0% found this document useful (0 votes)
69 views21 pages

Pipelining

The document discusses different types of parallel processors including single-instruction single-data streams (SISD), single-instruction multiple-data streams (SIMD), multiple-instruction single-data streams (MISD), and multiple-instruction multiple-data streams (MIMD). It also covers pipeline processing which allows for parallel execution through different stages. Examples of applications suited for parallel processing include expert systems, fluid flow analysis, and seismic data analysis due to their vast computational needs.

Uploaded by

Sushmit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views21 pages

Pipelining

The document discusses different types of parallel processors including single-instruction single-data streams (SISD), single-instruction multiple-data streams (SIMD), multiple-instruction single-data streams (MISD), and multiple-instruction multiple-data streams (MIMD). It also covers pipeline processing which allows for parallel execution through different stages. Examples of applications suited for parallel processing include expert systems, fluid flow analysis, and seismic data analysis due to their vast computational needs.

Uploaded by

Sushmit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduction to Computer Architecture

(Parallel and Pipeline processors)

KR Chowdhary
Professor & Head
Email: [email protected]
webpage: krchowdhary.com

Department of Computer Science and Engineering


MBM Engineering College, Jodhpur

November 22, 2013

KR Chowdhary Parallel and Pipeline processors 1/ 21


Classification of parallel Processors

◮ One classification by M.J. Flynn 2. Single-instruction multiple-data


considers the organization of a streams (SIMD): All processors
computer system by the number receive the same instruction from
of instructions and data items control unit, but operate in
that can be manipulated different sets of data.
simultaneously.
◮ The sequence of instructions read 3. Multiple-instruction single-data
from the memory constitute an streams (MISD): It is of
instruction stream. theoretical interest only as no
◮ The operation performed on data practical organization can be
in the processor constitutes a constructed using this
data stream. organization.
◮ Parallel processing may occur in
instruction stream stream or data 4. Multiple-instruction multiple-data
stream, or both. streams (MIMD): Several
1. Single-instruction single-data programs can execute at the
streams (SISD): Instructions are same time. Most multiprocessors
executed sequentially. come in this category.

KR Chowdhary Parallel and Pipeline processors 2/ 21


Flynn’s taxonomy of computer architecture(1966)

1. SISD Instruction stream Instruction stream


3. MIMD

Control unit Processor Memory


Control unit Processor (P) memory (M)
1 1 1
I/O b

Instr. stream data stream b


b
Instr. stream data stream b
b
b
Instruction stream

P1 M1
Control unit Processor memory
b data stream n n n
b b b
CU b b b
b b
Instr. stream data stream
2. SIMD
Pn Mn
Program loaded
from front end data stream
Figure 2: MIMD Architecture.

Figure 1: SISD and SIMD architecture

One of the parallel processing class that does not fit into this classification is
pipeline processing.

KR Chowdhary Parallel and Pipeline processors 3/ 21


Parallel processing

Parallel Processing:
◮ Increasing speed by doing many things in parallel.
◮ Let P is a sequential processor processing the task T in sequential
manner. If T is partitioned into n subtasks T1 , T2 , . . . , Tn of appox. same

size, then a processor P (say) having n processors can be programmed so
that all the subtasks of T can execute in parallel.

◮ Then P executes n times faster than P.
◮ A failure of CPU is fatal to a sequential processor, but not in the case of
parallel processor.
◮ Some of the applications of parallel computer (processors) are:
1. Expert system for AI
2. Fluid flow analysis,
3. Seismic data analysis
4. Long range weather forecasting,
5. Computer Assisted tomography
6. Nuclear reactor modeling,
7. Visual image processing
8. VLSI design
◮ The typical characteristic of parallel computing are: vast amount of
computation, floating point arithmetic, vast number of operands.

KR Chowdhary Parallel and Pipeline processors 4/ 21


Computational Models

◮ We assume that a given computation can be divided into concurrent tasks


for execution on the multiprocessor.
◮ Equal Duration Model: A given task can be divided into n equal subtasks,
each of which can be executed by one processor. If ts is the execution
time of the whole task using a single processor, then the time taken by
each processor to execute its subtask is tm = ts /n. Since, according to this
model, all processors are executing their subtasks simultaneously, then the
time taken to execute the whole task is tm = ts /n.
◮ The speedup factor of a parallel system can be defined as the ratio
between the time taken by a single processor to solve a given problem
instance to the time taken by a parallel system consisting of n processors
to solve the same problem instance.

S(n) = speedup factor


ts
=
tm
ts
= =n
ts /n

KR Chowdhary Parallel and Pipeline processors 5/ 21


Pipelining
◮ A typical example of parallel processing is a one-dimensional array of
processors, where there are n identical processors P1 . . . Pn and each having
its local memory. These processors communicate by message passing
(send - receive).
b b b

P1 P2 Pn
b b b

serial IO -operations(send-receive)

Figure 3: Pipeline processing.

◮ There are total n operations going on in parallel.


◮ A pipe line constitutes a sequence of processing circuits, called segments
or stages.
- m stage pipeline has same throughput as m separate units.
R1 C1 R2 C2 Rm Cm R

segment 1 segment 2 segment m

Ri : Buffer register Ci : Computing element

Figure 4: Pipeline segments


KR Chowdhary Parallel and Pipeline processors 6/ 21
Pipeline Processors

1. Instruction pipeline: Transfer of instructions through various stages of cpu,


during instruction cycle: fetch, decode, execute. Thus, there can be three
different instructions in different stages of execution: one getting fetched,
previous of that is getting decoded, and previous to that is getting
executed.
2. Arithmetic pipeline: The data is computed through different stages.

istr. fetch Instr. decode Instr. execute


Instr. adr.
instruction results
segment 1 segment 2 segment 3

compare align normalize


exponent add
mantissa mantissa resulst
segment 1 seg. 2 seg. 3 seg. 4

X = (XM , XE ), Y = (YM , YE ) X +Y

Figure 5: Instruction and data pipeline examples.

KR Chowdhary Parallel and Pipeline processors 7/ 21


Pipe-lining Example

◮ Consider an example to compute: Ai ∗ Bi + Ci , for i = 1, 2, 3, 4, 5. Each


segment has r registers, a multiplier, and an adder unit.

R1 R2

Multiplier R4

R3

Adder

R5

Figure 6: A segment comprising registers and computing elements.

R1 ← Ai , R2 ← Bi ; input Ai , Bi
R3 ← R1 ∗ R2 , R4 ← C ; multiply and i/ C
R5 ← R3 + R4 ; add Ci to product

KR Chowdhary Parallel and Pipeline processors 8/ 21


Pipe-lining Example

Table 1: Computation of expression Ai ∗ Bi + Ci in space and time in 3-stage pipeline.

Clock pulse Segment 1 Segment 2 Segment 3


no. R1 , R2 R3 , R4 R5
1. A1 , B1 −, − -
2. A2 , B2 A1 ∗ B1 , C1 -
3. A3 , B3 A2 ∗ B2 , C2 A1 ∗ B1 + C1
4. A4 , B4 A3 ∗ B3 , C3 A2 ∗ B2 + C2
5. A5 , B5 A4 ∗ B4 , C4 A3 ∗ B3 + C3
6. −− A5 ∗ B5 , C5 A4 ∗ B4 + C4
7. −− −, − A5 ∗ B5 + C5
◮ Any operator that can be decomposed into a sequence of sub-operations
of about the same components can be implemented by pipeline processor.
◮ Consider that for a k-segment pipeline with clock cycle time =tp sec.,
with total n no. of tasks (T1 , T2 , . . . , Tn ) are required to be executed.
◮ T1 requires time equal to k.tp secs. Remaining n − 1 tasks emerge from
the pipeline at the rate of one task per clock cycle, and they will be
completed in time of (n − 1)tp sec, so total clock cycles required =
k + (n − 1).
◮ For k = 3 segment and n = 5 tasks it is 3 + (5 − 1) = 7, as clear from
table 1.
KR Chowdhary Parallel and Pipeline processors 9/ 21
Computational Models

◮ Consider an instruction pipeline unit (segment) that performs the same


operation and takes time equal to tu to complete each task. Total time for
n tasks is n.tu . The speedup for no. of segments as k and clock period as
tp is:
n.tu
S(n) = (1)
(k + (n − 1))tp
◮ For large number of tasks, n >> k − 1, k + n − 1 ≈ n, so,
n.tu
S(n) = (2)
n.tp
tu
= (3)
tp
◮ Instruction pipelining is similar to use of assembly line in manufacturing
plant
◮ An instruction’s execution is broken in to many steps, which indicates the
scope for pipelining
◮ pipelining requires registers to store data between stages.

KR Chowdhary Parallel and Pipeline processors 10/ 21


Computational Models
Parallel computation with serial section model:
◮ It is assumed that fraction f of a given task (computation) cannot be
divided into concurrent subtasks. The remaining part (1 − f ) is assumed
to be dividable. (for example, f may correspond to data i/p).
◮ The time required to execute the task on n processors is:
ts
tm = f .ts + (1 − f ). (4)
n
◮ The speedup is therefore,
ts
S(n) = (5)
f .ts + (1 − f ). tns
n
= (6)
1 + (n − 1).f
◮ So, S(n) is primarily determined by the code section, which cannot be
divided.
◮ If task is completely serial (f = 1), then no speedup can be achieved even
by parallel processors.
◮ For n → ∞,
1
S(n) = (7)
f
which is maximum speedup.
KR Chowdhary Parallel and Pipeline processors 11/ 21
Computational Models

◮ Improvement in performance (speed) of parallel algorithm over a


sequential is limited not by no. of processors but by fraction of the
algorithm (code) that cannot be parallelized. (Amdahl’s law).
◮ Considering the communication overhead:
ts
S(n) = (8)
f .ts + (1 − f )(ts /n) + tc
n
= (9)
f .(n − 1) + 1 + n(tc /ts )
◮ For n → ∞,
n
S(n) = (10)
f (n − 1) + 1 + n(tc /ts )
1
= (11)
f + (tc /ts )
◮ Thus, S(n) depends on communication overhead tc also.

KR Chowdhary Parallel and Pipeline processors 12/ 21


Pipe-lining processors

Instruction Pipe-lining: typical stages of pipeline are:


1. FI (fetch instruction)
2. DI (decode Instruction)
3. CO (calculate operands)
4. FO (fetch operands)
5. EI (execute instruction)
6. WO (write operands)

KR Chowdhary Parallel and Pipeline processors 13/ 21


Instruction Pipe-lining

◮ Nine different instructions are to be executed


◮ The six stage pipeline can reduce the execution time for 9 instructions
from 54 time units to 14 time units.
time units →
Instruc. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
In1 FI DI CO FO EI WO
In2 FI DI CO FO EI WO
In3 FI DI CO FO EI WO
In4 FI DI CO FO EI WO
In5 FI DI CO FO EI WO
In6 FI DI CO FO EI WO
In7 FI DI CO FO EI WO
In8 FI DI CO FO EI WO
In9 FI DI CO FO EI WO

◮ The diagram assumes that each instruction goes through 6 stages of


pipeline.
◮ But, for example, a load instruction does not need WO.
◮ It is also assumed that there is no memory conflicts, for example, FI, FO,
WO in all require memory access (together).
◮ The value may be in cache, or FO/WO may be null.
◮ Six stages may not be of equal duration, conditional branch/interrupt
instruction may invalidate several fetches
◮ After which stage it should check for conditional branch/interrupt?
KR Chowdhary Parallel and Pipeline processors 14/ 21
Factors in Instruction Pipe-lining

◮ Overhead in each stage of ◮ Pipeline Hazard occur when


pipeline for data movements pipeline or its portion stalls.
buffer to buffer 1. Resource hazard: Two or more
instructions in pipeline require
◮ Amount of control logic needed same resource (say ALU/reg.)
to handle memory/register (called structure hazard)
dependencies increases with size 2. Data hazards: conflict in
of pipeline memory access
3. Control hazards: (called branch
◮ It needs time for the buffers to hazards) wrong decision in
operate branch prediction

KR Chowdhary Parallel and Pipeline processors 15/ 21


Vector Processing

◮ In many computational mvi i, 0


applications, a problem can be loop: read A[i]
formulated in terms of vectors read B[i]
and matrices. Processing these store i = i +1
by a special computer is called cmp i, 100
vector processing. jnz loop
◮ A vector is: ◮ Accesses the arrays A and B, and
V = [V1 V2 V3 . . . Vn ]. The index only counter needs to updated.
for Vi is represented as V [i]. A The vector processing computer
program for adding two vectors A eliminates the need of fetching
and B of length 100, to produce the instructions, and executing
vector C is: them. As they are fetched only
◮ Scalar Concept: once only, decoded once only, but
executes them 100 times. This
for(i=0; i < 100; i++) allows operations to be specified
c[i]=b[i]+a[i]; only as:
◮ In machine language we write it
as: C (1 : 100) = A(1 : 100)+B(1 : 100)

KR Chowdhary Parallel and Pipeline processors 16/ 21


Vector Processing

◮ Vector instructions includes the design vector processor to store


initial address of operands, length all operands in registers in
of vectors, and operands to be advance.
performed, all in one composition
instruction. The addition is done ◮ It can be applied in matrix
with a pipelines floating pointing multiplication, for
point adder. It is possible to [l × m] × [m × n].

KR Chowdhary Parallel and Pipeline processors 17/ 21


RISC v/s CISC

◮ CISC: (Complex instruction set program, thus small size, thus


computer) lesser memory, and faster
1. Complex programs have access
motivated the complex and ◮ RISC: (Reduced instruction set
powerful HLL. This produced computer)
semantic gap between HLL and
1. Large number of Gen. purpose
machine languages, and
registers, use of compiler
required more effort in
technology to optimize register
compiler constructions.
usage
2. The attempt to reduce the
2. R-R operations (Adv.?)
semantic gap/simplify compiler
construction, motivated to 3. Simple addressing modes
make more powerful instruction 4. Limited and simple instruction
sets. (CISC - Complex set (one instruction per
Instruction set computing) machine cycle)
3. CISC provide better support for 5. Advantage of simple
HLLs instructions?
4. Lesser count of instructions in 6. Optimizing pipeline

KR Chowdhary Parallel and Pipeline processors 18/ 21


Exercises

1. Design a pipeline configuration to carry out the task to compute:


(Ai + Bi )/(Ci + Di )
2. Construct pipeline to add 100 floating point numbers, i.e., find the result
of x1 × x2 × . . . x100 .
3. 3.1 List the advantages of designing a floating point processor in the form of a
k-segment pipeline rather than a k-unit parallel processor.
3.2 A floating-point pipeline has four segments S1 ,S2 ,S3 ,S4 , whose delays are
100, 90, 100, and 110 nano-secs, respectively. What is the pipeline’s
maximum throughput in MFLOPS?
4. It is frequently argued that large (super) computer is approaching its
performance limits, and the future advances in large computers will
depend on interconnected large number of inexpensive computers
together. List the arguments against and favor.
5. List the features to be added in sequential programming languages, to use
them in large interconnection of small inexpensive computers.
6. Let S1 , S2 , . . ., Sk denote the sequence of k-operations on a program.
Suppose that execution of these operations on a uni-processor produces
the same results regardless of the oder of execution of the k Si ’s. Show
that this does not imply that the Si ’s are parallelizable on a
multi-processor.
KR Chowdhary Parallel and Pipeline processors 19/ 21
Exercises

7. Prove that general problem of determining whether two program segments


S1 , S2 are parallelizable is undecidable by showing that a solution to this
problem implies a solution to the halting problem for Turing machine.
(Hint: Assume that an algorithm A to determine parallelization exists, and
consider applying A to a program containing the statement: if Turing
machine T halts after at most n steps then S1 else S2 ).

KR Chowdhary Parallel and Pipeline processors 20/ 21


Bibliography

M. Morris Mano, “Computer System Architecture”, 3nd Edition, Pearson,


2006.
William Stalling, ”Computer Organization and Architecture”, 8th Edition,
Pearson, 2010.
John P. Hayes, “Computer Architecture and Organization”, 2nd Edition,
McGraw-Hill International Edition, 1988.

KR Chowdhary Parallel and Pipeline processors 21/ 21

You might also like