0% found this document useful (0 votes)
54 views45 pages

Unit-4-Pipeline and Vector Processing

Uploaded by

Rishi Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views45 pages

Unit-4-Pipeline and Vector Processing

Uploaded by

Rishi Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Unit 5:Pipeline and

Vector
Reference Processing
: Chapter 9 from Computer System Architecture by Morris Mano

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Parallel Processing
• Why?
• To increase computational speed.
• To achieve faster execution time.

• How to achieve?
• Concurrent Data Processing.
• Multiprocessor System.
• Parallel processing can be viewed from various
level of complexity
• Lowest level: Parallel and serial operation by the type
of register used.
• Higher level: multiple functional units

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Multifunctional Units
• Used to establish Parallel Processing.
• All units are independent of each other
• A multifunctional organization is usually
associated with a complex control unit to
coordinate all the activities among the various
components.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Flynn’s Classification
• Considers the organization of a computer system by
the number of instructions and data items that are
manipulated simultaneously.
• The instructions read from the memory constitutes an
instruction stream.
• The operation performed on the data in the processor
constitutes a data stream.

1. SISD (Single Instruction stream, Single Data stream)


2. SIMD (Single Instruction stream, Multiple Data stream)
3. MISD (Multiple Instruction stream, Single Data stream)
4. MIMD(Multiple Instruction stream, Multiple Data
stream)

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Continue…
• SISD: Instructions are executed sequentially.
• Parallel Processing in this case can be achieved by:
Multiple function units or Pipeline processing.

• SIMD: many processing units.


• All processor receive same instruction from control
unit but operate on different data.

• MIMD: a computer system capable of processing


several programs at the same time.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Continue…
• Flynn’s classification depends on the distinction
between the performance of the control unit and
the data processing unit.

• One type of parallel processing that does not fit


Flynn’s classification is pipelining.

• In this we consider parallel processing under:


1. Pipeline processing: arithmetic sub operation or
instruction phase overlap.
2. Vector Processing: large vectors and matrices
3. Array Processing: large arrays of data
© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT
Pipelining
• It is a technique of decomposing a sequential process into sub
operations, with each subprocess being executed in a special dedicated
segment that operates concurrently with all other segments.

• Like an industrial assembly line.

• Example: Ai * Bi + Ci for i=1,2,3,…,7


Suboperations in each segment of pipeline are :
R1  Ai , R2  Bi
R3  R1 * R2 , R4 Ci
R5  R3 + R4

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Content of register in Pipeline
example
Clock Segment 1 Segment 2 Segment 3
Pulse
Number R1 R2 R3 R4 R5

1 A1 B1 - - -
2 A2 B2 A1 * B1 C1 -
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 - - A7 * B7 C7 A6 * B6 + C6
9 - - - - A7 * B7 + C7

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Four segment Pipeline

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Space-time Diagram

Task: Total operation performed going through all


the segments in the pipeline

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Speedup

• Pipeline Unit:
• k segment pipeline with clock cycle time tp to complete n tasks.
• Clock cycles required to complete n task= k + (n - 1)
• Time to complete n task= k*tp + (n - 1)*tp = (k+n-1) * tp
• Non Pipeline Unit:
• tn to complete each task
• Time to complete n task= n * tn

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Continue…
• As the number of task increases, n becomes much larger than k – 1,
and k+n-1 approaches the value of n.
• In that condition
• If we have tn= k*tp , than = k (Maximum speed that pipeline can
provide

• Example: n= 100, tp=20ns, k=4, tn= 80ns


Speedup=8000 / 2060 =3.88
• Reason for not getting maximum speedup:
• Different segments may take different times

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Multiple functional unit in
Parallel

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Pipeline is applicable in
• Arithmetic pipeline:
• Divide arithmetic operation in suboperation.
• Instruction Pipeline:
• Overlapping the phase of instructions.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Arithmetic Pipeline
• Usually found in very high speed computers.
• Used to implement floating point operations, multiplication of fixed
point numbers and similar computation encountered in scientific
problems.
• Floating point operations are easily decomposed into suboperations.
• Floating point addition and subtraction can be decomposed in:
1. Compare the exponent.
2. Align the mantisaas.
3. Add or subtract the mantisaas.
4. Normalize the result.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Example
• X= 0.9504 * 103 + Y=0.8200 * 102
1. Compare the exponent:
T1= 60ns • 3–2=1
• So larger exponent 3 is chosen as the exponent of the result.
2. Align the mantisa:
T2= 70ns • Y=0.0820 * 103
3. Add to mantisaas:
T3= 100ns • R= 1.0324 * 103
4. Normalize the Result:
T4= 80ns • R= 0.10324 * 104

TR= 10ns
Speedup=Tn / Tp = 320 /110
© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT
Instruction Pipeline
• An instruction pipeline reads consecutive instructions from memory while
previous instructions are being executed in other segment.
• The instruction fetch segment can be implemented by FIFO buffer.
• Whenever execution unit is not using memory, the control increments the
program counter and read the next instruction.
• Reduce average access time to memory for reading instructions.
• Instruction phases:
1. Fetch the instruction from memory
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT
Reduce the Phases
• A register mode instruction does not need an
effective address calculation.

• Two or more segment require memory access at


the same time, causing one segment to wait until
another is finished with the memory.

• Memory conflicts can be resolved by using two


memory buses for accessing instructions and
data in separate modules.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Four segment Instruction
pipeline
1. FI is the segment that fetches an instruction.

2. DA is the segment that decodes the instruction


and calculates the effective address.

3. FO is the segment that fetches the operand.

4. EX is the segment that executes the instruction.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Continue…

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Timing of instruction Pipeline

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Instruction Pipeline conflicts
• Causes the instruction pipeline to deviate from
its normal operation.
1. Resource conflicts:
• Caused by access to memory by two segments at the
same time. Most of these conflicts can be resolved by
using separate instruction and data memories.
2. Data dependency:
• Conflicts arise when an instruction depends on the
result of a previous instruction, but this result is not
yet available.
3. Branch difficulties:
• Arise from branch and other instructions that
changes the value of PC.
© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT
Data Dependency
• Occurs when an instruction needs data that are not yet available.
• An instruction in FO segment may need to fetch an operand that is
being generated at the same time by the previous instruction in
segment EX. Therefore, second instruction must wait.
• Address dependency may occur when an operand address cannot be
calculated.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Solution to Data dependency
1. Hardware Interlocks:
• Interlock is a circuit that detects instructions
whose source operands are destinations of
instructions farther up in pipeline.
• Instruction whose source is not available to be
delayed by enough clock cycles to resolve the
conflict.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Solution to Data dependency
2.Operand forwarding:
• Detect a conflict and then avoid it by routing the
data through special paths between pipeline
segments.
• Instead of transferring ALU results into
destination register, the hardware checks the
destination operand and if it is needed as a
source in the next instruction, it passes the result
directly into the ALU input, bypassing the
register file.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Solution to Data dependency
3.Delayed Load:
• The compiler for such computers is designed to
detect a data conflict and reorder the instructions
as necessary to delay the loading of conflicting
data by inserting no-operation instructions.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Branch Difficulties
• Branch instruction can be conditional or
unconditional.
• Breaks the normal sequence of instruction
stream, causing difficulties in the operation of the
instruction pipeline.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Handling of Branch Instruction
1. Prefetch target instruction:
• Prefetch the target instruction in addition to the
instruction following the branch. Both are saved until
branch is executed.
2. Branch target buffer(BTB):
• Associative memory included in the fetch segment.
• Each entry in BTB consists of the address of a
previously executed branch instruction and the target
instruction for that branch.
• It also stores the next few instructions after the
branch target instruction.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Handling of Branch Instruction
3. Loop Buffer:
• Variation of BTB.
• It is small and very high speed register file.
• When a program loop is detected in the program, it
is stored in the loop buffer in its entirety, including all
branches.
• The loop mode is removed by the final branching
out.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Handling of Branch Instruction
4. Branch prediction:
• Use some additional logic to guess the outcome of a
conditional branch instruction before it is executed.
• The pipeline then begins prefetching the instruction
stream from the predicted path.
• A correct prediction eliminates the wasted time
caused by branch penalties.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Handling of Branch Instruction
5. Delayed branch:
• Employed in most of the RISC processors.
• Compiler detects the branch instructions and
rearranges the machine language code sequence by
inserting useful instructions that keep the pipeline
operating without interruptions.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


RISC Pipeline
• The simplicity of the instruction can be utilized to implement an
instruction pipeline using a small number of sub operation, which
executes in one clock cycle.
• Decode operation can occur at the same time, due to fixed length
instruction format.
• There is no need to calculate effective address or fetching operands
from memory
• Instruction pipeline can be implemented with 2 or 3 segment:
• Fetch instruction
• Executed the instruction in ALU
• Store the result in destination register

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


RISC Pipeline
• Data transfer instruction in RISC are limited to load and store
instructions which use register indirect mode.
• To resolve memory conflicts between fetch instruction and load-store
an operand, most RISC machines use two separate buses with two
memories.
• Advantage:
• Ability to execute instructions at the rate of one per clock cycle.
• Support given by the compiler which detects and minimize the delays
encounter due to data conflicts and branch penalities.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Example: Three-segment
Instruction pipeline
I: Instruction Fetch
A: ALU operation
E: Execute instruction

Now, consider following four instructions


1. LOAD: R1<- M[ADD1]
2. LOAD: R2<- M[ADD2]
3. ADD: R3<- R1 + R2
4. STORE: M[ADD3] <- R3

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Delayed Load

Advantage: Data dependency is taken care of by the compiler rather than the hardware.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Delayed Branch
• The RISC processor rely on the compiler to redefine the branches so
that they take effect at the proper time in the pipeline which is
referred to as delayed branch.
• Compiler analyze the instructions before and after the branch and
rearrange the program sequence by inserting useful instructions in
the delay steps.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Delayed Branch

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Vector Processing
• There is a class of computational problems that are beyond the
capabilities of a conventional computers.
• Science and Engineering Applications: (the problems can be formulated in terms of
vectors and matrices)
• Long-range weather forecasting, Petroleum explorations, Seismic data
analysis, Medical diagnosis, Aerodynamics and space flight simulations,
Artificial intelligence and expert systems, Mapping the human genome, Image
processing

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Vector Operations
• Arithmetic operations on large arrays of numbers which are floating-
point number.
• V = [v1, v2, v3… vn]
• Conventional system is capable of processing one operand at a time.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Continue…
• A Vector processing eliminated overhead associated with the time it
takes to fetch and execute the instructions in the program loop.
• Can be specified with single vector instruction of the form
C(1:100) = A(1:100) + B(1:100)
• Includes: initial address of the operands, the length of the vectors, and the
operation to be performed

• Matrix multiplication is one of the most computational intensive


operations performed in computers with vector processors.

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Memory Interleaving
• Simultaneous memory access to memory from
two or more source.
• Arithmetic pipeline usually requires two or
more operands to enter the pipeline at the
same time.
• The memory can be partitioned in number of
modules instead of using separate memory bus.
• Memory module is one kind of memory array.
• The two least significant bits of the address can • In interleaved memory, different sets
be used to distinguish between the four of addresses are assigned to different
modules. memory modules.
• The advantage is that it allows the use of a
• The vector processor that uses n-way
technique called interleaving.
interleaved memory can fetch n
operands from n different modules.
© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT
Supercomputer
• Supercomputer = Vector Instructions + Pipelined floating-point
arithmetic operations
• Components are tightly coupled
• Multiple function units, each has its own pipeline configuration
• Performance Evaluation Index
• MIPS : Million Instruction Per Second
• FLOPS : Floating-point Operation Per Second
• megaflops : 106, gigaflops : 109

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Cray-1
• Developed in 1976
• 12 distinct functional units
• 150+ registers in memory
• 80 megaflops
• Memory
• capacity of 4 billion 64bits word
• divided in 16 banks
• transfer rate is 320 million words per second

© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT


Array Processor
• Performs computations on large arrays of data
• Attached array processor : Auxiliary processor attached to a general purpose
computer
• SIMD array processor : Computer with multiple processing units operating in
parallel

Objective is to provide vector manipulation


capabilities to a conventional computer
© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT
Any
Questions

?
© Ronak Patel, Computer Engineering Department , CSPIT, CHARUSAT

You might also like