0% found this document useful (0 votes)
4 views74 pages

Coa 3

Uploaded by

u2303069
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views74 pages

Coa 3

Uploaded by

u2303069
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

COMPUTER ORGANIZATION AND

ARCHITECTURE

Reshma M R, Dept of CSE, RSET


Text Book

“Computer Organization “ by Carl


Hamacher, Zvonko Vranesic , and
Safwat Zaky. 5th ed

Reshma M R, Dept of CSE, RSET


Syllabus
• Module 3: Instruction Pipelining
• Pipelining: Basic principles, classification of pipeline processors, introduction to
5-stage RISC pipeline, pipeline hazards, hazard detection and resolution,
introduction to multistage pipeline, static and dynamic scheduling. Speculative
execution, out of order execution and reorder buffers.
• Course Outcome 3:
• CO3 Explain the concepts of pipelining, hazards and principles of Instruction
Pipelining (Cognitive Knowledge Level: Understand).

Reshma M R, Dept of CSE, RSET


PIPELINING

Reshma M R, Dept of CSE, RSET


Overview

• Pipelining is widely used in modern processors.

• Pipelining improves system performance in terms of throughput.

• Pipelined organization requires sophisticated compilation techniques.

Reshma M R, Dept of CSE, RSET


Assembly line
• A pipeline is similar to an assembly
line in a production factory.
• A product has to go through multiple
stages in the assembly line before the
final product is manufactured.
• At a time, all the stages work
simultaneously but on different phases
of the product. This process is referred
to as pipelining.
• Pipelining is also referred to as
execution of multiple jobs/instructions
parallelly in an overlapped fashion.
Reshma M R, Dept of CSE, RSET
Pipelining
• In a computer system the technique of executing multiple instructions in an
overlapped fashion is known pipelining.
• A pipeline consists of many stages and these stages are connected to one another
in a pipe like structure.
• An instruction enters one end of the pipeline, goes through several stages before
exiting from another end.
• Pipelining improves the overall throughput of the system (one instruction will be
completed in each clock cycle if there are no stalls).
• Pipeline rate is limited by the slowest pipeline stage.
• As the number of pipeline stages increase, the advantage of pipelining increases.

Reshma M R, Dept of CSE, RSET


Basic Ideas of Pipelining
• A technique for overlapping the execution of several instructions to
reduce the execution time of a set of instructions.

• Let Fi and Ei refer to the fetch and execute steps for instruction Ii

• Execution of a program consists of a sequence of fetch and execute


steps, as shown below

Reshma M R, Dept of CSE, RSET


Pipelining
• Execution of a program consists of a sequence of fetch and execute steps, as
shown below
Time

Sequential execution

Reshma M R, Dept of CSE, RSET


Pipelining
• Consider a computer that has two separate hardware units, one for fetching
instructions and another for executing them, as shown below.

Hardware organization
Reshma M R, Dept of CSE, RSET
Basic Idea of Instruction Pipelining (Two-Stage)
Clock cycle

Instruction

Fetch+ Execute

Pipelined execution

Reshma M R, Dept of CSE, RSET


Two-Stage Pipeline

Sequential Pipeline
Stages = 2 (k) Stages = 2 (k)
Instructions = 4 (n) Instructions = 4 (n)
Sequential = k*n =8 clock cycles Pipeline = k+n-1 =2+3=5 clock
cycles

Reshma M R, Dept of CSE, RSET


Pipelining-Four instructions progress at any
given time

Pipelining increases throughput (one instruction will be


completed in each clock cycle if there are no stalls).

Reshma M R, Dept of CSE, RSET


Classification of pipeline processors
• The pipeline processors were classified based on the levels of processing:
• Arithmetic pipeline
• Processor pipeline
• Instruction pipeline

Reshma M R, Dept of CSE, RSET


Arithmetic pipeline
• In an arithmetic pipeline, an arithmetic operation like multiplication, addition,
etc. can be divided into series of steps that can be executed one by one in stages in
Arithmetic Logic Unit (ALU).
• Arithmetic pipeline technique makes sure to decompose complex arithmetic
operations into simple stages which can be processed concurrently.
• Let’s consider an arithmetic pipeline to add floating points:
• Fetch operands → First operands are fetched from register files.
• Compare & Align operands → compare the operands and align them to the
operands with same exponent.
• Add the mantissas
• Finally normalise the result obtained through above addition.

Reshma M R, Dept of CSE, RSET


Segments/Stages:
Arithmetic Pipeline

Segment 1

Segment 2

Segment 3

Reshma M R, Dept of CSE, RSET


Arithmetic Pipelining -
Floating point
addition/subtraction

1. Compare the exponents.


2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalize the result

Reshma M R, Dept of CSE, RSET


17
Processor pipeline
• Pipeline processing of the same data stream by a cascade of
processors, each of which processes a specific task.

• The data stream passes the first processor with the results
stored in memory block which is also accessible by the
second processor.

• The second processor then passes the refined results to the


third and so on.

• There is no practical example found for processor pipeline.

Reshma M R, Dept of CSE, RSET


Instruction pipeline
• In instruction pipeline processor, the execution • An instructions execution cycle consists of
of a stream of instructions can be pipelined by following sequence of operations:
overlapping the execution of the current • instruction fetch → Here instruction is
instruction with the fetch, decode and operand fetched from memory
fetch of subsequent instructions. • instruction decode → Here fetched
• Instruction pipeline in computer architecture instruction in above stage is decoded to
ensures to process instructions in sequential understand and clarify the necessary
manner. operation to be performed.
• It makes processing of instructions take place • operand fetch → Here, necessary
in distinct stages thus helping to process operands are fetched from register file.
multiple instructions simultaneously. • execute instruction → The instruction is
finally executed to get the end result.
• write-back → The final result obtained
after execution is written into an
appropriate register.

Reshma M R, Dept of CSE, RSET


Instruction Pipelining
• Fig: shows the timing diagram of an
instruction pipeline.
• While the instruction pipeline reads one
instruction from the memory, previous
instruction is executed in other stage of the
pipeline. Thus, multiple instructions are
executed simultaneously.
• From the Figure, it can be observed that
while the first instruction started at time
period one, the second instruction started at
time period two and so on.
• Up to time period four, not all stages were
working simultaneously but from time
period five onwards all the five stages are
working simultaneously.
Timing diagram for Instruction Pipeline Operation
Reshma M R, Dept of CSE, RSET
Introduction to 5-stage RISC pipeline
• RISC stands for Reduced Instruction Set Computer.
• Unlike complex instruction set computers (CISC), which have a large number of complex
instructions, RISC processors have a smaller, more streamlined set of instructions.
• Key Characteristics of RISC Processors:
• Fixed-Length Instructions
• Load-Store Architecture
• Pipelining
• Hardwired Control Unit
• Examples of RISC Processors:
• ARM: Widely used in smartphones, tablets, and embedded systems.
• MIPS: Used in various embedded systems and networking devices.
• PowerPC: Initially used in Apple computers, now found in various embedded systems and
supercomputers.
Reshma M R, Dept of CSE, RSET
Introduction to 5-stage RISC pipeline
• The 5-stage RISC pipeline is a common architecture used in modern
microprocessors to execute instructions more efficiently.
• It breaks down the instruction execution process into five distinct stages, allowing
multiple instructions to be processed simultaneously.
1.fetch instructions from memory
2.read registers and decode the instruction
3.execute the instruction or calculate an address
4.access an operand in data memory
5.write the result into a register
• This improves performance by overlapping the execution of different instructions.

Reshma M R, Dept of CSE, RSET


5-stage RISC pipeline
The whole Pipeline resources are used by 5 instructions in every cycle.

Reshma M R, Dept of CSE, RSET


Introduction to 5-stage RISC pipeline
• The five stages in a 5-stage RISC pipeline:
1.Fetch: This stage retrieves the next instruction from memory and places it
in the instruction register.
2.Decode: The instruction is decoded to determine the operation to be
performed and the operands involved.
3.Execute: The operation specified in the instruction is carried out using the
ALU (Arithmetic Logic Unit).
4.Memory: If the instruction requires memory access (e.g., load or store), the
data is read from or written to memory.
5.Write Back: The result of the operation is written back to the register file.
Reshma M R, Dept of CSE, RSET
Instruction Pipelining (5 – Stage Pipeline)
• In a pipeline system, each stage uses register to hold the output of that stage.
Output of one stage is applied as input to the next stage.

• Figure shows an example of five stage instruction pipeline consisting of fetch,


decode, operand fetch, execute and write stages. Here streams of instructions are
executed in overlapped fashion thereby increasing the throughput of the computer
system.

Reshma M R, Dept of CSE, RSET


Instruction Pipelining
• Fig: shows the timing diagram of an
instruction pipeline.
• While the instruction pipeline reads one
instruction from the memory, previous
instruction is executed in other stage of the
pipeline. Thus, multiple instructions are
executed simultaneously.
• From the Figure, it can be observed that
while the first instruction started at time
period one, the second instruction started at
time period two and so on.
• Up to time period four, not all stages were
working simultaneously but from time
period five onwards all the five stages are
working simultaneously.
Timing diagram for Instruction Pipeline Operation
Reshma M R, Dept of CSE, RSET
Introduction to 5-stage RISC pipeline

Reshma M R, Dept of CSE, RSET


Pipelining Performance
• Consider a ‘k’ stage pipeline with clock cycle time as ‘Tp’ (say 1
nanosecond).
• ‘n’ tasks (instructions) to be completed in the pipelined processor.
• First instruction is going to take ‘k’ cycles to come out of the pipeline
but the other ‘n – 1’ instructions will take only ‘1’ cycle each, i.e., a
total of ‘n – 1’ cycles.
• Execution Time (ETpipeline)= (k+n-1)Tp
• In the same case, for a non-pipelined processor, execution time of ‘n’
instructions will be:
• ET non pipeline= n*k*Tp

Reshma M R, Dept of CSE, RSET


Pipeline Performance

Reshma M R, Dept of CSE, RSET


Pipelining (4-stage pipeline)
• A pipelined processor may process each instruction in four steps as follows.

• Fetch (F) : read the instruction from the memory

• Decode (D) : decode the instruction and fetch the source operands

• Execute (E): perform the operation specified by the instruction.

• Write (W): store the result in the destination loaction.

Reshma M R, Dept of CSE, RSET


A 4-stage pipeline Fetch + Decode + Execution + Write
Clock cycle
Instruction

Reshma M R, Dept of CSE, RSET


Pipeline Performance
• The pipeline processor shown in last slide completes the processing of one
instruction in each clock cycle, which means that the rate of instruction processing
is four times that of sequential operation.

• The potential increase in performance resulting from pipelining is proportional to


the number of pipeline stages.

• However, this increase would be achieved only if pipelined operation could be


sustained without interruption throughout program execution

• Unfortunately, this is not true.

Reshma M R, Dept of CSE, RSET


Clock cycle

Instruction

Effect of an execution operation taking more than one clock cycle

Reshma M R, Dept of CSE, RSET


Hazards

• Pipelined operation in above figure is said to have been stalled for two clock
cycles.

• A stall is a pipeline cycle with no operation or no new input.

• Any condition that causes the pipeline to stall is called a hazard.

• The above example comes in data hazard

Reshma M R, Dept of CSE, RSET


Types of Hazards
• Control hazards or Instruction Hazard
• Data hazards
• Structural hazards

Reshma M R, Dept of CSE, RSET


Instruction Hazard
• The pipeline may also be stalled because of a delay in the availability of an
instruction.

• For example, this may be a result of a miss in the cache, requiring the instruction
to be fetched from the main memory.

• Such hazards are often called control hazards or instruction hazards.

Reshma M R, Dept of CSE, RSET


An Example of Instruction Hazard
Clock cycle

Clock cycle

Reshma M R, Dept of CSE, RSET


Instruction Hazard

• Such idle periods shown in the figure are called stalls. They are also often referred
to as bubbles in the pipeline.

• Once created as a result of a delay in one of the pipeline stages, a bubble moves
downstream until it reaches the last unit

Reshma M R, Dept of CSE, RSET


Structural Hazard
• In pipelined operation, when two instructions require the use of a given hardware
resource at the same time, the pipeline has a structural hazard.

• The most common case in which this hazard may arise is in access to memory.

• One instruction may need to access memory as part of the Execute and Write stage
while another instruction is being fetched.

Reshma M R, Dept of CSE, RSET


An Example of a Structural Hazard

• It causes the pipeline to stall


one cycle because both
instruction I2 and I3 require
access to the register file in
cycle 6.
• Even though the instruction
and their data are all available ,
the pipeline is stalled because
one hardware resource the
register file cannot handle two
operations at once

Reshma M R, Dept of CSE, RSET


Data Hazard

• A data hazard is any condition in which either the source or the destination
operands of an instruction are not available at the time expected in the pipeline.

• As a result some operation has to be delayed, and the pipeline stalls

Reshma M R, Dept of CSE, RSET


Data Hazards

• Consider a program that contains two instructions, I1 followed by I2.

• When this program is executed in a pipeline, the execution of I2 can begin before
the execution of I1 is completed. This means that the results generated by I1 may
not be available for use by I2

Reshma M R, Dept of CSE, RSET


Data Hazards
• Assume that A=5, and consider the following two operations:

A  3+A
B  4xA

• When these operations are performed in the order given, the result is B=32.

• But if they are performed concurrently, the value of A used in computing B would
be the original value, 5, leading to an incorrect result

Reshma M R, Dept of CSE, RSET


Data Hazards
• On the other hand, the two operations
A  5xC
B  20+C
• can be performed concurrently, because these operations are independent.
• These two examples illustrate a basic constraint that must be enforced to guarantee
correct results.
• When two operations depend on each other, they must be performed sequentially
in the correct order

Reshma M R, Dept of CSE, RSET


Data Hazard

The data dependency arises when the


destination of one instruction is used
as a source in the next instruction

Reshma M R, Dept of CSE, RSET


Operand Forwarding

• The data hazard just described arises because one instruction, instruction I2 , is
waiting for data to be written in the register file.
• However, these data are available at the output of the output of the ALU once the
Execute stage completes step E1 .
• Hence, the delay can be reduced, or possibly eliminated, if we arrange for the
result of instruction I1 to be forwarded directly for use in step E2.
• Operand forwarding used so that we can minimize the stalls in data dependency.

Reshma M R, Dept of CSE, RSET


Operand Forwarding in Datapath

Reshma M R, Dept of CSE, RSET


Operand Forwarding in a Pipelined Processor

• Operand Forwarding: In this


forwarding, we will use the interstage
buffers which exist between the
stages.
• These registers are used to contain the
intermediate output.
• With the help of intermediate
registers, the dependent instruction is
able to directly access the new value.

Reshma M R, Dept of CSE, RSET


Handling Data Hazards in Software
• An alternative approach is to leave the task of detecting data dependencies and
dealing with them to the software.
• In this case, the compiler can introduce two-cycle delay needed between
instruction I1 and I2 by inserting NOP (No-operation) instructions, as follows:
• I1: Mul R2, R3, R4
• NOP
• NOP
• I2: Add R5, R4, R6

Reshma M R, Dept of CSE, RSET


Instruction Hazards

• Whenever the stream of instructions supplied by the instruction fetch unit is


interrupted, the pipeline stalls. A branch instruction may also cause stall.

• Cache miss
• Branch

• Now we will see the effect of branch instructions and techniques that can be used
for mitigating their impact

Reshma M R, Dept of CSE, RSET


Instruction Hazards - Unconditional Branch

Pipeline is stalled for one clock cycle

Reshma M R, Dept of CSE, RSET


Instruction Hazards - Unconditional Branch

• The time lost as a result of a branch instruction is referred to as the branch


penalty.

• For a longer pipeline, the branch penalty may be higher.

Reshma M R, Dept of CSE, RSET


Branch Penalty

Branch penalty is two clock cycle

Reshma M R, Dept of CSE, RSET


Branch Penalty Reduction

• Reducing the branch penalty requires the branch address to be


computed earlier in the pipeline.

• Typically, the instruction fetch unit has dedicated hardware to identify


a branch instruction and compute the branch target address as quickly
as possible after an instruction is fetched.

Reshma M R, Dept of CSE, RSET


The branch penalty is only one clock cycle

Branch address computed in decode stage

Reshma M R, Dept of CSE, RSET


Instruction Queue and Prefetching
• Either a cache miss or branch instruction stalls the pipeline for one or more
clock cycles.

• To reduce the effect of these interruptions, many processors employ fetch units
that can fetch instructions before they are needed and put them in a queue.

• Typically, the instruction queue can store several instructions.

• A separate unit, which we call the dispatch unit, takes instructions from the front
of the queue and sends them to the execution unit.

Reshma M R, Dept of CSE, RSET


Instruction Queue and Prefetching

• The dispatch unit also performs the decoding function

• To be effective, the fetch unit must have sufficient decoding and processing
capability to recognize and execute branch instructions.

• It attempts to keep the instruction queue filled at all the times to reduce the impact
of occasional delays when fetching instruction.

Reshma M R, Dept of CSE, RSET


Instruction Queue and Prefetching

Reshma M R, Dept of CSE, RSET


Branch Timing with Instruction Queue

Branch timing in the presence of an instruction queue.


Branch target address is computed in the D stage. Reshma M R, Dept of CSE, RSET
Branch Timing with Instruction Queue
• When the pipeline stalls because of a data hazard, for example, the dispatch unit is
not able to issue instructions from the instruction queue.

• However, the fetch unit continues to fetch instructions and add them to the queue.

• Conversely, if there is a delay in fetching instructions because of a branch or a


cache miss, the dispatch unit continues to issue instructions from the instruction
queue.

Reshma M R, Dept of CSE, RSET


Branch Timing with Instruction Queue
• Instructions I1, I2, I3, I4, and Ik complete execution in successive clock cycles.
Hence, the branch instruction does not increase the overall execution time.

• This is because the instruction fetch unit has executed the branch instruction (by
computing the branch address) concurrently with the execution of other
instructions.

• This technique is referred to as branch folding.

Reshma M R, Dept of CSE, RSET


Branch Timing with Instruction Queue
• The instruction queue mitigates the impact of branch instructions on performance
through the process of branch folding.

• It has a similar effect on stalls caused by cache misses.

• The effectiveness of this technique is enhanced when the instruction fetch unit is
able to add more than one instruction at a time.

Reshma M R, Dept of CSE, RSET


Conditional Branches
• A conditional branch instruction introduces the added hazard caused by the
dependency of the branch condition on the result of a preceding instruction.

• The decision to branch cannot be made until the execution of that instruction has
been completed

• There are several ways to reduce branch penalty associated with conditional
branches and their negative impact on the rate of execution of instructions
▪ Delay Branch
▪ Branch Prediction
▪ Dynamic Branch Prediction
Reshma M R, Dept of CSE, RSET
Delayed Branch

• The instructions in the delay slots are always fetched. Therefore, we would like to
arrange for them to be fully executed whether or not the branch is taken.

• The objective is to place useful instructions in these slots.

• The effectiveness of the delayed branch approach depends on how often it is


possible to reorder instructions.

Reshma M R, Dept of CSE, RSET


Delayed Branch
LOOP Shift_left R1
Decrement R2
Branch=0 LOOP
NEXT Add R1,R3

(a) Original program loop

LOOP Decrement R2
Branch=0 LOOP
Shift_left R1
NEXT Add R1,R3

(b) Reordered instructions

Figure 8.12. Reordering of instructions for a delayed branch.


Reshma M R, Dept of CSE, RSET
Delayed Branch Time
Clock c ycle 1 2 3 4 5 6 7 8

Instruction
Decrement F E • Logically the program is executed as if the
branch instruction were placed after the shift
Branch F E instruction.

Shift (delay slot) F E • That is branching takes place one instruction


later than where the branch instruction appears
Decrement (Branch taken) F E in the instruction sequence in the memory ,
hence the name delayed branch.
Branch F E

• The effectiveness of this approach depends on


Shift (delay slot) F E
how often its possible to reorder instructions.
Add (Branch not tak en) F E

Figure 8.13. Execution timing showing the delay slot being filled
during the last two passes through the loop in Figure 8.12.
Reshma M R, Dept of CSE, RSET
Branch Prediction

• The simplest form of branch prediction is to assume that the branch


will not take place and to continue to fetch instructions in sequential
address order.

• Until the branch condition is evaluated, instruction execution along the


predicted path must be done on a speculative basis.

Reshma M R, Dept of CSE, RSET


Branch Prediction

• Speculative execution means that instructions are executed before the


processor is certain that they are in the correct execution sequence.

• Need to be careful so that no processor registers or memory locations


are updated until it is confirmed that these instructions should indeed
be executed.

Reshma M R, Dept of CSE, RSET


An Example of Incorrectly Predicted Branch

Reshma M R, Dept of CSE, RSET


An Example of Incorrectly Predicted Branch

• The results of the compare operation are available at the end of cycle 3.

• Assuming that they are forwarded immediately to the instruction fetch unit, the
branch condition is evaluated in cycle 4.

• At this point the instruction fetch unit realizes that the prediction was incorrect and
the two instructions in the execution pipe are purged.

Reshma M R, Dept of CSE, RSET


Branch Prediction
• Better performance can be achieved if we arrange for some branch instructions to
be predicted as taken and others as not taken.

• Use hardware to observe whether the target address is lower or higher than that of
the branch instruction.

• Let compiler include a branch prediction bit.

Reshma M R, Dept of CSE, RSET


Dynamic Branch Prediction
• The branch prediction decision is always the same every time a given
instruction is executed, this is called static branch prediction

• The prediction decision may changed depending on execution history


is called dynamic branch prediction

• In dynamic branch prediction schemes, the processor hardware


assesses the likelihood of a given branch being taken by keeping track
of branch decisions every time that instruction is executed

Reshma M R, Dept of CSE, RSET


Dynamic Branch Prediction
• The simplest form, the execution history used in predicting the outcome of a given
instruction is the result of the most recent execution of that instruction.

• The processor assumes that the next time the instruction is executed, the result is
likely to be the same.

• Hence, the algorithm may be described by a two state machine. The two states are
LT: Branch is likely to be taken
LNT: Branch is likely not to be taken,
This scheme requires one bit of history information for each branch instruction,
works well inside the program loops.
Reshma M R, Dept of CSE, RSET
Branch Prediction algorithm- State Machine
Better performance can be achieved by
keeping more information about execution
history.

An algorithm that uses 4 states , thus using


2 bit of history information for Each
branch instruction is shown in fig:

Reshma M R, Dept of CSE, RSET

You might also like