COA Unit-3 Slides
COA Unit-3 Slides
Let's say that there are four loads of dirty laundry that need to be washed, dried,
and folded. We could put the the first load in the washer for 30 minutes, dry it for
40 minutes, and then take 20 minutes to fold the clothes. Then pick up the second
load and wash, dry, and fold, and repeat for the third and fourth loads. Supposing
we started at 6 PM and worked as efficiently as possible, we would still be doing
laundry until midnight.
However, a smarter approach to the problem would be to put the second load of
dirty laundry into the washer after the first was already clean and whirling happily
in the dryer. Then, while the first load was being folded, the second load would dry,
and a third load could be added to the pipeline of laundry. Using this method, the
laundry would be finished by 9:30.
Instruction Execution
5cc
3cc
• Advantages
• Pipelining improves the throughput of the system.
• In every clock cycle, a new instruction finishes its
execution.
• Allow multiple instructions to be executed concurrently.
Pipelining
• Disadvantages
• The design of pipelined processor is complex and
costly to manufacture.
• The instruction latency is more.
Types of
Pipelining
• Instruction Pipelining
• Arithmetic Pipelining
Instruction Pipelining
ETNon-pipeline = n*K*TP
Parameters that Determines the Performance of
pipeline process
• So, the Speed up (S) of the Pipelined processor over Non-Pipelined processor, when
‘N’ tasks are executed on the same processor is:
S = ET Non-Pipe/ ET Pipe
S = [N*K*TP ] / [(K + N-1)*TP ]
= (N*K) / [(K + N-1)*TP ]
• When the number of task ‘N’ is significantly larger than K, i.e
N>>K Then S = (N*K)/N = K
Where ‘K’ is the number of stages in the Pipeline.
Parameters that Determines the Performance of
pipeline process
• Efficiency = Given Speed Up/ Maximum Speed Up
= S/SMax
We know that SMax = K
Seg No\Clock 1 2 3 4 5 6 7 8 9
No of Segment = 4
No of Task =6
Speed Up Ratio = ?
Branch Instruction
I\T Hazard
1 2 3 4 5 6 7 8 9 10 11 12 13
1 FI DA FO EX
2 FI DA FO EX
3 FI DA FO EX
(Branch)
4 FI ---- ---- FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
No of
stalls=2
Pipeline
• Any condition that causes ‘stall’ inHazard
the pipeline operations can be called a
Hazard
• Pipeline hazards are situation that prevent the next instruction in the
instruction stream from executing during its designated clock cycles.
• Hazard Occurs:
Pipeline: It is a technique of decomposing a sequential
A <- 3+A process into a no of sub processes with each sub process
B <- 4*A being executed in a special dedicated segment that
operates concurrently with all other segments.
• No Hazard
A <- 5*C
B <- 20+C
Pipeline Hazards
a) Data Hazard
b) Control or Instruction Hazard
c) Structure Hazard
Pipeline
Hazard
a) Data Hazards
• An instruction cannot continue because it needs a value that has not yet been generated by an
earlier instruction.
• Data Hazards occurs when the data is used before it is ready.
• In other words an instruction attempt to use a resource before it is ready.
There are three type of Data Hazard possible.
1) RAW (Read after Write) [Flow/True data dependency]
2) WAR (Write after Read) [Anti-Data dependency]
3) WAW (Write after Write) [Output data dependency]
Pipeline Hazard
• Let there be two instructions I and J, such that J follow I. Then,
• RAW hazard occurs when instruction J tries to read data before instruction I writes it.
• Eg:
• I: R2 <- R1 + R3
• J: R4 <- R2 + R3
• WAR hazard occurs when instruction J tries to write data before instruction I reads it.
• Eg:
• I: R2 <- R1 + R3
• J: R3 <- R4 + R5
Pipeline Hazard
• WAW hazard occurs when instruction J tries to write output before instruction I writes it.
Eg:
I: R2 <- R1 + R3 J: R2 <-
R4 + R5
• WAR and WAW hazards occur during the out-of-order execution of the instructions.
#Observations
• All the instructions after the ADD use the result of the ADD instruction (in R1). The ADD instruction writes the
value of R1 in the WB stage (shown black), and the SUB instruction reads the value during ID stage (IDsub). This
problem is called a data hazard. Unless precautions are taken to prevent it, the SUB instruction will read the
wrong value and try to use it.
• The AND instruction is also affected by this data hazard. The write of R1 does not complete until the end of cycle
5 (shown black). Thus, the AND instruction that reads the registers during cycle 4 (IDand) will receive the wrong
result.
• The OR instruction can be made to operate without incurring a hazard by a simple implementation
technique. The technique is to perform register file reads in the second half of the cycle, and writes in the
first half. Because both WB for ADD and IDor for OR are performed in one cycle 5, the write to register file
by ADD will perform in the first half of the cycle, and the read of registers by OR will perform in the second
half of the cycle.
• The XOR instruction operates properly, because its register read occur in cycle 6 after the register write by
ADD.
Control Hazard
• In the previous lecture, we have studied about the pipeline hazards.
• Any condition that causes STALL in the pipeline operations can be called as a
hazard.
• It means due to some circumstances, pipeline gets disturbed and don’t
perform
concurrently for some clock cycles.
• Control Hazard: The situation when pipeline can't operate normally due to
non
sequential control flow.
Types of Control Hazards
• Branch Hazards: These occur when the processor encounters a conditional branch instruction (like an if statement or a loop). Since the outcome of a
branch (whether it will be taken or not) is not known until later in the pipeline, the processor may not know which instruction to fetch next.
• Example: If the processor encounters a branch that depends on the result of a comparison, it has to wait for the comparison to be evaluated before it
knows whether to jump to a new instruction or continue with the next one in the sequence.
• Jump Hazards: These occur when the processor encounters an unconditional jump instruction, where it must immediately change the flow of
execution to a new location. Unlike conditional branches, where the outcome depends on a condition, a jump is always taken, but the new instruction
address may not be known immediately.
• Example: An unconditional GOTO statement or a function call that transfers control to a different memory location.
• Indirect Branch Hazards: These occur when the target of a branch or jump is determined at runtime, rather than being a fixed address known in
advance.
• Example: A function call where the target address is stored in a register or memory location, making it more complex for the processor to predict
where to go next. Solutions: More advanced forms of branch prediction are required to handle indirect branches.
• Call and Return Hazards: These occur when the processor executes a function call or a return instruction. When returning from a function, the
processor needs to know where to resume execution. If the return address is not immediately available, this can cause a delay.
• Example: When a function is called, the processor pushes the return address to a stack and later needs to pop it to continue execution.
Control Hazard
• Now we will understand control or instruction hazard.
• In this, we will see what are the conditions, due to that control hazard occurs in the
pipeline and doesn't perform concurrent/ overlapped operations for some clock cycle.
• Control hazard occurs due to Branch Instruction i.e. the impact of the pipeline on the
branch condition.
• We can understand the control hazard through an example.
Example:
Memory Instructions
Location
12: BEQ R1, R3, 36//When R1 and R3 equal then
jump to address 36
16: AND R2, R3, R5
20: OR R6, R1, R7
24: ADD R8, R1, R9
36: XOR R10,R1, R11
Control
Hazard
• aLet
In us assume
dedicated the following
pipeline 5 phases are required to execute an instruction namely
architecture,
Instruction Fetch, Instruction Decode, Execution, Memory access and Write Back (writing
result in the register).
IF: Instruction Fetch
ID: Instruction Decode
EX: Execution
MEM: Memory access
WB: Write Back/ Store result in the Register.
Timing Diagram of given set of Instructions
Address CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
12 IF ID EX MEM WB
IF ID EX MEM WB
16
20 IF ID EX MEM WB
24 IF ID EX MEM WB
Control
Address CC1 CC2
Hazard
CC3 CC4 CC5 CC6 CC7 CC8
12 IF ID EX MEM WB
IF ID EX MEM WB
16
20 IF ID EX MEM WB
24 IF ID EX MEM WB
Memory Instructions
Location
12: BEQ R1, R3, 36//When R1 and R3 equal then
jump to address 36
16: AND R2, R3, R5
20: OR R6, R1, R7
24: ADD R8, R1, R9
36: XOR R10,R1, R11
Control
In the instruction 12 , Hazard
• Whether it will jump to address location 36 or not, it will be known at ‘MEM’ phase i.e.
at CC4.
• It means until the instruction 12 will be in the 4th CC the next three instructions at the
memory locations 16, 20 and 24 have already entered in the pipe and
performing the operations in their respective phases.
• In normal pipeline concept these instruction(located at addresses 16,20 and 24) have
entered in the pipe.
• Now, suppose if condition(R1=R3) becomes true then everything happened with the
subsequent instructions becomes wrong, because at this point it is clear that instruction
at location 36 should be the next instruction to be executed, instead of 16,20 and 24.
• So, there is a requirement of flush out the wrongly entered/processed instructions.
• Therefore, STALLS for 3CC will occur and instruction at the location 16, 20 and 24 should
not suppose to executes and they should flush out from the pipe.
Control
Hazard
• According to the above example, branch instruction decides to go to the location 36 for
the next instruction to be executed in the MEM stage i.e. in the CC4.
• Three subsequent instructions that follow the branch instruction will be fetched and begin
their executions as like normal scenario, before BEQ branches to location 36.
• Common assumption is to not stopping the process because of branch, once
branch condition is true, just flush out the previous unwanted instruction from the
pipe.
• In this case Branch Penalty is 3 Cycles.
Control
So,
Hazard
• The instruction fetch unit of CPU is responsible for providing a stream of instructions to
the execution unit.
• The instruction fetched by the fetch unit are in consecutive memory location until some special
conditions or branch occurs.
• Problem arises when one of the instruction is branch instruction and need to go to the
some different memory location.
• In this case all unwanted instructions fetched in the pipeline from consecutive memory
locations
are invalid now andMemory
need to removeInstructions
i.e. Flushing out from the pipe.
Location
100 BEQ R1, R3, 120//When R1 and R3 equal then
jump to address 120
104 Instruction 2
108. Instruction 3 Flush OUT
…….
………
120: Instruction 10
124: Instruction 11
Control
Hazard
• This causes STALL in the pipeline till new corrected instruction are fetched from the
memory.
• Thus the time lost as a result of this called as Branch Penalty.
• For reducing the resulting delay, dedicated hardware is incorporated in the fetch/decode
unit to identify branch instruction possibility of occurrence in advance.
• It can increase the cost.
Structural Hazard
• When the multiple instructions need the same resource.
• It means in a computer organization part, common resources are
used by the multiple instructions for their execution.
• These resources are like: Memory, RAM, Different kind of registers,
ALU, common bus etc.
• We have limited no of resources and large no of instructions
• So, obviously many conflict may occur due to this situation.
• Due to this normal pipeline concept is getting disturbed, called as
Structural Hazard.
Types of Structural Hazards
Single-Port Memory Hazards Occur when one instruction is trying to read from a
memory location while another instruction is trying to
write to the same location at the same time.
Execution Unit Conflicts Arise when multiple instructions require the same
execution unit, such as an ALU, simultaneously.
Bus Contention Happens when two or more instructions try to use the
same bus to transfer data at the same time.
Structural
Clock Cycle CC1
Hazard
CC2 CC3 CC4 CC5 CC6 CC7 CC8
I1 IF ID EX MEM WB
IF ID EX MEM WB
I2
I3 IF ID EX MEM WB
I4 IF ID EX MEM WB
I1 IF ID EX MEM WB
IF ID EX MEM WB
I2
I3 IF ID EX MEM WB
I4 STALL IF ID EX MEM WB
• For I4: If we make a stall at CC4 and start the I4 in the CC5 then again the similar
kind of problem still exists, because at CC5, i2 is using the memory along with
i4.
• Same kind of problem may occur continuously in CC6 and CC7 also.
Stalling to avoid hazards
• For all of these hazards, the simplest solution to implement is to stall. Stalling involves having the hardware
introduce a delay, or bubble, into the pipeline when a hazard is encountered until that hazard is resolved.
Drawbacks of stalling:
• Reduced efficiency and increased cycle count
• Loss of parallelism and potential impact on performance
Structural
Clock Cycle CC1 CC2 Hazard
CC3 CC4 CC5 CC6 CC7 CC8 CC9
I1 IF ID EX MEM WB
IF ID EX MEM WB
I2
I3 IF ID EX MEM WB
I4 STALL STALL IF ID EX MEM
Clock
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11
Cycle
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB
I3 IF ID EX MEM WB
I4 STALL STALL STALL IF ID EX MEM WB
Data Hazard
Example
Methods of Optimizing Against Hazards – Compiler-Level
• While stalling is a universal remedy in that it can be used to resolve any pipelining hazard, the costs are high, and
stalls impact a chip’s ability to perform efficiently. However, there are other methods available to resolve hazards
that help retain efficiency.
• The first we will examine are all performed in the compiler; no additional hardware need be added to
implement them because the improvements are made to the code itself, not to the machine running it.
• Implemented correctly, compiler-level optimizations, as opposed to hardware-level optimizations, can provide a
solution to
• hazards that does not require extra power and can be performed on any hardware that can implement the
simple pipeline described above.
Resolving Structural Hazards
• One approach to structural hazards is to reorder operations such that two instructions are never so close to one
another that this sort of hazard occurs. The extent to which this is possible depends largely on the hardware; a
string of successive multiplications could be difficult to organize such that a structural hazard never occurred.
However, barring that, reordering operations to prevent this kind of hazard is
• often viable during compiler time. Because the compiler has knowledge of how long a particular
Example for Static Instruction Reordering
• R1 ← R1 + R2
• R3 ← R3 + R3
Instruction Reordering
• R1 ← R2 * R2
• R2 ← R1 + R3
Re-ordered sequence:
R1 ← R2 * R2
R4 ← R5 - R6
R2 ← R1 + R3
Reordering can sometimes be made more difficult by data hazards; reordering to avoid one hazard can result in
others, as in the following example:
• R1 ← R2 * R2
• R2 ← R1 + R3
• R4 ← R5 – R2
In this case, the subtraction’s use of R2 as an operand violates an antidependence between it and the addition,
meaning that the two instructions are not independent. Normally, we might want to put an independent
instruction between the multiplication and the addition, since multiplications take longer than additions and
thus the addition would be delayed.
Resolving Control Hazards
Managing control hazards at the compiler level is related to distancing the logical operation on which the
branch is based from the branch itself and on limiting the number of branches. Both of these are accomplished
by loop unrolling.
Loop unrolling essentially expands the body of a loop so that fewer branches are necessary. Hennessey
Example:
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
Open forwarding allows for the dynamic transfer of data between pipeline stages.
This technique minimizes delays by forwarding results directly to dependent instructions.
It enhances throughput and reduces latency in executing instructions smoothly.
By doing so, it optimizes resource utilization within the pipeline.
• Open forwarding is a key strategy to combat data hazards.
Resolving Control Hazards
• Branch Prediction
Branch Prediction
• Guessing whether a branch (like an if-statement) will go one way or another to keep the pipeline moving
without delays.
• There are two types 1) Static and 2) Dynamic
Static Prediction:
Uses simple rules, like always assuming branches are "taken" or "not taken.“
• Pros: Reduces pauses and keeps the system working smoothly.
• Cons: Wrong guesses lead to wasted work and lost time.
Dynamic Branch prediction
• A smarter type of branch prediction that looks at what happened in the past to make better guesses about
branches.
• Common techniques include 1-bit and 2-bit prediction tables, and more complex methods like Pattern
History Tables (PHT) or Branch History Tables (BHT).
• Pros: More accurate than basic guessing, which means fewer stalls.
• Cons: Needs extra hardware, and wrong guesses can still waste time.
Consider the following 4-Stage instruction pipeline where different
instructions are taking different amount of time at different stages.
How many CC will be required to complete these four instructions in
the given pipeline?
IF ID EX WB
I1 2 1 2 2
I2 1 3 3 1
I3 2 2 2 2
I4 1 2 1 2
IF ID EX WB
I1 2 1 2 2
I2 1 3 3 1
I3 2 2 2 2
I4 1 2 1 2
ADD R0,
R1,R2 MUL
R3, R4, R6
SUB
R7,R8,R9
DIV R10,
R11, R12
STORE X,
R13
Fetch,
Decode,
Write Back
takes 1 CC
while
Execution
takes 3
Cycle for
remaining
instructions.
Q. Consider the following program segment which is executed in the 4- stage pipeline.
Fetch(F), Decode(D), Execute(E) Write(W).
ADD R0,
R1,R2 MUL
R3, R4, R6
SUB
R7,R8,R9
DIV R10,
R11, R12
STORE X,
R13
Fetch(F) Decode(D) Execute(E) Write Back(W)
Fetch,
Decode, I1 1 1 1 1
Write Back I2 1 1 3 1
takes 1 CC
while I3 1 1 1 1
Execution I4 1 1 3 1
takes 3 I5 1 1 1 1
Cycle for
remaining
instructions.
What is the
speed up?
Fetch(F) Decode(D) Execute(E) Write Back(W)
I1 1 1 1 1
I2 1 1 3 1
I3 1 1 1 1
I4 1 1 3 1
I5 1 1 1 1
I1
I2
I3
I4
I5
Speed Up=
Q) A CPU has a 5-stage pipeline and operates at a frequency of 1 GHz. The instruction fetch happens in the first
stage. A conditional branch instruction computes the target address and evaluates the condition in the 3rd
stage. The CPU stalls and does not fetch new instructions following a conditional branch instruction until the
branch outcome is known. Given that a program consists of 1 billion instructions, where 20% of these
instructions are conditional branch instructions, and each instruction takes 1 clock cycles on average, calculate
the total time required for the completion of the program.
• Consider two pipeline implementations that have the same instruction structures and
support overlapping of all instructions, except for memory-related operations. In this
case, if memory operations cannot be executed simultaneously, it results in one stall
cycle. In the program, 20% of the instructions involve memory-related operations.
Pipeline 1 utilizes 1 port-memory, while Pipeline 2 uses 2 port –memory. If the speedup
factors for the respective pipelines are S1,S2, What is the value of S2/S1.