0% found this document useful (0 votes)

37 views82 pages

Pipelining 2019

Uploaded by

Manikanta Sunkara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views82 pages

Pipelining 2019

Uploaded by

Manikanta Sunkara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Pipelining

Basic and Intermediate Concepts

What Is Pipelining?

●
Pipelining -- multiple instructions are overlapped
in execution

●
It takes advantage of parallelism that exists among
the actions needed to execute an instruction.
The throughput of an instruction pipeline
●

no.of instructions executed per clock cycle

●

●
The time required between moving an instruction
one step down the pipeline is a processor cycle.

●
In a computer, this processor cycle is usually 1
clock cycle (sometimes it is 2, rarely more).
The Basics of a RISC Instruction
Set
Key Properties:
●

●
All operations on data apply to data in registers .

●
The only operations that affect memory are load and store operations
●
move data from memory to a register or to memory from a register,
respectively.
●
Load and store operations -- load or store less than a full register(e.g.,
a byte, 16 bits, or 32 bits) are often available.

●
The instruction formats are few in number, with all instructions typically
being one size
RISC architectures have 3 classes:
●
ALU instructions
– take either two registers or a register and a sign-extended immediate operate on them, and store the
result into a third register.
– operations include -- add (DADD), subtract (DSUB), and logical operations (such as AND or OR)
– Immediate, Unsigned forms

●
Load and store instructions
– takes base register, and an immediate field (offset) as operands.
– effective address –sum of the contents of the base register and the sign-extended offset ( memory
address.)
– load instruction: a second register operand acts as the destination for the data loaded from memory.
– Store instruction, the second register operand is the source of the data that is stored into memory.
●
Branches and jumps
– Branches are conditional transfers of control.
– a set of condition bits (condition code) or
– by a limited set of comparisons between a pair of registers or between a register and zero.
– In all RISC architectures, the branch destination is obtained by adding an offset to the current PC.
– Unconditional jumps are provided in many RISC architectures.
A Simple Implementation of a RISC Instruction Set
1. Instruction fetch cycle (IF):
– Send the program counter (PC) to memory and fetch the
current instruction from memory.
– Update the PC to the next sequential PC by adding 4
(since each instruction is 4 bytes) to the PC.

2. Instruction decode/register fetch cycle (ID):

– Decode the instruction and read the registers from the
register file.
– Do the equality test on the registers as they are read, for
a possible branch.
– Compute the possible branch target address by adding
the sign-extended offset to the incremented PC.
●
3. Execution/effective address cycle (EX):
– Memory reference— base register + offset to form the effective
address.
– Register-Register ALU instruction—performs the operation
specified by opcode on the values read from the register file.
– Register-Immediate ALU instruction— performs the operation
specified by ALU opcode on the first value read from the register
file and the sign-extended immediate.

– In a load-store architecture the effective address and execution

cycles can be combined into a single clock cycle, since no
instruction needs to simultaneously calculate a data address and
perform an operation on the data.
4. Memory access (MEM)
●

– Load --> memory does a read using the effective address

computed in the previous cycle.
– Store --> memory writes the data using the effective
address.

5. Write-back cycle (WB)

●

– Write the result into the register file,

– whether it comes from the memory system (for a load) or
– from the ALU (for an ALU instruction).
In this implementation,
●

branch instructions require 2 cycles,

●

store instructions require 4 cycles,

●

and all other instructions require 5 cycles.

●

●
Assuming a branch frequency of 12% and a store
frequency of 10%,
●
a typical instruction distribution leads to an overall CPI
of 4.54.
The Classic Five-Stage Pipeline for
a RISC Processor
Observations
●
First, we use separate instruction and data
memories, to implement with separate instruction
and data caches

●
The use of separate caches eliminates a conflict
for a single memory that would arise between
instruction fetch and data memory access.
●Second, the register file is used in the two
stages: one for reading in ID and one for writing in
WB.
● show the register file in two places.
●Hence, we need to perform two reads and one

write every clock cycle.

●To handle reads and a write to the same register
we perform the register write in the first half of the
clock cycle and the read in the second half.
Third, Figure C.2 does not deal with the PC.
●

To start a new instruction every clock,

●

● increment and store the PC every clock,

●during the IF stage in preparation for the next instruction.

● have an adder to compute the potential branch target

during ID.

One further problem is that a branch does not change the

●

PC until the ID stage.

●
Ensure instructions in pipeline do not attempt to
use h/w resources at the same time
●

●
Instructions in different stages of pipeline do not
interfere with one another
– Pipeline Registers
Basic Performance Issues in Pipelining
●
Pipelining increases the CPU instruction throughput
●
but it does not reduce the execution time of an individual
instruction.
●
slightly increases the execution time of each instruction due to
overhead in the control of the pipeline.
●

●
The increase in instruction throughput means
●
program runs faster and has lower total execution time, even
though no single instruction runs faster!

●
limitations arise from pipeline latency, from imbalance among the pipe
stages and from pipelining overhead.
●
Imbalance among the pipe stages
●

●
reduces performance since the clock takes time needed for the slowest pipeline
stage.
Pipeline overhead arises from the
●

●
combination of pipeline register delay and clock skew.
●
The pipeline registers add setup time, which is the time that a register input
must be stable before the clock signal that triggers a write occurs, plus
propagation delay to the clock cycle.
●
Clock skew, which is maximum delay between when the clock arrives at any two
registers,

This simple RISC pipeline would function just if every instruction were independent
●

of every other instruction in the pipeline.

●

In reality, instructions in the pipeline can depend on one another

●
The Major Hurdle of Pipelining—Pipeline
Hazards
Structural hazards
●

resource conflicts
●

●
when the hardware cannot support all possible combinations of
instructions simultaneously in over-lapped execution.

Data hazards
●

●
instruction depends on the results of a
previous instruction.
Control hazards
●

●
pipelining of branches and other instructions
that change the PC.
Hazards in pipelines - stall the pipeline.
●

when an instruction is stalled,

●

all instructions issued later are also stalled.

●

●
Instructions issued earlier than the stalled
instruction must continue,
Otherwise hazard will never clear.
●

no new instructions are fetched during the stall.

●

●
Avoiding a hazard : some instructions in the
pipeline be allowed to proceed while others are
delayed.
Performance of Pipelines with Stalls
INSTRUCTION LEVEL
PARALLELISM
Its Exploration
1.Concepts & Challenges
• ILP – Overlap among instructions.
- parallel execution of instructions.
• Exploit ILP
• H/w – dynamically at execution time
• S/W – Statically at compile time
• Pipelining
– Structural Hazards
– Data Hazards
– Control Hazards
• Hazards make necessary to stall pipeline
- some inst. Proceed, others delayed.

• Stall – Wasted Clock Cycle

– Causes pipeline performance to degrade
from ideal performance.
• Structural Hazards
• Resource conflicts

• Data Hazards
• Current inst. depends on previous one

• Control Hazards
• Branch instructions.
Structural Hazards
• Usually happen when a unit is not fully
pipelined
• That unit cannot churn out one instruction per
cycle
• Or, when a resource has not been duplicated
enough
– Example: same I-cache and D-cache
– Example: single write-port for register-file
• Usual solution: stall
• Also called pipeline bubble, or simply
bubble
Why Allow Structural Hazards?
• Lower Cost:
– Lesser hardware ==> lesser cost

• Shorter latency of unpipelined unit

– May have other performance benefits
– Data hazards may introduce stalls anyway!

• Suppose the FP unit is unpipelined, and the other

instructions have a 5-stage pipeline. What percentage
of instructions can be FP, so that the CPI does not
increase?
– 20% can be FP, assuming no clustering of FP instructions
– Even if clustered, data hazards may introduce stalls anyway
Data Hazards
• Example:
– ADD R1, R2, R3
– SUB R4, R1, R5
– AND R6, R1, R7
– OR R8, R1, R9
– XOR R10,R1, R11
• All instructions after ADD depend on R1
• Stalling is a possibility
– Can we do better?
Data Hazard Classification

• Read after Write (RAW):

– j tries to read a source before i writes it.
– Preserve order of inst., use data forwarding to
overcome
• Write after Write (WAW):
– j tries to write an operand before it is written by i.
– arises only when writes can happen in different
pipeline stages
– Has other problems as well: structural hazards
• Write after Read (WAR):
– j writes to a destn, before it is read by i.
– rare
Stages in inst. execution
• IF
• ID
• EX
• MEM
• WM
Control Hazard

• Result of branch instruction not known

until end of MEM stage
• Naïve solution: stall until result of branch
instruction is known
– If an instruction is a branch or not is known
at the end of its ID cycle
– “IF” may have to be repeated
Handling Control Hazards
• Naive solution : Stall
• Predict untaken or not-taken:
– Treat every branch as not taken
– Only slightly more complex
– Do not update machine state until branch
outcome is known
– Done by clearing the IF/ID register of the
fetched instruction
More Ways to Reduce Control Hazard
Delays
• Predict taken:
– Treat every branch as taken
– Not of any use in DLX since branch target is not
known before branch condition anyway
– May be of use in other architectures

• Delayed branch:
– Instruction(s) after branch are executed anyway!
– Sequential successors are called branch-delay slots
• Pipeline CPI (pipelined processor) =
Ideal pipeline CPI + Structural stalls + Data
hazard stalls + control stalls
What is ILP?
• Amount of parallelism within a basic
block is quite small

• Basic block – a straight line code with no

branches in and out except at entry and
exit respectively

• Exploit ILP across multiple basic blocks.

• Simple and common way to increase
parallelism – exploit parallelism among
iterations of a loop – loop level
parallelism

• Loop level parallelism

• Instruction level parallelism
• Data level parallelism
Example
● Adds two 1000 element arrays
● for (i=1; i<=1000;i+1) every iteration of loop
x[i] = x[i] + y [i]; can overlap with any other
iteration

● Within each loop iteration - no opportunity for overlap

●
Convert LLP to ILP – unroll loop statically by compiler,
dynamically by H/W.
Data dependences & Hazards
● Exploit ILP – which inst.s can execute in
parallel?
– If 2 inst. are Parallel – executed
simultaneously with no stalls, assume
sufficient resources
– If 2 insts. are dependent, not parallel –
executed in order, may partially overlap.

● How do you determine instruction is

dependent?
Dependences & Hazards
• Data dependences

• Name Dependences

• Control Dependences
Data dependences
• Instruction j is dependent on i if either of
the following holds:
a. Inst i produces a result that may be used by
j
b. Inst j is data dependent on k, and k is data
dependent on i (chain of dependences)

• Dependency with in a single inst. is not a

dependency
– ADD R1, R1
example
• Loop: L.D F0,0(R1);
ADD.D F4, F0, F2;
S.D F4, 0(R1);
DADDUI R1,R1,#-8;
BNE R1,R2,Loop;
Risk
● Order must be preserved for correct
execution.
● 2 inst. are data dependent --? Can't be
exec. simultaneously/ completely
overlapped

● If exec simultaneously ? - stall.

Pipeline organization properties
i. Whether a dependency results in actual
hazard being detected?
ii. Whether that hazard causes a stall?

• Understand this to exploit ILP

Solution
• Data dependency conveys 3 things:
1. Possibility of a hazard
2. Order in which results must be calculated
3. Upper bound on how much ||sm can be
exploited.
●
Data dependency limits ILP exploitation… so
– Maintain dependency but avoid a hazard
●
Scheduling the code – compiler/ hardware.
– Eliminate dependence by transforming the code
• Data value may flow b/w inst through
– Register’s
– Mem. Locations

• Detecting dependency
– Registers- easy – reg names fixed
– Mem. Loc – difficult – 2 address refer to same
loc but look diff. effective address
Name Dependences
• Occurs when 2 inst use same reg/mem
loc called a ‘name’

• But no flow of data b/w inst associated

with that name.
Types
• Anti dependence
– b/w i and j occurs when j writes a reg/mem
loc that i reads.
• Original ordering must be preserved – i reads
correct value.

• Output dependence
– When i and j write same reg/mem loc
• Original ordering must be preserved - j gets the
final value.
Solution
• Register renaming
– No data flow b/w inst
– Only clash in using ‘name’
– Use different names in either instructions
– Easy for register
– Done by compiler / hardware
Control dependences
• Determines the ordering of an inst i, w.r.t
branch inst so that i is executed in correct order
& only when it should be.

• Example : T1;
if P1{S1;}
• Statement S1 is control-dependent on p1, but
T1 is not
• What does this mean for execution ?

– S1 cannot be moved before p1, and also T1

cannot be moved after p1
Constraints imposed by CD
1. An inst that is ctrl dep on a branch can’t
be moved before the branch
– Its exec no longer controlled by branch.

2. An inst that is not ctrl dep on a branch

can’t be moved after the branch
– Its exec is controlled by branch
Solution
• Preserve strict program order – ensure
ctrl dep also preserved

• we can Violate ctrl dep without affecting

correctness of program.

• Correctness of program ?
Correctness
● Two critical properties to program correctness:
● Exception behavior
– Reordering must not cause new exceptions in
the program
● Data flow
– Actual flow of data values among inst that
produce results and consume them
Basic compiler techniques for
exposing ILP

• Pipeline scheduling

• Loop unrolling
Static vs. Dynamic
Scheduling
● Static scheduling: limitations
– Dependences may not be known at compile time
– Even if known, compiler becomes complex
– Compiler has to have knowledge of pipeline
● Dynamic scheduling
– Handle dynamic dependences
– Simpler compiler
– Efficient even if code compiled for a different
pipeline
Pipeline scheduling
● To exploit ILP – keep pipeline full.
● How to keep pipeline full?
– Find sequences of unrelated inst – overlap in
pipeline.

● To avoid stall – dependent inst is to be separated

from source inst
“by a distance in clock cycles = pipeline latency
of that source inst”
Bandwidth Vs Latency
• Band width/ through put
– Total amount of work done in a given time
• Megabytes per sec for disk transfer

• Latency/response time
– Time b/w start and completion of an event
• Milliseconds for disk access
Pipeline scheduling(Contd…)
• PS depends on :
1.Amount of ILP available in program
2.Latencies of functional units in pipeline.

Inst Inst using Latency in

producing result Clock cycles
result
FP ALU op Another FP ALU 3
op
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Loop unrolling
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s; //add scalar to vector

• Each iteration is independent

• So loop is parallel
• Evaluate the performance of this loop
based on previous latencies.
• Translate it to assembly language

Loop : L.D F0, 0(R1); F0 = array element;

ADD.D F4, F0, F2;
S.D 0(R1),F4,;
DADDUI R1, R1, #-8; decrement ptr 8bits
BNE R1,R2, Loop
Loop : L.D F0, 0(R1); 1
2
stall
ADD.D F4, F0, F2; 3
stall 4
stall 5

S.D F4,0(R1); 6
DADDUI R1, R1, #-8; 7
stall 8

BNE R1,R2, Loop 9

Loop : L.D F0, 0(R1); 1
DADDUI R1, R1, #-8; 2
ADD.D F4, F0, F2; 3
stall 4
stall 5
S.D F4,0(R1); 6
BNE R1,R2, Loop 7

•Out of the 7, operations take only 3cc,

remaining 4cc for loop overhead - DADDUI, BNE
and 2 stalls
• To eliminate those 4 cc, we require more
operations relative to no.of overhead inst.

• Simple scheme for increasing the no.of insts

relative to branch and overhead insts is LOOP
UNROLLING

• Unrolling replicates loop body multiple times,

adjusting loop termination code
Pros and Cons
● Loop unrolling used to improve scheduling.
– Eliminates branch
– Allows inst from diff iterations to be scheduled
together
– But… , we require more no.of registers if a loop is
unrolled

● When and how the ordering among insts may

be changed – key to all techniques.
Limitations
• Decrease in amount of overhead

• Code size limitations

• Compiler limitation

• Register pressure.
Summary of loop unrolling &
scheduling
● Unrolling would be useful
● Use different registers to avoid unnecessary
constraints
● Eliminate extra branch & test conditions
● Determine that loads and stores can be interchanged
– Analyse mem addr- make sure they don’t refer to
same
● Schedule the code – preserve dependences.
● For all this, understand the dependency properly
3. Reducing branch costs with
prediction
● How to reduce branch hazards?
– Loop unrolling
– Reduce performance losses ?
● By predicting their behavior

● How do we predict their behavior?

– Static branch prediction
– Dynamic branch prediction
– Tournament predictors
Static branch prediction
• Predicts statically at compile time.
• Branch behavior is highly predictable at
compile time.
• Also, assists dynamic predictors.

• But how? By several methods

• Major limitation of SBP : mis prediction
rate for int prgm is > FP prgm.
Methods
● Predict a branch as taken
– Avg mis prediction rate i.e untaken branch freq
– 59%(not accurate) to 9% (highly accurate)
– For SPEC , average is 34%

● Predict branch based on profile information.

– Collected from earlier runs.
– Behavior of branches is bimodally distributed
– Individual branch is biased as taken/untaken
Dynamic branch prediction
• Branch prediction buffer/ branch history
table

• 2-bit prediction scheme

• Correlation branch predictors

Branch prediction buffer -BPB

● a small memory
● indexed by lower position of the address of branch
inst.
● One of mem bit indicates whether branch was
recently taken or not.
● Limitation:
– not sure if prediction is correct
– Address may be of another branch with same lower
address
● Remedy – 2 bit prediction scheme
2 bit prediction scheme
● Small cache accessed with inst address
during the IF stage

● If inst decoded as a branch – if branch is

predicted as taken , fetching begins from
target
Accuracy 99 to 82 %
Misprediction 1 to 18%
Correlating branch predictors
(2-level predictors)
● 2-bit predictor – uses only recent
behaviour of a single branch
● Improve accuracy of predictor by – looking
at recent behaviour of other branches
also.
● If (aa=2) aa=0;
if (bb=2) bb=0;
if (aa!=bb) {
Branch b3 behaviour based on b1 and b2
• CP add info about the behavior of the
most recent branches to decide how to
predict a given branch.

• (1,2) predictor –
– uses behaviour of last 1 branch to choose from
among a pair of 2-bit predictors
• (m,n) predictor -
– uses behaviour of last m branches to choose
from 2m branch of predictors (each of n-bit)

• Better than 2-bit + only trivial amount of

H/W
●
No.of bits in an (m,n) predictor is
– 2m X n X no. of prediction entries selected by branch
address
– 2-bit predictor with no global history – (0,2)pred.

●
Eg: how many bits are there in (0,2) BP with 4 K entries ?
How many entries are in a (2,2) BP with same no.of bits
– 20 X 2 X 4K = 8K bits
– 22X 2 X a = 8K
– a= 1K
Tournament Predictors
● Combine local and global predictors

● Multiple predictors – one based on global info

& one on local info – combine with a selector

● Selects right predictor a particular branch

● Mis prediction rate < 0.5% for FP programs and

about 14 % for integer programs.
Dynamic scheduling
• Hardware re arranges the inst exec to
reduce stalls while maintaining data flow
and exception behaviour.

Worldwizard 1
No ratings yet
Worldwizard 1
20 pages
GMCPTechnicalWeekly31 10 2023
100% (1)
GMCPTechnicalWeekly31 10 2023
18 pages
Minerals: Mill Circuit Duty Pumps
No ratings yet
Minerals: Mill Circuit Duty Pumps
8 pages
Olympic Sports Matching Quiz
No ratings yet
Olympic Sports Matching Quiz
2 pages
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
No ratings yet
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
136 pages
ILP - Appendix C PDF
No ratings yet
ILP - Appendix C PDF
52 pages
CAO Fall 2024 Lecture 07 RISC V Pipelined Implementation
No ratings yet
CAO Fall 2024 Lecture 07 RISC V Pipelined Implementation
114 pages
Presentation 1
No ratings yet
Presentation 1
22 pages
CO Pipelining PDF Notes
No ratings yet
CO Pipelining PDF Notes
10 pages
Pipelinehazard 160823134502
No ratings yet
Pipelinehazard 160823134502
61 pages
Pipelinehazard For Class
No ratings yet
Pipelinehazard For Class
61 pages
Unit2 Aca
No ratings yet
Unit2 Aca
118 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
49 pages
SRM Pipelining 05
No ratings yet
SRM Pipelining 05
42 pages
Pipelining Preview: Basics & Challenges
No ratings yet
Pipelining Preview: Basics & Challenges
75 pages
Pipeline: A Simple Implementation of A RISC Instruction Set
No ratings yet
Pipeline: A Simple Implementation of A RISC Instruction Set
16 pages
Week 11-13
No ratings yet
Week 11-13
76 pages
Lec 1
No ratings yet
Lec 1
30 pages
Pipelining Lecture
No ratings yet
Pipelining Lecture
39 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Week 4 - Pipelining
No ratings yet
Week 4 - Pipelining
44 pages
Lec07 Pipelining Review
No ratings yet
Lec07 Pipelining Review
121 pages
Lec04 Pipelining Intro&hazards
No ratings yet
Lec04 Pipelining Intro&hazards
77 pages
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
No ratings yet
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
7 pages
Week 11
No ratings yet
Week 11
33 pages
Ca 5
No ratings yet
Ca 5
12 pages
Computer System Organization
No ratings yet
Computer System Organization
26 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
01 - Mod 2 - Livro Autorresponsabilidade
No ratings yet
01 - Mod 2 - Livro Autorresponsabilidade
9 pages
CS530 Fall2015 Lecture9
No ratings yet
CS530 Fall2015 Lecture9
5 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
Enhancing Performance With Pipelining
No ratings yet
Enhancing Performance With Pipelining
85 pages
12 - Processor Structure and Function
No ratings yet
12 - Processor Structure and Function
73 pages
Chapter 04 Processor 2
No ratings yet
Chapter 04 Processor 2
28 pages
Unit-V: Performance Enhancement Techinques
No ratings yet
Unit-V: Performance Enhancement Techinques
61 pages
Unit 5.2 Processor
No ratings yet
Unit 5.2 Processor
40 pages
2.pipeline RISC-V v2
No ratings yet
2.pipeline RISC-V v2
47 pages
Lec 2
No ratings yet
Lec 2
21 pages
HRY-312 Computer Organization Introduction To Pipelining
No ratings yet
HRY-312 Computer Organization Introduction To Pipelining
30 pages
Ch#16 (CPU Structure and Function)
No ratings yet
Ch#16 (CPU Structure and Function)
48 pages
Unit - 1 Microprocessor Architecture
No ratings yet
Unit - 1 Microprocessor Architecture
52 pages
CAP EndSem Unit 5
No ratings yet
CAP EndSem Unit 5
8 pages
Reduced Instruction Set Computers Pipelining: (RISC)
No ratings yet
Reduced Instruction Set Computers Pipelining: (RISC)
25 pages
CA Assignment
100% (1)
CA Assignment
8 pages
ACA Unit 2,7th Sem CSE
No ratings yet
ACA Unit 2,7th Sem CSE
13 pages
Chapter 4
No ratings yet
Chapter 4
78 pages
Chapter # 03 Pipelining
No ratings yet
Chapter # 03 Pipelining
85 pages
Reduced Instruction Set Computers Pipelining: (RISC)
No ratings yet
Reduced Instruction Set Computers Pipelining: (RISC)
25 pages
Chapter 8 - Pipelining
No ratings yet
Chapter 8 - Pipelining
38 pages
CH7-Parallel and Pipelined Processing
No ratings yet
CH7-Parallel and Pipelined Processing
23 pages
Pipeline Hazard
No ratings yet
Pipeline Hazard
8 pages
Pipe Lining
No ratings yet
Pipe Lining
14 pages
CA Unit-2 Chapter-2
No ratings yet
CA Unit-2 Chapter-2
36 pages
Module 5 - Processor Structure and Function
No ratings yet
Module 5 - Processor Structure and Function
74 pages
Coa Unit 4
No ratings yet
Coa Unit 4
10 pages
Pipeline Hazards
No ratings yet
Pipeline Hazards
53 pages
IT3030E CA Chap5 CPU - Removed
No ratings yet
IT3030E CA Chap5 CPU - Removed
26 pages
10 Pipelining
No ratings yet
10 Pipelining
44 pages
Pipelining Basic Concept
No ratings yet
Pipelining Basic Concept
23 pages
Helping Slides Pipelining Hazards Solutions
No ratings yet
Helping Slides Pipelining Hazards Solutions
55 pages
WINSEM2022-23 BCSE205L TH VL2022230502914 2023-04-06 Reference-Material-I
No ratings yet
WINSEM2022-23 BCSE205L TH VL2022230502914 2023-04-06 Reference-Material-I
27 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Biometrics 24mcmb05 PPT Orginal
No ratings yet
Biometrics 24mcmb05 PPT Orginal
65 pages
Untitled Document
No ratings yet
Untitled Document
1 page
NPLect1 1
No ratings yet
NPLect1 1
34 pages
Vitaran - School Data Collection Template
No ratings yet
Vitaran - School Data Collection Template
4 pages
ACAchap 1 Questions
No ratings yet
ACAchap 1 Questions
2 pages
Git and Github Notes
100% (1)
Git and Github Notes
110 pages
SoftwareEngineering Final
No ratings yet
SoftwareEngineering Final
197 pages
Homework Oh Homework I Hate You You Stink Poem
100% (1)
Homework Oh Homework I Hate You You Stink Poem
7 pages
GU Elsevier Adaptive Quizzing - Quiz Performance
No ratings yet
GU Elsevier Adaptive Quizzing - Quiz Performance
47 pages
Prokaryotic Cell Vs Eukaryotic Cell
No ratings yet
Prokaryotic Cell Vs Eukaryotic Cell
23 pages
Punjiezhendenise 15awe044u
No ratings yet
Punjiezhendenise 15awe044u
14 pages
CN Unit-4 Notes
No ratings yet
CN Unit-4 Notes
24 pages
SAT Math Test S
No ratings yet
SAT Math Test S
17 pages
Essay On Addiction
100% (2)
Essay On Addiction
6 pages
Letter of Application
No ratings yet
Letter of Application
2 pages
Estimated Price For Pets
No ratings yet
Estimated Price For Pets
4 pages
College of Engineering and Technology, Bikaner Time Table - Revised B. Tech. II Semester Session 2013-14
No ratings yet
College of Engineering and Technology, Bikaner Time Table - Revised B. Tech. II Semester Session 2013-14
6 pages
Midterm Case Study2 - The StyleKoTo Fashion With A
No ratings yet
Midterm Case Study2 - The StyleKoTo Fashion With A
3 pages
Managing Heat Stress As Singapore Gets Warmer
No ratings yet
Managing Heat Stress As Singapore Gets Warmer
4 pages
All India Institute of Medical Sciences, Patna
No ratings yet
All India Institute of Medical Sciences, Patna
14 pages
Business Proposal Startup in Tea Beverage Industry in Denmark
No ratings yet
Business Proposal Startup in Tea Beverage Industry in Denmark
8 pages
Chilli Paneer - Cookidoo® - The Official Thermomix® Recipe Platform
No ratings yet
Chilli Paneer - Cookidoo® - The Official Thermomix® Recipe Platform
1 page
Physics Project
No ratings yet
Physics Project
8 pages
Kautilya's Arthashastra Strategic Cultural Roots of India's Contemporary Statecraft (Kajari Kamal) (Z-Library)
100% (3)
Kautilya's Arthashastra Strategic Cultural Roots of India's Contemporary Statecraft (Kajari Kamal) (Z-Library)
261 pages
Chp2 (2) - Creativity and Innovation
100% (2)
Chp2 (2) - Creativity and Innovation
37 pages
Tabari Volume 08
100% (2)
Tabari Volume 08
239 pages
The Johannine Comma
No ratings yet
The Johannine Comma
8 pages
Black Scholes
No ratings yet
Black Scholes
15 pages
Jyotirao Phule - Wikipedia
No ratings yet
Jyotirao Phule - Wikipedia
66 pages
What's Brewing?: An Analysis of India's Coffee Industry
No ratings yet
What's Brewing?: An Analysis of India's Coffee Industry
69 pages
Ballada-Chap 6-P#7
No ratings yet
Ballada-Chap 6-P#7
20 pages
Windows Registry
No ratings yet
Windows Registry
17 pages
NewList Augusta Property
No ratings yet
NewList Augusta Property
64 pages

Pipelining 2019

Uploaded by

Pipelining 2019

Uploaded by

Pipelining

Basic and Intermediate Concepts

no.of instructions executed per clock cycle

2. Instruction decode/register fetch cycle (ID):

– In a load-store architecture the effective address and execution

– Load --> memory does a read using the effective address

5. Write-back cycle (WB)

– Write the result into the register file,

branch instructions require 2 cycles,

store instructions require 4 cycles,

and all other instructions require 5 cycles.

write every clock cycle.

To start a new instruction every clock,

● increment and store the PC every clock,

● have an adder to compute the potential branch target

One further problem is that a branch does not change the

PC until the ID stage.

of every other instruction in the pipeline.

In reality, instructions in the pipeline can depend on one another

when an instruction is stalled,

all instructions issued later are also stalled.

no new instructions are fetched during the stall.

• Stall – Wasted Clock Cycle

• Shorter latency of unpipelined unit

• Suppose the FP unit is unpipelined, and the other

• Read after Write (RAW):

• Result of branch instruction not known

• Basic block – a straight line code with no

• Exploit ILP across multiple basic blocks.

• Loop level parallelism

● Within each loop iteration - no opportunity for overlap

● How do you determine instruction is

• Dependency with in a single inst. is not a

● If exec simultaneously ? - stall.

• Understand this to exploit ILP

• But no flow of data b/w inst associated

– S1 cannot be moved before p1, and also T1

2. An inst that is not ctrl dep on a branch

• we can Violate ctrl dep without affecting

● To avoid stall – dependent inst is to be separated

Inst Inst using Latency in

• Each iteration is independent

Loop : L.D F0, 0(R1); F0 = array element;

BNE R1,R2, Loop 9

•Out of the 7, operations take only 3cc,

• Simple scheme for increasing the no.of insts

• Unrolling replicates loop body multiple times,

● When and how the ordering among insts may

• Code size limitations

● How do we predict their behavior?

• But how? By several methods

● Predict branch based on profile information.

• 2-bit prediction scheme

• Correlation branch predictors

● If inst decoded as a branch – if branch is

• Better than 2-bit + only trivial amount of

● Multiple predictors – one based on global info

● Selects right predictor a particular branch

● Mis prediction rate < 0.5% for FP programs and

You might also like