0% found this document useful (0 votes)
8 views34 pages

Chapter4 2

The document discusses the architecture and implementation of a pipelined processor design, detailing the roles of various units such as the ALU, data memory, and registers. It covers the execution of different instruction types (R-type, load/store, J-type) and the impact of pipelining on performance, including potential hazards like structural, data, and control hazards. Additionally, it explores the benefits of bypassing to reduce stalls and improve throughput in a multi-stage pipeline system.

Uploaded by

Namit Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views34 pages

Chapter4 2

The document discusses the architecture and implementation of a pipelined processor design, detailing the roles of various units such as the ALU, data memory, and registers. It covers the execution of different instruction types (R-type, load/store, J-type) and the impact of pipelining on performance, including potential hazards like structural, data, and control hazards. Additionally, it explores the benefits of bypassing to reduce stalls and improve throughput in a multi-stage pipeline system.

Uploaded by

Namit Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

View from 30,000 Feet

Note: we haven’t bothered


showing multiplexors

• What is the role of the Add units? Source: H&P textbook

• Explain the inputs to the data memory unit


• Explain the inputs to the ALU
• Explain the inputs to the register unit 4
Clocking Methodology

Source: H&P textbook


• Which of the above units need a clock?
• What is being saved (latched) on the rising edge of the clock?
Keep in mind that the latched value remains there for an entire cycle
5
Implementing R-type Instructions

• Instructions of the form add $t1, $t2, $t3


• Explain the role of each signal

Source: H&P textbook

6
Implementing Loads/Stores

• Instructions of the form lw $t1, 8($t2) and sw $t1, 8($t2)

Where does this input come from?

7
Source: H&P textbook
Implementing J-type Instructions

• Instructions of the form beq $t1, $t2, offset

Source: H&P textbook 8


View from 10,000 Feet

9
Source: H&P textbook
View from 5,000 Feet

10
Source: H&P textbook
Latches and Clocks in a Single-Cycle Design

Instr Reg Data


PC ALU Addr
Mem File Memory

• The entire instruction executes in a single cycle


• Green blocks are latches
• At the rising edge, a new PC is recorded
• At the rising edge, the result of the previous cycle is recorded
• At the falling edge, the address of LW/SW is recorded so
we can access the data memory in the 2nd half of the cycle 11
Multi-Stage Circuit

Instead of executing the entire instruction in a single


cycle (a single stage), let’s break up the execution into
multiple stages, each separated by a latch

Instr Reg Data


PC L2 L3 ALU L4 L5
Mem File Memory

Reg
File
12
The Assembly Line
Unpipelined Start and finish a job before moving to the next

Jobs

Time

A B C
A B C Break the job into smaller stages
A B C
A B C
Pipelined

13
Performance Improvements?

• Does it take longer to finish each individual job?

• Does it take shorter to finish a series of jobs?

• What assumptions were made while answering these


questions?

• Is a 10-stage pipeline better than a 5-stage pipeline?

4
A 5-Stage Pipeline

register write in the first half of the clock cycle (dotted part)
register read in the second half of the clock cycle (solid part)

so that in CC5, I4 can use the register released by I1 (otherwise directly I5


will be able to use it)

5
Source: H&P textbook
A 5-Stage Pipeline

Use the PC to access the I-cache and increment PC by 4

all instructions go through all stages

for eg, add instruction does not require DM, but still it will
take 5 clock cycles (it will wait for that particular clock cycle)

6
DM - data memory
A 5-Stage Pipeline improved throughput, thorughput becomes 5 times

Read registers, compare registers, compute branch target; for now, assume
branches take 2 cyc (there is enough work that branches can easily take more)

branches dont work well with pipelines

7
A 5-Stage Pipeline

ALU computation, effective address computation for load/store

8
A 5-Stage Pipeline

Memory access to/from data cache, stores finish in 4 cycles

9
A 5-Stage Pipeline

Write result of ALU computation or load into register file

because of the solid and dotted lines, we are able to use


the writing of I1 for the reading of I4. otherwise we would have to
wait for I5 for reading what I1 wrote.

10
Pipeline Summary
note: no skipping of stages. so that there is no
overtaking (faster instruction overtaking the
slower one)
RR ALU DM RW
still 5 cycles taken (even if DM is empty). IM is not shown, only the latter 4 cycles are shown
ADD R1, R2,  R3 Rd R1,R2 R1+R2 -- Wr R3

BEQ R1, R2, 100 Rd R1, R2 -- -- --


Compare, Set PC

LD 8[R3]  R6 Rd R3 R3+8 Get data Wr R6


here R3 is the address

ST 8[R3]  R6 Rd R3,R6 R3+8 Wr data --

11
Performance Improvements?

• Does it take longer to finish each individual job?


yes, possibbly due to additional latch delays

• Does it take shorter to finish a series of jobs?

• What assumptions were made while answering these


questions?
– No dependences between instructions
– Easy to partition circuits into uniform pipeline stages
– No latch overhead

• Is a 10-stage pipeline better than a 5-stage pipeline?

12
Quantitative Effects

• As a result of pipelining:
 Time in ns per instruction goes up
 Each instruction takes more cycles to execute
 But… average CPI remains roughly the same
 Clock speed goes up becomes 5 times for 5 stage pipeline
 Total execution time goes down, resulting in lower
average time per instruction
 Under ideal conditions, speedup
= ratio of elapsed times between successive instruction
completions
= number of pipeline stages = increase in clock speed

13
Conflicts/Problems

• I-cache and D-cache are accessed in the same cycle – it


helps to implement them separately since, IM and DM might happen in the same clock
cycle, we must build separate hardware for these ops

• Registers are read and written in the same cycle – easy to


deal with if register read/write time equals cycle time/2

• Branch target changes only at the end of the second stage


-- what do you do in the meantime?

14
Hazards

• Structural hazards: different instructions in different stages


(or the same stage) conflicting for the same resource
for eg reading and writing in the same register in the same clock cycle solution: half cycles mein break kardo
eg(2) im and dm operations in the same clock cycle solution: keep separate IM and DM

• Data hazards: an instruction cannot continue because it


needs a value that has not yet been generated by an
earlier instruction dependencies, an instruction might need the output of some instruction that has still
not completed its 5 stages

• Control hazard: fetch cannot continue because it does


not know the outcome of an earlier branch – special case
of a data hazard – separate category because they are
treated in different ways

15
Structural Hazards

• Example: a unified instruction and data cache 


stage 4 (MEM) and stage 1 (IF) can never coincide

• The later instruction and all its successors are delayed


until a cycle is found when the resource is free  these
are pipeline bubbles

• Structural hazards are easy to eliminate – increase the


number of resources (for example, implement a separate
instruction and data cache, add more register ports)

5
Data Hazards

• An instruction produces a value in a given pipeline stage

• A subsequent instruction consumes that value in a pipeline


stage

• The consumer may have to be delayed so that the time


of consumption is later than the time of production

6
Example 1 – No Bypassing i1 and i2 have data hazard

• Show the instruction occupying each stage in each cycle (no bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R7+R8R9
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8

IF IF IF IF IF IF IF IF

D/R D/R D/R D/R D/R D/R D/R D/R

ALU ALU ALU ALU ALU ALU ALU ALU

DM DM DM DM DM DM DM DM

RW RW RW RW RW RW RW RW 7
Example 1 – No Bypassing
• Show the instruction occupying each stage in each cycle (no bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R7+R8R9
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8

IF IF IF IF IF IF IF IF
L2
I1 I2 I3 I3 I3
waiting for I2 to proceed
I4 I5

L3
D/R D/R D/R D/R D/R D/R D/R D/R
I1 I2 I2 I2
concluded finally
I3 I4
in the second half
of the clock cycle
L4 ALU ALU ALU ALU ALU ALU ALU ALU
I1 I2 I3
L5
DM DM DM DM DM DM DM DM
I1 this is a bubble
I2 I3
RW RW RW RW RW RW RW RW 8
I1 I2
Example 2 – Bypassing
• Show the instruction occupying each stage in each cycle (with bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R3+R8R9.
Identify the input latch for each input operand.
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8

IF IF IF IF IF IF IF IF

D/R D/R D/R D/R D/R D/R D/R D/R

ALU ALU ALU ALU ALU ALU ALU ALU

DM DM DM DM DM DM DM DM

RW RW RW RW RW RW RW RW 9
Example 2 – Bypassing Li Lj means that Li has been overwritten by Lj
L5 L3 because by the end of cyc4, L4 has been updated by the ALU op

• Show the instruction occupying each stage in each cycle (with bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R3+R8R9.
Identify the input latch for each input operand.
observe that the result has been stored in L3 for I1 in cyc3 itself, and it is directly usable now for I2 (dont have to wait for all 5 cycles)
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8

IF IF IF IF IF IF IF IF
I1 I2 I3 I4 I5
L2

D/R D/R D/R D/R D/R D/R D/R D/R


I1 I2 I3 I4
L3 L3 L3 L4 L3 L5 L3
ALU ALU ALU ALU ALU ALU ALU ALU
I1 I2 I3
L4

DM DM DM DM DM DM DM DM
I1 I2 I3
L5

RW RW RW RW RW RW RW RW
I1 I2 I3
Problem 1

IF D/R ALUL3 DM RW
i1 i1 i1 i1 i1

IF D/R ALU DM RW
L4
i2 i2 i2 i2
i2

IF D/R ALU DM RW
add $1, $2, $3

IF D/R ALU DM RW
lw $4, 8($1)

11
Problem 2
L2 L3 L4 L5

IF D/R ALU DM RW
i1 i1
i1 i1 i1

IF D/R ALU DM RW
i2 i2

lw $1, 8($2) IF D/R ALU DM RW


i2 i2
DM is in cyc4, so there is one cycle delay i2 i2
(still faster than non-bypass)

lw $4, 8($1) IF D/R ALU DM RW

12
Problem 3 1) read from L5
2) writing will happen in the first half and hence DM can access the written part in the second half

L5

IF D/R ALU DM RW
i1 i1 i1 i1
i1

IF D/R ALU DM RW
i2 i2 i2
i2 i2

IF D/R ALU DM RW
lw $1, 8($2)

IF D/R ALU DM RW
sw $1, 8($3)

13
Problem 4

A 7 or 9 stage pipeline, RR and RW take an entire stage

IF IF Dec Dec RR ALU RW

ALU DM DM RW

lw $1, 8($2)

add $4, $1, $3 9


Problem 4

A 7 or 9 stage pipeline, RR and RW take an entire stage

instruction
fetch decode
IF IF Dec Dec RR ALU RW

ALU DM DM RW

lw $1, 8($2)

add $4, $1, $3 10


Problem 4

Without bypassing: 4 stalls


IF:IF:DE:DE:RR:AL:DM:DM:RW
IF: IF :DE:DE:DE:DE: DE :DE:RR:AL:RW

With bypassing: 2 stalls


IF:IF:DE:DE:RR:AL:DM:DM:RW
IF: IF :DE:DE:DE:DE: RR :AL:RW
lw $1, 8($2)
IF IF Dec Dec RR ALU RW
add $4, $1, $3

ALU DM DM RW
11

You might also like