0% found this document useful (0 votes)
15 views32 pages

Chapter 4 Notes

This document covers the architecture of a processor, detailing the single-cycle and pipelined implementations of the CPU. It explains the stages of the datapath, instruction execution, and the impact of pipelining on performance, including hazards and solutions like forwarding and branch prediction. The document also highlights the importance of instruction set architecture (ISA) design in optimizing CPU efficiency and throughput.

Uploaded by

Kiet Do
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views32 pages

Chapter 4 Notes

This document covers the architecture of a processor, detailing the single-cycle and pipelined implementations of the CPU. It explains the stages of the datapath, instruction execution, and the impact of pipelining on performance, including hazards and solutions like forwarding and branch prediction. The document also highlights the importance of instruction set architecture (ISA) design in optimizing CPU efficiency and throughput.

Uploaded by

Kiet Do
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CDA 4502

Computer Architecture
Chapter 4
The Processor

Lecture 4.1

Figure 1: Single cycle datapath

Opcode is sent to the control unit which helps determine what


actions are performed. Note that the instruction memory is read-
only memory which is used to store an instruction, separate
from registers. This is its sole purpose.
Figure 2: Single-Core Computer Layout

How the processor will interface with memory thorough the


cache. The processor is organized around words and bytes as
storage units.

The Processor
Processor (CPU): The active part of the computer that does all
the work (data manipulation and decision making)
Datapath: Portion of the processor that contains hardware
necessary to perform operations required by the processor.
Control: Portion of the processor (also in hardware) that tells
the datapath what needs to be done.

Single-Cycle Implementation

Figure 3: Instructions which are used in a single cycle


Instruction Execution
Program Counter → Instruction memory (fetching the instruction)
Register Numbers → Register file, read registers
Depending on the instruction class, we use the ALU to calculate
the the arithmetic result, the memory address for
loading/storing, or the branch comparison. Memory is accessed
to complete load and store operations. The program counter will
be updated to the new target address or incremented by 4 bytes
(one word)

Stages of the Datapath (Overview)


Problem: A single block that executes an instruction (performs
all necessary operations beginning with fetching an
instruction) would be too bulky and inefficient.

Solution: Break up the process of "executing an instruction"


into stages, and then connect the stages to create the whole
datapath. Smaller stages are easier to design and optimize
without affecting others.

Five Stages of the Datapath


Stage 1: Instruction Fetch (IF) - fetching the instruction to
be executed
Stage 2: Instruction Decode (ID)
Stage 3: Execute (EX) - ALU (Arithmetic - Logic Unit)
Stage 4: Memory Access (MEM)
Stage 5: Write Back to Register (WB)

Not all instructions use every stage. For example, a load


instruction requires every stage, while an addition instruction
will bypass stage 4 and go to stage 5. A branch instruction
will stop at stage 3 and not perform stages 4 or 5.

Overview of Implementation
1. Send the program counter (PC) to the memory that contains
the code and fetch the instruction from that memory
2. Read one or two registers based on the instruction fields
determining which registers to read. For the load word
instruction we need to read only one register, but most other
instructions require reading two.
Instruction Fetching
The CPU is always in an infinite loop, fetching instructions
from memory and executing them. The program counter (PC) holds
the address of the current instruction, and is incremented to
indicate the next instruction.

Figure 4: Instruction fetching process

CPU Overview
The value written into the PC can come from one of two adders.
Then, the data written into the register file can come from
either the ALU or the data memory. Second input to the ALU can
come from a register or the immediate field of the instruction.
Then, a multiplexer to select from several inputs based on the
setting of its control lines to send to the output. Control
lines are set based on information taken from the instructions
being executed.
Figure 5: CPU circuitry diagram

Building a Datapath
Datapath: elements that process data and addresses in the CPU,
including registers, ALUs, Multiplexers, memory, etc...

Blue indicates that something is coming from / directly related


to the control unit.

Composing the Elements


First-cut datapath performs one instruction per clock cycle.
Each datapath element can only do one function at a time,
requiring separate instruction and data memory. Use
multiplexers where alternate data sources are used for
different instructions.

ALU Control
The ALU is used for load/store (add), branch (subtract), r-type
instructions (opcode dependent). Assume that the 2-bit ALUOp is
derived from the opcode, while combinational logic will derive
the ALU control.
Figure 6: ALU Control fields
The ALUOp line exiting the control unit is 'thicker' because it
is a 2bit transfer, while the others are 1 bit.

Memories
To fetch instructions and read and write words, we need these
memories to be 32-bits wide. Buses are represented by dark
lines here. Blue lines represent control signals. MemRead and
MemWrite should be set to 1 if the data memory is to be read or
written, and 0 otherwise.
Figure 7: Memory operations
Figure 8: Interpretation of funct7 + funct 3, producing ALU
Control

R-Format Instructions
• Read two register operands
• Perform arithmetic / logical operation
• Write register result

Figure 9: R-type instruction hardware usage


Figure 10: Datapath of R-type instructions

No writing or reading from memory, no immediate generation.

Load/Store Instructions
1. Read register operands
2. Calculate address using a 12-bit offset
(Use ALU, but sign-extend offset)
3. Load: read memory and update register / Store: write
register value to memory

Figure 11: Components of the datapath


Figure 12: Load: No writing, but needs to read
Store instructions need to write data instead of reading it

Branch Instructions
1. Read register operands
2. Compare operands
(Use ALU, subtract, and check Zero output)
3. Calculate target address
• sign-extend displacement
• shift left one place (halfword displacement)
• Add to PC value

Figure 13: BEQ Datapath


Notice that the ALU result is not needed, instead, the 'zero'
result (if the branch condition was/was not met) is checked
instead, and ALU control is also not needed, no data is written
back to any registers

Figure 14: Inside the Control unit

Figure 15: Wires joined by MUXes


Wires cannot just be 'joined' together, so these junction
points will require the use of a multiplexer
Lecture 4.2
The previous lecture focused on non-pipelined construction of a
CPU whereas instructions execute sequentially.

Figure 16: Pipelined CPU

Pipelining Analogy
Figure 17: Sequential vs. Pipelined Laundry

Instead of waiting for an entire cycle to be finished, start up


the washer with the next round of clothes while the first set
dries.

Pipelining is an implementation rechnique in which multiple


instructions are overlapped in execution, much like an assembly
line is. In pipelining, all steps are called stages. Each stage
operates on a different instruction. Pipelining improves the
perfromance by increasing instruction throughput, i.e.
executing multiple instructions in parallel with the same
latency.

However, pipelining is subject to some hazards (structure,


data, and control). The ISA design affects complexity of this
implementation.

RISC-V Pipeline Stages


5 Stages - 1 instruction / stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode/register read
3. EX: Execute operation or calculate address
4. MEM: access memory operand
5. WB: Write result back to register
Latency vs. Throughput
Latency is the same for pipelined and non-pipelined
instructions (the time it takes for one instruction to be
processed)
Throughput is the number of tasks completed in a given time
period
(greater for pipelined CPUs)

Laundromat example
If any laundry stage takes 30 min (wash, dry, fold, store) and
the laundromat has 4 of each station (washers, driers, folding
stations, storing stations) then what is the
a) latency: 2 hours (30 x 4 stages) for any given load
b) throughput: 8 loads/hour (in steady state, 2 x 4 processes
completed each hour)

Pipelining & ISA Design

The RISC-V ISA is designed for pipelining. Each instruction is


32-bits, making it easier to fetch & decode within one cycle.

Pipelining does not help the latency of a single task, rather,


it helps the throughput of the entire workload.
• Multiple tasks operate simultaneously using different
resources.
• Potential for speedup w/ # of stages
• Unbalanced lengths of pipe stages reduce speedup
• Time to "fill" and "drain" the pipeline reduces speedup,
and the pipeline may need to stall for dependencies
Figure 18: Comparison between implementations
Pipeline Performance Calculation

Figure 19: 100ps for read/write, 200ps other. Assume no delays


Figure 20: Comparison between single vs. pipelined CPU

A 600ps total improvement in execution time is observed between


the single and pipelined implementations.

SCAUSE SEPC: Stores issues (not hazards) which occur in the


pipeline process.

A super scaler pipeline can perform multiple instructions per


pipeline stage.

Branch instructions are the most complicated instructions for


the CPU to work with.

Hazards
Situations which prevent starting the next instruction in the
next cycle
Structure Hazard - a required resource is busy
• Two or more instructions in the pipeline compete for
access to a single resource.
• Solution 1: Some instructions have to stall
• Solution 2: Add more hardware to the machine
Data Hazard (most common) - there are data dependencies between
instructions
• Solve by forwarding (bypassing): retrieve the missing data
element from internal buffers rather than waiting for it
to arrive to program-visible registers or memory
Control Hazard - Deciding on the control action depends on the
previous instruction

Structural Hazard Example

Figure 21: Structural hazard example

With a fourth load instruction being added to the above example


(ld x4, 600(x4) [or something else which makes sense in the
program context]), we would see that data would be accessed by
the first instruction while the instruction is also being
fetched at the same time, resulting in a structural hazard.

Therefore, instruction fetch would have to stall for that


cycle, causing a pipeline bubble

Pipeline stall/bubble: A stall initiated in order to resolve a


hazard. No-op instructions delaying pipeline execution until
the hazard is resolved

Data Hazard Example


Figure 22: Data hazard example

The subtraction instruction requires the result of addition


before it can continue its operation. So, the pipeline has to
stall until the register x19 is written into around ~900 units,
so it can then be read by the instruction decode stage and
continued, resulting in a 3 cycle delay in processing

Note:
shading on right → reading;
shading on left → writing;
no shading → no memory access

Solution: Forwarding (aka Bypassing)


Use the result of a previous operation when it is computed,
don't wait for it to be stored in a register! --- requires
additional datapath connections

Figure 23: Use result before it is stored in x1


Load-Use Data Hazard
Figure 24: Data hazard example

Without the stall, the subtract instruction would be going


'backwards' in time, i.e. using a register with a value from
before the x1 was loaded with a new number

Branch instructions are the most complicated, followed by the


load instruction, and this is related to the hazards posed by
their execution.

> The CPU knows how to avoid these hazards by checking if


destination and source registers are shared between sequential
instructions and needs to calculate the

Code Scheduling to Avoid Stalls


Code can be reordered to avoid data hazards.

Figure 25: Resolving some load-use hazards by moving


instructions around

Control Hazards
Branch instructions determine the flow of control. Fetching the
next instruction will depend on the outcome of the branch
instruction. The pipeline can't always fetch the correct
instruction because it would still be on decoding the
instruction.

In the RISC-V pipeline, registers need to be compared and the


target computed early in the pipeline

Figure 26: Stalling on a branch instruction


In the above example, if the branch is NOT taken, then you only
installed once and can continue with execution. However, if the
branch is taken, then only one instruction needs to be flushed.
Taking a look at a pipelined implementation of the CPU shows
that the AND gate determining whether or not to branch is
located in the MEM.

Branch Prediction
Longer pipelines cannot readily determine the branch outcome
early. The stall penalty becomes unacceptable. As an
alternative, predicting the outcome to the branch, only
stalling if the prediction turned out to be wrong.

Two types of prediction


Static Branch Prediction
Based on typical branch behavior, i.e. loops and if statements
Predict backward branches take, forward branches not taken

Dynamic Branch Prediction


Hardware measures actual branch behavior (i.e. the branch
history)
Assume future behavior will continue the trend, and if wrong,
stall, refetch and update history.

Figure 27: RISC-V Pipelined Datapath


The write-back stage places the result back into the register
left in the middle of the datapath.

The selection of the next value of the PC, choosing between the
incremented PC and the branch address from the MEM stage

Pipelined Registers
Registers are needed between stages to hold info produced from
the previous cycle.
Figure 28: Pipelined Registers

Pipelining improves efficiency by regularizing instruction


format for simplicity, and dividing instructions into a fixed
number of steps, with each being implemented as a segment of
the pipeline.

** In designing the pipeline it is important that each segment


takes just about the same amount of time to execute to
• maximize utilization and throughput
• minimize set-up time

The IF/ID register must be 64 bits wide to hold the 32 bit


instruction from memory AND the 32-bit PC address.

Load & Store


1. IF
Instruction is read from memory using the address from PC, then
placed into the IF/ID pipeline register. PC is incremented by 4
and written back. This is also saved in the IF/ID reg, as it
could be needed later for a branch condition. (The computer
doesn't know what it's going to need, so it occurs regardless
of what instruction is fetched. It will only know when it is
being decoded)
2. ID
The IF/ID reg supplies the immediate, which is sign-extended to
32 bits, as well as the register numbers to be read from. These
3 values are stored in ID/EX alongside incremented PC address.
a) Load stages
3. EX
Combines the register address and the sign-extended immediate,
passing this value to the ALU via EX/MEM pipeline register
4. MEM
Memory is read using the offset address calculated in the
previous step, with the resultant data read into MEM/WB reg
5. WB
The accessed information is then written back to the
destination register
Corrected Datapath for Load

Figure 29: Corrected Datapath for Load


Here, the write register number comes from the MEM/WB pipeline
reg alongside the data. The register number gets passed from
the ID pipe stage until the MEM/WB pipeline register,
ultimately adding 5 more bits to the last three registers.

b) Store stages
3. EX
The effective address (reg + offset) is placed in EX/MEM reg
4. MEM
The register containing data to be stored was already read in
an earlier stage (ID/EX) so it needs to be passed over to
EX/MEM. Then it can be stored to memory as appropriate.
5. WB
Nothing happens, because the store instruction does not have an
RD. All steps are complete and the stage passes without action.

Figure 30: Pipeline diagram w/ control lines

Efficiency of 5-stage Pipeline


1. Allowing jumps, branches, and ALU instructions to take fever
stages in comparison to the five required by the load
instruction, increasing pipeline performance

2. Throughput is determined by the clock cycle - the number of


stages affects latency.

3. ALU instructions cannot be made to take less cycles due to


write back requirement. However, branches and jumps can be
reduced since they don't require an RD.

4. Instead of trying to reduce the # of cycles required per


instruction, attempting to make the pipeline longer (resulting
in shorter cycles) could improve performance.

Pipelined Control (Simplified)


This datapath borrows the control logic for PC source, register
destination number, and ALU control.
• Note that we now need the 6 bit function code of the
instruction from EX stage as input to ALU control
• ID/EX can supply these from the immediate field since sign
extension will not alter these bits

Figure 31: Simplified Pipeline Control

Control Signals are derived from each instruction, same as in a


single cycle implementation. These are passed throughout the
stages, as shown in the following:
Figure 32: Pipelined Control

Control of Pipeline by Stages


1. IF
Control signals to read instruction memory and write to the PC
are always asserted.
2. ID
As in previous stage, the same thing happens at each clock
cycle, so no optional control lines need to be set.
3. EX
RegDst (result register), ALUOp (ALU operation), and ALUSrc
(read data or sign extended immediate) need to be set.
4. MEM
Branch (equal), MemRead (load), and MemWrite (store) set these
signals. PCSrc selects next sequential address unless control
asserts branch and ALU result was 0.
5. WB
MemtoReg - decides between sending ALU result or memory value
back to register
Reg-Write - writes chosen value

Lecture 4.3
A signal is asserted when its logical state is set to true. A
signal is deasserted when its set to false or unknown.
• Some signals are 'true' on low voltage, some on high
When a 1-bit control to a 2-way MUX is asserted, the MUX
selects input corresponding w/ 1. Otherwise, the 0 input is
select.

PCSrc is controlled by an AND gate. If the branch signal and


ALU Zero signal are both set, then PCSrc is 1, otherwise, it is
0.
Control sets the Branch signal only during a BEQ instruction.

Figure 33: Multi-Cycle Pipeline Diagram


Notice the wire allowing for the bypass of memory write, for
example, when performing actions that do not require the
storage of memory

Data Hazards in ALU Instructions

Considering the following sequence of instructions:


Figure 34: ALU instruction dependencies example
Forwarding can be used to resolve dependency conflicts. This is
where data is brought "forward" to another stage of the
pipeline without the need for stalling to increase throughput.

Figure 35: Red lines indicate the need for forwarding in order
to maintain this operation; otherwise, a stall is necessary to
resolve the dependency

Register numbers are passed along the pipeline. If the


forwarding instruction will write to a register, and its
destination register is not zero, then forwarding is necessary

Double Data Hazards

Figure 36: Double Data Hazard


The forwarding condition must be revised to address this
Figure 37: Forwarding control

Figure 38: All forwarding conditions


Figure 39: Datapath with forwarding
Load-Use Hazard Detection
Check when using instruction is decoded in the ID stage. The
ALU operand register numbers will be given here, and a load-use
hazard would occur if the rd of load in EX matches either rs1
or rs2 of the instruction being decoded. If the condition
holds, the instruction stalls one cycle.

Figure 40: Load-Use Data Hazard Example

Here, and becomes nop (no operation)


Figure 41: Pipeline w/ Hazard Detection Unit

Branch Hazards (or) Control Hazards


Because a branch instruction decides whether to branch in the
MEM stage, which is clock cycle 4, the three sequential
instructions that follow will be fetched and begin execution.
However, if the branch is taken, then these will be incorrect.
Thus, predictive measures are used:

Branch Prediction: A method of resolving a branch hazard that


assumes a given outcome for the branch and then proceeds from
that assumption, rather than stalling for the actual outcome.
• Static Prediction: Based on typical behavior, i.e. loop
and if-statement branches, predicting backward branches
taken and forward branches not taken
• Dynamic Prediction: Hardware measures actual branch
behavior, i.e. a recorded history of each branch, assuming
future behavior continues the trend. When it is incorrect,
a stall is used, refetching occurs, and history is updated.
Longer pipelines can't really determine branch outcomes early,
resulting in a long stall. Predicting the outcome allows the
pipeline to only stall where it is wrong. In RISC-V, the
pipeline can predict branches not taken, and then fetch the
instruction after the branch without delay.
If the branch is taken then the instructions being fetched,
decoded, and executed (3 instructions) must be discarded, so
execution will continue at branch target. Discarding
instructions is done by changing control values to 0, similar
to how load-use data hazards are resolved. Discarding requires
the flushing of the pipeline out before execution can continue.

To reduce branch delay, a target address adder and register


comparator can be added to the hardware to help determine the
branch outcome.

Figure 42: Example of branch being taken

You might also like