Unit 1
Unit 1
21EC74H6
Dr. P.N.JAYANTHI
Asst.Prof, Dept. of ECE,
RVCE.
1
SYLLABUS
2
References
3
Differences between Computer
Architecture & Computer Organization
Computer Architecture is a functional description of
requirements and design implementation for the various parts
of a computer.
It deals with the functional behavior of computer systems. It
comes before the computer organization while designing a
computer.
Computer architecture refers to the design of the internal
workings of a computer system, including the CPU, memory,
and other hardware components.
It involves decisions about the organization of the hardware,
such as the instruction set architecture, the data path design,
and the control unit design. Computer architecture is concerned
with optimizing the performance of a computer system and
ensuring that it can execute instructions quickly and efficiently.
Architecture describes what the computer does.
4
Advantages of Computer
Architecture
Performance Optimization: Proper architectural design is
known to advance the efficiency of a system by a large
percentage.
Flexibility: This results in the capability to adapt and
incorporate new technologies as well as accommodate the
different hardware components.
Scalability: Plans should be in a way such that there is
provision made for future expansion of a building or accretion.
Disadvantages of Computer Architecture:
Complexity: It may be challenging and huge task which
involves time consuming on the flow of design and
optimization.
Cost: High-performance architectures need deluxe equipment
and parts sometimes that is why they are more expensive.
5
Computer Organization
Computer Organization comes after the
decision of Computer Architecture first.
Computer Organization is how operational
attributes are linked together and contribute to
realizing the architectural specification.
Computer Organization deals with a structural
relationship.
The organization describes how it does it.
6
Advantages of Computer Organization
Practical Implementation: It offers a perfect account
of the physical layout of the computer system.
Cost Efficiency: Organization can help in avoiding
wastage of resources hence resulting in a reduction in
costs.
Reliability: Organization helps in guaranteeing that
similar work produces similar favorable results.
Disadvantages of Computer Organization
Hardware Limitations: The physical components that
are available installation in the organization also
defines the systems that can be implemented hence
limiting the performance.
Less Flexibility: The organization however is a lot
more well defined and less amicable or easy to change
once set in its done so.
7
Computer System
8
COMPUTER ARCHITECTURE COMPUTER ORGANIZATION
Architecture describes what the computer does. The Organization describes how it does it.
Computer Architecture deals with the functional behavior Computer Organization deals with a structural
of computer systems. relationship.
In the above figure, it’s clear that it deals with high-level In the above figure, it’s also clear that it deals with low-
design issues. level design issues.
As a programmer, you can view architecture as a series of The implementation of the architecture is called
instructions, addressing modes, and registers. organization.
Computer Architecture is also called Instruction Set Computer Organization is frequently called
Architecture (ISA). microarchitecture.
It makes the computer’s hardware visible. It offers details on how well the computer performs.
Architecture coordinates the hardware and software of the Computer Organization handles the segments of the
system. network in
9
RISC vs CISC
10
Load-Store architecture
The Load-Store architecture is a type of
CPU design that differentiates it from other
architectures, like the CISC.
It is commonly associated with RISC
architecture, where instructions are
streamlined and optimized for efficiency.
Load instruction brings data from
memory into a CPU register.
Store instruction transfers data from a
CPU register to memory.
11
Load-Store architecture
Separation of Memory and ALU Operations:
In Load-Store architectures, the CPU separates
memory access instructions from computation
(arithmetic/logic) instructions.
Only load and store instructions can access
memory, and all other instructions operate only on
data within the CPU registers
Efficiency through Reduced Instructions:
• By limiting memory access to load and store
instructions, this architecture reduces the
complexity of the CPU’s instruction set,
streamlining the pipeline stages, which can
speed up instruction processing.
12
Load-Store architecture
Improved Pipeline and Parallelism:
Load-Store architectures facilitate instruction
pipelining by separating data movement (load/store)
from computation, allowing multiple instructions to be
processed concurrently.
Since each instruction generally takes the same amount
of time, this leads to fewer pipeline stalls, increasing the
CPU's efficiency
Registers-Based Computation:
All computation (arithmetic/logic) operations are carried
out on data in registers, which are much faster to access
than memory. This design encourages using more
registers and managing data within the CPU, minimizing
costly memory accesses.
13
LDR & STR examples
Generally, LDR is used to load something
from memory into a register, and STR is
used to store something from a register to
a memory address.
LDR R2, [R0] @ [R0] - origin address is the value found in R0.
STR R2, [R1] @ [R1] - destination address is the value found in
R1.
15
Architecture ( Instruction Set Architecture (ISA))
17
Machine models
Machine models in computing refer to
abstract frameworks or theoretical models
used to understand and design computing
systems, helping to simulate, analyse, and
implement algorithms. These models play a
crucial role in both theoretical computer
science and practical computing, as they
influence how processors, memory, and
computation workflows are organized.
18
1. Von Neumann Model
This is the classic model for most modern computers,
named after John von Neumann. It consists of a
single memory space for storing both
instructions and data, and a central processing unit
(CPU) that sequentially processes instructions.
Components:
Memory: Holds both instructions and data.
Control Unit: Interprets instructions from memory.
ALU :Executes operations.
Input/Output: Interfaces for user interaction and data
transfer.
Characteristics: Sequential instruction processing, a
shared memory space for instructions and data (leading
to the "von Neumann bottleneck"), which can slow
down performance due to limited data throughput.
19
2. Harvard Architecture
This model separates the memory for instructions
and data, allowing the CPU to access both
simultaneously.
Components:
Separate Memory Banks: Separate instruction
and data memory.
Control Unit and ALU: As in the von Neumann
model.
Characteristics: The separation allows
parallelism in fetching instructions and data,
reducing bottlenecks. It’s common in
embedded systems and certain digital signal
processors.
20
3 .Parallel Models
These models consider multiple processors or cores to
achieve concurrent processing.
SIMD (Single Instruction, Multiple Data): A single
control unit issues one instruction that operates on
multiple data points simultaneously. Ideal for tasks like
graphics processing and scientific computing.
MIMD (Multiple Instruction, Multiple Data): Each
processor executes its own set of instructions
independently. Common in multi-core CPUs and distributed
computing systems.
SISD (Single Instruction, Single Data): The traditional,
sequential model used in single-core processors.
MISD (Multiple Instruction, Single Data): Rarely used,
where multiple instructions operate on a single data
stream.
21
Machine(Data Path)
Models
● Data in registers, memory, or stack can form operands for ALU, which
decides the type of machine model to follow.
Stack machine model:
• The stack machine performs operations by pushing operands (data
or variables) onto a stack and then applying
operations on the values at the top of the stack.
• Simple model with no explicit operands specified in ALU
instructions unless it is a multilevel stack(more than two
locations).
Ex: 8087 Floating point coprocessor to be used in conjunction
with 8086 processor.
Ex: Java virtual machine & Python interpreted languages work
on stack machine model.
*TOS:Top of the
stack
22
Machine(Data
Path)
Models
Register-Memory Register-Register
Accumulator
• Accumulator • This
• This model
model
represents
machines represents the
were the the
among architecture
load-store
architecture
earliest computer of
many real-
of
modern systems
architectures world
computer
like ARM.
the EDSAC and
like systems like x86.
• Instructions
the • Instructions
• IBM withare 2 or
701.
Instructions withare 2 or
are
simple, operating 3
3named
directly with operands.
named
operands.
accumulator the
memory. and
• Instruction are
s
with a named
operand.
23
Machine(Data Path) Models: Typical Program
sequence
Let us take an operation:
C=A+B
Stack Accumulator Register to Memory Register to
Register
PUSH A LOAD A LOAD R1, A LOAD R1,A
PUSH B ADD B ADD R3,R1, B LOAD R2,B
ADD STORE C STORE R3, C ADD R3,R1,R2
POP C STORE R3,C
24
Instruction Set Architecture (ISA)
ISA is the interface between software and hardware, defining the supported
instructions, data types, registers, memory addressing modes, and I/O
mechanisms.
Each ISA has a unique set of characteristics that influences how effectively it
can handle different types of computation and how it balances performance,
energy efficiency, and simplicity.
Key Characteristics of an ISA
Instruction Set:
Types of Instructions: Defines arithmetic, logical, data transfer, control, and
floating-point instructions.
Instruction Length: Fixed-length or variable-length instructions. Fixed-length
instructions (like in RISC) simplify decoding, while variable-length instructions
(like in CISC) provide flexibility.
Format: Specifies how operands and operation codes are structured in
instructions.
Registers:
General-Purpose Registers (GPRs): Registers that can be used for various
purposes, providing faster access than memory.
Special-Purpose Registers: These may include program counters, status
registers, and others, optimized for specific operations.
Register Count: Impacts performance and complexity; more registers
generally mean more data can be processed without needing memory access.
25
Instruction Set Architecture (ISA)
Memory Addressing Modes:
Specifies how instructions reference memory. Common modes include
immediate, direct, indirect, register, and displacement addressing.
Complex addressing modes, as in CISC, allow versatile data
manipulation, while RISC designs like RISC-V often use simpler modes
to speed up instruction decoding and execution.
Data Types:
Defines supported data types, such as integer, floating-point, and
sometimes more specialized data types like packed or SIMD data types.
The types supported impact the ISA's utility for various applications,
e.g., floating-point support for scientific computation.
Instruction Decoding:
Complexity of Decoding: Affects CPU design; simpler decoding is a
characteristic of RISC ISAs, which can speed up the pipeline.
Control Flow Instructions: Defines branches, loops, and jumps.
Branch prediction optimizations depend on control instructions'
predictability.
Power Efficiency:
Simpler ISAs (RISC) tend to consume less power due to fewer
instructions, fixed-length encoding, and streamlined decoding.
26
Types of ISA
27
RISC-V
• RISC-V (pronounced "risk-five”) is a ISA standard
– An open source implementation of a reduced instruction set computing (RISC)
based instruction set architecture (ISA)
– There was RISC-I, II, III, IV before
• Most ISAs: X86, ARM, Power, MIPS, SPARC
– Commercially protected by patents
– Preventing practical efforts to reproduce the computer systems.
• RISC-V is open
– Permitting any person or group to construct compatible computers
– Use associated software
• Originated in 2010 by researchers at UC Berkeley
– Krste Asanović, David Patterson and students
• 2017 version 2 of the userspace ISA is fixed
– User-Level ISA Specification v2.2
– Draft Compressed ISA Specification v1.79 https://riscv.org/
– Draft Privileged ISA Specification v1.10 https://en.wikipedia.org/wiki/RISC-V
28
29
Key Characteristics of RISC-V
Modular Design:
RISC-V follows a modular approach, allowing implementers to add
or omit specific features as needed.
The base ISA (RV32I for 32-bit, RV64I for 64-bit, RV128I for 128-
bit) provides basic integer instructions.
Additional modules (extensions) support features like floating-point
arithmetic (F, D), atomic operations (A), and vector processing (V).
Simplified Instruction Set:
RISC-V follows RISC principles with a small, fixed set of instructions
that are easy to decode and execute.
Fixed-Length Instructions: RISC-V uses 32-bit fixed-length
instructions in the base ISA, simplifying instruction fetch and
decode stages.
32 General-Purpose Registers:
RISC-V uses 32 GPRs for each base ISA (RV32, RV64, and RV128).
These registers allow efficient data handling without frequent
memory access, which is especially beneficial for high-performance
computing.
30
Key Characteristics of RISC-V
31
Key Characteristics of RISC-V
Scalability and Interoperability:
RISC-V supports 32-bit, 64-bit, and 128-bit address spaces, providing
scalability across devices.
Modular extensions ensure compatibility, enabling devices of different
capabilities to run the same software while tailoring hardware to specific
application needs.
Simplified Load/Store Architecture:
RISC-V follows a load/store architecture where only load and store
instructions access memory, while arithmetic instructions operate solely
on registers. This approach simplifies pipelining and optimizes
performance.
Efficient and Power-Aware:
RISC-V’s minimalist design, fixed-length instructions, and load/store
architecture contribute to high power efficiency, making it suitable for
embedded and low-power applications.
Open-Source and Customizable:
Unlike proprietary ISAs like ARM or x86, RISC-V is an open standard,
fostering innovation and custom hardware designs without licensing
fees. This openness has led to rapid adoption in academia and industry,
as it encourages experimentation and customization.
32
RISC-V
RISC-V’s design aligns with traditional RISC
principles, prioritizing simplicity,
modularity, and flexibility.
Its open-source nature, coupled with an
extensible design, makes it versatile for a
wide range of applications, from small
embedded devices to large-scale data
centers.
This flexibility allows for targeted
optimizations, balancing performance,
power consumption, and cost for various
computing environments.
33
Applications
The application options are endless for the RISC-V ISA:
•Wearable's, Industrial, IoT, and Home Appliances. RISC-V processors are ideal for
meeting the power requirements of space-constrained and battery-operated designs.
•Smartphones. RISC-V cores can be customized to handle the performance needed to
power smartphones, or can be used as part of a larger SoC to handle specific tasks for
phone operation.
•Automotive, High-Performance Computing (HPC), and Data Centres. RISC-V
cores can handle complex computational tasks with customized ISAs, while RISC-V
extensions enable development of simple, secure, and flexible cores for greater energy
efficiency.
•Aerospace and Government. RISC-V offers high reliability and security for these use
applications.
34
35
36
37
Pipelining
40
Simple RISC
Instructions encoding
41
Simple R I S C Processor Design
● The approach to designing the processor is to divide the processing into
stages.
42
Simple RISC Processor
Design
MA (Memory Access) Stage
• Interfaces with the memory system
• Executes a load or a store
RW (Register Write) Stage
• Writes to the register file
• In the case of a call instruction, it writes the
return address to register, ra
43
SimpleRISC Processor
Datapath
The EX Stage
Contains an Arithmetic-Logical Unit
(ALU)This unit can perform all arithmetic
operations ( add, sub, mul, div, cmp, mod),
and logical operations (and, or, not)
Contains the branch unit for computing the
branch condition (beq, bgt).
Contains the flags register (updated by the
cmp instruction)
44
Simple RISC Processor
Design
MA (Memory Access) Stage
• Interfaces with the memory system
• Executes a load or a store
RW (Register Write) Stage
• Writes to the register file
• In the case of a call instruction, it writes the
return address to register, ra
45
SimpleRISC Processor Pipelined
Datapath
IF Stage
46
Simple RISC Processor Pipelined
Data path
47
Unpipelined Datapath for
MIPS
PCSrc
br RegWrite MemWrite WBSrc
rind
jabs
pc+
4
0x4
Add
Add
clk
we
clk
rs1
rs2
PC addr 31 r we
inst d addr
1 ALU
clk Inst. ws z rdata
wd Data
Memory rd2
ImmG Memory
PRs
Ext wdata
ALU
Control
PC addr rs2 we
rdata r addr
d ALU
1 rdata
ws Data
Inst. wd
Memory Imm
rd
Memory
2Ext wdata
G
P
R
s
4
9
Pipelined
Datapath
0x4
Add
we
rs1
PC addr rs2 we
rdata IR r addr
d ALU
1 rdata
ws Data
Inst. wd
Memory Imm
rd
Memory
2Ext wdata
G
P
R
s
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
5
0
Pipelined
Control
0x4
Add
we
rs1
PC addr rs2 we
rdata IR r addr
d ALU
1 rdata
ws Data
Inst. wd
Memory Imm
rd
Memory
2Ext wdata
G
P
R
s
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
5
1
Pipelined
Control
Hardwired
0x4
Add Controller
we
rs1
PC addr rs2 we
rdata IR r addr
d ALU
1 rdata
ws Data
Inst. wd
Memory Imm
rd
Memory
2Ext wdata
G
P
R
s
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
52
Hardwired control in pipelining
Speed: Hardwired control is faster than micro programmed control because it uses combinational
logic to produce control signals directly based on the current state and input. This speed is
crucial in pipelined processors, where each instruction stage needs quick control signal
generation to avoid slowing down the pipeline.
Predictability: Hardwired control is deterministic, meaning that it’s more straightforward to
design and predict behavior for each pipeline stage. This predictability helps in minimizing stalls
or hazards (such as data hazards, control hazards, or structural hazards) within the pipeline.
Lower Latency: The primary goal in pipelining is to improve the throughput of the instruction
execution process. Hardwired control contributes to this by ensuring minimal latency in signal
generation, helping each pipeline stage transition smoothly from one to the next without delays.
Simplicity in Execution: Hardwired control is typically simpler to implement in terms of the
circuitry needed for straightforward, well-defined control tasks in each pipeline stage, making it
an efficient choice for processors with simpler instruction sets, like RISC.
Hardwired control is fast and efficient, it can become complex and difficult to modify if the
instruction set is extensive or if additional functionalities are needed, as adding control paths
requires physical changes in the wiring.
For more complex processors or those that need flexibility (like supporting multiple instruction
sets), microprogrammed control may be a better fit.
53
Pipelined
Control
Hardwired
0x4
Add Controller
we
rs1
PC addr rs2 we
rdata IR r addr
d ALU
1 rdata
ws Data
Inst. wd
Memory Imm
rd
Memory
2Ext wdata
G
P
R
s
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
54
However, CPI will increase unless instructions are pipelined
Pipelined
Control
Hardwired
0x4
Add Controller
we
rs1
PC addr rs2 we
rdata IR r addr
d ALU
1 rdata
ws Data
Inst. wd
Memory Imm
rd
Memory
2Ext wdata
G
P
R
s
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined 5
5
CPI
Examples
Microcoded machine Time
7 cycles 5 cycles 10 cycles
Inst 1 Inst 2 Inst 3
61
62
63
64
65
66
67
68
69
70
71
Design of a Pipeline
* Splitting the Data Path
* We divide the data path into 5 parts : IF, OF,
EX, MA, and RW
* Timing
* We insert latches (registers) between
consecutive stages
* 4 Latches → IF-OF, OF-EX, EX-MA, and MA-RW
* At the negative edge of a clock, an instruction
moves from one stage to the next
72
Simple R I S C Processor Design
● The approach to designing the processor is to divide the processing into
stages.
73
The Instruction Packet
* What travels between stages ?
* ANSWER : the instruction packet
* Instruction Packet
* Instruction contents
* Program counter
* All intermediate results
* Control signals
* Every instruction moves with its entire state, no
interference between instructions
74
Pipelined Data Path with Latches
Latches
75
Simple RISC Processor
Design
MA (Memory Access) Stage
• Interfaces with the memory system
• Executes a load or a store
RW (Register Write) Stage
• Writes to the register file
• In the case of a call instruction, it writes the
return address to register, ra
76
SimpleRISC Processor
Datapath
The EX Stage
Contains an Arithmetic-Logical Unit
(ALU)This unit can perform all arithmetic
operations ( add, sub, mul, div, cmp, mod),
and logical operations (and, or, not)
Contains the branch unit for computing the
branch condition (beq, bgt).
Contains the flags register (updated by the
cmp instruction)
77
Simple RISC Processor
Design
MA (Memory Access) Stage
• Interfaces with the memory system
• Executes a load or a store
RW (Register Write) Stage
• Writes to the register file
• In the case of a call instruction, it writes the
return address to register, ra
78
Instruction Fetch
5 Bits opcode
imm:
80 Immediate
OF Stage
instruction
Control
Immediate and unit
branch target
81
OF Stage
*The register file has two read ports
* 1st Input
* 2nd Input
*The two outputs are op1, and op2
* op1 is the branch target (return address) in the case of a ret
instruction, or rs1
* op2 is the value that needs to be stored in the case of a store
instruction, or rs2
82
EX Stage
op2 OF-EX
aluSignals
isBeq
Branch
ALU unit isBgt
To fetch unit
isUBranch
flag
?ags
0 1
isRet
branchPC
s
pc aluResult
isBranchTaken control
83
MA Stage
pc aluResult op2 instruction control EX-MA
mdr
mar
isLd
Data memory Memory
unit
isSt
pc control MA-RW
ldResult aluResult instruction
84
RW Stage
4 isLd
10 01 00 isCall isWb
E
rd
0
Register
E enable A file
1
data ra(15) D
A address
D data
85
1
pc + 4 0
pc Instruction instruction
memory
pc instruction
1 0 1 0 isSt
isRet Control
reg
Immediate and Register unit
file data
branch target
op2 op1
isWb
immx isImmediate
1 0
aluSignals
flags
0 isBeq
1 Branch
isRet ALU unit
isBgt
isUBranch
isBranchTaken
pc aluResult op2 instruction control
isLd
mar mdr
Data
Memory
memory unit
isSt
DRAFT
pc ldResult aluResult instruction control
4 isLd
isWb
10 01 00 isCall
rd
0
C Smruti
data
R. Sarangi
ra(15)<[email protected]>
1
86
Abridged Diagram
Data
ALU
memory
op2 Unit
Instruction Register
memory file op1
87
Instructions Interact With Each Other
in Pipeline
• Data Hazard: An instruction depends on a
data value produced by an earlier instruction
• Control Hazard: Whether or not an
instruction should be executed depends on a
control decision made by an earlier instruction
• Structural Hazard: An instruction in the
pipeline needs a resource being used by
another instruction in the pipeline
88
* Pipeline Diagram
Clock cycles
1 2 3 4 5 6 7 8 9
IF 1 2 3
[1]: add r1, r2, r3 OF 1 2 3
[2]: sub r4, r5, r6 EX 1 2 3
MA 1 2 3
[3]: mul r8, r9, r10
RW 1 2 3
89
Example
Clock cycles
1 2 3 4 5 6 7 8 9
IF 1 2 3
[1]: add r1, r2, r3 1 2
OF 3
[2]: sub r4, r2, r5 EX 1 2 3
MA 1 2 3
[3]: mul r5, r8, r9
RW 1 2 3
90
Data Hazards
clock cycles
1 2 3 4 5 6 7 8 9
IF 1 2
[1]: add r1, r2, r3
OF 1 2
91
Data Hazard
Definition: A hazard is defined as the possibility of erroneous execution of an
instruction in a pipeline. A data hazard represents the possibility of erroneous
execution because of the unavailability of data, or the availability of incorrect
data.
92
Other Types of Data Hazards
* Our pipeline is in-order
Definition: In an in-order pipeline (such as ours), a preceding instruction is
always ahead of a succeeding instruction in the pipeline. Modern processors
however use out-of-order pipelines that break this rule. It is possible for later
instructions to execute before earlier instructions.
93
WAW Hazards & WAR Hazards
[1]: add r1, r2, r3 [1]: add r1, r2, r3
[2]: sub r1, r4, r3 [2]: add r2, r5, r6
* Instruction [2] cannot Instruction [2] cannot write the
write the value of r1, value of r2, before instruction
before instruction [1] [1] reads it → will lead to a
writes to it, will lead to a WAR hazard
WAW hazard
94
Control Hazards
95
Control Hazard –
Pipeline Diagram Clock cycles
1 2 3 4 5 6 7 8 9
IF 1 2 3
[1]: beq .foo OF 1 2 3
[2]: mov r1, 4 EX 1 2 3
MA 1 2 3
[3]: add r2, r4, r3
RW 1 2 3
96
Control Hazards
* The two instructions fetched immediately after a branch
instruction might have been fetched incorrectly.
* These instructions are said to be on the wrong path
* A control hazard represents the possibility of erroneous
execution in a pipeline because instructions in the wrong
path of a branch can possibly get executed and save their
results in memory, or in the register file
97
Structural Hazards
* A structural hazard may occur when two instructions have a
conflict on the same set of resources in a cycle
* Example :
* Assume that we have an add instruction that can read
one operand from memory
* add r1, r2, 10[r3]
99
Solutions in Software
* Data hazards
1.Insert nop instructions, reorder code
[1]: add r1, r2,
r3
[2]: sub r3, r1,
r4
10
0
2. Code Reordering
add r1, r2, r3
add r1, r2, r3
add r8, r5, r6
add r4, r1, 3
add r10, r11, r12
add r8, r5, r6
nop
add r9, r8, r5
add r4, r1, 3
add r10, r11, r12
add r9, r8, r5
add r13, r10, 2
add r13, r10, 2
10
1
Control Hazards—Delayed branch
10
2
3. Delay Slots
add r1, r2, r3 b .foo
add r4, r5, r6 add r1, r2, r3
b .foo add r4, r5, r6
add r8, r9, r10 add r8, r9, r10
10
3
Hardware solution for
pipeline hazards
10
4
Why interlocks ?
We cannot always trust the compiler to do a good
job, or even introduce nop instructions correctly.
Compilers now need to be tailored to specific
hardware.
We should ideally not expose the details of the
pipeline to the compiler (might be confidential
also)
Hardware mechanism to enforce correctness →
interlock
10
5
Two kinds of Interlocks
* Data-Lock
* Do not allow a consumer instruction to move beyond
the OF stage till it has read the correct values.
Implication : Stall the IF and OF stages.
* Branch-Lock
* We never execute instructions in the wrong
path.
* The hardware needs to ensure both
these conditions.
10
6
Comparison between Software and
Hardware
10
7
Conceptual Look at Pipeline with
Interlocks
10
8
Example
Clock cycles
bubble
1 2 3 4 5 6 7 8 9
IF 1 2
[1]: add r1, r2, r3
OF 1 2 2 2 2
MA 1 2
RW 1 2
10
9
A Pipeline Bubble
* A pipeline bubble is inserted into a
stage, when the previous stage
needs to be stalled
* It is a nop instruction
* To insert a bubble
* Create a nop instruction packet
* OR, Mark a designated bubble bit to 1
11
0
Bubbles in the Case of a Branch
Instruction
Clock cycles
bubble
1 2 3 4 5 6 7 8 9
[1]: beq. foo
[2]: add r1, r2, r3 IF 1 2 3 4
[3]: sub r4, r5, r6
OF 1 2 4
....
.... EX 1 4
.foo:
MA 1 4
[4]: add r8, r9, r10
RW 1 4
11
1
Control Hazards and Bubbles
* We know that an instruction is a branch in
the OF stage
* When it reaches the EX stage and the
branch is taken, let us convert the
instructions in the IF, and OF stages to
bubbles
* Ensures the branch-lock condition
11
2
Ensuring the Data-Lock Condition
11
3
How to Stall a
Pipeline ?
* Disable the write functionality of :
* The IF-OF register
* and the Program Counter (PC)
* To insert a bubble
* Write a bubble (nop instruction) into the OF-EX
register
11
4
Data Path with Interlocks (Data-
Lock)
bubble
stall stall
Data-lock Unit
Control
unit Branch
unit Memory
unit
MA-RW
Register
EX-MA
Fetch Immediate
OF-EX
flags
IF-OF
Data
ALU
op2
memory
unit
Instruction Register
memory file op1
11
5
Ensuring the Branch-Lock Condition
* Option 1 :
* Use delay slots (interlocks not required)
* Option 2 :
* Convert the instructions in the IF, and OF
stages, to bubbles once a branch instruction
reaches the EX stage.
* Start fetching from the next PC (not taken) or
the branch target (taken)
11
6
Ensuring the Branch-Lock Condition
- II
* Option 3
* If the branch instruction in the EX stage is taken,
then invalidate the instructions in the IF and OF
stages. Start fetching from the branch target.
* Otherwise, do not take any special action
* This method is also called predict not-taken (we
shall use this method because it is more
efficient that option 2)
11
7
Data Path with Interlocks
isBranchTaken
Control
unit Branch
unit Memor
yunit
Fetch Immediate Register
IF-OF
MA-RW
OF-EX
and branch flags write unit
EX-MA
unit unit
Data
ALU
unit
memory
op2
Instruction Register
memory file op1
11
8
Ensuring the Branch-Lock Condition
* Option 1 :
* Use delay slots (interlocks not required)
* Option 2 :
* Convert the instructions in the IF, and OF
stages, to bubbles once a branch instruction
reaches the EX stage.
* Start fetching from the next PC (not taken) or
the branch target (taken)
11
9
Ensuring the Branch-Lock Condition
- II
* Option 3
* If the branch instruction in the EX stage is taken,
then invalidate the instructions in the IF and OF
stages. Start fetching from the branch target.
* Otherwise, do not take any special action
* This method is also called predict not-taken (we
shall use this method because it is more
efficient that option 2)
12
0
Data Path with Interlocks
isBranchTaken
Control
unit Branch
unit Memor
yunit
Fetch Immediate Register
IF-OF
MA-RW
OF-EX
and branch flags write unit
EX-MA
unit unit
Data
ALU
unit
memory
op2
Instruction Register
memory file op1
12
1
Measuring Performance
* What do we mean by the
performance of a processor ?
* ANSWER : Almost nothing
* What should we ask instead ?
* What is the performance with respect to a
given program or a set of programs ?
* Performance is inversely proportional to the
time it takes to execute a program
12
2
Computing the Time a Program Takes
𝜏=¿𝑠𝑒𝑐𝑜𝑛𝑑𝑠
* CPI → Cycles per instruction
* f → frequency (cycles per second)
12
3
The Performance Equation
𝐼𝑃𝐶 ∗ 𝑓
𝑃∝
¿ 𝑖𝑛𝑠𝑡𝑠
12
4
Number of Instructions (#insts)
12
5
Number of Instructions(#insts) – 2
* Function inlining
* Very small functions have a lot of overhead → call,
ret instructions, register spilling, and restoring
* Paste the code of the callee in the code of the caller
(known as inlining)
12
6
Computing the CPI
* CPI for a single cycle processor = 1
* CPI for an ideal pipeline(no hazards)
* Assume we have n instructions, and k stages
* The first instruction enters the pipeline in cycle 1
* It leaves the pipeline in cycle k
* The rest of the (n-1) instructions leave in the next (n-1) consecutive cycles
𝑛+ 𝑘 −1
𝐶𝑃𝐼 =
𝑛
12
7
Computing the Maximum Frequency
* Let the maximum amount of time that it takes to execute any instruction be :
* tmax (also known as algorithmic work)
* In the case of a pipeline, let us assume that all the pipeline stages are balanced
* We thus have :
𝑡 𝑚𝑎𝑥
𝑡 𝑠𝑡𝑎𝑔𝑒 = +𝑙
𝑘
1 𝑡 𝑚𝑎𝑥
= +𝑙
𝑓 𝑘
12
8
Performance of an Ideal Pipeline
* Let us assume that the number of instructions are a
constant
𝑓
𝑃=
𝐶𝑃𝐼
12
9
Optimal Number of Pipeline Stages
*
𝜕¿¿
k is inversely proportional to
* k is proportional to
* As we increase the latch delay, we should have less pipeline stages
* We need to minimise the time wasted in accessing latches
13
0
A Non-Ideal Pipeline
* Our ideal CPI (CPIideal = 1) is 1
* However, in reality, we have stalls
13
1
Mathematical Model
𝑓
𝑃=
𝐶𝑃𝐼
13
2
Implications
* For programs with a lot of dependences (high value of r) →
Use less pipeline stages
* For a pipeline with forwarding → c is smaller (than a pipeline
that just has interlocks)
* It requires a larger number of pipeline stages for optimal performance
13
3
13
4
Example
Example Consider two programs that have the following characteristics.
Program 1 Program 2
Instruction Fraction Instruction Fraction
Type Type
13
5
Example
CPI=CPI ideal + stall rate *stall
penalty
13
6
13
7
13
8
13
9
Performance, Architecture, Compiler
P f IPC
Technology Compiler
Architecture Architecture
• Manufacturing technology affects the speed of transistors, and in turn
the speed of combinational logic blocks, and latches.
• Transistors are steadily getting smaller and faster.
• Consequently, the total algorithmic work (t max) and the latch delay (l),
are also steadily reducing.
• Hence, it is possible to run processors at higher frequencies leading to
improvements in performance.
• Manufacturing technology exclusively affects the frequency at which
we can run a processor.
• It does not have any effect on the IPC, or the number of instructions.
14
0
Contraints for pipelining
• Note that the overall picture is not as simple as we described
• We need to consider power and complexity issues also.
• Typically, implementing a pipeline beyond 20 stages is very difficult
because of the increase in complexity.
• Secondly, most modern processors have severe power and temperature
constraints.
• This problem is also known as the power wall.
• It is often not possible to ramp up the frequency, because we cannot
afford the increase in power consumption.
• As a thumb rule, power increases as the cube of frequency. Hence,
increasing the frequency by 10% increases the power consumption by
more than 30%, which is prohibitively large.
• Designers are thus increasingly avoiding deeply pipelined designs that
run at very high frequencies.
14
1
14
2
Consider a pipelined processor with the following four
stages, Instruction Fetch: IF, Instruction decode and
Operand Fetch: ID, EX: Execute, WB: Write Back
The IF,ID and WB stages take one clock cycle each to
complete the operation. The number of clock cycles of
the EX stage depends on the instruction. The ADD and
SUB instructions need 1 clock cycle and the MUL
instruction need 3 clock cycles in the EX stage. Operand
forwarding is used in the pipelined processor. What is the
number of clock cycles taken to complete the following
sequence of operations?
ADD R2,R1,R0
MUL R4,R3,R2
SUB R6,R5,R4
14
3
14
4
14
5
14
6
14
7
14
8
14
9
15
0
15
1
15
2
15
3
15
4
Consider a pipelined processor with the following four
stages, Instruction Fetch: IF, Instruction decode and
Operand Fetch: ID, EX: Execute, WB: Write Back
The IF,ID and WB stages take one clock cycle each to
complete the operation. The number of clock cycles of
the EX stage depends on the instruction. The ADD and
SUB instructions need 1 clock cycle and the MUL
instruction need 3 clock cycles in the EX stage. Operand
forwarding is used in the pipelined processor. What is the
number of clock cycles taken to complete the following
sequence of operations?
ADD R2,R1,R0
MUL R4,R3,R2
SUB R6,R5,R4
15
5
A processor executes instructions without
pipelining in 10 cycles per instruction. A
pipelined version of the processor splits
execution into 5 stages, each taking 2
cycles. Calculate the speedup for executing
100 instructions on the pipelined processor
compared to the non-pipelined processor.
Assume no stalls or hazards.
15
6
15
7
A pipelined processor has 6 stages, each
taking 2 ns. Due to hazards and stalls, only
80% of the pipeline is utilized.
a) What is the effective throughput in
instructions per second?
b) What would the throughput be if the
pipeline were fully utilized?
15
8
15
9
16
0
16
1
Thank you
16
2