Full Notes
Full Notes
Full Notes
Architecture
Text Book: Computer Architecture: A Quantitative
Approach by Hennessey and Patterson
RISC
Single Processor Performance
10000
Move to multi-processor
??%/year
1000
Performance (vs. VAX-11/780)
52%/year
100
RISC
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Contd…
IS DS Memory
Process
Control Module
ing
unit
Unit
SIMD
Processing DS1 Memory
Unit 1 Module
IS DS 2 Memory
Control Processing
unit Unit 2 Module
DS n Memory
Processing
Unit n Module
IS
MIMD
Contr IS Process DS1 Memory
ol ing Module
unit Unit 1
DSn Memory
Contr IS Process
ol ing Module
unit Unit n
Pipelining: Basic and Intermediate
Concepts
RISC Instruction Set Basics
(from Hennessey and Patterson)
• Properties of RISC architectures:
– All ops on data apply to data in registers and typically
change the entire register (32-bits or 64-bits).
– The only ops that affect memory are load/store
operations. Memory to register, and register to
memory.
– Load and store ops on data less than a full size of a
register (32, 16, 8 bits) are often available.
– Usually instructions are few in number (this can be
relative) and are typically one size.
RISC Instruction Set Basics
Types Of Instructions
• ALU Instructions:
• Arithmetic operations, either take two registers as
operands or take one register and a sign extended
immediate value as an operand. The result is stored in
a third register.
• Logical operations AND OR, XOR do not usually
differentiate between 32-bit and 64-bit.
• Load/Store Instructions:
• Usually take a register (base register) as an operand
and a 16-bit immediate value. The sum of the two will
create the effective address. A second register acts as a
source in the case of a load operation.
RISC Instruction Set Basics
Types Of Instructions (continued)
• In the case of a store operation the second register
contains the data to be stored.
• Branches and Jumps
• Conditional branches are transfers of control. As
described before, a branch causes an immediate value
to be added to the current program counter.
RISC Instruction Set Implementation
• We first need to look at how instructions in the
MIPS64 instruction set are implemented without
pipelining. Assume that any instruction (MIPS) can be
executed in at most 5 clock cycles.
• The five clock cycles will be broken up into the
following steps:
• Instruction Fetch Cycle
• Instruction Decode/Register Fetch Cycle
• Execution Cycle
• Memory Access Cycle
• Write-Back Cycle
Instruction cycle
Instruction Fetch (IF) Cycle
• Send the program counter (PC) to memory
and fetch the current instruction from
memory.
• Update the PC to the next sequential PC by
adding 4 (since each instruction is 4 bytes) to
the PC.
Instruction Decode (ID)/Register Fetch Cycle
• Decode the instruction and at the same time
read in the values of the register involved. As the
registers are being read, do equality test incase
the instruction decodes as a branch or jump.
• The offset field of the instruction is sign-
extended incase it is needed. The possible branch
effective address is computed by adding the sign-
extended offset to the incremented PC. The
branch can be completed at this stage if the
equality test is true and the instruction decoded
as a branch.
Instruction Decode (ID)/Register Fetch
Cycle (continued)
Clock
m d
Asynchronous Pipeline
- Transfers performed when individual stages
are ready.
- Handshaking protocol between processors.
Input Output
Si Si+1
m d
Pipeline cycle :
Latch delay : d
= max {m } +
d
Pipeline frequency : f
f = 1 /
Example on Clock period
Suppose the time delays of the 4 stages are 1 =
60ns,2 = 50ns, 3 = 90ns, 4 = 80ns & the
interface latch has a delay of tld = 10ns.
Hence the cycle time of this pipeline can be granted
to be like :- = 90 + 10 =100ns
Clock frequency of the pipeline (f) = 1/100 =10 Mhz
If it is non-pipeline then = 60 + 50 + 90 + 80
=280ns
= max {m } + d
Ideal Pipeline Speedup
k-stage pipeline processes n tasks in k + (n-
1) clock cycles:
k cycles for
the first task and n-1 cycles for the remaining n-
1 tasks.
Total time to process n tasks,
Tk = [ k +
(n-1)]
For the non-pipelined processor
Pipeline Speedup Expression
Speedup=
T nk nk
Sk = 1
Tk =[ k + (n-1)] = k + (n-1)
IF ID EX ME WB
IF E
ID M
EXE ME WB
IF ID M
EXE ME WB
IF ID M
EXE ME WB
M
An Example of a Structural
Hazard
ALU
Load Mem Reg DM Reg
ALU
Instruction 1 Mem Reg DM Reg
ALU
Instruction 2 Mem Reg DM Reg
ALU
Instruction 3 Mem Reg DM Reg
ALU
Instruction 4 Mem Reg DM Reg
Pipeline depth
1 Pipeline stall cycles per instructio n
Stalls and Performance
(Contd…)
• Alternatively, improving the clock cycle time
pipelinning, CPI of Both is 1
1
Pipeline depth
1 Pipeline stall cycles per instructio n
Defining Performance
• To maximize performance, we want to
minimize response time or execution time for
some task. We can relate performance and
execution time for a computer X as
1
Performance X
ExecutionT ime X
Performance X PerformanceY
Performance X
n
PerformanceY
ALU
Load Mem Reg DM Reg
ALU
Instruction 1 Mem Reg DM Reg
ALU
Instruction 2 Mem Reg DM Reg
ALU
Instruction 3 Mem Reg DM Reg
1 1
Speedup = x
1+0.4*1 1/1.05
= 0.75
Dealing With Structural Hazards
(continued)
• We can see that even though the clock speed
of the processor with the hazard is a little
faster, the speedup is still less than 1.
• Therefore the hazard has quite an effect on
the performance.
• Sometimes computer architects will opt to
design a processor that exhibits a structural
hazard. Why?
• A: The improvement to the processor data path is too costly.
• B: The hazard occurs rarely enough so that the processor will still
perform to specifications.
An Example of Performance
Impact of Structural Hazard
• Assume:
– Pipelined processor.
– Data references constitute 40% of an instruction
mix.
– Ideal CPI of the pipelined machine is 1.
– Consider two cases:
• Unified data and instruction cache vs. separate data and
instruction cache.
• What is the impact on performance?
An Example
Cont…
• Avg. Inst. Time = CPI x Clock Cycle Time
(i) For Separate cache: Avg. Instr. Time=1*1=1
(ii) For Unified cache case:
= (1 + 0.4 x 1) x (Clock cycle timeideal)
= 1.4 x Clock cycle timeideal=1.4
• Speedup= 1/1.4
= 0.7
• 30% degradation in performance
Data Dependences and Hazards
• Determining how one instruction depends on
another is critical to determining how much
parallelism exists in a program and how that
parallelism can be exploited.
Data Dependences
There are three different types of dependences:
• Data Dependences (also called true data
dependences), Name Dependences and
Control Dependences.
• An instruction j is data dependent on
instruction i if either of the following holds:
Instruction i produces a result that may be used by
instruction j, or
Instruction j is data dependent on instruction k, and
instruction k is data dependent on instruction i.
Consider the MIPS code sequence
That increments a vector of values in memory
(starting at 0(R1) , and with the last element at
8(R2)
Loop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1, #-8 ;decrement pointer 8 bytes
BNE R1, R2, LOOP ;branch R1!=R2
Data Dependences
Contd…
• A data value may flow between instructions
either through registers or through memory
locations.
• When the data flow occurs in a register,
detecting the dependence is straight forward
since the register names are fixed in the
instructions
• Although it gets more complicated when
branches intervene
Contd…
• Dependences that flow through memory
locations are more difficult to detect
• Since two addresses may refer to the same
location but look different:
• For example, 100(R4) and 20(R6) may be
identical memory addresses.
• Effective address of a load or store may change
from one execution of the instruction to another
(so that 20(R4) and 20(R4) may be different
Detecting Data Dependences
• A data value may flow between instructions:
– (i) through registers
– (ii) through memory locations.
• When data flow is through a register:
– Detection is rather straight forward.
• When data flow is through a memory location:
– Detection is difficult.
– Two addresses may refer to the same memory
location but look different.
100(R4) and 20(R6)
Name Dependences
• A Name Dependence occurs when two
instructions use the same register or memory
location, called a name
• There are two types of name dependences
between an instruction i that preceedes
instruction j in program order:
• Antidependence,
• Output Dependence
Contd…
• An Antidependence: between instruction i and
instruction j occurs when instruction J writes a
register or memory location that instruction i
reads.
• The original ordering must be preserved to
ensure that i reads the correct value. There is
an antidependence between S.D and DADDIU
on register R1, in the MIPS code sequence next
slide.
Consider the MIPS code sequence
That increments a vector of values in memory
(starting at 0(R1) , and with the last element at
8(R2)
Loop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1, #-8 ;decrement pointer 8 bytes
BNE R1, R2, LOOP ;branch R1!=R2), by a scalar
in register F2.
Contd…
• An Output Dependence occurs when
instruction i and instruction j write the same
register or memory location.
Instn I RAW
D(I) R(I)
Write
D(J) Instn J
R(J)
Read
Instn J WAR
D(J) R(J)
Write
D(I) InstnI
R(I)
Read
Instn I
D(I) R(I) WAW
Write
R(J) Instn J
D(J)
Write
Control dependence
Data Dependencies : Summary
Data dependencies
in straight-line code
Load-Use Define-Use
dependency dependency
ALU
ADD R1, R2, R3 Mem Reg DM Reg
ALU
SUB R4, R1, R5 Mem Reg DM Reg
ALU
AND R6, R1, R7 Mem Reg DM
ALU
OR R8, R1, R9 Mem Reg
Time
Latch Latch
EXECUTE WRITE
ALU RESULT
Forwarding Path
When Can We Forward?
ALU
ADD R1, R2, R3 Mem Reg DM Reg SUB gets info.
from EX/MEM
pipe register
ALU
SUB R4, R1, R5 Mem Reg DM Reg
ALU
AND R6, R1, R7 Mem Reg DM
from MEM/WB
pipe register
ALU
OR R8, R1, R9 Mem Reg
OR gets info. by
XOR R10, R1, R11 Mem Reg forwarding from
register file
LD R1, O(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
Problems
• We can avoid the hazard by using a Pipeline
interlock.
• The pipeline interlock will detect when data
forwarding will not be able to get the data to
the next instruction in time.
• A stall is introduced until the instruction can
get the appropriate data from the previous
instruction.
Handling data hazard by S/W
• Compiler introduce NOP in between two
instructions
• NOP = a piece of code which keeps a gap
between two instruction
• Detection of the dependency is left entirely
on the S/W
• Advantage :- We find the easy technique
called as instruction reordering.
Instruction Reordering
• ADD R1 , R2 , R3
• SUB R4 , R1 , R5 Before
• XOR R8 , R6 , R7
• AND R9 , R10 , R11
• ADD R1 , R2 , R3
• XOR R8 , R6 , R7 After
• AND R9 , R10 , R11
• SUB R4 , R1 , R5
Instruction Execution:
MIPS Data path
• Can break down the process of “running” an
instruction into stages.
• These stages are what needs to be done to
complete the execution of each instruction.
Some instructions will not require some stages.
MIPS Data path
IR M[PC]
NPC PC + 4
IR – Instruction register
NPC – Next program counter
Instruction Execution
Contd…
2. Instruction Decode/Register Fetch (ID) –
Figure out what the instruction is supposed to
do and what it needs.
A Register File[Rs]
B Register File[Rt]
Imm {(IR16)16, IR15..0}
Reg-Reg
ALU instr: ALUout A op B
Reg-Imm: ALUout A op Imm
Branch: ALUout NPC + Imm
Cond (A {==, !=} 0)
LD/ST: ALUout A op Imm
to form effective Address
Instruction Execution
Contd…
4. Memory Access/Branch Completion (MEM) – Besides
the IF stage this is the only stage that access the
memory to load and store data.
Reg-Reg
ALU instr: Rd ALUoutput
Load: Rd LMD
Reg-Imm: Rt ALUoutput
Amdahl’s Law
Quantifies overall performance gain due to improve in a
part of a computation. (CPU Bound)
Amdahl’s Law:
Performance improvement gained from using some faster
mode of execution is limited by the amount of time the
enhancement is actually used
Performance for entire task using
Speedup = enhancement when possible
Performance for entire task without
OR using enhancement
Execution Timeold 1
Speedupoverall = =
Execution Timenew Fractionenhanced
(1 – Fractionenhanced) +
Speedupenhanced
Use previous equation,
Solve for speedup
Assignments
Q. Consider a 5 stage pipeline system. If there
exist 40% of branch instruction, penalty for
branch instruction is 3 clock cycle & clock cycle
time is 20ns.Then, find out the throughput
with the pipeline system.
Q. A 5 stage pipeline separated by a 10ns clock.
If the non-pipeline clock is also having same
duration and the pipeline efficiency is 90%
then calculate the speed up factor.
Assignments
Q. A five stage pipeline processor has IF, ID, EXE,
MEM, WB. The IF, ID, MEM, WB stages takes 1
clock cycles each for any instruction. But at
the same time EXE stage takes 2 clock cycle for
ADD & SUB instructions, and 3 clock cycles for
MUL & DIV instructions respectively. Other
remaining instructions will take 1 clock cycle
to complete their EXE phase.
Assignments
Q. Consider the following instructions:-
MUL R1, R2, R4 For the above sequence:-
Find out all type of
STORE R1, 4(R3) data dependency?
DIV R4, R1, R2 Find out the difference
between total number of
ADD R7, R1, R4 clock cycles required to
MUL R9, R7, R4 complete the execution
of above given
LOAD R1, 8(R6) instruction set using
SUB R1, R9, R7 operand forwarding and
without using operand
forwarding?
Assignments
Q. Your company has just brought a new dual
Pentium processor, and you have been tasked
with optimizing your software for this
processor. You will run two applications on this
dual Pentium, but the resource requirements
are not equal. The first application needs 80%
of the resources, and the other only 20% of
the resources.
Assignments
(i) Given that 40% of the first application is
parallelizable, how much speed up would you
achieve with that application if run in
isolation?
(ii) Given that 99% of the second application is
parallelizable, how much overall system speed
up would you get?
Control Hazards
• Result from branch and other instructions that change
the flow of a program (i.e. change PC).
• Example: 1: If(cond){
2: s1}
3: s2
IF delay
ID EX slot
MEM WB
Delayed slot instruction
IF ID EX MEM WB
Branch target OR
•successor
Question: What instruction do we put in the delay slot?
• Answer: one that can safely be executed no matter what the
branch does.
– The compiler decides this.
Delayed Branch
• One possibility: An instruction from before
• Example:
DADD R1, R2, R3
DADD R1, R2, R3
if R2 == 0 then
if R2 == 0 then
branch IF ID EX MEM WB
IF ID EX MEM WB
add instruction
IF ID EX MEM WB
branch target/successor
delay slot
OR R7, R8, R9
DSUB R4, R5, R6
• The OR instruction can be moved into the delay slot
ONLY IF its execution doesn’t disrupt the program
execution (e.g., R7 is overwritten later)
Delayed Branch
• Third possibility: An instruction from inside the taken
path DADD R1, R2, R3
• Example:
if R1 == 0 then
OR R7, R8, R9
OR R7, R8, R9
DSUB R4, R5, R6
• The OR instruction can be moved into the delay slot
ONLY IF its execution doesn’t disrupt the program
execution (e.g., R7 is overwritten later)
Performance of branch with Stalls
FP ALU op FP ALU op 3
FP ALU op Store Double 2
Load Double FP ALU op 1
Load Double Store Double 0
me that latency for Integer ops is zero and latency for Integer load is 1
An Example
Loop: LD F0, 0 (R1) 1
STALL 2
ADDD F4, F0, F2 3
STALL 4
STALL 5
SD 0 (R1), F4 6
DAADUI R1, R1, #-8 7
STALL 8
BNEZ R1, R2 Loop 9
Predicted(01) Predicted(00)
not Taken not Taken
Taken
Not taken
Branch Predictors
• The size of a branch predictor memory will
only increase it’s effectiveness so much.
• We also need to address the effectiveness of
the scheme used. Just increasing the number
of bits in the predictor doesn’t do very much
either.
• Some other predictors include:
– Correlating Predictors
– Tournament Predictors
Branch Predictors
• Correlating predictors will use the history of a
local branch AND some overall information on
how branches are executing to make a
decision whether to execute or not.
• Tournament Predictors are even more
sophisticated in that they will use multiple
predictors local and global and enable them
with a selector to improve accuracy.
Why Dynamic Scheduling?
• All the static(complier) techniques discussed so
far use in-order instruction issue.
• That means that if an instruction is stalled in the
pipeline, no later instructions can proceed.
• With in-order issue, if two instructions have a
hazard between them, the pipeline will stall,
even if there are later instructions that are
independent and would not stall.
Why Dynamic Scheduling?
• Several early processors used another
approach, called dynamic scheduling, whereby
the hardware rearranges the instruction
execution to reduce the stalls.
Advantages of Dynamic Scheduling
• Handles cases when dependences unknown at compile
time (eg memory reference)
• Simplifies compiler code compiled for one pipeline
runs efficiently on different pipeline
• Hardware speculation, a technique with significant
performance advantages, that builds on dynamic
scheduling
• Key idea: instructions behind stall can proceed
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
• Out-of-order execution => out-of-order completion.
Instruction Parallelism by HW
• Enables out-of-order execution and allows out-
of-order completion
• Will distinguish when an instruction begins
execution and when it completes execution; in
between instruction in execution
• dynamically scheduled pipeline:: all
instructions pass through issue stage in order
(in-order issue)
Dynamic Scheduling by Scoreboard:
bookkeeping technique - OLD
• To implement out-of-order execution, ID stage must bee
split into two stages:
1. Issue—decode instructions, check for structural hazards
2. Read operands—wait until no data hazards, then read
operands
• Scoreboards date to CDC6600 in 1963
• Instructions execute whenever not dependent on
previous instructions and no hazards.
• CDC 6600: In order issue, out-of-order execution (when
there are no conflicts and the hardware is available). ,
out-of-order commit (or completion)
– No forwarding
Scoreboard Architecture (CDC
6600)
FP
FP Mult
Mult
Functional Units
FP
FP Mult
Mult
Registers
FP
FP Divide
Divide
FP
FP Add
Add
Integer
Integer
SCOREBOARD Memory
SCOREBOARD
Scoreboard Implications
• Out-of-order completion => WAR, WAW hazards?
• Solutions for WAR:
– Stall writeback until registers have been read
– Read registers only during Read Operands stage
• Solution for WAW:
– Detect hazard and stall issue of new instruction until other
instruction completes
• No register renaming
Scoreboard Implications
Cond…
• Need to have multiple instructions in
execution phase => multiple execution units or
pipelined execution units
• Scoreboard keeps track of dependencies
between instructions that have already issued.
• Scoreboard replaces ID, EX, WB with 4 stages
Four Stages of Scoreboard Control
• Issue—decode instructions & check for
structural hazards (ID1)
– Instructions issued in program order (for hazard checking)
– Don’t issue if structural hazard
– Don’t issue if instruction is output dependent on
previously issued but uncompleted instruction (no WAW
hazards)
• Read operands—wait until no data hazards,
then read operands (ID2)
– All real dependencies (RAW hazards) resolved in this
stage. Wait for instructions to write back data.
– No data forwarding
Four Stages of Scoreboard Control
• Execution—operate on operands (EX)
– Functional unit begins execution upon receiving operands.
When result is ready, scoreboard notified execute
complete
• Write result—finish execution (WB)
– Stall until no WAR hazards with previous instructions:
Scoreboard Example
Instruction status Read Execution
Write
Instruction j k Issue operands
complete
Result
LD F6 34+ R2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
FU
Dynamic Scheduling Using A Scoreboard
Scoreboard Example Cycle 1
Instruction Status
Read
Instruction Issue Operands Execution Complete Write Result
L.D F6,34(R2) X X X X
L.D F2,45(R3) X X X X
MUL.D F0,F2,F4 X X X
SUB.D F8,F6,F2 X X X X
DIV.D F10,F0,F6 X
ADD.D F6,F8,F2 X X X
L3-Cache (optional)
Memory
Block address
Direct Mapping Main
memory
Block j of main memory maps Block 0
cache
Cache
Block 1
(each block has 16=24 Cache
words) tag
Block 0
tag
Associative-mapped cache.
Set-Associative Mapping Main memory
Block 0
Block 1
Cache
4: one of 16 words. (each tag
Block 0
Set 0
block has 16=24 words) tag
Block 1
tag Block 63
Block 2
6: points to a particular Set 1
tag
Block 3
Block 64
(128/2=64=26)
tag
Block 126
Block 129
is present (4096/64=26). 6 6 4
Processor Processor
I-Cache-1 D-Cache-1
Unified
Cache-1
Unified
Cache-2
Unified
Cache-2
Multiprocessor
A Broad Classification of Computers
• Shared-memory multiprocessors
– Also called UMA
• Distributed memory computers
– Also called NUMA:
• Distributed Shared-memory (DSM)
architectures
• Clusters
• Grids, etc.
UMA vs. NUMA Computers
Latency = several
milliseconds to seconds
P1 P2 Pn P1 P2 Pn
Cache Cache Cache Cache Cache Cache
Bus
Main Main Main
Memory Memory Memory
Main
Memory
Network
Latency = 100s of ns
Bus
U:5 U: U:5
?
1 2
U:5
P P P P
Bus
CPU Write
Place Write Miss on Bus
Exclusi
CPU read hit ve CPU Write Miss
CPU write hit (read/ Write back cache block
write) Place write miss on bus
Snoopy-Cache State Machine-II
• State machine
considering only Write miss
Share
bus requests Invalid for this block
for each cache d
block. (read/
only)
Write miss
for this block
Write Back Read miss
Block; (abort for this block
memory access) Write Back
Block;
(abort
Exclusi
memory
ve
access)
(read/
write)
Combined Snoopy-Cache State
• State machine Machine CPU Read hit
Instruction queue
Floating-
point
unit
Dispatch
unit W : Write
results
Integer
unit
• Limited by
– Data dependency
– Procedural dependency
– Resource conflicts
VLIW Processor
Basic Working Principles of VLIW
• Aim at speeding up computation by exploiting
instruction-level parallelism.
• Same hardware core as superscalar
processors, having multiple execution units
(EUs) working in parallel.
• An instruction is consisted of multiple
operations; typical word length from 52 bits to
1 Kbits.
• All operations in an instruction are executed in
a lock-step mode.
• Rely on compiler to find parallelism and
schedule dependency free program code.
Basic VLIW Approach
Register File Structure for VLIW
Differences Between VLIW & Superscalar
Architecture (I)
Differences Between VLIW & Superscalar
Architecture (II)
• Instruction formulation:
– Superscalar:
• Receive conventional instructions conceived for seq. processors.
– VLIW:
• Receive (very) long instruction words, each comprising a field (or
opcode) for each execution unit.
• Instruction word length depends (a) number of execution units,
and (b) code length to control each unit (such as opcode length,
register names, …).
• Typical word length is 64 – 1024 bits, much longer than
conventional machine word length.
• Instruction scheduling:
– Superscalar:
• Done dynamically at run-time by the hardware.
• Data dependency is checked and resolved in hardware.
• Need a look ahead hardware window for instruction fetch.
– VLIW:
• Static scheduling done at compile-time by the compiler.
• Advantages:
– Reduce hardware complexity.
– Tasks such as decoding, data dependency detection,
instruction issue, …, etc. becoming simple.
– Potentially higher clock rate.
– Higher degree of parallelism with global program
information.
Avoiding address translation to
Reduce hit time(3)