CS1601 Computer Architecture
CS1601 Computer Architecture
Unit 1
Contents
Measuring and Reporting performance
Quantitative principles of computer design
Classifying instruction set architecture
Memory addressing
Addressing Modes
Type and size of operands
Operations in the instruction set
Operands and operations for signal processing
Instruction for control flow
Encoding an instruction set
Example architecture – MIPS and TM32.
Measuring and Reporting
Performance
Response Time or Execution Time
Throughput
Benchmarks are used to measure
and compare performance of
different systems
Response Time Break up
Response time
Same as Elapsed time or Wall Clock
time
Includes the following components
User Time
System Time or Privileged Time
Waiting time and I/O time
Finding User time in Windows
Choosing Programs to Evaluate
Performance
Computer A B C
Program 1 1 10 20
Program 2 1000 100 20
Total 1001 110 40
Data
More applicable to add sto … add sto
data!
Amdahl’s Law
Performance Gain or Speed up
Performance for entire task with
enhancement
= Performance for entire task without
enhancement
OR
Execution time of the entire task without
enhancement
= Execution time of the entire with
enhancement
Fraction–enhanced = 0.4 s
Speedup-enhanced = 10 times
Total time for Enhanced execution = 0.6 + 0.4/10 s
= 0.64s
1 ns
4. CPU Time
= Instruction x Clock Cycles x Seconds =
Seconds
Program Instruction Clock Cycle Program
Take Advantage of Parallelism
Instruction level parallelism using
Pipeline
Memory Access using multiple
banks
Parallel computation of two or more
possible outcomes
Assignment
Unit I
Content
Internal Storage
Reading
Memory Addressing Assignment
Addressing Mode for Signal Processing
Types and Size of Operands
Operands for Media and Signal Processing
Operations in the Instruction Set
Operations for Media and Signal Processing
Instruction for Control Flow
Encoding and Instruction Set
Role of Compilers
Internal Storage
Reg-Reg/Load-
stack Accumulator Register-mem store
processor
TOS
ACC
memory
Code Sequence with different
Internal Storage
Small Endian
Address Byte 0
Byte 1
Value 0 2 0 1
Big Endian
Address Byte 1 Byte
0
Value 0 1 0 2
Alignment
Alignment normally at
2 Byte boundary
4 Byte boundary
Alignment
Optimizes the storage
Makes the CPU logic more complex &
slow
4 3 2 1 0
Regs[R1]←Regs[R1]+Mem[
Regs[R2]]
8 23
(b)Fixed(e.g.Alpha,ARM,MIPS,PowerPC,SPARC,SuperH)
Addre
ss Addre
ss Addre
ss
o
per
atio
n f
ield1 f
ield2 f
ield3
A
ddr
ess Addre
ss
o
pe
ra
ti
o n s
pec
i
fier f
iel
d
Add
re ss A dd
res
s A dd
res
s
o
per
a t
ion s
pecifie
r fie
ld1 fie
ld2
Topics to prepare and present
1. RISC versus CISC
2. IEEE 754 Floating point standard
3. Floating point Arithmetic
4. DEC VAX Assembler Instruction Set
5. Benchmarking - Whetstone, Dhrystone,
Linpack, SPEC, & TPC
6. Stack based versus Register
architecture
7. Problem – Example on page 106
Instruction Level Parallelism
Unit 2
What is Instruction-level parallelism
(ILP)?
Overlapping the execution of
instructions
Assembled code
----------
------
----------
----------
-------
------
Pipeline - Overlap
Compare &
contrast
1. Serving an
individual to
completion
2. Serving in a
pipeline as
shown above
Second case:
Time to serve an individual
C lo c k N u m b e r
In s t r u c t io n n u m 1b e r 2 3 4 5 6 7 8 9
In s t r u c t io n I IF ID EX MEM W B
In s t ru c t io n i+ 1 IF ID EX MEM W B
In s t ru c t io n i+ 2 IF ID EX MEM W B
In s t ru c t io n i+ 3 IF ID EX MEM W B
In s t ru c t io n i+ 4 IF ID EX MEM W B
Pipelining Data Paths
Time( in clock cycles)
Program execution order(in instructions)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
ALU
IM Reg DM Reg
ALU
IM Reg DM Reg
ALU
IM Reg DM Reg
ALU
IM Reg DM Reg
ALU
IM Reg DM Reg
Basic Performance Issues in
Pipelining
Pipelining slows down individual
instructions.
But increases the speed of
execution of your program!
Problem 1
Consider the un-pipelined processor section.
Assume that it has a 1 ns clock cycle and that it
uses 4 cycles for ALU operations and branches and
5 cycles for memory operations. Assume that the
relative frequencies of these operations are
40%,20% and 40% respectively. Suppose that due
to clock skew and setup, pipelining the processor
adds 0.2 ns of overhead to the clock. Ignoring any
latency impact, how many speedup in the
instruction execution rate will we gain from a
pipeline?
Speed up
CPI unP Clock period
unP
= *
CPI P Clock period P
CPI P = Ideal CPI + Pipeline stall cycles per instruction
CPI P = 1 + Pipeline stall cycles per instruction
IR Mem[PC];
NPC PC + 4;
Pipeline Implementation with MIPS – ID
Cycle
A Regs[rs];
B Regs[rt];
IMM sign-extended immediate
field of IR;
Execution/effective address
cycle(EX)
The ALU operates on the operands prepared
in the prior cycle,performing one of the MIPS
instruction type.
Memory Reference:
ALUOutput A + IMM;
Register-Register ALU instruction:
ALUOutput A func B;
Register-Immediate ALU instruction:
ALUOutput A op IMM;
Branch
ALUOutput NPC + (IMM << 2);
Cond (A==0)
Pipeline Implementation with MIPS –
Memory Access Cycle
Memory reference:
LMD Mem[ALUOutput] or
Mem[ALUOutput] B;
Branch:
if(cond) PC ALUOutput; #NPC
content
Pipeline Implementation with MIPS – Write
Back
----------
------
----------
----------
-------
------
Instruction Level Parallelism
Basic block
Parallelism within basic block
Limitation of parallelism within …
Parallelism across basic block
Limitation of parallelism across …
entry Inst1
Inst 2
Basic block
Inst 3
exit Inst 4
MIPS Program Characteristics
1. ---- 92. JUMP
2. --- 93. ----
3. ---- 94. ---
4. ----- 95. ----
5. BE 96. -----
6. -- 97. BNE
7. ---- 98. --
8. --- 99. ----
9. ---- 100. ---
10. BNEZ 101. ----
Vector Processors – Parallelism
across …
Control dependence
True Dependence
Instruction i produces a result that
may be used by instruction j or
Instruction j is data dependent on
instruction k, and instruction k is
data dependent on instruction i
if p1 {
S1;
}
if p2 {
S2;
}
Data Hazards
RAW (Read After Write)
WAW (Write After Write)
WAR (Write After Read)
RAR (Read After Read)
Read After Write - RAW
Wrong execution: j tries to read a
source before i write to it.
Arises from true dependence
Example:
Load R1, A
Add R1, R2, R1
Write After Write
Arises from Output Dependence
Correct order between writes to be
maintained
Appears in only those pipelines
where there are multiple write stages
Example:
DMUL F4, F3, F4
DLOAD F4, A
Write After Read
Arises from Anti-dependence
An instruction writes to a
destination before it is read by
another instruction, thus making the
latter read the new (incorrect value)
Does not occur in static pipelines
Read After Read
Not a hazard!
Dynamic Scheduling
Summarize: Data Dependence, & Structural
Dependence
Executing instructions in non-sequential
form
Example:
L R1, 0(R2)
SUB R1, R3
BEQZ Same
S R1, 0(R2)
JUMP End
Same: S R1, 0(R4)
Control Dependence
if p2 { RET
L1:
S2;
ADD R1, R2, R3
}
Two Constraints
IF ID EX MEM WB
EX
Integer Unit
EX
FP MUL
IF ID MEM WB
Instructions Instructions
EX
Enter Leave
FP Adder
EX
FP DIV
5-Stage Pipeline with pipelined FP
units
EX
Integer Unit
EX
FP MUL
IF ID MEM WB
Instructions Instructions
EX
Enter Leave
FP Adder
EX
FP DIV
5-Stage Pipeline with pipelined
multiple FP units
EX
EX
Integer
IntegerUnit
Unit
EX
EX
FP MUL
Integer Unit
IF ID MEM WB
Instructions Instructions
EX
EX
Enter Leave
FP Adder
Integer Unit
EX
EX
FP DIVUnit
Integer
Hazards in Longer Latency Pipelines
Divide is not fully pipelined, this can lead to structural
hazard.
Instructions have varying running times, the number
of writes required in a cycle can be more than 1
Instructions can complete in a different order than
they were issued
WAW Hazard possible
WAR Hazard less likely
Stalls due to RAW hazards will be more due to longer
latency
Dynamic Scheduling Stages
Dynamic scheduling implies out-of-order
execution
To allow this we split ID pipe stage of 5-stage
pipe line into two stages
Issue: Decode instructions, check for
structural hazards.
Read operands: Wait until no data hazards,
then read operands
IF Issue RO EX MEM WB
ID
Instructions
Instructions
Enter
Leave
Variations and other characteristics
IF Issue RO EX MEM WB
ID
Instructions Instructions
Enter Leave
Scoreboard content
Instruction Status
Indicates the current stage for an
instruction
Functional Unit Status
Busy, Op, Fi, Fj, Fk, Qj, Qk, Rj, Rk
Register Result Status
Indicates which functional unit will write
each register
Scoreboard Building blocks
Registers
FU
FU
FU
Scoreboard
Control/status Control/status
Tomasulo’s Approach to Dynamic
Scheduling
Addresses WAR and WAW hazards
Based on renaming of registers
Load-store FP
Instruction registers
operations Queue
Address Unit
Store buffers Load buffers Floating-point Operand
operations buses
Operation bus
3 2
2 1
Reservation Stations
1
Data Address
Instruction Status
Indicates the current stage for an
instruction
Reservation Station Status
Busy, Op, Vj, Vk, Qj, Qk, A
Register Status
Indicates which functional unit will write to
each register
Show what information tables look like
for the following code sequence when
only the first load has completed and
written its result:
1. L.D F6,34(R2)
2. L.D F2,45(R3)
3. MUL.D F0,F2,F4
4. SUB.D F8,F2,F6
5. DIV.D F10,F0,F6
6. ADD.D F6,F8,F2
Example
In s t r u c t io n S
In s t r u c t io n Is s u e E x e c u te
L .D F 6 ,3 4 (R 2 ) √ √ √
L .D F 2 ,4 5 (R 3 ) √ √
M U L .D F 0 ,F 2 ,F 4 √
S U B .D F 8 ,F 2 ,F 6 √
D IV . D F 1 0 ,F 0 ,F 6√
A D D .D F 6 ,F 8 ,F 2 √
Example
R e s e r va t io n S t a t io n
Nam e B usy j kO Vp Qj V k A Q
Load1 no
Load2 yes Load 4 5 + R e g s [R 3 ]]
A dd1 yes S UB M e m [ 3 4 + R e g s [ RL 2o ] a] d 2
A dd2 yes A DD A dd1 Load2
A dd3 no
M u lt 1 yes M UL R e g s [F 4 ] Load2
M u lt 2 y e s D IV M e m [ 3 4 + R e g s [ RM2 u] ] l t 1
Example
R e g is t e r S t a t u s
F ie ld F0 F2 F4 F6 F8
Q i M u lt 1 Load2 A dd2 A dd1
Steps in Algorithm
To understand the full power of eliminating WAW and WAR
hazards through dynamic renaming of registers, We must look at
a loop. Consider the following simple sequence for multiplying
the elements of an array by a scalar in F2:
LOOP: L . D F0 , 0 ( R1 )
MUL . D F4 , F0 , F2
S.D F4 , 0 ( R1 )
DADDUI R1 , R1 , #- 8
BNE R1 , R2 , LOOP: branches if R1 # R2
Steps in Algorithm
Instruction State Wait until Action or bookkeeping
if(RegisterStat[rs].Qi ≠0)
{RS[r].Qj<-RegisterStat[rs].Qi}
else{RS[r].Vj<-Regs[rs];RS[r].Qj<-0};
if(Register[rt].Qi ≠0)
{RS[r].Qk<-Register[rt].Qi
else{RS[r].Vk<-Regs[rt];RS[r].Qk<-0};
Issue Sation r empty RS[r].Busy<-yes;RegisterStat[rd],Qi=r;
FP Operation
if(RegisterStat[rs].Qi ≠0)
{RS[r].Qj<-RegisterStat[rs].Qi}
else{RS[r].Vj<-Regs[rs]; RS[r].Qj<-0};
Load or Store Buffer r empty RS[r].A<-imm;RS[r].Busy<-yes;
Taken
Take
Branch or
n
Not Not taken
Address Predict taken Predict taken
11 10
baddr1 1 Taken
Not taken
baddr3 1 Predict untaken Predict untaken
01 00
1 bit Taken
Not taken
80%
90%
Static Branch Prediction
Useful when branch behavior is
highly predictable at compile time
It can also assist dynamic predictors
Loop unrolling is an example of this
Branch Causing 3-Cycle Stall
Time
B ra n c h i n s t r u c t i o n IF ID EX MEM W B
B ra n c h successor IF IF IF IF ID EX
B ra n c h successor + 1 IF ID
B ra n c h successor + 2 IF
Freeze or Flush
Assume that every branch has not taken
Assume that every branch is taken
Delayed branch
Delayed Branch
Code segment
branch instruction
sequential successor
…
Loop: branch target if taken
Branch delay slot
Instruction in branch delay slot is always
executed
Compiler can schedule the instruction(s)
for branch delay slot so that its execution
is useful even when the prediction is
wrong
Branch Delay of 1
Not Taken
U n t a k e n B r a n c h I Fi n s t r uIDc t i o nE X M E M W B
B r a n c h d e l a y i n s t r u c t IFi o n j +I D 1 E X M EM W B
In s t r u c t i o n j + 2 IF ID EX M EM W B
In s t r u c t i o n j + 3 IF ID EX M EM W B
In s t r u c t i o n j + 4 IF ID EX M EM W B
Taken
U n ta k e n B r a n c h i n s tr u cItiFo n I D EX MEM W B
B r a n c h d e l a y in s tr u c tio n j + I1 F ID EX M EM W B
B r a n c h ta r g e t IF ID EX MEM W B
B r a n c h ta r g e t + 1 IF ID EX MEM W B
B r a n c h ta r g e t + 2 IF ID EX MEM W B
Scheduling the branch delay slot
From before From target From fall-through
becomes becomes
becomes
DSUB R4, R5, R6
If R2 = 0 then DADD R1, R2, R3
DADD R1, R2, DADD R1, R2, R3 If R1 = 0 then
R3
If R1 = 0 then OR R7, R8, R9
DSUB R4, R5, R6
DSUB R4 , R5, R6
Best choice
Branch Prediction – Correlating
Prediction
Prediction based on correlation
SUB R3, R1, #2
if (aa == 2) BNEZ R3, L1 ; branch b1
aa = 0; (aa != 2)
if (bb == 2) ADD R1, R0, 0 ;aa=0
L1: SUB R3, R2, #2
bb = 0; BNEZ R3, L2 ;branch b2
if (aa != bb) { (bb != 2)
ADD R2, R0, R0 :bb=0
L2: SUB R3, R1, R2 ; R3=aa-bb
If both first and second branches are
BEQZ R3, L3 ; branch b3 (aa ==
bb)
not taken then the last branch will be
taken
ILP
With Scheduling
In structio n P ro du cin g
Re su lt In struction Usin g Re suLlta te n cy in Clock Cycle s
FP A LU op A nother FP A LU op 3
FP A LU op S tore double 2
Load Double F P A LU op 1
Load Double S tore double 0
Scheduling Techniques
Local Scheduling
Global Scheduling
Example
I n te g e r
M e m o r y r e f e r e n c e M1 e m o r y r e f e r e n c eF P2 o p e r a ti o n F1 P o p e r a ti o n 2 o p e r a ti o n / b r a n c h
L .D F 0 , 0 ( R 1 ) L .D F 6 , -8 ( R 1 )
L .D F 1 0 , - 1 6 ( R 1 ) L .D F 1 4 , 2 4 ( R 1 )
L .D F 1 8 , - 3 2 ( R 1 ) L .D F 2 2 , - 4 0 ( R 1 ) A D D .D F 4 , F 0 A, FD2D .D F 8 , F 6 , F 2
L .D F 2 6 , -4 8 R 1 ) A D D .D F 1 2 , F A1 0D, DF 2.D F 1 6 , F 1 4 , F 2
A D D .D F 2 0 , F A1 8D ,DF .D
2 F 2 4 ,F 2 2 ,F 2
S .D F 4 , 0 ( R 1 ) S .D F 8 , -8 ( R 1 ) A D D .D F 2 8 , F 2 6 , F 2
S .D F 1 2 , -1 6 ( R 1 ) S .D F 1 6 , - 2 4 ( R 1 ) D A D D U I R 1 ,R 1 ,# -5 6
S .D F 2 0 2 4 ( R 1 ) S .D F 2 4 , 1 6 ( R 1 )
S .D F 2 8 , 8 ( R 1 ) B N E R 1 ,R 2 ,L o o p
Problems of VLIW Model
They are:
Technical problems
Logistical Problems
Technical Problems
Increase in code size and the limitations of lockstep.
Two different elements combine to increase code
size substantially for a VLIW.
Generating enough operations in a straight-line
code fragment requires ambitiously unrolling
loops, thereby increasing code size.
Whenever instructions are not full, the unused
functional units translate to wasted bits in the
instruction encoding.
Logistical Problems And Solution
Logistical Problems
Binary code compatibility has been a major problem.
The different numbers of functional units and unit latencies require
different versions of the code.
Migration problems
Solution
Approach:
Use object-code translation or emulation.This technology is
developing quickly and could play a significant role in future
migration schemes.
To tamper the strictness of the approach so that binary
compatibility is still feasible.
Advantages of multiple-issue versus
vector processor
Twofold.
A multiple-issue processor has the potential
to extract some amount of parallelism from
less regularly structured code.
It has the ability to use a more conventional,
and typically less expensive, cache-based
memory system.
Advanced complier support for
exposing and exploiting ILP
x[i] = x[i] + s;
Example 1
Consider a loop:
for (i= 1; i <= 100; i = i + 1)
{
A [ i + 1] = A [ i ] + C [ i ] ;
B [ i + 1] = B [ i ] + A [ i + 1 ] ;
}
Loop-carried dependence : execution of
an instance of a loop requires the
execution of a previous instance.
Example 2
Consider a loop:
for (i= 1; i <= 100; i = i + 1)
{
A [ i ] = A [ i ] + B [ i ] ; /* S1 */
B [ i + 1] = C [ i ] + D [ i ] ;/* S2 */
}
ooo ooo
99 A[99] = A[99] + B[99] 99 A[99] = A[99] + B[99]
99 B[100] = C[99] + D[99] 99 B[100] = C[99] + D[99]
Three tasks
Good scheduling of code
Determining which loops might contain
parallelism
Eliminating name dependences
Finding Dependence
A dependence exists if two conditions hold:
There are two iteration indices, j and k , both
within the limits of the for loop. That is, m<=j
<= n, m<=k <= n.
The loop stores into an array element
indexed by a*j+b and later fetches from the
same array element when it is indexed by
c*k+d. That is, a*j+b=c*k+d.
Example 1
Use the GCD test to determine whether
dependences exist in the loop:
for ( i=1 ; i<= 100 ; i = i + 1)
{
X[2*i+3]=X[2*i ]*5.0;
}
A = 2, B = 3, C= 2, D = 0
Since 2 (GCD(A,C)) does not divide –3 (D-B), no
dependence is possible
GCD – Greatest Common Divisor
Example 2
The loop has multiple types of dependences. Find
all the true dependences , and antidependences ,
and eliminate the output dependences and
antidependences by renaming.
for ( i = 1 , i <= 100 ; i = i + 1)
{
y [ i ] = x [ i ] / c ; /* S1 */
x [ i ] = x [ i ] + c ; /* S2 */
z [ i ] = y [ i ] + c ; /* S3 */
y [ i ] = c - y [ i ] ; /* S4 */
}
Dependence Analysis
There are a wide variety of situations in which array-oriented
dependence analysis cannot tell us we might want to know ,
including
When objects are referenced via pointers rather than array
indices .
When array indexing is indirect through another array, which
happens with many representations of sparse arrays.
When a dependence may exist for some value of the inputs, but
does not exist in actuality when the code is run since the inputs
never take on those values.
When an optimization depends on knowing more than just the
possibility of a dependence, but needs to know on which write
of a variable does a read of than variable depend.
Basic Approach used in points-to
analysis
The basic approach used in points-to analysis relies on
information from three major sources:
Type information, which restricts what a pointer can point to.
Information derived when an object is allocated or when the
address of an object is taken, which can be used to restrict
what a pointer can point to.
For example if p always points to an object allocated in a
given source line and q never points to that object, then p
and q can never point to the same object.
Information derived from pointer assignments.
For example , if p may be assigned the value of q, then p
may point to anything q points to.
Analyzing pointers
There are several cases where analyzing pointers has been
successfully applied and is extremely useful:
When pointers are used to pass the address of an object as a
parameter, it is possible to use points-to analysis to determine
the possible set of objects referenced by a pointer. One
important use is to determine if two pointer parameters may
designate the same object.
When a pointer can point to one of several types , it is
sometimes possible to determine the type of the data object
that a pointer designates at different parts of of the program.
It is often to separate out pointers that may only point to local
object versus a global one.
Different types of limitations
There are two different types of limitations that affect
our ability to do accurate dependence analysis for
large programs.
Limitations arises from restrictions in the analysis
to
DADDUI R1,R2, #8
Example
Consider the code sequences:
ADD R1,R2,R3
ADD R4,R1,R6
ADD R8,R4,R7
Notice that this sequence requires at least three
execution cycles, since all the instructions depend on the
immediate predecessor. By taking advantage of
associativity, transform the code and rewrite:
ADD R1,R2,R3
ADD R4,R6,R7
ADD R8,R1,R4
This sequence can be computed in two execution cycles.
When loop unrolling is used , opportunities for these types
of optimizations occur frequently.
Recurrences
Recurrences are expressions whose value on
one iteration is given by a function that depends
on the previous iterations.
When a loop with a recurrence is unrolled, we
may be able to algebraically optimize the
unrolled loop, so that the recurrence need only
be evaluated once per unrolled iterations.
Type of recurrences
Sum = Sum + X ;
Unroll a recurrence – Unoptimized (5 dependent
instructions)
Sum = Sum + X1 + X2 + X3 + X4 + X5 ;
Optimized: with 3 dependent operations:
Reading Assignment
Trace Scheduling
Software Pipelining versus Loop Unrolling
Software Pipelining
Time
a) Software pipelining
Overlap
Proportional
between
to number
Unrolled
of unrolls
iterations
Number of
Overlapped
operations
Time
b) Loop unrolling
Software Pipelining Difficulty
In practice, complication using software pipelining is quite difficult
for several reasons:
Many loops require significant transformation before they can be
software pipelined, the trade-offs in terms of overhead versus
efficiency of the software-pipelined loop are complex, and the
issue of register management creates additional complexities.
To help deal with the two of these issues, the IA-64 added
extensive hardware support for software pipelining
Although this hardware can make it more efficient to apply
software pipelining , it does not eliminate the need for complex
complier support, or the need to make difficult decisions about
the best way to compile a loop.
Global Code Scheduling
Global code scheduling aims to compact a code
fragment with internal control structure into the
shortest possible sequence that preserves the
data and control dependence .
Data Dependence:
This force a partial order on operations and
easily moved.
It arising from loop branches are reduce by unrolling.
T F
AA[i]
[i]=
=0?
0?
B [i] = X
C [i] =
LD R4,0(R1) ; load A
LD R5,0(R2) ; load B
DADDU R4,R4,R5 ; ADD to A
SD R4,0(R1) ; Store A
….
BNEZ R4,elsepart ; Test A
…. ; then part
SD ….,0(R2) ;Stores to B
….
J join ; jump over else
elsepart: …. ; else part
X ; Code for X
….
join: …. ; after if
SD ….,0(R3) ; Store C[i]
Factors of the complier
Consider the factors that the complier would have to
consider in moving the computation and assignment of B:
What are the relative execution frequencies of the then
T F
AA[i]
[i]=
=0?
0? Trace exit
B [i] =
T F
AA[i]
[i]=
=0?
0? Trace exit
B [i] =
C [i] =
Drawback of trace scheduling
T F
AA[i]
[i]=
=0?
0?
Superblock exit
B [i] = Wit n=2 AA[[i i]]=
=AA[i]
[i]++B[i]
B[i]
T F
C [i] =
AA[i]
[i]=
=0?
0?
AA[[i i]]=
=AA[i]
[i]+
+B[i]
B[i]
B [i] = X
T F
AA[i]
[i]=
=0?
0?
Superblock exit
B [i] =
Wit n=1
C [i] =
C [i] =
Exploited of ILP
executed normally;
If the condition is false, the execution
of conditional instructions.
Example:
Consider the following code:
if (A==0) {S=T;}
Assuming that registers R1,R2 and R3 hold the
values of A,S and T respectively, show the code
for this statement with the branch and with the
conditional move.
Conditional and Predictions
Conditional moves are the simplest form of conditional or
predicted instructions, and although useful for short
sequences, have limitations.
In particular, using conditional move to eliminate branches
that guard the execution of large blocks of code can be
efficient, since many conditional moves may need to be
introduced.
To remedy the inefficiency of using conditional moves,
some architectures support full predication, whereby the
execution of all instructions is controlled by a predicate.
When the predicate is false, the instruction becomes a no-
op. Full predication allows us to simply convert large
blocks of code that are branch dependent.
Predicated instructions can also be used to speculatively
move an instruction that is time critical, but may cause an
exception if moved before a guarding branch.
Example:
Here is a code sequence for a two-issue superscalar
that can issue a combination of one memory
reference and one ALU operation, or a branch by
itself, every cycle:
First InstructionSet SecondInstructionSet
LW R 1,40(R2) ADDR 3,R4,R5
ADDR 6,R3,R7
BEQZ R 10,L
LW R8,0(R10)
LW R9,0(R8)
a. True
b. False
Entry Quiz
a. Flip-Flop
b. Magnetic core
c. Capacitor
d. Non-volatile Technology
Entry Quiz
a. 3
b. 4
c. 1000
d. 10000
Entry Quiz
4. Virtual Memory is
a. Same as caching
b. Same as associative memory
c. Different from caching
d. Same as disk memory
Entry Quiz
Memory
Simpler Memory
Memory Requirements - Server, Desktop,
and Embedded devices
Desktop
Server Embedded
Lower Access
Lower Access time Lower Access
time
Higher Bandwidth* time
Larger Memory
Better Protection* Simpler
Larger Memory Memory*
Processor-Memory Gap
Moore’s Law
Transistor density on a chip dye
doubles every couple (1.5) of years.
Short reference:
http://en.wikipedia.org/wiki/Moore's_law
What is Memory Hierarchy and Why?
250 ns
Main
Memory
CPU
0.25 ns
(Registers)
Bus Storage &
Adapter I/O devices
2,500,000 ns!
Memory Hierarchy & Cache
Cache
Cache is a smaller, faster, and expensive memory.
Improves the througput/latency of slower memory
next to it in the memory hierarchy.
Blocking reads and delaying the writes to slower
memory offers better performance.
There are two cache memories L1 and L2 between
CPU and main memory.
L1 is built into CPU.
L2 is an SRAM.
Data, Instructions, and Addresses are cached.
Cache Operation
Instruction
Example 1
Assume we have a computer where Clock
cycles Per Instruction (CPI) is 1.0 when all
memory accesses are cache hits. The
only data accesses are loads and stores
and these total 50% of the instructions. If
the miss penalty is 25 clock cycles and
miss rate is 2%, how much faster would
the computer be if all instructions were
cache hits?
Example 1 …
4. Ratios
= 1.75 IC Clock Cycles / IC X 1.0 X Cycle Time
= 1.75
CPU
..
.
..
.
C= 16
Block Length
(K Words) Block
Cache 2N - 1
Word Length
(K Words)
Main Memory
Elements of Cache Design
Cache Size
Block Size
Mapping Function
Replacement Algorithm
Write Policy
Write Miss
Number of caches
Split versus Unified/Mixed Cache
Mapping Functions
Address
From CPU Tag Line Word Tag
Block 0
Line 0 Block 1
1. Select
Line 1
2. Copy
3.Compare
5. Load Line 2
+
4. Hit
Block n-1
Miss Line 3 (m)
Cache
Main Memory
Mapping Function
Direct
Line value in address uniquely points to a line in cache. 1 tag
Comparison
Set Associative
Line value in address points to a set of lines in cache (typically
2/4/8, so 2/4/8 tag comparisons). This is known as 2/4/8 way Set
Associative.
Associative
Line value is always 0. This means Line points to all the lines in
cache (4 (m) tag Comparisons)
Uses Content Addressable Memory (CAM) for comparison.
Needs non-trivial replacement algorithm
Address 0
Cache 511
Memory
Mapping Function Comparison
Write back
Information written only to cache. Content of the
cache is written to the main memory only when this
cache block is replaced or the program terminates.
Instruct
Size (KB) Cache
8 8.1
Improving Cache Performance
Reducing Penalty
Reducing Misses
Compiler optimization attempts to reduce the
Cache Misses falls under this category
Stalled Cycles
= IC X Memory Access X Miss Rate X Miss
Penalty
Instruction
Improving Cache Performance Using
Compiler
Compilers are built with the following optimization:
Instructions:
Reordering instructions to avoid conflict misses
Data :
Merging Arrays
Loop Interchange
Loop Fusion
Blocking
Merging Arrays
/* Conflict */ Key
/*instead */
WriteMem[100];
WriteMem[100];
ReadMem[200];
WriteMem[200];
WriteMem[100];
Exit Quiz
1. In memory hierarchy top layer is occupied by
the
a. EPROM
b. DRAM
c. SRAM
d. Flash
Exit Quiz
1. Multilevel caches
2. Critical word first & early restart
3. Priority for Read misses over Write
4. Merging Write Buffers
5. Victim Cache
Reducing Miss Penalty - Multilevel
caches (1)
100 ns, 128M 100 ns, 128M
10 ns, 512K
2 ns, 16K
PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
Obviously!
Reducing Cache Misses – Higher
Associativity (3)
Block 6
Block 8
Reducing Hit Time – Small and Simple
Cache (1)
Addressing
VA -> Physical Address -> cache
Skip two levels, VA maps to cache
Problems:
No page boundary checks
Building direct mapping between VA
and Cache for every process is not
easy.
Reducing Hit Time – Pipelined Cache (3)
Multi-level Cache
Avg Access Time = Hit TimeL1+ Miss RateL1X
PenaltyL1
PenaltyL1 = Hit TimeL2+ Miss RateL2X PenaltyL2
Addressing
Address 0
From CPU Tag Set Word Tag 1
2 Block 0
Set 0 3
4
Line 0 5 Block 1
6
7
1. Select
Line 1
2. Copy
3.Compare Set 1
5. Load Line 2
+
4. Hit
Block 126
4. Miss Line 3 (m)
508
509
To main memory 510 Block 127
Cache
511
Main Memory
Assignment I – Due same day next
week
Mapping functions
Replacement algorithms
Write policies
Write Miss policies
Split Cache versus Unified Cache
Primary Cache versus Secondary Cache
Compiler cache optimization
techniques with examples
Assignment II - Due same day next
week
Multilevel Cache
Cache Inclusion/Exclusion Property
Thumb rules of cache
Compiler pre-fetch
Multi-level Caching + one another Miss
Penalty Optimization technique
Two miss Rate Optimization Techniques
Assignment III - Due 2nd class of next
week
All odd numbered problems from
cache module of your text book.
Assignment IV - Due 2nd class of next
week
All even numbered problems from
cache module of your text book.
CPU Execution Time & Average Access
Time
Memory
Cache
CPU
1 CC 100 CC
With Multi-level Cache
Memory
Cache
CPU
10 CC 100 CC
1 CC
1000
Memory Hierarchy
Main Memory
Main Memory
Module Objective
To understand Main memory latency
and bandwidth
Techniques to improve latency and
bandwidth
Memory Hierarchy & Cache
Main Memory – Cache – I/O
250 ns
Main
Memory
CPU cache
0.25 ns
Bus Storage &
Adapter I/O devices
2,500,000 ns!
4 cc
Main
Memory
56 cc
CPU cache
4 cc
CC – Clock Cycle
Data
Bus Address
Bus
Access Time per word = 4+56+4 CC
One word is 8 Bytes
Latency is 1 bit/CC
Improving Memory Performance
Improving Latency ( time to access
1 memory unit - word)
Improving bandwidth (bytes
accessed in unit time)
Improving Memory Bandwidth
Simple Design Wider Bus Interleaved
CPU
CPU CPU
64 bits
64 bits 64 bits
Cache
Cache Cache
Memory Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Cache Block 4 words
Bank 0 Bank 1 Bank 2 Bank 3
One word 8 bytes
Bandwidth, Latency, Penalty
Interleaving Factor
Address allocation with Interleaving
Problem 1
Interleaving factor
What is the optimal interleaving factor if the memory
cycle is 8 CC (1+6+1)?
Assume 4 banks
b1 b2 b3 b4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X √
1st Read 2nd Read 3rd Read 3rd Read
Request Request Request Request
Problem 2 (page 452 in your text
book)
Block size = 1 word Simple
Memory bus size = 1 word Memory
Cache 64 CC
Miss rate = 3% CPU
Memory access per instruction = 1.2
Cache miss Penalty = 64 CC
Avg Cycles per instruction = 2
Physical
Disk Main Memory
D
Virtual Versus First Level Cache
Code Data
Paging
Segmentation
Efficient disk traffic Yes (easy tune the page size) Not always (small segments)
CPU Execution Time
CPU Execution Time
= (CPU Cycle time + Stalled Cycle) X Cycle Time
Multi-level Cache
Avg Access Time = Hit TimeL1 + Miss
RateL1 X PenaltyL1
SCSI …
Host Adapter
Controller Controller T
SCSI Bus
Host I/O Bus
Individual Disks
Reference: http://www.stjulians.com/cs/diskstoragenotes.html
Components of Disk Access Time
Seek Time
Rotational Latency
Internal Transfer Time
Other Delays
That is,
Avg Access Time =
Avg Seek Time + Avg Rotational Delay +
Transfer Time + Other overhead
Problem (page 684)
Seek Time = 5 ms/100 tracks
RPM = 10000 RPM
Transfer rate 40 MB/sec
Other Delays = 0.1 ms
Sector Size = 2 KB
01 02 03 04
05 06 07 08
Physical
09 10 11 12
Storage
RAID 0 RAID 0
01 02 01 02
03 04 03 04
05 06 05 06
n disks n disks
Performance RAID-0+1
Let the Reliability of a RAID 0 sub-tree be R’:
Then the reliability of RAID 1 tree = 1 – (1-R’)(1-
R’)
Reliability R’ is:
R’ = r2 (reliability of a single disk is r):
Throughput is same as RAID-0, however with 2 x
n disks
Utilization is lower than RAID-0 due to mirroring
“Write” is marginally slower due to atomicity
When r = 0.9, R’ = 0.81, and R = 1 – (0.19)2 = .96
RAID 1+0 - Mirroring & Stripping
RAID 1+0
RAID 0
RAID 1 RAID 1
01 01 02 02
03 03 04 04
05 05 06 06
Performance RAID-1+0
Let the Reliability of a RAID 1 sub-tree be R’:
Then the reliability of RAID 0 tree = (R’)2
Reliability R’ is:
R’ = 1-(1-r)2 (reliability of a single disk is r):
Throughput is same as RAID-0, however with 2 x n disks
Utilization is lower than RAID-0 due to mirroring
“Write” is marginally slower due to its atomicity
When r = 0.9, R’ = 0.99, and R = (0.99)2 = .98
RAID-2 Hamming Code Arrays
Low commercial interest due to
complex nature of Hamming code
computation.
RAID-3 Stripping with Parity
Single Bits
Or Words Stripe width
Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12
01 02 03 04 P0
Physical
Storage 05 06 07 08 P1
09 10 11 12 P2
Parity Disk
RAID-3 Operation
Based on the principle of reversible form
of parity computation.
Where Parity P = C0⊕ C1⊕ … Cn-1⊕ Cn⊕
Missing Stripe Cm = P ⊕ C0⊕ C1⊕ … Cm-1⊕ Cm+1⊕… Cn-1⊕ Cn
RAID-3 Performance
RAID-1’s 1-to-1 redundancy issue is addressed by 1-for-n
Parity disk. Less expensive than RAID-1
Rest of the performance is similar to RAID-0.
This can withstand the failure of one of its disks.
Reliability = all the disks are working + exactly one failed
= rn + nc1rn-1.(1-r)
When r = 0.9 and n = 5
= 0.95 + 5 x 0.94 x (1- 0.9)
= 0.6 + 5x0.66 x 0.1
= 0.6 + 0.33 = .93
RAID-4 Performance
Similar to RAID-3, but supports
larger chunks.
Performance measures are similar
to RAID-3.
RAID-5 (Distributed Parity)
Chunk
Stripe width
Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12
Physical 01 02 03 04 P0
Storage P1 05 06 07 08
09 P2 10 11 12
Parity Disk
RAID-5 Performance
In RAID-3 and RAID-4, Parity Disk is
a bottleneck for Write operations.
This issue is addressed in RAID-5.
Reading Assignment
Optical Disk
RAID study material
Worked out problems from the text
book: p 452, p 537, p 539, p561, p
684, p 691, p 704
Amdahl’s Speedup Multiprocessors
Speedup = 1/(Fractione/Speedupe+(1–Fractione ))
Speedupe - number of processors
Fractione - Fraction of the program that runs
parallely on Speedupe
Assumption: Either the program runs in fully parallel
(enhanced) mode making use of all the processors or non-
enhanced mode.
Multiprocessor Architectures
Single Instruction Stream, Single Data Stream (SISD):
Single Instruction Stream, Multiple Data Stream (SIMD)
Multiple Instruction Streams, Single Data Stream (MISD)
Multiple Instruction Streams, Multiple Data Streams (MIMD)
Classification of MIMD Architectures
Shared Memory:
Centralized shared memory architecture OR
Symmetric (shared-memory) Multiprocessors (SMP)
OR Uniform Memory Access (UMA).
Distributed shared memory architecture OR Non-
Uniform Memory Access (NUMA) architecture.
Message Passing:
Multiprocessor Systems based on messaging
Shared Memory versus Message
Passing
Cache
Cache Cache
Cache Cache
Cache Cache
Cache
Interconnection Network
Symmetric versus Distributed
Memory MP
SMP: Uses shared memory for Inter-process Communication
Advantage:
Close coupling due to shared memory
Sharing of data is faster between processors
Disadvantage
Scaling: Memory is a bottleneck
High and unpredictable Memory Latency
2. Write to the same location are serialized; that is two writes to the same
location by any two processors are seen in the same order by all
processors. For example, if the values 1 and then 2 are written to a
location, processors can never read the value of the location as 2 and
then later read it as 1.
Coherency and Consistency
Coherency and consistency are complementary.
Cache
Cache
Cache Cache
Cache Cache
Cache Cache
cc cc cc cc
Bus
Main Memory I/O Systems
cc Cache Controller
Cache Coherent (CC) Protocols
CC protocol implement cache coherency
Two types:
Snooping (Replicated): There are multiple copies of
the sharing status. Every cache that has a copy of the
data from a block of physical memory also has a copy
of the sharing status of the block, and no centralized
state is kept.
Directory based (logically centralized): There is only
one copy of the sharing status of a block of physical
memory. All the processors use this one copy. This
copy could be in any of the participating processors.
Snooping Protocol
Bus
Main Memory I/O Systems
cc Cache Controller
Invalidation of Snooping Protocol
Dirty bit: Dirty bit is set to FALSE when a block is loaded into a cache memory. It is
set to TRUE when the block is updated the first time. When another processor
wants to load this block, this block is migrated to the processor instead of loading
from the memory.
Summary of Snooping Mechanism
S ta te o f a d d re sse d
Re q u e st S o u rce ca ch e b lock F u n ctio n a n d Ex p la n a tio n
Read hit proc es s or s hared or ex c lus iveRead data in c ac he
Read m is s proc es s or invalid P lac e read m is s on bus
Read m is s proc es s or s hared A ddres s c onflic t m is s : plac e read m is s on bus
Read m is s proc es s or ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus
W rite hit proc es s or ex c lus ive W rite data in c ac he
W rite hit proc es s or s hared P lac e write m is s on bus
W rite m is s proc es s or invalid P lac e write m is s on bus
W rite m is s proc es s or s hared A ddres s c onflic t m is s : plac e read m is s on bus
W rite m is s proc es s or ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus
Read m is s bus s hared No ac tion; allow m em ory to s ervic e read m is s
A ttem pt to s hare data: plac e c ac he bloc k on bus and c hange s tate to
Read m is s bus ex c lus ive s hared
W rite m is s bus s hared A ttem pt to write s hared bloc k ; invalidate the bloc k
A ttem pt to write bloc k that is ex c lus ive els ewhere: write bac k the c ac he
W rite m is s bus ex c lus ive bloc k and m ak e its s tate invalid
State Transition
CPU read
hit
k
loc CPU read
CPU Write b
k
ac
Place write miss
e miss
- b rit
te bus w Place read miss on bus
r i PU
s sC
is W on bu
m s s
on
on bus
ad i
re m s
U ad is
C P r e m
e
c e r it
la w
Exclusive P c e Cache State Transitions
a
(read/write) Pl Based on requests from CPU
CPU Write hit
CPU Read hit
ort
CPU Write ab CPU read
Write-back block;
k; miss
loc
b s
c k es
c
- ba a c
te y
ri or
W em
M
Write miss for
this block Exclusive Cache State Transitions
(read/write) Based on requests from the bus
Read miss for
this block
Some Terminologies
Polling: A process periodically checks if there is a message that it
needs to handle. This method of awaiting a message is called
polling. Polling reduces processor utilization.
Interrupt: A process is notified when a message arrives using
built-in interrupt register. Interrupt increases processor utilization
in comparison to polling.
Synchronous: A process sends a message and waits for the
response to come before sending another message or carrying
out other tasks. This is way of waiting is referred to as
synchronous communication.
Asynchronous: A process sends a message and continues to
carry out other tasks while the requested message is processed.
This is referred as asynchronous communication.
Communication Infrastructure
Multiprocessor Systems with shared
memory can have two types of
communication infrastructure:
Shared Bus
Interconnect
Interconnect
Shared Bus
Directory Based Protocol
1
Nod
e1 e n-
Nod
Directory
… Directory
Interconnect
Directory Directory
…
e2 en
Nod Nod
State of the Block
Shared: One or more processors have the block
cached, and the value in memory is up to date.
Uncached: No processor has a copy of the block.
Exclusive: Exactly one processor has a copy of the
cache block, and it has written the block, so the
memory copy is out of date. The processor is called the
owner of the block.
Local, Remote, Home Node
Local Node: This is the node where
request originates
Home Node: This is the node where
the memory location and the
directory entry of an an address
reside.
Remote Node: A remote node is the
node that has copy of a cache block
Shared State Operation
Read-miss: The requesting processor is
sent the requested data from memory
and the requestor is added to the sharing
set.
Write-miss: The requesting processor is
sent the value. All the processors in the
set Sharer are sent invalidate messages,
and the Sharer set is to contain the
identity of the requesting processor. The
state of the block is made exclusive.
Uncached State Operation
Read-miss: The requesting processor is sent
the requested data from memory and the
requestor is made the only sharing node. The
state of the block is made shared.
Write-miss: The requesting processor is sent
the requested data and becomes the sharing
node. The block is made exclusive to indicate
that the only valid copy is cached. Sharer
indicates the identity of the owner.
Exclusive State Operation
Read-miss: The owner processor is sent a data fetch message, which causes the
state of the block in the owner’s cache to transition to shared and causes the
owner to send the data to the directory, which it is written to memory and sent back
to the requesting processor. The identity of the requesting processor is added to
the set Sharer, which still contains the identity of the processor that was the owner
(since it still has a readable copy).
Data write back: The owner processor is replacing the block and therefore must
write it back. This write back makes the memory copy up to date (the home
directory essentially becomes the owner), the block is now uncached and the
Sharer is empty.
Write-miss: The block has a new owner. A message is sent to the old owner
causing the cache to invalidate the block and send the value to the directory, from
which it is sent to the requesting processor, which becomes the new owner. Sharer
is set to the identity of the new owner, and the state of the block remains exclusive.
Cache State Transition
e
lu
Data Value Reply
va
Write Miss Read
ta P} s miss
is
da = {
Shares = {P}
M
; e
t ch es + us rit
W Data value Reply
Fe ar n b };
h
S so { P Shares += {P}
s is =
is m es
m r
Data Write- ad ead ha ly
re r S p
Back
ce t e; re
a a e
Pl lid alu
va v
Exclusive In t a
(read/write) da
Memory
P1 P2
TOP READ X
BNEZ TOP
time P2 is scheduled
TOP READ X
BNEZ TOP
WRITE 1
P1 is scheduled
WRITE 1
With Interrupt Mask
P1 P2
TOP READ X
BNEZ TOP
MASK
WRITE 1
time UNMASK
MASK
TOP READ X
BNEZ TOP
WRITE 1
UNMASK
Multiprocessor System
Masking the interrupt will not work
P1 P2
300
250
Instructions
between 2 0 0
mispredictio P re d ic t e d T a k e n
150
ns
P r o fil e b a s e d
100
50
0
li
ear
gcc
uc
ijdp
cor
tot
d
so
dod
ro2
ss
eqn
su2
md
res
pre
hyd
esp
com
Bench mark