0% found this document useful (0 votes)
15 views

Computer Architecture-Performance - Datapath

The document discusses pipelining in computer processors. Pipelining is a technique used to overlap the execution of instructions by breaking down the execution process into stages. This allows subsequent instructions to begin execution before previous ones finish. The document uses an example of doing laundry to illustrate how pipelining can improve throughput.

Uploaded by

Dina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Computer Architecture-Performance - Datapath

The document discusses pipelining in computer processors. Pipelining is a technique used to overlap the execution of instructions by breaking down the execution process into stages. This allows subsequent instructions to begin execution before previous ones finish. The document uses an example of doing laundry to illustrate how pipelining can improve throughput.

Uploaded by

Dina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

CMP3010: Computer Architecture

L02: Performance and DataPath

Dina Tantawy
Computer Engineering Department
Cairo University
Agenda
• Recap
• Performance Fallacies and pitfalls
• Real stuff: Benchmarking the Intel core i7
• What is pipelining?
• Characteristics of pipelining

Computer Engineering, Cairo University


Recap
• Design for Moore
• Abstract
• Common Case Fast
• Parallelization
• Pipelining
• Prediction
• Hierarchy
• Reliable via duplication

Computer Engineering, Cairo University


Computer Performance: TIME, TIME, TIME
• Our focus: user CPU time
• Response Time (execution time, latency)
• Throughput (bandwidth)

• Performance = 1 / Execution Time


Cycle time

Clock Cycles count


Clock cycles =

Execution time = IC x CPI x CT Execution time = IC x CPI / CR


Computer Engineering, Cairo University
Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle

Instruction_ CPI clock_cycle


count
Algorithm
X X
Programming
language X X

Compiler
X X

ISA
X X X

Core
organization X X

Technology
X
Component Analysis

Computer Engineering, Cairo University


Performance
Pitfalls

Computer Engineering, Cairo University


CPI Example
• Suppose we have two implementations of the same instruction set
architecture (ISA).

For some program,

Machine A has a clock cycle time of 250 ps and a CPI of 2.0


Machine B has a clock cycle time of 500 ps and a CPI of 1.2

What machine is faster for this program, and by how much?

Computer Engineering, Cairo University


CPI Example
• Suppose we have two implementations of the same
instruction set seconds cycles seconds
architecture (ISA).  
program program cycle
For some program,

Machine A has a clock cycle time of 250 ps and a CPI of 2.0


Machine B has a clock cycle time of 500 ps and a CPI of 1.2

What machine is faster for this program, and by how much?


Another Example
Which sequence will be faster? How much?
A compiler designer is trying to
decide between two code sequences
for a particular machine. Based on
the hardware implementation, there Clock cycles =

are three different classes of


instructions: Class A, Class B, and
Class C, and they require one, two,
and three cycles (respectively).
What is the CPI for each sequence?

The first code sequence has 5 instructions: 2 of A,


1 of B, and 2 of C

The second sequence has 6 instructions: 4 of A, 1 of


B, and 1 of C.
Can we compare two machines using AVG CPI only ?
Pitfall 1
• Can we use subset of the performance equation as a performance
measure ?? !!!

Instructions Clock cycles Seconds


CPU Time   
Program Instruction Clock cycle

Computer Engineering, Cairo University


MIPS
• Million instructions per second (MIPS) is a measure of a processor's
speed, providing a standard for representing the number
of instructions that a central processing unit (CPU) can process in 1
second.
• The number is meant to indicate how well a computer performs and
how much work it can do, especially when compared with other
systems. For example, a computer that can process 12,000 MIPS
should be able to outperform one that processes 10,000 MIPS.

Is this true ??
Computer Engineering, Cairo University
MIPS Example

Computer Engineering, Cairo University


MIPS Example

Computer Engineering, Cairo University


What is MIPS?

• Instruction execution rate => higher is better


• Issues:
• Can not compare processors with different instruction sets
• Varies between programs on the same processor (a computer
can’t have single MIPS rating)
Another Example
Op Freq CPIi Freq x CPIi
ALU 50% 1 .5 .5 .5 .25
Load 20% 5 1.0 .4 1.0 1.0

Store 10% 3 .3 .3 .3 .3

Branch 20% 2 .4 .4 .2 .4

= 2.2 1.6 2.0 1.95

• How much faster would the machine be if a better data cache


reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
• How does this compare with using branch prediction to shave a cycle
off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
• What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
Pitfall 2
• Expecting the improvement of one aspect of a computer to increase
overall performance by an amount proportional to the size of the
improvement.

Computer Engineering, Cairo University


Amdahl's Law
• The performance enhancement of an improvement is limited by how
much the improved feature is used. In other words: Don’t expect an
enhancement proportional to how much you enhanced something.
Amdahl's Law
• Example:
"Suppose a program runs in 100 seconds on a machine, with multiply operations
responsible for 80 seconds of this time. How much do we have to improve the speed of
multiplication if we want the program to run 4 times faster?"

Computer Engineering, Cairo University


Performance Summary
The BIG Picture

Instructions Clock cycles Seconds


CPU Time   
Program Instruction Clock cycle

• Performance depends on
• Algorithm: affects IC, possibly CPI
• Programming language: affects IC, CPI
• Compiler: affects IC, CPI
• Instruction set architecture: affects IC, CPI, Tc

Computer Engineering, Cairo University


Benchmarking the intel core i7
• Performance best determined by running a real application
• Use programs typical of expected workload
• Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
• Small benchmarks
• nice for architects and designers
• easy to standardize
• SPEC (System Performance Evaluation Cooperative)
• companies have agreed on a set of real program and inputs
• valuable indicator of performance (and compiler technology)

Computer Engineering, Cairo University


SPEC CPU Benchmark
• Programs used to measure performance
• Supposedly typical of actual workload
• Standard Performance Evaluation Corp (SPEC)
• Develops benchmarks for CPU, I/O, Web, …
• SPEC CPU2006
• Elapsed time to execute a selection of programs
• Negligible I/O, so focuses on CPU performance
• Normalize relative to reference machine
• Summarize as geometric mean of performance ratios
• CINT2006 (integer) and CFP2006 (floating-point)

n
n
 Execution time ratio
i1 Computer Engineering, Cairo University
i
CINT2006 for Intel Core i7 920

Computer Engineering, Cairo University


What does the
computer eat when
it is hungry ?

chips

Computer Engineering, Cairo University


What Is A Pipeline?
• Pipelining is used by virtually all modern microprocessors to enhance performance by
overlapping the execution of instructions.

Computer Engineering, Cairo University


What Is Pipelining
• Laundry Example
• 4 persons each have one load of A B C D
clothes to wash, dry, and fold
• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

Computer Engineering, Cairo University


What Is Pipelining
6 PM 7 8 9 10 11 Midnight
Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e
r D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
Computer Engineering, Cairo University
What Is Pipelining Start work ASAP

6 PM 7 8 9 10 11 Midnight
Time

30 40 40 40 40 20
T
a A
s • Pipelined laundry takes 3.5
k hours for 4 loads
B
O
r
d C
e
r
D

Computer Engineering, Cairo University


What Is Pipelining Pipelining Lessons
• Pipelining doesn’t help latency of
6 PM 7 8 9 single task, it helps throughput
of entire workload
Time
• Pipeline rate limited by slowest
T pipeline stage
a 30 40 40 40 40 20
• Multiple tasks operating
s simultaneously
k A
• Potential speedup = Number
pipe stages
O
r B • Unbalanced lengths of pipe
d stages reduces speedup
e • Time to “fill” pipeline and time to
r C “drain” it reduces speedup

Computer Engineering, Cairo University


Pipelining Theoretical Performance
•An ideal pipeline divides a task into k independent sequential subtasks
• Each subtask requires 1 time unit to complete
• The task itself requires k time units to complete
•For n iterations of task, the execution times:
• With no pipelining: nk time units
• With pipelining: k + (n-1) time units
•Speedup of a k-stage pipeline is
• S = nk / [k+(n-1)] ==> k (for large n)

Computer Engineering, Cairo University


Pipelining Performance

Computer Engineering, Cairo University


Characteristics Of Pipelining
•The previous expression is ideal.

•In terms of a CPU, the implementation of pipelining has the


effect of reducing the average instruction time, therefore
reducing the average CPI.

• EX: If each instruction in a microprocessor takes 5 clock cycles


(unpipelined) and we have a 4 stage pipeline, the ideal average
CPI with the pipeline will be 1.25 .

Computer Engineering, Cairo University


MIPS
processor
A pipelined processor

Computer Engineering, Cairo University


RISC Instruction Set Basics
•Properties of RISC architectures:
• All operations on data apply to data in registers and typically change
the entire register (32-bits or 64-bits).

• The only operations that affect memory are load/store operations.


Memory to register, and register to memory.

• Usually instructions are few in number and are typically one size.

Computer Engineering, Cairo University


RISC Instruction Set Basics (MIPS)
Types of Instructions

MIPS ISA

R-type I-type J-Type


Computer Engineering, Cairo University
ALU Instructions (R-type)
• Arithmetic operations, take two registers as operands.
• The result is stored in a third register.
• Logical operations AND OR, XOR, shift

$18

Computer Engineering, Cairo University


Immediate Format Instructions (I-type)
• Usually take a register (base register) as an operand and a 16-bit immediate value.
• The sum of the two will create the effective address in some instructions.
• A second register acts as a source in the case of a load operation.
• In the case of a store operation the second register contains the data to be stored.

Computer Engineering, Cairo University


I-Type Instruction Example

Computer Engineering, Cairo University


RISC Instruction Set Basics (MIPS)
Types of Instructions
•Jump Format (J-type)
• Conditional branches are transfers of control. As described before, a branch causes
an immediate value to be added to the current program counter.

Computer Engineering, Cairo University


Computer Engineering, Cairo University
RISC Instruction Set Implementation
•We first need to look at how instructions in the MIPS instruction
set are implemented without pipelining. We’ll assume that any
instruction of the subset of MIPS can be executed in one cycle
(SINGLE CYCLE PROCESSOR)

Computer Engineering, Cairo University


The single cycle datapath

Computer Engineering, Cairo University


Single Cycle Datapath with Control Unit

0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp
Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Read Address
Memory Register
Instr[20-16] Read Addr 2 Data 1 zero
Data
Read File
PC Instr[31-0] 0 ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign
ALU
16 Extend 32 control

Instr[5-0]
Computer Engineering, Cairo University
R-type Instruction Data/Control Flow

0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp
Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Read Address
Memory Register zero
Instr[20-16] Read Addr 2 Data 1
Data
Read File
PC Instr[31-0] Memory Read Data
Address 0 Write Addr ALU 1
1 Read 0
Data 2 Write Data 0
Write Data
Instr[15 1
-11]

Instr[5-0] Sign
ALU
16 Extend 32 control

Instr[5-0]
Computer Engineering, Cairo University
Load Word Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp
Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21]
Instruction Read Addr 1
Read Address
Memory Register
Instr[20-16] Read Addr 2 Data 1 zero
Read Data
PC Address Instr[31-0] File Memory Read Data 1
0 Write Addr ALU
1 Read 0 0
Data 2 Write Data
Instr[15
Write Data
-11] 1

Instr[15-0]
Store Word Sign
ALU
16 Extend 32 control

Instruction? Instr[5-0]
Computer Engineering, Cairo University
Branch Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp
Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21]
Read Addr 1
Instruction
Read Address
Memory Instr[20-16] Register
Read Addr 2 Data 1 zero
Data
Read File
PC Instr[31-0] 0 ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0]
Sign
ALU
16 Extend 32 control

Instr[5-0]
Computer Engineering, Cairo University
RISC Instruction Set Implementation
•We first need to look at how instructions in the MIPS instruction
set are implemented without pipelining. We’ll assume that any
instruction of the subset of MIPS can be executed in at most 5
clock cycles.
•The five clock cycles will be broken up into the following steps:
• Instruction Fetch Cycle
• Instruction Decode/Register Fetch Cycle
• Execution Cycle
• Memory Access Cycle
• Write-Back Cycle

Computer Engineering, Cairo University


Fetching Instructions (IF)

•Fetching instructions involves


• reading the instruction from the
Instruction Memory Add

• updating the PC to hold the address of 4

the next instruction


Instruction
• PC is updated every cycle, so it does Memory

not need an explicit write control signal PC


Read
Address
Instruction

• Instruction Memory is read every cycle,


so it doesn’t need an explicit read
control signal

Computer Engineering, Cairo University


Decoding Instructions (ID)

•Decoding instructions involves


• sending the fetched instruction’s opcode and function field bits to the control unit
• reading two values from the Register File
• Register File addresses are contained in the instruction

Control
Unit

Read Addr 1
Read
Register
Read Addr 2 Data 1
Instruction
File
Write Addr Read
Computer Data 2 University
WriteEngineering,
Data Cairo
Executing R Format Operations (IE)

•R format operations (add,sub,slt,and,or)

• perform the (op and funct) operation on values in rs and rt


• store the result back into the Register File (into location rd)
31 25 20 15 10 5 0
R-type: op rs rt rd shamt funct

RegWrite ALU control

Read Addr 1
Read
Register
Read Addr 2 Data 1 overflow
Instruction
File zero
Write Addr ALU
Read
Data 2
Write Data

Computer Engineering, Cairo University


Executing Load and Store Operations (IE)

•Load and store operations involve


• compute memory address by adding the base register (read from the Register File during
decode) to the 16-bit signed-extended offset field in the instruction
• store value (read from the Register File during decode) written to the Data Memory
• load value, read from the Data Memory, written to the Register File

RegWrite ALU control MemWrite

overflow
Read Addr 1 zero
Read Address
Register
Read Addr 2 Data 1
Instruction Data
File Memory Read Data
Write Addr ALU
Read
Data 2 Write Data
Write Data

Sign MemRead
16 Extend 32

Computer Engineering, Cairo University


Executing Branch Operations (IE)

Add Branch
4
Add target
Shift address
left 2

ALU control
PC

Read Addr 1 zero (to branch


Register
Read control logic)
•Branch operations involves Read Addr 2 Data 1
Instruction
File
• compare the operands read from the Write Addr ALU
Read
Register File during decode for equality Data 2
(zero ALU output) Write Data

• compute the branch target address by


adding the updated PC to the 16-bit
Sign
signed-extended offset field in the instr 16 Extend 32

Computer Engineering, Cairo University


Memory Access (MEM) Cycle

•If a load, the effective address computed from the previous cycle is referenced and the
memory is read. The actual data transfer to the register does not occur until the next cycle.
•If a store, the data from the register is written to the effective address in memory.

Computer Engineering, Cairo University


Write-Back (WB) Cycle

•Occurs with Register-Register ALU instructions or load instructions.


•Simple operation whether the operation is a register-register operation or a memory load
operation, the resulting data is written to the appropriate register.
•The Register File is not written every cycle (e.g. sw), so we need an explicit write control
signal for the Register File
RegWrite ALU control

Read Addr 1
Read
Register
Read Addr 2 Data 1 overflow
Instruction
File zero
Write Addr ALU
Read
Data 2
Write Data

Computer Engineering, Cairo University


The Basic Pipeline For MIPS

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
Ifetch Reg DMem Reg

n
s
t

ALU
Ifetch Reg DMem Reg
r
.

ALU
Ifetch Reg DMem Reg
O
r
d

ALU
Reg
e
Ifetch Reg DMem

Computer Engineering, Cairo University


CPU Pipelining: Example

• Example : Single-Cycle, non-pipelined execution


• Total time for 3 instructions: 24 ns

P ro g ra m
e x e c utio n
o rd e r 2 4 6 8 10 12 14 16 18
Time
(in instructions)
Instruction Reg ALU Data Reg
lw $1, 100($0) fetch access
Instruction Data
8 ns Reg ALU Reg
fetch access
lw $2, 200($0)
Instruction
8 ns fetch
lw $3, 300($0) ...
8 ns

Computer Engineering, Cairo University


CPU Pipelining: Example

• Single-cycle, pipelined execution


– Improve performance by increasing instruction throughput
– Total time for 3 instructions = 14 ns
– Each instruction adds 2 ns to total execution time
– Stage time limited by slowest resource (2 ns)
– Assumptions:
• Write to register occurs in 1st half of clock
• Read from register occurs in 2nd half of clock
P ro g ra m
e x e c utio n 2 4 6 8 10 12 14
o rd e r Time
( in in stru ctio n s)
Instruction D ata
lw $1, 100($0) R eg AL U R eg
fetch access

Instruction D ata
lw $2, 200($0) 2 ns R eg AL U R eg
fetch access

Instruction D ata
lw $3, 300($0) 2 ns R eg AL U R eg
fetch access

2 ns 2 ns 2 ns 2 ns 2 ns
Computer Engineering, Cairo University
CPU pipelining: Example

•Time without pipelining = 24 ns


•Time with pipelining = 14 ns (not = 24/5), WHY???
• Number of instructions is not large
•Let’s increase the number of instructions
• If number of instructions = 1,000,000 instruction , the total time with pipelining =
1,000,000 X 2 ns = 2,000,000 ns
• Time without pipelining = 1,000,000 X 8ns = 8,000,000 ns
• The speed up = 4 (increased)

Computer Engineering, Cairo University


The pipelined version of MIPS Datapath

•Need registers between stages


• To hold information produced in previous cycle

Computer Engineering, Cairo University


IF

Computer Engineering, Cairo University


ID

Computer Engineering, Cairo University


EX for Load

Computer Engineering, Cairo University


MEM for Load

Computer Engineering, Cairo University


WB for Load

There is a BUG here

Wrong
register
number

Computer Engineering, Cairo University


Corrected Datapath for Load

Computer Engineering, Cairo University


The pipelined data path with control signals

Computer Engineering, Cairo University


Control Signals

Computer Engineering, Cairo University


Recap
• Introduction (last time)
• Eight great ideas in computer architecture
• Performance
• Fallacies and pitfalls
• Real stuff: Benchmarking the Intel core i7

Computer Engineering, Cairo University

You might also like