0% found this document useful (0 votes)

8 views

740 Fall10 Lecture4 Afterlecture Pipelining

Uploaded by

Sayan Sarkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

740 Fall10 Lecture4 Afterlecture Pipelining

Uploaded by

Sayan Sarkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

15-740/18-740

Computer Architecture
Lecture 4: Pipelining

Prof. Onur Mutlu

Carnegie Mellon University
Last Time …
Addressing modes
Other ISA-level tradeoffs
Programmer vs. microarchitect
Virtual memory
Unaligned access
Transactional memory
Control flow vs. data flow
The Von Neumann Model
The Performance Equation

2
Review: Other ISA-level Tradeoffs
Load/store vs. Memory/Memory
Condition codes vs. condition registers vs. compare&test
Hardware interlocks vs. software-guaranteed interlocking
VLIW vs. single instruction
0, 1, 2, 3 address machines
Precise vs. imprecise exceptions
Virtual memory vs. not
Aligned vs. unaligned access
Supported data types
Software vs. hardware managed page fault handling
Granularity of atomicity
Cache coherence (hardware vs. software)
…
3
Review: The Von-Neumann Model
MEMORY
Mem Addr Reg

Mem Data Reg

PROCESSING UNIT
INPUT OUTPUT
ALU TEMP

CONTROL UNIT

IP Inst Register

4
Review: The Von-Neumann Model
Stored program computer (instructions in memory)
One instruction at a time
Sequential execution
Unified memory
The interpretation of a stored value depends on the control
signals

All major ISAs today use this model

Underneath (at uarch level), the execution model is very
different
Multiple instructions at a time
Out-of-order execution
Separate instruction and data caches
5
Review: Fundamentals of Uarch Performance Tradeoffs

Instruction Data Path Data

Supply (Functional Supply
Units)

- Zero-cycle latency - Perfect data flow - Zero-cycle latency

(no cache miss) (reg/memory dependencies)
- Infinite capacity
- No branch mispredicts - Zero-cycle interconnect
(operand communication) - Zero cost
- No fetch breaks
- Enough functional units

- Zero latency compute?

We will examine all these throughout the course (especially data supply)
6
Review: How to Evaluate Performance Tradeoffs

time
Execution time =
program

# instructions # cycles time

= X X cycle
program instruction

Algorithm Microarchitecture
Program ISA Logic design
ISA Microarchitecture Circuit implementation
Compiler Technology

7
Improving Performance (Reducing Exec Time)
Reducing instructions/program
More efficient algorithms and programs
Better ISA?

Reducing cycles/instruction (CPI)

Better microarchitecture design
Execute multiple instructions at the same time
Reduce latency of instructions (1-cycle vs. 100-cycle memory
access)

Reducing time/cycle (clock period)

Technology scaling
Pipelining

8
Other Performance Metrics: IPS
Machine A: 10 billion instructions per second
Machine B: 1 billion instructions per second
Which machine has higher performance?

Instructions Per Second (IPS, MIPS, BIPS)

# of instructions cycle
X time
cycle

How does this relate to execution time?

When is this a good metric for comparing two machines?
Same instruction set, same binary (i.e., same compiler), same
operating system
Meaningless if “Instruction count” does not correspond to “work”
E.g., some optimizations add instructions, but do not change “work”

9
Other Performance Metrics: FLOPS
Machine A: 10 billion FP instructions per second
Machine B: 1 billion FP instructions per second
Which machine has higher performance?

Floating Point Operations per Second (FLOPS, MFLOPS,

GFLOPS)
Popular in scientific computing
FP operations used to be very slow (think Amdahl’s law)
Why not a good metric?
Ignores all other instructions
what if your program has 0 FP instructions?
Not all FP ops are the same

10
Other Performance Metrics: Perf/Frequency
SPEC/MHz
time 1
Remember Execution time =
program
=
Performance
Performance/Frequency
time
cycle
=
# instructions # cycles time
X X cycle
program instruction

# cycles
= 1/{ }
program
What is wrong with comparing only “cycle count”?
Unfairly penalizes machines with high frequency
For machines of equal frequency, fairly reflects
performance assuming equal amount of “work” is done
Fair if used to compare two different same-ISA processors on the same binaries

11
An Example

Ronen et al, IEEE Proceedings 2001

12
Amdahl’s Law: Bottleneck Analysis
Speedup= timewithout enhancement / timewith enhancement
Suppose an enhancement speeds up a fraction f of a task
by a factor of S
timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S)
Speedupoverall = 1 / ( (1-f) + f/S )
timeoriginal

(1 - f) f
timeenhanced

(1 - f) f/S

Focus on bottlenecks with large f (and large S)

13
Microarchitecture Design Principles
Bread and butter design
Spend time and resources on where it matters (i.e. improving
what the machine is designed to do)
Common case vs. uncommon case

Balanced design
Balance instruction/data flow through uarch components
Design to eliminate bottlenecks

Critical path design

Find the maximum speed path and decrease it
Break a path into multiple cycles?

14
Cycle Time (Frequency) vs. CPI (IPC)
Usually at odds with each other

Why?
Memory access latency: Increased frequency increases the
number of cycles it takes to access main memory

Pipelining: A deeper pipeline increases frequency, but also

increases the “stall” cycles:
Data dependency stalls
Control dependency stalls
Resource contention stalls

15
Intro to Pipelining (I)
Single-cycle machines
Each instruction executed in one cycle
The slowest instruction determines cycle time

Multi-cycle machines
Instruction execution divided into multiple cycles
Fetch, decode, eval addr, fetch operands, execute, store result
Advantage: the slowest “stage” determines cycle time
Microcoded machines
Microinstruction: Control signals for the current cycle
Microcode: Set of all microinstructions needed to implement
instructions Æ Translates each instruction into a set of
microinstructions

16
Microcoded Execution of an ADD
ADD DR Å SR1, SR2 MEMORY
Fetch: What if this is SLOW? Mem Addr Reg
MAR Å IP Mem Data Reg
MDR Å MEM[MAR]
IR Å MDR
DATAPATH
Decode:
Control Signals Å ALU GP Registers
DecodeLogic(IR)
Execute:
TEMP Å SR1 + SR2 Control Signals
Store result (Writeback): CONTROL UNIT
DR Å TEMP
Inst Pointer Inst Register
IP Å IP + 4
17
Intro to Pipelining (II)
In the microcoded machine, some resources are idle in
different stages of instruction processing
Fetch logic is idle when ADD is being decoded or executed

Pipelined machines
Use idle resources to process other instructions
Each stage processes a different instruction
When decoding the ADD, fetch the next instruction
Think “assembly line”
Pipelined vs. multi-cycle machines
Advantage: Improves instruction throughput (reduces CPI)
Disadvantage: Requires more logic, higher power consumption

18
A Simple Pipeline

19
Execution of Four Independent ADDs
Multi-cycle: 4 cycles per instruction
F D E W
F D E W
F D E W
F D E W
Time

Pipelined: 4 cycles per 4 instructions (steady state)

F D E W
F D E W
F D E W
F D E W
Time

20
Issues in Pipelining: Increased CPI
Data dependency stall: what if the next ADD is dependent

ADD R3 Å R1, R2 F D E W
ADD R4 Å R3, R7 F D D E W

Solution: data forwarding. Can this always work?

How about memory operations? Cache misses?
If data is not available by the time it is needed: STALL
What if the pipeline was like this?

LD R3 Å R2(0) F D E M W
ADD R4 Å R3, R7 F D E E M W

R3 cannot be forwarded until read from memory

Is there a way to make ADD not stall?

21
Implementing Stalling
Hardware based interlocking
Common way: scoreboard
i.e. valid bit associated with each register in the register file
Valid bits also associated with each forwarding/bypass path

Func Unit
Instruction Register
Cache File Func Unit

Func Unit

22
Data Dependency Types

Types of data-related dependencies

Flow dependency (true data dependency – read after write)
Output dependency (write after write)
Anti dependency (write after read)

Which ones cause stalls in a pipelined machine?

Answer: It depends on the pipeline design
In our simple strictly-4-stage pipeline, only flow dependencies
cause stalls
What if instructions completed out of program order?

23
Issues in Pipelining: Increased CPI
Control dependency stall: what to fetch next

BEQ R1, R2, TARGET F D E W

F F F D E W

Solution: predict which instruction comes next

What if prediction is wrong?

Another solution: hardware-based fine-grained multithreading

Can tolerate both data and control dependencies
Read: James Thornton, “Parallel operation in the Control Data
6600,” AFIPS 1964.
Read: Burton Smith, “A pipelined, shared resource MIMD
computer,” ICPP 1978.

NETSCOUT Arbor APSConsole 6.3 User Guide PDF
100% (1)
NETSCOUT Arbor APSConsole 6.3 User Guide PDF
372 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
The Guide To Cyber Investigations-Second Edition
No ratings yet
The Guide To Cyber Investigations-Second Edition
227 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
7 pages
Parallelism in Uniprocessor System and Granularity
100% (5)
Parallelism in Uniprocessor System and Granularity
5 pages
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
138 pages
Contact Session 8
No ratings yet
Contact Session 8
63 pages
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
No ratings yet
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
64 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
DDCO-Jan25-Unit5
No ratings yet
DDCO-Jan25-Unit5
30 pages
5.1-5.3 Pipelining and Parallel Processing
No ratings yet
5.1-5.3 Pipelining and Parallel Processing
56 pages
CSO Computer Programming
No ratings yet
CSO Computer Programming
73 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
Module 4 - Parallel & Pipeline Processing - Final
No ratings yet
Module 4 - Parallel & Pipeline Processing - Final
31 pages
3-Pipelining_241110_203716
No ratings yet
3-Pipelining_241110_203716
59 pages
Unit 5
No ratings yet
Unit 5
36 pages
Unit 5
No ratings yet
Unit 5
44 pages
onur-447-spring15-lecture7-pipelining-afterlecture
No ratings yet
onur-447-spring15-lecture7-pipelining-afterlecture
66 pages
Parallel Programming Platforms: Alexandre David 1.2.05
No ratings yet
Parallel Programming Platforms: Alexandre David 1.2.05
30 pages
CH7-Parallel and Pipelined Processing
No ratings yet
CH7-Parallel and Pipelined Processing
23 pages
Ca06 2014 PDF
No ratings yet
Ca06 2014 PDF
53 pages
Pipeline
No ratings yet
Pipeline
33 pages
CSC 313 Module 3 Pipelining
No ratings yet
CSC 313 Module 3 Pipelining
59 pages
16.482 / 16.561 Computer Architecture and Design: Instructor: Dr. Michael Geiger Fall 2013
No ratings yet
16.482 / 16.561 Computer Architecture and Design: Instructor: Dr. Michael Geiger Fall 2013
42 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Computer Architecture-Performance - Datapath
No ratings yet
Computer Architecture-Performance - Datapath
68 pages
L14 MipsPipeline Ovw
No ratings yet
L14 MipsPipeline Ovw
17 pages
moduel 5
No ratings yet
moduel 5
46 pages
HRY-312 Computer Organization Introduction To Pipelining
No ratings yet
HRY-312 Computer Organization Introduction To Pipelining
30 pages
Unit1 1.7 Instr Cycle
No ratings yet
Unit1 1.7 Instr Cycle
35 pages
Advanced Computer Architecture: Pipelined Processor
No ratings yet
Advanced Computer Architecture: Pipelined Processor
20 pages
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
No ratings yet
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
49 pages
Pipelining and Parallel Processing
No ratings yet
Pipelining and Parallel Processing
25 pages
Pipelined MIPS Processor: Dmitri Strukov ECE 154A
No ratings yet
Pipelined MIPS Processor: Dmitri Strukov ECE 154A
81 pages
CSO Lecture Notes Unit - 5
No ratings yet
CSO Lecture Notes Unit - 5
11 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
77 pages
Pipelining Basic and Intermediate Concepts
No ratings yet
Pipelining Basic and Intermediate Concepts
75 pages
Comp Architecture Chapter 4 - Pipelining
No ratings yet
Comp Architecture Chapter 4 - Pipelining
53 pages
Lecture1 2
No ratings yet
Lecture1 2
30 pages
Parallelism in Microprocessor
No ratings yet
Parallelism in Microprocessor
17 pages
Lec18 Pipeline
No ratings yet
Lec18 Pipeline
59 pages
Pipeline Processing
No ratings yet
Pipeline Processing
28 pages
What Is The Most Boring Household Activity?
No ratings yet
What Is The Most Boring Household Activity?
27 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
Lec_7_CSE-509_Pipelining_5a944f3dd357e191fc77502f92eb2be7
No ratings yet
Lec_7_CSE-509_Pipelining_5a944f3dd357e191fc77502f92eb2be7
27 pages
1. Lecture 13 Pipelining
No ratings yet
1. Lecture 13 Pipelining
12 pages
Pipelining and Parallelism
No ratings yet
Pipelining and Parallelism
41 pages
Module 4
No ratings yet
Module 4
12 pages
Pipelining
No ratings yet
Pipelining
44 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
Pipelining Lecture
No ratings yet
Pipelining Lecture
60 pages
5 Pipelining
No ratings yet
5 Pipelining
38 pages
Pipelining: 5-Stage Pipeline: Mahdi Nazm Bojnordi
No ratings yet
Pipelining: 5-Stage Pipeline: Mahdi Nazm Bojnordi
35 pages
Computer Systems Architecture: Thorsten Altenkirch and Liyang Hu
No ratings yet
Computer Systems Architecture: Thorsten Altenkirch and Liyang Hu
20 pages
Pipeline and Vector
No ratings yet
Pipeline and Vector
29 pages
Parallel Computing Platforms-Dr Nausheen
No ratings yet
Parallel Computing Platforms-Dr Nausheen
47 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
B Tech Aids Syllabus
No ratings yet
B Tech Aids Syllabus
4 pages
09 Systems Programming-Network Programming
No ratings yet
09 Systems Programming-Network Programming
68 pages
Specifications SHERLOG en
No ratings yet
Specifications SHERLOG en
2 pages
Micro Medical MicroLab - Service Manual
100% (2)
Micro Medical MicroLab - Service Manual
16 pages
DC17 Ch05
No ratings yet
DC17 Ch05
47 pages
Social Media Marketing
100% (1)
Social Media Marketing
47 pages
Nursing Informatics
No ratings yet
Nursing Informatics
102 pages
Math1 - Analyzing Lines of Fit
No ratings yet
Math1 - Analyzing Lines of Fit
8 pages
S. Abinesh_ III YEAR_ AU03_ BBA_PROJECT
No ratings yet
S. Abinesh_ III YEAR_ AU03_ BBA_PROJECT
5 pages
Web Server Log Analysis Sysytem
No ratings yet
Web Server Log Analysis Sysytem
3 pages
Convert Internal Table Data Into HTML Format Without Using Function Modules
No ratings yet
Convert Internal Table Data Into HTML Format Without Using Function Modules
4 pages
1 - Unit 2 - Assignment Brief 2
No ratings yet
1 - Unit 2 - Assignment Brief 2
3 pages
1.2MIS - Marketing Information System (With Diagram)
No ratings yet
1.2MIS - Marketing Information System (With Diagram)
6 pages
VMGSimUserManual DownLoadLy - Ir
No ratings yet
VMGSimUserManual DownLoadLy - Ir
3,565 pages
Wordpress Theme Thesis 185
100% (3)
Wordpress Theme Thesis 185
4 pages
Internal Audit's Role in Cyber Security
No ratings yet
Internal Audit's Role in Cyber Security
4 pages
Using Visual C++ From Visual Basic (Tutorial With Example) : Abdulaziz Alfoudari Miscellaneous Intermediate
No ratings yet
Using Visual C++ From Visual Basic (Tutorial With Example) : Abdulaziz Alfoudari Miscellaneous Intermediate
4 pages
Instagram Followers Free QSZ 3
No ratings yet
Instagram Followers Free QSZ 3
3 pages
L4b - Perfomance Evaluation Metric - Regression
No ratings yet
L4b - Perfomance Evaluation Metric - Regression
6 pages
Full Download The Art of Reinforcement Learning Michael Hu PDF DOCX
100% (2)
Full Download The Art of Reinforcement Learning Michael Hu PDF DOCX
51 pages
Micromedex Tutorial
No ratings yet
Micromedex Tutorial
78 pages
T370HW02 VG Auo PDF
No ratings yet
T370HW02 VG Auo PDF
31 pages
Django: Python Web Framework Rayland Jeans CSCI 5448
No ratings yet
Django: Python Web Framework Rayland Jeans CSCI 5448
40 pages
Dip Syllabus
No ratings yet
Dip Syllabus
54 pages
CAM2 Measure 10 System Requirements: Paula Toth 2/18/2015
No ratings yet
CAM2 Measure 10 System Requirements: Paula Toth 2/18/2015
3 pages
Using MQSC Commands To Administer Websphere MQ
No ratings yet
Using MQSC Commands To Administer Websphere MQ
12 pages
4.8 Consequences of Uses of Computing
No ratings yet
4.8 Consequences of Uses of Computing
2 pages
SRX SFP 10ge Dac 1M
No ratings yet
SRX SFP 10ge Dac 1M
7 pages

740 Fall10 Lecture4 Afterlecture Pipelining

Uploaded by

740 Fall10 Lecture4 Afterlecture Pipelining

Uploaded by

15-740/18-740

Prof. Onur Mutlu

Mem Data Reg

 All major ISAs today use this model

Instruction Data Path Data

- Zero-cycle latency - Perfect data flow - Zero-cycle latency

- Zero latency compute?

# instructions # cycles time

 Reducing cycles/instruction (CPI)

 Reducing time/cycle (clock period)

 Instructions Per Second (IPS, MIPS, BIPS)

 How does this relate to execution time?

 Floating Point Operations per Second (FLOPS, MFLOPS,

 Ronen et al, IEEE Proceedings 2001

Focus on bottlenecks with large f (and large S)

 Critical path design

 Pipelining: A deeper pipeline increases frequency, but also

 Pipelined: 4 cycles per 4 instructions (steady state)

 Solution: data forwarding. Can this always work?

 R3 cannot be forwarded until read from memory

 Types of data-related dependencies

 Which ones cause stalls in a pipelined machine?

BEQ R1, R2, TARGET F D E W

 Solution: predict which instruction comes next

 Another solution: hardware-based fine-grained multithreading

You might also like

All major ISAs today use this model

Reducing cycles/instruction (CPI)

Reducing time/cycle (clock period)

Instructions Per Second (IPS, MIPS, BIPS)

How does this relate to execution time?

Floating Point Operations per Second (FLOPS, MFLOPS,

Ronen et al, IEEE Proceedings 2001

Critical path design

Pipelining: A deeper pipeline increases frequency, but also

Pipelined: 4 cycles per 4 instructions (steady state)

Solution: data forwarding. Can this always work?

R3 cannot be forwarded until read from memory

Types of data-related dependencies

Which ones cause stalls in a pipelined machine?

Solution: predict which instruction comes next

Another solution: hardware-based fine-grained multithreading