0% found this document useful (0 votes)

5 views30 pages

Pipelining Part 1: An Overview

Pipelining is a technique used in CPU design where multiple instructions are executed simultaneously in overlapping stages, enhancing performance. The document discusses the implementation of pipelining in the ARM instruction set architecture (ISA), highlighting its advantages over complex instruction set architectures (CISC) like Intel's. It also explains the stages of instruction execution in ARM ISA and how the design of the instruction set facilitates efficient pipelining.

Uploaded by

testphishingguy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views30 pages

Pipelining Part 1: An Overview

Uploaded by

testphishingguy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Pipelining Part 1:

An Overview
‫מהווים חומר הרחבה‪.‬‬ ‫השקפים המסומנים ב‪-‬‬

‫התוכן המובא בשקפים אלו אינו מהווה חלק מהחומר המחייב‬

‫לבחינה בקורס‪.‬‬

‫‪2‬‬
What is Pipelining ?

- Pipelining is an implementation technique in which multiple instructions are overlapped in

execution.
- Today, pipelining is nearly universal.
implemented by all CPUs

3
‫הסבר רעיון ה‪ pipelining -‬בהקשר ל‪ARM ISA -‬‬

‫נסביר את הטכניקה הזו בהקשר למימושה במעבדי ‪ARM‬‬

‫ובתוך כך נדגים איך לתכנון [= ‪ ]design‬של ה‪ ISA -‬יש השפעה‪/‬חשיבות בקשר ליעילות של הטכניקה הזו‪.‬‬
‫נזכור כי ‪ ARM ISA‬מהווה דוגמא מייצגת של ‪RISC ISA‬‬
‫‪RISC = Reduced Instruction Set CPU‬‬
‫בעוד ש‪ Intel ISA -‬מהווה דוגמא מייצגת של ‪CISC ISA‬‬
‫‪CISC = Complex Instruction Set CPU‬‬

‫וההדגמה של רעיון ה‪ pipelining -‬בהקשר של ‪ ARM ISA‬תשרת אותנו באופן כללי להסביר איך ל‪RISC -‬‬
‫‪ ISA‬יש יתרונות על פני ‪ CISC ISA‬בכל מה שקשור לביצועים של ה‪.Pipeline -‬‬

‫‪4‬‬
ARM ISA

- 32 registers: X0–X30, XZR

- register XZR always equals 0.

- Memory:
- 2^62 memory words: Memory[0], Memory[4], . . . ,Memory[4,611,686,018,427,387,904]
- Each word = 4 bytes

5
ARM ISA – Arithmetic & Logical Instructions

Operation Example Meaning

add ADD X1, X2, X3 // X1 <- X2 + X3
add immediate ADDI X1, X2, 20 // X1 <- X2 + 20
subtract
….

- In ARM ISA, data must be in registers to perform arithmetic or logic operation

- That is, it is not allowed to add two integer values [operands], val1 and val2, if the value of one
of them is in memory !
- Both values, val1 and val2, MUST be ‘brought’ [= loaded] first from memory to registers
- using data transfer instructions
- load [copy] val1 from memory to register Xi
- and load val2 to register Xj

6
ARM ISA – Data Transfer Instructions

- Memory can be accessed only by data transfer instructions:

- load register
- LDUR X1, [X2,40] // X1 <- Memory[X2 + 40]
- store register
- STUR X1, [X2,40] // Memory[X2 + 40] <- X1
-…

7
ARM ISA - Conditional & Unconditional Branch Instructions

- compare and branch on equal 0]

- Compare and Branch on Zero compares the value in a register with zero, and conditionally
branches to a label at a PC-relative offset if the comparison is equal
- CBZ X1, 25 // meaning: if (X1 == 0) go to PC + 100.
// Why 100 ? Since 4*25 = 100. word = 4 bytes.

8
‘break’ the instruction cycle into steps

Each ARM instruction is executed by an instruction cycle.

And one way to describe the instruction cycle, is that it take [= composed of] 3 stages: IF, ID, IE

- For our current purpose, describing the concepts of pipelining, we change this description of an
instruction cycle.
- And we describe an instruction cycle as if it takes 5 stages:
1. Fetch instruction from memory.
2. Read registers and decode the instruction.
3. Execute the operation or calculate an address.
4. Access an operand in data memory (if necessary).
5. Write the result into a register (if necessary).

- Hence, the ARM ISA pipeline we explore in this lecture has 5 stages.

9
Explanation: ADD X1, X2, X3
1. Fetch instruction from memory.
- translates the current value stored in the PC register to a physical address
- using the MMU
- stores the physical address in the MAR [= Memory Address Register]
- initiate a read bus transaction
- copy the MAR’s value onto the system bus address lines
- raise the appropriate control line of the system bus
- to signals the MM controller that a read is requested
- wait for an acknowledgment from the MM controller that the requested data, in this case
the instruction, is available on the data lines of the system bus
- copy the value from the data lines to the MBR [ Memory Buffer Register]
- copy the value from the MBR to the IR [= Instruction Register]

10
Explanation: ADD X1, X2, X3 (cont.)

2. Read registers and decode the instruction:

- decode that the instruction ‘tells’ [ = instructs/commands] the CPU it is required to perform
an addition operation
- fetch the current value of register X2
- and fetch the current value of register X3

3. Execute the operation or calculate an address.

- ‘activate’ [ = ‘send’ an appropriate control signal to] the ALU‘s addition circuit
- which is implemented by FAs [= Full Adders]

4. Access an operand in data memory (if necessary).

- not carried out in the instruction cycle of ADD

5. Write the result into a register (if necessary).

- signals register X1 ‘take’ the

11
Explanation: LDUR X1, [X2,40]

1. Fetch instruction from memory

2. Read registers and decode the instruction:
- decode that the instruction ‘tells’ [ = instructs/commands] the CPU it is required to transfer [=
copy] data from MM
- and store it in register R1
- from where in MM ? From the logical address Memory[X2+40]
- so the CPU needs to fetch the current value of register X2, and input it to one of the inputs [=
‘entries’]of the ALU
- and the CPU also needs to input the value 40 [= immediate value] to the second input [=
entry] of the ALU

3. Execute the operation or calculate an address.

{in this example: calculate an address}
- ‘activate’ [ = ‘send’ an appropriate control signal to] the ALU‘s addition circuit to carry out the
addiction X2 + 40

12
Explanation: LDUR X1, [X2,40]

4. Access an operand in data memory (if necessary).

{in this example it is necessary of course !}
- translates the current value Memory[X2+40], calculated in the previous stage by the ALU, to a
physical address
- using the MMU
- stores the physical address in the MAR [= Memory Address Register]
- initiate a read bus transaction
- copy the MAR’s value onto the system bus address lines
- raise the appropriate control line of the system bus
- to signals the MM controller that a read is requested
- wait for an acknowledgment from the MM controller that the requested data, in this a data
operand, is available on the data lines of the system bus
- copy the value from the data lines to the MBR [ Memory Buffer Register]
- copy the value from the MBR to the X1

13
Explanation: CBZ X1, 25

1. Fetch instruction from memory

2. Read registers and decode the instruction:

- decode that the instruction ‘tells’ [ = instructs/commands] the CPU it is required to check if
the value in register X1 equals 0, and if so branch to the logical address ‘created’ by adding the
current value of the PC register with 25*4
- since a word is 4 bytes long
- so the CPU needs to fetch the current value of register X1, and check if it is 0

14
Explanation: CBZ X1, 25

3. Execute the operation or calculate an address.

- if the value stored in X1 is 0, then ‘activate’ [ = ‘send’ an appropriate control signal to] the
ALU‘s addition circuit to carry out the calculation current PC value + 4*25
- note: 4*25 is actually achieved by a shift. NOT a multiplication
- and ‘store’ the result in the PC

4. Access an operand in data memory (if necessary).

- not carried out in the instruction cycle of CBZ

5. Write the result into a register (if necessary).

- not carried out in the instruction cycle of CBZ

15
Assumption #1

- To make this discussion concrete, let’s create a pipeline.

- In this example, we limit our attention to seven instructions:
- load register (LDUR),
- store register (STUR),
- add (ADD),
- subtract (SUB),
- AND (AND),
- OR (ORR),
- and compare and branch on zero (CBZ).

16
Assumption #2

- Assume that the operation times for the major functional units in the example we’ll shortly
describe are:
- 200 ps for memory access for instructions or data,
- 200 ps for ALU operation,
- and 100 ps for register file read or write.

- total time for each of the seven

instructions.
- calculated from the time for each
stage.

Note: this calculation assumes

that the multiplexors, control unit,
PC accesses, and sign extension
unit have no delay.
17
???

Let us now compare nonpipelined and pipelined execution of three load

18
???

- The non pipelined implementation design must allow for the slowest instruction !!!
why ? In order’ as we shall see shortly, to be able to quantitatively compare the non pipelined
and the pipelined implementations
- It is the LDUR instruction.
- so for the purpose of our discussion, we’ll assume the time required for every instruction is 800
ps.
- even though some instructions can be as fast as 500 ps

19
???

The time between the first and fourth instructions in the nonpipelined design is 3 × 800 ps or
2400 ps.

20
???

- Just as the non pipelined design must take the worst-case scenario of 800 ps, even though some
- instructions can be as fast as 500 ps, the pipelined execution clock cycle must have the worst-
case time of 200 ps.
- even though some stages take only 100 ps.
- Why ? To simplify the control of the pipeline. Every 200ps the clock ‘signals’ to ‘move’ the
instruction currently in stage S of the pipeline, to the next stage S+1
- or to ‘get out from the pipeline ’ the instruction currently in the last stage

21
???

We can turn the pipelining speed-up discussion above into a formula.

If the stages are perfectly balanced, then the time between instructions on the pipelined
Processor — assuming ideal conditions — is equal to:

22
???

Under ideal conditions and with a large number of instructions, the speed-up from pipelining is
approximately equal to the number of pipe stages;
That is: a five-stage pipeline is nearly five times faster than a non pipelined implementation !

23
???

- The example we explored shows, however, that the stages may be imperfectly balanced.
- There are stages that take 100ps, whereas other take 200 ps.
- Moreover, pipelining involves some overhead, the source of which will be clearer shortly.
- Hazards

- Thus, the time per instruction in the pipelined processor will exceed the minimum
possible, and speed-up will be less than the number of pipeline stages.
- in our example: less than 5 times faster

24
???

- Never the less, our pipelining still offers a fourfold performance improvement.

- To see why, let us assume we extend the previous figures to 1,000,003 instructions.
- instead of only 3 instructions
- In the non pipelined example, each of the 1,000,003 instructions is taking 800 ps, so total
execution time would be 1,000,003 × 800 ps = 800,002,400 ps

- in the pipelined :
- The first 3 instructions are taking a total of 1400 ps
- Then, after every 200 ps, each of the other 1,000,000 instruction will ‘coe out’ from the
pipeline
- The total execution time would be 1,000,000 × 200 ps + 1400 ps, or 200,001,400 ps.

25
???

Under these conditions, the ratio of total execution times for real programs on nonpipelined to
pipelined processors is close to the ratio of times between instructions:

26
???

Pipelining improves performance by increasing instruction throughput,

in contrast to
decreasing the execution time of an individual instruction !!

but instruction throughput is the important metric because real programs execute billions of
instructions.

27
Designing Instruction Sets for Pipelining
Pipelining and RISC ISA

- Even with the above simple explanation of pipelining, we can get insight into the design
of the ARM instruction set, which was designed for pipelined execution.

- First, all ARM instructions are the same length.

- This restriction makes it much easier to fetch instructions in the first pipeline stage and to
decode them in the second stage.
- In an instruction set like the x86, where instructions vary from 1 byte to 15 bytes, pipelining is
considerably more challenging.
- Modern implementations of the x86 architecture actually translate x86 instructions into simple
operations that look like ARM instructions and then pipeline the simple operations rather
than the native x86 instructions!

29
Pipelining and RISC ISA

- Second, ARM has just a few instruction formats, with the first source register
and destination register fields being located in the same place in each instruction.

- Third, memory operands only appear in loads or stores in ARM ISA.

- This restriction means we can use the execute stage to calculate the memory address and
then access memory in the following stage.
- If we could operate on the operands in memory, as in the x86, stages 3 and 4 would expand to
an address stage, memory stage, and then execute stage.

Internship Report On Embedded System and IoT Technology.
82% (11)
Internship Report On Embedded System and IoT Technology.
72 pages
Instructions and Addressing
No ratings yet
Instructions and Addressing
61 pages
Chapter 33
No ratings yet
Chapter 33
63 pages
Chapter 5
No ratings yet
Chapter 5
48 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
57 pages
Unit III
No ratings yet
Unit III
35 pages
EC Chapter2 2014
No ratings yet
EC Chapter2 2014
88 pages
Be Computer Engineering Semester 4 2018 December Computer Organization and Architecture Cbcgs
No ratings yet
Be Computer Engineering Semester 4 2018 December Computer Organization and Architecture Cbcgs
18 pages
Pipelined Datapath and Control
No ratings yet
Pipelined Datapath and Control
26 pages
Lecture 16
No ratings yet
Lecture 16
29 pages
CO - 7th UNIT
No ratings yet
CO - 7th UNIT
17 pages
Ddco With Answers
No ratings yet
Ddco With Answers
12 pages
Chapter 5 My Note
No ratings yet
Chapter 5 My Note
48 pages
Chapter 4 Notes
No ratings yet
Chapter 4 Notes
32 pages
Presentation 35191 Content Document 20250423021246PM
No ratings yet
Presentation 35191 Content Document 20250423021246PM
46 pages
4 The Processors
No ratings yet
4 The Processors
112 pages
Ddco Mam Module 5
No ratings yet
Ddco Mam Module 5
16 pages
Module-5 DDCO
No ratings yet
Module-5 DDCO
35 pages
Pipelining 2
No ratings yet
Pipelining 2
33 pages
Pipelining
No ratings yet
Pipelining
21 pages
Lec7 Pipelining
No ratings yet
Lec7 Pipelining
22 pages
Unit 4
No ratings yet
Unit 4
53 pages
COA Chapter 5
No ratings yet
COA Chapter 5
12 pages
Unit II
No ratings yet
Unit II
46 pages
ARM Microcontroller - CIE 2
No ratings yet
ARM Microcontroller - CIE 2
63 pages
Types of Addressing Mode
67% (3)
Types of Addressing Mode
14 pages
DDCO Notes-162-171
No ratings yet
DDCO Notes-162-171
10 pages
Cpe626 ARMorganization
No ratings yet
Cpe626 ARMorganization
10 pages
Lecture6 ARM
No ratings yet
Lecture6 ARM
50 pages
Pipelining
No ratings yet
Pipelining
24 pages
02a ILP Pipeline
No ratings yet
02a ILP Pipeline
40 pages
Lec03 - Processor Structure and Function
No ratings yet
Lec03 - Processor Structure and Function
55 pages
M5 Notes
No ratings yet
M5 Notes
13 pages
Pipeline Mips
No ratings yet
Pipeline Mips
28 pages
Co - Unit Ii - Ii
No ratings yet
Co - Unit Ii - Ii
34 pages
7.7 Sectioion 7 Architecture, Data Communication & Networking
No ratings yet
7.7 Sectioion 7 Architecture, Data Communication & Networking
21 pages
UNIT-3: MIPS Instructions
No ratings yet
UNIT-3: MIPS Instructions
15 pages
COA Module5
No ratings yet
COA Module5
35 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
Basic Processing Unit
No ratings yet
Basic Processing Unit
49 pages
CTE 433 Computer Architecture II
No ratings yet
CTE 433 Computer Architecture II
28 pages
Lec07 Annotated
No ratings yet
Lec07 Annotated
26 pages
L11 Pipelined Datapath and
100% (1)
L11 Pipelined Datapath and
31 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
15 pages
Lec 11
No ratings yet
Lec 11
30 pages
Unit 7 - Basic Processing
No ratings yet
Unit 7 - Basic Processing
85 pages
Wa0031.
No ratings yet
Wa0031.
10 pages
Chapter V Processor Architecture
No ratings yet
Chapter V Processor Architecture
140 pages
Ca06 2014 PDF
No ratings yet
Ca06 2014 PDF
53 pages
Coa Neptel Answers
No ratings yet
Coa Neptel Answers
69 pages
8051 Microcontroller Notes
100% (1)
8051 Microcontroller Notes
40 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
15 pages
Cad For Vlsi 2 Pro Ject - Superscalar Processor Implementation
No ratings yet
Cad For Vlsi 2 Pro Ject - Superscalar Processor Implementation
10 pages
Manipulator (NG3)
No ratings yet
Manipulator (NG3)
69 pages
Single Cycle Mips
No ratings yet
Single Cycle Mips
25 pages
Basic Processing Unit:: Fundamental Concepts
No ratings yet
Basic Processing Unit:: Fundamental Concepts
35 pages
Ca Unit 3 Prabu
100% (1)
Ca Unit 3 Prabu
24 pages
Ddco With Answers
No ratings yet
Ddco With Answers
46 pages
Pipelined Processor Design
No ratings yet
Pipelined Processor Design
28 pages
Slides Chapter 5 Basic Processing Unit
No ratings yet
Slides Chapter 5 Basic Processing Unit
44 pages
ARM Organization and Implementation: Aleksandar Milenkovic
100% (3)
ARM Organization and Implementation: Aleksandar Milenkovic
37 pages
MCASyllabusR 20
No ratings yet
MCASyllabusR 20
82 pages
A Single-Cycle MIPS Processor
No ratings yet
A Single-Cycle MIPS Processor
13 pages
Multiple Choice Questions On Embedded Systems
No ratings yet
Multiple Choice Questions On Embedded Systems
35 pages
Opc Ua
No ratings yet
Opc Ua
286 pages
8-Bit Computer
No ratings yet
8-Bit Computer
55 pages
Computer Architecture I Lecture Presentation 1
100% (2)
Computer Architecture I Lecture Presentation 1
37 pages
21css201t Coa Unit 2 Notes
No ratings yet
21css201t Coa Unit 2 Notes
131 pages
William Stallings Computer Organization and Architecture 8 Edition
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition
29 pages
Basic Computer Education - PDF - 20250422 - 140058 - 0000 PDF
No ratings yet
Basic Computer Education - PDF - 20250422 - 140058 - 0000 PDF
11 pages
Computer Architecture Hnd1 Notes
No ratings yet
Computer Architecture Hnd1 Notes
27 pages
An Introduction To Assembly Programming With RISC V 1st Edition Edson Borin Download
No ratings yet
An Introduction To Assembly Programming With RISC V 1st Edition Edson Borin Download
85 pages
CSC P1 2ND
No ratings yet
CSC P1 2ND
4 pages
Traffic Controller Using Microprocessor 8085: YEAR-2011
No ratings yet
Traffic Controller Using Microprocessor 8085: YEAR-2011
69 pages
PLC Timers and Counters
No ratings yet
PLC Timers and Counters
18 pages
8BIT
No ratings yet
8BIT
121 pages
Chapter 3
No ratings yet
Chapter 3
48 pages
Microprocessor Indiviual Assignment
No ratings yet
Microprocessor Indiviual Assignment
21 pages
OS Android
No ratings yet
OS Android
37 pages
8086 Microprocessor
No ratings yet
8086 Microprocessor
20 pages
Accessing The EEPROM
No ratings yet
Accessing The EEPROM
12 pages
Computer Organization and Architecture Csen 2202 - 2022
No ratings yet
Computer Organization and Architecture Csen 2202 - 2022
6 pages
12.introduction To Computers and Microprocessor
No ratings yet
12.introduction To Computers and Microprocessor
8 pages
The REDEFINE Execution Model: Revision History
No ratings yet
The REDEFINE Execution Model: Revision History
26 pages
Lecture-11 Dynamic Scheduling A
No ratings yet
Lecture-11 Dynamic Scheduling A
18 pages
COAL LAB 12 26122023 095516am
No ratings yet
COAL LAB 12 26122023 095516am
4 pages
Super Scalar Architecture With Dynamic Branch Prediction
No ratings yet
Super Scalar Architecture With Dynamic Branch Prediction
5 pages
Comptia Server+ Primer
From Everand
Comptia Server+ Primer
John Greene
5/5 (1)
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
Computer Science II Essentials
From Everand
Computer Science II Essentials
Randall Raus
No ratings yet

Pipelining Part 1: An Overview

Uploaded by

Pipelining Part 1: An Overview

Uploaded by

Pipelining Part 1:

‫התוכן המובא בשקפים אלו אינו מהווה חלק מהחומר המחייב‬

- Pipelining is an implementation technique in which multiple instructions are overlapped in

‫נסביר את הטכניקה הזו בהקשר למימושה במעבדי ‪ARM‬‬

- 32 registers: X0–X30, XZR

Operation Example Meaning

- In ARM ISA, data must be in registers to perform arithmetic or logic operation

- Memory can be accessed only by data transfer instructions:

- compare and branch on equal 0]

Each ARM instruction is executed by an instruction cycle.

2. Read registers and decode the instruction:

3. Execute the operation or calculate an address.

4. Access an operand in data memory (if necessary).

5. Write the result into a register (if necessary).

1. Fetch instruction from memory

3. Execute the operation or calculate an address.

4. Access an operand in data memory (if necessary).

1. Fetch instruction from memory

2. Read registers and decode the instruction:

3. Execute the operation or calculate an address.

4. Access an operand in data memory (if necessary).

5. Write the result into a register (if necessary).

- To make this discussion concrete, let’s create a pipeline.

- total time for each of the seven

Note: this calculation assumes

Let us now compare nonpipelined and pipelined execution of three load

We can turn the pipelining speed-up discussion above into a formula.

Pipelining improves performance by increasing instruction throughput,

- First, all ARM instructions are the same length.

- Third, memory operands only appear in loads or stores in ARM ISA.

You might also like