0% found this document useful (0 votes)
5 views30 pages

Pipelining Part 1: An Overview

Pipelining is a technique used in CPU design where multiple instructions are executed simultaneously in overlapping stages, enhancing performance. The document discusses the implementation of pipelining in the ARM instruction set architecture (ISA), highlighting its advantages over complex instruction set architectures (CISC) like Intel's. It also explains the stages of instruction execution in ARM ISA and how the design of the instruction set facilitates efficient pipelining.

Uploaded by

testphishingguy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views30 pages

Pipelining Part 1: An Overview

Pipelining is a technique used in CPU design where multiple instructions are executed simultaneously in overlapping stages, enhancing performance. The document discusses the implementation of pipelining in the ARM instruction set architecture (ISA), highlighting its advantages over complex instruction set architectures (CISC) like Intel's. It also explains the stages of instruction execution in ARM ISA and how the design of the instruction set facilitates efficient pipelining.

Uploaded by

testphishingguy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Pipelining Part 1:

An Overview
‫מהווים חומר הרחבה‪.‬‬ ‫השקפים המסומנים ב‪-‬‬

‫התוכן המובא בשקפים אלו אינו מהווה חלק מהחומר המחייב‬


‫לבחינה בקורס‪.‬‬

‫‪2‬‬
What is Pipelining ?

- Pipelining is an implementation technique in which multiple instructions are overlapped in


execution.
- Today, pipelining is nearly universal.
implemented by all CPUs

3
‫הסבר רעיון ה‪ pipelining -‬בהקשר ל‪ARM ISA -‬‬

‫נסביר את הטכניקה הזו בהקשר למימושה במעבדי ‪ARM‬‬


‫ובתוך כך נדגים איך לתכנון [= ‪ ]design‬של ה‪ ISA -‬יש השפעה‪/‬חשיבות בקשר ליעילות של הטכניקה הזו‪.‬‬
‫נזכור כי ‪ ARM ISA‬מהווה דוגמא מייצגת של ‪RISC ISA‬‬
‫‪RISC = Reduced Instruction Set CPU‬‬
‫בעוד ש‪ Intel ISA -‬מהווה דוגמא מייצגת של ‪CISC ISA‬‬
‫‪CISC = Complex Instruction Set CPU‬‬

‫וההדגמה של רעיון ה‪ pipelining -‬בהקשר של ‪ ARM ISA‬תשרת אותנו באופן כללי להסביר איך ל‪RISC -‬‬
‫‪ ISA‬יש יתרונות על פני ‪ CISC ISA‬בכל מה שקשור לביצועים של ה‪.Pipeline -‬‬

‫‪4‬‬
ARM ISA

- 32 registers: X0–X30, XZR


- register XZR always equals 0.

- Memory:
- 2^62 memory words: Memory[0], Memory[4], . . . ,Memory[4,611,686,018,427,387,904]
- Each word = 4 bytes

5
ARM ISA – Arithmetic & Logical Instructions

Operation Example Meaning


add ADD X1, X2, X3 // X1 <- X2 + X3
add immediate ADDI X1, X2, 20 // X1 <- X2 + 20
subtract
….

- In ARM ISA, data must be in registers to perform arithmetic or logic operation


- That is, it is not allowed to add two integer values [operands], val1 and val2, if the value of one
of them is in memory !
- Both values, val1 and val2, MUST be ‘brought’ [= loaded] first from memory to registers
- using data transfer instructions
- load [copy] val1 from memory to register Xi
- and load val2 to register Xj

6
ARM ISA – Data Transfer Instructions

- Memory can be accessed only by data transfer instructions:


- load register
- LDUR X1, [X2,40] // X1 <- Memory[X2 + 40]
- store register
- STUR X1, [X2,40] // Memory[X2 + 40] <- X1
-…

7
ARM ISA - Conditional & Unconditional Branch Instructions

- compare and branch on equal 0]


- Compare and Branch on Zero compares the value in a register with zero, and conditionally
branches to a label at a PC-relative offset if the comparison is equal
- CBZ X1, 25 // meaning: if (X1 == 0) go to PC + 100.
// Why 100 ? Since 4*25 = 100. word = 4 bytes.

8
‘break’ the instruction cycle into steps

Each ARM instruction is executed by an instruction cycle.


And one way to describe the instruction cycle, is that it take [= composed of] 3 stages: IF, ID, IE

- For our current purpose, describing the concepts of pipelining, we change this description of an
instruction cycle.
- And we describe an instruction cycle as if it takes 5 stages:
1. Fetch instruction from memory.
2. Read registers and decode the instruction.
3. Execute the operation or calculate an address.
4. Access an operand in data memory (if necessary).
5. Write the result into a register (if necessary).

- Hence, the ARM ISA pipeline we explore in this lecture has 5 stages.

9
Explanation: ADD X1, X2, X3
1. Fetch instruction from memory.
- translates the current value stored in the PC register to a physical address
- using the MMU
- stores the physical address in the MAR [= Memory Address Register]
- initiate a read bus transaction
- copy the MAR’s value onto the system bus address lines
- raise the appropriate control line of the system bus
- to signals the MM controller that a read is requested
- wait for an acknowledgment from the MM controller that the requested data, in this case
the instruction, is available on the data lines of the system bus
- copy the value from the data lines to the MBR [ Memory Buffer Register]
- copy the value from the MBR to the IR [= Instruction Register]

10
Explanation: ADD X1, X2, X3 (cont.)

2. Read registers and decode the instruction:


- decode that the instruction ‘tells’ [ = instructs/commands] the CPU it is required to perform
an addition operation
- fetch the current value of register X2
- and fetch the current value of register X3

3. Execute the operation or calculate an address.


- ‘activate’ [ = ‘send’ an appropriate control signal to] the ALU‘s addition circuit
- which is implemented by FAs [= Full Adders]

4. Access an operand in data memory (if necessary).


- not carried out in the instruction cycle of ADD

5. Write the result into a register (if necessary).


- signals register X1 ‘take’ the

11
Explanation: LDUR X1, [X2,40]

1. Fetch instruction from memory


2. Read registers and decode the instruction:
- decode that the instruction ‘tells’ [ = instructs/commands] the CPU it is required to transfer [=
copy] data from MM
- and store it in register R1
- from where in MM ? From the logical address Memory[X2+40]
- so the CPU needs to fetch the current value of register X2, and input it to one of the inputs [=
‘entries’]of the ALU
- and the CPU also needs to input the value 40 [= immediate value] to the second input [=
entry] of the ALU

3. Execute the operation or calculate an address.


{in this example: calculate an address}
- ‘activate’ [ = ‘send’ an appropriate control signal to] the ALU‘s addition circuit to carry out the
addiction X2 + 40

12
Explanation: LDUR X1, [X2,40]

4. Access an operand in data memory (if necessary).


{in this example it is necessary of course !}
- translates the current value Memory[X2+40], calculated in the previous stage by the ALU, to a
physical address
- using the MMU
- stores the physical address in the MAR [= Memory Address Register]
- initiate a read bus transaction
- copy the MAR’s value onto the system bus address lines
- raise the appropriate control line of the system bus
- to signals the MM controller that a read is requested
- wait for an acknowledgment from the MM controller that the requested data, in this a data
operand, is available on the data lines of the system bus
- copy the value from the data lines to the MBR [ Memory Buffer Register]
- copy the value from the MBR to the X1

13
Explanation: CBZ X1, 25

1. Fetch instruction from memory

2. Read registers and decode the instruction:


- decode that the instruction ‘tells’ [ = instructs/commands] the CPU it is required to check if
the value in register X1 equals 0, and if so branch to the logical address ‘created’ by adding the
current value of the PC register with 25*4
- since a word is 4 bytes long
- so the CPU needs to fetch the current value of register X1, and check if it is 0

14
Explanation: CBZ X1, 25

3. Execute the operation or calculate an address.


- if the value stored in X1 is 0, then ‘activate’ [ = ‘send’ an appropriate control signal to] the
ALU‘s addition circuit to carry out the calculation current PC value + 4*25
- note: 4*25 is actually achieved by a shift. NOT a multiplication
- and ‘store’ the result in the PC

4. Access an operand in data memory (if necessary).


- not carried out in the instruction cycle of CBZ

5. Write the result into a register (if necessary).


- not carried out in the instruction cycle of CBZ

15
Assumption #1

- To make this discussion concrete, let’s create a pipeline.


- In this example, we limit our attention to seven instructions:
- load register (LDUR),
- store register (STUR),
- add (ADD),
- subtract (SUB),
- AND (AND),
- OR (ORR),
- and compare and branch on zero (CBZ).

16
Assumption #2

- Assume that the operation times for the major functional units in the example we’ll shortly
describe are:
- 200 ps for memory access for instructions or data,
- 200 ps for ALU operation,
- and 100 ps for register file read or write.

- total time for each of the seven


instructions.
- calculated from the time for each
stage.

Note: this calculation assumes


that the multiplexors, control unit,
PC accesses, and sign extension
unit have no delay.
17
???

Let us now compare nonpipelined and pipelined execution of three load


register instructions.

18
???

- The non pipelined implementation design must allow for the slowest instruction !!!
why ? In order’ as we shall see shortly, to be able to quantitatively compare the non pipelined
and the pipelined implementations
- It is the LDUR instruction.
- so for the purpose of our discussion, we’ll assume the time required for every instruction is 800
ps.
- even though some instructions can be as fast as 500 ps

19
???

The time between the first and fourth instructions in the nonpipelined design is 3 × 800 ps or
2400 ps.

20
???

- Just as the non pipelined design must take the worst-case scenario of 800 ps, even though some
- instructions can be as fast as 500 ps, the pipelined execution clock cycle must have the worst-
case time of 200 ps.
- even though some stages take only 100 ps.
- Why ? To simplify the control of the pipeline. Every 200ps the clock ‘signals’ to ‘move’ the
instruction currently in stage S of the pipeline, to the next stage S+1
- or to ‘get out from the pipeline ’ the instruction currently in the last stage

21
???

We can turn the pipelining speed-up discussion above into a formula.


If the stages are perfectly balanced, then the time between instructions on the pipelined
Processor — assuming ideal conditions — is equal to:

22
???

Under ideal conditions and with a large number of instructions, the speed-up from pipelining is
approximately equal to the number of pipe stages;
That is: a five-stage pipeline is nearly five times faster than a non pipelined implementation !

23
???

- The example we explored shows, however, that the stages may be imperfectly balanced.
- There are stages that take 100ps, whereas other take 200 ps.
- Moreover, pipelining involves some overhead, the source of which will be clearer shortly.
- Hazards

- Thus, the time per instruction in the pipelined processor will exceed the minimum
possible, and speed-up will be less than the number of pipeline stages.
- in our example: less than 5 times faster

24
???

- Never the less, our pipelining still offers a fourfold performance improvement.

- To see why, let us assume we extend the previous figures to 1,000,003 instructions.
- instead of only 3 instructions
- In the non pipelined example, each of the 1,000,003 instructions is taking 800 ps, so total
execution time would be 1,000,003 × 800 ps = 800,002,400 ps

- in the pipelined :
- The first 3 instructions are taking a total of 1400 ps
- Then, after every 200 ps, each of the other 1,000,000 instruction will ‘coe out’ from the
pipeline
- The total execution time would be 1,000,000 × 200 ps + 1400 ps, or 200,001,400 ps.

25
???

Under these conditions, the ratio of total execution times for real programs on nonpipelined to
pipelined processors is close to the ratio of times between instructions:

26
???

Pipelining improves performance by increasing instruction throughput,


in contrast to
decreasing the execution time of an individual instruction !!

but instruction throughput is the important metric because real programs execute billions of
instructions.

27
Designing Instruction Sets for Pipelining
Pipelining and RISC ISA

- Even with the above simple explanation of pipelining, we can get insight into the design
of the ARM instruction set, which was designed for pipelined execution.

- First, all ARM instructions are the same length.


- This restriction makes it much easier to fetch instructions in the first pipeline stage and to
decode them in the second stage.
- In an instruction set like the x86, where instructions vary from 1 byte to 15 bytes, pipelining is
considerably more challenging.
- Modern implementations of the x86 architecture actually translate x86 instructions into simple
operations that look like ARM instructions and then pipeline the simple operations rather
than the native x86 instructions!

29
Pipelining and RISC ISA

- Second, ARM has just a few instruction formats, with the first source register
and destination register fields being located in the same place in each instruction.

- Third, memory operands only appear in loads or stores in ARM ISA.


- This restriction means we can use the execute stage to calculate the memory address and
then access memory in the following stage.
- If we could operate on the operands in memory, as in the x86, stages 3 and 4 would expand to
an address stage, memory stage, and then execute stage.

30

You might also like