Pipelining Part 1: An Overview
Pipelining Part 1: An Overview
An Overview
מהווים חומר הרחבה. השקפים המסומנים ב-
2
What is Pipelining ?
3
הסבר רעיון ה pipelining -בהקשר לARM ISA -
וההדגמה של רעיון ה pipelining -בהקשר של ARM ISAתשרת אותנו באופן כללי להסביר איך לRISC -
ISAיש יתרונות על פני CISC ISAבכל מה שקשור לביצועים של ה.Pipeline -
4
ARM ISA
- Memory:
- 2^62 memory words: Memory[0], Memory[4], . . . ,Memory[4,611,686,018,427,387,904]
- Each word = 4 bytes
5
ARM ISA – Arithmetic & Logical Instructions
6
ARM ISA – Data Transfer Instructions
7
ARM ISA - Conditional & Unconditional Branch Instructions
8
‘break’ the instruction cycle into steps
- For our current purpose, describing the concepts of pipelining, we change this description of an
instruction cycle.
- And we describe an instruction cycle as if it takes 5 stages:
1. Fetch instruction from memory.
2. Read registers and decode the instruction.
3. Execute the operation or calculate an address.
4. Access an operand in data memory (if necessary).
5. Write the result into a register (if necessary).
- Hence, the ARM ISA pipeline we explore in this lecture has 5 stages.
9
Explanation: ADD X1, X2, X3
1. Fetch instruction from memory.
- translates the current value stored in the PC register to a physical address
- using the MMU
- stores the physical address in the MAR [= Memory Address Register]
- initiate a read bus transaction
- copy the MAR’s value onto the system bus address lines
- raise the appropriate control line of the system bus
- to signals the MM controller that a read is requested
- wait for an acknowledgment from the MM controller that the requested data, in this case
the instruction, is available on the data lines of the system bus
- copy the value from the data lines to the MBR [ Memory Buffer Register]
- copy the value from the MBR to the IR [= Instruction Register]
10
Explanation: ADD X1, X2, X3 (cont.)
11
Explanation: LDUR X1, [X2,40]
12
Explanation: LDUR X1, [X2,40]
13
Explanation: CBZ X1, 25
14
Explanation: CBZ X1, 25
15
Assumption #1
16
Assumption #2
- Assume that the operation times for the major functional units in the example we’ll shortly
describe are:
- 200 ps for memory access for instructions or data,
- 200 ps for ALU operation,
- and 100 ps for register file read or write.
18
???
- The non pipelined implementation design must allow for the slowest instruction !!!
why ? In order’ as we shall see shortly, to be able to quantitatively compare the non pipelined
and the pipelined implementations
- It is the LDUR instruction.
- so for the purpose of our discussion, we’ll assume the time required for every instruction is 800
ps.
- even though some instructions can be as fast as 500 ps
19
???
The time between the first and fourth instructions in the nonpipelined design is 3 × 800 ps or
2400 ps.
20
???
- Just as the non pipelined design must take the worst-case scenario of 800 ps, even though some
- instructions can be as fast as 500 ps, the pipelined execution clock cycle must have the worst-
case time of 200 ps.
- even though some stages take only 100 ps.
- Why ? To simplify the control of the pipeline. Every 200ps the clock ‘signals’ to ‘move’ the
instruction currently in stage S of the pipeline, to the next stage S+1
- or to ‘get out from the pipeline ’ the instruction currently in the last stage
21
???
22
???
Under ideal conditions and with a large number of instructions, the speed-up from pipelining is
approximately equal to the number of pipe stages;
That is: a five-stage pipeline is nearly five times faster than a non pipelined implementation !
23
???
- The example we explored shows, however, that the stages may be imperfectly balanced.
- There are stages that take 100ps, whereas other take 200 ps.
- Moreover, pipelining involves some overhead, the source of which will be clearer shortly.
- Hazards
- Thus, the time per instruction in the pipelined processor will exceed the minimum
possible, and speed-up will be less than the number of pipeline stages.
- in our example: less than 5 times faster
24
???
- Never the less, our pipelining still offers a fourfold performance improvement.
- To see why, let us assume we extend the previous figures to 1,000,003 instructions.
- instead of only 3 instructions
- In the non pipelined example, each of the 1,000,003 instructions is taking 800 ps, so total
execution time would be 1,000,003 × 800 ps = 800,002,400 ps
- in the pipelined :
- The first 3 instructions are taking a total of 1400 ps
- Then, after every 200 ps, each of the other 1,000,000 instruction will ‘coe out’ from the
pipeline
- The total execution time would be 1,000,000 × 200 ps + 1400 ps, or 200,001,400 ps.
25
???
Under these conditions, the ratio of total execution times for real programs on nonpipelined to
pipelined processors is close to the ratio of times between instructions:
26
???
but instruction throughput is the important metric because real programs execute billions of
instructions.
27
Designing Instruction Sets for Pipelining
Pipelining and RISC ISA
- Even with the above simple explanation of pipelining, we can get insight into the design
of the ARM instruction set, which was designed for pipelined execution.
29
Pipelining and RISC ISA
- Second, ARM has just a few instruction formats, with the first source register
and destination register fields being located in the same place in each instruction.
30