2 Programming Model and Pipelining
2 Programming Model and Pipelining
2 Programming Model and Pipelining
𝑁𝑖𝑛𝑠𝑡 × 𝐶𝑃𝐼
𝑇𝑝𝑟𝑜𝑔 =
𝑓𝑐𝑙𝑘
• where
− Ninst is the number of ARM instructions executed in the
course of the program
− CPI is the average number of clock cycles per instruction
− fclk is the processor's clock frequency.
5 stage Pipelining ARM organization
Since Ninst is constant for a given program there are only two
ways to increase performance:
1. Increase the clock rate, fclk
o This requires the logic in each pipeline stage to be
simplified and, therefore, the number of pipeline stages to
be increased
2. Reduce the average number of clock cycles per instruction,
CPI
o This requires either that instructions which occupy more
than one pipeline slot in a 3-stage pipeline ARM are re-
implemented to occupy fewer slots, or that pipeline stalls
caused by dependencies between instructions are
reduced, or a combination of both.
5 stage Pipelining ARM organization
Fetch - instruction is fetched from
memory and placed in the instruction
pipeline
Decode - the instruction is decoded and
register operands read from the register
file. There are three operand read ports in
the register file, so most ARM instructions
can source all their operands in one cycle
Execute - an operand is shifted and the
ALU result generated. If the instruction is
a load or store the memory address is
computed in the ALU
Buffer/data - data memory is accessed if
required. Otherwise the ALU result is
simply buffered for one clock cycle to give
the same pipeline flow for all instructions
Write-back - the results generated by the
instruction are written back to the
register file, including any data loaded
from memory.
5 stage Pipelining ARM organization
Fundamental problem with reducing the CPI relative to a 3-
stage core is related to the von Neumann bottleneck - any
stored-program computer with a single instruction and data
memory will have its performance limited by the available
memory bandwidth.
A 3-stage ARM core accesses memory on (almost) every clock
cycle either to fetch an instruction or to transfer data
Simply tightening up on the few cycles where the memory is not
used will yield only a small performance gain
To get a significantly better CPI the memory system must
deliver more than one value in each clock cycle either by
delivering more than 32 bits per cycle from a single memory or
by having separate memories for instruction and data accesses.
5 stage Pipelining ARM organization
Higher performance ARM cores employ a 5-stage pipeline and
have separate instruction and data memories
Breaking instruction execution down into five components
rather than three reduces the maximum work which must be
completed in a clock cycle, and hence allows a higher clock
frequency to be used
Separate instruction and data memories (which may be separate
caches connected to a unified instruction and data main
memory) allow a significant reduction in the core's CPI
A typical 5-stage ARM pipeline is that employed in the
ARM9TDMI