ARM Architecture: Universität Dortmund
ARM Architecture: Universität Dortmund
ARM Architecture
Universität Dortmund
Registers
Cache/SRAM
memory
Main
memory
I/O
Interface
Storage
memory
System Components
• The basic components:
– Processor with its associated temporary memory (registers and
cache if available) for code execution
– Main memory and secondary memory where code and data are
temporarily and permanently stored
– Input and output modules that provide interfaces between the
processor and the user
• Connected through an interface bus that consists of
Address, Data, and Control signals
– e.g., AMBA bus for the ARM-based processor
Universität Dortmund
Memory Hierarchy
• A typical processor is supported by:
– on-board main memory (e.g. SDRAM up to GB)
– on-chip or on-die cache memory (e.g. SRAM KB to MB)
– on-die registers
• Some processors also provide general purpose on-
chip
– SRAM (e.g. embedded processor) which may be configured as
SRAM/Cache combination (e.g. TI’s DSP)
• Typically, a processor also utilizes secondary non-
volatile memory
– for permanent code and data storage like Flash-based memory
and hard disks
Universität Dortmund
Address Space
• The address space of a processor depends on its
address decoding mechanism
– Its size will depend on the number of address bits used
• Depending on the processor design, there may be
two types of address space
– one is used by normal memory access
– another one is reserved for I/O peripheral registers (control,
status, and data)
– need extra control signal or special means of accessing the
alternate address space
Universität Dortmund
Processor
Data
Memory
Code
0x00000000
Universität Dortmund
0xFFFF
0xFFFFFFFF
I/O Reg
Data
Code
Processor
Data
I/O Reg
0x0000 Code
0x00000000
I/O Address Memory
Space Address
Space
Universität Dortmund
Processor Code
Harvard Architecture
• The Harvard architecture utilizes separate instruction
bus and data bus
– code and data may still share the same memory space
FFFFh
Data
Data
Data
8000h
Separate bus 7FFFh
Processor
for Code & Data Code
Code
Code
0000h
Universität Dortmund
Harvard Features
• Separate instruction and data buses
– allow code and data access at the same time which gives
improved performance
– provide better support for instruction pipeline operations and
shorter instruction execution time
– allow different sizes of data and instructions to be used which
results in more flexibility
– do not incur any code corruption by data which makes the
operations more robust
• But more sophisticated hardware glue logic is
required to support multiple interface buses
• Cortex-M3 core is based on the Harvard architecture
with separate buses for instructions and data
Universität Dortmund
Architecture Variations
FFFFh
FFFFh
Independent data Data
Processor Code
and code memory
but with one shared Data Code
FFFFh
Data
Two separate Code
internal bus for Cache Code
Processor
code & data (e.g. Data Data
ARM9) Cache
Code
0000h
Universität Dortmund
Data
Program Data
Reset Program
vector
00..00h 00..00h
Universität Dortmund
Processor ‘Size’
• Processor size is described in terms of ‘bits’ (e.g. an
8-bit or 32-bit processor)
– corresponds to the data size that can be manipulated at a time by
the processor
– typically reflected in the size of the processor (internal) data path
and register bank
• An 8-bit processor can only manipulate one byte of
data at a time, while a 32-bit processor can handle
one 32-bit double word sized data at a time
– even though the data content may only be of single byte size
Universität Dortmund
Registers
• The most fundamental storage area in the processor
– is closely located to the processor
– provides very fast access, operating at the same frequency as the
processor clock
– but is of limited quantity (typically less than 100)
• Most are of the general purpose type and can store
any type of information:
– data – e.g., timer value, constants
– address – e.g., ASCII table, stack
• Some are reserved for specific purposes
– program counter (r15 in ARM)
– program status register (CPSR in ARM)
Universität Dortmund
Data Alignment
• A 32-bit data consists of four bytes of data, which are
stored in four successive memory locations
• Data and code must be aligned to the respective
address size boundary.
– e.g., for a 32-bit system which aligns to the word boundary, the
lowest two address bits equal to zero
• But what is the order of the four bytes of data?
– depends on the Endianness adopted
• In the Little Endian format,
– the least significant byte (LSB) is stored in the lowest address of
the memory, with the most significant byte (MSB) stored in the
highest address location of the memory.
• In the Big Endian format,
– the least significant byte (LSB) is stored in the highest address of the memory,
with the most significant byte (MSB) stored in the lowest address location of the
memory.
Universität Dortmund
Data Endianness
MSB LSB
Memory Memory
Address Address
Space Space
0x000000 0x000000
Universität Dortmund
Comparison
• Little Endian
– The order matched with processor instructions typically process
numbers from LSB to MSB
– The byte number corresponds with the address offset, suitable for
multi-precision data manipulation
• Big Endian
– Can compare numerical data by just accessing the zero offset byte
– Corresponds to the written order of number (starting with the most
significant digit)
• Some processors (e.g. ARM) have bi-endian
hardware that feature ‘switchable’ endianness
Universität Dortmund
CISC
• Features of the Complex Instruction Set Computing (CISC):
– many instructions
– complex instructions
o each instruction can execute several low level operations
– complex addressing modes
o smaller number of registers needed
• A semantically rich instruction set is accommodated by
allowing instructions of variable length
Universität Dortmund
Advantages of CISC
• As each instruction can execute several low level
operations,
– the code size is reduced to save on memory requirements
– less main memory access is required and hence processing time is
reduced (faster)
• Backward code compatibility is maintained
– can add new (and more powerful) instructions while retaining the ‘old’
instruction set for code compatibility (i.e. legacy programs can still run)
• Easy to program
– direct support of high-level language constructs
– complex instructions that fit well with high-level language expressions
Universität Dortmund
Limitations of CISC
• A highly encoded instruction set needs to be decoded
by hardwired microcode electronic circuitry
– more complex hardware design
– slower instruction decoding/execution
• Variable length instructions
– different execution time among instructions
– affects pipelined operations
Universität Dortmund
RISC
RISC – Reduced Instruction Set Computing
• Small instruction sets
• Simpler instructions
• Fixed length instructions
• Large number of registers
• Simpler addressing mode with the Load/Store
instructions for accessing memory
Universität Dortmund
Advantages of RISC
• Simpler instructions
– one clock per instruction gives faster execution than on a CISC
processor with the same clock speed
• Simpler addressing mode
– faster decoding
• Fixed length instructions
– faster decoding and better pipeline performance
• Simpler hardware
– less silicon area
– less power consumption
Universität Dortmund
Limitations of RISC
• Fewer instructions than CISC
– as compared to CISC, RISC needs more instructions to execute one
task
– code density is less
– needs more memory
• No complex instructions
– no hardware support for division or floating-point arithmetic
operations
– needs a more complex compiler and longer compiling time
• But ARM also adds DSP-like instructions to support
commonly used signal processing functions
Universität Dortmund
Instruction Execution
• Multiple stages are involved in executing an
instruction.
– Example:
1) Fetching the instruction code
2) Decoding the instruction code
3) Executing the instruction code
• Hence multiple processor clock cycles are needed to
execute one single instruction.
1st 2nd
time
Universität Dortmund
Instruction Pipeline
• The pipeline allows concurrent execution of multiple
different instructions
– execution of different stages of multiple instructions at the same time
• During a normal operation
– while one instruction is being executed
– the next instruction is being decoded
– and a third instruction is being fetched from memory
– allows effective throughput to increase to one instruction per clock cycle
Universität Dortmund
Cortex-M3 Pipeline
• The Cortex-M3 Uses the 3-stage pipeline for instruction
executions
– Fetch ⇒ Decode ⇒ Execute
– Pipeline design allows effective throughput to increase to one
instruction per clock cycle
– Allows the next instruction to be fetched while still decoding or
executing the previous instructions
Pipelined Architecture
• A longer pipeline can also be used to further break down
the operation carried out in the individual stage
– simpler logic for each stage to increase system clock
Fetch Example: A 5-stage instruction
Instruction
pipeline
Decode Fetch
Instruction Instruction
Fetch Decode Fetch 4th
Operand Instruction Instruction
Parallel Execute Fetch Decode Fetch
execution of Instruction Operand Instruction Instruction
5th
multiple Store Execute Fetch Decode Fetch time
instructions Result Instruction Operand Instruction Instruction
Store Execute Fetch Decode
Result Instruction Operand Instruction
1st
Store Execute Fetch
Result Instruction Operand
2nd
Store Execute
3rd Result Instruction
Store
Result
Universität Dortmund
37
Universität Dortmund
38
Universität Dortmund
• ARM 7
– Introduced in 1994.
– More than 10 billion
ARM7 processor family-
based devices have
powered a wide variety
of applications.
– Today is used for simple
32-bit devices.
• ARM 9
– Is the most popular
ARM processor family
ever.
• Over 5 Billion ARM9
processors have been
shipped so far.
– Successfully deployed
across a wide range of
applications.
Universität Dortmund
42
Universität Dortmund
• Cortex M0:
– Ultra low gate count (less
that 12 K gates).
– Ultra low-power (3
µW/MHz ).
– 32-bit processor.
Universität Dortmund
• Cortex M1:
– The first ARM processor
designed specifically for
implementation in
FPGAs.
– Supports all major FPGA
vendors.
– Easy migration path from
FPGA to ASIC.
Universität Dortmund
• Cortex M3:
– The mainstream ARM
processor for
microcontroller
applications.
– High performance and
energy efficiency.
Universität Dortmund
• Harvard architecture:
– Separate Instruction & Data buses enable
parallel fetch & store.
• Advanced 3-Stage Pipeline:
– Includes Branch Forwarding & Speculation
• Additional Write-Back via Bus Matrix.
• Cortex M4:
– The latest embedded
processor for DSP.
Universität Dortmund
• Cortex R4:
– First embedded real-time
processor based on the
ARMv7-R architecture.
– For high-volume deeply-
embedded System-on-Chip
applications:
• Hard disk.
• Drive controllers.
• Wireless baseband
processors.
• Electronic control units for
automotive systems.
Universität Dortmund
• Cortex R5:
– Extends the feature set
of the Cortex-R4.
– Enables:
• Higher levels of system
performance.
• Increased efficiency and
reliability.
• Enhanced error
management in
dependable real-time
systems.
Universität Dortmund
• Cortex R7:
– High-performance dual-
core.
– It is the highest
performing Cortex-R
series processor.
Universität Dortmund
• Cortex A5:
– Delivers high end features to
power and cost sensitive
applications.
– Single core version also
available.
– Suitable for:
• From entry level
smartphones, low cost
handsets and smart mobile
devices.
• To pervasive embedded,
consumer and industrial
devices.
Universität Dortmund
• Cortex A8:
– Single core only.
– Can scale in speed from
600MHz to greater than
1GHz.
– Suitable for high-end
feature phones,
netbooks, DTVs, printers
and automotive-
infotainment.
Universität Dortmund
• Cortex A9:
– From 1 to 4 cores.
– Extremely high levels of
performance and power
efficiency.
– Ideal solution for designs
requiring high
performance in low power
or thermally
constrained cost-sensitive
devices.
Universität Dortmund
• Cortex A15:
– Ultra low-power.
– Suitable for:
• Advanced Smartphones.
• Mobile Computing.
• High-end Digital Home
Entertainment.
• Wireless Infrastructure.
• Low-power Servers.