Arm Referance Notes
Arm Referance Notes
ARM ARCHITECTURE
ARM
ARM, previously Advanced RISC Machine, originally Acorn RISC Machine, is a family
of Reduced Instruction Set Computing (RISC) Architecture for Computer Processors.
The ARM processor core is key component of many successful 32-bit embedded
systems.
2. Pipelines The processing of instructions is broken down into smaller units that can be
executed in parallel by pipelines. Ideally the pipeline advances by one step on each cycle
for maximum throughput. There is no need for an instruction to be executed by a mini
program called microcode as on CISC processors.
3. Registers RISC machines have a large general-purpose register set. Any register can
contain either data or an address. In contrast, CISC processors have dedicated registers
for specific purposes.
4. Load-store architecture--The processor operates on data held in registers. Separate load
and store instructions transfer data between the register bank and external memory. In
contrast, with a CISC design the data processing operations can act on memory directly.
Registers:
ARM processors provide general-purpose and special-purpose registers. Some additional
registers are available in privileged execution modes.
In all ARM processors, the following registers are available and accessible in any processor
mode:
The amount of registers depends on the ARM version. According to the ARM Reference
Manual, there are 30 general-purpose 32-bit registers, with the exception of ARMv6-M and
ARMv7-M based processors. The first 16 registers are accessible in user-level mode, the
additional registers are available in privileged software execution (with the exception of
ARMv6-M and ARMv7-M). In this tutorial series we will work with the registers that are
accessible in any privilege mode: r0-15. These 16 registers can be split into two groups: general
purpose and special purpose registers.
R0-R12: can be used during common operations to store temporary values, pointers (locations to
memory), etc. R0, for example, can be referred as accumulator during the arithmetic operations
or for storing the result of a previously called function. R7 becomes useful while working with
syscalls as it stores the syscall number and R11 helps us to keep track of boundaries on the stack
serving as the frame pointer (will be covered later). Moreover, the function calling convention on
ARM specifies that the first four arguments of a function are stored in the registers r0-r3.
R13: SP (Stack Pointer). The Stack Pointer points to the top of the stack. The stack is an area of
memory used for function-specific storage, which is reclaimed when the function returns. The
stack pointer is therefore used for allocating space on the stack, by subtracting the value (in
bytes) we want to allocate from the stack pointer. In other words, if we want to allocate a 32 bit
value, we subtract 4 from the stack pointer.
R14: LR (Link Register). When a function call is made, the Link Register gets updated with a
memory address referencing the next instruction where the function was initiated from. Doing
R15: PC (Program Counter). The Program Counter is automatically incremented by the size of
the instruction executed. This size is always 4 bytes in ARM state and 2 bytes in THUMB mode.
When a branch instruction is being executed, the PC holds the destination address. During
execution, PC stores the address of the current instruction plus 8 (two ARM instructions) in
ARM state, and the current instruction plus 4 (two Thumb instructions) in Thumb(v1) state. This
is different from x86 where PC always points to the next instruction to be executed.
The execution state bits for the IT block (on ARMv6T2 and later).
The Current Program Status Register is a 32-bit wide register used in the ARM architecture to
record various pieces of information regarding the state of the program being executed by the
processor and the state of the processor. This information is recorded by setting or clearing
specific bits in the register.
The top four bits (bits 31, 30, 29, and 28) are the condition code (cc) bits and are of most interest
to us. Condition code bits are sometimes referred to as "flags". The lowest 8 bits (bit 7 through to
bit 0) store information about the processor's own state. The remaining bits (i.e. bit 27 to bit 8)
are currently unused in most ARM processors.
The N bit is the "negative flag" and indicates that a value is negative.
The Z bit is the "zero flag" and is set when an appropriate instruction produces a zero result.
The C bit is the "carry flag" but it can also be used to indicate "borrows" (from subtraction
operations) and "extends" (from shift instructions (LINK)).
The V bit is the "overflow flag" which is set if an instruction produces a result that overflows and
hence may go beyond the range of numbers that can be represented in 2's complement signed
format.
The I and F bits which determine whether interrupts (such as requests for input/output) are
enabled or disabled.
The T bit which indicates whether the processor is in "Thumb" mode, where the processor can
execute a subset of the assembly language as 16-bit compact instructions. As Thumb code packs
more instructions into the same amount of memory, it is an effective solution to applications
where physical memory is at a premium.
The M4 to M0 bits are the mode bits. Application programs normally run in user mode (where
the mode bits are 10000). Whenever an interrupt or similar event occurs, the processor switches
into one of the alternative modes allowing the software handler greater privileges with regard to
memory manipulation.
Fetch
Decode
Execute.
During normal operation, while one instruction is being executed, its successor is being decoded,
and a third instruction is being fetched from memory. The program counter points to the
instruction being fetched rather than to the instruction being executed. This is important because
it means that the Program Counter (PC) value used in an executing instruction is always two
instructions ahead of the address.
The pipeline design for each ARM family differs. For example, The ARM9 core increases the
pipeline length to ve stages, as shown in Figure 2.9. The ARM9 adds a memory and writeback
stage, which allows the ARM9 to process on average 1.1 Dhrystone MIPS per MHz an
increase in instruction throughput by around 13% compared with an ARM7. The maximum core
frequency attainable using an ARM9 is also higher.
The ARM10 increases the pipeline length still further by adding a sixth stage, as shown in Figure
2.10. The ARM10 can process on average 1.3 Dhrystone MIPS per MHz, about 34% more
throughput than an ARM7 processor core, but again at a higher latency cost.
Even though the ARM9 and ARM10 pipelines are different, they still use the same pipeline
executing characteristics as an ARM7. Code written for the ARM7 will execute on an ARM9 or
ARM10.
The address is within a special address range called the vector table. The entries in the vector
to handle a particular exception or
interrupt.
The memory map address 0x00000000 is reserved for the vector table, a set of 32-bit words. On
some processors the vector table can be optionally located at a higher address in memory
(starting at the offset 0
products can take advantage of this feature.
When an exception or interrupt occurs, the processor suspends normal execution and starts
loading instructions from the exception vector table (see Table 2.6). Each vector table entry
Reset vector
applied. This instruction branches to the initialization code.
Software interrupt vector is called when you execute a SWI instruction. The SWI instruction is
frequently used as the mechanism to invoke an operating system routine.
Prefetch abort vector occurs when the processor attempts to fetch an instruction from an address
without the correct access permissions. The actual abort occurs in the decode stage.
Data abort vector is similar to a prefetch abort but is raised when an instruction attempts to
access data memory without the correct access permissions.
Interrupt request vector is used by external hardware to interrupt the normal execution ow of
the processor. It can only be raised if IRQs are not masked in the cpsr.
The vector table.
Architecture Revision
Every ARM processor implementation executes a
although an ISA revision may have more than one processor implementation.
The ISA has evolved to keep up with the demands of the embedded market. This evolution has
been carefully managed by ARM, so that code written to execute on an earlier architecture
revision will also execute on a later revision of the architecture.
Before we go on to explain the evolution of the architecture, we must introduce the ARM
dual processors and provides basic
information about the feature set.
Nomenclature
ARM uses the nomenclature shown in below Figure to describe the processor implemen- tations.
ARM{x}{y}{z}{T}{D}{M}{I}{E}{J}{F}{-S}
S synthesizible version
may have. In the future the number and letter combinations may change as more features are
added. Note the nomenclature does not include the architecture revision information.
There are a few additional points to make about the ARM nomenclature:
All ARM cores after the ARM7TDMI include the TDMI features even though they may
The processor family is a group of processor implementations that share the same
hardware characteristics. For example, the ARM7TDMI, ARM740T, and ARM720T all share
the same family characteristics and belong to the ARM7 family.
JTAG is described by IEEE 1149.1 Standard Test Access Port and boundary scan archi-
tecture. It is a serial protocol used by ARM to send and receive debug information between the
processor core and test equipment.
EmbeddedICE macrocell is the debug hardware built into the processor that allows
breakpoints and watchpoints to be set.
Synthesizable means that the processor core is supplied as source code that can be
compiled into a form easily used by EDA tools.
Architecture Evolution
The architecture has continued to evolve since the rst ARM processor implementation was
introduced in 1985. S
The various parts of the program status register and the availabil- ity of certain features on
Table 2.9 shows a rough comparison of attributes between the ARM7, ARM9, ARM10, and
ARM11 cores. The numbers quoted can vary greatly and are directly dependent upon the type
and geometry of the manufacturing process, which has a direct effect on the frequency (MHz)
and power consumption (watts).
Within each ARM family, there are a number of variations of memory management, cache, and
TCM processor extensions. ARM continues to expand both the number of families available and
the different variations within each family.
You can nd other processors that execute the ARM ISA such as StrongARM and XScale. These
processors are unique to a particular semiconductor company, in this case Intel.
Table 2.10 summarizes the different features of the various processors. The next subsections
describe the ARM families in more detail, starting with the ARM7 family.
ARM7 Family
The ARM7 core has a Von Neumann style architecture, where both data and instructions use the
same bus. The core has a three-stage pipeline and executes the architecture ARMv4T instruction
set.
The ARM7TDMI was the rst of a new range of processors introduced in 1995 by ARM. It is
currently a very popular core and is used in many 32-bit embedded processors. It provides a very
good performance-to-power ratio. The ARM7TDMI processor core has been licensed by many
of th
instruction set, a fast multiply instruction, and the EmbeddedICE debug technology.
-S. The ARM7TDMI-S has the
same operating characteristics as a standard ARM7TDMI but is also synthesizable. ARM720T is
MMU means the ARM720T is capable of handling the Linux and Microsoft embedded platform
Another variation is the ARM7EJ-S processor, also synthesizable. ARM7EJ-S is quite different
-stage pipeline and executes ARMv5TEJ instructions. This version of the
ARM7 is the only one that provides both Java acceleration and the enhanced instructions but
without any memory protection.
ARM9 FAMILY
The ARM9 fa -stage pipeline, the ARM9
processor can run at higher clock frequencies than the ARM7 family. The extra stages improve
the overall performance of the processor. The memory system has been redesigned to follow the
Harvard architecture, which separates the data D and instruction I buses.
and an MMU.+ This processor can be used by operating systems requiring virtual memory
+
support. ARM922T is a variation on the ARM920T but with half the D I cache size.
The ARM940T includes a smaller D I cache + and an MPU. The ARM940T is designed for
applications that do not require a platform operating system. Both ARM920T and ARM940T
execute the architecture v4T instructions.
The next processors in the ARM9 family were based on the ARM9E-S core. This core is a
synthesizable version of the ARM9 core with the E extensions. There are two variations: the
ARM946E-S and the ARM966E-S. Both execute architecture v5TE instructions. They also
support the optional embedded trace macrocell (ETM), which allows a developer to trace
instruction and data execution in real time on the processor. This is important when debugging
applications with time-critical segments.
The ARM946E-S includes TCM, cache, and an MPU. The sizes of the TCM and caches are
deterministic real-time response. In contrast, the ARM966E does not have the MPU and cache
The latest core in the ARM9 product line is the ARM926EJ-S synthesizable processor core,
announced in 2000. It is designed for use in small portable Java-enabled devices such as 3G
phones and personal digital assistants (PDAs). The ARM926EJ-
core to include the Jazelle technology, which accelerates
+ Java bytecode execution. It features an
ARM10 Family
The ARM10, announced in 1999, was designed for performance. It extends the ARM9 pipeline
-point (VFP) unit, which adds a seventh
-point performance and is
-point standard.
+
enhanced E instructions. It has separate 32K D I caches, optional vect -point unit, and
an MMU. The ARM1020E also has a dual 64-bit bus interface for increased performance.
ARM1026EJ-S is very similar to the ARM926EJ-S but with both MPU and MMU. This
an ARM926EJ-S.
ARM11 Family
The ARM1136J-S, announced in 2003, was designed for high performance and power-
applications. ARM1136J-
ARMv6 instructions. It incorporates an eight-stage pipeline with separate load- store and
arithmetic pipelines. Included in the ARMv6 instructions are single instruction multiple data
performance.
Specialized Processors
StrongARM was originally co-developed by Digital Semiconductor and is now exclusively
licensed by Intel Corporation. It is has been popular for PDAs and applications that require
performance with low power consumption. It is a Harvard architecture with separate D+ I caches.
- -stage pipeline, but
it does not support the Thumb instruction set.
-power
DMI core
with an MPU. This core is small and has low voltage and current requirements, which makes it
attractive for smart card applications.
UNIT-II
ARM instructions process data held in registers and only access memory with load and store
instructions. ARM instructions commonly take two or three operands. For instance the ADD
instruction below adds the two values stored in registers r1 and r2 (the source registers). It writes
the result to register r3 (the destination register).
Destination
Instruction Source Source
register
Syntax register 1 ( Rn) register 2 ( Rm)
( Rd)
ADD r3, r1, r2 r3 r1 r2
In the following sections we examine the function and syntax of the ARM instructions by
instruction class data processing instructions, branch instructions,
load-store instructions, software interrupt instruction, and program status register instructions.
Move Instructions
Move is the simplest ARM instruction. It copies N into a destination register Rd, where N is a
register or immediate value. This instruction is useful for setting initial values and transferring
data between registers.
Syntax: <instruction>{<cond>}{S} Rd, N
Gives a full description of the values allowed for the second operand N for all data processing
instructions. Usually it is a register Rm or a constant preceded by #.
Barrel Shifter
MOV instruction where N is a simple register. But N can be more than just a register or
immediate value; it can also be a register Rm that has been preprocessed by the barrel shifter
prior to being used by a data processing instruction.
Data processing instructions are processed within the arithmetic logic unit (ALU). A unique and
powerful feature of the ARM processor is the ability to shift the 32 -bit binary pattern in one of
shift incre
There are data processing instructions that do not use the barrel shift, for example, the MUL
(multiply), CLZ (count leading zeros), and QADD (signed saturated 32-bit add) instructions.
Pre-processing or shift occurs within the cycle time of the instruction. This is particularly useful
for loading constants into a register and achieving fast multiplies or division by a power of 2.
Arithmetic Instructions
The arithmetic instructions implement addition and subtraction of 32-bit signed and unsigned
values.
Using the Barrel Shifter with Arithmetic Instructions
The wide range of second operand shifts available on arithmetic and logical instructions is a very
powerful feature of the ARM instruction set. illustrates the use of the inline barrel shifter with an
arithmetic instruction. The instruction multiplies the value stored in register r1 by three.
B RANCH INSTRUCTIONS
of execution or is used to call a routine. This type of
instruction allows programs to have subroutines, if-then-else structures, and loops.
BL{<cond>} label
BX{<cond>} Rm
BLX{<cond>} label | Rm
B branch pc = label
pc = label
BL branch with link
lr = address of the next instruction after the BL
pc = label, T =1
The address label is stored in the instruction as a signed pc-relative offset and must be
within approximately 32 MB of the branch instruction. T refers to the Thumb bit in the cpsr.
When instructions set T, the ARM switches to Thumb state.
Example:
This example shows a forward and backward branch. Because these loops are address
- and post-conditions. The forward branch skips three
L OAD-STORE INSTRUCTIONS
Load-store instructions transfer data between memory and processor registers. There are
three types of load-store instructions: single-register transfer, multiple-register transfer, and
swap.
Conditional execution reduces the number of branches, which also reduces the number of
nce of the executed code. Conditional execution
Unit-III
ARM Programming Model II
Thumb Instruction Set
Thumb encodes a subset of the 32-bit ARM instructions into a 16-bit instruction set
space. Since Thumb has higher performance than ARM on a processor with a 16-bit data bus, but
lower performance than ARM on a 32-bit data bus, use Thumb for memory-constrained systems.
Thumb has higher code density the space taken up in memory by an executable
program than ARM. For memory-constrained embedded systems, for example, mobile phones
and PDAs, code density is very important. Cost pressures also limit memory size, width, and
speed.
On average, a Thumb implementation of the same code takes up around 30% less
memory than the equivalent ARM implementation. As an example, the same divide code routine
implemented in ARM and Thumb assembly code. Even though the Thumb implementation uses
more instructions, the overall memory footprint is reduced. Code density was the main driving
force for the Thumb instruction set. Because it was also designed as a compiler target, rather than
for hand-written assembly code, we recommend that you write Thumb-targeted code in a high-
level language like C or C++.
Each Thumb instruction is related to a 32-bit ARM instruction. A simple Thumb ADD
instruction being decoded into an equivalent ARM ADD instruction. Only the branch relative
instruction can be conditionally executed. The limited space available in 16 bits causes the barrel
shift operations ASR, LSL, LSR, and ROR to be separate instructions in the Thumb ISA.
Thumb instruction set.
To alter the cpsr or spsr, you must switch into ARM state to use MSR and MRS.
Similarly, there are no coprocessor instructions in Thumb state. You need to be in ARM state to
access
ARM processors have 32-bit registers and 32-bit data processing operations. The ARM
architecture is a RISC load/store architecture. In other words you must load values from memory
into registers before acting on them. There are no arithmetic or logical instructions that manipulate
values in memory directly.
Early versions of the ARM architecture (ARMv1 to ARMv3) provided hardware support
for loading and storing unsigned 8-bit and unsigned or signed 32-bit values.
These architectures were used on processors prior to the ARM7TDMI. The load/store
instruction classes available by ARM architecture.
In loads that act on 8- or 16-bit values extend the value to 32 bits before writing to an ARM
register. Unsigned values are zero-extended, and signed values sign-extended. This means that the
cast of a loaded value to an int type does not cost extra instructions. Similarly, a store of an 8- or
16-bit value selects the lowest 8 or 16 bits of the register. The cast of an int to smaller type does
not cost extra instructions on a store.
The ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores
directly, through new instructions. Since these instructions are a later addition, they do not support
as many addressing modes as the pre-ARMv4 instructions.
Finally, ARMv5 adds instruction support for 64-bit load and stores. This is available in
ARM9E and later cores.
Prior to ARMv4, ARM processors were not good at handling signed 8-bit or any 16-bit
-bit value, rather than a signed
8-bit value as is typical in many other compilers.
Compilers armcc and gcc use the datatype mappings in Table 5.2 for an ARM target. The
exceptional case for type char is worth noting as it can cause problems when you are porting code
from another processor architecture. A common example is using a char type variable i as a loop
counter, with loop continuation condition i 0. As i is unsigned for the ARM compilers, the loop
will never terminate. Fortunately armcc produces a warning in this situation: unsigned comparison
with 0. Compilers also provide an override switch to make char signed. For example, the command
line option -fsigned-char will make char signed on gcc. The command line option -zc will have the
same effect with armcc.
}
This f
by the compiler. The input values a, b, and the return value will be passed in 32-bit ARM registers.
Should the compiler assume that these 32-bit values are in the range of a short type, that is, 32,768
+
to 32,767? Or should the compiler force values to be in this range by sign-extending the lowest 16
-bit register? The compiler must make compatible decisions for the function caller
and callee. Either the caller or callee must perform the cast to a short type.
We say that function arguments are passed wide if they are not reduced to the range of the
type and narrow if they are. You can tell which decision the compiler has made by looking at the
assembly output for add_v1. If the compiler passes arguments wide, then the callee must reduce
function arguments to the correct range. If the compiler passes arguments narrow, then the caller
must reduce the range. If the compiler returns values wide, then the caller must reduce the return
value to the correct range. If the compiler returns values narrow, then the callee must reduce the
range before returning the value.
For armcc in ADS, function arguments are passed narrow and values returned narrow. In
other words, the caller casts argument values and the callee casts return values. The compiler uses
the ANSI prototype of the function to determine the datatypes of the function arguments.
The armcc output for add_v1 shows that the compiler casts the return value to a short type,
but does not cast the input values. It assumes that the caller has already ensured that the 32-bit
values r0 and r1 are in the range of the short type. This shows narrow passing of arguments and
return value.
The gcc compiler we used is more cautious and makes no assumptions about the range of
argument value. This version of the compiler reduces the input arguments to the range
C LOOPING STRUCTURES
This section looks at the most
W
checksum example and look at the looping structure.
r3. Subsequent integer arguments are placed on the full descending stack, ascending in memory
Function return integer values are passed in r0.
This description covers only integer or pointer arguments. Two-word arguments such as
long long or double are passed in a pair of consecutive argument registers and returned in r0, r1.
The compiler may pass structures in registers or by reference according to command line compiler
options.
If your C function needs more than four arguments, or your C++ method more than three
arguments into structures, and pass a structure pointer rather than mul- tiple arguments. Which
arguments are related will depend on the structure of your software.
Pointer Aliasing
Two pointers are said to alias when they point to the same address. If you write to one
pointer, it will affect the value you read from the other pointer. In a function, the compiler often
pessimistic and assume that any write to a pointer may affect the value read from any other
pointer
by a step amount:
STRUCTURE ARRANGEMENT
act on its perfor-
mance and code density. There are two issues concerning structures on the ARM: alignment of the
structure entries and the overall size of the structure.
For architectures up to and including ARMv5TE, load and store instructions are only
guaranteed to load and store values with address aligned to the size of the access width. Table 5.4
summarizes these restrictions.
For this reason, ARM compilers will automatically align the start address of a structure to a
multiple of the largest access width used within the structure (usually four or eight bytes) and align
entries within structures to their access width by inserting padding.
struct {
char a;
int b;
char c;
short d;
For a little-endian memory system the compiler will lay this out adding padding to ensure
that the next object is aligned to the size of that object:
Floating Point
-point
support, which saves on power and area when using ARM in a price-sensitive, embedded
application. With the exceptions of the Floating Point Accelerator (FPA) used on the ARM7500FE
and the Vector Floating Point accelerator (VFP) hardware, the C compiler must provide support
audio and video. This is a large and important area of programming, For best performance you
need to code the algorithms in assembly
Instruction Scheduling
The time taken to execute instructions depends on the implementation pipeline. For this
chapter, we assume ARM9TDMI pipeline timings.
The following rules summarize the cycle timings for common instruction classes on the
ARM9TDMI.
Instructions that are conditional on the value of the ARM condition codes in the cpsr take one
cycle if the condition is not met. If the condition is met, then the following rules apply:
ALU operations such as addition, subtraction, and logical operations take one cycle. This
includes a shift by an immediate value. If you use a register-
the instruction writes to the pc, then add two cycles.
Load instructions that load N 32-bit words of memory such as LDR and LDM take N
cycles to issue, but the result of the last word loaded is not available on the following cycle. The
updated load address is available on the next cycle. This assumes zero-wait-state memory for an
uncached system, or a cache hit for a cached system. An LDM of a single value is exceptional,
taking two cycles. If the instruction loads pc, then add two cycles.
Load instructions that load 16-bit or 8-bit data such as LDRB, LDRSB, LDRH, and
LDRSH take one cycle to issue. The load result is not available on the following two cycles. The
updated load address is available on the next cycle. This assumes zero-wait-state memory for an
uncached system, or a cache hit for a cached system.
Branch instructions take three cycles.
Store instructions that store N values take N cycles. This assumes zero-wait-state memory
for an uncached system, or a cache hit or a write buffer with N free entries for a cached system. An
STM of a single value is exceptional, taking two cycles.
Multiply instructions take a varying number of cycles depending on the value of the second
operand in the product (see Table D.6 in Section D.3).
Fetch: Fetch from memory the instruction at address pc. The instruction is loaded into the
core and then processes down the core pipeline.
Decode: Decode the instruction that was fetched in the previous cycle. The processor also
reads the input operands from the register bank if they are not available via one of the forwarding
paths.
ALU: Executes the instruction that was decoded in the previous cycle. Note this instruc-
tion was originally fetched from address pc 8 (ARM state) or pc 4 (Thumb state). Normally this
involves calculating the answer for a data processing operation, or the address for a load, store, or
branch operation. Some instructions may spend several cycles in this stage. For example, multiply
and register-controlled shift operations take several ALU cycles.
Conditional Execution
The processor core can conditionally execute most ARM instructions. This conditional
a condition, the
assembler defaults to the execute always condition (AL). The other 14 conditions split into
stored in the cpsr register. See Table A.2 in Appendix A for the list of possible ARM conditions.
to this are comparison instructions that do not write to a destination register. Their sole purpose is
-
ment simple if statements withou
can take many cycles and also reduces code size.