0% found this document useful (0 votes)

31 views50 pages

Arm Referance Notes

The document provides an overview of ARM architecture, detailing its RISC design philosophy, which emphasizes simple instructions, single-cycle execution, and a load-store architecture. It describes the ARM processor's features, including its register set, instruction pipeline, and interrupt handling, as well as the evolution of ARM families from ARM7 to ARM11. The document also highlights the importance of ARM's design for embedded systems, focusing on power efficiency and performance.

Uploaded by

viquarsultana135

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views50 pages

Arm Referance Notes

Uploaded by

viquarsultana135

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

UNIT-I

ARM ARCHITECTURE

ARM
ARM, previously Advanced RISC Machine, originally Acorn RISC Machine, is a family
of Reduced Instruction Set Computing (RISC) Architecture for Computer Processors.

The ARM processor core is key component of many successful 32-bit embedded
systems.

The RISC design philosophy

The design philosophy aimed at delivering the following.

simple but powerful instructions

single cycle execution at a high clock speed

intelligence in software rather than hardware

Provide greater flexibility on reducing the complexity of instructions.

The ARM core uses RISC architecture.

The RISC philosophy is implemented with four major design rules:

1. Instructions RISC processors have a reduced number of instruction classes. These

classes provide simple operations that can each execute in a single cycle. The compiler or
programmer synthesizes complicated operations (a divide operation) by combining
several simple instructions. Each instruction is a fixed length to allow the pipeline to
fetch future instructions before decoding the current instruction. In contrast, in CISC
processors the instructions are often of variable size and take many cycles to execute.

2. Pipelines The processing of instructions is broken down into smaller units that can be
executed in parallel by pipelines. Ideally the pipeline advances by one step on each cycle
for maximum throughput. There is no need for an instruction to be executed by a mini
program called microcode as on CISC processors.

3. Registers RISC machines have a large general-purpose register set. Any register can
contain either data or an address. In contrast, CISC processors have dedicated registers
for specific purposes.
4. Load-store architecture--The processor operates on data held in registers. Separate load
and store instructions transfer data between the register bank and external memory. In
contrast, with a CISC design the data processing operations can act on memory directly.

The ARM Design Philosophy

There are a number of physical features that have driven the ARM processor design.

1. Small to reduce power consumption and extend battery operation

2. High code density
3. Price sensitive and use slow and low-cost memory devices.
4. Reduce the area of the die taken up by the embedded processor.
5. Hardware debug technology
6. ARM core is not a pure RISC architecture

Registers:
ARM processors provide general-purpose and special-purpose registers. Some additional
registers are available in privileged execution modes.

In all ARM processors, the following registers are available and accessible in any processor
mode:

13 general-purpose registers R0-R12.

One Stack Pointer (SP).

One Link Register (LR).

One Program Counter (PC).

One Application Program Status Register (APSR).

The amount of registers depends on the ARM version. According to the ARM Reference
Manual, there are 30 general-purpose 32-bit registers, with the exception of ARMv6-M and
ARMv7-M based processors. The first 16 registers are accessible in user-level mode, the
additional registers are available in privileged software execution (with the exception of
ARMv6-M and ARMv7-M). In this tutorial series we will work with the registers that are
accessible in any privilege mode: r0-15. These 16 registers can be split into two groups: general
purpose and special purpose registers.
R0-R12: can be used during common operations to store temporary values, pointers (locations to
memory), etc. R0, for example, can be referred as accumulator during the arithmetic operations
or for storing the result of a previously called function. R7 becomes useful while working with
syscalls as it stores the syscall number and R11 helps us to keep track of boundaries on the stack
serving as the frame pointer (will be covered later). Moreover, the function calling convention on
ARM specifies that the first four arguments of a function are stored in the registers r0-r3.

R13: SP (Stack Pointer). The Stack Pointer points to the top of the stack. The stack is an area of
memory used for function-specific storage, which is reclaimed when the function returns. The
stack pointer is therefore used for allocating space on the stack, by subtracting the value (in
bytes) we want to allocate from the stack pointer. In other words, if we want to allocate a 32 bit
value, we subtract 4 from the stack pointer.

R14: LR (Link Register). When a function call is made, the Link Register gets updated with a
memory address referencing the next instruction where the function was initiated from. Doing

R15: PC (Program Counter). The Program Counter is automatically incremented by the size of
the instruction executed. This size is always 4 bytes in ARM state and 2 bytes in THUMB mode.
When a branch instruction is being executed, the PC holds the destination address. During
execution, PC stores the address of the current instruction plus 8 (two ARM instructions) in
ARM state, and the current instruction plus 4 (two Thumb instructions) in Thumb(v1) state. This
is different from x86 where PC always points to the next instruction to be executed.

Current Program Status Register

The Current Program Status Register (CPSR) holds the same program status flags as the APSR,
and some additional information.

The CPSR holds:

The APSR flags.

The processor mode.

The interrupt disable flags.

The instruction set state (ARM, Thumb, ThumbEE, or Jazelle ®).

The endianness state (on ARMv4T and later).

The execution state bits for the IT block (on ARMv6T2 and later).

The Current Program Status Register is a 32-bit wide register used in the ARM architecture to
record various pieces of information regarding the state of the program being executed by the
processor and the state of the processor. This information is recorded by setting or clearing
specific bits in the register.

The top four bits (bits 31, 30, 29, and 28) are the condition code (cc) bits and are of most interest
to us. Condition code bits are sometimes referred to as "flags". The lowest 8 bits (bit 7 through to
bit 0) store information about the processor's own state. The remaining bits (i.e. bit 27 to bit 8)
are currently unused in most ARM processors.
The N bit is the "negative flag" and indicates that a value is negative.

The Z bit is the "zero flag" and is set when an appropriate instruction produces a zero result.

The C bit is the "carry flag" but it can also be used to indicate "borrows" (from subtraction
operations) and "extends" (from shift instructions (LINK)).

The V bit is the "overflow flag" which is set if an instruction produces a result that overflows and
hence may go beyond the range of numbers that can be represented in 2's complement signed
format.

For completeness, the other state bits are:

The I and F bits which determine whether interrupts (such as requests for input/output) are
enabled or disabled.

The T bit which indicates whether the processor is in "Thumb" mode, where the processor can
execute a subset of the assembly language as 16-bit compact instructions. As Thumb code packs
more instructions into the same amount of memory, it is an effective solution to applications
where physical memory is at a premium.

The M4 to M0 bits are the mode bits. Application programs normally run in user mode (where
the mode bits are 10000). Whenever an interrupt or similar event occurs, the processor switches
into one of the alternative modes allowing the software handler greater privileges with regard to
memory manipulation.

M[4:0] Mode Accessible registers

10000 User PC, R14 to R0, CPSR
10001 FIQ PC, R14_fiq to R8_fiq, R7 to R0, CPSR, SPSR_fiq
10010 IRQ PC, R14_irq, R13_irq, R12 to R0, CPSR, SPSR_irq
10011 Supervisor PC, R14_svc, R13_svc, R12 to R0, CPSR, SPSR_svc
10111 Abort PC, R14_abt, R13_abt, R12 to R0, CPSR, SPSR_abt
11011 Undefined PC, R14_und, R13_und, R12 to R0, CPSR, SPSR_und
11111 System PC, R14 to R0, CPSR

The instruction pipeline

The ARM uses a pipeline to increase the speed of the flow of instructions to the processor. This
allows several operations to take place simultaneously, and the processing, and memory systems
to operate continuously.

A three-stage pipeline is used, so instructions are executed in three stages:

Fetch
Decode

Execute.

The three-stage pipeline is shown in

The instruction pipeline

During normal operation, while one instruction is being executed, its successor is being decoded,
and a third instruction is being fetched from memory. The program counter points to the
instruction being fetched rather than to the instruction being executed. This is important because
it means that the Program Counter (PC) value used in an executing instruction is always two
instructions ahead of the address.

The pipeline design for each ARM family differs. For example, The ARM9 core increases the
pipeline length to ve stages, as shown in Figure 2.9. The ARM9 adds a memory and writeback
stage, which allows the ARM9 to process on average 1.1 Dhrystone MIPS per MHz an
increase in instruction throughput by around 13% compared with an ARM7. The maximum core
frequency attainable using an ARM9 is also higher.
The ARM10 increases the pipeline length still further by adding a sixth stage, as shown in Figure
2.10. The ARM10 can process on average 1.3 Dhrystone MIPS per MHz, about 34% more
throughput than an ARM7 processor core, but again at a higher latency cost.

Even though the ARM9 and ARM10 pipelines are different, they still use the same pipeline
executing characteristics as an ARM7. Code written for the ARM7 will execute on an ARM9 or
ARM10.

Interrupts and the Vector Table.

The address is within a special address range called the vector table. The entries in the vector
to handle a particular exception or
interrupt.

The memory map address 0x00000000 is reserved for the vector table, a set of 32-bit words. On
some processors the vector table can be optionally located at a higher address in memory
(starting at the offset 0
products can take advantage of this feature.

When an exception or interrupt occurs, the processor suspends normal execution and starts
loading instructions from the exception vector table (see Table 2.6). Each vector table entry

Reset vector
applied. This instruction branches to the initialization code.

instruction vector is used when the processor cannot decode an instruction.

Software interrupt vector is called when you execute a SWI instruction. The SWI instruction is
frequently used as the mechanism to invoke an operating system routine.

Prefetch abort vector occurs when the processor attempts to fetch an instruction from an address
without the correct access permissions. The actual abort occurs in the decode stage.

Data abort vector is similar to a prefetch abort but is raised when an instruction attempts to
access data memory without the correct access permissions.

Interrupt request vector is used by external hardware to interrupt the normal execution ow of
the processor. It can only be raised if IRQs are not masked in the cpsr.
The vector table.

Exception/interrupt Shorthand Address High address

Reset RESET 0x00000000 0xffff0000
UNDEF 0x00000004 0xffff0004
Software interrupt SWI 0x00000008 0xffff0008
Prefetch abort PABT 0x0000000c 0xffff000c
Data abort DABT 0x00000010 0xffff0010
Reserved 0x00000014 0xffff0014
Interrupt request IRQ 0x00000018 0xffff0018
Fast interrupt request FIQ 0x0000001c 0xffff001c

Architecture Revision
Every ARM processor implementation executes a
although an ISA revision may have more than one processor implementation.

The ISA has evolved to keep up with the demands of the embedded market. This evolution has
been carefully managed by ARM, so that code written to execute on an earlier architecture
revision will also execute on a later revision of the architecture.

Before we go on to explain the evolution of the architecture, we must introduce the ARM
dual processors and provides basic
information about the feature set.

Nomenclature
ARM uses the nomenclature shown in below Figure to describe the processor implementations.

ARM{x}{y}{z}{T}{D}{M}{I}{E}{J}{F}{-S}

x family y memory management/protection unit

z cache T Thumb 16-bit decoder

D JTAG debug M fast multiplier

I EmbeddedICE macrocell E enhanced instructions (assumes TDMI)

J Jazelle F vector oating-point unit

S synthesizible version

may have. In the future the number and letter combinations may change as more features are
added. Note the nomenclature does not include the architecture revision information.

There are a few additional points to make about the ARM nomenclature:

All ARM cores after the ARM7TDMI include the TDMI features even though they may

The processor family is a group of processor implementations that share the same
hardware characteristics. For example, the ARM7TDMI, ARM740T, and ARM720T all share
the same family characteristics and belong to the ARM7 family.

JTAG is described by IEEE 1149.1 Standard Test Access Port and boundary scan archi-
tecture. It is a serial protocol used by ARM to send and receive debug information between the
processor core and test equipment.

EmbeddedICE macrocell is the debug hardware built into the processor that allows
breakpoints and watchpoints to be set.

Synthesizable means that the processor core is supplied as source code that can be
compiled into a form easily used by EDA tools.

Architecture Evolution
The architecture has continued to evolve since the rst ARM processor implementation was
introduced in 1985. S

introduction of the Thumb instruction set in ARMv4T (the ARM7TDMI processor).

The various parts of the program status register and the availabil- ity of certain features on

ARM PROCESSOR FAMILIES

ARM has designed a number of processors that are grouped into different families according to
the core they use. The families are based on the ARM7, ARM9, ARM10, and ARM11 cores. The
to an increase in performance and sophistication. ARM8 was developed but was soon
superseded.

Table 2.9 shows a rough comparison of attributes between the ARM7, ARM9, ARM10, and
ARM11 cores. The numbers quoted can vary greatly and are directly dependent upon the type
and geometry of the manufacturing process, which has a direct effect on the frequency (MHz)
and power consumption (watts).

Within each ARM family, there are a number of variations of memory management, cache, and
TCM processor extensions. ARM continues to expand both the number of families available and
the different variations within each family.

You can nd other processors that execute the ARM ISA such as StrongARM and XScale. These
processors are unique to a particular semiconductor company, in this case Intel.

Table 2.10 summarizes the different features of the various processors. The next subsections
describe the ARM families in more detail, starting with the ARM7 family.
ARM7 Family
The ARM7 core has a Von Neumann style architecture, where both data and instructions use the
same bus. The core has a three-stage pipeline and executes the architecture ARMv4T instruction
set.

The ARM7TDMI was the rst of a new range of processors introduced in 1995 by ARM. It is
currently a very popular core and is used in many 32-bit embedded processors. It provides a very
good performance-to-power ratio. The ARM7TDMI processor core has been licensed by many
of th
instruction set, a fast multiply instruction, and the EmbeddedICE debug technology.
-S. The ARM7TDMI-S has the
same operating characteristics as a standard ARM7TDMI but is also synthesizable. ARM720T is

MMU means the ARM720T is capable of handling the Linux and Microsoft embedded platform

relocated to a higher address by setting a coprocessor 15 register.

Another variation is the ARM7EJ-S processor, also synthesizable. ARM7EJ-S is quite different
-stage pipeline and executes ARMv5TEJ instructions. This version of the
ARM7 is the only one that provides both Java acceleration and the enhanced instructions but
without any memory protection.

ARM9 FAMILY
The ARM9 fa -stage pipeline, the ARM9
processor can run at higher clock frequencies than the ARM7 family. The extra stages improve
the overall performance of the processor. The memory system has been redesigned to follow the
Harvard architecture, which separates the data D and instruction I buses.
and an MMU.+ This processor can be used by operating systems requiring virtual memory
+
support. ARM922T is a variation on the ARM920T but with half the D I cache size.

The ARM940T includes a smaller D I cache + and an MPU. The ARM940T is designed for
applications that do not require a platform operating system. Both ARM920T and ARM940T
execute the architecture v4T instructions.

The next processors in the ARM9 family were based on the ARM9E-S core. This core is a
synthesizable version of the ARM9 core with the E extensions. There are two variations: the
ARM946E-S and the ARM966E-S. Both execute architecture v5TE instructions. They also
support the optional embedded trace macrocell (ETM), which allows a developer to trace
instruction and data execution in real time on the processor. This is important when debugging
applications with time-critical segments.

The ARM946E-S includes TCM, cache, and an MPU. The sizes of the TCM and caches are

deterministic real-time response. In contrast, the ARM966E does not have the MPU and cache

The latest core in the ARM9 product line is the ARM926EJ-S synthesizable processor core,
announced in 2000. It is designed for use in small portable Java-enabled devices such as 3G
phones and personal digital assistants (PDAs). The ARM926EJ-
core to include the Jazelle technology, which accelerates
+ Java bytecode execution. It features an

ARM10 Family
The ARM10, announced in 1999, was designed for performance. It extends the ARM9 pipeline
-point (VFP) unit, which adds a seventh
-point performance and is
-point standard.

+
enhanced E instructions. It has separate 32K D I caches, optional vect -point unit, and
an MMU. The ARM1020E also has a dual 64-bit bus interface for increased performance.

ARM1026EJ-S is very similar to the ARM926EJ-S but with both MPU and MMU. This
an ARM926EJ-S.

ARM11 Family
The ARM1136J-S, announced in 2003, was designed for high performance and power-
applications. ARM1136J-
ARMv6 instructions. It incorporates an eight-stage pipeline with separate load- store and
arithmetic pipelines. Included in the ARMv6 instructions are single instruction multiple data

performance.

The ARM1136JF-S is an ARM1136J- -point unit for fast

-point operations.

Specialized Processors
StrongARM was originally co-developed by Digital Semiconductor and is now exclusively
licensed by Intel Corporation. It is has been popular for PDAs and applications that require
performance with low power consumption. It is a Harvard architecture with separate D+ I caches.
- -stage pipeline, but
it does not support the Thumb instruction set.

-on product to the StrongARM and offers dramatic increases in

performance. At the time of writing, XScale was quoted as being able to run up to 1 GHz.
XScale executes architecture v5TE instructions. It is a Harvard architecture and is similar to the
StrongARM, as it also includes an MMU.

-power
DMI core
with an MPU. This core is small and has low voltage and current requirements, which makes it
attractive for smart card applications.
UNIT-II

ARM Programming Model I

ARM instructions process data held in registers and only access memory with load and store
instructions. ARM instructions commonly take two or three operands. For instance the ADD
instruction below adds the two values stored in registers r1 and r2 (the source registers). It writes
the result to register r3 (the destination register).

Destination
Instruction Source Source
register
Syntax register 1 ( Rn) register 2 ( Rm)
( Rd)
ADD r3, r1, r2 r3 r1 r2

In the following sections we examine the function and syntax of the ARM instructions by
instruction class data processing instructions, branch instructions,

load-store instructions, software interrupt instruction, and program status register instructions.

Data Processing Instructions

The data processing instructions manipulate data within registers. They are move instructions,
arithmetic instructions, logical instructions, comparison instructions, and multiply instructions.
Most data processing instructions can process one of their operands using the barrel shifter.

If you use the S cpsr. Move

Move Instructions
Move is the simplest ARM instruction. It copies N into a destination register Rd, where N is a
register or immediate value. This instruction is useful for setting initial values and transferring
data between registers.
Syntax: <instruction>{<cond>}{S} Rd, N

MOV Move a 32-bit value into a register Rd = N

move the NOT of the 32-bit value into a

MVN Rd = - N
register

Gives a full description of the values allowed for the second operand N for all data processing
instructions. Usually it is a register Rm or a constant preceded by #.

Barrel Shifter
MOV instruction where N is a simple register. But N can be more than just a register or
immediate value; it can also be a register Rm that has been preprocessed by the barrel shifter
prior to being used by a data processing instruction.

Data processing instructions are processed within the arithmetic logic unit (ALU). A unique and
powerful feature of the ARM processor is the ability to shift the 32 -bit binary pattern in one of

shift incre

There are data processing instructions that do not use the barrel shift, for example, the MUL
(multiply), CLZ (count leading zeros), and QADD (signed saturated 32-bit add) instructions.

Pre-processing or shift occurs within the cycle time of the instruction. This is particularly useful
for loading constants into a register and achieving fast multiplies or division by a power of 2.
Arithmetic Instructions
The arithmetic instructions implement addition and subtraction of 32-bit signed and unsigned
values.
Using the Barrel Shifter with Arithmetic Instructions
The wide range of second operand shifts available on arithmetic and logical instructions is a very
powerful feature of the ARM instruction set. illustrates the use of the inline barrel shifter with an
arithmetic instruction. The instruction multiplies the value stored in register r1 by three.
B RANCH INSTRUCTIONS
of execution or is used to call a routine. This type of
instruction allows programs to have subroutines, if-then-else structures, and loops.

forces the program counter pc to point to a new address. The

ARMv5E instruction set includes four different branch instructions.

Syntax: B{<cond>} label

BL{<cond>} label

BX{<cond>} Rm

BLX{<cond>} label | Rm

B branch pc = label

pc = label
BL branch with link
lr = address of the next instruction after the BL

BX branch exchange pc = Rm & 0xfffffffe, T = Rm &1

pc = label, T =1

branch exchange pc = Rm & 0xfffffffe, T = Rm &1

BLX
with link
lr = address of the next instruction after the
BLX

The address label is stored in the instruction as a signed pc-relative offset and must be
within approximately 32 MB of the branch instruction. T refers to the Thumb bit in the cpsr.
When instructions set T, the ARM switches to Thumb state.
Example:

This example shows a forward and backward branch. Because these loops are address
- and post-conditions. The forward branch skips three

Branches are used to change e

instruction encoding by using labels. In this example, forward and backward are the labels. The
branch labels are placed at the beginning of the line and are used to mark an address that can be
used later by the assembler to calculate the branch offset.

L OAD-STORE INSTRUCTIONS
Load-store instructions transfer data between memory and processor registers. There are
three types of load-store instructions: single-register transfer, multiple-register transfer, and
swap.

S INGLE -REGISTER TRANSFER

These instructions are used for moving a single data item in and out of a register. The
datatypes supported are signed and unsigned words (32-bit), halfwords (16-bit), and bytes. Here
are the various load-store single-register transfer instructions.
SINGLE -REGISTER LOAD-STORE ADDRESSING MODES
The ARM instruction set provides different modes for addressing memory. These modes
incorporate one of the indexing methods: preindex with writeback, preindex, and postindex
MULTIPLE -REGISTER TRANSFER
Load-store multiple instructions can transfer multiple registers between memory and the
processor in a single instruction. The transfer occurs from a base address register Rn pointing
into memory. Multiple- rom single-register
transfers for moving blocks of data around memory and saving and restoring context and stacks.
CONDITIONAL EXECUTION
Most ARM instructions are conditionally executed you can specify that the instruction

execution instructions you can increase performance and code density.

The co -letter mnemonic appended to the instruction mnemonic.

The default mnemonic is AL, or always execute.

Conditional execution reduces the number of branches, which also reduces the number of
nce of the executed code. Conditional execution
Unit-III
ARM Programming Model II
Thumb Instruction Set
Thumb encodes a subset of the 32-bit ARM instructions into a 16-bit instruction set
space. Since Thumb has higher performance than ARM on a processor with a 16-bit data bus, but
lower performance than ARM on a 32-bit data bus, use Thumb for memory-constrained systems.

Thumb has higher code density the space taken up in memory by an executable
program than ARM. For memory-constrained embedded systems, for example, mobile phones
and PDAs, code density is very important. Cost pressures also limit memory size, width, and
speed.

On average, a Thumb implementation of the same code takes up around 30% less
memory than the equivalent ARM implementation. As an example, the same divide code routine
implemented in ARM and Thumb assembly code. Even though the Thumb implementation uses
more instructions, the overall memory footprint is reduced. Code density was the main driving
force for the Thumb instruction set. Because it was also designed as a compiler target, rather than
for hand-written assembly code, we recommend that you write Thumb-targeted code in a high-
level language like C or C++.

Each Thumb instruction is related to a 32-bit ARM instruction. A simple Thumb ADD
instruction being decoded into an equivalent ARM ADD instruction. Only the branch relative
instruction can be conditionally executed. The limited space available in 16 bits causes the barrel
shift operations ASR, LSL, LSR, and ROR to be separate instructions in the Thumb ISA.
Thumb instruction set.

THUMB REGISTER USAGE

In Thumb state, you do not have direct access to all registers. Only the low registers r0 to
r7 are fully accessible, as shown in below Table 4.2. The higher registers r8 to r12 are only
accessible with MOV, ADD, or CMP instructions. CMP and all the data processing instructions
that operate on low registers update the condition ags in the cpsr.
You may have noticed from the Thumb instruction set list and from the Thumb register
usage table that there is no direct access to the cpsr or spsr. In other words, there are no MSR-
and MRS-equivalent Thumb instructions.

To alter the cpsr or spsr, you must switch into ARM state to use MSR and MRS.
Similarly, there are no coprocessor instructions in Thumb state. You need to be in ARM state to
access

OTHER BRANCH INSTRUCTIONS

T
ARM version and is conditionally executed; the branch range is limited to a signed 8-bit
immediate, or 256 to +254 bytes. The second version removes the conditional part of the
instruction and expands the effective branch range to a signed 11-bit immediate, or 2048 to
+2046 bytes.

The conditional branch instruction is the only conditionally executed instruction in

Thumb state.

Syntax: B<cond> label

B label
BL label
UNIT IV
ARM Programming
BASIC C DATA TYPES
There are also differences between the addressing modes available when loading and
storing data of each type.

ARM processors have 32-bit registers and 32-bit data processing operations. The ARM
architecture is a RISC load/store architecture. In other words you must load values from memory
into registers before acting on them. There are no arithmetic or logical instructions that manipulate
values in memory directly.

Early versions of the ARM architecture (ARMv1 to ARMv3) provided hardware support
for loading and storing unsigned 8-bit and unsigned or signed 32-bit values.

These architectures were used on processors prior to the ARM7TDMI. The load/store
instruction classes available by ARM architecture.

In loads that act on 8- or 16-bit values extend the value to 32 bits before writing to an ARM
register. Unsigned values are zero-extended, and signed values sign-extended. This means that the
cast of a loaded value to an int type does not cost extra instructions. Similarly, a store of an 8- or
16-bit value selects the lowest 8 or 16 bits of the register. The cast of an int to smaller type does
not cost extra instructions on a store.

The ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores
directly, through new instructions. Since these instructions are a later addition, they do not support
as many addressing modes as the pre-ARMv4 instructions.
Finally, ARMv5 adds instruction support for 64-bit load and stores. This is available in
ARM9E and later cores.

Prior to ARMv4, ARM processors were not good at handling signed 8-bit or any 16-bit
-bit value, rather than a signed
8-bit value as is typical in many other compilers.

Compilers armcc and gcc use the datatype mappings in Table 5.2 for an ARM target. The
exceptional case for type char is worth noting as it can cause problems when you are porting code
from another processor architecture. A common example is using a char type variable i as a loop
counter, with loop continuation condition i 0. As i is unsigned for the ARM compilers, the loop
will never terminate. Fortunately armcc produces a warning in this situation: unsigned comparison
with 0. Compilers also provide an override switch to make char signed. For example, the command
line option -fsigned-char will make char signed on gcc. The command line option -zc will have the
same effect with armcc.

FUNCTION ARGUMENT TYPES

local variables from types char or short to type int increases performance and
reduces code size. The same holds for function arguments. Consider the following simple function,
which adds two 16-bit values, halving the second, and returns a 16-bit sum:

short add_v1(short a, short b)

return a + (b >> 1);

}
This f
by the compiler. The input values a, b, and the return value will be passed in 32-bit ARM registers.
Should the compiler assume that these 32-bit values are in the range of a short type, that is, 32,768
+
to 32,767? Or should the compiler force values to be in this range by sign-extending the lowest 16
-bit register? The compiler must make compatible decisions for the function caller
and callee. Either the caller or callee must perform the cast to a short type.

We say that function arguments are passed wide if they are not reduced to the range of the
type and narrow if they are. You can tell which decision the compiler has made by looking at the
assembly output for add_v1. If the compiler passes arguments wide, then the callee must reduce
function arguments to the correct range. If the compiler passes arguments narrow, then the caller
must reduce the range. If the compiler returns values wide, then the caller must reduce the return
value to the correct range. If the compiler returns values narrow, then the callee must reduce the
range before returning the value.

For armcc in ADS, function arguments are passed narrow and values returned narrow. In
other words, the caller casts argument values and the callee casts return values. The compiler uses
the ANSI prototype of the function to determine the datatypes of the function arguments.

The armcc output for add_v1 shows that the compiler casts the return value to a short type,
but does not cast the input values. It assumes that the caller has already ensured that the 32-bit
values r0 and r1 are in the range of the short type. This shows narrow passing of arguments and
return value.

The gcc compiler we used is more cautious and makes no assumptions about the range of
argument value. This version of the compiler reduces the input arguments to the range
C LOOPING STRUCTURES
This section looks at the most

variable number of iterations. Finally we look at loop unrolling.

LOOPS WITH A FIXED NUMBER OF ITERATIONS

W
checksum example and look at the looping structure.

-register rule. Functions

with four or fewer arguments
arguments. For functions with four or fewer arguments, the compiler can pass all the arguments in
registers. For functions with more arguments, both the caller and callee must access the stack for

argument is implicit and additional to the explicit arguments.

Function Call:
nts and
return values in ARM registers. The more recent ARM-Thumb Procedure Call Standard (ATPCS)
covers ARM and Thumb interworking as well.

r3. Subsequent integer arguments are placed on the full descending stack, ascending in memory
Function return integer values are passed in r0.

This description covers only integer or pointer arguments. Two-word arguments such as
long long or double are passed in a pair of consecutive argument registers and returned in r0, r1.
The compiler may pass structures in registers or by reference according to command line compiler
options.

-register rule. Functions

with fou
arguments. For functions with four or fewer arguments, the compiler can pass all the arguments in
registers. For functions with more arguments, both the caller and callee must access the stack for

argument is implicit and additional to the explicit arguments.

If your C function needs more than four arguments, or your C++ method more than three

arguments into structures, and pass a structure pointer rather than multiple arguments. Which
arguments are related will depend on the structure of your software.
Pointer Aliasing
Two pointers are said to alias when they point to the same address. If you write to one
pointer, it will affect the value you read from the other pointer. In a function, the compiler often

pessimistic and assume that any write to a pointer may affect the value read from any other
pointer

by a step amount:
STRUCTURE ARRANGEMENT
act on its perfor-
mance and code density. There are two issues concerning structures on the ARM: alignment of the
structure entries and the overall size of the structure.

For architectures up to and including ARMv5TE, load and store instructions are only
guaranteed to load and store values with address aligned to the size of the access width. Table 5.4
summarizes these restrictions.

For this reason, ARM compilers will automatically align the start address of a structure to a
multiple of the largest access width used within the structure (usually four or eight bytes) and align
entries within structures to their access width by inserting padding.

For example, consider the structure

struct {

char a;

int b;

char c;

short d;

For a little-endian memory system the compiler will lay this out adding padding to ensure
that the next object is aligned to the size of that object:
Floating Point
-point
support, which saves on power and area when using ARM in a price-sensitive, embedded
application. With the exceptions of the Floating Point Accelerator (FPA) used on the ARM7500FE
and the Vector Floating Point accelerator (VFP) hardware, the C compiler must provide support

-point operation into a

subroutine call. The C library co -point behavior using
-point
algorithms will execute far more slowly than corresponding integer algorithms.

If you need fast execut -point or block-

audio and video. This is a large and important area of programming, For best performance you
need to code the algorithms in assembly

Instruction Scheduling
The time taken to execute instructions depends on the implementation pipeline. For this
chapter, we assume ARM9TDMI pipeline timings.

The following rules summarize the cycle timings for common instruction classes on the
ARM9TDMI.
Instructions that are conditional on the value of the ARM condition codes in the cpsr take one
cycle if the condition is not met. If the condition is met, then the following rules apply:

ALU operations such as addition, subtraction, and logical operations take one cycle. This
includes a shift by an immediate value. If you use a register-
the instruction writes to the pc, then add two cycles.
Load instructions that load N 32-bit words of memory such as LDR and LDM take N
cycles to issue, but the result of the last word loaded is not available on the following cycle. The
updated load address is available on the next cycle. This assumes zero-wait-state memory for an
uncached system, or a cache hit for a cached system. An LDM of a single value is exceptional,
taking two cycles. If the instruction loads pc, then add two cycles.
Load instructions that load 16-bit or 8-bit data such as LDRB, LDRSB, LDRH, and
LDRSH take one cycle to issue. The load result is not available on the following two cycles. The
updated load address is available on the next cycle. This assumes zero-wait-state memory for an
uncached system, or a cache hit for a cached system.
Branch instructions take three cycles.
Store instructions that store N values take N cycles. This assumes zero-wait-state memory
for an uncached system, or a cache hit or a write buffer with N free entries for a cached system. An
STM of a single value is exceptional, taking two cycles.
Multiply instructions take a varying number of cycles depending on the value of the second
operand in the product (see Table D.6 in Section D.3).

pipeline and dependencies. The ARM

Fetch: Fetch from memory the instruction at address pc. The instruction is loaded into the
core and then processes down the core pipeline.
Decode: Decode the instruction that was fetched in the previous cycle. The processor also
reads the input operands from the register bank if they are not available via one of the forwarding
paths.
ALU: Executes the instruction that was decoded in the previous cycle. Note this instruc-
tion was originally fetched from address pc 8 (ARM state) or pc 4 (Thumb state). Normally this
involves calculating the answer for a data processing operation, or the address for a load, store, or
branch operation. Some instructions may spend several cycles in this stage. For example, multiply
and register-controlled shift operations take several ALU cycles.
Conditional Execution
The processor core can conditionally execute most ARM instructions. This conditional
a condition, the

assembler defaults to the execute always condition (AL). The other 14 conditions split into

stored in the cpsr register. See Table A.2 in Appendix A for the list of possible ARM conditions.

to this are comparison instructions that do not write to a destination register. Their sole purpose is
-
ment simple if statements withou
can take many cycles and also reduces code size.

Flynn's and Fengs Architecture
No ratings yet
Flynn's and Fengs Architecture
28 pages
(Ebook PDF) Computer Organization and Design ARM Edition: The Hardware Software Interface Download
100% (1)
(Ebook PDF) Computer Organization and Design ARM Edition: The Hardware Software Interface Download
54 pages
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
60 pages
Coa Mid 2qb and Obj
No ratings yet
Coa Mid 2qb and Obj
29 pages
Unit 1
No ratings yet
Unit 1
65 pages
ACA Question Bank 2024
No ratings yet
ACA Question Bank 2024
6 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Risc Processor - Arm 9
No ratings yet
Risc Processor - Arm 9
84 pages
Lect11 12 Parallel
No ratings yet
Lect11 12 Parallel
57 pages
Coa Unit5
No ratings yet
Coa Unit5
11 pages
Module 4 Final
No ratings yet
Module 4 Final
28 pages
Fat MPMC
No ratings yet
Fat MPMC
97 pages
Arm Processor Fundamentals
No ratings yet
Arm Processor Fundamentals
11 pages
Parallel Computing System
No ratings yet
Parallel Computing System
4 pages
Unit 1 ARM Architecture - Final
No ratings yet
Unit 1 ARM Architecture - Final
19 pages
Unit 4 - ARM Processors
No ratings yet
Unit 4 - ARM Processors
68 pages
MPMC Unit-3 - Part-1
No ratings yet
MPMC Unit-3 - Part-1
10 pages
MC 5
No ratings yet
MC 5
23 pages
Atlas Ai
No ratings yet
Atlas Ai
69 pages
MS Unit2
No ratings yet
MS Unit2
94 pages
Computer Architecture
No ratings yet
Computer Architecture
29 pages
1.ARM Architecture, Instruction
No ratings yet
1.ARM Architecture, Instruction
50 pages
MPMC Unit - 4
No ratings yet
MPMC Unit - 4
15 pages
MC Lab Introduction Part Bcs402 Sem-4 2024-25
No ratings yet
MC Lab Introduction Part Bcs402 Sem-4 2024-25
11 pages
L0.ARM Architecture, Instruction 1
No ratings yet
L0.ARM Architecture, Instruction 1
49 pages
Accelerating ML Recommendation With Over A Thousand Risc-V/Tensor Processors On Esperanto'S Et-Soc-1 Chip
No ratings yet
Accelerating ML Recommendation With Over A Thousand Risc-V/Tensor Processors On Esperanto'S Et-Soc-1 Chip
23 pages
Interrupt Service (Handling) Mechanism
No ratings yet
Interrupt Service (Handling) Mechanism
20 pages
2) Arm
No ratings yet
2) Arm
26 pages
Final ARM Instruction 04 Lecture
No ratings yet
Final ARM Instruction 04 Lecture
226 pages
MCES Unit 1 2 ARM 2023
No ratings yet
MCES Unit 1 2 ARM 2023
44 pages
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
No ratings yet
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
54 pages
Computer Architecture & Organization
No ratings yet
Computer Architecture & Organization
2 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
S-17 CSE Syllabus
No ratings yet
S-17 CSE Syllabus
164 pages
Analyzing HardFaults On Cortex-M CPU
No ratings yet
Analyzing HardFaults On Cortex-M CPU
12 pages
MCES Unit 1 2 ARM-Instruction-set 2023
No ratings yet
MCES Unit 1 2 ARM-Instruction-set 2023
41 pages
Introduction To ARM
100% (2)
Introduction To ARM
24 pages
Csa Mod 2
100% (1)
Csa Mod 2
28 pages
Module 4 - Introduction To Embedded System and ARM
No ratings yet
Module 4 - Introduction To Embedded System and ARM
29 pages
Module3 ARM
No ratings yet
Module3 ARM
96 pages
Module 4 - ECE3014 Introduction To Embedded System and ARM-1
No ratings yet
Module 4 - ECE3014 Introduction To Embedded System and ARM-1
27 pages
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
No ratings yet
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
20 pages
Attiny 2313
No ratings yet
Attiny 2313
226 pages
Slide 2 ARM Architecture and Instruction Set
No ratings yet
Slide 2 ARM Architecture and Instruction Set
234 pages
Ch12 Parallel Proc3-Aula
No ratings yet
Ch12 Parallel Proc3-Aula
35 pages
Supercomputer Benchmarking
No ratings yet
Supercomputer Benchmarking
18 pages
Unit-5 (Coa) Notes
No ratings yet
Unit-5 (Coa) Notes
33 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
34 pages
ARM Introduction-1
100% (2)
ARM Introduction-1
26 pages
Cray2 Super Computer
No ratings yet
Cray2 Super Computer
27 pages
04 - The ARM Architecture and ISA
No ratings yet
04 - The ARM Architecture and ISA
73 pages
ARM Architecture
No ratings yet
ARM Architecture
16 pages
Computer Organization and Architecture (COA) 2017 May - June Old Solved Question Paper
100% (1)
Computer Organization and Architecture (COA) 2017 May - June Old Solved Question Paper
35 pages
Arm Notes
No ratings yet
Arm Notes
22 pages
Types of Pipeline
100% (1)
Types of Pipeline
2 pages
Arm 7 Architecture
No ratings yet
Arm 7 Architecture
22 pages
Module - 5 - ARM
No ratings yet
Module - 5 - ARM
45 pages
18CS44 MODULE1 Chapter2
No ratings yet
18CS44 MODULE1 Chapter2
38 pages
Unit-I - : School of Electrical & Electronics Engineering Department of Electronics & Instrumentation
No ratings yet
Unit-I - : School of Electrical & Electronics Engineering Department of Electronics & Instrumentation
190 pages
Arm 32
0% (1)
Arm 32
20 pages
Sanjay - High Performance DSP Architectures
No ratings yet
Sanjay - High Performance DSP Architectures
38 pages
Arm Exploitation PDF
No ratings yet
Arm Exploitation PDF
57 pages
Arm 7 Architecture
100% (1)
Arm 7 Architecture
22 pages
ARM - Advanced RISC Machines: RISC-Reduce Instruction Set Computers
No ratings yet
ARM - Advanced RISC Machines: RISC-Reduce Instruction Set Computers
60 pages
SECA3019 Lecture 3.1 ARM Processor Basics
No ratings yet
SECA3019 Lecture 3.1 ARM Processor Basics
37 pages
MKC ES Units 3&4 ARM 1
No ratings yet
MKC ES Units 3&4 ARM 1
105 pages
University of Belgrade: School of Electrical Engineering Department of Computer Science
No ratings yet
University of Belgrade: School of Electrical Engineering Department of Computer Science
70 pages
Arm9 Embedded Book-Guide
100% (2)
Arm9 Embedded Book-Guide
67 pages
Unit IV MPMC
No ratings yet
Unit IV MPMC
22 pages
The Acorn RISC Machine (ARM)
100% (1)
The Acorn RISC Machine (ARM)
12 pages
Arm 7 Architecture
100% (3)
Arm 7 Architecture
22 pages
Cisc vs. Risc
No ratings yet
Cisc vs. Risc
53 pages
ARM: An Advanced Microcontroller
No ratings yet
ARM: An Advanced Microcontroller
54 pages
ARM 4 Part2
100% (1)
ARM 4 Part2
9 pages
ARM7 - LPC 2148 Processor
100% (1)
ARM7 - LPC 2148 Processor
50 pages
ARM Architecture
No ratings yet
ARM Architecture
26 pages
ARM Processors (Advanced RISC Machines) : G N V Ratnakishor
No ratings yet
ARM Processors (Advanced RISC Machines) : G N V Ratnakishor
26 pages
ARM Architecture Overview
100% (1)
ARM Architecture Overview
19 pages
ARM1
No ratings yet
ARM1
40 pages
Arm Architecture (Risc) : Instruction Set
No ratings yet
Arm Architecture (Risc) : Instruction Set
5 pages
Laboratory Manual: Embedded Systems
No ratings yet
Laboratory Manual: Embedded Systems
74 pages
Advanced RISC Machine-ARM Notes Bhurchandi
100% (1)
Advanced RISC Machine-ARM Notes Bhurchandi
8 pages

Arm Referance Notes

Uploaded by

Arm Referance Notes

Uploaded by

UNIT-I

The RISC design philosophy

simple but powerful instructions

single cycle execution at a high clock speed

intelligence in software rather than hardware

Provide greater flexibility on reducing the complexity of instructions.

The ARM core uses RISC architecture.

The RISC philosophy is implemented with four major design rules:

1. Instructions RISC processors have a reduced number of instruction classes. These

The ARM Design Philosophy

1. Small to reduce power consumption and extend battery operation

13 general-purpose registers R0-R12.

One Stack Pointer (SP).

One Link Register (LR).

One Program Counter (PC).

One Application Program Status Register (APSR).

Current Program Status Register

The CPSR holds:

The APSR flags.

The processor mode.

The interrupt disable flags.

The instruction set state (ARM, Thumb, ThumbEE, or Jazelle ®).

The endianness state (on ARMv4T and later).

For completeness, the other state bits are:

M[4:0] Mode Accessible registers

The instruction pipeline

A three-stage pipeline is used, so instructions are executed in three stages:

The three-stage pipeline is shown in

The instruction pipeline

Interrupts and the Vector Table.

instruction vector is used when the processor cannot decode an instruction.

Exception/interrupt Shorthand Address High address

x family y memory management/protection unit

z cache T Thumb 16-bit decoder

D JTAG debug M fast multiplier

I EmbeddedICE macrocell E enhanced instructions (assumes TDMI)

introduction of the Thumb instruction set in ARMv4T (the ARM7TDMI processor).

ARM PROCESSOR FAMILIES

relocated to a higher address by setting a coprocessor 15 register.

The ARM1136JF-S is an ARM1136J- -point unit for fast

-on product to the StrongARM and offers dramatic increases in

ARM Programming Model I

Data Processing Instructions

If you use the S cpsr. Move

MOV Move a 32-bit value into a register Rd = N

move the NOT of the 32-bit value into a

forces the program counter pc to point to a new address. The

Syntax: B{<cond>} label

BX branch exchange pc = Rm & 0xfffffffe, T = Rm &1

branch exchange pc = Rm & 0xfffffffe, T = Rm &1

Branches are used to change e

S INGLE -REGISTER TRANSFER

execution instructions you can increase performance and code density.

The co -letter mnemonic appended to the instruction mnemonic.

The default mnemonic is AL, or always execute.

THUMB REGISTER USAGE

OTHER BRANCH INSTRUCTIONS

The conditional branch instruction is the only conditionally executed instruction in

Syntax: B<cond> label

FUNCTION ARGUMENT TYPES

short add_v1(short a, short b)

return a + (b >> 1);

variable number of iterations. Finally we look at loop unrolling.

LOOPS WITH A FIXED NUMBER OF ITERATIONS

-register rule. Functions

argument is implicit and additional to the explicit arguments.

-register rule. Functions

argument is implicit and additional to the explicit arguments.

For example, consider the structure

-point operation into a

If you need fast execut -point or block-

pipeline and dependencies. The ARM

You might also like