ARM - An Understanding and More
ARM - An Understanding and More
ARM – AN
UNDERSTANDING AND
MORE.
Shriram K Vasudevan
Dept. of CSE, Amrita University, Coimbatore, India.
[email protected]
Ph: 89399 18562
Agenda.
• Features and Basics
• Architecture.
• Programming Model.
• Instruction Set.
• Thumb Mode.
• Few Sample Codes.
3/8/2020 ARM by Shriram 3
What is ARM?
• ARM processor is basically a 32 bit processor, meant
particularly for high end applications which involve
more complex computation and calculations.
• ARM processor was first developed at “ACORN
computer Limited” of Cambridge, England between
1983 and 1985 just after 1980 when the concept of
RISC was introduced at Stanford and Berkley. (ARM –
Acron RISC Machine)
• ARM specializes in the concept of ARM core, which
they have licensed to number of other manufacturers
to make a variety of chips around the same processor
core. (Means, I tell you how to make, you make it in
your name!)
3/8/2020 ARM by Shriram 4
Contd.,
• So, now the focus is not on family of processors, but
conceptually a CPU architecture which may figure in
number of different chips intended for embedded
applications.
• The ARM is based on RISC architecture, but it is not a
purely RISC architecture because it has been
enhanced to meet the requirement of embedded
applications. Versatility!
• The requirements for embedded applications are
basically high code density, low power consumption
as well as low and smaller silicon footprint.
Architecturally ARM satisfies various conditions and
properties of RISC processors as well.
3/8/2020 ARM by Shriram 5
Features
• ARM processor has a large uniform register file.
• It is basically a LOAD-STORE architecture, where data processing operations
are only between registers and does not involve any memory operations.
• It is a 32 bit processor and also has variants of 16 bit and 8 bit
architectures.
• So, there are 16 bit and 8 bit variants embedded into a 32 bit processor.
• We will enumerate about 16 bit and 8 bit variants also called as THUMB and
Jazelle architecture.
• ARM has got a very good speed Vs power consumption ratio and
high code density as required by embedded applications
• It has got barrel shifter in the data path, which can maximize the
hardware usage available on the chip.
• It has also got auto increment and auto decrement addressing modes to
optimize program loops; this is not very common with RISC processor. Also
ARM supports LOAD and STORE of multiple data elements through a single
instruction.
• ARM has also got a feature named ‘conditional execution’, where an
instruction gets executed only when a condition is being met, which maximizes
the execution throughput.
3/8/2020 ARM by Shriram 6
The Pipeline
• At the heart of the ARM7 CPU is the instruction pipeline.
The pipeline is used to process instructions taken from
the program store.
• On the ARM 7 a three-stage pipeline is used.
3/8/2020 ARM by Shriram 8
Contd.,
• A three-stage pipeline is the simplest form of pipeline and does not
suffer from the kind of hazards such as read-before-write seen in
pipelines with more stages.
• The pipeline has hardware independent stages that execute one
instruction while decoding a second and fetching a third.
• The pipeline speeds up the throughput of CPU instructions so
effectively that most ARM instructions can be executed in a single
cycle.
• The pipeline works most efficiently on linear code. As soon as a
branch is encountered, the pipeline is flushed and must be refilled
before full execution speed can be resumed. (Very Essential )
• As we shall see, the ARM instruction set has some interesting
features which help smooth out small jumps in your code in
order to get the best flow of code through the pipeline.
• As the pipeline is part of the CPU, the programmer does not have any
exposure to it.
3/8/2020 ARM by Shriram 9
Contd.,
• The central set of registers are a
bank of 16 user registers R0 –
R15. Each of these registers is
32 bits wide and R0 – R12 are
user registers in that they do not
have any specific other function.
(Means, general purpose
registers)
• The Registers R13 – R15 do
have special functions in the
CPU.
• R13 is used as the stack pointer
(SP). (You know what it is!)
3/8/2020 ARM by Shriram 11
Contd.,
• R14 is called the link register (LR).
• When a call is made to a function the
return address is automatically
stored in the link register and is
immediately available on return from
the function.
• This allows quick entry and return in
to a ‘leaf’ function (a function that is
not going to call further functions).
• If the function is part of a branch (i.e. it
is going to call other functions) then the
link register must be preserved on the
stack (R13). (Do you understand??)
• Finally R15 is the program counter
(PC).
• Interestingly, many instructions can be
performed on R13 - R15 as if they were
standard user registers (But, don’t ever
do this please!)
3/8/2020 ARM by Shriram 12
Contd.,
• The top four bits of the CPSR contain the condition codes which are set by
the CPU. (Flags)
• The lowest eight bits in the CPSR contain flags which may be set or cleared
by the application code.
• Bits 7 and 8 are the I and F bits. These bits are used to enable and disable
the two interrupt sources which are external to the ARM7 PU. You should be
careful when programming these two bits because in order to disable
either interrupt source the bit must be set to ‘1’ not ‘0’ as you might
expect. Bit 5 is the THUMB bit.
3/8/2020 ARM by Shriram 14
Contd.,
• The ARM7 CPU is capable of executing two instruction sets; the ARM
instruction set which is 32 bits wide and the THUMB instruction set
which is 16 bits wide. (Jazelle is also there. We are not worried!)
• Consequently the T bit reports which instruction set is being executed.
(Refer the label slide).
• Your code should not try to set or clear this bit to switch between instruction
sets. (Means, instructions shall be different for different modes)
• The last five bits are the mode bits.
• The ARM7 has 7 different operating modes. (We shall see this later)
• Your application code will normally run in the user mode with access to the
register bank R0 – R15 and the CPSR as already discussed.
• However, in response to an exception such as an interrupt, memory
error or software interrupt instruction the processor will change modes.
• When this happens the registers R0 – R12 and R15 remain the same but R13
(LR ) and R14 (SP) are replaced by a new pair of registers unique to that
mode. This means that each mode has its own stack and link register.
(Understand this, please, this will help in resuming operation)
• In addition, the fast interrupt mode (FIQ) has duplicate registers for R7 – R12.
3/8/2020 ARM by Shriram 15
Contd.,
• Each of the modes except user mode has an additional
register called the “saved program status register”.
• If your application is running in user mode when an
exception occurs the mode will change and the current
contents of the CPSR will be saved into the SPSR.
(Context saving is this, folks)
Register Set and View
cpsr
spsr spsr spsr spsr spsr spsr
Contd.,
• User : unprivileged mode under which most tasks run
Contd.,
• All modes other than USER are privileged.
• These have full access to system resources and can
move freely.
• Exception modes are also there. (5, see in next slide).
Happens when there is an exception.
• System mode – Like user mode. But, privileged.
• For operating system tasks.
• Does not require additional registers, but needs system resources.
• So, made a privileged mode.
3/8/2020 ARM by Shriram 19
Exception Modes
• When an exception occurs, the CPU will change modes
and the PC be forced to an exception vector. The vector
table starts from address zero with the reset vector and
then has an exception vector every four bytes.
3/8/2020 ARM by Shriram 21
Contd.,
• There is a gap in the vector table because there is a
missing vector at 0x00000014.
• This location was used on an earlier ARM architecture
and has been preserved on ARM7 to ensure software
compatibility between different ARM architectures.
3/8/2020 ARM by Shriram 22
Contd.,
• In classical shift register, the number of shifts requires an
equivalent number of clocks because, the shifting takes
place based on clocks.
• In barrel shifter, Combinational circuit is used, where the
shifting takes place in a single attempt itself.
• In fact, the shift takes place in the same instruction
itself. This is a very basic enhancement present in the
ARM data path.
• The other interesting feature is the increment and
decrement logic which can operate on the registers that
are independent of the ALU.
• This facilitates the implementation of auto-increment and auto-
decrement features in the ARM, where it is used for movement of
blocks of data between the memory and registers.
3/8/2020 ARM by Shriram 24
The arrows represent the direction of data flow, and the lines represent the
buses and the boxes represent either a storage unit or an operation unit.
3/8/2020 ARM by Shriram 25
At any one time, three different instructions may occupy each of these
stages, so the hardware in each stage has to be capable of independent
operation
3/8/2020 ARM by Shriram 28
Contd.,
• When the processor is executing simple data processing
instructions the pipeline enables one instruction to be
completed every clock cycle.
• An individual instruction takes three clock cycles to
complete, so it has a three-cycle latency, but the through-
put is one instruction per cycle.
3/8/2020 ARM by Shriram 29
5 stage pipeline
• Higher performance ARM cores employ a 5-stage pipeline
and have separate instruction and data memories.
• Breaking instruction execution down into five components
rather than three reduces the maximum work which must
be completed in a clock cycle, and hence allows a higher
clock frequency to be used (provided that other system
components, and particularly the instruction memory, are
also redesigned to operate at this higher clock rate).
3/8/2020 ARM by Shriram 30
Contd.,
• Fetch: the instruction is fetched from memory and placed in the
instruction pipeline.
• Decode: the instruction is decoded and register operands read
from the register file. There are three operand read ports in the
register file, so most ARM instructions can source all their
operands in one cycle.
• Execute: an operand is shifted and the ALU result generated. If
the instruction is a load or store the memory address is
computed in the ALU.
• Buffer/data: data memory is accessed if required. Otherwise
the ALU result is simply buffered for one clock cycle to give the
same pipeline flow for all instructions.
• Write-back: the results generated by the instruction are written
back to the register file, including any data loaded from
memory.
3/8/2020 ARM by Shriram 31
Contd.,
• Then the CPSR is copied into the SPSR of the
exception mode that is about to be entered (i.e.
SPSR_irq)
• The PC is then filled with the address of the exception
mode interrupt vector. In the case of the IRQ mode
this is 0x00000018 (Refer the table shown earlier.)
3/8/2020 ARM by Shriram 33
Contd.,
• At the same time the mode is changed to IRQ mode,
which causes R13 and R14 to be replaced by the IRQ
R13 and R14 registers.
• Once your code has finished processing the exception it
must return back to the user mode and continue where it
left off.
• However the ARM instruction set does not contain a
return” or “return from interrupt” instruction so
manipulating the PC must be done by regular instructions.
3/8/2020 ARM by Shriram 34
Contd.,
• The situation is further complicated by there being a
number of different return cases. (Makes life further
difficult)
• Let us consider three cases! All are very interesting to
look into!
3/8/2020 ARM by Shriram 35
Case : 1
• Consider the SWI instruction.
• In this case the SWI instruction is executed, the address
of the next instruction to be executed is stored in the Link
register and the exception is processed.
• In order to return from the exception all that is necessary
is to move the contents of the link register into the PC
and processing can continue.
• However in order to make the CPU switch modes back to
user mode, a modified version of the move instruction is
used and this is called MOVS (more about this later).
• Hence for a software interrupt the return instruction is
MOVS R15,R14 ; Move Link register into the PC and
switch modes.
3/8/2020 ARM by Shriram 36
Case: 2
• Consider the FIQ and IRQ instructions, when an exception
occurs the current instruction being executed is discarded
and the exception is entered.
• When the code returns from the exception the link register
contains the address of the discarded instruction plus four.
• In order to resume processing at the correct point we need to
roll back the value in the Link register by four.
• In this case we use the subtract instruction to deduct four from
the link register and store the results in the PC.
• As with the move instruction, there is a form of the subtract
instruction which will also restore the operating mode. For an
IRQ, FIQ or Prog Abort, the return instruction is:
• SUBS R15, R14,#4
3/8/2020 ARM by Shriram 37
Case: 3
• In the case of a data abort instruction, the
exception will occur one instruction after execution of
the instruction which caused the exception.
• In this case we will ideally enter the data abort ISR, sort
out the problem with the memory and return to reprocess
the instruction that caused the exception. In this case
we have to roll back the PC by two instructions i.e.
the discarded instruction and the instruction that
caused the exception.
• In other words subtract eight from the link register and
store the result in the PC. For a data abort exception the
return instruction is
• SUBS R15, R14,#8
3/8/2020 ARM by Shriram 38
Condition Field.
3/8/2020 ARM by Shriram 41
Conditional Execution
• To execute an instruction conditionally, simply postfix it with the
appropriate condition:
• For example an add instruction takes the form:
• ADD r0,r1,r2 ; r0 = r1 + r2 (ADDAL)
• To execute this only if the zero flag is set:
• ADDEQ r0,r1,r2 ; If zero flag set then…
; ... r0 = r1 + r2
• By default, data processing operations do not affect the
condition flags (apart from the comparisons where this is the
only effect). To cause the condition flags to be updated, the S
bit of the instruction needs to be set by postfixing the
instruction (and any condition code) with an “S”.
• For example to add two numbers and set the condition flags:
• ADDS r0,r1,r2 ; r0 = r1 + r2
; ... and set flags
3/8/2020 ARM by Shriram 42
Arithmetic Instructions.
• Operations are:
• ADD operand1 + operand2
• ADC operand1 + operand2 + carry
• SUB operand1 - operand2
• SBC operand1 - operand2 + carry -1
• RSB operand2 - operand1
• RSC operand2 - operand1 + carry - 1
• Syntax:
• <Operation>{<cond>}{S} Rd, Rn, Operand2
• Examples
• ADD r0, r1, r2
• SUBGT r3, r3, #1
• RSBLES r4, r5, #5
3/8/2020 ARM by Shriram 45
Comparisons
• The only effect of the comparisons is to
• UPDATE THE CONDITION FLAGS. Thus no need to set S bit.
• Operations are:
• CMP operand1 - operand2, but result not written
• CMN operand1 + operand2, but result not written
• TST operand1 AND operand2, but result not written
• TEQ operand1 EOR operand2, but result not written
• Syntax:
• <Operation>{<cond>} Rn, Operand2
• Examples:
• CMP r0, r1
• TSTEQ r2, #5
3/8/2020 ARM by Shriram 46
Data Movement
• Operations are:
• MOV operand2
• MVN NOT operand2
Note that these make no use of operand1.
• Syntax:
• <Operation>{<cond>}{S} Rd, Operand2
• Examples:
• MOV r0, r1
• MOVS r2, #10
• MVNEQ r1,#0
3/8/2020 ARM by Shriram 47
Multiplication Instructions
• The Basic ARM provides two multiplication instructions.
• Multiply
• MUL{<cond>}{S} Rd, Rm, Rs ; Rd = Rm * Rs
• Multiply Accumulate - does addition for free
• MLA{<cond>}{S} Rd, Rm, Rs,Rn ; Rd = (Rm * Rs) + Rn
• Restrictions on use:
• Rd and Rm cannot be the same register
• Can be avoid by swapping Rm and Rs around. This works because
multiplication is commutative.
• Cannot use PC.
These will be picked up by the assembler if overlooked.
• Operands can be considered signed or unsigned
• Up to user to interpret correctly.
3/8/2020 ARM by Shriram 48
Load Store
• The ARM is a Load / Store Architecture:
• Does not support memory to memory data processing operations.
• Must move data values into registers before using them.
• This might sound inefficient, but in practice isn’t:
• Load data values from memory into registers.
• Process data in registers using a number of data processing
instructions which are not slowed down by memory access.
• Store results from registers out to memory.
• The ARM has three sets of instructions which interact with
main memory. These are:
• Single register data transfer (LDR / STR).
• Block data transfer (LDM/STM).
• Single Data Swap (SWP).
3/8/2020 ARM by Shriram 49
Software Support
• Download Keil from website. Select ARM core.
• It will ask you to register. Register, download and install. It
is easy.
3/8/2020 ARM by Shriram 50
Remember this.
• There must be an ENTRY directive. This tells the location
of the first executable instruction.
• AREA = PROGRAM / DATA / OR ANYTHING!
• END directive is must to show the code is getting
completed there.
• ARM can deal directly with 32 bit instructions as you all
know.
• It is possible to have a halfword by the use of DCW
directive. To ensure consistency one should use ALIGN
directive as shown in the examples.
3/8/2020 ARM by Shriram 51
One’s complement
3/8/2020 ARM by Shriram 57
Two’s complement
3/8/2020 ARM by Shriram 58
Greatest of 2 numbers
3/8/2020 ARM by Shriram 59
Main
LDRB R1, Value ; Loading the value to be moved.
STR R1, Result ; Store it back.
SWI &11 ; Software Interrupt instead of loop option seen earlier.
Main
LDRB R1, Value ; Loading the value to be complemented.
MVN R1, R1 ; See the way I used R1 and R1. MVN is NOT.
SWI &11 ; Software Interrupt instead of loop option seen earlier.
Before Execution
After Execution
3/8/2020 ARM by Shriram 62
Contd.,
3/8/2020 ARM by Shriram 64
Contd.,
• Setting the T bit can be done by adding 0x20 to the D3. It
will then set the T bit and eventually the THUMB mode will
be set
3/8/2020 ARM by Shriram 66
Contd.,
• Here comes the challenge!
• Larger the memory is, slower it gets. (Obvious right)
• So, it is not possible for someone to design a large memory
which is also faster. (This is not possible!)
• Here comes the possibility:
• Combine a small, fast memory with large slow main
memory (Possible!! Trust me)
• Now, you get a feel as “large fast memory in hand with
you”.
• Let us name these now!
• Small, Fast component == Cache (Will have the most
frequently accessed instruction. Library table/rack is the
example)
• Here comes the terms temporal and spatial locality!
Contd.,
• Suppose you were a student writing a term paper on important
historical developments in computer hardware.
• You are sitting at a desk in a library with a collection of books that you
have pulled from the shelves and are examining.
• You find that several of the important computers that
you need to write about are described in the books you
have, but there is nothing about the EDSAC.
• Therefore, you go back to the shelves and look for an additional book.
You find a book on early British computers that covers EDSAC.
• Once you have a good selection of books on the desk in front of you,
there is a good probability that many of the topics you need can be
found in them, and you may spend most of your time just using the
books on the desk without going back to the shelves.
• Having several books on the desk in front of you saves time
compared to having only one book there and constantly having to go
back to the shelves to return it and take out another.
Contd.,
• The same principle allows us to create the illusion of a
large memory that we can access as fast as a very small
memory.
• Just as you did not need to access all the books in the
library at once with equal probability, a program does not
access all of its code or data at once with equal
probability.
• Otherwise, it would be impossible to make most memory
accesses fast and still have large memory in computers,
just as it would be impossible for you to fit all the library
books on your desk and still find what you wanted quickly
Contd.,
• This principle of locality underlies both the way in which
you did your work in the library and the way that
programs operate.
• The principle of locality states that programs access a
relatively small portion of their address space at any
instant of time, just as you accessed a very small You brought out the
portion of the library’s collection. book on early
English computers
to find out about
• There are two different types of locality: EDSAC, you also
noticed that there
was another book
shelved next to it
If you recently about early
mechanical
brought a book to computers, so you
your desk to look at, also brought back
that book too and,
you will probably later on, found
need to look at it something useful in
again soon that book.
Books on the same
topic are shelved
together in the
library to increase
spatial locality
Contd.,
• Just as accesses to books on the desk naturally exhibit
locality, locality in programs arises from simple and
natural program structures.
• For example, most programs contain loops, so
instructions and data are likely to be accessed repeatedly,
showing high amounts of temporal locality.
• Since instructions are normally accessed sequentially,
programs show high spatial locality.
• Accesses to data also exhibit a natural spatial locality.
For example, accesses to elements of an array or a
record will naturally have high degrees of spatial
locality.
3/8/2020 ARM by Shriram 74
Cache organization
• Since a cache holds a dynamically varying selection of
items from main memory, it must have storage for both
the data and the address at which the data is stored in
main memory. (remember this!)
• Cache is a safe place for hiding or storing things – Says
some dictionary.
• Library example : Desk is also safe!! Books remain safe in
the table as well!
• We begin by looking at a very simple cache in which the
processor requests are each one word and the blocks
also consist of a single word.
3/8/2020 ARM by Shriram 78
Contd.,
• Figure RHS shows such a
simple cache, before and
after requesting a data
item that is not initially in
the cache.
• Before the request, the
cache contains a collection
of recent references X1,
X2, . . . , Xn – 1, and the
processor requests a word
Xn that is not in the cache.
This reference causes a miss that forces
• This request results in a the cache to fetch Xn from memory and
miss, and the word Xn is insert it into the cache.
brought from memory into
cache.
3/8/2020 ARM by Shriram 79
Contd.,
• There are two questions to answer:
• How do we know if a data item is in the cache?
• Moreover, if it is, how do we find it?
• The answers to these two questions are related. If each word can
go in exactly one place in the cache, then it is straightforward to
find the word if it is in the cache.
• The simplest way to assign a location in the cache for each word in
memory is to assign the cache location based on the address of the
word in memory.
• This cache structure is called direct mapped, since each
memory location is mapped directly to exactly one
location in the cache. The typical mapping between
addresses and cache locations for a direct-mapped cache
is usually simple.
3/8/2020 ARM by Shriram 80
Contd.,
• The tag needs only to
contain the upper portion
of the address,
corresponding to the bits
that are not use as an
index into the cache.
• For example, (See RHS)
we need only to have
the upper 2 of the 5
address bits in the tag,
since the lower 3-bit
index field of the
address selects the
block.
3/8/2020 ARM by Shriram 82
Contd.,
• We also need a way to recognize that a cache block does
not have valid information.
• For instance, when a processor starts up, the cache does
not have good data, and the tag fields will be
meaningless.
• Even after executing many instructions, some of the
cache entries may still be empty, as in Figure RHS.
• Thus, we need to know that the tag should be ignored for
such entries.
• The most common method is to add a valid bit to indicate
whether an entry contains a valid address.
• If the bit is not set, there cannot be a match for this block.
The cache is initially empty, with all valid bits (V entry in cache) turned off (N). The processor requests the
following addresses: 10110two (miss), 11010two (miss), 10110two (hit), 11010two (hit), 10000two (miss), 00011two (miss),
10000two (hit), and 10010two (miss). The figures show the cache contents after each miss in the sequence has been
handled. When address 10010two (18) is referenced, the entry for address 11010two (26) must be replaced, and a
reference to 11010two will cause a subsequent miss. The tag field will contain only the upper portion of the
address. The full address of a word contained in cache block i with tag field j for this cache is j 8 + i, or
equivalently the concatenation of the tag field j and the index i. For example, in cache f above, index 010 has
tag 10 and corresponds to address 10010.
3/8/2020 ARM by Shriram 84
Contd.,
• So far, when we place a block in the cache, we have used a
simple placement scheme: A block can go in exactly one place in
the cache.
• It is direct mapping! We have seen it already!
• There is actually a whole range of schemes for placing blocks. At one
extreme is direct mapped, where a block can be placed in exactly one
location.
• At the other extreme is a scheme where a block can be placed in any
location in the cache.
• Such a scheme is called fully associative because a block in
memory may be associated with any entry in the cache.
• To find a given block in a fully associative cache, all the entries in the
cache must be searched because a block can be placed in any one.
• To make the search practical, it is done in parallel with a comparator
associated with each cache entry. These comparators significantly
increase the hardware cost, effectively making fully associative
placement practical only for caches with small numbers of blocks.
3/8/2020 ARM by Shriram 85
Contd.,
• The middle range of designs between direct mapped and
fully associative is called set associative.
• In a set-associative cache, there are a fixed number of
locations (at least two) where each block can be placed;
• A set-associative cache with n locations for a block is
called an n-way set-associative cache.
• An n-way set-associative cache consists of a number of
sets, each of which consists of n blocks. Each block in the
memory maps to a unique set in the cache given by the
index field, and a block can be placed in any element of
that set.
• Thus, a set associative placement combines direct-mapped
placement and fully associative placement: a block is directly
mapped into a set, and then all the blocks in the set are
searched for a match.
3/8/2020 ARM by Shriram 86
Contd.,
• Remember that in a direct-mapped cache, the position of
a memory block is given by
• (Block number) modulo (Number of cache blocks)
• In a set-associative cache, the set containing a memory
block is given by
• (Block number) modulo (Number of sets in the cache)
3/8/2020 ARM by Shriram 87
PERIODICAL 2
PORTION IS
COMPLETED.
Shriram K Vasudevan