Coa Module 5
Coa Module 5
ARM-Thumb differences
• Unconditional Execution of instruction
• 2-address format for data processing
• Less regular instruction formats.
Thumb exception
• With exception processor is returned to ARM mode.
• While returning previous mode is restored as SPSR is transferred to CPSR
Thumb Branching
• Short conditional branches
• Medium range unconditional branches
• Long range Subroutine calls
• Branch to change to ARM Mode
Thumb-ARM Decompression
• Translation from 16-bit Thumb instruction to 32-bit ARM instruction
• Condition bits changed to „always‟
• Lookup to translate major and minoropcodes
• Zero extending 3-bit register specifiers to give 4-bit specifiers
• Zero extending immediate values
• Implicit „S‟(affecting condition codes) should be explicitly specified.
• Thumb 2-address format must be mapped to ARM 3- address format
Properties
ARM-Thumb Interworking
Note: Switching between ARM and Thumb States of Execution Using BX Instruction
ARM-Thumb interworking is the name given to the method of linking ARM and Thumb code together for both
assembly and C/C++.
• It handles the transition between the two states.
• To call a Thumb routine from an ARM routine, the core has to change state.
• This state change is shown in the T bit of the cpsr.
• The BX and BLX branch instructions cause a switch between ARM and Thumb state while branching to a
routine.
Syntax:
BX Rm
BLX Rm | label
You can see that the least significant bit of register r0 is used to set the T
bit of the cpsr. The cpsr changes from IFt, prior to the execution of the
BX, to IFT, after execution. The pc is then set to point to the start address
of the Thumb routine.
This example shows a small code fragment that uses both the ARM and Thumb versions of the BX instruction.
You can see that the branch address into Thumb has the lowest bit set. This sets the T bit in the cpsr to
Thumb state.
The return address is not automatically preserved by the BX instruction. Rather the code sets the return
address explicitly using a MOV instruction prior to the branch:
Replacing the BX instruction with BLX simplifies the calling of a Thumb routine since it sets the return
address in the link register lr:
Stack Instructions:
The Thumb stack operations are different from the equivalent ARM instructions because they use the more
traditional POP and PUSH concepts.
The interesting point to note is that there is no stack
pointer in the instruction. This is because the stack
pointer is fixed as register r13 in Thumb operations
and sp is automatically updated. The list of registers is
limited to the low registers r0 to r7.
The PUSH register list also can include the link
register lr; similarly, the POP register list can include the pc. This provides support for subroutine entry and
exit. The link register lr is pushed onto the stack with register r1. Upon return, register r1 is popped off the
stack, as well as the return address being loaded into the pc. This returns from the subroutine.
Example :
BL ThumbRoutine
; continue
ThumbRoutine
PUSH {r1, lr} ; enter subroutine
MOV r0, #2
POP {r1, pc} ; return from subroutine
Software Interrupt Instruction:
Similar to the ARM equivalent, the Thumb software interrupt (SWI) instruction causes a software interrupt
exception. If any interrupt or exception flag is
raised in the Thumb state, the processor
automatically reverts back to the ARM state to
handle the exception.
Embedded C programming language is used to develop the applications that are nearest to the hardware i.e. the
applications that directly communicate to the hardware. Here are the major differences between C and
Embedded C:
C EMBEDDED C
C is a high-level programming language, which is used to
design any type of desktop-based application. It's also a Embedded C is an extension of C language. It's focused on
general-purpose programming language, focused on developing developing software for embedded systems.
software for general-purpose computers.
C language is a hardware-independent language. Embedded C is a fully hardware-dependent language.
Embedded C employs specific compilers that can generate
C language uses the standard compiler to compile and execute
particular hardware/microcontroller based output. It generates
the program and generates OS-dependent executable files.
hardware-dependent files.
Popular language compilers to execute an Embedded C language
Popular language compilers to execute C language: GCC (GNU
program are: Keil compiler BiPOM ELECTRONIC, Green Hill
compiler collection), Borland turbo C, and Intel C++
Software
C language has a free format of program coding. Formatting depends upon the type of microprocessor that is used.
It is specifically used for desktop applications. It is used for embedded systems.
GUI ; Embedded software; Operating Systems operation DVD/TV; Vehicle Tracking Systems; Digital camera
It has a normal level of optimization and supports various other
It displays a high level of optimization. Only the pre-defined input
programming languages during application. Input can be given
can be given to the running program.
to the programming whilst it is running.
Fixing bugs is very easy in C language. Fixing bugs is complicated.
The compilation is the process of converting the source code of the C language into machine code. As C is a mid-
level language, it needs a compiler to convert it into an executable code so that the program can be run on our
machine. The C program goes through the following phases during compilation:
C data type
– char: unsigned byte (ARM C compiler)
ARM processors have 32 bit registers and 32 bit data processing operations, It has Load /store architecture. (No
arithmetic or logical operations possible in memory directly.) Previous versions of ARM (ARMv4 and its lower)
were not good in handling signed 8 /16 bit values. So, the ARM C compilers define char to be an unsigned 8-bit
value rather than a signed 8-bit. (Inside the memory whether it is character or number, all are stored as
numbers only. So how the compilers treat the number which is defined as char is a matter.) ARM v4 and its
lower define char to be an unsigned 8 bit value.
Local Variable Types Though ARMv4 is efficient in to loading and storing 8, 16 and 32 bit, ARMv7 and above
have their data processing operations as 32 only. So, it is advisable to use int or long data type for local
variables. Avoid using char and short, even when working with 8 or 16 bit value. Exception is when you
use modulo arithmetics that needs to give 255+1 =0 case. (Here one can use char) Reason to avoid char as local
variable Example Consider a function written to find checksum of a data-packet containing 64 words as below.
int checksum(int *data)
{
char i;
int sum = 0;
for(i=0;i<64;i++)
sum += data[i];
return sum;
}
Looking the variable ‘i’ as a char datatypes seems like efficient, since it occupies less space in register, as well as
in stack. However, this is not correct, bcaz, all the registers and stack entries are 32 bit only. Looking at i++, the
compiler has to look on the implementation that accounts for the case of i=255. Once i= 255 and incrementing it
leads to 0. 255+1. The corresponding compiler output for this code is given below
checksum_v1_s
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v1_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
AND r1,r1,#0xff ; i = (char)r1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v1_loop ; if (i<64) loop
MOV pc,r14 ; return sum
Instead of declaring i as char, if we declare it as unsigned int, the AND instruction can be removed. The compiler
output for the program in which I is declared as int is
checksum_v2_s
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v2_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1++
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v2_loop ; if (i<64) goto loop
MOV pc,r14 ; return sum
Local Variable: as “short”
Suppose, the data packet contains 16 bit values, and we need a 16 bit checksum, in that case.
If the embedded c program is modified as the sum as int inside the function and converting final result to short
will be an optimized one as below.
short checksum_v3(short *data)
{
unsigned int i;
short sum=0;
for (i=0; i<64; i++) {
sum = (short)(sum + data[i]);
}
return sum;
}
The loop is now three instructions longer than the previous one. Reasons are The LDRH instruction does not
allow for a shifted address offset. So, address calculation is literally done in the ADD, and then the
corresponding data in that address is summed.
LDRH instruction does not have offset calculation. It loads only the address. The explicit typecasting requires
two MOV instructions. The compiler shifts left by 16 and then right by 16 to implement 16 bit sign extend.
checksum_v3_s
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v3_loop
Post-Increment – *(p++) in C
The *(data++) operation translates to a single ARM instruction, that loads the data and increments the pointer.
The corresponding assembly code otput of the compiler goes as below
short checksum_v4(short *data)
{
unsigned int i;
int sum=0; // Solution 2
Whatever the merits of different narrow or wide calling protocols, you can see that char or short type
function arguments and return values introduce extra casts.
It’s more efficient to use the int type for function arguments and return value, even if you are only passing
an 8-bit value.
Addition, subtraction and multiplication operation does not make any difference in performance whether it is
signed or unsigned one. However when it comes to division, it is different (32 bit int has a minimum value of -
2,147,483,648 and a maximum value of 2,147,483,647 (inclusive)
int average_v1(int a, int b) compiles to
{ average_v1_s
return (a+b)/2; ADD r0,r0,r1 ; r0 = a+b
} ADD r0,r0,r0,LSR #31 ; if (r0<0) r0++ (one more instruction)
MOV r0,r0,ASR #1 ; r0 = r0>>1
MOV pc,r14 ; return r0
The compiler adds one to the sum before shifting by right if the sum is negative. If the data type is unsigned int,
then no need to keep the second ADD instruction. Bcaz, a divide by 2 is not a right shift if the data is negative.
To understand the program, pls try this code in Micro Vision 4
Negative Division
• In C on an ARM target, a divide by two is not a right shift if x is negative.
– For example, -3>>1 = -2, but -3/2 = -1.
– Division rounds towards zero, but arithmetic right shift round towards -∞.
• It’s more efficient to use unsigned types for divisions.
– The compiler converts the unsigned power of two divisions directly to right shifts.
explicit narrowing casts in expressions, because they usually cost extra cycles. Avoid char and short types for
function arguments and return values.
C Looping Structures
Loops with a fixed number of iterations
Let’s see how the compiler treats a loop with incrementing count i++
The key point is that the loop counter should count down to zero rather than counting up to some arbitrary
limit.
} checksum_v7_loop
Checking N == 0, Why ?
Compiler checks that N is nonzero on entry to the function
– Often, “check N” is unnecessary, since you know that the array won’t be empty. In this case, a “do-while”
loop gives better performance and code density than a “for” loop.
int checksum_v8(int *data, unsigned int N) The compiler output is
{
int sum=0; checksum_v8_s
MOV r2,#0 ; sum = 0
do { checksum_v8_loop
Loop Unrolling
Each loop iteration costs two instructions in addition to the body of the loop. This we call it as Loop overhead.
The subtract takes one cycle and branch takes three cycles, giving an overhead of 4 cycles per loop. We can save
some of these cycles by unrolling a loop. Repeating a loop body several times and reducing the number of loop
iterations can be done in some places. For example.
int checksum_v9(int *data, unsigned int N) compiles to
{
int sum=0; checksum_v9_s
{ checksum_v9_loop
Speedup of Unrolling
Loop Overhead in Cycle
– SUB(1), BRANCH(3)
– LOAD(3), ADD(1)
Cycles per iteration
– Old = 3(load)+1(add)+1(sub)+3(branch) = 8
– New = [(3+1)*4+1+3]/4 = 20/4 = 5
Double Speedup
– Old/New = 8/5 = ~2
– * ARM9TDMI (faster LOAD) brings out more Speedup.
Q: How much times (K) should I unroll the loop ?
– Suppose the loop is very important, for example, 30% of the entire application.
– Suppose you unroll the loop until it is 0.5KB in code size (128 x instr.).
– Then, loop overhead is at most 4 cycles compared to a loop body of around 128 cycles.
– The loop overhead cost is 3/128, roughly 3% of loop, and 1% (3% x 30%) of overall application.
So, unrolling the code further gains little extra performance, but has a significant impact on the cache
contents.
Register Allocation
The compiler attempts to allocate a register to each local variable.
– It tries to use the same register for different local variables if the use of the variables does not overlap.
– When number of local variables exceeds number of available registers then the excess variables
are stored on the stack. Spilling
– Such stacked variables are called spilled since they are written out to memory.
– Spilled variables are slow to access compared to variables allocated to registers.
– To implement a function efficiently, you need to:
o Minimise the number of spilled variables.
o Ensure that critical variables are stored in registers.
AAPCS (ARM Architecture Procedure Call Standard) Registers AAPCS is the ARM Architecture Procedure Calling
Standard. It is a convention which allows high level languages to interwork.
• Register/Alias/Usage
– r0~3 (a1~4): arguments and return
– r4~11 (v1~8): general variable register
r9 (v6, sb): static base
• The function must preserve the callee value of this register except when compiling for “read-
write position independence (RWPI)”
r10 (v7, sl): Stack-Limit
• The function must preserve the callee value of this register except when compiling with “stack
limit checking”.
r11 (fp) (Only old armcc use a frame pointer)
• The function must preserve the callee value of this register except when compiling using a
“frame pointer” (only old version of armcc use fp)
r12 (ip)
• A general scratch register that the function can corrupt. It is useful as a scratch register for
function veneers or other intra-procedure call requirement.
r13/sp, r14/lr, r15/pc
Available Registers
– R0..R12, R14 can all hold variables.
– Must save R4..R11, R14 on the stack if using these registers.
– Compiler can assign 14 variables to registers without spillage.
– But some compilers use a fixed register e.g. R12 as scratch and never keep values in it.
– Complex expressions need intermediate working registers.
Function Calls
There are overhead for function calls.
Four-register rule
– The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and r3.
– More than four Arguments beyond four are transferred via stack.
Two-word arguments
Such as “long long” or “double” are passed in a pair of consecutive argument registers and return in r0,
r1.
– In C++, first argument to an object method is the “this” pointer.
Use structure
– If more than four arguments, group them into structures.
Insert N bytes (from array data into a queue). Copy “data” to “Queue”
char *queue_bytes_v1( queue_bytes_v1_s
char *Q_start, /* Queue buffer start address */ STR r14,[r13,#-4]! ; save lr on the stack
char *Q_end, /* Queue buffer end address */ LDR r12,[r13,#4] ; r12 = N
char *Q_ptr, /* Current queue pointer position */ queue_v1_loop
char *data, /* Data to insert into the queue */ LDRB r14,[r3],#1 ; r14 = *(data++)
unsigned int N) /* Number of bytes to insert */ STRB r14,[r2],#1 ; *(Q_ptr++) = r14
{ CMP r2,r1 ; if (Q_ptr == Q_end)
do MOVEQ r2,r0 ; { Q_ptr = Q_start; }
{ SUBS r12,r12,#1 ; --N and set flags
*(Q_ptr++) = *(data++); BNE queue_v1_loop ; if (N!=0) goto loop
if (Q_ptr == Q_end) { MOV r0,r2 ; r0 = Q_ptr
Q_ptr = Q_start; LDR pc,[r13],#4 ; return r0
}
} while (--N);
return Q_ptr;
Using structure
inline
If the callee is very small, then the compiler can inline the code in the caller code (removes the function call
overhead completely).
Summary: Calling Function Efficiently
Try to restrict functions to four arguments
Define small functions in the same source file and before the caller.
Critical functions can be inlined using the __inline keyword.
Pointer Aliasing
Two pointers are said to alias when they point to the same address. If you write to one pointer, it will affect the
value you read from the other pointer. The compiler often doesn’t know which pointers alias.
The compiler must assume that any write through a pointer may affect the value read from any another pointer!
This can significantly reduce code efficiency.
The following function increments two timer values by a step amount.
Question Bank:
1. What are the ARM and Thumb instruction sets, and how do they differ in terms of
instruction size and performance?
2. Explain the advantages of using Thumb instructions over ARM instructions in embedded
systems with limited memory.
3. How many general-purpose registers are available in the Thumb instruction set, and what
are their roles?
4. What is ARM-Thumb interworking, and why is it necessary in mixed-mode code execution
environments?
5. Discuss the mechanisms used for switching between ARM and Thumb modes during
interworking.
6. Apart from unconditional branches, what other types of branch instructions are available
in ARM and Thumb instruction sets?
7. Provide examples of conditional branch instructions and explain their usage.
8. Compare and contrast the syntax and functionality of data processing instructions in ARM
and Thumb modes.
9. How do stack instructions differ between ARM and Thumb modes, and what is their role in
managing the stack?
10. Explain the process of pushing and popping data onto and from the stack using stack
instructions.
11. What is a software interrupt, and how is it triggered in ARM and Thumb modes?
12. How are exceptions handled in ARM and Thumb modes, and what is the role of exception
vectors?
13. Describe the process of transitioning from normal execution to exception handling mode.
14. Explain the differences between load and store instructions in ARM and Thumb modes.
15. Discuss the addressing modes supported by load and store instructions and their impact
on memory access efficiency.