Efficient Programming Techniques For ARM
Efficient Programming Techniques For ARM
being in progress. Another way to view decode / execute stage) are discarded
this is to understand that during E2, without getting executed.
the control logic was busy generating
signals for transferring data from the Thus, branch instructions impose a
latch register to its destination in the significant overhead and disrupt the
register bank. smooth flow in the pipeline. This is not
to say that branches should not be
For obvious reasons, instruction #4 used (for it is impractical), but only to
also experiences the effects of this emphasise the fact that knowledge of
stall. the pipeline behaviour guides a better
program design, which leads to lesser
Pipeline breaks: Another throughput number of branches.
limiting condition arises when a branch
instruction enters the pipeline.
ARM instruction set
4 5
E3’s primary purpose is to generate control This categorisation is not from the instruction
signals for fetching instruction #5. Then on, the set design viewpoint but only from that of an
pipeline is back to steady state. assembly programmer.
ARM programming and optimisation techniques 5
if (b & c)
foo();
TST r1, r2
BLNE _foo
; TST is similar to ANDS, but
; does not modify any register
Optimisation techniques
Here on, program fragments in C are listed alongside their equivalent ARM assembly
code. Concepts specific to the ARM instruction set and relevant optimization
techniques are introduced at appropriate places.
Note:
All ARM assemblers by convention use r13 as the stack pointer. The architecture
supports r14 as the link register.
#include <stdio.h>
int main(void)
{
int a[10] = {7, 6, 4, 5, 5, 1, 3, 2, 9, 8};
int i;
int s = 4;
if (i >= 10)
return 1; /* 1 => not found */
else
return 0; /* 0 => found */
}
.text
; Program labels are terminated by ‘:’ for readability
; Stack ‘grows’ downward, caller saves registers.
MOV r0, #0 ;
STR r0, [r13, #8] ; for (i = 0; ...)
2 loop_start:
; loop entry condition check
LDR r0, [r13, #8] ; load ‘i’
CMP r0, #10 ; for (...; i < 10; ...)
BGE loop_end
3 LDR r1, [r13, #4] ; load ‘s’ (already initialised to 4)
; get a[i]
MOV r2, #4 ; sizeof(a[i])
MUL r3, r0, r2 ; r3 = i * 4 (serves as an index in a[i])
ADD r3, r3, #12 ; adjust r3 relative to base of a[]
LDR r4, [r13, r3] ; r4 = *(r13 + r3) i.e., a[i]
TEQ r1, r4
BEQ loop_end ; if (s == a[i]) break;
5 loop_end:
LDR r0, [r13, #8] ; load ‘i’
CMP r0, #10
BGE return_1 ; if (i >= 10)
return_1:
MOV r0, #1
prog_end:
ADD r13, r13, #48 ; pop the function frame off the stack
MOV r15, r14 ; load LR into PC (r15) [causing a return]
Albeit deliberately under-optimised, this assembly listing gives scope for non-trivial
optimisations. But firstly, let us remove some glaring redundancies.
Unnecessary loads/stores:
The variables ‘i’ and ‘s’ need not be stored on the stack. Registers can be used
instead, to server their purpose. With this simple change, we save:
o 8-bytes on the stack (see grey-shaded #1)
o 20-bytes (5-instructions) in the program space (see grey-shaded #2-5)
o 3 of the load/store instructions off the loop, which in the worst-case scenario
(of the element being searched for being the last one in the array) saves
them from being executed 9 times (i.e., a minimum of 126-cycles7)
The compiler is good at tasks such as register allocation. But before we look at a
compiler generate code, let us attempt utilising the knowledge we have acquired of
the ARM instruction set.
Loop invariants:
‘r2’ is initialised to ‘#4’ in the loop, but never modified. So, the initialisation can be
moved out of the loop, say ahead of ‘loop_start’. This, in the worst-case scenario
saves it from being executed 9 times (i.e., a minimum of 27-cycles)
Conditional execution:
7
Four cycles per store and five per load
ARM programming and optimisation techniques 10
.text
loop_start:
; loop entry condition check
CMP r0, #10 ; for (...; i < 10; ...)
BGE loop_end
; get a[i]
MUL r3, r0, r2 ; r3 = i * 4 (serves as an index in a[i])
ADD r3, r3, #4 ; adjust r3 relative to base of a[]
LDR r4, [r13, r3] ; r4 = *(r13 + r3) i.e., a[i]
ADDNE r0, r0, #1 ; for (...; i++) (if ‘s’ not found)
BNE loop_start ; next iteration (if ‘s’ not found)
loop_end:
CMP r0, #10
prog_end:
ADD r13, r13, #40 ; pop the function frame off the stack
MOV r15, r14 ; load LR into PC (r15) [causing a return]
This seems to be as good as it can get. Yet, there is one little trick left – eliminating
the multiplication – to be tried out.
Shift to multiply:
We have already seen the second operand of ARM load/store instructions being used
for auto-indexing. It can also be used to specify a register with an optional shift as
follows:
The multiplication in the listing can now be replaced with a simple left shift:
.text
loop_start:
; loop entry condition check
CMP r0, #10 ; for (...; i < 10; ...)
BGE loop_end
; get a[i]
LDR r3, [r2, r0, LSL #2] ; r3 = *(r2 + r0*4) i.e., a[i]
ADDNE r0, r0, #1 ; for (...; i++) (if ‘s’ not found)
BNE loop_start ; next iteration (if ‘s’ not found)
loop_end:
CMP r0, #10
prog_end:
ADD r13, r13, #40 ; pop the function frame off the stack
MOV r15, r14 ; load LR into PC (r15) [causing a return]
compiler8 (even after making an allowance for ‘a[]’ initialization code). This
improvement can primarily be attributed to our understanding of the application on
hand. For instance, it is difficult to imagine a compiler generate the ‘TEQ, ADDNE,
BNE’ sequence for this program!
It is very tempting to hand code a program in assembly as the returns are rewarding
enough, especially in the case of small programs. But any non-trivial program should
first be run through a respectable compiler. Further optimisation can then be
attempted on the generated assembly code. Not only does this cut the development
effort by an order of magnitude but also greatly reduces the chances of defects
creeping in due to oversight and weariness that sets on the programmer (on the
second sleepless night when the coffee maker runs out of the refill). For, modern
compilers incorporate many advanced optimisation techniques and are good at
applying them tirelessly, over and over, to large chunks of code.
To explore ARM optmisation further, let us now move on to ‘block copy’ - an example
no ARM programming tutorial can do without:
An optimised bcopy:
Translation:
_bcopy:
bcopy_start:
SUB r2, r2, #1 ; nbytes--
; *to++ = *from++
LDRB r3, [r1], #1 ; LDRB/STRB loads/stores a byte
STRB r3, [r0], #1 ; auto indexing for post increment (++)
bcopy_end:
MOV r15, r14 ; PC = LR i.e., return
8
I make this claim after verifying the ‘release’ mode code generated by two popular compilers for ARM
ARM programming and optimisation techniques 13
There seems to be hardly any scope for optimization at the outset. Yet, in a task that
involves a condition check, we hardly seem to be using any conditional execution.
This gives us a clue for optimisation.
Optimisation-1:
_bcopy:
Now let us move our focus from size to performance, as a 30% reduction does not
really mean much when the original footprint is only 28-bytes.
As obvious from the listing, bcopy spends its entire lifetime in a loop. The branch
instruction contributes 25% to the size of the loop. More importantly, it takes up
30% of the execution time (5-cycles out of every 17-cycles). This overhead is
unacceptable to any non-trivial and performance sensitive application. This
understanding drives our further attempts at optimisation.
Optimisation-2:
_bcopy:
9
Yes. With O2 optimisation for space in ‘release’ mode!
ARM programming and optimisation techniques 14
By adding 6 more instructions, we have been able to reduce the share of ‘BPL’ from
30% to 14% (5 out of 44-cycles). Yet, this gain is highly deceptive. For we could as
well have used a load/store-word combination in place of the four load/store-byte
instructions, thereby increasing the effective throughput of the original loop without
incurring a size/cycle penalty. That way we only need 17-cycles to transfer four
bytes (in spite of ‘BPL’ usurping 30% of the cycles)!
The de-optimisation seen above is due to a blind application of the ‘loop unrolling’
technique. And such cases are not unique to this technique alone. Each technique
needs to be tailored to the task on hand.
Optimisation-3:
_bcopy:
With this, the throughput has increased to 16bytes per 44-cycle iteration (a gain of
600% as compared to the original 1byte per 17-cycle iteration), with ‘BPL’ taking
14% of the execution time.
10
An old saying goes ‘When you have a hammer in the hand...’
ARM programming and optimisation techniques 15
Consider a subroutine which needs to call other subroutines and also uses all the
available general-purpose registers for local use. Assuming a ‘callee-saves’ protocol,
a stack that grows from low to high address and a stack pointer that always pointing
to the next free word available, the subroutine’s entry and exit code looks similar to
this:
To a non-ARM RISC programmer, this listing is familiar and normal. For, each and
every instruction is very much relevant and essential to the task on hand. Only, an
ARM programmer would simply have written this equivalent code:
_foo:
;
; body of _foo
;
instructions seen above (LDM&STM) are an explicit acknowledgement from the ARM
architecture of the frequency and importance of such multiple-register load/store
activity. These instructions can be used multiple ways such as:
STMFD & LDMFD: By replacing the suffix ‘EA’ with ‘FD’ you get a ‘full descending’
stack which is exactly opposite in behaviour to an ‘EA’ stack. ‘FD’ has a semantically
equivalent name ‘DB’ which stands for ‘decrement before’. E.g., LDMDB r0!, {r1-r10}
Obviously, these instructions cannot be taking the same number of cycles as an LDR
or an STR. The real gains are in program space, reduced chances of making coding
mistakes and enhanced readability.
Optimisation-4:
As you might have guessed by now, the time has come to rejoin the main road and
revisit the bcopy example. The previous throughput of 16bytes per iteration can now
be achieved on a smaller footprint by replacing the four pairs of LDRPL/STRPL with a
single LDMPL/STMPL combination such as:
LDMPL r1!, {r3 – r6} ; load r3-r6 from [r1], advance r1 by 16 bytes
STMPL r0!, {r3 – r6} ; store r3-r6 starting at [r0], advance r0
The blue print for a high throughput (close to 40byte) bcopy is as follows:
_bcopy:
By saving r13 and r14 on the stack before the start of the loop, the throughput can
be pushed closer to the maximum possible 48bytes per iteration. This is an example
of how two of ARM’s best features - conditional execution and multiple load-stores
put together11 improve the program characteristics by an order of magnitude.
The Thumb is a 16-bit instruction set architecture that is functionally complete but
relatively restricted in variety as compared to that of the regular 32-bit ARM
instruction set. Notable differences include 2-address format, unconditional updation
of CPSR flags for (almost) all instructions and less flexible ‘second operand’. The
Thumb architecture is cleanly implemented in silicon by way of including an on-the-
fly instruction de-compressor functional unit in the processor that translates 16-bit
ARM instructions into their 32-bit equivalents that are understood by the rest of the
processor. It must be noted though that it is only the instruction length that is
halved and not the register sizes themselves. As a side effect, the number of usually
visible registers is reduced by five14.
The programmer is not constrained to use a single instruction set throughout his
program. Complete freedom is given to freely intermix 32-bit and 16-bit code and
switch between them using the BX (branch and exchange) instruction. Whenever the
Thumb is found to be inadequate and restrictive for a particular functionality /
module / computation, the ARM instruction set can be used as a special case (or the
other way around, if at certain places the power of 32-bit seems an overkill15). It is
this flexibility which makes the ARM processor a very attractive option for an
embedded systems designer / programmer.
Conclusion
This paper made an attempt at introducing the ARM architecture to an embedded
systems designer / programmer by providing an overview of its functional units and
instruction set. ARM assembly optimisation techniques were introduced along with a
couple of examples in a graded exercise like fashion. This being targeted at those
who are new to the architecture, most fine grain details were left out of this paper
for fear of losing reader interest. However, the techniques presented herein should
11
Chess enthusiasts can liken this to an active Queen-Knight combination
12
Brief, not because its instruction length being only half as long as its more powerful 32-bit cousin, but
because of it requiring a completely dedicated tutorial to do justice.
13
Even in performance critical cases such as an RTOS, the emphasis is usually on predictability and
bounded response times rather than on searing speed.
14
r0-r7 known as ‘low’ registers are always visible while the high ‘r8-r12’ are visible only in certain
instructions in restricted formats.
15
If the reader wondered why there was a switch between a 2-column and single column mode, the
answer should now be evident ;^)
ARM programming and optimisation techniques 18
be sufficient enough to venture into serious software design and programming with
the ARM processor(s). The references provided towards the end of this paper can be
used to further hone ARM assembly programming skills.
References
• Dave Jagger (editor), ARM Architecture Reference Manual, Prentice Hall
• Steve B. Furber, ARM System Architecture, Addison-Wesley, 1996, ISBN 0-201-
40352-8
• http://www.arm.com/ - data sheets, instruction set summaries, white papers,
application notes and architecture references for the ARM family of processors
• Rupesh W. Kumbhare, Optimization Techniques for DSP based Applications – A
highly detailed paper on optimisation techniques for the TMS320C54x family of
DSPs. It describes with code-samples, concepts that are applicable to non-DSP
programming as well. Interested readers are can reach the author
([email protected]) for a copy of the same
---