DAI0034A Efficient C
DAI0034A Efficient C
ENGLAND GERMANY
Advanced RISC Machines Limited Advanced RISC Machines Limited
Fulbourn Road Otto-Hahn Str. 13b
Cherry Hinton 85521 Ottobrunn-Riemerling
Cambridge CB1 4JN Munich
UK Germany
Telephone: +44 1223 400400 Telephone: +49 89 608 75545
Facsimile: +44 1223 400410 Facsimile: +49 89 608 75599
Email: [email protected] Email: [email protected]
JAPAN USA
Advanced RISC Machines K.K. ARM USA Incorporated
KSP West Bldg, 3F 300D, 3-2-1 Sakado Suite 5
Takatsu-ku, Kawasaki-shi 985 University Avenue
Kanagawa Los Gatos
213 Japan CA 95030 USA
Telephone: +81 44 850 1301 Telephone: +1 408 399 5199
Facsimile: +81 44 850 1308 Facsimile: +1 408 399 8854
Email: [email protected] Email: [email protected]
Open Access
Proprietary Notice
ARM and the ARM Powered logo are trademarks of Advanced RISC Machines Ltd.
Neither the whole nor any part of the information contained in, or the product described in, this document may be adapted or reproduced in
any material form except with the prior written permission of the copyright holder.
The product described in this document is subject to continuous developments and improvements. All particulars of the product and its use
contained in this document are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied
warranties or merchantability, or fitness for purpose, are excluded.
This document is intended only to assist the reader in the use of the product. ARM Ltd shall not be liable for any loss or damage arising from
the use of any information in this document, or any error or omission in such information, or any incorrect use of the product.
Key
Document Number
This document has a number which identifies it uniquely. The number is displayed on the front page and at the foot of each subsequent page.
Document Status
The document’s status is displayed in a banner at the bottom of each page. This describes the document’s confidentiality and its information
status.
Change Log
Issue Date By Change
A January 1998 SKW Released
Application Note 34
ii ARM DAI 0034A
Open Access
Table of Contents
Table of Contents
1 Introduction 2
2 Setting Compiler Options 3
2.1 Selecting processor/architecture 3
2.2 Debugging options 3
2.3 Optimization options 4
2.4 APCS options 4
3 Division and Remainder 6
3.1 Combining division and remainder 6
3.2 Division and remainder by powers of two 7
3.3 Alternatives to remainder for modulo arithmetic 7
3.4 Division by a constant 8
3.5 Using lookup tables 8
4 Conditional Execution 9
5 Boolean Expressions 10
5.1 Range checking 10
5.2 Compares with zero 11
6 Loops 12
6.1 Loop termination 12
6.2 Loop unrolling 13
7 Switch Statement 14
7.1 Switch statement vs. lookup tables 14
8 Register Allocation 16
8.1 Register allocatable variables 16
8.2 Aliasing 16
8.3 Live variables and spilling 20
9 Variable Types 21
9.1 Local variables 21
9.2 Use of shorts/signed bytes on ARM 22
9.3 Space occupied by global data 22
10 Function Design 23
10.1 Function call overhead 23
10.2 Leaf functions 25
10.3 Tail continued functions 26
10.4 Pure functions 27
10.5 Inline functions 28
10.6 Function definitions 29
11 Using Lookup Tables 30
12 Floating-Point Arithmetic 31
13 Cross Jump Optimization 32
14 Portability of C Code 33
15 Further Information 34
Application Note 34
ARM DAI 0034A 1
Open Access
Introduction
1 Introduction
The ARM and Thumb C compilers are mature, industrial-strength ANSI C compilers which
are capable of producing high quality machine code. However, when writing source code,
it is always worthwhile to use programming techniques which work well on RISC
processors such as ARM. This Application Note describes some of the techniques that
can be useful. It also explains to some extent how the ARM compiler works, and how to
use the C language efficiently. These techniques and knowledge will enable programmers
to increase execution speed and/or lower code density.
Note Most of the techniques discussed in this Application Note are equally applicable to both
armcc and tcc. If a technique is only applicable to ARM or Thumb, this is highlighted. In
principle, many of the described techniques are also applicable to other languages and
other compilers.
Note The code examples given in this Application Note have been compiled and disassembled
using tools supplied with ARM Software Development Toolkit version 2.11. If you are using
another version of the toolkit, your output may differ slightly, though the principles
highlighted should still hold, and should continue to hold in future releases.
Application Note 34
2 ARM DAI 0034A
Open Access
Setting Compiler Options
Application Note 34
ARM DAI 0034A 3
Open Access
Setting Compiler Options
Application Note 34
4 ARM DAI 0034A
Open Access
Setting Compiler Options
Application Note 34
ARM DAI 0034A 5
Open Access
Division and Remainder
= C0 + C1 * (log2(numerator) − log2(denominator)).
Application Note 34
6 ARM DAI 0034A
Open Access
Division and Remainder
3.2 Division and remainder by powers of two
If the divisor in a division operation is a power of two, the compiler uses a shift to perform
the division. Therefore you should always arrange, where possible, for scaling factors to
be powers of two (for example, 128 rather than 100).
This can be seen by examining the following piece of code:
typedef unsigned int uint;
div16s
CMP a1,#0
ADDLT a1,a1,#&f
MOV a1,a1,ASR #4
MOV pc,lr
Notice that while both divisions avoid calling the division function, unsigned division takes
fewer instructions than signed. In many cases the shift instruction can be combined with
following instructions. Signed division needs additional instructions because it rounds
towards zero, while a shift rounds towards minus infinity.
Application Note 34
ARM DAI 0034A 7
Open Access
Division and Remainder
The following code is produced:
counter1
STMDB sp!,{lr}
ADD a2,a1,#1
MOV a1,#&3c
BL __rt_udiv
MOV a1,a2
LDMIA sp!,{pc}
counter2
ADD a1,a1,#1
CMP a1,#&3c
MOVCS a1,#0
MOV pc,lr
From this it is clear that the use of the if statement, rather than the remainder operator, is
preferable, as it produces much faster code. Note that the new version only works if it is
known that the range of count on input is 0−59.
Application Note 34
8 ARM DAI 0034A
Open Access
Conditional Execution
4 Conditional Execution
Note This section is applicable to armcc only.
Note Conditional execution is disabled for all debugging options.
All ARM instructions are conditional. Each instruction contains a 4-bit field which is a
condition code; the instruction is only executed if the ARM flag bits indicate that the
specified condition is true. Typically a conditionally executing code sequence starts with a
compare instruction setting the flags, followed by a few conditionally executed instructions.
For example:
CMP x, #0
MOVGE y, #1
MOVLT y, #0
This saves two branch instructions and on average 2.5 ARM7 cycles.
Conditional execution reduces the number of branch instructions, and therefore improves
codesize and performance. However, when more than about four instructions are
conditional, performance could be worse in some cases (since branches take three cycles
or less on ARMs). The compiler therefore limits the number of conditionally executed
instructions. In SDT2.11 this limit is three instructions. In future compilers the limit will
depend on whether -Otime or -Ospace is used.
Conditional execution is applied mostly in the body of if statements, but it is also used
while evaluating complex expressions with relational (<, ==, > and so on) or boolean
operators (&&, !, and so on). Conditional execution is disabled for code sequences which
contain function calls, as on function return the flags are destroyed.
It is therefore beneficial to keep the bodies of if and else statements as simple as
possible, so that they can be conditionalized. Relational expressions should be grouped
into blocks of similar conditions.
The following example shows how the compiler uses conditional execution:
int g(int a, int b, int c, int d)
{ if (a > 0 && b > 0 && c < 0 && d < 0) /* grouped conditions */
return a + b + c + d;
return -1;
}
g
CMP a1,#0
CMPGT a2,#0
BLE |L000024.J4.g|
CMP a3,#0
CMPLT a4,#0
ADDLT a1,a1,a2
ADDLT a1,a1,a3
ADDLT a1,a1,a4
MOVLT pc,lr
|L000024.J4.g|
MVN a1,#0
MOV pc,lr
Because the conditions were grouped, the compiler was able to conditionalize them.
Application Note 34
ARM DAI 0034A 9
Open Access
Boolean Expressions
5 Boolean Expressions
There is a faster way to implement this: (x >= min && x < max) can be transformed
into (unsigned)(x-min) < (max-min). This is especially beneficial if min is zero.
The same example after this optimization:
bool PointInRect2(Point p, Rectangle *r)
{ return ((unsigned) (p.x - r->xmin) < r->xmax &&
(unsigned) (p.y - r->ymin) < r->ymax);
}
PointInRect2
LDR a4,[a3,#0]
SUB a1,a1,a4
LDR a4,[a3,#4]
CMP a1,a4
LDRCC a1,[a3,#8]
SUBCC a1,a2,a1
LDRCC a2,[a3,#&c]!
CMPCC a1,a2
MOVCS a1,#0
MOVCC a1,#1
MOV pc,lr
Application Note 34
10 ARM DAI 0034A
Open Access
Boolean Expressions
5.2 Compares with zero
The ARM flags are set after a compare (CMP) instruction. The flags can also be set by
other operations, such as MOV, ADD, AND, MUL, which are the basic arithmetic and logical
instructions (the dataprocessing instructions). If a dataprocessing instruction sets the flags,
the N and Z flags are set the same way as if the result was compared with zero. The N
flag indicates whether the result is negative, the Z flag indicates that the result is zero. For
example:
ADD R0, R0, R1
CMP R0, #0
The N and Z flags on the ARM correspond to the signed relational operators x < 0,
x >= 0, x == 0, x != 0, and unsigned x == 0, x != 0 (or x > 0) in C.
Each time a relational operator is used in C, the compiler emits a compare instruction. If
the operator is one of the above, the compiler can remove the compare if a data
processing operation preceded the compare. For example:
int g(int x, int y)
{ if (x + y < 0)
return 1;
else
return 0;
}
g
ADDS a1,a1,a2
MOVPL a1,#0
MOVMI a1,#1
MOV pc,lr
If possible, arrange for critical routines to test the above conditions (see 6.1 Loop
termination on page 12). This often allows you to save compares in critical loops, leading
to reduced code size and increased performance.
The C language has no concept of a carry flag or overflow flag so it is not possible to test
the C or V flag bits directly without using inline assembler. However, the compiler supports
the carry flag (unsigned overflow). For example:
int sum(int x, int y)
{ int res;
res = x + y;
if ((unsigned) res < (unsigned) x) /* carry set? */
res++;
return res;
}
sum
ADDS a2,a1,a2
ADC a2,a2,#0
MOV a1,a2
MOV pc,lr
Application Note 34
ARM DAI 0034A 11
Open Access
Loops
6 Loops
Loops are a common construct in most programs; a significant amount of the execution
time is often spent in loops. It is therefore worthwhile to pay attention to time-critical loops.
fact2
MOVS a2,a1
MOV a1,#1
MOVEQ pc,lr
|L000034.J4.fact2|
MUL a1,a2,a1
SUBS a2,a2,#1
BNE |L000034.J4.fact2|
MOV pc,lr
Application Note 34
12 ARM DAI 0034A
Open Access
Loops
You can see that the slight recoding of fact1 required to produce fact2 has caused the
original ADD/CMP instruction pair to be replaced a single SUBS instruction. This is because
a compare with zero could be optimized away, as described in 5.2 Compares with zero
on page 11.
In addition to saving an instruction in the loop, the variable n does not need to be saved
across the loop, so a register is also saved. This eases register allocation, and leads to
more efficient code elsewhere in the function (two more instructions saved).
This technique of initializing the loop counter to the number of iterations required, and then
decrementing down to zero, also applies to while and do statements.
int countbit2(uint n)
{ int bits = 0;
while (n != 0)
{
if (n & 1) bits++;
if (n & 2) bits++;
if (n & 4) bits++;
if (n & 8) bits++;
n >>= 4;
}
return bits;
}
On the ARM7, checking a single bit takes six cycles when using the first version. The code
size is only nine instructions. The unrolled version checks four bits at a time, taking on
average only three cycles per bit. The cost is larger codesize: 15 instructions.
Application Note 34
ARM DAI 0034A 13
Open Access
Switch Statement
7 Switch Statement
A switch statement is translated by the ARM compiler as follows:
If the switch is dense the compiler uses a table lookup to jump to the code of the selected
case label. A switch is dense if case labels comprise more than half the range spanned by
the labels with the minimum and maximum values.
• For armcc the table is a branch-table with one word per entry, while tcc uses an
offset table using only 8 or 16 bits per entry. tcc uses the 8-bit table when the
number of case labels is less than 32. However, when the code in the switch
statement is large, extra branches are needed to jump to the case labels.
• If the case labels are not dense, the compiler splits the case labels, and applies
the same rules on each part recursively until all case labels have been processed.
• In order to improve the code size of switch statements, they should be as dense
as possible, and for tcc both the code and the number of case labels should be
kept small.
Application Note 34
14 ARM DAI 0034A
Open Access
Switch Statement
char * ConditionStr2(int condition)
{
if ((unsigned) condition >= 15) return 0;
return
"EQ\0NE\0CS\0CC\0MI\0PL\0VS\0VC\0HI\0LS\0GE\0LT\0GT\0LE\0\0" +
3 * condition;
}
The first routine needs a total of 240 bytes, the second only 72 bytes.
Application Note 34
ARM DAI 0034A 15
Open Access
Register Allocation
8 Register Allocation
Note Register allocation is less efficient when the -gr or -g options are used. This is to ensure
that variables are always displayed correctly in the debugger.
The most important optimization supported by the ARM compilers is called register
allocation. This is a process where the compiler allocates variables to ARM registers,
rather than to memory. This has the advantage that those variables can be accessed
quickly whenever needed, without needing instructions to transfer them from/to memory.
As a result of register allocation, most variables are kept in registers, resulting in dramatic
improvement in codesize and performance. You can write code which enables the
compiler to achieve a more optimal register allocation.
8.2 Aliasing
Pointers are a powerful part of the C language. However, they must be used carefully or
poor code may result. If the address of a variable is taken, the compiler must assume that
the variable can be changed by any assignment through a pointer or by any function call,
making it impossible to put it into a register. This is also true for global variables, as they
might have their address taken in some other function. This problem is known as pointer
aliasing, because the pointer is known as an alias of the variable it points to.
Note Some C compilers offer an “ignore pointer aliasing” option, which tells the compiler to
ignore the fact that other functions could be accessing local variables which have their
address taken. This can cause problems if this is not the case, resulting in bugs which are
difficult to trace. ARM does not offer this option because it contradicts with ANSI/ISO
standard for C compilers.
The negative effects which pointer aliasing has on performance can be reduced by using
the following techniques:
• Avoid taking the address of local variables.
• Avoid global variables.
• Avoid pointer chains.
Application Note 34
16 ARM DAI 0034A
Open Access
Register Allocation
8.2.1 Local variables
It is often necessary to take the address of variables, for example if they are passed as a
reference parameter to a function. This means that those variables cannot be allocated to
registers. A solution is to make a copy of the variable, and pass the address of that copy.
In the following example, test1 shows the conventional way of taking the address of the
local variable, resulting in inefficient code. test2 uses a dummy variable whose address
is taken. The value is then copied to a local variable i (whose address is not taken). This
allows the variable i to be allocated to a register, which reduces memory traffic.
void f(int *a);
int g(int a);
int test1(int i)
{ f(&i);
/* now use ’i’ extensively */
i += g(i);
i += g(i);
return i;
}
int test2(int i)
{ int dummy = i;
f(&dummy);
i = dummy;
/* now use ’i’ extensively */
i += g(i);
i += g(i);
return i;
}
test1
STMDB sp!,{a1,lr}
MOV a1,sp
BL f
LDR a1,[sp,#0]
BL g
LDR a2,[sp,#0]
ADD a1,a1,a2
STR a1,[sp,#0]
BL g
LDR a2,[sp,#0]
ADD a1,a1,a2
ADD sp,sp,#4
LDMIA sp!,{pc}
test2
STMDB sp!,{v1,lr}
STR a1,[sp,#-4]!
MOV a1,sp
BL f
LDR v1,[sp,#0]
MOV a1,v1
BL g
ADD v1,a1,v1
MOV a1,v1
BL g
ADD a1,a1,v1
ADD sp,sp,#4
LDMIA sp!,{v1,pc}
Application Note 34
ARM DAI 0034A 17
Open Access
Register Allocation
The first routine allocates i on the stack, and four memory accesses are needed for i.
The second uses two memory accesses for dummy, and none for i.
Note There are some exceptions where the compiler is able to determine that the address is not
really used. For example:
int f(int i)
{ return *(&i);
}
Here the compiler detects that the address is only taken inside the expression, and never
assigned to another variable or passed to a function.
int errs;
void test1(void)
{ errs += f();
errs += g();
}
void test2(void)
{ int localerrs = errs;
localerrs += f();
localerrs += g();
errs = localerrs;
}
test1
STMDB sp!,{v1,lr}
BL f
LDR v1,[pc, #L00002c-.-8]
LDR a2,[v1,#0]
ADD a1,a1,a2
STR a1,[v1,#0]
BL g
LDR a2,[v1,#0]
ADD a1,a1,a2
STR a1,[v1,#0]
LDMIA sp!,{v1,pc}
L00002c
DCD |x$dataseg|
Application Note 34
18 ARM DAI 0034A
Open Access
Register Allocation
test2
STMDB sp!,{v1,v2,lr}
LDR v1,[pc, #L00002c-.-8]
LDR v2,[v1,#0]
BL f
ADD v2,a1,v2
BL g
ADD a1,a1,v2
STR a1,[v1,#0]
LDMIA sp!,{v1,v2,pc}
Note that test1 must load and store the global errs value each time it is incremented,
whereas test2 stores localerrs in a register and needs only a single instruction.
Another possibility is to include the Point3 structure in the Object structure, thereby
avoiding pointers completely.
Application Note 34
ARM DAI 0034A 19
Open Access
Register Allocation
8.3 Live variables and spilling
As the ARM has a fixed set of registers, there is a limit to the number of variables that can
be kept in registers at any one point in the program. With the recommended options, there
are 14 integer registers available. For hardware floating-point, eight separate floating-point
registers are available. For software floating-point, the integer registers are used to hold
floating-point variables.
The ARM compilers support live-range splitting, where a variable can be allocated to
different registers as well as to memory in different parts of the function. The live-range of
a variable is defined as all statements between the last assignment to the variable, and the
last usage of the variable before the next assignment. In this range the value of the
variable is valid, thus it is alive. In between live ranges, the value of a variable is not
needed: it is dead, so its register can be used for other variables, allowing the compiler to
allocate more variables to registers.
The number of registers needed for register-allocatable variables is at least the number of
overlapping live-ranges at each point in a function. If this exceeds the number of registers
available, some variables must be stored to memory temporarily. This process is called
spilling. The compiler spills the least frequently used variables first, so as to minimize the
cost of spilling. Spilling of variables can be avoided by:
• Limiting the maximum number of live variables. This is typically achieved by
keeping expressions simple and small, and not using too many variables in a
function. Subdividing large functions into smaller, simpler ones might also help.
• Using register for frequently-used variables. This tells the compiler that the
register variable is going to be frequently used, so it should be allocated to a
register with a very high priority. However, such a variable may still be spilled in
some circumstances.
Application Note 34
20 ARM DAI 0034A
Open Access
Variable Types
9 Variable Types
The C compilers support the basic types char, short, int and long long (signed and
unsigned), float and double. Using the most appropriate type for variables is important,
as it can reduce code and/or data size and increase performance considerably.
wordinc
ADD a1,a1,#1
MOV pc,lr
shortinc
ADD a1,a1,#1
MOV a1,a1,LSL #16
MOV a1,a1,ASR #16
MOV pc,lr
charinc
ADD a1,a1,#1
AND a1,a1,#&ff
MOV pc,lr
Application Note 34
ARM DAI 0034A 21
Open Access
Variable Types
9.2 Use of shorts/signed bytes on ARM
ARM processors implementing an ARM Architecture earlier than version 4 do not have the
ability to load or store halfwords (shorts) or signed bytes (signed char) directly to or from
memory. These operations are implemented using load/store byte operations and shifts,
and may take up to four instructions. On Architecture 4 and later, these operations only
take a single instruction.
Therefore, if possible, select Architecture 4 processors for applications which use shorts or
signed chars heavily.
The following example illustrates the effect of using shorts and the -arch 4 option.
VARTYPE array [2000];
void varsize (void)
{ int loop;
for (loop = 0; loop < 2000; loop++)
array[loop] = loop;
}
Application Note 34
22 ARM DAI 0034A
Open Access
Function Design
10 Function Design
In general, it is a good idea to keep functions small and simple. This enables the compiler
to perform other optimizations, such as register allocation, more efficiently.
int g1(void)
{ return f1(1, 2, 3, 4);
}
ing g2(void)
{ return f2(1, 2, 3, 4, 5, 6);
}
The fifth and sixth parameters are stored on the stack in g2, and reloaded in f2, costing
two memory accesses per parameter.
Application Note 34
ARM DAI 0034A 23
Open Access
Function Design
void test(void)
{ int64 a, b, c, sum;
a.hi = 0x00000000; a.lo = 0xF0000000;
b.hi = 0x00000001; b.lo = 0x10000001;
sum = add64(a, b);
c.hi = 0x00000002; c.lo = 0xFFFFFFFF;
sum = add64(sum, c);
}
Application Note 34
24 ARM DAI 0034A
Open Access
Function Design
add64
ADDS a2,a2,a4
ADC a1,a3,a1
MOV pc,lr
test
STMDB sp!,{lr}
MOV a1,#0
MOV a2,#&f0000000
MOV a3,#1
MOV a4,#&10000001
BL add64
MOV a3,#2
MVN a4,#0
LDMIA sp!,{lr}
B add64
By using __value_in_regs, the code size is 52 bytes (compared with 160 bytes
otherwise). Note how the result from the first call to add64 is returned in r0 and r1, so
these registers are already prepared for the second call to add64. The compiler can use
the ADC instruction, resulting in optimal code.
__value_in_regs can also be an efficient way of interfacing C to assembler when you
wish to return more than one result. You can define a structure to hold the return values
and then write a C function declaration using __value_in_regs:
typedef struct
{ int x, y, z;
} Point3;
Application Note 34
ARM DAI 0034A 25
Open Access
Function Design
Overall, you should expect a leaf function to carry virtually no function entry / exit
overhead, and at worst, a small overhead, most likely in proportion to the useful work done
by the function.
If possible, you should try to arrange for frequently-called functions to be leaf functions.
The number of times a function is called can be determined by using the profiling facility
(refer to the Software Development Toolkit User Guide (ARM DUI 0040) for more details).
There are several ways to ensure that a function is compiled as a leaf function:
• Avoid calling other functions. This includes any operations which are converted to
calls to the C-library (such as division, or any floating-point operation when the
software floating-point library is used).
• Use __inline for small functions which are called from it (see 10.5 Inline
functions on page 28 for more details).
to:
BL + B + MOV pc,lr
which works out as a saving of 25%. Many of the other examples given in this Application
Note also benefit from this optimization.
Application Note 34
26 ARM DAI 0034A
Open Access
Function Design
10.4 Pure functions
Note The pure function optimization is disabled for the -g option.
Pure functions are those which return a result which depends only on their arguments.
They can be thought of as mathematical functions: they always return the same result if
the arguments are the same. Therefore they must not have any side-effects, where a
different value could be returned, even though the parameters are the same. For example,
a pure function may use the stack for local storage, and read its parameters from the
stack. However, a pure function cannot read or write global state by using global variables
or indirecting through pointers. To tell the compiler that a function is pure, use the special
declaration keyword __pure.
Consider the following sample code:
int square(int x)
{ return x * x;
}
int f(int n)
{ return square(n) + square(n);
}
square
MOV a2,a1
MUL a1,a2,a2
MOV pc,lr
f
STMDB sp!,{lr}
MOV a3,a1
BL square
MOV a4,a1
MOV a1,a3
BL square
ADD a1,a4,a1
LDMIA sp!,{pc}
If the code is modified so that square is defined as pure:
__pure int square(int x)
{ return x * x;
}
f
STMDB sp!,{lr}
BL square
MOV a1,a1,LSL #1
LDMIA sp!,{pc}
Note that square is now only called once. This is because the compiler has detected that
it is a common subexpression (CSE). This optimization can only be performed on pure
functions as they do not have side-effects.
Pure functions can also improve the code in other ways, as they cannot have read or
written any memory locations. This means, for example, that values which are allocated to
memory can be safely cached in registers, instead of being written to memory before a call
and reloaded afterwards.
Application Note 34
ARM DAI 0034A 27
Open Access
Function Design
Another way to tell the compiler that a function is pure is to place the following pragmas
around its definition:
#pragma no_side_effects
/* function definition */
#pragma side_effects
#include <math.h>
length
STMDB sp!,{lr}
MUL a3,a1,a1
MLA a1,a2,a2,a3
BL _dflt
LDMIA sp!,{lr}
B sqrt
Application Note 34
28 ARM DAI 0034A
Open Access
Function Design
10.6 Function definitions
Note This section applies only to armcc, as tcc does not currently support this optimization.
Placing function definitions before their use can sometimes produce better code as it
allows the compiler to see the register usage of the called function. This is a simple form of
interprocedural optimization (where optimizations are carried out between functions).
int square(int x);
sumsquares1
STMDB sp!,{v1,v2,lr}
MOV v1,a2
BL square
MOV v2,a1
MOV a1,v1
BL square
ADD a1,v2,a1
LDMIA sp!,{v1,v2,pc}
square
MOV a2,a1
MUL a1,a2,a2
MOV pc,lr
sumsquares2
STMDB sp!,{lr}
MOV a3,a2
BL square
MOV a4,a1
MOV a1,a3
BL square
ADD a1,a4,a1
LDMIA sp!,{pc}
By putting the square definition before the sumsquares definition, the compiler knows
that a3 and a4 are not used. It is able to use these registers, knowing that they will not be
corrupted by the square function, instead of being forced to use v1 and v2 by storing their
values on the stack. Fewer memory accesses result in higher speed.
Application Note 34
ARM DAI 0034A 29
Open Access
Using Lookup Tables
Application Note 34
30 ARM DAI 0034A
Open Access
Floating-Point Arithmetic
12 Floating-Point Arithmetic
The ARM core does not contain any actual floating-point hardware. Instead there are three
options for an application which needs floating-point support.
• Floating-Point Accelerator (FPA) hardware coprocessor.
This implements a floating-point instruction set using a number of ARM
coprocessor instructions. However this does require the FPA hardware to exist
within the system as a coprocessor.
• The Floating-Point Emulator (FPE).
This emulates in software the instructions that the FPA executes. This means that
there is no need to recompile code for systems with or without the FPA.
• The Floating-Point Library (FPLib).
Floating-point operations are compiled into function calls to library routines rather
than floating-point instructions. Although this is slower than using a FPA it is typically
two or three times faster than using the FPE. (The FPE emulates the FPA instruction
set, which means there is some overhead per instruction.) The overall code size of
the system is also smaller because only the required library routines are included,
rather than the whole of the FPE. The floating-point library is therefore the route that
ARM recommends for use in embedded systems and is the default.
Note The Thumb instruction set does not have instruction space for coprocessor instructions.
Thus the Thumb compiler will always provide floating-point functionality using the Floating-
Point Library.
The recommended compiler options give the best results in terms of performance and
code size. However, when writing floating-point code, keep the following things in mind:
• Floating-point division is slow.
Division is typically twice as slow as addition or multiplication. Rewrite divisions by
a constant into a multiplication with the inverse (for example, x = x / 3.0
becomes x = x * (1.0/3.0)the constant is calculated during compilation).
• Use floats instead of doubles.
Float variables consume less memory and fewer registers, and are more efficient
because of their lower precision. Use floats whenever their precision is good
enough.
• Avoid using transcendental functions.
Transcendental functions, like sin, exp and log are implemented using series of
multiplications and additions (using extended precision). As a result, these
operations are at least ten times slower than a normal multiply.
• Simplify floating-point expressions.
The compiler cannot apply many optimizations which are performed on integers to
floating-point values. For example, 3 * (x / 3) cannot be optimized to x, since
floating-point operations generally lead to loss of precision. Even the order of
evaluation is important: (a + b) + c is not the same as a + (b + c).
Therefore, it is beneficial to perform floating-point optimizations manually if it is
known they are correct.
For more information about optimizing floating-point performance, see Application Note
55: Floating-Point Performance (ARM DAI 0055).
However it is still possible that the floating performance will not reach the required level for a
particular application. In such a case the best approach may be to change from using
floating-point to fixed point arithmetic. When the range of values needed is sufficiently small,
fixed-point arithmetic is more accurate and much faster than floating-point arithmetic. See
Application Note 33: Fixed Point Arithmetic (ARM DAI 0033) for more details.
Application Note 34
ARM DAI 0034A 31
Open Access
Cross Jump Optimization
f
STMDB sp!,{v1,v2,lr}
MOV v2,a2
CMP a1,a2
ADD v1,a1,a2
SUBLE v2,v2,#1
SUBGT v1,v1,#1
MOV a2,v2
MOV a1,v1
BL g
ADD a1,v1,v2
LDMIA sp!,{v1,v2,pc}
The final three instructions in the then and else part have been combined, resulting in
shorter and faster code.
Application Note 34
32 ARM DAI 0034A
Open Access
Portability of C Code
14 Portability of C Code
Many of the C optimizations described here apply to other processors. However, some of
the optimizations rely on pragma statements or special function declaration keywords
(such as __inline). Many other compilers also support these, although they have a
different syntax.
To maintain code portability to other platforms, these ARM-specific keywords could be
made into macros:
#ifdef __arm
# define INLINE __inline
# define VALUE_IN_REGS __value_in_regs
# define PURE __pure
#else
# define INLINE
# define VALUE_IN_REGS
# define PURE
#endif
The code can make use of INLINE, VALUE_IN_REGS, and so on.
INLINE int square(int x) {
return x*x;
}
Application Note 34
ARM DAI 0034A 33
Open Access
Further Information
15 Further Information
For further information, refer to:
• Software Development Toolkit User Guide (ARM DUI 0040): Chapter 7
Benchmarking, Performance Analysis and Profiling
• Application Note 33: Fixed Point Arithmetic on the ARM (ARM DAI 0033)
• Application Note 36: Declaring C Global Data (ARM DAI 0036)
• Application Note 55: Floating-Point Performance (ARM DAI 0055)
Application Note 34
34 ARM DAI 0034A
Open Access