lOMoARcPSD|43089217
CHAPTER-2 : C COMPILERS AND OPTIMIZATION
● Optimizing code takes time and reduces source code readability. Usually, it’s only worth
optimizing functions that are frequently executed and important for performance.
● We recommend you use a performance profiling tool, found in most ARM simulators, to
find these frequently executed functions.
● Document nonobvious optimizations with source code comments to aid maintainability.
● C compilers have to translate your C function literally into assembler so that it works for
all possible inputs.
● In practice, many of the input combinations are not possible or won’t occur. Let’s start by
looking at an example of the problems the compiler faces.
● The memclr function clears N bytes of memory at address data.
● No matter how advanced the compiler, it does not know whether N can be 0 on input or
not. Therefore the compiler needs to test for this case explicitly before the first iteration of
the loop.
● The compiler doesn’t know whether the data array pointer is four-byte aligned or not. If it
is four-byte aligned, then the compiler can clear four bytes at a time using an int store rather
than a char store.
● Nor does it know whether N is a multiple of four or not. If N is a multiple of four, then the
compiler can repeat the loop body four times or store four bytes at a time using an int store.
To keep our examples concrete, we have tested them using the following specific C compilers:
lOMoARcPSD|43089217
❖ armcc from ARM Developer Suite version 1.1 (ADS1.1). You can license this compiler,
or a later version, directly from ARM.
❖ arm-elf-gcc version 2.95.2. This is the ARM target for the GNU C compiler, gcc, and is
freely available.
We have used armcc from ADS1.1 to generate the example assembler output in this book. The
following short script shows you how to invoke armcc on a C file test.c. You can use this to
reproduce our examples.
By default armcc has full optimizations turned on (the -02 command line switch). The -0time
switch optimizes for execution efficiency rather than space and mainly affects the layout of for
and while loops. If you are using the gcc compiler, then the following short script generates a
similar assembler output listing:
Basic C Data Types
ARM supports operations on different data types.
The data types we can load (or store) can be signed and unsigned words, halfwords, or bytes. The
extensions for these data types are: -h or -sh for halfwords, -b or -sb for bytes, and no extension
for words. The difference between signed and unsigned data types is:
Signed data types can hold both positive and negative values and are therefore lower in range.
Unsigned data types can hold large positive values (including ‘Zero’) but cannot hold negative
values and are therefore wider in range.
● ARM processors have 32-bit registers and 32-bit data processing operations. The ARM
architecture is a RISC load/store architecture.
lOMoARcPSD|43089217
● In other words you must load values from memory into registers before acting on them.
There are no arithmetic or logical instructions that manipulate values in memory directly.
● The ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores
directly, through new instructions
● ARMv5 adds instruction support for 64-bit load and stores. This is available in ARM9E
and later cores.
● Therefore ARM C compilers define char to be an unsigned 8-bit value, rather than a
signed 8-bit value as is typical in many other compilers.
● Compilers armcc and gcc use the datatype mappings
● A common example is using a char type variable i as a loop counter, with loop
continuation condition i ≥ 0.
● As i is unsigned for the ARM compilers, the loop will never terminate. Fortunately armcc
produces a warning in this situation: unsigned comparison with 0.
lOMoARcPSD|43089217
● Compilers also provide an override switch to make char signed. For example, the
command line option -fsigned-char will make char signed on gcc.
● The command line option -zc will have the same effect with armcc.
Local Variable Types
● ARMv4-based processors can efficiently load and store 8-, 16-, and 32-bit data. However,
most ARM data processing operations are 32-bit only.
● For this reason, you should use a 32-bit datatype, int or long, for local variables wherever
possible.
● Avoid using char and short as local variable types, even if you are manipulating an 8- or
16-bit value.
● The one exception is when you want wrap-around to occur. If you require modulo
arithmetic of the form 255 + 1 = 0, then use the char type.
● The following code checksums a data packet containing 64 words. It shows why you
should avoid using char for local variables.
lOMoARcPSD|43089217
lOMoARcPSD|43089217
The loop is now three instructions longer than the loop for example checksum_v2 earlier! There
are two reasons for the extra instructions:
● The LDRH instruction does not allow for a shifted address offset as the LDR instruction
did in checksum_v2. Therefore the first ADD in the loop calculates the address of item i
in the array. The LDRH loads from an address with no offset. LDRH has fewer
addressing modes than LDR as it was a later addition to the ARM instruction set.
● The cast reducing total +array[i] to a short requires two MOV instructions. The compiler
shifts left by 16 and then right by 16 to implement a 16-bit sign extend. The shift right is
a sign-extending shift so it replicates the sign bit to fill the upper 16 bits.
lOMoARcPSD|43089217
lOMoARcPSD|43089217
FUNCTION ARGUMENT TYPES
Consider the following simple function, which adds two 16-bit values, halving the second, and
returns a 16-bit sum:
● The input values a, b, and the return value will be passed in 32-bit ARM registers. Should
the compiler assume that these 32-bit values are in the range of a short type, that is,
−32,768 to +32,767?
● Or should the compiler force values to be in this range by sign-extending the lowest 16 bits
to fill the 32-bit register?
● The compiler must make compatible decisions for the function caller and callee. Either the
caller or callee must perform the cast to a short type.
● If the compiler passes arguments wide, then the callee must reduce function arguments to
the correct range. If the compiler passes arguments narrow, then the caller must reduce
the range.
● If the compiler returns values wide, then the caller must reduce the return value to the
correct range. If the compiler returns values narrow, then the callee must reduce the range
before returning the value.
lOMoARcPSD|43089217
lOMoARcPSD|43089217
SIGNED VERSUS UNSIGNED TYPES
It is more efficient to use unsigned types for divisions. The compiler converts unsigned power of
two divisions directly to right shifts. For general divisions, the divide routine in the C library is
faster for unsigned types.
lOMoARcPSD|43089217
lOMoARcPSD|43089217
C LOOPING STRUCTURES
LOOPS WITH A FIXED NUMBER OF ITERATIONS
Below code shows how the compiler treats a loop with incrementing count i++.
lOMoARcPSD|43089217
● For an unsigned loop counter i we can use either of the loop continuation conditions
i!=0 or i>0.
● As i can’t be negative, they are the same condition. For a signed loop counter, it is
tempting to use the condition i>0 to continue the loop.
● You might expect the compiler to generate the following two instructions to implement
the loop:
LOOPS USING A VARIABLE NUMBER OF ITERATIONS
Now suppose we want our checksum routine to handle packets of arbitrary size. We pass in a
variable N giving the number of words in the data packet. Using the lessons from the last section
we count down until N = 0 and don’t require an extra loop counter i.
The checksum_v7 example shows how the compiler handles a for loop with a variable
number of iterations N.
lOMoARcPSD|43089217
lOMoARcPSD|43089217
LOOP UNROLLING
● On ARM7 or ARM9 processors the subtract takes one cycle and the branch three cycles,
giving an overhead of four cycles per loop.
● You can save some of these cycles by unrolling a loop—repeating the loop body several
times, and reducing the number of loop iterations by the same proportion.
● For example, let’s unroll our packet checksum example four times.
lOMoARcPSD|43089217
To start with the first question, only unroll loops that are important for the overall performance
of the application. Otherwise unrolling will increase the code size with little performance benefit.
Unrolling may even reduce performance by evicting more important code from the cache.
For the second question, try to arrange it so that array sizes are multiples of your unroll amount.
If this isn’t possible, then you must add extra code to take care of the leftover cases. This
increases the code size a little but keeps the performance high.
lOMoARcPSD|43089217
SUMMARY: Writing Loops Efficiently
REGISTER ALLOCATION
● The compiler attempts to allocate a processor register to each local variable you use in a
C function.
● It will try to use the same register for different local variables if the use of the variables
do not overlap.
● When there are more local variables than available registers, the compiler stores the
excess variables on the processor stack.
● These variables are called spilled or swapped out variables since they are written out to
memory (in a similar way virtual memory is swapped out to disk).
● Spilled variables are slow to access compared to variables allocated to registers.
lOMoARcPSD|43089217
First let’s look at the number of processor registers the ARM C compilers have available for
allocating variables. Below table shows the standard register names and usage when following the
ARM-Thumb procedure call standard (ATPCS), which is used in code generated by C compilers.
lOMoARcPSD|43089217
● The C compiler can assign 14 variables to registers without spillage.
● In practice, some compilers use a fixed register such as r12 for intermediate scratch
working and do not assign variables to this register.
● Also, complex expressions require intermediate working registers to evaluate. Therefore,
to ensure good assignment to registers, you should try to limit the internal loop of functions
to using at most 12 local variables.
● If the compiler does need to swap out variables, then it chooses which variables to swap
out based on frequency of use.
● A variable used inside a loop counts multiple times. You can guide the compiler as to which
variables are important by ensuring these variables are used within the innermost loop.
● The register keyword in C hints that a compiler should allocate the given variable to
a register.
● However, different compilers treat this keyword in different ways, and different
architectures have a different number of available registers (for example, Thumb and
ARM).
● Therefore we recommend that you avoid using register and rely on the compiler’s
normal register allocation routine.
lOMoARcPSD|43089217
FUNCTION CALLS
● The ARM Procedure Call Standard (APCS) defines how to pass function arguments and
return values in ARM registers.
● The more recent ARM-Thumb Procedure Call Standard (ATPCS) covers ARM and Thumb
interworking as well.
● The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and
r3. Subsequent integer arguments are placed on the full descending stack, ascending in
memory as in figure. Function return integer values are passed in r0.
lOMoARcPSD|43089217
● This description covers only integer or pointer arguments. Two-word arguments such as
long long or double are passed in a pair of consecutive argument registers and
returned in r0, r1.
● The compiler may pass structures in registers or by reference according to command line
compiler options.
● The first point to note about the procedure call standard is the four-register rule.
● Functions with four or fewer arguments are far more efficient to call than functions with
five or more arguments.
● For functions with four or fewer arguments, the compiler can pass all the arguments in
registers.
● For functions with more arguments, both the caller and callee must access the stack for
some arguments.
● Note that for C++ the first argument to an object method is the this pointer. This
argument is implicit and additional to the explicit arguments.
● If your C function needs more than four arguments, or your C++ method more than three
explicit arguments, then it is almost always more efficient to use structures.
● Group related arguments into structures, and pass a structure pointer rather than multiple
arguments. Which arguments are related will depend on the structure of your software.
The next example illustrates the benefits of using a structure pointer. First we show a typical
routine to insert N bytes from array data into a queue. We implement the queue using a cyclic
buffer with start address Q_start (inclusive) and end address Q_end (exclusive).
lOMoARcPSD|43089217
lOMoARcPSD|43089217
Example
The following code creates a Queue structure and passes this to the function to reduce the
number of function arguments.
lOMoARcPSD|43089217
● The queue_bytes_v2 is one instruction longer than queue_bytes_v1, but it is in
fact more efficient overall.
● The second version has only three function arguments rather than five. Each call to the
function requires only three register setups.
● This compares with four register setups, a stack push, and a stack pull for the first version.
There is a net saving of two instructions in function call overhead.
● There are likely further savings in the callee function, as it only needs to assign a single
register to the Queue structure pointer, rather than three registers in the nonstructured
case.
Example
The function uint_to_hex converts a 32-bit unsigned integer into an array of eight
hexadecimal digits. It uses a helper function nybble_to_hex, which converts a digit d in the
range 0 to 15 to a hexadecimal digit.
lOMoARcPSD|43089217
lOMoARcPSD|43089217
The compiler will only inline small functions. You can ask the compiler to inline a function using
the inline keyword, although this keyword is only a hint and the compiler may ignore it.
Inlining large functions can lead to big increases in code size without much performance
improvement.
POINTER ALIASING
● Two pointers are said to alias when they point to the same address.
● If you write to one pointer, it will affect the value you read from the other pointer. In a
function, the compiler often doesn’t know which pointers can alias and which pointers can’t.
● The compiler must be very pessimistic and assume that any write to a pointer may affect
the value read from any other pointer, which can significantly reduce code efficiency.
lOMoARcPSD|43089217
● Note that the compiler loads from step twice. Usually a compiler optimization called
common subexpression elimination would kick in so that *step was only evaluated once,
and the value reused for the second occurrence.
● However, the compiler can’t use this optimization here. The pointers timer1 and step
might alias one another.
● In other words, the compiler cannot be sure that the write to timer1 doesn’t affect the
read from step.
● In this case the second value of *step is different from the first and has the value
*timer1. This forces the compiler to insert an extra load instruction.
lOMoARcPSD|43089217
Example
Consider the following example, which reads and then checksums a data packet:
Here get_next_packet is a function returning the address and size of the next data
packet. The previous code compiles to
lOMoARcPSD|43089217
Portability Issues
Here is a summary of the issues you may encounter when porting C code to the ARM.
■The char type. On the ARM, char is unsigned rather than signed as for many other processors. A common
problem concerns loops that use a char loop counter i and the continuation condition i ≥ 0, they become infinite
loops. In this situation, armcc produces a warning of unsigned comparison with zero. You should either use a
compiler option to make char signed or change loop counters to type int.
The int type. Some older architectures use a 16-bit int, which may cause problems when moving to ARM’s 32-bit
int type although this is rare nowadays. Note that expressions are promoted to an int type before evaluation.
Therefore if i = -0x1000, the expressioni == 0xF000istrueona16-bitmachinebutfalseona32-bitmachine.
Unaligned data pointers. Some processors support the loading of short and int typed values from unaligned
addresses. A C program may manipulate pointers directly so that they become unaligned, for example, by casting
a char * to an int *. ARM architectures up to ARMv5TE do not support unaligned pointers. To detect them, run
lOMoARcPSD|43089217
the program on an ARM with an alignment checking trap. For example, you can configure the ARM720T to data
abort on an unaligned access.
Endian assumptions. C code may make assumptions about the endianness of a memory system, for example, by
casting a char * to an int *. If you configure the ARM for the same endianness the code is expecting, then there is
no issue. Otherwise, you must remove endian-dependent code sequences and replace them by endian-
independent ones. See Section 5.9 for more details.
Function prototyping. The armcc compiler passes arguments narrow, that is, reduced to the range of the
argument type. If functions are not prototyped correctly, then the function may return the wrong answer. Other
compilers that pass arguments wide may give the correct answer even if the function prototype is incorrect.
Always use ANSI prototypes.
Use of bit-fields. The layout of bits within a bit-field is implementation and endian dependent. If C code assumes
that bits are laid out in a certain order, then the code is not portable.
Use of enumerations. Although enum is portable, different compilers allocate different numbers of bytes to an
enum. The gcc compiler will always allocate four bytes to an enum type. The armcc compiler will only allocate
one byte if the enum takes only eight-bit values. Therefore you can’t cross-link code and libraries between
different compilers if you use enums in an API structure.
Inline assembly. Using inline assembly in C code reduces portability between architectures. You should separate
any inline assembly into small inlined functions that can easily be replaced. It is also useful to supply reference,
plain C implementations of these functions that can be used on other architectures, where this is possible.
The volatile keyword. Use the volatile keyword on the type definitions of ARM memory-mapped peripheral
locations. This keyword prevents the compiler from opti- mizing away the memory access. It also ensures that
the compiler generates a data access ofthecorrecttype.Forexample,ifyoudefineamemorylocationasavolatile short
type, then the compiler will access it using 16-bit load and store instructions LDRSH and STRH.