Hello World
Hello World
● Optimizing code takes time and reduces source code readability. Usually, it’s only worth
optimizing functions that are frequently executed and important for performance.
● We recommend you use a performance profiling tool, found in most ARM simulators, to find
these frequently executed functions.
● C compilers have to translate your C function literally into assembler so that it works for all
possible inputs.
● In practice, many of the input combinations are not possible or won’t occur. Let’s start by
looking at an example of the problems the compiler faces.
No matter how advanced the compiler, it does not know whether N can be 0 on input or not.
Therefore the compiler needs to test for this case explicitly before the first iteration of the loop.
● The compiler doesn’t know whether the data array pointer is four-byte aligned or not. If it is
four-byte aligned, then the compiler can clear four bytes at a time using an int store rather than a
char store.
● Nor does it know whether N is a multiple of four or not. If N is a multiple of four, then the
compiler can repeat the loop body four times or store four bytes at a time using an int store.
To keep our examples concrete, we have tested them using the following specific C compilers:
❖ armcc from ARM Developer Suite version 1.1 (ADS1.1). You can license this compiler, or a
later version, directly from ARM.
❖ arm-elf-gcc version 2.95.2. This is the ARM target for the GNU C compiler, gcc, and is freely
available.
We have used armcc from ADS1.1 to generate the example assembler output in this book. The
following short script shows you how to invoke armcc on a C file test.c. You can use this to
reproduce our examples.
Microcontroller and Embedded System 21CS43
By default armcc has full optimizations turned on (the -02 command line switch). The -0time
switch optimizes for execution efficiency rather than space and mainly affects the layout of for
and while loops. If you are using the gcc compiler, then the following short script generates a
similar assembler output listing:
The ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores directly,
through new instructions
● ARMv5 adds instruction support for 64-bit load and stores. This is available in ARM9E and
later cores.
● Therefore ARM C compilers define char to be an unsigned 8-bit value, rather than a signed 8-
bit value as is typical in many other compilers.
● Compilers armcc and gcc use the datatype mappings
● A common example is using a char type variable i as a loop counter, with loop continuation
condition i ≥ 0.
● As i is unsigned for the ARM compilers, the loop will never terminate. Fortunately armcc
produces a warning in this situation: unsigned comparison with 0.
Microcontroller and Embedded System 21CS43
● Compilers also provide an override switch to make char signed. For example, the command
line option -fsigned-char will make char signed on gcc.
● The command line option -zc will have the same effect with armcc.
The input values a, b, and the return value will be passed in 32-bit ARM registers. Should the
compiler assume that these 32-bit values are in the range of a short type, that is, −32,768 to
+32,767
● Or should the compiler force values to be in this range by sign-extending the lowest 16 bits to
fill the 32-bit register
● The compiler must make compatible decisions for the function caller and callee. Either the
caller or callee must perform the cast to a short type.
● If the compiler passes arguments wide, then the callee must reduce function arguments to the
correct range. If the compiler passes arguments narrow, then the caller must reduce the range.
● If the compiler returns values wide, then the caller must reduce the return value to the correct
range. If the compiler returns values narrow, then the callee must reduce the range before
returning the value.
Microcontroller and Embedded System 21CS43
Microcontroller and Embedded System 21CS43
Microcontroller and Embedded System 21CS43
Microcontroller and Embedded System 21CS43
Microcontroller and Embedded System 21CS43
Microcontroller and Embedded System 21CS43
FUNCTION CALLS
● The ARM Procedure Call Standard (APCS) defines how to pass function arguments and return
values in ARM registers.
● The more recent ARM-Thumb Procedure Call Standard (ATPCS) covers ARM and Thumb
interworking as well.
● The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and r3.
Subsequent integer arguments are placed on the full descending stack, ascending in memory as
in figure. Function return integer values are passed in r0.
Microcontroller and Embedded System 21CS43
This description covers only integer or pointer arguments. Two-word arguments such as long
long or double are passed in a pair of consecutive argument registers and returned in r0, r1.
● The compiler may pass structures in registers or by reference according to command line
compiler options.
● The first point to note about the procedure call standard is the four-register rule.
● Functions with four or fewer arguments are far more efficient to call than functions with five
or more arguments.
● For functions with four or fewer arguments, the compiler can pass all the arguments in
registers.
● For functions with more arguments, both the caller and callee must access the stack for some
arguments.
● Note that for C++ the first argument to an object method is the this pointer. This argument is
implicit and additional to the explicit arguments.
● If your C function needs more than four arguments, or your C++ method more than three
explicit arguments, then it is almost always more efficient to use structures.
● Group related arguments into structures, and pass a structure pointer rather than multiple
arguments. Which arguments are related will depend on the structure of your software.
The next example illustrates the benefits of using a structure pointer. First we show a typical
routine to insert N bytes from array data into a queue. We implement the queue using a cyclic
buffer with start address Q_start (inclusive) and end address Q_end (exclusive).
Microcontroller and Embedded System 21CS43
Microcontroller and Embedded System 21CS43
The queue_bytes_v2 is one instruction longer than queue_bytes_v1, but it is in fact more
efficient overall.
● The second version has only three function arguments rather than five. Each call to the
function requires only three register setups.
● This compares with four register setups, a stack push, and a stack pull for the first version.
There is a net saving of two instructions in function call overhead.
● There are likely further savings in the callee function, as it only needs to assign a single
register to the Queue structure pointer, rather than three registers in the nonstructured case.
Microcontroller and Embedded System 21CS43
The compiler will only inline small functions. You can ask the compiler to inline a function
using the __inline keyword, although this keyword is only a hint and the compiler may ignore it.
Inlining large functions can lead to big increases in code size without much performance
improvement
Microcontroller and Embedded System 21CS43
POINTER ALIASING
● Two pointers are said to alias when they point to the same address.
● If you write to one pointer, it will affect the value you read from the other pointer. In a
function, the compiler often doesn’t know which pointers can alias and which pointers can’t.
● The compiler must be very pessimistic and assume that any write to a pointer may affect the
value read from any other pointer, which can significantly reduce code efficiency
Note that the compiler loads from step twice. Usually a compiler optimization called common
subexpression elimination would kick in so that *step was only evaluated once, and the value
reused for the second occurrence.
● However, the compiler can’t use this optimization here. The pointers timer1 and step might
alias one another.
● In other words, the compiler cannot be sure that the write to timer1 doesn’t affect the read
from step.
● In this case the second value of *step is different from the first and has the value *timer1. This
forces the compiler to insert an extra load instruction
Microcontroller and Embedded System 21CS43
Microcontroller and Embedded System 21CS43