Mod 3
Mod 3
∙ ARM processors have 32-bit registers and 32-bit data processing operations.
∙ Prior to ARMv4, ARM processors were not good at handling signed 8-bit or any 16-bit
values. Therefore ARM C compilers define char to be an unsigned 8-bit value, rather
than a signed 8-bit value as is typical in many other compilers.
LOCAL VARIABLE TYPES
∙ ARMv4-based processors can efficiently load and store 8-, 16-, and 32-bit data.
However, most ARM data processing operations are 32-bit only. For this reason,
you should use a 32-bit datatype, int or long, for local variables wherever
possible.
∙ Avoid using char and short as local variable types, even if you are manipulating
an 8- or 16-bit value. If you require modulo arithmetic of the form 255 1 0, then
use the char type.
suppose the data packet contains 16-bit values and we need a 16-bit checksum. It is tempting to write
the following C code:
FUNCTION ARGUMeNT TYPeS
• Consider the following simple function, which adds two 16-bit values, halving the
second, and returns a 16-bit sum:
• The input values a, b, and the return value will be passed in 32-bit ARM registers.
Should the compiler assume that these 32-bit values are in the range of a short
type, that is, 32,768 to 32,767?
• The compiler must make compatible decisions for the function caller and callee.
• Either the caller or callee must perform the cast to a short type.
• function arguments are passed wide if they are not reduced to the range of the
type and narrow if they are reduced to the range of the type
∙ We tell which decision the compiler has made by looking at the assembly output for add_v1.
■ If the compiler passes arguments wide, then the callee must reduce function arguments
to the correct range.
■ If the compiler passes arguments narrow, then the caller must reduce the range.
■ If the compiler returns values wide, then the caller must reduce the return value to the
correct range.
■ If the compiler returns values narrow, then the callee must reduce the range before
returning the value.
FUNCTION ARGUMeNT TYPeS
FUNCTION ARGUMeNT TYPeS
∙ The gcc compiler we used is more cautious and makes no assumptions about the range of
argument value. This version of the compiler reduces the input arguments to the range of a
short in both the caller and the callee. It also casts the return value to a short type. Here is the
compiled code for add_v1:
• you can see that char or short type function arguments and return values
introduce extra casts.
• It is more efficient to use the int type for function arguments and return values,
even if you are only passing an 8-bit value.
SIGNeD VeRSUS UNSIGNeD TYPeS
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
y=5-9
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
C Looping Structures - LOOPS WITH A FIxeD NUMBeR Of ITeRATIONS
shows how the compiler treats a loop with incrementing count i++.
It takes three instructions to implement the for loop structure:
■ An ADD to increment i
■ A compare to check if i is less than 64
■ A conditional branch to continue the loop if i < 64
This is not efficient. On the ARM, a loop should only use two instructions:
■ A subtract to decrement the loop counter, which also sets the condition
code flags on the result
■ A conditional branch instruction
The key point is that the loop counter should count down to zero rather than counting up to some arbitrary limit.
Then the comparison with zero is free since the result is stored in the condition flags.
Since we are no longer using i as an array index, there is no problem in counting down rather than up.
The SUBS and BNE instructions implement the loop. Our checksum example now has the minimum number of four
instructions per loop. This is much better than six for checksum_v1 and eight for checksum_v3
Signed and Unsigned Loop Counter
• For an unsigned loop counter i we can use either of the loop continuation conditions i!=0 or i>0.
As i can’t be negative, they are the same condition.
• For a signed loop counter, it is tempting to use the condition i>0 to continue the loop
The compiler is not being inefficient. It must be careful about the case when i = -0x80000000 because the two sections of
code generate different answers in this case. For the first piece of code the SUBS instruction compares i with 1 and then
decrements i. Since -0x80000000 < 1, the loop terminates. For the second piece of code, we decrement i and then compare
with 0. Modulo arithmetic means that i now has the value +0x7fffffff, which is greater than zero. Thus the loop continues for
many iterations. Of course, in practice, i rarely takes the value -0x80000000.
Therefore you should use the termination condition i!=0 for signed or unsigned loop counters. It saves one instruction over
the condition i>0 for signed i.
LOOPS USING A VARIABLe NUMBeR Of ITeRATIONS
Now suppose we want our checksum routine to handle packets of arbitrary size. We pass in a variable N giving
the number of words in the data packet.
The checksum_v7 example shows how the compiler handles a for loop with a variable number of iterations N.
that the compiler checks that N is nonzero on entry to the function. Often this check is unnecessary since you
know that the array won’t be empty. In this case a do-while loop gives better performance and code density than a
for loop.
Example shows how to use a do-while loop to remove the test for N being zero that occurs
in a for loop
Loop Unrolling
In decrement loop each loop iteration costs two instructions in addition to the body of the loop:
• You can save some of these cycles by unrolling a loop—repeating the loop body several times, and
reducing the number of loop iterations by the same proportion. For example, let’s unroll our packet
checksum example four times.
There are two questions you need to ask when unrolling a loop:
■ What if the number of loop iterations is not a multiple of the unroll amount? For example, what if N is not a
multiple of four in checksum_v9?
To start with the first question, only unroll loops that are important for the overall performance of the application.
Otherwise unrolling will increase the code size with little performance benefit. Unrolling may even reduce
performance by evicting more important code from the cache
For the second question, try to arrange it so that array sizes are multiples of your unroll amount. If this isn’t
possible, then you must add extra code to take care of the leftover cases. This increases the code size a little but
keeps the performance high
REGISTER ALLOCATION
❑ The compiler attempts to allocate a processor register to each local variable you use in a
C function. It will try to use the same register for different local variables if the use of the
variables do not overlap.
❑ When there are more local variables than available registers, the compiler stores the
excess variables on the processor stack. These variables are called spilled or swapped out
variables since they are written out to memory (in a similar way virtual memory is
swapped out to disk).
❑ Spilled variables are slow to access compared to variables allocated to registers.
To implement a function efficiently, you need to
■ Try to limit the number of local variables in the internal loop of functions to
12. The compiler should be able to allocate these to ARM registers.
FUNCTION CALLS
• The ARM Procedure Call Standard (APCS) defines how to pass function arguments
and return values in ARM registers. The more recent ARM-Thumb Procedure Call
Standard (ATPCS) covers ARM and Thumb interworking as well.
• The first four integer arguments are passed in the first four ARM registers: r0, r1, r2,
and r3. Subsequent integer arguments are placed on the full descending stack,
ascending in memory as in Figure 5.1. Function return integer values are passed in
r0.
• Two pointers are said to alias when they point to the same address.
• If you write to one pointer, it will affect the value you read from the other pointer.
• In a function, the compiler often doesn’t know which pointers can alias and which
pointers can’t.
• The compiler must be very pessimistic and assume that any write to a pointer may affect
the value read from any other pointer, which can significantly reduce code efficiency.