UNIT-IV Basic C Data Types
UNIT-IV Basic C Data Types
Let’s understand how ARM compilers handle the basic C data types. We will see that
some of these types are more efficient to use for local variables than others. There
are also differences between the addressing modes available when loading and
storing data of each type.
ARM processors have 32-bit registers and 32-bit data processing operations.
The ARM architecture is a RISC load/store architecture. In other words you
must load values from memory into registers before acting on them. There are no
arithmetic or logical instructions that manipulate values in memory directly.
Early versions of the ARM architecture (ARMv1 to ARMv3) provided hardware support
for loading and storing unsigned 8-bit and unsigned or signed 32-bit values.
Table-1:
In Table-1 loads that act on 8- or 16-bit values extend the value to 32 bits before
writing to an ARM register. Unsigned values are zero-extended, and signed values
sign-extended.
This means that the cast of a loaded value to an int type does not cost extra
instructions. Similarly, a store of an 8- or 16-bit value selects the lowest 8 or 16 bits of
the register. The cast of an int to smaller type does not cost extra instructions on a
store.
The ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores
directly, through new instructions. Since these instructions are a later addition, they
do not support as many addressing modes as the pre-ARMv4 instructions.
Finally, ARMv5 adds instruction support for 64-bit load and stores. This is available in
ARM9E and later cores.
Prior to ARMv4, ARM processors were not good at handling signed values. Therefore
ARM C compilers define char to be an unsigned 8-bit value, rather than a
signed 8-bit value as is typical in many other compilers.
Compilers armcc and gcc use the datatype mappings in Table 2 for an ARM target.
Table-2:
The exceptional case for type char is worth noting as it can cause problems when you
are porting code from another processor architecture. A common example is using a
char type variable i as a loop counter, with loop continuation condition i ≥ 0.
As i is unsigned for the ARM compilers, the loop will never terminate. Fortunately
armcc produces a warning in this situation: unsigned comparison with 0. Compilers
also provide an override switch to make char signed.
For example, the command line option -fsigned-char will make char signed on gcc.
The command line option -zc will have the same effect with armcc.
ARMv4-based processors can efficiently load and store 8-, 16-, and 32-bit data.
However, most ARM data processing operations are 32-bit only. For this reason, we
should use a 32-bit datatype, int or long, for local variables wherever
possible.
Avoid using char and short as local variable types, even if you are manipulating
an 8- or 16-bit value. The one exception is when you want wrap-around to occur. If
you require modulo arithmetic of the form 255 + 1 = 0, then use the char
type.
The following code checksums a data packet containing 64 words. It shows why you
should avoid using char for local variables.
At first sight it looks as though declaring i as a char is efficient. You may be thinking
that a char uses less register space or less space on the ARM stack than an int. On the
ARM, both these assumptions are wrong. All ARM registers are 32-bit and all stack
entries are at least 32-bit. Furthermore, to implement the i++ exactly, the compiler
must account for the case when i = 255. Any attempt to increment 255 should
produce the answer 0.
Consider the compiler output for this function. We’ve added labels and comments to
make the assembly clear.
Now compare this to the compiler output where instead we declare i as an unsigned
int.
In the first case, the compiler inserts an extra AND instruction to reduce i to the range
0 to 255 before the comparison with 64. This instruction disappears in the second
case.
Next, suppose the data packet contains 16-bit values and we need a 16-bit checksum.
It is tempting to write the following C code:
You may wonder why the for loop body doesn’t contain the code
sum += data[i];
With armcc this code will produce a warning if you enable implicit narrowing cast
warnings using the compiler switch -W + n. The expression sum + data[i] is an integer
and so can only be assigned to a short using an (implicit or explicit) narrowing cast. As
you can see in the following assembly output, the compiler must insert extra
instructions to implement the narrowing cast:
The loop is now three instructions longer than the loop for example checksum_v2
earlier! There are two reasons for the extra instructions:
1) The LDRH instruction does not allow for a shifted address offset as the LDR
instruction did in checksum_v2. Therefore the first ADD in the loop calculates the
address of item I in the array. The LDRH loads from an address with no offset. LDRH
has fewer addressing modes than LDR as it was a later addition to the ARM instruction
set. (See Table-1.)
2) The cast reducing total + array[i] to a short requires two MOV instructions. The
compiler shifts left by 16 and then right by 16 to implement a 16-bit sign extend. The
shift right is a sign-extending shift so it replicates the sign bit to fill the upper 16 bits.
We can avoid the second problem by using an int type variable to hold the partial
sum.
However, the first problem is a new issue. We can solve it by accessing the array by
incrementing the pointer data rather than using an index as in data[i]. This is efficient
regardless of array type size or element size. All ARM load and store instructions have
a post increment addressing mode.
Function Argument Types:
We saw in Local Variable Types that converting local variables from types char or
short to type int increases performance and reduces code size. The same holds for
function arguments.
Consider the following simple function, which adds two 16-bit values, halving the
second, and returns a 16-bit sum:
This function is a little artificial, but it is a useful test case to illustrate the problems
faced by the compiler.
The input values a, b, and the return value will be passed in 32-bit ARM registers.
Should the compiler assume that these 32-bit values are in the range of a short type,
that is, −32,768 to +32,767?
Or should the compiler force values to be in this range by sign-extending the lowest
16 bits to fill the 32-bit register?
The compiler must make compatible decisions for the function caller and callee. Either
the caller or callee must perform the cast to a short type.
We say that function arguments are passed wide if they are not reduced to the range
of the type (not reduce to short type) and narrow if they are reduced to the range of
the type.
You can tell which decision the compiler has made by looking at the assembly
output for add_v1.
If the compiler passes arguments wide, then the callee must reduce function
arguments to the correct range.
If the compiler passes arguments narrow, then the caller must reduce the
range.
If the compiler returns values wide, then the caller must reduce the return
value to the correct range.
If the compiler returns values narrow, then the callee must reduce the range
before returning the value.
For armcc in ADS (ARM Developer Suites), function arguments are passed narrow and
values returned narrow. In other words, the caller casts argument values and the
callee casts return values. The compiler uses the ANSI prototype of the function to
determine the datatypes of the function arguments.
The armcc output for add_v1 shows that the compiler casts the return value to a short
type, but does not cast the input values. It assumes that the caller has already
ensured that the 32-bit values r0 and r1 are in the range of the short type. This
shows narrow passing of arguments and return value.
This assembly code seems to be adding two values (a and b) and then converting the
result to a 16-bit signed integer. Let's break it down step by step:
ADD r0, r0, r1, ASR #1: This instruction adds the value in register r1 to the value in
register r0 after right-shifting the value in r1 by 1 bit (ASR #1). This effectively
computes (int)a + ((int)b >> 1) and stores the result in r0.
MOV r0, r0, LSL #16: This shifts the value in r0 left by 16 bits, effectively multiplying it
by 2^16.
MOV r0, r0, ASR #16: This shifts the value in r0 right by 16 bits, effectively converting
it to a 16-bit signed integer. This operation is equivalent to sign-extending the lower
16 bits to fill the upper 16 bits.
MOV pc, r14: This moves the value of register r14 (link register lr on ARM architecture,
typically used to store the return address) into the program counter pc, effectively
returning from the subroutine or function.
Overall, this code takes two 16-bit values (a and b), adds them together with b shifted
right by 1 bit, and then converts the 32-bit result to a 16-bit signed integer,
which is returned as the result.
************
The gcc compiler makes no assumptions about the range of argument value. This
version of the compiler reduces the input arguments to the range of a short in both
the caller and the callee. It also casts the return value to a short type.
This assembly code appears to be for an ARM architecture (based on the use of MOV,
LSL, ASR, ADD, pc, and lr instructions). Let's break down the code step by step:
MOV r0, r0, LSL #16: This instruction shifts the value in register r0 left by 16 bits
(effectively multiplying it by 2^16) and stores the result back in r0. This operation is
equivalent to r0 = r0 << 16.
MOV r1, r1, LSL #16: Similar to the first instruction, this shifts the value in register r1
left by 16 bits and stores the result back in r1.
MOV r1, r1, ASR #17: This arithmetic shift right operation on r1 effectively divides r1
by 2, discarding the least significant bit. It is equivalent to r1 = r1 >> 1.
ADD r1, r1, r0, ASR #16: This instruction adds the shifted r0 value (which was
originally in r0 but is now in the higher 16 bits of r0) to the shifted r1 value (which was
shifted and divided by 2) and stores the result in r1. The ASR #16 on the second
operand indicates that the addition should consider only the lower 16 bits of r0,
effectively adding (int)a to (int)b >> 1.
MOV r1, r1, LSL #16: This shifts the result in r1 left by 16 bits, effectively multiplying
it by 2^16.
MOV r0, r1, ASR #16: Finally, this arithmetic shift right operation on r1 extracts the
lower 16 bits of r1 and stores them in r0, effectively converting the 32-bit result back
to a 16-bit value.
MOV pc, lr: This moves the value of the link register lr (which typically contains the
return address) into the program counter pc, effectively returning from the subroutine
or function.
Overall, this code appears to be performing some kind of arithmetic operation on two
16-bit values (a and b) that are packed into 32-bit registers (r0 and r1). The exact
operation depends on the context in which this code is used.
****************
Whatever the merits of different narrow and wide calling protocols, you can see that
char or short type function arguments and return values introduce extra casts. These
increase code size and decrease performance. It is more efficient to use the int type
for function arguments and return values, even if you are only passing an 8-bit value.
C Looping Structures
This section looks at the most efficient ways to code for and while loops on the ARM.
We start by looking at loops with a fixed number of iterations and then move on to
loops with a variable number of iterations. Finally we look at loop unrolling.
What is the most efficient way to write a for loop on the ARM? Let’s return to our
checksum example and look at the looping structure.
Here is the last version of the 64-word packet checksum routine we studied in Local
Variable Section. This shows how the compiler treats a loop with incrementing count
i++.
This compiles to
The key point is that the loop counter should count down to zero rather than counting
up to some arbitrary limit. Then the comparison with zero is free since the result is
stored in the condition flags. Since we are no longer using i as an array index, there is
no problem in counting down rather than up.
Example-2:
This example shows the improvement if we switch to a decrementing loop rather than
an incrementing loop.
Explanation of code:
This assembly code calculates the checksum of a block of data stored in memory.
Here's a breakdown of each instruction:
MOV r2, r0: This moves the contents of register r0 into r2, which is likely the starting
address of the data block.
MOV r0, #0: This initializes r0 to 0, which will be used to store the sum of the data
bytes.
MOV r1, #0x40: This initializes r1 to 64, which is the number of bytes to process.
LDR r3, [r2], #4: This loads a 4-byte word from the memory address pointed to by r2
into r3, and then increments r2 by 4.
SUBS r1, r1, #1: This decrements r1 by 1. The SUBS instruction sets flags based on
the result of the subtraction, which will be used by the following branch instruction.
ADD r0, r3, r0: This adds the value in r3 to the sum stored in r0.
MOV pc, r14: This moves the program counter (pc) to the return address stored in r14,
effectively returning from the subroutine with the sum in r0.
Overall, this code loops through 64 bytes of data, adding each byte to the sum stored
in r0, and then returns the sum as the checksum.
The SUBS and BNE instructions implement the loop. Our checksum example now has
the minimum number of four instructions per loop. This is much better than six for
checksum_v1 and eight for checksum_v3.
Signed and unsigned loop counter:
For an unsigned loop counter i we can use either of the loop continuation conditions i!
=0 or i>0. As i can’t be negative, they are the same condition.
For a signed loop counter, it is tempting to use the condition i>0 to continue the loop.
You might expect the compiler to generate the following two instructions to
implement the loop:
The compiler is not being inefficient. It must be careful about the case when i = -
0x80000000 because the two sections of code generate different answers in this case.
For the first piece of code the SUBS instruction compares i with 1 and then
decrements i. Since -0x80000000 < 1, the loop terminates.
For the second piece of code, we decrement i and then compare with 0. Modulo
arithmetic means that i now has the value +0x7fffffff, which is greater than zero. Thus
the loop continues for many iterations.
Of course, in practice, i rarely takes the value -0x80000000. The compiler can’t
usually determine this, especially if the loop starts with a variable number of
iterations.
Therefore you should use the termination condition i!=0 for signed or unsigned loop
counters. It saves one instruction over the condition i>0 for signed i.
Now suppose we want our checksum routine to handle packets of arbitrary size. We
pass in a variable N giving the number of words in the data packet.
Using the lessons from the last section we count down until N = 0 and don’t require
an extra loop counter i.
.
The checksum_v7 example shows how the compiler handles a for loop with a variable
number of iterations N.
Notice that the compiler checks that N is nonzero on entry to the function. Often this
check is unnecessary since you know that the array won’t be empty. In this case a do-
while loop gives better performance and code density than a for loop.
Explanation of code:
This assembly code calculates the checksum of a block of data stored in memory.
Here's a breakdown of each instruction:
MOV r2, #0: This initializes r2 to 0, which will be used to store the sum of the data
bytes.
CMP r1, #0: This compares the value in r1 (which likely represents the number of
bytes to process, denoted as N) with 0.
LDR r3, [r0], #4: This loads a 4-byte word from the memory address pointed to by r0
into r3, and then increments r0 by 4.
SUBS r1, r1, #1: This decrements r1 by 1. The SUBS instruction sets flags based on
the result of the subtraction, which will be used by the following branch instruction.
ADD r2, r3, r2: This adds the value in r3 to the sum stored in r2.
MOV r0, r2: This moves the sum stored in r2 to r0, which effectively returns the sum
as the checksum.
MOV pc, r14: This moves the program counter (pc) to the return address stored in r14,
effectively returning from the subroutine with the sum (checksum) in r0.
Overall, this code calculates the checksum of a block of data by summing up the
individual bytes and returning the result.
Using do-while loop:
This example shows how to use a do-while loop to remove the test for N being zero
that occurs in a for loop.
Compare this with the output for checksum_v7 to see the two-cycle saving.
Register Allocation:
You can use 14 of the 16 visible ARM registers to hold general-purpose data. The
other two registers are the stack pointer r13 and the program counter r15.
For a function to be ATPCS compliant it must preserve the callee values of registers r4
to r11.
ATPCS also specifies that the stack should be eight-byte aligned; therefore you must
preserve this alignment if calling subroutines.
Use the following template for optimized assembly routines requiring many
registers:
Our only purpose in stacking r12 is to keep the stack eight-byte aligned. You need not
stack r12 if your routine doesn’t call other ATPCS routines.
For ARMv5 and above you can use the preceding template even when being called
from Thumb code. If your routine may be called from Thumb code on an ARMv4T
processor, then modify the template as follows:
In this section we look at how best to allocate variables to register numbers for
register intensive tasks, how to use more than 14 local variables, and how to make
the best use of the 14 available registers.
When you write an assembly routine, it is best to start by using names for the
variables, rather than explicit register numbers. This allows you to change the
allocation of variables to register numbers easily. You can even use different register
names for the same physical register number when their use doesn’t overlap. Register
names increase the clarity and readability of optimized code.
For the most part ARM operations are orthogonal with respect to register number. In
other words, specific register numbers do not have specific roles. If you swap all
occurrences of two registers Ra and Rb in a routine, the function of the routine does
not change.
However, there are several cases where the physical number of the register is
important:
■ Argument registers. The ATPCS convention defines that the first four arguments to
a function are placed in registers r0 to r3. Further arguments are placed on the stack.
The return value must be placed in r0.
■ Registers used in a load or store multiple. Load and store multiple instructions LDM
and
STM operate on a list of registers in order of ascending register number. If r0 and r1
appear in the register list, then the processor will always load or store r0 using a lower
address than r1 and so on.
■ Load and store double word. The LDRD and STRD instructions introduced in
ARMv5E
operate on a pair of registers with sequential register numbers, Rd and Rd +
1.Furthermore, Rd must be an even register number.
For an example of how to allocate registers when writing assembly, suppose we want
to shift an array of N bits upwards in memory by k bits. For simplicity assume that N is
large and a multiple of 256. Also assume that 0 ≤ k < 32 and that the input and
output pointers are word aligned. This type of operation is common in dealing with the
arithmetic of multiple precision numbers where we want to multiply by 2 K . It is also
useful to block copy from one bit or byte alignment to a different bit or byte
alignment. For example, the C library function memcpy can use the routine to copy an
array of bytes using only word accesses.
The C routine shift_bits implements the simple k-bit shift of N bits of data. It returns
the k bits remaining following the shift.
The obvious way to improve efficiency is to unroll the loop to process eight words of
256 bits at a time so that we can use load and store multiple operations to load and
store eight words at a time for maximum efficiency. Before thinking about register
numbers, we write the following assembly code:
This ARM assembly code performs a bitwise shift operation on a block of data. It shifts
each word in the block left by k bits and fills in the shifted out bits from the next word
to the right. The last word's shifted out bits are stored in the carry variable, which is
returned at the end of the operation. Step by step explanation of code:
1) STMFD sp!, {r4-r11, lr}: Save registers r4 to r11 and the link register lr onto the
stack.
2) RSB kr, k, #32: Compute kr = 32 - k, where k is the number of bits to shift.
3) MOV carry, #0: Initialize carry to 0.
4) loop: Start of the loop.
5) LDMIA in!, {x_0-x_7}: Load 8 words from the input memory address in into
registers x_0 to x_7.
6) ORR y_0, carry, x_0, LSL k: Shift x_0 left by k bits, OR it with carry, and store the
result in y_0.
7) MOV carry, x_0, LSR kr: Shift x_0 right by kr bits and store the shifted out bits in
carry.
8) Repeat steps 6 and 7 for x_1 to x_7, storing the results in y_1 to y_7.
9) STMIA out!, {y_0-y_7}: Store the 8 modified words y_0 to y_7 to the output memory
address out.
10) SUBS N, N, #256: Subtract 256 (8 words * 32 bits) from N, which is the total
number of bits processed.
11) BNE loop: Branch back to loop if N is not zero.
12) MOV r0, carry: Move the final carry value to register r0.
13) LDMFD sp!, {r4-r11, pc}: Restore registers r4 to r11 and the program counter pc
from the stack and return.
Here, in this example 8 words are considered, each word is of 32 bits i.e. 32 x 8 =
256. If we have more than that say 512
Register Allocation:
Now to the register allocation. So that the input arguments do not have to move
registers, we can immediately assign
For the load multiple to work correctly, we must assign x0 through x7 to sequentially
increasing register numbers, and similarly for y0 through y7. Notice that we finish with
x0 before starting with y1. In general, we can assign xn to the same register as yn+1.
Therefore, assign
We are nearly finished, but there is a problem. There are two remaining variables
carry and kr, but only one remaining free register lr. There are several possible ways
we can proceed when we run out of registers:
■ Use the stack to store the least-used values to free up more registers. In this case
we could store the loop counter N on the stack.
■ Alter the code implementation to free up more registers. This is the solution we
consider in the following text.
This assembly shows our final shift_bits routine. It uses all 14 available ARM registers.
Using More than 14 Local Variables
If you need more than 14 local 32-bit variables in a routine, then you must store some
variables on the stack. The standard procedure is to work outwards from the
innermost loop of the algorithm, since the innermost loop has the greatest
performance impact.
This example shows three nested loops, each loop requiring state information
inherited from the loop surrounding it.
This example shows how you can use the ARM assembler directives MAP (alias ∧) and
FIELD (alias #) to define and allocate space for variables and arrays on the processor
stack. The directives perform a similar function to the structure operator in C.
Making the Most of Available Registers
On a load-store architecture such as the ARM, it is more efficient to access values held
in registers than values held in memory. There are several tricks you can use to fit
several sub-32-bit length variables into a single 32-bit register and thus can reduce
code size and increase performance. This section presents examples showing how you
can pack multiple variables into a single ARM register.
Note that if index and increment are 16-bit values, then putting index in the top 16
bits of indinc correctly implements 16-bit-wrap-around. In other words, index = (short)
(index + increment). This can be useful if you are using a buffer where you want to
wrap from the end back to the beginning (often known as a circular buffer).
Conditional Executions:
The processor core can conditionally execute most ARM instructions. This conditional
execution is based on one of 15 condition codes. If you don’t specify a condition, the
assembler defaults to the execute always condition (AL). The other 14 conditions split
into seven pairs of complements. The conditions depend on the four condition code
flags N, Z, C, V stored in the cpsr register.
By default, ARM instructions do not update the N, Z, C, V flags in the ARM cpsr. For
most instructions, to update these flags you append an S suffix to the instruction
mnemonic. Exceptions to this are comparison instructions that do not write to a
destination register.
Their sole purpose is to update the flags and so they don’t require the S suffix. By
combining conditional execution and conditional setting of the flags, you can
implement simple if statements without any need for branches. This improves
efficiency since branches can take many cycles and also reduces code size.
if (i<10)
{
c = i + ‘0’;
}
else
{
c = i + ‘A’-10;
}
We can write this in assembly using conditional execution rather than conditional
branches:
CMP i, #10
ADDLO c, i, #‘0’
ADDHS c, i, #‘A’-10
The sequence works since the first ADD does not change the condition codes.
The second ADD is still conditional on the result of the compare.
Conditional execution is even more powerful for cascading conditions.
This code is likely written in ARM assembly language and is checking if a character stored in register c
is a vowel (specifically, either 'a', 'e', 'i', 'o', or 'u').
Step by Step explanation:
1) TEQ c, #'a': This instruction compares the value in register c with the ASCII value of 'a'. It sets the
flags based on the result of the comparison.
2) TEQNE c, #'e': This instruction checks if the character in register c is not equal to 'e'. If the previous
comparison (TEQ c, #'a') didn't set the equal flag, indicating that the character is not 'a', this
instruction compares the character with 'e'.
3) TEQNE c, #'i': Similar to the previous instruction, this checks if the character is not equal to 'i'.
4) TEQNE c, #'o': This checks if the character is not equal to 'o'.
5) TEQNE c, #'u': This checks if the character is not equal to 'u'.
6) ADDEQ vowel, vowel, #1: If none of the previous comparisons indicated that the character was a
vowel, this instruction adds 1 to the value stored in the vowel register. The ADDEQ instruction adds the
immediate value 1 to the vowel register only if the equal flag is set, which means none of the previous
comparisons were equal, indicating that the character is not 'a', 'e', 'i', 'o', or 'u'.
As soon as one of the TEQ comparisons detects a match, the Z flag is set in the cpsr. The following
TEQNE instructions have no effect as they are conditional on Z = 0.
The next instruction to have effect is the ADDEQ that increments vowel. You can use this method
whenever all the comparisons in the if statement are of the same type.
To implement this efficiently, we can use an addition or subtraction to move each range to the form 0 ≤
c ≤ limit . Then we use unsigned comparisons to detect this range and conditional comparisons to
chain together ranges. The following assembly implements this efficiently:
2) CMP temp, #'Z'-'A': This instruction compares the value in the temp register with the difference
between the ASCII value of 'Z' and 'A'. This effectively checks if the character was originally an
uppercase letter by comparing its position in the alphabet with the number of letters from 'A' to 'Z'.
3) SUBHI temp, c, #'a': If the previous comparison (CMP) indicated that the character is within the
range of uppercase letters, this instruction subtracts the ASCII value of 'a' from the character stored in
register c and stores the result in the temp register. This converts lowercase letters to their
corresponding position in the alphabet (0 for 'a', 1 for 'b', and so on).
4) CMPHI temp, #'z'-'a': This instruction compares the value in the temp register with the difference
between the ASCII value of 'z' and 'a'. This effectively checks if the character was originally a lowercase
letter by comparing its position in the alphabet with the number of letters from 'a' to 'z'.
5) ADDLS letter, letter, #1: If the character is either an uppercase or lowercase letter (i.e., it passed
both comparisons), this instruction adds 1 to the value stored in the letter register. The ADDLS
instruction adds the immediate value 1 to the letter register only if the lower or same flag is set, which
means that the previous comparison result was lower or the same, indicating that the character is
within the range of letters ('A' to 'Z' or 'a' to 'z').
Note that the logical operations AND and OR are related by the standard logical relations as shown in
Table below.
You can invert logical expressions involving OR to get an expression involving AND, which can often be
useful in simplifying or rearranging logical expressions.