Module 3 Notes
Module 3 Notes
MODULE-3
In Table 3.1 loads/store that act on 8- or 16-bit values extend the value to 32 bits
before writing to an ARM register. Unsigned values are zero-extended, and signed
values sign-extended.This means that type casting of a loaded value to an int type
does not cost extra instructions.
ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores
directly, through new instructions. Since these instructions are a later addition,
they do not support all the addressing modes as the pre-ARMv4 instructions.
Compilers armcc and gcc use the datatype mappings for an ARM target.
ARMv4-based processors can efficiently load and store 8-, 16-, and 32-bit data.But,
most ARM data processing operations are 32-bit only. For this reason, we should
use a 32-bit datatype, int or long, for local variables wherever possible and avoid
using char and short as local variable types. If we are doing modulo arithmetic
operations(mod n=n) then we can use char type.
Example:The following code checksums a data packet containing 64 words. It
shows why we should avoid using char for local variables.
The compiler output for the same function by declaring I as unsigned int is as
given below
checksum_v2
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v2_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v2_loop ; if (i<64) loop
MOV pc, r14 ; return sum
In the first case, the compiler inserts an extra AND instruction to reduce i to the
range 0 to 255 before the comparison with 64. This instruction disappears in the
second case.
}
return sum;
}
The expression sum + data[i] is an integer and so can only be assigned to a short
using an (implicit or explicit) narrowing cast, sum = (short)(sum + data[i]);
With armcc this code will produce a warning if you enable implicit narrowing cast
warnings.
checksum_v3
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v3_loop
ADD r3,r2,r1,LSL #1 ; r3 = &data[i]
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; r0 = sum + r3
MOV r0,r0,LSL #16
MOV r0,r0,ASR #16 ; sum = (short)r0
BCC checksum_v3_loop ; if (i<64) goto loop
MOV pc, r14 ; return sum
The loop is now three instructions longer than the loop for example checksum_v2
earlier! There are two reasons for the extra instructions:
⚫ The LDRH instruction does not allow for a shifted address offset as the
LDR instruction did in checksum_v2. Therefore the first ADD in the loop calculates
the address of item i in the array. The LDRH loads from an address with no offset.
LDRH has fewer addressing modes than LDR as it was a later addition to the ARM
instruction set.
⚫ The cast reducing total + array[i] to a short requires two MOV
instructions. The compiler shifts left by 16 and then right by 16 to implement a 16-
bit sign extend.The shift right is a sign-extending shift so it replicates the sign bit
to fill the upper 16 bits.
We can avoid the second problem by using an int type variable to hold the partial
sum.We only reduce the sum to a short type at the function exit.
The first problem is a new issue. We can solve it by accessing the array by
incrementing the pointer data rather than using an index as in data[i]. This is
efficient regardless of array type size or element size. All ARM load and store
instructions have a post increment addressing mode.
Case3: The checksum_v4 code fixes all the problems we have discussed in this
section. It uses int type local variables to avoid unnecessary casts. It increments
the pointer data instead of using an index offset data[i]
short checksum_v4(short *data)
{
unsigned int i;
int sum=0;
for (i=0; i<64; i++)
{
sum += *(data++);
}
return (short)sum;
}
The compiler produces the following output. Three instructions have been
removed from the inside loop, saving three cycles per loop compared to
checksum_v3
checksum_v4
MOV r2,r0 ; r2 = data
MOV r1,#0 ;i=0
checksum_v4_loop
LDRSH r3,[r0],#2 ; r3 = *(data++)
ADD r1,r1,#1 ; r1 = i+1
CMP r1,#0x40 ; compare i, 64
ADD r2,r3,r2 ; sum + = r3
BCC checksum_v4_loop ; if (i<64) goto loop
MOV r0,r2,LSL #16
MOV r0,r0,ASR #16 ; sum = (short)sum
MOV pc, r14 ; return sum
increase code size and decrease performance. Therefore It is more efficient to use
the int type for function arguments and return values, even if you are only passing
an 8-bit value.
This compiles to
average_v1
ADD r0,r0,r1 ; r0=a+b
ADD r0,r0,r0,LSR #31 ; if (r0<0) r0++
MOV r0,r0,ASR #1 ; r0 = r0 >> 1
MOV pc,r14 ; return r0
The compiler adds one to the sum before shifting by right if the sum is
negative.In other words it replaces x/2 by the statement:
(x<0) ? ((x+1) >> 1): (x >> 1)
Summary:
The Efficient Use of C Types
⚫ For local variables held in registers,we shouldn’t use a char or short type
unless 8-bit or 16-bit modular arithmetic is necessary. Use the signed or
unsigned int types instead. Unsigned types are faster when we use
divisions.
⚫ For array entries and global variables held in main memory, use the type
with the smallest size possible to hold the required data. This saves
memory footprint. The ARMv4 architecture is efficient at loading and
storing all data widths provided we traverse arrays by incrementing the
array pointer. Avoid using offsets from the base of the array with short
type arrays, as LDRH does not support this.
⚫ Use explicit casts when reading array entries or global variables into local
variables, or writing local variables out to array entries. The casts make it
clear that for fast operation we are taking a narrow width type stored in
memory and expanding it to a wider type in the registers. Switch on
implicit narrowing cast warnings in the compiler to detect implicit casts.
⚫ Avoid implicit or explicit narrowing casts in expressions because they
usually cost extra cycles. Casts on loads or stores are usually free because
the load or store instruction performs the type casting.
⚫ Avoid char and short types for function arguments or return values. Use
the int type even if the range of the parameter is smaller. This prevents
the compiler performing unnecessary casts.
This compiles to
checksum_v5
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v5_loop
LDR r3,[r2],#4 ; r3 = *(data++)
ADD r1,r1,#1 ; i++
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v5_loop ; if (i<64) goto loop
MOV pc,r14 ; return sum
It takes three instructions to implement the for loop structure:
⚫ An ADD to increment i
⚫ A compare to check if i is less than 64
⚫ A conditional branch to continue the loop if i < 64
This is not efficient. On the ARM, a loop should only use two instructions:
⚫ A subtract to decrement the loop counter, which also sets the condition
code flags on the result
⚫ A conditional branch instruction
■ What if the number of loop iterations is not a multiple of the unroll amount?
We can try to arrange it so that array sizes are multiples of our unroll
amount. If this isn’t possible, then we must add extra code to take care of
the leftover cases.This increases the code size a little but keeps the
performance high.
Summary:
If our C function needs more than four arguments,it is always more efficient,if we
group related arguments into structures, and pass a structure pointer rather than
using multiple arguments.
*(Q_ptr++) = *(data++);
if (Q_ptr == Q_end)
{
Q_ptr = Q_start;
}
} while (--N);
return Q_ptr;
}
This compiles to
queue_bytes_v1
STR r14,[r13,#-4]! ; save lr on the stack
r12,[r13,#4] ; r12 = N
queue_v1_loop
LDRB r14,[r3],#1 ; r14 = *(data++)
STRB r14,[r2],#1 ; *(Q_ptr++) = r14
CMP r2,r1 ; if (Q_ptr == Q_end)
MOVEQ r2,r0 ; {Q_ptr = Q_start;}
SUBS r12,r12,#1 ; --N and set flags
BNE queue_v1_loop ; if (N!=0) goto loop
MOV r0,r2 ; r0 = Q_ptr
LDR pc,[r13],#4 ; return r0
typedef struct {
char *Q_start; /* Queue buffer start address */
char *Q_end; /* Queue buffer end address */
char *Q_ptr; /* Current queue pointer position */
} Queue;
Q_ptr = queue->Q_start;
}
} while (--N);
queue->Q_ptr = Q_ptr;
}
This compiles to
queue_bytes_v2
STR r14,[r13,#-4]! ; save lr on the stack
LDR r3,[r0,#8] ; r3 = queue->Q_ptr
LDR r14,[r0,#4] ; r14 = queue->Q_end
queue_v2_loop
LDRB r12,[r1],#1 ; r12 = *(data++)
STRB r12,[r3],#1 ; *(Q_ptr++) = r12
CMP r3,r14 ; if (Q_ptr == Q_end)
LDREQ r3,[r0,#0] ; Q_ptr = queue->Q_start
SUBS r2,r2,#1 ; --N and set flags
BNE queue_v2_loop ; if (N!=0) goto loop
STR r3,[r0,#8] ; queue->Q_ptr = r3
LDR pc,[r13],#4 ; return
The queue_bytes_v2 is one instruction longer than queue_bytes_v1, but it is more
efficient overall.
The second version has only three function arguments rather than
five. Each call to the function requires only three register setups. This compares
with four register setups, a stack push, and a stack pull for the first version. There
is a net saving of two instructions in function call overhead. It only needs to assign
a single register to the Queue structure pointer, rather than three registers in the
non structured case
Summary
For Calling Functions Efficiently we need to consider the below points
⚫ Try to restrict functions to four arguments. This will make them more
efficient to call. Use structures to group related arguments and pass
structure pointers instead of multiple arguments.
⚫ Define small functions in the same source file and before the functions
that call them. The compiler can then optimize the function call or inline
the small function.
⚫ Critical functions can be inlined using the __inline keyword.
Summary:
Avoiding Pointer Aliasing
⚫ Do not rely on the compiler to eliminate common sub-expressions
involving memory accesses. Instead create new local variables to hold
the expression. This ensures the expression is evaluated only once.
⚫ Avoid taking the address of local variables. The variable may be
inefficient to access from then on.
5.7 Structure Arrangement
Every data type have alignment requirements ( it is mandated by processor
architecture, not by language). A processor will have processing word length as
that of data bus size. On a 32-bit machine, the processing word size will be 4 bytes.
Fig 3.2 Data Alignment in Memory
Load and store instructions are only guaranteed to load and store values with
address aligned to the size of the access width.
Table 3.4 summarizes these restrictions.
Therefore ARM compilers will automatically align the start address of a structure
to a multiple of the largest access width used within the structure (usually four or
eight bytes) and align entries within structures to their access width by inserting
padding.
Example:
struct {
char a;
int b;
char c;
short d;
}
For a little-endian memory system the compiler will lay this out adding padding to
ensure that the next object is aligned to the size of that object:
The following rules generate a structure with the elements packed for maximum
efficiency: