MC-module 3 C Compilers and Optimization (BCS402)
MC-module 3 C Compilers and Optimization (BCS402)
Module-3
C Compilers and Optimization :
Basic C Data Types, C Looping Structures, Register Allocation, Function Calls, Pointer Aliasing,
Portability Issues.
A C compiler is a software tool that translates human-readable C code into machine-readable instructions
that a computer's processor can execute.
Optimization aims to improve the performance (speed and/or memory usage) of the compiled code without
altering its functionality.
1. Inlining: Replaces a function call with the function's body, reducing overhead. 2.
Loop Unrolling: Copies the body of a loop multiple times to reduce loop control
overhead.
3. Constant Folding: Evaluates constant expressions at compile-time instead of runtime. 4.
Dead Code Elimination: Removes code that doesn't affect program output. 5. Register
Allocation: Allocates variables to CPU registers to reduce memory access time. 6.
Instruction Scheduling: Reorders instructions to minimize CPU idle time. 7. Data Flow
Analysis: Analyzes how data flows through the program to optimize memory access.
8. Tail Call Optimization: Replaces tail recursive calls with jumps to reduce stack usage.
This function is designed to clear (set to zero) a block of memory starting from the location pointed to by
data and spanning N bytes.
To write efficient C code, you must be aware of areas where the C compiler has to be conservative, the
limits of the processor architecture the C compiler is mapping to, and the limits of a specific C
compiler.
ARM processors have 32-bit registers and 32-bit data processing operations. The ARM
architecture is a RISC load/store architecture.
Table 3.1 Load and store instructions by ARM architecture.
Architecture Instruction Action
Pre-ARMv4 LDRB load an unsigned 8-bit value
STRB store a signed or unsigned 8-bit value
LDR load a signed or unsigned 32-bit value
STR store a signed or unsigned 32-bit value
ARMv4 LDRSB load a signed 8-bit value
LDRH load an unsigned 16-bit value
LDRSH load a signed 16-bit value
STRH store a signed or unsigned 16-bit value
ARMv5 LDRD load a signed or unsigned 64-bit value
STRD store a signed or unsigned 64-bit value
In above Table : loads/store that act on 8- or 16-bit values extend the value to 32 bits before
writing to an ARM register. Unsigned values are zero-extended, and signed values sign
extended.
This means that type casting of a loaded value to an int type does not cost extra
instructions.
ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores directly,
through new instructions. Since these instructions are a later addition, they do not support all
the addressing modes as the pre-ARMv4 instructions.
Compilers armcc and gcc use the datatype mappings for an ARM target.
ARM C compilers define char to be an unsigned 8-bit value, rather than a signed 8-bit
value.
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 2
MICROCONTROLLERS(BCS402)
Example 1: The following code checksums a data packet containing 64 words. It shows
why to avoid using char for local variables.
Int checksum_v1(int*data)
{
char i;
int sum = 0;
for (i = 0; i < 64; i++)
{
sum += data[i];
}
return sum;
}
At first sight it looks as though declaring i as a char is efficient. a char uses less register
space or less space on the ARM stack than an int On the ARM, both these assumptions
are wrong.
All ARM registers are 32-bit and all stack entries are at least 32-bit. To implement the
i++ exactly, the compiler must accountfor the case when i = 255. Any attempt to
increment 255 should produce the answer 0.
Case1: The compiler output for this function is as given below
checksum_v1
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v1_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
AND r1,r1,#0xff ; i = (char)r1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v1_loop ; if (i<64) loop
MOV pc,r14 ; return sum
The compiler output for the same function by declaring I as unsigned int is as given
below
checksum_v2
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v2_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v2_loop ; if (i<64) loop
MOV pc, r14 ; return sum
In the first case, the compiler inserts an extra AND instruction to reduce i to the range
0 to 255 before the comparison with 64. This instruction disappears in the second case.
Example 2: The data packet contains 16-bit values for a 16-bit checksum.then i is declared as unsigned int
The loop is now three instructions longer than the loop for example checksum_v2 earlier! There
are two reasons for the extra instructions:
⚫ The LDRH instruction does not allow for a shifted address offset as the LDR instruction did
in checksum_v2. Therefore the first ADD in the loop calculates the address of item i in the
array. The LDRH loads from an address with no offset. LDRH has fewer addressing modes
than LDR as it was a later addition to the ARM instruction set.
⚫ The cast reducing total + array[i] to a short requires two MOV instructions. The compiler
shifts left by 16 and then right by 16 to implement a 16-bit sign extend. The shift right is a sign
extending shift so it replicates the sign bit to fill the upper 16 bits.
We can avoid the second problem by using an int type variable to hold the partial sum.We
only reduce the sum to a short type at the function exit.
The first problem is a new issue. We can solve it by accessing the array by incrementing the
pointer data rather than using an index as in data[i]. This is efficient regardless of array type
size or element size. All ARM load and store instructions have a post increment addressing
mode.
To avoid unnecessary casts it uses int type local variables. It increments the pointer
data instead of using an index offset data[i].
short checksum_v4(short *data)
{
unsigned int i;
int sum=0;
for (i=0; i<64; i++)
{
sum += *(data++); // The *(data++) operation translates to a single ARM instruction that loads the data and increments the data
pointer.
}
return (short)sum;
}
The compiler produces the following output. Three instructions have been removed from the
inside loop, saving three cycles per loop compared to checksum_v3
checksum_v4
MOV r2,r0 ; r2 = data
MOV r1,#0 ;i=0
checksum_v4_loop
LDRSH r3,[r0],#2 ; r3 = *(data++)
ADD r1,r1,#1 ; r1 = i+1
If our code uses addition, subtraction, and multiplication, then there is no performance
difference between signed and unsigned operations , but there is a difference when it is
division operation.
Consider the following short example that averages two integers:
int average_v1(int a,
int b) {
return (a+b)/2;
}
average_v1
ADD r0,r0,r1 ; r0=a+b
ADD r0,r0,r0,LSR #31 ; if (r0<0) r0++
MOV r0,r0,ASR #1 ; r0 = r0 >> 1
MOV pc,r14 ; return r0
The compiler adds one to the sum before shifting by right if the sum is negative. In
other words it replaces x/2 by the statement:
(x<0) ? ((x+1) >> 1): (x >> 1)
It must do this because x is signed.. In C on an ARM target, a divide by two is not a right shift if
x is negative.
For example, −3 >> 1 =-2 but −3/2 = −1.
Division rounds towards zero, but arithmetic right shift rounds towards −∞.
It is more efficient to use unsigned types for divisions. The compiler converts unsigned power of
two divisions directly to right shifts. For general divisions, the divide routine in the C library is
faster for unsigned types.
For the efficient use of C types the following points to be deliberated. ➢
For local variables held in registers, don’t use a char or short type unless 8-bit or
16-bit modular arithmetic is necessary. Use the signed or unsigned int types
➢ For array entries and global variables held in main memory, use the type with the
smallest size possible to hold the required data. This saves memory footprint.
The ARMv4 architecture is efficient at loading and storing all data widths
provided you traverse arrays by incrementing the array pointer. Avoid using
offsets from the base of the array with short type arrays, as LDRH does not
support this.
➢ Use explicit casts when reading array entries or global variables into local
variables, or writing local variables out to array entries. The casts make it clear
that for fast operation taking a narrow width type stored in memory and
expanding it to a wider type in the registers. Switch on implicit narrowing cast
warnings in the compiler to detect implicit casts.
C Looping Structures
In this section we will learn the most efficient ways to code for and while loops on the ARM.
This section includes, loops with a fixed number of iterations, loops with a variable number of
iterations and loop unrolling.
⚫ A subtract to decrement the loop counter, which also sets the condition code flags on the result
⚫ A conditional branch instruction
This example shows the improvement if we switch to a decrementing loop rather than an
incrementing loop
do-while loop remove the test for N being zero that occurs in a for loop & hence it gives
betterperformance than a for loop.
Loop unrolling: Repeating the loop body several times, and reducing the number of loop
iterations by the same proportion.
There are two thing we we need to ask when unrolling a loop:
■ How many times should we unroll the loop?
Only unroll loops that are important for the overall
performance of the application. Otherwise unrolling will increase the code size with little
performance benefit. Unrolling may even reduce performance by evicting more important
code from the cache.
The following code unrolls our packet checksum loop by four times. We assume that the
number of words in the packet N is a multiple of four
This example handles the checksum of any size of data packet using a loop that has been unrolled
four times
int checksum_v10(int *data, unsigned int N)
{
unsigned int i;
int sum=0;
for (i=N/4; i!=0; i--)
{
sum += *(data++);
sum += *(data++);
sum += *(data++);
sum += *(data++);
}
for (i=N&3; i!=0; i--)
{
sum += *(data++);
}
return sum;
}
The second for loop handles the remaining cases when N is not a multiple of four. Note that
both N/4 and N&3 can be zero, so we can’t use do-while loops.
Points to remember while using Looping statement efficiently • Use loops
that count down to zero. Then the compiler does not need to allocate a register to
hold the termination value, and the comparison with zero is free. • Use unsigned loop
counters by default and the continuation condition i!=0 rather thani>0. This will
ensure that the loop overhead is only two instructions. • Use do-while loops rather
than for loops when you know the loop will iterate at least once. This saves the
compiler checking to see if the loop count is zero. • Unroll important loops to reduce
the loop overhead. Do not overunroll, if the loop overhead is small as a proportion
of the total, then unrolling will increase code size and hurt the performance of the
cache.
Register Allocation
The compiler attempts to allocate a processor register to each local variable use in a C function.
It will try to use the same register for different local variables if the use of the variables
does not overlap. When there are more local variables than available registers, the
compiler stores the excess variables on the processor stack. These variables are
called spilled or swapped out variables since they are written out to memory (in a
similar way virtual memory is swapped out to disk). Spilled variables are slow to access
compared to variables allocated to registers.
To implement a function efficiently, minimize the number of spilled variables & ensure
that the most important and frequently accessed variables are stored in registers. C
compiler register usage
Table shows the standard register names and usage when following the ARM-Thumb
procedure call standard (ATPCS), which is used in code generated by C compilers.
Provided the compiler is not using software stack checking or a frame pointer, then the
C compiler can use registers r0 to r12 and r14 to hold variables. It must save the callee
values of r4 to r11 and r14 on the stack if using these registers.
Dept. of CSE
(Data Science), SVIT , Asst. Prof. Anitha C S Page 11
MICROCONTROLLERS(BCS402)
Function Calls
The ARM Procedure Call Standard (APCS) defines how to
pass function arguments and return values in ARM registers.
The more recent
ARM-Thumb Procedure Call Standard (ATPCS) covers ARM
and
Thumb interworking as well.
The first four integer arguments are passed in the
first four ARM registers: r0, r1, r2, and r3.
Subsequent integer arguments are placed on the full
descending stack, ascending in memory shown in
figure below. Function return integer values are passed
in r0.
structures, and pass a structure pointer rather than using multiple arguments.For functions with
more arguments, both the caller and callee must access the stack for some arguments. Example:
The following code creates a Queue structure and passes this to the function to reduce the
number of function arguments.
There are other ways of reducing function call overhead if the function is very small and
corrupts few registers (uses few local variables). Put the C function in the same C file as
the functions that will call it. The C compiler then knows the code generated for the
callee function and can make optimizations in the caller function:
• The caller function need not preserve registers that it can see the callee doesn’t
corrupt. Therefore the caller function need not save all the ATPCS corruptible
registers.
• If the callee function is very small, then the compilers can inline the code in the
caller function. This removes the function call overhead completely.
•Example: Insert N bytes(from Array data into a queue)
•Case1: 5 Arguments
This compiles to
The following code creates a Queue structure and passes this to the function to reduce the number
of function arguments
This Compiles to
The
queue_bytes_v2 is one instruction longer than queue_bytes_v1, but it is more efficient
overall.
The second version has only three function arguments rather than
five. Each call to the function requires only three register setups. This compares with
four register setups, a stack push, and a stack pull for the first version. There is a net
saving of two instructions in function call overhead. It only needs to assign a single
register to the Queue structure pointer, rather than three registers in the non structured
case
Pointer Aliasing
Two pointers are said to be alias when they point to the same address That means ,
If we write to one pointer, it will affect the value we read from the other pointer. In
function, the compiler often does not know which pointer cause aliasing and which
pointer not
Example: The below code for function increments, two timer values by a step
amount:void timers_v1(int *timer1, int *timer2, int *step) {
*timer1 += *step;
*timer2 += *step;
}
The compiler loads from step twice. Usually, a compiler optimization called common
sub expression elimination would kick in so that *step was only evaluated once, and the
value reused for the second occurrence. But the compiler can’t use this optimization
here. The pointers timer1 and step might alias one another i.e. the compiler cannot be
sure that the write to timer1 doesn’t affect the read from step.
In this case, the second value of *step is different from the first and has the value
*timer1. This forces the compiler to insert an extra load instruction.
The same problem occurs if you use structure accesses rather than direct
pointer access.The following code also compiles inefficiently:
typedef struct {int step;} State;
typedef struct {int timer1, timer2;}
Timers; void timers_v2(State *state,
Timers *timers) {
timers->timer1 += state->step; timers->timer2 += state->step; }
Avoiding Pointer Aliasing
• Do not rely on the compiler to eliminate common sub expressions involving
memory accesses. Instead, create new local variables to hold the expression.
This ensures the expression is evaluated only once.
• Avoid taking the address of local variables. The variable may be inefficient to
access from then on.
Portability Issues
char type: On the ARM, char is unsigned rather than signed as for many other processors. A
common problem concerns loops that use a char loop counter i and the continuation condition
i≥ 0, they become infinite loops. In this situation, armcc produces a warning of unsigned
comparison with zero. Use a compiler option to make char signed or change loop
counters to type int.
int type: Some older architectures use a 16-bit int, which may cause problems when moving
to ARM‟s 32-bit int type although this is rare nowadays. Note that expressions are promoted
to an int type before evaluation. Therefore if i = -0x1000, the expression i == 0xF000 is true
on a 16- bit machine but false on a 32- bit machine.
Unaligned data pointers: Some processors support the loading of short and int typed values
from unaligned addresses. A C program may manipulate pointers directly so that they
become unaligned, for example, by casting a char * to an int *. ARM architectures up to
ARMv5TE do not support unaligned pointers. To detect them, run the program on an ARM
with an alignment checking trap. Configure the ARM720T to data abort on an unaligned
access.
Endian assumptions: C code may make assumptions about the endianness of a memory
system, for example, by casting a char * to an int *. The ARM is configured for the same
endianness the code is expecting, and then there is no issue. Otherwise, endian-dependent
code sequences must be removed and replace them by endian-independent ones. Function
prototyping: The armcc compiler passes arguments narrow, that is, reduced to the range of
the argument type. If functions are not prototyped correctly, then the function may return the
wrong answer. Other compilers that pass arguments wide may give the correct answer even
if the function prototype is incorrect. Always use ANSI prototypes. Use of bit-fields: The
layout of bits within a bit-field is implementation and endian dependent. If C code assumes
that bits are laid out in a certain order, then the code is not portable.
Use of enumerations: Although enum is portable, different compilers allocate different
numbers of bytes to an enum. The gcc compiler will always allocate four bytes to an enum
type. The armcc compiler will only allocate one byte if the enum takes only eight-bit values.
Therefore you can‟t cross-link code and libraries between different compilers if you use
enums in an API structure.
Inline assembly: Using inline assembly in C code reduces portability between architectures.
You should separate any inline assembly into small inlined functions that can easily be
replaced.It is also useful to supply reference, plain C implementations of these functions that
can be used on other architectures, where this is possible.
The volatile keyword: Use the volatile keyword on the type definitions of ARM memory
mapped peripheral locations. This keyword prevents the compiler from optimizing away the
memory access. It also ensures that the compiler generates a data access of the correct type.
For example, a memory location is defined as a volatile short type, and then the compiler
will access it using 16-bit load and store instructions LDRSH and STRH.
Remove the C definition of square, but not the declaration (the second line) to produce a new
C file main1.c. Next add an armasm assembler file square.s with the following contents:
square(i)); } } END
int square(int i) {
return i*i;
The AREA directive names the area or code section that the code lives in. If non
alphanumeric characters are used in a symbol or area name, then enclose the name in vertical
bars. Many non- alphanumeric characters have special meanings otherwise. In the previous
code we define a read-only code area called .text.
The EXPORT directive makes the symbol square available for external linking. At line six
we define the symbol square as a code label. Note that armasm treats non-indented text as a
label definition.
When square is called, the parameter passing is defined by the ARM-Thumb procedure call
standard (ATPCS). The input argument is passed in register r0, and the return value is
returned in register r0. The multiply instruction has a restriction that the destination register
must not be the same as the first argument register. Therefore we place the multiply result
into r1 and move this to r0.
The END directive marks the end of the assembly file. Comments follow a
semicolon.
Explain code optimization, profiling and cycle counting. MQP 2024 10M Code
optimization refers to the process of modifying a program to improve its performance. This
can involve reducing execution time, memory usage, power consumption, or other resources.
Profiling and Cycle Counting: The first stage of any optimization process is to identify the
critical routines and measure their current performance.
Profiling is the process of analyzing a program to determine which parts of the code are
consuming the most resources, such as CPU time, memory, or I/O operations. Profiling
helps identify performance bottlenecks and areas that could benefit from optimization A
profiler is a tool that measures the proportion of time or processing cycles spent in each
subroutine. It is used to identify the most critical routines.
A cycle counter measures the number of cycles taken by a specific routine. The ARM
simulator used by the ADS1.1 debugger is called the ARMulator and provides profiling and
cycle counting features. The ARMulator profiler works by sampling the program counter pc
at regular intervals. The profiler identifies the function the pc points to and updates a hit
counter for each function it encounters. Another approach is to use the trace output of a
simulator as a source for analysis.
Develop an ALP to find the sum of first 10 integer numbers. MQP 2024 10 M
AREA SUM, CODE, READONLY
ENTRY
MOV R1,#10 ;length of array
LDR R2,=ARRAY ;Load the starting address of the array
MOV R4,#0 ;Initial sum
NEXT LDR R3,[R2],#4 ;Load first integer of the array in R3
ADD R4,R4,R3 ;R4=sum of integers
SUBS R1,R1,#1 ;repeat until R1=0
BNE NEXT ;If z-flag is not set repeat
MOV R5,#0X40000000 ; initialize memory address to store the result in memory STR
R4,[R5] ; store the result in the address stored in R5
STOP B STOP
ARRAY DCD 1,2,3,4,5,6,7,8,9,10
END
Questions
1. Explain with an example the different basic C data types used by arm compiler. 2.
With program example explain the advantages of using int rather than char & short type
for local variables & function arguments.
3. Describe with an example different looping structure used by arm compiler.
4. Explain loop unrolling concept with suitable program example.
5. Explain loops using variable number of iterations with program example.
6. Explain loop unrolling concept with suitable program example.
7. Explain in detail about Register Allocation.
8. Explain the function call operation with suitable program example.
9. Explain the pointer aliasing concept with suitable program example.
Reference:
1. Andrew N Sloss, Dominic Symes and Chris Wright, ARM system developers
guide,Elsvier, Morgan Kaufman publishers, 2008.