0% found this document useful (0 votes)
56 views22 pages

MC-module 3 C Compilers and Optimization (BCS402)

The document discusses C compilers and optimization techniques, emphasizing the importance of writing efficient C code while being aware of compiler limitations and processor architecture. It outlines various optimization methods such as inlining, loop unrolling, and dead code elimination, and provides examples of how variable types impact performance on ARM processors. Additionally, it covers best practices for using data types and looping structures to enhance code efficiency.

Uploaded by

nandakishor425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views22 pages

MC-module 3 C Compilers and Optimization (BCS402)

The document discusses C compilers and optimization techniques, emphasizing the importance of writing efficient C code while being aware of compiler limitations and processor architecture. It outlines various optimization methods such as inlining, loop unrolling, and dead code elimination, and provides examples of how variable types impact performance on ARM processors. Additionally, it covers best practices for using data types and looping structures to enhance code efficiency.

Uploaded by

nandakishor425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

MICROCONTROLLERS(BCS402)

Module-3
C Compilers and Optimization :
Basic C Data Types, C Looping Structures, Register Allocation, Function Calls, Pointer Aliasing,
Portability Issues.

Overview of C Compilers and Optimization:


Optimizing code takes time and reduces source code readability. therefore, it’s only worth optimizing
functions that are frequently executed and important for performance. C compilers have to translate
our C function into an assembler to work for all possible inputs.
To write efficient C code, we must be aware of areas where the C compiler has to be conservative, the
limits of the processor architecture the C compiler is mapping to, and the limits of a specific C compiler.
Most common c compilers are
⚫ armcc from ARM Developer Suite version 1.1 (ADS1.1).
⚫ arm-elf-gcc version 2.95.2. This is the ARM target for the GNU C compiler, gcc,

A C compiler is a software tool that translates human-readable C code into machine-readable instructions
that a computer's processor can execute.
Optimization aims to improve the performance (speed and/or memory usage) of the compiled code without
altering its functionality.

Here are common optimization techniques:

1. Inlining: Replaces a function call with the function's body, reducing overhead. 2.
Loop Unrolling: Copies the body of a loop multiple times to reduce loop control
overhead.
3. Constant Folding: Evaluates constant expressions at compile-time instead of runtime. 4.
Dead Code Elimination: Removes code that doesn't affect program output. 5. Register
Allocation: Allocates variables to CPU registers to reduce memory access time. 6.
Instruction Scheduling: Reorders instructions to minimize CPU idle time. 7. Data Flow
Analysis: Analyzes how data flows through the program to optimize memory access.
8. Tail Call Optimization: Replaces tail recursive calls with jumps to reduce stack usage.

This function is designed to clear (set to zero) a block of memory starting from the location pointed to by
data and spanning N bytes.

void memclr(char *data, int N)


{
for (; N>0; N--)
{
*data=0;
data++;
}
}
void: The function does not return any value.
memclr: The name of the function.
*char data: A pointer to the beginning of the memory block that needs to be cleared. The char
type indicates that we're dealing with memory one byte at a time.
int N: An integer representing the number of bytes to be cleared.
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 1
MICROCONTROLLERS(BCS402)

For Loop Initialization and Condition:


*data = 0;: This dereferences the pointer data and sets the value at the current memory location to
0.
data++;: This increments the pointer data to point to the next byte in memory.
Initial state
data -> [ ? ] [ ? ] [ ? ] N = 3
Final
data -> [ 0 ] [ 0 ] [ 0 ]
N=0

To write efficient C code, you must be aware of areas where the C compiler has to be conservative, the
limits of the processor architecture the C compiler is mapping to, and the limits of a specific C
compiler.

Basic C Data Types

ARM processors have 32-bit registers and 32-bit data processing operations. The ARM
architecture is a RISC load/store architecture.
Table 3.1 Load and store instructions by ARM architecture.
Architecture Instruction Action
Pre-ARMv4 LDRB load an unsigned 8-bit value
STRB store a signed or unsigned 8-bit value
LDR load a signed or unsigned 32-bit value
STR store a signed or unsigned 32-bit value
ARMv4 LDRSB load a signed 8-bit value
LDRH load an unsigned 16-bit value
LDRSH load a signed 16-bit value
STRH store a signed or unsigned 16-bit value
ARMv5 LDRD load a signed or unsigned 64-bit value
STRD store a signed or unsigned 64-bit value

In above Table : loads/store that act on 8- or 16-bit values extend the value to 32 bits before
writing to an ARM register. Unsigned values are zero-extended, and signed values sign
extended.
This means that type casting of a loaded value to an int type does not cost extra
instructions.
ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores directly,
through new instructions. Since these instructions are a later addition, they do not support all
the addressing modes as the pre-ARMv4 instructions.
Compilers armcc and gcc use the datatype mappings for an ARM target.
ARM C compilers define char to be an unsigned 8-bit value, rather than a signed 8-bit
value.
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 2
MICROCONTROLLERS(BCS402)

Local variable types:


ARMv4-based processors can efficiently load and store 8, 16, and 32-bit data. However,
most ARM data processing operations are 32-bit only. For this reason, a 32-bit data
type, int or long, is used for local variables wherever possible. Avoid using char and
short as local variable types, even if for an 8- or 16-bit value. The one exception is
when you want wrap-around to occur. If we are doing modulo arithmetic
operations(mod n=n) then we can use char type.

Example 1: The following code checksums a data packet containing 64 words. It shows
why to avoid using char for local variables.
Int checksum_v1(int*data)
{
char i;
int sum = 0;
for (i = 0; i < 64; i++)
{
sum += data[i];
}
return sum;
}
At first sight it looks as though declaring i as a char is efficient. a char uses less register
space or less space on the ARM stack than an int On the ARM, both these assumptions
are wrong.
All ARM registers are 32-bit and all stack entries are at least 32-bit. To implement the
i++ exactly, the compiler must accountfor the case when i = 255. Any attempt to
increment 255 should produce the answer 0.
Case1: The compiler output for this function is as given below
checksum_v1
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v1_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
AND r1,r1,#0xff ; i = (char)r1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v1_loop ; if (i<64) loop
MOV pc,r14 ; return sum

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 3


MICROCONTROLLERS(BCS402)

The compiler output for the same function by declaring I as unsigned int is as given
below
checksum_v2
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v2_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v2_loop ; if (i<64) loop
MOV pc, r14 ; return sum
In the first case, the compiler inserts an extra AND instruction to reduce i to the range
0 to 255 before the comparison with 64. This instruction disappears in the second case.

Example 2: The data packet contains 16-bit values for a 16-bit checksum.then i is declared as unsigned int

short checksum_v3(short *data)


{
unsigned int i;
short sum = 0;
for (i = 0; i < 64; i++)
{
sum = (short)(sum + data[i]);
}
return sum;
}
With armcc this code will produce a warning for enabling implicit narrowing cast
warnings using the compiler switch -W+ n. The expression sum + data[i] is an integer
and so can only be assigned to a short using an (implicit or explicit) narrowing cast.
As you can see in the following assembly output, the compiler must insert extra
instructions to implement the narrowing cast:
checksum_v3
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v3_loop
ADD r3,r2,r1,LSL #1 ; r3 = &data[i]
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 4
MICROCONTROLLERS(BCS402)

CMP r1,#0x40 ; compare i, 64


ADD r0,r3,r0 ; r0 = sum + r3
MOV r0,r0,LSL #16

MOV r0,r0,ASR #16 ; sum = (short)r0


BCC checksum_v3_loop ; if (i<64) goto loop
MOV pc, r14 ; return sum

The loop is now three instructions longer than the loop for example checksum_v2 earlier! There
are two reasons for the extra instructions:
⚫ The LDRH instruction does not allow for a shifted address offset as the LDR instruction did
in checksum_v2. Therefore the first ADD in the loop calculates the address of item i in the
array. The LDRH loads from an address with no offset. LDRH has fewer addressing modes
than LDR as it was a later addition to the ARM instruction set.
⚫ The cast reducing total + array[i] to a short requires two MOV instructions. The compiler
shifts left by 16 and then right by 16 to implement a 16-bit sign extend. The shift right is a sign
extending shift so it replicates the sign bit to fill the upper 16 bits.
We can avoid the second problem by using an int type variable to hold the partial sum.We
only reduce the sum to a short type at the function exit.
The first problem is a new issue. We can solve it by accessing the array by incrementing the
pointer data rather than using an index as in data[i]. This is efficient regardless of array type
size or element size. All ARM load and store instructions have a post increment addressing
mode.

To avoid unnecessary casts it uses int type local variables. It increments the pointer
data instead of using an index offset data[i].
short checksum_v4(short *data)
{
unsigned int i;
int sum=0;
for (i=0; i<64; i++)
{
sum += *(data++); // The *(data++) operation translates to a single ARM instruction that loads the data and increments the data
pointer.
}
return (short)sum;
}
The compiler produces the following output. Three instructions have been removed from the
inside loop, saving three cycles per loop compared to checksum_v3
checksum_v4
MOV r2,r0 ; r2 = data
MOV r1,#0 ;i=0
checksum_v4_loop
LDRSH r3,[r0],#2 ; r3 = *(data++)
ADD r1,r1,#1 ; r1 = i+1

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 5


MICROCONTROLLERS(BCS402)

CMP r1,#0x40 ; compare i, 64


ADD r2,r3,r2 ; sum + = r3
BCC checksum_v4_loop ; if (i<64) goto loop
MOV r0,r2,LSL #16
MOV r0,r0,ASR #16 ; sum = (short)sum
MOV pc, r14 ; return sum
Function argument types:
Converting local variables from types char or short to type int increases performance and
reduces code size. The same holds for function arguments. If the function arguments are short
type either the caller or the callee must perform the cast to a short type. And these char or
short type function arguments and return values introduce extra casts. This increase code size
and decrease performance. Therefore It is more efficient to use the int type for function
arguments and return values, even if you are only passing an 8-bit value.
Signed versus Unsigned Types

If our code uses addition, subtraction, and multiplication, then there is no performance
difference between signed and unsigned operations , but there is a difference when it is
division operation.
Consider the following short example that averages two integers:
int average_v1(int a,
int b) {
return (a+b)/2;
}
average_v1
ADD r0,r0,r1 ; r0=a+b
ADD r0,r0,r0,LSR #31 ; if (r0<0) r0++
MOV r0,r0,ASR #1 ; r0 = r0 >> 1
MOV pc,r14 ; return r0

The compiler adds one to the sum before shifting by right if the sum is negative. In
other words it replaces x/2 by the statement:
(x<0) ? ((x+1) >> 1): (x >> 1)
It must do this because x is signed.. In C on an ARM target, a divide by two is not a right shift if
x is negative.
For example, −3 >> 1 =-2 but −3/2 = −1.
Division rounds towards zero, but arithmetic right shift rounds towards −∞.
It is more efficient to use unsigned types for divisions. The compiler converts unsigned power of
two divisions directly to right shifts. For general divisions, the divide routine in the C library is
faster for unsigned types.
For the efficient use of C types the following points to be deliberated. ➢
For local variables held in registers, don’t use a char or short type unless 8-bit or
16-bit modular arithmetic is necessary. Use the signed or unsigned int types

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 6


MICROCONTROLLERS(BCS402)

instead. Unsigned types are faster for divisions operation.

➢ For array entries and global variables held in main memory, use the type with the
smallest size possible to hold the required data. This saves memory footprint.
The ARMv4 architecture is efficient at loading and storing all data widths
provided you traverse arrays by incrementing the array pointer. Avoid using
offsets from the base of the array with short type arrays, as LDRH does not
support this.

➢ Use explicit casts when reading array entries or global variables into local
variables, or writing local variables out to array entries. The casts make it clear
that for fast operation taking a narrow width type stored in memory and
expanding it to a wider type in the registers. Switch on implicit narrowing cast
warnings in the compiler to detect implicit casts.

➢ Avoid implicit or explicit narrowing casts in expressions because they usually


cost extra cycles. Avoid char and short types for function arguments or return
values. Instead use the int type even if the range of the parameter is smaller. This
prevents the compiler performing unnecessarycasts.

C Looping Structures
In this section we will learn the most efficient ways to code for and while loops on the ARM.
This section includes, loops with a fixed number of iterations, loops with a variable number of
iterations and loop unrolling.

Loops with a Fixed Number of Iterations


This concept is explained with 64-word checksum routine.
Below example shows the compiler treats a loop with incrementing count i.e i++.

It takes three instructions to implement the for loop structure:


⚫ An ADD to increment i
⚫ A compare to check if i is less than 64
⚫ A conditional branch to continue the loop if i < 64
This is not efficient. On the ARM, a loop should only use two instructions:
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 7
MICROCONTROLLERS(BCS402)

⚫ A subtract to decrement the loop counter, which also sets the condition code flags on the result
⚫ A conditional branch instruction

This example shows the improvement if we switch to a decrementing loop rather than an
incrementing loop

SUBS and BNE instructions implement the loop


Loops Using Variable Number of Iterations
In this case a do-while loop gives better performance and code density than a for loop.
Example: We pass in a variable N giving the Variable Number of Iterations as an
argument and count down N until N=0 ,no need of extra loop counter i.
In this case a do-while loop gives better performance than a for loop

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 8


MICROCONTROLLERS(BCS402)

do-while loop remove the test for N being zero that occurs in a for loop & hence it gives
betterperformance than a for loop.
Loop unrolling: Repeating the loop body several times, and reducing the number of loop
iterations by the same proportion.
There are two thing we we need to ask when unrolling a loop:
■ How many times should we unroll the loop?
Only unroll loops that are important for the overall
performance of the application. Otherwise unrolling will increase the code size with little
performance benefit. Unrolling may even reduce performance by evicting more important
code from the cache.

■ What if the number of loop iterations is not a multiple of the unroll


amount? We can try to arrange it so that array sizes are multiples of our unroll
amount. If this isn’t possible, then we must add extra code to take care of the leftover
cases.This increases the code size a little but keeps the performance high.

The following code unrolls our packet checksum loop by four times. We assume that the
number of words in the packet N is a multiple of four

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 9


MICROCONTROLLERS(BCS402)

This example handles the checksum of any size of data packet using a loop that has been unrolled
four times
int checksum_v10(int *data, unsigned int N)
{
unsigned int i;
int sum=0;
for (i=N/4; i!=0; i--)
{
sum += *(data++);
sum += *(data++);
sum += *(data++);
sum += *(data++);
}
for (i=N&3; i!=0; i--)
{
sum += *(data++);
}
return sum;
}

The second for loop handles the remaining cases when N is not a multiple of four. Note that
both N/4 and N&3 can be zero, so we can’t use do-while loops.
Points to remember while using Looping statement efficiently • Use loops
that count down to zero. Then the compiler does not need to allocate a register to
hold the termination value, and the comparison with zero is free. • Use unsigned loop
counters by default and the continuation condition i!=0 rather thani>0. This will
ensure that the loop overhead is only two instructions. • Use do-while loops rather
than for loops when you know the loop will iterate at least once. This saves the
compiler checking to see if the loop count is zero. • Unroll important loops to reduce
the loop overhead. Do not overunroll, if the loop overhead is small as a proportion
of the total, then unrolling will increase code size and hurt the performance of the
cache.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 10


MICROCONTROLLERS(BCS402)

Register Allocation
The compiler attempts to allocate a processor register to each local variable use in a C function.
It will try to use the same register for different local variables if the use of the variables
does not overlap. When there are more local variables than available registers, the
compiler stores the excess variables on the processor stack. These variables are
called spilled or swapped out variables since they are written out to memory (in a
similar way virtual memory is swapped out to disk). Spilled variables are slow to access
compared to variables allocated to registers.
To implement a function efficiently, minimize the number of spilled variables & ensure
that the most important and frequently accessed variables are stored in registers. C
compiler register usage
Table shows the standard register names and usage when following the ARM-Thumb
procedure call standard (ATPCS), which is used in code generated by C compilers.
Provided the compiler is not using software stack checking or a frame pointer, then the
C compiler can use registers r0 to r12 and r14 to hold variables. It must save the callee
values of r4 to r11 and r14 on the stack if using these registers.
Dept. of CSE
(Data Science), SVIT , Asst. Prof. Anitha C S Page 11
MICROCONTROLLERS(BCS402)

In theory, the C compiler can assign 14 variables to registers without spillage. In


practice, some compilers use a fixed register such as r12 for intermediate scratch
working and do not assign variables to this register. Also, complex expressions require
intermediate working registers to evaluate. Therefore, to ensure good assignment to
registers, try to limit the internal loop of functions to using at most 12 local variables.
If the compiler does need to swap out variables, then it chooses which variables to swap
out based on frequency of use. A variable used inside a loop counts multiple times. The
register keyword in C hints that a compiler should allocate the given variable to a
register. Different compilers treat this keyword in different ways, and different
architectures have a different number of available registers (for example, Thumb and
ARM).

Efficient Register Allocation


• Try to limit the number of local variables in the internal loop of functions to 12.
The compiler should be able to allocate these to ARM registers.
• Guide the compiler as to which variables are important by ensuring these variables
are used within the innermost loop.

Function Calls
The ARM Procedure Call Standard (APCS) defines how to
pass function arguments and return values in ARM registers.
The more recent
ARM-Thumb Procedure Call Standard (ATPCS) covers ARM
and
Thumb interworking as well.
The first four integer arguments are passed in the
first four ARM registers: r0, r1, r2, and r3.
Subsequent integer arguments are placed on the full
descending stack, ascending in memory shown in
figure below. Function return integer values are passed
in r0.

Fig: ATPCS argument passing.


This description covers only integer or pointer arguments. Two- word arguments such as long or double
are passed in a pair of consecutive argument registers and returned in r0, r1. If our C function needs
more than four arguments, it is always more efficient, if we group related arguments into

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 12


MICROCONTROLLERS(BCS402)

structures, and pass a structure pointer rather than using multiple arguments.For functions with
more arguments, both the caller and callee must access the stack for some arguments. Example:

The following code creates a Queue structure and passes this to the function to reduce the
number of function arguments.
There are other ways of reducing function call overhead if the function is very small and
corrupts few registers (uses few local variables). Put the C function in the same C file as
the functions that will call it. The C compiler then knows the code generated for the
callee function and can make optimizations in the caller function:
• The caller function need not preserve registers that it can see the callee doesn’t
corrupt. Therefore the caller function need not save all the ATPCS corruptible
registers.
• If the callee function is very small, then the compilers can inline the code in the
caller function. This removes the function call overhead completely.
•Example: Insert N bytes(from Array data into a queue)
•Case1: 5 Arguments

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 13


MICROCONTROLLERS(BCS402)

This compiles to

The following code creates a Queue structure and passes this to the function to reduce the number
of function arguments

Case2: using Structure


Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 14
MICROCONTROLLERS(BCS402)

This Compiles to

The
queue_bytes_v2 is one instruction longer than queue_bytes_v1, but it is more efficient
overall.
The second version has only three function arguments rather than
five. Each call to the function requires only three register setups. This compares with
four register setups, a stack push, and a stack pull for the first version. There is a net
saving of two instructions in function call overhead. It only needs to assign a single
register to the Queue structure pointer, rather than three registers in the non structured
case

There are other ways of reducing function call overhead


⚫ The caller function need not preserve registers that it can see the callee doesn’t
corrupt. Therefore the caller function need not save all the ATPCS corruptible
registers.
⚫ If the callee function is very small, then the compiler can inline the code in the caller
function. This removes the function call overhead completely.

For efficient use of calling a functions


• Try to restrict functions to four arguments. This will make them more efficient to
call. Use structures to group related arguments and pass structure pointers
instead of multiple arguments.
• Define small functions in the same source file and before the functions that call
them. The compiler can then optimize the function call or inline the small
function.
• Critical functions can be inlined using the inline keyword.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 15


MICROCONTROLLERS(BCS402)

Pointer Aliasing
Two pointers are said to be alias when they point to the same address That means ,
If we write to one pointer, it will affect the value we read from the other pointer. In
function, the compiler often does not know which pointer cause aliasing and which
pointer not
Example: The below code for function increments, two timer values by a step
amount:void timers_v1(int *timer1, int *timer2, int *step) {
*timer1 += *step;
*timer2 += *step;
}
The compiler loads from step twice. Usually, a compiler optimization called common
sub expression elimination would kick in so that *step was only evaluated once, and the
value reused for the second occurrence. But the compiler can’t use this optimization
here. The pointers timer1 and step might alias one another i.e. the compiler cannot be
sure that the write to timer1 doesn’t affect the read from step.
In this case, the second value of *step is different from the first and has the value
*timer1. This forces the compiler to insert an extra load instruction.
The same problem occurs if you use structure accesses rather than direct
pointer access.The following code also compiles inefficiently:
typedef struct {int step;} State;
typedef struct {int timer1, timer2;}
Timers; void timers_v2(State *state,
Timers *timers) {
timers->timer1 += state->step; timers->timer2 += state->step; }
Avoiding Pointer Aliasing
• Do not rely on the compiler to eliminate common sub expressions involving
memory accesses. Instead, create new local variables to hold the expression.
This ensures the expression is evaluated only once.
• Avoid taking the address of local variables. The variable may be inefficient to
access from then on.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 16


MICROCONTROLLERS(BCS402)

Portability Issues
char type: On the ARM, char is unsigned rather than signed as for many other processors. A
common problem concerns loops that use a char loop counter i and the continuation condition
i≥ 0, they become infinite loops. In this situation, armcc produces a warning of unsigned
comparison with zero. Use a compiler option to make char signed or change loop
counters to type int.
int type: Some older architectures use a 16-bit int, which may cause problems when moving
to ARM‟s 32-bit int type although this is rare nowadays. Note that expressions are promoted
to an int type before evaluation. Therefore if i = -0x1000, the expression i == 0xF000 is true
on a 16- bit machine but false on a 32- bit machine.
Unaligned data pointers: Some processors support the loading of short and int typed values
from unaligned addresses. A C program may manipulate pointers directly so that they
become unaligned, for example, by casting a char * to an int *. ARM architectures up to
ARMv5TE do not support unaligned pointers. To detect them, run the program on an ARM
with an alignment checking trap. Configure the ARM720T to data abort on an unaligned
access.
Endian assumptions: C code may make assumptions about the endianness of a memory
system, for example, by casting a char * to an int *. The ARM is configured for the same
endianness the code is expecting, and then there is no issue. Otherwise, endian-dependent
code sequences must be removed and replace them by endian-independent ones. Function
prototyping: The armcc compiler passes arguments narrow, that is, reduced to the range of
the argument type. If functions are not prototyped correctly, then the function may return the
wrong answer. Other compilers that pass arguments wide may give the correct answer even
if the function prototype is incorrect. Always use ANSI prototypes. Use of bit-fields: The
layout of bits within a bit-field is implementation and endian dependent. If C code assumes
that bits are laid out in a certain order, then the code is not portable.
Use of enumerations: Although enum is portable, different compilers allocate different
numbers of bytes to an enum. The gcc compiler will always allocate four bytes to an enum
type. The armcc compiler will only allocate one byte if the enum takes only eight-bit values.
Therefore you can‟t cross-link code and libraries between different compilers if you use
enums in an API structure.
Inline assembly: Using inline assembly in C code reduces portability between architectures.
You should separate any inline assembly into small inlined functions that can easily be

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 17


MICROCONTROLLERS(BCS402)

replaced.It is also useful to supply reference, plain C implementations of these functions that
can be used on other architectures, where this is possible.
The volatile keyword: Use the volatile keyword on the type definitions of ARM memory
mapped peripheral locations. This keyword prevents the compiler from optimizing away the
memory access. It also ensures that the compiler generates a data access of the correct type.
For example, a memory location is defined as a volatile short type, and then the compiler
will access it using 16-bit load and store instructions LDRSH and STRH.

Writing Assembly Code


This section gives examples showing how to write basic assembly code. Also, this section
uses the ARM macro assembler armasm for examples.
Example 1: Example shows how to convert a C function to an assembly function—
usually the first stage of assembly optimization. Consider the simple C program main.c
following that printsthe squares of the integers from 0 to 9:
Write a C program that prints the square of the integers between 0 to 9 using functions
and explain how to convert this C function to an assembly function with command.
MQP 2024 10M

Remove the C definition of square, but not the declaration (the second line) to produce a new
C file main1.c. Next add an armasm assembler file square.s with the following contents:

#include <stdio.h> AREA |.text|, code, readonly

int square(int i); EXPORT square ; int square(int i)

int main(void) { square

int i; MUL r1, r0, r0 ; r1 = r0 * r0

for (i=0; i<10; i++) { MOV r0, r1 ; r0 = r1 MOV pc, lr ;

printf("Square of %d is %d\n", i, return r0

square(i)); } } END

int square(int i) {

return i*i;

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 18


MICROCONTROLLERS(BCS402)

The AREA directive names the area or code section that the code lives in. If non
alphanumeric characters are used in a symbol or area name, then enclose the name in vertical
bars. Many non- alphanumeric characters have special meanings otherwise. In the previous
code we define a read-only code area called .text.
The EXPORT directive makes the symbol square available for external linking. At line six
we define the symbol square as a code label. Note that armasm treats non-indented text as a
label definition.
When square is called, the parameter passing is defined by the ARM-Thumb procedure call
standard (ATPCS). The input argument is passed in register r0, and the return value is
returned in register r0. The multiply instruction has a restriction that the destination register
must not be the same as the first argument register. Therefore we place the multiply result
into r1 and move this to r0.
The END directive marks the end of the assembly file. Comments follow a

semicolon.

Explain code optimization, profiling and cycle counting. MQP 2024 10M Code
optimization refers to the process of modifying a program to improve its performance. This
can involve reducing execution time, memory usage, power consumption, or other resources.
Profiling and Cycle Counting: The first stage of any optimization process is to identify the
critical routines and measure their current performance.
Profiling is the process of analyzing a program to determine which parts of the code are
consuming the most resources, such as CPU time, memory, or I/O operations. Profiling
helps identify performance bottlenecks and areas that could benefit from optimization A
profiler is a tool that measures the proportion of time or processing cycles spent in each
subroutine. It is used to identify the most critical routines.
A cycle counter measures the number of cycles taken by a specific routine. The ARM
simulator used by the ADS1.1 debugger is called the ARMulator and provides profiling and
cycle counting features. The ARMulator profiler works by sampling the program counter pc
at regular intervals. The profiler identifies the function the pc points to and updates a hit
counter for each function it encounters. Another approach is to use the trace output of a
simulator as a source for analysis.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 19


MICROCONTROLLERS(BCS402)

The accuracy of a pc-sampled profiler is limited, as it can produce meaningless results if it


records too few samples.
ARM implementations do not normally contain cycle-counting hardware; so to easily
measure cycle counts an ARM debugger can be used with ARM simulator. Configure the
ARMulator to simulate a range of different ARM cores and obtain cycle count benchmarks
for a number of platforms.

Develop an ALP to find the sum of first 10 integer numbers. MQP 2024 10 M
AREA SUM, CODE, READONLY
ENTRY
MOV R1,#10 ;length of array
LDR R2,=ARRAY ;Load the starting address of the array
MOV R4,#0 ;Initial sum
NEXT LDR R3,[R2],#4 ;Load first integer of the array in R3
ADD R4,R4,R3 ;R4=sum of integers
SUBS R1,R1,#1 ;repeat until R1=0
BNE NEXT ;If z-flag is not set repeat
MOV R5,#0X40000000 ; initialize memory address to store the result in memory STR
R4,[R5] ; store the result in the address stored in R5
STOP B STOP
ARRAY DCD 1,2,3,4,5,6,7,8,9,10
END
Questions
1. Explain with an example the different basic C data types used by arm compiler. 2.
With program example explain the advantages of using int rather than char & short type
for local variables & function arguments.
3. Describe with an example different looping structure used by arm compiler.
4. Explain loop unrolling concept with suitable program example.
5. Explain loops using variable number of iterations with program example.
6. Explain loop unrolling concept with suitable program example.
7. Explain in detail about Register Allocation.
8. Explain the function call operation with suitable program example.
9. Explain the pointer aliasing concept with suitable program example.
Reference:
1. Andrew N Sloss, Dominic Symes and Chris Wright, ARM system developers
guide,Elsvier, Morgan Kaufman publishers, 2008.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 20


MICROCONTROLLERS(BCS402)
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 21

You might also like