0% found this document useful (0 votes)

56 views22 pages

MC-module 3 C Compilers and Optimization (BCS402)

The document discusses C compilers and optimization techniques, emphasizing the importance of writing efficient C code while being aware of compiler limitations and processor architecture. It outlines various optimization methods such as inlining, loop unrolling, and dead code elimination, and provides examples of how variable types impact performance on ARM processors. Additionally, it covers best practices for using data types and looping structures to enhance code efficiency.

Uploaded by

nandakishor425

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views22 pages

MC-module 3 C Compilers and Optimization (BCS402)

Uploaded by

nandakishor425

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

MICROCONTROLLERS(BCS402)

Module-3
C Compilers and Optimization :
Basic C Data Types, C Looping Structures, Register Allocation, Function Calls, Pointer Aliasing,
Portability Issues.

Overview of C Compilers and Optimization:

Optimizing code takes time and reduces source code readability. therefore, it’s only worth optimizing
functions that are frequently executed and important for performance. C compilers have to translate
our C function into an assembler to work for all possible inputs.
To write efficient C code, we must be aware of areas where the C compiler has to be conservative, the
limits of the processor architecture the C compiler is mapping to, and the limits of a specific C compiler.
Most common c compilers are
⚫ armcc from ARM Developer Suite version 1.1 (ADS1.1).
⚫ arm-elf-gcc version 2.95.2. This is the ARM target for the GNU C compiler, gcc,

A C compiler is a software tool that translates human-readable C code into machine-readable instructions
that a computer's processor can execute.
Optimization aims to improve the performance (speed and/or memory usage) of the compiled code without
altering its functionality.

Here are common optimization techniques:

1. Inlining: Replaces a function call with the function's body, reducing overhead. 2.
Loop Unrolling: Copies the body of a loop multiple times to reduce loop control
overhead.
3. Constant Folding: Evaluates constant expressions at compile-time instead of runtime. 4.
Dead Code Elimination: Removes code that doesn't affect program output. 5. Register
Allocation: Allocates variables to CPU registers to reduce memory access time. 6.
Instruction Scheduling: Reorders instructions to minimize CPU idle time. 7. Data Flow
Analysis: Analyzes how data flows through the program to optimize memory access.
8. Tail Call Optimization: Replaces tail recursive calls with jumps to reduce stack usage.

This function is designed to clear (set to zero) a block of memory starting from the location pointed to by
data and spanning N bytes.

void memclr(char *data, int N)

{
for (; N>0; N--)
{
*data=0;
data++;
}
}
void: The function does not return any value.
memclr: The name of the function.
*char data: A pointer to the beginning of the memory block that needs to be cleared. The char
type indicates that we're dealing with memory one byte at a time.
int N: An integer representing the number of bytes to be cleared.
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 1
MICROCONTROLLERS(BCS402)

For Loop Initialization and Condition:

*data = 0;: This dereferences the pointer data and sets the value at the current memory location to
0.
data++;: This increments the pointer data to point to the next byte in memory.
Initial state
data -> [ ? ] [ ? ] [ ? ] N = 3
Final
data -> [ 0 ] [ 0 ] [ 0 ]
N=0

To write efficient C code, you must be aware of areas where the C compiler has to be conservative, the
limits of the processor architecture the C compiler is mapping to, and the limits of a specific C
compiler.

Basic C Data Types

ARM processors have 32-bit registers and 32-bit data processing operations. The ARM
architecture is a RISC load/store architecture.
Table 3.1 Load and store instructions by ARM architecture.
Architecture Instruction Action
Pre-ARMv4 LDRB load an unsigned 8-bit value
STRB store a signed or unsigned 8-bit value
LDR load a signed or unsigned 32-bit value
STR store a signed or unsigned 32-bit value
ARMv4 LDRSB load a signed 8-bit value
LDRH load an unsigned 16-bit value
LDRSH load a signed 16-bit value
STRH store a signed or unsigned 16-bit value
ARMv5 LDRD load a signed or unsigned 64-bit value
STRD store a signed or unsigned 64-bit value

In above Table : loads/store that act on 8- or 16-bit values extend the value to 32 bits before
writing to an ARM register. Unsigned values are zero-extended, and signed values sign
extended.
This means that type casting of a loaded value to an int type does not cost extra
instructions.
ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores directly,
through new instructions. Since these instructions are a later addition, they do not support all
the addressing modes as the pre-ARMv4 instructions.
Compilers armcc and gcc use the datatype mappings for an ARM target.
ARM C compilers define char to be an unsigned 8-bit value, rather than a signed 8-bit
value.
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 2
MICROCONTROLLERS(BCS402)

Local variable types:

ARMv4-based processors can efficiently load and store 8, 16, and 32-bit data. However,
most ARM data processing operations are 32-bit only. For this reason, a 32-bit data
type, int or long, is used for local variables wherever possible. Avoid using char and
short as local variable types, even if for an 8- or 16-bit value. The one exception is
when you want wrap-around to occur. If we are doing modulo arithmetic
operations(mod n=n) then we can use char type.

Example 1: The following code checksums a data packet containing 64 words. It shows
why to avoid using char for local variables.
Int checksum_v1(int*data)
{
char i;
int sum = 0;
for (i = 0; i < 64; i++)
{
sum += data[i];
}
return sum;
}
At first sight it looks as though declaring i as a char is efficient. a char uses less register
space or less space on the ARM stack than an int On the ARM, both these assumptions
are wrong.
All ARM registers are 32-bit and all stack entries are at least 32-bit. To implement the
i++ exactly, the compiler must accountfor the case when i = 255. Any attempt to
increment 255 should produce the answer 0.
Case1: The compiler output for this function is as given below
checksum_v1
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v1_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
AND r1,r1,#0xff ; i = (char)r1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v1_loop ; if (i<64) loop
MOV pc,r14 ; return sum

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 3

MICROCONTROLLERS(BCS402)

The compiler output for the same function by declaring I as unsigned int is as given
below
checksum_v2
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v2_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v2_loop ; if (i<64) loop
MOV pc, r14 ; return sum
In the first case, the compiler inserts an extra AND instruction to reduce i to the range
0 to 255 before the comparison with 64. This instruction disappears in the second case.

Example 2: The data packet contains 16-bit values for a 16-bit checksum.then i is declared as unsigned int

short checksum_v3(short *data)

{
unsigned int i;
short sum = 0;
for (i = 0; i < 64; i++)
{
sum = (short)(sum + data[i]);
}
return sum;
}
With armcc this code will produce a warning for enabling implicit narrowing cast
warnings using the compiler switch -W+ n. The expression sum + data[i] is an integer
and so can only be assigned to a short using an (implicit or explicit) narrowing cast.
As you can see in the following assembly output, the compiler must insert extra
instructions to implement the narrowing cast:
checksum_v3
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v3_loop
ADD r3,r2,r1,LSL #1 ; r3 = &data[i]
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 4
MICROCONTROLLERS(BCS402)

CMP r1,#0x40 ; compare i, 64

ADD r0,r3,r0 ; r0 = sum + r3
MOV r0,r0,LSL #16

MOV r0,r0,ASR #16 ; sum = (short)r0

BCC checksum_v3_loop ; if (i<64) goto loop
MOV pc, r14 ; return sum

The loop is now three instructions longer than the loop for example checksum_v2 earlier! There
are two reasons for the extra instructions:
⚫ The LDRH instruction does not allow for a shifted address offset as the LDR instruction did
in checksum_v2. Therefore the first ADD in the loop calculates the address of item i in the
array. The LDRH loads from an address with no offset. LDRH has fewer addressing modes
than LDR as it was a later addition to the ARM instruction set.
⚫ The cast reducing total + array[i] to a short requires two MOV instructions. The compiler
shifts left by 16 and then right by 16 to implement a 16-bit sign extend. The shift right is a sign
extending shift so it replicates the sign bit to fill the upper 16 bits.
We can avoid the second problem by using an int type variable to hold the partial sum.We
only reduce the sum to a short type at the function exit.
The first problem is a new issue. We can solve it by accessing the array by incrementing the
pointer data rather than using an index as in data[i]. This is efficient regardless of array type
size or element size. All ARM load and store instructions have a post increment addressing
mode.

To avoid unnecessary casts it uses int type local variables. It increments the pointer
data instead of using an index offset data[i].
short checksum_v4(short *data)
{
unsigned int i;
int sum=0;
for (i=0; i<64; i++)
{
sum += *(data++); // The *(data++) operation translates to a single ARM instruction that loads the data and increments the data
pointer.
}
return (short)sum;
}
The compiler produces the following output. Three instructions have been removed from the
inside loop, saving three cycles per loop compared to checksum_v3
checksum_v4
MOV r2,r0 ; r2 = data
MOV r1,#0 ;i=0
checksum_v4_loop
LDRSH r3,[r0],#2 ; r3 = *(data++)
ADD r1,r1,#1 ; r1 = i+1

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 5

MICROCONTROLLERS(BCS402)

CMP r1,#0x40 ; compare i, 64

ADD r2,r3,r2 ; sum + = r3
BCC checksum_v4_loop ; if (i<64) goto loop
MOV r0,r2,LSL #16
MOV r0,r0,ASR #16 ; sum = (short)sum
MOV pc, r14 ; return sum
Function argument types:
Converting local variables from types char or short to type int increases performance and
reduces code size. The same holds for function arguments. If the function arguments are short
type either the caller or the callee must perform the cast to a short type. And these char or
short type function arguments and return values introduce extra casts. This increase code size
and decrease performance. Therefore It is more efficient to use the int type for function
arguments and return values, even if you are only passing an 8-bit value.
Signed versus Unsigned Types

If our code uses addition, subtraction, and multiplication, then there is no performance
difference between signed and unsigned operations , but there is a difference when it is
division operation.
Consider the following short example that averages two integers:
int average_v1(int a,
int b) {
return (a+b)/2;
}
average_v1
ADD r0,r0,r1 ; r0=a+b
ADD r0,r0,r0,LSR #31 ; if (r0<0) r0++
MOV r0,r0,ASR #1 ; r0 = r0 >> 1
MOV pc,r14 ; return r0

The compiler adds one to the sum before shifting by right if the sum is negative. In
other words it replaces x/2 by the statement:
(x<0) ? ((x+1) >> 1): (x >> 1)
It must do this because x is signed.. In C on an ARM target, a divide by two is not a right shift if
x is negative.
For example, −3 >> 1 =-2 but −3/2 = −1.
Division rounds towards zero, but arithmetic right shift rounds towards −∞.
It is more efficient to use unsigned types for divisions. The compiler converts unsigned power of
two divisions directly to right shifts. For general divisions, the divide routine in the C library is
faster for unsigned types.
For the efficient use of C types the following points to be deliberated. ➢
For local variables held in registers, don’t use a char or short type unless 8-bit or
16-bit modular arithmetic is necessary. Use the signed or unsigned int types

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 6

MICROCONTROLLERS(BCS402)

instead. Unsigned types are faster for divisions operation.

➢ For array entries and global variables held in main memory, use the type with the
smallest size possible to hold the required data. This saves memory footprint.
The ARMv4 architecture is efficient at loading and storing all data widths
provided you traverse arrays by incrementing the array pointer. Avoid using
offsets from the base of the array with short type arrays, as LDRH does not
support this.

➢ Use explicit casts when reading array entries or global variables into local
variables, or writing local variables out to array entries. The casts make it clear
that for fast operation taking a narrow width type stored in memory and
expanding it to a wider type in the registers. Switch on implicit narrowing cast
warnings in the compiler to detect implicit casts.

➢ Avoid implicit or explicit narrowing casts in expressions because they usually

cost extra cycles. Avoid char and short types for function arguments or return
values. Instead use the int type even if the range of the parameter is smaller. This
prevents the compiler performing unnecessarycasts.

C Looping Structures
In this section we will learn the most efficient ways to code for and while loops on the ARM.
This section includes, loops with a fixed number of iterations, loops with a variable number of
iterations and loop unrolling.

Loops with a Fixed Number of Iterations

This concept is explained with 64-word checksum routine.
Below example shows the compiler treats a loop with incrementing count i.e i++.

It takes three instructions to implement the for loop structure:

⚫ An ADD to increment i
⚫ A compare to check if i is less than 64
⚫ A conditional branch to continue the loop if i < 64
This is not efficient. On the ARM, a loop should only use two instructions:
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 7
MICROCONTROLLERS(BCS402)

⚫ A subtract to decrement the loop counter, which also sets the condition code flags on the result
⚫ A conditional branch instruction

This example shows the improvement if we switch to a decrementing loop rather than an
incrementing loop

SUBS and BNE instructions implement the loop

Loops Using Variable Number of Iterations
In this case a do-while loop gives better performance and code density than a for loop.
Example: We pass in a variable N giving the Variable Number of Iterations as an
argument and count down N until N=0 ,no need of extra loop counter i.
In this case a do-while loop gives better performance than a for loop

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 8

MICROCONTROLLERS(BCS402)

do-while loop remove the test for N being zero that occurs in a for loop & hence it gives
betterperformance than a for loop.
Loop unrolling: Repeating the loop body several times, and reducing the number of loop
iterations by the same proportion.
There are two thing we we need to ask when unrolling a loop:
■ How many times should we unroll the loop?
Only unroll loops that are important for the overall
performance of the application. Otherwise unrolling will increase the code size with little
performance benefit. Unrolling may even reduce performance by evicting more important
code from the cache.

■ What if the number of loop iterations is not a multiple of the unroll

amount? We can try to arrange it so that array sizes are multiples of our unroll
amount. If this isn’t possible, then we must add extra code to take care of the leftover
cases.This increases the code size a little but keeps the performance high.

The following code unrolls our packet checksum loop by four times. We assume that the
number of words in the packet N is a multiple of four

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 9

MICROCONTROLLERS(BCS402)

This example handles the checksum of any size of data packet using a loop that has been unrolled
four times
int checksum_v10(int *data, unsigned int N)
{
unsigned int i;
int sum=0;
for (i=N/4; i!=0; i--)
{
sum += *(data++);
sum += *(data++);
sum += *(data++);
sum += *(data++);
}
for (i=N&3; i!=0; i--)
{
sum += *(data++);
}
return sum;
}

The second for loop handles the remaining cases when N is not a multiple of four. Note that
both N/4 and N&3 can be zero, so we can’t use do-while loops.
Points to remember while using Looping statement efficiently • Use loops
that count down to zero. Then the compiler does not need to allocate a register to
hold the termination value, and the comparison with zero is free. • Use unsigned loop
counters by default and the continuation condition i!=0 rather thani>0. This will
ensure that the loop overhead is only two instructions. • Use do-while loops rather
than for loops when you know the loop will iterate at least once. This saves the
compiler checking to see if the loop count is zero. • Unroll important loops to reduce
the loop overhead. Do not overunroll, if the loop overhead is small as a proportion
of the total, then unrolling will increase code size and hurt the performance of the
cache.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 10

MICROCONTROLLERS(BCS402)

Register Allocation
The compiler attempts to allocate a processor register to each local variable use in a C function.
It will try to use the same register for different local variables if the use of the variables
does not overlap. When there are more local variables than available registers, the
compiler stores the excess variables on the processor stack. These variables are
called spilled or swapped out variables since they are written out to memory (in a
similar way virtual memory is swapped out to disk). Spilled variables are slow to access
compared to variables allocated to registers.
To implement a function efficiently, minimize the number of spilled variables & ensure
that the most important and frequently accessed variables are stored in registers. C
compiler register usage
Table shows the standard register names and usage when following the ARM-Thumb
procedure call standard (ATPCS), which is used in code generated by C compilers.
Provided the compiler is not using software stack checking or a frame pointer, then the
C compiler can use registers r0 to r12 and r14 to hold variables. It must save the callee
values of r4 to r11 and r14 on the stack if using these registers.
Dept. of CSE
(Data Science), SVIT , Asst. Prof. Anitha C S Page 11
MICROCONTROLLERS(BCS402)

In theory, the C compiler can assign 14 variables to registers without spillage. In

practice, some compilers use a fixed register such as r12 for intermediate scratch
working and do not assign variables to this register. Also, complex expressions require
intermediate working registers to evaluate. Therefore, to ensure good assignment to
registers, try to limit the internal loop of functions to using at most 12 local variables.
If the compiler does need to swap out variables, then it chooses which variables to swap
out based on frequency of use. A variable used inside a loop counts multiple times. The
register keyword in C hints that a compiler should allocate the given variable to a
register. Different compilers treat this keyword in different ways, and different
architectures have a different number of available registers (for example, Thumb and
ARM).

Efficient Register Allocation

• Try to limit the number of local variables in the internal loop of functions to 12.
The compiler should be able to allocate these to ARM registers.
• Guide the compiler as to which variables are important by ensuring these variables
are used within the innermost loop.

Function Calls
The ARM Procedure Call Standard (APCS) defines how to
pass function arguments and return values in ARM registers.
The more recent
ARM-Thumb Procedure Call Standard (ATPCS) covers ARM
and
Thumb interworking as well.
The first four integer arguments are passed in the
first four ARM registers: r0, r1, r2, and r3.
Subsequent integer arguments are placed on the full
descending stack, ascending in memory shown in
figure below. Function return integer values are passed
in r0.

Fig: ATPCS argument passing.

This description covers only integer or pointer arguments. Two- word arguments such as long or double
are passed in a pair of consecutive argument registers and returned in r0, r1. If our C function needs
more than four arguments, it is always more efficient, if we group related arguments into

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 12

MICROCONTROLLERS(BCS402)

structures, and pass a structure pointer rather than using multiple arguments.For functions with
more arguments, both the caller and callee must access the stack for some arguments. Example:

The following code creates a Queue structure and passes this to the function to reduce the
number of function arguments.
There are other ways of reducing function call overhead if the function is very small and
corrupts few registers (uses few local variables). Put the C function in the same C file as
the functions that will call it. The C compiler then knows the code generated for the
callee function and can make optimizations in the caller function:
• The caller function need not preserve registers that it can see the callee doesn’t
corrupt. Therefore the caller function need not save all the ATPCS corruptible
registers.
• If the callee function is very small, then the compilers can inline the code in the
caller function. This removes the function call overhead completely.
•Example: Insert N bytes(from Array data into a queue)
•Case1: 5 Arguments

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 13

MICROCONTROLLERS(BCS402)

This compiles to

The following code creates a Queue structure and passes this to the function to reduce the number
of function arguments

Case2: using Structure

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 14
MICROCONTROLLERS(BCS402)

This Compiles to

The
queue_bytes_v2 is one instruction longer than queue_bytes_v1, but it is more efficient
overall.
The second version has only three function arguments rather than
five. Each call to the function requires only three register setups. This compares with
four register setups, a stack push, and a stack pull for the first version. There is a net
saving of two instructions in function call overhead. It only needs to assign a single
register to the Queue structure pointer, rather than three registers in the non structured
case

There are other ways of reducing function call overhead

⚫ The caller function need not preserve registers that it can see the callee doesn’t
corrupt. Therefore the caller function need not save all the ATPCS corruptible
registers.
⚫ If the callee function is very small, then the compiler can inline the code in the caller
function. This removes the function call overhead completely.

For efficient use of calling a functions

• Try to restrict functions to four arguments. This will make them more efficient to
call. Use structures to group related arguments and pass structure pointers
instead of multiple arguments.
• Define small functions in the same source file and before the functions that call
them. The compiler can then optimize the function call or inline the small
function.
• Critical functions can be inlined using the inline keyword.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 15

MICROCONTROLLERS(BCS402)

Pointer Aliasing
Two pointers are said to be alias when they point to the same address That means ,
If we write to one pointer, it will affect the value we read from the other pointer. In
function, the compiler often does not know which pointer cause aliasing and which
pointer not
Example: The below code for function increments, two timer values by a step
amount:void timers_v1(int *timer1, int *timer2, int *step) {
*timer1 += *step;
*timer2 += *step;
}
The compiler loads from step twice. Usually, a compiler optimization called common
sub expression elimination would kick in so that *step was only evaluated once, and the
value reused for the second occurrence. But the compiler can’t use this optimization
here. The pointers timer1 and step might alias one another i.e. the compiler cannot be
sure that the write to timer1 doesn’t affect the read from step.
In this case, the second value of *step is different from the first and has the value
*timer1. This forces the compiler to insert an extra load instruction.
The same problem occurs if you use structure accesses rather than direct
pointer access.The following code also compiles inefficiently:
typedef struct {int step;} State;
typedef struct {int timer1, timer2;}
Timers; void timers_v2(State *state,
Timers *timers) {
timers->timer1 += state->step; timers->timer2 += state->step; }
Avoiding Pointer Aliasing
• Do not rely on the compiler to eliminate common sub expressions involving
memory accesses. Instead, create new local variables to hold the expression.
This ensures the expression is evaluated only once.
• Avoid taking the address of local variables. The variable may be inefficient to
access from then on.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 16

MICROCONTROLLERS(BCS402)

Portability Issues
char type: On the ARM, char is unsigned rather than signed as for many other processors. A
common problem concerns loops that use a char loop counter i and the continuation condition
i≥ 0, they become infinite loops. In this situation, armcc produces a warning of unsigned
comparison with zero. Use a compiler option to make char signed or change loop
counters to type int.
int type: Some older architectures use a 16-bit int, which may cause problems when moving
to ARM‟s 32-bit int type although this is rare nowadays. Note that expressions are promoted
to an int type before evaluation. Therefore if i = -0x1000, the expression i == 0xF000 is true
on a 16- bit machine but false on a 32- bit machine.
Unaligned data pointers: Some processors support the loading of short and int typed values
from unaligned addresses. A C program may manipulate pointers directly so that they
become unaligned, for example, by casting a char * to an int *. ARM architectures up to
ARMv5TE do not support unaligned pointers. To detect them, run the program on an ARM
with an alignment checking trap. Configure the ARM720T to data abort on an unaligned
access.
Endian assumptions: C code may make assumptions about the endianness of a memory
system, for example, by casting a char * to an int *. The ARM is configured for the same
endianness the code is expecting, and then there is no issue. Otherwise, endian-dependent
code sequences must be removed and replace them by endian-independent ones. Function
prototyping: The armcc compiler passes arguments narrow, that is, reduced to the range of
the argument type. If functions are not prototyped correctly, then the function may return the
wrong answer. Other compilers that pass arguments wide may give the correct answer even
if the function prototype is incorrect. Always use ANSI prototypes. Use of bit-fields: The
layout of bits within a bit-field is implementation and endian dependent. If C code assumes
that bits are laid out in a certain order, then the code is not portable.
Use of enumerations: Although enum is portable, different compilers allocate different
numbers of bytes to an enum. The gcc compiler will always allocate four bytes to an enum
type. The armcc compiler will only allocate one byte if the enum takes only eight-bit values.
Therefore you can‟t cross-link code and libraries between different compilers if you use
enums in an API structure.
Inline assembly: Using inline assembly in C code reduces portability between architectures.
You should separate any inline assembly into small inlined functions that can easily be

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 17

MICROCONTROLLERS(BCS402)

replaced.It is also useful to supply reference, plain C implementations of these functions that
can be used on other architectures, where this is possible.
The volatile keyword: Use the volatile keyword on the type definitions of ARM memory
mapped peripheral locations. This keyword prevents the compiler from optimizing away the
memory access. It also ensures that the compiler generates a data access of the correct type.
For example, a memory location is defined as a volatile short type, and then the compiler
will access it using 16-bit load and store instructions LDRSH and STRH.

Writing Assembly Code

This section gives examples showing how to write basic assembly code. Also, this section
uses the ARM macro assembler armasm for examples.
Example 1: Example shows how to convert a C function to an assembly function—
usually the first stage of assembly optimization. Consider the simple C program main.c
following that printsthe squares of the integers from 0 to 9:
Write a C program that prints the square of the integers between 0 to 9 using functions
and explain how to convert this C function to an assembly function with command.
MQP 2024 10M

Remove the C definition of square, but not the declaration (the second line) to produce a new
C file main1.c. Next add an armasm assembler file square.s with the following contents:

#include <stdio.h> AREA |.text|, code, readonly

int square(int i); EXPORT square ; int square(int i)

int main(void) { square

int i; MUL r1, r0, r0 ; r1 = r0 * r0

for (i=0; i<10; i++) { MOV r0, r1 ; r0 = r1 MOV pc, lr ;

printf("Square of %d is %d\n", i, return r0

square(i)); } } END

int square(int i) {

return i*i;

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 18

MICROCONTROLLERS(BCS402)

The AREA directive names the area or code section that the code lives in. If non
alphanumeric characters are used in a symbol or area name, then enclose the name in vertical
bars. Many non- alphanumeric characters have special meanings otherwise. In the previous
code we define a read-only code area called .text.
The EXPORT directive makes the symbol square available for external linking. At line six
we define the symbol square as a code label. Note that armasm treats non-indented text as a
label definition.
When square is called, the parameter passing is defined by the ARM-Thumb procedure call
standard (ATPCS). The input argument is passed in register r0, and the return value is
returned in register r0. The multiply instruction has a restriction that the destination register
must not be the same as the first argument register. Therefore we place the multiply result
into r1 and move this to r0.
The END directive marks the end of the assembly file. Comments follow a

semicolon.

Explain code optimization, profiling and cycle counting. MQP 2024 10M Code
optimization refers to the process of modifying a program to improve its performance. This
can involve reducing execution time, memory usage, power consumption, or other resources.
Profiling and Cycle Counting: The first stage of any optimization process is to identify the
critical routines and measure their current performance.
Profiling is the process of analyzing a program to determine which parts of the code are
consuming the most resources, such as CPU time, memory, or I/O operations. Profiling
helps identify performance bottlenecks and areas that could benefit from optimization A
profiler is a tool that measures the proportion of time or processing cycles spent in each
subroutine. It is used to identify the most critical routines.
A cycle counter measures the number of cycles taken by a specific routine. The ARM
simulator used by the ADS1.1 debugger is called the ARMulator and provides profiling and
cycle counting features. The ARMulator profiler works by sampling the program counter pc
at regular intervals. The profiler identifies the function the pc points to and updates a hit
counter for each function it encounters. Another approach is to use the trace output of a
simulator as a source for analysis.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 19

MICROCONTROLLERS(BCS402)

The accuracy of a pc-sampled profiler is limited, as it can produce meaningless results if it

records too few samples.
ARM implementations do not normally contain cycle-counting hardware; so to easily
measure cycle counts an ARM debugger can be used with ARM simulator. Configure the
ARMulator to simulate a range of different ARM cores and obtain cycle count benchmarks
for a number of platforms.

Develop an ALP to find the sum of first 10 integer numbers. MQP 2024 10 M
AREA SUM, CODE, READONLY
ENTRY
MOV R1,#10 ;length of array
LDR R2,=ARRAY ;Load the starting address of the array
MOV R4,#0 ;Initial sum
NEXT LDR R3,[R2],#4 ;Load first integer of the array in R3
ADD R4,R4,R3 ;R4=sum of integers
SUBS R1,R1,#1 ;repeat until R1=0
BNE NEXT ;If z-flag is not set repeat
MOV R5,#0X40000000 ; initialize memory address to store the result in memory STR
R4,[R5] ; store the result in the address stored in R5
STOP B STOP
ARRAY DCD 1,2,3,4,5,6,7,8,9,10
END
Questions
1. Explain with an example the different basic C data types used by arm compiler. 2.
With program example explain the advantages of using int rather than char & short type
for local variables & function arguments.
3. Describe with an example different looping structure used by arm compiler.
4. Explain loop unrolling concept with suitable program example.
5. Explain loops using variable number of iterations with program example.
6. Explain loop unrolling concept with suitable program example.
7. Explain in detail about Register Allocation.
8. Explain the function call operation with suitable program example.
9. Explain the pointer aliasing concept with suitable program example.
Reference:
1. Andrew N Sloss, Dominic Symes and Chris Wright, ARM system developers
guide,Elsvier, Morgan Kaufman publishers, 2008.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 20

MICROCONTROLLERS(BCS402)
Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 21

12 Information Technology (802) Important Question
67% (9)
12 Information Technology (802) Important Question
6 pages
Solutions: CS152 Computer Architecture and Engineering
No ratings yet
Solutions: CS152 Computer Architecture and Engineering
17 pages
BCS402 MC Module3 Notes
No ratings yet
BCS402 MC Module3 Notes
30 pages
Module 3 Notes
No ratings yet
Module 3 Notes
18 pages
BCS402 - MC - M3 - Notes SJCIT
No ratings yet
BCS402 - MC - M3 - Notes SJCIT
18 pages
BCS402 Module 3 PDF
No ratings yet
BCS402 Module 3 PDF
18 pages
BCS402 M3
No ratings yet
BCS402 M3
110 pages
Module 3 Book1 - Merged
No ratings yet
Module 3 Book1 - Merged
42 pages
Module 3
No ratings yet
Module 3
21 pages
UNIT-IV Basic C Data Types
No ratings yet
UNIT-IV Basic C Data Types
24 pages
Module 3 Notes-1
No ratings yet
Module 3 Notes-1
30 pages
Module 3
No ratings yet
Module 3
51 pages
Module 2 Part B (Mces 21cs43)
No ratings yet
Module 2 Part B (Mces 21cs43)
29 pages
Arm Unit 3
No ratings yet
Arm Unit 3
62 pages
Imp Notes - 1
No ratings yet
Imp Notes - 1
28 pages
Department of Computer Science and Engineering
No ratings yet
Department of Computer Science and Engineering
25 pages
Mod 3
No ratings yet
Mod 3
35 pages
Es (U4) 1
No ratings yet
Es (U4) 1
24 pages
Hello World
No ratings yet
Hello World
18 pages
Module 3
No ratings yet
Module 3
35 pages
Module 5
No ratings yet
Module 5
33 pages
ARM MC Module 03
No ratings yet
ARM MC Module 03
21 pages
Embedded C Interview Questions
75% (4)
Embedded C Interview Questions
3 pages
Module-3 ARMProgram Notes.-16857877494142 PDF
No ratings yet
Module-3 ARMProgram Notes.-16857877494142 PDF
5 pages
MC Ia-2
No ratings yet
MC Ia-2
14 pages
Class Ans Q
No ratings yet
Class Ans Q
24 pages
3rd Module MC Sem Exam Preparation
No ratings yet
3rd Module MC Sem Exam Preparation
31 pages
EE447 Week5 2023-24
No ratings yet
EE447 Week5 2023-24
37 pages
Exp4 - ARM Addressing Modes
No ratings yet
Exp4 - ARM Addressing Modes
11 pages
Lecture 08
No ratings yet
Lecture 08
17 pages
Embedded C Programming
No ratings yet
Embedded C Programming
49 pages
Introduction Unit
No ratings yet
Introduction Unit
8 pages
1 5
No ratings yet
1 5
7 pages
Sehs3317 L4
No ratings yet
Sehs3317 L4
53 pages
ARM Assembly Language Guide: Common ARM Instructions (And Psuedo-Instructions)
No ratings yet
ARM Assembly Language Guide: Common ARM Instructions (And Psuedo-Instructions)
7 pages
SET - ARM - Inst
No ratings yet
SET - ARM - Inst
4 pages
Chapter 04 ARM Assembly
No ratings yet
Chapter 04 ARM Assembly
53 pages
Module 2
No ratings yet
Module 2
41 pages
4-Instruction Set
No ratings yet
4-Instruction Set
45 pages
21CS43 - MCES Module-3 Chapter 1-2023.
No ratings yet
21CS43 - MCES Module-3 Chapter 1-2023.
23 pages
Assembly Instructions
No ratings yet
Assembly Instructions
4 pages
ARM Addressing Modes
No ratings yet
ARM Addressing Modes
11 pages
Lab 02
No ratings yet
Lab 02
7 pages
Lec3 - RISC-V Assembly
No ratings yet
Lec3 - RISC-V Assembly
56 pages
Aula Ch2 2
No ratings yet
Aula Ch2 2
27 pages
PPT-2 - Data Processing Instructions
No ratings yet
PPT-2 - Data Processing Instructions
59 pages
ARM Prog Model 3 Arithmetic
No ratings yet
ARM Prog Model 3 Arithmetic
13 pages
SJB Institute of Technology: CO & ARM Microcontrollers (21EC52)
No ratings yet
SJB Institute of Technology: CO & ARM Microcontrollers (21EC52)
61 pages
02 Arm
No ratings yet
02 Arm
53 pages
AppendixD Assembly Arm
No ratings yet
AppendixD Assembly Arm
53 pages
ARM-Inst Summary
No ratings yet
ARM-Inst Summary
2 pages
Addressing Memory
No ratings yet
Addressing Memory
39 pages
Cse331 L3 Arm Isa
No ratings yet
Cse331 L3 Arm Isa
103 pages
CS401 Short Notes Mid Term
No ratings yet
CS401 Short Notes Mid Term
22 pages
Condition Code Flags in APSR
No ratings yet
Condition Code Flags in APSR
25 pages
Chapter 3: Introduction To Assembly Language Programming: CEG2400 - Microcomputer Systems
No ratings yet
Chapter 3: Introduction To Assembly Language Programming: CEG2400 - Microcomputer Systems
57 pages
CH1 - ARM - PPT New New
No ratings yet
CH1 - ARM - PPT New New
62 pages
ARM Presentation
No ratings yet
ARM Presentation
51 pages
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
C Programming
From Everand
C Programming
Netra
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
cs8261 Lab Manual
100% (2)
cs8261 Lab Manual
44 pages
BCA First and Second Year
No ratings yet
BCA First and Second Year
82 pages
OPS435 Assignment 2
No ratings yet
OPS435 Assignment 2
3 pages
W5-Lambda APIGateway
No ratings yet
W5-Lambda APIGateway
28 pages
Differences Between Procedural and Object Oriented Programming
No ratings yet
Differences Between Procedural and Object Oriented Programming
2 pages
How To Place An SE78 Image On An Adobe Form
No ratings yet
How To Place An SE78 Image On An Adobe Form
5 pages
Autosar Srs Com
No ratings yet
Autosar Srs Com
31 pages
CH-9-Forward Engineering
No ratings yet
CH-9-Forward Engineering
35 pages
Using Quincy 2005 To Write A C Program: Single-File Programs
No ratings yet
Using Quincy 2005 To Write A C Program: Single-File Programs
1 page
Coding Form Masuk
No ratings yet
Coding Form Masuk
7 pages
Algorithms 2. Order 3. Analysis of Algorithm 4. Some Mathematical Background
No ratings yet
Algorithms 2. Order 3. Analysis of Algorithm 4. Some Mathematical Background
41 pages
Unit-8 Rmi and Corba
No ratings yet
Unit-8 Rmi and Corba
27 pages
Jenkins Architecture
No ratings yet
Jenkins Architecture
6 pages
Home Assignment1 2
No ratings yet
Home Assignment1 2
7 pages
Internet Programming Laboratory
No ratings yet
Internet Programming Laboratory
2 pages
Assignment Part1 SP22 2014
No ratings yet
Assignment Part1 SP22 2014
6 pages
Tutorial (2024-02)
No ratings yet
Tutorial (2024-02)
213 pages
Sap CRM Technical
100% (2)
Sap CRM Technical
30 pages
Translating Termite
No ratings yet
Translating Termite
5 pages
Recommender System With Sentiment Analysis: Summer Internship Report
No ratings yet
Recommender System With Sentiment Analysis: Summer Internship Report
58 pages
Use FSG Transfer Feature To Transfer Reports F
No ratings yet
Use FSG Transfer Feature To Transfer Reports F
2 pages
SQL Grouping Records, Joins in SQL
No ratings yet
SQL Grouping Records, Joins in SQL
42 pages
Se Unit 2 Analysis Modelling
No ratings yet
Se Unit 2 Analysis Modelling
68 pages
Name Class XI Roll No. School: K.V. Jalipa Cantt, Barmer
No ratings yet
Name Class XI Roll No. School: K.V. Jalipa Cantt, Barmer
6 pages
Solutions of Homework 1
100% (1)
Solutions of Homework 1
9 pages
Unit-5 Operator Overloading
No ratings yet
Unit-5 Operator Overloading
8 pages
Zto Implement Hill Climbing Problem To Print "Hello World"
No ratings yet
Zto Implement Hill Climbing Problem To Print "Hello World"
4 pages
Krishna File
No ratings yet
Krishna File
113 pages

MC-module 3 C Compilers and Optimization (BCS402)

Uploaded by

MC-module 3 C Compilers and Optimization (BCS402)

Uploaded by

MICROCONTROLLERS(BCS402)

Overview of C Compilers and Optimization:

Here are common optimization techniques:

void memclr(char *data, int N)

For Loop Initialization and Condition:

Basic C Data Types

Local variable types:

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 3

short checksum_v3(short *data)

CMP r1,#0x40 ; compare i, 64

MOV r0,r0,ASR #16 ; sum = (short)r0

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 5

CMP r1,#0x40 ; compare i, 64

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 6

instead. Unsigned types are faster for divisions operation.

➢ Avoid implicit or explicit narrowing casts in expressions because they usually

Loops with a Fixed Number of Iterations

It takes three instructions to implement the for loop structure:

SUBS and BNE instructions implement the loop

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 8

■ What if the number of loop iterations is not a multiple of the unroll

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 9

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 10

In theory, the C compiler can assign 14 variables to registers without spillage. In

Efficient Register Allocation

Fig: ATPCS argument passing.

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 12

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 13

Case2: using Structure

There are other ways of reducing function call overhead

For efficient use of calling a functions

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 15

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 16

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 17

Writing Assembly Code

#include <stdio.h> AREA |.text|, code, readonly

int square(int i); EXPORT square ; int square(int i)

int main(void) { square

int i; MUL r1, r0, r0 ; r1 = r0 * r0

for (i=0; i<10; i++) { MOV r0, r1 ; r0 = r1 MOV pc, lr ;

printf("Square of %d is %d\n", i, return r0

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 18

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 19

The accuracy of a pc-sampled profiler is limited, as it can produce meaningless results if it

Dept. of CSE (Data Science), SVIT , Asst. Prof. Anitha C S Page 20

You might also like