Module 2
Module 2
instructions
data types
processor registers
main memory hardware
input/output model
addressing modes
Programmers and system engineers rely on the ISA for guidance on how to program various
activities.
Instruction sets work with other important parts of a computer, such as compilers and interpreters.
Those components translate high-level programming code into machine code that the processor
can understand.
Think of the ISA as a programmer's gateway into the inner workings of a computer.
It provides five types of shifts and rotates which can be applied to Operand2. (These are not operations
themselves in ARM mode.)
Certain ARM instructions such as MUL, CLZ and QADD cannot use the barrel shifter.
Pre-processing or shift occurs within the cycle time of the instruction. This is useful for multiplying
or dividing a constant by a power of 2.
Instructions that uses Barrel shifter is illustrated with examples.
3.1.3 Arithmetic Instruction
Used to carry out addition and subtraction of 32 bit signed and unsigned values.
3.7 Programs
to find the sum of first 10 numbers
Find the factorial of a number
Result is stored in register R0.
Overview of C compilers and Optimization
5.1 Introduction
# C Embedded C
1 It is a structural and general purpose Embedded C is generally used to develop
programming language used by the microcontroller-based applications.
developers to build desktop-based
applications
2 C is a high-level programming language. Embedded C is just the extension variant of the
C language.
3 This programming language is hardware On the other hand, embedded C language is
independent. truly hardware dependent.
4 The compilers in C language are OS The compilers in embedded C are OS
dependent. independent.
5 Here, the traditional or standard compilers Here, we need a specific compiler that can help
are used to run the program . in generating micro-controller based
6 6Famous compilers used in C are Intel C++, Famous compilers used in embedded C are Keil
Borland turbo C, and more. compiler. BiPOM ELECTRONIC. Green Hill
Software.
What is compiler optimization in Embedded C?
Optimization is a series of actions taken by the compiler on your program's code generation
process to reduce a number of instructions (code space optimization), memory access time
(time-space optimization), and Power consumption.
Compiler optimizing process should meet the following objectives :
The optimization must be correct, it must not, in any way, change the meaning of the
program.
Optimization should increase the speed and performance of the program.
The compilation time must be kept reasonable.
The optimization process should not delay the overall compiling process.
Instead of declaring i as char, if we declare it as unsigned int, the AND instruction can be removed.
The compiler output for the program in which I is declared as int is
Suppose, the data packet contains 16 bit values, and we need a 16 bit checksum, in that case,
The expression sum+data[i] is an integer and an explicit typecasting short is carried out.
The corresponding assembly language compiler output is
The loop is now three instruction longer than the previous one.
Reasons are
The LDRH instruction does not allow for a shifted address offset. So, address calculation is
literally done in the ADD, and then the corresponding data in that address is summed.
LDRH instruction does not have offset calculation. It loads only the address.
The explicit typecasting requires two MOV instructions. The compiler shifts left by 16 and then
right by 16 to implement 16 bit sign extend.
If the embedded c program is modified as the sum as int inside the function and converting final
result to short will be an optimized one as below.
The *(data++) operation translates to a single ARM instruction, that loads the data and
increments the pointer.
The corresponding assembly code otput of the compiler goes as below
The compiler is still performing one cast to a 16 bit range on the return variable outside the loop.
If we make the function to return int, then the 2 MOV instruction kept before return can be
removed.
We know that, converting local variable from char or short to int will increase the performance
and reduces the code size. The same holds true for functions also.
Consider the function that adds two 16 bit values , halving the second and returns a 16 bit sum.
This program is actually a useful test case to illustrate the problem faced by the compiler.
The input values a, b and the return value will be passed in 32 bit registers.
Should the compiler assume that these 32 bit values are in the range of short type, (i.e -32768 to
+32767)?
Or the compiler force values to be in this range by sign- extending the lowest 16 bits to fill the 32
bit register?
So, the compiler must make compatible decisions b/w the function caller and callee on who to
perform the cast to short type.
If the compiler passes arguments wide, the callee must reduce the arguments to the correct range.
If, the compiler passes arguments narrow then the caller must perform the task of reducing the
arguments to the correct range.
In ARMcc, the function arguments are passed narrow (i.e caller casts the task of casting), and the
return values are narrow (i.e callee casts retrn value)
Following assembly code shows the narrow passing of arguments and return value.
One version of gcc compiler makes no assumptions about the range of argument value.
It reduces the input arguments to the range of a short in both the caller and callee. The compiler
output goes as below.
Addition, subtraction and multiplication operation does not make any difference in performance
whether it is signed or unsigned one.
However when it comes to division, it is different
(32 bit int has
a minimum value of -2,147,483,648 and a maximum value
of 2,147,483,647 (inclusive)
The compiler adds one to the sum before shifting by right if the sum is negative.
If the data type is unsigned int, then no need to keep the second ADD instruction.
Bcaz, a divide by 2 is not a right shift if the data is negative.
To understand the program, pls try this code in Micro Vision 4
Section 5.3
This compiles to
It takes three instructions to implement the loop.
This compiles to
Here the loop contains only 4 instructions. It is better than the 6 what we have in the
incrementing loop.
SUBS and BNE implements the loop.
This is nice, when the loop counter is positive. We can use both i!=0; or i>0.
However for a signed loop counter, with i>0, the compiler will generate the following
However, when i=-0x80000000, the two sections of the code generate different answers.
For the case-1, SUBS ins compares i with 1 and then decrements i. Since -0x80000000<1, the
loop terminates.
For the case -2, i is decremented first and then compared with 0. For this case i has the value
0x7fffffff, which is greater than 0. So, the loop continues for many iterations.
So, one must use i!=0 for signed or unsigned loop counters.
Suppose the packet size is unknown or arbitrary, we use a variable N which gives the number of
data in the packet. For variable number of iterations N,
This compiles to
Notice here, the compiler often checks that N is non zero at the entry to the function.
This can be avoided if we use do- while loop.
Example program with do-while
The compiler output is
Each loop iteration costs two instructions in addition to the body of the loop. This we call it as
Loop overhead.
The subtract takes one cycle and branch takes three cycles, giving an overhead of 4 cycles per
loop.
We can save some of these cycles by unrolling a loop. Repeating a loop body several times and
reducing the number of loop iterations can be done in some places. For example.
This compiles to
With this, we have reduced the loop overhead from 4N cycles to N cycles.
However, there are two questions to be answered
1. How many times one can unroll the loop?
2. What if the number of iteration is not a multiple of 4?
Only do unroll for the loops that are important for the overall performance of the application.
Otherwise unrolling will increase the code size and gives little performance. Sometimes, this
may even reduce the performance.
Suppose if a loop is 30% of the entire application, we can unroll the loop until it is 0.5KB in
code size. Then the loop overhead is almost 4 cycles compared to a loop body of 128 cycles.
It is usually not worth unrolling when the gain is less than 1%.
For the qn2, try to arrange so that the iterations are multiples of your unroll amount.
Otherwise, put extra codes for the leftover case. This will improve the performance considerably.
Example
Here the second loop is meant for cases to handle the leftovercases
Writing loops efficiently
Spilling
Such stacked variables are called spilled since they are written out to memory.
Spilled variables are slow to access compared to variables allocated to registers.
To implement a function efficiently, you need to:
o Minimise the number of spilled variables.
o Ensure that critical variables are stored in registers.
R0 Argument registers. These hold the first four function arguments on a function call and
To A1..4 the return value on a function return. A function may corrupt these registers and use
R3 them as general scratch registers within the function.
R4
General variable registers. The function must preserve the callee values of these
To V1..5
registers.
R8
General variable register. The function must preserve the callee value of this register
R9 V6 SB except when compiling for read-write position independence (RWPI). Then R9 holds
the static base address. This is the address of the read-write data.
General variable register. The function must preserve the callee value of this register
R10 V7 SL except when compiling with stack limit checking. Then R10 holds the stack limit
address.
General variable register. The function must preserve the callee value of this register
R11 V8 FP except when compiling using a frame pointer. Only old versions of armcc use a frame
pointer.
A general scratch register that the function can corrupt. It is useful as a scratch register
R12 IP
for function veneers or other intra-procedure call requirements.
R14 LR The link register. On a function call this holds the return address.
Available Registers
If the compiler does spill variables, it chooses which variables to spill based on
frequency of use.
A variable used inside a loop counts multiple times.
You can tell the compiler about important variables by using them within the innermost
loop.
APCS defines how to pass function arguments and return values in ARM registers
Therefore, functions taking four or fewer arguments avoid the stack, which allows for
greater efficiency.
Two word arguments such as long long or double are passed in a pair of consecutive argument
registers and returned in ro, r1.
For functions with more than four arguments, both the caller and callee must access the stack
for some arguments.
For C++,the first argument to an object method is the this pointer. This argument is implicit
and additional to the explicit arguments.
In general, if the number of arguments are greater than 4, it is efficient to use structures.
Example
This compiles to
This has only three function arguments Hence requires only three registers.
The callee function needs to assign a single register for the queue structure pointer.
The function call overhead can be further reduced by putting both the caller and callee function
in the same C file, then the compiler knows the code generated for the callee function and can
make optimization in the caller function:
This compiles to
You’d expect *step to be pulled from memory once and used twice. That does not happen.
Usually a compiler optimization called subexpression elimination would kick in so that *step was
only evaluated once and is reused for the second occurrence.
However, the compiler can’t use this optimization here. The compiler cannot be sure that the write
to timer1 does not affect the read from step. This forces the compiler to insert an extra Load
instruction.
Avoiding pointer aliasing