0% found this document useful (0 votes)
18 views

Module 2

The document discusses compiler optimization in embedded C. It explains that compiler optimization aims to reduce code size, memory access time, and power consumption while maintaining program correctness and reasonable compilation time. It also describes the basic data types used in ARM processors, noting that char is typically unsigned 8-bit while int and long are preferred over char and short for local variables. An example shows how declaring a loop counter as char adds unnecessary instructions compared to declaring it as int.

Uploaded by

venugopal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Module 2

The document discusses compiler optimization in embedded C. It explains that compiler optimization aims to reduce code size, memory access time, and power consumption while maintaining program correctness and reasonable compilation time. It also describes the basic data types used in ARM processors, noting that char is typically unsigned 8-bit while int and long are preferred over char and short for local variables. An example shows how declaring a loop counter as char adds unnecessary instructions compared to declaring it as int.

Uploaded by

venugopal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Chapter-3

INTRODUCTION TO THE ARM INSTRUCTION SET


The Instruction Set Architecture is fundamental for building fast, efficient computers that optimize
memory and processing resources. It specifies the following supported capabilities:

 instructions
 data types
 processor registers
 main memory hardware
 input/output model
 addressing modes

Programmers and system engineers rely on the ISA for guidance on how to program various
activities.

Instruction sets work with other important parts of a computer, such as compilers and interpreters.
Those components translate high-level programming code into machine code that the processor
can understand.

Think of the ISA as a programmer's gateway into the inner workings of a computer.

Opcode and Operand


Each assembly language statement is split into an opcode and an operand. The opcode is
the instruction that is executed by the CPU and the operand is the data or memory location used to
execute that instruction.
ARMv7 Instruction set Architecture.
 Different ARM architecture revisions support different instructions. However, new
revisions are backwardly compatible.
 The ARMv7 architecture is a 32-bit processor architecture.
 It is also a load/store architecture, meaning that data-processing instructions operate only
on values in general purpose registers.
 Only load and store instructions access memory.
 General purpose registers are also 32 bits.
 A word, we mean 32 bits. A double-word is therefore 64 bits and a half-word is 16 bits
wide.
Classes of Instructions
1. Data Processing Ins
2. Branch Ins
3. Load – Store Ins
4. Software Interrupt Ins
5. Program Status Register Ins

3.1 Data Processing Instructions


These are the fundamental arithmetic and logical operations of the processor and operate on values in
the general-purpose registers, or a register and an immediate value.
Multiply and divide instructions can be considered special cases of these instructions.
Data processing instructions mostly use one destination register and two source operands.
The general format can be considered to be the instruction, followed by the operands, as follows:

The data processing operations include:


 Move Instructions
 Arithmetic Instructions
 Logical Instructions
 Comparison Instructions
 Multiply Instructions

3.1.1 Move Instructions


It is the simplest ARM Instructions.
It copies N into a destination register Rd, where N is a register or immediate value.
Used to initialize a value or to transfer data between registers.
Example

3.1.2 Barrel Shifter


The barrel shifter is a functional unit which can be used in a number of different circumstances.
Barrel shifter along with ALU is shown here.

It provides five types of shifts and rotates which can be applied to Operand2. (These are not operations
themselves in ARM mode.)
Certain ARM instructions such as MUL, CLZ and QADD cannot use the barrel shifter.
Pre-processing or shift occurs within the cycle time of the instruction. This is useful for multiplying
or dividing a constant by a power of 2.
Instructions that uses Barrel shifter is illustrated with examples.
3.1.3 Arithmetic Instruction
Used to carry out addition and subtraction of 32 bit signed and unsigned values.

Simple subtract instruction

Reverse Subtract Instruction


3.1.4 Using the Barrel shifter with Arithmetic Instructions

3.1.5 Logical Instruction


Bitwise logical operations can be performed using this instructions
3.1.6 Comparison Instructions
Comparison instructions are used to compare or test a register. These instructions
compare and update the cpsr register. Hence, S suffix instructions are not there for these
instructions.
CMP is actually a subtract instruction but the results are simply discarded. However, the cpsr
flags are modified.
TST is a logical AND operation.
TEQ is logical Exclusive OR operation.
In all the cases, only the cpsr register is modified no other registers are changed. Results are
simply discarded.
3.1.7 Multiply Instructions
Multiply instructions multiply the contents of a pair of registers
The long multiply instruction results in 64 bit value. In such cases, RdLo holds the lower 32 bit
of the 64 bit result, and RdHi holds the higher 32 bit of the 64 bit result. So, we must specify two
registers as destination to hold the results.

3.2 Branch Instructions


Branch instruction changes the flow of execution or call a subroutine.
3.3 Load Store Instructions
These instructions are used to transfer data between memory and processor registers.
Three types are there.
 Single-register transfer,
 Multiple Register Transfer,
 Swap
3.3.1 Single-Register Transfer
These instructions are used to move a single data item in and out of a register.
With these, we can able to transfer signed or unsigned 32 bit / 16 bit data.
3.4 Software Interrupt Instructions (SWI)
It causes a software Interrupt exceptions. It provides a gateway to call the operating system
routines.
When the program executes the SWI ins, it sets the content of the program counter (pc) to an
offset value 0x8.
Also, it forces the proccessor mode to SVC.
Each SWI has anassociated SWI number.
3.5 Program Status Register Instructions
The ARM instruction set provides two instructions namely MRS and MSR to directly control the
psr.
3.5.1 Co-Processor Instructions
These instructions are used to extend the Instruction Set.
A co-processor can either provide additional computational capability or to control the
ma=emory management.
The Coprocessor instruction can be of
Data processing, register ttransfering, memory transfering. However,. These instructions are only
used by the cores with a coprocessor.

3.6 Loading Constants


Since ARM instructions are 32 bits in size, they obviously cannot specify a general 32 bit
constant.
So, to move a 32 bit constant, two pseudoinstructions are employed.

3.7 Programs
to find the sum of first 10 numbers
Find the factorial of a number
Result is stored in register R0.
Overview of C compilers and Optimization

5.1 Introduction

# C Embedded C
1 It is a structural and general purpose Embedded C is generally used to develop
programming language used by the microcontroller-based applications.
developers to build desktop-based
applications
2 C is a high-level programming language. Embedded C is just the extension variant of the
C language.
3 This programming language is hardware On the other hand, embedded C language is
independent. truly hardware dependent.
4 The compilers in C language are OS The compilers in embedded C are OS
dependent. independent.

5 Here, the traditional or standard compilers Here, we need a specific compiler that can help
are used to run the program . in generating micro-controller based
6 6Famous compilers used in C are Intel C++, Famous compilers used in embedded C are Keil
Borland turbo C, and more. compiler. BiPOM ELECTRONIC. Green Hill
Software.
What is compiler optimization in Embedded C?

Optimization is a series of actions taken by the compiler on your program's code generation
process to reduce a number of instructions (code space optimization), memory access time
(time-space optimization), and Power consumption.
Compiler optimizing process should meet the following objectives :
 The optimization must be correct, it must not, in any way, change the meaning of the
program.
 Optimization should increase the speed and performance of the program.
 The compilation time must be kept reasonable.
 The optimization process should not delay the overall compiling process.

5.2 Basic Data types


ARM processors have 32 bit registers and 32 bit data processing operations, It has Load /store
architecture.
(No arithmetic or logical operations possible in memory directly.)
Previous versions of ARM (ARMv4 and its lower) were not good in handling signed 8 /16 bit
values. So, the ARM C compilers define char to be an unsigned 8-bit value rather than a signed 8-
bit.
(Inside the memory whether it is character or number, all are stored as numbers only. So
how the compilers treat the number which is defined as char is a matter.)
ARM v4 and its lower define char to be an unsigned 8 bit value.

Data type mappings used by armcc and gcc

5.3 Local Variable Types


Though ARMv4 is efficient in to loading and storing 8, 16 and 32 bit, ARMv7 and above have
their data processing operations as 32 only. So, it is advisable to use int or long data type for
local variables. Avoid using char and short, even when working with 8 or 16 bit value.
Exception is when you use modulo arithmetics that needs to give 255+1 =0 case. (Here one
can use char)
Reason to avoid char as local variable
Example
Consider a function written to find checksum of a data-packet containing 64 words as below.
Looking the variable ‘i’ as a char datatypes seems like efficient, since it occupies less space in
register, as well as in stack. However, this is not correct, bcaz, all the registers and stack entries
are 32 bit only.
Looking at i++, the compiler has to look on the implementation that accounts for the case of
i=255. Once i= 255 and incrementing it leads to 0.
255+1.
The corresponding compiler output for this code is given below

Instead of declaring i as char, if we declare it as unsigned int, the AND instruction can be removed.
The compiler output for the program in which I is declared as int is
Suppose, the data packet contains 16 bit values, and we need a 16 bit checksum, in that case,

The expression sum+data[i] is an integer and an explicit typecasting short is carried out.
The corresponding assembly language compiler output is
The loop is now three instruction longer than the previous one.
Reasons are
The LDRH instruction does not allow for a shifted address offset. So, address calculation is
literally done in the ADD, and then the corresponding data in that address is summed.
LDRH instruction does not have offset calculation. It loads only the address.
The explicit typecasting requires two MOV instructions. The compiler shifts left by 16 and then
right by 16 to implement 16 bit sign extend.
If the embedded c program is modified as the sum as int inside the function and converting final
result to short will be an optimized one as below.
The *(data++) operation translates to a single ARM instruction, that loads the data and
increments the pointer.
The corresponding assembly code otput of the compiler goes as below

The compiler is still performing one cast to a 16 bit range on the return variable outside the loop.
If we make the function to return int, then the 2 MOV instruction kept before return can be
removed.

We know that, converting local variable from char or short to int will increase the performance
and reduces the code size. The same holds true for functions also.
Consider the function that adds two 16 bit values , halving the second and returns a 16 bit sum.

This program is actually a useful test case to illustrate the problem faced by the compiler.
The input values a, b and the return value will be passed in 32 bit registers.
Should the compiler assume that these 32 bit values are in the range of short type, (i.e -32768 to
+32767)?
Or the compiler force values to be in this range by sign- extending the lowest 16 bits to fill the 32
bit register?
So, the compiler must make compatible decisions b/w the function caller and callee on who to
perform the cast to short type.
If the compiler passes arguments wide, the callee must reduce the arguments to the correct range.
If, the compiler passes arguments narrow then the caller must perform the task of reducing the
arguments to the correct range.
In ARMcc, the function arguments are passed narrow (i.e caller casts the task of casting), and the
return values are narrow (i.e callee casts retrn value)
Following assembly code shows the narrow passing of arguments and return value.

One version of gcc compiler makes no assumptions about the range of argument value.
It reduces the input arguments to the range of a short in both the caller and callee. The compiler
output goes as below.

Addition, subtraction and multiplication operation does not make any difference in performance
whether it is signed or unsigned one.
However when it comes to division, it is different
(32 bit int has
a minimum value of -2,147,483,648 and a maximum value
of 2,147,483,647 (inclusive)

The compiler adds one to the sum before shifting by right if the sum is negative.
If the data type is unsigned int, then no need to keep the second ADD instruction.
Bcaz, a divide by 2 is not a right shift if the data is negative.
To understand the program, pls try this code in Micro Vision 4

Efficient use of C type


For local variables which are held in registers, don’t use a char or short unless 8 bit or 16 bit
modular arithmetic is necessary. Use the signed or unsigned int types. Unsigned int are faster
when you use division.
For array entries and global variables held in main memory, se the type with the smallest size
possible to hold the required data. This saves memory footprint.
Use explicit casts when reading array entries or global variables into local variables (ie passing
arguments to function)
Use explicit casts when writing local variables out to array entries (i.e returning data)
Avoid implicit or explicit narrowing casts in expressions, because they usually cost extra cycles.
Avoid char and short types for function arguments and return values.

Section 5.3

Loops with a fixed number of iterations


Let s see how the compiler treats a loop with incrementing count i++

This compiles to
It takes three instructions to implement the loop.

It is not efficient for ARM. It should use only two instructions.


 A subtract to decrement the loop variable. This also sets the condition flags on the result.
 Followed by a conditional branch instruction.
If we use a decrementing counter like this

This compiles to
Here the loop contains only 4 instructions. It is better than the 6 what we have in the
incrementing loop.
SUBS and BNE implements the loop.
This is nice, when the loop counter is positive. We can use both i!=0; or i>0.
However for a signed loop counter, with i>0, the compiler will generate the following

In fact the compiler will generate

However, when i=-0x80000000, the two sections of the code generate different answers.
For the case-1, SUBS ins compares i with 1 and then decrements i. Since -0x80000000<1, the
loop terminates.
For the case -2, i is decremented first and then compared with 0. For this case i has the value
0x7fffffff, which is greater than 0. So, the loop continues for many iterations.
So, one must use i!=0 for signed or unsigned loop counters.
Suppose the packet size is unknown or arbitrary, we use a variable N which gives the number of
data in the packet. For variable number of iterations N,

This compiles to

Notice here, the compiler often checks that N is non zero at the entry to the function.
This can be avoided if we use do- while loop.
Example program with do-while
The compiler output is

Each loop iteration costs two instructions in addition to the body of the loop. This we call it as
Loop overhead.
The subtract takes one cycle and branch takes three cycles, giving an overhead of 4 cycles per
loop.
We can save some of these cycles by unrolling a loop. Repeating a loop body several times and
reducing the number of loop iterations can be done in some places. For example.
This compiles to

With this, we have reduced the loop overhead from 4N cycles to N cycles.
However, there are two questions to be answered
1. How many times one can unroll the loop?
2. What if the number of iteration is not a multiple of 4?

Only do unroll for the loops that are important for the overall performance of the application.
Otherwise unrolling will increase the code size and gives little performance. Sometimes, this
may even reduce the performance.
Suppose if a loop is 30% of the entire application, we can unroll the loop until it is 0.5KB in
code size. Then the loop overhead is almost 4 cycles compared to a loop body of 128 cycles.
It is usually not worth unrolling when the gain is less than 1%.

For the qn2, try to arrange so that the iterations are multiples of your unroll amount.
Otherwise, put extra codes for the leftover case. This will improve the performance considerably.
Example

Here the second loop is meant for cases to handle the leftovercases
Writing loops efficiently

 The compiler attempts to allocate a register to each local variable.


 It tries to use the same register for different local variables if the use of the variables does
not overlap.
 When number of local variables exceeds number of available registers then the excess
variables are stored on the stack.

Spilling
 Such stacked variables are called spilled since they are written out to memory.
 Spilled variables are slow to access compared to variables allocated to registers.
 To implement a function efficiently, you need to:
o Minimise the number of spilled variables.
o Ensure that critical variables are stored in registers.

AAPCS (ARM Architecture Procedure Call Standard) Registers


AAPCS is the ARM Architecture Procedure Calling Standard. It is a convention which allows
high level languages to interwork.
Rn Name Usage under AAPCS

R0 Argument registers. These hold the first four function arguments on a function call and
To A1..4 the return value on a function return. A function may corrupt these registers and use
R3 them as general scratch registers within the function.

R4
General variable registers. The function must preserve the callee values of these
To V1..5
registers.
R8

General variable register. The function must preserve the callee value of this register
R9 V6 SB except when compiling for read-write position independence (RWPI). Then R9 holds
the static base address. This is the address of the read-write data.

General variable register. The function must preserve the callee value of this register
R10 V7 SL except when compiling with stack limit checking. Then R10 holds the stack limit
address.

General variable register. The function must preserve the callee value of this register
R11 V8 FP except when compiling using a frame pointer. Only old versions of armcc use a frame
pointer.

A general scratch register that the function can corrupt. It is useful as a scratch register
R12 IP
for function veneers or other intra-procedure call requirements.

R13 SP The stack pointer, pointing to the full descending stack.

R14 LR The link register. On a function call this holds the return address.

R15 PC The program counter.

Available Registers

 R0..R12, R14 can all hold variables.


 Must save R4..R11, R14 on the stack if using these registers.
 Compiler can assign 14 variables to registers without spillage.
 But some compilers use a fixed register e.g. R12 as scratch and never keep values in it.
 Complex expressions need intermediate working registers.

Try to limit the inner loop of routines to at most 12 local variables.

 If the compiler does spill variables, it chooses which variables to spill based on
frequency of use.
 A variable used inside a loop counts multiple times.
 You can tell the compiler about important variables by using them within the innermost
loop.

APCS defines how to pass function arguments and return values in ARM registers

Four register rule

First four integer arguments to a function are passed in R0-R3.

The remainder of the arguments are passed on the stack.

 Therefore, functions taking four or fewer arguments avoid the stack, which allows for
greater efficiency.

Two word arguments such as long long or double are passed in a pair of consecutive argument
registers and returned in ro, r1.
For functions with more than four arguments, both the caller and callee must access the stack
for some arguments.

For C++,the first argument to an object method is the this pointer. This argument is implicit
and additional to the explicit arguments.

In general, if the number of arguments are greater than 4, it is efficient to use structures.

Group related arguments into structures, and pass a structure pointer.

Example

This compiles to
This has only three function arguments Hence requires only three registers.

The callee function needs to assign a single register for the queue structure pointer.

The function call overhead can be further reduced by putting both the caller and callee function
in the same C file, then the compiler knows the code generated for the callee function and can
make optimization in the caller function:

Summary of Function calling


Two pointers are said to alias when they point to the same address.
If you write to one pointer, it will affect the value you read from the other pointer.
The compiler often doesn’t know which pointers alias.
The compiler must assume that any write through a pointer may affect the value read
from any another pointer!
This can significantly reduce code efficiency.

The following function increments two timer values by a step amount.

This compiles to

You’d expect *step to be pulled from memory once and used twice. That does not happen.

Usually a compiler optimization called subexpression elimination would kick in so that *step was
only evaluated once and is reused for the second occurrence.
However, the compiler can’t use this optimization here. The compiler cannot be sure that the write
to timer1 does not affect the read from step. This forces the compiler to insert an extra Load
instruction.
Avoiding pointer aliasing

You might also like