0% found this document useful (0 votes)
79 views17 pages

Coa Module 5

Uploaded by

rishyanthsm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views17 pages

Coa Module 5

Uploaded by

rishyanthsm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Canara Engineering College

Computer Organization &ARMMicrocontrollers


Computer Organization & ARM Microcontrollers
Course Code 21EC52 CIE Marks 50
Teaching Hours/Week (L: T: P: S) (3:0:2:0) SEE Marks 50
Total Hours of Pedagogy 40 hours Theory + 13 Lab slots Total Marks 100
Credits 04 Exam Hours 03
Module-5:
Introduction to the THUMB instruction set: Introduction, THUMB register usage, ARM – THUMB interworking, Other branch instructions,
Data processing instructions, Stack instructions, Software interrupt instructions.
Efficient C Programming: Overview of C Compilers and optimization, Basic C Data types, C looping structures.
Textbook : Chapters 4, 5
Chalk and Talk, PowerPoint Presentation
Teaching-Learning Process
RBT Level: L1, L2, L3
Suggested Learning Resources:
Textbooks
Andrew N Sloss, Dominic System and Chris Wright, “ARM System Developers Guide”, Elsevier, Morgan Kaufman publisher, 1st Edition, 2008.

- By. Dr. Ganesh V Bhat


Canara Engineering College

Thumb Instruction Set


Thumb encodes a subset of the 32-bit ARM instructions into a 16-bit instruction set space.
• Thumb has higher performance than ARM on a processor with a 16-bit data bus, but lower performance
than ARM on a 32-bit data bus.
• Use Thumb for memory-constrained systems
• Thumb has higher code density
• For memory-constrained embedded systems, for example, mobile phones and PDAs, code density is very
• important.
• A Thumb implementation of the same code takes up around 30% less memory than the equivalent ARM
• implementation.
Figure below shows the same divide code routine implemented in ARM and Thumb assembly code.

Arm thumb Mapping:

ARM-Thumb differences
• Unconditional Execution of instruction
• 2-address format for data processing
• Less regular instruction formats.
Thumb exception
• With exception processor is returned to ARM mode.
• While returning previous mode is restored as SPSR is transferred to CPSR
Thumb Branching
• Short conditional branches
• Medium range unconditional branches
• Long range Subroutine calls
• Branch to change to ARM Mode
Thumb-ARM Decompression
• Translation from 16-bit Thumb instruction to 32-bit ARM instruction
• Condition bits changed to „always‟
• Lookup to translate major and minoropcodes
• Zero extending 3-bit register specifiers to give 4-bit specifiers
• Zero extending immediate values
• Implicit „S‟(affecting condition codes) should be explicitly specified.
• Thumb 2-address format must be mapped to ARM 3- address format
Properties

- By. Dr. Ganesh V Bhat


Canara Engineering College

• Thumb code requires 70% of space of ARM code


• Thumb code uses 40% more instructions than the ARM code
• With 32-bit memory ARM code is 40% faster
• With 16-bit memory Thumb code is 45% faster than ARM code
• Thumb code uses 30% less external memory power than ARM code

Thumb Register Usage


In Thumb state, you do not have direct access to all registers. Only the low registers r0 to r7 are fully
accessible, as shown in below Table.
The higher registers r8 to r12 are only accessible with
MOV, ADD, or CMP instructions.
• CMP and all the data processing instructions that
operate on low registers update the condition flags in
the cpsr.
• There are no MSR- and MRS-equivalent Thumb
instructions.
• To alter the cpsr or spsr, you must switch into ARM
state to use MSR and MRS.

ARM-Thumb Interworking
Note: Switching between ARM and Thumb States of Execution Using BX Instruction
ARM-Thumb interworking is the name given to the method of linking ARM and Thumb code together for both
assembly and C/C++.
• It handles the transition between the two states.
• To call a Thumb routine from an ARM routine, the core has to change state.
• This state change is shown in the T bit of the cpsr.
• The BX and BLX branch instructions cause a switch between ARM and Thumb state while branching to a
routine.
Syntax:
BX Rm
BLX Rm | label

Unlike the ARM version, the Thumb BX


instruction cannot be conditionally
executed.
A branch exchange instruction can also be used as an absolute branch providing bit 0 isn’t used to force a
state change:

You can see that the least significant bit of register r0 is used to set the T
bit of the cpsr. The cpsr changes from IFt, prior to the execution of the
BX, to IFT, after execution. The pc is then set to point to the start address
of the Thumb routine.

This example shows a small code fragment that uses both the ARM and Thumb versions of the BX instruction.
You can see that the branch address into Thumb has the lowest bit set. This sets the T bit in the cpsr to
Thumb state.
The return address is not automatically preserved by the BX instruction. Rather the code sets the return
address explicitly using a MOV instruction prior to the branch:

- By. Dr. Ganesh V Bhat


Canara Engineering College

Replacing the BX instruction with BLX simplifies the calling of a Thumb routine since it sets the return
address in the link register lr:

Other Branch Instructions:


There are two variations of the standard branch instruction, or B.
• The first is similar to the ARM version and is conditionally executed; the branch range is limited to a signed
8-bit immediate, or −256 to +254 bytes.
• The second version removes the conditional part of the instruction and expands the effective branch range
to a signed 11-bit immediate, or −2048 to +2046 bytes (+/- 4MB).
• The conditional branch instruction is the only conditionally executed instruction in Thumb state.

Data Processing Instructions:


• The data processing instructions
manipulate data within registers.
• They include move instructions,
arithmetic instructions, shifts, logical
instructions, comparison instructions,
and multiply instructions.
• The Thumb data processing instructions
are a subset of the ARM data processing
instructions.

Single Register Load – Store Instructions:


The Thumb instruction set supports load and storing registers, or LDR and STR.
• These instructions use two preindexed addressing modes:
o Offset by register
o Offset by immediate

- By. Dr. Ganesh V Bhat


Canara Engineering College

Multiple-Register Load-Store Instructions:


The Thumb versions of the load-store multiple instructions are reduced forms of the ARM load-store multiple
instructions. They only support the increment after (IA) addressing mode.

Here N is the number of registers in the list of


registers. You can see that these instructions always
updates the base register Rn after execution. The base
register and list of registers are limited to the low
registers r0 to r7.
Example: STMIA r4!,{r1,r2,r3}

Stack Instructions:
The Thumb stack operations are different from the equivalent ARM instructions because they use the more
traditional POP and PUSH concepts.
The interesting point to note is that there is no stack
pointer in the instruction. This is because the stack
pointer is fixed as register r13 in Thumb operations
and sp is automatically updated. The list of registers is
limited to the low registers r0 to r7.
The PUSH register list also can include the link
register lr; similarly, the POP register list can include the pc. This provides support for subroutine entry and
exit. The link register lr is pushed onto the stack with register r1. Upon return, register r1 is popped off the
stack, as well as the return address being loaded into the pc. This returns from the subroutine.
Example :
BL ThumbRoutine
; continue
ThumbRoutine
PUSH {r1, lr} ; enter subroutine
MOV r0, #2
POP {r1, pc} ; return from subroutine
Software Interrupt Instruction:
Similar to the ARM equivalent, the Thumb software interrupt (SWI) instruction causes a software interrupt
exception. If any interrupt or exception flag is
raised in the Thumb state, the processor
automatically reverts back to the ARM state to
handle the exception.

The Thumb SWI instruction has the same effect


and nearly the same syntax as the ARM equivalent.
It differs in that the SWI number is limited to the
range 0 to 255 and it is not conditionally executed.

- By. Dr. Ganesh V Bhat


Canara Engineering College

Overview of C Compilers and Optimization:

Embedded C programming language is used to develop the applications that are nearest to the hardware i.e. the
applications that directly communicate to the hardware. Here are the major differences between C and
Embedded C:
C EMBEDDED C
C is a high-level programming language, which is used to
design any type of desktop-based application. It's also a Embedded C is an extension of C language. It's focused on
general-purpose programming language, focused on developing developing software for embedded systems.
software for general-purpose computers.
C language is a hardware-independent language. Embedded C is a fully hardware-dependent language.
Embedded C employs specific compilers that can generate
C language uses the standard compiler to compile and execute
particular hardware/microcontroller based output. It generates
the program and generates OS-dependent executable files.
hardware-dependent files.
Popular language compilers to execute an Embedded C language
Popular language compilers to execute C language: GCC (GNU
program are: Keil compiler BiPOM ELECTRONIC, Green Hill
compiler collection), Borland turbo C, and Intel C++
Software
C language has a free format of program coding. Formatting depends upon the type of microprocessor that is used.
It is specifically used for desktop applications. It is used for embedded systems.
GUI ; Embedded software; Operating Systems operation DVD/TV; Vehicle Tracking Systems; Digital camera
It has a normal level of optimization and supports various other
It displays a high level of optimization. Only the pre-defined input
programming languages during application. Input can be given
can be given to the running program.
to the programming whilst it is running.
Fixing bugs is very easy in C language. Fixing bugs is complicated.
The compilation is the process of converting the source code of the C language into machine code. As C is a mid-
level language, it needs a compiler to convert it into an executable code so that the program can be run on our
machine. The C program goes through the following phases during compilation:

What is Compiler Optimization :


Optimization is a series of actions taken by the compiler on your program's code
generation process to reduce a number of instructions (code space optimization),
memory access time (time-space optimization), and Power consumption. Compiler
optimizing process should meet the following objectives:

• The optimization must be correct, it must not, in any way, change


the meaning of the program.
• Optimization should increase the speed and performance of the
program.
• The compilation time must be kept reasonable.
• The optimization process should not delay the overall compiling
process.

Example : memclr: INPUT sentinel


void memclr(char *data, int N)
{
for(;N>0;N--) {
*data = 0;
data++;
}
}
No matter how advanced the Compiler, it does not know whether N can be 0 or not on input. Also, data array
pointer is four-byte aligned or not.
Basic Data types
Let’s start by looking at how ARM compiler (armcc) handles the basic C data type. We will see that some types
are more efficient for local variables than others. There are also differences between the addressing modes
available when loading and storing data of each type.

C data type
– char: unsigned byte (ARM C compiler)

- By. Dr. Ganesh V Bhat


Canara Engineering College

– short: signed 16-bit


– int: signed 32-bit
– long: signed 32-bit
– long long: signed 64-bit
ARM’s load/store support
– Pre-ARMv4: LDRB/STRB, LDR/STR
– ARMv4: LDRSB/LDRH/LDRSH,STRH (H:half)
– ARMv5: LDRD/STRD (D:double)

ARM processors have 32 bit registers and 32 bit data processing operations, It has Load /store architecture. (No
arithmetic or logical operations possible in memory directly.) Previous versions of ARM (ARMv4 and its lower)
were not good in handling signed 8 /16 bit values. So, the ARM C compilers define char to be an unsigned 8-bit
value rather than a signed 8-bit. (Inside the memory whether it is character or number, all are stored as
numbers only. So how the compilers treat the number which is defined as char is a matter.) ARM v4 and its
lower define char to be an unsigned 8 bit value.

char: unsigned vs. signed ?


In ARM C compiler, “char” is unsigned.
– Because, prior to ARMv4, ARM processor were not good at handling signed 8-bit or 16-bit value.
Therefore ARM C compilers define char to be an unsigned 8-bit value.
– For example
char i; // i: unsigned
while (i>=0) … // i always >= 0, so never quit while loop
ARMCC will warning -- “unsigned comparison with 0”.
– Compiler also provide an override switch to make char signed.
GCC: “-fsigned-char”, ARMCC: “-zc”

Data type mappings used by armcc and gcc


char: unsigned 8-bit byte
short: signed 16-bit half word
int, long: signed 32-bit word
long long: signed 64-bit double word

To be “int” rather than “char”.


– For a local variable, i, make it a “int” rather than “char” (except you want wrap-around to occur, e.g.,
“255 + 1 = 0”)
– Does char take less register space or less space on the ARM stack ?
No! Because all ARM registers are 32-bit and all stack entries are at least 32-bit.

Local Variable Types Though ARMv4 is efficient in to loading and storing 8, 16 and 32 bit, ARMv7 and above
have their data processing operations as 32 only. So, it is advisable to use int or long data type for local
variables. Avoid using char and short, even when working with 8 or 16 bit value. Exception is when you
use modulo arithmetics that needs to give 255+1 =0 case. (Here one can use char) Reason to avoid char as local
variable Example Consider a function written to find checksum of a data-packet containing 64 words as below.
int checksum(int *data)
{
char i;
int sum = 0;
for(i=0;i<64;i++)
sum += data[i];
return sum;
}
Looking the variable ‘i’ as a char datatypes seems like efficient, since it occupies less space in register, as well as
in stack. However, this is not correct, bcaz, all the registers and stack entries are 32 bit only. Looking at i++, the

- By. Dr. Ganesh V Bhat


Canara Engineering College

compiler has to look on the implementation that accounts for the case of i=255. Once i= 255 and incrementing it
leads to 0. 255+1. The corresponding compiler output for this code is given below

checksum_v1_s
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v1_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
AND r1,r1,#0xff ; i = (char)r1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v1_loop ; if (i<64) loop
MOV pc,r14 ; return sum

Instead of declaring i as char, if we declare it as unsigned int, the AND instruction can be removed. The compiler
output for the program in which I is declared as int is
checksum_v2_s
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v2_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1++
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v2_loop ; if (i<64) goto loop
MOV pc,r14 ; return sum
Local Variable: as “short”
Suppose, the data packet contains 16 bit values, and we need a 16 bit checksum, in that case.
If the embedded c program is modified as the sum as int inside the function and converting final result to short
will be an optimized one as below.
short checksum_v3(short *data)
{
unsigned int i;
short sum=0;
for (i=0; i<64; i++) {
sum = (short)(sum + data[i]);
}
return sum;
}
The loop is now three instructions longer than the previous one. Reasons are The LDRH instruction does not
allow for a shifted address offset. So, address calculation is literally done in the ADD, and then the
corresponding data in that address is summed.
LDRH instruction does not have offset calculation. It loads only the address. The explicit typecasting requires
two MOV instructions. The compiler shifts left by 16 and then right by 16 to implement 16 bit sign extend.

checksum_v3_s
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v3_loop

- By. Dr. Ganesh V Bhat


Canara Engineering College

ADD r3,r2,r1,LSL #1 ; r3 = &data[i] // (1) Shifting


LDRH r3,[r3,#0] ; r3 = data[i] // LDRH
ADD r1,r1,#1 ; i++
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; r0 = sum + r3
MOV r0,r0,LSL #16 ;
MOV r0,r0,ASR #16 ; sum = (short)r0 // (2) Casting
BCC checksum_v3_loop ; if (i<64) goto loop
MOV pc,r14 ; return sum
Q1: LDRH does not allow for a shifted address offset as the LDR instruction did in checksum_v2.
– Ans: It’s a new issue. We can solve it by accessing the array by incrementing the pointer “data” rather than using an index
as in “data[i]” – All ARM load/store instruction have a postincrement addressing mode.
Q2: The cast reducing “sum+array[i]” to a short requires two MOV instructions.
– Ans: Using an “int” type variable to hold the partial sum. Reduce the sum to a “short” type at the function exit.
Q3: You may wonder why not “sum +=data[i]”
– Ans : “sum+data[i]” is an integer and so can only be assigned to a short using an (implicit or explicit) narrowing cast.
With armcc, this code will produce a warning if you enable “implicit narrowing cast warning” using compiler switch “-W+n”.

Post-Increment – *(p++) in C
The *(data++) operation translates to a single ARM instruction, that loads the data and increments the pointer.
The corresponding assembly code otput of the compiler goes as below
short checksum_v4(short *data)
{
unsigned int i;
int sum=0; // Solution 2

for (i=0; i<64; i++) {


sum += *(data++); // Solution 1: post-increment
}
return (short)sum; // Solution 2
}
The compiler is still performing one cast to a 16 bit range on the return variable outside the loop. If we make
the function to return int, then the 2 MOV instruction kept before return can be removed.
checksum_v4_s
MOV r2,#0 ; sum = 0
MOV r1,#0 ;i=0
checksum_v4_loop
LDRSH r3,[r0],#2 ; r3 = *(data++)
ADD r1,r1,#1 ; i++
CMP r1,#0x40 ; compare i, 64
ADD r2,r3,r2 ; sum += r3
BCC checksum_v4_loop ; if (sum<64) goto loop
MOV r0,r2,LSL #16
MOV r0,r0,ASR #16 ; r0 = (short)sum
MOV pc,r14 ; return r0
Note:
– LDRSH: Three instructions have been removed from loop.
– MOV-SHIFTs (casting): still here, but outside loop body

Function Argument Types


We saw in “Local Variable” that converting local variable from types “char” or “short” to type “int” increase
performance and reduce code size. The same holds for function arguments.
For example
short add_v1(short a, short b)
{

- By. Dr. Ganesh V Bhat


Canara Engineering College

return (short)(a + (b>>1));


}
 The input value ‘a’ and ‘b’ will be passed in 32-bit ARM registers.
- Should the compiler assume that these 32-bit values are in the range of a “short” type (that is, -32768 to
+32767)? Or should the compiler force values to be in this range by sign-extending.
- The compiler must make compatible decision for the function Caller and Callee.
either Caller or callee must perform casting to a short type.
 Calling convention: Narrow or Wide ?
- We say that function arguments are passed “wide” while they are not reduced to the range of the type and
“narrow” if they are. (For armcc, arguments are passed narrow and values returned narrow)

Whatever the merits of different narrow or wide calling protocols, you can see that char or short type
function arguments and return values introduce extra casts.
It’s more efficient to use the int type for function arguments and return value, even if you are only passing
an 8-bit value.

Signed versus Unsigned Types


• The previous sections demonstrate the advantage of using int rather than a char or short type for local
variable.
• This section compares the efficiencies of “signed int” and “unsigned int”.
– If your code use only ‘+’, ‘-’, ‘*’, there is no performance difference between signed and unsigned
operations.
– However, there is a difference when it comes to division (/).

Addition, subtraction and multiplication operation does not make any difference in performance whether it is
signed or unsigned one. However when it comes to division, it is different (32 bit int has a minimum value of -
2,147,483,648 and a maximum value of 2,147,483,647 (inclusive)
int average_v1(int a, int b) compiles to
{ average_v1_s
return (a+b)/2; ADD r0,r0,r1 ; r0 = a+b
} ADD r0,r0,r0,LSR #31 ; if (r0<0) r0++ (one more instruction)
MOV r0,r0,ASR #1 ; r0 = r0>>1
MOV pc,r14 ; return r0

The compiler adds one to the sum before shifting by right if the sum is negative. If the data type is unsigned int,
then no need to keep the second ADD instruction. Bcaz, a divide by 2 is not a right shift if the data is negative.
To understand the program, pls try this code in Micro Vision 4

Negative Division
• In C on an ARM target, a divide by two is not a right shift if x is negative.
– For example, -3>>1 = -2, but -3/2 = -1.
– Division rounds towards zero, but arithmetic right shift round towards -∞.
• It’s more efficient to use unsigned types for divisions.
– The compiler converts the unsigned power of two divisions directly to right shifts.

Efficient use of C type


For local variables which are held in registers, don’t use a char or short unless 8 bit or 16 bit modular
arithmetic is necessary. Use the signed or unsigned int types. Unsigned int are faster when you use division. For
array entries and global variables held in main memory, se the type with the smallest size possible to hold the
required data. This saves memory footprint.
Use explicit casts when reading array entries or global variables into local variables (ie passing arguments to
function) Use explicit casts when writing local variables out to array entries (i.e returning data) Avoid implicit or

- By. Dr. Ganesh V Bhat


Canara Engineering College

explicit narrowing casts in expressions, because they usually cost extra cycles. Avoid char and short types for
function arguments and return values.

C Looping Structures
Loops with a fixed number of iterations
Let’s see how the compiler treats a loop with incrementing count i++

int checksum_v5(int *data) compiles to


{ checksum_v5_s
unsigned int i; MOV r2,r0 ; r2 = data
int sum=0; MOV r0,#0 ; sum = 0
MOV r1,#0 ;i=0
for (i=0; i<64; i++) { checksum_v5_loop
sum += *(data++); LDR r3,[r2],#4 ; r3 = *(data++)
} ADD r1,r1,#1 ; i++
return sum; CMP r1,#0x40 ; compare i, 64
} ADD r0,r3,r0 ; sum += r3
BCC checksum_v5_loop ; if (i<64) goto loop
MOV pc,r14 ; return sum

It takes three instructions to implement the loop.


– Counter increment: An ADD to increment i
– Comparing: A compare to check if i is less than 64
– Branch: A conditional branch
This is NOT efficient. On ARM, a loop should only use two instructions.
– A subtract to decrement the loop counter
– A conditional branch instruction

The key point is that the loop counter should count down to zero rather than counting up to some arbitrary
limit.

int checksum_v6(int *data) { compiles to


unsigned int i; checksum_v6_s
int sum=0; MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
for (i=64; i!=0; i--) { MOV r1,#0x40 ; i = 64
sum += *(data++); checksum_v6_loop
} LDR r3,[r2],#4 ; r3 = *(data++)
return sum; SUBS r1,r1,#1 ; i-- and set flags
} ADD r0,r3,r0 ; sum += r3
BNE checksum_v6_loop
MOV pc,r14 ; return sum

Q: “i != 0 or i>0” when i is signed


- For an unsigned counter (i), we can use either of the loop continuation conditions “i!=0” or “i>0”.
(As i can’t be negative, they are the same condition). For signed counter (i), it’s tempting to use “i>0”.
You may expect the compiler to generate following two instruction to implement the loop
SUBS r1, r1, #1 ;compare i with 1, i=i-1
BGT loop ;if(i+1>1) goto loop
– In fact, it is
SUB r1, r1, #1 ;i--

- By. Dr. Ganesh V Bhat


Canara Engineering College

CMP r1, #0 ;compare i with 0,


BGT loop ;if(i>0) goto loop
The compiler is not being inefficient. It must be careful about the case when “i=-0x80000000”, because two
sections of code generate different answers in this case
– For SUBS, since “-0x80000000<1”, loop will terminate.
– For SUB, since Modulo arithmetic means that i now has the value +0x7fff,ffff (it is >0), so loop continues.
So, “i!=0” always win ! (it saves one instruction over the signed i’s “i>0”).

Loops Using a Variable Number of Iterations


Using the lessons from the last section, we count down until N=0 and don’t require an extra loop counter.
int checksum_v7(int *data, unsigned int N) compiles to
{
int sum=0; checksum_v7_s
MOV r2,#0 ; sum = 0

for ( ; N!=0; N--) { CMP r1,#0 ; compare N, 0

sum += *(data++); BEQ checksum_v7_end ; if (N==0) goto end

} checksum_v7_loop

return sum; LDR r3,[r0],#4 ; r3 = *(data++)

} SUBS r1,r1,#1 ; N-- and set flags


ADD r2,r3,r2 ; sum += r3
BNE checksum_v7_loop ; if (N!=0) goto loop
checksum_v7_end
MOV r0,r2 ; r0 = sum
MOV pc,r14 ; return r0

Checking N == 0, Why ?
Compiler checks that N is nonzero on entry to the function
– Often, “check N” is unnecessary, since you know that the array won’t be empty. In this case, a “do-while”
loop gives better performance and code density than a “for” loop.
int checksum_v8(int *data, unsigned int N) The compiler output is
{
int sum=0; checksum_v8_s
MOV r2,#0 ; sum = 0

do { checksum_v8_loop

sum += *(data++); LDR r3,[r0],#4 ; r3 = *(data++)

} while ( --N!=0); SUBS r1,r1,#1 ; N-- and set flags

return sum; ADD r2,r3,r2 ; sum += r3

} BNE checksum_v8_loop ; if (N!=0) goto loop


MOV r0,r2 ; r0 = sum
MOV pc,r14 ; return r0

- By. Dr. Ganesh V Bhat


Canara Engineering College

No need to check “N==0”, do-while: Saving Two Cycles

Loop Unrolling
Each loop iteration costs two instructions in addition to the body of the loop. This we call it as Loop overhead.
The subtract takes one cycle and branch takes three cycles, giving an overhead of 4 cycles per loop. We can save
some of these cycles by unrolling a loop. Repeating a loop body several times and reducing the number of loop
iterations can be done in some places. For example.
int checksum_v9(int *data, unsigned int N) compiles to
{
int sum=0; checksum_v9_s

do MOV r2,#0 ; sum = 0

{ checksum_v9_loop

sum += *(data++); LDR r3,[r0],#4 ; r3 = *(data++)

sum += *(data++); SUBS r1,r1,#4 ; N-=4 & set flags

sum += *(data++); ADD r2,r3,r2 ; sum += r3

sum += *(data++); LDR r3,[r0],#4 ; r3 = *(data++)

N-=4; ADD r2,r3,r2 ; sum += r3

} while ( N!=0); LDR r3,[r0],#4 ; r3 = *(data++)

return sum; ADD r2,r3,r2 ; sum += r3

} LDR r3,[r0],#4 ; r3 = *(data++)


ADD r2,r3,r2 ; sum += r3
BNE checksum_v9_loop ; if (N!=0) goto loop
MOV r0,r2 ; r0 = sum
MOV pc,r14 ; return r0

Speedup of Unrolling
Loop Overhead in Cycle
– SUB(1), BRANCH(3)
– LOAD(3), ADD(1)
Cycles per iteration
– Old = 3(load)+1(add)+1(sub)+3(branch) = 8
– New = [(3+1)*4+1+3]/4 = 20/4 = 5
Double Speedup
– Old/New = 8/5 = ~2
– * ARM9TDMI (faster LOAD) brings out more Speedup.
Q: How much times (K) should I unroll the loop ?
– Suppose the loop is very important, for example, 30% of the entire application.
– Suppose you unroll the loop until it is 0.5KB in code size (128 x instr.).
– Then, loop overhead is at most 4 cycles compared to a loop body of around 128 cycles.
– The loop overhead cost is 3/128, roughly 3% of loop, and 1% (3% x 30%) of overall application.
So, unrolling the code further gains little extra performance, but has a significant impact on the cache
contents.

Summary: Writing Loops Efficiently


• Use loops that count down to zero;
• Use unsigned loop counters, and
– i!=0 rather than i>0;
• Use do-while loop rather than for loop
– This saves the compiler checking to see if counter is 0;
• Unrolling important loops to reduce overhead
– But do not over-unroll (in which hurt the cache perf.).

Register Allocation
The compiler attempts to allocate a register to each local variable.

- By. Dr. Ganesh V Bhat


Canara Engineering College

– It tries to use the same register for different local variables if the use of the variables does not overlap.
– When number of local variables exceeds number of available registers then the excess variables
are stored on the stack. Spilling
– Such stacked variables are called spilled since they are written out to memory.
– Spilled variables are slow to access compared to variables allocated to registers.
– To implement a function efficiently, you need to:
o Minimise the number of spilled variables.
o Ensure that critical variables are stored in registers.

AAPCS (ARM Architecture Procedure Call Standard) Registers AAPCS is the ARM Architecture Procedure Calling
Standard. It is a convention which allows high level languages to interwork.

• Register/Alias/Usage
– r0~3 (a1~4): arguments and return
– r4~11 (v1~8): general variable register
r9 (v6, sb): static base
• The function must preserve the callee value of this register except when compiling for “read-
write position independence (RWPI)”
r10 (v7, sl): Stack-Limit
• The function must preserve the callee value of this register except when compiling with “stack
limit checking”.
r11 (fp) (Only old armcc use a frame pointer)
• The function must preserve the callee value of this register except when compiling using a
“frame pointer” (only old version of armcc use fp)
r12 (ip)
• A general scratch register that the function can corrupt. It is useful as a scratch register for
function veneers or other intra-procedure call requirement.
r13/sp, r14/lr, r15/pc
Available Registers
– R0..R12, R14 can all hold variables.
– Must save R4..R11, R14 on the stack if using these registers.
– Compiler can assign 14 variables to registers without spillage.
– But some compilers use a fixed register e.g. R12 as scratch and never keep values in it.
– Complex expressions need intermediate working registers.

Lessons in Register Allocation


– Don’t use more than 12 variables in one function.
– In a nested loop, the most inner is the hottest, so allocate register to variables residing here.
– Trust compiler, instead of using “register” by yourself.

Function Calls
There are overhead for function calls.

Four-register rule
– The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and r3.
– More than four Arguments beyond four are transferred via stack.
Two-word arguments
Such as “long long” or “double” are passed in a pair of consecutive argument registers and return in r0,
r1.
– In C++, first argument to an object method is the “this” pointer.
Use structure
– If more than four arguments, group them into structures.

- By. Dr. Ganesh V Bhat


Canara Engineering College

Insert N bytes (from array data into a queue). Copy “data” to “Queue”
char *queue_bytes_v1( queue_bytes_v1_s
char *Q_start, /* Queue buffer start address */ STR r14,[r13,#-4]! ; save lr on the stack
char *Q_end, /* Queue buffer end address */ LDR r12,[r13,#4] ; r12 = N
char *Q_ptr, /* Current queue pointer position */ queue_v1_loop
char *data, /* Data to insert into the queue */ LDRB r14,[r3],#1 ; r14 = *(data++)
unsigned int N) /* Number of bytes to insert */ STRB r14,[r2],#1 ; *(Q_ptr++) = r14
{ CMP r2,r1 ; if (Q_ptr == Q_end)
do MOVEQ r2,r0 ; { Q_ptr = Q_start; }
{ SUBS r12,r12,#1 ; --N and set flags
*(Q_ptr++) = *(data++); BNE queue_v1_loop ; if (N!=0) goto loop
if (Q_ptr == Q_end) { MOV r0,r2 ; r0 = Q_ptr
Q_ptr = Q_start; LDR pc,[r13],#4 ; return r0
}
} while (--N);
return Q_ptr;

Using structure

typedef struct { queue_bytes_v2_s


char *Q_start; STR r14,[r13,#-4]! ; save lr on the stack
char *Q_end; LDR r3,[r0,#8] ; r3 = queue->Q_ptr
char *Q_ptr; LDR r14,[r0,#4] ; r14 = queue->Q_end
} Queue queue_v2_loop
void queue_bytes_v2(Queue *queue, char *data, unsigned int N) LDRB r12,[r1],#1 ; r12 = *(data++)
{ STRB r12,[r3],#1 ; *(Q_ptr++) = r12
char *Q_ptr = queue->Q_ptr; CMP r3,r14 ; if (Q_ptr == Q_end)
char *Q_end = queue->Q_end; LDREQ r3,[r0,#0] ; Q_ptr = queue->Q_start
do { SUBS r2,r2,#1 ; --N and set flags
*(Q_ptr++) = *(data++); BNE queue_v2_loop ; if (N!=0) goto loop
if (Q_ptr == Q_end) { STR r3,[r0,#8] ; queue->Q_ptr = r3
Q_ptr = queue->Q_start; LDR pc,[r13],#4 ; return
}
} while (--N);
queue->Q_ptr = Q_ptr;
}
Longer and More efficient : The queue_bytes_v2 is one instruction longer than queue_bytes_v1, but it is in fact more efficient
overall.
Good or Bad
V2 is less efficient ?
In each call, one more LDR (r14=Queue->Q_end);
In each iteration, one more LDR (r3=Queue->Q_start);
Or, it’s more efficient ?
V1: five arguments
registers setting x 4 + (PUSH+POP) x 1
V2: three arguments
three registers setting x 3 (no PUSH/POP)
So, V2 is better in reducing loop overhead.

Others: Reduce Loop Overhead


Put the C function in the same C file as the callers.
If your function is very small and corrupts few registers (use few local variables),
The C compiler then knows the code generated for the callee, and can make optimizations in the caller:
The caller need not preserve registers that it can see the callee doesn’t corrupt (else, caller will save all the
ATPCS corruptible registers).

- By. Dr. Ganesh V Bhat


Canara Engineering College

inline
If the callee is very small, then the compiler can inline the code in the caller code (removes the function call
overhead completely).
Summary: Calling Function Efficiently
Try to restrict functions to four arguments
Define small functions in the same source file and before the caller.
Critical functions can be inlined using the __inline keyword.
Pointer Aliasing
Two pointers are said to alias when they point to the same address. If you write to one pointer, it will affect the
value you read from the other pointer. The compiler often doesn’t know which pointers alias.
The compiler must assume that any write through a pointer may affect the value read from any another pointer!
This can significantly reduce code efficiency.
The following function increments two timer values by a step amount.

void timers_v1(int *timer1, int *timer2, int *step) This compiles to


{ timers_v1
*timer1 += *step; LDR r3,[r0,#0] ; r3 = *timer1
*timer2 += *step; LDR r12,[r2,#0] ; r12 = *step
} ADD r3,r3,r12 ; r3 += r12
STR r3,[r0,#0] ; *timer1 = r3
LDR r0,[r1,#0] ; r0 = *timer2
LDR r2,[r2,#0] ; r2 = *step, AGAIN!
ADD r0,r0,r2 ; r0 += r2
STR r0,[r1,#0] ; *timer2 = t0
MOV pc,r14 ; return
You’d expect *step to be pulled from memory once and used twice. That does not happen. Usually a compiler optimization called
subexpression elimination would kick in so that *step was only evaluated once and is reused for the second occurrence.
However, the compiler can’t use this optimization here. The compiler cannot be sure that the write to timer1 does not affect the read
from step. This forces the compiler to insert an extra Load instruction. Avoiding pointer aliasing

Eliminating Common Subexpression with local variable


void timers_v3(State *state, Timers *timers)
{
int step = state->step;
timers->timer1 += step;
timers->timer2 += step;
}

Summary: Avoiding Pointer Aliasing


Don’t rely on the compiler to eliminate common sub-expressions involving memory accesses. Instead create new local variable to
hold the expression.
Avoid taking the address of local variables. The variable may be inefficient to access from then on.

- By. Dr. Ganesh V Bhat


Canara Engineering College

Question Bank:
1. What are the ARM and Thumb instruction sets, and how do they differ in terms of
instruction size and performance?
2. Explain the advantages of using Thumb instructions over ARM instructions in embedded
systems with limited memory.
3. How many general-purpose registers are available in the Thumb instruction set, and what
are their roles?
4. What is ARM-Thumb interworking, and why is it necessary in mixed-mode code execution
environments?
5. Discuss the mechanisms used for switching between ARM and Thumb modes during
interworking.
6. Apart from unconditional branches, what other types of branch instructions are available
in ARM and Thumb instruction sets?
7. Provide examples of conditional branch instructions and explain their usage.
8. Compare and contrast the syntax and functionality of data processing instructions in ARM
and Thumb modes.
9. How do stack instructions differ between ARM and Thumb modes, and what is their role in
managing the stack?
10. Explain the process of pushing and popping data onto and from the stack using stack
instructions.
11. What is a software interrupt, and how is it triggered in ARM and Thumb modes?
12. How are exceptions handled in ARM and Thumb modes, and what is the role of exception
vectors?
13. Describe the process of transitioning from normal execution to exception handling mode.
14. Explain the differences between load and store instructions in ARM and Thumb modes.
15. Discuss the addressing modes supported by load and store instructions and their impact
on memory access efficiency.

1. What is a compiler optimization, and why is it important in C programming?


2. What strategies can be employed to optimize code performance in ARM and Thumb
instruction sets?
3. How do basic C data types such as int, char, and double impact the generated assembly
code on ARM7 platforms?
4. What is the impact of C looping structures like for, while, and do-while loops on the
generated assembly code for ARM7 architectures?
5. Explain the concept of loop unrolling and how it affects the efficiency of code execution on
ARM7 platforms.
6. Discuss the advantages and disadvantages of loop unrolling in ARM7 embedded systems
programming.
7. How does register allocation impact the efficiency of assembly code generated from C
programs for ARM7 architectures?
8. Describe the process of register allocation performed by C compilers targeting ARM7
embedded systems.
9. What optimizations can be applied to function calls to improve code efficiency on ARM7
platforms?
10. Explain the concept of pointer aliasing and its implications for optimizing code in ARM7
embedded systems
11. How does pointer aliasing affect the efficiency of assembly code generated from C
programs?
12. Describe best practices for writing efficient C code for ARM7 embedded systems,
considering optimization techniques and architectural constraints.

- By. Dr. Ganesh V Bhat

You might also like