# **Thumb Instruction Programming**

Copyright (c) 2024 - 2014 Young W. Lim.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".

Please send corrections (or suggestions) to youngwlim@hotmail.com.

This document was produced by using LibreOffice.

ARM System-on-Chip Architecture, 2<sup>nd</sup> ed, Steve Furber

Introduction to ARM Cortex-M Microcontrollers – Embedded Systems, Jonathan W. Valvano

Digital Design and Computer Architecture, D. M. Harris and S. L. Harris

ARM assembler in Raspberry Pi Roger Ferrer Ibáñez

https://thinkingeek.com/arm-assembler-raspberry-pi/

## **Thumb Instruction Programming**

#### ARM vs. Thumb programmer's models

| R0        |  |  |  |  |  |
|-----------|--|--|--|--|--|
| R1        |  |  |  |  |  |
| R2        |  |  |  |  |  |
| R3        |  |  |  |  |  |
| R4        |  |  |  |  |  |
| R5        |  |  |  |  |  |
| R6        |  |  |  |  |  |
| R7        |  |  |  |  |  |
| R8        |  |  |  |  |  |
| R9        |  |  |  |  |  |
| R10       |  |  |  |  |  |
| R11       |  |  |  |  |  |
| R12       |  |  |  |  |  |
| R13 (SP)  |  |  |  |  |  |
| R14 (LR)  |  |  |  |  |  |
| R15 (PC)  |  |  |  |  |  |
|           |  |  |  |  |  |
| CPSR      |  |  |  |  |  |
| ARM state |  |  |  |  |  |

| R0          |  |  |  |  |
|-------------|--|--|--|--|
| R1          |  |  |  |  |
| R2          |  |  |  |  |
| R3          |  |  |  |  |
| R4          |  |  |  |  |
| R5          |  |  |  |  |
| R6          |  |  |  |  |
| R7          |  |  |  |  |
| R8          |  |  |  |  |
| R9          |  |  |  |  |
| R10         |  |  |  |  |
| R11         |  |  |  |  |
| R12         |  |  |  |  |
| SP          |  |  |  |  |
| LR          |  |  |  |  |
| PC          |  |  |  |  |
|             |  |  |  |  |
| CPSR        |  |  |  |  |
| Fhumb state |  |  |  |  |

#### **ARM state**

• 16 + 1 = 17 normal registers

#### Thumb state

• 11 + 1 = 12 normal registers

#### Thumb Instruction Programming

## ARM Register Sets (2-1)

- The biggest register <u>difference</u> involves the **SP** register.
  - the Thumb state unique stack mnemonics (PUSH, POP)
  - the ARM state.

no such stack mnemonics (PUSH, POP)

- PUSH, POP instructions <u>assume</u> the existence of a stack pointer (R13)
- PUSH, POP instructions translate into load and store instructions in the ARM state.

### ARM Register Sets (2-2)

- The CPSR register holds
  - processor mode bits (user or exception flag)
  - · interrupt mask bits
  - · condition codes and
  - Thumb status bit
- The Thumb status bit (T) <u>indicates</u> the processor's <u>current state</u>:
  - O for ARM state (default)
  - 1 for Thumb.
- Although other <u>bits</u> in the CPSR may be <u>modified</u> in software, it's <u>dangerous</u> to <u>write</u> to T directly;
  - the results of an improper state change are *unpredictable*.

N Negative flagZ Zero flagC Carry flagV Overflow flag

To <u>disable</u> Interrupt (**IRQ**), set **I** To <u>disable</u> Fast Interrupt (**FIQ**), set **F** 

USR User mode
FIQ Fast Interrupt mode
SVC Supervisor mode
ABT Abort mode
UND Undefined mode
SYS System mode



#### **Branch instructions**



BL and BLX copy the return address into LR (R14)



**BX** and **BLX** can change the processor state

https://developer.arm.com/documentation/dui0489/c/arm-and-thumb-instructions/branch-and-control-instructions/b--bl--bx--blx--and-bxj

## Branch instructions and operand types

| <ul> <li>B {cond} label</li> <li>B {cond} Rm</li> </ul>   | <ul> <li>BL {cond} label</li> <li>BL {cond} Rm</li> </ul>   |
|-----------------------------------------------------------|-------------------------------------------------------------|
| <ul> <li>BX-{cond} label</li> <li>BX {cond} Rm</li> </ul> | <ul> <li>BLX {cond} label</li> <li>BLX {cond} Rm</li> </ul> |

| <ul> <li>B {cond} label</li> </ul>  | <ul> <li>BL {cond} label</li> </ul> |
|-------------------------------------|-------------------------------------|
| <ul> <li>BX-{cond} label</li> </ul> | BLX {cond} label                    |

| B {cond} Rm  | BL (cond) Rm  |
|--------------|---------------|
| BX {cond} Rm | BLX {cond} Rm |

Branch Branch with Link Brand and eXchange Brand with Link and eXchange

#### • B {cond} label

- BL {cond} label
- BLX {cond} label
- BX {cond} Rm
- BLX {cond} Rm

## **B** and **BL** instructions (1)

- B {cond} label
- B {cond} Rm
- BL {cond} label
- BL {cond} Rm
- cond is an optional condition code
- label is a program-relative expression
- The **B** instruction
  - causes a <u>branch</u> to label.
- The **BL** instruction
  - copies the <u>address</u> of the next instruction into r14 (lr, the link register)
  - causes a <u>branch</u> to label.

Branch Branch with Link Brand and eXchange Brand with Link and eXchange

## **B** and **BL** instructions (2)

- machine-level B and BL instructions have a range of ±32Mb from the address of the current instruction.
  - However, you can use these instructions even if label is <u>out of range</u>.
  - Often you do <u>not know</u> where label is placed by the linker.
  - When necessary, the ARM linker
     <u>adds veneer code</u> to allow <u>longer branches</u>



## **B** and **BL** instructions (3)

- The ARM BL instruction has a 24-bit immediate for encoding the branch offer
- this would give you a range of 2<sup>24</sup> bytes, or +/-8MB (given that the immediate allows forwards or backwards).
- all ARM instructions are 4 bytes long, and must be size aligned.
- <u>no need</u> to consider the *two* least significant bits of the address
- taking our branch range from +/-8MB to +/-32MB.



| 3T | 30 | 29  | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | Τ0 | 15 | 14   | 13     | ΤZ  | ΤT   | 10 | 9 | 8                      | 1  | 6    | 5                     | 4  | 3 | 2 | T              | 0   |      |  |
|----|----|-----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|------|--------|-----|------|----|---|------------------------|----|------|-----------------------|----|---|---|----------------|-----|------|--|
|    | СС | ond |    | 1  | 0  | 1  | L  |    |    |    |    |    |    |    |    |    |      |        | Off | set  |    |   |                        |    |      |                       |    |   |   |                |     | (11) |  |
|    |    |     |    |    | E  | 3  | 0  |    |    |    |    |    |    |    |    | 2  | 24-k | oit ir | nme | edia | te |   | <b>2</b> <sup>24</sup> | By | te = | <b>2</b> <sup>4</sup> | MB |   | + | ·/- 8<br>·/- 3 | MB  |      |  |
|    |    |     |    |    | E  | 3L | 1  |    |    |    |    |    |    |    |    |    |      |        |     |      |    |   |                        |    |      |                       |    | - | + | /- 3           | 2 M | В    |  |

https://community.arm.com/support-forums/f/architectures-and-processors-forum/3061/range-of-bl-instruction-in-arm-state

| <b>Thumb Instruction</b> |
|--------------------------|
| Programming              |

## BX and BLX instructions (1)

- BX {cond} label
- BX {cond} Rm
- BLX {cond} label
- BLX {cond} Rm
- cond is an optional condition code
- label is a program-relative expression
- Rm is a register containing an address to branch to
- The **BX** instruction
  - causes a branch to the address contained in Rm
  - changes the instruction set, if required:
- The **BLX** instruction
  - copies the <u>address</u> of the next instruction into r14 (lr, the link register)
  - causes a <u>branch</u> to label.
  - can <u>change</u> the instruction set

Branch Branch with Link Brand and eXchange Brand with Link and eXchange

## **BX** and **BLX** instructions (2)

| <ul> <li>B {cond} label</li> <li>B {cond} Rm</li> </ul>   | <ul> <li>BL {cond} label</li> <li>BL {cond} Rm</li> </ul>   |       | Branch<br>Branch with Link<br>Brand and eXchange                                   |
|-----------------------------------------------------------|-------------------------------------------------------------|-------|------------------------------------------------------------------------------------|
| <ul> <li>BX {cond} label</li> <li>BX {cond} Rm</li> </ul> | <ul> <li>BLX {cond} label</li> <li>BLX {cond} Rm</li> </ul> |       | Brand with Link and eXchange                                                       |
|                                                           |                                                             |       |                                                                                    |
|                                                           |                                                             | ARM s | the state.<br><u>changes</u> the state.<br>tate → Thumb state<br>state → ARM state |
|                                                           |                                                             |       |                                                                                    |
| Both ARM state<br>and Thumb state provide                 |                                                             |       | n<br>= <b>0</b> → to ARM state<br>= <b>1</b> → to Thumb state                      |

B, BL, BX, BLX

## B, BL, BX, and BLX instructions



Thumb state  $\rightarrow$  ARM state

## Branch instructions – changing the state



https://developer.arm.com/documentation/dui0489/c/arm-and-thumb-instructions/branch-and-control-instructions/b--bl--bx--blx--and-bxj

### BLX in ARM Architecture v5

In ARM Architecture v5 both ARM and Thumb state provide a **BLX** instruction that will call a subroutine <u>addressed by a register</u> and correctly sets the return address to the sequentially <u>next value</u> of the program counter.

/IHI0042E\_aapcs.pdf

## Switching the state (1) **BX** or **BLX**

- There are several ways to <u>enter</u> or <u>leave</u> the Thumb state properly.
- The usual method is via the Branch and Exchange (BX) instruction.
- also Branch, Link, and Exchange (BLX) if you're using an ARM with version 5 architecture.
- During the branch, the CPU examines the least significant bit (<u>lsb</u>) of the <u>destination address</u> to determine the <u>new state</u>.



## Switching the state (2) Exception Handler

- When an **exception** occurs, the processor automatically begins executing in ARM state at the address of the exception vector.
- So another way to <u>change state</u> is to place your 32-bit code in an <u>exception handler</u>.
- If the CPU is running in Thumb state when that exception occurs, you can count on it being in ARM state within the handler.
- If desired, you can have the exception handler put the CPU into Thumb state via a <u>branch</u>.

## Switching the state (3) T bit in the SPSR

The final way to change the state is via a **return** from **exception**.

- When returning from the processor's exception mode, the saved value of T in the SPSR register is used to restore the state.
- This T bit can be used, for example, by an <u>operating system</u> to <u>manually restart</u> a task in the <u>Thumb state</u> – if that's how it was running previously.

## Entering and leaving the Thumb state (1)

- several ways to <u>enter</u> or <u>leave</u> the <u>Thumb state</u> properly.
- the usual method is via the **BX** (Branch and EXchange) instruction.
- also **BLX** (**B**ranch, **L**ink, and **EX**change) with version 5 architecture.
- during the <u>branch</u>, the CPU examines the <u>lsb</u> of the <u>destination</u> address in a register operand to <u>determine</u> the <u>new state</u>.

| • | BX {cond} Rm  |
|---|---------------|
| • | BLX {cond} Rm |
|   |               |

with Rm Rm[0] =  $\mathbf{0} \rightarrow$  to ARM state Rm[0] =  $\mathbf{1} \rightarrow$  to Thumb state

https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/5655/question-about-a-code-snippet-on-arm-thumb-state-change

## Branch and Exchange (1)

- the Branch and Exchange (BX) instruction.
- also Branch, Link, and Exchange (BLX) if you're using an ARM with version 5 architecture.
- During the branch, the CPU examines the least significant bit (<u>lsb</u>) of the <u>destination address</u> to determine the <u>new state</u>.

| <ul> <li>B {cond} label</li> <li>B {cond} Rm</li> </ul> | <ul> <li>BL {cond} label</li> <li>BL {cond} Rm</li> </ul>   |   |
|---------------------------------------------------------|-------------------------------------------------------------|---|
| BX {cond} label     BX {cond} Rm                        | <ul> <li>BLX {cond} label</li> <li>BLX {cond} Rm</li> </ul> | + |

with label 💳

with Rm 🚝

always changes the state.

ARM state  $\rightarrow$  Thumb state Thumb state  $\rightarrow$  ARM state

 $Rm[0] = \mathbf{0} \rightarrow to ARM state$  $Rm[0] = \mathbf{1} \rightarrow to Thumb state$ 





## Branch and Exchange (2)

- Since all ARM instructions will align themselves on either a 32- or 16-bit boundary, the lsb of the address is not used in the branch directly.
- if the lsb is 1 when branching <u>from ARM state</u>, the processor <u>switches to Thumb state</u> before it begins executing from the new address;
- if the lsb is 0 when branching <u>from Thumb state</u>, the processor switches back <u>to ARM state</u> it goes.

BX Rm ←
BLX Rm ←
; destination address in the regsiter Rm
If Rm[0] is 0, to ARM state.
If Rm[0] is 1, to Thumb state.

BLX lable ←

; destination address is the PC-relative lable expression always change: (ARM → Thumb, Thumb → ARM)

B {cond} label
 BL {cond} label
 B {cond} Rm
 BX {cond} label
 BX {cond} Rm
 BX {cond} Rm
 BLX {cond} Rm
 HELX {COND} R

with label  $\leftarrow$ always changes the state. ARM state  $\rightarrow$  Thumb state Thumb state  $\rightarrow$  ARM state

with Rm  $\leftarrow$ Rm[0] =  $\mathbf{0} \rightarrow$  to ARM state Rm[0] =  $\mathbf{1} \rightarrow$  to Thumb state

## Entering and leaving the Thumb state (2)

- all ARM instructions will align themselves on either a 32- or 16-bit boundary  $\rightarrow$
- the lsb of the destination address is <u>not used</u> in the branch directly.
- if the lsb is 1 when branching from ARM state, the processor switches to Thumb state <u>before</u> it begins executing from the new address;
- if the lsb is 0 when branching from Thumb state, back to ARM state it goes.

|                                                                                         |     | voru aligni | iieiit |    |  |  |  |  |
|-----------------------------------------------------------------------------------------|-----|-------------|--------|----|--|--|--|--|
|                                                                                         | +3  | +2          | +1     | +0 |  |  |  |  |
|                                                                                         | +7  | +6          | +5     | +4 |  |  |  |  |
|                                                                                         | +11 | +10         | +9     | +8 |  |  |  |  |
| word addresses                                                                          |     |             |        |    |  |  |  |  |
|                                                                                         |     |             |        |    |  |  |  |  |
| addresses of least significant byte <u>not used</u><br>of a 4-byte word (little endian) |     |             |        |    |  |  |  |  |

32-hit word alignment

#### 16-bit halfword alignment





https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/5655/question-about-a-code-snippet-on-arm-thumb-state-change

### 32-bit / 16-bit alignment

Since all ARM instructions have either a 32- or 16-bit alignment

the LSB of the address is <u>not used</u> in the branch directly.

32-bit (4 bytes) word - the least significant 2 bits of the target address are not used 16-bit (2 bytes) word - the least significate 1 bit of the target address is not used

can use the least significant bit is used to change the state (ARM  $\leftrightarrow$  Thumb)



https://www.cs.princeton.edu/courses/archive/fall13/cos375/ARMthumb.pdf

#### 16-bit halfword alignment

|     | +1 | +0 |
|-----|----|----|
| +2  | +3 | +2 |
| +2( | +5 | +4 |



## PC (Program Count) R15 Register

The Program Counter (or PC) is a <u>register</u> inside the microprocessor that stores the memory <u>address</u> of the <u>next instruction</u> to be executed.

In ARM processors, the Program Counter is a 32-bit register which is also known as R15.

The processor first <u>fetches</u> the <u>instruction</u> from the <u>address</u> stored in the <u>PC</u>.

The fetched instruction is then <u>decoded</u> so that it can be interpreted by the microprocessor.

Once decoded, the instruction can then be <u>executed</u> and the PC <u>incremented</u> so that it contains the <u>address</u> of the <u>next instruction</u>.

#### the fetch-decode-execute cycle.

decode

fetch

execute

## PC (Program Count) R15 Register

memory addresses are given in bytes (byte addresses)

memory is usually <u>accessed</u> by a <u>word</u> and <u>aligned</u> on word boundaries. (word addresses) for a high performance

but also can be accessed by a <u>byte</u> or a <u>halfword</u> with a performance loss

in ARM processors,

- all <u>ARM</u> instructions take up <u>one word</u> (<u>4 bytes</u>).
- all <u>Thumb</u> instructions take up <u>one halfword</u> (2 bytes).

incrementing the PC in the ARM state

PC + 4

incrementing the PC in the Thumb state

PC + 2



## PC (Program Counter) R15 Register



## PC (Program Counter) R15 Register

**3 stage pipeline execution** 



## Register relative and PC relative expressions (1)

#### armasm supports

PC-relative and register-relative expressions.

a register-relative expression evaluates to a named register combined with a numeric expression.

- a PC-relative expression as a label or the PC, optionally combined with a numeric expression.
  - 1. using label
  - 2. using PC
  - 3. [PC, #number] for some instructions

https://developer.arm.com/documentation/dui0801/b/Cacdbfji

## Register relative and PC relative expressions (2)

If you specify a label, the assembler calculates

the offset from the PC value of the current instruction to the address of the label.

the assembler encodes the offset in the instruction.

If the offset is too large, the assembler produces an error.

The offset is either <u>added</u> to or <u>subtracted</u> from the PC value to form the required address.

ARM recommends you write

PC-relative expressions using labels

rather than PC because the value of PC depends on the instruction set.

https://developer.arm.com/documentation/dui0801/b/Cacdbfji

In A32 code, PC + 8 the value of the PC is the address of the current instruction plus 8 bytes.

In **T32** code: PC + 4

For **B**, **BL**, **CBNZ**, and **CBZ** instructions, the value of the PC is the address of the current instruction plus 4 bytes.

For <u>all other</u> instructions that use labels, the value of the PC is the address of the current instruction plus 4 bytes, with bit[1] of the result cleared to 0 to make it word-aligned.

In A64 code, PC the value of the PC is the address of the current instruction.

https://developer.arm.com/documentation/dui0801/b/Cacdbfji







addresses of least significant byte of a 2-byte halfword (little endian)

<u>not used</u>

word aligned with bit[1] = 0



In A32 code, PC + 8 the value of the PC is the address of the current instruction plus 8 bytes.

In **T32** code: PC + 4

For **B**, **BL**, **CBNZ**, and **CBZ** instructions, the value of the PC is the address of the current instruction plus 4 bytes.

For <u>all other</u> instructions that use labels, the value of the PC is the address of the current instruction plus 4 bytes, with bit[1] of the result cleared to 0 to make it word-aligned.

In A64 code, PC the value of the PC is the address of the current instruction.

for B, BL, CBNZ, CBZ

PC can point any halfword

| * |
|---|
| * |
| * |
| * |

for all other instructions

PC can point only a word

word aligned with bit[1] = 0

| * |
|---|
|   |
| * |
|   |

https://developer.arm.com/documentation/dui0801/b/Cacdbfji

LDR r4,=data+4\*n ; n is an assembly-time variable ; code MOV pc,lr data DCD value\_0 ; n-1 DCD directives DCD value\_n ; data+4\*n points here ; more DCD directives

https://developer.arm.com/documentation/dui0801/b/Cacdbfji

```
int f,g,y;//global variables
int sum(int a, int b){
    return (a+b);
}
int main(void){
    f = 2;
    g = 3;
    y = sum(f, g);
    return y;
}
```

```
00008390 <sum>:
int sum(int a, int b) {
return (a + b);
}
    8390: e0800001 add r0, r0, r1
    8394: e12fff1e bx lr
    00008398 <main>:
int f, g, y; // global variables
int sum(int a, int b);
int main(void) {
    8398: e92d4008 push {r3, lr}
f = 2;
    839c: e3a00002 mov r0, #2
    83a0: e59f301c ldr r3, [pc, #28] ; 83c4 <main+0x2c>
    83a4: e5830000 str r0, [r3]
g = 3;
    83a8: e3a01003 mov r1, #3
    83ac: e59f3014 ldr r3, [pc, #20] ; 83c8 <main+0x30>
    83b0: e5831000 str r1, [r3]
y = sum(f,q);
    83b4: ebfffff5 bl 8390 <sum>
    83b8: e59f300c ldr r3, [pc, #12] ; 83cc <main+0x34>
    83bc: e5830000 str r0, [r3]
return y;
83c0: e8bd8008 pop {r3, pc}
83c4: 00010570 .word 0x00010570
83c8: 00010574 .word 0x00010574
83cc: 00010578 .word 0x00010578
```

https://stackoverflow.com/questions/24091566/why-does-the-arm-pc-register-point-to-the-instruction-after-the-next-one-to-be-e

see the above LDR's PC value--here is used to load variable f,g,y's address to r3.

83a0: e59f301c ldr r3, [pc, #28];83c4 main+0x2c PC=0x83c4-28=0x83a8-0x1C = 0x83a8

PC's value is just the current executing instruction's next's next instruction. as ARM uses 32bits instruction, but it's using byte address, so + 8 means 8bytes, two instructions' length.

so attached ARM archi's 5 stage pipe line fetch, decode, execute, memory, writeback

ARM's 5 stage pipeline

the PC register is added by 4 each clock, so when instruction bubbled to execute--the current instruction, PC register's already 2 clock passed!

now it's + 8. that actually means: PC points the "fetch" instruction, current instruction means "execute" instruction, so PC means the next next to be executed.

https://stackoverflow.com/questions/24091566/why-does-the-arm-pc-register-point-to-the-instruction-after-the-next-one-to-be-e

# Subroutine call (1) BL (Branch and link) operation

Both the ARM and Thumb instruction sets contain a primitive subroutine call instruction, **BL** target, which performs a branch-with-link operation.

> LR ← the return address the <u>next value</u> of the PC

**PC** ← the destination address target

LR[0] ← 1 if BL target was <u>executed</u> from Thumb state LR[0] ← 0 if BL target was <u>executed</u> from ARM state

The result is to transfer control to the destination address, passing the return address in LR as an <u>additional parameter</u> to the called subroutine

Control is returned to the instruction following the **BL** when the return address is loaded back into the PC

/IHI0042E\_aapcs.pdf



## Subroutine call (2) BL vs. BX







| BX target | <b>BX</b> has<br>no <u>label</u> operand                                      |
|-----------|-------------------------------------------------------------------------------|
| BX R4     | a programmer must<br><u>explicitly</u> set the return<br>address in <b>LR</b> |

# Subroutine call (3) BX (Branch and eXchange) operation



to call a subroutine addressed by **R4** with control returning to the following instruction,



/IHI0042E\_aapcs.pdf

LR[31:1] ← the return address

 $LR[0] \leftarrow 0$  return to ARM codes  $LR[0] \leftarrow 1$  return to Thumb codes

### Subroutine call (4) ARM vs. Thumb state





/IHI0042E\_aapcs.pdf

Thumb Instruction Programming

# Subroutine call (5) the lsb of a destination address



Thumb Instruction Programming

ARM is unusual among the processors by having the program counter available as a "general purpose" register.

Most other processors have the program counter hidden, and its value will only be disclosed as the return address when calling a function.

If you want to modify it, a jumping instruction is used.

For example, on the **x86**, the program counter is called the instruction pointer, and is stored in **eip**, which is <u>not</u> an accessible register.

After a function call, **eip** is <u>pushed</u> onto the stack, at which point it could be examined.

Return is done through the **ret** instruction which <u>pops</u> the return address off the stack, and jumps there.

Another example: on the **MIPS**, the program counter is stored into **register 31** after executing a **JALR** instruction, which is used for function calling.

The value in there can be examined, and a return is a register jump JR to that register.

ARM's unusual design allows many, many ways of <u>returning</u> from functions.

But first, we must understand how function calls work on the ARM.

On ARM, the program counter is register 15, or **r15**, also called **pc**.

The instruction to call a function is **bl** (for immediate offsets, a label operand) or **blx** (for addresses in registers, a register operand).

These instructions stores the return address in **r14**, called the link register, or **Ir**.

To <u>return</u>, we must put this value back into **pc**.

When writing <u>non-leaf</u> functions, i.e. functions that calls other functions, the value of **Ir** must be <u>preserved</u>, since <u>calling</u> another function will overwrite it.

The most common way is to store it on the stack.

On the ARM, push and pop instructions

use **push** and **pop** to <u>preserve</u> the registers we modify.

For example, if we want to <u>preserve</u> **r3**, **r4**, and **Ir**, we can write push {**r3**, **r4**, **Ir**}.

A normal function will look like:

push {r3, r4, lr} ; save registers.

; function body.

**pop** {**r3**, **r4**, **pc**} ; restore registers and return.

PUSH stores registers on the stack, with the lowest numbered register using the lowest memory address and the highest numbered register using the highest memory address.

POP loads registers from the stack, with the lowest numbered register using the lowest memory address and the highest numbered register using the highest memory address.

...the registers in the {} can be specified in any order, but the order in which they appear on the stack is fixed...

https://stackoverflow.com/questions/63304428/ordering-of-registers-in-push-and-pop-brackets

So according to the above explanations, the ordering of registers in one PUSH bracket doesn't matter.

I.e. PUSH {R0,R1,R2}, PUSH {R2,R1,R0}, and PUSH {R1,R2,R0} all would result in the some ordering in the stack

because "...the lowest/highest numbered register (R0/R2) uses the lowest/highest (stack) memory address...".

Does that mean if a single PUSH instruction has multiple registers in the bracket, the assembler automatically sorts the pushing actions out in the object code, where PUSH R2 goes first into the stack to take the highest address, followed by PUSH R1 and ended with PUSH R0 taking the lowest address?

So if I want to guarantee R2 get pushed last and popped first in a LIFO stack (i.e. SP pointing R2 or for R2 to take the lowest stack address), I cannot do so in one PUSH bracket statement but only separately with PUSH R0; PUSH R1; PUSH R2?

If you look at assembled hex/binary, you'll find that push with same registers https: but different order encode to the same instruction.

That will be related to instruction encoding, because it's pretty much a bitmask of registers

### Thump menaction Programming

46

-pop-brackets

.thumb

push {r0,r1,r2}
push {r2,r1,r0}
push {r0}
push {r1}
push {r2}

Disassembly of section .text:

00000000 <.text>:

0: b407 push {r0, r1, r2} 2: b407 push {r0, r1, r2} 4: b401 push {r0} 6: b402 push {r1} 8: b404 push {r2}

from the ARM ARM you can see in the push instruction the lower 8 bits are a register list/mask. So r0 is bit 0, r1 is bit 1 and so on. So the 7 in b407 indicates the three registers r0,r1,r2. The logic operates on machine code not assembly language, the machine code goes from bit 7 to bit 0 if set then push that register. All the assembler does is create the machine code it doesn't create extra instructions or anything like that.

If you want these in a different order then you have to write them in separate instructions in the assembly language.

https://stackoverflow.com/questions/63304428/ordering-of-registers-in-push-and-pop-brackets

If you want these in a different order then you have to write them in separate instructions in the assembly language.

The registers are stored in sequence, the lowest-numbered register to the lowest memory address (start\_address), through to the highest-numbered register to the highest memory address (end\_address)

The start\_address is the value of the SP minus 4 times the number of registers to be stored.

Subsequent addresses are formed by incrementing the previous address by four. One address is produced for each register that is specified in .

The end\_address value is four less than the original value of SP. The SP register is decremented by four times the numbers of registers in .

https://stackoverflow.com/questions/63304428/ordering-of-registers-in-push-and-pop-brackets

This is our first way of returning: using push to restore all the registers, except putting what was Ir when we are doing push into pc.

This will overwrite pc with the return address, achieving the return.

Note that we could instead use r14 instead of Ir and r15 instead of pc, but this is less clear on the intent.

#### Method 2

We can use an unconditional jump to register to return, which is useful in leaf functions where Ir is never stored on the stack. This is simply:

#### bx Ir

This jumps to the address in Ir, setting pc to Ir, and completing the return.

#### Method 3

Similar in rationale to method 2, but as stated in the beginning, ARM lets you manipulate the program counter as you would any other register. So... we have:

#### mov pc, lr

This copies Ir into pc, also completing the return.

Method n

Of course, there are many other ways of copying the value in one register into another, and to list that would be fairly silly. But as long as Ir at the beginning of the function call is placed into pc, a return is completed.

But please, use the most sensible ways to return. This means you should prefer the first two, depending on whether the function is a leaf. As a distant third, use method 3 (mov pc, lr).

.entry BL myfunction MOV PC, R14

.myfunction ; does nothing MOV PC, R14

This will fail because the exit address is in R14 on entry, and the BL call trashes that, so your program cannot ever exit as the return address is gone.

Consider:

| .entry                                       | ; entry point, return address in R14           |
|----------------------------------------------|------------------------------------------------|
| BL myfunction                                | ; call subroutine (puts return address in R14) |
| MOV PC, R14                                  | ; return to BASIC (R14 will come back here)    |
| .myfunction<br>; does nothing<br>MOV PC, R14 | ; exit subroutine by jumping back to R14       |

If you follow through this code, you'll see that the line that it supposed to return to BASIC is the instruction following the BL, which means R14 will point to it, so it'll just keep jumping to itself <cue spooky voice>forever!!!!!

How to fix this? You need to preserve R14 prior to it being used again. Like this (assuming R13 is a valid stack, it is from BASIC):

.entry
STR R14, [R13, #-4]! ; stack R14
BL myfunction
LDR PC, [R13], #4 ; unstack R14 directly into PC to exit

#### .myfunction

; does nothing MOV PC, R14

The weird looking offsets are to write-back R13 to support a fully descending stack.

The STR's "#-4]!" performs a decrement before action (akin to the behaviour of STMFD/STMDB), while the LDR's "], #4" performs an increment after action (akin to LDMFD/LDMIA).

This supports the type of stack used within RISC OS.

You might have come across it like this, but these days it is inefficient to use a multiple register instruction to store and load single registers (and indeed, ARM64 doesn't support STM/LDM at all!).

This is for information purposes as you will probably come across code that does this. It's inefficient, so try to remember the STR/LDR version given above...

.entry STMFD R13!, {R14} BL myfunction LDMFD R13!, {PC}

.myfunction ; does nothing MOV PC, R14

That's what Entry and EXIT (and friends) are for:

Entry → push a stack frame for procedure entry (implicitly adds Ir to the register list), optionally reserving a block of local workspace on the stack EXIT → return from a procedure by popping the workspace + register list from the most recent Entry (i.e. the one located directly before it in the assembler listing) EntryS/EXITS  $\rightarrow$  variants which save and restore some or all of the PSR EXITV/EXITVC/EXITVS  $\rightarrow$  return with V flag in a specified state PullEnv/PullEnvS  $\rightarrow$  pop the stack frame without returning from the procedure ALTENTRY  $\rightarrow$  generate an Entry/EntryS equivalent to the most recent (used when shared code can have multiple entry points) FRAMLDR/FRAMSTR  $\rightarrow$  load/store specific registers from the stack frame (calculates the correct offset, assuming you haven't used Push/Pull or adjusted SP manually)

If you're observant you'll also spot that there's an ENTRY macro which is equivalent to Entry, but that one isn't used any more because objasm confuses it with the ENTRY directive.

# State changing example (1)

| MOV R0, #5<br>ADD R1, PC, #1<br><b>BX</b> R1 • |                         |
|------------------------------------------------|-------------------------|
|                                                |                         |
| SUB_BRANCH:                                    | <b>BL</b> thumb_sub (0) |
|                                                | <b>BL</b> thumb sub (1) |
|                                                | ADD R1, #7              |
|                                                | BX R1                   |
| SUB_RETURN:                                    |                         |

| MOV R0, #5     |       |                        |     |        |                        |
|----------------|-------|------------------------|-----|--------|------------------------|
| ADD R1, PC, #1 |       |                        |     |        |                        |
| BX             | R1 🗕  |                        |     |        |                        |
| BL             | thumb | _ <mark>sub</mark> (1) | BL  | thumb_ | _ <mark>sub</mark> (0) |
| BX             | R1 🔍  |                        | ADD | R1, #7 |                        |
|                |       |                        |     |        |                        |

In ARM mode, PC indicates 2 instructions ahead

PC of 'ADD R1,PC,#1' is the address of SUB\_BRANCH

execution mode switch from **ARM** to **Thumb** at the **SUB\_BRANCH** and the program will execute in **Thumb** mode.

And **R1** is now 'SUB\_BRANCH+1' and by adding to 7 it will become 'SUB\_BRANCH+8'.

'SUB\_BRANCH+8' is the address of 'SUB\_RETURN' and the program jumps to the address of which LSB value is 0 and the execution mode will become from **Thumb** mode to **ARM** mode.

https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/5655/question-at

### Branch and link operation (2)



/IHI0042E\_aapcs.pdf

Thumb Instruction Programming

# Branch and Exchange (2)

### change into Thumb state, then back

| mov | R0, #5    | ; argument to function is in R0                                 |
|-----|-----------|-----------------------------------------------------------------|
| add | R1, PC,#1 | ; load address of SUB_BRANCH,                                   |
|     |           | ; set for THUMB by adding 1                                     |
| BX  | R1        | ; R1 contains address of SUB_BRANCH+1                           |
|     |           | ; assembler-specific instruction<br>; to switch to <b>Thumb</b> |

#### SUB\_BRANCH:

- BL thumb\_sub ; must be in a space of +/- 4 MB
- add R1, #7 ; point to SUB\_RETURN with bit 0 clear
- BX R1

; assembler-specific instruction to switch to  $\ensuremath{\mathsf{ARM}}$ 

SUB\_RETURN:

https://www.embedded.com/introduction-to-arm-thumb/

ARM

state

Thumb

state

# Branch and Exchange (3)

- the BX instruction example to go from ARM to Thumb state and back.
- first switches to Thumb state (BX R1)
- **R1[0] = 1** (because of +1)
- then <u>calls</u> a <u>subroutine</u> <u>written</u> in <u>Thumb</u> code (**BL thumb\_sub**)
- upon <u>return</u> from the subroutine (**BX R1**) the system again switches back to ARM state;
- **R1[0] = 0** (because of +1+7= +8)

| mov R(<br>add <mark>R1</mark> | <ul> <li>), #5 ; argument to function is in R0</li> <li>., PC,#1 ; load address of SUB_BRANCH, ; set for THUMB by adding 1</li> </ul> |
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| BX R1                         | ; R1 contains address<br>; of SUB_BRANCH+1                                                                                            |
|                               | ; to switch to Thumb                                                                                                                  |

#### SUB\_BRANCH:

| BL thumb_s         | ub                                                                              |
|--------------------|---------------------------------------------------------------------------------|
| add <b>R1</b> , #7 | ; must be in a space of +/- 4 MB<br>; point to SUB_RETURN<br>; with bit 0 clear |
| BX R1              | ; to switch to ARM                                                              |
| SUB_RETURN         |                                                                                 |

# Branch and Exchange (4)

- this example <u>assumes</u> that
   R1 is *preserved* by the subroutine.
- The PC always contains the address of the <u>current</u> instruction plus 8
  - add R1, PC,#1
    - · (4 bytes)
  - · **BX R1** 
    - · (4 bytes)
  - SUB\_BRANCH
    - (PC of add inst. + 8 bytes)



# Branch and Exchange (5)

- The Thumb BL instruction actually resolves into two instructions, so 8 bytes are used between SUB\_BRANCH and SUB\_RETURN.
- **BL** thumb\_sub (4 bytes)
  - · BL (H=0) Offset\_high (2 bytes)
  - **BL** (**H=1**) Offset\_low (2 bytes)
- add R1, #7 (2 bytes)
  BX R1 (2 bytes)





# Thumb $\rightarrow$ ARM interworking call



| to <b>BL</b> to an intermediate Thum<br>that executes the <b>BX</b> instructio                    | BLcall_via_r4<br>BX r4                                                                                             |                                                           |  |
|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|--|
|                                                                                                   | the <b>BL</b> instruction loads the <b>link register</b> immediately before the <b>BX</b> instruction is executed. |                                                           |  |
| In addition, the <b>Thumb instruct</b> i when it loads the <b>link register</b>                   | LR[0] = 0 → ARM state                                                                                              |                                                           |  |
| When a Thumb-to-ARM interwe<br>using a <b>BX LR</b> instruction, it ca<br>to occur automatically. | <b>BX</b> LR                                                                                                       |                                                           |  |
| CODE16<br>ThumbProg<br>MOV r0, #2<br>MOV r1, #3<br>ADR r4, ARMSubroutine                          | Stop<br>MOV r0, #0x18<br>LDR r1, =0x20026<br>SWI 0xAB<br>call_via_r4                                               | CODE32<br>ARMSubroutine<br>ADD r0, r0, r1<br>BX LR<br>END |  |
| BLcall_via_r4                                                                                     | BX r4                                                                                                              |                                                           |  |

Т

https://developer.arm.com/documentation/dui0040/d/Interworking-ARM-and-Thumb/Basic-assembly-language-interworking/Implementing-interworking-assembly-language-



If you always use the <u>same register</u> to store the <u>address</u> of the ARM <u>subroutine</u> that is being called from <u>Thumb</u>, this segment can be used to send an interworking call to <u>any</u> ARM subroutine.

You must use a **BX LR** instruction at the end of the ARM subroutine to return to the caller.

You cannot use the **MOV pc,Ir** instruction to return in this situation because it does not cause the required change of state. ADR r4, ARMSubroutine CODE16 ThumbProg \*\*\*\* ADR r4, ARMSubroutine BL \_\_call\_via\_r4 \*\*\*\* \_\_call\_via\_r4 BX r4 CODE32 ARMSubroutine \*\*\*\* BX LR

https://developer.arm.com/documentation/dui0040/d/Interworking-ARM-and-Thumb/Basic-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-



<u>no need</u> to set bit 0 of the **link register** because the routine is <u>returning</u> to ARM state.

store the return address by copying PC into LR with a MOV Ir,pc instruction immediately before the BX instruction.

Remember that the address operand to the **BX** instruction that calls the **Thumb subroutine** must have bit 0 set so that the processor executes in **Thumb state** on arrival.

As with Thumb-to-ARM interworking subroutine calls, you must use a **BX** instruction to return.

| CODE16               | CODE16         |
|----------------------|----------------|
| ADR r4, ThumbSub + 1 | ThumbSub       |
|                      | ADD r0, r0, r1 |
| MOV Ir, pc           | BX LR          |
| BX r4                | END            |

 $LR[0] = 0 \rightarrow ARM state$ 

ADR r4, ThumbSub + 1 BX r4

https://developer.arm.com/documentation/dui0040/d/Interworking-ARM-and-Thumb/Basic-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-

# ARM $\rightarrow$ Thumb interworking call example code (1)

#### AREA ArmAdd,CODE,READONLY

|    | ,          |                   | ; name this block of code.                                                                  |  |  |
|----|------------|-------------------|---------------------------------------------------------------------------------------------|--|--|
|    | ENTRY      | /                 | ; Mark 1st instruction to call.                                                             |  |  |
| ma | ain        |                   | ; Assembler starts in ARM mode.                                                             |  |  |
|    | ADR        | r2, ThumbProg + 1 | ; Generate branch target address and set bit 0,<br>; hence arrive at target in Thumb state. |  |  |
|    | BX         | r2                | ; Branch exchange to ThumbProg.                                                             |  |  |
|    | CODE       |                   | ; Subsequent instructions are Thumb.                                                        |  |  |
| Th | umbProg    |                   |                                                                                             |  |  |
|    | MOV        | r0, #2            | ; Load r0 with value 2.                                                                     |  |  |
|    | MOV        | r1, #3            | ; Load r1 with value 3.                                                                     |  |  |
|    | ADR        | r4, ARMSubroutine | ; Generate branch target address, leaving bit 0                                             |  |  |
|    |            |                   | , clear in order to arrive in ARM state.                                                    |  |  |
|    | BL         | call_via_r4       | ; Branch and link to Thumb code segment that will                                           |  |  |
|    |            |                   | ; carry out the BX to the ARM subroutine.                                                   |  |  |
|    |            |                   | ; The BL causes bit 0 of Ir to be set.                                                      |  |  |
| St | ор         |                   | ; Terminate execution.                                                                      |  |  |
|    | MOV        | r0, #0x18         | ; angel_SWIreason_ReportException<br>; ADP_Stopped_ApplicationExit                          |  |  |
|    | LDR        | r1, =0x20026      |                                                                                             |  |  |
|    | SWI        | 0xAB              | ; Angel semihosting Thumb SWI                                                               |  |  |
|    | call_via_r | 4                 | ; This Thumb code segment will                                                              |  |  |
|    | -          |                   | ; BX to the address contained in r4.                                                        |  |  |
|    | BX         | r4                | ; Branch exchange.                                                                          |  |  |

https://developer.arm.com/documentation/dui0040/d/Interworking-ARM-and-Thumb/Basic-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking-assembly-language

# ARM $\rightarrow$ Thumb interworking call example code (2)

| CODE32<br>ARMSubroutine |            |  |
|-------------------------|------------|--|
| ADD                     | r0, r0, r1 |  |
| BX                      | LR         |  |
|                         |            |  |

END

; Subsequent instructions are ARM.

- ; Add the numbers together
- ; and return to Thumb caller
- ; (bit 0 of LR set by Thumb BL).
- ; Mark end of this file.

https://developer.arm.com/documentation/dui0040/d/Interworking-ARM-and-Thumb/Basic-assembly-language-interworking/Implementing-interworking-assembly-language-

# Thumb $\rightarrow$ ARM interworking call example code (1)

| AREA Thun<br>ENTRY                                    | nbAdd,CODE,READONLY | ; Name this block of code.<br>; Mark 1st instruction to call.<br>; Assembler starts in ARM mode.                                                                                                                                                                                                                                                                                                                                       |
|-------------------------------------------------------|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| MOV Ir, po<br>BX r4<br>Stop<br>MOV r0, #<br>LDR r1, = | ¢3<br>ΓhumbSub + 1  | <ul> <li>; Load r0 with value 2.</li> <li>; Load r1 with value 3.</li> <li>; Generate branch target address and set bit 0,</li> <li>; hence arrive at target in Thumb state.</li> <li>; Store the return address.</li> <li>; Branch exchange to subroutine ThumbSub.</li> <li>; Terminate execution.</li> <li>; angel_SWIreason_ReportException</li> <li>; ADP_Stopped_ApplicationExit</li> <li>; Angel semihosting ARM SWI</li> </ul> |
| CODE16<br>ThumbSub<br>ADD r0, r0<br>BX LR<br>END      | 0, r1               | ; Subsequent instructions are Thumb.<br>; Add the numbers together<br>; and return to ARM caller.<br>; Mark end of this file.                                                                                                                                                                                                                                                                                                          |

https://developer.arm.com/documentation/dui0040/d/Interworking-ARM-and-Thumb/Basic-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking/Implementing-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-assembly-language-interworking-

### Cortex-M3: 32-bit processor

- The Thumb instruction set is a <u>subset</u> of the most commonly used 32-bit ARM instructions.
- Thumb instructions are each 16 bits long, and have a corresponding 32-bit ARM instruction that has the same effect on the processor model.
- The Cortex-M3 processor is a high performance 32-bit processor designed for the microcontroller market.
- It offers significant benefits to developers, including: outstanding processing performance combined with
  - <u>fast</u> interrupt handling.
  - enhanced system debug with
  - extensive breakpoint and trace capabilities.

https://developer.arm.com/documentation/dui0552/a/introduction/about-the-cortex-m3-processor-and-core-peripherals

### Cortex-M3 : Thumb state only

- The Cortex-M3 processor <u>only supports</u> execution of instructions in Thumb state. (T = 1)
- The following can <u>clear</u> the **T** bit to **0**:
  - instructions BLX, BX and POP {PC}
  - restoration from the stacked **xPSR** value on an exception return
  - bit[0] of the vector value on an exception entry or reset.
- In the Cortex-M3 processor, attempting to execute instructions when the T bit is 0 results in a fault or lockup. See Lockup for more information.

- The Thumb status bit (T) indicates the processor's <u>current state</u>:
  - 0 for ARM state (default)
  - 1 for Thumb.



https://developer.arm.com/documentation/dui0552/a/the-cortex-m3-processor/programmers-model/core-registers

# **Thumb Instruction**

# Thumb instruction set benefits

- The biggest reason to look for an ARM processor with the Thumb instruction set is if you need to <u>reduce</u> code density.
- In addition to <u>reducing</u> the total amount of <u>memory required</u>, you may also be able to <u>narrow</u> the <u>data bus</u> to just 16 bits.
- With the narrower bus, it will take two bus cycles to fetch a single 32-bit instruction;
- but you'll only <u>pay</u> that penalty in the parts of your code that <u>can't</u> be <u>implemented</u> with the <u>Thumb instructions</u>.
- And you'll still have the benefits of a powerful 32-bit RISC processor. A nifty trick indeed.

# Thumb instructions (1)

- The Thumb instructions
  - 16-bit instructions
  - a compact <u>shorthand</u> for a <u>subset</u> of the <u>32-bit</u> ARM instructions
- every Thumb instruction has the *equivalent* 32-bit ARM instruction.
- <u>not every ARM instructions has</u> the *equivalent* Thumb subset;
- a <u>single ARM instruction</u> can only be simulated with a <u>sequence</u> of <u>Thumb instructions</u>

- for example, there's <u>no way</u> to access status or coprocessor registers.
- a long branch with link (**BL**)
- the assembler splits
   Instruction 1 (H = 0)
   Instruction 2 (H = 1)

https://www.cs.princeton.edu/courses/archive/fall13/cos375/ARMthumb.pdf

# Thumb instructions (2)

- the ARM contains only <u>one</u> instruction set: the 32-bit set.
- When it's operating in the Thumb state,
  - the processor simply <u>expands</u> the smaller <u>shorthand instructions</u> fetched from memory

into their 32-bit equivalents.

• The <u>difference</u> between two equivalent instructions (the ARM and Thumb instructions) lies in

how the *instructions* are <u>fetched</u> and <u>interpreted</u> prior to <u>execution</u>, <u>not</u> in how they function.

• dedicated hardware expands the <u>16-bit</u> instruction into <u>32-bit</u>

it <u>doesn't</u> slow execution even a bit.

• the narrower 16-bit instructions do offer memory advantages.

https://www.cs.princeton.edu/courses/archive/fall13/cos375/ARMthumb.pdf

# Thumb instructions (3)

- Roughly speaking, a CPU instruction is a particular sequence of bits
- to the CPU, a particular sequence of bits could mean "add two 32-bit values and carry"
- The exact value of *bits in this sequence* has <u>nothing</u> to do with values being added.
- In the ARM mode, this sequence of bits has 32 bits.
- In the thumb mode, it only has 16 bits.
- apparently, the thumb mode has <u>less number</u> of encoded instructions than the ARM mode <u>(less bits</u> to <u>encode</u> them),
- for a same function, most instructions are <u>encoded differently</u> for the <u>ARM</u> and the thumb modes, respectively,

https://electronics.stackexchange.com/questions/353192/how-does-an-arm-processor-in-thumb-state-execute-32-bit-values

# Thumb instructions (4)

- for example, the x86 uses 8-bit instructions but is also able to work on 32 bit values.
- For ARM, the *instruction length* is what changes when you switch to/from ARM and thumb modes.
- For example, the instruction MOV R0, R1 copy the contents of the 32-bit R1 register to the R0 register

is <u>encoded</u> in the following way:

- *E1A00001* for ARM (32 bit : 4 bytes)
- 4608 for Thumb (16-bit : 2 bytes)
- But the processor will perform exactly the <u>same operation</u>, and it will do it on 32-bit wide data, whatever the mode.

https://electronics.stackexchange.com/questions/353192/how-does-an-arm-processor-in-thumb-state-execute-32-bit-values

# Thumb instructions (5)

- The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions.
- Thumb instructions are 16 bits long, and have a <u>corresponding</u> 32-bit ARM instruction that has the same effect on the processor model.
- Thumb instructions operate with the standard ARM register configuration, enabling excellent <u>interoperability</u> between ARM and Thumb states.
- Thumb has all the advantages of a 32-bit core:
  - 32-bit address space
  - · 32-bit registers
  - · 32-bit shifter and Arithmetic Logic Unit (ALU)
  - · 32-bit memory transfer

https://developer.arm.com/documentation/ddi0333/h/introduction/arm1176jz-s-architecture-with-jazelle-technology/the-thumb-instruction-set

#### Thumb instructions (6)

- The ARM processor can manipulate 32 bit values because it is a 32-bit processor, whatever mode it is running in (Thumb or ARM).
- thus, registers are 32 bits wide
- register width <u>doesn't</u> change when you switch mode (state)
- the data bus width of the processor has <u>nothing to do with</u> the length of the instructions.
- The instructions could be <u>encoded</u> in <u>any length</u>.

https://electronics.stackexchange.com/questions/353192/how-does-an-arm-processor-in-thumb-state-execute-32-bit-values

# Thumb instructions (7)

- The Thumb instruction set provides *most* of the *functionality* of a typical application.
  - arithmetic and logical operations
  - · load/store data movements
  - · conditional and unconditional branches
- any code written in C could be executed successfully in Thumb state.
- However, device drivers and exception handlers must often be written <u>at least partly</u> in ARM state

https://www.cs.princeton.edu/courses/archive/fall13/cos375/ARMthumb.pdf

### Thumb instructions (8)

- Switching modes allows programmers to <u>decide</u> on the <u>compromise</u> between code density and flexibility
- can <u>pack</u> more instructions in a kB of code with <u>16-bit</u> instructions,
- but the 32 bit instructions are more *flexible* 
  - · they offer more features and
  - you can do more with a single instruction

https://electronics.stackexchange.com/questions/353192/how-does-an-arm-processor-in-thumb-state-execute-32-bit-values



## Thumb instructions (9)

- All Thumb instructions are 16 bits in length.
- Thumb provides approximately 30% better code density over ARM code.
- Most code written for Thumb is in a high-level language such as C and C++.
- ATPCS (ARM Thumb Procedure Call Standard) defines how ARM and Thumb code call each other, called ARM-Thumb interworking.
- Interworking uses the branch exchange (BX) instruction and branch exchange with link (BLX) instruction to <u>change</u> state and <u>jump</u> to a specific routine.

https://www.sciencedirect.com/topics/computer-science/thumb-instruction-set

# Thumb instructions (10)

- In Thumb, *only* the branch instructions are conditionally executed.
- The barrel shift operations are separate instructions
  - ASR
  - LSL
  - ・ LSR
  - ROR
- The multiple-register load-store instructions only support the increment after (IA) addressing mode.
- The Thumb instruction set includes **POP** and **PUSH** instructions as stack operations.
- **POP** and **PUSH** instructions only support a full descending stack.
- There are <u>no</u> Thumb instructions to access the coprocessors, cpsr, and spsr.

https://www.sciencedirect.com/topics/computer-science/thumb-instruction-set

# Thumb instructions (11)

|                                 | ARM                      | Thumb                          |
|---------------------------------|--------------------------|--------------------------------|
|                                 | ( <b>CPSR T</b> =0)      | (CPSR T=1)                     |
| Instruction size                | 32-bit                   | 16-bit                         |
| Core instructions               | 58                       | 30                             |
| Conditional execution           | most                     | only branch instruction        |
| Data Processing                 | access to barrel shifter | <u>separate</u> barrel shifter |
| Instructions                    | and ALU                  | and ALU instructions           |
| Program <mark>Status</mark> Reg | R/W in privileged mode   | <u>no</u> <u>direct access</u> |
| Register usage                  | 15 general purpose reg   | 8 general purpose reg          |
|                                 | + PC                     | + 7 high reg + PC              |
|                                 |                          |                                |



https://electronics.stackexchange.com/questions/353192/how-does-an-arm-processor-in-thumb-state-execute-32-bit-values

# Thumb long branch with link **BL** instruction (1)

THUMB assembler : **BL label** 

H=0 LR := PC + <mark>OffsetHigh << 12</mark>

H=1 temp := next instruction address PC := LR + OffsetLow << 1 LR := temp | 1

PC := PC + (OffsetHigh << 12) + (OffsetLow << 1)</pre>



# Thumb long branch with link **BL** instruction (2)

#### ARM **B** or **BL** instruction

| 31 30 29 28 27 26 25 24 23 2   | 2 21 20 19 18 17 | <sup>7</sup> 16 15 14 13 12 | 11 10 9 8 7         | 6 5 4 3 2   | 1 0 |
|--------------------------------|------------------|-----------------------------|---------------------|-------------|-----|
| <b>cond</b> 1 0 1 <b>L</b>     |                  | 24-bit                      | Offset              |             |     |
| Branch                         |                  | PC := Offse                 | et                  |             |     |
|                                |                  | 15 14 13 12                 | 11 10 9 8 7         | 6 5 4 3 2   | 1 0 |
| Thumb BL instruction111HOffset |                  |                             | Offset              |             |     |
|                                |                  |                             |                     |             |     |
|                                |                  |                             |                     |             |     |
|                                | H=0              | 1 1 1 1                     | 0 11-bit            | Offset_high |     |
|                                | H=1              | 1 1 1 1                     | 1 11-bit Offset_low |             |     |
|                                |                  |                             |                     |             |     |
|                                |                  |                             |                     |             |     |
| 23-bit Offset                  | 11-bit Off       | set high                    | 11-bit Of           | fset low    | 0   |

# Thumb long branch with link **BL** instruction (3)

#### Examples

| <b>BL faraway</b><br>next | ; Unconditionally Branch to 'faraway'<br>; and place following instruction address,<br>; ie 'next', in <b>R14</b> , the <b>Link Register (LR)</b><br>; and set bit 0 of <b>LR</b> high (1)<br>; Note that the THUMB opcodes will contain<br>; the number of halfwords to offset. |
|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| faraway                   | ; Must be Half-word aligned.                                                                                                                                                                                                                                                     |

H=0 LR := PC + <mark>OffsetHigh << 12</mark>

H=1 temp := next instruction address PC := LR + OffsetLow << 1 LR := temp | 1

PC := PC + (OffsetHigh << 12) + (OffsetLow << 1)

# Thumb long branch with link **BL** instruction (4)

- This format specifies a long branch with link.
- The assembler splits the 23-bit two's complement half-word offset specifed by the label into *two* 11-bit halves, ignoring bit 0 (which must be 0), and <u>creates</u> *two* THUMB instructions.
- Instruction 1 (H = 0)
  - In the <u>first</u> instruction the Offset field contains
  - the upper 11 bits of the target address.
  - this is shifted left by 12 bits and
  - · added to the <u>current</u> **PC** address.
  - The resulting address is placed in LR.

- Instruction 2 (H =1)
  - In the <u>second</u> instruction the Offset field contains
  - the lower 11-bit of the target address.
  - this is shifted left by 1 bit and
  - · added to LR.
  - LR, which now contains the full 23-bit address, is placed in PC,
    - the address of the instruction following the **BL**
  - is placed in **LR** and bit 0 of **LR** is set.
  - the branch offset must take account of the prefetch operation,
  - which causes the PC to be 1 word (4 bytes) ahead of the current instruction

#### **Thumb-2 Instruction**

# Thumb-2 Instructions (1)

- Thumb-1 only does 16 bit instructions
- Thumb-2 can do both 16 bit & 32 bit instructions
- Thumb-1 and Thumb-2

C . . O 4 1.11 . . . . . . . . . .

- share <u>same architecture</u> for <u>32 bit data</u>.
- share the <u>same data bus</u> since <u>only</u> the instruction registers are different.

Thumb-1 16-bit instructions 32-bit GP regs

Thumb-2 Mixed 16- and 32-bit instructions 32-bit GP regs

| • for 64 bit processors,                                        |                                       |                                               |
|-----------------------------------------------------------------|---------------------------------------|-----------------------------------------------|
| Thumb (T32) can support                                         | T32                                   |                                               |
| both 16 & 32 bit instructions                                   | Mixed 16- and 32-bit<br>instructions  |                                               |
| with some different in each set                                 | 32-bit GP regs                        |                                               |
| in order to <u>conserve</u> code space for some applications    |                                       |                                               |
| but at the <u>expense</u> of <mark>duplicate libraries</mark> . | A32                                   | A64                                           |
|                                                                 | 32-bit instructions<br>32-bit GP regs | 32-bit instructions<br>32- and 64-bit GP regs |

https://electronics.stackexchange.com/questions/353192/how-does-an-arm-processor-in-thumb-state-execute-32-bit-values

# Thumb-2 Instructions (2)

• Thumb-2 is an enhancement to the 16-bit Thumb instruction set.

| • | Thumb-2 adds 32-bit instructions<br>that can be <i>freely intermixed</i><br>with 16-bit instructions in a program. | ARM<br>Thumb<br>Thumb-2 | 16-bit<br>16-bit | 32-bit<br>32-bit                          |
|---|--------------------------------------------------------------------------------------------------------------------|-------------------------|------------------|-------------------------------------------|
| • | the additional 32-bit instructions<br>enable Thumb-2                                                               |                         | 20 8.1           | 1                                         |
|   | <ul> <li>to cover the functionality of the ARM instruction set.</li> </ul>                                         |                         |                  | added<br>32-bit<br>Thumb-2<br>instruction |
|   | <ul> <li>to <u>combine</u> the code density of earlier versions of</li> </ul>                                      |                         |                  |                                           |

https://developer.arm.com/documentation/ddi0344/c/programmer-s-model/thumb-2-instruction-set

Thumb, with performance of the ARM instruction.



# Thumb-2 Instructions (3)

• The most important <u>difference</u> between the Thumb-2 instruction set and the ARM instruction set is

that <u>most</u> 32-bit Thumb instructions are unconditional, whereas <u>most</u> ARM instructions can be conditional.

| ARM     |                 | 32-bit          |
|---------|-----------------|-----------------|
|         |                 | (conditional)   |
| Thumb   | 16-bit          |                 |
|         | (unconditional) |                 |
| Thumb-2 | 16-bit          | 32-bit          |
|         | (unconditional) | (unconditional) |

- Thumb-2 introduces a conditional execution instruction, IT, that is a *logical if-then-else function* that you can <u>apply</u> to following <u>instructions</u> to make them <u>conditional</u>.
- If cond Then ... Else ...

|  | ITTET EQ                               | ITTET EQ      |         |  |  |
|--|----------------------------------------|---------------|---------|--|--|
|  | ADD r0,r0,r0                           | T EQ + ADD r  | ),r0,r0 |  |  |
|  | ADD r1,r0,r0                           | T EQ + ADD r1 |         |  |  |
|  | ADD r2,r0,r0                           | E EQ + ADD r  |         |  |  |
|  | ADD r3,r0,r0                           | T EO + ADD r  |         |  |  |
|  |                                        |               |         |  |  |
|  | ADDEQ r0,r0,r0 (Always if for 1st one) |               |         |  |  |
|  | ADDEO r1 r0 r0 (T for 2rd one)         |               |         |  |  |

ADDEQ r1,r0,r0 (1 for 2nd one) ADDNE r2,r0,r0 (E for 3rd one) ADDEQ r3,r0,r0 (T for 4th one)

## Thumb-2 Instructions (4)

 Thumb-2 instructions are <u>accessible</u> as were Thumb instructions when the processor is in Thumb state, that is, the T bit in the CPSR is 1 and the J bit in the CPSR is 0.

**TJ** = 10

 In addition to the 32-bit Thumb instructions, there are several 16-bit Thumb instructions and a few 32-bit ARM instructions, introduced as part of the Thumb-2 architecture.

https://en.wikipedia.org/wiki/Jazelle#Implementation



## New 32-bit Thumb Instructions (1-1)

- The <u>new 32-bit Thumb</u> instructions are added in the space previously occupied by the Thumb BL and BLX instructions.
- This is made possible by <u>treating</u> BL and BLX as 32-bit instructions, instead of treating them as two 16-bit instructions.
- This means that BL and BLX, and <u>all the other</u> 32-bit Thumb instructions, can only take exceptions on their start address.
- They <u>cannot</u> take <u>exceptions</u> at the <u>boundary</u> between *halfword1* and *halfword2* of the instruction.

**TJ** = 10

## New 32-bit Thumb Instructions (1-2)

 All implementations must ensure that <u>both</u> *halfwords* are <u>fetched</u> and <u>consolidated</u> <u>before</u> they are <u>issued</u> and <u>executed</u> to *comply* with this exception event restriction.

**TJ** = 10

- This is a <u>change from</u> Thumb.
- <u>Before Thumb-2</u>, the <u>two halfwords</u> of **BL** and **BLX** instructions <u>execute independently</u>, and can take <u>exceptions independently</u>.

# New 32-bit Thumb Instructions (2-1)

- The <u>new 32-bit Thumb</u> instructions are designed for:
- the <u>existing</u> ARM/Thumb Programmers' Model, with as <u>few modifications</u> as possible.

**TJ** = 10

- Certain <u>changes</u> are essential to introduce the 32-bit Thumb instructions, notably to the Prefetch abort and Undefined Instruction exceptions.
- There is <u>no increase</u> in the <u>number</u> of <u>registers</u> (general purpose or <u>special</u> purpose registers), and <u>no increase</u> in <u>register sizes</u>.
- <u>existing compiler code generation techniques</u>, as far as possible.

# New 32-bit Thumb Instructions (2-2)

• <u>New concepts</u> are <u>supplementary</u> rather than <u>obligatory</u>.

 For example, literals can still be loaded using PC-relative instructions, or use in-line immediate values embedded in the MOV 16-bit immediate and MOVT instructions.

**TJ** = 10

## New 32-bit Thumb Instructions (3)

 You may <u>not need</u> to rewrite too <u>much</u> depending on what features of the ARM instruction set and ARM variant you've used.

**TJ** = 10

- It's also possible that your ARM code is already <u>compatible</u> with Thumb-2.
- ARM created Unified Assembly Language (UAL) once Thumb-2 was introduced in order to increase the portability of code.
- it is <u>not</u> a <u>significant deviation</u> from ARM assembly of olden days, with the biggest change being the introduction of the IT(E) directive for conditional execution.

### New 32-bit Thumb Instructions (4)

 There are some other constructs that <u>won't port directly</u>, and if you are using <u>features</u> of a more <u>advanced</u> or <u>complex ARM core that</u> the <u>Cortex-M4</u> doesn't have, then that will require a <u>rewrite</u> of that portion.

**TJ** = 10

- I think if the code is <u>not</u> already <u>written</u> in ARM UAL that, while it would take time, it would be relatively <u>simple</u> to run a <u>script</u> over the code that can <u>flag</u> the usage of <u>features</u> that are <u>not</u> written correctly for UAL.
- A simple <u>regular expression</u> could check for <u>conditionals</u> on the <u>end</u> of instructions, and it may even be relatively <u>easy</u> to then convert those constructs to use IT(E) <cond>.
  - If cond Then ... Else ...

# Thumb 2 instruction set (4)

- The main enhancements are:
- **1.** 32-bit instructions added to the Thumb instruction set to:
  - provide support for exception handling in Thumb state
  - provide <u>access</u> to coprocessors
    - include Digital Signal Processing (DSP)
    - and media instructions
- **2.** improve performance in cases where a <u>single 16-bit instruction restricts</u> functions available to the compiler.
- **3.** addition of a **16-bit IT instruction** that enables *one* to *four* following Thumb instructions, the IT block, to be conditional

## Thumb 2 instruction set (5)

- The main enhancements are:
- **4.** addition of a 16-bit CZB instruction
  - Compare with Zero and Branch (CZB) to improve code density by replacing two-instruction sequence with a single instruction.
- 5. The 32-bit ARM Thumb-2 instructions are added in the space occupied by the Thumb BL and BLX instructions

#### 32-bit ARM Thumb-2 Instruction Format (1)

- The <u>first halfword (hw1)</u> determines the instruction length and functionality.
- If the processor decodes the instruction as 32-bit long, then the processor <u>fetches</u> the <u>second</u> halfword (hw2) of the instruction from the instruction address <u>plus</u> two.
- The availability of both 16-bit Thumb and 32-bit instructions in the Thumb-2 instruction sets, gives you the flexibility to emphasize performance or code size on a subroutine level, according to the requirements of their applications.



#### 32-bit ARM Thumb-2 Instruction Format (2)

For example, you can code critical loops for applications such as fast interrupts and DSP algorithms using the 32-bit media instructions in Thumb-2 and use the smaller 16-bit classic Thumb instructions for the rest of the application.
 This is for code density and does not require any mode change.



https://developer.arm.com/documentation/ddi0344/c/programmer-s-model/thumb-2-instruction-set

Thumb Instruction Programming

# ARM, Thumb, Thumb 2 instruction encodings (1)

- officially there's no "Thumb-2 instruction set".
- Ignoring ARMv8
  - where everything is <u>renamed</u> and <u>AArch64</u> complicates things),
- from ARMv4T to ARMv7-A
- there are two instruction sets: ARM and Thumb.
- they are both "32-bit" in the sense that they operate on
  - up-to-32-bit-wide data
  - in 32-bit-wide registers
  - with 32-bit addresses.
- In fact, they represent the exact same instructions
- it is only the instruction encoding which differs
- the CPU has <u>two</u> *different* decode front-ends to its pipeline which it can switch between.

https://stackoverflow.com/questions/28669905/what-is-the-difference-between-the-arm-thumb-and-thumb-2-instruction-encodings

## ARM, Thumb, Thumb 2 instruction encodings (2)

- ARM instructions have
- fixed-width 4-byte encodings
- which require 4-byte alignment.
- Thumb instructions have variable-length
  - 2-byte "narrow" encoding
  - 4-byte "wide" encoding
- requiring 2-byte alignment
- most instructions have 2-byte encodings,
- but **bl** and **blx** have always had 4-byte encodings\*.

https://stackoverflow.com/questions/28669905/what-is-the-difference-between-the-arm-thumb-and-thumb-2-instruction-encodings

•

# ARM, Thumb, Thumb 2 instruction encodings (3)

- The really confusing bit came in ARMv6T2, which introduced "Thumb-2 Technology".
- Thumb-2 encompassed not just
  - adding a load more instructions to Thumb (mostly with <u>4-byte encodings</u>) to bring it almost to comparable to ARM,
  - <u>but</u> also *extending* the execution state to allow for conditional execution of most Thumb instructions,
  - and finally introducing a whole new <u>assembly syntax</u> (UAL, "<u>Unified Assembly Language</u>")
    - which *replaced* the previous
       <u>separate</u> ARM and Thumb <u>syntaxes</u>
    - and allowed *writing* code once and assembling it to either ARM or Thumb <u>instruction set</u> without modification.

https://stackoverflow.com/questions/28669905/what-is-the-difference-between-the-arm-thumb-and-thumb-2-instruction-encodings

**Thumb-2 Technology** 4-byte encodings conditional execution

UAL (Unified Assembly Language) unify ARM and Thumb <u>syntaxes</u> assembling to either ARM or Thumb

## ARM, Thumb, Thumb 2 instruction encodings (4)

- The Cortex-M architectures only implement the Thumb instruction set -
- ARMv7-M (Cortex-M3/M4/M7) supports most of "Thumb-2 Technology", including conditional execution and encodings for VFP instructions,
- whereas ARMv6-M (Cortex-M0/M0+) only uses Thumb-2 in the form of a handful of 4-byte system instructions.
- Thus, the new 4-byte encodings (and those added later in ARMv7 revisions) are still Thumb instructions
- the "Thumb-2" aspect of them is that they can have 4-byte encodings, and that they can (mostly) be conditionally executed via it

their menmonics are seemed to be only defined in UAL

https://stackoverflow.com/questions/28669905/what-is-the-difference-between-the-arm-thumb-and-thumb-2-instruction-encodings

# ARM, Thumb, Thumb 2 instruction encodings (7)

- Thumb: 16 bit instruction set
- ARM: 32 bit wide instruction set hence more flexible instructions and less code density
- Thumb2 (mixed 16/32 bit): a compromise between ARM and thumb(16) (mixing them), to get both performance/flexibility of ARM and instruction density of Thumb.
- so a Thumb2 instruction can be <u>either</u> an ARM (only a subset of) with 32 bit wide instruction

or a Thumb instruction with 16 bit wide.

https://stackoverflow.com/questions/28669905/what-is-the-difference-between-the-arm-thumb-and-thumb-2-instruction-encodings

# UAL (Unified Assembly Language) (1-1)

- Unified assembly language (UAL) is the new assembly syntax introduced by ARM Ltd.
  - to handle the ambiguities introduced by the original Thumb-2 assembly syntax and
  - provide similar syntax for **ARM**, **Thumb** and **Thumb-2**.
- UAL is backwards compatible with old ARM assembly, but incompatible with the **Thumb** assembly syntax.
- **UAL** syntax is the default assembly syntax beginning with ARMv7 architectures.

http://downloads.ti.com/docs/esd/SPNU118/unified-assembly-language-syntax-support-spnu1184444.html

# UAL (Unified Assembly Language) (1-2)

- When writing assembly code, the .arm and .thumb directives are used to specify ARM and Thumb UAL syntax, respectively.
- The .state32 and .state16 directives remain to specify non-UAL **ARM** and **Thumb** syntax.
- The .arm and .state32 directives are equivalent since UAL syntax is backwards compatible in ARM mode.
- Since non-UAL syntax is <u>not supported</u> for **Thumb-2** instructions, **Thumb-2** instructions <u>cannot</u> be <u>used</u> inside of a .state16 section.
- However, assembly code with .state16 sections that contain <u>only</u> non-UAL Thumb code can be assembled for ARMv7 architectures to allow easy porting of older code.

http://downloads.ti.com/docs/esd/SPNU118/unified-assembly-language-syntax-support-spnu1184444.html

# UAL (Unified Assembly Language) (2-1)

- the ARM Unified Assembler Language (UAL) syntax provides a <u>canonical form</u> for *all* ARM and Thumb instructions.
- UAL describes the <u>syntax</u> for the <u>mnemonic</u> and the <u>operands</u> of each instruction.
- In addition, it assumes that instructions and data items can be given labels.
- It does <u>not specify</u> the <u>syntax</u> to be used for <u>labels</u>, <u>nor</u> what assembler <u>directives</u> and <u>options</u> are available.

https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/The-Instruction-Sets/Unified-Assembler-Language

.

# UAL (Unified Assembly Language) (2-2)

- <u>Most earlier ARM</u> assembly language mnemonics are still supported as <u>synonyms</u>
- <u>Most earlier</u> Thumb assembly language mnemonics are <u>not supported</u>.

https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/The-Instruction-Sets/Unified-Assembler-Language

.

# UAL (Unified Assembly Language) (3)

- UAL includes instruction selection rules that specify <u>which</u> instruction encoding is <u>selected</u> when more than one can provide the required functionality.
- For example, both 16-bit and 32-bit encodings exist for an ADD R0, R1, R2 instruction.
- The <u>most common instruction selection rule</u> is that when both 16-bit and 32-bit encodings are available, the 16-bit encoding is <u>selected</u>, to optimize code density.
- Syntax options exist to <u>override</u> the <u>normal</u> instruction selection rules and <u>ensure</u> that a <u>particular</u> encoding is selected.
- These are <u>useful</u> when <u>disassembling</u> code, to ensure that subsequent assembly produces the <u>original code</u>, and in some other situations.

https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/The-Instruction-Sets/Unified-Assembler-Language

#### **NEON and VFP**

- For armv7 ISA (and variants)
- The NEON is a SIMD and parallel data processing unit for integer and floating point data
- the VFP is a fully IEEE-754 compatible floating point unit
- In particular on the A8, the NEON unit is much <u>faster</u> for just about everything,
- even if you don't have highly parallel data, since the VFP is non-pipelined.
- So why would you ever use the VFP?!
- The most major difference is that the VFP provides double precision floating point.
- Secondly, there are some specialized instructions that that VFP offers that there are no equivalent implementations for in the NEON unit.
- SQRT comes to mind, perhaps some type conversions.

https://stackoverflow.com/questions/4097034/arm-cortex-a8-whats-the-difference-between-vfp-and-neon

# Jezelle DBX (Direct Bytecode Execution)

# Jazelle (1)

- Jazelle DBX (direct bytecode execution) is an extension that allows some ARM processors to execute Java bytecode in hardware as a third execution state alongside the existing ARM and Thumb modes.
- Jazelle functionality was specified in the ARMvTEJ architecture
- the first processor with Jazelle technology was the ARM926EJ-S.
- Jazelle is denoted by a "J" appended to the CPU name except for <u>post-v5 cores</u> where it is required (albeit only in trivial form) for architecture conformance.

**TJ** = 10

https://en.wikipedia.org/wiki/Jazelle#Implementation

## Jazelle (2)

• The J bit

• The J bit in the CPSR indicates

| when the processor is in Jazelle state.                                                                |                                  |                  |
|--------------------------------------------------------------------------------------------------------|----------------------------------|------------------|
| <ul> <li>When J = 0<br/>the processor is in ARM or Thumb state,<br/>depending on the T bit.</li> </ul> | <b>TJ</b> = 00<br><b>TJ</b> = 10 | ARM<br>Thumb     |
| <ul> <li>When J = 1<br/>the processor is in Jazelle state.</li> </ul>                                  | <b>TJ</b> = 01<br><b>TJ</b> = 11 | Jazelle<br>undef |

https://developer.arm.com/documentation/ddi0301/h/programmer-s-model/the-program-status-registers/the-j-bit

## Jazelle (3)

| • | The combination of <b>J = 1</b> and <b>T = 1</b> causes <u>similar effects</u> |  |
|---|--------------------------------------------------------------------------------|--|
|   | to setting <b>T=1</b> on a non Thumb-aware processor.                          |  |

 That is, the <u>next instruction</u> executed causes entry to the **Undefined** Instruction exception.
 TJ = 01 Jazelle TJ = 11 undefined

TJ = 00

TJ = 10

ARM

Thumb

- entry to the exception handler causes the processor to <u>re-enter</u> ARM state, and
- the handler can <u>detect</u> that this was the <u>cause</u> of the exception because J and T are <u>both set</u> in SPSR\_und.
- MSR cannot be used to change the J bit in the CPSR.

https://developer.arm.com/documentation/ddi0301/h/programmer-s-model/the-program-status-registers/the-j-bit

# Jazelle (4)

- The placement of the **J bit** <u>avoids</u> the status or extension bytes in code running on ARMv5TE or earlier processors.
- This ensures that OS code written using the deprecated syntax CPSR, SPSR, CPSR\_all, or SPSR\_all for the <u>destination</u> of an MSR instruction continues to work.
- The MSR instruction is used to write
  - to the CPSR or
  - to the **SPSR** of the current mode.

| flags CPSR_f |    |    |    |    |     |      |      |     | sta | atus | CP  | SR_ | S   |    |     |     | e>  | cten         | sion        | CP | SR_ | X |   |   | С | ontr | ol C | PSR | SR_c |   |   |  |  |  |  |  |
|--------------|----|----|----|----|-----|------|------|-----|-----|------|-----|-----|-----|----|-----|-----|-----|--------------|-------------|----|-----|---|---|---|---|------|------|-----|------|---|---|--|--|--|--|--|
| 31           | 30 | 29 | 28 | 27 | 26  | 25   | 24   | 23  | 22  | 21   | 20  | 19  | 18  | 17 | 16  | 15  | 14  | 13           | 12          | 11 | 10  | 9 | 8 | 7 | 6 | 5    | 4    | 3   | 2    | 1 | 0 |  |  |  |  |  |
| Ν            | Ζ  | С  | V  | Q  | IT[ | 1:0] | J    |     |     |      |     |     | G   | E  |     |     |     | <b>IT</b> [7 | ':2]        |    |     | Ε | Α | I | F | Т    |      | n   | nod  | е |   |  |  |  |  |  |
|              |    |    |    |    |     |      |      |     |     |      |     |     |     |    |     |     |     |              |             |    |     |   |   |   |   |      |      |     |      |   |   |  |  |  |  |  |
| I            |    |    |    |    |     | Cι   | irre | ent | P   | rog  | rai | m s | Sta | tu | s F | Reg | ist | er           | <b>(C</b> ) | PS | R)  |   |   |   |   |      |      |     |      |   | I |  |  |  |  |  |

https://developer.arm.com/documentation/ddi0301/h/programmer-s-model/the-program-status-registers/the-j-bit

# CPSR Bits (1)

| N Negative flag    | To <u>disable</u> Interrupt ( <b>IRQ</b> ), set <b>I</b>      | To <u>disable</u> Interrupt ( <b>IRQ</b> ), set <b>I</b>      | USR | 10000 |
|--------------------|---------------------------------------------------------------|---------------------------------------------------------------|-----|-------|
| <b>Z</b> Zero flag | To <u>disable</u> Fast Interrupt ( <b>FIQ</b> ), set <b>F</b> | To <u>disable</u> Fast Interrupt ( <b>FIQ</b> ), set <b>F</b> | FIQ | 10001 |
| C Carry flag       | the <b>T bit</b> shows whether the processor runs             | the <b>T bit</b> shows whether the processor runs             | IRQ | 10010 |
| V Overflow flag    | in ARM state or in Thumb state.                               | in ARM state or in Thumb state.                               | SVC | 10011 |
|                    | never set this bit                                            | never set this bit                                            | ABT | 10111 |
|                    | can be changed only in a privileged mode                      | can be changed only in a privileged mode                      | UND | 11011 |
|                    |                                                               |                                                               | SYS | 11111 |

| flags CPSR_f |    |    |    |    |    |    |     |     | st   | atus | CP  | SR_  | S   |    |     |     | e    | kten | sior | n CP | SR_ | X |   |   | C | ontr | ol C | PSF | ₹_C |   |   |  |
|--------------|----|----|----|----|----|----|-----|-----|------|------|-----|------|-----|----|-----|-----|------|------|------|------|-----|---|---|---|---|------|------|-----|-----|---|---|--|
| 31           | 30 | 29 | 28 | 27 | 26 | 25 | 24  | 23  | 22   | 21   | 20  | 19   | 18  | 17 | 16  | 15  | 14   | 13   | 12   | 11   | 10  | 9 | 8 | 7 | 6 | 5    | 4    | 3   | 2   | 1 | 0 |  |
| Ν            | Ζ  | С  | V  |    |    |    |     |     |      |      |     |      |     |    |     |     |      |      |      |      |     |   |   |   | F | Τ    | mode |     |     |   |   |  |
|              |    |    |    |    |    |    |     |     |      |      |     |      |     |    |     |     |      |      |      |      |     |   |   |   |   |      |      |     |     |   |   |  |
|              |    |    |    |    |    | Сι | Jrr | ent | : Pi | rog  | Ira | m \$ | Sta | tu | s F | Reg | jist | er   | (C   | PS   | R)  |   |   |   |   |      |      |     |     |   |   |  |

https://developer.arm.com/documentation/ddi0301/h/programmer-s-model/the-program-status-registers/the-j-bit https://courses.washington.edu/cp105/02\_Exceptions/Status\_Register\_Instructions.html

# CPSR Bits (2)

Q Cumulative saturation bit

IT[1:0] if-Then exectuion state bits

for the Thumb IT (If-Then) instruction

J Jazelle bit

- GE greater than or equal to flags
- IT[7:2] if-Then exectuion state bits
  - for the Thumb IT (If-Then) instruction
- E Endianness execution state bit0 Little-endian, 1 Big-endian
- A Asynchronous abort mask bit



https://www.keil.com/pack/doc/CMSIS/Core\_A/html/group\_CMSIS\_CPSR.html

#### MRS – Move to Register from Status

- MRS is use to read
  - from the CPSR or
  - from the SPRS of the current mode
- It move the value from the status register into a regular register.
- The **SPSR** that will be read is the one that is active for the CPU's current mode.

MRS R0, CPSR MRS R1, SPSR

• Reading the **SPSR** while in user or system mode is <u>not valid</u> and yields <u>unpredictable</u> results.

https://courses.washington.edu/cp105/02\_Exceptions/Status\_Register\_Instructions.html

#### MSR – Move to Status from Register

- The MSR instruction is used to write
  - to the CPSR or
  - to the **SPSR** of the current mode.
- Writing to the **SPSR** while in the user or system mode is <u>not valid</u> and the results are <u>not predictable</u>.
- Any writes to the **CPSR** in user mode are <u>ignored</u>.
- The **CPSR** can only be written to in a priveleged mode.
- MSR CPSR, R0
- MSR SPSR, R1

https://courses.washington.edu/cp105/02\_Exceptions/Status\_Register\_Instructions.html

#### 64-bit Processors

#### A32 + T32 ISA's A64 ISA

#### 64-bit processor (1)



## 64-bit processor (1)



# ARM, Thumb, Thumb 2 instruction encodings (5)

- there is a 32-bit execution state (AArch32) and a 64-bit execution state (AArch64).
- the 32-bit execution state supports two different instruction sets:
  - T32 ("Thumb") and
  - A32 ("ARM").
- The 64-bit execution state supports only one instruction set A64.
- All A64, like all A32, instructions are 32-bit (4 byte) in size, requiring 4-byte alignment.
- Many/most A64 instructions can operate on both 32-bit and 64-bit registers (or arguably 32-bit or 64-bit views of the same underlying 64-bit register).

https://stackoverflow.com/questions/28669905/what-is-the-difference-between-the-arm-thumb-and-thumb-2-instruction-encodings

# ARM, Thumb, Thumb 2 instruction encodings (6)

- All ARMv8 processors (like all ARMv7 processors) that implement AArch32 support Thumb-2 instructions in the T32 instruction set.
- <u>Not</u> all ARMv8-A processors implement AAarch32, and some <u>don't</u> implement AArch64.
- Some Processors support both, but only support AArch32 at <u>lower exception levels</u>.

https://stackoverflow.com/questions/28669905/what-is-the-difference-between-the-arm-thumb-and-thumb-2-instruction-encodings

# 64-bit processor (1)

- Evolution of the ARM architecture
- The diagram shows how all the features present in **ARMv7-A** have been carried forward into **ARMv8-A**.
- But **ARMv8** supports two execution states:
  - AArch32 the A32 and T32 instruction sets (ARM and Thumb in ARMv7-A) are supported
  - AArch64 the new A64 instruction set is introduced.
- Although backwards compatible with ARMv7-A, the <u>exception</u>, <u>privilege</u> and <u>security</u> model has been significantly *extended* and is now classified as a set of <u>exception levels</u>, EL0 to EL3, in a four-level hierarchy.

ARMv7-A AARCH32 ARM+Thumb ISAs ARMv8-A AARCH32 A32+T32 ISAs.

AARCH64

A64 ISAs

# 64-bit processor (2)

 In AArch32, the ARMv7-A Large Physical Address Extensions

are supported, providing

- <u>32-bit virtual addressing and</u>
- <u>40-bit physical addressing</u>.

#### • In AArch64,

this is extended, again in a backward compatible way, to provide

- <u>64-bit</u> virtual addresses and
- <u>48-bit</u> physical address
- Other additions include cryptographic support at instruction level.

ARMv7-A AARCH32 ARM+Thumb ISAs

ARMv8-A AARCH32, A32+T32 ISAs, AARCH64 A64 ISAs

## 64-bit processor (3)

- Overview of AArch64 in ARMv8-A
- The A64 instruction set, defined in AArch64, has been designed from the ground up as a <u>clean</u>, <u>modern</u> instruction set which operates on 64-bit or 32-bit native datatypes or registers.
- A64 is a <u>fixed-length</u> instruction set in which all instructions are <u>32 bits</u> in length.
- It does, as you might expect, have many similarities with the A32 instruction set which you'll be familiar with from earlier ARM architectures.
- There are some things you'll find which are new and some things which you'll go looking for and aren't there!

ARMv7-A AARCH32 ARM+Thumb ISAs

ARMv8-A AARCH32, A32+T32 ISAs, AARCH64

A64 ISAs

## 64-bit processor (4)



#### 64-bit processor (5)



# 64-bit processor (6)

Changing Execution state and Instruction set

- A fully-populated ARMv8-A processor supports both AArch32 and Aarch64 execution states.
- <u>Transition</u> between the two is always <u>across</u> an <u>exception boundary</u>.



 This differs from ARMv7-A in which a <u>change</u> of instruction set is triggered by an <u>interworking branch</u> (e.g. BLX).



https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/Porting%20to%20ARM%2064-bit.pdf

#### Thumb Instruction Programming

133

# 64-bit processor (7)

Changing Execution state and Instruction set

- the relationship between the **T32**, **A32** and **A64** instruction sets and
- the events which can cause a switch between them.
- the execution state
- can <u>stay</u> the same or
- > go from 32-bit to 64-bit
  - · when taking an exception, or
  - when returning from an exception
- This introduces a natural hierarchy of 64-bit and 32-bit support at each level



https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/Porting%20to%20ARM%2064-bit.pdf

134

ARMv7<sub>70</sub> Won Lim AARCH32 11/13/24 ARM+Thumb ISAs

#### References

- [1] http://wiki.osdev.org/ARM\_RaspberryPi\_Tutorial\_C
- [2] http://blog.bobuhiro11.net/2014/01-13-baremetal.html
- [3] http://www.valvers.com/open-software/raspberry-pi/
- [4] https://www.cl.cam.ac.uk/projects/raspberrypi/tutorials/os/downloads.html