Arm Instruction 2 - 001

PART 2
ARM Assembly Language
Part 2 - Contents
Lecture 5 - ARM data processing
ARM data processing instructions in detail ARM status flags and tests
Lecture 8 Processing parts of a word - shifts, rotates, & logical operations

Example bit-manipulation problems ARM instructions: bitwise logical & shifts
Lecture 6 ARM Memory access

Addressing modes for LDR/STR Assembler pseudo-instructions
EDSAC simulator (Written by Martin Campbell-Kelly, Univ Warwick) reproduces original EDSAC control panel. EDSAC was the first computer to be programmed in an assembly language. The assembler was 41 instructions long!
Lecture 7 Execution conditions & branches

Comparison & test instructions Signed and unsigned comparison ARM tips & tricks
Lecture 9 Subroutines, return addresses, and stacks

Why use subroutines? Why use stacks? Implementing stacks on ARM
ARM Multiple register transfer instructions
ARM simulator & debugger

tjwc - 2-Dec-10
ISE1/EE2 Introduction to Computer Architecture
2.1
tjwc - 2-Dec-10
2.2
Lecture 5 Data Processing: ARM implementation

and then the different branches of Arithmetic: Ambition, Distraction, Uglification, and Derision The Mock Turtle Lewis Carrol
Data processing (ADD,SUB,AND,CMP,MOV, etc)

op1 dest
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot Shift 8
op2
8 C Rm 4
immediate value
Rd := Rn Op C Rd := Rn Op Rm
Arithmetic is the most complex data processing operation at an assembly language level. ARM implements 32 bit addition and subtraction. Longer calculations must make appropriate use of carries. We will look at:
ARM data processing arithmetic & logical instructions Use of immediate operands in data processing instructions Simple examples
0
ALU operation
S bit = 1 => status bits are written S bit = 0 => status bits unchanged
dest := op1 op op2
The first operand, op1, is always register Rn The second operand, op2, is either a constant C or register Rm This lecture: assume Shift=0, Rot=0, for unshifted Rm or immediate constant C
tjwc - 2-Dec-10
2.3
tjwc - 2-Dec-10
2.4
ARM data processing instructions

Op 0000 0001 0010 0011 0100 0101 0110 0111 1100 1101 1110 1111 Assembly AND Rd,Rn,op2 EOR Rd,Rn,op2 SUB Rd, Rn, op2 RSB Rd, Rn, op2 ADD Rd,Rn,op2 ADC Rd,Rn,op2 SBC Rd, Rn, op2 RSC Rd, Rn, op2 ORR Rd,Rn,op2 MOV Rd, op2 BIC Rd,Rn,op2 MVN Rd,op2 Operation Bitwise logical AND Bitwise logical XOR Subtract Reverse subtract Add Add with carry Subtract with carry Reverse sub with carry Bitwise logical OR Move Bitwise clear Bitwise move negated Pseudocode Rd := Rn AND op2 Rd := Rn XOR op2 Rd := Rn op2 Rd := op2 Rn Rd := Rn + op2 Rd := Rn + op2 + C Rd := Rn op2 + C 1 Rd := op2 Rn + C 1 Rd := Rn OR op2 Rd := op2 Rd := Rn AND NOT op2 Rd := NOT op2
Example
cond 0 0 0 1110 0 0 0 Op S Rn Rd Shift Rm
Rd := Rn Op Rm R0 := R1 + R2
0100 0 0001 0000 0000,0000 0010
Here are the move and arithmetic data processing instructions. The operations with Carry allow multi-word addition and subtraction MOV, MVN do not use Rn, Rn should be set 0 in instruction word
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.5
Op = 0100 (ADD) Cond = 1110 (always) Rd = 0 R0 Rn = 1 R1 Rm = Op2 = 2 R2 S=0 (don't write status bits) Use assembler don't need to worry about precise bit format as above ADD R0, R1, R2
Data Processing Instructions

Rules that apply to ARM data processing instructions:
All operands are 32 bits, come either from registers or are specified as constants (called literals) in the instruction itself The result is also 32 bits and is placed in a register 3 operands - 2 for inputs and 1 for result (usually)
Data Processing Instructions Arithmetic operations

ADD ADC SUB SBC RSB RSC r0, r1, r2 r0, r1, r2 r0, r1, r2 r0, r1, r2 r0, r1, r2 r0, r1, r2 ; r0 := r1 + r2 ; r0 := r1 + r2 + C ; r0 := r1 - r2 ; r0 := r1 - r2 + (C - 1) ; r0 := r2 - r1 ; r0 := r2 - r1 + (C - 1)
Example: SUB r0, r1, r2 ; r0 := r1 - r2 Works for both unsigned and 2's complement signed
Note that source registers are unchanged (unless dest = source)
RSB stands for reverse subtraction Operands & result may be interpreted as unsigned or 2's complement signed integers. 'C' is the carry (C) status bit in the CPSR
Subtraction - carry is "borrow" - 0 or 1 - hence C-1
Result register can be the same as an input operand register:

ADD r0, r0, r0 ; doubles the value in r0!
ADC, SBC, and RSC are used to operate on data more than 32 bits long in 32-bit chunks: see next slide RSB,RSC are useful, instead of SUB,SBC with r1, r2 reversed, because r2 can be any of the Op2 variants, see later.
tjwc - 2-Dec-10
2.7
SBC, RSC
Some of you are probably thinking why (C 1) in subtraction? The negation needed by subtraction is implemented in hardware by bitwise not function and addition with C=1. Thus C=0 has the effect of -1, and C=1 is a normal subtract. The first (LSW) subtract of a multi-word subtraction must have carry set this is the default carry used in the SUB, RSB instructions. Normally the LS word of a multi-word add or subtract will use ADDS or SUBS, all others will use ADCS or SBCS Note the S suffix means S=1, write Status bits (condition codes). Can be added to any DP assembler mnemonic except comparisons.
Example 64 bit addition

For example, lets add two 64-bit numbers X and Y, storing the result in Z We need two registers to hold each number - registers are 32 bit store X as r1:r0, Y in r3:r2, and Z in r5:r4 (notation MSW:LSW) Then:
ADDS ADCS
r4, r0, r2 ; r4:=r0+r2 (set C) r5, r1, r3 ; r5 := r1+r3 +C
ADCS 95 95
tjwc - 2-Dec-10
ADCS 64 63 64 63 32 31 32 31
ADDS 0 0 +
2.9
S at the end of an instruction means you want to write the C, V, N, and Z status bits. In this case the C flag is needed. Similarly, if we wanted to subtract the two numbers:
SUBS SBCS
tjwc - 2-Dec-10
r4, r0, r2 ; without carry r5, r1, r3 ; with carry

2.10
Data Processing Instructions Register Moves

Here are ARM's register move operations:
MOV MVN r0, r2 r0, r2 ; r0 := r2 ; r0 := NOT r2
Operand 2
Data processing instructions have 3 operand format: Rd := Rn op op2 First operand (Rn) - always a register Second operand (Op2) can be
An immediate (literal) value in range 0-255 A register Rm
Special case of data processing where one register is not used, but other options (shifted r2 etc see later) still apply. MVN stands for 'move negated' bitwise NOT
This is not two's complement negate - no addition of 1! r2: r0:
tjwc - 2-Dec-10
Use of shifts adds more options, considered later

# indicates literal
0101 0011 1010 1111 1101 1010 0110 1011 1010 1100 0101 0000 0010 0101 1001 0100
ISE1/EE2 Introduction to Computer Architecture 2.11
ADD R5, R2, #200 ADD R5, R2, R3

tjwc - 2-Dec-10
; Op2 = 200 is decimal literal value ; Op2 = R3

Negative literal values

Since literal op2 is an unsigned value it cannot be used directly to set a register to a negative number However usually this does not matter, because a different op-code can be used:
ADD r0, r1, #-11 => SUB r0, r1, #11 MOV r0, #-n => MVN r0, #(n-1) ; MVN inverts bits (232-1-n) ADC r0, r1, #-n => SBC r0, r1, #(n-1) ;WHY (n-1)?
Examples
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot 8 C
Rd := Rn Op C
1110 001 0010 0 1111 0011 0000 01100100 always SUB R15 R3 #100 do not write status bits R3 := R15 - 100 ADD r3, r15, #-100 The "ADD with negative SUB r3, r15, #100 constant" is turned into equivalent SUB automatically by assembler
The assembler will do this conversion automatically

See next slide
tjwc - 2-Dec-10
2.13
tjwc - 2-Dec-10
2.14
Examples
Rd := Rn Op C
Examples
Rd := Rn Op C
1110 001 0110 1 0100 0001 0000 00000011` always SBC R4 R1 #3 write status bits N,Z,C,V R1 := R4 -4+C ADCS r1, r4, #-4 SBCS r1, r4, #3 The ADC is turned into equivalent SBC automatically
1110 001
1111 1 0000 0001 0000 00000000 MVN not used R1 C=#0 write status why not C,V? bits N,Z
R1 := -1 MOVS r1, #-1 Note that MVN negates bits, not two's MVNS r1, #0 complement negation S = 1 => N,Z status bits are written C,V status bits are only written on arithemetic operation
2.15 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.16
tjwc - 2-Dec-10
64-bit checksum
A checksum is often calculated to check that data has not been corrupted.
C = di
i
In this example 8K bytes of data is stored in memory in a buffer pointed to by r2. Each 8 contiguous bytes (2 words) are interpreted as a 64 bit number di. [R2+20], [R2+16] [R2+12], [R2+8] [R2+4], [R2]
32 bits 32 bits 32 bits 32 bits 32 bits 32 bits
CHECKSUM64 MOV r3, #0 MOV r4, #0 MOV r6, #1024 LOOP LDR r0, [r2] ADD r2, r2, #4 LDR r1, [r2] ADD r2, r2, #4 ADDS r3, r3, r0 ADC r4, r4, r1 SUBS r6, r6, #1 BNE LOOP
; bits 31:0 of sum ; bits 63:32 of sum ; set up loop counter ; load 31:0 of next 64 bit word ; move r2 to MSW word ; load 63:32 of it ; move r2 to next 64 bit word ; 31:0 of 64 bit addition, set C ; add bits 63:32, with C ; decr counter, set status bits on result ; if counter is not 0 add next 64 bits
Add 64 bit numbers (assume words are ordered so that LSW has lowest address)
r2 -> current word r3,r4 -> 64 bit sum r6 -> count no of 64 bit words down to 0 Auto-increment memory load (discussed later) would make the code much more efficient. Note that 64 bit result will overflow because MSW C is discarded
Arithmetic on real numbers

So far, we have concentrated on integer representations signed or unsigned. There is an implicit binary point to the right:
N-1 0
Idea of floating point representation

Although fixed point representation can cope with numbers with fractions, the range of values that can represented is still limited. Alternative: use the equivalent of scientific notation, but in binary:
number = s x m x 2e sign
For example:
In general, the binary point can be in the middle of the word (or off the end!). This is FIXED POINT representation of fractional numbers
N-1 0
implicit binary point
mantissa
exponent
S
binary point
10.5 in binary: 1010.1(2) Move binary point 3 places to left: 1.0101(2) x 23 10.5 = 1.3125 x 8
Thus by choosing the correct exponent any number can be represented as a fixed point binary number multiplied by an exponent Equivalently, the binary point is "floating"
Fixed point arithmetic requires no extra hardware the binary point is in the mind of the programmer, like signed/unsigned.
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture
IEEE-754 standard floating point

32-bit single precision floating point:
31 30 23 22 0
IEEE-754 example
single precision
8-bit exp
23-bit frac
s Why not exponent = 128? -1
exp 128
fractional part of mantissa .11(2)
1100 0000 0110 0000 0000 0000 0000 0000
x = 1s 2exp 127 1. frac 5.9 10 39 < x < 3.4 1038

MSB s is sign-bit: 1 => negative Exponent = exp - 127
The number above, C0600000(16) , must have negative sign, Exponent = exp -127 = 1, mantissa = 1+ 0.11(2) = 1.11(2) - 21 X 1.11(2) = -11.1(2) = -3.5 Note leading 1.0 is always added to frac
Note this gives exponent = [-127,127], and special case exp=255
The MSB of the mantissa is ALWAYS 1, therefore it is not stored

mantissa = 1 + frac*2-23 (mantissa = 1.frac)
Special cases which break this rule:

exp field = 0, frac field = 0 => number is +/- 0 exp field = 255, frac field = 0 => +/- exp field = 255, frac field 0 => NaN (invalid number)
Conversion to IEEE 754

17.5(10) = 10001.1(2) = 24 * 1.00011(2)
exp = 4+127 = 131 = 10000011(2) frac = 00011000000000000000000(2) s = 0 (positive)
Lecture 6 - Data Transfer Instructions (Load/Store)

Computer programmers don't byte, they nibble a bit - Unknown
This lecture will examine in detail the ARM LOAD/STORE instructions

Multiple register load/store instructions will be dealt with separately, when we are discussing stacks.
Floating point is typically handled by Floating Point coprocessor (FPU) separate from the CPU. ARM architecture has FPUs, see latest ARM datasheets for more details. We will not consider FPU instructions in this course.
The ARM architecture has

some clever tricks which mean that memory locations close to the PC can easily be accessed. special support for sequential data access
tjwc - 2-Dec-10
2.23
tjwc - 2-Dec-10
2.24
Example block copy

Block memory copy.
A block of memory at address TABLE1 is copied to address TABLE2.
Both TABLE1 & TABLE2 are word-aligned (address divisible by 4) The copy operation can be implemented by moving words
Block copy solutions

We look first at simple solutions to the block copy using instructions which read and write fixed words in memory The block copy could be implemented like this through a sequence of 50 sets of read/write instructions each with different addresses.
This is actually more efficient than using a loop, but not practical, due to the large amount of code, if the number of words copied is too many.
The size of this block is 200(10).

TABLE1 TABLE1+4 TABLE2+8
TABLE2 TABLE2+4 TABLE2+8
TABLE1+192 TABLE1+196
TABLE2+192 TABLE2+196
Next we look at how the reads and writes can be made to variable locations (like an access to an array with a variable as index a[i]), so that a loop can be used with a single read and write to copy all 50 words.
tjwc - 2-Dec-10
Data Transfer Instructions single register load/store instructions

Basic operation Use a value in one register (called the base register) as a memory address and either load the data value from that address into a destination register or store the register value to memory:
Data Transfer Instructions Set up the address pointer with ADR

Need to initialize address in r1 in the first place. How? ADR is a pseudo instruction - looks like normal instruction, but it is actually an assembler directive.
The assembler translates it to one or more real instructions. ADR sets a register to a (known and constant) address ADR moves a constant value into a register.
LDR STR
r0, [r1] r0, [r1]
; r0 := mem32[r1] ; mem32[r1] := r0
This is called register-indirect addressing (AKA indexed) Here r1 is a memory pointer (AKA index register) LDR r0, [r1] ; this is a word transfer, r1 must be a word address (divisible by 4)
This copies one word from TABLE 1 to TABLE2

copy ADR ADR LDR STR . r1, TABLE1 r2, TABLE2 r0, [r1] r0, [r2] ; r1 points to TABLE1 ; r2 points to TABLE2 ; load first word . ; and store it in TABLE2 ; <source of data> ; <destination of data>
r1: r0:
CPU &1000 117
&1000: &1004: &1008:
Memory 117 560 100
TABLE1 TABLE2
tjwc - 2-Dec-10
Data Transfer Instructions ADR instruction
Data Transfer Instructions Moving multiple data items

Extend the copy program further to copy NEXT word:
copy ADR ADR LDR STR ADD ADD LDR STR ... r1, TABLE1 r2, TABLE2 r0, [r1] r0, [r2] r1, r1, #4 r2, r2, #4 r0, [r1] r0, [r2] ; r1 points to TABLE1 ; r2 points to TABLE2 ; load first value . ; and store it in TABLE2 ; step r1 onto next word ; step r2 onto next word ; load second value ; and store it
table1 table2
How does the ADR directive work? Address is 32-bit, difficult to put a 32-bit address value in a register in the first place (constants are 8 bit) Solution: Program Counter PC (r15) is often close to required value ADR r1, TABLE1 is translated into a data processing instruction that adds or subtracts a constant to PC (r15), and puts the result in r1 This constant is known as a PC-relative offset, and it is calculated as: addr_of_TABLE1 - (PC_value + 8)
(+8 is because of hardware pipelining, see Part 3)
Simplify with base+offset addressing mode
LDR r0, [r1, #4]

base address
tjwc - 2-Dec-10
; r0 := mem32 [r1 + 4]
offset effective address
2.30
Data Transfer Instructions base+offset

A simplified version of the last slides code is:
copy ADR ADR LDR STR LDR STR ... r1, TABLE1 r2, TABLE2 r0, [r1] r0, [r2] r0, [r1, #4] r0, [r2, #4] ; r1 points to TABLE1 ; r2 points to TABLE2 ; load first value . ; and store it in TABLE2 ; load second value ; and store it
Data Transfer Instructions base+offset with auto-indexing

Base+offset addressing does not change the base register (r1 & r2 here). Sometimes, it is useful to modify the base register to point to the new address. This is achieved by adding a '!', and is base + offset addressing with auto-indexing:
LDR r0, [r1, #4]! ; r0 : = mem32 [r1 + 4] ; r1 := r1 + 4
Base+offset addressing does not change the base register (r1 & r2 here). Sometimes, it is useful to modify the base register to point to the new address. This is achieve by adding a '!', and is base + offset addressing with auto-indexing:
LDR
r0, [r1, #4]!
; r0 : = mem32 [r1 + 4] ; r1 := r1 + 4
The '!' indicates that the instruction should update the base register after the data transfer One instruction changes two registers
Useful in loops
The '!' indicates that the instruction should update the base register after the data transfer
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.31 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.32
Data Transfer Instructions post-indexed addressing

Another useful form of the instruction is:
LDR r0, [r1], #4 ; r0 : = mem32 [r1] ; r1 := r1 + 4
Data Transfer Instructions register-indexed addressing

Sometimes it is useful to have a base register and a register offset: LDR r0, [r1,r2] ; r0 : = mem32 [r1+r2] This is called register-indexed addressing - the index register is added to the base register to make the address. Using this, we can use fixed base registers and a single offset register which also counts the loop iterations:
copy ADR r1, TABLE1 ADR r2, TABLE2 MOV r3,#0 LDR r0, [r1,r3] STR r0, [r2,r3] ADD r3,r3,#4 CMP r3, #200 BNE loop ; r1 points to TABLE1 ; r2 points to TABLE2 ; get TABLE1 1st word . ; copy it to TABLE2 ; move to next word ; if more, go back to loop ; if r3 200 ; < source of data >
2.34
This is called post-indexed addressing - the base address is used without an offset as the transfer address, after which it is always modified. Using this, we can write the copy program as a loop:
copy ADR ADR MOV LDR STR SUBS BNE
tjwc - 2-Dec-10
loop
r1, TABLE1 r2, TABLE2 r3, #50 r0, [r1], #4 r0, [r2], #4 r3, r3, #1 loop
; r1 points to TABLE1 ; r2 points to TABLE2 ; r3 counts no words copied ; get TABLE1 1st word . ; copy it to TABLE2 ; . r1, r2 are updated afterwards ; decrement & set flags ; loop if not finished
2.33
loop
TABLE1
tjwc - 2-Dec-10
Data Transfer Instructions scaled register-indexed addressing

LDR r0, [r1,r2, lsl #n] ; r0 : = mem32 [r1+(r2 left shift n)]
ARM equivalent of direct addressing

Sometimes it is not necessary to load a base register (eg with ADR). The code below accesses TABLE1 & TABLE2 by computing the correct offset (as in previous slide) and using PC as the base register. The assembly LDR r0, TABLE1 below is translated automatically into a load using PC as base with the correct offset, for example:
LDR r0, [r15, #88]
The second (index) register can have an optional shift useful in this case so that it can count words (bytes*4) directly In principle any of the shift modes: lsl, asl, asr, rrx described in the next lecture can be used lsl #n used here multiplies by a scale factor of 2N
copy ADR r1, TABLE1 ; r1 points to TABLE1 ADR r2, TABLE2 ; r2 points to TABLE2 MOV r3,#0 LDR r0, [r1, r3, lsl #2] ; get TABLE1 1st word . STR r0, [r2, r3, lsl #2] ; copy it to TABLE2 ADD r3,r3,#1 ; move to next word CMP r3, #50 ; if more, go back to loop BNE loop ; if r3 50 ISE1/EE2 Introduction to Computer Architecture ; < source of data >
Because value of R15 is known this is effectively direct addressing, in limited range close to PC
It does not use a normal base register so can't be used for auto-increment modes etc which would change PC
LDR r0, [r15,#88]

8000 LDR STR . TABLE1 r0, TABLE1 r0, TABLE2 ; load using PC as base ; store using PC as base ; will only work if TABLE1, TABLE2 ; are within 4096 bytes of PC at ; LDR, STR instructions
loop
8090
2.35 tjwc - 2-Dec-10
tjwc - 2-Dec-10
TABLE1
Benefits of PC = r15: pseudo-instructions

We see here two benefits of allowing PC to be a general purpose register (R15) Adding a constant number to PC can often be used to load a register with a memory address ADR R0, TABLE ADD R0, R15, #offset Using PC offset addressing is equivalent to direct addressing: LDR R0,TABLE LDR R0, [R15,offset] These pseudo-instructions, the transformations, and the offset calculations, are implemented by assembler
Data transfer encoding (to or from memory LDR,STR)

4 cond 0 1 0 P U B W L 1 4 Rn 4 Rd 12 S Shift
Rd mem[Rn+S]
Rm Rd mem[Rn+Rm*]
Bit in word 0 use base register addressing P

[Rn]
1
use indexed or offset address [Rn+Rm], [Rn+S] add offset [Rn+S] Byte write indexed or offset address back into Rn if P=1 Load
U B W L
subtract offset [Rn-S] Word leave Rn unchanged if P=1 Store
NB - if P=0, W=0
If P=0, always write offset address back into Rn
Data Transfer Instruction Assembly

Size of data can be reduced to 8-bit byte for any instruction: LDRB r0, [r1] ; r0 : = mem8 [r1] STRB r0,[r1] ; mem8[r1] := r0 In practice, most loops which access data sequentially can be simplified by using base+ofset or post-indexed addressing, as appropriate, with auto-indexing. Summary of addressing modes (replace LDR by STR for STORE):
Lecture 7 - Branches, Comparisons, Status Bits, and Conditional Execution

When I hear somebody sigh, Life is hard, I am always tempted to ask, Compared to what? Sydney J Harris
In ARM ISA "jumps" which change value of PC, are called "branches" The ARM ISA has a unique and clever way of dealing with conditional branches.
Instead of having special instructions, ALL instructions are given an execution condition which determines whether they are executed, or ignored. Condition is top 4 bits of instruction word The always true condition is used with most instructions to make their execution unconditional
LDRB LDRB LDRB LDRB LDRB LDRB LDRB ADR
r0, [r1] r0, [r1, # offset] r0, [r1, # offset]! r0, [r1], # offset r0, [r1, r2] r0, [r1, r2, lsl #shift] r0, address_label r0, address_label
; register-indirect addressing ; base+offset addressing ; base+offset, auto-indexing ; post-indexed, auto-indexing ; register-indexed addressing ; scaled register-indexed addressing ; PC relative addressing pseudo ; load PC relative address instructions
2.39
A single branch instruction thus provides conditional and unconditional branches.

tjwc - 2-Dec-10
Branches
The basic branch instruction is:
B . label ; unconditionally branch to label
An example
Consider the pseudo-code:
If (a = 1) then c := c+1 else d := d-1
label
Conditional branch instructions can be used to control loops:

loop MOV . . SUB CMP BNE r0, #10 ; intialize loop counted r0 ; start of body of loop ; decrement loop counter ; is it zero yet? ; branch if r0 0
r0, r0, #1 r0, #0 loop
Needs to be implemented using conditional branches, or, as we will see, conditional execution. First step is to assign registers to variables. We assume: a=r0, c=r2, d=r3, and then the problem becomes:
if (r0 = 1) then r2 := r2+1 else r3 := r3-1
Here the CMP instruction is a SUBTRACTION, which gives no results EXCEPT possibly changing status flags in CPSR. Here we need to know that If r0 = 0, then Z bit is set (='1'), else Z bit is reset (='0')
Z controls the following BNE conditional branch instruction
To translate this pseudocode we need to use branches and conditional execution

Example with branches

if (r0 = 1) then r2 := r2+1 else r3 := r3-1 EXAMPLE CMP r0,#1 BEQ THENPART ; else part SUB r3, r3, #1 B ENDCODE THENPART ENDCODE ; then part ADD r2, r2, #1 comparison conditional branch
Comparison Operations
Here are ARM's register test operations:
CMP CMN TST TEQ
r1, r2 r1, r2 r1, r2 r1, r2
; set NZCV on (r1 - r2) ; set NZCV on (r1 + r2) ; set NZ on (r1 and r2) ; set NZ on (r1 xor r2)
Results of the subtract, add, and, xor are NOT stored in any registers, so destination register Rd is not used Status flags in the CPSR are set or cleared by these instructions (you dont need the S).
Take CMP r1,r2 instruction:

N=1 Z=1 C=1 V=1
2.43 tjwc - 2-Dec-10
if MSB of (r1 - r2) is '1' (BMI,BPL) if (r1 - r2) = 0 (BEQ,BNE) if carry-out of addition is 1 (BCS,BCC) if there is a twos complement overflow. (BVS,BVC)
tjwc - 2-Dec-10
The S-bit
Explicit comparisons are not needed after a SUBS or ADDS: MOV . . SUBS BNE r0, #10 ; intialize loop counted r0 ; start of body of loop ; decrement loop counter AND set flags ; branch if r0 0
ARM condition code field
loop
r0, r0, #1 loop
SUBS instruction is the same as SUB except that the former updates the NZCV flags in the CPSR. After SUBS instruction, Z-bit is set or cleared depending on the result of the subtraction, so CMP is not needed. All data processing instructions can have S: EORS R0,R1,R2 ANDS R0,R3,#0 ADCS R0, R1, R2
CMP is identical to SUBS but with no destination, TEQ to EORS, etc
BMI LABEL ; Branch to LABEL on MI condition

Conditional Execution
Conditional execution applies not only to branches, but to all ARM instructions. CMP r0, #5 ; if (r0 >= 5) then For example:
BLO ADD SUB BYPASS .. BYPASS r1, r1, r0 r1, r1, r2 ; r1 := r1 + r0 - r2
Using Condition Codes

The two letter condition code is appended to the 3 letter instruction op-code to make instruction execution conditional: MOVEQ, ADDPL, BCC, LDRMI, etc.
Always AL may be omitted for (normal) unconditional execution
; if (r0 >= 5) then ; r1 := r1 + r0 - r2
Can be replaced by:

CMP r0, #5 ADDHS r1, r1, r0 SUBHS r1, r1, r2 ..
BYPASS
Op-code suffixes (S for data processing instructions, B for LDR/STR) go after the condition code:
ADDPLS, STRNEB SBCCSS
Here the ADDHS and SUBHS instructions are executed only if C=1, i.e. the CMP instruction gives R0 >= 5 (unsigned).
Conditional Execution Replaces Branches

We have seen that IF-THEN-ELSE constructions in pseudocode turn into multiple branches in assembly. If the THEN and ELSE statements are short, branches can be avoided by using conditional execution. The same optimisation works for IF-THEN code if the THEN statement is short.
CMP r0, #1 BEQ THENPART ; else part SUB r3, r3, #1 B ENDCODE ; go to end THENPART ; then part ADD r2, r2, #1 ENDCODE ; finished
tjwc - 2-Dec-10
Conditional Execution - more

Here is another very clever use of this unique feature in ARM instruction set. ALL instructions can be qualified by the condition codes, including CMP!
; if ( (a=b) and (c=d)) then e := e + 1 CMP r0, r1 ; r0 has a, r1 has b CMPEQ r2, r3 ; r2 has c, r3 has d ADDEQ r4, r4, #1 ; e := e+1
CMP r0, #1 SUBNE r3, r3, #1 ADDEQ r2, r2, #1 ; finished
Note how if the first comparison finds unequal operands, the second and third instructions are both skipped. Also the logical 'and' in the if clause is implemented by making the second comparison conditional on the first. Conditional execution is normally only efficient if the conditional sequence is three instructions or fewer. If the conditional sequence is longer, use branches.
BCC-BLO, BCS-BHS equivalence

The names of all the conditional branches only really make sense if they follow a CMP instruction LO (lower), HS (higher or same) are used for unsigned numbers the equivalent for signed are LT (less than), GE (greater than or equal) Remember that CMP is a SUB instruction without destination CMP r0, r1 => invert all bits in r1, add 1, and add to r0
u(R0)+(2N-u(R1)) -- u(R0) unsigned value of R0 There will be a carry out if this is 2N, so:
The more complex cases GE is twos complement signed comparison Greater than or equal to (GE). r0 r1. Two cases:
r0 r1 is positive result, no overflow => V=0, N=0 r0 r1 is negative result, with overflow => V=1,N=1.
r0=127, r1= -128 EXACT: 127 (-128) = +255 8 bit signed interpretation: -1 (so V=1, N=1)
carry set u(R0) + (2N u(R1)) 2N R0 R1 (unsigned)

R0 R1 R0 2n-R1 00000020 - 00000002 00000020 +FFFFFFFE = 0000001E (& carry out) 00000020 - 00000020 00000020 +FFFFFFE0 = 00000000 (& carry out) 00000002 - 00000020 00000002 +FFFFFFE0 = FFFFFFE2 (no carry out)
V=0,N=0 or V=1,N=1 means r0 r1 so GE tests not(NV)
Other conditions:
LT < is NOT GE GT> is GE AND NOT EQ LE is LT OR EQ
Inequality Conditions Summarised

ARM has the full set of signed and unsigned inequality conditions. They can be confusing. After a CMP or SUBS, if x,y are the two operands, the 8 possible inequalities are shown in the table below
It is important to choose the correct condition if the test is to work for all inputs, even though for positive numbers signed and unsigned comparisons are identical. Test
x>y xy xy x<y
Lecture 8: Bit manipulation shifts etc

The best teachers have shown me that things have to be done bit by bit. Nothing that means anything happens quickly we only think it does, Joseph Bruchac
Individual bits can have separate meanings in assembly programs

Hardware registers where every bit is a separate flag Hardware registers where bit fields have specific meaning
Signed
GreaTer Greater or Equal Less or Equal Less Than GT GE LE LT
Unsigned
HIgher Higher or Same Lower or Same LOwer HI HS (= CS) LS LO (= CC)
Two types of operation help manipulating bits

Shifts & rotates 32 bit bitwise logical data processing instructions
tjwc - 2-Dec-10
Register Shifts
ADD r0, r1, r2, lsl #3 MOV r0, r1, lsr #11
ARM shift operations - LSL and LSR

Here are all the six possible ARM shift operations you can use:
op2 shifted
The key to manipulating bit fields contiguous groups of bits is the use of data shifts. ARM has a large collection of shifts available for the 2nd register operand of a data processing instruction.
shifts can be combined with arithmetic or bitwise logical operations in one instruction.
Rd := Rn op (Rm shift by n) ; shift = lsl, asr, asl, ror, rrx

0 n 31 RRX is special case only possible by 1 bit (n=1).
LSL: logical shift left by 0 to 31 places; fill the vacated bits at the least significant end of the word with zeros.
x LSL n = x*2n if no overflow
NOTE Rm is not changed by shift shifted value is used as operand

LSR: logical shift right by 0 to 31 places; fill the vacated bits at the most significant end of the word with zeros.
x LSR n = x/2n if x is positive (integer division)
ARM shift ops - ASL and ASR

ASL: arithmetic shift left; this is the same as LSL ASR: arithmetic shift right by 0 to 31 places; fill the vacated bits at the most significant end of the word with zeros if the source operand was positive, and with ones it is negative. That is, sign extend while shifting right. x ASR n = x / 2n (x>0) -x ASR n = -(x+1) / 2 = -x/2n (rounding negatively)
x 3 2 1 0 -1 -2 -3 -4
x asr 1 1 1 0 0 -1 -1 -2 -2
ARM rotate operations - ROR and RRX

ROR: rotate right by 0 to 31 places; the bits which fall off the least significant end are used to fill the vacated bits at the most significant end of the word. (ROL n = ROR 32-n) RRX: rotate right extended by 1 place; the vacated bit (bit 31) is filled with the old value of the C flag and the operand is shifted one place to the right. This is effectively a 33 bit rotate using the register and the C flag.
tjwc - 2-Dec-10
2.57
tjwc - 2-Dec-10
2.58
Register-valued Shifts
ADD r0, r1, r2, lsl r3 ; shift r2 by value of register r3. 4 regs! MOV r0, r1, asr r10 ; shift r1 by value of register r10
Rotation in immediate op2

op1 dest
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot
op2
immediate value
The number (n previously) of bits to shift can be variable and come from the value in a register, as above. "register-valued" shifts take two cycles to execute MOV r0, r1, lsl r3
If r3 = 4 & r1 = 11 this will set r0 := 11*24
8 C=Const Rd := Rn Op C'
; r4 contains n ; result (r0) has bit ; n from r2 aligned ; with bit 0 MOVS r0, r2, lsr r4
All data transfer instructions can have rotated immediate operand C' = C rotated right (ROR) by 2r, where r is unsigned value of Rot field
This allows variable shifts, for example, to select bit n from a 32 bit register
Rotation in immediate op2 (2)

The 12 bit immediate field is split into two parts, a 4 bit unsigned rotate number, r, (0 r 15) and an 8 bit unsigned constant, C, (0 c 255). C' = C ROR 2r Note that 22xC = C ROL 2x = C ROR (32-2x) - easier to work out x first! (NB special case, x=0 => r=0, C=C'). Example: c = &51, x = 5 => r = 11, S = 22*5 * &51 = &14400
010100010000000000 zero
cond 0 0 1
tjwc - 2-Dec-10
Common rotated immediate values

C' = 22xC (r = (16-x) mod 16) x = 0, any value in range 0 - 255 x = 1 => X4 any word address offset in range 0 - 1020 (e.g. in ADR pseudo-instruction) Any single bit set (2n)
How do you get constant 2n for odd n?
non-zero 2*x zeros field C S Rn Rd 1011 01010001

2.61
In general any 8 bit binary field aligned on any even bit position is possible NB negative numbers use alternate instruction e.g. SUB not ADD
Op
Data Processing Instructions Bitwise Logical operations

Here are ARM's bit-wise logical operations:
AND ORR EOR BIC r0, r1, r2 r0, r1, r2 r0, r1, r2 r0, r1, r2 ; r0 := r1 and r2 (bit-by-bit for 32 bits) ; r0 := r1 or r2 ; r0 := r1 xor r2 ; r0 := r1 and not r2
Example typical memory-mapped I/O

AD0DR:
31 30 D OV 26 24 CHN 15 DATA 6
A/D convertor converts input voltage from up to 8 inputs into digital (unsigned) value. LPC2138 A/D convertor data register AD0CR
Memory mapped as 32 bit word, read/write Read provides the 10 bit conversion output, 3 bit channel output, and other status info D Done 1 when conversion has finished OV Overrun 1 if data from a conversion is not read before another conversion starts CHN channel which of the 8 possible inputs was converted DATA 10 bit binary data output (bit 15 is MSB, bit 6 is LSB).
BIC stands for 'bit clear', where every '1' in the second operand clears the corresponding bit in the first:
r1: r2: r0: 0101 0011 1010 1111 1101 1010 0110 1011 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000 0000 1101 1010 0110 1011
BIC allows immediate operands to be used to clear individual bits

tjwc - 2-Dec-10
2.64
Extracting a bit-field 8 bits using AND

AD0DR:
31 30 D OV 26 24 CHN 15 DATA 6
Extracting bit fields using LSL & LSR

Left shift by N of a number is the same as multiplying by 2N Arithmetic right shift by N of a number is the same as dividing by 2N and rounding negatively.
Logical right shift the same for unsigned numbers.
R0:
CHN
To extract only CHN bit field of AD0DR to R0: LDRL R0, AD0DR ; get data into register AND R0, R0, #&07000000 ; set all unwanted bits to 0 LDRL LABEL (like LDR LABEL but LABEL can be anywhere in
memory)
Shifts can be used to extract bit fields. In a 32 bit word, bits n:m can be extracted and aligned with bit 0 by:
left shift 31-n right shift (31-n)+m
31 11 : 7 0 11101010100001110001001111110011 00111111001100000000000000000000 (LSL 31-11 = 20) 00000000000000000000000000000111 (LSR 31-11+7 = 27)
31 30
AD0DR: D OV AD0CR:
26 24 START
26 24 CHN
15 DATA 8 7 CLKDIV
6 0 SEL
Multiplying by a (small) constant

Multiplying by 2N is easy using a left shift. Other constants can be derived from this by using ADD or RSB as in the table below. 2,3,4,5,7,8,9, etc are all possible in this way Where possible this is preferable to using a MUL instruction because it is faster, does not require the immediate operand to be set up in a register, and is available on all architectures
r0 := 2Nr1 r0 := r0 := (2N+1)r1 (2N-1)r1 ADD RSB
tjwc - 2-Dec-10
19 17 16 15 CLKS B
Extract the 10 bit DATA field: AD0DR(15:6)
ADRL r1, AD0DR ; load address LDR r0, [r1] NB ADRL used when address is >4096 bytes MOV r0, r0, lsl #16 from PC MOV r0, r0, lsr #22 ; R0 contains extracted DATA field ; r3 contains 8 bit value to be written ADRL r1, AD0CR LDR r0, [r1] ; load whole of AD0CR BIC r0, r0, &ff00 ; clear bits 15:8 (CLKDIV) ORR r0, r0, r3, lsl #8; set 15:8 from r3(7:0) STR r0, [r1] ; store back to AD0CR .
Write CLKDIV: AD0CR(15:8), from r3(7:0)
MOV r0, r1 lsl #n ADD r0, r1, r1 lsl #n RSB r0, r1, r1 lsl #n r0, r0, r0, LSL #2 r0, r0, r0, LSL #3 Note RSB not SUB ; r0' := 5 x r0 ; r0" := 7 x r0'
What does this multiply by?

2.68
Lecture 9 Subroutines & Stacks

Television is like the American toaster, you push the button and the same thing pops up everytime Alfred Hitchcock
Subroutines
Subroutines allow you to modularize your code so that they are more reusable. The general structure of a subroutine in a program is: MAIN main program ...... BL SUB1 ...... ;subroutine call
The subroutine is a key element in assembly language programs, allowing code reuse
It is also the way that High Level Language procedures and functions are implemented
Storage of data on a stack is an essential element of all modern computer programs and typically is done on subroutine entry & exit ARM has instructions to support subroutines and stacks This lecture will consider
Use of return addresses by subroutines
Branch & link instruction
SUB1 subroutine
2.69 tjwc - 2-Dec-10
Storing data on stacks in the ARM ISA

Load & Store Multiple Registers instructions
..... MOV pc,R14

Branch & Link instruction

BL subroutine_name (Branch-and-Link) is the instruction to jump to subroutine. It performs the following operations: 1) It saves the PC value (which points to the next instruction) in r14. This is the return address. 2) It loads PC with the address of the subroutine. This performs a branch. BL always uses r14 to store the return address. r14 is called the link register (can be referred to as lr or r14). Return from subroutine is simple: - just put r14 back into PC (r15).
Example
Essential documentation for subroutines must describe
Inputs Outputs (if any) What subroutine does (other than compute outputs) Which registers it changes
EXAMPLE: Subroutine to move n bytes (spaced one per word) into n contiguous bytes at a different position in memory &1000 &1004 &1008 &2000
PACK_BYTES ; Input: src=r0, dest=r1, n=r2 ; loads LS bytes in words [R0],[R0+4], ..., [R0+4(n-1)] ; into contiguous bytes [R1],[R1+1],.....[R1+n-1] ; Changes r2,r3 SUBS R2, R2,#1 ; n := n-1 LDRB R3, [R0,R2, lsl #2] ; load first byte [R0+4(n-1)] STRB R3, [R1,R2] ; store it [R1+n-1] BNE PACK_BYTES MOV pc, r14 ; return to caller
Nested Subroutines
SUB1 BL SUB2
SUB2 BL SUB3
SUB3 X
MAIN ADR R0, TAB1 ; set up subroutine inputs ADR R1, TAB2 MOV R2, #100 BL PACK_BYTES ; call the subroutine When executing at "X" the nested subroutines SUB1, SUB2, SUB3 are all active
Nested Subroutines
Since the return address is held in register r14, you should not call a further subroutine without first saving r14. How do you achieve this goal?
Could use separate storage for each subroutine Problem: storage needed scales with number of subroutines. Typically may have 1000s of subroutines, means 1000s of separate storage locations
The idea of a STACK

A stack is a portion of main memory used to store data temporarily, so that the memory can be shared between different items at different times. A PUSH operation stores a number of registers onto the stack memory. r13 is called the PUSH {r1, r3-r5, r14} stack pointer SP
memory BEFORE PUSH memory AFTER PUSH r14 r5 r4 r3 r1 low
2.75 tjwc - 2-Dec-10
SUB1 store 1 SUB2 store 2 SUB3 store 3
The number of subroutines active at any time (nested) is much smaller than the total number, typically less than 10. This motivates use of a stack an area of memory which is shared for storage by subroutines. Can store all registers changed by subroutine on stack, not just R14
high r13
r13
low
2.76
Nested Subroutines using stack

SUB1 BL SUB2 Stack Memory SUB1 data SUB3 X
downwards growing
PUSH R14 onto stack: method 1

mem32[R13] := R14 R13 := R13-4 STR R14, [R13], #-4
SUB2 BL SUB3
SUB2 data SUB3 data R13 &1344
Would need one LDR instruction for each item...

R13 stored item &134C stored item &1348 &1344 &1340 &1340 PUSH stored item &134C stored item &1348 stored R14 &1344 &1340
2.78
Stack pointer at X empty When executing "X" the nested subroutines SUB1, SUB2, SUB3 are all active
tjwc - 2-Dec-10
2.77
tjwc - 2-Dec-10
PUSHing onto a Stack: multiple registers

Note the following properties of this ARM PUSH operation:
r13 is used as the address pointer. We call this STACK POINTER (SP). We could have used any other registers (except r15) as SP, but it is good practice to use r13 unless there is a good reason not to do so. This stack grows down through decreasing memory address, and The base registers points to the first empty location of the stack. To store values in memory, the SP is decremented after it is used.
STMED vs STR
These two instructions look different but do same thing with one register STMED can be used with any number of registers STMED is conventionally used for stacks even when only a single transfer is needed. STMED R13!, {R14} stack pointer first, then list of one or more data registers, offset is calculate and added after operation data register first, then stack pointer, offset is explicitly written and added to SP after operation
ARM has a single instruction which transfers multiple registers to a stack and implements PUSH this way:
STMED r13!, {r1, r3-r5, r14} ; Push r1, r3-r5, r14 onto stack ; Stack grows down in mem ; r13 points to next empty loc.
STR R14, [R13], #-4
tjwc - 2-Dec-10
2.79
tjwc - 2-Dec-10
POP operation
The complementary operation of PUSH is the POP operation. POP {r1, r3-r5, r14}
memory BEFORE POP r14 r5 r4 r3 r1 r13 low r13 memory AFTER POP high (r14) (r5) (r4) (r3) (r1) low
Multiple Stack Operations

A stack operates as a Last In First Out memory:
PUSH A PUSH B PUSH C A stored B, A stored C,B,A stored
POP (returns C) B,A stored POP (returns B) A stored POP (returns A) empty
Stack implements a Last-In-FirstLast-In-First-Out (LIFO) memory
This is equivalent to the ARM instruction:

LDMED r13!, {r1, r3-r5, r14}
tjwc - 2-Dec-10
Nested subroutines will each PUSH and then POP their registers at the same level (all PUSHes & POPs from subroutine calls will balance) so this will work.
; Pop r1, r3-r5, r14 from stack

Preserve things inside subroutine with STACK

SUB1 BL .. STMED . BL LDMED MOV SUB1 r13!, {r0-r2, r14} SUB2 r13!, {r0-r2, r14} pc, r14
on entry to SUB1 r13 SP moves down r13' STMED
tjwc - 2-Dec-10
; push work & link registers ; jump to a nested subroutine ; pop work & link registers ; return to calling program
when return from SUB1 high (r14) (r2) (r1) (r0) low r13 LDMED low r13!, {r0-r2, r14}
2.83
r14 r2 r1 r0
high
r13'
; Input: r0 ; Output: r1=1 if odd parity (xor of all 32 bits), otherwise 0 ; preserves value of r2 on stack STMED r13!, {r2} ; save registers, why not r1? MOV r2, #31 MOV r1, #0 LOOP EOR r1, r0, r1, ror #1 SUBS r2, r2, #1 BPL LOOP ; loop 32 times AND r1, r1, #1 LDMED r13!,{r2} ;restore registers MOV pc, r14 ; return to caller
r13!, {r0-r2, r14}
Optimising subroutine entry/exit

The usual case is for a subroutine which calls other subroutines, and so which saves and restores registers including R14, the return address. In this case the subroutine exit can be optimised by restoring r14 directly to the PC, r15.
Note that it is important NOT to include both r14 & r15 in the LDMED register list - which would be one too many POPs!
STMED r13!, {r0,r1,r2,r14} . LDMED r13!,{r0,r1,r2, r14} MOV pc, r14 ; return to caller
tjwc - 2-Dec-10
Effect on stack of subroutine nesting

SUBX (1) calls SUBY(2) The arrangement of storage on the stack when inside SUBY is as follows
SUBX STMED r13!, {R14} BL SUBY ....... LDMED r13!, {pc} SUBY STMED r13!, {r0,r1,r2} ..... LDMED r13!,{r0,r1,r2} MOV pc, r14
SUBX caller return adddress
Stack (downwards growing)

Stack pointer before SUBX Stack pointer inside SUBX Stack pointer inside SUBY
Base of stack is highest location Rest of stack
STMED r13!, {r0,r1,r2,r14} . LDMED r13!,{r0,r1,r2,pc} ; return to caller

2.85
r14 r2 r1 r0
Stack frame (1) SUBX Stack frame (2) SUBY
Top of stack is SP+4 (lowest location)
ARM PUSH instructions

STMED implements Descending stack, with SP pointing to Empty location Stacks can by Ascending or Descending SP can point to Full location (last item PUSHED) or Empty location (first space available to PUSH next item) STMED - Empty location, Descending stack STMEA - Empty location, Ascending stack on entry to SUB1 STMFD - Full location, Descending stack r14 r13 r2 STMFA - Full location, Ascending stack SP moves r1 down LDMED (pop) matches STMED (push) etc. r0
r13' STMED
Other uses of LDM/STM

LDM,STM can work with any register being SP, not just R13 Can move block of memory by setting up SP1, SP2, POP from SP1, PUSH to SP2 Faster than loop with LDR/STR
high
The 4 types of stack POP & PUSH have different mnemonics (for convenience) when used for general data movement like this. It does not matter which mnemonic you use: LDMED & LDMIB are the same instruction
low r13!, {r0-r2, r14}
tjwc - 2-Dec-10
2.87
Alternative names for LDM instructions!
Example of using Load/Store Multiple

Here is an example to move 8 words from a source memory location to a destination memory location:ADR r0, src_addr ADR r1, dest_addr LDMIA r0!, {r2-r9} STMIA r1!, {r2-r9} ; initialize src addr ; initialize dest addr ; fetch 8 words from mem ; r0 := r0+32 ; copy 8 words to mem, r1 := r1+32
When using LDMIA and STMIA instructions, you:INCREMENT the address in memory to load/store your data the increment of the address occurs AFTER the address is used.
In fact, one could use 4 different form of load/store:

Increment - After Increment - Before Decrement - After Decrement - Before
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.89 tjwc - 2-Dec-10
LDMIA LDMIB LDMDA LDMDB
and and and and
STMIA STMIB STMDA STMDB
(see next slide)
2.90
The four variations of the STM instruction
Optional update of base address register with Load/Store Multiple Instructions

So far the base address register, r1 below, has always been updated. You can choose NOT update this pointer register removing the "!". All variants of LDM/STM have optional base register update.
LDMIA
Higher register numbers stored or loaded to/from higher addresses, always
r1, {r2-r9}
; r2 := mem32[r1] ; . ; r9 := mem32[r1+28]
LDMIA
r1!, {r2-r9} ; r2 := mem32[r1] ; . "!" indicates r1 ; r9 := mem32[r1+28] is changed ;r1 := r1 + 32 (8 registers)

tjwc - 2-Dec-10
Multiple register transfer instructions
Lecture 10: Miscellaneous Multiplication Overview of machine instructions Machine instruction timing
Register list has one bit per register bit 0 = 1 => load/store r0; bit 1 = 1 => load/store r1; etc STMIA r13!, {r0-r2, r14}
E8AD 4007
tjwc - 2-Dec-10
ARM Multiply instructions The original ARM 1 architecure did not have multiply instructions
32X32->32 bit (least significant 32 bits of result kept) was added for ARM 3 and above 32X32->64 multiplication was added for ARM7DM and above.
Multiply in detail
MUL,MLA were the original (32 bit LSW result) instructions
Why does it not matter whether they are signed or unsigned? Register operands only No constants, no shifts
Later architectures added 64 bit results

NB d & m must be different for MUL, MLA
ARM3 and above
The multiplications were shoe-horned into the data processing instructions, using bit combinations specifying shifts that were previously unused and illegal.
MUL rd, rm, rs MLA rd,rm,rs,rn UMULL rh, rl, rm, rs UMLAL rh, rl, rm, rs SMULL rh,rl,rm,rs SMLAL rh,rl,rm,rs
tjwc - 2-Dec-10
multiply (32 bit) multiply-acc (32 bit) unsigned multiply unsigned multiply-acc signed multiply signed multiply-acc
Rd := (Rm*Rs)[31:0] Rd:= (Rm*Rs)[31:0] + Rn (Rh:Rl) := Rm*Rs (Rh:Rl) := (Rh:Rl)+Rm*Rs (Rh:Rl) := Rm*Rs (Rh:Rl) :=(Rh:Rl)+Rm*Rs
2.96
ARM7DM core and above (64 bit multiply result)

Example of using ARM Multiplier

This calculates a 64 bit scalar product of two signed vectors, each 20 words long: r8 and r9 point to the two vectors xj and yj r11 is the loop counter r7:r6 stores the result
ARM Machine Instruction Overview (1)

Data processing (ADD,SUB,CMP,MOV)
cond 0 0 0 1
ALU operation
Op
Rn
Rd
Shift S
Rm
Rd := Rn Op Rm* Rd := Rn Op S Rm* = Rm with optional shift
z = xj * yj
j =0
19
multiply instructions are special case
Data transfer (to or from memory LDR,STR)

cond 0 1 0 Trans Rn Rd Shift
Byte/word, load/store, etc
S Rm
Rd mem[Rn+S] Rd mem[Rn+Rm*]
MOV MOV MOV LOOP LDR LDR SMLAL SUBS tjwc - 2-Dec-10 BNE
r11, #20 ; initialize loop counter r7, #0 ; initialize 32 bit total r6, #0 r0, [r8], #4 ; get x component r1, [r9], #4 ; . and y component r6, r7, r0, r1 ; accumulate product r11, r11, #1 ; decrement loop counter LOOP ISE1/EE2 Introduction to Computer Architecture ; loop 20 times
Multiple register transfer

cond 1 0 0
2.97 tjwc - 2-Dec-10
Type
Rn
Register list
Transfer registers to/from stack

2.98
Overview (2)
Branch B, BL, BNE, BMI
cond 1 0 1 L 0 1 cond 1 1 0 0 cond 1 1 0 1 cond 1 1 1 0 S L = 0 => Branch, B ... L = 1 => Branch and link (R14 := PC), BL ...
PC := PC+S
ARM Instruction Timing

Exact instruction timing is very complex and depends in general on memory cycle times which are system dependent. The table below gives an approximate guide.
Instruction Any instruction, with condition false data processing (all except register-valued shifts)
Typical execution time (cycles) 1
1 (+3 if PC is dest) 2 (+3 if PC is dest) 4 n+3 (+3 more if PC is loaded) n+3 4 7-14
coprocessor interface
data processing (register-valued shifts): MOV R1, R2, lsl R3 LDR,LDRB, STR, STRB LDM (n registers)
Software Interrupt (SWI)

cond 1 1 1 1
tjwc - 2-Dec-10
S
Simulate hardware interrupt: S is passed to handler

2.99
STM (n registers) B, BL Multiply
Instruction Timing Notes

Most instructions take 1 cycle - RISC Memory reference takes longer (4 cycles typically) Branch takes longer (4 cycles)
Writing to PC => branch
ALL instructions take 1 cycle if not executed (condition false) "register-valued shift" is special case 2 cycles
Make sure you know what a register-valued shift is!
Multiply takes a lot longer though exact timing depnds on data and also on ARM core - later cores have more efficient hardware multiply Instruction timing is hardware-dependent. Not part of Instruction Set Architecture

Arm Instruction 2 - 001

Uploaded by

Copyright:

Available Formats

Arm Instruction 2 - 001

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arm Instruction 2 - 001

Uploaded by

Copyright:

Available Formats

PART 2

ARM Assembly Language

Lecture 8 Processing parts of a word - shifts, rotates, & logical operations

Lecture 6 ARM Memory access

Lecture 7 Execution conditions & branches

Lecture 9 Subroutines, return addresses, and stacks

ARM simulator & debugger

ISE1/EE2 Introduction to Computer Architecture

ISE1/EE2 Introduction to Computer Architecture

Lecture 5 Data Processing: ARM implementation

Data processing (ADD,SUB,AND,CMP,MOV, etc)

dest := op1 op op2

ISE1/EE2 Introduction to Computer Architecture

ISE1/EE2 Introduction to Computer Architecture

ARM data processing instructions

0100 0 0001 0000 0000,0000 0010

Data Processing Instructions

Data Processing Instructions Arithmetic operations

Result register can be the same as an input operand register:

ISE1/EE2 Introduction to Computer Architecture

Example 64 bit addition

r4, r0, r2 ; r4:=r0+r2 (set C) r5, r1, r3 ; r5 := r1+r3 +C

r4, r0, r2 ; without carry r5, r1, r3 ; with carry

ISE1/EE2 Introduction to Computer Architecture

ISE1/EE2 Introduction to Computer Architecture

Data Processing Instructions Register Moves

Use of shifts adds more options, considered later

ADD R5, R2, #200 ADD R5, R2, R3

; Op2 = 200 is decimal literal value ; Op2 = R3

Negative literal values

The assembler will do this conversion automatically

ISE1/EE2 Introduction to Computer Architecture

ISE1/EE2 Introduction to Computer Architecture

ISE1/EE2 Introduction to Computer Architecture

Arithmetic on real numbers

Idea of floating point representation

implicit binary point

IEEE-754 standard floating point

s Why not exponent = 128? -1

fractional part of mantissa .11(2)

1100 0000 0110 0000 0000 0000 0000 0000

x = 1s 2exp 127 1. frac 5.9 10 39 < x < 3.4 1038

Note this gives exponent = [-127,127], and special case exp=255

The MSB of the mantissa is ALWAYS 1, therefore it is not stored

Special cases which break this rule:

Conversion to IEEE 754

Lecture 6 - Data Transfer Instructions (Load/Store)

This lecture will examine in detail the ARM LOAD/STORE instructions

The ARM architecture has

ISE1/EE2 Introduction to Computer Architecture

ISE1/EE2 Introduction to Computer Architecture

Example block copy

Block copy solutions

The size of this block is 200(10).

TABLE2 TABLE2+4 TABLE2+8

ISE1/EE2 Introduction to Computer Architecture

Data Transfer Instructions single register load/store instructions

Data Transfer Instructions Set up the address pointer with ADR

r0, [r1] r0, [r1]