Arm Instruction 2 - 001

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

PART 2

ARM Assembly Language

Part 2 - Contents
Lecture 5 - ARM data processing
ARM data processing instructions in detail ARM status flags and tests

Lecture 8 Processing parts of a word - shifts, rotates, & logical operations


Example bit-manipulation problems ARM instructions: bitwise logical & shifts

Lecture 6 ARM Memory access


Addressing modes for LDR/STR Assembler pseudo-instructions

EDSAC simulator (Written by Martin Campbell-Kelly, Univ Warwick) reproduces original EDSAC control panel. EDSAC was the first computer to be programmed in an assembly language. The assembler was 41 instructions long!

Lecture 7 Execution conditions & branches


Comparison & test instructions Signed and unsigned comparison ARM tips & tricks

Lecture 9 Subroutines, return addresses, and stacks


Why use subroutines? Why use stacks? Implementing stacks on ARM
ARM Multiple register transfer instructions

ARM simulator & debugger


tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.1

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.2

Lecture 5 Data Processing: ARM implementation


and then the different branches of Arithmetic: Ambition, Distraction, Uglification, and Derision The Mock Turtle Lewis Carrol

Data processing (ADD,SUB,AND,CMP,MOV, etc)


op1 dest
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot Shift 8

op2
8 C Rm 4

immediate value
Rd := Rn Op C Rd := Rn Op Rm

Arithmetic is the most complex data processing operation at an assembly language level. ARM implements 32 bit addition and subtraction. Longer calculations must make appropriate use of carries. We will look at:
ARM data processing arithmetic & logical instructions Use of immediate operands in data processing instructions Simple examples

0
ALU operation

S bit = 1 => status bits are written S bit = 0 => status bits unchanged

dest := op1 op op2

The first operand, op1, is always register Rn The second operand, op2, is either a constant C or register Rm This lecture: assume Shift=0, Rot=0, for unshifted Rm or immediate constant C

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.3

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.4

ARM data processing instructions


Op 0000 0001 0010 0011 0100 0101 0110 0111 1100 1101 1110 1111 Assembly AND Rd,Rn,op2 EOR Rd,Rn,op2 SUB Rd, Rn, op2 RSB Rd, Rn, op2 ADD Rd,Rn,op2 ADC Rd,Rn,op2 SBC Rd, Rn, op2 RSC Rd, Rn, op2 ORR Rd,Rn,op2 MOV Rd, op2 BIC Rd,Rn,op2 MVN Rd,op2 Operation Bitwise logical AND Bitwise logical XOR Subtract Reverse subtract Add Add with carry Subtract with carry Reverse sub with carry Bitwise logical OR Move Bitwise clear Bitwise move negated Pseudocode Rd := Rn AND op2 Rd := Rn XOR op2 Rd := Rn op2 Rd := op2 Rn Rd := Rn + op2 Rd := Rn + op2 + C Rd := Rn op2 + C 1 Rd := op2 Rn + C 1 Rd := Rn OR op2 Rd := op2 Rd := Rn AND NOT op2 Rd := NOT op2

Example
cond 0 0 0 1110 0 0 0 Op S Rn Rd Shift Rm
Rd := Rn Op Rm R0 := R1 + R2

0100 0 0001 0000 0000,0000 0010

Here are the move and arithmetic data processing instructions. The operations with Carry allow multi-word addition and subtraction MOV, MVN do not use Rn, Rn should be set 0 in instruction word
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.5

Op = 0100 (ADD) Cond = 1110 (always) Rd = 0 R0 Rn = 1 R1 Rm = Op2 = 2 R2 S=0 (don't write status bits) Use assembler don't need to worry about precise bit format as above ADD R0, R1, R2
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.6

Data Processing Instructions


Rules that apply to ARM data processing instructions:
All operands are 32 bits, come either from registers or are specified as constants (called literals) in the instruction itself The result is also 32 bits and is placed in a register 3 operands - 2 for inputs and 1 for result (usually)

Data Processing Instructions Arithmetic operations


ADD ADC SUB SBC RSB RSC r0, r1, r2 r0, r1, r2 r0, r1, r2 r0, r1, r2 r0, r1, r2 r0, r1, r2 ; r0 := r1 + r2 ; r0 := r1 + r2 + C ; r0 := r1 - r2 ; r0 := r1 - r2 + (C - 1) ; r0 := r2 - r1 ; r0 := r2 - r1 + (C - 1)

Example: SUB r0, r1, r2 ; r0 := r1 - r2 Works for both unsigned and 2's complement signed
Note that source registers are unchanged (unless dest = source)

RSB stands for reverse subtraction Operands & result may be interpreted as unsigned or 2's complement signed integers. 'C' is the carry (C) status bit in the CPSR
Subtraction - carry is "borrow" - 0 or 1 - hence C-1

Result register can be the same as an input operand register:


ADD r0, r0, r0 ; doubles the value in r0!

ADC, SBC, and RSC are used to operate on data more than 32 bits long in 32-bit chunks: see next slide RSB,RSC are useful, instead of SUB,SBC with r1, r2 reversed, because r2 can be any of the Op2 variants, see later.
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.8

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.7

SBC, RSC
Some of you are probably thinking why (C 1) in subtraction? The negation needed by subtraction is implemented in hardware by bitwise not function and addition with C=1. Thus C=0 has the effect of -1, and C=1 is a normal subtract. The first (LSW) subtract of a multi-word subtraction must have carry set this is the default carry used in the SUB, RSB instructions. Normally the LS word of a multi-word add or subtract will use ADDS or SUBS, all others will use ADCS or SBCS Note the S suffix means S=1, write Status bits (condition codes). Can be added to any DP assembler mnemonic except comparisons.

Example 64 bit addition


For example, lets add two 64-bit numbers X and Y, storing the result in Z We need two registers to hold each number - registers are 32 bit store X as r1:r0, Y in r3:r2, and Z in r5:r4 (notation MSW:LSW) Then:

ADDS ADCS

r4, r0, r2 ; r4:=r0+r2 (set C) r5, r1, r3 ; r5 := r1+r3 +C

ADCS 95 95
tjwc - 2-Dec-10

ADCS 64 63 64 63 32 31 32 31

ADDS 0 0 +
2.9

S at the end of an instruction means you want to write the C, V, N, and Z status bits. In this case the C flag is needed. Similarly, if we wanted to subtract the two numbers:

SUBS SBCS
tjwc - 2-Dec-10

r4, r0, r2 ; without carry r5, r1, r3 ; with carry


2.10

ISE1/EE2 Introduction to Computer Architecture

ISE1/EE2 Introduction to Computer Architecture

Data Processing Instructions Register Moves


Here are ARM's register move operations:
MOV MVN r0, r2 r0, r2 ; r0 := r2 ; r0 := NOT r2

Operand 2
Data processing instructions have 3 operand format: Rd := Rn op op2 First operand (Rn) - always a register Second operand (Op2) can be
An immediate (literal) value in range 0-255 A register Rm

Special case of data processing where one register is not used, but other options (shifted r2 etc see later) still apply. MVN stands for 'move negated' bitwise NOT
This is not two's complement negate - no addition of 1! r2: r0:
tjwc - 2-Dec-10

Use of shifts adds more options, considered later


# indicates literal

0101 0011 1010 1111 1101 1010 0110 1011 1010 1100 0101 0000 0010 0101 1001 0100
ISE1/EE2 Introduction to Computer Architecture 2.11

ADD R5, R2, #200 ADD R5, R2, R3


tjwc - 2-Dec-10

; Op2 = 200 is decimal literal value ; Op2 = R3


ISE1/EE2 Introduction to Computer Architecture 2.12

Negative literal values


Since literal op2 is an unsigned value it cannot be used directly to set a register to a negative number However usually this does not matter, because a different op-code can be used:
ADD r0, r1, #-11 => SUB r0, r1, #11 MOV r0, #-n => MVN r0, #(n-1) ; MVN inverts bits (232-1-n) ADC r0, r1, #-n => SBC r0, r1, #(n-1) ;WHY (n-1)?

Examples
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot 8 C
Rd := Rn Op C

1110 001 0010 0 1111 0011 0000 01100100 always SUB R15 R3 #100 do not write status bits R3 := R15 - 100 ADD r3, r15, #-100 The "ADD with negative SUB r3, r15, #100 constant" is turned into equivalent SUB automatically by assembler

The assembler will do this conversion automatically


See next slide

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.13

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.14

Examples
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot 8 C
Rd := Rn Op C

Examples
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot 8 C
Rd := Rn Op C

1110 001 0110 1 0100 0001 0000 00000011` always SBC R4 R1 #3 write status bits N,Z,C,V R1 := R4 -4+C ADCS r1, r4, #-4 SBCS r1, r4, #3 The ADC is turned into equivalent SBC automatically

1110 001

1111 1 0000 0001 0000 00000000 MVN not used R1 C=#0 write status why not C,V? bits N,Z

R1 := -1 MOVS r1, #-1 Note that MVN negates bits, not two's MVNS r1, #0 complement negation S = 1 => N,Z status bits are written C,V status bits are only written on arithemetic operation
2.15 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.16

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

64-bit checksum
A checksum is often calculated to check that data has not been corrupted.

C = di
i

In this example 8K bytes of data is stored in memory in a buffer pointed to by r2. Each 8 contiguous bytes (2 words) are interpreted as a 64 bit number di. [R2+20], [R2+16] [R2+12], [R2+8] [R2+4], [R2]
32 bits 32 bits 32 bits 32 bits 32 bits 32 bits

CHECKSUM64 MOV r3, #0 MOV r4, #0 MOV r6, #1024 LOOP LDR r0, [r2] ADD r2, r2, #4 LDR r1, [r2] ADD r2, r2, #4 ADDS r3, r3, r0 ADC r4, r4, r1 SUBS r6, r6, #1 BNE LOOP

; bits 31:0 of sum ; bits 63:32 of sum ; set up loop counter ; load 31:0 of next 64 bit word ; move r2 to MSW word ; load 63:32 of it ; move r2 to next 64 bit word ; 31:0 of 64 bit addition, set C ; add bits 63:32, with C ; decr counter, set status bits on result ; if counter is not 0 add next 64 bits

Add 64 bit numbers (assume words are ordered so that LSW has lowest address)
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.17

r2 -> current word r3,r4 -> 64 bit sum r6 -> count no of 64 bit words down to 0 Auto-increment memory load (discussed later) would make the code much more efficient. Note that 64 bit result will overflow because MSW C is discarded

Arithmetic on real numbers


So far, we have concentrated on integer representations signed or unsigned. There is an implicit binary point to the right:
N-1 0

Idea of floating point representation


Although fixed point representation can cope with numbers with fractions, the range of values that can represented is still limited. Alternative: use the equivalent of scientific notation, but in binary:

number = s x m x 2e sign
For example:

In general, the binary point can be in the middle of the word (or off the end!). This is FIXED POINT representation of fractional numbers
N-1 0

implicit binary point

mantissa

exponent

S
binary point

10.5 in binary: 1010.1(2) Move binary point 3 places to left: 1.0101(2) x 23 10.5 = 1.3125 x 8
Thus by choosing the correct exponent any number can be represented as a fixed point binary number multiplied by an exponent Equivalently, the binary point is "floating"
2.19 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.20

Fixed point arithmetic requires no extra hardware the binary point is in the mind of the programmer, like signed/unsigned.
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture

IEEE-754 standard floating point


32-bit single precision floating point:
31 30 23 22 0

IEEE-754 example

single precision

8-bit exp

23-bit frac

s Why not exponent = 128? -1

exp 128

fractional part of mantissa .11(2)

1100 0000 0110 0000 0000 0000 0000 0000

x = 1s 2exp 127 1. frac 5.9 10 39 < x < 3.4 1038


MSB s is sign-bit: 1 => negative Exponent = exp - 127

The number above, C0600000(16) , must have negative sign, Exponent = exp -127 = 1, mantissa = 1+ 0.11(2) = 1.11(2) - 21 X 1.11(2) = -11.1(2) = -3.5 Note leading 1.0 is always added to frac
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.22

Note this gives exponent = [-127,127], and special case exp=255

The MSB of the mantissa is ALWAYS 1, therefore it is not stored


mantissa = 1 + frac*2-23 (mantissa = 1.frac)

Special cases which break this rule:


exp field = 0, frac field = 0 => number is +/- 0 exp field = 255, frac field = 0 => +/- exp field = 255, frac field 0 => NaN (invalid number)

Conversion to IEEE 754


17.5(10) = 10001.1(2) = 24 * 1.00011(2)
exp = 4+127 = 131 = 10000011(2) frac = 00011000000000000000000(2) s = 0 (positive)

Lecture 6 - Data Transfer Instructions (Load/Store)


Computer programmers don't byte, they nibble a bit - Unknown

This lecture will examine in detail the ARM LOAD/STORE instructions


Multiple register load/store instructions will be dealt with separately, when we are discussing stacks.

Floating point is typically handled by Floating Point coprocessor (FPU) separate from the CPU. ARM architecture has FPUs, see latest ARM datasheets for more details. We will not consider FPU instructions in this course.

The ARM architecture has


some clever tricks which mean that memory locations close to the PC can easily be accessed. special support for sequential data access

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.23

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.24

Example block copy


Block memory copy.
A block of memory at address TABLE1 is copied to address TABLE2.
Both TABLE1 & TABLE2 are word-aligned (address divisible by 4) The copy operation can be implemented by moving words

Block copy solutions


We look first at simple solutions to the block copy using instructions which read and write fixed words in memory The block copy could be implemented like this through a sequence of 50 sets of read/write instructions each with different addresses.
This is actually more efficient than using a loop, but not practical, due to the large amount of code, if the number of words copied is too many.

The size of this block is 200(10).


TABLE1 TABLE1+4 TABLE2+8

TABLE2 TABLE2+4 TABLE2+8

TABLE1+192 TABLE1+196

TABLE2+192 TABLE2+196

Next we look at how the reads and writes can be made to variable locations (like an access to an array with a variable as index a[i]), so that a loop can be used with a single read and write to copy all 50 words.
2.25 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.26

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

Data Transfer Instructions single register load/store instructions


Basic operation Use a value in one register (called the base register) as a memory address and either load the data value from that address into a destination register or store the register value to memory:

Data Transfer Instructions Set up the address pointer with ADR


Need to initialize address in r1 in the first place. How? ADR is a pseudo instruction - looks like normal instruction, but it is actually an assembler directive.
The assembler translates it to one or more real instructions. ADR sets a register to a (known and constant) address ADR moves a constant value into a register.

LDR STR

r0, [r1] r0, [r1]

; r0 := mem32[r1] ; mem32[r1] := r0

This is called register-indirect addressing (AKA indexed) Here r1 is a memory pointer (AKA index register) LDR r0, [r1] ; this is a word transfer, r1 must be a word address (divisible by 4)

This copies one word from TABLE 1 to TABLE2


copy ADR ADR LDR STR . r1, TABLE1 r2, TABLE2 r0, [r1] r0, [r2] ; r1 points to TABLE1 ; r2 points to TABLE2 ; load first word . ; and store it in TABLE2 ; <source of data> ; <destination of data>
ISE1/EE2 Introduction to Computer Architecture 2.28

r1: r0:

CPU &1000 117

&1000: &1004: &1008:

Memory 117 560 100

TABLE1 TABLE2
tjwc - 2-Dec-10

Data Transfer Instructions ADR instruction

Data Transfer Instructions Moving multiple data items


Extend the copy program further to copy NEXT word:
copy ADR ADR LDR STR ADD ADD LDR STR ... r1, TABLE1 r2, TABLE2 r0, [r1] r0, [r2] r1, r1, #4 r2, r2, #4 r0, [r1] r0, [r2] ; r1 points to TABLE1 ; r2 points to TABLE2 ; load first value . ; and store it in TABLE2 ; step r1 onto next word ; step r2 onto next word ; load second value ; and store it

table1 table2
How does the ADR directive work? Address is 32-bit, difficult to put a 32-bit address value in a register in the first place (constants are 8 bit) Solution: Program Counter PC (r15) is often close to required value ADR r1, TABLE1 is translated into a data processing instruction that adds or subtracts a constant to PC (r15), and puts the result in r1 This constant is known as a PC-relative offset, and it is calculated as: addr_of_TABLE1 - (PC_value + 8)
(+8 is because of hardware pipelining, see Part 3)
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.29

Simplify with base+offset addressing mode

LDR r0, [r1, #4]


base address
tjwc - 2-Dec-10

; r0 := mem32 [r1 + 4]
offset effective address
2.30

ISE1/EE2 Introduction to Computer Architecture

Data Transfer Instructions base+offset


A simplified version of the last slides code is:
copy ADR ADR LDR STR LDR STR ... r1, TABLE1 r2, TABLE2 r0, [r1] r0, [r2] r0, [r1, #4] r0, [r2, #4] ; r1 points to TABLE1 ; r2 points to TABLE2 ; load first value . ; and store it in TABLE2 ; load second value ; and store it

Data Transfer Instructions base+offset with auto-indexing


Base+offset addressing does not change the base register (r1 & r2 here). Sometimes, it is useful to modify the base register to point to the new address. This is achieved by adding a '!', and is base + offset addressing with auto-indexing:
LDR r0, [r1, #4]! ; r0 : = mem32 [r1 + 4] ; r1 := r1 + 4

Base+offset addressing does not change the base register (r1 & r2 here). Sometimes, it is useful to modify the base register to point to the new address. This is achieve by adding a '!', and is base + offset addressing with auto-indexing:

LDR

r0, [r1, #4]!

; r0 : = mem32 [r1 + 4] ; r1 := r1 + 4

The '!' indicates that the instruction should update the base register after the data transfer One instruction changes two registers
Useful in loops

The '!' indicates that the instruction should update the base register after the data transfer
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.31 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.32

Data Transfer Instructions post-indexed addressing


Another useful form of the instruction is:
LDR r0, [r1], #4 ; r0 : = mem32 [r1] ; r1 := r1 + 4

Data Transfer Instructions register-indexed addressing


Sometimes it is useful to have a base register and a register offset: LDR r0, [r1,r2] ; r0 : = mem32 [r1+r2] This is called register-indexed addressing - the index register is added to the base register to make the address. Using this, we can use fixed base registers and a single offset register which also counts the loop iterations:
copy ADR r1, TABLE1 ADR r2, TABLE2 MOV r3,#0 LDR r0, [r1,r3] STR r0, [r2,r3] ADD r3,r3,#4 CMP r3, #200 BNE loop ; r1 points to TABLE1 ; r2 points to TABLE2 ; get TABLE1 1st word . ; copy it to TABLE2 ; move to next word ; if more, go back to loop ; if r3 200 ; < source of data >
2.34

This is called post-indexed addressing - the base address is used without an offset as the transfer address, after which it is always modified. Using this, we can write the copy program as a loop:
copy ADR ADR MOV LDR STR SUBS BNE
tjwc - 2-Dec-10

loop

r1, TABLE1 r2, TABLE2 r3, #50 r0, [r1], #4 r0, [r2], #4 r3, r3, #1 loop

; r1 points to TABLE1 ; r2 points to TABLE2 ; r3 counts no words copied ; get TABLE1 1st word . ; copy it to TABLE2 ; . r1, r2 are updated afterwards ; decrement & set flags ; loop if not finished
2.33

loop

TABLE1
tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

ISE1/EE2 Introduction to Computer Architecture

Data Transfer Instructions scaled register-indexed addressing


LDR r0, [r1,r2, lsl #n] ; r0 : = mem32 [r1+(r2 left shift n)]

ARM equivalent of direct addressing


Sometimes it is not necessary to load a base register (eg with ADR). The code below accesses TABLE1 & TABLE2 by computing the correct offset (as in previous slide) and using PC as the base register. The assembly LDR r0, TABLE1 below is translated automatically into a load using PC as base with the correct offset, for example:
LDR r0, [r15, #88]

The second (index) register can have an optional shift useful in this case so that it can count words (bytes*4) directly In principle any of the shift modes: lsl, asl, asr, rrx described in the next lecture can be used lsl #n used here multiplies by a scale factor of 2N
copy ADR r1, TABLE1 ; r1 points to TABLE1 ADR r2, TABLE2 ; r2 points to TABLE2 MOV r3,#0 LDR r0, [r1, r3, lsl #2] ; get TABLE1 1st word . STR r0, [r2, r3, lsl #2] ; copy it to TABLE2 ADD r3,r3,#1 ; move to next word CMP r3, #50 ; if more, go back to loop BNE loop ; if r3 50 ISE1/EE2 Introduction to Computer Architecture ; < source of data >

Because value of R15 is known this is effectively direct addressing, in limited range close to PC
It does not use a normal base register so can't be used for auto-increment modes etc which would change PC

LDR r0, [r15,#88]


8000 LDR STR . TABLE1 r0, TABLE1 r0, TABLE2 ; load using PC as base ; store using PC as base ; will only work if TABLE1, TABLE2 ; are within 4096 bytes of PC at ; LDR, STR instructions
ISE1/EE2 Introduction to Computer Architecture 2.36

loop

8090
2.35 tjwc - 2-Dec-10

tjwc - 2-Dec-10

TABLE1

Benefits of PC = r15: pseudo-instructions


We see here two benefits of allowing PC to be a general purpose register (R15) Adding a constant number to PC can often be used to load a register with a memory address ADR R0, TABLE ADD R0, R15, #offset Using PC offset addressing is equivalent to direct addressing: LDR R0,TABLE LDR R0, [R15,offset] These pseudo-instructions, the transformations, and the offset calculations, are implemented by assembler

Data transfer encoding (to or from memory LDR,STR)


4 cond 0 1 0 P U B W L 1 4 Rn 4 Rd 12 S Shift
Rd mem[Rn+S]

Rm Rd mem[Rn+Rm*]

Bit in word 0 use base register addressing P


[Rn]

1
use indexed or offset address [Rn+Rm], [Rn+S] add offset [Rn+S] Byte write indexed or offset address back into Rn if P=1 Load

U B W L

subtract offset [Rn-S] Word leave Rn unchanged if P=1 Store

NB - if P=0, W=0
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.37

If P=0, always write offset address back into Rn

Data Transfer Instruction Assembly


Size of data can be reduced to 8-bit byte for any instruction: LDRB r0, [r1] ; r0 : = mem8 [r1] STRB r0,[r1] ; mem8[r1] := r0 In practice, most loops which access data sequentially can be simplified by using base+ofset or post-indexed addressing, as appropriate, with auto-indexing. Summary of addressing modes (replace LDR by STR for STORE):

Lecture 7 - Branches, Comparisons, Status Bits, and Conditional Execution


When I hear somebody sigh, Life is hard, I am always tempted to ask, Compared to what? Sydney J Harris

In ARM ISA "jumps" which change value of PC, are called "branches" The ARM ISA has a unique and clever way of dealing with conditional branches.
Instead of having special instructions, ALL instructions are given an execution condition which determines whether they are executed, or ignored. Condition is top 4 bits of instruction word The always true condition is used with most instructions to make their execution unconditional

LDRB LDRB LDRB LDRB LDRB LDRB LDRB ADR

r0, [r1] r0, [r1, # offset] r0, [r1, # offset]! r0, [r1], # offset r0, [r1, r2] r0, [r1, r2, lsl #shift] r0, address_label r0, address_label

; register-indirect addressing ; base+offset addressing ; base+offset, auto-indexing ; post-indexed, auto-indexing ; register-indexed addressing ; scaled register-indexed addressing ; PC relative addressing pseudo ; load PC relative address instructions
2.39

A single branch instruction thus provides conditional and unconditional branches.


tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.40

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

Branches
The basic branch instruction is:
B . label ; unconditionally branch to label

An example
Consider the pseudo-code:
If (a = 1) then c := c+1 else d := d-1

label

Conditional branch instructions can be used to control loops:


loop MOV . . SUB CMP BNE r0, #10 ; intialize loop counted r0 ; start of body of loop ; decrement loop counter ; is it zero yet? ; branch if r0 0

r0, r0, #1 r0, #0 loop

Needs to be implemented using conditional branches, or, as we will see, conditional execution. First step is to assign registers to variables. We assume: a=r0, c=r2, d=r3, and then the problem becomes:
if (r0 = 1) then r2 := r2+1 else r3 := r3-1

Here the CMP instruction is a SUBTRACTION, which gives no results EXCEPT possibly changing status flags in CPSR. Here we need to know that If r0 = 0, then Z bit is set (='1'), else Z bit is reset (='0')
Z controls the following BNE conditional branch instruction
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.41

To translate this pseudocode we need to use branches and conditional execution


tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.42

Example with branches


if (r0 = 1) then r2 := r2+1 else r3 := r3-1 EXAMPLE CMP r0,#1 BEQ THENPART ; else part SUB r3, r3, #1 B ENDCODE THENPART ENDCODE ; then part ADD r2, r2, #1 comparison conditional branch

Comparison Operations
Here are ARM's register test operations:

CMP CMN TST TEQ

r1, r2 r1, r2 r1, r2 r1, r2

; set NZCV on (r1 - r2) ; set NZCV on (r1 + r2) ; set NZ on (r1 and r2) ; set NZ on (r1 xor r2)

Results of the subtract, add, and, xor are NOT stored in any registers, so destination register Rd is not used Status flags in the CPSR are set or cleared by these instructions (you dont need the S).

Take CMP r1,r2 instruction:


N=1 Z=1 C=1 V=1
2.43 tjwc - 2-Dec-10

if MSB of (r1 - r2) is '1' (BMI,BPL) if (r1 - r2) = 0 (BEQ,BNE) if carry-out of addition is 1 (BCS,BCC) if there is a twos complement overflow. (BVS,BVC)
ISE1/EE2 Introduction to Computer Architecture 2.44

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

The S-bit
Explicit comparisons are not needed after a SUBS or ADDS: MOV . . SUBS BNE r0, #10 ; intialize loop counted r0 ; start of body of loop ; decrement loop counter AND set flags ; branch if r0 0

ARM condition code field

loop

r0, r0, #1 loop

SUBS instruction is the same as SUB except that the former updates the NZCV flags in the CPSR. After SUBS instruction, Z-bit is set or cleared depending on the result of the subtraction, so CMP is not needed. All data processing instructions can have S: EORS R0,R1,R2 ANDS R0,R3,#0 ADCS R0, R1, R2

CMP is identical to SUBS but with no destination, TEQ to EORS, etc

BMI LABEL ; Branch to LABEL on MI condition


tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.45 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.46

Conditional Execution
Conditional execution applies not only to branches, but to all ARM instructions. CMP r0, #5 ; if (r0 >= 5) then For example:
BLO ADD SUB BYPASS .. BYPASS r1, r1, r0 r1, r1, r2 ; r1 := r1 + r0 - r2

Using Condition Codes


The two letter condition code is appended to the 3 letter instruction op-code to make instruction execution conditional: MOVEQ, ADDPL, BCC, LDRMI, etc.
Always AL may be omitted for (normal) unconditional execution
; if (r0 >= 5) then ; r1 := r1 + r0 - r2

Can be replaced by:


CMP r0, #5 ADDHS r1, r1, r0 SUBHS r1, r1, r2 ..

BYPASS

Op-code suffixes (S for data processing instructions, B for LDR/STR) go after the condition code:
ADDPLS, STRNEB SBCCSS
2.47 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.48

Here the ADDHS and SUBHS instructions are executed only if C=1, i.e. the CMP instruction gives R0 >= 5 (unsigned).
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture

Conditional Execution Replaces Branches


We have seen that IF-THEN-ELSE constructions in pseudocode turn into multiple branches in assembly. If the THEN and ELSE statements are short, branches can be avoided by using conditional execution. The same optimisation works for IF-THEN code if the THEN statement is short.
CMP r0, #1 BEQ THENPART ; else part SUB r3, r3, #1 B ENDCODE ; go to end THENPART ; then part ADD r2, r2, #1 ENDCODE ; finished
tjwc - 2-Dec-10

Conditional Execution - more


Here is another very clever use of this unique feature in ARM instruction set. ALL instructions can be qualified by the condition codes, including CMP!
; if ( (a=b) and (c=d)) then e := e + 1 CMP r0, r1 ; r0 has a, r1 has b CMPEQ r2, r3 ; r2 has c, r3 has d ADDEQ r4, r4, #1 ; e := e+1

CMP r0, #1 SUBNE r3, r3, #1 ADDEQ r2, r2, #1 ; finished

Note how if the first comparison finds unequal operands, the second and third instructions are both skipped. Also the logical 'and' in the if clause is implemented by making the second comparison conditional on the first. Conditional execution is normally only efficient if the conditional sequence is three instructions or fewer. If the conditional sequence is longer, use branches.
2.49 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.50

ISE1/EE2 Introduction to Computer Architecture

BCC-BLO, BCS-BHS equivalence


The names of all the conditional branches only really make sense if they follow a CMP instruction LO (lower), HS (higher or same) are used for unsigned numbers the equivalent for signed are LT (less than), GE (greater than or equal) Remember that CMP is a SUB instruction without destination CMP r0, r1 => invert all bits in r1, add 1, and add to r0
u(R0)+(2N-u(R1)) -- u(R0) unsigned value of R0 There will be a carry out if this is 2N, so:

The more complex cases GE is twos complement signed comparison Greater than or equal to (GE). r0 r1. Two cases:
r0 r1 is positive result, no overflow => V=0, N=0 r0 r1 is negative result, with overflow => V=1,N=1.
r0=127, r1= -128 EXACT: 127 (-128) = +255 8 bit signed interpretation: -1 (so V=1, N=1)

carry set u(R0) + (2N u(R1)) 2N R0 R1 (unsigned)


R0 R1 R0 2n-R1 00000020 - 00000002 00000020 +FFFFFFFE = 0000001E (& carry out) 00000020 - 00000020 00000020 +FFFFFFE0 = 00000000 (& carry out) 00000002 - 00000020 00000002 +FFFFFFE0 = FFFFFFE2 (no carry out)

V=0,N=0 or V=1,N=1 means r0 r1 so GE tests not(NV)

Other conditions:
LT < is NOT GE GT> is GE AND NOT EQ LE is LT OR EQ
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.52

Inequality Conditions Summarised


ARM has the full set of signed and unsigned inequality conditions. They can be confusing. After a CMP or SUBS, if x,y are the two operands, the 8 possible inequalities are shown in the table below
It is important to choose the correct condition if the test is to work for all inputs, even though for positive numbers signed and unsigned comparisons are identical. Test
x>y xy xy x<y

Lecture 8: Bit manipulation shifts etc


The best teachers have shown me that things have to be done bit by bit. Nothing that means anything happens quickly we only think it does, Joseph Bruchac

Individual bits can have separate meanings in assembly programs


Hardware registers where every bit is a separate flag Hardware registers where bit fields have specific meaning

Signed
GreaTer Greater or Equal Less or Equal Less Than GT GE LE LT

Unsigned
HIgher Higher or Same Lower or Same LOwer HI HS (= CS) LS LO (= CC)

Two types of operation help manipulating bits


Shifts & rotates 32 bit bitwise logical data processing instructions
2.53 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.54

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

Register Shifts
ADD r0, r1, r2, lsl #3 MOV r0, r1, lsr #11

ARM shift operations - LSL and LSR


Here are all the six possible ARM shift operations you can use:

op2 shifted

The key to manipulating bit fields contiguous groups of bits is the use of data shifts. ARM has a large collection of shifts available for the 2nd register operand of a data processing instruction.
shifts can be combined with arithmetic or bitwise logical operations in one instruction.

Rd := Rn op (Rm shift by n) ; shift = lsl, asr, asl, ror, rrx


0 n 31 RRX is special case only possible by 1 bit (n=1).

LSL: logical shift left by 0 to 31 places; fill the vacated bits at the least significant end of the word with zeros.
x LSL n = x*2n if no overflow

NOTE Rm is not changed by shift shifted value is used as operand


tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.55

LSR: logical shift right by 0 to 31 places; fill the vacated bits at the most significant end of the word with zeros.
x LSR n = x/2n if x is positive (integer division)
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.56

ARM shift ops - ASL and ASR


ASL: arithmetic shift left; this is the same as LSL ASR: arithmetic shift right by 0 to 31 places; fill the vacated bits at the most significant end of the word with zeros if the source operand was positive, and with ones it is negative. That is, sign extend while shifting right. x ASR n = x / 2n (x>0) -x ASR n = -(x+1) / 2 = -x/2n (rounding negatively)

x 3 2 1 0 -1 -2 -3 -4

x asr 1 1 1 0 0 -1 -1 -2 -2

ARM rotate operations - ROR and RRX


ROR: rotate right by 0 to 31 places; the bits which fall off the least significant end are used to fill the vacated bits at the most significant end of the word. (ROL n = ROR 32-n) RRX: rotate right extended by 1 place; the vacated bit (bit 31) is filled with the old value of the C flag and the operand is shifted one place to the right. This is effectively a 33 bit rotate using the register and the C flag.

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.57

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.58

Register-valued Shifts
ADD r0, r1, r2, lsl r3 ; shift r2 by value of register r3. 4 regs! MOV r0, r1, asr r10 ; shift r1 by value of register r10

Rotation in immediate op2


op1 dest
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot

op2

immediate value

The number (n previously) of bits to shift can be variable and come from the value in a register, as above. "register-valued" shifts take two cycles to execute MOV r0, r1, lsl r3
If r3 = 4 & r1 = 11 this will set r0 := 11*24

8 C=Const Rd := Rn Op C'

; r4 contains n ; result (r0) has bit ; n from r2 aligned ; with bit 0 MOVS r0, r2, lsr r4

All data transfer instructions can have rotated immediate operand C' = C rotated right (ROR) by 2r, where r is unsigned value of Rot field

This allows variable shifts, for example, to select bit n from a 32 bit register
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.59 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.60

Rotation in immediate op2 (2)


The 12 bit immediate field is split into two parts, a 4 bit unsigned rotate number, r, (0 r 15) and an 8 bit unsigned constant, C, (0 c 255). C' = C ROR 2r Note that 22xC = C ROL 2x = C ROR (32-2x) - easier to work out x first! (NB special case, x=0 => r=0, C=C'). Example: c = &51, x = 5 => r = 11, S = 22*5 * &51 = &14400
010100010000000000 zero
cond 0 0 1
tjwc - 2-Dec-10

Common rotated immediate values


C' = 22xC (r = (16-x) mod 16) x = 0, any value in range 0 - 255 x = 1 => X4 any word address offset in range 0 - 1020 (e.g. in ADR pseudo-instruction) Any single bit set (2n)

How do you get constant 2n for odd n?

non-zero 2*x zeros field C S Rn Rd 1011 01010001


2.61

In general any 8 bit binary field aligned on any even bit position is possible NB negative numbers use alternate instruction e.g. SUB not ADD
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.62

Op

ISE1/EE2 Introduction to Computer Architecture

Data Processing Instructions Bitwise Logical operations


Here are ARM's bit-wise logical operations:
AND ORR EOR BIC r0, r1, r2 r0, r1, r2 r0, r1, r2 r0, r1, r2 ; r0 := r1 and r2 (bit-by-bit for 32 bits) ; r0 := r1 or r2 ; r0 := r1 xor r2 ; r0 := r1 and not r2

Example typical memory-mapped I/O


AD0DR:
31 30 D OV 26 24 CHN 15 DATA 6

A/D convertor converts input voltage from up to 8 inputs into digital (unsigned) value. LPC2138 A/D convertor data register AD0CR
Memory mapped as 32 bit word, read/write Read provides the 10 bit conversion output, 3 bit channel output, and other status info D Done 1 when conversion has finished OV Overrun 1 if data from a conversion is not read before another conversion starts CHN channel which of the 8 possible inputs was converted DATA 10 bit binary data output (bit 15 is MSB, bit 6 is LSB).

BIC stands for 'bit clear', where every '1' in the second operand clears the corresponding bit in the first:
r1: r2: r0: 0101 0011 1010 1111 1101 1010 0110 1011 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000 0000 1101 1010 0110 1011

BIC allows immediate operands to be used to clear individual bits


tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.63

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.64

Extracting a bit-field 8 bits using AND


AD0DR:
31 30 D OV 26 24 CHN 15 DATA 6

Extracting bit fields using LSL & LSR


Left shift by N of a number is the same as multiplying by 2N Arithmetic right shift by N of a number is the same as dividing by 2N and rounding negatively.
Logical right shift the same for unsigned numbers.

R0:

CHN

To extract only CHN bit field of AD0DR to R0: LDRL R0, AD0DR ; get data into register AND R0, R0, #&07000000 ; set all unwanted bits to 0 LDRL LABEL (like LDR LABEL but LABEL can be anywhere in
memory)
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.65

Shifts can be used to extract bit fields. In a 32 bit word, bits n:m can be extracted and aligned with bit 0 by:
left shift 31-n right shift (31-n)+m
31 11 : 7 0 11101010100001110001001111110011 00111111001100000000000000000000 (LSL 31-11 = 20) 00000000000000000000000000000111 (LSR 31-11+7 = 27)
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.66

31 30

AD0DR: D OV AD0CR:
26 24 START

26 24 CHN

15 DATA 8 7 CLKDIV

6 0 SEL

Multiplying by a (small) constant


Multiplying by 2N is easy using a left shift. Other constants can be derived from this by using ADD or RSB as in the table below. 2,3,4,5,7,8,9, etc are all possible in this way Where possible this is preferable to using a MUL instruction because it is faster, does not require the immediate operand to be set up in a register, and is available on all architectures
r0 := 2Nr1 r0 := r0 := (2N+1)r1 (2N-1)r1 ADD RSB
tjwc - 2-Dec-10

19 17 16 15 CLKS B

Extract the 10 bit DATA field: AD0DR(15:6)

ADRL r1, AD0DR ; load address LDR r0, [r1] NB ADRL used when address is >4096 bytes MOV r0, r0, lsl #16 from PC MOV r0, r0, lsr #22 ; R0 contains extracted DATA field ; r3 contains 8 bit value to be written ADRL r1, AD0CR LDR r0, [r1] ; load whole of AD0CR BIC r0, r0, &ff00 ; clear bits 15:8 (CLKDIV) ORR r0, r0, r3, lsl #8; set 15:8 from r3(7:0) STR r0, [r1] ; store back to AD0CR .

Write CLKDIV: AD0CR(15:8), from r3(7:0)

MOV r0, r1 lsl #n ADD r0, r1, r1 lsl #n RSB r0, r1, r1 lsl #n r0, r0, r0, LSL #2 r0, r0, r0, LSL #3 Note RSB not SUB ; r0' := 5 x r0 ; r0" := 7 x r0'

What does this multiply by?


2.68

ISE1/EE2 Introduction to Computer Architecture

Lecture 9 Subroutines & Stacks


Television is like the American toaster, you push the button and the same thing pops up everytime Alfred Hitchcock

Subroutines
Subroutines allow you to modularize your code so that they are more reusable. The general structure of a subroutine in a program is: MAIN main program ...... BL SUB1 ...... ;subroutine call

The subroutine is a key element in assembly language programs, allowing code reuse
It is also the way that High Level Language procedures and functions are implemented

Storage of data on a stack is an essential element of all modern computer programs and typically is done on subroutine entry & exit ARM has instructions to support subroutines and stacks This lecture will consider
Use of return addresses by subroutines
Branch & link instruction

SUB1 subroutine
2.69 tjwc - 2-Dec-10

Storing data on stacks in the ARM ISA


Load & Store Multiple Registers instructions
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture

..... MOV pc,R14


ISE1/EE2 Introduction to Computer Architecture 2.70

Branch & Link instruction


BL subroutine_name (Branch-and-Link) is the instruction to jump to subroutine. It performs the following operations: 1) It saves the PC value (which points to the next instruction) in r14. This is the return address. 2) It loads PC with the address of the subroutine. This performs a branch. BL always uses r14 to store the return address. r14 is called the link register (can be referred to as lr or r14). Return from subroutine is simple: - just put r14 back into PC (r15).
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.71

Example
Essential documentation for subroutines must describe
Inputs Outputs (if any) What subroutine does (other than compute outputs) Which registers it changes

EXAMPLE: Subroutine to move n bytes (spaced one per word) into n contiguous bytes at a different position in memory &1000 &1004 &1008 &2000
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.72

PACK_BYTES ; Input: src=r0, dest=r1, n=r2 ; loads LS bytes in words [R0],[R0+4], ..., [R0+4(n-1)] ; into contiguous bytes [R1],[R1+1],.....[R1+n-1] ; Changes r2,r3 SUBS R2, R2,#1 ; n := n-1 LDRB R3, [R0,R2, lsl #2] ; load first byte [R0+4(n-1)] STRB R3, [R1,R2] ; store it [R1+n-1] BNE PACK_BYTES MOV pc, r14 ; return to caller

Nested Subroutines

SUB1 BL SUB2

SUB2 BL SUB3

SUB3 X

MAIN ADR R0, TAB1 ; set up subroutine inputs ADR R1, TAB2 MOV R2, #100 BL PACK_BYTES ; call the subroutine When executing at "X" the nested subroutines SUB1, SUB2, SUB3 are all active
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.74

Nested Subroutines
Since the return address is held in register r14, you should not call a further subroutine without first saving r14. How do you achieve this goal?
Could use separate storage for each subroutine Problem: storage needed scales with number of subroutines. Typically may have 1000s of subroutines, means 1000s of separate storage locations

The idea of a STACK


A stack is a portion of main memory used to store data temporarily, so that the memory can be shared between different items at different times. A PUSH operation stores a number of registers onto the stack memory. r13 is called the PUSH {r1, r3-r5, r14} stack pointer SP
memory BEFORE PUSH memory AFTER PUSH r14 r5 r4 r3 r1 low
2.75 tjwc - 2-Dec-10

SUB1 store 1 SUB2 store 2 SUB3 store 3

The number of subroutines active at any time (nested) is much smaller than the total number, typically less than 10. This motivates use of a stack an area of memory which is shared for storage by subroutines. Can store all registers changed by subroutine on stack, not just R14
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture

high r13

r13

low
2.76

ISE1/EE2 Introduction to Computer Architecture

Nested Subroutines using stack


SUB1 BL SUB2 Stack Memory SUB1 data SUB3 X
downwards growing

PUSH R14 onto stack: method 1


mem32[R13] := R14 R13 := R13-4 STR R14, [R13], #-4

SUB2 BL SUB3

SUB2 data SUB3 data R13 &1344

Would need one LDR instruction for each item...


R13 stored item &134C stored item &1348 &1344 &1340 &1340 PUSH stored item &134C stored item &1348 stored R14 &1344 &1340
2.78

Stack pointer at X empty When executing "X" the nested subroutines SUB1, SUB2, SUB3 are all active

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.77

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

PUSHing onto a Stack: multiple registers


Note the following properties of this ARM PUSH operation:
r13 is used as the address pointer. We call this STACK POINTER (SP). We could have used any other registers (except r15) as SP, but it is good practice to use r13 unless there is a good reason not to do so. This stack grows down through decreasing memory address, and The base registers points to the first empty location of the stack. To store values in memory, the SP is decremented after it is used.

STMED vs STR
These two instructions look different but do same thing with one register STMED can be used with any number of registers STMED is conventionally used for stacks even when only a single transfer is needed. STMED R13!, {R14} stack pointer first, then list of one or more data registers, offset is calculate and added after operation data register first, then stack pointer, offset is explicitly written and added to SP after operation
ISE1/EE2 Introduction to Computer Architecture 2.80

ARM has a single instruction which transfers multiple registers to a stack and implements PUSH this way:
STMED r13!, {r1, r3-r5, r14} ; Push r1, r3-r5, r14 onto stack ; Stack grows down in mem ; r13 points to next empty loc.

STR R14, [R13], #-4

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.79

tjwc - 2-Dec-10

POP operation
The complementary operation of PUSH is the POP operation. POP {r1, r3-r5, r14}
memory BEFORE POP r14 r5 r4 r3 r1 r13 low r13 memory AFTER POP high (r14) (r5) (r4) (r3) (r1) low

Multiple Stack Operations


A stack operates as a Last In First Out memory:
PUSH A PUSH B PUSH C A stored B, A stored C,B,A stored

POP (returns C) B,A stored POP (returns B) A stored POP (returns A) empty

Stack implements a Last-In-FirstLast-In-First-Out (LIFO) memory

This is equivalent to the ARM instruction:


LDMED r13!, {r1, r3-r5, r14}
tjwc - 2-Dec-10

Nested subroutines will each PUSH and then POP their registers at the same level (all PUSHes & POPs from subroutine calls will balance) so this will work.

; Pop r1, r3-r5, r14 from stack


2.81 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.82

ISE1/EE2 Introduction to Computer Architecture

Preserve things inside subroutine with STACK


SUB1 BL .. STMED . BL LDMED MOV SUB1 r13!, {r0-r2, r14} SUB2 r13!, {r0-r2, r14} pc, r14
on entry to SUB1 r13 SP moves down r13' STMED
tjwc - 2-Dec-10

; push work & link registers ; jump to a nested subroutine ; pop work & link registers ; return to calling program
when return from SUB1 high (r14) (r2) (r1) (r0) low r13 LDMED low r13!, {r0-r2, r14}
2.83

r14 r2 r1 r0

high

r13'

; Input: r0 ; Output: r1=1 if odd parity (xor of all 32 bits), otherwise 0 ; preserves value of r2 on stack STMED r13!, {r2} ; save registers, why not r1? MOV r2, #31 MOV r1, #0 LOOP EOR r1, r0, r1, ror #1 SUBS r2, r2, #1 BPL LOOP ; loop 32 times AND r1, r1, #1 LDMED r13!,{r2} ;restore registers MOV pc, r14 ; return to caller

r13!, {r0-r2, r14}

ISE1/EE2 Introduction to Computer Architecture

Optimising subroutine entry/exit


The usual case is for a subroutine which calls other subroutines, and so which saves and restores registers including R14, the return address. In this case the subroutine exit can be optimised by restoring r14 directly to the PC, r15.
Note that it is important NOT to include both r14 & r15 in the LDMED register list - which would be one too many POPs!
STMED r13!, {r0,r1,r2,r14} . LDMED r13!,{r0,r1,r2, r14} MOV pc, r14 ; return to caller
tjwc - 2-Dec-10

Effect on stack of subroutine nesting


SUBX (1) calls SUBY(2) The arrangement of storage on the stack when inside SUBY is as follows

SUBX STMED r13!, {R14} BL SUBY ....... LDMED r13!, {pc} SUBY STMED r13!, {r0,r1,r2} ..... LDMED r13!,{r0,r1,r2} MOV pc, r14

SUBX caller return adddress

Stack (downwards growing)


Stack pointer before SUBX Stack pointer inside SUBX Stack pointer inside SUBY

Base of stack is highest location Rest of stack

STMED r13!, {r0,r1,r2,r14} . LDMED r13!,{r0,r1,r2,pc} ; return to caller


2.85

r14 r2 r1 r0

Stack frame (1) SUBX Stack frame (2) SUBY

ISE1/EE2 Introduction to Computer Architecture

Top of stack is SP+4 (lowest location)

ARM PUSH instructions


STMED implements Descending stack, with SP pointing to Empty location Stacks can by Ascending or Descending SP can point to Full location (last item PUSHED) or Empty location (first space available to PUSH next item) STMED - Empty location, Descending stack STMEA - Empty location, Ascending stack on entry to SUB1 STMFD - Full location, Descending stack r14 r13 r2 STMFA - Full location, Ascending stack SP moves r1 down LDMED (pop) matches STMED (push) etc. r0
r13' STMED

Other uses of LDM/STM


LDM,STM can work with any register being SP, not just R13 Can move block of memory by setting up SP1, SP2, POP from SP1, PUSH to SP2 Faster than loop with LDR/STR
high

The 4 types of stack POP & PUSH have different mnemonics (for convenience) when used for general data movement like this. It does not matter which mnemonic you use: LDMED & LDMIB are the same instruction
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.88

low r13!, {r0-r2, r14}

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

2.87

Alternative names for LDM instructions!

Example of using Load/Store Multiple


Here is an example to move 8 words from a source memory location to a destination memory location:ADR r0, src_addr ADR r1, dest_addr LDMIA r0!, {r2-r9} STMIA r1!, {r2-r9} ; initialize src addr ; initialize dest addr ; fetch 8 words from mem ; r0 := r0+32 ; copy 8 words to mem, r1 := r1+32

When using LDMIA and STMIA instructions, you:INCREMENT the address in memory to load/store your data the increment of the address occurs AFTER the address is used.

In fact, one could use 4 different form of load/store:


Increment - After Increment - Before Decrement - After Decrement - Before
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.89 tjwc - 2-Dec-10

LDMIA LDMIB LDMDA LDMDB

and and and and

STMIA STMIB STMDA STMDB

(see next slide)

ISE1/EE2 Introduction to Computer Architecture

2.90

The four variations of the STM instruction

Optional update of base address register with Load/Store Multiple Instructions


So far the base address register, r1 below, has always been updated. You can choose NOT update this pointer register removing the "!". All variants of LDM/STM have optional base register update.

LDMIA
Higher register numbers stored or loaded to/from higher addresses, always

r1, {r2-r9}

; r2 := mem32[r1] ; . ; r9 := mem32[r1+28]

LDMIA

r1!, {r2-r9} ; r2 := mem32[r1] ; . "!" indicates r1 ; r9 := mem32[r1+28] is changed ;r1 := r1 + 32 (8 registers)


ISE1/EE2 Introduction to Computer Architecture 2.92

tjwc - 2-Dec-10

Multiple register transfer instructions

Lecture 10: Miscellaneous Multiplication Overview of machine instructions Machine instruction timing

Register list has one bit per register bit 0 = 1 => load/store r0; bit 1 = 1 => load/store r1; etc STMIA r13!, {r0-r2, r14}

E8AD 4007
2.93 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.94

tjwc - 2-Dec-10

ISE1/EE2 Introduction to Computer Architecture

ARM Multiply instructions The original ARM 1 architecure did not have multiply instructions
32X32->32 bit (least significant 32 bits of result kept) was added for ARM 3 and above 32X32->64 multiplication was added for ARM7DM and above.

Multiply in detail
MUL,MLA were the original (32 bit LSW result) instructions
Why does it not matter whether they are signed or unsigned? Register operands only No constants, no shifts

Later architectures added 64 bit results


NB d & m must be different for MUL, MLA
ARM3 and above

The multiplications were shoe-horned into the data processing instructions, using bit combinations specifying shifts that were previously unused and illegal.
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.95

MUL rd, rm, rs MLA rd,rm,rs,rn UMULL rh, rl, rm, rs UMLAL rh, rl, rm, rs SMULL rh,rl,rm,rs SMLAL rh,rl,rm,rs
tjwc - 2-Dec-10

multiply (32 bit) multiply-acc (32 bit) unsigned multiply unsigned multiply-acc signed multiply signed multiply-acc

Rd := (Rm*Rs)[31:0] Rd:= (Rm*Rs)[31:0] + Rn (Rh:Rl) := Rm*Rs (Rh:Rl) := (Rh:Rl)+Rm*Rs (Rh:Rl) := Rm*Rs (Rh:Rl) :=(Rh:Rl)+Rm*Rs
2.96

ARM7DM core and above (64 bit multiply result)


ISE1/EE2 Introduction to Computer Architecture

Example of using ARM Multiplier


This calculates a 64 bit scalar product of two signed vectors, each 20 words long: r8 and r9 point to the two vectors xj and yj r11 is the loop counter r7:r6 stores the result

ARM Machine Instruction Overview (1)


Data processing (ADD,SUB,CMP,MOV)
cond 0 0 0 1
ALU operation

Op

Rn

Rd

Shift S

Rm

Rd := Rn Op Rm* Rd := Rn Op S Rm* = Rm with optional shift

z = xj * yj
j =0

19

multiply instructions are special case

Data transfer (to or from memory LDR,STR)


cond 0 1 0 Trans Rn Rd Shift
Byte/word, load/store, etc

S Rm

Rd mem[Rn+S] Rd mem[Rn+Rm*]

MOV MOV MOV LOOP LDR LDR SMLAL SUBS tjwc - 2-Dec-10 BNE

r11, #20 ; initialize loop counter r7, #0 ; initialize 32 bit total r6, #0 r0, [r8], #4 ; get x component r1, [r9], #4 ; . and y component r6, r7, r0, r1 ; accumulate product r11, r11, #1 ; decrement loop counter LOOP ISE1/EE2 Introduction to Computer Architecture ; loop 20 times

Multiple register transfer


cond 1 0 0
2.97 tjwc - 2-Dec-10

Type

Rn

Register list

Transfer registers to/from stack


2.98

ISE1/EE2 Introduction to Computer Architecture

Overview (2)
Branch B, BL, BNE, BMI
cond 1 0 1 L 0 1 cond 1 1 0 0 cond 1 1 0 1 cond 1 1 1 0 S L = 0 => Branch, B ... L = 1 => Branch and link (R14 := PC), BL ...
PC := PC+S

ARM Instruction Timing


Exact instruction timing is very complex and depends in general on memory cycle times which are system dependent. The table below gives an approximate guide.
Instruction Any instruction, with condition false data processing (all except register-valued shifts)
Typical execution time (cycles) 1

1 (+3 if PC is dest) 2 (+3 if PC is dest) 4 n+3 (+3 more if PC is loaded) n+3 4 7-14

coprocessor interface

data processing (register-valued shifts): MOV R1, R2, lsl R3 LDR,LDRB, STR, STRB LDM (n registers)

Software Interrupt (SWI)


cond 1 1 1 1
tjwc - 2-Dec-10

S
ISE1/EE2 Introduction to Computer Architecture

Simulate hardware interrupt: S is passed to handler


2.99

STM (n registers) B, BL Multiply

Instruction Timing Notes


Most instructions take 1 cycle - RISC Memory reference takes longer (4 cycles typically) Branch takes longer (4 cycles)
Writing to PC => branch

ALL instructions take 1 cycle if not executed (condition false) "register-valued shift" is special case 2 cycles
Make sure you know what a register-valued shift is!

Multiply takes a lot longer though exact timing depnds on data and also on ARM core - later cores have more efficient hardware multiply Instruction timing is hardware-dependent. Not part of Instruction Set Architecture
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.101

You might also like