Arm Instruction 2 - 001
Arm Instruction 2 - 001
Arm Instruction 2 - 001
Part 2 - Contents
Lecture 5 - ARM data processing
ARM data processing instructions in detail ARM status flags and tests
EDSAC simulator (Written by Martin Campbell-Kelly, Univ Warwick) reproduces original EDSAC control panel. EDSAC was the first computer to be programmed in an assembly language. The assembler was 41 instructions long!
2.1
tjwc - 2-Dec-10
2.2
op2
8 C Rm 4
immediate value
Rd := Rn Op C Rd := Rn Op Rm
Arithmetic is the most complex data processing operation at an assembly language level. ARM implements 32 bit addition and subtraction. Longer calculations must make appropriate use of carries. We will look at:
ARM data processing arithmetic & logical instructions Use of immediate operands in data processing instructions Simple examples
0
ALU operation
S bit = 1 => status bits are written S bit = 0 => status bits unchanged
The first operand, op1, is always register Rn The second operand, op2, is either a constant C or register Rm This lecture: assume Shift=0, Rot=0, for unshifted Rm or immediate constant C
tjwc - 2-Dec-10
2.3
tjwc - 2-Dec-10
2.4
Example
cond 0 0 0 1110 0 0 0 Op S Rn Rd Shift Rm
Rd := Rn Op Rm R0 := R1 + R2
Here are the move and arithmetic data processing instructions. The operations with Carry allow multi-word addition and subtraction MOV, MVN do not use Rn, Rn should be set 0 in instruction word
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.5
Op = 0100 (ADD) Cond = 1110 (always) Rd = 0 R0 Rn = 1 R1 Rm = Op2 = 2 R2 S=0 (don't write status bits) Use assembler don't need to worry about precise bit format as above ADD R0, R1, R2
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.6
Example: SUB r0, r1, r2 ; r0 := r1 - r2 Works for both unsigned and 2's complement signed
Note that source registers are unchanged (unless dest = source)
RSB stands for reverse subtraction Operands & result may be interpreted as unsigned or 2's complement signed integers. 'C' is the carry (C) status bit in the CPSR
Subtraction - carry is "borrow" - 0 or 1 - hence C-1
ADC, SBC, and RSC are used to operate on data more than 32 bits long in 32-bit chunks: see next slide RSB,RSC are useful, instead of SUB,SBC with r1, r2 reversed, because r2 can be any of the Op2 variants, see later.
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.8
tjwc - 2-Dec-10
2.7
SBC, RSC
Some of you are probably thinking why (C 1) in subtraction? The negation needed by subtraction is implemented in hardware by bitwise not function and addition with C=1. Thus C=0 has the effect of -1, and C=1 is a normal subtract. The first (LSW) subtract of a multi-word subtraction must have carry set this is the default carry used in the SUB, RSB instructions. Normally the LS word of a multi-word add or subtract will use ADDS or SUBS, all others will use ADCS or SBCS Note the S suffix means S=1, write Status bits (condition codes). Can be added to any DP assembler mnemonic except comparisons.
ADDS ADCS
ADCS 95 95
tjwc - 2-Dec-10
ADCS 64 63 64 63 32 31 32 31
ADDS 0 0 +
2.9
S at the end of an instruction means you want to write the C, V, N, and Z status bits. In this case the C flag is needed. Similarly, if we wanted to subtract the two numbers:
SUBS SBCS
tjwc - 2-Dec-10
Operand 2
Data processing instructions have 3 operand format: Rd := Rn op op2 First operand (Rn) - always a register Second operand (Op2) can be
An immediate (literal) value in range 0-255 A register Rm
Special case of data processing where one register is not used, but other options (shifted r2 etc see later) still apply. MVN stands for 'move negated' bitwise NOT
This is not two's complement negate - no addition of 1! r2: r0:
tjwc - 2-Dec-10
0101 0011 1010 1111 1101 1010 0110 1011 1010 1100 0101 0000 0010 0101 1001 0100
ISE1/EE2 Introduction to Computer Architecture 2.11
Examples
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot 8 C
Rd := Rn Op C
1110 001 0010 0 1111 0011 0000 01100100 always SUB R15 R3 #100 do not write status bits R3 := R15 - 100 ADD r3, r15, #-100 The "ADD with negative SUB r3, r15, #100 constant" is turned into equivalent SUB automatically by assembler
tjwc - 2-Dec-10
2.13
tjwc - 2-Dec-10
2.14
Examples
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot 8 C
Rd := Rn Op C
Examples
4 1 1 1 cond 0 0 1 4 Op 1 S 4 Rn 4 Rd 4 Rot 8 C
Rd := Rn Op C
1110 001 0110 1 0100 0001 0000 00000011` always SBC R4 R1 #3 write status bits N,Z,C,V R1 := R4 -4+C ADCS r1, r4, #-4 SBCS r1, r4, #3 The ADC is turned into equivalent SBC automatically
1110 001
1111 1 0000 0001 0000 00000000 MVN not used R1 C=#0 write status why not C,V? bits N,Z
R1 := -1 MOVS r1, #-1 Note that MVN negates bits, not two's MVNS r1, #0 complement negation S = 1 => N,Z status bits are written C,V status bits are only written on arithemetic operation
2.15 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.16
tjwc - 2-Dec-10
64-bit checksum
A checksum is often calculated to check that data has not been corrupted.
C = di
i
In this example 8K bytes of data is stored in memory in a buffer pointed to by r2. Each 8 contiguous bytes (2 words) are interpreted as a 64 bit number di. [R2+20], [R2+16] [R2+12], [R2+8] [R2+4], [R2]
32 bits 32 bits 32 bits 32 bits 32 bits 32 bits
CHECKSUM64 MOV r3, #0 MOV r4, #0 MOV r6, #1024 LOOP LDR r0, [r2] ADD r2, r2, #4 LDR r1, [r2] ADD r2, r2, #4 ADDS r3, r3, r0 ADC r4, r4, r1 SUBS r6, r6, #1 BNE LOOP
; bits 31:0 of sum ; bits 63:32 of sum ; set up loop counter ; load 31:0 of next 64 bit word ; move r2 to MSW word ; load 63:32 of it ; move r2 to next 64 bit word ; 31:0 of 64 bit addition, set C ; add bits 63:32, with C ; decr counter, set status bits on result ; if counter is not 0 add next 64 bits
Add 64 bit numbers (assume words are ordered so that LSW has lowest address)
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.17
r2 -> current word r3,r4 -> 64 bit sum r6 -> count no of 64 bit words down to 0 Auto-increment memory load (discussed later) would make the code much more efficient. Note that 64 bit result will overflow because MSW C is discarded
number = s x m x 2e sign
For example:
In general, the binary point can be in the middle of the word (or off the end!). This is FIXED POINT representation of fractional numbers
N-1 0
mantissa
exponent
S
binary point
10.5 in binary: 1010.1(2) Move binary point 3 places to left: 1.0101(2) x 23 10.5 = 1.3125 x 8
Thus by choosing the correct exponent any number can be represented as a fixed point binary number multiplied by an exponent Equivalently, the binary point is "floating"
2.19 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.20
Fixed point arithmetic requires no extra hardware the binary point is in the mind of the programmer, like signed/unsigned.
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture
IEEE-754 example
single precision
8-bit exp
23-bit frac
exp 128
The number above, C0600000(16) , must have negative sign, Exponent = exp -127 = 1, mantissa = 1+ 0.11(2) = 1.11(2) - 21 X 1.11(2) = -11.1(2) = -3.5 Note leading 1.0 is always added to frac
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.22
Floating point is typically handled by Floating Point coprocessor (FPU) separate from the CPU. ARM architecture has FPUs, see latest ARM datasheets for more details. We will not consider FPU instructions in this course.
tjwc - 2-Dec-10
2.23
tjwc - 2-Dec-10
2.24
TABLE1+192 TABLE1+196
TABLE2+192 TABLE2+196
Next we look at how the reads and writes can be made to variable locations (like an access to an array with a variable as index a[i]), so that a loop can be used with a single read and write to copy all 50 words.
2.25 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.26
tjwc - 2-Dec-10
LDR STR
; r0 := mem32[r1] ; mem32[r1] := r0
This is called register-indirect addressing (AKA indexed) Here r1 is a memory pointer (AKA index register) LDR r0, [r1] ; this is a word transfer, r1 must be a word address (divisible by 4)
r1: r0:
TABLE1 TABLE2
tjwc - 2-Dec-10
table1 table2
How does the ADR directive work? Address is 32-bit, difficult to put a 32-bit address value in a register in the first place (constants are 8 bit) Solution: Program Counter PC (r15) is often close to required value ADR r1, TABLE1 is translated into a data processing instruction that adds or subtracts a constant to PC (r15), and puts the result in r1 This constant is known as a PC-relative offset, and it is calculated as: addr_of_TABLE1 - (PC_value + 8)
(+8 is because of hardware pipelining, see Part 3)
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.29
; r0 := mem32 [r1 + 4]
offset effective address
2.30
Base+offset addressing does not change the base register (r1 & r2 here). Sometimes, it is useful to modify the base register to point to the new address. This is achieve by adding a '!', and is base + offset addressing with auto-indexing:
LDR
; r0 : = mem32 [r1 + 4] ; r1 := r1 + 4
The '!' indicates that the instruction should update the base register after the data transfer One instruction changes two registers
Useful in loops
The '!' indicates that the instruction should update the base register after the data transfer
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.31 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.32
This is called post-indexed addressing - the base address is used without an offset as the transfer address, after which it is always modified. Using this, we can write the copy program as a loop:
copy ADR ADR MOV LDR STR SUBS BNE
tjwc - 2-Dec-10
loop
r1, TABLE1 r2, TABLE2 r3, #50 r0, [r1], #4 r0, [r2], #4 r3, r3, #1 loop
; r1 points to TABLE1 ; r2 points to TABLE2 ; r3 counts no words copied ; get TABLE1 1st word . ; copy it to TABLE2 ; . r1, r2 are updated afterwards ; decrement & set flags ; loop if not finished
2.33
loop
TABLE1
tjwc - 2-Dec-10
The second (index) register can have an optional shift useful in this case so that it can count words (bytes*4) directly In principle any of the shift modes: lsl, asl, asr, rrx described in the next lecture can be used lsl #n used here multiplies by a scale factor of 2N
copy ADR r1, TABLE1 ; r1 points to TABLE1 ADR r2, TABLE2 ; r2 points to TABLE2 MOV r3,#0 LDR r0, [r1, r3, lsl #2] ; get TABLE1 1st word . STR r0, [r2, r3, lsl #2] ; copy it to TABLE2 ADD r3,r3,#1 ; move to next word CMP r3, #50 ; if more, go back to loop BNE loop ; if r3 50 ISE1/EE2 Introduction to Computer Architecture ; < source of data >
Because value of R15 is known this is effectively direct addressing, in limited range close to PC
It does not use a normal base register so can't be used for auto-increment modes etc which would change PC
loop
8090
2.35 tjwc - 2-Dec-10
tjwc - 2-Dec-10
TABLE1
Rm Rd mem[Rn+Rm*]
1
use indexed or offset address [Rn+Rm], [Rn+S] add offset [Rn+S] Byte write indexed or offset address back into Rn if P=1 Load
U B W L
NB - if P=0, W=0
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.37
In ARM ISA "jumps" which change value of PC, are called "branches" The ARM ISA has a unique and clever way of dealing with conditional branches.
Instead of having special instructions, ALL instructions are given an execution condition which determines whether they are executed, or ignored. Condition is top 4 bits of instruction word The always true condition is used with most instructions to make their execution unconditional
r0, [r1] r0, [r1, # offset] r0, [r1, # offset]! r0, [r1], # offset r0, [r1, r2] r0, [r1, r2, lsl #shift] r0, address_label r0, address_label
; register-indirect addressing ; base+offset addressing ; base+offset, auto-indexing ; post-indexed, auto-indexing ; register-indexed addressing ; scaled register-indexed addressing ; PC relative addressing pseudo ; load PC relative address instructions
2.39
tjwc - 2-Dec-10
Branches
The basic branch instruction is:
B . label ; unconditionally branch to label
An example
Consider the pseudo-code:
If (a = 1) then c := c+1 else d := d-1
label
Needs to be implemented using conditional branches, or, as we will see, conditional execution. First step is to assign registers to variables. We assume: a=r0, c=r2, d=r3, and then the problem becomes:
if (r0 = 1) then r2 := r2+1 else r3 := r3-1
Here the CMP instruction is a SUBTRACTION, which gives no results EXCEPT possibly changing status flags in CPSR. Here we need to know that If r0 = 0, then Z bit is set (='1'), else Z bit is reset (='0')
Z controls the following BNE conditional branch instruction
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.41
Comparison Operations
Here are ARM's register test operations:
; set NZCV on (r1 - r2) ; set NZCV on (r1 + r2) ; set NZ on (r1 and r2) ; set NZ on (r1 xor r2)
Results of the subtract, add, and, xor are NOT stored in any registers, so destination register Rd is not used Status flags in the CPSR are set or cleared by these instructions (you dont need the S).
if MSB of (r1 - r2) is '1' (BMI,BPL) if (r1 - r2) = 0 (BEQ,BNE) if carry-out of addition is 1 (BCS,BCC) if there is a twos complement overflow. (BVS,BVC)
ISE1/EE2 Introduction to Computer Architecture 2.44
tjwc - 2-Dec-10
The S-bit
Explicit comparisons are not needed after a SUBS or ADDS: MOV . . SUBS BNE r0, #10 ; intialize loop counted r0 ; start of body of loop ; decrement loop counter AND set flags ; branch if r0 0
loop
SUBS instruction is the same as SUB except that the former updates the NZCV flags in the CPSR. After SUBS instruction, Z-bit is set or cleared depending on the result of the subtraction, so CMP is not needed. All data processing instructions can have S: EORS R0,R1,R2 ANDS R0,R3,#0 ADCS R0, R1, R2
Conditional Execution
Conditional execution applies not only to branches, but to all ARM instructions. CMP r0, #5 ; if (r0 >= 5) then For example:
BLO ADD SUB BYPASS .. BYPASS r1, r1, r0 r1, r1, r2 ; r1 := r1 + r0 - r2
BYPASS
Op-code suffixes (S for data processing instructions, B for LDR/STR) go after the condition code:
ADDPLS, STRNEB SBCCSS
2.47 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.48
Here the ADDHS and SUBHS instructions are executed only if C=1, i.e. the CMP instruction gives R0 >= 5 (unsigned).
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture
Note how if the first comparison finds unequal operands, the second and third instructions are both skipped. Also the logical 'and' in the if clause is implemented by making the second comparison conditional on the first. Conditional execution is normally only efficient if the conditional sequence is three instructions or fewer. If the conditional sequence is longer, use branches.
2.49 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.50
The more complex cases GE is twos complement signed comparison Greater than or equal to (GE). r0 r1. Two cases:
r0 r1 is positive result, no overflow => V=0, N=0 r0 r1 is negative result, with overflow => V=1,N=1.
r0=127, r1= -128 EXACT: 127 (-128) = +255 8 bit signed interpretation: -1 (so V=1, N=1)
Other conditions:
LT < is NOT GE GT> is GE AND NOT EQ LE is LT OR EQ
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.52
Signed
GreaTer Greater or Equal Less or Equal Less Than GT GE LE LT
Unsigned
HIgher Higher or Same Lower or Same LOwer HI HS (= CS) LS LO (= CC)
tjwc - 2-Dec-10
Register Shifts
ADD r0, r1, r2, lsl #3 MOV r0, r1, lsr #11
op2 shifted
The key to manipulating bit fields contiguous groups of bits is the use of data shifts. ARM has a large collection of shifts available for the 2nd register operand of a data processing instruction.
shifts can be combined with arithmetic or bitwise logical operations in one instruction.
LSL: logical shift left by 0 to 31 places; fill the vacated bits at the least significant end of the word with zeros.
x LSL n = x*2n if no overflow
LSR: logical shift right by 0 to 31 places; fill the vacated bits at the most significant end of the word with zeros.
x LSR n = x/2n if x is positive (integer division)
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.56
x 3 2 1 0 -1 -2 -3 -4
x asr 1 1 1 0 0 -1 -1 -2 -2
tjwc - 2-Dec-10
2.57
tjwc - 2-Dec-10
2.58
Register-valued Shifts
ADD r0, r1, r2, lsl r3 ; shift r2 by value of register r3. 4 regs! MOV r0, r1, asr r10 ; shift r1 by value of register r10
op2
immediate value
The number (n previously) of bits to shift can be variable and come from the value in a register, as above. "register-valued" shifts take two cycles to execute MOV r0, r1, lsl r3
If r3 = 4 & r1 = 11 this will set r0 := 11*24
8 C=Const Rd := Rn Op C'
; r4 contains n ; result (r0) has bit ; n from r2 aligned ; with bit 0 MOVS r0, r2, lsr r4
All data transfer instructions can have rotated immediate operand C' = C rotated right (ROR) by 2r, where r is unsigned value of Rot field
This allows variable shifts, for example, to select bit n from a 32 bit register
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.59 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.60
In general any 8 bit binary field aligned on any even bit position is possible NB negative numbers use alternate instruction e.g. SUB not ADD
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.62
Op
A/D convertor converts input voltage from up to 8 inputs into digital (unsigned) value. LPC2138 A/D convertor data register AD0CR
Memory mapped as 32 bit word, read/write Read provides the 10 bit conversion output, 3 bit channel output, and other status info D Done 1 when conversion has finished OV Overrun 1 if data from a conversion is not read before another conversion starts CHN channel which of the 8 possible inputs was converted DATA 10 bit binary data output (bit 15 is MSB, bit 6 is LSB).
BIC stands for 'bit clear', where every '1' in the second operand clears the corresponding bit in the first:
r1: r2: r0: 0101 0011 1010 1111 1101 1010 0110 1011 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000 0000 1101 1010 0110 1011
tjwc - 2-Dec-10
2.64
R0:
CHN
To extract only CHN bit field of AD0DR to R0: LDRL R0, AD0DR ; get data into register AND R0, R0, #&07000000 ; set all unwanted bits to 0 LDRL LABEL (like LDR LABEL but LABEL can be anywhere in
memory)
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.65
Shifts can be used to extract bit fields. In a 32 bit word, bits n:m can be extracted and aligned with bit 0 by:
left shift 31-n right shift (31-n)+m
31 11 : 7 0 11101010100001110001001111110011 00111111001100000000000000000000 (LSL 31-11 = 20) 00000000000000000000000000000111 (LSR 31-11+7 = 27)
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.66
31 30
AD0DR: D OV AD0CR:
26 24 START
26 24 CHN
15 DATA 8 7 CLKDIV
6 0 SEL
19 17 16 15 CLKS B
ADRL r1, AD0DR ; load address LDR r0, [r1] NB ADRL used when address is >4096 bytes MOV r0, r0, lsl #16 from PC MOV r0, r0, lsr #22 ; R0 contains extracted DATA field ; r3 contains 8 bit value to be written ADRL r1, AD0CR LDR r0, [r1] ; load whole of AD0CR BIC r0, r0, &ff00 ; clear bits 15:8 (CLKDIV) ORR r0, r0, r3, lsl #8; set 15:8 from r3(7:0) STR r0, [r1] ; store back to AD0CR .
MOV r0, r1 lsl #n ADD r0, r1, r1 lsl #n RSB r0, r1, r1 lsl #n r0, r0, r0, LSL #2 r0, r0, r0, LSL #3 Note RSB not SUB ; r0' := 5 x r0 ; r0" := 7 x r0'
Subroutines
Subroutines allow you to modularize your code so that they are more reusable. The general structure of a subroutine in a program is: MAIN main program ...... BL SUB1 ...... ;subroutine call
The subroutine is a key element in assembly language programs, allowing code reuse
It is also the way that High Level Language procedures and functions are implemented
Storage of data on a stack is an essential element of all modern computer programs and typically is done on subroutine entry & exit ARM has instructions to support subroutines and stacks This lecture will consider
Use of return addresses by subroutines
Branch & link instruction
SUB1 subroutine
2.69 tjwc - 2-Dec-10
Example
Essential documentation for subroutines must describe
Inputs Outputs (if any) What subroutine does (other than compute outputs) Which registers it changes
EXAMPLE: Subroutine to move n bytes (spaced one per word) into n contiguous bytes at a different position in memory &1000 &1004 &1008 &2000
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.72
PACK_BYTES ; Input: src=r0, dest=r1, n=r2 ; loads LS bytes in words [R0],[R0+4], ..., [R0+4(n-1)] ; into contiguous bytes [R1],[R1+1],.....[R1+n-1] ; Changes r2,r3 SUBS R2, R2,#1 ; n := n-1 LDRB R3, [R0,R2, lsl #2] ; load first byte [R0+4(n-1)] STRB R3, [R1,R2] ; store it [R1+n-1] BNE PACK_BYTES MOV pc, r14 ; return to caller
Nested Subroutines
SUB1 BL SUB2
SUB2 BL SUB3
SUB3 X
MAIN ADR R0, TAB1 ; set up subroutine inputs ADR R1, TAB2 MOV R2, #100 BL PACK_BYTES ; call the subroutine When executing at "X" the nested subroutines SUB1, SUB2, SUB3 are all active
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.74
Nested Subroutines
Since the return address is held in register r14, you should not call a further subroutine without first saving r14. How do you achieve this goal?
Could use separate storage for each subroutine Problem: storage needed scales with number of subroutines. Typically may have 1000s of subroutines, means 1000s of separate storage locations
The number of subroutines active at any time (nested) is much smaller than the total number, typically less than 10. This motivates use of a stack an area of memory which is shared for storage by subroutines. Can store all registers changed by subroutine on stack, not just R14
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture
high r13
r13
low
2.76
SUB2 BL SUB3
Stack pointer at X empty When executing "X" the nested subroutines SUB1, SUB2, SUB3 are all active
tjwc - 2-Dec-10
2.77
tjwc - 2-Dec-10
STMED vs STR
These two instructions look different but do same thing with one register STMED can be used with any number of registers STMED is conventionally used for stacks even when only a single transfer is needed. STMED R13!, {R14} stack pointer first, then list of one or more data registers, offset is calculate and added after operation data register first, then stack pointer, offset is explicitly written and added to SP after operation
ISE1/EE2 Introduction to Computer Architecture 2.80
ARM has a single instruction which transfers multiple registers to a stack and implements PUSH this way:
STMED r13!, {r1, r3-r5, r14} ; Push r1, r3-r5, r14 onto stack ; Stack grows down in mem ; r13 points to next empty loc.
tjwc - 2-Dec-10
2.79
tjwc - 2-Dec-10
POP operation
The complementary operation of PUSH is the POP operation. POP {r1, r3-r5, r14}
memory BEFORE POP r14 r5 r4 r3 r1 r13 low r13 memory AFTER POP high (r14) (r5) (r4) (r3) (r1) low
POP (returns C) B,A stored POP (returns B) A stored POP (returns A) empty
Nested subroutines will each PUSH and then POP their registers at the same level (all PUSHes & POPs from subroutine calls will balance) so this will work.
; push work & link registers ; jump to a nested subroutine ; pop work & link registers ; return to calling program
when return from SUB1 high (r14) (r2) (r1) (r0) low r13 LDMED low r13!, {r0-r2, r14}
2.83
r14 r2 r1 r0
high
r13'
; Input: r0 ; Output: r1=1 if odd parity (xor of all 32 bits), otherwise 0 ; preserves value of r2 on stack STMED r13!, {r2} ; save registers, why not r1? MOV r2, #31 MOV r1, #0 LOOP EOR r1, r0, r1, ror #1 SUBS r2, r2, #1 BPL LOOP ; loop 32 times AND r1, r1, #1 LDMED r13!,{r2} ;restore registers MOV pc, r14 ; return to caller
SUBX STMED r13!, {R14} BL SUBY ....... LDMED r13!, {pc} SUBY STMED r13!, {r0,r1,r2} ..... LDMED r13!,{r0,r1,r2} MOV pc, r14
r14 r2 r1 r0
The 4 types of stack POP & PUSH have different mnemonics (for convenience) when used for general data movement like this. It does not matter which mnemonic you use: LDMED & LDMIB are the same instruction
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.88
tjwc - 2-Dec-10
2.87
When using LDMIA and STMIA instructions, you:INCREMENT the address in memory to load/store your data the increment of the address occurs AFTER the address is used.
2.90
LDMIA
Higher register numbers stored or loaded to/from higher addresses, always
r1, {r2-r9}
; r2 := mem32[r1] ; . ; r9 := mem32[r1+28]
LDMIA
tjwc - 2-Dec-10
Lecture 10: Miscellaneous Multiplication Overview of machine instructions Machine instruction timing
Register list has one bit per register bit 0 = 1 => load/store r0; bit 1 = 1 => load/store r1; etc STMIA r13!, {r0-r2, r14}
E8AD 4007
2.93 tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.94
tjwc - 2-Dec-10
ARM Multiply instructions The original ARM 1 architecure did not have multiply instructions
32X32->32 bit (least significant 32 bits of result kept) was added for ARM 3 and above 32X32->64 multiplication was added for ARM7DM and above.
Multiply in detail
MUL,MLA were the original (32 bit LSW result) instructions
Why does it not matter whether they are signed or unsigned? Register operands only No constants, no shifts
The multiplications were shoe-horned into the data processing instructions, using bit combinations specifying shifts that were previously unused and illegal.
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.95
MUL rd, rm, rs MLA rd,rm,rs,rn UMULL rh, rl, rm, rs UMLAL rh, rl, rm, rs SMULL rh,rl,rm,rs SMLAL rh,rl,rm,rs
tjwc - 2-Dec-10
multiply (32 bit) multiply-acc (32 bit) unsigned multiply unsigned multiply-acc signed multiply signed multiply-acc
Rd := (Rm*Rs)[31:0] Rd:= (Rm*Rs)[31:0] + Rn (Rh:Rl) := Rm*Rs (Rh:Rl) := (Rh:Rl)+Rm*Rs (Rh:Rl) := Rm*Rs (Rh:Rl) :=(Rh:Rl)+Rm*Rs
2.96
Op
Rn
Rd
Shift S
Rm
z = xj * yj
j =0
19
S Rm
Rd mem[Rn+S] Rd mem[Rn+Rm*]
MOV MOV MOV LOOP LDR LDR SMLAL SUBS tjwc - 2-Dec-10 BNE
r11, #20 ; initialize loop counter r7, #0 ; initialize 32 bit total r6, #0 r0, [r8], #4 ; get x component r1, [r9], #4 ; . and y component r6, r7, r0, r1 ; accumulate product r11, r11, #1 ; decrement loop counter LOOP ISE1/EE2 Introduction to Computer Architecture ; loop 20 times
Type
Rn
Register list
Overview (2)
Branch B, BL, BNE, BMI
cond 1 0 1 L 0 1 cond 1 1 0 0 cond 1 1 0 1 cond 1 1 1 0 S L = 0 => Branch, B ... L = 1 => Branch and link (R14 := PC), BL ...
PC := PC+S
1 (+3 if PC is dest) 2 (+3 if PC is dest) 4 n+3 (+3 more if PC is loaded) n+3 4 7-14
coprocessor interface
data processing (register-valued shifts): MOV R1, R2, lsl R3 LDR,LDRB, STR, STRB LDM (n registers)
S
ISE1/EE2 Introduction to Computer Architecture
ALL instructions take 1 cycle if not executed (condition false) "register-valued shift" is special case 2 cycles
Make sure you know what a register-valued shift is!
Multiply takes a lot longer though exact timing depnds on data and also on ARM core - later cores have more efficient hardware multiply Instruction timing is hardware-dependent. Not part of Instruction Set Architecture
tjwc - 2-Dec-10 ISE1/EE2 Introduction to Computer Architecture 2.101