0% found this document useful (0 votes)

18 views42 pages

cortex_a72_software_optimization_guide_external

Optimize your code

Uploaded by

rajadr80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views42 pages

cortex_a72_software_optimization_guide_external

Optimize your code

Uploaded by

rajadr80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Cortex®-A72 Software

Optimization Guide

Date of Issue: March 10, 2015

Copyright ARM Limited 2015. All rights reserved.

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 1 of 42
Cortex®-A72
Software Optimisation Guide

Copyright © 2015 ARM. All rights reserved.

Release Information
The following changes have been made to this Software Optimisation Guide.

Change History

Date Issue Confidentiality Change

1 June 2015 Non-confidential First release

Proprietary Notice

Words and logos marked with ™ or ® are registered trademarks or trademarks of ARM® in the EU and other
countries except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein
may be the trademarks of their respective owners.

Neither the whole nor any part of the information contained in, or the product described in, this document may be
adapted or reproduced in any material form except with the prior written permission of the copyright holder.

The product described in this document is subject to continuous developments and improvements. All particulars
of the product and its use contained in this document are given by ARM in good faith. However, all warranties
implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are
excluded.

This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss
or damage arising from the use of any information in this document, or any error or omission in such information,
or any incorrect use of the product.

Where the term ARM is used it means “ARM or any of its subsidiaries as appropriate”.

Confidentiality Status

This document is Non-Confidential. This document has no restriction on distribution.

Product Status

The information in this document is final, that is for a developed product .

Web Address
http://www.arm.com

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 2 of 42
Contents

1 ABOUT THIS DOCUMENT 5

1.1 References 5

1.2 Terms and abbreviations 5

1.3 Document Scope 5

2 INTRODUCTION 6

2.1 Pipeline Overview 6

3 INSTRUCTION CHARACTERISTICS 7

3.1 Instruction Tables 7

3.2 Branch Instructions 7

3.3 Arithmetic and Logical Instructions 7

3.4 Move and Shift Instructions 8

3.5 Divide and Multiply Instructions 9

3.6 Saturating and Parallel Arithmetic Instructions 10

3.7 Miscellaneous Data-Processing Instructions 11

3.8 Load Instructions 12

3.9 Store Instructions 15

3.10 FP Data Processing Instructions 16

3.11 FP Miscellaneous Instructions 18

3.12 FP Load Instructions 19

3.13 FP Store Instructions 20

3.14 ASIMD Integer Instructions 22

3.15 ASIMD Floating-Point Instructions 26

3.16 ASIMD Miscellaneous Instructions 28

3.17 ASIMD Load Instructions 30

3.18 ASIMD Store Instructions 33

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 3 of 42
3.19 Cryptography Extensions 34

3.20 CRC 35

4 SPECIAL CONSIDERATIONS 36

4.1 Dispatch Constraints 36

4.2 Conditional Execution 36

4.3 Conditional ASIMD 37

4.4 Register Forwarding Hazards 37

4.5 Load/Store Throughput 38

4.6 Load/Store Alignment 39

4.7 Branch Alignment 39

4.8 Setting Condition Flags 39

4.9 Special Register Access 39

4.10 AES Encryption/Decryption 40

4.11 Fast literal generation 41

4.12 PC-relative address calculation 41

4.13 FPCR self-synchronization 42

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 4 of 42
1 ABOUT THIS DOCUMENT

1.1 References
This document refers to the following documents.

Title Location

ARM Cortex-A72 MPCore Processor Technical Reference Manual Infocenter.arm.com

1.2 Terms and abbreviations

This document uses the following terms and abbreviations.

Term Meaning

ALU Arithmetic/Logical Unit

ASIMD Advanced SIMD
µop Micro-Operation
VFP Vector Floating Point

1.3 Document Scope

This document provides high-level information about the Cortex®-A72 processor pipeline, instruction performance
characteristics, and special performance considerations. This information is intended to aid those who are
optimizing software and compilers for the Cortex®-A72 processor. For a more complete description of the
Cortex®-A72 processor, please refer to the ARM Cortex®-A72 MPCore Processor Technical Reference Manual,
available at infocenter.arm.com.

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 5 of 42
2 INTRODUCTION

2.1 Pipeline Overview

The following diagram describes the high-level Cortex®-A72 instruction processing pipeline. Instructions are first
fetched, then decoded into internal micro-operations (µops). From there, the µops proceed through register
renaming and dispatch stages. Once dispatched, µops wait for their operands and issue out-of-order to one of
eight execution pipelines. Each execution pipeline can accept and complete one µop per cycle.

Branch

Integer 0

Integer 1

Decode, Integer Multi-Cycle

Issue

Fetch Rename,
Dispatch FP/ASIMD 0

FP/ASIMD 1

Load

Store

IN ORDER OUT OF ORDER

The execution pipelines support different types of operations, as shown below.

Pipeline (mnemonic) Supported functionality

Branch (B) Branch µops

Integer 0/1 (I) Integer ALU µops

Multi-cycle (M) Integer shift-ALU, multiply, divide, CRC and sum-of-absolute-differences

µops

Load (L) Load and register transfer µops

Store (S) Store and special memory µops

FP/ASIMD-0 (F0) ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc,
FP add, FP multiply, FP divide, crypto µops

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 6 of 42
FP/ASIMD-1 (F1) ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, FP sqrt, ASIMD
shift µops

3 INSTRUCTION CHARACTERISTICS

3.1 Instruction Tables

This chapter describes high-level performance characteristics for most ARMv8 A32, T32 and A64 instructions. It
includes a series of tables to summarize the effective execution latency and throughput, pipelines utilized, and
special behaviors associated with each group of instructions. Utilized pipelines correspond to the execution
pipelines described in chapter 2.
In the following tables:
• Exec Latency is defined as the minimum latency seen by an operation dependent on an instruction in the
described group.
• Execution Throughput is defined as the maximum throughput (in instructions / cycle) of the specified
instruction group that can be achieved in the entirety of the Cortex®-A72 microarchitecture.

3.2 Branch Instructions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Branch, immed B 1 1 B
Branch, register BX 1 1 B
Branch and link, immed BL, BLX 1 1 I0/I1, B
Branch and link, register BLX 1 1 I0/I1, B
Compare and branch CBZ, CBNZ 1 1 B

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
Branch, immed B 1 1 B
Branch, register BR, RET 1 1 B
Branch and link, immed BL 1 1 I0/I1, B
Branch and link, register BLR 1 1 I0/I1, B
Compare and branch CBZ, CBNZ, TBZ, TBNZ 1 1 B

3.3 Arithmetic and Logical Instructions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
ALU, basic ADD{S}, ADC{S}, ADR, AND{S}, 1 2 I0/I1
BIC{S}, CMN, CMP, EOR{S},

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 7 of 42
ORN{S}, ORR{S}, RSB{S},
RSC{S}, SUB{S}, SBC{S}, TEQ,
TST
ALU, shift by immed (same as above) 2 1 M
ALU, shift by register, (same as above) 2 1 M
unconditional
ALU, shift by register, (same as above) 2 1 I0/I1
conditional
ALU, branch forms +2 1 +B 1

Note:
1. Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µop is required. This adds 2 cycles to the latency.

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
ALU, basic ADD{S}, ADC{S}, AND{S}, 1 2 I0/I1
BIC{S}, EON, EOR, ORN, ORR,
SUB{S}, SBC{S}
ALU, extend and/or shift ADD{S}, AND{S}, BIC{S}, EON, 2 1 M
EOR, ORN, ORR, SUB{S}
Conditional compare CCMN, CCMP 1 2 I0/I1
Conditional select CSEL, CSINC, CSINV, CSNEG 1 2 I0/I1

3.4 Move and Shift Instructions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
Move, basic MOV{S}, MOVW, MVN{S} 1 2 I0/I1 1
Move, shift by immed, no ASR, LSL, LSR, ROR, RRX, MVN 1 2 I0/I1
setflags
Move, shift by immed, ASRS, LSLS, LSRS, RORS, RRXS, 2 1 M
setflags MVNS
Move, shift by register, no ASR, LSL, LSR, ROR, RRX, MVN 1 2 I0/I1
setflags, unconditional
Move, shift by register, no ASR, LSL, LSR, ROR, RRX, MVN 2 1 I0/I1
setflags, conditional
Move, shift by register, ASRS, LSLS, LSRS, RORS, RRXS, 2 1 M
setflags, unconditional MVNS
Move, shift by register, ASRS, LSLS, LSRS, RORS, RRXS, 2 1 I0/I1
setflags, conditional MVNS
Move, top MOVT 1 2 I

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 8 of 42
(Move, branch forms) +2 1 +B 2

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
Address generation ADR, ADRP 1 2 I0/I1 3
Move immed MOVN, MOVK, MOVZ 1 2 I0/I1 1
Variable shift ASRV, LSLV, LSRV, RORV 1 2 I0/I1

Note:
1. Sequential MOVW/MOVT (AArch32) instruction pairs and certain MOVZ/MOVK, MOVK/MOVK (AArch64)
instruction pairs can be executed with one-cycle execute latency and four-instruction/cycle execution
throughput in I0/I1. See Section 4.11 for more details on the instruction pairs that can be merged.
2. Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µop is required. This adds two cycles to the latency.
3. Sequential ADRP/ADD instruction pairs can be executed with one-cycle execute latency and four
instruction/cycle execution throughput in I0/I1. See Section 4.12 for more details on the instruction pairs
that can be merged.

3.5 Divide and Multiply Instructions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
Divide SDIV, UDIV 4 -‐ 12 1/12 -‐ 1/4 M 1

Multiply MUL, SMULBB, SMULBT, 3 1 M
SMULTB, SMULTT, SMULWB,
SMULWT, SMMUL{R},
SMUAD{X}, SMUSD{X}
Multiply accumulate MLA, MLS, SMLABB, SMLABT, 3 (1) 1 M 2
SMLATB, SMLATT, SMLAWB,
SMLAWT, SMLAD{X},
SMLSD{X}, SMMLA{R},
SMMLS{R}
Multiply accumulate long SMLAL, SMLALBB, SMLALBT, 4 (2) 1/2 M 2, 3
SMLALTB, SMLALTT,
SMLALD{X}, SMLSLD{X},
UMAAL, UMLAL
Multiply long SMULL, UMULL 4 1/2 M 3
(Multiply, setflags forms) +1 (Same as +I0/I1 4
above)

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 9 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput pipelines
Divide, W-‐form SDIV, UDIV 4-‐12 1/12 – 1/4 M 1

Divide, X-‐form SDIV, UDIV 4-‐20 1/20 -‐ 1/4 M 1

Multiply accumulate, W-‐ MADD, MSUB 3 (1) 1 M 2
form

Multiply accumulate, X-‐ MADD, MSUB 5 (3) 1/3 M 2,5

form

Multiply accumulate long SMADDL, SMSUBL, UMADDL, 3 (1) 1 M 2

UMSUBL
Multiply high SMULH, UMULH 6 [3] 1/4 M 6

Note:
1. Integer divides are performed using a iterative algorithm and block any subsequent divide operations until
complete. Early termination is possible, depending upon the data values.
2. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops, allowing
a typical sequence of multiply-accumulate µops to issue one every N cycles (accumulate latency N shown
in parentheses).
3. Long-form multiplies (which produce two result registers) stall the multiplier pipeline for one extra cycle.
4. Multiplies that set the condition flags require an additional integer µop.
5. X-form multiply accumulates stall the multiplier pipeline for two extra cycles.
6. Multiply high operations stall the multiplier pipeline for N extra cycles before any other type M µop can be
issued to that pipeline, with N shown in parentheses.

3.6 Saturating and Parallel Arithmetic Instructions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput pipelines
Parallel arith, SADD16, SADD8, SSUB16, 2 1 M
unconditional SSUB8, UADD16, UADD8,
USUB16, USUB8
Parallel arith, conditional SADD16, SADD8, SSUB16, 2 (4) 1/2 M, I0/I1 1
SSUB8, UADD16, UADD8,
USUB16, USUB8
Parallel arith with SASX, SSAX, UASX, USAX 3 1 I0/I1, M
exchange, unconditional

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 10 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput pipelines
Parallel arith with SASX, SSAX, UASX, USAX 3 (5) 1/2 I0/I1, M 1
exchange, conditional
Parallel halving arith SHADD16, SHADD8, 2 1 M
SHSUB16, SHSUB8,
UHADD16, UHADD8,
UHSUB16, UHSUB8
Parallel halving arith with SHASX, SHSAX, UHASX, 3 1 I0/I1, M
exchange UHSAX
Parallel saturating arith QADD16, QADD8, QSUB16, 2 1 M
QSUB8, UQADD16, UQADD8,
UQSUB16, UQSUB8
Parallel saturating arith QASX, QSAX, UQASX, UQSAX 3 1 I0/I1, M
with exchange
Saturate SSAT, SSAT16, USAT, USAT16 2 1 M
Saturating arith QADD, QSUB 2 1 M
Saturating doubling arith QDADD, QDSUB 3 1 I0/I1, M

Note:
1. Conditional GE-setting instructions require three extra µops and two additional cycles to conditionally
update the GE field (GE latency shown in parentheses).

3.7 Miscellaneous Data-Processing Instructions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Bit field extract SBFX, UBFX 1 2 I0/I1
Bit field insert/clear BFI, BFC 2 1 M
Count leading zeros CLZ 1 2 I0/I1
Pack halfword PKH 2 1 M
Reverse bits/bytes RBIT, REV, REV16, REVSH 1 2 I0/I1
Select bytes, unconditional SEL 1 2 I0/I1
Select bytes, conditional SEL 2 1 I0/I1
Sign/zero extend, normal SXTB, SXTH, UXTB, UXTH 1 2 I0/I1
Sign/zero extend, parallel SXTB16, UXTB16 2 1 M
Sign/zero extend and add, SXTAB, SXTAH,UXTAB, UXTAH 2 1 M
normal
Sign/zero extend and add, SXTAB16, UXTAB16 4 1/2 M
parallel
Sum of absolute USAD8, USADA8 3 1 M
differences

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 11 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Bitfield extract, one reg EXTR 1 2 I0/I1
Bitfield extract, two regs EXTR 3 1 I0/I1, M
Bitfield move, basic SBFM, UBFM 1 2 I0/I1
Bitfield move, insert BFM 2 1 M
Count leading CLS, CLZ 1 2 I0/I1
Reverse bits/bytes RBIT, REV, REV16, REV32 1 2 I0/I1

3.8 Load Instructions

The latencies shown in the following table assume the memory access hits in the Level 1 Data Cache.

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
Load, immed offset LDR{T}, LDRB{T}, LDRD, 4 1 L
LDRH{T}, LDRSB{T}, LDRSH{T}
Load, register offset, plus LDR, LDRB, LDRD, LDRH, 4 1 L
LDRSB, LDRSH
Load, register offset, LDR, LDRB, LDRD, LDRH, 5 1 I0/I1, L
minus LDRSB, LDRSH
Load, scaled register LDR, LDRB 4 1 L
offset, plus LSL2
Load, scaled register LDR, LDRB, LDRH, LDRSB, 5 1 I0/I1, L
offset, other LDRSH
Load, immed pre-‐indexed LDR, LDRB, LDRD, LDRH, 4 (1) 1 L, I0/I1 1
LDRSB, LDRSH
Load, register pre-‐indexed LDR, LDRB, LDRH, LDRSB, 4 (1) 1 L, I0/I1 1
LDRSH
Load, register pre-‐indexed LDRD 5 (2) 1 I0/I1, L 1
Load, scaled register pre-‐ LDR, LDRB 4 (2) 1 I0/I1, L 1
indexed, plus LSL2
Load, scaled register pre-‐ LDR, LDRB 5 (2) 1 I0/I1, L 1
indexed, other
Load, immed post-‐indexed LDR{T}, LDRB{T}, LDRD, 4 (1) 1 L, I0/I1 1
LDRH{T}, LDRSB{T}, LDRSH{T}
Load, register post-‐ LDR, LDRB, LDRH{T}, 4 (1) 1 L, I0/I1 1
indexed LDRSB{T}, LDRSH{T}
Load, register post-‐ LDRD 4(2) 1 I0/I1, L 1
indexed

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 12 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Load, register post-‐ LDRT, LDRBT 4(3) 1 I0/I1, L, M 1
indexed
Load, scaled register post-‐ LDR, LDRB 4 (2) 1 I0/I1, L 1
indexed
Load, scaled register post-‐ LDRT, LDRBT 4 (3) 1 I0/I1, L, M 1
indexed
Preload, immed offset PLD, PLDW 4 1 L
Preload, register offset, PLD, PLDW 4 1 L
plus
Preload, register offset, PLD, PLDW 5 1 I0/I1, L
minus
Preload, scaled register PLD, PLDW 4 1 L
offset, plus LSL2
Preload, scaled register PLD, PLDW 5 1 I0/I1, L
offset, other
Load multiple, no LDMIA, LDMIB, LDMDA, 3 + N 1/N L 2
writeback, base reg not in LDMDB
list
Load multiple, no LDMIA, LDMIB, LDMDA, 4 + N 1/N I0/I1, L 2
writeback, base reg in list LDMDB
Load multiple, writeback LDMIA, LDMIB, LDMDA, 3 + N (1) 1/N L, I0/I1 1, 2
LDMDB, POP
Load, branch forms with LDR 4(2) 1 L, M 1
addressing mode as
register post-‐indexed
(scaled or unscaled) or
scaled, register pre-‐
indexed, plus, LSL2
Load, branch forms with LDR 5(2) 1 I0/I1, L 1
addressing modeas scaled
register, pre-‐indexed,
other
(Load, branch forms) +2 +B 3

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
Load register, literal LDR, LDRSW, PRFM 4 1 L
Load register, unscaled LDUR, LDURB, LDURH, 4 1 L
immed LDURSB, LDURSH, LDURSW,
PRFUM
Load register, immed post-‐ LDR, LDRB, LDRH, LDRSB, 4 (1) 1 L, I0/I1 1
index LDRSH, LDRSW

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 13 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Load register, immed pre-‐ LDR, LDRB, LDRH, LDRSB, 4 (1) 1 L, I0/I1 1
index LDRSH, LDRSW
Load register, immed LDTR, LDTRB, LDTRH, 4 1 L
unprivileged LDTRSB, LDTRSH, LDTRSW
Load register, unsigned LDR, LDRB, LDRH, LDRSB, 4 1 L
immed LDRSH, LDRSW, PRFM
Load register, register LDR, LDRB, LDRH, LDRSB, 4 1 L
offset, basic LDRSH, LDRSW, PRFM
Load register, register LDR, LDRSW, PRFM 4 1 L
offset, scale by 4/8
Load register, register LDRH, LDRSH 5 1 I0/I1, L
offset, scale by 2
Load register, register LDR, LDRB, LDRH, LDRSB, 4 1 L
offset, extend LDRSH, LDRSW, PRFM
Load register, register LDR, LDRSW, PRFM 4 1 L
offset, extend, scale by
4/8
Load register, register LDRH, LDRSH 5 1 I0/I1, L
offset, extend, scale by 2
Load pair, immed offset, LDP, LDNP 4 1 L
normal
Load pair, immed offset, LDPSW 5 1/2 I0/I1, L
signed words, base != SP
Load pair, immed offset, LDPSW 5 1/2 L
signed words, base = SP
Load pair, immed post-‐ LDP 4 (1) 1 L, I0/I1 1
index, normal
Load pair, immed post-‐ LDPSW 5 (1) 1/2 L, I0/I1 1
index, signed words
Load pair, immed pre-‐ LDP 4 (1) 1 L, I0/I1 1
index, normal
Load pair, immed pre-‐ LDPSW 5 (1) 1/2 L, I0/I1 1
index, signed words

Note:
1. Base register updates are typically completed in parallel with the load operation and with shorter latency
(update latency shown in parentheses).
2. For load multiple instructions, N=floor((num_regs+1)/2).
3. Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µop is required. This adds two cycles to the latency.

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 14 of 42
3.9 Store Instructions
The following table describes performance characteristics for standard store instructions. Store µops can issue
after their address operands become available and do not need to wait for data operands. After they are executed,
stores are buffered and committed in the background.

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
Store, immed offset STR{T}, STRB{T}, STRD, 1 1 S
STRH{T}
Store, register offset, plus STR, STRB, STRD, STRH 1 1 S
Store, register offset, STR, STRB, STRD, STRH 3 1 I0/I1, S
minus
Store, scaled register STR, STRB 1 1 S
offset, plus LSL2
Store, scaled register STR, STRB 3 1 I0/I1, S
offset, other
Store, immed pre-‐indexed STR, STRB, STRD, STRH 1 (1) 1 S, I0/I1 1
Store, register pre-‐ STR, STRB, STRD, STRH 1 (1) 1 S, I0/I1 1
indexed, plus
Store, register pre-‐ STR, STRB, STRD, STRH 3 (2) 1 I0/I1, S 1
indexed, minus
Store, scaled register pre-‐ STR, STRB 1 (2) 1 S, M 1
indexed, plus LSL2
Store, scaled register pre-‐ STR, STRB 3 (2) 1 I0/I1, S 1
indexed, other
Store, immed post-‐ STR{T}, STRB{T}, STRD, 1 (1) 1 S, I0/I1 1
indexed STRH{T}
Store, register post-‐ STRH{T}, STRD 1 (1) 1 S, I0/I1 1
indexed
Store, register post-‐ STR{T}, STRB{T} 1 (2) 1 S, M 1
indexed
Store, scaled register post-‐ STR{T}, STRB{T} 1 (2) 1 S, M 1
indexed
Store multiple, no STMIA, STMIB, STMDA, N 1/N S 1, 2
writeback STMDB
Store multiple, writeback STMIA, STMIB, STMDA, N (1) 1/N S, I0/I1 1, 2
STMDB, PUSH

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
Store register, unscaled STUR, STURB, STURH 1 1 S
immed

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 15 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Store register, immed STR, STRB, STRH 1 (1) 1 S, I0/I1 1
post-‐index
Store register, immed pre-‐ STR, STRB, STRH 1 (1) 1 S, I0/I1 1
index
Store register, immed STTR, STTRB, STTRH 1 1 S
unprivileged
Store register, unsigned STR, STRB, STRH 1 1 S
immed
Store register, register STR, STRB, STRH 1 1 S
offset, basic
Store register, register STR 1 1 S
offset, scaled by 4/8
Store register, register STRH 3 1 I0/I1, S
offset, scaled by 2
Store register, register STR, STRB, STRH 1 1 S
offset, extend
Store register, register STR 1 1 S
offset, extend, scale by
4/8
Store register, register STRH 3 1 I0/I1, S
offset, extend, scale by 1
Store pair, immed offset, STP, STNP 1 1 S
W-‐form
Store pair, immed offset, STP, STNP 2 1/2 S
X-‐form
Store pair, immed post-‐ STP 1 (1) 1 S, I0/I1 1
index, W-‐form
Store pair, immed post-‐ STP 2 (1) 1/2 S, I0/I1 1
index, X-‐form
Store pair, immed pre-‐ STP 1 (1) 1 S, I0/I1 1
index, W-‐form
Store pair, immed pre-‐ STP 2 (1) 1/2 S, I0/I1 1
index, X-‐form

Note:
1. Base register updates are typically completed in parallel with the store operation and with shorter latency
(update latency is shown in parentheses).
2. For store multiple instructions, N=floor((num_regs+1)/2).

3.10 FP Data Processing Instructions

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 16 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP absolute value VABS 3 2 F0/F1
FP arith VADD, VSUB 4 2 F0/F1
FP compare, unconditional VCMP, VCMPE 3 1 F1
FP compare, conditional VCMP, VCMPE 6 1/6 F0/F1, F1
FP convert VCVT{R}, VCVTB, VCVTT, 3 1 F0
VCVTA, VCVTM, VCVTN,
VCVTP
FP round to integral VRINTA, VRINTM, VRINTN, 3 1 F0
VRINTP, VRINTR, VRINTX,
VRINTZ
FP divide, S-‐form VDIV 6-‐11 2/9-‐1/2 F0 1
FP divide, D-‐form VDIV 6-‐18 1/16-‐1/4 F0 1
FP max/min VMAXNM, VMINNM 3 2 F0/F1
FP multiply VMUL, VNMUL 4 2 F0/F1 2
FP multiply accumulate VFMA, VFMS, VFNMA, 7 (3) 2 F0/F1 3
VFNMS, VMLA, VMLS,
VNMLA, VNMLS
FP negate VNEG 3 2 F0/F1
FP select VSELEQ, VSELGE, VSELGT, 3 2 F0/F1
VSELVS
FP square root, S-‐form VSQRT 6-‐17 2/15-‐1/2 F1 1
FP square root, D-‐form VSQRT 6-‐32 1/30-‐1/4 F1 1

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
FP absolute value FABS 3 2 F0/F1
FP arithmetic FADD, FSUB 4 2 F0/F1
FP compare FCCMP{E}, FCMP{E} 3 1 F1
FP divide, S-‐form FDIV 6-‐11 2/9-‐1/2 F0 1
FP divide, D-‐form FDIV 6-‐18 1/16-‐1/4 F0 1
FP min/max FMIN, FMINNM, FMAX, 3 2 F0/F1
FMAXNM
FP multiply FMUL, FNMUL 4 2 F0/F1 2
FP multiply accumulate FMADD, FMSUB, FNMADD, 7 (3) 2 F0/F1 3
FNMSUB
FP negate FNEG 3 2 F0/F1

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 17 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP round to integral FRINTA, FRINTI, FRINTM, 3 1 F0
FRINTN, FRINTP, FRINTX,
FRINTZ
FP select FCSEL 3 2 F0/F1
FP square root, S-‐form FSQRT 6-‐17 2/15-‐1/2 F1 1
FP square root, D-‐form FSQRT 6-‐32 1/30-‐1/4 F1 1

Note:
1. FP divide and square root operations are performed using an iterative algorithm and block subsequent
similar operations to the same pipeline until complete.
2. FP multiply-accumulate pipelines support late forwarding of the result from FP multiply µops to the
accumulate operands of an FP multiply-accumulate µop. The latter can potentially be issued one cycle
after the FP multiply µop has been issued.
3. FP multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops,
allowing a typical sequence of multiply-accumulate µops to issue one every N cycles (accumulate latency
N is shown in parentheses).

3.11 FP Miscellaneous Instructions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
FP move, immed VMOV 3 2 F0/F1
FP move, register VMOV 3 2 F0/F1
FP transfer, vfp to core reg VMOV 5 1 L
FP transfer, core reg to VMOV 8 1 L, F0/F1
upper or lower half of vfp
D-‐reg
FP transfer, core reg to vfp VMOV 5 1 L

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
FP convert, from vec to FCVT, FCVTXN 3 1 F0
vec reg
FP convert, from gen to SCVTF, UCVTF 8 1 L, F0
vec reg

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 18 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP convert, from vec to FCVTAS, FCVTAU, FCVTMS, 8 1 L, F0
gen reg FCVTMU, FCVTNS, FCVTNU,
FCVTPS, FCVTPU, FCVTZS,
FCVTZU
FP move, immed FMOV 3 2 F0/F1
FP move, register FMOV 3 2 F0/F1
FP transfer, from gen to FMOV 5 1 L
vec reg
FP transfer, from vec to FMOV 5 1 L
gen reg

3.12 FP Load Instructions

The latencies shown assume the memory access hits in the Level 1 Data Cache. Compared to standard loads, an
extra cycle is required to forward results to FP/ASIMD pipelines.

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
FP load, register VLDR 5 1 L
FP load multiple, unconditional VLDMIA, VLDMDB, 4 + N 1/N L 1
VPOP
FP load multiple, conditional VLDMIA, VLDMDB, 4 + N 1/N L 2
VPOP
(FP load, writeback forms) (1) Same as before +I0/I1 3

Instruction Group AArch64 Exec Execution Utilized Notes

Instructions Latency Throughput Pipelines
Load vector reg, literal LDR 5 1 L
Load vector reg, unscaled immed LDUR 5 1 L
Load vector reg, immed post-‐index LDR 5 (1) 1 L, I0/I1 3
Load vector reg, immed pre-‐index LDR 5 (1) 1 L, I0/I1 3
Load vector reg, unsigned immed LDR 5 1 L
Load vector reg, register offset, basic LDR 5 1 L
Load vector reg, register offset, scale, LDR 5 1 L
S/D-‐form
Load vector reg, register offset, scale, LDR 6 1 I0/I1, L
H/Q-‐form
Load vector reg, register offset, LDR 5 1 L
extend

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 19 of 42
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
Load vector reg, register offset, LDR 5 1 L
extend, scale, S/D-‐form
Load vector reg, register offset, LDR 6 1 I0/I1, L
extend, scale, H/Q-‐form
Load vector pair, immed offset, S/D-‐ LDP, LDNP 5 1 L
form
Load vector pair, immed offset, Q-‐ LDP, LDNP 6 1/2 L
form
Load vector pair, immed post-‐index, LDP 5 (1) 1 L, I0/I1 3
S/D-‐form
Load vector pair, immed post-‐index, LDP 6 (1) 1/2 L, I0/I1 3
Q-‐form
Load vector pair, immed pre-‐index, LDP 5 (1) 1 L, I0/I1 3
S/D-‐form
Load vector pair, immed pre-‐index, Q-‐ LDP 6 (1) 1/2 L, I0/I1 3
form

Note:
1. For FP load multiple instructions, N=floor((num_regs+1)/2).
2. For conditional FP load multiple instructions, N = num_regs for conditional forms only.
3. Writeback forms of load instructions require an extra µop to update the base address. This update is
typically performed in parallel with, or prior to, the load µop (update latency is shown in parentheses).

3.13 FP Store Instructions

Stores µops can issue after their address operands become available and do not need to wait for data operands.
After they are executed, stores are buffered and committed in the background.

Instruction Group Aarch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
FP store, immed offset VSTR 1 1 S
FP store multiple, S-‐form VSTMIA, VSTMDB, N 1/N S 1
VPUSH
FP store multiple, D-‐form VSTMIA, VSTMDB, N 1/N S 1
VPUSH
(FP store, writeback forms) (1) Same as before +I0/I1 2

Instruction Group AArch64 Exec Execution Utilized Notes

Instructions Latency Throughput Pipelines
Store vector reg, unscaled immed, STUR 1 1 S
B/H/S/D-‐form

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 20 of 42
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
Store vector reg, unscaled immed, Q-‐form STUR 2 1/2 S
Store vector reg, immed post-‐index, STR 1 (1) 1 S, I0/I1 2
B/H/S/D-‐form
Store vector reg, immed post-‐index, Q-‐ STR 2 (1) 1/2 S, I0/I1 2
form
Store vector reg, immed pre-‐index, STR 1 (1) 1 S, I0/I1 2
B/H/S/D-‐form
Store vector reg, immed pre-‐index, Q-‐form STR 2 (1) 1/2 I0/I1, S 2
Store vector reg, unsigned immed, STR 1 1 S
B/H/S/D-‐form
Store vector reg, unsigned immed, Q-‐form STR 2 1/2 I0/I1, S
Store vector reg, register offset, basic, STR 1 1 S
B/H/S/D-‐form
Store vector reg, register offset, basic, Q-‐ STR 2 1/2 I0/I1, S
form
Store vector reg, register offset, scale, H-‐ STR 3 1 I0/I1, S
form
Store vector reg, register offset, scale, S/D-‐ STR 1 1 S
form
Store vector reg, register offset, scale, Q-‐ STR 4 1/2 I0/I1, S
form
Store vector reg, register offset, extend, STR 1 1 S
B/H/S/D-‐form
Store vector reg, register offset, extend, Q-‐ STR 4 1/2 M, S
form
Store vector reg, register offset, extend, STR 3 1 I0/I1, S
scale, H-‐form
Store vector reg, register offset, extend, STR 1 1 S
scale, S/D-‐form
Store vector reg, register offset, extend, STR 4 1/2 I0/I1, S
scale, Q-‐form
Store vector pair, immed offset, S-‐form STP 1 1 S
Store vector pair, immed offset, D-‐form STP 2 1/2 S S
Store vector pair, immed offset, Q-‐form STP 4 1/4 I0/I1, S
Store vector pair, immed post-‐index, S-‐ STP 1 (1) 1 S, I0/I1 2
form
Store vector pair, immed post-‐index, D-‐ STP 2 (1) 1/2 S, I0/I1 2
form
Store vector pair, immed post-‐index, Q-‐ STP 4 (1) 1/4 S, I0/I1 2
form
Store vector pair, immed pre-‐index, S-‐form STP 1 (1) 1 S, I0/I1 2
Store vector pair, immed pre-‐index, D-‐ STP 2 (1) 1/2 S, I0/I1 2
form

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 21 of 42
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
Store vector pair, immed pre-‐index, Q-‐ STP 4 (1) 1/4 I0/I1, S 2
form

Note:
1. For single-precision store multiple instructions, N=floor((num_regs+1)/2). For double-precision store
multiple instructions, N=(num_regs).
2. Writeback forms of store instructions require an extra µop to update the base address. This update is
typically performed in parallel with, or prior to, the store µop (address update latency is shown in
parentheses).

3.14 ASIMD Integer Instructions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
ASIMD absolute diff, D-‐form VABD 3 2 F0/F1
ASIMD absolute diff, Q-‐form VABD 3 1 F0/ F1
ASIMD absolute diff accum, D-‐ VABA 4 (1) 1 F1 2
form
ASIMD absolute diff accum, Q-‐ VABA 5 (2) 1/2 F1 2
form
ASIMD absolute diff accum long VABAL 4 (1) 1 F1 2
ASIMD absolute diff long VABDL 3 2 F0/F1
ASIMD arith, basic VADD, VADDL, 3 2 F0/F1
VADDW, VNEG,
VPADD, VPADDL,
VSUB, VSUBL, VSUBW
ASIMD arith, complex VABS, VADDHN, 3 2 F0/F1
VHADD, VHSUB,
VQABS, VQADD,
VQNEG, VQSUB,
VRADDHN, VRHADD,
VRSUBHN, VSUBHN
ASIMD compare VCEQ, VCGE, VCGT, 3 2 F0/F1
VCLE, VTST
ASIMD logical VAND, VBIC, VMVN, 3 2 F0/F1
VORR, VORN, VEOR
ASIMD max/min VMAX, VMIN, VPMAX, 3 2 F0/F1
VPMIN
ASIMD multiply, D-‐form VMUL, VQDMULH, 4 1 F0
VQRDMULH

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 22 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD multiply, Q-‐form VMUL, VQDMULH, 5 1/2 F0
VQRDMULH
ASIMD multiply accumulate, D-‐ VMLA, VMLS 4 (1) 1 F0 1
form
ASIMD multiply accumulate, Q-‐ VMLA, VMLS 5 (2) 1/2 F0 1
form
ASIMD multiply accumulate long VMLAL, VMLSL 4 (1) 1 F0 1
ASIMD multiply accumulate VQDMLAL, VQDMLSL 4 (2) 1 F0 1
saturating long
ASIMD multiply long VMULL.S, VMULL.I, 4 1 F0
VMULL.P8, VQDMULL
ASIMD pairwise add and VPADAL 4 (1) 1 F1 2
accumulate
ASIMD shift accumulate VSRA, VRSRA 4 (1) 1 F1 2
ASIMD shift by immed, basic VMOVL, VSHL, VSHLL, 3 1 F1
VSHR, VSHRN
ASIMD shift by immed, complex VQRSHRN, VQRSHRUN, 4 1 F1
VQSHL{U}, VQSHRN,
VQSHRUN, VRSHR,
VRSHRN
ASIMD shift by immed and VSLI, VSRI 3 1 F1
insert, basic, D-‐form
ASIMD shift by immed and VSLI, VSRI 4 1/2 F1
insert, basic, Q-‐form
ASIMD shift by register, basic, D-‐ VSHL 3 1 F1
form
ASIMD shift by register, basic, Q-‐ VSHL 4 1/2 F1
form
ASIMD shift by register, VQRSHL, VQSHL, VRSHL 4 1 F1
complex, D-‐form
ASIMD shift by register, VQRSHL, VQSHL, VRSHL 5 1/2 F1
complex, Q-‐form

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
ASIMD absolute diff, D-‐form SABD, UABD 3 2 F0/F1
ASIMD absolute diff, Q-‐form SABD, UABD 3 2 F0/F1
ASIMD absolute diff accum, D-‐ SABA, UABA 4 (1) 1 F1 2
form
ASIMD absolute diff accum, Q-‐ SABA, UABA 5 (2) 1/2 F1 2
form

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 23 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD absolute diff accum long SABAL(2), UABAL(2) 4 (1) 1 F1 2
ASIMD absolute diff long SABDL, UABDL 3 2 F0/F1
ASIMD arith, basic ABS, ADD, ADDP, NEG, 3 2 F0/F1
SADDL(2), SADDLP,
SADDW(2), SHADD,
SHSUB, SSUBL(2),
SSUBW(2), SUB,
UADDL(2), UADDLP,
UADDW(2), UHADD,
UHSUB, USUBW(2)
ASIMD arith, complex ADDHN(2), 3 2 F0/F1
RADDHN(2),
RSUBHN(2), SQABS,
SQADD, SQNEG,
SQSUB, SRHADD,
SUBHN(2), SUQADD,
UQADD, UQSUB,
URHADD, USQADD
ASIMD arith, reduce, 4H/4S ADDV, SADDLV, 3 1 F1
UADDLV
ASIMD arith, reduce, 8B/8H ADDV, SADDLV, 6 1 F1, F0/F1
UADDLV
ASIMD arith, reduce, 16B ADDV, SADDLV, 6 1/2 F1
UADDLV
ASIMD compare CMEQ, CMGE, CMGT, 3 2 F0/F1
CMHI, CMHS, CMLE,
CMLT, CMTST
ASIMD logical AND, BIC, EOR, MOV, 3 2 F0/F1
MVN, ORN, ORR
ASIMD max/min, basic SMAX, SMAXP, SMIN, 3 2 F0/F1
SMINP, UMAX,
UMAXP, UMIN, UMINP
ASIMD max/min, reduce, 4H/4S SMAXV, SMINV, 3 1 F1
UMAXV, UMINV
ASIMD max/min, reduce, 8B/8H SMAXV, SMINV, 6 1 F1, F0/F1
UMAXV, UMINV
ASIMD max/min, reduce, 16B SMAXV, SMINV, 6 1/2 F1
UMAXV, UMINV
ASIMD multiply, D-‐form MUL, PMUL, 4 1 F0
SQDMULH, SQRDMULH
ASIMD multiply, Q-‐form MUL, PMUL, 5 1/2 F0
SQDMULH, SQRDMULH

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 24 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD multiply accumulate, D-‐ MLA, MLS 4 (1) 1 F0 1
form
ASIMD multiply accumulate, Q-‐ MLA, MLS 5 (2) 1/2 F0 1
form
ASIMD multiply accumulate long SMLAL(2), SMLSL(2), 4 (1) 1 F0 1
UMLAL(2), UMLSL(2)
ASIMD multiply accumulate SQDMLAL(2), 4 (2) 1 F0 1
saturating long SQDMLSL(2)
ASIMD multiply long SMULL(2), UMULL(2), 4 1 F0
SQDMULL(2)
ASIMD polynomial (8x8) multiply PMULL.8B, 4 1 F0 3
long PMULL2.16B
ASIMD pairwise add and SADALP, UADALP 4 (1) 1 F1 2
accumulate
ASIMD shift accumulate SRA, SRSRA, USRA, 4 (1) 1 F1 2
URSRA
ASIMD shift by immed, basic SHL, SHLL(2), SHRN(2), 3 1 F1
SLI, SRI, SSHLL(2),
SSHR, SXTL(2),
USHLL(2), USHR,
UXTL(2)
ASIMD shift by immed and SLI, SRI 3 1 F1
insert, basic, D-‐form
ASIMD shift by immed and SLI, SRI 4 1/2 F1
insert, basic, Q-‐form
ASIMD shift by immed, complex RSHRN(2), SRSHR, 4 1 F1
SQSHL{U},
SQRSHRN(2),
SQRSHRUN(2),
SQSHRN(2),
SQSHRUN(2), URSHR,
UQSHL, UQRSHRN(2),
UQSHRN(2)
ASIMD shift by register, basic, D-‐ SSHL, USHL 3 1 F1
form
ASIMD shift by register, basic, Q-‐ SSHL, USHL 4 1/2 F1
form
ASIMD shift by register, SRSHL, SQRSHL, SQSHL, 4 1 F1
complex, D-‐form URSHL, UQRSHL,
UQSHL
ASIMD shift by register, SRSHL, SQRSHL, SQSHL, 5 1/2 F1
complex, Q-‐form URSHL, UQRSHL,
UQSHL

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 25 of 42
Note:
1. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops, allowing
a typical sequence of integer multiply-accumulate µops to issue one every cycle or one every other cycle
(accumulate latency is shown in parentheses).
2. Other accumulate pipelines also support late-forwarding of accumulate operands from similar µops,
allowing a typical sequence of such µops to issue one every cycle (accumulate latency is shown in
parentheses).
3. This category includes instructions of the form “PMULL Vd.8H, Vn.8B, Vm.8B” and “PMULL2 Vd.8H,
Vn.16B, Vm.16B”

3.15 ASIMD Floating-Point Instructions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
ASIMD FP absolute value VABS 3 2 F0/F1
ASIMD FP arith, D-‐form VABD, VADD, VPADD, 4 2 F0/F1
VSUB
ASIMD FP arith, Q-‐form VABD, VADD, VSUB 4 1 F0/F1
ASIMD FP compare VACGE, VACGT, VACLE, 3 2 F0/F1
VACLT, VCEQ, VCGE,
VCGT, VCLE
ASIMD FP convert, integer, D-‐ VCVT, VCVTA, VCVTM, 3 1 F0
form VCVTN, VCVTP
ASIMD FP convert, integer, Q-‐ VCVT, VCVTA, VCVTM, 4 1/2 F0
form VCVTN, VCVTP
ASIMD FP convert, fixed, D-‐form VCVT 3 1 F0
ASIMD FP convert, fixed, Q-‐form VCVT 4 1/2 F0
ASIMD FP convert, half-‐precision VCVT 7 1/2 F0, F1
ASIMD FP max/min, D-‐form VMAX, VMIN, VPMAX, 3 2 F0/F1
VPMIN, VMAXNM,
VMINNM
ASIMD FP max/min, Q-‐form VMAX, VMIN, VMAXNM, 3 1 F0/F1
VMINNM
ASIMD FP multiply, D-‐form VMUL 4 2 F0/F1 2
ASIMD FP multiply, Q-‐form VMUL 4 1 F0/F1 2
ASIMD FP multiply accumulate, VMLA, VMLS, VFMA, 7 (3) 2 F0/F1 1
D-‐form VFMS
ASIMD FP multiply accumulate, VMLA, VMLS, VFMA, 7 (3) 1 F0/F1 1
Q-‐form VFMS
ASIMD FP negate VNEG 3 2 F0/F1
ASIMD FP round to integral, D-‐ VRINTA, VRINTM, 3 1 F0

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 26 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
form VRINTN, VRINTP, VRINTX,
VRINTZ

ASIMD FP round to integral, Q-‐ VRINTA, VRINTM, 4 1/2 F0
form VRINTN, VRINTP, VRINTX,
VRINTZ

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
ASIMD FP absolute value FABS 3 2 F0/F1
ASIMD FP arith, normal, D-‐form FABD, FADD, FSUB 4 2 F0/F1
ASIMD FP arith, normal, Q-‐form FABD, FADD, FSUB 4 1 F0/F1
ASIMD FP arith, pairwise, D-‐form FADDP 4 2 F0/F1

ASIMD FP arith, pairwise, Q-‐ FADDP 7 2/3 F0/F1

form
ASIMD FP compare FACGE, FACGT, FCMEQ, 3 2 F0/F1
FCMGE, FCMGT, FCMLE,
FCMLT
ASIMD FP convert, long (F16 to FCVTL(2) 7 1/2 F0, F0/F1
F32)
ASIMD FP convert, long (F32 to FCVTL(2) 3 1 F0
F64)
ASIMD FP convert, narrow (F32 FCVTN(2), FCVTXN(2) 7 1/2 F0, F0/F1
to F16)
ASIMD FP convert, narrow (F64 FCVTN(2), FCVTXN(2) 3 1 F0
to F32)
ASIMD FP convert, other, D-‐form FCVTAS, VCVTAU, 3 1 F0
F32 and Q-‐form F64 FCVTMS, FCVTMU,
FCVTNS, FCVTNU,
FCVTPS, FCVTPU, FCVTZS,
FCVTZU, SCVTF, UCVTF
ASIMD FP convert, other, Q-‐ FCVTAS, VCVTAU, 4 1/2 F0
form F32 FCVTMS, FCVTMU,
FCVTNS, FCVTNU,
FCVTPS, FCVTPU, FCVTZS,
FCVTZU, SCVTF, UCVTF
ASIMD FP divide, D-‐form, F32 FDIV 6-‐11 1/9-‐1/4 F0 3
ASIMD FP divide, Q-‐form, F32 FDIV 12-‐22 1/18-‐1/10 F0 3

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 27 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD FP divide, Q-‐form, F64 FDIV 12-‐36 1/32-‐1/10 F0 3
ASIMD FP max/min, normal FMAX, FMAXNM, FMIN, 3 2 F0/F1
FMINNM
ASIMD FP max/min, pairwise FMAXP, FMAXNMP, 3 2 F0/F1
FMINP, FMINNMP
ASIMD FP max/min, reduce FMAXV, FMAXNMV, 6 1 F0/F1
FMINV, FMINNMV
ASIMD FP multiply, D-‐form FMUL, FMULX 4 2 F0/F1 2
ASIMD FP multiply, Q-‐form FMUL, FMULX 4 1 F0/F1 2
ASIMD FP multiply accumulate, FMLA, FMLS 7 (3) 2 F0/F1 1
D-‐form
ASIMD FP multiply accumulate, FMLA, FMLS 7 (3) 1 F0/F1 1
Q-‐form
ASIMD FP negate FNEG 3 2 F0/F1
ASIMD FP round, D-‐form FRINTA, FRINTI, FRINTM, 3 1 F0
FRINTN, FRINTP, FRINTX,
FRINTZ
ASIMD FP round, Q-‐form FRINTA, FRINTI, FRINTM, 4 1/2 F0
FRINTN, FRINTP, FRINTX,
FRINTZ

Note:
1. ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops,
allowing a typical sequence of floating-point multiply-accumulate µops to issue one every N cycles
(accumulate latency N is shown in parentheses).
2. ASIMD multiply-accumulate pipelines support late forwarding of the result from ASIMD FP multiply µops to
the accumulate operands of an ASIMD FP multiply-accumulate µop. The latter can potentially be issued
one cycle after the ASIMD FP multiply µop has been issued.
3. ASIMD divide operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.

3.16 ASIMD Miscellaneous Instructions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
ASIMD bitwise insert, D-‐form VBIF, VBIT, VBSL 3 2 F0/F1
ASIMD bitwise insert, Q-‐form VBIF, VBIT, VBSL 3 1 F0/F1
ASIMD count, D-‐form VCLS, VCLZ, VCNT 3 2 F0/F1
ASIMD count, Q-‐form VCLS, VCLZ, VCNT 3 1 F0/F1
ASIMD duplicate, core reg VDUP 8 1 L, F0/F1

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 28 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD duplicate, scalar VDUP 3 2 F0/F1
ASIMD extract VEXT 3 2 F0/F1
ASIMD move, immed VMOV 3 2 F0/F1
ASIMD move, register VMOV 3 2 F0/F1
ASIMD move, narrowing VMOVN 3 2 F0/F1
ASIMD move, saturating VQMOVN, VQMOVUN 4 1 F1
ASIMD reciprocal estimate, D-‐ VRECPE, VRSQRTE 3 1 F0
form
ASIMD reciprocal estimate, Q-‐ VRECPE, VRSQRTE 4 1/2 F0
form
ASIMD reciprocal step, D-‐form VRECPS, VRSQRTS 7 2 F0/F1
ASIMD reciprocal step, Q-‐form VRECPS, VRSQRTS 7 1 F0/F1
ASIMD reverse VREV16, VREV32, 3 2 F0/F1
VREV64
ASIMD swap, D-‐form VSWP 3 2 F0/F1
ASIMD swap, Q-‐form VSWP 3 1 F0/F1
ASIMD table lookup, 1 reg VTBL, VTBX 3 2 F0/F1
ASIMD table lookup, 2 reg VTBL, VTBX 3 2 F0/F1
ASIMD table lookup, 3 reg VTBL, VTBX 6 2 F0/F1
ASIMD table lookup, 4 reg VTBL, VTBX 6 2 F0/F1
ASIMD transfer, scalar to core VMOV 5 1 L
reg, word
ASIMD transfer, scalar to core VMOV 6 1 L, I0/I1
reg, byte/hword
ASIMD transfer, core reg to VMOV 8 1 L, F0/F1
scalar
ASIMD transpose, D-‐form VTRN 3 2 F0/F1
ASIMD transpose, Q-‐form VTRN 3 1 F0/F1
ASIMD unzip/zip, D-‐form VUZP, VZIP 3 2 F0/F1
ASIMD unzip/zip, Q-‐form VUZP, VZIP 6 2/3 F0/F1

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
ASIMD bit reverse RBIT 3 2 F0/F1
ASIMD bitwise insert, D-‐form BIF, BIT, BSL 3 2 F0/F1
ASIMD bitwise insert, Q-‐form BIF, BIT, BSL 3 1 F0/F1
ASIMD count, D-‐form CLS, CLZ, CNT 3 2 F0/F1
ASIMD count, Q-‐form CLS, CLZ, CNT 3 1 F0/F1
ASIMD duplicate, gen reg DUP 8 1 L, F0/F1

Copyright © 2015 ARM. All rights reserved.

ARM UAN 0016A Page 29 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD duplicate, element DUP 3 2 F0/F1
ASIMD extract EXT 3 2 F0/F1
ASIMD extract narrow XTN 3 2 F0/F1
ASIMD extract narrow, saturating SQXTN(2), SQXTUN(2), 4 1 F1
UQXTN(2)
ASIMD insert, element to INS 3 2 F0/F1
element
ASIMD move, integer immed MOVI 3 2 F0/F1
ASIMD move, FP immed FMOV 3 2 F0/F1
ASIMD reciprocal estimate, D-‐ FRECPE, FRECPX, 3 1 F0
form FRSQRTE, URECPE,
URSQRTE
ASIMD reciprocal estimate, Q-‐ FRECPE, FRECPX, 4 1/2 F0
form FRSQRTE, URECPE,
URSQRTE
ASIMD reciprocal step, D-‐form FRECPS, FRSQRTS 7 2 F0/F1
ASIMD reciprocal step, Q-‐form FRECPS, FRSQRTS 7 1 F0/F1
ASIMD reverse REV16, REV32, REV64 3 2 F0/F1
ASIMD table lookup, D-‐form TBL, TBX 3xN F0/F1 1
ASIMD table lookup, Q-‐form TBL, TBX 3xN + 3 F0/F1 1
ASIMD transfer, element to gen UMOV 5 1 L
reg, word or dword
ASIMD transfer, element to gen SMOV, UMOV 6 1 L, I0/I1
reg, others
ASIMD transfer, gen reg to INS 8 1 L, F0/F1
element
ASIMD transpose, D-‐form TRN1, TRN2 3 2 F0/F1
ASIMD unzip/zip, D-‐form UZP1, UZP2, ZIP1, ZIP2 3 2 F0/F1

Note:
1. For table branches (TBL and TBX), N denotes the number of registers in the table.

3.17 ASIMD Load Instructions

The latencies shown assume the memory access hits in the Level 1 Data Cache. Compared to standard loads, an
extra cycle is required to forward results to FP/ASIMD pipelines.

Instruction Group AArch32 Exec Execution Utilized Notes

Instructions Latency Throughput Pipelines
ASIMD load, 1 element, multiple, 1 reg VLD1 5 1 L

ARM UAN 0016A Page 30 of 42
Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
ASIMD load, 1 element, multiple, 2 reg VLD1 5 1 L
ASIMD load, 1 element, multiple, 3 reg VLD1 6 1/2 L
ASIMD load, 1 element, multiple, 4 reg VLD1 6 1/2 L
ASIMD load, 1 element, one lane VLD1 8 1 L, F0/F1
ASIMD load, 1 element, all lanes VLD1 8 1 L, F0/F1
ASIMD load, 2 element, multiple, 2 reg VLD2 8 1 L, F0/F1
ASIMD load, 2 element, multiple, 4 reg VLD2 9 1/2 L, F0/F1
ASIMD load, 2 element, one lane, size 32 VLD2 8 1 L, F0/F1
ASIMD load, 2 element, one lane, size VLD2 8 1 L, F0/F1
8/16
ASIMD load, 2 element, all lanes VLD2 8 1 L, F0/F1
ASIMD load, 3 element, multiple, 3 reg VLD3 9 1/2 L, F0/F1
ASIMD load, 3 element, one lane, size 32 VLD3 8 1 L, F0/F1
ASIMD load, 3 element, one lane, size VLD3 9 2/3 L, F0/F1
8/16
ASIMD load, 3 element, all lanes VLD3 8 1 L, F0/F1
ASIMD load, 4 element, multiple, 4 reg VLD4 9 1/2 L, F0/F1
ASIMD load, 4 element, one lane, size 32 VLD4 8 1 L, F0/F1
ASIMD load, 4 element, one lane, size VLD4 9 1/2 L, F0/F1
8/16
ASIMD load, 4 element, all lanes VLD4 8 1 L, F0/F1
(ASIMD load, writeback form) (1) Same as +I0/I1 1
before

Instruction Group AArch64 Exec Execution Utilized Notes

Instructions Latency Throughput Pipelines
ASIMD load, 1 element, multiple, 1 reg, D-‐form LD1 5 1 L
ASIMD load, 1 element, multiple, 1 reg, Q-‐form LD1 5 1 L
ASIMD load, 1 element, multiple, 2 reg, D-‐form LD1 5 1 L
ASIMD load, 1 element, multiple, 2 reg, Q-‐form LD1 6 1/2 L
ASIMD load, 1 element, multiple, 3 reg, D-‐form LD1 6 1/2 L
ASIMD load, 1 element, multiple, 3 reg, Q-‐form LD1 7 1/3 L
ASIMD load, 1 element, multiple, 4 reg, D-‐form LD1 6 1/2 L
ASIMD load, 1 element, multiple, 4 reg, Q-‐form LD1 8 1/4 L
ASIMD load, 1 element, one lane, B/H/S LD1 8 1 L, F0/F1
ASIMD load, 1 element, one lane, D LD1 5 1 L
ASIMD load, 1 element, all lanes, D-‐form, B/H/S LD1R 8 1 L, F0/F1
ASIMD load, 1 element, all lanes, D-‐form, D LD1R 5 1 L
ASIMD load, 1 element, all lanes, Q-‐form LD1R 8 1 L, F0/F1

ARM UAN 0016A Page 31 of 42
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
ASIMD load, 2 element, multiple, D-‐form, B/H/S LD2 8 1 L, F0/F1
ASIMD load, 2 element, multiple, Q-‐form, LD2 9 1/2 L, F0/F1
B/H/S
ASIMD load, 2 element, multiple, Q-‐form, D LD2 6 1/2 L
ASIMD load, 2 element, one lane, B/H LD2 8 1 L, F0/F1
ASIMD load, 2 element, one lane, S LD2 8 1 L, F0/F1
ASIMD load, 2 element, one lane, D LD2 6 1 L
ASIMD load, 2 element, all lanes, D-‐form, B/H/S LD2R 8 1 L, F0/F1
ASIMD load, 2 element, all lanes, D-‐form, D LD2R 5 1 L
ASIMD load, 2 element, all lanes, Q-‐form LD2R 8 1 L, F0/F1
ASIMD load, 3 element, multiple, D-‐form, B/H/S LD3 9 1/2 L, F0/F1
ASIMD load, 3 element, multiple, Q-‐form, LD3 10 1/3 L, F0/F1
B/H/S
ASIMD load, 3 element, multiple, Q-‐form, D LD3 8 1/4 L
ASIMD load, 3 element, one lane, B/H LD3 9 2/3 L, F0/F1
ASIMD load, 3 element, one lane, S LD3 8 1 L, F0/F1
ASIMD load, 3 element, one lane, D LD3 6 1/2 L
ASIMD load, 3 element, all lanes, D-‐form, B/H/S LD3R 8 1 L, F0/F1
ASIMD load, 3 element, all lanes, D-‐form, D LD3R 6 1/2 L
ASIMD load, 3 element, all lanes, Q-‐form, B/H/S LD3R 9 2/3 L, F0/F1
ASIMD load, 3 element, all lanes, Q-‐form, D LD3R 9 1/2 L, F0/F1
ASIMD load, 4 element, multiple, D-‐form, B/H/S LD4 9 1/2 L, F0/F1
ASIMD load, 4 element, multiple, Q-‐form, LD4 11 1/4 L, F0/F1
B/H/S
ASIMD load, 4 element, multiple, Q-‐form, D LD4 8 1/4 L
ASIMD load, 4 element, one lane, B/H LD4 9 1/2 L, F0/F1
ASIMD load, 4 element, one lane, S LD4 8 1 L, F0/F1
ASIMD load, 4 element, one lane, D LD4 6 1/2 L
ASIMD load, 4 element, all lanes, D-‐form, B/H/S LD4R 8 1 L, F0/F1
ASIMD load, 4 element, all lanes, D-‐form, D LD4R 6 1 L
ASIMD load, 4 element, all lanes, Q-‐form, B/H/S LD4R 9 1/2 L, F0/F1
ASIMD load, 4 element, all lanes, Q-‐form, D LD4R 9 2/5 L, F0/F1
(ASIMD load, writeback form) (1) Same as +I0/I1 1
before

Note:
1. Writeback forms of load instructions require an extra µop to update the base address. This update is
typically performed in parallel with the load µop (update latency is shown in parentheses).

ARM UAN 0016A Page 32 of 42
3.18 ASIMD Store Instructions
Stores µops can issue after their address operands are available and do not need to wait for data operands. After
they are executed, stores are buffered and committed in the background.

Instruction Group AArch32 Exec Execution Utilized Notes

Instructions Latency Throughput Pipelines
ASIMD store, 1 element, multiple, 1 reg VST1 1 1 S
ASIMD store, 1 element, multiple, 2 reg VST1 2 1/2 S
ASIMD store, 1 element, multiple, 3 reg VST1 3 1/3 S
ASIMD store, 1 element, multiple, 4 reg VST1 4 1/4 S
ASIMD store, 1 element, one lane VST1 3 1 F0/F1, S
ASIMD store, 2 element, multiple, 2 reg VST2 3 1/2 F0/F1, S
ASIMD store, 2 element, multiple, 4 reg VST2 4 1/4 F0/F1, S
ASIMD store, 2 element, one lane VST2 3 1 F0/F1, S
ASIMD store, 3 element, multiple, 3 reg VST3 3 1/3 F0/F1, S
ASIMD store, 3 element, one lane, size 32 VST3 3 1/2 F0/F1, S
ASIMD store, 3 element, one lane, size 8/16 VST3 3 1 F0/F1, S
ASIMD store, 4 element, multiple, 4 reg VST4 4 1/4 F0/F1, S
ASIMD store, 4 element, one lane, size 32 VST4 3 1/2 F0/F1, S
ASIMD store, 4 element, one lane, size 8/16 VST4 3 1 F0/F1, S
(ASIMD store, writeback form) +1 +I0/I1 1

Instruction Group AArch64 Exec Execution Utilized Notes

Instructions Latency Throughput Pipelines
ASIMD store, 1 element, multiple, 1 reg, D-‐form ST1 1 1 S
ASIMD store, 1 element, multiple, 1 reg, Q-‐form ST1 2 1/2 S
ASIMD store, 1 element, multiple, 2 reg, D-‐form ST1 2 1/2 S
ASIMD store, 1 element, multiple, 2 reg, Q-‐form ST1 4 1/4 S
ASIMD store, 1 element, multiple, 3 reg, D-‐form ST1 3 1/3 S
ASIMD store, 1 element, multiple, 3 reg, Q-‐form ST1 6 1/6 S
ASIMD store, 1 element, multiple, 4 reg, D-‐form ST1 4 1/4 S
ASIMD store, 1 element, multiple, 4 reg, Q-‐form ST1 8 1/8 S
ASIMD store, 1 element, one lane, B/H/S ST1 3 1 F0/F1, S
ASIMD store, 1 element, one lane, D ST1 1 1 S
ASIMD store, 2 element, multiple, D-‐form, B/H/S ST2 3 1/2 F0/F1, S
ASIMD store, 2 element, multiple, Q-‐form, B/H/S ST2 4 1/4 F0/F1, S
ASIMD store, 2 element, multiple, Q-‐form, D ST2 4 1/4 S
ASIMD store, 2 element, one lane, B/H/S ST2 3 1 F0/F1, S

ARM UAN 0016A Page 33 of 42
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
ASIMD store, 2 element, one lane, D ST2 2 1/2 S
ASIMD store, 3 element, multiple, D-‐form, B/H/S ST3 3 1/3 F0/F1, S
ASIMD store, 3 element, multiple, Q-‐form, B/H/S ST3 6 1/6 F0/F1, S
ASIMD store, 3 element, multiple, Q-‐form, D ST3 6 1/6 S
ASIMD store, 3 element, one lane, B/H ST3 3 1 F0/F1, S
ASIMD store, 3 element, one lane, S ST3 3 1/2 F0/F1, S
ASIMD store, 3 element, one lane, D ST3 3 1/3 S
ASIMD store, 4 element, multiple, D-‐form, B/H/S ST4 4 1/4 F0/F1, S
ASIMD store, 4 element, multiple, Q-‐form, B/H/S ST4 8 1/8 F0/F1, S
ASIMD store, 4 element, multiple, Q-‐form, D ST4 8 1/8 S
ASIMD store, 4 element, one lane, B/H ST4 3 1 F0/F1, S
ASIMD store, 4 element, one lane, S ST4 3 1/2 F0/F1, S
ASIMD store, 4 element, one lane, D ST4 4 1/4 S
(ASIMD store, writeback form) (1) Same as +I0/I1 1
before

Note:
1. Writeback forms of store instructions require an extra µop to update the base address. This update is
typically performed in parallel with the store µop (update latency is shown in parentheses).

3.19 Cryptography Extensions

Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Crypto AES ops AESD, AESE, AESIMC, AESMC 3 1 F0 1, 2
Crypto polynomial (64x64)
3 1 F0 2
multiply long VMULL.P64
Crypto SHA1 xor ops SHA1SU0 6 2 F0/F1

Crypto SHA1 fast ops SHA1H, SHA1SU1 3 1 F0 2
Crypto SHA1 slow ops SHA1C, SHA1M, SHA1P 6 1/2 F0 2
Crypto SHA256 fast ops SHA256SU0 3 1 F0 2
Crypto SHA256 slow ops SHA256H, SHA256H2,
6 1/2 F0 2
SHA256SU1

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
Crypto AES ops AESD, AESE, AESIMC, AESMC 3 1 F0 1, 2
Crypto polynomial (64x64)
3 1 F0 2
multiply long PMULL(2)

ARM UAN 0016A Page 34 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Crypto SHA1 xor ops SHA1SU0 6 2 F0/F1

Crypto SHA1 schedule
3 1 F0 2
acceleration ops SHA1H, SHA1SU1
Crypto SHA1 hash
6 1/2 F0 2
acceleration ops SHA1C, SHA1M, SHA1P
Crypto SHA256 schedule
3 1 F0 2
acceleration op (1 µop) SHA256SU0
Crypto SHA256 schedule
6 1/2 F0 2
acceleration op (2 µops) SHA256SU1
Crypto SHA256 hash
6 1/2 F0 2
acceleration ops SHA256H, SHA256H2

Note:
1. Adjacent AESE/AESMC instruction pairs and adjacent AESD/AESIMC instruction pairs will exhibit the
described performance characteristics. See Section 4.10 for additional details.
2. Crypto execution support late forwarding of the result from a producer µop to a consumer µop. This results
in a one cycle reduction in latency as seen by the consumer.

3.20 CRC
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
CRC checksum ops CRC32, CRC32C 2 1 M 1

Instruction Group AArch64 Instructions Exec Execution Utilized Notes

Latency Throughput Pipelines
CRC checksum ops CRC32, CRC32C 2 1 M 1

Note:
1. CRC execution supports late forwarding of the result from a producer CRC µop to a consumer CRC µop. This
results in a one cycle reduction in latency as seen by the consumer.

ARM UAN 0016A Page 35 of 42
4 SPECIAL CONSIDERATIONS

4.1 Dispatch Constraints

Dispatch of µops from the in-order portion to the out-of-order portion of the microarchitecture includes a number of
constraints. It is important to consider these constraints during code generation in order to maximize the effective
dispatch bandwidth and subsequent execution bandwidth of the Cortex®-A72 processor.
The dispatch stage can process up to three µops per cycle, with the following limitations on the number of µops of
each type that can be simultaneously dispatched.

• One µop using the B pipeline

• Up to two µops using the I pipelines
• Up to two µops using the M pipeline
• One µop using the F0 pipeline
• One µop using the F1 pipeline
• Up to two µops using the L or S pipeline
If there are more µops available to be dispatched in a given cycle than can be supported by the constraints above,
µops will be dispatched in oldest-to-youngest age order to the extent allowed by the above.

4.2 Conditional Execution

The ARMv8 architecture allows many types of A32 instructions to be conditionally executed based upon condition
flags (N, Z, C, V). If the condition flags satisfy a condition specified in the instruction encoding, an instruction has
its normal effect. If the flags do not satisfy this condition, the instruction acts as a NOP.
This leads to conditional register writes for most types of conditional instructions. In an out-of-order processor
such as Cortex®-A72 processor, this has two side-effects:
• The first side-effect is that the conditional instruction requires the old value of its destination register as an
input operand.
• The second side-effect is that all subsequent consumers of the conditional instruction destination register
depend upon this operation, regardless of the state of the condition flags (that is, even if the destination
register is unchanged in the event the condition is not met.).
These effects should be taken into account when considering whether to use conditional execution for long-
latency operations. The overheads of conditional execution might begin to outweigh the benefits. Consider the
following example.

MULEQ R1, R2, R3

MULNE R1, R2, R4

For this pair of instructions, the second multiply is dependent upon the result of the first multiply, not through one
of its normal input operands (R2 and R4), but through the destination register R1. The combined latency for these
instructions is six cycles, rather than the four cycles that would be required if these instructions were not
conditional (three cycles latency for the first, and one additional cycle for the second which is fully pipelined behind
the first). So if the condition is easily predictable (by the branch predictor), conditional execution can lead to a

ARM UAN 0016A Page 36 of 42
performance loss. But if the condition is not easily predictable, conditional execution can lead to a performance
gain because the latency of a branch mispredict is generally higher than the execution latency of conditional
instructions. In general, ARM recommends that conditional instruction forms be considered only for integer
instructions with latency less than or equal to two cycles, loads, and stores.

4.3 Conditional ASIMD

Conditional execution is architecturally possible for ASIMD instructions in Thumb state using IT blocks. However,
this type of encoding is considered abnormal and is not recommended for Cortex®-A72. It will likely perform
worse than the equivalent unconditional encodings.

4.4 Register Forwarding Hazards

The ARMv8-A architecture allows FP instructions to read and write 32-bit S-registers. In AArch32, each S-register
corresponds to one half (upper or lower) of an overlayed 64-bit D-register. Register forwarding hazards might
occur when one µop reads a D-register or Q-register operand that has recently been written with one or more S-
register results. Consider the following abnormal scenario.

VMOV S0,R0
VMOV S1,R1
VADD D2, D1, D0

The first two instructions write S0 and S1, which correspond to the bottom and top halves of D0. The third
instruction then requires D0 as an input operand. In this scenario, Cortex®-A72 processor detects that at least one
of the upper or lower S0/S1 registers overlayed on D0 were previously written, at which point the VADD instruction
is serialized until the prior S-register writes are guaranteed to have been architecturally committed, likely incurring
significant additional latency. Note that after the D0 register has been written as a D-register or Q-register
destination, subsequent consumers of that register will no longer encounter this register-hazard condition, until the
next S-register write, if any.
The Cortex®-A72 processor is able to avoid this register-hazard condition for certain cases. The following rules
describe the conditions under which a register-hazard can occur.

• The producer writes an S-register (not a D[x] scalar)

• The consumer reads an overlapping D-register (not as a D[x] scalar, nor as an implicit operand caused by
conditional execution)
• The consumer is a FP/ASIMD µop (not a store µop)

To avoid unnecessary hazards, ARM recommends that the programmer use D[x] scalar writes when populating
registers prior to ASIMD operations. For example, either of the following instruction forms would safely prevent a
subsequent hazard:

VLD1.32 Dd[x], [address]

VMOV.32 Dd[x], Rt

ARM UAN 0016A Page 37 of 42
The Performance Monitor Unit (PMU) in the Cortex®-A72 processor can be used to determine when register
forwarding hazards are actually occuring. The implementation defined PMU event number 0x12C
(DISP_SWDW_STALL) has been assigned to count the number of cycles spent stalling due to these hazards.

4.5 Load/Store Throughput

The Cortex®-A72 processor includes separate load and store pipelines,which allow it to execute one load µop and
one store µop every cycle

To achieve maximum throughput for memory copy (or similar loops), do the following:

• Unroll the loop to include multiple load and store operations for each iteration, minimizing the overheads
of looping.
• Use discrete, non-writeback forms of load and store instructions (such as LDRD and STRD), interleaving
them so that one load and one store operation can be performed each cycle. Avoid load-multiple/store-
multiple instruction encodings (such as LDM and STM), which lead to separated bursts of load and store
µops which might not allow concurrent use of both the load and store pipelines.
The following example shows a recommended instruction sequence for a long memory copy in AArch32 state:

Loop_start:
SUBS r2,r2,#64
LDRD r3,r4,[r1,#0]
STRD r3,r4,[r0,#0]
LDRD r3,r4,[r1,#8]
STRD r3,r4,[r0,#8]
LDRD r3,r4,[r1,#16]
STRD r3,r4,[r0,#16]
LDRD r3,r4,[r1,#24]
STRD r3,r4,[r0,#24]
LDRD r3,r4,[r1,#32]
STRD r3,r4,[r0,#32]
LDRD r3,r4,[r1,#40]
STRD r3,r4,[r0,#40]
LDRD r3,r4,[r1,#48]
STRD r3,r4,[r0,#48]
LDRD r3,r4,[r1,#56]
STRD r3,r4,[r0,#56]
ADD r1,r1,#64
ADD r0,r0,#64
BGT Loop_start

A recommended copy routine for AArch64 would look similar to the sequence above, but would use LDP/STP
instructions.

ARM UAN 0016A Page 38 of 42
4.6 Load/Store Alignment
The ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The
Cortex®-A72 processor handles most unaligned accesses without performance penalties. However, there are
cases which reduce bandwidth or incur additional latency, as described below.

• Load operations that cross a cache-line (64-byte) boundary

• Store operations that cross a 16-byte boundary

4.7 Branch Alignment

Branch instruction and branch target instruction alignment can affect performance. For best-case performance,
consider the following guidelines.

• Try not to include more than two taken branches within the same quadword-aligned quadword of
instruction memory.
• Consider aligning subroutine entry points and branch targets to quadword boundaries, within the bounds
of the code-density requirements of the program. This will ensure that the subsequent fetch can retrieve
four (or a full quadword’s worth of) instructions, maximizing fetch bandwidth following the taken branch.

4.8 Setting Condition Flags

The ARM instruction set includes instruction forms that set the condition flags. In addition to compares, many
types of data processing operations set the condition flags as a side-effect. Excessive use of flag-setting
instruction forms might result in performance degradation, therefore ARM recommends that, where possible, non-
flag-setting instructions and instruction-forms are used except where the condition-flag result is explicitly required
for subsequent branches or conditional instructions.
When using the Thumb instruction set, special attention should be given to the use of 16-bit instruction forms.
Many of those (such as moves, adds, shifts) automatically set the condition flags. For best performance, consider
using the 32-bit encodings which include forms that do not set the condition flags, within the bounds of the code-
density requirements of the program.

4.9 Special Register Access

The Cortex®-A72 processor performs register renaming for general purpose registers to enable speculative and
out-of-order instruction execution. However, most special-purpose registers are not renamed. Instructions that
read or write non-renamed registers are subjected to one or more of the following additional execution constraints:

• Non-speculative execution – Instructions can only execute non-speculatively.

• In-order execution – Instructions must execute in-order with respect to other similar instructions, or in
some cases with respect to all instructions.
• Flush side-effects – Instructions trigger a flush side-effect after executing for synchronization.

The table below summarizes various special instructions and the associated execution constraints or side-effects.

ARM UAN 0016A Page 39 of 42
Instructions Forms Non- In- Flush Notes
Speculative Order Side-
Effect
ISB Yes Yes Yes 1
CPS Yes Yes Yes 1
SETEND Yes Yes Yes 1
MRS (read) APSR, CPSR Yes Yes No 1
MRS (read) SPSR No Yes No 1
MSR (write) ASPR_nzcvq, CPSR_f No No No 1, 2, 3
MSR (write) APSR, CPSR other Yes Yes Yes 1
MSR (write) SPSR Yes Yes No 1
VMRS (read) FPSCR to APSR_nzcv No No No 1, 2
VMRS (read) Other Yes Yes No 1
VMSR (write) Yes Yes Yes 1
VMSR (write) FPSCR, changing only NZCV Yes Yes No 1
MRC (read) Some Yes No 1, 2, 4
MCR (write) Yes Yes Some 1, 4

Note:
1. Conditional forms of these instructions for which the condition is not satisfied will not access special
registers or trigger flush side-effects.
2. Conditional forms of these instructions are always executed non-speculatively and in-order to properly
resolve the condition.
3. MSR instructions that write APSR_nzcvq generate a separate µop to write the Q bit. That µop executes
non-speculatively and in-order. But the main µop, which writes the NZCV bits, executes as shown in the
table above.
4. A subset of MCR instructions must be executed non-speculatively. A subset of MRC instructions trigger
flush side-effects for synchronization. Those subsets are not documented here.

4.10 AES Encryption/Decryption

The Cortex®-A72 processor can issue one AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined)
with an execution latency of three cycles (see Section 3.19). This means encryption or decryption for at least
three data chunks should be interleaved for maximum performance:

AESE data0, key0

AESMC data0, data0
AESE data1, key0
AESMC data1, data1
AESE data2, key0
AESMC data2, data2
AESE data0, key1
AESMC data0, data0

ARM UAN 0016A Page 40 of 42
...

Pairs of dependent AESE/AESMC and AESD/AESIMC instructions provide higher performance when adjacent,
and in the described order, in the program code. Therefore it is important to ensure that these instructions come
in pairs in AES encryption/decryption loops, as shown in the code segment above.

4.11 Fast literal generation

The Cortex®-A72 processor supports optimized literal generation for 32- and 64-bit code. A typical literal
generation sequence in 32-bit code is:

MOV rX, #bottom_16_bits

MOVT rX, #top_16_bits

In 64-bit code, generating a 32-bit immediate:

MOV wX, #bottom_16_bits

MOVK wX, #top_16_bits, lsl #16

In 64-bit code, generating the bottom half of a 64-bit immediate:

MOV xX, #bottom_16_bits

MOVK xX, #top_16_bits, lsl #16

In 64-bit code, generating the top half of a 64-bit immediate:

MOVK xX, #bits_47_to_32, lsl #32

MOVK xX, #bits_63_to_48, lsl #48

If any of these sequences appear sequentially and in the described order in program code, the two instructions
can be executed at lower latency and higher bandwidth than if they do not appear sequentially in the program
code, enabling 32-bit literals to be generated in a single cycle and 64-bit literals to be generated in two cycles.
Thus it is advantageous to ensure that compilers or programmers writing assembly code schedule these
instruction pairs sequentially.

4.12 PC-relative address calculation

The Cortex®-A72 processor supports optimized PC-relative address calculation using the following instruction
sequence:

ADRP xX, #label

ADD xY, xX, #imm

ARM UAN 0016A Page 41 of 42
If this sequence appears sequentially and in the described order in program code, the two instructions can be
executed at lower latency and higher bandwidth than if they do not appear sequentially in the program code.
Thus it is advantageous to ensure that compilers or programmers writing assembly code schedule these
instruction pairs sequentially.

4.13 FPCR self-synchronization

Programmers and compiler writers should note that writes to the FPCR register are self-synchronizing, i.e. its
effect on subsequent instructions can be relied upon without an intervening context synchronizing operation.

ARM UAN 0016A Page 42 of 42

DDI0408F_cortex_a9_fpu_r2p2_trm
No ratings yet
DDI0408F_cortex_a9_fpu_r2p2_trm
48 pages
ARM® NEON™ Intrinsics Reference
No ratings yet
ARM® NEON™ Intrinsics Reference
344 pages
ARM Cortex-A8 Technical Reference Manual r1p1
No ratings yet
ARM Cortex-A8 Technical Reference Manual r1p1
730 pages
Arm Architecture Reference Manual 2nd Edition 2nd Edition David Seal instant download
No ratings yet
Arm Architecture Reference Manual 2nd Edition 2nd Edition David Seal instant download
88 pages
Embedded Systems Fundamentals 2nd Ed_Nucleo-F091RC
No ratings yet
Embedded Systems Fundamentals 2nd Ed_Nucleo-F091RC
345 pages
ARM C Language Extensions - Al Grant
No ratings yet
ARM C Language Extensions - Al Grant
81 pages
arm9ejs
No ratings yet
arm9ejs
298 pages
ISA V85a A64 XML 00bet8 PDF
No ratings yet
ISA V85a A64 XML 00bet8 PDF
2,036 pages
ARM-CPU-Cores
No ratings yet
ARM-CPU-Cores
84 pages
DDI0344E Cortex A8 r2p2 TRM
No ratings yet
DDI0344E Cortex A8 r2p2 TRM
748 pages
The-Chock-Zhuo-Clan-of-Guan-Tang官塘卓氏
No ratings yet
The-Chock-Zhuo-Clan-of-Guan-Tang官塘卓氏
26 pages
Cortex-A53_MPCore_Software_Developers_Errata_Notice_21
No ratings yet
Cortex-A53_MPCore_Software_Developers_Errata_Notice_21
71 pages
Worksheet Recount Text
No ratings yet
Worksheet Recount Text
8 pages
Kierkegaard - The Present Age (Full Text Trans Alexander Dru) A
100% (10)
Kierkegaard - The Present Age (Full Text Trans Alexander Dru) A
27 pages
Chromium(VI) Removal from Water by Lanthanum Hybrid Modified Activated Carbon Produced from Coconut Shells
No ratings yet
Chromium(VI) Removal from Water by Lanthanum Hybrid Modified Activated Carbon Produced from Coconut Shells
19 pages
Hierarchical Porous Activated Carbon Derived from Coconut Shell for Ultrahigh-Performance Supercapacitors
No ratings yet
Hierarchical Porous Activated Carbon Derived from Coconut Shell for Ultrahigh-Performance Supercapacitors
14 pages
Accounting for Non-Financial Professionals_ A Practical Guide (Management Book 7)
No ratings yet
Accounting for Non-Financial Professionals_ A Practical Guide (Management Book 7)
70 pages
ARMv6-M Architecture Reference Manual
No ratings yet
ARMv6-M Architecture Reference Manual
436 pages
05-85-1461
No ratings yet
05-85-1461
8 pages
8413-1 Intro To Logic
No ratings yet
8413-1 Intro To Logic
19 pages
DAY1_ARM
No ratings yet
DAY1_ARM
44 pages
Direct brainstem somatosensory evoked potentials for cavernous malformations
No ratings yet
Direct brainstem somatosensory evoked potentials for cavernous malformations
7 pages
Module 1. Stylistics Tasks
No ratings yet
Module 1. Stylistics Tasks
6 pages
04_ARM_Architecture_Overview.ppt
No ratings yet
04_ARM_Architecture_Overview.ppt
19 pages
Coconut-based activated carbon fibers for efficient adsorption of various organic dyes
No ratings yet
Coconut-based activated carbon fibers for efficient adsorption of various organic dyes
12 pages
ARM9ES TRM PDF
No ratings yet
ARM9ES TRM PDF
290 pages
CPP Project
No ratings yet
CPP Project
35 pages
Gold_Exp_B2P_Unit_2_Skills_Test_A
No ratings yet
Gold_Exp_B2P_Unit_2_Skills_Test_A
5 pages
209131705processor Architrcture
No ratings yet
209131705processor Architrcture
23 pages
Lecture 05 ARM Processors
No ratings yet
Lecture 05 ARM Processors
65 pages
Graphite-type activated carbon from coconut shell. a natural source for eco-friendly non-volatile storage devices
No ratings yet
Graphite-type activated carbon from coconut shell. a natural source for eco-friendly non-volatile storage devices
12 pages
Mathematics
No ratings yet
Mathematics
37 pages
Unit III Part 1
No ratings yet
Unit III Part 1
47 pages
Embedded Real Time Operating Systems PDF
100% (2)
Embedded Real Time Operating Systems PDF
491 pages
Neolithic Land Under Sea
No ratings yet
Neolithic Land Under Sea
301 pages
C Pre Processor
No ratings yet
C Pre Processor
4 pages
ENG 325 African American Literary Studies Syllabus Fall 2024 (1) (2)
No ratings yet
ENG 325 African American Literary Studies Syllabus Fall 2024 (1) (2)
6 pages
Embedded Lecture 4 ARM
No ratings yet
Embedded Lecture 4 ARM
47 pages
யாப்பிலக்கணம்
No ratings yet
யாப்பிலக்கணம்
43 pages
White Paper - Cortex-M For Beginners - 2016 (Final v3) PDF
No ratings yet
White Paper - Cortex-M For Beginners - 2016 (Final v3) PDF
25 pages
Dissertation in Library and Information Science
100% (2)
Dissertation in Library and Information Science
6 pages
Adv Comp Arch Q3'11
No ratings yet
Adv Comp Arch Q3'11
54 pages
Slide 1
No ratings yet
Slide 1
15 pages
ARM K
No ratings yet
ARM K
32 pages
Approaches To Discourse
No ratings yet
Approaches To Discourse
26 pages
PIP3, Can Produceisoform-Specific Signaling by PI 3-Kinases
No ratings yet
PIP3, Can Produceisoform-Specific Signaling by PI 3-Kinases
10 pages
Cortex-M3 Reference Manual
No ratings yet
Cortex-M3 Reference Manual
410 pages
04 - The ARM Architecture and ISA
No ratings yet
04 - The ARM Architecture and ISA
73 pages
Portfolio Entries
No ratings yet
Portfolio Entries
5 pages
Sample Session Plan Basic
No ratings yet
Sample Session Plan Basic
7 pages
Pursuit Full StudyGuide
No ratings yet
Pursuit Full StudyGuide
63 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
34 pages
The First Encounter
50% (2)
The First Encounter
44 pages
Arm Custom Instructions White Paper Feb20
No ratings yet
Arm Custom Instructions White Paper Feb20
8 pages
GEN Biomarket Trends Report Circulating Biomarkersopt
No ratings yet
GEN Biomarket Trends Report Circulating Biomarkersopt
9 pages
Richard Grisenthwaite
No ratings yet
Richard Grisenthwaite
25 pages
MKC ES Units 3&4 ARM 1
No ratings yet
MKC ES Units 3&4 ARM 1
105 pages
ARM Cortex AppNote179
No ratings yet
ARM Cortex AppNote179
24 pages
Intro To ARM Cortex-M3 (CM3) and LPC17xx MCU: Outline
No ratings yet
Intro To ARM Cortex-M3 (CM3) and LPC17xx MCU: Outline
79 pages
Armv8 M Processor Debug 100734 0100 0100 en
No ratings yet
Armv8 M Processor Debug 100734 0100 0100 en
20 pages
Both Cell Surface CD74 (MHC Class The BB1 Monoclonal Antibody Recognizes
No ratings yet
Both Cell Surface CD74 (MHC Class The BB1 Monoclonal Antibody Recognizes
9 pages
02 Arm
No ratings yet
02 Arm
53 pages
Application Note: Bare-Metal Boot Code For Armv8-A Processors
No ratings yet
Application Note: Bare-Metal Boot Code For Armv8-A Processors
53 pages
Kandeepan Rajanandhi
100% (1)
Kandeepan Rajanandhi
259 pages
IT 118 - SIA - Module 3
No ratings yet
IT 118 - SIA - Module 3
19 pages
Result of Aits Pt-I (Advanced) Class 12th Test Date 10.11.2019 PDF
No ratings yet
Result of Aits Pt-I (Advanced) Class 12th Test Date 10.11.2019 PDF
10 pages
Cortex-M For Beginners
100% (1)
Cortex-M For Beginners
24 pages
Anti-B Cell Therapy With Rituximab As Induction Therapy in Renal Transplantation
No ratings yet
Anti-B Cell Therapy With Rituximab As Induction Therapy in Renal Transplantation
3 pages
Origins Language Introduction
No ratings yet
Origins Language Introduction
37 pages
ARM CPU Architecture
No ratings yet
ARM CPU Architecture
30 pages
05 Ds Quiz Set Bfs and Dfs
No ratings yet
05 Ds Quiz Set Bfs and Dfs
24 pages
ARM Processor Roadmap
100% (1)
ARM Processor Roadmap
23 pages
Lec Arm PDF
No ratings yet
Lec Arm PDF
25 pages
FS1 EP3 (Gregori, BEED 4-A)
100% (1)
FS1 EP3 (Gregori, BEED 4-A)
27 pages
ARM Cortex-A9 MPCore
No ratings yet
ARM Cortex-A9 MPCore
34 pages
The Connected Classroom
No ratings yet
The Connected Classroom
1 page
Computer Programming in C++: Practical#06)
No ratings yet
Computer Programming in C++: Practical#06)
15 pages
4875 Present Continuous Tense Affirmative Sentence 3 Pages 8 Tasks With Key Fully Editable
100% (1)
4875 Present Continuous Tense Affirmative Sentence 3 Pages 8 Tasks With Key Fully Editable
4 pages
Application Note 179: Cortex - M3 Embedded Software Development
No ratings yet
Application Note 179: Cortex - M3 Embedded Software Development
24 pages
DDI0408I Cortex A9 Fpu r4p1 TRM
No ratings yet
DDI0408I Cortex A9 Fpu r4p1 TRM
27 pages
Selected Topics in Embedded Systems: Seminar
No ratings yet
Selected Topics in Embedded Systems: Seminar
13 pages
Hans Joas - Social Theory and The Sacred - A Response To John Milbank
No ratings yet
Hans Joas - Social Theory and The Sacred - A Response To John Milbank
11 pages
Test of Receptive Skill
No ratings yet
Test of Receptive Skill
16 pages
DDI0363G Cortex r4 r1p4 TRM
No ratings yet
DDI0363G Cortex r4 r1p4 TRM
436 pages
Cortex-M For Beginners - 2016 (Final v3)
No ratings yet
Cortex-M For Beginners - 2016 (Final v3)
25 pages
Setup Help
No ratings yet
Setup Help
4 pages
ARM Processors 11
No ratings yet
ARM Processors 11
20 pages
ARM Architecture Overview
100% (1)
ARM Architecture Overview
19 pages
Development of The ARM Architecture
No ratings yet
Development of The ARM Architecture
44 pages
The First Encounter: Authors: Nemanja Perovic, Prof. Dr. Veljko Milutinovic
No ratings yet
The First Encounter: Authors: Nemanja Perovic, Prof. Dr. Veljko Milutinovic
44 pages
Ur and Justifying Actions Lesson Plan 1
No ratings yet
Ur and Justifying Actions Lesson Plan 1
3 pages
Arm
100% (2)
Arm
44 pages
Yogendranathyogi - Blogspot.in Khechari Vidya
100% (4)
Yogendranathyogi - Blogspot.in Khechari Vidya
21 pages
UNIT 7 Grammar Merged
100% (1)
UNIT 7 Grammar Merged
7 pages
The Book of I²C: A Guide for Adventurers
From Everand
The Book of I²C: A Guide for Adventurers
Randall Hyde
No ratings yet
CompTIA Server+ Study Guide: Exam SK0-005
From Everand
CompTIA Server+ Study Guide: Exam SK0-005
Troy McMillan
5/5 (1)
Apache Tomcat 7 Essentials
From Everand
Apache Tomcat 7 Essentials
Tanuj Khare
No ratings yet
CCNA Certification Study Guide Volume 2: Exam 200-301 v1.1
From Everand
CCNA Certification Study Guide Volume 2: Exam 200-301 v1.1
Todd Lammle
5/5 (1)
Mastering AUTOSAR: A Comprehensive Guide for Automotive Engineers
From Everand
Mastering AUTOSAR: A Comprehensive Guide for Automotive Engineers
Mohamad Charara
No ratings yet
CompTIA A+ Complete Review Guide: Core 1 Exam 220-1101 and Core 2 Exam 220-1102
From Everand
CompTIA A+ Complete Review Guide: Core 1 Exam 220-1101 and Core 2 Exam 220-1102
Troy McMillan
5/5 (2)
Decoding Reliability-Centered Maintenance Process for Manufacturing Industries 10th Discipline of World Class Maintenance Management: 1, #7
From Everand
Decoding Reliability-Centered Maintenance Process for Manufacturing Industries 10th Discipline of World Class Maintenance Management: 1, #7
Rolly Angeles
No ratings yet
Alcatel-Lucent Network Routing Specialist II (NRS II) Self-Study Guide: Preparing for the NRS II Certification Exams
From Everand
Alcatel-Lucent Network Routing Specialist II (NRS II) Self-Study Guide: Preparing for the NRS II Certification Exams
Glenn Warnock
No ratings yet
Data Communication and Networking: For Under-graduate Students
From Everand
Data Communication and Networking: For Under-graduate Students
DR LILADHAR REWATKAR
No ratings yet