0% found this document useful (0 votes)
43 views92 pages

Arm Cortex-A710 Core Software Optimization Guide

Uploaded by

zipper1957
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views92 pages

Arm Cortex-A710 Core Software Optimization Guide

Uploaded by

zipper1957
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Arm® Cortex®-A710 Core

Revision: r2p0

Software Optimization Guide


Non-Confidential Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). PJDOC-466751330-14951
All rights reserved.
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Arm® Cortex®-A710 Core


Software Optimization Guide

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.

Release information

Document history
Issue Date Confidentiality Change

1.0 31 Mar 2020 Confidential First release for r0p0

2.0 15 May 2020 Confidential Release for r1p0

3.0 21 Aug 2020 Confidential Release for r2p0

4.0 25 May 2021 Non-Confidential Second release for r2p0

Non-Confidential Proprietary Notice


This document is protected by copyright and other related rights and the practice or implementation of
the information contained in this document may be protected by one or more patents or pending patent
applications. No part of this document may be reproduced in any form by any means without the express
prior written permission of Arm. No license, express or implied, by estoppel or otherwise to any intellectual
property rights is granted by this document unless specifically stated.

Your access to the information in this document is conditional upon your acceptance that you will not use
or permit others to use the information for the purposes of determining whether implementations infringe
any third party patents.

THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR
PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation
with respect to, has undertaken no analysis to identify or understand the scope and content of, patents,
copyrights, trade secrets, or other rights.

This document may include technical inaccuracies or typographical errors.

TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY,
ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES.

This document consists solely of commercial items. You shall be responsible for ensuring that any use,
duplication or disclosure of this document complies fully with any relevant export laws and regulations to
assure that this document or any portion thereof is not exported, directly or indirectly, in violation of such
export laws. Use of the word “partner” in reference to Arm's customers is not intended to create or refer to
any partnership relationship with any other company. Arm may make changes to this document at any
time and without notice.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 2 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

This document may be translated into other languages for convenience, and you agree that if there is any
conflict between the English version of this document and any translation, the terms of the English version
of the Agreement shall prevail.

The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks of Arm
Limited (or its affiliates) in the US and/or elsewhere. All rights reserved. Other brands and names
mentioned in this document may be the trademarks of their respective owners. Please follow Arm's
trademark usage guidelines at https://www.arm.com/company/policies/trademarks.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.

Arm Limited. Company 02557590 registered in England.

110 Fulbourn Road, Cambridge, England CB1 9NJ.

(LES-PRE-20349)

Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be subject to
license restrictions in accordance with the terms of the agreement entered into by Arm and the party that
Arm delivered this document to.

Unrestricted Access is an Arm internal classification.

Product Status
The information in this document is final, that is for a developed product.

Web Address
developer.arm.com

Progressive terminology commitment


Arm values inclusive communities. Arm recognizes that we and our industry have used terms that can be
offensive. Arm strives to lead the industry and create change.

This document includes terms that can be offensive. We will replace these terms in a future issue of this
document. If you find offensive terms in this document, please email [email protected].

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 3 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Contents
1 Introduction .................................................................................................................................7
1.1 Product revision status ........................................................................................................................................... 7
1.2 Intended audience ................................................................................................................................................... 7
1.3 Scope............................................................................................................................................................................. 7
1.4 Conventions................................................................................................................................................................ 7
1.4.1 Glossary .................................................................................................................................................................... 7
1.4.2 Terms and abbreviations .................................................................................................................................... 8
1.4.3 Typographical conventions ............................................................................................................................... 9
1.5 Additional reading ................................................................................................................................................. 10
1.6 Feedback .................................................................................................................................................................... 11
1.6.1 Feedback on this product ................................................................................................................................ 11
1.6.2 Feedback on content ......................................................................................................................................... 11

2 Overview .................................................................................................................................... 12
2.1 Pipeline overview .................................................................................................................................................... 13

3 Instruction characteristics ........................................................................................................ 15


3.1 Instruction tables .................................................................................................................................................... 15
3.2 Legend for reading the utilized pipelines ..................................................................................................... 15
3.3 Branch instructions ................................................................................................................................................ 16
3.4 Arithmetic and logical instructions .................................................................................................................. 17
3.5 Move and shift instructions ................................................................................................................................ 20
3.6 Divide and multiply instructions ....................................................................................................................... 21
3.7 Saturating and parallel arithmetic instructions ........................................................................................... 24
3.8 Pointer Authentication Instructions................................................................................................................. 25
3.9 Miscellaneous data-processing instructions ................................................................................................ 27
3.10 Load instructions .................................................................................................................................................. 28
3.11 Store instructions ................................................................................................................................................. 32
3.12 Tag Load Instructions ......................................................................................................................................... 34
3.13 Tag Store instructions ........................................................................................................................................ 34
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 4 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.14 FP data processing instructions ...................................................................................................................... 35


3.15 FP miscellaneous instructions ......................................................................................................................... 37
3.16 FP load instructions ............................................................................................................................................. 38
3.17 FP store instructions............................................................................................................................................ 40
3.18 ASIMD integer instructions .............................................................................................................................. 42
3.19 ASIMD floating-point instructions................................................................................................................. 46
3.20 ASIMD BFloat16 (BF16) instructions ............................................................................................................. 50
3.21 ASIMD miscellaneous instructions ................................................................................................................ 52
3.22 ASIMD load instructions .................................................................................................................................... 55
3.23 ASIMD store instructions .................................................................................................................................. 58
3.24 Cryptography extensions .................................................................................................................................. 61
3.25 CRC ............................................................................................................................................................................ 62
3.26 SVE Predicate instructions ................................................................................................................................ 63
3.27 SVE integer instructions..................................................................................................................................... 65
3.28 SVE floating-point instructions ....................................................................................................................... 72
3.29 SVE BFloat16 (BF16) instructions.................................................................................................................... 74
3.30 SVE Load instructions ......................................................................................................................................... 75
3.31 SVE Store instructions ........................................................................................................................................ 77
3.32 SVE Miscellaneous instructions ...................................................................................................................... 79
3.33 SVE Cryptographic instructions ...................................................................................................................... 80

4 Special considerations .............................................................................................................. 81


4.1 Dispatch constraints .............................................................................................................................................. 81
4.2 Dispatch stall ............................................................................................................................................................ 81
4.3 Optimizing general-purpose register spills and fills ................................................................................. 81
4.4 Optimizing memory routines ............................................................................................................................. 82
4.5 Load/Store alignment ........................................................................................................................................... 83
4.6 AES encryption/decryption ................................................................................................................................. 83
4.7 Region based fast forwarding ........................................................................................................................... 84
4.8 Branch instruction alignment ............................................................................................................................. 85
4.9 FPCR self-synchronization ................................................................................................................................... 85
4.10 Special register access........................................................................................................................................ 85
4.11 Register forwarding hazards ............................................................................................................................ 87

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 5 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

4.12 IT blocks ................................................................................................................................................................... 88


4.13 Instruction fusion ................................................................................................................................................. 88
4.14 Zero Latency MOVs ............................................................................................................................................. 89
4.15 Cache maintenance operation ........................................................................................................................ 90
4.16 Memory Tagging - Tagging Performance .................................................................................................. 90
4.17 Memory Tagging - Synchronous Mode ...................................................................................................... 91
4.18 Complex ASIMD and SVE instructions ......................................................................................................... 91

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 6 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

1 Introduction
1.1 Product revision status
The rmpn identifier indicates the revision status of the product described in this book, for
example, r1p2, where:
Rm

Identifies the major revision of the product, for example, r1.

Pn
Identifies the minor revision or modification status of the product, for example,
p2.

1.2 Intended audience


This document is for system designers, system integrators, and programmers who are designing
or programming a System-on-Chip (SoC) that uses an Arm core.

1.3 Scope
This document describes aspects of the Cortex-A710 core micro-architecture that influence
software performance. Micro-architectural detail is limited to that which is useful for software
optimization.

Documentation extends only to software visible behavior of the Cortex-A710 core and not to the
hardware rationale behind the behavior.

1.4 Conventions
The following subsections describe conventions used in Arm documents.

1.4.1 Glossary
The Arm Glossary is a list of terms used in Arm documentation, together with definitions for
those terms. The Arm Glossary does not contain terms that are industry standard unless the Arm
meaning differs from the generally accepted meaning.

See the Arm Glossary for more information: https://developer.arm.com/glossary.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 7 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

1.4.2 Terms and abbreviations


This document uses the following terms and abbreviations.
Term Meaning

ALU Arithmetic and Logical Unit

ASIMD Advanced SIMD

MOP Macro-OPeration

µOP Micro-OPeration

SQRT Square Root

T32 AArch32 Thumb® instruction set

FP Floating-point

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 8 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

1.4.3 Typographical conventions


Convention Use

italic Introduces citations.

bold Highlights interface elements, such as menu names. Denotes signal names. Also used
for terms in descriptive lists, where appropriate.
monospace Denotes text that you can enter at the keyboard, such as commands, file and program
names, and source code.
monospace bold Denotes language keywords when used outside example code.
monospace Denotes a permitted abbreviation for a command or option. You can enter the
underline underlined text instead of the full command or option name.

<and> Encloses replaceable terms for assembler syntax where they appear in code or code
fragments.
For example:
MRC p15, 0, <Rd>, <CRn>, <CRm>, <Opcode_2>

SMALL CAPITALS Used in body text for a few terms that have specific technical meanings, that are
defined in the Arm® Glossary. For example, IMPLEMENTATION DEFINED, IMPLEMENTATION
SPECIFIC, UNKNOWN, and UNPREDICTABLE.

This represents a recommendation which, if not followed, might lead to system failure
or damage.

This represents a requirement for the system that, if not followed, might result in
system failure or damage.

This represents a requirement for the system that, if not followed, will result in system
failure or damage.

This represents an important piece of information that needs your attention.

This represents a useful tip that might make it easier, better or faster to perform a task.

This is a reminder of something important that relates to the information you are
reading.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 9 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

1.5 Additional reading


This document contains information that is specific to this product. See the following documents
for other relevant information:

Table 1-1 Arm publications


Document name Document ID Licensee only

Arm® Architecture Reference Manual, Armv8, for DDI 0487 Y


Armv8-A architecture profile

Arm® Cortex®-A710 Core Technical Reference 101430 Y


Manual

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 10 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

1.6 Feedback
Arm welcomes feedback on this product and its documentation.

1.6.1 Feedback on this product


If you have any comments or suggestions about this product, contact your supplier and give:
• The product name.
• The product revision or version.
• An explanation with as much information as you can provide. Include symptoms and
diagnostic procedures if appropriate.

1.6.2 Feedback on content


If you have comments on content, send an email to [email protected] and give:
• The title Arm® Cortex®-A710 Core Software Optimization Guide.
• The number PJDOC-466751330-14951.
• If applicable, the page number(s) to which your comments refer.
• A concise explanation of your comments.

Arm also welcomes general suggestions for additions and improvements.

Arm tests the PDF only in Adobe Acrobat and Acrobat Reader and cannot guarantee the quality
of the represented document when used with any other PDF reader.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 11 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

2 Overview
The Cortex-A710 core is a high-performance, low-power, and constrained area product that
implements the Armv9.0-A architecture and supports all previous Armv8-A architectures up to
Armv8.5-A. It targets clamshell and premium high-end smartphone applications

The key features of Cortex-A710 core are:


• Implementation of the Armv9-A A32, T32, and A64 instruction sets.
• AArch32 Execution state at Exception level EL0 and AArch64 Execution state at all exception
levels, EL0-EL3
• Memory Management Unit (MMU)
• 40-bit Physical Address (PA) and 48-bit Virtual Address (VA)
• Generic Interrupt Controller (GIC) CPU interface to connect to an external interrupt
distributor
• Generic Timers that supports 64-bit count input from an external system counter
• Implementation of the Reliability, Availability, and Serviceability (RAS) Extension
• Implementation of the Scalable Vector Extension (SVE) with a 128-bit vector length and
Scalable Vector Extension 2 (SVE2)
• Integrated execution unit with Advanced SIMD and floating-point support
• Support for the optional Cryptographic Extension, which is licensed separately
• Activity Monitoring Unit (AMU)
• Separate L1 data and instruction caches
• Private, unified data and instruction L2 cache
• Support for Memory System Resource Partitioning and Monitoring (MPAM)
• Armv9-A debug logic
• Performance Monitoring Unit (PMU)
• Embedded Trace Macrocell (ETM) with support for Embedded Trace Extension (ETE)
• Trace Buffer Extension (TRBE)
• Optional Embedded Logic Analyzer (ELA)

This document describes elements of the Cortex-A710 core micro-architecture that influence
software performance so that software and compilers can be optimized accordingly.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 12 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

2.1 Pipeline overview


The following figure describes the high-level Cortex-A710 instruction processing pipeline.
Instructions are first fetched and then decoded into internal Macro-OPerations (MOPs). From
there, the MOPs proceed through register renaming and dispatch stages. A MOP can be split
into two Micro-OPerations (µOPs) further down the pipeline after the decode stage. Once
dispatched, µOPs wait for their operands and issue out-of-order to one of thirteen issue
pipelines. Each issue pipeline can accept one µOP per cycle.

Figure 2-1 Cortex-A710 core pipeline

Branch 0

Branch 1

Integer Single-Cycle 0

Decode, Integer Single-Cycle 1


Rename,
Fetch Dispatch Integer Single /Multi-Cycle 0

Integer Single /Multi-Cycle 1

FP/ASIMD 0
Issue

FP/ASIMD 1

Load/Store 0

Load/Store 1

Load 2

Store data 0

Store data 1

IN ORDER OUT OF ORDER

The execution pipelines support different types of operations, as shown in the following table.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 13 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Table 2-1 Cortex-A710 core operations

Instruction Instructions
groups
Branch 0/1 Branch µOPs

Integer Single-Cycle 0/1 Integer ALU µOPs

Integer Single/Multi- Integer shift-ALU, multiply, divide, CRC and sum-of-absolute-differences µOPs
cycle 0/1

Load/Store 0/1 Load, Store address generation and special memory µOPs

Load 2 Load µOPs

Store data 0/1 Store data µOPs

FP/ASIMD-0 ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply,
FP divide, FP sqrt, crypto µOPs, store data µOPs

FP/ASIMD-1 ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, ASIMD shift µOPs, store data µOPs,
crypto µOPs.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 14 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3 Instruction characteristics
3.1 Instruction tables
This chapter describes high-level performance characteristics for most Armv9-A instructions. A
series of tables summarize the effective execution latency and throughput (instruction bandwidth
per cycle), pipelines utilized, and special behaviors associated with each group of instructions.
Utilized pipelines correspond to the execution pipelines described in chapter 2.

In the tables below, Exec Latency is defined as the minimum latency seen by an operation
dependent on an instruction in the described group.

In the tables below, Execution Throughput is defined as the maximum throughput (in
instructions per cycle) of the specified instruction group that can be achieved in the entirety of
the Cortex-A710 microarchitecture.

3.2 Legend for reading the utilized pipelines


Table 3-1 Cortex-A710 core pipeline names and symbols
Pipeline name Symbol used in tables

Branch 0/1 B

Integer single Cycle 0/1 S

Integer single Cycle 0/1 and single/multicycle 0/1 I

Integer single/multicycle 0/1 M

Integer multicycle 0 M0

Load/Store 01 L01

Load/Store 0/1 and Load 2 L

Store data 0/1 D

FP/ASIMD 0/1 V

FP/ASIMD 0 V0

FP/ASIMD 1 V1

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 15 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.3 Branch instructions


Table 3-2 AArch64 Branch instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Branch, immed B 1 2 B -

Branch, register BR, RET 1 2 B -

Branch and link, immed BL 1 2 B, S -

Branch and link, register BLR 1 2 B, S -

Compare and branch CBZ, CBNZ, TBZ, 1 2 B -


TBNZ

Table 3-3 AArch32 Branch instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Branch, immed B 1 2 B -

Branch, register BX 1 2 B -

Branch and link, immed BL, BLX 1 2 B, S -

Branch and link, register BLX 1 2 B, S -

Compare and branch CBZ, CBNZ 1 2 B -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 16 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.4 Arithmetic and logical instructions


Table 3-4 AArch64 Arithmetic and logical instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ALU, basic ADD, ADC, AND, 1 4 I -


BIC, EON, EOR,
ORN, ORR, SUB,
SBC

ALU, basic, flagset ADDS, ADCS, 1 3 I -


ANDS, BICS,
SUBS, SBCS

ALU, extend and shift ADD{S}, SUB{S} 2 2 M -

Arithmetic, LSL shift, shift <= 4 ADD, SUB 1 4 I -

Arithmetic, flagset, LSL shift, ADDS, SUBS 1 4 I -


shift <= 4

Arithmetic, LSR/ASR/ROR shift ADD{S}, SUB{S} 2 2 M -


or LSL shift > 4

Arithmetic, immediate to ADDG, SUBG 2 2 M -


logical address tag

Conditional compare CCMN, CCMP 1 4 I -

Conditional select CSEL, CSINC, 1 4 I -


CSINV, CSNEG

Convert floating-point AXFLAG, XAFLAG 1 1 I -


condition flags

Flag manipulation instructions SETF8, SETF16, 1 1 I -


RMIF, CFINV

Insert Random Tags IRG 2, 3 2, 1 M, M0 1

Insert Tag Mask GMI 1 4 I -

Logical, shift, no flagset AND, BIC, EON, 1 4 I -


EOR, ORN, ORR

Logical, shift, flagset ANDS, BICS 2 2 M -

Subtract Pointer SUBP 1 4 I -

Subtract Pointer, flagset SUBPS 1 3 I -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 17 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Table 3-5 AArch32 Arithmetic and logical instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ALU, basic, unconditional, no ADD, ADC, ADR, 1 4 I -


flagset AND, BIC, EOR,
ORN, ORR, RSB,
RSC, SUB, SBC

ALU, basic, unconditional, ADDS, ADCS, 1 3 I -


flagset ANDS, BICS,
CMN, CMP,
EORS, ORNS,
ORRS, RSBS,
RSCS, SUBS,
SBCS, TEQ, TST

ALU, basic, conditional ADD{S}, ADC{S}, 1 1 M0 -


AND{S}, BIC{S},
CMN, CMP,
EOR{S|, ORN{S},
ORR{S}, RSB{S},
RSC{S}, SUB{S},
SBC{S}, TEQ, TST

ALU, basic, shift by register, (same as ALU 2 1 I, M0 -


conditional basic,
conditional)

ALU, basic, shift by register, (same as ALU, 2 1 M0 -


unconditional, flagset basic,
unconditional,
flagset)

Arithmetic, shift by register, ADD, ADC, RSB, 2 1 M0 -


unconditional, no flagset RSC, SUB, SBC

Logical, shift by register, AND, BIC, EOR, 1 1 M0 -


unconditional, no flagset ORN, ORR

Arithmetic, LSL shift by immed, ADD, ADC, RSB, 1 4 I -


shift <= 4, unconditional, no RSC, SUB, SBC
flagset

Arithmetic, LSL shift by immed, ADDS, ADCS, 1 4 I -


shift <= 4, unconditional, RSBS, RSCS,
flagset SUBS, SBCS

Arithmetic, LSL shift by immed, ADD{S}, ADC{S}, 1 1 M0 -


shift <= 4, conditional RSB{S}, RSC{S},
SUB{S}, SBC{S}

Arithmetic, LSR/ASR/ROR shift ADD{S}, ADC{S}, 2 2 M -


by immed or LSL shift by RSB{S}, RSC{S},
immed > 4, unconditional SUB{S}, SBC{S}

Arithmetic, LSR/ASR/ROR shift ADD{S}, ADC{S}, 2 1 M0 -


by immed or LSL shift by RSB{S}, RSC{S},
immed > 4, conditional SUB{S}, SBC{S}
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 18 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Logical, shift by immed, no AND, BIC, EOR, 1 4 I -


flagset, unconditional ORN, ORR

Logical, shift by immed, no AND, BIC, EOR, 1 1 M0 -


flagset, conditional ORN, ORR

Logical, shift by immed, flagset, ANDS, BICS, 2 2 M -


unconditional EORS, ORNS,
ORRS

Logical, shift by immed, flagset, ANDS, BICS, 2 1 M0 -


conditional EORS, ORNS,
ORRS

Test/Compare, shift by immed CMN, CMP, TEQ, 2 2 M -


TST

Branch forms +1 2 +B 2

Notes:
1.The latency is 2, throughput is 2 and utilized pipeline is M when GCR_EL1.RRND = 1. When GCR_EL1.RRND = 0,
latency is 3, throughput is 1 and pipeline utilized is M0.
2. Branch forms are possible when the instruction destination register is the PC. For those cases, an additional branch
µOP is required. This adds 1 cycle to the latency.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 19 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.5 Move and shift instructions


Table 3-6 AArch32 Move and shift instructions
Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Move, basic MOV, MOVW, 1 4 I -


MVN

Move, basic, flagset MOVS, MVNS 1 3 I

Move, shift by immed, no ASR, LSL, LSR, 1 4 I -


flagset ROR, RRX, MVN

Move, shift by immed, flagset ASRS, LSLS, LSRS, 2 2 M -


RORS, RRXS,
MVNS

Move, shift by register, no ASR, LSL, LSR, 1 4 I -


flagset, unconditional ROR, RRX, MVN

Move, shift by register, no ASR, LSL, LSR, 2 2 I -


flagset, conditional ROR, RRX, MVN

Move, shift by register, flagset ASRS, LSLS, LSRS, 2 1 M0 -


RORS, RRXS,
MVNS

Move, top MOVT 1 4 I -

Move, branch forms +1 2 +B -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 20 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.6 Divide and multiply instructions


Table 3-7 AArch64 Divide and multiply instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Divide, W-form SDIV, UDIV 5 to 12 1/12 to 1/5 M0 1

Divide, X-form SDIV, UDIV 5 to 20 1/20 to 1/5 M0 1

Multiply MUL, MNEG 2 2 M -

Multiply accumulate, W-form MADD, MSUB 2(1) 1 M0 2

Multiply accumulate, X-form MADD, MSUB 2(1) 1 M0 2

Multiply accumulate long SMADDL, 2(1) 1 M0 2


SMSUBL,
UMADDL,
UMSUBL

Multiply high SMULH, UMULH 3 2 M 2

Multiply long SMNEGL, SMULL, 2 2 M -


UMNEGL, UMULL

Table 3-8 AArch32 Divide and multiply instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Divide SDIV, UDIV 5 to 12 1/12 to 1/5 M0 1

Multiply, unconditional MUL, SMULBB, 2 2 M -


SMULBT,
SMULTB,
SMULTT,
SMULWB,
SMULWT,
SMMUL{R},
SMUAD{X},
SMUSD{X}

Multiply, conditional MUL, SMULBB, 2 1 M0 -


SMULBT,
SMULTB,
SMULTT,
SMULWB,
SMULWT,
SMMUL{R},
SMUAD{X},
SMUSD{X}

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 21 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Multiply accumulate, MLA, MLS, 3 1 M0, I -


conditional SMLABB,
SMLABT,
SMLATB,
SMLATT,
SMLAWB,
SMLAWT,
SMLAD{X},
SMLSD{X},
SMMLA{R},
SMMLS{R}

Multiply accumulate, MLA, MLS, 2(1) 1 M0 2


unconditional SMLABB,
SMLABT,
SMLATB,
SMLATT,
SMLAWB,
SMLAWT,
SMLAD{X},
SMLSD{X},
SMMLA{R},
SMMLS{R}

Multiply accumulate UMAAL 4 1 I, M0 -


accumulate long, conditional

Multiply accumulate UMAAL 3 1 I, M0 -


accumulate long, unconditional

Multiply accumulate long, no SMLAL, 3 1 M0, I -


flagset SMLALBB,
SMLALBT,
SMLALTB,
SMLALTT,
SMLALD{X},
SMLSLD{X},
UMLAL

Multiply accumulate long, SMLAL, 4 1 M0, I -


flagset SMLALBB,
SMLALBT,
SMLALTB,
SMLALTT,
SMLALD{X},
SMLSLD{X},
UMLAL

Multiply long, unconditional, SMULL, UMULL 2 2 M -


no flagset

Multiply long, unconditional, SMULLS, UMULLS 3 1 M, I -


flagset

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 22 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Multiply long, conditional SMULL{S}, 3 1 M, I -


UMULL{S}

Notes:
1. Integer divides are performed using an iterative algorithm and block any subsequent divide operations until
complete. Early termination is possible, depending upon the data values.
2. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a typical
sequence of multiply-accumulate µOPs to issue one every N cycles (accumulate latency N shown in parentheses).
Accumulator forwarding is not supported for consumers of 64 bit multiply high operations.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 23 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.7 Saturating and parallel arithmetic instructions


Table 3-9 AArch32 Saturating and parallel arithmetic instructions
Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Parallel arith, unconditional SADD16, SADD8, 2 1 M -


SSUB16, SSUB8,
UADD16, UADD8,
USUB16, USUB8

Parallel arith, conditional SADD16, SADD8, 2(4) 1 M0, I 1


SSUB16, SSUB8,
UADD16, UADD8,
USUB16, USUB8

Parallel arith with exchange, SASX, SSAX, 3 2 I, M -


unconditional UASX, USAX

Parallel arith with exchange, SASX, SSAX, 3(5) 1 I, M0 1


conditional UASX, USAX

Parallel halving arith, SHADD16, 2 2 M -


unconditional SHADD8,
SHSUB16,
SHSUB8,
UHADD16,
UHADD8,
UHSUB16,
UHSUB8

Parallel halving arith, SHADD16, 2 1 M0 -


conditional SHADD8,
SHSUB16,
SHSUB8,
UHADD16,
UHADD8,
UHSUB16,
UHSUB8

Parallel halving arith with SHASX, SHSAX, 3 1 I, M0 -


exchange UHASX, UHSAX

Parallel saturating arith, QADD16, 2 2 M -


unconditional QADD8, QSUB16,
QSUB8,
UQADD16,
UQADD8,
UQSUB16,
UQSUB8

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 24 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Parallel saturating arith, QADD16, 2 1 M0 -


conditional QADD8, QSUB16,
QSUB8,
UQADD16,
UQADD8,
UQSUB16,
UQSUB8

Parallel saturating arith with QASX, QSAX, 3 2 I, M -


exchange, unconditional UQASX, UQSAX

Parallel saturating arith with QASX, QSAX, 3 1 I, M0 -


exchange, conditional UQASX, UQSAX

Saturate, unconditional SSAT, SSAT16, 2 2 M -


USAT, USAT16

Saturate, conditional SSAT, SSAT16, 2 1 M0 -


USAT, USAT16

Saturating arith, unconditional QADD, QSUB 2 2 M -

Saturating arith, conditional QADD, QSUB 2 1 M0 -

Saturating doubling arith, QDADD, QDSUB 3 1 M, M -


unconditional

Saturating doubling arith QDADD, QDSUB 3 1 M, M0 -


conditional

Notes:
1. GE-setting instructions require three extra µOPs and two additional cycles to conditionally update the GE field (GE
latency shown in parentheses).

3.8 Pointer Authentication Instructions


Table 3-10 AArch64 pointer authentication instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Authenticate data address AUTDA, AUTDB, 5 1 M0 -


AUTDZA,
AUTDZB

Authenticate instruction AUTIA, AUTIB, 5 1 M0 -


address AUTIA1716,
AUTIB1716,
AUTIASP,
AUTIBSP, AUTIAZ,
AUTIBZ, AUTIZA,
AUTIZB

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 25 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Branch and link, register, with BLRAA, BLRAAZ, 6 1 M0, B


pointer authentication BLRAB, BLRABZ

Branch, register, with pointer BRAA, BRAAZ, 6 1 M0, B


authentication BRAB, BRABZ

Branch, return, with pointer RETA, RETB 6 1 M0, B


authentication

Compute pointer PACDA, PACDB, 5 1 M0


authentication code for data PACDZA,
address PACDZB

Compute pointer PACGA 5 1 M0


authentication code, using
generic key

Compute pointer PACIA, PACIB, 5 1 M0


authentication code for PACIA1716,
instruction address PACIB1716,
PACIASP,
PACIBSP, PACIAZ,
PACIBZ, PACIZA,
PACIZB

Load register, with pointer LDRAA, LDRAB 9 1 M0, L


authentication

Strip pointer authentication XPACD, XPACI, 2 1 M0


code XPACLRI

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 26 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.9 Miscellaneous data-processing instructions


Table 3-11 AArch64 Miscellaneous data-processing instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Address generation ADR, ADRP 1 4 I -

Bitfield extract, one reg EXTR 1 4 I -

Bitfield extract, two regs EXTR 3 2 I, M -

Bitfield move, basic SBFM, UBFM 1 4 I -

Bitfield move, insert BFM 2 2 M -

Count leading CLS, CLZ 1 4 I -

Move immed MOVN, MOVK, 1 4 I -


MOVZ

Reverse bits/bytes RBIT, REV, REV16, 1 4 I -


REV32

Variable shift ASRV, LSLV, 1 4 I -


LSRV, RORV

Table 3-12 AArch32 Miscellaneous data-processing instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Bit field extract SBFX, UBFX 1 4 I -

Bit field insert/clear, BFI, BFC 2 2 M -


unconditional

Bit field insert/clear, BFI, BFC 2 1 M0 -


conditional

Count leading zeros CLZ 1 4 I -

Pack halfword, unconditional PKH 2 2 M -

Pack halfword, conditional PKH 2 1 M0 -

Reverse bits/bytes RBIT, REV, REV16, 1 4 I -


REVSH

Select bytes, unconditional SEL 1 4 I -

Select bytes, conditional SEL 2 2 I -

Sign/zero extend, normal SXTB, SXTH, 1 4 I -


UXTB, UXTH

Sign/zero extend, parallel, SXTB16, UXTB16 2 2 M -


unconditional

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 27 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Sign/zero extend, parallel, SXTB16, UXTB16 2 1 M0 -


conditional

Sign/zero extend and add, SXTAB, SXTAH, 2 2 M -


normal, unconditional UXTAB, UXTAH

Sign/zero extend and add, SXTAB, SXTAH, 2 1 M0 -


normal, conditional UXTAB, UXTAH

Sign/zero extend and add, SXTAB16, 4 1 M -


parallel, unconditional UXTAB16

Sign/zero extend and add, SXTAB16, 4 1 M, M0 -


parallel, conditional UXTAB16

Sum of absolute differences USAD8 2 1 M0 -

Sum of absolute differences USADA8 2 1 M0 -


accumulate, unconditional

Sum of absolute differences USADA8 3 1 M0, I -


accumulate, conditional

3.10 Load instructions


The latencies shown assume the memory access hits in the Level 1 Data Cache and represent the maximum latency to
load all the registers written by the instruction.

Table 3-13 AArch64 Load instructions


Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Load register, literal LDR, LDRSW, 4 3 L -


PRFM

Load register, unscaled immed LDUR, LDURB, 4 3 L -


LDURH, LDURSB,
LDURSH,
LDURSW, PRFUM

Load register, immed post- LDR, LDRB, LDRH, 4 3 L, I -


index LDRSB, LDRSH,
LDRSW

Load register, immed pre-index LDR, LDRB, LDRH, 4 3 L, I -


LDRSB, LDRSH,
LDRSW

Load register, immed LDTR, LDTRB, 4 3 L -


unprivileged LDTRH, LDTRSB,
LDTRSH, LDTRSW

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 28 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Load register, unsigned immed LDR, LDRB, LDRH, 4 3 L -


LDRSB, LDRSH,
LDRSW, PRFM

Load register, register offset, LDR, LDRB, LDRH, 4 3 L -


basic LDRSB, LDRSH,
LDRSW, PRFM

Load register, register offset, LDR, LDRSW, 4 3 L -


scale by 4/8 PRFM

Load register, register offset, LDRH, LDRSH 4 3 L -


scale by 2

Load register, register offset, LDR, LDRB, LDRH, 4 3 L -


extend LDRSB, LDRSH,
LDRSW, PRFM

Load register, register offset, LDR, LDRSW, 4 3 L -


extend, scale by 4/8 PRFM

Load register, register offset, LDRH, LDRSH 4 3 L -


extend, scale by 2

Load pair, signed immed offset, LDP, LDNP 4 3 L -


normal, W-form

Load pair, signed immed offset, LDP, LDNP 4 1.5 L -


normal, X-form

Load pair, signed immed offset, LDPSW 5 1 I, L -


signed words

Load pair, immed post-index or LDP 4 3 L, I -


immed pre-index, normal, W-
form

Load pair, immed post-index or LDP 4 1.5 L, I -


immed pre-index, normal, X-
form

Load pair, immed post-index or LDPSW 5 1 I, L -


immed pre-index, signed
words

Table 3-14 AArch32 Load instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Load, immed offset LDR{T}, LDRB{T}, 4 3 L 1, 2


LDRD, LDRH{T},
LDRSB{T},
LDRSH{T}

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 29 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Load, register offset, plus LDR, LDRB, LDRD, 4 3 L 1 ,2


LDRH, LDRSB,
LDRSH

Load, register offset, minus LDR, LDRB, LDRD, 5 3 I, L 1, 2


LDRH, LDRSB,
LDRSH

Load, scaled register offset, LDR, LDRB 4 3 L 1, 2


plus, LSL2

Load, scaled register offset, LDR, LDRB, LDRH, 5 3 I, L 1, 2


other LDRSB, LDRSH

Load, immed pre-indexed LDR, LDRB, LDRD, 4 3 L, I 1, 2


LDRH, LDRSB,
LDRSH

Load, register pre-indexed LDRH, LDRSB, 5 3 I, L, M0 1, 2, 3


LDRSH

Load, register pre-indexed LDRD 4 3 L, M0 1, 2, 3

Load, scaled register pre- LDR, LDRB 4 3 L, M0 1, 2, 3


indexed, plus, LSL2

Load, scaled register pre- LDR, LDRB 4 3 L, M0 1, 2, 3


indexed, unshifted

Load, scaled register pre- LDR, LDRB 5 3 I, L, M0 1, 2, 3


indexed, other

Load, immed post-indexed LDR{T}, LDRB{T}, 4 3 L, I 1, 2


LDRD, LDRH{T},
LDRSB{T},
LDRSH{T}

Load, register post-indexed LDR{T}, LDRB{T}, 5 3 I, L, M0 1, 2, 3


LDRH{T},
LDRSB{T},
LDRSH{T}

Load, register post-indexed LDRD 4 3 L, M0 1, 2, 3

Preload, immed offset PLD, PLDW 4 3 L -

Preload, register offset, plus, PLD, PLDW 4 3 L -


LSL2 and unshifted

Preload, register offset, minus PLD, PLDW 5 3 I, L -

Load multiple, no writeback, LDMIA, LDMIB, N 3/R L 1, 4, 5


base reg not in list LDMDA, LDMDB

Load multiple, no writeback, LDMIA, LDMIB, 1+ N 3/R I, L 1, 4, 5


base reg in list LDMDA, LDMDB

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 30 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Load multiple, writeback LDMIA, LDMIB, 1+ N 3/R L, I 1, 4, 5


LDMDA, LDMDB,
POP

(Load, all branch forms) - +1 - +B 6

Notes:
1. Conditional loads have extra µOP(s) which goes down pipeline I and have 1 cycle extra latency compared to their
unconditional counterparts.
2. Conditional loads go down L01 pipe and have an execution throughput of 2 whereas unconditional versions have a
throughput of 3.
3. The address update op goes down pipeline ‘I’ if the load is unconditional.
4. N is floor [ (num_reg+5)/6].
5. R is floor [(num_reg +1)/2].
6. Branch forms are possible when the instruction destination register is the PC. For those cases, an additional branch
µOP is required. This adds 1 cycle to the latency.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 31 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.11 Store instructions


The following table describes performance characteristics for standard store instructions. Stores
µOPs are split into address and data µOPs. Once executed, stores are buffered and committed
in the background.

Table 3-15 AArch64 Store instructions


Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Store register, unscaled immed STUR, STURB, 1 2 L01, D -


STURH

Store register, immed post- STR, STRB, STRH 1 2 L01, D, I -


index

Store register, immed pre- STR, STRB, STRH 1 2 L01, D, I -


index

Store register, immed STTR, STTRB, 1 2 L01, D -


unprivileged STTRH

Store register, unsigned immed STR, STRB, STRH 1 2 L01, D -

Store register, register offset, STR, STRB, STRH 1 2 L01, D -


basic

Store register, register offset, STR 1 2 L01, D -


scaled by 4/8

Store register, register offset, STRH 1 2 I, L01, D -


scaled by 2

Store register, register offset, STR, STRB, STRH 1 2 L01, D -


extend

Store register, register offset, STR 1 2 L01, D -


extend, scale by 4/8

Store register, register offset, STRH 1 2 I, L01, D -


extend, scale by 2

Store pair, immed offset STP, STNP 1 2 L01, D -

Store pair, immed post-index STP 1 2 L01, D, I -

Store pair, immed pre-index STP 1 2 L01, D, I -

Table 3-16 AArch32 Store instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Store, immed offset STR{T}, STRB{T}, 1 2 L01, D -


STRD, STRH{T}

Store, register offset, plus STR, STRB, STRD, 1 2 L01, D -


STRH

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 32 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Store, register offset, minus STR, STRB, STRD, 1 2 L01, D -


STRH

Store, scaled register STR, STRB 1 2 L01, D -


offset, plus, no shift

Store, scaled register offset, STR, STRB 1 2 L01, D -


plus, LSL2

Store, scaled register offset, STR, STRB 2 2 I, L01, D -


plus, other

Store, scaled register offset, STR, STRB 2 2 I, L01, D -


minus

Store, immed pre-indexed STR, STRB, STRD, 1 2 L01, D, I -


STRH

Store, register pre-indexed, STR, STRB, STRD, 1 2 L01, D, M0 1


plus, no shift STRH

Store, register pre-indexed, STR, STRB, STRD, 2 2 I, L01, D, M0 1


minus STRH

Store, scaled register pre- STR, STRB 1 2 L01, D, M0 1


indexed, plus LSL2

Store, scaled register pre- STR, STRB 2 2 I, L01, D, M0 1


indexed, other

Store, immed post-indexed STR{T}, STRB{T}, 1 2 L01, D, I -


STRD, STRH{T}

Store, register post-indexed STRH{T}, STRD 1 2 L01, D, M0 1

Store, register post-indexed STR{T}, STRB{T} 1 2 L01, D, M0 1

Store, scaled register post- STR{T}, STRB{T} 1 2 L01, D, M0 2


indexed

Store multiple, no writeback STMIA, STMIB, N 1/N L01, D 3


STMDA, STMDB

Store multiple, writeback STMIA, STMIB, N 1/N L01, D 3


STMDA, STMDB,
PUSH

Notes:
1. The address update op goes down pipeline ‘I’ if the store is unconditional.
2. The address update op goes down pipeline “M” if the store is unconditional.
3. For store multiple instructions, N=floor((num_regs+3)/4).

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 33 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.12 Tag Load Instructions


Table 3-17 AArch64 Tag load instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Load allocation tag LDG 4 3 L -

Load multiple allocation tags LDGM 4 3 L -

3.13 Tag Store instructions


Table 3-18 AArch64 Tag store instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Store allocation tags to one or STG, ST2G 1 2 L01, D, I -


two granules, post-index

Store allocation tags to one or STG, ST2G 1 2 L01, D, I -


two granules, pre-index

Store allocation tags to one or STG, ST2G 1 2 L01, D -


two granules, signed offset

Store allocation tag to one or STZG, STZ2G 1 2 L01, D, I -


two granules, zeroing, post-
index

Store Allocation Tag to one or STZG, STZ2G 1 2 L01, D, I -


two granules, zeroing, pre-
index

Store allocation tag to two STZG, STZ2G 1 2 L01, D -


granules, zeroing, signed offset

Store allocation tag and reg STGP 1 2 L01, D, I -


pair to memory, post-Index

Store allocation tag and reg STGP 1 2 L01, D, I -


pair to memory, pre-Index

Store allocation tag and reg STGP 1 2 L01, D -


pair to memory, signed offset

Store multiple allocation tags STGM 1 2 L01, D -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 34 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Store multiple allocation tags, STZGM 1 2 L01, D -


zeroing

3.14 FP data processing instructions


Table 3-19 AArch64 FP data processing instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

FP absolute value FABS, FABD 2 2 V -

FP arithmetic FADD, FSUB 2 2 V -

FP compare FCCMP{E}, 2 1 V0 -
FCMP{E}

FP divide, H-form FDIV 7 2/7 V0 1

FP divide, S-form FDIV 7 to 10 2/9 to 2/7 V0 1

FP divide, D-form FDIV 7 to 15 1/7 to 2/7 V0 1

FP min/max FMIN, FMINNM, 2 2 V -


FMAX, FMAXNM

FP multiply FMUL, FNMUL 3 2 V 2

FP multiply accumulate FMADD, FMSUB, 4 (2) 2 V 3


FNMADD,
FNMSUB

FP negate FNEG 2 2 V -

FP round to integral FRINTA, FRINTI, 3 1 V0 -


FRINTM, FRINTN,
FRINTP, FRINTX,
FRINTZ,
FRINT32X,
FRINT64X,
FRINT32Z,
FRINT64Z

FP select FCSEL 2 2 V -

FP square root, H-form FSQRT 7 4/7 V0 1

FP square root, S-form FSQRT 7 to 9 1/2 to 4/7 V0 1

FP square root, D-form FSQRT 7 to 16 2/15 to 2/7 V0 1

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 35 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Table 3-20 AArch32 FP data processing instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

VFP absolute value VABS 2 2 V -

VFP arith VADD, VSUB 2 2 V -

VFP compare, unconditional VCMP, VCMPE 2 1 V0 -

VFP compare, conditional VCMP, VCMPE 4 1 V, V0 -

VFP convert VCVT{R}, VCVTB, 3 1 V0 -


VCVTT, VCVTA,
VCVTM, VCVTN,
VCVTP

VFP convert to BFloat16 VCVTB, VCVTT 3 1 V0 -

VFP divide, H-form VDIV 7 4/7 V0 1

VFP divide, S-form VDIV 7 to 10 4/9 to 4/7 V0 1

VFP divide, D-form VDIV 7 to 15 1/7 to 2/7 V0 1

VFP max/min VMAXNM, 2 2 V -


VMINNM

VFP multiply VMUL, VNMUL 3 2 V 2

VFP multiply accumulate VMLA, VMLS, 5 (2) 2 V 3


(chained) VNMLA, VNMLS

VFP multiply accumulate VFMA, VFMS, 4 (2) 2 V 3


(fused) VFNMA, VFNMS

VFP negate VNEG 2 2 V -

VFP round to integral VRINTA, VRINTM, 3 1 V0 -


VRINTN, VRINTP,
VRINTR, VRINTX,
VRINTZ

VFP select VSELEQ, VSELGE, 2 2 V -


VSELGT, VSELVS

VFP square root, H-form VSQRT 7 4/7 V0 1

VFP square root, S-form VSQRT 7 to 9 1/2 to 4/7 V0 1

VFP square root, D-form VSQRT 7 to 16 2/15 to 2/7 V0 1

Notes:
1. FP divide and square root operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.
2. FP multiply-accumulate pipelines support late forwarding of the result from FP multiply µOPs to the accumulate
operands of an FP multiply-accumulate µOP. The latter can potentially be issued 1 cycle after the FP multiply µOP has
been issued.
3. FP multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a
typical sequence of multiply-accumulate µOPs to issue one every N cycles (accumulate latency N shown in
parentheses).

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 36 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.15 FP miscellaneous instructions


Table 3-21 AArch64 FP miscellaneous instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

FP convert, from gen to vec reg SCVTF, UCVTF 3 1 M0 -

FP convert, from vec to gen reg FCVTAS, FCVTAU, 3 1 V -


FCVTMS,
FCVTMU,
FCVTNS,
FCVTNU, FCVTPS,
FCVTPU, FCVTZS,
FCVTZU

FP convert, Javascript from vec FJCVTZS 3 1 V0 -


to gen reg

FP convert, from vec to vec reg FCVT, FCVTXN 3 1 V0 -

FP move, immed FMOV 2 2 V -

FP move, register FMOV 2 2 V -

FP transfer, from gen to low FMOV 3 1 M0 -


half of vec reg

FP transfer, from gen to high FMOV 5 1 M0, V -


half of vec reg

FP transfer, from vec to gen reg FMOV 2 1 V -

Table 3-22 AArch32 FP miscellaneous instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

VFP move, extraction VMOVX 2 2 V -

VFP move, immed VMOV 2 2 V -

VFP move, insert VINS 2 2 V -

VFP move, register VMOV 2 2 V -

VFP transfer, core to vfp, single VMOV 5 1 M0, V -


reg to S-reg, cond

VFP transfer, core to vfp, single VMOV 3 1 M0 -


reg to S-reg, uncond

VFP transfer, core to vfp, single VMOV 5 1 M0, V -


reg to upper/lower half of D-
reg

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 37 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

VFP transfer, core to vfp, 2 regs VMOV 6 1/2 M0, V -


to 2 S-regs, cond

VFP transfer, core to vfp, 2 regs VMOV 4 1/2 M0 -


to 2 S-regs, uncond

VFP transfer, core to vfp, 2 regs VMOV 5 1 M0, V -


to D-reg, cond

VFP transfer, core to vfp, 2 regs VMOV 3 1 M0 -


to D-reg, uncond

VFP transfer, vfp S-reg or VMOV 3 1 V, I -


upper/lower half of vfp D-reg
to core reg, cond

VFP transfer, vfp S-reg or VMOV 2 1 V -


upper/lower half of vfp D-reg
to core reg, uncond

VFP transfer, vfp 2 S-regs or D- VMOV 3 1 V, I -


reg to 2 core regs, cond

VFP transfer, vfp 2 S-regs or D- VMOV 2 1 V -


reg to 2 core regs, uncond

3.16 FP load instructions


The latencies shown assume the memory access hits in the Level 1 Data Cache and represent the
maximum latency to load all the vector registers written by the instruction. Compared to
standard loads, an extra cycle is required to forward results to FP/ASIMD pipelines.

Table 3-23 AArch64 FP load instructions


Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Load vector reg, literal, S/D/Q LDR 6 3 L -


forms

Load vector reg, unscaled LDUR 6 3 L -


immed

Load vector reg, immed post- LDR 6 3 L, I -


index

Load vector reg, immed pre- LDR 6 3 L, I -


index

Load vector reg, unsigned LDR 6 3 L -


immed

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 38 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Load vector reg, register offset, LDR 6 3 L -


basic

Load vector reg, register offset, LDR 6 3 L -


scale, S/D-form

Load vector reg, register offset, LDR 7 3 I, L -


scale, H/Q-form

Load vector reg, register offset, LDR 6 3 L -


extend

Load vector reg, register offset, LDR 6 3 L -


extend, scale, S/D-form

Load vector reg, register offset, LDR 7 3 I, L -


extend, scale, H/Q-form

Load vector pair, immed offset, LDP, LDNP 6 3 L -


S/D-form

Load vector pair, immed offset, LDP, LDNP 6 3/2 L -


Q-form

Load vector pair, immed post- LDP 6 3 I, L -


index, S/D-form

Load vector pair, immed post- LDP 6 3/2 L, I -


index, Q-form

Load vector pair, immed pre- LDP 6 3 I, L -


index, S/D-form

Load vector pair, immed pre- LDP 6 3/2 L, I -


index, Q-form

Table 3-24 AArch32 FP load instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

FP load, register VLDR 6 3 (2) L 1, 6, 7

FP load multiple, S form VLDMIA, N(N*) 3/R (2/R) L 1, 2, 3, 4,


VLDMDB, VPOP 6, 7

FP load multiple, D form VLDMIA, N(N*) 3/R (2/R) L, V 1, 2, 3, 4,


VLDMDB, VPOP 6, 7

(FP load, writeback forms) - (1) - +I 5, 7

Notes:
Condition loads have an extra uop which goes down pipeline V and have 2 cycle extra latency compared to their
unconditional counterparts.
1. N is (num_reg)/6 + 5.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 39 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

2. N* is (num_reg)/4 + 5.
3. R is num_reg/2.
4. Writeback forms of load instructions require an extra µOP to update the base address. This update is typically
performed in parallel with or prior to the load µOP (update latency shown in parentheses).
5. The number is parenthesis represents the latency and throughput of conditional loads.
6. Conditional loads go down the L01 pipe.

3.17 FP store instructions


Stores MOPs are split into store address and store data µOPs. Once executed, stores are buffered
and committed in the background.

Table 3-25 AArch64 FP store instructions


Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Store vector reg, unscaled STUR 2 2 L01, V -


immed, B/H/S/D-form

Store vector reg, unscaled STUR 2 2 L01, V -


immed, Q-form

Store vector reg, immed post- STR 2 2 L01, V, I -


index, B/H/S/D-form

Store vector reg, immed post- STR 2 2 L01, V, I -


index, Q-form

Store vector reg, immed pre- STR 2 2 L01, V, I -


index, B/H/S/D-form

Store vector reg, immed pre- STR 2 2 L01, V, I -


index, Q-form

Store vector reg, unsigned STR 2 2 L01, V -


immed, B/H/S/D-form

Store vector reg, unsigned STR 2 2 L01, V -


immed, Q-form

Store vector reg, register offset, STR 2 2 L01, V -


basic, B/H/S/D-form

Store vector reg, register offset, STR 2 2 L01, V -


basic, Q-form

Store vector reg, register offset, STR 2 2 I, L01, V -


scale, H-form

Store vector reg, register offset, STR 2 2 L01, V -


scale, S/D-form

Store vector reg, register offset, STR 2 2 I, L01, V -


scale, Q-form

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 40 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Store vector reg, register offset, STR 2 2 L01, V -


extend, B/H/S/D-form

Store vector reg, register offset, STR 2 2 L01, V -


extend, Q-form

Store vector reg, register offset, STR 2 2 I, L01, V -


extend, scale, H-form

Store vector reg, register offset, STR 2 2 L01, V -


extend, scale, S/D-form

Store vector reg, register offset, STR 2 2 I, L01, V -


extend, scale, Q-form

Store vector pair, immed offset, STP, STNP 2 2 L01, V -


S-form

Store vector pair, immed offset, STP, STNP 2 2 L01, V -


D-form

Store vector pair, immed offset, STP, STNP 2 1 L01, V -


Q-form

Store vector pair, immed post- STP 2 2 I, L01, V -


index, S-form

Store vector pair, immed post- STP 2 2 I, L01, V -


index, D-form

Store vector pair, immed post- STP 2 1 I, L01, V -


index, Q-form

Store vector pair, immed pre- STP 2 2 I, L01, V -


index, S-form

Store vector pair, immed pre- STP 2 2 I, L01, V -


index, D-form

Store vector pair, immed pre- STP 2 1 I, L01, V -


index, Q-form

Table 3-26 AArch32 FP store instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

FP store, immed offset VSTR 2 2 L01, V -

FP store multiple, S-form VSTMIA, N+1 2/R L01, V 1, 2


VSTMDB, VPUSH

FP store multiple, D-form VSTMIA, N+1 2/R L01, V 1, 2


VSTMDB, VPUSH

(FP store, writeback forms) - (1) - +I 3

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 41 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Notes:
1. For store multiple instructions, N = (num_reg/2)
2. R is num_regs.
3. Writeback forms of store instructions require an extra µOP to update the base address. This update is typically
performed in parallel with or prior to the store µOP (update latency shown in parentheses).

3.18 ASIMD integer instructions


Table 3-27 AArch64 ASIMD integer instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD absolute diff SABD, UABD 2 2 V -

ASIMD absolute diff accum SABA, UABA 4(1) 1 V1 2

ASIMD absolute diff accum SABAL(2), 4(1) 1 V1 2


long UABAL(2)

ASIMD absolute diff long SABDL(2), 2 2 V -


UABDL(2)

ASIMD arith, basic ABS, ADD, NEG, 2 2 V -


SADDL(2),
SADDW(2),
SHADD, SHSUB,
SSUBL(2),
SSUBW(2), SUB,
UADDL(2),
UADDW(2),
UHADD, UHSUB,
USUBL(2),
USUBW(2)

ASIMD arith, complex ADDHN(2), 2 2 V -


RADDHN(2),
RSUBHN(2),
SQABS, SQADD,
SQNEG, SQSUB,
SRHADD,
SUBHN(2),
SUQADD,
UQADD, UQSUB,
URHADD,
USQADD

ASIMD arith, pair-wise ADDP, SADDLP, 2 2 V -


UADDLP

ASIMD arith, reduce, 4H/4S ADDV, SADDLV, 2 1 V1 -


UADDLV

ASIMD arith, reduce, 8B/8H ADDV, SADDLV, 4 1 V1, V -


UADDLV

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 42 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD arith, reduce, 16B ADDV, SADDLV, 4 1 V1 -


UADDLV

ASIMD compare CMEQ, CMGE, 2 2 V -


CMGT, CMHI,
CMHS, CMLE,
CMLT, CMTST

ASIMD dot product SDOT, UDOT 3 (1) 2 V 2

ASIMD dot product using SUDOT, USDOT 3(1) 2 V 2


signed and unsigned integers

ASIMD logical AND, BIC, EOR, 2 2 V -


MOV, MVN, NOT,
ORN, ORR

ASIMD matrix multiply- SMMLA, UMMLA, 3(1) 2 V 2


accumulate USMMLA

ASIMD max/min, basic and SMAX, SMAXP, 2 2 V -


pair-wise SMIN, SMINP,
UMAX, UMAXP,
UMIN, UMINP

ASIMD max/min, reduce, 4H/4S SMAXV, SMINV, 2 2 V1 -


UMAXV, UMINV

ASIMD max/min, reduce, SMAXV, SMINV, 4 1 V1, V -


8B/8H UMAXV, UMINV

ASIMD max/min, reduce, 16B SMAXV, SMINV, 4 ½ V1 -


UMAXV, UMINV

ASIMD multiply MUL, SQDMULH, 4 1 V0 -


SQRDMULH

ASIMD multiply accumulate MLA, MLS 4(1) 1 V0 1

ASIMD multiply accumulate SQRDMLAH, 4(2) 1 V0 1


high SQRDMLSH

ASIMD multiply accumulate SMLAL(2), 4(1) 1 V0 1


long SMLSL(2),
UMLAL(2),
UMLSL(2)

ASIMD multiply accumulate SQDMLAL(2), 4(2) 1 V0 1


saturating long SQDMLSL(2)

ASIMD multiply/multiply long PMUL, PMULL(2) 3 1 V0 3


(8x8) polynomial, D-form

ASIMD multiply/multiply long PMUL, PMULL(2) 3 1 V0 3


(8x8) polynomial, Q-form

ASIMD multiply long SMULL(2), 3 2 V -


UMULL(2),
SQDMULL(2)

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 43 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD pairwise add and SADALP, UADALP 4(1) 1 V1 2


accumulate long

ASIMD shift accumulate SSRA, SRSRA, 4(1) 1 V1 2


USRA, URSRA

ASIMD shift by immed, basic SHL, SHLL(2), 2 1 V1 -


SHRN(2),
SSHLL(2), SSHR,
SXTL(2),
USHLL(2), USHR,
UXTL(2)

ASIMD shift by immed and SLI, SRI 2 1 V1 -


insert, basic

ASIMD shift by immed, RSHRN(2), 4 1 V1 -


complex SQRSHRN(2),
SQRSHRUN(2),
SQSHL{U},
SQSHRN(2),
SQSHRUN(2),
SRSHR,
UQRSHRN(2),
UQSHL,
UQSHRN(2),
URSHR

ASIMD shift by register, basic SSHL, USHL 2 1 V1 -

ASIMD shift by register, SRSHL, SQRSHL, 4 1 V1 -


complex SQSHL, URSHL,
UQRSHL, UQSHL

Table 3-28 AArch32 ASIMD integer instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD absolute diff VABD 2 2 V -

ASIMD absolute diff accum VABA 4(1) 1 V1 2

ASIMD absolute diff accum VABAL 4(1) 1 V1 2


long

ASIMD absolute diff long VABDL 2 2 V -

ASIMD arith, basic VADD, VADDL, 2 2 V -


VADDW, VNEG,
VSUB, VSUBL,
VSUBW

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 44 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD arith, complex VABS, VADDHN, 2 2 V -


VHADD, VHSUB,
VQABS, VQADD,
VQNEG, VQSUB,
VRADDHN,
VRHADD,
VRSUBHN,
VSUBHN

ASIMD arith, pair-wise VPADD, VPADDL 2 2 V -

ASIMD compare VCEQ, VCGE, 2 2 V -


VCGT, VCLE, VTST

ASIMD dot product VSDOT, VUDOT 3(1) 2 V 2

ASIMD dot product using VSUDOT, 3(1) 2 V 2


signed and unsigned integers VUSDOT

ASIMD logical VAND, VBIC, 2 2 V -


VMVN, VORR,
VORN, VEOR

ASIMD matrix multiply- VSMMLA, 3(1) 2 V 2


accumulate VUMMLA,
VUSMMLA

ASIMD max/min VMAX, VMIN, 2 2 V -


VPMAX, VPMIN

ASIMD multiply VMUL, 4 1 V0 -


VQDMULH,
VQRDMULH

ASIMD multiply accumulate VMLA, VMLS 4(1) 1 V0 1

ASIMD multiply accumulate VMLAL, VMLSL 4(1) 1 V0 1


long

ASIMD multiply accumulate VQDMLAL, 4 1 V0 -


saturating long VQDMLSL

ASIMD multiply/multiply long VMUL (.P8), 3 1 V0 -


(8x8) polynomial, D-form VMULL (.P8)

ASIMD multiply (8x8) VMUL (.P8) 3 1 V0 -


polynomial, Q-form

ASIMD multiply long VMULL (.S, .I), 3 1 V0 -


VQDMULL

ASIMD pairwise add and VPADAL 4(1) 1 V1 1


accumulate

ASIMD shift accumulate VSRA, VRSRA 4(1) 1 V1 1

ASIMD shift by immed, basic VMOVL, VSHL, 2 1 V1 -


VSHLL, VSHR,
VSHRN
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 45 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD shift by immed and VSLI, VSRI 2 1 V1 -


insert, basic

ASIMD shift by immed, VQRSHRN, 4 1 V1 -


complex VQRSHRUN,
VQSHL{U},
VQSHRN,
VQSHRUN,
VRSHR, VRSHRN

ASIMD shift by register, basic VSHL 2 1 V1 -

ASIMD shift by register, VQRSHL, VQSHL, 4 1 V1 -


complex VRSHL

Notes:
1. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a typical
sequence of integer multiply-accumulate µOPs to issue one every cycle or one every other cycle (accumulate latency
shown in parentheses).
2. Other accumulate pipelines also support late-forwarding of accumulate operands from similar µOPs, allowing a
typical sequence of such µOPs to issue one every cycle (accumulate latency shown in parentheses).
3. This category includes instructions of the form “PMULL Vd.8H, Vn.8B, Vm.8B” and “PMULL2 Vd.8H, Vn.16B, Vm.16B”.

3.19 ASIMD floating-point instructions


Table 3-29 AArch64 ASIMD floating-point instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD FP absolute FABS, FABD 2 2 V -


value/difference

ASIMD FP arith, normal FADD, FSUB, 2 2 V -


FADDP

ASIMD FP compare FACGE, FACGT, 2 2 V -


FCMEQ, FCMGE,
FCMGT, FCMLE,
FCMLT

ASIMD FP complex add FCADD 2 2 V -

ASIMD FP complex multiply FCMLA 4(2) 2 V 1


add

ASIMD FP convert, long (F16 to FCVTL(2) 4 1/2 V0 -


F32)

ASIMD FP convert, long (F32 to FCVTL(2) 3 1 V0 -


F64)

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 46 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD FP convert, narrow (F32 FCVTN(2) 4 1/2 V0 -


to F16)

ASIMD FP convert, narrow (F64 FCVTN(2), 3 2 V0 -


to F32) FCVTXN(2)

ASIMD FP convert, other, D- FCVTAS, FCVTAU, 3 1 V0 -


form F32 and Q-form F64 FCVTMS,
FCVTMU,
FCVTNS,
FCVTNU, FCVTPS,
FCVTPU, FCVTZS,
FCVTZU, SCVTF,
UCVTF

ASIMD FP convert, other, D- FCVTAS, VCVTAU, 4 1/2 V0 -


form F16 and Q-form F32 FCVTMS,
FCVTMU,
FCVTNS,
FCVTNU, FCVTPS,
FCVTPU, FCVTZS,
FCVTZU, SCVTF,
UCVTF

ASIMD FP convert, other, Q- FCVTAS, VCVTAU, 6 1/4 V0 -


form F16 FCVTMS,
FCVTMU,
FCVTNS,
FCVTNU, FCVTPS,
FCVTPU, FCVTZS,
FCVTZU, SCVTF,
UCVTF

ASIMD FP divide, D-form, F16 FDIV 7 1/7 V0 3

ASIMD FP divide, D-form, F32 FDIV 7 to 10 2/9 to 2/7 V0 3

ASIMD FP divide, Q-form, F16 FDIV 10 to 13 1/13 to 1/10 V0 3

ASIMD FP divide, Q-form, F32 FDIV 7 to 10 1/9 to 1/7 V0 3

ASIMD FP divide, Q-form, F64 FDIV 7 to 15 1/14 to 1/7 V0 3

ASIMD FP max/min, normal FMAX, FMAXNM, 2 2 V -


FMIN, FMINNM

ASIMD FP max/min, pairwise FMAXP, 2 2 V -


FMAXNMP,
FMINP,
FMINNMP

ASIMD FP max/min, reduce, FMAXV, 4 1 V -


F32 and D-form F16 FMAXNMV,
FMINV,
FMINNMV

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 47 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD FP max/min, reduce, Q- FMAXV, 6 2/3 V -


form F16 FMAXNMV,
FMINV,
FMINNMV

ASIMD FP multiply FMUL, FMULX 3 2 V 2

ASIMD FP multiply accumulate FMLA, FMLS 4(2) 2 V 1

ASIMD FP multiply accumulate FMLAL(2), 5(2) 2 V 1


long FMLSL(2)

ASIMD FP negate FNEG 2 2 V -

ASIMD FP round, D-form F32 FRINTA, FRINTI, 3 1 V0 -


and Q-form F64 FRINTM, FRINTN,
FRINTP, FRINTX,
FRINTZ,
FRINT32X,
FRINT64X,
FRINT32Z,
FRINT64Z

ASIMD FP round, D-form F16 FRINTA, FRINTI, 4 1/2 V0 -


and Q-form F32 FRINTM, FRINTN,
FRINTP, FRINTX,
FRINTZ,
FRINT32X,
FRINT64X,
FRINT32Z,
FRINT64Z

ASIMD FP round, Q-form F16 FRINTA, FRINTI, 6 1/4 V0 -


FRINTM, FRINTN,
FRINTP, FRINTX,
FRINTZ,
FRINT32X,
FRINT64X,
FRINT32Z,
FRINT64Z

ASIMD FP square root, D-form, FSQRT 7 1/7 V0 3


F16

ASIMD FP square root, D-form, FSQRT 7 to 10 2/9 to 2/7 V0 3


F32

ASIMD FP square root, Q-form, FSQRT 11 to 13 1/13 to 1/11 V0 3


F16

ASIMD FP square root, Q-form, FSQRT 7 to 10 1/9 to 1/7 V0 3


F32

ASIMD FP square root, Q-form, FSQRT 7 to 16 1/15 to 1/7 V0 3


F64

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 48 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Table 3-30 AArch32 ASIMD floating-point instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD FP absolute value VABS 2 2 V -

ASIMD FP arith VABD, VADD, 2 2 V -


VPADD, VSUB

ASIMD FP compare VACGE, VACGT, 2 2 V -


VACLE, VACLT,
VCEQ, VCGE,
VCGT, VCLE

ASIMD FP complex add VCADD 2 2 V -

ASIMD FP complex multiply VCMLA 4(2) 2 V 2


add

ASIMD FP convert, integer, D- VCVT, VCVTA, 3 1 V0 -


form VCVTM, VCVTN,
VCVTP

ASIMD FP convert, integer, Q- VCVT, VCVTA, 4 1/2 V0 -


form VCVTM, VCVTN,
VCVTP

ASIMD FP convert, fixed, D- VCVT 3 1 V0 -


form

ASIMD FP convert, fixed, Q- VCVT 4 1/2 V0 -


form

ASIMD FP convert, half- VCVT 4 1/2 V0 -


precision

ASIMD FP max/min VMAX, VMIN, 2 2 V -


VPMAX, VPMIN,
VMAXNM,
VMINNM

ASIMD FP multiply VMUL, VNMUL 3 2 V 2

ASIMD FP chained multiply VMLA, VMLS 5(2) 2 V 1


accumulate

ASIMD FP fused multiply VFMA, VFMS 4(2) 2 V 1


accumulate

ASIMD FP multiply accumulate VFMAL, VFMSL 5(2) 2 V 1


long

ASIMD FP negate VNEG 2 2 V

ASIMD FP round to integral, D- VRINTA, VRINTM, 3 1/2 V0 -


form VRINTN, VRINTP,
VRINTX, VRINTZ

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 49 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD FP round to integral, Q- VRINTA, VRINTM, 4 1 V0 -


form VRINTN, VRINTP,
VRINTX, VRINTZ

Notes:
1. ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing
a typical sequence of floating-point multiply-accumulate µOPs to issue one every N cycles (accumulate latency N
shown in parentheses).
2. ASIMD multiply-accumulate pipelines support late forwarding of the result from ASIMD FP multiply µOPs to the
accumulate operands of an ASIMD FP multiply-accumulate µOP. The latter can potentially be issued 1 cycle after the
ASIMD FP multiply µOP has been issued.
3. ASIMD divide and square root operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.

3.20 ASIMD BFloat16 (BF16) instructions


Table 3-31 AArch64 ASIMD BFloat (BF16) instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD convert, F32 to BF16 BFCVTN, 4 1 V0 -


BFCVTN2

ASIMD dot product BFDOT 4(2) 2 V 1

ASIMD matrix multiply BFMMLA 5(3) 2 V 1


accumulate

ASIMD multiply accumulate BFMLALB, 4(2) 2 V 1


long BFMLALT

Scalar convert, F32 to BF16 BFCVT 3 1 V0 -

Table 3-32 AArch32 ASIMD BFloat (BF16) instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD convert, F32 to BF16 VCVTB, VCVTT 4 1 V0 -

ASIMD dot product VDOT 4(2) 2 V 1

ASIMD matrix multiply VMMLA 5(3) 2 V 1


accumulate

ASIMD multiply accumulate VFMAB, VFMAT 4(2) 2 V 1


long

Scalar convert, F32 to BF16 VCVT 3 1 V0 -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 50 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Notes:
1. ASIMD pipelines that execute these instructions support late-forwarding of accumulate operands from similar µOPs,
allowing a typical sequence of µOPs to issue one every N cycles (accumulate latency N shown in parentheses).

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 51 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.21 ASIMD miscellaneous instructions


Table 3-33 AArch64 ASIMD miscellaneous instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD bit reverse RBIT 2 2 V -

ASIMD bitwise insert BIF, BIT, BSL 2 2 V -

ASIMD count CLS, CLZ, CNT 2 2 V -

ASIMD duplicate, gen reg DUP 3 1 M0 -

ASIMD duplicate, element DUP 2 2 V -

ASIMD extract EXT 2 2 V -

ASIMD extract narrow XTN(2) 2 2 V -

ASIMD extract narrow, SQXTN(2), 4 1 V1 -


saturating SQXTUN(2),
UQXTN(2)

ASIMD insert, element to INS 2 2 V -


element

ASIMD move, FP immed FMOV 2 2 V -

ASIMD move, integer immed MOVI, MVNI 2 2 V -

ASIMD reciprocal and square URECPE, 3 1 V0 -


root estimate, D-form U32 URSQRTE

ASIMD reciprocal and square URECPE, 4 1/2 V0 -


root estimate, Q-form U32 URSQRTE

ASIMD reciprocal and square FRECPE, FRSQRTE 3 1 V0 -


root estimate, D-form F32 and
scalar forms

ASIMD reciprocal and square FRECPE, FRSQRTE 4 1/2 V0 -


root estimate, D-form F16 and
Q-form F32

ASIMD reciprocal and square FRECPE, FRSQRTE 6 1/4 V0 -


root estimate, Q-form F16

ASIMD reciprocal exponent FRECPX 3 1 V0

ASIMD reciprocal step FRECPS, FRSQRTS 4 2 V -

ASIMD reverse REV16, REV32, 2 2 V -


REV64

ASIMD table lookup, 1 or 2 TBL 2 2 V -


table regs

ASIMD table lookup, 3 table TBL 4 1 V -


regs

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 52 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD table lookup, 4 table TBL 4 2/3 V -


regs

ASIMD table lookup extension, TBX 2 2 V -


1 table reg

ASIMD table lookup extension, TBX 4 1 V -


2 table reg

ASIMD table lookup extension, TBX 6 2/3 V -


3 table reg

ASIMD table lookup extension, TBX 6 2/5 V -


4 table reg

ASIMD transfer, element to gen UMOV, SMOV 2 1 V -


reg

ASIMD transfer, gen reg to INS 5 1 M0, V -


element

ASIMD transpose TRN1, TRN2 2 2 V -

ASIMD unzip/zip UZP1, UZP2, ZIP1, 2 2 V -


ZIP2

Table 3-34 AArch32 ASIMD miscellaneous instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD bitwise insert VBIF, VBIT, VBSL 2 2 V -

ASIMD count VCLS, VCLZ, 2 2 V -


VCNT

ASIMD duplicate, core reg VDUP 3 1 M0 -

ASIMD duplicate, scalar VDUP 2 2 V -

ASIMD extract VEXT 2 2 V -

ASIMD move, immed VMOV 2 2 V -

ASIMD move, register VMOV 2 2 V -

ASIMD move, narrowing VMOVN 2 2 V -

ASIMD move, saturating VQMOVN, 4 1 V1 -


VQMOVUN

ASIMD reciprocal estimate, D- VRECPE, 3 1 V0 -


form F32 and F64 VRSQRTE

ASIMD reciprocal estimate, D- VRECPE, 4 1/2 V0


form F16 and Q-form F32 VRSQRTE

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 53 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD reciprocal estimate, Q- VRECPE, 6 1/4 V0 -


form F16 VRSQRTE

ASIMD reciprocal step VRECPS, 5 2 V -


VRSQRTS

ASIMD reverse VREV16, VREV32, 2 2 V -


VREV64

ASIMD swap VSWP 4 2/3 V -

ASIMD table lookup, 1 or 2 VTBL 2 2 V -


table regs

ASIMD table lookup, 3 table VTBL 4 1 V -


regs

ASIMD table lookup, 4 table VTBL 6 2/3 V -


regs

ASIMD table lookup extension, VTBX 2 2 V -


1 reg

ASIMD table lookup extension, VTBX 4 1 V -


2 table reg

ASIMD table lookup extension, VTBX 6 2/3 V -


3 table reg

ASIMD table lookup extension, VTBX 6 2/5 V -


4 table reg

ASIMD transfer, scalar to core VMOV 2 1 V -


reg, word

ASIMD transfer, scalar to core VMOV 3 1 V, I -


reg, byte/hword

ASIMD transfer, core reg to VMOV 5 1 M0, V -


scalar

ASIMD transpose VTRN 4 2/3 V -

ASIMD unzip/zip VUZP, VZIP 4 2/3 V -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 54 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.22 ASIMD load instructions


The latencies shown assume the memory access hits in the Level 1 Data Cache and represent
the maximum latency to load all the vector registers written by the instruction. Compared to
standard loads, an extra cycle is required to forward results to FP/ASIMD pipelines.

Table 3-35 AArch64 ASIMD load instructions


Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD load, 1 element, LD1 6 3 L -


multiple, 1 reg, D-form

ASIMD load, 1 element, LD1 6 3 L -


multiple, 1 reg, Q-form

ASIMD load, 1 element, LD1 6 3/2 L -


multiple, 2 reg, D-form

ASIMD load, 1 element, LD1 6 3/2 L -


multiple, 2 reg, Q-form

ASIMD load, 1 element, LD1 6 1 L -


multiple, 3 reg, D-form

ASIMD load, 1 element, LD1 6 1 L -


multiple, 3 reg, Q-form

ASIMD load, 1 element, LD1 7 3/4 L -


multiple, 4 reg, D-form

ASIMD load, 1 element, LD1 7 3/4 L -


multiple, 4 reg, Q-form

ASIMD load, 1 element, one LD1 8 2 L, V -


lane, B/H/S

ASIMD load, 1 element, one LD1 8 2 L, V -


lane, D

ASIMD load, 1 element, all LD1R 8 2 L, V -


lanes, D-form, B/H/S

ASIMD load, 1 element, all LD1R 8 2 L, V -


lanes, D-form, D

ASIMD load, 1 element, all LD1R 8 2 L, V -


lanes, Q-form

ASIMD load, 2 element, LD2 8 2 L, V -


multiple, D-form, B/H/S

ASIMD load, 2 element, LD2 8 3/2 L, V -


multiple, Q-form, B/H/S

ASIMD load, 2 element, LD2 8 3/2 L, V -


multiple, Q-form, D

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 55 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD load, 2 element, one LD2 8 2 L, V -


lane, B/H

ASIMD load, 2 element, one LD2 8 2 L, V -


lane, S

ASIMD load, 2 element, one LD2 8 2 L, V -


lane, D

ASIMD load, 2 element, all LD2R 8 2 L, V -


lanes, D-form, B/H/S

ASIMD load, 2 element, all LD2R 8 2 L, V -


lanes, D-form, D

ASIMD load, 2 element, all LD2R 8 2 L, V -


lanes, Q-form

ASIMD load, 3 element, LD3 8 2/3 L, V -


multiple, D-form, B/H/S

ASIMD load, 3 element, LD3 8 2/3 L, V -


multiple, Q-form, B/H/S

ASIMD load, 3 element, LD3 8 2/3 L, V -


multiple, Q-form, D

ASIMD load, 3 element, one LD3 8 2/3 L, V -


lane, B/H

ASIMD load, 3 element, one LD3 8 2/3 L, V -


lane, S

ASIMD load, 3 element, one LD3 8 2/3 L, V -


lane, D

ASIMD load, 3 element, all LD3R 8 2/3 L, V -


lanes, D-form, B/H/S

ASIMD load, 3 element, all LD3R 8 2/3 L, V -


lanes, D-form, D

ASIMD load, 3 element, all LD3R 8 2/3 L, V -


lanes, Q-form, B/H/S

ASIMD load, 3 element, all LD3R 8 2/3 L, V -


lanes, Q-form, D

ASIMD load, 4 element, LD4 8 1 L, V -


multiple, D-form, B/H/S

ASIMD load, 4 element, LD4 9 1/2 L, V -


multiple, Q-form, B/H/S

ASIMD load, 4 element, LD4 9 1/2 L, V -


multiple, Q-form, D

ASIMD load, 4 element, one LD4 8 1 L, V -


lane, B/H

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 56 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD load, 4 element, one LD4 8 1 L, V -


lane, S

ASIMD load, 4 element, one LD4 8 1 L, V -


lane, D

ASIMD load, 4 element, all LD4R 8 1 L, V -


lanes, D-form, B/H/S

ASIMD load, 4 element, all LD4R 8 1 L, V -


lanes, D-form, D

ASIMD load, 4 element, all LD4R 8 1 L, V -


lanes, Q-form, B/H/S

ASIMD load, 4 element, all LD4R 8 1 L, V -


lanes, Q-form, D

(ASIMD load, writeback form) - - - I 1

Table 3-36 AArch32 ASIMD load instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD load, 1 element, VLD1 5 3(2) L 2


multiple, 1 reg

ASIMD load, 1 element, VLD1 5 3(2) L 2


multiple, 2 reg

ASIMD load, 1 element, VLD1 5 3/2(1) L 2


multiple, 3 reg

ASIMD load, 1 element, VLD1 5 3/2(1) L 2


multiple, 4 reg

ASIMD load, 1 element, one VLD1 7 3(2) L, V 2


lane

ASIMD load, 1 element, all VLD1 7 3(2) LV 2


lanes, 1 reg

ASIMD load, 1 element, all VLD1 7 1 L, V 2


lanes, 2 reg

ASIMD load, 2 element, VLD2 7 1 L, V 2


multiple, 2 reg

ASIMD load, 2 element, VLD2 8 1/2 L, V 2


multiple, 4 reg

ASIMD load, 2 element, one VLD2 7 1 L, V 2


lane, size 32

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 57 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD load, 2 element, one VLD2 7 1 L, V 2


lane, size 8/16

ASIMD load, 2 element, all VLD2 7 1 L, V 2


lanes

ASIMD load, 3 element, VLD3 8 2/3 (1) L, V 2


multiple, 3 reg

ASIMD load, 3 element, one VLD3 8 2/3 (1) L, V 2


lane, size 32

ASIMD load, 3 element, one VLD3 8 2/3 (1) L, V 2


lane, size 8/16

ASIMD load, 3 element, all VLD3 8 2/3 (1) L, V 2


lanes

ASIMD load, 4 element, VLD4 8 1/2 L, V 2


multiple, 4 reg

ASIMD load, 4 element, one VLD4 8 1/2 L, V 2


lane, size 32

ASIMD load, 4 element, one VLD4 8 1/2 L, V 2


lane, size 8/16

ASIMD load, 4 element, all VLD4 8 1/2 L, V 2


lanes

(ASIMD load, writeback form) - - - I 1

Notes:
1. Writeback forms of load instructions require an extra µOP to update the base address. This update is typically
performed in parallel with the load µOP (update latency shown in parentheses).
2. Conditional loads go down L01 pipe and the number in parenthesis represents their throughput when different
from the unconditional forms.

3.23 ASIMD store instructions


Stores MOPs are split into store address and store data µOPs. Once executed, stores are buffered
and committed in the background.

Table 3-37 AArch64 ASIMD store instructions


Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD store, 1 element, ST1 2 2 L01, V -


multiple, 1 reg, D-form

ASIMD store, 1 element, ST1 2 2 L01, V -


multiple, 1 reg, Q-form

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 58 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD store, 1 element, ST1 2 2 L01, V -


multiple, 2 reg, D-form

ASIMD store, 1 element, ST1 2 1 L01, V -


multiple, 2 reg, Q-form

ASIMD store, 1 element, ST1 2 1 L01, V -


multiple, 3 reg, D-form

ASIMD store, 1 element, ST1 2 2/3 L01, V -


multiple, 3 reg, Q-form

ASIMD store, 1 element, ST1 2 1 L01, V -


multiple, 4 reg, D-form

ASIMD store, 1 element, ST1 2 1/2 L01, V -


multiple, 4 reg, Q-form

ASIMD store, 1 element, one ST1 4 1 L01, V -


lane, B/H/S

ASIMD store, 1 element, one ST1 4 1 L01, V -


lane, D

ASIMD store, 2 element, ST2 4 1 V, L01 -


multiple, D-form, B/H/S

ASIMD store, 2 element, ST2 4 1/2 V, L01 -


multiple, Q-form, B/H/S

ASIMD store, 2 element, ST2 4 1/2 V, L01 -


multiple, Q-form, D

ASIMD store, 2 element, one ST2 4 1 V, L01 -


lane, B/H/S

ASIMD store, 2 element, one ST2 4 1 V, L01 -


lane, D

ASIMD store, 3 element, ST3 5 1/2 V, L01 -


multiple, D-form, B/H/S

ASIMD store, 3 element, ST3 6 1/3 V, L01 -


multiple, Q-form, B/H/S

ASIMD store, 3 element, ST3 6 1/3 V, L01 -


multiple, Q-form, D

ASIMD store, 3 element, one ST3 5 1/2 V, L01 -


lane, B/H

ASIMD store, 3 element, one ST3 5 1/2 V, L01 -


lane, S

ASIMD store, 3 element, one ST3 5 1/2 V, L01 -


lane, D

ASIMD store, 4 element, ST4 6 1/3 V, L01 -


multiple, D-form, B/H/S

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 59 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD store, 4 element, ST4 7 1/6 V, L01 -


multiple, Q-form, B/H/S

ASIMD store, 4 element, ST4 5 1/4 V, L01 -


multiple, Q-form, D

ASIMD store, 4 element, one ST4 6 2/3 V, L01 -


lane, B/H/S

ASIMD store, 4 element, one ST4 4 1/2 V, L01 -


lane, D

(ASIMD store, writeback form) - - - I 1

Table 3-38 AArch32 ASIMD store instructions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

ASIMD store, 1 element, VST1 2 2 L01, V -


multiple, 1 reg

ASIMD store, 1 element, VST1 2 2 L01, V -


multiple, 2 reg

ASIMD store, 1 element, VST1 2 1 L01, V -


multiple, 3 reg

ASIMD store, 1 element, VST1 2 1 L01, V -


multiple, 4 reg

ASIMD store, 1 element, one VST1 4 1 V, L01 -


lane

ASIMD store, 2 element, VST2 5 2/3 V, L01 -


multiple, 2 reg

ASIMD store, 2 element, VST2 5 1/3 V, L01 -


multiple, 4 reg

ASIMD store, 2 element, one VST2 4 1 V, L01 -


lane

ASIMD store, 3 element, VST3 5 1/2 V, L01 -


multiple, 3 reg

ASIMD store, 3 element, one VST3 4 1/2 V, L01 -


lane, size 32

ASIMD store, 3 element, one VST3 4 1/2 V, L01 -


lane, size 8/16

ASIMD store, 4 element, VST4 5 1/3 V, L01 -


multiple, 4 reg

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 60 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD store, 4 element, one VST4 5 2/3 V, L01 -


lane, size 32

ASIMD store, 4 element, one VST4 5 2/3 V, L01 -


lane, size 8/16

(ASIMD store, writeback form) - (1) - +I 1

Notes:
1. Writeback forms of store instructions require an extra µOP to update the base address. This update is typically
performed in parallel with the store µOP (update latency shown in parentheses).

3.24 Cryptography extensions


Table 3-39 AArch64 Cryptography extensions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Crypto AES ops AESD, AESE, 2 2 V -


AESIMC, AESMC

Crypto polynomial (64x64) PMULL (2) 2 1 V0 -


multiply long

Crypto SHA1 hash acceleration SHA1H 2 1 V0 -


op

Crypto SHA1 hash acceleration SHA1C, SHA1M, 4 1 V0 -


ops SHA1P

Crypto SHA1 schedule SHA1SU0, 2 1 V0 -


acceleration ops SHA1SU1

Crypto SHA256 hash SHA256H, 4 1 V0 -


acceleration ops SHA256H2

Crypto SHA256 schedule SHA256SU0, 2 1 V0 -


acceleration ops SHA256SU1

Crypto SHA512 hash SHA512H, 2 1 V0 -


acceleration ops SHA512H2,
SHA512SU0,
SHA512SU1

Crypto SHA3 ops BCAX, EOR3, 2 1 V0 -


RAX1, XAR

Crypto SM3 ops SM3PARTW1, 2 1 V0 -


SM3PARTW2SM3
SS1, SM3TT1A,
SM3TT1B,
SM3TT2A,
SM3TT2B

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 61 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Crypto SM4 ops SM4E, SM4EKEY 4 1 V0 -

Table 3-40 AArch32 Cryptography extensions


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Crypto AES ops AESD, AESE, 2 2 V 1


AESIMC, AESMC

Crypto polynomial (64x64) VMULL.P64 2 1 V0 -


multiply long

Crypto SHA1 hash acceleration SHA1H 2 1 V0 -


op

Crypto SHA1 hash acceleration SHA1C, SHA1M, 4 1 V0 -


ops SHA1P

Crypto SHA1 schedule SHA1SU0, 2 1 V0 -


acceleration ops SHA1SU1

Crypto SHA256 hash SHA256H, 4 1 V0 -


acceleration ops SHA256H2

Crypto SHA256 schedule SHA256SU0, 2 1 V0 -


acceleration ops SHA256SU1

Notes:
1. Adjacent AESE/AESMC instruction pairs and adjacent AESD/AESIMC instruction pairs will exhibit the performance
characteristics described in Section 4.6.

3.25 CRC
Table 3-41 AArch64 CRC
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

CRC checksum ops CRC32, CRC32C 2 1 M0 1

Table 3-42 AArch32 CRC


Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

CRC checksum ops CRC32, CRC32C 2 1 M0 1

Notes:

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 62 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

1. CRC execution supports late forwarding of the result from a producer µOP to a consumer µOP. This results in a 1
cycle reduction in latency as seen by the consumer.

3.26 SVE Predicate instructions


Table 3-43 SVE Predicate Instructions
Instruction Group SVE Instruction Exec Execution Utilized Notes
Latency Throughput Pipelines

Loop control, based on BRKA, BRKB 2 2 M 1


predicate

Loop control, based on BRKAS, BRKBS 3 2 M 1


predicate and flag setting

Loop control, propagating BRKN, BRKPA, 2 1 M0 1


BRKPB

Loop control, propagating and BRKNS, BRKPAS, 3 1 M0, M 1


flag setting BRKPBS

Loop control, based on GPR WHILEGE, 3 1 M -


WHILEGT,
WHILEHI,
WHILEHS,
WHILELE,
WHILELO,
WHILELS,
WHILELT,
WHILERW,
WHILEWR

Loop terminate CTERMEQ, 1 1 M -


CTERMNE

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 63 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Predicate counting scalar ADDPL, ADDVL, 2 2 M -


CNTB, CNTH,
CNTW, CNTD,
DECB, DECH,
DECW, DECD,
INCB, INCH,
INCW, INCD,
RDVL, SQDECB,
SQDECH,
SQDECW,
SQDECD,
SQINCB, SQINCH,
SQINCW,
SQINCD,
UQDECB,
UQDECH,
UQDECW,
UQDECD,
UQINCB,
UQINCH,
UQINCW,
UQINCD

Predicate counting scalar, INC, DEC 1 4 I


ALL, {1,2,4}

Predicate counting scalar, CNTP, DECP, 2 2 M -


active predicate INCP, SQDECP,
SQINCP,
UQDECP,
UQINCP

Predicate counting vector, DECP, INCP, 7 1 M, M0, V -


active predicate SQDECP, SQINCP,
UQDECP,
UQINCP

Predicate logical AND, BIC, EOR, 1 1 M0 1


MOV, NAND,
NOR, NOT, ORN,
ORR

Predicate logical, flag setting ANDS, BICS, 2 1 M0, M 1


EORS, MOV,
NANDS, NORS,
NOTS, ORNS,
ORRS

Predicate reverse REV 2 2 M -

Predicate select SEL 1 1 M0 -

Predicate set PFALSE, PTRUE 2 2 M -

Predicate set/initialize, set flags PTRUES 3 2 M -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 64 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Predicate find first/next PFIRST, PNEXT 3 2 M -

Predicate test PTEST 1 2 M -

Predicate transpose TRN1, TRN2 2 2 M -

Predicate unpack and widen PUNPKHI, 2 2 M -


PUNPKLO

Predicate zip/unzip ZIP1, ZIP2, UZP1, 2 2 M -


UZP2

Notes:
1. When the governing predicate is the same as destination, the latency is increased by one cycle.

3.27 SVE integer instructions


Table 3-44 SVE integer instructions
Instruction Group SVE Instruction Exec Execution Utilized Notes
Latency Throughput Pipelines

Arithmetic, absolute diff SABD, UABD 2 2 V -

Arithmetic, absolute diff accum SABA, UABA 4(1) 1 V1 2

Arithmetic, absolute diff accum SABALB, SABALT, 4(1) 1 V1 2


long UABALB, UABALT

Arithmetic, absolute diff long SABDLB, SABDLT, 2 2 V -


UABDLB, UABDLT

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 65 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Arithmetic, basic ABS, ADD, ADR, 2 2 V -


CNOT, NEG,
SADDLB,
SADDLBT,
SADDLT,
SADDWB,
SADDWT,
SHADD, SHSUB,
SHSUBR,
SSUBLB,
SSUBLBT,
SSUBLT,
SSUBLTB,
SSUBWB,
SSUBWT, SUB,
SUBHNB,
SUBHNT, SUBR,
UADDLB,
UADDLT,
UADDWB,
UADDWT,
UHADD, UHSUB,
UHSUBR,
USUBLB, USUBLT,
USUBWB,
USUBWT

Arithmetic, complex ADDHNB, 2 2 V -


ADDHNT,
RADDHNB,
RADDHNT,
RSUBHNB,
RSUBHNT,
SQABS, SQADD,
SQNEG, SQSUB,
SQSUBR,
SRHADD,
SUQADD,
UQADD, UQSUB,
UQSUBR,
USQADD,
URHADD

Arithmetic, large integer ADCLB, ADCLT, 2 2 V -


SBCLB, SBCLT

Arithmetic, pairwise add ADDP 2 2 V -

Arithmetic, pairwise add and SADALP, UADALP 4(1) 1 V1 2


accum long

Arithmetic, shift ASR, ASRR, LSL, 2 1 V1 -


LSLR, LSR, LSRR

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 66 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Arithmetic, shift and SRSRA, SSRA, 4(1) 1 V1 2


accumulate URSRA, USRA

Arithmetic, shift by immediate SHRNB, SHRNT, 2 1 V1 -


SSHLLB, SSHLLT,
USHLLB, USHLLT

Arithmetic, shift by immediate SLI, SRI 2 1 V1 -


and insert

Arithmetic, shift complex RSHRNB, 4 1 V1 -


RSHRNT,
SQRSHL,
SQRSHLR,
SQRSHRNB,
SQRSHRNT,
SQRSHRUNB,
SQRSHRUNT,
SQSHL, SQSHLR,
SQSHLU,
SQSHRNB,
SQSHRNT,
SQSHRUNB,
SQSHRUNT,
UQRSHL,
UQRSHLR,
UQRSHRNB,
UQRSHRNT,
UQSHL, UQSHLR,
UQSHRNB,
UQSHRNT

Arithmetic, shift right for divide ASRD 4 1 V1 -

Arithmetic, shift rounding SRSHL, SRSHLR, 4 1 V1 -


SRSHR, URSHL,
URSHLR, URSHR

Bit manipulation BDEP, BEXT, 6 1/2 V1 -


BGRP

Bitwise select BSL, BSL1N, 2 2 V -


BSL2N, NBSL

Count/reverse bits CLS, CLZ, CNT, 2 2 V -


RBIT

Broadcast logical bitmask DUPM, MOV 2 2 V -


immediate to vector

Compare and set flags CMPEQ, CMPGE, 4 1 V0, M


CMPGT, CMPHI, 1
CMPHS, CMPLE,
CMPLO, CMPLS,
CMPLT, CMPNE

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 67 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Complex add CADD, SQCADD 2 2 V -

Complex dot product 8-bit CDOT 3(1) 2 V 2


element

Complex dot product 16-bit CDOT 4(1) 1 V0 2


element

Complex multiply-add B, H, S CMLA 4(1) 1 V0 2


element size

Complex multiply-add D CMLA 5(3) 1/2 V0 2


element size

Conditional extract operations, CLASTA, CLASTB 8 1 M0, V1, V -


scalar form

Conditional extract operations, CLASTA, CLASTB, 3 1 V1 -


SIMD&FP scalar and vector COMPACT,
forms SPLICE

Convert to floating point, 64b SCVTF, UCVTF 3 1 V0 -


to float or convert to double

Convert to floating point, 32b SCVTF, UCVTF 4 1/2 V0 -


to single or half

Convert to floating point, 16b SCVTF, UCVTF 6 1/4 V0 -


to half

Copy, scalar CPY 5 1 M0, V

Copy, scalar SIMD&FP or imm CPY 2 2 V

Divides, 32 bit SDIV, SDIVR, 7 to 12 1/11 to 1/7 V0 3


UDIV, UDIVR

Divides, 64 bit SDIV, SDIVR, 7 to 20 1/20 to 1/7 V0 3


UDIV, UDIVR

Dot product, 8 bit SDOT, UDOT 3(1) 2 V 2

Dot product, 8 bit, using SUDOT, USDOT 3(1) 2 V 2


signed and unsigned integers

Dot product, 16 bit SDOT, UDOT 4(1) 1 V0 2

Duplicate, immediate and DUP, MOV 2 2 V -


indexed form

Duplicate, scalar form DUP, MOV 3 1 M0 -

Extend, sign or zero SXTB, SXTH, 2 1 V1 -


SXTW, UXTB,
UXTH, UXTW

Extract EXT 2 2 V -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 68 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Extract narrow saturating SQXTNB, 4 1 V1 -


SQXTNT,
SQXTUNB,
SQXTUNT,
UQXTNB,
UQXTNT

Extract/insert operation, SIMD LASTA, LASTB, 3 1 V1 -


and FP scalar form INSR

Extract/insert operation, scalar LASTA, LASTB, 5 1 V1, M0 -


INSR

Histogram operations HISTCNT, 2 2 V -


HISTSEG

Horizontal operations, B, H, S INDEX 4 1 V0 -


form, immediate operands only

Horizontal operations, B, H, S INDEX 7 1 M0, V0 -


form, scalar, immediate
operands)/ scalar operands
only / immediate, scalar
operands

Horizontal operations, D form, INDEX 5 1/2 V0 -


immediate operands only

Horizontal operations, D form, INDEX 8 1/2 M0, V0 -


scalar, immediate operands)/
scalar operands only /
immediate, scalar operands

Logical AND, BIC, EON, 2 2 V -


EOR, EORBT,
EORTB, MOV,
NOT, ORN, ORR

Max/min, basic and pairwise SMAX, SMAXP, 2 2 V -


SMIN, SMINP,
UMAX, UMAXP
UMIN, UMINP

Matching operations MATCH, 2 1 V0, M 1,5


NMATCH

Matrix multiply-accumulate SMMLA, UMMLA, 3(1) 2 V 2


USMMLA

Move prefix MOVPRFX 2 2 V -

Multiply, B, H, S element size MUL, SMULH, 4 1 V0 -


UMULH

Multiply, D element size MUL, SMULH, 5 1/2 V0 -


UMULH

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 69 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Multiply long SMULLB, 4 1 V0 -


SMULLT,
UMULLB,
UMULLT

Multiply accumulate, B, H, S MLA, MLS 4(1) 1 V0 2


element size

Multiply accumulate, D MLA, MLS, MAD, 5(3) 1/2 V0 2


element size MSB,

Multiply accumulate long SMLALB, SMLALT, 4(1) 1 V0 2


SMLSLB, SMLSLT,
UMLALB,
UMLALT,
UMLSLB, UMLSLT

Multiply accumulate saturating SQDMLALB, 4(2) 1 V0 4


doubling long regular SQDMLALT,
SQDMLALBT,
SQDMLSLB,
SQDMLSLT,
SQDMLSLBT

Multiply saturating doubling SQDMULH 4 1 V0 -


high, B, H, S element size

Multiply saturating doubling SQDMULH 5 1/2 V0 -


high, D element size

Multiply saturating doubling SQDMULLB, 4 1 V0 -


long SQDMULLT

Multiply saturating rounding SQRDMLAH, 4(2) 1 V0 4


doubling regular/complex SQRDMLSH,
accumulate, B, H, S element SQRDCMLAH
size

Multiply saturating rounding SQRDMLAH, 5(3) 1/2 V0 4


doubling regular/complex SQRDMLSH,
accumulate, D element size SQRDCMLAH

Multiply saturating rounding SQRDMULH 4 1 V0 -


doubling regular/complex, B,
H, S element size

Multiply saturating rounding SQRDMULH 5 1/2 V0 -


doubling regular/complex, D
element size

Multiply/multiply long, (8x8) PMUL, PMULLB, 2 1 V0 -


polynomial PMULLT

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 70 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Predicate counting vector CNT, DECB, 2 2 V0 -


DECH, DECW,
DECD, INCB,
INCH, INCW,
INCD, SQDECB,
SQDECH,
SQDECW,
SQDECD,
SQINCB, SQINCH,
SQINCW,
SQINCD,
UQDECB,
UQDECH,
UQDECW,
UQDECD,
UQINCB,
UQINCH,
UQINCW,
UQINCD

Reciprocal estimate URECPE, 4 1/2 V0


URSQRTE

Reduction, arithmetic, B form SADDV, UADDV, 11 1/2 V, V1 -


SMAXV, SMINV,
UMAXV, UMINV

Reduction, arithmetic, H form SADDV, UADDV, 9 1/2 V, V1 -


SMAXV, SMINV,
UMAXV, UMINV

Reduction, arithmetic, S form SADDV, UADDV, 8 4/5 V, V1 -


SMAXV, SMINV,
UMAXV, UMINV

Reduction, logical ANDV, EORV, 6 1 V, V1 -


ORV

Reverse, vector REV, REVB, REVH, 2 2 V -


REVW

Select, vector form MOV, SEL 2 2 V -

Table lookup TBL 2 2 V -

Table lookup extension TBX 2 2 V -

Transpose, vector form TRN1, TRN2 2 2 V -

Unpack and extend SUNPKHI, 2 2 V -


SUNPKLO,
UUNPKHI,
UUNPKLO

Zip/unzip UZP1, UZP2, ZIP1, 2 2 V -


ZIP2

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 71 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Notes:
1. When the governing predicate is the same as destination, the latency is increased by one cycle.
2. SVE accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a typical
sequence of such µOPs to issue one every N cycles (accumulate latency N shown in parentheses).
3. SVE integer divide operations are performed using an iterative algorithm and block subsequent similar operations
to the same pipeline until complete.
4. Same as 2 except that for saturating instructions require an extra cycle of latency for late-forwarding accumulate
operands.
5. If the consuming instruction has a flag source, the latency for this instruction is 4 cycles.

3.28 SVE floating-point instructions


Table 3-45 SVE floating-point instructions
Instruction Group SVE Instruction Exec Execution Utilized Notes
Latency Throughput Pipelines

Floating point absolute FABD, FABS 2 2 V -


value/difference

Floating point arithmetic FADD, FADDP, 2 2 V -


FNEG, FSUB,
FSUBR

Floating point associative add, FADDA 10 1/9 V1 -


F16

Floating point associative add, FADDA 6 1/5 V1 -


F32

Floating point associative add, FADDA 4 2 V -


F64

Floating point compare FACGE, FACGT, 2 1 V0 -


FACLE, FACLT,
FCMEQ, FCMGE,
FCMGT, FCMLE,
FCMLT, FCMNE,
FCMUO

Floating point complex add FCADD 3 2 V -

Floating point complex FCMLA 5(2) 2 V 1


multiply add

Floating point convert, long or FCVT, FCVTLT, 4 1/2 V0 -


narrow (F16 to F32 or F32 to FCVTNT
F16)

Floating point convert, long or FCVT, FCVTLT, 3 1 V0 -


narrow (F16 to F64, F32 to F64, FCVTNT
F64 to F32 or F64 to F16)

Floating point convert, round FCVTX, 3 1 V0 -


to odd FCVTXNT

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 72 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Floating point base2 log, F16 FLOGB 6 1/4 V0

Floating point base2 log, F32 FLOGB 4 1/2 V0

Floating point base2 log, F64 FLOGB 3 1 V0

Floating point convert to FCVTZS, FCVTZU 6 1/4 V0 -


integer, F16

Floating point convert to FCVTZS, FCVTZU 4 1/2 V0 -


integer, F32

Floating point convert to FCVTZS, FCVTZU 3 1 V0 -


integer, F64

Floating point copy FCPY, FDUP, 2 2 V -


FMOV

Floating point divide, F16 FDIV, FDIVR 10 to 13 1/12 to 1/10 V0 2

Floating point divide, F32 FDIV, FDIVR 7 to 10 1/9 to 1/7 V0 2

Floating point divide, F64 FDIV, FDIVR 7 to 15 1/14 to 1/7 V0 2

Floating point min/max FMAXP, 2 2 V


pairwise FMAXNMP,
FMINP,
FMINNMP

Floating point min/max FMAX, DMIN, 2 2 V -


FMAXNM,
FMINNM

Floating point multiply FSCALE, FMUL, 3 2 V -


FMULX

Floating point multiply FMLA, FMLS, 4(2) 2 V 1


accumulate FMAD, FMSB,
FNMAD, FNMLA,
FNMLS, FNMSB

Floating point multiply FMLALB, FMLALT, 4(2) 2 V 1


add/sub accumulate long FMLSLB, FMLSLT

Floating point reciprocal FRECPE, FRECPX, 6 1/4 V0 -


estimate, F16 FRSQRTE

Floating point reciprocal FRECPE, FRECPX, 4 1/2 V0 -


estimate, F32 FRSQRTE

Floating point reciprocal FRECPE, FRECPX, 3 1 V0 -


estimate, F64 FRSQRTE

Floating point reciprocal step FRECPS, FRSQRTS 4 2 V -

Floating point reduction, F16 FADDV, 6 2/3 V -


FMAXNMV,
FMAXV,
FMINNMV,
FMINV

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 73 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Floating point reduction, F32 FADDV, 4 1 V -


FMAXNMV,
FMAXV,
FMINNMV,
FMINV

Floating point reduction, F64 FADDV, 2 2 V -


FMAXNMV,
FMAXV,
FMINNMV,
FMINV

Floating point round to FRINTA, FRINTM, 6 1/4 V0 -


integral, F16 FRINTN, FRINTP,
FRINTX, FRINTZ

Floating point round to FRINTA, FRINTM, 4 1/2 V0 -


integral, F32 FRINTN, FRINTP,
FRINTX, FRINTZ

Floating point round to FRINTA, FRINTM, 3 1 V0 -


integral, F64 FRINTN, FRINTP,
FRINTX, FRINTZ

Floating point square root, F16 FSQRT 10 to 13 1/12 to 1/10 V0 2

Floating point square root, F32 FSQRT 7 to 10 1/9 to 1/7 V0 2

Floating point square root F64 FSQRT 7 to 16 1/14 to 1/7 V0 2

Floating point trigonometric FEXPA 3 1 V1


exponentiation

Floating point trigonometric FTMAD 4 2 V


multiply add

Floating point trigonometric, FTSMUL, FTSSEL 3 2 V -


miscellaneous

Notes:
1. SVE multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a
typical sequence of floating-point multiply-accumulate µOPs to issue one every N cycles (accumulate latency N shown
in parentheses).
2. SVE divide and square root operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.

3.29 SVE BFloat16 (BF16) instructions


Table 3-46 SVE Bfloat16 (BF16) instructions
Instruction Group SVE Instruction Exec Execution Utilized Notes
Latency Throughput Pipelines

Convert, F32 to BF16 BFCVT, BFCVTNT 3 1 V0 -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 74 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Dot product BFDOT 4(2) 2 V 1

Matrix multiply accumulate BFMMLA 5(3) 2 V 1

Multiply accumulate long BFMLALB, 4(2) 2 V 1


BFMLALT

Notes:
1. SVE pipelines that execute these instructions support late-forwarding of accumulate operands from similar µOPs,
allowing a typical sequence of µOPs to issue one every N cycles (accumulate latency N shown in parentheses).

3.30 SVE Load instructions


The latencies shown assume the memory access hits in the Level 1 Data Cache and represent the
maximum latency to load all the vector registers written by the instruction.

Table 3-47 SVE Load instructions


Instruction Group SVE Instruction Exec Execution Utilized Notes
Latency Throughput Pipelines

Load vector LDR 6 3 L -

Load predicate LDR 6 3 L, M -

Contiguous load, scalar + imm LD1B, LD1D, 6 3 L -


LD1H, LD1W,
LD1SB, LD1SH,
LD1SW,

Contiguous load, scalar + LD1B, LD1D, 6 3 L01 -


scalar LD1H, LD1W,
LD1SB, LD1SH
LD1SW

Contiguous load broadcast, LD1RB, LD1RH, 6 3 L -


scalar + imm LD1RD, LD1RW,
LD1RSB, LD1RSH,
LD1RSW,
LD1RQB,
LD1RQD,
LD1RQH,

Contiguous load broadcast, LD1RQB, 6 3 L -


scalar + scalar LD1RQD,
LD1RQH,
LD1RQW

Non temporal load, scalar + LDNT1B, 6 3 L -


imm LDNT1D,
LDNT1H,
LDNT1W

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 75 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Non temporal load, scalar + LDNT1B, 6 3 L, S -


scalar LDNT1D,
LDNT1H
LDNT1W

Non temporal gather load, LDNT1B, 9 1 L, V -


vector + scalar 32-bit element LDNT1H,
size LDNT1W,
LDNT1SB,
LDNT1SH

Non temporal gather load, LDNT1B, 10 1/2 L, V1 -


vector + scalar 64-bit element LDNT1D,
size LDNT1H,
LDNT1W,
LDNT1SB,
LDNT1SH,
LDNT1SW

Contiguous first faulting load, LDFF1B, LDFF1D, 6 3 L, S -


scalar + scalar LDFF1H, LDFF1W,
LDFF1SB,
LDFF1SD,
LDFF1SH
LDFF1SW

Contiguous non faulting load, LDNF1B, 6 3 L -


scalar + imm LDNF1D,
LDNF1H,
LDNF1W,
LDNF1SB,
LDNF1SH,
LDNF1SW

Contiguous Load two LD2B, LD2D, 8 1 V, L -


structures to two vectors, scalar LD2H, LD2W
+ imm

Contiguous Load two LD2B, LD2D, 9 1 V, L -


structures to two vectors, scalar LD2H, LD2W
+ scalar

Contiguous Load three LD3B, LD3D, 9 3/2 V, L -


structures to three vectors, LD3H, LD3W
scalar + imm

Contiguous Load three LD3B, LD3D, 10 3/2 V, L, S -


structures to three vectors, LD3H, LD3W
scalar + scalar

Contiguous Load four LD4B, LD4D, 9 1/2 V, L -


structures to four vectors, LD4H LD4W
scalar + imm

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 76 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Contiguous Load four LD4B, LD4D, 10 1/2 L, V, S -


structures to four vectors, LD4H, LD4W
scalar + scalar

Gather load, vector + imm, 32- LD1B, LD1H, 9 1 L, V -


bit element size LD1W, LD1SB,
LD1SH, LD1SW,
LDFF1B, LDFF1H,
LDFF1W,
LDFF1SB,
LDFF1SH,
LDFF1SW

Gather load, vector + imm, 64- LD1B, LD1D, 9 1/2 L, V -


bit element size LD1H, LD1W,
LD1SB, LD1SH,
LD1SW, LDFF1B,
LDFF1D LDFF1H,
LDFF1W,
LDFF1SB,
LDFF1SD,
LDFF1SH,
LDFF1SW

Gather load, 32-bit scaled LD1H, LD1SH, 10 1/2 L, V -


offset LDFF1H,
LDFF1SH, LD1W,
LDFF1W,
LDFF1SW

Gather load, 32-bit unpacked LD1B, LD1SB, 9 1 L, V -


unscaled offset LDFF1B, LDFF1SB,
LD1D, LDFF1D,
LD1H, LD1SH,
LDFF1H,
LDFF1SH, LD1W,
LD1SW, LDFF1W,
LDFF1SW

3.31 SVE Store instructions


Table 3-48 SVE Store instructions
Instruction Group SVE Instruction Exec Execution Utilized Notes
Latency Throughput Pipelines

Store from predicate reg STR 1 2 L01 -

Store from vector reg STR 2 2 L01, V -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 77 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Contiguous store, scalar + imm ST1B, ST1H, 2 2 L01, V -


ST1D, ST1W

Contiguous store, scalar + ST1H 2 2 L01, S, V -


scalar

Contiguous store, scalar + ST1B, ST1D, 2 2 L01, V -


scalar ST1W

Contiguous store two ST2B, ST2H, 4 1 L01, V -


structures from two vectors, ST2D, ST2W
scalar + imm

Contiguous store two ST2H 4 1 L01, S, V -


structures from two vectors,
scalar + scalar

Contiguous store two ST2B, ST2D, 4 1 L01, V -


structures from two vectors, ST2W
scalar + scalar

Contiguous store three ST3B, ST3D, 7 2/9 L01, V -


structures from three vectors, ST3H, ST3W
scalar + imm

Contiguous store three ST3H 7 2/9 L01, S, V -


structures from three vectors,
scalar + scalar

Contiguous store three ST3B, ST3D, 7 2/9 L01, S, V -


structures from three vectors, ST3W
scalar + scalar

Contiguous store four ST2B, ST4D, 11 1/9 L01, V -


structures from four vectors, ST4H, ST4W
scalar + imm

Contiguous store four ST4H 11 1/9 L01, S, V -


structures from four vectors,
scalar + scalar

Contiguous store four ST4B, ST4D, 11 1/9 L01, S, V -


structures from four vectors, ST4W
scalar + scalar

Non temporal store, scalar + STNT1B, STNT1D, 2 2 L01, V -


imm STNT1H,
STNT1W

Non temporal store, scalar + STNT1H 2 2 L01, S, V -


scalar

Non temporal store, scalar + STNT1B, STNT1D, 2 2 L01, V -


scalar STNT1W

Scatter non temporal store, STNT1B, STNT1H, 4 1/2 L01, V -


vector + scalar 32-bit element STNT1W
size

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 78 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Instruction Group SVE Instruction Exec Execution Utilized Notes


Latency Throughput Pipelines

Scatter non temporal store, STNT1B, STNT1D, 2 1 L01, V -


vector + scalar 64-bit element STNT1H,
size STNT1W

Scatter store vector + imm 32- ST1B, ST1H, 4 1/2 L01, V -


bit element size ST1W

Scatter store vector + imm 64- ST1B, ST1D, 2 1 L01, V -


bit element size ST1H, ST1W

Scatter store, 32-bit scaled ST1H, ST1W 4 1/2 L01, V -


offset

Scatter store, 32-bit unpacked ST1B, ST1D, 2 1 L01, V -


unscaled offset ST1H, ST1W

Scatter store, 32-bit unpacked ST1D, ST1H, 2 1 L01, V -


scaled offset ST1W

Scatter store, 32-bit unscaled ST1B, ST1H, 4 1/2 L01, V -


offset ST1W

Scatter store, 64-bit scaled ST1D, ST1H, 2 1 L01, V -


offset ST1W

Scatter store, 64-bit unscaled ST1B, ST1D, 2 1 L01, V -


offset ST1H, ST1W

3.32 SVE Miscellaneous instructions


Table 3-49 SVE miscellaneous instructions
Instruction Group SVE Instruction Exec Execution Utilized Notes
Latency Throughput Pipelines

Read first fault register, RDFFR 2 1 M0 -


unpredicated

Read first fault register, RDFFR 3 1 M0, M 1


predicated

Read first fault register and set RDFFRS 4 1/2 M0, M 1


flags

Set first fault register SETFFR 2 1 M0 -

Write to first fault register WRFFR 2 1 M0 -

Notes:
1. When destination is same as the governing predicate, the latency of the instruction increases by one cycle.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 79 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

3.33 SVE Cryptographic instructions


Table 3-50 SVE cryptographic instructions
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines

Crypto AES ops AESD, AESE, 2 2 V -


AESIMC, AESMC

Crypto SHA3 ops BCAX, EOR3, 2 1 V0 -


RAX1, XAR

Crypto SM4 ops SM4E, SM4EKEY 4 1 V0 -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 80 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

4 Special considerations
4.1 Dispatch constraints
Dispatch of µOPs from the in-order portion to the out-of-order portion of the microarchitecture includes several
constraints. It is important to consider these constraints during code generation to maximize the effective dispatch
bandwidth and subsequent execution bandwidth of Cortex-A710.

The dispatch stage can process up to 5 MOPs per cycle and dispatch up to 10 µOPs per cycle, with the following
limitations on the number of µOPs of each type that may be simultaneously dispatched.

Up to 4 µOPs utilizing the S or B pipelines


Up to 4 µOPs utilizing the M pipelines
Up to 2 µOPs utilizing the M0 pipelines
Up to 2 µOPs utilizing the V0 pipeline
Up to 2 µOPs utilizing the V1 pipeline
Up to 6 µOPs utilizing the L pipelines

In the event there are more µOPs available to be dispatched in a given cycle than can be supported by the constraints
above, µOPs will be dispatched in oldest to youngest age-order to the extent allowed by the above.

4.2 Dispatch stall


In the event of a V-pipeline µOP containing more than 1 quad-word register source, a portion or
all of which was previously written as one or multiple single words, that µOP will stall in dispatch
for three cycles. This stall occurs only on the first such instance, and subsequent consumers of
the same register will not experience this stall.

4.3 Optimizing general-purpose register spills and fills


Register transfers between general-purpose registers (GPR) and ASIMD registers (VPR) are lower
latency than reads and writes to the cache hierarchy, thus it is recommended that GPR registers
be filled/spilled to the VPR rather to memory, when possible.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 81 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

4.4 Optimizing memory routines


To achieve maximum throughput for memory copy (or similar loops), one should do the following.

Unroll the loop to include multiple load and store operations per iteration, minimizing the overheads of
looping.
Align stores on 32B boundary wherever possible.
Use non-writeback forms of LDP and STP instructions interleaving them like shown in the example below:

Loop_start:
SUBS x2,x2,#96
LDP q3,q4,[x1,#0]
STP q3,q4,[x0,#0]
LDP q3,q4,[x1,#32]
STP q3,q4,[x0,#32]
LDP q3,q4,[x1,#64]
STP q3,q4,[x0,#64]
ADD x1,x1,#96
ADD x0,x0,#96
BGT Loop_start

A recommended copy routine for AArch32 would look like the sequence above but would use LDRD/STRD
instructions. Avoid load-/store-multiple instruction encodings (such as LDM and STM).

If the memory locations being copied are non-cacheable, the non-temporal version of LDPQ (LDNPQ) should be used.
STPQ should still be used for the stores.

Similarly, it Is recommended to use LDPQ to achieve maximum throughput for memcmp (memory compare) loops
that compare cacheable memory. LDNPQ should be used for non-cacheable memory.

To achieve maximum throughput on memset, it is recommended that one do the following.

Unroll the loop to include multiple store operations per iteration, minimizing the overheads of looping.

Loop_start:
STP q1,q3,[x0,#0]
STP q1,q3,[x0,#0x20]
STP q1,q3,[x0,#0x40]
STP q1,q3,[x0,#0x60]
ADD x0,x0,#0x80
SUBS x2,x2,#0x80
B.GT Loop_start

To achieve maximum performance on memset to zero, it is recommended that one use DC ZVA instead of STP. An
optimal routine might look something like the following.

Loop_start:

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 82 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

SUBS x2,x2,#0x80
DC ZVA,x0
ADD x0,x0,#0x40
DC ZVA,x0
ADD x0,x0,#0x40
B.GT Loop_start

4.5 Load/Store alignment


The Armv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-A710 core
handles most unaligned accesses without performance penalties. However, there are cases which could reduce
bandwidth or incur additional latency, as described below.

• Load operations that cross a cache-line (64-byte) boundary.


• Quad-word load operations that are not 4B aligned.
• Store operations that cross a 32B boundary.

4.6 AES encryption/decryption


Cortex-A710 can issue two AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined)
with an execution latency of two cycles. This means encryption or decryption for at least four
data chunks should be interleaved for maximum performance:
AESE data0, key_reg
AESMC data0, data0
AESE data1, key_reg
AESMC data1, data1
AESE data2, key_reg
AESMC data2, data2
AESE data3, key_reg
AESMC data3, data3
AESE data0, key_reg
AESMC data0, data0
...

Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when
they are adjacent in the program code and both instructions use the same destination register.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 83 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

4.7 Region based fast forwarding


The forwarding logic in the V pipelines is optimized to provide optimal latency for instructions
which are expected to commonly forward to one another. The effective latency of FP and ASIMD
instructions as described in section 3 is increased by one cycle if the producer and consumer
instructions are not part of the same forwarding region. These optimized forwarding regions are
defined in the following table.

Table 4-1 Optimized forwarding regions


Region Instruction Types Notes

1 ASIMD/SVE integer ALU, ASIMD/SVE integer shift, ASIMD/scalar insert and move, 1
ASIMD/SVE integer abs/cmp/max/min and the ASIMD miscellaneous instructions
in table 3-18.

2 FP/ASIMD/SVE floating-point multiply, FP/ASIMD/SVE floating point multiply- 1,2,3


accumulate, FP/ASIMD/SVE compare, FP/ASIMD/SVE add/sub and the ASIMD
miscellaneous instructions in table 3-18.

3 ASIMD/SVE Crypto and SHA1/SHA256 -

4 ASIMD/SVE AES, ASIMD/SVE polynomial multiply and all the instruction types in 1
region 1.

5 ASIMD/SVE BFDOT and BFMMLA instructions -

Notes:
1. Reciprocal step and estimate instructions are excluded from this region.
2. ASIMD/SVE extract narrow, saturating instructions are excluded from this region.
3. ASIMD miscellaneous instructions can only be consumers of this region.

The following instructions are not a part of any region:


• FP/ASIMD/SVE floating-point div/sqrt and SVE integer divides
• FP/ASIMD/SVE convert and rounding instructions that do not write to general purpose
registers
• ASIMD/SVE integer mul/mac
• ASIMD/SVE integer reduction

In addition to the regions mentioned in the table above, all instructions in regions 1 and 2 can
fast forward to FP/ASIMD/SVE stores, FP/ASIMD vector to integer register transfers and ASIMD
converts that write to general purpose registers.

More special notes about the forwarding region in table 4-1:


• Element sources (the non-vector operand in "by element" multiplies) used by ASIMD/SVE
floating-point multiply and multiply-accumulate operations cannot be consumers.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 84 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

• Complex shift by immediate/register and shift accumulate instructions cannot be producers


(see sections 3.16 and 3.25) in region 1.
• Extract narrow, saturating instructions cannot be producers (see sections 3.19 and 3.25) in
region 1.
• Absolute difference accumulate and pairwise add and accumulate instructions cannot be
producers (see sections 3.16 and 3.25) in region 1.
• For floating-point producer-consumer pairs, the precision of the instructions should match
(single, double or half) in region 2.
• Pair-wise floating-point instructions cannot be producers or consumers in region 2.

It is not advisable to interleave instructions belonging to different regions. Also, certain


instructions can only be producers or consumers in a particular region but not both (see
footnote 3 for table 4-1). For example, the code below interleaves producers and consumers
from regions 1 and 2. This will result in and additional latency of 1 cycle as seen by FMUL.
FSUB v27.2s, v28.2s, v20.2s – Region 2
FADD v20.2s, v28.2s, v20.2s – Region 2
MOV v27.s[1], v20.s[1] - Region 2 producer but not a region 2 consumer
FMUL v26.2s, v27.2s, v6.2s – Region 2

4.8 Branch instruction alignment


Branch instruction and branch target instruction alignment and density can affect performance.

For best case performance, avoid placing more than four branch instructions within an
aligned 32-byte instruction memory region.

4.9 FPCR self-synchronization


Programmers and compiler writers should note that writes to the FPCR register are self-
synchronizing, i.e. its effect on subsequent instructions can be relied upon without an intervening
context synchronizing operation.

4.10 Special register access


The Cortex-A710 core performs register renaming for general purpose registers to enable
speculative and out-of-order instruction execution. But most special-purpose registers are not
renamed. Instructions that read or write non-renamed registers are subjected to one or more of
the following additional execution constraints.

Non-Speculative Execution – Instructions may only execute non-speculatively.


Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 85 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

In-Order Execution – Instructions must execute in-order with respect to other similar instructions
or in some cases all instructions.

Flush Side-Effects – Instructions trigger a flush side-effect after executing for synchronization.

The table below summarizes various special-purpose register read accesses and the associated
execution constraints or side-effects.

Table 4-2 Special-purpose register read accesses

Register Read Non- In- Flush Side-Effect Notes


Speculative Order
APSR Yes Yes No 3

CurrentEL No Yes No -

DAIF No Yes No -

DLR_EL0 No Yes No -

DSPSR_EL0 No Yes No -

ELR_* No Yes No -

FPCR No Yes No -

FPSCR Yes Yes No 2

FPSR Yes Yes No 2

NZCV No No No 1

SP_* No No No 1

SPSel No Yes No -

SPSR_* No Yes No -

FFR No Yes No -

Notes:
1. The NZCV and SP registers are fully renamed.
2. FPSR/FPSCR reads must wait for all prior instructions that may update the status flags to execute and retire.
3. APSR reads must wait for all prior instructions that may set the Q bit to execute and retire.
4. The table below summarizes various special-purpose register write accesses and the associated execution
constraints or side-effects.

Table 4-3 Special-purpose register write accesses

Register Write Non- In- Flush Side-Effect Notes


Speculative Order
APSR Yes Yes No 4

DAIF Yes Yes No -

DLR_EL0 Yes Yes No -

DSPSR_EL0 Yes Yes No -

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 86 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

Register Write Non- In- Flush Side-Effect Notes


Speculative Order
ELR_* Yes Yes No -

FPCR Yes Yes Maybe 2

FPSCR Yes Yes Maybe 2, 3

FPSR Yes Yes No 3

NZCV No No No 1

SP_* No No No 1

SPSel Yes Yes Yes -

SPSR_* Yes Yes No -

FFR Yes Yes No -

Notes:
1. The NZCV and SP registers are fully renamed.
2. If the FPCR/FPSCR write is predicted to change the control field values, it will introduce a barrier which prevents
subsequent instructions from executing. If the FPCR/FPSCR write is predicted to not change the control field values, it
will execute without a barrier but trigger a flush if the values change.
3. FPSR/FPSCR writes must stall at dispatch if another FPSR/FPSCR write is still pending.
4. APSR writes that set the Q bit will introduce a barrier which prevents subsequent instructions from executing until
the write completes.

4.11 Register forwarding hazards


The Armv8-A architecture allows FP/ASIMD instructions to read and write 32-bit S-registers. In
AArch32, each S-register corresponds to one half (upper or lower) of an overlaid 64-bit D-
register. A Q register in turn consists of two overlaid D registers. Register forwarding hazards
may occur when one µOP reads a Q-register operand that has recently been written with one or
more S-register results. Consider the following scenario.
VADD S0, S1, S2
VADD Q6, Q5, Q0

The first instruction writes S0, which corresponds to the lowest part of Q0. The second
instruction then requires Q0 as an input operand. In this scenario, there is a RAW dependency
between the first and the second instructions. In most cases, Cortex-A710 performs slightly
worse in such situations.

Cortex-A710 is able to avoid this register-hazard condition for certain cases. The following rules
describe the conditions under which a register-hazard can occur.
• The producer writes an S-register (not a D[x] scalar)
• The consumer reads an overlapping Q-register (not as a D[x] scalar)
• The consumer is a FP/ASIMD µOP (not a store or MOV µOP)

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 87 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

To avoid unnecessary hazards, it is recommended that the programmer use D[x] scalar writes
when populating registers prior to ASIMD operations. For example, either of the following
instruction forms would safely prevent a subsequent hazard.
VLD1.32 D0[x], [address]
VADD Q1, Q0, Q2

4.12 IT blocks
The Armv8-A architecture performance deprecates some uses of the IT instruction in such a way
that software may be written using multiple naïve single instruction IT blocks. It is preferred that
software instead generate multi instruction IT blocks rather than single instruction blocks.

4.13 Instruction fusion


Cortex-A710 can accelerate certain instruction pairs in an operation called fusion. Specific
Aarch64 instruction pairs that can be fused are as follows:

CMP/CMN (immediate) + B.cond

CMP/CMN (register) + B.cond

CMP (immediate) + CSEL

CMP (register) + CSEL

CMP (immediate) + CSET

CMP (register) + CSET

TST (immediate) + B.cond

TST (register) + B.cond

BICS (register) + B.cond

NOP + Any instruction

The following instruction pairs are fused in both Aarch32 and Aarch64 modes:

AESE + AESMC (see Section 4.6 on AES Encryption/Decryption)

AESD + AESIMC (see Section 4.6 on AES Encryption/Decryption)

CMP/CMN (immediate) + B.cond

CMP/CMN (register) + B.cond

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 88 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

TST (immediate) + B.cond

TST (register) + B.cond

BICS (register) + B.cond

These instruction pairs must be adjacent to each other in program code. For CMP, CMN, TST and
BICS, fusion is not allowed for shifted and/or extended register forms. For BICS, the destination
register should be XZR or WZR if fusion is to take place.

4.14 Zero Latency MOVs


A subset of register-to-register move operations and move immediate operations are executed
with zero latency. These instructions do not utilize the scheduling and execution resources of the
machine. These are as follows:

MOV Xd, #0

MOV Xd, XZR

MOV Wd, #0

MOV Wd, WZR

MOV Hd, WZR

MOV Hd, XZR

MOV Sd, WZR

MOV Dd, XZR

MOVI Dd, #0

MOVI Vd.2D, #0

MOV Rd, #0 (AArch32)

MOV Wd, Wn

MOV Xd, Xn

MOV Rd, Rn (AArch32)

The last 3 instructions may not be executed with zero latency under certain conditions.

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 89 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

4.15 Cache maintenance operation


While using set way invalidation operations on L1 cache, it is recommended that software be
written to traverse the sets in the inner loop and ways in the out loop.

4.16 Memory Tagging - Tagging Performance


To achieve maximum throughput for tag-only, it is recommended that one do the following.

Unroll the loop to include multiple store operations per iteration, minimizing the overheads of looping. Use STGM (or
DCGVA) instruction as shown in the example below:

Loop_start:
SUBS x2,x2,#0x80
STGM x1,[x0]
ADD x0,x0,#0x40
STGM x1,[x0]
ADD x0,x0,#0x40
B.GT Loop_start

To achieve maximum throughput for tag and zeroing out data, it is recommended that one do the following.

Unroll the loop to include multiple store operations per iteration, minimizing the overheads of looping. Use STZGM (or
DCZGVA) instruction as shown in the example below:

Loop_start:
SUBS x2,x2,#0x80
STZGM x1,[x0]
ADD x0,x0,#0x40
STZGM x1,[x0]
ADD x0,x0,#0x40
B.GT Loop_start

To achieve maximum throughput for tag-loading, it is recommended that one do the following.

Unroll the loop to include multiple load operations per iteration, minimizing the overheads of looping. Use LDGM
instruction as shown in the example below:

Loop_start:
SUBS x2,x2,#0x80
LDGM x1,[x0]
ADD x0,x0,#0x40
LDGM x1,[x0]

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 90 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

ADD x0,x0,#0x40
B.GT Loop_start

Also, it is recommended to use STZGM (or DCZGVA) to set tag if data is not a concern.

4.17 Memory Tagging - Synchronous Mode


In synchronous tag checking mode, stores cannot be performed speculatively. Each store must
complete a tag check before the next store can be executed non-speculatively. Thus,
performance of stores in synchronous tag checking mode will be diminished.

It is recommended to use asynchronous mode for better performance.

4.18 Complex ASIMD and SVE instructions


The bandwidth of the following ASIMD and SVE instructions is limited by decode constraints and
it is advisable to avoid them when high performing code is desired.

ASIMD

LD4R, post-indexed addressing, element size = 64b.

LD4, single 4-element structure, post indexed addressing mode, element size = 64b.

LD4, multiple 4-element structures, quad form.

LD4, multiple 4-element structures, double word form.

ST4, multiple 4-element structures, quad form, element size less than 64b.

ST4, multiple 4-element structures, quad form, element size = 64b, post indexed addressing
mode.

SVE

LD1B gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b unscaled offset.

LD1H gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.

LD1W gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.

LD3[B/H/W/D] contiguous (scalar + scalar addressing).

LD4[B/H/D/W] contiguous (scalar + immediate addressing).

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 91 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0

LD4[B/H/D/W] contiguous (scalar + scalar addressing).

LDFF1B gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b unscaled offset.

LDFF1H gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.

LDFF1W gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.

ST3[B/H/W/D] contiguous (scalar + scalar addressing).

ST4[B/H/D/W] contiguous (scalar + immediate addressing).

ST4[B/H/D/W] contiguous (scalar + scalar addressing).

Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 92 of 92

You might also like