Arm Cortex-A710 Core Software Optimization Guide
Arm Cortex-A710 Core Software Optimization Guide
Revision: r2p0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Release information
Document history
Issue Date Confidentiality Change
Your access to the information in this document is conditional upon your acceptance that you will not use
or permit others to use the information for the purposes of determining whether implementations infringe
any third party patents.
THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR
PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation
with respect to, has undertaken no analysis to identify or understand the scope and content of, patents,
copyrights, trade secrets, or other rights.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY,
ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that any use,
duplication or disclosure of this document complies fully with any relevant export laws and regulations to
assure that this document or any portion thereof is not exported, directly or indirectly, in violation of such
export laws. Use of the word “partner” in reference to Arm's customers is not intended to create or refer to
any partnership relationship with any other company. Arm may make changes to this document at any
time and without notice.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 2 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
This document may be translated into other languages for convenience, and you agree that if there is any
conflict between the English version of this document and any translation, the terms of the English version
of the Agreement shall prevail.
The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks of Arm
Limited (or its affiliates) in the US and/or elsewhere. All rights reserved. Other brands and names
mentioned in this document may be the trademarks of their respective owners. Please follow Arm's
trademark usage guidelines at https://www.arm.com/company/policies/trademarks.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
(LES-PRE-20349)
Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be subject to
license restrictions in accordance with the terms of the agreement entered into by Arm and the party that
Arm delivered this document to.
Product Status
The information in this document is final, that is for a developed product.
Web Address
developer.arm.com
This document includes terms that can be offensive. We will replace these terms in a future issue of this
document. If you find offensive terms in this document, please email [email protected].
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 3 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Contents
1 Introduction .................................................................................................................................7
1.1 Product revision status ........................................................................................................................................... 7
1.2 Intended audience ................................................................................................................................................... 7
1.3 Scope............................................................................................................................................................................. 7
1.4 Conventions................................................................................................................................................................ 7
1.4.1 Glossary .................................................................................................................................................................... 7
1.4.2 Terms and abbreviations .................................................................................................................................... 8
1.4.3 Typographical conventions ............................................................................................................................... 9
1.5 Additional reading ................................................................................................................................................. 10
1.6 Feedback .................................................................................................................................................................... 11
1.6.1 Feedback on this product ................................................................................................................................ 11
1.6.2 Feedback on content ......................................................................................................................................... 11
2 Overview .................................................................................................................................... 12
2.1 Pipeline overview .................................................................................................................................................... 13
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 5 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 6 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
1 Introduction
1.1 Product revision status
The rmpn identifier indicates the revision status of the product described in this book, for
example, r1p2, where:
Rm
Pn
Identifies the minor revision or modification status of the product, for example,
p2.
1.3 Scope
This document describes aspects of the Cortex-A710 core micro-architecture that influence
software performance. Micro-architectural detail is limited to that which is useful for software
optimization.
Documentation extends only to software visible behavior of the Cortex-A710 core and not to the
hardware rationale behind the behavior.
1.4 Conventions
The following subsections describe conventions used in Arm documents.
1.4.1 Glossary
The Arm Glossary is a list of terms used in Arm documentation, together with definitions for
those terms. The Arm Glossary does not contain terms that are industry standard unless the Arm
meaning differs from the generally accepted meaning.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 7 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
MOP Macro-OPeration
µOP Micro-OPeration
FP Floating-point
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 8 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
bold Highlights interface elements, such as menu names. Denotes signal names. Also used
for terms in descriptive lists, where appropriate.
monospace Denotes text that you can enter at the keyboard, such as commands, file and program
names, and source code.
monospace bold Denotes language keywords when used outside example code.
monospace Denotes a permitted abbreviation for a command or option. You can enter the
underline underlined text instead of the full command or option name.
<and> Encloses replaceable terms for assembler syntax where they appear in code or code
fragments.
For example:
MRC p15, 0, <Rd>, <CRn>, <CRm>, <Opcode_2>
SMALL CAPITALS Used in body text for a few terms that have specific technical meanings, that are
defined in the Arm® Glossary. For example, IMPLEMENTATION DEFINED, IMPLEMENTATION
SPECIFIC, UNKNOWN, and UNPREDICTABLE.
This represents a recommendation which, if not followed, might lead to system failure
or damage.
This represents a requirement for the system that, if not followed, might result in
system failure or damage.
This represents a requirement for the system that, if not followed, will result in system
failure or damage.
This represents a useful tip that might make it easier, better or faster to perform a task.
This is a reminder of something important that relates to the information you are
reading.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 9 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 10 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
1.6 Feedback
Arm welcomes feedback on this product and its documentation.
Arm tests the PDF only in Adobe Acrobat and Acrobat Reader and cannot guarantee the quality
of the represented document when used with any other PDF reader.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 11 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
2 Overview
The Cortex-A710 core is a high-performance, low-power, and constrained area product that
implements the Armv9.0-A architecture and supports all previous Armv8-A architectures up to
Armv8.5-A. It targets clamshell and premium high-end smartphone applications
This document describes elements of the Cortex-A710 core micro-architecture that influence
software performance so that software and compilers can be optimized accordingly.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 12 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Branch 0
Branch 1
Integer Single-Cycle 0
FP/ASIMD 0
Issue
FP/ASIMD 1
Load/Store 0
Load/Store 1
Load 2
Store data 0
Store data 1
The execution pipelines support different types of operations, as shown in the following table.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 13 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Instruction Instructions
groups
Branch 0/1 Branch µOPs
Integer Single/Multi- Integer shift-ALU, multiply, divide, CRC and sum-of-absolute-differences µOPs
cycle 0/1
Load/Store 0/1 Load, Store address generation and special memory µOPs
FP/ASIMD-0 ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply,
FP divide, FP sqrt, crypto µOPs, store data µOPs
FP/ASIMD-1 ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, ASIMD shift µOPs, store data µOPs,
crypto µOPs.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 14 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
3 Instruction characteristics
3.1 Instruction tables
This chapter describes high-level performance characteristics for most Armv9-A instructions. A
series of tables summarize the effective execution latency and throughput (instruction bandwidth
per cycle), pipelines utilized, and special behaviors associated with each group of instructions.
Utilized pipelines correspond to the execution pipelines described in chapter 2.
In the tables below, Exec Latency is defined as the minimum latency seen by an operation
dependent on an instruction in the described group.
In the tables below, Execution Throughput is defined as the maximum throughput (in
instructions per cycle) of the specified instruction group that can be achieved in the entirety of
the Cortex-A710 microarchitecture.
Branch 0/1 B
Integer multicycle 0 M0
Load/Store 01 L01
FP/ASIMD 0/1 V
FP/ASIMD 0 V0
FP/ASIMD 1 V1
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 15 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Branch, immed B 1 2 B -
Branch, immed B 1 2 B -
Branch, register BX 1 2 B -
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 16 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 17 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Branch forms +1 2 +B 2
Notes:
1.The latency is 2, throughput is 2 and utilized pipeline is M when GCR_EL1.RRND = 1. When GCR_EL1.RRND = 0,
latency is 3, throughput is 1 and pipeline utilized is M0.
2. Branch forms are possible when the instruction destination register is the PC. For those cases, an additional branch
µOP is required. This adds 1 cycle to the latency.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 19 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 20 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 21 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 22 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. Integer divides are performed using an iterative algorithm and block any subsequent divide operations until
complete. Early termination is possible, depending upon the data values.
2. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a typical
sequence of multiply-accumulate µOPs to issue one every N cycles (accumulate latency N shown in parentheses).
Accumulator forwarding is not supported for consumers of 64 bit multiply high operations.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 23 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 24 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. GE-setting instructions require three extra µOPs and two additional cycles to conditionally update the GE field (GE
latency shown in parentheses).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 25 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 26 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 27 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 28 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 29 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 30 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. Conditional loads have extra µOP(s) which goes down pipeline I and have 1 cycle extra latency compared to their
unconditional counterparts.
2. Conditional loads go down L01 pipe and have an execution throughput of 2 whereas unconditional versions have a
throughput of 3.
3. The address update op goes down pipeline ‘I’ if the load is unconditional.
4. N is floor [ (num_reg+5)/6].
5. R is floor [(num_reg +1)/2].
6. Branch forms are possible when the instruction destination register is the PC. For those cases, an additional branch
µOP is required. This adds 1 cycle to the latency.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 31 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 32 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. The address update op goes down pipeline ‘I’ if the store is unconditional.
2. The address update op goes down pipeline “M” if the store is unconditional.
3. For store multiple instructions, N=floor((num_regs+3)/4).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 33 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 34 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
FP compare FCCMP{E}, 2 1 V0 -
FCMP{E}
FP negate FNEG 2 2 V -
FP select FCSEL 2 2 V -
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 35 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. FP divide and square root operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.
2. FP multiply-accumulate pipelines support late forwarding of the result from FP multiply µOPs to the accumulate
operands of an FP multiply-accumulate µOP. The latter can potentially be issued 1 cycle after the FP multiply µOP has
been issued.
3. FP multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a
typical sequence of multiply-accumulate µOPs to issue one every N cycles (accumulate latency N shown in
parentheses).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 36 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 37 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 38 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
Condition loads have an extra uop which goes down pipeline V and have 2 cycle extra latency compared to their
unconditional counterparts.
1. N is (num_reg)/6 + 5.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 39 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
2. N* is (num_reg)/4 + 5.
3. R is num_reg/2.
4. Writeback forms of load instructions require an extra µOP to update the base address. This update is typically
performed in parallel with or prior to the load µOP (update latency shown in parentheses).
5. The number is parenthesis represents the latency and throughput of conditional loads.
6. Conditional loads go down the L01 pipe.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 40 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 41 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. For store multiple instructions, N = (num_reg/2)
2. R is num_regs.
3. Writeback forms of store instructions require an extra µOP to update the base address. This update is typically
performed in parallel with or prior to the store µOP (update latency shown in parentheses).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 42 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 43 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 44 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a typical
sequence of integer multiply-accumulate µOPs to issue one every cycle or one every other cycle (accumulate latency
shown in parentheses).
2. Other accumulate pipelines also support late-forwarding of accumulate operands from similar µOPs, allowing a
typical sequence of such µOPs to issue one every cycle (accumulate latency shown in parentheses).
3. This category includes instructions of the form “PMULL Vd.8H, Vn.8B, Vm.8B” and “PMULL2 Vd.8H, Vn.16B, Vm.16B”.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 46 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 47 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 48 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 49 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing
a typical sequence of floating-point multiply-accumulate µOPs to issue one every N cycles (accumulate latency N
shown in parentheses).
2. ASIMD multiply-accumulate pipelines support late forwarding of the result from ASIMD FP multiply µOPs to the
accumulate operands of an ASIMD FP multiply-accumulate µOP. The latter can potentially be issued 1 cycle after the
ASIMD FP multiply µOP has been issued.
3. ASIMD divide and square root operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 50 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. ASIMD pipelines that execute these instructions support late-forwarding of accumulate operands from similar µOPs,
allowing a typical sequence of µOPs to issue one every N cycles (accumulate latency N shown in parentheses).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 51 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 52 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 53 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 54 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 55 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 56 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 57 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. Writeback forms of load instructions require an extra µOP to update the base address. This update is typically
performed in parallel with the load µOP (update latency shown in parentheses).
2. Conditional loads go down L01 pipe and the number in parenthesis represents their throughput when different
from the unconditional forms.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 58 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 59 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 60 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. Writeback forms of store instructions require an extra µOP to update the base address. This update is typically
performed in parallel with the store µOP (update latency shown in parentheses).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 61 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. Adjacent AESE/AESMC instruction pairs and adjacent AESD/AESIMC instruction pairs will exhibit the performance
characteristics described in Section 4.6.
3.25 CRC
Table 3-41 AArch64 CRC
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
Notes:
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 62 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
1. CRC execution supports late forwarding of the result from a producer µOP to a consumer µOP. This results in a 1
cycle reduction in latency as seen by the consumer.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 63 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 64 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. When the governing predicate is the same as destination, the latency is increased by one cycle.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 65 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 66 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 67 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Extract EXT 2 2 V -
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 68 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 69 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 70 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 71 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. When the governing predicate is the same as destination, the latency is increased by one cycle.
2. SVE accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a typical
sequence of such µOPs to issue one every N cycles (accumulate latency N shown in parentheses).
3. SVE integer divide operations are performed using an iterative algorithm and block subsequent similar operations
to the same pipeline until complete.
4. Same as 2 except that for saturating instructions require an extra cycle of latency for late-forwarding accumulate
operands.
5. If the consuming instruction has a flag source, the latency for this instruction is 4 cycles.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 72 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 73 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. SVE multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a
typical sequence of floating-point multiply-accumulate µOPs to issue one every N cycles (accumulate latency N shown
in parentheses).
2. SVE divide and square root operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 74 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. SVE pipelines that execute these instructions support late-forwarding of accumulate operands from similar µOPs,
allowing a typical sequence of µOPs to issue one every N cycles (accumulate latency N shown in parentheses).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 75 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 76 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 77 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 78 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Notes:
1. When destination is same as the governing predicate, the latency of the instruction increases by one cycle.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 79 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 80 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
4 Special considerations
4.1 Dispatch constraints
Dispatch of µOPs from the in-order portion to the out-of-order portion of the microarchitecture includes several
constraints. It is important to consider these constraints during code generation to maximize the effective dispatch
bandwidth and subsequent execution bandwidth of Cortex-A710.
The dispatch stage can process up to 5 MOPs per cycle and dispatch up to 10 µOPs per cycle, with the following
limitations on the number of µOPs of each type that may be simultaneously dispatched.
In the event there are more µOPs available to be dispatched in a given cycle than can be supported by the constraints
above, µOPs will be dispatched in oldest to youngest age-order to the extent allowed by the above.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 81 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Unroll the loop to include multiple load and store operations per iteration, minimizing the overheads of
looping.
Align stores on 32B boundary wherever possible.
Use non-writeback forms of LDP and STP instructions interleaving them like shown in the example below:
Loop_start:
SUBS x2,x2,#96
LDP q3,q4,[x1,#0]
STP q3,q4,[x0,#0]
LDP q3,q4,[x1,#32]
STP q3,q4,[x0,#32]
LDP q3,q4,[x1,#64]
STP q3,q4,[x0,#64]
ADD x1,x1,#96
ADD x0,x0,#96
BGT Loop_start
A recommended copy routine for AArch32 would look like the sequence above but would use LDRD/STRD
instructions. Avoid load-/store-multiple instruction encodings (such as LDM and STM).
If the memory locations being copied are non-cacheable, the non-temporal version of LDPQ (LDNPQ) should be used.
STPQ should still be used for the stores.
Similarly, it Is recommended to use LDPQ to achieve maximum throughput for memcmp (memory compare) loops
that compare cacheable memory. LDNPQ should be used for non-cacheable memory.
Unroll the loop to include multiple store operations per iteration, minimizing the overheads of looping.
Loop_start:
STP q1,q3,[x0,#0]
STP q1,q3,[x0,#0x20]
STP q1,q3,[x0,#0x40]
STP q1,q3,[x0,#0x60]
ADD x0,x0,#0x80
SUBS x2,x2,#0x80
B.GT Loop_start
To achieve maximum performance on memset to zero, it is recommended that one use DC ZVA instead of STP. An
optimal routine might look something like the following.
Loop_start:
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 82 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
SUBS x2,x2,#0x80
DC ZVA,x0
ADD x0,x0,#0x40
DC ZVA,x0
ADD x0,x0,#0x40
B.GT Loop_start
Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when
they are adjacent in the program code and both instructions use the same destination register.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 83 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
1 ASIMD/SVE integer ALU, ASIMD/SVE integer shift, ASIMD/scalar insert and move, 1
ASIMD/SVE integer abs/cmp/max/min and the ASIMD miscellaneous instructions
in table 3-18.
4 ASIMD/SVE AES, ASIMD/SVE polynomial multiply and all the instruction types in 1
region 1.
Notes:
1. Reciprocal step and estimate instructions are excluded from this region.
2. ASIMD/SVE extract narrow, saturating instructions are excluded from this region.
3. ASIMD miscellaneous instructions can only be consumers of this region.
In addition to the regions mentioned in the table above, all instructions in regions 1 and 2 can
fast forward to FP/ASIMD/SVE stores, FP/ASIMD vector to integer register transfers and ASIMD
converts that write to general purpose registers.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 84 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
For best case performance, avoid placing more than four branch instructions within an
aligned 32-byte instruction memory region.
In-Order Execution – Instructions must execute in-order with respect to other similar instructions
or in some cases all instructions.
Flush Side-Effects – Instructions trigger a flush side-effect after executing for synchronization.
The table below summarizes various special-purpose register read accesses and the associated
execution constraints or side-effects.
CurrentEL No Yes No -
DAIF No Yes No -
DLR_EL0 No Yes No -
DSPSR_EL0 No Yes No -
ELR_* No Yes No -
FPCR No Yes No -
NZCV No No No 1
SP_* No No No 1
SPSel No Yes No -
SPSR_* No Yes No -
FFR No Yes No -
Notes:
1. The NZCV and SP registers are fully renamed.
2. FPSR/FPSCR reads must wait for all prior instructions that may update the status flags to execute and retire.
3. APSR reads must wait for all prior instructions that may set the Q bit to execute and retire.
4. The table below summarizes various special-purpose register write accesses and the associated execution
constraints or side-effects.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 86 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
NZCV No No No 1
SP_* No No No 1
Notes:
1. The NZCV and SP registers are fully renamed.
2. If the FPCR/FPSCR write is predicted to change the control field values, it will introduce a barrier which prevents
subsequent instructions from executing. If the FPCR/FPSCR write is predicted to not change the control field values, it
will execute without a barrier but trigger a flush if the values change.
3. FPSR/FPSCR writes must stall at dispatch if another FPSR/FPSCR write is still pending.
4. APSR writes that set the Q bit will introduce a barrier which prevents subsequent instructions from executing until
the write completes.
The first instruction writes S0, which corresponds to the lowest part of Q0. The second
instruction then requires Q0 as an input operand. In this scenario, there is a RAW dependency
between the first and the second instructions. In most cases, Cortex-A710 performs slightly
worse in such situations.
Cortex-A710 is able to avoid this register-hazard condition for certain cases. The following rules
describe the conditions under which a register-hazard can occur.
• The producer writes an S-register (not a D[x] scalar)
• The consumer reads an overlapping Q-register (not as a D[x] scalar)
• The consumer is a FP/ASIMD µOP (not a store or MOV µOP)
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 87 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
To avoid unnecessary hazards, it is recommended that the programmer use D[x] scalar writes
when populating registers prior to ASIMD operations. For example, either of the following
instruction forms would safely prevent a subsequent hazard.
VLD1.32 D0[x], [address]
VADD Q1, Q0, Q2
4.12 IT blocks
The Armv8-A architecture performance deprecates some uses of the IT instruction in such a way
that software may be written using multiple naïve single instruction IT blocks. It is preferred that
software instead generate multi instruction IT blocks rather than single instruction blocks.
The following instruction pairs are fused in both Aarch32 and Aarch64 modes:
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 88 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
These instruction pairs must be adjacent to each other in program code. For CMP, CMN, TST and
BICS, fusion is not allowed for shifted and/or extended register forms. For BICS, the destination
register should be XZR or WZR if fusion is to take place.
MOV Xd, #0
MOV Wd, #0
MOVI Dd, #0
MOVI Vd.2D, #0
MOV Wd, Wn
MOV Xd, Xn
The last 3 instructions may not be executed with zero latency under certain conditions.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 89 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
Unroll the loop to include multiple store operations per iteration, minimizing the overheads of looping. Use STGM (or
DCGVA) instruction as shown in the example below:
Loop_start:
SUBS x2,x2,#0x80
STGM x1,[x0]
ADD x0,x0,#0x40
STGM x1,[x0]
ADD x0,x0,#0x40
B.GT Loop_start
To achieve maximum throughput for tag and zeroing out data, it is recommended that one do the following.
Unroll the loop to include multiple store operations per iteration, minimizing the overheads of looping. Use STZGM (or
DCZGVA) instruction as shown in the example below:
Loop_start:
SUBS x2,x2,#0x80
STZGM x1,[x0]
ADD x0,x0,#0x40
STZGM x1,[x0]
ADD x0,x0,#0x40
B.GT Loop_start
To achieve maximum throughput for tag-loading, it is recommended that one do the following.
Unroll the loop to include multiple load operations per iteration, minimizing the overheads of looping. Use LDGM
instruction as shown in the example below:
Loop_start:
SUBS x2,x2,#0x80
LDGM x1,[x0]
ADD x0,x0,#0x40
LDGM x1,[x0]
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 90 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
ADD x0,x0,#0x40
B.GT Loop_start
Also, it is recommended to use STZGM (or DCZGVA) to set tag if data is not a concern.
ASIMD
LD4, single 4-element structure, post indexed addressing mode, element size = 64b.
ST4, multiple 4-element structures, quad form, element size less than 64b.
ST4, multiple 4-element structures, quad form, element size = 64b, post indexed addressing
mode.
SVE
LD1B gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b unscaled offset.
LD1H gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.
LD1W gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 91 of 92
Arm® Cortex®-A710 Core Software Optimization Guide PJDOC-466751330-14951
Issue 4.0
LDFF1B gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b unscaled offset.
LDFF1H gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.
LDFF1W gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 92 of 92