Arm Cortex-A77 Software Optimization Guide
Arm Cortex-A77 Software Optimization Guide
Revision: r1p1
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Release information
Document history
Your access to the information in this document is conditional upon your acceptance that you will not use
or permit others to use the information for the purposes of determining whether implementations
infringe any third party patents.
THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR
PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation
with respect to, and has undertaken no analysis to identify or understand the scope and content of,
patents, copyrights, trade secrets, or other rights.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING
OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that any use,
duplication or disclosure of this document complies fully with any relevant export laws and regulations to
assure that this document or any portion thereof is not exported, directly or indirectly, in violation of
such export laws. Use of the word “partner” in reference to Arm's customers is not intended to create or
refer to any partnership relationship with any other company. Arm may make changes to this document
at any time and without notice.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 2 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
If any of the provisions contained in these terms conflict with any of the provisions of any click through or
signed written agreement covering this document with Arm, then the click through or signed written
agreement prevails over and supersedes the conflicting provisions of these terms. This document may be
translated into other languages for convenience, and you agree that if there is any conflict between the
English version of this document and any translation, the terms of the English version of the Agreement
shall prevail.
The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks of Arm
Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. Other brands and names
mentioned in this document may be the trademarks of their respective owners. Please follow Arm's
trademark usage guidelines at http://www.arm.com/company/policies/trademarks.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
LES-PRE-20349
Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be subject to
license restrictions in accordance with the terms of the agreement entered into by Arm and the party that
Arm delivered this document to.
Web Address
33T http://www.arm.com 33T
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 3 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Contents
1 Introduction 6
1.1 Product revision status 6
1.2 Intended audience 6
1.3 Conventions 6
1.3.1 Glossary 6
1.3.2 Typographical conventions 7
1.4 Additional reading 8
1.5 Feedback 8
1.5.1 Feedback on this product 8
1.5.2 Feedback on content 8
3 Instruction characteristics 12
3.1 Instruction tables 12
3.2 Legend for reading the utilized pipelines 12
3.3 Branch instructions 13
3.4 Arithmetic and logical instructions 13
3.5 Move and shift instructions 16
3.6 Divide and multiply instructions 17
3.7 Saturating and parallel arithmetic instructions 19
3.8 Miscellaneous data-processing instructions 22
3.9 Load instructions 23
3.10 Store instructions 28
3.11 FP data processing instructions 30
3.12 FP miscellaneous instructions 33
3.13 FP load instructions 34
3.14 FP store instructions 36
3.15 ASIMD integer instructions 39
3.16 ASIMD floating-point instructions 45
3.17 ASIMD miscellaneous instructions 50
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 4 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
4 Special considerations 62
4.1 Dispatch constraints 62
4.2 Dispatch stall 62
4.3 Optimizing general-purpose register spills and fills 62
4.4 Optimizing memory copy 62
4.5 Load/Store alignment 63
4.6 AES encryption/decryption 63
4.7 Region based fast forwarding 64
4.8 Branch instruction alignment 65
4.9 FPCR self-synchronization 65
4.10 Special register access 65
4.11 Register forwarding hazards 67
4.12 IT blocks 67
4.13 Instruction fusion 68
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 5 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Introduction
1 Introduction
1.1 Product revision status
The rmpn identifier indicates the revision status of the product described in this book, for
example, r1p2, where:
rm
Identifies the major revision of the product, for example, r1.
pn Identifies the minor revision or modification status of the product, for
example, p2.
1.3 Conventions
The following subsections describe conventions used in Arm documents.
1.3.1 Glossary
The Arm Glossary is a list of terms used in Arm documentation, together with definitions for those
terms. The Arm Glossary does not contain terms that are industry standard unless the Arm
meaning differs from the generally accepted meaning.
Term Meaning
ALU Arithmetic and Logical Unit
ASIMD Advanced SIMD
DSU DynamIQ™ Shared Unit
MOP Macro-OPeration
µOP Micro-OPeration
SQRT Square Root
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 6 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Introduction
Term Meaning
T32 AArch32 Thumb® instruction set
FP Floating-point
Caution
Warning
Note
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 7 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Introduction
1.5 Feedback
1.5.1 Feedback on this product
If you have any comments or suggestions about this product, contact your supplier and give:
Arm tests the PDF only in Adobe Acrobat and Acrobat Reader, and cannot guarantee the quality
of the represented document when used with any other PDF reader.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 8 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
About this document
2.1 Scope
This document describes aspects of the Cortex-A77 core micro-architecture that influence
software performance. Micro-architectural detail is limited to that which is useful for software
optimization.
This documentation extends only to software visible behavior of the Cortex-A77 core and not to
the hardware rationale behind the behavior.
The Cortex-A77 core has a Level 1 (L1) memory system and a private, integrated Level 2 (L2) cache.
It also includes a superscalar, variable-length, out-of-order pipeline.
The Cortex-A77 core is implemented inside the DynamIQ™ Shared Unit (DSU) cluster. For more
information, see the Arm® DynamIQ™ Shared Unit Technical Reference Manual.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 9 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
About this document
Branch 0
Branch 1
Fetch Decode,
Rename, Integer Single-Cycle 0
Dispatch
Integer Single-Cycle 1
FP/ASIMD 1
Load/Store 0
Load/Store 1
Store data 0
Store data 1
The execution pipelines support different types of operations, as shown in the following table.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 10 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
About this document
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 11 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
3 Instruction characteristics
3.1 Instruction tables
This chapter describes high-level performance characteristics for most Armv8.2-A A32, T32, and
A64 instructions. A series of tables summarize the effective execution latency and throughput
(instruction bandwidth per cycle), pipelines utilized, and special behaviors associated with each
group of instructions. Utilized pipelines correspond to the execution pipelines described in
chapter 2.
In the tables below, Exec Latency is defined as the minimum latency seen by an operation
dependent on an instruction in the described group.
In the tables below, Execution Throughput is defined as the maximum throughput (in instructions
per cycle) of the specified instruction group that can be achieved in the entirety of the Cortex-A77
microarchitecture.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 12 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Branch, immed B 1 2 B -
Branch, register BX 1 2 B -
Branch and link, immed BL, BLX 1 2 B -
Branch and link, register BLX 1 2 B -
Compare and branch CBZ, CBNZ 1 2 B -
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 13 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 14 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 15 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µOP is required. This adds 1 cycle to the latency.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 16 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 17 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 18 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
1. Integer divides are performed using an iterative algorithm and block any subsequent divide
operations until complete. Early termination is possible, depending upon the data values.
2. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar
µOPs, allowing a typical sequence of multiply-accumulate µOPs to issue one every N cycles
(accumulate latency N shown in parentheses). Accumulator forwarding is not supported for
consumers of 64 bit multiply high operations.
3. Multiplies that set the condition flags require an additional integer µOP.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 19 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 20 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µOP is required. This adds 1 cycle to the latency.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 21 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 22 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 23 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 24 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 25 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 26 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
1. Condition loads have an extra µOP which goes down pipeline I and have 1 cycle extra latency
compared to their unconditional counterparts.
2. The throughput of conditional LDRD is 1 as compared to a throughput of 2 for unconditional
LDRD.
3. The address update op for addressing forms which use reg scaled reg, or reg extend goes
down pipeline ‘I’ if the shift is LSL where the shift value is less than or equal to 4.
4. N is floor [ (num_reg+3)/4].
5. R is floor [(num_reg +1)/2].
6. Branch forms are possible when the instruction destination register is the PC. For those
cases, an additional branch µOP is required. This adds 1 cycle to the latency.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 27 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 28 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 29 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
1. The address update op for addressing forms which use reg scaled reg, or reg extend goes
down pipeline ‘I’ if the shift is LSL where the shift value is less than or equal to 4.
2. For store multiple instructions, N=floor((num_regs+3)/4).
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 30 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 31 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
1. FP divide and square root operations are performed using an iterative algorithm and block
subsequent similar operations to the same pipeline until complete.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 32 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 33 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 34 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 35 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
1. Condition loads have an extra µOP which goes down pipeline V and have 2 cycle extra
latency compared to their unconditional counterparts.
2. N is floor[ (num_reg+3)/4 ].
3. R is floor[(num_reg+1)/2].
4. Writeback forms of load instructions require an extra µOP to update the base address. This
update is typically performed in parallel with or prior to the load µOP (update latency shown
in parentheses).
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 37 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 38 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 40 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 41 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 42 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 43 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 44 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 45 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 46 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 47 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 48 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 49 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 50 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 51 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 52 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 53 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 54 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 55 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 56 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Writeback forms of load instructions require an extra µOP to update the base address. This
update is typically performed in parallel with the load µOP (update latency shown in
parentheses).
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 57 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 58 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 59 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Writeback forms of store instructions require an extra µOP to update the base address. This
update is typically performed in parallel with the store µOP (update latency shown in
parentheses).
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 60 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics
Adjacent AESE/AESMC instruction pairs and adjacent AESD/AESIMC instruction pairs will exhibit
the performance characteristics described in Section 4.6.
3.21 CRC
Table 38: AArch64 CRC
CRC execution supports late forwarding of the result from a producer µOP to a consumer µOP.
This results in a 1 cycle reduction in latency as seen by the consumer.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 61 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations
4 Special considerations
4.1 Dispatch constraints
Dispatch of µOPs from the in-order portion to the out-of-order portion of the microarchitecture
includes several constraints. It is important to consider these constraints during code generation
to maximize the effective dispatch bandwidth and subsequent execution bandwidth of Cortex-
A77.
The dispatch stage can process up to 6 MOPs per cycle and dispatch up to 10 µOPs per cycle, with
the following limitations on the number of µOPs of each type that may be simultaneously
dispatched.
Unroll the loop to include multiple load and store operations per iteration, minimizing the
overheads of looping.
Align stores on 16B boundary wherever possible.
Use non-writeback forms of LDP and STP instructions interleaving them like shown in the
example below:
Loop_start:
SUBS X2, X2, #192
LDP Q3, Q4, [x1, #0]
STP Q3, Q4, [x0, #0]
LDP Q3, Q4, [x1, #32]
STP Q3, Q4, [x0, #32]
LDP Q3, Q4, [x1, #64]
STP Q3, Q4, [x0, #64]
LDP Q3, Q4, [x1, #96]
STP Q3, Q4, [x0, #96]
LDP Q3, Q4, [x1, #128]
STP Q3, Q4, [x0, #128]
LDP Q3, Q4, [x1, #160]
STP Q3, Q4, [x0, #160]
ADD X1, X1, #192
ADD X0, X0, #192
BGT Loop_start
A recommended copy routine for AArch32 would look like the sequence above but would use
LDRD/STRD instructions. Avoid load-/store-multiple instruction encodings (such as LDM and STM).
Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when
they are adjacent in the program code and both instructions use the same destination register.
In addition to the regions mentioned in the table above, all floating point and ASIMD instructions
can fast forward to FP and ASIMD stores.
Fast forwarding will not occur in AArch32 mode if the consuming register’s width is greater
than that of the producer.
Element sources used by FP multiply and multiply-accumulate operations cannot be
consumers.
Complex ASIMD shift by immediate/register and shift accumulate instructions cannot be
producers (see section 3.14) in region 1.
ASIMD extract narrow, saturating instructions cannot be producers (see section 3.16) in region
1.
ASIMD absolute difference accumulate and pairwise add and accumulate instructions cannot
be producers (see section 3.14) in region 1.
For FP producer-consumer pairs, the precision of the instructions should match (single, double
or half) in region 2.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 64 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations
For best case performance, avoid placing more than four branch instructions within an aligned
32-byte instruction memory region.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 65 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations
The table below summarizes various special-purpose register write accesses and the associated
execution constraints or side-effects.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 66 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations
The first instruction writes S0, which correspond to the lowest part of Q0. The second instruction
then requires Q0 as an input operand. In this scenario, there is a dependency RAW dependency
between the first and the second instructions. In most cases, Cortex-A77 performs slightly worse
in such situations.
Cortex-A77 is able to avoid this register-hazard condition for certain cases. The following rules
describe the conditions under which a register-hazard can occur.
4.12 IT blocks
The Armv8-A architecture performance deprecates some uses of the IT instruction in such a way
that software may be written using multiple naïve single instruction IT blocks. It is preferred that
software instead generate multi instruction IT blocks rather than single instruction blocks.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 67 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations
The following instruction pairs are fused in both Aarch32 and Aarch64 modes:
These instruction pairs must be adjacent to each other in program code. For CMP, CMN, TST and
BICS, fusion is not allowed for shifted and/or extended register forms. For BICS, the destination
register should be XZR or WZR if fusion is to take place.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 68 of 68