0% found this document useful (0 votes)
358 views68 pages

Arm Cortex-A77 Software Optimization Guide

i also work for qualcomm

Uploaded by

Sam Muhammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
358 views68 pages

Arm Cortex-A77 Software Optimization Guide

i also work for qualcomm

Uploaded by

Sam Muhammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Arm® Cortex®-A77 Core

Revision: r1p1

Software Optimization Guide

Non-Confidential Issue 3.0


Copyright © 2018, 2019 Arm Limited (or its affiliates). PJDOC-466751330-11050
All rights reserved.
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0

Arm® Cortex®-A77 Core

Software Optimization Guide

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.

Release information

Document history

Issue Date Confidentiality Change


1.0 9 May 2018 Confidential First release for r0p0
2.0 28 Sept 2018 Confidential First release for r1p0
3.0 10 May 2019 Non-Confidential First release for r1p1

Non-Confidential Proprietary Notice


This document is protected by copyright and other related rights and the practice or implementation of
the information contained in this document may be protected by one or more patents or pending patent
applications. No part of this document may be reproduced in any form by any means without the express
prior written permission of Arm. No license, express or implied, by estoppel or otherwise to any
intellectual property rights is granted by this document unless specifically stated.

Your access to the information in this document is conditional upon your acceptance that you will not use
or permit others to use the information for the purposes of determining whether implementations
infringe any third party patents.

THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR
PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation
with respect to, and has undertaken no analysis to identify or understand the scope and content of,
patents, copyrights, trade secrets, or other rights.

This document may include technical inaccuracies or typographical errors.

TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING
OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES.

This document consists solely of commercial items. You shall be responsible for ensuring that any use,
duplication or disclosure of this document complies fully with any relevant export laws and regulations to
assure that this document or any portion thereof is not exported, directly or indirectly, in violation of
such export laws. Use of the word “partner” in reference to Arm's customers is not intended to create or
refer to any partnership relationship with any other company. Arm may make changes to this document
at any time and without notice.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 2 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0

If any of the provisions contained in these terms conflict with any of the provisions of any click through or
signed written agreement covering this document with Arm, then the click through or signed written
agreement prevails over and supersedes the conflicting provisions of these terms. This document may be
translated into other languages for convenience, and you agree that if there is any conflict between the
English version of this document and any translation, the terms of the English version of the Agreement
shall prevail.

The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks of Arm
Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. Other brands and names
mentioned in this document may be the trademarks of their respective owners. Please follow Arm's
trademark usage guidelines at http://www.arm.com/company/policies/trademarks.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.

Arm Limited. Company 02557590 registered in England.

110 Fulbourn Road, Cambridge, England CB1 9NJ.

LES-PRE-20349

Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be subject to
license restrictions in accordance with the terms of the agreement entered into by Arm and the party that
Arm delivered this document to.

Unrestricted Access is an Arm internal classification.Product Status


The information in this document is Final, that is for a developed product.

Web Address
33T http://www.arm.com 33T

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 3 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0

Contents

1 Introduction 6
1.1 Product revision status 6
1.2 Intended audience 6
1.3 Conventions 6
1.3.1 Glossary 6
1.3.2 Typographical conventions 7
1.4 Additional reading 8
1.5 Feedback 8
1.5.1 Feedback on this product 8
1.5.2 Feedback on content 8

2 About this document 9


2.1 Scope 9
2.2 Product overview 9
2.2.1 Pipeline overview 9

3 Instruction characteristics 12
3.1 Instruction tables 12
3.2 Legend for reading the utilized pipelines 12
3.3 Branch instructions 13
3.4 Arithmetic and logical instructions 13
3.5 Move and shift instructions 16
3.6 Divide and multiply instructions 17
3.7 Saturating and parallel arithmetic instructions 19
3.8 Miscellaneous data-processing instructions 22
3.9 Load instructions 23
3.10 Store instructions 28
3.11 FP data processing instructions 30
3.12 FP miscellaneous instructions 33
3.13 FP load instructions 34
3.14 FP store instructions 36
3.15 ASIMD integer instructions 39
3.16 ASIMD floating-point instructions 45
3.17 ASIMD miscellaneous instructions 50

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 4 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0

3.18 ASIMD load instructions 53


3.19 ASIMD store instructions 57
3.20 Cryptography extensions 60
3.21 CRC 61

4 Special considerations 62
4.1 Dispatch constraints 62
4.2 Dispatch stall 62
4.3 Optimizing general-purpose register spills and fills 62
4.4 Optimizing memory copy 62
4.5 Load/Store alignment 63
4.6 AES encryption/decryption 63
4.7 Region based fast forwarding 64
4.8 Branch instruction alignment 65
4.9 FPCR self-synchronization 65
4.10 Special register access 65
4.11 Register forwarding hazards 67
4.12 IT blocks 67
4.13 Instruction fusion 68

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 5 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Introduction

1 Introduction
1.1 Product revision status
The rmpn identifier indicates the revision status of the product described in this book, for
example, r1p2, where:

rm
Identifies the major revision of the product, for example, r1.
pn Identifies the minor revision or modification status of the product, for
example, p2.

1.2 Intended audience


This document is for system designers, system integrators, and programmers who are designing or
programming a System-on-Chip (SoC) that uses an Arm core.

1.3 Conventions
The following subsections describe conventions used in Arm documents.

1.3.1 Glossary
The Arm Glossary is a list of terms used in Arm documentation, together with definitions for those
terms. The Arm Glossary does not contain terms that are industry standard unless the Arm
meaning differs from the generally accepted meaning.

See the Arm® Glossary for more information.

1.3.1.1 Terms and Abbreviations


This document uses the following terms and abbreviations.

Term Meaning
ALU Arithmetic and Logical Unit
ASIMD Advanced SIMD
DSU DynamIQ™ Shared Unit
MOP Macro-OPeration
µOP Micro-OPeration
SQRT Square Root

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 6 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Introduction

Term Meaning
T32 AArch32 Thumb® instruction set
FP Floating-point

1.3.2 Typographical conventions


Convention Use
italic Introduces special terminology, denotes cross-references,
and citations.
bold Highlights interface elements, such as menu names.
Denotes signal names. Also used for terms in descriptive
lists, where appropriate.
monospace Denotes text that you can enter at the keyboard, such as
commands, file and program names, and source code.
Monospace bold Denotes language keywords when used outside example
code.
monospace italic Denotes arguments to monospace text where the
argument is to be replaced by a specific value.
monospace underline Denotes a permitted abbreviation for a command or
option. You can enter the underlined text instead of the full
command or option name.
<and> Encloses replaceable terms for assembler syntax where
they appear in code or code fragments.
For example:
MRC p15, 0, <Rd>, <CRn>, <CRm>, <Opcode_2>
SMALL CAPITALS
Used in body text for a few terms that have specific
technical meanings, that are defined in the Arm ® Glossary.
For example, IMPLEMENTATION DEFINED, IMPLEMENTATION
SPECIFIC, UNKNOWN, and UNPREDICTABLE.

Caution

Warning

Note

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 7 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Introduction

1.4 Additional reading


This document contains information that is specific to this product. See the following documents
for other relevant information:

Table 1: Arm publications

Document name Document ID Licensee only Y/N


Arm® Architecture Reference Manual, DDI 0487 Y
Armv8, for Armv8-A architecture profile
Arm® Cortex®-A77 Core Technical 101111 Y
Reference Manual

1.5 Feedback
1.5.1 Feedback on this product
If you have any comments or suggestions about this product, contact your supplier and give:

 The product name.


 The product revision or version.
 An explanation with as much information as you can provide. Include symptoms and
diagnostic procedures if appropriate.

1.5.2 Feedback on content


If you have comments on content, send an e-mail to [email protected] and give:

 The title: Arm® Cortex®-A77 Core Software Optimization Guide.


 The number: PJDOC-466751330-8482.
 If applicable, the page number(s) to which your comments refer.
 A concise explanation of your comments.
Arm also welcomes general suggestions for additions and improvements.

Arm tests the PDF only in Adobe Acrobat and Acrobat Reader, and cannot guarantee the quality
of the represented document when used with any other PDF reader.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 8 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
About this document

2 About this document


This document contains a guide to the Cortex-A77 core micro-architecture with a view to aiding
software optimization.

2.1 Scope
This document describes aspects of the Cortex-A77 core micro-architecture that influence
software performance. Micro-architectural detail is limited to that which is useful for software
optimization.

This documentation extends only to software visible behavior of the Cortex-A77 core and not to
the hardware rationale behind the behavior.

2.2 Product overview


The Cortex-A77 core is a high-performance and low-power Arm product that implements the
Armv8-A architecture with support for the Armv8.2-A extension, including the RAS extension, the
Load acquire (LDAPR) instructions introduced in the Armv8.3-A extension, and the dot product
instructions introduced in the Armv8.4-A extension.

The Cortex-A77 core has a Level 1 (L1) memory system and a private, integrated Level 2 (L2) cache.
It also includes a superscalar, variable-length, out-of-order pipeline.

The Cortex-A77 core is implemented inside the DynamIQ™ Shared Unit (DSU) cluster. For more
information, see the Arm® DynamIQ™ Shared Unit Technical Reference Manual.

2.2.1 Pipeline overview


The following figure describes the high-level Cortex-A77 instruction processing pipeline.
Instructions are first fetched and then decoded into internal Macro-OPerations (MOPs). From
there, the MOPs proceed through register renaming and dispatch stages. A MOP can be split into
two Micro-OPerations (µOPs) further down the pipeline after the decode stage. Once dispatched,
µOPs wait for their operands and issue out-of-order to one of twelve issue pipelines. Each issue
pipeline can accept one µOP per cycle.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 9 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
About this document

Figure 1: Cortex-A77 core pipeline

Branch 0

Branch 1
Fetch Decode,
Rename, Integer Single-Cycle 0
Dispatch

Integer Single-Cycle 1

Integer Single /Multi-Cycle 0

Integer Single /Multi-Cycle 1


Issue
FP/ASIMD 0

FP/ASIMD 1

Load/Store 0

Load/Store 1

Store data 0

Store data 1

IN ORDER OUT OF ORDER

The execution pipelines support different types of operations, as shown in the following table.

Table 2: Cortex-A77 core operations

Instruction groups Instructions


Branch 0/1 Branch µOPs
Integer Single-Cycle Integer ALU µOPs
0/1
Integer Single/Multi- Integer shift-ALU, multiply, divide, CRC and sum-of-absolute-
cycle 0/1 differences µOPs
Load/Store 0/1 Load, Store address generation and special memory µOPs

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 10 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
About this document

Instruction groups Instructions


Store data 0/1 Store data µOPs
FP/ASIMD-0 ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc,
FP add, FP multiply, FP divide, FP sqrt, crypto µOPs, store data µOPs
FP/ASIMD-1 ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, ASIMD shift
µOPs, store data µOPs, crypto µOPs.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 11 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

3 Instruction characteristics
3.1 Instruction tables
This chapter describes high-level performance characteristics for most Armv8.2-A A32, T32, and
A64 instructions. A series of tables summarize the effective execution latency and throughput
(instruction bandwidth per cycle), pipelines utilized, and special behaviors associated with each
group of instructions. Utilized pipelines correspond to the execution pipelines described in
chapter 2.

In the tables below, Exec Latency is defined as the minimum latency seen by an operation
dependent on an instruction in the described group.

In the tables below, Execution Throughput is defined as the maximum throughput (in instructions
per cycle) of the specified instruction group that can be achieved in the entirety of the Cortex-A77
microarchitecture.

3.2 Legend for reading the utilized pipelines


Table 3: Cortex-A77 core pipeline names and symbols

Pipeline name Symbol used in tables


Branch 0/1 B
Integer single Cycle 0/1 S
Integer single Cycle 0/1 and single/multicycle 0/1 I
Integer single/multicycle 0/1 M
Integer multicycle 0 M0
Load/Store 0/1 L
Store data 0/1 D
FP/ASIMD 0/1 V
FP/ASIMD 0 V0
FP/ASIMD 1 V1

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 12 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

3.3 Branch instructions


Table 4: AArch64 Branch instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Branch, immed B 1 2 B -
Branch, register BR, RET 1 2 B -
Branch and link, immed BL 1 2 B -
Branch and link, register BLR 1 2 B -
Compare and branch CBZ, CBNZ, 1 2 B -
TBZ, TBNZ

Table 5: AArch32 Branch instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Branch, immed B 1 2 B -
Branch, register BX 1 2 B -
Branch and link, immed BL, BLX 1 2 B -
Branch and link, register BLX 1 2 B -
Compare and branch CBZ, CBNZ 1 2 B -

3.4 Arithmetic and logical instructions


Table 6: AArch64 Arithmetic and logical instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Arithmetic, basic ADD, ADC, 1 4 I -
SUB, SBC
Arithmetic, basic, flag set ADDS, ADCS, 1 3 I -
SUBS, SBCS
Arithmetic, extend and ADD{S}, 2 2 M -
shift SUB{S}
Arithmetic, LSL shift, shift ADD, SUB 1 4 I -
<= 4

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 13 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Arithmetic, flag set, LSL ADDS, SUBS 1 3 I -
shift, shift <= 4
Arithmetic, LSR/ASR/ROR ADD{S}, 2 2 M -
shift or LSL shift > 4 SUB{S}
Conditional compare CCMN, CCMP 1 3 I -
Conditional select CSEL, CSINC, 1 3 I -
CSINV,
CSNEG
Logical, basic AND{S}, 1 3 I -
BIC{S}, EON,
EOR, ORN,
ORR
Logical, shift, no flagset AND, BIC, 1 4 I -
EON, EOR,
ORN, ORR
Logical, shift, flagset ANDS, BICS 2 2 M -

Table 7: AArch32 Arithmetic and logical instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ALU, basic, no flagset ADD, ADC, 1 4 I -


ADR, AND,
BIC, EOR,
ORN, ORR,
RSB, RSC,
SUB, SBC
ALU, basic, flagset ADDS, ADCS, 1 3 I -
ANDS, BICS,
CMN, CMP,
EORS, ORNS,
ORRS, RSBS,
RSCS, SUBS,
SBCS, TEQ,
TST
ALU, basic, shift by (same as ALU 2 1 I, M0 -
register, conditional basic, flagset
and no
flagset)

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 14 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ALU, basic, shift by (same as 2 1 M0 -


register, unconditional, ALU, basic,
flagset flagset)
Arithmetic, shift by ADD, ADC, 2 1 M0 -
register, unconditional, RSB, RSC,
no flagset SUB, SBC
Logical, shift by register, AND, BIC, 1 1 M0 -
unconditional, no flagset EOR, ORN,
ORR
Arithmetic, LSL shift by ADD, ADC, 1 4 I -
immed, shift <= 4, RSB, RSC,
unconditional, no flagset SUB, SBC
Arithmetic, LSL shift by ADDS, ADCS, 1 3 I -
immed, shift <= 4, RSBS, RSCS,
unconditional, flagset SUBS, SBCS
Arithmetic, LSL shift by ADD{S}, 1 1 M0 -
immed, shift <= 4, ADC{S},
conditional RSB{S},
RSC{S},
SUB{S},
SBC{S}
Arithmetic, LSR/ASR/ROR ADD{S}, 2 2 M -
shift by immed or LSL ADC{S},
shift by immed > 4, RSB{S},
unconditional RSC{S},
SUB{S},
SBC{S}
Arithmetic, LSR/ASR/ROR ADD{S}, 2 1 M0 -
shift by immed or LSL ADC{S},
shift by immed > 4, RSB{S},
conditional RSC{S},
SUB{S},
SBC{S}
Logical, shift by immed, AND, BIC, 1 4 I -
no flagset, unconditional EOR, ORN,
ORR
Logical, shift by immed, AND, BIC, 1 1 M0 -
no flagset, conditional EOR, ORN,
ORR

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 15 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Logical, shift by immed, ANDS, BICS, 2 2 M -


flagset, unconditional EORS, ORNS,
ORRS
Logical, shift by immed, ANDS, BICS, 2 1 M0 -
flagset, conditional EORS, ORNS,
ORRS
Test/Compare, shift by CMN, CMP, 2 2 M -
immed TEQ, TST
Branch forms +1 2 +B 1

Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µOP is required. This adds 1 cycle to the latency.

3.5 Move and shift instructions


Table 8: AArch32 Move and shift instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Move, basic MOV, 1 4 I -
MOVW, MVN
Move, basic, flagset MOVS, MVNS 1 3 I
Move, shift by immed, no ASR, LSL, LSR, 1 4 I -
flagset ROR, RRX,
MVN
Move, shift by immed, ASRS, LSLS, 2 2 M -
flagset LSRS, RORS,
RRXS, MVNS
Move, shift by register, ASR, LSL, LSR, 1 4 I -
no flagset, unconditional ROR, RRX,
MVN
Move, shift by register, ASR, LSL, LSR, 2 2 I -
no flagset, conditional ROR, RRX,
MVN
Move, shift by register, ASRS, LSLS, 2 1 M0 -
flagset LSRS, RORS,
RRXS, MVNS
Move, top MOVT 1 4 I -

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 16 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Move, branch forms +1 2 +B -

3.6 Divide and multiply instructions


Table 9: AArch64 Divide and multiply instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Divide, W-form SDIV, UDIV 5 to 12 1/12 to 1/5 M0 1
Divide, X-form SDIV, UDIV 5 to 20 1/20 to 1/5 M0 1
Multiply accumulate, W- MADD, 2(1) 1 M0 2
form MSUB
Multiply accumulate, X- MADD, 2(1) 1 M0 2
form MSUB
Multiply accumulate long SMADDL, 2(1) 1 M0 2
SMSUBL,
UMADDL,
UMSUBL
Multiply high SMULH, 3 1 M0 2
UMULH

Table 10: AArch32 Divide and multiply instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Divide SDIV, UDIV 5 to 12 1/12 to 1/5 M0 1


Multiply MUL, 2 1 M0 -
SMULBB,
SMULBT,
SMULTB,
SMULTT,
SMULWB,
SMULWT,
SMMUL{R},
SMUAD{X},
SMUSD{X}

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 17 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Multiply accumulate, MLA, MLS, 3 1 M0, I -


conditional SMLABB,
SMLABT,
SMLATB,
SMLATT,
SMLAWB,
SMLAWT,
SMLAD{X},
SMLSD{X},
SMMLA{R},
SMMLS{R}
Multiply accumulate, MLA, MLS, 2(1) 1 M0 2
unconditional SMLABB,
SMLABT,
SMLATB,
SMLATT,
SMLAWB,
SMLAWT,
SMLAD{X},
SMLSD{X},
SMMLA{R},
SMMLS{R}
Multiply accumulate UMAAL 4 1 I, M0 -
accumulate long,
conditional
Multiply accumulate UMAAL 3 1 I, M0 -
accumulate long,
unconditional
Multiply accumulate long, SMLAL, 2 1 M0, I -
no flagset SMLALBB,
SMLALBT,
SMLALTB,
SMLALTT,
SMLALD{X},
SMLSLD{X},
UMLAL

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 18 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Multiply accumulate long, SMLAL, 2 1 M0, I, M -


flagset SMLALBB,
SMLALBT,
SMLALTB,
SMLALTT,
SMLALD{X},
SMLSLD{X},
UMLAL
Multiply long, SMULL, 2 1 M0 -
unconditional, no flagset UMULL
Multiply long, SMULLS, 3 1 M0, I -
unconditional, flagset UMULLS
Multiply long, conditional SMULL{S}, 3 1 M0, I -
UMULL{S}

1. Integer divides are performed using an iterative algorithm and block any subsequent divide
operations until complete. Early termination is possible, depending upon the data values.
2. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar
µOPs, allowing a typical sequence of multiply-accumulate µOPs to issue one every N cycles
(accumulate latency N shown in parentheses). Accumulator forwarding is not supported for
consumers of 64 bit multiply high operations.
3. Multiplies that set the condition flags require an additional integer µOP.

3.7 Saturating and parallel arithmetic instructions


Table 11: AArch32 Saturating and parallel arithmetic instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Parallel arith, SADD16, 2 1 M -


unconditional SADD8,
SSUB16,
SSUB8,
UADD16,
UADD8,
USUB16,
USUB8

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 19 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Parallel arith, conditional SADD16, 2(4) 1 M0, I 1


SADD8,
SSUB16,
SSUB8,
UADD16,
UADD8,
USUB16,
USUB8
Parallel arith with SASX, SSAX, 3 2 I, M -
exchange, unconditional UASX, USAX
Parallel arith with SASX, SSAX, 3(5) 1 I, M0 1
exchange, conditional UASX, USAX
Parallel halving arith, SHADD16, 2 2 M -
unconditional SHADD8,
SHSUB16,
SHSUB8,
UHADD16,
UHADD8,
UHSUB16,
UHSUB8
Parallel halving arith, SHADD16, 2 1 M0 -
conditional SHADD8,
SHSUB16,
SHSUB8,
UHADD16,
UHADD8,
UHSUB16,
UHSUB8
Parallel halving arith with SHASX, 3 1 I, M0 -
exchange SHSAX,
UHASX,
UHSAX
Parallel saturating arith, QADD16, 2 2 M -
unconditional QADD8,
QSUB16,
QSUB8,
UQADD16,
UQADD8,
UQSUB16,
UQSUB8

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 20 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Parallel saturating arith, QADD16, 2 1 M0 -


conditional QADD8,
QSUB16,
QSUB8,
UQADD16,
UQADD8,
UQSUB16,
UQSUB8
Parallel saturating arith QASX, QSAX, 3 2 I, M -
with exchange, UQASX,
unconditional UQSAX
Parallel saturating arith QASX, QSAX, 3(5) 1 I, M0 -
with exchange, UQASX,
conditional UQSAX
Saturate, unconditional SSAT, 2 2 M -
SSAT16,
USAT,
USAT16
Saturate, conditional SSAT, 2 1 M0 -
SSAT16,
USAT,
USAT16
Saturating arith, QADD, QSUB 2 2 M -
unconditional
Saturating arith, QADD, QSUB 2 1 M0 -
conditional
Saturating doubling arith, QDADD, 4 1 M, M -
unconditional QDSUB
Saturating doubling arith QDADD, 4 1 M, M0 -
conditional QDSUB

Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µOP is required. This adds 1 cycle to the latency.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 21 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

3.8 Miscellaneous data-processing instructions


Table 12: AArch64 Miscellaneous data-processing instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Address generation ADR, ADRP 1 4 I -
Bitfield extract, one reg EXTR 1 4 I -
Bitfield extract, two regs EXTR 3 2 I, M -
Bitfield move, basic SBFM, UBFM 1 4 I -
Bitfield move, insert BFM 2 2 M -
Count leading CLS, CLZ 1 4 I -
Move immed MOVN, 1 4 I -
MOVK,
MOVZ
Reverse bits/bytes RBIT, REV, 1 4 I -
REV16,
REV32
Variable shift ASRV, LSLV, 1 4 I -
LSRV, RORV

Table 13: AArch32 Miscellaneous data-processing instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Bit field extract SBFX, UBFX 1 4 I -


Bit field insert/clear, BFI, BFC 2 2 M -
unconditional
Bit field insert/clear, BFI, BFC 2 1 M0 -
conditional
Count leading zeros CLZ 1 4 I -
Pack halfword, PKH 2 2 M -
unconditional
Pack halfword, PKH 2 1 M0 -
conditional
Reverse bits/bytes RBIT, REV, 1 4 I -
REV16,
REVSH

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 22 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Select bytes, SEL 1 4 I -


unconditional
Select bytes, conditional SEL 2 2 I -
Sign/zero extend, normal SXTB, SXTH, 1 4 I -
UXTB, UXTH
Sign/zero extend, SXTB16, 2 2 M -
parallel, unconditional UXTB16
Sign/zero extend, SXTB16, 2 1 M0 -
parallel, conditional UXTB16
Sign/zero extend and SXTAB, 2 2 M -
add, normal, SXTAH,
unconditional UXTAB,
UXTAH
Sign/zero extend and SXTAB, 2 1 M0 -
add, normal, conditional SXTAH,
UXTAB,
UXTAH
Sign/zero extend and SXTAB16, 4 1/2 M -
add, parallel, UXTAB16
unconditional
Sign/zero extend and SXTAB16, 4 1/2 M, M0 -
add, parallel, conditional UXTAB16
Sum of absolute USAD8, 2 1 M0 -
differences, USADA8
unconditional
Sum of absolute USAD8, 2 1 M0, I -
differences, conditional USADA8

3.9 Load instructions


Table 14: AArch64 Load instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Load register, literal LDR, LDRSW, 4 2 L -
PRFM

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 23 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Load register, unscaled LDUR, 4 2 L -
immed LDURB,
LDURH,
LDURSB,
LDURSH,
LDURSW,
PRFUM
Load register, immed LDR, LDRB, 4 2 L, I -
post-index LDRH, LDRSB,
LDRSH,
LDRSW
Load register, immed pre- LDR, LDRB, 4 2 L, I -
index LDRH, LDRSB,
LDRSH,
LDRSW
Load register, immed LDTR, LDTRB, 4 2 L -
unprivileged LDTRH,
LDTRSB,
LDTRSH,
LDTRSW
Load register, unsigned LDR, LDRB, 4 2 L -
immed LDRH, LDRSB,
LDRSH,
LDRSW,
PRFM
Load register, register LDR, LDRB, 4 2 L -
offset, basic LDRH, LDRSB,
LDRSH,
LDRSW,
PRFM
Load register, register LDR, LDRSW, 4 2 L -
offset, scale by 4/8 PRFM
Load register, register LDRH, LDRSH 5 2 I, L -
offset, scale by 2
Load register, register LDR, LDRB, 4 2 L -
offset, extend LDRH, LDRSB,
LDRSH,
LDRSW,
PRFM

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 24 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Load register, register LDR, LDRSW, 4 2 L -
offset, extend, scale by PRFM
4/8
Load register, register LDRH, LDRSH 5 2 I, L -
offset, extend, scale by 2
Load pair, signed immed LDP, LDNP 4 2 L -
offset, normal, W-form
Load pair, signed immed LDP, LDNP 4 1 L -
offset, normal, X-form
Load pair, signed immed LDPSW 5 1 I, L -
offset, signed words,
base! = SP
Load pair, signed immed LDPSW 5 1 I, L -
offset, signed words,
base = SP
Load pair, immed post- LDP 4 1 L, I -
index, normal
Load pair, immed post- LDPSW 5 1 I, L -
index, signed words
Load pair, immed pre- LDP 4 1 L, I -
index, normal
Load pair, immed pre- LDPSW 5 1 I, L -
index, signed words

Table 15: AArch32 Load instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Load, immed offset LDR{T}, 4 2 L 1,2


LDRB{T},
LDRD,
LDRH{T},
LDRSB{T},
LDRSH{T}
Load, register offset, plus LDR, LDRB, 4 2 L 1.2
LDRD, LDRH,
LDRSB,
LDRSH

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 25 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Load, register offset, LDR, LDRB, 5 2 I, L 1,2


minus LDRD, LDRH,
LDRSB,
LDRSH
Load, scaled register LDR, LDRB 4 2 L 1
offset, plus, LSL2
Load, scaled register LDR, LDRB, 5 2 I, L 1
offset, other LDRH, LDRSB,
LDRSH
Load, immed pre-indexed LDR, LDRB, 4 2 L, I 1,2
LDRD, LDRH,
LDRSB,
LDRSH
Load, register pre- LDR, LDRB, 5 2 I, L, M 3
indexed, shift Rm, plus LDRH, LDRSB,
and minus LDRSH
Load, register pre- LDRD 4 2 L, I -
indexed
Load, register pre- LDRD 5 1 1/2 L, I -
indexed, cond
Load, scaled register pre- LDR, LDRB 4 2 L, I 1
indexed, plus, LSL2
Load, scaled register pre- LDR, LDRB 4 2 L, I -
indexed, unshifted
Load, immed post- LDR{T}, 4 2 L, I 1,2
indexed LDRB{T},
LDRD,
LDRH{T},
LDRSB{T},
LDRSH{T}
Load, register post- LDR, LDRB, 5 2 I, L -
indexed LDRH{T},
LDRSB{T},
LDRSH{T}
Load, register post- LDRD 4 2 L, I -
indexed
Load, register post- LDRT, LDRBT 5 2 I, L -
indexed

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 26 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Load, scaled register LDR, LDRB 4 2 L, M 3


post-indexed
Load, scaled register LDRT, LDRBT 4 2 L, M 3
post-indexed
Preload, immed offset PLD, PLDW 4 2 L -
Preload, register offset, PLD, PLDW 4 2 L -
plus
Preload, register offset, PLD, PLDW 5 2 I, L -
minus
Preload, scaled register PLD, PLDW 5 2 I, L -
offset, plus LSL2
Preload, scaled register PLD, PLDW 5 2 I, L -
offset, other
Load multiple, no LDMIA, N 2/R L 1, 4,
writeback, base reg not in LDMIB, 5
list LDMDA,
LDMDB
Load multiple, no LDMIA, 1+ N 2/R I, L 1, 4,
writeback, base reg in list LDMIB, 5
LDMDA,
LDMDB
Load multiple, writeback LDMIA, 1+N 2/R L, I 1, 4,
LDMIB, 5
LDMDA,
LDMDB, POP
(Load, all branch forms) - +1 - +B 6

1. Condition loads have an extra µOP which goes down pipeline I and have 1 cycle extra latency
compared to their unconditional counterparts.
2. The throughput of conditional LDRD is 1 as compared to a throughput of 2 for unconditional
LDRD.
3. The address update op for addressing forms which use reg scaled reg, or reg extend goes
down pipeline ‘I’ if the shift is LSL where the shift value is less than or equal to 4.
4. N is floor [ (num_reg+3)/4].
5. R is floor [(num_reg +1)/2].
6. Branch forms are possible when the instruction destination register is the PC. For those
cases, an additional branch µOP is required. This adds 1 cycle to the latency.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 27 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

3.10 Store instructions


The following tables describes performance characteristics for standard store instructions. Stores
µOPs are split into address and data µOPs. Once executed, stores are buffered and committed in
the background.

Table 16: AArch64 Store instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Store register, unscaled STUR, STURB, 1 2 L, D -
immed STURH
Store register, immed STR, STRB, 1 2 L, D -
post-index STRH
Store register, immed STR, STRB, 1 2 L, D -
pre-index STRH
Store register, immed STTR, STTRB, 1 2 L, D -
unprivileged STTRH
Store register, unsigned STR, STRB, 1 2 L, D -
immed STRH
Store register, register STR, STRB, 1 2 L, D -
offset, basic STRH
Store register, register STR 1 2 L, D -
offset, scaled by 4/8
Store register, register STRH 2 3/2 I, L, D -
offset, scaled by 2
Store register, register STR, STRB, 1 2 L, D -
offset, extend STRH
Store register, register STR 1 2 L, D -
offset, extend, scale by
4/8
Store register, register STRH 2 3/2 I, L, D -
offset, extend, scale by 1
Store pair, immed offset, STP, STNP 1 2 L, D -
W-form
Store pair, immed offset, STP, STNP 1 1 L, D -
X-form
Store pair, immed post- STP 1 1 L, D -
index, W-form
Store pair, immed post- STP 1 1 L, D -
index, X-form

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 28 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Store pair, immed pre- STP 1 1 L, D -
index, W-form
Store pair, immed pre- STP 1 1 L, D -
index, X-form

Table 17: AArch32 Store instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Store, immed offset STR{T}, 1 2 L, D -


STRB{T},
STRD,
STRH{T}
Store, register offset, plus STR, STRB, 1 2 L, D -
STRD, STRH
Store, register offset, STR, STRB, 1 2 L, D -
minus STRD, STRH
Store, register offset, no STR, STRB 1 2 L, D -
shift, plus
Store, scaled register STR, STRB 1 2 L, D -
offset, plus LSL2
Store, scaled register STR, STRB 2 3/2 I, L, D -
offset, other
Store, scaled register STR, STRB 2 3/2 I, L, D -
offset, minus
Store, immed pre- STR, STRB, 1 3/2 I, L, D -
indexed STRD, STRH
Store, register pre- STR, STRB, 1 3/2 L, D -
indexed, plus, no shift STRD, STRH
Store, register pre- STR, STRB, 2 1 I, L, D -
indexed, minus STRD, STRH
Store, scaled register pre- STR, STRB 1 3/2 L, D -
indexed, plus LSL2
Store, scaled register pre- STR, STRB 2 1 I, L, D, M 1
indexed, other

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 29 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Store, immed post- STR{T}, 1 3/2 L, D -


indexed STRB{T},
STRD,
STRH{T}
Store, register post- STRH{T}, 1 3/2 L, D -
indexed STRD
Store, register post- STR{T}, 1 3/2 L, D -
indexed STRB{T}
Store, scaled register STR{T}, 1 3/2 L, D -
post-indexed STRB{T}
Store multiple, no STMIA, N 1/N L, D 2
writeback STMIB,
STMDA,
STMDB
Store multiple, writeback STMIA, N 1/N L, D 2
STMIB,
STMDA,
STMDB,
PUSH

1. The address update op for addressing forms which use reg scaled reg, or reg extend goes
down pipeline ‘I’ if the shift is LSL where the shift value is less than or equal to 4.
2. For store multiple instructions, N=floor((num_regs+3)/4).

3.11 FP data processing instructions


Table 18: AArch64 FP data processing instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
FP absolute value FABS 2 2 V -
FP arithmetic FADD, FSUB 2 2 V -
FP compare FCCMP{E}, 2 1 V0 -
FCMP{E}
FP divide, H-form FDIV 7 4/7 V0 1
FP divide, S-form FDIV 7 to 10 4/9 to 4/7 V0 1
FP divide, D-form FDIV 7 to 15 1/7 to 2/7 V0 1

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 30 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
FP min/max FMIN, 2 2 V -
FMINNM,
FMAX,
FMAXNM
FP multiply FMUL, 3 2 V 2
FNMUL
FP multiply accumulate FMADD, 4 (2) 2 V 3
FMSUB,
FNMADD,
FNMSUB
FP negate FNEG 2 2 V -
FP round to integral FRINTA, 3 1 V -
FRINTI,
FRINTM,
FRINTN,
FRINTP,
FRINTX,
FRINTZ
FP select FCSEL 2 2 V -
FP square root, H-form FSQRT 7 4/7 V0 1
FP square root, S-form FSQRT 7 to 10 4/9 to 4/7 V0 1
FP square root, D-form FSQRT 7 to 17 1/8 to 2/7 V0 1

Table 19: AArch32 FP data processing instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

VFP absolute value VABS 2 2 V -


VFP arith VADD, VSUB 2 2 V -
VFP compare, VCMP, 2 1 V0 -
unconditional VCMPE
VFP compare, conditional VCMP, 4 1 V, V0 -
VCMPE

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 31 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

VFP convert VCVT{R}, 3 1 V0 -


VCVTB,
VCVTT,
VCVTA,
VCVTM,
VCVTN,
VCVTP
VFP divide, H-form VDIV 7 4/7 V0 1
VFP divide, S-form VDIV 7 to 10 4/9 to 4/7 V0 1
VFP divide, D-form VDIV 7 to 15 1/7 to 2/7 V0 1
VFP max/min VMAXNM, 2 2 V -
VMINNM
VFP multiply VMUL, 3 2 V 2
VNMUL
VFP multiply accumulate VMLA, VMLS, 5 (2) 2 V 3
(chained) VNMLA,
VNMLS
VFP multiply accumulate VFMA, VFMS, 4 (2) 2 V 3
(fused) VFNMA,
VFNMS
VFP negate VNEG 2 2 V -
VFP round to integral VRINTA, 3 1 V0 -
VRINTM,
VRINTN,
VRINTP,
VRINTR,
VRINTX,
VRINTZ
VFP select VSELEQ, 2 2 V -
VSELGE,
VSELGT,
VSELVS
VFP square root, H-form VSQRT 7 4/7 V0 1
VFP square root, S-form VSQRT 7 to 10 4/9 to 4/7 V0 1
VFP square root, D-form VSQRT 7 to 17 1/8 to 2/7 V0 1

1. FP divide and square root operations are performed using an iterative algorithm and block
subsequent similar operations to the same pipeline until complete.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 32 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

2. FP multiply-accumulate pipelines support late forwarding of the result from FP multiply


µOPs to the accumulate operands of an FP multiply-accumulate µOP. The latter can
potentially be issued 1 cycle after the FP multiply µOP has been issued.
3. FP multiply-accumulate pipelines support late-forwarding of accumulate operands from
similar µOPs, allowing a typical sequence of multiply-accumulate µOPs to issue one every N
cycles (accumulate latency N shown in parentheses).

3.12 FP miscellaneous instructions


Table 20: AArch64 FP miscellaneous instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
FP convert, from vec to FCVT, 3 1 V0 -
vec reg FCVTXN
FP convert, from gen to SCVTF, 6 1 M0, V0 -
vec reg UCVTF
FP convert, from vec to FCVTAS, 4 1 V0, V1 -
gen reg FCVTAU,
FCVTMS,
FCVTMU,
FCVTNS,
FCVTNU,
FCVTPS,
FCVTPU,
FCVTZS,
FCVTZU
FP move, immed FMOV 2 2 V -
FP move, register FMOV 2 2 V -
FP transfer, from gen to FMOV 3 1 M0 -
vec reg
FP transfer, from vec to FMOV 2 1 V1 -
gen reg

Table 21: AArch32 FP miscellaneous instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

VFP move, immed VMOV 2 2 V -


VFP move, register VMOV 2 2 V -
VFP move, insert VINS 2 2 V -

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 33 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

VFP move, extraction VMOVX 2 2 V -


VFP transfer, core to vfp, VMOV 5 1 M0, V -
single reg to S-reg, cond
VFP transfer, core to vfp, VMOV 3 1 M0 -
single reg to S-reg,
uncond
VFP transfer, core to vfp, VMOV 5 1 M0, V -
single reg to upper/lower
half of D-reg
VFP transfer, core to vfp, VMOV 6 1/2 M0, V -
2 regs to 2 S-regs, cond
VFP transfer, core to vfp, VMOV 4 1/2 M0 -
2 regs to 2 S-regs, uncond
VFP transfer, core to vfp, VMOV 5 1 M0, V -
2 regs to D-reg, cond
VFP transfer, core to vfp, VMOV 3 1 M0 -
2 regs to D-reg, uncond
VFP transfer, vfp S-reg or VMOV 3 1 V1, I -
upper/lower half of vfp
D-reg to core reg, cond
VFP transfer, vfp S-reg or VMOV 2 1 V1 -
upper/lower half of vfp
D-reg to core reg, uncond
VFP transfer, vfp 2 S-regs VMOV 3 1 V1, I -
or D-reg to 2 core regs,
cond
VFP transfer, vfp 2 S-regs VMOV 2 1 V1 -
or D-reg to 2 core regs,
uncond

3.13 FP load instructions


The latencies shown assume the memory access hits in the Level 1 Data Cache. Compared to
standard loads, an extra cycle is required to forward results to FP/ASIMD pipelines.

Table 22: AArch64 FP load instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 34 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Load vector reg, literal, LDR - 2 L -
S/D/Q forms
Load vector reg, unscaled LDUR 5 2 L -
immed
Load vector reg, immed LDR 5 2 L, I -
post-index
Load vector reg, immed LDR 5 2 L, I -
pre-index
Load vector reg, unsigned LDR 5 2 L, I -
immed
Load vector reg, register LDR 5 2 L, I -
offset, basic
Load vector reg, register LDR 5 2 L, I -
offset, scale, S/D-form
Load vector reg, register LDR 6 2 I, L -
offset, scale, H/Q-form
Load vector reg, register LDR 5 2 L, I -
offset, extend
Load vector reg, register LDR 5 2 L, I -
offset, extend, scale, S/D-
form
Load vector reg, register LDR 6 2 I, L -
offset, extend, scale,
H/Q-form
Load vector pair, immed LDP, LDNP 5 1 L, I -
offset, S/D-form
Load vector pair, immed LDP, LDNP 7 1 L -
offset, Q-form
Load vector pair, immed LDP 5 1 I, L -
post-index, S/D-form
Load vector pair, immed LDP 7 1 L, I -
post-index, Q-form
Load vector pair, immed LDP 5 1 I, L -
pre-index, S/D-form
Load vector pair, immed LDP 7 1 L, I -
pre-index, Q-form

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 35 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Table 23: AArch32 FP load instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

FP load, register VLDR 4 2 L 1


FP load multiple, S form VLDMIA, N 2/R L 1, 2,
VLDMDB, 3
VPOP
FP load multiple, D form VLDMIA, N+2 1/R L, V 1, 2,
VLDMDB, 3
VPOP
(FP load, writeback - (1) - +I 4
forms)

1. Condition loads have an extra µOP which goes down pipeline V and have 2 cycle extra
latency compared to their unconditional counterparts.
2. N is floor[ (num_reg+3)/4 ].
3. R is floor[(num_reg+1)/2].
4. Writeback forms of load instructions require an extra µOP to update the base address. This
update is typically performed in parallel with or prior to the load µOP (update latency shown
in parentheses).

3.14 FP store instructions


Stores MOPs are split into store address and store data µOPs. Once executed, stores are buffered
and committed in the background.

Table 24: AArch64 FP store instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Store vector reg, STUR 2 2 L, V -
unscaled immed,
B/H/S/D-form
Store vector reg, STUR 2 1 L, V -
unscaled immed, Q-form
Store vector reg, immed STR 2 2 L, V -
post-index, B/H/S/D-form
Store vector reg, immed STR 2 1 L, V -
post-index, Q-form
Store vector reg, immed STR 2 2 L, V -
pre-index, B/H/S/D-form
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 36 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Store vector reg, immed STR 2 1 L, V -
pre-index, Q-form
Store vector reg, STR 2 2 L, V -
unsigned immed,
B/H/S/D-form
Store vector reg, STR 2 1 L, V -
unsigned immed, Q-form
Store vector reg, register STR 2 2 L, V -
offset, basic, B/H/S/D-
form
Store vector reg, register STR 2 1 L, V -
offset, basic, Q-form
Store vector reg, register STR 2 2 I, L, V -
offset, scale, H-form
Store vector reg, register STR 2 2 L, V -
offset, scale, S/D-form
Store vector reg, register STR 2 1 I, L, V -
offset, scale, Q-form
Store vector reg, register STR 2 2 L, V -
offset, extend, B/H/S/D-
form
Store vector reg, register STR 2 1 L, V -
offset, extend, Q-form
Store vector reg, register STR 2 2 I, L, V -
offset, extend, scale, H-
form
Store vector reg, register STR 2 2 L, V -
offset, extend, scale, S/D-
form
Store vector reg, register STR 2 1 I, L, V -
offset, extend, scale, Q-
form
Store vector pair, immed STP, STNP 2 2 L, V -
offset, S-form
Store vector pair, immed STP, STNP 2 1 L, V -
offset, D-form
Store vector pair, immed STP, STNP 3 1/2 L, V -
offset, Q-form

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 37 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Store vector pair, immed STP 2 1 L, V -
post-index, S-form
Store vector pair, immed STP 2 1 L, V -
post-index, D-form
Store vector pair, immed STP 3 1 L, V -
post-index, Q-form
Store vector pair, immed STP 2 1 L, V -
pre-index, S-form
Store vector pair, immed STP 2 1 L, V -
pre-index, D-form
Store vector pair, immed STP 3 1/2 L, V -
pre-index, Q-form

Table 25: AArch32 FP store instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

FP store, immed offset VSTR 2 2 LV -


FP store multiple, S-form VSTMIA, N+1 2/R L, V 1, 3
VSTMDB,
VPUSH
FP store multiple, D-form VSTMIA, P+1 1/R L, V 2, 3
VSTMDB,
VPUSH
(FP store, writeback - (1) - +I 4
forms)

1. For store multiple instructions, N=floor((num_regs+3)/4).


2. For store multiple instructions, P=floor((num_regs+1)/2).
3. R=floor[(num_regs + 1)/2].
4. Writeback forms of store instructions require an extra µOP to update the base address.
This update is typically performed in parallel with or prior to the store µOP (update latency
shown in parentheses).

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 38 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

3.15 ASIMD integer instructions


Table 26: AArch64 ASIMD integer instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD absolute diff SABD, UABD 2 2 V -
ASIMD absolute diff SABA, UABA 4(1) 1 V1 2
accum
ASIMD absolute diff SABAL(2), 4(1) 1 V1 2
accum long UABAL(2)
ASIMD absolute diff long SABDL(2), 2 2 V -
UABDL(2)
ASIMD arith, basic ABS, ADD, 2 2 V -
NEG,
SADDL(2),
SADDW(2),
SHADD,
SHSUB,
SSUBL(2),
SSUBW(2),
SUB,
UADDL(2),
UADDW(2),
UHADD,
UHSUB,
USUBL(2),
USUBW(2)
ASIMD arith, complex ADDHN(2), 2 2 V -
RADDHN(2),
RSUBHN(2),
SQABS,
SQADD,
SQNEG,
SQSUB,
SRHADD,
SUBHN(2),
SUQADD,
UQADD,
UQSUB,
URHADD,
USQADD
ASIMD arith, pair-wise ADDP, 2 2 V -
SADDLP,
UADDLP
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 39 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD arith, reduce, ADDV, 3 1 V1 -
4H/4S SADDLV,
UADDLV
ASIMD arith, reduce, ADDV, 5 1 V1, V -
8B/8H SADDLV,
UADDLV
ASIMD arith, reduce, 16B ADDV, 6 1/2 V1 -
SADDLV,
UADDLV
ASIMD compare CMEQ, 2 1 V -
CMGE,
CMGT, CMHI,
CMHS, CMLE,
CMLT, CMTST
ASIMD dot product SDOT, UDOT 2 2 V -
ASIMD logical AND, BIC, 2 1 V -
EOR, MOV,
MVN, ORN,
ORR, NOT
ASIMD max/min, basic SMAX, 2 1 V -
and pair-wise SMAXP,
SMIN,
SMINP,
UMAX,
UMAXP,
UMIN,
UMINP
ASIMD max/min, reduce, SMAXV, 3 1 V1 -
4H/4S SMINV,
UMAXV,
UMINV
ASIMD max/min, reduce, SMAXV, 5 1 V1, V -
8B/8H SMINV,
UMAXV,
UMINV
ASIMD max/min, reduce, SMAXV, 6 1/2 V1 -
16B SMINV,
UMAXV,
UMINV

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 40 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD multiply, D-form MUL, 4 1 V0 -
SQDMULH,
SQRDMULH
ASIMD multiply, Q-form MUL, 5 1/2 V0 -
SQDMULH,
SQRDMULH
ASIMD multiply MLA, MLS 4(1) 1 V0 1
accumulate, D-form
ASIMD multiply MLA, MLS 5(2) 1/2 V0 1
accumulate, Q-form
ASIMD multiply SQRDMLAH, 4 1 V0 -
accumulate high, D-form SQRDMLSH
ASIMD multiply SQRDMLAH, 5 1/2 V0 -
accumulate high, Q-form SQRDMLSH
ASIMD multiply SMLAL(2), 4(1) 1 V0 1
accumulate long SMLSL(2),
UMLAL(2),
UMLSL(2)
ASIMD multiply SQDMLAL(2), 4 1 V0 -
accumulate saturating SQDMLSL(2)
long
ASIMD multiply/multiply PMUL, 3 1 V0 3
long (8x8) polynomial, D- PMULL(2)
form
ASIMD multiply/multiply PMUL, 4 1/2 V0 3
long (8x8) polynomial, Q- PMULL(2)
form
ASIMD multiply long SMULL(2), 4 1 V0 -
UMULL(2),
SQDMULL(2)
ASIMD pairwise add and SADALP, 4(1) 1 V1 2
accumulate long UADALP
ASIMD shift accumulate SSRA, SRSRA, 4(1) 1 V1 2
USRA, URSRA

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 41 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD shift by immed, SHL, SHLL(2), 2 1 V1 -
basic SHRN(2),
SSHLL(2),
SSHR,
SXTL(2),
USHLL(2),
USHR,
UXTL(2)
ASIMD shift by immed SLI, SRI 2 1 V1 -
and insert, basic
ASIMD shift by immed, RSHRN(2), 4 1 V1 -
complex SQRSHRN(2),
SQRSHRUN(2
), SQSHL{U},
SQSHRN(2),
SQSHRUN(2),
SRSHR,
UQRSHRN(2),
UQSHL,
UQSHRN(2),
URSHR
ASIMD shift by register, SSHL, USHL 2 1 V1 -
basic
ASIMD shift by register, SRSHL, 4 1 V1 -
complex SQRSHL,
SQSHL,
URSHL,
UQRSHL,
UQSHL

Table 27: AArch32 ASIMD integer instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD absolute diff VABD 2 2 V -


ASIMD absolute diff VABA 4(1) 1 V1 2
accum
ASIMD absolute diff VABAL 4(1) 1 V1 2
accum long
ASIMD absolute diff long VABDL 2 2 V -

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 42 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD arith, basic VADD, 2 2 V -


VADDL,
VADDW,
VNEG, VSUB,
VSUBL,
VSUBW
ASIMD arith, complex VABS, 2 2 V -
VADDHN,
VHADD,
VHSUB,
VQABS,
VQADD,
VQNEG,
VQSUB,
VRADDHN,
VRHADD,
VRSUBHN,
VSUBHN
ASIMD arith, dot product VSDOT, 2 2 V -
VUDOT
ASIMD arith, pair-wise VPADD, 2 2 V -
VPADDL
ASIMD compare VCEQ, VCGE, 2 1 V -
VCGT, VCLE,
VTST
ASIMD logical VAND, VBIC, 2 1 V -
VMVN,
VORR, VORN,
VEOR
ASIMD max/min VMAX, VMIN, 2 1 V -
VPMAX,
VPMIN
ASIMD multiply, D-form VMUL, 4 1 V0 -
VQDMULH,
VQRDMULH
ASIMD multiply, Q-form VMUL, 5 1/2 V0 -
VQDMULH,
VQRDMULH
ASIMD multiply VMLA, VMLS 4(1) 1 V0 1
accumulate, D-form

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 43 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD multiply VMLA, VMLS 5(2) 1/2 V0 1


accumulate, Q-form
ASIMD multiply VQRDMLAH, 4 1 V0 -
accumulate high, D-form VQRDMLSH
ASIMD multiply VQRDMLAH, 5 1/2 V0 -
accumulate high, Q-form VQRDMLSH
ASIMD multiply VMLAL, 4(1) 1 V0 1
accumulate long VMLSL
ASIMD multiply VQDMLAL, 4 1 V0 -
accumulate saturating VQDMLSL
long
ASIMD multiply/multiply VMUL (.P8), 3 1 V0 -
long (8x8) polynomial, D- VMULL (.P8)
form
ASIMD multiply (8x8) VMUL (.P8) 4 1/2 V0 -
polynomial, Q-form
ASIMD multiply long VMULL (.S, 4 1 V0 -
.I), VQDMULL
ASIMD pairwise add and VPADAL 4(1) 1 V1 1
accumulate
ASIMD shift accumulate VSRA, VRSRA 4(1) 1 V1 1
ASIMD shift by immed, VMOVL, 2 1 V1 -
basic VSHL, VSHLL,
VSHR, VSHRN
ASIMD shift by immed VSLI, VSRI 2 1 V1 -
and insert, basic
ASIMD shift by immed, VQRSHRN, 4 1 V1 -
complex VQRSHRUN,
VQSHL{U},
VQSHRN,
VQSHRUN,
VRSHR,
VRSHRN
ASIMD shift by register, VSHL 2 1 V1 -
basic
ASIMD shift by register, VQRSHL, 4 1 V1 -
complex VQSHL,
VRSHL

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 44 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

1. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar


µOPs, allowing a typical sequence of integer multiply-accumulate µOPs to issue one every
cycle or one every other cycle (accumulate latency shown in parentheses).
2. Other accumulate pipelines also support late-forwarding of accumulate operands from
similar µOPs, allowing a typical sequence of such µOPs to issue one every cycle (accumulate
latency shown in parentheses).
3. This category includes instructions of the form “PMULL Vd.8H, Vn.8B, Vm.8B” and “PMULL2
Vd.8H, Vn.16B, Vm.16B”.

3.16 ASIMD floating-point instructions


Table 28: AArch64 ASIMD integer instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD FP absolute FABS, FABD 2 2 V -
value/difference
ASIMD FP arith, normal FABD, FADD, 2 2 V -
FSUB, FADDP
ASIMD FP compare FACGE, 2 2 V -
FACGT,
FCMEQ,
FCMGE,
FCMGT,
FCMLE,
FCMLT
ASIMD FP convert, long FCVTL(2) 4 1/2 V0 -
(F16 to F32)
ASIMD FP convert, long FCVTL(2) 3 1 V0 -
(F32 to F64)
ASIMD FP convert, FCVTN(2) 4 1/2 V0 -
narrow (F32 to F16)
ASIMD FP convert, FCVTN(2), 3 1 V0 -
narrow (F64 to F32) FCVTXN(2)

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 45 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD FP convert, other, FCVTAS, 3 1 V0 -
D-form F32 and Q-form FCVTAU,
F64 FCVTMS,
FCVTMU,
FCVTNS,
FCVTNU,
FCVTPS,
FCVTPU,
FCVTZS,
FCVTZU,
SCVTF,
UCVTF
ASIMD FP convert, other, FCVTAS, 4 1/2 V0 -
D-form F16 and Q-form VCVTAU,
F32 FCVTMS,
FCVTMU,
FCVTNS,
FCVTNU,
FCVTPS,
FCVTPU,
FCVTZS,
FCVTZU,
SCVTF,
UCVTF
ASIMD FP convert, other, FCVTAS, 6 1/4 V0 -
Q-form F16 VCVTAU,
FCVTMS,
FCVTMU,
FCVTNS,
FCVTNU,
FCVTPS,
FCVTPU,
FCVTZS,
FCVTZU,
SCVTF,
UCVTF
ASIMD FP divide, D-form, FDIV 7 1/7 V0 3
F16
ASIMD FP divide, D-form, FDIV 7 to 10 2/9 to 2/7 V0 3
F32
ASIMD FP divide, Q-form, FDIV 10 to 13 1/13 to 1/10 V0 3
F16

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 46 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD FP divide, Q-form, FDIV 7 to 10 1/9 to 1/7 V0 3
F32
ASIMD FP divide, Q-form, FDIV 7 to 15 1/14 to 1/7 V0 3
F64
ASIMD FP max/min, FMAX, 2 2 V -
normal FMAXNM,
FMIN,
FMINNM
ASIMD FP max/min, FMAXP, 2 2 V -
pairwise FMAXNMP,
FMINP,
FMINNMP
ASIMD FP max/min, FMAXV, 5 2 V -
reduce FMAXNMV,
FMINV,
FMINNMV
ASIMD FP max/min, FMAXV, 8 2/3 V -
reduce, Q-form F16 FMAXNMV,
FMINV,
FMINNMV
ASIMD FP multiply FMUL, 3 2 V 2
FMULX
ASIMD FP multiply FMLA, FMLS 4 (2) 2 V 1
accumulate
ASIMD FP multiply FMLAL(2), 5(2) 2 V 1
accumulate long FMLSL(2)
ASIMD FP negate FNEG 2 2 V -
ASIMD FP round, D-form FRINTA, 3 1 V0 -
F32 and Q-form F64 FRINTI,
FRINTM,
FRINTN,
FRINTP,
FRINTX,
FRINTZ

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 47 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD FP round, D-form FRINTA, 4 1/2 V0 -
F16 and Q-form F32 FRINTI,
FRINTM,
FRINTN,
FRINTP,
FRINTX,
FRINTZ
ASIMD FP round, Q-form FRINTA, 6 1/4 V0 -
F16 FRINTI,
FRINTM,
FRINTN,
FRINTP,
FRINTX,
FRINTZ
ASIMD FP square root, D- FSQRT 7 1/7 V0 3
form, F16
ASIMD FP square root, D- FSQRT 7 to 10 2/9 to 2/7 V0 3
form, F32
ASIMD FP square root, Q- FSQRT 11 to 13 1/13 to 1/11 V0 3
form, F16
ASIMD FP square root, Q- FSQRT 7 to 10 1/9 to 1/7 V0 3
form, F32
ASIMD FP square root, Q- FSQRT 7 to 17 1/16 to 1/7 V0 3
form, F64

Table 29: AArch32 ASIMD integer instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD FP absolute value VABS 2 2 V -


ASIMD FP arith VABD, VADD, 2 2 V -
VPADD, VSUB
ASIMD FP compare VACGE, 2 2 V -
VACGT,
VACLE,
VACLT, VCEQ,
VCGE, VCGT,
VCLE

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 48 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD FP convert, VCVT, VCVTA, 3 1 V0 -


integer, D-form VCVTM,
VCVTN,
VCVTP
ASIMD FP convert, VCVT, VCVTA, 4 1/2 V0 -
integer, Q-form VCVTM,
VCVTN,
VCVTP
ASIMD FP convert, fixed, VCVT 3 1 V0 -
D-form
ASIMD FP convert, fixed, VCVT 4 1/2 V0 -
Q-form
ASIMD FP convert, half- VCVT 4 1/2 V0 -
precision
ASIMD FP max/min VMAX, VMIN, 2 2 V -
VPMAX,
VPMIN,
VMAXNM,
VMINNM
ASIMD FP multiply VMUL, 3 2 V 2
VNMUL
ASIMD FP chained VMLA, VMLS 5(2) 2 V 1
multiply accumulate
ASIMD FP fused multiply VFMA, VFMS 4(2) 2 V 1
accumulate
ASIMD FP negate VNEG 2 2 V
ASIMD FP round to VRINTA, 3 1 V0 -
integral, D-form VRINTM,
VRINTN,
VRINTP,
VRINTX,
VRINTZ
ASIMD FP round to VRINTA, 4 1/2 V0 -
integral, Q-form VRINTM,
VRINTN,
VRINTP,
VRINTX,
VRINTZ

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 49 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

1. ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from


similar µOPs, allowing a typical sequence of floating-point multiply-accumulate µOPs to issue
one every N cycles (accumulate latency N shown in parentheses).
2. ASIMD multiply-accumulate pipelines support late forwarding of the result from ASIMD FP
multiply µOPs to the accumulate operands of an ASIMD FP multiply-accumulate µOP. The
latter can potentially be issued 1 cycle after the ASIMD FP multiply µOP has been issued.
3. ASIMD divide and square root operations are performed using an iterative algorithm and
block subsequent similar operations to the same pipeline until complete.

3.17 ASIMD miscellaneous instructions


Table 30: AArch64 ASIMD miscellaneous instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD bit reverse RBIT 2 2 V -
ASIMD bitwise insert BIF, BIT, BSL 2 2 V -
ASIMD count CLS, CLZ, CNT 2 2 V -
ASIMD duplicate, gen reg DUP 3 1 M0 -
ASIMD duplicate, DUP 2 2 V -
element
ASIMD extract EXT 2 2 V -
ASIMD extract narrow XTN 2 2 V -
ASIMD extract narrow, SQXTN(2), 4 1 V1 -
saturating SQXTUN(2),
UQXTN(2)
ASIMD insert, element to INS 2 2 V -
element
ASIMD move, FP immed FMOV 2 2 V -
ASIMD move, integer MOVI, MVNI 2 2 V -
immed
ASIMD reciprocal FRECPE, 3 1 V0 -
estimate, D-form F32 and FRECPX,
F64 FRSQRTE,
URECPE,
URSQRTE

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 50 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD reciprocal FRECPE, 4 1/2 V0 -
estimate, D-form F16 and FRECPX,
Q-form F32 FRSQRTE,
URECPE,
URSQRTE
ASIMD reciprocal FRECPE, 6 1/4 V0 -
estimate, Q-form F16 FRECPX,
FRSQRTE,
URECPE,
URSQRTE
ASIMD reciprocal step FRECPS, 4 2 V -
FRSQRTS
ASIMD reverse REV16, 2 2 V -
REV32,
REV64
ASIMD table lookup, 1 or TBL 2 2 V -
2 table regs
ASIMD table lookup, 3 TBL 4 1/2 V -
table regs
ASIMD table lookup, 4 TBL 4 2/3 V -
table regs
ASIMD table lookup TBX 2 2 V -
extension, 1 table reg
ASIMD table lookup TBX 4 1/2 V -
extension, 2 table reg
ASIMD table lookup TBX 6 2/3 V -
extension, 3 table reg
ASIMD table lookup TBX 6 2/5 V -
extension, 4 table reg
ASIMD transfer, element UMOV, 2 1 V1 -
to gen reg SMOV
ASIMD transfer, gen reg INS 5 1 M0, V -
to element
ASIMD transpose TRN1, TRN2 2 2 V -
ASIMD unzip/zip UZP1, UZP2, 2 2 V -
ZIP1, ZIP2

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 51 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Table 31: AArch32 ASIMD miscellaneous instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD bitwise insert VBIF, VBIT, 2 V -


2
VBSL
ASIMD count VCLS, VCLZ, 2 V -
2
VCNT
ASIMD duplicate, core reg VDUP 3 1 M0 -
ASIMD duplicate, scalar VDUP 2 2 V -
ASIMD extract VEXT 2 2 V -
ASIMD move, immed VMOV 2 2 V -
ASIMD move, register VMOV 2 2 V -
ASIMD move, narrowing VMOVN 2 2 V -
ASIMD move, saturating VQMOVN, 1 V1 -
4
VQMOVUN
ASIMD reciprocal VRECPE, 1 V0 -
3
estimate, D-form VRSQRTE
ASIMD reciprocal VRECPE, 1/2 V0 -
4
estimate, Q-form VRSQRTE
ASIMD reciprocal step VRECPS, 2 V -
5
VRSQRTS
ASIMD reverse VREV16, 2 V -
VREV32, 2
VREV64
ASIMD swap VSWP 4 2/3 V -
ASIMD table lookup, 1 or VTBL 2 2 V -
2 table regs
ASIMD table lookup, 3 VTBL 4 1/2 V -
table regs
ASIMD table lookup, 4 VTBL 4 2/3 V -
table regs
ASIMD table lookup VTBX 2 V -
2
extension, 1 reg
ASIMD table lookup VTBX 4 1/2 V -
extension, 2 table reg
ASIMD table lookup VTBX 6 2/3 V -
extension, 3 table reg

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 52 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD table lookup VTBX 6 2/5 V -


extension, 4 table reg
ASIMD transfer, scalar to VMOV 2 1 V1 -
core reg, word
ASIMD transfer, scalar to VMOV 1 V1, I -
3
core reg, byte/hword
ASIMD transfer, core reg VMOV 5 1 M0, V -
to scalar
ASIMD transpose VTRN 4 2/3 V -
ASIMD unzip/zip VUZP, VZIP 4 2/3 V -

3.18 ASIMD load instructions


The latencies shown assume the memory access hits in the Level 1 Data Cache. Compared to
standard loads, an extra cycle is required to forward results to FP/ASIMD pipelines.

Table 32: AArch64 ASIMD load instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD load, 1 element, LD1 5 2 L -
multiple, 1 reg, D-form
ASIMD load, 1 element, LD1 5 2 L -
multiple, 1 reg, Q-form
ASIMD load, 1 element, LD1 5 1 L -
multiple, 2 reg, D-form
ASIMD load, 1 element, LD1 5 1 L -
multiple, 2 reg, Q-form
ASIMD load, 1 element, LD1 6 2/3 L -
multiple, 3 reg, D-form
ASIMD load, 1 element, LD1 6 2/3 L -
multiple, 3 reg, Q-form
ASIMD load, 1 element, LD1 6 1/2 L -
multiple, 4 reg, D-form
ASIMD load, 1 element, LD1 6 1/2 L -
multiple, 4 reg, Q-form
ASIMD load, 1 element, LD1 7 2 L, V -
one lane, B/H/S

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 53 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD load, 1 element, LD1 7 2 L, V -
one lane, D
ASIMD load, 1 element, LD1R 7 2 L, V -
all lanes, D-form, B/H/S
ASIMD load, 1 element, LD1R 7 2 L, V -
all lanes, D-form, D
ASIMD load, 1 element, LD1R 7 2 L, V -
all lanes, Q-form
ASIMD load, 2 element, LD2 7 1 L, V -
multiple, D-form, B/H/S
ASIMD load, 2 element, LD2 7 1 L, V -
multiple, Q-form, B/H/S
ASIMD load, 2 element, LD2 7 1 L, V -
multiple, Q-form, D
ASIMD load, 2 element, LD2 7 1 L, V -
one lane, B/H
ASIMD load, 2 element, LD2 7 1 L, V -
one lane, S
ASIMD load, 2 element, LD2 7 1 L, V -
one lane, D
ASIMD load, 2 element, LD2R 7 1 L, V -
all lanes, D-form, B/H/S
ASIMD load, 2 element, LD2R 7 1 L, V -
all lanes, D-form, D
ASIMD load, 2 element, LD2R 7 1 L, V -
all lanes, Q-form
ASIMD load, 3 element, LD3 8 1/2 L, V -
multiple, D-form, B/H/S
ASIMD load, 3 element, LD3 8 1/2 L, V -
multiple, Q-form, B/H/S
ASIMD load, 3 element, LD3 8 1/2 L, V -
multiple, Q-form, D
ASIMD load, 3 element, LD3 7 1/2 L, V -
one lane, B/H
ASIMD load, 3 element, LD3 7 1/2 L, V -
one lane, S

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 54 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD load, 3 element, LD3 7 1/2 L, V -
one lane, D
ASIMD load, 3 element, LD3R 7 1/2 L, V -
all lanes, D-form, B/H/S
ASIMD load, 3 element, LD3R 7 1/2 L, V -
all lanes, D-form, D
ASIMD load, 3 element, LD3R 7 1/2 L, V -
all lanes, Q-form, B/H/S
ASIMD load, 3 element, LD3R 7 1/2 L, V -
all lanes, Q-form, D
ASIMD load, 4 element, LD4 8 2/7 L, V -
multiple, D-form, B/H/S
ASIMD load, 4 element, LD4 10 1/5 L, V -
multiple, Q-form, B/H/S
ASIMD load, 4 element, LD4 10 1/5 L, V -
multiple, Q-form, D
ASIMD load, 4 element, LD4 8 1/2 L, V -
one lane, B/H
ASIMD load, 4 element, LD4 8 1/2 L, V -
one lane, S
ASIMD load, 4 element, LD4 8 1/2 L, V -
one lane, D
ASIMD load, 4 element, LD4R 8 1/2 L, V -
all lanes, D-form, B/H/S
ASIMD load, 4 element, LD4R 8 1/2 L, V -
all lanes, D-form, D
ASIMD load, 4 element, LD4R 8 1/2 L, V -
all lanes, Q-form, B/H/S
ASIMD load, 4 element, LD4R 8 1/2 L, V -
all lanes, Q-form, D
(ASIMD load, writeback - (1) - +I 1
form)

Table 33: AArch32 ASIMD load instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 55 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD load, 1 element, VLD1 5 2 L -


multiple, 1 reg
ASIMD load, 1 element, VLD1 5 2 L -
multiple, 2 reg
ASIMD load, 1 element, VLD1 5 1 L -
multiple, 3 reg
ASIMD load, 1 element, VLD1 5 1 L -
multiple, 4 reg
ASIMD load, 1 element, VLD1 7 2 L, V -
one lane
ASIMD load, 1 element, VLD1 7 2 LV -
all lanes, 1 reg
ASIMD load, 1 element, VLD1 7 2/3 L, V -
all lanes, 2 reg
ASIMD load, 2 element, VLD2 7 2/3 L, V -
multiple, 2 reg
ASIMD load, 2 element, VLD2 8 1/2 L, V -
multiple, 4 reg
ASIMD load, 2 element, VLD2 7 1 L, V -
one lane, size 32
ASIMD load, 2 element, VLD2 7 1 L, V -
one lane, size 8/16
ASIMD load, 2 element, VLD2 7 1 L, V -
all lanes
ASIMD load, 3 element, VLD3 8 2/3 L, V -
multiple, 3 reg
ASIMD load, 3 element, VLD3 8 2/3 L, V -
one lane, size 32
ASIMD load, 3 element, VLD3 8 2/3 L, V -
one lane, size 8/16
ASIMD load, 3 element, VLD3 8 2/3 L, V -
all lanes
ASIMD load, 4 element, VLD4 8 1/2 L, V -
multiple, 4 reg
ASIMD load, 4 element, VLD4 8 1/2 L, V -
one lane, size 32

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 56 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD load, 4 element, VLD4 8 1/2 L, V -


one lane, size 8/16
ASIMD load, 4 element, VLD4 8 1/2 L, V -
all lanes
(ASIMD load, writeback - (1) - +I 1
form)

Writeback forms of load instructions require an extra µOP to update the base address. This
update is typically performed in parallel with the load µOP (update latency shown in
parentheses).

3.19 ASIMD store instructions


Stores MOPs are split into store address and store data µOPs. Once executed, stores are buffered
and committed in the background.

Table 34: AArch64 ASIMD store instructions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD store, 1 element, ST1 2 2 L, V -
multiple, 1 reg, D-form
ASIMD store, 1 element, ST1 2 1 L, V -
multiple, 1 reg, Q-form
ASIMD store, 1 element, ST1 2 1 L, V -
multiple, 2 reg, D-form
ASIMD store, 1 element, ST1 3 1/2 L, V -
multiple, 2 reg, Q-form
ASIMD store, 1 element, ST1 3 2/3 L, V -
multiple, 3 reg, D-form
ASIMD store, 1 element, ST1 4 1/3 L, V -
multiple, 3 reg, Q-form
ASIMD store, 1 element, ST1 3 1/2 L, V -
multiple, 4 reg, D-form
ASIMD store, 1 element, ST1 5 1/4 L, V -
multiple, 4 reg, Q-form
ASIMD store, 1 element, ST1 4 1 V, L -
one lane, B/H/S

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 57 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
ASIMD store, 1 element, ST1 4 1 V, L -
one lane, D
ASIMD store, 2 element, ST2 4 1 V, L -
multiple, D-form, B/H/S
ASIMD store, 2 element, ST2 5 1/2 V, L -
multiple, Q-form, B/H/S
ASIMD store, 2 element, ST2 5 1/2 V, L -
multiple, Q-form, D
ASIMD store, 2 element, ST2 4 1 V, L -
one lane, B/H/S
ASIMD store, 2 element, ST2 4 1 V, L -
one lane, D
ASIMD store, 3 element, ST3 5 1/2 V, L -
multiple, D-form, B/H/S
ASIMD store, 3 element, ST3 6 1/3 V, L -
multiple, Q-form, B/H/S
ASIMD store, 3 element, ST3 6 1/3 V, L -
multiple, Q-form, D
ASIMD store, 3 element, ST3 4 1/2 V, L -
one lane, B/H
ASIMD store, 3 element, ST3 4 1/2 V, L -
one lane, S
ASIMD store, 3 element, ST3 5 1/2 V, L -
one lane, D
ASIMD store, 4 element, ST4 7 1/3 V, L -
multiple, D-form, B/H/S
ASIMD store, 4 element, ST4 9 1/6 V, L -
multiple, Q-form, B/H/S
ASIMD store, 4 element, ST4 6 1/4 V, L -
multiple, Q-form, D
ASIMD store, 4 element, ST4 5 - V, L -
one lane, B/H
ASIMD store, 4 element, ST4 - 2/3 V, L -
one lane, S
ASIMD store, 4 element, ST4 - - V, L -
one lane, D

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 58 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
(ASIMD store, writeback - (1) - Add I 1
form)

Table 35: AArch32 ASIMD store instructions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

ASIMD store, 1 element, VST1 2 2 L, V -


multiple, 1 reg
ASIMD store, 1 element, VST1 2 2 L, V -
multiple, 2 reg
ASIMD store, 1 element, VST1 3 2/3 L, V -
multiple, 3 reg
ASIMD store, 1 element, VST1 3 1/2 L, V -
multiple, 4 reg
ASIMD store, 1 element, VST1 4 2 V, L -
one lane
ASIMD store, 2 element, VST2 4 1 V, L -
multiple, 2 reg
ASIMD store, 2 element, VST2 5 1/2 V, L -
multiple, 4 reg
ASIMD store, 2 element, VST2 4 2 V, L -
one lane
ASIMD store, 3 element, VST3 5 2/3 V, L -
multiple, 3 reg
ASIMD store, 3 element, VST3 4 1 V, L -
one lane, size 32
ASIMD store, 3 element, VST3 4 1 V, L -
one lane, size 8/16
ASIMD store, 4 element, VST4 8 1/2 V, L -
multiple, 4 reg
ASIMD store, 4 element, VST4 7 2 V, L -
one lane, size 32
ASIMD store, 4 element, VST4 7 2 V, L -
one lane, size 8/16
(ASIMD store, writeback - (1) - +I 1
form)

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 59 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Writeback forms of store instructions require an extra µOP to update the base address. This
update is typically performed in parallel with the store µOP (update latency shown in
parentheses).

3.20 Cryptography extensions


Table 36: AArch64 Cryptography extensions

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
Crypto AES ops AESD, AESE, 2 2 V -
AESIMC,
AESMC
Crypto polynomial PMULL (2) 2 1 V0 -
(64x64) multiply long
Crypto SHA1 hash SHA1H 2 1 V0 -
acceleration op
Crypto SHA1 hash SHA1C, 4 1 V0 -
acceleration ops SHA1M,
SHA1P
Crypto SHA1 schedule SHA1SU0, 2 1 V0 -
acceleration ops SHA1SU1
Crypto SHA256 hash SHA256H, 4 1 V0 -
acceleration ops SHA256H2
Crypto SHA256 schedule SHA256SU0, 2 1 V0 -
acceleration ops SHA256SU1

Table 37: AArch32 Cryptography extensions

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Crypto AES ops AESD, AESE, 2 2 V 1


AESIMC,
AESMC
Crypto polynomial VMULL.P64 2 1 V0 -
(64x64) multiply long
Crypto SHA1 hash SHA1H 2 1 V0 -
acceleration op

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 60 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Instruction characteristics

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

Crypto SHA1 hash SHA1C, 4 1 V0 -


acceleration ops SHA1M,
SHA1P
Crypto SHA1 schedule SHA1SU0, 2 1 V0 -
acceleration ops SHA1SU1
Crypto SHA256 hash SHA256H, 4 1 V0 -
acceleration ops SHA256H2
V0 -
Crypto SHA256 schedule SHA256SU0, 2 1
acceleration ops SHA256SU1

Adjacent AESE/AESMC instruction pairs and adjacent AESD/AESIMC instruction pairs will exhibit
the performance characteristics described in Section 4.6.

3.21 CRC
Table 38: AArch64 CRC

Instruction Group AArch64 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines
CRC checksum ops CRC32, 1 M0
2 1
CRC32C

Table 39: AArch32 CRC

Instruction Group AArch32 Exec Execution Utilized Notes


Instructions Latency Throughput Pipelines

CRC checksum ops CRC32, 1 M0


2 1
CRC32C

CRC execution supports late forwarding of the result from a producer µOP to a consumer µOP.
This results in a 1 cycle reduction in latency as seen by the consumer.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 61 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations

4 Special considerations
4.1 Dispatch constraints
Dispatch of µOPs from the in-order portion to the out-of-order portion of the microarchitecture
includes several constraints. It is important to consider these constraints during code generation
to maximize the effective dispatch bandwidth and subsequent execution bandwidth of Cortex-
A77.

The dispatch stage can process up to 6 MOPs per cycle and dispatch up to 10 µOPs per cycle, with
the following limitations on the number of µOPs of each type that may be simultaneously
dispatched.

 Up to 4 µOPs utilizing the S or B pipelines


 Up to 4 µOPs utilizing the M pipelines
 Up to 2 µOPs utilizing the M0 pipelines
 Up to 2 µOPs utilizing the V0 pipeline
 Up to 2 µOPs utilizing the V1 pipeline
 Up to 4 µOPs utilizing the L pipelines
 Up to 4 µOPs utilizing the D pipelines
In the event there are more µOPs available to be dispatched in a given cycle than can be
supported by the constraints above, µOPs will be dispatched in oldest to youngest age-order to
the extent allowed by the above.

4.2 Dispatch stall


In the event of a V-pipeline µOP containing more than 1 quad-word register source, a portion or all
of which was previously written as one or multiple single words, that µOP will stall in dispatch for
three cycles. This stall occurs only on the first such instance, and subsequent consumers of the
same register will not experience this stall.

4.3 Optimizing general-purpose register spills and fills


Register transfers between general-purpose registers (GPR) and ASIMD registers (VPR) are lower
latency than reads and writes to the cache hierarchy, thus it is recommended that GPR registers
be filled/spilled to the VPR rather to memory, when possible.

4.4 Optimizing memory copy


To achieve maximum throughput for memory copy (or similar loops), one should do the following.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 62 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations

 Unroll the loop to include multiple load and store operations per iteration, minimizing the
overheads of looping.
 Align stores on 16B boundary wherever possible.
 Use non-writeback forms of LDP and STP instructions interleaving them like shown in the
example below:
Loop_start:
SUBS X2, X2, #192
LDP Q3, Q4, [x1, #0]
STP Q3, Q4, [x0, #0]
LDP Q3, Q4, [x1, #32]
STP Q3, Q4, [x0, #32]
LDP Q3, Q4, [x1, #64]
STP Q3, Q4, [x0, #64]
LDP Q3, Q4, [x1, #96]
STP Q3, Q4, [x0, #96]
LDP Q3, Q4, [x1, #128]
STP Q3, Q4, [x0, #128]
LDP Q3, Q4, [x1, #160]
STP Q3, Q4, [x0, #160]
ADD X1, X1, #192
ADD X0, X0, #192
BGT Loop_start

A recommended copy routine for AArch32 would look like the sequence above but would use
LDRD/STRD instructions. Avoid load-/store-multiple instruction encodings (such as LDM and STM).

4.5 Load/Store alignment


The Armv8.2-A architecture allows many types of load and store accesses to be arbitrarily aligned.
The Cortex-A77 core handles most unaligned accesses without performance penalties.
However, there are cases which reduce bandwidth or incur additional latency, as described below.

 Load operations that cross a cache-line (64-byte) boundary.


 Quad-word load operations that are not 4B aligned.
 Store operations that cross a 16B boundary.

4.6 AES encryption/decryption


Cortex-A77 can issue two AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined) with
an execution latency of two cycles. This means encryption or decryption for at least four data
chunks should be interleaved for maximum performance:

AESE data0, key0


AESMC data0, data0
AESE data1, key0
AESMC data1, data1
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 63 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations

AESE data2, key0


AESMC data2, data2
AESE data3, key1
AESMC data3, data3
AESE data0, key0
AESMC data0, data0
...

Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when
they are adjacent in the program code and both instructions use the same destination register.

4.7 Region based fast forwarding


The forwarding logic in the V pipelines is optimized to provide optimal latency for instructions
which are expected to commonly forward to one another. These optimized forwarding regions
are defined in the following table.

Table 40: Optimized forwarding regions

Region Instruction Types


1 ASIMD ALU, ASIMD shift and certain ASIMD Miscellaneous.
2 FP multiply, FP multiply-accumulate, FP compare, FP add/sub and certain
ASIMD Miscellaneous.
3 Cryptography Extensions.

In addition to the regions mentioned in the table above, all floating point and ASIMD instructions
can fast forward to FP and ASIMD stores.

Exceptions to these forwarding regions are as follows:

 Fast forwarding will not occur in AArch32 mode if the consuming register’s width is greater
than that of the producer.
 Element sources used by FP multiply and multiply-accumulate operations cannot be
consumers.
 Complex ASIMD shift by immediate/register and shift accumulate instructions cannot be
producers (see section 3.14) in region 1.
 ASIMD extract narrow, saturating instructions cannot be producers (see section 3.16) in region
1.
 ASIMD absolute difference accumulate and pairwise add and accumulate instructions cannot
be producers (see section 3.14) in region 1.
 For FP producer-consumer pairs, the precision of the instructions should match (single, double
or half) in region 2.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 64 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations

 Pair-wise FP instructions cannot be consumers in region 2.


The effective latency of FP and ASIMD instructions as described in section 3 is increased by one
cycle if the producer and consumer instructions are not part of the same forwarding region, or if
they are affected by the exceptions described above. ASIMD integer multiply/multiply-accumulate,
ASIMD max/min reduction, FP divide and square root, FP convert/round and reciprocal/reciprocal
sqrt estimate instructions are not subject to this behavior.

4.8 Branch instruction alignment


Branch instruction and branch target instruction alignment and density can affect performance.

For best case performance, avoid placing more than four branch instructions within an aligned
32-byte instruction memory region.

4.9 FPCR self-synchronization


Programmers and compiler writers should note that writes to the FPCR register are self-
synchronizing, i.e. its effect on subsequent instructions can be relied upon without an intervening
context synchronizing operation.

4.10 Special register access


The Cortex-A77 core performs register renaming for general purpose registers to enable
speculative and out-of-order instruction execution. But most special-purpose registers are not
renamed. Instructions that read or write non-renamed registers are subjected to one or more of
the following additional execution constraints.

 Non-Speculative Execution – Instructions may only execute non-speculatively.


 In-Order Execution – Instructions must execute in-order with respect to other similar
instructions or in some cases all instructions.
 Flush Side-Effects – Instructions trigger a flush side-effect after executing for synchronization.
The table below summarizes various special-purpose register read accesses and the associated
execution constraints or side-effects.

Table 41: Special-purpose register read accesses

Register Read Non-Speculative In-Order Flush Side-Effect Notes


APSR Yes Yes No 3
CurrentEL No Yes No -
DAIF No Yes No -
DLR_EL0 No Yes No -

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 65 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations

Register Read Non-Speculative In-Order Flush Side-Effect Notes


DSPSR_EL0 No Yes No -
ELR_* No Yes No -
FPCR No Yes No -
FPSCR Yes Yes No 2
FPSR Yes Yes No 2
NZCV No No No 1
SP_* No No No 1
SPSel No Yes No -
SPSR_* No Yes No -

1. The NZCV and SP registers are fully renamed.


2. FPSR/FPSCR reads must wait for all prior instructions that may update the status flags to
execute and retire.
3. APSR reads must wait for all prior instructions that may set the Q bit to execute and retire.

The table below summarizes various special-purpose register write accesses and the associated
execution constraints or side-effects.

Table 42: Special-purpose register write accesses

Register Write Non-Speculative In-Order Flush Side-Effect Notes


APSR Yes Yes No 4
DAIF Yes Yes No -
DLR_EL0 Yes Yes No -
DSPSR_EL0 Yes Yes No -
ELR_* Yes Yes No -
FPCR Yes Yes Maybe 2
FPSCR Yes Yes Maybe 2, 3
FPSR Yes Yes No 3
NZCV No No No 1
SP_* No No No 1
SPSel Yes Yes Yes -
SPSR_* Yes Yes No -

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 66 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations

1. The NZCV and SP registers are fully renamed.


2. If the FPCR/FPSCR write is predicted to change the control field values, it will introduce a
barrier which prevents subsequent instructions from executing. If the FPCR/FPSCR write is
predicted to not change the control field values, it will execute without a barrier but trigger a
flush if the values change.
3. FPSR/FPSCR writes must stall at dispatch if another FPSR/FPSCR write is still pending.
4. APSR writes that set the Q bit will introduce a barrier which prevents subsequent
instructions from executing until the write completes.

4.11 Register forwarding hazards


The Armv8-A architecture allows FP/ASIMD instructions to read and write 32-bit S-registers. In
AArch32, Each S-register corresponds to one half (upper or lower) of an overlaid 64-bit D-register.
A Q register in turn consists of two overlaid D register. Register forwarding hazards may occur
when one µOP reads a Q-register operand that has recently been written with one or more S-
register result. Consider the following scenario.

VADD S0, S1, S2


VADD Q6, Q5, Q0

The first instruction writes S0, which correspond to the lowest part of Q0. The second instruction
then requires Q0 as an input operand. In this scenario, there is a dependency RAW dependency
between the first and the second instructions. In most cases, Cortex-A77 performs slightly worse
in such situations.

Cortex-A77 is able to avoid this register-hazard condition for certain cases. The following rules
describe the conditions under which a register-hazard can occur.

 The producer writes an S-register (not a D[x] scalar)


 The consumer reads an overlapping Q-register (not as a D[x] scalar)
 The consumer is a FP/ASIMD µOP (not a store or MOV µOP)
To avoid unnecessary hazards, it is recommended that the programmer use D[x] scalar writes
when populating registers prior to ASIMD operations. For example, either of the following
instruction forms would safely prevent a subsequent hazard.

VLD1.32 D0[x], [address]


VADD Q1, Q0, Q2F

4.12 IT blocks
The Armv8-A architecture performance deprecates some uses of the IT instruction in such a way
that software may be written using multiple naïve single instruction IT blocks. It is preferred that
software instead generate multi instruction IT blocks rather than single instruction blocks.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 67 of 68
Arm® Cortex®-A77 Core Software Optimization Guide PJDOC-466751330-11050
Issue 3.0
Special considerations

4.13 Instruction fusion


1. CMP/CMN (immediate) + B.cond
2. CMP/CMN (register) + B.cond
3. TST (immediate) + B.cond
4. TST (register) + B.cond
5. BICS (register) + B.cond
6. NOP + Any instruction

The following instruction pairs are fused in both Aarch32 and Aarch64 modes:

1. AESE + AESMC (see Section 4.6 on AES Encryption/Decryption)


2. AESD + AESIMC (see Section 4.6 on AES Encryption/Decryption)

These instruction pairs must be adjacent to each other in program code. For CMP, CMN, TST and
BICS, fusion is not allowed for shifted and/or extended register forms. For BICS, the destination
register should be XZR or WZR if fusion is to take place.

Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 68 of 68

You might also like