cortex_a72_software_optimization_guide_external
cortex_a72_software_optimization_guide_external
Optimization Guide
Release Information
The following changes have been made to this Software Optimisation Guide.
Change History
Proprietary Notice
Words and logos marked with ™ or ® are registered trademarks or trademarks of ARM® in the EU and other
countries except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein
may be the trademarks of their respective owners.
Neither the whole nor any part of the information contained in, or the product described in, this document may be
adapted or reproduced in any material form except with the prior written permission of the copyright holder.
The product described in this document is subject to continuous developments and improvements. All particulars
of the product and its use contained in this document are given by ARM in good faith. However, all warranties
implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are
excluded.
This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss
or damage arising from the use of any information in this document, or any error or omission in such information,
or any incorrect use of the product.
Where the term ARM is used it means “ARM or any of its subsidiaries as appropriate”.
Confidentiality Status
Product Status
Web Address
http://www.arm.com
1.1 References 5
2 INTRODUCTION 6
3 INSTRUCTION CHARACTERISTICS 7
3.20 CRC 35
4 SPECIAL CONSIDERATIONS 36
1.1 References
This document refers to the following documents.
Title Location
Term Meaning
Branch
Integer 0
Integer 1
Fetch Rename,
Dispatch FP/ASIMD 0
FP/ASIMD 1
Load
Store
FP/ASIMD-0 (F0) ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc,
FP add, FP multiply, FP divide, crypto µops
3 INSTRUCTION CHARACTERISTICS
Note:
1. Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µop is required. This adds 2 cycles to the latency.
Note:
1. Sequential MOVW/MOVT (AArch32) instruction pairs and certain MOVZ/MOVK, MOVK/MOVK (AArch64)
instruction pairs can be executed with one-cycle execute latency and four-instruction/cycle execution
throughput in I0/I1. See Section 4.11 for more details on the instruction pairs that can be merged.
2. Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µop is required. This adds two cycles to the latency.
3. Sequential ADRP/ADD instruction pairs can be executed with one-cycle execute latency and four
instruction/cycle execution throughput in I0/I1. See Section 4.12 for more details on the instruction pairs
that can be merged.
Note:
1. Integer divides are performed using a iterative algorithm and block any subsequent divide operations until
complete. Early termination is possible, depending upon the data values.
2. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops, allowing
a typical sequence of multiply-accumulate µops to issue one every N cycles (accumulate latency N shown
in parentheses).
3. Long-form multiplies (which produce two result registers) stall the multiplier pipeline for one extra cycle.
4. Multiplies that set the condition flags require an additional integer µop.
5. X-form multiply accumulates stall the multiplier pipeline for two extra cycles.
6. Multiply high operations stall the multiplier pipeline for N extra cycles before any other type M µop can be
issued to that pipeline, with N shown in parentheses.
Note:
1. Conditional GE-setting instructions require three extra µops and two additional cycles to conditionally
update the GE field (GE latency shown in parentheses).
Note:
1. Base register updates are typically completed in parallel with the load operation and with shorter latency
(update latency shown in parentheses).
2. For load multiple instructions, N=floor((num_regs+1)/2).
3. Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µop is required. This adds two cycles to the latency.
Note:
1. Base register updates are typically completed in parallel with the store operation and with shorter latency
(update latency is shown in parentheses).
2. For store multiple instructions, N=floor((num_regs+1)/2).
Note:
1. FP divide and square root operations are performed using an iterative algorithm and block subsequent
similar operations to the same pipeline until complete.
2. FP multiply-accumulate pipelines support late forwarding of the result from FP multiply µops to the
accumulate operands of an FP multiply-accumulate µop. The latter can potentially be issued one cycle
after the FP multiply µop has been issued.
3. FP multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops,
allowing a typical sequence of multiply-accumulate µops to issue one every N cycles (accumulate latency
N is shown in parentheses).
Note:
1. For FP load multiple instructions, N=floor((num_regs+1)/2).
2. For conditional FP load multiple instructions, N = num_regs for conditional forms only.
3. Writeback forms of load instructions require an extra µop to update the base address. This update is
typically performed in parallel with, or prior to, the load µop (update latency is shown in parentheses).
Note:
1. For single-precision store multiple instructions, N=floor((num_regs+1)/2). For double-precision store
multiple instructions, N=(num_regs).
2. Writeback forms of store instructions require an extra µop to update the base address. This update is
typically performed in parallel with, or prior to, the store µop (address update latency is shown in
parentheses).
Note:
1. ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops,
allowing a typical sequence of floating-point multiply-accumulate µops to issue one every N cycles
(accumulate latency N is shown in parentheses).
2. ASIMD multiply-accumulate pipelines support late forwarding of the result from ASIMD FP multiply µops to
the accumulate operands of an ASIMD FP multiply-accumulate µop. The latter can potentially be issued
one cycle after the ASIMD FP multiply µop has been issued.
3. ASIMD divide operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.
Note:
1. For table branches (TBL and TBX), N denotes the number of registers in the table.
Note:
1. Writeback forms of load instructions require an extra µop to update the base address. This update is
typically performed in parallel with the load µop (update latency is shown in parentheses).
Note:
1. Adjacent AESE/AESMC instruction pairs and adjacent AESD/AESIMC instruction pairs will exhibit the
described performance characteristics. See Section 4.10 for additional details.
2. Crypto execution support late forwarding of the result from a producer µop to a consumer µop. This results
in a one cycle reduction in latency as seen by the consumer.
3.20 CRC
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
CRC
checksum
ops
CRC32,
CRC32C
2
1
M
1
Note:
1. CRC execution supports late forwarding of the result from a producer CRC µop to a consumer CRC µop. This
results in a one cycle reduction in latency as seen by the consumer.
For this pair of instructions, the second multiply is dependent upon the result of the first multiply, not through one
of its normal input operands (R2 and R4), but through the destination register R1. The combined latency for these
instructions is six cycles, rather than the four cycles that would be required if these instructions were not
conditional (three cycles latency for the first, and one additional cycle for the second which is fully pipelined behind
the first). So if the condition is easily predictable (by the branch predictor), conditional execution can lead to a
VMOV S0,R0
VMOV S1,R1
VADD D2, D1, D0
The first two instructions write S0 and S1, which correspond to the bottom and top halves of D0. The third
instruction then requires D0 as an input operand. In this scenario, Cortex®-A72 processor detects that at least one
of the upper or lower S0/S1 registers overlayed on D0 were previously written, at which point the VADD instruction
is serialized until the prior S-register writes are guaranteed to have been architecturally committed, likely incurring
significant additional latency. Note that after the D0 register has been written as a D-register or Q-register
destination, subsequent consumers of that register will no longer encounter this register-hazard condition, until the
next S-register write, if any.
The Cortex®-A72 processor is able to avoid this register-hazard condition for certain cases. The following rules
describe the conditions under which a register-hazard can occur.
To avoid unnecessary hazards, ARM recommends that the programmer use D[x] scalar writes when populating
registers prior to ASIMD operations. For example, either of the following instruction forms would safely prevent a
subsequent hazard:
To achieve maximum throughput for memory copy (or similar loops), do the following:
• Unroll the loop to include multiple load and store operations for each iteration, minimizing the overheads
of looping.
• Use discrete, non-writeback forms of load and store instructions (such as LDRD and STRD), interleaving
them so that one load and one store operation can be performed each cycle. Avoid load-multiple/store-
multiple instruction encodings (such as LDM and STM), which lead to separated bursts of load and store
µops which might not allow concurrent use of both the load and store pipelines.
The following example shows a recommended instruction sequence for a long memory copy in AArch32 state:
Loop_start:
SUBS r2,r2,#64
LDRD r3,r4,[r1,#0]
STRD r3,r4,[r0,#0]
LDRD r3,r4,[r1,#8]
STRD r3,r4,[r0,#8]
LDRD r3,r4,[r1,#16]
STRD r3,r4,[r0,#16]
LDRD r3,r4,[r1,#24]
STRD r3,r4,[r0,#24]
LDRD r3,r4,[r1,#32]
STRD r3,r4,[r0,#32]
LDRD r3,r4,[r1,#40]
STRD r3,r4,[r0,#40]
LDRD r3,r4,[r1,#48]
STRD r3,r4,[r0,#48]
LDRD r3,r4,[r1,#56]
STRD r3,r4,[r0,#56]
ADD r1,r1,#64
ADD r0,r0,#64
BGT Loop_start
A recommended copy routine for AArch64 would look similar to the sequence above, but would use LDP/STP
instructions.
• Try not to include more than two taken branches within the same quadword-aligned quadword of
instruction memory.
• Consider aligning subroutine entry points and branch targets to quadword boundaries, within the bounds
of the code-density requirements of the program. This will ensure that the subsequent fetch can retrieve
four (or a full quadword’s worth of) instructions, maximizing fetch bandwidth following the taken branch.
The table below summarizes various special instructions and the associated execution constraints or side-effects.
Note:
1. Conditional forms of these instructions for which the condition is not satisfied will not access special
registers or trigger flush side-effects.
2. Conditional forms of these instructions are always executed non-speculatively and in-order to properly
resolve the condition.
3. MSR instructions that write APSR_nzcvq generate a separate µop to write the Q bit. That µop executes
non-speculatively and in-order. But the main µop, which writes the NZCV bits, executes as shown in the
table above.
4. A subset of MCR instructions must be executed non-speculatively. A subset of MRC instructions trigger
flush side-effects for synchronization. Those subsets are not documented here.
Pairs of dependent AESE/AESMC and AESD/AESIMC instructions provide higher performance when adjacent,
and in the described order, in the program code. Therefore it is important to ensure that these instructions come
in pairs in AES encryption/decryption loops, as shown in the code segment above.
If any of these sequences appear sequentially and in the described order in program code, the two instructions
can be executed at lower latency and higher bandwidth than if they do not appear sequentially in the program
code, enabling 32-bit literals to be generated in a single cycle and 64-bit literals to be generated in two cycles.
Thus it is advantageous to ensure that compilers or programmers writing assembly code schedule these
instruction pairs sequentially.