0% found this document useful (0 votes)

113 views6 pages

Ampere INS

The Ampere instruction set includes instructions for floating point, integer, conversion, movement, predicate, load/store, uniform, texture, surface, and control operations. It uses a format of (instruction) (destination) (source1), (source2) ... with valid destinations and sources including registers, uniform registers, special registers, predicate registers, and constant memory. Some examples of instructions are FADD for FP32 add, IADD for integer addition, LD for load, and BRA for relative branch.

Uploaded by

Yang Elio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views6 pages

Ampere INS

Uploaded by

Yang Elio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Instruction Set Reference

4.5. Ampere Instruction Set

The Ampere architecture (Compute Capability 8.0and 8.6) has the following instruction set
format:
(instruction) (destination) (source1), (source2) ...

Valid destination and source locations include:

‣ RX for registers
‣ URX for uniform registers
‣ SRX for special system-controlled registers
‣ PX for predicate registers
‣ c[X][Y] for constant memory
Table 8 lists valid instructions for the Ampere GPUs.

Table 8. Ampere Instruction Set

Opcode Description
Floating Point Instructions
FADD FP32 Add
FADD32I FP32 Add
FCHK Floating-point Range Check
FFMA32I FP32 Fused Multiply and Add
FFMA FP32 Fused Multiply and Add

CUDA Binary Utilities DA-06762-001_v11.7 | 34

Instruction Set Reference

Opcode Description
FMNMX FP32 Minimum/Maximum
FMUL FP32 Multiply
FMUL32I FP32 Multiply
FSEL Floating Point Select
FSET FP32 Compare And Set
FSETP FP32 Compare And Set Predicate
FSWZADD FP32 Swizzle Add
MUFU FP32 Multi Function Operation
HADD2 FP16 Add
HADD2_32I FP16 Add
HFMA2 FP16 Fused Mutiply Add
HFMA2_32I FP16 Fused Mutiply Add
HMMA Matrix Multiply and Accumulate
HMNMX2 FP16 Minimum / Maximum
HMUL2 FP16 Multiply
HMUL2_32I FP16 Multiply
HSET2 FP16 Compare And Set
HSETP2 FP16 Compare And Set Predicate
DADD FP64 Add
DFMA FP64 Fused Mutiply Add
DMMA Matrix Multiply and Accumulate
DMUL FP64 Multiply
DSETP FP64 Compare And Set Predicate
Integer Instructions
BMMA Bit Matrix Multiply and Accumulate
BMSK Bitfield Mask
BREV Bit Reverse
FLO Find Leading One
IABS Integer Absolute Value
IADD Integer Addition
IADD3 3-input Integer Addition
IADD32I Integer Addition
IDP Integer Dot Product and Accumulate
IDP4A Integer Dot Product and Accumulate
IMAD Integer Multiply And Add
IMMA Integer Matrix Multiply and Accumulate
IMNMX Integer Minimum/Maximum
IMUL Integer Multiply

CUDA Binary Utilities DA-06762-001_v11.7 | 35

Instruction Set Reference

Opcode Description
IMUL32I Integer Multiply
ISCADD Scaled Integer Addition
ISCADD32I Scaled Integer Addition
ISETP Integer Compare And Set Predicate
LEA LOAD Effective Address
LOP Logic Operation
LOP3 Logic Operation
LOP32I Logic Operation
POPC Population count
SHF Funnel Shift
SHL Shift Left
SHR Shift Right
VABSDIFF Absolute Difference
VABSDIFF4 Absolute Difference
Conversion Instructions
F2F Floating Point To Floating Point Conversion
F2I Floating Point To Integer Conversion
I2F Integer To Floating Point Conversion
I2I Integer To Integer Conversion
I2IP Integer To Integer Conversion and Packing
I2FP Integer to FP32 Convert and Pack
F2IP FP32 Down-Convert to Integer and Pack
FRND Round To Integer
Movement Instructions
MOV Move
MOV32I Move
MOVM Move Matrix with Transposition or Expansion
PRMT Permute Register Pair
SEL Select Source with Predicate
SGXT Sign Extend
SHFL Warp Wide Register Shuffle
Predicate Instructions
PLOP3 Predicate Logic Operation
PSETP Combine Predicates and Set Predicate
P2R Move Predicate Register To Register
R2P Move Register To Predicate Register
Load/Store Instructions
LD Load from generic Memory

CUDA Binary Utilities DA-06762-001_v11.7 | 36

Instruction Set Reference

Opcode Description
LDC Load Constant
LDG Load from Global Memory
LDGDEPBAR Global Load Dependency Barrier
LDGSTS Asynchronous Global to Shared Memcopy
LDL Load within Local Memory Window
LDS Load within Shared Memory Window
LDSM Load Matrix from Shared Memory with Element Size Expansion
ST Store to Generic Memory
STG Store to Global Memory
STL Store within Local or Shared Window
STS Store within Local or Shared Window
MATCH Match Register Values Across Thread Group
QSPC Query Space
ATOM Atomic Operation on Generic Memory
ATOMS Atomic Operation on Shared Memory
ATOMG Atomic Operation on Global Memory
RED Reduction Operation on Generic Memory
CCTL Cache Control
CCTLL Cache Control
ERRBAR Error Barrier
MEMBAR Memory Barrier
CCTLT Texture Cache Control
Uniform Datapath Instructions
R2UR Move from Vector Register to a Uniform Register
REDUX Reduction of a Vector Register into a Uniform Register
S2UR Move Special Register to Uniform Register
UBMSK Uniform Bitfield Mask
UBREV Uniform Bit Reverse
UCLEA Load Effective Address for a Constant
UF2FP Uniform FP32 Down-convert and Pack
UFLO Uniform Find Leading One
UIADD3 Uniform Integer Addition
UIADD3.64 Uniform Integer Addition
UIMAD Uniform Integer Multiplication
UISETP Integer Compare and Set Uniform Predicate
ULDC Load from Constant Memory into a Uniform Register
ULEA Uniform Load Effective Address
ULOP Logic Operation

CUDA Binary Utilities DA-06762-001_v11.7 | 37

Instruction Set Reference

Opcode Description
ULOP3 Logic Operation
ULOP32I Logic Operation
UMOV Uniform Move
UP2UR Uniform Predicate to Uniform Register
UPLOP3 Uniform Predicate Logic Operation
UPOPC Uniform Population Count
UPRMT Uniform Byte Permute
UPSETP Uniform Predicate Logic Operation
UR2UP Uniform Register to Uniform Predicate
USEL Uniform Select
USGXT Uniform Sign Extend
USHF Uniform Funnel Shift
USHL Uniform Left Shift
USHR Uniform Right Shift
VOTEU Voting across SIMD Thread Group with Results in Uniform
Destination
Texture Instructions
TEX Texture Fetch
TLD Texture Load
TLD4 Texture Load 4
TMML Texture MipMap Level
TXD Texture Fetch With Derivatives
TXQ Texture Query
Surface Instructions
SUATOM Atomic Op on Surface Memory
SULD Surface Load
SUQUERY Surface Query
SURED Reduction Op on Surface Memory
SUST Surface Store
Control Instructions
BMOV Move Convergence Barrier State
BPT BreakPoint/Trap
BRA Relative Branch
BREAK Break out of the Specified Convergence Barrier
BRX Relative Branch Indirect
BRXU Relative Branch with Uniform Register Based Offset
BSSY Barrier Set Convergence Synchronization Point
BSYNC Synchronize Threads on a Convergence Barrier

CUDA Binary Utilities DA-06762-001_v11.7 | 38

Instruction Set Reference

Opcode Description
CALL Call Function
EXIT Exit Program
JMP Absolute Jump
JMX Absolute Jump Indirect
JMXU Absolute Jump with Uniform Register Based Offset
KILL Kill Thread
NANOSLEEP Suspend Execution
RET Return From Subroutine
RPCMOV PC Register Move
RTT Return From Trap
WARPSYNC Synchronize Threads in Warp
YIELD Yield Control
Miscellaneous Instructions
B2R Move Barrier To Register
BAR Barrier Synchronization
CS2R Move Special Register to Register
DEPBAR Dependency Barrier
GETLMEMBASE Get Local Memory Base Address
LEPC Load Effective PC
NOP No Operation
PMTRIG Performance Monitor Trigger
R2B Move Register to Barrier
S2R Move Special Register to Register
SETCTAID Set CTA ID
SETLMEMBASE Set Local Memory Base Address
VOTE Vote Across SIMD Thread Group

CUDA Binary Utilities DA-06762-001_v11.7 | 39

The 80x86 IBM PC and Compatible Computers - 4th Edition PDF
100% (1)
The 80x86 IBM PC and Compatible Computers - 4th Edition PDF
1,019 pages
Cobol-Complete-Notes 1
No ratings yet
Cobol-Complete-Notes 1
37 pages
Mplab XC8 C Compiler User's Guide For PIC MCU
No ratings yet
Mplab XC8 C Compiler User's Guide For PIC MCU
616 pages
Computer Architecture
100% (1)
Computer Architecture
125 pages
C Programming
No ratings yet
C Programming
125 pages
Lectures On Computer Arithmetic: Unit 7
No ratings yet
Lectures On Computer Arithmetic: Unit 7
20 pages
Viva Questions
No ratings yet
Viva Questions
17 pages
Assembly Code For N'TH Fibonacci Number
0% (1)
Assembly Code For N'TH Fibonacci Number
7 pages
86 Intel Family Computer Architecture
No ratings yet
86 Intel Family Computer Architecture
108 pages
Digital Logic Design Power Point Slides Lecture 2
No ratings yet
Digital Logic Design Power Point Slides Lecture 2
56 pages
390 - Computer Programming Concepts (Open) - R - 2020
No ratings yet
390 - Computer Programming Concepts (Open) - R - 2020
10 pages
Computer Science SSC-II Solution
No ratings yet
Computer Science SSC-II Solution
11 pages
Computer Oriented Numerical Techniques: BCS-054 I
No ratings yet
Computer Oriented Numerical Techniques: BCS-054 I
5 pages
Abap Code Book
No ratings yet
Abap Code Book
66 pages
Laplace Transform Numerical Inversion PDF
No ratings yet
Laplace Transform Numerical Inversion PDF
18 pages
Unit 2 1
No ratings yet
Unit 2 1
15 pages
Gpucoder Ug
No ratings yet
Gpucoder Ug
560 pages
881 Asm
No ratings yet
881 Asm
23 pages
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
No ratings yet
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
68 pages
Comprogram
No ratings yet
Comprogram
19 pages
Floating Point Instructions: Ray Seyfarth
No ratings yet
Floating Point Instructions: Ray Seyfarth
18 pages
Gaddis Python 6e Chapter 02
No ratings yet
Gaddis Python 6e Chapter 02
81 pages
AHsd - Important Legal
No ratings yet
AHsd - Important Legal
6 pages
Out of Order Floating Point Coprocessor For RISC V ISA
No ratings yet
Out of Order Floating Point Coprocessor For RISC V ISA
7 pages
Array Fire GPU Programming in C++
No ratings yet
Array Fire GPU Programming in C++
32 pages
16-Bit Floating Point Instructions For Embedded Multimedia Applications
No ratings yet
16-Bit Floating Point Instructions For Embedded Multimedia Applications
6 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
Microprcessor 2
100% (1)
Microprcessor 2
16 pages
07 Simd Avx
No ratings yet
07 Simd Avx
41 pages
DSP Unit 1 To 5 QB
No ratings yet
DSP Unit 1 To 5 QB
12 pages
Chapter 2 - CPIT110
No ratings yet
Chapter 2 - CPIT110
179 pages
MPMC Manual
No ratings yet
MPMC Manual
80 pages
8087 Coprocessor
100% (1)
8087 Coprocessor
28 pages
Arrayfire Tutorrial
No ratings yet
Arrayfire Tutorrial
32 pages
2 Marks: Question Bank Unit - Ii Datapath Design
No ratings yet
2 Marks: Question Bank Unit - Ii Datapath Design
2 pages
Low-Power Multiple-Precision Iterative Floating-Point Multiplier With SIMD Support
No ratings yet
Low-Power Multiple-Precision Iterative Floating-Point Multiplier With SIMD Support
13 pages
Iomanip
No ratings yet
Iomanip
2 pages
FPU-Instructions Cheat Sheet
No ratings yet
FPU-Instructions Cheat Sheet
2 pages
Programming With SIMD-instructions
No ratings yet
Programming With SIMD-instructions
10 pages
C674x CPU Features
No ratings yet
C674x CPU Features
23 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Vector Code Example
No ratings yet
Vector Code Example
6 pages
Implementation of 32 Bit Floating Point MAC Unit To Feed Weighted Inputs To Neural Networks
No ratings yet
Implementation of 32 Bit Floating Point MAC Unit To Feed Weighted Inputs To Neural Networks
4 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
QRC0007C VFP PDF
No ratings yet
QRC0007C VFP PDF
2 pages
Midterm Computer Org. and Archi
No ratings yet
Midterm Computer Org. and Archi
8 pages
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
No ratings yet
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
31 pages
Midterm 1 Solution
No ratings yet
Midterm 1 Solution
6 pages
QRC0007 VFP
No ratings yet
QRC0007 VFP
2 pages
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
No ratings yet
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
4 pages
Lab Report
No ratings yet
Lab Report
13 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
Basic Instructions
No ratings yet
Basic Instructions
24 pages
Addressing Modes
No ratings yet
Addressing Modes
4 pages
DLX Instruction Set Description Notation
No ratings yet
DLX Instruction Set Description Notation
14 pages
System Programing and Operating System
No ratings yet
System Programing and Operating System
376 pages
Vector Floating Point Instruction Set Quick Reference Card: Key To Tables
No ratings yet
Vector Floating Point Instruction Set Quick Reference Card: Key To Tables
3 pages
Assignment Q2 ESD MELZG526 2019HT80542
No ratings yet
Assignment Q2 ESD MELZG526 2019HT80542
8 pages
8,16,32 Floating Point Processing in ARM
100% (1)
8,16,32 Floating Point Processing in ARM
25 pages
Pps Unit I Notes
No ratings yet
Pps Unit I Notes
35 pages
Pape 3
No ratings yet
Pape 3
20 pages
STLsintax
No ratings yet
STLsintax
2 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Datatype Notes
No ratings yet
Datatype Notes
19 pages
Newton School1233
No ratings yet
Newton School1233
40 pages
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
No ratings yet
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
8 pages
CortexM4 FPU
No ratings yet
CortexM4 FPU
14 pages
MOV SI, 1100H Mov Ax, (Si) Mov BX, (Si + 21 Mul BX MOV (SI + 41, AX Mov (Si + 61, DX HLT
No ratings yet
MOV SI, 1100H Mov Ax, (Si) Mov BX, (Si + 21 Mul BX MOV (SI + 41, AX Mov (Si + 61, DX HLT
5 pages
Exercise Only
No ratings yet
Exercise Only
40 pages
13.3 Floating-Point Numbers, Representation and Manipulation
No ratings yet
13.3 Floating-Point Numbers, Representation and Manipulation
25 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
Lecture 6
No ratings yet
Lecture 6
11 pages
EE234 - Lec - 03
No ratings yet
EE234 - Lec - 03
45 pages
PDC Assignment
No ratings yet
PDC Assignment
9 pages
Final Exam Comp Org
No ratings yet
Final Exam Comp Org
4 pages
FactoryTalk View Machine Edition - 10.00.01 (Released 2 - 2018) - 11.00.00 (Released 2 - 2019)
No ratings yet
FactoryTalk View Machine Edition - 10.00.01 (Released 2 - 2018) - 11.00.00 (Released 2 - 2019)
24 pages
Combinepdf
No ratings yet
Combinepdf
28 pages
Microcontroller Test Solutions Full
No ratings yet
Microcontroller Test Solutions Full
6 pages
Unit Ii 2M
No ratings yet
Unit Ii 2M
8 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Chapter 8 Shortened
No ratings yet
Chapter 8 Shortened
11 pages
EEEE3205 Tutorial
No ratings yet
EEEE3205 Tutorial
3 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Unit 5 Notes - ARM Instruction
No ratings yet
Unit 5 Notes - ARM Instruction
5 pages
C To Asm, Asm To C
No ratings yet
C To Asm, Asm To C
40 pages
Assembly #4
No ratings yet
Assembly #4
3 pages
LEARN MPLS FROM SCRATCH PART-B: A Beginners guide to next level of networking
From Everand
LEARN MPLS FROM SCRATCH PART-B: A Beginners guide to next level of networking
POONAM DEVI
No ratings yet

Ampere INS

Uploaded by

Ampere INS

Uploaded by

Instruction Set Reference

4.5. Ampere Instruction Set

Valid destination and source locations include:

Table 8. Ampere Instruction Set

CUDA Binary Utilities DA-06762-001_v11.7 | 34

CUDA Binary Utilities DA-06762-001_v11.7 | 35

CUDA Binary Utilities DA-06762-001_v11.7 | 36

CUDA Binary Utilities DA-06762-001_v11.7 | 37

CUDA Binary Utilities DA-06762-001_v11.7 | 38

CUDA Binary Utilities DA-06762-001_v11.7 | 39

You might also like