Fusion Ug
Fusion Ug
User’s Guide
This publication is provided “AS IS.” Cadence Design Systems, Inc. (hereafter “Cadence") does not make any warranty of any
kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a
particular purpose. Information in this document is provided solely to enable system and software developers to use our
processors. Unless specifically set forth herein, there are no express or implied patent, copyright or any other intellectual
property rights or licenses granted hereunder to design or fabricate Cadence integrated circuits or integrated circuits based on
the information in this document. Cadence does not warrant that the contents of this publication, whether individually or as one
or more groups, meets your requirements or that the publication is error-free. This publication could include technical
inaccuracies or typographical errors. Changes may be made to the information herein, and these changes may be incorporated
in new editions of this publication.
Cadence, the Cadence logo, Allegro, Assura, Broadband Spice, CDNLIVE!, Celtic, Chipestimate.com, Conformal, Connections,
Denali, Diva, Dracula, Encounter, Flashpoint, FLIX, First Encounter, Incisive, Incyte, InstallScape, NanoRoute, NC-Verilog,
OrCAD, OSKit, Palladium, PowerForward, PowerSI, PSpice, Purespec, Puresuite, Quickcycles, SignalStorm, Sigrity, SKILL,
SoC Encounter, SourceLink, Spectre, Specman, Specman-Elite, SpeedBridge, Stars & Strikes, Tensilica, TripleCheck,
TurboXim, Vectra, Virtuoso, VoltageStorm, Xplorer, Xtensa, and Xtreme are either trademarks or registered trademarks of
Cadence Design Systems, Inc. in the United States and/or other jurisdictions.
OSCI, SystemC, Open SystemC, Open SystemC Initiative, and SystemC Initiative are registered trademarks of Open SystemC
Initiative, Inc. in the United States and other countries and are used with permission. All other trademarks are the property of
their respective holders.
PD-17-8537-10-03
RG-2018.9
Issue Date: 4/2018
Contents
1. Introduction .................................................................................................................. 1
1.1 Purpose of this Guide ............................................................................................. 2
1.1.1 Conventions ........................................................................................................ 2
1.2 Installation Overview .............................................................................................. 2
1.3 Fusion DSP Architecture Overview ........................................................................ 3
1.4 Prefetching.............................................................................................................. 4
1.4.1 Software Prefetching .......................................................................................... 6
1.5 Fusion DSP Instruction Set Overview .................................................................... 7
2. Fusion DSP Features ................................................................................................... 8
2.1 Instruction Naming Conventions........................................................................... 15
2.2 Fixed-point Values and Fixed-point Arithmetic ..................................................... 16
2.2.1 Representation of Fixed-point Values .............................................................. 16
2.2.2 Arithmetic with Fixed-point Values ................................................................... 18
2.2.3 Other Fixed-point Representations................................................................... 18
2.3 VLIW Slots and Formats ....................................................................................... 19
2.4 Load and Store Operations .................................................................................. 21
2.4.1 Aligning Loads and Stores................................................................................ 22
2.4.2 Circular Buffer ................................................................................................... 24
2.4.3 Load and Store Naming Scheme ..................................................................... 26
2.4.4 Load Operations ............................................................................................... 29
2.4.5 Core Load Operations ...................................................................................... 41
2.4.6 Store Operations............................................................................................... 41
2.5 Core Updating Stores ........................................................................................... 52
2.6 Multiply and Accumulate Operations .................................................................... 52
2.6.1 24x24-bit Multiplication Operations .................................................................. 54
2.6.2 32x32-bit Multiplication Operations .................................................................. 58
2.6.3 32x16-bit Multiplication Operations .................................................................. 62
2.6.4 16x16-bit Multiplication Operations .................................................................. 66
2.6.5 16x16-bit Legacy Multiplication Operations ...................................................... 69
2.6.6 32x16-bit Legacy Multiplication Operations ...................................................... 70
2.6.7 HiFi 2 EP 32x24-bit Multiplication Operations .................................................. 73
2.7 Add, Subtract, and Compare Operations ............................................................. 73
2.8 Shift Operations .................................................................................................... 83
2.9 HiFi 2 Shift Operations ......................................................................................... 93
2.10 Normalize Shift Amount Operation ....................................................................... 96
2.11 Divide Step Operation .......................................................................................... 96
2.12 Truncate, Round, Saturate, Convert, and Move Operations ................................ 97
2.13 Selection and Permutation Operations ............................................................... 111
2.14 Bit Reversal ........................................................................................................ 115
2.15 Zero Operation.................................................................................................... 115
Figures
Tables
Added information in section 2.4.2 that CBEGIN need not be less than CEND
Clarified for converting from HiFi 2 legacy types to and from HiFi 3 vector types
Support for 64-bit format for the Viterbi decoder and Soft-bit demapping options
----------------------------- -------------------------------------
The following changes were made to this document for the Cadence Tensilica RG-2016.3
release of Fusion F1 DSP:
The title and introduction of this document reflect the name change to "Fusion F1".
Support for 64-bit format for the Viterbi decoder and Soft-bit demapping options in
Section 2.3. Also described in Section 2.3, one of the Fusion operations added by
the Viterbi option overlaps in the Inst opcode space with the reserved CUST0
opcode.
Corrected the boundary conditions for the Circular Load/Store instructions in Table
2-18.
Five new instructions to support complex conjugate and complex conjugate multiply
in Section 2.18.
Operator overloading for 16b data types in Table 3-2 Fusion DSP C/C++ Operators.
Clarified information about HiFi 2 and HiFi Mini code portability in Section 3.10.
Updated the text and screen shots in Section 6.1 to include the newly included Viterbi
Decoder and soft-bit demapping options.
Amended restriction in Section 7.1 to “As Fusion DSP is always coprocessor number
1, the number of coprocessors must be at least 2.”
Added a list of the XPG options selected for each template in Section 6.2.
Added appendixes with a summary list of instructions for each Fusion DSP option
(Appendix A) and Appendix B with instruction width requirements.
1. Introduction
The Cadence® Tensilica® Fusion F1 DSP is a highly optimized, highly configurable processor
geared for efficient execution of dataplane algorithms needed for the Internet of Things (IoT),
and other applications, such as codec chips, sensor hubs, and narrowband wireless
communications. It is derived from a smaller version of the Cadence HiFi 3 DSP. It supports
dual issuing a single load or a store together with two way SIMD ALU or MAC operations,
supporting dual 16x16, 32x16, 24x24-bit MACs and single 32-bit MACs. The base
configuration is source code software compatible with the Cadence HiFi 2, HiFi Mini, and
HiFi 2 EP DSPs except for bitstream and variable-length decode and encode. It is also
compatible with the Cadence HiFi 3 DSP except for bitstream, variable-length decode and
encode and HiFi 3 quad MAC instructions.
The Fusion F1 DSP contains a wide range of configuration options to meet your needs. Each
one of the eight options can be selected independently.
1. The AVS option adds full HiFi software source compatibility by adding bitstream,
variable-length decode and encode operations, and HiFi 3 quad MAC emulation
capability 1. 0F0F0F
2. The 16-bit Quad MAC option adds computation extension to support 16-bit vector
Quad MAC (four MAC) for complex and dot product operations. Also included with
this Quad MAC option are specialized instructions for FFT computation
acceleration.
3. The FP option adds support for single issuing single-precision, IEEE 754, floating
point operations, including fused multiply accumulates, together with two-way
SIMD loads or stores.
4. The Reduced MAC Latency option halves the latency of all long latency operations,
sacrificing the maximum MHz achievable to enable lower area and power at low
MHz.
5. The Advanced Bit Manipulation option add supports for bit-level operations for
baseband-PHY and MAC processing. This option also supports CRC and
scrambling, FEC convolutional encoding, and adds instructions for bit-level
shuffling operations.
6. The BLE/Wi-Fi AES 128-CCM option supports instructions to accelerate AES 128
CCM-mode encryption/decryption.
7. The Viterbi Decoder option adds instructions for efficient Viterbi decoding to
support rates 1/2 and 1/3 with arbitrary polynomials of constraint lengths 5 and 7.
8. The Soft-bit Demapping option adds instructions for 4/16/64/256-QAM soft bit
demapping with support for different Gray Encoding formats needed by 3GPP and
WiFi.
1
Note that even without the AVS option, Fusion DSP is able to emulate all HiFi bitstream instructions in software,
albeit very slowly. This makes even the base configuration fully compatible with HiFi 2 and HiFi Mini.
The Fusion F1 DSP is a coprocessor configuration option for the Xtensa® LX6 processor. All
Fusion operations can be used as intrinsics in standard C/C++ applications. In addition, when
compiling with automatic vectorization or with the –mcoproc option, the compiler will
automatically infer these operations when compiling standard C code.
Note that the remainder of this document refers to the Fusion F1 DSP as “Fusion DSP” or as
“Fusion”.
To use this guide most effectively, a basic level of familiarity with the Xtensa software
development flow is highly recommended. For more details, see the Xtensa Software
Development Toolkit User’s Guide.
1.1.1 Conventions
Throughout this document, the symbol <xtensa_root> refers to the installation directory
of a user’s Xtensa configuration. For example, <xtensa_root> might refer to the directory
\usr\xtensa\XtDevTools\install\builds\RF-2015.2-win32\<s1> if <s1> is
the name of your Xtensa configuration. In the examples in this guide, replace
<xtensa_root> with the installation directory of your Xtensa distribution.
<xtensa_root>/xtensa-elf/arch/include/xtensa/tie/xt_fusion.h
For floating point usage with the optional floating point unit, include the following file.
<xtensa_root>/xtensa-elf/arch/include/xtensa/tie/xt_FP.h
In general, baseline 16-bit support is geared towards efficient support of the ITU-T/ETSI
intrinsic model, while 32x16-bit and 24-bit support is provided for both integer and fixed-point
computation. With the 16-bit Quad MAC option, support is provided for complex 16-bit
multiplications as well as for real 16-bit dot product instructions, allowing efficient
implementations of complex and real FFTs and FIRs.
Fusion DSP is a VLIW architecture, supporting the execution of two operations in parallel.
DSP loads and stores, bit-stream and Huffman operations and core operations are available
in slot 0 of a VLIW instruction. DSP MAC and ALU operations are typically available in slot
1. The optional floating point operations are generally available in slot 1.
Fusion DSP supports either caches or local memories with the full flexibility provided by
Xtensa. Configurations can have either or both and can make different choices for instruction
and data. Audio packages supplied by Cadence do not use DMA. Hence, most customers
either use caches or make local memories sufficiently large to cover desired applications.
Figure 1-1 illustrates the main custom state, register file and execution units added to an
Xtensa LX processor by the Fusion DSP.
32 bits 32 bits
AE_DR
Register File AR Base
Register File
12 x 64 bits
Register MUX
Variable
Load/
Length Misc
Store
Enc/Dec & Function
Misc ALU Unit
Bitstream
ALU Function MAC
Slot 1 Slot 0
The main hardware resources in the DSP subsystem are a multiply/accumulate unit, an
option for a single precision IEEE floating point unit, a 12-entry register file AE_DR to hold
64-bit, pairs of 32-bit or quads of 16-bit data items, an arithmetic/logic unit, and a shift unit to
operate on the AE_DR values. The multiplier unit supports one 32x32-bit MAC or two 24x24,
16x32 or 16x16-bit MACs per cycle (four 16x16-bit with the 16-bit Quad MAC option).
The load/store unit is capable of loading or storing up to two 24-bit or 32-bit SIMD elements,
four 16-bit SIMD elements, or single elements up to 64 bits in size. 24-bit data can either be
contained inside 32-bit envelopes or can be packed together into 24 bits of memory. Eight
packed elements can be loaded or stored in three instructions. The load/store unit supports
unaligned accesses whereby a stream is first primed and afterwards 64 unaligned bits can
be loaded or stored in every cycle.
single slot 40-bit format (fusion_slot40) used mainly for wide branches and AES
instructions
optional two slot 40-bit format (fusion_slot_fir_0 and fusion_slot_fir_1) for emulation
of HiFi 3 FIR instructions
optional two slot 40-bit format (fusion_slot40_0 and fusion_slot40_1) used for 16-bit
FFT support with the 16-bit Quad MAC option.
The operations for the two-slot VLIW formats can be issued in one of the two slots. In each
execution cycle, zero or one operation from each slot can be executed independently
according to the static bundling expressed in the machine code. So, for example, load
operations can execute concurrently with multiply/accumulate operations because loads are
in fusion_slot0 and multiply/accumulate operations are in fusion_slot1. For better code size,
many operations (but not integer or fixed point multiplies) are also available in single issue
16- and 24-bit formats. Most floating point operations are available in the 24-bit formats.
1.4 Prefetching
Fusion DSP supports a prefetch option geared for systems with long memory latency. When
the Fusion DSP processor detects a positive stride-1 stream of cache misses (either data or
instruction), it can speculatively prefetch ahead up to four cache lines and place them in a
buffer close to the processor, or on the data side, optionally into the L1 data cache (there is
no support for prefetching directly into the L1 instruction cache). In addition, you can manually
issue prefetch instructions.
By default, hardware prefetching is enabled in the reset code provided by Cadence with a
low setting. On configurations that support it, data prefetches are placed into the L1 data
cache by default. You can use the following HAL calls to explicitly disable prefetching or to
increase its aggressiveness in different sections of your code. With more aggressive
prefetching, the hardware will prefetch earlier when detecting a stream and will prefetch more
lines ahead. Assuming sufficient bus bandwidth, performance will improve with more
aggressive prefetch but the system will require more bandwidth. Prefetching instructions and
data can be controlled separately.
#include <xtensa/hal.h>
int xthal_set_cache_prefetch(unsigned long mode);
The value returned is not meant for direct use or interpretation; however, it is suitable for
passing to a subsequent call to xthal_set_cache_prefetch().
One of the following constants, which apply to both instruction and data caches:
XTHAL_PREFETCH_ENABLE(enable cache prefetch)
XTHAL_PREFETCH_DISABLE(disable cache prefetch)
A bit-wise OR of two cache prefetch mode constants, one for the instruction cache:
XTHAL_ICACHE_PREFETCH_OFF(disable instruction cache prefetch)
XTHAL_ICACHE_PREFETCH_LOW(enable, less aggressive prefetch)
XTHAL_ICACHE_PREFETCH_MEDIUM(enable, midway aggressive prefetch)
XTHAL_ICACHE_PREFETCH_HIGH(enable, more aggressive prefetch)
XTHAL_ICACHE_PREFETCH(n) (explicitly set the InstCtl field of the PREFCTL
register to 0..15. See the Prefetch Architectural Additions section of the
Prefetch Unit option chapter in the Xtensa Microprocessor Data Book for
details).
For easier simulation, prefetching can also be disabled in the simulator using the
xt-run --prefetch=0 flag. Disabling prefetching from the simulation command line will
override any HAL calls.
__builtin_prefetch(addr);
Software prefetches can be used for either data or instructions. They can be used in addition
to or instead of hardware prefetching. If hardware prefetching is disabled, the software
prefetches are still enabled.
For configurations that do not prefetch into the cache, and rather use a small, 8- to 16-entry
buffer outside of the cache, you must be careful not to prefetch too far ahead. Otherwise, the
data will be overwritten before it is needed by the processor.
Consider a simple example that performs an energy calculation. You might choose to place
a few explicit prefetch instructions before the loop to seed the hardware prefetcher.
Otherwise, depending on mode, the hardware prefetch might delay prefetching until after the
second miss.
__builtin_prefetch(&ap[0]);
__builtin_prefetch(&ap[XCHAL_DCACHE_LINESIZE]);
__builtin_prefetch(&ap[2*XCHAL_DCACHE_LINESIZE]);
for (i=0; i<n; i++) {
sum += ap[i]*ap[i];
}
You might also want to put prefetch instructions directly inside the loop. Doing so allows one
to prefetch more aggressively than the hardware prefetcher and allows one to prefetch
patterns other than the stride-1 references that are detected by the hardware prefetcher. On
the other hand, placing prefetch instructions inside the loop incurs instruction overhead
whether or not the loop actually suffers from cache misses.
In general, given the effectiveness of the hardware prefetcher, software prefetches should
be used judiciously. Carefully compare performance between using and not using software
prefetching on a loop-by-loop basis.
Multiply operations include 32x32-bit, 32x24-bit, 24x24-bit, 32x16-bit and 16x16-bit. Multiply
operations come in fixed-point and integer variants. They come in high precision and low
precision variants. High-precision multiplies use a 64-bit accumulator. Since an accumulator
can hold only one result, Fusion DSP supports dual multiplies where the results of two
multiplies are added or subtracted together before being added into the accumulator. For
example, a single operation might compute the following operation where H and L refer to
the high bits or low bits respectively of an operand.
Low-precision multiplies accumulate in 32 bits or even 16 bits. Since each register can hold
two 32-bit or four 16-bit accumulators, these instructions can perform two or four independent
SIMD multiplies.
A set of bitstream and variable-length instructions allow for efficient access of serial
bitstreams including Huffman encode and decode.
The optional floating point unit supports IEEE-754 single precision floating point operations
(scalar for compute, two-way SIMD loads and stores).
The Fusion DSP contains a 12-entry, 64-bit register file, AE_DR. Each register can hold one
or two, 24- or 32-bit operands, one or four 16-bit operands or one 56- or 64-bit operand as
shown in Figure 2-1. 24-bit and 56-bit operands are sign extended to fill their 32 or 64-bit
container. The separate halves or quarters of the register are always separate data items.
For example, if you shift a SIMD 32-bit element to the left, each half is shifted separately.
The high bits of the L input half do not impact the H half of the output.
63 … 0
H L
31 … 0 31 … 0
3 2 1 0
15 … 0 15 … 0 15 … 0 15 … 0
When a register is stored to memory, the high half of the register is always stored in the lower
memory address. For example, a load that loads a 32 by 2-way SIMD value from address
"a" will place the 32-bits from address "a" into the high 32-bits of the register and the 32-bits
from address "a+4" into the low 32-bits of the register. A load that loads a 16 by 4-way SIMD
value from address "a" will place the 16-bits from address "a" into the high 16-bits of the
register. Operations that access individual 24- or 32-bit elements of AE_DR registers refer to
the elements with selectors L and H in the mnemonics. Operations that access individual 16-
bit elements refer to the elements with sectors 3, 2, 1 and 0 in the mnemonics.
For compatibility with HiFi 2, HiFi EP, and HiFi Mini, a 32-bit data item might occupy the
middle of an entire AE_DR register and a 16-bit data item might occupy the middle of a 32-
bit half register. When using such legacy instructions, a register holds half as many elements
and hence the instruction exploits less parallelism. Such instructions should only be used in
legacy code.
Fusion DSP supports a 4-entry, 64-bit alignment register, AE_VALIGN. The use of this
register allows the hardware to load or store a SIMD stream that is not 64-bit aligned at a
rate of 64-bits per cycle. It also allows 24-bit data to be packed densely into 24-bit containers.
These mechanisms are described in more detail in Section 2.4.1.
The TIE state registers in the Fusion DSP are listed in Table 2-1.
The state registers in Table 2-2 pertain to the bitstream and variable-length encode/decode
support subsystem of the Fusion DSP. This subsystem is described in detail in Section 2.20.
All these registers are available with the AVS option and AE_BITHEAD, AE_BITPTR, and
AE_BITSUSED are also available with the Advanced Bit Manipulation Package option.
Programmers generally will not need to consider the details of how each of these state
registers is used by the instructions, but the state registers are documented here for
completeness. These descriptions make more sense to a reader who is already somewhat
familiar with the variable-length encode/decode instructions.
Table 2-2 Bitstream and Variable-length Encode/Decode Support Subsystem State Registers
The state registers in Table 2-3 pertain to the circular buffer support and are shared between
the DSP subsystem and the bitstream and variable-length encode/decode support
subsystem of the Fusion DSP.
The state registers in Table 2-4 pertain to the optional floating point support.
The TIE state registers are grouped as follows into user registers for the purposes of efficient
save and restore operations:
or
or
or
With the floating point option, the following user register is used to control and detect
rounding and exception behavior. See Chapter 4 of the Xtensa Instruction Set Architecture
(ISA) Reference Manual for more details about rounding and exception behavior.
user_register FCR_FSR
{RoundMode,InvalidFlag,DivZeroFlag,OverflowFlag,UnderflowFlag,InexactFlag}
In addition to specialized instruction sequences used to save and restore entire user registers
efficiently from memory, instructions are provided to read and write individual state registers.
Both types are listed in Table 2-6.
In the operation descriptions in Sections 2.4 through 2.20, each mnemonic is listed with
assembly syntax showing placeholders (templates) for its operands. The register files of the
operands are implied by the placeholders, as in Table 2-7.
Each operation description is annotated with the name(s) of the slot(s) where that operation
can be issued. Each operation description is also annotated with the C syntax showing the
intrinsic name and prototype for the operation. A discussion of using C data types and
intrinsics to program the Fusion DSP is included in Chapter 3.
Following the AE_ prefix, each mnemonic has a string of one or more characters signifying
the type of operation such as load, shift, add, etc. For example, AE_L is the prefix denoting
Fusion DSP loads.
The remaining portion of each operation mnemonic typically includes reminders of various
aspects of the operation’s details. Multiplies and loads and stores have more regular naming
conventions that are described in their respective sections.
Mnemonic Meaning
ASYM Denotes asymmetric rounding (e.g., AE_ROUND32X2F64SASYM)
F Denotes fractional arithmetic (e.g., AE_MULZAAFD24.HH.LL) or the value
False in a conditional move (e.g., AE_MOVF64).
H and L Combinations of H and L are used to refer to halves of registers (e.g.,
AE_MULZAAFD24.HH.LL).
0,1,2,3 Combinations of 0,1,2 and 3 are used to refer to quarters of registers
(e.g. AE_MULF32X16.L0)
I Denotes use of an immediate operand (e.g., AE_SRAIP32)
S Denotes saturating arithmetic (e.g., AE_MULF32S.LL) or the use of the
AE_SAR state register as a shift amount (e.g., AE_SRASP32), depending
on context
SYM Denotes symmetric rounding (e.g., AE_ROUND32X2F64SSYM)
T Denotes the value True in a conditional move (e.g., AE_MOVT64)
U Denotes unsigned arithmetic (e.g., AE_MULS32U.LL)
X Denotes use of an index register in an address computation (e.g.,
AE_L64.XP)
X2 Denotes a two-way SIMD operation in contexts (e.g., AE_L32X2.I) where
scalar operations are also available
X4 Denotes a four-way SIMD operation (e.g., AE_L16X4.XC)
Thus, for example, the 24-bit 1.23 number 0.5 is represented as 0x400000.
and the 64-bit 17.47 number -1.5 is represented as (-2 + 0.5 = 0xff 4000 0000 0000)
1 1111 1111 1111 1110 100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
Fusion DSP fractional instructions use fractional operations on 1.15, 1.23, 9.23, 1.31, 17.47
and 1.63, described in more detail as follows.
1.15 16-bit fixed-point data type with 1 sign bit and 15 bits to the right of the
decimal. The largest positive value 0x7fff is interpreted as 1.0 – 2-15. The smallest
negative value 0x8000 is interpreted as -1.0. The value 0 is interpreted as 0.0.
9.23 32-bit fixed-point data type with a 9-bit integer and 23 bits to the right of the
decimal. The largest positive value 0x7fffffff is interpreted as 256.0 – 2-23. The
smallest negative value 0x80000000 is interpreted as -256.0. The value 0 is
interpreted as 0.0.
1.23 24-bit fixed-point data type with 1 sign bit and 23 bits to the right of the
decimal. The largest positive value 0x7fffff is interpreted as 1.0 – 2-23. The smallest
negative value 0x800000 is interpreted as -1.0. The value 0 is interpreted as 0.0.
Since register halves hold 32-bits, not 24-bits, typical 24-bit fractional variables are
9.23. However, 24-bit fixed-point multiply instructions ignore the upper 8-bits,
thereby treating them as 1.23.
1.31 32-bit fixed-point data type with 1 sign bit and 31 bits to the right of the
decimal. The largest positive value 0x7fffffff is interpreted as 1.0 – 2-31. The smallest
negative value 0x80000000 is interpreted as -1.0. The value 0 is interpreted as 0.0.
17.47 64-bit fixed-point data type with a 17-bit integer and 47 bits to the right of the
decimal. The largest positive value 0x7fff ffff ffff ffff is interpreted as 65536.0 – 2-47.
The smallest negative value 0x8000 0000 0000 0000 is interpreted as -65536.0. The
value 0 is interpreted as 0.0.
1.63 64-bit fixed-point data type with 1 sign bit and 63 bits to the right of the
decimal. The largest positive value 0x7fff ffff ffff ffff is interpreted as 1.0 – 2-63. The
smallest negative value 0x8000 0000 0000 0000 is interpreted as -1.0. The value 0
is interpreted as 0.0.
Fusion DSP contains both saturating and non-saturating instructions. Overflowing the
supplied guard bits with a non-saturating instruction is a program error that will cause the
result to wrap around. For saturating operations, the processor will also set the overflow state
which can later be checked programmatically. In the instruction descriptions that follow,
whether an operation saturates is explicitly stated.
Format fusion_format_fir is a specialized format tied to the AVS option, used for emulating
HiFi 3 FIR operations that require too many operands to issue in parallel with stores. Format
fusion_format_40_3 is a specialized format tied to the 16-bit Quad MAC option. It is used in
FFTs to allow specialized add and subtract operations to issue in parallel with stores.
For the fusion_format48 format, the first slot contains all of the Fusion DSP load/store
instructions and some miscellaneous operations. The second slot contains all of the regular
multiply and DSP ALU operations. A subset of the core Xtensa operations are also available
in both slots allowing some parallelism with core Xtensa operations.
The optional Viterbi decoder and soft-bit demapping options add an additional two-slot
fusion_format64 format. The first slot contains a subset of DSP load/store instructions (those
that use immediate offset) along with instructions for the Viterbi decoding option to compute
the branchmetrics and store the states. The second slot contains instructions for Viterbi trellis
radix-4 butterfly computations, Viterbi traceback, and for 4/16/64/256-QAM soft-bit
demapping.
A subset of the operations as well as all the bit-stream operations are available in a single
issue, 24-bit format called Inst. The compiler will automatically use the 24-bit format when it
is not possible (or beneficial) to bundle a relevant operation together with an operation that
can go in another slot.
For the optional floating point unit, most floating point operations are available in the second
slot, fusion_slot1, of the fusion_format48, allowing the machine to issue, for example, one
two-way SIMD floating point load in parallel with one scalar multiply-accumulation operation.
Understanding the slotting is important when optimizing code for Fusion DSP. Often a loop
is limited by operations that can only go in one slot or another. For example, it is never
possible to issue more than one (possible SIMD) load or store per cycle. If a loop is limited
by the operations in one slot, there is no point in trying to optimize the operations in another
slot.
All Fusion DSP core instructions available in the Inst slot share (but do not overlap) opcode
space with the MAC16 option. Note however, that we discourage selecting the MAC16 option
with the Fusion DSP core. All Fusion DSP floating point operations available in the Inst slot
share (overlap) opcode space with core floating point instructions; thus it is not possible to
turn on the core Single/Double Precision FP when the Fusion FP option is selected. In
addition, the Viterbi option on the Fusion DSP adds an instruction to the Inst slot,
AE_MOVSANORM, whose opcode overlaps the CUST0 opcode normally reserved for
customer-added operations. However, the CUST1 opcode is still available for customer-
added operations, and other customer-added operations are possible to encode using TIE
Compiler features. For more information on CUST0 and CUST1, refer to the Xtensa
Instruction Set Architecture (ISA) Reference Manual.
A summary table describing the instruction width required for each option is provided in
Appendix B. The available slotting for the different operations are listed next to the operation
descriptions in the remainder of this chapter.
Special support is provided for retaining full throughput when vectors of data are not aligned
to 64-bits. Fusion DSP also supports a single circular buffer that can be used with either
aligned or unaligned data.
Such loads and stores are called aligning loads and stores. Support is available for 16, 24
and 32-bit data. The aligning vector load and store instructions use the Fusion DSP alignment
register file to provide a throughput of one aligning load or store operation per instruction.
A special priming instruction, AE_LA64.PP, is used to begin the process of loading an array
of unaligned data. This instruction loads the alignment register with data from the start of the
stream. The subsequent aligning load instruction loads from the next location in memory,
merging it with the data already in the alignment register. The exact details of how the
aligning instructions work are not relevant to the programmer. Simply invoke the
AE_LA64_PP priming intrinsic with the first address (aligned or not) to be loaded and
continue loading with the appropriate aligning loads to achieve a subsequent throughput of
one aligning load per instruction.
The design of the priming load and aligning load instructions is such that they can be used
in situations where the alignment of the address is unknown. The load sequence works
whether the starting address is aligned or not. Consider a simple example that adds up the
32-bit elements in an array.
Similarly, when accessing the data stride negative one, prime the stream by passing in the
address of the first scalar element to be loaded (a[n-1]), as follows.
int i;
Note that in the negative stride case, the start of the stream is handled differently in the
aligned versus the non-aligned case. With aligned loads, one passes in the address of
a[n-2] because that is the address of the first 64-bit word being loaded. With aligning
loads, one passes in the address of the first 32-bit scalar being loaded, a[n-1], because
the priming load loads from memory the aligned 64-bit envelope containing its argument and
a[n-2]might not be in the same 64-bit envelope as a[n-1].
Fusion DSP supports storing 24-bit data in a packed format that requires only 24 bits per
data element. Using this load/store feature can potentially save 25% of the memory required
for a 24-bit variable and has an added benefit of reducing the amount of memory
transactions, thereby reducing memory power and improving performance. Support for this
packed data is implemented using the alignment mechanism. In the examples above, simply
use AE_LA24X2 intrinsics instead of AE_LA32X2, as shown below. Note that we have used
char * for the pointer type. While not strictly necessary, it is helpful to indicate that the
packed stream is an unaligned byte stream.
For packed data, even scalar streams are unaligned so support is also available for AE_LA24
intrinsics. Because the memory format for packed data is different, packed data can only be
used in cases where all loads and stores of a stream are done using the packing loads and
stores. While the packing loads and stores can be used on any 24-bit variable, since a
priming load and a finalizing store is required for every stream, it is often only efficient to use
them on stride one or stride negative one streams. Similarly, since there are only four
alignment registers, it is only efficient to use them on loops that have at most four streams.
Aligning stores operate in a slightly different manner. Before starting a stream, the alignment
variable needs to be zeroed using the AE_ZALIGN64() intrinsic. On an unaligned store,
each aligning store instruction merges some of the data with data already in the alignment
register and writes the result to memory. The remaining data is written into the alignment
register for use in the next aligning store. If the data happens to be aligned, each aligning
store simply writes its data to memory. After completing the stream, the user must finalize
the stream using a finalization instruction. If the data happens to be unaligned, that
finalization instruction writes out the remaining data from the alignment register. The
finalization instruction also zeroes the alignment register so that a follow on stream can skip
the use of the AE_ZALIGN64() intrinsic. Following is a simple example that zeroes an n
element array of ints named a.
Negative strided streams work analogously to the case of loads, with the use of RIP
intrinsics. Note that there are separate flush instructions for the positive stride and negative
stride streams.
The circular buffer boundaries are specified through two 32-bit states, as in Table 2-16.
State Description
_
AE CBEGIN0 The start address of the circular buffer.
AE_CEND0 The end address of circular buffer, i.e., the start address plus the byte
size of the buffer.
The following intrinsic functions may be used to read from the circular buffer states in C:
The following intrinsic functions may be used to write to the circular buffer states in C:
All circular buffer operations follow a “post-increment” convention; that is, in every case the
effective address is the base address while the updated base address is formed by adding
the register offset to the base address with circular wrap-around.
The address increment is specified in terms of number of bytes and must be less than or
equal to the buffer byte size. The increment can be either positive (wrap-around at the end
of the buffer), or negative (wrap-around at the beginning of the buffer).
Both aligned and unaligned accesses are supported. However, for unaligned accesses,
AE_CBEGIN0 and AE_CEND0 must be aligned to 64 bits. For aligned accesses,
AE_CBEGIN0 and AE_CEND0 must be aligned to the size of the data being loaded or
stored. Unaligned accesses use the alignment mechanism described in Section 2.4.1.
Priming loads use the PC suffix with separate instructions for positive and negative stride.
For unaligned references, only stride one and stride negative one are supported. Packed 24-
bit loads are supported.
AE_CBEGIN0 need not be smaller than AE_CEND0. If an instruction accesses data past
the AE_CEND0 boundary, data will continue to be accessed at AE_CBEGIN0 regardless of
whether it is before or after AE_CEND0.
Circular buffer support is available for DSP loads and stores to the AE_DR register file as
well as bit-stream loads and stores to the AR register file.
Following is an example C code snippet demonstrating how to initialize and use the circular
buffer. The buffer is used to store 24-bit vector data in the 24 MSBs of each 32-bit word with
a negative stride starting from the last element of the buffer.
24X2 Vector of 24-bit This operation accesses two of the size “24” above,
occupying 48 bits in memory.
32X2 Vector of 32-bit This operation accesses two of the size “32” above.
Some instructions need the pair to be 64-bit aligned
while others do not.
32X2F24 Vector of left- This operation accesses two of the size “32F24”
justified 24-bit above. Some instructions need the pair to be 64-bit
fraction aligned while others do not.
16X4 Vector of 16 bit This operation accesses four of the size “16” above.
Some instructions need the quartet to be 64-bit
aligned while others do not.
8X4F Vector of left- This operation accesses four of size 8.
justified 8 bit
fraction
The mnemonics of most load and store operations contains a suffix indicating how the
effective address is computed and whether the base address register is updated. The
suffixes are listed in Table 2-18.
Operations with suffix IP, XP, IC, or XC follow a “post-increment” convention where the
effective address is the base AR register, and the base address register is updated by adding
an immediate, constant or register offset. Operations with suffix IU or XU follow a “pre-
increment” convention where the effective address is the result of adding the immediate or
register offset to the base address register’s contents and the base address register is
updated with the effective address. Operations with suffix I or X do not increment but create
an effective address which is the sum of the base address register and an immediate or offset
register.
I, X, IP, XP,
AE_L<sz>.<adr> 64, 32, 32F24, 16 Aligned loads of scalars
XC
32X2, 32X2F24, I, X, IP, RIP,
AE_L<sz>.<adr> Aligned loads of vectors
16X4, 8X8, 8X4F XP, XC, RIC
Prime for Unaligned
AE_LA<sz>.<adr> 64, PP
loads using IP
Prime for Unaligned
32X2, 16X4, 24,
AE_LA<sz>POS.<adr> PC loads using IC with
24X2,
positive stride
Prime for Unaligned
32X2, 16X4, 24,
AE_LA<sz>NEG.<adr> PC loads using IC with
24X2,
negative stride
Unaligned Loads for
32X2, 32X2F24, accessing vectors of
AE_LA<sz>.<adr> IP, IC
16X4, 24, 24X2, aligned scalars with
positive update
Unaligned Loads for
32X2, 32X2F24, accessing vectors of
AE_LA<sz>.<adr> RIP, RIC
16X4, 24, 24X2, aligned scalars with
negative update
Load of alignment
AE_LALIGN64.I register
I, X, XC, IU,
AE_L<sz>M.<adr> 16X2, 32, 16 Legacy Loads
XU
In the following instructions, the instruction names show the assembler syntax. We also
include the C syntax below each instruction description.
C syntax:
ae_int32x2 AE_L32X2_I (const ae_int32x2 * a, immediate i64);
ae_int32x2 AE_L32X2_X (const ae_int32x2 * a, int ax);
void AE_L32X2_IP (ae_int32x2 d /*out*/,
const ae_int32x2 *a /*inout*/, immediate i64pos);
void AE_L32X2_XP (ae_int32x2 d /*out*/,
const ae_int32x2 *a /*inout*/, int ax);
void AE_L32X2_XC (ae_int32x2 d /*out*/,
const ae_int32x2 *a /*inout*/, int ax);
void AE_L32X2.RIP (ae_int32x2 d /*out*/,
const ae_int32x2 *a /*inout*/);
void AE_L32X2.RIC (ae_int32x2 d /*out*/,
const ae_int32x2 *a /*inout*/);
ae p24x2s AE_LP24X2_I (const ae_p24x2s * a, immediate i64);
_
void AE_LP24X2_IU (ae_p24x2s d /*out*/,
const ae_p24x2s * a /*inout*/, immediate i64);
ae_p24x2s AE_LP24X2_X (const ae_p24x2s * a, int ax);
void AE_LP24X2_XU (ae_p24x2s d /*out*/,
const ae_p24x2s * a /*inout*/, int ax);
void AE_LP24X2_C (ae_p24x2s d /*out*/,
const ae_p24x2s * a /*inout*/, int ax);
AE_L8X4F.I(.IP) d, a, i [ fusion_slot0 ]
AE_L8X4F.X(.XP) d, a, ax [ fusion_slot0 ]
Load four, 8-bit values from 32 bits in memory, sign-extends them to 16 bits and stores the
values into the four 16-bit elements of AE_DR register d. See Table 2-3 for the meanings of
the address mode suffixes. The intent here is that the values in memory represent 8-bits (1.7)
fractions that get placed in the four elements of the AE_DR register as 1.15-bit fractions.
C syntax:
ae_f16x4 AE_L8X4F_I (const int8 * a, immediate i);
void AE_L8X4F_IP (ae_f16x4 p /*out*/,
const int8 * a /*inout*/, immediate i);
C syntax:
ae_f24x2 AE_L32X2F24_I (const ae_f24x2 * a, immediate i64);
ae_f24x2 AE_L32X2F24_X (const ae_f24x2 *a, int ax);
void AE_L32X2F24_IP (ae_f24x2 d /*out*/,
const ae_f24x2 * a /*inout*/,
immediate i64pos);
void AE_L32X2F24_XP (ae_f24x2 d /*out*/,
const ae_f24x2 * a /*inout*/, int ax);
void AE_L32X2F24_XC (ae_f24x2 d /*out*/,
const ae_f24x2 * a /*inout*/, int ax);
void AE L32X2F24 RIP (ae_f24x2 d /*out*/,
_ _
const ae_f24x2 *a /*inout*/);
void AE L32X2F24 RIC (ae_f24x2 d /*out*/,
_ _
const ae_f24x2 *a /*inout*/);
ae_p24x2s AE_LP24X2F_I (const ae_p24x2f * a, immediate i64);
void AE_LP24X2F_IU (ae_p24x2s d /*out*/,
const ae_p24x2f * a /*inout*/, immediate i64);
ae p24x2s AE LP24X2F_X (const ae_p24x2f * a, int ax);
_ _
void AE_LP24X2F_XU (ae_p24x2s d /*out*/,
const ae_p24x2f * a /*inout*/, int ax);
void AE_LP24X2F_C (ae_p24x2s d /*out*/,
const ae_p24x2f * a /*inout*/, unsigned ax);
AE_L32.I d, a, i32 [ fusion_slot0, Inst ]
AE_L32.IP d, a, i32 [ fusion_slot0, Inst ]
AE_L32.X (.XC) d, a, ax [ fusion_slot0, Inst ]
AE_L32.XP d, a, ax [ fusion_slot0 ]
Required alignment: 4 bytes
Load a 32-bit value from memory and replicate the value into the two elements of the AE_DR
register d. See Table 2-3 for the meanings of the address mode suffixes.
Note: C intrinsics AE_LP24_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_L32.I (.X, .XC, .I, .I), respectively.
C syntax:
ae_int32x2 AE_L32_I (const ae_int32 * a, immediate i32);
ae_int32x2 AE_L32_X (const ae_int32 * a, int ax);
void AE_L32_IP(ae_int32x2 d /*out*/,
const ae_int32 * a /*inout*/, immediate off);
void AE_L32_XP(ae_int32x2 d /*out*/,
const ae_int32 * a /*inout*/, int ax);
void AE_L32_XC(ae_int32x2 d /*out*/,
const ae_int32 * a /*inout*/, int ax);
ae_p24x2s AE_LP24_I (const ae_p24s * a, immediate i32);
void AE_LP24_IU (ae_p24x2s d /*out*/,
const ae_p24s * a /*inout*/, immediate i32);
ae_p24x2s AE_LP24_X (const ae_p24s * a, int ax);
void AE_LP24_XU (ae_p24x2s d /*out*/,
const ae_p24s * a /*inout*/, int ax);
_ _ _
void AE LP24 C (ae p24x2s d /*out*/,
const ae_p24s * a /*inout*/, int ax);
AE_L32F24.I d, a, i32 [ fusion_slot0, Inst ]
AE_L32F24.IP d, a, i32 [ fusion_slot0, Inst ]
AE_L32F24.XC d, a, ax [ fusion_slot0, Inst ]
AE_L32F24.X (.XP) d, a, ax [ fusion_slot0 ]
Required alignment: 4 bytes
Load a 24-bit value from the most significant 24 bits of the 32-bit word from memory, sign-
extend to 32 bits and replicate the value into the two 32-bit elements of the AE_DR register
d. See Table 2-3 for the meanings of the address mode suffixes. The intent here is that the
value in memory represents a 32-bit (1.31) fraction that gets truncated and replicated into
the two elements of d as 9.23-bit fractions.
Note: C intrinsics AE_LP24X2F_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_L32F24.I (.X, .XC, .I, .I),
respectively.
C syntax:
ae_f24x2 AE_L32F24_I (const ae_f24 * a, immediate i32);
ae_p24s AE_L32F24_X (const ae_f24 * a, int ax);
void AE_L32F24_IP (ae_f24x2 d /*out*/,
const ae_f24 * a /*inout*/, immediate i32);
void AE_L32F24_XP (ae_f24x2 d /*out*/,
const ae_f24 * a /*inout*/, int ax);
void AE_L32F24_XC (ae_f24x2 d /*out*/,
const ae_f24 * a /*inout*/, int ax);
ae_p24x2s AE_LP24F_I (const ae_p24f * a, immediate i32);
void AE_LP24F_IU (ae_p24x2s d /*out*/,
const ae_p24f * a /*inout*/, immediate i32);
This instruction is used to prime the unaligned access stream for all AE_LA<size>.IP and
AE_LA<size>.RIP instructions regardless of size or direction.
C syntax:
ae_valign AE_LA64_PP (void *a);
AE_LA32X2POS.PC u, a [ fusion_slot0, Inst]
AE_LA32X2NEG.PC u, a [ fusion_slot0]
Required alignment: 4 bytes
This operation loads 64-bit value from memory into AE_VALIGN register u. The effective
address is (a & 0xFFFFFFF8).
This instruction AE_LA32X2POS.PC is used to prime the unaligned access stream for
AE_LA32X2.IC and AE_LA32X2F24.IC instructions. The instruction AE_LA32X2NEG.PC is
used to prime the unaligned access stream for AE_LA32X2.RIC and AE_LA32X2F24.RIC
instructions.
The instruction AE_LA16X4POS.PC is used to prime the unaligned access stream for
AE_LA16X4.IC instructions. The instruction AE_LA16X4NEG.PC is used to prime the
unaligned access stream for AE_LA16X4.RIC instructions.
C syntax:
void AE_LA16X4POS_PC (ae_valign u /*out*/, ae_int16x4 *a /*inout*/);
void AE_LA16X4NEG_PC (ae_valign u /*out*/, ae_int16x4 *a /*inout*/);
AE_LA24POS.PC u, a [ fusion_slot0]
AE_LA24NEG.PC u, a [ fusion_slot0]
Required alignment: 1 byte
Load a 64-bit value from memory to AE_VALIGN register u. The effective address is
(a & 0xFFFFFFF8).
The instruction AE_LA24POS.PC is used to prime the unaligned access stream for
AE_LA24.IC instructions. The instruction AE_LA24NEG.PC is used to prime the unaligned
access stream for AE_LA24.RIC instructions.
C syntax:
void AE_LA24POS_PC (ae_valign u /*out*/, void *a /*inout*/);
void AE_LA24NEG_PC (ae_valign u /*out*/, void *a /*inout*/);
AE_LA24X2POS.PC u, a [ fusion_slot0]
AE_LA24X2NEG.PC u, a [ fusion_slot0]
Required alignment: 1 byte
Load a 64-bit value from memory to AE_VALIGN register u. The effective address is
(a & 0xFFFFFFF8).
The instruction AE_LA24X2POS.PC is used to prime the unaligned access stream for
AE_LA24X2.IC instructions. The instruction AE_LA24X2NEG.PC is used to prime the
unaligned access stream for AE_LA24X2.RIC instructions.
C syntax:
void AE_LA24X2POS_PC (ae_valign u /*out*/, void a */*inout*/);
void AE_LA24X2NEG_PC (ae_valign u /*out*/, void a */*inout*/);
AE_LA32X2.IP (.IC) d, u, a [ fusion_slot0, fusion_slot_fir_0, Inst]
AE_LA32X2.RIC (.RIP) d, u, a [ fusion_slot0]
Required alignment: 4 bytes
Load a pair of 32-bit values from effective address (a) in memory into the AE_DR register d.
Instructions AE_LA32X2.IP (.IC) are used if the direction of the load operations is positive.
Instructions AE_LA32X2.RIP (.RIC) are used if the direction of the load operations is
negative.
C syntax:
void AE_LA32X2_IP (ae_int32x2 d /*out*/, ae_valign u /*inout*/,
ae_int32x2 *a /*inout*/);
void AE_LA32X2_IC (ae_int32x2 d /*out*/, ae_valign u /*inout*/,
ae_int32x2 *a /*inout*/);
void AE_LA32X2_RIP (ae_int32x2 d /*out*/, ae_valign u /*inout*/,
ae_int32x2 *a /*inout*/);
void AE_LA32X2_RIC (ae_int32x2 d /*out*/, ae_valign u /*inout*/,
ae_int32x2 *a /*inout*/);
AE_LA32X2F24.IP (.IC) d, u, a [ fusion_slot0, Inst]
AE_LA32X2F24.RIC (.RIP) d, u, a [ fusion_slot0]
Load a pair of 24-bit values, each from the most significant 24 bits of a 32-bit half of the 64
bits in memory, sign-extend them to 32 bits and store the values into the two 32-bit elements
of AE_DR register d. Instructions AE_LA32X2F24.IP (.IC) are used if the direction of the load
operations is positive. Instructions AE_LA32X2F24.RIP (.RIC) are used if the direction of the
load operations is negative.
C syntax:
void AE_LA32X2F24_IP (ae_f24x2 d /*out*/, ae_valign u /*inout*/,
ae_f24x2 *a /*inout*/);
void AE_LA32X2F24_IC (ae_f24x2 d /*out*/, ae_valign u /*inout*/,
ae_f24x2 *a /*inout*/);
void AE_LA32X2F24_RIP (ae_f24x2 d /*out*/, ae_valign u /*inout*/,
ae_f24x2 *a /*inout*/);
void AE_LA32X2F24_RIC (ae_f24x2 d /*out*/, ae_valign u /*inout*/,
ae_f24x2 *a /*inout*/);
C syntax:
void AE_LA24X2_IP (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
void *a /*inout*/);
void AE_LA24X2_IC (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
void *a /*inout*/);
void AE_LA24X2_RIP (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
void *a /*inout*/);
void AE_LA24X2_RIC (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
void *a /*inout*/);
AE_LALIGN64.I u, a, imm [ fusion_slot0]
Required alignment: 8 bytes
Load a 64-bit value from effective address (a + imm) in memory into the AE_VALIGN register
u.
C syntax:
ae_valign AE_LALIGN64_I (void *a, immediate imm);
AE_L16X2M.I d, a, i32 [ fusion_slot0, Inst ]
AE_L16X2M.IU d, a, i32 [ fusion_slot0, Inst ]
AE_L16X2M.X (.XU) d, a, ax [ fusion_slot0, Inst ]
AE_L16X2M.XC d, a, ax [fusion_slot0 ]
Required alignment: 4 bytes
Load a pair of 16-bit values from memory, pad 8-bit zeroes at the low end and sign-extend
to 32 bits and store the values into the two 32-bit elements of AE_DR register d. See Table
2-3 for the meanings of the address mode suffixes.
Note: C intrinsics AE_LP16X2F_I (_IU, _X, _XU, _C) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_L16X2M.I (.IU, .X, .XU, .XC),
respectively.
C syntax:
ae_int32x2 AE_L16X2M_I (const ae_p16x2s * a, immediate i32);
void AE_L16X2M_IU (ae_int32x2 d /*out*/,
const ae_p16x2s * a /*inout*/, immediate i32);
ae int32x2 AE_L16X2M_X (const ae_p16x2s * a, int ax);
_
void AE_L16X2M_XU (ae_p16x2s d /*out*/,
const ae_p16x2s * a /*inout*/, int ax);
void AE_L16X2M_XC (ae_int32x2 d /*out*/,
const ae_p16x2s * a /*inout*/, int ax);
ae p24x2s AE LP16X2F_I (const ae_p16x2s * a, immediate i32);
_ _
void AE_LP16X2F_IU (ae_p24x2s d /*out*/,
const ae_p16x2s * a /*inout*/, immediate i32);
ae_p24x2s AE_LP16X2F_X (const ae_p16x2s * a, int ax);
void AE_LP16X2F_XU (ae_p24x2s d /*out*/,
Note: C intrinsics AE_LP16F_I (_IU, _X, _XU, _C) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_L16M.I (.IU, .X, .XU, .XC),
respectively.
C syntax:
ae_int32x2 AE_L16M_I (const ae_p16s * a, immediate i16);
void AE_L16M_IU (ae_int32x2 d /*out*/,
const ae_p16s * a /*inout*/, immediate i16);
ae_int32x2 AE_L16M_X (const ae_p16s * a, int ax);
void AE_L16M_XU (ae_int32x2 d /*out*/,
const ae_p16s * a /*inout*/, int ax);
void AE_L16M_XC (ae_int32x2 d /*out*/,
const ae_p16s * a /*inout*/, int ax);
ae_p24x2s AE_LP16F_I (const ae_p16s * a, immediate i16);
void AE_LP16F_IU (ae_p24x2s d /*out*/,
const ae_p16s * a /*inout*/, immediate i16);
ae p24x2s AE LP16F_X (const ae_p16s * a, int ax);
_ _
void AE_LP16F_XU (ae_p24x2s d /*out*/,
const ae_p16s * a /*inout*/, int ax);
void AE_LP16F_C (ae_p24x2s d /*out*/,
const ae_p16s * a /*inout*/, int ax);
Limited immediate versions of the core L16SI and L16UI instructions. These instructions are
inferred automatically by the C/C++ compiler.
C syntax:
unsigned AE_L16SI_N (const void * a, immediate i32);
unsigned AE_L16UI_N (const void * a, immediate i32);
C syntax:
void AE_S64_I (ae_int64 d, ae_int64 * a, immediate i64);
void AE_S64_X (ae_int64 d, ae_int64 * a, int ax)
void AE_S64_IP (ae_int64 d, ae_int64 * a /*inout*/, immediate i64);
void AE_S64_XP (ae_int64 d, ae_int64 * a /*inout*/, int ax);
void AE_S64_XC (ae_int64 d, ae_int64 * a /*inout*/, int ax);
void AE_SQ56S_I (ae_q56s d, ae_q56s * a, immediate i64);
void AE_SQ56S_IU (ae_q56s d, ae_q56s * a /*inout*/, immediate i64);
void AE_SQ56S_X (ae_q56s d, ae_q56s * a, int ax)
void AE_SQ56S_XU (ae_q56s d, ae_q56s * a /*inout*/, int ax);
void AE_SQ56S_C (ae_q56s d, ae_q56s * a /*inout*/, int ax);
AE_S32X2.I d, a, i64 [ fusion_slot0, Inst ]
AE_S32X2.IP d, a, i64pos [ fusion_slot0, Inst ]
AE_S32X2.RIP (.RIC) d, a [ fusion_slot0 ]
AE_S32X2.X (.XP, .XC) d, a, ax [ fusion_slot0, Inst ]
Required alignment: 8 bytes
Store a pair of 32-bit values from the AE_DR register d to memory. See Table 2-3 for the
meanings of the address mode suffixes.
Note: C intrinsics AE_SP24X2S_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2/EP code
portability. They are implemented through operations AE_SP32X2.I (.X, .XC, .I, .I),
respectively.
C syntax:
void AE_S32X2_I (ae_int32x2 d, ae_int32x2 * a, immediate i64);
void AE_S32X2_X (ae_int32x2 d, ae_int32x2 * a, int ax);
void AE_S32X2_IP (ae_int32x2 d,
ae_int32x2 * a /*inout*/, immediate i64);
void AE S32X2 XP (ae_int32x2 d,
_ _
ae_int32x2 * a /*inout*/, int ax);
void AE S32X2 XC (ae_int32x2 d,
_ _
ae_int32x2 * a /*inout*/, int ax);
void AE_S32X2_RIP (ae_int32x2 d, ae_int32x2 * a /*inout*/);
void AE_S32X2_RIC (ae_int32x2 d, ae_int32x2 * a /*inout*/);
void AE_SP24X2S_I (ae_p24x2s d, ae_p24x2s * a, immediate i64);
void AE_SP24X2S_IU (ae_p24x2s d,
ae_p24x2s * a /*inout*/, immediate i64);
void AE_SP24X2S_X (ae_p24x2s d, ae_p24x2s * a, int ax);
void AE_SP24X2S_XU (ae_p24x2s d,
ae_p24x2s * a /*inout*/, int ax);
void AE SP24X2S C (ae_p24x2s d,
_ _
ae_p24x2s * a /*inout*/, int ax);
AE_S8X4F.I(.IP) d, a, i [ fusion_slot0 ]
Required alignment: 4 bytes
Store four, eight-bit values, taken from the high eight bits of each 16-bit element of AE_DR
register d into 32 bits of memory. See Table 2-3 for the meanings of the address mode
suffixes.
C syntax:
void AE_S8X4F_I (ae_f16x4 d, int8 * a, immediate i);
void AE_S8X4F_IP ( ae_f16x4 d, int8 *a /* inout */, immediate i)
C syntax:
void AE_S32X2F24_I (ae_f24x2 d, ae_f24x2 *a, immediate i64);
void AE_S32X2F24_X (ae_f24x2 d, ae_f24x2 * a, int ax);
void AE_S32X2F24_IP (ae_f24x2 d,
ae_f24x2 * a /*inout*/, immediate i64);
void AE_S32X2F24_RIP (ae_f24x2 d, ae_f24x2 * a /*inout*/);
void AE_S32X2F24_RIC (ae_f24x2 d, ae_f24x2 * a /*inout*/);
void AE_S32X2F24_XP (ae_f24x2 d,
ae_f24x2 * a /*inout*/, int ax);
void AE_S32X2F24_XC (ae_f24x2 d,
ae_f24x2 * a /*inout*/, int ax);
void AE_SP24X2F_I (ae_p24x2s d, ae_p24x2f * a, immediate i64);
void AE_SP24X2F_IU (ae_p24x2s d,
ae_p24x2f * a /*inout*/, immediate i64);
void AE_SP24X2F_X (ae_p24x2s d, ae_p24x2f * a, int ax);
void AE_SP24X2F_XU (ae_p24x2s d,
ae_p24x2f * a /*inout*/, int ax);
void AE_SP24X2F_C (ae_p24x2s d,
ae_p24x2f * a /*inout*/, int ax);
AE_S32.L.I d, a, i32 [ fusion_slot0, Inst ae_minislot0 ]
AE_S32.L.IP d, a, i32 [ fusion_slot0, Inst ]
AE_S32.L.X (.XP) d, a, ax [ fusion_slot0, Inst ]
AE_S32.L.XC d, a, ax [ fusion_slot0 ]
Required alignment: 4 bytes
Store the 32-bit L element of the AE_DR register d to memory. For operations with suffix .I,
the effective address is (a + i32). See Table 2-3 for the meanings of the address mode
suffixes.
Note: C intrinsics AE_SP24S_L_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_S32.L.I (.X, .XC, .I, .I) respectively.
C syntax:
void AE_S32_L_I (ae_int32x2 d, ae_int32 * a, immediate i32);
void AE_S32_L_X (ae_int32x2 d, ae_int32 * a, int ax)
void AE_S32_L_IP (ae_int32x2 d,
ae_int32 * a /*inout*/, immediate i32);
void AE_S32_L_XP (ae_int32x2 d,
ae_int32 * a /*inout*/, int ax);
void AE_S32_L_XC (ae_int32x2 d,
ae_int32 * a /*inout*/, int ax);
void AE_SP24S_L_I (ae_p24x2s d, ae_p24s * a, immediate i32);
void AE_SP24S_L_IU (ae_p24x2s d,
ae_p24s * a /*inout*/, immediate i32);
void AE_SP24S_L_X (ae_p24x2s d, ae_p24s * a, int ax)
void AE_SP24S_L_XU (ae_p24x2s d,
ae_p24s * a /*inout*/, int ax);
void AE SP24S L C (ae_p24x2s d,
_ _ _
ae_p24s * a /*inout*/, int ax);
C syntax:
void AE_SA24_L_IP (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
void AE_SA24_L_IC (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
void AE_SA24_L_RIP (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
void AE_SA24_L_RIC (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
AE_SA24X2.IP (.IC, .RIP, .RIC) d, u, a [ fusion_slot0 ]
Required alignment: 1 byte
Store the 24 LSBs of the two 32-bit elements of AE_DR register d to 48 bits in memory with
effective address (a). Instructions AE_SA24X2.IP (.IC) are used if the direction of the store
operations is positive. Instructions AE_SA24X2.RIP (.RIC) are used if the direction of the
store operations is negative.
C syntax:
void AE_SA24X2_IP (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
void AE_SA24X2_IC (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
void AE SA24X2 RIP (ae_int24x2 d, ae_valign u /*inout*/,
_ _
void * a /*inout*/);
void AE SA24X2 RIC (ae_int24x2 d, ae_valign u /*inout*/,
_ _
void * a /*inout*/);
AE_SALIGN64.I u, a, imm [ fusion_slot0 ]
Stores a 64-bit value from AE_VALIGN register u to memory with effective address (a +
imm).
C syntax:
void AE_SALIGN64_I (ae_valign u, void *a, immediate imm);
AE_SA64POS.FP u, a [ Inst ]
Required alignment: varies depending on the data type in the AE_VALIGN register u.
Flushes the value in AE_VALIGN register u to memory with effective address (a). The
AE_VALIGN register u is updated with value zero. This operation is used when the direction
of the store operation is positive.
C syntax:
void AE_SA64POS_FP (ae_valign u /*inout*/, void *a);
void AE_SA64POS_FC (ae_valign u /*inout*/, void *a);
AE_SA64NEG.FP u, a [ fusion_slot0 ]
Required alignment: varies depending on the data type in the AE_VALIGN register u.
Flushes the value in AE_VALIGN register u to memory with effective address (a). The
AE_VALIGN register u is updated with value zero. This operation is used when the direction
of the store operation is negative.
C syntax:
void AE_SA64NEG_FP (ae_valign u /*inout*/, void *a);
void AE_SA64NEG_FC (ae_valign u /*inout*/, void *a);
AE_ZALIGN64 u [ Inst ]
Initialize the AE_VALIGN register u with zero.
C syntax:
ae_valign AE_ZALIGN64 ();
AE_S16X2M.I (.IU) d, a, i32 [ fusion_slot0, Inst ]
AE_S16X2M.X (.XU) d, a, ax [ fusion_slot0, Inst ]
AE_S16X2M.XC d, a, ax [ fusion_slot0]
Required alignment: 4 byte.
Store the middle 16-bit element of each 32-bit half of AE_DR register d into 32 bits in
memory. See Table 2-3 for the meanings of the address mode suffixes.
Note: C intrinsics AE_SP16X2F_I (_IU, _X, _XU, _C) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_S16X2M.I (.IU, .X, .XU, .XC),
respectively.
C syntax:
void AE_S16X2M_I (ae_int32x2 d, ae_p16x2s *a, immediate i32);
void AE_S16X2M_IU (ae_int32x2 d, ae_p16x2s *a /*inout*/,
immediate i32);
void AE_S16X2M_X (ae_int32x2 d, ae_p16x2s *a, int ax);
void AE_S16X2M_XU (ae_int32x2 d, ae_p16x2s *a /*inout*/, int ax);
void AE_S16X2M_XC (ae_int32x2 d, ae_p16x2s *a /*inout*/, int ax);
void AE_SP16X2F_I (ae_p24x2s d, ae_p16x2s *a, immediate i32);
void AE_SP16X2F_IU (ae_p24x2s d, ae_p16x2s *a /*inout*/,
immediate i32);
void AE_SP16X2F_X (ae_p24x2s d, ae_p16x2s *a, int ax);
void AE_SP16X2F_XU (ae_p24x2s d, ae_p16x2s *a /*inout*/,
int ax);
void AE_SP16X2F_C (ae_p24x2s d, ae_p16x2s *a /*inout*/,
unsigned ax);
Store the middle 16-bit element of the low-order 32-bit element of AE_DR register d into 16
bits in memory. See Table 2-3 for the meanings of the address mode suffixes.
Note: C intrinsics AE_SP16F_L_I (_IU, _X, _XU, _C) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_S16M.L.I (.IU, .X, .XU, .XC),
respectively.
C syntax:
void AE_S16M_L_I (ae_int32x2 d, ae_p16s *a, immediate i16);
void AE_S16M_L_IU (ae_int32x2 d, ae_p16s *a /*inout*/,
immediate i16);
void AE_S16M_L_X (ae_int32x2 d, ae_p16s *a, int ax);
void AE_S16M_L_XU (ae_int32x2 d, ae_p16s *a /*inout*/, int ax);
void AE_S16M_L_XC (ae_int32x2 d, ae_p16s *a /*inout*/, int ax);
void AE_SP16F_L_I (ae_p24x2s d, ae_p16s *a, immediate i16);
void AE_SP16F_L_IU (ae_p24x2s d, ae_p16s *a /*inout*/,
immediate i16);
void AE_SP16F_L_X (ae_p24x2s d, ae_p16s *a, int ax);
void AE_SP16F_L_XU (ae_p24x2s d, ae_p16s *a /*inout*/, int ax);
void AE_SP16F_L_C (ae_p24x2s d, ae_p16s *a /*inout*/, int ax);
Fusion DSP MAC operations are named using the following convention:
AE_MUL<accum_type>[F][DP]{C,CR,CI}<size>{R,RA}[S][U].specifier
The operations use a specifier of L or H suffix to select input operands from the two 32-bit
AE_DR elements or a 0, 1, 2, 3 suffix for 16-bit data.
The two MAC operations have two forms—dual MACs take the results of two MACs and add
or subtract them together, as in the example below.
SIMD MACs do not combine the results of different multiplies. They instead perform the
sample multiply operation on different portions of the data, as in the example below.
The dual MACs use a D in the name. Most of the SIMD MACs pack their results into 32 or
16-bits and hence use a P in their name. By adding or subtracting two multiply results
together, the dual MAC instructions are able to maintain high precision for their accumulation
without needing to write multiple output registers.
Quad MACs compute the sum of four products and have a Q in the name.
With the AVS option, 24x24-bit and 32x16-bit complex multiply operations are dual-MAC
operations that compute either the real half or the imaginary half of a complex multiplication
and pack their two results down to 32-bits. They are designated with a CR or a CI.
With the 16-bit Quad MAC option, 16x16-bit complex multiplies are quad-MAC operations
that produce either a 32x2-bit or 16x2-bit result. They are designated with a C.
Among the single-multiply and SIMD multiply operations, each family of multiply/accumulate
operations has a multiply-only variant, a multiply/add variant, and a multiply/subtract variant,
denoted by having accum_type set to nothing, A or S respectively. With the MUL variant, the
accumulator contents are overwritten with the result of the multiplication. With the MULA
variant, the result of the multiplication is added to the accumulator contents and written back
to the accumulator. With the MULS variant, the result of the multiplication is subtracted from
the accumulator contents and written back to the accumulator.
Dual MAC operations with an accum_type starting with Z do their accumulation against
zero; in other words, the initial contents of the accumulator are discarded. Those without any
Z accumulate against the initial contents of the accumulator. Following the optional Z there
are two letters that indicate addition or subtraction, one for each of the two multiplication
results.
Quad MAC operations with an accum_type starting with Z do their accumulation against
zero; in other words, the initial contents of the accumulator are discarded. Those without any
Z accumulate against the initial contents of the accumulator. Following the optional Z there
are four As, one for each of the four multiplication results.
Fusion DSP supports both integer and fractional multiplication. Fractional multiply
instructions have an F immediately following accum_type.
The size of a multiply instruction is 16, 24, 32 or 32X16 for 16-bit, 24-bit, 32-bit and 32 times
16-bit respectively. For SIMD multipliers, a suffix X2 or X4 is added to the size to signify the
number of SIMD elements.
Integral SIMD multiply instructions throw away the upper bits of their results, just like standard
C/C++ multiplies. Fractional SIMD multiply instructions round away the lower bits using either
a symmetric or asymmetric rounding. They are signified with R or RA in the name. With
asymmetric rounding, halves are rounded upward, i.e., 0.5 times the least significant result
bit is rounded up to 1.0 and -0.5 times the least significant result bit is rounded up to 0. With
symmetric rounding, halves are rounding away from zero, i.e., -0.5 times the least significant
result bit is rounded down to -1.0. In the instruction descriptions, symmetric rounds are
referred to as round while asymmetric are referred to as round+∞.
MAC operations without guard bits, 1.31x1.31 into 1.63, 1.31x1.15 into 1.31, and 1.15x1.15
into 1.15 or 1.31, saturate their results. All other MAC operations have guard bits and do not
saturate. Saturating multiplies have an S following the size or the rounding designation.
Some 16x16-bit multipliers are designed to be bit exact with the ITU-T/ETSI intrinsics and
therefore do multiple saturations in series. These instructions have SS in the name.
All MAC operations appear in slot fusion_slot1 or fusion_slot_fir_1 when the AVS option has
been selected.
HiFi 2/EP had a different naming scheme for multipliers. Compatibility intrinsics are provided
for all the old HiFi 2/EP intrinsics and are listed in the following sections.
Complex quad 24x24-bit into 32-bit signed integer MAC with no saturation: These are
emulated using two-instruction sequences: one containing CR in the name and computing
the real part of the product and the other containing CI and computing the imaginary part.
2-way SIMD 1.23x1.23-bit into 9.23-bit signed MAC with symmetric (away from zero)
rounding of the product.
2-way SIMD 1.23x1.23-bit into 9.23-bit signed MAC with asymmetric rounding of the product.
C syntax:
ae_f32x2 AE_MULFP32X2RS (ae_f32x2 d0, ae_f32x2 d1);
void AE_MULAFP32X2RS (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f32x2 d1);
void AE_MULSFP32X2RS (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f32x2 d1);
AE_MULFP32X2RAS d, d0, d1 [fusion_slot1] AVS ONLY
AE_MULAFP32X2RAS d, d0, d1 [fusion_slot1] AVS ONLY
AE_MULSFP32X2RAS d, d0, d1 [fusion_slot1] AVS ONLY
2-way SIMD 1.31x1.31-bit into 1.31-bit signed MAC with asymmetric rounding of the product
and 32-bit saturation of the final result: These are emulated using two-instruction sequences.
AE_MULF32X16.L0 (.L1 .L2 .L3 .H0 .H1 .H2 .H3) d, d0, d1 [fusion_slot1]
AE_MULAF32X16.L0 (.L1 .L2 .L3 .H0 .H1 .H2 .H3) d, d0, d1 [fusion_slot1]
AE_MULSF32X16.L0 (.L1 .L2 .L3 .H0 .H1 .H2 .H3) d, d0, d1 [fusion_slot1]
Single 1.31x1.15-bit into 17.47-bit signed MAC without saturation:
d [d17.47 ±] d0.L1.31 × d1.01.15
C syntax:
ae_f64 AE_MULF32X16_L0 (ae_f32x2 d0, ae_f16x4 d1);
void AE_MULAF32X16_L0 (ae_f64 d /*inout*/,
ae_f32x2 d0, ae_f16x4 d1);
void AE MULSF32X16 L0 (ae_f64 d /*inout*/,
_ _
ae_f32x2 d0, ae_f16x4 d1);
AE_MULZAAFD32X16.H1.L0 (.H3.L2 .H2.L3 .H0.L1) d, d0, d1 [fusion_slot1]
AE_MULZASFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULZSAFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULZSSFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULAAFD32X16.H1.L0 (.H3.L2 .H2.L3 .H0.L1) d, d0, d1 [fusion_slot1]
AE_MULASFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULSAFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULSSFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
Dual 1.31x1.15-bit into 17.47-bit signed MAC without saturation:
d [d17.47] ± d0.H1.31 × d1.11.15 ± d0.L1.31 × d1.01.15
The extra .H3.L2 and .H0.L1 specifiers are for computing half of a complex multiplication.
C syntax:
ae_f64 AE_MULZAAFD32X16_H1_L0 (ae_f32x2 d0, ae_f16x4 d1);
ae_f64 AE_MULZASFD32X16_H1_L0 (ae_f32x2 d0, ae_f16x4 d1);
ae_f64 AE_MULZSAFD32X16_H1_L0 (ae_f32x2 d0, ae_f16x4 d1);
ae_f64 AE_MULZSSFD32X16_H1_L0 (ae_f32x2 d0, ae_f16x4 d1);
C syntax:
ae_f32x2 AE_MULFP32X16X2RS (ae_f32x2 d0, ae_f16x4 d1);
void AE_MULAFP32X16X2RS (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f16x4 d1);
void AE_MULSFP32X16X2RS (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f16x4 d1);
These MAC operations are bit-exact with the ITU-T L_mul, L_mac and L_msu basic
primitives.
C syntax:
ae_f32x2 AE_MULF16SS_00 (ae_f16x4 d0, ae_f16x4 d1);
void AE_MULAF16SS_00 (ae_f32x2 d /*inout*/,
ae_f16x4 d0, ae_f16x4 d1);
void AE_MULSF16SS_00 (ae_f32x2 d /*inout*/,
ae_f16x4 d0, ae_f16x4 d1);
AE_MULZAAFD16SS.11.00 (.33.22 .13.02) d, d0, d1 [fusion_slot1]
AE_MULZSSFD16SS.11.00 (.33.22 .13.02) d, d0, d1 [fusion_slot1]
AE_MULAAFD16SS.11.00 (.33.22 .13.02) d, d0, d1 [fusion_slot1]
AE_MULSSFD16SS.11.00 (.33.22 .13.02) d, d0, d1 [fusion_slot1]
Dual 1.15x1.15-bit into a single 1.31-bit signed MAC with 32-bit saturation after each product
and after each accumulation. The 32-bit result is replicated into each half of the result
register.
tmp saturate1.31([d1.31] ± saturate1.31(d0.11.15 × d1.11.15))
d1.31 saturate1.31(tmp ± saturate1.31(d0.01.15 × d1.01.15))
These MAC operations are bit-exact with a pair of ITU-T L_mul, L_mac and L_msu basic
primitives.
C syntax:
ae_f32x2 AE_MULZAAFD16SS_11_00 (ae_f16x4 d0, ae_f16x4 d1);
ae_f32x2 AE_MULZSSFD16SS_11_00 (ae_f16x4 d0, ae_f16x4 d1);
void AE_MULAAFD16SS_11_00 (ae_f32x2 d /*inout*/,
ae_f16x4 d0, ae_f16x4 d1);
void AE_MULSSFD16SS_11_00 (ae_f32x2 d /*inout*/,
ae_f16x4 d0, ae_f16x4 d1);
AE_MULF16X4SS d0, d1, d2, d3 [fusion_slot1] AVS ONLY
AE_MULAF16X4SS d0, d1, d2, d3 [fusion_slot1] AVS ONLY
AE_MULSF16X4SS d0, d1, d2, d3 [fusion_slot1] AVS ONLY
Four way SIMD 1.15x1.15-bit into 1.31-bit signed MAC with 32-bit intermediate product and
accumulator saturation. These are emulated using two-instruction sequences.
C syntax:
void AE_MULF16X4SS (ae_f32x2 d0 /*out*/, ae_f32x2 d1 /*out*/
ae_f16x4 d2, ae_f16x4 d3);
void AE MULAF16X4SS (ae_f32x2 d0 /*inout*/,
_
ae_f32x2 d1 /*inout*/,
ae_f16x4 d2, ae_f16x4 d3);
void AE_MULSF16X4SS (ae_f32x2 d0 /*inout*/,
ae_f32x2 d1 /*inout*/,
ae_f16x4 d2, ae_f16x4 d3);
AE_MUL16X4 d0, d1, d2, d3 [fusion_slot1] AVS/16-bit Quad MAC Options ONLY
AE_MULA16X4 d0, d1, d2, d3 [fusion_slot1] AVS/16-bit Quad MAC Options ONLY
AE_MULS16X4 d0, d1, d2, d3 [fusion_slot1] AVS/16-bit Quad MAC Options ONLY
Four way SIMD 16x16-bit into 32-bit integer signed MAC without saturation. These are
emulated using two-instruction sequences.
d0.H [d0.H ± ] d2.3 × d3.3
d0.L [d0.L ± ] d2.2 × d3.2
d1.H [d1.H ± ] d2..1 × d3.1
d1.L [d1.L ± ] d2.0 × d3.0
C syntax:
void AE_MUL16X4 (ae_int32x2 d0 /*out*/, ae_int32x2 d1 /*out*/
ae_int16x4 d2, ae_int16x4 d3);
void AE MULAA16X4 (ae_int32x2 d0 /*inout*/,
_
ae_int32x2 d1 /*inout*/,
ae_int16x4 d2, ae_int16x4 d3);
void AE_MULSS16X4 (ae_int32x2 d0 /*inout*/,
ae_int32x2 d1 /*inout*/,
ae_int16x4 d2, ae_int16x4 d3);
C syntax:
ae_f16x4 AE_MULFP16X4S (ae_f16x4 d0, ae_f16x4 d1);
AE_MULFP16X4RAS d, d0, d1 [fusion_slot1] AVS ONLY
Four way SIMD 1.15x1.15-bit into 1.15-bit signed multiply with saturation and rounding.
These are emulated using two-instruction sequences.
d.3 saturate1.15(round+∞2.15(d0.31.15 × d1.31.15))
d.2 saturate1.15(round+∞2.15(d0.21.15 × d1.21.15))
d.1 saturate1.15(round+∞2.15(d0.11.15 × d1.11.15))
d.0 saturate1.15(round+∞2.15(d0.01.15 × d1.01.15))
The operation is bit-exact with the ITU-T mult_r basic primitives.
C syntax:
ae_f16x4 AE_MULFP16X4RAS (ae_f16x4 d0, ae_pf16x4 d1);
The following intrinsics are provided to ensure HiFi 2 code compatibility and are implemented
through a sequence of one or more of the multiplication operations described in this section:
AE_ABS32 d, d0 [ fusion_slot1 ]
Absolute value of 32-bit element of an AE_DR register d0 without saturation, with result
placed in d.
d.H |d0.H|
d.L |d0.L|
Note: C intrinsic AE_ABSP24 is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_ABS32.
C syntax:
ae_int32x2 AE_ABS32 (ae_int32x2 d0);
ae_p24x2s AE_ABSP24 (ae_p24x2s d0);
AE_ABS32S d, d0 [ fusion_slot1, Inst ]
Absolute value, saturating, of a 32-bit element of an AE_DR register d0 with result placed in
d.
d.H saturate1.31(|d0.H|)
d.L saturate1.31(|d0.L|)
C syntax:
ae_int32x2 AE_ABS32S (ae_int32x2 d0);
AE_ABS24S d, d0 [ fusion_slot1, Inst ]
Absolute value, with 24-bit (9.23) saturation of a 32-bit element of an AE_DR register d0 with
result placed in d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ABSSP24S is provided to ensure HiFi 2 code portability. It is
implemented through operation AE_ABS24S.
d.H sext9.23(saturate1.23(|d0.H9.23|))
d.L sext9.23(saturate1.23(|d0.L9.23|))
C syntax:
ae_f24x2 AE_ABS24S (ae_f24x2 d0);
ae_p24x2s AE_ABSSP24S (ae_p24x2s d0);
AE_ABS16S d, d0 [ fusion_slot1, Inst ]
Absolute value, saturating, element-wise of 16-bit elements of an AE_DR register d0 with
result placed in d.
C syntax:
ae_f16x4 AE_ABS16S (ae_f16x4 d0);
C syntax:
xtbool4 AE_LT16 (ae_int16x4 d0, ae_int16x4 d1);
AE_LE16 b3210, d0, d1 [ fusion_slot1 ]
Compare, less-than-or-equal, two 16-bit signed elements of AE_DR registers d0 and d1;
results go to a four element Boolean register.
b3210[3] (d0.3 <= d1.3) ? 1 : 0
b3210[2] (d0.2 <= d1.2) ? 1 : 0
b3210[1] (d0.1 <= d1.1) ? 1 : 0
b3210[0] (d0.0 <= d1.0) ? 1 : 0
C syntax:
xtbool4 AE_LE16 (ae_int16x4 d0, ae_int16x4 d1);
AE_EQ16 b3210, d0, d1 [ fusion_slot1 ]
Compare, equal, two AE_DR registers d0 and d1; results go to a four element Boolean
register.
b321[3] (d0.3 == d1.3) ? 1 : 0
b321[2] (d0.2 == d1.2) ? 1 : 0
b321[1] (d0.1 == d1.1) ? 1 : 0
b321[0] (d0.0 == d1.0) ? 1 : 0
C syntax:
xtbool4 AE_EQ16 (ae_int16x4 d0, ae_int16x4 d1);
AE_ADD64 d, d0, d1 [ fusion_slot1, Inst ]
AE_SUB64 d, d0, d1 [fusion_slot1, Inst ]
Add/Subtract two 64-bit AE_DR registers d0 and d1 without saturation, with result placed in
d.
d d0 ± d1
Note: C intrinsics AE_ADDQ56 and AE_SUBQ56 are provided to ensure HiFi 2 code
portability. They are implemented through operations AE_ADD64 and AE_SUB64,
respectively.
C syntax:
ae_int64 AE_ADD64 (ae_int64 d0, ae_int64 d1);
ae_int64 AE_SUB64 (ae_int64 d0, ae_int64 d1);
ae_q56s AE_ADDQ56 (ae_q56s d0, ae_q56s d1);
ae_q56s AE_SUBQ56 (ae_q56s d0, ae_q56s d1);
C syntax:
ae_q56s AE_NEGSQ56S (ae_q56s d0);
AE_ABS64 d, d0 [fusion_slot1, Inst ]
Get absolute value of 64-bit AE_DR register d0 without saturation, with result placed in d.
d |d0|
Note: C intrinsic AE_ABSQ56 is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_ABS64.
C syntax:
ae_int64 AE_ABS64 (ae_int64 d0);
ae_q56s AE_ABSQ56 (ae_q56s d0);
AE_ABS64S d, d0 [ fusion_slot1 ]
Get absolute value, saturating, of 64-bit AE_DR register d0, with result placed in d. In case
of saturation, state AE_OVERFLOW is set to 1.
d saturate1.63(|d0|)
C syntax:
ae_q64 AE_ABS64S (ae_q64 d0);
AE_ABSSQ56S d, d0 [ fusion_slot1 ]
Get absolute value, with 56-bit (9.55) saturation of 64-bit AE_DR register d0, with result
placed in d. In case of saturation, state AE_OVERFLOW is set to 1.
d sext9.55((saturate1.55(|d09.55|))
Note: These are legacy instructions meant to support HiFi 2 code portability.
C syntax:
ae_q56s AE_ABSSQ56S (ae_q56s d0);
AE_MAX64 d, d0, d1 [ fusion_slot ]
AE_MIN64 d, d0, d1 [ fusion_slot1 ]
Get maximum/minimum of two signed 64-bit AE_DR registers d0 and d1, with result placed
in d.
Maximum: d (d0 > d1) ? d0 : d1
Note: C intrinsics AE_MAXQ56S and AE_MINQ56S are provided to ensure HiFi 2 code
portability. They are implemented through operations AE_MAX64 and AE_MIN64,
respectively. C intrinsics AE_MAXB64/AE_MINB64 are implemented through a sequence of
the AE_MAX64/AE_MIN64 and AE_LT64 operations and set the Boolean result only if the
d0 value is greater/less than the d1 value. C intrinsics AE_MAXBQ56S/AE_MINBQ56S are
implemented in a similar way and are provided to ensure HiFi 2 code portability.
C syntax:
ae_int64 AE_MAX64 (ae_int64 d0, ae_int64 d1);
ae_int64 AE_MIN64 (ae_int64 d0, ae_int64 d1);
ae_q56s AE_MAXQ56S (ae_q56s d0, ae_q56s d1);
ae_q56s AE_MINQ56S (ae_q56s d0, ae_q56s d1);
void AE_MAXB64 (ae_int64 d /* out */, ae_int64 d0, ae_int64 d1,
xtbool b /* out */);
void AE_MINB64 (ae_int64 d /* out */, ae_int64 d0, ae_int64 d1,
xtbool b /* out */);
void AE MAXBQ56S (ae_q56s d /* out */, ae_q56s
_ d0, ae_q56s d1,
xtbool b /* out */);
void AE MINBQ56S (ae_q56s d /* out */, ae_q56s
_ d0, ae_q56s d1,
xtbool b /* out */);
AE_MAXABS64S d, d0, d1 [fusion_slot1]
AE_MINABS64S d, d0, d1 [ fusion_slot1 ]
Get maximum/minimum of absolute value of two 64-bit signed AE_DR registers d0 and d1.
The result is saturated to 64 bits and placed in d.
In case of saturation, state AE_OVERFLOW is set to 1.
Maximum: d saturate1.63((|d0| > |d1|) ? |d0| : |d1|)
C syntax:
ae_f64 AE_MAXABS64S (ae_f64 d0, ae_f64 d1);
ae_f64 AE_MINABS64S (ae_f64 d0, ae_f64 d1);
All shift operations start with the prefix AE_S. The following letter is either L or R signifying
whether the primary shift direction is left or right. The next letter is either L or R signifying
whether a shift is logical (fill in 0’s on a right shift) or arithmetic (sign-extend on a right shift).
The next letter is I for immediate shifts, A for AR shifts and S for AE_SAR shifts. Following is
a number signifying the size of the element being shifted and an optional R for right shifts
that round rather than truncate and an optional S for left shifts that saturate.
C syntax:
ae_int24x2 AE_SLAI24 (ae_int24x2 d0, immediate i);
ae_p24x2s AE_SLLIP24 (ae_p24x2s d0, immediate i);
AE_SRLI24 d, d0, i [ fusion_slot0 ]
Shift right logical (zero-extending), element-wise, two 24-bit elements of AE_DR register d0
by immediate, with result placed in d. Note that the sign of the result will be zero for any non-
zero shift amount.
d.L = sext24(d0.L[23:0] >>u i);
d.H = sext24(d0.H[23:0] >>u i).
C syntax:
ae_int24x2 AE_SRLI24 (ae_int24x2 d0, immediate i);
ae_p24x2s AE_SRLIP24 (ae_p24x2s d0, immediate i);
AE_SRAI24 d, d0, i [ fusion_slot0, Inst ]
Shift right arithmetic (sign-extending), element-wise, two 24-bit elements of AE_DR register
d0 by immediate value, with result placed in d.
d.L = sext24(d0.L[23:0] >>s i);
d.H = sext24(d0.H[23:0] >>s i).
C syntax:
ae_int24x2 AE_SRAI24 (ae_int24x2 p0, immediate i);
ae_p24x2s AE_SRAIP24 (ae_p24x2s d0, immediate i);
AE_SLAI24S d, d0, i [ fusion_slot0, Inst ]
Shift left, saturating, element-wise, two 24-bit signed elements of AE_DR register d0 by
immediate, with result placed in d. In case of saturation, state AE_OVERFLOW is set to 1.
d.L = sext24(saturate24(d0.L[23:0] << i));
d.H = sext24(saturate24(d0.H[23:0] << i)).
Note: C intrinsic AE_SLLISP24S is implemented through operation AE_SLAI24S.
C syntax:
ae_f24x2 AE_SLAI24S (ae_f24x2 d0, immediate i);
ae_p24x2s AE_SLLISP24S (ae_p24x2s d0, immediate i);
C syntax:
ae_int24x2 AE_SRAS24 (ae_int24x2 d0);
ae_p24x2s AE_SRASP24 (ae_p24x2s d0);
AE_SLAS24S d, d0 [ fusion_slot0 ]
Shift left or right, arithmetic (sign-extending), saturating, element-wise, two 24-bit elements
of AE_DR register d0 by shift amount register AE_SAR, with result placed in d. For a positive
shift amount, the value is shifted to the left. In case of a negative shift amount, the value is
shifted to the right. In case of saturation, state AE_OVERFLOW is set to 1.
d.L = sext24((SAR ≥ 0) ? saturate24(d0.L[23:0] << SAR) : (d0.L[23:0] >>s −SAR));
d.H = sext24((SAR ≥ 0) ? saturate24(d0.H[23:0] << SAR) : (d0.L[23:0] >>s −SAR)).
Note: C intrinsic AE_SLLSSP24S is implemented through operation AE_SLAS24S. Note
that in the case of a negative shift amount, this intrinsic performs an arithmetic right shift.
C syntax:
ae_f24x2 AE_SLAS24S (ae_f24x2 d0);
ae_p24x2s AE_SLLSSP24S (ae_p24x2s d0);
AE_SLAI32 d, d0, i [ fusion_slot0, Inst]
Shift left, element-wise, two 32-bit elements of AE_DR register d0 by immediate value, with
result placed in d.
d.L = d0.L << i;
d.H = d0.H << i.
C syntax:
ae_int32x2 AE_SLAI32 (ae_int32x2 d0, immediate i);
AE_SRLI32 d, d0, i [ fusion_slot0, Inst]
Shift right logical (zero-extending), element-wise, two 32-bit elements of AE_DR register d0
by immediate value, with result placed in d.
d.L = d0.L >>u i;
d.H = d0.H >>u i.
C syntax:
ae_int32x2 AE_SRLI32 (ae_int32x2 d0, immediate i);
AE_SRAI32 d, d0, i [ fusion_slot0, Inst]
Shift right arithmetic (sign-extending), element-wise, two 32-bit elements of AE_DR register
d0 by immediate value, with result placed in d.
d.L = d0.L >>s i;
d.H = d0.H >>s i.
C syntax:
ae_int32x2 AE_SRAI32 (ae_int32x2 d0, immediate i);
C syntax:
ae_int32x2 AE_SRAA32 (ae_int32x2 d0, int32 sa);
AE_SLAA32S d, d0, a0 [fusion_slot0, Inst ]
Shift left or right arithmetic (sign-extending), saturating, element-wise, two 32-bit elements of
AE_DR register by AR register a0, with result placed in d. For a positive shift amount, the
value is shifted to the left. In case of a negative shift amount, the value is shifted to the right
and sign-extended. In case of saturation, state AE_OVERFLOW is set to 1.
d.L = (a0 ≥ 0) ? saturate32(d0.L << a0) : (d0.L >>s −a0);
d.H = (a0 ≥ 0) ? saturate32(d0.H << a0) : (d0.H >>s −a0).
C syntax:
ae_f32x2 AE_SLAA32S (ae_f32x2 d0, int32 a0);
AE_SLAS32 d, d0 [ fusion_slot0]
Shift left or right arithmetic (sign-extending), element-wise, two 32-bit elements of AE_DR
register d0 by the shift amount register AE_SAR, with result placed in d. For a positive shift
amount, the value is shifted to the right. For a negative shift amount, the value is shifted to
the right and sign-extended.
d.L = (SAR ≥ 0) ? (d0.L << SAR) : (d0.L >>s −SAR);
d.H = (SAR ≥ 0) ? (d0.H << SAR) : (d0.H >>s −SAR).
C syntax:
ae_int32x2 AE_SLAS32 (ae_int32x2 d0);
AE_SRAA32RS d, d0, a0 [fusion_slot0 ]
Shift right or left arithmetic (sign-extending), element-wise, 32-bit elements of AE_DR
register d0 by AR register a0, with result placed in d. For a positive shift amount, the value
is shifted to the right. For a negative shift amount, the value is shifted to the right and rounded
corresponding to ITU intrinsic L_shr_r.
C syntax:
ae_f32x2 AE_SRAA32RS (ae_f32x2 d0, int32 a0);
AE_SRAA32S d, d0, a0 [fusion_slot0, Inst ]
Shift right arithmetic (sign-extending), saturating, element-wise, two 32-bit elements of
AE_DR register d0 by AR register a0, with result placed in d corresponding to ITU intrinsic
L_shr.
C syntax:
ae_f32x2 AE_SRAAR32S (ae_f32x2 d0, int32 a0);
AE_SRLS32 d, d0 [ fusion_slot0]
Shift right or left logical (zero-extending), element-wise, two 32-bit elements AE_DR register
d0 by the shift amount in register AE_SAR, with result placed in d. For a positive shift
amount, the value is shifted to the right. For a negative shift amount, the value is shifted to
the left.
d.L = (SAR ≥ 0) ? (d0.L >>u SAR) : (d0.L << −SAR);
d.H = (SAR ≥ 0) ? (d0.H >>u SAR) : (d0.H << −SAR).
C syntax:
ae_int32x2 AE_SRLS32 (ae_int32x2 d0);
AE_SRAS32 d, d0 [ fusion_slot0]
Shift right or left arithmetic (sign-extending), element-wise, two 32-bit elements of AE_DR
register d0 by the shift amount register AE_SAR, with result placed in d. For a positive shift
amount, the value is shifted to the right. In case of a negative shift amount, the value is shifted
to the left.
d.L = (SAR ≥ 0) ? (d0.L >>s SAR) : (d0.L << −SAR);
d.H = (SAR ≥ 0) ? (d0.H >>s SAR) : (d0.H << −SAR).
C syntax:
ae_int32x2 AE_SRAS32 (ae_int32x2 d0);
AE_SLAS32S d, d0 [ fusion_slot0]
Shift left or right arithmetic (sign-extending), saturating, element-wise, two 32-bit elements of
AE_DR register d0 by the shift amount register AE_SAR, with result placed in d. For a
positive shift amount, the value is shifted to the left. For a negative shift amount, the value is
shifted to the right and sign-extended. In case of saturation, state AE_OVERFLOW is set to
1.
d.L = (SAR ≥ 0) ? saturate32(d0.L << SAR) : (d0.L >>s −SAR);
d.H = (SAR ≥ 0) ? saturate32(d0.H << SAR) : (d0.H >>s −SAR).
C syntax:
ae_f32x2 AE_SLAS32S (ae_int32x2 d0);
AE_SLAI64 d, d0, i [ fusion_slot0, Inst]
Shift left, 64-bit AE_DR register d0 by immediate value, with result placed in d.
d = d0 << i
Note: C intrinsic AE_CVTQ56P32S_L converts a signed 1.31-bit value in d0.L to a 1.63-bit
value in d. It is implemented through operation AE_SLAI64 with a shift amount of 32.
C syntax:
ae_int64 AE_SLAI64 (ae_int64 d0, immediate i);
ae_int64 AE_CVTQ56P32S_L (ae_int32x2 d0);
Note: This instruction is designed to work only when d >=0, d0.L(.H) > 0 and d <= d0.L(.H)C
C syntax:
void AE_DIV64D32_L (ae_int64 d, ae_int32x2 d0);
C syntax:
ae_f32x2 AE_ROUND32X2F64SSYM (ae_f64 dh, ae_f64 dl);
ae_f32x2 AE_ROUND32F64SSYM (ae_f64 d0);
AE_ROUND32X2F64SASYM d, dh, dl [ fusion_slot1, Inst ]
Round asymmetrically, saturate the 1.63-bit values from AE_DR registers dh and dl to 1.31-
bit values, and store the results in the two elements of AE_DR register d. In case of
saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND32F64SASYM is implemented through operation
AE_ROUND32X2F64SASYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.
C syntax:
ae_f32x2 AE_ROUND32X2F64SASYM (ae_f64 dh, ae_f64 dl);
ae_f32x2 AE_ROUND32F64SASYM (ae_f64 d0);
AE_ROUNDSP16F24SYM d, d0 [ fusion_slot1 ]
Round symmetrically (away from 0), saturate each 9.23-bit element of AE_DR register d0 to
a 1.15-bit value, sign-extend it and store the results as 9.23-bit values in the two elements of
AE_DR register d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUNDSP16SYM is implemented through operation
AE_ROUNDSP16F24SYM and is provided to ensure HiFi 2 code portability.
C syntax:
ae_f32x2 AE_ROUNDSP16F24SYM (ae_f32x2 d0);
ae_int24x2s AE_ROUNDSP16SYM (ae_int24x2s d0);
AE_ROUNDSP16F24ASYM d, d0 [ fusion_slot1 ]
Round asymmetrically, saturate the two 9.23-bit elements of AE_DR register d0 to 1.15-bit
values, sign-extend it and store the results as 9.23-bit values in the two elements of AE_DR
register d. In case of saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f32x2 AE_ROUNDSP16F24ASYM (ae_f32x2 d0);
ae_int24x2s AE_ROUNDSP16ASYM (ae_int24x2s d0);
AE_ROUND32X2F48SSYM d, dh, dl [ fusion_slot1 ]
Round symmetrically (away from 0), saturate the 17.47-bit values from AE_DR registers dh
and dl to 1.31-bit values and stores the results into the two elements of AE_DR register d.
In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND32F48SSYM is implemented through operation
AE_ROUND32X2F48SSYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.
C syntax:
ae_f32x2 AE_ROUND32X2F48SSYM (ae_f64 dh, ae_f64 dl);
ae_f32x2 AE_ROUND32F48SSYM (ae_f64 d0);
AE_ROUND32X2F48SASYM d, dh, dl [ fusion_slot1 ]
Round asymmetrically, saturate the 17.47-bit values from AE_DR registers dh and dl to 1.31-
bit values and stores the results into the two elements of AE_DR register d. In case of
saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND32F48SASYM is implemented through operation
AE_ROUND32X2F48SASYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.
C syntax:
ae_f32x2 AE_ROUND32X2F48SASYM (ae_f64 dh, ae_f64 dl);
ae_f32x2 AE_ROUND32F48SASYM (ae_f64 d0);
AE_ROUND24X2F48SSYM d, dh, dl [ fusion_slot1, Inst ]
Round symmetrically (away from 0), saturate the 17.47-bit values from AE_DR registers dh
and dl to 1.23-bit values, sign-extend it and store the results as 9.23-bit values in the two
elements of AE_DR register d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND24F48SSYM is implemented through operation
AE_ROUND24X2F48SSYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.
C syntax:
ae_f24x2 AE_ROUND24X2F48SSYM (ae_f64 dh, ae_f64 dl);
ae_f24x2 AE_ROUND24F48SSYM (ae_f64 d0);
AE_ROUND24X2F48SASYM d, dh, dl [ fusion_slot1, Inst ]
Round asymmetrically, saturate the 17.47-bit values from AE_DR registers dh and dl to 1.23-
bit values, sign-extend it and store the results as 9.23-bit values in the two elements of
AE_DR register d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND24F48SASYM is implemented through operation
AE_ROUND24X2F48SASYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.
C syntax:
ae_f24x2 AE_ROUND24X2F48SASYM (ae_f64 dh, ae_f64 dl);
ae_f24x2 AE_ROUND24F48SASYM (ae_f64 d0);
AE_ROUNDSP16Q48X2SYM d, dh, dl [ fusion_slot1 ]
Round symmetrically (away from 0), saturate the 17.47-bit values from AE_DR registers dh
and dl to 1.15-bit values, sign-extend it and store the results as 9.23-bit values in the two
elements of AE_DR register d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUNDSP16Q48SYM is implemented through operation
AE_ROUNDSP16Q48X2SYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.
C syntax:
ae_f24x2 AE_ROUNDSP16Q48X2ASYM (ae_f64 dh, ae_f64 dl);
ae_f24x2 AE_ROUNDSP16Q48ASYM (ae_f64 d0);
AE_ROUNDSP16Q48X2ASYM d, dh, dl [ fusion_slot1, Inst ]
Round asymmetrically, saturate the 17.47-bit values from AE_DR registers dh and dl to 1.15-
bit values, sign-extend it and store the results as 9.23-bit values in the two elements of
AE_DR register d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUNDSP16Q48ASYM is implemented through operation
AE_ROUNDSP16Q48X2ASYM; it rounds a single input AE_DR value and replicates the
result in the two elements of the output AE_DR register.
C syntax:
ae_f24x2 AE_ROUNDSP16Q48ASYM (ae_f64 dh, ae_f64 dl);
ae_f24x2 AE_ROUNDSP16Q48X2ASYM (ae_f64 d0);
AE_ROUND16X4F32SASYM d, dh, dl [ fusion_slot1, Inst ]
Round asymmetrically, saturate the 1.31-bit values from AE_DR registers dh and dl to 1.15-
bit values, and store the results in the four elements of AE_DR register d. In case of
saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f16x4 AE_ROUND16X4F32SASYM (ae_f32x2 dh, ae_f32x2 dl);
AE_ROUND16X4F32SSYM d, dh, dl [ fusion_slot1 ]
Round symmetrically, saturate the 1.31-bit values from AE_DR registers dh and dl to 1.15-
bit values, and store the results in the four elements of AE_DR register d. In case of
saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f16x4 AE_ROUND16X4F32SSYM (ae_f32x2 dh, ae_f32x2 dl);
AE_ROUNDSQ32F48SYM d, d0 [ fusion_slot1, Inst ]
Round symmetrically (away from 0), saturate the 17.47-bit value from AE_DR register d0 to
a 1.31-bit value, sign-extend it and store the result as 17.47-bit value in AE_DR register d.
In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUNDSQ32SYM is implemented through operation
AE_ROUNDSQ32F48SYM and is provided to ensure HiFi 2 code portability.
C syntax:
ae_f64 AE_ROUNDSQ32F48SYM (ae_f64 d0);
ae_q56s AE_ROUNDSQ32SYM (ae_q56s d0);
C syntax:
ae_f64 AE_SAT48S (ae_f64 d0);
ae_q56s AE_SATQ48S (ae_q56s d0);
AE_SAT24S d, d0 [ fusion_slot1 ]
Saturate the two 17.23 values in AE_DR register d0 into 1.23 values and sign extend into
17.23. In case of saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f24x2 AE_SAT24S (ae_int32x2 d0);
AE_SAT16X4 d, d0,d1 [ fusion_slot1 ]
Saturate the four 32-bit integral values in AE_DR registers d0 and d1 to a 16-bit integral
value, In case of saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_int16x4 AE_SAT16X4 (ae_int32x2 d0, ae_int32x2 d1);
AE_SEXT32 d, d0,i [ fusion_slot0 ]
Sign-extend (SIMD). Takes the contents of each 32-bit element of register d0 and replicates
the bit specified by its immediate operand (in the range 7 to 22) to the high bits and writes
the results to register d.
C syntax:
ae_int32x2 AE_SEXT32 (ae_int32x2 d0, immediate i);
AE_SEXT32X2D16.32 {.10} d, d0 [ fusion_slot0]
Promote the two higher (or lower) 16-bit elements from register d0 and place into the lower
16-bit elements of each pair of AE_DR register d. The remaining upper 16-bits of each half
are sign extended. These correspond to ITU intrinsics L_deposit_l.
C syntax:
ae_int32x2 AE_SEXT32X2D16_32(ae_int16x4 d);
AE_CVTP24A16X2.LL (.LH, .HL. HH) d, ah, al [ fusion_slot0 ]
Sign-extend and copy the 16 most (.HL, .HH) or least (.LL, .LH) significant bits from the AR
register ah into the 24 most significant bits of d.H, and the 16 most (.LH, .HH) or least (.LL,
.HL) significant bits from the AR register al into the 24 most significant bits of d.L. In other
words, convert 1.15-bit values in AR to 9.23-bit values in AE_DR.
Note: C intrinsic AE_CVTP24A16X2 is equivalent to and implemented through operation
AE_CVTP24A16X2.LL. C intrinsic AE_CVTP24A16 sign-extends and replicates the 16 least
significant bits from an AR register into the 24 most significant bits of both elements of an
AE_DR register. It is implemented through operation AE_CVTP24A16X2.LL.
C syntax:
ae_int24x2 AE_CVTP24A16X2_LL (unsigned ah, unsigned al);
ae_int24x2 AE_CVTP24A16X2 (unsigned ah, unsigned al);
ae_int24x2 AE_CVTP24A16 (unsigned a);
AE_CVT64A32 d, a [ fusion_slot0 ]
Convert a signed 1.31-bit value in AR register a to a 1.63-bit value in AE_DR register d.
C syntax:
ae_f64 AE_CVT64A32 (unsigned a);
AE_CVTQ56A32 d, a [ fusion_slot0]
Convert a signed 1.31-bit value in an AR register a to a 9.55-bit value in AE_DR register d.
C syntax:
ae_q56s AE_CVTQ56A32S (unsigned a);
C syntax:
ae_int64 AE_MOV64 (ae_int64 d0);
ae_int32x2 AE_MOV32X2 (ae_int32x2 d0);
ae_q56s AE_MOVQ56 (ae_q56s d0);
ae_p24x2s AE_MOVP48 (ae_p24x2s d0);
AE_MOVT32X2 d, d0, bhl [ fusion_slot1, Inst ]
If bhl[0] is set, copy the contents of d0.L to d.L;
If bhl[1] is set, copy the contents of d0.H to d.H.
Note: C intrinsic AE_MOVTP24X2 is implemented through operation AE_MOVT32X2 and is
provided to ensure HiFi 2 code portability.
C syntax:
void AE_MOVT32X2 (ae_int32x2 d /*inout*/, ae_int32x2 d0,
xtbool2 bhl);
void AE MOVTP24X2 (ae_p24x2s d /*inout*/, ae_p24x2s d0,
_
xtbool2 bhl);
AE_MOVF32X2 d, d0, bhl [ fusion_slot1, Inst ]
If bhl[0] is clear, copy the contents of d0.L to d.L;
If bhl[1] is clear, copy the contents of d0.H to d.H.
Note: C intrinsic AE_MOVFP24X2 is implemented through operation AE_MOVF32X2 and is
provided to ensure HiFi 2 code portability.
C syntax:
void AE_MOVF32X2 (ae_int32x2 d /*inout*/, ae_int32x2 d0,
xtbool2 bhl);
void AE_MOVFP24X2 (ae_p24x2s d /*inout*/, ae_p24x2s d0,
xtbool2 bhl);
AE_MOVT16X4 d, d0, b3210 [ fusion_slot1 ]
If b3210[0] is set, copy the contents of d0.0 to d.0;
If b3210[1] is set, copy the contents of d0.1 to d.1.
If b3210[2] is set, copy the contents of d0.2 to d.2;
If b3210[3] is set, copy the contents of d0.3 to d.3.
C syntax:
void AE_MOVT16X4 (ae_int16x4 d /*inout*/, ae_int16x4 d0,
xtbool4 b3210);
0 5432
1 7632
2 7610
3 5410
4 4321
5 6543
6 7520
7 Used for AE_TRUNC16X4F32
operation or equivalently 7531
8 6420
9 7362
10 5146
11 5140
12 2301
13 7160
14 5342
15 7351
C syntax:
ae_int32x2 AE_SEL32_LL (ae_int32x2 d0, ae_int32x2 d1);
ae_p24x2s AE_SELP24_LL (ae_p24x2s d0, ae_p24x2s d1);
AE_SEL32.LH d, d0, d1 [ fusion_slot1 ]
d.H = d0.L;
d.L = d1.H.
Note: AE_SEL32.LH is a proto implemented using AE_SEL16I. Also, C intrinsic
AE_SELP24_LH is similar to proto AE_SEL32.LH and is implemented through operation
AE_SEL16I. It is provided to ensure HiFi 2 code portability.
C syntax:
ae_int32x2 AE_SEL32_LH (ae_int32x2 d0, ae_int32x2 d1);
ae_p24x2s AE_SELP24_LH (ae_p24x2s d0, ae_p24x2s d1);
AE_SEL32.HL d, d0, d1 [ fusion_slot1 ]
d.H = d0.H;
d.L = d1.L.
Note: AE_SEL32.HL is a proto implemented using AE_SEL16I. Also, C intrinsic
AE_SELP24_HL is similar to proto AE_SEL32.HL and is implemented through operation
AE_SEL16I. It is provided to ensure HiFi 2 code portability.
C syntax:
ae_int32x2 AE_SEL32_HL (ae_int32x2 d0, ae_int32x2 d1);
ae_p24x2s AE_SELP24_HL (ae_p24x2s d0, ae_p24x2s d1);
AE_SEL32.HH d, d0, d1 [ fusion_slot1 ]
d.H = d0.H;
d.L = d1.H.
Note: AE_SEL32.HH is a proto implemented using AE_SEL16I. Also, C intrinsic
AE_SELP24_HH is similar to proto AE_SEL32.HH and is implemented through operation
AE_SEL16I. It is provided to ensure HiFi 2 code portability.
C syntax:
ae_int32x2 AE_SEL32_HH (ae_int32x2 d0, ae_int32x2 d1);
ae_p24x2s AE_SELP24_HH (ae_p24x2s d0, ae_p24x2s d1);
AE_SEL16.7362 (5146, 6543, 4321, 7520, 5410, 5432, 7610, 7632, 6420) d, d0, d1 [fusion_slot1 ]
Combine 16-bit elements from d0 and d1 into d. Elements are numbered in order so that 7
corresponds to the highest significant 16-bits of input register d0 down to 0 which
corresponds to the least significant 16-bits of register d1. For example, the diagram below
shows the usage of AE_SEL16.7362.
7 6 5 4 3 2 1 0 7 3 6 2
Note: AE_SEL16.7632 and its variants are protos implemented using AE_SEL16I.
C syntax:
ae_int16x4 AE_SEL16_7362 (ae_int16x4 d0, ae_int16x4 d1);
AE_SHORTSWAP v, v0 [ fusion_slot1 ]
v.3 = v.0;
v.2 = v.1.
v.1 = v.2;
v.0 = v.3.
C syntax:
ae_int16x4 AE_SHORTSWAP (ae_int16x4 d0);
The computations performed by these operations are implied by their opcode mnemonics
and operands as given below.
C syntax:
unsigned AE_ADDBRBA32 (unsigned ab, unsigned ax);
C syntax:
ae_int64 AE_ZERO (void);
ae_int64 AE_ZERO64 (void);
ae_int32x2 AE_ZERO32 (void);
ae_int24x2 AE_ZERO24 (void);
ae_int16x4 AE_ZERO16 (void);
ae_q56s AE_ZEROQ56 (void);
ae_int24x2 AE_ZEROP48 (void);
AE_ZEROB br1, v0, v1 [ fusion_slot1]
br1 is set to true if any of the bytes in v0 or v1 is equal to zero.
C syntax:
xtbool AE_ZEROB (ae_int64 v0, ae_int64 v1);
Specialized version of the core CLAMPS instruction that clamps art to 16-bits signed.
C syntax:
int AE_CLAMPS16 (int ars);
AE_SEXT16 art, ars [ Inst16b ]
Specialized version of the core SEXT instruction that replicates bit 15 of ars to the upper 16-
bits.
C syntax:
int AE_SEXT16 (int ars);
AE_ZEXT8 art, ars [ Inst16b ]
Specialized version of the core EXTUI instruction that zeroes the upper 24-bits of the result.
C syntax:
unsigned int AE_ZEXT8 (unsigned ars);
Specialized version of the core EXTUI instruction that zeroes the upper 16-bits of the result.
C syntax:
unsigned int AE_ZEXT16 (unsigned ars);
C syntax:
ae_f16x4 AE_MULFC16RAS (ae_f16x4 d0, ae_f16x4 d1);
void AE_MULAFC16RAS (ae_f16x4 d /*inout*/,
ae_f16x4 d0, ae_f16x4 d1);
AE_MULZAAAAQ16 q0, d0, d1 [ fusion_slot1 ]
AE_MULAAAAQ16 q0, d0, d1 [ fusion_slot1]
Quad 16x16-bit into 64-bit signed MAC without saturation:
q0 [q0] + d0.3 × d1.3 + d0.2 × d1.2 + d0.1 × d1.1 + d0.0 × d1.0
C syntax:
ae_int64 AE_MULZAAAAQ16 (ae_int16x4 d0,
ae_int16x4 d1);
void AE_MULAAAAQ16 (ae_int16x4 q0 /* inout */,
ae_int16x4 d0, ae_int16x4 d1) ;
AE_MULC16S.L (.H) q0, d0, d1 [ fusion_slot1]
AE_MULAC16S.L (.H) q0, d0, d1 [ fusion_slot1]
Complex quad-mac 16x16-bit into 2x32-bit signed integer MAC with saturation:
For H version
d.H saturate32 ([d.H +] d0.3 × d1.3 - d0.2 × d1.2)
d.L saturate32 ([d.L +] d0.3 × d1.2 + d0.2 × d1.3)
For L version
d.H saturate32 ([d.H +] d0.1 × d1.1 - d0.0 × d1.0)
d.L saturate32 ([d.L +] d0.1 × d1.0 + d0.0 × d1.1)
C syntax:
ae_int32 AE_MULC16S_L (_H) (ae_int16x4 d0,
ae_int16x4 d1);
void AE_MULAC16S_L (_H) (ae_int32x2 q0 /* inout */,
ae_int16x4 d0, ae_int16x4 d1) ;
AE_MUL16JS d, d0 [ fusion_slot1 ]
Two-way SIMD multiply by the imaginary number j. For each half, the upper 16-bits of d are
set to the lower 16-bits of d0. The lower 16-bits of d are set to the negation of the upper 16-
bits of d0, saturated.
C syntax:
ae_f16x4 AE_MUL16JS (ae_f16x4 d0);
C syntax:
unsigned int AE_CALCRNG3 (void);
Add and subtract 16-bit elements of two AE_DR register d0 and d1 without saturation and
shift the results arithmetically right 0 or 1 place depending on the value of AE_SAR[0]. f
shifting, round asymmetrically.
C syntax:
ae_f16x4 AE_MUL16JS (ae_f16x4 d0);
Floating point operations typically have four cycles of latency but are fully pipelined. With the
Reduced MAC Latency option, the latency is reduced to two cycles. Divide and sqrt are
implemented using instruction sequences.
C syntax:
float XT_CONST_S (immediate i);
C syntax:
void XT_MOVEQZ.S (float fr /* inout */,
float fs, int art);
void XT_MOVNEZ.S (float fr /* inout */,
float fs, int art);
void XT_MOVGEZ.S (float fr /* inout */,
float fs, int art);
void XT_MOVLTZ.S (float fr /* inout */,
float fs, int art);
MOVT.S fr, fs, bt [Inst]
MOVF.S fr, fs, bt [Inst]
Conditional move of the low half of data operand fs to fr based on scalar condition in xtbool
bt. The upper half is conditionally zeroed.
MOVT.S: fr.L (bt==1) ? fs : fr fr.H (bt==1) ? 0 : fr
MOVF.S: fr.L (bt==0) ? fs : fr fr.H (bt==0) ? 0 : fr
C syntax:
void XT_MOVT.S (float fr /* inout */,
float fs, xtbool b);
void XT_MOVF.S (float fr /* inout */,
float fs, xtbool b);
ABS.S fr, fs [fusion_slot1, Inst]
Computes an IEEE 754 abs of the contents of the lower floating-point operand of fs. The
upper half is zeroed.
fr.H 0
fr.L abs(fs.L )
C syntax:
float XT_ABS_S (float fs);
Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are ordered with, and less than or equal to the contents of the
half of ft, then br is set to 1. Otherwise, br is set to 0. According to IEEE 754, +0 and -0
compare as equal. IEEE 754 floating-point values are ordered if neither is a NaN (Not a
Number).
C syntax:
xtbool XT_OLE_S (float fs, float ft);
OLT.S br, fs, ft [Inst]
Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are ordered with and less than the contents of the half of ft, then
br is set to 1. Otherwise, br is set to 0. According to IEEE 754, +0 and -0 compare as equal.
IEEE 754 floating-point values are ordered if neither is a NaN.
C syntax:
xtbool XT_OLT_S (float fs, float ft);
OEQ.S br, fs, ft [Inst]
Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are ordered with and equal to the contents of the half of ft, then
br is set to 1. Otherwise, br is set to 0. According to IEEE 754, +0 and -0 compare as equal.
IEEE 754 floating-point values are ordered if neither is a NaN.
C syntax:
xtbool XT_OEQ_S (float fs, float ft);
ULE.S br, fs, ft [Inst]
Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are less than or equal to or unordered with respect to the half of
ft, then br is set to 1. Otherwise, br is set to 0. According to IEEE 754, +0 and -0 compare as
equal. IEEE 754 floating-point values are unordered if either is a NaN.
C syntax:
xtbool XT_ULE_S (float fs, float ft);
ULT.S br, fs, ft [Inst]
Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are less than or unordered with respect to the half of ft, then br
is set to 1. Otherwise, br is set to 0. According to IEEE 754, +0 and -0 compare as equal.
IEEE 754 floating-point values are unordered if either is a NaN.
C syntax:
xtbool XT_ULT_S (float fs, float ft);
Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are equal to or unordered with the half of ft, br is set to 1.
Otherwise, br is set to 0. According to IEEE 754, +0 and -0 compare as equal. IEEE 754
floating-point values are unordered if either is a NaN.
C syntax:
xtbool XT_UEQ_S (float fs, float ft);
UN.S br, fs, ft [Inst]
Unordered compare. If the contents of the half of fs or half of ft are equal to NaN, then br is
set to 1. Otherwise, br is set to 0.
C syntax:
xtbool XT_UN_S (float fs, float ft);
FLOAT.S fr, ars, i [fusion_slot0]
Converts the contents of integral operand ars from signed integer to single-precision format,
rounding according to the current rounding mode. The converted integer value is then scaled
by a power of two constant value encoded in the immediate field, with 0..31 representing 1.0,
0.5, 0.25,…, 1.0/ 2147483648.0 The scaling allows for a fixed-point notation where the binary
point is at the right end of the integer for i=0 and moves to the left as i increases until for i=31
there are 31 fractional bits represented in the fixed-point number. The result is placed in the
low half of fr. The upper half is zeroed.
C syntax:
float XT_FLOAT_S (int ars, immediate i);
UFLOAT.S fr, ars, i [fusion_slot0]
Converts the contents of integral operand ars from unsigned integer to single-precision
format, rounding according to the current rounding mode. The converted integer value is then
scaled by a power of two constant value encoded in the immediate field, with 0..31
representing 1.0, 0.5, 0.25,…, 1.0/2147483648.0. The scaling allows for a fixed-point
notation where the binary point is at the right end of the integer for i=0 and moves to the left
as i increases until for i=31 there are 31 fractional bits represented in the fixed-point number.
The result is placed in the low half of floating-point operand fr. The upper half is zeroed.
C syntax:
float XT_UFLOAT_S (unsigned int ars, immediate i);
FIROUND.S vt, vr [fusion_slot0]
Rounds the floating point value of the low half of the input vector register operand into an
integral value in the low half of the output vector register operand. The high half is zeroed.
The value is rounded to the nearest integral value. When the fractional part of an input is
exactly 1/2, the value is rounded away from 0.
C syntax:
float XT_FIROUND_S (float b);
Rounds the floating point value using the ROUND mode of the low half of the input vector
register operand into an integral value in the low half of the output vector register operand.
The high half is zeroed.
C syntax:
float XT_FIRINT_S (float b);
TRUNC.S arr, fs, i [fusion_slot0]
Converts the contents of the lower 32-bits of floating-point operand fs from single-precision
to signed integer format, rounding toward zero. The converted integer value is first scaled by
a power of two constant value encoded in the immediate field, with 0..31 representing 1.0,
0.5, 0.25, …, 1.0/2147483648.0. The scaling allows for a fixed-point notation where the
binary point is at the right end of the integer for i=0 and moves to the left as i increases until
for i=31 there are 31 fractional bits represented in the fixed-point number.
C syntax:
int XT_TRUNC_S (float fs, immediate i);
UTRUNC.S arr, fs, i [fusion_slot0]
Converts the contents of the lower 32-bits of floating-point operand fs from single-precision
to unsigned integer format, rounding toward zero. The converted unsigned integer value is
first scaled by a power of two constant value encoded in the immediate field, with 0..31
representing 1.0, 0.5, 0.25, …, 1.0/ 2147483648.0 The scaling allows for a fixed-point
notation where the binary point is at the right end of the integer for i=0 and moves to the left
as i increases until for i=31 there are 31 fractional bits represented in the fixed-point number.
C syntax:
unsigned int XT_UTRUNC_S (float fs, immediate i);
RFR art, vr [Inst]
Copy the low 32-bits of vr into art
C syntax:
unsigned int XT_RFR (float vs);
WFR vt, art [Inst]
Replicate art into each half of data register vt.
C syntax:
float XT_WFR (unsigned int vs);
Additional helper instructions exist that are used in compiler generated divide and sqrt
sequences. These are not documented here. Refer to the generated HTML file available via
the Xtensa Xplorer IDE for details.
LSX2I d, a, i64
LSX2IP d, a, i64pos
LSX2RI (RIP) d, a, i64
LSX2RIC d, a, [i64neg]
LSX2X (XP, XC) d, a, ax
Required alignment: 8 bytes
Load a pair of 32-bit values from memory into the AE_DR register d. See Table 2-3 for the
meanings of the address mode suffixes.
Note: RI and RIP are intrinsics mapped to equivalent instructions.
C syntax:
xtfloatx2 XT_LSX2I (const xtfloatx2 * a, immediate i64);
xtfloatx2 XT_LSX2X (const xtfloatx2 * a, int ax);
void XT_LSX2IP (xtfloatx2 d /*out*/,
const xtfloatx2 *a /*inout*/, immediate i64pos);
void XT_LSX2XP (xtfloatx2 d /*out*/,
const xtfloatx2 *a /*inout*/, int ax);
void XT_LSX2XC (xtfloatx2 d /*out*/,
const xtfloatx2 *a /*inout*/, int ax);
This instruction is used to prime the unaligned access stream for LASX2IP and LASX2RIP
instructions regardless of size or direction.
C syntax:
ae_valign XT_LASX2PP (xtfloatx2 *a);
LASX2POSPC u, a
LASX2NEGPC u, a
Required alignment: 4 bytes
This operation loads 64-bit value from memory into AE_VALIGN register u. The effective
address is (a & 0xFFFFFFF8).
This instruction LASX2POSPC is used to prime the unaligned access stream for LASX2IC
instructions. The instruction LASX2NEGPC is used to prime the unaligned access stream for
LASX2RIC instructions.
C syntax:
void XT_LASX2POSPC (ae_valign u /*out*/, xtfloatx2 *a /*inout*/);
void XT_LASX2NEGPC (ae_valign u /*out*/, xtfloatx2 *a /*inout*/);
LASX2IP (IC, RIP, RIC) d, u, a
Required alignment: 4 bytes
Load a pair of 32-bit values from effective address (a) in memory into the AE_DR register d.
Instructions LASX2IP (IC) are used if the direction of the load operations is positive.
Instructions LASX2RIP (RIC) are used if the direction of the load operations is negative.
C syntax:
void XT_LA32X2IP (xtfloatx2 d /*out*/, ae_valign u /*inout*/,
xtfloatx2 *a /*inout*/);
void XT LASX2IC (xtfloatx2 d /*out*/, ae_valign u /*inout*/,
_
xtfloatx2 *a /*inout*/);
void XT LASX2RIP (xtfloatx2 d /*out*/, ae_valign u /*inout*/,
_
xtfloatx2 *a /*inout*/);
void XT LASX2RIC (xtfloatx2 d /*out*/, ae_valign u /*inout*/,
_
xtfloatx2 *a /*inout*/);
SSX2I d, a, i64
SSX2IP d, a, i64pos
SSX2RI (RIP) d, a, i64
SSX2RIC d, a
SSX2X (XP, XC) d, a, ax
Required alignment: 8 bytes
Store a pair of 32-bit values from the AE_DR register d to memory. See Table 2-3 for the
meanings of the address mode suffixes.
Note: RI and RIP are intrinsics mapped to equivalent instructions.
C syntax:
void XT_SSX2I (xtfloatx2 d, xtfloatx2 * a, immediate i64);
void XT_SSX2X (xtfloatx2 d, xtfloatx2 * a, int ax);
void XT_SSX2IP (xtfloatx2 d,
xtfloatx2 * a /*inout*/, immediate i64);
void XT_SSX2XP (xtfloatx2 d,
xtfloatx2 * a /*inout*/, int ax);
void XT_SSX2XC (xtfloatx2 d,
xtfloatx2 * a /*inout*/, int ax);
void XT_SSX2RI (xtfloatx2 d, xtfloatx2 * a, immediate i64);
void XT_SSX2RIP (xtfloatx2 d, xtfloatx2 * a /*inout*/, immediate i64);
void XT_SSX2RIC (xtfloatx2 d, xtfloatx2 * a /*inout*/);
SSI d, a, i32
SSIP d, a, i32
SSIX (XP, XC) d, a, ax
Required alignment: 4 bytes
Store the 32-bit L element of the AE_DR register d to memory. For operations with suffix I,
the effective address is (a + i32). See Table 2-3 for the meanings of the address mode
suffixes.
C syntax:
void XT_SSI (float d, float * a, immediate i32);
void XT_SSX (float d, float * a, int ax)
void XT_SSIP (float d,
float * a /*inout*/, immediate i32);
void XT_SSXP (float d,
float * a /*inout*/, int ax);
void XT_SSXC (float d,
float * a /*inout*/, int ax);
SASX2IP (IC, RIP, RIC) d, u, a
Required alignment: 4 bytes
Store a pair of 32-bit values from AE_DR register d to memory with effective address (a).
Instructions SASX2IP (IC, IC1) are used if the direction of the store operations is positive.
Instructions SASX2RIP (RIC, RIC1) are used if the direction of the store operations is
negative.
C syntax:
void XT_SASX2IP (xtfloatx2 d, ae_valign u /*inout*/,
xtfloatx2 * a /*inout*/);
void XT_SASX2IC (xtfloatx2 d, ae_valign u /*inout*/,
xtfloatx2 * a /*inout*/);
void XT_SASX2RIP (xtfloatx2 d, ae_valign u /*inout*/,
xtfloatx2 * a /*inout*/);
void XT_SASX2RIC (xtfloatx2 d, ae_valign u /*inout*/,
xtfloatx2 * a /*inout*/);
SASX2POSFP u, a
Required alignment: varies depending on the data type in the AE_VALIGN register u.
Flushes the value in AE_VALIGN register u to memory with effective address (a). The
AE_VALIGN register u is updated with value zero. This operation is used when the direction
of the store operation is positive.
C syntax:
void XT_SASX2POSFP (ae_valign u /*inout*/, xtfloatx2 *a);
SASX2NEGFP u, a
Required alignment: varies depending on the data type in the AE_VALIGN register u.
Flushes the value in AE_VALIGN register u to memory with effective address (a). The
AE_VALIGN register u is updated with value zero. This operation is used when the direction
of the store operation is negative.
C syntax:
void XT_SASX2NEGFP (ae_valign u /*inout*/, xtfloatx2 *a);
AE_ZALIGN64 u
Initialize the AE_VALIGN register u with zero.
C syntax:
ae_valign AE_ZALIGN64 ();
SEL32_LL.SX2 d, d0, d1 [fusion_slot1]
d.H = d0.L;
d.L = d1.L.
C syntax:
xtfloatx2 XT_SEL32_LL_S (xtfloatx2 d0, xtfloatx2 d1);
SEL32_LH.SX2 d, d0, d1 [fusion_slot1]
d.H = d0.L;
d.L = d1.H.
C syntax:
xtfloatx2 XT_SEL32_LH_S (xtfloatx2 d0, xtfloatx2 d1);
SEL32_HL.SX2 d, d0, d1 [fusion_slot]
d.H = d0.H;
d.L = d1.L.
C syntax:
xtfloatx2 XT_SEL32_HL_S (xtfloatx2 d0, xtfloatx2 d1);
SEL32_HH.SX2 d, d0, d1 [fusion_slot1]
d.H = d0.H;
d.L = d1.H.
C syntax:
xtfloatx2 XT_SEL32_HH_S (xtfloatx2 d0, xtfloatx2 d1);
LOW.S d, d0 [fusion_slot1]
Extract the low half of a SIMD floating point value.
d = d0.L;
C syntax:
float XT_LOW_S (xtfloatx2 d0);
HIGH.S d, d0
Extract the high half of a SIMD floating point value.
d = d0.H;
C syntax:
float XT_HIGH_S (xtfloatx2 d0);
However, programmers are reminded not to depend on NaN propagation, payload, or the
sign bit, since recompilation may cause the propagation to change or to cease.
ABS.SX2 MOVT.SX2MOVF.SX2
ADD.SX2 MSUBC.S
AE_MOVXTFLOATX2_FROMINT32X2 MUL.SX2
AE_MOVINT32X2_FROMXTFLOATX2 MSUBCCONJ.S
AE_MOVXTFLOATX2_FROMF32X2 MSUB.SX2
AE_MOVF32X2_FROMXTFLOATX2 MULC.S
CONJC.S NEG.SX2
FICEIL.SX2 OEQ.SX2
FIFLOOR.SX2 OLE.SX2
FIRINT.SX2 OLT.SX2
FIROUND.SX2 UEQ.SX2
FITRUNC.SX2 ULE.SX2
FLOAT.SX2 ULT.SX2
MADDC.S UN.SX2
MADDCCONJ.S SSX2RI
MADD.SX2 SUB.SX2
MAX.SX2 UFLOAT.SX2
MIN.S TRUNC.SX2
MIN.SX2 UTRUNC.SX2
MOVEQZ.SX2 RECIP.SX2
MOVGEZ.SX2 RSQRT.SX2
MOVLTZ.SX2 SQRT.SX2
MOVNEZ.SX2 DIV.SX2
MOV.SX2 FSQRT.SX2
The HiFi bitstream engine supports both fixed length and variable length encoding and
decoding. Variable length (Huffman) encode and decode instructions are specialized
instructions, in which the elements with variable bit-widths are encoded or decoded. The
instructions are assisted by a special set of tables generated from Huffman
encoding/decoding schemes used in the algorithm. These tables are generated offline and
their entries capture the bit-widths, bit-pattern and values. The format of the table entries are
specified in section 2.19.1. For details on how the variable-length encode/decode instructions
should be used, refer to Chapter 4.
Internally, the instructions share the state registers described in Table 2-2 Bitstream and
variable-length Encode/Decode Support Subsystem State Registers. Therefore, the program
cannot switch between encoding and decoding modes without storing and restoring their
values.
All of the following are 24-bit instructions that issue in the Inst slot.
16-bit table entry load for variable-length decode. Given a pointer a0 to a decoding table of
16-bit entries, an entry is loaded and parsed from a0[AE_NEXTOFFSET]. If the table entry
loaded completes the current decoding operation, b is set to true and a is set to the decoded
symbol value. Otherwise b is set to false.
C syntax:
void AE_VLDL16T (xtbool b /*out*/, unsigned a /*out*/,
const unsigned short * a0);
AE_VLDL32T b, a, a0 [ Inst ] AVS ONLY
32-bit table entry load for variable-length decode. Given a pointer a0 to a decoding table of
32-bit entries, an entry is loaded and parsed from a0[AE_NEXTOFFSET]. If the table entry
loaded completes the current decoding operation, b is set to true and a is set to the decoded
symbol value. Otherwise b is set to false.
C syntax:
void AE_VLDL32T (xtbool b /*out*/, unsigned a /*out*/,
const unsigned * a0);
AE_VLDL16C a [ Inst ] AVS ONLY
16-bit conditional bitstream load for variable-length decode. 16 bits are loaded from the
bitstream pointed to by (a+2) if they are needed to maintain the invariant that we have at
least 16 bits of look ahead from the AE_BITPTR position in the AE_BITHEAD state register.
In the event that a load occurs, a is advanced to refer to the next 16 bits in memory.
C syntax:
void AE_VLDL16C (const unsigned short * a /*inout*/);
16-bit conditional bitstream load for variable-length decode. 16 bits are loaded from the
bitstream pointed to by a if they are needed to maintain the invariant that we have at least
16 bits of look ahead from the AE_BITPTR position in the AE_BITHEAD state register. In the
event that a load occurs, a is advanced to refer to the next 16 bits in memory.
C syntax:
void AE_VLDL16C.IP (const unsigned short * a /*inout*/);
AE_VLDL16C.IC a [ Inst ] AVS ONLY
16-bit conditional bitstream load for variable-length decode. 16 bits are loaded from the
bitstream pointed to by a if they are needed to maintain the invariant that we have at least
16 bits of look ahead from the AE_BITPTR position in the AE_BITHEAD state register. In the
event that a load occurs, a is advanced using a circular wrap-around to refer to the next 16
bits in memory.
C syntax:
void AE_VLDL16C.XC (const unsigned short * a /*inout*/);
void AE_VLDL16C.IC (const unsigned short * a /*inout*/);
AE_VLDSHT a [ Inst ] AVS ONLY
Set Huffman Table for variable-length decode. This instruction sets AE_NEXTOFFSET
according to the current bits at the head of the bitstream and the table size specified by a for
the next lookup that will take place via AE_VLDL16T or AE_VLDL32T.
C syntax:
void AE_VLDSHT (unsigned a);
AE_LB a, a0 [ Inst ] AVS ONLY
Look ahead in the bitstream. Return as few as 1 bit or as many as 16 bits from the head (or
MSBits) of the state register AE_BITHEAD. The number of bits to return is given by the low
five bits of a0, and must be in the range [0..16]. No state is updated; this is a look ahead
instruction. The bits from the bitstream are returned right-justified in a.
AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
unsigned AE_LB (unsigned a0);
Look ahead in the bitstream. Return as few as 1 bit or as many as 16 bits from the head (or
MSBits) of the state register AE_BITHEAD. The number of bits to return is given by the
immediate value i, and must be in the range [1..16]. No state is updated; this is a look-ahead
instruction. The bits from the bitstream are returned right-justified in a.
AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
unsigned AE_LBI (immediate i);
AE_LBS a, a0 [ Inst ] AVS ONLY
Signed look ahead in the bitstream. Return as few as 1 bit or as many as 16 bits from the
head (or MSBits) of the state register AE_BITHEAD. The number of bits to return is given by
the low five bits of a0, and must be in the range [0..16]. No state is updated; this is a look
ahead instruction. The bits from the bitstream are returned sign-extended, right-justified in a.
AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
unsigned AE_LBS (unsigned a0);
AE_LBSI a, i [ Inst ] AVS ONLY
Signed look ahead in the bitstream. Return as few as 1 bit or as many as 16 bits from the
head (or MSBits) of the state register AE_BITHEAD. The number of bits to return is given by
the immediate value i, and must be in the range [1..16]. No state is updated; this is a look-
ahead instruction. The bits from the bitstream are returned sign-extended, right-justified in a.
AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
unsigned AE_LBSI (immediate i);
Look ahead in the bitstream, keeping low bits of a0. Returns as few as 1 bit or as many as
16 bits from the head (or MSBits) of the state register AE_BITHEAD in the low bits of a, with
the remaining bits filled with low bits from a0. The number of bits to move from the stream to
a is given by the low five bits of a1, and must be in the range [1..16]. No state is updated;
this is a look-ahead instruction.
AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
unsigned AE_LBK (unsigned a0, unsigned a1);
AE_LBKI a, a0, i [ Inst ] AVS ONLY
Look ahead in the bitstream, keeping low bits of a0. Returns as few as 1 bit or as many as
16 bits from the head (or MSBits) of the state register AE_BITHEAD in the low bits of a,
with the remaining bits filled with low bits from a0. The number of bits to move from the
stream to a is given by the immediate value i, and must be in the range [1..16]. No state is
updated; this is a look-ahead instruction.
AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
unsigned AE_LBKI (unsigned a0, immediate i);
AE_DB a, a0 [ Inst ] AVS ONLY
Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the low five bits of a0, and must be in the range [0..16]. AE_BITPTR value
increments by the number of bits-read and keeps track of the number of bits consumed from
the AE_BITHEAD. When the remaining bits in the AE_BITHEAD reaches less than or equal
to 16 bits, it reads a 16-bit word from (a+2) memory location into the state register
AE_BITHEAD, the pointer value gets updated to (a+2) The value stored in AE_BITPTR is
decremented by 16.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
void AE_DB (const unsigned short * a /*inout*/, unsigned a0);
Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the low five bits of a0, and must be in the range [0..16]. AE_BITPTR value
increments by the number of bits-read and keeps track of the number of bits consumed from
the AE_BITHEAD. When the remaining bits in the AE_BITHEAD are less than or equal to 16
bits, it reads a 16-bit word from (a) memory location into the state register AE_BITHEAD, the
pointer value gets updated to (a+2) The value stored in AE_BITPTR is decremented by 16.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
void AE_DB_IP (const unsigned short * a /*inout*/, unsigned a0);
AE_DB.IC a, a0 [ Inst ] AVS ONLY
Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the low five bits of a0, and must be in the range [0..16]. AE_BITPTR value
increments by the number of bits-read and keeps track of the number of bits consumed from
the AE_BITHEAD. When the remaining bits in the AE_BITHEAD are less than or equal to 16
bits, it reads a 16-bit word from (a) memory location into the state register AE_BITHEAD, the
pointer value gets updated using a circular wrap-around to (a+2) The value stored in
AE_BITPTR is decremented by 16.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
void AE_DB_IC (const unsigned short * a /*inout*/, unsigned a0);
void AE_DB_XC (const unsigned short * a /*inout*/, unsigned a0);
AE_DBI a, i [ Inst ] AVS ONLY
Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the immediate i, and must be in the range [1..16]. AE_BITPTR value increments by
the number of bits-read and keeps track of the number of bits consumed from the
AE_BITHEAD. When the remaining bits in the AE_BITHEAD are less than or equal to 16
bits, it reads a 16-bit word from (a+2) memory location into the state register AE_BITHEAD,
the pointer value gets updated to (a+2) The value stored in AE_BITPTR is decremented by
16.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
The following sequence of instructions is useful to start bitparsing of the bitstream buffer
stored in short bitParseBuf[] using AE_LB*/AE_DB* instructions.
{
short *a=&bitParseBuf[0]-1;
WAE_BITPTR(0);
AE_DBI(a,16);
AE_DBI(a,16);
}
This sequence fills the AE_BITHEAD with 32 bits starting from bitParseBuf[0]
The actual bit-parsing is done using sequence of AE_LB, AE_LBI, AE_LBK followed
by AE_DB/AE_DBI instructions
C syntax:
void AE_DBI (const unsigned short * a /*inout*/, immediate i);
AE_DBI.IP a, i [ Inst ] AVS ONLY
Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the immediate i, and must be in the range [1..16]. AE_BITPTR value increments by
the number of bits-read and keeps track of the number of bits consumed from the
AE_BITHEAD. When the remaining bits in the AE_BITHEAD are less than or equal to 16
bits, it reads a 16-bit word from (a) memory location into the state register AE_BITHEAD, the
pointer value gets updated to (a+2) The value stored in AE_BITPTR is decremented by 16.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
void AE_DBI_IP (const unsigned short * a /*inout*/, immediate i);
AE_DBI.IC a, i [ Inst ] AVS ONLY
Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the immediate i, and must be in the range [1..16]. AE_BITPTR value increments by
the number of bits-read and keeps track of the number of bits consumed from the
AE_BITHEAD. When the remaining bits in the AE_BITHEAD are less than or equal to 16
bits, it reads a 16-bit word from (a) memory location into the state register AE_BITHEAD, the
pointer value gets updated using a circular wraparound to (a+2) The value stored in
AE_BITPTR is decremented by 16.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
void AE_DBI_IC (const unsigned short * a /*inout*/, immediate i);
AE_VLEL16T b, a, a0 [ Inst ] AVS ONLY
16-bit table entry load for variable-length encode. Given a pointer a0 to an encoding table of
16-bit entries, an entry is loaded and parsed from a0[a]. If the table entry loaded completes
the current encoding operation, b is set to true, otherwise b is set to false and a is set to the
appropriate index for the next lookup to continue the encoding operation. In either case, the
appropriate codeword bits are pushed onto the output bitstream.
C syntax:
void AE_VLEL16T (xtbool b /*out*/, unsigned a /*inout*/,
const unsigned short * a0);
AE_VLEL32T b, a, a0 [ Inst ] AVS ONLY
32-bit table entry load for variable-length encode. Given a pointer a0 to an encoding table of
32-bit entries, an entry is loaded and parsed from a0[a]. If the table entry loaded completes
the current encoding operation, b is set to true, otherwise b is set to false and a is set to the
appropriate index for the next lookup to continue the encoding operation. In either case, the
appropriate codeword bits are pushed onto the output bitstream.
C syntax:
void AE_VLEL32T (xtbool b /*out*/, unsigned a /*inout*/,
const unsigned * a0);
AE_VLES16C a [ Inst ] AVS ONLY
16-bit conditional bitstream store for variable-length encode. 16 bits are stored to the
bitstream pointed to by (a+2) if doing so is needed to maintain the invariant that fewer than
16 bits are buffered in AE_BITHEAD.
C syntax:
void AE_VLES16C (unsigned short * a /*inout*/);
16-bit conditional bitstream store for variable-length encode. 16 bits are stored to the
bitstream pointed to by a if doing so is needed to maintain the invariant that fewer than 16
bits are buffered in AE_BITHEAD.
C syntax:
void AE_VLES16C_IP (unsigned short * a /*inout*/);
AE_VLES16C.IC a [ Inst ] AVS ONLY
16-bit conditional bitstream store for variable-length encode. 16 bits are stored to the
bitstream pointed to by a if doing so is needed to maintain the invariant that fewer than 16
bits are buffered in AE_BITHEAD and a is advanced by 2 with a circular wrap-around.
C syntax:
void AE_VLES16C_IC (unsigned short * a /*inout*/);
AE_SB a, a0 [ Inst ] AVS ONLY
This instruction writes into the memory location (a+2) through a state register AE_BITHEAD
in chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by AE_BITSUSED
(Note: If the value of AE_BITSUSED is zero, it is interpreted as 16). Another state register
AE_BITPTR keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more, When this occurs, the 16 oldest bits from the AE_BITHEAD are flushed out and
stored as a 16-bit word in (a+2) memory location, and the pointer value in “a” gets updated
to (a+2). At the initialization of an output bitstream, AE_BITPTR and AE_BITHEAD are set
to 0.
C syntax:
void AE_SB (unsigned short * a /*inout*/, unsigned a0);
AE_SB.IP a, a0 [ Inst ] AVS ONLY
This instruction writes into the memory location (a+2) through a state register AE_BITHEAD
in chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by AE_BITSUSED
(Note: If the value of AE_BITSUSED is zero, it is interpreted as 16). Another state register
AE_BITPTR keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more. When this happens, the 16 oldest bits from the AE_BITHEAD are flushed out
and stored as a 16-bit word in (a+2) memory location, and the pointer value in “a” gets
updated to (a+2). At the initialization of an output bitstream, AE_BITPTR and AE_BITHEAD
are set to 0.
C syntax:
void AE_SB_IP (unsigned short * a /*inout*/, unsigned a0);
AE_SB.IC a, a0 [ Inst ] AVS ONLY
This instruction writes into the memory location (a) through a state register AE_BITHEAD in
chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by AE_BITSUSED
(Note: If the value of AE_BITSUSED is zero, it is interpreted as 16). Another state register
AE_BITPTR keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more. When this occurs, the 16 oldest bits from the AE_BITHEAD are flushed out and
stored as a 16-bit word in (a) memory location, and the pointer value in “a” gets updated
using a circular wrap-around to (a+2). At the initialization of an output bitstream, AE_BITPTR
and AE_BITHEAD are set to 0.
C syntax:
void AE_SB_IC (unsigned short * a /*inout*/, unsigned a0);
AE_SBI a, a0, i [ Inst ] AVS ONLY
This instruction writes into the memory location (a+2) through a state register AE_BITHEAD
in chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by immediate i (Note:
If the value of immediate i is zero, it is interpreted as 16). Another state register AE_BITPTR
keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more. When this occurs, the 16 oldest bits from the AE_BITHEAD are flushed out and
stored as a 16-bit word in (a+2) memory location, and the pointer value in “a” gets updated
to (a+2). At the initialization of an output bitstream, AE_BITPTR and AE_BITHEAD are set
to 0.
C syntax:
void AE_SBI (unsigned short *a /*inout*/, unsigned a0, immediate i);
AE_SBI.IP a, a0, i [ Inst ] AVS ONLY
This instruction writes into the memory location (a) through a state register AE_BITHEAD in
chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by immediate i (Note:
If the value of immediate i is zero, it is interpreted as 16). Another state register AE_BITPTR
keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more. When this occurs, the 16 oldest bits from the AE_BITHEAD are flushed out and
stored as a 16-bit word in (a) memory location, and the pointer value in “a” gets updated to
(a+2). At the initialization of an output bitstream, AE_BITPTR and AE_BITHEAD are set to
0.
C syntax:
void AE_SBI_IP (unsigned short *a /*inout*/, unsigned a0, immediate i);
AE_SBI.IC a, a0, i [ Inst ] AVS ONLY
This instruction writes into the memory location (a) through a state register AE_BITHEAD in
chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by immediate i (Note:
If the value of immediate i is zero, it is interpreted as 16). Another state register AE_BITPTR
keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more. When this occurs, the 16 oldest bits from the AE_BITHEAD are flushed out and
stored as a 16-bit word in (a) memory location, and the pointer value in “a” gets updated
using a circular wrap-around to (a+2). At the initialization of an output bitstream, AE_BITPTR
and AE_BITHEAD are set to 0.
C syntax:
void AE_SBI_IC (unsigned short *a /*inout*/, unsigned a0, immediate i);
AE_SBF a [ Inst ] AVS ONLY
Flush any remaining bits from AE_BITHEAD to the stream in memory pointed to by (a + 2).
This instruction stores AE_BITHEAD into (a+2), including the padded bits (zero padding)
stored in LSB positions and clears AE_BITHEAD. The ptr (a) is updated/incremented by 2.
Because this instruction doesn't modify AE_BITPTR, the number of bits stored (without
padding) can be retrieved from the AE_BITPTR state register.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
void AE_SBF (unsigned short * a /*inout*/);
AE_SBF.IP a [ Inst ] AVS ONLY
Flush any remaining bits from AE_BITHEAD to the stream in memory pointed to by (a).
This instruction stores AE_BITHEAD into (a), including the padded bits (zero padding) stored
in LSB positions and clears AE_BITHEAD. The ptr (a) is updated/incremented by 2. Because
this instruction doesn't modify AE_BITPTR, the number of bits stored (without padding) can
be retrieved from the AE_BITPTR state register.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
void AE_SBF_IP (unsigned short * a /*inout*/);
AE_SBF.IC a [ Inst ] AVS ONLY
Flush any remaining bits from AE_BITHEAD to the stream in memory pointed to by (a).
This instruction stores AE_BITHEAD into (a), including the padded bits (zero padding) stored
in LSB positions and clears AE_BITHEAD. The ptr (a) is updated/incremented using a
circular wrap-around by 2. Because this instruction doesn't modify AE_BITPTR, the number
of bits stored (without padding) can be retrieved from the AE_BITPTR state register.
The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
void AE_SBF_IC (unsigned short * a /*inout*/);
Each 32-bit variable-length decode codebook table entry has the following format:
31 30 27 26 0
F N S
1 4 27
In this entry, F is a single bit that indicates whether the symbol has been found.
If F is set, the codeword is decoded and the symbol is found. N gives the number of bits
consumed at the current (final) stage of the lookup, and S gives the 27-bit symbol value.
If F is clear, the codeword is only partly decoded and the symbol isn’t found yet. N is a 4-bit
indication of the number of stream prefix bits used to perform the lookup in the next table,
and S gives the 27-bit offset of the beginning of the next table. The number of bits consumed
is implied by the size of the current sub-table.
Each 16-bit variable-length decode codebook table entry has the following format:
15 14 11 10 0
F N S
1 4 11
In this entry, F is a single bit that indicates whether the symbol has been found.
If F is set, the codeword is decoded and the symbol is found. N gives the number of bits
consumed at the current (final) stage of the lookup, and S gives the 11-bit symbol value.
If F is clear, the codeword is only partly decoded and the symbol isn’t found yet. N is a 4-bit
indication of the number of stream prefix bits used to perform the lookup in the next table,
and S gives the 11-bit offset of the beginning of the next table. The number of bits consumed
is implied by the size of the current sub-table.
Each 32-bit variable-length encode codebook table entry has the following format:
31 30 0
F …
1 31
In this entry, F is a single bit that indicates whether the symbol has been completed.
If F is set, the symbol is encoded completely, and the rest of the table entry is interpreted as
follows:
31 30 20 19 16 15 0
1 … N C
1 11 4 16
N is the codeword segment size in bits (N equal to zero means 16 bits). C contains the right-
justified codeword segment. 11 bits of the 32-bit word are unused in this case.
If F is clear, the symbol is only partly encoded, and the rest of the table entry is interpreted
as follows:
31 30 16 15 0
0 K C
1 15 16
K is the table entry index for the next encode lookup. C is a 16-bit segment of the codeword.
Each 16-bit variable-length encode codebook table entry has the following format:
15 14 0
F …
1 15
In this entry, F is a single bit that indicates whether the symbol has been completed.
If F is set, the symbol is encoded completely, and the rest of the table entry is interpreted as
follows:
15 14 11 10 0
1 N C
1 4 11
N is the codeword segment size in bits with valid values in the range from 1 to 11. C contains
the right-justified codeword segment.
If F is clear, the symbol is only partly encoded, and the rest of the table entry is interpreted
as follows:
15 14 6 5 0
0 K C
1 9 6
K is the table entry index for the next encode lookup. C is a six-bit segment of the codeword.
CRC and Scrambling (Linear Feedback Shift Register) operations, commonly used
in Baseband PHY/MAC standards such as Bluetooth, Wi-Fi, and 3GPP.
Bit-level shuffling and manipulation, commonly used in Baseband PHY and MAC
standards.
AE_CRC32 a, d, a0 [ fusion_slot1]
AE_CRC32 processes 8 bits of input data from the AE_DR register d, and updates the CRC
value in address register a, using the CRC polynomial specified by address register a0. CRC
polynomials of up to 32 bits are supported.
C syntax:
extern void AE_CRC32(unsigned a, ae_int32x2 d, unsigned a0);
AE_LFSR8 a, d, a0 [ fusion_slot1]
AE_LFSR8 generates 8 bits of a Linear Feedback Shift Register (LFSR), using the 32 bit
shift register state in the address register a, and the polynomial encoded into the address
register a0. Polynomials of up to 32 bits are supported.
C syntax:
extern void AE_LFSR8(unsigned a, ae_int32x2 d, unsigned a0);
AE_LFSR16 a, d, a0 [ fusion_slot1]
AE_LFSR16 generates 16 bits of a Linear Feedback Shift Register (LFSR), using the 32-bit
state in address register a, and the polynomial encoded into the address register a0.
Polynomials of up to 16 bits are supported.
C syntax:
extern void AE_LFSR16(unsigned a, ae_int32x2 d, unsigned a0);
Generate 8 bits of output of a three-state Convolutional Turbo Encoder, using input bits from
AE_DR register d, and two programmable polynomials from AR register c.
C syntax:
extern void AE_CTC_BIN(unsigned a /*inout*/, ae_int32x2 d /*inout*/,
unsigned c);
This option also includes a variation of a small subset of the AVS bitstream operations. These
variants are AE_LB_BR, AE_LBI_BR, AE_DB_BR.IP, AE_DBI_BR.IP, AE_SB_BR.IP,
AE_SBI_BR.IP, and AE_SBF_BR.IP. These bitstream operation variants operate on the
bitstream with the least significant bit first in each byte. Note that the regular AVS bitstream
operations operate with the most significant bit first in each byte.
Refer to the ISA HTML documentation for detailed specifications of each of the operations.
Byte select operations allow an arbitrary combination/selection from a set of 8 input bytes
formed using 4 bytes each from the two input registers, to generate four output bytes. Each
of the output bytes can be independently selected from any of the 8 input bytes. With this
general definition of byte selection, it is easy to implement byte-level replication, rotation,
shift, and interleaving with the same basic instruction.
C syntax:
extern void AE_SEL4X8_L(ae_int32x2 a /*inout*/, ae_int32x2 b, unsigned
c);
extern void AE_SEL4X8_H(ae_int32x2 a /*inout*/, ae_int32x2 b, unsigned
c);
AE_DEPBITS_L d0, d, imm1, imm2 [ fusion_slot40]
AE_DEPBITS_H d0, d, imm1, imm2 [ fusion_slot40]
Deposit a field into an arbitrary position in an AE_DR register. These instructions are similar
to the Xtensa DEPBITS option, with the difference that the AE_DEPBITS_L/H use AE_DR
registers for input/outputs (The Xtensa DEPBITS uses AR registers for input/output).
C syntax:
extern void AE_DEPBITS_L(ae_int64 dout /*inout*/, ae_int64 d, immediate
low_depbits, immediate lngth_depbits);
extern void AE_DEPBITS_H(ae_int64 dout /*inout*/, ae_int64 d, immediate
low_depbits, immediate lngth_depbits);
AE_LB_BR a, a0 [fusion_slot0]
AE_LBI_BR a, i [fusion_slot0]
AE_DB_BR.IP a, a0 [fusion_slot0]
AE_DBI_BR.IP a, i [fusion_slot0]
AE_SB_BR.IP a, a0 [fusion_slot0]
AE_SBI_BR.IP a, a0, i [fusion_slot0]
AE_SBF_BR.IP a [fusion_slot0]
These are variants of a subset of AVS bitstream instructions. The variants are named with a
_BR in the name to distinguish them from the corresponding AVS instruction. The _BR
variants operate least significant bit first in the byte, whereas the corresponding AVS
instructions operates most significant bit first in each byte.
C syntax:
unsigned AE_LB_BR (unsigned a0);
unsigned AE_LBI_BR (immediate i);
void AE_DB_BR_IP (const unsigned short * a /*inout*/, unsigned a0);
void AE_DBI_BR_IP (const unsigned short * a /*inout*/, immediate i);
void AE_SB_BR_IP (unsigned short * a /*inout*/, unsigned a0);
void AE_SBI_BR_IP (unsigned short *a /*inout*/, unsigned a0, immediate i);
void AE_SBF_BR_IP (unsigned short * a /*inout*/);
The AES algorithm specified in the FIPS-197 standard is capable of using cryptographic keys
of 128, 192, and 256 bits to perform a forward cipher and reverse cipher of data in blocks of
128 bits. However, the CCM mode of operation defines the CCM-generation-encryption and
CCM-decryption-verification procedures by only using the forward cipher of the AES
algorithm in FIPS-197. CCM mode does not require the reverse cipher from FIPS-197.
Furthermore, 128-bit block size is the most widely used in data communication standards
(for example, such as Bluetooth Low Energy).
Fusion DSP has optional operations to support efficient implementation of the AES forward
cipher algorithm for block size of 128 bits.
Do one step of the Key Expansion procedure as specified in FIPS 197 standard.
C syntax:
extern void AE_AES_RKEY(ae_int64 d0 /*inout*/, ae_int64 d0 /*inout*/,
unsigned a);
AE_AES_SB128 d0, d1 [ fusion_slot40]
Do the ShiftRows transformation on the state array in registers d0 and d1 as specified in
FIPS 197 standard.
C syntax:
extern void AE_AES_SB128(ae_int64 d0 /*inout*/, ae_int64 d0
/*inout*/);
6-bit soft-bit values from interleaved streams of soft bit data are loaded from memory. Internal
state values are stored in 8-bit signed elements of the vector register files. The operations
are designed to perform a forward pass through input soft bits, updating the states and
buffering branch select decisions. These 1-bit branch select decisions are packed and stored
to memory. After all branch select decisions have been stored, the maximal state is identified,
then a backwards traceback pass through the decision bits computes the hard-bit outputs.
The Viterbi operations on Fusion DSP are implemented based on a radix-4 architecture.
Each single step radix-4 trellis butterfly equals four two step radix-2 trellis butterflies as shown
in Figure 2-2. N is the number of states in the convolutional code. For constraint length K=5,
N is 16 and for constraint length K=7, N is 64.
S4n S2n Sn
S4n+1 S2n+1
S4n+2
Sn+N/4
S4n+3
S2n+N/2 Sn+N/2
S2n+N/2+1
Sn+3*N/4
The AE_VTADDSUB3BX2S operation calculates partial branch metrics for two consecutive
time instances using 6-bit LLRs whose sign extension occupies 8 bits in memory. The partial
branch metrics are stored in the BMETRICS state register, which will be used by the add-
compare-select instruction.
For code rate R=1/3, the most significant 32 bits of the input register holds the three LLRs of
bits b0,b1,b2 for time instance k and least significant 32 bits of input register hold the three
LLRs for time instance k+1.
The following four partial branch metrics are calculated for each time instance in this
operation:
Following are the other four branch metrics for each time instance:
PMM = -MPP
PMP = -MPM
PPM = -MMP
PPP = -MMM
The above are not calculated in this operation, but derived in the add-compare-select
operation.
For code rate R=1/2, there are only two LLRs for each time instance; you can still use the
same operation, but LLR(b2) in the input register should be 0.
The operation assumes a 6-bit LLR input, which occupies 8 physical bits. The immediate
operand msb of this operation is used to select either the most significant 6 bits or the least
significant 6 bits from 8 bits. If the most significant 6 bits are selected, only four effective bits
of each LLR have been used. If the least significant 6 bits are selected, all six effective bits
of each LLR have been used.
The branch metrics calculated by operation AE_VTADDSUB3BX2S will be used in the add-
compare-select operation (refer to the ISA HTML pages of these operations for a detailed
description, along with the pseudo-code). The branch metrics are stored in state register
BMETRICS. Before calling the add-compare-select operation to update state metrics in trellis
forward processing, first you need to build the branch metric index table. From the branch
metric index, the butterfly operation can find the branch metric.
By exploiting the branch symmetry of the radix-2 butterfly, we only need one branch metric
index for each radix-2 butterfly. For constraint length K=7, there are 64 states. We only store
the branch metric for the branch entering into states from 0 to 31. Each of those 32 states
has two input branches, which originate from two previous states. We only store the branch
metric index for the branch that is connected to the previous state whose state index is an
even number. As shown in Figure 2-2, the radix-2 butterfly is composed of states S4n S4n+1
and S2n S2n+N/2; we only need to store the branch metric index for the branch from S4n to
S2n.
Each add-compare-select operation updates 16 states for two consecutive time instances.
For each add-compare-select operation we need eight branch metric indices for the first time
instance and eight branch metric indices for the second time instance. Each branch metric
index is 3 bits, but occupies 4 physical bits. Four continuous add-compare-select operations
are needed for 64 states of constraint length K=7 and only one add-compare-select operation
is needed for 16 states of constraint length K=5. The order of input states and branch metric
indices feeding to add-compare-select should follow the following sequence:
The sequence of intermediate states output from first stage of radix-4 butterfly has the
following sequence:
The intermediate states are not exposed from operation, but the branch select decision bits
for all intermediate states are stored in the same sequence as intermediate states.
The sequence of output states from radix-4 butterfly before shuffling are in the following
sequence:
Normally we apply shuffling on output states combined with the input shuffling to reorder the
states into the right sequence as described above. The branch select decision bits are stored
in the same sequence as the output states before shuffling.
The branch metric indices are packed into sixteen 16 bits for K=7 and four 16 bits for K=5
as follows:
The least significant 8 bits (two branch metric indices) are used by the first stage of the radix-
4 butterfly operation and the most significant 8 bits are used by the second stage of the radix-
4 butterfly operation.
Four continuous 16 bits branch metric indices are needed for each add-compare-select
operation. The 32-bit input operand bmsel0 holds the first two 16 bits and 32-bit input
operand bmsel1 holds the next two 16 bits.
For K=7, the output states after shuffling are in following sequence:
n,n+1,n+4,n+5,n+N/4,n+1+N/4,n+4+N/4,n+5+N/4,
n+N/2,n+1+N/2,n+4+N/2,n+5+N/2,n+N/4+N/2,n+1+N/4+N/2,n+4+N/4+N/2,n+5+N/4+N/2
with n = 0 for the first add-compare-select operation, n=8 for second operation, n= 2 for the
third operation and n=10 for the fourth operation.
n,n+1,n+4,n+5,n+N/4,n+1+N/4,n+4+N/4,n+5+N/4,
n+N/2,n+1+N/2,n+4+N/2,n+5+N/2,n+N/4+N/2,n+1+N/4+N/2,n+4+N/4+N/2,n+5+N/4+N/2
with n=0.
The input shuffling will change the order of states output from the last iteration into the input
order:
The least significant bit of the 2-bit input immediate operand shfl is used to enable input
shuffling and the most significant bit is used to enable output shuffling.
The immediate input norm is used to select and update the normalization enable flag. There
are two normalization enable flags in the state register that are previous normalization flag
NORMALIZE_PREV and current normalization flag NORMALIZE_CUR.
If norm is true, the effective normalization enable flag is set to the current normalization flag
NORMALIZE_CUR, otherwise it is set to the previous normalization flag
NORMALIZE_PREV. If norm is true, NORMALIZE_PREV is set to NORMALIZE_ CUR and
NORMALIZE_ CUR will be recalculated by measuring the most significant 3 bits, except a
sign bit of output states. If any state is positive and any bit in the field specified by state
register NORM_MASK is 1, the current normalization flag will be set to 1, otherwise it is set
to 0. The major purpose of immediate input norm is to make sure all states processed by
multiple add-compare-select operations are normalized in the same way.
If effective normalization enable flag is 1, all out states will be subtracted by the state register
NORM_CONST. The user state register NORM_MASK and NORM_CONST will be
initialized once before trellis processing.
Each add-compare-select operation will update 16 states for constraint length K=7, which
has 64 states, we need four add-compare-select operations; each operation takes 16 states
as input and updates 16 states as output. The 128 decision bits of branch metric selection
will be stored to memory. The sequence of operation for every two time instances is
summarized as below:
AE_VTACSR4X4S_H
AE_VTACSR4X4S_H
AE_VTACSR4X4S_L
AE_VTACSR4X4S_L
AE_S64_DECBITS_H_IP
AE_S64_DECBITS_L_IP
The first add-compare-select operation will take states: S(4n), S(4n+1), S(4n+2), S(4n+3)
n=0,1,4,5 at time instance k as input and update states: S(n), S(n+32), S(n+16), S(n+48)
n=0,1,4,5 at time instance k+2 as output.
The second add-compare-select operation will take states: S(4n), S(4n+1), S(4n+2), S(4n+3)
n=8,9,12,13 at time instance k as input and update states: S(n), S(n+32), S(n+16), S(n+48)
n=8,9,12,13 at time instance k+2 as output.
The third add-compare-select operation will take states: S(4n), S(4n+1), S(4n+2), S(4n+3)
n=2,3,6,7 at time instance k as input and update states: S(n), S(n+32), S(n+16), S(n+48)
n=2,3,6,7 at time instance k+2 as output.
The fourth add-compare-select operation will take states: S(4n), S(4n+1), S(4n+2), S(4n+3)
n=10,11,14,15 at time instance k as input and update states: S(n), S(n+32), S(n+16), S(n+48)
n=10,11,14,15 at time instance k+2 as output.
The states input to add-compare-select operations should always be in sequential order, but
the output state of add-compare-select operations that will be used as input for the next
iteration are not in sequential order. We need to shuffle both the input states and output
states in add-compare-select operations to make sure the state input to each radix-4 butterfly
is always S(4n), S(4n+1), S(4n+2), S(4n+3).
AE_VTACSR4X4S_H
AE_S64_DECBITS_H_IP
The add-compare-select operation will take states: S(4n), S(4n+1), S(4n+2), S(4n+3)
n=0,1,2,3 at time instance k as input and update states: S(n), S(n+32), S(n+16), S(n+48)
n=0,1,2,3 at time instance k+2 as output. The shuffling function in the add-compare-select
operation should also be enabled.
For trellis forward processing, the state metrics are maintained in 8-bit signed vector
elements. For each iteration of a trellis loop, if any output state metric is big enough (state
metric is positive and any bit in most significant 3-bit field except sign bit specified by 3-bit
state register NORM_MASK is not zero), the normalization flag will be set and state
normalization will be done in the next iteration by subtracting the output state metrics by a
constant value specified by 8-bit state register NORM_CONST.
At the end of the Viterbi computation on the input data streams, before backtrace, the
maximal metric must be identified. 8-bit state metrics will be sign extended to 16-bit state
metrics by operation AE_UNPKS8X16 and we can find the index of the maximum state
metric. The traceback will start from the maximal state. The convolutional encoder is
terminated to 0 state. We can force the maximal state to be state 0 and start traceback from
state 0.
After the forward trellis processing, the backward traceback operation AE_VTTB2X64
collects two traceback bits per cycle.
The Fusion_Viterbi_Decoder example demonstrates the use of the Viterbi operations with
LTE and WiFi standard rates and polynomials.
The soft-bit demapping operations are used to convert soft-symbol estimates, outputs of an
equalizer, into soft-bit estimates, or log-likelihood ratios (LLRs), later to be processed by a
soft channel decoder for error correction and detection. The soft-bit demapper typically sits
at the interface between complex and soft-bit domains.
𝑃(𝑏𝑖 = 1|𝑥)
𝐿𝐿𝑅(𝑏𝑖 ) = 𝑙𝑛
𝑃(𝑏𝑖 = 0|𝑥)
This is according to the mapping of bits to a constellation S. The LLR calculation uses a Max-
Log approximation and assumes an unbiased symbol estimate with zero-mean additive white
Gaussian noise (AWGN), i.e. x=s+w, where s belongs to S and w is AWGN.
The scaling factor is used to account for the signal-to-noise ratio and any other desired
weighting adjustments. You can negate the LLR values with an additional sign option.
Supported constellations and mappings are summarized in Table 2-23. Symbol mappings
for 3GPP and WiFi use different Gray Encoding formats, both supported by the soft-bit
demapper operations
The 256-QAM soft-demapping operation selects one complex input from the most significant
or least significant 32 bits of the input vector register vr, where each complex input consists
of a 16-bit real and 16-bit imaginary part, computes soft bits for four higher 4 or lower 4 out
of eight bits used to generate the constellation, scales them using an exponent and mantissa
from the vector register vs, and writes the resulting 4 soft-bits (4 bytes) into the most
significant or least significant 32 bits of output vector register vt.
C syntax:
extern void AE_SDMAP256QAM1X16C_H(ae_int16x4 llr /*inout*/, ae_int16x4
exp_mant, ae_int16x4 cx, immediate intlv, immediate negate, immediate
out_high);
extern void AE_SDMAP256QAM1X16C_L(ae_int16x4 llr /*inout*/, ae_int16x4
exp_mant, ae_int16x4 cx, immediate intlv, immediate negate, immediate
out_high);
Scale factors per point: 4-bit mantissa and 4-bit exponent in paired vector elements
Optionally interleave output soft-bit LLRs for real/imaginary parts (IEEE vs. 3GPP
standard)
And, for the outputs of the Fusion DSP soft-bit demap operations:
All operations only output 4 soft-bits each cycle, scaled by the scaling factors, with
rounding and saturation to 8-bit integer resolution at output.
Scaling before the soft demapper is needed to place onto an integer grid (assumed
hardware implementation Q5.10 format). Scaling after the soft-demodulation is optionally
applied by the operations.
Note that this chapter does not attempt to duplicate material in either of these guides.
Fusion DSP offers two MACs per cycle for 24x24-bit, 32x16-bit, and 16x16-bit audio and
voice data and one MAC per cycle for 32x32-bit operations. It offers equivalent support for
both integer and fractional arithmetic. The C and C++ languages support integer arithmetic
on 32x32-bit or 16x16-bit data. Therefore, while standard applications can effectively utilize
Fusion DSP’s resources, applications that require fractional arithmetic or applications that
require 24-bit or 32x16-bit multiplication must be modified to express those semantics. These
modifications can be as simple as declaring variables of the appropriate custom data types
and then relying on built-in operator overloading, or they can involve using explicit intrinsics
to express the exact operations desired. For 16-bit applications, the ITU-T/ETSI intrinsics are
fully supported.
In essentially no case is it required to resort to assembly. All the Fusion DSP instructions can
be accessed from C/C++ level intrinsics. The XCC compiler will efficiently register allocate
Fusion DSP variables and schedule Fusion DSP instructions, relieving the programmer from
the hardest aspects of writing in assembly.
For 24-bit and 32x16-bit applications, the compiler does not automatically vectorize. The
application writer must write the code using explicit vector data types or intrinsics.
This chapter describes multiple approaches to programming Fusion DSP and illustrates them
with some simple examples. The next chapter goes into more detail with more complicated
examples.
To use the Fusion DSP data types and instruction intrinsics, you must appropriately include
the following:
#include <xtensa/tie/xt_fusion.h>
in the C or C++ source code before referring to any of the data types or intrinsics. Optionally,
for HiFi 2 or HiFi 3 code, the option of including xt_hifi2.h or xt_hifi3.h is possible.
This is to facilitate easy use of existing HiFi applications.
For floating point intrinsics using the optional floating point unit, you must appropriately
include the following:
#include <xtensa/tie/xt_FP.h>
The intrinsic prototype for each Fusion DSP operation is described in Chapter 2.
Fusion DSP supports 16-, 24-, 32- and 64-bit types. All types come in both integer and
fractional versions. For intrinsic programmers using 16-, 32- and 64-bit types, the two types
can usually be used interchangeably. A variable of an integer type can be assigned to a
fractional variable, and vice-versa, without changing the bit pattern in registers or memory. It
is up to the programmer to use the appropriate intrinsic to achieve the desired computation.
However, for programmers using operator overloading, the fractional and integer types map
to different instructions. In particular, fractional types use fractional multiplies and saturating
arithmetic, while integer types use integer multiplies and non-saturating arithmetic. 24-bit
fractional and integer types have an additional difference. 24-bit integer types are stored in
memory in the low 24 bits of a 32-bit word, equivalent to the storage representation for 32-
bit integers. 24-bit fractional types are stored in memory in the high 24 bits of a 32-bit word,
equivalent to a 1.31-bit representation, with the low-precision bits all set to 0.
All types (other than the 64-bit types) come in both scalar and vector versions. In general,
computation happens on vector variables. Scalar variables are stored in the low parts of
registers. The high parts are undefined. Assigning a scalar variable to a variable of the
equivalent vector type will replicate the element in the lowest bit-position into all the elements
of the vector. Assigning a vector to a scalar will not change the bit pattern in the register.
Assiging a low precision variable to a high precision variable in general sign extends the
variable for signed types and zero extends for unsigned types. Assiging a high precision
variable to a low precision variable discards the upper bits for integer types and discards the
lower bits for fractional types.
Conversions can also be implicitly applied to intrinsic invocations. For example, just like
assigning a scalar variable to a vector variable replicates the lowest element in the register,
a scalar variable assigned to an intrinsic expecting an input vector argument will first be
implicitly replicated.
With the floating point option, Fusion DSP supports a 2-way SIMD, single precision floating
point type xtfloatx2. This type can be converted to and from ae_int32x2 using the standard
C floating point to integer conversions.
All the legacy HiFi 2 types are supported so that HiFi 2 code can work out-of-the-box. They
should only be used on HiFi 2 code but can be freely intermixed when porting HiFi 2 code to
Fusion DSP. Note that for compatibility with HiFi 2, assigning variables of vector types to
variables of type ae_p24s or ae_p24f does not replicate the elements and instead leaves the
bit patterns unchanged.
Table 3-1 contains a complete list of the Fusion DSP data types with a brief description of
each.
Type Description
_
ae int32x2 64-bit type containing two 32-bit integer elements. The memory
format for this type is two elements stored in adjacent 32-bit
words. In memory, this type is eight-byte aligned.
ae_f32x2 64-bit type containing two 32-bit fractional elements. The
memory format for this type is two elements stored in adjacent
32-bit words. In memory, this type is eight-byte aligned.
ae_int24x2 48-bit type containing two 24-bit integer elements. The memory
format for this type is two elements, each stored in the least
significant 24 bits of adjacent 32-bit words. In memory, this
type is eight-byte aligned. This type is loaded and stored in a
way that is equivalent to loading and storing the ae_int32x2
type.
ae_f24x2 48-bit type containing two 24-bit fractional elements. The
memory format for this type is two elements, each stored in the
most significant 24 bits of adjacent 32-bit words making it
equivalent to a 1.31-bit representation. In registers, this
occupies the lower 24 bits of each 32-bit half of a register,
allowing for extra guard bits of precision.
ae_int16x4 64-bit type containing four 16-bit integer elements. This type
normally represents the 64-bit contents of a AE_DR register
when the register entry holds four data elements. The memory
format for this type is four elements stored in adjacent 16-bit
words. In memory, this type is eight-byte aligned.
ae_f16x4 64-bit type containing four 16-bit fractional elements. The
memory format for this type is four elements stored in adjacent
16-bit words. In memory, this type is eight-byte aligned.
ae_int32 32-bit type consisting of a single integer element stored in
memory. When this type is converted to an ae_int32x2 type in
an AE_DR register, the data is replicated into the two 32-bit
register elements.
ae_f32 32-bit type consisting of a single fractional element stored in
memory. When this type is converted to an ae_f32x2 type in an
AE_DR register, the data is replicated into the two 32-bit
register elements.
Type Description
ae_int24 24-bit type containing a single integer element stored in the
least significant 24 bits of a 32-bit word. In memory, this type is
four-byte aligned. This type is loaded and stored in a way that
is equivalent to loading and storing the ae_int32 type.
ae_f24 24-bit type containing a single 24-bit fractional elements. The
memory format for this is an element stored in the most
significant 24 bits of a 32-bit word making it equivalent to a
1.31-bit representation. In registers, this occupies the lower 24
bits of each 32-bit half of a register, allowing for extra guard
bits of precision.
ae_int16 16-bit type consisting of a single integer element stored in
memory. When this type is converted to an ae_int16x4 type in
an AE_DR register, the data is replicated into the four 16-bit
register elements.
ae_f16 16-bit type consisting of a single fractional element stored in
memory. When this type is converted to an ae_f16x4 type in an
AE_DR register, the data is replicated into the four 16-bit
register elements.
ae_int64 64-bit type representing the contents of an AE_DR register
when the register entry holds a single integer element.
ae_f64 64-bit type representing the contents of an AE_DR register
when the register entry holds a single fractional element.
ae_int32x4 128-bit type containing four 32-bit integer elements. This is a
composite type containing two, ae_int32x2 types. Its main use
is to support operator overloading for 32x16-bit multiplication.
ae_f32x4 128-bit type containing four 32-bit fractional elements. This is a
composite type containing two, ae_f32x2 types. Its main use is
to support operator overloading for 32x16-bit multiplication.
HiFi-2 Compatibility Types
ae_p16x2s This type ensures HiFi 2 target code compatibility. 32-bit type
containing two 16-bit elements. This type lives only in memory,
and represents two elements in a 1.15 format. It can be
automatically converted into an ae_p24x2s object, in which
case the low 8 bits of each resulting element are zero and the
upper 8 bits are sign-extended.
ae_p24x2s This type ensures HiFi 2 target code compatibility. 48-bit type
containing two 24-bit elements. The memory format for this
type is two elements, each stored in the least significant 24 bits
of adjacent 32-bit words. In memory, this type is eight-byte
aligned. In Fusion DSP, this type is loaded and stored in a way
that is equivalent to loading and storing the ae_p32x2s type.
ae_p24x2f This type ensures HiFi 2 target code compatibility. This type
occupies 64 bits in memory, but should be thought of as a 48-
bit type containing two 24-bit fractional elements. This type
exists only in memory, and represents two elements in 1.31
format; the low eight bits of each of the elements are ignored. It
can be automatically converted into an ae_p24x2s object, in
which case the low eight bits of each element are discarded –
Type Description
the 1.31-bit value in memory is converted to 9.23-bit value in
register.
ae_p16s This type ensures HiFi 2 target code compatibility. 16-bit type
consisting of a single element stored in memory. This type can
be automatically converted into an ae_p24x2s. In such a
conversion, the ae_p16s object's bits are padded with zeros
and duplicated to form the two 24-bit elements of the resulting
ae_p24x2s object. In Fusion DSP, each 24-bit element is sign
extended to 32-bits.
ae_p24s This type ensures HiFi 2 target code compatibility. It is a 24-bit
type consisting of a single element stored in the low 24 bits of a
32-bit memory word. This type exists only in memory and can
be automatically converted into an ae_p24x2s object. In such a
conversion, the ae_p24s object’s bits are duplicated to form the
two 24-bit elements of the resulting ae_p24x2s object. In
Fusion DSP, this type is loaded and stored in a way that is
equivalent to loading and storing the ae_p32s type.
ae_p24f This type ensures HiFi 2 target code compatibility. It is a 24-bit
type consisting of a single element stored in the high 24 bits of
a 32-bit memory word. This type exists only in memory and can
be automatically converted into an ae_p24x2s object. In such a
conversion, the ae_p24f object’s bits are duplicated to form the
two 24-bit elements of the resulting ae_p24x2s object. In
Fusion DSP, the 1.31-bit value in memory is converted to a
9.23-bit value in register.
ae_q56s This type ensures HiFi 2 target code compatibility. It is a 56-bit
type representing the contents of an AE_DR register. The
memory format for this type has the bits of the ae_q56s object
stored in the low 56 bits of a 64-bit double word. In Fusion
DSP, this type is loaded and stored in a way that is equivalent
to loading and storing the ae_int64 type.
ae_q32s This type ensures HiFi 2 target code compatibility. It is a 32-bit
type representing a value in memory that will be padded with
16 zeros at the low end and sign extended by eight bits at the
high end to form a 56-bit value when converted to an ae_q56s
object (i.e., when loaded into an AE_DR register). In Fusion
DSP, the 1.31-bit value in memory is converted to a 17.47-bit
value in register.
xtfloatx2 For configurations with the optional SIMD IEEE floating point
unit, a type containing two, 32-bit IEEE floating point values.
The following examples demonstrate how to efficiently load, store, and convert various data
types in C using Fusion DSP. The examples do not enumerate all possible conversions
between core C and Fusion DSP types. Generally, conversion between register (local)
variables and data in memory (arrays, struct fields, etc.) should be done through pointer
typecasting, while conversion between register variables should be done through direct use
of the appropriate Fusion DSP conversion intrinsics.
Convert and sign-extend the low (L) 1.31-bit fraction in AE_DR to a 9.55 value in
AE_DR.
ae_int32x2 p = …;
ae_f64 q = AE_CVTQ56P32S_L(p);
Saturate and truncate two 9.55-bit values in AE_DR to the two 1.31-bit fraction
elements of AE_DR.
ae_int64 qh = …;
ae_int64 ql = …;
ae_int32x2 p = AE_TRUNCI32X2F64S(qh, ql, 8);
Saturate two 9.23-bit values in AE_DR into two 1.23-bit fraction elements in AE_DR.
This allows the resultant values to be safely used in future 24-bit multiply instructions.
ae_f32x2 = …;
ae_f24x2 p = AE_SAT24S(d);
Changing Types
Sometimes it is necessary to treat a variable as one type for one computation and another
for a follow-on computation. For example, one might want to do a fractional multiply on a 24-
bit variable that is stored in memory in the low 24-bits rather than the high 24-bits of a word.
For such uses Fusion DSP supports conversion protos that do not change the bit-
representation of a variable.
ae_f64 = …;
ae_int24x2 p = AE_MOVINT24X2_FROMF64(d);
ae_int32x2: Displays hex and decimal for each element of the vector.
ae_f32x2: Displays hex and decimal for each element of the vector assuming a 1.31
representation.
ae_int24x2: Displays hex and decimal for each element of the vector. The upper 8-
bits of the variable, whether in register or in memory, is not displayed.
ae_f24x2: Displays hex and decimal for each element of the vector. If the variable is
in memory, it is displayed as a 1.31 variable. If it is in a register, it is displayed as a
9.23.
ae_int16x4: Displays hex and decimal for each element of the vector.
ae_f16x4: Displays hex and decimal for each element of the vector assuming a 1.15
representation.
ae_int24: Displays hex and decimal. The upper 8-bits of the variable, whether in
register or in memory, is not displayed.
ae_int32x4: Displays hex and decimal for each element of the vector.
ae_f32x4: Displays hex and decimal for each element of the vector assuming a 1.31
representation.
xtfloatx2: Displays floating point for each element of the vector on configurations
with the optional floating point unit.
ae_p24x2f: Displays hex for each element of the vector. All 24-bits of an element
are displayed, even if 0.
ae_p16x2s: Displays hex for each element of the vector. All 16-bits of an element
are displayed, even if 0.
ae_q56s: Displays hex, with the 8 guard bits separated from the other 48-bits. All
48-bits are displayed, even if 0.
Reference codes are frequently written in terms of basic fixed-point intrinsic libraries. As a
first step, it is often desirable to implement the existing intrinsic library in terms of Fusion DSP
intrinsics. When implementing such an intrinsic library, the programmer has the choice of
whether to use standard C/C++ data types as external interfaces or whether to use the native
Fusion DSP data types. If the body of a library is ported to use Fusion DSP intrinsics but the
interface remains standard C/C++, the implementation must convert to and from the Fusion
DSP data types. The compiler can sometimes, but not always, eliminate these conversions.
If instead, the interfaces of the libraries are changed to use Fusion DSP data types,
performance will be better, but all the code that calls into the library must be changed to
handle the Fusion DSP data types. That is not always possible.
In general, the most common scenario is that important functions in the application are
optimized directly for Fusion DSP, and the original library is left for the less important
functions.
There are several basic programming styles that can be used, depending on application
needs, in increasing order of manual effort. These are as follows:
C/C++ code with Fusion DSP data types and operator overloading
Use of intrinsic functions for computation instruction along with Fusion DSP data
types and implicit loads and stores
Use of intrinsic functions for both computation and loads and stores.
These different styles can be freely intermixed. For maximum performance, it is typically
necessary to use at least some amount of explicit intrinsics for computation. However, it is
often not necessary to use intrinsics for loads or stores.
For each of these strategies, one can write either scalar or vector code. One general strategy
is to port a single function at a time. If the desired semantics match standard C/C++ code or
the ITU-T/ETSI intrinsics, start with that and automatic vectorization. For 24-bit or 32x16-bit
applications, start with scalar code, using operator overloading where the desired semantics
match the available overloads and intrinsics where a specialized semantic is needed. Either
way, the code is then profiled. Those parts of the code that are computationally important
can then be manually vectorized. At any point, if the performance goals for the code have
been met, the optimization can cease. By starting with what can be done easily and refining
only the most computationally-intensive portions of code manually, the engineering effort can
be directed to where it has the most effect, which is discussed in the following sections.
The xt-xcc compiler provides several options and methods of analysis to assist in vec-
torization. These are discussed in more detail in the Xtensa C and C++ Compiler User’s
Guide, in particular in the SIMD Vectorization section. Cadence recommends studying this
guide in detail. However, following are some guidelines in summary form:
Data should be aligned to 8-byte boundaries. The XCC compiler will naturally align
arrays to start on 8-byte boundaries. But the compiler cannot assume that pointer
arguments are aligned. The compiler needs to be told that data is aligned by one of
the following methods:
Using global or local arrays rather than pointers
Using #pragma aligned(<pointer>, n)
Compiling with -LNO:aligned_pointers=on
Pointer aliasing causes problems with vectorization. The __restrict attribute for
pointer declarations (e.g., short * __restrict cp;) tells the compiler that the
pointer does not alias.
There are global compiler aliasing options, but these can sometimes be dangerous.
Subtle C/C++ semantics in loops may make them impossible to vectorize. The
Vectorization Assistant can help identify small changes that allow effective
vectorization.
Outer loops can be simplified wherever possible to allow inner loops to be more
easily vectorized. Sometimes trading outer and inner loops can improve results.
Loops containing function calls and conditionals may prevent vectorization. It may
be better to duplicate code and perform a little "unnecessary computation" to
produce better results.
Array references, rather than pointer dereferencing, can make code (especially
mathematical algorithms) both easier to understand and easier to vectorize.
At –O3, the compiler will perform optimizations that while mathematically correct
might change the exact bit results of floating point computations. For example, the
compiler might replace a += b*c with a fused multiply-accumulate operation that
avoids a round between the multiply and the accumulate. If bit-exact answers are
needed, compile with fno-unsafe-math-optimizations.
The program can be compiled either with or without automatic vectorization. Note that even
without automatic vectorization, it is still important to use the Use DSP co-processor
button or equivalently the –mcoproc compiler option. These optimizations allow the compiler
to automatically use Fusion DSP instructions for scalar code.
Without vectorization and without the –mcoproc compiler option, the compiler is limited to
the use of Xtensa foundation instructions, and those do not include multiply-add instructions.
The compiler chooses to unroll the loop by a factor of 8, and then packs the 8 adds, 8
multiplies, and 8 loads into 13 cycles. Using the mcoproc compiler option, the compiler is
able to utilize the Fusion DSP multiply-accumulate operations and generates an inner loop
that performs one 16-bit multiply every cycle.
loopgtz a3,L
{
ae_l16.ip aed0,a2,2
ae_mula16x4.l aed1,aed0,aed0
}
L:
Note that the ae_mula16x4.l instruction performs two multiplies, but because ae_l16.ip
performs a single 16-bit load that replicates the data, each of the two multiplies is multiplying
the same operand.
Note that operations within brackets {, }, in assembly code are part of the same instruction
and execute in parallel.
With vectorization, the compiler generates a loop that executes two multiply-adds every
cycle.
loopgtz a3,L
{
ae_la16x4.ip aed0,u0,a2
ae_mula16x4.h aed1,aed3,aed3
}
{
ae_la16x4.ip aed3,u0,a2
ae_mula16x4.l aed2,aed3,aed3
}
{
nop
ae_mula16x4.h aed1,aed0,aed0
}
{
nop
ae_mula16x4.l aed2,aed0,aed0
}L:
Note that since the input array is a parameter, and we have not used any special compiler
flags or pragmas, the compiler must assume that it might not be aligned. Therefore, the
compiler uses the aligning load instructions.
If our example had used int instead of short, the compiler would generate a loop that
executes one, 32-bit multiply-add per cycle.
#include <fusion/basic_op_xtensa.h>
#include <fusion/oper_32b_xtensa.h>
The standard intrinsics can then be used either with or without automatic vectorization, just
like standard C/C++ code.
#include <fusion/basic_op_xtensa.h>
Without vectorization (but using –mcoproc), the compiler generates an inner loop that
performs a multiplication every cycle.
loop a3, L
{
ae_l16.ip aed0,a2,2
ae_mulaf16ss.00 aed1,aed0,aed0
}
L:
With vectorization, the compiler generates an inner loop that performs two multiplications
every cycle.
loopgtz a3,L
{
ae_la16x4.ip aed0,u0,a2
ae_mulaafd16ss.33_22 aed1,aed2,aed2
}
{
ae_la16x4.ip aed2,u0,a2
ae_mulaafd16ss.11_00 aed1,aed2,aed2
}
{
nop
ae_mulaafd16ss.33_22 aed1,aed0,aed0
}
{
nop
ae_mulaafd16ss.11_00 aed1,aed0,aed0
}
L:
Table 3-2 describes the supported operators. Unless noted otherwise, the operators return
variables with the same type as the input operand types. If at least one of the input operands
has a SIMD type, the return type will also be SIMD.
The same operator might map to both a version that takes a register argument and one that
takes an immediate. The compiler will automatically choose the immediate version when
used with an immediate that is in range.
Table 3-3 describes the supported operators for the legacy HiFi 2 data types. Note that the
overloading choices for the HiFi 2 types are quite different than for Fusion DSP.
Note that all the non-legacy multiply overloads produce results of the similar, low, precision
as the operands. This is because there are no high-precision SIMD multiplies. The high-
precision dual multiplies in Fusion DSP add (or subtract) together the two multiply results
into a single result, and it is less natural to define the semantics of multiplying two ae_f24x2
variables, for example, to be a single ae_f64 that is the dot-product of the two variables. This
is in contrast to the legacy HiFi 2/EP data types, such as ae_p24x2f, where multiplying two
such variables does indeed do a dot product. Those semantics were chosen because HiFi
2/EP has no true SIMD multiplies.
s = 0;
for (i=0; i<n; i++)
{
s += ((long long) a[i]*a[i]) >> 31;
}
return s;
Assuming that we wish to use 24-bit arithmetic and can therefore throw away the bottom
eight bits of the input, the code can be converted into Fusion DSP code as follows.
The main loop uses operator overloading to perform a 24-bit fixed-point multiply. The ae_f24
typed array is implicitly loaded, just like any standard C/C++ type. As part of the load, the
bottom 8-bits of the 1.31 input array are discarded. The accumulator is of type ae_f32, giving
8 guard bits. The assignment of the result to an int does not change the bit pattern. Hence
this routine returns a 9.23 value stored as an int.
loop a3, L
{
ae_l32f24.ip aed0,a2,4
_
ae mulafp24x2ra aed1,aed0,aed0
}
L:
Fusion DSP is able to issue a multiply and a load every cycle. Note that the compiler
automatically generates the multiply-add instruction, ae_mulafp24x2ra. This instruction
does a 24-bit multiplication with a 32-bit accumulation. The 32-bit accumulation does not
saturate, so this code is only safe where 32-bit overflow is not possible. If overflow is possible,
compile with –mno-enable-non-exact-imaps. The compiler will leave the multiply and
the addition as two separate instructions and will use a saturating add for the addition.
The inner loop is perfect except that no SIMD is used. By changing ae_f24 into ae_f24x2f,
ae_f32 into ae_f32x2, and cutting the trip count in half, we convert the example into a 2-way
SIMD example. The main loop is computing two partial sums in parallel. After the loop, we
must add together the two partial sums into a single sum using the AE_ADD32_HL_LH
intrinsic.
loop a3, L
{
ae_l32x2f24.ip aed0,a2,8
ae_mulafp24x2ra aed1,aed0,aed0
}
L:
The generated code is now able to do two multiplies every cycle with the speed limited by
the load/store bandwidth of the machine.
Note that the optimized code assumes that n is a multiple of two. If that is not guaranteed,
the last iteration of the loop must be conditionally peeled as follows.
If the total number of iterations dynamically turns out to be odd, the last iteration is executed
separately, using scalar instructions. Note the use of the AE_MOVINT32X2_FROMF32
intrinsic. The reduction add intrinsic returns an ae_int32x2 type and therefore the product
of the last iteration must be appropriately coerced.
This example code uses fixed-point arithmetic. If instead, integral arithmetic is desired, simply
use the integral rather than the fixed-point types.
Every Fusion DSP instruction can be directly accessed by an intrinsic of the same name
(except that “.” in instruction names get converted into “_” in intrinsic names). The prototypes
of the supported intrinsics were listed along with the instruction descriptions in the previous
chapter.
Consider a simple example that does a 24-bit fixed-point energy calculation but wants to
keep all the intermediate results in high precision. Operator overloading always uses the low-
precision multipliers. Therefore, we must use intrinsics for the multiply.
In addition to the dual-multiply intrinsic, intrinsics are used to round the final result back down
to 24-bits.
The intrinsics are not assembly operations. They do not need to be manually sched-
uled into FLIX bundles. Variables do not need to be manually allocated into particular
registers. The compiler takes care of all that. The code still remains quite "C-like".
The compiler generates a perfect inner loop with a dual, updating load and a dual
multiply instruction.
The compiler will automatically select load/store instructions, but programmers may
in some cases be able to optimize results using their own selection, by using the
correct intrinsic instead of leaving it to the compiler
Consider now a similar example where the operand is stored in the circular buffer. The
assumption is that the operand array might cross the end of the buffer. After loading the last
element in the buffer, the code needs to continue to the first element. There is no way to
implicitly utilize the circular buffer load instructions. One needs to use the explicit load
intrinsics as shown in the following code.
ae_f24x2 tmp;
ae_f24x2 *ap = (ae_f24x2 *) a;
ae_f64 s = 0LL;
The operand pointer is loaded using the updating, circular load intrinsic, AE_L32X2F24_XC.
This example assumes that the boundaries of the circular buffer have been set elsewhere.
Code Description
#define XCHAL_HAVE_FUSION Fusion
#define XCHAL_HAVE_FUSION_FP Fusion FP option
#define XCHAL_HAVE_FUSION_LOW_POWER Fusion Low Power option
#define XCHAL_HAVE_FUSION_AES Fusion BLE/Wifi AES-128 CCM
option
#define XCHAL_HAVE_FUSION_CONVENC Fusion Conv Encode option
#define XCHAL_HAVE_FUSION_LFSR_CRC Fusion LFSR-CRC option
#define XCHAL_HAVE_FUSION_BITOPS Fusion Bit Operations Support
option
#define XCHAL_HAVE_FUSION_AVS Fusion AVS option
#define XCHAL_HAVE_FUSION_16BIT_BASEBAND 1 Fusion 16-bit Quad Mac Unit
#define XCHAL_HAVE_FUSION_VITERBI 1 Fusion Viterbi option
#define XCHAL_HAVE_FUSION_SOFTDEMAP 1 Fusion Soft Bit Demap option
The HiFi Mini DSP supports 2-way SIMD 8-bit load instructions AE_LP8X2F.I and
AE_LP8X2F.IU that have no equivalent on Fusion. Fusion instead supports 4-way SIMD 8-
bit loads.
Following are several guidelines for porting HiFi 2 target code to Fusion DSP:
Mapping: Refer to the operations and intrinsics (C syntax) in Chapter 2 for notes on
the HiFi 2-to-Fusion DSP operation and intrinsic mapping.
Optimization level. When optimizing code, the code should be compiled with either
the –O2 or –O3 level of optimization. On average, -O3 will give higher performance,
but not always. It is recommended that critical functions be compiled both ways to
compare performance.
Compiling for code size. Less performance-critical functions should be compiled with
–Os (in addition to either –O2 or –O3). This will meaningfully shrink the code size
required. In addition to saving on memory, smaller code might improve performance
on real systems with more limited instruction cache sizes.
In addition to the flexibility of table structure, we have the flexibility of instructions supporting
both 16- and 32-bit table entries. 16-bit table entries are expected to be superior in most
cases because they tend to save space over 32-bit entries. However, the option to use 32-
bit entries is important, because certain codebooks can make 16-bit table entries impossible
to use: the smaller entries cannot represent large table indices the way 32-bit entries can.
While 16-bit table entries will also give slower encoding for long codewords, we don't expect
this to be a major consideration because the difference is only a few cycles per symbol. In
keeping with the versatility of the mechanism, it is possible to use hierarchical tables with 32-
bit entries at some levels and 16-bit entries at others.
In the vast majority of implementations, 16-bit table entries will be the right choice.
Nonetheless, the instructions for 32-bit entries are there when they are needed.
4.2 Encoding
Since encoding usually has fewer worthwhile table-structure variants than decoding, we will
describe the encode side first and then move to the more complicated considerations around
decoding.
The examples shown in Section 4.4 structure their tables in a couple of ways that are the
most commonly used. You will certainly encounter cases, e.g., in at least one of WMA’s
codebooks, where you will want to implement a different structure for the tables.
For encoding, the usual technique is simple: Translate the symbol to be coded into a table
index, and use that index to retrieve a sequence of codeword bits and a codeword length
from a table or a pair of tables. Usually table entries for each codebook are just long enough
to hold the longest codeword, but in the present mechanism we wanted to provide a way to
keep the codeword length from being dependent on either the size of the table entries or on
other aspects of the implementation. So in our scheme, depending on the length of the
longest codeword, it might be that some codewords don't fit within a single table entry. When
this situation happens, the first lookup in the encoding table provides not only a portion of the
codeword, but also the index of the location in the table to look for the next codeword
segment. Each lookup in the encoding table either completes the codeword or yields an index
for the next lookup. In the case of 32-bit table entries, a second lookup is required only if the
codeword exceeds 16 bits in length. In the case of 16-bit table entries, codewords longer
than 11 bits will require a second and possibly subsequent lookups.
In the first case, we are finished encoding the present symbol once we push the found
codeword bits onto the output bitstream.
The second case is a little more interesting. In the second case, we get some bits of the
codeword from the table entry, and those are pushed onto the output stream, but there are
more codeword bits still to come that could not be accommodated in a single table entry.
When this happens, the first table entry tells us the index of another table entry that will give
us another segment of the codeword’s bit sequence. Once we retrieve the second table entry
based on the new index, we are back in the same situation: either this table entry completes
the codeword, or yet another lookup is required. Table entries needed to support lookups
beyond the first one for each symbol would generally appear at the end of the table, just
beyond the symbol-indexed part.
The length of the codebook’s longest codeword and your decision about whether to use 16-
or 32-bit table entries will bound the number of lookups required to encode a symbol. In
practice, three or more lookups per symbol will be rare with 32-bit table entries (Editor’s note:
we are not aware of any codebooks used in audio that would require three lookups for any
symbol), and four or more will be rare with 16-bit entries.
4.3 Decoding
The decoding process is more complicated than encoding because codewords have variable
length. If we could afford a huge table, we could just pad all the codewords out to the length
of the longest codeword (with bits from the bitstream), and use the resulting string of bits as
an index into a single giant table where we would find an entry telling us the symbol value
and the number of bits in the codeword. Note that the lookup has to tell us the number of bits
in the codeword so we know how many bits to discard from the head of the bitstream we are
reading before doing the next decoding operation.
As with encoding, we look up entries for decoding in a table. But unlike encoding where the
alphabet size determined the size of the initial table, the decoding process has power-of-two
table sizes that are decided by you according to the space/time tradeoffs you want to make.
Decoding takes place through a hierarchy of tables where the size of each table in the
hierarchy is up to you (within limits, of course). A table can have as few as two entries, in
which case it is essentially a node in a binary tree where a single bit of the codeword guides
the decoding process to the next step, or as many as 65536 entries where a 16-bit chunk of
the bitstream forms the table index.
FAAD2 uses a so-called two-step table as the other of its basic table structures. K bits at the
head of the stream are used to index into the first table. (Depending on the codebook, K is
either five or six.) The entry found in the first table gives an index into the second table, which
is essentially made up of consecutively placed subtables of various sizes. The index from
the first table entry tells where the appropriate subtable begins. Each subtable in the second
table corresponds to one or more K-bit combinations that might appear at the head of the
bitstream. If the codeword is longer than K bits, the entry from the first table also tells how
many bits are used to index into the subtable. If the codeword has K bits or fewer, the
corresponding subtable has only one entry so no additional bits are used as an index into it.
The entry found in the second table by indexing using the appropriate number of bits off the
base given in the first table entry gives the decoded symbol value and the codeword length.
This sounds complicated, but it isn't as bad as it sounds.
WMA uses a hierarchically-structured table consisting of 4-ary tree nodes and binary tree
nodes. The eight levels closest to the root in the tree consist of 4-ary tree nodes, and the
remaining six levels are binary.
Our decoding support permits us to structure our decode essentially according to any of
those example schemes, or indeed according to a wide variety of other schemes as well. Our
Fusion DSP variable-length encoding and decoding instructions also permit us more efficient
use of the bits in table entries than the generic-processor implementations, meaning that for
a given table organization scheme, the tables to drive our instructions are smaller than those
in the corresponding generic implementation. And, of course, our decoding operations are
faster as well.
When we begin decoding a codeword, we start at the root of the decoding table hierarchy
and use a prefix of the bitstream to look up a table entry. As mentioned before, the length of
this prefix is determined when the table hierarchy is designed. Once we have a table entry,
there are two cases much like there were for encoding, and again a bit in the table entry
distinguishes between the two.
In the first case, the codeword is short enough that we are done decoding it and the table
entry tells us the symbol corresponding to the codeword, along with the number of bits
occupied by the codeword at the head of the stream. Note that the number of bits used to
index into the table might be greater than the length of the codeword, in which case there
are duplicate table entries, one for each combination of the “don't care” bits that follow the
codeword in the stream.
In the second case, the codeword is longer than an index into the table. In this case, we have
not yet found the symbol corresponding to our codeword (because we have not yet looked
at all the codeword bits). In this case, the table entry tells us where to find the next table and
the number of bits to use as an index into that table. The bits we need to discard from the
head of the stream are exactly those that we used as the table index, so the table entry itself
need not have any direct indication of the number of bits to discard. Upon knowing the base
of the next table in the hierarchy for this codeword and discarding the bits that made up the
index we used for the first table, we are back in the same situation as when we began
decoding: We have a table into which we will index according to a set number of bits at the
head of the bitstream. The process repeats until we find ourselves in the first case with our
symbol in hand.
The instruction mnemonics are as follows: Audio Engine Variable-Length Decode, Load
{16|32}-bit Table entry; Audio Engine Variable-Length Decode, Load 16 stream bits
Conditional (reflecting the fact that the bitstream is refreshed from memory in 16-bit chunks).
“Audio Engine” in this context refers to this part of the Fusion DSP.
xtbool complete;
unsigned int symbol;
unsigned short *table;
...
not_done:
AE_VLEL16T(complete, symbol, table);
AE_VLES16C(stream);
if (!complete) {
#pragma frequency_hint NEVER
goto not_done;
}
...
With the above sequence, the Xtensa C compiler generates assembly code like the following:
not_done:
ae_vlel16t b0, a3, a9 /* First lookup likely to succeed. */
_
ae vles16c a2
bf b0, not_done /* Avoid branch delay in common case. */
done_encoding:
...
If, for example, you know that your encoding table structure is only one layer deep, you can
optimize the code more.
For decoding, the optimal code implementation will depend on the structure of your tables,
although it is possible to build a single routine that works very fast with all the possible
structures. A single decoding step might be enough most of the time if your top-level table
uses a 5-bit index. In such a case, the best way to decode is the simplest, and is exactly
analogous to the encoding code above:
xtbool complete;
unsigned int symbol;
unsigned short *table;
...
not_done:
AE_VLDL16T(complete, symbol, table);
AE_VLDL16C(stream);
if (!complete) {
#pragma frequency_hint NEVER
goto not_done;
}
...
The above sequence in C should yield assembly code like the following:
...
not_done:
ae_vldl16t b0, a9, a4
ae_vldl16c a2
bf b0, not_done
done_decoding:
...
On the other hand, if you build your tables as a binary tree, you're unlikely to find any symbols
within a single decoding step. In this case, if you have to have every last bit of decoding
speed, you can use something like the following example, which is a fast, generic
implementation that handles lookups deep in the table hierarchy with fewer branch delays
than the simple loop above:
not_done: ...
_
ae vldl16t b0, a9, a4
ae_vldl16c a2
b b0, not_done
done_decoding:
...
not_done:
loopnez a0, .Loopend /* use stack pointer as while (1) loop
counter */
ae_vldl16t b0, a9, a4
ae_vldl16c a2
bt b0, done_decoding
.Loopend:
j not_done /* more lookup iterations than the stack pointer?!? */
In conclusion, the Fusion DSP supplies a generic set of instructions to support variable-length
(Huffman) encode/decode. These instructions place only minimal restrictions on the kind of
table hierarchies you use in your application.
We start with the following simple reference code, written using standard C.
We use a 64-bit accumulator for all the intermediate calculations. When we have completed
one output point, we use a shift to throw away the bottom fractional bits.
We choose to use fractional data type ae_f64 and the fractional multiply accumulate intrinsic
AE_MULAF32S_LL. This intrinsic will saturate the result to 64-bits avoiding the danger of
overflow. This result produces a 1.63-bit result which is later shifted to the right by 32 bits to
produce a 1.31-bit result.
Next, we utilize SIMD to perform two iterations in parallel. We have a choice in which loop to
run SIMD. If we run two iterations of the j loop in parallel, then in each iteration, we will need
to access x[i+j] and x[i+j+1]. For odd values of i, those two values will not be aligned, and we
need to use aligning loads. While certainly feasible, that does add some overhead.
Alternatively, we can run two iterations of the i loop in parallel. Using pseudo code, the inner
loop is computing y[i:i+1] += x[i+j:i+j+1]*h[j]. Note that in the first j iteration, we are using
x[i:i+1] while in the second we are using overlapping data x[i+1:i+2]. We cannot simply utilize
a SIMD load that will get the right data in both even and odd iterations. Instead, we also unroll
the j loop by two. In the first unrolled iteration, we use x[i:i+1] and x[i+1:i+2]. In the next
iteration, we use data that is exactly two elements ahead: x[i+2:i+3] and x[i+3:i+4]. The code
is shown below.
Note that the first product, x[i+j]*h[j] uses the HH variant of AE_MULAF32S_HH. The Fusion
DSP processor loads vector elements in big endian order, i.e., the lower element from the
memory goes into the higher half of the register. Note that we have used the
AE_TRUNC32X2F64 intrinsic to truncate two 1.63 values into two 1.31-bit values using one
instruction.
Note that we traverse both the coefficient array and the data array in the forward direction,
while a typical formulation often accesses the data in the reverse direction (x[M+N-1+i-j]
* h[j]). To use the reverse formulation, we must traverse the data array in the reverse
direction using the .RIP instructions. It is not possible to use the .RIP instructions with
implicit loads, so we must use explicit intrinsics.
The code is all standard C except that we have used pragmas to tell the compiler that all the
arrays are aligned on 8-byte boundaries. While those pragmas aren’t necessary, they allow
the XCC vectorizer to generate somewhat more efficient code. When compiling with –O3 –
LNO:simd, the XCC compiler vectorizes the inner j loop and unrolls the j loop by four. The
resultant code performs eight multiply-accumulate operations in every iteration and
schedules in nine cycles on configurations with the Reduced MAC Latency option, close to
the ideal limit of eight.
To understand what the vectorizer does, and to try to get the ideal schedule, let us vectorize
the code using intrinsics. First, let us vectorize exactly equivalently to how we vectorized the
integer example.
Note that no truncation is needed with floating point. We have explicitly used the
XT_SSX2_L_IP intrinsic that allows us to store two floating point values using one store. On
configurations using the Reduced MAC Latency option, the inner loop schedules in a perfect
four cycles for four iterations. However, on full power configurations, the MADD operations
have four cycles of latency. In every iteration of the inner loop, we are accumulating twice
into the same accumulator, causing eight cycles every iteration.
We can double the inner loop performance by unrolling the outer loop by four instead of by
two, resulting in the following code.
Now the compiler is able to generate a perfect eight cycle schedule for eight MADD
operations.
k = 0;
for (b = n ; b > 2; b >>= 1) {
unsigned int b2 = b >> 1;
for (i = 0; i < b/4; i += 1) {
short wr = twid[2 * k + 0];
short wi = twid[2 * k + 1];
for (j0 = j1 = i; j0 < n; ) {
int d0r, d0i, d1r, d1i;
int tr, ti;
int r0r, r0i, r1r, r1i;
data[2 * j1 + 0] = r0r;
data[2 * j1 + 1] = r0i;
j1 += b2;
data[2 * j1 + 0] = r1r;
data[2 * j1 + 1] = r1i;
j1 += b2;
}
k += 1;
}
}
}
The 2-way SIMD architecture of Fusion DSP maps nicely to this computation, as a 64-bit
register can hold a single, 32-bit complex data item. The 16-bit twiddle factors complicate the
algorithm a bit since we can store four twiddle factors in one register. Successive iterations
of i must access either the top two or the bottom two entries in the twiddle register. The
simplest way to handle this is to load the twiddle array every other iteration, and use selects
to copy the second pair of elements into proper position for the other iterations. Otherwise,
the conversion to Fusion DSP is very simple, and the resultant code is actually simpler than
the original. Note the use of the explicit complex multiply intrinsic AE_MULFC32X16RAS_H.
k = 0;
ae_f32x2 t;
ae_f32x2 r0, r1;
d0 = data[j0];
j0 += b2;
d1 = data[j0];
j0 += b2;
r0 = d0 + d1;
t = d0 - d1;
r1 = AE_MULFC32X16RAS_H(t, w);
data[j1] = r0;
j1 += b2;
data[j1] = r1;
j1 += b2;
}
}
}
}
According to the Xtensa C Application Programmer’s Guide, you should look at the generated
.S file to see how the compiler compiles the particular loop. By looking at the code next to
the inner loop, you can see that the compiler is not software pipelining the inner loop.
The compiler was not able to optimize the inner loop well because the compiler could not
calculate the number of iterations in the inner loop. If we rewrite the trip count calculation as
follows, the compiler is able to better optimize the inner loop.
d0 = data[i+2*b2*trip];
d1 = data[i+2*b2*trip+b2];
r0 = d0 + d1;
t = d0 - d1;
r1 = AE_MULFC32X16RAS_H(t, w);
data[i+2*b2*trip] = r0;
data[i+2*b2*trip+b2] = r1;
}
}
}
}
Performance is better, but still not ideal. To achieve top performance, the compiler must
software pipeline the loop and execute loads from iteration trip+1 ahead of stores from the
previous iteration trip. However, the compiler will not move the loads up because it doesn’t
know if the loads and stores access the same memory. Therefore, you must move the loads
manually as follows.
d0 = data[i];
d1 = data[i+b2];
for (trip=0; trip < ((n-i+b-1)>>lg2)-1; trip++) {
r0 = d0 + d1;
t = d0 - d1;
r1 = AE_MULFC32X16RAS_H(t, w);
d0 = data[i+2*b2*(trip+1)];
d1 = data[i+2*b2*(trip+1)+b2];
data[i+2*b2*trip] = r0;
data[i+2*b2*trip+b2] = r1;
}
r0 = d0 + d1;
t = d0 - d1;
r1 = AE_MULFC32X16RAS_H(t, w);
data[i+2*b2*trip] = r0;
data[i+2*b2*trip+b2] = r1;
}
}
}
}
Before the inner loop, we load the two elements from the first iteration. In the inner loop, we
operate on the values loaded from the previous iteration and load the values for the next
iteration. After the inner loop, we complete the computation for the last iteration.
With these changes, the compiler is able to schedule the inner loop in four cycles per
iteration, the minimum possible due to the load/store bandwidth of the machine.
Note that the same performance could have been achieved by using the __restrict
attribute on two pointers for the input and output accesses of data, rather than manually
software pipelining the loop. However, this attribute is only allowed to be used when pointers
do not overlap, and the two pointers would in fact overlap.
From the library project, under doc, is a reference manual, NatureDSP Signal Library
Reference for Tensilica Fusion F1 DSP, which describes the library and the test program in
depth.
7. Implementation Methodology
The Fusion DSP is an optional coprocessor for the Xtensa LX core. Fusion DSP is provided
as a check box option in the Xplorer Processor Generator (XPG) interface in Xtensa Xplorer
(XX). This section includes guidelines for using the XPG to configure a Fusion DSP
coprocessor.
As an alternative, Xplorer provides a number of templates for Fusion DSP. These are
described briefly in the next section. If you choose one of these templates, they will select
both Fusion DSP options for a particular use case, and other attributes of a configuration.
However, you can then edit them further if your particular use case requires further changes.
In that sense, templates are regarded as a recommended starting point.
FP
Support for IEEE 754 single precision floating point. Floating point compute
operations, including fused multiply-accumulates, can be issued in parallel with
loads or stores. The compute operations work on scalar, 32-bit data. The load
and store operations can load or store two-way SIMD, 32x2-bit data.
AVS
Support for software compatibility with HiFi-2, HiFi 3, and HiFi Mini audio, voice
and speech codecs. Enables HiFi bitstream intrinsics as well as emulation
intrinsics for the HiFi 3 quad multiplication instructions.
Developers using caches should also configure the Cache Prefetch Entries from the
Interfaces window under the category PIF/Memory Interface Widths (refer to Figure 6-2). A
selection of 0 will eliminate hardware prefetching from the configuration. Otherwise, eight or
16 entries are available. The latter provides a little higher performance at the cost of a little
more area. In addition, customers should decide whether to enable prefetching directly to L1.
Prefetching into L1 typically improves performance, minimally on configurations with very
large delays to main memory and more significantly on systems with small delays to
secondary or main memory, but at the cost of additional hardware.
You can now customize the processor containing the Fusion DSP as described in the Xtensa
Development Tools Installation Guide. As you customize the processor, remember the
following restrictions:
Core multiplier options (for example, MUL16, MUL32) cannot be selected. Fusion
DSP implements the 32-bit multiplier instructions contained in the MUL32 checkbox
option directly within the Fusion DSP; thus, this checkbox is not needed. MUL16 is
not available. MAC16 is available but is generally not useful on Fusion cores.
The Fusion DSP option is incompatible with the other DSP families.
If the Viterbi Decoder and Soft Bit Demap options are not selected, as the Fusion
DSP has 48-bit instruction formats, the maximum instruction width must be six bytes
and a 64-bit instruction fetch is required. The data interfaces to memory must be at
least 64-bits. If the user wishes to add their own formats that are larger than 48 bits
(for example, 56 bits (7 bytes) or 64 bits (8 bytes), then the maximum instruction
width must be set accordingly. On the other hand, if the Viterbi Decoder or Soft Bit
Demap option is selected, then a 64-bit format is created, requiring a maximum
instruction width of 8 bytes. Due to a relatively large increase in gate count for
maximum instruction widths greater than 8 bytes (64 bits), such a size is not
recommended, although possible.
It is not possible for users to add their own additional 48-bit instruction format, as the
current one in Fusion DSP is quite full and there is no space for an additional format
of this size. However, users may add new 56-bit or 64-bit formats as discussed
above when the Viterbi Decoder and Soft Bit Demap options are not selected.
Once a processor has been configured and downloaded, it can be exercised in simulation.
XRC_FusionF1_All_Cache
This template configures the Fusion DSP core for products that combine voice
processing, and sensor fusion applications that often require floating point support.
Selected options include the quad 16x16 MAC, FPU, and AVS extensions.
XRC_FusionF1_All_LM
This template configures the Fusion DSP core for products that combine voice
processing and sensor fusion applications that often require floating point support.
Selected options include the quad 16x16 MAC, FPU, and AVS extensions.
This template supports a local memory subsystem and includes debug functionality.
XRC_FusionF1_802ah
This template configures the Fusion DSP core for all narrowband wireless
communications applications, including 802.11ah, by enabling communications ISA
options including quad 16x16 MAC, Soft Bit Demap, Viterbi, and Advanced Bit
Manipulations.
This template supports a local memory subsystem and includes debug functionality.
Boolean registers
NSA/NSAU instructions
Density instructions
Zero overhead loop instructions. Note that this option is not strictly required.
However, audio codecs licensed by Cadence are compiled using these instructions
and not selecting these instructions can significantly increase the MCPS required by
an application.
5- or 7-stage pipeline. However, note that this choice has several implications. A 5-
stage pipeline will result in a smaller configuration, but the maximum speed that it is
possible to synthesize and layout will be less than is possible with a 7-stage pipeline.
In addition, larger local memories (e.g., 32 KB or larger) may operate better with a
7-stage pipeline configuration that has extra memory access stages. Thus,
depending on the application, consider these trade-offs.
Without the Viterbi Decoder and Soft Bit Demap options selected, the instruction
width (specified by the ‘max instruction width in bytes’ option in Xplorer) needs to be
at least 6 bytes. If you add user formats greater than 6 bytes (48 bits), this must be
increased. With the Viterbi Decoder or Soft Bit Demap option selected, the instruction
width needs to be at least 8 bytes. However, an increase beyond 8 bytes (64 bits) is
not recommended. A summary table describing the instruction width required for
each option is provided in Appendix B.
To create new larger formats when Viterbi Decoder and Soft Bit Demap options are not used,
you can use the following suggested TIE:
When creating new instructions to put in the existing formats, consider the following points.
The AR register file in Fusion DSP has 2 read ports and 1 write port in each of slots
fusion_slot0 and fusion_slot1. Creating an operation that requires more than two
read or one write operation on the AR register file will increase the number of ports.
The AE_DR register file has one read and one write port in fusion_slot0 and three
read and one write ports in fusion_slot1. When the Viterbi Decoder or Soft Bit Demap
option is selected, the AE_DR register file has one read and one write port in
fusion_slot64_0 and three reads and two writes in fusion_slot64_1. Creating an
operation that has more operands in either slot will increase the number of ports in
the machine and therefore will have a large hardware impact. Such operations
should instead be limited to the non-FLIX fusion_slot40.
Single-cycle DSP instructions should read their AE_DR operands in stage Mstage and write
them in stage Mstage. Ideally, two-cycle DSP instructions should read their earliest AE_DR
operands in stage Mstage, and write their AE_DR operands in stage Mstage+1.
Existing instructions, either core or Fusion DSP, can be placed in additional slots to increase
parallelism. As with custom TIE instructions, simply use the TIE slot_opcode statement to
place the existing operation in one of the VLIW slots. Load and store instructions can be
added to fusion_slot1 to double the memory bandwidth.
It is not currently possible to share existing Fusion DSP functional resources for new
instructions. New multiplier instructions, for example, must use their own dedicated
multipliers.
For timing closure between synthesis and place-route, Cadence recommends using
physically-aware synthesis flow such as RC-Physical from Cadence or DC-topo from
Synopsys. These flows are currently supported by the provided synthesis scripts.
AE_AES_SUBBYTE_MIX_XOR64
AE_AES_SB128
AE_AES_RKEY
AE_LB_BR AE_DEPBITS_L
AE_LBI_BR AE_DEPBITS_H
AE_DB_BR.IP AE_CC32_L
AE_DBI_BR.IP AE_CC32_H
AE_SBI_BR.IP AE_CTC_BIN
AE_SB_BR.IP AE_CRC32
AE_SBF_BR.IP AE_SCR32
AE_ADDMOD16U AE_LFSR16
AE_BISEL4X8_L AE_LFSR8
AE_S16X4RNG.I AE_ADDANDSUBRNG16RAS_S1
AE_S16X4RNG.IP AE_ADDANDSUBRNG16RAS_S2
AE_S16X4RNG.X AE_MAXABS16S
AE_S16X4RNG.XP AE_CONJ16S
AE_MULC16S.H AE_MULC16JS.H
AE_MULC16S.L AE_MULC16JS.L
AE_MULAC16S.H AE_MULAC16JS.H
AE_MULAC16S.L AE_MULAC16JS.L
AE_MULFC16RAS. AE_MUL16X4.H
AE_MULAFC16RAS.H AE_MUL16X4.L
AE_MULAFC16RAS.L AE_MULA16X4.H
AE_MULZAAAAQ16 AE_MULA16X4.L
AE_MULAAAAQ16 AE_MULS6X4.H
AE_MULS16X4.L
AE_MOVTABLEFIRSTSEARCHNEXTV AE_MUL16X4.L
AE_MOVVTABLEFIRSTSEARCHNEXT AE_MULCI24
AE_MULFP32X2RS.H AE_MULFCI24RA
AE_MULFP32X2RAS.H AE_MULCI32X16.L
AE_MULAFP32X2RS.H AE_MULCI32X16.H
AE_MULAFP32X2RAS.H AE_MULACR24
AE_MULSFP32X2RS.H AE_MULAFCR24RA
AE_MULSFP32X2RAS.H AE_MULACR32X16.L
AE_MULFP32X2RS.L AE_MULACR32X16.H
AE_MULFP32X2RAS.L AE_MULACI24
AE_MULAFP32X2RS.L AE_MULAFCI24RA
AE_MULAFP32X2RAS.L AE_MULACI32X16.L
AE_MULSFP32X2RS.L AE_MULACI32X16.H
AE_MULSFP32X2RAS.L AE_MULF16X4SS.H
AE_MULFP16X4S.H AE_MULAF16X4SS.H
AE_MULFP16X4RAS.H AE_MULSF16X4SS.H
AE_MULFP16X4S.L AE_MULF16X4SS.L
AE_MULFP16X4RAS.L AE_MULAF16X4SS.L
AE_MULCR24 AE_MULSF16X4SS.L
AE_MULFCR24RA AE_MUL16X4.H
AE_MULCR32X16.L AE_MULA16X4.H
AE_MULCR32X16.H AE_MULS16X4.H
AE_MULA16X4.L AE_DB.IC
AE_MULS16X4.L AE_DBI.IC
AE_MULFD24X2.FIR.H.H AE_DB.IP
AE_MULFD24X2.FIR.H.L AE_DBI.IP
AE_MULFD32X16X2.FIR.HH.H AE_VLEL32T
AE_MULFD32X16X2.FIR.HH.L AE_VLEL16T
AE_MULFD32X16X2.FIR.HL.H AE_SB
AE_MULFD32X16X2.FIR.HL.L AE_SBI
AE_MULAFD24X2.FIR.H.H AE_VLES16C
AE_MULAFD24X2.FIR.H.L AE_SBF
AE_MULAFD32X16X2.FIR.HH.H AE_SB.IC
AE_MULAFD32X16X2.FIR.HH.L AE_SBI.IC
AE_MULAFD32X16X2.FIR.HL.H AE_VLES16C.IC
AE_MULAFD32X16X2.FIR.HL.L AE_SBF.IC
AE_SHA32 AE_SB.IP
AE_VLDL32T AE_SBI.IP
AE_VLDL16T AE_VLES16C.IP
AE_VLDL16C AE_SBF.IP
AE_VLDL16C.IP WUR.AE_BITPTR
AE_VLDL16C.IC RUR.AE_BITSUSED
AE_VLDSHT WUR.AE_BITSUSED
AE_LB RUR.AE_TABLESIZE
AE_LBI WUR.AE_TABLESIZE
AE_LBK RUR.AE_FIRST_TS
AE_LBKI WUR.AE_FIRST_TS
AE_LBS RUR.AE_NEXTOFFSET
AE_LBSI WUR.AE_NEXTOFFSET
AE_DB RUR.AE_SEARCHDONE
AE_DBI WUR.AE_SEARCHDONE
AE_VTACSR4X4S_H
AE_VTADDSUB3BX2S
AE_VTTB2X64
AE_S64_DECBITS.H.IP
AE_S64_DECBITS.L.IP
AE_UNPKS8X16
AE_MOVBMETRICSV
AE_MOVVBMETRICS
AE_MOVDBITSV.H
AE_MOVDBITSV.L
AE_MOVSANORM
AE_SDMAP256QAM1X16C_L
AE_SDMAP64QAM1X16C_H
AE_SDMAP64QAM1X16C_L
AE_SDMAP16QAM1X16C_H
AE_SDMAP16QAM1X16C_L
AE_SDMAPQPSK2X16C
AE_SDMAP64QAM1X16C_HL
The following table highlights the instruction width required (specified by the ‘max
instruction width in bytes’ option in Xplorer) for each of the Fusion F1 options.