0% found this document useful (0 votes)
1K views235 pages

Fusion Ug

xtensa hhh

Uploaded by

王佳旭
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views235 pages

Fusion Ug

xtensa hhh

Uploaded by

王佳旭
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 235

Fusion F1 DSP

User’s Guide

For Cadence Tensilica Fusion F1 DSP

Cadence Design Systems, Inc.


2566 Seely Ave.
San Jose, CA 95134
www.cadence.com
Fusion F1 DSP User’s Guide

© 2018 Cadence Design Systems, Inc.


All rights reserved worldwide

This publication is provided “AS IS.” Cadence Design Systems, Inc. (hereafter “Cadence") does not make any warranty of any
kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a
particular purpose. Information in this document is provided solely to enable system and software developers to use our
processors. Unless specifically set forth herein, there are no express or implied patent, copyright or any other intellectual
property rights or licenses granted hereunder to design or fabricate Cadence integrated circuits or integrated circuits based on
the information in this document. Cadence does not warrant that the contents of this publication, whether individually or as one
or more groups, meets your requirements or that the publication is error-free. This publication could include technical
inaccuracies or typographical errors. Changes may be made to the information herein, and these changes may be incorporated
in new editions of this publication.

Cadence, the Cadence logo, Allegro, Assura, Broadband Spice, CDNLIVE!, Celtic, Chipestimate.com, Conformal, Connections,
Denali, Diva, Dracula, Encounter, Flashpoint, FLIX, First Encounter, Incisive, Incyte, InstallScape, NanoRoute, NC-Verilog,
OrCAD, OSKit, Palladium, PowerForward, PowerSI, PSpice, Purespec, Puresuite, Quickcycles, SignalStorm, Sigrity, SKILL,
SoC Encounter, SourceLink, Spectre, Specman, Specman-Elite, SpeedBridge, Stars & Strikes, Tensilica, TripleCheck,
TurboXim, Vectra, Virtuoso, VoltageStorm, Xplorer, Xtensa, and Xtreme are either trademarks or registered trademarks of
Cadence Design Systems, Inc. in the United States and/or other jurisdictions.

OSCI, SystemC, Open SystemC, Open SystemC Initiative, and SystemC Initiative are registered trademarks of Open SystemC
Initiative, Inc. in the United States and other countries and are used with permission. All other trademarks are the property of
their respective holders.

PD-17-8537-10-03
RG-2018.9
Issue Date: 4/2018

ii  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Contents

1. Introduction .................................................................................................................. 1
1.1 Purpose of this Guide ............................................................................................. 2
1.1.1 Conventions ........................................................................................................ 2
1.2 Installation Overview .............................................................................................. 2
1.3 Fusion DSP Architecture Overview ........................................................................ 3
1.4 Prefetching.............................................................................................................. 4
1.4.1 Software Prefetching .......................................................................................... 6
1.5 Fusion DSP Instruction Set Overview .................................................................... 7
2. Fusion DSP Features ................................................................................................... 8
2.1 Instruction Naming Conventions........................................................................... 15
2.2 Fixed-point Values and Fixed-point Arithmetic ..................................................... 16
2.2.1 Representation of Fixed-point Values .............................................................. 16
2.2.2 Arithmetic with Fixed-point Values ................................................................... 18
2.2.3 Other Fixed-point Representations................................................................... 18
2.3 VLIW Slots and Formats ....................................................................................... 19
2.4 Load and Store Operations .................................................................................. 21
2.4.1 Aligning Loads and Stores................................................................................ 22
2.4.2 Circular Buffer ................................................................................................... 24
2.4.3 Load and Store Naming Scheme ..................................................................... 26
2.4.4 Load Operations ............................................................................................... 29
2.4.5 Core Load Operations ...................................................................................... 41
2.4.6 Store Operations............................................................................................... 41
2.5 Core Updating Stores ........................................................................................... 52
2.6 Multiply and Accumulate Operations .................................................................... 52
2.6.1 24x24-bit Multiplication Operations .................................................................. 54
2.6.2 32x32-bit Multiplication Operations .................................................................. 58
2.6.3 32x16-bit Multiplication Operations .................................................................. 62
2.6.4 16x16-bit Multiplication Operations .................................................................. 66
2.6.5 16x16-bit Legacy Multiplication Operations ...................................................... 69
2.6.6 32x16-bit Legacy Multiplication Operations ...................................................... 70
2.6.7 HiFi 2 EP 32x24-bit Multiplication Operations .................................................. 73
2.7 Add, Subtract, and Compare Operations ............................................................. 73
2.8 Shift Operations .................................................................................................... 83
2.9 HiFi 2 Shift Operations ......................................................................................... 93
2.10 Normalize Shift Amount Operation ....................................................................... 96
2.11 Divide Step Operation .......................................................................................... 96
2.12 Truncate, Round, Saturate, Convert, and Move Operations ................................ 97
2.13 Selection and Permutation Operations ............................................................... 111
2.14 Bit Reversal ........................................................................................................ 115
2.15 Zero Operation.................................................................................................... 115

 CADENCE DESIGN SYSTEMS , INC. iii


Fusion F1 DSP User’s Guide

2.16 Core ALU Operations ......................................................................................... 116


2.17 Optional 16-bit Quad MAC Unit .......................................................................... 117
2.18 Optional Floating Point Unit ................................................................................ 121
2.18.1 Floating Point Intrinsics .............................................................................. 132
2.18.2 Notes on Not a Number (NaN) Propagation ............................................... 137
2.18.3 HiFi 3 Floating Point Intrinsics Emulation ................................................... 137
2.19 Bitstream and Variable-Length Encode and Decode Instructions AVS ONLY... 138
2.19.1 Codebook Formats ..................................................................................... 150
2.20 Optional Fusion Advanced Bit Manipulation Package........................................ 152
2.21 CRC and Scrambling (LFSR) Operations .......................................................... 152
2.22 Bit-level Convolutional Encode Operations ........................................................ 153
2.23 Bit Shuffling and Selection Operations ............................................................... 154
2.24 Optional AES128-CCM Operations .................................................................... 156
2.25 Optional Viterbi Decoder Operations .................................................................. 157
2.26 Optional Soft-bit Demapping Operations ............................................................ 163
3. Programming the Fusion DSP ................................................................................. 167
3.1 Data Types ......................................................................................................... 168
3.1.1 81B2Example Memory Types ................................................................................. 172
3.2 Xtensa Xplorer Display Format Support ............................................................. 173
3.3 Programming Styles ........................................................................................... 174
3.4 Auto-vectorization of Standard C/C++ ................................................................ 176
3.5 ITU-T/ETSI Intrinsics .......................................................................................... 178
3.6 Operator Overloading ......................................................................................... 180
3.6.1 Energy Calculation Example .......................................................................... 187
3.6.2 32X16-bit Dot Product Example ..................................................................... 190
3.7 Intrinsic-based Programming.............................................................................. 190
3.8 Checking Configuration Options in C/C++ Code ................................................ 192
3.9 HiFi 3 Code Portability ........................................................................................ 192
3.10 HiFi 2 and HiFi Mini Code Portability .................................................................. 193
3.11 Important Compiler Switches.............................................................................. 194
4. Variable-Length Encode and Decode ...................................................................... 195
4.1 Overview of Huffman Instructions....................................................................... 195
4.1.1 Reading and Writing a Sequence of Raw Bits ............................................... 196
4.2 Encoding ............................................................................................................. 196
4.2.1 What Encoding a Symbol Looks Like ............................................................. 197
4.2.2 Encoding Table Lookup Instruction Sequence ............................................... 197
4.3 Decoding............................................................................................................. 198
4.3.1 Supported Decoding Structure Examples ...................................................... 198
4.3.2 Decoding Table Lookup Instruction Sequence............................................... 199
4.4 Encode/Decode Examples ................................................................................. 200
5. Fusion DSP Examples ............................................................................................. 202
5.1 Correlation/Convolutional/FIR Coding ................................................................ 202

iv  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

5.2 Floating-point FIR ............................................................................................... 204


5.3 Fast Fourier Transform ....................................................................................... 206
6. Fusion F1 NatureDSP Signal Library ....................................................................... 211
7. Implementation Methodology ................................................................................... 212
7.1 Configuring a Fusion DSP .................................................................................. 212
7.2 Xplorer-provided Fusion DSP Templates ........................................................... 215
7.3 Basic Fusion DSP Characteristics ...................................................................... 216
7.4 Extending a Fusion DSP with User TIE .............................................................. 217
7.4.1 Utilizing Fusion DSP Resources..................................................................... 218
7.5 Synthesis and Place-and-Route ......................................................................... 218
Appendix A. Option Instruction Lists ............................................................................... 219
Appendix B. Instruction Width Required by Fusion F1 Options ...................................... 225

Figures

Figure 1-1 Fusion DSP Components .................................................................................... 3


Figure 2-1 AE_DR Register ................................................................................................... 8
Figure 2-2 Radix-4 Trellis Butterfly .................................................................................... 158
Figure 7-1 XPG Options for a Fusion DSP........................................................................ 213
Figure 7-2 Configuring Hardware Prefetch........................................................................ 214

Tables

Table 2-1 DSP Subsystem State Registers ........................................................................... 9


Table 2-2 Bitstream and Variable-length Encode/Decode Support Subsystem State
Registers .............................................................................................................. 9
Table 2-3 Circular Buffer Support State Registers ............................................................... 10
Table 2-4 Floating Point Support State Registers ................................................................ 10
Table 2-5 Viterbi State Registers ......................................................................................... 11
Table 2-6 State Register Access Instructions ...................................................................... 12
Table 2-7 Operand Register Types ...................................................................................... 15
Table 2-8 Operand Immediate Types ................................................................................... 15
Table 2-9 Operation Mnemonics .......................................................................................... 16
Table 2-10: VLIW Slotting..................................................................................................... 19
Table 2-11: Port Usage in Format fusion_format48 ...................................................... 19

 CADENCE DESIGN SYSTEMS , INC. v


Fusion F1 DSP User’s Guide

Table 2-12: Port Usage in Format fusion_format40 ...................................................... 19


Table 2-13: Port Usage in Format fusion_format40_3 (only applicable with 16-bit Quad
MAC option) ....................................................................................................... 20
Table 2-14: Port Usage in Format fusion_format_fir (only applicable with AVS option)
............................................................................................................................ 20
Table 2-15: Port Usage in Format fusion_format64 (only applicable with Viterbi
Decoder or Soft-bit Demap option) .................................................................... 20
Table 2-16 Circular Buffer States ......................................................................................... 24
Table 2-17 Load/Store Operation Sizes ............................................................................... 26
Table 2-18 Load/Store Operation Suffixes ........................................................................... 27
Table 2-19 Load Overview ................................................................................................... 29
Table 2-20 Store Overview ................................................................................................... 42
Table 2-21 Permutations of Immediate Field Values ......................................................... 111
Table 2-22 Immmediate “I” Values ..................................................................................... 127
Table 2-23 Set of Symbol Constellations Supported ......................................................... 164
Table 3-1 Fusion DSP C Types .......................................................................................... 169
Table 3-2 Fusion DSP C/C++ Operators ............................................................................ 180
Table 3-3 Legacy HiFi 2 C/C++ Operators ......................................................................... 185

vi  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Changes from the Previous Release


The following changes were made to this document for the Cadence Tensilica RG-2018.9
release of Fusion F1 DSP:

 Corrected syntax for instruction AE_ROUNDSQ32F48ASYM


----------------------------- -------------------------------------
The following changes were made to this document for the Cadence Tensilica RG-2017.8
release of Fusion F1 DSP:

 Improved descriptions for bitstream operations in Section 2.19

 Updated two Xplorer project names in Chapter 6


----------------------------- -------------------------------------
The following changes were made to this document for the Cadence Tensilica RG-2017.7
release of Fusion F1 DSP:

 Added information in section 2.4.2 that CBEGIN need not be less than CEND

 Added Chapter 6 Fusion F1 NatureDSP Signal Library


----------------------------- -------------------------------------
The following changes were made to this document for the Cadence Tensilica RG-2017.5
release of Fusion F1 DSP:

 Enhanced the description of floating point operations.

 Corrected the intrinsic for AE_L16_IP

 Corrected round instructions that were documented as returning integer types


instead of fractional types

 Corrected AE_PKSR24 as returning a fractional type instead of an integer type

 Clarified the usage restrictions for AE_DIV64

 Clarified for converting from HiFi 2 legacy types to and from HiFi 3 vector types

 Clarified rules on conversions from float to and from ae_int32x2

 Corrected an inaccuracy in output type name for AE_MOVPA24x2

 Added Section 3.8 “Checking Configuration Options in C/C++ Code”


----------------------------- -------------------------------------
The following changes were made to this document for the Cadence Tensilica RG-2016.4
release of Fusion F1 DSP:

 Minor corrections to the (NaN) description

 Enhanced conversions between variables of different types.

 Support for 64-bit format for the Viterbi decoder and Soft-bit demapping options

 CADENCE DESIGN SYSTEMS , INC. vii


Fusion F1 DSP User’s Guide

----------------------------- -------------------------------------
The following changes were made to this document for the Cadence Tensilica RG-2016.3
release of Fusion F1 DSP:

 The title and introduction of this document reflect the name change to "Fusion F1".

 Support for 64-bit format for the Viterbi decoder and Soft-bit demapping options in
Section 2.3. Also described in Section 2.3, one of the Fusion operations added by
the Viterbi option overlaps in the Inst opcode space with the reserved CUST0
opcode.

 Support for AE_MULP32X2, AE_MULAP32X2 and AE_MULSP32X2 in Section


2.6.2 is now provided by default in Fusion F1 base configuration. Previously, it was
only a part of the AVS option.

 Support for AE_MULFC32X16RAS.L (.H) and AE_MULAFC32X16RAS.L (.H) in


Section 2.6.3 is now provided by default in Fusion F1 base configuration. Previously,
these operations were only a part of the AVS option.

 Support for AE_MUL16X4, AE_MULA16X4, AE_MULS16X4 operations in the 16-bit


Quad MAC option in Section 2.6.4.

 Additional information about shift instructions using an AR register added in Section


2.8.

 Made the following changes in Section 2.12.


 Added a description of instruction AE_TRUNCI32X2F64S
 Added new instruction AE_TRUNCI16X4F32S.
 Amended the note for AE_MOVDA32X2.

 Corrected the boundary conditions for the Circular Load/Store instructions in Table
2-18.

 New select patterns for AE_SEL16I instruction in Section 2.13.

 Five new instructions to support complex conjugate and complex conjugate multiply
in Section 2.18.

 Added a description of HiFi 3 Floating Point intrinsics emulation in Section 2.19.2.

 Added a description of the Viterbi Decoder option in Section 2.25.

 Added a description of the Soft-bit demapping option in Section 2.26.

 Operator overloading for 16b data types in Table 3-2 Fusion DSP C/C++ Operators.

 Clarified information about HiFi 2 and HiFi Mini code portability in Section 3.10.

 Updated the text and screen shots in Section 6.1 to include the newly included Viterbi
Decoder and soft-bit demapping options.

 Amended restriction in Section 7.1 to “As Fusion DSP is always coprocessor number
1, the number of coprocessors must be at least 2.”

 Added a list of the XPG options selected for each template in Section 6.2.

viii  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

 Added appendixes with a summary list of instructions for each Fusion DSP option
(Appendix A) and Appendix B with instruction width requirements.

 CADENCE DESIGN SYSTEMS , INC. ix


Fusion F1 DSP User’s Guide

x  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

1. Introduction
The Cadence® Tensilica® Fusion F1 DSP is a highly optimized, highly configurable processor
geared for efficient execution of dataplane algorithms needed for the Internet of Things (IoT),
and other applications, such as codec chips, sensor hubs, and narrowband wireless
communications. It is derived from a smaller version of the Cadence HiFi 3 DSP. It supports
dual issuing a single load or a store together with two way SIMD ALU or MAC operations,
supporting dual 16x16, 32x16, 24x24-bit MACs and single 32-bit MACs. The base
configuration is source code software compatible with the Cadence HiFi 2, HiFi Mini, and
HiFi 2 EP DSPs except for bitstream and variable-length decode and encode. It is also
compatible with the Cadence HiFi 3 DSP except for bitstream, variable-length decode and
encode and HiFi 3 quad MAC instructions.

The Fusion F1 DSP contains a wide range of configuration options to meet your needs. Each
one of the eight options can be selected independently.

1. The AVS option adds full HiFi software source compatibility by adding bitstream,
variable-length decode and encode operations, and HiFi 3 quad MAC emulation
capability 1. 0F0F0F

2. The 16-bit Quad MAC option adds computation extension to support 16-bit vector
Quad MAC (four MAC) for complex and dot product operations. Also included with
this Quad MAC option are specialized instructions for FFT computation
acceleration.
3. The FP option adds support for single issuing single-precision, IEEE 754, floating
point operations, including fused multiply accumulates, together with two-way
SIMD loads or stores.
4. The Reduced MAC Latency option halves the latency of all long latency operations,
sacrificing the maximum MHz achievable to enable lower area and power at low
MHz.
5. The Advanced Bit Manipulation option add supports for bit-level operations for
baseband-PHY and MAC processing. This option also supports CRC and
scrambling, FEC convolutional encoding, and adds instructions for bit-level
shuffling operations.
6. The BLE/Wi-Fi AES 128-CCM option supports instructions to accelerate AES 128
CCM-mode encryption/decryption.
7. The Viterbi Decoder option adds instructions for efficient Viterbi decoding to
support rates 1/2 and 1/3 with arbitrary polynomials of constraint lengths 5 and 7.
8. The Soft-bit Demapping option adds instructions for 4/16/64/256-QAM soft bit
demapping with support for different Gray Encoding formats needed by 3GPP and
WiFi.

1
Note that even without the AVS option, Fusion DSP is able to emulate all HiFi bitstream instructions in software,
albeit very slowly. This makes even the base configuration fully compatible with HiFi 2 and HiFi Mini.

 CADENCE DESIGN SYSTEMS , INC. 1


Fusion F1 DSP User’s Guide

The Fusion F1 DSP is a coprocessor configuration option for the Xtensa® LX6 processor. All
Fusion operations can be used as intrinsics in standard C/C++ applications. In addition, when
compiling with automatic vectorization or with the –mcoproc option, the compiler will
automatically infer these operations when compiling standard C code.

Note that the remainder of this document refers to the Fusion F1 DSP as “Fusion DSP” or as
“Fusion”.

1.1 Purpose of this Guide


This guide provides an overview of the Fusion DSP architecture and its instruction set. It will
help programmers using Fusion DSP by identifying some of the techniques commonly used
to optimize algorithms. It provides guidelines to achieve improved performance by using the
Fusion DSP instructions, intrinsics, protos, and primitives. This guide also serves as a C/C++
usage reference for the appropriate way to use Fusion DSP features in C/C++ software
development. This guide will also assist Xtensa Fusion DSP users who wish to add additional
instructions to the Fusion DSP architecture.

To use this guide most effectively, a basic level of familiarity with the Xtensa software
development flow is highly recommended. For more details, see the Xtensa Software
Development Toolkit User’s Guide.

1.1.1 Conventions
Throughout this document, the symbol <xtensa_root> refers to the installation directory
of a user’s Xtensa configuration. For example, <xtensa_root> might refer to the directory
\usr\xtensa\XtDevTools\install\builds\RF-2015.2-win32\<s1> if <s1> is
the name of your Xtensa configuration. In the examples in this guide, replace
<xtensa_root> with the installation directory of your Xtensa distribution.

1.2 Installation Overview


To install a Fusion DSP configuration, follow the same procedures described in the Xtensa
Development Tools Installation Guide. The Fusion DSP include files are in the following
directories and files:

<xtensa_root>/xtensa-elf/arch/include/xtensa/tie/xt_fusion.h

Note either xt_hifi2.h or xt_hifi3.h can be used instead of xt_fusion.h. This


enables easier migration of existing HiFi codes.

For floating point usage with the optional floating point unit, include the following file.

<xtensa_root>/xtensa-elf/arch/include/xtensa/tie/xt_FP.h

2  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

1.3 Fusion DSP Architecture Overview


The Fusion DSP is a SIMD (single-instruction/multiple-data) processor, and has the ability to
work in parallel on two 24/32-bit data items or four 16-bit data items. For example, it allows
for one operation to perform two 32-bit additions in parallel, with each addition occupying half
of a 64-bit AE_DR register. The Fusion DSP multipliers support multiplication of two 24-bit
or two 32x16-bit or two 16x16-bit operands per cycle. They support one 32x32-bit multiplies
per cycle. With the 16-bit Quad MAC option, the Fusion DSP multipliers also support four
16x16-bit operands per cycle. There are operations for single, dual, and quad multiplication.
A single, 64-bit load/store unit is supported. The Fusion DSP can only be configured to use
a little-endian byte ordering.

In general, baseline 16-bit support is geared towards efficient support of the ITU-T/ETSI
intrinsic model, while 32x16-bit and 24-bit support is provided for both integer and fixed-point
computation. With the 16-bit Quad MAC option, support is provided for complex 16-bit
multiplications as well as for real 16-bit dot product instructions, allowing efficient
implementations of complex and real FFTs and FIRs.

Fusion DSP is a VLIW architecture, supporting the execution of two operations in parallel.
DSP loads and stores, bit-stream and Huffman operations and core operations are available
in slot 0 of a VLIW instruction. DSP MAC and ALU operations are typically available in slot
1. The optional floating point operations are generally available in slot 1.

Fusion DSP supports either caches or local memories with the full flexibility provided by
Xtensa. Configurations can have either or both and can make different choices for instruction
and data. Audio packages supplied by Cadence do not use DMA. Hence, most customers
either use caches or make local memories sufficiently large to cover desired applications.

Figure 1-1 illustrates the main custom state, register file and execution units added to an
Xtensa LX processor by the Fusion DSP.
32 bits 32 bits

AE_DR
Register File AR Base
Register File
12 x 64 bits

Register MUX

Variable
Load/
Length Misc
Store
Enc/Dec & Function
Misc ALU Unit
Bitstream
ALU Function MAC

Slot 1 Slot 0

Figure 1-1 Fusion DSP Components

 CADENCE DESIGN SYSTEMS , INC. 3


Fusion F1 DSP User’s Guide

The main hardware resources in the DSP subsystem are a multiply/accumulate unit, an
option for a single precision IEEE floating point unit, a 12-entry register file AE_DR to hold
64-bit, pairs of 32-bit or quads of 16-bit data items, an arithmetic/logic unit, and a shift unit to
operate on the AE_DR values. The multiplier unit supports one 32x32-bit MAC or two 24x24,
16x32 or 16x16-bit MACs per cycle (four 16x16-bit with the 16-bit Quad MAC option).

The load/store unit is capable of loading or storing up to two 24-bit or 32-bit SIMD elements,
four 16-bit SIMD elements, or single elements up to 64 bits in size. 24-bit data can either be
contained inside 32-bit envelopes or can be packed together into 24 bits of memory. Eight
packed elements can be loaded or stored in three instructions. The load/store unit supports
unaligned accesses whereby a stream is first primed and afterwards 64 unaligned bits can
be loaded or stored in every cycle.

The DSP subsystem can be issued in several VLIW formats:

 two slot 48-bit format (fusion_slot0, fusion_slot1)

 single slot 40-bit format (fusion_slot40) used mainly for wide branches and AES
instructions

 optional two slot 40-bit format (fusion_slot_fir_0 and fusion_slot_fir_1) for emulation
of HiFi 3 FIR instructions

 optional two slot 40-bit format (fusion_slot40_0 and fusion_slot40_1) used for 16-bit
FFT support with the 16-bit Quad MAC option.

The operations for the two-slot VLIW formats can be issued in one of the two slots. In each
execution cycle, zero or one operation from each slot can be executed independently
according to the static bundling expressed in the machine code. So, for example, load
operations can execute concurrently with multiply/accumulate operations because loads are
in fusion_slot0 and multiply/accumulate operations are in fusion_slot1. For better code size,
many operations (but not integer or fixed point multiplies) are also available in single issue
16- and 24-bit formats. Most floating point operations are available in the 24-bit formats.

1.4 Prefetching
Fusion DSP supports a prefetch option geared for systems with long memory latency. When
the Fusion DSP processor detects a positive stride-1 stream of cache misses (either data or
instruction), it can speculatively prefetch ahead up to four cache lines and place them in a
buffer close to the processor, or on the data side, optionally into the L1 data cache (there is
no support for prefetching directly into the L1 instruction cache). In addition, you can manually
issue prefetch instructions.

By default, hardware prefetching is enabled in the reset code provided by Cadence with a
low setting. On configurations that support it, data prefetches are placed into the L1 data
cache by default. You can use the following HAL calls to explicitly disable prefetching or to
increase its aggressiveness in different sections of your code. With more aggressive
prefetching, the hardware will prefetch earlier when detecting a stream and will prefetch more
lines ahead. Assuming sufficient bus bandwidth, performance will improve with more

4  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

aggressive prefetch but the system will require more bandwidth. Prefetching instructions and
data can be controlled separately.

#include <xtensa/hal.h>
int xthal_set_cache_prefetch(unsigned long mode);

The value returned is not meant for direct use or interpretation; however, it is suitable for
passing to a subsequent call to xthal_set_cache_prefetch().

The mode parameter can be one of the following:

 The value returned from a previous call to xthal_set_cache_prefetch() or


xthal_get_cache_prefetch()

 One of the following constants, which apply to both instruction and data caches:
 XTHAL_PREFETCH_ENABLE(enable cache prefetch)
 XTHAL_PREFETCH_DISABLE(disable cache prefetch)

 A bit-wise OR of two cache prefetch mode constants, one for the instruction cache:
 XTHAL_ICACHE_PREFETCH_OFF(disable instruction cache prefetch)
 XTHAL_ICACHE_PREFETCH_LOW(enable, less aggressive prefetch)
 XTHAL_ICACHE_PREFETCH_MEDIUM(enable, midway aggressive prefetch)
 XTHAL_ICACHE_PREFETCH_HIGH(enable, more aggressive prefetch)
 XTHAL_ICACHE_PREFETCH(n) (explicitly set the InstCtl field of the PREFCTL
register to 0..15. See the Prefetch Architectural Additions section of the
Prefetch Unit option chapter in the Xtensa Microprocessor Data Book for
details).

 and one for the data cache:


 XTHAL_DCACHE_PREFETCH_OFF(disable data cache prefetch)
 XTHAL_DCACHE_PREFETCH_LOW(enable, less aggressive prefetch)
 XTHAL_DCACHE_PREFETCH_MEDIUM(enable, midway aggressive prefetch)
 XTHAL_DCACHE_PREFETCH_HIGH(enable, more aggressive prefetch)
 XTHAL_DCACHE_PREFETCH(n) (explicitly set the DataCtl field of the
PREFCTL register to 0..15. See the Prefetch Architectural Additions section of
the Prefetch Unit option chapter in the Xtensa Microprocessor Data Book for
details).
 XTHAL_DCACHE_PREFETCH_L1_OFF (prefetch data to prefetch buffers only)
 XTHAL_DCACHE_PREFETCH_L1 (on configurations that support it, prefetch
directly to L1 data cache)

For easier simulation, prefetching can also be disabled in the simulator using the
xt-run --prefetch=0 flag. Disabling prefetching from the simulation command line will
override any HAL calls.

 CADENCE DESIGN SYSTEMS , INC. 5


Fusion F1 DSP User’s Guide

1.4.1 Software Prefetching


Prefetching can also be individually controlled via software using the following GCC
extension.

__builtin_prefetch(addr);

Software prefetches can be used for either data or instructions. They can be used in addition
to or instead of hardware prefetching. If hardware prefetching is disabled, the software
prefetches are still enabled.

For configurations that do not prefetch into the cache, and rather use a small, 8- to 16-entry
buffer outside of the cache, you must be careful not to prefetch too far ahead. Otherwise, the
data will be overwritten before it is needed by the processor.

Consider a simple example that performs an energy calculation. You might choose to place
a few explicit prefetch instructions before the loop to seed the hardware prefetcher.
Otherwise, depending on mode, the hardware prefetch might delay prefetching until after the
second miss.

__builtin_prefetch(&ap[0]);
__builtin_prefetch(&ap[XCHAL_DCACHE_LINESIZE]);
__builtin_prefetch(&ap[2*XCHAL_DCACHE_LINESIZE]);
for (i=0; i<n; i++) {
sum += ap[i]*ap[i];
}

You might also want to put prefetch instructions directly inside the loop. Doing so allows one
to prefetch more aggressively than the hardware prefetcher and allows one to prefetch
patterns other than the stride-1 references that are detected by the hardware prefetcher. On
the other hand, placing prefetch instructions inside the loop incurs instruction overhead
whether or not the loop actually suffers from cache misses.

In general, given the effectiveness of the hardware prefetcher, software prefetches should
be used judiciously. Carefully compare performance between using and not using software
prefetching on a loop-by-loop basis.

6  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

1.5 Fusion DSP Instruction Set Overview


The Fusion DSP is built on the baseline Xtensa RISC architecture, which implements a rich
set of generic instructions optimized for efficient embedded processing. The power of Fusion
DSP comes from a comprehensive DSP and audio instruction set. A wide variety of
load/store operations support multiple addressing modes with support for 16/24/32-bit scalar
and vector data types together with 56/64-bit scalar. Vector data management is supported
with select operations and shifting.

Multiply operations include 32x32-bit, 32x24-bit, 24x24-bit, 32x16-bit and 16x16-bit. Multiply
operations come in fixed-point and integer variants. They come in high precision and low
precision variants. High-precision multiplies use a 64-bit accumulator. Since an accumulator
can hold only one result, Fusion DSP supports dual multiplies where the results of two
multiplies are added or subtracted together before being added into the accumulator. For
example, a single operation might compute the following operation where H and L refer to
the high bits or low bits respectively of an operand.

acc = acc – d0.L*d1.L + d0.H*d1.H.

Low-precision multiplies accumulate in 32 bits or even 16 bits. Since each register can hold
two 32-bit or four 16-bit accumulators, these instructions can perform two or four independent
SIMD multiplies.

A set of bitstream and variable-length instructions allow for efficient access of serial
bitstreams including Huffman encode and decode.

The optional floating point unit supports IEEE-754 single precision floating point operations
(scalar for compute, two-way SIMD loads and stores).

 CADENCE DESIGN SYSTEMS , INC. 7


Fusion F1 DSP User’s Guide

2. Fusion DSP Features

The Fusion DSP contains a 12-entry, 64-bit register file, AE_DR. Each register can hold one
or two, 24- or 32-bit operands, one or four 16-bit operands or one 56- or 64-bit operand as
shown in Figure 2-1. 24-bit and 56-bit operands are sign extended to fill their 32 or 64-bit
container. The separate halves or quarters of the register are always separate data items.
For example, if you shift a SIMD 32-bit element to the left, each half is shifted separately.
The high bits of the L input half do not impact the H half of the output.

63 … 0

H L
31 … 0 31 … 0

3 2 1 0
15 … 0 15 … 0 15 … 0 15 … 0

Figure 2-1 AE_DR Register

When a register is stored to memory, the high half of the register is always stored in the lower
memory address. For example, a load that loads a 32 by 2-way SIMD value from address
"a" will place the 32-bits from address "a" into the high 32-bits of the register and the 32-bits
from address "a+4" into the low 32-bits of the register. A load that loads a 16 by 4-way SIMD
value from address "a" will place the 16-bits from address "a" into the high 16-bits of the
register. Operations that access individual 24- or 32-bit elements of AE_DR registers refer to
the elements with selectors L and H in the mnemonics. Operations that access individual 16-
bit elements refer to the elements with sectors 3, 2, 1 and 0 in the mnemonics.

For compatibility with HiFi 2, HiFi EP, and HiFi Mini, a 32-bit data item might occupy the
middle of an entire AE_DR register and a 16-bit data item might occupy the middle of a 32-
bit half register. When using such legacy instructions, a register holds half as many elements
and hence the instruction exploits less parallelism. Such instructions should only be used in
legacy code.

8  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Fusion DSP supports a 4-entry, 64-bit alignment register, AE_VALIGN. The use of this
register allows the hardware to load or store a SIMD stream that is not 64-bit aligned at a
rate of 64-bits per cycle. It also allows 24-bit data to be packed densely into 24-bit containers.
These mechanisms are described in more detail in Section 2.4.1.

The TIE state registers in the Fusion DSP are listed in Table 2-1.

Table 2-1 DSP Subsystem State Registers

State Register Bit Size Description


AE_OVERFLOW 1 Indicates whether any arithmetic operation has saturated
since the time when AE_OVERFLOW was last reset to
zero.
AE_SAR 7 Contains the shift amount for various DSP shift operations.

The state registers in Table 2-2 pertain to the bitstream and variable-length encode/decode
support subsystem of the Fusion DSP. This subsystem is described in detail in Section 2.20.
All these registers are available with the AVS option and AE_BITHEAD, AE_BITPTR, and
AE_BITSUSED are also available with the Advanced Bit Manipulation Package option.
Programmers generally will not need to consider the details of how each of these state
registers is used by the instructions, but the state registers are documented here for
completeness. These descriptions make more sense to a reader who is already somewhat
familiar with the variable-length encode/decode instructions.

Table 2-2 Bitstream and Variable-length Encode/Decode Support Subsystem State Registers

State Register Bit Size Description


AE_BITHEAD 32 Contains the bits at the head of the bitstream. The high half
has the current 16 bits and the low half has the next 16 bits.
Only the high half is used for output bitstreams.
AE_BITPTR 4 Offset within the 16 most-significant bits of the bitstream
head. For an input bitstream, this value signifies the
number of most significant bits of AE_BITHEAD that have
been consumed already by the application. For an output
bitstream, this value signifies the number of most significant
bits of AE_BITHEAD that have already been initialized.
AE_BITSUSED 4 Contains the number of bits consumed or produced in the
last table lookup by a variable-length encode/decode
instruction. This value is coded in binary except that all-
zeroes is interpreted as the value 16.
AE_TABLESIZE 4 Contains one less than the base-2 logarithm of the current
decoding table size for variable-length decode. 0
corresponds to a 2-entry table; 15 corresponds to a 65536-
entry table.
AE_FIRST_TS 4 Contains the correct value of AE_TABLESIZE for the first
level in the lookup-table hierarchy. This state is an
optimization so that no AE_VLDSHT instruction is needed
between consecutive decoding operations using the same
codebook.

 CADENCE DESIGN SYSTEMS , INC. 9


Fusion F1 DSP User’s Guide

State Register Bit Size Description


AE_NEXTOFFSE 27 This state is used for three different things.
T - In variable-length decode: Before an AE_VLDL16T or
AE_VLDL32T instruction, AE_NEXTOFSET is the index of
the table entry corresponding to the current bitstream prefix
to look up.
- After an AE_VLDL16T or AE_VLDL32T instruction,
AE_NEXTOFFSET is the offset of the base of the next
decoding lookup table.
- In variable-length encode: After an AE_VLEL16T or
AE_VLEL32T instruction, the low bits of AE_NEXTOFFSET
hold the codeword bits produced by the most recent lookup.
AE_SEARCHDON 1 This state tells the AE_VLDL16C instruction to prepare
E AE_NEXTOFFSET (using AE_FIRST_TS) for a fresh
decoding search starting with the first table in the decoding
hierarchy. This state is an optimization so that no
AE_VLDSHT instruction is needed between consecutive
decoding operations using the same codebook.

The state registers in Table 2-3 pertain to the circular buffer support and are shared between
the DSP subsystem and the bitstream and variable-length encode/decode support
subsystem of the Fusion DSP.

Table 2-3 Circular Buffer Support State Registers

State Register Bit Size Description


AE_CBEGIN0 32 Contains the start address of the circular buffer.
AE_CEND0 32 Contains the end address of the circular buffer.
AE_CWRAP 1 Indicates whether any bit-stream circular buffer operation has
wrapped around since the time when AE_CWRAP was last
reset to zero.

The state registers in Table 2-4 pertain to the optional floating point support.

Table 2-4 Floating Point Support State Registers

State Register Bit Size Description


RoundMode 2 Control the rounding mode of floating point operations. A
value of 0 rounds to nearest, a value of 1 rounds toward 0,
a value of 2 rounds towards positive infinity and a value of
3 rounds toward negative infinity.
InvalidFlag 1 Invalid exception flag.
DivZeroFlag 1 Divide-by-zero flag.
OverflowFlag 1 Overflow exception flag.
UnderflowFlag 1 Underflow exception flag.
InexactFlag 1 Inexact exception flag.

10  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Table 2-5 Viterbi State Registers

State Register Bit Description


Size
DECISION_BITS_H 64 Contains the 64 bits of the branch decision bits from
AE_VTACS4X4S_H of which MSB 32 bits correspond to
stage 0 and LSB 32 bits correspond to stage 1 in Viterbi
decoding.
DECISION_BITS_L 64 Contains the 64 bits of the branch decision bits from
AE_VTACS4X4S_L of which MSB 32 bits correspond to
stage 0 and LSB 32 bits correspond to stage 1 in Viterbi
decoding.
BMETRICS 64 Contains the branch metrics for the two stages in Viterbi
decoding.
NORM_CONST 7 Contains a 7-bit unsigned normalization constant that is
zero extended to 8 bits and subtracted from the state
metrics in Viterbi decoding.
NORM_MASK 3 Contains the 3-bit normalization mask initialized by the
programmer that is ANDed with the MSB-1 to MSB-3 bits
of the state metric outputs to detect an overflow possibility
and indicate it using the NORMALIZE_CUR state.
NORMALIZE_CUR 1 Contains a 1-bit flag indicating if any of the state metric
outputs in the current stages are close to overflow that can
be used to trigger normalization in the next stages.
NORMALIZE_PREV 1 Contains a 1-bit flag indicating if the previous state metric
outputs are close to overflow that can be used to
implement normalization in the current stages.

The TIE state registers are grouped as follows into user registers for the purposes of efficient
save and restore operations:

 user_register CIRC {AE_CBEGIN,0, AE_CEND0}

 user_register TABLEFIRSTSEARCHNEXT {AE_TABLESIZE, AE_FIRST_TS,


AE_SEARCHDONE, AE_NEXTOFFSET}

 user_register FUSIONMISC {AE_BITHEAD, AE_BITPTR, AE_BITSUSED,


AE_CWRAP, AE_OVERFLOW, AE_SAR}

or

 user_register FUSIONMISC {AE_CWRAP, AE_OVERFLOW, AE_SAR}

or

 user_register FUSIONMISC {NORMALIZE_PREV, NORMALIZE_CUR,


NORM_MASK, NORM_CONST, AE_BITHEAD, AE_BITPTR, AE_BITSUSED,
AE_CWRAP, AE_OVERFLOW, AE_SAR}

 CADENCE DESIGN SYSTEMS , INC. 11


Fusion F1 DSP User’s Guide

or

 user_register FUSIONMISC {NORMALIZE_PREV, NORMALIZE_CUR,


NORM_MASK, NORM_CONST, AE_CWRAP, AE_OVERFLOW, AE_SAR}

 user_register DBITS_H {DECISION_BITS_H} Viterbi Option only

 user_register DBITS_L {DECISION_BITS_L} Viterbi Option only

 user_register BMETRS {BMETRICS} Viterbi Option only

With the floating point option, the following user register is used to control and detect
rounding and exception behavior. See Chapter 4 of the Xtensa Instruction Set Architecture
(ISA) Reference Manual for more details about rounding and exception behavior.

user_register FCR_FSR
{RoundMode,InvalidFlag,DivZeroFlag,OverflowFlag,UnderflowFlag,InexactFlag}

In addition to specialized instruction sequences used to save and restore entire user registers
efficiently from memory, instructions are provided to read and write individual state registers.
Both types are listed in Table 2-6.

Table 2-6 State Register Access Instructions

Instruction Intrinsic Description


RUR.AE_OVERFLOW RUR_AE_OVERFLOW, Read state register
RAE_OVERFLOW AE_OVERFLOW
RUR.AE_SAR RUR_AE_SAR, Read state register AE_SAR
RAE_SAR
RUR.AE_TABLESIZE RUR_AE_TABLESIZE, Read state register
RAE_TABLESIZE AE_TABLESIZE
RUR.AE_FIRST_TS RUR_AE_FIRST_TS, Read state register AE_FIRST_TS
RAE_FIRST_TS
RUR.AE_BITHEAD RUR_AE_BITHEAD, Read state register AE_BITHEAD
RAE_BITHEAD
RUR.AE_BITSUSED RUR_AE_BITSUSED, Read state register AE_BITSUSED
RAE_BITSUSED
RUR.AE_BITPTR RUR_AE_BITPTR, Read state register AE_BITPTR
RAE_BITPTR
RUR.AE_SEARCHDONE RUR_AE_SEARCHDONE, Read state register
RAE_SEARCHDONE AE_SEARCHDONE
RUR.AE_NEXTOFFSET RUR_AE_NEXTOFFSET, Read state register
RAE_NEXTOFFSET AE_NEXTOFFSET
RUR.AE_CBEGIN0 RUR_AE_CBEGIN0, Read state register AE_CBEGIN0.
RAE_CBEGIN0, AE_GETCBEGIN0 returns a void *
AE_GETBEGIN value.

12  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Instruction Intrinsic Description


RUR.AE_CEND0 RUR_AE_CEND0, Read state register AE_CEND0.
RAE_CEND0, AE_GETCEND0 returns a void *
AE_GETCEND0 value.
RUR.AE_CWRAP RUR_AE_CWRAP, Read state register AE_CWRAP.
RAE_CWRAP
RUR.FCR RUR_FCR Read register FCR containing state
RoundMode.
RUR.FSR RUR_FSR Read register FSR corresponding
to state registers InvalidFlag,
DivZeroFlag, OverflowFlag, and
UnderflowFlag.
AE_MOVVCIRC AE_MOVVCIRC Copy user register CIRC into a
vector register which can be stored
to memory.
AE_MOVVFUSIONMISC AE_MOVVFUSIONMISC Copy user register FUSIONMISC
into a vector register which can be
stored to memory.
AE_MOVVTABLE- AE_MOVVTABLE- Copy user register
FIRSTSEARCHNEXT FIRSTSEARCHNEXT TABLEFIRSTSEARCHNEXT into a
vector register which can be stored
to memory.
AE_MOVVFCRFSR AE_MOVVFCRFSR Copy user register FCR_FSR into a
vector register which can be stored
to memory.
WUR.AE_OVERFLOW WUR_AE_OVERFLOW, Write state register
WAE_OVERFLOW AE_OVERFLOW.
WUR.AE_SAR WUR_AE_SAR, Write state register AE_SAR.
WAE_SAR
WUR.AE_TABLESIZE WUR_AE_TABLESIZE, Write state register
WAE_TABLESIZE AE_TABLESIZE.
WUR.AE_FIRST_TS WUR_AE_FIRST_TS, Write state register AE_FIRST_TS.
WAE_FIRST_TS
WUR.AE_BITHEAD WUR_AE_BITHEAD, Write state register AE_BITHEAD.
WAE_BITHEAD
WUR.AE_BITSUSED WUR_AE_BITSUSED, Write state register AE_BITSUSED.
WAE_BITSUSED
WUR.AE_BITPTR WUR_AE_BITPTR, Write state register AE_BITPTR.
WAE_BITPTR
WUR.AE_SEARCHDONE WUR_AE_SEARCHDONE, Write state register
WAE_SEARCHDONE AE_SEARCHDONE.
WUR.AE_NEXTOFFSET WUR_AE_NEXTOFFSET, Write state register
WAE_NEXTOFFSET AE_NEXTOFFSET

 CADENCE DESIGN SYSTEMS , INC. 13


Fusion F1 DSP User’s Guide

Instruction Intrinsic Description


WUR.AE_CBEGIN0 WUR_AE_CBEGIN0, Write state register AE_CBEGIN0.
WAE_CBEGIN0, AE_SETCBEGIN0 take a void *
AE_SETCBEGIN0 value.
WUR.AE_CEND0 WUR_AE_CEND0, Write state register AE_CEND0
WAE_CEND0, AE_SETCEND0 take a void * value.
AE_SETCEND0
WUR.AE_CWRAP WUR_AE_CWRAP, Write state register AE_CWRAP
WAE_CWRAP
WUR.FCR WUR_FCR Write register FCR containing state
RoundMode.
WUR.FSR WUR_FSR Write register FSR corresponding to
state registers InvalidFlag,
DivZeroFlag, OverflowFlag, and
UnderflowFlag.
AE_MOVCIRCV AE_MOVCIRCV Set user register CIRC from a
vector register which can be stored
to memory.
AE_MOVFUSIONMISCV AE_MOVFUSIONMISCV Set user register FUSIONMISC
from a vector register that can be
loaded from memory.
AE_MOVTABLE- AE_MOVTABLE- Set user register
FIRSTSEARCHNEXTV FIRSTSEARCHNEXTV TABLEFIRSTSEARCHNEXT from
a vector register which can be
loaded from memory.
AE_MOVFCRFSRV AE_MOVFCRFSRV Set user register FCR_FSR from a
vector register which can be loaded
from memory.
AE_MOVDBITSV.H AE_MOVDBITSV.H Set user register DBITS_H from a
vector register that can be loaded
Viterbi Option only from memory.
AE_MOVDBITSV.L AE_MOVDBITSV.L Set user register DBITS_L from a
vector register that can be loaded
Viterbi Option only from memory.
AE_MOVBMETRICSV AE_MOVBMETRICSV Set user register BMETRS from a
vector register that can be loaded
Viterbi Option only from memory.
AE_MOVVBMETRICS AE_MOVVBMETRICS Copy user register BMETRS into a
vector register that can be stored to
Viterbi Option only memory.

In the operation descriptions in Sections 2.4 through 2.20, each mnemonic is listed with
assembly syntax showing placeholders (templates) for its operands. The register files of the
operands are implied by the placeholders, as in Table 2-7.

14  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Table 2-7 Operand Register Types

Placeholder Register file Legal values Example


A, ah, al, a0, a1, ax AR a0 – a15 a3
q, q0, q1, d, d0, d1, dh, dl AE_DR aed0 – aed11 aed2
B BR b0 – b15 b3
Bhl BR2 b0 – b14 b0
(even)
b3210 BR4 b0-b16 b0
(multiple of 4)
U AE_VALIGN u0-u3 u0

Table 2-8 Operand Immediate Types

Placeholder Value Range Stride


i16 -16..14 2
i16pos 0..14 2
i32 -32..28 4
i32pos 0..28 4
i64 -64..56 8
i64pos 0..56 8
i Operation-dependent 1

Each operation description is annotated with the name(s) of the slot(s) where that operation
can be issued. Each operation description is also annotated with the C syntax showing the
intrinsic name and prototype for the operation. A discussion of using C data types and
intrinsics to program the Fusion DSP is included in Chapter 3.

2.1 Instruction Naming Conventions


All base Fusion DSP operation mnemonics, as shown in Table 2-9, begin with the string AE_
to avoid colliding with any other space of names. The optional floating point instructions use
the standard Xtensa floating point names that have no prefix and instead are named using
the operation name followed by an .S suffix. Note that as with other core Xtensa operations,
the intrinsic names are prefixed with XT_even though the instructions are not.

Following the AE_ prefix, each mnemonic has a string of one or more characters signifying
the type of operation such as load, shift, add, etc. For example, AE_L is the prefix denoting
Fusion DSP loads.

The remaining portion of each operation mnemonic typically includes reminders of various
aspects of the operation’s details. Multiplies and loads and stores have more regular naming
conventions that are described in their respective sections.

 CADENCE DESIGN SYSTEMS , INC. 15


Fusion F1 DSP User’s Guide

Table 2-9 Operation Mnemonics

Mnemonic Meaning
ASYM Denotes asymmetric rounding (e.g., AE_ROUND32X2F64SASYM)
F Denotes fractional arithmetic (e.g., AE_MULZAAFD24.HH.LL) or the value
False in a conditional move (e.g., AE_MOVF64).
H and L Combinations of H and L are used to refer to halves of registers (e.g.,
AE_MULZAAFD24.HH.LL).
0,1,2,3 Combinations of 0,1,2 and 3 are used to refer to quarters of registers
(e.g. AE_MULF32X16.L0)
I Denotes use of an immediate operand (e.g., AE_SRAIP32)
S Denotes saturating arithmetic (e.g., AE_MULF32S.LL) or the use of the
AE_SAR state register as a shift amount (e.g., AE_SRASP32), depending
on context
SYM Denotes symmetric rounding (e.g., AE_ROUND32X2F64SSYM)
T Denotes the value True in a conditional move (e.g., AE_MOVT64)
U Denotes unsigned arithmetic (e.g., AE_MULS32U.LL)
X Denotes use of an index register in an address computation (e.g.,
AE_L64.XP)
X2 Denotes a two-way SIMD operation in contexts (e.g., AE_L32X2.I) where
scalar operations are also available
X4 Denotes a four-way SIMD operation (e.g., AE_L16X4.XC)

2.2 Fixed-point Values and Fixed-point


Arithmetic
The Fusion DSP contains instructions for implementing fixed-point arithmetic. This section
describes the representation and interpretation of fixed-point values, as well as some
operations on fixed-point values.

2.2.1 Representation of Fixed-point Values


A fixed-point data type m.n contains a sign bit, some number of bits m-1, to the left of the
decimal and some number of bits n, to the right of the decimal. When expressed as a binary
value and stored into a register file, the least significant n bits are the fractional part, and
the most significant m bits are the integer part expressed as a signed 2’s complement
number. If the binary value is interpreted as a 2’s complement signed integer, converting
from the binary value to a fixed-point number requires dividing the integer by 2n.

16  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Thus, for example, the 24-bit 1.23 number 0.5 is represented as 0x400000.

Signed Integer (1 bit) Fractional (23 bits)

0 100 0000 0000 0000 0000 0000

0x0 0x40 0000

and the 64-bit 17.47 number -1.5 is represented as (-2 + 0.5 = 0xff 4000 0000 0000)

Signed Integer (17 bit) Fractional (47 bits)

1 1111 1111 1111 1110 100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

0x1fffe 0x4000 0000 0000

Fusion DSP fractional instructions use fractional operations on 1.15, 1.23, 9.23, 1.31, 17.47
and 1.63, described in more detail as follows.

 1.15 16-bit fixed-point data type with 1 sign bit and 15 bits to the right of the
decimal. The largest positive value 0x7fff is interpreted as 1.0 – 2-15. The smallest
negative value 0x8000 is interpreted as -1.0. The value 0 is interpreted as 0.0.

 9.23 32-bit fixed-point data type with a 9-bit integer and 23 bits to the right of the
decimal. The largest positive value 0x7fffffff is interpreted as 256.0 – 2-23. The
smallest negative value 0x80000000 is interpreted as -256.0. The value 0 is
interpreted as 0.0.

 1.23 24-bit fixed-point data type with 1 sign bit and 23 bits to the right of the
decimal. The largest positive value 0x7fffff is interpreted as 1.0 – 2-23. The smallest
negative value 0x800000 is interpreted as -1.0. The value 0 is interpreted as 0.0.
Since register halves hold 32-bits, not 24-bits, typical 24-bit fractional variables are
9.23. However, 24-bit fixed-point multiply instructions ignore the upper 8-bits,
thereby treating them as 1.23.

 1.31 32-bit fixed-point data type with 1 sign bit and 31 bits to the right of the
decimal. The largest positive value 0x7fffffff is interpreted as 1.0 – 2-31. The smallest
negative value 0x80000000 is interpreted as -1.0. The value 0 is interpreted as 0.0.

 17.47 64-bit fixed-point data type with a 17-bit integer and 47 bits to the right of the
decimal. The largest positive value 0x7fff ffff ffff ffff is interpreted as 65536.0 – 2-47.
The smallest negative value 0x8000 0000 0000 0000 is interpreted as -65536.0. The
value 0 is interpreted as 0.0.

 CADENCE DESIGN SYSTEMS , INC. 17


Fusion F1 DSP User’s Guide

 1.63 64-bit fixed-point data type with 1 sign bit and 63 bits to the right of the
decimal. The largest positive value 0x7fff ffff ffff ffff is interpreted as 1.0 – 2-63. The
smallest negative value 0x8000 0000 0000 0000 is interpreted as -1.0. The value 0
is interpreted as 0.0.

2.2.2 Arithmetic with Fixed-point Values


When multiplying fixed point numbers m.n0 * m.n1 with a standard signed integer multiplier,
the natural result of the multiple will be an m.n data type where n = n0+n1 and m = m0+m1.
Thus for example, multiplying a 1.23 typed variable by a 1.23 typed variable generates a
2.46 typed variable. As Fusion DSP supports the 17.47 data type, the fixed point multiply
instructions shift the 2.46 result to the left by one bit and then sign extends it by 15 bits. In
general, high-precision fixed-point multiplications shift their results to the left by one bit.

Fusion DSP contains both saturating and non-saturating instructions. Overflowing the
supplied guard bits with a non-saturating instruction is a program error that will cause the
result to wrap around. For saturating operations, the processor will also set the overflow state
which can later be checked programmatically. In the instruction descriptions that follow,
whether an operation saturates is explicitly stated.

2.2.3 Other Fixed-point Representations


Programmers are free to use fixed-point representations other than the ones listed above.
Most Fusion DSP operations are independent of fixed-point representation; e.g., a fixed-point
add is equivalent to an integer one. Even for multiplies, the multiply instructions are
compatible with any representations that expect the result to be shifted left by one bit. So, if
the input data is actually a 2.22 data type rather than a 1.23 data type, the 24-bit fixed-point
multiply instructions will correctly produce a 19.45 typed variable. The programmer is simply
responsible for knowing what type of data is in what variables, and if manual conversions are
needed, the programmer can always use shift instructions.

18  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

2.3 VLIW Slots and Formats


Fusion DSP can issue up to two operations in a single 40- or 48-bit instruction bundle using
Xtensa LX FLIX (VLIW) technology. Fusion DSP supports up to four different formats
fusion_format48, fusion_format40, fusion_format_fir and fusion_format40_3. If you select the
Viterbi Decoder or Soft-bit Demapping options, an additional 64-bit FLIX format,
fusion_format64, is created. Every instruction belongs to one format, but different formats
may pack different numbers of operations in a single instruction. The basic structure of the
formats is shown in Table 2-10. For each of the formats, the port usage for the various
register files is summarized in Tables 2-11 to 2-15.

Table 2-10: VLIW Slotting

Formats Slot 0 Slot 1


fusion_format48: Load/store DSP and FP multiply
{fusion_slot0,fusion_slot1} DSP shifts DSP and FP ALU
Core ops Core ops
Miscellaneous Core MUL32 ops
fusion_format_fir: DSP loads HiFi 3 FIR
{fusion_slot_fir_0,fusion_slot_fir_1}
fusion_format40_3: FFT stores FFT scaling add and
{fusion_slot40_0,fusion_slot40_1} sub
FFT bit-reversal
fusion_format40: {fusion_slot40} Wide branches NA
DSP truncation
AES ops
Depbits
fusion_format64: Load/store Viterbi trellis butterfly
{fusion_slot64_0,fusion_slot64_1}
Viterbi branchmetrics Viterbi traceback
(Viterbi and Soft-bit Demap options only)
Viterbi state store Soft-bit demap

Table 2-11: Port Usage in Format fusion_format48

Register File Slot 0 read/write Slot 1 read/write


AE_DR 1/1 3/1
AR 2/1 2/1
AE_VALIGN 1/1
BR 2/0 1/1

Table 2-12: Port Usage in Format fusion_format40

Register File Slot 0 read/write


AE_DR 2/2
AR 2/1
AE_VALIGN
BR

 CADENCE DESIGN SYSTEMS , INC. 19


Fusion F1 DSP User’s Guide

Table 2-13: Port Usage in Format fusion_format40_3


(only applicable with 16-bit Quad MAC option)

Register File Slot 0 read/write Slot 1 read/write


AE_DR 1/0 2/2
AR 2/1
AE_VALIGN
BR

Table 2-14: Port Usage in Format fusion_format_fir


(only applicable with AVS option)

Register File Slot 0 read/write Slot 1 read/write


AE_DR 0/1 4/1
AR 2/1
AE_VALIGN 1/1
BR

Table 2-15: Port Usage in Format fusion_format64


(only applicable with Viterbi Decoder or Soft-bit Demap option)

Register File Slot 0 read/write Slot 1 read/write


AE_DR 1/1 3/2
AR 1/1 2/1
AE_VALIGN
BR

Formats fusion_format48, fusion_format_fir and fusion_format40_3 all support two parallel


operations. Format fusion_format40 is a single-slot format that allows for individual
operations with more operands or larger immediates than can be used in the two slot formats.

Format fusion_format_fir is a specialized format tied to the AVS option, used for emulating
HiFi 3 FIR operations that require too many operands to issue in parallel with stores. Format
fusion_format_40_3 is a specialized format tied to the 16-bit Quad MAC option. It is used in
FFTs to allow specialized add and subtract operations to issue in parallel with stores.

For the fusion_format48 format, the first slot contains all of the Fusion DSP load/store
instructions and some miscellaneous operations. The second slot contains all of the regular
multiply and DSP ALU operations. A subset of the core Xtensa operations are also available
in both slots allowing some parallelism with core Xtensa operations.

The optional Viterbi decoder and soft-bit demapping options add an additional two-slot
fusion_format64 format. The first slot contains a subset of DSP load/store instructions (those
that use immediate offset) along with instructions for the Viterbi decoding option to compute
the branchmetrics and store the states. The second slot contains instructions for Viterbi trellis
radix-4 butterfly computations, Viterbi traceback, and for 4/16/64/256-QAM soft-bit
demapping.

20  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

A subset of the operations as well as all the bit-stream operations are available in a single
issue, 24-bit format called Inst. The compiler will automatically use the 24-bit format when it
is not possible (or beneficial) to bundle a relevant operation together with an operation that
can go in another slot.

For the optional floating point unit, most floating point operations are available in the second
slot, fusion_slot1, of the fusion_format48, allowing the machine to issue, for example, one
two-way SIMD floating point load in parallel with one scalar multiply-accumulation operation.

Understanding the slotting is important when optimizing code for Fusion DSP. Often a loop
is limited by operations that can only go in one slot or another. For example, it is never
possible to issue more than one (possible SIMD) load or store per cycle. If a loop is limited
by the operations in one slot, there is no point in trying to optimize the operations in another
slot.

All Fusion DSP core instructions available in the Inst slot share (but do not overlap) opcode
space with the MAC16 option. Note however, that we discourage selecting the MAC16 option
with the Fusion DSP core. All Fusion DSP floating point operations available in the Inst slot
share (overlap) opcode space with core floating point instructions; thus it is not possible to
turn on the core Single/Double Precision FP when the Fusion FP option is selected. In
addition, the Viterbi option on the Fusion DSP adds an instruction to the Inst slot,
AE_MOVSANORM, whose opcode overlaps the CUST0 opcode normally reserved for
customer-added operations. However, the CUST1 opcode is still available for customer-
added operations, and other customer-added operations are possible to encode using TIE
Compiler features. For more information on CUST0 and CUST1, refer to the Xtensa
Instruction Set Architecture (ISA) Reference Manual.

A summary table describing the instruction width required for each option is provided in
Appendix B. The available slotting for the different operations are listed next to the operation
descriptions in the remainder of this chapter.

2.4 Load and Store Operations


Fusion DSP supports loading and storing scalars or vectors of 16, 24, 32 and 64-bits. Each
scalar load/store accesses 16, 24, 32 or 64-bits. Each vector accesses 64-bits or 48-bits for
packed 24-bit data. For vector loads and stores, the high address in memory is always stored
in the least significant bits in the register. This enables the same source code to work on all
configurations, including big-endian HiFi 2 cores. Reverse vector loads and stores reverse
the elements in a register so that the low address in memory is stored in the least significant
bits in the register. This way, whether accessing data in a stride one or stride negative one
fashion, the earliest data to be accessed is always in the same position in the register.

Special support is provided for retaining full throughput when vectors of data are not aligned
to 64-bits. Fusion DSP also supports a single circular buffer that can be used with either
aligned or unaligned data.

 CADENCE DESIGN SYSTEMS , INC. 21


Fusion F1 DSP User’s Guide

2.4.1 Aligning Loads and Stores


Fusion DSP has support for loading or storing vector streams of data 64-bits at a time even
if the data is not aligned to 64-bits. Note that while the vector variables need not be aligned
to 64-bits, they must still be aligned according to the requirements of each scalar element,
i.e., 32-bits for vectors of ints.

Such loads and stores are called aligning loads and stores. Support is available for 16, 24
and 32-bit data. The aligning vector load and store instructions use the Fusion DSP alignment
register file to provide a throughput of one aligning load or store operation per instruction.

A special priming instruction, AE_LA64.PP, is used to begin the process of loading an array
of unaligned data. This instruction loads the alignment register with data from the start of the
stream. The subsequent aligning load instruction loads from the next location in memory,
merging it with the data already in the alignment register. The exact details of how the
aligning instructions work are not relevant to the programmer. Simply invoke the
AE_LA64_PP priming intrinsic with the first address (aligned or not) to be loaded and
continue loading with the appropriate aligning loads to achieve a subsequent throughput of
one aligning load per instruction.

The design of the priming load and aligning load instructions is such that they can be used
in situations where the alignment of the address is unknown. The load sequence works
whether the starting address is aligned or not. Consider a simple example that adds up the
32-bit elements in an array.

void add(int * a, int n)


{
ae_int32x2 *ap=(ae_int32x2 *) &a[0];
ae_int32x2 tmp;
ae_valign align;
int i;

align = AE_LA64_PP(ap); // prime the stream


for(i = 0; i < n; i = i + 2)
{
AE_LA32X2_IP(tmp,align,ap); // load the next element
V = V + tmp;
}
}

Similarly, when accessing the data stride negative one, prime the stream by passing in the
address of the first scalar element to be loaded (a[n-1]), as follows.

void add(int * a, int n)


{
ae_int32x2 *ap=(ae_int32x2 *) &a[n-1];
ae_int32x2 tmp;
ae_valign align;

22  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

int i;

align = AE_LA64_PP(ap); // prime the stream


for(i = 0; i < n; i = i + 2)
{
AE_LA32X2_RIP(tmp,align,ap); // load the next element
V = V + tmp;
}
}

Note that in the negative stride case, the start of the stream is handled differently in the
aligned versus the non-aligned case. With aligned loads, one passes in the address of
a[n-2] because that is the address of the first 64-bit word being loaded. With aligning
loads, one passes in the address of the first 32-bit scalar being loaded, a[n-1], because
the priming load loads from memory the aligned 64-bit envelope containing its argument and
a[n-2]might not be in the same 64-bit envelope as a[n-1].

Fusion DSP supports storing 24-bit data in a packed format that requires only 24 bits per
data element. Using this load/store feature can potentially save 25% of the memory required
for a 24-bit variable and has an added benefit of reducing the amount of memory
transactions, thereby reducing memory power and improving performance. Support for this
packed data is implemented using the alignment mechanism. In the examples above, simply
use AE_LA24X2 intrinsics instead of AE_LA32X2, as shown below. Note that we have used
char * for the pointer type. While not strictly necessary, it is helpful to indicate that the
packed stream is an unaligned byte stream.

void add(int * a, int n)


{
char *ap=(char *) &a[0];
ae_int24x2 tmp;
ae_valign align;
int i;

align = AE_LA64_PP(ap); // prime the stream


for(i = 0; i < n; i = i + 2)
{
AE_LA24X2_IP(tmp,align,ap); // load the next element
V = V + tmp;
}
}

For packed data, even scalar streams are unaligned so support is also available for AE_LA24
intrinsics. Because the memory format for packed data is different, packed data can only be
used in cases where all loads and stores of a stream are done using the packing loads and
stores. While the packing loads and stores can be used on any 24-bit variable, since a
priming load and a finalizing store is required for every stream, it is often only efficient to use
them on stride one or stride negative one streams. Similarly, since there are only four
alignment registers, it is only efficient to use them on loops that have at most four streams.

 CADENCE DESIGN SYSTEMS , INC. 23


Fusion F1 DSP User’s Guide

Aligning stores operate in a slightly different manner. Before starting a stream, the alignment
variable needs to be zeroed using the AE_ZALIGN64() intrinsic. On an unaligned store,
each aligning store instruction merges some of the data with data already in the alignment
register and writes the result to memory. The remaining data is written into the alignment
register for use in the next aligning store. If the data happens to be aligned, each aligning
store simply writes its data to memory. After completing the stream, the user must finalize
the stream using a finalization instruction. If the data happens to be unaligned, that
finalization instruction writes out the remaining data from the alignment register. The
finalization instruction also zeroes the alignment register so that a follow on stream can skip
the use of the AE_ZALIGN64() intrinsic. Following is a simple example that zeroes an n
element array of ints named a.

ae_int32x2 V_con = (ae_int32x2)(0);


ae_int32x2 *addr = (ae_int32x2 *) a;
ae_valign align = AE_ZALIGN64(); // zero alignment reg
for(i = 0; i <= n; i = i + 2)
{
AE_SA32X2_IP(V_con, align, addr); // store
}
AE_SA64POS_FP(align, addr); // finalize the stream

Negative strided streams work analogously to the case of loads, with the use of RIP
intrinsics. Note that there are separate flush instructions for the positive stride and negative
stride streams.

2.4.2 Circular Buffer


Fusion DSP has support for a single circular buffer which can be accessed in either the
forward or the backward direction.

The circular buffer boundaries are specified through two 32-bit states, as in Table 2-16.

Table 2-16 Circular Buffer States

State Description
_
AE CBEGIN0 The start address of the circular buffer.
AE_CEND0 The end address of circular buffer, i.e., the start address plus the byte
size of the buffer.

The following intrinsic functions may be used to read from the circular buffer states in C:

void * AE_GETCBEGIN0 (void);


void * AE_GETCEND0 (void);

The following intrinsic functions may be used to write to the circular buffer states in C:

void AE_SETCBEGIN0 (const void * addr);


void AE_SETCEND0 (const void * addr);

24  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

All circular buffer operations follow a “post-increment” convention; that is, in every case the
effective address is the base address while the updated base address is formed by adding
the register offset to the base address with circular wrap-around.

The address increment is specified in terms of number of bytes and must be less than or
equal to the buffer byte size. The increment can be either positive (wrap-around at the end
of the buffer), or negative (wrap-around at the beginning of the buffer).

Both aligned and unaligned accesses are supported. However, for unaligned accesses,
AE_CBEGIN0 and AE_CEND0 must be aligned to 64 bits. For aligned accesses,
AE_CBEGIN0 and AE_CEND0 must be aligned to the size of the data being loaded or
stored. Unaligned accesses use the alignment mechanism described in Section 2.4.1.
Priming loads use the PC suffix with separate instructions for positive and negative stride.
For unaligned references, only stride one and stride negative one are supported. Packed 24-
bit loads are supported.

AE_CBEGIN0 need not be smaller than AE_CEND0. If an instruction accesses data past
the AE_CEND0 boundary, data will continue to be accessed at AE_CBEGIN0 regardless of
whether it is before or after AE_CEND0.

Circular buffer support is available for DSP loads and stores to the AE_DR register file as
well as bit-stream loads and stores to the AR register file.

Following is an example C code snippet demonstrating how to initialize and use the circular
buffer. The buffer is used to store 24-bit vector data in the 24 MSBs of each 32-bit word with
a negative stride starting from the last element of the buffer.

/* Allocate the buffers. */


void *buf = malloc(buf_size);

/* Initialize the circular buffer boundaries. */


AE_SETCBEGIN0(buf);
AE_SETCEND0(buf + buf_size);

/* Point to the first element to be loaded/stored. */


ae_f24x2 *buf_ptr = (ae_f24x2 *)(buf + buf_size – sizeof(ae_f24x2));

for (…) {
ae_f24x2 p;

AE_S32X2F24_XC(p, buf_ptr, -sizeof(ae_f24x2));

}

 CADENCE DESIGN SYSTEMS , INC. 25


Fusion F1 DSP User’s Guide

2.4.3 Load and Store Naming Scheme


The mnemonic of most load and store operations contains a size indicating what size
operands it will load or store. The sizes are listed in Table 2-17.

Table 2-17 Load/Store Operation Sizes

Size Definition Description


16 16-bit scalar This operation accesses an aligned 16-bit quantity.

24 24-bit scalar This operation accesses a 24-bit quantity that is


packed into memory so as to occupy only 24-bits in
memory.
32 32-bit scalar This operation accesses an aligned 32-bit quantity.
This size is also used for legacy 24-bit integers which
are stored in a 32-bit memory location right-justified
and with 8 bits of sign extension.
32F24 Left-justified This operation accesses a 24-bit fraction which is
24-bit fraction stored left-justified in a 32-bit memory location. It shifts
the value right by 8 bits and sign extends on the left by
8 bits. The address must be 32-bit aligned.
64 64-bit scalar This operation accesses an aligned 64-bit quantity.

24X2 Vector of 24-bit This operation accesses two of the size “24” above,
occupying 48 bits in memory.
32X2 Vector of 32-bit This operation accesses two of the size “32” above.
Some instructions need the pair to be 64-bit aligned
while others do not.
32X2F24 Vector of left- This operation accesses two of the size “32F24”
justified 24-bit above. Some instructions need the pair to be 64-bit
fraction aligned while others do not.
16X4 Vector of 16 bit This operation accesses four of the size “16” above.
Some instructions need the quartet to be 64-bit
aligned while others do not.
8X4F Vector of left- This operation accesses four of size 8.
justified 8 bit
fraction

The mnemonics of most load and store operations contains a suffix indicating how the
effective address is computed and whether the base address register is updated. The
suffixes are listed in Table 2-18.

26  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Operations with suffix IP, XP, IC, or XC follow a “post-increment” convention where the
effective address is the base AR register, and the base address register is updated by adding
an immediate, constant or register offset. Operations with suffix IU or XU follow a “pre-
increment” convention where the effective address is the result of adding the immediate or
register offset to the base address register’s contents and the base address register is
updated with the effective address. Operations with suffix I or X do not increment but create
an effective address which is the sum of the base address register and an immediate or offset
register.

Table 2-18 Load/Store Operation Suffixes

Suffix & Effective Base Reg Description


Definition Address Update
I Reg + [none] The effective address is a base AR register plus
immed an immediate value. The base AR register is not
Immediate updated.
X Reg + [none] The effective address is a base AR register plus
Reg an index AR register value. The base AR register
Indexed is not updated.
IP Reg Reg + The effective address is a base AR register. The
Immed base AR register is updated with the base AR
Post Update register plus an immediate or constant value.
Immediate
XP Reg Reg + The effective address is a base AR register. The
Reg base AR register is updated with the base AR
Post Update register plus an offset AR register value.
Indexed
IC Reg Reg + The effective address is base AR register. The
Post Update Const base AR register is updated with the base AR
Implied folded register plus a positive constant value equal to
Immediate back one element. If the address is less than
with Circular into AE_CENDO and the updated value is greater
buffer circular than or equal to AE_CEND0, then AE_CEND0-
buffer AE_CBEGIN0 is subtracted from it.
XC Reg Reg + The effective address is base AR register. The
Post Update Reg base AR register is updated with the base AR
Indexed with folded register plus an offset AR register value. For
Circular back positive updates, if the address is less than
Buffer into AE_CEND0 and the updated value is greater than
circular or equal to AE_CEND0, then AE_CEND0-
buffer AE_CBEGIN0 is subtracted from it. For negative
updates, if the address is greater than or equal to
AE_CBEGIN0 and the updated value is less than
AE_CBEGIN0, then AE_CEND0-AE_CBEGIN0 is
added to it.
RIP Reg Reg The effective address is a base AR register. The
Reverse base AR register is updated with the base AR
Post Update register minus the size of the element being
loaded or stored. The vector elements in the result
register are also swapped.

 CADENCE DESIGN SYSTEMS , INC. 27


Fusion F1 DSP User’s Guide

Suffix & Effective Base Reg Description


Definition Address Update
RIC Reg Reg + The effective address is a base AR register. The
Reverse Const base AR register is updated with the base AR
Post Update folded register minus a positive constant value equal to
Implied back one element. If the address is greater than or
Immediate into equal to AE_CBEGIN0 and the updated value is
with Circular circular less than AE_CBEGIN0, then AE_CEND0-
buffer buffer AE_CBEGIN0 is added to it. The vector elements
in the result register are also swapped.
PP See See This addressing mode is used for priming
Prime Instruc- Instruc- instructions which set up the beginning of an
tion tion unaligned load sequence
PC See See This addressing mode is used for priming
Circular Instruc- Instruc- instructions which set up the beginning of an
Prime tion tion unaligned load sequence in a circular buffer
FP See See This addressing mode is used for flushing the last
Instruc- Instruc- part of an unaligned store sequence
Flush tion tion
IU Reg + Reg + The effective address is a base AR register plus
Immed Immed an immediate value. The base AR register is
Immediate updated with the effective address. These
with Pre- instructions are used for legacy HiFi 2/EP
update operations only. (They are “pre-incrementing”
operations).
XU Reg + Reg + The effective address is a base AR register plus
Reg Reg an offset AR register value. The base AR register
Indexed with is updated with the effective address. These
Pre-update instructions are used for legacy HiFi 2/EP
operations only. (They are “pre-incrementing”
operations).

28  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

2.4.4 Load Operations


Table 2-19 gives an overview of the various types of load operations. The first column
indicates a set of load operations which includes all those with the size <sz> and the address
mode <adr> replaced by any of the values in the second and third columns. The fourth
column summarizes the purpose of that group of operations.

Table 2-19 Load Overview

Instruction Size <sz> Suffix <adr> Purpose

I, X, IP, XP,
AE_L<sz>.<adr> 64, 32, 32F24, 16 Aligned loads of scalars
XC
32X2, 32X2F24, I, X, IP, RIP,
AE_L<sz>.<adr> Aligned loads of vectors
16X4, 8X8, 8X4F XP, XC, RIC
Prime for Unaligned
AE_LA<sz>.<adr> 64, PP
loads using IP
Prime for Unaligned
32X2, 16X4, 24,
AE_LA<sz>POS.<adr> PC loads using IC with
24X2,
positive stride
Prime for Unaligned
32X2, 16X4, 24,
AE_LA<sz>NEG.<adr> PC loads using IC with
24X2,
negative stride
Unaligned Loads for
32X2, 32X2F24, accessing vectors of
AE_LA<sz>.<adr> IP, IC
16X4, 24, 24X2, aligned scalars with
positive update
Unaligned Loads for
32X2, 32X2F24, accessing vectors of
AE_LA<sz>.<adr> RIP, RIC
16X4, 24, 24X2, aligned scalars with
negative update
Load of alignment
AE_LALIGN64.I register
I, X, XC, IU,
AE_L<sz>M.<adr> 16X2, 32, 16 Legacy Loads
XU

In the following instructions, the instruction names show the assembler syntax. We also
include the C syntax below each instruction description.

 CADENCE DESIGN SYSTEMS , INC. 29


Fusion F1 DSP User’s Guide

AE_L64.I d, a, i64 [ fusion_slot0, Inst ]


AE_L64.IP d, a, i64 [ fusion_slot0, Inst ]
AE_L64.X (.XP, .XC) d, a, ax [ fusion_slot0, Inst ]
Required alignment: 8 bytes
Load a 64-bit value from memory into the AE_DR register d. See Table 2-3 for the meanings
of the address mode suffixes.
Note: C intrinsics AE_LQ56_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_L64.I (.X, .XC, .I, .I), respectively.
C syntax:
ae_int64 AE_L64_I (const ae_int64 * a, immediate i64);
ae_int64 AE_L64_X (const ae_int64 * a, int ax);
void AE_L64_IP (ae_int64 d /*out*/,
const ae_int64 *a /*inout*/, immediate i64);
void AE L64_XP (ae_int64 d /*out*/,
_
const ae_int64 *a /*inout*/, int ax);
void AE_L64_XC (ae_int64 d /*out*/,
const ae_int64 *a /*inout */, int ax);
ae_q56s AE_LQ56_I (const ae_q56s * a, immediate i64);
void AE_LQ56_IU (ae_q56s d /*out*/,
const ae_q56s * a /*inout*/, immediate i64);
ae q56s AE_LQ56_X (const ae_q56s * a, int ax);
_
void AE_LQ56_XU (ae_q56s d /*out*/,
const ae_q56s * a /*inout*/, int ax);
void AE_LQ56_C (ae_q56s d /*out*/,
const ae_q56s * a /*inout*/, int ax);
AE_L32X2.I d, a, i64 [fusion_slot0, fusion_slot_fir_0, Inst]
AE_L32X2.IP d, a, i64pos [fusion_slot0, fusion_slot_fir_0, Inst]
AE_L32X2.RIP ( .RIC) d, a [fusion_slot0, fusion_slot_fir_0, Inst ]
AE_L32X2.X (XC) d, a, ax [fusion_slot0, fusion_slot_fir_0, Inst ]
AE_L32X2.XP d, a, ax [fusion_slot0, Inst ]
Required alignment: 8 bytes
Load a pair of 32-bit values from memory into the AE_DR register d. See Table 2-3 for the
meanings of the address mode suffixes.
Note: C intrinsics AE_LP24X2_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_LP32X2.I (.X, .XC, .I, .I),
respectively.

30  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_int32x2 AE_L32X2_I (const ae_int32x2 * a, immediate i64);
ae_int32x2 AE_L32X2_X (const ae_int32x2 * a, int ax);
void AE_L32X2_IP (ae_int32x2 d /*out*/,
const ae_int32x2 *a /*inout*/, immediate i64pos);
void AE_L32X2_XP (ae_int32x2 d /*out*/,
const ae_int32x2 *a /*inout*/, int ax);
void AE_L32X2_XC (ae_int32x2 d /*out*/,
const ae_int32x2 *a /*inout*/, int ax);
void AE_L32X2.RIP (ae_int32x2 d /*out*/,
const ae_int32x2 *a /*inout*/);
void AE_L32X2.RIC (ae_int32x2 d /*out*/,
const ae_int32x2 *a /*inout*/);
ae p24x2s AE_LP24X2_I (const ae_p24x2s * a, immediate i64);
_
void AE_LP24X2_IU (ae_p24x2s d /*out*/,
const ae_p24x2s * a /*inout*/, immediate i64);
ae_p24x2s AE_LP24X2_X (const ae_p24x2s * a, int ax);
void AE_LP24X2_XU (ae_p24x2s d /*out*/,
const ae_p24x2s * a /*inout*/, int ax);
void AE_LP24X2_C (ae_p24x2s d /*out*/,
const ae_p24x2s * a /*inout*/, int ax);
AE_L8X4F.I(.IP) d, a, i [ fusion_slot0 ]
AE_L8X4F.X(.XP) d, a, ax [ fusion_slot0 ]

Required alignment: 4 bytes

Load four, 8-bit values from 32 bits in memory, sign-extends them to 16 bits and stores the
values into the four 16-bit elements of AE_DR register d. See Table 2-3 for the meanings of
the address mode suffixes. The intent here is that the values in memory represent 8-bits (1.7)
fractions that get placed in the four elements of the AE_DR register as 1.15-bit fractions.
C syntax:
ae_f16x4 AE_L8X4F_I (const int8 * a, immediate i);
void AE_L8X4F_IP (ae_f16x4 p /*out*/,
const int8 * a /*inout*/, immediate i);

AE_L8X8.I (.IP) d, a, i [ fusion_slot0]


Required alignment: 8 bytes
Load eight 8-bit values from memory into the AE_DR register d. See Table 2-3 for the
meanings of the address mode suffixes.
C syntax:
ae_int64 AE_L8X8_I (const uint8 * a, immediate i64);
void AE_L8X8_IP (ae_int64 p /*out*/, const uint8 * a /*inout*/,
immediate i64);

 CADENCE DESIGN SYSTEMS , INC. 31


Fusion F1 DSP User’s Guide

AE_L16X4.I d, a, i64 [ fusion_slot0, fusion_slot_fir_0, Inst]


AE_L16X4.IP d, a, i64pos [ fusion_slot0, fusion_slot_fir_0, Inst]
AE_L16X4.RIP (.RIC) d, a [ fusion_slot0]
AE_L16X4.X (.XC) d, a, ax [ fusion_slot0]
AE_L16X4.XP d, a, ax [ fusion_slot0, Inst ]
Required alignment: 8 bytes
Load four 16-bit values from memory into the AE_DR register d. See Table 2-3 for the
meanings of the address mode suffixes.
C syntax:
ae_int16x4 AE_L16X4_I (const ae_int16x4 * a, immediate i64);
ae_int16x4 AE_L16X4_X (const ae_int16x4 * a, int ax);
void AE_L16X4_IP (ae_int16x4 d /*out*/,
const ae_int16x4 *a /*inout*/, immediate i64pos);
void AE_L16X4_XP (ae_int16x4 d /*out*/,
const ae_int16x4 *a /*inout*/, int ax);
void AE_L16X4_XC (ae_int16x4 d /*out*/,
const ae_int16x4 *a /*inout*/, int ax);
void AE_L16X4_RIP (ae_int16x4 d /*out*/,
const ae_int16x4 *a /*inout*/);
void AE L16X4 RIC (ae_int16x4 d /*out*/,
_ _
const ae_int16x4 *a /*inout*/);

AE_L32X2F24.I d, a, i64 [ fusion_slot0, fusion_slot_fir_0, Inst ]


AE_L32X2F24.IP d, a, i64pos [ fusion_slot0, fusion_slot_fir_0, Inst ]
AE_L32X2F24.RIP (.RIC) d, a [ fusion_slot0]
AE_L32X2F24.X (.XP) d, a, ax [ fusion_slot0, Inst ]
AE_L32X2F24.XC d, a, ax [ fusion_slot0, fusion_slot_fir_0, Inst ]
Required alignment: 8 bytes
Loads a pair of 24-bit values, each from the most significant 24 bits of a 32-bit half of the 64
bits in memory, sign-extends them to 32 bits and stores the values into the two 32-bit
elements of AE_DR register d. See Table 2-3 for the meanings of the address mode suffixes.
The intent here is that the values in memory represent 32-bit (1.31) fractions that get
truncated and placed in the two elements of the AE_DR register as 9.23-bit fractions.
Note: C intrinsics AE_LP24X2F_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_LP32X2F24.I (.X, .XC, .I, .I),
respectively.

32  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_f24x2 AE_L32X2F24_I (const ae_f24x2 * a, immediate i64);
ae_f24x2 AE_L32X2F24_X (const ae_f24x2 *a, int ax);
void AE_L32X2F24_IP (ae_f24x2 d /*out*/,
const ae_f24x2 * a /*inout*/,
immediate i64pos);
void AE_L32X2F24_XP (ae_f24x2 d /*out*/,
const ae_f24x2 * a /*inout*/, int ax);
void AE_L32X2F24_XC (ae_f24x2 d /*out*/,
const ae_f24x2 * a /*inout*/, int ax);
void AE L32X2F24 RIP (ae_f24x2 d /*out*/,
_ _
const ae_f24x2 *a /*inout*/);
void AE L32X2F24 RIC (ae_f24x2 d /*out*/,
_ _
const ae_f24x2 *a /*inout*/);
ae_p24x2s AE_LP24X2F_I (const ae_p24x2f * a, immediate i64);
void AE_LP24X2F_IU (ae_p24x2s d /*out*/,
const ae_p24x2f * a /*inout*/, immediate i64);
ae p24x2s AE LP24X2F_X (const ae_p24x2f * a, int ax);
_ _
void AE_LP24X2F_XU (ae_p24x2s d /*out*/,
const ae_p24x2f * a /*inout*/, int ax);
void AE_LP24X2F_C (ae_p24x2s d /*out*/,
const ae_p24x2f * a /*inout*/, unsigned ax);
AE_L32.I d, a, i32 [ fusion_slot0, Inst ]
AE_L32.IP d, a, i32 [ fusion_slot0, Inst ]
AE_L32.X (.XC) d, a, ax [ fusion_slot0, Inst ]
AE_L32.XP d, a, ax [ fusion_slot0 ]
Required alignment: 4 bytes
Load a 32-bit value from memory and replicate the value into the two elements of the AE_DR
register d. See Table 2-3 for the meanings of the address mode suffixes.
Note: C intrinsics AE_LP24_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_L32.I (.X, .XC, .I, .I), respectively.

 CADENCE DESIGN SYSTEMS , INC. 33


Fusion F1 DSP User’s Guide

C syntax:
ae_int32x2 AE_L32_I (const ae_int32 * a, immediate i32);
ae_int32x2 AE_L32_X (const ae_int32 * a, int ax);
void AE_L32_IP(ae_int32x2 d /*out*/,
const ae_int32 * a /*inout*/, immediate off);
void AE_L32_XP(ae_int32x2 d /*out*/,
const ae_int32 * a /*inout*/, int ax);
void AE_L32_XC(ae_int32x2 d /*out*/,
const ae_int32 * a /*inout*/, int ax);
ae_p24x2s AE_LP24_I (const ae_p24s * a, immediate i32);
void AE_LP24_IU (ae_p24x2s d /*out*/,
const ae_p24s * a /*inout*/, immediate i32);
ae_p24x2s AE_LP24_X (const ae_p24s * a, int ax);
void AE_LP24_XU (ae_p24x2s d /*out*/,
const ae_p24s * a /*inout*/, int ax);
_ _ _
void AE LP24 C (ae p24x2s d /*out*/,
const ae_p24s * a /*inout*/, int ax);
AE_L32F24.I d, a, i32 [ fusion_slot0, Inst ]
AE_L32F24.IP d, a, i32 [ fusion_slot0, Inst ]
AE_L32F24.XC d, a, ax [ fusion_slot0, Inst ]
AE_L32F24.X (.XP) d, a, ax [ fusion_slot0 ]
Required alignment: 4 bytes
Load a 24-bit value from the most significant 24 bits of the 32-bit word from memory, sign-
extend to 32 bits and replicate the value into the two 32-bit elements of the AE_DR register
d. See Table 2-3 for the meanings of the address mode suffixes. The intent here is that the
value in memory represents a 32-bit (1.31) fraction that gets truncated and replicated into
the two elements of d as 9.23-bit fractions.
Note: C intrinsics AE_LP24X2F_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_L32F24.I (.X, .XC, .I, .I),
respectively.
C syntax:
ae_f24x2 AE_L32F24_I (const ae_f24 * a, immediate i32);
ae_p24s AE_L32F24_X (const ae_f24 * a, int ax);
void AE_L32F24_IP (ae_f24x2 d /*out*/,
const ae_f24 * a /*inout*/, immediate i32);
void AE_L32F24_XP (ae_f24x2 d /*out*/,
const ae_f24 * a /*inout*/, int ax);
void AE_L32F24_XC (ae_f24x2 d /*out*/,
const ae_f24 * a /*inout*/, int ax);
ae_p24x2s AE_LP24F_I (const ae_p24f * a, immediate i32);
void AE_LP24F_IU (ae_p24x2s d /*out*/,
const ae_p24f * a /*inout*/, immediate i32);

34  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

ae_p24x2s AE_LP24F_X (const ae_p24f * a, int ax);


void AE_LP24F_XU (ae_p24x2s d /*out*/,
const ae_p24f * a /*inout*/, int ax);
void AE_LP24F_C (ae_p24x2s d /*out*/,
const ae_p24f * a /*inout*/, int ax);
AE_L16.I d, a, i16 [ fusion_slot0, Inst ]
AE_L16.IP d, a, i16 [ fusion_slot0, Inst ]
AE_L16.X (.XP, .XC) d, a, ax [ fusion_slot0]
Required alignment: 2 bytes
Load a 16-bit value from memory and replicate the value into the four elements of AE_DR
register d. See Table 2-3 for the meanings of the address mode suffixes.
C syntax:
ae_int16x4 AE_L16_I (const ae_int16 * a, immediate i16);
ae_int16x4 AE_L16_X (const ae_int16 * a, int ax);
void AE_L16_IP (ae_int16x4 d /*out*/,
const ae_int16 * a /*inout*/, immediate i16);
void AE_L16_XP (ae_int16x4 d /*out*/,
const ae_int16 * a /*inout*/, int ax);
void AE_L16_XC (ae_int16x4 d /*out*/,
const ae_int16 * a /*inout*/, int ax);
AE_LA64.PP u, a [ Inst]
Required alignment: 1 byte (but following instructions have alignment requirements).
Load a 64-bit value from memory to AE_VALIGN register u. The effective address is
(a & 0xFFFFFFF8). No update is made to the address register.

This instruction is used to prime the unaligned access stream for all AE_LA<size>.IP and
AE_LA<size>.RIP instructions regardless of size or direction.

C syntax:
ae_valign AE_LA64_PP (void *a);
AE_LA32X2POS.PC u, a [ fusion_slot0, Inst]
AE_LA32X2NEG.PC u, a [ fusion_slot0]
Required alignment: 4 bytes
This operation loads 64-bit value from memory into AE_VALIGN register u. The effective
address is (a & 0xFFFFFFF8).

This instruction AE_LA32X2POS.PC is used to prime the unaligned access stream for
AE_LA32X2.IC and AE_LA32X2F24.IC instructions. The instruction AE_LA32X2NEG.PC is
used to prime the unaligned access stream for AE_LA32X2.RIC and AE_LA32X2F24.RIC
instructions.

 CADENCE DESIGN SYSTEMS , INC. 35


Fusion F1 DSP User’s Guide

Note: C intrinsic AE_LA32X2F24POS_PC is implemented using operation


AE_LA32X2POS.PC. C intrinsic AE_LA32X2F24NEG_PC is implemented using operation
AE_LA32X2NEG.PC.
C syntax:
void AE_LA32X2POS_PC (ae_valign u /*out*/, ae_int32x2 *a /*inout*/);
void AE_LA32X2F24POS_PC (ae_valign u /*out*/,ae_f24x2 *a /*inout*/);
void AE_LA32X2NEG_PC (ae_valign u /*out*/, ae_int32x2 *a /*inout*/);
void AE_LA32X2F24NEG_PC (ae_valign u/*out*/, ae_f24x2 *a /*inout*/);
AE_LA16X4POS.PC u, a [ fusion_slot0]
AE_LA16X4NEG.PC u, a [ fusion_slot0]
Required alignment: 2 bytes
Load a 64-bit value from memory into AE_VALIGN register u. The effective address is (a &
0xFFFFFFF8).

The instruction AE_LA16X4POS.PC is used to prime the unaligned access stream for
AE_LA16X4.IC instructions. The instruction AE_LA16X4NEG.PC is used to prime the
unaligned access stream for AE_LA16X4.RIC instructions.

C syntax:
void AE_LA16X4POS_PC (ae_valign u /*out*/, ae_int16x4 *a /*inout*/);
void AE_LA16X4NEG_PC (ae_valign u /*out*/, ae_int16x4 *a /*inout*/);
AE_LA24POS.PC u, a [ fusion_slot0]
AE_LA24NEG.PC u, a [ fusion_slot0]
Required alignment: 1 byte
Load a 64-bit value from memory to AE_VALIGN register u. The effective address is
(a & 0xFFFFFFF8).

The instruction AE_LA24POS.PC is used to prime the unaligned access stream for
AE_LA24.IC instructions. The instruction AE_LA24NEG.PC is used to prime the unaligned
access stream for AE_LA24.RIC instructions.

C syntax:
void AE_LA24POS_PC (ae_valign u /*out*/, void *a /*inout*/);
void AE_LA24NEG_PC (ae_valign u /*out*/, void *a /*inout*/);
AE_LA24X2POS.PC u, a [ fusion_slot0]
AE_LA24X2NEG.PC u, a [ fusion_slot0]
Required alignment: 1 byte
Load a 64-bit value from memory to AE_VALIGN register u. The effective address is
(a & 0xFFFFFFF8).

36  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

The instruction AE_LA24X2POS.PC is used to prime the unaligned access stream for
AE_LA24X2.IC instructions. The instruction AE_LA24X2NEG.PC is used to prime the
unaligned access stream for AE_LA24X2.RIC instructions.
C syntax:
void AE_LA24X2POS_PC (ae_valign u /*out*/, void a */*inout*/);
void AE_LA24X2NEG_PC (ae_valign u /*out*/, void a */*inout*/);
AE_LA32X2.IP (.IC) d, u, a [ fusion_slot0, fusion_slot_fir_0, Inst]
AE_LA32X2.RIC (.RIP) d, u, a [ fusion_slot0]
Required alignment: 4 bytes
Load a pair of 32-bit values from effective address (a) in memory into the AE_DR register d.
Instructions AE_LA32X2.IP (.IC) are used if the direction of the load operations is positive.
Instructions AE_LA32X2.RIP (.RIC) are used if the direction of the load operations is
negative.
C syntax:
void AE_LA32X2_IP (ae_int32x2 d /*out*/, ae_valign u /*inout*/,
ae_int32x2 *a /*inout*/);
void AE_LA32X2_IC (ae_int32x2 d /*out*/, ae_valign u /*inout*/,
ae_int32x2 *a /*inout*/);
void AE_LA32X2_RIP (ae_int32x2 d /*out*/, ae_valign u /*inout*/,
ae_int32x2 *a /*inout*/);
void AE_LA32X2_RIC (ae_int32x2 d /*out*/, ae_valign u /*inout*/,
ae_int32x2 *a /*inout*/);
AE_LA32X2F24.IP (.IC) d, u, a [ fusion_slot0, Inst]
AE_LA32X2F24.RIC (.RIP) d, u, a [ fusion_slot0]

Required alignment: 4 bytes

Load a pair of 24-bit values, each from the most significant 24 bits of a 32-bit half of the 64
bits in memory, sign-extend them to 32 bits and store the values into the two 32-bit elements
of AE_DR register d. Instructions AE_LA32X2F24.IP (.IC) are used if the direction of the load
operations is positive. Instructions AE_LA32X2F24.RIP (.RIC) are used if the direction of the
load operations is negative.

C syntax:
void AE_LA32X2F24_IP (ae_f24x2 d /*out*/, ae_valign u /*inout*/,
ae_f24x2 *a /*inout*/);
void AE_LA32X2F24_IC (ae_f24x2 d /*out*/, ae_valign u /*inout*/,
ae_f24x2 *a /*inout*/);
void AE_LA32X2F24_RIP (ae_f24x2 d /*out*/, ae_valign u /*inout*/,
ae_f24x2 *a /*inout*/);
void AE_LA32X2F24_RIC (ae_f24x2 d /*out*/, ae_valign u /*inout*/,
ae_f24x2 *a /*inout*/);

 CADENCE DESIGN SYSTEMS , INC. 37


Fusion F1 DSP User’s Guide

AE_LA16X4.IP (.IC, .RIP) d, u, a [ fusion_slot0, Inst]


AE_LA16X4.RIC d, u, a [ fusion_slot0]
Required alignment: 2 bytes
Load four 16-bit values from effective address (a) in memory into the AE_DR register d.
Instructions AE_ LA16X4.IP (.IC) are used if the direction of the load operations is positive.
Instructions AE_ LA16X4.RIP (.RIC) are used if the direction of the load operations is
negative.
C syntax:
void AE_LA16X4_IP (ae_int16x4 d /*out*/, ae_valign u /*inout*/,
ae_int16x4 *a /*inout*/);
void AE LA16X4 IC (ae_int16x4 d /*out*/, ae_valign u /*inout*/,
_ _
ae_int16x4 *a /*inout*/);
void AE_LA16X4_RIP (ae_int16x4 d /*out*/, ae_valign u /*inout*/,
ae_int16x4 *a /*inout*/);
void AE LA16X4 RIC (ae_int16x4 d /*out*/, ae_valign u /*inout*/,
_ _
ae_int16x4 *a /*inout*/);

AE_LA24.IP (.IC) d, u, a [ fusion_slot0, Inst]


AE_LA24.RIP (.RIC) d, u, a [ fusion_slot0]
Required alignment: 1 byte
Load a 24-bit value from effective address (a) in memory into the AE_DR register d.
Instructions AE_LA24.IP (.IC) are used if the direction of the load operations is positive.
Instructions AE_LA24.RIP (.RIC) are used if the direction of the load operations is negative.
C syntax:
void AE_LA24_IP (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
void *a /*inout*/);
void AE LA24 IC (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
_ _
void *a /*inout*/);
void AE LA24 RIP (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
_ _
void *a /*inout*/);
void AE_LA24_RIC (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
void *a /*inout*/);
AE_LA24X2.IP (.IC) d, u, a [ fusion_slot0, Inst, fusion_slot_fir_0]
AE_LA24X2.RIP (.RIC) d, u, a [ fusion_slot0]
Required alignment: 1 byte
Load a pair of 24-bit values from effective address (a) in memory into the AE_DR register d.
Instructions AE_LA24X2.IP (.IC) are used if the direction of the load operations is positive.
Instructions AE_LA24X2.RIP (.RIC) are used if the direction of the load operations is
negative.

38  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
void AE_LA24X2_IP (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
void *a /*inout*/);
void AE_LA24X2_IC (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
void *a /*inout*/);
void AE_LA24X2_RIP (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
void *a /*inout*/);
void AE_LA24X2_RIC (ae_int24x2 d /*out*/, ae_valign u /*inout*/,
void *a /*inout*/);
AE_LALIGN64.I u, a, imm [ fusion_slot0]
Required alignment: 8 bytes
Load a 64-bit value from effective address (a + imm) in memory into the AE_VALIGN register
u.
C syntax:
ae_valign AE_LALIGN64_I (void *a, immediate imm);
AE_L16X2M.I d, a, i32 [ fusion_slot0, Inst ]
AE_L16X2M.IU d, a, i32 [ fusion_slot0, Inst ]
AE_L16X2M.X (.XU) d, a, ax [ fusion_slot0, Inst ]
AE_L16X2M.XC d, a, ax [fusion_slot0 ]
Required alignment: 4 bytes
Load a pair of 16-bit values from memory, pad 8-bit zeroes at the low end and sign-extend
to 32 bits and store the values into the two 32-bit elements of AE_DR register d. See Table
2-3 for the meanings of the address mode suffixes.
Note: C intrinsics AE_LP16X2F_I (_IU, _X, _XU, _C) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_L16X2M.I (.IU, .X, .XU, .XC),
respectively.
C syntax:
ae_int32x2 AE_L16X2M_I (const ae_p16x2s * a, immediate i32);
void AE_L16X2M_IU (ae_int32x2 d /*out*/,
const ae_p16x2s * a /*inout*/, immediate i32);
ae int32x2 AE_L16X2M_X (const ae_p16x2s * a, int ax);
_
void AE_L16X2M_XU (ae_p16x2s d /*out*/,
const ae_p16x2s * a /*inout*/, int ax);
void AE_L16X2M_XC (ae_int32x2 d /*out*/,
const ae_p16x2s * a /*inout*/, int ax);
ae p24x2s AE LP16X2F_I (const ae_p16x2s * a, immediate i32);
_ _
void AE_LP16X2F_IU (ae_p24x2s d /*out*/,
const ae_p16x2s * a /*inout*/, immediate i32);
ae_p24x2s AE_LP16X2F_X (const ae_p16x2s * a, int ax);
void AE_LP16X2F_XU (ae_p24x2s d /*out*/,

 CADENCE DESIGN SYSTEMS , INC. 39


Fusion F1 DSP User’s Guide

const ae_p16x2s * a /*inout*/, int ax);


void AE LP16x2F C (ae_p24x2s d /*out*/,
_ _
const ae_p16x2s * a /*inout*/, int ax);
AE_L32M.I d, a, i32 [fusion_slot0, Inst ]
AE_L32M.IU d, a, i32 [fusion_slot0, Inst ]
AE_L32M.X (.XU) d, a, ax [fusion_slot0, Inst ]
AE_L32M.XC d, a, ax [fusion_slot0]
Required alignment: 4 bytes
Load 32-bit values from memory, pad 16-bit zeroes at the low end and sign-extend to 64 bits
and store the values into AE_DR register d. See Table 2-3 for the meanings of the address
mode suffixes.
Note: C intrinsics AE_LQ32F_I (_IU, _X, _XU, _C) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_L32M.I (.IU, .X, .XU, .XC),
respectively.
C syntax:
ae_int64 AE_L32M_I (const ae_q32s * a, immediate i32);
void AE_L32M_IU (ae_int64 d /*out*/,
const ae_q32s * a /*inout*/, immediate i32);
ae_int64 AE_L32M_X (const ae_q32s * a, int ax);
void AE_L32M_XU (ae_int64 d /*out*/,
const ae_q32s * a /*inout*/, int ax);
void AE_L32M_XC (ae_int64 d /*out*/,
const ae_q32s * a /*inout*/, int ax);
ae_p56s AE_LQ32F_I (const ae_q32s * a, immediate i32);
void AE_LQ32F_IU (ae_p56s d /*out*/,
const ae_q32s * a /*inout*/, immediate i32);
ae_p56s AE_LQ32F_X (const ae_q32s * a, int ax);
void AE_LQ32F_XU (ae_p56s d /*out*/,
const ae_q32s * a /*inout*/, int ax);
void AE_LQ32F_C (ae_p56s d /*out*/,
const ae_q32s * a /*inout*/, int ax);
AE_L16M.I d, a, i16 [fusion_slot0, Inst ]
AE_L16M.IU d, a, i16 [fusion_slot0, Inst ]
AE_L16M.XU d, a, ax [fusion_slot0, Inst ]
AE_L16M.X (.XC) d, a, ax [fusion_slot0 ]
Required alignment: 2 bytes
Load a 16-bit value from memory, pad 8-bit zeroes at the low end and sign-extend to 32 bits
and store the value into both halves of AE_DR register d. See Table 2-3 for the meanings
of the address mode suffixes.

40  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Note: C intrinsics AE_LP16F_I (_IU, _X, _XU, _C) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_L16M.I (.IU, .X, .XU, .XC),
respectively.
C syntax:
ae_int32x2 AE_L16M_I (const ae_p16s * a, immediate i16);
void AE_L16M_IU (ae_int32x2 d /*out*/,
const ae_p16s * a /*inout*/, immediate i16);
ae_int32x2 AE_L16M_X (const ae_p16s * a, int ax);
void AE_L16M_XU (ae_int32x2 d /*out*/,
const ae_p16s * a /*inout*/, int ax);
void AE_L16M_XC (ae_int32x2 d /*out*/,
const ae_p16s * a /*inout*/, int ax);
ae_p24x2s AE_LP16F_I (const ae_p16s * a, immediate i16);
void AE_LP16F_IU (ae_p24x2s d /*out*/,
const ae_p16s * a /*inout*/, immediate i16);
ae p24x2s AE LP16F_X (const ae_p16s * a, int ax);
_ _
void AE_LP16F_XU (ae_p24x2s d /*out*/,
const ae_p16s * a /*inout*/, int ax);
void AE_LP16F_C (ae_p24x2s d /*out*/,
const ae_p16s * a /*inout*/, int ax);

2.4.5 Core Load Operations


AE_L16SI.N art, ars, i32 [ Inst16b ]
AE_L16UI.N art, ars, i32 [ Inst16b ]

Required alignment: 2 bytes

Limited immediate versions of the core L16SI and L16UI instructions. These instructions are
inferred automatically by the C/C++ compiler.

C syntax:
unsigned AE_L16SI_N (const void * a, immediate i32);
unsigned AE_L16UI_N (const void * a, immediate i32);

2.4.6 Store Operations


Table 2-20 provides an overview of the various types of store instructions. The first column
indicates a set of store instructions which includes all those with the size <sz> and the
address mode <adr> replaced by any of the values in the second and third columns. The
fourth column summarizes the purpose of that group of instructions.

 CADENCE DESIGN SYSTEMS , INC. 41


Fusion F1 DSP User’s Guide

Table 2-20 Store Overview

Instruction Size <sz> Suffix <adr> Purpose


I, X, IP, XP,
AE_S<sz>.<adr> 64 Aligned stores of scalars
XC
AE_S<sz>IP 8, 16, 32 IP Core updating store
32X2, 16X4, 8X4F, I, X, IP, XP,
AE_S<sz>.<adr> Aligned stores of vectors
32X2F24 XC, RIP, RIC
Aligned stores of scalars
I, X, IP, XP,
AE_S<sz>.L.<adr> 32, 32F24, 16 from the low part of a
XC
register
Aligned stores of scalars
I, IP, X, XP, from the middle part of a
AE_S<sz>.<adr> 32RA64S, 24RA64S
XC register with rounding
and saturation
Aligned stores of two
32X2RA64S, scalars from the middle
AE_S<sz>.<adr> IP
24X2RA64S part of a register with
rounding and saturation
Unaligned Stores for
32X2, 32X2F24, IP, IC, RIP,
AE_SA<sz>.<adr> accessing vectors of
16X4, 24, 24X2, RIC
aligned scalars
Flush after unaligned
AE_SA64POS.FP
store with positive stride
Flush after unaligned
AE_SA64NEG.FP
store with negative stride
Store of alignment
AE_SALIGN64.I
register
AE_ZALIGN64 Zero alignment register
I, X, XC, IU,
AE_S<sz>M.<adr> 16X2, 32, 16 Legacy Stores
XU

AE_S64.I d, a, i64 [ fusion_slot0, Inst ]


AE_S64.IP d, a, i64 [ fusion_slot0, Inst ]
AE_S64.X (.XP, .XC) d, a, ax [ fusion_slot0 ]
Required alignment: 8 bytes
Store the 64 bits of the AE_DR register d to memory. See Table 2-3 for the meanings of the
address mode suffixes.
Note: C intrinsics AE_SQ56S_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2/EP code
portability. They are implemented through operations AE_SQ64.I (.X, .XC, .I, .I), respectively.

42  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
void AE_S64_I (ae_int64 d, ae_int64 * a, immediate i64);
void AE_S64_X (ae_int64 d, ae_int64 * a, int ax)
void AE_S64_IP (ae_int64 d, ae_int64 * a /*inout*/, immediate i64);
void AE_S64_XP (ae_int64 d, ae_int64 * a /*inout*/, int ax);
void AE_S64_XC (ae_int64 d, ae_int64 * a /*inout*/, int ax);
void AE_SQ56S_I (ae_q56s d, ae_q56s * a, immediate i64);
void AE_SQ56S_IU (ae_q56s d, ae_q56s * a /*inout*/, immediate i64);
void AE_SQ56S_X (ae_q56s d, ae_q56s * a, int ax)
void AE_SQ56S_XU (ae_q56s d, ae_q56s * a /*inout*/, int ax);
void AE_SQ56S_C (ae_q56s d, ae_q56s * a /*inout*/, int ax);
AE_S32X2.I d, a, i64 [ fusion_slot0, Inst ]
AE_S32X2.IP d, a, i64pos [ fusion_slot0, Inst ]
AE_S32X2.RIP (.RIC) d, a [ fusion_slot0 ]
AE_S32X2.X (.XP, .XC) d, a, ax [ fusion_slot0, Inst ]
Required alignment: 8 bytes
Store a pair of 32-bit values from the AE_DR register d to memory. See Table 2-3 for the
meanings of the address mode suffixes.
Note: C intrinsics AE_SP24X2S_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2/EP code
portability. They are implemented through operations AE_SP32X2.I (.X, .XC, .I, .I),
respectively.
C syntax:
void AE_S32X2_I (ae_int32x2 d, ae_int32x2 * a, immediate i64);
void AE_S32X2_X (ae_int32x2 d, ae_int32x2 * a, int ax);
void AE_S32X2_IP (ae_int32x2 d,
ae_int32x2 * a /*inout*/, immediate i64);
void AE S32X2 XP (ae_int32x2 d,
_ _
ae_int32x2 * a /*inout*/, int ax);
void AE S32X2 XC (ae_int32x2 d,
_ _
ae_int32x2 * a /*inout*/, int ax);
void AE_S32X2_RIP (ae_int32x2 d, ae_int32x2 * a /*inout*/);
void AE_S32X2_RIC (ae_int32x2 d, ae_int32x2 * a /*inout*/);
void AE_SP24X2S_I (ae_p24x2s d, ae_p24x2s * a, immediate i64);
void AE_SP24X2S_IU (ae_p24x2s d,
ae_p24x2s * a /*inout*/, immediate i64);
void AE_SP24X2S_X (ae_p24x2s d, ae_p24x2s * a, int ax);
void AE_SP24X2S_XU (ae_p24x2s d,
ae_p24x2s * a /*inout*/, int ax);
void AE SP24X2S C (ae_p24x2s d,
_ _
ae_p24x2s * a /*inout*/, int ax);

 CADENCE DESIGN SYSTEMS , INC. 43


Fusion F1 DSP User’s Guide

AE_S16X4.I d, a, i64 [ fusion_slot0, Inst ]


AE_S16X4.IP d, a, i64pos [ fusion_slot0, Inst ]
AE_S16X4.RIP (.RIC) d, a [ fusion_slot0 ]
AE_S16X4.X (.XP, .XC) d, a, ax [ fusion_slot0 ]
Required alignment: 8 bytes
Store four 16-bit values from AE_DR register d to memory. See Table 2-3 for the meanings
of the address mode suffixes.
C syntax:
void AE_S16X4_I (ae_int16x4 d, ae_int16x4 * a, immediate i64);
void AE_S16X4_X (ae_int16x4 d, ae_int16x4 * a, int ax);
void AE_S16X4_IP (ae_int16x4 d,
ae_int16x4 * a /*inout*/, immediate i64);
void AE_S16X4_RIP (ae_int16x4 d, ae_int16x4 * a /*inout*/);
void AE_S16X4_RIC (ae_int16x4 d, ae_int16x4 * a /*inout*/);
void AE S16X4_XP (ae_int16x4 d,
_
ae_int16x4 * a /*inout*/, int ax);
void AE_S16X4_XC (ae_int16x4 d,
ae_int16x4 * a /*inout*/, unsigned ax);

AE_S8X4F.I(.IP) d, a, i [ fusion_slot0 ]
Required alignment: 4 bytes
Store four, eight-bit values, taken from the high eight bits of each 16-bit element of AE_DR
register d into 32 bits of memory. See Table 2-3 for the meanings of the address mode
suffixes.
C syntax:
void AE_S8X4F_I (ae_f16x4 d, int8 * a, immediate i);
void AE_S8X4F_IP ( ae_f16x4 d, int8 *a /* inout */, immediate i)

AE_S32X2F24.I d, a, i64 [ fusion_slot0, Inst ]


AE_S32X2F24.IP d, a, i64pos [ fusion_slot0, Inst ]
AE_S32X2F24.RIP (.RIC) d, a [ fusion_slot0 ]
AE_S32X2F24.X (.XU, .XP, .XC) d, a, ax [ fusion_slot0, Inst ]
Required alignment: 8 bytes
Store the 24 LSBs of the two 32-bit elements of AE_DR register d, with each value padded
on the right with zeroes to 32 bits and placed in half of the 64 bits in memory. See Table 2-3
for the meanings of the address mode suffixes. The intent here is that the values in register
d represent 9.23-bit values that get padded to a 1.31-bit memory representation.
Note: C intrinsics AE_SP24X2F_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_S32X2F24.I (.X, .XC, .I, .I),
respectively.

44  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
void AE_S32X2F24_I (ae_f24x2 d, ae_f24x2 *a, immediate i64);
void AE_S32X2F24_X (ae_f24x2 d, ae_f24x2 * a, int ax);
void AE_S32X2F24_IP (ae_f24x2 d,
ae_f24x2 * a /*inout*/, immediate i64);
void AE_S32X2F24_RIP (ae_f24x2 d, ae_f24x2 * a /*inout*/);
void AE_S32X2F24_RIC (ae_f24x2 d, ae_f24x2 * a /*inout*/);
void AE_S32X2F24_XP (ae_f24x2 d,
ae_f24x2 * a /*inout*/, int ax);
void AE_S32X2F24_XC (ae_f24x2 d,
ae_f24x2 * a /*inout*/, int ax);
void AE_SP24X2F_I (ae_p24x2s d, ae_p24x2f * a, immediate i64);
void AE_SP24X2F_IU (ae_p24x2s d,
ae_p24x2f * a /*inout*/, immediate i64);
void AE_SP24X2F_X (ae_p24x2s d, ae_p24x2f * a, int ax);
void AE_SP24X2F_XU (ae_p24x2s d,
ae_p24x2f * a /*inout*/, int ax);
void AE_SP24X2F_C (ae_p24x2s d,
ae_p24x2f * a /*inout*/, int ax);
AE_S32.L.I d, a, i32 [ fusion_slot0, Inst ae_minislot0 ]
AE_S32.L.IP d, a, i32 [ fusion_slot0, Inst ]
AE_S32.L.X (.XP) d, a, ax [ fusion_slot0, Inst ]
AE_S32.L.XC d, a, ax [ fusion_slot0 ]
Required alignment: 4 bytes
Store the 32-bit L element of the AE_DR register d to memory. For operations with suffix .I,
the effective address is (a + i32). See Table 2-3 for the meanings of the address mode
suffixes.
Note: C intrinsics AE_SP24S_L_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_S32.L.I (.X, .XC, .I, .I) respectively.

 CADENCE DESIGN SYSTEMS , INC. 45


Fusion F1 DSP User’s Guide

C syntax:
void AE_S32_L_I (ae_int32x2 d, ae_int32 * a, immediate i32);
void AE_S32_L_X (ae_int32x2 d, ae_int32 * a, int ax)
void AE_S32_L_IP (ae_int32x2 d,
ae_int32 * a /*inout*/, immediate i32);
void AE_S32_L_XP (ae_int32x2 d,
ae_int32 * a /*inout*/, int ax);
void AE_S32_L_XC (ae_int32x2 d,
ae_int32 * a /*inout*/, int ax);
void AE_SP24S_L_I (ae_p24x2s d, ae_p24s * a, immediate i32);
void AE_SP24S_L_IU (ae_p24x2s d,
ae_p24s * a /*inout*/, immediate i32);
void AE_SP24S_L_X (ae_p24x2s d, ae_p24s * a, int ax)
void AE_SP24S_L_XU (ae_p24x2s d,
ae_p24s * a /*inout*/, int ax);
void AE SP24S L C (ae_p24x2s d,
_ _ _
ae_p24s * a /*inout*/, int ax);

AE_S32F24.L.I d, a, i32 [ fusion_slot0, Inst ]


AE_S32F24.L.IP d, a, i32 [ fusion_slot0, Inst ]
AE_S32F24.L.X (.XP, .XC) d, a, ax [ fusion_slot0 ]
Required alignment: 4 bytes
Store the 24 LSBs from the L element of the AE_DR register d, padded with zeroes on the
right, to the 32 bits in memory. See Table 2-3 for the meanings of the address mode suffixes.
Note: C intrinsics AE_SP24F_L_I (_X, _C, _IU, _XU) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_S32F24.L.I (.X, .XC, .I, .I),
respectively.
C syntax:
void AE_S32F24_L_I (ae_f24x2 d, ae_f24 * a, immediate i32);
void AE_S32F24_L X (ae_f24x2 d, ae_f24 * a, int ax);
void AE_S32F24_L IP (ae_f24x2 d,
ae_f24 * a /*inout*/, immediate i32);
void AE_S32F24_L_XP (ae_f24x2 d,
ae_f24 * a /*inout*/, int ax);
void AE_S32F24_L_XC (ae_f24x2 d,
ae_f24 * a /*inout*/, int ax);
void AE_SP24F_L_I (ae_p24x2s d, ae_p24f * a, immediate i32);
void AE_SP24F_L_IU (ae_p24x2s d,
ae_p24f * a /*inout*/, immediate i32);
void AE_SP24F_L_X (ae_p24x2s d, ae_p24f * a, int ax);
void AE_SP24F_L_XU (ae_p24x2s d,

46  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

ae_p24f * a /*inout*/, int ax);


void AE SP24F L C (ae_p24x2s d,
_ _ _
ae_p24f * a /*inout*/, int ax);
AE_S16.0.I d, a, i16 [ fusion_slot0, Inst ]
AE_S16.0.IP d, a, i16 [ fusion_slot0, Inst ]
AE_S16.0.XP d, a, ax [ fusion_slot0, Inst ]
AE_S16.0.X (.XC) d, a, ax [ fusion_slot0]
Required alignment: 2 bytes
Store the 16-bit 0 element of the AE_DR register d to memory. See Table 2-3 for the
meanings of the address mode suffixes.
C syntax:
void AE_S16_0_I (ae_int16x4 d, ae_int16 * a, immediate i16);
void AE_S16_0_X (ae_int16x4 d, ae_int16 * a, int ax);
void AE_S16_0_IP (ae_int16x4 d,
ae_int16 * a /*inout*/, immediate i16);
void AE_S16_0_XP (ae_int16x4 d, ae_int16 * a, int ax);
void AE_S16_0_XC (ae_int16x4 d, ae_int16 * a, int ax);
AE_SA16X4.IP d, u, a [ fusion_slot0, Inst ]
AE_SA16X4.IC (.RIP, .RIC) d, u, a [ fusion_slot0 ]
Required alignment: 2 bytes
Store four 16-bit values from AE_DR register d to memory with effective address (a).
Instructions AE_SA16X4.IP (.IC) are used if the direction of the store operations is positive.
Instructions AE_SA16X4.RIP (.RIC) are used if the direction of the store operations is
negative.
C syntax:
void AE_SA16X4_IP (ae_int16x4 d, ae_valign u /*inout*/,
ae_int16x4 * a /*inout*/);
void AE_SA16X4_IC (ae_int16x4 d, ae_valign u /*inout*/,
ae_int16x4 * a /*inout*/);
void AE_SA16X4_RIP (ae_int16x4 d, ae_valign u /*inout*/,
ae_int16x4 * a /*inout*/);
void AE_SA16X4_RIC (ae_int16x4 d, ae_valign u /*inout*/,
ae_int16x4 * a /*inout*/);

 CADENCE DESIGN SYSTEMS , INC. 47


Fusion F1 DSP User’s Guide

AE_SA32X2.IP d, u, a [ fusion_slot0, Inst ]


AE_SA32X2.IC (.RIP, .RIC) d, u, a [ fusion_slot0 ]
Required alignment: 4 bytes
Store a pair of 32-bit values from AE_DR register d to memory with effective address (a).
Instructions AE_SA32X2.IP (.IC) are used if the direction of the store operations is positive.
Instructions AE_SA32X2.RIP (.RIC) are used if the direction of the store operations is
negative.
C syntax:
void AE_SA32X2_IP (ae_int32x2 d, ae_valign u /*inout*/,
ae_int32x2 * a /*inout*/);
void AE SA32X2 IC (ae_int32x2 d, ae_valign u /*inout*/,
_ _
ae_int32x2 * a /*inout*/);
void AE SA32X2 RIP (ae_int32x2 d, ae_valign u /*inout*/,
_ _
ae_int32x2 * a /*inout*/);
void AE_SA32X2_RIC (ae_int32x2 d, ae_valign u /*inout*/,
ae_int32x2 * a /*inout*/);
AE_SA32X2F24.IP d, u, a [ fusion_slot0, Inst ]
AE_SA32X2F24.IC (.RIP, .RIC) d, u, a [ fusion_slot0 ]
Required alignment: 4 bytes
Store the 24 LSBs of the two 32-bit elements of AE_DR register d, with each value padded
on the right with zeroes to 32 bits and placed in half of the 64 bits in memory with effective
address (a). Instructions AE_SA32X2F24.IP (.IC) are used if the direction of the store
operations is positive. Instructions AE_SA32X2F24.RIP (.RIC) are used if the direction of the
store operations is negative.
C syntax:
void AE_SA32X2F24_IP (ae_f24x2 d, ae_valign u /*inout*/,
ae_f24x2 * a /*inout*/);
void AE SA32X2F24 IC (ae_f24x2 d, ae_valign u /*inout*/,
_ _
ae_f24x2 * a /*inout*/);
void AE SA32X2F24 RIP (ae_f24x2 d, ae_valign u /*inout*/,
_ _
ae_f24x2 * a /*inout*/);
void AE_SA32X2F24_RIC (ae_f24x2 d, ae_valign u /*inout*/,
ae_f24x2 * a /*inout*/);
AE_SA24.L.IP (.IC, .RIP, .RIC) d, u, a [ fusion_slot0 ]
Required alignment: 1 byte
Store the 24 LSBs of AE_DR register d to 24 bits in memory with effective address (a).
Instructions AE_SA24.IP (.IC) are used if the direction of the store operations is positive.
Instructions AE_SA24.RIP (.RIC) are used if the direction of the store operations is negative.

48  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
void AE_SA24_L_IP (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
void AE_SA24_L_IC (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
void AE_SA24_L_RIP (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
void AE_SA24_L_RIC (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
AE_SA24X2.IP (.IC, .RIP, .RIC) d, u, a [ fusion_slot0 ]
Required alignment: 1 byte
Store the 24 LSBs of the two 32-bit elements of AE_DR register d to 48 bits in memory with
effective address (a). Instructions AE_SA24X2.IP (.IC) are used if the direction of the store
operations is positive. Instructions AE_SA24X2.RIP (.RIC) are used if the direction of the
store operations is negative.
C syntax:
void AE_SA24X2_IP (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
void AE_SA24X2_IC (ae_int24x2 d, ae_valign u /*inout*/,
void * a /*inout*/);
void AE SA24X2 RIP (ae_int24x2 d, ae_valign u /*inout*/,
_ _
void * a /*inout*/);
void AE SA24X2 RIC (ae_int24x2 d, ae_valign u /*inout*/,
_ _
void * a /*inout*/);
AE_SALIGN64.I u, a, imm [ fusion_slot0 ]

Required alignment: 8 bytes

Stores a 64-bit value from AE_VALIGN register u to memory with effective address (a +
imm).

C syntax:
void AE_SALIGN64_I (ae_valign u, void *a, immediate imm);
AE_SA64POS.FP u, a [ Inst ]

Required alignment: varies depending on the data type in the AE_VALIGN register u.

Flushes the value in AE_VALIGN register u to memory with effective address (a). The
AE_VALIGN register u is updated with value zero. This operation is used when the direction
of the store operation is positive.

C syntax:
void AE_SA64POS_FP (ae_valign u /*inout*/, void *a);
void AE_SA64POS_FC (ae_valign u /*inout*/, void *a);

 CADENCE DESIGN SYSTEMS , INC. 49


Fusion F1 DSP User’s Guide

AE_SA64NEG.FP u, a [ fusion_slot0 ]
Required alignment: varies depending on the data type in the AE_VALIGN register u.
Flushes the value in AE_VALIGN register u to memory with effective address (a). The
AE_VALIGN register u is updated with value zero. This operation is used when the direction
of the store operation is negative.
C syntax:
void AE_SA64NEG_FP (ae_valign u /*inout*/, void *a);
void AE_SA64NEG_FC (ae_valign u /*inout*/, void *a);
AE_ZALIGN64 u [ Inst ]
Initialize the AE_VALIGN register u with zero.
C syntax:
ae_valign AE_ZALIGN64 ();
AE_S16X2M.I (.IU) d, a, i32 [ fusion_slot0, Inst ]
AE_S16X2M.X (.XU) d, a, ax [ fusion_slot0, Inst ]
AE_S16X2M.XC d, a, ax [ fusion_slot0]
Required alignment: 4 byte.
Store the middle 16-bit element of each 32-bit half of AE_DR register d into 32 bits in
memory. See Table 2-3 for the meanings of the address mode suffixes.
Note: C intrinsics AE_SP16X2F_I (_IU, _X, _XU, _C) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_S16X2M.I (.IU, .X, .XU, .XC),
respectively.
C syntax:
void AE_S16X2M_I (ae_int32x2 d, ae_p16x2s *a, immediate i32);
void AE_S16X2M_IU (ae_int32x2 d, ae_p16x2s *a /*inout*/,
immediate i32);
void AE_S16X2M_X (ae_int32x2 d, ae_p16x2s *a, int ax);
void AE_S16X2M_XU (ae_int32x2 d, ae_p16x2s *a /*inout*/, int ax);
void AE_S16X2M_XC (ae_int32x2 d, ae_p16x2s *a /*inout*/, int ax);
void AE_SP16X2F_I (ae_p24x2s d, ae_p16x2s *a, immediate i32);
void AE_SP16X2F_IU (ae_p24x2s d, ae_p16x2s *a /*inout*/,
immediate i32);
void AE_SP16X2F_X (ae_p24x2s d, ae_p16x2s *a, int ax);
void AE_SP16X2F_XU (ae_p24x2s d, ae_p16x2s *a /*inout*/,
int ax);
void AE_SP16X2F_C (ae_p24x2s d, ae_p16x2s *a /*inout*/,
unsigned ax);

50  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_S32M.I d, a, i32 [ fusion_slot0, Inst ]


AE_S32M.IU d, a, i32 [ fusion_slot0, Inst ]
AE_S32M.X (.XU) d, a, ax [ fusion_slot0, Inst ]
AE_S32M.XC d, a, ax [ fusion_slot0 ]
Required alignment: 4 byte.
Store the middle 32-bit element of AE_DR register d into 32 bits in memory. See Table 2-3
for the meanings of the address mode suffixes.
Note: C intrinsics AE_SQ32F_I (_IU, _X, _XU, _C) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_S32M.I (.IU, .X, .XU, .XC),
respectively.
C syntax:
void AE_S32M_I (ae_int64 d, ae_q32s *a, immediate i32);
void AE_S32M_IU (ae_int64 d, ae_q32s *a /*inout*/, immediate i32);
void AE_S32M_X (ae_int64 d, ae_q32s *a, int ax);
void AE_S32M_XU (ae_int64 d, ae_q32s *a /*inout*/, int ax);
void AE_S32M_XC (ae_int64 d, ae_q32s *a /*inout*/, int ax);
void AE_SQ32F_I (ae_q56s d, ae_q32s *a, immediate i32);
void AE_SQ32F_IU (ae_q56s d, ae_q32s *a /*inout*/, immediate i32);
void AE_SQ32F_X (ae_q56s d, ae_q32s *a, int ax);
void AE_SQ32F_XU (ae_q56s d, ae_q32s *a /*inout*/, int ax);
void AE_SQ32F_C (ae_q56s d, ae_q32s *a /*inout*/, int ax);
AE_S16M.L.I d, a, i16 [ fusion_slot0, Inst ]
AE_S16M.L.IU d, a, i16 [ fusion_slot0, Inst]
AE_S16M.L.X d, a, ax [ fusion_slot0, Inst ]
AE_S16M.L.XU (.XC) d, a, ax [ fusion_slot0 ]

Required alignment: 2 byte.

Store the middle 16-bit element of the low-order 32-bit element of AE_DR register d into 16
bits in memory. See Table 2-3 for the meanings of the address mode suffixes.
Note: C intrinsics AE_SP16F_L_I (_IU, _X, _XU, _C) are provided to ensure HiFi 2 EP code
portability. They are implemented through operations AE_S16M.L.I (.IU, .X, .XU, .XC),
respectively.

 CADENCE DESIGN SYSTEMS , INC. 51


Fusion F1 DSP User’s Guide

C syntax:
void AE_S16M_L_I (ae_int32x2 d, ae_p16s *a, immediate i16);
void AE_S16M_L_IU (ae_int32x2 d, ae_p16s *a /*inout*/,
immediate i16);
void AE_S16M_L_X (ae_int32x2 d, ae_p16s *a, int ax);
void AE_S16M_L_XU (ae_int32x2 d, ae_p16s *a /*inout*/, int ax);
void AE_S16M_L_XC (ae_int32x2 d, ae_p16s *a /*inout*/, int ax);
void AE_SP16F_L_I (ae_p24x2s d, ae_p16s *a, immediate i16);
void AE_SP16F_L_IU (ae_p24x2s d, ae_p16s *a /*inout*/,
immediate i16);
void AE_SP16F_L_X (ae_p24x2s d, ae_p16s *a, int ax);
void AE_SP16F_L_XU (ae_p24x2s d, ae_p16s *a /*inout*/, int ax);
void AE_SP16F_L_C (ae_p24x2s d, ae_p16s *a /*inout*/, int ax);

2.5 Core Updating Stores


AE_S32IP d, a, i32 [ fusion_slot0, Inst ]
AE_S16IP d, a, i16 [ fusion_slot0, Inst ]
AE_S8IP d, a, i8 [ fusion_slot0,Inst]
Required alignment: 4, 2 and 1 bytes
Core updating store that is equivalent to S32I, S16I, S8I with immediate zero followed by an
increment of the address register by 4, 2 or 1 respectively. These instructions are
automatically inferred by the compiler.
C syntax:
void AE_S32IP (int32 d, int32 * a, immediate i32);
void AE_S16IP (int32 d, int16 * a, immediate i16);
void AE_S8IP (int32 d, int8 * a, immediate i8);

2.6 Multiply and Accumulate Operations


The Fusion DSP ISA supports a rich collection of single and dual multiply/accumulate
operations with different input and output precision, scaling, rounding and saturation modes.
Fusion DSP supports two 24x24-bit, 32x16-bit, or 16x16-bit multiplies per cycle or one 32x32-
bit multiplies per cycle. With the 16-bit Quad MAC option, Fusion DSP also supports four,
16x16-bit multiplies per cycle.

Fusion DSP MAC operations are named using the following convention:

AE_MUL<accum_type>[F][DP]{C,CR,CI}<size>{R,RA}[S][U].specifier

52  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

The operations use a specifier of L or H suffix to select input operands from the two 32-bit
AE_DR elements or a 0, 1, 2, 3 suffix for 16-bit data.

The two MAC operations have two forms—dual MACs take the results of two MACs and add
or subtract them together, as in the example below.

acc = acc – d0.L*d1.L + d0.H*d1.H.

SIMD MACs do not combine the results of different multiplies. They instead perform the
sample multiply operation on different portions of the data, as in the example below.

acc.h = acc.h – d0.h*d1.h


acc.l = acc.l – d0.l*d1.l

The dual MACs use a D in the name. Most of the SIMD MACs pack their results into 32 or
16-bits and hence use a P in their name. By adding or subtracting two multiply results
together, the dual MAC instructions are able to maintain high precision for their accumulation
without needing to write multiple output registers.

Quad MACs compute the sum of four products and have a Q in the name.

With the AVS option, 24x24-bit and 32x16-bit complex multiply operations are dual-MAC
operations that compute either the real half or the imaginary half of a complex multiplication
and pack their two results down to 32-bits. They are designated with a CR or a CI.

With the 16-bit Quad MAC option, 16x16-bit complex multiplies are quad-MAC operations
that produce either a 32x2-bit or 16x2-bit result. They are designated with a C.

Among the single-multiply and SIMD multiply operations, each family of multiply/accumulate
operations has a multiply-only variant, a multiply/add variant, and a multiply/subtract variant,
denoted by having accum_type set to nothing, A or S respectively. With the MUL variant, the
accumulator contents are overwritten with the result of the multiplication. With the MULA
variant, the result of the multiplication is added to the accumulator contents and written back
to the accumulator. With the MULS variant, the result of the multiplication is subtracted from
the accumulator contents and written back to the accumulator.

Dual MAC operations with an accum_type starting with Z do their accumulation against
zero; in other words, the initial contents of the accumulator are discarded. Those without any
Z accumulate against the initial contents of the accumulator. Following the optional Z there
are two letters that indicate addition or subtraction, one for each of the two multiplication
results.

Quad MAC operations with an accum_type starting with Z do their accumulation against
zero; in other words, the initial contents of the accumulator are discarded. Those without any
Z accumulate against the initial contents of the accumulator. Following the optional Z there
are four As, one for each of the four multiplication results.

Fusion DSP supports both integer and fractional multiplication. Fractional multiply
instructions have an F immediately following accum_type.

 CADENCE DESIGN SYSTEMS , INC. 53


Fusion F1 DSP User’s Guide

The size of a multiply instruction is 16, 24, 32 or 32X16 for 16-bit, 24-bit, 32-bit and 32 times
16-bit respectively. For SIMD multipliers, a suffix X2 or X4 is added to the size to signify the
number of SIMD elements.

Integral SIMD multiply instructions throw away the upper bits of their results, just like standard
C/C++ multiplies. Fractional SIMD multiply instructions round away the lower bits using either
a symmetric or asymmetric rounding. They are signified with R or RA in the name. With
asymmetric rounding, halves are rounded upward, i.e., 0.5 times the least significant result
bit is rounded up to 1.0 and -0.5 times the least significant result bit is rounded up to 0. With
symmetric rounding, halves are rounding away from zero, i.e., -0.5 times the least significant
result bit is rounded down to -1.0. In the instruction descriptions, symmetric rounds are
referred to as round while asymmetric are referred to as round+∞.

MAC operations without guard bits, 1.31x1.31 into 1.63, 1.31x1.15 into 1.31, and 1.15x1.15
into 1.15 or 1.31, saturate their results. All other MAC operations have guard bits and do not
saturate. Saturating multiplies have an S following the size or the rounding designation.
Some 16x16-bit multipliers are designed to be bit exact with the ITU-T/ETSI intrinsics and
therefore do multiple saturations in series. These instructions have SS in the name.

Unsigned multiplies have a U preceding the specifier.

All MAC operations appear in slot fusion_slot1 or fusion_slot_fir_1 when the AVS option has
been selected.

HiFi 2/EP had a different naming scheme for multipliers. Compatibility intrinsics are provided
for all the old HiFi 2/EP intrinsics and are listed in the following sections.

2.6.1 24x24-bit Multiplication Operations


Fusion DSP supports dual 24x24-bit multiplication operations. With the AVS option, for HiFi
3 compatibility, quad 24x4-bit multiplication operations are emulated using two instruction
sequences. SIMD variants compute two or four products that are individually accumulated in
32-bit precision. Non-SIMD variants compute the sum or difference of two 48-bit products
added or subtracted to a 64-bit accumulator. There is no support for single 24x24-bit
multiplication. Users should instead use 32x32-bit instructions. To ensure compatibility with
HiFi 2 and consistency with the dual multiply instructions, 24-bit single multiplication intrinsics
are provided. However, these intrinsics are implemented using the higher precision 32x32-
bit multipliers.

54  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_MULZAAFD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]


AE_MULZSSFD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]
AE_MULZASFD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]
AE_MULZSAFD24.HH.LL d, d0, d1 [fusion_slot1]
AE_MULAAFD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]
AE_MULSSFD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]
AE_MULASFD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]
AE_MULSAFD24.HH.LL d, d0, d1 [fusion_slot1]
Dual 1.23x1.23-bit into 17.47-bit signed MAC:
d  [d17.47] ± d0.H1.23 × d1.H1.23 ± d0.L1.23 × d1.L1.23
Note: C intrinsics with ae_p24x2s input operand types and ae_q56s accumulator operand
types are provided to ensure HiFi 2 code portability and are implemented through the
operations above.
C syntax:
ae_f64 AE_MULZAAFD24_HH_LL (ae_f24x2 d0, ae_f24x2 d1);
void AE_MULAAFD24_HH_LL (ae_f64 d /*inout*/,
ae_f24x2 d0, ae_f24x2 d1);
ae_q56s AE_MULZAAFP24S_HH_LL (ae_p24x2s d0, ae_p24x2s d1);
void AE_MULAAFP24S_HH_LL (ae_q56s q /*inout*/,
ae_p24x2s d0, ae_p24x2s d1);
AE_MULFD24X2.FIR.H q0, q1, d0, d1, c [ fusion_slot1, fusion_slot_fir_1 ] AVS ONLY
AE_MULAFD24X2.FIR.H q0, q1, d0, d1, c [ fusion_slot1, fusion_slot_fir_1 ] AVS ONLY
Quad 1.23x1.23-bit multiplications into two 17.47-bit signed MAC with operands selected to
accelerate FIR computations. These are emulated using two instruction sequences. Note
that the FIR.L instructions available in HiFi 3 and HiFi 4 are not available with Fusion.
For the .H version:
q0  [q017.47] + d0.H1.23 × c.H1.23 + d0.L1.23 × c.L1.23
q1  [q117.47] + d0.L1.23 × c.H1.23 + d1.H1.23 × c.L1.23
C syntax:
void AE_MULFD24X2_FIR_H (ae_f64 q0 /*out*/, ae_f64 q1 /*out*/,
ae_f24x2 d0,ae_f24x2 d1, ae_f24x2 c);
void AE MULAFD24X2 FIR H(ae_f64 q0 /*inout*/,
_ _ _
ae_f64 q1 /* inout*/
ae_f24x2 d0,ae_f24x2 d1,ae_f24x2 c);

 CADENCE DESIGN SYSTEMS , INC. 55


Fusion F1 DSP User’s Guide

AE_MULZAAD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]


AE_MULZSSD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]
AE_MULZASD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]
AE_MULZSAD24.HH.LL d, d0, d1 [fusion_slot1]
AE_MULAAD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]
AE_MULSSD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]
AE_MULASD24.HH.LL (.HL.LH) d, d0, d1 [fusion_slot1]
AE_MULSAD24.HH.LL d, d0, d1 [fusion_slot1]
Dual 24x24-bit into 64-bit signed integer MAC with no saturation:
d  [d] ± d0.H × d1.H ± d0.L × d1.L
Note: C intrinsics with ae_p24x2s input operand types and ae_q56s accumulator operand
types are provided to ensure HiFi 2 code portability and are implemented through the
operations above.
C syntax:
ae_int64 AE_MULZAAD24_HH_LL (ae_int24x2 p0, ae_int24x2 p1);
void AE_MULAAD24_HH_LL (ae_int64 d /*inout*/,
ae_int24x2 p0, ae_int24x2 p1);
ae_q56s AE_MULZAAP24S_HH_LL (ae_p24x2s p0, ae_p24x2s p1);
void AE_MULAAP24S_HH_LL (ae_q56s q /*inout*/,
ae_p24x2s p0, ae_p24x2s p1);
AE_MULFC24RA d, d0, d1 [fusion_slot1] AVS ONLY
AE_MULAFC24RA d, d0, d1 [fusion_slot1] AVS ONLY
Complex quad 1.23x1.23-bit into 9.23-bit signed MAC with asymmetric rounding of the
product. These are emulated using two-instruction sequences: one containing CR in the
name and computing the real part of the product and the other containing CI and computing
the imaginary part.
d.H  [d.H9.23 +] round+∞9.23(d0.H1.23 × d1.H1.23 - d0.L1.23 × d1.L1.23) (CR instruction)
d.L  [d.L9.23 +] round+∞9.23(d0.H1.23 × d1.L1.23 + d0.L1.23 × d1.H1.23) (CI instruction)
C syntax:
ae_f32x2 AE_MULFC24RA (ae_f24x2 d0, ae_f24x2 d1);
void AE_MULAFC24RA (ae_f32x2 d /*inout*/,
ae_f24x2 d0, ae_f24x2 d1);
AE_MULC24 d, d0, d1 [fusion_slot1] AVS ONLY
AE_MULAC24 d, d0, d1 [fusion_slot1] AVS ONLY

Complex quad 24x24-bit into 32-bit signed integer MAC with no saturation: These are
emulated using two-instruction sequences: one containing CR in the name and computing
the real part of the product and the other containing CI and computing the imaginary part.

56  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

d.H  [d.H +] d0.H × d1.H - d0.L × d1.L (CR instruction)


d.L  [d.L +] d0.H × d1.L + d0.L × d1.H (CI instruction)
C syntax:
ae_int32x2 AE_MULC24 (ae_int24x2 d0, ae_int24x2 d1);
void AE_MULAC24 (ae_int32x2 d /*inout*/,
ae_int24x2 d0, ae_int24x2 d1);
AE_MULFP24X2R d, d0, d1 [fusion_slot1]
AE_MULAFP24X2R d, d0, d1 [fusion_slot1]
AE_MULSFP24X2R d, d0, d1 [fusion_slot1]

2-way SIMD 1.23x1.23-bit into 9.23-bit signed MAC with symmetric (away from zero)
rounding of the product.

d.H  [d.H9.23 ±] round9.23 (d0.H1.23 × d1.H1.23)


d.L  [d.L9.23 ±] round9.23(d0.L1.23 × d1.L1.23)
C syntax:
ae_f32x2 AE_MULFP24X2R (ae_f24x2 d0, ae_f24x2 d1);
void AE_MULAFP24X2R (ae_f32x2 d /*inout*/,
ae_f24x2 d0, ae_f24x2 d1);
void AE_MULSFP24X2R (ae_f32x2 d /*inout*/,
ae_f24x2 d0, ae_f24x2 d1);
AE_MULFP24X2RA d, d0, d1 [fusion_slot1]
AE_MULAFP24X2RA d, d0, d1 [fusion_slot1]
AE_MULSFP24X2RA d, d0, d1 [fusion_slot1]

2-way SIMD 1.23x1.23-bit into 9.23-bit signed MAC with asymmetric rounding of the product.

d.H  [d.H9.23 ±] round+∞9.23 (d0.H9.23 × d1.H9.233)


d.L  [d.L9.23 ±] round+∞9.23(d0.L9.23 × d1.L9.23)
C syntax:
ae_f32x2 AE_MULFP24X2RA (ae_f24x2 d0, ae_f24x2 d1);
void AE_MULAFP24X2RA (ae_f32x2 d /*inout*/,
ae_f24x2 d0, ae_f24x2 d1);
void AE MULSFP24X2RA (ae_f32x2 d /*inout*/,
_
ae_f24x2 d0, ae_f24x2 d1);
_
AE MULP24X2 d, d0, d1 [fusion_slot1]
AE_MULAP24X2 d, d0, d1 [fusion_slot1]
AE_MULSP24X2 d, d0, d1 [fusion_slot1]
2-way SIMD 24x24-bit into 32-bit signed integer MAC with no saturation:
d.H  [d.H ±] d0.H × d1.H

 CADENCE DESIGN SYSTEMS , INC. 57


Fusion F1 DSP User’s Guide

d.L  [d.L ±] d0.L × d1.L


C syntax:
ae_int32x2 AE_MULP24X2 (ae_int24x2 d0, ae_int24x2 d1);
void AE_MULAP24X2 (ae_int32x2 d /*inout*/,
ae_int24x2 d0, ae_int24x2 d1);
void AE_MULSP24X2 (ae_int32x2 d /*inout*/,
ae_int24x2 d0, ae_int24x2 d1);

2.6.2 32x32-bit Multiplication Operations


Fusion DSP supports two 24x24 or 32x16-bit multiplications per cycle but only one 32x32-bit
one. The input operands for 32x32-bit multiplication are elements of AE_DR registers. Each
AE_DR register holds two 32-bit elements for each AE_DR register operand to a
multiplication; one of the two elements must be selected as the input to the multiplication
through an H or an L suffix. The result of each multiply/accumulate operation goes into an
AE_DR register.

AE_MULF32S.LL (.LH .HH) d, d0, d1 [fusion_slot1]


AE_MULAF32S.LL (.LH .HH) d, d0, d1 [fusion_slot1]
AE_MULSF32S.LL (.LH .HH) d, d0, d1 [fusion_slot1]
Single 1.31x1.31-bit into 1.63-bit signed MAC with 64-bit saturation:
d  saturate1.63([d1.63 ±] d0.L1.31 × d1.L1.31)
Note: C intrinsics AE_MUL[AS]F32S_HL are provided and implemented through the .LH
operations above. C intrinsics with ae_f24x2 input operands are implemented through the
above operations. C intrinsics with ae_p24x2s input operand types and ae_q56s accumulator
operand types are provided to ensure HiFi 2 code portability and are implemented through
the operations above. The HiFi 2 intrinsics that perform 56-bit accumulator saturation
(AE_MUL[AS]FS56*) instantiate an additional AE_SATQ56S operation.
C syntax:
ae_f64 AE_MULF32S_LL (ae_f32x2 d0, ae_f32x2 d1);
void AE_MULAF32S_LL (ae_f64 d /*inout*/,
ae_f32x2 d0, ae_f32x2 d1);
void AE_MULSF32S_LL (ae_f64 d /*inout*/,
ae_f32x2 d0, ae_f32x2 d1);
ae_f64 AE_MULF24S_LL (ae_f24x2 d0, ae_f24x2 d1);
void AE_MULAF24S_LL (ae_f64 d /*inout*/,
ae_f24x2 d0, ae_f24x2 d1);
void AE_MULSF24S_LL (ae_f64 d /*inout*/,
ae_f24x2 d0, ae_f24x2 d1);
ae_q56s AE_MULFP24S_LL (ae_p24x2s d0, ae_p24x2s d1);
void AE_MULAFP24S_LL (ae_q56s q /*inout*/, [TU]
ae_p24x2s d0, ae_p24x2s d1);

58  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

void AE_MULSFP24S_LL (ae_q56s q /*inout*/,


ae_p24x2s d0, ae_p24x2s d1);
void AE_MULAFS56P24S_LL (ae_q56s q /*inout*/,
ae_p24x2s d0, ae_p24x2s d1);
void AE_MULSFS56P24S_LL (ae_q56s q /*inout*/,
ae_p24x2s d0, ae_p24x2s d1);

AE_MUL32.LL (.LH .HH) d, d0, d1 [fusion_slot1]


AE_MULA32.LL (.LH .HH) d, d0, d1 [fusion_slot1]
AE_MULS32.LL (.LH .HH) d, d0, d1 [fusion_slot1]
Single 32x32-bit into 64-bit signed integer MAC with no saturation:
d  [d ±] d0.L × d1.L
Note: C intrinsics AE_MUL[AS]32S_HL are provided and implemented through the .LH
operations above. C intrinsics with ae_int24x2 input operands are implemented through the
above operations. C intrinsics with ae_p24x2s input operand types and ae_q56s accumulator
operand types are provided to ensure HiFi 2 code portability and are implemented through
the operations above. The HiFi 2 intrinsics that perform 56-bit accumulator saturation
instantiate an additional AE_SATQ56S operation.
C syntax:
ae_int64 AE_MUL32_LL (ae_int32x2 d0, ae_int32x2 d1);
void AE_MULA32_LL (ae_int64 d /*inout*/,
ae_int32x2 d0, ae_int32x2 d1);
void AE_MULS32_LL (ae_int64 d /*inout*/,
ae_int32x2 d0, ae_int32x2 d1);
ae_int64 AE_MUL24_LL (ae_int24x2 d0, ae_int24x2 d1);
void AE_MULA24_LL (ae_int64 d /*inout*/,
ae_int24x2 d0, ae_int24x2 d1);
void AE_MULS24_LL (ae_int64 d /*inout*/,
ae_int24x2 d0, ae_int24x2 d1);
ae_q56s AE_MULP24S_LL (ae_p24x2s d0, ae_p24x2s d1);
void AE_MULAP24S_LL (ae_q56s d /*inout*/,
ae_p24x2s d0, ae_p24x2s d1);
void AE_MULSP24S_LL (ae_q56s d /*inout*/,
ae_p24x2s d0, ae_p24x2s d1);
void AE_MULAS56P24S_LL (ae_q56s q /*inout*/,
ae_p24x2s d0, ae_p24x2s d1);
void AE MULSS56P24S LL (ae_q56s q /*inout*/,
_ _
ae_p24x2s d0, ae_p24x2s d1);

 CADENCE DESIGN SYSTEMS , INC. 59


Fusion F1 DSP User’s Guide

AE_MULF32R.LL (.LH .HH) d, d0, d1 [fusion_slot1]


AE_MULAF32R.LL (.LH .HH) d, d0, d1 [fusion_slot1]
AE_MULSF32R.LL (.LH .HH) d, d0, d1 [fusion_slot1]
Single 1.31x1.31-bit into 17.47-bit signed MAC with symmetric (away from 0) rounding of the
product:
d  [d17.47 ±] round17.47(d0.L1.31 × d1.L1.31)
Note: C intrinsics AE_MUL[AS]F32R_HL and AE_MULF32R_HL are provided and
implemented through the .LH operations above.
C syntax:
ae_f64 AE_MULF32R_LL (ae_f32x2 d0, ae_f32x2 d1);
void AE_MULAF32R_LL (ae_f64 d /*inout*/,
ae_f32x2 d0, ae_f32x2 d1);
void AE_MULSF32R_LL (ae_f64 d /*inout*/,
ae_f32x2 d0, ae_f32x2 d1);
AE_MUL32U.LL d, d0, d1 [ fusion_slot1 ]
AE_MULA32U.LL d, d0, d1 [ fusion_slot1 ]
AE_MULS32U.LL d, d0, d1 [ fusion_slot1 ]
Single 32x32-bit into 64-bit unsigned integer MAC with no saturation:
d  [d ±] d0.Lu × d1.Lu
C syntax:
ae_int64 AE_MUL32U_LL (ae_int32x2 d0, ae_int32x2 d1);
void AE_MULA32U_LL (ae_int64 d /*inout*/,
ae_int32x2 d0, ae_int32x2 d1);
void AE_MULS32U_LL (ae_int64 d /*inout*/,
ae_int32x2 d0, ae_int32x2 d1);
AE_MULFP32X2RS d, d0, d1 [fusion_slot1] AVS ONLY
AE_MULAFP32X2RS d, d0, d1 [fusion_slot1] AVS ONLY
AE_MULSFP32X2RS d, d0, d1 [fusion_slot1] AVS ONLY
2-way SIMD 1.31x1.31-bit into 1.31-bit signed MAC with symmetric (away from zero)
rounding of the product and 32-bit saturation of the final result. These are emulated using
two-instruction sequences.
d.H  saturate1.31([d.H1.31 ±] round1.31(d0.H1.31 × d1.H1.31))
d.L  saturate1.31([d.L1.31 ±] round1.31(d0.L1.31 × d1.L1.31))

60  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_f32x2 AE_MULFP32X2RS (ae_f32x2 d0, ae_f32x2 d1);
void AE_MULAFP32X2RS (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f32x2 d1);
void AE_MULSFP32X2RS (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f32x2 d1);
AE_MULFP32X2RAS d, d0, d1 [fusion_slot1] AVS ONLY
AE_MULAFP32X2RAS d, d0, d1 [fusion_slot1] AVS ONLY
AE_MULSFP32X2RAS d, d0, d1 [fusion_slot1] AVS ONLY
2-way SIMD 1.31x1.31-bit into 1.31-bit signed MAC with asymmetric rounding of the product
and 32-bit saturation of the final result: These are emulated using two-instruction sequences.

d.H  saturate1.31([d.H1.31 ±] round+∞1.31(d0.H1.31 × d1.H1.31))


d.L  saturate1.31([d.L1.31 ±] round+∞1.31(d0.L1.31 × d1.L1.31))
C syntax:
ae_f32x2 AE_MULFP32X2RAS (ae_f32x2 d0, ae_f32x2 d1);
void AE_MULAFP32X2RAS (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f32x2 d1);
void AE_MULSFP32X2RAS (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f32x2 d1);
AE_MULP32X2 d, d0, d1 [fusion_slot1]
AE_MULAP32X2 d, d0, d1 [fusion_slot1]
AE_MULSP32X2 d, d0, d1 [fusion_slot1]
2-way SIMD 32x32-bit into 32-bit signed integer MAC with no saturation: These are emulated
using two-instruction sequences.
d.H  [d.H ±] d0.H × d1.H
d.L  [d.L ±] d0.L × d1.L
C syntax:
ae_int32x2 AE_MULP32X2 (ae_int32x2 d0, ae_int32x2 d1);
void AE_MULAP32X2 (ae_int32x2 d /*inout*/,
ae_int32x2 d0, ae_int32x2 d1);
void AE_MULSP32X2 (ae_int32x2 d /*inout*/,
ae_int32x2 d0, ae_int32x2 d1);

 CADENCE DESIGN SYSTEMS , INC. 61


Fusion F1 DSP User’s Guide

2.6.3 32x16-bit Multiplication Operations


The input operands for 32x16-bit multiplication operations are elements of AE_DR registers.
The first multiplicand holds two 32-bit elements. The second multiplicand holds four 16-bit
elements. For operations that allow operand selection within a register, each 32-bit operand
is specified through an H or L suffix and each 16-bit operand is selected through a 3, 2, 1 or
0 suffix.

AE_MULF32X16.L0 (.L1 .L2 .L3 .H0 .H1 .H2 .H3) d, d0, d1 [fusion_slot1]
AE_MULAF32X16.L0 (.L1 .L2 .L3 .H0 .H1 .H2 .H3) d, d0, d1 [fusion_slot1]
AE_MULSF32X16.L0 (.L1 .L2 .L3 .H0 .H1 .H2 .H3) d, d0, d1 [fusion_slot1]
Single 1.31x1.15-bit into 17.47-bit signed MAC without saturation:
d  [d17.47 ±] d0.L1.31 × d1.01.15
C syntax:
ae_f64 AE_MULF32X16_L0 (ae_f32x2 d0, ae_f16x4 d1);
void AE_MULAF32X16_L0 (ae_f64 d /*inout*/,
ae_f32x2 d0, ae_f16x4 d1);
void AE MULSF32X16 L0 (ae_f64 d /*inout*/,
_ _
ae_f32x2 d0, ae_f16x4 d1);
AE_MULZAAFD32X16.H1.L0 (.H3.L2 .H2.L3 .H0.L1) d, d0, d1 [fusion_slot1]
AE_MULZASFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULZSAFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULZSSFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULAAFD32X16.H1.L0 (.H3.L2 .H2.L3 .H0.L1) d, d0, d1 [fusion_slot1]
AE_MULASFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULSAFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULSSFD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
Dual 1.31x1.15-bit into 17.47-bit signed MAC without saturation:
d  [d17.47] ± d0.H1.31 × d1.11.15 ± d0.L1.31 × d1.01.15
The extra .H3.L2 and .H0.L1 specifiers are for computing half of a complex multiplication.
C syntax:
ae_f64 AE_MULZAAFD32X16_H1_L0 (ae_f32x2 d0, ae_f16x4 d1);
ae_f64 AE_MULZASFD32X16_H1_L0 (ae_f32x2 d0, ae_f16x4 d1);
ae_f64 AE_MULZSAFD32X16_H1_L0 (ae_f32x2 d0, ae_f16x4 d1);
ae_f64 AE_MULZSSFD32X16_H1_L0 (ae_f32x2 d0, ae_f16x4 d1);

void AE_MULAAFD32X16_H1_L0 (ae_f64 d /*inout*/,


ae_f32x2 d0, ae_f16x4 d1);
void AE_MULASFD32X16_H1_L0 (ae_f64 d /*inout*/,

62  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

ae_f32x2 d0, ae_f16x4 d1);


void AE MULSAFD32X16 H1 L0 (ae_f64 d
_ _ _ /*inout*/,
ae_f32x2 d0, ae_f16x4 d1);
void AE_MULSSFD32X16_H1_L0 (ae_f64 d /*inout*/,
ae_f32x2 d0, ae_f16x4 d1);
AE_MULFD32X16X2.FIR.HH (.HL) q0, q1, d0, d1, c [fusion_slot1, fusion_slot_fir_1 ] AVS ONLY
AE_MULAFD32X16X2.FIR.HH( .HL) q0, q1, d0, d1, c [fusion_slot1, fusion_slot_fir_1] AVS ONLY
Quad 1.31x1.16-bit multiplications into two 17.47-bit signed MAC with operands selected to
accelerate FIR computations. These are emulated using two-instruction sequences. Note
that the FIR.LL and FIR.LH variants available on HiFi 3 and HiFi 4 are not available on
Fusion.
For the .HH version
q0  [q017.47 +] d0.H1.31 × c.31.15 + d0.L1.31 × c.21.15
q1  [q117.47 +] d0.L1.31 × c.31.15 + d1.H1.31 × c.21.15

For the .HL version

q0  [q017.47 +] d0.H1.31 × c.11.15 + d0.L1.31 × c.01.15


q1  [q117.47 +] d0.L1.31 × c.11.15 + d1.H1.31 × c.01.15
C syntax:
void AE_MULFD32X16X2_FIR_HH (ae_f64 q0 /*out*/, ae_f64 q1 /*out*/,
ae_f32x2 d0,ae_f32x2 d1, ae_f16x4 c);
void AE_MULAFD32X16X2_FIR_HH(ae_f64 q0 /*inout*/,
ae_f64 q1 /* inout*/
ae_f32x2 d0,ae_f32x2 d1,ae_f16x4 c);
AE_MUL32X16.L0 (.L1 .L2 .L3 .H0 .H1 .H2 .H3) d, d0, d1 [fusion_slot1]
AE_MULA32X16.L0 (.L1 .L2 .L3 .H0 .H1 .H2 .H3) d, d0, d1 [fusion_slot1]
AE_MULS32X16.L0 (.L1 .L2 .L3 .H0 .H1 .H2 .H3) d, d0, d1 [fusion_slot1]
Single 32x16-bit into 64-bit signed MAC without saturation:
d  [d] ± d0.L × d1.0
C syntax:
ae_int64 AE_MUL32X16_L0 (ae_int32x2 d0, ae_f16x4 d1);
void AE_MULA32X16_L0 (ae_int64 d /*inout*/,
ae_int32x2 d0, ae_f16x4 d1);
void AE_MULS32X16_L0 (ae_int64 d /*inout*/,
ae_int32x2 d0, ae_int16x4 d1);

 CADENCE DESIGN SYSTEMS , INC. 63


Fusion F1 DSP User’s Guide

AE_MULZAAD32X16.H1.L0 (.H3.L2 .H2.L3 .H0.L1) d, d0, d1 [fusion_slot1]


AE_MULZASD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULZSAD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULZSSD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULAAD32X16.H1.L0 (.H3.L2 .H2.L3 .H0.L1) d, d0, d1 [fusion_slot1]
AE_MULASD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULSAD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
AE_MULSSD32X16.H1.L0 (.H3.L2) d, d0, d1 [fusion_slot1]
Dual 32x16-bit into 64-bit signed MAC without saturation:
d  [d] ± d0.H × d1.1 ± d0.L × d1.0.
The extra .H3.L2 and .H0.L1 specifiers are meant for computing half of a complex
multiplication.
C syntax:
ae_int64 AE_MULZAAD32X16_H1_L0 (ae_int32x2 d0, ae_int16x4 d1);
ae_int64 AE_MULZASD32X16_H1_L0 (ae_int32x2 d0, ae_int16x4 d1);
ae_int64 AE_MULZSAD32X16_H1_L0 (ae_int32x2 d0, ae_int16x4 d1);
ae_int64 AE_MULZSSD32X16_H1_L0 (ae_int32x2 d0, ae_int16x4 d1);

void AE_MULAAD32X16_H1_L0 (ae_int64 d /*inout*/,


ae_int32x2 d0, ae_int16x4 d1);
void AE_MULASD32X16_H1_L0 (ae_int64 d /*inout*/,
ae_int32x2 d0, ae_int16x4 d1);
void AE_MULSAD32X16_H1_L0 (ae_int64 d /*inout*/,
ae_int32x2 d0, ae_int16x4 d1);
void AE_MULSSD32X16_H1_L0 (ae_int64 d /*inout*/,
ae_int32x2 d0, ae_int16x4 d1);
AE_MULFP32X16X2RS.L (.H) d, d0, d1 [fusion_slot1]
AE_MULAFP32X16X2RS.L (.H) d, d0, d1 [fusion_slot1]
AE_MULSFP32X16X2RS.L (.H) d, d0, d1 [fusion_slot1]
2-way SIMD 1.31x1.15-bit into 1.31-bit signed MAC with saturation and symmetric (away
from zero) rounding of the product. When the suffix .H is specified, the upper two 16-bit
elements of d1 are used. When the suffix .L is specified, the lower two 16-bit elements are
used.
d.H  saturate1.31([d.H1.31 ±] round1.31(d0.H1.31 × d1.11.15))
d.L  saturate1.31([d.L1.31 ±] round1.31(d0.L1.31 × d1.01.15))

64  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_f32x2 AE_MULFP32X16X2RS (ae_f32x2 d0, ae_f16x4 d1);
void AE_MULAFP32X16X2RS (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f16x4 d1);
void AE_MULSFP32X16X2RS (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f16x4 d1);

AE_MULFP32X16X2RAS.L (.H) d, d0, d1 [fusion_slot1]


AE_MULAFP32X16X2RAS.L (.H) d, d0, d1 [fusion_slot1]
AE_MULSFP32X16X2RAS.L (.H) d, d0, d1 [fusion_slot1]
2-way SIMD 1.31x1.15-bit into 1.31-bit signed MAC with saturation and asymmetric rounding
of the product. When the suffix .H is specified, the upper two 16-bit elements of d1 are used.
When the suffix .L is specified, the lower two 16-bit elements are used.
d.H  saturate1.31([d.H1.31 ±] round+∞1.31(d0.H1.31 × d1.11.15))
d.L  saturate1.31([d.L1.31 ±] round+∞1.31(d0.L1.31 × d1.01.15))
C syntax:
ae_f32x2 AE_MULFP32X16X2RAS_L (ae_f32x2 d0, ae_f16x4 d1);
void AE_MULAFP32X16X2RAS_L (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f16x4 d1);
void AE MULSFP32X16X2RAS L (ae_f32x2 d /*inout*/,
_ _
ae_f32x2 d0, ae_f16x4 d1);
AE_MULP32X16X2.L (.H) d, d0, d1 [fusion_slot1 ]
AE_MULAP32X16X2.L (.H) d, d0, d1 [fusion_slot1 ]
AE_MULSP32X16X2.L (.H) d, d0, d1 [fusion_slot1 ]
2-way SIMD 32x16-bit into 32-bit signed MAC without saturation. When the suffix .H is
specified, the upper two 16-bit elements of d1 are used. When the suffix .L is specified, the
lower two 16-bit elements are used.
d.H  [d.H ±] d0.H1.31 × d1.1
d.L  [d.L ±] d0.L1.31 × d1.0
C syntax:
ae_int32x2 AE_MULP32X16X2_L (ae_int32x2 d0, ae_int16x4 d1);
void AE_MULAP32X16X2_L (ae_int32x2 d /*inout*/,
ae_int32x2 d0, ae_int16x4 d1);
void AE_MULSSP32P16X2_L (ae_int32x2 d /*inout*/,
ae_int32x2 d0, ae_int16x4 d1);

 CADENCE DESIGN SYSTEMS , INC. 65


Fusion F1 DSP User’s Guide

AE_MULFC32X16RAS.L (.H) d, d0, d1 [ fusion_slot1 ]


AE_MULAFC32X16RAS.L (.H) d, d0, d1 [ fusion_slot1 ]
Complex quad 1.31x1.15-bit into 1.31-bit signed MAC with asymmetric rounding of the
product and 32-bit saturation of the final result. When the suffix .H is specified, the upper two
16-bit elements of d1 are used. When the suffix .L is specified, the lower two 16-bit elements
are used. These are emulated using two-instruction sequences: one containing CR in the
name and computing the real part of the product and the other containing CI and computing
the imaginary part.
d.H  saturate1.31([d.H1.31+] round+∞3.31(d0.H1.31 × d1.11.15 - d0.L1.31 × d1.01.15)) CR
d.L  saturate1.31([d.L1.31 +] round+∞3.31(d0.H1.31 × d1.01.15 + d0.L1.31 × d1.11.15)) CI
C syntax:
ae_f32x2 AE_MULFC32X16RAS_L (ae_f32x2 d0, ae_f16x4 d1);
void AE_MULAFC32X16RAS_L (ae_f32x2 d /*inout*/,
ae_f32x2 d0, ae_f16x4 d1);
AE_MULC32X16.L (.H) d, d0, d1 [ fusion_slot1 ] AVS ONLY
AE_MULAC32X16.L (.H) d, d0, d1 [ fusion_slot1 ] AVS ONLY
Complex quad 32x16-bit into 32-bit signed integer MAC with no saturation. When the suffix
.H is specified, the upper two 16-bit elements of d1 are used. When the suffix .L is specified,
the lower two 16-bit elements are used. These are emulated using two-instruction
sequences: one containing CR in the name and computing the real part of the product and
the other containing CI and computing the imaginary part.
d.H  [d.H +] d0.H × d1.1 - d0.L × d1.0 (CR instruction)
d.L  [d.L +] d0.H × d1.0 + d0.L × d1.1 (CI instruction)
C syntax:
ae_int32x2 AE_MULC32X16_L (ae_int32x2 d0, ae_f16x4 d1);
void AE_MULAC32X16_L (ae_int32x2 d /*inout*/,
ae_int32x2 d0, ae_f16x4 d1);

2.6.4 16x16-bit Multiplication Operations


The input operands for 16x16-bit multiplication operations are elements of AE_DR registers.
Each AE_DR register holds four 16-bit elements; for each AE_DR register operand to a
multiplication, one of the four elements must be selected as the input to the multiplication
through a 3, 2, 1 or 0 suffix.
AE_MULF16SS.00 (.33 .22 .32 .21 .31 .30 .10 .20 .11) d, d0, d1 [fusion_slot1]
AE_MULAF16SS.00 (.33 .22 .32 .21 .31 .30 .10 .20 .11) d, d0, d1 [fusion_slot1]
AE_MULSF16SS.00 (.33 .22 .32 .21 .31 .30 .10 .20 .11) d, d0, d1 [fusion_slot1]
Single 1.15x1.15-bit into 1.31-bit signed MAC with 32-bit intermediate product and
accumulator saturation. The 32-bit result is replicated into each half of the result register.
d1.31  saturate1.31([d1.31 ±] saturate1.31(d0.01.15 × d1.01.15))

66  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

These MAC operations are bit-exact with the ITU-T L_mul, L_mac and L_msu basic
primitives.
C syntax:
ae_f32x2 AE_MULF16SS_00 (ae_f16x4 d0, ae_f16x4 d1);
void AE_MULAF16SS_00 (ae_f32x2 d /*inout*/,
ae_f16x4 d0, ae_f16x4 d1);
void AE_MULSF16SS_00 (ae_f32x2 d /*inout*/,
ae_f16x4 d0, ae_f16x4 d1);
AE_MULZAAFD16SS.11.00 (.33.22 .13.02) d, d0, d1 [fusion_slot1]
AE_MULZSSFD16SS.11.00 (.33.22 .13.02) d, d0, d1 [fusion_slot1]
AE_MULAAFD16SS.11.00 (.33.22 .13.02) d, d0, d1 [fusion_slot1]
AE_MULSSFD16SS.11.00 (.33.22 .13.02) d, d0, d1 [fusion_slot1]
Dual 1.15x1.15-bit into a single 1.31-bit signed MAC with 32-bit saturation after each product
and after each accumulation. The 32-bit result is replicated into each half of the result
register.
tmp  saturate1.31([d1.31] ± saturate1.31(d0.11.15 × d1.11.15))
d1.31  saturate1.31(tmp ± saturate1.31(d0.01.15 × d1.01.15))
These MAC operations are bit-exact with a pair of ITU-T L_mul, L_mac and L_msu basic
primitives.
C syntax:
ae_f32x2 AE_MULZAAFD16SS_11_00 (ae_f16x4 d0, ae_f16x4 d1);
ae_f32x2 AE_MULZSSFD16SS_11_00 (ae_f16x4 d0, ae_f16x4 d1);
void AE_MULAAFD16SS_11_00 (ae_f32x2 d /*inout*/,
ae_f16x4 d0, ae_f16x4 d1);
void AE_MULSSFD16SS_11_00 (ae_f32x2 d /*inout*/,
ae_f16x4 d0, ae_f16x4 d1);
AE_MULF16X4SS d0, d1, d2, d3 [fusion_slot1] AVS ONLY
AE_MULAF16X4SS d0, d1, d2, d3 [fusion_slot1] AVS ONLY
AE_MULSF16X4SS d0, d1, d2, d3 [fusion_slot1] AVS ONLY
Four way SIMD 1.15x1.15-bit into 1.31-bit signed MAC with 32-bit intermediate product and
accumulator saturation. These are emulated using two-instruction sequences.

d0.H  saturate1.31([d0.H1.31 ±] saturate1.31(d2.31.15 × d3.31.15))


d0.L  saturate1.31([d0.L1.31 ±] saturate1.31(d2.21.15 × d3.21.15))
d1.H  saturate1.31([d1.H1.31 ±] saturate1.31(d2.11.15 × d3.11.15))
d1.L  saturate1.31([d1.L1.31 ±] saturate1.31(d2.01.15 × d3.01.15))
These MAC operations are bit-exact with the ITU-T L_mul, L_mac and L_msu basic
primitives.

 CADENCE DESIGN SYSTEMS , INC. 67


Fusion F1 DSP User’s Guide

C syntax:
void AE_MULF16X4SS (ae_f32x2 d0 /*out*/, ae_f32x2 d1 /*out*/
ae_f16x4 d2, ae_f16x4 d3);
void AE MULAF16X4SS (ae_f32x2 d0 /*inout*/,
_
ae_f32x2 d1 /*inout*/,
ae_f16x4 d2, ae_f16x4 d3);
void AE_MULSF16X4SS (ae_f32x2 d0 /*inout*/,
ae_f32x2 d1 /*inout*/,
ae_f16x4 d2, ae_f16x4 d3);

AE_MUL16X4 d0, d1, d2, d3 [fusion_slot1] AVS/16-bit Quad MAC Options ONLY

AE_MULA16X4 d0, d1, d2, d3 [fusion_slot1] AVS/16-bit Quad MAC Options ONLY

AE_MULS16X4 d0, d1, d2, d3 [fusion_slot1] AVS/16-bit Quad MAC Options ONLY

Four way SIMD 16x16-bit into 32-bit integer signed MAC without saturation. These are
emulated using two-instruction sequences.
d0.H  [d0.H ± ] d2.3 × d3.3
d0.L  [d0.L ± ] d2.2 × d3.2
d1.H  [d1.H ± ] d2..1 × d3.1
d1.L  [d1.L ± ] d2.0 × d3.0
C syntax:
void AE_MUL16X4 (ae_int32x2 d0 /*out*/, ae_int32x2 d1 /*out*/
ae_int16x4 d2, ae_int16x4 d3);
void AE MULAA16X4 (ae_int32x2 d0 /*inout*/,
_
ae_int32x2 d1 /*inout*/,
ae_int16x4 d2, ae_int16x4 d3);
void AE_MULSS16X4 (ae_int32x2 d0 /*inout*/,
ae_int32x2 d1 /*inout*/,
ae_int16x4 d2, ae_int16x4 d3);

AE_MULFP16X4S d, d0, d1 [fusion_slot1] AVS ONLY


Four way SIMD multiply 1.15x1.15-bit into 1.15-bit signed multiply with saturation. These are
emulated using two-instruction sequences.
d.3  saturate1.15(d0.31.15 × d1.31.15)
d.2  saturate1.15(d0.21.15 × d1.21.15)
d.1  saturate1.15(d0.11.15 × d1.11.15)
d.0  saturate1.15(d0.01.15 × d1.01.15)
This operations is bit-exact with the ITU-T mult basic primitives. Without the AVS option, this
instruction is emulated using a seven-instruction sequence.

68  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_f16x4 AE_MULFP16X4S (ae_f16x4 d0, ae_f16x4 d1);
AE_MULFP16X4RAS d, d0, d1 [fusion_slot1] AVS ONLY
Four way SIMD 1.15x1.15-bit into 1.15-bit signed multiply with saturation and rounding.
These are emulated using two-instruction sequences.
d.3  saturate1.15(round+∞2.15(d0.31.15 × d1.31.15))
d.2  saturate1.15(round+∞2.15(d0.21.15 × d1.21.15))
d.1  saturate1.15(round+∞2.15(d0.11.15 × d1.11.15))
d.0  saturate1.15(round+∞2.15(d0.01.15 × d1.01.15))
The operation is bit-exact with the ITU-T mult_r basic primitives.
C syntax:
ae_f16x4 AE_MULFP16X4RAS (ae_f16x4 d0, ae_pf16x4 d1);

2.6.5 16x16-bit Legacy Multiplication Operations


The input operands for legacy 16x16-bit multiplication operations are elements of AE_DR
registers. Each AE_DR register holds two 16-bit elements; for each AE_DR register operand
to a multiplication, one of the two elements must be selected as the input to the multiplication
through an H or an L suffix. The result of each multiply/accumulate operation goes into an
AE_DR register.

AE_MULS32F48P16S.LL (.LH .HH) q, d0, d1 [fusion_slot1]


AE_MULAS32F48P16S.LL (.LH .HH) q, d0, d1 [fusion_slot1]
AE_MULSS32F48P16S.LL (.LH .HH) q, d0, d1 [fusion_slot1]
Single 1.15x1.15-bit into 1.31-bit signed MAC with 32-bit intermediate product and
accumulator saturation. The input 32-bit AE_DR elements are treated as 9.23-bit values and
the result is formatted as a 17.47-bit value.
q17.47  saturate1.31([q17.47 ±] saturate1.31(d0.L[23:8]1.15 × d1.L[23:8]1.15))
These MAC operations are bit-exact with the ITU-T L_mul, L_mac and L_msu basic
primitives.
Note: C intrinsics AE_MUL[AS]S32F48P16S_HL are provided and implemented through the
.LH operations above. C intrinsics with ae_p24x2s input operand types and ae_q56s
accumulator operand types are provided to ensure HiFi 2 code portability and are
implemented through the operations above.
C syntax:
ae_q56s AE_MULS32F48P16S_LL (ae_p24x2s d0, ae_p24x2s d1);
void AE_MULAS32F48P16S_LL (ae_q56s q /*inout*/,
ae_p24x2s d0, ae_p24x2s d1);
void AE_MULSS32F48P16S_LL (ae_q56s q /*inout*/,
ae_p24x2s d0, ae_p24x2s d1);

 CADENCE DESIGN SYSTEMS , INC. 69


Fusion F1 DSP User’s Guide

ae_q56s AE_MULFS32P16S_LL (ae_p24x2s d0, ae_p24x2s d1);


void AE_MULAFS32P16S_LL (ae_q56s q /*inout*/,
ae_p24x2s d0, ae_p24x2s d1);
void AE_MULSSFS32P16S_LL (ae_q56s q /*inout*/,
ae_p24x2s d0, ae_p24x2s d1);

2.6.6 32x16-bit Legacy Multiplication Operations


Fusion DSP provides a basic set of legacy 32x16-bit MAC operations for efficient execution
of HiFi 2 target code. The legacy 32- and 16-bit operand formats can only store half as many
elements in a register and are therefore less efficient than the Fusion DSP-specific 32x16-
bit operations. The 32-bit input operand comes from bits 47 through 16 of the AE_DR
register. The 16-bit input operand comes from bits 23 through 8 of the L 32-bit AE_DR
element.

The following intrinsics are provided to ensure HiFi 2 code compatibility and are implemented
through a sequence of one or more of the multiplication operations described in this section:

void AE_MULAFQ32SP16S_H (_L) (ae_q56s q /* inout */,


ae_q56s d0, ae_p24x2s d1);
void AE_MULAFQ32SP16U_H (_L) (ae_q56s q /* inout */,
ae_q56s d0, ae_p24x2s d1);
void AE_MULAQ32SP16S_H (ae_q56s q /* inout */,
ae_q56s d0, ae_p24x2s d1);
void AE_MULAQ32SP16U_H (ae_q56s q /* inout */,
ae_q56s d0, ae_p24x2s d1);
ae_q56s AE_MULFQ32SP16S_H (_L) (ae_q56s d0, ae_p24x2s d1);
ae_q56s AE_MULFQ32SP16U_H (_L) (ae_q56s d0, ae_p24x2s d1);
ae_q56s AE_MULQ32SP16S_H (ae_q56s d0, ae_p24x2s d1);
ae_q56s AE_MULQ32SP16U_H (ae_q56s d0, ae_p24x2s d1);
void AE_MULSFQ32SP16S_H (_L) (ae_q56s q /* inout */,
ae_q56s d0, ae_p24x2s d1);
void AE_MULSFQ32SP16U_H (_L) (ae_q56s q /* inout */,
ae_q56s d0, ae_p24x2s d1);
void AE_MULSQ32SP16S_H (ae_q56s q /* inout */,
ae_q56s d0, ae_p24x2s d1);
void AE_MULSQ32SP16U_H (ae_q56s q /* inout */,
ae_q56s d0, ae_p24x2s d1);

ae_q56s AE_MULZAAFQ32SP16S_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,


ae_q56s q1, ae_p24x2s p1);
ae_q56s AE_MULZAAFQ32SP16U_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,
ae_q56s q1, ae_p24x2s p1);
ae_q56s AE_MULZAAQ32SP16S_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,
ae_q56s q1, ae_p24x2s p1);

70  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

ae_q56s AE_MULZAAQ32SP16U_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,


ae_q56s q1, ae_p24x2s p1);
ae_q56s AE_MULZASFQ32SP16S_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,
ae_q56s q1, ae_p24x2s p1);
ae_q56s AE_MULZASFQ32SP16U_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,
ae_q56s q1, ae_p24x2s p1);
ae q56s AE MULZASQ32SP16S HH ( LH LL) (ae_q56s q0, ae_p24x2s p0,
_ _ _ _ _
ae_q56s q1, ae_p24x2s p1);
ae q56s AE MULZASQ32SP16U HH ( LH LL) (ae_q56s q0, ae_p24x2s p0,
_ _ _ _ _
ae_q56s q1, ae_p24x2s p1);
ae_q56s AE_MULZSAFQ32SP16S_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,
ae_q56s q1, ae_p24x2s p1);
ae_q56s AE_MULZSAFQ32SP16U_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,
ae_q56s q1, ae_p24x2s p1);
ae_q56s AE_MULZSAQ32SP16S_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,
ae_q56s q1, ae_p24x2s p1);
ae_q56s AE_MULZSAQ32SP16U_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,
ae_q56s q1, ae_p24x2s p1);
ae q56s AE MULZSSFQ32SP16S HH ( LH LL) (ae_q56s q0, ae_p24x2s p0,
_ _ _ _ _
ae_q56s q1, ae_p24x2s p1);
ae q56s AE MULZSSFQ32SP16U HH ( LH LL) (ae_q56s q0, ae_p24x2s p0,
_ _ _ _ _
ae_q56s q1, ae_p24x2s p1);
ae_q56s AE_MULZSSQ32SP16S_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,
ae_q56s q1, ae_p24x2s p1);
ae_q56s AE_MULZSSQ32SP16U_HH (_LH _LL) (ae_q56s q0, ae_p24x2s p0,
ae_q56s q1, ae_p24x2s p1);
AE_MULF48Q32SP16S.L q, d0, d1 [fusion_slot1]
AE_MULAF48Q32SP16S.L q, d0, d1 [fusion_slot1]
AE_MULSF48Q32SP16S.L q, d0, d1 [fusion_slot1]
Single 1.31x1.15-bit into 17.47-bit signed MAC without saturation:
q  [q17.47 ±] d0[47:16]1.31 × d1[23:8]1.15
Note: C intrinsic AE_MUL[AS]F48Q32SP16S.H are provided and implemented through the
.L operations above.
C syntax:
ae_int64 AE_MULF48Q32SP16S_L (ae_int64 d0, ae_f32x2 d1);
void AE_MULAF48Q32SP16S_L (ae_int64 q /*inout*/,
ae_int64 d0, ae_f32x2 d1);
void AE_MULSF48Q32SP16S_L (ae_int64 q /*inout*/,
ae_int64 d0, ae_f32x2 d1);

 CADENCE DESIGN SYSTEMS , INC. 71


Fusion F1 DSP User’s Guide

AE_MULF48Q32SP16U.L qd, d0, d1 [fusion_slot1]


AE_MULAF48Q32SP16U.L qd, d0, d1 [fusion_slot1]
AE_MULSF48Q32SP16U.L qd, d0, d1 [fusion_slot1]
Single 1.31x1.15u-bit into 17.47-bit MAC without saturation. Note that the 32-bit operand is
treated as a signed value while the 16-bit operand is treated as an unsigned value.
qd  [qd17.47 ±] d0[47:16]1.31 × d1[23:8]1.15u
C syntax:
ae_int64 AE_MULF48Q32SP16U_L (ae_int64 d0, ae_f32x2 d1);
void AE_MULAF48Q32SP16U_L (ae_int64 qd /*inout*/,
ae_int64 d0, ae_f32x2 d1);
void AE_MULSF48Q32SP16U_L (ae_int64 qd /*inout*/,
ae_int64 d0, ae_f32x2 d1);
AE_MULQ32SP16S.L q, d0, d1 [fusion_slot1]
AE_MULAQ32SP16S.L q, d0, d1 [fusion_slot1]
AE_MULSQ32SP16S.L q, d0, d1 [fusion_slot1]
Single 32x16-bit into 64-bit signed integer MAC with no saturation:
q  [q ±] d0[47:16] × d1[23:8]
C syntax:
ae_q56s AE_MULQ32SP16S_L (ae_q56s d0, ae_p24x2s d1);
void AE_MULAQ32SP16S_L (ae_q56s q /*inout*/,
ae_q56s d0, ae_p24x2s d1);
void AE_MULSQ32SP16S_L (ae_q56s q /*inout*/,
ae_q56s d0, ae_p24x2s d1);
AE_MULQ32SP16U.L qd, d0, d1 [fusion_slot1]
AE_MULAQ32SP16U.L qd, d0, d1 [fusion_slot1]
AE_MULSQ32SP16U.L qd, d0, d1 [fusion_slot1]
Single 32x16u-bit into 64-bit integer MAC with no saturation. Note that the 32-bit operand is
treated as a signed value while the 16-bit operand is treated as an unsigned value.
qd  [qd ±] d0[47:16] × d1[23:8]u
C syntax:
ae_q56s AE_MULQ32SP16U_L (ae_q56s d0, ae_p24x2s d1);
void AE_MULAQ32SP16U_L (ae_q56s qd /*inout*/,
ae_q56s d0, ae_p24x2s d1);
void AE MULSQ32SP16U L (ae_q56s qd /*inout*/,
_ _
ae_q56s d0, ae_p24x2s d1);

72  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

2.6.7 HiFi 2 EP 32x24-bit Multiplication Operations


Fusion DSP does not provide support for the 32x24-bit Multiplication operations of HiFi EP
and HiFi 3 as they are superseded by 32x32-bit operations.

2.7 Add, Subtract, and Compare Operations


AE_ADD32 d, d0, d1 [ fusion_slot1, Inst ]
AE_SUB32 d, d0, d1 [ fusion_slot1, Inst ]
AE_ADDSUB32 d, d0, d1 [ fusion_slot1 ]
AE_SUBADD32 d, d0, d1 [ fusion_slot1 ]
Add/subtract 32-bit elements of two AE_DR register d0 and d1 without saturation. The results
are placed in d. For AE_ADDSUB32 the high half of each register is added together and the
low half is subtracted. For AE_SUBADD32 the high half of each register is subtracted and
the low half is added together.
d.H  d0.H ± d1.H
d.L  d0.L ± d1.L
Note: C intrinsics AE_ADDP24 and AE_SUBP24 are provided to ensure HiFi 2 code
portability. They are implemented through operations AE_ADD32 and AE_SUB32,
respectively.
C syntax:
ae_int32x2 AE_ADD32 (ae_int32x2 d0, ae_int32x2 d1);
ae_int32x2 AE_SUB32 (ae_int32x2 d0, ae_int32x2 d1);
ae_int32x2 AE_ADDSUB32 (ae_int32x2 d0, ae_int32x2 d1);
ae_int32x2 AE_SUBADD32 (ae_int32x2 d0, ae_int32x2 d1);
ae_p24x2s AE_ADDP24 (ae_p24x2s d0, ae_p24x2s d1);
ae_p24x2s AE_SUBP24 (ae_p24x2s d0, ae_p24x2s d1);
AE_ADD32_HL_LH d, d0, d1 [ fusion_slot1 ]
Generalized reduction add. Add 32-bit elements of two AE_DR registers d0 and d1 without
saturation. Add the low half of one register to the high half of the other.
d.H  d0.H + d1.L
d.L  d0.L + d1.H
C syntax:
ae_int32x2 AE_ADD32_HL_LH (ae_int32x2 d0, ae_int32x2 d1

 CADENCE DESIGN SYSTEMS , INC. 73


Fusion F1 DSP User’s Guide

AE_ADD32S d, d0, d1 [ fusion_slot1, Inst ]


AE_SUB32S d, d0, d1 [ fusion_slot1, Inst ]
AE_ADDSUB32S d, d0, d1 [ fusion_slot1 ]
AE_SUBADD32S d, d0, d1 [ fusion_slot1 ]
Add/subtract 32-bit elements signed, saturating two AE_DR registers d0 and d1. For
AE_ADDSUB32S the high half of each register is added together and the low half is
subtracted. For AE_SUBADD32S the high half of each register is subtracted and the low half
is added together. The results are placed in d. In case of saturation, state AE_OVERFLOW
is set to 1.
d.H  saturate1.31(d0.H ± d1.H)
d.L  saturate1.31(d0.L ± d1.L)
C syntax:
ae_f32x2 AE_ADD32S (ae_f32x2 d0, ae_f32x2 d1);
ae_f32x2 AE_SUB32S (ae_f32x2 d0, ae_f32x2 d1);
ae_int32x2 AE_ADDSUB32S (ae_int32x2 d0, ae_int32x2 d1);
ae_int32x2 AE_SUBADD32S (ae_int32x2 d0, ae_int32x2 d1);
AE_ADD24S d, d0, d1 [ fusion_slot1, Inst ]
AE_SUB24S d, d0, d1 [ fusion_slot1, Inst ]
Add/subtract 32-bit elements with 24-bit (9.23) signed saturation of two AE_DR registers d0
and d1. The results are placed in d. In case of saturation, state AE_OVERFLOW is set to 1.
d.H  sext9.23(saturate1.23(d0.H9.23 ± d1.H9.23))
d.L  sext9.23(saturate1.23(d0.L9.23 ± d1.L9.23))
Note: C intrinsics AE_ADDSP24S and AE_SUBSP24S are provided to ensure HiFi 2 code
portability. They are implemented through operations AE_ADD24S and AE_SUB24S,
respectively.
C syntax:
ae_f24x2 AE_ADD24S (ae_f24x2 d0, ae_f24x2 d1);
ae_f24x2 AE_SUB24S (ae_f24x2 d0, ae_f24x2 d1);
ae_p24x2s AE_ADDSP24S (ae_p24x2s d0, ae_p24x2s d1);
ae_p24x2s AE_SUBSP24S (ae_p24x2s d0, ae_p24x2s d1);

AE_ADD16 d, d0, d1 [ fusion_slot1 ]


AE_SUB16 d, d0, d1 [ fusion_slot1 ]
Add/subtract signed 16-bit elements from two AE_DR registers d0 and d1.
C syntax:
ae_int16x4 AE_ADD16 (ae_int16x4 d0, ae_int16x4 d1);
ae_int16x4 AE_SUB16 (ae_int16x4 d0, ae_int16x4 d1);

74  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_ADD16S d, d0, d1 [ fusion_slot1, Inst ]


AE_SUB16S d, d0, d1 [ fusion_slot1, Inst ]
Add/subtract signed 16-bit elements, saturating from two AE_DR registers d0 and d1. The
results are placed in d. In case of saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f16x4 AE_ADD16S (ae_f16x4 d0, ae_f16x4 d1);
ae_f16x4 AE_SUB16S (ae_f16x4 d0, ae_f16x4 d1);
AE_NEG32 d, d0 [fusion_slot1 ]
Negate 32-bit elements of AE_DR register d0 without saturation, with result placed in d.
d.H  −d0.H
d.L  −d0.L
Note: C intrinsic AE_NEGP24 is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_NEG32.
C syntax:
ae_int32x2 AE_NEG32 (ae_int32x2 d0);
ae_p24x2s AE_NEGP24 (ae_p24x2s d0);
AE_NEG32S d, d0 [ fusion_slot1, Inst ]
Negate, saturating. 32-bit element of an AE_DR register d0, with result placed in d.
d.H  saturate1.31(−d0.H)
d.L  saturate1.31(−d0.L)
C syntax:
ae_f32x2 AE_NEG32S (ae_f32x2 d0);
AE_NEG24S d, d0 [ fusion_slot1 ]
Negate 32-bit element with 24-bit (9.23) saturation of an AE_DR register d0, with result
placed in d. In case of saturation, state AE_OVERFLOW is set to 1.
d.H  sext9.23(saturate1.23(−d0.H9.23))
d.L  sext9.23(saturate1.23(−d0.L9.23))
Note: C intrinsic AE_NEGSP24S is provided to ensure HiFi 2 code portability. It is
implemented through operation AE_NEG24S.
C syntax:
ae_f24x2 AE_NEG24S (ae_f24x2 d0);
ae_p24x2s AE_NEGSP24S (ae_p24x2s d0);
AE_NEG16S d, d0 [ fusion_slot1, Inst ]
Negate 16-bit, saturating, of an AE_DR register d0, with result placed in d.
C syntax:
ae_int16 AE_NEG16S (ae_int16 d0);

 CADENCE DESIGN SYSTEMS , INC. 75


Fusion F1 DSP User’s Guide

AE_ABS32 d, d0 [ fusion_slot1 ]
Absolute value of 32-bit element of an AE_DR register d0 without saturation, with result
placed in d.
d.H  |d0.H|
d.L  |d0.L|
Note: C intrinsic AE_ABSP24 is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_ABS32.
C syntax:
ae_int32x2 AE_ABS32 (ae_int32x2 d0);
ae_p24x2s AE_ABSP24 (ae_p24x2s d0);
AE_ABS32S d, d0 [ fusion_slot1, Inst ]
Absolute value, saturating, of a 32-bit element of an AE_DR register d0 with result placed in
d.
d.H  saturate1.31(|d0.H|)
d.L  saturate1.31(|d0.L|)
C syntax:
ae_int32x2 AE_ABS32S (ae_int32x2 d0);
AE_ABS24S d, d0 [ fusion_slot1, Inst ]
Absolute value, with 24-bit (9.23) saturation of a 32-bit element of an AE_DR register d0 with
result placed in d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ABSSP24S is provided to ensure HiFi 2 code portability. It is
implemented through operation AE_ABS24S.
d.H  sext9.23(saturate1.23(|d0.H9.23|))
d.L  sext9.23(saturate1.23(|d0.L9.23|))
C syntax:
ae_f24x2 AE_ABS24S (ae_f24x2 d0);
ae_p24x2s AE_ABSSP24S (ae_p24x2s d0);
AE_ABS16S d, d0 [ fusion_slot1, Inst ]
Absolute value, saturating, element-wise of 16-bit elements of an AE_DR register d0 with
result placed in d.
C syntax:
ae_f16x4 AE_ABS16S (ae_f16x4 d0);

76  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_MAX32 d, d0, d1 [ fusion_slot1, Inst ]


AE_MIN32 d, d0, d1 [ fusion_slot1, Inst ]
Get maximum/minimum of two 32-bit elements of AE_DR registers d0 and d1. The results
are placed in d.
Maximum: d.H  (d0.H > d1.H) ? d0.H : d1.H
d.L  (d0.L > d1.L) ? d0.L : d1.L
Note: C intrinsics AE_MAXP24S and AE_MINP24S are provided to ensure HiFi 2 code
portability. They are implemented through operations AE_MAX32 and AE_MIN32,
respectively. C intrinsics AE_MAXB32/AE_MINB32 are implemented through a sequence of
the AE_MAX32/AE_MIN32 and AE_LT32 operations and set the Boolean result only if the
d0 element is greater/less than the d1 element. C intrinsics AE_MAXBP24S/AE_MINBP24S
are implemented in a similar way and are provided to ensure HiFi 2 code portability.
C syntax:
ae_int32x2 AE_MAX32 (ae_int32x2 d0, ae_int32x2 d1);
ae_int32x2 AE_MIN32 (ae_int32x2 d0, ae_int32x2 d1);
ae_p24x2s AE_MAXP24 (ae_p24x2s d0, ae_p24x2x d1);
ae_p24x2s AE_MINP24 (ae_p24x2s d0, ae_p24x2s d1);
void AE_MAXB32 (ae_int32x2 d /* out */, ae_int32x2 d0,
ae_int32x2 d1, xtbool2 bhl /* out */);
void AE_MINB32 (ae_int32x2 d /* out */, ae_int32x2 d0,
ae_int32x2 d1, xtbool2 bhl/* out */);
void AE_MAXBP24S (ae_p24x2s d /* out */, ae_p24x2s d0,
ae_p24x2s d1, xtbool2 bhl /* out */);
void AE_MINBP24S (ae_p24x2s d /* out */, ae_p24x2s d0,
ae_p24x2s d1, xtbool2 bhl /* out */);
AE_MAXABS32S d, d0, d1 [ fusion_slot1 ]
AE_MINABS32S d, d0, d1 [ fusion__slot1 ]
Get maximum/minimum of absolute value of two signed 32-bit elements of AE_DR registers
d0 and d1. The two element-wise results are saturated to 32 bits and placed in d. In case of
saturation, state AE_OVERFLOW is set to 1.
Maximum: d.H  saturate1.31(|d0.H| > |d1.H| ? |d0.H| : |d1.H|)
d.L  saturate1.31 (|d0.L| > |d1.L| ? |d0.L| : |d1.L|)
C syntax:
ae_f32x2 AE_MAXABS32S (ae_f32x2 d0, ae_f32x2 d1);
ae_f32x2 AE_MINABS32S (ae_f32x2 d0, ae_f32x2 d1);
Note: C intrinsics AE_MAXBABSSP24S and AE_MINABSSP24S are provided to ensure
HiFi 2 EP code portability. They are implemented through operations AE_MAXABS32S and
AE_MINABS32S.

 CADENCE DESIGN SYSTEMS , INC. 77


Fusion F1 DSP User’s Guide

AE_LT32 bhl, d0, d1 [ fusion_slot1, Inst ]


Compare, signed less-than, two 32-bit elements of AE_DR registers d0 and d1; results go to
a pair bhl of adjacent Boolean registers.
bhl[1]  (d0.H < d1.H) ? 1 : 0
bhl[0]  (d0.L < d1.L) ? 1 : 0
Note: C intrinsic AE_LTP24S is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_LT32.
C syntax:
xtbool2 AE_LT32 (ae_int32x2 d0, ae_int32x2 d1);
xtbool2 AE_LTP24S (ae_p24x2s d0, ae_p24x2s d1);
AE_LE32 bhl, d0, d1 [ fusion_slot1, Inst ]
Compare, less-than-or-equal, two 32-bit signed elements of AE_DR registers d0 and d1;
results go to a pair bhl of adjacent Boolean registers.
bhl[1]  (d0.H ≤ d1.H) ? 1 : 0
bhl[0]  (d0.L ≤ d1.L) ? 1 : 0
Note: C intrinsic AE_LEP24S is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_LE32.
C syntax:
xtbool2 AE_LE32 (ae_int32x2 d0, ae_int32x2 d1);
xtbool2 AE_LEP24S (ae_p24x2s d0, ae_p24x2s d1);
AE_EQ32 bhl, d0, d1 [ fusion_slot1, Inst ]
Compare, equal, two 32-bit elements of AE_DR registers d0 and d1; results go to a pair bhl
of adjacent Boolean registers.
bhl[1]  (d0.H == d1.H) ? 1 : 0
bhl[0]  (d0.L == d1.L) ? 1 : 0
Note: C intrinsic AE_EQP24 is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_EQ32.
C syntax:
xtbool2 AE_EQ32 (ae_int32x2 d0, ae_int32x2 d1);
xtbool2 AE_EQP24 (ae_p24x2s d0, ae_p24x2s d1);
AE_LT16 b321, d0, d1 [ fusion_slot1 ]
Compare, less-than, two 16-bit signed elements of AE_DR registers d0 and d1; results go to
a four element Boolean register.
b3210[3]  (d0.3 < d1.3) ? 1 : 0
b3210[2]  (d0.2 < d1.2) ? 1 : 0
b3210[1]  (d0.1 < d1.1) ? 1 : 0
b3210[0]  (d0.0 < d1.0) ? 1 : 0

78  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
xtbool4 AE_LT16 (ae_int16x4 d0, ae_int16x4 d1);
AE_LE16 b3210, d0, d1 [ fusion_slot1 ]
Compare, less-than-or-equal, two 16-bit signed elements of AE_DR registers d0 and d1;
results go to a four element Boolean register.
b3210[3]  (d0.3 <= d1.3) ? 1 : 0
b3210[2]  (d0.2 <= d1.2) ? 1 : 0
b3210[1]  (d0.1 <= d1.1) ? 1 : 0
b3210[0]  (d0.0 <= d1.0) ? 1 : 0
C syntax:
xtbool4 AE_LE16 (ae_int16x4 d0, ae_int16x4 d1);
AE_EQ16 b3210, d0, d1 [ fusion_slot1 ]
Compare, equal, two AE_DR registers d0 and d1; results go to a four element Boolean
register.
b321[3]  (d0.3 == d1.3) ? 1 : 0
b321[2]  (d0.2 == d1.2) ? 1 : 0
b321[1]  (d0.1 == d1.1) ? 1 : 0
b321[0]  (d0.0 == d1.0) ? 1 : 0
C syntax:
xtbool4 AE_EQ16 (ae_int16x4 d0, ae_int16x4 d1);
AE_ADD64 d, d0, d1 [ fusion_slot1, Inst ]
AE_SUB64 d, d0, d1 [fusion_slot1, Inst ]
Add/Subtract two 64-bit AE_DR registers d0 and d1 without saturation, with result placed in
d.
d  d0 ± d1
Note: C intrinsics AE_ADDQ56 and AE_SUBQ56 are provided to ensure HiFi 2 code
portability. They are implemented through operations AE_ADD64 and AE_SUB64,
respectively.
C syntax:
ae_int64 AE_ADD64 (ae_int64 d0, ae_int64 d1);
ae_int64 AE_SUB64 (ae_int64 d0, ae_int64 d1);
ae_q56s AE_ADDQ56 (ae_q56s d0, ae_q56s d1);
ae_q56s AE_SUBQ56 (ae_q56s d0, ae_q56s d1);

 CADENCE DESIGN SYSTEMS , INC. 79


Fusion F1 DSP User’s Guide

AE_ADD64S d, d0, d1 [fusion_slot1, Inst ]


AE_SUB64S d, d0, d1 [fusion_slot1, Inst ]
Add/Subtract, saturating, two 64-bit signed AE_DR registers d0 and d1, with result placed in
d. In case of saturation, state AE_OVERFLOW is set to 1.
d  saturate1.63(d0 ± d1)
C syntax:
ae_f64 AE_ADD64S (ae_f64 d0, ae_f64 d1);
ae_f64 AE_SUB64S (ae_f64 d0, ae_f64 d1);
AE_ADDSQ56S d, d0, d1 [ fusion_slot1 ]
AE_SUBSQ56S d, d0, d1 [ fusion_slot1 ]
Add/Subtract (56-bit (9.55) saturation), two 64-bit signed AE_DR registers d0 and d1, with
result placed in d. In case of saturation, state AE_OVERFLOW is set to 1.
d  sext9.55((saturate1.55(d09.55 ± d19.55))
Note: These are legacy instructions meant to support HiFi 2 code portability.
C syntax:
ae_q56s AE_ADDSQ56S (ae_q56s d0, ae_q56s d1);
ae_q56s AE_SUBSQ56S (ae_q56s d0, ae_q56s d1);
AE_NEG64 d, d0 [fusion_slot1, Inst ]
Negate 64-bit AE_DR register d0 without saturation, with result placed in d.
d  −d0
Note: C intrinsic AE_NEGQ56 is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_NEG64.
C syntax:
ae_int64 AE_NEG64 (ae_int64 d0);
ae_q56s AE_NEGQ56 (ae_q56s d0);
AE_NEG64S d, d0 [ fusion_slot1 ]
Negate, saturating, 64-bit AE_DR register d0, with result placed in d. In case of saturation,
state AE_OVERFLOW is set to 1.
d  saturate1.63(−d0)
C syntax:
ae_f64 AE_NEG64S (ae_f64 d0);
AE_NEGSQ56S d, d0 [ fusion_slot1 ]
Negate, with 56-bit (9.55) saturation, 64-bit AE_DR register d0, with result placed in d. In
case of saturation, state AE_OVERFLOW is set to 1.
d  sext9.55(saturate1.55(−d09.55))
Note: These are legacy instructions meant to support HiFi 2 code portability.

80  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_q56s AE_NEGSQ56S (ae_q56s d0);
AE_ABS64 d, d0 [fusion_slot1, Inst ]
Get absolute value of 64-bit AE_DR register d0 without saturation, with result placed in d.
d  |d0|
Note: C intrinsic AE_ABSQ56 is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_ABS64.
C syntax:
ae_int64 AE_ABS64 (ae_int64 d0);
ae_q56s AE_ABSQ56 (ae_q56s d0);
AE_ABS64S d, d0 [ fusion_slot1 ]
Get absolute value, saturating, of 64-bit AE_DR register d0, with result placed in d. In case
of saturation, state AE_OVERFLOW is set to 1.
d  saturate1.63(|d0|)
C syntax:
ae_q64 AE_ABS64S (ae_q64 d0);
AE_ABSSQ56S d, d0 [ fusion_slot1 ]
Get absolute value, with 56-bit (9.55) saturation of 64-bit AE_DR register d0, with result
placed in d. In case of saturation, state AE_OVERFLOW is set to 1.
d  sext9.55((saturate1.55(|d09.55|))
Note: These are legacy instructions meant to support HiFi 2 code portability.
C syntax:
ae_q56s AE_ABSSQ56S (ae_q56s d0);
AE_MAX64 d, d0, d1 [ fusion_slot ]
AE_MIN64 d, d0, d1 [ fusion_slot1 ]
Get maximum/minimum of two signed 64-bit AE_DR registers d0 and d1, with result placed
in d.
Maximum: d  (d0 > d1) ? d0 : d1
Note: C intrinsics AE_MAXQ56S and AE_MINQ56S are provided to ensure HiFi 2 code
portability. They are implemented through operations AE_MAX64 and AE_MIN64,
respectively. C intrinsics AE_MAXB64/AE_MINB64 are implemented through a sequence of
the AE_MAX64/AE_MIN64 and AE_LT64 operations and set the Boolean result only if the
d0 value is greater/less than the d1 value. C intrinsics AE_MAXBQ56S/AE_MINBQ56S are
implemented in a similar way and are provided to ensure HiFi 2 code portability.

 CADENCE DESIGN SYSTEMS , INC. 81


Fusion F1 DSP User’s Guide

C syntax:
ae_int64 AE_MAX64 (ae_int64 d0, ae_int64 d1);
ae_int64 AE_MIN64 (ae_int64 d0, ae_int64 d1);
ae_q56s AE_MAXQ56S (ae_q56s d0, ae_q56s d1);
ae_q56s AE_MINQ56S (ae_q56s d0, ae_q56s d1);
void AE_MAXB64 (ae_int64 d /* out */, ae_int64 d0, ae_int64 d1,
xtbool b /* out */);
void AE_MINB64 (ae_int64 d /* out */, ae_int64 d0, ae_int64 d1,
xtbool b /* out */);
void AE MAXBQ56S (ae_q56s d /* out */, ae_q56s
_ d0, ae_q56s d1,
xtbool b /* out */);
void AE MINBQ56S (ae_q56s d /* out */, ae_q56s
_ d0, ae_q56s d1,
xtbool b /* out */);
AE_MAXABS64S d, d0, d1 [fusion_slot1]
AE_MINABS64S d, d0, d1 [ fusion_slot1 ]
Get maximum/minimum of absolute value of two 64-bit signed AE_DR registers d0 and d1.
The result is saturated to 64 bits and placed in d.
In case of saturation, state AE_OVERFLOW is set to 1.
Maximum: d  saturate1.63((|d0| > |d1|) ? |d0| : |d1|)
C syntax:
ae_f64 AE_MAXABS64S (ae_f64 d0, ae_f64 d1);
ae_f64 AE_MINABS64S (ae_f64 d0, ae_f64 d1);

Note: C intrinsics AE_MAXBSSQ56S and AE_MINABSSQ56S are provided to ensure HiFi 2


EP code portability. They are implemented through operations AE_MAXABS64S and
AE_MINABS64S.

AE_LT64 b, d0, d1 [fusion_slot1, Inst ]


Compare, less-than, two signed 64-bit AE_DR registers d0 and d1; result goes to a Boolean
register b.
b  (d0 < d1) ? 1 : 0
Note: C intrinsic AE_LTQ56S is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_LT64.
C syntax:
xtbool AE_LT64 (ae_int64 d0, ae_int64 d1);
xtbool AE_LTQ56S (ae_q56s d0, ae_q56s d1);

82  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_LE64 b, d0, d1 [fusion_slot1, Inst ]


Compare, less-than-or-equal, two 64-bit signed AE_DR registers d0 and d1; result goes to a
Boolean register b.
b  (d0 ≤ d1) ? 1 : 0
Note: C intrinsic AE_LEQ56S is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_LE64.
C syntax:
xtbool AE_LE64 (ae_int64 d0, ae_int64 d1);
xtbool AE_LEQ56S (ae_q56s d0, ae_q56s d1);
AE_EQ64 b, d0, d1 [fusion_slot1, Inst ]
Compare, equal, two 64-bit AE_DR registers d0 and d1; result goes to a Boolean register b.
b  (d0 == d1) ? 1 : 0
Note: C intrinsic AE_EQQ56 is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_EQQ64.
C syntax:
xtbool AE_EQ64 (ae_int64 d0, ae_int64 d1);
xtbool AE_EQQ56 (ae_q56s d0, ae_q56s d1);

2.8 Shift Operations


Fusion DSP comes with a large variety of shift operations, supporting 16-, 24-, 32-, and 64-
bit shifts as well as legacy HiFi 2 shift operations. The shift amount can come from an
immediate, an AR register or the AE_SAR shift register. Variable shifts are bidirectional,
meaning that the direction of the shift changes if the shift amount is negative. Variable shifts
using the AR shift register can do a shift without having to set the AE_SAR shift register. Shift
instructions using an AR register or the AE_SAR state will truncate the shift amount based
on the size of the data being shifted. For example, shifting a 16-bit element by 17 will truncate
the shift amount from 17 down to 1.

All shift operations start with the prefix AE_S. The following letter is either L or R signifying
whether the primary shift direction is left or right. The next letter is either L or R signifying
whether a shift is logical (fill in 0’s on a right shift) or arithmetic (sign-extend on a right shift).
The next letter is I for immediate shifts, A for AR shifts and S for AE_SAR shifts. Following is
a number signifying the size of the element being shifted and an optional R for right shifts
that round rather than truncate and an optional S for left shifts that saturate.

AE_SRAI16 d, d0, i [ fusion_slot0]


Shift right arithmetic (sign-extending), element-wise, 16-bit elements of AE_DR register d0
by immediate value, with result placed in d.
C syntax:
ae_int16x4 AE_SRAI16 (ae_int16x4 d0, immediate i);

 CADENCE DESIGN SYSTEMS , INC. 83


Fusion F1 DSP User’s Guide

AE_SRAI16R d, d0, i [ fusion_slot0]


Shift right arithmetic (sign-extending), element-wise, 16-bit elements of AE_DR register d0
by immediate, with result placed in d. Result is rounded corresponding to ITU intrinsic shr_r.
C syntax:
ae_int16x4 AE_SRAI16R (ae_int16x4 d0, immediate i);
AE_SRAA16RS d, d0, a0 [fusion_slot0 ]
Shift right or left arithmetic (sign-extending), saturating, element-wise, four 16-bit signed
elements of AE_DR register d0 by AR register a0, with result placed in d. For a positive shift
amount, the value is shifted to the right. For a negative shift amount, the value is shifted to
the left. When shifted to the right, result is rounded corresponding to ITU intrinsic shr_r. In
case of saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f16x4 AE_SRAA16RS (ae_f6x4 d0, int32 a0);
AE_SRAA16S d, d0, a0 [ fusion_slot0]
Shift right or left arithmetic, (sign-extending), saturating, element-wise, four 16-bit elements
of AE_DR register d0 by AR register a0, with result placed in d. For a positive shift amount,
the value is shifted to the right. For a negative shift amount, the value is shifted to the left. In
case of saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f16x4 AE_SRAA16S (ae_f16x4 d0, int32 a0);
AE_SLAI16S d, d0, i [ fusion_slot0]
Shift left arithmetic, saturating, element-wise, four 16-bit signed elements of AE_DR register
d0 by immediate value, with result placed in d. In case of saturation, state AE_OVERFLOW
is set to 1.
C syntax:
ae_f16x4 AE_SLAI16S (ae_f16x4 d0, immediate i);
AE_SLAA16S d, d0, a0 [fusion_slot0, Inst ]
Shift left or right, saturating, element-wise, four 16-bit signed elements of AE_DR register by
AR register a0, with result placed in d. For a positive shift amount, the value is shifted to the
left. For a negative shift amount, the value is shifted to the right and sign-extended. In case
of saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f16x4 AE_SLAA16S (ae_f16x4 d0, int32 a);
AE_SLAI24 d, d0, i [ fusion_slot0, Inst ]
Shift left element-wise, two 24-bit elements of AE_DR register d0 by immediate value, with
result placed in d.
d.L = sext24(d0.L[23:0] << i);
d.H = sext24(d0.H[23:0] << i).
Note: C intrinsic AE_SLLIP24 is implemented through operation AE_SLAI24.

84  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_int24x2 AE_SLAI24 (ae_int24x2 d0, immediate i);
ae_p24x2s AE_SLLIP24 (ae_p24x2s d0, immediate i);
AE_SRLI24 d, d0, i [ fusion_slot0 ]
Shift right logical (zero-extending), element-wise, two 24-bit elements of AE_DR register d0
by immediate, with result placed in d. Note that the sign of the result will be zero for any non-
zero shift amount.
d.L = sext24(d0.L[23:0] >>u i);
d.H = sext24(d0.H[23:0] >>u i).

Note: C intrinsic AE_SRLIP24 is implemented through operation AE_SRLII24.

C syntax:
ae_int24x2 AE_SRLI24 (ae_int24x2 d0, immediate i);
ae_p24x2s AE_SRLIP24 (ae_p24x2s d0, immediate i);
AE_SRAI24 d, d0, i [ fusion_slot0, Inst ]
Shift right arithmetic (sign-extending), element-wise, two 24-bit elements of AE_DR register
d0 by immediate value, with result placed in d.
d.L = sext24(d0.L[23:0] >>s i);
d.H = sext24(d0.H[23:0] >>s i).

Note: C intrinsic AE_SRAIP24 is implemented through operation AE_SRAII24.

C syntax:
ae_int24x2 AE_SRAI24 (ae_int24x2 p0, immediate i);
ae_p24x2s AE_SRAIP24 (ae_p24x2s d0, immediate i);
AE_SLAI24S d, d0, i [ fusion_slot0, Inst ]
Shift left, saturating, element-wise, two 24-bit signed elements of AE_DR register d0 by
immediate, with result placed in d. In case of saturation, state AE_OVERFLOW is set to 1.
d.L = sext24(saturate24(d0.L[23:0] << i));
d.H = sext24(saturate24(d0.H[23:0] << i)).
Note: C intrinsic AE_SLLISP24S is implemented through operation AE_SLAI24S.
C syntax:
ae_f24x2 AE_SLAI24S (ae_f24x2 d0, immediate i);
ae_p24x2s AE_SLLISP24S (ae_p24x2s d0, immediate i);

 CADENCE DESIGN SYSTEMS , INC. 85


Fusion F1 DSP User’s Guide

AE_SLAS24 d, d0 [ fusion_slot0, Inst ]


Shift left or right arithmetic, (sign-extending), element-wise two 24-bit elements of AE_DR
register d0 by shift amount register AE_SAR, with result placed in d. For a positive shift
amount, the value is shifted to the left. For a negative shift amount, the value is shifted to the
right and sign-extended. Note that in the case of a negative shift amount, this intrinsic
performs an arithmetic right shift.
d.L = sext24((SAR ≥ 0) ? (d0.L[23:0] << SAR) : (d0.L[23:0] >>s −SAR));
d.H = sext24((SAR ≥ 0) ? d0.H[23:0] << SAR) : (d0.H[23:0] >>s −SAR)).
Note: C intrinsic AE_SLLSP24 is implemented through operation AE_SLAS24.
C syntax:
ae_int24x2 AE_SLAS24 (ae_int24x2 d0);
ae_p24x2s AE_SLLSP24 (ae_p24x2s d0);
AE_SRLS24 d, d0 [ fusion_slot0 ]
Shift right or left, logical (zero-extending), element-wise two 24-bit elements of AE_DR
register d0 by shift amount register AE_SAR, with result placed in d. For a positive shift
amount, the value is shifted to the right. In case of a negative shift amount, the value is shifted
to the left.

Note: C intrinsic AE_SRLSP24 is implemented through operation AE_SRLS24

d.L = sext24((SAR ≥ 0) ? (d0.L[23:0] >>u SAR) : (d0.L[23:0] << −SAR));


d.H = sext24((SAR ≥ 0) ? (d0.H[23:0] >>u SAR) : (d0.H[23:0] << −SAR)).
C syntax:
ae_int24x2 AE_SRLS24 (ae_int24x2 d0);
ae_p24x2s AE_SRLSP24 (ae_p24x2s d0);
AE_SRAS24 d, d0 [ fusion_slot0 ]
Shift right or left arithmetic (sign-extending), element-wise two 24-bit elements of AE_DR
register d0 by shift amount register AE_SAR, with result placed in d. For a positive shift
amount, the value is shifted to the right. In case of a negative shift amount, the value is shifted
to the left.
d.L = sext24((SAR ≥ 0) ? (d0.L[23:0] >>s SAR) : (d0.L[23:0] << −SAR));
d.H = sext24((SAR ≥ 0) ? (d0.H[23:0] >>s SAR) : (d0.H[23:0] << −SAR)).

Note: C intrinsic AE_SRASP24 is implemented through operation AE_SRAS24.

C syntax:
ae_int24x2 AE_SRAS24 (ae_int24x2 d0);
ae_p24x2s AE_SRASP24 (ae_p24x2s d0);

86  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_SLAS24S d, d0 [ fusion_slot0 ]
Shift left or right, arithmetic (sign-extending), saturating, element-wise, two 24-bit elements
of AE_DR register d0 by shift amount register AE_SAR, with result placed in d. For a positive
shift amount, the value is shifted to the left. In case of a negative shift amount, the value is
shifted to the right. In case of saturation, state AE_OVERFLOW is set to 1.
d.L = sext24((SAR ≥ 0) ? saturate24(d0.L[23:0] << SAR) : (d0.L[23:0] >>s −SAR));
d.H = sext24((SAR ≥ 0) ? saturate24(d0.H[23:0] << SAR) : (d0.L[23:0] >>s −SAR)).
Note: C intrinsic AE_SLLSSP24S is implemented through operation AE_SLAS24S. Note
that in the case of a negative shift amount, this intrinsic performs an arithmetic right shift.
C syntax:
ae_f24x2 AE_SLAS24S (ae_f24x2 d0);
ae_p24x2s AE_SLLSSP24S (ae_p24x2s d0);
AE_SLAI32 d, d0, i [ fusion_slot0, Inst]
Shift left, element-wise, two 32-bit elements of AE_DR register d0 by immediate value, with
result placed in d.
d.L = d0.L << i;
d.H = d0.H << i.
C syntax:
ae_int32x2 AE_SLAI32 (ae_int32x2 d0, immediate i);
AE_SRLI32 d, d0, i [ fusion_slot0, Inst]
Shift right logical (zero-extending), element-wise, two 32-bit elements of AE_DR register d0
by immediate value, with result placed in d.
d.L = d0.L >>u i;
d.H = d0.H >>u i.
C syntax:
ae_int32x2 AE_SRLI32 (ae_int32x2 d0, immediate i);
AE_SRAI32 d, d0, i [ fusion_slot0, Inst]
Shift right arithmetic (sign-extending), element-wise, two 32-bit elements of AE_DR register
d0 by immediate value, with result placed in d.
d.L = d0.L >>s i;
d.H = d0.H >>s i.
C syntax:
ae_int32x2 AE_SRAI32 (ae_int32x2 d0, immediate i);

 CADENCE DESIGN SYSTEMS , INC. 87


Fusion F1 DSP User’s Guide

AE_SRAI32R d, d0, i [ fusion_slot0]


Shift right arithmetic, (sign-extending), element-wise, two 32-bit elements of AE_DR register
d0 by immediate, with result placed in d. Result is rounded corresponding to ITU intrinsic
L_shr_r.
C syntax:
ae_int32x2 AE_SRAI32R (ae_int32x2 d0, immediate i);
AE_SLAI32S d, d0, i [ fusion_slot0, Inst]
Shift left, saturating, element-wise, two signed 32-bit elements of AE_DR register d0 by
immediate value, with result placed in d. In case of saturation, state AE_OVERFLOW is set
to 1.

d.L = saturate32(d0.L << i);


d.H = saturate32(d0.H << i).
C syntax:
ae_f32x2 AE_SLAI32S (ae_f32x2 d0, immediate i);
AE_SLAA32 d, d0, a0 [fusion_slot0, Inst ]
Shift left or right arithmetic (sign-extending), element-wise, two 32-bit elements of AE_DR
register d0 by AR register a0, with result placed in d. For a positive shift amount, the value
is shifted to the left. In case of a negative shift amount, the value is shifted to the right and
sign-extended.
d.L = (a0 ≥ 0) ? (d0.L << a0) : (d0.L >>s −a0);
d.H = (a0 ≥ 0) ? (d0.H << a0) : (d0.H >>s −a0).
C syntax:
ae_int32x2 AE_SLAA32 (ae_int32x2 d0, int32 sa);
AE_SRLA32 d, d0, a0 [fusion_slot0 ]
Shift right or left logical (zero-extending), element-wise, two 32-bit elements of AE_DR
register d0 by AR register a0, with result placed in d. For a positive shift amount, the value
is shifted to the right. In case of a negative shift amount, the value is shifted to the left.
d.L = (a0 ≥ 0) ? (d0.L >>u a0) : (d0.L << −a0);
d.H = (a0 ≥ 0) ? (d0.H >>u a0) : (d0.H << −a0).
C syntax:
ae_int32x2 AE_SRLA32 (ae_int32x2 d0, int32 a0);
AE_SRAA32 d, d0, a0 [fusion_slot0, Inst ]
Shift right or left arithmetic (sign-extending), element-wise, two 32-bit elements of AE_DR
register d0 by AR register a0, with result placed in d. For a positive shift amount, the value
is shifted to the right. In case of a negative shift amount, the value is shifted to the left.
d.L = (a0 ≥ 0) ? (d0.L >>s a0) : (d0.L << −a0);
d.H = (a0 ≥ 0) ? (d0.H >>s a0) : (d0.H << −a0).

88  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_int32x2 AE_SRAA32 (ae_int32x2 d0, int32 sa);
AE_SLAA32S d, d0, a0 [fusion_slot0, Inst ]
Shift left or right arithmetic (sign-extending), saturating, element-wise, two 32-bit elements of
AE_DR register by AR register a0, with result placed in d. For a positive shift amount, the
value is shifted to the left. In case of a negative shift amount, the value is shifted to the right
and sign-extended. In case of saturation, state AE_OVERFLOW is set to 1.
d.L = (a0 ≥ 0) ? saturate32(d0.L << a0) : (d0.L >>s −a0);
d.H = (a0 ≥ 0) ? saturate32(d0.H << a0) : (d0.H >>s −a0).
C syntax:
ae_f32x2 AE_SLAA32S (ae_f32x2 d0, int32 a0);
AE_SLAS32 d, d0 [ fusion_slot0]
Shift left or right arithmetic (sign-extending), element-wise, two 32-bit elements of AE_DR
register d0 by the shift amount register AE_SAR, with result placed in d. For a positive shift
amount, the value is shifted to the right. For a negative shift amount, the value is shifted to
the right and sign-extended.
d.L = (SAR ≥ 0) ? (d0.L << SAR) : (d0.L >>s −SAR);
d.H = (SAR ≥ 0) ? (d0.H << SAR) : (d0.H >>s −SAR).
C syntax:
ae_int32x2 AE_SLAS32 (ae_int32x2 d0);
AE_SRAA32RS d, d0, a0 [fusion_slot0 ]
Shift right or left arithmetic (sign-extending), element-wise, 32-bit elements of AE_DR
register d0 by AR register a0, with result placed in d. For a positive shift amount, the value
is shifted to the right. For a negative shift amount, the value is shifted to the right and rounded
corresponding to ITU intrinsic L_shr_r.
C syntax:
ae_f32x2 AE_SRAA32RS (ae_f32x2 d0, int32 a0);
AE_SRAA32S d, d0, a0 [fusion_slot0, Inst ]
Shift right arithmetic (sign-extending), saturating, element-wise, two 32-bit elements of
AE_DR register d0 by AR register a0, with result placed in d corresponding to ITU intrinsic
L_shr.
C syntax:
ae_f32x2 AE_SRAAR32S (ae_f32x2 d0, int32 a0);

 CADENCE DESIGN SYSTEMS , INC. 89


Fusion F1 DSP User’s Guide

AE_SRLS32 d, d0 [ fusion_slot0]
Shift right or left logical (zero-extending), element-wise, two 32-bit elements AE_DR register
d0 by the shift amount in register AE_SAR, with result placed in d. For a positive shift
amount, the value is shifted to the right. For a negative shift amount, the value is shifted to
the left.
d.L = (SAR ≥ 0) ? (d0.L >>u SAR) : (d0.L << −SAR);
d.H = (SAR ≥ 0) ? (d0.H >>u SAR) : (d0.H << −SAR).
C syntax:
ae_int32x2 AE_SRLS32 (ae_int32x2 d0);
AE_SRAS32 d, d0 [ fusion_slot0]
Shift right or left arithmetic (sign-extending), element-wise, two 32-bit elements of AE_DR
register d0 by the shift amount register AE_SAR, with result placed in d. For a positive shift
amount, the value is shifted to the right. In case of a negative shift amount, the value is shifted
to the left.
d.L = (SAR ≥ 0) ? (d0.L >>s SAR) : (d0.L << −SAR);
d.H = (SAR ≥ 0) ? (d0.H >>s SAR) : (d0.H << −SAR).
C syntax:
ae_int32x2 AE_SRAS32 (ae_int32x2 d0);
AE_SLAS32S d, d0 [ fusion_slot0]
Shift left or right arithmetic (sign-extending), saturating, element-wise, two 32-bit elements of
AE_DR register d0 by the shift amount register AE_SAR, with result placed in d. For a
positive shift amount, the value is shifted to the left. For a negative shift amount, the value is
shifted to the right and sign-extended. In case of saturation, state AE_OVERFLOW is set to
1.
d.L = (SAR ≥ 0) ? saturate32(d0.L << SAR) : (d0.L >>s −SAR);
d.H = (SAR ≥ 0) ? saturate32(d0.H << SAR) : (d0.H >>s −SAR).
C syntax:
ae_f32x2 AE_SLAS32S (ae_int32x2 d0);
AE_SLAI64 d, d0, i [ fusion_slot0, Inst]
Shift left, 64-bit AE_DR register d0 by immediate value, with result placed in d.
d = d0 << i
Note: C intrinsic AE_CVTQ56P32S_L converts a signed 1.31-bit value in d0.L to a 1.63-bit
value in d. It is implemented through operation AE_SLAI64 with a shift amount of 32.
C syntax:
ae_int64 AE_SLAI64 (ae_int64 d0, immediate i);
ae_int64 AE_CVTQ56P32S_L (ae_int32x2 d0);

90  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_SRLI64 d, d0, i [ fusion_slot0]


Shift right, logical (zero-extending), 64-bit AE_DR register d0 by immediate value, with result
placed in d.
d = d0 >>u i
C syntax:
ae_int64 AE_SRLI64 (ae_int64 d0, immediate i);
AE_SRAI64 d, d0, i [ fusion_slot0, Inst]
Shift right arithmetic (sign-extending), 64-bit AE_DR register d0 by immediate, with result
placed in d.
d = d0 >>s i
Note: C intrinsic AE_SRAIQ56 is provided to ensure HiFi 2 code portability. It is implemented
through operation AE_SRAI64.
C syntax:
ae_int64 AE_SRAI64 (ae_int64 d0, immediate i);
ae_q56s AE_SRAIQ56 (ae_q56s d0, immediate i);
AE_SLAI64S d, d0, i [ fusion_slot0]
Shift left, saturating, 64-bit AE_DR register d0 by immediate value, with result placed in d.
In case of saturation, state AE_OVERFLOW is set to 1.
d = saturate64(d0 << i)
C syntax:
ae_f64 AE_SLAI64S (ae_f64 d0, immediate i);
AE_SLAA64 d, d0, a0 [ fusion_slot0, Inst ]
Shift left or right arithmetic (sign-extending) 64-bit AE_DR register d0 by AR register a0, with
result placed in d. For a positive shift amount, the value is shifted to the left. For a negative
shift amount, the value is shifted to the right and sign-extended.
d = (a0 ≥ 0) ? (d0 << a0) : (d0 >>s −a0)
C syntax:
ae_int64 AE_SLAA64 (ae_int64 d0, int32 a0);
AE_SRLA64 d, d0, a0 [fusion_slot0 ]
Shift right or left, logical, (zero-extending), 64-bit AE_DR register d0 by AR register a0, with
result placed in d. For a positive shift amount, the value is shifted to the right. For a negative
shift amount, the value is shifted to the left.
d = (a0 ≥ 0) ? (d0 >>u a0) : (d0 << −a0)
C syntax:
ae_int64 AE_SRLA64 (ae_int64 d0, int32 a0);

 CADENCE DESIGN SYSTEMS , INC. 91


Fusion F1 DSP User’s Guide

AE_SRAA64 d, d0, a0 [ fusion_slot0, Inst ]


Shift right or left arithmetic (sign-extending) 64-bit AE_DR register d0 by AR register a0, with
result placed in d. For a positive shift amount, the value is shifted to the right. For a negative
shift amount, the value is shifted to the left.
d = (a0 ≥ 0) ? (d0 >>s a0) : (d0 << −a0)
C syntax:
ae_int64 AE_SRAA64 (ae_int64 d0, int32 a0);
AE_SLAA64S d, d0, a0 [fusion_slot0 ]
Shift left or right, arithmetic (sign-extending), 64-bit AE_DR register d0 by AR register a0,
with result placed in d. For a positive shift amount, the value is shifted to the left. In case of
a negative shift amount, the value is shifted to the right and sign-extended. In case of
saturation, state AE_OVERFLOW is set to 1.
d = (a0 ≥ 0) ? saturate64(d0 << a0) : (d0 >>s −a0)
C syntax:
ae_f64 AE_SLAA64S (ae_f64 d0, int32 a0);
AE_SLAS64 d, d0 [ fusion_slot0]
Shift left or right arithmetic (sign-extending) the 64-bit AE_DR register d0 by the shift amount
register AE_SAR, with result placed in d. For a positive shift amount, the value is shifted to
the left. For a negative shift amount, the value is shifted to the right and sign-extended.
d = (SAR ≥ 0) ? (d0 << SAR) : (d0 >>s −SAR)
C syntax:
ae_int64 AE_SLAS64 (ae_int64 d0);
AE_SRLQ64 d, d0 [ fusion_slot0]
Shift right or left, logical (zero-extending) the 64-bit AE_DR register d0 by the shift amount
register AE_SAR, with result placed in d. For a positive shift amount, the value is shifted to
the right. For a negative shift amount, the value is shifted to the left.
d = (SAR ≥ 0) ? (d0 >>u SAR) : (d0 << −SAR)
C syntax:
ae_int64 AE_SRLS64 (ae_int64 d0);
AE_SRAS64 d, d0 [ fusion_slot0]
Shift right or left arithmetic (sign-extending) the 64-bit AE_DR register d0 by the shift amount
register AE_SAR, with result placed in d. For a positive shift amount, the value is shifted to
the right. For a negative shift amount, the value is shifted to the left.
d = (SAR ≥ 0) ? (d0 >>s SAR) : (d0 << −SAR)
C syntax:
ae_int64 AE_SRAS64 (ae_int64 d0);

92  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_SLAS64S d, d0 [ fusion_slot0, Inst]


Shift left or right, arithmetic (sign-extending), saturating, the 64-bit AE_DR register d0 by the
shift amount register AE_SAR, with result placed in d. For a positive shift amount, the value
is shifted to the left. For a negative shift amount, the value is shifted to the right and sign-
extended. In case of saturation, state AE_OVERFLOW is set to 1.
d = (SAR ≥ 0) ? saturate64(d0 << SAR) : (d0 >>s
−SAR)
C syntax:
ae_f64 AE_SLAS64S (ae_f64 d0);
AE_SRA64_32 d, s, sa [ fusion_slot0 ]
Convert a 1.31 variable s into a 17.47 variable by sign extending the MSB by 16-bits, shifting
left by 16-bits by filling the LSBs with 0 shift right using the 4-bit shift amount in AR register
sa to pick the LSB 64-bits.
C syntax:
ae_int64 AE_SRA64_32 (ae_int32x2 s, uint32 sa);

2.9 HiFi 2 Shift Operations


The Fusion DSP ISA provides a set of shift operations for efficient execution of HiFi 2 target
code. The 56-bit HiFi 2 shift operations process the 56 LSBs of the AE_DR register and sign-
extend the 56-bit result to 64 bits.
AE_SLAISQ56S d, d0, i [ fusion_slot0, Inst ]
Shift left arithmetic (sign-extending), saturating, the 56-bit signed element of AE_DR register
d0 by immediate, with result placed in d. For a positive shift amount, the value is shifted to
the left. For a negative shift amount, the value is shifted to the right. In case of saturation,
state AE_OVERFLOW is set to 1.
d = sext56(saturate56(d0[55:0] << i)).
Note: C intrinsic AE_SLLISQ56S is implemented through operation AE_SLAISQ56S.
C syntax:
ae_q56s AE_SLAISQ56S (ae_q56s d0, immediate i);
ae_q56s AE_SLLISQ56S (ae_q56s d0, immediate i);
AE_SLAAQ56 d, d0, a0 [ fusion_slot0, Inst ]
Shift left or right arithmetic (sign-extending), the 56-bit element of AE_DR register d0 by AR
register a0, with result placed in d. For a positive shift amount, the value is shifted to the left.
In case of a negative shift amount, the value is shifted to the right and sign-extended.
d = sext56((a0 ≥ 0) ? (d0[55:0] << a0) : (d0[55:0] >>s −a0)).

 CADENCE DESIGN SYSTEMS , INC. 93


Fusion F1 DSP User’s Guide

Note: C intrinsics AE_SLAIQ56 and AE_SLLIQ56 are implemented through operation


AE_SLAAQ56 by first assigning the immediate shift amount to an AR register. C intrinsic
AE_SLLAQ56 is implemented through operation AE_SLAAQ56; note that in the case of a
negative shift amount, this intrinsic performs an arithmetic right shift.
C syntax:
ae_q56s AE_SLAAQ56 (ae_q56s d0, int32 a0);
ae_q56s AE_SLAIQ56 (ae_q56s d0, immediate i);
ae_q56s AE_SLLAQ56 (ae_q56s d0, ae_int32 sa);
AE_SRLAQ56 d, d0, a0 [ fusion_slot0, Inst ]
Shift right or left logical (zero-extending) the 56-bit element of AE_DR register d0 by AR
register a0, with result placed in d. For a positive shift amount, the value is shifted to the
right. For a negative shift amount, the value is shifted to the left.
d = sext56((a0 ≥ 0) ? (d0[55:0] >>u a0) : (d0[55:0] << −a0)).
Note: C intrinsic AE_SRLIQ56 is implemented through operation AE_SRLAQ56 by first
assigning the immediate shift amount to an AR register.
C syntax:
ae_q56s AE_SRLAQ56 (ae_q56s d0, int32 a0);
ae_q56s AE_SRLIQ56 (ae_q56s d0, immediate i);
AE_SRAAQ56 d, d0, a0 [ fusion_slot0, Inst ]
Shift right or left arithmetic (sign-extending) the 56-bit element of AE_DR register d0 by AR
register a0, with result placed in d. For a positive shift amount, the value is shifted to the
right. In case of a negative shift amount, the value is shifted to the left.
d = sext56((a0 ≥ 0) ? (d0[55:0] >>s a0) : (d0[55:0] << −a0)).
C syntax:
ae_q56s AE_SRAAQ56 (ae_q56s d0, int32 a0);
AE_SLAASQ56S d, d0, a0 [ fusion_slot0, Inst ]
Shift left or right arithmetic (sign-extending), saturating the 56-bit element of AE_DR register
d0 by AR register a0, with result placed in d. For a positive shift amount, the value is shifted
to the left. In case of a negative shift amount, the value is shifted to the right and sign-
extended. In case of saturation, state AE_OVERFLOW is set to 1.
d = sext56((a0 ≥ 0) ? saturate56(d0[55:0] << a0) : (d0[55:0] >>s −a0)).
Note: C intrinsic AE_SLLASQ56S is implemented through operation AE_SLAASQ56S; note
that in the case of a negative shift amount, this intrinsic performs an arithmetic right shift.
C syntax:
ae_q56s AE_SLAASQ56S (ae_q56s d0, int32 a0);
ae_q56s AE_SLLASQ56S (ae_q56s d0, int32 a0);

94  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_SLASQ56 d, d0 [ fusion_slot0, Inst ]


Shift left or right arithmetic (sign-extending) the 56-bit element of AE_DR register d0 by shift
amount register AE_SAR, with result placed in d. For a positive shift amount, the value is
shifted to the left. In case of a negative shift amount, the value is shifted to the right and sign-
extended.
d = sext56((SAR ≥ 0) ? (d0[55:0] << SAR) : (d0[55:0] >>s −SAR)).
Note: C intrinsic AE_SLLSQ56 is implemented through operation AE_SLASQ56; note that
in the case of a negative shift amount, this intrinsic performs an arithmetic right shift.
C syntax:
ae_q56s AE_SLASQ56 (ae_q56s d0);
ae_q56s AE_SLLSQ56 (ae_q56s d0);
AE_SRLSQ56 d, d0 [ fusion_slot0 ]
Shift right or left logical (zero-extending) the 56-bit element of AE_DR register d0 by shift
amount register AE_SAR, with result placed in d. For a positive shift amount, the value is
shifted to the left. For a negative shift amount, the value is shifted to the left.
d = sext56((SAR ≥ 0) ? (d0[55:0] >>u SAR) : (d0[55:0] << −SAR)).
C syntax:
ae_q56s AE_SRLSQ56 (ae_q56s q0);

AE_SRASQ56 d, d0 [ fusion_slot0, Inst ]


Shift right or left arithmetic (sign-extending) the 56-bit element of AE_DR register d0 by shift
amount register AE_SAR, with result placed in d. For a positive shift amount, the value is
shifted to the right. For a negative shift amount, the value is shifted to the left.
d = sext56((SAR ≥ 0) ? (d0[55:0] >>s SAR) : (d0[55:0] << −SAR)).
C syntax:
ae_q56s AE_SRASQ56 (ae_q56s d0);
AE_SLASSQ56S d, d0 [ fusion_slot0 ]
Shift left or right arithmetic (sign-extending), saturating the 56-bit element of AE_DR register
d0 by shift amount register AE_SAR, with result placed in d. For a positive shift amount, the
value is shifted to the left. In case of a negative shift amount, the value is shifted to the right
and sign-extended. In case of saturation, state AE_OVERFLOW is set to 1.
d = sext56((SAR ≥ 0) ? saturate56(d0[55:0] << SAR) : (d0[55:0] >>s SAR)).
Note: C intrinsic AE_SLLSSQ56S is implemented through operation AE_SLASSQ56S; note
that in the case of a negative shift amount, this intrinsic performs an arithmetic right shift.
C syntax:
ae_q56s AE_SLASSQ56S (ae_q56s d0);
ae_q56s AE_SLLSSQ56S (ae_q56s d0);

 CADENCE DESIGN SYSTEMS , INC. 95


Fusion F1 DSP User’s Guide

2.10 Normalize Shift Amount Operation


AE_NSA64 a, d0 [ fusion_slot0, Inst ]
Calculate the left shift amount that will normalize (maximize the value that can be represented
without overflow) the two's complement contents of an AE_DR register and write the amount
(in the range of 0 to 63) to AR register a. If d0 contains 0 or -1, return 63. To calculate the
normalization exponent for a 9.55 fixed-point number, subtract 8 from the result. If the result
is negative, a right shift is required for normalization.
Note: C intrinsic AE_NSAQ56S is provided to ensure HiFi 2 code portability. It is
implemented by subtracting 8 from the result of operation AE_NSA64.
C syntax:
int AE_NSA64 (ae_int64 d0);
int AE_NSAQ56S (ae_q56s d0);
AE_NSAZ32.L a, d0 [ fusion_slot0, Inst ]
Calculate the left shift amount that will normalize the two's complement contents of the
lower 32 bits of an AE_DR register and write the amount (in the range of 0 to 31) to AR
register a. If d0 contains 0, return 0.
int AE_NSAZ32.L (ae_int32x2 d0);
AE_NSAZ16.0 a, d0 [ fusion_slot0, Inst ]
Calculate the left shift amount that will normalize the two's complement contents of the
lower 16 bits of an AE_DR register and write the amount (in the range of 0 to 15) to AR
register a. If d0 contains 0, return 0.
int AE_NSAZ16.0 (ae_int16x4 d0);

2.11 Divide Step Operation


AE_DIV64D32.L d, d0 [ fusion_slot1, Inst ]
AE_DIV64D32.H d, d0 [ fusion_slot1 ]
Perform a 1-bit divide-step operation. Shift left the 64-bit input value d by 1. If the unsigned
32-bit L (H) element of AE_DR register d0 is greater than the unsigned value in the 32 MSBs
of the shifted value, the shifted value is placed in AE_DR register d; otherwise, d0.L (d0.H)
is subtracted from the 32-bit MSBs of the shifted value, the LSB of the shifted value is set to
1 and the result is placed in AE_DR register d:
d = { d[62:31], [30:1], 1’b0 }, if d0.L >u d[62:31];
d = { d[62:31] – d0.L, d[30:1], 1’b1 }, otherwise.

Note: This instruction is designed to work only when d >=0, d0.L(.H) > 0 and d <= d0.L(.H)C

C syntax:
void AE_DIV64D32_L (ae_int64 d, ae_int32x2 d0);

96  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

2.12 Truncate, Round, Saturate, Convert, and


Move Operations
AE_TRUNCA32Q48 a, d0 [ fusion_slot0, Inst ]
Truncate a 17.47 AE_DR fraction d0 to a 32-bit (1.31) AR fraction in a. The 16 MSBs and
the 16 LSBs of the input 64-bit value are discarded. This operation is provided to ensure
HiFi 2 code compatibility.
C syntax:
int AE_TRUNCA32Q48 (ae_q56s d0);
AE_TRUNC32X2F64 d, dh, dl [ fusion_slot1 ]
Truncate the two 1.63-bit fixed-point fractions from registers dh and dl to two 1.31-bit fixed-
point fraction elements in AE_DR register d. This is an intrinsic implemented using the
AE_SELI16 instruction.
C syntax:
ae_f32x2 AE_TRUNC32X2F64 (ae_f64 dh, ae_f64 dl);
AE_TRUNCA32X2F64S d, dh, dl, a [ Inst ]
Shift left or right arithmetic (sign-extending), two signed 64-bit values from AE_DR registers
dh and dl by AR register shift amount a; each shifted value is saturated to 64 bits and the 32
MSBs of the two results are stored in the two 32-bit elements of AE_DR register d. For a
positive shift amount, the value is shifted to the left. or a negative shift amount, the value is
sign-extended and shifted to the right. In case of saturation, state AE_OVERFLOW is set to
1.
Note: C intrinsic AE_TRUNCA32F64S performs the same operation on a single input AE_DR
register and replicates the result in the two 32-bit AE_DR elements. It is implemented through
operation AE_TRUNCA32X2F64S.
C syntax:
ae_int32x2 AE_TRUNCA32X2F64S (ae_int64 dh, ae_int64 dl, int a);
ae_int32x2 AE_TRUNCA32F64S (ae_int64 d0, int a);
AE_TRUNCI32F64S.L d, d0, d1, i [ fusion_slot40 ]
Shift left the signed 64-bit AE_DR register value d1 by immediate shift amount i, saturate it
to 64 bits and store the 32 MSBs of the result into the L element of AE_DR register d; store
the L element of AE_DR register d0 into the H element of AE_DR register d. In case of
saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_int32x2 AE_TRUNCI32F64S_L (ae_int32x2 d0, ae_int64 d1,
immediate i);

 CADENCE DESIGN SYSTEMS , INC. 97


Fusion F1 DSP User’s Guide

AE_TRUNCI32X2F64S d, d0, d1, i [ fusion_slot40 ]


Shift left two signed 64-bit AE_DR register values d0 and d1 by immediate shift amount i,
saturate each shifted value to 64 bits and store the 32 MSBs of each result into the two 32-
bit elements of the AE_DR register d. In case of saturation, state AE_OVERFLOW is set to
1.
C syntax:
ae_int32x2 AE_TRUNCI32X2F64S (ae_int64 d0, ae_int64 d1,
immediate i);
AE_TRUNCI16X4F32S d, dh, dl, i [ fusion_slot40 ]
Shift left four signed 32-bit AE_DR register values from each half of dh and dl by immediate
shift amount i, saturate each shifted value to 32 bits and store the 16 MSBs of each result
into four 16-bit elements of the AE_DR register d. In case of saturation, state
AE_OVERFLOW is set to 1.
C syntax:
ae_int16x4 AE_TRUNCI16X4F32S (ae_int32x2 dh, ae_int32x2 dl,
immediate i);
AE_TRUNCA32F64S.L d, d0, d1, a [ fusion_slot40 ]
Shift left or right arithmetic (sign-extending), 64-bit AE_DR register value d1 by AR register
shift amount a; the shifted value is saturated to 64 bits and the 32 MSBs of the results are
stored into the L element of AE_DR register d; store the L element of AE_DR register d0 into
the H element of AE_DR register d. For a positive shift amount, the value is shifted to the left.
For a negative shift amount, the value is sign-extended and shifted to the right. In case of
saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_int32x2 AE_TRUNCISP32X2Q64S_L (ae_int32x2 d0, ae_int64 d1,
immediate i);
AE_TRUNCP24Q48X2 d, dh, dl [ fusion_slot1 ]
Truncate two 17.47-bit fixed-point fractions from AE_DR registers dh and dl into two 1.23-bit
fixed-point fractions, sign-extend them to 9.23-bit values and store in the two 32-bit elements
of AE_DR register d. This operation is provided to ensure HiFi 2 code compatibility.
Note: C intrinsic AE_TRUNCP24Q48 truncates and replicates a single 17.47-bit fixed-point
fraction in AE_DR to two 9.23-bit fixed-point fractions in AE_DR. It is implemented through
operation AE_TRUNCP24Q48X2.
C syntax:
ae_p24x2s AE_TRUNCP24Q48X2 (ae_q56s dh, ae_q56s dl);
ae_p24x2s AE_TRUNCP24Q48 (ae_q56s d0);

98  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_TRUNCP24A32X2 d, ah, al [ fusion_slot0, Inst ]


Truncate and sign-extend two 1.31-bit fixed-point fractions (1.31) from AR registers ah and
al to two 9.23-bit fixed-point fraction elements in AE_DR register d. This operation is
provided to ensure HiFi 2 code compatibility.
C syntax:
ae_int24x2 AE_TRUNCP24A32X2 (unsigned ah, unsigned al);
AE_TRUNCA16P24S.L (.H) a, d0 [ fusion_slot0, Inst ]
Truncate a 9.23-bit AE_DR fraction in d0.L (d0.H) to 1.15 bits, sign-extend it to 17.15 bits
and store it into AR register a. The 8 MSBs and the 8 LSBs of the input value are discarded.
This operation is provided to ensure HiFi 2 code compatibility.
C syntax:
int AE_TRUNCA16P24S_L (ae_int24x2 d0);
AE_TRUNCP16 d, d0 [ fusion_slot1 ]
Truncate two 9.23-bit fixed-point fractions in AE_DR register d0 to 1.15-bits, and sign-extend
them to 9.23-bit fractions into AE_DR register d. The 8 MSBs and the 8 LSBs of the input
value are discarded. The 8 LSBs of the result are set to zero. This operation is provided to
ensure HiFi 2 code compatibility.
C syntax:
ae_int24x2 AE_TRUNCP16 (ae_int24x2 d0);
AE_TRUNC16X4F32 d, dh, dl [ fusion_slot1 ]
Truncate the four 1.31-bit fixed-point fractions (1.31) from registers dh and dl to four 1.15-bit
fixed-point fraction elements in AE_DR register d. This is an intrinsic implemented using the
2:1I16 instruction.
C syntax:
ae_f16x4 AE_TRUNCP16Q32X4 (ae_f32x2 dh, ae_f32x2 dl);
AE_TRUNCQ32 d, d0 [ fusion_slot1 ]
Truncate (set to zero) the 16 least significant bits of a 15.47-bit fixed-point number in AE_DR
register d0 with the result placed into AE_DR register d. This operation is provided to ensure
HiFi 2 code compatibility.
C syntax:
ae_q56s AE_TRUNCQ32 (ae_q56s d0);
AE_ROUND32X2F64SSYM d, dh, dl [ fusion_slot1 ]
Round symmetrically (away from 0), saturate the 1.63-bit values from AE_DR registers dh
and dl to 1.31-bit values, and store the results in the two elements of AE_DR register d. In
case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND32F64SSYM is implemented through operation
AE_ROUND32X2F64SSYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.

 CADENCE DESIGN SYSTEMS , INC. 99


Fusion F1 DSP User’s Guide

C syntax:
ae_f32x2 AE_ROUND32X2F64SSYM (ae_f64 dh, ae_f64 dl);
ae_f32x2 AE_ROUND32F64SSYM (ae_f64 d0);
AE_ROUND32X2F64SASYM d, dh, dl [ fusion_slot1, Inst ]
Round asymmetrically, saturate the 1.63-bit values from AE_DR registers dh and dl to 1.31-
bit values, and store the results in the two elements of AE_DR register d. In case of
saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND32F64SASYM is implemented through operation
AE_ROUND32X2F64SASYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.
C syntax:
ae_f32x2 AE_ROUND32X2F64SASYM (ae_f64 dh, ae_f64 dl);
ae_f32x2 AE_ROUND32F64SASYM (ae_f64 d0);
AE_ROUNDSP16F24SYM d, d0 [ fusion_slot1 ]
Round symmetrically (away from 0), saturate each 9.23-bit element of AE_DR register d0 to
a 1.15-bit value, sign-extend it and store the results as 9.23-bit values in the two elements of
AE_DR register d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUNDSP16SYM is implemented through operation
AE_ROUNDSP16F24SYM and is provided to ensure HiFi 2 code portability.
C syntax:
ae_f32x2 AE_ROUNDSP16F24SYM (ae_f32x2 d0);
ae_int24x2s AE_ROUNDSP16SYM (ae_int24x2s d0);
AE_ROUNDSP16F24ASYM d, d0 [ fusion_slot1 ]
Round asymmetrically, saturate the two 9.23-bit elements of AE_DR register d0 to 1.15-bit
values, sign-extend it and store the results as 9.23-bit values in the two elements of AE_DR
register d. In case of saturation, state AE_OVERFLOW is set to 1.

Note: C intrinsic AE_ROUNDSP16ASYM is implemented through operation


AE_ROUNDSP16F24ASYM and is provided to ensure HiFi 2 code portability.

C syntax:
ae_f32x2 AE_ROUNDSP16F24ASYM (ae_f32x2 d0);
ae_int24x2s AE_ROUNDSP16ASYM (ae_int24x2s d0);
AE_ROUND32X2F48SSYM d, dh, dl [ fusion_slot1 ]
Round symmetrically (away from 0), saturate the 17.47-bit values from AE_DR registers dh
and dl to 1.31-bit values and stores the results into the two elements of AE_DR register d.
In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND32F48SSYM is implemented through operation
AE_ROUND32X2F48SSYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.

100  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_f32x2 AE_ROUND32X2F48SSYM (ae_f64 dh, ae_f64 dl);
ae_f32x2 AE_ROUND32F48SSYM (ae_f64 d0);
AE_ROUND32X2F48SASYM d, dh, dl [ fusion_slot1 ]
Round asymmetrically, saturate the 17.47-bit values from AE_DR registers dh and dl to 1.31-
bit values and stores the results into the two elements of AE_DR register d. In case of
saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND32F48SASYM is implemented through operation
AE_ROUND32X2F48SASYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.
C syntax:
ae_f32x2 AE_ROUND32X2F48SASYM (ae_f64 dh, ae_f64 dl);
ae_f32x2 AE_ROUND32F48SASYM (ae_f64 d0);
AE_ROUND24X2F48SSYM d, dh, dl [ fusion_slot1, Inst ]
Round symmetrically (away from 0), saturate the 17.47-bit values from AE_DR registers dh
and dl to 1.23-bit values, sign-extend it and store the results as 9.23-bit values in the two
elements of AE_DR register d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND24F48SSYM is implemented through operation
AE_ROUND24X2F48SSYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.
C syntax:
ae_f24x2 AE_ROUND24X2F48SSYM (ae_f64 dh, ae_f64 dl);
ae_f24x2 AE_ROUND24F48SSYM (ae_f64 d0);
AE_ROUND24X2F48SASYM d, dh, dl [ fusion_slot1, Inst ]
Round asymmetrically, saturate the 17.47-bit values from AE_DR registers dh and dl to 1.23-
bit values, sign-extend it and store the results as 9.23-bit values in the two elements of
AE_DR register d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUND24F48SASYM is implemented through operation
AE_ROUND24X2F48SASYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.
C syntax:
ae_f24x2 AE_ROUND24X2F48SASYM (ae_f64 dh, ae_f64 dl);
ae_f24x2 AE_ROUND24F48SASYM (ae_f64 d0);
AE_ROUNDSP16Q48X2SYM d, dh, dl [ fusion_slot1 ]
Round symmetrically (away from 0), saturate the 17.47-bit values from AE_DR registers dh
and dl to 1.15-bit values, sign-extend it and store the results as 9.23-bit values in the two
elements of AE_DR register d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUNDSP16Q48SYM is implemented through operation
AE_ROUNDSP16Q48X2SYM; it rounds a single input AE_DR value and replicates the result
in the two elements of the output AE_DR register.

 CADENCE DESIGN SYSTEMS , INC. 101


Fusion F1 DSP User’s Guide

C syntax:
ae_f24x2 AE_ROUNDSP16Q48X2ASYM (ae_f64 dh, ae_f64 dl);
ae_f24x2 AE_ROUNDSP16Q48ASYM (ae_f64 d0);
AE_ROUNDSP16Q48X2ASYM d, dh, dl [ fusion_slot1, Inst ]
Round asymmetrically, saturate the 17.47-bit values from AE_DR registers dh and dl to 1.15-
bit values, sign-extend it and store the results as 9.23-bit values in the two elements of
AE_DR register d. In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUNDSP16Q48ASYM is implemented through operation
AE_ROUNDSP16Q48X2ASYM; it rounds a single input AE_DR value and replicates the
result in the two elements of the output AE_DR register.
C syntax:
ae_f24x2 AE_ROUNDSP16Q48ASYM (ae_f64 dh, ae_f64 dl);
ae_f24x2 AE_ROUNDSP16Q48X2ASYM (ae_f64 d0);
AE_ROUND16X4F32SASYM d, dh, dl [ fusion_slot1, Inst ]
Round asymmetrically, saturate the 1.31-bit values from AE_DR registers dh and dl to 1.15-
bit values, and store the results in the four elements of AE_DR register d. In case of
saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f16x4 AE_ROUND16X4F32SASYM (ae_f32x2 dh, ae_f32x2 dl);
AE_ROUND16X4F32SSYM d, dh, dl [ fusion_slot1 ]
Round symmetrically, saturate the 1.31-bit values from AE_DR registers dh and dl to 1.15-
bit values, and store the results in the four elements of AE_DR register d. In case of
saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f16x4 AE_ROUND16X4F32SSYM (ae_f32x2 dh, ae_f32x2 dl);
AE_ROUNDSQ32F48SYM d, d0 [ fusion_slot1, Inst ]
Round symmetrically (away from 0), saturate the 17.47-bit value from AE_DR register d0 to
a 1.31-bit value, sign-extend it and store the result as 17.47-bit value in AE_DR register d.
In case of saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUNDSQ32SYM is implemented through operation
AE_ROUNDSQ32F48SYM and is provided to ensure HiFi 2 code portability.
C syntax:
ae_f64 AE_ROUNDSQ32F48SYM (ae_f64 d0);
ae_q56s AE_ROUNDSQ32SYM (ae_q56s d0);

102  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_ROUNDSQ32F48ASYM d, d0 [ fusion_slot1, Inst ]


Round asymmetrically, saturate the 17.47-bit value from AE_DR register d0 to a 1.31-bit
value, sign-extend it and store the result as 17.47-bit value in AE_DR register d. In case of
saturation, state AE_OVERFLOW is set to 1.
Note: C intrinsic AE_ROUNDSQ32ASYM is implemented through operation
AE_ROUNDSQ32F48ASYM and is provided to ensure HiFi 2 code portability.
C syntax:
ae_f64 AE_ROUNDSQ32F48ASYM (ae_f64 d0);
ae_q56s AE_ROUNDSQ32ASYM (ae_q56s d0);

AE_S24RA64S.I d, a, i16 [ fusion_slot0 ]


AE_S24RA64S.IP d, a, i16 [ fusion_slot0 ]
AE_S24RA64S.X (.XC) d, a, x [ fusion_slot0 ]
AE_S24RA64S.XP d, a, x [ fusion_slot0, Inst ]
Round asymmetrically, saturate the 17.47-bit value from AE_DR register d to a 1.23-bit value
and store the result to memory in the high 24-bits of a 32-bit bundle. In case of saturation,
state AE_OVERFLOW is set to 1. This operation is equivalent to an
AE_ROUND24F48SASYM followed by a store.

void AE_S24RA64S_I (ae_f64 d, ae_f24 *a, immediate i16);


void AE_S24RA64S_IP (ae_f64 d, ae_f24 *a, immediate i16);
void AE_S24RA64S_X (ae_f64 d, ae_f24 *a, int x);
void AE_S24RA64S_XP (ae_f64 d, ae_f24 *a, int x);
void AE_S24RA64S_XC (ae_f64 d, ae_f24 *a, int x);

 CADENCE DESIGN SYSTEMS , INC. 103


Fusion F1 DSP User’s Guide

AE_S32RA64S.I d, a, i16 [ fusion_slot0 ]


AE_S32RA64S.IP d, a, i16 [ fusion_slot0 ]
AE_S32RA64S.X (.XC) d, a, x [ fusion_slot0 ]
AE_S32RA64S.XP d, a, x [ fusion_slot0, Inst ]
Round asymmetrically, saturate the 17.47-bit value from AE_DR register d to a 1.31-bit value
and store the result to memory. In case of saturation, state AE_OVERFLOW is set to 1. This
operation is equivalent to an AE_ROUNDSQ32F48SYM followed by a store.

void AE_S32RA64S_I (ae_f64 d, ae_f32 *a, immediate i16);


void AE_S32RA64S_IP (ae_f64 d, ae_f32 *a, immediate i16);
void AE_S32RA64S_X (ae_f64 d, ae_f32 *a, int32 x);
void AE_S32RA64S_XP (ae_f64 d, ae_f32 *a, int32 x);
void AE_S32RA64S_XC (ae_f64 d, ae_f32 *a, int32 x);
AE_S24X2RA64S.IP d0, d1, a [ Inst ]
Round asymmetrically, saturate the two 17.47-bit values from AE_DR registers d0 and d1 to
1.23-bit values and store the results to memory in the high 24-bits of two 32-bit bundles. In
case of saturation, state AE_OVERFLOW is set to 1. This operation is equivalent to an
AE_ROUND24F48SASYM followed by a store. This instruction also post-increments the
address register by eight (implicit immediate).

void AE_S24X2RA64S_IP(ae_f64 d0, ae_f64 d1, ae_f24x2 *a);


AE_S32X2RA64S.IP d0, d1, a [ Inst ]
Round asymmetrically, saturate the two 17.47-bit values from AE_DR registers d0 and d1 to
1.31-bit values and store the result to memory. In case of saturation, state AE_OVERFLOW
is set to 1. This operation is equivalent to an AE_ROUND32X2F48SASYM followed by a
store. This instruction also post-increments the address register by eight (implicit immediate).

void AE_S32X2RA64S_IP(ae_f64 d0, ae_f64 d1,


ae_int32x2 *a);
AE_SATQ56S d, d0 [ fusion_slot1 ]
Saturate the 9.55-bit value in AE_DR register d0 to a 1.55-bit value, sign-extend it and store
the result as a 9.55-bit value in AE_DR register d. In case of saturation, state
AE_OVERFLOW is set to 1.
C syntax:
ae_f64 AE_SATQ56S (ae_f64 d0);
AE_SAT48S d, d0 [ fusion_slot1, Inst ]
Saturate the 17.47-bit value in AE_DR register d0 to a 1.47-bit value, sign-extend it and store
the result as a 17.47-bit value in AE_DR register d. In case of saturation, state
AE_OVERFLOW is set to 1.

104  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_f64 AE_SAT48S (ae_f64 d0);
ae_q56s AE_SATQ48S (ae_q56s d0);
AE_SAT24S d, d0 [ fusion_slot1 ]
Saturate the two 17.23 values in AE_DR register d0 into 1.23 values and sign extend into
17.23. In case of saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_f24x2 AE_SAT24S (ae_int32x2 d0);
AE_SAT16X4 d, d0,d1 [ fusion_slot1 ]
Saturate the four 32-bit integral values in AE_DR registers d0 and d1 to a 16-bit integral
value, In case of saturation, state AE_OVERFLOW is set to 1.
C syntax:
ae_int16x4 AE_SAT16X4 (ae_int32x2 d0, ae_int32x2 d1);
AE_SEXT32 d, d0,i [ fusion_slot0 ]
Sign-extend (SIMD). Takes the contents of each 32-bit element of register d0 and replicates
the bit specified by its immediate operand (in the range 7 to 22) to the high bits and writes
the results to register d.
C syntax:
ae_int32x2 AE_SEXT32 (ae_int32x2 d0, immediate i);
AE_SEXT32X2D16.32 {.10} d, d0 [ fusion_slot0]
Promote the two higher (or lower) 16-bit elements from register d0 and place into the lower
16-bit elements of each pair of AE_DR register d. The remaining upper 16-bits of each half
are sign extended. These correspond to ITU intrinsics L_deposit_l.
C syntax:
ae_int32x2 AE_SEXT32X2D16_32(ae_int16x4 d);
AE_CVTP24A16X2.LL (.LH, .HL. HH) d, ah, al [ fusion_slot0 ]
Sign-extend and copy the 16 most (.HL, .HH) or least (.LL, .LH) significant bits from the AR
register ah into the 24 most significant bits of d.H, and the 16 most (.LH, .HH) or least (.LL,
.HL) significant bits from the AR register al into the 24 most significant bits of d.L. In other
words, convert 1.15-bit values in AR to 9.23-bit values in AE_DR.
Note: C intrinsic AE_CVTP24A16X2 is equivalent to and implemented through operation
AE_CVTP24A16X2.LL. C intrinsic AE_CVTP24A16 sign-extends and replicates the 16 least
significant bits from an AR register into the 24 most significant bits of both elements of an
AE_DR register. It is implemented through operation AE_CVTP24A16X2.LL.
C syntax:
ae_int24x2 AE_CVTP24A16X2_LL (unsigned ah, unsigned al);
ae_int24x2 AE_CVTP24A16X2 (unsigned ah, unsigned al);
ae_int24x2 AE_CVTP24A16 (unsigned a);

 CADENCE DESIGN SYSTEMS , INC. 105


Fusion F1 DSP User’s Guide

AE_CVT64A32 d, a [ fusion_slot0 ]
Convert a signed 1.31-bit value in AR register a to a 1.63-bit value in AE_DR register d.
C syntax:
ae_f64 AE_CVT64A32 (unsigned a);
AE_CVTQ56A32 d, a [ fusion_slot0]
Convert a signed 1.31-bit value in an AR register a to a 9.55-bit value in AE_DR register d.
C syntax:
ae_q56s AE_CVTQ56A32S (unsigned a);

AE_CVT48A32 d, a [ fusion_slot0, Inst ]


Convert a signed 1.31-bit value in an AR register a to a 17.47-bit value in AE_DR
register d.
C syntax:
ae_f64 AE_CVT48A32 (unsigned a);
AE_CVT64F32.H d, d0 [ fusion_slot0 ]
Convert a signed 1.31-bit value in d0.H to a 1.63-bit value in d.
C syntax:
ae_f64 AE_CVT64F32_H (ae_int32x2 d0);
ae_f64 AE_CVT64F32_L(ae_int32x2 p0);
AE_CVT56F32.L (.H) d, d0 [ fusion_slot0 ]
Convert a signed 1.31-bit value in d0.L (d0.H) to a 9.55-bit value in d.
Note: C intrinsic AE_CVTQ48P24S_L (_H) is provided to ensure HiFi 2 code portability. It is
implemented through operation AE_CVT56F32.L (.H).
C syntax:
ae_f64 AE_CVT56F32_L (ae_int32x2 d0);
ae_q56s AE_CVTQ48P24S_L (ae_p24x2s d0);
AE_CVTA32F24S.L (.H) a, d0 [ fusion_slot0 ]
Convert a 9.23-bit value in d0.L (d0.H) to a 1.31-bit value in AR register a. The 8 MSBs of
the input value are discarded.
C syntax:
int AE_CVTA32F24S_L (ae_int24x2 d0);
AE_CVT16X4 d, dh, dl [ fusion_slot0 ]
Convert/truncate the lower 16-bit elements from four 32-bit signed elements of registers dh
and dl into four 16-bit integer elements in AE_DR register d.
C syntax:
ae_int16x4 AE_CVT16X4 (ae_int32x2 dh, ae_int32x2 dl);

106  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_CVT32X2F16.32 {.10 } d, d0 [ fusion_slot0 ]


Promote the two higher (or lower) 16-bit elements from register d0 and place into the higher
16-bit elements of each pair in AE_DR register d. The remaining lower 16-bits of each half
of register d are filled with 0.
These correspond to ITU intrinsics L_deposit_h
C syntax:
ae_f32x2 AE_CVT32X2F16_32(ae_f16x4 d);
AE_PKSR32 d, d0, i [ fusion_slot1 ]
Move the low 32-bits of d into the high 32-bits of d. Sign extend the 17.47-bit value in d0 by
three bits so that it becomes a 20.47-bit value. Logical Left Shift the result by 0 to 3 bits as
encoded in the 2-bit immediate. Round the result, using an asymmetric round, to a 20.31-bit
value, removing 16 bits from the low end. Saturate the result to a 1.31-bit value, removing
19 bits from the top. If saturation was needed, state AE_OVERFLOW is set to 1. Store the
result in the low 32-bits of d. This instruction is useful for the acceleration of biquad filters.
C syntax:
void AE_PKSR32 (ae_f32x2 d /*inout*/, ae_f64 d0, immediate i);
AE_PKSR24 d, d0, i [ fusion_slot1 ]
Move the low 32-bits of d into the high 32-bits of d. Sign extend the 17.47-bit value in d0 by
three bits so that it becomes a 20.47-bit value. Logical Left Shift the result by 0 to 3 bits as
encoded in the 2-bit immediate. Round the result, using an asymmetric round, to a 20.23-bit
value, removing 24 bits from the low end. Saturate the result to a 1.23-bit value, removing
19 bits from the top. If saturation was needed, state AE_OVERFLOW is set to 1. Sign extend
the result by 8 bits to a 9.23-bit value. Store the result in the low 32-bits of d. This instruction
is useful for the acceleration of biquad filters.
C syntax:
void AE_PKSR24 (ae_f24x2 d /*inout*/, ae_f64 d0, immediate i );
AE_MOVI d, i [ fusion_slot0,
fusion_slot1, Inst]
Copy and replicate the immediate (from -16 to 47) into the two halves of d.
C syntax:
ae_int32x2 AE_MOVI (immediate i);
AE_MOVDA32X2 d, ah, al [ fusion_slot0, Inst ]
Copy the 32-bit contents of each of two AR registers, ah and al, into the two 32-bit elements
of an AE_DR register d.
Note: AE_MOVPA24X2 and AE_MOVPA24 are provided to ensure HiFi 2 code portability.
C syntax:
ae_int32x2 AE_MOVDA32X2 (unsigned ah, unsigned al);
ae_p24x2s AE_MOVPA24X2 (unsigned ah, unsigned al);
ae_p24x2s AE_MOVPA24 (unsigned a);

 CADENCE DESIGN SYSTEMS , INC. 107


Fusion F1 DSP User’s Guide

AE_MOVDA32 d, a [ fusion_slot0, Inst]


Copy and replicate the 32-bit contents of AR register a, into the two 32-bit elements of an
AE_DR register d.
C syntax:
ae_int32 AE_MOVDA32 (unsigned a);
AE_MOVDA16 d, a [ fusion_slot0, Inst]
Copy the 16-bit contents of a into each of the four 16-bit elements of an AE_DR register d.
C syntax:
ae_int16x4 AE_MOVDA16 (unsigned a);
AE_MOVDA16X2 d, a0, a1 [ fusion_slot0, Inst ]
Combine the 16-bit contents of a0 and a1 and copy into each of the two 32-bit elements of
an AE_DR register d.
C syntax:
ae_int16x4 AE_MOVDA16X2 (unsigned a0, unsigned a1);
AE_MOVAD32.L (.H) a, d0 [ fusion_slot0Inst ]
Copy the 32-bit contents of d0.L (d0.H) to an AR register a.
Note: C intrinsic AE_TRUNCA32Q64 is implemented through operation AE_MOVAD32.H.
C intrinsic AE_MOVAP24S_L (_H) is implemented through operation AE_MOVAD32_L (_H)
and is provided to ensure HiFi 2 code portability.
C syntax:
int AE_MOVAD32_L (ae_int32x2 d0);
int AE_MOVAP24S_L (ae_p24x2s d0);
int AE_TRUNCA32Q64 (ae_int64 d0);
AE_MOVAD16.0 (.2, .3) a, d0 [ fusion_slot0, Inst ]
AE_MOVAD16.1 a, d0 [ fusion_slot0 ]
Copy and sign-extend the 16-bit contents of d0.0 (d0.1, d0.2, d0.3) to an AR register a.
C syntax:
int AE_MOVAD16_0 (ae_int16x4 d0);
AE_MOV d, d0 [ fusion_slot0, fusion_slot1, Inst ]
Copy the 64-bit contents of AE_DR register d0 to AE_DR register d.
Note: C intrinsic AE_MOV32X2 (operating on C type ae_int32x2) is implemented through
operation AE_MOV. C intrinsics AE_MOVQ56 (operating on C type ae_q56s) and
AE_MOVP48 (operating on C type ae_int24x2) are implemented through operation AE_MOV
and are provided to ensure HiFi 2 code portability.

108  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
ae_int64 AE_MOV64 (ae_int64 d0);
ae_int32x2 AE_MOV32X2 (ae_int32x2 d0);
ae_q56s AE_MOVQ56 (ae_q56s d0);
ae_p24x2s AE_MOVP48 (ae_p24x2s d0);
AE_MOVT32X2 d, d0, bhl [ fusion_slot1, Inst ]
If bhl[0] is set, copy the contents of d0.L to d.L;
If bhl[1] is set, copy the contents of d0.H to d.H.
Note: C intrinsic AE_MOVTP24X2 is implemented through operation AE_MOVT32X2 and is
provided to ensure HiFi 2 code portability.
C syntax:
void AE_MOVT32X2 (ae_int32x2 d /*inout*/, ae_int32x2 d0,
xtbool2 bhl);
void AE MOVTP24X2 (ae_p24x2s d /*inout*/, ae_p24x2s d0,
_
xtbool2 bhl);
AE_MOVF32X2 d, d0, bhl [ fusion_slot1, Inst ]
If bhl[0] is clear, copy the contents of d0.L to d.L;
If bhl[1] is clear, copy the contents of d0.H to d.H.
Note: C intrinsic AE_MOVFP24X2 is implemented through operation AE_MOVF32X2 and is
provided to ensure HiFi 2 code portability.
C syntax:
void AE_MOVF32X2 (ae_int32x2 d /*inout*/, ae_int32x2 d0,
xtbool2 bhl);
void AE_MOVFP24X2 (ae_p24x2s d /*inout*/, ae_p24x2s d0,
xtbool2 bhl);
AE_MOVT16X4 d, d0, b3210 [ fusion_slot1 ]
If b3210[0] is set, copy the contents of d0.0 to d.0;
If b3210[1] is set, copy the contents of d0.1 to d.1.
If b3210[2] is set, copy the contents of d0.2 to d.2;
If b3210[3] is set, copy the contents of d0.3 to d.3.
C syntax:
void AE_MOVT16X4 (ae_int16x4 d /*inout*/, ae_int16x4 d0,
xtbool4 b3210);

 CADENCE DESIGN SYSTEMS , INC. 109


Fusion F1 DSP User’s Guide

AE_MOVF16X4 d, d0, b3210 [ fusion_slot1 ]


If b3210[0] is clear, copy the contents of d0.0 to d.0;
If b3210[1] is clear, copy the contents of d0.1 to d.1.
If b3210[2] is clear, copy the contents of d0.2 to d.2;
If b3210[3] is clear, copy the contents of d0.3 to d.3.
C syntax:
void AE_MOVF16X4 (ae_int16x4 d /*inout*/, ae_int16x4 d0,
xtbool4 b3210);
AE_MOVT64 d, d0, b [ fusion_slot1, Inst ]
If b is set, copy the contents of d0 to d.
Note: C intrinsics AE_MOVTQ56 and AE_MOVTP48 are implemented through operation
AE_MOVT64 and are provided to ensure HiFi 2 code portability.
C syntax:
void AE_MOVT64 (ae_int64 d /*inout*/, ae_int64 d0, xtbool b);
void AE_MOVTQ56 (ae_q56s d /*inout*/, ae_q56s d0, xtbool b);
void AE_MOVTP48 (ae_p24x2s d /*inout*/, ae_p24x2s d0, xtbool b);
AE_MOVF64 d, d0, b [ fusion_slot1 ]
If b is clear, copy the contents of d0 to d.
Note: C intrinsics AE_MOVFQ56 and AE_MOVFP48 are implemented through operation
AE_MOVF64 and are provided to ensure HiFi 2 code portability.
C syntax:
void AE_MOVF64 (ae_int64 d /*inout*/, ae_int64 d0, xtbool b);
void AE_MOVFQ56 (ae_q56s d /*inout*/, ae_q56s d0, xtbool b);
void AE_MOVFP48 (ae_p24x2s d /*inout*/, ae_p24x2s d0, xtbool b);
AE_MOVALIGN u, v [ Inst ]
Copy the 64-bit contents of AE_VALIGN register v to AE_VALIGN register u.
C syntax:
ae_valign u = AE_MOVALIGN (ae_valign v);

110  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

2.13 Selection and Permutation Operations


The select and permute operations allow 16-, 24-, or 32-bit SIMD elements from two
elements to be combined together. Not all combinations are supported; only the most
commonly used ones.

AE_SEL16I d, d0, d1, imm [ fusion_slot1 ]


Combine 16-bit elements from d0 and d1 into d. The AE_SEL16.xxxx operations above are
actually implemented using this operation. Immediate field ‘imm’ is an encoded value
choosing the permutations such as 5146 or 7520 using a 4-bit value. The table below shows
the encoded value versus the permutation.

Table 2-21 Permutations of Immediate Field Values

Immediate Field Value Permutation

0 5432
1 7632
2 7610
3 5410
4 4321
5 6543
6 7520
7 Used for AE_TRUNC16X4F32
operation or equivalently 7531
8 6420
9 7362
10 5146
11 5140
12 2301
13 7160
14 5342
15 7351

AE_SEL32.LL d, d0, d1 [ fusion_slot1 ]


d.H = d0.L;
d.L = d1.L.
Note: AE_SEL32.LL is a proto implemented using AE_SEL16I. Also, C intrinsic
AE_SELP24_LL is similar to proto AE_SEL32.LL and is implemented through operation
AE_SEL16I. It is provided to ensure HiFi 2 code portability.

 CADENCE DESIGN SYSTEMS , INC. 111


Fusion F1 DSP User’s Guide

C syntax:
ae_int32x2 AE_SEL32_LL (ae_int32x2 d0, ae_int32x2 d1);
ae_p24x2s AE_SELP24_LL (ae_p24x2s d0, ae_p24x2s d1);
AE_SEL32.LH d, d0, d1 [ fusion_slot1 ]
d.H = d0.L;
d.L = d1.H.
Note: AE_SEL32.LH is a proto implemented using AE_SEL16I. Also, C intrinsic
AE_SELP24_LH is similar to proto AE_SEL32.LH and is implemented through operation
AE_SEL16I. It is provided to ensure HiFi 2 code portability.
C syntax:
ae_int32x2 AE_SEL32_LH (ae_int32x2 d0, ae_int32x2 d1);
ae_p24x2s AE_SELP24_LH (ae_p24x2s d0, ae_p24x2s d1);
AE_SEL32.HL d, d0, d1 [ fusion_slot1 ]
d.H = d0.H;
d.L = d1.L.
Note: AE_SEL32.HL is a proto implemented using AE_SEL16I. Also, C intrinsic
AE_SELP24_HL is similar to proto AE_SEL32.HL and is implemented through operation
AE_SEL16I. It is provided to ensure HiFi 2 code portability.
C syntax:
ae_int32x2 AE_SEL32_HL (ae_int32x2 d0, ae_int32x2 d1);
ae_p24x2s AE_SELP24_HL (ae_p24x2s d0, ae_p24x2s d1);
AE_SEL32.HH d, d0, d1 [ fusion_slot1 ]
d.H = d0.H;
d.L = d1.H.
Note: AE_SEL32.HH is a proto implemented using AE_SEL16I. Also, C intrinsic
AE_SELP24_HH is similar to proto AE_SEL32.HH and is implemented through operation
AE_SEL16I. It is provided to ensure HiFi 2 code portability.
C syntax:
ae_int32x2 AE_SEL32_HH (ae_int32x2 d0, ae_int32x2 d1);
ae_p24x2s AE_SELP24_HH (ae_p24x2s d0, ae_p24x2s d1);
AE_SEL16.7362 (5146, 6543, 4321, 7520, 5410, 5432, 7610, 7632, 6420) d, d0, d1 [fusion_slot1 ]
Combine 16-bit elements from d0 and d1 into d. Elements are numbered in order so that 7
corresponds to the highest significant 16-bits of input register d0 down to 0 which
corresponds to the least significant 16-bits of register d1. For example, the diagram below
shows the usage of AE_SEL16.7362.

7 6 5 4 3 2 1 0 7 3 6 2

112  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Note: AE_SEL16.7632 and its variants are protos implemented using AE_SEL16I.
C syntax:
ae_int16x4 AE_SEL16_7362 (ae_int16x4 d0, ae_int16x4 d1);
AE_SHORTSWAP v, v0 [ fusion_slot1 ]
v.3 = v.0;
v.2 = v.1.
v.1 = v.2;
v.0 = v.3.
C syntax:
ae_int16x4 AE_SHORTSWAP (ae_int16x4 d0);

AE_INTSWAP v, v0 [ fusion_slot0, fusion_slot1


]
v.H = v.L;
v.L = v.H.
C syntax:
ae_int32x2 AE_INTSWAP (ae_int32x2 d0);
Bitwise Logical Operations

The computations performed by these operations are implied by their opcode mnemonics
and operands as given below.

AE_AND d, d0, d1 [ fusion_slot1 ]


AE_NAND d, d0, d1 [ fusion_slot1 ]
AE_OR d, d0, d1 [ fusion_slot1 ]
AE_XOR d, d0, d1 [ fusion_slot1 ]
Note: Type-specific C intrinsics are provided through the operations above. C intrinsics
AE_ANDQ56, AE_NANDQ56, AE_ORQ56, AE_XORQ56 and AE_NOTQ56 (operating on
the ae_q56s C type) and AE_ANDP48, AE_NANDP48, AE_ORP48, AE_XORP48 and
AE_NOTP48 (operating on the ae_int24x2 C type) are provided to ensure HiFi 2 code
portability and are implemented through the operations above.
C syntax:
ae_int64 AE_AND (ae_int64 d0, ae_int64 d1);
ae_int64 AE_NAND (ae_int64 d0, ae_int64 d1);
ae_int64 AE_OR (ae_int64 d0, ae_int64 d1);
ae_int64 AE_XOR (ae_int64 d0, ae_int64 d1);
ae_int64 AE_NOT (ae_int64 d0);

ae_int64 AE_AND64 (ae_int64 d0, ae_int64 d1);


ae_int64 AE_NAND64 (ae_int64 d0, ae_int64 d1);

 CADENCE DESIGN SYSTEMS , INC. 113


Fusion F1 DSP User’s Guide

ae_int64 AE_OR64 (ae_int64 d0, ae_int64 d1);


ae_int64 AE_XOR64 (ae_int64 d0, ae_int64 d1);
ae_int64 AE_NOT64 (ae_int64 d0);

ae_int32x2 AE_AND32 (ae_int32x2 d0, ae_int32x2 d1);


ae_int32x2 AE_NAND32 (ae_int32x2 d0, ae_int32x2 d1);
ae_int32x2 AE_OR32 (ae_int32x2 d0, ae_int32x2 d1);
ae_int32x2 AE_XOR32 (ae_int32x2 d0, ae_int32x2 d1);
ae_int32x2 AE_NOT32 (ae_int32x2 d0);

ae_int24x2 AE_AND24 (ae_int24x2 d0, ae_int24x2 d1);


ae_int24x2 AE_NAND24 (ae_int24x2 d0, ae_int24x2 d1);
ae_int24x2 AE_OR24 (ae_int24x2 d0, ae_int24x2 d1);
ae_int24x2 AE_XOR24 (ae_int24x2 d0, ae_int24x2 d1);
ae_int24x2 AE_NOT24 (ae_int24x2 d0);

ae_int16x4 AE_AND16 (ae_int16x4 d0, ae_int16x4 d1);


ae_int16x4 AE_NAND16 (ae_int16x4 d0, ae_int16x4 d1);
ae_int16x4 AE_OR16 (ae_int16x4 d0, ae_int16x4 d1);
ae_int16x4 AE_XOR16 (ae_int16x4 d0, ae_int16x4 d1);
ae_int16x4 AE_NOT16 (ae_int16x4 d0);

ae_q56s AE_ANDQ56 (ae_q56s d0, ae_q56s d1);


ae_q56s AE_NANDQ56 (ae_q56s d0, ae_q56s d1);
ae_q56s AE_ORQ56 (ae_q56s d0, ae_q56s d1);
ae_q56s AE_XORQ56 (ae_q56s d0, ae_q56s d1);
ae_q56s AE_NOTQ56 (ae_q56s d0);
ae_p24x2s AE_ANDP48 (ae_p24x2s d0, ae_p24x2s d1);
ae_p24x2s AE_NANDP48 (ae_p24x2s d0, ae_p24x2s d1);
ae_p24x2s AE_ORP48 (ae_p24x2s d0, ae_p24x2s d1);
ae_p24x2s AE_XORP48 (ae_p24x2s d0, ae_p24x2s d1);
ae_p24x2s AE_NOTP48 (ae_p24x2s d0);

114  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

2.14 Bit Reversal


AE_ADDBRBA32 a, ab, ax [ fusion_slot0, fusion_slot40_0 ]
32-bit add to a bit-reversed base:
a  bitrev32(bitrev32(ab) + ax).
This helper operation may be used in combination with indexed loads and stores (.X) to
perform bit-reversed addressing in optimized FFT implementations. For example, the C code
below accesses through a set of 256 32-bit complex data elements in bit-reversed order:
/* The data elements will be accessed in the following order:
0, 128, 64, 192, 32, 160, 96, 224, 16, 144, 80, …
i.e., for i = 0…255, access element at index bitrev8(i). */
ae_int32x2 *buf = …;
unsigned int index = 0;
unsigned int stride =
0x80000000U >> (8 /* log2256 */);

for (…) {

ae_int32x2 p = AE_L32X2_X(buf, index);
index = AE_ADDBRBA32(index, stride);

}

C syntax:
unsigned AE_ADDBRBA32 (unsigned ab, unsigned ax);

2.15 Zero Operation


AE_ZERO d [fusion_slot0, fusion_slot1, Inst]
Set all bits of an AE_DR register d to zero. This intrinsic is implemented in terms of the
AE_MOVI instruction.
Note: Type specific C intrinsics are implemented through AE_ZERO64. C intrinsics
AE_ZEROQ56 and AE_ZERO48 are provided to ensure HiFi 2 code portability and are
implemented through operation AE_ZERO64.

 CADENCE DESIGN SYSTEMS , INC. 115


Fusion F1 DSP User’s Guide

C syntax:
ae_int64 AE_ZERO (void);
ae_int64 AE_ZERO64 (void);
ae_int32x2 AE_ZERO32 (void);
ae_int24x2 AE_ZERO24 (void);
ae_int16x4 AE_ZERO16 (void);
ae_q56s AE_ZEROQ56 (void);
ae_int24x2 AE_ZEROP48 (void);
AE_ZEROB br1, v0, v1 [ fusion_slot1]
br1 is set to true if any of the bytes in v0 or v1 is equal to zero.
C syntax:
xtbool AE_ZEROB (ae_int64 v0, ae_int64 v1);

2.16 Core ALU Operations


The following instructions are simplified versions of core ALU operations encoded in 16-bits
for better code density. There are all inferred automatically by the C/C++ compiler.
AE_CLAMPS16 art, ars [ Inst16b ]

Specialized version of the core CLAMPS instruction that clamps art to 16-bits signed.

C syntax:
int AE_CLAMPS16 (int ars);
AE_SEXT16 art, ars [ Inst16b ]

Specialized version of the core SEXT instruction that replicates bit 15 of ars to the upper 16-
bits.

C syntax:
int AE_SEXT16 (int ars);
AE_ZEXT8 art, ars [ Inst16b ]

Specialized version of the core EXTUI instruction that zeroes the upper 24-bits of the result.

C syntax:
unsigned int AE_ZEXT8 (unsigned ars);

116  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_ZEXT16 art, ars [ Inst16b ]

Specialized version of the core EXTUI instruction that zeroes the upper 16-bits of the result.

C syntax:
unsigned int AE_ZEXT16 (unsigned ars);

2.17 Optional 16-bit Quad MAC Unit


Fusion DSP supports an optional 16-bit Quad MAC unit that provides improved 16-bit DSP
performance. In particular, support is provided for a limited set of quad 16-bit multiply
instructions as well as specialized instructions for speeding up 16-bit FFTs, including support
for dynamic scaling.
AE_MAXABS16S d, d0, d1 [ fusion_slot1 ]
Get maximum of absolute value of four signed 16-bit elements of AE_DR registers d0 and
d1. The four element-wise results are saturated to 16 bits and placed in d. In case of
saturation, state AE_OVERFLOW is set to 1.
d.3  saturate1.15(|d0.3| > |d1.3| ? |d0.3| : |d1.3|)
d.2  saturate1.15(|d0.2| > |d1.2| ? |d0.2| : |d1.2|)
d.1  saturate1.15(|d0.1| > |d1.1| ? |d0.1| : |d1.1|)
d.0  saturate1.15(|d0.0| > |d1.0| ? |d0.0| : |d1.0|)
C syntax:
ae_f16x4 AE_MAXABS16S (ae_f16x4 d0, ae_f16x4 d1);
AE_MULFC16RAS d, d0, d1 [ fusion_slot1 ]
AE_MULAFC16RAS d, d0, d1 [ fusion_slot1 ]
Two-way SIMD complex 1.15x1.15-bit into 1.15-bit signed MAC with asymmetric rounding of
the product and 16-bit saturation of the final result. These instructions are implemented using
a two-instruction sequence: AE_MUL[A]FC16RAS.L followed by AE_MUL[A]FC16RAS.H.
Each instruction does a single quad-MAC, 16-bit complex multiplication. The “H” instruction
in the sequence leaves the low half of the result unchanged. The “L” MUL instruction zeroes
the upper bits while the “L” MAC instruction leaves the upper bits unchanged.
d.3  saturate1.15([d.31.15 +] round+∞1.15(d0.31.15 × d1.31.15 - d0.21.15 × d1.21.15))
d.2  saturate1.15([d.21.15 +] round+∞1.15(d0.31.15 × d1.21.15 + d0.21.15 × d1.31.15))
d.1  saturate1.15([d.11.15 +] round+∞1.15(d0.11.15 × d1.11.15 - d0.01.15 × d1.01.15))
d.0  saturate1.15([d.01.15 +] round+∞1.15(d0.11.15 × d1.01.15 + d0.01.15 × d1.11.15))

 CADENCE DESIGN SYSTEMS , INC. 117


Fusion F1 DSP User’s Guide

C syntax:
ae_f16x4 AE_MULFC16RAS (ae_f16x4 d0, ae_f16x4 d1);
void AE_MULAFC16RAS (ae_f16x4 d /*inout*/,
ae_f16x4 d0, ae_f16x4 d1);
AE_MULZAAAAQ16 q0, d0, d1 [ fusion_slot1 ]
AE_MULAAAAQ16 q0, d0, d1 [ fusion_slot1]
Quad 16x16-bit into 64-bit signed MAC without saturation:
q0  [q0] + d0.3 × d1.3 + d0.2 × d1.2 + d0.1 × d1.1 + d0.0 × d1.0
C syntax:
ae_int64 AE_MULZAAAAQ16 (ae_int16x4 d0,
ae_int16x4 d1);
void AE_MULAAAAQ16 (ae_int16x4 q0 /* inout */,
ae_int16x4 d0, ae_int16x4 d1) ;
AE_MULC16S.L (.H) q0, d0, d1 [ fusion_slot1]
AE_MULAC16S.L (.H) q0, d0, d1 [ fusion_slot1]

Complex quad-mac 16x16-bit into 2x32-bit signed integer MAC with saturation:

For H version
d.H  saturate32 ([d.H +] d0.3 × d1.3 - d0.2 × d1.2)
d.L  saturate32 ([d.L +] d0.3 × d1.2 + d0.2 × d1.3)
For L version
d.H  saturate32 ([d.H +] d0.1 × d1.1 - d0.0 × d1.0)
d.L  saturate32 ([d.L +] d0.1 × d1.0 + d0.0 × d1.1)
C syntax:
ae_int32 AE_MULC16S_L (_H) (ae_int16x4 d0,
ae_int16x4 d1);
void AE_MULAC16S_L (_H) (ae_int32x2 q0 /* inout */,
ae_int16x4 d0, ae_int16x4 d1) ;
AE_MUL16JS d, d0 [ fusion_slot1 ]
Two-way SIMD multiply by the imaginary number j. For each half, the upper 16-bits of d are
set to the lower 16-bits of d0. The lower 16-bits of d are set to the negation of the upper 16-
bits of d0, saturated.
C syntax:
ae_f16x4 AE_MUL16JS (ae_f16x4 d0);

118  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_S16X4RNG.I d, a, i64 [fusion_slot0, fusion_slot40_0]


AE_S16X4RNG.IP d, a, i64pos [fusion_slot0, fusion_slot40_0]
AE_S16X4RNG.X (.XP) d, a, ax [fusion_slot0, fusion_slot40_0]
Required alignment: 8 bytes
Store four, 16-bit values from the AE_DR register d to memory with range detection. Bits 5,
4, and 3 respectively of AE_SAR are set if bits 14, 13 and 12 of any quarter of d is different
than their respective sign bit. These instructions are meant to be used together with
AE_CALCRNG3 to allow a right shift by up to 3 bits in order to allow dynamic normalization.
See Table 2-3 for the meanings of the address mode suffixes.
C syntax:
void AE_S16X4RNG_I (ae_int16x4 d, ae_int16x4 * a, immediate i64);
void AE_S16X4RNG_X (ae_int16x4 d, ae_int16x4 * a, int ax);
void AE_S16X4RNG_IP (ae_int16x4 d,
ae_int16x4 * a /*inout*/, immediate i64pos);
void AE_S16X4RNG_XP (ae_int16x4 d,
ae_int16x4 * a /*inout*/, int ax);
AE_CALCRNG3 a [fusion_slot1]

AE_CALCRNG3 returns 0 to 3 depending on whether the highest 1 out of bits 5, 4 or 3 of


AE_SAR is 5, 4 or 3. AE_SAR is also set to 0, 2, 4 or 5 respectively so that an
AE_ADDANDSUBRNG16RAS_S1 followed by an AE_ADDANDSUBRNG16RAS_S2 will
shift to the right by the same amount as AE_CALCRNG3 returns. These are meant to be
used after a series of AE_S16X4RNG.I (IP, X, XP) and before a series of
AE_ADDANDSUBRNG16RAS_S1 and ADDANDSUBRNG16RAS_S2 instructions.
Together, they allow the FFT algorithm to shift the minimum amount necessary to keep data
from overflowing.

C syntax:
unsigned int AE_CALCRNG3 (void);

 CADENCE DESIGN SYSTEMS , INC. 119


Fusion F1 DSP User’s Guide

AE_ADDANDSUBRNG16RAS_S1 da, ds, [fusion_slot40_1]

Add and subtract 16-bit elements of two AE_DR register d0 and d1 without saturation and
shift the results arithmetically right 0 or 1 place depending on the value of AE_SAR[0]. f
shifting, round asymmetrically.

da.3  round+∞1..15( (da.3 + ds.3) >> AE_SAR[0])


da.2  round+∞1..15( (da.2 + ds.2) >> AE_SAR[0])
da.1  round+∞1..15( (da.1 + ds.1) >> AE_SAR[0])
da.0  round+∞1..15( (da.0 + ds.0) >> AE_SAR[0])
ds.3  round+∞1..15( (da.3 - ds.3) >> AE_SAR[0])
ds.2  round+∞1..15( (da.2 - ds.2) >> AE_SAR[0])
ds.1  round+∞1..15( (da.1 - ds.1) >> AE_SAR[0])
ds.0  round+∞1..15( (da.0 - ds.0) >> AE_SAR[0])
AE_ADDANDSUBRNG16RAS_S2 da, ds, [fusion_slot40_1]
Add and subtract 16-bit elements of two AE_DR register d0 and d1 without saturation and
shift the results arithmetically right 0, 1 or 2 places depending on the value of AE_SAR[2:1].
If shifting, round asymmetrically. A shift value of 3 is not supported.
da.3  round+∞1..15( (da.3 + ds.3) >> AE_SAR[2:1])
da.2  round+∞1..15( (da.2 + ds.2) >> AE_SAR[2:1])
da.1  round+∞1..15( (da.1 + ds.1) >> AE_SAR[2:1])
da.0  round+∞1..15( (da.0 + ds.0) >> AE_SAR[2:1])
ds.3  round+∞1..15( (da.3 - ds.3) >> AE_SAR[2:1])
ds.2  round+∞1..15( (da.2 - ds.2) >> AE_SAR[2:1])
ds.1  round+∞1..15( (da.1 - ds.1) >> AE_SAR[2:1])
ds.0  round+∞1..15( (da.0 - ds.0) >> AE_SAR[2:1)
C syntax:
AE_ADDRNG16RAS (ae_int16x4 va /* inout */,
ae_int16x4 vs /* inout */ );

120  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_MULC16JS.L (.H) q0, d0, d1 [fusion_slot1]


AE_MULAC16JS.L (.H) q0, d0, d1 [fusion_slot1]
Complex conjugate quad-mac 16x16-bit into 2x32-bit signed integer MAC with saturation:
For H version
d.H  saturate32 ([d.H +] d0.3 × d1.3 + d0.2 × d1.2)
d.L  saturate32 ([d.L -] d0.3 × d1.2 + d0.2 × d1.3)
For L version
d.H  saturate32 ([d.H +] d0.1 × d1.1 + d0.0 × d1.0)
d.L  saturate32 ([d.L -] d0.1 × d1.0 + d0.0 × d1.1)
C syntax:
ae_int32 AE_MULC16JS_L (_H) (ae_int16x4 d0,
ae_int16x4 d1);
void AE_MULAC16JS_L (_H) (ae_int32x2 q0 /* inout */,
ae_int16x4 d0, ae_int16x4 d1) ;
AE_CONJ16S d, d0 [ fusion_slot1 ]
Two-way SIMD complex conjugate. For each half, the upper 16-bits of d are set to the upper
16-bits of d0. The lower 16-bits of d are set to the negation of the lower 16-bits of d0,
saturated. In case of saturation, state AE_OVERFLOW is set to 1.

C syntax:
ae_f16x4 AE_MUL16JS (ae_f16x4 d0);

2.18 Optional Floating Point Unit


Fusion DSP supports an optional IEEE 754 floating point unit. The floating point unit shares
the AE_DR register file with the rest of Fusion DSP. Therefore, standard loads, stores and
selects can all be used together with floating point compute operations. Fusion DSP supports
loading or storing 64-bits but only 32-bit floating point computation. Common operations
allow taking their operands from either half of a register. Other operations only take operands
from the low half of a register. For operands taking their results from the low half of a register,
intrinsics are rarely needed and programmers should instead use just standard C.

Floating point operations typically have four cycles of latency but are fully pipelined. With the
Reduced MAC Latency option, the latency is reduced to two cycles. Divide and sqrt are
implemented using instruction sequences.

 CADENCE DESIGN SYSTEMS , INC. 121


Fusion F1 DSP User’s Guide

ADD.S fr, fs, ft [fusion_slot1, Inst]


Computes an IEEE 754 single-precision sum of the contents of the low halves fs and ft. This
operation rounds the result(s) to the destination format when necessary, according to the
rounding mode in FCR.
fr.H  0
fr.L  fs.L + ft.L
C syntax:
float XT_ADD_S (float fs, float ft);
float XT_ADD_LLL_S (xtfloatx2 fs, xtfloatx2 ft);

ADD_LLH.S fr, fs, ft [fusion_slot1]


Computes an IEEE 754 single-precision sum of the contents of the low half of fs and the high
half of ft. This operation rounds the result(s) to the destination format when necessary,
according to the rounding mode in FCR.
fr.H  0
fr.L  fs.L + ft.H
C syntax:
float XT_ADD_LLH_S (xtfloatx2 fs, xtfloatx2 ft);
float XT_ADD_LHL_S (xtfloatx2 ft, xtfloatx2 fs);
ADD_LHH.S fr, fs, ft [fusion_slot1]
Computes an IEEE 754 single-precision sum of the contents of the high halves of fs and ft.
This operation rounds the result(s) to the destination format when necessary, according to
the rounding mode in FCR.
fr.H  0
fr.L  fs.H + ft.H
C syntax:
float XT_ADD_LHH_S (xtfloatx2 fs, xtfloatx2 ft);
SUB.S fr, fs, ft [fusion_slot1, Inst]
Computes an IEEE 754 single-precision difference of the contents of the low halves of fs and
ft. This operation rounds the result(s) to the destination format when necessary, according to
the rounding mode in FCR.
fr.H  0
fr.L  fs.L - ft.L
C syntax:
float XT_SUB_S (float fs, float ft);
float XT_SUB_LLL_S (xtfloatx2 fs, xtfloatx2 ft);

122  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

SUB_LLH.S fr, fs, ft [fusion_slot1t]


Computes an IEEE 754 single-precision difference of the contents of the low half of fs and
the high half of ft. This operation rounds the result(s) to the destination format when
necessary, according to the rounding mode in FCR.
fr.H  0
fr.L  fs.L - ft.H
C syntax:
float XT_SUB_LLH_S (xtfloatx2 fs, xtfloatx2 ft);
float XT_SUB_LHL_S (xtfloatx2 ft, xtfloatx2 fs);
SUB_LHH.S fr, fs, ft [fusion_slot1]
Computes an IEEE 754 single-precision difference of the contents of the high halves of fs
and ft. This operation rounds the result(s) to the destination format when necessary,
according to the rounding mode in FCR.
fr.H  0
fr.L  fs.H - ft.H
C syntax:
float XT_SUB_LHH_S (xtfloatx2 fs, xtfloatx2 ft);
MUL.S fr, fs, ft [fusion_slot1, Inst]
Computes an IEEE 754 single-precision product of the contents of the low halves fs and ft.
This operation rounds the result(s) to the destination format when necessary, according to
the rounding mode in FCR.
fr.H  0
fr.L  fs.L * ft.L
C syntax:
float XT_MUL_S (float fs, float ft);
float XT_MUL_LLL_S (xtfloatx2 fs, xtfloatx2 ft);
MUL_LLH.S fr, fs, ft [fusion_slot1t]
Computes an IEEE 754 single-precision product of the contents of the low half of fs and the
high half of ft. This operation rounds the result(s) to the destination format when necessary,
according to the rounding mode in FCR.
fr.H  0
fr.L  fs.L * ft.H
C syntax:
float XT_MUL_LLH_S (xtfloatx2 fs, xtfloatx2 ft);
float XT_MUL_LHL_S (xtfloatx2 ft, xtfloatx2 fs);

 CADENCE DESIGN SYSTEMS , INC. 123


Fusion F1 DSP User’s Guide

MUL_LHH.S fr, fs, ft [fusion_slot1]


Computes an IEEE 754 single-precision product of the contents of the high halves of fs and
ft. This operation rounds the result(s) to the destination format when necessary, according to
the rounding mode in FCR.
fr.H  0
fr.L  fs.H * ft.H
C syntax:
float XT_MUL_LHH_S (xtfloatx2 fs, xtfloatx2 ft);
MADD.S fr, fs, ft [fusion_slot1, Inst]
MADD.S implements the IEEE754-2008 fusedMultiplyAdd in single precision (binary32).
MADD.S multiplies the corresponding lower half of data registers fs and ft, adds the products
to the corresponding half of data register fr, and then writes the sum back to the
corresponding half of data register fr. This operation rounds the sum to the destination format
when necessary, according to the rounding mode in FCR. There is no rounding on the
intermediate and precise product. This operation zeroes out the higher half of the data
register fr.
fr.H  0
fr.L fr.L + fs.L * ft.L
C syntax:
XT_MADD_S (float fr /* inout */,
float fs, float ft);
XT_MADD_LLL_S (float fr /* inout */,
xtfloatx2 fs, xtfloatx2 ft);
MADD_LLH.S fr, fs, ft [fusion_slot1]
MADD_LLH.S implements the IEEE754-2008 fusedMultiplyAdd in single precision
(binary32). This operation multiplies the corresponding lower half of data register fs and the
higher half of data register ft, adds the product to the lower half of data register fr, and then
writes the sum back to the lower half of data register fr. This operation rounds the sum to the
destination format when necessary, according to the rounding mode in FCR. There is no
rounding on the intermediate and precise product. This operation zeroes out the higher half
of the data register fr.
fr.H  0
fr.L fr.L + fs.L * ft.H
C syntax:
XT_MADD_LLH_S (float fr /* inout */,
xtfloatx2 fs, xtfloatx2 ft);
XT_MADD_LHL_S (float fr /* inout */,
xtfloatx2 ft, xtfloatx2 fs);

124  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

MADD_LHH.S fr, fs, ft [fusion_slot1]


MADD_LHH.S implements the IEEE754-2008 fusedMultiplyAdd in single precision
(binary32). This operation multiplies the higher half of data registers fs and ft, adds the
product to the lower half of data register fr, and then writes the sum back to the lower half of
data register fr. This operation rounds the sum to the destination format when necessary,
according to the rounding mode in FCR. There is no rounding on the intermediate and precise
product. This operation zeroes out the higher half of the data register fr.
fr.H  0
fr.L fr.L + fs.H * ft.H
C syntax:
XT_MADD_LHH_S (float fr /* inout */,
xtfloatx2 fs, xtfloatx2 ft);
MSUB.S fr, fs, ft [fusion_slot1, Inst]
MSUB.S implements the IEEE754-2008 fusedMultiplyAdd in single precision (binary32). This
operation multiplies the corresponding lower half of data registers fs and ft, subtracts the
product from the lower half of data register fr, and then writes the difference back to the lower
half of data register fr. This operation rounds the difference to the destination format when
necessary, according to the rounding mode in FCR. There is no rounding on the intermediate
and precise product. This operation zeroes out the higher half of the data register fr.
fr.H  0
fr.L fr.L - fs.L * ft.L
C syntax:
XT_MSUB_S (float fr /* inout */,
float fs, float ft);
XT_MSUB_LLL_S (float fr /* inout */,
xtfloatx2 fs, xtfloatx2 ft);
MSUB_LLH.S fr, fs, ft [fusion_slot1]
MSUB_LLH.S implements the IEEE754-2008 fusedMultiplyAdd in single precision
(binary32). This operation multiplies the lower half of data register fs and the higher half of
data register ft, subtracts the product from the lower half of data register fr, and then writes
the difference back to the lower half of data register fr. This operation rounds the difference
to the destination format when necessary, according to the rounding mode in FCR. There is
no rounding on the intermediate and precise product. This operation zeroes out the higher
half of the data register fr.
fr.H  0
fr.L fr.L - fs.L * ft.H
C syntax:
XT_MSUB_LLH_S (float fr /* inout */,
xtfloatx2 fs, xtfloatx2 ft);
XT MSUB LHL_S (float
_ _ fr /* inout */,
xtfloatx2 ft, xtfloatx2 fs);

 CADENCE DESIGN SYSTEMS , INC. 125


Fusion F1 DSP User’s Guide

MSUB_LHH.S fr, fs, ft [fusion_slot1]


MSUB_LHH.S implements the IEEE754-2008 fusedMultiplyAdd in single precision
(binary32). This operation multiplies the higher half of data registers fs and ft, subtracts the
product from the lower half of data register fr, and then writes the difference back to the lower
half of data register fr. This operation rounds the difference to the destination format when
necessary, according to the rounding mode in FCR. There is no rounding on the intermediate
and precise product. This operation zeroes out the higher half of the data register fr.
fr.H  0
fr.L fr.L - fs.H * ft.H
C syntax:
XT_MSUB_LHH_S (float fr /* inout */,
xtfloatx2 fs, xtfloatx2 ft);
DIV.S fr, fs, ft [fusion_slot1, Inst]
Computes an IEEE 754 single-precision division of the contents of the low halves of fs and
ft using a sequence of instructions. Division will take approximately 15 cycles. Faster, but
non-IEEE 754 exact, results can be achieved using RECIP.S and MUL.S. This operation
rounds the result(s) to the destination format when necessary, according to the rounding
mode in FCR.
fr.H  0
fr.L  fs.L / ft.L
C syntax:
float XT_DIV_S (float fs, float ft);
RECIP.S fr, fs [fusion_slot1, Inst]
Computes a single-precision reciprocal of the contents of the low half fs using a sequence of
instructions taking approximately 5 cycles.
fr.H  0
fr.L  1.0 / fs.L
C syntax:
float XT_RECIP_S (float fs);
SQRT.S fr, fs [fusion_slot1, Inst]
Computes an IEEE 754 single-precision sqrt using a sequence of instructions, taking
approximately 15 cycles. Faster, but non-IEEE 754 exact, results can be achieved using
RSQRT.S and MUL.S. This operation rounds the result(s) to the destination format when
necessary, according to the rounding mode in FCR.
fr.H  0
fr.L  sqrtf(fs.L)
C syntax:
float XT_SQRT_S (float fs);

126  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

RSQRT.S fr, fs [fusion_slot1, Inst]


Computes a single-precision reciprocal sqrt of the contents of fs using a sequence of
instructions taking approximately 10 cycles.
fr.H  0
fr.L  1.0 / sqrtf(fs.L)
C syntax:
float XT_RSQRT_S (float fs);
CONST.S fr, i [fusion_slot1, Inst]
Create a single-precision constant and places it in the low half of floating-point register fr.
The upper half is zeroed.
The constant is chosen by the value of the i field as shown in Table 2-22.

Table 2-22 Immmediate “I” Values

Immediate “i” Decimal Value Hex Value


0 0.0 0x0000_0000
1 1.0 0x3F80_0000
2 2.0 0x4000_0000
3 0.5 0x3F00_0000

C syntax:
float XT_CONST_S (immediate i);

MOVEQZ.S fr, fs, art [Inst]


MOVNEZ.S fr, fs, art [Inst]
MOVGEZ.S fr, fs, art [Inst]
MOVLTZ.S fr, fs, art [Inst]
Conditional move of the low half of data operand fs to fr based on integer condition in art.
The upper half is conditionally zeroed.
MOVEQZ.S: fr.H  (art == 0) ? 0 : fr.H; fr.L  (art == 0) ? fs.L : fr.L;
MOVNEZ.S: fr.H  (art != 0) ? 0 : fr.H; fr.L  (art != 0) ? fs.L : fr.L;
MOVGEZ.S: fr.H  (art >= 0) ? 0 : fr.H; fr.L  (art >= 0) ? fs.L : fr.L;
MOVLTZ.S: fr.H  (art < 0) ? 0 : fr.H; fr.L  (art < 0) ? fs.L : fr.L;

 CADENCE DESIGN SYSTEMS , INC. 127


Fusion F1 DSP User’s Guide

C syntax:
void XT_MOVEQZ.S (float fr /* inout */,
float fs, int art);
void XT_MOVNEZ.S (float fr /* inout */,
float fs, int art);
void XT_MOVGEZ.S (float fr /* inout */,
float fs, int art);
void XT_MOVLTZ.S (float fr /* inout */,
float fs, int art);
MOVT.S fr, fs, bt [Inst]
MOVF.S fr, fs, bt [Inst]
Conditional move of the low half of data operand fs to fr based on scalar condition in xtbool
bt. The upper half is conditionally zeroed.
MOVT.S: fr.L  (bt==1) ? fs : fr fr.H  (bt==1) ? 0 : fr
MOVF.S: fr.L  (bt==0) ? fs : fr fr.H  (bt==0) ? 0 : fr
C syntax:
void XT_MOVT.S (float fr /* inout */,
float fs, xtbool b);
void XT_MOVF.S (float fr /* inout */,
float fs, xtbool b);
ABS.S fr, fs [fusion_slot1, Inst]
Computes an IEEE 754 abs of the contents of the lower floating-point operand of fs. The
upper half is zeroed.
fr.H  0
fr.L  abs(fs.L )
C syntax:
float XT_ABS_S (float fs);

NEG.S fr, fs [fusion_slot1, Inst]


Computes an IEEE 754 negation of the contents of the lower floating-point operand of fs.
The upper half is zeroed.
fr.H  0
fr.L  -fs.L
C syntax:
float XT_NEG_S (float fs);

128  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

OLE.S br, fs, ft [Inst]

Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are ordered with, and less than or equal to the contents of the
half of ft, then br is set to 1. Otherwise, br is set to 0. According to IEEE 754, +0 and -0
compare as equal. IEEE 754 floating-point values are ordered if neither is a NaN (Not a
Number).

C syntax:
xtbool XT_OLE_S (float fs, float ft);
OLT.S br, fs, ft [Inst]

Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are ordered with and less than the contents of the half of ft, then
br is set to 1. Otherwise, br is set to 0. According to IEEE 754, +0 and -0 compare as equal.
IEEE 754 floating-point values are ordered if neither is a NaN.

C syntax:
xtbool XT_OLT_S (float fs, float ft);
OEQ.S br, fs, ft [Inst]

Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are ordered with and equal to the contents of the half of ft, then
br is set to 1. Otherwise, br is set to 0. According to IEEE 754, +0 and -0 compare as equal.
IEEE 754 floating-point values are ordered if neither is a NaN.

C syntax:
xtbool XT_OEQ_S (float fs, float ft);
ULE.S br, fs, ft [Inst]

Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are less than or equal to or unordered with respect to the half of
ft, then br is set to 1. Otherwise, br is set to 0. According to IEEE 754, +0 and -0 compare as
equal. IEEE 754 floating-point values are unordered if either is a NaN.

C syntax:
xtbool XT_ULE_S (float fs, float ft);
ULT.S br, fs, ft [Inst]

Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are less than or unordered with respect to the half of ft, then br
is set to 1. Otherwise, br is set to 0. According to IEEE 754, +0 and -0 compare as equal.
IEEE 754 floating-point values are unordered if either is a NaN.

C syntax:
xtbool XT_ULT_S (float fs, float ft);

 CADENCE DESIGN SYSTEMS , INC. 129


Fusion F1 DSP User’s Guide

UEQ.S br, fs, ft [Inst]

Compares the single-precision values of the low half of floating-point operands fs and ft. If
the contents of the half of fs are equal to or unordered with the half of ft, br is set to 1.
Otherwise, br is set to 0. According to IEEE 754, +0 and -0 compare as equal. IEEE 754
floating-point values are unordered if either is a NaN.

C syntax:
xtbool XT_UEQ_S (float fs, float ft);
UN.S br, fs, ft [Inst]

Unordered compare. If the contents of the half of fs or half of ft are equal to NaN, then br is
set to 1. Otherwise, br is set to 0.

C syntax:
xtbool XT_UN_S (float fs, float ft);
FLOAT.S fr, ars, i [fusion_slot0]
Converts the contents of integral operand ars from signed integer to single-precision format,
rounding according to the current rounding mode. The converted integer value is then scaled
by a power of two constant value encoded in the immediate field, with 0..31 representing 1.0,
0.5, 0.25,…, 1.0/ 2147483648.0 The scaling allows for a fixed-point notation where the binary
point is at the right end of the integer for i=0 and moves to the left as i increases until for i=31
there are 31 fractional bits represented in the fixed-point number. The result is placed in the
low half of fr. The upper half is zeroed.
C syntax:
float XT_FLOAT_S (int ars, immediate i);
UFLOAT.S fr, ars, i [fusion_slot0]
Converts the contents of integral operand ars from unsigned integer to single-precision
format, rounding according to the current rounding mode. The converted integer value is then
scaled by a power of two constant value encoded in the immediate field, with 0..31
representing 1.0, 0.5, 0.25,…, 1.0/2147483648.0. The scaling allows for a fixed-point
notation where the binary point is at the right end of the integer for i=0 and moves to the left
as i increases until for i=31 there are 31 fractional bits represented in the fixed-point number.
The result is placed in the low half of floating-point operand fr. The upper half is zeroed.
C syntax:
float XT_UFLOAT_S (unsigned int ars, immediate i);
FIROUND.S vt, vr [fusion_slot0]
Rounds the floating point value of the low half of the input vector register operand into an
integral value in the low half of the output vector register operand. The high half is zeroed.
The value is rounded to the nearest integral value. When the fractional part of an input is
exactly 1/2, the value is rounded away from 0.
C syntax:
float XT_FIROUND_S (float b);

130  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

FIFLOOR.S vt, vr [fusion_slot0]


Rounds the floating point value of the low half of the input vector register operand into an
integral value in the low half of the output vector register operand. The high half is zeroed.
The value is rounded down to the nearest integral value.
C syntax:
float XT_FIFLOOR_S (float b);
FICEIL.S vt, vr [fusion_slot0]
Rounds the floating point value of the low half of the input vector register operand into an
integral value in the low half of the output vector register operand. The high half is zeroed.
The value is rounded up to the nearest integral value.
C syntax:
float XT_FICEIL_S (float b);
FITRUNC.S vt, vr [fusion_slot0]
Rounds the floating point value of the low half of the input vector register operand into an
integral value in the low half of the output vector register operand. The high half is zeroed.
The value is rounded towards 0.
C syntax:
float XT_FITRUNC_S (float b);
FIRINT.S vt, vr [fusion_slot0]

Rounds the floating point value using the ROUND mode of the low half of the input vector
register operand into an integral value in the low half of the output vector register operand.
The high half is zeroed.

C syntax:
float XT_FIRINT_S (float b);
TRUNC.S arr, fs, i [fusion_slot0]
Converts the contents of the lower 32-bits of floating-point operand fs from single-precision
to signed integer format, rounding toward zero. The converted integer value is first scaled by
a power of two constant value encoded in the immediate field, with 0..31 representing 1.0,
0.5, 0.25, …, 1.0/2147483648.0. The scaling allows for a fixed-point notation where the
binary point is at the right end of the integer for i=0 and moves to the left as i increases until
for i=31 there are 31 fractional bits represented in the fixed-point number.
C syntax:
int XT_TRUNC_S (float fs, immediate i);
UTRUNC.S arr, fs, i [fusion_slot0]
Converts the contents of the lower 32-bits of floating-point operand fs from single-precision
to unsigned integer format, rounding toward zero. The converted unsigned integer value is
first scaled by a power of two constant value encoded in the immediate field, with 0..31
representing 1.0, 0.5, 0.25, …, 1.0/ 2147483648.0 The scaling allows for a fixed-point
notation where the binary point is at the right end of the integer for i=0 and moves to the left
as i increases until for i=31 there are 31 fractional bits represented in the fixed-point number.

 CADENCE DESIGN SYSTEMS , INC. 131


Fusion F1 DSP User’s Guide

C syntax:
unsigned int XT_UTRUNC_S (float fs, immediate i);
RFR art, vr [Inst]
Copy the low 32-bits of vr into art
C syntax:
unsigned int XT_RFR (float vs);
WFR vt, art [Inst]
Replicate art into each half of data register vt.
C syntax:
float XT_WFR (unsigned int vs);

Additional helper instructions exist that are used in compiler generated divide and sqrt
sequences. These are not documented here. Refer to the generated HTML file available via
the Xtensa Xplorer IDE for details.

2.18.1 Floating Point Intrinsics


Fusion DSP floating point programs use the standard Fusion DSP load, store and select
operations. To ease programming, intrinsics using floating point types are provided that map
into the core Fusion DSP instructions. Refer to Sections 2.4 and 2.13 for more details on the
instructions themselves.

LSX2I d, a, i64
LSX2IP d, a, i64pos
LSX2RI (RIP) d, a, i64
LSX2RIC d, a, [i64neg]
LSX2X (XP, XC) d, a, ax
Required alignment: 8 bytes
Load a pair of 32-bit values from memory into the AE_DR register d. See Table 2-3 for the
meanings of the address mode suffixes.
Note: RI and RIP are intrinsics mapped to equivalent instructions.
C syntax:
xtfloatx2 XT_LSX2I (const xtfloatx2 * a, immediate i64);
xtfloatx2 XT_LSX2X (const xtfloatx2 * a, int ax);
void XT_LSX2IP (xtfloatx2 d /*out*/,
const xtfloatx2 *a /*inout*/, immediate i64pos);
void XT_LSX2XP (xtfloatx2 d /*out*/,
const xtfloatx2 *a /*inout*/, int ax);
void XT_LSX2XC (xtfloatx2 d /*out*/,
const xtfloatx2 *a /*inout*/, int ax);

132  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

xtfloatx2 XT_LSX2RI (const xtfloatx2 * a, immediate i64);


void XT_LSX2RIP (xtfloatx2 d /*out*/,
const xtfloatx2 *a /*inout*/, immediate i64);
_
void XT LSX2RIC (xtfloatx2 d /*out*/,
const xt_floatx2 *a /*inout*/);
LSI d, a, i32
LSIP d, a, i32
LSX (XP, XC) d, a, ax
Required alignment: 4 bytes
Load a 32-bit value from memory and replicate the value into the two elements of the AE_DR
register d. See Table 2-3 for the meanings of the address mode suffixes.
C syntax:
float XT_LSI (const float * a, immediate i32);
float XT_LSX (const float * a, int ax);
void XT_LSIP(float d /*out*/,
const float * a /*inout*/, immediate off);
void XT_LSXP(float d /*out*/,
const float * a /*inout*/, int ax);
void XT_LSXC (float d /*out*/,
const float * a /*inout*/, int ax);
LASX2PP u, a
Required alignment: 1 byte (but following instructions have alignment requirements).
Load a 64-bit value from memory to AE_VALIGN register u. The effective address is
(a & 0xFFFFFFF8). No update is made to the address register.

This instruction is used to prime the unaligned access stream for LASX2IP and LASX2RIP
instructions regardless of size or direction.
C syntax:
ae_valign XT_LASX2PP (xtfloatx2 *a);
LASX2POSPC u, a
LASX2NEGPC u, a
Required alignment: 4 bytes
This operation loads 64-bit value from memory into AE_VALIGN register u. The effective
address is (a & 0xFFFFFFF8).
This instruction LASX2POSPC is used to prime the unaligned access stream for LASX2IC
instructions. The instruction LASX2NEGPC is used to prime the unaligned access stream for
LASX2RIC instructions.

 CADENCE DESIGN SYSTEMS , INC. 133


Fusion F1 DSP User’s Guide

C syntax:
void XT_LASX2POSPC (ae_valign u /*out*/, xtfloatx2 *a /*inout*/);
void XT_LASX2NEGPC (ae_valign u /*out*/, xtfloatx2 *a /*inout*/);
LASX2IP (IC, RIP, RIC) d, u, a
Required alignment: 4 bytes
Load a pair of 32-bit values from effective address (a) in memory into the AE_DR register d.
Instructions LASX2IP (IC) are used if the direction of the load operations is positive.
Instructions LASX2RIP (RIC) are used if the direction of the load operations is negative.
C syntax:
void XT_LA32X2IP (xtfloatx2 d /*out*/, ae_valign u /*inout*/,
xtfloatx2 *a /*inout*/);
void XT LASX2IC (xtfloatx2 d /*out*/, ae_valign u /*inout*/,
_
xtfloatx2 *a /*inout*/);
void XT LASX2RIP (xtfloatx2 d /*out*/, ae_valign u /*inout*/,
_
xtfloatx2 *a /*inout*/);
void XT LASX2RIC (xtfloatx2 d /*out*/, ae_valign u /*inout*/,
_
xtfloatx2 *a /*inout*/);
SSX2I d, a, i64
SSX2IP d, a, i64pos
SSX2RI (RIP) d, a, i64
SSX2RIC d, a
SSX2X (XP, XC) d, a, ax
Required alignment: 8 bytes
Store a pair of 32-bit values from the AE_DR register d to memory. See Table 2-3 for the
meanings of the address mode suffixes.
Note: RI and RIP are intrinsics mapped to equivalent instructions.
C syntax:
void XT_SSX2I (xtfloatx2 d, xtfloatx2 * a, immediate i64);
void XT_SSX2X (xtfloatx2 d, xtfloatx2 * a, int ax);
void XT_SSX2IP (xtfloatx2 d,
xtfloatx2 * a /*inout*/, immediate i64);
void XT_SSX2XP (xtfloatx2 d,
xtfloatx2 * a /*inout*/, int ax);
void XT_SSX2XC (xtfloatx2 d,
xtfloatx2 * a /*inout*/, int ax);
void XT_SSX2RI (xtfloatx2 d, xtfloatx2 * a, immediate i64);
void XT_SSX2RIP (xtfloatx2 d, xtfloatx2 * a /*inout*/, immediate i64);
void XT_SSX2RIC (xtfloatx2 d, xtfloatx2 * a /*inout*/);

134  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

SSI d, a, i32
SSIP d, a, i32
SSIX (XP, XC) d, a, ax
Required alignment: 4 bytes
Store the 32-bit L element of the AE_DR register d to memory. For operations with suffix I,
the effective address is (a + i32). See Table 2-3 for the meanings of the address mode
suffixes.
C syntax:
void XT_SSI (float d, float * a, immediate i32);
void XT_SSX (float d, float * a, int ax)
void XT_SSIP (float d,
float * a /*inout*/, immediate i32);
void XT_SSXP (float d,
float * a /*inout*/, int ax);
void XT_SSXC (float d,
float * a /*inout*/, int ax);
SASX2IP (IC, RIP, RIC) d, u, a
Required alignment: 4 bytes
Store a pair of 32-bit values from AE_DR register d to memory with effective address (a).
Instructions SASX2IP (IC, IC1) are used if the direction of the store operations is positive.
Instructions SASX2RIP (RIC, RIC1) are used if the direction of the store operations is
negative.
C syntax:
void XT_SASX2IP (xtfloatx2 d, ae_valign u /*inout*/,
xtfloatx2 * a /*inout*/);
void XT_SASX2IC (xtfloatx2 d, ae_valign u /*inout*/,
xtfloatx2 * a /*inout*/);
void XT_SASX2RIP (xtfloatx2 d, ae_valign u /*inout*/,
xtfloatx2 * a /*inout*/);
void XT_SASX2RIC (xtfloatx2 d, ae_valign u /*inout*/,
xtfloatx2 * a /*inout*/);
SASX2POSFP u, a
Required alignment: varies depending on the data type in the AE_VALIGN register u.
Flushes the value in AE_VALIGN register u to memory with effective address (a). The
AE_VALIGN register u is updated with value zero. This operation is used when the direction
of the store operation is positive.
C syntax:
void XT_SASX2POSFP (ae_valign u /*inout*/, xtfloatx2 *a);

 CADENCE DESIGN SYSTEMS , INC. 135


Fusion F1 DSP User’s Guide

SASX2NEGFP u, a
Required alignment: varies depending on the data type in the AE_VALIGN register u.
Flushes the value in AE_VALIGN register u to memory with effective address (a). The
AE_VALIGN register u is updated with value zero. This operation is used when the direction
of the store operation is negative.
C syntax:
void XT_SASX2NEGFP (ae_valign u /*inout*/, xtfloatx2 *a);
AE_ZALIGN64 u
Initialize the AE_VALIGN register u with zero.
C syntax:
ae_valign AE_ZALIGN64 ();
SEL32_LL.SX2 d, d0, d1 [fusion_slot1]
d.H = d0.L;
d.L = d1.L.
C syntax:
xtfloatx2 XT_SEL32_LL_S (xtfloatx2 d0, xtfloatx2 d1);
SEL32_LH.SX2 d, d0, d1 [fusion_slot1]
d.H = d0.L;
d.L = d1.H.
C syntax:
xtfloatx2 XT_SEL32_LH_S (xtfloatx2 d0, xtfloatx2 d1);
SEL32_HL.SX2 d, d0, d1 [fusion_slot]
d.H = d0.H;
d.L = d1.L.
C syntax:
xtfloatx2 XT_SEL32_HL_S (xtfloatx2 d0, xtfloatx2 d1);
SEL32_HH.SX2 d, d0, d1 [fusion_slot1]
d.H = d0.H;
d.L = d1.H.
C syntax:
xtfloatx2 XT_SEL32_HH_S (xtfloatx2 d0, xtfloatx2 d1);
LOW.S d, d0 [fusion_slot1]
Extract the low half of a SIMD floating point value.
d = d0.L;

136  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
float XT_LOW_S (xtfloatx2 d0);
HIGH.S d, d0
Extract the high half of a SIMD floating point value.
d = d0.H;
C syntax:
float XT_HIGH_S (xtfloatx2 d0);

2.18.2 Notes on Not a Number (NaN) Propagation


Some floating-point operations have a floating-point datum as an input operand or an output
operand, but not both. Some other floating-point operations have both a floating-point input
operand, and a floating-point output operand. Most of these floating-point operations, having
floating-point data as both input and output operands, propagate a NaN as the output result
if an input is a NaN, according to IEEE 754™ -2008. This propagation assists programmers
to trace back to the origin of a numerical exception or NaN, usually an invalid operation such
as inf - inf.

However, programmers are reminded not to depend on NaN propagation, payload, or the
sign bit, since recompilation may cause the propagation to change or to cease.

2.18.3 HiFi 3 Floating Point Intrinsics Emulation


We emulate the following HiFi 3 floating point intrinsics on Fusion DSP using proto
sequences. Refer to the HiFi 3 DSP User’s Guide for more details on these intrinsics.

 ABS.SX2  MOVT.SX2MOVF.SX2

 ADD.SX2  MSUBC.S

 AE_MOVXTFLOATX2_FROMINT32X2  MUL.SX2

 AE_MOVINT32X2_FROMXTFLOATX2  MSUBCCONJ.S

 AE_MOVXTFLOATX2_FROMF32X2  MSUB.SX2

 AE_MOVF32X2_FROMXTFLOATX2  MULC.S

 CONJC.S  NEG.SX2

 FICEIL.SX2  OEQ.SX2

 FIFLOOR.SX2  OLE.SX2

 FIRINT.SX2  OLT.SX2

 FIROUND.SX2  UEQ.SX2

 FITRUNC.SX2  ULE.SX2

 CADENCE DESIGN SYSTEMS , INC. 137


Fusion F1 DSP User’s Guide

 FLOAT.SX2  ULT.SX2

 MADDC.S  UN.SX2

 MADDCCONJ.S  SSX2RI

 MADD.SX2  SUB.SX2

 MAX.SX2  UFLOAT.SX2

 MIN.S  TRUNC.SX2

 MIN.SX2  UTRUNC.SX2

 MOVEQZ.SX2  RECIP.SX2

 MOVGEZ.SX2  RSQRT.SX2

 MOVLTZ.SX2  SQRT.SX2

 MOVNEZ.SX2  DIV.SX2

 MOV.SX2  FSQRT.SX2

2.19 Bitstream and Variable-Length Encode


and Decode Instructions AVS ONLY
The HiFi bitstream encoding and decoding instructions explained in this section provide
efficient support for serial access to bitstreams (bits stored in memory in serial byte order,
with the most significant bit first). The encoding instructions are used to create a bitstream
from a list of values and their bit-widths. The decoding instructions are used to read the
bitstreams into elements using the list of bit-widths.

The HiFi bitstream engine supports both fixed length and variable length encoding and
decoding. Variable length (Huffman) encode and decode instructions are specialized
instructions, in which the elements with variable bit-widths are encoded or decoded. The
instructions are assisted by a special set of tables generated from Huffman
encoding/decoding schemes used in the algorithm. These tables are generated offline and
their entries capture the bit-widths, bit-pattern and values. The format of the table entries are
specified in section 2.19.1. For details on how the variable-length encode/decode instructions
should be used, refer to Chapter 4.

Internally, the instructions share the state registers described in Table 2-2 Bitstream and
variable-length Encode/Decode Support Subsystem State Registers. Therefore, the program
cannot switch between encoding and decoding modes without storing and restoring their
values.

All of the following are 24-bit instructions that issue in the Inst slot.

138  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_SHA32 a, a0 [ Inst ] AVS ONLY


Swap 32 bits for half word access. This instruction is used to swap bytes in the two half words
in an AR, typically for endianness change during a memcpy()-like operation. For example, if
a0 contains 0x12345678 before the AE_SHA32 instruction executes, a contains 0x34127856
immediately afterward.
C syntax:
unsigned AE_SHA32 (unsigned a0);
AE_VLDL16T b, a, a0 [ Inst ] AVS ONLY

Required alignment: 2 bytes

16-bit table entry load for variable-length decode. Given a pointer a0 to a decoding table of
16-bit entries, an entry is loaded and parsed from a0[AE_NEXTOFFSET]. If the table entry
loaded completes the current decoding operation, b is set to true and a is set to the decoded
symbol value. Otherwise b is set to false.
C syntax:
void AE_VLDL16T (xtbool b /*out*/, unsigned a /*out*/,
const unsigned short * a0);
AE_VLDL32T b, a, a0 [ Inst ] AVS ONLY

Required alignment: 4 bytes

32-bit table entry load for variable-length decode. Given a pointer a0 to a decoding table of
32-bit entries, an entry is loaded and parsed from a0[AE_NEXTOFFSET]. If the table entry
loaded completes the current decoding operation, b is set to true and a is set to the decoded
symbol value. Otherwise b is set to false.
C syntax:
void AE_VLDL32T (xtbool b /*out*/, unsigned a /*out*/,
const unsigned * a0);
AE_VLDL16C a [ Inst ] AVS ONLY

Required alignment: 2 bytes

16-bit conditional bitstream load for variable-length decode. 16 bits are loaded from the
bitstream pointed to by (a+2) if they are needed to maintain the invariant that we have at
least 16 bits of look ahead from the AE_BITPTR position in the AE_BITHEAD state register.
In the event that a load occurs, a is advanced to refer to the next 16 bits in memory.
C syntax:
void AE_VLDL16C (const unsigned short * a /*inout*/);

 CADENCE DESIGN SYSTEMS , INC. 139


Fusion F1 DSP User’s Guide

AE_VLDL16C.IP a [ Inst ] AVS ONLY

Required alignment: 2 bytes

16-bit conditional bitstream load for variable-length decode. 16 bits are loaded from the
bitstream pointed to by a if they are needed to maintain the invariant that we have at least
16 bits of look ahead from the AE_BITPTR position in the AE_BITHEAD state register. In the
event that a load occurs, a is advanced to refer to the next 16 bits in memory.
C syntax:
void AE_VLDL16C.IP (const unsigned short * a /*inout*/);
AE_VLDL16C.IC a [ Inst ] AVS ONLY

Required alignment: 2 bytes

16-bit conditional bitstream load for variable-length decode. 16 bits are loaded from the
bitstream pointed to by a if they are needed to maintain the invariant that we have at least
16 bits of look ahead from the AE_BITPTR position in the AE_BITHEAD state register. In the
event that a load occurs, a is advanced using a circular wrap-around to refer to the next 16
bits in memory.
C syntax:
void AE_VLDL16C.XC (const unsigned short * a /*inout*/);
void AE_VLDL16C.IC (const unsigned short * a /*inout*/);
AE_VLDSHT a [ Inst ] AVS ONLY
Set Huffman Table for variable-length decode. This instruction sets AE_NEXTOFFSET
according to the current bits at the head of the bitstream and the table size specified by a for
the next lookup that will take place via AE_VLDL16T or AE_VLDL32T.
C syntax:
void AE_VLDSHT (unsigned a);
AE_LB a, a0 [ Inst ] AVS ONLY

Look ahead in the bitstream. Return as few as 1 bit or as many as 16 bits from the head (or
MSBits) of the state register AE_BITHEAD. The number of bits to return is given by the low
five bits of a0, and must be in the range [0..16]. No state is updated; this is a look ahead
instruction. The bits from the bitstream are returned right-justified in a.

AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
unsigned AE_LB (unsigned a0);

140  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_LBI a, i [ Inst ] AVS ONLY

Look ahead in the bitstream. Return as few as 1 bit or as many as 16 bits from the head (or
MSBits) of the state register AE_BITHEAD. The number of bits to return is given by the
immediate value i, and must be in the range [1..16]. No state is updated; this is a look-ahead
instruction. The bits from the bitstream are returned right-justified in a.

AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
unsigned AE_LBI (immediate i);
AE_LBS a, a0 [ Inst ] AVS ONLY

Signed look ahead in the bitstream. Return as few as 1 bit or as many as 16 bits from the
head (or MSBits) of the state register AE_BITHEAD. The number of bits to return is given by
the low five bits of a0, and must be in the range [0..16]. No state is updated; this is a look
ahead instruction. The bits from the bitstream are returned sign-extended, right-justified in a.

AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
unsigned AE_LBS (unsigned a0);
AE_LBSI a, i [ Inst ] AVS ONLY

Signed look ahead in the bitstream. Return as few as 1 bit or as many as 16 bits from the
head (or MSBits) of the state register AE_BITHEAD. The number of bits to return is given by
the immediate value i, and must be in the range [1..16]. No state is updated; this is a look-
ahead instruction. The bits from the bitstream are returned sign-extended, right-justified in a.

AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.
C syntax:
unsigned AE_LBSI (immediate i);

 CADENCE DESIGN SYSTEMS , INC. 141


Fusion F1 DSP User’s Guide

AE_LBK a, a0, a1 [ Inst ] AVS ONLY

Look ahead in the bitstream, keeping low bits of a0. Returns as few as 1 bit or as many as
16 bits from the head (or MSBits) of the state register AE_BITHEAD in the low bits of a, with
the remaining bits filled with low bits from a0. The number of bits to move from the stream to
a is given by the low five bits of a1, and must be in the range [1..16]. No state is updated;
this is a look-ahead instruction.

AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
unsigned AE_LBK (unsigned a0, unsigned a1);
AE_LBKI a, a0, i [ Inst ] AVS ONLY
Look ahead in the bitstream, keeping low bits of a0. Returns as few as 1 bit or as many as
16 bits from the head (or MSBits) of the state register AE_BITHEAD in the low bits of a,
with the remaining bits filled with low bits from a0. The number of bits to move from the
stream to a is given by the immediate value i, and must be in the range [1..16]. No state is
updated; this is a look-ahead instruction.

AE_BITHEAD holds 16 to 32 bits of the bitstream pointed by a. The number of bits consumed
from the AE_BITHEAD is stored in AE_BITPTR.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
unsigned AE_LBKI (unsigned a0, immediate i);
AE_DB a, a0 [ Inst ] AVS ONLY

Required alignment: 2 bytes

Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the low five bits of a0, and must be in the range [0..16]. AE_BITPTR value
increments by the number of bits-read and keeps track of the number of bits consumed from
the AE_BITHEAD. When the remaining bits in the AE_BITHEAD reaches less than or equal
to 16 bits, it reads a 16-bit word from (a+2) memory location into the state register
AE_BITHEAD, the pointer value gets updated to (a+2) The value stored in AE_BITPTR is
decremented by 16.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
void AE_DB (const unsigned short * a /*inout*/, unsigned a0);

142  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_DB.IP a, a0 [ Inst ] AVS ONLY

Required alignment: 2 bytes

Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the low five bits of a0, and must be in the range [0..16]. AE_BITPTR value
increments by the number of bits-read and keeps track of the number of bits consumed from
the AE_BITHEAD. When the remaining bits in the AE_BITHEAD are less than or equal to 16
bits, it reads a 16-bit word from (a) memory location into the state register AE_BITHEAD, the
pointer value gets updated to (a+2) The value stored in AE_BITPTR is decremented by 16.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
void AE_DB_IP (const unsigned short * a /*inout*/, unsigned a0);
AE_DB.IC a, a0 [ Inst ] AVS ONLY

Required alignment: 2 bytes

Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the low five bits of a0, and must be in the range [0..16]. AE_BITPTR value
increments by the number of bits-read and keeps track of the number of bits consumed from
the AE_BITHEAD. When the remaining bits in the AE_BITHEAD are less than or equal to 16
bits, it reads a 16-bit word from (a) memory location into the state register AE_BITHEAD, the
pointer value gets updated using a circular wrap-around to (a+2) The value stored in
AE_BITPTR is decremented by 16.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
void AE_DB_IC (const unsigned short * a /*inout*/, unsigned a0);
void AE_DB_XC (const unsigned short * a /*inout*/, unsigned a0);
AE_DBI a, i [ Inst ] AVS ONLY

Required alignment: 2 bytes

Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the immediate i, and must be in the range [1..16]. AE_BITPTR value increments by
the number of bits-read and keeps track of the number of bits consumed from the
AE_BITHEAD. When the remaining bits in the AE_BITHEAD are less than or equal to 16
bits, it reads a 16-bit word from (a+2) memory location into the state register AE_BITHEAD,
the pointer value gets updated to (a+2) The value stored in AE_BITPTR is decremented by
16.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

 CADENCE DESIGN SYSTEMS , INC. 143


Fusion F1 DSP User’s Guide

The following sequence of instructions is useful to start bitparsing of the bitstream buffer
stored in short bitParseBuf[] using AE_LB*/AE_DB* instructions.

{
short *a=&bitParseBuf[0]-1;
WAE_BITPTR(0);
AE_DBI(a,16);
AE_DBI(a,16);
}

 This sequence fills the AE_BITHEAD with 32 bits starting from bitParseBuf[0]

 It also sets AE_BITPTR to 0, and buf_ptr appropriately for initializing bitstream


parsing.

 The actual bit-parsing is done using sequence of AE_LB, AE_LBI, AE_LBK followed
by AE_DB/AE_DBI instructions
C syntax:
void AE_DBI (const unsigned short * a /*inout*/, immediate i);
AE_DBI.IP a, i [ Inst ] AVS ONLY

Required alignment: 2 bytes

Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the immediate i, and must be in the range [1..16]. AE_BITPTR value increments by
the number of bits-read and keeps track of the number of bits consumed from the
AE_BITHEAD. When the remaining bits in the AE_BITHEAD are less than or equal to 16
bits, it reads a 16-bit word from (a) memory location into the state register AE_BITHEAD, the
pointer value gets updated to (a+2) The value stored in AE_BITPTR is decremented by 16.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
void AE_DBI_IP (const unsigned short * a /*inout*/, immediate i);
AE_DBI.IC a, i [ Inst ] AVS ONLY

Required alignment: 2 bytes

Discards bits from the state register AE_BITHEAD. The number of bits to be discarded is
given by the immediate i, and must be in the range [1..16]. AE_BITPTR value increments by
the number of bits-read and keeps track of the number of bits consumed from the
AE_BITHEAD. When the remaining bits in the AE_BITHEAD are less than or equal to 16
bits, it reads a 16-bit word from (a) memory location into the state register AE_BITHEAD, the
pointer value gets updated using a circular wraparound to (a+2) The value stored in
AE_BITPTR is decremented by 16.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

144  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
void AE_DBI_IC (const unsigned short * a /*inout*/, immediate i);
AE_VLEL16T b, a, a0 [ Inst ] AVS ONLY

Required alignment: 2 bytes

16-bit table entry load for variable-length encode. Given a pointer a0 to an encoding table of
16-bit entries, an entry is loaded and parsed from a0[a]. If the table entry loaded completes
the current encoding operation, b is set to true, otherwise b is set to false and a is set to the
appropriate index for the next lookup to continue the encoding operation. In either case, the
appropriate codeword bits are pushed onto the output bitstream.
C syntax:
void AE_VLEL16T (xtbool b /*out*/, unsigned a /*inout*/,
const unsigned short * a0);
AE_VLEL32T b, a, a0 [ Inst ] AVS ONLY

Required alignment: 4 bytes

32-bit table entry load for variable-length encode. Given a pointer a0 to an encoding table of
32-bit entries, an entry is loaded and parsed from a0[a]. If the table entry loaded completes
the current encoding operation, b is set to true, otherwise b is set to false and a is set to the
appropriate index for the next lookup to continue the encoding operation. In either case, the
appropriate codeword bits are pushed onto the output bitstream.
C syntax:
void AE_VLEL32T (xtbool b /*out*/, unsigned a /*inout*/,
const unsigned * a0);
AE_VLES16C a [ Inst ] AVS ONLY

Required alignment: 2 bytes

16-bit conditional bitstream store for variable-length encode. 16 bits are stored to the
bitstream pointed to by (a+2) if doing so is needed to maintain the invariant that fewer than
16 bits are buffered in AE_BITHEAD.
C syntax:
void AE_VLES16C (unsigned short * a /*inout*/);

 CADENCE DESIGN SYSTEMS , INC. 145


Fusion F1 DSP User’s Guide

AE_VLES16C.IP a [ Inst ] AVS ONLY

Required alignment: 2 bytes

16-bit conditional bitstream store for variable-length encode. 16 bits are stored to the
bitstream pointed to by a if doing so is needed to maintain the invariant that fewer than 16
bits are buffered in AE_BITHEAD.
C syntax:
void AE_VLES16C_IP (unsigned short * a /*inout*/);
AE_VLES16C.IC a [ Inst ] AVS ONLY

Required alignment: 2 bytes

16-bit conditional bitstream store for variable-length encode. 16 bits are stored to the
bitstream pointed to by a if doing so is needed to maintain the invariant that fewer than 16
bits are buffered in AE_BITHEAD and a is advanced by 2 with a circular wrap-around.
C syntax:
void AE_VLES16C_IC (unsigned short * a /*inout*/);
AE_SB a, a0 [ Inst ] AVS ONLY

Required alignment: 2 bytes

This instruction writes into the memory location (a+2) through a state register AE_BITHEAD
in chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by AE_BITSUSED
(Note: If the value of AE_BITSUSED is zero, it is interpreted as 16). Another state register
AE_BITPTR keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more, When this occurs, the 16 oldest bits from the AE_BITHEAD are flushed out and
stored as a 16-bit word in (a+2) memory location, and the pointer value in “a” gets updated
to (a+2). At the initialization of an output bitstream, AE_BITPTR and AE_BITHEAD are set
to 0.

AE_BITPTR, AE_BITSUSED, and AE_BITHEAD must be initialized appropriately before


using this instruction.

C syntax:
void AE_SB (unsigned short * a /*inout*/, unsigned a0);
AE_SB.IP a, a0 [ Inst ] AVS ONLY

Required alignment: 2 bytes

This instruction writes into the memory location (a+2) through a state register AE_BITHEAD
in chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by AE_BITSUSED
(Note: If the value of AE_BITSUSED is zero, it is interpreted as 16). Another state register
AE_BITPTR keeps track of the number of bits appended in AE_BITHEAD register.

146  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more. When this happens, the 16 oldest bits from the AE_BITHEAD are flushed out
and stored as a 16-bit word in (a+2) memory location, and the pointer value in “a” gets
updated to (a+2). At the initialization of an output bitstream, AE_BITPTR and AE_BITHEAD
are set to 0.

The AE_BITPTR, AE_BITSUSED, and AE_BITHEAD must be initialized appropriately before


using this instruction.

C syntax:
void AE_SB_IP (unsigned short * a /*inout*/, unsigned a0);
AE_SB.IC a, a0 [ Inst ] AVS ONLY

Required alignment: 2 bytes

This instruction writes into the memory location (a) through a state register AE_BITHEAD in
chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by AE_BITSUSED
(Note: If the value of AE_BITSUSED is zero, it is interpreted as 16). Another state register
AE_BITPTR keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more. When this occurs, the 16 oldest bits from the AE_BITHEAD are flushed out and
stored as a 16-bit word in (a) memory location, and the pointer value in “a” gets updated
using a circular wrap-around to (a+2). At the initialization of an output bitstream, AE_BITPTR
and AE_BITHEAD are set to 0.

The AE_BITPTR, AE_BITSUSED, and AE_BITHEAD must be initialized appropriately before


using this instruction.

C syntax:
void AE_SB_IC (unsigned short * a /*inout*/, unsigned a0);
AE_SBI a, a0, i [ Inst ] AVS ONLY

Required alignment: 2 bytes

This instruction writes into the memory location (a+2) through a state register AE_BITHEAD
in chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by immediate i (Note:
If the value of immediate i is zero, it is interpreted as 16). Another state register AE_BITPTR
keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more. When this occurs, the 16 oldest bits from the AE_BITHEAD are flushed out and
stored as a 16-bit word in (a+2) memory location, and the pointer value in “a” gets updated
to (a+2). At the initialization of an output bitstream, AE_BITPTR and AE_BITHEAD are set
to 0.

The AE_BITPTR, AE_BITSUSED, and AE_BITHEAD must be initialized appropriately before


using this instruction.

 CADENCE DESIGN SYSTEMS , INC. 147


Fusion F1 DSP User’s Guide

C syntax:
void AE_SBI (unsigned short *a /*inout*/, unsigned a0, immediate i);
AE_SBI.IP a, a0, i [ Inst ] AVS ONLY

Required alignment: 2 bytes

This instruction writes into the memory location (a) through a state register AE_BITHEAD in
chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by immediate i (Note:
If the value of immediate i is zero, it is interpreted as 16). Another state register AE_BITPTR
keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more. When this occurs, the 16 oldest bits from the AE_BITHEAD are flushed out and
stored as a 16-bit word in (a) memory location, and the pointer value in “a” gets updated to
(a+2). At the initialization of an output bitstream, AE_BITPTR and AE_BITHEAD are set to
0.

The AE_BITPTR, AE_BITSUSED, and AE_BITHEAD must be initialized appropriately before


using this instruction.

C syntax:
void AE_SBI_IP (unsigned short *a /*inout*/, unsigned a0, immediate i);
AE_SBI.IC a, a0, i [ Inst ] AVS ONLY

Required alignment: 2 bytes

This instruction writes into the memory location (a) through a state register AE_BITHEAD in
chunks of 16-bits. Each call of the instruction appends low bits from a0 to AE_BITHEAD
register. The number of low bits written in AE_BITHEAD are specified by immediate i (Note:
If the value of immediate i is zero, it is interpreted as 16). Another state register AE_BITPTR
keeps track of the number of bits appended in AE_BITHEAD register.
After one or more call of the above instruction, the AE_BITHEAD register gets filled with 16
bits or more. When this occurs, the 16 oldest bits from the AE_BITHEAD are flushed out and
stored as a 16-bit word in (a) memory location, and the pointer value in “a” gets updated
using a circular wrap-around to (a+2). At the initialization of an output bitstream, AE_BITPTR
and AE_BITHEAD are set to 0.

The AE_BITPTR, AE_BITSUSED, and AE_BITHEAD must be initialized appropriately before


using this instruction.

C syntax:
void AE_SBI_IC (unsigned short *a /*inout*/, unsigned a0, immediate i);
AE_SBF a [ Inst ] AVS ONLY

Required alignment: 2 bytes

Flush any remaining bits from AE_BITHEAD to the stream in memory pointed to by (a + 2).

148  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

This instruction stores AE_BITHEAD into (a+2), including the padded bits (zero padding)
stored in LSB positions and clears AE_BITHEAD. The ptr (a) is updated/incremented by 2.
Because this instruction doesn't modify AE_BITPTR, the number of bits stored (without
padding) can be retrieved from the AE_BITPTR state register.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
void AE_SBF (unsigned short * a /*inout*/);
AE_SBF.IP a [ Inst ] AVS ONLY

Required alignment: 2 bytes

Flush any remaining bits from AE_BITHEAD to the stream in memory pointed to by (a).

This instruction stores AE_BITHEAD into (a), including the padded bits (zero padding) stored
in LSB positions and clears AE_BITHEAD. The ptr (a) is updated/incremented by 2. Because
this instruction doesn't modify AE_BITPTR, the number of bits stored (without padding) can
be retrieved from the AE_BITPTR state register.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
void AE_SBF_IP (unsigned short * a /*inout*/);
AE_SBF.IC a [ Inst ] AVS ONLY

Required alignment: 2 bytes

Flush any remaining bits from AE_BITHEAD to the stream in memory pointed to by (a).

This instruction stores AE_BITHEAD into (a), including the padded bits (zero padding) stored
in LSB positions and clears AE_BITHEAD. The ptr (a) is updated/incremented using a
circular wrap-around by 2. Because this instruction doesn't modify AE_BITPTR, the number
of bits stored (without padding) can be retrieved from the AE_BITPTR state register.

The AE_BITPTR and AE_BITHEAD must be initialized appropriately before using this
instruction.

C syntax:
void AE_SBF_IC (unsigned short * a /*inout*/);

 CADENCE DESIGN SYSTEMS , INC. 149


Fusion F1 DSP User’s Guide

2.19.1 Codebook Formats


The variable-length encode and decode instructions described in Section 3.1 use 16- and
32-bit variable-length encode and decode tables (codebooks). The structure of each
codebook table entry is described below.

32-bit Variable-Length Decode Table Entry

Each 32-bit variable-length decode codebook table entry has the following format:

31 30 27 26 0
F N S
1 4 27

In this entry, F is a single bit that indicates whether the symbol has been found.

If F is set, the codeword is decoded and the symbol is found. N gives the number of bits
consumed at the current (final) stage of the lookup, and S gives the 27-bit symbol value.

If F is clear, the codeword is only partly decoded and the symbol isn’t found yet. N is a 4-bit
indication of the number of stream prefix bits used to perform the lookup in the next table,
and S gives the 27-bit offset of the beginning of the next table. The number of bits consumed
is implied by the size of the current sub-table.

16-bit Variable-Length Decode Table Entry

Each 16-bit variable-length decode codebook table entry has the following format:

15 14 11 10 0
F N S
1 4 11

In this entry, F is a single bit that indicates whether the symbol has been found.

If F is set, the codeword is decoded and the symbol is found. N gives the number of bits
consumed at the current (final) stage of the lookup, and S gives the 11-bit symbol value.

If F is clear, the codeword is only partly decoded and the symbol isn’t found yet. N is a 4-bit
indication of the number of stream prefix bits used to perform the lookup in the next table,
and S gives the 11-bit offset of the beginning of the next table. The number of bits consumed
is implied by the size of the current sub-table.

150  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

32-bit Variable-Length Encode Table Entry

Each 32-bit variable-length encode codebook table entry has the following format:

31 30 0
F …
1 31

In this entry, F is a single bit that indicates whether the symbol has been completed.

If F is set, the symbol is encoded completely, and the rest of the table entry is interpreted as
follows:

31 30 20 19 16 15 0
1 … N C
1 11 4 16

N is the codeword segment size in bits (N equal to zero means 16 bits). C contains the right-
justified codeword segment. 11 bits of the 32-bit word are unused in this case.

If F is clear, the symbol is only partly encoded, and the rest of the table entry is interpreted
as follows:

31 30 16 15 0
0 K C
1 15 16

K is the table entry index for the next encode lookup. C is a 16-bit segment of the codeword.

16-bit Variable-Length Encode Table Entry

Each 16-bit variable-length encode codebook table entry has the following format:

15 14 0
F …
1 15

In this entry, F is a single bit that indicates whether the symbol has been completed.

If F is set, the symbol is encoded completely, and the rest of the table entry is interpreted as
follows:

15 14 11 10 0
1 N C
1 4 11

 CADENCE DESIGN SYSTEMS , INC. 151


Fusion F1 DSP User’s Guide

N is the codeword segment size in bits with valid values in the range from 1 to 11. C contains
the right-justified codeword segment.

If F is clear, the symbol is only partly encoded, and the rest of the table entry is interpreted
as follows:

15 14 6 5 0
0 K C
1 9 6

K is the table entry index for the next encode lookup. C is a six-bit segment of the codeword.

2.20 Optional Fusion Advanced Bit


Manipulation Package
The Fusion Advanced Bit Manipulation Package Option enables speedup for the bit-level
operations commonly used in Baseband PHY and MAC standards such as Bluetooth, Wi-Fi,
and 3GPP. It enables three groups of operations:

 CRC and Scrambling (Linear Feedback Shift Register) operations, commonly used
in Baseband PHY/MAC standards such as Bluetooth, Wi-Fi, and 3GPP.

 Bit-level Convolutional Encode operations

 Bit-level shuffling and manipulation, commonly used in Baseband PHY and MAC
standards.

Each of these groups are described in the following sections.

2.21 CRC and Scrambling (LFSR) Operations


Fusion DSP has optional support for Cyclic Redundancy Check (CRC) and Linear Feedback
Shift Register (LFSR) operations. Refer to the ISA HTML documentation for detailed
specifications of each of the operations.

AE_CRC32 a, d, a0 [ fusion_slot1]
AE_CRC32 processes 8 bits of input data from the AE_DR register d, and updates the CRC
value in address register a, using the CRC polynomial specified by address register a0. CRC
polynomials of up to 32 bits are supported.
C syntax:
extern void AE_CRC32(unsigned a, ae_int32x2 d, unsigned a0);

152  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_LFSR8 a, d, a0 [ fusion_slot1]
AE_LFSR8 generates 8 bits of a Linear Feedback Shift Register (LFSR), using the 32 bit
shift register state in the address register a, and the polynomial encoded into the address
register a0. Polynomials of up to 32 bits are supported.
C syntax:
extern void AE_LFSR8(unsigned a, ae_int32x2 d, unsigned a0);
AE_LFSR16 a, d, a0 [ fusion_slot1]
AE_LFSR16 generates 16 bits of a Linear Feedback Shift Register (LFSR), using the 32-bit
state in address register a, and the polynomial encoded into the address register a0.
Polynomials of up to 16 bits are supported.
C syntax:
extern void AE_LFSR16(unsigned a, ae_int32x2 d, unsigned a0);

2.22 Bit-level Convolutional Encode


Operations
Fusion DSP has optional operations to support efficient implementation of Convolutional and
Turbo Encoding (FEC) functions typically found in Baseband PHY standards such as
Wi-Fi/3G/LTE. Refer to the ISA HTML documentation for detailed specification of each of the
operations.

AE_CC32_L d, d0, a [ fusion_slot1]


AE_CC32_H d, d0, a [ fusion_slot1]
Generate 32 output bits by performing a convolutional encoding operation using input bits
from the AE_DR register d0 and polynomial from AR register a.
C syntax:
extern ae_int32x2 AE_CC32_L(ae_int32x2 d0, unsigned a);
extern void AE_CC32_H(ae_int32x2 d, ae_int32x2 d0, unsigned a);
AE_CTC_BIN a, d, c [ fusion_slot1]

Generate 8 bits of output of a three-state Convolutional Turbo Encoder, using input bits from
AE_DR register d, and two programmable polynomials from AR register c.

C syntax:
extern void AE_CTC_BIN(unsigned a /*inout*/, ae_int32x2 d /*inout*/,
unsigned c);

 CADENCE DESIGN SYSTEMS , INC. 153


Fusion F1 DSP User’s Guide

2.23 Bit Shuffling and Selection Operations


Fusion DSP has optional support for operations for bit-level selection. These bit-level
operations enable efficient implementation of bit-level processing functions such as bit-level
interleaving/deinterleaving, bit-level Rate Matching for 3GPP-LTE Baseband PHY, and other
functions. A typical use for such functions is for bit-level processing in Baseband PHY and
MAC protocols for standards such as 3GPP, Wi-Fi, and Bluetooth.

This option also includes a variation of a small subset of the AVS bitstream operations. These
variants are AE_LB_BR, AE_LBI_BR, AE_DB_BR.IP, AE_DBI_BR.IP, AE_SB_BR.IP,
AE_SBI_BR.IP, and AE_SBF_BR.IP. These bitstream operation variants operate on the
bitstream with the least significant bit first in each byte. Note that the regular AVS bitstream
operations operate with the most significant bit first in each byte.

Refer to the ISA HTML documentation for detailed specifications of each of the operations.

AE_BSEL4X8_L d0, d, a [ fusion_slot1]


AE_BSEL4X8_H d0, d, a [ fusion_slot1]
AE_BISEL4X8_L d0, d, a [ fusion_slot1]
AE_BISEL4X8_H d0, d, a [ fusion_slot1]
Generate eight output bits to be chosen arbitrarily from any one of 16 input bits. This
functionality is applied four times to four different input 16 bit elements to generate four output
vector elements of 8 bits, each using the same selection pattern specified in AR register a.
The AE_BSEL4X8 and AE_BISEL4X8 operations differ in how the data inputs are split into
four 16-bit vector elements, with the AE_BSEL4X8 concatenating two 32 bit halves from two
input registers, while AE_BISEL4X8 interleaves 8 bit elements from the two 32 bit halves
before concatenating them.
C syntax:
extern void AE_BSEL4X8_L(ae_int32x2 d0, ae_int32x2 d, unsigned a);
extern void AE_BSEL4X8_H(ae_int32x2 d0, ae_int32x2 d, unsigned a);
extern void AE_BISEL4X8_L(ae_int32x2 d0, ae_int32x2 d, unsigned a);
extern void AE_BISEL4X8_H(ae_int32x2 d0, ae_int32x2 d, unsigned a);

AE_SEL4X8_L d0, d, a [ fusion_slot1]


AE_SEL4X8_H d0, d, a [ fusion_slot1]

Byte select operations allow an arbitrary combination/selection from a set of 8 input bytes
formed using 4 bytes each from the two input registers, to generate four output bytes. Each
of the output bytes can be independently selected from any of the 8 input bytes. With this
general definition of byte selection, it is easy to implement byte-level replication, rotation,
shift, and interleaving with the same basic instruction.

154  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

C syntax:
extern void AE_SEL4X8_L(ae_int32x2 a /*inout*/, ae_int32x2 b, unsigned
c);
extern void AE_SEL4X8_H(ae_int32x2 a /*inout*/, ae_int32x2 b, unsigned
c);
AE_DEPBITS_L d0, d, imm1, imm2 [ fusion_slot40]
AE_DEPBITS_H d0, d, imm1, imm2 [ fusion_slot40]
Deposit a field into an arbitrary position in an AE_DR register. These instructions are similar
to the Xtensa DEPBITS option, with the difference that the AE_DEPBITS_L/H use AE_DR
registers for input/outputs (The Xtensa DEPBITS uses AR registers for input/output).
C syntax:
extern void AE_DEPBITS_L(ae_int64 dout /*inout*/, ae_int64 d, immediate
low_depbits, immediate lngth_depbits);
extern void AE_DEPBITS_H(ae_int64 dout /*inout*/, ae_int64 d, immediate
low_depbits, immediate lngth_depbits);
AE_LB_BR a, a0 [fusion_slot0]
AE_LBI_BR a, i [fusion_slot0]
AE_DB_BR.IP a, a0 [fusion_slot0]
AE_DBI_BR.IP a, i [fusion_slot0]
AE_SB_BR.IP a, a0 [fusion_slot0]
AE_SBI_BR.IP a, a0, i [fusion_slot0]
AE_SBF_BR.IP a [fusion_slot0]
These are variants of a subset of AVS bitstream instructions. The variants are named with a
_BR in the name to distinguish them from the corresponding AVS instruction. The _BR
variants operate least significant bit first in the byte, whereas the corresponding AVS
instructions operates most significant bit first in each byte.
C syntax:
unsigned AE_LB_BR (unsigned a0);
unsigned AE_LBI_BR (immediate i);
void AE_DB_BR_IP (const unsigned short * a /*inout*/, unsigned a0);
void AE_DBI_BR_IP (const unsigned short * a /*inout*/, immediate i);
void AE_SB_BR_IP (unsigned short * a /*inout*/, unsigned a0);
void AE_SBI_BR_IP (unsigned short *a /*inout*/, unsigned a0, immediate i);
void AE_SBF_BR_IP (unsigned short * a /*inout*/);

 CADENCE DESIGN SYSTEMS , INC. 155


Fusion F1 DSP User’s Guide

2.24 Optional AES128-CCM Operations


The Advanced Encryption Standard (AES) is a widely used standard for data security. The
AES algorithm is defined in the NIST Publication FIPS-197
(http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf). The most commonly used mode
of operation of the AES is the CCM mode; a description of the CCM mode can be found in
the NIST Publication 800-38C (http://csrc.nist.gov/publications/PubsSPs.html).

The AES algorithm specified in the FIPS-197 standard is capable of using cryptographic keys
of 128, 192, and 256 bits to perform a forward cipher and reverse cipher of data in blocks of
128 bits. However, the CCM mode of operation defines the CCM-generation-encryption and
CCM-decryption-verification procedures by only using the forward cipher of the AES
algorithm in FIPS-197. CCM mode does not require the reverse cipher from FIPS-197.
Furthermore, 128-bit block size is the most widely used in data communication standards
(for example, such as Bluetooth Low Energy).

To implement the CCM-generation-encryption and CCM-decryption-verification procedures


for 128-bit blocks, the forward cipher procedure for 128-bit blocks from the FIPS-197
specified algorithm is used. The forward cipher procedure performs a series of
transformations (operations) on an input 128-bit block to generate an output ciphered 128-
bit block. The set of transformations are referred to as SubBytes transformation, ShiftRows
transformation, MixColumns transformation, and AddRoundKey transformation. Each of
these transformations are performed multiple times (referred to as rounds) to eventually
generate the final ciphered output 128-bit block. Each of the rounds uses a set of
intermediate 128-bit keys, called a key schedule. These set of intermediate keys are
generated from the input 128-bit key by using a separate procedure called KeyExpansion,
which is also specified in the FIPS-197 standard. Refer to the FIPS-197 standard for a
detailed specification of the forward cipher algorithm.

Fusion DSP has optional operations to support efficient implementation of the AES forward
cipher algorithm for block size of 128 bits.

AE_AES_SUBBYTE_MIX_XOR64 d0, d1, a, imm1, imm2 [ fusion_slot40]


SubBytes, MixColumns and AddRoundKey steps of AES-128 encryption procedure on
128-bit block.
C syntax:
extern void AE_AES_SUBBYTE_MIX_XOR64(ae_int64 d0 /*inout*/, ae_int64
d1 /*inout*/, const void * keyaddr, immediate offset, immediate
index);
AE_AES_SUBBYTE_XOR64 d0, d1, a, imm1, imm2 [ fusion_slot40]
SubBytes and AddRoundKey steps of AES-128 encryption on 128 bit block. This operation
does not perform the MixColumns step, and is intended to be used in the last round of AES-
128 encryption procedure (the last round does not have MixColumns).
C syntax:
extern void AE_AES_SUBBYTE_XOR64(ae_int64 d0 /*inout*/, ae_int64 d1
/*inout*/, const void * keyaddr, immediate offset, immediate index);

156  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_AES_RKEY d0, d1, a [ fusion_slot40]

Do one step of the Key Expansion procedure as specified in FIPS 197 standard.

C syntax:
extern void AE_AES_RKEY(ae_int64 d0 /*inout*/, ae_int64 d0 /*inout*/,
unsigned a);
AE_AES_SB128 d0, d1 [ fusion_slot40]
Do the ShiftRows transformation on the state array in registers d0 and d1 as specified in
FIPS 197 standard.
C syntax:
extern void AE_AES_SB128(ae_int64 d0 /*inout*/, ae_int64 d0
/*inout*/);

2.25 Optional Viterbi Decoder Operations


This option supports efficient Viterbi operations. The operations support 1/2 and 1/3 rate with
arbitrary polynomials of constraint length 5 and 7. The rate 1/2 and 1/3 are the code rate
before puncturing (or rate-matching). In WiFi standard (IEEE 802.11), code rate=1/2 and
constraint length=7 are used for both signaling and data channels. In the 3GPP standard,
code rate=1/3 and constraint length=7 are normally used for signaling channels; however for
GSM (global system for mobile communication), code rate=1/2 and constraint length=5 are
used.

6-bit soft-bit values from interleaved streams of soft bit data are loaded from memory. Internal
state values are stored in 8-bit signed elements of the vector register files. The operations
are designed to perform a forward pass through input soft bits, updating the states and
buffering branch select decisions. These 1-bit branch select decisions are packed and stored
to memory. After all branch select decisions have been stored, the maximal state is identified,
then a backwards traceback pass through the decision bits computes the hard-bit outputs.

 CADENCE DESIGN SYSTEMS , INC. 157


Fusion F1 DSP User’s Guide

The Viterbi operations on Fusion DSP are implemented based on a radix-4 architecture.
Each single step radix-4 trellis butterfly equals four two step radix-2 trellis butterflies as shown
in Figure 2-2. N is the number of states in the convolutional code. For constraint length K=5,
N is 16 and for constraint length K=7, N is 64.

S4n S2n Sn

S4n+1 S2n+1
S4n+2
Sn+N/4
S4n+3

S2n+N/2 Sn+N/2

S2n+N/2+1

Sn+3*N/4

Figure 2-2 Radix-4 Trellis Butterfly

AE_VTADDSUB3BX2S d0, 0..1 [fusion_slot64_0]


Branch metric calculation for two consecutive time instances. The branch metrics will be used
by add-compare-select instruction.
C syntax:
extern void AE_VTADDSUB3BX2S(ae_int32x2 a, immediate msb);

The AE_VTADDSUB3BX2S operation calculates partial branch metrics for two consecutive
time instances using 6-bit LLRs whose sign extension occupies 8 bits in memory. The partial
branch metrics are stored in the BMETRICS state register, which will be used by the add-
compare-select instruction.

For code rate R=1/3, the most significant 32 bits of the input register holds the three LLRs of
bits b0,b1,b2 for time instance k and least significant 32 bits of input register hold the three
LLRs for time instance k+1.

The following four partial branch metrics are calculated for each time instance in this
operation:

 MPP = -LLR(b2) +LLR(b1)+LLR(b0)

 MPM = -LLR(b2) +LLR(b1)-LLR(b0)

 MMP = -LLR(b2) -LLR(b1)+LLR(b0)

 MMM = -LLR(b2) -LLR(b1)-LLR(b0)

158  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Following are the other four branch metrics for each time instance:

 PMM = -MPP

 PMP = -MPM

 PPM = -MMP

 PPP = -MMM

The above are not calculated in this operation, but derived in the add-compare-select
operation.

For code rate R=1/2, there are only two LLRs for each time instance; you can still use the
same operation, but LLR(b2) in the input register should be 0.

The operation assumes a 6-bit LLR input, which occupies 8 physical bits. The immediate
operand msb of this operation is used to select either the most significant 6 bits or the least
significant 6 bits from 8 bits. If the most significant 6 bits are selected, only four effective bits
of each LLR have been used. If the least significant 6 bits are selected, all six effective bits
of each LLR have been used.

AE_VTACSR4X4S_H d0,d1,a0,a1,0..3, 0..1 [fusion_slot64_1]


AE_VTACSR4X4S_L d0,d1,a0,a1,0..3, 0..1 [fusion_slot64_1]
Add-compare-select operation performs 16 state updates across two time instances.
C syntax:
extern void AE_VTACSR4X4S_H(ae_int32x2 st0 /*inout*/, ae_int32x2 st1
/*inout*/, unsigned bmsel0, unsigned bmsel1, immediate shfl, immediate
norm);
extern void AE_VTACSR4X4S_L(ae_int32x2 st0 /*inout*/, ae_int32x2 st1
/*inout*/, unsigned bmsel0, unsigned bmsel1, immediate shfl, immediate
norm);

The AE_VTACSR4X4S_H or AE_VTACSR4X4S_L performs four radix-4 trellis butterfly


operations on 16 states across two consecutive time instances. The only difference between
those two operations is that AE_VTACSR4X4S_H outputs branch select decision bits to the
most significant 64 bits of state register BMETRICS and AE_VTACSR4X4S_L outputs
branch select decision bits to the least significant 64 bits of state register BMETRICS.

The branch metrics calculated by operation AE_VTADDSUB3BX2S will be used in the add-
compare-select operation (refer to the ISA HTML pages of these operations for a detailed
description, along with the pseudo-code). The branch metrics are stored in state register
BMETRICS. Before calling the add-compare-select operation to update state metrics in trellis
forward processing, first you need to build the branch metric index table. From the branch
metric index, the butterfly operation can find the branch metric.

By exploiting the branch symmetry of the radix-2 butterfly, we only need one branch metric
index for each radix-2 butterfly. For constraint length K=7, there are 64 states. We only store
the branch metric for the branch entering into states from 0 to 31. Each of those 32 states
has two input branches, which originate from two previous states. We only store the branch
metric index for the branch that is connected to the previous state whose state index is an

 CADENCE DESIGN SYSTEMS , INC. 159


Fusion F1 DSP User’s Guide

even number. As shown in Figure 2-2, the radix-2 butterfly is composed of states S4n S4n+1
and S2n S2n+N/2; we only need to store the branch metric index for the branch from S4n to
S2n.

As we only calculate partial branch metrics in operation AE_VTADDSUB3BX2S: four out of


eight for R=1/3 and 2 out of 4 for R=1/2. The four partial branch metrics for R=1/3 are MMM,
MMP, MPM and MPP, which are indexed as 0, 1, 2 and 3. The branch metric index for PPP,
PPM, PMP and PMM are indexed as 4, 5, 6 and 7. The two partial branch metrics for R=1/2
are MM and MP, which are indexed as 0 and 1. The branch metric index for PP and PM are
indexed as 4 and 5.

Each add-compare-select operation updates 16 states for two consecutive time instances.
For each add-compare-select operation we need eight branch metric indices for the first time
instance and eight branch metric indices for the second time instance. Each branch metric
index is 3 bits, but occupies 4 physical bits. Four continuous add-compare-select operations
are needed for 64 states of constraint length K=7 and only one add-compare-select operation
is needed for 16 states of constraint length K=5. The order of input states and branch metric
indices feeding to add-compare-select should follow the following sequence:

4n, 4n+1, 4n+2, 4n+3, … (n= 0,1,4,5,8,9,12,13,2,3,6,7,10,11,14,15 for K=7)

4n, 4n+1, 4n+2, 4n+3, … (n= 0,1,2,3 for K=5)

The sequence of intermediate states output from first stage of radix-4 butterfly has the
following sequence:

2n, 2n+N/2, 2n+1, 2n+1+N/2, … (n= 0,1,4,5,8,9,12,13,2,3,6,7,10,11,14,15 for K=7)

2n, 2n+N/2, 2n+1, 2n+1+N/2, … (n= 0,1,2,3 for K=5)

The intermediate states are not exposed from operation, but the branch select decision bits
for all intermediate states are stored in the same sequence as intermediate states.

The sequence of output states from radix-4 butterfly before shuffling are in the following
sequence:

n, n+N/2, n+N/4, n+N/4+N/2, … (n= 0,1,4,5,8,9,12,13,2,3,6,7,10,11,14,15 for K=7)

n, n+N/2, n+N/4, n+N/4+N/2, … (n= 0,1,2,3 for K=5)

Normally we apply shuffling on output states combined with the input shuffling to reorder the
states into the right sequence as described above. The branch select decision bits are stored
in the same sequence as the output states before shuffling.

The branch metric indices are packed into sixteen 16 bits for K=7 and four 16 bits for K=5
as follows:

{ bmsel[n+N/4], bmsel[n], bmsel[2n+1], bmsel[2n] } n follows the same sequence as states.

The least significant 8 bits (two branch metric indices) are used by the first stage of the radix-
4 butterfly operation and the most significant 8 bits are used by the second stage of the radix-
4 butterfly operation.

160  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Four continuous 16 bits branch metric indices are needed for each add-compare-select
operation. The 32-bit input operand bmsel0 holds the first two 16 bits and 32-bit input
operand bmsel1 holds the next two 16 bits.

For K=7, the output states after shuffling are in following sequence:

n,n+1,n+4,n+5,n+N/4,n+1+N/4,n+4+N/4,n+5+N/4,

n+N/2,n+1+N/2,n+4+N/2,n+5+N/2,n+N/4+N/2,n+1+N/4+N/2,n+4+N/4+N/2,n+5+N/4+N/2

with n = 0 for the first add-compare-select operation, n=8 for second operation, n= 2 for the
third operation and n=10 for the fourth operation.

Similarly, for K=5, the output states are:

n,n+1,n+4,n+5,n+N/4,n+1+N/4,n+4+N/4,n+5+N/4,

n+N/2,n+1+N/2,n+4+N/2,n+5+N/2,n+N/4+N/2,n+1+N/4+N/2,n+4+N/4+N/2,n+5+N/4+N/2

with n=0.

The input shuffling will change the order of states output from the last iteration into the input
order:

4n, 4n+1, 4n+2, 4n+3, … (n= 0,1,4,5,8,9,12,13,2,3,6,7,10,11,14,15 for K=7)

4n, 4n+1, 4n+2, 4n+3, … (n= 0,1,2,3 for K=5)

The least significant bit of the 2-bit input immediate operand shfl is used to enable input
shuffling and the most significant bit is used to enable output shuffling.

The immediate input norm is used to select and update the normalization enable flag. There
are two normalization enable flags in the state register that are previous normalization flag
NORMALIZE_PREV and current normalization flag NORMALIZE_CUR.

If norm is true, the effective normalization enable flag is set to the current normalization flag
NORMALIZE_CUR, otherwise it is set to the previous normalization flag
NORMALIZE_PREV. If norm is true, NORMALIZE_PREV is set to NORMALIZE_ CUR and
NORMALIZE_ CUR will be recalculated by measuring the most significant 3 bits, except a
sign bit of output states. If any state is positive and any bit in the field specified by state
register NORM_MASK is 1, the current normalization flag will be set to 1, otherwise it is set
to 0. The major purpose of immediate input norm is to make sure all states processed by
multiple add-compare-select operations are normalized in the same way.

If effective normalization enable flag is 1, all out states will be subtracted by the state register
NORM_CONST. The user state register NORM_MASK and NORM_CONST will be
initialized once before trellis processing.

 CADENCE DESIGN SYSTEMS , INC. 161


Fusion F1 DSP User’s Guide

AE_S64_DECBITS.H.IP a0,-64..56 [fusion_slot64_0]


AE_S64_DECBITS.L.IP a0,-64..56 [fusion_slot64_0]
Store either the most significant or least significant 64 bits of decision bits to memory.
C syntax:
extern void AE_S64_DECBITS_H_IP(ae_int64 * a /*inout*/, immediate off);
extern void AE_S64_DECBITS_L_IP(ae_int64 * a /*inout*/, immediate off);
AE_VTTB2X64 d,d0,d1,a0,0..3 [fusion_slot64_1]
Backward traceback operation generates two hard-decision bits packed into the least
significant 2 bits of the output register.
C syntax:
extern void AE_VTTB2X64(ae_int64 a /*inout*/, ae_int64 b1, ae_int64 b0,
unsigned idx /*inout*/, immediate shfl);

Each add-compare-select operation will update 16 states for constraint length K=7, which
has 64 states, we need four add-compare-select operations; each operation takes 16 states
as input and updates 16 states as output. The 128 decision bits of branch metric selection
will be stored to memory. The sequence of operation for every two time instances is
summarized as below:

AE_VTACSR4X4S_H
AE_VTACSR4X4S_H
AE_VTACSR4X4S_L
AE_VTACSR4X4S_L
AE_S64_DECBITS_H_IP
AE_S64_DECBITS_L_IP

The first add-compare-select operation will take states: S(4n), S(4n+1), S(4n+2), S(4n+3)
n=0,1,4,5 at time instance k as input and update states: S(n), S(n+32), S(n+16), S(n+48)
n=0,1,4,5 at time instance k+2 as output.

The second add-compare-select operation will take states: S(4n), S(4n+1), S(4n+2), S(4n+3)
n=8,9,12,13 at time instance k as input and update states: S(n), S(n+32), S(n+16), S(n+48)
n=8,9,12,13 at time instance k+2 as output.

The third add-compare-select operation will take states: S(4n), S(4n+1), S(4n+2), S(4n+3)
n=2,3,6,7 at time instance k as input and update states: S(n), S(n+32), S(n+16), S(n+48)
n=2,3,6,7 at time instance k+2 as output.

The fourth add-compare-select operation will take states: S(4n), S(4n+1), S(4n+2), S(4n+3)
n=10,11,14,15 at time instance k as input and update states: S(n), S(n+32), S(n+16), S(n+48)
n=10,11,14,15 at time instance k+2 as output.

The states input to add-compare-select operations should always be in sequential order, but
the output state of add-compare-select operations that will be used as input for the next
iteration are not in sequential order. We need to shuffle both the input states and output

162  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

states in add-compare-select operations to make sure the state input to each radix-4 butterfly
is always S(4n), S(4n+1), S(4n+2), S(4n+3).

For constraint length K=5, we only need a sequence of operation as follows:

 AE_VTACSR4X4S_H

 AE_S64_DECBITS_H_IP

The add-compare-select operation will take states: S(4n), S(4n+1), S(4n+2), S(4n+3)
n=0,1,2,3 at time instance k as input and update states: S(n), S(n+32), S(n+16), S(n+48)
n=0,1,2,3 at time instance k+2 as output. The shuffling function in the add-compare-select
operation should also be enabled.

For trellis forward processing, the state metrics are maintained in 8-bit signed vector
elements. For each iteration of a trellis loop, if any output state metric is big enough (state
metric is positive and any bit in most significant 3-bit field except sign bit specified by 3-bit
state register NORM_MASK is not zero), the normalization flag will be set and state
normalization will be done in the next iteration by subtracting the output state metrics by a
constant value specified by 8-bit state register NORM_CONST.

At the end of the Viterbi computation on the input data streams, before backtrace, the
maximal metric must be identified. 8-bit state metrics will be sign extended to 16-bit state
metrics by operation AE_UNPKS8X16 and we can find the index of the maximum state
metric. The traceback will start from the maximal state. The convolutional encoder is
terminated to 0 state. We can force the maximal state to be state 0 and start traceback from
state 0.

After the forward trellis processing, the backward traceback operation AE_VTTB2X64
collects two traceback bits per cycle.

The Fusion_Viterbi_Decoder example demonstrates the use of the Viterbi operations with
LTE and WiFi standard rates and polynomials.

2.26 Optional Soft-bit Demapping Operations


This option supports up to 256 QAM soft-bit demapping.

The soft-bit demapping operations are used to convert soft-symbol estimates, outputs of an
equalizer, into soft-bit estimates, or log-likelihood ratios (LLRs), later to be processed by a
soft channel decoder for error correction and detection. The soft-bit demapper typically sits
at the interface between complex and soft-bit domains.

The soft-bit demapper accepts as inputs complex-valued soft-symbol estimates x in addition


to a scaling factor. Given these inputs, for each bit 𝑏𝑖 , it calculates the log-likelihood ratio as
follows:

𝑃(𝑏𝑖 = 1|𝑥)
𝐿𝐿𝑅(𝑏𝑖 ) = 𝑙𝑛
𝑃(𝑏𝑖 = 0|𝑥)

 CADENCE DESIGN SYSTEMS , INC. 163


Fusion F1 DSP User’s Guide

This is according to the mapping of bits to a constellation S. The LLR calculation uses a Max-
Log approximation and assumes an unbiased symbol estimate with zero-mean additive white
Gaussian noise (AWGN), i.e. x=s+w, where s belongs to S and w is AWGN.

Therefore, the SDMAP output is given by:

𝐿𝐿𝑅𝑎𝑝𝑝𝑟𝑜𝑥 (𝑏𝑖 ) = (𝑠𝑖𝑔𝑛) × (𝑠𝑐𝑎𝑙𝑖𝑛𝑔 𝑓𝑎𝑐𝑡𝑜𝑟) × (𝑚𝑖𝑛𝑠1∈𝑆|𝑏𝑖 =1 |𝑥 − 𝑠1 |2 − 𝑚𝑖𝑛𝑠0∈𝑆|𝑏𝑖 =0 |𝑥 − 𝑠0 |2 )

The scaling factor is used to account for the signal-to-noise ratio and any other desired
weighting adjustments. You can negate the LLR values with an additional sign option.

Table 2-23 Set of Symbol Constellations Supported

Standard Supported Gray Encoding Output to Soft Decoding


Constellations
3GPP QPSK 3GPP Turbo
16-QAM
64-QAM
WiFi (IEEE 802.11) QPSK IEEE Convolutional or LDPC
16-QAM
64-QAM
256-QAM

Supported constellations and mappings are summarized in Table 2-23. Symbol mappings
for 3GPP and WiFi use different Gray Encoding formats, both supported by the soft-bit
demapper operations

The Fusion DSP implementation covers cases of 4/16/64/256-QAM soft-demodulation. Each


soft-demapping operation can only output four soft-bits. QPSK needs one operation to output
4 soft bits for two complex inputs. 16-QAM needs two operations to output 8 soft bits for two
complex inputs, 64-QAM needs three operations to output 12 soft-bits for two complex inputs,
and 256-QAM needs four operations to output 16 soft-bits for two complex inputs. There is
one operation for QPSK, two operations for 16-QAM, three operations for 64-QAM, and two
operations for 256-QAM. Those operations are summarized as:

AE_SDMAPQPSK2X16C vt,vs,vr, 0..1 [fusion_slot64_1]


The QPSK soft-demapping operation uses two complex inputs from input vector register vr,
where each complex input consists of a 16-bit real and 16-bit imaginary part, computes a
soft bit for each of the bits used to generate the constellation, scales them using a exponent
and mantissa from the vector register vs, and writes the resulting 4 soft-bits (4 bytes) into the
most significant 32 bits of output vector register vt.
C syntax:
extern ae_int16x4 AE_SDMAPQPSK2X16C(ae_int16x4 exp_mant, ae_int16x4 cx,
immediate negate);

164  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

AE_SDMAP16QAM1X16C_H vt,vs,vr, 0..1, 0..1 [fusion_slot64_1]


AE_SDMAP16QAM1X16C_L vt,vs,vr, 0..1, 0..1 [fusion_slot64_1]
16-QAM soft-demapping operation selects one complex input from the most significant or
least significant 32 bits of the input vector register vr, where each complex input consists of
a 16-bit real and 16-bit imaginary part, computes a soft bit for each of the bits used to
generate the constellation, scales them using an exponent and mantissa from the vector
register vs, and writes the resulting 4 soft-bits (4 bytes) into the most significant or least
significant 32 bits of output vector register vt. AE_SDMAP16QAM1X16C_H will use the
complex input in the most significant 32 bits of the input vector register and write the output
to the most significant 32 bits of output vector register. AE_SDMAP16QAM1X16C_L will use
the complex input in the least significant 32 bits of the input vector register and write the
output to the least significant 32 bits of the output vector register.
C syntax:
extern void AE_SDMAP16QAM1X16C_H(ae_int16x4 llr /*inout*/, ae_int16x4
exp_mant, ae_int16x4 cx, immediate intlv, immediate negate);
extern void AE_SDMAP16QAM1X16C_L(ae_int16x4 llr /*inout*/, ae_int16x4
exp_mant, ae_int16x4 cx, immediate intlv, immediate negate);
AE_SDMAP64QAM1X16C_H vt,vs,vr, 0..1, 0..1,0..1 [fusion_slot64_1]
AE_SDMAP64QAM1X16C_HL vt,vs,vr, 0..1, 0..1,0..1 [fusion_slot64_1]
AE_SDMAP64QAM1X16C_L vt,vs,vr, 0..1, 0..1,0..1,0..1 [fusion_slot64_1]
64-QAM soft-demapping has three different operations. The AE_SDMAP64QAM1X16C_H
operation selects one complex input from the most significant 32 bits of the input vector
register vr. The AE_SDMAP64QAM1X16C_L operation selects one complex input from the
least significant 32 bits of the input vector register vr, and the AE_SDMAP64QAM1X16C_HL
operation uses both complex inputs. For two complex inputs, we need to output 6 soft-bits
for the first complex input denoted by b00, b01, b02, b03, b04, and b05. We also need to
output 6 soft-bits for the second complex input denoted by b10, b11, b12, b13, b14, and b15.
The sequence of operations needed for two complex inputs is as follows:
AE_SDMAP64QAM1X16C_H (output b03,b02,b01,b00)
AE_SDMAP64QAM1X16C_HL (output b11,b10,b05,b04)
AE_SDMAP64QAM1X16C_L (output b15,b14,b13,b12)
C syntax:
extern void AE_SDMAP64QAM1X16C_H(ae_int16x4 llr /*inout*/, ae_int16x4
exp_mant, ae_int16x4 cx, immediate intlv, immediate negate, immediate
out_high, immediate seq);
extern void AE_SDMAP64QAM1X16C_HL(ae_int16x4 llr /*inout*/, ae_int16x4
exp_mant, ae_int16x4 cx, immediate intlv, immediate negate, immediate
out_high);
extern void AE_SDMAP64QAM1X16C_L(ae_int16x4 llr /*inout*/, ae_int16x4
exp_mant, ae_int16x4 cx, immediate intlv, immediate negate, immediate
out_high, immediate seq);

 CADENCE DESIGN SYSTEMS , INC. 165


Fusion F1 DSP User’s Guide

AE_SDMAP256QAM1X16C_H vt,vs,vr, 0..1, 0..1,0..1 [fusion_slot64_1]


AE_SDMAP256QAM1X16C_L vt,vs,vr, 0..1, 0..1,0..1 [fusion_slot64_1]

The 256-QAM soft-demapping operation selects one complex input from the most significant
or least significant 32 bits of the input vector register vr, where each complex input consists
of a 16-bit real and 16-bit imaginary part, computes soft bits for four higher 4 or lower 4 out
of eight bits used to generate the constellation, scales them using an exponent and mantissa
from the vector register vs, and writes the resulting 4 soft-bits (4 bytes) into the most
significant or least significant 32 bits of output vector register vt.

C syntax:
extern void AE_SDMAP256QAM1X16C_H(ae_int16x4 llr /*inout*/, ae_int16x4
exp_mant, ae_int16x4 cx, immediate intlv, immediate negate, immediate
out_high);
extern void AE_SDMAP256QAM1X16C_L(ae_int16x4 llr /*inout*/, ae_int16x4
exp_mant, ae_int16x4 cx, immediate intlv, immediate negate, immediate
out_high);

Following are inputs to Fusion DSP soft-bit demap operations:

 Complex constellation points

 Assumed Q5.10 (16 bit resolution), not normalized

 Scale factors per point: 4-bit mantissa and 4-bit exponent in paired vector elements

 Used for SNR and channel weighting adjustments

 Different immediate operands to select between various modes

 Pick upper or lower half of input complex vector to operate

 Optionally negate soft-bit LLRs

 Optionally interleave output soft-bit LLRs for real/imaginary parts (IEEE vs. 3GPP
standard)

And, for the outputs of the Fusion DSP soft-bit demap operations:

 All operations only output 4 soft-bits each cycle, scaled by the scaling factors, with
rounding and saturation to 8-bit integer resolution at output.
Scaling before the soft demapper is needed to place onto an integer grid (assumed
hardware implementation Q5.10 format). Scaling after the soft-demodulation is optionally
applied by the operations.

The Fusion_Soft_Demapper example demonstrates the use of the soft-demapper operations


for the LTE and WiFi standard modulation scheme.

166  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

3. Programming the Fusion DSP


Cadence recommends the two following important Xtensa manuals that you should read and
be familiar with before attempting to obtain optimal results by programming Fusion DSP:

 Xtensa C Application Programmer’s Guide

 Xtensa C and C++ Compiler User’s Guide

Note that this chapter does not attempt to duplicate material in either of these guides.

Fusion DSP offers two MACs per cycle for 24x24-bit, 32x16-bit, and 16x16-bit audio and
voice data and one MAC per cycle for 32x32-bit operations. It offers equivalent support for
both integer and fractional arithmetic. The C and C++ languages support integer arithmetic
on 32x32-bit or 16x16-bit data. Therefore, while standard applications can effectively utilize
Fusion DSP’s resources, applications that require fractional arithmetic or applications that
require 24-bit or 32x16-bit multiplication must be modified to express those semantics. These
modifications can be as simple as declaring variables of the appropriate custom data types
and then relying on built-in operator overloading, or they can involve using explicit intrinsics
to express the exact operations desired. For 16-bit applications, the ITU-T/ETSI intrinsics are
fully supported.

In essentially no case is it required to resort to assembly. All the Fusion DSP instructions can
be accessed from C/C++ level intrinsics. The XCC compiler will efficiently register allocate
Fusion DSP variables and schedule Fusion DSP instructions, relieving the programmer from
the hardest aspects of writing in assembly.

Fusion DSP is also a 2/4-way SIMD (Single Instruction/Multiple Data) architecture.


Applications that do not take advantage of SIMD may run slower than applications that do.
Since Fusion DSP only supports one 32-bit multiplier or one 32-bit floating point unit, such
applications will see a more limited benefit from SIMD. For 16 or 32-bit integer applications
and for applications written using the ITU-T/ETSI intrinsics, the compiler is able to
automatically vectorize code to take advantage of the SIMD architecture. Even so, it is typical
for programmers to do some work to fully exploit the available performance. It may only
require recognizing that an existing implementation of an application is already in essentially
the right form for vectorization, or it may require completely reordering the algorithm’s
computations to bring together those that can be done in parallel.

For 24-bit and 32x16-bit applications, the compiler does not automatically vectorize. The
application writer must write the code using explicit vector data types or intrinsics.

This chapter describes multiple approaches to programming Fusion DSP and illustrates them
with some simple examples. The next chapter goes into more detail with more complicated
examples.

 CADENCE DESIGN SYSTEMS , INC. 167


Fusion F1 DSP User’s Guide

To use the Fusion DSP data types and instruction intrinsics, you must appropriately include
the following:

#include <xtensa/tie/xt_fusion.h>

in the C or C++ source code before referring to any of the data types or intrinsics. Optionally,
for HiFi 2 or HiFi 3 code, the option of including xt_hifi2.h or xt_hifi3.h is possible.
This is to facilitate easy use of existing HiFi applications.

For floating point intrinsics using the optional floating point unit, you must appropriately
include the following:

#include <xtensa/tie/xt_FP.h>

3.1 Data Types


Several C data types are provided by the Fusion DSP to facilitate programming the Fusion
DSP in C and C++ using instruction intrinsics and operator overloading.

The intrinsic prototype for each Fusion DSP operation is described in Chapter 2.

Fusion DSP supports 16-, 24-, 32- and 64-bit types. All types come in both integer and
fractional versions. For intrinsic programmers using 16-, 32- and 64-bit types, the two types
can usually be used interchangeably. A variable of an integer type can be assigned to a
fractional variable, and vice-versa, without changing the bit pattern in registers or memory. It
is up to the programmer to use the appropriate intrinsic to achieve the desired computation.
However, for programmers using operator overloading, the fractional and integer types map
to different instructions. In particular, fractional types use fractional multiplies and saturating
arithmetic, while integer types use integer multiplies and non-saturating arithmetic. 24-bit
fractional and integer types have an additional difference. 24-bit integer types are stored in
memory in the low 24 bits of a 32-bit word, equivalent to the storage representation for 32-
bit integers. 24-bit fractional types are stored in memory in the high 24 bits of a 32-bit word,
equivalent to a 1.31-bit representation, with the low-precision bits all set to 0.

All types (other than the 64-bit types) come in both scalar and vector versions. In general,
computation happens on vector variables. Scalar variables are stored in the low parts of
registers. The high parts are undefined. Assigning a scalar variable to a variable of the
equivalent vector type will replicate the element in the lowest bit-position into all the elements
of the vector. Assigning a vector to a scalar will not change the bit pattern in the register.

Assiging a low precision variable to a high precision variable in general sign extends the
variable for signed types and zero extends for unsigned types. Assiging a high precision
variable to a low precision variable discards the upper bits for integer types and discards the
lower bits for fractional types.

Conversions can also be implicitly applied to intrinsic invocations. For example, just like
assigning a scalar variable to a vector variable replicates the lowest element in the register,
a scalar variable assigned to an intrinsic expecting an input vector argument will first be
implicitly replicated.

168  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

With the floating point option, Fusion DSP supports a 2-way SIMD, single precision floating
point type xtfloatx2. This type can be converted to and from ae_int32x2 using the standard
C floating point to integer conversions.

All the legacy HiFi 2 types are supported so that HiFi 2 code can work out-of-the-box. They
should only be used on HiFi 2 code but can be freely intermixed when porting HiFi 2 code to
Fusion DSP. Note that for compatibility with HiFi 2, assigning variables of vector types to
variables of type ae_p24s or ae_p24f does not replicate the elements and instead leaves the
bit patterns unchanged.

Table 3-1 contains a complete list of the Fusion DSP data types with a brief description of
each.

Table 3-1 Fusion DSP C Types

Type Description
_
ae int32x2 64-bit type containing two 32-bit integer elements. The memory
format for this type is two elements stored in adjacent 32-bit
words. In memory, this type is eight-byte aligned.
ae_f32x2 64-bit type containing two 32-bit fractional elements. The
memory format for this type is two elements stored in adjacent
32-bit words. In memory, this type is eight-byte aligned.
ae_int24x2 48-bit type containing two 24-bit integer elements. The memory
format for this type is two elements, each stored in the least
significant 24 bits of adjacent 32-bit words. In memory, this
type is eight-byte aligned. This type is loaded and stored in a
way that is equivalent to loading and storing the ae_int32x2
type.
ae_f24x2 48-bit type containing two 24-bit fractional elements. The
memory format for this type is two elements, each stored in the
most significant 24 bits of adjacent 32-bit words making it
equivalent to a 1.31-bit representation. In registers, this
occupies the lower 24 bits of each 32-bit half of a register,
allowing for extra guard bits of precision.
ae_int16x4 64-bit type containing four 16-bit integer elements. This type
normally represents the 64-bit contents of a AE_DR register
when the register entry holds four data elements. The memory
format for this type is four elements stored in adjacent 16-bit
words. In memory, this type is eight-byte aligned.
ae_f16x4 64-bit type containing four 16-bit fractional elements. The
memory format for this type is four elements stored in adjacent
16-bit words. In memory, this type is eight-byte aligned.
ae_int32 32-bit type consisting of a single integer element stored in
memory. When this type is converted to an ae_int32x2 type in
an AE_DR register, the data is replicated into the two 32-bit
register elements.
ae_f32 32-bit type consisting of a single fractional element stored in
memory. When this type is converted to an ae_f32x2 type in an
AE_DR register, the data is replicated into the two 32-bit
register elements.

 CADENCE DESIGN SYSTEMS , INC. 169


Fusion F1 DSP User’s Guide

Type Description
ae_int24 24-bit type containing a single integer element stored in the
least significant 24 bits of a 32-bit word. In memory, this type is
four-byte aligned. This type is loaded and stored in a way that
is equivalent to loading and storing the ae_int32 type.
ae_f24 24-bit type containing a single 24-bit fractional elements. The
memory format for this is an element stored in the most
significant 24 bits of a 32-bit word making it equivalent to a
1.31-bit representation. In registers, this occupies the lower 24
bits of each 32-bit half of a register, allowing for extra guard
bits of precision.
ae_int16 16-bit type consisting of a single integer element stored in
memory. When this type is converted to an ae_int16x4 type in
an AE_DR register, the data is replicated into the four 16-bit
register elements.
ae_f16 16-bit type consisting of a single fractional element stored in
memory. When this type is converted to an ae_f16x4 type in an
AE_DR register, the data is replicated into the four 16-bit
register elements.
ae_int64 64-bit type representing the contents of an AE_DR register
when the register entry holds a single integer element.
ae_f64 64-bit type representing the contents of an AE_DR register
when the register entry holds a single fractional element.
ae_int32x4 128-bit type containing four 32-bit integer elements. This is a
composite type containing two, ae_int32x2 types. Its main use
is to support operator overloading for 32x16-bit multiplication.
ae_f32x4 128-bit type containing four 32-bit fractional elements. This is a
composite type containing two, ae_f32x2 types. Its main use is
to support operator overloading for 32x16-bit multiplication.
HiFi-2 Compatibility Types
ae_p16x2s This type ensures HiFi 2 target code compatibility. 32-bit type
containing two 16-bit elements. This type lives only in memory,
and represents two elements in a 1.15 format. It can be
automatically converted into an ae_p24x2s object, in which
case the low 8 bits of each resulting element are zero and the
upper 8 bits are sign-extended.
ae_p24x2s This type ensures HiFi 2 target code compatibility. 48-bit type
containing two 24-bit elements. The memory format for this
type is two elements, each stored in the least significant 24 bits
of adjacent 32-bit words. In memory, this type is eight-byte
aligned. In Fusion DSP, this type is loaded and stored in a way
that is equivalent to loading and storing the ae_p32x2s type.
ae_p24x2f This type ensures HiFi 2 target code compatibility. This type
occupies 64 bits in memory, but should be thought of as a 48-
bit type containing two 24-bit fractional elements. This type
exists only in memory, and represents two elements in 1.31
format; the low eight bits of each of the elements are ignored. It
can be automatically converted into an ae_p24x2s object, in
which case the low eight bits of each element are discarded –

170  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Type Description
the 1.31-bit value in memory is converted to 9.23-bit value in
register.
ae_p16s This type ensures HiFi 2 target code compatibility. 16-bit type
consisting of a single element stored in memory. This type can
be automatically converted into an ae_p24x2s. In such a
conversion, the ae_p16s object's bits are padded with zeros
and duplicated to form the two 24-bit elements of the resulting
ae_p24x2s object. In Fusion DSP, each 24-bit element is sign
extended to 32-bits.
ae_p24s This type ensures HiFi 2 target code compatibility. It is a 24-bit
type consisting of a single element stored in the low 24 bits of a
32-bit memory word. This type exists only in memory and can
be automatically converted into an ae_p24x2s object. In such a
conversion, the ae_p24s object’s bits are duplicated to form the
two 24-bit elements of the resulting ae_p24x2s object. In
Fusion DSP, this type is loaded and stored in a way that is
equivalent to loading and storing the ae_p32s type.
ae_p24f This type ensures HiFi 2 target code compatibility. It is a 24-bit
type consisting of a single element stored in the high 24 bits of
a 32-bit memory word. This type exists only in memory and can
be automatically converted into an ae_p24x2s object. In such a
conversion, the ae_p24f object’s bits are duplicated to form the
two 24-bit elements of the resulting ae_p24x2s object. In
Fusion DSP, the 1.31-bit value in memory is converted to a
9.23-bit value in register.
ae_q56s This type ensures HiFi 2 target code compatibility. It is a 56-bit
type representing the contents of an AE_DR register. The
memory format for this type has the bits of the ae_q56s object
stored in the low 56 bits of a 64-bit double word. In Fusion
DSP, this type is loaded and stored in a way that is equivalent
to loading and storing the ae_int64 type.
ae_q32s This type ensures HiFi 2 target code compatibility. It is a 32-bit
type representing a value in memory that will be padded with
16 zeros at the low end and sign extended by eight bits at the
high end to form a 56-bit value when converted to an ae_q56s
object (i.e., when loaded into an AE_DR register). In Fusion
DSP, the 1.31-bit value in memory is converted to a 17.47-bit
value in register.
xtfloatx2 For configurations with the optional SIMD IEEE floating point
unit, a type containing two, 32-bit IEEE floating point values.

 CADENCE DESIGN SYSTEMS , INC. 171


Fusion F1 DSP User’s Guide

3.1.1 Example Memory Types


81B2

The following examples demonstrate how to efficiently load, store, and convert various data
types in C using Fusion DSP. The examples do not enumerate all possible conversions
between core C and Fusion DSP types. Generally, conversion between register (local)
variables and data in memory (arrays, struct fields, etc.) should be done through pointer
typecasting, while conversion between register variables should be done through direct use
of the appropriate Fusion DSP conversion intrinsics.

 Take a 32-bit value and replicate as two 32-bit elements in AE_DR.


int mem32 = …;
ae_int32x2 p = mem32;

 Load two 32-bit values in AE_DR. &mem32[i] must be 64-bit aligned.


int *mem32 = …;
ae_int32x2 p = *((ae_int32x2 *) &mem32[i]);

 Move two 32-bit values in AR to the two 32-bit elements in AE_DR.


int ah = …;
int al = …;
ae_int32x2 p = AE_MOVDA32X2(ah, al);

 Convert and sign-extend a 32-bit (1.31) fraction in AR to a 9.55-bit value in AE_DR.


int a = …;
ae_int64 q = AE_CVTQ56A32S(a);

 Convert and sign-extend the low (L) 1.31-bit fraction in AE_DR to a 9.55 value in
AE_DR.
ae_int32x2 p = …;
ae_f64 q = AE_CVTQ56P32S_L(p);

 Saturate and truncate two 9.55-bit values in AE_DR to the two 1.31-bit fraction
elements of AE_DR.
ae_int64 qh = …;
ae_int64 ql = …;
ae_int32x2 p = AE_TRUNCI32X2F64S(qh, ql, 8);

 Saturate two 9.23-bit values in AE_DR into two 1.23-bit fraction elements in AE_DR.
This allows the resultant values to be safely used in future 24-bit multiply instructions.
ae_f32x2 = …;
ae_f24x2 p = AE_SAT24S(d);

172  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Changing Types

Sometimes it is necessary to treat a variable as one type for one computation and another
for a follow-on computation. For example, one might want to do a fractional multiply on a 24-
bit variable that is stored in memory in the low 24-bits rather than the high 24-bits of a word.
For such uses Fusion DSP supports conversion protos that do not change the bit-
representation of a variable.

They are all of the form AE_MOV<dest_type>_FROM<SRC_TYPE>. The following example


shows how to coerce an ae_f64 variable into an ae_int24x2.

ae_f64 = …;
ae_int24x2 p = AE_MOVINT24X2_FROMF64(d);

3.2 Xtensa Xplorer Display Format Support


Xtensa Xplorer provides support for a wide variety of display formats, which makes using
these varied data types easier, and also easier to debug. These formats allow vector
register data contents to be displayed in an easier to read format. Variables are displayed
by default in a format matching their vector data types. Registers are by default always
displayed as ae_int64, but you can change the format to any other format.

The display formats for the different types are as follows:

 ae_int32x2: Displays hex and decimal for each element of the vector.

 ae_f32x2: Displays hex and decimal for each element of the vector assuming a 1.31
representation.

 ae_int24x2: Displays hex and decimal for each element of the vector. The upper 8-
bits of the variable, whether in register or in memory, is not displayed.

 ae_f24x2: Displays hex and decimal for each element of the vector. If the variable is
in memory, it is displayed as a 1.31 variable. If it is in a register, it is displayed as a
9.23.

 ae_int16x4: Displays hex and decimal for each element of the vector.

 ae_f16x4: Displays hex and decimal for each element of the vector assuming a 1.15
representation.

 ae_int32: Displays hex and decimal.

 ae_f32: Displays hex and decimal for assuming a 1.31 representation.

 ae_int24: Displays hex and decimal. The upper 8-bits of the variable, whether in
register or in memory, is not displayed.

 ae_f24: Displays hex and decimal. If the variable is in memory, it is displayed as a


1.31 variable. If it is in a register, it is displayed as a 9.23.

 ae_int16: Displays hex and decimal.

 CADENCE DESIGN SYSTEMS , INC. 173


Fusion F1 DSP User’s Guide

 ae_f16: Displays hex and decimal assuming a 1.15 representation.

 ae_int64: Displays hex and decimal.

 ae_f64: Displays hex and decimal assuming a 17.47 representation. A 1.63


representation can be seen by explicitly selecting it in Xplorer.

 ae_int32x4: Displays hex and decimal for each element of the vector.

 ae_f32x4: Displays hex and decimal for each element of the vector assuming a 1.31
representation.

 xtfloatx2: Displays floating point for each element of the vector on configurations
with the optional floating point unit.

 ae_p24x2f: Displays hex for each element of the vector. All 24-bits of an element
are displayed, even if 0.

 ae_p24s: Displays hex. All 24-bits are displayed, even if 0.

 ae_p24f: Displays hex. All 24-bits are displayed, even if 0.

 ae_p16x2s: Displays hex for each element of the vector. All 16-bits of an element
are displayed, even if 0.

 ae_p16s: Displays hex. All 16-bits are displayed, even if 0.

 ae_q32s: Displays hex. All 32-bits are displayed, even if 0.

 ae_q56s: Displays hex, with the 8 guard bits separated from the other 48-bits. All
48-bits are displayed, even if 0.

3.3 Programming Styles


Typically, programmers put in effort on their code to make it run efficiently on any fixed-point
DSP. For example, if the reference code is floating point, the code must be converted into
fixed point, unless the optional floating point unit is utilized. Doing such conversions is beyond
the scope of this guide. However, we will note that since all Fusion DSP configurations
support floating point, albeit inefficiently, it is often desirable to convert to fixed point one
function at a time.

Reference codes are frequently written in terms of basic fixed-point intrinsic libraries. As a
first step, it is often desirable to implement the existing intrinsic library in terms of Fusion DSP
intrinsics. When implementing such an intrinsic library, the programmer has the choice of
whether to use standard C/C++ data types as external interfaces or whether to use the native
Fusion DSP data types. If the body of a library is ported to use Fusion DSP intrinsics but the
interface remains standard C/C++, the implementation must convert to and from the Fusion
DSP data types. The compiler can sometimes, but not always, eliminate these conversions.
If instead, the interfaces of the libraries are changed to use Fusion DSP data types,
performance will be better, but all the code that calls into the library must be changed to
handle the Fusion DSP data types. That is not always possible.

174  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Cadence provides an optimized implementation of the ITU-T/ETSI intrinsics used often in


voice codecs. The interface uses standard C/C++ data types, but the implementation has
been carefully crafted to allow the compiler to eliminate the conversions.

In general, the most common scenario is that important functions in the application are
optimized directly for Fusion DSP, and the original library is left for the less important
functions.

There are several basic programming styles that can be used, depending on application
needs, in increasing order of manual effort. These are as follows:

 Standard scalar C/C++ code compiled with or without automatic vectorization.

 Auto-vectorization code written on top of the ITU-T/ETSI intrinsics.

 C/C++ code with Fusion DSP data types and operator overloading

 Use of intrinsic functions for computation instruction along with Fusion DSP data
types and implicit loads and stores

 Use of intrinsic functions for both computation and loads and stores.

These different styles can be freely intermixed. For maximum performance, it is typically
necessary to use at least some amount of explicit intrinsics for computation. However, it is
often not necessary to use intrinsics for loads or stores.

For each of these strategies, one can write either scalar or vector code. One general strategy
is to port a single function at a time. If the desired semantics match standard C/C++ code or
the ITU-T/ETSI intrinsics, start with that and automatic vectorization. For 24-bit or 32x16-bit
applications, start with scalar code, using operator overloading where the desired semantics
match the available overloads and intrinsics where a specialized semantic is needed. Either
way, the code is then profiled. Those parts of the code that are computationally important
can then be manually vectorized. At any point, if the performance goals for the code have
been met, the optimization can cease. By starting with what can be done easily and refining
only the most computationally-intensive portions of code manually, the engineering effort can
be directed to where it has the most effect, which is discussed in the following sections.

 CADENCE DESIGN SYSTEMS , INC. 175


Fusion F1 DSP User’s Guide

3.4 Auto-vectorization of Standard C/C++


Auto-vectorization of scalar C code can produce effective results on simple loop nests, but
has its limits. It can be improved through the use of compiler pragmas and options, and
effective data marshaling to make data accesses (loads and stores) regular and aligned.

The xt-xcc compiler provides several options and methods of analysis to assist in vec-
torization. These are discussed in more detail in the Xtensa C and C++ Compiler User’s
Guide, in particular in the SIMD Vectorization section. Cadence recommends studying this
guide in detail. However, following are some guidelines in summary form:

 Vectorization is triggered with the compiler options O3, -LNO:simd, or by selecting


the Enable Automatic Vectorization option in Xplorer. The -LNO:simd_v and -keep
command-line options give feedback on vectorization issues and keeps intermediate
compilation files, respectively. Xplorer’s Vectorization Assistant is a graphical tool to
help the programmer understand what did and did not vectorize.

 Data should be aligned to 8-byte boundaries. The XCC compiler will naturally align
arrays to start on 8-byte boundaries. But the compiler cannot assume that pointer
arguments are aligned. The compiler needs to be told that data is aligned by one of
the following methods:
 Using global or local arrays rather than pointers
 Using #pragma aligned(<pointer>, n)
 Compiling with -LNO:aligned_pointers=on

 Pointer aliasing causes problems with vectorization. The __restrict attribute for
pointer declarations (e.g., short * __restrict cp;) tells the compiler that the
pointer does not alias.

 Compiler alignment options, such as -LNO:aligned_pointers=on, tell the


compiler that it can assume data is always aligned.

 There are global compiler aliasing options, but these can sometimes be dangerous.

 Subtle C/C++ semantics in loops may make them impossible to vectorize. The
Vectorization Assistant can help identify small changes that allow effective
vectorization.

 Irregular or non-unity strides in data array accessing can be a problem for


vectorization. Changing data array accesses to regular unity strides can improve
results, even if some “unnecessary computation” is necessary.

 Outer loops can be simplified wherever possible to allow inner loops to be more
easily vectorized. Sometimes trading outer and inner loops can improve results.

 Loops containing function calls and conditionals may prevent vectorization. It may
be better to duplicate code and perform a little "unnecessary computation" to
produce better results.

 Array references, rather than pointer dereferencing, can make code (especially
mathematical algorithms) both easier to understand and easier to vectorize.

176  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

 At –O3, the compiler will perform optimizations that while mathematically correct
might change the exact bit results of floating point computations. For example, the
compiler might replace a += b*c with a fused multiply-accumulate operation that
avoids a round between the multiply and the accumulate. If bit-exact answers are
needed, compile with fno-unsafe-math-optimizations.

Consider a simple example that performs a 16-bit energy calculation:


int Energy (short a[], int n)
{
int i;
int s = 0;

for (i = 0; i < n; i++)


{
s = s + a[i]*a[i];
}
return s;
}

The program can be compiled either with or without automatic vectorization. Note that even
without automatic vectorization, it is still important to use the Use DSP co-processor
button or equivalently the –mcoproc compiler option. These optimizations allow the compiler
to automatically use Fusion DSP instructions for scalar code.

Without vectorization and without the –mcoproc compiler option, the compiler is limited to
the use of Xtensa foundation instructions, and those do not include multiply-add instructions.
The compiler chooses to unroll the loop by a factor of 8, and then packs the 8 adds, 8
multiplies, and 8 loads into 13 cycles. Using the mcoproc compiler option, the compiler is
able to utilize the Fusion DSP multiply-accumulate operations and generates an inner loop
that performs one 16-bit multiply every cycle.

loopgtz a3,L
{
ae_l16.ip aed0,a2,2
ae_mula16x4.l aed1,aed0,aed0
}
L:

Note that the ae_mula16x4.l instruction performs two multiplies, but because ae_l16.ip
performs a single 16-bit load that replicates the data, each of the two multiplies is multiplying
the same operand.

Note that operations within brackets {, }, in assembly code are part of the same instruction
and execute in parallel.

 CADENCE DESIGN SYSTEMS , INC. 177


Fusion F1 DSP User’s Guide

With vectorization, the compiler generates a loop that executes two multiply-adds every
cycle.

loopgtz a3,L
{
ae_la16x4.ip aed0,u0,a2
ae_mula16x4.h aed1,aed3,aed3
}
{
ae_la16x4.ip aed3,u0,a2
ae_mula16x4.l aed2,aed3,aed3
}
{
nop
ae_mula16x4.h aed1,aed0,aed0
}
{
nop
ae_mula16x4.l aed2,aed0,aed0
}L:

Note that since the input array is a parameter, and we have not used any special compiler
flags or pragmas, the compiler must assume that it might not be aligned. Therefore, the
compiler uses the aligning load instructions.

If our example had used int instead of short, the compiler would generate a loop that
executes one, 32-bit multiply-add per cycle.

3.5 ITU-T/ETSI Intrinsics


To use the ITU-T/ETSI Intrinsics, simply include one or both of the following header files.

#include <fusion/basic_op_xtensa.h>
#include <fusion/oper_32b_xtensa.h>

For compatibility with HiFi, hifi2 can be used instead of fusion.

The standard intrinsics can then be used either with or without automatic vectorization, just
like standard C/C++ code.

178  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Consider our energy calculation example, modified to use the intrinsics.

#include <fusion/basic_op_xtensa.h>

int Energy( short a[], int n)


{
int i; int s = 0;
for (i = 0; i < n; i++)
{
s = L_mac (s, a[i], a[i]);
}
return s;
}

Without vectorization (but using –mcoproc), the compiler generates an inner loop that
performs a multiplication every cycle.

loop a3, L
{
ae_l16.ip aed0,a2,2
ae_mulaf16ss.00 aed1,aed0,aed0
}
L:

With vectorization, the compiler generates an inner loop that performs two multiplications
every cycle.

loopgtz a3,L
{
ae_la16x4.ip aed0,u0,a2
ae_mulaafd16ss.33_22 aed1,aed2,aed2
}
{
ae_la16x4.ip aed2,u0,a2
ae_mulaafd16ss.11_00 aed1,aed2,aed2
}
{
nop
ae_mulaafd16ss.33_22 aed1,aed0,aed0
}
{
nop
ae_mulaafd16ss.11_00 aed1,aed0,aed0
}
L:

 CADENCE DESIGN SYSTEMS , INC. 179


Fusion F1 DSP User’s Guide

3.6 Operator Overloading


Common Fusion DSP operations can be accessed in C or C++ by applying standard C
operators to the Fusion DSP data types. For example, the following C code infers operation
AE_ADD32:

ae_int32x2 p0, p1;


ae_int32x2 p = p0 + p1;

Table 3-2 describes the supported operators. Unless noted otherwise, the operators return
variables with the same type as the input operand types. If at least one of the input operands
has a SIMD type, the return type will also be SIMD.

The same operator might map to both a version that takes a register argument and one that
takes an immediate. The compiler will automatically choose the immediate version when
used with an immediate that is in range.

Table 3-2 Fusion DSP C/C++ Operators

Operator Operand Types Operation Description


+ ae_f32, AE_ADD32S Signed saturating 32-bit addition.
ae_f32x2,ae_f32x4
- ae_f32, AE_SUB32S Signed saturating 32-bit subtraction.
ae_f32x2,ae_f32x4
- ae_f32, AE_NEG32S Signed saturating 32-bit negation.
ae_f32x2,ae_f32x4
* ae_f32x2 AE_MULFP32 Signed SIMD fixed-point 1.31x1.31-bit
X2RAS into 1.31-bit multiplication with an
AVS ONLY ae_f32x2 return type.
* ae_f32 AE_MULFP32 Signed fixed-point 1.31x1.31-bit into
X2RAS 1.31-bit multiplication with an ae_f32
AVS ONLY return type.
* ae_f32x4 * ae_f16x4 AE_MULFP32 Signed SIMD fixed-point 1.31x1.15-bit
X16X2RAS.L into 1.31-bit multiplication
AE_MULFP32
X16X2RAS.H
* ae_f32 * ae_f16 AE_MULFP32 Signed fixed-point 1.31x1.15-bit into
X16X2RAS.H 1.31-bit multiplication
& ae_f32, ae_f32x2, AE_AND Binary AND.
ae_f32x4, ae_int32,
ae_int32x2, ae_int32x4
ae_f24, ae_f24x2,
ae_int24, ae_int24x2,
ae_f64, ae_int64,
ae_f16, ae_f16x4,
ae_int16, ae_int16x4

180  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Operator Operand Types Operation Description


| ae_f32, ae_f32x2, AE_OR Binary OR.
ae_f32x4, ae_int32,
ae_int32x2, ae_int32x4
ae_f24, ae_f24x2,
ae_int24, ae_int24x2,
ae_f64, ae_int64,
ae_f16, ae_f16x4,
ae_int16, ae_int16x4
^ ae_f32, ae_f32x2, AE_XOR Binary Exclusive OR.
ae_f32x4, ae_int32,
ae_int32x2, ae_int32x4
ae_f24, ae_f24x2,
ae_int24, ae_int24x2,
ae_f64, ae_int64,
ae_f16, ae_f16x4,
ae_int16, ae_int16x4
~ ae_f32, ae_f32x2, AE_NAND Binary NOT.
ae_f32x4, ae_int32,
ae_int32x2, ae_int32x4,
ae_f24, ae_f24x2,
ae_int24, ae_int24x2,
ae_f64, ae_int64,
ae_f16, ae_f16x4,
ae_int16, ae_int16x4
>> ae_f32, ae_f32x2, AE_SRAI32 Signed arithmetic 32-bit right shift by
ae_f32x4, an immediate shift amount.
ae_int32, ae_int32x2,
ae_int32x4,
ae_in24, ae_int24x2
>> ae_f32, ae_f32x2, AE_SRAA32 Signed arithmetic 32-bit right shift by a
ae_f32x4, variable shift amount.
ae_int32, ae_int32x2,
ae_int32x4
ae_int24, ae_int24x2
<< ae_f32, ae_f32x2, AE_SLAI32S Signed saturating 32-bit left shift by an
ae_f32x4 immediate shift amount.
<< ae_f32, ae_f32x2, AE_SLAA32S Signed saturating 32-bit left shift by a
ae_f32x4 variable shift amount.
< ae_f32x2, ae_int32x2 AE_LT32 Signed less-than comparison with an
ae_f24x2, ae_int24x2, xtbool2 return type.
<= ae_f32x2, ae_int32x2 AE_LE32 Signed less-than-or-equal comparison
ae_f24x2, ae_int24x2, with an xtbool2 return type.

 CADENCE DESIGN SYSTEMS , INC. 181


Fusion F1 DSP User’s Guide

Operator Operand Types Operation Description


== ae_f32x2, ae_int32x2 AE_EQ32 Equal comparison with an xtbool2
ae_f24x2, ae_int24x2, return type.
>= ae_f32x2, ae_int32x2 AE_LE32 Signed greater-than-or-equal
ae_f24x2, ae_int24x2, comparison with an xtbool2 return
type.
> ae_f32x2, ae_int32x2 AE_LT32 Signed greater-than comparison with
ae_f24x2, ae_int24x2, an xtbool2 return type.

+ ae_int32, ae_int32x2, AE_ADD32 Signed 32-bit addition.


ae_int32x4,
ae_int24, ae_int24x2
- ae_int32, ae_int32x2, AE_SUB32 Signed 32-bit subtraction.
ae_int32x4,
ae_int24, ae_int24x2
- ae_int32, ae_int32x2, AE_NEG32 Signed 32-bit negation.
ae_int32x4,
ae_int24, ae_int24x2
* ae_int32x2 AE_MULP32X Signed SIMD 32x32 into 32-bit
2 multiplication with an ae_int32x2
return type.
* ae_int32 AE_MULP32X Signed 32x32 into 32-bit multiplication
2 with an ae_int32 return type.
* ae_int32x4 *ae_int16x4 AE_MULP32X Signed SIMD 32x16-bit into 32-bit
16X2.L multiplication
AE_MULP32X
16X2.H
* ae_int32 * ae_int16 AE_MULP32X Signed 32x16-bit into 32-bit
16X2.H multiplication
<< ae_int32, ae_int32x2, AE_SLAI32 Signed 32-bit left shift by an
ae_int32x4, immediate shift amount.
ae_int24, ae_int24x2
<< ae_int32, ae_int32x2, AE_SLAA32 Signed 32-bit left shift by a variable
ae_int32x4, shift amount.
ae_int24, ae_int24x2

+ ae_f24, ae_f24x2 AE_ADD24S Signed saturating 24-bit addition.


- ae_f24, ae_f24x2 AE_SUB24S Signed saturating 24-bit subtraction.
- ae_f24, ae_f24x2 AE_NEG24S Signed saturating 24-bit negation.
* ae_f24 AE_MULFP24 Signed SIMD fixed-point 1.23x1.23-bit
X2RA into 9.23-bit multiplication with an
ae_f32 return type.

182  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Operator Operand Types Operation Description


* ae_f24x2 AE_MULFP24 Signed SIMD fixed-point 1.23x1.23-bit
X2RA into 9.23-bit multiplication with an
ae_f32x2 return type.
>> ae_f24, ae_f24x2 AE_SRAI24 Signed arithmetic 24-bit right shift by
an immediate shift amount.
>> ae_f24, ae_f24x2 AE_SRAS24 Signed arithmetic 24-bit right shift by a
variable shift amount.
<< ae_f24, ae_f24x2 AE_SLAI24S Signed saturating 24-bit left shift by an
immediate shift amount.
<< ae_f24, ae_f24x2 AE_SLAS24S Signed saturating 24-bit left shift by a
variable shift amount.

* ae_int24x2 AE_MULP24X Signed SIMD 24x24 into 32-bit


2 multiplication with an ae_int32x2
return type.
* ae_int24 AE_MULP24X Signed 24x24 into 32-bit multiplication
2 with an ae_int32 return type.

+ ae_f64 AE_ADD64S Signed saturating 64-bit addition.


- ae_f64 AE_SUB64S Signed saturating 64-bit subtraction.
- ae_f64 AE_NEG64S Signed saturating 64-bit negation.
>> ae_f64, ae_int64 AE_SRAI64 Signed arithmetic 64-bit right shift by
an immediate shift amount.
>> ae_f64, ae_int64 AE_SRAA64 Signed arithmetic 64-bit right shift by a
variable shift amount.
<< ae_f64 AE_SLAI64S Signed saturating 64-bit left shift by an
immediate shift amount.
<< ae_f64 AE_SLAA64S Signed saturating 64-bit left shift by a
variable shift amount.
< ae_f64, ae_int64 AE_LT64 Signed less-than comparison with an
xtbool return type.
<= ae_f64, ae_int64 AE_LE64 Signed less-than-or-equal comparison
with an xtbool return type.
== ae_f64, ae_int64 AE_EQ64 Equal comparison with an xtbool
return type.
>= ae_f64, ae_int64 AE_LE64 Signed greater-than-or-equal
comparison with an xtbool return
type.
> ae_f64, ae_int64 AE_LT64 Signed greater-than comparison with
an xtbool return type.

 CADENCE DESIGN SYSTEMS , INC. 183


Fusion F1 DSP User’s Guide

Operator Operand Types Operation Description


+ ae_int64 AE_ADD64 Signed 64-bit addition.
- ae_int64 AE_SUB64 Signed 64-bit subtraction.
- ae_int64 AE_NEG64 Signed 64-bit negation.
<< ae_int64 AE_SLAI64 Signed 64-bit left shift by an
immediate shift amount.
<< ae_int64 AE_SLAA64 Signed 64-bit left shift by a variable
shift amount.

+ ae_f16, ae_f16x4 AE_ADD16S Signed saturating 16-bit addition.


- ae_f16, ae_f16x4 AE_SUB16S Signed saturating 16-bit subtraction.
- ae_f16, ae_f16x4 AE_NEG16S Signed saturating 16-bit negation.
* ae_f16x4 AE_MULF16X Signed SIMD fixed-point 1.15x1.15-bit
4SS into 1.31-bit multiplication with an
ae_f32x4 return type.
>> ae_f16, ae_f16x4 AE_SRAI16 Signed arithmetic 16-bit right shift by
an immediate shift amount.
>> ae_f16, ae_f16x4 AE_SRAA16S Signed saturating arithmetic 16-bit
right shift by a variable shift amount.
<< ae_f16, ae_f16x4 AE_SLAI16S Signed saturating 16-bit left shift by an
immediate shift amount.
<< ae_f16, ae_f16x4 AE_SLAA16S Signed saturating 16-bit left shift by a
variable shift amount.
< ae_f16x4, ae_int16x4 AE_LT16 Signed less-than comparison with an
xtbool4 return type.
<= ae_f16x4, ae_int16x4 AE_LE16 Signed less-than-or-equal comparison
with an xtbool4 return type.
== ae_f16x4, ae_int16x4 AE_EQ16 Equal comparison with an xtbool4
return type.
>= ae_f16x4, ae_int16x4 AE_LE16 Signed greater-than-or-equal
comparison with an xtbool4 return
type.
> ae_f16x4, ae_int16x4 AE_LT16 Signed greater-than comparison with
an xtbool4 return type.

+ ae_int16, ae_int16x4 AE_ADD16 Signed 16-bit addition.


- ae_int16, ae_int16x4 AE_SUB16 Signed 16-bit subtraction.
- ae_int16, ae_int16x4 AE_MOVI, Signed 16-bit negation.
AE_SUB16
* ae_int16x4 AE_MUL16X4 Signed SIMD 16x16 into 32-bit
multiplication with an ae_int32x4
return type.

184  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Operator Operand Types Operation Description


>> ae_int16, ae_int16x4 AE_SRAI16 Signed 16-bit right shift by an
immediate shift amount.
>> ae_int16, ae_int16x4 AE_SRAA16S Signed 16-bit right shift by a variable
shift amount.

Table 3-3 describes the supported operators for the legacy HiFi 2 data types. Note that the
overloading choices for the HiFi 2 types are quite different than for Fusion DSP.

Table 3-3 Legacy HiFi 2 C/C++ Operators

Operator Operand Types Operation Description


+ ae_p24s, ae_p24f, AE_ADDSP24S Signed saturating 24-bit
ae_p24x2s, addition.
ae_p24x2f
- ae_p24s, ae_p24f, AE_SUBSP24S Signed saturating 24-bit
ae_p24x2s, subtraction.
ae_p24x2f
- ae_p24s, ae_p24f, AE_NEGSP24S Signed saturating 24-bit
ae_p24x2s, negation.
ae_p24x2f
* ae_p24s, ae_p24f AE_MULFP24S.LL Signed single fixed-point
1.23x1.23-bit into 9.47-bit
multiplication with an
ae_q56s return type.
* ae_p24x2s, AE_MULZAAFP24S.HH.LL Signed dual fixed-point
ae_p24x2f 1.23x1.23-bit into 9.47-bit
multiplication with an
ae_q56s return type.
& ae_p24s, ae_p24f, AE_ANDP48 Binary AND.
ae_p24x2s,
ae_p24x2f
| ae_p24s, ae_p24f, AE_ORP48 Binary OR.
ae_p24x2s,
ae_p24x2f
^ ae_p24s, ae_p24f, AE_XORP48 Binary Exclusive OR.
ae_p24x2s,
ae_p24x2f

~ ae_p24s, ae_p24f, AE_NANDP48 Binary NOT.


ae_p24x2s,
ae_p24x2f
>> ae_p24s, ae_p24f, AE_SRAIP24 Signed arithmetic 24-bit
ae_p24x2s, right shift by an immediate
ae_p24x2f shift amount.

 CADENCE DESIGN SYSTEMS , INC. 185


Fusion F1 DSP User’s Guide

Operator Operand Types Operation Description


>> ae_p24s, ae_p24f, AE_SRASP24 Signed arithmetic 24-bit
ae_p24x2s, right shift by a variable
ae_p24x2f shift amount.
<< ae_p24s, ae_p24f, AE_SLLISP24S Signed saturating 24-bit
ae_p24x2s, left shift by an immediate
ae_p24x2f shift amount.
<< ae_p24s, ae_p24f, AE_SLLSSP24S Signed saturating 24-bit
ae_p24x2s, left shift by a variable shift
ae_p24x2f amount.
< ae_p24x2s, AE_LTP24S Signed less-than
ae_p24x2f comparison with an
xtbool2 return type.
<= ae_p24x2s, AE_LEP24S Signed less-than-or-equal
ae_p24x2f comparison with an
xtbool2 return type.
== ae_p24x2s, AE_EQP24 Equal comparison with an
ae_p24x2f xtbool2 return type.
>= ae_p24x2s, AE_LEP24S Signed greater-than-or-
ae_p24x2f equal comparison with an
xtbool2 return type.
> ae_p24x2s, AE_LTP24S Signed greater-than
ae_p24x2f comparison with an
xtbool2 return type.

+ ae_q56s AE_ADDQ56 56-bit addition.


- ae_q56s AE_SUBQ56 56-bit subtraction.
- ae_q56s AE_NEGQ56 56-bit negation.
& ae_q56s AE_ANDQ56 Binary AND.
| ae_q56s AE_ORQ56 Binary OR.
^ ae_q56s AE_XORQ56 Binary Exclusive OR.
~ ae_q56s AE_NANDQ56 Binary NOT.
>> ae_q56s AE_SRAIQ56 Signed arithmetic 56-bit
right shift by an immediate
shift amount.
>> ae_q56s AE_SRAAQ56 Signed arithmetic 56-bit
right shift by a variable
shift amount.
<< ae_q56s AE_SLLIQ56 56-bit left shift by an
immediate shift amount.
<< ae_q56s AE_SLLAQ56 56-bit left shift by a
variable shift amount.

186  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Operator Operand Types Operation Description


< ae_q56s AE_LTQ56S Signed less-than
comparison with an
xtbool return type.
<= ae_q56s AE_LEQ56S Signed less-than-or-equal
comparison with an
xtbool return type.
== ae_q56s AE_EQQ56 Equal comparison with an
xtbool return type.
>= ae_q56s AE_LEQ56S Signed greater-than-or-
equal comparison with an
xtbool return type.
> ae_q56s AE_LTQ56S Signed greater-than
comparison with an
xtbool return type.

Note that all the non-legacy multiply overloads produce results of the similar, low, precision
as the operands. This is because there are no high-precision SIMD multiplies. The high-
precision dual multiplies in Fusion DSP add (or subtract) together the two multiply results
into a single result, and it is less natural to define the semantics of multiplying two ae_f24x2
variables, for example, to be a single ae_f64 that is the dot-product of the two variables. This
is in contrast to the legacy HiFi 2/EP data types, such as ae_p24x2f, where multiplying two
such variables does indeed do a dot product. Those semantics were chosen because HiFi
2/EP has no true SIMD multiplies.

3.6.1 Energy Calculation Example


Consider our energy calculation example for operator overloading where the input data is
stored in memory as a 1.31 fixed-point value. The standard C reference code is shown below.

s = 0;
for (i=0; i<n; i++)
{
s += ((long long) a[i]*a[i]) >> 31;
}
return s;

Assuming that we wish to use 24-bit arithmetic and can therefore throw away the bottom
eight bits of the input, the code can be converted into Fusion DSP code as follows.

ae_f24 *ap = (ae_f24 *) a;


ae_f32 s = 0;
for (i=0; i<n; i++) {
s += ap[i]*ap[i];
}
return s;

 CADENCE DESIGN SYSTEMS , INC. 187


Fusion F1 DSP User’s Guide

The main loop uses operator overloading to perform a 24-bit fixed-point multiply. The ae_f24
typed array is implicitly loaded, just like any standard C/C++ type. As part of the load, the
bottom 8-bits of the 1.31 input array are discarded. The accumulator is of type ae_f32, giving
8 guard bits. The assignment of the result to an int does not change the bit pattern. Hence
this routine returns a 9.23 value stored as an int.

The compiler generates the following inner loop.

loop a3, L
{
ae_l32f24.ip aed0,a2,4
_
ae mulafp24x2ra aed1,aed0,aed0
}
L:

Fusion DSP is able to issue a multiply and a load every cycle. Note that the compiler
automatically generates the multiply-add instruction, ae_mulafp24x2ra. This instruction
does a 24-bit multiplication with a 32-bit accumulation. The 32-bit accumulation does not
saturate, so this code is only safe where 32-bit overflow is not possible. If overflow is possible,
compile with –mno-enable-non-exact-imaps. The compiler will leave the multiply and
the addition as two separate instructions and will use a saturating add for the addition.

The inner loop is perfect except that no SIMD is used. By changing ae_f24 into ae_f24x2f,
ae_f32 into ae_f32x2, and cutting the trip count in half, we convert the example into a 2-way
SIMD example. The main loop is computing two partial sums in parallel. After the loop, we
must add together the two partial sums into a single sum using the AE_ADD32_HL_LH
intrinsic.

ae_f24x2 *ap = (ae_f24x2 *) a;


ae_f32x2 s = 0;

for (i = 0; i < n>>1; i++)


{
s = s + ap[i]*ap[i];
}
return AE_ADD32_HL_LH(s,s);

The compiler generates the following inner loop.

loop a3, L
{
ae_l32x2f24.ip aed0,a2,8
ae_mulafp24x2ra aed1,aed0,aed0
}
L:

The generated code is now able to do two multiplies every cycle with the speed limited by
the load/store bandwidth of the machine.

188  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Note that the optimized code assumes that n is a multiple of two. If that is not guaranteed,
the last iteration of the loop must be conditionally peeled as follows.

ae_f24x2 *ap = (ae_f24x2 *) a;


ae_f32x2 s = 0;

for (i = 0; i < n>>1; i++)


{
s = s + ap[i]*ap[i];
}
ae_f32 last_prod = 0;
if (n%2) {
ae_f24 *ap_scalar = (ae_f24 *) a;
last_prod = ap_scalar[n-1]*ap_scalar[n-1];
}
return AE_ADD32_HL_LH(s,s) +
AE_MOVINT32X2_FROMF32(last_prod);

If the total number of iterations dynamically turns out to be odd, the last iteration is executed
separately, using scalar instructions. Note the use of the AE_MOVINT32X2_FROMF32
intrinsic. The reduction add intrinsic returns an ae_int32x2 type and therefore the product
of the last iteration must be appropriately coerced.

This example code uses fixed-point arithmetic. If instead, integral arithmetic is desired, simply
use the integral rather than the fixed-point types.

ae_int24x2 *ap = (ae_int24x2 *) a;


ae_int32x2 s = 0;

for (i = 0; i < n>>1; i++)


{
s = s + ap[i]*ap[i];
}
ae_int32 last_prod = 0;
if (n%2) {
ae_int24 *ap_scalar = (ae_int24 *) a;
last_prod = ap_scalar[n-1]*ap_scalar[n-1];
}
return AE_ADD32_HL_LH(s,s)+last_prod;

 CADENCE DESIGN SYSTEMS , INC. 189


Fusion F1 DSP User’s Guide

3.6.2 32X16-bit Dot Product Example


Consider now a scenario for operating overloading where we wish to do 32x16-bit
multiplication rather than 24-bit. An energy calculation only has a single input operand, while
32x16-bit requires two. So, we convert our energy example into a dot product. Because four
16-bit elements can fit into a register, we vectorize by four rather than by two. The number
of elements in the 32-bit operand must be the same as the number of elements in the 16-bit
operand. Therefore, Fusion DSP defines an ae_int32x4 (and an ae_f32x4) data type.
These are structure data types that occupy two registers. Most operations defined on these
types result in two instructions, so are no faster than the two-way SIMD types. However, their
use is necessary when doing 32x16-bit multiplication using operator overloading. The
example is show below. Note that the result is reduced into a single int using the
AE_INT32X4_RADD intrinsic. This is a convenience intrinsic that translates into a three
instruction sequence.

ae_int32x4 *ap = (ae_int32x4 *) a;


ae_int16x4 *bp = (ae_int16x4 *) b;
ae_int32x4 s = 0;

for (i = 0; i < n>>2; i++)


{
s += bp[i]*ap[i];
}
return AE_INT32X4_RADD(s);

3.7 Intrinsic-based Programming


The next programming style is to use explicit intrinsics. Even if operator overloading is not
sufficient, it may not be necessary to use intrinsics everywhere, as the compiler may, for
example, infer the right vector loads and stores. Sometimes adding just a few strategic
intrinsics may be sufficient to achieve maximum efficiency. The compiler can still be counted
on for efficient scheduling and optimization.

Every Fusion DSP instruction can be directly accessed by an intrinsic of the same name
(except that “.” in instruction names get converted into “_” in intrinsic names). The prototypes
of the supported intrinsics were listed along with the instruction descriptions in the previous
chapter.

190  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Consider a simple example that does a 24-bit fixed-point energy calculation but wants to
keep all the intermediate results in high precision. Operator overloading always uses the low-
precision multipliers. Therefore, we must use intrinsics for the multiply.

ae_f24x2 *ap = (ae_f24x2 *) a;


ae_f64 s = 0LL;

for (i = 0; i < n>>1; i++)


{
AE_MULAAFD24_HH_LL(s, ap[i], ap[i]);
}
ae_f24 result = AE_ROUND24F48SASYM(s);

In addition to the dual-multiply intrinsic, intrinsics are used to round the final result back down
to 24-bits.

Following are several interesting points:

 There is no need to use explicit vector loads.

 The intrinsics are not assembly operations. They do not need to be manually sched-
uled into FLIX bundles. Variables do not need to be manually allocated into particular
registers. The compiler takes care of all that. The code still remains quite "C-like".
The compiler generates a perfect inner loop with a dual, updating load and a dual
multiply instruction.

 The compiler will automatically select load/store instructions, but programmers may
in some cases be able to optimize results using their own selection, by using the
correct intrinsic instead of leaving it to the compiler

Consider now a similar example where the operand is stored in the circular buffer. The
assumption is that the operand array might cross the end of the buffer. After loading the last
element in the buffer, the code needs to continue to the first element. There is no way to
implicitly utilize the circular buffer load instructions. One needs to use the explicit load
intrinsics as shown in the following code.

ae_f24x2 tmp;
ae_f24x2 *ap = (ae_f24x2 *) a;
ae_f64 s = 0LL;

for (i = 0; i < n>>1; i++)


{
AE_L32X2F24_XC(tmp, ap, 8);
AE_MULAAFD24_HH_LL(s, tmp, tmp);
}
ae_f24 result = AE_ROUND24F48SASYM(s);

 CADENCE DESIGN SYSTEMS , INC. 191


Fusion F1 DSP User’s Guide

The operand pointer is loaded using the updating, circular load intrinsic, AE_L32X2F24_XC.
This example assumes that the boundaries of the circular buffer have been set elsewhere.

In Chapter 5, we go through some more examples in detail.

3.8 Checking Configuration Options in C/C++


Code
It can be convenient for code to know which optional features are available. The file
<xtensa/config/core.h> contains a set of #defines that are set to 1 or 0, depending on the
particular Fusion F1 configuration being targeted.

Code Description
#define XCHAL_HAVE_FUSION Fusion
#define XCHAL_HAVE_FUSION_FP Fusion FP option
#define XCHAL_HAVE_FUSION_LOW_POWER Fusion Low Power option
#define XCHAL_HAVE_FUSION_AES Fusion BLE/Wifi AES-128 CCM
option
#define XCHAL_HAVE_FUSION_CONVENC Fusion Conv Encode option
#define XCHAL_HAVE_FUSION_LFSR_CRC Fusion LFSR-CRC option
#define XCHAL_HAVE_FUSION_BITOPS Fusion Bit Operations Support
option
#define XCHAL_HAVE_FUSION_AVS Fusion AVS option
#define XCHAL_HAVE_FUSION_16BIT_BASEBAND 1 Fusion 16-bit Quad Mac Unit
#define XCHAL_HAVE_FUSION_VITERBI 1 Fusion Viterbi option
#define XCHAL_HAVE_FUSION_SOFTDEMAP 1 Fusion Soft Bit Demap option

3.9 HiFi 3 Code Portability


With the AVS option, the Fusion DSP implements all HiFi 3 C types, intrinsics, and operator
overloads to ensure that existing HiFi 3 C and C++ target source code can compile and run
on a Fusion DSP processor. HiFi 3 assembly target code must be manually modified to build
and run on a Fusion DSP processor. The Fusion DSP and HiFi 3 ISAs are not binary
compatible—a binary generated for a HiFi 3 processor will not execute correctly on a Fusion
DSP processor. As listed in the last chapter, some HiFi 3 operations, particularly those that
do four multiplies or two 32x32-bit multiplies, are implemented using two-instruction
sequences.

192  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

3.10 HiFi 2 and HiFi Mini Code Portability


The Fusion DSP implements all HiFi 2 C types, intrinsics, and operator overloads to ensure
that existing HiFi 2 C and C++ target source code can compile and run on a Fusion DSP
processor. HiFi 2 assembly target code must be manually modified to build and run on a
Fusion DSP processor. The Fusion DSP and HiFi 2 ISAs are not binary compatible—a binary
compiled on a HiFi 2 processor will not execute correctly on a Fusion DSP processor.

The HiFi Mini DSP supports 2-way SIMD 8-bit load instructions AE_LP8X2F.I and
AE_LP8X2F.IU that have no equivalent on Fusion. Fusion instead supports 4-way SIMD 8-
bit loads.

Following are several guidelines for porting HiFi 2 target code to Fusion DSP:

 Mapping: Refer to the operations and intrinsics (C syntax) in Chapter 2 for notes on
the HiFi 2-to-Fusion DSP operation and intrinsic mapping.

 Precision: To ensure efficient execution of existing HiFi 2 code on Fusion DSP as


well as efficient Fusion DSP hardware implementation, some HiFi 2-specific
intrinsics (DSP operations, loads, and stores) provide wider precision than the
intrinsics available in the HiFi 2 ISA. For example, the AE_ADDP24 intrinsic is
implemented through operation AE_ADD32—if a computation overflowed the 24 bits
in HiFi 2, in Fusion DSP the computation will maintain the extra precision in the 8
MSBs of each 32-bit AE_DR element. If a HiFi 2 application assumes wrap-around
due to the limited register width, it may need to be fixed to ensure correct execution
on Fusion DSP

 Performance: To ensure efficient Fusion DSP hardware implementation, some


HiFi 2 intrinsics that map to a single operation in the HiFi 2 ISA are implemented
through a sequence of two or more operations in Fusion DSP. For example, HiFi 2
intrinsic AE_MULZASFQ32SP16S_HH is implemented through a sequence of four
operations in Fusion DSP. If a HiFi 2 application relies on such intrinsics, it may need
to be manually reoptimized to ensure efficient execution on Fusion DSP. However,
in many cases extra registers and MACs provided by Fusion DSP will be sufficient
to compensate.

 CADENCE DESIGN SYSTEMS , INC. 193


Fusion F1 DSP User’s Guide

3.11 Important Compiler Switches


The following compiler switches are important:

 -mcoproc, as discussed in the Xtensa C Application Programmer’s Guide and


Xtensa C and C++ Compiler User’s Guide. In particular for Fusion DSP, the use of
this flag allows the compiler to emulate standard C/C++ operations using the Fusion
DSP instructions.

 Optimization level. When optimizing code, the code should be compiled with either
the –O2 or –O3 level of optimization. On average, -O3 will give higher performance,
but not always. It is recommended that critical functions be compiled both ways to
compare performance.

 Compiling for code size. Less performance-critical functions should be compiled with
–Os (in addition to either –O2 or –O3). This will meaningfully shrink the code size
required. In addition to saving on memory, smaller code might improve performance
on real systems with more limited instruction cache sizes.

194  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

4. Variable-Length Encode and Decode


With the AVS option, the Fusion DSP Instruction Set Architecture includes a set of
instructions that make it convenient for you to do variable-length encoding and decoding in
your software routines for the Fusion DSP. This chapter will help you become familiar with
the Fusion DSP Huffman-related instructions and Huffman table formats.

4.1 Overview of Huffman Instructions


This section will orient you to the instructions that support variable-length encoding and
decoding, as well as raw bitstream reads and writes.

The instructions we supply to support variable-length (Huffman) encode/decode are very


generic in the sense that they place only the most minimal restrictions on the kind of table
hierarchies you use in your application. We expect every practical structure for variable-
length encoding and decoding to be efficiently implementable using the instructions we
supply. The programmer is free to choose the space/time tradeoffs that suit the application.
One of the main goals of this discussion is to help you understand the mechanism well
enough that you can make and exploit those choices.

In addition to the flexibility of table structure, we have the flexibility of instructions supporting
both 16- and 32-bit table entries. 16-bit table entries are expected to be superior in most
cases because they tend to save space over 32-bit entries. However, the option to use 32-
bit entries is important, because certain codebooks can make 16-bit table entries impossible
to use: the smaller entries cannot represent large table indices the way 32-bit entries can.
While 16-bit table entries will also give slower encoding for long codewords, we don't expect
this to be a major consideration because the difference is only a few cycles per symbol. In
keeping with the versatility of the mechanism, it is possible to use hierarchical tables with 32-
bit entries at some levels and 16-bit entries at others.

In the vast majority of implementations, 16-bit table entries will be the right choice.
Nonetheless, the instructions for 32-bit entries are there when they are needed.

 CADENCE DESIGN SYSTEMS , INC. 195


Fusion F1 DSP User’s Guide

4.1.1 Reading and Writing a Sequence of Raw Bits


The instructions for variable-length encoding and decoding are part of a larger family of
instructions designed to support highly-efficient processing of bitstream input and output. In
addition to the instructions for encoding and decoding, there are instructions to retrieve a
sequence of raw bits from an input stream and there are instructions to write a sequence of
raw bits to an output stream. The one major restriction is that only one input bitstream or one
output bitstream can be active at a given time without a significant sacrifice of efficiency. To
explain, there is a single set of state registers that underpin the implementation of the whole
family of instructions, and that collection of state pertains to a single stream. To switch from
reading to writing, or even just to switch from one input (output) stream to another input
(output) stream, all of the underlying state would typically need to be saved to memory and
reinitialized. While this restriction is typically not a problem for audio and voice codec
applications, programmers must nevertheless be aware of it.

4.2 Encoding
Since encoding usually has fewer worthwhile table-structure variants than decoding, we will
describe the encode side first and then move to the more complicated considerations around
decoding.

The examples shown in Section 4.4 structure their tables in a couple of ways that are the
most commonly used. You will certainly encounter cases, e.g., in at least one of WMA’s
codebooks, where you will want to implement a different structure for the tables.

For encoding, the usual technique is simple: Translate the symbol to be coded into a table
index, and use that index to retrieve a sequence of codeword bits and a codeword length
from a table or a pair of tables. Usually table entries for each codebook are just long enough
to hold the longest codeword, but in the present mechanism we wanted to provide a way to
keep the codeword length from being dependent on either the size of the table entries or on
other aspects of the implementation. So in our scheme, depending on the length of the
longest codeword, it might be that some codewords don't fit within a single table entry. When
this situation happens, the first lookup in the encoding table provides not only a portion of the
codeword, but also the index of the location in the table to look for the next codeword
segment. Each lookup in the encoding table either completes the codeword or yields an index
for the next lookup. In the case of 32-bit table entries, a second lookup is required only if the
codeword exceeds 16 bits in length. In the case of 16-bit table entries, codewords longer
than 11 bits will require a second and possibly subsequent lookups.

196  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

4.2.1 What Encoding a Symbol Looks Like


Here’s what the process of encoding one symbol looks like. In memory we have an encoding
table indexed by symbol value, and each entry in the table is one of the 16- or 32-bit table
entries we’ve been discussing. We look in the table and retrieve the encoding entry
corresponding to the symbol we want to encode. In that table entry we either find the entire
codeword corresponding to our symbol, along with an indication of the codeword’s length in
bits; or we find that the entire codeword is too long to fit in the table entry. A bit in the table
entry indicates which of the two cases occurred.

In the first case, we are finished encoding the present symbol once we push the found
codeword bits onto the output bitstream.

The second case is a little more interesting. In the second case, we get some bits of the
codeword from the table entry, and those are pushed onto the output stream, but there are
more codeword bits still to come that could not be accommodated in a single table entry.
When this happens, the first table entry tells us the index of another table entry that will give
us another segment of the codeword’s bit sequence. Once we retrieve the second table entry
based on the new index, we are back in the same situation: either this table entry completes
the codeword, or yet another lookup is required. Table entries needed to support lookups
beyond the first one for each symbol would generally appear at the end of the table, just
beyond the symbol-indexed part.

The length of the codebook’s longest codeword and your decision about whether to use 16-
or 32-bit table entries will bound the number of lookups required to encode a symbol. In
practice, three or more lookups per symbol will be rare with 32-bit table entries (Editor’s note:
we are not aware of any codebooks used in audio that would require three lookups for any
symbol), and four or more will be rare with 16-bit entries.

4.2.2 Encoding Table Lookup Instruction Sequence


Each encoding table lookup operation consists of a sequence of two instructions: ae_vlel16t
(or ae_vlel32t if you are using 32-bit table entries) and ae_vles16c. ae_vlel{16|32}t loads a
table entry based on the current symbol value, and ae_vles16c pushes the segment of bits
onto the bitstream being written, flushing 16 stream bits to memory if that many are available.
The instruction mnemonics are as follows: Audio Engine Variable-Length Encode, Load
{16|32}-bit Table entry; Audio Engine Variable-Length Encode, Store 16 stream bits
Conditional (reflecting the fact that the bitstream is stored to memory in 16-bit chunks). “Audio
Engine” in this context refers to the “AE” part of Fusion DSP.

 CADENCE DESIGN SYSTEMS , INC. 197


Fusion F1 DSP User’s Guide

4.3 Decoding
The decoding process is more complicated than encoding because codewords have variable
length. If we could afford a huge table, we could just pad all the codewords out to the length
of the longest codeword (with bits from the bitstream), and use the resulting string of bits as
an index into a single giant table where we would find an entry telling us the symbol value
and the number of bits in the codeword. Note that the lookup has to tell us the number of bits
in the codeword so we know how many bits to discard from the head of the bitstream we are
reading before doing the next decoding operation.

As with encoding, we look up entries for decoding in a table. But unlike encoding where the
alphabet size determined the size of the initial table, the decoding process has power-of-two
table sizes that are decided by you according to the space/time tradeoffs you want to make.
Decoding takes place through a hierarchy of tables where the size of each table in the
hierarchy is up to you (within limits, of course). A table can have as few as two entries, in
which case it is essentially a node in a binary tree where a single bit of the codeword guides
the decoding process to the next step, or as many as 65536 entries where a 16-bit chunk of
the bitstream forms the table index.

4.3.1 Supported Decoding Structure Examples


FAAD2, the freeware MPEG-AAC decoder, uses a binary tree as one of its basic table
structures. Decoding begins with a two-entry table at the root and proceeds one bit at a time
to a new two-entry table for each codeword bit, until the end of the codeword is reached and
the table entry contains the symbol value.

FAAD2 uses a so-called two-step table as the other of its basic table structures. K bits at the
head of the stream are used to index into the first table. (Depending on the codebook, K is
either five or six.) The entry found in the first table gives an index into the second table, which
is essentially made up of consecutively placed subtables of various sizes. The index from
the first table entry tells where the appropriate subtable begins. Each subtable in the second
table corresponds to one or more K-bit combinations that might appear at the head of the
bitstream. If the codeword is longer than K bits, the entry from the first table also tells how
many bits are used to index into the subtable. If the codeword has K bits or fewer, the
corresponding subtable has only one entry so no additional bits are used as an index into it.
The entry found in the second table by indexing using the appropriate number of bits off the
base given in the first table entry gives the decoded symbol value and the codeword length.
This sounds complicated, but it isn't as bad as it sounds.

WMA uses a hierarchically-structured table consisting of 4-ary tree nodes and binary tree
nodes. The eight levels closest to the root in the tree consist of 4-ary tree nodes, and the
remaining six levels are binary.

Our decoding support permits us to structure our decode essentially according to any of
those example schemes, or indeed according to a wide variety of other schemes as well. Our
Fusion DSP variable-length encoding and decoding instructions also permit us more efficient
use of the bits in table entries than the generic-processor implementations, meaning that for
a given table organization scheme, the tables to drive our instructions are smaller than those

198  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

in the corresponding generic implementation. And, of course, our decoding operations are
faster as well.

When we begin decoding a codeword, we start at the root of the decoding table hierarchy
and use a prefix of the bitstream to look up a table entry. As mentioned before, the length of
this prefix is determined when the table hierarchy is designed. Once we have a table entry,
there are two cases much like there were for encoding, and again a bit in the table entry
distinguishes between the two.

In the first case, the codeword is short enough that we are done decoding it and the table
entry tells us the symbol corresponding to the codeword, along with the number of bits
occupied by the codeword at the head of the stream. Note that the number of bits used to
index into the table might be greater than the length of the codeword, in which case there
are duplicate table entries, one for each combination of the “don't care” bits that follow the
codeword in the stream.

In the second case, the codeword is longer than an index into the table. In this case, we have
not yet found the symbol corresponding to our codeword (because we have not yet looked
at all the codeword bits). In this case, the table entry tells us where to find the next table and
the number of bits to use as an index into that table. The bits we need to discard from the
head of the stream are exactly those that we used as the table index, so the table entry itself
need not have any direct indication of the number of bits to discard. Upon knowing the base
of the next table in the hierarchy for this codeword and discarding the bits that made up the
index we used for the first table, we are back in the same situation as when we began
decoding: We have a table into which we will index according to a set number of bits at the
head of the bitstream. The process repeats until we find ourselves in the first case with our
symbol in hand.

4.3.2 Decoding Table Lookup Instruction Sequence


Each decoding table lookup operation consists of a sequence of two instructions, ae_vldl16t
(or ae_vldl32t if you are using 32-bit table entries) and ae_vldl16c. ae_vldl{16|32}t loads a
table entry based on the bits currently at the head of the bitstream, and ae_vldl16c refreshes
the head of the bitstream from memory if necessary.

The instruction mnemonics are as follows: Audio Engine Variable-Length Decode, Load
{16|32}-bit Table entry; Audio Engine Variable-Length Decode, Load 16 stream bits
Conditional (reflecting the fact that the bitstream is refreshed from memory in 16-bit chunks).
“Audio Engine” in this context refers to this part of the Fusion DSP.

 CADENCE DESIGN SYSTEMS , INC. 199


Fusion F1 DSP User’s Guide

4.4 Encode/Decode Examples


Within a C routine that uses encoding, a speed-optimized encoding sequence looks like this:

xtbool complete;
unsigned int symbol;
unsigned short *table;
...
not_done:
AE_VLEL16T(complete, symbol, table);
AE_VLES16C(stream);
if (!complete) {
#pragma frequency_hint NEVER
goto not_done;
}
...

With the above sequence, the Xtensa C compiler generates assembly code like the following:

not_done:
ae_vlel16t b0, a3, a9 /* First lookup likely to succeed. */
_
ae vles16c a2
bf b0, not_done /* Avoid branch delay in common case. */
done_encoding:
...

If, for example, you know that your encoding table structure is only one layer deep, you can
optimize the code more.

For decoding, the optimal code implementation will depend on the structure of your tables,
although it is possible to build a single routine that works very fast with all the possible
structures. A single decoding step might be enough most of the time if your top-level table
uses a 5-bit index. In such a case, the best way to decode is the simplest, and is exactly
analogous to the encoding code above:

xtbool complete;
unsigned int symbol;
unsigned short *table;
...
not_done:
AE_VLDL16T(complete, symbol, table);
AE_VLDL16C(stream);
if (!complete) {
#pragma frequency_hint NEVER
goto not_done;
}
...

200  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

The above sequence in C should yield assembly code like the following:

...
not_done:
ae_vldl16t b0, a9, a4
ae_vldl16c a2
bf b0, not_done
done_decoding:
...

On the other hand, if you build your tables as a binary tree, you're unlikely to find any symbols
within a single decoding step. In this case, if you have to have every last bit of decoding
speed, you can use something like the following example, which is a fast, generic
implementation that handles lookups deep in the table hierarchy with fewer branch delays
than the simple loop above:

not_done: ...
_
ae vldl16t b0, a9, a4
ae_vldl16c a2
b b0, not_done
done_decoding:
...

not_done:
loopnez a0, .Loopend /* use stack pointer as while (1) loop
counter */
ae_vldl16t b0, a9, a4
ae_vldl16c a2
bt b0, done_decoding
.Loopend:
j not_done /* more lookup iterations than the stack pointer?!? */

In conclusion, the Fusion DSP supplies a generic set of instructions to support variable-length
(Huffman) encode/decode. These instructions place only minimal restrictions on the kind of
table hierarchies you use in your application.

 CADENCE DESIGN SYSTEMS , INC. 201


Fusion F1 DSP User’s Guide

5. Fusion DSP Examples


In this chapter, we cover a few examples in more detail, showing how to optimize the
examples for the Fusion DSP.

5.1 Correlation/Convolutional/FIR Coding


The following example shows how to efficiently implement a 32-bit correlation (or
equivalently, a convolution or block FIR filter) using the Fusion DSP.

We start with the following simple reference code, written using standard C.

void fir_ref ( int * __restrict y, // [n]


const int *__restrict x, // [m+n]
const int *__restrict h, // [m]
unsigned int n, unsigned int m
)
{
unsigned int i, j;
long long sum;

for (i = 0; i < n; i++) {


sum = 0;
for(j = 0; j < m; j++) {
sum += (long long) x[i+j]*(long long) h[j];
}
y[i] = sum >> 31;
}
}

We use a 64-bit accumulator for all the intermediate calculations. When we have completed
one output point, we use a shift to throw away the bottom fractional bits.

Next we use an intrinsic to generate more efficient but bit-exact code.

void fir_opt ( int * __restrict y, // [n]


const int *__restrict x, // [m+n]
const int *__restrict h, // [m]
unsigned int n, unsigned int m
)
{
unsigned int i, j;
ae_f64 sum;

202  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

for (i = 0; i < n; i++) {


sum = 0LL;
for(j = 0; j < m; j++) {
AE_MULAF32S_LL(sum, x[i+j], h[j]);
}
y[i] = (long long) (sum >> 32);
}
}

We choose to use fractional data type ae_f64 and the fractional multiply accumulate intrinsic
AE_MULAF32S_LL. This intrinsic will saturate the result to 64-bits avoiding the danger of
overflow. This result produces a 1.63-bit result which is later shifted to the right by 32 bits to
produce a 1.31-bit result.

Next, we utilize SIMD to perform two iterations in parallel. We have a choice in which loop to
run SIMD. If we run two iterations of the j loop in parallel, then in each iteration, we will need
to access x[i+j] and x[i+j+1]. For odd values of i, those two values will not be aligned, and we
need to use aligning loads. While certainly feasible, that does add some overhead.
Alternatively, we can run two iterations of the i loop in parallel. Using pseudo code, the inner
loop is computing y[i:i+1] += x[i+j:i+j+1]*h[j]. Note that in the first j iteration, we are using
x[i:i+1] while in the second we are using overlapping data x[i+1:i+2]. We cannot simply utilize
a SIMD load that will get the right data in both even and odd iterations. Instead, we also unroll
the j loop by two. In the first unrolled iteration, we use x[i:i+1] and x[i+1:i+2]. In the next
iteration, we use data that is exactly two elements ahead: x[i+2:i+3] and x[i+3:i+4]. The code
is shown below.

void fir_opt ( int * __restrict y, // [n]


const int *__restrict x, // [m+n]
const int *__restrict h, // [m]
unsigned int n, unsigned int m
)
{
unsigned int i, j;
ae_f64 sum0, sum1;
ae_f32x2 *xp = (ae_f32x2 *) x;
ae_f32x2 *hp = (ae_f32x2 *) h;
ae_f32x2 *yp = (ae_f32x2 *) y;

for (i = 0; i < n/2; i++) {


sum0 = 0LL;
sum1 = 0LL;
for(j = 0; j < m/2; j++) {
AE_MULAF32S_HH(sum0, xp[i+j], hp[j]);
AE_MULAF32S_LL(sum0, xp[i+j], hp[j]);
AE_MULAF32S_LH(sum1, xp[i+j], hp[j]);
AE_MULAF32S_HL(sum1, xp[i+j+1], hp[j]);
}

 CADENCE DESIGN SYSTEMS , INC. 203


Fusion F1 DSP User’s Guide

yp[i] = AE_TRUNC32X2F64(sum0, sum1);


}
}

Note that the first product, x[i+j]*h[j] uses the HH variant of AE_MULAF32S_HH. The Fusion
DSP processor loads vector elements in big endian order, i.e., the lower element from the
memory goes into the higher half of the register. Note that we have used the
AE_TRUNC32X2F64 intrinsic to truncate two 1.63 values into two 1.31-bit values using one
instruction.

Note that we traverse both the coefficient array and the data array in the forward direction,
while a typical formulation often accesses the data in the reverse direction (x[M+N-1+i-j]
* h[j]). To use the reverse formulation, we must traverse the data array in the reverse
direction using the .RIP instructions. It is not possible to use the .RIP instructions with
implicit loads, so we must use explicit intrinsics.

5.2 Floating-point FIR


Now we consider a floating point FIR on the optional floating point unit. The reference code
is quite simple.
void fir_ref ( float * __restrict y, const float *__restrict x, const
float *__restrict h,
int n, int m)
{
#pragma aligned(x,8)
#pragma aligned(h,8)
#pragma aligned(y,8)
int i, j;
for (i = 0; i < n; i++) {
y[i] = 0;
for(j = 0; j < m; j++) {
y[i] += x[i+j]*h[j];
}
}
}

The code is all standard C except that we have used pragmas to tell the compiler that all the
arrays are aligned on 8-byte boundaries. While those pragmas aren’t necessary, they allow
the XCC vectorizer to generate somewhat more efficient code. When compiling with –O3 –
LNO:simd, the XCC compiler vectorizes the inner j loop and unrolls the j loop by four. The
resultant code performs eight multiply-accumulate operations in every iteration and
schedules in nine cycles on configurations with the Reduced MAC Latency option, close to
the ideal limit of eight.

To understand what the vectorizer does, and to try to get the ideal schedule, let us vectorize
the code using intrinsics. First, let us vectorize exactly equivalently to how we vectorized the
integer example.

204  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

void fir_opt ( float * __restrict y, const float *__restrict


x, const float *__restrict h,
int n, int m)
{
int i, j;
float sum0,sum1;
xtfloatx2 *xp = (xtfloatx2 *) x;
xtfloatx2 *hp = (xtfloatx2 *) h;
xtfloatx2 *yp = (xtfloatx2 *) y;

for (i = 0; i < n/2; i++) {


sum0 = 0.0F;
sum1 = 0.0F;
for(j = 0; j < m/2; j++) {
XT_MADD_LHH_S(sum0, xp[i+j], hp[j]);
XT_MADD_LLL_S(sum0, xp[i+j], hp[j]);
XT_MADD_LLH_S(sum1, xp[i+j], hp[j]);
XT_MADD_LHL_S(sum1, xp[i+j+1], hp[j]);
}
XT_SSX2_L_IP(sum0, sum1, yp);
}
}

Note that no truncation is needed with floating point. We have explicitly used the
XT_SSX2_L_IP intrinsic that allows us to store two floating point values using one store. On
configurations using the Reduced MAC Latency option, the inner loop schedules in a perfect
four cycles for four iterations. However, on full power configurations, the MADD operations
have four cycles of latency. In every iteration of the inner loop, we are accumulating twice
into the same accumulator, causing eight cycles every iteration.

We can double the inner loop performance by unrolling the outer loop by four instead of by
two, resulting in the following code.

for (i = 0; i < n/4; i++) {


sum0 = 0.0F;
sum1 = 0.0F;
sum2 = 0.0F;
sum3 = 0.0F;
for(j = 0; j < m/2; j++) {
XT_MADD_LHH_S(sum0, xp[2*i+j], hp[j]);
XT_MADD_LLL_S(sum0, xp[2*i+j], hp[j]);
XT_MADD_LLH_S(sum1, xp[2*i+j], hp[j]);
XT_MADD_LHL_S(sum1, xp[2*i+j+1], hp[j]);

XT_MADD_LHH_S(sum2, xp[2*i+j+1], hp[j]);


XT_MADD_LLL_S(sum2, xp[2*i+j+1], hp[j]);
XT_MADD_LLH_S(sum3, xp[2*i+j+1], hp[j]);

 CADENCE DESIGN SYSTEMS , INC. 205


Fusion F1 DSP User’s Guide

XT_MADD_LHL_S(sum3, xp[2*i+j+2], hp[j]);


}
XT_SSX2_L_IP(sum0, sum1, yp);
XT_SSX2_L_IP(sum2, sum3, yp);
}

Now the compiler is able to generate a perfect eight cycle schedule for eight MADD
operations.

5.3 Fast Fourier Transform


The Fast Fourier Transform (FFT) algorithm is an optimized implementation of the Discrete
Fourier Transform, which is common in digital signal-processing applications. The following
is a simple radix-2 decimation in frequency FFT implementation that assumes n is a power
of 2 and the number of complex elements stored in the data array. In an efficient
implementation, we would use a radix-4 implementation, but the radix-2 implementation is
simpler to explain. Consider the following simple reference implementation where the data is
32-bits but the twiddle factors are 16-bit.

void fft_ref( int data[], const short twid[],


const unsigned int n)
{
unsigned int i, j0, j1, k, b;
int r = 0x4000;

k = 0;
for (b = n ; b > 2; b >>= 1) {
unsigned int b2 = b >> 1;
for (i = 0; i < b/4; i += 1) {
short wr = twid[2 * k + 0];
short wi = twid[2 * k + 1];
for (j0 = j1 = i; j0 < n; ) {
int d0r, d0i, d1r, d1i;
int tr, ti;
int r0r, r0i, r1r, r1i;

d0r = data[2 * j0 + 0];


d0i = data[2 * j0 + 1];
j0 += b2;

d1r = data[2 * j0 + 0];


d1i = data[2 * j0 + 1];
j0 += b2;

r0r = d0r + d1r;

206  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

r0i = d0i + d1i;


tr = d0r - d1r;
ti = d0i - d1i;

r1r =((long long)wr*tr - (long long)wi*ti + r) >> 15;


r1i =((long long)wr*ti + (long long)wi*tr + r) >> 15;

data[2 * j1 + 0] = r0r;
data[2 * j1 + 1] = r0i;
j1 += b2;

data[2 * j1 + 0] = r1r;
data[2 * j1 + 1] = r1i;
j1 += b2;
}
k += 1;
}

}
}

The 2-way SIMD architecture of Fusion DSP maps nicely to this computation, as a 64-bit
register can hold a single, 32-bit complex data item. The 16-bit twiddle factors complicate the
algorithm a bit since we can store four twiddle factors in one register. Successive iterations
of i must access either the top two or the bottom two entries in the twiddle register. The
simplest way to handle this is to load the twiddle array every other iteration, and use selects
to copy the second pair of elements into proper position for the other iterations. Otherwise,
the conversion to Fusion DSP is very simple, and the resultant code is actually simpler than
the original. Note the use of the explicit complex multiply intrinsic AE_MULFC32X16RAS_H.

void fft_opt( ae_f32x2 data[], const ae_f16x4 twid[],


const unsigned int n)
{
unsigned int i, j0, j1, k, b;

k = 0;

for (b = n ; b > 2; b >>= 1) {


unsigned int b2 = b >> 1;
ae_f16x4 w;
for (i = 0; i < b/4; i += 1) {
if (!(i%2)) w = twid[k];
else {
w = AE_SEL16_5410(w, w);
k += 1;
}
for (j0 = j1 = i; j0 < n; ) {
ae_f32x2 d0, d1;

 CADENCE DESIGN SYSTEMS , INC. 207


Fusion F1 DSP User’s Guide

ae_f32x2 t;
ae_f32x2 r0, r1;

d0 = data[j0];
j0 += b2;

d1 = data[j0];
j0 += b2;

r0 = d0 + d1;
t = d0 - d1;

r1 = AE_MULFC32X16RAS_H(t, w);

data[j1] = r0;
j1 += b2;

data[j1] = r1;
j1 += b2;
}
}
}
}

According to the Xtensa C Application Programmer’s Guide, you should look at the generated
.S file to see how the compiler compiles the particular loop. By looking at the code next to
the inner loop, you can see that the compiler is not software pipelining the inner loop.

The compiler was not able to optimize the inner loop well because the compiler could not
calculate the number of iterations in the inner loop. If we rewrite the trip count calculation as
follows, the compiler is able to better optimize the inner loop.

void fft_opt( ae_int32x2 data[], const ae_int16x4 twid[],


const unsigned int n)
{
unsigned int i, k=0, b,trip;
unsigned lg2 = 31-AE_NSAZ32_L(n);
for (b = n ; b > 2; b >>= 1) {
lg2--;
unsigned int b2 = b >> 1;
ae_int16x4 w;
for (i = 0; i < b/4; i += 1) {
if (!(i%2)) w = twid[k];
else {
w = AE_SEL16_5410(w, w);
k += 1;
}

208  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

for (trip=0; trip < (n-i+b-1)>>lg2; trip++) {


ae_int32x2 d0, d1;
ae_int32x2 t;
ae_int32x2 r0, r1;

d0 = data[i+2*b2*trip];
d1 = data[i+2*b2*trip+b2];

r0 = d0 + d1;
t = d0 - d1;
r1 = AE_MULFC32X16RAS_H(t, w);

data[i+2*b2*trip] = r0;
data[i+2*b2*trip+b2] = r1;
}
}
}
}

Performance is better, but still not ideal. To achieve top performance, the compiler must
software pipeline the loop and execute loads from iteration trip+1 ahead of stores from the
previous iteration trip. However, the compiler will not move the loads up because it doesn’t
know if the loads and stores access the same memory. Therefore, you must move the loads
manually as follows.

void fft_opt( ae_int32x2 data[], const ae_int16x4 twid[],


const unsigned int n)
{
unsigned int i, k=0, b,trip;
unsigned lg2 = 31-AE_NSAZ32_L(n);

for (b = n ; b > 2; b >>= 1) {


lg2--;
unsigned int b2 = b >> 1;
ae_int16x4 w;
for (i = 0; i < b/4; i += 1) {
if (!(i%2)) w = twid[k];
else {
w = AE_SEL16_5410(w, w);
k += 1;
}
if (((n-i+b-1)>>lg2) > 0) {
ae_int32x2 d0, d1;
ae_int32x2 t;
ae_int32x2 r0, r1;

d0 = data[i];

 CADENCE DESIGN SYSTEMS , INC. 209


Fusion F1 DSP User’s Guide

d1 = data[i+b2];
for (trip=0; trip < ((n-i+b-1)>>lg2)-1; trip++) {
r0 = d0 + d1;
t = d0 - d1;
r1 = AE_MULFC32X16RAS_H(t, w);

d0 = data[i+2*b2*(trip+1)];
d1 = data[i+2*b2*(trip+1)+b2];

data[i+2*b2*trip] = r0;
data[i+2*b2*trip+b2] = r1;
}
r0 = d0 + d1;
t = d0 - d1;
r1 = AE_MULFC32X16RAS_H(t, w);

data[i+2*b2*trip] = r0;
data[i+2*b2*trip+b2] = r1;
}
}
}
}

Before the inner loop, we load the two elements from the first iteration. In the inner loop, we
operate on the values loaded from the previous iteration and load the values for the next
iteration. After the inner loop, we complete the computation for the last iteration.

With these changes, the compiler is able to schedule the inner loop in four cycles per
iteration, the minimum possible due to the load/store bandwidth of the machine.

Note that the same performance could have been achieved by using the __restrict
attribute on two pointers for the input and output accesses of data, rather than manually
software pipelining the loop. However, this attribute is only allowed to be used when pointers
do not overlap, and the two pointers would in fact overlap.

210  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

6. Fusion F1 NatureDSP Signal Library


The Fusion F1 DSP has an associated generic DSP library that can be downloaded from the
XPG. The library comes with two Xplorer projects, fusion_library, which is the library and
fusion_demo, which is a test application that exercises all the functions in the library. Both
are delivered in source form, allowing you to use the library as is, or use it as a starting point
for your own development. The library contains functions for FIR filters, IIR filters, basic math
functions, matrix operations, and FFTs.

From the library project, under doc, is a reference manual, NatureDSP Signal Library
Reference for Tensilica Fusion F1 DSP, which describes the library and the test program in
depth.

 CADENCE DESIGN SYSTEMS , INC. 211


Fusion F1 DSP User’s Guide

7. Implementation Methodology
The Fusion DSP is an optional coprocessor for the Xtensa LX core. Fusion DSP is provided
as a check box option in the Xplorer Processor Generator (XPG) interface in Xtensa Xplorer
(XX). This section includes guidelines for using the XPG to configure a Fusion DSP
coprocessor.

The last section in this chapter discusses synthesis and place-and-route.

7.1 Configuring a Fusion DSP


Configuring a Fusion DSP in the XPG is done by selecting the relevant check box options in
the Xplorer Configuration editor in the Processors window under the category Fusion DSP.
The additional sub-options, shown in Figure 7-1, should be properly selected.

As an alternative, Xplorer provides a number of templates for Fusion DSP. These are
described briefly in the next section. If you choose one of these templates, they will select
both Fusion DSP options for a particular use case, and other attributes of a configuration.
However, you can then edit them further if your particular use case requires further changes.
In that sense, templates are regarded as a recommended starting point.

 FP
Support for IEEE 754 single precision floating point. Floating point compute
operations, including fused multiply-accumulates, can be issued in parallel with
loads or stores. The compute operations work on scalar, 32-bit data. The load
and store operations can load or store two-way SIMD, 32x2-bit data.

 AVS
Support for software compatibility with HiFi-2, HiFi 3, and HiFi Mini audio, voice
and speech codecs. Enables HiFi bitstream intrinsics as well as emulation
intrinsics for the HiFi 3 quad multiplication instructions.

 Fusion Advanced Bit Manipulation Package


This option enables speedup for bit-level operations commonly used in Baseband
PHY and MAC standards such as Bluetooth, WiFi, and 3GPP, for:
 CRC and Scrambling (Linear Feedback Shift Register) functions
 Bit-level Convolutional Encode operations
 Bit-level shuffling and manipulation, commonly used in Baseband PHY and
MAC standards.

 BLE/Wi-Fi AES-128 CCM


Support for efficient implementation of AES-128 CCM encryption functions, as
required, for example, in Bluetooth Low Energy and Wi-Fi protocols.

212  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

 Fusion Reduced MAC Latency


Without this option, multiply and bit-stream instructions are fully pipelined and can
be issued every cycle but take an additional cycle to complete. With this option,
they complete without additional cycles of latency. The optional floating point
operations normally complete in four cycles, but with this option they complete in
two. Using this option allows for smaller and lower energy hardware but only when
synthesizing at lower MHz. This option limits the maximum MHz to approximately
2/3rd of the normal maximum but is generally only beneficial when synthesizing to
at most 1/3rd of maximum frequency.

 Fusion 16-bit Quad MAC DSP


This option allows for more efficient 16-bit DSP performance. In particular, this
option enables complex quad 16-bit multiply instructions, real quad 16-bit multiply
dot-product instructions and specialized instructions to speed up 16-bit FFTs with
dynamic scaling support.

 Fusion Viterbi Decoder


This option adds support for efficient Viterbi decoder operations. The operations
support 1/2 and 1/3 rate with arbitrary polynomials of constraint length 5 and 7.

 Fusion Soft Bit Demap


This option adds support for 4/16/64/256-QAM soft bit demapping. Supported
constellations and mappings are summarized in Table 2-16. Symbol mappings
for 3GPP and WiFi use different Gray Encoding formats, and both are supported
by the soft-bit demapper operations.

Figure 7-1 XPG Options for a Fusion DSP

Developers using caches should also configure the Cache Prefetch Entries from the
Interfaces window under the category PIF/Memory Interface Widths (refer to Figure 6-2). A
selection of 0 will eliminate hardware prefetching from the configuration. Otherwise, eight or
16 entries are available. The latter provides a little higher performance at the cost of a little
more area. In addition, customers should decide whether to enable prefetching directly to L1.
Prefetching into L1 typically improves performance, minimally on configurations with very
large delays to main memory and more significantly on systems with small delays to
secondary or main memory, but at the cost of additional hardware.

 CADENCE DESIGN SYSTEMS , INC. 213


Fusion F1 DSP User’s Guide

Figure 7-2 Configuring Hardware Prefetch

You can now customize the processor containing the Fusion DSP as described in the Xtensa
Development Tools Installation Guide. As you customize the processor, remember the
following restrictions:

 The Fusion DSP option must be selected.

 Because Fusion DSP is always coprocessor number 1, the number of coprocessors


must be at least 2.

 Core multiplier options (for example, MUL16, MUL32) cannot be selected. Fusion
DSP implements the 32-bit multiplier instructions contained in the MUL32 checkbox
option directly within the Fusion DSP; thus, this checkbox is not needed. MUL16 is
not available. MAC16 is available but is generally not useful on Fusion cores.

 If the FP sub-option is selected, the core Single or Double precision FP options


cannot be selected. Single precision is implemented directly by Fusion DSP. Double
precision is not compatible, although the user can still select the double precision
floating point accelerator option for faster double precision emulation.

214  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

 The Fusion DSP option is incompatible with the other DSP families.

 If the Viterbi Decoder and Soft Bit Demap options are not selected, as the Fusion
DSP has 48-bit instruction formats, the maximum instruction width must be six bytes
and a 64-bit instruction fetch is required. The data interfaces to memory must be at
least 64-bits. If the user wishes to add their own formats that are larger than 48 bits
(for example, 56 bits (7 bytes) or 64 bits (8 bytes), then the maximum instruction
width must be set accordingly. On the other hand, if the Viterbi Decoder or Soft Bit
Demap option is selected, then a 64-bit format is created, requiring a maximum
instruction width of 8 bytes. Due to a relatively large increase in gate count for
maximum instruction widths greater than 8 bytes (64 bits), such a size is not
recommended, although possible.

 It is not possible for users to add their own additional 48-bit instruction format, as the
current one in Fusion DSP is quite full and there is no space for an additional format
of this size. However, users may add new 56-bit or 64-bit formats as discussed
above when the Viterbi Decoder and Soft Bit Demap options are not selected.

Once a processor has been configured and downloaded, it can be exercised in simulation.

7.2 Xplorer-provided Fusion DSP Templates


Xplorer provides a number of templates for Fusion DSP. If you choose one of these
templates, they will select both Fusion DSP options for a particular use case, and other
attributes of a configuration. However, you can then edit them further if your particular use
case requires further changes. In that sense, templates are regarded as a recommended
starting point. The list includes:

 XRC_FusionF1_All_Cache

This template configures the Fusion DSP core for products that combine voice
processing, and sensor fusion applications that often require floating point support.
Selected options include the quad 16x16 MAC, FPU, and AVS extensions.

This template supports a cache memory subsystem and includes debug


functionality.

 XRC_FusionF1_All_LM
This template configures the Fusion DSP core for products that combine voice
processing and sensor fusion applications that often require floating point support.
Selected options include the quad 16x16 MAC, FPU, and AVS extensions.

This template supports a local memory subsystem and includes debug functionality.

 XRC_FusionF1_802ah

This template configures the Fusion DSP core for all narrowband wireless
communications applications, including 802.11ah, by enabling communications ISA
options including quad 16x16 MAC, Soft Bit Demap, Viterbi, and Advanced Bit
Manipulations.

This template supports a local memory subsystem and includes debug functionality.

 CADENCE DESIGN SYSTEMS , INC. 215


Fusion F1 DSP User’s Guide

7.3 Basic Fusion DSP Characteristics


Some of the relevant configuration characteristics of the Fusion DSP coprocessor include:

 Fusion DSP instruction set

 Boolean registers

 Sign extend to 32 bits

 NSA/NSAU instructions

 TIE arbitrary byte enables

 Density instructions

 Zero overhead loop instructions. Note that this option is not strictly required.
However, audio codecs licensed by Cadence are compiled using these instructions
and not selecting these instructions can significantly increase the MCPS required by
an application.

 5- or 7-stage pipeline. However, note that this choice has several implications. A 5-
stage pipeline will result in a smaller configuration, but the maximum speed that it is
possible to synthesize and layout will be less than is possible with a 7-stage pipeline.
In addition, larger local memories (e.g., 32 KB or larger) may operate better with a
7-stage pipeline configuration that has extra memory access stages. Thus,
depending on the application, consider these trade-offs.

 Cache prefetch entries.

 Without the Viterbi Decoder and Soft Bit Demap options selected, the instruction
width (specified by the ‘max instruction width in bytes’ option in Xplorer) needs to be
at least 6 bytes. If you add user formats greater than 6 bytes (48 bits), this must be
increased. With the Viterbi Decoder or Soft Bit Demap option selected, the instruction
width needs to be at least 8 bytes. However, an increase beyond 8 bytes (64 bits) is
not recommended. A summary table describing the instruction width required for
each option is provided in Appendix B.

 Data memory interface of at least eight bytes.

 Little-endian byte ordering (fixed).

216  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

7.4 Extending a Fusion DSP with User TIE


Fusion DSP can be extended with user-TIE by defining new instructions. These new
instructions can be assigned to the 24-bit regular instruction format, can use one of the
existing 48- or 40-bit FLIX instruction format, or can go in a new user-defined format. To use
the existing formats, simply use the TIE slot_opcode statement to place the new operation
in one of fusion_slot0, fusion_slot1, or fusion_slot40. When the Viterbi Decoder or Soft Bit
Demap options are selected, a 64-bit FLIX instruction format with two slots fusion_slot64_0
and fusion_slot64_0 is created. New operations meant to benefit from parallel execution
should go in a FLIX format. Such operations that might be used in parallel with existing
Fusion DSP operations should go in one of the slots of the existing instruction formats. If
you are creating a set of operations that are meant to be used in code separate from other
Fusion DSP code, it might be better to put that code in a new format. Note that due to
encoding space limitations, it is not possible for users to create a new 48-bit format, but they
are able to create a new 56-bit or 64-bit format (when the Viterbi Decoder and Soft Bit Demap
options are not selected) as described below, and also new 40-bit formats. If the Viterbi
Decoder and Soft Bit Demap options are selected, no further format lengths can be added,
but you can create new 40-bit or 64-bit FLIX instruction formats. Formats beyond 64 bits are
not recommended.

To create new larger formats when Viterbi Decoder and Soft Bit Demap options are not used,
you can use the following suggested TIE:

length big_length <one of 56/64> {InstBuf[4:0] == 5'b11111}


format F_user big_length {user_slot0, user_slot1,… }

When creating new instructions to put in the existing formats, consider the following points.

 The AR register file in Fusion DSP has 2 read ports and 1 write port in each of slots
fusion_slot0 and fusion_slot1. Creating an operation that requires more than two
read or one write operation on the AR register file will increase the number of ports.

 The AE_DR register file has one read and one write port in fusion_slot0 and three
read and one write ports in fusion_slot1. When the Viterbi Decoder or Soft Bit Demap
option is selected, the AE_DR register file has one read and one write port in
fusion_slot64_0 and three reads and two writes in fusion_slot64_1. Creating an
operation that has more operands in either slot will increase the number of ports in
the machine and therefore will have a large hardware impact. Such operations
should instead be limited to the non-FLIX fusion_slot40.
Single-cycle DSP instructions should read their AE_DR operands in stage Mstage and write
them in stage Mstage. Ideally, two-cycle DSP instructions should read their earliest AE_DR
operands in stage Mstage, and write their AE_DR operands in stage Mstage+1.

 CADENCE DESIGN SYSTEMS , INC. 217


Fusion F1 DSP User’s Guide

7.4.1 Utilizing Fusion DSP Resources


New instructions may utilize existing Fusion DSP resources. For example, it is possible to
create a new instruction that utilizes the AE_DR register file. Simply use the string AE_DR in
your operation arguments. Similarly, existing Fusion DSP states can be used by using the
names listed in Table 2-1 DSP Subsystem State Registers, Table 2-2 Bitstream and Variable-
length Encode/Decode Support Subsystem State Registers, Table 2-3 Circular Buffer
Support State Registers, or Table 2-4 Floating Point Support State Registers.

Existing instructions, either core or Fusion DSP, can be placed in additional slots to increase
parallelism. As with custom TIE instructions, simply use the TIE slot_opcode statement to
place the existing operation in one of the VLIW slots. Load and store instructions can be
added to fusion_slot1 to double the memory bandwidth.

It is not currently possible to share existing Fusion DSP functional resources for new
instructions. New multiplier instructions, for example, must use their own dedicated
multipliers.

7.5 Synthesis and Place-and-Route


When the Fusion DSP is included in an Xtensa processor configuration, the synthesis and
place-and-route scripts that are included with the software build can be used with the usual
methodology, which is outlined in the Xtensa Hardware User’s Guide.

For timing closure between synthesis and place-route, Cadence recommends using
physically-aware synthesis flow such as RC-Physical from Cadence or DC-topo from
Synopsys. These flows are currently supported by the provided synthesis scripts.

218  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Appendix A. Option Instruction Lists

Instructions added by the FP Option

 AE_MOVFCRFSRV  FITRUNC.S  MADD_LHH.S

 AE_MOVVFCRFSR  FIRINT.S  MSUB.S

 ABS.S  UN.S  MSUB_LLH.S

 NEG.S  ULT.S  MSUB_LHH.S

 NEG_LH.S  ULE.S Note that the following cannot be


 MOVEQZ.S  UEQ.S used outside of their predefined
sequences:
 MOVNEZ.S  OLT.S
 SQRT0.S
 MOVLTZ.S  OLE.S
 DIV0.S
 MOVGEZ.S  OEQ.S
 RECIP0.S
 MOVF.S  ADD.S
 RSQRT0.S
 MOVT.S  ADD_LLH.S
 MADDN.S
 WFR  ADD_LHH.S
 MSUBN.S
 RFR  SUB.S
 DIVN.S
 TRUNC.S  SUB_LLH.S
 CONST.S
 UTRUNC.S  SUB_LHH.S
 NEXP01.S
 FLOAT.S  MUL.S
 ADDEXP.S
 UFLOAT.S  MUL_LLH.S
 ADDEXPM.S
 FICEIL.S  MUL_LHH.S
 MKDADJ.S
 FIFLOOR.S  MADD.S
 MKSADJ.S
 FIROUND.S  MADD_LLH.S

Instructions added by the BLE/WiFi AES 128 CCM Option


 AE_AES_SUBBYTE_XOR64

 AE_AES_SUBBYTE_MIX_XOR64

 AE_AES_SB128

 AE_AES_RKEY

 CADENCE DESIGN SYSTEMS , INC. 219


Fusion F1 DSP User’s Guide

Instructions added by the Advanced Bit Manipulation Package Option

 AE_LB_BR  AE_DEPBITS_L

 AE_LBI_BR  AE_DEPBITS_H

 AE_DB_BR.IP  AE_CC32_L

 AE_DBI_BR.IP  AE_CC32_H

 AE_SBI_BR.IP  AE_CTC_BIN

 AE_SB_BR.IP  AE_CRC32

 AE_SBF_BR.IP  AE_SCR32

 AE_ADDMOD16U  AE_LFSR16

 AE_BISEL4X8_L  AE_LFSR8

 AE_BSEL4X8_L The following apply only if AVS is not selected:


 AE_BISEL4X8_H
 RUR.AE_BITPTR
 AE_BSEL4X8_H
 WUR.AE_BITPTR
 AE_SEL4X8_L
 RUR.AE_BITSUSED
 AE_SEL4X8_H
 WUR.AE_BITSUSED

220  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Instructions added by the 16-bit Quad MAC Option

 AE_S16X4RNG.I  AE_ADDANDSUBRNG16RAS_S1

 AE_S16X4RNG.IP  AE_ADDANDSUBRNG16RAS_S2

 AE_S16X4RNG.X  AE_MAXABS16S

 AE_S16X4RNG.XP  AE_CONJ16S

 AE_MULC16S.H  AE_MULC16JS.H

 AE_MULC16S.L  AE_MULC16JS.L

 AE_MULAC16S.H  AE_MULAC16JS.H

 AE_MULAC16S.L  AE_MULAC16JS.L

 AE_MUL16JS The following instructions are available from the AVS


 AE_CALCRNG3 option. If the AVS option is not selected, then the 16-
bit Quad MAC option provides the following
 AE_MULFC16RAS.H instructions.

 AE_MULFC16RAS.  AE_MUL16X4.H
 AE_MULAFC16RAS.H  AE_MUL16X4.L
 AE_MULAFC16RAS.L  AE_MULA16X4.H
 AE_MULZAAAAQ16  AE_MULA16X4.L
 AE_MULAAAAQ16  AE_MULS6X4.H

 AE_MULS16X4.L

 CADENCE DESIGN SYSTEMS , INC. 221


Fusion F1 DSP User’s Guide

Instructions added by the AVS Option

 AE_MOVTABLEFIRSTSEARCHNEXTV  AE_MUL16X4.L

 AE_MOVVTABLEFIRSTSEARCHNEXT  AE_MULCI24

 AE_MULFP32X2RS.H  AE_MULFCI24RA

 AE_MULFP32X2RAS.H  AE_MULCI32X16.L

 AE_MULAFP32X2RS.H  AE_MULCI32X16.H

 AE_MULAFP32X2RAS.H  AE_MULACR24

 AE_MULSFP32X2RS.H  AE_MULAFCR24RA

 AE_MULSFP32X2RAS.H  AE_MULACR32X16.L

 AE_MULFP32X2RS.L  AE_MULACR32X16.H

 AE_MULFP32X2RAS.L  AE_MULACI24

 AE_MULAFP32X2RS.L  AE_MULAFCI24RA

 AE_MULAFP32X2RAS.L  AE_MULACI32X16.L

 AE_MULSFP32X2RS.L  AE_MULACI32X16.H

 AE_MULSFP32X2RAS.L  AE_MULF16X4SS.H

 AE_MULFP16X4S.H  AE_MULAF16X4SS.H

 AE_MULFP16X4RAS.H  AE_MULSF16X4SS.H

 AE_MULFP16X4S.L  AE_MULF16X4SS.L

 AE_MULFP16X4RAS.L  AE_MULAF16X4SS.L

 AE_MULCR24  AE_MULSF16X4SS.L

 AE_MULFCR24RA  AE_MUL16X4.H

 AE_MULCR32X16.L  AE_MULA16X4.H

 AE_MULCR32X16.H  AE_MULS16X4.H

222  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Instructions added by the AVS Option (continued):

 AE_MULA16X4.L  AE_DB.IC

 AE_MULS16X4.L  AE_DBI.IC

 AE_MULFD24X2.FIR.H.H  AE_DB.IP

 AE_MULFD24X2.FIR.H.L  AE_DBI.IP

 AE_MULFD32X16X2.FIR.HH.H  AE_VLEL32T

 AE_MULFD32X16X2.FIR.HH.L  AE_VLEL16T

 AE_MULFD32X16X2.FIR.HL.H  AE_SB

 AE_MULFD32X16X2.FIR.HL.L  AE_SBI

 AE_MULAFD24X2.FIR.H.H  AE_VLES16C

 AE_MULAFD24X2.FIR.H.L  AE_SBF

 AE_MULAFD32X16X2.FIR.HH.H  AE_SB.IC

 AE_MULAFD32X16X2.FIR.HH.L  AE_SBI.IC

 AE_MULAFD32X16X2.FIR.HL.H  AE_VLES16C.IC

 AE_MULAFD32X16X2.FIR.HL.L  AE_SBF.IC

 AE_SHA32  AE_SB.IP

 AE_VLDL32T  AE_SBI.IP

 AE_VLDL16T  AE_VLES16C.IP

 AE_VLDL16C  AE_SBF.IP

 AE_VLDL16C.IP  WUR.AE_BITPTR

 AE_VLDL16C.IC  RUR.AE_BITSUSED

 AE_VLDSHT  WUR.AE_BITSUSED

 AE_LB  RUR.AE_TABLESIZE

 AE_LBI  WUR.AE_TABLESIZE

 AE_LBK  RUR.AE_FIRST_TS

 AE_LBKI  WUR.AE_FIRST_TS

 AE_LBS  RUR.AE_NEXTOFFSET

 AE_LBSI  WUR.AE_NEXTOFFSET

 AE_DB  RUR.AE_SEARCHDONE

 AE_DBI  WUR.AE_SEARCHDONE

 CADENCE DESIGN SYSTEMS , INC. 223


Fusion F1 DSP User’s Guide

Instructions added by the Viterbi Decoder Option


 AE_VTACSR4X4S_L

 AE_VTACSR4X4S_H

 AE_VTADDSUB3BX2S

 AE_VTTB2X64

 AE_S64_DECBITS.H.IP

 AE_S64_DECBITS.L.IP

 AE_UNPKS8X16

 AE_MOVBMETRICSV

 AE_MOVVBMETRICS

 AE_MOVDBITSV.H

 AE_MOVDBITSV.L

 AE_MOVSANORM

Instructions added by the Soft-bit Demap Option


 AE_SDMAP256QAM1X16C_H

 AE_SDMAP256QAM1X16C_L

 AE_SDMAP64QAM1X16C_H

 AE_SDMAP64QAM1X16C_L

 AE_SDMAP16QAM1X16C_H

 AE_SDMAP16QAM1X16C_L

 AE_SDMAPQPSK2X16C

 AE_SDMAP64QAM1X16C_HL

224  CADENCE DESIGN SYSTEMS, INC.


Fusion F1 DSP User’s Guide

Appendix B. Instruction Width Required


by Fusion F1 Options

The following table highlights the instruction width required (specified by the ‘max
instruction width in bytes’ option in Xplorer) for each of the Fusion F1 options.

Fusion F1 Option Max. Instruction Width in Bytes


FP ≥6
AVS ≥6
Fusion Advanced Bit Manipulation Package ≥6
BLE/Wi-Fi AES-128 CCM ≥6
Fusion 16-bit Quad MAC DSP ≥6
Fusion Viterbi Decoder ≥ 8 (beyond 8 is not recommended)
Fusion Soft Bit Demap ≥ 8 (beyond 8 is not recommended)

 CADENCE DESIGN SYSTEMS , INC. 225

You might also like