0% found this document useful (0 votes)
202 views100 pages

6.4.1. Complex FIR Filter

Uploaded by

Ta Duc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
202 views100 pages

6.4.1. Complex FIR Filter

Uploaded by

Ta Duc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

6.

DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

13. Interpolating FIR Filter with Multiple Coefficient Banks on page 105
14. Interpolating FIR Filter with Updating Coefficient Banks on page 106
15. Root-Raised Cosine FIR Filter on page 106
16. Single-Rate FIR Filter on page 106
17. Super-Sample Decimating FIR Filter on page 107
18. Super-Sample Fractional FIR Filter on page 107
19. Super-Sample Interpolating FIR Filter on page 107
20. Variable-Rate CIC Filter on page 107

6.4.1. Complex FIR Filter


This design example demonstrates how to implement a complex FIR filter using three
real filters. The resource efficient implementation (three real multipliers per complex
multiply) maps optimally onto Intel Arria 10 DSP blocks, using the scan and cascade
modes.

The model file is demo_complex_fir.mdl.

6.4.2. Decimating CIC Filter


This design example implements a decimating CIC filter.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_dcic.m script.

The CICSystem subsystem includes the Device and DecimatingCIC blocks.

The model file is demo_dcic.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.4.3. Decimating FIR Filter


This design example implements a decimating FIR filter.

This design example uses the Decimating FIR block to build a 20-channel decimate
by 5, 49-tap FIR filter with a target system clock frequency of 240 MHz.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_fird.m script.

The FilterSystem subsystem includes the Device and Decimating FIR blocks.

The model file is demo_fird.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

101
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.4.4. Filter Chain with Forward Flow Control


This design example builds a filter chain with forward flow control.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_filters_flow_control.m script.

The FilterSystem subsystem includes FractionalRateFIR, InterpolatingFIR,


InterpolatingCIC, Const and Scale blocks.

The model file is demo_filters_flow_control.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.4.5. FIR Filter with Exposed Bus


This design example is a multichannel single-rate FIR filter with rewritable coefficients.
The initial configuration is a high-pass filter, but halfway through the testbench
simulation, DSP Builder reconfigures it as a low-pass filter. The testbench feeds in the
sum of a fast and a slow sine wave into the filter. The fast one emerges from the
originally configured FIR filter; the slow one is all that is left after DSP Builder
reconfigures the filter.

The model file is demo_fir_exposed_bus.mdl.

6.4.6. Fractional FIR Filter Chain


This design example uses a chain of InterpolatingFIR and DecimatingFIR blocks to
build a 16-channel fractional rate filter with a target system clock frequency of 360
MHz.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_fir_fractional.m script.

The FilterSystem subsystem includes ChanView, Decimating FIR,


InterpolatingFIR, and Scale blocks.

The model file is demo_fir_fractional.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.4.7. Fractional-Rate FIR Filter


This design example implements a fractional rate FIR filter.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_firf.m script.

The FilterSystem subsystem includes the Device and FractionalRateFIR blocks.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

102
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

The model file is demo_firf.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.4.8. Half-Band FIR Filter


This design example implements a half band interpolating FIR filter.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_firih.m script.

The FilterSystem subsystem includes the Device block and two separate
InterpolatingFIR blocks for the regular and interpolating filters.

The model file is demo_firih.mdl.

This design example uses the Simulink Signal Processing Blockset.

6.4.9. IIR: Full-rate Fixed-point


This design example implements a full-rate fixed-point IIR filter.

This design demonstrates a single-channel second-order Infinite Impulse Response


(IIR) filter running at the clock rate. Usually with such designs, closing the feedback
loop is difficult at high clock rates. This design recursively expands the mathematical
expression from the feedback in terms of earlier samples, which gives a feed-forward
scalar product and a longer feedback loop. You can make the feedback loop long
enough to add any length of pipelining at the expense of more resources for the
expansion.

The model file is demo_full_rate_iir_fixed.mdl.

Figure 46. IIR Second-Order Biquad

6.4.10. IIR: Full-rate Floating-point


This design example implements a full-rate floating-point IIR filter.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

103
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

This design demonstrates a single-channel second-order Infinite Impulse Response


(IIR) filter running at the clock rate. Usually with such designs, closing the feedback
loop is impossible at high clock rates. This design recursively expands the
mathematical expression from the feedback in terms of earlier samples, which gives a
feed-forward scalar product and a longer feedback loop. You can make the feedback
loop long enough to add any length of pipelining at the expense of more resources for
the expansion.

The model file is demo_full_rate_iir_floating.mdl.

Figure 47. IIR Second-Order Biquad

6.4.11. Interpolating CIC Filter


This design example implements an interpolating CIC filter.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_icic.m script.

The FilterSystem subsystem includes the Device and InterpolatingCIC blocks.

The model file is demo_icic.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.4.12. Interpolating FIR Filter


This design example uses the InterpolatingFIR block to build a 16-channel
interpolate by 2, symmetrical, 49-tap FIR filter with a target system clock frequency of
240 MHz.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_firi.m script.

The FilterSystem subsystem includes the Device and InterpolatingFIR blocks.

The model file is demo_firi.mdl.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

104
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Note: This design example uses the Simulink Signal Processing Blockset.

6.4.13. Interpolating FIR Filter with Multiple Coefficient Banks


This design example builds an interpolating FIR filter that regularly switches between
coefficient banks.

Multiple sets of coefficients requires storage in memory so that the design can switch
easily from one set, or bank, of coefficients in use to another in a single clock cycle.

The design must perform the following actions:


• Specify the number of coefficient banks
• Initialize the banks
• Update the coefficients in a particular bank
• Select the bank in use in the filter

You specify the coefficient array as a matrix rather than a vector—(bank rows) by
(number of coefficient columns).

The addressing scheme has address offsets of base address + (bank number *
number of coefficients for each bank).

If the number of rows is greater than one, DSP Builder creates a bank select input
port on the FIR filter. In a design, you can drive this input from either data or bus
interface blocks, allowing either direct or bus control. The data type is unsigned
integer of width ceil(log2(number of banks)).

The bank select is a single signal. For example, for a FIR filter with four input channels
over two timeslots:

<0><1>

<2><3>

The corresponding input channel signal is:

<0><1>

Here the design receives more than one channel at a time, but can only choose a
single bank of coefficients. Channels 0 and 2 use one set of coefficients and channels
1 and 3 another. Channel 0 cannot use a different set of coefficients to channel 2 in
the same filter.

For multiple coefficient banks, you enter an array of coefficients sets, rather than a
single coefficient set. For example, for a MATLAB array of 1 row and 8 columns [1 x
8], enter:

fi(fir1(7, 0.5 ),1,16,15)

For a MATLAB array of 2 rows and 8 columns [2 x 8] enter:

[fi(fir1(7, 0.5 ),1,16,15);fi(fir1(7, 0.5 ),1,16,15)]

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

105
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Therefore, you can determine the number of banks by the number of rows without
needing the number of banks. If the number of banks is greater than 1, add an
additional bank select input on the block.

The model file is demo_firi_multibank.mdl.

6.4.14. Interpolating FIR Filter with Updating Coefficient Banks


This design example is similar to the Interpolating FIR Filter with Multiple Coefficient
Banks design example. While one bank is in use DSP Builder writes a new set of FIR
filter coefficients to the other bank. You can see the resulting change in the filter
output when the bank select switches to the updated bank.

Write to the bus interface using the BusStimulus block with a sample rate
proportionate with the bus clock. Generally, DSP Builder does not guarantee bus
interface transactions to be cycle accurate in Simulink simulations. However, in this
design example, DSP Builder updates the coefficient bank while it is not in use.

The model name is demo_firi_updatecoeff.mdl.

6.4.15. Root-Raised Cosine FIR Filter


This design example uses the Decimating FIR block to build a 4-channel decimate by
5, 199-tap root raised cosine filter with a target system clock frequency of 304 MHz.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_fir_rrc.m script.

The FilterSystem subsystem includes the Device and Decimating FIR blocks.

The model file is demo_fir_rrc.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.4.16. Single-Rate FIR Filter


This design example uses the SingleRateFIR block to build a 16-channel single rate
49-tap FIR filter with a target system clock frequency of 360 MHz.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_firs.m script.

The FilterSystem subsystem includes the Device and SingleRateFIR blocks.

The model file is demo_firs.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

106
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.4.17. Super-Sample Decimating FIR Filter


This design example shows how the filters cope with data rates greater than the clock
rate. The design example uses the DecimatingFIR block to build a single channel
decimate by 2, symmetrical, 33-tap FIR filter.

The input sample rate is six times the clock rate. The filter decimates by two the input
sample rate to three times the clock rate, which is visible in the vector input and
output data connections. The input receives six samples in parallel at the input, and
three samples are output each cycle.

After simulation, you can view the resource usage.

The model file is demo_ssfird.mdl.

6.4.18. Super-Sample Fractional FIR Filter


This design example shows how the filters cope with data rates greater than the clock
rate. The design example uses the FractionalFIR block to build a single channel
interpolate by 3, decimate by 2, symmetrical, 33-tap FIR filter.

The input sample rate is two times the clock rate. The filter upconverts the input
sample rate to three times the clock rate, which is visible in the vector input and
output data connections. The input receives two samples in parallel at the input, and
three samples are output each cycle.

The model file is demo_ssfirf.mdl.

6.4.19. Super-Sample Interpolating FIR Filter


This design example shows how the filters cope with data rates greater than the clock
rate. The design example uses the InterpolatingFIR block to build a single channel
interpolate by 3, symmetrical, 33-tap FIR filter.

The input sample rate is twice the clock rate and is interpolated by three by the filter
to six times the clock rate, which is visible in the vector input and output data
connections. The input receives two samples in parallel at the input, and six samples
are output each cycle.

After simulation, you can view the resource usage.

The model file is demo_ssfiri.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.4.20. Variable-Rate CIC Filter


CIC filters are extremely hardware efficient, as they require no multipliers. You see
CIC filters commonly in applications that require large interpolation and decimation
factors. Usually the interpolation and decimation factors are fixed, and you can use
the CIC IP block. However, a subset of applications require you to change the
interpolation and decimation factors at run time. This design example shows how to
build a variable-rate CIC filter from primitives. It contains a variable-rate decimating
CIC filter, which consists of a number of integrators and differentiators with a
decimation block between them, where the rate change occurs.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

107
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

You can control the rate change with a register field, which is part of the control
interface. The register field controls the generation of a valid signal that feeds into the
differentiators.

The design example also contains a gain compensation block that compensates for the
rate change dependent gain of the CIC. It shifts the input up so that the MSB at the
output is always at the same position, regardless of the rate change that you select.

The associated setup file contains parameters for the minimum and maximum
decimation rate, and calculates the required internal data widths and the scaling
number. To change the decimation factor for simulation, adjust variable CicDecRate
to the desired current decimation rate.

The model file is demo_vcic.mdl.

6.5. DSP Builder Folding Design Examples


1. Position, Speed, and Current Control for AC Motors on page 108
2. Position, Speed, and Current Control for AC Motors (with ALU Folding) on page
112
3. About FOC on page 113
4. Folded FIR Filter on page 113

6.5.1. Position, Speed, and Current Control for AC Motors


This design example implements a field-oriented control (FOC) algorithm for AC
motors such as permanent magnet synchronous machines (PMSM). Industrial servo
motors, where the precise control of torque is important, commonly use these
algorithms. This design example includes position and speed control, which allow the
control of rotor speed and angle.

Note: Intel has not tested this design on hardware and Intel does not provide a model of a
motor.

The model file is psc_ctrl.mdl. Also, an equivalent fixed-point design,


psc_ctrl_fixed.mdl, exists. To change the precision this design uses, refer to the
setup_position_speed_current_controller_fixed.m script.

Functional Description

An encoder measures the rotor position in the motor, which the FPGA then reads. An
analog-to-digital converter (ADC) measures current feedback, which the FPGA then
reads.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

108
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 48. AC Motor Control System Block Diagram


FPGA

SOPC Builder
Position, Nios II Ethernet Industrial
PHY
Speed, Processor MAC Ethernet
and Current
Control IGBT
for AC Motors Control
Example Design Interface
In DSP Builder Power AC
ADC Stage Motor
ADC
Interface

Position
Encoder Encoder
Interface

Each of the FOC, speed, and position feedback loops use a simple PI controller to
reduce the steady state error to zero. In a real-world PI controller, you may also need
to consider integrator windup and tune the PI gains appropriately. The feedback loops
for the integral portion of the PI controllers are internal to the design.

The example assumes you sample the inputs at a rate of 100 kHz and the FPGA clock
rate is 100 MHz (suitable for Cyclone IV devices). ALU folding reduces the resource
usage by sharing operators such as adders, multipliers, cosine. The folding factor is
set to 100 to allow each operator to be timeshared up 100 times, which gives an input
sample rate of 1 Msps, but as the real input sample rate is 100 ksps, only one out of
every ten input timeslots are used. DSP Builder identifies the used timeslots when
valid_in is 1. Use valid_in to enable the latcher in the PI controller, which stores
data for use in the next valid timeslot. The valid_out signal indicates when the
ChannelOut block has valid output data. You can calculate nine additional channels
on the samedesign without incurring extra latency (or extra FPGA resources).

You should adjust the folding factor to see the effect it has on hardware resources and
latency. To adjust, change the Sample rate (MHz) parameter in the ChannelIn and
ChannelOut blocks of the design either directly or change the FoldingFactor
parameter in the setup script. For example, a clock frequency of 100 MHz and sample
rate of 10 MHz gives a folding factor of 10. Disabling folding, or setting the factor to 1,
results in no resource sharing and minimal latency. Generally, you should not set the
folding factor greater than the number of shareable operators, that is, for 24 adders
and 50 multipliers, use a maximum folding factor 50.

Note: The testbench does not support simulations if you adjust the folding factor.

The control algorithm, with the FOC, position, speed, control loops, vary the desired
position across time. The three control loops are parameterized with minimum and
maximum limits, and Pl values. These values are not optimized and are for
demonstrations only.

Resource Usage

Table 16. Position, Speed, and Current Control for AC Motors Design Example Resource
Usage
Folding Factor Add and Sub Blocks Mult Blocks Cos Blocks Latency

No folding 22 22 4 170

>22 1 1 1 279

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

109
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

The example uses floating-point arithmetic that automatically avoids arithmetic


overflow, but you can implement it in a fixed-point design and tune individual
accuracies while manually avoiding overflows.

Hardware Generation
When hardware generation is disabled, the Simulink system simulates the design at
the external sample rate of 100 kHz, so that it outputs a new value once every 100
kHz. When hardware generation is enabled, the design simulates at the FPGA clock
rate (100 MHz), which represents real-life latency clock delays, but it only outputs a
new value every 100 kHz. This mode slows the system simulation speed greatly as the
model is evaluated 1,000 times for every output. The setup script for the design
example automatically detects whether hardware generation is enabled and sets the
sample rates accordingly. The example is configured with hardware generation
disabled, which allows fast simulations. When you enable hardware generation, set a
very small simulation time (for example 0.0001 s) as simulation may be very slow.

Figure 49. Input Position Request


At 0 s, a position of 3 is requested and then at 0.5 s a position of 0 is requested. Also shows the actual position
and motor feedback currents

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

110
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 50. Output Response for Speed and Torque


The maximum speed request saturates at 10 and the torque request saturates at 5 as set by parameters of the
model. Also, some oscillation exists on the speed and torque requests because of nonoptimal settings for the PI
controller causing an under-damped response.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

111
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 51. Output Current


From 0 to 0.1, the motor is accelerating; 0.1 to 0.3, it is at a constant speed; 0.3 to 0.5, it is decelerating to
stop. From 0.5 to 0.6, the motor accelerates in the opposite direction; from 0.6 to 0.8, it is at a constant
speed; from 0.8 to 1, it is decelerating to stop.

6.5.2. Position, Speed, and Current Control for AC Motors (with ALU
Folding)
The position, speed, and current control for AC motors (with ALU folding) design
example is a FOC algorithm for AC motors, which is identical to the position, speed,
and current control for AC motors design example. However this design example uses
ALU folding.

The model file is psc_ctrl_alu.mdl.

The design example targets a Cyclone V device (speed grade 8). Cyclone V devices
have distributed memory (MLABs). ALU folding uses many distributed memory
components. ALU folding performs better in devices that have distributed memories,
rather than devices with larger block memories.

The design example includes a setup script


setup_position_speed_current_controller_alu.m.

Table 17. Setup Script Variables


Variables Description

dspb_psc_ctrl.SampleRateHz = 10000 Sample rate. Default set to 10000, which is 10 kHz sample rate.

dspb_psc_ctrl.ClockRate = 100 FPGA clock frequency. Default set to 100, which is 100 MHz clock

dspb_psc_ctrl.LatencyConstraint = 1000 Maximum latency. Default 1,000 clock cycles

This design example uses a significantly large maximum latency, so resource


consumption is the factor to optimize in ALU folding rather than latency.

Generally, industrial designs require a testbench that operates at the real-world


sample rate. This example emulates the behavior of a motor sending current, position,
and speed samples at a rate of 10 kHz.

When you run this design example without folding, the DSP Builder system operates
at the same 10 kHz sample rate. Therefore, the system calculates a new packet of
data for every Simulink sample. Also, the sample times of the testbench are the same
as the sample times for the DSP Builder system.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

112
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

The Rate Transition blocks translate between the Simulink testbench and the DSP
Builder system. These blocks allow Simulink to manage the different sample times
that the DSP Builder system requires. You need not modify the design example when
you run designs with or without folding.

The Rate Transition blocks produce Simulink samples with a sample time of
dspb_psc_ctrl.SampleTime for the testbench and
dspb_psc_ctrl.DSPBASampleTime for the DSP Builder system. The samples are in
the stimuli system, within the dummy motor. To hold the data consistent at the inputs
to the Rate Transition blocks for the entire length of the output sample
(dspb_psc_ctrl.SampleTime), turn on Register Outputs.

The data valid signal consists of a one Simulink sample pulse that signifies the
beginning of a data packet followed by zero values until the next data sample, as
required by ALU folding. The design example sets the period of this pulsing data valid
signal to the number of Simulink samples for the DSP Builder system (at
dspb_psc_ctrl.DSPBASampleTime) between data packets. This value is
dspb_psc_ctrl.SampleTime/dspb_psc_ctrl.DSPBASampleTime.

The verification script within ALU folding uses the To Workspace blocks. The
verification script searches for To Workspace blocks on the output of systems to fold.
The script uses these blocks to record the outputs from both the design example with
and without folding. The script compares the results with respect to valid outputs. To
run the verification script, enter the following command at the MATLAB prompt:

Folder.Testing.RunTest('psc_ctrl_alu');

6.5.3. About FOC


FOC involves controlling the motor's sinusoidal 3-phase currents in real time, to create
a smoothly rotating magnetic flux pattern, where the frequency of rotation
corresponds to the frequency of the sine waves. FOC controls the amplitude of the
current vector that is at 90 degrees with respect to the rotor magnet flux axis
(quadrature current) to control torque.

The direct current component (0 degrees) is set to zero. The algorithm involves the
following steps:
• Converting the 3-phase feedback current inputs and the rotor position from the
encoder into quadrature and direct current components with the Clarke and Park
transforms.
• Using these current components as the inputs to two proportional and integral (PI)
controllers running in parallel to control the direct current to zero and the
quadrature current to the desired torque.
• Converting the direct and quadrature current outputs from the PI controllers back
to 3-phase currents with inverse Clarke and Park transforms.

6.5.4. Folded FIR Filter


This design example implements a simple non-symmetric FIR filter using primitive
blocks, with a data sample rate much less than the system clock rate. This design
example uses ALU folding to minimize hardware resource utilization.

The model file is demo_alu_fir..mdl.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

113
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.6. DSP Builder Floating Point Design Examples


1. Black-Scholes Floating Point on page 114
2. Double-Precision Real Floating-Point Matrix Multiply on page 114
3. Fine Doppler Estimator on page 114
4. Floating-Point Mandlebrot Set on page 115
5. General Real Matrix Multiply One Cycle Per Output on page 116
6. Newton Root Finding Tutorial Step 1—Iteration on page 116
7. Newton Root Finding Tutorial Step 2—Convergence on page 117
8. Newton Root Finding Tutorial Step 3—Valid on page 117
9. Newton Root Finding Tutorial Step 4—Control on page 117
10. Newton Root Finding Tutorial Step 5—Final on page 117
11. Normalizer on page 117
12. Single-Precision Complex Floating-Point Matrix Multiply on page 117
13. Single-Precision Real Floating-Point Matrix Multiply on page 118
14. Simple Nonadaptive 2D Beamformer on page 118

6.6.1. Black-Scholes Floating Point


The DSP Builder Black-Scholes single- and double-precision floating-point design
examples implement the calculation of a Black-Scholes equation and demonstrate the
load exponent, reciprocal square root, logarithm and divide floating-point Primitive
library blocks for single- or double-precision floating-point designs.

The model files are blackScholes_S.mdl and blackScholes_D.mdl.

6.6.2. Double-Precision Real Floating-Point Matrix Multiply


A simpler design example of a floating-point matrix multiply implementation than the
complex multiply example. Each vector multiply is performed simultaneously, using
many more multiply-adds in parallel.

The model file is matmul_flash_RD.mdl.

6.6.3. Fine Doppler Estimator


The fine Doppler estimator design example is an interpolator for radar applications.
The example has three complex input values. It calculates the magnitude of each
value, then performs a parabolic curve fit, identifies the location of the peak, and
calculates the peak magnitude. The example performs all processing in single-
precision floating-point data.

For more information about fine Doppler estimators, refer to Fundamentals of Radar
Signal Processing by Mark A. Richards, McGraw-Hill, ISBN 0-07-144474-2, ch. 5.3.4.

The model file is FineDopplerEstimator.mdl.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

114
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.6.4. Floating-Point Mandlebrot Set


This design example plots the Mandlebrot set for a defined region of the complex
plane, shows many advanced blockset features, and highlights recommended design
styles.

A complex number C is in the Mandelbrot set if for the following equation the value
remains finite when repeatedly iterated:

z(n + 1) = zn2 + C

where n is the iteration number and C is the complex conjugate

The system takes longer to perform floating-point calculations than for the
corresponding fixed-point calculations. You cannot wait around for partial results to be
ready, if you want to achieve maximum efficiency. Instead, you must ensure your
algorithm fully uses the floating-point calculation engines. The design contains two
floating-point math subsystems: one for scaling and offsetting pixel indices to give a
point in the complex plane; the other to perform the main square-and-add iteration
operation.

For this design example, the total latency is approximately 19 clock cycles, depending
on target device and clock speed. The latency is not excessive; but long enough that it
is inefficient to wait for partial results.

FIFO buffers control the circulation of data through the iterative process. The FIFO
buffers ensure that if a partial result is available for a further iteration in the
z(n +1) = zn2 + C progression, the design works on that point.

Otherwise, the design starts a new point (new value of C). Thus, the design maintains
a full flow of data through the floating-point arithmetic. This main iteration loop can
exert back pressure on the new point calculation engine. If the design does not read
new points off the command queue FIFO buffers quickly enough, such that they fill up,
the loop iteration stalls. The design does not explicitly signal the calculation of each
point when it is required (and thus avoid waiting through the latency cycles before you
can use it). The design does not attempt to exactly calculate this latency in clock
cycles. The design tries to issue generate point commands the exact number of clock-
cycles before you need them. You must change them each time you retarget a device,
or change target clock rate. Instead, the design calculates the points quickly from the
start and catches them in a FIFO buffer. If the FIFO buffer starts to get full—a
sufficient number of cycles ahead of full—The design stops the calculation upstream
without loss of data. This selfregulating flow mitigates latency while remaining flexible.

Avoid inefficiencies by designing the algorithm implementation around the latency and
availability of partial results. Data dependencies in processing can stall processing.

The design example uses the FinishedThisPoint signal as the valid signal. Although
the system constantly produces data on the output, it marks the data as valid only
when the design finishes a point. Downstream components can then just process valid
data, just as the enabled subsystem in the testbench captures and plot the valid
points.

In both feedback loops, you must provide sufficient delay for the scheduler to
redistribute as pipelining. In feed-forward paths you can add pipelining without
changing the algorithm—DSP Builder changes only the timing of the algorithm. But in
feedback loops, inserting a delay can alter the meaning of an algorithm. For example,

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

115
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

adding N cycles of delay to an accumulator loop increments N different numbers, each


incrementing every N clock cycles. The design must provide enough slack in each loop
for the scheduler, which redistributes delays and pipelines operators, to be able to
close timing by redistributing this slack. The scheduler must not change the total
latency around the loop. The scheduler must ensure the function of the algorithm is
unaltered. It must not change the total latency around the loop. It must ensure the
function of the algorithm is unaltered. Such slack delays are in the top-level design of
the synthesizable design in the feedback loop controlling the generation of new points,
and in the FeedBackFIFO subsystem controlling the main iteration calculation. DSP
Builder uses the minimum delay feature on the SampleDelay blocks to set these
slack delays to the minimum possible delay that satisfies the scheduling solver. The
example sets the SampleDelay block to the minimum latency that satisfies the
schedule, which the DSP Builder solves as part of the integer linear programming
problem that finds an optimum pipelining and scheduling solution. You can group
delays into numbered equivalence groups to match other delays. In this design
example, the single delay around the coordinate generation loop is in one equivalence
group, and all the slack delays around the main calculation loop are in another
equivalence group. The equivalence group field can contain any MATLAB expression
that evaluates to a string. The SampleDelay block displays the delay that DSP Builder
uses.

The FIFO buffers operate in show-ahead mode—they display the next value to be
read. The read signal is a read acknowledgement, which reads the output value,
discards it, and shows the next value. The design uses multiple FIFO buffers with the
same control signal, which are full and give a valid output at the same time. The
design only needs the output control signals from one of the FIFO buffers and can
ignore the corresponding signals from the other FIFO buffers. As floating-point
simulation is not bit accurate to the hardware, some points in the complex plane take
fewer or more iterations to complete in hardware compared to the Simulink
simulation. The results, when you are finished with a particular point, may come out in
a different order. You must build a testbench mechanism that is robust to this feature.
Use the testbench override feature in the Run All Testbenches block:
• Set the condition on mismatches to Warning
• Use the Run All Testbenches block to set an import variable, which brings the
ModelSim results back into MATLAB and a custom verification function that sets
the pass or fail criteria.

The model file is Mandelbrot_S.mdl.

6.6.5. General Real Matrix Multiply One Cycle Per Output


This design example implements a floating-point matrix multiply. The design performs
each vector multiply simultaneously, using many multiply-adds in parallel.

The model file is gemm_flash.mdl.

6.6.6. Newton Root Finding Tutorial Step 1—Iteration


This design example is part of the Newton-Raphson tutorial. It demonstrates a naive
test for convergence and exposes problems with rounding and testing equality with
zero.

The model file is demo_newton_iteration.mdl.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

116
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.6.7. Newton Root Finding Tutorial Step 2—Convergence


This design example is part of the Newton-Raphson tutorial. It demonstrates
convergence criteria exposing mismatches between Simulink and ModelSim that you
can correct by bit-accurate simulation. The discrepancies are worse when you use
faithful rounding.

The model file is demo_newton_convergence.mdl.

6.6.8. Newton Root Finding Tutorial Step 3—Valid


This design example is part of the Newton-Raphson tutorial. It demonstrates how you
avoid having the same answer multiple times on the output. It introduces a valid
control signal, parallel to the datapath, to keep track of which pipeline slots the design
empties. It uses equivalence groups in the minimum SampleDelay blocks.

The model file is demo_newton_valid.mdl.

6.6.9. Newton Root Finding Tutorial Step 4—Control


This design example is part of the Newton-Raphson tutorial. It demonstrates flow
control which allows the design to buffer inputs in a FIFO buffer and insert data into
pipeline slots as they become available.

The model file is demo_newton_control.mdl.

6.6.10. Newton Root Finding Tutorial Step 5—Final


This design example is part of the Newton-Raphson tutorial. It demonstrates a parallel
integer datapath for counting iterations. It detects divergence in cases where the
Newton method oscillates between two finite values.

The model file is demo_newton_final.mdl.

6.6.11. Normalizer
The normalizer design example demonstrates the ilogb block and the multifunction
ldexp block. The parameters allow you to select the ilogb or ldexp. The design
example implements a simple floating-point normalization. The magnitude of the
output is always in the range 0.5 to 1.0, irrespective of the (non-zero) input.

The model file is demo_normalizer.mdl.

6.6.12. Single-Precision Complex Floating-Point Matrix Multiply


This design example uses a similar flow control style to that in the floating-point
Mandlebrot set design example. The design example uses a limited number of
multiply-adds, set by the vector size, to perform a complex single precision matrix
multiply.

A matrix multiplication must multiply row and column dot product for each output
element. For 8×8 matrices A and B:

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

117
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Equation 1. Matrix Multiply Equation

8
ABi j = ∑ Aik Bk j
k=1

You may accumulate the adjacent partial results, or build adder trees, without
considering any latency. However, to implement with a smaller dot product, consider
resource usage folding, which uses a smaller number of multipliers rather than
performing everything in parallel. Also split up the loop over k into smaller chunks.
Then reorder the calculations to avoid adjacent accumulations.

A traditional implementation of a matrix multiply design is structured around a delay


line and an adder tree:

A11B11 +A12B21 +A13B31 and so on.

The traditional implementation has the following features:


• The length and size grow with folding size (typically 8 to 12)
• Uses adder trees of 7 to 10 adders that are only used once every 10 cycles.
• Each matrix size needs different length, so you must provide for the worst case

A better implementation is to use FIFO buffers to provide self-timed control. New data
is accumulated when both FIFO buffers have data. This implementation has the
following advantages:
• Runs as fast as possible
• Is not sensitive to latency of dot product on devices or fMAX
• Is not sensitive to matrix size (hardware just stalls for small N)
• Can be responsive to back pressure, which stops FIFO buffers emptying and full
feedback to control

The model file is matmul_CS.mdl.

6.6.13. Single-Precision Real Floating-Point Matrix Multiply


This design example is a simpler design example of a floating-point matrix multiply
implementation than the complex multiply example. The design example uses many
more multiply-adds in parallel (128 single precision multiply adds in the default
parameterization), to perform each vector multiply simultaneously.

The model file is matmul_flash_RS.mdl.

6.6.14. Simple Nonadaptive 2D Beamformer


This design example demonstrates a simple nonadaptive 2D beamformer using
vectors and single precision arithmetic. The parameters are the number of beams,
angle, focus and intensity of each beam.

A beamformer is a key algorithm in radar and wireless and is a signal processing


technique that sensor arrays use for directional signal transmission or reception. In
transmission, a beamformer controls the phase and amplitude of the individual array
elements to create constructive or destructive interference in the wavefront. In

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

118
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

reception, information from different elements are combined such that the expected
pattern of radiation is preferentially observed. A number of different algorithms exist.
An efficient scheme combines multiple paths constructively.

The simulation calculates the phases in MATLAB code (as a reference), simulates the
beamformer 2D design to calculate the phases in DSP Builder Advanced Blockset,
compares the reference to the simulation results and plots the beam pattern.

The design example uses vectors of single precision floating-point numbers, with
state-machine control from two for loops.

The model file is beamform_2d.mdl.

6.7. DSP Builder Flow Control Design Examples


1. Avalon-ST Interface (Input and Output FIFO Buffer) with Backpressure on page
119
2. Avalon-ST Interface (Output FIFO Buffer) with Backpressure on page 119
3. Kronecker Tensor Product on page 120
4. Parallel Loops on page 120
5. Primitive FIR with Back Pressure on page 120
6. Primitive FIR with Forward Pressure on page 121
7. Primitive Systolic FIR with Forward Flow Control on page 122
8. Rectangular Nested Loop on page 122
9. Sequential Loops on page 123
10. Triangular Nested Loop on page 123

6.7.1. Avalon-ST Interface (Input and Output FIFO Buffer) with


Backpressure
This example demonstrates the Avalon-ST input interface with FIFO buffers and the
AvalonST output interface blocks. This example has FIFO buffers in the input and
output interfaces. Use the manual switches in the testbench to change when
downstream is ready for data or to turn off input. The simulation ends by turning off
incoming data and ensures that it writes out as many valid data cycles as it receives.

The model file is demo_avalon_st_input_fifo.mdl.

6.7.2. Avalon-ST Interface (Output FIFO Buffer) with Backpressure


This example demonstrates the Avalon-ST input interface and the Avalon-ST output
interface blocks. This example has FIFO buffers in the output interface only. Use
manual switches in the testbench to change when downstream is ready for data or to
turn off input. The simulation ends by turning off incoming data and ensures that it
writes out as many valid data cycles as it receives.

The model file is demo_avalon_st.mdl.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

119
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.7.3. Kronecker Tensor Product


This design example generates a Kronecker tensor product. The design example
shows how to use the Loop block to generate datapaths that operate on regular data.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks.

The Chip subsystem includes the Device block and a lower-level


KroneckerSubsystem subsystem.

The KroneckerSubsystem subsystem includes ChannelIn, ChannelOut, Loop,


Const, DualMem, Mult, and SynthesisInfo blocks.

In this design example, the top level of the FPGA device (marked by the Device
block) and the synthesizable KroneckerSubsystem subsystem (marked by the
SynthesisInfo block) are at different hierarchy levels.

The model file is demo_kronecker.mdl.

6.7.4. Parallel Loops


This design example has two inner loops nested within the outer loop. The inner loops
execute in parallel rather than sequentially. The two inner loops are started
simultaneously by duplicating the control token but finish at different times. The
Rendezvous block waits until both of them finish and then passes the control token
back to the outer loop.

The model file is forloop_parloop.mdl.

6.7.5. Primitive FIR with Back Pressure


This DSP Builder design example uses Primitive library blocks to implement a FIR
design with flow control and back pressure.The design example shows how you use
the Primitive FIFO block to implement back pressure and flow control.

The top-level testbench includes Control and Signals blocks.

The FirChip subsystem includes the Device block and a lower-level primitive FIR
subsystem.

The primitive FIR subsystem includes ChannelIn, ChannelOut, FIFO, Not, And,
Mux, SampleDelay, Const, Mult, Add, and SynthesisInfo blocks.

In this design example, the top level of the FPGA device (marked by the Device
block) and the synthesizable Primitive FIR subsystem (marked by the SynthesisInfo
block) are at different hierarchy levels.

The model file is demo_back_pressure.mdl.

This design example shows how back pressure from a downstream block can halt
upstream processing. This design example provides three FIR filters. A FIFO buffer
follows each FIR filter that can buffer any data that is flowing through the FIFO buffer.
If the FIFO buffer becomes half full, the design asserts the ready signal back to the
upstream block. This signal prevents any new input (as flagged by valid) entering the
FIR block. The FIFO buffers always show the next data if it is available and the valid

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

120
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

signal is asserted high. You must AND this FIFO valid signal with the ready signal to
consume the data at the head of the FIFO buffer. If the AND result is high, you can
consume data because it is available and you are ready for it.

You can chain several blocks together in this way, and no ready signal has to feed
back further than one block, which allows you to use modular design techniques with
local control.

The delay in the feedback loop represents the lumped delay that spreads throughout
the FIR filter block. The delay must be at least as big as the delay through the FIR
filter. This delay is not critical. Experiment with some values to find the right one. The
FIFO buffer must be able to hold at least this much data after it asserts full. The full
threshold must be at least this delay amount below the size of the FIFO buffer (64 –
32 in this design example).

The final block uses an external ready signal that comes from a downstream block in
the system.

6.7.6. Primitive FIR with Forward Pressure


This DSP Builder design example uses Primitive library blocks to implement a FIR
design with forward flow control. The design example shows how you can add a simple
forward flow control scheme to a FIR design so that it can handle invalid source data
correctly.

The top-level testbench includes Control and Signals blocks.

The FirChip subsystem includes the Device block and a lower-level Primitive FIR
subsystem.

The primitive FIR subsystem includes ChannelIn, ChannelOut, Mux, SampleDelay,


Const, Mult, Add, and SynthesisInfo blocks.

In this design example, the top level of the FPGA device (marked by the Device
block) and the synthesizable primitive FIR subsystem (marked by the SynthesisInfo
block) are at different hierarchy levels.

The model file is demo_forward_pressure.mdl.

The design example has a sequence of three FIR filters that stall when the valid signal
is low, preventing invalid data polluting the datapath. The design example has a
regular filter structure, but with a delay line implemented in single-cycle latches—
effectively an enabled delay line.

You need not enable everything in the filter (multipliers, adders, and so on), just the
blocks with state (the registers). Then observe the output valid signal, which DSP
Builder pipelines with the logic, and observe the valid output data only.

You can also use vectors to implement the constant multipliers and adder tree, which
also speeds up simulation.

You can improve the design example further by using the TappedDelayLine block.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

121
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.7.7. Primitive Systolic FIR with Forward Flow Control


This DSP Builder design example uses Primitive library blocks to implement a systolic
FIR design with forward flow control. The design example shows how you can add a
simple forward flow control scheme to a FIR design so that it can handle invalid source
data correctly.

The top-level testbench includes Control and Signals blocks.

The FirChip subsystem includes the Device block and a lower-level Primitive FIR
subsystem.

The Primitive FIR subsystem includes ChannelIn, ChannelOut, Mux, SampleDelay,


Const, Mult, Add, and SynthesisInfo blocks.

In this design example, the top level of the FPGA device (marked by the Device
block) and the synthesizable primitive FIR subsystem (marked by the SynthesisInfo
block) are at different hierarchy levels.

The design example has a sequence of three FIR filters that stall when the valid signal
is low, preventing invalid data polluting the datapath. The design example has a
regular filter structure, but with a delay line implemented in single-cycle latches—
effectively an enabled delay line.

You need not enable everything in the filter (multipliers, adders, and so on), just the
blocks with state (the registers). Then observe the output valid signal, which DSP
Builder pipelines with the logic, and observe the valid output data only.

You can also use vectors to implement the constant multipliers and adder tree, which
also speeds up simulation. You can improve the design example further with the
TappedDelayLine block.

The model file is demo_forward_pressure.mdl.

6.7.8. Rectangular Nested Loop


In this design example all initialization, step, and limit values are constant. At the
corners (at the end of loops) there may be cycles where the count value goes out of
range, then the output valid signal from the loop is low.

The token-passing structure is typical for a nested-loop structure. The bs port of the
innermost loop (ForLoopB) connects to the bd port of the same loop, so that the next
loop iteration of this loop starts immediately after the previous iteration.

The bs port of the outer loop (ForLoopA) connects to the ls port of the inner loop;
the ld port of the inner loop loops back to the bd port of the outer loop. Each iteration
of the outer loop runs a full activation of the inner loop before continuing on to the
next iteration.

The ls port of the outer loop connect to external logic and the ld port of the outer
loop is unconnected, which is typical of applications where the control token is
generated afresh for each activation of the outermost loop.

The model file is forloop_rectangle.mdl.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

122
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.7.9. Sequential Loops


This design example nests two inner loops (InnerLoopA and InnerLoopB) within the
outer loop. The design example daisy chains the ld port of InnerLoopA to the ls
port of InnerLoopB rather than connecting it directly to the bd port of
OuterLoop.Thus each activation of InnerLoopA is followed by an activation of
InnerLoopB

The model file is forloop_seqloop.mdl.

6.7.10. Triangular Nested Loop

The initialization, step, and limit values do not have to be constants. By using the
count value from an outer loop as the limit of an inner loop, the counter effectively
walks through a triangular set of indices.

The token-passing structure for this loop is identical to that for the rectangular loop,
except for the parameterization of the loops.

The model file is forloop_triangle.mdl.

6.8. DSP Builder HDL Import Design Example


This digital up-converter resamples 20 MSPS complex base-band data to 80 MHz
intermediate frequency, mixes it to center on +25 MHz, and applies some simple
digital predistortion (DPD). This design example takes FIR and DPD VHDL components
to create a complete up-conversion chain by importing existing IP and adding the up-
conversion, mixer and pre-DPD scaling.

The digital upconverter includes: input memory, upconverter, FIR filter, scaler, mixer
and digital predistortion (DPD).

Table 18. Example Design Files


hdl_import_duc.mdl The DSP Builder design.

hdl_import_duc_params.xml The design's parameter file.

hdl_import_calc_fir_coefs.m A script to generate the FIR coefficients using MATLAB's cfirpm function. DSP Builder
prints the coefficients to MATLAB's Command Window and you can copy and paste
them into coefficients.vhd.

calc_dpd_coefs.m A script to generate the DPD coefficients using a simple polynomial model of a power
amplifier. DSP Builder prints the coefficients MATLAB's Command Window and you
can copy and paste them into lut_dpd.vhd.

to_import This directory contains 12 VHDL source files.

VHDL Components

The design example includes a complex FIR filter in VHDL optimized for Intel Stratix
10 devices. This FIR filter has one valid data sample every eight clock cycles.
See Designing Filters for High Performance.

The simple LUT-based DPD is initialized with a third-order polynomial.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

123
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Top-Level Design

The top-level design contains the device-level subsystem and five downsample and
spectrum analyzer blocks from MathWork's DSP System Toolbox. These blocks show
the spectral output from the various stages of the up-conversion chain.

Figure 52. Top-Level Design

Digital Up Converter

The digital_up_converter subsystem is the device-level subsystem. It contains all of


the design's DSP Builder-based components and two gaps for HDL Import blocks.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

124
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 53. Digital Upconverter

Buffer and Upsample

This scheduled subsystem contains two SharedMem blocks, which contain the 20
MSPS baseband source: one for the real part of the signal and one for the imaginary
part. You can write to the blocks via the bus or use the preloaded tones.

The read_counter block drives the upconversion. It counts modulo 32 because it


upsamples the 20 MSPS baseband by 4 to 80 MSPS and then holds each sample for 8
clock cycles at a clock rate of 640 MHz. The FIR filter accepts one sample every eight
cycles. By holding the samples, the FIR does not need synchronization logic.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

125
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 54. Buffer and Up-Sample

Mixer

This single-channel mixer consists of NCO and ComplexMixer IP blocks and a


scheduled subsystem for controlling the NCO. The control subsystem asserts the valid
signal once every eight cycles. The NCO generates a 16 MHz complex tone, which the
ComplexMixer uses to mix the filtered signal.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

126
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 55. Mixer

Scale

The scale scheduled subsystem scales the data so that it fits within the DPD's range of
operation by bit-shifting from the mixer's output. You can use the optional multiplier
for increasing the signal level if bit-shifting is insufficient.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

127
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 56. Scale

FIR Coefficients

The FIR coefficients are defined in coefficients.vhd.

The coefficients are calculated in hdl_import_calc_fir_coefs.m. This script uses


MATLAB’s cfirpm command to create complex coefficients.

DPD

The file lut_dpd.vhd contains the DPD for this design example. The DPD consists of
an address generator that indexes a LUT. The output of the LUT is then multiplied with
the complex input data. The LUT contents are calculated in
hdl_import_calc_dpd_coefs.m. This script uses a simple, real-numbered, third-
order model of an amplifier to calculate predistortion coefficients. DSP Builder uses
these coefficients to calculate the LUT contents.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

128
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Simulink Simulations Results

Figure 57. Simulation

The first four waveforms are the real and imaginary input and output of the the FIR. The FIR smooths the zero-
padded signals.

The next four waveforms are the real and imaginary input and output of the the DPD.

Figure 58. Upconverted

The two preloaded memory signals are clearly visible about 0, as are their four aliases because of the zero-
insert upsampling.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

129
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 59. Filtered

The aliased signals are attenuated by 40dB, as expected from the analysis in calc_fir_coefs.m.

Figure 60. Mixed

The mixed spectrum shows the baseband signal moving over to be centered on 16 MHz. This view shows the
Simulink clock rate of 1 Hz rather than the FPGA clock rate of 640 MHz, so 16 MHz becomes 25 mHz.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

130
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 61. Scaled

Scaled looks identical to mixed, except that the signal amplitude is much greater.

Figure 62. Output

The post-DPD output signal is a noiser version of the scaled signal. Observe the two third-order harmonics in
the pass-band.

Related Information
Designing Filters for High Performance

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

131
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.8.1. Performing a Cosimulation


This tutorial uses the DSP Builder HDL import design example.

The design example has two HDL entities: the DPD (lut_dpd.vhd) and the FIR
(complex_fir.vhd).

In DSP Builder cosimulation, each HDL Import block represents an HDL instance. You
must instantiate both of these entities in a top-level VHDL file. For this design
example, Intel provides top.vhd.

In addition, the FIR filter uses a signed data type with a generic for the data width.
When DSP Builder instantiates the FIR filter, it uses its own paradigm (i.e.
std_logic_vector and no generics). This design example adds a wrapper entity:
complex_fir_wrapper.vhd. This entity instantiates complex_fir, including setting
the generic to the appropriate value, and converts signed to std_logic_vector.

These two files, top.vhd and complex_fir_wrapper.vhd are in the to_import


directory.
1. Add a HDL Import Config block to the top-level design.

Figure 63. Top-level Design with HDL Import Config Block

2. Parameterize the HDL Import Config block.


a. Click Add to add all of the files from the to_import directory.
The order of the files does not matter. DSP Builder determines the type of HDL
file by the extension, but you can change the type manually.
b. Enter top in the Top level instance.
c. Turn on Top-level is a wrapper.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

132
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

d. Click the Compile button.


e. Set the Simulink sample time field to 1.
f. When the status light is green, click Launch Cosim.

Figure 64. HDL Import Configuration

3. Add a HDL Import block to the digital_up_converter subsystem.


a. Double click the HDL Import block
b. Click Instance and select inst_fir.
c. Set the fractional bits of the two output signals to 16.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

133
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 65. HDL Import Block inst_fir Parameters

4. Add a second HDL Import block to the digital_up_converter subsystem.


a. Double click the HDL Import block
b. Click Instance and select inst_dpd.
c. Set the fractional bits of the two output signals to 27.
d. Set the valid output to unsigned.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

134
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 66. HDL Import inst_dpd Parameters

5. Wire up HDL import blocks.


The HDL Import block port names are in alphabetical order.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

135
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 67. Wire up HDL Import Blocks

6. Press the play button or advance through the simulation a cycle at a time.
7. Verify HDL import with the ModelSim simulator, in DSP Builder, select DSP
Builder ➤ Run ModelSim ➤ Device.
The cosimulation turns any non-high state (e.g. U or X) to a zero.
8. Compile the design in Intel Quartus Prime, by selecting DSP Builder > Run
Quartus Prime Software.

6.9. DSP Builder Host Interface Design Examples


1. Memory-Mapped Registers on page 136

6.9.1. Memory-Mapped Registers


This design example is an extreme example of using the processor registers to
implement a simple calculator. Registers and shared memories write arguments and
read results.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks.

This design also includes BusStimulus and BusStimulusFileReader blocks.

The RegChip subsystem includes RegField, RegBit, RegOut, SharedMem, Const,


Add, Sub, Mult, Convert, Select, BitExtract, Shift, and SynthesisInfo blocks.

The model file is demo_regs.mdl.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

136
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.10. DSP Builder Platform Design Examples


This folder contains design examples that illustrate how you can implement a DDC or
digital up converter (DUC) for use in a radio basestation. Use these designs as a
starting point to build your own filter chain that meets your exact needs.

1. 16-Channel DDC on page 137


2. 16-Channel DUC on page 137
3. 2-Antenna DUC for WiMAX on page 138
4. 2-Channel DUC on page 139
5. Super-Sample Rate Digital Upconverter on page 139

6.10.1. 16-Channel DDC


This design example shows how to use using IP and Interface blocks to build a 16-
channel digital-down converter for modern radio systems.

Decimating CIC and FIR filters down convert eight complex carriers (16 real channels)
from 61.44 MHz. The total decimation rate is 64. A real mixer and NCO isolate the
eight carriers. The testbench isolates two channels of data from the TDM signals using
a channel viewer.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus a ChanView block that deserializes the output bus. An
Edit Params block allows easy access to the setup variables in the
setup_demo_ddc.m script.

The DDCChip subsystem includes Device, Decimating FIR, DecimatingCIC,


Mixer, NCO, Scale, RegBit, and RegField blocks.

The model file is demo_ddc.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.10.2. 16-Channel DUC


This design example shows how to build a 16-channel DUC as found in modern radio
systems using Interface, IP, and Primitive blocks.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

137
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

This design example shows an interpolating filter chain with interpolating CIC and FIR
filters that up convert eight complex channels (16 real channels). The total
interpolation rate is 50. DSP Builder integrates several Primitive subsystems into the
datapath. This design example shows how you can integrate IP blocks with Primitive
subsystems:
• The programmable Gain subsystem, at the start of the datapath, shows how you
can use processor-visible register blocks to control a datapath element.
• The Sync subsystem is a Primitive subsystem that shows how to manage two
data streams coming together and synchronizing. The design writes the data from
the NCOs to a memory with the channel as an address. The data stream uses its
channel signals to read out the NCO signals, which resynchronizes the data
correctly. Alternatively, you can simply delay the NCO value by the correct number
of cycles to ensure that the NCO and channel data arrive at the Mixer on the
same cycle.

Extensive use is made of Simulink multiplexer and demultiplexer blocks to manage


vector signals.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus a ChanView block that deserializes the output bus. An
Edit Params block allows easy access to the setup variables in the
setup_demo_duc.m script.

The DUCChip subsystem includes a Device block and a lower level DUC16
subsystem.

The DUC16 subsystem includes InterpolatingFIR, InterpolatingCIC,


ComplexMixer, NCO, and Scale blocks.

It also includes lower level Gain, Sync, and CarrierSum subsystems which make use
of other Interface and Primitive blocks including AddSLoad, And, BitExtract,
ChannelIn, ChannelOut, CompareEquality, Const, SampleDelay, DualMem,
Mult, Mux, Not, Or, RegBit, RegField blocks, and SynthesisInfo blocks.

The model file is demo_duc.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.10.3. 2-Antenna DUC for WiMAX


This design example shows how to build a 2-antenna DUC to meet a WiMAX
specification.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus a ChanView block that deserializes the output bus.

The DUCChip subsystem includes a Device block and a lower level DUC2Antenna
subsystem.

The DUC2Antenna subsystem includes InterpolatingFIR, SingleRateFIR, Const,


ComplexMixer, NCO, and Scale blocks.

The model file is demo_wimax_duc.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

138
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.10.4. 2-Channel DUC


This design example shows how to build a 2-channel DUC.

Interpolating CIC and FIR filters up convert a single complex channel (2 real
channels). A NCO and Mixer subsystem combine the complex input channels into a
single output channel.

This design example shows how quick and easy it is to emulate the contents of an
existing datapath. A Mixer block implements the mixer in this design example as the
data rate is low enough to save resource using a time-shared hardware technique.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus a ChanView block that deserializes the output bus. An
Edit Params block allows easy access to the setup variables in the
setup_demo_AD9856.m script.

The AD9856 subsystem includes a Device block and a lower level DUCIQ
subsystem.

The DUCIQ subsystem includes Const, InterpolatingFIR, SingleRateFIR,


InterpolatingCIC, NCO, Scale blocks, and a lower level Mixer subsystem.

The Mixer subsystem includes ChannelIn, ChannelOut, Mult, Const, BitExtract,


CompareEquality, And, Delay, Sub, and SynthesisInfo blocks.

The model file is demo_AD9856.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.10.5. Super-Sample Rate Digital Upconverter

The model file is demo_ssduc.mdl.

6.11. DSP Builder Primitive Block Design Examples


1. 8×8 Inverse Discrete Cosine Transform on page 140
2. Automatic Gain Control on page 140
3. Bit Combine for Boolean Vectors on page 141
4. Bit Extract for Boolean Vectors on page 141
5. Color Space Converter on page 141
6. CORDIC from Primitive Blocks on page 142
7. Digital Predistortion Forward Path on page 142
8. Fibonacci Series on page 142
9. Folded Vector Sort on page 143
10. Fractional Square Root Using CORDIC on page 143
11. Fixed-point Maths Functions on page 143
12. Gaussian Random Number Generator on page 143

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

139
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

13. Hello World on page 144


14. Hybrid Direct Form and Transpose Form FIR Filter on page 144
15. Loadable Counter on page 144
16. Matrix Initialization of LUT on page 145
17. Matrix Initialization of Vector Memories on page 145
18. Multichannel IIR Filter on page 146
19. Quadrature Amplitude Modulation on page 146
20. Reinterpret Cast for Bit Packing and Unpacking on page 146
21. Run-time Configurable Decimating and Interpolating Half-Rate FIR Filter on page
147
22. Square Root Using CORDIC on page 147
23. Test CORDIC Functions with the CORDIC Block on page 147
24. Uniform Random Number Generator on page 147
25. Vector Sort—Sequential on page 148
26. Vector Sort—Iterative on page 148
27. Vector Initialization of Sample Delay on page 148
28. Wide Single-Channel Accumulators on page 149

6.11.1. 8×8 Inverse Discrete Cosine Transform


This design example uses the Chen-Wang algorithm to implement a fully pipelined
8×8 inverse discrete cosine transform (IDCT).

Separate subsystems perform the row transformation (Row), corner turner


(CornerTurn), and column transformation (Col) functions. The design example
synthesizes each separate subsystem separately. The Row and Col subsystems have
additional levels of hierarchy for the different stages. The SynthesisInfo block is at
the row or column level, so the design example flattens these subsystems before
synthesis.

The CornerTurn turn block makes extensive use of Simulink Goto/From blocks to
reduce the wiring complexity. The top-level testbench includes Control and Signals
blocks. The IDCTChip subsystem includes the Device block and a lower level IDCT
subsystem. The IDCT subsystem includes lower level subsystems that it describes
with the ChannelIn, ChannelOut, Const, BitCombine, Shift, Mult, Add, Sub,
BitExtract, SampleDelay, OR Gate, Not, Sequence, and SynthesisInfo blocks.

The model file is demo_idct8x8.mdl.

6.11.2. Automatic Gain Control


This design example implements an automatic gain control.

This design example shows a complex loop with several subloops that it schedules and
pipelines without inserting registers. The design example spreads a lumped delay
around the circuit to satisfy timing while maintaining correctness. Processor visible
registers control the thresholds and gains.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

140
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

In complex algorithmic circuits, the zero-latency blocks make it easy to follow a data
value through the circuit and investigate the algorithm without offsetting all the
results by the pipelining delays.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks.

The AGC_Chip subsystem includes the Device block, a RegField block and a lower
level AGC subsystem.

The AGC subsystem includes RegField, ChannelIn, ChannelOut, Mult,


SampleDelay, Add, Sub, Convert, Abs, CmpGE, Lut, Const, SharedMem, Shift,
BitExtract, Select, and SynthesisInfo blocks.

The model file is demo_agc.mdl.

6.11.3. Bit Combine for Boolean Vectors


This design example demonstrates different ways to use the BitCombine primitive
block to create signals of different widths from a vector of Boolean signals.

The one input BitCombine block is a special case that concatenates all the
components of the input vector and produces one wide scalar output signal. You can
apply 1-bit reducing operators to vectors of Boolean signals. The BitCombine block
supports multiple input concatenation. When vectors of Boolean signals are input on
multiple ports, corresponding components from each vector are combined so that the
output is a vector of signals.

The model file is demo_bitcombine.mdl.

6.11.4. Bit Extract for Boolean Vectors


This design example demonstrates different ways to use the BitExtract block to split
a wide signal into a vector of narrow signal components.

This block converts a scalar signal into a vector of Boolean signals. You use the
initialization parameter to arbitrarily order the components of the vector output by the
BitExtract block. If the input to a BitExtract block is a vector, different bits can be
extracted from each of the components. The output does not always have to be a
vector of Boolean signals. You may split a 16-bit wide signal into four components
each 4-bits wide.

The model file is demo_bitextract.mdl.

6.11.5. Color Space Converter


This design example demonstrates DSP Builder Primitive subsystems with simple
RGB to Y'CbCr color space conversion

• Y = 0.257R + 0.504G + 0.098B + 16


• Cb = -0.148R - 0.291G + 0.439B + 128
• Cr = 0.439R - 0.368G - 0.071B + 128

The RGB data arrives as three parallel signals each clock cycle. The model file is
demo_csc.mdl.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

141
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.11.6. CORDIC from Primitive Blocks


This design example demonstrates building a CORDIC out of basic operators. This
design has the same functionality as the CORDIC library block in the
demo_cordic_lib_block example

The model file is demo_cordic_primitives.mdl.

6.11.7. Digital Predistortion Forward Path


This design example demonstrates forward paths that implement digital predistortion
(DPD).

Forward paths compensate for nonlinear power amplifiers by applying the inverse of
the distortion that the power amplifier generates, such that the pre-distortion and the
distortion of the power amplifier cancel each other out. The power amplifier's non-
linearity may change over time, therefore such systems are typically adaptive.

This design example is based on "A robust digital baseband pre-distorter constructed
using memory polynomials," L. Ding, G. T. Zhou, D. R. Morgan, et al., IEEE
Transactions on Communications, vol. 52, no. 1, pp. 159-165, 2004.

This design example only implements the forward path, which is representative of
many systems where you implement the forward path in FPGAs, and the feedback
path on external processors. The design example sets the predistortion memory, Q, to
8; the highest nonlinearity order K is 5 in this design example. The file
setup_demo_dpd_fwdpath initializes the complex valued coefficients, which are
stored in registers. During operation, the external processor continuously improves
and adapts these coefficients with a microcontroller interface.

The model file is demo_dpd_fwdpath.mdl.

6.11.8. Fibonacci Series


This DSP Builder design example generates a Fibonacci sequence.

This design example shows that even for circuitry with tight feedback loops and 120-
bit adders, designs can achieve high data rates by the pipelining algorithms. The top-
level testbench includes Control, Signals, Run ModelSim, and Run Quartus Prime
blocks. The Chip subsystem includes the Device block and a lower level FibSystem
subsystem. The FibSystem subsystem includes ChannelIn, ChannelOut,
SampleDelay, Add, Mux, and SynthesisInfo blocks.

Note: In this design example, the top-level of the FPGA device (marked by the Device
block) and the synthesizable Primitive subsystem (marked by the SynthesisInfo
block) are at different hierarchy levels.

The model file is demo_fibonacci.mdl.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

142
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.11.9. Folded Vector Sort


This design sorts the values on the input vector from largest to smallest. The design is
a masked subsystem that allows for sorting with either a comparator and mux block,
or a minimum and a maximum block. The first implementation is more efficient. Both
use the reconfigurable subsystem to choose between implementations using the
BlockChoice parameter.

Folded designs repeatedly use a single dual sort stage. The throughput of the design is
limited in the number of channels, vector width, and data rate. The data passes
through the dual sort stage (vector width)/2 times. The vector sort design example
uses full throughput with (vector width)/2 dual sort stages in sequence.

Look under the mask to view the implementation of reconfigurable subsystem


templates and the blocks that reorder and interleave vectors.

The model file is demo_foldedsort.mdl.

6.11.10. Fractional Square Root Using CORDIC


This design example demonstrates CORDIC techniques, but does not use the CORDIC
block. This design example is fully iterative.

The design example allows you to generate a valid signal. The design example only
generates output and can only accept input every N cycles, where N depends on the
number of stages, the data output format, and the target fMAX. The valid signal goes
high when the output is ready. You can use this output signal to trigger the next input,
for example, a FIFO buffer read for bursty data.

The model file is demo_cordic_fracsqrt.mdl.

6.11.11. Fixed-point Maths Functions


This design example demonstrates how the Math, Trig and Sqrt functions support
fixed-point types and the fixed-point Divide function. You can use fixed-point types of
width up to and including 32 bits.

DSP Builder generates results using the same techniques as in the floating point
functions but at generally reduced resource usage, depending on data bit width.
Outputs are faithfully rounded. If the exact result is between two representable
numbers within the data format, DSP Builder uses either of them. In some instances
you see a difference in output result between simulation and hardware by one LSB. To
get bit-accurate results at the subsystem level, this example uses the Bit Exact
option on the SynthesisInfo block.

The model file is demo_fixed_math.mdl.

6.11.12. Gaussian Random Number Generator


This DSP Builder design example demonstrates a random number generator (CLT
component method) that produces random numbers with normal distribution and
standard deviation that you specify using the input sigma_input.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

143
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

You can also specify the seed value for the random sequence using the seed_value
input. The reset input resets the sequence to the initial state defined by the
seed_value. The output is a 32-bit single-precision floating-point number.

6.11.13. Hello World


This DSP Builder design example produces a simple text message that it stores in a
look-up table.

An external input enables a counter that addresses a lookup-table (LUT) that contains
some text. The design example writes the result to a MATLAB array. You can examine
the contents with a char(message) command in the MATLAB command window.

This design example does not use any ChannelIn, ChannelOut, GPIn, or GPOut
blocks. The design example uses Simulink ports for simplicity although they prevent
the automatic testbench flow from working.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks.

The Chip subsystem includes Device, Counter, Lut, and SynthesisInfo blocks.

Note: In this design example, the top-level of the FPGA device (marked by the Device
block) and the synthesizable Primitive subsystem (marked by the SynthesisInfo
block) are at the same level.

The model file is helloWorld.mdl.

6.11.14. Hybrid Direct Form and Transpose Form FIR Filter


The design example uses small, four-tap direct form filters to use the structure inside
the DSP block efficiently. The design example combines these direct form minifilters
into a transpose structure, which minimizes the logic and memory that the sample
pipe uses. This FIR filter shows a FIR architecture that is a hybrid between the direct
form and transpose form FIR filter. It combines the advantages of both.

The model file is demo_hybrid_fir_mc.mdl.

6.11.15. Loadable Counter


This design example demonstrates the LoadableCounter block.

The testbench reloads the counter with new parameters every 64 cycles. A manual
switch allows you to control whether the counter is permanently enabled, or only
enabled on alternate cycles. You can view the signals input and output from the
counter with the provided scope.

The model file is demo_ld_counter.mdl.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

144
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.11.16. Matrix Initialization of LUT


This design example feeds a vector of addresses to the Primitive block such that DSP
Builder gives each vector component a different address. This design example also
shows Lut blocks working with complex data types. You can initialize Lut blocks in
exactly the same way.

Using this design example avoids demultiplexing, connecting, and multiplexing, so


that you can build parameterizable systems.

You can use one of the following ways to specify the contents of the Lut block:
• Specify table contents as single row or column vector. The length of the 1D row or
column vector determines the number of addressable entries in the table. If DSP
Builder reads vector data from the table, all components of a given vector share
the same value.
• When a look-up table contains vector data, you can provide a matrix to specify the
table contents. The number of rows in the matrix determines the number of
addressable entries in the table. Each row specifies the vector contents of the
corresponding table entry. The number of columns must match the vector length,
otherwise DSP Builder issues an error.

Note: The default initialization of the LUT is a row vector round([0:255]/17). This vector
is inconsistent with the default for the DualMem block, which is a column vector
[zeros(16, 1)]. The latter form is consistent with the new matrix initialization form in
which the number of rows determines the addressable size.

The model file is demo_lut_matrix_init.mdl.

6.11.17. Matrix Initialization of Vector Memories


Use this feature in DSP Builder designs that handle vector data and require individual
components of each vector in the dual memory to be initialized uniquely.

The design example file is demo_dualmem_matrix_init.mdl.

You can initialize both the dual memory and LUT Primitive library blocks with matrix
data.

The number of rows in the 2D matrix that you provide for initialization determines the
addressable size of the dual memory. The number of columns must match the width of
the vector data. So the nth column specifies the contents of the nth dual memory.
Within each of these columns the ith row specifies the contents at the (i –- 1)th
address (the first row is address zero, second row address 1, and so on).

The exception for this row and column interpretation of the initialization matrix is for
1D data, where the initialization matrix consists of either a single column or single
row. In this case, the interpretation is flexible and maps the vector (row or column)
into the contents of each dual memory. In the previous behavior all dual memories
have identical initial contents.

The demo_dualmem_matrix_init design example uses complex values in both the


initialization and the data that it later writes to the dual memory. You set up the
contents matrix in the model's set-up script, which runs on model initialization.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

145
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.11.18. Multichannel IIR Filter


This DSP Builder design example implements a masked multi-channel infinite impulse
response (IIR) filter with a masked subsystem that it builds from Primitive library
blocks.

This design example has many feedback loops. The design example implements all the
pipelined delays in the circuit automatically. The multiple channels provide more
latency around the circuit to ensure a high clock frequency result. Lumped delays
allow you to easily parameterize the design example when changing the channel
counts. For example, masking the subsystem provides the benefits of a black-box IP
block but with visibility.

The top-level testbench includes Control and Signals blocks, plus ChanView block
that deserialize the output buses.

The IIRChip subsystem includes the Device block and a masked IIRSubsystem
subsystem. The coefficients for the filter are set from [b, a] = ellip(2, 1, 10, 0.3); in
the callbacks for the masked subsystem. You can look under the mask to see the
implementation details of the IIRSubsystem subsystem which includes ChannelIn,
ChannelOut, SampleDelay, Const, Mult, Add, Sub, Convert, and SynthesisInfo
blocks.

The model file is demo_iir.mdl.

6.11.19. Quadrature Amplitude Modulation


This design example implements a simple quadrature amplitude modulation (QAM256)
design example with noise addition. The testbench uses various Simulink blocks.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks.

The QAM256Chip subsystem includes Add, GPIn, GPOut, BitExtract, Lut,


BitCombine, and SynthesisInfo blocks.

The model file is demo_QAM256.mdl.

Note: This design example uses the Simulink Communications Blockset.

6.11.20. Reinterpret Cast for Bit Packing and Unpacking


This design example demonstrates the ReinterpretCast block, which packs signals
into a long word and extracts multiple signals from a long word.

The first datapath reinterprets a single precision complex signal into raw 32-bit
components that separate into real and imaginary parts. A BitCombine block then
merges it into a 64-bit signal. The second datapath uses the BitExtract block to split
a 64-bit wide signal into a two component vectors of 32-bit signals. The
ReinterpretCast block then converts the raw bit pattern into single-precision IEEE
format. The HDL that the design synthesizes is simple wire connections, which
performs no computation.

The model file is demo_reinterpret_cast.mdl.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

146
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.11.21. Run-time Configurable Decimating and Interpolating Half-Rate


FIR Filter
This design example contains a half-rate FIR filter, which can perform either
decimation or interpolation by a factor of two during run time.

In decimation mode, the design example accepts a new sample every clock cycle, and
produces a new result every two clock cycles. When interpolating, the design example
accepts a new input every other clock cycle, and produces a new result every clock
cycle. In both cases, the design example fully uses multipliers, making this structure
very efficient compared to parallel instantiations of interpolate and decimate filters, or
compared to a single rate filter with external interpolate and decimate stages.

The coefficients are set to [1 0 3 0 5 6 5 0 3 0 1] to illustrate the operation of the


filter in setup_demo_fir_tdd.m.

The model file is demo_fir_tdd.mdl.

6.11.22. Square Root Using CORDIC


This design example demonstrates the CORDIC block. It configures the CORDIC
block for uint(32) input and uint(16) output. The example is partially parallelized
(four stages).

The design example allows you to generate a valid signal. The design example only
generates output and can only accept input every N cycles, where N depends on the
number of stages, the data output format, and the target fMAX. The valid signal goes
high when the output is ready. You can use this output signal to trigger the next input,
for example, a FIFO buffer read for bursty data.

The model file is demo_cordic_sqrt.mdl.

6.11.23. Test CORDIC Functions with the CORDIC Block


This design example demonstrates how to use the DSP Builder Primitive CORDIC
block to implement the coordinate rotation digital (CORDIC) algorithm.

The Mode input can either rotate the input vector by a specified angle, or rotate the
input vector to the x-axis while recording the angle required to make that rotation.
You can experiment with different size of inputs to control the precision of the CORDIC
output.

The top-level testbench includes Control and Signals blocks.

The SinCos and AGC subsystem includes ChannelIn, ChannelOut, CORDIC, and
SynthesisInfo blocks.

The model file is demo_cordic_lib_block.mdl.

6.11.24. Uniform Random Number Generator


This DSP Builder design example demonstrates a random number generator
(Tausworthe-88) that produces uniformly distributed random numbers.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

147
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

You can specify the seed value for the random sequence using the seed_value input.
The reset input resets the sequence to the initial state defined by the seed_value.
The output is a 32-bit random number, which can be interpreted as a random integer
sampled from the uniform distribution.

6.11.25. Vector Sort—Sequential


This design example sorts the values on the input vector from largest to smallest. The
sorting is a configurable masked subsystem: sortstages.

For sorting, the sortstages subsystem allows either a comparator and mux based
block, or one based on a minimum and a maximum block. The first is more efficient.
Both use the reconfigurable subsystem to choose between implementations using the
BlockChoice parameter.

The design repeatedly uses a dual sort stage in series. The data passes through the
dual sort stage (vector width)/2 times.

Look under the mask to view the implementation of reconfigurable subsystem


templates and the blocks that reorder and interleave vectors.

The model file is demo_vectorsort.mdl.

6.11.26. Vector Sort—Iterative


This design sorts the values on the input vector from largest to smallest. The design is
a masked subsystem that allows for sorting with either a comparator and mux block,
or a minimum and a maximum block. The first implementation is more efficient. Both
use the reconfigurable subsystem to choose between implementations using the
BlockChoice parameter.

Folded designs repeatedly use a single dual sort stage. The throughput of the design is
limited in the number of channels, vector width, and data rate. The data passes
through the dual sort stage (vector width)/2 times. The vector sort design example
uses full throughput with (vector width)/2 dual sort stages in sequence.

Look under the mask to view the implementation of reconfigurable subsystem


templates and the blocks that reorder and interleave vectors.

The model file is demo_foldedsort.mdl.

6.11.27. Vector Initialization of Sample Delay


This DSP Builder design example shows that one sample delay can replace what
usually requires a Demultiplex, SampleDelay, and Multiplex combination.

When the SampleDelay Primitive library block receives vector input, you can
independently specify a different delay for each of the components of the vector.

You may give individual components zero delay resulting in a direct feed through of
only that component. Avoid algebraic loops if you select some components to be zero
delays.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

148
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

This rule only applies when DSP Builder is reading and outputting vector data. A scalar
specification of delay length still sets all the delays on each vector component to the
same value. You must not specify a vector that is not the same length as the vector on
the input port. A negative delay on any one component is also an error. However, as in
the scalar case, you can specify a zero length delay for one or more of the
components.

The model file is demo_sample_delay_vector.mdl.

6.11.28. Wide Single-Channel Accumulators


This example design shows various ways to connect up an adder, sample delay
(depth=1), and optional multiplexer to implement reset or load.

The output type of the adder is propagated from one of the inputs. You must select
the correct input, otherwise the accumulator fails to schedule. You may add a Convert
block to ensure the accumulator also maintains sufficient precision.

The wide single-channel accumulator consists of a two-input adder and sample-delay


feedback with one cycle of latency. If you use a fixed-point input to this accumulator,
you can make it arbitrarily wide provided the types of the inputs match with a data
type prop duplicate block. The output type of the Add block can be with or without
word growth. Alternatively, you can propagate the input type to the output of the
adder.

The optional use of a two-to-one multiplexer allows the accumulator to load values
according to a Boolean control signal. The inputs differ in precision, so the type with
wider fractional part must be propagated to the output type of the adder, otherwise
the accumulator fails to schedule. Converting both inputs to the same precision
ensures that the single-channel accumulator can always be scheduled even at high
fMAX targets.

If neither input has a fixed-point type that is suitable for the adder to output, use a
Convert block to ensure that the precision of both inputs to the Add block are the
same. Scheduling of this accumulator at high fMAX fails.

The model file is demo_wide_accumulators.mdl.

6.12. DSP Builder Reference Designs


DSP Builder also includes reference designs that demonstrate the design of DDC and
DUC systems for digital intermediate frequency (IF) processing.

This folder accesses groups of reference designs that illustrate the design of DDC and
DUC systems for digital intermediate frequency (IF) processing.

The first group implements IF modem designs compatible with the Worldwide
Interoperability for Microwave Access (WiMAX) standard. Intel provides separate
models for one and two antenna receivers and transmitters.

The second group implement IF modem designs compatible with the wideband Code
Division Multiple Access (W-CDMA) standard.

This folder also contains reference designs.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

149
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

STAP for radar systems applies temporal and spatial filtering to separate slow moving
targets from clutter and null jammers. Applications demand highprocessing
requirements and low latency for rapid adaptation. High-dynamic ranges demand
floating-point datapaths.

1. 1-Antenna WiMAX DDC on page 151


2. 2-Antenna WiMAX DDC on page 151
3. 1-Antenna WiMAX DUC on page 152
4. 2-Antenna WiMAX DUC on page 152
5. 4-Carrier, 2-Antenna W-CDMA DDC on page 153
6. 1-Carrier, 2-Antenna W-CDMA DDC on page 154
7. 4-Carrier, 2-Antenna W-CDMA DUC on page 154
8. 4-Carrier, 4-Antenna DUC and DDC for LTE on page 155
9. 1-Carrier, 2-Antenna W-CDMA DDC on page 156
10. 4-Carrier, 2-Antenna High-Speed W-CDMA DUC at 368.64 MHz with Total Rate
Change 32 on page 157
11. 4-Carrier, 2-Antenna High-Speed W-CDMA DUC at 368.64 MHz with Total Rate
Change 48 on page 157
12. 4-Carrier, 2-Antenna High-Speed W-CDMA DUC at 307.2 MHz with Total Rate
Change 40 on page 158
13. Cholesky-based Matrix Inversion on page 159
14. Cholesky Solver Multiple Channels on page 163
15. Crest Factor Reduction on page 164
16. Direct RF with Synthesizable Testbench on page 164
17. Dynamic Decimating FIR Filter on page 164
18. Multichannel QR Decompostion on page 165
19. QR Decompostion on page 165
20. QRD Solver on page 166
21. Reconfigurable Decimation Filter on page 167
22. Single-Channel 10-MHz LTE Transmitter on page 167
23. STAP Radar Forward and Backward Substitution on page 168
24. STAP Radar Steering Generation on page 168
25. STAP Radar QR Decomposition 192x204 on page 168
26. Time Delay Beamformer on page 169
27. Transmit and Receive Modem on page 169
28. Variable Integer Rate Decimation Filter on page 170

Related Information
AN 544: Digital IF Modem Design with the DSP Builder Advanced Blockset
For more information about these designs

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

150
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.12.1. 1-Antenna WiMAX DDC


This reference design uses IP and Interface blocks to build a 2-channel, 1-antenna,
single-frequency modulation DDC for use in an IF modem design compatible with the
WiMAX standard.

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
the design includes an Edit Params block to allow easy access to the setup variables
in the setup_wimax_ddc_1rx.m script.

The DDCChip subsystem includes Device, Decimating FIR, Mixer, NCO,


SingleRateFIR, and Scale blocks. Also, an Interleaver subsystem extracts the
correct I and Q channel data from the demodulated data stream.

The FIR filters implement a decimating filter chain that down convert the two channels
from a frequency of 89.6 MSPS to a frequency of 11.2 MSPS (a total decimation rate
of eight). The real mixer, NCO, and Interleaver subsystem isolate the two channels.
The design configures the NCO with a single-channel to provide one sine and one
cosine wave at a frequency of 22.4 MHz. The NCO has the same sample rate (89.6
MSPS) as the input data sample rate.

A system clock rate of 179.2 MHz drives the design on the FPGA that the Device block
defines inside the DDCChip subsystem.

The model file is wimax_ddc_1rx.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

6.12.2. 2-Antenna WiMAX DDC


This reference design uses IP and Interface blocks to build a 4-channel, 2-antenna,
2-frequency modulation DDC for use in an IF modem design compatible with the
WiMAX standard.

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
the design includes an Edit Params block to allow easy access to the setup variables
in the setup_wimax_ddc_2rx_iiqq.m script.

The DDCChip subsystem includes Device, Decimating FIR, Mixer, NCO,


SingleRateFIR, and Scale blocks.

The FIR filters implement a decimating filter chain that down convert the two channels
from a frequency of 89.6 MSPS to a frequency of 11.2 MSPS (a total decimation rate
of 8). The real mixer and NCO isolate the two channels. The design configures the
NCO with two channels to provide two sets of sine and cosine waves at the same
frequency of 22.4 MHz. The NCO has the same sample rate of (89.6 MSPS) as the
input data sample rate.

A system clock rate of 179.2 MHz drives the design on the FPGA, which the Device
block defines inside the DDCChip subsystem.

The model file is wimax_ddc_2rx_iiqq.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

151
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.12.3. 1-Antenna WiMAX DUC


This reference design uses IP, Interface, and Primitive library blocks to build a 2-
channel, 1-antenna, single-frequency modulation DUC for use in an IF modem design
compatible with the WiMAX standard.

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
The design includes an Edit Params block to allow easy access to the setup variables
in the setup_wimax_duc_1tx.m script.

The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC2Channel subsystem which contains SingleRateFIR, Scale,
InterpolatingFIR, NCO, and ComplexMixer blocks. The deinterleaver subsystem
contains a series of Primitive blocks including delays and multiplexers that
deinterleave the two I and Q channels.

The FIR filters implement an interpolating filter chain that up converts the two
channels from a frequency of 11.2 MSPS to a frequency of 89.6 MSPS (a total
interpolating rate of 8). The complex mixer and NCO modulate the two input channel
baseband signals to the IF domain. The design configures the NCO with a single
channel to provide one sine and one cosine wave at a frequency of 22.4 MHz. The
NCO has the same sample rate (89.6 MSPS) as the input data sample rate.

A system clock rate of 179.2 MHz drives the design on the FPGA, which the Device
block defines inside the DUCChip subsystem.

The model file is wimax_duc_1tx.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

6.12.4. 2-Antenna WiMAX DUC


This reference design uses IP, Interface, and Primitivelibrary blocks to build a 4-
channel, 2-antenna, single-frequency modulation DUC for use in an IF modem design
compatible with the WiMAX standard.

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
The design includes an Edit Params block to allow easy access to the setup variables
in the setup_wimax_duc_2tx_iiqq.m script.

The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC2Channel subsystem which contains SingleRateFIR, Scale,
InterpolatingFIR, NCO, ComplexMixer, and Const blocks. It also contains a Sync
subsystem, which shows how to manage two data streams coming together and
synchronizing. The design writes the data from the NCOs to a memory with the
channel index as an address. The data stream uses its channel signals to read out the
NCO signals, which resynchronizes the data correctly. (Alternatively, you can simply
delay the NCO value by the correct number of cycles to ensure that the NCO and
channel data arrive at the Mixer on the same cycle). The deinterleaver subsystem
contains a series of Primitive blocks including delays and multiplexers that de-
interleave the four I and Q channels.

The FIR filters implement an interpolating filter chain that up converts the two
channels from a frequency of 11.2 MSPS to a frequency of 89.6 MSPS (a total
interpolating rate of 8).

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

152
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

A complex mixer and NCO modulate the two input channel baseband signals to the IF
domain. The design configures the NCO to provide two sets of sine and cosine waves
at a frequency of 22.4 MHz. The NCO has the same sample rate (89.6 MSPS) as the
input data sample rate.

The Sync subsystem shows how to manage two data streams coming together and
synchronizing. The design writes the data from the NCOs to a memory with the
channel as an address. The data stream uses its channel signals to read out the NCO
signals, which resynchronizes the data correctly.

A system clock rate of 179.2 MHz drives the design on the FPGA, which the Device
block defines inside the DUCChip subsystem.

The model file is wimax_duc_2tx_iiqq.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

6.12.5. 4-Carrier, 2-Antenna W-CDMA DDC


This reference design uses IP and Interface blocks to build a 16-channel, 2-antenna,
multiple-frequency modulation DDC for use in an IF modem design compatible with
the W-CDMA standard.

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks,
plus a ChanView block that isolates two channels of data from the TDM signals.

The DDCChip subsystem includes Device, DecimatingCIC, Decimating FIR,


Mixer, NCO, and Scale blocks. It also contains a Sync subsystem which provides the
synchronization of the channel data to the NCO carrier waves.

The CIC and FIR filters implement a decimating filter chain that down converts the
eight complex carriers (16 real channels from two antennas with four pairs of I and Q
inputs from each antenna) from a frequency of 122.88 MSPS to a frequency of 7.68
MSPS (a total decimation rate of 16). The real mixer and NCO isolate the four
channels. The design configures the NCO with four channels to provide four pairs of
sine and cosine waves at frequencies of 12.5 MHz, 17.5 MHz, 22.5 MHz, and 27.5
MHz, respectively. The NCO has the same sample rate (122.88 MSPS) as the input
data sample rate.

The Sync subsystem shows how to manage two data streams that come together and
synchronize. The data from the NCOs writes to a memory with the channel as an
address. The data stream uses its channel signals to read out the NCO signals, which
resynchronizes the data correctly.

A system clock rate of 245.76 MHz drives the design on the FPGA, which the Device
block defines inside the DDCChip subsystem.

The model file is wcdma_multichannel_ddc_mixer.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

153
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.12.6. 1-Carrier, 2-Antenna W-CDMA DDC


This reference design uses IP and Interface blocks to build a 4-channel, 2-antenna,
single-frequency modulation DDC for use in an IF modem design compatible with the
W-CDMA standard.

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks,
plus a ChanView block that isolates two channels of data from the TDM signals.

The DDCChip subsystem includes Device, DecimatingCIC, Decimating FIR,


Mixer, NCO, and Scale blocks.

The CIC and FIR filters implement a decimating filter chain that down converts the two
complex carriers (4 real channels from two antennas with one pair of I and Q inputs
from each antenna) from a frequency of 122.88 MSPS to a frequency of 7.68 MSPS (a
total decimation rate of 16). The real mixer and NCO isolate the four channels. The
design configures the NCO with a single channel to provide one sine and one cosine
wave at a frequency of 17.5 MHz. The NCO has the same sample rate (122.88 MSPS)
as the input data sample rate.

A system clock rate of 122.88 MHz drives the design on the FPGA, which the Device
block defines inside the DDCChip subsystem.

The model file is wcdma_picocell_ddc_mixer.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

6.12.7. 4-Carrier, 2-Antenna W-CDMA DUC


This reference design uses IP and Interface blocks to build a 16-channel, 2-antenna,
multiple-frequency modulation DUC for use in an IF modem design compatible with
the W-CDMA standard.

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
A Spectrum Scope block computes and displays the periodogram of the outputs from
the two antennas.

The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC subsystem that contains InterpolatingFIR, InterpolatingCIC, NCO,
ComplexMixer, and Scale blocks.

The FIR and CIC filters implement an interpolating filter chain that up converts the 16-
channel input data from a frequency of 3.84 MSPS to a frequency of 122.88 MSPS (a
total interpolation factor of 32). The complex mixer and NCO modulate the four
channel baseband input signal onto the IF region. The design configures the NCO with
four channels to provide four pairs of sine and cosine waves at frequencies of 12.5
MHz, 17.5 MHz, 22.5 MHz, and 27.5 MHz, respectively. The NCO has the same sample
rate (122.88 MSPS) as the final interpolated output sample rate from the last CIC filter
in the interpolating filter chain.

The subsystem SyncMixSumSel uses Primitive blocks to implement the


synchronization, mixing, summation, scaling, and signal selection. This subsystem
separates each operation into further subsystems. The Sync subsystem shows how to
manage two data streams that come together and synchronize. The data from the
NCOs writes to a memory with the channel as an address. The data stream uses its
channel signals to read out the NCO signals, which resynchronizes the data correctly.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

154
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

The Sum and SampSelectr subsystems sum up the correct modulated signals to the
designated antenna.

A system clock rate of 245.76 MHz drives the design on the FPGA, which the Device
block defines inside the DUC subsystem.

The model file is wcdma_multichannel_duc_mixer.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

6.12.8. 4-Carrier, 4-Antenna DUC and DDC for LTE

These DUC and matching DDC designs connect to 4 antennas and can process 4
channels per antenna. With a sample rate of 61.44 MHz and a clock rate of 491.52
MHz, these designs represent up- and downconverters used in LTE.

DUC

The top-level design of the upconverter contains a TEST_BENCH block with signal
sources, the upconverter, and a SINKS block that stores the datastreams coming out
of the upconverter in MATLAB variables. Depending on which simulation you run, the
TEST_BENCH block uses either real LTE sample streams or specialized debugging
patterns. The upconverter consists of the LDUC module, the lower DUC, which
contains a channel filter and two interpolating filters, each interpolating by a factor of
2. The filtered sample stream feeds into the COMPLEX MIXER block, where a NCO
generates separate frequencies for each of the four channels, and multiplies the
generated sinewaves with the filtered sample stream. A delay match block ensures
that the sample stream and the generated frequencies align correctly. After the
COMPLEX MIXER block is an antenna summer block, which adds up the different
channels for each antenna, multiplies each with a different frequency, and outputs
them to the four separate antennas.

The model file is duc_4c4ant.mdl.

DDC

The top-level design of the DDC also contains a TESTBENCH block, which contains
source blocks that read from workspace. It uses the data that DSP Builder generates
during the simulation of the DUC. The SINKS block again traces the outputs of the
design in MATLAB variables, which you can analyze and manipulate in MATLAB. The
DDC consists of a complex mixer that matches the complex mixer of the DUC, and the
LDDC (Lower DownConverter), which contains two decimate-by-2 filters and a channel
filter.

The model file is ddc_4c4ant.mdl.

Simulation Scripts

The design, which is in the Examples\ReferenceDesigns\DDC4c4ant


\4C4T4R_echodemo\4C4T4R\Design directory, contains two separate parts:
duc_4c4ant.mdl contains the upconverter, and ddc_4c4ant.mdl contains the
downconverter. The directory also contains two scripts that allow you to run the
simulation of both designs: Both Run_DUC_DDC_demo.m and
Test_DUC_DDC_demo.m create test vectors, run the upconverter first, which

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

155
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

generates the input vectors for the downconverter, - then run the downconverter and
analyze the outputs. The designs contains no channel model, but you can add your
own channel model and apply it to the output data of the DUC before running the DDC
to simulate more realistic operating conditions. Run_DUC_DDC_demo.m uses typical
LTE waveforms; Test_DUC_DDC_demo.m works with ramps that help visualizing
which data goes into which channel and which antenna it transmits on. In the test
pattern, an impulse is set first, followed by a ramp on channel 1 on antenna 1. All
other channels and antenna are 0. The next section transmits channel 1 on antenna 1,
channel 2 on antenna 2 … channel 4 on antenna 4. The last section transmits all 4
channels on all 4 antennas, using the full capacity of the system. Use this debug
pattern, if you want to modify or extend the design. Run the scripts using the
echodemo command, to step through the script section by section, by typing
echodemo Run_DUC_DDC_demo.m at the MATLAB command prompt, and then
clicking Next several times to step through the simulation script. Alternatively, you
can run the entire script by typing Run_DUC_DDC_demo.m at the MATLAB command
prompt. The last step of the script calls up a plot function that generates input vs
output plots for each channel, with overlaid input and output plots. These plots should
match closely, displaying only a small quantization error. The script also produces
channel scopes, which show each channel’s data in time and frequency domains.

6.12.9. 1-Carrier, 2-Antenna W-CDMA DDC


This reference design uses IP and Interface blocks to build a 4-channel, 2-antenna,
single-frequency modulation DUC for an IF modem design compatible with the W-
CDMA standard.

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
A Spectrum Scope block computes and displays the periodogram of the outputs from
the two antennas.

The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC subsystem that contains InterpolatingFIR, InterpolatingCIC, NCO,
ComplexMixer, and Scale blocks.

The FIR and CIC filters implement an interpolating filter chain that up convert the four
channel input data from a frequency of 3.84 MSPS to a frequency of 122.88 MSPS (a
total interpolation factor of 32). The complex mixer and NCO modulate the four
channel baseband input signal onto the IF region.

The design example configures the NCO with a single channel to provide one sine and
one cosine wave at a frequency of 17.5 MHz. The NCO has the same sample rate
(122.88 MSPS) as the final interpolated output sample rate from the last CIC filter in
the interpolating filter chain.

A system clock rate of 122.88 MHz drives the design on the FPGA, which the Device
block defines inside the DDC subsystem.

The model file is wcdma_picocell_duc_mixer.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

156
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.12.10. 4-Carrier, 2-Antenna High-Speed W-CDMA DUC at 368.64 MHz


with Total Rate Change 32
This reference design uses IP and Interface blocks to build a high-speed 16-channel,
2-antenna, multiple-frequency modulation DUC for use in an IF modem design
compatible with the W-CDMA standard.

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
A Spectrum Scope block computes and displays the periodogram of the outputs from
the two antennas.

The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC subsystem that contains InterpolatingFIR, InterpolatingCIC, NCO,
ComplexMixer, and Scale blocks.

The FIR and CIC filters implement an interpolating filter chain that up converts the 16-
channel input data from a frequency of 3.84 MSPS to a frequency of 122.88 MSPS (a
total interpolation factor of 32). This design example uses dummy signals and carriers
to achieve the desired rate up conversion, because of the unusual FPGA clock
frequency and total rate change combination. The complex mixer and NCO modulate
the four channel baseband input signal onto the IF region. The design example
configures the NCO with four channels to provide four pairs of sine and cosine waves
at frequencies of 12.5 MHz, 17.5 MHz, 22.5 MHz and 27.5 MHz, respectively. The NCO
has the same sample rate (122.88 MSPS) as the final interpolated output sample rate
from the last CIC filter in the interpolating filter chain.

The Sync subsystem shows how to manage two data streams that come together and
synchronize. The data from the NCOs writes to a memory with the channel as an
address. The data stream uses its channel signals to read out the NCO signals, which
resynchronizes the data correctly.

The GenCarrier subsystem manipulates the NCO outputs to generate carrier signals
that can align with the datapath signals.

The CarrierSum and SignalSelector subsystems sum up the right modulated signals
to the designated antenna.

A system clock rate of 368.64 MHz, which is 96 times the input sample rate, drives the
design on the FPGA, which the Device block defines inside the DUC subsystem. The
higher clock rate can potentially allow resource re-use in other modules of a digital
system implemented on an FPGA.

The model file is mcducmix96x32R.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

6.12.11. 4-Carrier, 2-Antenna High-Speed W-CDMA DUC at 368.64 MHz


with Total Rate Change 48
This reference design uses IP and Interface blocks to build a high-speed 16-channel,
2-antenna, multiple-frequency modulation DUC for use in an IF modem design
compatible with the W-CDMA standard.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

157
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
A Spectrum Scope block computes and displays the periodogram of the outputs from
the two antennas.

The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC subsystem that contains InterpolatingFIR, InterpolatingCIC, NCO,
ComplexMixer, and Scale blocks.

The FIR and CIC filters implement an interpolating filter chain that up converts the 16-
channel input data from a frequency of 3.84 MSPS to a frequency of 184.32 MSPS (a
total interpolation factor of 48).

The complex mixer and NCO modulate the four channel baseband input signal onto
the IF region. The design configures the NCO with four channels to provide four pairs
of sine and cosine waves at frequencies of 12.5 MHz, 17.5 MHz, 22.5 MHz, and 27.5
MHz, respectively. The NCO has the same sample rate (184.32 MSPS) as the final
interpolated output sample rate from the last CIC filter in the interpolating filter chain.

The Sync subsystem shows how to manage two data streams that come together and
synchronize. The data from the NCOs writes to a memory with the channel as an
address. The data stream uses its channel signals to read out the NCO signals, which
resynchronizes the data correctly.

The CarrierSum and SignalSelector subsystems sum up the right modulated signals
to the designated antenna.

A system clock rate of 368.64 MHz, which is 96 times the input sample rate, drives the
design on the FPGA, which the Device block defines inside the DUC subsystem. The
higher clock rate can potentially allow resource re-use in other modules of a digital
system implemented on an FPGA.

The model file is mcducmix96x48R.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

6.12.12. 4-Carrier, 2-Antenna High-Speed W-CDMA DUC at 307.2 MHz


with Total Rate Change 40
This reference design uses IP and Interface blocks to build a high-speed 16-channel,
2-antenna, multiple-frequency modulation DUC for use in an IF modem design
compatible with the W-CDMA standard

The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
A Spectrum Scope block computes and displays the periodogram of the outputs from
the two antennas.

The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC subsystem that contains InterpolatingFIR, InterpolatingCIC, NCO,
ComplexMixer, and Scale blocks.

The FIR and CIC filters implement an interpolating filter chain that up converts the 16-
channel input data from a frequency of 3.84 MSPS to a frequency of 153.6 MSPS (a
total interpolation factor of 40).

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

158
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

The complex mixer and NCO modulate the four channel baseband input signal onto
the IF region. The design configures the NCO with four channels to provide four pairs
of sine and cosine waves at frequencies of 12.5 MHz, 17.5 MHz, 22.5 MHz, and 27.5
MHz, respectively. The NCO has the same sample rate (153.6 MSPS) as the final
interpolated output sample rate from the last CIC filter in the interpolating filter chain.

The Sync subsystem shows how to manage two data streams that come together and
Synchronize. The design writes data from the NCOs to a memory with the channel as
an address. The data stream uses its channel signals to read out the NCO signals,
which resynchronizes the data correctly.

The CarrierSum and SignalSelector subsystems sum up the right modulated signals
to the designated antenna.

A system clock rate of 307.2 MHz, which is 80 times the input sample rate, drives the
design on the FPGA, which the Device block defines inside the DUC subsystem. The
higher clock rate can potentially allow resource re-use in other modules of a digital
system implemented on an FPGA.

The model file is mcducmix80x40R.mdl.

Note: This reference design uses the Simulink Signal Processing Blockset.

6.12.13. Cholesky-based Matrix Inversion


Matrix inversion has many applications in wireless communications, e.g. digital
predistortion (DPD) for RF linearization and multiple-input multiple-output (MIMO)
detection. Matrix inversion algorithms typically require high-resolution numerics to
guarantee accuracy and numerical stability. The implementation is normally resource
demanding in particular if the matrix dimension grows. The DSP Builder Cholesky-
based Matrix Inversion reference design offers an efficient implementation of matrix
inversion for minimized resource utilization and improved latency and throughput. The
Cholesky decomposition technique inverts a positive-definite real or complex square
matrix. Cholesky decomposition-based matrix inversion is more efficient than direct
matrix inversion.

Figure 68. Matrix inversion based on Cholesky decomposition


The figure shows the three steps of implementing a Hermitian matrix inversion using Cholesky decomposition:
1. Cholesky decomposition
2. Triangular matrix inversion through forward substitution
3. Triangular matrix multiplication

Lower
Triangle
Input Cholesky Triangular Matrix J Triangular
Matrix (A) Decomposition Matrix Inversion Matrix Mult A_inverse
Diagonal
Reciprocal
Values 1/Lkk

The Cholesky decomposition calculates the reciprocal values of the diagonal elements
of L, L1 which the triangular matrix inversion requires. The design propagates those
kk
values to the output interface of the Cholesky decomposition reducing resource usage
and latency.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

159
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Assuming matrix A is an NxN positive-definite square matrix, Cholesky decomposition


of A into lower and upper triangular matrices, L, and LH is given by:

A = LH

The inverse of Hermitian A, A-1 is:


H
A−1 = L−1 ∙ L−1

The design performs Cholesky decomposition and calculates the inverse of L, J =  L−1,
through forward substitution. J is a lower triangle matrix. The inverse of the input
matrix requires a triangular matrix multiplication, followed by a Hermitian matrix
multiplication:
A−1 = JH ∙ J

The Cholesky-based matrix inversion reference design comprises a Cholseky


decomposition design and a triangular matrix inversion design. Both designs are fully
pipelined, with multichannel input and output streaming to maximize throughput. The
size of dot-product engines in both designs are compile-time configurable according to
the size of the input matrices. The datapath and control logic are split.

Figure 69. Cholesky Decomposition Top-level Design


Input = Size* (size +1)* channel/2

Cholesky Decomposition
Top Datapath Bottom Datapath
Circular
1/√ 18s17 (c)
Memory Li, j
Data Scalar Product and
Input and FIFO
Mux and Subtract Multiplier
Memory Operators
Vectorization
16s15(c) 18s17

1/Li, j

Control Logic

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

160
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Figure 70. Triangular Matrix Inversion Top-level Design

Triangular Matrix Inversion


Feedback Write Back

s
Li,j Write Controller

mn

ws
lu

Ro
Co
18s17(c) 1 Cycle

Channels
Circular

Rows* Channels
L Input Memory
Li,j

Columns* Rows* Channels


Memory 2 Cycles
3 Cycles 1 Cycle 5 Cycles 18s12(c)
Input

s-1
J Output

mn
Write
Σ X

lu
X Scale

Mux
Co
Controller Scale Negate FIFO

Rows* Channels
1/Li,j Inv Ljj
Input 4 Cycles 7 Cycles 5 Cycles 0 Cycles
18s12
Memory Diagonal

Control Logic

This design supports single-precision floating-point Cholesky matrix inversion. DSP


Builder requires a single-precision floating-point input for the floating point inversion.

Matrix inversion takes multiple matrices and interleaves the inverse computations for
all matrices. This method hides the latency in computing each element by pipelining
inversion of a completely different channel. Multichannel designs use the idle cycles in
the computation chain to process the next channel. Two buffers at the input and
output of the design create channels for streaming matrices into multichannel
interfaces.

Table 19. Top-level matrix inversion input and output ports


The input and output interfaces follow Avalon™ streaming (Avalon-ST) standard.

Signal Direction Type Width Description

Sink_Valid Input Boolean 1 Avalon streaming sink valid signal for the input matrix
interface. Number of valid input = (matrix size*(matrix size
+ 1))/2

Sink_Channel Input unsigned integer 8 Avalon streaming sink channel bus for the input matrix
interface.

Sink_Data Input Single floating- 64 bit I/Q Avalon streaming sink data bus for the input matrix
point complex interface. Lower matrix elements are streamed in column
major order.

Source_Valid Output Boolean 1 Avalon streaming source valid signal for output interface.
This signal is asserted for (size*(size+1))/2 clocks

Source_Channel Output unsigned integer 8 Avalon streaming source channel bus for output interface.

Source_Data Output Single floating- 64 bit I/Q Avalon streaming source data bus for output interface.
point complex Lower matrix elements are streamed in column major
order.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

161
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Parameters

Table 20. Parameters of the matrix inversion design


The parameters are compile-time configurable using the setup file. .

Parameter Description

Size of Matrix The size of matrix to invert.

Channels Number of matrices inverted in a burst. Minimum of 16 channel.

Latency The period in cycles the module waits before receiving the next set of matrices.

DSP Builder calculates the throughput of the design by setting the latency value and
the system clock:

Throughput (matrix inversion per second) = System clock/Latency

Although elements of input matrices arrive in streaming format, the internal


channelizer vectorizes the input matrices into several channels (the default is 16). This
vectorization significantly improves the throughput.

Figure 71. Input streaming interface for 8x8 Hermitian input matrix

The figure shows the latency configuration parameter in the input interface including data, valid, and channel
signals. In this example of 8x8 matrix inversion, the valid signal remains high for 36 clock cycles (total number
of lower triangle elements of the Hermitian matrix of 8x8) and remains low for (latency – 36) cycles before
inserting the next matrix elements. The minimum duration to remain low and hence the minimum latency
period may vary depending on the matrix size and the pipelining required to meet timing constraints.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

162
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Table 21. Recommended Values for the Minimum Latency (maximum throughput)
In Intel Stratix 10 and Intel Arria 10 devices, speed grade –1 and –2, for three different matrix sizes.

Matrix Dimension Latency in clock cycles

Intel Arria 10 Devices Intel Stratix 10 Devices

4x4 ≥ 30 ≥ 30

8x8 ≥ 75 ≥ 74

16x16 ≥ 230 ≥ 220

Performance and Resource Usage

Table 22. Floating-point implementation resource utilization targeting GX/SX/TX 280


FPGA
The table shows the resource count of the floating-point Cholesky-based matrix inversion design including the
channelizing input and output buffers.

Matrix Dimension Number of channels Logic Elements (ALMs) DSP Blocks Memory bits RAM blocks Registers

4x4 16 8,236 55 548,448 55 22,066

8x8 16 16,665 103 2,001,664 194 45,463

16x16 16 35,025 199 7,085,088 521 95,079

Table 23. Performance of the floating-point matrix inversion module for different
matrix dimensions
This table shows the fMAX performance of the floating-point design for different matrix sizes with a system clock
of 368.64 MHz and targeting a FPGA device. The maximum throughput is in millions of matrix inversions per
second.

Matrix Dimension Number of channels Target System clock (MHz) fMAX (MHz) ThroughputMAX

4x4 16 368.64 468.06 12.2

8x8 16 368.64 403.88 5.0

16x16 16 368.64 392.77 1.67

6.12.14. Cholesky Solver Multiple Channels


The Cholesky Solver Multiple Channels reference design performs Cholesky
decomposition to solve column vector x in Ax = b

A is a Hermitian, positive definite matrix (for example covariance matrix) and


b is a column vector.

The design uses forward and backward substitution to solve x.

The design decomposes A into L*L', therefore L*L'*x = b, or L*y = b, where y = L'*x.
The design solves y with forward substitution and x with backward substitution.

This design uses cycle stealing and command FIFO techniques to enhance
performance. Although it targets multiple channels, it also works well with single
channels.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

163
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

To input the lower triangular elements of matrix A and b with the input bus, specify
the column, row, and channel index of each element. The design transposes and
appends the column vector b to the bottom of A and treats it as an extension of A in
terms of column and row addressing.

The output is column vector x with the bottom element output first.

A multiple channel design optimizes performance by prioritizing diagonal element


calculation over non-diagonal ones.

The model file is cholseky_solver_mc.mdl.

6.12.15. Crest Factor Reduction


This reference design implements crest factor reduction, based on the peak cancelling
algorithm.

For further information refer to the web page.

You can change the simulation length by clicking on the Simulink Length block.

The model file is demo_cfr.mdl.

Related Information
Crest factor reduction for wireless systems

6.12.16. Direct RF with Synthesizable Testbench


This very large reference design implements a digital upconversion to RF and digital
predistortion, with a testbench that you can synthesize to hardware for easier on-chip
testing.

The model file is DirectRFTest_and_DPD_SV.mdl.

6.12.17. Dynamic Decimating FIR Filter


The dynamic decimating FIR reference design offers multichannel run-time decimation
ratios in integer power of 2 and run-time control of channel count (in trading with
bandwidth).The design supports dynamic channel count to signal bandwidth trade off
(if you halve the channel count, the input sample rate doubles).

The FIR filter length is 2 x (Dmax / Dmin) x N + 1 where Dmax and Dmin are the
maximum and minimum decimation ratios and N is the number of (1 sided) symmetric
coefficients at Dmin.

All channels must have the same decimation ratio. The product of the number of
channels and the minimum decimation ratio must be 4 or more. The design limits the
wire count to 1 and:

number of channels x sample rate = clock rate.

The model file is demo_dyndeci.mdl

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

164
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.12.18. Multichannel QR Decompostion


This reference design is a complete linear equations system solution that uses QR
decomposition.

To optimize the overall throughput the solver can interleave multiple data instances at
the same time. The inputs of the design are system matrices A [n × m] and input
vectors.

The reference design uses the Gram-Schmidt method to decompose system matrix A
to Q and R matrices. It calculates the solution of the system by completing backward
substitution.

The reference design is fully parametrizable: system dimensions n and m, the


processing vector size, which defines the parallelization ratio of the dot product
engine, and the number of channels that the design processes in parallel. This design
uses single-precision Multiply and Add blocks that perform most of the floating-point
calculations to implement a parallel dot product engine. The design uses a processor,
which executes a fixed set of micro-instructions and generates operation indexes, to
route different phases of the calculation through these blocks. The design uses for-
loop macro blocks, which allow very efficient, flexible, and high-level implementation
of iterative operations, to implement the processor.

The model file is demo_mcqrd.mdl.

6.12.19. QR Decompostion
This reference design is a complete linear equations system solution that uses QR
decomposition.

The input of the design is a system matrix A [n × m] and input vector.

The reference design uses the Gram-Schmidt method to decompose system matrix A
to Q and R matrices, and calculates the solution of the system by completing
backward substitution.

The reference design is fully parametrizable—system dimensions n and m, and the


processing vector size, which defines the parallelization ratio of the dot product
engine. This design uses single-precision Multiply and Add blocks that perform most
of the floating-point calculations to implement a parallel dot product engine. The
design uses a processor, which executes a fixed set of microinstructions and generates
operation indexes, to route different phases of the calculation through these blocks.
The design uses for-loop macro blocks, which allow very efficient, flexible and high-
level implementation of iterative operations, to implement the processor.

This design uses the Run All Testbenches block to access enhanced features of the
automatically-generated testbench. An application-specific m-function verifies the
simulation output, to correctly handle the complex results and the numerical
approximation because of the floating-point format.

The model file is demo_qrd.mdl.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

165
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.12.20. QRD Solver


The QRD Solver reference design is a complete linear equations system solution using
QR decomposition. The input of the design is a system matrix A [n x m] and input
vector [b].

Figure 72. QRD Solver


The design decomposes the system matrix A to Q and R matrices using the Gram-Schmidt method. The design
calculates the solution of the system by completing backward substitution.
[b=Ax]

QR
[A] [R]
Decomposition

[Q] Backward
Substitution [x]

[b]
Qxb

The reference design is fully parameterizable over system dimensions n and m and the
processing vector size, which defines the parallelization ratio of the dot product
engine. This design implements parallel dot product engine using single-precision
Multiply and Add blocks that perform most of the floating-point calculations. The
design routes different phases of the calculation through these blocks with a
controlling processor that executes a fixed set of microinstructions and generates
operation indexes. The design implements the controlling processor using for-loop
macro blocks, which allow very efficient, flexible, and high-level implementation of
iterative operations.

This design uses the Run All Testbenches block to access enhanced features of the
automatically generated testbench. An application-specific m-function verifies the
simulation output, to correctly handle the complex results and the numerical
approximation because of the floating-point format. Intel optimized the design for
Intel Stratix 10 FPGAs. The design implements hardened floating-point operators in
the FPGA DSP blocks.

Table 24. Performance

Intel tested the design with Intel Quartus Prime v18.1.1 build 259, targeting a 1SG280LN3F43E2VG device

Matrix Size Parallel fMAX Resources Throughput Latency


Processing (MHz)
Vector Size ALM DSPs M20K Cycles Matrices/s Cycles ms

512x256 512 320 461K 4,370 1,313 71,232 4,492 137,545 0.43
(49%) (76%) (11%)

64x64 64 418 60.5 (6%) 562 (10%) 160 (1%) 7,920 52,777 12,392 0.03

The model file is demo_qrd_s10.mdl.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

166
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.12.21. Reconfigurable Decimation Filter


The reconfigurable decimation filter reference design uses primitive blocks to build a
variable integer rate decimation FIR filter.

The reference design has the following features:


• Supports arbitrary integer decimation rate (including the cases without rate
change), arbitrary number of channels and arbitrary clock rate and input sample
rate, if the clock rate is high enough to process all channels in a single data path
(i.e. no hardware duplication).
• Supports run-time reconfiguration of decimation rate.
• Uses two memory banks for filter coefficients storage instead of prestoring
coefficients for all rates in memory. Updates one memory bank while the design is
reading coefficients from the other bank.
• Implements real time control of scaling in the FIR datapath.

You can modify the parameters in the setup_vardownsampler.m file, which you
access from the Edit Params icon.

The model file is vardownsampler.mdl.

6.12.22. Single-Channel 10-MHz LTE Transmitter


This reference design uses IP, Primitive, and blocks from the FFT Blockset library to
build a single-channel 10-MHz LTE transmitter.

The top-level testbench includes blocks to access control and signals, and to run the
Quartus Prime software. It also includes an Edit Params block to allow easy access to
the configuration variables in the setup_sc_LTEtxr.m script. A discrete-time scatter
plot scope displays the constellation of the modulated signal in inphase versus
quadrature components.

The LTE_txr subsystem includes a Device block to specify the target FPGA device,
and 64QAM, 1K_IFFT, ScaleRnd, CP_bReverse, Chg_Data_Format, and DUC
blocks.

The 64QAM subsystem uses a lookup table to convert the source input data into 64
QAM symbol mapped data. The 1K_IFFT subsystem converts the frequency domain
quadrature amplitude modulation (QAM) modulated symbols to the time domain. The
ScaleRnd subsystem follows the conversion, which scales down the output signals
and converts them to the specified fixed-point type.

The bit CP_bReverse subsystem adds extended cycle prefix (CP) or guard interval
for each orthogonal frequency-domain multiplexing (OFDM) symbol to avoid
intersymbol interference (ISI) that causes multipaths. The CP_bReverse block
reorders the output bits of IFFT subsystems, which are in bit-reversed order, so that
they are in the correct order in the time domain. The design adds the cyclic prefix bit
by copying the last 25% of the data frame, then appends to the beginning of it.

The Chg_Data_Format subsystem changes the output data format of CP_bReverse


subsystem to match the working protocol format of DUC subsystem.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

167
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

The DUC subsystem uses an interpolating filter chain to achieve an interpolation


factor of 16, such that the design interpolates the 15.36 Msps input channel to 245.76
Msps. In this design, an interpolating finite impulse response (FIR) filter interpolates
by 2, followed by a cascaded integrator-comb (CIC) filter with an interpolation rate of
8. An NCO generates orthogonal sinusoids at specified carrier frequency. The design
mixes the signals with complex input data with a ComplexMixer block. The final SINC
compensation filter compensates for the digital analog converter (DAC) frequency
response roll-off.

A system clock rate of 245.76 MHz drives the design on the FPGA. The Signals block
of the design defines this clock. The input random data for the 64QAM symbol
mapping subsystem has a data rate of 15.36 Msps.

The model file is sc_LTEtxr.mdl.

6.12.23. STAP Radar Forward and Backward Substitution


The QR decomposition reference design produces an upper triangular matrix and a
lower triangular matrix.

The design applies this linear system of equations to the steering vector in the
following two steps:
• Forward substitution with the lower triangular matrix
• Backward substitution with the lower triangular matrix

A command pipeline controls the routing of floating-point vectors. Nested ForLoop


blocks generate these commands. Another FIFO unit queues the commands. This
decoupled system of FIFO buffers maximizes the usage of the shared vector floating-
point block while automatically throttling the rate of the ForLoop system.

This design uses advanced settings from the DSP Builder > Verify Design menu to
access enhanced features of the automatically generated testbench. An application
specific m-function verifies the simulation output, to correctly compare complex
results and properly handle floating-point errors that arise from the ill-conditioning of
the QRD output.

The model file is STAP_ForwardAndBackwardSubstitution.mdl.

6.12.24. STAP Radar Steering Generation


The STAP radar steering generation reference design uses ForLoop blocks and
floating-point primitives to generate the steering vector. You input the angle of arrival
and Doppler frequency.

The model file is STAP_steeringGen.mdl.

6.12.25. STAP Radar QR Decomposition 192x204


The QR decomposition reference design implements a sequence of floating-point
vector operations.

Single-precision Multiply and Add blocks perform most of the floating-point


calculations. The design routes different phases of the calculation through these blocks
with a controlling processor that executes a fixed set of microinstructions. FIFO units
ensure this architecture maximizes the usage of the Multiply and Add blocks.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

168
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

This design uses the Run All Testbenches block to access enhanced features of the
automatically generated testbench. An application specific m-function verifies the
simulation output, to correctly handle the complex results and the numerical
approximation due to the floating-point format.

The model file is STAP_qrd192x204.mdl. The parallel version model file is


STAP_qrd192x204_p.mdl.

6.12.26. Time Delay Beamformer


The time delay beamformer reference design implements a time-delay beamformer
that has many advantages over traditional phase-shifted beamformer. It uses a (full-
band) Nyquist filter and Farrow-like structure for optimal performance and resource
usages.

The design includes the following features so you can simulate and verify the transmit
and receive beamforming operations:
• Waveform (chirp) generation
• Target emulation
• Receiver noise emulation
• Aperture tapering
• Pulse compression

6.12.27. Transmit and Receive Modem


The transmit and receive modem design contains a QAM transmitter, a synthesizeable
channel model and a receiver, working at sample rates that match or exceed the clock
rate. The design works at different sample rates, and can provide up to 16 parallel
data streams between transmitter and receiver.

The transmitter can produce random data, which is useful for generating a hardware
demo, or you can feed it with data from the MATLAB environment. You can modulate
the data, where the modulation order can be QAM4 or QAM64. The design filters the
signal, and then feeds it into optional crest factor reduction (CFR) and digital
predistortion (DPD) blocks. Intel assumes you have a control processor that configures
modulation scheme and CFR and DPD parameters.

The channel model contains a random noise source, and a channel model, which you
can configure through the setup script. This channel model allows you to build a
hardware demonstrator on a standard FPGA development platform, without DA or AD
converters and analogue components. Following the channel model is the model of a
decimating ADC, which emulates the behavior of some existing ADC components that
provide this functionality.

The receiver contains an RRC filter, followed by an equalizer. Intel assumes that a
control processor calculates the equalizer coefficients. The equalizer feeds into an AGC
block, which feeds into a demapper. You can configure the demapper to different
modulation orders.

The model file is tx_ch_rx.mdl

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

169
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.12.28. Variable Integer Rate Decimation Filter


The variable integer rate decimation filter reference design iimplements a 16-channel
interpolate-by-2 symmetrical 49-tap FIR filter. The target system clock frequency is
320 MHz.

You can modify the parameters in the setup_vardecimator_rt.m file, which you
access from the Edit Params icon.

The model file is vardecimator_rt.mdl.

6.13. DSP Builder Waveform Synthesis Design Examples


This folder contains design examples that synthesize waveforms with a NCO or direct
digital synthesis (DDS).

1. Complex Mixer on page 170


2. Four Channel, Two Banks NCO on page 170
3. Four Channel, Four Banks NCO on page 172
4. Four Channel, Eight Banks, Two Wires NCO on page 172
5. Four Channel, 16 Banks NCO on page 173
6. IP on page 174
7. NCO on page 174
8. NCO with Exposed Bus on page 174
9. Real Mixer on page 174
10. Super-sample NCO on page 175

6.13.1. Complex Mixer


This design example shows how to mix complex signals.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_complex_mixer.m script.

The FilterSystem subsystem includes the Device and ComplexMixer blocks.

The model file is demo_complex_mixer.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.13.2. Four Channel, Two Banks NCO


This design example implements an NCO with four channels and two banks.

This design example demonstrates frequency-hopping with the NCO block to generate
four channels of sinusoidal waves that you can switch from one set (bank) of
frequencies to another.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

170
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

The phase increment values are set directly into the NCO Parameter dialog box as a
2 (rows) × 4 (columns) matrix. The input for the bank index is set up so that it
alternates between the two predefined banks with each one lasting 2000 steps.

A BusStimulus block sets up an Avalon-MM interface that writes into the phase
increment memory registers. It shows how you can use the Avalon-MM interface to
dynamically change the frequencies of the NCO-generated sinusoidal signals at run
time. This design example uses a 16-bit memory interface (as the Control block
specifies) and a 24-bit the accumulator in the NCO block. The design example
requires two registers for each phase increment value. With the base address of the
phase increment memory map set to 1000 in this design example, the addresses
[1000 1001 1002 1003 1012 1013 1014 1015] write to the phase increment memory
registers of channels 1 and 2 in bank 1, and to the registers of channels 3 and 4 in
bank 2. The write data is also made up of two parts with each part writing to one of
the registers feeding the selected phase increment accumulators.

This design example has two banks of frequencies with each bank processes 2,000
steps before switching to the other. You should write a new value into the phase
increment memory register for each bank to change the NCO output frequencies after
8,000 steps during simulation. To avoid writing new values to the active bank, the
design example configures the write enable signals in the following way:

[zeros(1,7000) 1 1 1 1 zeros(1,2000) 1 1 1 1 zeros(1,8000)]

This configuration ensures that a new phase increment value for bank 0 is written at
7000 steps when the NCO is processing bank 1; and a new phase increment value for
bank 1 is written at 9000 steps when the NCO is processing bank 0.

Four writes for each bank exist to write new values for channel 1 and 2 into bank 0,
and new values for channel 3 and 4 into bank 1. Each new phase value needs two
registers due to the size of the memory interface.

The Spectrum Scope block shows three peaks for a selected channel with the first
two peaks representing the two banks and the third peak showing the frequency that
you specify through the memory interface. The scope of the select channel shows the
sinusoidal waves of the channel you select. You can zoom in to see the smooth and
continuous sinusoidal signals at the switching point. You can also see the frequency
changes after 8000 steps where the phase increment value alters through the memory
interface.

The top-level testbench includes Control, Signals, BusStimulus, Run ModelSim,


and Run Quartus Prime blocks, plus ChanView blocks that deserialize the output
buses. An Edit Params block allows easy access to the setup variables in the
setup_demo_mc_nco_2banks_mem_interface.m script.

The NCOSubSystem subsystem includes the Device and NCO blocks.

The model file is demo_mc_nco_2banks_mem_interface.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

171
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.13.3. Four Channel, Four Banks NCO


This design example implements a NCO with four channels and four banks.

This design example is similar to the Four Channel, Two Banks NCO design, but it has
four banks of frequencies defined for the phase increment values. Each spectrum plot
has five peaks: the fifth peak shows the changes the design example writes through
the memory interface.

The design example uses a 32-bit memory interface with a 24-bit accumulator. Hence,
the design example requires only one phase increment memory register for each
phase increment value—refer to the address and data setup on the BusStimulus
block inside this design example.

This design example has four banks of frequencies with each bank processed for 2,000
steps before switching to the other. You should write a new value into the phase
increment memory register for each bank to change the NCO output frequencies after
16,000 steps during simulation. To avoid writing new values to the active bank, the
design example configures the write enable signals in the following way:

[zeros(1,15000) 1 zeros(1,2000) 1 zeros(1,2000) 1 zeros(1,2000) 1 zeros(1,8000)]

This configuration ensures that a new phase increment value for bank 0 is written at
15000 steps when the NCO is processing bank 3; a new phase increment value for
bank 1 is written at 17000 steps when the NCO is processing bank 0; a new phase
increment value for bank 2 is written at 19000 steps when the NCO is processing
bank 1; and a new phase increment value for bank 3 is written at 21000 steps when
the NCO is processing bank 2.

There is one write for each bank to write a new value for channel 1 into bank 0; a new
value for channel 2 into bank 1; a new value for channel 3 into bank 2; and a new
value for channel 4 into bank 3. Each new phase value needs only one register due to
the size of the memory interface.

The top-level testbench includes Control, Signals, BusStimulus, Run ModelSim,


and Run Quartus Prime blocks, plus ChanView blocks that deserialize the output
buses. An Edit Params block allows easy access to the setup variables in the
setup_demo_mc_nco_4banks_mem_interface.m script.

The NCOSubSystem subsystem includes the Device and NCO blocks.

The model file is demo_mc_nco_4banks_mem_interface.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.13.4. Four Channel, Eight Banks, Two Wires NCO


This design example implements a NCO with four channels and eight banks.

This design example is similar to the Four Channel, 16 Banks NCO design, but has
only eight banks of phase increment values (specified in the setup script for the
workspace variable) feeding into the NCO. Furthermore, the sample time for the NCO
requires two wires to output the four channels of the sinusoidal signals. Two wires
exist for the NCO output, each wire only contains two channels. Hence, the channel
indicator is from 0 .. 3 to 0 .. 1.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

172
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

You can inspect the eight peaks on the spectrum graph for each channel and see the
smooth continuous sinusoidal waves on the scope display.

This design example uses an additional subsystem (Select_bank_out) to extract the


NCO-generated sinusoidal signal of a selected bank on a channel.

The design example outputs the data to the workspace and plots through with the
separate demo_mc_nco_extracted_waves.mdl, which demonstrates that the
output of the bank you select does represent a genuine sinusoidal wave. However,
from the scope display, you can see that the sinusoidal wave is no longer smooth at
the switching point, because the design example uses the different values of phase
increment values between the selected banks. You can only run the
demo_mc_nco_extracted_waves.mdl model after you run
demo_mc_nco_8banks_2wires.mdl.

The top-level testbench includes Control, Signals, BusStimulus, Run ModelSim,


and Run Quartus Prime blocks, plus ChanView blocks that deserialize the output
buses. An Edit Params block allows easy access to the setup variables in the
setup_demo_mc_nco_8banks_2wires.m script.

The NCOSubSystem subsystem includes the Device and NCO blocks.

The Select_bank_out subsystem contains Const, CompareEquality, and AND


Gate blocks.

The model file is demo_mc_nco_8banks_2wires.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

6.13.5. Four Channel, 16 Banks NCO


This design example implements a NCO with four channels and 16 banks. This design
example demonstrates frequency-hopping with the NCO block to generate 4 channels
of sinusoidal waves, which you can switch from one set (bank) of frequencies to
another in the 16 predefined frequency sets.

A workspace variable phaseIncr defines the 16 (rows) × 4 (columns) matrix for the
phase increment input with the phase increment values that the setup script
calculates.

The input for the bank index is set up so that it cycles from 0 to 15 with each bank
lasting 1200 steps.

The spectrum display shows clearly 16 peaks for the selected channel indicating that
the design example generates 16 different frequencies for that channel. The scope of
the selected channel shows the sinusoidal waves of the selected channel. You can
zoom in to see that the design example generates smooth and continuous sinusoidal
signals at the switching point.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView blocks that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_mc_nco_16banks.m script.

The NCOSubSystem subsystem includes the Device and NCO blocks.

The model file is demo_mc_nco_16banks.mdl.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

173
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

Note: This design example uses the Simulink Signal Processing Blockset.

6.13.6. IP
The IP design example describes how you can build a NCO design with the NCO block
from the Waveform Synthesis library.

Note: This design example uses the Simulink Signal Processing Blockset.

6.13.7. NCO
This design example uses the NCO block from the Waveform Synthesis library to
implement an NCO. A Simulink double precision sine or cosine wave compares the
results.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView blocks that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_nco.m script.

The NCOSubSystem subsystem includes the Device and NCO blocks.

The model file is demo_nco.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

Related Information
NCO on page 245

6.13.8. NCO with Exposed Bus


This design example is a multichannel NCO that outputs four waveforms with slightly
different frequencies. Halfway through the simulation, DSP Builder reconfigures the
NCO for smaller increments, which gives a waveform with a longer period.

The model file is demo_nco_exposed_bus.mdl.

6.13.9. Real Mixer


This design example shows how to mix non-complex signals.

The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_mix.m script.

The MixerSystem subsystem includes the Device and Mixer blocks.

The model file is demo_mix.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

174
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05

6.13.10. Super-sample NCO


This design example uses the NCO block from the Waveform Synthesis library to
implement a super-sample NCO. The design demonstrates run-time reconfiguring of
the frequency using a register bus.

A super-sample NCO uses multiple NCOs that each have an initial phase offset. When
you combine the parallel outputs into a serial stream, they can describe frequencies N
times the Nyquist frequency of a single NCO. Where N is the total number of NCOs
that the design uses.

The NCO block produces four outputs, which all have the same phase increment but
each have a different, evenly distributed initial phase offset. With the four parallel
outputs in series they describe frequencies up to four times higher than the Nyquist
frequency of an individual NCO.

To change the frequency of the super-sample NCO using the bus, write a new phase
increment and offset to each of the four constituent NCOs and then strobe the
synchronization register. The NCO block includes the phase increment register; a
separate primitive subsystem implements the phase offset and synchronization
registers.

The setup_demo_nco_super_sample scripts allows you to configure the clock rate,


number of NCOs, NCO accumulator size, and many other parameters. This script
calculates the required phase increment and offsets required to sweep the super-
sample NCO through five frequencies. The script defines the memory map and creates
the bus stimulus.

DSP Builder writes the output of the super-sample NCO into a MATLAB workspace
variable and compares it with a MATLAB-generated waveform in the script
test_demo_nco_super_sample.

DSP Builder schedules the bus in HDL but not in Simulink, so bus writes occur at
different clock cycles. Therefore, the function verify_demo_nco_super_sample
function verifies the design, which checks that the Simulink and ModelSim frequency
distributions match within a tolerance.

The output of the Spectrum Analyser block show the simulation initializes to the last
frequency in dspb_super_nco.frequencies and then rotates through the list.

The model file is demo_nco_super_sample.mdl.

Note: This design example uses the Simulink Signal Processing Blockset.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

175
HB_DSPB_ADV | 2020.10.05

Send Feedback

7. DSP Builder Design Rules, Design Recommendations,


and Troubleshooting
1. DSP Builder Design Rules and Recommendations on page 176
2. Troubleshooting DSP Builder Designs on page 178

7.1. DSP Builder Design Rules and Recommendations


Use the design rules and recommendations to ensure your design performs correctly.

Design Rules for the Top-Level Design


• Ensure the top-level design has a Control block and a Signals block.
• Ensure the synthesizable part of your design is a subsystem or contained within a
subsystem of the top-level design.
• Ensure testbench stimulus data types that feed into the synthesizable design are
correct, as DSP Builder propagates them.
• Ensure you place Interface ➤ ExternalMemory and AvalonMMSlaveSettings
blocks only in the top-level design

Design Rules for the Synthesized Top-Level Design


• Ensure your synthesized hardware top-level subsystem has a Device block.
• Ensure you place some non-synthesizable blocks (from the Interface ➤
MemoryMapped ➤ Stimulus and Utillities ➤ Testbench libraries outside the
synthesized system.

Design Rules for the Primitive Top-Level Design


• Ensure the primitive top-level subsystem contain a SynthesisInfo block with
style set to Scheduled.
• Ensure the Primitive subsystems do not contain IP blocks.
• Only use primitive blocks in primitive subsystems and delimit them by primitive
boundary blocks.
• If using ALU folding, ensure the ALU Folding block is in the primitive top-level
subsystem.
• Route all subsystem inputs with associated valid and channel signals that are
to be scheduled together through the same ChannelIn blocks immediately
following the subsystem inputs. Route any other subsystem inputs through GPIn
blocks.
• Route all subsystem outputs with associated valid and channel signals that are
to be scheduled together through the same ChannelOut blocks immediately
before the subsystem outputs. Route any other subsystem outputs through
GPOut blocks.
Intel Corporation. All rights reserved. Agilex, Altera, Arria, Cyclone, Enpirion, Intel, the Intel logo, MAX, Nios,
Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or
other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in ISO
accordance with Intel's standard warranty, but reserves the right to make changes to any products and services 9001:2015
at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any Registered
information, product, or service described herein except as expressly agreed to in writing by Intel. Intel
customers are advised to obtain the latest version of device specifications before relying on any published
information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
7. DSP Builder Design Rules, Design Recommendations, and Troubleshooting
HB_DSPB_ADV | 2020.10.05

• Ensure all primitive subsystem input boundary blocks (GPIn or ChannelIn) or


output boundary blocks (GPOut or ChannelOut) are in primitive top-level
subsystem.
Note: Also Avalon-MM interface blocks can be subsystem schedule boundaries
• Ensure the valid signal is a scalar Boolean signal or ufix(1).
• Ensure the channel signal is a scalar uint(8)

Design Rules for Avalon-MM Interface Blocks


• Place shared memory blocks inside primitive scheduled subsystem.
• Ensure the RegField and RegBit blocks output type width exactly match the
range you specify for these blocks through MSB and LSB parameters.
• Ensure the specified ranges through MSB and LSB parameters fit within Avalon-
MM word width set from Avalon Interfaces ➤ Avalon-MM Slave Settings.
• Ensure different instances of register blocks (RegBit, RegField, or RegOut) that
map to the same Avalon-MM address specify disjoint ranges.
• For shared memory blocks, ensure output data width matches or is twice the size
of Avalon-MM data width set from Avalon Interfaces ➤ Avalon-MM Slave
Settings.
• Locate the BusStimulus and BusStimulusFileReader blocks in the testbench,
which is outside the synthesizable system.

Recommendations for your Top-Level Design


• Create a Simulink project for your model file, libraries, and scripts.
• Use workspace variables to set parameters, which allows you to globally organize
and change them.
• Use set-up scripts to set the workspace variables and clear-up scripts to clear
them from the workspace afterwards.
• Run set-up, analysis, and clear-up scripts automatically by adding them to the
model callbacks.
• Build a testbench that is parameterizable with system parameters such as sample
rate, clock rate, and number of channels. Use the Channelizer block to create
data in the valid-channel-data protocol.
• Hierarchically structure your design into subsystems. A modular design with well-
defined subsystem boundaries allows you to precisely manage latency and speed
of different modules and achieve timing closure.
• Save repeated subsystems as library blocks. Replace the design blocks with copies
from the library.
• Make library blocks configurable and self-modifying.
• Create and use your own libraries of reusable components. Organize them into
separate library files.
• Use configurable subsystem blocks in libraries to switch implementations in place.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

177
7. DSP Builder Design Rules, Design Recommendations, and Troubleshooting
HB_DSPB_ADV | 2020.10.05

• Build separate testbenches for library blocks


• Keep block and subsystem names short, but descriptive. Do not use names with
special characters, slashes, or that begin with numbers.
• Use vectors to build parameterizable designs. DSP Builder does not need to redraw
them when parameters such as the number of channels change. A design that
uses a vector input of width N is the same as connecting N copies of the block with
a single scalar connection to each.

Recommendations for Loops in Primitive Subsystems


• Ensure sufficient sample delays (SampleDelay blocks) exist around loops to allow
for pipelining
• To determine the minimum loop latency, turn on Minimum Delay on the
SampleDelay block
• Simulink performs data type, complexity, and vector width propagation.
Sometimes Simulink does not successfully resolve propagation around loops,
particularly multiple nested loops.
• If Simulink is unsuccessful, look for where data types are not annotated.
• You may have to explicitly set data types. Simulink provides a library of blocks to
help in such situations, which duplicate data types. For example, the data type
prop duplicate block, fixpt_dtprop, (type open fixpt_dtprop from the
MATLAB command prompt), which the control library latches use.
• Avoid primitive subsystems with logic that clocked inputs do not drive, because
either reset behavior determines hardware behavior or the hardware is inefficient.
• Avoid examples that start from reset, as the design simulation in Simulink may
not match that of the generated hardware. You should start a counter from the
valid signal, rather than the constant. If the counter repeats without stopping after
the first valid, add a zero-latency latch into this connection.
• Avoid loops that DSP Builder drives without clocked inputs.

Related Information
• Control on page 221
• Avalon-MM Slave Settings (AvalonMMSlaveSettings) on page 219
• External Memory, Memory Read, Memory Write on page 264
• Channel In (ChannelIn) on page 344
• Channel Out (ChannelOut) on page 345
• Synthesis Information (SynthesisInfo) on page 347
• Setting DSP Builder Design Parameters with MATLAB Scripts on page 183

7.2. Troubleshooting DSP Builder Designs


You might see errors when you build, test, update, simulate, or verify your DSP
Builder design.
1. Check your design construction:

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

178
7. DSP Builder Design Rules, Design Recommendations, and Troubleshooting
HB_DSPB_ADV | 2020.10.05

• Follow the recommendations for structuring and managing your model.


• Follow the Simulink setup guidelines.
• Follow the design rules
• Follow the rules for Primitive and IP library blocks and specific blocks like
SampleDelays blocks
2. Check for common Simulink errors including algebraic loops and unresolved data
types.
3. Ensure your DSP Builder does not use Primitive library blocks in unsupported
modes – either outside of primitive subsystems or in loops without sufficient start
to end of loop timing offset.
4. Read DSP Builder error messages to see the root cause.
5. Click DSP Builder ➤ Design Checker, to check your design for common
mistakes.
6. Select individual steps and click Check.
The output only matches the hardware when valid is high.
If your design uses FIFO buffers within multiple feedback loops, while the data
throughput and frequency of invalid cycles is the same, their distribution over a
frame of data might vary (because of the final distribution of delays around the
loop). If you find a mismatch, step past errors.

1. About Loops on page 179


2. DSP Builder Timed Feedback Loops on page 180
3. DSP Builder Loops, Clock Cycles, and Data Cycles on page 181

Related Information
DSP Builder Design Rules and Recommendations on page 176

7.2.1. About Loops


Your design can contain many loops that can interact with or be nested inside each
other. DSP Builder uses standard mathematical linear programming techniques to
solve a set of simultaneous timing constraints.

Consider the following two main cases:


• The simpler case is feed-forward. When no loops exist, feed-forward datapaths are
balanced to ensure that all the input data reaches each functional unit in the same
cycle. After analysis, DSP Builder inserts delays on all the non-critical paths to
balance out the delays on the critical path.
• The case with loops is more complex. Loops cannot be combinational—all loops in
the Simulink design must include delay memory. Otherwise Simulink displays an
'algebraic loop' error. In hardware, the signal has to have a specified number of
clock cycles latency round the feedback loop. Typically, one or more lumped delays
exist with SampleDelay blocks specifying the latency around some or all of the
loop. DSP Builder preserves the latency around the loop to maintain correct
functional operation. To achieve the target clock frequency, the total delay of the
sum of SampleDelay blocks around the loop must be greater or equal to the
required pipelining.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

179
7. DSP Builder Design Rules, Design Recommendations, and Troubleshooting
HB_DSPB_ADV | 2020.10.05

If the pipelining requirements of the functional units around the loop are greater than
the delay specified by the SampleDelay blocks on the loop path, DSP Builder
generates an error message. The message states that distribution of memory failed as
there was insufficient delay to satisfy the fMAX requirement. DSP Builder cannot
simultaneously satisfy the pipelining to achieve the given fMAX and the loop criteria to
re-circulate the data in the number of clock cycles specified by the SampleDelay
blocks.

DSP Builder automatically adjusts the pipeline requirements of every Primitive block
according to these factors
• The type of block
• The target fMAX
• The device family and speedgrade
• The inputs of inputs
• The bit width in the data inputs

Note: Multipliers on Cyclone devices take two cycles at all clock rates. On Stratix V, Arria V,
and Cyclone V devices, fixed-point multipliers take two cycles at low clock rates, three
cycles at high clock rates. Very wide fixed-point multipliers incur higher latency when
DSP Builder splits them into smaller multipliers and adders. You cannot count the
multiplier and adder latencies separately because DSP Builder may combine them into
a single DSP block. The latency of some blocks depends on what pipelining you apply
to surrounding blocks. DSP Builder avoids pipelining every block but inserts pipeline
stages after every few blocks in a long sequence of logical components, if fMAX is
sufficiently low that timing closure is still achievable.

In the SynthesisInfo block, you can optionally specify a latency constraint limit that
can be a workspace variable or expression, but must evaluate to a positive integer.
However, only use this feature to add further latency. Never use the feature to reduce
latency to less than the latency required to pipeline the design to achieve the target
fMAX.

After you run a simulation in Simulink, the help page for the SynthesisInfo block
shows the latency, port interface, and estimated resource utilization for the current
Primitive subsystem.

When no loops exist, feed-forward datapaths are balanced to ensure that all the input
data reaches each functional unit in the same cycle. After analysis, DSP Builder inserts
delays on all the non-critical paths to balance out the delays on the critical path.

In designs with loops, DSP Builder advanced blockset must synthesize at least one
cycle of delay in every feedback loop to avoid combinational loops that Simulink
cannot simulate. Typically, one or more lumped delays exist. To preserve the delay
around the loop for correct operation, the functional units that need more pipelining
stages borrow from the lumped delay.

7.2.2. DSP Builder Timed Feedback Loops


Take care with feedback loops generally, in particular provide sufficient delay around
the loop.

Designs that have a cycle containing two adders with only a single sample delay are
not sufficient. In automatically pipelining designs, DSP Builder creates a schedule of
signals through the design. From internal timing models, DSP Builder calculates how

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

180
7. DSP Builder Design Rules, Design Recommendations, and Troubleshooting
HB_DSPB_ADV | 2020.10.05

fast certain components, such as wide adders, can run and how many pipelining
stages they require to run at a specific clock frequency. DSP Builder must account for
the required pipelining while not changing the order of the schedule. The single
sample delay is not enough to pipeline the path through the two adders at the specific
clock frequency. DSP Builder is not free to insert more pipelining, as it changes the
algorithm, accumulating every n cycles, rather than every cycle. The scheduler detects
this change and gives an appropriate error indicating how much more latency the loop
requires for it to run at the specific clock rate. In multiple loops, this error may be hit
a few times in a row as DSP Builder balances and resolves each loop.

7.2.3. DSP Builder Loops, Clock Cycles, and Data Cycles


Never confuse clock cycles and data cycles in relation to feedback loops. For example,
you may want to accumulate previous data from the same channel. The DSP Builder
multichannel IIR filter design example (demo_iir) shows feedback accumulators
processing multiple channels. In this example, consecutive data samples on any
particular channel are 20 clock cycles apart. DSP Builder derives this number from
clock rate and sample rate.

The folded IIR filter design example (demo_iir_fold2) demonstrates one channel, at
a low data rate. This design example implements a single-channel infinite impulse
response (IIR) filter with a subsystem built from Primitive blocks folded down to a
serial implementation.

The design of the IIR is the same as the IIR in the multichannel example, demo_iir.
As the channel count is one, the lumped delays in the feedback loops are all one. If
you run the design at full speed, there is a scheduling problem. With new data arriving
every clock cycle, the lumped delay of one cycle is not enough to allow for pipelining
around the loops. However, the data arrives at a much slower rate than the clock rate,
in this example 32 times slower (the clock rate in the design is 320 MHz, and the
sample rate is 10 MHz), which gives 32 clock cycles between each sample.

You can set the lumped delays to 32 cycles long—the gap between successive data
samples—which is inefficient both in terms of register use and in underused multipliers
and adders. Instead, use folding to schedule the data through a minimum set of fully
used hardware.

Set the SampleRate on both the ChannelIn and ChannelOut blocks to 10 MHz, to
inform the synthesis for the Primitive subsystem of the schedule of data through the
design. Even though the clock rate is 320 MHz, each data sample per channel is
arriving only at 10 MHz. The RTL is folded down—in multiplier use—at the expense of
extra logic for signal multiplexing and extra latency.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

181
HB_DSPB_ADV | 2020.10.05

Send Feedback

8. About DSP Builder for Intel FPGAs Optimization


Improve your designs and learn about folding and floating-point data types.

1. Associating DSP Builder with MATLAB on page 182


2. Setting Up Simulink for DSP Builder Designs on page 182
3. The DSP Builder Windows Shortcut on page 183
4. Setting DSP Builder Design Parameters with MATLAB Scripts on page 183
5. Managing your Designs on page 186
6. How to Manage Latency on page 187
7. Flow Control in DSP Builder Designs on page 194
8. Reset Minimization on page 196
9. About Importing HDL on page 198

8.1. Associating DSP Builder with MATLAB


If you install another version of MATLAB or you install DSP Builder without associating
it with a version of MATLAB, you can associate DSP Builder with MATLAB
1. Type the following command into a command window:
<path to dsp_builder.bat> -m “<path to matlab executable>” For
example:c:\intel_FPGA_pro\quartus\dspba\dsp_builder.bat -m "c:\tools\matlab
\R2015a\windows64\bin\matlab.exe"

8.2. Setting Up Simulink for DSP Builder Designs


1. Setting Up Simulink Solver on page 182
2. Setting Up Simulink Signal Display Option on page 183

8.2.1. Setting Up Simulink Solver

1. On the File menu, click Preferences.


2. Expand Configuration defaults and click Solver.
3. For Type, select Fixed-step solver, unless you have folding turned on in some
part of your design. In that case, you need to select Variable-step solver.
4. For Solver select Discrete (no continuous states).
5. Click on Display Defaults and turn on Show port data types.

Intel Corporation. All rights reserved. Agilex, Altera, Arria, Cyclone, Enpirion, Intel, the Intel logo, MAX, Nios,
Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or
other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in ISO
accordance with Intel's standard warranty, but reserves the right to make changes to any products and services 9001:2015
at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any Registered
information, product, or service described herein except as expressly agreed to in writing by Intel. Intel
customers are advised to obtain the latest version of device specifications before relying on any published
information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

8.2.2. Setting Up Simulink Signal Display Option


Display various port and signal properties to aid debugging and visualization.
1. In Simulink, click Format ➤ Port/Signal Displays.
2. Click Sample Time Colors to change the color of blocks and wires in particular
clock domain—useful when creating multirate designs.
3. Click Port Data Type option to display the data type of the blocks. You can only
connect ports of same data type.
4. Click Signal Dimensions to display the dimensions of particular signal wire.
5. Make show data types and wide non-scalar lines the default for new models:
a. Click File ➤ Preferences.
a. Select Display Defaults for New Models
b. Turn on Wide nonscalar lines and Show port data types.

8.3. The DSP Builder Windows Shortcut


Create a shortcut to set the file paths to DSP Builder and run a batch file with an
argument for the MATLAB executable to use

The shortcut target is:

<dsp_builder.bat from the DSP Builder release to use> -m “<path


to the MATLAB executable to use>”

For example

C:\Altera\16.0\quartus\dspba\dsp_builder.bat -m "C:\tools\matlab
\R2013a\windows64\bin\matlab.exe"

You can copy the shortcut from the Start menu and paste it to your desktop to create
a desktop shortcut. You can edit the properties to use different installed DSP Builder
releases, different MATLAB releases, or different start directories.

Related Information
Starting DSP Builder in MATLAB

8.4. Setting DSP Builder Design Parameters with MATLAB Scripts


1. Set block and design parameters using MATLAB workspace variables with names
unique to your model.
2. Define the MATLAB workspace variables in a MATLAB script or set of scripts, where
you can manage them.
3. Run the scripts run automatically on opening the model and again before
simulation.
DSP Builder evaluates and updates all parameters before it generates hardware.
4. Clean up the workspace using a separate script when you close the design.

1. Running Setup Scripts Automatically on page 184

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

183
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

2. Defining Unique DSP Builder Design Parameters on page 184


3. Example DSP Builder Custom Scripts on page 184

8.4.1. Running Setup Scripts Automatically


1. In a Simulink model file .mdl, click File ➤ Model properties.
2. Select Callbacks tab.
3. Select PreLoadFcn and type the setup script name in the window on the right
hand side. When you open your Simulink design file, the setup script runs.
4. Select InitFcn and type the setup script name in the window on the right hand
side. Simulink runs your setup script first at the start of each simulation before it
evaluates the model design file .mdl.

8.4.2. Defining Unique DSP Builder Design Parameters


Define unique parameters to avoid parameters clashing with other open designs and
to help clear the workspace.
1. Create named structures and append a common root to all parameter names.
For example;

my_design_params.clockrate = 200;
my_design_params.samplerate = 50;
my_design_params.inputChannels = 4;

2. Clear the specific workspace variables you create with a clear-up script that run
when you close the model. Do not use clear all.
For example,. if you use the named structure my_design_params, run clear
my_design_params;. You may have other temporary workspace variables to
clear too.

8.4.3. Example DSP Builder Custom Scripts


You can write scripts that directly change parameters (such as the hardware
destination directory) on the Control and Signals blocks.

For example, in a script that passes the design name (without .mdl extension) as
model you can use:
%% Load the model
load_system(model);
%% Get the Signals block
signals = find_system(model, 'type', 'block', 'MaskType', 'DSP Builder Advanced
Blockset Signals Block');
if (isempty(signals))
error('The design must contain a Signals Block. ');
end;
%% Get the Controls block
control = find_system(model, 'type', 'block', 'MaskType', 'DSP Builder Advanced
Blockset Control Block');
if (isempty(control))
error('The design must contain a Control Block. ');
end;%%
Example: set the RTL destination directory
dest_dir = ['../rtl' num2str(freq)];
dspba.SetRTLDestDir(model, rtlDir);

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

184
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

Similarly you can get and set other parameters. For example, on the Signals block
you can set the target clock frequency:
fmax_freq = 300.0;dspba.set_param(signals{1},'freq', fmax_freq);

You can also change the following threshold values that are parameters on the
Control block:
• distRamThresholdBits
• hardMultiplierThresholdLuts
• mlabThresholdBits
• ramThresholdBits

You can loop over changing these values, change the destination directory, run the
Quartus Prime software each time, and perform design space exploration. For
example:
%% Run a simulation; which also does the RTL generation.
t = sim(model);
%% Then run the Quartus Prime compilation flow.
[success, details] = run_hw_compilation(<model>, './')%%
where details is a struct containing resource and timing information
details.Logic,
details.Comb_Aluts,
details.Mem_Aluts,
details.Regs,
details.ALM,
details.DSP_18bit,
details.Mem_Bits,
details.M9K,
details.M144K,
details.IO,
details.FMax,
details.Slack,
details.Required,
details.FMax_unres,
details.timingpath,
details.dir,
details.command,
details.pwd
such that >> disp(details) gives output something like:
Logic: 4915
Comb_Aluts: 3213
Mem_Aluts: 377
Regs: 4725
ALM: 2952
DSP_18bit: 68
Mem_Bits: 719278
M9K: 97
M144K: 0 IO: 116
FMax: 220.1700
Slack: 0.4581
Required: 200
FMax_unres: 220.1700
timingpath: [1x4146 char]
dir: '../quartus_demo_ifft_4096_for_SPR_FFT_4K_n_2'
command: [1x266 char]
pwd: 'D:\test\script'

Note: The Timing Report is in the timingpath variable, which you can display by
disp(details.timingpath). Unused resources may appear as -1, rather than 0.

You must previously execute load_system before commands such as find_system


and run_hw_compilation work.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

185
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

A useful set of commands to generate RTL, compile in the Quartus Prime software and
return the details is:
load_system(<model>);
sim(<model>);
[success, details] = run_hw_compilation(<model>, './')

8.5. Managing your Designs


DSP Builder supports parameterization through scripting.
1. To define many of DSP Builder advanced blockset parameters as a MATLAB
workspace variables, such as clock rate, sample rate, and bit width, define these
variables in a .m file.
2. Run this setup script before running your design.
3. Explore different values for various parameters, without having to modify the
Simulink design.
For instance, you can evaluate the performance impact of varying bit width at
different stages of your design.
4. Define the data type and width of Primitive library blocks in the script
5. Experiment with different values. DSP Builder advanced blockset vector signal and
ALU folding support allows you to use the same design file to target single and
multiple channels designs.
6. Use a script for device options in your setup script, which eases design migration,
whether you are targeting a new device or you are upgrading the design to
support more data channels.
7. Use advanced scripting to fine tune Quartus Prime settings and to build automatic
test sweeping, including parameter changes and device changes.

1. Managing Basic Parameters on page 186


2. Creating User Libraries and Converting a Primitive Subsystem into a Custom Block
on page 187
3. Revision Control on page 187

8.5.1. Managing Basic Parameters


Before you start implementing your design, you should define key parameters in a
script.

Based on the FPGA clock rate and data sample rates, you can derive how many clock
cycles are available to process unique data samples. This parameter is called Period in
many of the design examples. For example, for a period of three, a new sample for
the same channel appears every three clock cycles. For multiplication, you have three
clock cycles to compute one multiplication for this channel. In a design with multiple
channels, you can accommodate three different channels with just one multiplier. A
resource reuse potential exists when the period is greater than one.
1. Define the following parameters:

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

186
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

• FPGA clock rate


• Data sample rates at various stages of your design
• Number of channels or data sources
• Bit widths of signals at various stages of your design, including possible bit
growth throughout the computational datapath
• Coefficients of filters

8.5.2. Creating User Libraries and Converting a Primitive Subsystem into


a Custom Block
You can group frequently used custom blocks into libraries for future use.
1. Mask the block to hide the block's contents and provide a custom block dialog.
2. Place the block in a library to prohibit modifications and allow you to easily update
copies of the block.
Note: This procedure is similar to creating a Simulink custom block and custom
library. You can also add a custom library to the Simulink library browser.

8.5.3. Revision Control


Use Simulink revision control to manage your DSP Builder advanced blockset design
revision control.

The Simulink Model Info block displays revision control information about a model as
an annotation block in the model's block diagram. It shows revision control
information embedded in the model and information maintained by an external
revision control or configuration management system.

You can customize some revision control tools to use the Simulink report generator
XML comparison, which allows you to compare two versions of the same file.

You must add the following files to revision control:


• Your setup script (.m file)
• Model design files .mdl.
• All the customized library files.
• _params.xml file

Note: You do not need to archive autogenerated files such as Quartus Prime project files or
synthesizable RTL files.

8.6. How to Manage Latency


The Primitive library blocks are untimed circuits, so they are not cycle accurate. A
one-to-one mapping does not exist between the blocks in the Simulink model and the
blocks you implement in your design in RTL. This decoupling of design intent from
design implementation gives productivity benefits. The ChannelOut block is the
boundary between the untimed section and the cycle accurate section. This block
creates the additional delay that the RTL introduces, so that data going in to the
ChannelOut block delays internally, before DSP Builder presents it externally. The
latency of the block shows on the ChannelOut mask. You may want to fix or constrain

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

187
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

the latency after you complete part of a DSP Builder design, for example on an IP
library block or for a Primitive subsystem. In other cases, you may want to limit the
latency in advance, which allows future changes to other subsystems without causing
undesirable effects upon the overall design.

To accommodate extra latency, insert registers. This feature applies only to Primitive
subsystems. To access, use the Synthesis Info block.

Latency is the number of delays in the valid signal across the subsystem. The DSP
Builder advanced blockset balances delays in the valid and channel path with
delays that DSP Builder inserts for autopipelining in the datapath.

Note: User-inserted sample delays in the datapath are part of the algorithm, rather than
pipelining, and are not balanced. However, any uniform delays that you insert across
the entire datapath optimize out. If you want to constrain the latency across the entire
datapath, you can specify this latency constraint in the SynthesisInfo block.

1. Reading the Added Latency Value for an IP Block on page 188


2. Zero Latency Example on page 188
3. Implicit Delays in DSP Builder Designs on page 189
4. Distributed Delays in DSP Builder Designs on page 190
5. Latency and fMAX Constraint Conflicts in DSP Builder Designs on page 192
6. Control Units Delays on page 192

8.6.1. Reading the Added Latency Value for an IP Block


1. Select the block and type the following command:
get_param(gcb, 'latency')
You can also use this command in an M-script. For example when you want to use
the returned latency value to balance delays with external circuitry.
Note: If you use an M-script to get this parameter and set latency elsewhere in
your design, by the time it updates and sets on the IP block, it is too late to
initialize the delays elsewhere. You must run your design twice after any
changes to make sure that you have the correct latency. If you are scripting
the whole flow, your must run once with end time 0, and then run again
immediately with the desired simulation end time.

8.6.2. Zero Latency Example


In this example, sufficient delays in the design ensure that DSP Builder requires no
extra automatic pipelining to reach the fMAX target (although DSP Builder distributes
this user-added delay through the datapath). Thus, the reported latency is zero. DSP
Builder inserts no extra pipelining registers in the datapath to meet fMAX and thus
inserts no balancing registers on the channel and valid paths. The delay of the valid
signal across the subsystem is zero clock cycles, as the Lat: 0 latency value on the
ChannelOut block shows.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

188
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

Figure 73. Latency Example with a User-Specified Delay

8.6.3. Implicit Delays in DSP Builder Designs


The DSP Builder scheduler may add extra delays on paths between the ChannelIn
and ChannelOut blocks. The extra latency is the same for all such paths and is
displayed on the ChannelOut block.

If the valid input drives directly the valid output, the delay on the valid signal matches
the latency displayed on the ChannelOut block. It doesn't, if the valid output is
generated in any other way, for example by using a Sequence block.

For example, the 4K FFT design example uses a Sequence block to drive the valid
signal explicitly.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

189
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

Figure 74. Sequence Block in the 4K FFT Design Example

The latency that the ChannelOut block reports is therefore not 4096 + the automatic
pipelining value, but just the pipelining value.

8.6.4. Distributed Delays in DSP Builder Designs


Distributed delays are not cycle-accurate inside a primitive subsystem, because DSP
Builder distributes and optimizes the user-specified delay. To consistently apply extra
latency to a primitive subsystem, use latency constraints.

In this example, the Mult block has a direct feed-through simulation model, and the
following SampleDelay block has a delay of 10. The Mult block has zero delay in
simulation, followed by a delay of 10. In the generated hardware, DSP Builder
distributes part of this 10-stage pipelining throughout the multiplier optimally, such
that the Mult block has a delay (in this case, four pipelining stages) and the
SampleDelay block a delay (in this case, six pipelining stages). The overall result is
the same—10 pipelining stages, but if you try to match signals in the primitive
subsystem against hardware, you may find DSP Builder shifts them by several cycles.

Similarly, if you have insufficient user-inserted delay to meet the required fMAX, DSP
Builder automatically pipelines and balances the delays, and then corrects the cycle-
accuracy of the primitive subsystem as a whole, by delaying the output signals in
simulation by the appropriate number of cycles at the ChannelOut block.

If you specify no pipelining, the simulation design example for the multiplier is direct-
feed-through, and the result appears on the output immediately.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

190
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

Figure 75. Latency Example without a User-Specified Delay

To reach the desired fMAX, DSP Builder then inserts four pipelining stages in the
multiplier, and balances these with four registers on the channel and valid paths.
To correct the simulation design example to match hardware, the ChannelOut block
delays the outputs by four cycles in simulation and displays Lat: 4 on the block. Thus,
if you compare the output of the multiplier simulation with the hardware it is now four
cycles early in simulation; but if you compare the primitive subsystem outputs with
hardware they match, because the ChannelOut block provides the simulation
correction for the automatically inserted pipelining.

If you want a consistent 10 cycles of delay across the valid, channel and
datapath, you may need latency constraints.

Figure 76. Latency Example with Consistent Delays

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

191
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

This example has a consistent line of SampleDelay blocks inserted across the design.
However, the algorithm does not use these delays. DSP Builder recognizes that
designs do not require them and optimizes them away, leaving only the delay that
designs require. In this case, each block requires a delay of four, to balance the four
delay stages to pipeline the multiplier sufficiently to reach the target fMAX. The delay of
10 in simulation remains from the non-direct-feed-through SampleDelay blocks. In
such cases, you receive the following warning on the MATLAB command line:

DSP Builder optimizes away some user inserted SampleDelays. The latency on the
valid path across primitive subsystem design name in hardware is 4, which may
differ from the simulation model. If you need to preserve extra SampleDelay
blocks in this case, use the Constraint Latency option on the SynthesisInfo
block.

Note: SampleDelay blocks reset to unknown values ('X'), not to zero. Designs that rely on
SampleDelays output of zero after reset may not behave correctly in hardware. Use
the valid signal to indicate valid data and its propagation through the design.

8.6.5. Latency and fMAX Constraint Conflicts in DSP Builder Designs


Some blocks need to have a minimum latency, either because of logical or silicon
limitations. In these cases, you can create an abstracted design that cannot be
realized in hardware. While these cases can generally be addressed, in some cases like
IIRs, find algorithmic alternatives.

Generally, problems occur in feedback loops. You can solve these issues by lowering
the fMAX target, or by restructuring the feedback loop to reduce the combinatorial logic
or increasing the delay. You can redesign some control structures that have feedback
loops to make them completely feed forward.

You cannot set a latency constraint that conflicts with the constraint that the fMAX
target implies. For example, a latency constraint of < 2 may conflict with the fMAX
implied pipelining constraint. The multiplier may need four pipelining stages to reach
the target fMAX. The simulation fails and issues an error, highlighting the Primitive
subsystem.

DSP Builder gives this error because you must increase the constraint limit by at least
3 (that is, to < 5) to meet the target fMAX.

8.6.6. Control Units Delays


Commonly, you may use an FSM to design control units. An FSM uses DSP Builder
SampleDelay blocks to store its internal state. DSP Builder automatically
redistributes these SampleDelay blocks, which may alter the functional behavior of
the control unit subsystem. Then the generated hardware no longer matches the
simulation. Also, redistribution of SampleDelay blocks throughout the design may
change the behavior of the FSM by altering its initial state. Classically, you exploit the
reset states of the constituent components to determine the initial state; however this
approach may not work. DSP Builder may not preserve any given component because
it automatically pipelines Primitive subsystems. Also it can leave some components
combinatorial based on fMAX target, device family, speed grade, and the locations of
registers immediately upstream or downstream.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

192
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

Figure 77. SampleDelay Block Example

DSP Builder relocates the sample delay, to save registers, to the Boolean signal that
drives the s-input of the 2-to-1 Mux block. You may see a mismatch in the first cycle
and beyond, depending on the contents of the LUT.

When you design a control unit as an FSM, the locations of SampleDelay blocks
specify where DSP Builder expects zero values during the first cycle. In Figure 77 on
page 193, DSP Builder expects the first sample that the a-input receives of the
CmpGE block to be zero. Therefore, the first output value of that compare block is
high. Delay redistribution changes this initialization. You cannot rely on the reset state
of that block, especially if you embed the Primitive subsystem within a larger design.
Other subsystems may drive the feedback loop whose pipeline depth adapts to fMAX.
The first valid sample may only enter this subsystem after some arbitrary number of
cycles that you cannot predetermine. To avoid this problem, always ensure you anchor
the SampleDelay blocks to the valid signal so that the control unit enters a well-
defined state when valid-in first goes high.

Figure 78. SampleDelay Block Example 2

To make a control unit design resistant to automated delay redistribution and to solve
most hardware FSM designs that fail to match simulation, replace every SampleDelay
block with the Anchored Delay block from the Control folder in the Additional
libraries. When the valid-in first goes high, the Anchored Delay block outputs one (or
more) zeros, otherwise it behaves just like an ordinary SampleDelay block.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

193
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

Synthesizing the example design (fMAX = 250MHz) on Arria V (speedgrade 4), shows
that DSP Builder is still redistributing the delays contained inside of the Anchored
Delay block to minimize register utilization. DSP Builder still inserts a register
initialized to zero before the s-input of the 2-to-1 Mux block. However, the hardware
continues to match Simulink simulation because of the anchoring. If you place highly
pipelined subsystems upstream so that the control unit doesn't enter its first state
until several cycles after device initialization, the FSM still provides correct outputs.
Synchronization is maintained because DSP Builder inserts balancing delays on the
valid-in wire that drives the Anchored Delay and forces the control unit to enter its
initial state the correct number of cycles later.

Control units that use this design methodology are also robust to optimizations that
alter the latency of components. For example, when a LUT block grows sufficiently
large, DSP Builder synthesizes a DualMem block in its place that has a latency of at
least one cycle. Automated delay balancing inserts a sufficient number of one bit wide
delays on the valid signal control path inside every Anchored Delay. Hence, even if
the CmpGE block is registered, its reset state has no influence on the initial state of
the control unit when the valid-in first goes high.

Each Anchored Delay introduces a 2-to-1 Mux block in the control path. When
targeting a high fMAX (or slow device) tight feedback loops may fail to schedule or
meet timing. Using Anchored Delay blocks in place of SampleDelay blocks may also
use more registers and can also contribute to routing congestion.

8.7. Flow Control in DSP Builder Designs


Use DSP Builder valid and channel signals with data to indicate when data is valid
for synchronizing. You should use these signals to process valid data and ignore invalid
data cycles in a streaming style to use the FPGA efficiently. You can build designs that
run as fast as the data allows and are not sensitive to latency or devices fMAX and that
can be responsive to backpressure.

This style uses FIFO buffers for capturing and flow control of valid outputs, loops, and
for loops, for simple and complex nested counter structures. Also add latches to
enable only components with state—thus minimizing enable line fan-out, which can
otherwise be a bottleneck to performance.

Flow Control Using Latches

Generally hardware designers avoid latches. However, these subsystems synthesize to


flip-flops.

Often designs need to stall or enable signals. Routing an enable signal to all the blocks
in the design can lead to high fan-out nets, which become the critical timing path in
the design. To avoid this situation, enable only blocks with state, while marking output
data as invalid when necessary.

DSP Builder provides the following utility functions in the Additional Blocks Control
library, which are masked subsystems.
• Zero-Latency Latch (latch_0L)
• Single-Cycle Latency Latch (latch_1L)
• Reset-Priority Latch (SRlatch_PS)
• Set-Priority Latch (SRlatch)

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

194
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

Some of these blocks use the Simulink Data Type Prop Duplicate block, which takes
the data type of a reference signal ref and back propagates it to another signal prop.
Use this feature to match data types without forcing an explicit type that you can use
in other areas of your design.

Forward Flow Control Using Latches

The demo_forward_pressure example design shows how to use latches to


implement forward flow control.

Flow Control Using FIFO Buffers

You can use FIFO buffers to build flexible, self-timed designs insensitive to latency.
They are an essential component in building parameterizable designs with feedback,
such as those that implement back pressure.

Flow Control and Backpressure Using FIFO Buffers

The demo_back_pressure design example shows how to use latches to implement


back pressure flow control.

You must acknowledge reading of invalid output data. Consider a FIFO buffer with the
following parameters:
• Depth = 8
• Fill threshold = 2
• Fill period = 7

A three cycle latency exists between the first write and valid going high. The q output
has a similar latency in response to writes. The latency in response to read
acknowledgements is only one cycle for all output ports. The valid out goes low in
response to the first read, even though the design writes two items to the FIFO buffer.
The second write is not older than three cycles when the read occurs.

With the fill threshold set to a low value, the t output can go high even though the v
out is still zero. Also, the q output stays at the last value read when valid goes low in
response to a read.

Problems can occur when you use no feedback on the read line, or if you take the
feedback from the t output instead with fill threshold set to a very low value (< 3). A
situation may arise where a read acknowledgement is received shortly following a
write but before the valid output goes high. In this situation, the internal state of the
FIFO buffer does not recover for many cycles. Instead of attempting to reproduce this
behavior, Simulink issues a warning when a read acknowledgement is received while
valid output is zero. This intermediate state between the first write to an empty FIFO
buffer and the valid going high, highlights that the input to output latency across the
FIFO buffer is different in this case. This situation is the only time when the FIFO
buffer behaves with a latency greater than one cycle. With other primitive blocks,
which have consistent constant latency across each input to output path, you never
have to consider these intermediate states.

You can mitigate this issue by taking care when using the FIFO buffer. The model
needs to ensure that the read is never high when valid is low using the simple
feedback. If you derive the read input from the t output, ensure that you use a
sufficiently high threshold.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

195
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

You can set fill threshold to a low number (<3) and arrive at a state where output t is
high and output v is low, because of differences in latency across different pairs of
ports—from w to v is three cycles, from r to t is one cycle, from w to t is one cycle. If
this situation arises, do not send a read acknowledgement signal to the FIFO buffer.
Ensure that when the v output is low, the r input is also low. A warning appears in the
MATLAB command window if you ever violate this rule. If you derive the read
acknowledgement signal with a feedback from the t output, ensure that the fill
threshold is set to a sufficiently high number (3 or above). Similarly for the f output
and the full period.

If you supply vector data to the d input, you see vector data on the q output. DSP
Builder does not support vector signals on the w or r inputs, as the behavior is
unspecified. The v, t, and f outputs are always scalar.

Flow Control using Simple Loop

Designs may require counters, or nested counters to implement indexing of


multidimensional data. The Loop block provides a simple nested counter—equivalent
to a simple software loop.

The enable input and demo_kronecker design example demonstrate flow control
using a loop.

Flow Control Using the ForLoop Block

You can use either Loop or ForLoop blocks for building nested loops.

The Loop block has the following advantages:


• A single Loop block can implement an entire stack of nested loops.
• No wasted cycles when the loop is active but the count is not valid.
• The implementation cost is lower because no overhead for the token-passing
scheme exists.

The ForLoop block has the following advantages:


• Loops may count either up or down.
• You may specify the initial value and the step, not just the limit value.
• The token-passing scheme allows the construction of control structures that are
more sophisticated than just nesting rectangular loops.

When a stack of nested loops is the appropriate control structure (for example, matrix
multiplication) use a single Loop block. When a more complex control structure is
required, use multiple ForLoop blocks.

8.8. Reset Minimization


Reset minimization reduces the amount of reset logic in your design A reduction in
reset logic can give an area decrease and potential fMAX increase. Reset minimization
removes resets on the datapath. You can apply reset minimization globally to floating-
point operators and to your synthesizable subsystems. By default, DSP Builder turns
on reset minimization for HyperFlex™ architectures and off for all other devices.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

196
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

DSP Builder distinguishes control flow from data flow: control flow is the logic you
connect to the ChannelIn and ChannelOut valid signal path. DSP Builder applies
little or no reset minimization to control logic and aggressive minimzation to data flow.

By default, DSP Builder chooses reset minimization options for you automatically. It
automatically applies reset minimization if your target device includes the HyperFlex
architecture.

You may override the default automatic reset minimization options, for example as
part of design space optimization.

When you globally apply reset minimization, DSP Builder determines a local reset
minimization setting for each of your synthesizable subsystems. DSP Builder applies
this local reset minimization conditionally, if your subsystem contains ChannelIn or
ChannelOut blocks.

Table 25. Reset Minimization Summary


If your synthesizable subsystem uses a mixture of Channel and GP blocks, choose Conditional for Local reset
minimization.

Global Enable Local Setting Synthesizable Subsystem Reset Minimization

Off Any Any No

On Off Any No

On Conditional ChannelIn and ChannelOut Yes

On Conditional GPIn and GPOut No

On On ChannelIn and ChannelOut Yes

On On GPIn and GPOut Yes

DSP Builder does not apply reset minimization to blocks with innate state, user-
constructed cycles, and enable logic in your design, as that can give undefined initial
values.

Reset minimization only detects local cycles within a subsystem. You should avoid
broader feedback cycles.

Reset minimization may affect the behavior of your design during Simulink simulation
and on hardware.

Simulink Simulation

The DSP Builder simulation engine within Simulink is unaware of the reset
minimization optimization and therefore always simulates your design behavior with
reset present.

In general there is no difference in behavior, and this is aided by the testbench inputs
defaulting typically to zero and a longer minimum reset pulse-width allowing such
defaults to propagate through the datapath register stages.

However in some cases mismatches may occur, because data entering a Sample
Delay in your design during reset is non-zero.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

197
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

If an input does not default to zero or the internal behavior is incompatible with
Sample Delay blocks resetting to zeros (or the minimum reset-pulse width is less
than the design latency), the Simulink simulation might be different than the HDL
simulation.

Implementation on Hardware

Removing a reset on the datapath means that when DSP Builder releases a reset, your
data flow logic may contain values clocked in during reset, which might affect the
initial post-reset behavior of your system.

Reset minimization detects and avoids optimizing cycles in your synthesizable


subsystem. It does not detect cycles constructed outside of a single synthesizable
subsystem. Do not enable it for such designs.

Related Information
• Control on page 221
• Synthesis Information (SynthesisInfo) on page 347

8.9. About Importing HDL


Importing HDL enables you to cosimulate existing HDL as a subsystem within your
DSP Builder designs.

Importing HDL has the following software requirements:


• HDL Verifier toolbox
• An HDL Verifier compatible version of the ModelSim simulator (importing HDL does
not support ModelSim AE)

Additionally, your HDL must conform to DSP Builder design rules and must:
• Have only one clock domain
• Match reset level with DSP Builder
• Use the std_logic data type for clock and reset ports
• Use std_logic_vector for all other ports
• Have no top-level generics
• Contain no bus components
You may need to write a wrapper HDL file that instantiates your HDL, which might
configure generics, convert from other data types to std_logic_vector, or invert the
reset signal.
DSP Builder can import any number of instantiated entities. To import multiple
copies of an entity or multiple distinct entities, instantiate the entities in a top-
level wrapper file.
Simulink does not model all the signal states that ModelSim uses (e.g. ‘U’).
Simulink interprets all non-‘1’ states as a ‘0’.

Importing HDL uses the HDL Verifier toolbox to communicate with an HDL simulation
running in ModelSim. You can have as many components in your ModelSim simulation
as you like; each component communicates with a separate DSP Builder HDL Import
block. Your top-level design must include an HDL Import Config block.

DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback

198
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05

Figure 79. HDL Import Block Placement

Simulink

Source Control

Component 0
HDL Import
Subsystem Subsystem
Component 0
Component 1

HDL Import HDL Import


Subsystem Component n
Component 1 Component n

ModelSim
DSP Builder Advanced

Sink

You cannot place HDL Import blocks inside a primitive scheduled subsystem.

DSP Builder creates the appropriate instantiation of the component represented by the
HDL Import block.

DSP Builder sees imported HDL as a scheduled system. DSP Builder does not try to
schedule your imported HDL. You cannot import HDL into a scheduled subsystem.
Imported HDL acts like other DSP Builder IP blocks (e.g. NCO, FFT). You must
manually delay-balance any parallel datapaths and turn on Generate Hardware in
the Control block.

Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook

199
HB_DSPB_ADV | 2020.10.05

Send Feedback

9. About Folding
Folding optimizes hardware usage for low throughput systems, which have many clock
cycles between data samples. Low throughput systems often inefficiently use
hardware resources. When you map designs that process data as it arrives every clock
cycle to hardware, many hardware resources may be idle for the clock cycles between
data.

Folding allows you to create your design and generate hardware that reuses resources
to create an efficient implementation.

The folding factor is the number of times you reuse a single hardware resource, such
as a multiplier, and it depends on the ratio of the data and clock rates:

Folding factor = clock rate/data rate

DSP Builder offers ALU folding for folding factors greater than 500. With ALU folding,
DSP Builder arranges one of each resource in a central arithmetic logic unit (ALU) with
a program to schedule the data through the shared operation.

1. ALU Folding on page 200


2. Removing Resource Sharing Folding on page 207

9.1. ALU Folding


ALU folding generates an ALU architecture specific to the DSP Builder design. The
functional units in the generated ALU architecture depend on the blocks and data
types in your design. DSP Builder maps the operations performed by connecting
blocks in Simulink to the functional units on the generated architecture.

ALU folding reduces the resource consumption of a design by as much as it can while
still meeting the latency constraint. The constraint specifies the maximum number of
clock cycles a system with folding takes to process a packet. If ALU folding cannot
meet this latency constraint, or if ALU folding cannot meet a latency constraint
internal to the DSP Builder system due to a feedback loop, you see an error message
stating it is not possible to schedule the design.

1. ALU Folding Limitations on page 201


2. ALU Folding Parameters on page 201
3. ALU Folding Simulation Rate on page 201
4. Using ALU Folding on page 205
5. Using Automated Verification on page 206
6. Ready Signal on page 206
7. Connecting the ALU Folding Ready Signal on page 206

Intel Corporation. All rights reserved. Agilex, Altera, Arria, Cyclone, Enpirion, Intel, the Intel logo, MAX, Nios,
Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or
other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in ISO
accordance with Intel's standard warranty, but reserves the right to make changes to any products and services 9001:2015
at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any Registered
information, product, or service described herein except as expressly agreed to in writing by Intel. Intel
customers are advised to obtain the latest version of device specifications before relying on any published
information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.

You might also like