6.4.1. Complex FIR Filter
6.4.1. Complex FIR Filter
DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
13. Interpolating FIR Filter with Multiple Coefficient Banks on page 105
14. Interpolating FIR Filter with Updating Coefficient Banks on page 106
15. Root-Raised Cosine FIR Filter on page 106
16. Single-Rate FIR Filter on page 106
17. Super-Sample Decimating FIR Filter on page 107
18. Super-Sample Fractional FIR Filter on page 107
19. Super-Sample Interpolating FIR Filter on page 107
20. Variable-Rate CIC Filter on page 107
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_dcic.m script.
Note: This design example uses the Simulink Signal Processing Blockset.
This design example uses the Decimating FIR block to build a 20-channel decimate
by 5, 49-tap FIR filter with a target system clock frequency of 240 MHz.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_fird.m script.
The FilterSystem subsystem includes the Device and Decimating FIR blocks.
Note: This design example uses the Simulink Signal Processing Blockset.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
101
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_filters_flow_control.m script.
Note: This design example uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_fir_fractional.m script.
Note: This design example uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_firf.m script.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
102
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Note: This design example uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_firih.m script.
The FilterSystem subsystem includes the Device block and two separate
InterpolatingFIR blocks for the regular and interpolating filters.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
103
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_icic.m script.
Note: This design example uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_firi.m script.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
104
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Note: This design example uses the Simulink Signal Processing Blockset.
Multiple sets of coefficients requires storage in memory so that the design can switch
easily from one set, or bank, of coefficients in use to another in a single clock cycle.
You specify the coefficient array as a matrix rather than a vector—(bank rows) by
(number of coefficient columns).
The addressing scheme has address offsets of base address + (bank number *
number of coefficients for each bank).
If the number of rows is greater than one, DSP Builder creates a bank select input
port on the FIR filter. In a design, you can drive this input from either data or bus
interface blocks, allowing either direct or bus control. The data type is unsigned
integer of width ceil(log2(number of banks)).
The bank select is a single signal. For example, for a FIR filter with four input channels
over two timeslots:
<0><1>
<2><3>
<0><1>
Here the design receives more than one channel at a time, but can only choose a
single bank of coefficients. Channels 0 and 2 use one set of coefficients and channels
1 and 3 another. Channel 0 cannot use a different set of coefficients to channel 2 in
the same filter.
For multiple coefficient banks, you enter an array of coefficients sets, rather than a
single coefficient set. For example, for a MATLAB array of 1 row and 8 columns [1 x
8], enter:
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
105
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Therefore, you can determine the number of banks by the number of rows without
needing the number of banks. If the number of banks is greater than 1, add an
additional bank select input on the block.
Write to the bus interface using the BusStimulus block with a sample rate
proportionate with the bus clock. Generally, DSP Builder does not guarantee bus
interface transactions to be cycle accurate in Simulink simulations. However, in this
design example, DSP Builder updates the coefficient bank while it is not in use.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_fir_rrc.m script.
The FilterSystem subsystem includes the Device and Decimating FIR blocks.
Note: This design example uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_firs.m script.
Note: This design example uses the Simulink Signal Processing Blockset.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
106
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The input sample rate is six times the clock rate. The filter decimates by two the input
sample rate to three times the clock rate, which is visible in the vector input and
output data connections. The input receives six samples in parallel at the input, and
three samples are output each cycle.
The input sample rate is two times the clock rate. The filter upconverts the input
sample rate to three times the clock rate, which is visible in the vector input and
output data connections. The input receives two samples in parallel at the input, and
three samples are output each cycle.
The input sample rate is twice the clock rate and is interpolated by three by the filter
to six times the clock rate, which is visible in the vector input and output data
connections. The input receives two samples in parallel at the input, and six samples
are output each cycle.
Note: This design example uses the Simulink Signal Processing Blockset.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
107
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
You can control the rate change with a register field, which is part of the control
interface. The register field controls the generation of a valid signal that feeds into the
differentiators.
The design example also contains a gain compensation block that compensates for the
rate change dependent gain of the CIC. It shifts the input up so that the MSB at the
output is always at the same position, regardless of the rate change that you select.
The associated setup file contains parameters for the minimum and maximum
decimation rate, and calculates the required internal data widths and the scaling
number. To change the decimation factor for simulation, adjust variable CicDecRate
to the desired current decimation rate.
Note: Intel has not tested this design on hardware and Intel does not provide a model of a
motor.
Functional Description
An encoder measures the rotor position in the motor, which the FPGA then reads. An
analog-to-digital converter (ADC) measures current feedback, which the FPGA then
reads.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
108
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
SOPC Builder
Position, Nios II Ethernet Industrial
PHY
Speed, Processor MAC Ethernet
and Current
Control IGBT
for AC Motors Control
Example Design Interface
In DSP Builder Power AC
ADC Stage Motor
ADC
Interface
Position
Encoder Encoder
Interface
Each of the FOC, speed, and position feedback loops use a simple PI controller to
reduce the steady state error to zero. In a real-world PI controller, you may also need
to consider integrator windup and tune the PI gains appropriately. The feedback loops
for the integral portion of the PI controllers are internal to the design.
The example assumes you sample the inputs at a rate of 100 kHz and the FPGA clock
rate is 100 MHz (suitable for Cyclone IV devices). ALU folding reduces the resource
usage by sharing operators such as adders, multipliers, cosine. The folding factor is
set to 100 to allow each operator to be timeshared up 100 times, which gives an input
sample rate of 1 Msps, but as the real input sample rate is 100 ksps, only one out of
every ten input timeslots are used. DSP Builder identifies the used timeslots when
valid_in is 1. Use valid_in to enable the latcher in the PI controller, which stores
data for use in the next valid timeslot. The valid_out signal indicates when the
ChannelOut block has valid output data. You can calculate nine additional channels
on the samedesign without incurring extra latency (or extra FPGA resources).
You should adjust the folding factor to see the effect it has on hardware resources and
latency. To adjust, change the Sample rate (MHz) parameter in the ChannelIn and
ChannelOut blocks of the design either directly or change the FoldingFactor
parameter in the setup script. For example, a clock frequency of 100 MHz and sample
rate of 10 MHz gives a folding factor of 10. Disabling folding, or setting the factor to 1,
results in no resource sharing and minimal latency. Generally, you should not set the
folding factor greater than the number of shareable operators, that is, for 24 adders
and 50 multipliers, use a maximum folding factor 50.
Note: The testbench does not support simulations if you adjust the folding factor.
The control algorithm, with the FOC, position, speed, control loops, vary the desired
position across time. The three control loops are parameterized with minimum and
maximum limits, and Pl values. These values are not optimized and are for
demonstrations only.
Resource Usage
Table 16. Position, Speed, and Current Control for AC Motors Design Example Resource
Usage
Folding Factor Add and Sub Blocks Mult Blocks Cos Blocks Latency
No folding 22 22 4 170
>22 1 1 1 279
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
109
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Hardware Generation
When hardware generation is disabled, the Simulink system simulates the design at
the external sample rate of 100 kHz, so that it outputs a new value once every 100
kHz. When hardware generation is enabled, the design simulates at the FPGA clock
rate (100 MHz), which represents real-life latency clock delays, but it only outputs a
new value every 100 kHz. This mode slows the system simulation speed greatly as the
model is evaluated 1,000 times for every output. The setup script for the design
example automatically detects whether hardware generation is enabled and sets the
sample rates accordingly. The example is configured with hardware generation
disabled, which allows fast simulations. When you enable hardware generation, set a
very small simulation time (for example 0.0001 s) as simulation may be very slow.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
110
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
111
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
6.5.2. Position, Speed, and Current Control for AC Motors (with ALU
Folding)
The position, speed, and current control for AC motors (with ALU folding) design
example is a FOC algorithm for AC motors, which is identical to the position, speed,
and current control for AC motors design example. However this design example uses
ALU folding.
The design example targets a Cyclone V device (speed grade 8). Cyclone V devices
have distributed memory (MLABs). ALU folding uses many distributed memory
components. ALU folding performs better in devices that have distributed memories,
rather than devices with larger block memories.
dspb_psc_ctrl.SampleRateHz = 10000 Sample rate. Default set to 10000, which is 10 kHz sample rate.
dspb_psc_ctrl.ClockRate = 100 FPGA clock frequency. Default set to 100, which is 100 MHz clock
When you run this design example without folding, the DSP Builder system operates
at the same 10 kHz sample rate. Therefore, the system calculates a new packet of
data for every Simulink sample. Also, the sample times of the testbench are the same
as the sample times for the DSP Builder system.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
112
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The Rate Transition blocks translate between the Simulink testbench and the DSP
Builder system. These blocks allow Simulink to manage the different sample times
that the DSP Builder system requires. You need not modify the design example when
you run designs with or without folding.
The Rate Transition blocks produce Simulink samples with a sample time of
dspb_psc_ctrl.SampleTime for the testbench and
dspb_psc_ctrl.DSPBASampleTime for the DSP Builder system. The samples are in
the stimuli system, within the dummy motor. To hold the data consistent at the inputs
to the Rate Transition blocks for the entire length of the output sample
(dspb_psc_ctrl.SampleTime), turn on Register Outputs.
The data valid signal consists of a one Simulink sample pulse that signifies the
beginning of a data packet followed by zero values until the next data sample, as
required by ALU folding. The design example sets the period of this pulsing data valid
signal to the number of Simulink samples for the DSP Builder system (at
dspb_psc_ctrl.DSPBASampleTime) between data packets. This value is
dspb_psc_ctrl.SampleTime/dspb_psc_ctrl.DSPBASampleTime.
The verification script within ALU folding uses the To Workspace blocks. The
verification script searches for To Workspace blocks on the output of systems to fold.
The script uses these blocks to record the outputs from both the design example with
and without folding. The script compares the results with respect to valid outputs. To
run the verification script, enter the following command at the MATLAB prompt:
Folder.Testing.RunTest('psc_ctrl_alu');
The direct current component (0 degrees) is set to zero. The algorithm involves the
following steps:
• Converting the 3-phase feedback current inputs and the rotor position from the
encoder into quadrature and direct current components with the Clarke and Park
transforms.
• Using these current components as the inputs to two proportional and integral (PI)
controllers running in parallel to control the direct current to zero and the
quadrature current to the desired torque.
• Converting the direct and quadrature current outputs from the PI controllers back
to 3-phase currents with inverse Clarke and Park transforms.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
113
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
For more information about fine Doppler estimators, refer to Fundamentals of Radar
Signal Processing by Mark A. Richards, McGraw-Hill, ISBN 0-07-144474-2, ch. 5.3.4.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
114
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
A complex number C is in the Mandelbrot set if for the following equation the value
remains finite when repeatedly iterated:
z(n + 1) = zn2 + C
The system takes longer to perform floating-point calculations than for the
corresponding fixed-point calculations. You cannot wait around for partial results to be
ready, if you want to achieve maximum efficiency. Instead, you must ensure your
algorithm fully uses the floating-point calculation engines. The design contains two
floating-point math subsystems: one for scaling and offsetting pixel indices to give a
point in the complex plane; the other to perform the main square-and-add iteration
operation.
For this design example, the total latency is approximately 19 clock cycles, depending
on target device and clock speed. The latency is not excessive; but long enough that it
is inefficient to wait for partial results.
FIFO buffers control the circulation of data through the iterative process. The FIFO
buffers ensure that if a partial result is available for a further iteration in the
z(n +1) = zn2 + C progression, the design works on that point.
Otherwise, the design starts a new point (new value of C). Thus, the design maintains
a full flow of data through the floating-point arithmetic. This main iteration loop can
exert back pressure on the new point calculation engine. If the design does not read
new points off the command queue FIFO buffers quickly enough, such that they fill up,
the loop iteration stalls. The design does not explicitly signal the calculation of each
point when it is required (and thus avoid waiting through the latency cycles before you
can use it). The design does not attempt to exactly calculate this latency in clock
cycles. The design tries to issue generate point commands the exact number of clock-
cycles before you need them. You must change them each time you retarget a device,
or change target clock rate. Instead, the design calculates the points quickly from the
start and catches them in a FIFO buffer. If the FIFO buffer starts to get full—a
sufficient number of cycles ahead of full—The design stops the calculation upstream
without loss of data. This selfregulating flow mitigates latency while remaining flexible.
Avoid inefficiencies by designing the algorithm implementation around the latency and
availability of partial results. Data dependencies in processing can stall processing.
The design example uses the FinishedThisPoint signal as the valid signal. Although
the system constantly produces data on the output, it marks the data as valid only
when the design finishes a point. Downstream components can then just process valid
data, just as the enabled subsystem in the testbench captures and plot the valid
points.
In both feedback loops, you must provide sufficient delay for the scheduler to
redistribute as pipelining. In feed-forward paths you can add pipelining without
changing the algorithm—DSP Builder changes only the timing of the algorithm. But in
feedback loops, inserting a delay can alter the meaning of an algorithm. For example,
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
115
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The FIFO buffers operate in show-ahead mode—they display the next value to be
read. The read signal is a read acknowledgement, which reads the output value,
discards it, and shows the next value. The design uses multiple FIFO buffers with the
same control signal, which are full and give a valid output at the same time. The
design only needs the output control signals from one of the FIFO buffers and can
ignore the corresponding signals from the other FIFO buffers. As floating-point
simulation is not bit accurate to the hardware, some points in the complex plane take
fewer or more iterations to complete in hardware compared to the Simulink
simulation. The results, when you are finished with a particular point, may come out in
a different order. You must build a testbench mechanism that is robust to this feature.
Use the testbench override feature in the Run All Testbenches block:
• Set the condition on mismatches to Warning
• Use the Run All Testbenches block to set an import variable, which brings the
ModelSim results back into MATLAB and a custom verification function that sets
the pass or fail criteria.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
116
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
6.6.11. Normalizer
The normalizer design example demonstrates the ilogb block and the multifunction
ldexp block. The parameters allow you to select the ilogb or ldexp. The design
example implements a simple floating-point normalization. The magnitude of the
output is always in the range 0.5 to 1.0, irrespective of the (non-zero) input.
A matrix multiplication must multiply row and column dot product for each output
element. For 8×8 matrices A and B:
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
117
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
8
ABi j = ∑ Aik Bk j
k=1
You may accumulate the adjacent partial results, or build adder trees, without
considering any latency. However, to implement with a smaller dot product, consider
resource usage folding, which uses a smaller number of multipliers rather than
performing everything in parallel. Also split up the loop over k into smaller chunks.
Then reorder the calculations to avoid adjacent accumulations.
A better implementation is to use FIFO buffers to provide self-timed control. New data
is accumulated when both FIFO buffers have data. This implementation has the
following advantages:
• Runs as fast as possible
• Is not sensitive to latency of dot product on devices or fMAX
• Is not sensitive to matrix size (hardware just stalls for small N)
• Can be responsive to back pressure, which stops FIFO buffers emptying and full
feedback to control
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
118
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
reception, information from different elements are combined such that the expected
pattern of radiation is preferentially observed. A number of different algorithms exist.
An efficient scheme combines multiple paths constructively.
The simulation calculates the phases in MATLAB code (as a reference), simulates the
beamformer 2D design to calculate the phases in DSP Builder Advanced Blockset,
compares the reference to the simulation results and plots the beam pattern.
The design example uses vectors of single precision floating-point numbers, with
state-machine control from two for loops.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
119
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks.
In this design example, the top level of the FPGA device (marked by the Device
block) and the synthesizable KroneckerSubsystem subsystem (marked by the
SynthesisInfo block) are at different hierarchy levels.
The FirChip subsystem includes the Device block and a lower-level primitive FIR
subsystem.
The primitive FIR subsystem includes ChannelIn, ChannelOut, FIFO, Not, And,
Mux, SampleDelay, Const, Mult, Add, and SynthesisInfo blocks.
In this design example, the top level of the FPGA device (marked by the Device
block) and the synthesizable Primitive FIR subsystem (marked by the SynthesisInfo
block) are at different hierarchy levels.
This design example shows how back pressure from a downstream block can halt
upstream processing. This design example provides three FIR filters. A FIFO buffer
follows each FIR filter that can buffer any data that is flowing through the FIFO buffer.
If the FIFO buffer becomes half full, the design asserts the ready signal back to the
upstream block. This signal prevents any new input (as flagged by valid) entering the
FIR block. The FIFO buffers always show the next data if it is available and the valid
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
120
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
signal is asserted high. You must AND this FIFO valid signal with the ready signal to
consume the data at the head of the FIFO buffer. If the AND result is high, you can
consume data because it is available and you are ready for it.
You can chain several blocks together in this way, and no ready signal has to feed
back further than one block, which allows you to use modular design techniques with
local control.
The delay in the feedback loop represents the lumped delay that spreads throughout
the FIR filter block. The delay must be at least as big as the delay through the FIR
filter. This delay is not critical. Experiment with some values to find the right one. The
FIFO buffer must be able to hold at least this much data after it asserts full. The full
threshold must be at least this delay amount below the size of the FIFO buffer (64 –
32 in this design example).
The final block uses an external ready signal that comes from a downstream block in
the system.
The FirChip subsystem includes the Device block and a lower-level Primitive FIR
subsystem.
In this design example, the top level of the FPGA device (marked by the Device
block) and the synthesizable primitive FIR subsystem (marked by the SynthesisInfo
block) are at different hierarchy levels.
The design example has a sequence of three FIR filters that stall when the valid signal
is low, preventing invalid data polluting the datapath. The design example has a
regular filter structure, but with a delay line implemented in single-cycle latches—
effectively an enabled delay line.
You need not enable everything in the filter (multipliers, adders, and so on), just the
blocks with state (the registers). Then observe the output valid signal, which DSP
Builder pipelines with the logic, and observe the valid output data only.
You can also use vectors to implement the constant multipliers and adder tree, which
also speeds up simulation.
You can improve the design example further by using the TappedDelayLine block.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
121
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The FirChip subsystem includes the Device block and a lower-level Primitive FIR
subsystem.
In this design example, the top level of the FPGA device (marked by the Device
block) and the synthesizable primitive FIR subsystem (marked by the SynthesisInfo
block) are at different hierarchy levels.
The design example has a sequence of three FIR filters that stall when the valid signal
is low, preventing invalid data polluting the datapath. The design example has a
regular filter structure, but with a delay line implemented in single-cycle latches—
effectively an enabled delay line.
You need not enable everything in the filter (multipliers, adders, and so on), just the
blocks with state (the registers). Then observe the output valid signal, which DSP
Builder pipelines with the logic, and observe the valid output data only.
You can also use vectors to implement the constant multipliers and adder tree, which
also speeds up simulation. You can improve the design example further with the
TappedDelayLine block.
The token-passing structure is typical for a nested-loop structure. The bs port of the
innermost loop (ForLoopB) connects to the bd port of the same loop, so that the next
loop iteration of this loop starts immediately after the previous iteration.
The bs port of the outer loop (ForLoopA) connects to the ls port of the inner loop;
the ld port of the inner loop loops back to the bd port of the outer loop. Each iteration
of the outer loop runs a full activation of the inner loop before continuing on to the
next iteration.
The ls port of the outer loop connect to external logic and the ld port of the outer
loop is unconnected, which is typical of applications where the control token is
generated afresh for each activation of the outermost loop.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
122
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The initialization, step, and limit values do not have to be constants. By using the
count value from an outer loop as the limit of an inner loop, the counter effectively
walks through a triangular set of indices.
The token-passing structure for this loop is identical to that for the rectangular loop,
except for the parameterization of the loops.
The digital upconverter includes: input memory, upconverter, FIR filter, scaler, mixer
and digital predistortion (DPD).
hdl_import_calc_fir_coefs.m A script to generate the FIR coefficients using MATLAB's cfirpm function. DSP Builder
prints the coefficients to MATLAB's Command Window and you can copy and paste
them into coefficients.vhd.
calc_dpd_coefs.m A script to generate the DPD coefficients using a simple polynomial model of a power
amplifier. DSP Builder prints the coefficients MATLAB's Command Window and you
can copy and paste them into lut_dpd.vhd.
VHDL Components
The design example includes a complex FIR filter in VHDL optimized for Intel Stratix
10 devices. This FIR filter has one valid data sample every eight clock cycles.
See Designing Filters for High Performance.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
123
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Top-Level Design
The top-level design contains the device-level subsystem and five downsample and
spectrum analyzer blocks from MathWork's DSP System Toolbox. These blocks show
the spectral output from the various stages of the up-conversion chain.
Digital Up Converter
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
124
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
This scheduled subsystem contains two SharedMem blocks, which contain the 20
MSPS baseband source: one for the real part of the signal and one for the imaginary
part. You can write to the blocks via the bus or use the preloaded tones.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
125
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Mixer
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
126
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Scale
The scale scheduled subsystem scales the data so that it fits within the DPD's range of
operation by bit-shifting from the mixer's output. You can use the optional multiplier
for increasing the signal level if bit-shifting is insufficient.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
127
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
FIR Coefficients
DPD
The file lut_dpd.vhd contains the DPD for this design example. The DPD consists of
an address generator that indexes a LUT. The output of the LUT is then multiplied with
the complex input data. The LUT contents are calculated in
hdl_import_calc_dpd_coefs.m. This script uses a simple, real-numbered, third-
order model of an amplifier to calculate predistortion coefficients. DSP Builder uses
these coefficients to calculate the LUT contents.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
128
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The first four waveforms are the real and imaginary input and output of the the FIR. The FIR smooths the zero-
padded signals.
The next four waveforms are the real and imaginary input and output of the the DPD.
The two preloaded memory signals are clearly visible about 0, as are their four aliases because of the zero-
insert upsampling.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
129
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The aliased signals are attenuated by 40dB, as expected from the analysis in calc_fir_coefs.m.
The mixed spectrum shows the baseband signal moving over to be centered on 16 MHz. This view shows the
Simulink clock rate of 1 Hz rather than the FPGA clock rate of 640 MHz, so 16 MHz becomes 25 mHz.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
130
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Scaled looks identical to mixed, except that the signal amplitude is much greater.
The post-DPD output signal is a noiser version of the scaled signal. Observe the two third-order harmonics in
the pass-band.
Related Information
Designing Filters for High Performance
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
131
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The design example has two HDL entities: the DPD (lut_dpd.vhd) and the FIR
(complex_fir.vhd).
In DSP Builder cosimulation, each HDL Import block represents an HDL instance. You
must instantiate both of these entities in a top-level VHDL file. For this design
example, Intel provides top.vhd.
In addition, the FIR filter uses a signed data type with a generic for the data width.
When DSP Builder instantiates the FIR filter, it uses its own paradigm (i.e.
std_logic_vector and no generics). This design example adds a wrapper entity:
complex_fir_wrapper.vhd. This entity instantiates complex_fir, including setting
the generic to the appropriate value, and converts signed to std_logic_vector.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
132
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
133
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
134
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
135
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
6. Press the play button or advance through the simulation a cycle at a time.
7. Verify HDL import with the ModelSim simulator, in DSP Builder, select DSP
Builder ➤ Run ModelSim ➤ Device.
The cosimulation turns any non-high state (e.g. U or X) to a zero.
8. Compile the design in Intel Quartus Prime, by selecting DSP Builder > Run
Quartus Prime Software.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
136
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Decimating CIC and FIR filters down convert eight complex carriers (16 real channels)
from 61.44 MHz. The total decimation rate is 64. A real mixer and NCO isolate the
eight carriers. The testbench isolates two channels of data from the TDM signals using
a channel viewer.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus a ChanView block that deserializes the output bus. An
Edit Params block allows easy access to the setup variables in the
setup_demo_ddc.m script.
Note: This design example uses the Simulink Signal Processing Blockset.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
137
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
This design example shows an interpolating filter chain with interpolating CIC and FIR
filters that up convert eight complex channels (16 real channels). The total
interpolation rate is 50. DSP Builder integrates several Primitive subsystems into the
datapath. This design example shows how you can integrate IP blocks with Primitive
subsystems:
• The programmable Gain subsystem, at the start of the datapath, shows how you
can use processor-visible register blocks to control a datapath element.
• The Sync subsystem is a Primitive subsystem that shows how to manage two
data streams coming together and synchronizing. The design writes the data from
the NCOs to a memory with the channel as an address. The data stream uses its
channel signals to read out the NCO signals, which resynchronizes the data
correctly. Alternatively, you can simply delay the NCO value by the correct number
of cycles to ensure that the NCO and channel data arrive at the Mixer on the
same cycle.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus a ChanView block that deserializes the output bus. An
Edit Params block allows easy access to the setup variables in the
setup_demo_duc.m script.
The DUCChip subsystem includes a Device block and a lower level DUC16
subsystem.
It also includes lower level Gain, Sync, and CarrierSum subsystems which make use
of other Interface and Primitive blocks including AddSLoad, And, BitExtract,
ChannelIn, ChannelOut, CompareEquality, Const, SampleDelay, DualMem,
Mult, Mux, Not, Or, RegBit, RegField blocks, and SynthesisInfo blocks.
Note: This design example uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus a ChanView block that deserializes the output bus.
The DUCChip subsystem includes a Device block and a lower level DUC2Antenna
subsystem.
Note: This design example uses the Simulink Signal Processing Blockset.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
138
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Interpolating CIC and FIR filters up convert a single complex channel (2 real
channels). A NCO and Mixer subsystem combine the complex input channels into a
single output channel.
This design example shows how quick and easy it is to emulate the contents of an
existing datapath. A Mixer block implements the mixer in this design example as the
data rate is low enough to save resource using a time-shared hardware technique.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus a ChanView block that deserializes the output bus. An
Edit Params block allows easy access to the setup variables in the
setup_demo_AD9856.m script.
The AD9856 subsystem includes a Device block and a lower level DUCIQ
subsystem.
Note: This design example uses the Simulink Signal Processing Blockset.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
139
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The CornerTurn turn block makes extensive use of Simulink Goto/From blocks to
reduce the wiring complexity. The top-level testbench includes Control and Signals
blocks. The IDCTChip subsystem includes the Device block and a lower level IDCT
subsystem. The IDCT subsystem includes lower level subsystems that it describes
with the ChannelIn, ChannelOut, Const, BitCombine, Shift, Mult, Add, Sub,
BitExtract, SampleDelay, OR Gate, Not, Sequence, and SynthesisInfo blocks.
This design example shows a complex loop with several subloops that it schedules and
pipelines without inserting registers. The design example spreads a lumped delay
around the circuit to satisfy timing while maintaining correctness. Processor visible
registers control the thresholds and gains.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
140
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
In complex algorithmic circuits, the zero-latency blocks make it easy to follow a data
value through the circuit and investigate the algorithm without offsetting all the
results by the pipelining delays.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks.
The AGC_Chip subsystem includes the Device block, a RegField block and a lower
level AGC subsystem.
The one input BitCombine block is a special case that concatenates all the
components of the input vector and produces one wide scalar output signal. You can
apply 1-bit reducing operators to vectors of Boolean signals. The BitCombine block
supports multiple input concatenation. When vectors of Boolean signals are input on
multiple ports, corresponding components from each vector are combined so that the
output is a vector of signals.
This block converts a scalar signal into a vector of Boolean signals. You use the
initialization parameter to arbitrarily order the components of the vector output by the
BitExtract block. If the input to a BitExtract block is a vector, different bits can be
extracted from each of the components. The output does not always have to be a
vector of Boolean signals. You may split a 16-bit wide signal into four components
each 4-bits wide.
The RGB data arrives as three parallel signals each clock cycle. The model file is
demo_csc.mdl.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
141
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Forward paths compensate for nonlinear power amplifiers by applying the inverse of
the distortion that the power amplifier generates, such that the pre-distortion and the
distortion of the power amplifier cancel each other out. The power amplifier's non-
linearity may change over time, therefore such systems are typically adaptive.
This design example is based on "A robust digital baseband pre-distorter constructed
using memory polynomials," L. Ding, G. T. Zhou, D. R. Morgan, et al., IEEE
Transactions on Communications, vol. 52, no. 1, pp. 159-165, 2004.
This design example only implements the forward path, which is representative of
many systems where you implement the forward path in FPGAs, and the feedback
path on external processors. The design example sets the predistortion memory, Q, to
8; the highest nonlinearity order K is 5 in this design example. The file
setup_demo_dpd_fwdpath initializes the complex valued coefficients, which are
stored in registers. During operation, the external processor continuously improves
and adapts these coefficients with a microcontroller interface.
This design example shows that even for circuitry with tight feedback loops and 120-
bit adders, designs can achieve high data rates by the pipelining algorithms. The top-
level testbench includes Control, Signals, Run ModelSim, and Run Quartus Prime
blocks. The Chip subsystem includes the Device block and a lower level FibSystem
subsystem. The FibSystem subsystem includes ChannelIn, ChannelOut,
SampleDelay, Add, Mux, and SynthesisInfo blocks.
Note: In this design example, the top-level of the FPGA device (marked by the Device
block) and the synthesizable Primitive subsystem (marked by the SynthesisInfo
block) are at different hierarchy levels.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
142
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Folded designs repeatedly use a single dual sort stage. The throughput of the design is
limited in the number of channels, vector width, and data rate. The data passes
through the dual sort stage (vector width)/2 times. The vector sort design example
uses full throughput with (vector width)/2 dual sort stages in sequence.
The design example allows you to generate a valid signal. The design example only
generates output and can only accept input every N cycles, where N depends on the
number of stages, the data output format, and the target fMAX. The valid signal goes
high when the output is ready. You can use this output signal to trigger the next input,
for example, a FIFO buffer read for bursty data.
DSP Builder generates results using the same techniques as in the floating point
functions but at generally reduced resource usage, depending on data bit width.
Outputs are faithfully rounded. If the exact result is between two representable
numbers within the data format, DSP Builder uses either of them. In some instances
you see a difference in output result between simulation and hardware by one LSB. To
get bit-accurate results at the subsystem level, this example uses the Bit Exact
option on the SynthesisInfo block.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
143
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
You can also specify the seed value for the random sequence using the seed_value
input. The reset input resets the sequence to the initial state defined by the
seed_value. The output is a 32-bit single-precision floating-point number.
An external input enables a counter that addresses a lookup-table (LUT) that contains
some text. The design example writes the result to a MATLAB array. You can examine
the contents with a char(message) command in the MATLAB command window.
This design example does not use any ChannelIn, ChannelOut, GPIn, or GPOut
blocks. The design example uses Simulink ports for simplicity although they prevent
the automatic testbench flow from working.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks.
The Chip subsystem includes Device, Counter, Lut, and SynthesisInfo blocks.
Note: In this design example, the top-level of the FPGA device (marked by the Device
block) and the synthesizable Primitive subsystem (marked by the SynthesisInfo
block) are at the same level.
The testbench reloads the counter with new parameters every 64 cycles. A manual
switch allows you to control whether the counter is permanently enabled, or only
enabled on alternate cycles. You can view the signals input and output from the
counter with the provided scope.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
144
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
You can use one of the following ways to specify the contents of the Lut block:
• Specify table contents as single row or column vector. The length of the 1D row or
column vector determines the number of addressable entries in the table. If DSP
Builder reads vector data from the table, all components of a given vector share
the same value.
• When a look-up table contains vector data, you can provide a matrix to specify the
table contents. The number of rows in the matrix determines the number of
addressable entries in the table. Each row specifies the vector contents of the
corresponding table entry. The number of columns must match the vector length,
otherwise DSP Builder issues an error.
Note: The default initialization of the LUT is a row vector round([0:255]/17). This vector
is inconsistent with the default for the DualMem block, which is a column vector
[zeros(16, 1)]. The latter form is consistent with the new matrix initialization form in
which the number of rows determines the addressable size.
You can initialize both the dual memory and LUT Primitive library blocks with matrix
data.
The number of rows in the 2D matrix that you provide for initialization determines the
addressable size of the dual memory. The number of columns must match the width of
the vector data. So the nth column specifies the contents of the nth dual memory.
Within each of these columns the ith row specifies the contents at the (i –- 1)th
address (the first row is address zero, second row address 1, and so on).
The exception for this row and column interpretation of the initialization matrix is for
1D data, where the initialization matrix consists of either a single column or single
row. In this case, the interpretation is flexible and maps the vector (row or column)
into the contents of each dual memory. In the previous behavior all dual memories
have identical initial contents.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
145
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
This design example has many feedback loops. The design example implements all the
pipelined delays in the circuit automatically. The multiple channels provide more
latency around the circuit to ensure a high clock frequency result. Lumped delays
allow you to easily parameterize the design example when changing the channel
counts. For example, masking the subsystem provides the benefits of a black-box IP
block but with visibility.
The top-level testbench includes Control and Signals blocks, plus ChanView block
that deserialize the output buses.
The IIRChip subsystem includes the Device block and a masked IIRSubsystem
subsystem. The coefficients for the filter are set from [b, a] = ellip(2, 1, 10, 0.3); in
the callbacks for the masked subsystem. You can look under the mask to see the
implementation details of the IIRSubsystem subsystem which includes ChannelIn,
ChannelOut, SampleDelay, Const, Mult, Add, Sub, Convert, and SynthesisInfo
blocks.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks.
The first datapath reinterprets a single precision complex signal into raw 32-bit
components that separate into real and imaginary parts. A BitCombine block then
merges it into a 64-bit signal. The second datapath uses the BitExtract block to split
a 64-bit wide signal into a two component vectors of 32-bit signals. The
ReinterpretCast block then converts the raw bit pattern into single-precision IEEE
format. The HDL that the design synthesizes is simple wire connections, which
performs no computation.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
146
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
In decimation mode, the design example accepts a new sample every clock cycle, and
produces a new result every two clock cycles. When interpolating, the design example
accepts a new input every other clock cycle, and produces a new result every clock
cycle. In both cases, the design example fully uses multipliers, making this structure
very efficient compared to parallel instantiations of interpolate and decimate filters, or
compared to a single rate filter with external interpolate and decimate stages.
The design example allows you to generate a valid signal. The design example only
generates output and can only accept input every N cycles, where N depends on the
number of stages, the data output format, and the target fMAX. The valid signal goes
high when the output is ready. You can use this output signal to trigger the next input,
for example, a FIFO buffer read for bursty data.
The Mode input can either rotate the input vector by a specified angle, or rotate the
input vector to the x-axis while recording the angle required to make that rotation.
You can experiment with different size of inputs to control the precision of the CORDIC
output.
The SinCos and AGC subsystem includes ChannelIn, ChannelOut, CORDIC, and
SynthesisInfo blocks.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
147
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
You can specify the seed value for the random sequence using the seed_value input.
The reset input resets the sequence to the initial state defined by the seed_value.
The output is a 32-bit random number, which can be interpreted as a random integer
sampled from the uniform distribution.
For sorting, the sortstages subsystem allows either a comparator and mux based
block, or one based on a minimum and a maximum block. The first is more efficient.
Both use the reconfigurable subsystem to choose between implementations using the
BlockChoice parameter.
The design repeatedly uses a dual sort stage in series. The data passes through the
dual sort stage (vector width)/2 times.
Folded designs repeatedly use a single dual sort stage. The throughput of the design is
limited in the number of channels, vector width, and data rate. The data passes
through the dual sort stage (vector width)/2 times. The vector sort design example
uses full throughput with (vector width)/2 dual sort stages in sequence.
When the SampleDelay Primitive library block receives vector input, you can
independently specify a different delay for each of the components of the vector.
You may give individual components zero delay resulting in a direct feed through of
only that component. Avoid algebraic loops if you select some components to be zero
delays.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
148
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
This rule only applies when DSP Builder is reading and outputting vector data. A scalar
specification of delay length still sets all the delays on each vector component to the
same value. You must not specify a vector that is not the same length as the vector on
the input port. A negative delay on any one component is also an error. However, as in
the scalar case, you can specify a zero length delay for one or more of the
components.
The output type of the adder is propagated from one of the inputs. You must select
the correct input, otherwise the accumulator fails to schedule. You may add a Convert
block to ensure the accumulator also maintains sufficient precision.
The optional use of a two-to-one multiplexer allows the accumulator to load values
according to a Boolean control signal. The inputs differ in precision, so the type with
wider fractional part must be propagated to the output type of the adder, otherwise
the accumulator fails to schedule. Converting both inputs to the same precision
ensures that the single-channel accumulator can always be scheduled even at high
fMAX targets.
If neither input has a fixed-point type that is suitable for the adder to output, use a
Convert block to ensure that the precision of both inputs to the Add block are the
same. Scheduling of this accumulator at high fMAX fails.
This folder accesses groups of reference designs that illustrate the design of DDC and
DUC systems for digital intermediate frequency (IF) processing.
The first group implements IF modem designs compatible with the Worldwide
Interoperability for Microwave Access (WiMAX) standard. Intel provides separate
models for one and two antenna receivers and transmitters.
The second group implement IF modem designs compatible with the wideband Code
Division Multiple Access (W-CDMA) standard.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
149
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
STAP for radar systems applies temporal and spatial filtering to separate slow moving
targets from clutter and null jammers. Applications demand highprocessing
requirements and low latency for rapid adaptation. High-dynamic ranges demand
floating-point datapaths.
Related Information
AN 544: Digital IF Modem Design with the DSP Builder Advanced Blockset
For more information about these designs
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
150
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
the design includes an Edit Params block to allow easy access to the setup variables
in the setup_wimax_ddc_1rx.m script.
The FIR filters implement a decimating filter chain that down convert the two channels
from a frequency of 89.6 MSPS to a frequency of 11.2 MSPS (a total decimation rate
of eight). The real mixer, NCO, and Interleaver subsystem isolate the two channels.
The design configures the NCO with a single-channel to provide one sine and one
cosine wave at a frequency of 22.4 MHz. The NCO has the same sample rate (89.6
MSPS) as the input data sample rate.
A system clock rate of 179.2 MHz drives the design on the FPGA that the Device block
defines inside the DDCChip subsystem.
Note: This reference design uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
the design includes an Edit Params block to allow easy access to the setup variables
in the setup_wimax_ddc_2rx_iiqq.m script.
The FIR filters implement a decimating filter chain that down convert the two channels
from a frequency of 89.6 MSPS to a frequency of 11.2 MSPS (a total decimation rate
of 8). The real mixer and NCO isolate the two channels. The design configures the
NCO with two channels to provide two sets of sine and cosine waves at the same
frequency of 22.4 MHz. The NCO has the same sample rate of (89.6 MSPS) as the
input data sample rate.
A system clock rate of 179.2 MHz drives the design on the FPGA, which the Device
block defines inside the DDCChip subsystem.
Note: This reference design uses the Simulink Signal Processing Blockset.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
151
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
The design includes an Edit Params block to allow easy access to the setup variables
in the setup_wimax_duc_1tx.m script.
The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC2Channel subsystem which contains SingleRateFIR, Scale,
InterpolatingFIR, NCO, and ComplexMixer blocks. The deinterleaver subsystem
contains a series of Primitive blocks including delays and multiplexers that
deinterleave the two I and Q channels.
The FIR filters implement an interpolating filter chain that up converts the two
channels from a frequency of 11.2 MSPS to a frequency of 89.6 MSPS (a total
interpolating rate of 8). The complex mixer and NCO modulate the two input channel
baseband signals to the IF domain. The design configures the NCO with a single
channel to provide one sine and one cosine wave at a frequency of 22.4 MHz. The
NCO has the same sample rate (89.6 MSPS) as the input data sample rate.
A system clock rate of 179.2 MHz drives the design on the FPGA, which the Device
block defines inside the DUCChip subsystem.
Note: This reference design uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
The design includes an Edit Params block to allow easy access to the setup variables
in the setup_wimax_duc_2tx_iiqq.m script.
The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC2Channel subsystem which contains SingleRateFIR, Scale,
InterpolatingFIR, NCO, ComplexMixer, and Const blocks. It also contains a Sync
subsystem, which shows how to manage two data streams coming together and
synchronizing. The design writes the data from the NCOs to a memory with the
channel index as an address. The data stream uses its channel signals to read out the
NCO signals, which resynchronizes the data correctly. (Alternatively, you can simply
delay the NCO value by the correct number of cycles to ensure that the NCO and
channel data arrive at the Mixer on the same cycle). The deinterleaver subsystem
contains a series of Primitive blocks including delays and multiplexers that de-
interleave the four I and Q channels.
The FIR filters implement an interpolating filter chain that up converts the two
channels from a frequency of 11.2 MSPS to a frequency of 89.6 MSPS (a total
interpolating rate of 8).
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
152
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
A complex mixer and NCO modulate the two input channel baseband signals to the IF
domain. The design configures the NCO to provide two sets of sine and cosine waves
at a frequency of 22.4 MHz. The NCO has the same sample rate (89.6 MSPS) as the
input data sample rate.
The Sync subsystem shows how to manage two data streams coming together and
synchronizing. The design writes the data from the NCOs to a memory with the
channel as an address. The data stream uses its channel signals to read out the NCO
signals, which resynchronizes the data correctly.
A system clock rate of 179.2 MHz drives the design on the FPGA, which the Device
block defines inside the DUCChip subsystem.
Note: This reference design uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks,
plus a ChanView block that isolates two channels of data from the TDM signals.
The CIC and FIR filters implement a decimating filter chain that down converts the
eight complex carriers (16 real channels from two antennas with four pairs of I and Q
inputs from each antenna) from a frequency of 122.88 MSPS to a frequency of 7.68
MSPS (a total decimation rate of 16). The real mixer and NCO isolate the four
channels. The design configures the NCO with four channels to provide four pairs of
sine and cosine waves at frequencies of 12.5 MHz, 17.5 MHz, 22.5 MHz, and 27.5
MHz, respectively. The NCO has the same sample rate (122.88 MSPS) as the input
data sample rate.
The Sync subsystem shows how to manage two data streams that come together and
synchronize. The data from the NCOs writes to a memory with the channel as an
address. The data stream uses its channel signals to read out the NCO signals, which
resynchronizes the data correctly.
A system clock rate of 245.76 MHz drives the design on the FPGA, which the Device
block defines inside the DDCChip subsystem.
Note: This reference design uses the Simulink Signal Processing Blockset.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
153
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks,
plus a ChanView block that isolates two channels of data from the TDM signals.
The CIC and FIR filters implement a decimating filter chain that down converts the two
complex carriers (4 real channels from two antennas with one pair of I and Q inputs
from each antenna) from a frequency of 122.88 MSPS to a frequency of 7.68 MSPS (a
total decimation rate of 16). The real mixer and NCO isolate the four channels. The
design configures the NCO with a single channel to provide one sine and one cosine
wave at a frequency of 17.5 MHz. The NCO has the same sample rate (122.88 MSPS)
as the input data sample rate.
A system clock rate of 122.88 MHz drives the design on the FPGA, which the Device
block defines inside the DDCChip subsystem.
Note: This reference design uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
A Spectrum Scope block computes and displays the periodogram of the outputs from
the two antennas.
The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC subsystem that contains InterpolatingFIR, InterpolatingCIC, NCO,
ComplexMixer, and Scale blocks.
The FIR and CIC filters implement an interpolating filter chain that up converts the 16-
channel input data from a frequency of 3.84 MSPS to a frequency of 122.88 MSPS (a
total interpolation factor of 32). The complex mixer and NCO modulate the four
channel baseband input signal onto the IF region. The design configures the NCO with
four channels to provide four pairs of sine and cosine waves at frequencies of 12.5
MHz, 17.5 MHz, 22.5 MHz, and 27.5 MHz, respectively. The NCO has the same sample
rate (122.88 MSPS) as the final interpolated output sample rate from the last CIC filter
in the interpolating filter chain.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
154
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The Sum and SampSelectr subsystems sum up the correct modulated signals to the
designated antenna.
A system clock rate of 245.76 MHz drives the design on the FPGA, which the Device
block defines inside the DUC subsystem.
Note: This reference design uses the Simulink Signal Processing Blockset.
These DUC and matching DDC designs connect to 4 antennas and can process 4
channels per antenna. With a sample rate of 61.44 MHz and a clock rate of 491.52
MHz, these designs represent up- and downconverters used in LTE.
DUC
The top-level design of the upconverter contains a TEST_BENCH block with signal
sources, the upconverter, and a SINKS block that stores the datastreams coming out
of the upconverter in MATLAB variables. Depending on which simulation you run, the
TEST_BENCH block uses either real LTE sample streams or specialized debugging
patterns. The upconverter consists of the LDUC module, the lower DUC, which
contains a channel filter and two interpolating filters, each interpolating by a factor of
2. The filtered sample stream feeds into the COMPLEX MIXER block, where a NCO
generates separate frequencies for each of the four channels, and multiplies the
generated sinewaves with the filtered sample stream. A delay match block ensures
that the sample stream and the generated frequencies align correctly. After the
COMPLEX MIXER block is an antenna summer block, which adds up the different
channels for each antenna, multiplies each with a different frequency, and outputs
them to the four separate antennas.
DDC
The top-level design of the DDC also contains a TESTBENCH block, which contains
source blocks that read from workspace. It uses the data that DSP Builder generates
during the simulation of the DUC. The SINKS block again traces the outputs of the
design in MATLAB variables, which you can analyze and manipulate in MATLAB. The
DDC consists of a complex mixer that matches the complex mixer of the DUC, and the
LDDC (Lower DownConverter), which contains two decimate-by-2 filters and a channel
filter.
Simulation Scripts
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
155
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
generates the input vectors for the downconverter, - then run the downconverter and
analyze the outputs. The designs contains no channel model, but you can add your
own channel model and apply it to the output data of the DUC before running the DDC
to simulate more realistic operating conditions. Run_DUC_DDC_demo.m uses typical
LTE waveforms; Test_DUC_DDC_demo.m works with ramps that help visualizing
which data goes into which channel and which antenna it transmits on. In the test
pattern, an impulse is set first, followed by a ramp on channel 1 on antenna 1. All
other channels and antenna are 0. The next section transmits channel 1 on antenna 1,
channel 2 on antenna 2 … channel 4 on antenna 4. The last section transmits all 4
channels on all 4 antennas, using the full capacity of the system. Use this debug
pattern, if you want to modify or extend the design. Run the scripts using the
echodemo command, to step through the script section by section, by typing
echodemo Run_DUC_DDC_demo.m at the MATLAB command prompt, and then
clicking Next several times to step through the simulation script. Alternatively, you
can run the entire script by typing Run_DUC_DDC_demo.m at the MATLAB command
prompt. The last step of the script calls up a plot function that generates input vs
output plots for each channel, with overlaid input and output plots. These plots should
match closely, displaying only a small quantization error. The script also produces
channel scopes, which show each channel’s data in time and frequency domains.
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
A Spectrum Scope block computes and displays the periodogram of the outputs from
the two antennas.
The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC subsystem that contains InterpolatingFIR, InterpolatingCIC, NCO,
ComplexMixer, and Scale blocks.
The FIR and CIC filters implement an interpolating filter chain that up convert the four
channel input data from a frequency of 3.84 MSPS to a frequency of 122.88 MSPS (a
total interpolation factor of 32). The complex mixer and NCO modulate the four
channel baseband input signal onto the IF region.
The design example configures the NCO with a single channel to provide one sine and
one cosine wave at a frequency of 17.5 MHz. The NCO has the same sample rate
(122.88 MSPS) as the final interpolated output sample rate from the last CIC filter in
the interpolating filter chain.
A system clock rate of 122.88 MHz drives the design on the FPGA, which the Device
block defines inside the DDC subsystem.
Note: This reference design uses the Simulink Signal Processing Blockset.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
156
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
A Spectrum Scope block computes and displays the periodogram of the outputs from
the two antennas.
The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC subsystem that contains InterpolatingFIR, InterpolatingCIC, NCO,
ComplexMixer, and Scale blocks.
The FIR and CIC filters implement an interpolating filter chain that up converts the 16-
channel input data from a frequency of 3.84 MSPS to a frequency of 122.88 MSPS (a
total interpolation factor of 32). This design example uses dummy signals and carriers
to achieve the desired rate up conversion, because of the unusual FPGA clock
frequency and total rate change combination. The complex mixer and NCO modulate
the four channel baseband input signal onto the IF region. The design example
configures the NCO with four channels to provide four pairs of sine and cosine waves
at frequencies of 12.5 MHz, 17.5 MHz, 22.5 MHz and 27.5 MHz, respectively. The NCO
has the same sample rate (122.88 MSPS) as the final interpolated output sample rate
from the last CIC filter in the interpolating filter chain.
The Sync subsystem shows how to manage two data streams that come together and
synchronize. The data from the NCOs writes to a memory with the channel as an
address. The data stream uses its channel signals to read out the NCO signals, which
resynchronizes the data correctly.
The GenCarrier subsystem manipulates the NCO outputs to generate carrier signals
that can align with the datapath signals.
The CarrierSum and SignalSelector subsystems sum up the right modulated signals
to the designated antenna.
A system clock rate of 368.64 MHz, which is 96 times the input sample rate, drives the
design on the FPGA, which the Device block defines inside the DUC subsystem. The
higher clock rate can potentially allow resource re-use in other modules of a digital
system implemented on an FPGA.
Note: This reference design uses the Simulink Signal Processing Blockset.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
157
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
A Spectrum Scope block computes and displays the periodogram of the outputs from
the two antennas.
The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC subsystem that contains InterpolatingFIR, InterpolatingCIC, NCO,
ComplexMixer, and Scale blocks.
The FIR and CIC filters implement an interpolating filter chain that up converts the 16-
channel input data from a frequency of 3.84 MSPS to a frequency of 184.32 MSPS (a
total interpolation factor of 48).
The complex mixer and NCO modulate the four channel baseband input signal onto
the IF region. The design configures the NCO with four channels to provide four pairs
of sine and cosine waves at frequencies of 12.5 MHz, 17.5 MHz, 22.5 MHz, and 27.5
MHz, respectively. The NCO has the same sample rate (184.32 MSPS) as the final
interpolated output sample rate from the last CIC filter in the interpolating filter chain.
The Sync subsystem shows how to manage two data streams that come together and
synchronize. The data from the NCOs writes to a memory with the channel as an
address. The data stream uses its channel signals to read out the NCO signals, which
resynchronizes the data correctly.
The CarrierSum and SignalSelector subsystems sum up the right modulated signals
to the designated antenna.
A system clock rate of 368.64 MHz, which is 96 times the input sample rate, drives the
design on the FPGA, which the Device block defines inside the DUC subsystem. The
higher clock rate can potentially allow resource re-use in other modules of a digital
system implemented on an FPGA.
Note: This reference design uses the Simulink Signal Processing Blockset.
The top-level testbench includes Control, Signals, and Run Quartus Prime blocks.
A Spectrum Scope block computes and displays the periodogram of the outputs from
the two antennas.
The DUCChip subsystem includes a Device block to specify the target FPGA device,
and a DUC subsystem that contains InterpolatingFIR, InterpolatingCIC, NCO,
ComplexMixer, and Scale blocks.
The FIR and CIC filters implement an interpolating filter chain that up converts the 16-
channel input data from a frequency of 3.84 MSPS to a frequency of 153.6 MSPS (a
total interpolation factor of 40).
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
158
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The complex mixer and NCO modulate the four channel baseband input signal onto
the IF region. The design configures the NCO with four channels to provide four pairs
of sine and cosine waves at frequencies of 12.5 MHz, 17.5 MHz, 22.5 MHz, and 27.5
MHz, respectively. The NCO has the same sample rate (153.6 MSPS) as the final
interpolated output sample rate from the last CIC filter in the interpolating filter chain.
The Sync subsystem shows how to manage two data streams that come together and
Synchronize. The design writes data from the NCOs to a memory with the channel as
an address. The data stream uses its channel signals to read out the NCO signals,
which resynchronizes the data correctly.
The CarrierSum and SignalSelector subsystems sum up the right modulated signals
to the designated antenna.
A system clock rate of 307.2 MHz, which is 80 times the input sample rate, drives the
design on the FPGA, which the Device block defines inside the DUC subsystem. The
higher clock rate can potentially allow resource re-use in other modules of a digital
system implemented on an FPGA.
Note: This reference design uses the Simulink Signal Processing Blockset.
Lower
Triangle
Input Cholesky Triangular Matrix J Triangular
Matrix (A) Decomposition Matrix Inversion Matrix Mult A_inverse
Diagonal
Reciprocal
Values 1/Lkk
The Cholesky decomposition calculates the reciprocal values of the diagonal elements
of L, L1 which the triangular matrix inversion requires. The design propagates those
kk
values to the output interface of the Cholesky decomposition reducing resource usage
and latency.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
159
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
A = LH
The design performs Cholesky decomposition and calculates the inverse of L, J = L−1,
through forward substitution. J is a lower triangle matrix. The inverse of the input
matrix requires a triangular matrix multiplication, followed by a Hermitian matrix
multiplication:
A−1 = JH ∙ J
Cholesky Decomposition
Top Datapath Bottom Datapath
Circular
1/√ 18s17 (c)
Memory Li, j
Data Scalar Product and
Input and FIFO
Mux and Subtract Multiplier
Memory Operators
Vectorization
16s15(c) 18s17
1/Li, j
Control Logic
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
160
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
s
Li,j Write Controller
mn
ws
lu
Ro
Co
18s17(c) 1 Cycle
Channels
Circular
Rows* Channels
L Input Memory
Li,j
s-1
J Output
mn
Write
Σ X
lu
X Scale
Mux
Co
Controller Scale Negate FIFO
Rows* Channels
1/Li,j Inv Ljj
Input 4 Cycles 7 Cycles 5 Cycles 0 Cycles
18s12
Memory Diagonal
Control Logic
Matrix inversion takes multiple matrices and interleaves the inverse computations for
all matrices. This method hides the latency in computing each element by pipelining
inversion of a completely different channel. Multichannel designs use the idle cycles in
the computation chain to process the next channel. Two buffers at the input and
output of the design create channels for streaming matrices into multichannel
interfaces.
Sink_Valid Input Boolean 1 Avalon streaming sink valid signal for the input matrix
interface. Number of valid input = (matrix size*(matrix size
+ 1))/2
Sink_Channel Input unsigned integer 8 Avalon streaming sink channel bus for the input matrix
interface.
Sink_Data Input Single floating- 64 bit I/Q Avalon streaming sink data bus for the input matrix
point complex interface. Lower matrix elements are streamed in column
major order.
Source_Valid Output Boolean 1 Avalon streaming source valid signal for output interface.
This signal is asserted for (size*(size+1))/2 clocks
Source_Channel Output unsigned integer 8 Avalon streaming source channel bus for output interface.
Source_Data Output Single floating- 64 bit I/Q Avalon streaming source data bus for output interface.
point complex Lower matrix elements are streamed in column major
order.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
161
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Parameters
Parameter Description
Latency The period in cycles the module waits before receiving the next set of matrices.
DSP Builder calculates the throughput of the design by setting the latency value and
the system clock:
Figure 71. Input streaming interface for 8x8 Hermitian input matrix
The figure shows the latency configuration parameter in the input interface including data, valid, and channel
signals. In this example of 8x8 matrix inversion, the valid signal remains high for 36 clock cycles (total number
of lower triangle elements of the Hermitian matrix of 8x8) and remains low for (latency – 36) cycles before
inserting the next matrix elements. The minimum duration to remain low and hence the minimum latency
period may vary depending on the matrix size and the pipelining required to meet timing constraints.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
162
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Table 21. Recommended Values for the Minimum Latency (maximum throughput)
In Intel Stratix 10 and Intel Arria 10 devices, speed grade –1 and –2, for three different matrix sizes.
4x4 ≥ 30 ≥ 30
8x8 ≥ 75 ≥ 74
Matrix Dimension Number of channels Logic Elements (ALMs) DSP Blocks Memory bits RAM blocks Registers
Table 23. Performance of the floating-point matrix inversion module for different
matrix dimensions
This table shows the fMAX performance of the floating-point design for different matrix sizes with a system clock
of 368.64 MHz and targeting a FPGA device. The maximum throughput is in millions of matrix inversions per
second.
Matrix Dimension Number of channels Target System clock (MHz) fMAX (MHz) ThroughputMAX
The design decomposes A into L*L', therefore L*L'*x = b, or L*y = b, where y = L'*x.
The design solves y with forward substitution and x with backward substitution.
This design uses cycle stealing and command FIFO techniques to enhance
performance. Although it targets multiple channels, it also works well with single
channels.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
163
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
To input the lower triangular elements of matrix A and b with the input bus, specify
the column, row, and channel index of each element. The design transposes and
appends the column vector b to the bottom of A and treats it as an extension of A in
terms of column and row addressing.
The output is column vector x with the bottom element output first.
You can change the simulation length by clicking on the Simulink Length block.
Related Information
Crest factor reduction for wireless systems
The FIR filter length is 2 x (Dmax / Dmin) x N + 1 where Dmax and Dmin are the
maximum and minimum decimation ratios and N is the number of (1 sided) symmetric
coefficients at Dmin.
All channels must have the same decimation ratio. The product of the number of
channels and the minimum decimation ratio must be 4 or more. The design limits the
wire count to 1 and:
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
164
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
To optimize the overall throughput the solver can interleave multiple data instances at
the same time. The inputs of the design are system matrices A [n × m] and input
vectors.
The reference design uses the Gram-Schmidt method to decompose system matrix A
to Q and R matrices. It calculates the solution of the system by completing backward
substitution.
6.12.19. QR Decompostion
This reference design is a complete linear equations system solution that uses QR
decomposition.
The reference design uses the Gram-Schmidt method to decompose system matrix A
to Q and R matrices, and calculates the solution of the system by completing
backward substitution.
This design uses the Run All Testbenches block to access enhanced features of the
automatically-generated testbench. An application-specific m-function verifies the
simulation output, to correctly handle the complex results and the numerical
approximation because of the floating-point format.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
165
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
QR
[A] [R]
Decomposition
[Q] Backward
Substitution [x]
[b]
Qxb
The reference design is fully parameterizable over system dimensions n and m and the
processing vector size, which defines the parallelization ratio of the dot product
engine. This design implements parallel dot product engine using single-precision
Multiply and Add blocks that perform most of the floating-point calculations. The
design routes different phases of the calculation through these blocks with a
controlling processor that executes a fixed set of microinstructions and generates
operation indexes. The design implements the controlling processor using for-loop
macro blocks, which allow very efficient, flexible, and high-level implementation of
iterative operations.
This design uses the Run All Testbenches block to access enhanced features of the
automatically generated testbench. An application-specific m-function verifies the
simulation output, to correctly handle the complex results and the numerical
approximation because of the floating-point format. Intel optimized the design for
Intel Stratix 10 FPGAs. The design implements hardened floating-point operators in
the FPGA DSP blocks.
Intel tested the design with Intel Quartus Prime v18.1.1 build 259, targeting a 1SG280LN3F43E2VG device
512x256 512 320 461K 4,370 1,313 71,232 4,492 137,545 0.43
(49%) (76%) (11%)
64x64 64 418 60.5 (6%) 562 (10%) 160 (1%) 7,920 52,777 12,392 0.03
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
166
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
You can modify the parameters in the setup_vardownsampler.m file, which you
access from the Edit Params icon.
The top-level testbench includes blocks to access control and signals, and to run the
Quartus Prime software. It also includes an Edit Params block to allow easy access to
the configuration variables in the setup_sc_LTEtxr.m script. A discrete-time scatter
plot scope displays the constellation of the modulated signal in inphase versus
quadrature components.
The LTE_txr subsystem includes a Device block to specify the target FPGA device,
and 64QAM, 1K_IFFT, ScaleRnd, CP_bReverse, Chg_Data_Format, and DUC
blocks.
The 64QAM subsystem uses a lookup table to convert the source input data into 64
QAM symbol mapped data. The 1K_IFFT subsystem converts the frequency domain
quadrature amplitude modulation (QAM) modulated symbols to the time domain. The
ScaleRnd subsystem follows the conversion, which scales down the output signals
and converts them to the specified fixed-point type.
The bit CP_bReverse subsystem adds extended cycle prefix (CP) or guard interval
for each orthogonal frequency-domain multiplexing (OFDM) symbol to avoid
intersymbol interference (ISI) that causes multipaths. The CP_bReverse block
reorders the output bits of IFFT subsystems, which are in bit-reversed order, so that
they are in the correct order in the time domain. The design adds the cyclic prefix bit
by copying the last 25% of the data frame, then appends to the beginning of it.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
167
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
A system clock rate of 245.76 MHz drives the design on the FPGA. The Signals block
of the design defines this clock. The input random data for the 64QAM symbol
mapping subsystem has a data rate of 15.36 Msps.
The design applies this linear system of equations to the steering vector in the
following two steps:
• Forward substitution with the lower triangular matrix
• Backward substitution with the lower triangular matrix
This design uses advanced settings from the DSP Builder > Verify Design menu to
access enhanced features of the automatically generated testbench. An application
specific m-function verifies the simulation output, to correctly compare complex
results and properly handle floating-point errors that arise from the ill-conditioning of
the QRD output.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
168
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
This design uses the Run All Testbenches block to access enhanced features of the
automatically generated testbench. An application specific m-function verifies the
simulation output, to correctly handle the complex results and the numerical
approximation due to the floating-point format.
The design includes the following features so you can simulate and verify the transmit
and receive beamforming operations:
• Waveform (chirp) generation
• Target emulation
• Receiver noise emulation
• Aperture tapering
• Pulse compression
The transmitter can produce random data, which is useful for generating a hardware
demo, or you can feed it with data from the MATLAB environment. You can modulate
the data, where the modulation order can be QAM4 or QAM64. The design filters the
signal, and then feeds it into optional crest factor reduction (CFR) and digital
predistortion (DPD) blocks. Intel assumes you have a control processor that configures
modulation scheme and CFR and DPD parameters.
The channel model contains a random noise source, and a channel model, which you
can configure through the setup script. This channel model allows you to build a
hardware demonstrator on a standard FPGA development platform, without DA or AD
converters and analogue components. Following the channel model is the model of a
decimating ADC, which emulates the behavior of some existing ADC components that
provide this functionality.
The receiver contains an RRC filter, followed by an equalizer. Intel assumes that a
control processor calculates the equalizer coefficients. The equalizer feeds into an AGC
block, which feeds into a demapper. You can configure the demapper to different
modulation orders.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
169
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
You can modify the parameters in the setup_vardecimator_rt.m file, which you
access from the Edit Params icon.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_complex_mixer.m script.
Note: This design example uses the Simulink Signal Processing Blockset.
This design example demonstrates frequency-hopping with the NCO block to generate
four channels of sinusoidal waves that you can switch from one set (bank) of
frequencies to another.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
170
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
The phase increment values are set directly into the NCO Parameter dialog box as a
2 (rows) × 4 (columns) matrix. The input for the bank index is set up so that it
alternates between the two predefined banks with each one lasting 2000 steps.
A BusStimulus block sets up an Avalon-MM interface that writes into the phase
increment memory registers. It shows how you can use the Avalon-MM interface to
dynamically change the frequencies of the NCO-generated sinusoidal signals at run
time. This design example uses a 16-bit memory interface (as the Control block
specifies) and a 24-bit the accumulator in the NCO block. The design example
requires two registers for each phase increment value. With the base address of the
phase increment memory map set to 1000 in this design example, the addresses
[1000 1001 1002 1003 1012 1013 1014 1015] write to the phase increment memory
registers of channels 1 and 2 in bank 1, and to the registers of channels 3 and 4 in
bank 2. The write data is also made up of two parts with each part writing to one of
the registers feeding the selected phase increment accumulators.
This design example has two banks of frequencies with each bank processes 2,000
steps before switching to the other. You should write a new value into the phase
increment memory register for each bank to change the NCO output frequencies after
8,000 steps during simulation. To avoid writing new values to the active bank, the
design example configures the write enable signals in the following way:
This configuration ensures that a new phase increment value for bank 0 is written at
7000 steps when the NCO is processing bank 1; and a new phase increment value for
bank 1 is written at 9000 steps when the NCO is processing bank 0.
Four writes for each bank exist to write new values for channel 1 and 2 into bank 0,
and new values for channel 3 and 4 into bank 1. Each new phase value needs two
registers due to the size of the memory interface.
The Spectrum Scope block shows three peaks for a selected channel with the first
two peaks representing the two banks and the third peak showing the frequency that
you specify through the memory interface. The scope of the select channel shows the
sinusoidal waves of the channel you select. You can zoom in to see the smooth and
continuous sinusoidal signals at the switching point. You can also see the frequency
changes after 8000 steps where the phase increment value alters through the memory
interface.
Note: This design example uses the Simulink Signal Processing Blockset.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
171
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
This design example is similar to the Four Channel, Two Banks NCO design, but it has
four banks of frequencies defined for the phase increment values. Each spectrum plot
has five peaks: the fifth peak shows the changes the design example writes through
the memory interface.
The design example uses a 32-bit memory interface with a 24-bit accumulator. Hence,
the design example requires only one phase increment memory register for each
phase increment value—refer to the address and data setup on the BusStimulus
block inside this design example.
This design example has four banks of frequencies with each bank processed for 2,000
steps before switching to the other. You should write a new value into the phase
increment memory register for each bank to change the NCO output frequencies after
16,000 steps during simulation. To avoid writing new values to the active bank, the
design example configures the write enable signals in the following way:
This configuration ensures that a new phase increment value for bank 0 is written at
15000 steps when the NCO is processing bank 3; a new phase increment value for
bank 1 is written at 17000 steps when the NCO is processing bank 0; a new phase
increment value for bank 2 is written at 19000 steps when the NCO is processing
bank 1; and a new phase increment value for bank 3 is written at 21000 steps when
the NCO is processing bank 2.
There is one write for each bank to write a new value for channel 1 into bank 0; a new
value for channel 2 into bank 1; a new value for channel 3 into bank 2; and a new
value for channel 4 into bank 3. Each new phase value needs only one register due to
the size of the memory interface.
Note: This design example uses the Simulink Signal Processing Blockset.
This design example is similar to the Four Channel, 16 Banks NCO design, but has
only eight banks of phase increment values (specified in the setup script for the
workspace variable) feeding into the NCO. Furthermore, the sample time for the NCO
requires two wires to output the four channels of the sinusoidal signals. Two wires
exist for the NCO output, each wire only contains two channels. Hence, the channel
indicator is from 0 .. 3 to 0 .. 1.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
172
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
You can inspect the eight peaks on the spectrum graph for each channel and see the
smooth continuous sinusoidal waves on the scope display.
The design example outputs the data to the workspace and plots through with the
separate demo_mc_nco_extracted_waves.mdl, which demonstrates that the
output of the bank you select does represent a genuine sinusoidal wave. However,
from the scope display, you can see that the sinusoidal wave is no longer smooth at
the switching point, because the design example uses the different values of phase
increment values between the selected banks. You can only run the
demo_mc_nco_extracted_waves.mdl model after you run
demo_mc_nco_8banks_2wires.mdl.
Note: This design example uses the Simulink Signal Processing Blockset.
A workspace variable phaseIncr defines the 16 (rows) × 4 (columns) matrix for the
phase increment input with the phase increment values that the setup script
calculates.
The input for the bank index is set up so that it cycles from 0 to 15 with each bank
lasting 1200 steps.
The spectrum display shows clearly 16 peaks for the selected channel indicating that
the design example generates 16 different frequencies for that channel. The scope of
the selected channel shows the sinusoidal waves of the selected channel. You can
zoom in to see that the design example generates smooth and continuous sinusoidal
signals at the switching point.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView blocks that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_mc_nco_16banks.m script.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
173
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
Note: This design example uses the Simulink Signal Processing Blockset.
6.13.6. IP
The IP design example describes how you can build a NCO design with the NCO block
from the Waveform Synthesis library.
Note: This design example uses the Simulink Signal Processing Blockset.
6.13.7. NCO
This design example uses the NCO block from the Waveform Synthesis library to
implement an NCO. A Simulink double precision sine or cosine wave compares the
results.
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView blocks that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_nco.m script.
Note: This design example uses the Simulink Signal Processing Blockset.
Related Information
NCO on page 245
The top-level testbench includes Control, Signals, Run ModelSim, and Run
Quartus Prime blocks, plus ChanView block that deserialize the output buses. An
Edit Params block allows easy access to the setup variables in the
setup_demo_mix.m script.
Note: This design example uses the Simulink Signal Processing Blockset.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
174
6. DSP Builder for Intel FPGAs (Advanced Blockset) Design Examples and Reference Designs
HB_DSPB_ADV | 2020.10.05
A super-sample NCO uses multiple NCOs that each have an initial phase offset. When
you combine the parallel outputs into a serial stream, they can describe frequencies N
times the Nyquist frequency of a single NCO. Where N is the total number of NCOs
that the design uses.
The NCO block produces four outputs, which all have the same phase increment but
each have a different, evenly distributed initial phase offset. With the four parallel
outputs in series they describe frequencies up to four times higher than the Nyquist
frequency of an individual NCO.
To change the frequency of the super-sample NCO using the bus, write a new phase
increment and offset to each of the four constituent NCOs and then strobe the
synchronization register. The NCO block includes the phase increment register; a
separate primitive subsystem implements the phase offset and synchronization
registers.
DSP Builder writes the output of the super-sample NCO into a MATLAB workspace
variable and compares it with a MATLAB-generated waveform in the script
test_demo_nco_super_sample.
DSP Builder schedules the bus in HDL but not in Simulink, so bus writes occur at
different clock cycles. Therefore, the function verify_demo_nco_super_sample
function verifies the design, which checks that the Simulink and ModelSim frequency
distributions match within a tolerance.
The output of the Spectrum Analyser block show the simulation initializes to the last
frequency in dspb_super_nco.frequencies and then rotates through the list.
Note: This design example uses the Simulink Signal Processing Blockset.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
175
HB_DSPB_ADV | 2020.10.05
Send Feedback
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
177
7. DSP Builder Design Rules, Design Recommendations, and Troubleshooting
HB_DSPB_ADV | 2020.10.05
Related Information
• Control on page 221
• Avalon-MM Slave Settings (AvalonMMSlaveSettings) on page 219
• External Memory, Memory Read, Memory Write on page 264
• Channel In (ChannelIn) on page 344
• Channel Out (ChannelOut) on page 345
• Synthesis Information (SynthesisInfo) on page 347
• Setting DSP Builder Design Parameters with MATLAB Scripts on page 183
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
178
7. DSP Builder Design Rules, Design Recommendations, and Troubleshooting
HB_DSPB_ADV | 2020.10.05
Related Information
DSP Builder Design Rules and Recommendations on page 176
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
179
7. DSP Builder Design Rules, Design Recommendations, and Troubleshooting
HB_DSPB_ADV | 2020.10.05
If the pipelining requirements of the functional units around the loop are greater than
the delay specified by the SampleDelay blocks on the loop path, DSP Builder
generates an error message. The message states that distribution of memory failed as
there was insufficient delay to satisfy the fMAX requirement. DSP Builder cannot
simultaneously satisfy the pipelining to achieve the given fMAX and the loop criteria to
re-circulate the data in the number of clock cycles specified by the SampleDelay
blocks.
DSP Builder automatically adjusts the pipeline requirements of every Primitive block
according to these factors
• The type of block
• The target fMAX
• The device family and speedgrade
• The inputs of inputs
• The bit width in the data inputs
Note: Multipliers on Cyclone devices take two cycles at all clock rates. On Stratix V, Arria V,
and Cyclone V devices, fixed-point multipliers take two cycles at low clock rates, three
cycles at high clock rates. Very wide fixed-point multipliers incur higher latency when
DSP Builder splits them into smaller multipliers and adders. You cannot count the
multiplier and adder latencies separately because DSP Builder may combine them into
a single DSP block. The latency of some blocks depends on what pipelining you apply
to surrounding blocks. DSP Builder avoids pipelining every block but inserts pipeline
stages after every few blocks in a long sequence of logical components, if fMAX is
sufficiently low that timing closure is still achievable.
In the SynthesisInfo block, you can optionally specify a latency constraint limit that
can be a workspace variable or expression, but must evaluate to a positive integer.
However, only use this feature to add further latency. Never use the feature to reduce
latency to less than the latency required to pipeline the design to achieve the target
fMAX.
After you run a simulation in Simulink, the help page for the SynthesisInfo block
shows the latency, port interface, and estimated resource utilization for the current
Primitive subsystem.
When no loops exist, feed-forward datapaths are balanced to ensure that all the input
data reaches each functional unit in the same cycle. After analysis, DSP Builder inserts
delays on all the non-critical paths to balance out the delays on the critical path.
In designs with loops, DSP Builder advanced blockset must synthesize at least one
cycle of delay in every feedback loop to avoid combinational loops that Simulink
cannot simulate. Typically, one or more lumped delays exist. To preserve the delay
around the loop for correct operation, the functional units that need more pipelining
stages borrow from the lumped delay.
Designs that have a cycle containing two adders with only a single sample delay are
not sufficient. In automatically pipelining designs, DSP Builder creates a schedule of
signals through the design. From internal timing models, DSP Builder calculates how
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
180
7. DSP Builder Design Rules, Design Recommendations, and Troubleshooting
HB_DSPB_ADV | 2020.10.05
fast certain components, such as wide adders, can run and how many pipelining
stages they require to run at a specific clock frequency. DSP Builder must account for
the required pipelining while not changing the order of the schedule. The single
sample delay is not enough to pipeline the path through the two adders at the specific
clock frequency. DSP Builder is not free to insert more pipelining, as it changes the
algorithm, accumulating every n cycles, rather than every cycle. The scheduler detects
this change and gives an appropriate error indicating how much more latency the loop
requires for it to run at the specific clock rate. In multiple loops, this error may be hit
a few times in a row as DSP Builder balances and resolves each loop.
The folded IIR filter design example (demo_iir_fold2) demonstrates one channel, at
a low data rate. This design example implements a single-channel infinite impulse
response (IIR) filter with a subsystem built from Primitive blocks folded down to a
serial implementation.
The design of the IIR is the same as the IIR in the multichannel example, demo_iir.
As the channel count is one, the lumped delays in the feedback loops are all one. If
you run the design at full speed, there is a scheduling problem. With new data arriving
every clock cycle, the lumped delay of one cycle is not enough to allow for pipelining
around the loops. However, the data arrives at a much slower rate than the clock rate,
in this example 32 times slower (the clock rate in the design is 320 MHz, and the
sample rate is 10 MHz), which gives 32 clock cycles between each sample.
You can set the lumped delays to 32 cycles long—the gap between successive data
samples—which is inefficient both in terms of register use and in underused multipliers
and adders. Instead, use folding to schedule the data through a minimum set of fully
used hardware.
Set the SampleRate on both the ChannelIn and ChannelOut blocks to 10 MHz, to
inform the synthesis for the Primitive subsystem of the schedule of data through the
design. Even though the clock rate is 320 MHz, each data sample per channel is
arriving only at 10 MHz. The RTL is folded down—in multiplier use—at the expense of
extra logic for signal multiplexing and extra latency.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
181
HB_DSPB_ADV | 2020.10.05
Send Feedback
Intel Corporation. All rights reserved. Agilex, Altera, Arria, Cyclone, Enpirion, Intel, the Intel logo, MAX, Nios,
Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or
other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in ISO
accordance with Intel's standard warranty, but reserves the right to make changes to any products and services 9001:2015
at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any Registered
information, product, or service described herein except as expressly agreed to in writing by Intel. Intel
customers are advised to obtain the latest version of device specifications before relying on any published
information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
For example
C:\Altera\16.0\quartus\dspba\dsp_builder.bat -m "C:\tools\matlab
\R2013a\windows64\bin\matlab.exe"
You can copy the shortcut from the Start menu and paste it to your desktop to create
a desktop shortcut. You can edit the properties to use different installed DSP Builder
releases, different MATLAB releases, or different start directories.
Related Information
Starting DSP Builder in MATLAB
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
183
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
my_design_params.clockrate = 200;
my_design_params.samplerate = 50;
my_design_params.inputChannels = 4;
2. Clear the specific workspace variables you create with a clear-up script that run
when you close the model. Do not use clear all.
For example,. if you use the named structure my_design_params, run clear
my_design_params;. You may have other temporary workspace variables to
clear too.
For example, in a script that passes the design name (without .mdl extension) as
model you can use:
%% Load the model
load_system(model);
%% Get the Signals block
signals = find_system(model, 'type', 'block', 'MaskType', 'DSP Builder Advanced
Blockset Signals Block');
if (isempty(signals))
error('The design must contain a Signals Block. ');
end;
%% Get the Controls block
control = find_system(model, 'type', 'block', 'MaskType', 'DSP Builder Advanced
Blockset Control Block');
if (isempty(control))
error('The design must contain a Control Block. ');
end;%%
Example: set the RTL destination directory
dest_dir = ['../rtl' num2str(freq)];
dspba.SetRTLDestDir(model, rtlDir);
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
184
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
Similarly you can get and set other parameters. For example, on the Signals block
you can set the target clock frequency:
fmax_freq = 300.0;dspba.set_param(signals{1},'freq', fmax_freq);
You can also change the following threshold values that are parameters on the
Control block:
• distRamThresholdBits
• hardMultiplierThresholdLuts
• mlabThresholdBits
• ramThresholdBits
You can loop over changing these values, change the destination directory, run the
Quartus Prime software each time, and perform design space exploration. For
example:
%% Run a simulation; which also does the RTL generation.
t = sim(model);
%% Then run the Quartus Prime compilation flow.
[success, details] = run_hw_compilation(<model>, './')%%
where details is a struct containing resource and timing information
details.Logic,
details.Comb_Aluts,
details.Mem_Aluts,
details.Regs,
details.ALM,
details.DSP_18bit,
details.Mem_Bits,
details.M9K,
details.M144K,
details.IO,
details.FMax,
details.Slack,
details.Required,
details.FMax_unres,
details.timingpath,
details.dir,
details.command,
details.pwd
such that >> disp(details) gives output something like:
Logic: 4915
Comb_Aluts: 3213
Mem_Aluts: 377
Regs: 4725
ALM: 2952
DSP_18bit: 68
Mem_Bits: 719278
M9K: 97
M144K: 0 IO: 116
FMax: 220.1700
Slack: 0.4581
Required: 200
FMax_unres: 220.1700
timingpath: [1x4146 char]
dir: '../quartus_demo_ifft_4096_for_SPR_FFT_4K_n_2'
command: [1x266 char]
pwd: 'D:\test\script'
Note: The Timing Report is in the timingpath variable, which you can display by
disp(details.timingpath). Unused resources may appear as -1, rather than 0.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
185
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
A useful set of commands to generate RTL, compile in the Quartus Prime software and
return the details is:
load_system(<model>);
sim(<model>);
[success, details] = run_hw_compilation(<model>, './')
Based on the FPGA clock rate and data sample rates, you can derive how many clock
cycles are available to process unique data samples. This parameter is called Period in
many of the design examples. For example, for a period of three, a new sample for
the same channel appears every three clock cycles. For multiplication, you have three
clock cycles to compute one multiplication for this channel. In a design with multiple
channels, you can accommodate three different channels with just one multiplier. A
resource reuse potential exists when the period is greater than one.
1. Define the following parameters:
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
186
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
The Simulink Model Info block displays revision control information about a model as
an annotation block in the model's block diagram. It shows revision control
information embedded in the model and information maintained by an external
revision control or configuration management system.
You can customize some revision control tools to use the Simulink report generator
XML comparison, which allows you to compare two versions of the same file.
Note: You do not need to archive autogenerated files such as Quartus Prime project files or
synthesizable RTL files.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
187
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
the latency after you complete part of a DSP Builder design, for example on an IP
library block or for a Primitive subsystem. In other cases, you may want to limit the
latency in advance, which allows future changes to other subsystems without causing
undesirable effects upon the overall design.
To accommodate extra latency, insert registers. This feature applies only to Primitive
subsystems. To access, use the Synthesis Info block.
Latency is the number of delays in the valid signal across the subsystem. The DSP
Builder advanced blockset balances delays in the valid and channel path with
delays that DSP Builder inserts for autopipelining in the datapath.
Note: User-inserted sample delays in the datapath are part of the algorithm, rather than
pipelining, and are not balanced. However, any uniform delays that you insert across
the entire datapath optimize out. If you want to constrain the latency across the entire
datapath, you can specify this latency constraint in the SynthesisInfo block.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
188
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
If the valid input drives directly the valid output, the delay on the valid signal matches
the latency displayed on the ChannelOut block. It doesn't, if the valid output is
generated in any other way, for example by using a Sequence block.
For example, the 4K FFT design example uses a Sequence block to drive the valid
signal explicitly.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
189
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
The latency that the ChannelOut block reports is therefore not 4096 + the automatic
pipelining value, but just the pipelining value.
In this example, the Mult block has a direct feed-through simulation model, and the
following SampleDelay block has a delay of 10. The Mult block has zero delay in
simulation, followed by a delay of 10. In the generated hardware, DSP Builder
distributes part of this 10-stage pipelining throughout the multiplier optimally, such
that the Mult block has a delay (in this case, four pipelining stages) and the
SampleDelay block a delay (in this case, six pipelining stages). The overall result is
the same—10 pipelining stages, but if you try to match signals in the primitive
subsystem against hardware, you may find DSP Builder shifts them by several cycles.
Similarly, if you have insufficient user-inserted delay to meet the required fMAX, DSP
Builder automatically pipelines and balances the delays, and then corrects the cycle-
accuracy of the primitive subsystem as a whole, by delaying the output signals in
simulation by the appropriate number of cycles at the ChannelOut block.
If you specify no pipelining, the simulation design example for the multiplier is direct-
feed-through, and the result appears on the output immediately.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
190
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
To reach the desired fMAX, DSP Builder then inserts four pipelining stages in the
multiplier, and balances these with four registers on the channel and valid paths.
To correct the simulation design example to match hardware, the ChannelOut block
delays the outputs by four cycles in simulation and displays Lat: 4 on the block. Thus,
if you compare the output of the multiplier simulation with the hardware it is now four
cycles early in simulation; but if you compare the primitive subsystem outputs with
hardware they match, because the ChannelOut block provides the simulation
correction for the automatically inserted pipelining.
If you want a consistent 10 cycles of delay across the valid, channel and
datapath, you may need latency constraints.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
191
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
This example has a consistent line of SampleDelay blocks inserted across the design.
However, the algorithm does not use these delays. DSP Builder recognizes that
designs do not require them and optimizes them away, leaving only the delay that
designs require. In this case, each block requires a delay of four, to balance the four
delay stages to pipeline the multiplier sufficiently to reach the target fMAX. The delay of
10 in simulation remains from the non-direct-feed-through SampleDelay blocks. In
such cases, you receive the following warning on the MATLAB command line:
DSP Builder optimizes away some user inserted SampleDelays. The latency on the
valid path across primitive subsystem design name in hardware is 4, which may
differ from the simulation model. If you need to preserve extra SampleDelay
blocks in this case, use the Constraint Latency option on the SynthesisInfo
block.
Note: SampleDelay blocks reset to unknown values ('X'), not to zero. Designs that rely on
SampleDelays output of zero after reset may not behave correctly in hardware. Use
the valid signal to indicate valid data and its propagation through the design.
Generally, problems occur in feedback loops. You can solve these issues by lowering
the fMAX target, or by restructuring the feedback loop to reduce the combinatorial logic
or increasing the delay. You can redesign some control structures that have feedback
loops to make them completely feed forward.
You cannot set a latency constraint that conflicts with the constraint that the fMAX
target implies. For example, a latency constraint of < 2 may conflict with the fMAX
implied pipelining constraint. The multiplier may need four pipelining stages to reach
the target fMAX. The simulation fails and issues an error, highlighting the Primitive
subsystem.
DSP Builder gives this error because you must increase the constraint limit by at least
3 (that is, to < 5) to meet the target fMAX.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
192
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
DSP Builder relocates the sample delay, to save registers, to the Boolean signal that
drives the s-input of the 2-to-1 Mux block. You may see a mismatch in the first cycle
and beyond, depending on the contents of the LUT.
When you design a control unit as an FSM, the locations of SampleDelay blocks
specify where DSP Builder expects zero values during the first cycle. In Figure 77 on
page 193, DSP Builder expects the first sample that the a-input receives of the
CmpGE block to be zero. Therefore, the first output value of that compare block is
high. Delay redistribution changes this initialization. You cannot rely on the reset state
of that block, especially if you embed the Primitive subsystem within a larger design.
Other subsystems may drive the feedback loop whose pipeline depth adapts to fMAX.
The first valid sample may only enter this subsystem after some arbitrary number of
cycles that you cannot predetermine. To avoid this problem, always ensure you anchor
the SampleDelay blocks to the valid signal so that the control unit enters a well-
defined state when valid-in first goes high.
To make a control unit design resistant to automated delay redistribution and to solve
most hardware FSM designs that fail to match simulation, replace every SampleDelay
block with the Anchored Delay block from the Control folder in the Additional
libraries. When the valid-in first goes high, the Anchored Delay block outputs one (or
more) zeros, otherwise it behaves just like an ordinary SampleDelay block.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
193
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
Synthesizing the example design (fMAX = 250MHz) on Arria V (speedgrade 4), shows
that DSP Builder is still redistributing the delays contained inside of the Anchored
Delay block to minimize register utilization. DSP Builder still inserts a register
initialized to zero before the s-input of the 2-to-1 Mux block. However, the hardware
continues to match Simulink simulation because of the anchoring. If you place highly
pipelined subsystems upstream so that the control unit doesn't enter its first state
until several cycles after device initialization, the FSM still provides correct outputs.
Synchronization is maintained because DSP Builder inserts balancing delays on the
valid-in wire that drives the Anchored Delay and forces the control unit to enter its
initial state the correct number of cycles later.
Control units that use this design methodology are also robust to optimizations that
alter the latency of components. For example, when a LUT block grows sufficiently
large, DSP Builder synthesizes a DualMem block in its place that has a latency of at
least one cycle. Automated delay balancing inserts a sufficient number of one bit wide
delays on the valid signal control path inside every Anchored Delay. Hence, even if
the CmpGE block is registered, its reset state has no influence on the initial state of
the control unit when the valid-in first goes high.
Each Anchored Delay introduces a 2-to-1 Mux block in the control path. When
targeting a high fMAX (or slow device) tight feedback loops may fail to schedule or
meet timing. Using Anchored Delay blocks in place of SampleDelay blocks may also
use more registers and can also contribute to routing congestion.
This style uses FIFO buffers for capturing and flow control of valid outputs, loops, and
for loops, for simple and complex nested counter structures. Also add latches to
enable only components with state—thus minimizing enable line fan-out, which can
otherwise be a bottleneck to performance.
Often designs need to stall or enable signals. Routing an enable signal to all the blocks
in the design can lead to high fan-out nets, which become the critical timing path in
the design. To avoid this situation, enable only blocks with state, while marking output
data as invalid when necessary.
DSP Builder provides the following utility functions in the Additional Blocks Control
library, which are masked subsystems.
• Zero-Latency Latch (latch_0L)
• Single-Cycle Latency Latch (latch_1L)
• Reset-Priority Latch (SRlatch_PS)
• Set-Priority Latch (SRlatch)
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
194
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
Some of these blocks use the Simulink Data Type Prop Duplicate block, which takes
the data type of a reference signal ref and back propagates it to another signal prop.
Use this feature to match data types without forcing an explicit type that you can use
in other areas of your design.
You can use FIFO buffers to build flexible, self-timed designs insensitive to latency.
They are an essential component in building parameterizable designs with feedback,
such as those that implement back pressure.
You must acknowledge reading of invalid output data. Consider a FIFO buffer with the
following parameters:
• Depth = 8
• Fill threshold = 2
• Fill period = 7
A three cycle latency exists between the first write and valid going high. The q output
has a similar latency in response to writes. The latency in response to read
acknowledgements is only one cycle for all output ports. The valid out goes low in
response to the first read, even though the design writes two items to the FIFO buffer.
The second write is not older than three cycles when the read occurs.
With the fill threshold set to a low value, the t output can go high even though the v
out is still zero. Also, the q output stays at the last value read when valid goes low in
response to a read.
Problems can occur when you use no feedback on the read line, or if you take the
feedback from the t output instead with fill threshold set to a very low value (< 3). A
situation may arise where a read acknowledgement is received shortly following a
write but before the valid output goes high. In this situation, the internal state of the
FIFO buffer does not recover for many cycles. Instead of attempting to reproduce this
behavior, Simulink issues a warning when a read acknowledgement is received while
valid output is zero. This intermediate state between the first write to an empty FIFO
buffer and the valid going high, highlights that the input to output latency across the
FIFO buffer is different in this case. This situation is the only time when the FIFO
buffer behaves with a latency greater than one cycle. With other primitive blocks,
which have consistent constant latency across each input to output path, you never
have to consider these intermediate states.
You can mitigate this issue by taking care when using the FIFO buffer. The model
needs to ensure that the read is never high when valid is low using the simple
feedback. If you derive the read input from the t output, ensure that you use a
sufficiently high threshold.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
195
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
You can set fill threshold to a low number (<3) and arrive at a state where output t is
high and output v is low, because of differences in latency across different pairs of
ports—from w to v is three cycles, from r to t is one cycle, from w to t is one cycle. If
this situation arises, do not send a read acknowledgement signal to the FIFO buffer.
Ensure that when the v output is low, the r input is also low. A warning appears in the
MATLAB command window if you ever violate this rule. If you derive the read
acknowledgement signal with a feedback from the t output, ensure that the fill
threshold is set to a sufficiently high number (3 or above). Similarly for the f output
and the full period.
If you supply vector data to the d input, you see vector data on the q output. DSP
Builder does not support vector signals on the w or r inputs, as the behavior is
unspecified. The v, t, and f outputs are always scalar.
The enable input and demo_kronecker design example demonstrate flow control
using a loop.
You can use either Loop or ForLoop blocks for building nested loops.
When a stack of nested loops is the appropriate control structure (for example, matrix
multiplication) use a single Loop block. When a more complex control structure is
required, use multiple ForLoop blocks.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
196
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
DSP Builder distinguishes control flow from data flow: control flow is the logic you
connect to the ChannelIn and ChannelOut valid signal path. DSP Builder applies
little or no reset minimization to control logic and aggressive minimzation to data flow.
By default, DSP Builder chooses reset minimization options for you automatically. It
automatically applies reset minimization if your target device includes the HyperFlex
architecture.
You may override the default automatic reset minimization options, for example as
part of design space optimization.
When you globally apply reset minimization, DSP Builder determines a local reset
minimization setting for each of your synthesizable subsystems. DSP Builder applies
this local reset minimization conditionally, if your subsystem contains ChannelIn or
ChannelOut blocks.
On Off Any No
DSP Builder does not apply reset minimization to blocks with innate state, user-
constructed cycles, and enable logic in your design, as that can give undefined initial
values.
Reset minimization only detects local cycles within a subsystem. You should avoid
broader feedback cycles.
Reset minimization may affect the behavior of your design during Simulink simulation
and on hardware.
Simulink Simulation
The DSP Builder simulation engine within Simulink is unaware of the reset
minimization optimization and therefore always simulates your design behavior with
reset present.
In general there is no difference in behavior, and this is aided by the testbench inputs
defaulting typically to zero and a longer minimum reset pulse-width allowing such
defaults to propagate through the datapath register stages.
However in some cases mismatches may occur, because data entering a Sample
Delay in your design during reset is non-zero.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
197
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
If an input does not default to zero or the internal behavior is incompatible with
Sample Delay blocks resetting to zeros (or the minimum reset-pulse width is less
than the design latency), the Simulink simulation might be different than the HDL
simulation.
Implementation on Hardware
Removing a reset on the datapath means that when DSP Builder releases a reset, your
data flow logic may contain values clocked in during reset, which might affect the
initial post-reset behavior of your system.
Related Information
• Control on page 221
• Synthesis Information (SynthesisInfo) on page 347
Additionally, your HDL must conform to DSP Builder design rules and must:
• Have only one clock domain
• Match reset level with DSP Builder
• Use the std_logic data type for clock and reset ports
• Use std_logic_vector for all other ports
• Have no top-level generics
• Contain no bus components
You may need to write a wrapper HDL file that instantiates your HDL, which might
configure generics, convert from other data types to std_logic_vector, or invert the
reset signal.
DSP Builder can import any number of instantiated entities. To import multiple
copies of an entity or multiple distinct entities, instantiate the entities in a top-
level wrapper file.
Simulink does not model all the signal states that ModelSim uses (e.g. ‘U’).
Simulink interprets all non-‘1’ states as a ‘0’.
Importing HDL uses the HDL Verifier toolbox to communicate with an HDL simulation
running in ModelSim. You can have as many components in your ModelSim simulation
as you like; each component communicates with a separate DSP Builder HDL Import
block. Your top-level design must include an HDL Import Config block.
DSP Builder for Intel FPGAs (Advanced Blockset): Handbook Send Feedback
198
8. About DSP Builder for Intel FPGAs Optimization
HB_DSPB_ADV | 2020.10.05
Simulink
Source Control
Component 0
HDL Import
Subsystem Subsystem
Component 0
Component 1
ModelSim
DSP Builder Advanced
Sink
You cannot place HDL Import blocks inside a primitive scheduled subsystem.
DSP Builder creates the appropriate instantiation of the component represented by the
HDL Import block.
DSP Builder sees imported HDL as a scheduled system. DSP Builder does not try to
schedule your imported HDL. You cannot import HDL into a scheduled subsystem.
Imported HDL acts like other DSP Builder IP blocks (e.g. NCO, FFT). You must
manually delay-balance any parallel datapaths and turn on Generate Hardware in
the Control block.
Send Feedback DSP Builder for Intel FPGAs (Advanced Blockset): Handbook
199
HB_DSPB_ADV | 2020.10.05
Send Feedback
9. About Folding
Folding optimizes hardware usage for low throughput systems, which have many clock
cycles between data samples. Low throughput systems often inefficiently use
hardware resources. When you map designs that process data as it arrives every clock
cycle to hardware, many hardware resources may be idle for the clock cycles between
data.
Folding allows you to create your design and generate hardware that reuses resources
to create an efficient implementation.
The folding factor is the number of times you reuse a single hardware resource, such
as a multiplier, and it depends on the ratio of the data and clock rates:
DSP Builder offers ALU folding for folding factors greater than 500. With ALU folding,
DSP Builder arranges one of each resource in a central arithmetic logic unit (ALU) with
a program to schedule the data through the shared operation.
ALU folding reduces the resource consumption of a design by as much as it can while
still meeting the latency constraint. The constraint specifies the maximum number of
clock cycles a system with folding takes to process a packet. If ALU folding cannot
meet this latency constraint, or if ALU folding cannot meet a latency constraint
internal to the DSP Builder system due to a feedback loop, you see an error message
stating it is not possible to schedule the design.
Intel Corporation. All rights reserved. Agilex, Altera, Arria, Cyclone, Enpirion, Intel, the Intel logo, MAX, Nios,
Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or
other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in ISO
accordance with Intel's standard warranty, but reserves the right to make changes to any products and services 9001:2015
at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any Registered
information, product, or service described herein except as expressly agreed to in writing by Intel. Intel
customers are advised to obtain the latest version of device specifications before relying on any published
information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.