Fpga Ai Suite Pcie Design Example User Guide 768977 791552
Fpga Ai Suite Pcie Design Example User Guide 768977 791552
Contents
3. Getting Started with the Intel FPGA AI Suite PCIe-based Design Example......................6
B. Intel FPGA AI Suite PCIe-based Design Example User Guide Document Revision
History.....................................................................................................................33
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
2
768977 | 2023.12.01
Send Feedback
The following sections in this document describe the steps to build and execute the
design:
• Building the Intel FPGA AI Suite Runtime on page 7
• Running the Design Example Demonstration Applications on page 9
The following sections in this document describe design decisions and architectural
details about the design:
• Design Example Components on page 17
• Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
on page 25
Use this document to help you understand how to create a PCIe example design with
the targeted Intel FPGA AI Suite architecture and number of instances and compiling
the design for use with the Intel FPGA Basic Building Blocks (BBBs) system.
Documentation for the Intel FPGA AI Suite is split across a few publications. Use the
following table to find the publication that contains the Intel FPGA AI Suite information
that you are looking for:
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
1. Intel® FPGA AI Suite PCIe-based Design Example User Guide
768977 | 2023.12.01
To use the Intel FPGA AI Suite, you must be familiar with the Intel Distribution of
OpenVINO toolkit.
Intel FPGA AI Suite Version 2023.3 requires the Intel Distribution of OpenVINO toolkit
Version 2022.3.1 LTS. For OpenVINO documentation, refer to https://
docs.openvino.ai/2022.3/documentation.html.
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
4
768977 | 2023.12.01
Send Feedback
The PCIe-based design example (Intel Arria 10) is implemented with the following
components:
• Intel FPGA AI Suite IP
• Intel Acceleration Stack for Intel Xeon CPU with FPGAs
• Open Programmable Acceleration Engine (OPAE) components:
— OPAE libraries
— Intel FPGA Basic Building Blocks (BBB)
• Intel Distribution of OpenVINO toolkit
• Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA
• Sample hardware and software systems that illustrate the use of these
components
The PCIe-based design example (Intel Agilex 7) is implemented with the following
components:
• Intel FPGA AI Suite IP
• Intel Distribution of OpenVINO toolkit
• Terasic DE10-Agilex-B2E2 board
• Sample hardware and software systems that illustrate the use of these
components
This design example includes pre-built FPGA bitstreams that correspond to pre-
optimized architecture files. However, the design example build scripts let you choose
from a variety of architecture files and build (or rebuild) your own bitstreams,
provided that you have a license permitting bitstream generation.
This design is provided with the Intel FPGA AI Suite as an example showing how to
incorporate the IP into a design. This design is not intended for unaltered use in
production scenarios. Any potential production application that uses portions of this
example design must review them for both robustness and security.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
768977 | 2023.12.01
Send Feedback
Before starting with the Intel FPGA AI Suite PCIe-based Design Example, ensure that
you have followed all the installation instructions for the Intel FPGA AI Suite compiler
and IP generation tools and completed the design example prerequisites as provided
in the Intel FPGA AI Suite Getting Started Guide.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
768977 | 2023.12.01
Send Feedback
The CMake tool manages the overall build flow to build the Intel FPGA AI Suite
runtime plugin.
The flow also builds additional targets as dependencies for the top-level target. The
most significant additional targets are:
• The OPAE-based MMD library, libintel_opae_mmd.so. The source files for this
target are under runtime/coredla_device/mmd/.
• The Input and Output Layout Transform library,
libdliaPluginIOTransformations.a. The sources for this target are under
runtime/plugin/io_transformations/.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
4. Building the Intel FPGA AI Suite Runtime
768977 | 2023.12.01
--disable_jit If this flag is specified, then the runtime will only support
the Ahead of Time mode. The runtime will not link to the
precompiled compiler libraries.
Use this mode when trying to compile the runtime on an
unsupported operating system.
--aot_splitter_example Builds the AOT splitter example utility for the selected
target (Intel PAC with Intel Arria 10 GX FPGA or Terasic
DE10-Agilex Development Board).
This option builds an AOT file for a model, splits the AOT file
into its constituent components (weights, overlay
instructions, etc), and the builds a small utility that loads
the model and a single image onto the target FPGA board
without using OpenVINO.
You must set the $AOT_SPLITTER_EXAMPLE_MODEL and
$AOT_SPLITTER_EXAMPLE_INPUT environment variables
correctly. For details, refer to “Intel FPGA AI Suite Ahead-of-
Time (AOT) Splitter Utility Example Application” in Intel
FPGA AI Suite IP Reference Manual.
The Intel FPGA AI Suite runtime plugin is built in release mode by default. To enable
debug mode, you must specify the -cmake_debug option of the script command.
The -no_make option skips the final call to the make command. You can make this
call manually instead.
Intel FPGA AI Suite hardware is compiled to include one or more IP instances, with the
same architecture for all instances. Each instance accesses data from a unique bank of
DDR:
• An Intel Programmable Acceleration Card (PAC) with Intel Arria 10 GX FPGA has
two DDR banks and supports two instances.
• The Terasic DE10-Agilex board supports up to four instances.
If the Intel FPGA AI Suite Runtime uses two or more instances, then the image
batches are divided between the instances to execute two or more batches in parallel
on the FPGA device.
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
8
768977 | 2023.12.01
Send Feedback
For details on creating the .bin/.xml files, refer to the Intel FPGA AI Suite Getting
Started Guide.
The Intel FPGA AI Suite compiler compiles the network and exports it to a .bin file
that uses the same .bin format as required by the OpenVINO™ Inference Engine.
This .bin file created by the compiler contains the compiled network parameters for
all the target devices (FPGA, CPU, or both) along with the weights and biases. The
inference application imports this file at runtime.
The Intel FPGA AI Suite compiler can also compile the graph and provide estimated
area or performance metrics for a given Architecture File or produce an optimized
Architecture File.
For more details about the Intel FPGA AI Suite compiler, refer to the Intel FPGA AI
Suite Compiler Reference Manual.
To build example design bitstreams, you must have a license that permits bitstream
generation for the IP, and have the correct version of Quartus installed. Use the
dla_build_example_design.py utility to create a bitstream.
For more details about this command, the steps it performs, and advanced command
options, refer to Build Script and to the Intel FPGA AI Suite Getting Started Guide.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01
The OPAE sample application programs are described in the Intel Acceleration Stack
Quick Start Guide for Intel Programmable Acceleration Card with Intel Arria 10 GX
FPGA.
If the Intel PAC is connected and has sufficient cooling, then you can program a
bitstream using the fpgaconf command. If the demonstration bitstreams (which
correspond to the architectures in the example_architectures/ directory) were
installed, then you can use them. You can also compile a new bitstream, as described
in section Compiling the PCIe-based Example Design on page 9.
You can program the design example bitstreams by using the following command:
The Intel PAC with Intel Arria 10 GX FPGA requires server-level cooling with a dual-fan
graphics card cooler or another appropriate fan. Contact your Intel representative for
specific cooling solution suggestions and quote case 14016255788 when contacting
your representative.
If the supplementary cooling is not sufficient, the PAC will hang during inference. Until
you are certain that the cooling solution is sufficient, monitor the temperature using
the command: sudo fpgainfo temp
For details, refer to “Intel FPGA AI Suite Quick Start Tutorial” in the Intel FPGA AI
Suite Getting Started Guide.
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
10
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01
Each graph can have either a different input dataset or use a commonly shared
dataset among all graphs. Each graph requires an individual ground_truth_file
file, separated by commas. If some ground_truth_file files are missing, the
dla_benchmark continues to run and ignore the missing ones.
When multi-graph is enabled, the -niter flag represents the number of iterations for
each graph, so the total number of iterations becomes -niter × size of graphs.
The board you use determines the number of instances that you can compile the Intel
FPGA AI Suite hardware for:
• For the Intel PAC with Intel Arria 10 GX FPGA, you can compile up to two
instances with the same architecture on all instances.
• For the Terasic DE10-Agilex Development Board, you can compile up to four
instances with the same architecture on all instances.
Each instance accesses one of the two DDR banks and executes the graph
independently. This optimization enables two batches to run in parallel. Each inference
request created by the demonstration application is assigned to one of the instances in
the FPGA plugin.
To ensure that batches are evenly distributed between the instances, you must choose
an inference request batch size that is a multiple of the number of Intel FPGA AI Suite
instances. For example, with two instances, specify the batch size as six (instead of
the OpenVINO default of five) to ensure that the experiment meets this requirement.
The following example usage assumes that a Model Optimizer IR .xml file has been
placed in demo/models/public/resnet-50-tf/FP32/. It also assumes that an
image set has been placed into demo/sample_images/. Lastly, it assumes that has
been programmed with a bitstream corresponding to A10_Performance.arch.
binxml=$COREDLA_ROOT/demo/models/public/resnet-50-tf/FP32
imgdir=$COREDLA_ROOT/demo/sample_images
cd $COREDLA_ROOT/runtime/build_Release
./dla_benchmark/dla_benchmark \
-b=1 \
-m $binxml/resnet-50-tf.xml \
-d=HETERO:FPGA,CPU \
-i $imgdir \
-niter=5 \
-plugins_xml_file ./plugins.xml \
-arch_file $COREDLA_ROOT/example_architectures/A10_Performance.arch \
-api=async \
-groundtruth_loc $imgdir/TF_ground_truth.txt \
-perf_est \
-nireq=4 \
-bgr
Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide
11
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01
The following example shows how the IP can dynamically swap between graphs. This
example usage assumes that another Model Optimizer IR .xml file has been placed in
demo/models/public/resnet-101-tf/FP32/. It also assumes that another
image set has been placed into demo/sample_images_rn101/. In this case,
dla_benchmark only evaluates the classification accuracy of Resnet50 because we
did not provide ground truth for the second graph (ResNet101).
binxml1=$COREDLA_ROOT/demo/models/public/resnet-50-tf/FP32
binxml2=$COREDLA_ROOT/demo/models/public/resnet-101-tf/FP32
imgdir1=$COREDLA_ROOT/demo/sample_images
imgdir2=$COREDLA_ROOT/demo/sample_images_rn101
cd $DEVELOPER_PACKAGE_ROOT/runtime/build_Release
./dla_benchmark/dla_benchmark \
-b=1 \
-m $binxml1/resnet-50-tf.xml,$binxml2/resnet-101-tf.xml \
-d=HETERO:FPGA,CPU \
-i $imgdir1,$imgdir2 \
-niter=5 \
-plugins_xml_file ./plugins.xml \
-arch_file $COREDLA_ROOT/example_architectures/A10_Performance.arch \
-api=async \
-groundtruth_loc $imgdir1/TF_ground_truth.txt \
-perf_est \
-nireq=4 \
-bgr
This flag lets the dla_benchmark calculate the mAP and COCO AP for object
detection graphs. Besides, you need to specify the version of the YOLO graph that you
provide to the dla_benchmark through the –yolo_version flag. Currently, this
routine is known to work with YOLOv3 (graph version is yolo-v3-tf) and
TinyYOLOv3 (graph version is yolo-v3-tiny-tf).
Two metrics are used for accuracy evaluation in the dla_benchmark application. The
mean average precision (mAP) is the challenge metric for PASCAL VOC. The mAP
value is averaged over all 80 categories using a single IoU threshold of 0.5. The COCO
AP is the primary challenge for object detection in the Common Objects in Context
contest. The COCO AP value uses 10 IoU thresholds of .50:.05:.95. Averaging over
multiple IoUs rewards detectors with better localization.
The dla_benchmark application currently allows only plain text ground truth files. To
convert the downloaded JSON annotation file to plain text, use the
convert_annotations.py script.
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
12
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01
To compute the accuracy scores on many images, you can usually increase the
number of iterations using the flag -niter instead of a large batch size -b. The
product of the batch size and the number of iterations should be less than or equal to
the number of images that you provide.
cd $COREDLA_ROOT/runtime/build_Release
python ./convert_annotations.py ./instances_val2017.json \
./groundtruth
./dla_benchmark/dla_benchmark \
-b=1 \
-niter=5000 \
-m=./graph.xml \
-d=HETERO:FPGA,CPU \
-i=./mscoco-images \
-plugins_xml_file=./plugins.xml \
-arch_file=../../example_architectures/A10_Performance.arch \
-yolo_version=yolo-v3-tf \
-api=async \
-groundtruth_loc=./groundtruth \
-nireq=4 \
-enable_object_detection_ap \
-perf_est \
-bgr
Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide
13
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01
Command Description
-arch_file=<FILE> This specifies the location of the .arch file that was used to
--arch=<FILE> configure the IP on the FPGA. The dla_benchmark will
issue an error if this does not match the.arch file used to
generate the IP on the FPGA.
-m=<FILE> This points to the XML file from OpenVINO Model Optimizer
--network_file=<FILE> that describes the graph. The BIN file from Model Optimizer
must be kept in the same directory and same filename
(except for the file extension) as the XML file.
-groundtruth_loc=<FILE> Location of the file with ground truth data. If not provided,
then dla_benchmark will not evaluate accuracy. This may
contain classification data or object detection data,
depending on the graph.
-bgr When used, this flag indicates that the graph expects input
image channel data to use BGR order.
-plugins_xml_file=<FILE> This option specifies the location of the file specifying the
OpenVINO plugins to use. This should be set to
$COREDLA_ROOT/runtime/plugins.xml in most cases.
If you are porting the design to a new host or doing other
development, it may be necessary to use a different value.
The Intel FPGA AI Suite runtime includes customized versions of the following demo
applications for use with the Intel FPGA AI Suite IP and plugins:
• classification_sample_async
• object_detection_demo_yolov3_async
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
14
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01
Each demonstration application uses a different graph. The OpenVINO HETERO plugin
can fall-back to the CPU for portions of the graph that are not supported with FPGA-
based acceleration. However, in a production environment, it may be more efficient to
use alternate graphs that execute exclusively on the FPGA.
You can use the example .arch files supplied with the Intel FPGA AI Suite can be
used with the demonstration applications. However, certain example .arch files do
not enable some of the layer-types used by the graphs associated with the
demonstration applications. Using these .arch files cause portions of the graph to
needlessly execute on the CPU. To minimize the number of layers that are executed on
the CPU by the demonstration application, use the following architecture description
files located in the example_architectures/ directory of the Intel FPGA AI Suite
installation package to run the demos:
• Intel Arria 10: A10_Generic.arch
• Intel Agilex 7: AGX7_Generic.arch
As specified in Programming the FPGA Device (Intel Arria 10) on page 10, you must
program the FPGA device with the bitstream for the architecture being used. Each
demonstration application includes a README.md file specifying how to use it.
When the OpenVINO sample applications are modified to support the Intel FPGA AI
Suite, the Intel FPGA AI Suite plugin used by OpenVINO needs to know how to find
the .arch file describing the IP parameterization by using the following configuration
key. The following C++ code is used in the demo for this purpose:
ie.SetConfig({ { DLIA_CONFIG_KEY(ARCH_PATH), FLAGS_arch_file } }, "FPGA");
Model Optimizer generates an FP32 version and an FP16 version. Use the FP32
version.
• Input video from: https://github.com/intel-iot-devkit/sample-videos.
• The recommended video is person-bicycle-car-detection.mp4
Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide
15
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01
1. Ensure that demonstration applications have been built with the following
command:
build_runtime.sh -build-demo
2. Ensure that the FPGA has been configured with the Generic bitstream.
3. Run the following command:
./runtime/build_Release/object_detection_demo/object_detection_demo \
-d HETERO:FPGA,CPU \
-i <path_to_video>/input_video.mp4 \
-m <path_to_model>/yolo_v3.xml \
-arch_file=$COREDLA_ROOT/example_architectures/A10_Generic.arch \
-plugins_xml_file $COREDLA_ROOT/runtime/plugins.xml \
-t 0.65 \
-at yolo
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
16
768977 | 2023.12.01
Send Feedback
The script generates a wrapper that wraps one or more IP instances along with
adapters necessary to connect to the Terasic DE10-Agilex BSP or to the OPAE BSP
(Intel Arria 10).
The OPAE BSP used by the Intel PAC with Intel Arria 10 GC FPGA board is compatible
only with bitstreams for accelerator function units (AFUs) compiled with Intel
Quartus® Prime Pro Edition Version 19.2. An AFU and its associated accelerator
functions (AFs) are sometimes referred to as the green bitstream or green bits. For
more details OPAE bitstream types, refer to Design Example Microarchitecture.
The DE10-Agilex design is only validated for use with Intel Quartus Prime Pro Edition
Version 23.3. This Intel Agilex 7 design does not use the Intel Quartus Prime partial
reconfiguration feature, unlike the OPAE BSP. The Agilex device has significantly more
resources and can support up to four IP instances.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
6. Design Example Components
768977 | 2023.12.01
Option Description
--build-dir Path to hardware build directory where BSP infrastructure and generated RTL
will be located.
(default: coredla/pcie_ed/platform/build_synth)
--build Option to perform compilation of the PCIe design using Intel Quartus Prime after
instantiation (default: False).
-d, --archs-dir Path to directory that contains Architecture Description Files for you to
interactively choose from (alternative to ‘-a’)
-ed, --example-design-id To build for the Intel PAC with Intel Arria 10 GX FPGA board, specify 1.
To build for the Terasic DE10-Agilex board, specify 3.
(default: 1)
--num-paths Number of top critical paths to report after compiling the design (default:
2000).
-q, --quiet Run script quietly without printing the output of underlying scripts to the
terminal.
--qor-modules List of internal modules (instance names) from inside the IP to include in the
QoR summary report.
--unlicensed/licensed This option is passed to the dla_create_ip tool to tell the tool to generate
either an unlicensed or licensed copy of the Intel FPGA AI Suite:
• Unlicensed IP: Unlicensed IP has a limit of 10000 inferences. After 10000
inferences, the unlicensed IP refuses to perform any additional inference and
a bit in the CSR is set. For details about the CSR bit, refer to DMA Descriptor
Queue in Intel FPGA AI Suite IP Reference Manual.
• Licensed IP: Licensed IP has no inference limitation.
If you do not have a license but generate licensed IP, Quartus cannot generate a
bitstream.
If neither option is specified, then the dla_create_ip tool queries the lmutil
license manager to determine the correct option.
--wsl This option sets the build script to run such that the final Intel Quartus Prime
compilation runs in the Windows* environment. After the script sets up the
compilation, it prints the instructions to complete the compilation on Windows.
Restriction: Only supported within a WSL 2 environment and for the DE10-
Agilex example design.
--finalize Restriction: This option can be used only when following the instructions
provided by the build script run with the --wsl option.
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
18
6. Design Example Components
768977 | 2023.12.01
1. Runs the dla_create_ip script to create an Intel FPGA AI Suite IP for the
requested IntelFPGA AI Suite architecture
2. Creates a wrapper around the IntelFPGA AI Suite IP instances and adapter logic
3. Runs the OPAE afu_synth_setup script to create a build directory that has the
BSP infrastructure needed to compile the design with Intel Quartus Prime
software.
4. Runs the OPAE run.sh script to compile the design example with Intel Quartus
Prime software:
a. Compile the example design up to and including bitstream generation.
b. Analyze the timing report to extract the Intel FPGA AI Suite clock maximum
frequency (fMAX).
c. Create an AFU/AF bitstream (green bitstream) file with PLL configurations to
generate a clock that is slightly lower than the fMAX of the design.
5. Runs the OPAE PACSign script to generate an unsigned version of the AFU/AF
bitstream as well as a signed version.
The script uses the Intel Quartus Prime timing analyzer to report the top critical paths
of the compiled design example.
The unsigned and signed versions of the bitstreams are in the <build_dir> directory
that you set when running the script (or the default location, if you did not set it). The
signed and unsigned bitstream file names are dla_afu.gbs and
dla_afu_unsigned.gbs, respectively.
The Intel Quartus Prime compilation reports are available in the <build_dir>/
build/output_files directory. A build.log file that has all the output log for
running the build script is available in the <build_dir> directory. In addition, the
achieved Intel FPGA AI Suite clock frequency is the clock-frequency-low value in the
following file:
<build_dir>/build/output_files/user_clock_freq.txt
The following figure, Software Stacks for Intel FPGA AI Suite Inference, shows the
complete runtime stack.
Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide
19
6. Design Example Components
768977 | 2023.12.01
For the Intel Arria 10 design example, the following components comprise the runtime
stack:
• OpenVINO Toolkit 2022.3.1 LTS (Inference Engine, Heterogeneous Plugin)
• Intel FPGA AI Suite runtime plugin
• OPAE driver 1.1.2-2
For the Intel Agilex 7 design example, the following components comprise the runtime
stack:
• OpenVINO Toolkit 2022.3.1 LTS (Inference Engine, Heterogeneous Plugin)
• Intel FPGA AI Suite runtime plugin
• Terasic DE10-Agilex-B2E2 board driver
The PCIe-based design example contains the source files and Makefiles to build the
Intel FPGA AI Suite runtime plugin. The other components, OpenVINO and OPAE, are
external and must be manually pre-installed.
A separate flow compiles the AI network graph using the Intel FPGA AI Suite compiler,
as shown in figure Software Stacks for Intel FPGA AI Suite Inference below as the
Compilation Software Stack.
The compilation flow output is a single binary file called CompiledNetwork.bin that
contains the compiled network partitions for FPGA and CPU devices along with the
network weights. The network is compiled for a specific Intel FPGA AI Suite
architecture and batch size. This binary is created on-disk only when using the Ahead-
Of-Time flow; when the JIT flow is used, the compiled object stays in-memory only.
An Architecture File describes the Intel FPGA AI Suite IP architecture to the compiler.
You must specify the same Architecture File to the Intel FPGA AI Suite compiler and to
the Intel FPGA AI Suite PCIe Example Design build script
(dla_build_example_design.py).
The runtime flow accepts the CompiledNetwork.bin file as the input network along
with the image data files.
CompiledNetwork.bin file
Compilation Software Stack Runtime Software Stack
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
20
6. Design Example Components
768977 | 2023.12.01
The runtime stack cannot program the FPGA with a bitstream. To build a bitstream
and program the FPGA devices:
1. Compile the design example. For details, refer to Compiling the PCIe-based Design
Example.
2. Program the device with the bitstream. For details, refer to Programming the FPGA
Device (Intel Arria 10) on page 10 or Programming the FPGA Device (Intel Agilex
7) on page 10 (depending on your FPGA device).
To run inference through the OpenVINO Toolkit on the FPGA, set the OpenVINO device
configuration flag (used by the heterogeneous Plugin) to FPGA or HETERO:FPGA,CPU.
The source files are located under runtime/plugin. The three main components of
the runtime plugin are the Plugin class, the Executable Network class, and the
Inference Request class. The primary responsibilities for each class are as follows:
Plugin class
• Initializes the runtime plugin with an Intel® FPGA AI Suite Architecture File which
you set as an OpenVINO™ configuration key (refer to Running the Ported
OpenVINO Demonstration Applications on page 14).
• Contains QueryNetwork function that analyzes network layers and returns a list
of layers that the specified architecture supports. This function allows network
execution to be distributed between FPGA and other devices and is enabled with
the HETERO mode.
• Creates an executable network instance in one of the following ways:
— Just-in-time (JIT) flow: Compiles a network such that the compiled network is
compatible with the hardware corresponding to the Intel FPGA AI Suite
Architecture File, and then loads the compiled network onto the FPGA device.
— Ahead-of-time (AOT) flow: Imports a precompiled network (exported by Intel
FPGA AI Suite compiler) and loads it onto the FPGA device.
Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide
21
6. Design Example Components
768977 | 2023.12.01
Related Information
OpenVINO™ Developer Guide for Inference Engine Plugin Library
The runtime source files are located under runtime/coredla_device. The three
most important classes in the runtime are the Device class, the GraphJob class, and
the BatchJob class.
Device class
• Acquires a handle to the MMD for performing operations by calling
aocl_mmd_open.
• Initializes a DDR memory allocator with the size of 1 DDR bank for each Intel
FPGA AI Suite IP instance on the device.
• Implements and registers a callback function on the MMD DMA (host to FPGA)
thread to launch Intel FPGA AI Suite IP for batch=1 after the batch input data is
transferred from host to DDR.
• Implements and registers a callback function (interrupt service routine) on the
MMD kernel interrupt thread to service interrupts from hardware after one batch
job completes.
• Provides the CreateGraphJob function to create a GraphJob object for each Intel
FPGA AI Suite IP instance on the device.
• Provides the WaitForDla(instance id) function to wait for a batch inference
job to complete on a given instance. Returns instantly if the number of batch jobs
finished (that is, the number of jobs processed by interrupt service routine) is
greater than number of batch jobs waited for this instance. Otherwise, the
function waits until interrupt service routine notifies. Before returning, this
function increments the number of batch jobs that have been waited for this
instance.
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
22
6. Design Example Components
768977 | 2023.12.01
GraphJob class
• Represents a compiled network that is loaded onto one instance of the Intel FPGA
AI Suite IP on an FPGA device.
• Allocates buffers in DDR memory to transfer configuration, filter, and bias data.
• Creates BatchJob objects for a given number of pipelines and allocates input and
output buffers for each pipeline in DDR.
BatchJob class
• Represents a single batch inference job.
• Stores the DDR addresses for batch input and output data.
• Provides LoadInputFeatureToDdr function to transfer input data to DDR and
start inference for this batch asynchronously.
• Provides ReadOutputFeatureFromDdr function to transfer output data from
DDR. Must be called after inference for this batch is completed.
The source files for the driver are in runtime/coredla_device/mmd. The source
files contain classes for managing and accessing the FPGA device by using BSP
functions for reading/writing to CSR, reading/writing to DDR, and handling kernel
interrupts.
The Intel Programmable Acceleration Card (PAC) with Intel Arria 10 GX FPGA uses
OPAE software libraries as its BSP driver.
To compile the Intel FPGA AI Suite runtime library and run the demonstration
application on the Intel PAC board, you must have the OPAE software libraries
installed on the machine according to the installation instructions under Section 4 -
Installing the OPAE Software Package of the Intel Acceleration Stack Quick Start
Guide.
Contact your Intel representative for information on the driver for the Terasic DE10-
Agilex board support package.
Related Information
Intel Acceleration Stack Quick Start Guide
Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide
23
6. Design Example Components
768977 | 2023.12.01
When porting the runtime to a new board, the team responsible for the new board
support must ensure that each of the member functions in MmdWrapper calls into a
board-specific implementation function. The team doing this will need to modify the
runtime build process and adjacent code.
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
24
768977 | 2023.12.01
Send Feedback
The Intel Acceleration Stack is designed to make FPGAs usable as accelerators. On the
FPGA side, the Intel Acceleration Stack splits acceleration functions into two parts:
• The FPGA interface manager (FIM) is FPGA hardware that contains the FPGA
interface unit (FIU) and external interfaces for functions like memory access and
networking. The FIM is locked and cannot be changed. The FIM is sometimes
referred to as BBS, blue bits, or blue bitstream.
• The accelerator function (AF) is a compiled accelerator image implemented in
FPGA logic that accelerates an application. AFs are compiled from accelerator
functional units (AFUs). An AFU and associated AFs are sometimes referred to as
GBS, green bits, or green bitstream. An FPGA device can be reprogrammed while
leaving the FIM in place.
The FIM handles external interfaces to the host, to which it is connected via PCIe. On
the host side, a driver stack communicates with the AFU via the FIM. This is referred
to as OPAE (Open Programmable Acceleration Engine). OPAE talks to the AFU with the
CCI-P (core cache interface) protocol that provides an abstraction over PCIe protocol.
The FPGA image consists of the Intel FPGA AI Suite IP and an additional logic that
connects it to a PCIe interface and DDR. The host can read and write to the DDR
memory through the PCIe port. In addition, the host can communicate and control the
Intel FPGA AI Suite instances through the PCIe connection which is also connected the
direct memory access (DMA) CSR port of Intel FPGA AI Suite instances.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01
The Intel FPGA AI Suite IP accelerates neural network inference on batches of images.
The process of executing a batch follows these steps:
1. The host writes a batch of images, weights, and config data to DDR where weights
can be reused between batches.
2. The host writes to the Intel FPGA AI Suite CSR to start execution.
3. Intel FPGA AI Suite computes the results of the batch and stores them in DDR.
4. Once the computation is complete, Intel FPGA AI Suite raises an interrupt to the
host.
5. The host reads back the results from DDR.
Board
FPGA
Design
PCIe Example Intel FPGA AI Intel FPGA AI
Host
Suite IP Core Suite IP Core
Instance 0 Instance 1
DDR DDR
Bank A Bank B
7.2. Hardware
This section describes the Example Design (Intel Arria 10) in detail. However, many of
the components close to the IP are shared in common with the Example Design (Intel
Agilex 7).
A top-level view of the design example is shown in Intel FPGA AI Suite Example
Design Top Level.
There are two instances of Intel FPGA AI Suite, shown on the right (dla_top.sv). All
communication between the Intel FPGA AI Suite IP systems and the outside occurs via
the Intel FPGA AI Suite DMA. The Intel FPGA AI Suite DMA provides a CSR (which also
has interrupt functionality) and reader/writer modules which read/write from DDR.
The host communicates with the board through PCIe using the CCI-P protocol. The
host can do the following things:
1. Read and write the on-board DDR memory (these reads/writes do not go through
Intel FPGA AI Suite).
2. Read/write to the Intel FPGA AI Suite DMA CSR of both instances.
3. Receive interrupt signals from the Intel FPGA AI Suite DMA CSR of both instances.
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
26
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01
From the perspective of the Intel FPGA AI Suite accelerator function (AF), external
connections are to the CCI-P interface running over PCIe, and to the on-board DDR4
memory. The DDR memory is connected directly to board.qsys block, while the CCI-
P interface is converted into Avalon memory mapped (MM) interfaces in
bsp_logic.sv block for communication with the board.qsys block.
The board.qsys block arbitrates the connections to DDR memory between the
reader/writer modules in Intel FPGA AI Suite IP and reads/writes from the host. Each
Intel FPGA AI Suite IP instance in this design has access to only one of the two DDR
banks. This design decision implies that no more than two simultaneous Intel FPGA AI
Suite IP instances can exist in the design. Adding an additional arbiter would relax this
restriction and allow additional Intel FPGA AI Suite IP instances.
There are three clock domains: host clock, DDR clock, and the Intel FPGA AI Suite IP
clock. The PCIe logic runs on the host clock at 200Mhz. Intel FPGA AI Suite DMA and
the platform adapters run on the DDR clock. The rest of Intel FPGA AI Suite IP runs on
the Intel FPGA AI Suite IP clock.
Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide
27
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01
Note: Arrows show host/agent relationships. Clock domains indicated with dashed lines.
512b ccp_std_afu.sv
resp
resp
req
req
AVMM dla_platform_wrapper.sv
bsp_logic.sv dla_top.sv
CCIP
req
32b AXI4 lite CSR
resp
DMA
req req req
CCIP MMIO 64b AVMM Readers+
512b AXI4 Writers
resp resp
resp
req req
Host board.qsys instance 0
CCI CCIP Read 512b AVMM
MPF resp resp
dla_top.sv
req req req
Host
CCIP Write 512b AVMM 32b AXI4 lite CSR
resp resp
resp
DMA
req
Readers+
512b AXI4 Writers
resp
interrupts instance 0
The board.qsys interfaces between DDR memory, the readers/writers, and the host
read/write channels. The internals of the board.qsys block are shown in Figure 4.
This figure shows three Avalon MM interfaces on the left and bottom: MMIO, host
read, and host write.
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
28
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01
• Host read is used to read data from DDR memory and send it to the host.
• Host write is used to read data from the host into DDR memory.
• The MMIO interface performs several functions:
— DDR read and write transactions are initiated by the host via the MMIO
interface
— Reading from the AFU ID block. The AFU ID block identifies the AFU with a
unique identifier and is required for the OPAE driver.
— Reading/writing to the DLA DMA CSRs where each instance has its own CSR
base address.
Figure 4. The board.qsys Block, Showing Two DDR Connections and Two IP Instances
Board.qsys
req
0x00000 - 0x0003F 64b Board AFU ID
resp AXI4
req req req
All interfaces AVMM/AXI4 CoreDLA
0x38000 - 0x38FFF 32b Clock Cross 32b Converter 32b
inst0 CSR
resp
are Avalon MM resp resp
interrupts req 512b resp req 512b resp req 512b resp req 512b resp
The above figure also shows the ddr_board.qsys block. The three central blocks
(address expander, msgdma_bbb.qsys (scatter-gather DMA), and
msgdma_bbb.qsys) allow host direct memory access (DMA) to DDR. This DMA is
distinct from the DMA module inside of the Intel FPGA AI Suite IP, shown in Figure 3 on
page 28. Host reads and writes begin with the host sending a request via the MMIO
interface to initiate a read or write. When requesting a read, the DMA gathers the data
from DDR and sends it to the host via the host-read interface. When requesting a
write, the DMA reads the data over the host-write interface and subsequently writes it
to DDR.
Note that in board.qsys, a block for the Avalon MM to AXI4 conversion is not
explicitly instantiated. Instead, an Avalon MM pipeline bridge connects to an AXI4
bridge. Platform Designer implicitly infers a protocol adapter between these two
bridges.
Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide
29
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01
Note: Avalon MM/AXI4 adapters in Platform Designer might not close timing.
Platform Designer optimizes for area instead of fMAX by default, so you might need to
change the interconnect settings for the inferred Avalon MM/AXI4 adapter. For
example, we made some changes as shown in the following figure.
Figure 5. Adjusting the Interconnect Settings for the Inferred Avalon MM/AXI4
Adapter to Optimize for fMAX Instead of Area.
This was the only change needed to close timing, however it took several rounds of
experimentation to determine this was the setting of importance. Depending on your
system, other settings might need to be tweaked.
Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback
30
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01
A fully rigorous production-quality flow would re-run timing analysis after the PLL
adjustment to account for the small possibility that change in PLL frequency might
cause a change in clock characteristics (for example, jitter) that cause a timing failure.
A production design that shares the Intel FPGA AI Suite IP clock with other system
components might target a fixed frequency and skip PLL adjustment
Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide
31
768977 | 2023.12.01
Send Feedback
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
768977 | 2023.12.01
Send Feedback
2023.12.01 2023.3 • Added --wsl and --finalize options to “Build Script Options”.
2023.04.05 2023.1 • Renamed the dlac command. The Intel FPGA AI Suite compiler
command is now dla_compiler.
• Updated the Intel Agilex™ product family name to "Intel Agilex 7."
2021.09.10 2021.2 • Added updates for initial Intel Agilex device support.
Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.