0% found this document useful (0 votes)
177 views33 pages

Fpga Ai Suite Pcie Design Example User Guide 768977 791552

Uploaded by

amr531.23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views33 pages

Fpga Ai Suite Pcie Design Example User Guide 768977 791552

Uploaded by

amr531.23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Intel® FPGA AI Suite

PCIe-based Design Example User Guide

Updated for Intel® FPGA AI Suite: 2023.3

Online Version 768977


Send Feedback 2023.12.01
Contents

Contents

1. Intel® FPGA AI Suite PCIe-based Design Example User Guide........................................ 3

2. About the PCIe-based Design Example........................................................................... 5

3. Getting Started with the Intel FPGA AI Suite PCIe-based Design Example......................6

4. Building the Intel FPGA AI Suite Runtime....................................................................... 7


4.1. CMake Targets...................................................................................................... 7
4.2. Build Options........................................................................................................ 7
5. Running the Design Example Demonstration Applications ............................................. 9
5.1. Exporting Trained Graphs from Source Frameworks.................................................... 9
5.2. Compiling Exported Graphs Through the Intel FPGA AI Suite....................................... 9
5.3. Compiling the PCIe-based Example Design................................................................9
5.4. Programming the FPGA Device (Intel Arria 10).........................................................10
5.5. Programming the FPGA Device (Intel Agilex 7).........................................................10
5.6. Performing Accelerated Inference with the dla_benchmark Application......................10
5.6.1. Inference on Image Classification Graphs.................................................... 10
5.6.2. Inference on Object Detection Graphs.........................................................12
5.6.3. Additional dla_benchmark Options........................................................... 13
5.7. Running the Ported OpenVINO Demonstration Applications........................................ 14
5.7.1. Example Running the Object Detection Demonstration Application.................. 15
6. Design Example Components........................................................................................ 17
6.1. Build Script......................................................................................................... 17
6.1.1. Build Script Options..................................................................................18
6.1.2. Script Flow..............................................................................................18
6.2. Example Architecture Bitstream Files...................................................................... 19
6.3. Software Components.......................................................................................... 19
6.3.1. OpenVINO™ FPGA Runtime Plugin.............................................................. 21
6.3.2. Intel FPGA AI Suite Runtime...................................................................... 22
6.3.3. BSP Driver.............................................................................................. 23
6.4. Software Interface to the BSP................................................................................23
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA.... 25
7.1. System Overview.................................................................................................25
7.2. Hardware............................................................................................................26
7.2.1. PLL Adjustment....................................................................................... 30
A. Intel FPGA AI Suite PCIe-based Design Example User Guide Archives.......................... 32

B. Intel FPGA AI Suite PCIe-based Design Example User Guide Document Revision
History.....................................................................................................................33

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

2
768977 | 2023.12.01

Send Feedback

1. Intel® FPGA AI Suite PCIe-based Design Example User


Guide
The Intel® FPGA AI Suite PCIe*-based Design Example User Guide describes the
design and implementation for accelerating AI inference using the Intel FPGA AI Suite,
Intel Distribution of OpenVINO™ toolkit, and an Intel PAC with Intel Arria® 10 GX FPGA
or a Terasic* DE10-Agilex Development Board

The following sections in this document describe the steps to build and execute the
design:
• Building the Intel FPGA AI Suite Runtime on page 7
• Running the Design Example Demonstration Applications on page 9

The following sections in this document describe design decisions and architectural
details about the design:
• Design Example Components on page 17
• Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
on page 25

Use this document to help you understand how to create a PCIe example design with
the targeted Intel FPGA AI Suite architecture and number of instances and compiling
the design for use with the Intel FPGA Basic Building Blocks (BBBs) system.

About the Intel FPGA AI Suite Documentation Library

Documentation for the Intel FPGA AI Suite is split across a few publications. Use the
following table to find the publication that contains the Intel FPGA AI Suite information
that you are looking for:

Table 1. Intel FPGA AI Suite Documentation Library


Title and Description

Release Notes Link


Provides late-breaking information about the Intel FPGA AI Suite including new features, important bug fixes,
and known issues.

Getting Started Guide Link


Get up and running with the Intel FPGA AI Suite by learning how to initialize your compiler environment and
reviewing the various design examples and tutorials provided with the Intel FPGA AI Suite

IP Reference Manual Link


Provides an overview of the Intel FPGA AI Suite IP and the parameters you can set to customize it. This
document also covers the Intel FPGA AI Suite IP generation utility.
continued...

Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
1. Intel® FPGA AI Suite PCIe-based Design Example User Guide
768977 | 2023.12.01

Title and Description

Compiler Reference Manual Link


Describes the use modes of the graph compiler (dla_compiler). It also provides details about the compiler
command options and the format of compilation inputs and outputs.

PCIe-based Design Example User Guide Link


Describes the design and implementation for accelerating AI inference using the Intel FPGA AI Suite, Intel
Distribution of OpenVINO toolkit, and an Intel PAC with Intel Arria 10 GX FPGA or a Terasic DE10-Agilex
Development Board.

SoC-based Design Example User Guide Link


Describes the design and implementation for accelerating AI inference using the Intel FPGA AI Suite, Intel
Distribution of OpenVINO toolkit, and an Intel Arria 10 SX SoC FPGA Development Kit.

Intel Distribution of OpenVINO toolkit Requirement

To use the Intel FPGA AI Suite, you must be familiar with the Intel Distribution of
OpenVINO toolkit.

Intel FPGA AI Suite Version 2023.3 requires the Intel Distribution of OpenVINO toolkit
Version 2022.3.1 LTS. For OpenVINO documentation, refer to https://
docs.openvino.ai/2022.3/documentation.html.

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

4
768977 | 2023.12.01

Send Feedback

2. About the PCIe-based Design Example


The Intel FPGA AI Suite PCIe-based design examples (Intel Arria 10 and Intel Agilex®
7) demonstrate how the Intel Distribution of OpenVINO toolkit and the Intel FPGA AI
Suite support the look-aside deep learning acceleration model.

The PCIe-based design example (Intel Arria 10) is implemented with the following
components:
• Intel FPGA AI Suite IP
• Intel Acceleration Stack for Intel Xeon CPU with FPGAs
• Open Programmable Acceleration Engine (OPAE) components:
— OPAE libraries
— Intel FPGA Basic Building Blocks (BBB)
• Intel Distribution of OpenVINO toolkit
• Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA
• Sample hardware and software systems that illustrate the use of these
components

The PCIe-based design example (Intel Agilex 7) is implemented with the following
components:
• Intel FPGA AI Suite IP
• Intel Distribution of OpenVINO toolkit
• Terasic DE10-Agilex-B2E2 board
• Sample hardware and software systems that illustrate the use of these
components

This design example includes pre-built FPGA bitstreams that correspond to pre-
optimized architecture files. However, the design example build scripts let you choose
from a variety of architecture files and build (or rebuild) your own bitstreams,
provided that you have a license permitting bitstream generation.

This design is provided with the Intel FPGA AI Suite as an example showing how to
incorporate the IP into a design. This design is not intended for unaltered use in
production scenarios. Any potential production application that uses portions of this
example design must review them for both robustness and security.

Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
768977 | 2023.12.01

Send Feedback

3. Getting Started with the Intel FPGA AI Suite PCIe-


based Design Example
The Intel FPGA AI Suite PCIe-based design example version 2023.3 is provided with
the Intel FPGA AI Suite (earlier versions were distributed as separate components).

Before starting with the Intel FPGA AI Suite PCIe-based Design Example, ensure that
you have followed all the installation instructions for the Intel FPGA AI Suite compiler
and IP generation tools and completed the design example prerequisites as provided
in the Intel FPGA AI Suite Getting Started Guide.

Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
768977 | 2023.12.01

Send Feedback

4. Building the Intel FPGA AI Suite Runtime


The Intel FPGA AI Suite PCIe-based Design Example runtime directory contains the
source code for the OpenVINO plugins, the lower-level MMD layer that interacts with
the OPAE drivers, and customized versions of the following OpenVINO programs:
• dla_benchmark
• classification_sample_async
• object_detection_demo_yolov3_async
• segmentation_demo

The CMake tool manages the overall build flow to build the Intel FPGA AI Suite
runtime plugin.

4.1. CMake Targets


The top level CMake build target is the Intel FPGA AI Suite runtime plugin shared
library, libcoreDLARuntimePlugin.so. The source files used to build this target
are located under the following directories:
• runtime/plugin/src/
• runtime/coredla_device/src/

The flow also builds additional targets as dependencies for the top-level target. The
most significant additional targets are:
• The OPAE-based MMD library, libintel_opae_mmd.so. The source files for this
target are under runtime/coredla_device/mmd/.
• The Input and Output Layout Transform library,
libdliaPluginIOTransformations.a. The sources for this target are under
runtime/plugin/io_transformations/.

4.2. Build Options


The runtime folder in the design example package contains a script to build the
runtime called build_runtime.sh.

Issue the following command to run the script:


./build_runtime.sh <command_line_options>

Where <command_line_options> are defined in the following table:

Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
4. Building the Intel FPGA AI Suite Runtime
768977 | 2023.12.01

Table 2. Command Line Options for the build_runtime.sh Script


Command Description

-h | --help Show usage details

--cmake_debug Call cmake with a debug flag

--verbosity=<number> Large numbers add some extra verbosity

--build_dir=<path> Directory where the runtime build should be placed

--disable_jit If this flag is specified, then the runtime will only support
the Ahead of Time mode. The runtime will not link to the
precompiled compiler libraries.
Use this mode when trying to compile the runtime on an
unsupported operating system.

--build_demo Adds several OpenVINO demo applications to the runtime


build. The demo applications are in subdirectories of the
runtime/directory.

--de10_agilex Target the Terasic DE10-Agilex Development Board.


If this option is not specified, then the runtime will by
default target the Intel PAC with Intel Arria 10 GX FPGA.

--aot_splitter_example Builds the AOT splitter example utility for the selected
target (Intel PAC with Intel Arria 10 GX FPGA or Terasic
DE10-Agilex Development Board).
This option builds an AOT file for a model, splits the AOT file
into its constituent components (weights, overlay
instructions, etc), and the builds a small utility that loads
the model and a single image onto the target FPGA board
without using OpenVINO.
You must set the $AOT_SPLITTER_EXAMPLE_MODEL and
$AOT_SPLITTER_EXAMPLE_INPUT environment variables
correctly. For details, refer to “Intel FPGA AI Suite Ahead-of-
Time (AOT) Splitter Utility Example Application” in Intel
FPGA AI Suite IP Reference Manual.

The Intel FPGA AI Suite runtime plugin is built in release mode by default. To enable
debug mode, you must specify the -cmake_debug option of the script command.

The -no_make option skips the final call to the make command. You can make this
call manually instead.

Intel FPGA AI Suite hardware is compiled to include one or more IP instances, with the
same architecture for all instances. Each instance accesses data from a unique bank of
DDR:
• An Intel Programmable Acceleration Card (PAC) with Intel Arria 10 GX FPGA has
two DDR banks and supports two instances.
• The Terasic DE10-Agilex board supports up to four instances.

The runtime automatically adapts to the correct number of instances.

If the Intel FPGA AI Suite Runtime uses two or more instances, then the image
batches are divided between the instances to execute two or more batches in parallel
on the FPGA device.

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

8
768977 | 2023.12.01

Send Feedback

5. Running the Design Example Demonstration


Applications
This section describes the steps to run the demonstration application and perform
accelerated inference using the PCIe Example Design.

5.1. Exporting Trained Graphs from Source Frameworks


Before running any demonstration application, you must convert the trained model to
the Inference Engine format (.xml, .bin) with the OpenVINO™ Model Optimizer.

For details on creating the .bin/.xml files, refer to the Intel FPGA AI Suite Getting
Started Guide.

5.2. Compiling Exported Graphs Through the Intel FPGA AI Suite


The network as described in the .xml and .bin files (created by the Model Optimizer)
is compiled for a specific Intel FPGA AI Suite Architecture File by using the Intel FPGA
AI Suite compiler.

The Intel FPGA AI Suite compiler compiles the network and exports it to a .bin file
that uses the same .bin format as required by the OpenVINO™ Inference Engine.

This .bin file created by the compiler contains the compiled network parameters for
all the target devices (FPGA, CPU, or both) along with the weights and biases. The
inference application imports this file at runtime.

The Intel FPGA AI Suite compiler can also compile the graph and provide estimated
area or performance metrics for a given Architecture File or produce an optimized
Architecture File.

For more details about the Intel FPGA AI Suite compiler, refer to the Intel FPGA AI
Suite Compiler Reference Manual.

5.3. Compiling the PCIe-based Example Design


Prepackaged bitstreams are available for the PCIe Example Design. If the prepackaged
bitstreams are installed, they are installed in demo/bitstreams/.

To build example design bitstreams, you must have a license that permits bitstream
generation for the IP, and have the correct version of Quartus installed. Use the
dla_build_example_design.py utility to create a bitstream.

For more details about this command, the steps it performs, and advanced command
options, refer to Build Script and to the Intel FPGA AI Suite Getting Started Guide.

Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01

5.4. Programming the FPGA Device (Intel Arria 10)


You can verify the Intel PAC with Intel Arria 10 GX FPGA programming by OPAE by
using one of the OPAE sample application programs.

The OPAE sample application programs are described in the Intel Acceleration Stack
Quick Start Guide for Intel Programmable Acceleration Card with Intel Arria 10 GX
FPGA.

If the Intel PAC is connected and has sufficient cooling, then you can program a
bitstream using the fpgaconf command. If the demonstration bitstreams (which
correspond to the architectures in the example_architectures/ directory) were
installed, then you can use them. You can also compile a new bitstream, as described
in section Compiling the PCIe-based Example Design on page 9.

You can program the design example bitstreams by using the following command:

fpgaconf -v <path_to_design_example_bitstream (.gbs)>

The Intel PAC with Intel Arria 10 GX FPGA requires server-level cooling with a dual-fan
graphics card cooler or another appropriate fan. Contact your Intel representative for
specific cooling solution suggestions and quote case 14016255788 when contacting
your representative.

If the supplementary cooling is not sufficient, the PAC will hang during inference. Until
you are certain that the cooling solution is sufficient, monitor the temperature using
the command: sudo fpgainfo temp

5.5. Programming the FPGA Device (Intel Agilex 7)


You can program the Terasic DE10-Agilex Development Board board using the
fpga_jtag_reprogram tool.

For details, refer to “Intel FPGA AI Suite Quick Start Tutorial” in the Intel FPGA AI
Suite Getting Started Guide.

5.6. Performing Accelerated Inference with the dla_benchmark


Application
You can use the dla_benchmark demonstration application included with the Intel
FPGA AI Suite runtime to benchmark the performance of image classification
networks.

5.6.1. Inference on Image Classification Graphs


The demonstration application requires the OpenVINO device flag to be either
HETERO:FPGA,CPU for heterogeneous execution or HETERO:FPGA for FPGA-only
execution.

The dla_benchmark demonstration application runs five inference requests (batches)


in parallel on the FPGA, by default, to achieve optimal system performance. To
measure steady state performance, you should run multiple batches (using the niter
flag) because the first iteration is significantly slower with FPGA devices.

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

10
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01

The dla_benchmark demonstration application also supports multiple graphs in the


same execution. You can place more than one graphs or compiled graphs as input,
separated by commas.

Each graph can have either a different input dataset or use a commonly shared
dataset among all graphs. Each graph requires an individual ground_truth_file
file, separated by commas. If some ground_truth_file files are missing, the
dla_benchmark continues to run and ignore the missing ones.

When multi-graph is enabled, the -niter flag represents the number of iterations for
each graph, so the total number of iterations becomes -niter × size of graphs.

The dla_benchmark demonstration application switches graphs after submitting -


nireq requests. The request queue holds the number of requests up to -nireq ×
size of graphs. This limit is constrained by the DMA CSR descriptor queue size (64 per
hardware instance).

The board you use determines the number of instances that you can compile the Intel
FPGA AI Suite hardware for:
• For the Intel PAC with Intel Arria 10 GX FPGA, you can compile up to two
instances with the same architecture on all instances.
• For the Terasic DE10-Agilex Development Board, you can compile up to four
instances with the same architecture on all instances.

Each instance accesses one of the two DDR banks and executes the graph
independently. This optimization enables two batches to run in parallel. Each inference
request created by the demonstration application is assigned to one of the instances in
the FPGA plugin.

To ensure that batches are evenly distributed between the instances, you must choose
an inference request batch size that is a multiple of the number of Intel FPGA AI Suite
instances. For example, with two instances, specify the batch size as six (instead of
the OpenVINO default of five) to ensure that the experiment meets this requirement.

The following example usage assumes that a Model Optimizer IR .xml file has been
placed in demo/models/public/resnet-50-tf/FP32/. It also assumes that an
image set has been placed into demo/sample_images/. Lastly, it assumes that has
been programmed with a bitstream corresponding to A10_Performance.arch.
binxml=$COREDLA_ROOT/demo/models/public/resnet-50-tf/FP32
imgdir=$COREDLA_ROOT/demo/sample_images
cd $COREDLA_ROOT/runtime/build_Release
./dla_benchmark/dla_benchmark \
-b=1 \
-m $binxml/resnet-50-tf.xml \
-d=HETERO:FPGA,CPU \
-i $imgdir \
-niter=5 \
-plugins_xml_file ./plugins.xml \
-arch_file $COREDLA_ROOT/example_architectures/A10_Performance.arch \
-api=async \
-groundtruth_loc $imgdir/TF_ground_truth.txt \
-perf_est \
-nireq=4 \
-bgr

Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide

11
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01

The following example shows how the IP can dynamically swap between graphs. This
example usage assumes that another Model Optimizer IR .xml file has been placed in
demo/models/public/resnet-101-tf/FP32/. It also assumes that another
image set has been placed into demo/sample_images_rn101/. In this case,
dla_benchmark only evaluates the classification accuracy of Resnet50 because we
did not provide ground truth for the second graph (ResNet101).
binxml1=$COREDLA_ROOT/demo/models/public/resnet-50-tf/FP32
binxml2=$COREDLA_ROOT/demo/models/public/resnet-101-tf/FP32
imgdir1=$COREDLA_ROOT/demo/sample_images
imgdir2=$COREDLA_ROOT/demo/sample_images_rn101
cd $DEVELOPER_PACKAGE_ROOT/runtime/build_Release
./dla_benchmark/dla_benchmark \
-b=1 \
-m $binxml1/resnet-50-tf.xml,$binxml2/resnet-101-tf.xml \
-d=HETERO:FPGA,CPU \
-i $imgdir1,$imgdir2 \
-niter=5 \
-plugins_xml_file ./plugins.xml \
-arch_file $COREDLA_ROOT/example_architectures/A10_Performance.arch \
-api=async \
-groundtruth_loc $imgdir1/TF_ground_truth.txt \
-perf_est \
-nireq=4 \
-bgr

5.6.2. Inference on Object Detection Graphs


To enable the accuracy checking routine for object detection graphs, you can use the -
enable_object_detection_ap=1 flag.

This flag lets the dla_benchmark calculate the mAP and COCO AP for object
detection graphs. Besides, you need to specify the version of the YOLO graph that you
provide to the dla_benchmark through the –yolo_version flag. Currently, this
routine is known to work with YOLOv3 (graph version is yolo-v3-tf) and
TinyYOLOv3 (graph version is yolo-v3-tiny-tf).

5.6.2.1. The mAP and COCO AP Metrics


Average precision and average recall are averaged over multiple Intersection over
Union (IoU) values.

Two metrics are used for accuracy evaluation in the dla_benchmark application. The
mean average precision (mAP) is the challenge metric for PASCAL VOC. The mAP
value is averaged over all 80 categories using a single IoU threshold of 0.5. The COCO
AP is the primary challenge for object detection in the Common Objects in Context
contest. The COCO AP value uses 10 IoU thresholds of .50:.05:.95. Averaging over
multiple IoUs rewards detectors with better localization.

5.6.2.2. Specifying Ground Truth


The path to the ground truth files is specified by the flag –groundtruth_loc.

The validation dataset is available on the COCO official website.

The dla_benchmark application currently allows only plain text ground truth files. To
convert the downloaded JSON annotation file to plain text, use the
convert_annotations.py script.

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

12
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01

5.6.2.3. Example of Inference on Object Detection Graphs


The following example makes the below assumptions:
• The Model Optimizer IR graph.xml for either YOLOv3 or TinyYOLOv3 is in the
current working directory.
Model Optimizer generates an FP32 version and an FP16 version. Use the FP32
version.
• The validation images downloaded from the COCO website are placed in the ./
mscoco-images directory.
• The JSON annotation file is downloaded and unzipped in the current directory.

To compute the accuracy scores on many images, you can usually increase the
number of iterations using the flag -niter instead of a large batch size -b. The
product of the batch size and the number of iterations should be less than or equal to
the number of images that you provide.
cd $COREDLA_ROOT/runtime/build_Release
python ./convert_annotations.py ./instances_val2017.json \
./groundtruth
./dla_benchmark/dla_benchmark \
-b=1 \
-niter=5000 \
-m=./graph.xml \
-d=HETERO:FPGA,CPU \
-i=./mscoco-images \
-plugins_xml_file=./plugins.xml \
-arch_file=../../example_architectures/A10_Performance.arch \
-yolo_version=yolo-v3-tf \
-api=async \
-groundtruth_loc=./groundtruth \
-nireq=4 \
-enable_object_detection_ap \
-perf_est \
-bgr

5.6.3. Additional dla_benchmark Options


The dla_benchmark tool is part of the example design and the distributed runtime
includes full source code for the tool.

Table 3. Command Line dla_benchmark Options


Command Description

-nireq=<N> This controls the number of simultaneous inference requests


that are sent to the FPGA.
Typically, this should be at least twice the number of IP
instances; this ensures that each IP can execute one
inference request while dla_benchmark loads the feature
data for a second inference request to the FPGA-attached
DDR memory.

-b=<N> This controls the batch size.


--batch-size=<N> A batch size greater than 1 is created by repeating
configuration data for multiple copies of the graph.
A batch size of 1 is typically best.

-niter=<N> Number of images to process in each batch.


continued...

Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide

13
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01

Command Description

-d=<STRING> Using -d=HETERO:FPGA, CPU causes dla_benchmark to


use the OpenVINOheterogeneous plugin to execute
inference on the FPGA, with fallback to the CPU for any
layers that cannot go to the FPGA.
Using -d=HETERO:CPU or -d=CPU executes inference on
the CPU, which may be useful for testing the flow when an
FPGA is not available. Using -d=HETERO:FPGA may be
useful for ensuring that all graph layers are accelerated on
the FPGA (and an error is issued if this is not possible).

-arch_file=<FILE> This specifies the location of the .arch file that was used to
--arch=<FILE> configure the IP on the FPGA. The dla_benchmark will
issue an error if this does not match the.arch file used to
generate the IP on the FPGA.

-m=<FILE> This points to the XML file from OpenVINO Model Optimizer
--network_file=<FILE> that describes the graph. The BIN file from Model Optimizer
must be kept in the same directory and same filename
(except for the file extension) as the XML file.

-i=<DIRECTORY> This points to the directory containing the input images.


Each input file corresponds to one inference request. The
files are read in order sorted by filename; set the
environment variable VERBOSE=1 to see details describing
the file order.

-api=[sync|async] The -api=async option allows dla_benchmark to fully


take advantage of multithreading to improve performance.
The -api=async option may be used during debug.

-groundtruth_loc=<FILE> Location of the file with ground truth data. If not provided,
then dla_benchmark will not evaluate accuracy. This may
contain classification data or object detection data,
depending on the graph.

-yolo_version=<STRING> This option is used when evaluating the accuracy of a


YOLOv3 or TinyYOLOv3 object detection graph. The options
are yolo-v3-tf and yolo-v3-tiny-tf.

-enable_object_detection_ap This option may be used with an object detection graph


(YOLOv3 or TinyYOLOv3) to calculate the object detection
accuracy.

-bgr When used, this flag indicates that the graph expects input
image channel data to use BGR order.

-plugins_xml_file=<FILE> This option specifies the location of the file specifying the
OpenVINO plugins to use. This should be set to
$COREDLA_ROOT/runtime/plugins.xml in most cases.
If you are porting the design to a new host or doing other
development, it may be necessary to use a different value.

5.7. Running the Ported OpenVINO Demonstration Applications


Some of the sample demonstration applications from the OpenVINO toolkit for Linux
Version 2022.3.1 have been ported to work with the Intel FPGA AI Suite. These
applications are built at the same time as the runtime when using the -build_demo
flag to build_runtime.sh.

The Intel FPGA AI Suite runtime includes customized versions of the following demo
applications for use with the Intel FPGA AI Suite IP and plugins:
• classification_sample_async
• object_detection_demo_yolov3_async

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

14
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01

Each demonstration application uses a different graph. The OpenVINO HETERO plugin
can fall-back to the CPU for portions of the graph that are not supported with FPGA-
based acceleration. However, in a production environment, it may be more efficient to
use alternate graphs that execute exclusively on the FPGA.

You can use the example .arch files supplied with the Intel FPGA AI Suite can be
used with the demonstration applications. However, certain example .arch files do
not enable some of the layer-types used by the graphs associated with the
demonstration applications. Using these .arch files cause portions of the graph to
needlessly execute on the CPU. To minimize the number of layers that are executed on
the CPU by the demonstration application, use the following architecture description
files located in the example_architectures/ directory of the Intel FPGA AI Suite
installation package to run the demos:
• Intel Arria 10: A10_Generic.arch
• Intel Agilex 7: AGX7_Generic.arch

As specified in Programming the FPGA Device (Intel Arria 10) on page 10, you must
program the FPGA device with the bitstream for the architecture being used. Each
demonstration application includes a README.md file specifying how to use it.

When the OpenVINO sample applications are modified to support the Intel FPGA AI
Suite, the Intel FPGA AI Suite plugin used by OpenVINO needs to know how to find
the .arch file describing the IP parameterization by using the following configuration
key. The following C++ code is used in the demo for this purpose:
ie.SetConfig({ { DLIA_CONFIG_KEY(ARCH_PATH), FLAGS_arch_file } }, "FPGA");

The OpenVINO demonstration application hello_query_device does not work with


the Intel FPGA AI Suite due to low-level hardware identification assumptions.

5.7.1. Example Running the Object Detection Demonstration Application


You must download the following items:
• yolo-v3-tf from the OpenVINO™ Model Downloader. The command should look
similar to the following command:
python3 <path_to_installation>/open_model_zoo/omz_downloader \
--name yolo-v3-tf \
--output_dir <download_dir>

From the downloaded model, generate the .bin/.xml files:


python3 <path_to_installation>/open_model_zoo/omz_converter \
--name yolo-v3-tf \
--download_dir <download_dir> \
--output_dir <output_dir> \
--mo <path_to_installation>/model_optimizer/mo.py

Model Optimizer generates an FP32 version and an FP16 version. Use the FP32
version.
• Input video from: https://github.com/intel-iot-devkit/sample-videos.
• The recommended video is person-bicycle-car-detection.mp4

To run the object detection demonstration application,

Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide

15
5. Running the Design Example Demonstration Applications
768977 | 2023.12.01

1. Ensure that demonstration applications have been built with the following
command:
build_runtime.sh -build-demo

2. Ensure that the FPGA has been configured with the Generic bitstream.
3. Run the following command:
./runtime/build_Release/object_detection_demo/object_detection_demo \
-d HETERO:FPGA,CPU \
-i <path_to_video>/input_video.mp4 \
-m <path_to_model>/yolo_v3.xml \
-arch_file=$COREDLA_ROOT/example_architectures/A10_Generic.arch \
-plugins_xml_file $COREDLA_ROOT/runtime/plugins.xml \
-t 0.65 \
-at yolo

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

16
768977 | 2023.12.01

Send Feedback

6. Design Example Components

6.1. Build Script


While this design example includes prepackaged bitstreams, you can also use the
build script to build bitstreams.

To build the PCIe-based example design, use the bin/


dla_build_example_design.py script. You can use this script to create an
example design with one or multiple Intel FPGA AI Suite IP instances.

The script generates a wrapper that wraps one or more IP instances along with
adapters necessary to connect to the Terasic DE10-Agilex BSP or to the OPAE BSP
(Intel Arria 10).

When specifying an <architecture_file>, pay attention to the resource limitations on


the FPGA, as well as the number of resources that the board support package (BSP)
uses.

The dla_compiler tool includes a --fanalyze-area option to estimate the


resources required for a single IP instance corresponding to an architecture file, as
described in the Intel FPGA AI Suite Compiler Reference Manual.

Implementing two instances (as is the default for dla_build_example_design.py)


requires twice the resources.

Table 4. Build Script Resources


Resources ALMs M20k DSPs

Resources available on Intel Arria 10 1150 device 427200 2713 1518

Reasonable Target Utilization 80% 90% 90%

Usable Resources 342000 2445 1365

Resources needed for BSP 62000 330 0

Resources available for IP instances 280000 2115 1365

The OPAE BSP used by the Intel PAC with Intel Arria 10 GC FPGA board is compatible
only with bitstreams for accelerator function units (AFUs) compiled with Intel
Quartus® Prime Pro Edition Version 19.2. An AFU and its associated accelerator
functions (AFs) are sometimes referred to as the green bitstream or green bits. For
more details OPAE bitstream types, refer to Design Example Microarchitecture.

The DE10-Agilex design is only validated for use with Intel Quartus Prime Pro Edition
Version 23.3. This Intel Agilex 7 design does not use the Intel Quartus Prime partial
reconfiguration feature, unlike the OPAE BSP. The Agilex device has significantly more
resources and can support up to four IP instances.

Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
6. Design Example Components
768977 | 2023.12.01

6.1.1. Build Script Options


Table 5. Build Script Options

Option Description

-a, --archs Path to Intel FPGA AI Suite IP Architecture Description File

--build-dir Path to hardware build directory where BSP infrastructure and generated RTL
will be located.
(default: coredla/pcie_ed/platform/build_synth)

--build Option to perform compilation of the PCIe design using Intel Quartus Prime after
instantiation (default: False).

-d, --archs-dir Path to directory that contains Architecture Description Files for you to
interactively choose from (alternative to ‘-a’)

-ed, --example-design-id To build for the Intel PAC with Intel Arria 10 GX FPGA board, specify 1.
To build for the Terasic DE10-Agilex board, specify 3.
(default: 1)

-n, --num-instances Number of IP instances to build (default: 2).


For the Intel PAC with Intel Arria 10 GX FPGA board, this number must be either
1 or 2.
For the Terasic DE10-Agilex board, this number must be 1, 2, 3, or 4.

--num-paths Number of top critical paths to report after compiling the design (default:
2000).

-q, --quiet Run script quietly without printing the output of underlying scripts to the
terminal.

--qor-modules List of internal modules (instance names) from inside the IP to include in the
QoR summary report.

-s, --seed Seed to be used in compiling the design (default: 1).

--unlicensed/licensed This option is passed to the dla_create_ip tool to tell the tool to generate
either an unlicensed or licensed copy of the Intel FPGA AI Suite:
• Unlicensed IP: Unlicensed IP has a limit of 10000 inferences. After 10000
inferences, the unlicensed IP refuses to perform any additional inference and
a bit in the CSR is set. For details about the CSR bit, refer to DMA Descriptor
Queue in Intel FPGA AI Suite IP Reference Manual.
• Licensed IP: Licensed IP has no inference limitation.
If you do not have a license but generate licensed IP, Quartus cannot generate a
bitstream.
If neither option is specified, then the dla_create_ip tool queries the lmutil
license manager to determine the correct option.

--wsl This option sets the build script to run such that the final Intel Quartus Prime
compilation runs in the Windows* environment. After the script sets up the
compilation, it prints the instructions to complete the compilation on Windows.
Restriction: Only supported within a WSL 2 environment and for the DE10-
Agilex example design.

--finalize Restriction: This option can be used only when following the instructions
provided by the build script run with the --wsl option.

6.1.2. Script Flow


The following steps describe the internal flow of the
dla_build_example_design.py script for the Arria 10 PAC:

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

18
6. Design Example Components
768977 | 2023.12.01

1. Runs the dla_create_ip script to create an Intel FPGA AI Suite IP for the
requested IntelFPGA AI Suite architecture
2. Creates a wrapper around the IntelFPGA AI Suite IP instances and adapter logic
3. Runs the OPAE afu_synth_setup script to create a build directory that has the
BSP infrastructure needed to compile the design with Intel Quartus Prime
software.
4. Runs the OPAE run.sh script to compile the design example with Intel Quartus
Prime software:
a. Compile the example design up to and including bitstream generation.
b. Analyze the timing report to extract the Intel FPGA AI Suite clock maximum
frequency (fMAX).
c. Create an AFU/AF bitstream (green bitstream) file with PLL configurations to
generate a clock that is slightly lower than the fMAX of the design.
5. Runs the OPAE PACSign script to generate an unsigned version of the AFU/AF
bitstream as well as a signed version.

The script uses the Intel Quartus Prime timing analyzer to report the top critical paths
of the compiled design example.

The unsigned and signed versions of the bitstreams are in the <build_dir> directory
that you set when running the script (or the default location, if you did not set it). The
signed and unsigned bitstream file names are dla_afu.gbs and
dla_afu_unsigned.gbs, respectively.

The Intel Quartus Prime compilation reports are available in the <build_dir>/
build/output_files directory. A build.log file that has all the output log for
running the build script is available in the <build_dir> directory. In addition, the
achieved Intel FPGA AI Suite clock frequency is the clock-frequency-low value in the
following file:

<build_dir>/build/output_files/user_clock_freq.txt

6.2. Example Architecture Bitstream Files


The Intel FPGA AI Suite provides example Architecture Files and bitstreams for the
PCIe-based Example Design (Intel Arria 10 and Intel Agilex 7). The bitstreams are
distributed as a separate tarball.

6.3. Software Components


The PCIe-based design example contains a sample software stack for the runtime
flow.

The following figure, Software Stacks for Intel FPGA AI Suite Inference, shows the
complete runtime stack.

Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide

19
6. Design Example Components
768977 | 2023.12.01

For the Intel Arria 10 design example, the following components comprise the runtime
stack:
• OpenVINO Toolkit 2022.3.1 LTS (Inference Engine, Heterogeneous Plugin)
• Intel FPGA AI Suite runtime plugin
• OPAE driver 1.1.2-2

For the Intel Agilex 7 design example, the following components comprise the runtime
stack:
• OpenVINO Toolkit 2022.3.1 LTS (Inference Engine, Heterogeneous Plugin)
• Intel FPGA AI Suite runtime plugin
• Terasic DE10-Agilex-B2E2 board driver

The PCIe-based design example contains the source files and Makefiles to build the
Intel FPGA AI Suite runtime plugin. The other components, OpenVINO and OPAE, are
external and must be manually pre-installed.

A separate flow compiles the AI network graph using the Intel FPGA AI Suite compiler,
as shown in figure Software Stacks for Intel FPGA AI Suite Inference below as the
Compilation Software Stack.

The compilation flow output is a single binary file called CompiledNetwork.bin that
contains the compiled network partitions for FPGA and CPU devices along with the
network weights. The network is compiled for a specific Intel FPGA AI Suite
architecture and batch size. This binary is created on-disk only when using the Ahead-
Of-Time flow; when the JIT flow is used, the compiled object stays in-memory only.

An Architecture File describes the Intel FPGA AI Suite IP architecture to the compiler.
You must specify the same Architecture File to the Intel FPGA AI Suite compiler and to
the Intel FPGA AI Suite PCIe Example Design build script
(dla_build_example_design.py).

The runtime flow accepts the CompiledNetwork.bin file as the input network along
with the image data files.

Figure 1. Software Stacks for Intel FPGA AI Suite Inference


CompiledNetwork.bin file
OpenVINO Model Optimizer +
Input Data files (.bmp or .bin)
Intel FPGA AI Suite
Architecture file in Network .xml and .bin files
.arch format OpenVINO Sample Demo
Intel FPGA AI Suite Compiler OpenVINO Inference Engine
OpenVINO Inference Engine OpenVINO Hetero Plugin
OpenVINO Hetero Plugin Intel FPGA AI Suite Runtime Plugin
Intel FPGA AI Suite AOT Plugin BSP Driver

CompiledNetwork.bin file
Compilation Software Stack Runtime Software Stack

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

20
6. Design Example Components
768977 | 2023.12.01

The runtime stack cannot program the FPGA with a bitstream. To build a bitstream
and program the FPGA devices:
1. Compile the design example. For details, refer to Compiling the PCIe-based Design
Example.
2. Program the device with the bitstream. For details, refer to Programming the FPGA
Device (Intel Arria 10) on page 10 or Programming the FPGA Device (Intel Agilex
7) on page 10 (depending on your FPGA device).

To run inference through the OpenVINO Toolkit on the FPGA, set the OpenVINO device
configuration flag (used by the heterogeneous Plugin) to FPGA or HETERO:FPGA,CPU.

6.3.1. OpenVINO™ FPGA Runtime Plugin


The FPGA runtime plugin uses the OpenVINO™ Inference Engine Plugin API.

The OpenVINO™ Plugin architecture is described in the OpenVINO™ Developer Guide


for Inference Engine Plugin Library.

The source files are located under runtime/plugin. The three main components of
the runtime plugin are the Plugin class, the Executable Network class, and the
Inference Request class. The primary responsibilities for each class are as follows:

Plugin class
• Initializes the runtime plugin with an Intel® FPGA AI Suite Architecture File which
you set as an OpenVINO™ configuration key (refer to Running the Ported
OpenVINO Demonstration Applications on page 14).
• Contains QueryNetwork function that analyzes network layers and returns a list
of layers that the specified architecture supports. This function allows network
execution to be distributed between FPGA and other devices and is enabled with
the HETERO mode.
• Creates an executable network instance in one of the following ways:
— Just-in-time (JIT) flow: Compiles a network such that the compiled network is
compatible with the hardware corresponding to the Intel FPGA AI Suite
Architecture File, and then loads the compiled network onto the FPGA device.
— Ahead-of-time (AOT) flow: Imports a precompiled network (exported by Intel
FPGA AI Suite compiler) and loads it onto the FPGA device.

Executable Network Class


• Represents an Intel FPGA AI Suite compiled network
• Loads the compiled model and config data for the network onto the FPGA device
that has already been programmed with an Intel FPGA AI Suite AFU/AF bitstream.
For two instances of Intel FPGA AI Suite, the Executable Network class loads the
network onto both instances, allowing them to perform parallel batch inference.
• Stores input/output processing information.
• Creates infer request instances for pipelining multiple batch execution.

Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide

21
6. Design Example Components
768977 | 2023.12.01

Infer Request class


• Runs a single batch inference serially.
• Executes five stages in one inference job – input layout transformation on CPU,
input transfer to DDR, Intel FPGA AI Suite FPGA execution, output transfer from
DDR, output layout transformation on CPU.
• In asynchronous mode, executes the stages on multiple threads that are shared
across all inference request instances so that multiple batch jobs are pipelined,
and the FPGA is always active.

Related Information
OpenVINO™ Developer Guide for Inference Engine Plugin Library

6.3.2. Intel FPGA AI Suite Runtime


The Intel FPGA AI Suite runtime implements lower-level classes and functions that
interact with the memory-mapped device (MMD). The MMD is responsible for
communicating requests to the OPAE driver, and the OPAE driver connects to the OPAE
FPGA BSP, and ultimately to the Intel FPGA AI Suite IP instance or instances.

The runtime source files are located under runtime/coredla_device. The three
most important classes in the runtime are the Device class, the GraphJob class, and
the BatchJob class.

Device class
• Acquires a handle to the MMD for performing operations by calling
aocl_mmd_open.
• Initializes a DDR memory allocator with the size of 1 DDR bank for each Intel
FPGA AI Suite IP instance on the device.
• Implements and registers a callback function on the MMD DMA (host to FPGA)
thread to launch Intel FPGA AI Suite IP for batch=1 after the batch input data is
transferred from host to DDR.
• Implements and registers a callback function (interrupt service routine) on the
MMD kernel interrupt thread to service interrupts from hardware after one batch
job completes.
• Provides the CreateGraphJob function to create a GraphJob object for each Intel
FPGA AI Suite IP instance on the device.
• Provides the WaitForDla(instance id) function to wait for a batch inference
job to complete on a given instance. Returns instantly if the number of batch jobs
finished (that is, the number of jobs processed by interrupt service routine) is
greater than number of batch jobs waited for this instance. Otherwise, the
function waits until interrupt service routine notifies. Before returning, this
function increments the number of batch jobs that have been waited for this
instance.

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

22
6. Design Example Components
768977 | 2023.12.01

GraphJob class
• Represents a compiled network that is loaded onto one instance of the Intel FPGA
AI Suite IP on an FPGA device.
• Allocates buffers in DDR memory to transfer configuration, filter, and bias data.
• Creates BatchJob objects for a given number of pipelines and allocates input and
output buffers for each pipeline in DDR.

BatchJob class
• Represents a single batch inference job.
• Stores the DDR addresses for batch input and output data.
• Provides LoadInputFeatureToDdr function to transfer input data to DDR and
start inference for this batch asynchronously.
• Provides ReadOutputFeatureFromDdr function to transfer output data from
DDR. Must be called after inference for this batch is completed.

6.3.3. BSP Driver


The Intel FPGA AI Suite runtime MMD software uses a driver supplied as part of the
BSP to access and interact with the FPGA device.

The source files for the driver are in runtime/coredla_device/mmd. The source
files contain classes for managing and accessing the FPGA device by using BSP
functions for reading/writing to CSR, reading/writing to DDR, and handling kernel
interrupts.

BSP Driver for the Intel Arria 10 Design Example

The Intel Programmable Acceleration Card (PAC) with Intel Arria 10 GX FPGA uses
OPAE software libraries as its BSP driver.

To compile the Intel FPGA AI Suite runtime library and run the demonstration
application on the Intel PAC board, you must have the OPAE software libraries
installed on the machine according to the installation instructions under Section 4 -
Installing the OPAE Software Package of the Intel Acceleration Stack Quick Start
Guide.

BSP Driver for the Intel Agilex 7 Design Example

Contact your Intel representative for information on the driver for the Terasic DE10-
Agilex board support package.

Related Information
Intel Acceleration Stack Quick Start Guide

6.4. Software Interface to the BSP


The interface to the user-space portion of the BSP drivers is centralized in the
MmdWrapper class, which can be found in the file $COREDLA_ROOT/runtime/
coredla_device/inc/mmd_wrapper.h.

Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide

23
6. Design Example Components
768977 | 2023.12.01

When porting the runtime to a new board, the team responsible for the new board
support must ensure that each of the member functions in MmdWrapper calls into a
board-specific implementation function. The team doing this will need to modify the
runtime build process and adjacent code.

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

24
768977 | 2023.12.01

Send Feedback

7. Design Example System Architecture for the Intel PAC


with Intel Arria 10 GX FPGA
The Intel PAC with Intel Arria 10 GX FPGA design example is derived from the OpenCL
BSP provided with the Intel Acceleration Stack for Intel Xeon CPU with FPGAs (which
also includes OPAE technology).

The Intel Acceleration Stack is designed to make FPGAs usable as accelerators. On the
FPGA side, the Intel Acceleration Stack splits acceleration functions into two parts:
• The FPGA interface manager (FIM) is FPGA hardware that contains the FPGA
interface unit (FIU) and external interfaces for functions like memory access and
networking. The FIM is locked and cannot be changed. The FIM is sometimes
referred to as BBS, blue bits, or blue bitstream.
• The accelerator function (AF) is a compiled accelerator image implemented in
FPGA logic that accelerates an application. AFs are compiled from accelerator
functional units (AFUs). An AFU and associated AFs are sometimes referred to as
GBS, green bits, or green bitstream. An FPGA device can be reprogrammed while
leaving the FIM in place.

The FIM handles external interfaces to the host, to which it is connected via PCIe. On
the host side, a driver stack communicates with the AFU via the FIM. This is referred
to as OPAE (Open Programmable Acceleration Engine). OPAE talks to the AFU with the
CCI-P (core cache interface) protocol that provides an abstraction over PCIe protocol.

7.1. System Overview


The system consists of the following components connected to a host system via a
PCIe interface as shown in the following figure.
• A board with the FPGA device
• On-board DDR memory

The FPGA image consists of the Intel FPGA AI Suite IP and an additional logic that
connects it to a PCIe interface and DDR. The host can read and write to the DDR
memory through the PCIe port. In addition, the host can communicate and control the
Intel FPGA AI Suite instances through the PCIe connection which is also connected the
direct memory access (DMA) CSR port of Intel FPGA AI Suite instances.

Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01

The Intel FPGA AI Suite IP accelerates neural network inference on batches of images.
The process of executing a batch follows these steps:
1. The host writes a batch of images, weights, and config data to DDR where weights
can be reused between batches.
2. The host writes to the Intel FPGA AI Suite CSR to start execution.
3. Intel FPGA AI Suite computes the results of the batch and stores them in DDR.
4. Once the computation is complete, Intel FPGA AI Suite raises an interrupt to the
host.
5. The host reads back the results from DDR.

Figure 2. Intel FPGA AI Suite Example Design System Overview

Board
FPGA
Design
PCIe Example Intel FPGA AI Intel FPGA AI
Host
Suite IP Core Suite IP Core
Instance 0 Instance 1

DDR DDR
Bank A Bank B

7.2. Hardware
This section describes the Example Design (Intel Arria 10) in detail. However, many of
the components close to the IP are shared in common with the Example Design (Intel
Agilex 7).

A top-level view of the design example is shown in Intel FPGA AI Suite Example
Design Top Level.

There are two instances of Intel FPGA AI Suite, shown on the right (dla_top.sv). All
communication between the Intel FPGA AI Suite IP systems and the outside occurs via
the Intel FPGA AI Suite DMA. The Intel FPGA AI Suite DMA provides a CSR (which also
has interrupt functionality) and reader/writer modules which read/write from DDR.

The host communicates with the board through PCIe using the CCI-P protocol. The
host can do the following things:
1. Read and write the on-board DDR memory (these reads/writes do not go through
Intel FPGA AI Suite).
2. Read/write to the Intel FPGA AI Suite DMA CSR of both instances.
3. Receive interrupt signals from the Intel FPGA AI Suite DMA CSR of both instances.

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

26
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01

Each Intel FPGA AI Suite IP instance can do the following things:


1. Read/write to its DDR bank.
2. Send interrupts to the host through the interrupt interface.
3. Receive reads/writes to its DMA CSR.

From the perspective of the Intel FPGA AI Suite accelerator function (AF), external
connections are to the CCI-P interface running over PCIe, and to the on-board DDR4
memory. The DDR memory is connected directly to board.qsys block, while the CCI-
P interface is converted into Avalon memory mapped (MM) interfaces in
bsp_logic.sv block for communication with the board.qsys block.

The board.qsys block arbitrates the connections to DDR memory between the
reader/writer modules in Intel FPGA AI Suite IP and reads/writes from the host. Each
Intel FPGA AI Suite IP instance in this design has access to only one of the two DDR
banks. This design decision implies that no more than two simultaneous Intel FPGA AI
Suite IP instances can exist in the design. Adding an additional arbiter would relax this
restriction and allow additional Intel FPGA AI Suite IP instances.

Much of board.qsys operates using the Avalon® Memory-mapped (MM) interface


protocol. The Intel FPGA AI Suite DMA uses AXI protocol, and board.qsys has
Avalon® MM interface to AXI adapters just before each interface is exported from the
Intel FPGA AI Suite IP (so that outside of the Platform Designer system it can be
connected to Intel FPGA AI Suite IP). Clock crossing are also handled inside of
board.qsys. For example, the host interface must be brought to the DDR clock to
talk with the Intel FPGA AI Suite IP CSR.

There are three clock domains: host clock, DDR clock, and the Intel FPGA AI Suite IP
clock. The PCIe logic runs on the host clock at 200Mhz. Intel FPGA AI Suite DMA and
the platform adapters run on the DDR clock. The rest of Intel FPGA AI Suite IP runs on
the Intel FPGA AI Suite IP clock.

Intel FPGA AI Suite IP protocols:


• Readers and Writers: 512-bit data (width configurable), 32-bit address AXI4
interface, 16-word max burst (width fixed).
• CSR: 32-bit data, 11-bit address

Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide

27
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01

Figure 3. Intel FPGA AI Suite Example Design Top Level

Note: Arrows show host/agent relationships. Clock domains indicated with dashed lines.

PCIe DDR A DDR B

DDR clock @ 266 MHz


DCP Blue Bits (Static Region) Host clock @ 200 MHz

512b ccp_std_afu.sv

resp

resp
req

req
AVMM dla_platform_wrapper.sv
bsp_logic.sv dla_top.sv
CCIP
req
32b AXI4 lite CSR
resp
DMA
req req req
CCIP MMIO 64b AVMM Readers+
512b AXI4 Writers
resp resp
resp
req req
Host board.qsys instance 0
CCI CCIP Read 512b AVMM
MPF resp resp
dla_top.sv
req req req
Host
CCIP Write 512b AVMM 32b AXI4 lite CSR
resp resp
resp
DMA

req
Readers+
512b AXI4 Writers
resp
interrupts instance 0

The board.qsys interfaces between DDR memory, the readers/writers, and the host
read/write channels. The internals of the board.qsys block are shown in Figure 4.
This figure shows three Avalon MM interfaces on the left and bottom: MMIO, host
read, and host write.

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

28
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01

• Host read is used to read data from DDR memory and send it to the host.
• Host write is used to read data from the host into DDR memory.
• The MMIO interface performs several functions:
— DDR read and write transactions are initiated by the host via the MMIO
interface
— Reading from the AFU ID block. The AFU ID block identifies the AFU with a
unique identifier and is required for the OPAE driver.
— Reading/writing to the DLA DMA CSRs where each instance has its own CSR
base address.

Figure 4. The board.qsys Block, Showing Two DDR Connections and Two IP Instances

Note: Arrows indicate host/agent relationships (from host to agent).

Board.qsys
req
0x00000 - 0x0003F 64b Board AFU ID
resp AXI4
req req req
All interfaces AVMM/AXI4 CoreDLA
0x38000 - 0x38FFF 32b Clock Cross 32b Converter 32b
inst0 CSR
resp
are Avalon MM resp resp

unless stated req req req


CoreDLA
32b Clock Cross AVMM/AXI4
otherwise 0x39000 - 0x39FFF 32b
resp Converter
32b
inst0 CSR
resp resp
AXI4
req ddr_board.qsys
0x20200 - 0x2023F 64b DDR AFU ID
req resp AXI4
req req req
MMIO 64b Address Decode req Clock Cross Arbiter 512b
AVMM/AXI4 CoreDLA
Address expander for host 512b
Converter 512b
inst0 DDR
resp 0x10000 - 0x11FFF 64b
register access into FPGA DDR
req resp resp
resp 512b
req 512b resp resp
resp
req req

Host Clock @ req 512b Arbiter 512b Decode Address


msgdma_bbb.qsys resp
resp Bottom/Top Half
200 MHz 0x20000 - 0x2007F 64b
resp (scatter-gather DMA) req 512b resp
req 512b resp
DDR Clock @ req
266 MHz 0x20100 - 0x2017F 64b msgdma_bbb.qsys Clock Cross
resp
req 512b resp req 512b respreq 512b resp req 512b resp req 512b resp AXI4
req req
Arbiter 512b
AVMM/AXI4 512b
CoreDLA
Write Arbiter Read Arbiter resp Converter resp inst0 DDR

interrupts req 512b resp req 512b resp req 512b resp req 512b resp

Host Write Host Read DDR4 DDR4

The above figure also shows the ddr_board.qsys block. The three central blocks
(address expander, msgdma_bbb.qsys (scatter-gather DMA), and
msgdma_bbb.qsys) allow host direct memory access (DMA) to DDR. This DMA is
distinct from the DMA module inside of the Intel FPGA AI Suite IP, shown in Figure 3 on
page 28. Host reads and writes begin with the host sending a request via the MMIO
interface to initiate a read or write. When requesting a read, the DMA gathers the data
from DDR and sends it to the host via the host-read interface. When requesting a
write, the DMA reads the data over the host-write interface and subsequently writes it
to DDR.

Note that in board.qsys, a block for the Avalon MM to AXI4 conversion is not
explicitly instantiated. Instead, an Avalon MM pipeline bridge connects to an AXI4
bridge. Platform Designer implicitly infers a protocol adapter between these two
bridges.

Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide

29
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01

Note: Avalon MM/AXI4 adapters in Platform Designer might not close timing.

Platform Designer optimizes for area instead of fMAX by default, so you might need to
change the interconnect settings for the inferred Avalon MM/AXI4 adapter. For
example, we made some changes as shown in the following figure.

Figure 5. Adjusting the Interconnect Settings for the Inferred Avalon MM/AXI4
Adapter to Optimize for fMAX Instead of Area.

Note: This enables timing closure on the DDR clock.

To access the view in the above figure:


• Within the Platform Designer GUI choose View -> Domains. This brings up the
Domains tab in the top-right window.
• From there, choose an interface (for example, ddr_0_axi).
• For the selected interface, you can adjust the interconnect parameters, as shown
on the bottom-right pane.
• In particular, we needed to change Burst adapter implementation from
Generic converter (slower, lower area) to Per-burst-type converter
(faster, higher area) to close timing on the DDR clock.

This was the only change needed to close timing, however it took several rounds of
experimentation to determine this was the setting of importance. Depending on your
system, other settings might need to be tweaked.

7.2.1. PLL Adjustment


The design example build script adjusts the PLL driving the Intel FPGA AI Suite IP
clock based on the fMAX that the Intel Quartus Prime compiler achieves.

Intel FPGA AI Suite: PCIe-based Design Example User Guide Send Feedback

30
7. Design Example System Architecture for the Intel PAC with Intel Arria 10 GX FPGA
768977 | 2023.12.01

A fully rigorous production-quality flow would re-run timing analysis after the PLL
adjustment to account for the small possibility that change in PLL frequency might
cause a change in clock characteristics (for example, jitter) that cause a timing failure.
A production design that shares the Intel FPGA AI Suite IP clock with other system
components might target a fixed frequency and skip PLL adjustment

Send Feedback Intel FPGA AI Suite: PCIe-based Design Example User Guide

31
768977 | 2023.12.01

Send Feedback

A. Intel FPGA AI Suite PCIe-based Design Example User


Guide Archives
For the latest and previous versions of this user guide, refer to Intel FPGA AI Suite
PCIe-based Design Example User Guide. If an Intel FPGA AI Suite software version is
not listed, the user guide for the previous software version applies.

Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.
768977 | 2023.12.01

Send Feedback

B. Intel FPGA AI Suite PCIe-based Design Example User


Guide Document Revision History
Document Version Intel FPGA AI Changes
SuiteVersion

2023.12.01 2023.3 • Added --wsl and --finalize options to “Build Script Options”.

2023.09.06 2023.2.1 • Updated supported OpenVINO version to 2022.3.1 LTS.

2023.07.03 2023.2 • Updated “Software Components”.


• Renamed “OPAE Driver” to “BSP Driver” and revised the content in the
topic.
• Updated supported OpenVINO version to 2022.3 LTS.
• Updated OpenVINO installation paths to /opt/intel/
openvino_2023.
• Updated Intel FPGA AI Suite installation paths to /opt/intel/
fpga_ai_suite_2023.2.
• Changed occurrences of tools/downloader/downloader.py to
omz_downloader.
• Changed occurrences of tools/downloader/converter.py to
omz_converter.

2023.04.05 2023.1 • Renamed the dlac command. The Intel FPGA AI Suite compiler
command is now dla_compiler.
• Updated the Intel Agilex™ product family name to "Intel Agilex 7."

2022.12.23 2022.2 • Removed the -f build script (dla_create_and_build_pcie_ed.py)


option.
• Renamed the benchmark_app to dla_benchmark.
• Renamed dla_create_and_build_pcie_ed.py to
dla_build_example_design.py

2022.05.26 2022.1.1 • Updated description of how to run the OpenVINO demos.

2022.04.27 2022.1 • Additional updates for Intel Agilex device support.


• Updated the locations of graphs from the Getting Started tutorial.

2021.09.10 2021.2 • Added updates for initial Intel Agilex device support.

2021.04.30 2021.1 • Added additional demonstration programs.


• Various corrections and updates.

2020.12.04 2020.2 • Initial release.

Intel Corporation. All rights reserved. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Intel warrants performance of its FPGA and semiconductor products to current
specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any ISO
products and services at any time without notice. Intel assumes no responsibility or liability arising out of the 9001:2015
application or use of any information, product, or service described herein except as expressly agreed to in Registered
writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying
on any published information and before placing orders for products or services.
*Other names and brands may be claimed as the property of others.

You might also like