Hailo Dataflow Compiler v3.30.0 User Guide
Hailo Dataflow Compiler v3.30.0 User Guide
User Guide
Release 3.30.0
1 January 2025
Table of Contents
I User Guide 2
2 Changelog 9
4 Tutorials 23
4.1 Dataflow Compiler Tutorials Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Parsing Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Model Optimization Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Compilation Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Inference Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Accuracy Analysis Tool Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.7 Quantization Aware Training Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Building Models 57
5.1 Translating Tensorflow and ONNX Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Model Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Model Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5 Supported Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Bibliography 170
Page i Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Copyright
No part of this document may be reproduced or transmitted in any form without the expressed, written permission
of Hailo. Nothing contained in this document should be construed as granting any license or right to use proprietary
information for that matter, without the written permission of Hailo.
General Notice
Hailo, to the fullest extent permitted by law, provides this document “as-is” and disclaims all warranties, either ex-
press or implied, statutory or otherwise, including but not limited to the implied warranties of merchantability, non-
infringement of third parties’ rights, and fitness for particular purpose.
Although Hailo used reasonable efforts to ensure the accuracy of the content of this document, it is possible that
this document may contain technical inaccuracies or other errors. Hailo assumes no liability for any error in this
document, and for damages, whether direct, indirect, incidental, consequential or otherwise, that may result from
such error, including, but not limited to loss of data or profits.
The content in this document is subject to change without prior notice and Hailo reserves the right to make changes
to content of this document without providing a notification to its users.
Page 1 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Part I
User Guide
Page 2 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
1.1. Introduction
The Dataflow Compiler API is used for compiling users’ models to Hailo binaries. The input of the Dataflow Compiler
is a trained Deep Learning model, the output is a binary file which is loaded to the Hailo device.
The HailoRT API is used for deploying the built model on the target device. This library is used by the runtime appli-
cations.
Integration pyHailoRT
CLI
Tool (Python API)
Hailo Dataflow Compiler (SDK)
C/C++ API and Library
Python API CLI tools
Hailo Driver
Model Parser OS IP Stack
Model Optimizer
Ethernet PCIe Integrated
Resource Allocator Profiler
Emulator
NN Core
Compiler
(part of Hailo Vision Processor or AI Accelerator)
In preview
The Hailo Dataflow Compiler toolchain enables users to generate a Hailo executable binary file (HEF) based on input
from a Tensorflow checkpoint, a Tensorflow frozen graph file, a TFLite file, or an ONNX file. The build process consists
of several steps including translation of the original model to a Hailo model, model parameters optimization, and
compilation.
Page 3 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Analyze
Hailo Archive (HAR) Hailo
Model representation and parameters Hailo Profiler Tool
Model
(32-bits weights) Resources
Calibration
Hailo Model Optimization images
Analyze
Hailo Archive (HAR) Hailo Emulator Numeric
Optimized model representation and Tool Model
parameters (quantized weights) Accuracy
Figure 2. Model build process, starting in a Tensorflow or ONNX model and ending with a Hailo binary (HEF)
Page 4 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
After the user has prepared the model in its original format, it can be converted into Hailo- compatible representa-
tion files. The translation API receives the user’s model and generates an internal Hailo representation format (HAR
compressed file, which includes HN and NPZ files). The HN model is a textual JSON output file. The weights are also
returned as a NumPy NPZ file.
1.2.2. Profiler
The Profiler tool uses the HAR file and profiles the expected performance of the model on hardware. This includes
the number of required devices, hardware resources utilization, and throughput (in frames per second). Breakdown
of the profiling figures for each of the model’s layers is also provided.
1.2.3. Emulator
The Dataflow Compiler Emulator allows users to run inference on their model without actual hardware. The Emulator
supports three main modes: native mode, fp_optimize mode and quantized mode. The native mode runs the original
model with float32 parameters, the fp_optimize mode runs with float32 parameters and all model modifications,
and the quantized mode provides results that mimics the hardware implementation. Please note that the quantized
emulator is not bit-exact to the Hailo hardware, but offer good and fast approximation. The native mode can be
used to validate the Tensorflow/ONNX translation process, the fp_optimize mode can be used to validate the model
modifications, while the quantized mode can be used to analyze the optimized model’s accuracy.
After the user generates the HAR representation, the next step is to convert the parameters from float32 to integer
representation. To convert the parameters, the user should run the model emulation in native mode on a small set of
images and collect activation statistics. Based on these statistics, the calibration module will generate a new network
configuration for the integer representation. This includes integer weights and biases, scaling configuration, and HW
configuration.
Now the model can be compiled into a HW compatible binary format with the extension HEF. The Dataflow Compiler
Tool allocates hardware resources to reach the highest possible fps within reasonable allocation difficulty. Then
the microcode is compiled and the HEF is generated. This whole step is performed internally, so from the user’s
perspective the compilation is done by calling a single API.
The Dataflow Compile Studio allows users to parse and visualize neural network graphs efficiently. Users can upload
ONNX or TFLite files. As default, the tool provides recommended start and end nodes for the parsing process. In the
next screen, the GUI displays a side-by-side comparison of Hailo’s parsed graph and the original graph. Users can
review these recommended start and end nodes, make adjustments as needed, and re-parse the graph to see the
updated results. This interactive process ensures that users can perform parsing to meet their specific requirements.
See Using Dataflow Studio for more details.
Page 5 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
After the model is compiled, it can be used to run inference on the target device. The HailoRT library provides access
to the device in order to load and run the model. This library is accessible from both C/C++ and Python APIs. It also
includes command line tools.
On Hailo-8 and Hailo-10, if the device is connected to the host through PCIe, the HailoRT library uses Hailo’s PCIe driver
to communicate with the device. On Hailo-8, if Ethernet is used, the library uses the Linux IP stack to communicate.
On Hailo-15, the HailoRT library communicates with the neural code through an internal interface.
The HailoRT library can be installed on the same machine as the Dataflow Compiler (on accelerator modules, such
as Hailo-8) or on a separate machine. A Yocto layer is provided to allow easy integration of HailoRT to embedded
environments.
Hailo-8™ is a series of AI accelerator modules, that allows edge devices to run deep learning applications at full scale
more efficiently, effectively, and sustainably than other AI chips and solutions, while significantly lowering costs.
The relevant hardware architecture types that should be used in the compilation process:
Page 6 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
hailo8
Use hw_arch=hailo8 to compile for Hailo-8 based devices, such as: Hailo-8 , Century, or custom Chip-on-Board
solutions.
hailo8l
Use hw_arch=hailo8l to compile for Hailo-8L device, such as: Hailo-8L, or custom Chip-on-Board solutions.
hailo8r
Hailo-15™ is a series of AI vision processors, featuring up to 20 TOPS of AI performance. The Hailo-15 is a System-on-
a-Chip (SoC) that combines Hailo’s AI capabilities with advanced computer vision engines, generating premium image
quality and advanced video analytics.
Page 7 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
hailo15h
hailo15m
Hailo-10™ is a series of AI accelerator modules, delivering up to 40 TOPS with 4-bit precision and up to 20 TOPS with
8-bit precision of AI performance. Hailo-10 is powerful and scalable structure-driven dataflow architecture taking
advantage of the core properties of neural networks. It enables edge devices to run deep learning applications at
full scale more efficiently and effectively. Unlike Hailo-8, the Hailo-10 accelerator contains direct DRAM access which
allows it scale for large models.
hailo10h
Page 8 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
2. Changelog
Dataflow Compiler v3.30.0 (January 2025)
Package Updates
Model Optimization
Parser
Compiler
• HEF compatibility warning: next DFC version is expected to use different mechanism for error-detecting
in HEF files by default (CRC32 to XXHASH3_64), therefore, older HRT versions will not be able to run HEFs
compiled by the new DFC version. Current version uses CRC32 by default, but can also support the new
format with the following command: hef_param(hef_signature=xxhash3_64).
Deprecated APIs
• Deprecation warning for parsing of TensorFlow 1.x models (.ckpt or .pb files) using all parsing APIs. Ten-
sorFlow users are recommended to move to Tensorflow-Lite.
Parser
• Added the input-format flag to the parser CLI API. This flag gets mapped to the net_input_format flag in
translate_onnx_model(). Command example: hailo parser onnx model.onnx --
input-format BWC, for an input tensor with shape [B, W, C] (batch, width, channels) (preview).
• Einsum operator support in ONNX was expanded with another equation, bmchw,bnmc → bmhwn,
which represents a group convolution.
Model Optimization
• Added format conversion support for YCbCr to RGB conversions full details in input conversion documentation.
– YUV601 to RGB: complies with BT.601 standard, equivalent to current default YUV to RGB con-
version.
• Added a model script command to allow splitting of a fused activation from its layer, making it a standalone
activation.
Tools
• The redesigned dfc-studio now supports viewing multiple models within the same project.
Page 9 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• New feature in dfc-studio, mapping corresponding layers between the Original and Hailo’s graph in Parsing
stage
Deprecated APIs
General
Parser
Compiler
• Added a new Compiler flag for allocating very big models, which returns the first feasible partition it founds,
allowing to reduce compile time at expense of the solution’s quality.
Model Optimization
• Added a new flag to optimization CLI, –compilation-only-har, allowing to save a reduced size har, containing
only compilation related data.
Tools
• Added preview version of “Dataflow Compile Studio”, new graphical tool for Hailo’s toolchain.
Parser
– Mapping input layers to original input node names, to ease creation of feed dict for native infer-
ence.
– Listing output layers by their original names, in the same order specified by the user (or as the
original model, if not specified).
Post Processing
• Added support with post-processing of bbox decoding only in YOLOv5 by using bbox_decoding_only=True.
• YOLOv5 SEG NMS for instance segmentation task is supported in all stages of emulation and compilation
with engine=cpu (preview).
Emulator
Page 10 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
General
Model Optimization
– Full precision models are serialized to Hailo archive in additional states: QUANTIZED_MODEL,
COMPILED_MODEL.
• output_encoding_vector added to include a different multiplicative scale for each feature (preview).
Compiler
• Improve Compilation time for big models in all hardware architectures for multi-context and single-
context networks.
Kernels
Post Processing
• YOLOv8 NMS is supported in all stages of emulation and compilation with engine=cpu.
Parser
• Added support for LSTM bidirectional layers (PyTorch and ONNX only) please notice this operator is un-
rolled by the sequence length which may add large number of layers to the model for large sequence
lengths.
Deprecated APIs
• Profiler mode deprecation, Profiler will run it’s inherit mode automatically.
General
• Interactive Mode on parser CLI: allows to retry failed parsing with suggested start/end nodes, or adding
auto-detected NMS post-process to model script
• Supports optimization-only mode, to only display optimization-related data (saves compilation time).
Compiler
• Allocation algorithm improvements that result in higher FPS for most models.
Page 11 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Model Modifications
• Added resize model script command for applying resize layer on input or output tensor(s).
• input_conversion command for NV conversion (nv12, nv21, i420) expects only one returned layer when
converting to YUV and two conversion layers when converting to RGB.
• All layers that are added to the model using input_conversion, now show up on Netron and Visualizer.
• NMS post-process:
– The value of nms_scores_th in the default NMS post-process config json was change from 0.01 to 0.3.
– When using nms post-process on CPU with default configuration the nms_iou_th is changed to 0.6.
Model Optimization
• Ability to run MO algorithms in reduced resolution, to decrease running time and RAM consumption.
– Added full precision only argument to hailo optimize CLI command, allowing running just the full
precision optimizations on a model. Command example: hailo optimize model.har --
full-precision-only --model-script script.alls
– Defuse (split) Multi-head attention blocks to groups for easier compilation, using a model script com-
mand.
– Convolution layers are defused (split) automatically by input features if they are large enough, also
possible using a model script command.
Parser
• Added support for Softsign activation (PyTorch, Tensorflow, not supported in TFLite).
• Added support with ceil_mode=True in pooling layers (PyTorch and ONNX only).
• Added support for RNN and LSTM layers (PyTorch and ONNX only), please notice this operator is unrolled
by the sequence length which may add large number of layers to the model for large sequence lengths.
• Added support for OneHot operator (preview level, PyTorch and ONNX only), limited to the last axis.
• Added support for Greater activation (PyTorch and ONNX only), limited to constant value only.
• Added support for Conv3D and Concat3D (PyTorch and ONNX only) - Preview, limited support - models
are assumed to be rank4 input and output.
Deprecated APIs
• Deprecation warning for resize_input model script command, please use resize instead.
• Profiler:
– --use-new-report flag was deprecated (since the new report is used by default)
– profile() return type will change to a single Python dict type in the near future
– Deprecation message for –mode CLI argument
Page 12 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
General
Model Optimization
• 16-bit precision mode can be applied to specific Conv layers inside the model to increase their accuracy
Profiler
• You can use the new profiler HTML design, by appending the --use-new-report flag to the CLI
command (preview; will be default starting 2023-10)
Parser
• Output layer names order is determined by their order on the parser API
– If a post-process json configuration files is used (on SSD, for example), the reg and cls layer names
can remain empty, and the auto-detect algorithm will locate them
• Added set_seed command for reproducing of quantization results, affects the seed of tensorFlow, numpy,
and python.random libraries (preview)
– YOLOX - after the objectness and classes layers before the NMS
Compiler
• hailo optimize using RGB images instead of random data when using –use-random-calib-set
Page 13 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• hailo analyze-noise now saves its results inside the model’s HAR
Deprecated APIs
Known issues
• Refer to Hailo AI SW Suite: Known Issues page for an updated list of issues
Compiler
• Introducing Performance Mode, that gradually increases the utilization to achieve the best FPS (preview)
• The compiler has been optimized for better stability and performance
Model Optimization
• Added support for 16-bit precision on full networks, in case all layers are supported (preview)
• Optimization levels are changed to be between 0 (no optimization) and 4 (best plausible optimization), as
opposed to 0-3. Their current description is found in the model_optimization_flavor API guide
• The default optimization level is now 2 for GPU and 1024 images, 1 for GPU and less than 1024 images,
and 0 for CPU only
• When importing Hailo python libraries, TF memory allocation mechanism is set to “memory growth” by
default, to decrease memory consumption. One can override this with an environment variable
• Improved the FineTune algorithm for models with multiple output nodes
• 16-bit output layer is enabled automatically when supported, for small output tensors
• When optimization fails, a better error message is displayed, referring to the failing algorithm
Profiler
• The HTML profiler now displays a quick version of the layer analysis tool (Accuracy tab) automatically
• Added –stream-fps flag to hailo profiler, to be used with single-context models, to evaluate the perfor-
mance using an FPS which is lower than the network’s FPS
• Added –collect-runtime-data flag to hailo profiler, to automatically infer using hailortcli and display runtime
data in the report
Emulator
• Added support for emulating YOLOv5 NMS with engine=cpu, as well as for SSD
Parser
• nms_postprocess command supports SSD post-processing also on CPU using the ‘engine’ flag (preview)
Page 14 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• Automatic anchors extraction for YOLOv5-type NMS models, using a message is displayed during parsing
• Added support for Biased Delta activation on TFLite, that is implemented using ABS->SIGN->MUL
• Added Hailo-ONNX support for models with Shape connections around the HailoOp
• Added an option to disable hailo-onnx runtime model build, when it hinders model parsing
• Softmax and Argmax can be added to the model using the logits_layer model script command
• Whenever NMS is being added (using a nms_postprocess command), Sigmoid is now added automatically
• Added hybrid conversion commands on the input_conversion section: yuy2_to_rgb, nv12_to_rgb, nv21_to_rgb,
i420_to_rgb
• Log level can be set using the LOGLEVEL environment variable (0 [default] to 3)
• hailo visualizer shows layers added using model script commands that were folded
• Layer Analysis Tool Tutorial has been updated to demonstrate how to increase accuracy
• Model Optimization Tutorial now uses YOLOv5 NMS with engine=cpu, and also a bbox visualization code
• Added description of which optimization algorithms are activated with each optimization level
• Removed the Multiple Models Tutorial. The Join API is still supported
Command Line Tools
Deprecated APIs
Page 15 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Known issues
• Some Transformer models are at risk for having a runtime bug when inferring with batch_size > 1,
when multi-context allocation is used a workaround is to use the max_utilization parameter of con-
text_switch_param command to change the failing context partition
• In some cases, using the Fine Tune algorithm when the whole network is quantized to 16-bit might cause
a degradation
Parser
• Fixed an issue where a model script had to be provided explicitly to hailo compiler when an NMS command
was used
Compiler
• Fixed prints to screen during compilation, regarding single/multi context flow and resources utilization
• Removed the warning message of using on-chip NMS with multi context allocation, since the new version
of HailoRT fixes the issue
Package Updates
• Added support for Ubuntu 22.04, Python 3.9, and Python 3.10
• Ubuntu 18.04, Python 3.6 and Python 3.7 are no longer supported
Profiler
• Introducing Accuracy Tab on the HTML Profiler, to be used as a tool to analyze and improve accuracy
• Profiler in post-placement mode doesn’t require .hef file, when working on a compiled .har file
• Profiler will apply model modifications on pre_placement mode, if a model script was supplied
• profile() API will not update the runner state, even if it compiles for the profiling process
• Bug fixes
Model Optimization
• ClientRunner now has a new SdkFPOptimized state (see runner states diagram), for assessing model
accuracy before quantization
Page 16 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• Updated the Model Optimization workflow section with simple and advanced optimization flows
• Updated the Model Optimization Tutorial with step-by-step instructions for validating accuracy through
the optimization and compilation phases
• Updated the Layer Analysis Tool tutorial to utilize the new HTML profiler Accuracy tab
Emulator
• Added Emulator support for YUY2 color conversions, using ‘emulator_support=True’ flag on the in-
put_conversion command
• Added support for on-chip NV12->YUV, NV21->YUV and YUV->BGR format conversions, using an in-
put_conversion
Parser
• nms_postprocess command now supports ‘engine’ flag, that instructs HailoRT to complete YOLOv5 NMS
post-processing on the host platform (preview)
• Added support for Less operator in both ONNX and Tensorflow parsers
• Add support for dual broadcast in element-wise mult (Hx1xC * 1xWxC -> HxWxC)
• Added support for depthwise with depth multiplier as group convolution in TFLite
Compiler
• Bug fixes
Known Bugs
• On this version, on-chip YOLOv5 NMS needs to be compiled using the legacy fps command.
API
• nms_postprocess model script command now uses relative paths relative to the alls script location. In
addition, when working with a HAR file that has model script inside, it uses the json from within the HAR
• Layer Analysis Tool now exports its data to a json file, that could be used with the HTML profiler to unlock
the new Accuracy tab
• ClientRunner APIs
– New
* analyze_noise
* optimize_full_precision
– Argument changes
* Deprecation warning: fps flag in all APIs that compile (profile_hn_model, get_tf_graph)
Page 17 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
– Removed
* quantize
* equalize_params
* get_hw_representation
* revert_state
– Deprecation warning
* get_results_by_layer
* translate_params
* update_params_layer_bias
* profile_hn_model
* get_mapped_graph
* get_params_after_bn
* set_original_model
* apply_model_modification_commands
• CLI tools
* revert removed
– parser
* –fps removed
Page 18 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
* –analysis-data added
Parser
• Added support for custom TFLite operators that implement a biased delta activation
• Added support for the self operators add(x,x), concat(x,x), and mul(x,x) in the TF and ONNX parsers
Model Optimization
• FPS is improved for large models by Quantization to 4-bit for 20% of the model weights is enabled by
default on large networks to improve FPS
• Added on-chip support for RGBX->RGB conversion using input conversion command
Compiler
Parser
• Added a recommendation to use TFLite parser if TF2 parsing fails (see conversion guide, on 4.2.5)
• Refactor logger
• HTML Profiler report includes model optimization information: compression and optimization levels,
model modifications, weight and activation value ranges
API
Page 19 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• Updated platform_param model script command to optimize compilation for low PCIe bandwidth hosts
• Deprecation warning for the legacy –fps argument, use performance_param model script command in-
stead
• As the parser detects Tensorflow1/2/TFLite automatically, the API for specifying the framework is depre-
cated
• The argument onnx_path of ClientRunner.translate_onnx_model was renamed to model, and also supports
‘bytes’ format
Note: Ubuntu 18.04 will be deprecated in Hailo Dataflow Compiler future version
Note: Python 3.6 will be deprecated in Hailo Dataflow Compiler future version
Page 20 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Note: This section describes the installation of the Dataflow Compiler only. For a complete description of the instal-
lation of Hailo Suite, which contains all Hailo SW products, please refer to the Hailo AI SW Suite user guide.
The Hailo Dataflow Compiler requires the following minimum hardware and software configuration:
The following additional requirements are needed for GPU based hardware emulation:
1. Nvidia’s Pascal/Turing/Ampere GPU architecture (such as Titan X Pascal, GTX 1080 Ti, RTX 2080 Ti, or RTX A4000)
3. CUDA 11.8
4. CUDNN 8.9
Note: The Dataflow Compiler installs and runs Tensorflow, however when Tensorflow is installed from PyPi and runs
on the CPU, it will also require AVX instruction support. Therefore, it is recommended to use a CPU that supports AVX
instructions. Another option is to compile Tensorflow from sources without AVX.
Warning: These requirements are for the Dataflow Compiler, which is used to build models. Running inference
using HailoRT works on smaller systems as well. In order to run inference and demos on a Hailo device, the latest
HailoRT needs to be installed as well. See HailoRT’s user guide for more details.
Warning: This installation requires an internet connection (or a local pip server) in order to download Python
packages.
Note: If you wish to upgrade both Hailo Dataflow Compiler and HailoRT which are installed in the same virtualenv:
update HailoRT first, and then the Dataflow Compiler using the following instructions.
Hailo Dataflow Compiler’s Wheel file (.whl) can be downloaded from Hailo’s Developer Zone.
Page 21 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
virtualenv <VENV_NAME>
. <VENV_NAME>/bin/activate
If you already have a previous version (v3.15.0 or newer), enter the virtualenv, and install using the line above. The
old version will be updated automatically.
If you already have further older versions (<=3.14.0), you have to uninstall it manually from within the existing vir-
tualenv:
Install the new package with pip using the method above (the package names were changed from v3.14.0 to v3.15.0).
After installation / upgrade, it is recommended to view Hailo’s CLI tool options with:
hailo -h
Note: You can validate the success of the install/update to latest Hailo packages, by running pip freeze | grep
hailo.
Page 22 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
4. Tutorials
The tutorials below go through the model build and inference steps. They are also available as Jupyter notebook files
in the directory VENV/lib/python…/site-packages/tutorials.
It’s recommended to use the command hailo tutorial (when inside the virtualenv) to open a Jupyter server
that contains the tutorials.
Model Compilation:
It is recommended to start with the Hailo Dataflow Compiler Overview / Model build process
section of the user guide.
3. Compiling the network to binary files (HEF), for running on the Hailo device.
Inference:
These use-cases were chosen to show an end-to-end flow, beginning with a Tensorflow / ONNX model and ending
with a hardware deployed model.
Throughout this guide the Resnet-v1-18 neural network will be used to demonstrate the capabilities of the Dataflow
Compiler. The neural network is defined using Tensorflow checkpoint.
4.1.1. Usage
The HTML and PDF versions are for viewing-only. The best way to use the tutorials is to run them as Jupyter notebooks:
1. The Dataflow Compiler should be installed, either as a standalone Python package, or as part of the Hailo SW
Suite.
5. Remote viewing from a machine different then the one used to run the Jupyter server is also possible by running
hailo tutorial --ip=0.0.0.0
Page 23 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
This tutorial describes the steps for parsing models from various frameworks to the HAR format (Hailo Archive).
HAR is a tar.gz archive file that contains the representation of the graph structure and the weights that are deployed
to Hailo’s runtime.
Note: Running this code in Jupyter notebook is recommended, see the Introduction tutorial for more details.
Note: This section demonstrates the Python APIs for Hailo Parser. You could also use the CLI: try hailo parser
{tf, onnx} --help.
More details on Dataflow Compiler User Guide / Building Models / Profiler and other command line tools.
[ ]: chosen_hw_arch = ”hailo8”
# For Hailo-15 devices, use 'hailo15h'
# For Mini PCIe modules or Hailo-8R devices, use 'hailo8r'
[ ]: onnx_model_name = ”resnet_v1_18”
onnx_path = ”../models/resnet_v1_18.onnx”
The main API of the Dataflow Compiler that the user interacts with is the ClientRunner class (see the API Reference
section on the Dataflow Compiler user guide for more information).
Arguments:
• model_path
• model_name to use
• start_node_names (list of str, optional): Name of the first ONNX node to parse.
• end_node_names (list of str, optional): List of ONNX nodes, that the parsing can stop after all of them are
parsed.
• net_input_shapes (dict, optional): A dictionary describing the input shapes for each of the start nodes given in
start_node_names, where the keys are the names of the start nodes and the values are their corresponding
input shapes. Use only when the original model has dynamic input shapes (described with a wildcard denoting
each dynamic axis, e.g. [b, c, h, w]).
As a suggestion try translating the ONNX model without supplying the optional arguments.
Page 24 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
[ ]: runner = ClientRunner(hw_arch=chosen_hw_arch)
hn, npz = runner.translate_onnx_model(
onnx_path,
onnx_model_name,
start_node_names=[”input.1”],
end_node_names=[”192”],
net_input_shapes={”input.1”: [1, 3, 224, 224]},
)
Hailo Archive is a tar.gz archive file that captures the “state” of the model - the files and attributes used in a given stage
from parsing to compilation. Use the save_har method to save the runner’s state in any stage and load_har
method to load a saved state to an uninitialized runner.
The initial HAR file includes: - HN file, which is a JSON-like representation of the graph structure that is deployed to
the Hailo hardware. - NPZ file, which includes the weights of the model.
[ ]: hailo_model_har_name = f”{onnx_model_name}_hailo_model.har”
runner.save_har(hailo_model_har_name)
The Hailo parser supports inference models as inputs, therefore we advise to use TensorFlow Lite representation for
TensorFlow 2 models (TF2 SavedModel format is commonly used for training models).
The following example shows how to parse a TensorFlow Lite model, using a different model.
[ ]: model_name = ”dense_example”
model_path = ”../models/v3-large-minimalistic_224_1.0_float.tflite”
runner = ClientRunner(hw_arch=chosen_hw_arch)
hn, npz = runner.translate_tf_model(model_path, model_name)
Page 25 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
The following examples focus on Tensorflow’s TFLite converter support for various TF formats, showing how older
formats of TF can be converted to TFLite, which can then be used in Hailo’s parsing stage.
tflite_model_path = ”../models/small_example.tflite”
with tf.io.gfile.GFile(tflite_model_path, ”wb”) as f:
f.write(tflite_model)
tflite_model_path = ”../models/dense_example_tf2.tflite”
with tf.io.gfile.GFile(tflite_model_path, ”wb”) as f:
f.write(tflite_model)
Page 26 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
tflite_model = converter.convert()
tflite_model_path = ”../models/ew_sub_example.tflite”
with tf.io.gfile.GFile(tflite_model_path, ”wb”) as f:
f.write(tflite_model)
This tutorial describe the process of optimizing the user’s model. The input to this tutorial is a HAR file in Hailo Model
state (before optimization; with native weights) and the output will be a quantized HAR file with quantized weights.
Note: For full information about Optimization and Quantization, refer to the Dataflow Compiler user guide
/ Model optimization section.
Requirements:
• Run this code in Jupyter notebook. See the Introduction tutorial for more details.
• The user should review the complete Parsing Tutorial (or created the HAR file in other way)
Recommendation:
• To obtain best performance run this code with a GPU machine. For full information see the Dataflow Com-
piler user guide / Model optimization section.
Contents:
import numpy as np
import tensorflow as tf
from IPython.display import SVG
from matplotlib import patches
from matplotlib import pyplot as plt
from PIL import Image
from tensorflow.python.eager.context import eager_mode
%matplotlib inline
IMAGES_TO_VISUALIZE = 5
Page 27 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
After the HAR file has been created (using either runner.translate_tf_model or runner.
translate_onnx_model), the next step is to go through the optimization process.
The basic optimization is performed just by calling runner.optimize(calib_dataset) (or the CLI hailo
optimize command), as described on the user guide on: Building Models / Model optimization / Model Optimiza-
tion Workflow. The calibration dataset should be preprocessed according to the model’s input requirements and it is
recommended to have at least 1024 inputs and to use a GPU. During this step it is also possible to use a model script
which change the default behavior of the Dataflow Compiler, for example, to add additional layer for normalization.
All the model script available commands are described in the user guide on: Building Models / Model optimization /
Optimization Related Model Script Commands.
In order to learn how to deal with common pitfalls, image formats and accuracy, refer to the in-depth section.
[ ]: # First, we will prepare the calibration set. Resize the images to the correct size and�
,→crop them.
with eager_mode():
h, w = image.shape[0], image.shape[1]
scale = tf.cond(tf.less(h, w), lambda: resize_side / h, lambda: resize_side / w)
resized_image = tf.compat.v1.image.resize_bilinear(tf.expand_dims(image, 0),�
,→[int(h * scale), int(w * scale)])
cropped_image = tf.compat.v1.image.resize_with_crop_or_pad(resized_image,�
,→output_height, output_width)
return tf.squeeze(cropped_image)
images_path = ”../data”
images_list = [img_name for img_name in os.listdir(images_path) if os.path.
,→splitext(img_name)[1] == ”.jpg”]
np.save(”calib_set.npy”, calib_dataset)
[ ]: # Second, we will load our parsed HAR from the Parsing Tutorial
model_name = ”resnet_v1_18”
hailo_model_har_name = f”{model_name}_hailo_model.har”
assert os.path.isfile(hailo_model_har_name), ”Please provide valid path for HAR file”
runner = ClientRunner(har=hailo_model_har_name)
# By default it uses the hw_arch that is saved on the HAR. For overriding, use the hw_
,→arch flag.
[ ]: # Now we will create a model script, that tells the compiler to add a normalization on�
,→the beginning
Page 28 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
runner.load_model_script(alls)
The advanced optimization process (see the diagram in the user guide on: Building Models / Model optimization /
Model Optimization Workflow), is comprised of the following steps:
1. Test the parsed Native model before any changes are made (still on floating point precision), check to see
that the pre and post processing code works well with the start and end nodes provided. The Native model will
match the results of the original model, in between the start_node_names and the end_node_names provided
by the user during the Parsing stage.
2. Optional: Apply Model Modifications (like input Normalization layer, YUY2 to RGB conversion, changing output
activations and others), using a model script.
3. Test the FP Optimized model (the model after floating point operations and modifications) to see that
required results have been achieved.
• Note: Remember to update the pre and post processing code to match the changes in the model. For
example, if normalization has been added to the model, remove the normalization code from the pre-
processing code, and feed un-normalized images to the model. If softmax has been added onto the
outputs, remove the softmax from the post-processing code. Etc.
4. Now perform Optimization to the model, using a calibration set that has been prepared. The result is a
Quantized model, that has some degradation compared to the pre-quantized model.
• Note: The format of calibration set is the same as was used as inputs for the modified model. For example,
if a normalization layer has been added to the model, the calibration set should not be normalized. If this
layer has not been added yet, pre-process and normalize the images.
5. Test the quantized model using the same already-validated code for the pre and post processing.
• If there is a degradation, it is due to the quantization process and not due to input/output formats, as
they were already verified with the pre-quantized model.
6. To increase the accuracy of the quantized model, it is possible to optimize again using a model script to
affect the optimization process.
• Note: The most basic method is to raise the optimization_level, an example model script command is
model_optimization_flavor(optimization_level=4). The advanced method is to
use the Layer Analysis Tool, presented on the next tutorial.
• Note: If
the accuracy is good, consider increasing the performance by using 4-bit
weights.This is done using compression_level, an example model script command is
model_optimization_flavor(compression_level=2).
7. During the next tutorials, compilation and on-device inference, input and output values are expected to be
similar to the quantized model’s values.
The testing (whether on Native, Modified or Quantized model) is performed using our Emulator feature, that will
be described in this tutorial.
To further understand the advanced optimization process, the following steps are described below.
Page 29 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Hailo offers an Emulator for testing the model in its different states. The emulator is implemented as a Tensorflow
graph, and its results are the return value of runner.infer(context, network_input). To get infer-
ence results, run this API within the context manager runner.infer_context(inference_context)
where the inference context is one of: [InferenceContext.SDK_NATIVE, InferenceContext.
SDK_FP_OPTIMIZED, InferenceContext.SDK_QUANTIZED]: - InferenceContext.
SDK_NATIVE: Testing method for Step 1 of the optimization process steps (Native model). Runs the model
as is without any changes. Use it to make sure the model has been converted properly into Hailo’s internal repre-
sentation. Should yield exact results as the original model. - InferenceContext.SDK_FP_OPTIMIZED:
Testing method for Step 3 of the optimization process steps (Modified model). The modified model represents
the Hailo model prior to quantization, and is the result of performing model modifications (e.g. normalizing/resizing
inputs) and full precision optimizations (e.g. tiled squeeze & excite, equalization). As a result, inference results
may vary slightly from the native results. - InferenceContext.SDK_QUANTIZED: Testing method for
Step 5 of the optimization process steps (Quantized model). This inference context emulates the hardware
implementation, and is useful for measuring the overall accuracy and degradation of the quantized model. This
measurement is performed against the original model over large datasets, prior to running inference on the actual
Hailo device.
[ ]: # -----------------------------------------
# Pre processing (prepare the input images)
# -----------------------------------------
def preproc(image, output_height=224, output_width=224, resize_side=256,�
,→normalize=False):
with eager_mode():
h, w = image.shape[0], image.shape[1]
scale = tf.cond(tf.less(h, w), lambda: resize_side / h, lambda: resize_side / w)
resized_image = tf.compat.v1.image.resize_bilinear(tf.expand_dims(image, 0),�
,→[int(h * scale), int(w * scale)])
cropped_image = tf.compat.v1.image.resize_with_crop_or_pad(resized_image,�
,→output_height, output_width)
if normalize:
# Default normalization parameters for ImageNet
cropped_image = (cropped_image - [123.675, 116.28, 103.53]) / [58.395, 57.12,�
,→57.375]
return tf.squeeze(cropped_image)
# -----------------------------------------------------
# Post processing (what to do with the model's outputs)
# -----------------------------------------------------
def _get_imagenet_labels(json_path=”../data/imagenet_names.json”):
imagenet_names = json.load(open(json_path))
imagenet_names = [imagenet_names[str(i)] for i in range(1001)]
return imagenet_names[1:]
imagenet_labels = _get_imagenet_labels()
def postproc(results):
labels = []
(continues on next page)
Page 30 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
scores = []
results = [np.squeeze(result) for result in results]
for result in results:
top_ind = np.argmax(result)
cur_label = imagenet_labels[top_ind]
cur_score = 100 * result[top_ind]
labels.append(cur_label)
scores.append(cur_score)
return scores, labels
# -------------
# Visualization
# -------------
def mynorm(data):
return (data - np.min(data)) / (np.max(data) - np.min(data))
def visualize_results(
images,
first_scores,
first_labels,
second_scores=None,
second_labels=None,
first_title=”Full Precision”,
second_title=”Other”,
):
# Deal with input arguments
assert (second_scores is None and second_labels is None) or (
second_scores is not None and second_labels is not None
), ”second_scores and second_labels must both be supplied, or both not be supplied”
assert len(images) == len(first_scores) == len(first_labels), ”lengths of inputs�
,→must be equal”
# Display
for img_idx in range(len(images)):
plt.figure()
plt.imshow(mynorm(images[img_idx]))
if not show_only_first:
plt.title(
f”{first_title}: top-1 class is {first_labels[img_idx]}. Confidence is
,→{first_scores[img_idx]:.2f}%,\n”
)
else:
plt.title(
f”{first_title}: top-1 class is {first_labels[img_idx]}. Confidence is
,→{first_scores[img_idx]:.2f}%”,
Page 31 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Load the network to the ClientRunner from the saved Hailo Archive file:
[ ]: model_name = ”resnet_v1_18”
hailo_model_har_name = f”{model_name}_hailo_model.har”
assert os.path.isfile(hailo_model_har_name), ”Please provide valid path for HAR file”
runner = ClientRunner(har=hailo_model_har_name)
# By default it uses the hw_arch that is saved on the HAR. For overriding, use the hw_
,→arch flag.
[ ]: images_path = ”../data”
images_list = [img_name for img_name in os.listdir(images_path) if os.path.
,→splitext(img_name)[1] == ”.jpg”]
[ ]: # Notice that we use the normalized images, because normalization is not in the model
with runner.infer_context(InferenceContext.SDK_NATIVE) as ctx:
native_res = runner.infer(ctx, image_dataset_normalized[:IMAGES_TO_VISUALIZE, :,�
,→:, :])
The Model Script is a text file that includes model script commands, affecting the stages of the compiler.
In the next steps the following will be performed: - Create a model script for the Optimization process, that also in-
cludes the model modifications. - Load the model script (it wont be applied yet) - Call runner.optimize_full_precision()
to apply the model modifications (instead, we could call optimize() that also applies the model modifications) - Then
we could call the SDK_FP_OPTIMIZED emulation context
[ ]: model_script_lines = [
# Add normalization layer with mean [123.675, 116.28, 103.53] and std [58.395, 57.12,
,→ 57.375])
# ...
]
Page 32 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
runner.load_model_script(””.join(model_script_lines))
runner.optimize_full_precision()
[ ]: # Notice that we use the original images, because normalization is IN the model
with runner.infer_context(InferenceContext.SDK_FP_OPTIMIZED) as ctx:
modified_res = runner.infer(ctx, image_dataset[:IMAGES_TO_VISUALIZE, :, :, :])
visualize_results(
image_dataset[:IMAGES_TO_VISUALIZE, :, :, :],
native_scores,
native_labels,
modified_scores,
modified_labels,
second_title=”FP Modified”,
)
1. We will create a calibration dataset (will be the same as the input to the modified model)
3. Then we will test its accuracy vs. the modified model. Please note that the quantized emulator is not bit-exact
with the Hailo hardware but provides good and fast approximation.
[ ]: # The original images are being used, just as the input to the SDK_FP_OPTIMIZED emulator
calib_dataset = image_dataset
hn_layers = runner.get_hn_dict()[”layers”]
print(”Input layers are: ”)
print([layer for layer in hn_layers if hn_layers[layer][”type”] == ”input_layer”]) #�
,→See available input layer names
runner.optimize(calib_dataset_dict)
[ ]: # Notice that we use the original images, because normalization is in the model
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
quantized_res = runner.infer(ctx, image_dataset[:IMAGES_TO_VISUALIZE, :, :, :])
visualize_results(
image_dataset[:IMAGES_TO_VISUALIZE, :, :, :],
modified_scores,
modified_labels,
quantized_scores,
quantized_labels,
first_title=”FP Modified”,
second_title=”Quantized”,
)
Page 33 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
This Demo depends on multiple gpu availability Further information for utilizing multiple GPUs is available on the
Dataflow Compiler user guide / Model optimization section
[ ]: num_gpus = len(tf.config.list_physical_devices(”GPU”))
if num_gpus > 1:
with runner.infer_context(InferenceContext.SDK_NATIVE, gpu_policy=”model_
,→parallelization”) as ctx:
To increase the accuracy of the quantized model, optimize again using a model script to affect the optimization pro-
cess.
• Verify that there is a GPU with at least 1024 images in the calibration set
• Raise the optimization_level value using the model_optimization_flavor command. If it fails on high GPU mem-
ory, try lowering the batch_size as described on the last example
• Decrease the compression_level value using the model_optimization_flavor command (default is 0, lowest op-
tion)
• Set the output layer(s) to use 16-bit accuracy using the command quantization_param(output_layer_name, pre-
cision_mode=a16_w16). Note that the DFC will set 16-bit output automatically for small enough outputs.
• Use the Layer Noise Analysis tools to find layers with low SNR, and affect their quantization using weight or
activation clipping (see the next tutorial)
• Experiment with the FineTune parameters (refer to the user guide for more details)
For more information refer the user guide in: Building Models / Model optimization / Model Optimization Workflow
/ Debugging accuracy.
Page 34 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
This block will apply model modification commands using a model script. A YUY2-> YUV-> RGB conversion will be
added.
Unlike the normalization layer, which could simulate with the SDK_FP_OPTIMIZED and SDK_QUANTIZED emulators,
not all format conversions are supported in the emulator (for more information see the Dataflow Compiler
user guide / Model optimization section). Every conversion that runs in the emulator affects the cali-
bration set, and the user should supply the set accordingly. For example, after adding YUV -> RGB format conversion
layer, the calibration set is expected to be in YUV format. However, for some conversions the user may choose to
skip the conversion in emulation and to use the original calibration set instead. For instance, in this tutorial we will
use YUY2 -> YUV layer without emulation because we want the emulator input and the calibration dataset to remain
in YUV format. The format conversion layer would be relevant only when running the compiled .hef file on device.
4) Using the optimize() API, the commands are applied and the model is quantized
5) Usage:
• To create input conversion after all input layers: net_scope1/yuv2rgb1, net_scope2/yuv2rgb2 = in-
put_conversion(yuv_to_rgb)
# Now we're adding yuy2_to_yuv conversion before the yuv_to_rgb and a normalization�
,→layer.
# The order of the layers is determined by the order of the commands in the model script:
# First we add normalization to the original input layer -> the input to the network is�
,→now normalization1
model_script_commands = [
”normalization1 = normalization([123.675, 116.28, 103.53], [58.395, 57.12, 57.
,→375])\n”,
”yuv_to_rgb1 = input_conversion(yuv_to_rgb)\n”,
”yuy2_to_yuv1 = input_conversion(input_layer1, yuy2_to_hailo_yuv)\n”,
]
runner.load_model_script(””.join(model_script_commands))
Page 35 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
modified_model_har_name = f”{model_name}_modified.har”
runner.save_har(modified_model_har_name)
This block will apply on-chip bilinear image resize at the beginning of the network through model script commands:
• Using the optimize() API, the command is applied and the model is quantized
[ ]: images_path = ”../data”
images_list = [img_name for img_name in os.listdir(images_path) if os.path.
,→splitext(img_name)[1] == ”.jpg”]
idx_to_visualize = None
images_list = images_list[:64]
calib_dataset_new = np.zeros((len(images_list), 480, 640, 3))
for idx, img_name in enumerate(images_list):
img = Image.open(os.path.join(images_path, img_name))
resized_image = np.array(img.resize((640, 480), Image.Resampling.BILINEAR))
calib_dataset_new[idx, :, :, :] = resized_image
# find an image that will be nice to display
if idx_to_visualize is None and img.size[0] != 640:
idx_to_visualize = idx
img_to_show = img
np.save(”calib_set_480_640.npy”, calib_dataset_new)
plt.imshow(img_to_show)
plt.title(”Original image”)
plt.show()
plt.imshow(np.array(calib_dataset_new[idx_to_visualize, :, :, :], np.uint8))
plt.title(”Resized image”)
plt.show()
[ ]: model_name = ”resnet_v1_18”
hailo_model_har_name = f”{model_name}_hailo_model.har”
assert os.path.isfile(hailo_model_har_name), ”Please provide valid path for HAR file”
runner = ClientRunner(har=hailo_model_har_name)
calib_dataset_large = np.load(”calib_set_480_640.npy”)
# Add a bilinear resize from 480x640 to the network's input size - in this case, 224x224.
# The order of the layers is determined by the order of the commands in the model script:
# First we add normalization to the original input layer -> the input to the network is�
,→now normalization1
Page 36 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
model_script_commands = [
”normalization1 = normalization([123.675, 116.28, 103.53], [58.395, 57.12, 57.
,→375])\n”,
”resize_input1= resize(resize_shapes=[480,640])\n”,
]
runner.load_model_script(””.join(model_script_commands))
calib_dataset_dict = {”resnet_v1_18/input_layer1”: calib_dataset_large} # In our�
,→case there is only one input layer
runner.optimize(calib_dataset_dict)
modified_model_har_name = f”{model_name}_resized.har”
runner.save_har(modified_model_har_name)
This block will add an NMS layer at the end of the network through the model script command:
nms_postprocess. The following arguments can be used to:
• Config json: an external json file that allows the changing of the NMS parameters (can be skipped for the default
configuration).
• Meta architecture: which meta architecture to use (for example, yolov5, ssd, etc). In this example, yolov5
will be used.
• Engine: defines the inference device for running the nms: nn_core, cpu or auto (this example shows
cpu).
Usage:
• Use the optimize_full_precision() API to apply the command (Note that optimize() API can
also be used)
[ ]: model_name = ”yolov5s”
onnx_path = f”../models/{model_name}.onnx”
assert os.path.isfile(onnx_path), ”Please provide valid path for ONNX file”
Page 37 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
]
# Note: Scores threshold of 0.0 means no filtering, 1.0 means maximal filtering. IoU�
,→thresholds are opposite: 1.0 means filtering boxes only if they are equal, and 0.0�
runner.load_model_script(””.join(model_script_commands))
# Note: On HailoRT APIs (that are used on the Inference Tutorial, and with C++�
,→APIs), the default is a list per class. For more information look for NMS on the�
if cur_score == 0:
continue
# Plotting code
if not found_any:
found_any = True
fig, ax = plt.subplots()
ax.imshow(Image.fromarray(np.array(calib_dataset_new[i], np.uint8)))
if min_score is None or cur_score < min_score:
min_score = cur_score
if max_score is None or cur_score > max_score:
max_score = cur_score
(
y_min,
x_min,
) = box[0, detection_idx] * HEIGHT, box[1, detection_idx] * WIDTH
y_max, x_max = box[2, detection_idx] * HEIGHT, box[3, detection_idx] * WIDTH
center, width, height = (x_min, y_min), x_max - x_min, y_max - y_min
# draw the box on the input image
rect = patches.Rectangle(center, width, height, linewidth=1, edgecolor=”r”,�
,→facecolor=”none”)
ax.add_patch(rect)
if found_any:
plt.title(f”Plot of high score boxes. Scores between {min_score:.2f} and {max_
,→score:.2f}”)
plt.show()
Page 38 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
For aggressive quantization (compress significant amount of weights to 4-bits), a higher optimization level will be
needed to obtain good result.
For quick iterations it is always recommended to start with the default setting of the model optimizer
(optimization_level=2, compression_level=1). However, when moving to production, it is recommended to work at
the highest complexity level to achieve optimal accuracy. With regards to compression, users should increase it
when the overall throughput/latency of the model is not good enough.
Note that increasing compression would have negligible effect on power-consumption so the motivation to work
with higher compression level is mainly due to FPS considerations.
Here the compression level is set to 4 (which means ~80% of the weights will be quantized into 4-bits) using the
compression_level param in a model script and run the model optimization again. Using 4-bit weights might reduce
the model’s accuracy but will help to reduce the model’s memory footprint. In this example, it can be seen that the
reliability of some examples decreases after changing several layers to 4-bit weights, later the reliability will improve
after applying higher optimization_level.
[ ]: alls_lines = [
”normalization1 = normalization([123.675, 116.28, 103.53], [58.395, 57.12, 57.
,→375])\n”,
# Batch size is 8 by default; 2 was used for stability on PCs with low amount of RAM /�
,→VRAM
# The following line is needed because resnet_v1_18 is a really small model, and the�
,→compression_level is always reverted back to 0.'
# To force using compression_level with small models, the following line should be�
,→used (compression level=4 equals to 80% 4-bit):
”model_optimization_config(compression_params, auto_4bit_weights_ratio=0.8)\n”,
# The application of the compression could be seen by the [info] messages: ”Assigning�
,→4bit weight to layer ..”
]
# -- Reduces weights memory by 80% !
runner = ClientRunner(har=hailo_model_har_name)
runner.load_model_script(””.join(alls_lines))
runner.optimize(calib_dataset)
[ ]: images = calib_dataset[:IMAGES_TO_VISUALIZE, :, :, :]
with runner.infer_context(InferenceContext.SDK_FP_OPTIMIZED) as ctx:
modified_res = runner.infer(ctx, images)
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
quantized_res = runner.infer(ctx, images)
visualize_results(
image_dataset[:IMAGES_TO_VISUALIZE, :, :, :],
modified_scores,
modified_labels,
quantized_scores,
quantized_labels,
first_title=”FP Modified”,
second_title=”Quantized”,
)
Page 39 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Now, repeating the same process with higher optimization level (For full information see the Dataflow Com-
piler user guide / Model optimization section):
[ ]: images = calib_dataset[:IMAGES_TO_VISUALIZE, :, :, :]
alls_lines = [
”normalization1 = normalization([123.675, 116.28, 103.53], [58.395, 57.12, 57.
,→375])\n”,
# Batch size is 8 by default; 2 was used for stability on PCs with low amount of RAM /�
,→VRAM
# The following line is needed because resnet_v1_18 is a really small model, and the�
,→compression_level is always reverted back to 0.'
# To force using compression_level with small models, the following line should be�
,→used (compression level=4 equals to 80% 4-bit):
”model_optimization_config(compression_params, auto_4bit_weights_ratio=0.8)\n”,
# The application of the compression could be seen by the [info] messages: ”Assigning�
,→4bit weight to layer ..”
]
# -- Reduces weights memory by 80% !
runner = ClientRunner(har=hailo_model_har_name)
runner.load_model_script(””.join(alls_lines))
runner.optimize(calib_dataset)
visualize_results(
image_dataset[:IMAGES_TO_VISUALIZE, :, :, :],
modified_scores,
modified_labels,
quantized_scores_new,
quantized_labels_new,
first_title=”FP Modified”,
second_title=”Quantized”,
)
[ ]: print(
f”Full precision predictions: {modified_labels}\n”
f”Quantized predictions (with optimization_level=2): {quantized_labels_new} ”
f”({sum(np.array(modified_labels) == np.array(quantized_labels_new))}/
,→{len(modified_labels)})\n”
[ ]: runner.save_har(quantized_model_har_path)
Page 40 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
4.4.1. Hailo Compilation Example from Hailo Archive Quantized Model to HEF
This tutorial will describe how to convert the model into the HEF executable format
Requirements:
• Run the codes below in Jupyter notebook, see the Introduction tutorial for more details.
Note: This section demonstrates the Python APIs for Hailo Compiler. You could also use the CLI: try hailo com-
piler --help. More details on Dataflow Compiler User Guide / Building Models / Profiler and other command
line tools.
Choose the quantized model Hailo Archive file to use throughout the example:
[ ]: model_name = ”resnet_v1_18”
quantized_model_har_path = f”{model_name}_quantized_model.har”
[ ]: runner = ClientRunner(har=quantized_model_har_path)
# By default it uses the hw_arch that is saved on the HAR. It is not recommended to�
,→change the hw_arch after Optimization.
file_name = f”{model_name}.hef”
with open(file_name, ”wb”) as f:
f.write(hef)
[ ]: har_path = f”{model_name}_compiled_model.har”
runner.save_har(har_path)
!hailo profiler {har_path}
Note:
The HTML profiler report could be augmented with runtime statistics, that are saved after the .hef ran on the device
using hailortcli.
For more information look under the section: Dataflow Compiler User Guide / Building Models / Profiler and other
command line tools / Running the Profiler.
Page 41 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Requirements:
• HailoRT installed on the same virtual environment, or as part of the Hailo SW Suite.
• Run this code in Jupyter notebook, see the Introduction tutorial for more details.
Note: This section demonstrates PyHailoRT, which is a python library for communication with Hailo devices. For
evaluation purposes, refer to hailortcli run2 --help (or the alias hailo run2 --help). For more
details on the HailoRT User Guide / Command Line Tools.
import numpy as np
The standalone flow allows direct access to the HW, developing applications directly on top of Hailo core HW, using
HailoRT.
This way the Hailo hardware can be used without Tensorflow, and even without the Hailo Dataflow Compiler (after
the HEF is built).
A HEF is Hailo’s binary format for neural networks. The HEF file contains:
• Target HW configuration
• Weights
Note: If a Hailo-15 device is being used, the tutorial and the resnet_v1_18.hef file should be copied to and
run on the device itself.
Page 42 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
# The target can be used as a context manager (”with” statement) to ensure it's released�
,→on time.
configure_params = ConfigureParams.create_from_hef(hef=hef,�
,→interface=HailoStreamInterface.PCIe)
with network_group.activate(network_group_params):
infer_results = infer_pipeline.infer(input_data)
# The result output tensor is infer_results[output_vstream_info.name]
print(f”Stream output shape is {infer_results[output_vstream_info.name].shape}
,→”)
Page 43 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
This section shows how to run streaming inference using multiple processes in Python.
Infer will not be used and instead a send and receive model will be employed. The send function and the receive
function will run in different processes.
}
for _ in range(num_frames):
for vstream, buff in vstream_to_buffer.items():
vstream.send(buff)
Define the amount of images to stream and processes, then recreate the target and run the processes:
Note: This section is not yet supported on the Hailo-15, as it requires the Dataflow Compiler to be installed on the
device.
The runner.infer() method that was used for emulation in the model optimization tutorial can also be used
for running inference on the Hailo device inside the infer_context environment. Before calling this function
with hardware context, please make sure a HEF file is loaded to a runner, by one of the options: calling runner.
compile(), loading a complied HAR using runner.load_har(), or setting the HEF attribute runner.hef.
First, create the runner and load a compiled HAR:
Page 44 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
[ ]: model_name = ”resnet_v1_18”
compiled_model_har_path = f”{model_name}_compiled_model.har”
runner = ClientRunner(hw_arch=”hailo8”, har=compiled_model_har_path)
# For Mini PCIe modules or Hailo-8R devices, use hw_arch='hailo8r'
This will demonstrate the usage of the HTML profiler with runtime data:
# Run hailortcli (can use `hailo` instead) to run the .hef on the device, and save�
,→runtime statistics to runtime_data.json
Page 45 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
resnet_v1_18 is a small network, which fits in a single device without context-switch (it is called “single context”). Its
FPS and Latency are always displayed.
The --runtime-data flag is useful with big models, where the FPS and latency cannot be calculated on compile
time. With runtime data, the profiler displays the load, config and runtime of the contexts, the fps and latency for
multiple batch sizes.
This is an advanced tutorial, if the accuracy results obtained were satisfactory it can be omitted. Before using it, make
sure that your native (pre-quantization) results are satisfying. For more details refer to Debugging Accuracy
section on the Dataflow Compiler User Guide.
This tutorial will serve as a guide for how model quantization analysis breaks down the quantization noise per layer.
The tutorial is intended to guide the user in using Hailo analyze noise tool, by using it to analyze the classification
model MobileNet-v3-Large-Minimalistic.
• Paths definitions: Defining the paths to the model and data for analysis.
• Accuracy analysis: This step is the heart of the tool, and computes the quantization noise of each layer output.
For each layer, the layer under analysis is the only quantized layer, while the rest of the model is kept in full
precision. This highlights the quantization sensitivity of the model to the noise of that specific layer.
• Visualizing the results: Walk through the results of the accuracy analysis and explain the different graphs and
information.
• Re-optimizing the model: After debugging the noise we repeat the optimization process to improve the results.
Requirements:
• Run this code in Jupyter notebook, see the Introduction tutorial for more details.
• Verify that you’ve completed the Parsing tutorial and the Model Optimization tutorial or generated analysis
data in another way.
[ ]: import os
%matplotlib inline
Page 46 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• data_path: path to preprocessed .npy image files for optimization and analysis
[ ]: model_name = ”v3-large-minimalistic_224_1.0_float”
model_path = ”../models/” + model_name + ”.tflite”
assert os.path.isfile(model_path), ”Please provide valid path for the model”
data_path = ”./calib_set.npy”
assert os.path.isfile(data_path), ”Please provide valid path for a dataset”
har_path = model_name + ”.har”
[ ]: if len(tf.config.list_physical_devices(”GPU”)) == 0:
print(”Warning: you are running the accuracy analysis tool without a GPU, expect�
,→long running time.”)
In this step, the model will be parsed and optimized to prepare it for analysis. For more details checkout the Parsing
tutorial and the Model Optimization tutorial.
[ ]: runner = ClientRunner(hw_arch=”hailo8”)
runner.translate_tf_model(model_path, model_name)
runner.load_model_script(model_script)
runner.optimize(data_path)
Though most models work well with our default optimization, some suffer from high quantization noise that in-
duces substantial accuracy degradation. As an example, we choose the MobileNet-v3-Large-Minimalistic neural net-
work model that, due to its structural characteristics, results in a high degradation of 6% for Top-1 accuracy on the
ImageNet-1K validation dataset.
To analyze the source of degradation, the Hailo analyze_noise API will be used. The analysis tool uses a given
dataset to measure the noise level in each layer and allows to pinpoint problematic layers that should be handled.
The analysis tool uses the entire dataset by default, use the data_count argument to limit the number of images.
It is recommended to use at least 64 images, preferably not from the same calibration set, however, to keep the
tool’s processing time to a reasonable level, it is also recommended not to use more than 100-200 images.
Page 47 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
runner.save_har(har_path)
In this section, a general explanation for the noise analysis report will be provided.
To visualize the accuracy analysis results and debug the quantization noise, the Hailo Model Profiler will be used.
The Hailo Model Profiler will generate an HTML report with all the information for the model.
In the Optimization Details tab of the report, all the relevant information for this tutorial can be found:
SNR Chart
Displayed on the top ribbon, only if the profiled HAR contains the analyze-noise data.
This chart shows the sensitivity of each layer to quantization (measured separately for each output layer). To measure
the quantization noise of each layer’s output, iterate over all layers when the given layer is the only quantized layer,
while the rest are kept in full precision and measure the SNR at each output layer. The number of SNR values will be
the number of outputs layer affected by the quantized layer. The graph shows the SNR values in decibels (dB) and
any value higher than 10 should be fine (higher is better).
In case an output layer is sensitive (low SNR) across many layers it is recommended to re-quantize with one of the
following model script commands (not in the scope of this tutorial):
• Configure the output layer to 16-bit output. For example, using the model script command:
quantization_param(output_layer1, precision_mode=a16_w16).
• When possible, offload output activation to the accelerator. For example, the following command adds sig-
moid activation to the output layer conv51: change_output_activation(conv51, sigmoid)
and should be used to offload sigmoid from post-processing code to the accelerator.
• Use massive fine tune which is enabled by default in optimization_level=2 but can be customized. For
example, specific fine-tune command: post_quantization_optimization(finetune,
policy=enabled, learning_rate=0.0001, epochs=8, batch_size=4,
dataset_size=4000). Other useful attributes to this command are: loss_layer_names, loss_factors
and loss_types which allows the user to manually edit the loss function of the fine tune training. In a case
where the fine tune failed due to GPU memory, try to use a lower batch_size.
Page 48 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Layers Information
This section provide per-layer detailed information that will help debug the local quantization errors in the model, for
example, specific layer that is very sensitive for quantization. Note that quantization noise may stem from the layers’
weights, activations or both.
• Weight Histogram: this graph shows the weights distribution and can help to identify outliers. If outliers exist
in the weight distribution, the following command can be used to clip it, for example, clip the kernel values of
conv27: pre_quantization_optimization(weights_clipping, layers=[conv27],
mode=percentile, clipping_values=[0.01, 99.99])
• Activation Histogram: this graph shows the activation distribution as collected by the layer noise analysis tool.
Wide activation distribution is a major source of degradation source and in general it is strongly recommend
to use a model with batch normalization after each layer to limit the layer’s extreme activation values. Another
important argument that affects the activation distribution is the calibration size that was used during quan-
tization, to raise it, use the following command: model_optimization_config(calibration,
calibset_size=512), the default value for calibration is 64. In case of outliers in the
layers’ activation distribution, we recommend using the activation clipping command, for exam-
ple: pre_quantization_optimization(activation_clipping, layers={*},
mode=percentile, clipping_values=[0.01, 99.99])
• Scatter Plot: this graph shows a comparison between full precision and quantized values of the layers’
activation. The X-axis of each point in this graph is its value in full precision and Y-axis is the value after
quantization. Zero quantization noise means the slope would be exactly one. In case of bias noise you expect
to find many points above/below the line that represent imperfect quantization, if this is the case, you should
use the following commands: post_quantization_optimization(bias_correction,
policy=enabled) and post_quantization_optimization(finetune, pol-
icy=disabled)
To examine these results, first plot the SNR graph for this specific model. Note that in general the profiler report
should be used but here an alternative visualization will be used.
[ ]: def get_snr_results():
# SNR results are saved in the params statistics object
params_statistics = runner.get_params_statistics()
out_layer = ”v3-large-minimalistic_224_1_0_float/output_layer1”
layers = []
snr = []
for layer in runner.get_hn_model():
# We get the SNR for each analyzed layer for a specific output layer (there is only�
,→one in this case)
layer_snr = params_statistics.get(f”{layer.name}/layer_noise_analysis/noise_
,→results/{out_layer}”)
Page 49 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
plt.xticks(rotation=75, fontsize=”x-small”)
plt.ylabel(”SNR”)
plt.grid()
plt.show()
Next, we will try to improve the model accuracy results by using specific model script commands. Specifically, we
will use the activation_clipping command on the problematic layers to clip outliers from the output of the
layers and optimization_level=2. For further information we refer the user to the full Accuracy report in
the profiler HTML.
[ ]: runner = ClientRunner(hw_arch=”hailo8”)
runner.translate_tf_model(model_path, model_name)
model_script_commands = [
”normalization1 = normalization([127.5, 127.5, 127.5], [127.5, 127.5, 127.5])\n”,
”model_optimization_config(calibration, calibset_size=128)\n”,
”pre_quantization_optimization(activation_clipping, layers=[dw1, conv2, conv3],�
,→mode=percentile, clipping_values=[0.5, 99.5])\n”,
”model_optimization_flavor(optimization_level=2, compression_level=0)\n”,
]
runner.load_model_script(””.join(model_script_commands))
runner.optimize(data_path)
runner.save_har(har_path)
After fixing the optimization process, it should be possible to reduce the model degradation to 1% (Top-1 accuracy
on the ImageNet-1K validation dataset) which is usually the target goal for classification models.
The improvement can also be seen from the new SNR graph:
Page 50 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
This tutorial is intended for advanced users, If the previous accuracy results were satisfactory, it can be omitted..
This section will describe the steps for performing Quantization Aware Training (QAT) using Hailo’s quantized model.
It is assumed that the User already has a background in training deep neural networks.
Quantization aware training - refers to a set of algorithms that incorporate full network training in a quantized do-
main. The technique utilizes the straight-through estimator (STE) concept to allow for backpropagation through non-
differentiable operations, such as rounding and clipping, during the training process. In deep learning literature, QAT
typically refers to an extended training procedure using the full dataset, labels, and multiple GPUs, similar to the
original training process. However, it can also be applied in other scenarios.
The main differences between the quantization-aware training method and the optimization method shown in pre-
vious tutorials are:
• QAT enables training using labeled data, whereas the FineTune algorithm (Model Optimization Tutorial) is limited
to training using knowledge distillation from the full precision model.
• QAT allows for the use of a pipeline of networks or the integration of post-processing functions into the training
procedure.
In summary, QAT is a useful tool for training quantized models with labeled data and supports multi-GPU training
and integration of post-processing functions. Currently, Hailo QAT only supports Keras.
• Input definitions: In this step, we will prepare the dataset and model for training and testing.
• Full precision training: A short training procedure will be run to initialize the model’s weights.
– In real scenarios, a complete full precision training procedure should take place here. In this notebook,
the full precision training has been shortened to simplify the tutorial.
• Translation of the model: The model will be exported to TFlite, parsed, optimized, and evaluated using the
Hailo toolchain.
• Running QAT: Finally, quantization-aware training will be performed on the quantized model to optimize its
accuracy.
Requirements:
• Run this code in Jupyter notebook, see the Introduction tutorial for more details.
The input definitions step of this tutorial involves using the MNIST dataset and a simple Convolutional Neural Network
(CNN). The code provided will download the dataset and prepare it for training and evaluation.
[ ]: # Model parameters
num_classes = 10
input_shape = (28, 28, 1)
# Load the data and split it between train and test sets
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
Page 51 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
In this step, a short training procedure will be run to initialize the model’s weights. Only 5,000 images from the full
training dataset will be used. The accuracy of the model will be measured on the test dataset.
In this step, a trained model will be exported into TFlite format to prepare it for use in the Hailo toolchain. After being
translated into TFlite, the model can be parsed, optimized, and inferred using the Hailo DFC. The results of the full
precision model will be compared to those of the quantized model. It is important to note that the results of the
full precision model should be identical to those obtained from the Keras evaluation, while the quantized model may
experience some degradation due to quantization noise.
Page 52 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
runner.load_model_script(””.join(model_script_commands))
runner.optimize(x_train[:1024])
In this final step, a quantized model will be optimized to enhance its accuracy. The runner.get_keras_model
API will be used to obtain a Keras model initialized with the quantized weights. The model can then be trained using
straight-through estimator (STE) method.
• The runner.get_keras_model API must be used with trainable=True to allow training (usage
of fit).
• To the Keras model additional layers, post-processing or other models can be added. For example, here a new
tf.keras.layers.Softmax layer is being added.
• For training, use the fit API provided by Keras. Training can be done with customized loss functions and
different optimizers.
• After training is complete, update the ClientRunner weights with the updated model. This is done using
the runner.set_keras_model API. Only allowed changes to the Keras model includes weight changes.
Once the new weights are updated, compile the model with the new weights using the runner.compile
API.
[ ]: with tf.distribute.MultiWorkerMirroredStrategy().scope():
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
# get the Hailo Keras model for training
model = runner.get_keras_model(ctx, trainable=True)
Page 53 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
new_model.compile(
loss=tf.keras.losses.CategoricalCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-6),
metrics=[”accuracy”],
)
# Hailo Keras model is exported with rank4 layers, expands dimensions for the y_
,→ train to match the model output shape
y_train = np.expand_dims(y_train, axis=[1, 2])
train_data = train_data.with_options(options)
# start QAT
log = new_model.fit(train_data, batch_size=128, epochs=10)
# set the Keras model after training. The model is already optimized, so do not run�
,→ optimize() again.
runner.set_keras_model(model)
QAT can gain additional accuracy with training using a teacher (the full precision model) to train the student model (the
quantized model) - knowledge distillation. To use the full precision model, call the runner.get_keras_model
API with a different context and change the loss accordingly. In the following code, a new class Distiller is
generated to distill the full precision and combine with the supervision of the labels.
• Note that, Hailo’s FineTune algorithm works in the same way as well (more information can be found in the
DFC user guide).
Page 54 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
[ ]: class Distiller(tf.keras.Model):
def __init__(self, student, teacher):
super().__init__()
self._teacher = teacher
self._student = student
super().compile(optimizer=optimizer, metrics=metrics)
self._student_loss_fn = student_loss_fn
self._distillation_loss_fn = distillation_loss_fn
self._alpha = alpha
self._temperature = temperature
# compute gradients
trainable_vars = self._student.trainable_variables
gradients = tape.gradient(total_loss, trainable_vars)
# update weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
)
return results
Page 55 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
runner.load_model_script(””.join(model_script_commands))
runner.optimize(x_train[:1024])
# start QAT
log = distiller.fit(x_train, y_train, batch_size=128, epochs=10)
Page 56 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
5. Building Models
This section describes the process of taking ONNX/TF trained model and compiling them to a Hailo executable binary
file (HEF). The main API for this process is the ClientRunner. The client runner is a stateful object that handles
all stages. In each stage, the client runner can be serialized into an Hailo archive file (HAR) that can be loaded in the
future to initialize a new client runner. There are three main stages: Translation, Optimization and Compilation.
1. Translation: this process takes an ONNX/TF model and translates it into Hailo’s internal representation. For
that, the translate_tf_model() method or the translate_onnx_model() method should be
used. For examples, see the Parsing Tutorial. At the end of this stage the state of the runner is changed from
Uninitialized to Hailo Model and new functionality is available:
A. Running inference on SDK_NATIVE context. For further details refer to: Model Optimization Tutorial.
B. Profile the model to obtain model overview. For example, using the command line interface: hailo
profiler --help.
Note: The same functionality can be obtained using the command line interface. For example, hailo
parser {tf, onnx} --help.
2. Optimization: in this stage the model is being optimized before compilation using the optimize() method.
The optimize method runs several steps of optimization including quantization which may degrade the model
accuracy; therefore, evaluation is needed to verify the model accuracy. For further information see Model Op-
timization Workflow and Model Optimization Tutorial. The method load_model_script() can be chosen
to use advanced configuration before calling optimize. At the end of the optimization stage, the state of the
runner is changed from Hailo Model to Quantized Model and new functionality is available:
A. Running inference on SDK_QUANTIZED context (quantized model emulation). For further details refer
to: Model Optimization Tutorial. This step allows the measurement of the degradation due to quantization
of the model without executing on the device. It is recommended to evaluate the quantized model in
emulation before proceeding to compilation.
B. Run the analyze_noise() method to execute the layer noise analysis tool and analyze the model’s
accuracy. This tool is useful to debug quantization issues in case of large degradation in your quantized
model. For further details see the Layer Noise Analysis Tutorial.
Note: The same functionality can be obtained using the command line interface. For example, hailo op-
timize --help
3. Compilation: this step takes a runner in state Quantized Model and compiles it to a Hailo executable binary
file (HEF). At the end of this stage the state of the runner is changed from Quantized Model to Compiled Model,
which allows the exporting of a binary HEF file to run on the Hailo hardware.
A. Save the HEF file to be used with the HailoRT. For further details refer to the Compilation Tutorial.
B. Run Inference on hardware. For further details refer to: Inference Tutorial.
Note: The same functionality can be obtained using the command line interface. For example, hailo com-
piler --help
The following block diagram illustrates how the runner states and the API switch between each other.
Page 57 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
runner = ClientRunner(...)
Uninitialized
runner.translate_tf_model(...)
or
runner.translate_onnx_model(...)
Quantized model
runner.compile(...)
Compiled model
Page 58 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
The Parser translates the model to Hailo Archive (.har) format. Hailo Archive is a tar.gz archive file that captures the
“state” of the model - the files and attributes used in a given stage from parsing to compilation.
• HN file, which is a JSON-like representation of the graph structure that is deployed to the Hailo hardware.
More files are added when the optimization and compilation stages are done.
Note: Advanced users can use the hailo har CLI tool to extract the internal files of the HAR.
Note: Tensorflow 1.x models (checkpoints, frozen protobuf) support is planned for deprecation on April 2024. It is
recommended to export/convert to TFLite via Keras & Tensorflow’s APIs (Python/CLI), see more info on the official
(Tensorflow guide).
Note: APIs that do not create new nodes in the TF graph (such as tf.name_scope and tf.
variable_scope) are not listed because they do not require additional parser support.
Tensorflow models are translated to HAR by calling the translate_tf_model() method of the Clien-
tRunner object. The nn_framework optional parameter tells the Parser whether it’s a TF1 or TF2 model.
The start_node_names and end_node_names optional parameters tell the Parser which parts to in-
clude/exclude from parsing. For example, the user may want to exclude certain parts of the post-processing and
evaluation, so they won’t be compiled to the Hailo device.
See also:
• TF1 models – checkpoints and frozen graphs (.pb). The Dataflow Compiler automatically distinguishes between
them based on the file extension, but this decision can be overridden using the is_frozen flag.
Page 59 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Page 60 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Page 61 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Tensorflow v1.15.4 has no group conv operation. The Hailo Dataflow Compiler recognizes the following pattern and
automatically converts it to a group conv layer:
• Several (>2) conv ops, which have the same input layer, input dimensions, and kernel dimensions.
• The features are equally sliced from the input layer into the convolutions.
• Bias addition should be before the concat, after each conv op.
Tensorflow v1.15.4 has no feature shuffle operation. The Hailo Dataflow Compiler recognizes the following pattern
of sequential ops and automatically converts it to a feature shuffle layer:
• tf.reshape from 4-dim [batch, height, width, features] to 5-dim [batch, height,
width, groups, features in group].
• tf.transpose where the groups and features in group dimensions are switched. In other words, this op
interleaves features from the different groups.
More details can be found in the Shufflenet paper (Zhang et al., 2017).
Squeeze and excitation block parsing is supported. An example Tensorflow snippet is shown below.
out_dim = 32
ratio = 4
conv1 = tf.keras.layers.Conv2D(out_dim, 1)(my_input)
x = tf.keras.layers.GlobalAveragePooling2D()(conv1)
x = tf.keras.layers.Dense(out_dim // ratio, activation='relu')(x)
x = tf.keras.layers.Dense(out_dim, activation='sigmoid')(x)
x = tf.reshape(x, [1, 1, 1, out_dim])
ew_mult = conv1 * x
tf.keras.activations.relu(input_tensor, threshold=threshold)
Page 62 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
val * tf.sign(tf.abs(input_tensor))
Tensorflow Lite models are translated by calling the translate_tf_model() method of the ClientRun-
ner object. No additional parameters needed.
Note: Hailo supports 32-bit/16-bit TFLite models, since our Model Optimization stage use the high precision weights
to optimize the model for Hailo devices. Models that are already quantized to 8-bit are not supported.
See also:
For more info, and some useful examples on converting models from Tensorflow to Tensorflow-lite, refer to the
Parsing Tutorial, or the official Tensorflow guide on (tflite converter CLI).
ONNX models are translated by calling the translate_onnx_model() method of the ClientRunner ob-
ject. The supported ONNX opset versions are 8 and 11-17.
Page 63 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Page 64 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
The following example shows how to export a PyTorch model to ONNX, note the inline comments which explain each
parameter in the export function.
Note: Before trying this small example, make sure Pytorch is installed in the environment.
Page 65 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
export_params=True,
training=torch.onnx.TrainingMode.PRESERVE,
do_constant_folding=False,
opset_version=13)
Page 66 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Page 67 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• NMS is a technique that is used to filter the predictions of object detectors, by selecting final entities (e.g.,
bounding box) out of many overlapping entities. It consists of two stages: score threshold (filtering low-
probability detections by their score), and IoU (Intersection over Union, filtering overlapping boxes).
• The NMS algorithm needs to be fed with bounding boxes, which are calculated out of the network outputs.
This process is called “bbox decoding”, and it consists of mathematically converting the network outputs to
box coordinates.
• The bbox decoding calculations can vary greatly from one implementation to another, and include many types
of math operations (pow, exp, log, and more).
On neural core:
On CPU:
1. YOLOv5: bbox decoding, score_threshold filtering, IoU filtering (also compatible with YOLOv7)
2. SSD: bbox decoding, score_threshold filtering, IoU filtering (also compatible with EfficientDet)
4. YOLOv8: bbox decoding, score_threshold filtering, IoU filtering (also compatible with NanoDet)
5. YOLOv5 SEG: bbox decoding, score_threshold filtering, IoU filtering, segmentation mask per instance
Note: NMS on neural code is only supported in models that are compiled to single context. If the model is compiled
with multi-context, undefined runtime behavior might occur. On this case, you are encouraged to either try single
context compilation using a model script, or perform the NMS on the host platform.
1. When translating the network using the parser, should supply end_node_names parameter with the lay-
ers that come before the post-processing (bbox decoding) section. For Tensorflow models for example, it
is performed using the API translate_tf_model() or the CLI tool: hailo parser tf --end-
node-names [list].
Note: When hailo CLI tool is being used, the arguments are separated by spaces: --end-node-names
END_NODE1 END_NODE2 .. and so on.
2. The post-processing has to be manually added to the translated (parsed) network using a Model
Script command (nms_postprocess), which is fed to the hailo optimize CLI tool, or is loaded with
load_model_script() before calling the optimize() method. The command adds the relevant
postprocess to the Hailo model, according to the architecture (e.g. SSD) and the configuration json file.
Note: The output format of the on-chip post-process can be found on HailoRT guide:
• For Python API, look for tf_nms_format and see definitions of Hailo format and TensorFlow format.
• For CPP API, look for HAILO_FORMAT_ORDER_HAILO_NMS. It is similar to the Hailo format from the Python
API.
3. One can experiment with the output format using the SDK_FP_OPTIMIZED or the SDK_QUANTIZED emulators,
before compiling the model. For more information, refer to the Model Optimization Workflow section.
Page 68 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
SSD
SSD (which is also used by EfficientDet models) post-processing consists of bbox decoding and NMS.
Hailo support the specific SSD NMS implementation from TF Object Detection API SSD, tag v1.13.
The ssd_anchor_generator is used which utilizes the center of a pixel as the anchors centers (so anchors centers
cannot be changed):
anchor_generator {
ssd_anchor_generator {
num_layers: 6
min_scale: 0.2
max_scale: 0.95
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
aspect_ratios: 3.0
aspect_ratios: 0.3333
}
}
It is assumed that each branch (“box predictor”) has its own anchors repeated on all pixels.
The bbox decoding function currently supported on the chip can be found here (see def _decode which con-
tains the mathematical transformation needed for extracting the bboxes).
For this NMS implementation, the end_nodes that come just-before the bbox decoding might be:
end_node_names =
[
”BoxPredictor_0/BoxEncodingPredictor/BiasAdd”,
”BoxPredictor_0/ClassPredictor/BiasAdd”,
”BoxPredictor_1/BoxEncodingPredictor/BiasAdd”,
”BoxPredictor_1/ClassPredictor/BiasAdd”,
”BoxPredictor_2/BoxEncodingPredictor/BiasAdd”,
”BoxPredictor_2/ClassPredictor/BiasAdd”,
”BoxPredictor_3/BoxEncodingPredictor/BiasAdd”,
”BoxPredictor_3/ClassPredictor/BiasAdd”,
”BoxPredictor_4/BoxEncodingPredictor/BiasAdd”,
”BoxPredictor_4/ClassPredictor/BiasAdd”,
”BoxPredictor_5/BoxEncodingPredictor/BiasAdd”,
”BoxPredictor_5/ClassPredictor/BiasAdd”
]
site-packages/hailo_sdk_client/
An example for the corresponding SSD NMS JSON is found at:
tools/core_postprocess/nms_ssd_config_example_json_notes.txt, relatively to the vir-
tual environment where the Dataflow Compiler is installed. This example file is not a valid JSON file since it has in-line
comments, but a ready-to-use file is on the same folder.
Page 69 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
CenterNet
CenterNet post-processing consists of bbox decoding and then choosing the bboxes with the best scores.
Our CenterNet post-processing corresponds to the CenterNetDecoder class on Gluon-CV (link). Therefore we
support any CenterNet post-processing which is equivalent in functionality to the above-mentioned code.
For this implementation, the end_nodes that come just-before the bbox decoding might be:
end_node_names =
[
”threshold_confidence/threshold_activation/threshold_confidence/re_lu/Relu”,
”CenterNet0_conv3/BiasAdd”,
”CenterNet0_conv5/BiasAdd”
]
site-packages/hailo_sdk_client/
An example for the corresponding CenterNet JSON can be found at:
tools/core_postprocess/centerNet_example_json_notes.txt, relatively to the virtual envi-
ronment where the Dataflow Compiler is installed. This example file is not a valid JSON file since it has in-line com-
ments, but a ready-to-use file is on the same folder.
YOLOv5
YOLOv5 post-processing (true also for YOLOv7) consists of bbox decoding and NMS. The NMS consists of two parts:
1. Filtering bboxes according to their detection score threshold (“low probability” boxes are filtered).
2. Filtering the remaining bboxes with IoU technique: selecting final entities (e.g., bounding box) out of many
overlapping entities.
Hailo implemented the bbox decoding in-chip, as well as score threshold filtering. The IoU section needs to be im-
plemented on host, but since score threshold filtering has been performed, the number of bboxes to deal with has
decreased by an order of magnitude.
Support for the post-processing from the original implementation of YOLOv5, tag v2.0. has been tested.
To add a post-process block from the model script, the model needs to be parsed up to the regression layers that
lead into the post-process. These regression layers are given by the end_nodes_names. For example, for this imple-
mentation, on YOLOv5m (tag v2.0) the end_node_names might be:
end_node_names =
[
”Conv_307”,
”Conv_286”,
”Conv_265”
]
site-packages/hailo_sdk_client/
An example for the corresponding YOLOv5 JSON is found at:
tools/core_postprocess/nms_yolov5_example_json_notes.txt, relatively to the virtual en-
vironment where the Dataflow Compiler is installed. This example file is not a valid JSON file since it has in-line
comments, but a ready-to-use file is on the same folder.
Page 70 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
YOLOV5 SEG
YOLOv5 SEG NMS for instance segmentation task, the post-processing consists of bbox decoding, score threshold
filtering, IoU filtering, and segmentation mask per instance. The post-processing runs on the CPU. The output is per
image detection which is in the format of: [N, 1, num_max_proposals, 6 + image_dims[0] * image_dims[1]] where the
format of axis -1 is [y_min, x_min, y_max x_max, score, class, flattened masks]
YOLOv6
YOLOv6 post-processing is supported only for tag 0.2.0 and above (for tag 0.1.0 please use the postprocess of YOLOX).
It consists of bbox decoding and NMS. As in other meta-architectures, the NMS postprocess performs score and IoU
threshold filtering. YOLOv6 is with an anchor, which means there is no need to configure anchors dimension via the
NMS config json.
We have tested support for post-processing from the original implementation of YOLOv6, tag 0.2.1.
site-packages/hailo_sdk_client/
An example for the corresponding YOLOv6 JSON can be found at:
tools/core_postprocess/nms_yolov6_example_json_notes.txt, relatively to the virtual en-
vironment where the Dataflow Compiler is installed. This example file is not a valid JSON file since it has in-line
comments, but a ready-to-use file is on the same folder.
This section describes the TF and ONNX parser limitations regarding ordering of layers.
• Bias – only before Conv, before DW Conv, after Conv, after DW Conv, after Deconv, or after Dense.
The following padding schemes are supported in Conv, DW Conv, Max Pooling, and Average Pooling layers:
• VALID
• SAME (symmetric padding)
• SAME_TENSORFLOW
Other padding schemes are also supported, and will translate into External Padding layers.
In some cases, the translated model may have some differences as compared to the original model:
1. BatchNorm layer in training mode. The difference in this case is because the BN params are static in the hailo
model (and folded on relevant layers kernel/bias), and in the original model framework, training mode means
that the layer would first update moving mean/var and then normalize its output in place. To avoid this case:
• PyTorch: export your model to ONNX in preserve or eval mode. For more information, check Parsing
Tutorial.
Page 71 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
While it is recommended to optimize and compile using the default configuration (using either the CLI tools or Python
APIs), Model Scripts make it possible to change the default behavior of the Dataflow Compiler, and to make modifi-
cations to the model.
client_runner.load_model_script('model_script.alls')
client_runner.optimize(calib_dataset)
compiled_hef = client_runner.compile()
The model script is a text file that is optionally fed to the Optimize or Compile functions, that contains commands that
serve different purposes. The most frequently used and recommended commands are:
– [Important] The Model Modification commands are used to modify the parsed model, and to add transfor-
mations (that were not originally a part of the original ONNX/TF model) to decrease CPU load. Examples:
* Apply resize at the input, from the source resolution to the model’s resolution.
* Add post processing to your model, on supported architectures only (if not detected automatically
during the parsing stage).
– [Important] The Optimization level determines how aggressive are the algorithms that are used to in-
crease the accuracy of the quantized model. Higher optimization level requires more time and system
resources, but results in higher accuracy.
– [Important] The Compression level determines the percentage of 4-bit layers, higher amount increases
the performance (FPS) of the compiled model. Requires a high optimization level, to regain the accuracy
loss.
– Resolution reduction command can be used to run the Optimization stage in lower spatial resolution, to
decrease its running time.
– Advanced commands:
* The precision_mode field of the quantization_param command can be used to apply 16-bit precision
to specific layers or outputs, to increase the model accuracy.
* Weights clipping can be used to ignore outliers on a layer’s weights, to increase the accuracy.
* Activation clipping can be used to ignore outliers on a layer’s activations, to increase the accuracy.
* Global average pool reduction can be used to split a global average pooling layer with a large input
resolution.
* The Post-quantization commands allows to change the parameters of the advanced post-quantization
algorithms. Although the algorithms and their parameters are automatically chosen according to the
Optimization level, manual configuration is possible. For example, decreasing the AdaRound batch
size if it fails.
* When Optimization level < 2, you can manually enable the checker_cfg in order to collect activation
statistics, for further analysis using the profiler (it is enabled by default when Optimization level >=
2).
• Compilation stage:
Page 72 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
– [Important] The Performance Mode can be used to compile the model to the highest possible resource
utilization, to maximize performance (FPS). Expect the compilation time to increase dramatically.
– Suggestions for the compilation could be supplied (for example: compile for platforms with low PCIe band-
width).
– The Automatic model script can be used to pin the compilation results to a previously compiled version of
the same model.
Note: Each stage only considers commands that are relevant for it; If a model script is provided at the Optimization
stage, but also contains compilation related commands, those commands will be ignored at the Optimization stage,
but will be activated during the compilation stage.
Note: If a new model script is given at the Compilation stage, it will not undo the already executed optimization
related commands, but will overwrite any compilation related commands that were loaded at the Optimization stage.
This step optimizes the model for deployment. The main objective is translating the models’ parameters numerically,
from floating point to integer representation, is also known as quantization (or model optimization). This is a manda-
tory step in order to run models on the Hailo hardware. This step takes place after translating the model from its
original framework and before compiling it. For optimized performance, we recommend using a machine with a GPU
when running the model optimization and to prepare a calibration data with at least 1024 entries.
The model optimization has two main steps: Full Precision Optimization and Quantization Optimization.
Full precision optimization includes any changes to the model in the floating-point precision domain, for example
Equalization [Meller2019], TSE [Vosco2021] and pruning.
It also applies any model modifications from the model script, such as color conversions or adding post-processing
layers.
The next step, quantization, includes compressing the model from floating point to integer representation of
the weights (4/8/16-bits) and activations (8/16-bits) and algorithms to improve the model’s accuracy, such as IBC
[Finkelstein2019], AdaRound [Nagel2020], and QFT [Finkelstein2022]. Both steps may degrade the model accuracy,
therefore, evaluation is needed to verify the model accuracy after each step.
To perform these steps, one can use the simple optimization flow. Use the hailo optimize CLI, or the
load_model_script() method followed by the optimize() API. Once the optimization process had fin-
ished, continue to the compilation stage. The simple optimization flow is presented in this diagram.
The advanced Python workflow can also be followed for tracking the accuracy of the model throughout the stages
of the optimization. This advanced workflow, as well as the simple flows, are presented in the Model Optimization
Tutorial.
The advanced workflow consists of number of stages, which are depicted in the following chart:
1. A preliminary step would be to test the Native model before any changes, right after parsing. This stage is
important for ensuring that the parsing was successful, and the preprocessing was built (before the start nodes)
and post processing (after the end nodes) correctly. As mentioned, the SDK_NATIVE emulator is used for
this purpose:
import tensorflow as tf
from hailo_sdk_client import ClientRunner, InferenceContext
Page 73 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
runner = ClientRunner(har=model_path)
with runner.infer_context(InferenceContext.SDK_NATIVE) as ctx:
output = runner.infer(ctx, input_data)
The parsed model can also be compared to the original model using the command: hailo parser with the
flag –compare. For more information refer to reasons section.
2. Load the model script, and use the optimize_full_precision() method to apply the model script
and the full precision optimizations.
3. Perform full precision validation, when the model is in its final state before the optimization process. This
stage is important because it allows to emulate the input and output formats, taking into account the model
modifications (normalization, resize, color conversions, etc.). Achieving good accuracy means that the pre/post
processing functions are built correctly, and that the infrastructure is ready for testing the quantized model.
The SDK_FP_OPTIMIZED emulator is used for this purpose:
import tensorflow as tf
from hailo_sdk_client import ClientRunner, InferenceContext
runner = ClientRunner(har=model_path)
with runner.infer_context(InferenceContext.SDK_FP_OPTIMIZED) as ctx:
output = runner.infer(ctx, input_data_modified)
4. Next, call the model optimization API to generate an optimized model. To obtain best performance it is rec-
ommended to use a GPU machine and a dataset with at least 1024 entries for calibration, which is used to
gather activation statistics in the inputs/outputs of each layer. This data is used to optimize the accuracy of
the final model. This statistic is being used to map the floating-point values into their integer representation,
(a.k.a quantization). Use high quality calibration data (that represents well the validation dataset and the real-
life scenario) is crucial for obtaining good accuracy. Supported calibration data types are: Numpy array with
shape: [BxHxWxC], NPY file of a Numpy array with shape: [BxHxWxC], directory of Numpy files with each shape:
[HxWxC] and tf.data.Dataset object with expected return value of: [{layer_name: input}, _].
5. Finally, it is necessary to verify the accuracy of the optimized model to validate the process was successful. In
case of large degradation (that doesn’t meet the accuracy requirement), re-try the optimization with increased
optimization level. Optimization and Compression levels allowing the control of the model optimization effort
and the model memory footprint. For quick iterations it is recommended to start with the default setting of the
Page 74 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
import tensorflow as tf
from hailo_sdk_client import ClientRunner, InferenceContext
runner = ClientRunner(har=model_path)
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
output = runner.infer(ctx, input_data_modified)
Note: Due to known installation issues with Hailo’s Docker, GPU usage is possible only when Tensorflow packages
are imported before any of Hailo’s DFC packages (e.g. client runner, inference context). See code examples above.
Note: Familiarity with the runner states diagram is important for understanding the following diagram.
Note: If problems are encountered with VRAM allocation during stages other than Adaround, it is possible attempt
to resolve the issue by disabling the memory growth flag. To do this, set the following environment variable:
HAILO_SET_MEMORY_GROWTH=false
By doing so, the default memory allocation method for tensorflow GPU will be modified, and the entire VRAM will be
allocated and managed internally.
Additionally, if tensorflow is imported, please make sure the SDK is imported before tensorflow is used.
The optimize() method serves as the model optimization API. This API requires sample dataset (typically >=
1024), which is used to collect statistics. After the statistics are collected, they are used to quantize the weights
and activations, that is, map the floating point values into integer representation. Hailo’s quantization scheme uses
uniformly distributed bins and optimizes for the best trade-off between range and precision.
Before calling the optimize() API, it is worth considering calling the load_model_script() to load a model script
(.alls file) that includes commands that modify the model, affect the basic quantization flow and additional algorithms
to improve the accuracy and optimize the running time.
To control the optimization grade, it is recommended to set the optimization_level argument with the
model_optimization_flavor command, which will obtain values of 0-4 and control which quantization algorithms will
be enabled. Using higher optimization level means the model optimization tool will use more advanced algorithms
which expected to get better accuracy but will take longer to run. Note that optimization levels 2, 4 require at least
1024 images to run and optimization level 3 requires 256. The default setting is optimization_level=2 unless GPU is not
available, or the dataset is not large enough (less than 1024). To reduce the running time, the optimization process
uses multiple GPUs when available. To avoid using multiple GPUs use the model_optimization_config command. For
reference, those are the expected running times for optimizing ResNet-v1-50 with compression_level=4 using Nvidia
A4000 GPU:
• optimization_level=0: 59s
• optimization_level=1: 206s
• optimization_level=2: 256s
Page 75 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
1
Hailo Model state
Validate accuracy in
Model (floating -point)
full precision using
Script SdkNative emulator
Double check input
Yes and output formats
Load model script
Accuracy
good enough? No
FP Optimized state
(floating-point)
Yes
Optimize
4 (GPU recommended)
Calibration dateset
Quantized model (recommended > 1024)
state (uint)
Use optimization
Accuracy commands on model
good enough? No
script to affect
optimization flow
Yes
Figure 8. Block diagram of the advanced model optimization flow using Python APIs
Page 76 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• optimization_level=3: 2828s
• optimization_level=4: 11002s
To control the compression degree, use the compression_level argument through the model_optimization_flavor com-
mand, which will obtain values of 0-5 and control the percentage of weights that are quantized to 4-bits (default is
using 8-bit precision for weights quantization). Using higher compression level means the compression will be more
aggressive and accuracy may be degraded. To recover the accuracy loss, it is recommended to use a higher optimiza-
tion level as well. High compression rate improves the fps especially for large networks (more than 20M parameters)
or when used in a pipeline. The default setting is Compression_level=1.
Note: The algorithms that compose each optimization level are expected to change in future versions. To see the
current algorithms in use refer to model_optimization_flavor command description
The table below displays the results of applying different choices of optimization/compression levels on common CV
models.
Table 4. An example of the degradations for the RegNetX-800MF model over various flavor settings. Reported degra-
dations are Top-1 scores over the ImageNet-1K dataset (validation set of 50k images). Note that the RegNetX-800MF
model is relatively small (defined as having less than 20M parameters), hence there is only one valid compression
level (compression_level=0).
Table 5. An example of the degradations for the YOLOv5m model over various flavor settings. Reported degradations
are mAP scores over a validation set of 5k samples from the COCO2017 dataset.
Table 6. An example of the degradations for the DeepLab-v3-MobileNet-v2 model over various flavor settings. Re-
ported degradations are mIoU scores over the PASCAL-VOC dataset. Note that the DeepLab-v3-MobileNet-v2 model
is relatively small (defined as having less than 20M parameters), hence there is only one valid compression level (com-
pression_level=0).
Page 77 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Debugging Accuracy
If the quantization accuracy is not sufficient, any of the following methods should be used (after each step, to validate
the accuracy of your model):
1. Make sure there are at least 1024 images in the calibration dataset and machine with a GPU.
3. Usage of BatchNormalization is crucial to obtain good quantization accuracy because it reduces the activation
ranges throughout the network, and therefore it is highly recommended to use it during training.
4. Run the layer noise analysis tool to identify the source of degradation. For example, using the CLI command:
5. If you have used **compression_level**, lower its value (the default is 0). For example, use the following com-
mand in the model script:
model_optimization_flavor(compression_level=1)
6. Configure higher **optimization_level** in the model script, that activates more optimization algorithms and
experiment with different optimization levels. For example:
model_optimization_flavor(optimization_level=4)
7. Configure 16-bit output. Note that using 16-bit output affects the output BW from the Hailo device. For exam-
ple, using the following model script command:
quantization_param(output_layer1, precision_mode=a16_w16)
8. Configure 16-bit on specific layers that are sensitive for quantization. Note that if the activation function is
not linear/relu/leaky the accuracy might be limited by the activation precision. In addition, note that using
16-bit affects the throughput obtained from the Hailo device. For example, using the following model script
command:
quantization_param(conv1, precision_mode=a16_w16)
9. Try to run with activation clipping using the following model script commands:
10. Use more data and longer optimization process in Finetune, for example:
12. Use quantization aware training (QAT). For more information see QAT Tutorial.
See also:
The Model Optimization Tutorial which explains how to use the optimization API and the optimization/compression
levels and the Layer Noise Analysis Tutorial which explains how to use the analysis tool.
Page 78 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
The model script is loaded before running the model optimization by using the load_model_script().
The model script supports model modification commands, which are processed on optimize():
model_modification_commands
1. model_optimization_flavor
2. model_optimization_config
3. quantization_param
4. pre_quantization_optimization
5. post_quantization_optimization
model_modification_commands
• input_conversion
• transpose
• normalization
• nms_postprocess
• change_output_activation
• logits_layer
• set_seed
• resize
Note: Each input modification command inserts a layer directly after the input layer, and the commands are applied
sequentially as they appear in the script. The final model structure places the most recently executed command’s
layer immediately after the input layer, resulting in the modification layers appearing in reverse order to the script.
# example script to create this structure: input_layer1 -> reshape_yuy2 -> norm_layer1
norm_layer1 = normalization(mean_array, std_array, input_layer1)
reshape_yuy2 = format_conversion(input_layer1, yuy2_to_hailo_yuv)
input_conversion
• yuv_full_range_to_rgb - full range YUV conversion from YUV to RGB, implemented by the following kernel:
[[1.0, 1.0, 1.0], [0, -0.343, 1.765], [1.4, -0.711, 0]] and bias [-179.2, 134.912, -225.92] terms. Corresponds
to cv::COLOR_YCrCb2RGB in OpenCV terminology, OpenCV documentation: OpenCV cv::COLOR_YCrCb2RGB.
Elaborated matrices and equations can be found at matrices and equations
• yuv_to_rgb / yuv601_to_rgb - in compliance with ITU-R BT.601 standard, implemented by the following
kernel: [[1.164, 1.164, 1.164], [0, -0.392, 2.017], [1.596, -0.813, 0]] and bias [-222.912, 135.616, -276.8]
terms. Corresponds to cv::COLOR_YUV2RGB_NV12 in OpenCV terminology, OpenCV documentation: OpenCV
cv::COLOR_YUV2RGB_NV12.
Page 79 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• yuv709_to_rgb - in compliance with ITU-R BT.709 standard, implemented by the following kernel: [[1.164, 1.164,
1.164], [0, -0.213, 2.112], [1.793, -0.533, 0]] and bias [-248.128, 76.864, -288.96] terms.
• yuv_full_range_to_bgr - full range YUV conversion from YUV to BGR, implemented by the following kernel: [[1.0,
1.0, 1.0], [1.765, -0.343, 0], [0, -0.711, 1.4]] and bias [-225.92, 134.912, -179.2] terms.
• yuv_to_bgr / yuv601_to_bgr - in compliance with ITU-R BT.601 standard, implemented by the following kernel:
[[1.164, 1.164, 1.164], [2.017, -0.392, 0], [0, -0.813, 1.596]] and bias [-276.8, 135.616, -222.912] terms. Corre-
sponds to cv::COLOR_YUV2BGR in OpenCV terminology.
• yuv709_to_bgr - in compliance with ITU-R BT.709 standard, implemented by the following kernel: [[1.164, 1.164,
1.164], [2.112, -0.213, 0], [0, -0.533, 1.793]] and bias [-288.96, 76.864, -248.128] terms.
• bgr_to_rgb - which transposes between the R and B channels using an inverse identity matrix as kernel, no bias.
Corresponds to cv2.cvtColor(src, code) where src is a BGR image, and code is cv2.COLOR_BGR2RGB.
• rgb_to_bgr - as the above, transposes between the R and B channels using an inverse identity matrix as kernel,
no bias. Corresponds to cv2.cvtColor(src, code) where src is a RGB image, and code is cv2.COLOR_RGB2BGR.
Note: The input_layer argument is optional. If a layer name is not specified, the conversion will be added after all
input layers.
# number of return values should match the number of inputs of the network
rgb_layer1, rgb_layer2, ... = input_conversion(yuv_to_rgb)
Or a format conversion:
• yuy2_to_hailo_yuv – Converts the YUY2 format, which is used by some cameras, to YUV. This is use-
ful together with the YUV to RGB layer to create a full vision pipeline YUY2 to YUV to RGB. Corresponds to
cv::COLOR_YUV2RGB_YUY2 in OpenCV terminology.
• nv12_to_hailo_yuv – converts the NV12 format, which is used by a growing number of cameras, to YUV
format. This is a useful conversion to be used before the first layer to offload this conversion from the host.
• nv21_to_hailo_yuv – Converts the NV21 format, which is used by some cameras, to YUV.
• i420_to_hailo_yuv – Converts the i420 format, which is used by some cameras, to YUV.
• tf_rgbx_to_hailo_rgb – Converts RGBX to Hailo RGB format.
Note: By default, format conversions will only be part of the compiled model but they won’t be part of the op-
timization process. To include emulation supported format conversions - yuy2_to_yuv, tf_rgbx_to_hailo_rgb and
nv12_to_hailo_yuv in the optimization process, set emulator_support=True inside the command. When setting it to
True, the calibration set should be given in the source format.
Or a hybrid conversion :
• yuy2_to_rgb - which is implemented by adding format conversion yuy2_to_yuv and color conversion yuv_to_rgb.
• nv12_to_rgb - which is implemented by adding format conversion nv12_to_yuv and color conversion yuv_to_rgb.
• nv21_to_rgb - which is implemented by adding format conversion nv21_to_yuv and color conversion yuv_to_rgb.
• i420_to_rgb - which is implemented by adding format conversion i420_to_yuv and color conversion yuv_to_rgb.
Page 80 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Note: By default, format conversion is part of the hybrid conversion command it behaves as format conversion, i.e. it
will be part of the compiled model but not part of the optimization process. To include the supported format conver-
sion - yuy2_to_yuv, tf_rgbx_to_hailo_rgb and nv12_to_hailo_yuv in the optimization process, set emulator_support=True
inside the command.
transpose
Transposes the whole connected component(s) of the chosen input layer(s), so the network runs transposed on chip
(improves performance in some cases).
Not supported when there are SpaceToDepth (columns to features) or DepthToSpace (features to columns) reshapes
in the network.
HailoRT is responsible for transposing the inputs and outputs on the host side.
Note: Transposing the network is not supported when the Depth to Space or Space to Depth layers are used.
normalization
,→Multiple commands can be used to apply different normalization to each input layer.
nms_postprocess
For more information about NMS post-processing, refer to nms_post_processing. Hailo’s optimized implementation of
the NMS post-process is recommended to enhance performance and avoid unnecessary format conversions. This is
true for all engines and architectures. In Hailo’s implementation, only the boxes with scores above the threshold are
converted to FLOAT32 format, which improves significantly overall performance compared to converting the entire
output to FLOAT32.
# example for adding SSD NMS with config file, architecture is written without ''.
nms_postprocess('nms_config_file.json', meta_arch=ssd)
There are a few options for using this command. Note that in each option, the architecture name must be provided,
using meta_arch argument.
Page 81 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• If NMS structure was detected during parsing, an autogenerated config file with the values ex-
tracted from the original model will be used.
• If NMS structure was detected during parsing, an autogenerated config file with the values ex-
tracted from the original model will be used, edited by provided arguments.
• The config arguments that can be set via the command are: nms_scores_th, nms_iou_th, im-
age_dims, classes.
• Please note that when providing the config path, do not provide any of the config argument using
the command, only inside the file.
• default_nms_config_yolov5.json
• default_nms_config_yolov6.json
• default_nms_config_yolox.json
• default_nms_config_yolo8.json
• default_nms_config_centernet.json
• default_nms_config_ssd.json
• default_nms_config_yolov5_seg.json
For available architectures see NMSMetaArchitectures.
Networks with YOLOv5 based post-process, perform bbox decoding and score_threshold filtering on the neural core
and IOU filtering on CPU by default. Networks with SSD/Centernet based post-process, run on the neural core by
default. All other supported post-process architectures run on the CPU by default. Networks with post-process can
be configured manually to run either on neural core or on CPU using the engine argument in the relevant model script
command. Decoded bounding boxes are normalized between 0 and 1.
For example:
nms_postprocess(meta_arch=ssd, engine=nn_core)
• cpu which means the NMS post-process will run on the CPU, currently supported on YOLOv5, YOLOv5 SEG,
YOLOv8, SSD and YOLOX:
For example:
Page 82 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Note: The output format of object detection models is [batch_size, num_classes, 5, num_proposals_per_class],
where the format of axis 2 dimension is [y_min, x_min, y_max, x_max, score]. The output format of instance
segmentation models is [N, 1, num_max_proposals, 6 + image_dims[0] * image_dims[1]] where the format of
axis -1 is [y_min, x_min, y_max x_max, score, class, flattened masks]
• auto currently supported on YOLOv5, performs bbox decoding and score_threshold filtering on the neural
core and IoU filtering on CPU.
For example:
Note: When using NMS post-process with the default configuration the nms_scores_th value is 0.3. When using
NMS post-process on CPU with default configuration the nms_iou_th is changed to 0.6.
For example:
Note: Running bbox decoding only on CPU is computationally expensive and may affect the perfor-
mance, since the decoding is done over all the proposals.
change_output_activation
Changes output layer activation. See the supported activations section for activation types. If the output layer doesn’t
support activation, a standalone activation layer will be added.
logits_layer
Adds logits layer after an output layer. The supported logits layers are Softmax and Argmax.
Page 83 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
set_seed
Sets the global random seed for python random, numpy and tensorflow libraries, and enables operator determinism
in tensorflow’s backend. Setting the seed ensures reproducibility of quantization results.
Note: When running Finetune algorithm on GPU, tensorflow’s back-propagation operators can’t perform determin-
istic results.
Note: Using tensorflow’s operator determinism comes at the expense of runtime efficiency, it’s recommended to
use this feature for debugging only. For more details please refer to tensorflow’s docs.
set_seed(seed=5)
resize:
Performs resize for the input or output tensor(s). The resize can be applied either on-chip or CPU. The default resize
method used is bilinear interpolation with align_corners=True, half_pixels=False, and engine=nn_core.
The resize limitations are those of resize bilinear as described here. When the resize ratio is high, the compilation
process will be more difficult, as more on-chip memories and sub-clusters are required.
Note: When using the resize command on an input layer, resize_shapes represents the new input shape of the
network, while using the command on an output layer resize_shapes represents the new output shape of the network
model_optimization_flavor
Configure the model optimization effort by setting compression level and optimization level. The flavor’s
algorithm will behave as default, any algorithm-specific configuration will override the flavor’s default
config
Default values:
• compression_level: 1
• optimization_level: 2 for GPU and 1024 images, 1 for GPU and less than 1024 images, and 0
for CPU only.
Page 84 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• 0 - Equalization
• 3 - Equalization + Adaround with 320 epochs & 256 images on all layers
• 4 - Equalization + Adaround with 320 epochs & 1024 images on all layers
• 0 - nothing is applied
• 1 - auto 4bit is set to 0.2 if network is large enough (20% of the weights)
• 2 - auto 4bit is set to 0.4 if network is large enough (40% of the weights)
• 3 - auto 4bit is set to 0.6 if network is large enough (60% of the weights)
• 4 - auto 4bit is set to 0.8 if network is large enough (80% of the weights)
• 5 - auto 4bit is set to 1.0 if network is large enough (100% of the weights)
Example commands:
model_optimization_flavor(optimization_level=4)
model_optimization_flavor(compression_level=2)
model_optimization_flavor(optimization_level=2, compression_level=1)
model_optimization_flavor(optimization_level=2, batch_size=4)
Parameters:
model_optimization_config
• compression_params
• negative_exponent
• globals
• calibration
• checker_cfg
Page 85 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
compression_params
This command controls layers 4-bit and 16-bit quantization. In 4-bit mode, it reduces some layers’ precision mode to
a8_w4. The values (between 0 and 1 inclusive) represent how much of the total weight memory usage you want to
optimize to 4bit. When the value is 1, all the weights will be set to 4bit, when 0, the weights won’t be modified. The
16-bit mode is supported only when setting on the entire network (setting 16-bit value of 1) and without using 4-bit
(setting 4-bit value to 0).
Example command:
Note: If you manually set some layers’ precision_mode using quantization_param, the optimization will take it into
account, and won’t set any weight back to 8bit
Note: If you set 16-bit quantization, all layers activations and weights are quantized using 16 bits. In this case, explicit
configuration of layer bias mode is not allowed.
Parameters:
negative_exponent
During the process of quantization, certain layers may experience bit loss, resulting in reduced precision of the output.
To mitigate this issue, this command can be enabled the addition of extra layers. by setting rank to 1 this layer
introduces a helper layer that mitigates the the bits lost in the quantized output this can cause a decrease on the FPS
of the network. by setting rank to 0 no layer will be introduces and the loss of bits will be delegated to the output.
Example commands:
# This will enable the split of conv3 into two layers to not lose precision by a negative�
,→exponent >= 1
Note: This operation does modify the structure of the model’s graph
Parameters:
Page 86 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
globals
Model configuration during the quantization that didn’t fit anywhere else…
Example command:
Parameters:
calibration
During the quantization process, the model will be inferred with small dataset for calibration purposes. The calibration
can be configured here. (This replaces the calib_num_batch and batch_size arguments in quantize() API)
Example command:
Parameters:
Page 87 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
checker_cfg
Checker Config will generate information about the quantization process using the layer analysis tool.
Example commands:
Note: This operation does not modify the structure of the model’s graph
Parameters:
quantization_param
quantization_param(<layer>, <parameter>=<value>)
For example
quantization_param(conv1, bias_mode=double_scale_initialization)
Multiple parameters can be assigned at once, by simply adding more parameter-value couples, for example:
Page 88 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Glob syntax is also supported to change multiple layers at the same time. For example, to change all layers whose
name starts with conv, use:
quantization_param({conv*}, bias_mode=double_scale_initialization)
1. bias_mode
2. precision_mode
3. quantization_groups
4. force_range_out
5. max_elementwise_feed_repeat
6. max_bias_feed_repeat
7. null_channels_cutoff_factor
8. output_encoding_vector
9. gpu_policy
bias_mode
Sets the layer’s bias behavior, there are 2 available bias modes. The modes are:
Some layers are 16-bit by default (for example, Depthwise), while others are not. Switching a layer to 16-bit, while
improving quantization, can have a slightly adverse effect on allocation. If a network exhibits degradation due to
quantization, it is strongly recommended to set this parameter for all layers with biases.
All layers that have weights and biases support the double_scale_initialization mode.
Example command:
quantization_param(conv3, bias_mode=double_scale_initialization)
Changed in version 2.8: This parameter was named use_16bit_bias. This name is now deprecated.
Changed in version 3.3: double_scale_initialization is now the default bias mode for multiple layers.
precision_mode
Precision mode sets the bits available for the layers’ weights and activation representation. There are three precision
modes that could be set on the model layers using a model script command:
• a8_w8 - which means 8-bit activations and 8-bit weights. (This is the default)
• a8_w4 - which means 8-bit activations and 4-bit weights. Can be used to reduce memory consumption. Sup-
ported on all layers that have weights. Compression levels automatically assigns 4-bit to layers in the model,
according to the level.
• a16_w16 - set 16-bit activations and weights to improve accuracy results. Supported on three cases:
– On any output node (output_layer_X)
– On the full model, in case all its layers are supported (Hailo-8 family only)
Page 89 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Example commands:
• Activations
• Average Pooling
• Concat
• Const Input
• Convolution
• Deconvolution
• Depth to Space
• Depthwise Convolution
• External Padding
• Feature Shuffle
• Feature Split
• Fully Connected (dense) [its output(s) must also be 16-bit, or model output layers]
• Max Pooling
• Normalization
• Output Layer
• Reduce Max*
• Reduce Sum*
• Resize*
• Reshape
• Shortcut
• Slice
• Space to Depth
Note: Layers with (*) are supported as long as they are not part of a Softmax chain.
Page 90 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
max_bias_feed_repeat
The range is 1-32 (integer only) and the default value is 32.
This parameter determines the precision of the biases. A lower number will result in higher throughput at the cost
of reduced precision. This parameter can be switched to 1 for all or some layers, in order to see if higher throughput
can be achieved. If this results in high quantization degradation, the source of the degradation should be examined
and this parameter should be increased for that layer.
This parameter is not applicable for layers that use the double_scale_initialization bias mode.
Example command:
quantization_param(conv5, max_bias_feed_repeat=1)
quantization_groups
This parameter allows splitting weights of a layer into groups and quantizing each separately for greater accuracy.
When using this command, the weights of layers with more than one quantization group are automatically sorted to
improve accuracy.
Using more than one group is supported only by Conv and Dense layers (not by Depthwise or Deconv layers). In
addition, it will not be supported if the layers are of conv-and-add kind or rather the last layer of the model (or last
layers if there are multiple outputs).
Example command:
quantization_param(conv1, quantization_groups=4)
force_range_out
This command forces the specified range to the output of given layers in the quantization process.
The expected value for this parameter is a pair of floats [min, max] value. min<=0; max>=0; min<max. Zero must
be within the specified range.
Example command:
max_elementwise_feed_repeat
This command is applicable only for conv-and-add layers. The range is 1-4 (integer only) and the default value is 4.
This parameter determines the precision of the elements in the “add” input of the conv-and-add. A lower number will
result in higher throughput at the cost of reduced precision. For networks with many conv-and-add operations, it is
recommended to switch this parameter to 1 for all conv-and-add layers, to determine if it’s possible to achieve higher
throughput. If this results in high quantization degradation, the source of the degradation should be examined and
this parameter should be increased for that layer.
Example command:
quantization_param(conv5, max_elementwise_feed_repeat=1)
Page 91 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
null_channels_cutoff_factor
This command is applicable only for layers with fused batch normalization. The default value is 1e-4.
This is used to zero-out the weights of the so called “dead-channels”. These are channels whose variance is below a
certain threshold. The low variance is usually a result of the activation function eliminating the results of the layer
(for example, a ReLU activation that zeros negative inputs). The weights are zeroed out to avoid outliers that shift
the dynamic range of the quantization but do not contribute to the results of the network. The variance threshold is
defined by null channels_cutoff_factor * bn_epsilon, where bn_epsilon is the epsilon from
the fused batch normalization of this layer.
Example command:
quantization_param(conv4, null_channels_cutoff_factor=1e-2)
output_encoding_vector
This command changes the last layer’s output format, to include a different multiplacative scale for each feature. It
raises the accuracy of the model in some cases, in the expense of slightly higher CPU utilization, since the output
tensor has to be multiplied with different factor per feature when converting the model outputs back from uint8 or
uint16 to floating point precision (a.k.a dequantization).
This command mostly helps when channels with different ranges are concatenated together (for example, some
features represent class, and others represent scores).
Example command:
model_optimization_config(globals, output_encoding_vector=enabled)
allocator_param(enable_muxer=False)
gpu_policy
To enhance your model’s inference capabilities, our client runner supports utilizing multiple GPUs. This functionality
accommodates diverse computational needs by offering four distinct modes of operation: auto, data_parallelization,
model_parallelization, and single. Here’s a succinct overview of each mode:
• data_parallelization: Run on parallel data configuration where each gpu will run the same model with different
data, batch size will be the same for each gpu. This configuration operates in a data parallel manner. It deploys
the same model across multiple GPUs, with each GPU handling a different segment of the data. The batch size
remains consistent across all GPUs, ensuring parallel processing of data batches.
• model_parallelization: In contrast to the data parallelization approach, the model parallelization mode employs
a model parallel strategy. The model is segmented and distributed across multiple GPUs, with different parts
of the model running on different GPUs. This setup is beneficial for large models that exceed the memory
capacity of a single GPU.
• single: Designed for simplicity, the single mode confines the inference process to just one GPU, regardless of
the number of available GPUs. This mode is particularly useful for tasks that do not require extensive parallel
processing capabilities.
Page 92 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• auto: Default configuration. As the default setting, the auto mode intelligently determines the optimal GPU
usage strategy based on the available resources and the specific requirements of the task at hand. It defaults
to data_parallelization mode if the conditions allow, offering a balance between performance and resource
utilization.”
Example command:
runner = ClientRunner(har=model_path)
with runner.infer_context(InferenceContext.SDK_NATIVE, gpu_
,→policy=DistributionStrategy.DATA_P) as ctx:
pre_quantization_optimization
All the features of this command optimize the model before the quantization process. Some of these commands
modify the model structure, and occur before the rest of the commands.
• dead_channels_removal
• zero_static_channels
• zero_static_channels per-layer
• se_optimization
• equalization
• equalization per-layer
• dead_layers_removal
• weights_clipping
• activation_clipping
• ew_add_fusing
• layer_decomposition
• smart_softmax_stats
• defuse
• resolution_reduction
• resolution_reduction per-layer
• global_avgpool_reduction
• add_shortcut_layer
• matmul_correction
• matmul_equalization
• matmul_decomposition
• switch_concat_with_add
• split_ew_mult_by_bit_significance
• use_prequantized_weights
• split_fused_activation
• quarot
Page 93 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
dead_channels_removal
Dead channels removal is channel pruning, which removes from the model any layer with both null weights and
activation output. This might reduce memory consumption and improve inference time
Example commands:
Note: This operation will modify the structure of the model’s graph
Parameters:
zero_static_channels
Zero static channels will zero out the weights of channels that have zero variances to improve quantization.
Example commands:
Note: This operation does not modify the structure of the model’s graph
Parameters:
zero_static_channels per-layer
This sub-command allows configuring the zerostatic behavior per layer. Example commands:
Note:
Page 94 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• if allowed layer gets the behavior from the generic algorithm policy
Parameters:
se_optimization
This feature can modify the Squeeze and Excite block to run more efficiently on the Hailo chip. A more detailed
explanation of the TSE algorithm can be found here https://arxiv.org/pdf/2107.02145.pdf
Example commands:
# Apply TSE to the first 3 S&E blocks with tile height of 9 to the 1st block, 7 to the 2nd�
,→and 5 to the 3rd
# Apply TSE to S&E blocks that start with avgpool1 and avgpool2 layers, with tile height�
,→of 7, 5 accordingly
Note: This operation will modify the structure of the model’s graph
Parameters:
Page 95 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
equalization
This sub-command allows configuring the global equalization behavior during the pre-quantization process, this com-
mand replaces the old equalize parameter from quantize() API
Example command:
pre_quantization_optimization(equalization, policy=disabled)
Parameters:
equalization per-layer
This sub-command allows configuring the equalization behavior per layer. Allowed policy means the behavior derives
from the algorithm config.
Example commands:
Note:
Page 96 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• Enabling 1 layer won’t enable the related layers (it has to be done manually)
Parameters:
dead_layers_removal
Example command:
pre_quantization_optimization(dead_layers_removal, policy=disabled)
Parameters:
weights_clipping
This command allows changing this behavior for selected layers and applying weights clipping when running the
quantization API. This command may be useful in order to decrease quantization related degradation in case of outlier
weight values. It is only applicable to the layers that have weights.
• disabled mode doesn’t take clipping values, and disables any weights clipping mode previously set to the
layer.
• mmse_if4b similar to mmse, when the layer uses 4bit weights, and disables clipping when it uses 8-bit
weights. (This is the default)
Example commands:
Page 97 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Note: The dynamic range of the weights is symmetric even if the clipping values are not symmetric.
Parameters:
activation_clipping
By default, the model optimization does not clip layers’ activations during quantization. This command can be used
to change this behavior for selected layers and apply activation clipping when running the quantization API. This
command may be useful in order to decrease quantization related degradation in case of outlier activation values.
• disabled mode doesn’t take clipping values, and disables any activation clipping mode previously set to the
layer (This is the default).
Note: Percentiles based activation clipping requires several iterations of statistics collection, so quantization might
take a longer time to finish.
Example commands:
Parameters:
Page 98 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
ew_add_fusing
When EW add fusing is enabled, ew add layers will be fused into conv and add layers. Layers with incompatible
precision modes won’t be fused.
Example commands:
Parameters:
layer_decomposition
This sub commands allows toggling layers to decomposition mode, which means 16-bit layers will be implemented
with 8-bit layers.
Example commands:
Parameters:
Page 99 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
smart_softmax_stats
SmartSoftmaxConfig is an algorithm that collects the stats on a softmax block in an efficient way Example commands:
Parameters:
defuse
INPUT FEATURES
Defuse input features for a selected dense or conv layer to a selected number of splits. It can also be used to disable
defusing of a layer. Example commands:
MHA
Allows defusing multi-head attention block, represented by its first matmul, to a selected number of splits.
Example commands:
Parameters:
Page 100 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
resolution_reduction
Reduce the model resolution in all input layers in order to optimize the model more efficiently. Marginally affects
accuracy. Not supported on models that contain Fully-connected, Matmul an Cross-correlation layers, or when the
resolution is too small.
Example commands:
# This will enable the algorithm, optimizing over an input shape of [128, 128]
pre_quantization_optimization(resolution_reduction, shape=[128, 128])
Note: This operation doesn’t modify the structure of the model’s graph
Parameters:
resolution_reduction per-layer
Sub-command for configuring resolution reduction per input layer, affecting its connected component. Reduce the
resolution in order to optimize more efficiently. Marginally affects accuracy. Not supported when containing Fully-
connected, Matmul an Cross-correlation layers, or when the resolution is too small.
Example commands:
# This will enable the algorithm for input_layer1 connected component, optimizing over�
,→an input shape of [128, 128]
pre_quantization_optimization(resolution_reduction, layers=input_layer1,�
,→shape=[128, 128])
Note: This operation doesn’t modify the structure of the model’s graph
Parameters:
Page 101 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
global_avgpool_reduction
This command allows reducing the spatial dimensions for global avgpool layers using additional avgpool layer. The
kernel size of the added avgpool layer will be [1, h // division_factors[0], w // division_factors[1], 1]
Parameters:
add_shortcut_layer
Adds an activation layer between “layer” and “target” removes original edge between, activation is linear (by default)
before : layer -> target after : layer -> act -> target
Example commands:
Parameters:
matmul_correction
Parameters:
Page 102 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
matmul_equalization
Parameters:
matmul_decomposition
This sub commands allows toggling Matmul layers to decomposition mode, which means 16-bit layers will be imple-
mented with 8-bit layers.
Example commands:
Parameters:
Page 103 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
switch_concat_with_add
If there is concat followed by a conv, this feature converts the concat and the conv to 2 convs and ew-add between
them
Example commands:
pre_quantization_optimization(switch_concat_with_add, layers=concat1,�
,→policy=enabled)
pre_quantization_optimization(switch_concat_with_add, layers={concat*},�
,→policy=enabled)
Note:
Parameters:
split_ew_mult_by_bit_significance
This command allows splitting element-wise multiplication layers by bit significant to allow higher precision.
pre_quantization_optimization(split_ew_mult_by_bit_significance, layers=ew_mult1,
,→ num_splits=2)
Parameters:
use_prequantized_weights
docstring
Parameters:
Page 104 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
split_fused_activation
Example commands:
# This will split the activation that fused on conv1 to conv1 layer with linear�
,→activation and standalone activation layer.
pre_quantization_optimization(split_fused_activation, layers=conv1,�
,→policy=enabled)
Parameters:
quarot
Parameters:
post_quantization_optimization
All the features of this command optimize the model after the quantization process.
post_quantization_optimization(<feature>, <**kwargs>)
• bias_correction
• bias_correction per-layer
• train_encoding
• finetune
• adaround
• adaround per-layer
Page 105 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• mix_precision_search
bias_correction
This sub-command allows configuring the global bias correction behavior during the post-quantization process, this
command replaces the old ibc parameter from quantize() API
Example command:
Note: Bias correction is recommended when the model contains small kernels or depth-wise layers
Parameters:
bias_correction per-layer
This sub-command allows enabling or disabling the Iterative Bias Correction (IBC) algorithm on a per-layer basis.
Allowed policy means the behavior derives from the algorithm config
Example commands:
# This will disable IBC for conv layers and enable for the other layers
post_quantization_optimization(bias_correction, policy=enabled)
post_quantization_optimization(bias_correction, layers={conv*}, policy=disabled)
Parameters:
Page 106 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
train_encoding
Parameters:
Parameters (cont.):
Page 107 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Advanced parameters:
Page 108 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Page 109 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
finetune
This sub-command enabled knowledge distillation based fine-tuning of the quantized graph.
Example commands:
Parameters:
Page 110 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Parameters (cont.):
Page 111 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Advanced parameters:
Page 112 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Page 113 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
adaround
Adaround algorithm optimizes layers’ quantization by training the rounding of the kernel layer-by-layer. To enable it,
use high optimization_level (>=3), or use the explicit command:
post_quantization_optimization(adaround, policy=enabled)
It is used by the highest optimization level to recover any degradation caused by quantization, and as such, it is time
consuming and requires strong system in order to run.
– DALI is an external package which is being used by AdaRound algorithm to accelerate the running time
(see warning raised during the run for more information)
– Lowering the batch size can reduce the RAM memory consumption but will increase the running time
(default is 32)
– Enables compression on the disk to reduce disk space usage at the expanse of increased running time
(default is disabled).
– Using a smaller dataset for Adaround would reduce the memory consumption but might affect the final
accuracy (default is 1024)
– Disabling bias training can help to reduce running time but might affect the final accuracy (default is true)
– Reducing the number of epochs can help to reduce the running time of the algorithm but might affect
the final accuracy (default is 320)
Parameters:
Page 114 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Parameters (cont.):
Advanced parameters:
Page 115 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
adaround per-layer
This sub commands allow toggling layers in the adaround algorithm individually
Example commands:
Parameters:
mix_precision_search
This algorithm aims to identify the optimal precision configuration for a model by utilizing the signal to noise ratio
(SNR). SNR quantifies the extent to which a signal is corrupted by noise. In this context, it aids in determining the trade-
off between the compression applied on operations and the error (or noise) introduced as a result of this compression.
Parameters:
Page 116 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Calling compile() compiles the model without loading it to the device returning a binary that contains the com-
piled model, a HEF file.
Note: The default compilation target is Hailo-8. To compile for different architecture (Hailo-8R for example), use
hw_arch='hailo8r' as a parameter to the translation phase. For example see the tutorial referenced on the
next note. Hailo-15 uses hw_arch='hailo15h' and Hailo-10 uses hw_arch='hailo10h'.
See also:
After compiling a model, as described in the previous section, that originated from an ONNX model one may choose
to extract a new ONNX model that contains the entire network in the original model, with the nodes segmented by
the start and end note arguments, replaced by the compiled HEF, by calling get_hailo_runtime_model()
. This is required if you wish to run inference using OnnxRT with HailoRT.
Page 117 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
– Post-processing
• The start_nodes will completely separate the pre-processing from the Main model. No connections from the
pre-processing are allowed into the main model, unless they are marked as start_nodes.
• The end_nodes need to separate the main model from the post-processing completely.
get_hailo_runtime_model() returns an ONNX model, that you can either pass directly to an ONNXRT
session, or first save to a file and then load unto a session.
hef = runner.compile() # the returned HEF is not needed when working with ONNXRT
onnx_model = runner.get_hailo_runtime_model() # only possible on a compiled model
onnx_file = onnx.save(onnx_model , onnx_file_path) # save model to file
See also:
The Parsing Tutorial shows how to load a network from an existing model and setting the start and end note argu-
ments.
Changed in version 3.9: Added context switch support using an allocation script command. The context switch mech-
anism allows to run a big model by automatically switching between several contexts that together constitute the full
model.
First, to get a runner loaded with compiled model, use one of the options: calling compile(), loading a compiled
HAR using load_har(), or setting the HEF using hef().
To run inference on the model, enter the context manager infer_context() and call infer() to get the
results.
Note: Inference using the TensorFlow inference is not yet supported on the Hailo-15 platform.
The compilation related model script commands affect the resources allocation stage of the compilation. Except for
the recommended Performance Mode command, most of these commands are for advanced and edge cases, as the
Dataflow Compiler already maximizes the performance by taking many factors into account.
Note: This section uses terminology that is related to the Hailo neural core. Full description of the core architecture
is not in the scope of this guide.
Page 118 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Usage
The script is a separate file which can be given to the load_model_script() method of the ClientRunner
class.
For example:
client_runner.load_model_script('x.alls')
compiled_model = client_runner.compile()
After the compilation process, in addition to the binary .hef file, the compiled HAR (Hailo ARchive) file is created. This
HAR file contains the final compilation results, as well as the automatic model script (.auto.alls) file, that contains
the exact instructions for the compiler for creating the same binary file (for the specific Dataflow Compiler version).
This model script can be used to compile the model again (from the corresponding quantized HAR file), for a quick
compilation.
Extraction of the automatic model script out of the compiled HAR file is done with the command:
Definition
context_switch_param(param=value)
Example
context_switch_param(mode=enabled)
Description This command modifies context switch policy and sets several parameters related to it:
• mode – Context switch mode. Set to enabled to enable context switch: Automatic partition of the
given model to several contexts will be applied. Set to disabled to disable context switch. Set to
allowed to let the compiler decide if multi context is required. Defaults to allowed.
• allow_auto_merge_in_multicontext – Set to True to allow auto-merging of layers in
multi-context mode. Defaults to False. Should be used in conjunction with Performance Mode or re-
sources_params set to higher utilization.
Allocator Parameters
Definition
allocator_param(param=value)
Example
allocator_param(automatic_ddr=False)
Page 119 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• timeout – Compilation timeout for the whole run. By default, the timeout is calculated dynamically
based on the model size. The timeout is in seconds by default. Can be given a postfix of ‘s’, ‘m’, or ‘h’ for
seconds, minutes or hours respectively. e.g. timeout=3h will result to 3 hours.
• automatic_ddr – when enabled, DDR portals that buffer data in the host’s RAM over PCIe are added
automatically when required. DDR portals are added when the data needed to be buffered on some
network edge exceeds a threshold. In addition, DDR portal is added only when there are enough re-
sources on-chip to accommodate it. Defaults to True. Set to False to ensure the HEF compatibility to
platforms that don’t support it, such as Ethernet based platforms.
• enable_lcu_ecc – When enabled, ECC calculation is enabled. to reduce power, disable the calcula-
tion.
Definition
resources_param(param=value)
Example
context0.resources_param(max_utilization=0.25)
Description This sets several resources calculation flow parameters described below.
• strategy – Resources calculation strategy. When set to greedy, adding more resources
to the slowest layers iteratively (Maximum FPS search), to reach the highest possible network
FPS (per context). Defaults to greedy.
• max_utilization – Number between 0.0 and 1.0. Threshold for greedy strategy.
Maximum-FPS search will be stopped when on-chip utilization of any resource (control, com-
pute, memory) exceeds the given threshold. The parameter overrides default thresholds but
not the user provided thresholds specified above.
Two formats are supported – the first one affects all contexts, and the second one only affects the
chosen context (see example #2).
Page 120 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Place
Definition
place(cluster_number, layers)
Example
Description This points the allocator to place layers in a specific cluster_number. Layers which are not in-
cluded in any place command, will be assigned to a cluster by the Allocator automatically.
Shortcut
Definition
shortcut(layer_from, layers_to)
Examples
Description This command adds a shortcut layer between directly connected layers. The layers_to parameter
can be a single layer or a list of layers. The shortcut layer copies its input to its output.
Portal
Definition
portal(layer_from, layer_to)
Example
Description This command adds a portal layer between two directly connected layers. When two layers are con-
nected using a portal, the data from the source layer leaves the cluster before it gets back in and reaches the
target layer. The main use case for this command is to solve edge cases when two layers are manually placed in
the same cluster. When two layers are in different clusters, there is no need to manually add a portal between
them.
L4 Portal
Definition
l4_portal(layer_from, layer_to)
Example
Description This command adds a L4-portal layer between two directly connected layers. This command is essen-
tially the same as portal, with the key difference that the data will be buffered in L4 memory, as opposed to
a regular portal which buffers the data in L3 memory. The main use case for this command is when a large
amount of data needs to be buffered between two endpoints, and it is required for this data to be buffered in
another memory hierarchy.
Page 121 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
DDR Portal
Definition
ddr(layer_from, layer_to)
Example
Description This command adds a DDR portal layer between two directly connected layers. This command is essen-
tially the same as portal, with the key difference that the data will be buffered in the host, as opposed to a
regular portal which buffers the data in on-chip memory. Note that this command is supported only in HEF
compilations and will work only on supported platforms (i.e. when using the PCIe interface).
Concatenation
Definition
concat(layers_from, layer_to)
Example
Description Add a concat layer between several input layers and an output layer. This command is used to split
a “large” concat layer into several steps (For example, three concat layers with two inputs instead of a single
concat layer with four inputs).
Note: For now this command only supports two input layers (in the argument layers_from).
De-fuse
Definition
Examples
Description Defusing splits a logical layer into multiple physical layers in order to increase performance. This com-
mand orders the Allocator to defuse the given layer to defuse_number physical layers that share the same
job, plus an additional concat layer merges all outputs together (and an input feature splitter in case of fea-
ture splitter). Like most mechanisms, the defuse mechanism happens automatically, so no user intervention
is required.
• Feature defuse: Each physical layer calculates part of the output features. Supported layers: Conv, De-
conv, Maxpool, Depthwise conv, Avgpool, Dense, Bilinear resize, NN resize.
Page 122 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• Spatial defuse: Each physical layer calculates part of the output columns. Supported layers: Conv, Deconv,
Depthwise conv, Avgpool, Argmax, NN resize.
• Input features defuse: Each physical layer receives a part of the input features. Supported layers: Max-
pool, Depthwise conv, Avgpool, NN resize, Bilinear resize.
For Feature defuse, don’t use the defuse_type argument (see examples).
Merge
Definition
merge(layer1, layer2)
Examples
Description Merging is a mechanism that uses the same hardware resources to compute two layers. The FPS
of the layer will be lower than the two original layers, but unless it is a bottleneck layer, it could save re-
sources and result in total higher FPS. It is supported for a subset of layers and connectivity types. Auto-
matic merging of layers is performed on single context when needed, and could be affected with the alloca-
tor_param(merge_min_layer_utilization) command.
Compilation Parameters
Definition
compilation_param(layer, param=value)
Example
compilation_param(conv1_d0, resources_allocation_strategy=manual_scs_
,→selection, number_of_subclusters=8, use_16x4_sc=enabled)
Description This will update the given layer’s compilation param. The command in the example sets the number of
subclusters of a specific layer to 8. In addition, it forces 16x4 mode, which means that each subcluster handles
16 columns and 4 output features at once. This is instead of the default of 8 and 8 respectively.
• use_16x4_sc – can use 16 pixels multiplication by 4 features – instead of the default 8 pixels by 8
features. This is useful when the number of features is smaller than 8. A table of supported layers is
given below (layers that are not mentioned are not supported).
• no_contexts – change to True in order to accumulate all the needed inputs for each output row
computation in the L3 memory. A table of supported layers is given below (layers that are not mentioned
are not supported).
Page 123 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• number_of_subclusters – force the usage of a specific number of subclusters. Make sure the
resource allocation strategy value is set to manual_scs_selection. This is only
applicable to Conv and Dense layers.
• fps – force a layer to reach this throughput, possibly higher than the FPS used for the rest of the model.
This parameter is useful to reduce the model’s latency, however it is not likely to contribute to the model’s
throughput which is dominated by the bottleneck layer.
compilation_param({conv*}, resources_allocation_strategy=min_scs_match_fps)
will change the resources allocation strategy of all the layers that start with conv.
Kernel type Kernel size (HxW) Stride (HxW) Dilation (HxW) Padding
Conv 1x1 1x1 1x1 SAME
SAME_TENSORFLOW
VALID
Conv 3x3 1x1, 2x1 1x1 SAME
2x2 (stride=1x1 only) SAME_TENSORFLOW
3x3 (stride=1x1 only)
4x4 (stride=1x1 only)
Conv 5x5 1x1, 2x1 1x1 SAME
SAME_TENSORFLOW
Conv 7x7 1x1, 1x2, 2x1, 2x2 1x1 SAME
SAME_TENSORFLOW
Conv 1x3, 1x5, 1x7 1x1 1x1 SAME
SAME_TENSORFLOW
Conv 3x5, 3x7, 5x3, 5x7, 1x1 1x1 SAME
7x3, 7x5 SAME_TENSORFLOW
Conv 3x4, 5x4, 7x4, 9x4 1x1 1x1 SAME
SAME_TENSORFLOW
Conv 3x6, 5x6, 7x6, 9x6 1x1 1x1 SAME
SAME_TENSORFLOW
Conv 3x8, 5x8, 7x8, 9x8 1x1 1x1 SAME
SAME_TENSORFLOW
Conv 9x9 1x1 1x1 SAME
SAME_TENSORFLOW
DW 3x3 1x1 1x1, 2x2 SAME
SAME_TENSORFLOW
DW 5x5 1x1 1x1 SAME
SAME_TENSORFLOW
Page 124 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
HEF Parameters
Definition
hef_param(should_use_sequencer=value, params_load_time_compression=value)
Example
hef_param(should_use_sequencer=True, params_load_time_compression=True)
Description This will configure the HEF build. The command in the example enables the use of Sequencer and
weights compression for optimized device configuration.
• should_use_sequencer – Using the Sequencer allows faster configurations load to device over
PCIe during network activation, but removes Ethernet support for the created HEF. It defaults to True.
Outputs Multiplexing
Definition
output_mux(layers)
Example
Description The outputs of the given layers will be multiplexed into a single tensor before sending them back from
the device to the host. Contrary to concat layers, output mux inputs do not have to share the same width,
height, or numerical scale.
From TF
Definition
layer = from_tf(original_name)
Example
my_conv = from_tf('conv1/BiasAdd')
Description This command allows the use of the original (TF/ONNX) layer name in order to make sure that the correct
layers are addressed, as the HN layers names and the original layers names differ.
Note: Despite its name, this commands supports original names from both TF and ONNX.
Page 125 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Buffers
Definition
Example
Description This command sets the size of the inter-layer buffer in units of layer_from’s output rows. Two
variants are supported. The first variant sets the total number of rows to buffer. The second variant sets two
such buffer sizes, in case the compiler adds a cluster transition between these layers. The first size sets the
number of rows to buffer before the cluster transition, and the second number sets the number of rows after
the transition. If there is no cluster transition, only the first number is used. The second variant is mainly used
in autogenerated scripts returned by save_autogen_allocation_script().
Feature Splitter
Definition
feature_splitter(layer_from, layers_to)
Example
Description Add a feature splitter layer between an existing feature splitter layer and some of its outputs. This
command is used to break up a “large” feature splitter layer with many outputs into several steps.
Shape Splitter
Definition
Example
Description Add a shape splitter layer between an existing shape splitter layer and some of its outputs. This com-
mand is used to break up a “large” shape splitter layer with many outputs into several steps.
Page 126 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Platform Param
Definition
platform_param(param=value)
Examples
platform_param(targets=[ethernet])
platform_param(hints=[low_pcie_bandwidth])
Description This sets several parameters regarding the platform hosting Hailo as described below:
• targets – a list or a single value of hosting target restrictions such as Ethernet which requires
disabling a set of features.
• hints – a list of hints or a single hint about the hosting platform such as Low PCIE bandwidth
which optimizes performance for specific scenarios.
Current supported hints: low_pcie_bandwidth, adjusts the compiler to reduce the PCIE band-
width by disabling or changing decision thresholds regarding when PCIE should be used.
Performance Param
Definition
performance_param(compiler_optimization_level=max)
Description Setting this parameter enters performance mode, in which the compiler will try as hard as it can to find a
solution that will fit in a single context, with the highest performance. This method of compilation will require
significantly longer time to complete, because the compiler tries to use very high utilization levels, that might
not allocate successfully. If it fails to allocate, it automatically tries lower utilization, until it finds the highest
possible utilization.
• compiler_optimization_level - supports 0, 1 (default) ,2, and max. 0 - returns the first fea-
sible solution found, 1 - returns the best solution under default utilization, 2 (or max) - exhausts searches
over best utilization.
Remove Node
Definition
remove_node(layer_name)
Example
remove_node(conv1)
Description removing layer from the network. This command is useful to remove layers that are give
by the hn but we can remove them. Should be use internally only and with caution.
Page 127 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
The following section describes the layers and parameters range that the Dataflow Compiler supports.
However, the Parser (that translates the original model to Hailo’s internal representation) support varies across frame-
works, so that some layers that are supported might not be supported on all frameworks, and vice-versa. Therefore,
please also refer to the summary tables which detail all layers and their corresponding APIs in PyTorch/ONNX and
TensorFlow.
5.5.1. Convolution
Convolution layers are supported with any integer values of kernel size, stride, dilation. Padding types supported are:
VALID, SAME, and SAME_TENSORFLOW. The following table displays the current optimized params.
Page 128 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
‘W’ refers to the width of the layer’s input tensor, in this case the kernel width is equal to the image width
Page 129 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Note: Convolution kernel with elementwise addition supports the addition of two tensors only.
1. Models that contain Conv3D layer must have rank-4 input and output (4 dimensions at most), so the Conv3D
layer must reside inside a “2D” model.
2. The input to the first 3D Conv needs to be created using a Concat layer on the Disparity dimension (after
Unsqueeze).
3. The last Conv3D in the chain must have output_features = 1 (HxWxDx1), followed by a Squeeze operation, then
a Conv2D or a Resize layer.
Note: Number of weights per layer <= 8MB (for all Conv layers).
Page 130 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
“Any other” means any kernel size or stride between 2 and the tensor’s dimensions, for example 2 ≤ kh ≤ H where
kh is the kernel height and H is the height of the layer’s input tensor.
5.5.3. Dense
Dense kernel is supported. It is supported only after a Dense layer, a Conv layer, a Max Pooling layer, a Global Average
Pooling layer, or as the first layer of the network.
When a Dense layer is after a Conv or a Max Pooling layer, the data is reshaped to a single vector. The height of the
reshaped image in this case is limited to 255 rows.
Average Pooling layers are supported with any integer values of kernel size, stride, and dilation. Padding types sup-
ported are: VALID, SAME, and SAME_TENSORFLOW. The following table displays the current optimized params.
‘W’ means the width of the layer’s input tensor. In other words, in this case the kernel width equals to the image width.
‘h’ means any height, from 1 up to the input tensor height.
Page 131 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
5.5.5. Concat
This layer requires 4-dimensional input tensors (batch, height, width, features), and concatenates them in the features
dimension. It supports up to 4 inputs.
5.5.6. Deconvolution
Depthwise Convolution layers are supported with any integer values of kernel size, stride, and dilation. Padding types
supported are: VALID, SAME, and SAME_TENSORFLOW. Utilizing a Depthwise 1x1 stride 1x1 kernel with elementwise
addition, supports the addition of two tensors only
Page 132 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
For Conv 1x1/1, 1x1/2, 3x3/1, and 7x7/2, any number of output features is supported. For all other supported Conv
kernels, only OF%8=0 or OF<8 is supported, where OF is the number of output features in each group.
Group Deconvolution is supported with all supported Deconvolution kernels. Only OF%8=0 or OF<8 is supported,
where OF is the number of output features in each group.
2. Two tensors with the same batch and spatial dimensions, one tensor has features dimension 1.
3. Two tensors with the same batch and feature dimensions, one of them has spatial dimension [1, 1].
4. Two tensors with the same batch dimension, one of them has feature and spatial dimension [1, 1, 1].
Note: The resize layer can broadcast a tensor from (batch, 1, 1, F) to (batch, height, width, F), where F is the number
of features. This may be useful before the Elementwise Multiplication layer.
1. Bias addition after Conv, Deconv, Depthwise Conv and Dense layers. Bias addition is always fused into another
layer.
2. Elementwise addition and subtraction: When possible, elementwise add / sub is fused into a Conv layer as
detailed above. Elementwise add / sub is supported on both “Conv like” and “Dense like” tensors, with shapes
in the format shown on Elementwise Multiplication and Division
Page 133 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Input normalization is supported as the first layer of the network. It normalizes the data by subtracting the given
mean of each feature and dividing by the given standard deviation.
Batch normalization layer is supported. When possible, it is fused into another layer such as Conv or Dense. Other-
wise, it is a standalone layer.
Calculating Batch Normalization statistics in runtime using the Hailo device is not supported.
5.5.15. Resize
Two methods are supported: Nearest Neighbor (NN) and Bilinear. In both methods, the scaling of rows and columns
can be different.
1. When the columns and rows scale is a float (for rows also <= 4096), the new sizes are integers,
where half_pixels and align_corners satisfies one of the following: align_corners=True & half_pixels=False,
align_corners=False & half_pixels=True, align_corners=False & half_pixels=False.
2. When the input shape is (batch, H, 1, F) and the output shape is (batch, rH, W, F). The number of features F
stays the same and the height ratio r is integer. This case is also known as “broadcasting” (NN only).
3. When the input shape is (batch, H, W, 1) and the output shape is (batch, H, W, F). The height H and the width W
stay the same. This case is also known as “features broadcasting” (NN only).
Note: align_corners: If True, the centers of the 4 corner pixels of the input and output tensors are aligned, preserving
the values at the corner pixels. See definition here (PyTorch) and here (TensorFlow).
half_pixel: Relevant for Pytorch / ONNX, as defined on the ONNX Operators page, under coordi-
nate_transformation_mode.
Depth to space rearranges data from depth (features) into blocks of spatial data.
Two modes are supported (check out ONNX operators spec for more info - https://github.com/onnx/onnx/blob/main/
docs/Operators.md#depthtospace):
1. “DCR” mode – the default mode, where elements along the depth dimension from the input tensor are rear-
ranged in the following order: depth, column, and then row.
2. “CRD” mode – elements along the depth dimension from the input tensor are rearranged in the following order:
column, row, and the depth.
Page 134 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Depth to space is only supported when IF %(BW · BH ) = 0, where IF is the number of input features, BW is the
width of the depth to space block and BH is the height of the block.
Space to depth rearranges blocks of spatial data into the depth (features) dimension.
1. “Classic” variant – The inverse of the Depth to Space kernel. It is identical to Tensorflow’s space_to_depth
operation. Supports MxN block size, where M, N are integers.
2. “Focus” variant – It supports the 2x2 block size. Used by models such as YOLOv5, YOLOP. It is defined by the
following Tensorflow code:
5.5.18. Softmax
1. After a “Dense like” layer with output shape (batch, features). In this case, Softmax is applied to the whole
tensor.
2. After another layer, if the input tensor of the Softmax layer has a single column (but multiple features). In this
case, Softmax is applied row by row.
3. After another layer, even if it has multiple columns. In this case Softmax is applied pixel by pixel on the feature
dimension. This case is implemented by breaking the softmax layer to other layers.
5.5.19. LogSoftmax
5.5.20. Argmax
Argmax kernel is supported if it is the last layer of the network, and the layer before it is has a 4-dimensional output
shape (batch, height, width, features).
Page 135 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Reduce Max is supported along the features dimension, and if the layer before it is has a 4-dimensional output shape
(batch, height, width, features).
If the layer before it is has a 4-dimensional output shape (batch, height, width, features), the Reduce Sum layer is
supported along all axes. If the layer before it has a 2-dimensional output shape (batch, features), the Reduce Sum
layer is supported along the features dimension.
Reduce Sum Square is supported along the features or the spatial dimensions.
Feature shuffle kernel is supported if F%G = 0, where G is the number of feature groups.
This layer requires 4-dimensional input tensors (batch, height, width, features), and splits the feature dimension into
sequential parts. Only static splitting is supported, i.e. the coordinates cannot be data dependent.
5.5.28. Slice
This layer requires 4-dimensional input tensors (batch, height, width, features), and crops a sequential part in each
coordinate in the height, width, and features dimensions. Only static cropping is supported, i.e., the coordinates
cannot be data dependent.
5.5.29. Reshape
“Conv like” to “Dense like” Reshape Reshaping from a Conv or Max Pooling output with shape (batch, height, W ′ ,
F ′ ) to a Dense layer input with shape (batch, F ), where F = W ′ · F ′ .
“Dense like” to “Conv like” Reshape Reshaping a tensor from (batch, F ) to (batch, 1, W ′ , F ′ ), where F = W′ · F′
and F ′ %8 = 0.
Features to Columns Reshape Reshaping a tensor from (batch, height, 1, F ) to (batch, height, W ′ , F ′ ), where F =
W ′ · F ′.
Transpose, on the other side, permutes the order of the dimensions without changing them.
Page 136 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
This layer implements zeros padding as a separate layer, to support custom padding schemes that are not one of
three schemes that are supported as a part of other layers (VALID, SAME and SAME_TENSORFLOW).
5.5.31. Matmul
This layer implement data driven matrices multiplication X x Y = Z. Input sizes should obey matrices multiplication
rules.
This layer is a major building block for Transformer models. It receives (K, Q, V) matrices, and implements the formula:
( )
Qi · KiT
Sof tmax √ · Vi
dk
Q
When Qi , Ki , Vi are matrices that result from multiplying the input matrices K, Q, V by WiK , Wi , WiV respectively
(W are learned matrices), i ranges from 0 to #heads - 1. Then concatenating the results after multiplying by a learned
weights vector W 0 .
# keep previous shape; This code is PyTorch, on which the input shape in channels-first.
b, in_channels, h, w = prev_output.shape
# transpose [b, channels, h*w] -> [b, h*w, channels] or [h*w, b, channels]
if self.batch_first:
x = x.permute(0, 2, 1)
else:
x = x.permute(2, 0, 1)
# self.q, self.k and self.v were defined as Linear transformations, for example: `self.
,→v = nn.Linear(channels, self.d_v)`
# transpose [b, h*w, channels] or [h*w, b, channels] -> [b, channels, h*w], then�
,→reshape [b, channels, h*w] -> [b, channels, h, w]
_, _, out_channels = mha_output.shape
if self.batch_first:
unflattened = mha_output.permute(0, 2, 1).reshape(b, out_channels, h, w)
else:
unflattened = mha_output.permute(1, 2, 0).reshape(b, out_channels, h, w)
Page 137 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
RNNs (Recurrent Neural Networks) and LSTMs (Long Short Term Memory) are mainly used on sequential or time
series data. By using a feedback loop and an internal state, they utilize information from prior inputs to influence the
current output and update the state. The sequence length of an RNN or LSTM block is the number of past or future
inputs that affect the current one.
Since Hailo does not allow feedback loops, those layers are supported by the Unrolling technique, which dupli-
cates each RNN or LSTM block by sequence length times. Therefore, high sequence lengths (more than 10 for for-
ward/backward, or 5 for bidirectional) may lead to performance degradation.
• Forward: Current input utilizes information from previous inputs; Supported for RNN and LSTM.
• Bidirectional: Current input utilizes information from previous and future inputs; Supported for LSTM.
5.5.34. Transpose
• Transpose of Height <-> Width dimensions, only in tensors where their complete quantized size is smaller than
1.5MB. This type of transpose is not optimal for performance, since it requires the buffering of the whole tensor,
creating a “pipeline stop” that raises the latency of the model.
5.5.35. Activations
Note: Activations are usually fused into the layer before them, however they are also supported as standalone layers
when they can’t be fused.
• Elementwise min/max are supported for two input tensors of the same shape.
Page 138 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
5.5.38. L2 Operators
• ReduceL2 is supported.
• L2Normalization is supported.
The Hailo Dataflow Compiler supports symmetric padding as supported by other frameworks such as Caffe. As the
SAME padding in Tensorflow is not symmetric, the only way to achieve this sort of padding is by explicitly using tf.
pad followed by a convolution operation with padding='VALID'. The following code snippet shows how this
would be done in Tensorflow (the padding generated by this code is supported by the Dataflow Compiler):
pad_total_h = kernel_h - 1
if strides_h == 1:
pad_beg_h = int(ceil(pad_total_h / 2.0))
else:
pad_beg_h = pad_total_h // 2
pad_end_h = pad_total_h - pad_beg_h
inputs = tf.pad(
inputs,
[[0, 0], [pad_beg_h, pad_end_h], [pad_beg_w, pad_end_w], [0, 0]])
Page 139 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
The Hailo Dataflow Compiler offers several command line tools that can be executed from the Linux shell. Before
using them, the virtual environment needs to be activated. This is explained in the tutorials.
hailo --help
The --help flag can also be used to display the help message for specific tools. The following example prints the
help message of the Profiler:
The command line tools cover major parts of the Dataflow Compiler’s functionality, as an alternative to using the
Python API directly:
• The hailo parser command line tool is used to translate ONNX / TF models into Hailo archive (HAR) files.
Note: Consult Translating Tensorflow and ONNX models and hailo parser {tf, onnx} --help for further
details on the according parser arguments.
• The hailo optimize command line tool is used to optimize models’ performance.
Note: Consult Model Optimization and hailo optimize --help for further details on quantization arguments.
• The hailo compiler command line tool is used to compile the models (in HAR format) into a hardware
representation.
Note: Consult Compilation and hailo compiler --help for further details on compilation arguments.
The list below describes the Hailo command line interface functions for visualization and analysis:
• The hailo analyze-noise command is used to analyze per-layer quantization noise. Consult Model
Optimization Workflow for further details.
• The hailo params-csv command is used to to generate a CSV report with weights statistics, which is
useful for analyzing the quantization.
Page 140 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
6.1.3. Tutorials
• The hailo tutorial command opens Jupyter with the tutorial notebooks folder. Select one of the tutorials
to run.
The Model Profiler analyzes the expected performance of a compiled model on hardware and displays the optimiza-
tion analysis.
The user has to set the path of the HAR file to profile, additional optional parameters may be needed.
Profiler Modes:
– For runner (or HAR file) in Native state (before quantization): Presents model overview.
– For runner (or HAR file) in Quantized state (after optimization): Presents Optimization details.
– For runner (or HAR file) in Compiled state (or Quantized state + --hef flag): Presents Optimization
details and compilation data, (note).
• Profiler with Runtime Data: By using the --runtime-data <JSON_FILE> flag with a runner (or HAR
file) in Compiled state (or Quantized + --hef), the profiler will show full compilation and performance data.
The JSON file is generated using hailortcli run2 -m raw measure-fw-actions set-net
<HEF-PATH> command on the target platform. In case HailoRT is installed on the same machine as the
Dataflow Compiler, the --collect-runtime-data profiler argument can be used to run the compiled
model on this platform and display the full report. See example at the bottom of the Inference Tutorial.
• Accuracy Profiler: By default, when running after quantization, only partial noise/accuracy data is dis-
played. The user can add the full analysis information by running the profiler on a HAR file that is a
result of hailo analyze-noise <har-path> --data-path <data-path> tool. Another
option is to add model_optimization_config(checker_cfg, policy=enabled, ana-
lyze_mode=advanced) to the model script before the optimization stage. See example at the Layer
Noise Analysis Tutorial.
Note: For single-context networks, the profiler report calculates the proposed FPS and latency of the whole model.
However, on hosts with low PCIe bandwidth, it might not reflect actual performance. If the performance is worse than
the profiler report values, it is recommended to try and disable DDR buffers.
Note: For single-context networks, --stream-fps argument can be used to normalize the power and bandwidth
values according to the FPS of the input stream.
Note: For big models (when the compilation results in multi-context), performance data will not be available since it
depends on various runtime factors. To present performance data for those models, use the Profiler with Runtime
Data mode.
Page 141 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
• Model Overview – Presents a summary of the model and its performance (runtime data will be required for
presenting the performance of big models)
• Optimization Details – Presents global Optimization-related information, and also per-layer statistics, both
native and quantized, used for gaining insights about degradation factors
• Compilation & Runtime Details – The percentage of the device(s) resources to be used by the target model,
and per-layer resources information. Presents simulated performance information for small models, and,
when runtime data file is provided, presents the measured performance for small and big models (see the
note above).
The following sections describe all tabs of the report and define the fields in each one:
6.2.2. Header
Device The device that the model is compiled for. Hailo-8 for example.
Profiler Parameters An icon on the top right corner, shows information about the Dataflow Compiler version and
profiling mode.
Page 142 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Model Parameters The number of model parameters (weights and biases), without any hardware-related over-
heads.
Operations per Input Tensor (OPS) Total operations per input image.
Input Tensors Shapes The resolution of the model’s input image (for example, HxWxC = 224x224x3).
Output Tensors Shapes The resolution of the model’s output shape (for example, HxWxC = 1x1x1000).
Model Graph
Graph representation of the model that is parsed using the Hailo Parser. If the model is in FP-Optimized or more
advanced state, it shows the model with the requested model modifications and further optimizations. Allows scrolling,
zooming in/out, and selecting a specific layer to display Kernel, Activation and Batch Norm information.
Performance Details
Throughput / FPS The overall network FPS, per batch size (for small models, the same FPS is achieved across all
batch sizes). The selection of a FPS-per-batch affects the next values.
Latency The number of milliseconds it takes the network to process an image / batch of images.
# of Contexts The amount of consecutive allocations that are used for the compilation of the model on the device.
Small models require 1 context. Large models consist of 2 or more contexts.
Operations per Second (OP/s) The total operations per second, based on the FPS rate.
Total NN Core Power Consumption The estimated power consumption of the neural core in watts at standard 25°
C. This field excludes power consumed by the chip top and interfaces. Only appears for small models (that fit
into a single context), and with accuracy of +/-20%.
Input Throughput (Input BW) The model’s total input tensor throughput (bytes per second), based on the FPS rate.
Output Throughput (Output BW) The model’s total tensor output throughput (bytes per second), based on the FPS
rate.
Optimization Level Complexity of the optimization algorithm that was used to quantize the model.
Compression Level Level of weights compression to 4-bit that was used (0 corresponds to 0% 4-bit weights, and 5
corresponds to 100% 4-bit weights).
Calibration Set Size Calibration set size that was used to optimize the model.
Page 143 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Input Conversion Input color/format conversions that were added to the model using a model script command.
Input Resize Input resize that was added to the model using a model script command.
Transpose Model (H<->W) The model was transposed using a model script command.
Normalization Input normalization that was added to the model using a model script command.
Post Processing Post-processing that was added to the model using a model script command. Can be either a single
Op (like Softmax or Sigmoid) or a complex method (like NMS).
A plot of signal-to-noise ratio between the full precision and quantized model. The SNR value is measured at the out-
put layer(s) of the model and in each measurement, only a single layer is quantized. This graph shows the sensitivity
of each layer to quantization measured in dB. The most sensitive layers (< 10dB) are highlighted. In cases where there
are multiple output layers, multiple graphs will be shown.
Layer(s) with low SNR could be improved using the following techniques.
Model Graph
Page 144 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Layer Analytics
This view is the default view on the bottom-right side, it can be switched to the Table view with the icon on its top
right corner.
Layer Details
For each layer, the fields presented below describe it’s properties:
Layer Type The type of operation performed by this layer (for example, convolution or max pooling).
Parameters The number of layer parameters (weights and biases), required by the layer.
Input Shape The shape of the input tensor processed by this layer.
Output Shape The shape of the output tensor processed by this layer.
Kernel Shape The shape of the kernel weights matrix (for example: A Conv layer with 3x3x64x64 means 3x3 kernel,
64 input features and 64 output features).
Groups The number of groups the kernel is split into. On most cases, groups are calculated independently. For
example, convolution layers with more than one group are called “group convolution” layers.
Activation Specifies the activation function type that is performed on the output of the layer.
Batch Norm Specifies whether batch normalization was used during training on this layer.
Original Names The original name(s) of the layer(s) in the original model file (TF or ONNX), that are merged into this
layer (for example a Conv layer, a Batch Norm and an Activation).
Optimization Details
Displays statistics per each layer, collected by passing the calibration set through the model:
SNR (on layer) Signal-to-noise ratio between the full precision and quantized model, measured at this layer’s out-
put, when all the layers are quantized. It helps to understand what is the SNR at this point on the quantized
model, considering all previous layers have been quantized. Expect low on-layer SNR at the final nodes of the
model, compared to the on-layer SNR at the beginning. Note the difference between this measure to the SNR
chart on top drawer, which shows the SNR at the model’s outputs, when only one layer is quantized at a time.
Bits (Input, Weights, Bias, Output) The amount of bits used to represent the [Input, Weights, Bias, Output] of the
layer.
MO Algorithms Which algorithms were used on this layer in the Optimization phase of the model: Equalization,
FineTune, Bias Correction, and AdaRound.
Weight Histogram This histogram shows the full precision weights distribution. Outliers in the distribution might
cause degradation. Kernel Ranges are the minimum and maximum values of the weights of the layer.
Activations Histogram This histogram shows the full precision activations distribution. Outliers in the distribution
might cause degradation. Input Ranges are the minimum and maximum values at the layer’s inputs (before
quantization). Output Ranges are the minimum and maximum values at the layer’s outputs (before quantiza-
tion).
Scatter Plot This graph shows the difference for representative activation values between the full precision and
quantized model as measured at the output of the layer. Better quantization means the trend should be closer
to a slope of 1 (that represents zero quantization noise). If a layer has some outliers (far from the slope=1 line),
or the values resemble a “cloud” instead of a straight linear line, it may point to quantization errors. Use the
following techniques to try and improve it.
Page 145 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Model Table
This view can be switched into by using the icon on the top right corner on the right side of the tab. You can select
which fields are displayed, and scroll horizontally if not all the fields are visible.
The displayed fields are all the fields that appear on the Layer Analytics / Layer Details, plus the SNR
(per layer), and Bits information.
Displays the performance information of the model. Available for small (single context) models, or for big (multi
context) models with runtime data.
The fields are the same fields from the Model Overview tab / Performance Details section.
For large (multi context) models only. A timeline graph that shows the consecutive loading and execution of the
model’s contexts on the device, to complete a single inference (of a specific batch of images). Available only when
--runtime-data is provided.
Each context consists of five phases:
• Config time – Time required to fetch weights and configurations over the DDR interface.
• Load time – Some of the fetched data needs to be prepared and loaded into the resources of the device.
• Inference time – The time it takes for the first layer to complete processing the batch.
• Drainage time – The time it takes for the last layer to complete processing the batch, measured from the end
of the Inference time.
• Overhead time – Initializing / finalizing the resources before / after the inference.
Page 146 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
For large (multi context) models only. A graph describing the DDR bandwidth utilized by each context of the network
(averaged over the context length). Available only when --runtime-data is provided.
• Weights/Configs – The weights of the next context, and its configuration registers.
• DDR Buffers – Some contexts might include long skip connections, so the DDR is being used for buffering this
large amount of data.
• Inter-context tensors – The intermediate tensors that are passed between the contexts.
• Boundary tensors – The boundary (edge) tensors that are fed into the model, and the outputs of the model.
Unlike the graph on the Model Overview and the Optimization Details tabs, this model is the result of the compilation.
It may include slightly different layers, like the addition of shortcuts and inter-context nodes.
The graph starts with a “Context View” that shows the different contexts that the model was compiled into. By choos-
ing a context, the layers that are included in it can be observed. Also, the right side of the screen will show the
“Context Details” view. When a layer is selected, the right side of the screen will show the “Table View” with this layer
highlighted.
Context Analytics
This is the view on the right side of the screen, when a context is selected. You can switch from this view to the Table
View (per-layer) by using the icon on the top-right corner of this region.
The Context Analytics section displays information regarding the whole context in general - statistics and utilization.
In case of small (single context) model, since it has only one context, the performance details of the whole model are
determined from this context.
Note: The following section uses a terminology that is related to the internal structure of the neural core.
• Context Utilization
– Compute Usage The percentage of the device compute resources to be used by the target network. Can
be expanded to view breakdown to sub-clusters (SCs), input aligners (IAs), and activation/pooling
units (APUs).
– Memory Usage The percentage of the device memory resources to be used by the target network. This
figure includes both weights and intermediate results memory. Can be expanded to view breakdown
to L2 (sub-cluster resource), L3 (cluster resource), and L4 (device resource) memories.
– Control Usage The percentage of the device control (LCU = Layer Controller Unit) resources to be used
by the target network.
• Frames Per Second Breakdown of the FPS of the context’s layers, with the lowest (bottleneck) layer high-
lighted.
• Latency Breakdown Displays a simulation of the layers as if they were running on the device. Displays a
simulation of three input tensors.
Page 147 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Table View
This view can be accessed by using the icon on the top right corner on the right side of the tab. Select which fields are
displayed, and scroll horizontally if not all the fields are visible.
The displayed fields consist of some of the fields that appear on Layer Analytics tab / Layer Details: *
Layer Name (This column stays even when scrolling) * Layer Type * Input Shape * Output Shape * Kernel Shape *
Stride * Dilation * Groups * MACs * Parameters
FPS For many frames per second this layer processes. The ratio between the real layer’s computed features,
to the actual computed features that include padding in the width dimension.The ratio between the real
layer’s computed features, to the actual computed features that include padding in the width dimen-
sion.The ratio between the real layer’s computed features, to the actual computed features that include
padding in the width dimension.
LCUs How many Layer Controllers this layer requires (for producing the layer’s FPS).
Subclusters How many sub-clusters this layer requires (for producing the layer’s FPS).
Latency How much time it takes from the moment the layer starts processing an input data, until the first
output is generated
Power The expected power to be consumed by the hardware resources that run this layer (an estimation; for
producing the layer’s FPS).
APUs How many Activation and Pooling Units this layer requires (for producing the layer’s FPS).
IAs How many Input Aligners this layer requires (for producing the layer’s FPS).
L3 weight cuts The relative amount of L3 (cluster-level) memory required by the layer’s weights.
L3 output cuts The relative amount of L3 (cluster-level) memory required for holding the layer’s outputs.
defuse_mode Whether this layer was defused into multiple sub-layers, and how.
ew_add_enabled Whether this layer was merged with a nearby element-wise add operation.
active_mac_util The utilization of the this layer’s code; The relative amount of cycles that the multiply-and-
accumulate units are working.
width_align_util The ratio between the real layer’s computed features, to the actual computed features that
include padding in the width dimension.
feature_align_util The ratio between the real layer’s computed features, to the actual computed features that
include padding in the features dimension.
balance_fps_util How much time this layer is working. The layer with the lowest FPS has balance_fps_util = 1.
Other layers are IDLE at times, therefore the utilization is lower.
mac_layers_util Of this layer’s subclusters, how many are used for the calculation of the output features (not
including intermediate helper operations).
effective_mac_util Multiplication of the previous factors; What is the effective (actual) MAC utilization of this
layer, considering all above factors.
Page 148 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
The Dataflow Compiler Studio allows users to parse and visualize neural network graphs efficiently.
hailo dfc-studio
On the welcome page, you can start a new project, import a project from the file system, or select an existing project
that is already open on this machine. After creating a project, add and load new model file, then select the desired
hardware architecture. The supported formats for original model are .onnx or .tflite.
Note: Each project can contain multiple models, but each model is parsed separately.
Note: It’s also possible to load a .HAR file, but this will allow visualization of the Hailo graph only, while it won’t be
possible to change start and end nodes and trigger parsing again. In addition, the original graph will not be available.
Each model opens in a new, separate tab. Click ‘Continue’ to initiate the parsing process (Assuming .onnx/.tflite files
are loaded). A red dot on the tab will indicate any unsaved changes.
You will be presented with a side-by-side view of the original graph and Hailo’s graph, using the start/end nodes
suggested by the Hailo parse.
Page 149 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Page 150 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Page 151 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
In the bottom-left corner, you’ll find a legend that illustrates the color code for each type of node: start/end nodes,
supported, unsupported, nodes outside the parsing scope, or nodes awaiting parsing. This occurs if there’s an attempt
to modify the start/end nodes. You can navigate through the start/end nodes by using the directional arrows within
the legend. To alter the start/end nodes, simply right-click on the node you wish to designate and initiate the parsing
again to refresh the display. To continue with your workflow, you can save the .HAR file and move on to the subsequent
stages using the command line interface (CLI).
Note: Future releases will unlock additional features. Stay tuned for updates.
Note: Currently, it’s impossible to make changes to other tabs (models) while parsing is in progress in another tab.
This will be fixed in Future.
Page 152 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
7. Additional Topics
In order to adjust the Dataflow Compiler behavior, the following optional functional variables could be set:
• HAILO_CLIENT_LOGS_ENABLED: Set to false to disable the log files of the Dataflow Compiler.
• HAILO_SDK_LOG_DIR: Defines which directory to write the logs into. Default to the working directory.
• HAILO_SET_MEMORY_GROWTH: Set to false if VRAM allocation problems occur. It disables the memory
growth flag, which affects the way TensorFlow allocates and manages its memory. More information is provided
here.
Page 153 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Part II
API Reference
Page 154 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
8.1. hailo_sdk_client.runner.client_runner
Parameters
• hn – Hailo network description (HN), as a file-like object, string, dict, or HailoNN. Use
None if you intend to parse the network description from Tensorflow later. Notice: This
flag will be deprecated soon.
property model_script
property modifications_meta_data
force_weightless_model(weightless=True)
DFC API to force the model to work in weightless mode.
When this mode is enabled, the software emulation graph can be received from get_tf_graph()
even when the parameters are not loaded.
Note: This graph cannot be used for running inference unless the model does not require weights.
Parameters weightless (bool) – Set to True to enable weightless mode. Defaults to True.
set_keras_model(model: hailo_model_optimization.flows.inference_flow.SimulationTrainingModel)
Set Keras model after quantization-aware training. This method allows you to set the model after editing
it externally. After setting the model, new quantized weights are generated.
Page 155 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Example
Parameters
Example
load_model_script(model_script=None, append=False)
DFC API for manipulation of the model build params. This method loads a script and applies it to the
existing HN, i.e., modifies the specific params in each layer, and sets the model build script for later use.
Parameters
1. Model modification related commands – These commands are executed during opti-
mization.
2. Quantization related commands – Some of these commands modify the HN, so af-
ter the modification, each layer (possibly) has new quantization parameters. Other
commands are executed during optimization.
3. Allocation and compilation related commands – These commands are executed during
compilation.
• append (boolean) – Whether to append the commands to a previous script (if exists)
or use only the new script. Addition is allowed only in native mode. Defaults to False.
Page 156 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
load_params(params, params_kind=None)
Load network params (weights).
Parameters
• params – If a string, this is treated as the path of the npz file to load.
If a dict, this is
treated as the params themselves, where the keys are strings and the values are numpy
arrays.
save_params(path, params_kind=’native’ )
Save all model params to a npz file.
Parameters
compile()
DFC API for compiling current model to Hailo hardware.
Returns Data of the HEF that contains the hardware representation of this model.
Example
Parameters
• device_ids (list of str, optional) – device IDs to create VDevice from, call
Device.scan() to get a list of all available devices. Excludes ‘params’.
• nms_score_threshold (float, optional) – score threshold filtering for on
device nms. Relevant only when nms is used.
• gpu_policy (str, Optional) – Sets the gpu policy for emulation based infer-
ence, AUTO will distribute the inference across available GPUS using a Mirrored Strategy
(Parallel DATA)
Raises
Page 157 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Example
Parameters
Note: Using a non-default start_node_names requires the model to be shape inference com-
patible, meaning either it has a real input shape, or, in the case of a dynamic input shape, the
net_input_shapes field is provided to specify the input shapes of the given start nodes. The order
of the output nodes is determined by the order of the end_node_names.
Returns The first item is the HN JSON as a string. The second item is the params dict.
Page 158 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Parameters
• model_path (str) – Path of the file to parse. Supported formats: SavedModel (TF2):
Saved model export from Keras, file named saved_model.pb|pbtxt from the model dir.
TFLite: Tensorflow lite model, converted from ckpt/frozen/Keras to file with .tflite suffix.
Note:
• The order of the output nodes is determined by the order of the end_node_names.
• TF1 model support will be deprecated in the future (April 2024), we recommend moving to TFLite.
Returns The first item is the HN JSON, as a string. The second item is the params dict.
Example
Parameters
Page 159 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Example
join_action (JoinAction, optional): Type of action to run in addition to joining the models:
Example
Parameters
• hef_filename (str, optional) – HEF file path. If given, the HEF file is used. If
not given and the HEF from the previous compilation is cached, the cached HEF is used;
Otherwise, the automatic mapping tool is used. Use compile() to generate and set
the HEF. Only in post-placement mode. Defaults to None.
• stream_fps (float, optional) – FPS used for power and bandwidth calcula-
tion.
Returns The first item is a JSON with the profiling result summary. The second item is a CSV
table with detailed profiling information about all model layers. The third item is the latency
data. Fourth is accuracy data.
Page 160 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Example
save_autogen_allocation_script(path)
DFC API for retrieving listed operations of the last allocation in .alls format.
property model_name
Get the current model (network) name.
property model_optimization_commands
property hw_arch
property state
Get the current model state.
property hef
Get the latest HEF compilation.
property nms_config_file
property nms_engine
property nms_meta_arch
get_params(keys=None)
Get the native (non-quantized) params the runner uses.
Parameters keys (list of str, optional) – List of params to retrieve. If not specified,
all params are retrieved.
get_params_translated(keys=None)
Get the quantized params the SDK uses.
Parameters keys (list of str, optional) – List of params to retrieve. If not specified,
all params are retrieved.
get_params_fp_optimized(keys=None)
Get the fp optimized params.
Parameters keys (list of str, optional) – List of params to retrieve. If not specified,
all params are retrieved.
get_params_statistics(keys=None)
Get the optimization statistics. During the optimization stage, we gather statistics about the model and
the optimization algorithms. This method returns this information in a ModelParams structure.
Parameters keys (list of str, optional) – List of params to retrieve. If not specified,
all params are retrieved.
get_hn_str()
Get the HN JSON after serialization to a formatted string.
get_hn_dict()
Get the HN of the current model as a dictionary.
get_hn()
Get the HN of the current model as a dictionary.
get_hn_model()
Get the HailoNN object of the current model.
get_native_hn_str()
Get the HN JSON after serialization to a formatted string.
Page 161 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
get_native_hn_dict()
Get the HN of the current model as a dictionary.
get_native_hn()
Get the HN of the current model as a dictionary.
get_native_hn_model()
Get the native HailoNN object of the current model.
get_fp_hn_str()
Get the full-precision HN JSON after serialization to a formatted string.
get_fp_hn_dict()
Get the full-precision HN of the current model as a dictionary.
get_fp_hn_model()
Get the full-precision HailoNN object of the current model.
set_hn(hn)
Set the HN of the current model.
Parameters hn – Hailo network description (HN), as a file-like object, string, dict or HailoNN.
save_hn(path)
Save the HN of the current model.
Parameters
load_har(har=None)
Set the current model properties using a given Hailo Archive file.
Parameters har (str or HailoArchive) – Path to the Hailo Archive file or an initialized
HailoArchive object to restore.
model_summary()
Prints summary of the model layers.
optimize_full_precision(calib_data=None, data_type=None)
1. Fusing various layers (e.g., conv and elementwise-add, fold batch_normalization, etc.), including
folding of fused layers params.
2. Apply model modification commands from the model script (e.g., resize input, transpose, color
conversion, etc.)
3. Run structural optimization algorithms (e.g., dead channels removal, tiling squeeze & excite, etc.)
Parameters
Page 162 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Parameters
• dataset – data for analysis. The type depends on the data_type parameter.
• data_type (optional, InferenceDataType) – dataset’s data type, based on
enum values:
• Quantize the model’s params, using optional pre-process and post-process algorithms.
Parameters
• calib_data – Calibration data for Equalization and quantization process. The type
depends on the data_type parameter.
Page 163 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
get_hailo_runtime_model()
Generate model allowing to run the full ONNX graph using ONNX runtime, including the parts that are
offloaded to the Hailo-8 (between the start and end nodes) and the parts that are not.
save_parsing_report(report_path)
Save the parsing report to a given path.
Parameters
• config_path (string, optional) – Path to save the generated config file. De-
faults to ‘{meta_arch}_nms_config.json’.
init_lora_model(lora_weights_mapping)
Establish the LoRA model basic state.
load_lora_weights(lora_weights_path, lora_adapter_name)
Add LORA weights set (single adapter only) to a quantized Hailo model.
Parameters
property use_service
property original_model_meta
Page 164 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
8.2. hailo_sdk_client.exposed_definitions
class hailo_sdk_client.exposed_definitions.JoinAction(value)
Bases: enum.Enum
NONE = 'none'
join the graphs without any connection between them.
AUTO_JOIN_INPUTS = 'auto_join_inputs'
Automatically detects inputs for both graphs and combines them into one. This only works when both
networks have a single input of the same shape.
AUTO_CHAIN_NETWORKS = 'auto_chain_networks'
Automatically detects the output of this model and the input of the other model, and connect them. Only
works when this model has a single output, and the other model has a single input, of the same shape.
CUSTOM = 'custom'
Supply a custom dictionary join_action_info, which specifies which nodes from this model need
to be connected to which of the nodes in the other graph. If keys and values are inputs, we join the inputs.
If keys are outputs, and values are inputs, we chain the networks as described in the dictionary.
class hailo_sdk_client.exposed_definitions.JoinOutputLayersOrder(value)
Bases: enum.Enum
Enum-like class to determine the output order of a model after joining with another model.
NEW_OUTPUTS_LAST = 'new_outputs_last'
First are the outputs of this model who remained outputs, then outputs of the other model. The order in
each sub-list is equal to the original order.
NEW_OUTPUTS_FIRST = 'new_outputs_first'
First are the outputs of the other model, then outputs of this model who remained outputs. The order in
each sub-list is equal to the original order.
NEW_OUTPUTS_IN_PLACE = 'new_outputs_in_place'
If the models are chained, the outputs of the other model are inserted, in their original order, to the
output list of this model instead of the first output which is no longer an output. If the models are joined
by inputs, the other model’s outputs are added last.
class hailo_sdk_client.exposed_definitions.NNFramework(value)
Bases: enum.Enum
TENSORFLOW = 'tf'
Tensorflow 1.x
TENSORFLOW2 = 'tf2'
Tensorflow 2.x
TENSORFLOW_LITE = 'tflite'
Tensorflow Lite
ONNX = 'onnx'
ONNX
class hailo_sdk_client.exposed_definitions.States(value)
Bases: str, enum.Enum
Page 165 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
UNINITIALIZED = 'uninitialized'
Uninitialized state when generating a new ClientRunner
ORIGINAL_MODEL = 'original_model'
ClientRunner state after setting the original model path (ONNX/TF model)
HAILO_MODEL = 'hailo_model'
ClientRunner state after parsing (calling the translate_onnx_model()/translate_tf_model()
API)
FP_OPTIMIZED_MODEL = 'fp_optimized_model'
ClientRunner state after calling the optimize_full_precision() API. This state includes
all the full-precision optimization such as model modification commands.
QUANTIZED_MODEL = 'quantized_model'
ClientRunner state after calling the optimize() API. This state includes quantized weights.
QUANTIZED_BASE_MODEL = 'quanzited_base_model'
ClientRunner state after calling, for example, the load_lora_weights() API. This state in-
cludes layers (e.g. LoRA layers) with non-quantized weights, that were added as a fine-tune to a quantized
base.
QUANTIZED_SLIM_MODEL = 'quantized_slim_model'
ClientRunner state after calling the optimize() API and saving in compilation only mode. This
state includes only the necessary information for compilation (for example quantized weights but not
full-precision information).
COMPILED_MODEL = 'compiled_model'
ClientRunner state after compilation (calling the compile() API).
COMPILED_SLIM_MODEL = 'compiled_slim_model'
ClientRunner state after compilation of a quantized slim model (calling the compile() API). This
state allows only evaluation (profiling, inference).
class hailo_sdk_client.exposed_definitions.InferenceContext(value)
Bases: enum.Enum
SDK_NATIVE = 'sdk_native'
SDK_NATIVE context is for inference of the original model (without any modification).
SDK_FP_OPTIMIZED = 'sdk_fp_optimized'
SDK_FP_OPTIMIZED context includes all model modification in floating-point (such as normalization, nms,
and so on).
SDK_QUANTIZED = 'sdk_quantized'
SDK_QUANTIZED context is for inference of the quantized model. Used to measure degradation caused
by quantization.
SDK_HAILO_HW = 'sdk_hailo_hw'
SDK_HAILO_HW inference context to run on the Hailo-HW.
SDK_BIT_EXACT = 'sdk_bit_exact'
SDK_BIT_EXACT (preview) bit exact emulation. Currently not all layers and mode are supported
This protocol represents a Context Info object that encapsulates the values need for context Infer To create a
Context Info Object need to run
infer_context: hailo_sdk_client.exposed_definitions.InferenceContext
InferenceContext use for the infer API.
Page 166 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
open: bool
State of the context.
graph_export: None
SdkGraphExport Internal object used by the SDK.
gpu_policy: hailo_model_optimization.acceleras.utils.
acceleras_definitions.DistributionStrategy
What will be the gpu distributed Policy
__init__(*args, **kwargs)
class hailo_sdk_client.exposed_definitions.Dims(value)
Bases: str, enum.Enum
An enumeration.
BATCH = 'batch'
STACK = 'stack'
CHANNELS = 'channels'
HEIGHT = 'height'
WIDTH = 'width'
GROUPS = 'groups'
HEADS = 'groups'
DISPARITY = 'groups'
8.3. hailo_sdk_client.hailo_archive.hailo_archive
8.4. hailo_sdk_client.tools.hn_modifications
hailo_sdk_client.tools.hn_modifications.translate_rgb_dataset(rgb_dataset, ...)
Translate a given RGB format images dataset to YUV or BGR format images. This function is useful when the
model expects YUV or BGR images, while the calibration images used for quantization are in RGB.
Parameters
Page 167 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
9.1. hailo_sdk_common.model_params.model_params
Dict-like class that contains all parameters used by a model such as weights, biases, etc.
9.2. hailo_sdk_common.hailo_nn.hailo_nn
stable_toposort(key=None)
Get a generator over the model’s layers, topologically sorted.
Example
... }
... }'''
>>> hailo_nn = HailoNN.from_hn(example_hn)
>>> for layer in hailo_nn.stable_toposort():
... print('The layer name is ”{}”'.format(layer.name))
The layer name is ”in”
The layer name is ”out”
Parameters
Parameters
Page 168 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Returns The first item is the HN, as a string or a dictionary, depending on the json_dump
argument. The second item contains the model’s parameters as a dictionary.
set_input_tensors_shapes(inputs_shapes)
Set the tensor shape (resolution) for each input layer.
Parameters inputs_shapes (dict) – Each key is a name of an input layer, and each value
is the new shape to assign to it. Currently doesn’t support changing number of features.
static from_fp(fp)
Get Hailo model from a file.
static from_hn(hn_json)
Get Hailo model from HN raw JSON data.
9.3. hailo_sdk_common.hailo_nn.hn_definitions
class hailo_sdk_common.hailo_nn.hn_definitions.NMSMetaArchitectures(value)
Bases: str, enum.Enum
SSD = 'ssd'
Single Shot Detection meta architecture.
CENTERNET = 'centernet'
Centernet meta architecture
YOLOV5 = 'yolov5'
Yolov5 meta architecture
YOLOX = 'yolox'
Yolox meta architecture
YOLOV5_SEG = 'yolov5_seg'
Yolov5 seg meta architecture
YOLOV6 = 'yolov6'
Yolov6 meta architecture
YOLOV8 = 'yolov8'
Yolov8 meta architecture
DAMOYOLO = 'damoyolo'
Damoyolo meta architecture
Page 169 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
Bibliography
[Meller2019] Eldad Meller, Alexander Finkelstein, Uri Almog and Mark Grobman. “Same, same but different: Recover-
ing neural network quantization error through weight factorization.” International Conference on Machine
Learning, 2019. http://proceedings.mlr.press/v97/meller19a/meller19a.pdf
[Finkelstein2019] Alexander Finkelstein, Uri Almog and Mark Grobman. “Fighting quantization bias with bias.” Confer-
ence on Computer Vision and Pattern Recognition Workshops, 2019. https://arxiv.org/pdf/1906.03193.pdf
[Finkelstein2022] Alex Finkelstein, Ella Fuchs, Idan Tal, Mark Grobman, Niv Vosco and Eldad Meller. “QFT: Post-training
quantization via fast joint finetuning of all degrees of freedom.” European Conference on Computer Vision
, 2022. https://arxiv.org/pdf/2212.02634.pdf
[Nagel2020] Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos and Tijmen Blankevoort. “Up or Down?
Adaptive Rounding for Post-Training Quantization.” International Conference on Machine Learning, 2020.
https://arxiv.org/pdf/2004.10568.pdf
[Vosco2021] Niv Vosco, Alon Shenkler and Mark Grobman. “Tiled Squeeze-and-Excite: Channel Atten-
tion With Local Spatial Context.” International Conference on Computer Vision Workshops, 2021.
https://openaccess.thecvf.com/content/ICCV2021W/NeurArch/papers/Vosco_Tiled_Squeeze-and-Excite_
Channel_Attention_With_Local_Spatial_Context_ICCVW_2021_paper.pdf
Page 170 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.
Hailo Dataflow Compiler User Guide
h
hailo_sdk_client.exposed_definitions,
165
hailo_sdk_client.hailo_archive.hailo_archive,
167
hailo_sdk_client.runner.client_runner,
155
hailo_sdk_client.tools.hn_modifications,
167
hailo_sdk_common.hailo_nn.hailo_nn,
168
hailo_sdk_common.hailo_nn.hn_definitions,
169
hailo_sdk_common.model_params.model_params,
168
Page 171 Release 3.30.0 Confidential and Proprietary | Copyright © 2025 – Hailo Technologies Ltd.