ML For Embedded Systems at The Edge - NXP and Arm - FINAL
ML For Embedded Systems at The Edge - NXP and Arm - FINAL
Today Machine learning for embedded systems at the edge Arm and NXP
June, 30 tinyML development with Tensorflow Lite for Microcontrollers and CMSIS-NN Arm
Getting started with Arm Cortex-M software development and Arm Development
August, 11 Arm
Studio
August, 25 Efficient ML across Arm from Cortex-M to Web Assembly Edge Impulse
Visit: developer.arm.com/solutions/machine-learning-on-arm/ai-virtual-tech-talks
1
SPEAKERS
Kobus Marneweck, Senior Product Manager Anthony Huereca, Embedded Systems Engineer
Arm NXP Semiconductor
2
AGENDA
• ML on the edge
• eIQ deployment
− Arm support for TFLµ
− TensorFlow
− Glow
− Getting started
• The future
• Wrap-up
3
Machine Learning on
the Edge
!
Image Classification Audio Analysis Anomaly Detection
• Identify what camera is looking at − Keyword actions − Identify factory issues before
− Coffee pods they become catastrophic
§ “Alexa”/“Hey Google”
− Empty vs full trucks
− Voice commands − Smartwatch health monitoring
− Factory defects on manufacturing − Motor performance monitoring
line − Alarm Analytics
− Produce on supermarket scale § Breaking glass − Sensor Analysis
• Personalization based on facial § Crying baby
recognition
− Appliances
− Home
− Toys
− Auto
• Security Video Analysis
5
MACHINE LEARNING PROCESS
6
INFERENCE ON THE EDGE
Two possibilities:
Inference on Inference on
the Cloud the Edge
7
NXP Enablement for
Machine Learning
DIY
9
ARM CORTEX-M PORTFOLIO
Cortex-M7
Maximum High
performance, performance
control and
DSP TrustZone
Cortex-M3 Cortex-M4 Cortex-M33 Cortex-M55
Helium vector Performance
Mainstream Flexibility,
Performance extensions
efficiency
control and control and Optimized for
efficiency
DSP DSP DSP & ML
10
CORTEX-M7: HIGHEST PERFORMANCE CORTEX-M
Performance
− Floating-point Unit (FPU) – Single precision (SP) and double precision
(DP), sustained 2x 32bit or 2x 16bit MACs per cycle
− Digital signal processing (DSP) extension
11
C O R T E X - M 3 3 : N E X T- G E N E R AT I O N C O R T E X - M W I T H T R U S T Z O N E S E C U R I T Y
12
eIQ
Quantization
PyTorch
Glow
Test Model Custom script…
other…
14
eIQ – EDGE INTELLIGENCE
Collection of Libraries and Development Tools for Building Machine Learning Apps
Targeting NXP MCUs and App Processors
Deploying open-source inference Integrated into Yocto Linux BSP Supporting materials
engines and MCUXpresso SDK for ease of use
Integration and optimization of neural net (NN) No separate SDK or release to download Documentation: eIQ White Paper, Release
inference engines (Arm NN, Arm CMSIS-NN, • iMX: New layer meta-imx- Notes, eIQ User’s Guide, Demo User’s Guide
OpenCV, TFLite, ONNX, etc.) machinelearning in Yocto Guidelines for importing pretrained models
End-to-end examples demonstrating customer • MCU: Integrated in MCUXpresso based on popular NN frameworks (e.g.
use-cases (e.g. camera à inference engine) SDK middleware TensorFlow, Caffe)
Support for emerging neural net compilers Training collateral for CAS, DFAEs and
(e.g. Glow) customers (e.g. lectures, hands-on, video)
Suite of classical ML algorithms such as
support vector machine (SVM) and random
forest
BYOM – Bring Your Own Model
15
eIQ DEMO
16
eIQ Deployment
Overview
with CMSIS-NN
NXP EIQ
INFERENCE
ENGINES &
LIBRARIES
NPU µNPU
COMPUTE
ENGINES Cortex-M DSP Cortex-A GPU ML Accelerator
i.MX RT600 i.MX i.MX 8M Plus i.MX 8M Plus i.MX 8M Plus Future MCU
i.MX RT1050 RT600 i.MX 8QM i.MX 8QM
i.MX RT1060 i.MX 8QXP i.MX 8QXP
i.MX RT1170 i.MX 8M Quad/Nano i.MX 8M Quad/Nano
i.MX 8M Mini
18
e I Q A D VA N TA G E S
• eIQ implements performance enhancements with CMSIS-NN for Cortex M cores and DSP
− Up to 2.4x improvement in inference time in TensorFlow Lite over original code
• eIQ inference engines work out-of-the-box and are already tested and optimized.
− Get up and running in minutes instead of weeks
NXP eIQ Enablement
Click Click
Import eIQ Compile Program Use Model
Project Output
Button Button
Input
Customer‘s or a third-party model
trained on a CPU, GPU or in the Cloud i.MX RT Device
PC
Optimizations eIQ
Pre-trained Inference
(Quantization/ Convert Engine
Model
Pruning)
Optional
Prediction
Inference engines available with eIQ for i.MX RT:
• CMSIS-NN – Can be used for several different model frameworks
• TensorFlow Lite – Used for TensorFlow model frameworks
• Glow – Machine Learning compiler for several different model frameworks (Coming in July)
20
Arm support for TFLµ
• Developed by Arm
• API to implement common model layers such as convolution, fully-connected, pooling, activation,
etc, efficiently at a low level
• Conversion scripts (provided by Arm) to convert models into CMSIS-NN API calls.
• CMSIS-NN optimized the implementation of inference engines like TFLite micro
(https://www.tensorflow.org/lite/microcontrollers)
22
CMSIS-NN OPTIMIZED FOR PERFORMANCE
Relative throughtput
6
− Available now through open source license
4
• Consistent interface to all Cortex-M CPUs 2
− Extending to Arm v8-M 0
1 2 3 4
• Open-source, via Apache 2.0 license
4.9x
− https://github.com/ARM-software/CMSIS_5 Energy efficiency improvement higher
eff.
6
Joule
2
Optimised for Cortex-M CPUs
0
Armv7-M Armv8.1-M 1 2 3 4
23
T O O L S & T F L Μ O P E R AT O R S U P P O R T – C M S I S - N N A N D E T H O S M I C R O N P U
Modified .TF
Optimized custom operators for the microNPU
Input File
Ethos microNPU
Ethos-U
driver
microNPU
Optimization
Reference kernels
InputInput
Cortex-M
start
.TF
24
eIQ TensorFlow
• Developed by Google
− TensorFlow à Training and Inference
− TensorFlow Lite eIQ à NXP’s implementation of TF Lite for MCUs
− TensorFlow Lite Micro à TensorFlow’s implementation of TF Lite for MCUs
26
TENSORFLOW LITE CONVERSION PROCESS
eIQ
27
TENSORFLOW LITE CODE FLOW
• Import model
#include “mobilenet_model.h”
model = tflite::FlatBufferModel::BuildFromBuffer(mobilenet_model, mobilenet_model_len);
• Get input
/* Extract image from camera to data buffer. */
CSI2Image(data, Rec_w, Rec_h, pExtract, true);
/* Resize image to input tensor size. */
ResizeImage(interpreter->tensor(input), data, Rec_h, Rec_w, image_height, image_width, image_channels, &s);
• Run inference
interpreter->Invoke();
• Get Results
std::vector<std::pair<float, int>> top_results;
GetTopN<float>(interpreter->typed_output_tensor<float>(0), output_size, s->number_of_results, threshold, &top_results, true);
auto result = top_results.front(); //Get results
const float confidence = result.first; //Get confidence level
const int index = result.second; //Get highest class
28
G E M M L O W P A S S E M B LY- C O D E D D S P O P T I M I Z AT I O N B E N E F I T S F O R T E N S O R F L O W L I T E
300
100
29
eIQ Glow
• Developed by Facebook
• Glow is a compiler that turns a model into an machine executable binary for the target
device
− Both the model and the inference engine are compiled into the binary that is generated.
− Integrate the generated binary into an SDK software project
− Can make use of compiler optimizations
− Supports ONNX (universal model format) and Caffe2 models
31
P E R F O R M A N C E C O M PA R I S O N U S I N G C I FA R - 1 0 M O D E L O N R T 1 0 5 0
70
60
50
40
30
20
10
0
Glow w/CMSIS-NN Optimized TensorFlow Lite
32
O P T I M I Z AT I O N S F O R G L O W
100.00
80.00
60.00
Glow Inference Time on RT685 (in milliseconds) MNIST Model CIFAR10 Model
40.00
Floating Point Model 104.63 213.78
20.00 Floating Point Model using HiFi4 DSP 3.02 13.36
0.00
No optimizations Using HiFi4 Quantized Model 59.77 165.37
Quantized Model using CMSIS-NN 28.52 89.95
Quantized MNIST Model Quantized Model using CMSIS-NN + HiFi4 DSP 2.50 6.70
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
No optimizations Using CMSIS-NN Using CMSIS-NN + HiFi4 DSP
33
GLOW
.inc eIQ
.pb .onnx .yml .o i.MX RT
.weights
34
ADD COMPILED CODE TO PROJECT
• Run model
• Get result
35
G L O W M E M O RY U S A G E
36
Getting eIQ
38
eIQ EXAMPLES
39
eIQ FOLDER STRUCTURE
40
eIQ APP NOTES
Coming Soon:
• Transfer Learning and Datasets
41
INFERENCE TIMES
42
M E M O RY R E Q U I R E M E N T S
43
The future
Cortex-M7
Relative
control code
performance Cortex-M35P Cortex-M55 Cortex-M55 + Ethos-U55
Cortex-M33 (multiple performance points available)
Cortex-M3
Cortex-M4
Signal conditioning ML performance and efficiency
Cortex-M23 and ML foundation
Cortex-M0+
Cortex-M0
Cortex-M1
Cortex-M55 &
Cortex-M55 Cortex-M55 Ethos-U55
Up to 5x higher Up to 15x Up to 480x
signal processing higher ML
higher ML
performance performance*
(matrix multiplication in
performance*
(CFFT in int32) (matrix multiplication in
int8)
int8)
46
Summary
• NXP eIQ
• TensorFlow Lite
• Glow
• CMSIS-NN
Book:
You Look Like a Thing and I Love You: How Artificial Intelligence Works and Why It's Making the World a
Weirder Place
48
GIT REPOS
• TensorFlow Lite
− https://github.com/tensorflow/tensorflow/tree/v1.13.1/tensorflow/lite
• TensorFlow Lite for Microcontrollers
− https://www.tensorflow.org/lite/microcontrollers
• CMSIS-NN
− https://github.com/ARM-software/CMSIS_5/tree/master/CMSIS/NN
− CIFAR-10: https://github.com/ARM-software/ML-examples/tree/master/cmsisnn-cifar10
− KWS: https://github.com/ARM-software/ML-KWS-for-MCU
• Glow
− https://github.com/pytorch/glow
49
NXP eIQ RESOURCES
50
Virtual Tech Talks Series
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
ध"यवाद
ﺷﻛًرا
תודה
Join our next virtual tech talk:
AI Virtual Tech Talks Series
tinyML development with
Tensorflow Lite for
Microcontrollers and CMSIS-NN
Tuesday 30 June
Register here:
developer.arm.com/solutions/machine-learning-on-arm/ai-virtual-tech-talks
Confidential © 2020 Arm Limited
The Arm trademarks featured in this presentation are registered
trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks
Virtual Tech Talk Series
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
ध"यवाद
ﺷﻛًرا
תודה