SlideShare a Scribd company logo
Copyright © 2017 Intel Corporation 1
Vadim Pisarevsky, Software Engineering Manager, Intel Corp.
May 2017
Making OpenCV Code Run Fast
Copyright © 2017 Intel Corporation 2
OpenCV at glance
What The most popular computer vision library:
http://opencv.org
License BSD
Supported Languages C/C++, Java, Python
Size >950 K lines of code
SourceForge statistics 13.6 M downloads (does not include github traffic)
Github statistics >7500 forks, >4000 patches merged during 6 years
(~2.5 patches per working day before Intel,
~5 patches per working day at Intel)
Accelerated with SSE, AVX, NEON, IPP, MKL, OpenCL, CUDA,
parallel_for_, OpenVX, Halide (planned)
The actual versions 2.4.13.2 (2016 Dec), 3.2 (2016 Dec)
Upcoming releases 2.4.14 (2017), 3.3 (2017 Jun)
Copyright © 2017 Intel Corporation 3
OpenCV, CV & Hardware Evolution 2000 => 2017
2000 2017
OpenCV OpenCV 1.0 alpha; C API, 1
module, Windows
OpenCV 3.2; C++ API; 30+30 modules,
Windows/Linux/Android/iOS/QNX, etc.
CPU 32-bit single-core, ~1 GFlop 32/64-bit many-core, 300+ GFlops, ~100 GFlops in a
cellphone!
GPU as accelerator - OpenCL, CUDA; 0.5-1+ TFlops
Other accelerators FPGA (manually coded) OpenCL-capable FPGA, various DSPs, etc.
Vision algorithms Traditional vision, simple image
processing, detection & tracking,
contours; “empirical, low-profile
computer vision”
Sophisticated traditional vision, 3D vision,
computational photography, deep learning, hybrid
algorithms; “learning-based, extensive computer
vision”
Cameras, sensors Analog surveillance cameras
(recording only), Webcams
Computer vision in every cellphone, every street
crossing, every mall, coming to every car; 3d
sensors, lidars, etc.
Computing model Desktop Edge, Cloud, Fog; Desktop for R&D only
Copyright © 2017 Intel Corporation 4
OpenCV Acceleration Options
CUDA modules
OpenVX
(immediate mode)
OpenCV optimized
for custom hardware
Universal
intrinsics
NEON/SSE/AVX2…
Carotene HAL
OpenCV optimized for
ARM CPU
IPP, MKL
OpenCV optimized
for x86/x64 CPU
OpenVX
(graphs)
OpenCV optimized
for custom hardware
OpenCV
T-API OpenCL GPU-optimized
OpenCV
OpenCV HAL
Halide scripts Any Halide-supported
hardware
User-programmable
tools
Collections of fixed
functions
Active development area
Copyright © 2017 Intel Corporation 5
• OpenCV 3.x includes T-API by default:
• Asynchronous: can run GPU & CPU code in parallel
• 100s of open-source OpenCL kernels
T-API: heterogeneous compute
with OpenCV is easy!
#include "opencv2/opencv.hpp"
using namespace cv;
int main(int argc, char** argv)
{
Mat img, gray;
img = imread(argv[1], 1);
imshow("original", img);
cvtColor(img, gray, COLOR_BGR2GRAY);
GaussianBlur(gray, gray,
Size(7, 7), 1.5);
Canny(gray, gray, 0, 50);
imshow("edges", gray);
waitKey();
return 0;
}
#include "opencv2/opencv.hpp"
using namespace cv;
int main(int argc, char** argv)
{
Mat img; UMat gray;
img = imread(argv[1]);
imshow("original", img);
cvtColor(img, gray, COLOR_BGR2GRAY);
GaussianBlur(gray, gray,
Size(7, 7), 1.5);
Canny(gray, gray, 0, 50);
imshow("edges", gray); // automatic sync point
waitKey();
return 0;
}
Copyright © 2017 Intel Corporation 6
T-API: under the hood
Very little of “boilerplate code”! (just ~30 lines of code)
void mykernel(cv::InputArray input, cv::OutputArray output, params …) {
}
Use OpenCL?
Get clmem (use zero-
copy if possible)
Retrieve/compile OpenCL
kernel & “enqueue” it
successfully?
yes
yes
Finish
Retrieve
cv::Mat
Run C++ code
Copyright © 2017 Intel Corporation 7
T-API execution model
• Supports multiple devices
• Asynchronous execution with no explicit synchronization required
Copyright © 2017 Intel Corporation 8
T-API showcase: Pedestrian Detector
Build pyramid RGB2Luv
HOG feature
maps
Integrals of
HOG maps
Feature Pyramid Builder
Capture Video
Frame
Optical flow-
based Tracker
Per-frame detector
Sliding window +
Cascade classifier
Non-maxima
suppression (filtering
out duplicates)
Do temporal filtering,
follow pedestrians,
detect new ones
Performance profile of
per-frame detector (CPU)
Feature Pyramid Builder (65%)
Classifier + Non-max (35%)
• Feature Pyramid Builder is the ideal “kernel” to optimize:
• Expensive
• Regular, easy to parallelize & vectorize
• Reusable (e.g., for cars)
Copyright © 2017 Intel Corporation 9
• Duplicate CPU branch
• Make OpenCL-compatible copy (cv::UMat) for each internal buffer (cv::Mat)
• Use available OpenCL-optimized funcs (e.g. cv::resize, cv::integral)
• Create OpenCL kernels for other parts (RGB2Luv, HOG): ~700 LoC
• Debug-Profile-Optimize: repeat until happy
Feature Pyramid Builder optimization with T-API
Part CPU time,
ms (1080p)
OCL time,
ms (1080p)
CPU time,
ms (720p)
OCL time,
ms (720p)
Acceleration
(1080p)
Acceleration
(720p)
All 200 140 107 87 42% 23%
Feature Pyramid
Builder
130 70 60 40 85% 50%
Test machine: Core i5 (Skylake), 2-core 2.5 GHz, Intel HD530 GPU
Copyright © 2017 Intel Corporation 10
• Many acceleration options are available (CPU,
GPU, DSPs, FPGA, etc.)
• Coding kernels using native tools is huge
investment and maintenance cost
• Big time to market
• Big commitment because of low portability
• OpenCV cannot be optimized for each single
accelerator
• OpenCL is not perf-portable neither easy to use
• Let’s generate OpenCL or LLVM code automatically
from high-level algorithm description!
• Let’s separate the platform-agnostic algorithm
description and platform-specific “pragma’s”
(vectorization, tiling …)!
Halide: write once, schedule everywhere!
Halide! (http://halide-lang.org)
Function 1 Function 2 …
CPU Scheduler:
Tiling,
Vectorization,
Pipelining
GPU Scheduler:
Tiling,
Vectorization,
Pipelining
CPU code
(SSE, AVX…,
NEON)
GPU code
(OpenCL,
CUDA)
Algorithm Description
Copyright © 2017 Intel Corporation 11
• Same code for CPU & GPU
• Halide includes very efficient loop handling engine
• Almost any known DNN can be implemented
entirely in Halide
• The language is quite limited (insufficient to cover
OpenVX 1.0)
• In some cases the produced code is inefficient
• The whole infrastructure is immature
Plans
• Halide backend in OpenCV DNN module (in
progress)
• Extend the language (if operator, etc.)
• Improve performance of the generated code
• Fix/improve the infrastructure (nicer frontend, better
support for offline compilation)
kernel OpenCV, ms
(CPU)
Halide, ms
(CPU)
Halide, ms
(GPU)
RGB=>Gray 0.44 0.54 (-20%) 0.58 (-25%)
Canny 3.3 1.4+2 (-3%) 2.4+2 (-25%)
DNN: AlexNet 29 (w. MKL) 24 (+20%) 47 (-40%)
DNN: ENet
(512x256)
~250 (w. MKL) 60 (+320%) 44 (+470%)
HOG-based
pedestrian
detector (1080p)
200 75+70 (+38%) 140 – 700 ms
Halide: first impressions & results
Copyright © 2017 Intel Corporation 12
• OpenVX-based HAL in OpenCV
✓ [Done] Immediate-mode OpenVX calls to accelerate simple functions:
• cv::boxFilter(const cv::Mat&, …) => vxuBox3x3(vx_image, …) etc.
• tested with Khronos’ sample implementation and Intel IAP
• [TBD] Graphs for DNN acceleration
✓ [Done] Mixing OpenVX + OpenCV at user app level
• vx_image  cv::Mat, OpenVX C++ wrappers, sample code:
• https://github.com/opencv/opencv/tree/master/samples/openvx
OpenCV + OpenVX
Copyright © 2017 Intel Corporation 13
OpenCV Acceleration Options Comparison
+ ⎼
HAL functions Get used automatically (zero effort); vendors-specific
implementation is possible
Little coverage (mostly image processing); usually CPU-only
HAL intrinsics Super-flexible, widely applicable and widely available Low-level, CPU only
T-API Can potentially deliver top speed OpenCL is not performance-portable; lot’s of expertise needed
OpenVX Can be tailored for any hardware (CPU, GPU, DSP, FPGA) Inflexible, not easy to use, difficult to extend
Halide Decent performance; relatively easy to use Not as flexible as OpenCL or C++
Performance
Ease-of-use
HAL functions
HAL intrinsics
Halide
T-API (custom)
T-API (built-in)
OpenVX (graphs)
OpenVX (graphs for DNN)
Flexibility
Coverage
HAL functions
HAL intrinsics
Halide
T-API (custom)
T-API (built-in)
OpenVX (graphs)
Copyright © 2017 Intel Corporation 14
• Modern OpenCV provides several acceleration paths
• Custom kernels are essential for user apps; existing OpenCV (and
OpenVX) functionality is not enough
• Universal intrinsics
(http://docs.opencv.org/master/df/d91/group__core__hal__intrin.html) is
best solution for CPU
• T-API (OpenCL; http://opencv.org/platforms/opencl.html) is the way to go
for GPU acceleration
• Halide looks very promising and can become a viable alternative to plain
C++ and OpenCL for “regular” algorithms; OpenCV 3.3 will include
Halide-accelerated deep learning module
Summary
Copyright © 2017 Intel Corporation 15
• OpenCV: http://opencv.org
• Intel CV SDK: https://software.intel.com/en-us/computer-vision-sdk - the
home of Intel-optimized OpenCV & OpenVX
• Halide: http://halide-lang.org
• Insights on the OpenCV 3.x feature roadmap, EVS2016 talk by Gary
Bradski: https://www.embedded-vision.com/platinum-
members/embedded-vision-alliance/embedded-vision-
training/videos/pages/may-2016-embedded-vision-summit-opencv
Resources

More Related Content

PDF
Cours python
PDF
Pr057 mask rcnn
PDF
Image anomaly detection with generative adversarial networks
PDF
AndroidアプリのKotlin移行時に遭遇した問題と対処例
PPTX
Automatic Number Plate Recognition(ANPR) System Project
PDF
GAN - Theory and Applications
PDF
Chapitre 6 hachage statique
PDF
Introduction au développement Web
Cours python
Pr057 mask rcnn
Image anomaly detection with generative adversarial networks
AndroidアプリのKotlin移行時に遭遇した問題と対処例
Automatic Number Plate Recognition(ANPR) System Project
GAN - Theory and Applications
Chapitre 6 hachage statique
Introduction au développement Web

What's hot (20)

PDF
Deep Learning: Application Landscape - March 2018
PDF
Nat, List and Option Monoids - from scratch - Combining and Folding - an example
PDF
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
PPTX
Computer vision
DOCX
Examen Principal - Fondement Multimedia Janvier 2015
PPTX
Faster rcnn
PPTX
Introduction à la Data Science l data business
PPTX
You Only Look Once: Unified, Real-Time Object Detection
PDF
Présentation Flutter
PDF
Mask R-CNN
DOCX
6.iris recognition using machine learning technique
PPTX
Objective Evaluation of Video Quality
PPTX
You only look once (YOLO) : unified real time object detection
PPTX
Ontologie concept applications
PDF
Tp4 - PHP
PDF
Intelligence Artificielle: résolution de problèmes en Prolog ou Prolog pour l...
PDF
OpenCV Introduction
PDF
Segmentation d images de documents anciens par approche texture - Mo…
ODP
Tensorflow for Beginners
PPTX
DeepFakes presentation : brief idea of DeepFakes
Deep Learning: Application Landscape - March 2018
Nat, List and Option Monoids - from scratch - Combining and Folding - an example
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
Computer vision
Examen Principal - Fondement Multimedia Janvier 2015
Faster rcnn
Introduction à la Data Science l data business
You Only Look Once: Unified, Real-Time Object Detection
Présentation Flutter
Mask R-CNN
6.iris recognition using machine learning technique
Objective Evaluation of Video Quality
You only look once (YOLO) : unified real time object detection
Ontologie concept applications
Tp4 - PHP
Intelligence Artificielle: résolution de problèmes en Prolog ou Prolog pour l...
OpenCV Introduction
Segmentation d images de documents anciens par approche texture - Mo…
Tensorflow for Beginners
DeepFakes presentation : brief idea of DeepFakes
Ad

Similar to "Making OpenCV Code Run Fast," a Presentation from Intel (20)

PDF
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
PDF
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
PDF
"New Standards for Embedded Vision and Neural Networks," a Presentation from ...
PPTX
OpenCV for Embedded: Lessons Learned
PDF
"OpenCV for Embedded: Lessons Learned," a Presentation from itseez
PDF
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
PDF
"The Vision Acceleration API Landscape: Options and Trade-offs," a Presentati...
PDF
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
PDF
"The OpenVX Computer Vision and Neural Network Inference Library Standard for...
PDF
"OpenCV: Current Status and Future Plans," a Presentation from OpenCV.org
PDF
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
PDF
"OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision," a P...
PDF
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
PDF
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
PPTX
Using Deep Learning for Computer Vision Applications
PDF
"Recent Developments in Khronos Standards for Embedded Vision," a Presentatio...
PPTX
Install, Compile, Setup, Setting OpenCV 3.2, Visual C++ 2015, Win 64bit,
PDF
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
PDF
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
PDF
Виктор Ерухимов Open VX mixar moscow sept'15
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"New Standards for Embedded Vision and Neural Networks," a Presentation from ...
OpenCV for Embedded: Lessons Learned
"OpenCV for Embedded: Lessons Learned," a Presentation from itseez
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
"The Vision Acceleration API Landscape: Options and Trade-offs," a Presentati...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
"The OpenVX Computer Vision and Neural Network Inference Library Standard for...
"OpenCV: Current Status and Future Plans," a Presentation from OpenCV.org
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
"OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision," a P...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Using Deep Learning for Computer Vision Applications
"Recent Developments in Khronos Standards for Embedded Vision," a Presentatio...
Install, Compile, Setup, Setting OpenCV 3.2, Visual C++ 2015, Win 64bit,
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
Виктор Ерухимов Open VX mixar moscow sept'15
Ad

More from Edge AI and Vision Alliance (20)

PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
PDF
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
PDF
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
PDF
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
PDF
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...

Recently uploaded (20)

PDF
DevOps & Developer Experience Summer BBQ
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
CroxyProxy Instagram Access id login.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Transforming Manufacturing operations through Intelligent Integrations
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Omni-Path Integration Expertise Offered by Nor-Tech
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
SAP855240_ALP - Defining the Global Template PUBLIC.pdf
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
DevOps & Developer Experience Summer BBQ
20250228 LYD VKU AI Blended-Learning.pptx
Electronic commerce courselecture one. Pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Review of recent advances in non-invasive hemoglobin estimation
madgavkar20181017ppt McKinsey Presentation.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
CroxyProxy Instagram Access id login.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Understanding_Digital_Forensics_Presentation.pptx
Transforming Manufacturing operations through Intelligent Integrations
MYSQL Presentation for SQL database connectivity
Omni-Path Integration Expertise Offered by Nor-Tech
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
SAP855240_ALP - Defining the Global Template PUBLIC.pdf
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
“AI and Expert System Decision Support & Business Intelligence Systems”
Dropbox Q2 2025 Financial Results & Investor Presentation

"Making OpenCV Code Run Fast," a Presentation from Intel

  • 1. Copyright © 2017 Intel Corporation 1 Vadim Pisarevsky, Software Engineering Manager, Intel Corp. May 2017 Making OpenCV Code Run Fast
  • 2. Copyright © 2017 Intel Corporation 2 OpenCV at glance What The most popular computer vision library: http://opencv.org License BSD Supported Languages C/C++, Java, Python Size >950 K lines of code SourceForge statistics 13.6 M downloads (does not include github traffic) Github statistics >7500 forks, >4000 patches merged during 6 years (~2.5 patches per working day before Intel, ~5 patches per working day at Intel) Accelerated with SSE, AVX, NEON, IPP, MKL, OpenCL, CUDA, parallel_for_, OpenVX, Halide (planned) The actual versions 2.4.13.2 (2016 Dec), 3.2 (2016 Dec) Upcoming releases 2.4.14 (2017), 3.3 (2017 Jun)
  • 3. Copyright © 2017 Intel Corporation 3 OpenCV, CV & Hardware Evolution 2000 => 2017 2000 2017 OpenCV OpenCV 1.0 alpha; C API, 1 module, Windows OpenCV 3.2; C++ API; 30+30 modules, Windows/Linux/Android/iOS/QNX, etc. CPU 32-bit single-core, ~1 GFlop 32/64-bit many-core, 300+ GFlops, ~100 GFlops in a cellphone! GPU as accelerator - OpenCL, CUDA; 0.5-1+ TFlops Other accelerators FPGA (manually coded) OpenCL-capable FPGA, various DSPs, etc. Vision algorithms Traditional vision, simple image processing, detection & tracking, contours; “empirical, low-profile computer vision” Sophisticated traditional vision, 3D vision, computational photography, deep learning, hybrid algorithms; “learning-based, extensive computer vision” Cameras, sensors Analog surveillance cameras (recording only), Webcams Computer vision in every cellphone, every street crossing, every mall, coming to every car; 3d sensors, lidars, etc. Computing model Desktop Edge, Cloud, Fog; Desktop for R&D only
  • 4. Copyright © 2017 Intel Corporation 4 OpenCV Acceleration Options CUDA modules OpenVX (immediate mode) OpenCV optimized for custom hardware Universal intrinsics NEON/SSE/AVX2… Carotene HAL OpenCV optimized for ARM CPU IPP, MKL OpenCV optimized for x86/x64 CPU OpenVX (graphs) OpenCV optimized for custom hardware OpenCV T-API OpenCL GPU-optimized OpenCV OpenCV HAL Halide scripts Any Halide-supported hardware User-programmable tools Collections of fixed functions Active development area
  • 5. Copyright © 2017 Intel Corporation 5 • OpenCV 3.x includes T-API by default: • Asynchronous: can run GPU & CPU code in parallel • 100s of open-source OpenCL kernels T-API: heterogeneous compute with OpenCV is easy! #include "opencv2/opencv.hpp" using namespace cv; int main(int argc, char** argv) { Mat img, gray; img = imread(argv[1], 1); imshow("original", img); cvtColor(img, gray, COLOR_BGR2GRAY); GaussianBlur(gray, gray, Size(7, 7), 1.5); Canny(gray, gray, 0, 50); imshow("edges", gray); waitKey(); return 0; } #include "opencv2/opencv.hpp" using namespace cv; int main(int argc, char** argv) { Mat img; UMat gray; img = imread(argv[1]); imshow("original", img); cvtColor(img, gray, COLOR_BGR2GRAY); GaussianBlur(gray, gray, Size(7, 7), 1.5); Canny(gray, gray, 0, 50); imshow("edges", gray); // automatic sync point waitKey(); return 0; }
  • 6. Copyright © 2017 Intel Corporation 6 T-API: under the hood Very little of “boilerplate code”! (just ~30 lines of code) void mykernel(cv::InputArray input, cv::OutputArray output, params …) { } Use OpenCL? Get clmem (use zero- copy if possible) Retrieve/compile OpenCL kernel & “enqueue” it successfully? yes yes Finish Retrieve cv::Mat Run C++ code
  • 7. Copyright © 2017 Intel Corporation 7 T-API execution model • Supports multiple devices • Asynchronous execution with no explicit synchronization required
  • 8. Copyright © 2017 Intel Corporation 8 T-API showcase: Pedestrian Detector Build pyramid RGB2Luv HOG feature maps Integrals of HOG maps Feature Pyramid Builder Capture Video Frame Optical flow- based Tracker Per-frame detector Sliding window + Cascade classifier Non-maxima suppression (filtering out duplicates) Do temporal filtering, follow pedestrians, detect new ones Performance profile of per-frame detector (CPU) Feature Pyramid Builder (65%) Classifier + Non-max (35%) • Feature Pyramid Builder is the ideal “kernel” to optimize: • Expensive • Regular, easy to parallelize & vectorize • Reusable (e.g., for cars)
  • 9. Copyright © 2017 Intel Corporation 9 • Duplicate CPU branch • Make OpenCL-compatible copy (cv::UMat) for each internal buffer (cv::Mat) • Use available OpenCL-optimized funcs (e.g. cv::resize, cv::integral) • Create OpenCL kernels for other parts (RGB2Luv, HOG): ~700 LoC • Debug-Profile-Optimize: repeat until happy Feature Pyramid Builder optimization with T-API Part CPU time, ms (1080p) OCL time, ms (1080p) CPU time, ms (720p) OCL time, ms (720p) Acceleration (1080p) Acceleration (720p) All 200 140 107 87 42% 23% Feature Pyramid Builder 130 70 60 40 85% 50% Test machine: Core i5 (Skylake), 2-core 2.5 GHz, Intel HD530 GPU
  • 10. Copyright © 2017 Intel Corporation 10 • Many acceleration options are available (CPU, GPU, DSPs, FPGA, etc.) • Coding kernels using native tools is huge investment and maintenance cost • Big time to market • Big commitment because of low portability • OpenCV cannot be optimized for each single accelerator • OpenCL is not perf-portable neither easy to use • Let’s generate OpenCL or LLVM code automatically from high-level algorithm description! • Let’s separate the platform-agnostic algorithm description and platform-specific “pragma’s” (vectorization, tiling …)! Halide: write once, schedule everywhere! Halide! (http://halide-lang.org) Function 1 Function 2 … CPU Scheduler: Tiling, Vectorization, Pipelining GPU Scheduler: Tiling, Vectorization, Pipelining CPU code (SSE, AVX…, NEON) GPU code (OpenCL, CUDA) Algorithm Description
  • 11. Copyright © 2017 Intel Corporation 11 • Same code for CPU & GPU • Halide includes very efficient loop handling engine • Almost any known DNN can be implemented entirely in Halide • The language is quite limited (insufficient to cover OpenVX 1.0) • In some cases the produced code is inefficient • The whole infrastructure is immature Plans • Halide backend in OpenCV DNN module (in progress) • Extend the language (if operator, etc.) • Improve performance of the generated code • Fix/improve the infrastructure (nicer frontend, better support for offline compilation) kernel OpenCV, ms (CPU) Halide, ms (CPU) Halide, ms (GPU) RGB=>Gray 0.44 0.54 (-20%) 0.58 (-25%) Canny 3.3 1.4+2 (-3%) 2.4+2 (-25%) DNN: AlexNet 29 (w. MKL) 24 (+20%) 47 (-40%) DNN: ENet (512x256) ~250 (w. MKL) 60 (+320%) 44 (+470%) HOG-based pedestrian detector (1080p) 200 75+70 (+38%) 140 – 700 ms Halide: first impressions & results
  • 12. Copyright © 2017 Intel Corporation 12 • OpenVX-based HAL in OpenCV ✓ [Done] Immediate-mode OpenVX calls to accelerate simple functions: • cv::boxFilter(const cv::Mat&, …) => vxuBox3x3(vx_image, …) etc. • tested with Khronos’ sample implementation and Intel IAP • [TBD] Graphs for DNN acceleration ✓ [Done] Mixing OpenVX + OpenCV at user app level • vx_image  cv::Mat, OpenVX C++ wrappers, sample code: • https://github.com/opencv/opencv/tree/master/samples/openvx OpenCV + OpenVX
  • 13. Copyright © 2017 Intel Corporation 13 OpenCV Acceleration Options Comparison + ⎼ HAL functions Get used automatically (zero effort); vendors-specific implementation is possible Little coverage (mostly image processing); usually CPU-only HAL intrinsics Super-flexible, widely applicable and widely available Low-level, CPU only T-API Can potentially deliver top speed OpenCL is not performance-portable; lot’s of expertise needed OpenVX Can be tailored for any hardware (CPU, GPU, DSP, FPGA) Inflexible, not easy to use, difficult to extend Halide Decent performance; relatively easy to use Not as flexible as OpenCL or C++ Performance Ease-of-use HAL functions HAL intrinsics Halide T-API (custom) T-API (built-in) OpenVX (graphs) OpenVX (graphs for DNN) Flexibility Coverage HAL functions HAL intrinsics Halide T-API (custom) T-API (built-in) OpenVX (graphs)
  • 14. Copyright © 2017 Intel Corporation 14 • Modern OpenCV provides several acceleration paths • Custom kernels are essential for user apps; existing OpenCV (and OpenVX) functionality is not enough • Universal intrinsics (http://docs.opencv.org/master/df/d91/group__core__hal__intrin.html) is best solution for CPU • T-API (OpenCL; http://opencv.org/platforms/opencl.html) is the way to go for GPU acceleration • Halide looks very promising and can become a viable alternative to plain C++ and OpenCL for “regular” algorithms; OpenCV 3.3 will include Halide-accelerated deep learning module Summary
  • 15. Copyright © 2017 Intel Corporation 15 • OpenCV: http://opencv.org • Intel CV SDK: https://software.intel.com/en-us/computer-vision-sdk - the home of Intel-optimized OpenCV & OpenVX • Halide: http://halide-lang.org • Insights on the OpenCV 3.x feature roadmap, EVS2016 talk by Gary Bradski: https://www.embedded-vision.com/platinum- members/embedded-vision-alliance/embedded-vision- training/videos/pages/may-2016-embedded-vision-summit-opencv Resources