OneAPI Introduction and Parallelization
OneAPI Introduction and Parallelization
Refer to https://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel
software products.
Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance
varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn
more at [intel.com].
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and
roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request.
Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Other names and brands may be claimed as the property of others.
© Intel Corporation
SATG 2
Welcome in the Parallelism Area !
SATG 3
Welcome in the Parallelism Area !
SATG 4
Amdahl’s Law
“The speedup of a program
using multiple processors
in parallel computing is limited
by the sequential fraction of
the program.”
- Gene Amdahl
Speedup = (s + p ) / (s + p / N )
= 1 / (s + p / N )
SATG 5
The ‘Free Lunch’ is over
SATG 6
Parallelism on the Intel Architecture
Developer
Tools
Architecture
SATG 7
8
Code Optimization Principles
Stage 1: Use Optimized Libraries
SATG 8
oneAPI – A Tools Development Framework
SATG 9
Programming Challenges
for Multiple Architectures Application Workloads Need Diverse Hardware
Requires separate programming models and toolchains CPU GPU FPGA Other accel.
for each architecture programming
model
programming
model
programming
model
programming
models
XPUs
SATG 10
Introducing oneAPI
Application Workloads Need Diverse Hardware
SATG 11
oneAPI Industry Initiative Application Workloads Need Diverse Hardware
Break the Chains of Proprietary Lock-in
Middleware & Frameworks
...
A cross-architecture language based on C++ and SYCL
standards
oneAPI Industry Specification
Powerful libraries designed for acceleration of domain- Direct Programming API-Based Programming
Video Processing
Open to promote community and industry collaboration
Low-Level Hardware Interface
Enables code reuse across architectures and vendors
XPUs
The productive, smart path to freedom for
accelerated computing from the economic
and technical burdens of proprietary CPU GPU FPGA Other accel.
programming models
The open source and Intel DPC++/C++ compiler supports Intel CPUs, GPUs, and FPGAs.
SATG announced a DPC++ compiler that targets Nvidia GPUs.
Codeplay 13
Powerful oneAPI Libraries
▪ Designed for acceleration of key domain-specific
functions Intel® oneAPI Math Kernel
Intel® oneAPI Deep Neural
Network Library
Library oneMKL
oneDNN
▪ Pre-optimized for each target platform for
maximum performance Intel® oneAPI Video Intel® oneAPI Data Analytics
Processing Library Library
oneVPL oneDAL
SATG 14
oneAPI – A Tools Development Framework
SATG 15
Intel® oneAPI
Product Application Workloads Need Diverse Hardware
Available Now
(CPU, GPU)
Time
Today
Intel® oneAPI
Intel® oneAPI Intel® oneAPI Rendering
Tools for HPC Tools for IoT Toolkit
Deliver fast Fortran, Build efficient, reliable Create performant,
OpenMP & MPI solutions that run at high-fidelity visualization
applications that scale network’s edge applications
Specialized Workloads
SATG 18
Streamlining Product Line
Composer Professional
Intel® oneAPI Base & HPC Toolkit
Edition Edition
Single-Node
SATG 19
Intel® oneAPI Intel® oneAPI
Intel Base
oneAPI & HPC
Tools Toolkit
for HPC
Base & HPC Toolkit Direct Programming API-Based Programming Analysis & debug Tools
Intel® oneAPI Tools for HPC: Deliver Intel® Fortran Compiler Classic Intel® oneAPI DPC++ Library
Intel® Trace Analyzer &
Collector
Fast Applications that Scale
Intel® Fortran Compiler Intel® oneAPI Math Kernel
Intel® Cluster Checker
(Beta) Library
What is it?
Intel® oneAPI DPC++/C++ Intel® oneAPI Data Analytics
A toolkit that adds to the Intel® oneAPI Base Toolkit for Compiler Intel® VTune™ Profiler
Library
building high-performance, scalable parallel code on
C++, Fortran, OpenMP & MPI from enterprise to cloud, Intel® oneAPI Threading
Intel® DPC++ Compatibility Tool Intel® Advisor
and HPC to AI applications. Building Blocks
Who needs this product? Intel® Distribution for Python* Intel® oneAPI Video Processing
Library
Intel® Distribution for GDB*
▪ OEMs/ISVs
Intel® FPGA Add-on for oneAPI Intel® oneAPI Collective
▪ C++, Fortran, OpenMP, MPI Developers Base Toolkit Communications Library
Visualization Applications
Quality 2
Learn
SATG More: intel.com/oneAPI-RenderKit Back to Domain-specific Toolkits for Specialized Workloads 23
Intel® oneAPI
AI Analytics Deep Learning Data Analytics & Machine Learning
Who Uses It? Model Zoo for Intel® Architecture NumPy SciPy Pandas
Data scientists, AI researchers, ML and DL developers, AI
application developers
Samples and End2End Workloads
▪ Deep learning performance for training and inference Supported Hardware Architechures1
with Intel optimized DL frameworks and tools
Hardware support varies by individual tool. Architecture support will be expanded over time.
Other names and brands may be claimed as the property of others.
▪ Drop-in acceleration for data analytics and machine
learning workflows with compute-intensive Python
packages Get the Toolkit HERE or via these locations
AI
ANALYTICS
TOOLKIT
Intel Installer Docker Apt, Yum Conda Intel® DevCloud
Learn 24
SATG More: software.intel.com/oneapi/ai-kit Back to Domain-specific Toolkits for Specialized Workloads 24
Intel® Distribution of OpenVINO™ toolkit
Powered by oneAPI
Edge AI &
Vision Alliance
SATG 26
The Seven Levels of Parallelism
5 Instruction-level Parallelism
SATG 27
Inter- Node Parallelism
SATG 29
Inter- Node Parallelism
SATG 30
Inter- Node Parallelism
What can I do? Which Tools?
▪ employ communication- ▪ Identify scalability issues with the
VTune™ Profiler Application
avoiding algorithms (e.g. Performance Snapshot & the
neighborhood collectives) Trace Analyzer and Collector
▪ overlap compute and ▪ Fine Tune Collective Operations
communication where possible with Intel MPI – autotuner
▪ Load balance work between ▪ Where applicable
ranks • Math Kernel Library (oneMKL)
• Data Analytics Library (oneDAL)
▪ In case of well scaling
• Collective Communications Library
application, add more nodes ☺ (oneCCL)
SATG 31
Inter- Node Parallelism
SATG 32
Intra- Node Parallelism
SATG 34
Intra- Node Parallelism
▪ Dual-Socket system – most
common datacenter node
▪ Cache Coherent Non-Uniform
Memory Access (ccNUMA) System
▪ Shared Memory based Parallel
Programming – addressed by
multiple Processes (e.g. MPI) and /
or multiple Threads per Process
(e.g. OpenMP)
▪ Possibility of accelerators
connected via PCI or similar (e.g.
CXL)
SATG 35
Intra- Node Parallelism
SATG 36
Core / Thread Level Parallelism
SATG 37
Core / Thread Level Parallelism
SATG 38
Core / Thread Level Parallelism
SATG 39
Thread-Level Parallelism with Simultaneous
MultiThreading (SMT)
1 Inter- Node Parallelism
SATG 40
Thread-Level Parallelism with Simultaneous
MultiThreading (SMT)
▪ X86 ISA Microprocessor SMT Core
▪ Partitioned Front End with multiple
buffers keep arch. state
▪ Shared Backend with execution
blocks
▪ SMT driving Instruction Level
Parallelism (ILP) increase, while
complementing Out of Order
Execution
▪ Implicit Programming Model -
Threading
SATG 41
SMT also known as Hyperthreading
w/o SMT
Architectural Architectural
State A State B
SATG 42
Thread-Level Parallelism with Hyperthreading
(SMT)
What can I do? Which Tools?
▪ Enable SMT in the BIOS ▪ Identify issues with the VTune™
Profiler
▪ Use pinning strategies to • Microarchitecture Analysis
leverage / not leverage SMT
▪ MKL BLAS3 routines leverage SMT
• OMP_PLACES / KMP_AFFINITY
▪ Use the Pinning Simulator for
• I_MPI_PIN_CELL etc. Intel® MPI Library
▪ Effective SMT usage requires ▪ Use the Intel Compiler OpenMP
instruction diversity between Runtime environment variables /
APIs for pinning
threads
SATG 43
Instruction-level Parallelism (ILP)
5 Instruction-level Parallelism
SATG 44
Instruction-level Parallelism (ILP)
▪ X86 ISA Microprocessor SMT Core
▪ Complex Instruction Set Computer
(CISC) Frontend with Reduced
Instruction Set Computer (RISC)
Backend
▪ SuperScalar Out-of-Order Backend –
increasing ILP (Speculative Execution)
▪ Von Neumann Architecture (L1+) and
Harvard Architecture (L1) Core
▪ Pipelining (parallelism over time)
• Through Front-end & Back-end stages
• Through Execution Units (ALUs)
SATG 45
Instruction-level Parallelism (ILP)
Decoders 5
Decoders
Decoders
32KB L1 I$ Pre decode Inst Q Decoders
μop
6
Front End Branch Prediction Unit μop Cache
Queue
Scheduler OOO
Fill Buffers
SATG 46
Instruction-level Parallelism (ILP)
What can I do? Which Tools?
▪ ILP mostly done by the Compiler – ▪ Identify issues with the VTune™
use the right compiler Profiler
▪ Effective ILP usage requires • Microarchitecture Analysis
instruction diversity wherever
possible ▪ Intel® DPC++/C++ Compiler,
Intel® C++ Compiler Classic,
▪ Some instructions may stall faster Intel® Fortran Compiler Classic,
than others – e.g. try to replace
& Intel® Fortran Compiler (Beta)
multiple divisions by reciprocal
SATG 47
Data Level Parallelism
5 Instruction-level Parallelism
SATG 48
Data Level Parallelism
SATG 49
50
Data Level Parallelism
64 Bit
1996 1999
256 Bit 512 Bit
2011
SIMD
7 instr
64/128-bit Graphics manip.
vector building Encryption
integer blocks and fma
Decryption Vector shifts
Advanced
Enhancements
vector instr Key Gather
8 instr Generation
String/XML
processing
POP-Count
CRC
a[3] a[2] a[1] a[0]
for (i=0;i<MAX;i++) + + + +
SATG 50
51
SATG 51
Data Level Parallelism
What can I do? Which Tools?
▪ Use the most advanced Vector ISA ▪ Intel® DPC++/C++ Compiler, Intel®
available on your machine – the C++ Compiler Classic, Intel® Fortran
compiler default is SSE2 Compiler Classic, & Intel® Fortran
▪ Add code pragmas to the non- Compiler (Beta)
vectorized loops in order to feed
compiler with domain knowledge ▪ Use the Intel® Advisor to
• Identify vector code candidates
▪ Re-arrange code to achieve
• Determine vector efficiencies with
• loops with known trip count at runtime
• branch-free loops to increase efficiency
• Determine code efficiencies with the
Roofline Analysis
▪ Tell the compiler to use full 512bit • Step-by-Step guide towards efficient
vector width in case of compute heavy vector code
loops
SATG 54
Accelerator / Offloading Parallelism
5 Instruction-level Parallelism
SATG 58
Accelerator / Offloading Parallelism
SATG 60
Accelerator / Offloading Parallelism
SATG 61
Accelerator / Offloading Parallelism
Offload Advisor – Efficiency VTune – GPU Offload Analysis
SATG 62
The Seven Levels of Parallelism
5 Instruction-level Parallelism
SATG 69
The Seven Levels of Parallelism
Parallelism on the Intel Architecture Supported by the
Intel® oneAPI Base & HPC Toolkits
Inter- Node Parallelism
Intel oneAPI Tools for HPC
SATG 70
Performance Analysis Tools for Diagnosis
Intel® Parallel Studio
Effective Memory
Y Bandwidth N
threading Vectorize
Sensitive
? ?
N Y
Optimize
Thread
Bandwidth
Intel® Intel®
71
SATG 71
72
Summary
Code Modernization is not always ‘easy & quick’ (a.k.a. “rewrite”).
Common methodology is to (1) analyze, then (2) optimize.
SATG 72
(Recorded) Tools Webinars
▪ Intel® oneAPI technical Webinar ( 2 days) - co-hosted with Zuse Institute Berlin
▪ oneAPI Webinar on March 2nd and 3rd, 2021
▪ https://www.zib.de/workshops/2021/oneapi
▪ https://www.hlrn.de/doc/display/PUB/Joint+NHR@ZIB+-+INTEL+++oneAPI+Workshop
▪ Intel® oneAPI Rendering Tookit Webinar ( 1 day) :
▪ Rendering 08.06.2021:
https://www.ai-spektrum.de/veranstaltungen/intel-oneapi-rendering-toolkit-workshop.html
▪ Make use of simulated data and virtualize them in high fidelity photo realistic fashion
▪ Intel® oneAPI AI Analytics Toolkit webinar – 1 day :
▪ AI webinar on 15.06.2021:
https://www.ai-spektrum.de/veranstaltungen/artificial-intelligence-on-intelr-platforms-using-intelr-
oneapi-ai-analytics-toolkit-openvino-workshop.html
▪ Intel performance optimized AI tools kits for classic and machine learning
▪ Windows :
https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-inspector/top/windows.html
https://www.intel.com/content/www/us/en/developer/community/parallel-universe-magazine/overview.html?s=Newest
https://www.intel.com/content/www/us/en/developer/overview.html
SATG 76
77