0% found this document useful (0 votes)

138 views64 pages

OneAPI Introduction and Parallelization

Uploaded by

José Adrián Munguía Rivera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

138 views64 pages

OneAPI Introduction and Parallelization

Uploaded by

José Adrián Munguía Rivera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Performance Optimization

for Software Developers

with

oneAPI & Parallelization on Intel®

Architecture
Edmund Preiss
Business Development Manager
Software and Advanced Technology Group (SATG)
NOTICES AND DISCLAIMERS

Refer to https://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel
software products.

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance
varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn
more at [intel.com].

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and
roadmaps.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request.
Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Other names and brands may be claimed as the property of others.
© Intel Corporation

SATG 2
Welcome in the Parallelism Area !

Microprocessor frequency Microprocessor frequency

over Time (history) versus Power consumption
… performance is not only the computer architect’s job anymore
… performance increase is increasingly the job of the software developer

SATG 3
Welcome in the Parallelism Area !

Source: © 2014, James Reinders, Intel, used with permission

Microprocessor frequency Microprocessor frequency

over Time (history) versus Power consumption
… performance is not only the computer architect’s job anymore
… performance increase is increasingly the job of the software developer

SATG 4
Amdahl’s Law
“The speedup of a program
using multiple processors
in parallel computing is limited
by the sequential fraction of
the program.”

- Gene Amdahl

Speedup = (s + p ) / (s + p / N )
= 1 / (s + p / N )

SATG 5
The ‘Free Lunch’ is over

“Parallelism == > Performance”

(leads to)

Optimization – Developer to make sure the

above statement becomes reality!

SATG 6
Parallelism on the Intel Architecture

Developer

Tools

Architecture

SATG 7
8
Code Optimization Principles
Stage 1: Use Optimized Libraries

Stage 2: Compile with Architecture-specific

Optimizations

Stage 3: Analysis and Tuning

Stage 4: Check Correctness

SATG 8
oneAPI – A Tools Development Framework

Open Industry Intel Product

Specification

SATG 9
Programming Challenges
for Multiple Architectures Application Workloads Need Diverse Hardware

Scalar Vector Spatial Matrix

Growth in specialized workloads

Middleware & Frameworks

Variety of data-centric hardware required

Requires separate programming models and toolchains CPU GPU FPGA Other accel.
for each architecture programming
model
programming
model
programming
model
programming
models

Software development complexity limits freedom of

architectural choice

CPU GPU FPGA Other accel.

XPUs

SATG 10
Introducing oneAPI
Application Workloads Need Diverse Hardware

Scalar Vector Spatial Matrix

Cross-architecture programming that delivers freedom

to choose the best hardware
Middleware & Frameworks
Based on industry standards and open specifications
Exposes cutting-edge performance features of latest
hardware Industry Intel
Initiative Product

Compatible with existing high-performance languages

and programming models including C++, OpenMP, XPUs
Fortran, and MPI
CPU GPU FPGA Other accel.

SATG 11
oneAPI Industry Initiative Application Workloads Need Diverse Hardware
Break the Chains of Proprietary Lock-in
Middleware & Frameworks

...
A cross-architecture language based on C++ and SYCL
standards
oneAPI Industry Specification
Powerful libraries designed for acceleration of domain- Direct Programming API-Based Programming

specific functions Libraries

DPC++
Math Threading
Library
Low-level hardware abstraction layer Data Parallel C++ Analytics/
ML
DNN ML Comm

Video Processing
Open to promote community and industry collaboration
Low-Level Hardware Interface
Enables code reuse across architectures and vendors
XPUs
The productive, smart path to freedom for
accelerated computing from the economic
and technical burdens of proprietary CPU GPU FPGA Other accel.
programming models

Visit oneapi.com for more details 12

SATG
Data Parallel C++ Standards-based,
DPC++ = ISO C++ and Khronos SYCL
Cross-architectural Language
Parallelism, productivity, and performance for CPUs and
accelerators Direct Programming:
▪ Delivers accelerated computing by exposing hardware features
Data Parallel C++
▪ Allows code reuse across hardware targets, while permitting custom
tuning for specific accelerators
▪ Provides an open, cross-industry solution to single-architecture Community Extensions
proprietary lock-in

Based on C++ and SYCL Khronos SYCL

▪ Delivers C++ productivity benefits, using common, familiar C and C++
constructs
▪ Incorporates SYCL from the Khronos Group to support data parallelism ISO C++
and heterogeneous programming

Community Project to drive language enhancements

▪ Provides extensions to simplify data parallel programming
▪ Continues evolution through open and cooperative development

Apply your skills to the next innovation, not to

rewriting software for the next hardware platform

The open source and Intel DPC++/C++ compiler supports Intel CPUs, GPUs, and FPGAs.
SATG announced a DPC++ compiler that targets Nvidia GPUs.
Codeplay 13
Powerful oneAPI Libraries
▪ Designed for acceleration of key domain-specific
functions Intel® oneAPI Math Kernel
Intel® oneAPI Deep Neural
Network Library
Library oneMKL
oneDNN
▪ Pre-optimized for each target platform for
maximum performance Intel® oneAPI Video Intel® oneAPI Data Analytics
Processing Library Library
oneVPL oneDAL

Intel® oneAPI Threading Intel® oneAPI Collective

Building Blocks Communications Library
oneTBB oneCCL

Intel® oneAPI DPC++ Library

oneDPL

SATG 14
oneAPI – A Tools Development Framework

Open Industry Intel Product

Specification

SATG 15
Intel® oneAPI
Product Application Workloads Need Diverse Hardware

Built on a Rich Heritage of Tools

Based on Intel® Xeon® Processors, Middleware & Frameworks
Now Expanded to XPUs
...
A complete set of advanced compilers,
libraries, and porting, analysis & debugger
tools Intel® oneAPI Product

▪ Accelerates compute by exploiting cutting-edge

hardware features Compatibility Tool Languages Libraries
Analysis & Debug
Tools
▪ Interoperable with existing programming models
and code bases (C++, Fortran, Python, OpenMP, Low-Level Hardware Interface
etc.), developers can be confident that existing
applications work seamlessly with oneAPI XPUs

▪ Eases transitions to new systems and

accelerators⎯using a single code base frees CPU GPU FPGA
developers to invest more time on innovation

Available Now

Visit software.intel.com/oneapi for more details

SATGcapabilities may differ per architecture and custom-tuning will still be required. Other accelerators to be supported in the future.
Some 16
Intel Compiler Transition: Classic to LLVM
Start Your Migration Now
▪ Expand to XPUs Features &
▪ Modern LLVM Infrastructure Performance
Intel® C++ Compiler Classic

Intel® oneAPI DPC++/C++ Compiler (CPU only)

▪ Use for all new projects Intel® oneAPI DPC++/C++ Compiler
▪ Migrate legacy projects
(CPU, GPU, FPGA)

Intel® Fortran Compiler (Beta) Intel® Fortran Compiler Classic

▪ Test drive now & Provide feedback Features &

(CPU only)
Performance
▪ Prepare for migration Intel® Fortran Compiler (Beta)

(CPU, GPU)
Time

Today

Performance, Quality, and Support Continues

SATG 17
Intel® oneAPI Toolkits
A Complete Set of Proven Developer Tools Expanded from CPU to XPU

A core set of high-performance tools for

building C++, Data Parallel C++ applications
& oneAPI library-based applications

Native Code Developers

Intel® oneAPI
Intel® oneAPI Intel® oneAPI Rendering
Tools for HPC Tools for IoT Toolkit
Deliver fast Fortran, Build efficient, reliable Create performant,
OpenMP & MPI solutions that run at high-fidelity visualization
applications that scale network’s edge applications
Specialized Workloads

Intel® AI Analytics Intel® Distribution of

Toolkit OpenVINO™ Toolkit
Accelerate machine learning & Deploy high performance
data science pipelines with inference & applications
optimized DL frameworks & high- Toolkit from edge to cloud
performing Python libraries
Data Scientists & AI Developers

SATG 18
Streamlining Product Line

Composer Professional
Intel® oneAPI Base & HPC Toolkit
Edition Edition
Single-Node

Cluster Intel® oneAPI Base & HPC Toolkit

Edition Multi-Node
CPU (Core & Xeon)

CPU Focused (Core & Xeon) GPU (Intel)

FPGA (Intel)

SATG 19
Intel® oneAPI Intel® oneAPI
Intel Base
oneAPI & HPC
Tools Toolkit
for HPC

Base & HPC Toolkit Direct Programming API-Based Programming Analysis & debug Tools

Intel® C++ Compiler Classic Intel® MPI Library Intel® Inspector

Intel® oneAPI Tools for HPC: Deliver Intel® Fortran Compiler Classic Intel® oneAPI DPC++ Library
Intel® Trace Analyzer &
Collector
Fast Applications that Scale
Intel® Fortran Compiler Intel® oneAPI Math Kernel
Intel® Cluster Checker
(Beta) Library
What is it?
Intel® oneAPI DPC++/C++ Intel® oneAPI Data Analytics
A toolkit that adds to the Intel® oneAPI Base Toolkit for Compiler Intel® VTune™ Profiler
Library
building high-performance, scalable parallel code on
C++, Fortran, OpenMP & MPI from enterprise to cloud, Intel® oneAPI Threading
Intel® DPC++ Compatibility Tool Intel® Advisor
and HPC to AI applications. Building Blocks

Who needs this product? Intel® Distribution for Python* Intel® oneAPI Video Processing
Library
Intel® Distribution for GDB*

▪ OEMs/ISVs
Intel® FPGA Add-on for oneAPI Intel® oneAPI Collective
▪ C++, Fortran, OpenMP, MPI Developers Base Toolkit Communications Library

Why is this important? Intel® oneAPI Deep Neural

Network Library
▪ Accelerate performance on Intel® Xeon® & Core™
Intel® Integrated Performance
Processors and Accelerators Intel® oneAPI HPC Toolkit +
Primitives
▪ Deliver fast, scalable, reliable parallel code with less Intel® oneAPI Base Toolkit

effort; built on industry standards

Learn More: intel.com/oneAPI-HPCKit Back to Domain-specific Toolkits for Specialized Workloads 22

SATG
Render Your Vision in Highest Fidelity
Intel® oneAPI Intel oneAPI Rendering & Ray Tracing Libraries

Rendering Toolkit Intel® Embree

High-Performance, Feature-Rich Ray Tracing &
Photorealistic Rendering 1

Intel® Open Image Denoise

Powerful Libraries for High-Fidelity AI-Accelerated Denoiser for Superior Visual

Visualization Applications
Quality 2

▪ Deliver high-performance, high-fidelity visualization Intel® OpenSWR

applications on Intel® architecture High-Performance, Scalable, OpenGL*-
Compatible Rasterizer 3

▪ Create amazing visual, hyper-realistic renderings via ray

tracing with global illumination Intel® Open Volume Kernel Library
▪ Access all system memory space to create renderings using Render & Simulate 3D Spatial Data Processing
the largest data sets 4

▪ Flexible, cost efficient development Intel® OSPRay

using open source libraries Scalable, Portable, Distributed Rendering API

Intel® OSPRay Studio

Real-time rendering through a graphical user
interface with this new scene graph application
Intel® Embree,
part of Intel® oneAPI Rendering Toolkit, Intel® OSPRay for Hydra
won an Academy Award® Technical Connect the Rendering Toolkit libraries to
Achievement Award in 2021 Universal Scene Description Hydra Rendering
subsystem via plugin 5
1 Avengers: Infinity War - Digital Domain, Marvel Studios, Chaos Group V-Ray
2 Scene courtesy of Frank Meinl
3 Model from Leigh Orf at University of Wisconsin. For more tornado visualization, visit Leigh Orf's site
4 Smoke volume, data courtesy OpenVDB example repository
5 Moana Island Scene, Walt Disney Animation Studios , publicly available dataset: 15fps+,~160 billion prims

Learn
SATG More: intel.com/oneAPI-RenderKit Back to Domain-specific Toolkits for Specialized Workloads 23
Intel® oneAPI
AI Analytics Deep Learning Data Analytics & Machine Learning

Toolkit Intel® Optimization for

TensorFlow
Accelerated Data Frames

Intel® Distribution of Modin OmniSci Backend

Accelerate end-to-end AI and data analytics
pipelines with libraries optimized for Intel® Intel® Optimization for PyTorch
Intel® Distribution for Python
architectures
Intel® Low Precision Optimization
XGBoost Scikit-learn Daal-4Py
Tool

Who Uses It? Model Zoo for Intel® Architecture NumPy SciPy Pandas
Data scientists, AI researchers, ML and DL developers, AI
application developers
Samples and End2End Workloads

Top Features/Benefits CPU GPU

▪ Deep learning performance for training and inference Supported Hardware Architechures1
with Intel optimized DL frameworks and tools
Hardware support varies by individual tool. Architecture support will be expanded over time.
Other names and brands may be claimed as the property of others.
▪ Drop-in acceleration for data analytics and machine
learning workflows with compute-intensive Python
packages Get the Toolkit HERE or via these locations
AI
ANALYTICS
TOOLKIT
Intel Installer Docker Apt, Yum Conda Intel® DevCloud

Learn 24
SATG More: software.intel.com/oneapi/ai-kit Back to Domain-specific Toolkits for Specialized Workloads 24
Intel® Distribution of OpenVINO™ toolkit
Powered by oneAPI

Deliver High-Performance Deep Intel® Distribution of OpenVINO™ toolkit

Learning Inference Deep Learning Traditional Computer Vision
A toolkit to accelerate development of high-performance
deep learning inference & computer vision in vision/AI Intel® Deep Learning Deployment Toolkit Optimized Libraries & Code Samples
applications used from edge to cloud. It enables deep
learning on hardware accelerators & easy deployment OpenCV OpenVX Sample
Model Optimizer Inference Engine
IR
across Intel® CPUs, GPUs, FPGAs, VPUs. Convert & Optimize Optimized Inference
For Intel CPU & GPU/Intel® Processor Graphics
IR = Intermediate Representation file
Who needs this product?
▪ Computer vision, deep learning software developers Open Model Zoo Tools & Libraries

▪ Data scientists Increase Media/Video/Graphics Performance

40+
Model
▪ OEMs, ISVs, System Integrators Pretrained Samples OpenCL™ Drivers
Downloader Intel® Media SDK
Models Open Source Version & Runtimes
Usages For GPU/Intel® Processor Graphics
Deep Learning Workbench
Security surveillance, robotics, retail, healthcare, AI, office
automation, transportation, non-vision use cases (speech, Optimize Intel® FPGA (Linux* only)
NLP, Audio, text) & more Calibration Model Benchmark Accuracy Aux. FPGA RunTime Environment
Tool Analyzer App Checker Capabilities Bitstreams
OpenCL™)

Edge AI &
Vision Alliance

Back to Domain-specific Toolkits for Specialized Workloads 25

SATG
Parallelism on the Intel Architecture
Enabled by Intel® oneAPI Products

SATG 26
The Seven Levels of Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

5 Instruction-level Parallelism

6 Data Level Parallelism

7 Accelerator / Offloading Parallelism

SATG 27
Inter- Node Parallelism

1 Inter- Node Parallelism

SATG 29
Inter- Node Parallelism

▪ Most common scenario of HPC

/ Cluster applications
▪ Nodes connected by fast
interconnect called fabric
▪ Partitioned Global Address
Space (PGAS) & Distributed
Memory Programming
▪ Message Passing Interface (MPI)
– most common approach

SATG 30
Inter- Node Parallelism
What can I do? Which Tools?
▪ employ communication- ▪ Identify scalability issues with the
VTune™ Profiler Application
avoiding algorithms (e.g. Performance Snapshot & the
neighborhood collectives) Trace Analyzer and Collector
▪ overlap compute and ▪ Fine Tune Collective Operations
communication where possible with Intel MPI – autotuner
▪ Load balance work between ▪ Where applicable
ranks • Math Kernel Library (oneMKL)
• Data Analytics Library (oneDAL)
▪ In case of well scaling
• Collective Communications Library
application, add more nodes ☺ (oneCCL)

SATG 31
Inter- Node Parallelism

VTune HPC Analysis (right)

Application Performance
Snapshots HTML export (left)

SATG 32
Intra- Node Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

SATG 34
Intra- Node Parallelism
▪ Dual-Socket system – most
common datacenter node
▪ Cache Coherent Non-Uniform
Memory Access (ccNUMA) System
▪ Shared Memory based Parallel
Programming – addressed by
multiple Processes (e.g. MPI) and /
or multiple Threads per Process
(e.g. OpenMP)
▪ Possibility of accelerators
connected via PCI or similar (e.g.
CXL)

SATG 35
Intra- Node Parallelism

What can I do? Which Tools?

▪ Identify issues with the VTune™
▪ Tune for best Memory Locality Profiler
(NUMA optimization) • Hotspots
▪ Employ proper process pinning • Memory Access
• HPC Performance Characterization
▪ Avoid frequent involvement of
Operating System (OS) Kernel ▪ Identify issues with the VTune™
Profiler Application Performance
e.g. by using scalable memory Snapshot
allocators

SATG 36
Core / Thread Level Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

SATG 37
Core / Thread Level Parallelism

▪ Multi-Core CPU with 10s of

cores interconnected by a ring
or mesh
▪ Some levels of private cache
(L1/L2) and shared cache (L3)
▪ Dynamically scaling core
frequencies
▪ Shared Memory based Parallel
Programming

SATG 38
Core / Thread Level Parallelism

What can I do? Which Tools?

▪ Tune for best Memory Access ▪ Identify issues with the VTune™
(Cache-Blocking, non-temporal Profiler
stores, …) • Hotspots
▪ Load-Balance Threading ▪ Math Kernel Library
▪ Utilize performance optimized ▪ Threading Building Blocks
libraries wherever possible
▪ Intel C++ / Fortran Compiler –
OpenMP Runtime

SATG 39
Thread-Level Parallelism with Simultaneous
MultiThreading (SMT)
1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

SATG 40
Thread-Level Parallelism with Simultaneous
MultiThreading (SMT)
▪ X86 ISA Microprocessor SMT Core
▪ Partitioned Front End with multiple
buffers keep arch. state
▪ Shared Backend with execution
blocks
▪ SMT driving Instruction Level
Parallelism (ILP) increase, while
complementing Out of Order
Execution
▪ Implicit Programming Model -
Threading
SATG 41
SMT also known as Hyperthreading
w/o SMT

Architectural Architectural
State A State B

Time (proc. cycles)

Scheduler
Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7

ALU ALU ALU ALU Store

INT

Shift LEA LEA Shift Load Load

JMP MUL JMP Data Store
Store Store
FMA FMA Address Address Address
Buffer
ALU ALU Load Data 2
VEC

Shift Shuffle Load Data 3 Memory Control

DIV

SATG 42
Thread-Level Parallelism with Hyperthreading
(SMT)
What can I do? Which Tools?
▪ Enable SMT in the BIOS ▪ Identify issues with the VTune™
Profiler
▪ Use pinning strategies to • Microarchitecture Analysis
leverage / not leverage SMT
▪ MKL BLAS3 routines leverage SMT
• OMP_PLACES / KMP_AFFINITY
▪ Use the Pinning Simulator for
• I_MPI_PIN_CELL etc. Intel® MPI Library
▪ Effective SMT usage requires ▪ Use the Intel Compiler OpenMP
instruction diversity between Runtime environment variables /
APIs for pinning
threads
SATG 43
Instruction-level Parallelism (ILP)

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

5 Instruction-level Parallelism

SATG 44
Instruction-level Parallelism (ILP)
▪ X86 ISA Microprocessor SMT Core
▪ Complex Instruction Set Computer
(CISC) Frontend with Reduced
Instruction Set Computer (RISC)
Backend
▪ SuperScalar Out-of-Order Backend –
increasing ILP (Speculative Execution)
▪ Von Neumann Architecture (L1+) and
Harvard Architecture (L1) Core
▪ Pipelining (parallelism over time)
• Through Front-end & Back-end stages
• Through Execution Units (ALUs)

SATG 45
Instruction-level Parallelism (ILP)

Decoders 5
Decoders
Decoders
32KB L1 I$ Pre decode Inst Q Decoders
μop
6
Front End Branch Prediction Unit μop Cache
Queue

Load Store Reorder Allocate/Rename/Retire

Buffer Buffer Buffer In order

Scheduler OOO

Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7

ALU ALU ALU ALU

Store Data Load/STA Load/STA STA
Shift LEA LEA Shift
INT

JMP 2 MUL JMP 1

FMA FMA FMA Load Data 2

Load Data 3 Memory Control
Memory
ALU ALU ALU
VEC

Shift Shift Shuffle

DIV

1MB L2$ Fill Buffers 32KB L1 D$

Fill Buffers

SATG 46
Instruction-level Parallelism (ILP)
What can I do? Which Tools?
▪ ILP mostly done by the Compiler – ▪ Identify issues with the VTune™
use the right compiler Profiler
▪ Effective ILP usage requires • Microarchitecture Analysis
instruction diversity wherever
possible ▪ Intel® DPC++/C++ Compiler,
Intel® C++ Compiler Classic,
▪ Some instructions may stall faster Intel® Fortran Compiler Classic,
than others – e.g. try to replace
& Intel® Fortran Compiler (Beta)
multiple divisions by reciprocal

SATG 47
Data Level Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

5 Instruction-level Parallelism

6 Data Level Parallelism

SATG 48
Data Level Parallelism

Fused Multiply Add (FMA) ▪ X86 ISA Extensions

AVX-512 Vector Instruction ▪ Flynn's taxonomy: Single
Instruction Multiple Data (SIMD)
D = (A + C) x B
FP64 FP64 FP64 FP64 FP64 FP64 FP64 FP64
▪ Code candidates: Loops – to
split iterations in chunks of
SIMD width – reducing the trip
FP64 FP64 FP64 FP64 FP64 FP64 FP64 FP64

FP64 FP64 FP64 FP64 FP64 FP64 FP64 FP64

count
+ + + + + + + + ▪ Vectorization can happen
× × × × × × × × implicitly with Compiler
FP64 FP64 FP64 FP64 FP64 FP64 FP64 FP64
generating instructions

SATG 49
50
Data Level Parallelism
64 Bit
1996 1999
256 Bit 512 Bit
2011

2001 2012 2013

128 Bit AVX
MMX SSE 2004
64 bit 128 bit AVX-512
2006 256 bit
57 instr SSE2 AVX2
Integer 2007
Types only 70 instr SSE3 ~100 new
512 bit
8 Registers Single- instr. 512-bit vector
Precision
2008
Vectors
SSSE3 ~300 legacy 16 new 512-
sse instr bit registers
Streaming updated
operations 13 instr SSE4.1 2009
256-bit vector 8 opmask
Complex registers
Data 3 and 4-
144 instr SSE4.2 operand
32 instr instructions
Double- Decode Int. AVX
precision AES-NI expands to
Vectors 47 instr
256 bit
8/16/32 Video
Improved bit

SIMD
7 instr
64/128-bit Graphics manip.
vector building Encryption
integer blocks and fma
Decryption Vector shifts
Advanced

Enhancements
vector instr Key Gather
8 instr Generation
String/XML
processing
POP-Count
CRC
a[3] a[2] a[1] a[0]
for (i=0;i<MAX;i++) + + + +

b[3] b[2] b[1] b[0]

c[i]=a[i]+b[i];
c[3] c[2] c[1] c[0]

SATG 50
51

Data Level Parallelism

Performance Libraries (e.g. IPP and MKL) Ease of use

Compiler: Fully automatic vectorization

Compiler: Auto vectorization hints

(#pragma ivdep, …)

Explicit vectorization with OpenMP* 4.0

and later (SIMD Directive)

SIMD intrinsic class (F32vec4 add)

Vector intrinsic (mm_add_ps())

Assembler code (addps) Programmer control

SATG 51
Data Level Parallelism
What can I do? Which Tools?
▪ Use the most advanced Vector ISA ▪ Intel® DPC++/C++ Compiler, Intel®
available on your machine – the C++ Compiler Classic, Intel® Fortran
compiler default is SSE2 Compiler Classic, & Intel® Fortran
▪ Add code pragmas to the non- Compiler (Beta)
vectorized loops in order to feed
compiler with domain knowledge ▪ Use the Intel® Advisor to
• Identify vector code candidates
▪ Re-arrange code to achieve
• Determine vector efficiencies with
• loops with known trip count at runtime
• branch-free loops to increase efficiency
• Determine code efficiencies with the
Roofline Analysis
▪ Tell the compiler to use full 512bit • Step-by-Step guide towards efficient
vector width in case of compute heavy vector code
loops

Programming Guidelines for Vectorization

SATG 52
Data Level Parallelism

Advisor Vectorized Loop

Summary (top-left)
Advisor
Roofline
(bottom right)
SATG 53
Data Level Parallelism – legacy instructions
The X87 Coprocessor
▪ FP extension of the 8086 ISA -
later (80486) integrated
▪ Today, x87 instructions rarely
used, If however, numerical results
might differ (80Bit intermediate)!!!
▪ Avoid these instructions in favor
What is / was that? of reproducible result for both,
scalar & vector code

SATG 54
Accelerator / Offloading Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

5 Instruction-level Parallelism

6 Data Level Parallelism

7 Accelerator / Offloading Parallelism

SATG 58
Accelerator / Offloading Parallelism

▪ GPGPU – denser compute

▪ Slice -> SubSlice -> Execution
Units (EUs)
▪ Each EU has
• 8 Threads (similar to SMT)
• 28KB Register File
• 2x 4-wide SIMD units
▪ Offload based Programming
enabled by oneAPI
SATG 59
Accelerator / Offloading Parallelism
Data Parallel C++ (DPC++) -
▪ Intel® oneAPI Accelerator
Hello World
#include <CL/sycl.hpp>
using namespace sycl;
Programming
int main() {
std::vector<float> A(1024), B(1024), C(1024);
• C/C++
// some data initialization
{ • Intel ® DPC++ Compiler
buffer bufA {A}, bufB {B}, bufC {C};
queue q;
q.submit([&](handler &h) { • Intel® C++ Compiler with OpenMP
auto A = bufA.get_access(h, read_only); Offloading
auto B = bufB.get_access(h, read_only);
auto C = bufC.get_access(h, write_only);
h.parallel_for(1024, [=](auto i){
C[i] = A[i] + B[i];
• Fortran
});
}); • Intel ® Fortran Compiler with OpenMP
}
for (int i = 0; i < 1024; i++) Offloading
std::cout << "C[" << i << "] = " << C[i] << std::endl;

SATG 60
Accelerator / Offloading Parallelism

What can I do? Which Tools? – Intel® oneAPI

▪ Intel® DPC++/C++ Compiler &
▪ Enable your code today for Intel Intel® Fortran Compiler (Beta)
Integrated Graphics GPUs and ▪ Intel® oneMKL
upcoming datacenter GPGPUs
▪ Use the Intel® Advisor to
▪ Reduce data transfers between • Find offloading candidates with the
host and device Offload Advisor
• Estimate code speedup of GPU with
▪ Minimize communication / the Offload Advisor
synchronization in between ▪ Use Intel® VTune™ Profiler to
individual threads • Analyze the offload performance

SATG 61
Accelerator / Offloading Parallelism
Offload Advisor – Efficiency VTune – GPU Offload Analysis

SATG 62
The Seven Levels of Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

5 Instruction-level Parallelism

6 Data Level Parallelism

7 Accelerator / Offloading Parallelism

SATG 69
The Seven Levels of Parallelism
Parallelism on the Intel Architecture Supported by the
Intel® oneAPI Base & HPC Toolkits
Inter- Node Parallelism
Intel oneAPI Tools for HPC

Direct Programming API-Based Programming Analysis Tools

Intra- Node Parallelism
Intel® C++ Compiler Intel® MPI Library
Classic Intel® Inspector

Core / Thread Level Parallelism Intel® Fortran Compiler

Classic
Intel® oneAPI
DPC++ Library
Intel® Trace
Analyzer & Collector

Intel® Fortran Compiler Intel® oneAPI

(Beta) Math Kernel Library Intel® Cluster Checker

Thread-Level Parallelism with SMT Intel® oneAPI

DPC++/C++ Compiler
Intel® oneAPI
Data Analytics Library
Intel® VTune™ Profiler

Intel® DPC++ Intel® oneAPI Threading

Compatibility Tool Building Blocks Intel® Advisor

Instruction-level Parallelism Intel® Distribution

for Python*
Intel® oneAPI Video
Processing Library
Intel® Distribution for
GDB*
Intel® FPGA Add-on for Intel® oneAPI Collective

Data Level Parallelism oneAPI Base Toolkit Communications

Library

Intel® oneAPI Deep

Neural Network Library

Accelerator / Offloading Parallelism Intel® oneAPI HPC Toolkit + Intel® Integrated

Performance Primitives
Intel® oneAPI Base Toolkit

SATG 70
Performance Analysis Tools for Diagnosis
Intel® Parallel Studio

N Intel® Trace Analyzer

Cluster Tune MPI & Collector

Application Performance Snapshot

Scalable
? Intel® MPI Tuner
Y
Intel® VTune™ Amplifier’s

Effective Memory
Y Bandwidth N
threading Vectorize
Sensitive
? ?
N Y

Optimize
Thread
Bandwidth

Intel® Intel®
71

VTune™ Profiler Intel® Advisor VTune™ Profiler

SATG 71
72

Summary
Code Modernization is not always ‘easy & quick’ (a.k.a. “rewrite”).
Common methodology is to (1) analyze, then (2) optimize.

Two themes consistently re-occur – Task and Data Parallelism.

To achieve best CPU performance, make sure your programs are

efficiently vectorized and are well parallelized.
Especially the latter one also counts for GPUs.

SATG 72
(Recorded) Tools Webinars
▪ Intel® oneAPI technical Webinar ( 2 days) - co-hosted with Zuse Institute Berlin
▪ oneAPI Webinar on March 2nd and 3rd, 2021
▪ https://www.zib.de/workshops/2021/oneapi
▪ https://www.hlrn.de/doc/display/PUB/Joint+NHR@ZIB+-+INTEL+++oneAPI+Workshop
▪ Intel® oneAPI Rendering Tookit Webinar ( 1 day) :
▪ Rendering 08.06.2021:
https://www.ai-spektrum.de/veranstaltungen/intel-oneapi-rendering-toolkit-workshop.html
▪ Make use of simulated data and virtualize them in high fidelity photo realistic fashion
▪ Intel® oneAPI AI Analytics Toolkit webinar – 1 day :
▪ AI webinar on 15.06.2021:
https://www.ai-spektrum.de/veranstaltungen/artificial-intelligence-on-intelr-platforms-using-intelr-
oneapi-ai-analytics-toolkit-openvino-workshop.html
▪ Intel performance optimized AI tools kits for classic and machine learning

*Other names and brands may be claimed as the property of others.
Selected Links : Intel Tools User‘s Manuals
• Intel® C++ Compiler Classic Developer Guide and Reference
https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top.html

• Intel Data Parallel C++

https://link.springer.com/content/pdf/10.1007%2F978-1-4842-5574-2.pdf

• Intel VTune Cookbook

This Cookbook introduces methodologies and use-case recipes to analyse the performance of your code with VTune Profiler , a tool that
helps you identify ineffective algorithm and hardware usage and provides tuning advice.
https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top.html

• Intel® Advisor Cookbook

The Intel® Advisor Cookbook contains step-by-step instructions to help effectively use more cores, vectorization, or heterogeneous
processing using Intel Advisor.
https://www.intel.com/content/www/us/en/develop/documentation/advisor-cookbook/top.html

• Intel Inspector User Guide(s) for Linux and for Windows

Get Started with Intel® Inspector -Windows* OS . This document explains how to get started with the Intel® Inspector workflows.
Linux :
https://www.intel.com/content/www/us/en/develop/documentation/inspector-user-guide-linux/top.html

▪ Windows :
https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-inspector/top/windows.html

• Microsoft Visual Studio* Integration of Intel VTune

https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/launch/microsoft-visual-studio-integration.html

*Other names and brands may be claimed as the property of others.
Other useful Links
• Intel Software Developer Manuals
These manuals describe the architecture and programming environment of the Intel® 64 and IA-32 architectures.

Intel® 64 and IA-32 Architectures Software Developer Manuals

• Intel Software Development Tools Magazine ‘Intel Parallel Universe’

https://www.intel.com/content/www/us/en/developer/community/parallel-universe-magazine/overview.html?s=Newest

• Intel Developer Zone

https://www.intel.com/content/www/us/en/developer/overview.html

*Other names and brands may be claimed as the property of others.
QUESTIONS?

SATG 76
77

04 AMD Edge AI TechDay - Singapore - 2024 - FrankWang
No ratings yet
04 AMD Edge AI TechDay - Singapore - 2024 - FrankWang
29 pages
Intel Parallel Magazine Issue17
No ratings yet
Intel Parallel Magazine Issue17
49 pages
Multi-Core Programming Digital Edition (06-29-06) PDF
100% (1)
Multi-Core Programming Digital Edition (06-29-06) PDF
362 pages
Unified, Cross-Architecture Programming Model: Product Brief
No ratings yet
Unified, Cross-Architecture Programming Model: Product Brief
3 pages
Intel Oneapi HPC Toolkit Product Brief
No ratings yet
Intel Oneapi HPC Toolkit Product Brief
3 pages
DPCPP CPP Compiler - Get Started Guide - 2024.2 767258 824361
No ratings yet
DPCPP CPP Compiler - Get Started Guide - 2024.2 767258 824361
11 pages
DPCPP CPP Compiler - Get Started Guide - 2023.2 767258 781903
No ratings yet
DPCPP CPP Compiler - Get Started Guide - 2023.2 767258 781903
11 pages
Oneapi Programming Guide
No ratings yet
Oneapi Programming Guide
196 pages
Costless Software Abstractions For Parallel Architectures - Joel Falcou - CppCon 2014
No ratings yet
Costless Software Abstractions For Parallel Architectures - Joel Falcou - CppCon 2014
71 pages
Oneapi Base Toolkit - Get Started Guide Windows - 2023.2 766891 782258
No ratings yet
Oneapi Base Toolkit - Get Started Guide Windows - 2023.2 766891 782258
26 pages
Computer Architecture and Organization Assignment
No ratings yet
Computer Architecture and Organization Assignment
9 pages
15DD
No ratings yet
15DD
51 pages
Intel Architecture Day 2021 Presentation
No ratings yet
Intel Architecture Day 2021 Presentation
195 pages
Parallel Universe Issue 30
No ratings yet
Parallel Universe Issue 30
101 pages
Par Prog Course Many Core SW Pats Ocl
No ratings yet
Par Prog Course Many Core SW Pats Ocl
90 pages
Parallel Universe Issue 32
No ratings yet
Parallel Universe Issue 32
74 pages
Building Accelerated Applications With Vitis Workshop - Slides
No ratings yet
Building Accelerated Applications With Vitis Workshop - Slides
33 pages
Product Catalog
No ratings yet
Product Catalog
85 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Installing Intel® Oneapi HPC Toolkit: Step 1: Prerequisite Software
No ratings yet
Installing Intel® Oneapi HPC Toolkit: Step 1: Prerequisite Software
4 pages
Aihub 1017
No ratings yet
Aihub 1017
17 pages
INTEL - The Parallel Universe - Issue 05 - 2010
No ratings yet
INTEL - The Parallel Universe - Issue 05 - 2010
44 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
EFIS002 Fall 06
No ratings yet
EFIS002 Fall 06
52 pages
Ipp - User Guide 8.0 U1
No ratings yet
Ipp - User Guide 8.0 U1
46 pages
Cover TBD: Intel® Fpga Product Catalog
No ratings yet
Cover TBD: Intel® Fpga Product Catalog
100 pages
Hybrid WP 2 Developing v1.2
No ratings yet
Hybrid WP 2 Developing v1.2
8 pages
ReleaseNotes 101.4955 WHQL
No ratings yet
ReleaseNotes 101.4955 WHQL
3 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
Modle 01 - HPC Introduction To Pipeline
No ratings yet
Modle 01 - HPC Introduction To Pipeline
124 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
梁存铭Intel - Core - effeciency PDF
No ratings yet
梁存铭Intel - Core - effeciency PDF
21 pages
Fosdem SDR Ornl
No ratings yet
Fosdem SDR Ornl
48 pages
Lecture 36
No ratings yet
Lecture 36
15 pages
Intel Fpga Product Catalog 22 2
No ratings yet
Intel Fpga Product Catalog 22 2
85 pages
Intel Fpga Product Catalog 23.4
No ratings yet
Intel Fpga Product Catalog 23.4
85 pages
AMD APP SDK Release Notes Developer1
No ratings yet
AMD APP SDK Release Notes Developer1
5 pages
Opencl SDK - Developer Guide - 2020.2 773042 773043
No ratings yet
Opencl SDK - Developer Guide - 2020.2 773042 773043
10 pages
INTEL - The Parallel Universe - Issue 02 - 2010
No ratings yet
INTEL - The Parallel Universe - Issue 02 - 2010
15 pages
HSA Trends Hipeac Berlin
No ratings yet
HSA Trends Hipeac Berlin
24 pages
Giulio Corradi Presentation PDF
No ratings yet
Giulio Corradi Presentation PDF
64 pages
Ny Naspa 2016 04 Z SW Performance Optimization
No ratings yet
Ny Naspa 2016 04 Z SW Performance Optimization
16 pages
Architecture
No ratings yet
Architecture
67 pages
A Gênese (Allan Kardec)
No ratings yet
A Gênese (Allan Kardec)
80 pages
Optimizing Signal and Image Processing Applications Using Intel Libraries
No ratings yet
Optimizing Signal and Image Processing Applications Using Intel Libraries
6 pages
Multi-Core Programming Digital Edition (06!29!06)
No ratings yet
Multi-Core Programming Digital Edition (06!29!06)
362 pages
Intro HPC IITK
No ratings yet
Intro HPC IITK
44 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Parallel Programming Module 4
No ratings yet
Parallel Programming Module 4
93 pages
GenAI 5 AI PC Overview Resource Guide
No ratings yet
GenAI 5 AI PC Overview Resource Guide
4 pages
Parallel Comp Point Main
No ratings yet
Parallel Comp Point Main
18 pages
Intel Unnati HPC Brochure
No ratings yet
Intel Unnati HPC Brochure
4 pages
SG 247575
No ratings yet
SG 247575
666 pages
Getting Started C
No ratings yet
Getting Started C
5 pages
Clusterguide-V3 0
No ratings yet
Clusterguide-V3 0
80 pages
Oneapi Optimization Guide Gpu 2024.1 771772 817276
No ratings yet
Oneapi Optimization Guide Gpu 2024.1 771772 817276
450 pages
Task and Data Distribution in Hybrid Parallel Systems
No ratings yet
Task and Data Distribution in Hybrid Parallel Systems
9 pages
Visual Studio 2013 and MSDN Licensing Whitepaper - November-2014
No ratings yet
Visual Studio 2013 and MSDN Licensing Whitepaper - November-2014
37 pages
Database System Concepts and Architecture (Unit - I)
No ratings yet
Database System Concepts and Architecture (Unit - I)
20 pages
Machine Dependent Assembler Features
No ratings yet
Machine Dependent Assembler Features
26 pages
Entry Level Resume Template LaTeX
No ratings yet
Entry Level Resume Template LaTeX
1 page
Development of Attendance Management System: An Experience
No ratings yet
Development of Attendance Management System: An Experience
9 pages
gc ٢٠٢٤ ١٢ ٢٥
No ratings yet
gc ٢٠٢٤ ١٢ ٢٥
26 pages
LTImindtree Code
No ratings yet
LTImindtree Code
21 pages
Python
No ratings yet
Python
10 pages
Crash Log
No ratings yet
Crash Log
105 pages
Object Oriented Programming (Lab) : Midterm Project (Library Management System)
No ratings yet
Object Oriented Programming (Lab) : Midterm Project (Library Management System)
12 pages
Curriculum Vitae
No ratings yet
Curriculum Vitae
3 pages
Job Conversion
No ratings yet
Job Conversion
66 pages
WPD PR List
No ratings yet
WPD PR List
3 pages
ZS4200 Help
No ratings yet
ZS4200 Help
92 pages
Hackathon Sample Document (1) (3) 2
No ratings yet
Hackathon Sample Document (1) (3) 2
7 pages
Mobile Application Sem-3
No ratings yet
Mobile Application Sem-3
13 pages
Report5 Test-Documentation
No ratings yet
Report5 Test-Documentation
5 pages
and Installation Instructions Visual Basic 2010 Express Software
No ratings yet
and Installation Instructions Visual Basic 2010 Express Software
8 pages
Linked List Insertion Codes
No ratings yet
Linked List Insertion Codes
8 pages
Data Analytics and Using Advancd Excel and Spss
No ratings yet
Data Analytics and Using Advancd Excel and Spss
17 pages
IBM Integration Bus V10 Performance: How To Analyse Your System To Optimise Performance and Throughput
No ratings yet
IBM Integration Bus V10 Performance: How To Analyse Your System To Optimise Performance and Throughput
67 pages
Recipe Editor User Guide
No ratings yet
Recipe Editor User Guide
183 pages
Modelos Examen Ingles Tecnico Informatico
No ratings yet
Modelos Examen Ingles Tecnico Informatico
4 pages
Healthcare MMIS
No ratings yet
Healthcare MMIS
6 pages
Final Exam SE490
No ratings yet
Final Exam SE490
8 pages
IT4125-STM-Unit 2 Sample Questions
No ratings yet
IT4125-STM-Unit 2 Sample Questions
3 pages
2024 Docker File
No ratings yet
2024 Docker File
2 pages
Example Program To Find The Sum of 2 Numbers Using Rmi Inter - Java
No ratings yet
Example Program To Find The Sum of 2 Numbers Using Rmi Inter - Java
18 pages
5 Possible Asynchronous Messaging Way To Decouple Sender and Reciever
No ratings yet
5 Possible Asynchronous Messaging Way To Decouple Sender and Reciever
21 pages
Gangireddy ETL 5+2
No ratings yet
Gangireddy ETL 5+2
3 pages