03 Computing With DSPs and AI Engines
03 Computing With DSPs and AI Engines
2. AI Engine Overview
2 |
[Public]
3 |
[Public]
Sum of Products
𝑦 𝑛 = 𝑐𝑖 ∙ 𝑥 𝑛 − 𝑘
𝑘=0
c0 c1 c2 c3 c4 c5 c6 c7
y[n]
4 |
[Public]
https://docs.amd.com/v/u/en-US/ug579-ultrascale-dsp
5 |
[Public]
|
https://docs.amd.com/v/u/en-US/ug579-ultrascale-dsp
6
[Public]
• Faster throughput
• More results in a narrower time slot (e.g., higher frames per second)
• Lower latency
• First output available in a shorter time span (e.g., 100ms -> 10ms)
• Higher density
• Larger image resolutions, more antennas, more cameras, etc.
• Higher accuracy, lower errors
• More complex algorithms
• Increasing AI (inference) content in applications
• Object (car, person, etc.) detection
• Modulation detection
• Adaptive beamforming
7 |
[Public]
AI Engine Overview
8 |
[Public]
16nm Generation
(Zynq® UltraScale+ MPSoC) 7nm Generation
GT
PL PL PL PL
IO AI Engine Array
GT IO GT IO
PL PL PL PL
PL PL PL PL GT IO
GT IO
GT
PL PL PL PL
IO
GT GT
PL PL PL PL
GT GT
PL PL PL PL
Processing GT
Processing GT
System
& PMC
PL PL PL System
PL PL PL
GT & PMC GT
9 |
[Public]
Block 0 Block 1
Memory
Memory
Memory
Interconnect AI
Engine
AI
Engine
AI
Engine
L0 L0 L0 L0 L0 L0
D0 • Non-blocking
• Deterministic
Memory
Memory
Memory
AI AI AI
D0 L1 L1 Engine Engine Engine
Fixed, shared
Interconnect
Memory
Memory
Memory
• Blocking limits D0 L2
AI
Engine
AI
Engine
AI
Engine
compute
• Timing not
deterministic DRAM Local, Distributed Memory
• No cache misses
Data • Higher bandwidth
Replicated • Less capacity required
• Robs bandwidth
• Reduces capacity
10 |
[Public]
AI Engine Tile
AI Engine Tiles and Kernels
https://www.xilinx.com/products/technology/ai-engine.html
11 |
[Public]
https://docs.xilinx.com/r/en-US/am009-versal-ai-engine/AI-Engine-Architecture
12 |
[Public]
AI Engine Evolution
Machine
Learning
ML
Optimized
Target Application
AIE-ML AIE-MLv2
AIE
AIEv2
13 |
[Public]
Data Type Native Support int8/16/32, cint16/32, FP32 int8/16/32, cint16, bfloat16 int8/16/32, cint16, bfloat16, MX6, MX9
Interface Tiles PL or NoC interface tiles PL or NoC interface tiles Single type of interface tile (PL & NoC)
14 |
[Public]
• The AI engine tiles are configured to form a modified Kahn process network
• Each kernel within a tile executes when its inputs become available
• The program code in each tile is executed sequentially
• Multiple kernels can be placed on a tile
• Multiple tiles can execute in parallel
• Tiles communicate through bounded channels (stream or memory)
• Unbounded (i.e., infinite) channels cannot be realized in hardware
• Reading from and writing to a channel is a blocking process
• Execution stalls when attempting to read from an empty channel or write to a full channel
• Processes are deterministic The presence of data to be read
and/or space for data to be written
• Same input always produces exactly the same output
determines the order of execution
f
a T2
Data flow:
d ➢ When inputs a, b and c arrive simultaneously
b T1 e T4 h • T2, T3 and T4 are stalled, waiting for all inputs
• Only T1 executes to produce d and e
➢ T2 and T3 execute in parallel after T1 to produce f and g
g ➢ T4 executes after receiving f and g to produce h
c T3
# of multiply-accumulate
operations per cycle per
tile!
https://docs.xilinx.com/r/en-US/am009-versal-ai-engine/Functional-Overview
16 |
[Public]
The VC1902 has 400 AI engine tiles. If the AIE array is running at 1.3GHz,
then the peak theoretical compute capability would be
400 * 256 OPs/cycle * 1.3e9 cycles/sec = 133.12e12 int8 OPs/sec = 133 int8 TOPS
AI AI
Mem Engine
Mem Engine AI AI
Engine Engine
Dataflow Streaming AI
AI AI AI
Graph Engine
Mem Engine
Mem Engine Multicast Engine
AI
Engine
AI AI
Mem Engine Mem Engine
Memory Interface
Cascade AI AI Stream Interface
Engine Engine
Streaming Cascade Interface
18 |
[Public]
Memory
Memory
Memory
AI AI AI
• AI Engine to programmable logic Engine Engine Engine
• AI Engine to NoC
Memory
Memory
Memory
AI AI AI
• PS manages config / debug / trace Engine Engine Engine
AXI-MM
PS / NoC Ext.
PMC Switch Switch Switch DRAM
Glossary
19 |
[Public]
AI Engine to PL Interface
Direction #AXI Stream per Column Bandwidth per Column Bandwidth on VC1902
Communication
PL → AIE interface array
→ North 8 32 GB/s ~1.3 TB/s
(Some columns are not available)
→ South 6 24 GB/s ~1 TB/s
Note: BW calculation - 1 GHz AI Engine clock @ -1L speedgrade (0.7V), higher bandwidth is available with faster speed grade
Note: 50 columns on VC1902, of which 39 are connected to PL
20 |
[Public]
21 |
[Public]
Latency 48 us 7.5 us
23 |
[Public]
Development Tools
• DSP
• Vivado (Verilog, VHDL)
• Vitis HLS
• Vitis Model Composer
• AI Engine
• Vitis
• Vitis Model Composer
24 |
[Public]
AIE Kernels, Graph PL Kernels (HLS) RTL Kernels XRT, Graph API
Vitis HW Platform
AIE driver
Vitis SW Platform
AIE Simulation HLS Cosimulation RTL Verification PS App
Linux® + rootfs
Vivado HW Build
SIM Build
Timing Closure
SSW
VMA
Fixed.xsa
26 |
[Public]
Summary
DSP blocks are the “traditional” way of implementing math operations on programmable logic
DSP48 on UltraScale -> DSP58 on Versal
Allows for fine-grain bitwidth selection up to maximum supported width
27 |
[Public]
Endnotes
VER-045: Based on 3rd party benchmark testing commissioned by AMD in February 2024, on the AMD
Versal adaptive SoC with AMD Vitis for AI design tool versus traditional programmable software
implementation with Vivado software and Vitis Model Composer tool, version 2023.1 in a signal processing
application FIR implementation. Results will vary depending on design specifications. (VER-45).
VER-046: Based on 3rd party benchmark testing commissioned by AMD in February 2024, on the AMD
Versal adaptive SoC with AMD Vitis for AI design tool versus traditional programmable software
implementation with Vivado software and Vitis Model Composer tool, version 2023.1 in a signal processing
application FIR implementation. Results will vary depending on design specifications. (VER-46)
28 |
[Public]
© 2024 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD, the AMD Arrow logo, Artix, Kintex, Kria, Spartan, UltraScale+, Versal,
Vitis, Virtex, Vivado, Zynq, and other designated brands included herein are trademarks of Advanced Micro Devices, Inc. Other product names used in this
publication are for identification purposes only and may be trademarks of their respective owners. Certain AMD technologies may require third-party enablement or
activation. Supported features may vary by operating system. Please confirm with the system manufacturer for specific features. No technology or product can be
completely secure.
29 |