0% found this document useful (0 votes)
31 views30 pages

03 Computing With DSPs and AI Engines

Uploaded by

cuonglt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views30 pages

03 Computing With DSPs and AI Engines

Uploaded by

cuonglt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Computing with

DSPs and AI Engines


Louie Valeña
Algorithm Specialist
Adaptive and Embedded Computing Group
[Public]

Agenda 1. DSP Slice Review

2. AI Engine Overview

2 |
[Public]

DSP Slice Review

3 |
[Public]

Sum of Products

𝑦 𝑛 = ෍ 𝑐𝑖 ∙ 𝑥 𝑛 − 𝑘
𝑘=0

x[n] z-1 z-1 z-1 z-1 z-1 z-1 z-1

c0 c1 c2 c3 c4 c5 c6 c7

y[n]

4 |
[Public]

DSP48E2 Slice in UltraScale+ Devices

https://docs.amd.com/v/u/en-US/ug579-ultrascale-dsp

5 |
[Public]

FIR Filter Implemented with DSP48E2 Slices

• 8 DSP48E2 slices for 8-tap FIR filter


• No fabric resources used
• Note the timing at which the coefficients are applied
• Throughput limited by clock frequency

|
https://docs.amd.com/v/u/en-US/ug579-ultrascale-dsp
6
[Public]

What is Driving the Need for More Compute?

• Faster throughput
• More results in a narrower time slot (e.g., higher frames per second)
• Lower latency
• First output available in a shorter time span (e.g., 100ms -> 10ms)
• Higher density
• Larger image resolutions, more antennas, more cameras, etc.
• Higher accuracy, lower errors
• More complex algorithms
• Increasing AI (inference) content in applications
• Object (car, person, etc.) detection
• Modulation detection
• Adaptive beamforming

7 |
[Public]

AI Engine Overview

8 |
[Public]

Conceptual Device with High Compute


• Applications need (1) cost reduction, (2) power reduction, (3) more compute, (4) more programmability

• How does AMD meet the demands of these evolving applications?

16nm Generation
(Zynq® UltraScale+ MPSoC) 7nm Generation

GT
PL PL PL PL
IO AI Engine Array
GT IO GT IO
PL PL PL PL
PL PL PL PL GT IO
GT IO
GT
PL PL PL PL
IO
GT GT
PL PL PL PL
GT GT
PL PL PL PL
Processing GT
Processing GT
System
& PMC
PL PL PL System
PL PL PL
GT & PMC GT

Goal: Increase Compute Density and Silicon Efficiency

9 |
[Public]

AI Engine Reinvents Multi-Core Compute


Traditional Multi-core AI Engine Array
(cache-based architecture)

Block 0 Block 1

core core core core core core Dedicated

Memory

Memory

Memory
Interconnect AI
Engine
AI
Engine
AI
Engine
L0 L0 L0 L0 L0 L0
D0 • Non-blocking
• Deterministic

Memory

Memory

Memory
AI AI AI
D0 L1 L1 Engine Engine Engine
Fixed, shared
Interconnect

Memory
Memory

Memory
• Blocking limits D0 L2
AI
Engine
AI
Engine
AI
Engine
compute
• Timing not
deterministic DRAM Local, Distributed Memory
• No cache misses
Data • Higher bandwidth
Replicated • Less capacity required
• Robs bandwidth
• Reduces capacity
10 |
[Public]

AI Engine Tile
AI Engine Tiles and Kernels

• Versal devices which contain AI engines have the


engines physically laid out like tiles
• Each AI engine tile has 16KB of program memory and
local data memory of 32KB
• It can access local data memory in adjacent tiles (for a total
of 128KB)
• Each AI engine tile can exchange data with any other
tile in the array
• Connections are determined by a “dataflow graph” and set
during program load
• Each AI engine contains scalar and vector processors
• A kernel is a C/C++function running on an AI engine
tile

https://www.xilinx.com/products/technology/ai-engine.html

11 |
[Public]

Scalar and Vector Processors

SIMD – only intrinsics can run on


this vector processor

https://docs.xilinx.com/r/en-US/am009-versal-ai-engine/AI-Engine-Architecture

32-bit RISC processor – runs Glossary


standard C/C++ code RISC: Reduced Instruction Set Computer
SIMD: Single Instruction Multiple Data

12 |
[Public]

AI Engine Evolution
Machine
Learning

ML
Optimized
Target Application

AIE-ML AIE-MLv2

AIE

AIEv2

2020/21/22 2023/24 2025…


Signal
Processing

13 |
[Public]

AIE / AIE-ML / AIE-MLv2 Comparison Table

AIE AIE-ML AIE-MLv2


Array structure checkerboard All lines identical All lines identical

Cascade Interface 384-bits wide 512-bits wide 512-bits wide


Horizontal Direction Horizontal and Vertical Directions Horizontal and Vertical Directions
Tile Stream Interfaces 2 32-bit In and 2 32-bit out 1 32-bit In and 1 32-bit out 1 32-bit In and 1 32-bit out
Memory Load/Store (per cycle) 512/256 bits 512/256 bits 1024/512 bits Local
512/512 bits Neighbour
int8 * int8 MAC 128 256 512

Data Type Native Support int8/16/32, cint16/32, FP32 int8/16/32, cint16, bfloat16 int8/16/32, cint16, bfloat16, MX6, MX9

Tile Local Memory 32KB 64KB 64KB

Tile Local Memory DMA 32-bit streams, 64-bit streams,


32-bit streams,
128-bit data memory interface 256-bit data memory interface
128-bit data memory interface

Memory Tiles No 512KB - 16 banks 512KB - 8 banks

Interface Tiles PL or NoC interface tiles PL or NoC interface tiles Single type of interface tile (PL & NoC)

14 |
[Public]

Kahn Process Network (KPN) [a.k.a. Data Flow Graph]

• The AI engine tiles are configured to form a modified Kahn process network
• Each kernel within a tile executes when its inputs become available
• The program code in each tile is executed sequentially
• Multiple kernels can be placed on a tile
• Multiple tiles can execute in parallel
• Tiles communicate through bounded channels (stream or memory)
• Unbounded (i.e., infinite) channels cannot be realized in hardware
• Reading from and writing to a channel is a blocking process
• Execution stalls when attempting to read from an empty channel or write to a full channel
• Processes are deterministic The presence of data to be read
and/or space for data to be written
• Same input always produces exactly the same output
determines the order of execution
f
a T2
Data flow:
d ➢ When inputs a, b and c arrive simultaneously
b T1 e T4 h • T2, T3 and T4 are stalled, waiting for all inputs
• Only T1 executes to produce d and e
➢ T2 and T3 execute in parallel after T1 to produce f and g
g ➢ T4 executes after receiving f and g to produce h
c T3

AI Engine Programming: A Kahn Process Network Evolution (WP552)


15 |
[Public]

Supported Data Types in the Vector Processor

# of multiply-accumulate
operations per cycle per
tile!

Complex data types!


Not just for artificial
intelligence!

https://docs.xilinx.com/r/en-US/am009-versal-ai-engine/Functional-Overview
16 |
[Public]

Applications with a lot of sum-of-product


128 int8 MACs/cycle operations can greatly benefit from using AI
engines!

16 multiply·add x 8 accumulators = 128 multiply-accumulate operations per cycle


= 256 OPs/cycle

The VC1902 has 400 AI engine tiles. If the AIE array is running at 1.3GHz,
then the peak theoretical compute capability would be
400 * 256 OPs/cycle * 1.3e9 cycles/sec = 133.12e12 int8 OPs/sec = 133 int8 TOPS

Must keep the vector processor “well-


fed” with data to achieve high
throughput
17 |
[Public]

Data Movement Within AI Engine Array


Memory Communication Streaming Communication
(neighbor) (non-neighbor)
Memory Memory Memory Memory
Dataflow
AI B0 AI B2 AI Non- AI AI
Pipeline Engine B1 Engine B3 Engine Engine
Neighbor Engine

AI AI
Mem Engine
Mem Engine AI AI
Engine Engine

Dataflow Streaming AI
AI AI AI
Graph Engine
Mem Engine
Mem Engine Multicast Engine

AI
Engine
AI AI
Mem Engine Mem Engine

Memory Interface
Cascade AI AI Stream Interface
Engine Engine
Streaming Cascade Interface

18 |
[Public]

Getting Data to/from AI Engine Array

• TB/s of interface bandwidth

Memory

Memory

Memory
AI AI AI
• AI Engine to programmable logic Engine Engine Engine

• AI Engine to NoC

• Leveraging NoC connectivity

Memory

Memory
Memory
AI AI AI
• PS manages config / debug / trace Engine Engine Engine

• AI Engine to DRAM (no PL req’d)


Switch Switch Switch
AI Engine
Async CDC DMA Interface Tiles

AXI-MM
PS / NoC Ext.
PMC Switch Switch Switch DRAM
Glossary

CDC: Clock Domain Crossing AXI-S


DMA: Direct Memory Access
PS: Processor System PL
PMC: Platform Manager Controller Programmable Function
DRAM: Dynamic Random Access Memory
PL: Programmable Logic
Logic

19 |
[Public]

AI Engine to PL Interface

AXI4-Stream switches in the PL interface tile directly


communicate with the PL

• Handle most of the data movement to/from the AI Engine array


• Configurable bit widths (32b/64b/128b)

Direction #AXI Stream per Column Bandwidth per Column Bandwidth on VC1902
Communication
PL → AIE interface array
→ North 8 32 GB/s ~1.3 TB/s
(Some columns are not available)
→ South 6 24 GB/s ~1 TB/s

#AXI Stream per Bandwidth per


Direction Bandwidth on VC1902
Interconnect Interconnect
Within AI Engine Grid
(All columns are available) → North 6 24 GB/s 1.2 TB/s
→ South 4 16 GB/s 800 GB/s

Note: BW calculation - 1 GHz AI Engine clock @ -1L speedgrade (0.7V), higher bandwidth is available with faster speed grade
Note: 50 columns on VC1902, of which 39 are connected to PL

20 |
[Public]

32K FFT, 8 GSPS, CINT16: AIE+PL Architecture


DSP Library AIE API Custom HDL
Function Code Code

SSR: Super Sampling Rate (no. of samples processed per cycle)

21 |
[Public]

32K FFT, 8 GSPS, CINT16 Data


Internal data format: CINT32 for AIE; CINT27 for PL
Twiddle factors: CINT16 for AIE; CINT24 for PL
PL (540 MHz) AIE + PL (1250 MHz)
Metric Comments
1D FFT Structure 2D FFT Structure
50 AIEs (16 for compute)
171 DSPs
12 DSPs
Resources 153,299 LUTs ~3.5 DSPs : 1 AIE
8,052 LUTs (for 16 pt FFT)
54,190 FFs
6,612FFs (for 16 pt FFT)
UltraRAM / Significant Block RAM
100 Block RAM 16 Block RAM
Block RAM savings

Latency 48 us 7.5 us

Dynamic 6.682W Up to 30% lower dynamic


9.58W
Power 6.138W for AIE power
.528W for 16 pt FFT

See Endnotes VER-045, VER-046


22 |
[Public]

Code Required to Develop AI Engine Kernels

• AI engine kernel code


• C/C++ code that will execute on the AI engine
• Graph code
• C++ code that describes the connectivity (i.e., data flow) between AI engine kernels and their “environment”
• Multiple independent graphs possible
• Testbench/control code
• C++ code that configures, initializes, runs and terminates graphs

23 |
[Public]

Development Tools

• DSP
• Vivado (Verilog, VHDL)
• Vitis HLS
• Vitis Model Composer

• AI Engine
• Vitis
• Vitis Model Composer

24 |
[Public]

Vitis Flow for Versal Adaptive SoC


AIE PL (HLS) PL (RTL) Platform PS

AIE Kernels, Graph PL Kernels (HLS) RTL Kernels XRT, Graph API
Vitis HW Platform
AIE driver

Vitis SW Platform
AIE Simulation HLS Cosimulation RTL Verification PS App
Linux® + rootfs

PL and AIE Integration (v++ --link)

Vivado HW Build
SIM Build
Timing Closure

Generate Binary (v++ --package)

SSW

Run on Device HW Emulation Vivado ML


Profile
Vitis Platform
AIESim QEMU SIM
Debug
25 |
[Public]

Vitis Export to Vivado


• Goal: Decouple Vitis and Vivado environments without any
dependency between the two Vitis Export to Vivado Flow
• Separate “Vitis work” in Vitis from “Vivado work” in Vivado
• Vitis generates a file called a Vitis Metadata Archive
(VMA) file RTL Files AIE/HLS Files
• VMA can be generated before finalizing the Vitis AIE/HLS
design
• Vivado uses VMA file to generate the XSA file Extensible.xsa
• Vitis v++ linker will use this XSA to iterate the AIE/HLS design
Vivado Vitis
When the Vitis design is finalized, the final VMA file is generated & v++ --export_archive
imported into Vivado

VMA

Fixed.xsa

26 |
[Public]

Summary

 DSP blocks are the “traditional” way of implementing math operations on programmable logic
 DSP48 on UltraScale -> DSP58 on Versal
 Allows for fine-grain bitwidth selection up to maximum supported width

 AI engines provide scalable, hardened compute capabilities to Versal devices


 Ideal for vectorizable (SIMD) operations with multiple parallel outputs
 Uses modified KPN to define data flow between kernels
 C/C++ programmable with fast compilation
 Provides more compute capability (higher TOPS/Watt) over “straight” DSP implementation

27 |
[Public]

Endnotes

VER-045: Based on 3rd party benchmark testing commissioned by AMD in February 2024, on the AMD
Versal adaptive SoC with AMD Vitis for AI design tool versus traditional programmable software
implementation with Vivado software and Vitis Model Composer tool, version 2023.1 in a signal processing
application FIR implementation. Results will vary depending on design specifications. (VER-45).

VER-046: Based on 3rd party benchmark testing commissioned by AMD in February 2024, on the AMD
Versal adaptive SoC with AMD Vitis for AI design tool versus traditional programmable software
implementation with Vivado software and Vitis Model Composer tool, version 2023.1 in a signal processing
application FIR implementation. Results will vary depending on design specifications. (VER-46)

28 |
[Public]

Disclaimer and Attribution


DISCLAIMER: The information contained herein is for informational purposes only and is subject to change without notice. While every precaution has been taken
in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or
otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the
contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular
purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to
any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD products are as set forth in a signed
agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18u.

© 2024 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD, the AMD Arrow logo, Artix, Kintex, Kria, Spartan, UltraScale+, Versal,
Vitis, Virtex, Vivado, Zynq, and other designated brands included herein are trademarks of Advanced Micro Devices, Inc. Other product names used in this
publication are for identification purposes only and may be trademarks of their respective owners. Certain AMD technologies may require third-party enablement or
activation. Supported features may vary by operating system. Please confirm with the system manufacturer for specific features. No technology or product can be
completely secure.

29 |

You might also like