0% found this document useful (0 votes)

31 views30 pages

03 Computing With DSPs and AI Engines

Uploaded by

cuonglt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views30 pages

03 Computing With DSPs and AI Engines

Uploaded by

cuonglt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Computing with

DSPs and AI Engines

Louie Valeña
Algorithm Specialist
Adaptive and Embedded Computing Group
[Public]

Agenda 1. DSP Slice Review

2. AI Engine Overview

2 |
[Public]

DSP Slice Review

3 |
[Public]

Sum of Products

𝑦 𝑛 = ෍ 𝑐𝑖 ∙ 𝑥 𝑛 − 𝑘
𝑘=0

x[n] z-1 z-1 z-1 z-1 z-1 z-1 z-1

c0 c1 c2 c3 c4 c5 c6 c7

y[n]

4 |
[Public]

DSP48E2 Slice in UltraScale+ Devices

https://docs.amd.com/v/u/en-US/ug579-ultrascale-dsp

5 |
[Public]

FIR Filter Implemented with DSP48E2 Slices

• 8 DSP48E2 slices for 8-tap FIR filter

• No fabric resources used
• Note the timing at which the coefficients are applied
• Throughput limited by clock frequency

|
https://docs.amd.com/v/u/en-US/ug579-ultrascale-dsp
6
[Public]

What is Driving the Need for More Compute?

• Faster throughput
• More results in a narrower time slot (e.g., higher frames per second)
• Lower latency
• First output available in a shorter time span (e.g., 100ms -> 10ms)
• Higher density
• Larger image resolutions, more antennas, more cameras, etc.
• Higher accuracy, lower errors
• More complex algorithms
• Increasing AI (inference) content in applications
• Object (car, person, etc.) detection
• Modulation detection
• Adaptive beamforming

7 |
[Public]

AI Engine Overview

8 |
[Public]

Conceptual Device with High Compute

• Applications need (1) cost reduction, (2) power reduction, (3) more compute, (4) more programmability

• How does AMD meet the demands of these evolving applications?

16nm Generation
(Zynq® UltraScale+ MPSoC) 7nm Generation

GT
PL PL PL PL
IO AI Engine Array
GT IO GT IO
PL PL PL PL
PL PL PL PL GT IO
GT IO
GT
PL PL PL PL
IO
GT GT
PL PL PL PL
GT GT
PL PL PL PL
Processing GT
Processing GT
System
& PMC
PL PL PL System
PL PL PL
GT & PMC GT

Goal: Increase Compute Density and Silicon Efficiency

9 |
[Public]

AI Engine Reinvents Multi-Core Compute

Traditional Multi-core AI Engine Array
(cache-based architecture)

Block 0 Block 1

core core core core core core Dedicated

Memory

Memory
Interconnect AI
Engine
AI
Engine
AI
Engine
L0 L0 L0 L0 L0 L0
D0 • Non-blocking
• Deterministic

Memory

Memory
AI AI AI
D0 L1 L1 Engine Engine Engine
Fixed, shared
Interconnect

Memory
Memory

Memory
• Blocking limits D0 L2
AI
Engine
AI
Engine
AI
Engine
compute
• Timing not
deterministic DRAM Local, Distributed Memory
• No cache misses
Data • Higher bandwidth
Replicated • Less capacity required
• Robs bandwidth
• Reduces capacity
10 |
[Public]

AI Engine Tile
AI Engine Tiles and Kernels

• Versal devices which contain AI engines have the

engines physically laid out like tiles
• Each AI engine tile has 16KB of program memory and
local data memory of 32KB
• It can access local data memory in adjacent tiles (for a total
of 128KB)
• Each AI engine tile can exchange data with any other
tile in the array
• Connections are determined by a “dataflow graph” and set
during program load
• Each AI engine contains scalar and vector processors
• A kernel is a C/C++function running on an AI engine
tile

https://www.xilinx.com/products/technology/ai-engine.html

11 |
[Public]

Scalar and Vector Processors

SIMD – only intrinsics can run on

this vector processor

https://docs.xilinx.com/r/en-US/am009-versal-ai-engine/AI-Engine-Architecture

32-bit RISC processor – runs Glossary

standard C/C++ code RISC: Reduced Instruction Set Computer
SIMD: Single Instruction Multiple Data

12 |
[Public]

AI Engine Evolution
Machine
Learning

ML
Optimized
Target Application

AIE-ML AIE-MLv2

AIE

AIEv2

2020/21/22 2023/24 2025…

Signal
Processing

13 |
[Public]

AIE / AIE-ML / AIE-MLv2 Comparison Table

AIE AIE-ML AIE-MLv2

Array structure checkerboard All lines identical All lines identical

Cascade Interface 384-bits wide 512-bits wide 512-bits wide

Horizontal Direction Horizontal and Vertical Directions Horizontal and Vertical Directions
Tile Stream Interfaces 2 32-bit In and 2 32-bit out 1 32-bit In and 1 32-bit out 1 32-bit In and 1 32-bit out
Memory Load/Store (per cycle) 512/256 bits 512/256 bits 1024/512 bits Local
512/512 bits Neighbour
int8 * int8 MAC 128 256 512

Data Type Native Support int8/16/32, cint16/32, FP32 int8/16/32, cint16, bfloat16 int8/16/32, cint16, bfloat16, MX6, MX9

Tile Local Memory 32KB 64KB 64KB

Tile Local Memory DMA 32-bit streams, 64-bit streams,

32-bit streams,
128-bit data memory interface 256-bit data memory interface
128-bit data memory interface

Memory Tiles No 512KB - 16 banks 512KB - 8 banks

Interface Tiles PL or NoC interface tiles PL or NoC interface tiles Single type of interface tile (PL & NoC)

14 |
[Public]

Kahn Process Network (KPN) [a.k.a. Data Flow Graph]

• The AI engine tiles are configured to form a modified Kahn process network
• Each kernel within a tile executes when its inputs become available
• The program code in each tile is executed sequentially
• Multiple kernels can be placed on a tile
• Multiple tiles can execute in parallel
• Tiles communicate through bounded channels (stream or memory)
• Unbounded (i.e., infinite) channels cannot be realized in hardware
• Reading from and writing to a channel is a blocking process
• Execution stalls when attempting to read from an empty channel or write to a full channel
• Processes are deterministic The presence of data to be read
and/or space for data to be written
• Same input always produces exactly the same output
determines the order of execution
f
a T2
Data flow:
d ➢ When inputs a, b and c arrive simultaneously
b T1 e T4 h • T2, T3 and T4 are stalled, waiting for all inputs
• Only T1 executes to produce d and e
➢ T2 and T3 execute in parallel after T1 to produce f and g
g ➢ T4 executes after receiving f and g to produce h
c T3

AI Engine Programming: A Kahn Process Network Evolution (WP552)

15 |
[Public]

Supported Data Types in the Vector Processor

# of multiply-accumulate
operations per cycle per
tile!

Complex data types!

Not just for artificial
intelligence!

https://docs.xilinx.com/r/en-US/am009-versal-ai-engine/Functional-Overview
16 |
[Public]

Applications with a lot of sum-of-product

128 int8 MACs/cycle operations can greatly benefit from using AI
engines!

16 multiply·add x 8 accumulators = 128 multiply-accumulate operations per cycle

= 256 OPs/cycle

The VC1902 has 400 AI engine tiles. If the AIE array is running at 1.3GHz,
then the peak theoretical compute capability would be
400 * 256 OPs/cycle * 1.3e9 cycles/sec = 133.12e12 int8 OPs/sec = 133 int8 TOPS

Must keep the vector processor “well-

fed” with data to achieve high
throughput
17 |
[Public]

Data Movement Within AI Engine Array

Memory Communication Streaming Communication
(neighbor) (non-neighbor)
Memory Memory Memory Memory
Dataflow
AI B0 AI B2 AI Non- AI AI
Pipeline Engine B1 Engine B3 Engine Engine
Neighbor Engine

AI AI
Mem Engine
Mem Engine AI AI
Engine Engine

Dataflow Streaming AI
AI AI AI
Graph Engine
Mem Engine
Mem Engine Multicast Engine

AI
Engine
AI AI
Mem Engine Mem Engine

Memory Interface
Cascade AI AI Stream Interface
Engine Engine
Streaming Cascade Interface

18 |
[Public]

Getting Data to/from AI Engine Array

• TB/s of interface bandwidth

Memory

Memory
AI AI AI
• AI Engine to programmable logic Engine Engine Engine

• AI Engine to NoC

• Leveraging NoC connectivity

Memory

Memory
Memory
AI AI AI
• PS manages config / debug / trace Engine Engine Engine

• AI Engine to DRAM (no PL req’d)

Switch Switch Switch
AI Engine
Async CDC DMA Interface Tiles

AXI-MM
PS / NoC Ext.
PMC Switch Switch Switch DRAM
Glossary

CDC: Clock Domain Crossing AXI-S

DMA: Direct Memory Access
PS: Processor System PL
PMC: Platform Manager Controller Programmable Function
DRAM: Dynamic Random Access Memory
PL: Programmable Logic
Logic

19 |
[Public]

AI Engine to PL Interface

AXI4-Stream switches in the PL interface tile directly

communicate with the PL

• Handle most of the data movement to/from the AI Engine array

• Configurable bit widths (32b/64b/128b)

Direction #AXI Stream per Column Bandwidth per Column Bandwidth on VC1902
Communication
PL → AIE interface array
→ North 8 32 GB/s ~1.3 TB/s
(Some columns are not available)
→ South 6 24 GB/s ~1 TB/s

#AXI Stream per Bandwidth per

Direction Bandwidth on VC1902
Interconnect Interconnect
Within AI Engine Grid
(All columns are available) → North 6 24 GB/s 1.2 TB/s
→ South 4 16 GB/s 800 GB/s

Note: BW calculation - 1 GHz AI Engine clock @ -1L speedgrade (0.7V), higher bandwidth is available with faster speed grade
Note: 50 columns on VC1902, of which 39 are connected to PL

20 |
[Public]

32K FFT, 8 GSPS, CINT16: AIE+PL Architecture

DSP Library AIE API Custom HDL
Function Code Code

SSR: Super Sampling Rate (no. of samples processed per cycle)

21 |
[Public]

32K FFT, 8 GSPS, CINT16 Data

Internal data format: CINT32 for AIE; CINT27 for PL
Twiddle factors: CINT16 for AIE; CINT24 for PL
PL (540 MHz) AIE + PL (1250 MHz)
Metric Comments
1D FFT Structure 2D FFT Structure
50 AIEs (16 for compute)
171 DSPs
12 DSPs
Resources 153,299 LUTs ~3.5 DSPs : 1 AIE
8,052 LUTs (for 16 pt FFT)
54,190 FFs
6,612FFs (for 16 pt FFT)
UltraRAM / Significant Block RAM
100 Block RAM 16 Block RAM
Block RAM savings

Latency 48 us 7.5 us

Dynamic 6.682W Up to 30% lower dynamic

9.58W
Power 6.138W for AIE power
.528W for 16 pt FFT

See Endnotes VER-045, VER-046

22 |
[Public]

Code Required to Develop AI Engine Kernels

• AI engine kernel code

• C/C++ code that will execute on the AI engine
• Graph code
• C++ code that describes the connectivity (i.e., data flow) between AI engine kernels and their “environment”
• Multiple independent graphs possible
• Testbench/control code
• C++ code that configures, initializes, runs and terminates graphs

23 |
[Public]

Development Tools

• DSP
• Vivado (Verilog, VHDL)
• Vitis HLS
• Vitis Model Composer

• AI Engine
• Vitis
• Vitis Model Composer

24 |
[Public]

Vitis Flow for Versal Adaptive SoC

AIE PL (HLS) PL (RTL) Platform PS

AIE Kernels, Graph PL Kernels (HLS) RTL Kernels XRT, Graph API
Vitis HW Platform
AIE driver

Vitis SW Platform
AIE Simulation HLS Cosimulation RTL Verification PS App
Linux® + rootfs

PL and AIE Integration (v++ --link)

Vivado HW Build
SIM Build
Timing Closure

Generate Binary (v++ --package)

SSW

Run on Device HW Emulation Vivado ML

Profile
Vitis Platform
AIESim QEMU SIM
Debug
25 |
[Public]

Vitis Export to Vivado

• Goal: Decouple Vitis and Vivado environments without any
dependency between the two Vitis Export to Vivado Flow
• Separate “Vitis work” in Vitis from “Vivado work” in Vivado
• Vitis generates a file called a Vitis Metadata Archive
(VMA) file RTL Files AIE/HLS Files
• VMA can be generated before finalizing the Vitis AIE/HLS
design
• Vivado uses VMA file to generate the XSA file Extensible.xsa
• Vitis v++ linker will use this XSA to iterate the AIE/HLS design
Vivado Vitis
When the Vitis design is finalized, the final VMA file is generated & v++ --export_archive
imported into Vivado

VMA

Fixed.xsa

26 |
[Public]

Summary

 DSP blocks are the “traditional” way of implementing math operations on programmable logic
 DSP48 on UltraScale -> DSP58 on Versal
 Allows for fine-grain bitwidth selection up to maximum supported width

 AI engines provide scalable, hardened compute capabilities to Versal devices

 Ideal for vectorizable (SIMD) operations with multiple parallel outputs
 Uses modified KPN to define data flow between kernels
 C/C++ programmable with fast compilation
 Provides more compute capability (higher TOPS/Watt) over “straight” DSP implementation

27 |
[Public]

Endnotes

VER-045: Based on 3rd party benchmark testing commissioned by AMD in February 2024, on the AMD
Versal adaptive SoC with AMD Vitis for AI design tool versus traditional programmable software
implementation with Vivado software and Vitis Model Composer tool, version 2023.1 in a signal processing
application FIR implementation. Results will vary depending on design specifications. (VER-45).

VER-046: Based on 3rd party benchmark testing commissioned by AMD in February 2024, on the AMD
Versal adaptive SoC with AMD Vitis for AI design tool versus traditional programmable software
implementation with Vivado software and Vitis Model Composer tool, version 2023.1 in a signal processing
application FIR implementation. Results will vary depending on design specifications. (VER-46)

28 |
[Public]

Disclaimer and Attribution

DISCLAIMER: The information contained herein is for informational purposes only and is subject to change without notice. While every precaution has been taken
in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or
otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the
contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular
purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to
any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD products are as set forth in a signed
agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18u.

© 2024 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD, the AMD Arrow logo, Artix, Kintex, Kria, Spartan, UltraScale+, Versal,
Vitis, Virtex, Vivado, Zynq, and other designated brands included herein are trademarks of Advanced Micro Devices, Inc. Other product names used in this
publication are for identification purposes only and may be trademarks of their respective owners. Certain AMD technologies may require third-party enablement or
activation. Supported features may vary by operating system. Please confirm with the system manufacturer for specific features. No technology or product can be
completely secure.

29 |

Computer Architecture
No ratings yet
Computer Architecture
667 pages
Air Compressor Parts PDF
0% (1)
Air Compressor Parts PDF
51 pages
Course Slides 2018
No ratings yet
Course Slides 2018
485 pages
Course Slides 2018
No ratings yet
Course Slides 2018
484 pages
Official Notification For OAVS Recruitment
No ratings yet
Official Notification For OAVS Recruitment
28 pages
FPGA PPT Presentation On Flow
No ratings yet
FPGA PPT Presentation On Flow
21 pages
Rb183210 Mpa Craft Guidebook FA
100% (1)
Rb183210 Mpa Craft Guidebook FA
23 pages
MUP All Lectures Merged
No ratings yet
MUP All Lectures Merged
673 pages
ACA Mod1
No ratings yet
ACA Mod1
118 pages
Practical Introduction To PCI Express With FPGAs - Extended
No ratings yet
Practical Introduction To PCI Express With FPGAs - Extended
77 pages
21UGYS01 - Mapping Techniques
No ratings yet
21UGYS01 - Mapping Techniques
109 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
GPGPU
No ratings yet
GPGPU
139 pages
VHDL / FPGA Design Lecture Notes
No ratings yet
VHDL / FPGA Design Lecture Notes
56 pages
04 Hardware
No ratings yet
04 Hardware
109 pages
Modern FPGAs
No ratings yet
Modern FPGAs
67 pages
MCA - HW - Lecture 7and8 - Prelim
No ratings yet
MCA - HW - Lecture 7and8 - Prelim
146 pages
Pengaruh Lingkungan Kos-Kosan Terhadap Motivasi Belajar Mahasiswa Stakpn Ambon
No ratings yet
Pengaruh Lingkungan Kos-Kosan Terhadap Motivasi Belajar Mahasiswa Stakpn Ambon
14 pages
Amd (Am009)
No ratings yet
Amd (Am009)
66 pages
Lecture04 - High-Level Digital Design Automation
No ratings yet
Lecture04 - High-Level Digital Design Automation
30 pages
ASPLOS 2021 - Golden Age of Compilers
No ratings yet
ASPLOS 2021 - Golden Age of Compilers
64 pages
Mahaveer Price List
No ratings yet
Mahaveer Price List
6 pages
Embeded Systems Unit 1
No ratings yet
Embeded Systems Unit 1
73 pages
Core of The Embedded System
No ratings yet
Core of The Embedded System
107 pages
DCO Presentation 6 PDF
No ratings yet
DCO Presentation 6 PDF
85 pages
Ceng111 2024 Week4a
No ratings yet
Ceng111 2024 Week4a
52 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
Hardware View of The Embedded Systems-1 - Anz
No ratings yet
Hardware View of The Embedded Systems-1 - Anz
29 pages
FSBC01 The Use of Repair and Maintenance Budget For Buildings
No ratings yet
FSBC01 The Use of Repair and Maintenance Budget For Buildings
5 pages
CS M151B / EE M116C: Computer Systems Architecture
No ratings yet
CS M151B / EE M116C: Computer Systems Architecture
29 pages
Irma
No ratings yet
Irma
10 pages
DSP Architecture
No ratings yet
DSP Architecture
90 pages
Department of Electronics and Communication Engineering Saintgits College of Engineering
No ratings yet
Department of Electronics and Communication Engineering Saintgits College of Engineering
41 pages
04 AMD Edge AI TechDay - Singapore - 2024 - FrankWang
No ratings yet
04 AMD Edge AI TechDay - Singapore - 2024 - FrankWang
29 pages
04 - Design With Microprocessors
No ratings yet
04 - Design With Microprocessors
71 pages
01 Introduction PDF
No ratings yet
01 Introduction PDF
48 pages
2 IN35 Slides 1
No ratings yet
2 IN35 Slides 1
61 pages
Lec 1
No ratings yet
Lec 1
25 pages
The Role of Peer Interaction and Second Language Learning For Esl Students in Academic Contexts: An Extended Literature Review
No ratings yet
The Role of Peer Interaction and Second Language Learning For Esl Students in Academic Contexts: An Extended Literature Review
74 pages
MYP 4 - Syllabus Booklet 2024-25 For Semester End Exam
No ratings yet
MYP 4 - Syllabus Booklet 2024-25 For Semester End Exam
32 pages
Why Take 6.375: 6.375: Complex Digital Systems
No ratings yet
Why Take 6.375: 6.375: Complex Digital Systems
14 pages
Data-Level Parallelism: Nima Honarmand
No ratings yet
Data-Level Parallelism: Nima Honarmand
59 pages
VLSI Programming: Lecture 1
No ratings yet
VLSI Programming: Lecture 1
61 pages
AMD versal™ AI edge series gen 2 for vision and automotive - (Knopp 等) - 2024
No ratings yet
AMD versal™ AI edge series gen 2 for vision and automotive - (Knopp 等) - 2024
28 pages
Introducing The Versal Architecture
No ratings yet
Introducing The Versal Architecture
35 pages
Vitis AI PG358 2024 20
No ratings yet
Vitis AI PG358 2024 20
27 pages
IEEE Paper Format
No ratings yet
IEEE Paper Format
78 pages
Vlsi
No ratings yet
Vlsi
70 pages
AI-Powered Course Recommendation System
No ratings yet
AI-Powered Course Recommendation System
11 pages
Fpga Timeline & Applications: Fpgas Past, Present & Future
No ratings yet
Fpga Timeline & Applications: Fpgas Past, Present & Future
39 pages
3.2. Perspectives On Listening Ho
No ratings yet
3.2. Perspectives On Listening Ho
35 pages
Quiz-1 Syllabus of Embedded Systems Design
No ratings yet
Quiz-1 Syllabus of Embedded Systems Design
20 pages
Xilinx Edge Processors: Aie Engineering Team Hotchips-33 Conference, August 2021
No ratings yet
Xilinx Edge Processors: Aie Engineering Team Hotchips-33 Conference, August 2021
21 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
Lec 3
No ratings yet
Lec 3
25 pages
Ground Improvement Methods
No ratings yet
Ground Improvement Methods
32 pages
Intel Sandy Ntel Sandy Bridge Architecture
No ratings yet
Intel Sandy Ntel Sandy Bridge Architecture
54 pages
cc11 PDF
No ratings yet
cc11 PDF
32 pages
Module 1
No ratings yet
Module 1
12 pages
Cs295: Modern Systems What Are Fpgas and Why Should You Care
No ratings yet
Cs295: Modern Systems What Are Fpgas and Why Should You Care
22 pages
Ai Engine Development For Versal: Olivier Tremois, PHD SW Technical Marketing Ai Engine Tools
No ratings yet
Ai Engine Development For Versal: Olivier Tremois, PHD SW Technical Marketing Ai Engine Tools
30 pages
Seminar and Workshop: D.B.Rajesh Application Engineer
No ratings yet
Seminar and Workshop: D.B.Rajesh Application Engineer
56 pages
SOC Architecture and Design: - System-On-Chip (SOC) - SOC Covers Many Topics
No ratings yet
SOC Architecture and Design: - System-On-Chip (SOC) - SOC Covers Many Topics
32 pages
Lecture Notes-Computer Architecture-Module 1
No ratings yet
Lecture Notes-Computer Architecture-Module 1
20 pages
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
No ratings yet
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
39 pages
Sims 2 Thoughts
No ratings yet
Sims 2 Thoughts
13 pages
Jason Vidmar - AI and SDR
No ratings yet
Jason Vidmar - AI and SDR
22 pages
Ece 450:digital Signal Processors and Applications Processors and Applications
No ratings yet
Ece 450:digital Signal Processors and Applications Processors and Applications
20 pages
ISO 9001 Clauses Simply Explained Rev.1
No ratings yet
ISO 9001 Clauses Simply Explained Rev.1
26 pages
University of Bristol Microarchitecture
No ratings yet
University of Bristol Microarchitecture
10 pages
PT Akasha Wira International TBK Swot Analysis Bac
No ratings yet
PT Akasha Wira International TBK Swot Analysis Bac
13 pages
Lecture 3 - Design Flow
No ratings yet
Lecture 3 - Design Flow
14 pages
M.sc. Chemistry
No ratings yet
M.sc. Chemistry
20 pages
Implementation and Analysis of Smart Lamp Using An
No ratings yet
Implementation and Analysis of Smart Lamp Using An
4 pages
R S Aggarwal Solution Class 11 Maths Chapter 31 Probability Exercise 31A
No ratings yet
R S Aggarwal Solution Class 11 Maths Chapter 31 Probability Exercise 31A
9 pages
e173e01748436895588d98e68888233a
No ratings yet
e173e01748436895588d98e68888233a
10 pages
General Duty Valves For Water Based Fire Suppression Piping
No ratings yet
General Duty Valves For Water Based Fire Suppression Piping
5 pages
Carrera, Elma A. Regular Safety Checks of Indoor and Outdoor Environments 02 Kitchen
No ratings yet
Carrera, Elma A. Regular Safety Checks of Indoor and Outdoor Environments 02 Kitchen
7 pages
1984, A Molecular Dynamics Method For Simulations in The Canonical Ensemblet by SHUICHI NOSE
No ratings yet
1984, A Molecular Dynamics Method For Simulations in The Canonical Ensemblet by SHUICHI NOSE
14 pages
2022 Bar Examination Questionnaire For Criminal Law
No ratings yet
2022 Bar Examination Questionnaire For Criminal Law
1 page
Versal Ai Edge Gen2 Product Brief
No ratings yet
Versal Ai Edge Gen2 Product Brief
3 pages
Xilinx Versal Ai Compute Solution Brief
No ratings yet
Xilinx Versal Ai Compute Solution Brief
3 pages
Worksheet and Coronavirus 10 Ac
No ratings yet
Worksheet and Coronavirus 10 Ac
5 pages
Rubric Ict
No ratings yet
Rubric Ict
1 page
The Inventory Order List Can Also Be Found On Our Website Under Downloads at The Specific Article
No ratings yet
The Inventory Order List Can Also Be Found On Our Website Under Downloads at The Specific Article
2 pages
Reflective Essay 1
No ratings yet
Reflective Essay 1
2 pages
Homework 6 Mckinney Ce374L Prob. 4.4.5. Drawdown Was Measured During A Pumping Test in A Confined Aquifer at Frequent
No ratings yet
Homework 6 Mckinney Ce374L Prob. 4.4.5. Drawdown Was Measured During A Pumping Test in A Confined Aquifer at Frequent
3 pages
Scedule of Defense
No ratings yet
Scedule of Defense
1 page
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
From Everand
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

03 Computing With DSPs and AI Engines

Uploaded by

03 Computing With DSPs and AI Engines

Uploaded by

Computing with

DSPs and AI Engines

Agenda 1. DSP Slice Review

DSP Slice Review

x[n] z-1 z-1 z-1 z-1 z-1 z-1 z-1

DSP48E2 Slice in UltraScale+ Devices

FIR Filter Implemented with DSP48E2 Slices

• 8 DSP48E2 slices for 8-tap FIR filter

What is Driving the Need for More Compute?

Conceptual Device with High Compute

• How does AMD meet the demands of these evolving applications?

Goal: Increase Compute Density and Silicon Efficiency

AI Engine Reinvents Multi-Core Compute

core core core core core core Dedicated

• Versal devices which contain AI engines have the

Scalar and Vector Processors

SIMD – only intrinsics can run on

32-bit RISC processor – runs Glossary

2020/21/22 2023/24 2025…

AIE / AIE-ML / AIE-MLv2 Comparison Table

AIE AIE-ML AIE-MLv2

Cascade Interface 384-bits wide 512-bits wide 512-bits wide

Tile Local Memory 32KB 64KB 64KB

Tile Local Memory DMA 32-bit streams, 64-bit streams,

Memory Tiles No 512KB - 16 banks 512KB - 8 banks

Kahn Process Network (KPN) [a.k.a. Data Flow Graph]

AI Engine Programming: A Kahn Process Network Evolution (WP552)

Supported Data Types in the Vector Processor

Complex data types!

Applications with a lot of sum-of-product

16 multiply·add x 8 accumulators = 128 multiply-accumulate operations per cycle

Must keep the vector processor “well-

Data Movement Within AI Engine Array

Getting Data to/from AI Engine Array

• TB/s of interface bandwidth

• Leveraging NoC connectivity

• AI Engine to DRAM (no PL req’d)

CDC: Clock Domain Crossing AXI-S

AXI4-Stream switches in the PL interface tile directly

• Handle most of the data movement to/from the AI Engine array

#AXI Stream per Bandwidth per

32K FFT, 8 GSPS, CINT16: AIE+PL Architecture

SSR: Super Sampling Rate (no. of samples processed per cycle)

32K FFT, 8 GSPS, CINT16 Data

Dynamic 6.682W Up to 30% lower dynamic

See Endnotes VER-045, VER-046

Code Required to Develop AI Engine Kernels

• AI engine kernel code

Vitis Flow for Versal Adaptive SoC

PL and AIE Integration (v++ --link)

Generate Binary (v++ --package)

Run on Device HW Emulation Vivado ML

Vitis Export to Vivado

 AI engines provide scalable, hardened compute capabilities to Versal devices

Disclaimer and Attribution

You might also like