0% found this document useful (0 votes)

18 views29 pages

IntroductionToAISystems

Uploaded by

Hongming Zheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views29 pages

IntroductionToAISystems

Uploaded by

Hongming Zheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Deep Learning Systems

Andres Rodriguez
Why AI Systems
Compute demand growing much faster than supply

Demand growth: Doubles every 3.4 months

Supply growth: Doubles every 2-3 years

Supply sells out before arrives

What to improve (lots of opportunities in each area)…

1. Algorithms

2. Compilers

3. Hardware
https://github.com/karlrupp/microprocessor-trend-data
DL Models
• Moore’s Law 2x
biannual growth is
slowing down
• SofA models double
every 3-4 months
• Need to co-design:
algorithms, SW, HW
DL training and inference overview
Input Computational graph (i.e. deep learning model) Output

Tensors
Tensors Operation
(conv, matrix multiply,
pooling, ReLU, LSTM, data
2D reorder, …)
Prob of
pedestrian

Data flow
Dictates how
3D Tensor flows

Inference (aka serving): Forward once

Training: Forward and backward many times

Main types of DL workloads
Workload Hardware-affinity Details
Recommender systems (ads/feeds) CPU for embeddings and Highest priority for hyperscalers and growing % of AI inference
e.g., DLRM vector search FB: 50% of training cycles; 80% of AI inference cycles
GPU for MLP
Language (translation, speech GPU for large models Explosive growth in the past 6 months
recognition / generation) e.g., GPT CPU for small models Requires high-compute and high-memory-bandwidth
Vision (image/video classification, GPU for large models Requires high-compute and high-memory-bandwidth
object detection) e.g., ResNet-50 CPU for small models or Requires CPU mem capacity in models w/prodigious data samples,
large data samples e.g., MRI/CT images
Traditional Machine learning CPU Large memory
Faster cores are preferable over thousands-cores

• GPU (and dedicated AI processors): Large compute and memory-bandwidth capacity

• CPU: Irregular memory accesses; Large memory; Faster cores
Types of topologies
• Multilayer perceptron
• Convolutional neural network
Widely used in industry
• Recurrent neural network
• Transformer network
• Graph neural network – gaining some adoption in industry
• Adversarial network
• Autoencoder
• Bayesian neural networks
• Spiking neural networks
LLM Overview
APPLICATIONS CHARACTERISTICS

• Text Generation: Coherent and grammatically correct • Two components: 1) Input tokens processing; 2) Token generation
sentences
• Token: vector representation of a word or part of a word
• Art and Design: Novel works of art and design
• Every generated token requires reading the entire model from memory
• Music Generation: Composing melodies & harmonies
• 10B params  20 GB
• Medicine: (potential) New drugs and treatments
• 1T param  2 TB
• Finance: (potential) Predict market trends
• ½ with 8-bits

• 2nd+ Token generation – bandwidth intensive

ETHICAL CONSIDERATIONS
• Xeon AMX, NV Tensor Cores – low utilization
• Simple to generate realistic fake content
• Latency – faster than a speed reader <100ms
• Training data may contain copyright material
• HBM can provide high benefits
• Biases in training expose in generated content –
• 1st Token – Long Input tokens or Large batch sizes –
not unique to LLMs
computationally intensive
Andrej Karpathy, Microsoft Build 2023
Andrej Karpathy, Microsoft Build 2023
Types of computations
OI = numOps / bytesReadWritten

• Compute-intensive

• Bandwidth-intensive

• Memory-intensive
Recommenders:
complex and
diverse models

https://research.fb.com/wp-content/uploads/2020/06/DeepRecSys-A-System-for-Optimizing-End-To-End-At-Scale-Neural-Recommendation-Inference.pdf
Naïve HW and SW

• OI = 1/3 (2 reads and 1 write)

Improve HW – SRAM

Improve SW – fusing computations

• More compute while data is in processor
• GEMM and convolution (more compute-intensive)
• Activation functions (ReLU, sigmoid) (minimal OI)
• Improve OI by fusing activation into GEMM or convolution
Accessing DRAM is extremely expensive

Based on 45 nm technology
Better HW – use SRAM
• SRAM – much faster but…
• More expensive (in $, power, and area – 6 transistors vs 1 transistor)
• Assuming a layer’s input, weights, and outputs fit in SRAM the OI is much higher
Hardware feature  DL function
HW level HW feature DL function

Processor level MAC (MUL, accumulator) Conv, GEMM

High core-to-core BW

Node level Processor + hierarchical SRAM + DRAM Conv, GEMM, activations

Server level {GPUs, CPU sockets} High internode BW Model / data parallelism

Cluster level High inter-server network BW Data parallelism

Physical network topology
Increasing ops per clock cycle
• Vector – SIMD, SIMT
• Matrix – dataflow
• Nvidia Tensor Cores (2017), Intel TMUL (2021), all prevalent ASICs

Figure adapted from V. Sze, Y. Chen, T. Yang, and J. Emer. Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE, Dec. 2017.
Popular numerical precisions

• FP32 is usually the default training and inference numerical precision

• FP16 becoming default inference

• BF16 shown to provide virtually the same accuracy for training and inference as FP32

• Simulated on various workloads and achieving virtually the same accuracy

• No hyper-parameters changes compared to FP32 on simulated workloads

• INT8 shown to provide similar accuracy for inference as FP32 for some models – popular in vision models

• Others… FP8, BF32, TF32/BF19

BF16 Accuracy Virtually Matches FP32
AlexNet ResNet-50

SR-GAN Generator SR-GAN Discriminator

Source: Kalamkar, et al., 2019. https://arxiv.org/pdf/1905.12322.pdf

Low-precision (INT8) inference
F32 model
Techniques to reduce the INT8 accuracy loss

• Symmetric quantization (zero shift) Quantize model

• KL divergence to find a min/max threshold

• Quantize conv & inner product w/channel-wise scales INT8 model

• Offline calibration to pre-compute activation scaling factor

FP32
• Some layers run in fp32
INT8 Primitive INT8

• Quantization aware training (retrain with INT8 constraint) Scale

Even with these techniques, models may still have unacceptable INT8 accuracy loss

Majority of INT8 literature is focused on CNNs

Supervised learning overview
The goal of supervised learning

• 𝑚𝑚𝑚𝑚𝑛𝑛𝒘𝒘 ℒ 𝒘𝒘 = ∑𝑁𝑁
𝑛𝑛=1 𝐶𝐶(𝑓𝑓𝒘𝒘 𝑥𝑥𝑛𝑛 , 𝑦𝑦𝑛𝑛 )

The weights of the model are updated iteratively

• 𝒘𝒘𝑡𝑡+1 = 𝒘𝒘𝑡𝑡 + Δ𝒘𝒘

• The optimizer’s task if to find an appropriate Δ𝒘𝒘

Andrej Karpathy, Microsoft Build 2023
Stochast Gradient descent
fprop cost bprop Δ𝒘𝒘
minibatch #1
fprop cost bprop Δ𝒘𝒘
weight update
fprop cost bprop Δ𝒘𝒘

fprop cost bprop Δ𝒘𝒘

minibatch #2
fprop cost bprop Δ𝒘𝒘 weight update

fprop cost bprop Δ𝒘𝒘

23
Flat vs Sharp minima

https://arxiv.org/abs/1609.04836
Pedigree of 1 st order methods Δ𝒘𝒘 = −𝜆𝜆(𝑡𝑡) ⋅ ℎ(∇𝐰𝐰 ℒ 𝒘𝒘 )
Popular in industry

Momentum accelerates SGD in the

direction of the exponential decaying
average of past gradients

Adam uses an adaptive learning rate

for each weight: momentum
normalized by the momentum squared

LAMB uses a local learning rate for

each layer
Distributed Training

Recommend using Hybrid for large models

Communication Primitives
• Most common: AllReduce, AllToAll, and AllGather
Physical Node Interconnects

Recommend: Fully-connected for lowest communication time for key comm primitives
Hardware is Easy – Compilers are Key

• Fuse nodes

• Data memory layout

• Backend code gen

Process-in-Memory forAI
No ratings yet
Process-in-Memory forAI
168 pages
Full Stack Optimization of Transformer Inference A Survey
No ratings yet
Full Stack Optimization of Transformer Inference A Survey
45 pages
Vasudevan S. Deep Learning. A Comprehensive Guide 2022
No ratings yet
Vasudevan S. Deep Learning. A Comprehensive Guide 2022
307 pages
DL Unit 1
No ratings yet
DL Unit 1
200 pages
UNIT I Part 1 Notes
No ratings yet
UNIT I Part 1 Notes
28 pages
LLM Training Update
100% (1)
LLM Training Update
31 pages
Deep Learning
100% (3)
Deep Learning
32 pages
3rd Unit DL Final Class Notes
No ratings yet
3rd Unit DL Final Class Notes
78 pages
Introduction To Deep Neural Networks - DataCamp
No ratings yet
Introduction To Deep Neural Networks - DataCamp
10 pages
Deep Learning UNIT 5
No ratings yet
Deep Learning UNIT 5
182 pages
The MathWorks, Inc. - MATLAB Deep Learning HDL Toolbox UG.-The MathWorks, Inc. (2021)
No ratings yet
The MathWorks, Inc. - MATLAB Deep Learning HDL Toolbox UG.-The MathWorks, Inc. (2021)
278 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
Deep Learning
No ratings yet
Deep Learning
95 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
71 pages
Notions de Deep Learning
No ratings yet
Notions de Deep Learning
116 pages
Deep Learningchap1
No ratings yet
Deep Learningchap1
20 pages
Deep Learning Cookbook
No ratings yet
Deep Learning Cookbook
24 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
DL Inference FPGA Class1
No ratings yet
DL Inference FPGA Class1
56 pages
Capra 2020
No ratings yet
Capra 2020
48 pages
The Evolution of Deep Learning
No ratings yet
The Evolution of Deep Learning
53 pages
Deep Learning
No ratings yet
Deep Learning
169 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
AI
No ratings yet
AI
11 pages
DL Notes
No ratings yet
DL Notes
97 pages
Intro To Deep Learning
100% (1)
Intro To Deep Learning
35 pages
Nokia 5G Standalone
No ratings yet
Nokia 5G Standalone
18 pages
Report Nnanddl
No ratings yet
Report Nnanddl
29 pages
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
No ratings yet
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
290 pages
A Developer's Guide To Artificial Intelligence (AI) : Definitions, Insights & Tools For Getting Started in AI
No ratings yet
A Developer's Guide To Artificial Intelligence (AI) : Definitions, Insights & Tools For Getting Started in AI
9 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
R21 - A7709 - Deep Learning: Dr. Bhawani Sankar Panigrahi
No ratings yet
R21 - A7709 - Deep Learning: Dr. Bhawani Sankar Panigrahi
92 pages
L 0017398760 PDF
No ratings yet
L 0017398760 PDF
24 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
Machine Learning
No ratings yet
Machine Learning
11 pages
Sony Ai Content
No ratings yet
Sony Ai Content
26 pages
Deep Learning - Unit 1 Notes
No ratings yet
Deep Learning - Unit 1 Notes
27 pages
NN DL Unit - III
No ratings yet
NN DL Unit - III
19 pages
Notes DL-1
No ratings yet
Notes DL-1
10 pages
Hardware Architectures For Deep Neural Networks-MIT'16
No ratings yet
Hardware Architectures For Deep Neural Networks-MIT'16
300 pages
Deep Learning Lab
No ratings yet
Deep Learning Lab
11 pages
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
No ratings yet
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
11 pages
Deep Neural Network
No ratings yet
Deep Neural Network
12 pages
Deep Neural Network AIML Handout v1.0-1
No ratings yet
Deep Neural Network AIML Handout v1.0-1
8 pages
Deep Learning Fundamentals
No ratings yet
Deep Learning Fundamentals
19 pages
Deep Learning (DL) - Comprehensive Summary
No ratings yet
Deep Learning (DL) - Comprehensive Summary
9 pages
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
No ratings yet
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
33 pages
Deep Learning Concise Notes
No ratings yet
Deep Learning Concise Notes
4 pages
Introduction To TensorFlow For Artificial Intelligence
No ratings yet
Introduction To TensorFlow For Artificial Intelligence
41 pages
Tutorial On DNN 1 of 9 Background of DNNs
No ratings yet
Tutorial On DNN 1 of 9 Background of DNNs
65 pages
Deep Learning and Neural Networks
No ratings yet
Deep Learning and Neural Networks
1 page
Report of Ann Cat3-Dev
No ratings yet
Report of Ann Cat3-Dev
8 pages
Unit I
No ratings yet
Unit I
10 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
DeepLearning - 1NT22CS078 - I Shania Jone
No ratings yet
DeepLearning - 1NT22CS078 - I Shania Jone
4 pages
Eng Assng
No ratings yet
Eng Assng
13 pages
Autoencoders: Parallel Programming Parallel Processing
No ratings yet
Autoencoders: Parallel Programming Parallel Processing
5 pages
Timer Control Series: User Instructions
0% (1)
Timer Control Series: User Instructions
2 pages
Lecturas
No ratings yet
Lecturas
260 pages
Red Hat Enterprise Linux 8 Security Hardening en US
100% (1)
Red Hat Enterprise Linux 8 Security Hardening en US
110 pages
C Language Cheat Sheet
No ratings yet
C Language Cheat Sheet
10 pages
VHDL Implementation of 128 Bit Pipelined Blowfish Algorithm
No ratings yet
VHDL Implementation of 128 Bit Pipelined Blowfish Algorithm
5 pages
DFT Strategy For Arm Cores
No ratings yet
DFT Strategy For Arm Cores
6 pages
Crystal
No ratings yet
Crystal
3 pages
Core Java Material
No ratings yet
Core Java Material
51 pages
Feh404-Built in Hi Speed Counter
No ratings yet
Feh404-Built in Hi Speed Counter
39 pages
Acti 9 C120 - A9N18374
No ratings yet
Acti 9 C120 - A9N18374
3 pages
Samsung N148 - Ba41-01184a, Ba41-01185a, Ba41-01186a PDF
No ratings yet
Samsung N148 - Ba41-01184a, Ba41-01185a, Ba41-01186a PDF
43 pages
CUPS Programming Manual
No ratings yet
CUPS Programming Manual
145 pages
List, Tupels&dictionary
No ratings yet
List, Tupels&dictionary
104 pages
A5SL-010 611-612 IND RWC
No ratings yet
A5SL-010 611-612 IND RWC
2 pages
Peplink Maritime Antenna 40g Datasheet
No ratings yet
Peplink Maritime Antenna 40g Datasheet
13 pages
MAD Mini Project-Group
No ratings yet
MAD Mini Project-Group
33 pages
IT FACULTY - SLD521 - FORMATIVE 3 - (YEAR 1 - Certificate) Paper (V1.0) - 20230506 - 1221
No ratings yet
IT FACULTY - SLD521 - FORMATIVE 3 - (YEAR 1 - Certificate) Paper (V1.0) - 20230506 - 1221
10 pages
GRC FM Tutorial
No ratings yet
GRC FM Tutorial
15 pages
Import ADHD VM in Proxmox
No ratings yet
Import ADHD VM in Proxmox
9 pages
VSE - Alarms - 30072011
No ratings yet
VSE - Alarms - 30072011
8 pages
Features of Java
No ratings yet
Features of Java
3 pages
Fan
No ratings yet
Fan
17 pages
Dhcp-Server Network
No ratings yet
Dhcp-Server Network
13 pages
System Testing
No ratings yet
System Testing
2 pages
Texte Rapido Rosa
No ratings yet
Texte Rapido Rosa
7 pages
LDC Assignment
No ratings yet
LDC Assignment
3 pages
Pervasive Computing
No ratings yet
Pervasive Computing
4 pages
Physio Arm Control For Patient Using IOT Technology
No ratings yet
Physio Arm Control For Patient Using IOT Technology
5 pages
Impro Readers Comparison
No ratings yet
Impro Readers Comparison
1 page
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Image Compression: Efficient Techniques for Visual Data Optimization
From Everand
Image Compression: Efficient Techniques for Visual Data Optimization
Fouad Sabry
No ratings yet