0% found this document useful (0 votes)
18 views29 pages

IntroductionToAISystems

Uploaded by

Hongming Zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views29 pages

IntroductionToAISystems

Uploaded by

Hongming Zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Deep Learning Systems

Andres Rodriguez
Why AI Systems
Compute demand growing much faster than supply

Demand growth: Doubles every 3.4 months

Supply growth: Doubles every 2-3 years

Supply sells out before arrives

What to improve (lots of opportunities in each area)…

1. Algorithms

2. Compilers

3. Hardware
https://github.com/karlrupp/microprocessor-trend-data
DL Models
• Moore’s Law 2x
biannual growth is
slowing down
• SofA models double
every 3-4 months
• Need to co-design:
algorithms, SW, HW
DL training and inference overview
Input Computational graph (i.e. deep learning model) Output

Tensors
Tensors Operation
(conv, matrix multiply,
pooling, ReLU, LSTM, data
2D reorder, …)
Prob of
pedestrian

Data flow
Dictates how
3D Tensor flows

Inference (aka serving): Forward once

Training: Forward and backward many times


Main types of DL workloads
Workload Hardware-affinity Details
Recommender systems (ads/feeds) CPU for embeddings and Highest priority for hyperscalers and growing % of AI inference
e.g., DLRM vector search FB: 50% of training cycles; 80% of AI inference cycles
GPU for MLP
Language (translation, speech GPU for large models Explosive growth in the past 6 months
recognition / generation) e.g., GPT CPU for small models Requires high-compute and high-memory-bandwidth
Vision (image/video classification, GPU for large models Requires high-compute and high-memory-bandwidth
object detection) e.g., ResNet-50 CPU for small models or Requires CPU mem capacity in models w/prodigious data samples,
large data samples e.g., MRI/CT images
Traditional Machine learning CPU Large memory
Faster cores are preferable over thousands-cores

• GPU (and dedicated AI processors): Large compute and memory-bandwidth capacity


• CPU: Irregular memory accesses; Large memory; Faster cores
Types of topologies
• Multilayer perceptron
• Convolutional neural network
Widely used in industry
• Recurrent neural network
• Transformer network
• Graph neural network – gaining some adoption in industry
• Adversarial network
• Autoencoder
• Bayesian neural networks
• Spiking neural networks
LLM Overview
APPLICATIONS CHARACTERISTICS

• Text Generation: Coherent and grammatically correct • Two components: 1) Input tokens processing; 2) Token generation
sentences
• Token: vector representation of a word or part of a word
• Art and Design: Novel works of art and design
• Every generated token requires reading the entire model from memory
• Music Generation: Composing melodies & harmonies
• 10B params  20 GB
• Medicine: (potential) New drugs and treatments
• 1T param  2 TB
• Finance: (potential) Predict market trends
• ½ with 8-bits

• 2nd+ Token generation – bandwidth intensive


ETHICAL CONSIDERATIONS
• Xeon AMX, NV Tensor Cores – low utilization
• Simple to generate realistic fake content
• Latency – faster than a speed reader <100ms
• Training data may contain copyright material
• HBM can provide high benefits
• Biases in training expose in generated content –
• 1st Token – Long Input tokens or Large batch sizes –
not unique to LLMs
computationally intensive
Andrej Karpathy, Microsoft Build 2023
Andrej Karpathy, Microsoft Build 2023
Types of computations
OI = numOps / bytesReadWritten

• Compute-intensive

• Bandwidth-intensive

• Memory-intensive
Recommenders:
complex and
diverse models

https://research.fb.com/wp-content/uploads/2020/06/DeepRecSys-A-System-for-Optimizing-End-To-End-At-Scale-Neural-Recommendation-Inference.pdf
Naïve HW and SW

• OI = 1/3 (2 reads and 1 write)

Improve HW – SRAM

Improve SW – fusing computations


• More compute while data is in processor
• GEMM and convolution (more compute-intensive)
• Activation functions (ReLU, sigmoid) (minimal OI)
• Improve OI by fusing activation into GEMM or convolution
Accessing DRAM is extremely expensive

Based on 45 nm technology
Better HW – use SRAM
• SRAM – much faster but…
• More expensive (in $, power, and area – 6 transistors vs 1 transistor)
• Assuming a layer’s input, weights, and outputs fit in SRAM the OI is much higher
Hardware feature  DL function
HW level HW feature DL function

Processor level MAC (MUL, accumulator) Conv, GEMM


High core-to-core BW

Node level Processor + hierarchical SRAM + DRAM Conv, GEMM, activations

Server level {GPUs, CPU sockets} High internode BW Model / data parallelism

Cluster level High inter-server network BW Data parallelism


Physical network topology
Increasing ops per clock cycle
• Vector – SIMD, SIMT
• Matrix – dataflow
• Nvidia Tensor Cores (2017), Intel TMUL (2021), all prevalent ASICs

Figure adapted from V. Sze, Y. Chen, T. Yang, and J. Emer. Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE, Dec. 2017.
Popular numerical precisions

• FP32 is usually the default training and inference numerical precision

• FP16 becoming default inference

• BF16 shown to provide virtually the same accuracy for training and inference as FP32

• Simulated on various workloads and achieving virtually the same accuracy

• No hyper-parameters changes compared to FP32 on simulated workloads

• INT8 shown to provide similar accuracy for inference as FP32 for some models – popular in vision models

• Others… FP8, BF32, TF32/BF19


BF16 Accuracy Virtually Matches FP32
AlexNet ResNet-50

SR-GAN Generator SR-GAN Discriminator

Source: Kalamkar, et al., 2019. https://arxiv.org/pdf/1905.12322.pdf


Low-precision (INT8) inference
F32 model
Techniques to reduce the INT8 accuracy loss

• Symmetric quantization (zero shift) Quantize model

• KL divergence to find a min/max threshold

• Quantize conv & inner product w/channel-wise scales INT8 model

• Offline calibration to pre-compute activation scaling factor


FP32
• Some layers run in fp32
INT8 Primitive INT8

• Quantization aware training (retrain with INT8 constraint) Scale

Even with these techniques, models may still have unacceptable INT8 accuracy loss

Majority of INT8 literature is focused on CNNs


Supervised learning overview
The goal of supervised learning

• 𝑚𝑚𝑚𝑚𝑛𝑛𝒘𝒘 ℒ 𝒘𝒘 = ∑𝑁𝑁
𝑛𝑛=1 𝐶𝐶(𝑓𝑓𝒘𝒘 𝑥𝑥𝑛𝑛 , 𝑦𝑦𝑛𝑛 )

The weights of the model are updated iteratively

• 𝒘𝒘𝑡𝑡+1 = 𝒘𝒘𝑡𝑡 + Δ𝒘𝒘

• The optimizer’s task if to find an appropriate Δ𝒘𝒘


Andrej Karpathy, Microsoft Build 2023
Stochast Gradient descent
fprop cost bprop Δ𝒘𝒘
minibatch #1
fprop cost bprop Δ𝒘𝒘
weight update
fprop cost bprop Δ𝒘𝒘

fprop cost bprop Δ𝒘𝒘


minibatch #2
fprop cost bprop Δ𝒘𝒘 weight update

fprop cost bprop Δ𝒘𝒘

23
Flat vs Sharp minima

https://arxiv.org/abs/1609.04836
Pedigree of 1 st order methods Δ𝒘𝒘 = −𝜆𝜆(𝑡𝑡) ⋅ ℎ(∇𝐰𝐰 ℒ 𝒘𝒘 )
Popular in industry

Momentum accelerates SGD in the


direction of the exponential decaying
average of past gradients

Adam uses an adaptive learning rate


for each weight: momentum
normalized by the momentum squared

LAMB uses a local learning rate for


each layer
Distributed Training

Recommend using Hybrid for large models


Communication Primitives
• Most common: AllReduce, AllToAll, and AllGather
Physical Node Interconnects

Recommend: Fully-connected for lowest communication time for key comm primitives
Hardware is Easy – Compilers are Key

• Fuse nodes

• Data memory layout

• Backend code gen

You might also like