IntroductionToAISystems
IntroductionToAISystems
Andres Rodriguez
Why AI Systems
Compute demand growing much faster than supply
1. Algorithms
2. Compilers
3. Hardware
https://github.com/karlrupp/microprocessor-trend-data
DL Models
• Moore’s Law 2x
biannual growth is
slowing down
• SofA models double
every 3-4 months
• Need to co-design:
algorithms, SW, HW
DL training and inference overview
Input Computational graph (i.e. deep learning model) Output
Tensors
Tensors Operation
(conv, matrix multiply,
pooling, ReLU, LSTM, data
2D reorder, …)
Prob of
pedestrian
Data flow
Dictates how
3D Tensor flows
• Text Generation: Coherent and grammatically correct • Two components: 1) Input tokens processing; 2) Token generation
sentences
• Token: vector representation of a word or part of a word
• Art and Design: Novel works of art and design
• Every generated token requires reading the entire model from memory
• Music Generation: Composing melodies & harmonies
• 10B params 20 GB
• Medicine: (potential) New drugs and treatments
• 1T param 2 TB
• Finance: (potential) Predict market trends
• ½ with 8-bits
• Compute-intensive
• Bandwidth-intensive
• Memory-intensive
Recommenders:
complex and
diverse models
https://research.fb.com/wp-content/uploads/2020/06/DeepRecSys-A-System-for-Optimizing-End-To-End-At-Scale-Neural-Recommendation-Inference.pdf
Naïve HW and SW
Improve HW – SRAM
Based on 45 nm technology
Better HW – use SRAM
• SRAM – much faster but…
• More expensive (in $, power, and area – 6 transistors vs 1 transistor)
• Assuming a layer’s input, weights, and outputs fit in SRAM the OI is much higher
Hardware feature DL function
HW level HW feature DL function
Server level {GPUs, CPU sockets} High internode BW Model / data parallelism
Figure adapted from V. Sze, Y. Chen, T. Yang, and J. Emer. Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE, Dec. 2017.
Popular numerical precisions
• BF16 shown to provide virtually the same accuracy for training and inference as FP32
• INT8 shown to provide similar accuracy for inference as FP32 for some models – popular in vision models
Even with these techniques, models may still have unacceptable INT8 accuracy loss
• 𝑚𝑚𝑚𝑚𝑛𝑛𝒘𝒘 ℒ 𝒘𝒘 = ∑𝑁𝑁
𝑛𝑛=1 𝐶𝐶(𝑓𝑓𝒘𝒘 𝑥𝑥𝑛𝑛 , 𝑦𝑦𝑛𝑛 )
23
Flat vs Sharp minima
https://arxiv.org/abs/1609.04836
Pedigree of 1 st order methods Δ𝒘𝒘 = −𝜆𝜆(𝑡𝑡) ⋅ ℎ(∇𝐰𝐰 ℒ 𝒘𝒘 )
Popular in industry
Recommend: Fully-connected for lowest communication time for key comm primitives
Hardware is Easy – Compilers are Key
• Fuse nodes