3 Lecture 21 01 25
3 Lecture 21 01 25
μP Battery
Lecture #3
21st January 2025
Previous Class
▪ DNN Training and Inference
▪ CNN Basics
2
Today’s Agenda
▪ Example of convolution
▪ Non-linear operation
▪ Pooling operation
▪ Normalization layer
▪ Commonly used dataset
▪ Commonly used DNN models
3
CNN Parameters (Recap)
4
Conv Layer
▪ Filters are 4-dimensional
–𝑅×𝑆×𝐶 ×𝑀
▪ What are R, S, C, M?
– R: Height of the filter
– S: Width of the filter
– C: Number of channels
– M: Number of filters
5
Conv Layer Implementation
▪ Naïve 7-layer for-loop implementation:
6
Conv Layer Computation
7
Conv Layer Computation
8
Conv Layer Computation
9
Conv Layer Computation
10
Conv Layer Computation
▪ Both channel 1 and channel 2 in the left are used to generate channel 1 of
the output
11
Conv Layer Computation
▪ We can flatten both the filters and the input feature map
– The computation is vector-matrix multiplication
12
Conv Layer Computation
▪ We can flatten both filters and input feature maps in the same way as before
▪ Naturally generate flattened output feature maps of two channels
14
Conv Layer Computation
▪ What are R, S, C, M?
▪ R: Height of the filter
▪ S: Width of the filter
▪ C: Number of channels
▪ M: Number of filters
15
Mem Sensor
μP Battery
A Quantitative Example
An Example
17
Converting Filter Traces into Matrix
18
Filter Flattened into a Vector
▪ The matrix of weights for the convolutional layer can be flattened into a vector,
K
[1., 2., 3., 4.]
19
Mem Sensor
μP Battery
21
Fully Connected Layer
22
Fully Connected Layer
23
Fully Connected Layer
24
Flattened Fully Connected Layer
▪ After flattening, having a batch size of N turns the matrix-vector operation into a
matrix-matrix multiplication
25
Flattened Fully Connected Layer
▪ After flattening, having a batch size of N turns the matrix-vector operation into a
matrix-matrix multiplication
26
Flattened Fully Connected Layer
▪ After flattening, having a batch size of N turns the matrix-vector operation into a
matrix-matrix multiplication
27
Flattened Fully Connected Layer
▪ After flattening, having a batch size of N turns the matrix-vector operation into a
matrix-matrix multiplication
28
Flattened Fully Connected Layer
▪ After flattening, having a batch size of N turns the matrix-vector operation into a
matrix-matrix multiplication
▪ How much temporal locality (reuse of data within a time frame) for this
implementation?
– None
29
Tiled Fully Connected Layer
▪ Matrix multiplication is tiled to fit in cache
▪ Computation ordered to maximize reuse of data in cache
30
Tiled Fully Connected Layer
▪ Implementation: Matrix Multiplication (GEMM)
– CPU: OpenBLAS, Intel MKL etc
– GPU: cuBLAS, cuDNN etc
▪ Library will note shape of the matrix multiplication and select implantation
optimized for that shape
▪ Optimization usually involves proper tiling to storage hierarchy
31
GV100 – “Tensor Core”
▪ New opcodes
– Matrix Multiply Accumulate (HMMA)
▪ FP16 operands
– 48 inputs / 16 outputs
▪ 64 multiplies
▪ 64 adds
▪ 120 TFLOPS (FP16)
▪ 400 GFLOPS/W (FP16)
32
Tensor Processing Unit
33
Today’s Agenda
▪ Example of convolution
▪ Non-linear operation
▪ Pooling operation
▪ Normalization layer
▪ Commonly used dataset
▪ Commonly used DNN models
34
Non-linear Operation
35
More Activation Functions
36
Today’s Agenda
▪ Non-linear operation
▪ Pooling operation
▪ Normalization layer
▪ Commonly used dataset
▪ Commonly used DNN models
37
Pooling (Pool) Layer
▪ Reduce resolution of each channel independently
▪ Overlapping or non-overlapping
– Depends on stride
▪ Increases translational-invariance and noise-resillience
38
Translational Invariance
Case-1
Output fmaps
are similar
Case-2
39
Translational Invariance
▪ Provides the same output independent of the location of the object within the
image
▪ Pooling helps to provide the invariance
40
Pooling Layer Implementation
▪ Naïve 6-layer for-loop implementation for max-pool
41
Today’s Agenda
▪ Non-linear operation
▪ Pooling operation
▪ Normalization layer
▪ Commonly used dataset
▪ Commonly used DNN models
42
Normalization Layer
▪ Batch Normalization
– Normalization activations towards mean=0 and std dev=1 based on the
statistics of the training data set
– Put between conv/FC and activation function
▪ Believed to be key to getting high accuracy and faster training for DNNs
43
Normalization Layer
▪ The normalized values are further scaled and shifted
– The parameters are learnt through training
44
Today’s Agenda
▪ Non-linear operation
▪ Pooling operation
▪ Normalization layer
▪ Commonly used dataset
▪ Commonly used DNN models
45
Commonly used Dataset
▪ MNIST
– Digit Classification
– 28×28 pixels (B&W)
– 10 Classes
– 60,000 training
– 10,000 testing
46
LeNet-5
▪ Conv layers: 2
▪ Fully connected layers: 2
▪ Weights: 60k
▪ MACs: 341k
▪ Sigmoid used for non-linearity
47
LeNet-5
48
ImageNet
▪ Image classification ▪ For ImageNet Large Scale Visual Recognition
– 256×256 Challenge (ILSVRC)
– Colour images – Accuracy of classification task reported based
on top-1/top-5 error
– 1000 classes
– What is top-K error?
– 1.3 M training
– 100, 000 testing
49
AlexNet (Krizhevsky et al., NeurIPS 2012)
▪ ILSCVR12 Winner
▪ Uses local response normalization (LRN)
▪ Structure
– 5 conv layers
– 3 fully connected layers
– Weights: 61M
– MACs 724M
– ReLU used for non-linearity
50
AlexNet (Krizhevsky et al., NeurIPS 2012)
▪ ILSCVR12 Winner
▪ Uses local response normalization (LRN)
▪ Structure
– 5 conv layers
– 3 fully connected layers
– Weights: 61M
– MACs 724M
– ReLU used for non-linearity
51
AlexNet: Large Sizes with Varying Shapes
52
VGG-16 (Simonyan et al., ICLR 2015)
▪ Conv layers: 13
▪ FC layers: 3
▪ Weights: 138M
▪ MACs: 15.5G
▪ There is a 19-layer version too (VGG-19)
53
Stacked Filters
▪ Deeper networks means more weights
▪ Use stack of smaller filters (3×3) to cover the same receptive field with fewer
filter weights
54
Stacked Filters
▪ Deeper networks means more weights
▪ Use stack of smaller filters (3×3) to cover the same receptive field with fewer
filter weights
55
Stacked Filters
▪ Deeper networks means more weights
▪ Use stack of smaller filters (3×3) to cover the same receptive field with fewer
filter weights
▪ Non-linear activations inserted between each filter
– 5×5 filter (25 weights) → two 3×3 filters (18 weights)
56
Deep into Inception
57
1×1 Convolution
58
1×1 Convolution
59
1×1 Convolution
60
Inception V1
▪ Apply 1×1 before ‘large’ convolution filters
▪ Reduce weights such that the entire DNN can be trained on one GPU
▪ Number of multiplications reduced from 854M →358M
61
Mem Sensor
μP Battery
THANK YOU