0% found this document useful (0 votes)
15 views62 pages

3 Lecture 21 01 25

The document outlines a lecture on deep neural networks (DNNs), covering topics such as convolution, pooling, normalization layers, and commonly used datasets and models. It includes detailed explanations of convolutional layer parameters, fully connected layers, and various DNN architectures like LeNet-5, AlexNet, and VGG-16. The lecture emphasizes the importance of techniques like batch normalization and the use of smaller stacked filters for efficient training and accuracy in DNNs.

Uploaded by

rppay777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views62 pages

3 Lecture 21 01 25

The document outlines a lecture on deep neural networks (DNNs), covering topics such as convolution, pooling, normalization layers, and commonly used datasets and models. It includes detailed explanations of convolutional layer parameters, fully connected layers, and various DNN architectures like LeNet-5, AlexNet, and VGG-16. The lecture emphasizes the importance of techniques like batch normalization and the use of smaller stacked filters for efficient training and accuracy in DNNs.

Uploaded by

rppay777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Mem Sensor

μP Battery

E0 294: Systems for Machine Learning

Lecture #3
21st January 2025
Previous Class
▪ DNN Training and Inference
▪ CNN Basics

2
Today’s Agenda
▪ Example of convolution
▪ Non-linear operation
▪ Pooling operation
▪ Normalization layer
▪ Commonly used dataset
▪ Commonly used DNN models

3
CNN Parameters (Recap)

4
Conv Layer
▪ Filters are 4-dimensional
–𝑅×𝑆×𝐶 ×𝑀
▪ What are R, S, C, M?
– R: Height of the filter
– S: Width of the filter
– C: Number of channels
– M: Number of filters

5
Conv Layer Implementation
▪ Naïve 7-layer for-loop implementation:

6
Conv Layer Computation

7
Conv Layer Computation

8
Conv Layer Computation

9
Conv Layer Computation

10
Conv Layer Computation

▪ Both channel 1 and channel 2 in the left are used to generate channel 1 of
the output

11
Conv Layer Computation

▪ We can flatten both the filters and the input feature map
– The computation is vector-matrix multiplication
12
Conv Layer Computation

▪ Filter 2 corresponds to (channel 1, channel2) of input feature map, but


channel 2 of output feature map
▪ Similarly, Filter 1: (channel 1, channel 2) of input and channel 1 of output
13
Conv Layer Computation

▪ We can flatten both filters and input feature maps in the same way as before
▪ Naturally generate flattened output feature maps of two channels

14
Conv Layer Computation

▪ What are R, S, C, M?
▪ R: Height of the filter
▪ S: Width of the filter
▪ C: Number of channels
▪ M: Number of filters

15
Mem Sensor

μP Battery

A Quantitative Example
An Example

17
Converting Filter Traces into Matrix

18
Filter Flattened into a Vector
▪ The matrix of weights for the convolutional layer can be flattened into a vector,
K
[1., 2., 3., 4.]

19
Mem Sensor

μP Battery

Fully Connected Layer: Extracting


Parallelism
Batch of N Input fmaps
▪ A batch has N input fmaps, so we apply the same M filters of size CHW,and
generate N output fmaps of 1x1xM

21
Fully Connected Layer

▪ We flatten three dimensions into one


– C (#of channels input fmap), HxW (filter size)
– All filters of for the input fmaps are flattened into one row of CHW. We have M such roes, each
for an output channel
▪ To perform matrix multiplication, input fmaps also need to be flattened to have CHW rows
– After multiplication, the output contains one point for all M output channels

22
Fully Connected Layer

▪ We flatten three dimensions into one


– C (#of channels input fmap), HxW (filter size)
– All filters of for the input fmaps are flattened into one row of CHW. We have M such roes, each
for an output channel
▪ To perform matrix multiplication, input fmaps also need to be flattened to have CHW rows
– After multiplication, the output contains one point for all M output channels

23
Fully Connected Layer

▪ We flatten three dimensions into one


– C (#of channels input fmap), HxW (filter size)
– All filters of for the input fmaps are flattened into one row of CHW. We have M such roes, each
for an output channel
▪ To perform matrix multiplication, input fmaps also need to be flattened to have CHW rows
– After multiplication, the output contains one point for all M output channels

24
Flattened Fully Connected Layer

▪ After flattening, having a batch size of N turns the matrix-vector operation into a
matrix-matrix multiplication

25
Flattened Fully Connected Layer

▪ After flattening, having a batch size of N turns the matrix-vector operation into a
matrix-matrix multiplication

26
Flattened Fully Connected Layer

▪ After flattening, having a batch size of N turns the matrix-vector operation into a
matrix-matrix multiplication

27
Flattened Fully Connected Layer

▪ After flattening, having a batch size of N turns the matrix-vector operation into a
matrix-matrix multiplication

28
Flattened Fully Connected Layer

▪ After flattening, having a batch size of N turns the matrix-vector operation into a
matrix-matrix multiplication
▪ How much temporal locality (reuse of data within a time frame) for this
implementation?
– None
29
Tiled Fully Connected Layer
▪ Matrix multiplication is tiled to fit in cache
▪ Computation ordered to maximize reuse of data in cache

30
Tiled Fully Connected Layer
▪ Implementation: Matrix Multiplication (GEMM)
– CPU: OpenBLAS, Intel MKL etc
– GPU: cuBLAS, cuDNN etc
▪ Library will note shape of the matrix multiplication and select implantation
optimized for that shape
▪ Optimization usually involves proper tiling to storage hierarchy

31
GV100 – “Tensor Core”

▪ New opcodes
– Matrix Multiply Accumulate (HMMA)
▪ FP16 operands
– 48 inputs / 16 outputs
▪ 64 multiplies
▪ 64 adds
▪ 120 TFLOPS (FP16)
▪ 400 GFLOPS/W (FP16)
32
Tensor Processing Unit

33
Today’s Agenda
▪ Example of convolution
▪ Non-linear operation
▪ Pooling operation
▪ Normalization layer
▪ Commonly used dataset
▪ Commonly used DNN models

34
Non-linear Operation

35
More Activation Functions

36
Today’s Agenda
▪ Non-linear operation
▪ Pooling operation
▪ Normalization layer
▪ Commonly used dataset
▪ Commonly used DNN models

37
Pooling (Pool) Layer
▪ Reduce resolution of each channel independently
▪ Overlapping or non-overlapping
– Depends on stride
▪ Increases translational-invariance and noise-resillience

38
Translational Invariance

Case-1

Output fmaps
are similar

Case-2

39
Translational Invariance
▪ Provides the same output independent of the location of the object within the
image
▪ Pooling helps to provide the invariance

40
Pooling Layer Implementation
▪ Naïve 6-layer for-loop implementation for max-pool

41
Today’s Agenda
▪ Non-linear operation
▪ Pooling operation
▪ Normalization layer
▪ Commonly used dataset
▪ Commonly used DNN models

42
Normalization Layer
▪ Batch Normalization
– Normalization activations towards mean=0 and std dev=1 based on the
statistics of the training data set
– Put between conv/FC and activation function
▪ Believed to be key to getting high accuracy and faster training for DNNs

43
Normalization Layer
▪ The normalized values are further scaled and shifted
– The parameters are learnt through training

44
Today’s Agenda
▪ Non-linear operation
▪ Pooling operation
▪ Normalization layer
▪ Commonly used dataset
▪ Commonly used DNN models

45
Commonly used Dataset
▪ MNIST
– Digit Classification
– 28×28 pixels (B&W)
– 10 Classes
– 60,000 training
– 10,000 testing

46
LeNet-5
▪ Conv layers: 2
▪ Fully connected layers: 2
▪ Weights: 60k
▪ MACs: 341k
▪ Sigmoid used for non-linearity

47
LeNet-5

48
ImageNet
▪ Image classification ▪ For ImageNet Large Scale Visual Recognition
– 256×256 Challenge (ILSVRC)
– Colour images – Accuracy of classification task reported based
on top-1/top-5 error
– 1000 classes
– What is top-K error?
– 1.3 M training
– 100, 000 testing

49
AlexNet (Krizhevsky et al., NeurIPS 2012)
▪ ILSCVR12 Winner
▪ Uses local response normalization (LRN)
▪ Structure
– 5 conv layers
– 3 fully connected layers
– Weights: 61M
– MACs 724M
– ReLU used for non-linearity

50
AlexNet (Krizhevsky et al., NeurIPS 2012)
▪ ILSCVR12 Winner
▪ Uses local response normalization (LRN)
▪ Structure
– 5 conv layers
– 3 fully connected layers
– Weights: 61M
– MACs 724M
– ReLU used for non-linearity

51
AlexNet: Large Sizes with Varying Shapes

52
VGG-16 (Simonyan et al., ICLR 2015)
▪ Conv layers: 13
▪ FC layers: 3
▪ Weights: 138M
▪ MACs: 15.5G
▪ There is a 19-layer version too (VGG-19)

53
Stacked Filters
▪ Deeper networks means more weights
▪ Use stack of smaller filters (3×3) to cover the same receptive field with fewer
filter weights

54
Stacked Filters
▪ Deeper networks means more weights
▪ Use stack of smaller filters (3×3) to cover the same receptive field with fewer
filter weights

55
Stacked Filters
▪ Deeper networks means more weights
▪ Use stack of smaller filters (3×3) to cover the same receptive field with fewer
filter weights
▪ Non-linear activations inserted between each filter
– 5×5 filter (25 weights) → two 3×3 filters (18 weights)

56
Deep into Inception

57
1×1 Convolution

58
1×1 Convolution

59
1×1 Convolution

60
Inception V1
▪ Apply 1×1 before ‘large’ convolution filters
▪ Reduce weights such that the entire DNN can be trained on one GPU
▪ Number of multiplications reduced from 854M →358M

61
Mem Sensor

μP Battery

THANK YOU

You might also like