0% found this document useful (0 votes)
20 views111 pages

Lec 06

Uploaded by

KhánhLinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views111 pages

Lec 06

Uploaded by

KhánhLinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Deep Learning Basics

簡韶逸 Shao-Yi Chien


Department of Electrical Engineering
National Taiwan University

1
References and Slide Credits
• Slides from Deep Learning for Computer Vision, Prof. Yu-
Chiang Frank Wang, National Taiwan University
• Slides from Machine Learning, Prof. Hung-Yi Lee, EE,
National Taiwan University
• Slides from CE 5554 / ECE 4554: Computer Vision, Prof. J.-B.
Huang, Virginia Tech
• http://cs231n.stanford.edu/syllabus.html
• Marc'Aurelio Ranzato, Tutorial in CVPR2014
• Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep
Learning
• https://www.deeplearningbook.org/
• Bishop, Pattern Recognition and Machine Learning
• Reference papers
2
Outline
• Introduction of neural network
• Go deeper
• Introduction of convolutional neural network (CNN)
• Modern CNN models

3
History of Neural Network and
Deep Learning [Prof. Hung-Yi Lee]

• 1958: Perceptron (linear model)


• 1969: Perceptron has limitation
• 1980s: Multi layer perceptron
• Do not have significant difference from DNN today
• 1986: Backpropagation
• Usually more than 3 hidden layers is not helpful
• 1989: 1 hidden layer is “good enough”, why deep?
• 2006: RBM initialization (breakthrough)
• 2009: GPU
• 2011: Start to be popular in speech recognition
• 2012: win ILSVRC image competition Geoffrey Hinton

LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey, “Deep learning,” Nature, 2015.
4
How Powerful?
Object Recognition

Not deep-learning

Deep-learning based

Source:
https://devblogs.nvidia.com/parallelforall/mocha-jl-deep-learning-julia/
https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-
machine-learning-deep-learning-ai/
5
Biological neuron and Perceptrons

A biological neuron An artificial neuron (Perceptron)


- a linear classifier
Simple, Complex and Hypercomplex cells

David H. Hubel and Torsten Wiesel

Suggested a hierarchy of feature detectors


in the visual cortex, with higher level features
responding to patterns of activation in lower
level cells, and propagating activation
upwards to still higher level cells.
David Hubel's Eye, Brain, and Vision
Hubel/Wiesel Architecture and Multi-layer Neural Network

Hubel and Weisel’s architecture Multi-layer Neural Network


- A non-linear classifier
Hierarchical Representation Learning
• Successive model layers learn deeper intermediate representations.

9
Recap: Linear Classification
• Linear Classifier
• Let’s take the input image as x, and the linear classifier as W.
We need y = Wx + b as a 10-dimensional output vector, indicating the score for each class.
• For example, an image with 2 x 2 pixels & 3 classes of interest
we need to learn a linear classifier W (plus a bias b),
so that desirable outputs y = Wx + b can be expected.

Image credit: Stanford CS231n 10


Multi-Layer Perceptron: A Nonlinear Classifier

11
Multi-Layer Perceptron: A Nonlinear Classifier (cont’d)

12
Layer 1 in MLP

13
Layer 2 in MLP

14
Multi-Layer Perceptron: A Nonlinear Classifier (cont’d)

15
Let’s Get a Closer Look…

• A single neuron 1

0.5

0
一5 0 5
output of neuron

activity of neuron

inputs to neuron
16
Input-Output Function of a Single Neuron

w = [0,1]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

17
Input-Output Function of a Single Neuron

w = [0.2,1]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

18
Input-Output Function of a Single Neuron

w = [0.3,0.9]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

19
Input-Output Function of a Single Neuron

w = [0.5,0.9]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

20
Input-Output Function of a Single Neuron

w = [0.6,0.8]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

21
Input-Output Function of a Single Neuron

w = [0.8,0.6]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

22
Input-Output Function of a Single Neuron

w = [0.9,0.5]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

23
Input-Output Function of a Single Neuron

w = [0.9,0.3]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

24
Input-Output Function of a Single Neuron

w = [1,0.2]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

25
Input-Output Function of a Single Neuron

w = [1,0]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

26
Input-Output Function of a Single Neuron (cont’d)

w = [0,1]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

27
Input-Output Function of a Single Neuron (cont’d)

w = [0,2]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

28
Input-Output Function of a Single Neuron (cont’d)

w = [0,3]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

29
Input-Output Function of a Single Neuron (cont’d)

w = [0,4]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

30
Input-Output Function of a Single Neuron (cont’d)

w = [0,5]
5

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
−5 −5
0 −5 0 5
5 −5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

31
Input-Output Function of a Single Neuron (cont’d)

w = [0,1] contours
5 sets direction of boundary
sets steepness of boundary

0.8

z2
0
0.6
x

0.4
5
0.2

0 0
一5 一5
0 一5 0 5
5 一5 z2 z1
z1
1
x(z1, z2) = 1+exp(一w1z1一w2z2)

32
Weight Space of a Single Neuron

W = [2,2]

1 1 1 1

2 0.5 0.5 x 0.5 0.5

0 0
5
0 0
5
0 0
z2
5
-5
0 0
5

-5 -5 -5
5 -5
0
5 -5 5 -5
0 0
5 -5
0
z1

1 1 1 1

W2 0 0.5
5
0.5
5
0.5
5
0.5
5
0 0 0 0 0 0 0 0
-5 -5 -5 -5
5 -5 5 -5
0
5 -5
0
5 -5
0 0

1 1 1 1

-2 0.5 0.5 0.5 0.5


5 5 5 5
0 0 0 0 0 0
0 0 -5
-5 -5 -5
5 -5
0
5 -5
0
5 -5 5 -5
0 0

-2 0 2 4
W1 33
Training a Single Neuron

0 0
0 training data

inputs class labels

class class

34
Training a Single Neuron

desired result of training:


neuron outputs for
1

neuron outputs for

0 0
0 training data

inputs class labels

class class

35
Training a Single Neuron

desired result of training:


neuron outputs for
1

neuron outputs for

0 0
0 training data

inputs class labels

class class
objective function:

surprise when observing encourages neuron output


to match training data 36
relative entropy between and
Training a Single Neuron

training data
1

inputs class labels

0 0
0

objective function:

choose the weights that minimise the network's surprise


about the training data

= prediction error X feature

iteratively step down the objective (gradient points up hill) 37


Training a Single Neuron

w = [0,−1]
5

z2
0

0.8

−5
0.6 −5 0 5
z
x

1
0.4 5

0.2 0
10

objective
0
0
−5

0
−5 z2
5
0 5 10 15 20
z1 iteration

38
Training a Single Neuron

w = [0.4,−0.7]
5

z2
0

0.8

−5
0.6 −5 0 5
z
x

1
0.4 5

0.2 0
10

objective
0
0
−5

0
−5 z2
5
0 5 10 15 20
z1 iteration

39
Training a Single Neuron

w = [0.9,−0.2]
5

z2
0

0.8

−5
0.6 −5 0 5
z
x

1
0.4 5

0.2 0
10

objective
0
0
−5

0
−5 z2
5
0 5 10 15 20
z1 iteration

40
Training a Single Neuron

w = [1.1,0.1]
5

z2
0

0.8

−5
0.6 −5 0 5
z
x

1
0.4 5

0.2 0
10

objective
0
0
−5

0
−5 z2
5
0 5 10 15 20
z1 iteration

41
Training a Single Neuron

w = [1.4,0.4]
5

z2
0

0.8

−5
0.6 −5 0 5
z
x

1
0.4 5

0.2 0
10

objective
0
0
−5

0
−5 z2
5
0 5 10 15 20
z1 iteration

42
Training a Single Neuron

w = [5.2,12.6]
5

z2
0

0.8

−5
0.6 −5 0 5
z
x

1
0.4 5

0.2 0
10

objective
0
0
−5

0
−5 z2
5
0 5 10 15 20
z1 iteration

43
Training a Single Neuron

w = [9.7,25.3]
5

z2
0

0.8

−5
0.6 −5 0 5
z
x

1
0.4 5
0
0.2 10

objective
0
0
−5
−5 10
0
−5 z2
5
0 10 20 30 40 50
z1 iteration

44
Overfitting and Weight Decay

training data
1

inputs class labels

0 0
0

objective function:

regulariser discourages the network using extreme weights

weight decay - shrinks weights


towards zero 45
Training a Single Neuron (cont’d)

wreg = [0,−1] w = [0,−1]


5 5
z2

z2
0 0

−5 −5
−5 0 5 −5 0 5
z z
1 1
original
objective

regularised
0
10

0 10 20 30 40 50
iteration

46
Training a Single Neuron (cont’d)

wreg = [0.4,−0.7] w = [0.4,−0.7]


5 5
z2

z2
0 0

−5 −5
−5 0 5 −5 0 5
z z
1 1
original
objective

regularised
0
10

0 10 20 30 40 50
iteration

47
Training a Single Neuron (cont’d)

wreg = [0.8,−0.2] w = [0.8,−0.3]


5 5
z2

z2
0 0

−5 −5
−5 0 5 −5 0 5
z z
1 1
original
objective

regularised
0
10

0 10 20 30 40 50
iteration

48
Training a Single Neuron (cont’d)

wreg = [1.2,0.5] w = [1.4,0.5]


5 5
z2

z2
0 0

−5 0 5 −5 0 5
z z
1 1
original
objective

regularised
0
10

0 10 20 30 40 50
iteration

49
Training a Single Neuron (cont’d)

wreg = [1,1.1] w = [1.9,1.7]


5 5
z2

z2
0 0

−5 −5
−5 0 5 −5 0 5
z z
1 1
original
objective

regularised
0
10

0 10 20 30 40 50
iteration

50
Training a Single Neuron (cont’d)

wreg = [1,1.1] w = [2.5,4]


5 5
z2

z2
0 0

−5 −5
−5 0 5 −5 0 5
z z
1 1

original
objective

regularised
0
10

0 10 20 30 40 50
iteration

51
Single Hidden Layer Neural Networks

output

hidden
layer

inputs
layer
52
Sampling Random Neural Network Classifiers

0.8

2
0

z
0.6
x

0.4
5

0.2

0 0
−5 −5
0 −5 0 5
−5 z z
5 2 1
z
1

53
Training a Neural Network with a Single Hidden Layer

objective function:
likelihood same as before

regulariser discourages extreme weights

54
Training a Neural Network with a Single Hidden Layer
Networks with hidden layers can be fit using gradient descent using an
algorithm called back-propagation.

objective function:
likelihood same as before

regulariser discourages extreme weights

55
Training a Neural Network with a Single Hidden Layer

0.8

z2
0
0.6
x

0.4
5

0.2

0 0
一5 一5
0 一5 0 5
5 一5 z2 z
1
z1

56
Training a Neural Network with a Single Hidden Layer

0.8

2
0

z
0.6
x

0.4
5

0.2

0 0
−5 −5
0 −5 0 5
−5 z z
5 2 1
z
1

57
Training a Neural Network with a Single Hidden Layer

0.8

2
0

z
0.6
x

0.4
5

0.2

0 0
−5 −5
0 −5 0 5
−5 z z
5 2 1
z
1

58
Training a Neural Network with a Single Hidden Layer

0.8

2
0

z
0.6
x

0.4
5

0.2

0 0
−5 −5
0 −5 0 5
−5 z z
5 2 1
z
1

59
Training a Neural Network with a Single Hidden Layer

0.8

2
0

z
0.6
x

0.4
5

0.2

0 0
−5 −5
0 −5 0 5
−5 z z
5 2 1
z
1

60
Training a Neural Network with a Single Hidden Layer

0.8

2
0

z
0.6
x

0.4
5

0.2

0 0
−5 −5
0 −5 0 5
−5 z z
5 2 1
z
1

61
Training a Neural Network with a Single Hidden Layer

0.8

2
0

z
0.6
x

0.4
5

0.2

0 0
−5 −5
0 −5 0 5
−5 z z
5 2 1
z
1

62
Hierarchical Models with Many Layers

output

hidden
layer

inputs
layer
63
Convolutional Neural Networks (CNN):
Local Connectivity
Hidden layer

Input layer

Global connectivity Local connectivity

• # input units (neurons): 7


• # hidden units: 3
• Number of parameters
• Global connectivity: 21
• Local connectivity: 9 67
Convolutional Neural Networks (CNN):
Weight Sharing
Hidden layer

w1 w3 w5 w7 w9 w1 w3 w2 w1 w3
w2 w4 w6 w8 w2 w1 w3 w2

Input layer

Without weight sharing With weight sharing

• # input units (neurons): 7


• # hidden units: 3
• Number of parameters
– Without weight sharing: 9
– With weight sharing : 3 68
CNN with Multiple Input Channels
Hidden layer

Input layer Channel 1

Channel 2

Single input channel Multiple input channels

Filter weights Filter weights 69


CNN with Multiple Output Maps

Hidden layer Map 1


Map 2

Input layer

Single output map Multiple output maps

Filter 1 Filter 2

Filter weights Filter weights 70


Generalized to 2D Cases:

71
Ref: Marc'Aurelio Ranzato, Tutorial in CVPR2014
Generalized to 2D Cases:

72
Ref: Marc'Aurelio Ranzato, Tutorial in CVPR2014
Generalized to 2D Cases:

73
Ref: Marc'Aurelio Ranzato, Tutorial in CVPR2014
Convolutional Layer

Input Output

74
Convolutional Layer

Input Output

75
Convolutional Layer

Input Output

76
Convolutional Layer

Input Output

77
Convolutional Layer

Input Output

78
Convolutional Layer

Input Output

79
Convolutional Layer

Input Output

80
81
Ref: Marc'Aurelio Ranzato, Tutorial in CVPR2014
Putting them together → CNN
• Local connectivity
• Weight sharing
• Handling multiple input channels
• Handling multiple output maps
Weight sharing

Local connectivity

# input channels # output (activation) maps


85
Image credit: A. Karpathy
Convolution Layer in CNN

86
Putting them together (cont’d)
• The brain/neuron view of CONV layer

90
Putting them together (cont’d)
• The brain/neuron view of CONV layer

91
Putting them together (cont’d)
• The brain/neuron view of CONV layer

92
Putting them together (cont’d)
• Image input with 32 x 32 pixels convolved repeatedly with 5 x 5 x 3
filters shrinks volumes spatially (32 -> 28 -> 24 -> …).

93
Variations of Convolution

• Zero Padding
• Output is the same size as input (doesn’t shrink as the network gets deeper).

94
Variations of Convolution

• Stride
• Step size across signals

95
Variations of Convolution

• Stride
• Step size across signals

96
Nonlinearity Layer in CNN

99
Nonlinearity Layer
• E.g., ReLU (Rectified Linear Unit)
• Pixel by pixel computation of max(0, x)

100
Receptive Field
• For convolution with kernel size n x n,
each entry in the output layer depends on a n x n receptive field in the input layer.

• Each successive convolution adds n-1 to the receptive field size.


With a total of L layers, the receptive field size would be 1 + L * (n-1).

• Thus, for large images, we need many layers for each entry in output to “see” the entire input image.
Possible solution → downsample the image/feature map (see pooling layer next)

Slide credit: UMich EECS 498-007 103


Pooling Layer in CNN

104
Pooling Layer
• Makes the representations smaller and more manageable
• Operates over each activation map independently
• E.g., Max Pooling

105
Pooling Layer for 2D Cases
• Reduces the spatial size and provides spatial invariance

106
Fully Connected (FC) Layer in CNN

109
FC Layer
• Contains neurons that connect to the entire input volume,
as in ordinary neural networks

110
FC Layer
• Contains neurons that connect to the entire input volume,
as in ordinary neural networks

111
CNN

112
LeNet
• Presented by Yann LeCun during the 1990s for reading digits
• Has the elements of modern architectures

113
LeNet [LeCun et al. 1998]

LeNet-1 from 1993

Gradient-based learning applied to document recognition


[LeCun, Bottou, Bengio, Haffner 1998] 114
New Driving Forces
• CPU/GPU computing
• Personal super computer
• Internet → big data → large datasets become
available

115
AlexNet [Krizhevsky et al., 2012]
• Repopularized CNN
by winning the ImageNet Challenge 2012
• 7 hidden layers, 650,000 neurons,
60M parameters
• Error rate of 16% vs. 26% for 2nd place.

116
Krizhevsky et al. “ImageNet classification with deep convolutional neural networks,” NIPS, 2012.
AlexNet

• Parameters
• Convolution: 1.89M parameters = 7.56MB
• Fully connected: 58.62M parameters = 234.49MB
• Computation
• Convolution: 591M Floating MAC
• Fully connected: 58.62M Floating MAC
• Full-HD 30fps: 805 GFLOPS (no overlap)
117
Krizhevsky et al. “ImageNet classification with deep convolutional neural networks,” NIPS, 2012.
Deep or Not?
• Depth of the network is critical for performance.

118
Ultra Deep Network
22 layers

http://cs231n.stanford.e
du/slides/winter1516_le 19 layers
cture8.pdf

8 layers
6.7%
7.3%
16.4%

AlexNet (2012) VGG (2014) GoogleNet (2014)


Ultra Deep Network
[Prof. H.-Y. Lee]
101 layers
152 layers

This ultra deep network


have special structure.
3.57%

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
VGG (2014)
• Parameters:
• Convolution: ~14M, 56MB
• Fully connected: ~124M, 496MB
• Computation:
• Convolution: 15.52G Floating MAC
• Fully connected: 123.63M Floating MAC
• Full-HD 30fps: 19.3TFLOPS(no overlap)

125
Simonyan and Zisserman, “Very Deep Convolutional Networks for Large-scale Image Recognition,” arxiv :1409.1556v6, Sept. 2014
ResNet (2016)
• Can we just increase the #layer?

• How can we train very deep network?


- Residual learning

Ref: He, Kaiming, et al. "Deep residual learning for image


recognition." CVPR, 2016.
126
DenseNet (2017)
• Shorter connections (like ResNet) help
• Why not just connect them all?

128
ResNeXT (2017)
• Deeper and wider → better…what else?
• Increase cardinality

ResNet block ResNeXt block

132
Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." CVPR, 2017.
Squeeze-and-Excitation Net (SENet)
• How to improve acc. without much overhead?
• Feature recalibration (channel attention)

133
Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." CVPR, 2018.
Various Deep Learning Models…

131
Ref: Bianco et al., "Benchmark Analysis of Representative Deep Neural Network Architectures," arXiv:1810.00736.

You might also like