0% found this document useful (0 votes)
67 views

Deep Learning Basics Lecture 6 Convolutional NN

The document discusses convolutional neural networks and LeNet-5. It introduces convolutional layers, pooling layers, and fully connected layers. It describes the architecture of LeNet-5, which was an early convolutional neural network applied to handwritten digit recognition. LeNet-5 used convolutional and pooling layers followed by fully connected layers. It also discusses momentum, an optimization technique used in stochastic gradient descent training of neural networks.

Uploaded by

baris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Deep Learning Basics Lecture 6 Convolutional NN

The document discusses convolutional neural networks and LeNet-5. It introduces convolutional layers, pooling layers, and fully connected layers. It describes the architecture of LeNet-5, which was an early convolutional neural network applied to handwritten digit recognition. LeNet-5 used convolutional and pooling layers followed by fully connected layers. It also discusses momentum, an optimization technique used in stochastic gradient descent training of neural networks.

Uploaded by

baris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Deep Learning Basics

Lecture 6: Convolutional NN
Princeton University COS 495
Instructor: Yingyu Liang
Review: convolutional layers
Convolution: two dimensional case
Input Kernel/filter
a b c d w x
e f g h y z
i j k l

wa + bx + bw + cx +
ey + fz fy + gz

Feature map
Convolutional layers
the same weight shared for all output nodes

𝑚 output nodes

𝑘 kernel size

𝑛 input nodes

Figure from Deep Learning, by Goodfellow, Bengio, and Courville


Terminology

Figure from Deep Learning,


by Goodfellow, Bengio,
and Courville
Case study: LeNet-5
LeNet-5
• Proposed in “Gradient-based learning applied to document
recognition” , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner,
in Proceedings of the IEEE, 1998
LeNet-5
• Proposed in “Gradient-based learning applied to document
recognition” , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner,
in Proceedings of the IEEE, 1998

• Apply convolution on 2D images (MNIST) and use backpropagation


LeNet-5
• Proposed in “Gradient-based learning applied to document
recognition” , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner,
in Proceedings of the IEEE, 1998

• Apply convolution on 2D images (MNIST) and use backpropagation

• Structure: 2 convolutional layers (with pooling) + 3 fully connected layers


• Input size: 32x32x1
• Convolution kernel size: 5x5
• Pooling: 2x2
LeNet-5

Figure from Gradient-based learning applied to document recognition,


by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5

Figure from Gradient-based learning applied to document recognition,


by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5 Filter: 5x5, stride: 1x1,
#filters: 6

Figure from Gradient-based learning applied to document recognition,


by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5
Pooling: 2x2, stride: 2

Figure from Gradient-based learning applied to document recognition,


by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5 Filter: 5x5x6, stride: 1x1,
#filters: 16

Figure from Gradient-based learning applied to document recognition,


by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5
Pooling: 2x2, stride: 2

Figure from Gradient-based learning applied to document recognition,


by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5
Weight matrix: 400x120

Figure from Gradient-based learning applied to document recognition,


by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
Weight matrix: 84x10
LeNet-5
Weight matrix: 120x84

Figure from Gradient-based learning applied to document recognition,


by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
Software platforms for CNN
Updated in April 2016; checked more recent ones online
Platform: Marvin (marvin.is)
Platform: Marvin by
LeNet in Marvin: convolutional layer
LeNet in Marvin: pooling layer
LeNet in Marvin: fully connected layer
Platform: Caffe (caffe.berkeleyvision.org)
LeNet in Caffe
Platform: Tensorflow (tensorflow.org)
Platform: Tensorflow (tensorflow.org)
Platform: Tensorflow (tensorflow.org)
Others
• Theano – CPU/GPU symbolic expression compiler in python (from
MILA lab at University of Montreal)
• Torch – provides a Matlab-like environment for state-of-the-art
machine learning algorithms in lua
• Lasagne - Lasagne is a lightweight library to build and train neural
networks in Theano

• See: http://deeplearning.net/software_links/
Optimization: momentum
Basic algorithms
• Minimize the (regularized) empirical loss
෠𝐿𝑅 𝜃 = 1 σ𝑛𝑡=1 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡 ) + 𝑅(𝜃)
𝑛
where the hypothesis is parametrized by 𝜃

• Gradient descent
𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻𝐿෠ 𝑅 𝜃𝑡
Mini-batch stochastic gradient descent
• Instead of one data point, work with a small batch of 𝑏 points
(𝑥𝑡𝑏+1, 𝑦𝑡𝑏+1 ),…, (𝑥𝑡𝑏+𝑏, 𝑦𝑡𝑏+𝑏 )

• Update rule
1
𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻 ෍ 𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖 + 𝑅(𝜃𝑡 )
𝑏
1≤𝑖≤𝑏
Momentum
• Drawback of SGD: can be slow when gradient is small

• Observation: when the gradient is consistent across consecutive steps,


can take larger steps
• Metaphor: rolling marble ball on gentle slope
Momentum

Contour: loss function


Path: SGD with momentum
Arrow: stochastic gradient

Figure from Deep Learning, by Goodfellow, Bengio, and Courville


Momentum
• work with a small batch of 𝑏 points
(𝑥𝑡𝑏+1, 𝑦𝑡𝑏+1 ),…, (𝑥𝑡𝑏+𝑏, 𝑦𝑡𝑏+𝑏 )

• Keep a momentum variable 𝑣𝑡 , and set a decay rate 𝛼


• Update rule
1
𝑣𝑡 = 𝛼𝑣𝑡−1 − 𝜂𝑡 𝛻 ෍ 𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖 + 𝑅(𝜃𝑡 )
𝑏
1≤𝑖≤𝑏

𝜃𝑡+1 = 𝜃𝑡 + 𝑣𝑡
Momentum
• Keep a momentum variable 𝑣𝑡 , and set a decay rate 𝛼
• Update rule
1
𝑣𝑡 = 𝛼𝑣𝑡−1 − 𝜂𝑡 𝛻 ෍ 𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖 + 𝑅(𝜃𝑡 )
𝑏
1≤𝑖≤𝑏

𝜃𝑡+1 = 𝜃𝑡 + 𝑣𝑡

• Practical guide: 𝛼 is set to 0.5 until the initial learning stabilizes and
then is increased to 0.9 or higher.

You might also like