Deep Learning Basics Lecture 6 Convolutional NN
Deep Learning Basics Lecture 6 Convolutional NN
Lecture 6: Convolutional NN
Princeton University COS 495
Instructor: Yingyu Liang
Review: convolutional layers
Convolution: two dimensional case
Input Kernel/filter
a b c d w x
e f g h y z
i j k l
wa + bx + bw + cx +
ey + fz fy + gz
Feature map
Convolutional layers
the same weight shared for all output nodes
𝑚 output nodes
𝑘 kernel size
𝑛 input nodes
• See: http://deeplearning.net/software_links/
Optimization: momentum
Basic algorithms
• Minimize the (regularized) empirical loss
𝐿𝑅 𝜃 = 1 σ𝑛𝑡=1 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡 ) + 𝑅(𝜃)
𝑛
where the hypothesis is parametrized by 𝜃
• Gradient descent
𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻𝐿 𝑅 𝜃𝑡
Mini-batch stochastic gradient descent
• Instead of one data point, work with a small batch of 𝑏 points
(𝑥𝑡𝑏+1, 𝑦𝑡𝑏+1 ),…, (𝑥𝑡𝑏+𝑏, 𝑦𝑡𝑏+𝑏 )
• Update rule
1
𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻 𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖 + 𝑅(𝜃𝑡 )
𝑏
1≤𝑖≤𝑏
Momentum
• Drawback of SGD: can be slow when gradient is small
𝜃𝑡+1 = 𝜃𝑡 + 𝑣𝑡
Momentum
• Keep a momentum variable 𝑣𝑡 , and set a decay rate 𝛼
• Update rule
1
𝑣𝑡 = 𝛼𝑣𝑡−1 − 𝜂𝑡 𝛻 𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖 + 𝑅(𝜃𝑡 )
𝑏
1≤𝑖≤𝑏
𝜃𝑡+1 = 𝜃𝑡 + 𝑣𝑡
• Practical guide: 𝛼 is set to 0.5 until the initial learning stabilizes and
then is increased to 0.9 or higher.