0% found this document useful (0 votes)
48 views51 pages

RNN + RL: Shusen Wang

1. The document describes using a recurrent neural network (RNN) and reinforcement learning to generate convolutional neural network (CNN) architectures. 2. The RNN predicts properties of CNN layers like number of filters, filter size, and stride. These predictions are made sequentially and used to build up a full CNN architecture. 3. The RNN is trained with reinforcement learning to maximize accuracy on a validation set by predicting architectural hyperparameters through multiple timesteps.

Uploaded by

MInh Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views51 pages

RNN + RL: Shusen Wang

1. The document describes using a recurrent neural network (RNN) and reinforcement learning to generate convolutional neural network (CNN) architectures. 2. The RNN predicts properties of CNN layers like number of filters, filter size, and stride. These predictions are made sequentially and used to build up a full CNN architecture. 3. The RNN is trained with reinforcement learning to maximize accuracy on a validation set by predicting architectural hyperparameters through multiple timesteps.

Uploaded by

MInh Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

NAS via RNN + RL

Shusen Wang

http://wangshusen.github.io/
Prerequisites

• Recurrent neural networks (RNNs).


• Policy-based reinforcement learning.
RNN for Generating CNN Architectures

Reference:
• Zoph & Le. Neural architecture search with reinforcement learning. In ICLR, 2017.

𝐡#

𝐱 #
𝐡#
𝐡% = tanh 𝐖 ⋅ +𝐛
𝐱#


𝐡#

𝐱 #
𝐡%

𝐡%

𝐡#

𝐱 #
𝐩% = [0.15, 0.6, 0.2, 0.05]
Softmax classifier: dense layer + softmax activation.

𝐡%

𝐡%

𝐡#

𝐱 #
Predict Number of Filters

Probability
𝐩% = [0.15, 0.6, 0.2, 0.05] 0.6

𝐡%
0.2
0.15
𝐡% 0.05

𝐡# # Filters
24 36 48 64
𝐱 #
Predict Number of Filters

random sampling or argmax


Probability
𝐩% = [0.15, 0.6, 0.2, 0.05] 0.6

𝐡%
0.2
0.15
𝐡% 0.05

𝐡# # Filters
24 36 48 64
𝐱 #
Predict Number of Filters
𝐚% = [0, 1, 0, 0]

Probability
𝐩% = [0.15, 0.6, 0.2, 0.05] 0.6

𝐡%
0.2
0.15
𝐡% 0.05

𝐡# # Filters
24 36 48 64
𝐱 #
𝐚%

𝐩%
Embedding: map a one-hot vector to a dense vector.
𝐡%

𝐡%

𝐡#

𝐱 # 𝐱%
𝐚% 𝐚: 𝐚; 𝐚%#

𝐩%
Embedding: map a one-hot vector to a dense vector.
𝐡%

𝐡%

𝐡#

𝐱 # 𝐱% 𝐱 : 𝐱 ; 𝐱%#
𝐚%

𝐩%

𝐡%
𝐡% 𝐡< = tanh 𝐖 ⋅ +𝐛
𝐱%
𝐡%

𝐡#

𝐱 # 𝐱%
𝐚%

𝐩%

𝐡% 𝐡<

𝐡% 𝐡<

𝐡#

𝐱 # 𝐱%
𝐚%

𝐩% 𝐩< = [0.5, 0.1, 0.4]


Softmax Classifier

𝐡% 𝐡<

𝐡% 𝐡<

𝐡#

𝐱 # 𝐱%
# of filters size of filters

𝐚%

𝐩% 𝐩< = [0.5, 0.1, 0.4]


Softmax Classifier

𝐡% 𝐡<

𝐡% 𝐡<

𝐡#

𝐱 # 𝐱%
Predict Size of Filters
𝐚%

Probability
𝐩% 𝐩< = [0.5, 0.1, 0.4]
0.5
0.4
𝐡% 𝐡<

𝐡% 𝐡< 0.1
Filter
𝐡#
3×3 5×5 7×7
Size

𝐱 # 𝐱%
Predict Size of Filters
𝐚% 𝐚< = [1, 0, 0]

Probability
𝐩% 𝐩< = [0.5, 0.1, 0.4]
0.5
0.4
𝐡% 𝐡<

𝐡% 𝐡< 0.1
Filter
𝐡#
3×3 5×5 7×7
Size

𝐱 # 𝐱%
𝐚% 𝐚<

𝐩% 𝐩<

embedding
𝐡% 𝐡<

𝐡% 𝐡<

𝐡#

𝐱 # 𝐱% 𝐱 <
𝐚% 𝐚<

An embedding layer can be


𝐩% 𝐩< shared among the same task.
• E.g., 𝐚< , 𝐚A , 𝐚B , 𝐚%% are for
𝐡% 𝐡< size of filters.
• They can share embedding
𝐡% 𝐡< layer.

𝐡#

𝐱 # 𝐱% 𝐱 <
𝐚% 𝐚<

Different tasks do not share


𝐩% 𝐩< embedding layers.
• E.g., 𝐚% and 𝐚< are for
𝐡% 𝐡< different tasks.
• They cannot share
𝐡% 𝐡< embedding layer.

𝐡#

𝐱 # 𝐱% 𝐱 <
𝐚% 𝐚<

𝐩% 𝐩<

𝐡% 𝐡< 𝐡C

𝐡% 𝐡< 𝐡C

𝐡#

𝐱 # 𝐱% 𝐱 <
𝐚% 𝐚<

𝐩% 𝐩< 𝐩C = [0.3, 0.7]


Softmax Classifier

𝐡% 𝐡< 𝐡C

𝐡% 𝐡< 𝐡C

𝐡#

𝐱 # 𝐱% 𝐱 <
Predict Stride
𝐚% 𝐚<

Probability
𝐩% 𝐩< 𝐩C = [0.3, 0.7]
0.7

𝐡% 𝐡< 𝐡C
0.3

𝐡% 𝐡< 𝐡C

𝐡# Stride
2 3
𝐱 # 𝐱% 𝐱 <
Predict Stride
𝐚% 𝐚< 𝐚C = [0, 1]

Probability
𝐩% 𝐩< 𝐩C = [0.3, 0.7]
0.7

𝐡% 𝐡< 𝐡C
0.3

𝐡% 𝐡< 𝐡C

𝐡# Stride
2 3
𝐱 # 𝐱% 𝐱 <
• Filter number = 36
𝐚% 𝐚< 𝐚C • Filter size = 3×3
• Stride = 3

𝐩% 𝐩< 𝐩C

𝐡% 𝐡< 𝐡C

𝐡% 𝐡< 𝐡C

𝐡#

𝐱 # 𝐱% 𝐱 <
𝐚% 𝐚< 𝐚C 𝐚F#

𝐩% 𝐩< 𝐩C 𝐩F#

𝐡% 𝐡< 𝐡C 𝐡F#

𝐡#

𝐡%

𝐡<

𝐡C

𝐱 # 𝐱% 𝐱 < 𝐱 AE
𝐚% 𝐚< 𝐚C 𝐚: 𝐚A 𝐚F 𝐚F#

𝐩% 𝐩< 𝐩C 𝐩: 𝐩A 𝐩F 𝐩F#

𝐡% 𝐡< 𝐡C 𝐡: 𝐡A 𝐡F 𝐡F#

𝐡#

𝐱 # 𝐱% 𝐱 < 𝐱 C 𝐱 : 𝐱 A ⋯ 𝐱 AE
# of size of # of size of
filters filters
stride
filters filters
stride ⋯ stride

𝐚% 𝐚< 𝐚C 𝐚: 𝐚A 𝐚F 𝐚F#

𝐩% 𝐩< 𝐩C 𝐩: 𝐩A 𝐩F 𝐩F#

𝐡% 𝐡< 𝐡C 𝐡: 𝐡A 𝐡F 𝐡F#

𝐡#

𝐱 # 𝐱% 𝐱 < 𝐱 C 𝐱 : 𝐱 A ⋯ 𝐱 AE
1st Conv Layer 2nd Conv Layer

# of size of # of size of
filters filters
stride
filters filters
stride ⋯ stride

𝐚% 𝐚< 𝐚C 𝐚: 𝐚A 𝐚F 𝐚F#

𝐩% 𝐩< 𝐩C 𝐩: 𝐩A 𝐩F 𝐩F#

𝐡% 𝐡< 𝐡C 𝐡: 𝐡A 𝐡F 𝐡F#

𝐡#

𝐱 # 𝐱% 𝐱 < 𝐱 C 𝐱 : 𝐱 A ⋯ 𝐱 AE
Training Controller RNN
How to train the controller RNN?

• The controller RNN outputs the hyper-parameters of a CNN.


• With the hyper-parameters at hand, instantiate a CNN.
• Train the CNN on a dataset, e.g., CIFAR-10, ImageNet, etc.
• Compute validation accuracy on a held-out dataset.
• Validation accuracy is the supervision for training the controller RNN.
How to train the controller RNN?

Training Set

Controller generate Generated


RNN CNN
How to train the controller RNN?

Training Set Val Set

Controller generate Generated predict


Val Acc
RNN CNN

Update
Challenges

• 𝑟: objective function (to maximize).


• 𝛉: optimization variable.
Challenges

• 𝑟: objective function (to maximize).


• 𝛉: optimization variable.

• What if 𝑟 is a differentiable function of 𝛉?


• Update 𝛉 by gradient ascent:
K L
𝛉←𝛉+𝛽⋅ .
K 𝛉
Challenges

• 𝑟: objective function (to maximize).


• 𝛉: optimization variable.

• What if 𝑟 is a differentiable function of 𝛉?


• Update 𝛉 by gradient ascent:
K L
𝛉←𝛉+𝛽⋅ .
K 𝛉

• However, if 𝑟 is not a differentiable function of 𝛉,


then we cannot use the gradient to update 𝛉.
Challenges

• 𝑟: objective function (to maximize). Validation Accuracy


• 𝛉: optimization variable. Controller RNN Parameters

• Validation accuracy (𝑟) is not a differentiable


function of the controller RNN parameters (𝛉).
• They have to use reinforcement learning.
Reinforcement Learning

• Objective: Improve the controller RNN so that


validation accuracies improve over time.
• Rewards: validation accuracies.
• Policy function: the controller RNN.
• Improve the policy function by policy gradient ascent.
Policy Function

𝐚M 𝐚MO% 𝐚MO<

𝐩M 𝐩MO% 𝐩MO<

𝐡M 𝐡MO% 𝐡MO<

𝐡MN% 𝐡M 𝐡MO% 𝐡MO<


𝐱 MN% 𝐱 M 𝐱 MO%
Policy Function

𝐚M 𝐚MO% 𝐚MO<

Predicted
𝐩M 𝐩MO% 𝐩MO<
Distribution

𝐡M 𝐡MO% 𝐡MO<

𝐡MN% 𝐡M 𝐡MO% 𝐡MO<


State
𝐱 MN% 𝐱 M 𝐱 MO%
Policy Function

𝐚M Action 𝐚MO% 𝐚MO<

Predicted
𝐩M 𝐩MO% 𝐩MO<
Distribution

𝐡M 𝐡MO% 𝐡MO<

𝐡MN% 𝐡M 𝐡MO% 𝐡MO<


State
𝐱 MN% 𝐱 M 𝐱 MO%
Policy Function

𝐚M Action 𝐚MO% 𝐚MO<


Policy function:
Predicted 𝜋 𝐩𝐚MO<
𝐩M 𝐩MO% MO% 𝐡M , 𝐱 M ; 𝛉).
Distribution

𝐡M 𝐡MO% 𝐡MO<

𝐡MN% 𝐡M 𝐡MO% 𝐡MO<


State
𝐱 MN% 𝐱 M 𝐱 MO%
Policy Function

Probability
Density
Policy function:
0.5 𝜋 𝐚MO% 𝐡M , 𝐱 M ; 𝛉).
0.4

• 𝜋 “3×3” 𝐡M , 𝐱 M ; 𝛉) = 0.5.
0.1
• 𝜋 “5×5” 𝐡M , 𝐱 M ; 𝛉) = 0.1.

3×3 5×5 7×7


• 𝜋 “7×7” 𝐡M , 𝐱 M ; 𝛉) = 0.4.
Filter Size (Action)
Reward & Return

• Suppose the controller RNN runs 60 steps.


• The first 59 rewards are zeros: 𝑟% = 𝑟< = ⋯ = 𝑟AE = 0.
• The last reward is the validation accuracy: 𝑟F# = ValAcc.
Reward & Return

• Suppose the controller RNN runs 60 steps.


• The first 59 rewards are zeros: 𝑟% = 𝑟< = ⋯ = 𝑟AE = 0.
• The last reward is the validation accuracy: 𝑟F# = ValAcc.
• Return (aka cumulative reward) is defined as:
𝑢M = 𝑟M + 𝑟MO% + 𝑟MO< + ⋯ + 𝑟AE + 𝑟F# .
• Thus, all the returns are equal:
𝑢% = 𝑢< = ⋯ = 𝑢F# = ValAcc.
REINFORCE Algorithm

• Approximate policy gradients by:


K Z[\ ] 𝐚^_` 𝐡^ ,𝐱 ^ ;𝛉)
⋅ 𝑢M .
K 𝛉
REINFORCE Algorithm

• Approximate policy gradients by:


K Z[\ ] 𝐚^_` 𝐡^ ,𝐱 ^ ;𝛉)
⋅ 𝑢M .
K 𝛉

• Update trainable parameters by:


F# K Z[\ ] 𝐚^_` 𝐡^ ,𝐱 ^ ;𝛉)
𝛉 ← 𝛉 + 𝛽 ⋅ ∑Mb% ⋅ 𝑢M .
K 𝛉
Recap

• Run the controller RNN to generate the hyper-parameters of


the 20 convolutional layers.
• Instantiate a CNN, train the CNN, and then obtain a validation
accuracy (to be used as a reward.)
• REINFORCE algorithm uses the reward to update the policy
function (i.e., the controller RNN.)
• Repeat this process thousands of times.
NAS is expensive!

• To update the controller RNN once, we need to train a CNN


from scratch. (Once is already expensive.)
• 10,000+ updates to train the RNN well è Train 10,000+ CNNs
from scratch. (Extremely expensive!)
• The controller RNN itself has tuning hyper-parameters.
• E.g., # of layers, size of 𝐱, size of 𝐡, etc.
• Hyper-parameter tuning will make the overall time cost many times
higher!
Thank You!

http://wangshusen.github.io/

You might also like