0% found this document useful (0 votes)
3 views10 pages

Dis4 Sol

This document discusses various aspects of designing and understanding deep neural networks, focusing on CNN architectures, batch normalization, dropout, weight initialization, and ensemble methods. Key architectures like LeNet, AlexNet, VGGNet, and ResNet are highlighted, along with techniques such as batch normalization to improve training stability and dropout for regularization. The document also covers weight initialization strategies to enhance model performance and the use of ensemble methods to reduce variance in predictions.

Uploaded by

abc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views10 pages

Dis4 Sol

This document discusses various aspects of designing and understanding deep neural networks, focusing on CNN architectures, batch normalization, dropout, weight initialization, and ensemble methods. Key architectures like LeNet, AlexNet, VGGNet, and ResNet are highlighted, along with techniques such as batch normalization to improve training stability and dropout for regularization. The document also covers weight initialization strategies to enhance model performance and the use of ensemble methods to reduce variance in predictions.

Uploaded by

abc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CS 182/282A Designing, Visualizing and Understanding Deep Neural Networks

Spring 2021 Sergey Levine Discussion 4

This discussion will cover CNN architectures, batch normalization, weight initializations, ensembles
and dropout.

1 Convolutional Neural Networks Architectures


We will survey the some most famous convolutional neural net architectures.

LeNet. Among the earlier CNN architectures, LeNet is the most widely known. LeNet was used mostly
for handwritten digit recognition on the MNIST dataset. Importantly, LeNet used a series of convolutional
layers, then pooling layers, followed by several fully connected (FC) layers.

AlexNet. The AlexNet architecture popularized CNNs in computer vision, when it won the ImageNet
ILSVRC Challenge in 2012 by a large margin. AlexNet has a similar architectural design as LeNet, except
that it is bigger (more neurons) and deeper (more layers). In addition, AlexNet demonstrated the benefits of
using the ReLU activation and dropout for vision tasks, as well as the use of GPUs for accelerated training.

VGGNet This network was the runner-up in ILSVRC 2014 to GoogLeNet, and showed the benefit of
(a) increasing the number of layers, and (b) using only convolutional operators stacked on each other. A
downside is that this network has roughly 138 million parameters, so in general, consider using Residual
Nets (see next item).

ResNet These networks use skip connections to allow inputs and gradients to propagate faster throughout
the network (either forward or backwards). Residual networks were state of the art for image recognition
results in mid-2016, and the general backbone is commonly used as of today. They have substantially
fewer parameters than VGG. The exact number depends on what type of “ResNet-X” is used, where “X”
represents the number of layers; PyTorch offers pretrained models for 18, 34, 50, 101, and 152. For reference,
ResNet-152 should have about 60 million parameters.)
Problem 1: Vanishing Gradients in ResNet

What features of ResNet, in addition to better initialization techniques and BN, alleviate the vanishing
gradient problem?

Solution 1: Vanishing Gradients

In backpropagation, gradients are transferred through the skip connection. This means that even
if there exists vanishing gradients between the skip connection, the identity transfer of the skip
connection can be used to solve the vanishing gradient problem.
In the original paper, the authors point out that the problem is mostly solved with better initialization
and Batch Normalization.

CS 182/282A, Spring 2021, Discussion 4 1


2 Batch Normalization
The main idea behind Batch Normalization is to transform every sampled batch of data so that they have
µ = 0, σ 2 = 1. Using Batch Normalization typically makes networks significantly more robust to poor
initialization. It based on the intuition that it is better to have unit Gaussian inputs to layers at initial-
ization. However, the reason for why batch normalization works is not entirely understood, and there are
conflicting views between whether Batch Normalization reduces covariate shift, improves smoothness over
the optimization landscape, or other reasons.
In practice, when using batch normalization, we add a BatchNorm layer immediately after ecah FC or
convolutional layer, either before or after the non-linearity. The key observation is that normalization is a
relatively simple differentiable operation, so we do not add too much additional complexity in the network.
Noticeably, Batch Normalization proceeds by first computing the empirical mean and variance of some
mini-batch B of size m from the training set.
m
1 X
µB = ai
m i=1
m
2 1 X
σB = (ai − µB )2
m i=1

Then, for a layer of network, each dimension, a(i) is normalized appropriately,


(k) (k)
(k) a − µB
āi = ri 2
(k)
σB +

where  is added for numerical stability.


In practice, after normalizing the input, we squash the result through a linear function with learnable scale
γ and bias β, so, we have,
(k) (k)
(k) a − µB
āi = ri 2 γ+β
(k)
σB +

Intuitively, γ and β allows us to restore the original activation if we would like, and during training, they
can learn other distribution that would be better initialization than standard Gaussian.
Problem 2: Examining the BatchNorm Layer

1. Draw out the computational graph of the BatchNorm layer


∂f
2. Given some derivatives ∂y i
, the derivative of the output of the BatchNorm layer, compute the
derivatives with respect to parameters γ, β

CS 182/282A, Spring 2021, Discussion 4 2


Solution 2: Examining the BatchNorm Layer

Note: Students should already have derived this in homework.

1.
2. For convenience of notation, for i = {1, . . . , m} where m represents the number of batches, let
xi − µB
x̂i = p 2
σB + 

and
yi = γ x̂i + β
First, we calculate the derivative with respect to γ,
∂f ∂f ∂yi
= ·
∂γ ∂yi ∂γ
m
X ∂f
= · x̂i
i=1
∂yi

We sum from i = 1 to m since we have m batches.


Second, we calculate the derivative with respect to β. Similar to our previous calculation,
∂f ∂f ∂yi
= ·
∂β ∂yi ∂β
m
X ∂f
=
i=1
∂yi

Problem 3: (Challenge) Examining the BatchNorm Layer

∂f
From the previous question, given some derivatives ∂y i
, the derivative of the output of the BatchNorm
layer, compute the derivatives with respect to input xi .
Please note the derivation can be tedious, but it is still a very good exercise!

CS 182/282A, Spring 2021, Discussion 4 3


Solution 3: (Challenge) Examining the BatchNorm Layer

We calculate the derivative with respect to xi . To do this, we note,

∂f ∂f ∂ x̂i ∂f ∂µ ∂f ∂σ 2
= + +
∂xi ∂ x̂i ∂xi ∂µ ∂xi ∂σ 2 ∂xi
Let us compute the individual pieces,
∂f ∂f ∂yi
= ·
∂ x̂i ∂yi ∂ x̂i
∂f
= ·γ
∂yi
∂ x̂i 1
=√
∂xi σ2 + 
∂f ∂f ∂ x̂
2
= ·
∂σ ∂ x̂ ∂σ 2
m
X ∂f ∂ x̂i
= ·
i=1
∂ x̂i ∂σ 2
m
X ∂f
= · (xi − µ) · (−0.5) · (σ 2 + )−1.5
i=1
∂ x̂ i
m
X ∂f
= −0.5 · (xi − µ) · (σ 2 + )−1.5
i=1
∂ x̂ i

∂σ 2 2(xi − µ)
=
∂xi m
∂f ∂f ∂ x̂ ∂f ∂σ 2
= · + ·
∂µ ∂ x̂ ∂µ ∂σ 2 ∂µ
m m
X ∂f −1 ∂f 1 X
= ·√ + · −2(xi − µ)
i=1
∂ x̂i σ 2 +  ∂σ 2 m i=1
m
X ∂f −1 ∂f
= ·√ + · (−2) · (0)
i=1
∂ x̂i σ +  ∂σ 2
2

m
X ∂f −1
= ·√
i=1
∂ x̂i σ2 + 
∂µ 1
=
∂xi m
∂f ∂f 1 ∂f 1 ∂f 2(xi − µ)
= ·√ + · + ·
∂xi ∂ x̂i σ 2 +  ∂µ m ∂σ 2 m

CS 182/282A, Spring 2021, Discussion 4 4


3 Ensembles
Definition 1 (Ensemble). Ensemble (bagging or boosting) group several models trained on the same task
into a single model to aggregate predictions.

Intuition The intuition for ensembles come from the recognition that neural networks have many param-
eters, often with high variance. Then, if we have multiple learners, we can average out the variance.

Ensemble Methods There are two ways we typically proceed with ensemble methods:

1. Prediction Averaging. Train N neural networks independently. Then, average their predictions
(either probabilistically or by majority vote)
2. Parameter Averaging. Parameter averaging does not work in the same way as prediction averaging.
Instead, we would only average over parameters from the context of snapshot ensembles, and average
parameters over one trajectory, not over independent runs.

In practice, we do not need to reshuffle our dataset (and resample with replacement), since there is already
a lot of randomness in neural network training form weight initialization, minibatch shuffling and SGD.

Making Ensemble Methods Faster Unfortunately, a downside to ensemble methods is that they can
be very slow.

1. Only make classification layers (e.g., FC layers) ensembles.

2. Snapshot ensemble. Save out parameter snapshots over the course of SGD optimization and use each
snapshot as a model.

CS 182/282A, Spring 2021, Discussion 4 5


4 Dropout
Definition 2 (Dropout). Dropout is a popular technique for regularizing neural networks by randomly re-
moving nodes with probability 1 − pkeep in the forward pass. However, the model is unchanged at test time.

Intuition Dropout can be thought of as representing an ensemble of neural networks, since each forward
pass is effectively a different neural network, since random nodes are removed.

Activation Scaling A caveat about dropout is that we must divide the activation by p, since we do not
change the model at test time, but we notice that none of our dimensions will then be forced to 0. Below is
sample code to demonstrate how Dropout works in practice for a 3-layer network.
def dropout_train(X, p):
"""
Forward pass for a 3-layer network.
NOTE: For simplicity, we do not include backwards pass or parameter update

X: Input
p: Probability of keeping a unit active (e.g., higher p leads to less dropout)
"""
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p
H1 *= U1 # Drop the activations
H2 = np.maximum(0, np.dot(W2, H1) + b2)
U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p
H2 *= U2 # Drop the activations
out = np.dot(W3, H2) + b3
return out

def predict(X):
""" Forward pass at test time """
H1 = np.maximum(0, np.dot(W1, X) + b1)
H2 = np.maximum(0, np.dot(W2, H1) + b2)
out = np.dot(W3, H2) + b3
return out

Problem 4: Dropout Review

Explain why Dropout could improve performance and when we should use it

Solution 4: Dropout Review

Dropout removes random activations during training, which prevents the model from overfitting to
specific features, and have redundant representations. Dropout should be used to make network more
robust and lower variance.

5 Weight Initialization
One of the reasons for poor model performance can be attributed to poor weight initialization. In class, we
discussed two types of weight initialization,

1. Basic initialization: Ensure activations are reasonable and they do not grow or shrink in later layers
(for example, Gaussian random weights or Xavier initialization)
2. Advanced initialization: Work with the eigenvalues of Jacobians

CS 182/282A, Spring 2021, Discussion 4 6


Problem 5: Deriving Xavier Initialization

Let our activation be the tanh activation, which is approximately linear with small inputs (i.e.,
Var(a) = Var(z), where z is the output of the activation followed by some linear layer). We furthermore
assume that weights and inputs are i.i.d. and centered at zero, and biases are initialized as zero.
We would like the magnitude of the variance to remain constant with each layer. Derive the Xavier
Initialization, which initializes each weight as,
 
1
Wij = N 0,
Da

where Da is the dimensionality of a

Solution 5: Deriving Xavier Initialization

Note that, since we assume bias is initialized at 0,


Da
X
zi = Wij aj
j=1

We can then compute Var(ai ),

Var(ai ) = Var(zi ) linearity of tanh


 
XDa
= Var  Wij aj 
j=1
Da
X
= Var(Wij aj ) Variance of independent sum
j=1
Da
X
= E[Wij ]2 Var(aj ) + E[aj ]2 Var(Wij ) + Var(Wij )Var(aj )
j=1

= Var(Wij )Var(aj )Da


1
⇐⇒ Var(Wij ) =
Da

CS 182/282A, Spring 2021, Discussion 4 7


6 Aside: ReLU Activations and its Relatives
Definition 3 (ReLU Activation). ReLU Activation is defined as, ReLU(x) = max(0, x), and is a popular
activation function.

On top of ReLU Activation, there exists its close relatives, like:

• Leaky ReLU. Instead of defining the ReLU as 0 for all x < 0, Leaky ReLU defines it as a small linear
component of x.
• ELU. Instead of defining the ReLU as 0 for all x < 0, ELU defines it as α(ex − 1) for some α

Please note we did not cover the above explicitly in lecture, but they are good knowledge to have.
Problem 6: (Review) Forward and Backward Pass for ReLU

Compute the output of forward pass of a ReLU layer with input x as given below:

y = ReLU(x)
 
1.5 2.2 1.3 6.7
 4.3 −0.3 −0.2 4.9
x= −4.5 1.4

5.5 1.8
0.1 −0.5 −0.1 2.2

With the gradients with respect to the outputs dL


dy given below, compute the gradient of the loss with
respect to the input x using the backward pass for a ReLU layer:
 
4.5 1.2 2.3 1.3
dL −1.3 −6.3 4.1 −2.9
= −0.5 1.2

dy 3.5 1.2 
−6.1 0.5 −4.1 −3.2

Solution 6: (Review) Forward and Backward Pass for ReLU

Applying the ReLU treats every entry independently, zeroing it out if the entry is less than 0.
 
1.5 2.2 1.3 6.7
4.3 0 0 4.9
y=  0 1.4 5.5 1.8

0.1 0 0 2.2

Similarly, the backwards pass zeros out entries of the same entries of the gradient that were zeroed
out in the forward pass.
 
4.5 1.2 2.3 1.3
dL  −1.3 0 0 −2.9
= 0

dx 1.2 3.5 1.2 
−6.1 0 0 −3.2

Problem 7: ReLU Potpourri

1. What advantages does using ReLU activations have over sigmoid activations?
2. ReLU layers have non-negative outputs. What is a negative consequence of this problem? What
layer types were developed to address this issue?

CS 182/282A, Spring 2021, Discussion 4 8


Solution 7: ReLU Potpourri

1. (1) Computing ReLU and its gradient is more computationally efficient than Sigmoid. (2)
It reduces the likelihood of vanishing gradient problems, since the derivative of the sigmoid
function is always less than 1, and multiplying gradients over multiple layers will lead to quick
convergence of sigmoid function to 0. However, this is less of an advantage, given that one can
use Batch Normalization to centralize inputs
2. ReLU suffers from the Dying ReLU problem, where this unit always outputs 0, no matter what
the input. Once the ReLU ends up in this state, it is unlikely to recover, since its gradient
is also 0, and GD methods will not alter its weights. Other layer types that were developed
include Leaky-ReLU, ELU.

CS 182/282A, Spring 2021, Discussion 4 9


7 Summary
• Recall the main ConvNet architectures (LeNet, AlexNet, GoogLeNet, VGGNet, ResNet). In particular,
recall why bottleneck layers is ResNet are important.
• Batch Normalization proceeds by first computing empirical mean and variance, then rescaling each
activation and squashing through γ and β

• Ensembles group several models into a single model. To make this quicker, we can either only make
classification layers ensembles or use snapshot ensemble.
• Dropouts are methods for randomly removing nodes, and intuitively represent an ensemble of networks,
since each forward pass is effectively a different network

CS 182/282A, Spring 2021, Discussion 4 10

You might also like