Dis4 Sol
Dis4 Sol
This discussion will cover CNN architectures, batch normalization, weight initializations, ensembles
and dropout.
LeNet. Among the earlier CNN architectures, LeNet is the most widely known. LeNet was used mostly
for handwritten digit recognition on the MNIST dataset. Importantly, LeNet used a series of convolutional
layers, then pooling layers, followed by several fully connected (FC) layers.
AlexNet. The AlexNet architecture popularized CNNs in computer vision, when it won the ImageNet
ILSVRC Challenge in 2012 by a large margin. AlexNet has a similar architectural design as LeNet, except
that it is bigger (more neurons) and deeper (more layers). In addition, AlexNet demonstrated the benefits of
using the ReLU activation and dropout for vision tasks, as well as the use of GPUs for accelerated training.
VGGNet This network was the runner-up in ILSVRC 2014 to GoogLeNet, and showed the benefit of
(a) increasing the number of layers, and (b) using only convolutional operators stacked on each other. A
downside is that this network has roughly 138 million parameters, so in general, consider using Residual
Nets (see next item).
ResNet These networks use skip connections to allow inputs and gradients to propagate faster throughout
the network (either forward or backwards). Residual networks were state of the art for image recognition
results in mid-2016, and the general backbone is commonly used as of today. They have substantially
fewer parameters than VGG. The exact number depends on what type of “ResNet-X” is used, where “X”
represents the number of layers; PyTorch offers pretrained models for 18, 34, 50, 101, and 152. For reference,
ResNet-152 should have about 60 million parameters.)
Problem 1: Vanishing Gradients in ResNet
What features of ResNet, in addition to better initialization techniques and BN, alleviate the vanishing
gradient problem?
In backpropagation, gradients are transferred through the skip connection. This means that even
if there exists vanishing gradients between the skip connection, the identity transfer of the skip
connection can be used to solve the vanishing gradient problem.
In the original paper, the authors point out that the problem is mostly solved with better initialization
and Batch Normalization.
Intuitively, γ and β allows us to restore the original activation if we would like, and during training, they
can learn other distribution that would be better initialization than standard Gaussian.
Problem 2: Examining the BatchNorm Layer
1.
2. For convenience of notation, for i = {1, . . . , m} where m represents the number of batches, let
xi − µB
x̂i = p 2
σB +
and
yi = γ x̂i + β
First, we calculate the derivative with respect to γ,
∂f ∂f ∂yi
= ·
∂γ ∂yi ∂γ
m
X ∂f
= · x̂i
i=1
∂yi
∂f
From the previous question, given some derivatives ∂y i
, the derivative of the output of the BatchNorm
layer, compute the derivatives with respect to input xi .
Please note the derivation can be tedious, but it is still a very good exercise!
∂f ∂f ∂ x̂i ∂f ∂µ ∂f ∂σ 2
= + +
∂xi ∂ x̂i ∂xi ∂µ ∂xi ∂σ 2 ∂xi
Let us compute the individual pieces,
∂f ∂f ∂yi
= ·
∂ x̂i ∂yi ∂ x̂i
∂f
= ·γ
∂yi
∂ x̂i 1
=√
∂xi σ2 +
∂f ∂f ∂ x̂
2
= ·
∂σ ∂ x̂ ∂σ 2
m
X ∂f ∂ x̂i
= ·
i=1
∂ x̂i ∂σ 2
m
X ∂f
= · (xi − µ) · (−0.5) · (σ 2 + )−1.5
i=1
∂ x̂ i
m
X ∂f
= −0.5 · (xi − µ) · (σ 2 + )−1.5
i=1
∂ x̂ i
∂σ 2 2(xi − µ)
=
∂xi m
∂f ∂f ∂ x̂ ∂f ∂σ 2
= · + ·
∂µ ∂ x̂ ∂µ ∂σ 2 ∂µ
m m
X ∂f −1 ∂f 1 X
= ·√ + · −2(xi − µ)
i=1
∂ x̂i σ 2 + ∂σ 2 m i=1
m
X ∂f −1 ∂f
= ·√ + · (−2) · (0)
i=1
∂ x̂i σ + ∂σ 2
2
m
X ∂f −1
= ·√
i=1
∂ x̂i σ2 +
∂µ 1
=
∂xi m
∂f ∂f 1 ∂f 1 ∂f 2(xi − µ)
= ·√ + · + ·
∂xi ∂ x̂i σ 2 + ∂µ m ∂σ 2 m
Intuition The intuition for ensembles come from the recognition that neural networks have many param-
eters, often with high variance. Then, if we have multiple learners, we can average out the variance.
Ensemble Methods There are two ways we typically proceed with ensemble methods:
1. Prediction Averaging. Train N neural networks independently. Then, average their predictions
(either probabilistically or by majority vote)
2. Parameter Averaging. Parameter averaging does not work in the same way as prediction averaging.
Instead, we would only average over parameters from the context of snapshot ensembles, and average
parameters over one trajectory, not over independent runs.
In practice, we do not need to reshuffle our dataset (and resample with replacement), since there is already
a lot of randomness in neural network training form weight initialization, minibatch shuffling and SGD.
Making Ensemble Methods Faster Unfortunately, a downside to ensemble methods is that they can
be very slow.
2. Snapshot ensemble. Save out parameter snapshots over the course of SGD optimization and use each
snapshot as a model.
Intuition Dropout can be thought of as representing an ensemble of neural networks, since each forward
pass is effectively a different neural network, since random nodes are removed.
Activation Scaling A caveat about dropout is that we must divide the activation by p, since we do not
change the model at test time, but we notice that none of our dimensions will then be forced to 0. Below is
sample code to demonstrate how Dropout works in practice for a 3-layer network.
def dropout_train(X, p):
"""
Forward pass for a 3-layer network.
NOTE: For simplicity, we do not include backwards pass or parameter update
X: Input
p: Probability of keeping a unit active (e.g., higher p leads to less dropout)
"""
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p
H1 *= U1 # Drop the activations
H2 = np.maximum(0, np.dot(W2, H1) + b2)
U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p
H2 *= U2 # Drop the activations
out = np.dot(W3, H2) + b3
return out
def predict(X):
""" Forward pass at test time """
H1 = np.maximum(0, np.dot(W1, X) + b1)
H2 = np.maximum(0, np.dot(W2, H1) + b2)
out = np.dot(W3, H2) + b3
return out
Explain why Dropout could improve performance and when we should use it
Dropout removes random activations during training, which prevents the model from overfitting to
specific features, and have redundant representations. Dropout should be used to make network more
robust and lower variance.
5 Weight Initialization
One of the reasons for poor model performance can be attributed to poor weight initialization. In class, we
discussed two types of weight initialization,
1. Basic initialization: Ensure activations are reasonable and they do not grow or shrink in later layers
(for example, Gaussian random weights or Xavier initialization)
2. Advanced initialization: Work with the eigenvalues of Jacobians
Let our activation be the tanh activation, which is approximately linear with small inputs (i.e.,
Var(a) = Var(z), where z is the output of the activation followed by some linear layer). We furthermore
assume that weights and inputs are i.i.d. and centered at zero, and biases are initialized as zero.
We would like the magnitude of the variance to remain constant with each layer. Derive the Xavier
Initialization, which initializes each weight as,
1
Wij = N 0,
Da
• Leaky ReLU. Instead of defining the ReLU as 0 for all x < 0, Leaky ReLU defines it as a small linear
component of x.
• ELU. Instead of defining the ReLU as 0 for all x < 0, ELU defines it as α(ex − 1) for some α
Please note we did not cover the above explicitly in lecture, but they are good knowledge to have.
Problem 6: (Review) Forward and Backward Pass for ReLU
Compute the output of forward pass of a ReLU layer with input x as given below:
y = ReLU(x)
1.5 2.2 1.3 6.7
4.3 −0.3 −0.2 4.9
x= −4.5 1.4
5.5 1.8
0.1 −0.5 −0.1 2.2
Applying the ReLU treats every entry independently, zeroing it out if the entry is less than 0.
1.5 2.2 1.3 6.7
4.3 0 0 4.9
y= 0 1.4 5.5 1.8
0.1 0 0 2.2
Similarly, the backwards pass zeros out entries of the same entries of the gradient that were zeroed
out in the forward pass.
4.5 1.2 2.3 1.3
dL −1.3 0 0 −2.9
= 0
dx 1.2 3.5 1.2
−6.1 0 0 −3.2
1. What advantages does using ReLU activations have over sigmoid activations?
2. ReLU layers have non-negative outputs. What is a negative consequence of this problem? What
layer types were developed to address this issue?
1. (1) Computing ReLU and its gradient is more computationally efficient than Sigmoid. (2)
It reduces the likelihood of vanishing gradient problems, since the derivative of the sigmoid
function is always less than 1, and multiplying gradients over multiple layers will lead to quick
convergence of sigmoid function to 0. However, this is less of an advantage, given that one can
use Batch Normalization to centralize inputs
2. ReLU suffers from the Dying ReLU problem, where this unit always outputs 0, no matter what
the input. Once the ReLU ends up in this state, it is unlikely to recover, since its gradient
is also 0, and GD methods will not alter its weights. Other layer types that were developed
include Leaky-ReLU, ELU.
• Ensembles group several models into a single model. To make this quicker, we can either only make
classification layers ensembles or use snapshot ensemble.
• Dropouts are methods for randomly removing nodes, and intuitively represent an ensemble of networks,
since each forward pass is effectively a different network