0% found this document useful (0 votes)

3 views10 pages

Dis4 Sol

This document discusses various aspects of designing and understanding deep neural networks, focusing on CNN architectures, batch normalization, dropout, weight initialization, and ensemble methods. Key architectures like LeNet, AlexNet, VGGNet, and ResNet are highlighted, along with techniques such as batch normalization to improve training stability and dropout for regularization. The document also covers weight initialization strategies to enhance model performance and the use of ensemble methods to reduce variance in predictions.

Uploaded by

abc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views10 pages

Dis4 Sol

Uploaded by

abc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

CS 182/282A Designing, Visualizing and Understanding Deep Neural Networks

Spring 2021 Sergey Levine Discussion 4

This discussion will cover CNN architectures, batch normalization, weight initializations, ensembles
and dropout.

1 Convolutional Neural Networks Architectures

We will survey the some most famous convolutional neural net architectures.

LeNet. Among the earlier CNN architectures, LeNet is the most widely known. LeNet was used mostly
for handwritten digit recognition on the MNIST dataset. Importantly, LeNet used a series of convolutional
layers, then pooling layers, followed by several fully connected (FC) layers.

AlexNet. The AlexNet architecture popularized CNNs in computer vision, when it won the ImageNet
ILSVRC Challenge in 2012 by a large margin. AlexNet has a similar architectural design as LeNet, except
that it is bigger (more neurons) and deeper (more layers). In addition, AlexNet demonstrated the benefits of
using the ReLU activation and dropout for vision tasks, as well as the use of GPUs for accelerated training.

VGGNet This network was the runner-up in ILSVRC 2014 to GoogLeNet, and showed the benefit of
(a) increasing the number of layers, and (b) using only convolutional operators stacked on each other. A
downside is that this network has roughly 138 million parameters, so in general, consider using Residual
Nets (see next item).

ResNet These networks use skip connections to allow inputs and gradients to propagate faster throughout
the network (either forward or backwards). Residual networks were state of the art for image recognition
results in mid-2016, and the general backbone is commonly used as of today. They have substantially
fewer parameters than VGG. The exact number depends on what type of “ResNet-X” is used, where “X”
represents the number of layers; PyTorch offers pretrained models for 18, 34, 50, 101, and 152. For reference,
ResNet-152 should have about 60 million parameters.)
Problem 1: Vanishing Gradients in ResNet

What features of ResNet, in addition to better initialization techniques and BN, alleviate the vanishing
gradient problem?

Solution 1: Vanishing Gradients

In backpropagation, gradients are transferred through the skip connection. This means that even
if there exists vanishing gradients between the skip connection, the identity transfer of the skip
connection can be used to solve the vanishing gradient problem.
In the original paper, the authors point out that the problem is mostly solved with better initialization
and Batch Normalization.

CS 182/282A, Spring 2021, Discussion 4 1

2 Batch Normalization
The main idea behind Batch Normalization is to transform every sampled batch of data so that they have
µ = 0, σ 2 = 1. Using Batch Normalization typically makes networks significantly more robust to poor
initialization. It based on the intuition that it is better to have unit Gaussian inputs to layers at initial-
ization. However, the reason for why batch normalization works is not entirely understood, and there are
conflicting views between whether Batch Normalization reduces covariate shift, improves smoothness over
the optimization landscape, or other reasons.
In practice, when using batch normalization, we add a BatchNorm layer immediately after ecah FC or
convolutional layer, either before or after the non-linearity. The key observation is that normalization is a
relatively simple differentiable operation, so we do not add too much additional complexity in the network.
Noticeably, Batch Normalization proceeds by first computing the empirical mean and variance of some
mini-batch B of size m from the training set.
m
1 X
µB = ai
m i=1
m
2 1 X
σB = (ai − µB )2
m i=1

Then, for a layer of network, each dimension, a(i) is normalized appropriately,

(k) (k)
(k) a − µB
āi = ri 2
(k)
σB +

where is added for numerical stability.

In practice, after normalizing the input, we squash the result through a linear function with learnable scale
γ and bias β, so, we have,
(k) (k)
(k) a − µB
āi = ri 2 γ+β
(k)
σB +

Intuitively, γ and β allows us to restore the original activation if we would like, and during training, they
can learn other distribution that would be better initialization than standard Gaussian.
Problem 2: Examining the BatchNorm Layer

1. Draw out the computational graph of the BatchNorm layer

∂f
2. Given some derivatives ∂y i
, the derivative of the output of the BatchNorm layer, compute the
derivatives with respect to parameters γ, β

CS 182/282A, Spring 2021, Discussion 4 2

Solution 2: Examining the BatchNorm Layer

Note: Students should already have derived this in homework.

1.
2. For convenience of notation, for i = {1, . . . , m} where m represents the number of batches, let
xi − µB
x̂i = p 2
σB +

and
yi = γ x̂i + β
First, we calculate the derivative with respect to γ,
∂f ∂f ∂yi
= ·
∂γ ∂yi ∂γ
m
X ∂f
= · x̂i
i=1
∂yi

We sum from i = 1 to m since we have m batches.

Second, we calculate the derivative with respect to β. Similar to our previous calculation,
∂f ∂f ∂yi
= ·
∂β ∂yi ∂β
m
X ∂f
=
i=1
∂yi

Problem 3: (Challenge) Examining the BatchNorm Layer

∂f
From the previous question, given some derivatives ∂y i
, the derivative of the output of the BatchNorm
layer, compute the derivatives with respect to input xi .
Please note the derivation can be tedious, but it is still a very good exercise!

CS 182/282A, Spring 2021, Discussion 4 3

Solution 3: (Challenge) Examining the BatchNorm Layer

We calculate the derivative with respect to xi . To do this, we note,

∂f ∂f ∂ x̂i ∂f ∂µ ∂f ∂σ 2
= + +
∂xi ∂ x̂i ∂xi ∂µ ∂xi ∂σ 2 ∂xi
Let us compute the individual pieces,
∂f ∂f ∂yi
= ·
∂ x̂i ∂yi ∂ x̂i
∂f
= ·γ
∂yi
∂ x̂i 1
=√
∂xi σ2 +
∂f ∂f ∂ x̂
2
= ·
∂σ ∂ x̂ ∂σ 2
m
X ∂f ∂ x̂i
= ·
i=1
∂ x̂i ∂σ 2
m
X ∂f
= · (xi − µ) · (−0.5) · (σ 2 + )−1.5
i=1
∂ x̂ i
m
X ∂f
= −0.5 · (xi − µ) · (σ 2 + )−1.5
i=1
∂ x̂ i

∂σ 2 2(xi − µ)
=
∂xi m
∂f ∂f ∂ x̂ ∂f ∂σ 2
= · + ·
∂µ ∂ x̂ ∂µ ∂σ 2 ∂µ
m m
X ∂f −1 ∂f 1 X
= ·√ + · −2(xi − µ)
i=1
∂ x̂i σ 2 + ∂σ 2 m i=1
m
X ∂f −1 ∂f
= ·√ + · (−2) · (0)
i=1
∂ x̂i σ + ∂σ 2
2

m
X ∂f −1
= ·√
i=1
∂ x̂i σ2 +
∂µ 1
=
∂xi m
∂f ∂f 1 ∂f 1 ∂f 2(xi − µ)
= ·√ + · + ·
∂xi ∂ x̂i σ 2 + ∂µ m ∂σ 2 m

CS 182/282A, Spring 2021, Discussion 4 4

3 Ensembles
Definition 1 (Ensemble). Ensemble (bagging or boosting) group several models trained on the same task
into a single model to aggregate predictions.

Intuition The intuition for ensembles come from the recognition that neural networks have many param-
eters, often with high variance. Then, if we have multiple learners, we can average out the variance.

Ensemble Methods There are two ways we typically proceed with ensemble methods:

1. Prediction Averaging. Train N neural networks independently. Then, average their predictions
(either probabilistically or by majority vote)
2. Parameter Averaging. Parameter averaging does not work in the same way as prediction averaging.
Instead, we would only average over parameters from the context of snapshot ensembles, and average
parameters over one trajectory, not over independent runs.

In practice, we do not need to reshuffle our dataset (and resample with replacement), since there is already
a lot of randomness in neural network training form weight initialization, minibatch shuffling and SGD.

Making Ensemble Methods Faster Unfortunately, a downside to ensemble methods is that they can
be very slow.

1. Only make classification layers (e.g., FC layers) ensembles.

2. Snapshot ensemble. Save out parameter snapshots over the course of SGD optimization and use each
snapshot as a model.

CS 182/282A, Spring 2021, Discussion 4 5

4 Dropout
Definition 2 (Dropout). Dropout is a popular technique for regularizing neural networks by randomly re-
moving nodes with probability 1 − pkeep in the forward pass. However, the model is unchanged at test time.

Intuition Dropout can be thought of as representing an ensemble of neural networks, since each forward
pass is effectively a different neural network, since random nodes are removed.

Activation Scaling A caveat about dropout is that we must divide the activation by p, since we do not
change the model at test time, but we notice that none of our dimensions will then be forced to 0. Below is
sample code to demonstrate how Dropout works in practice for a 3-layer network.
def dropout_train(X, p):
"""
Forward pass for a 3-layer network.
NOTE: For simplicity, we do not include backwards pass or parameter update

X: Input
p: Probability of keeping a unit active (e.g., higher p leads to less dropout)
"""
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p
H1 *= U1 # Drop the activations
H2 = np.maximum(0, np.dot(W2, H1) + b2)
U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p
H2 *= U2 # Drop the activations
out = np.dot(W3, H2) + b3
return out

def predict(X):
""" Forward pass at test time """
H1 = np.maximum(0, np.dot(W1, X) + b1)
H2 = np.maximum(0, np.dot(W2, H1) + b2)
out = np.dot(W3, H2) + b3
return out

Problem 4: Dropout Review

Explain why Dropout could improve performance and when we should use it

Solution 4: Dropout Review

Dropout removes random activations during training, which prevents the model from overfitting to
specific features, and have redundant representations. Dropout should be used to make network more
robust and lower variance.

5 Weight Initialization
One of the reasons for poor model performance can be attributed to poor weight initialization. In class, we
discussed two types of weight initialization,

1. Basic initialization: Ensure activations are reasonable and they do not grow or shrink in later layers
(for example, Gaussian random weights or Xavier initialization)
2. Advanced initialization: Work with the eigenvalues of Jacobians

CS 182/282A, Spring 2021, Discussion 4 6

Problem 5: Deriving Xavier Initialization

Let our activation be the tanh activation, which is approximately linear with small inputs (i.e.,
Var(a) = Var(z), where z is the output of the activation followed by some linear layer). We furthermore
assume that weights and inputs are i.i.d. and centered at zero, and biases are initialized as zero.
We would like the magnitude of the variance to remain constant with each layer. Derive the Xavier
Initialization, which initializes each weight as,

1
Wij = N 0,
Da

where Da is the dimensionality of a

Solution 5: Deriving Xavier Initialization

Note that, since we assume bias is initialized at 0,

Da
X
zi = Wij aj
j=1

We can then compute Var(ai ),

Var(ai ) = Var(zi ) linearity of tanh

 
XDa
= Var  Wij aj 
j=1
Da
X
= Var(Wij aj ) Variance of independent sum
j=1
Da
X
= E[Wij ]2 Var(aj ) + E[aj ]2 Var(Wij ) + Var(Wij )Var(aj )
j=1

= Var(Wij )Var(aj )Da

1
⇐⇒ Var(Wij ) =
Da

CS 182/282A, Spring 2021, Discussion 4 7

6 Aside: ReLU Activations and its Relatives
Definition 3 (ReLU Activation). ReLU Activation is defined as, ReLU(x) = max(0, x), and is a popular
activation function.

On top of ReLU Activation, there exists its close relatives, like:

• Leaky ReLU. Instead of defining the ReLU as 0 for all x < 0, Leaky ReLU defines it as a small linear
component of x.
• ELU. Instead of defining the ReLU as 0 for all x < 0, ELU defines it as α(ex − 1) for some α

Please note we did not cover the above explicitly in lecture, but they are good knowledge to have.
Problem 6: (Review) Forward and Backward Pass for ReLU

Compute the output of forward pass of a ReLU layer with input x as given below:

y = ReLU(x)
 
1.5 2.2 1.3 6.7
 4.3 −0.3 −0.2 4.9
x= −4.5 1.4

5.5 1.8
0.1 −0.5 −0.1 2.2

With the gradients with respect to the outputs dL

dy given below, compute the gradient of the loss with
respect to the input x using the backward pass for a ReLU layer:
 
4.5 1.2 2.3 1.3
dL −1.3 −6.3 4.1 −2.9
= −0.5 1.2

dy 3.5 1.2 
−6.1 0.5 −4.1 −3.2

Solution 6: (Review) Forward and Backward Pass for ReLU

Applying the ReLU treats every entry independently, zeroing it out if the entry is less than 0.
 
1.5 2.2 1.3 6.7
4.3 0 0 4.9
y=  0 1.4 5.5 1.8


0.1 0 0 2.2

Similarly, the backwards pass zeros out entries of the same entries of the gradient that were zeroed
out in the forward pass.
 
4.5 1.2 2.3 1.3
dL  −1.3 0 0 −2.9
= 0

dx 1.2 3.5 1.2 
−6.1 0 0 −3.2

Problem 7: ReLU Potpourri

1. What advantages does using ReLU activations have over sigmoid activations?
2. ReLU layers have non-negative outputs. What is a negative consequence of this problem? What
layer types were developed to address this issue?

CS 182/282A, Spring 2021, Discussion 4 8

Solution 7: ReLU Potpourri

1. (1) Computing ReLU and its gradient is more computationally efficient than Sigmoid. (2)
It reduces the likelihood of vanishing gradient problems, since the derivative of the sigmoid
function is always less than 1, and multiplying gradients over multiple layers will lead to quick
convergence of sigmoid function to 0. However, this is less of an advantage, given that one can
use Batch Normalization to centralize inputs
2. ReLU suffers from the Dying ReLU problem, where this unit always outputs 0, no matter what
the input. Once the ReLU ends up in this state, it is unlikely to recover, since its gradient
is also 0, and GD methods will not alter its weights. Other layer types that were developed
include Leaky-ReLU, ELU.

CS 182/282A, Spring 2021, Discussion 4 9

7 Summary
• Recall the main ConvNet architectures (LeNet, AlexNet, GoogLeNet, VGGNet, ResNet). In particular,
recall why bottleneck layers is ResNet are important.
• Batch Normalization proceeds by first computing empirical mean and variance, then rescaling each
activation and squashing through γ and β

• Ensembles group several models into a single model. To make this quicker, we can either only make
classification layers ensembles or use snapshot ensemble.
• Dropouts are methods for randomly removing nodes, and intuitively represent an ensemble of networks,
since each forward pass is effectively a different network

CS 182/282A, Spring 2021, Discussion 4 10

Maps For Spelljammer and Light of Xaryxis
No ratings yet
Maps For Spelljammer and Light of Xaryxis
25 pages
Cheat Sheet Imperva
100% (2)
Cheat Sheet Imperva
12 pages
Dymola Full User Manual
No ratings yet
Dymola Full User Manual
1,645 pages
CS 182 Berkeley 2021 Discussion 4
No ratings yet
CS 182 Berkeley 2021 Discussion 4
7 pages
9.b Handout-2-Regularization
No ratings yet
9.b Handout-2-Regularization
5 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
3 DL
No ratings yet
3 DL
15 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Notes For - Batch Normalization - Accelerating Deep Network Training by Reducing Internal Covariate Shift - Paper GitHub
No ratings yet
Notes For - Batch Normalization - Accelerating Deep Network Training by Reducing Internal Covariate Shift - Paper GitHub
3 pages
Batch Norm
No ratings yet
Batch Norm
7 pages
CS445 - Neural Networks and Deep Learning - Lecture Notes
No ratings yet
CS445 - Neural Networks and Deep Learning - Lecture Notes
5 pages
Unit II.
No ratings yet
Unit II.
14 pages
SS 2020
No ratings yet
SS 2020
21 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
Module 2
No ratings yet
Module 2
12 pages
INTRO TO Deep Learning Focusing On ToolS - Knowlexon - Biswa
No ratings yet
INTRO TO Deep Learning Focusing On ToolS - Knowlexon - Biswa
37 pages
Normalization Techniques
No ratings yet
Normalization Techniques
23 pages
CNNs Pytorch
No ratings yet
CNNs Pytorch
19 pages
Module 2
No ratings yet
Module 2
13 pages
Lecture 2: Introduction To Pytorch
No ratings yet
Lecture 2: Introduction To Pytorch
7 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
DL Quiz1
No ratings yet
DL Quiz1
5 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Ceng403 - Week 6b
No ratings yet
Ceng403 - Week 6b
51 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
BN Layer
No ratings yet
BN Layer
4 pages
General Observation
No ratings yet
General Observation
93 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Batch Normalization Separate
No ratings yet
Batch Normalization Separate
20 pages
Cours 4
No ratings yet
Cours 4
30 pages
Layers
No ratings yet
Layers
4 pages
More On CNN
No ratings yet
More On CNN
131 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
7 CNN 3
No ratings yet
7 CNN 3
30 pages
02 Neural Networks
No ratings yet
02 Neural Networks
32 pages
ML Modelling - Part 1
No ratings yet
ML Modelling - Part 1
7 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Deep Learning Training
No ratings yet
Deep Learning Training
9 pages
Assignment 13 Modern AI
No ratings yet
Assignment 13 Modern AI
3 pages
CNN Slides PDF
No ratings yet
CNN Slides PDF
81 pages
Dataset Augmentation
No ratings yet
Dataset Augmentation
30 pages
DL Unit 3
No ratings yet
DL Unit 3
14 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Chapter21 4e
No ratings yet
Chapter21 4e
35 pages
Dropout
No ratings yet
Dropout
14 pages
Training Neural
No ratings yet
Training Neural
16 pages
Concept 2 Nearlyfinished
No ratings yet
Concept 2 Nearlyfinished
17 pages
Intro To Neural Network
No ratings yet
Intro To Neural Network
25 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Regulation-2019 Curriculum - UG - CSE 1
No ratings yet
Regulation-2019 Curriculum - UG - CSE 1
19 pages
737 Book NG 22 303
100% (2)
737 Book NG 22 303
76 pages
CEH Practical Notes - ?????????
No ratings yet
CEH Practical Notes - ?????????
40 pages
Introduction To Digital Radiography and Pacs: Areej Aloufi
No ratings yet
Introduction To Digital Radiography and Pacs: Areej Aloufi
12 pages
ETECH WorkSheet 1
No ratings yet
ETECH WorkSheet 1
10 pages
Design Consistent Hashing: The Rehashing Problem
No ratings yet
Design Consistent Hashing: The Rehashing Problem
2 pages
DAA Practical
No ratings yet
DAA Practical
68 pages
GE Digital Camera x400 - Power - Pro - Series
No ratings yet
GE Digital Camera x400 - Power - Pro - Series
89 pages
Excel and Gpa Calculation
No ratings yet
Excel and Gpa Calculation
21 pages
Gamified Education System
No ratings yet
Gamified Education System
10 pages
Cracking Wifi Passwords Using Aircrack-Ng Using A Target-Specific Custom Wordlist Generated by Us
No ratings yet
Cracking Wifi Passwords Using Aircrack-Ng Using A Target-Specific Custom Wordlist Generated by Us
9 pages
VJ628D Service Manual
No ratings yet
VJ628D Service Manual
423 pages
Revision 2 Board Examination
No ratings yet
Revision 2 Board Examination
9 pages
Math10 Chapter Notes 2
No ratings yet
Math10 Chapter Notes 2
40 pages
Hawassa University PPT 3
No ratings yet
Hawassa University PPT 3
21 pages
Cloud Deployment Model - Javatpoint
No ratings yet
Cloud Deployment Model - Javatpoint
20 pages
Table of Contents: Language Core
No ratings yet
Table of Contents: Language Core
63 pages
8625 De3121 Bba
No ratings yet
8625 De3121 Bba
92 pages
Edms 2
No ratings yet
Edms 2
10 pages
Bict DBMS
No ratings yet
Bict DBMS
6 pages
Windows Basic Notes December 2024
No ratings yet
Windows Basic Notes December 2024
3 pages
C++ in Huge AAA Games - Nicolas Fleury - CppCon 2014
No ratings yet
C++ in Huge AAA Games - Nicolas Fleury - CppCon 2014
51 pages
Technicolor Dga4231
No ratings yet
Technicolor Dga4231
8 pages
G6S User Manual PDF
No ratings yet
G6S User Manual PDF
72 pages
Feedback: Your Answer Is Correct. The Correct Answer Is: Resources
No ratings yet
Feedback: Your Answer Is Correct. The Correct Answer Is: Resources
19 pages
SV TB Example 1671687516
No ratings yet
SV TB Example 1671687516
30 pages
Thinkpad Regulatory Notice: About This Manual
No ratings yet
Thinkpad Regulatory Notice: About This Manual
14 pages

Dis4 Sol

Uploaded by

Dis4 Sol

Uploaded by

CS 182/282A Designing, Visualizing and Understanding Deep Neural Networks

Spring 2021 Sergey Levine Discussion 4

1 Convolutional Neural Networks Architectures

Solution 1: Vanishing Gradients

CS 182/282A, Spring 2021, Discussion 4 1

Then, for a layer of network, each dimension, a(i) is normalized appropriately,

where  is added for numerical stability.

1. Draw out the computational graph of the BatchNorm layer

CS 182/282A, Spring 2021, Discussion 4 2

Note: Students should already have derived this in homework.

We sum from i = 1 to m since we have m batches.

Problem 3: (Challenge) Examining the BatchNorm Layer

CS 182/282A, Spring 2021, Discussion 4 3

We calculate the derivative with respect to xi . To do this, we note,

CS 182/282A, Spring 2021, Discussion 4 4

1. Only make classification layers (e.g., FC layers) ensembles.

CS 182/282A, Spring 2021, Discussion 4 5

Problem 4: Dropout Review

Solution 4: Dropout Review

CS 182/282A, Spring 2021, Discussion 4 6

where Da is the dimensionality of a

Solution 5: Deriving Xavier Initialization

Note that, since we assume bias is initialized at 0,

We can then compute Var(ai ),

Var(ai ) = Var(zi ) linearity of tanh

= Var(Wij )Var(aj )Da

CS 182/282A, Spring 2021, Discussion 4 7

On top of ReLU Activation, there exists its close relatives, like:

With the gradients with respect to the outputs dL

Solution 6: (Review) Forward and Backward Pass for ReLU

Problem 7: ReLU Potpourri

CS 182/282A, Spring 2021, Discussion 4 8

CS 182/282A, Spring 2021, Discussion 4 9

CS 182/282A, Spring 2021, Discussion 4 10

You might also like

where is added for numerical stability.