Lecture 5-6
Lecture 5-6
∂ℓ ∂ℓ ∂h d ∂h t+1 ∂h t
= d d−1…
∂W t ∂h ∂h ∂h t ∂W t
{
Multiplication of d-t matrices
i+1
Two Issues for Deep Neural Networks d−1
∂h
∏ i
i=t
∂h
1 if x > 0
σ(x) = max(0,x) and σ′(x) ={
0 otherwise
d−1 d−1
∂h i+1 d−1 i−1 i i
• Elements of ∏
∂h i
i
may from
= ∏ diag ( σ′(Wh )) (W )T ∏ (W )T
i=t i=t i=t
Small Small
gradients gradients
Gradient Exploding
d−1
∂hi+1 d−1 i−1 i
• Elements ∏
∂h i
= ∏ diag ( σ′(Wi
h )) (W )T are products of d-t small
i=t i=t
values
100 −10
0.8 ≈ 2 × 10
Issues with Gradient Vanishing
Data complexity
Simple Complex
λ
w* = arg min J(w; X,y)+ ∥w∥
2
2
Dropout
Motivation
h = σ(W x + b)
1 1
h′ =dropout(h)
o = W2h′ + b2
y = softmax(o)
Add Noise without Bias
E[x′] = x
0 with probability p
x′i = { xi
1−p
otherwise
Dropout using numpy?
Let’s do in Jupyter-notebook
Dropout using pytorch tensors
Code difference - numpy vs pytorch
# in pytorch
assert 0 <= dropout <= 1
# Here X.shape is a tensor
if dropout==1:
return torch.zeros_like(X.shape)
# for numpy
# Here X.shape is a tuple
if dropout==1:
return np.zeros(X.shape)
Dropout from scratch – using dropout_layer func
Dropout in Inference
h′ =dropout(h)