Opt Sem3
Opt Sem3
Seminar
v § } 1
Forward mode
Figure 1: Illustration of forward chain rule to calculate the derivative of the function vi with respect to wk .
Automatic Differentiation v § } 2
Forward mode
Figure 1: Illustration of forward chain rule to calculate the derivative of the function vi with respect to wk .
Automatic Differentiation v § } 2
Reverse mode
Figure 2: Illustration of reverse chain rule to calculate the derivative of the function L with respect to the node vi .
Automatic Differentiation v § } 3
Reverse mode
Figure 2: Illustration of reverse chain rule to calculate the derivative of the function L with respect to the node vi .
Automatic Differentiation v § } 3
Reverse mode
Figure 2: Illustration of reverse chain rule to calculate the derivative of the function L with respect to the node vi .
Automatic Differentiation v § } 3
Toy example
ñ Example
f (x1 , x2 ) = x1 ∗ x2 + sin x1
∂f
Let’s calculate the derivatives using forward and reverse modes.
∂xi
f (x1 , x2 ) = x1 ∗ x2 + sin x1
∂f
Let’s calculate the derivatives using forward and reverse modes.
∂xi
Example №1
f (X) = tr(AX −1 B)
∇f = −X −T AT B T X −T
Example №1 Example №2
∇f = −X −T AT B T X −T ∇2 g = ||x||−1 T
2 xx + ||x||2 In
Example №1 Example №2
∇f = −X −T AT B T X −T ∇2 g = ||x||−1 T
2 xx + ||x||2 In
Which of the AD modes would you choose (forward/ reverse) for the following computational graph of primitive
arithmetic operations?
Figure 4: Which mode would you choose for calculating gradients there?
ñ Question
∂L ∂L
Find the derivatives , .
∂A ∂b
Figure 5: x could be found as a solution of linear system
∂L ∂L ∂L
D E D E D E
dL = , dx = , dA + , db
∂x ∂A ∂b
∂L ∂L ∂L
D E D E D E
dL = , dx = , dA + , db
∂x ∂A ∂b
Given the linear system, we have:
Ax = b
Figure 6: x could be found as a solution of linear
system dAx + Adx = db → dx = A−1 (db − dAx)
∂L −1 ∂L ∂L
D E D E D E
, A (db − dAx) = , dA + , db
∂x ∂A ∂b
∂L −1 ∂L ∂L
D E D E D E
, A (db − dAx) = , dA + , db
∂x ∂A ∂b
∂L T ∂L ∂L ∂L
D E D E D E D E
−A−T x , dA + A−T , db = , dA + , db
∂x ∂x ∂A ∂b
∂L −1 ∂L ∂L
D E D E D E
, A (db − dAx) = , dA + , db
∂x ∂A ∂b
∂L T ∂L ∂L ∂L
D E D E D E D E
−A−T x , dA + A−T , db = , dA + , db
∂x ∂x ∂A ∂b
Therefore:
∂L ∂L T ∂L ∂L
= −A−T x = A−T
∂A ∂x ∂b ∂x
∂L −1 ∂L ∂L
D E D E D E
, A (db − dAx) = , dA + , db
∂x ∂A ∂b
∂L T ∂L ∂L ∂L
D E D E D E D E
−A−T x , dA + A−T , db = , dA + , db
∂x ∂x ∂A ∂b
Therefore:
∂L ∂L T ∂L ∂L
= −A−T x = A−T
∂A ∂x ∂b ∂x
It is interesting, that the most computationally intensive part here is
the matrix inverse, which is the same as for the forward pass.
Figure 7: x could be found as a solution of linear Sometimes it is even possible to store the result itself, which makes
system the backward pass even cheaper.
W = U ΣV T , U T U = I, V T V = I,
Σ = diag(σ1 , . . . , σmin(m,n) )
ñ Question
∂R
Find the derivative .
∂W
Suppose, we have the rectangular matrix W ∈ Rm×n , which has a singular value
decomposition:
W = U ΣV T , U T U = I, V T V = I, Σ = diag(σ1 , . . . , σmin(m,n) )
W = U ΣV T
dW = dU ΣV T + U dΣV T + U ΣdV T
U T dW V = U T dU ΣV T V + U T U dΣV T V + U T U ΣdV T V
U T dW V = U T dU Σ + dΣ + ΣdV T V
(U T dU )T + U T dU = 0 → diag(U T dU ) = (0, . . . , 0)
The same logic could be applied to the matrix V and
diag(dV T V ) = (0, . . . , 0)
(U T dU )T + U T dU = 0 → diag(U T dU ) = (0, . . . , 0)
The same logic could be applied to the matrix V and
diag(dV T V ) = (0, . . . , 0)
3. At the same time, the matrix dΣ is diagonal, which means (look at the 1.)
that
diag(U T dW V ) = dΣ
Here on both sides, we have diagonal matrices.
∂L
D E
dL = , dΣ
∂Σ
∂L
D E
= , diag(U T dW V )
∂Σ
T
∂L
= tr diag(U T dW V )
∂Σ
5. As soon as we have diagonal matrices inside the product, the trace of the
diagonal part of the matrix will be equal to the trace of the whole matrix:
∂L T
dL = tr diag(U T dW V )
∂Σ
∂L T T
= tr U dW V
∂Σ
∂L T
D E
= , U dW V
∂Σ
∂L T
D E
= U V , dW
∂Σ
∂L T ∂L
D E D E
U V , dW = , dW
∂Σ ∂W
∂L ∂L T
=U V ,
∂W ∂Σ
∂L ∂L
This nice result allows us to connect the gradients and .
∂W ∂Σ
Let’s make sure numerically that we have correctly calculated the derivatives in problems 2-3 3
Backward pass
Figure 9: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The activations
marked with an f . The gradient of the loss with respect to the activations and parameters marked with b.
Gradient checkpointing v § } 17
Feedforward Architecture
Forward pass
Backward pass
Figure 9: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The activations
marked with an f . The gradient of the loss with respect to the activations and parameters marked with b.
, Important
The results obtained for the f nodes are needed to compute the b nodes.
Gradient checkpointing v § } 17
Vanilla backpropagation
Figure 10: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 18
Vanilla backpropagation
Figure 10: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 18
Vanilla backpropagation
Figure 10: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 18
Vanilla backpropagation
Figure 10: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 18
Vanilla backpropagation
Figure 10: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 18
Vanilla backpropagation
Figure 10: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
• High memory usage. The memory usage grows linearly with the number of layers in the neural network.
Gradient checkpointing v § } 18
Memory poor backpropagation
Figure 11: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 19
Memory poor backpropagation
Figure 11: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 19
Memory poor backpropagation
Figure 11: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 19
Memory poor backpropagation
Figure 11: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 19
Memory poor backpropagation
Figure 11: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 19
Memory poor backpropagation
Figure 11: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
• Computationally inefficient. The number of node evaluations scales with n2 , whereas it vanilla backprop
scaled as n: each of the n nodes is recomputed on the order of n times.
Gradient checkpointing v § } 19
Checkpointed backpropagation
checkpoint
Figure 12: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
Gradient checkpointing v § } 20
Checkpointed backpropagation
checkpoint
Figure 12: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
• Trade-off between the vanilla and memory poor approaches. The strategy is to mark a subset of the neural net
activations as checkpoint nodes, that will be stored in memory.
Gradient checkpointing v § } 20
Checkpointed backpropagation
checkpoint
Figure 12: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
• Trade-off between the vanilla and memory poor approaches. The strategy is to mark a subset of the neural net
activations as checkpoint nodes, that will be stored in memory.
Gradient checkpointing v § } 20
Checkpointed backpropagation
checkpoint
Figure 12: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
• Trade-off between the vanilla and memory poor approaches. The strategy is to mark a subset of the neural net
activations as checkpoint nodes, that will be stored in memory.
• Faster recalculation of activations f . We only need to recompute the nodes between a b node and the
last checkpoint preceding it when computing that b node during backprop.
Gradient checkpointing v § } 20
Checkpointed backpropagation
checkpoint
Figure 12: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
• Trade-off between the vanilla and memory poor approaches. The strategy is to mark a subset of the neural net
activations as checkpoint nodes, that will be stored in memory.
• Faster recalculation of activations f . We only need to recompute the nodes between a b node and the
last checkpoint preceding it when computing that b node during backprop.
Gradient checkpointing v § } 20
Checkpointed backpropagation
checkpoint
Figure 12: Computation graph for obtaining gradients for a simple feed-forward neural network with n layers. The purple color
indicates nodes that are stored in memory.
• Trade-off between the vanilla and memory poor approaches. The strategy is to mark a subset of the neural net
activations as checkpoint nodes, that will be stored in memory.
• Faster recalculation of activations f . We only need to recompute the nodes between a b node and the
last checkpoint preceding it when computing that b node during backprop.
• Memory consumption depends on the number of checkpoints. More effective then vanilla approach.
Gradient checkpointing v § } 20
Gradient checkpointing visualization
Gradient checkpointing v § } 21
1
Hutchinson Trace Estimation
This example illustrates the estimation the Hessian trace of a neural network using Hutchinson’s method, which is an
algorithm to obtain such an estimate from matrix-vector products:
Let X ∈ Rd×d and v ∈ Rd be a random vector such that E[vv T ] = I. Then,
V
1 X T
Tr(X) = E[v T Xv] = vi Xvi .
V
i=1