Gradient Notes
Gradient Notes
Christopher Yeh
1 Jacobian
Consider a vector-valued function f : Rn → Rm . Then the Jacobian is the matrix
∂f1 ∂f1
∂x1 · · · ∂xn
∂f ∂f . ... ..
J= ··· = .
. .
∂x1 ∂xn
∂fm
∂fm
···
∂x1 ∂xn
∂fi
where element-wise Jij = ∂xj
.
If f : Rn → R is a scalar-valued function with vector inputs, then its gradient is just a special
case of the Jacobian with shape 1 × n.
1
function.
X
J(θ) = CE(y, ŷ) = − yc log ŷc = − log ŷk
c
!
θk
e X
= − log P = log eθc − θk
c eθc c
∂J ∂J
···
∂W1,1 ∂W1,m
∂J . .. ..
∂W ..
= . .
∂J ∂J
···
∂Wn,1 ∂Wn,m
2
Clearly, W∂ij Wkl = 1 only when i = k and j = l, and 0 otherwise. Thus, ∂
z
Wij k
= 1[k = i]xj .
Another way we can write this is
0
..
.
0
∂z
= xj ← ith element
Wij
0
.
..
0
where the summation comes from the Chain Rule. (Every change to Wij influences each zk
which in turn influences J, so the total effect of Wij on J is the sum of the influences of each
zk on J).
∂J ∂J
Thus the full matrix ∂W
is the outer product ∂W
= rT xT (recall that r is a row vector).
3
∂J ∂J
Thus the full matrix ∂W
is the outer product ∂W
= xT r (recall that both x and r are row
vectors).
Now we can use the (multi-path) chain rule to take the derivative of A w.r.t. each element
of X and Y .
∂A X ∂A ∂Bk,l X ∂A X ∂A ∂A
= = 1[k = i]Yj,l = Yj,l = · Yj,:
∂Xi,j k,l
∂Bk,l ∂Xi,j k,l
∂Bk,l l
∂Bi,l ∂B i,:
∂A X ∂A ∂Bk,l X ∂A X ∂A ∂A
= = 1[l = j]Xk,i = Xk,i = · X:,i
∂Yi,j k,l
∂B k,l ∂Yi,j
k,l
∂B k,l
k
∂B k,j ∂B :,j
∂A ∂A ∂A ∂A
= ·YT and = XT ·
∂X ∂B ∂Y ∂B
4
6 Scalar Function of Matrix-Vector Broadcast Sum w.r.t.
Vector
∂A
Let A be a scalar that is a function of a matrix B ∈ Rn×m , and suppose ∂B is known. Let
B = X + y be some broadcasted sum between a matrix X and a row-vector y ∈ R1×m . We
want to find ∂A
∂y
.
Intuitively, notice that any change in y directly and linearly affects every row of B. Each
row of B in turn affects A. Therefore,
∂A X ∂A ∂Bi X ∂A X ∂A
= = ·I =
∂y i
∂Bi ∂y i
∂Bi i
∂Bi
We first find ∂A
∂y
. Intuitively, any change in y directly affects every row of B by a factor of
the same row in X. Each row of B in turn affects A. Therefore,
∂A X ∂A ∂Bi X ∂A
= = Xi
∂y i
∂Bi ∂y i
∂Bi
∂A
Next we find ∂X . We can find this element-wise, then compose the entire gradient. Note
that changing Xij only affects Bij by a scale of yj . No other indices in B are affected.
∂A ∂A
= yj
∂Xij ∂Bij
∂A ∂A
=y·
∂X ∂B