0% found this document useful (0 votes)
15 views5 pages

Gradient Notes

Uploaded by

ganor44300
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

Gradient Notes

Uploaded by

ganor44300
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Gradient Notes

Christopher Yeh

June 20, 2024

1 Jacobian
Consider a vector-valued function f : Rn → Rm . Then the Jacobian is the matrix
∂f1 ∂f1
 
   ∂x1 · · · ∂xn 
∂f ∂f  . ... .. 
J= ··· = .
. . 
∂x1 ∂xn 
 ∂fm

∂fm 
···
∂x1 ∂xn
∂fi
where element-wise Jij = ∂xj
.
If f : Rn → R is a scalar-valued function with vector inputs, then its gradient is just a special
case of the Jacobian with shape 1 × n.

2 Softmax Cross-Entropy Loss w.r.t. Logits


We want to compute the gradient for the cross-entropy loss J ∈ R between our predicted
softmax probabilities ŷ and the true one-hot probabilities y. Both y and ŷ are vectors of the
same length. They can be either row or column vectors; the result is the same.
We are given the following:
1. ŷ = softmax(θ)
2. y is a one-hot vector, where yk = 1 and yc6=k = 0
3. y, ŷ, θ ∈ Rn
The cross-entropy loss J is computed as follows. The second line expands out the softmax

1
function.
X
J(θ) = CE(y, ŷ) = − yc log ŷc = − log ŷk
c
!
θk
e X
= − log P = log eθc − θk
c eθc c

The gradient of the loss w.r.t. the logits θ is


∂J eθi
= P θc − 1[i = k] −→ ∇θ J = ŷ − y
∂θi ce

3 Matrix times column vector w.r.t. matrix


∂J ∂J
Given z = W x and r = ∂z
, what is ∂W
?
1. z ∈ Rn and x ∈ Rm are column vectors
2. W ∈ Rn×m is a matrix
3. J ∈ R is some scalar function of z
4. r ∈ R1×n is the Jacobian of J w.r.t. z
Note on notation: Technically, J is a scalar-valued function taking nm inputs (the entries
∂J
of W ). This means the Jacobian ∂W should be a 1 ×  nm vector, which isn’t very useful.
∂J ∂J ∂J
Instead, we will let ∂W be a n × m matrix where ∂W ij = ∂Wij .

∂J ∂J
 
···
 ∂W1,1 ∂W1,m 
∂J  . .. .. 
∂W  ..
= . . 
 ∂J ∂J 
···
∂Wn,1 ∂Wn,m

Naively, we can write


∂J ∂J ∂z ∂z
= =r
∂W ∂z ∂W ∂W
∂z
However, it is unclear how to derive ∂W , since this is the gradient of a vector w.r.t. a
matrix. This gradient would have to be 3-dimensional, and multiplying the vector r by this
3-D tensor is not well-defined. Thus, we instead have to take the element-wise derivative
∂J
∂Wij
.
Note that zk is the dot-product between the k-th row of W and x.
m m
X ∂ X ∂
zk = Wkl xl zk = xl Wkl
l=1
Wij l=1
W ij

2
Clearly, W∂ij Wkl = 1 only when i = k and j = l, and 0 otherwise. Thus, ∂
z
Wij k
= 1[k = i]xj .
Another way we can write this is
 
0
 .. 
.
0
 
∂z
= xj  ← ith element
 
Wij  
0
.
 .. 
0

Now we can compute


∂J X ∂J ∂zk X
= = rk 1[k = i]xj = ri xj
∂Wij k
∂z k ∂Wij
k

where the summation comes from the Chain Rule. (Every change to Wij influences each zk
which in turn influences J, so the total effect of Wij on J is the sum of the influences of each
zk on J).
∂J ∂J
Thus the full matrix ∂W
is the outer product ∂W
= rT xT (recall that r is a row vector).

4 Row vector times matrix w.r.t. matrix


The problem setup is basically the same as the previous case, except with row vectors instead
of column vectors.
∂J ∂J
Given z = xW and r = ∂z
, what is ∂W
?
1. z ∈ R1×n and x ∈ R1×m are row vectors
2. W ∈ Rm×n is a matrix
3. J ∈ R is some scalar function of z
4. r ∈ R1×n is the Jacobian of J w.r.t. z
Similar to the previous case, we have
n
X
zl = xk Wkl
k=1
n
∂ X ∂
zl = xk Wkl = 1[j = l]xi
Wij k=1
Wij

Now we can compute


∂J X ∂J ∂zl X
= = rl 1[j = l]xi = xi rj
∂Wij l
∂z l ∂Wij
l

3
∂J ∂J
Thus the full matrix ∂W
is the outer product ∂W
= xT r (recall that both x and r are row
vectors).

5 Scalar Function of Matrix Multiplication w.r.t. Ma-


trix
Let B = XY be some matrix multiplication. Let A be a scalar that is a function of B, where
∂A ∂A ∂A
∂B
is known. We want to find ∂X and ∂Y .
1. Let X ∈ Rn×m and Y ∈ Rm×p .
∂A
2. This means that B, ∂B ∈ Rn×p .
Note on notation: Technically, A is a scalar-valued function taking np inputs (the entries of
∂A
B). This means the Jacobian ∂B should be a 1 × np vector, which isn’t very useful. Instead,
∂A ∂A ∂A ∂A ∂A
we will let ∂B be a n × p matrix where ∂B ij
= ∂B ij
. We define ∂X and ∂Y similarly.
∂A ∂A ∂B ∂B
Naively, we can write ∂X = ∂B ∂X
. However, it is unclear how to derive ∂X , since this is the
gradient of a matrix w.r.t. another matrix. This gradient would have to be 4-dimensional,
∂A
and multiplying the matrix ∂B by this 4-D tensor is not well-defined. Thus, we instead take
∂A
the element-wise derivative ∂Xij .
First, we compute the derivatives for each element of B w.r.t. each element of X and Y .
∂ ∂
Bk,l = (Xk,: · Y:,l ) = 1[k = i]Yj,l
∂Xi,j ∂Xi,j
∂ ∂
Bk,l = (Xk,: · Y:,l ) = 1[l = j]Xk,i
∂Yi,j ∂Yi,j

Now we can use the (multi-path) chain rule to take the derivative of A w.r.t. each element
of X and Y .
 
∂A X ∂A ∂Bk,l X ∂A X ∂A ∂A
= = 1[k = i]Yj,l = Yj,l = · Yj,:
∂Xi,j k,l
∂Bk,l ∂Xi,j k,l
∂Bk,l l
∂Bi,l ∂B i,:
 
∂A X ∂A ∂Bk,l X ∂A X ∂A ∂A
= = 1[l = j]Xk,i = Xk,i = · X:,i
∂Yi,j k,l
∂B k,l ∂Yi,j
k,l
∂B k,l
k
∂B k,j ∂B :,j

Combining these element-wise derivatives yields the matrix equations

∂A ∂A ∂A ∂A
= ·YT and = XT ·
∂X ∂B ∂Y ∂B

4
6 Scalar Function of Matrix-Vector Broadcast Sum w.r.t.
Vector
∂A
Let A be a scalar that is a function of a matrix B ∈ Rn×m , and suppose ∂B is known. Let
B = X + y be some broadcasted sum between a matrix X and a row-vector y ∈ R1×m . We
want to find ∂A
∂y
.
Intuitively, notice that any change in y directly and linearly affects every row of B. Each
row of B in turn affects A. Therefore,
∂A X ∂A ∂Bi X ∂A X ∂A
= = ·I =
∂y i
∂Bi ∂y i
∂Bi i
∂Bi

where the Bi is the i-th row of B.


Alternatively, we can write this broadcasted sum properly as
B =X +1·y
where 1 is a n-dimensional column vector. Then we can use the gradient rules derived earlier
to get the equivalent result.
∂A ∂A X ∂A
= 1T · =
∂y ∂B i
∂Bi

7 Scalar Function of Matrix-Vector Broadcast Prod-


uct
∂A
Let A be a scalar that is a function of a matrix B ∈ Rn×m , and suppose ∂B is known.
Let B = y · X be a broadcasted Hadamard (element-wise) product between a row vector
y ∈ R1×m and matrix X. In other words, the i-th row of B is computed by the Hadamard
product Bi = y Xi . We want to find ∂A∂y
∂A
and ∂X .

We first find ∂A
∂y
. Intuitively, any change in y directly affects every row of B by a factor of
the same row in X. Each row of B in turn affects A. Therefore,
∂A X ∂A ∂Bi X ∂A
= = Xi
∂y i
∂Bi ∂y i
∂Bi

∂A
Next we find ∂X . We can find this element-wise, then compose the entire gradient. Note
that changing Xij only affects Bij by a scale of yj . No other indices in B are affected.
∂A ∂A
= yj
∂Xij ∂Bij
∂A ∂A
=y·
∂X ∂B

You might also like