0% found this document useful (0 votes)

15 views5 pages

Gradient Notes

Uploaded by

ganor44300

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

Gradient Notes

Uploaded by

ganor44300

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Gradient Notes

Christopher Yeh

June 20, 2024

1 Jacobian
Consider a vector-valued function f : Rn → Rm . Then the Jacobian is the matrix
∂f1 ∂f1
 
 ∂x1 · · · ∂xn 
∂f ∂f  . ... .. 
J= ··· = .
. . 
∂x1 ∂xn 
 ∂fm

∂fm 
···
∂x1 ∂xn
∂fi
where element-wise Jij = ∂xj
.
If f : Rn → R is a scalar-valued function with vector inputs, then its gradient is just a special
case of the Jacobian with shape 1 × n.

2 Softmax Cross-Entropy Loss w.r.t. Logits

We want to compute the gradient for the cross-entropy loss J ∈ R between our predicted
softmax probabilities ŷ and the true one-hot probabilities y. Both y and ŷ are vectors of the
same length. They can be either row or column vectors; the result is the same.
We are given the following:
1. ŷ = softmax(θ)
2. y is a one-hot vector, where yk = 1 and yc6=k = 0
3. y, ŷ, θ ∈ Rn
The cross-entropy loss J is computed as follows. The second line expands out the softmax

1
function.
X
J(θ) = CE(y, ŷ) = − yc log ŷc = − log ŷk
c
!
θk
e X
= − log P = log eθc − θk
c eθc c

The gradient of the loss w.r.t. the logits θ is

∂J eθi
= P θc − 1[i = k] −→ ∇θ J = ŷ − y
∂θi ce

3 Matrix times column vector w.r.t. matrix

∂J ∂J
Given z = W x and r = ∂z
, what is ∂W
?
1. z ∈ Rn and x ∈ Rm are column vectors
2. W ∈ Rn×m is a matrix
3. J ∈ R is some scalar function of z
4. r ∈ R1×n is the Jacobian of J w.r.t. z
Note on notation: Technically, J is a scalar-valued function taking nm inputs (the entries
∂J
of W ). This means the Jacobian ∂W should be a 1 × nm vector, which isn’t very useful.
∂J ∂J ∂J
Instead, we will let ∂W be a n × m matrix where ∂W ij = ∂Wij .

∂J ∂J
 
···
 ∂W1,1 ∂W1,m 
∂J  . .. .. 
∂W  ..
= . . 
 ∂J ∂J 
···
∂Wn,1 ∂Wn,m

Naively, we can write

∂J ∂J ∂z ∂z
= =r
∂W ∂z ∂W ∂W
∂z
However, it is unclear how to derive ∂W , since this is the gradient of a vector w.r.t. a
matrix. This gradient would have to be 3-dimensional, and multiplying the vector r by this
3-D tensor is not well-defined. Thus, we instead have to take the element-wise derivative
∂J
∂Wij
.
Note that zk is the dot-product between the k-th row of W and x.
m m
X ∂ X ∂
zk = Wkl xl zk = xl Wkl
l=1
Wij l=1
W ij

2
Clearly, W∂ij Wkl = 1 only when i = k and j = l, and 0 otherwise. Thus, ∂
z
Wij k
= 1[k = i]xj .
Another way we can write this is
 
0
 .. 
.
0
 
∂z
= xj  ← ith element
 
Wij  
0
.
 .. 
0

Now we can compute

∂J X ∂J ∂zk X
= = rk 1[k = i]xj = ri xj
∂Wij k
∂z k ∂Wij
k

where the summation comes from the Chain Rule. (Every change to Wij influences each zk
which in turn influences J, so the total effect of Wij on J is the sum of the influences of each
zk on J).
∂J ∂J
Thus the full matrix ∂W
is the outer product ∂W
= rT xT (recall that r is a row vector).

4 Row vector times matrix w.r.t. matrix

The problem setup is basically the same as the previous case, except with row vectors instead
of column vectors.
∂J ∂J
Given z = xW and r = ∂z
, what is ∂W
?
1. z ∈ R1×n and x ∈ R1×m are row vectors
2. W ∈ Rm×n is a matrix
3. J ∈ R is some scalar function of z
4. r ∈ R1×n is the Jacobian of J w.r.t. z
Similar to the previous case, we have
n
X
zl = xk Wkl
k=1
n
∂ X ∂
zl = xk Wkl = 1[j = l]xi
Wij k=1
Wij

Now we can compute

∂J X ∂J ∂zl X
= = rl 1[j = l]xi = xi rj
∂Wij l
∂z l ∂Wij
l

3
∂J ∂J
Thus the full matrix ∂W
is the outer product ∂W
= xT r (recall that both x and r are row
vectors).

5 Scalar Function of Matrix Multiplication w.r.t. Ma-

trix
Let B = XY be some matrix multiplication. Let A be a scalar that is a function of B, where
∂A ∂A ∂A
∂B
is known. We want to find ∂X and ∂Y .
1. Let X ∈ Rn×m and Y ∈ Rm×p .
∂A
2. This means that B, ∂B ∈ Rn×p .
Note on notation: Technically, A is a scalar-valued function taking np inputs (the entries of
∂A
B). This means the Jacobian ∂B should be a 1 × np vector, which isn’t very useful. Instead,
∂A ∂A ∂A ∂A ∂A
we will let ∂B be a n × p matrix where ∂B ij
= ∂B ij
. We define ∂X and ∂Y similarly.
∂A ∂A ∂B ∂B
Naively, we can write ∂X = ∂B ∂X
. However, it is unclear how to derive ∂X , since this is the
gradient of a matrix w.r.t. another matrix. This gradient would have to be 4-dimensional,
∂A
and multiplying the matrix ∂B by this 4-D tensor is not well-defined. Thus, we instead take
∂A
the element-wise derivative ∂Xij .
First, we compute the derivatives for each element of B w.r.t. each element of X and Y .
∂ ∂
Bk,l = (Xk,: · Y:,l ) = 1[k = i]Yj,l
∂Xi,j ∂Xi,j
∂ ∂
Bk,l = (Xk,: · Y:,l ) = 1[l = j]Xk,i
∂Yi,j ∂Yi,j

Now we can use the (multi-path) chain rule to take the derivative of A w.r.t. each element
of X and Y .

∂A X ∂A ∂Bk,l X ∂A X ∂A ∂A
= = 1[k = i]Yj,l = Yj,l = · Yj,:
∂Xi,j k,l
∂Bk,l ∂Xi,j k,l
∂Bk,l l
∂Bi,l ∂B i,:

∂A X ∂A ∂Bk,l X ∂A X ∂A ∂A
= = 1[l = j]Xk,i = Xk,i = · X:,i
∂Yi,j k,l
∂B k,l ∂Yi,j
k,l
∂B k,l
k
∂B k,j ∂B :,j

Combining these element-wise derivatives yields the matrix equations

∂A ∂A ∂A ∂A
= ·YT and = XT ·
∂X ∂B ∂Y ∂B

4
6 Scalar Function of Matrix-Vector Broadcast Sum w.r.t.
Vector
∂A
Let A be a scalar that is a function of a matrix B ∈ Rn×m , and suppose ∂B is known. Let
B = X + y be some broadcasted sum between a matrix X and a row-vector y ∈ R1×m . We
want to find ∂A
∂y
.
Intuitively, notice that any change in y directly and linearly affects every row of B. Each
row of B in turn affects A. Therefore,
∂A X ∂A ∂Bi X ∂A X ∂A
= = ·I =
∂y i
∂Bi ∂y i
∂Bi i
∂Bi

where the Bi is the i-th row of B.

Alternatively, we can write this broadcasted sum properly as
B =X +1·y
where 1 is a n-dimensional column vector. Then we can use the gradient rules derived earlier
to get the equivalent result.
∂A ∂A X ∂A
= 1T · =
∂y ∂B i
∂Bi

7 Scalar Function of Matrix-Vector Broadcast Prod-

uct
∂A
Let A be a scalar that is a function of a matrix B ∈ Rn×m , and suppose ∂B is known.
Let B = y · X be a broadcasted Hadamard (element-wise) product between a row vector
y ∈ R1×m and matrix X. In other words, the i-th row of B is computed by the Hadamard
product Bi = y Xi . We want to find ∂A∂y
∂A
and ∂X .

We first find ∂A
∂y
. Intuitively, any change in y directly affects every row of B by a factor of
the same row in X. Each row of B in turn affects A. Therefore,
∂A X ∂A ∂Bi X ∂A
= = Xi
∂y i
∂Bi ∂y i
∂Bi

∂A
Next we find ∂X . We can find this element-wise, then compose the entire gradient. Note
that changing Xij only affects Bij by a scale of yj . No other indices in B are affected.
∂A ∂A
= yj
∂Xij ∂Bij
∂A ∂A
=y·
∂X ∂B

HONDA Motorcycle Electronic Parts Catalogue
100% (1)
HONDA Motorcycle Electronic Parts Catalogue
3 pages
Communicate Work Roles in The Operations of The Enterprise
0% (1)
Communicate Work Roles in The Operations of The Enterprise
9 pages
Bruce Lee / Dan Lee Phone Conversation
No ratings yet
Bruce Lee / Dan Lee Phone Conversation
13 pages
Vector - Matrix Calculus
No ratings yet
Vector - Matrix Calculus
10 pages
Chapter Matrix Derivative Common Cases
No ratings yet
Chapter Matrix Derivative Common Cases
6 pages
Matrix Calc
No ratings yet
Matrix Calc
23 pages
Matrix Calculus
No ratings yet
Matrix Calculus
8 pages
Derivatives, Backpropagation, and Vectorization
No ratings yet
Derivatives, Backpropagation, and Vectorization
7 pages
Matrixcalc PDF
No ratings yet
Matrixcalc PDF
23 pages
Matrix Calculus Tutorial
No ratings yet
Matrix Calculus Tutorial
7 pages
Day 1
No ratings yet
Day 1
41 pages
Matrix Calculus: 1 The Derivative
100% (1)
Matrix Calculus: 1 The Derivative
13 pages
Regression 1
No ratings yet
Regression 1
63 pages
Computing Neural Network Gradients-merged
No ratings yet
Computing Neural Network Gradients-merged
67 pages
Gradient Notes PDF
No ratings yet
Gradient Notes PDF
7 pages
Matrix Introduction
No ratings yet
Matrix Introduction
30 pages
Matrix Calculus PDF
No ratings yet
Matrix Calculus PDF
9 pages
F Matrix Calculus
No ratings yet
F Matrix Calculus
9 pages
TA WEEK 3 Copy
No ratings yet
TA WEEK 3 Copy
27 pages
Thomas Minka - Note On Matrix Calculus and Algebra
No ratings yet
Thomas Minka - Note On Matrix Calculus and Algebra
19 pages
mit18_s096iap23_lec05
No ratings yet
mit18_s096iap23_lec05
6 pages
Lecture 7
No ratings yet
Lecture 7
24 pages
matrixcalc Đạo hàm ma trận PDF
No ratings yet
matrixcalc Đạo hàm ma trận PDF
25 pages
00 Lectureslides LinAlg
No ratings yet
00 Lectureslides LinAlg
20 pages
Matrix Calculus
No ratings yet
Matrix Calculus
9 pages
M0 1 After Class
No ratings yet
M0 1 After Class
21 pages
EM Short Formulas
No ratings yet
EM Short Formulas
24 pages
Mathematical Formula Handbook
100% (1)
Mathematical Formula Handbook
26 pages
Selected Linear Algebra for Machine Learning
No ratings yet
Selected Linear Algebra for Machine Learning
30 pages
Crespin, D. - Matrix Formulas For Semilinear Backpropagation
No ratings yet
Crespin, D. - Matrix Formulas For Semilinear Backpropagation
29 pages
Linear - Algebra - and Metric Calculus-1
No ratings yet
Linear - Algebra - and Metric Calculus-1
59 pages
Tungban Machine Learning Math Course
No ratings yet
Tungban Machine Learning Math Course
124 pages
mit18_s096iap23_lec1
No ratings yet
mit18_s096iap23_lec1
16 pages
Review of Linear Algebra
No ratings yet
Review of Linear Algebra
19 pages
Review of Matrix Operations: Vector: A Sequence of Elements (The Order Is Important)
No ratings yet
Review of Matrix Operations: Vector: A Sequence of Elements (The Order Is Important)
11 pages
L02 Notes
No ratings yet
L02 Notes
6 pages
Matrix Calculus
No ratings yet
Matrix Calculus
9 pages
EN530.678 Nonlinear Control and Planning in Robotics Lecture 1: Matrix Algebra Basics January 27, 2020
No ratings yet
EN530.678 Nonlinear Control and Planning in Robotics Lecture 1: Matrix Algebra Basics January 27, 2020
4 pages
Linear Algebra - class notes
No ratings yet
Linear Algebra - class notes
5 pages
Summary
No ratings yet
Summary
115 pages
Matrix Algebra Calculus Review
0% (1)
Matrix Algebra Calculus Review
12 pages
Background Notes - Scientific Computing Michael T Heath
No ratings yet
Background Notes - Scientific Computing Michael T Heath
16 pages
Linear Algebra & Matrices
No ratings yet
Linear Algebra & Matrices
20 pages
matrices Lecture_Econometrics_Unil
No ratings yet
matrices Lecture_Econometrics_Unil
18 pages
Lec 3
No ratings yet
Lec 3
43 pages
LinearAlgebraPrimer Ver 2010
No ratings yet
LinearAlgebraPrimer Ver 2010
15 pages
Multivariate Notes r1
No ratings yet
Multivariate Notes r1
54 pages
Vectors 11
No ratings yet
Vectors 11
18 pages
Mobrob Linear Algebra
No ratings yet
Mobrob Linear Algebra
42 pages
Chiang - Chapter 4
100% (1)
Chiang - Chapter 4
14 pages
Linear Algebra
No ratings yet
Linear Algebra
23 pages
cs229.... Machine Language. Andrew NG
No ratings yet
cs229.... Machine Language. Andrew NG
17 pages
Math 5390 Chapter 2
No ratings yet
Math 5390 Chapter 2
5 pages
Matrix Chapter1 Part1 2025
No ratings yet
Matrix Chapter1 Part1 2025
39 pages
Linear Algebra For Business Analytics
No ratings yet
Linear Algebra For Business Analytics
27 pages
Matrix Calculus Short
No ratings yet
Matrix Calculus Short
5 pages
GEM 802 Chapter 1
No ratings yet
GEM 802 Chapter 1
52 pages
Introduction To Linear Algebra: Mark Goldman Emily Mackevicius
No ratings yet
Introduction To Linear Algebra: Mark Goldman Emily Mackevicius
110 pages
Capsule Calculus
From Everand
Capsule Calculus
Ira Ritow
No ratings yet
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
The Unfolding Reality of Climate Change
No ratings yet
The Unfolding Reality of Climate Change
1 page
Eng6 q2 Mod5a v3
No ratings yet
Eng6 q2 Mod5a v3
16 pages
DR - Mahalingam College of Engineering and Technology, Pollachi - 642003
No ratings yet
DR - Mahalingam College of Engineering and Technology, Pollachi - 642003
2 pages
RRB Clerk Prelims Day - 1 e 168077752142
No ratings yet
RRB Clerk Prelims Day - 1 e 168077752142
28 pages
Wellbeing in Dementia An Occupational Approach for Therapists and Carers - 2nd Edition Full Digital Edition
100% (11)
Wellbeing in Dementia An Occupational Approach for Therapists and Carers - 2nd Edition Full Digital Edition
14 pages
Final Notes For Ultrasonic Sensor
No ratings yet
Final Notes For Ultrasonic Sensor
13 pages
Star Award Evaluation Form
No ratings yet
Star Award Evaluation Form
1 page
Cambridge International Examinations: Chemistry 0620/22 May/June 2017
No ratings yet
Cambridge International Examinations: Chemistry 0620/22 May/June 2017
3 pages
The Enemy Within NPC 2 - WFRP
No ratings yet
The Enemy Within NPC 2 - WFRP
30 pages
WBinstructions
No ratings yet
WBinstructions
17 pages
Arrest - Bail Procedure
No ratings yet
Arrest - Bail Procedure
2 pages
PESA, The Forest Rights Act, and Tribal Rights in India: Sanjoy Patnaik
No ratings yet
PESA, The Forest Rights Act, and Tribal Rights in India: Sanjoy Patnaik
14 pages
Answer The Following Questions About The Reading Analyzed
No ratings yet
Answer The Following Questions About The Reading Analyzed
2 pages
UNIT 2- revision (2) 2
No ratings yet
UNIT 2- revision (2) 2
1 page
Universal Grammar and First Language Acquisition
No ratings yet
Universal Grammar and First Language Acquisition
13 pages
VSNA Nomination Form
No ratings yet
VSNA Nomination Form
1 page
225-001-001 BS EN 50525-2-31
No ratings yet
225-001-001 BS EN 50525-2-31
2 pages
Baumring Gann Reading List
50% (2)
Baumring Gann Reading List
14 pages
RTD Cocktails
No ratings yet
RTD Cocktails
6 pages
Hfe Pioneer TX-710-2 Service en
No ratings yet
Hfe Pioneer TX-710-2 Service en
26 pages
SEC-OGC Opinion No. 03-15 - Re Alternative References To Stock Ownership Other Than The STB
No ratings yet
SEC-OGC Opinion No. 03-15 - Re Alternative References To Stock Ownership Other Than The STB
4 pages
IAA WallPainting Guide
No ratings yet
IAA WallPainting Guide
29 pages
Factoran Vs Barcelona
No ratings yet
Factoran Vs Barcelona
1 page
Past Simple Exercises For EFL Students
No ratings yet
Past Simple Exercises For EFL Students
2 pages
The Thinkers Guide For Students On How To Study Learn A Discipline Using Critical Concepts Tools 1st Edition Richard Paul download
No ratings yet
The Thinkers Guide For Students On How To Study Learn A Discipline Using Critical Concepts Tools 1st Edition Richard Paul download
40 pages
buoi1deso02
No ratings yet
buoi1deso02
2 pages
Economics 1020230108 SHIVAM KAPOOR
No ratings yet
Economics 1020230108 SHIVAM KAPOOR
26 pages

Gradient Notes

Uploaded by

Gradient Notes

Uploaded by

Gradient Notes

June 20, 2024

2 Softmax Cross-Entropy Loss w.r.t. Logits

The gradient of the loss w.r.t. the logits θ is

3 Matrix times column vector w.r.t. matrix

Naively, we can write

Now we can compute

4 Row vector times matrix w.r.t. matrix

Now we can compute

5 Scalar Function of Matrix Multiplication w.r.t. Ma-

Combining these element-wise derivatives yields the matrix equations

where the Bi is the i-th row of B.

7 Scalar Function of Matrix-Vector Broadcast Prod-

You might also like