Matrix Calculus Tutorial
Matrix Calculus Tutorial
Tanmay Devale
1 Kronecker Product
Let A be a m × n matrix and B be a p × q matrix then the Kronecker or Tensor
product A and B denoted by A⊗B is a mp×nq matrix C with elements defined
by cαβ = aij bkl where α = p(i − 1) + k, β = q(j − 1)
+ l.
b11 b12
a a12
For example: Consider A = 11 and B= b21 b22 then
a21 a22
b31 b32
a11 b11 a11 b12 a12 b11 a12 b12
a11 b21 a11 b22 a12 b21 a12 b22
a11 B a12 B a11 b31 a11 b32 a12 b31 a12 b32
A⊗B= =
a21 B a22 B a21 b11 a21 b12 a22 b11 a22 b12
a21 b21 a21 b22 a22 b21 a22 b22
a21 b31 a21 b32 a22 b31 a22 b32
10 −2 0 0
2 0 5 −1 2B 0B −2 8 0 0
Say A = and B = Then A ⊗ B = =
1 3 −1 4 1B 3B 5 −1 15 −3
−1 4 −3 12
1.1 Exercise
1. Let A and B be matrices then find A ⊗ B.
3 1 0
(a) A = and B =
2 2 7
(b) A = 1 −1 and B = 1 0 5
1
(c) A = 3 6 and B = 0
1
1 0
(d) A = and B = −1 3
0 2
2 3
(e) A = and B =
1 8
2. Is A ⊗ B = B ⊗ A? Provide a proof or counter example.
1
2 Matrix Differentiation
We are going use the following notation:
1. x denotes scalars
2. ⃗x denotes vectors(specifically column vectors)
3. X denotes matrices
We are interested in the following 9 derivatives.
dy d⃗y dY
dx dx dx
dy dy dy
d⃗x d⃗x d⃗x
dy d⃗y dY
dX dX dX
Table 1: Derivatives of interest
2.1.1 Exercise
Given the notation as specified above, find derivative of y, ⃗y , Y w.r.t. x.
x 2
2 x + 1 cos(x)
1. y = sin(x ), ⃗y = cos(x) , Y =
sin(x) x − 1
2x2
ln(x) 3
x + x2 + 1
2 ex
2. y = ex , ⃗y = sin(x) , Y =
2x sin(x)
cos(x2 )
2
cos(x)
sin(x2 ) πx eπ
sin(πx)
3. y = ln(x10 ), ⃗y = ,Y =
tan(x) cos(sec(x)) csc(x3 ) x
x4 + x + 2023
2.2.1 Exercise
1. Given the notation as specified above, find derivative of y, ⃗y , Y w.r.t. ⃗x.
xyz
e 2 2
x
x yz xy z
(a) y = sin(x + yz), ⃗y = x2 z , Y = 2 where ⃗x = y
xyz ln(xyz)
xy z
3 2 e
ln(xyz) x + x + yz π cos(x + y + z)
(b) y = 5xyz , ⃗y = y ,Y =
x cos(z) xyz 2sin(cos(x + y)) z
x
where ⃗x = y
z
2. Consider functions f : Rn → Rm and g : Rn → Rm .
(a) Show that for ⃗x i.e. x ∈ Rn
d(f (x) + g(x) df (x) dg(x)
= +
dx dx dx
(b) Show that for ⃗x i.e. x ∈ Rn and a ∈ R
daf (x) df (x)
=a
dx dx
3. The quadratic form xT Ax is a form we will encounter often. In this
T
question, we are interested in dxdxAx . Assume that A is not a function of
x.
x
(a) Evaluate x Ax when x = 1 and the (i, j)th element of A is Aij
T
x2
T
.Why do you think x Ax is called the quadratic form?
3
(b) Which definition of the derivative do we need in order to evaluate
dxT Ax
dx ?
T
(c) Assume x ∈ R2 and A ∈ R2×2 . Evaluate dxdxAx .
(d) Generalize the previous result to when x ∈ R2 and A ∈ Rn×n and
T
evaluate dxdxAx . Can you express the result in matrix form?
(e) What happens when A is a symmetric matrix?
2.3.1 Exercise
1. Given the notation as specified above, find derivative of y, ⃗y , Y w.r.t. X.
xx
e 0 1 sin(x0 + 2x1 ) 2x1 + x3
(a) y = 4x3 +3x2 +2x1 +x0 , ⃗y = x2 x3 , Y =
e 2x0 + x2 cos(2x2 + x3 )
x0 x1
w.r.t. X =
x2 x3
sin(x5 ) + cos(x4 x3 ) + x22 + 2x1 x0
(b) y = ln(x25 x34 x3 x22 x01 ), ⃗y = ,Y =
eiπ + 1
x x 2x1
1 5 4 0
2x4 + x1 tan(2x2 + x4 ) w.r.t. X = x0 x1 x2
x3 x4 x5
cot(x0 ) csc(x4 + x1 )
3 Chain Rule
3.1 The basics
Recall that for h(x) = f (g(x)) the chain rule is
dh df dg
=
dx dg dx
For the multivariate case h(x) = f (g1 (x), g2 (x)), the chain rule is extended as
dh ∂f dg1 ∂f dg2 ∂f dg3
= + +
dx ∂g1 dx ∂g2 dx ∂g3 dx
4
3.1.1 Exercises
∂f ∂f
Evaluate ∂x and ∂y for each of the following:
Our previous operations can be thought of as adding all components that con-
tribute to the change of h. Building on this, we can extend the chain rule to also
work in matrix calculus. For detailed proof of why the chain rule still follows in
matrix calculus please refer to reference 3.
3.1.2 Exercise
Consider x ∈ Rp , y ∈ Rr , z ∈ Rn . Which of the following are true?
dz dz dy dz dy dz
= or =
dx dy dx dx dx dy
J = f (g(w))
⃗
In such a case, we can apply the vectored chain rule to obtain the following:
∂J
∇J = ⃗ f ′ (g(w))
= ∇g(w) ⃗
∂w | {z }
scalar
In this case, the order of multiplication does not matter, because one of the
factors in the product is a scalar. Note that this result is used frequently in ma-
chine learning, because many loss-functions in machine learning are computed
by applying a scalar function f (·) to the dot product of w ⃗ with a training point
⃗a. In other words, we have g(w) ⃗ =w ⃗ · ⃗a. Note that w
⃗ · ⃗a can also be written
⃗ T (I)⃗a, where I represents the identity matrix. This is in the form of one of
as w
the matrix identities listed above. In such a case, one can use the chain rule to
obtain the following:
∂J
= [f ′ (g(w))]
⃗ ⃗a
∂w⃗ | {z }
scalar
5
This result is extremely useful, and it can be used for computing the derivatives
of many loss functions like least-squares regression, SVMs and logistic regression.
The vector ⃗a is simply replaced with the vector of the training point at hand.
The function f (·) defines the specific form of the loss function for the model at
hand.
3.2.1 Exercise
d 1
1. Evaluate dx σ(x) = 1+e−x . Is this equation* familiar? What is it com-
monly called?
2. Express your answer in the previous question using only σ(x).
3. Consider a weight vector w ⃗ and a sample point ⃗x. We perform affine
⃗ T ⃗x and then perform σ(z).
transformation i.e. z = w
4 Matrix Identities
Assume identity 3(a) and prove all other identities.
∂c
1. ∂⃗
x = 0T
∂⃗
u⃗v
2. ∂⃗x = ⃗u ∂⃗
u
x +
∂⃗
∂⃗
u
x⃗
∂⃗ v
3. Note the change in notation
Let u, v, x be variable column vectors.
Let a, b be constant column vectors.
Let A be a constant matrix then:
∂y T Av ∂v
(a) ∂x = uT A ∂x + v T AT ∂u
∂x
∂uT v ∂v
(b) ∂x = uT ∂x + v T ∂u
∂x
∂aT x
(c) ∂x = aT
∂bT Ax
(d) ∂x = bT A
T
∂x Ax
(e) ∂x = xT (A + AT )
∂||x||2
(f) ∂x = 2xT
∂au T ∂u
(g) ∂x = a ∂x
6
5 References
1. Weisstein, Eric W. ”Kronecker Product.” From MathWorld–A Wolfram
Web Resource. https://mathworld.wolfram.com/KroneckerProduct.
html
2. Taboga, Marco (2021). ”Kronecker product”, Lectures on matrix algebra.
https://www.statlect.com/matrix-algebra/Kronecker-product