0% found this document useful (0 votes)
72 views15 pages

Matrix Differentiation

1) The document describes matrix-vector differentiation techniques using standard derivative rules and conversion rules. 2) Key aspects include taking the differential of composed functions by substituting the differential of the inner function into the differential of the outer function. 3) Examples are provided to demonstrate calculating first and second derivatives of functions, and expressing them in canonical forms involving the gradient and Hessian.

Uploaded by

kirthana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views15 pages

Matrix Differentiation

1) The document describes matrix-vector differentiation techniques using standard derivative rules and conversion rules. 2) Key aspects include taking the differential of composed functions by substituting the differential of the inner function into the differential of the outer function. 3) Examples are provided to demonstrate calculating first and second derivatives of functions, and expressing them in canonical forms involving the gradient and Hessian.

Uploaded by

kirthana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1 Matrix-vector differentiation

1.1 Theory
To calculate most of the derivatives that arise in practice, only a small table of standard derivatives and
conversion rules is sufficient. It is most convenient to work in terms of «differential» — with it, you can not
to think about intermediate dimensions, but simply apply the standard rules.
Note: This section describes the matrix-vector differentiation technique itself. For a more detailed
description of the mathematical theory behind this technique, see section A.

Conversion rules Standard Derivatives table


dA = 0 d⟨A, X⟩ = ⟨A, dX⟩
d(αX) = α(dX) d⟨Ax, x⟩ = ⟨(A + AT )x, dx⟩
d(AXB) = A(dX)B d⟨Ax, x⟩ = 2⟨Ax, dx⟩ (if A = AT )
d(X + Y ) = dX + dY d(Det(X)) = Det(X)⟨X −T , dX⟩
T T
d(X ) = (dX) d(X −1 ) = −X −1 (dX)X −1
d(XY ) = (dX)Y + X(dY )
d⟨X, Y ⟩ = ⟨dX, Y ⟩ + ⟨X, dY ⟩
 
X ϕdX − (dϕ)X
d =
ϕ ϕ2

Here A, B — fixed matrices; α — fixed scalar; X, Y — arbitrary differentiable matrix functions (consistent
in dimensions so that all operations make sense); ϕ — arbitrary differentiable scalar function.
One of the most important is the derived composition rule. Let g(Y ) and f (X) — be two differentiable
functions, and we know the expressions for their differentials: dg(Y ) and df (X). To calculate the derivative
of composition ϕ(X) := g(f (X)) , as in the scalar case, you need:
• take the expression of the calculated differential dg(Y );
• substitute the value f (X) instead of Y , and the value df (X) instead of dY .

Example
Consider the function ϕ(x) := ln⟨Ax, x⟩, where A ∈ Sn++ . In this case,

dy
g(y) := ln(y), dg(y) = ; f (x) := ⟨Ax, x⟩, df (x) = 2⟨Ax, dx⟩.
y

Inline formal to dg(y) instead of y, the expression f (x) = ⟨Ax, x⟩, is a dy, the expression df (x) =
2⟨Ax, dx⟩:

2⟨Ax, dx⟩ 2⟨Ax, h⟩


dϕ(x) = (In the notation with «D» - large: Dϕ(x)[h] = ).
⟨Ax, x⟩ ⟨Ax, x⟩

Usually, all matrix-vector functions that arise in practice are composed using table functions and standard
operations on them. Due to the universality of the above rules, it becomes as easy to differentiate arbitrarily
complex functions of this type as it is to differentiate one-dimensional functions.
The resulting expression must eventually be reduced to one of the canonical forms:

1
Exit
Scalar Vector Matrix
Input
df (x) = f ′ (x)dx
Scalar − −
(f ′ (x): scalar; dx: scalar)
df (x) = ⟨∇f (x)dx⟩ df (x) = Jf (x)dx
Vector −
(∇f (x): vector; dx: vector) (Jf (x): matrix; dx: vector)
df (X) = ⟨∇f (X)dX⟩
Matrix − −
(∇f (X): matrix; dX: matrix)

The cases marked with «−» will not interest us. The object ∇f (x) (vector for the vector argument function
and matrix for the matrix argument function) is called gradient. The matrix Jf (x) is called the Jacobi
matrix.
You can find the second derivative of the function f (X) using the following «algorithm»:
◦ calculate the first derivative of the function; fix in the expression for df (X) the increment of dX —
denote it as dX1 ;
◦ calculate the derivative for the function g(X) = df (X), assuming dX1 is fixed (constant). The new
increment is dX2 .

Example
Let’s turn to the function ϕ(x) = ln⟨Ax, x⟩, where A ∈ Sn++ . We have already calculated its first
derivative: dϕ(x) = 2⟨Ax,dx⟩
⟨Ax,x⟩ . Denote dx by dx1 and consider the new function:

2⟨Ax, dx1 ⟩
g(x) =
⟨Ax, x⟩

Find the derivative of g(x), assuming that dx1 is a constant vector:


 
2⟨Ax, dx1 ⟩ d(2⟨Ax, dx1 ⟩)⟨Ax, x⟩ − 2⟨Ax, dx1 ⟩d⟨Ax, x⟩
d2 ϕ(x) = d =
⟨Ax, x⟩ ⟨Ax, x⟩2
4AxxT A
  
2⟨Adx1 , dx2 ⟩⟨Ax, x⟩ − 2⟨Ax, dx1 ⟩2⟨Ax, dx2 ⟩ 2A
= = − dx1 , dx2 .
⟨Ax, x⟩2 ⟨Ax, x⟩ ⟨Ax, x⟩2
D  E
4AxxT A
(In the notation with D - large: D2 ϕ(x)[h1 , h2 ] = 2A
⟨Ax,x⟩ − ⟨Ax,x⟩2 h1 , h2 .)

For the second derivative, the canonical form for the scalar function of the vector argument
d2 f (x) = ⟨∇2 f (x)dx1 , dx2 ⟩.
The matrix ∇2 f (x) is called Hessian. For doubly continuously differentiable functions, the Hessian is a
symmetric matrix.

1.2 Tasks
Problem 1 (Quadratic function). Find the first and second derivatives of df (x) and d2 f (x), as well as the
gradient of ∇f (x) and the Hessian of ∇2 f (x) functions
1
f (x) := ⟨Ax, x⟩ − ⟨b, x⟩ + c, x ∈ Rn ,
2
where A ∈ Sn , b ∈ Rn , c ∈ R.

2
Solution. Find the first derivative:
 
1 1 1
df (x) = d ⟨Ax, x⟩ − ⟨b, x⟩ + c = d⟨Ax, x⟩ − d⟨b, x⟩ = 2⟨Ax, dx⟩ − ⟨b, dx⟩ = ⟨Ax − b, dx⟩ .
2 2 2

Note that df (x) is already written in the canonical form df (x) = ⟨∇f (x), dx⟩, so

∇f (x) = Ax − b .

Now find the second derivative:

d2 f (x) = d⟨Ax − b, dx1 ⟩ = ⟨d(Ax − b), dx1 ⟩ = ⟨d(Ax), dx1 ⟩ = ⟨Adx2 , dx1 ⟩ .

To find the Hessian, we give d2 f (x) to a canonical form d2 f (x) = ⟨∇2 f (x)dx1 , dx2 ⟩:

d2 f (x) = ⟨Adx1 , dx2 ⟩ ⇒ ∇2 f (x) = A .

Problem 2. Find the first and second derivatives of df (x) and d2 f (x), as well as the gradient of ∇f (x) and
the Hessian of ∇2 f (x) functions
1
f (x) := ∥Ax − b∥22 , x ∈ Rn ,
2
where A ∈ Rm×n , b ∈ Rn .
Solution. Find the first derivative:
 
1 1
df (x) = d ∥Ax − b∥22 = {d(∥x∥22 ) = d⟨x, x⟩ = 2⟨x, dx⟩} = 2⟨Ax − b, d(Ax − b)⟩ = ⟨Ax − b, Adx⟩ .
2 2
To find the gradient, we give df (x) to the canonical form df (x) = ⟨∇f (x), dx⟩:

df (x) = ⟨AT (Ax − b), dx⟩ ⇒ ∇f (x) = AT (Ax − b) .

Now find the second derivative:

d2 f (x) = d⟨Ax − b, Adx1 ⟩ = ⟨d(Ax − b), Adx1 ⟩ = ⟨Adx2 , Adx1 ⟩ = ⟨dx2 , AT Adx1 ⟩ .

To find the Hessian, we give d2 f (x) to a canonical form d2 f (x) = ⟨∇2 f (x)dx1 , dx2 ⟩:

d2 f (x) = ⟨AT Adx1 , dx2 ⟩ ⇒ ∇2 f (x) = AT A .

Problem 3 (The cube of the Euclidean norm). Find the first and second derivatives of df (x) and d2 f (x),
as well as the gradient of ∇f (x) and the Hessian of ∇2 f (x) functions
1
f (x) := ∥x∥32 , x ∈ Rn .
3
Solution. Find the first derivative:
 
1 1 13 1
df (x) = d ∥x∥2 = d⟨x, x⟩3/2 =  ⟨x, x⟩1/2 d⟨x, x⟩ = ∥x∥2 (2⟨x, dx⟩) = ∥x∥2 ⟨x, dx⟩ .
3
3 3 3 2 2

3
To find the gradient, we give df (x) to a canonical form df (x) = ⟨∇f (x)dx⟩:

df (x) = ⟨∥x∥2 x, dx⟩ ⇒ ∇f (x) = ∥x∥2 x .

Now find the second derivative:

d2 f (x) = d(∥x∥2 ⟨x, dx1 ⟩) = d(∥x∥2 ) ⟨x, dx1 ⟩ + ∥x∥2 d⟨x, dx1 ⟩
| {z }
=d(⟨x,x⟩1/2 )
 
1
= ⟨x, x⟩−1/2 (2⟨x, dx2 ⟩ ⟨x, dx1 ⟩ + ∥x∥2 ⟨dx2 , dx1 ⟩
2

= ∥x∥−1
2 ⟨x, dx2 ⟩⟨x, dx1 ⟩ + ∥x∥2 ⟨dx2 , dx1 ⟩ .

To find the Hessian, we give d2 f (x) to the canonical form d2 f (x) = ⟨∇2 f (x)dx1 , dx2 ⟩:

d2 f (x) = ∥x∥−1
2 ⟨dx1 , x⟩⟨x, dx2 ⟩ + ∥x∥2 ⟨dx1 , dx2 ⟩

= ⟨(∥x∥−1 T
2 xx + ∥x∥2 In )dx1 , dx2 ⟩ ⇒ ∇2 f (x) = ∥x∥−1 T
2 xx + ∥x∥2 In .

Note that the resulting formula for the Hessian (and the second derivative) is only valid for x ̸= 0, since the value
of ∥x∥−1
2 is undefined for x = 0. This restriction arose because at the very beginning we used the product rule, and
we had a derivative d(∥x∥2 ) that does not exist at the point x = 0. Nevertheless, it can be shown that the function
f in question is everywhere twice continuously differentiable, and its second derivative at the point x = 0 is zero.
Thus, we can say that the formula was derived, in fact, true for any values of x, with the caveat that at the point
x = 0 the value of ∥x∥−1 T
2 x must be understood as 0 (the limit when x → 0).

Problem 4 (Euclidean norm). Find the first and second derivatives of df (x) and d2 f (x), as well as the
gradient of ∇f (x) and the Hessian of ∇2 f (x) functions

f (x) := ∥x∥2 , x ∈ Rn \ {0}.

Solution. Find the first derivative:


1 1
df (x) = d(∥x∥2 ) = d(⟨x, x⟩1/2 ) = ⟨x, x⟩−1/2 d⟨x, x⟩ = ∥x∥−1
2 2
−1
⟨x, dx⟩ = ∥x∥2 ⟨x, dx⟩ .
2 2
To find the gradient, we give df (x) to a canonical form df (x) = ⟨∇f (x)dx⟩:

df (x) = ⟨∥x∥−1
2 xdx⟩ ⇒ ∇f (x) = ∥x∥−1
2 x .

Now find the second derivative:

d2 f (x) = d(∥x∥−1 −1 −1
2 ⟨x, dx1 ⟩) = d(∥x∥2 )⟨x, dx1 ⟩ + ∥x∥2 d⟨x, dx1 ⟩

= −∥x∥−2 −1
2 d(∥x∥2 )⟨x, dx1 ⟩ + ∥x∥2 ⟨dx2 , dx1 ⟩
= −∥x∥−2 −1 −1
2 (∥x∥2 ⟨x, dx2 ⟩)⟨x, dx1 ⟩ + ∥x∥2 ⟨dx2 , dx1 ⟩

= ∥x∥−1 −3
2 ⟨dx2 , dx1 ⟩ − ∥x∥2 ⟨x, dx2 ⟩⟨x, dx1 ⟩ .

To find the Hessian, we give d2 f (x) to the canonical form d2 f (x) = ⟨∇2 f (x)dx1 , dx2 ⟩:

d2 f (x) = ∥x∥−1 −2
2 (⟨dx1 , dx2 ⟩ − ∥x∥2 ⟨dx1 , x⟩⟨x, dx2 ⟩)

= ⟨∥x∥−1 −2 T
2 (In − ∥x∥2 xx )dx1 , dx2 ⟩ ⇒ ∇2 f (x) = ∥x∥−1 −2 T
2 (In − ∥x∥2 xx ) .

4
Problem 5 (Logistics function). Find the first and second derivatives of df (x) and d2 f (x), as well as the
gradient of ∇f (x) and the Hessian of ∇2 f (x) functions

f (x) := ln(1 + exp(⟨a, x⟩)), x ∈ Rn ,

where a ∈ Rn .
Solution. Find the first derivative:
 
dx d(1 + exp(⟨a, x⟩)) d(exp(⟨a, x⟩))
df (x) = d(ln(1 + exp(⟨a, x⟩))) = d(ln(x)) == =
x 1 + exp(⟨a, x⟩) 1 + exp(⟨a, x⟩)
exp(⟨a, x⟩)d⟨a, x⟩ exp(⟨a, x⟩)⟨a, dx⟩ ⟨a, dx⟩
= {d(exp(x)) = exp(x)dx} = = =
1 + exp(⟨a, x⟩) 1 + exp(⟨a, x⟩) 1 + exp(−⟨a, x⟩)
= σ(⟨a, x⟩)⟨a, dx⟩ .

Here is the notation σ : R → R for sigmoid function:

1
σ(x) := .
1 + exp(−x)

To find the gradient, we give df (x) to a canonical form df (x) = ⟨∇f (x)dx⟩:

df (x) = ⟨σ(⟨a, x⟩)a, dx⟩ ⇒ ∇f (x) = σ(⟨a, x⟩)a .

So the gradient ∇f (x) – is a vector collinear to the vector a with the coefficient σ(⟨a, x⟩) ∈ (0, 1). Depending on the
point x, only the length of the vector ∇f (x) changes, but not its direction.
Now find the second derivative:

d2 f (x) = d(σ(⟨a, x⟩)⟨a, dx1 ⟩) = d(σ(⟨a, x⟩))⟨a, dx1 ⟩ = {d(σ(x)) = σ ′ (x)dx} = (σ ′ (⟨a, x⟩)d⟨a, x⟩)⟨a, dx1 ⟩
= σ ′ (⟨a, x⟩)⟨a, dx2 ⟩⟨a, dx1 ⟩ = {σ ′ (x) = σ(x)(1 − σ(x))}
= σ(⟨a, x⟩)(1 − σ(⟨a, x⟩))⟨a, dx2 ⟩⟨a, dx1 ⟩ .

To find the Hessian, we give d2 f (x) to the canonical form d2 f (x) = ⟨∇2 f (x)dx1 , dx2 ⟩:

d2 f (x) = σ(⟨a, x⟩)(1 − σ(⟨a, x⟩))⟨dx1 , a⟩⟨a, dx2 ⟩


= ⟨(σ(⟨a, x⟩)(1 − σ(⟨a, x⟩))aaT )dx1 , dx2 ⟩ ⇒ ∇2 f (x) = σ(⟨a, x⟩)(1 − σ(⟨a, x⟩))aaT .

Note that the Hessian ∇2 f (x) – is a peer matrix proportional to the matrix aaT with the coefficient σ(⟨a, x⟩)(1 −
σ(⟨a, x⟩)) ∈ (0, 0.25). The point x only affects the proportionality coefficient.

Problem 6 (Logarithm of the determinant). Find the first and second derivatives of df (X) and d2 f (X), as
well as the gradient of ∇f (X) functions

f (X) := ln(Det(X)),

given on the set Sn++ in the space Sn .

5
Solution. Find the first derivative:

dx

d(Det(X)) ⟨X −1 , dX⟩
Det(X)

df (X) = d(ln Det(X)) = d(ln(x)) = = = = ⟨X −1 , dX⟩ .
x Det(X) Det(X)
 

Note that df (X) is already written in the canonical form df (X) = ⟨∇f (X), dX⟩.1 So,

∇f (X) = X −1 .

Now find the second derivative:

d2 f (X) = d⟨X −1 , dX1 ⟩ = ⟨d(X −1 ), dX1 ⟩ = ⟨−X −1 (dX2 )X −1 , dX1 ⟩ = −⟨X −1 (dX2 )X −1 , dX1 ⟩ .

The result is a bilinear form from the increments of dX1 and dX2 in the matrix space.
Consider
D2 f (X)[H, H] = −⟨X −1 HX −1 , H⟩.
We show that D2 f (X)[H, H] has a negative sign for all X ∈ Sn++ and H ∈ Sn , i.e. that the function f is a
concave function. Indeed, by decomposing X −1 = X −1/2 X −1/2 , we rewrite D2 f (X)[H, H] in the following
form:
D2 f (X)[H, H] = −⟨X −1/2 HX −1/2 , X −1/2 HX −1/2 ⟩ = −∥X −1/2 HX −1/2 ∥2F .
From this it can be seen that D2 f (X)[H, H], indeed, has a negative sign.

Problem 7. Find the derivative of df (X) and the gradient of ∇f (X) functions

f (X) := ∥AX − B∥F , X ∈ Rk×n ,

where A ∈ Rm×k , B ∈ Rm×n .


Solution. Calculate separately d(∥X∥F ):
 
1 −1/2 1
d(∥X∥F ) = d(⟨X, X⟩1/2 ) = d(x1/2 ) = x dx = ⟨X, X⟩−1/2 d⟨X, X⟩
2 2
1
= ∥X∥−1
F 2⟨X, dX⟩ = ∥X∥−1
F ⟨X, dX⟩.
2

Now we use the resulting formula to find df (X):

df (X) = d(∥AX − B∥F ) = ∥AX − B∥−1


F ⟨AX − B, d(AX − B)⟩

= ∥AX − B∥−1
F ⟨AX − B, AdX⟩ .

To find the gradient, we give df (X) to a canonical form df (X) = ⟨∇f (X)dX⟩:

df (X) = ⟨∥AX − B∥−1 T


F A (AX − B)dX⟩ ⇒ ∇f (X) = ∥AX − B∥−1 T
F A (AX − B) .

Problem 8. Find the derivative of df (X) and the gradient of ∇f (X) functions

f (X) := Tr(AXBX −1 ), X ∈ Rn×n , Det(X) ̸= 0,

where A, B ∈ Rn×n .
1 In this example, we are working in the space of symmetric matrices Sn , so the sign of the transpose can be omitted.

6
Solution. For convenience, we will rewrite the trace through the scalar product:

f (X) = ⟨In , AXBX −1 ⟩.

Find the first derivative:

df (X) = d⟨In , AXBX −1 ⟩ = ⟨In , d(AXBX −1 )⟩ = ⟨In , (d(AXB))X −1 + (AXB)d(X −1 )⟩

= ⟨In , (A(dX)B)X −1 + (AXB)(−X −1 (dX)X −1 )⟩ = ⟨In , A(dX)BX −1 − AXBX −1 (dX)X −1 ⟩ .

To find the gradient, we give df (X) to the canonical form df (X) = ⟨∇f (X), dX⟩:

df (X) = ⟨In , A(dX)BX −1 ⟩ − ⟨In , AXBX −1 (dX)X −1 ⟩ =


= ⟨AT X −T B T , dX⟩ − ⟨X −T B T X T AT X −T , dX⟩
= ⟨AT X −T B T − X −T B T X T AT X −T , dX⟩
⇒ ∇f (X) = AT X −T B T − X −T B T X T AT X −T .

Problem 9. Consider the scalar argument function

ϕ(α) := f (x + αp), α ∈ R,

where x, p ∈ Rn , f : Rn → R — is a doubly continuously differentiable function. Find the first and second
derivatives of ϕ′ (α) and ϕ”(α) and express them in terms of the gradient ∇f (·) and the Hessian ∇2 f (·).
Solution. In this problem, you need to keep in mind that the differentiation is performed by α, and x is a
constant vector.
Find the first derivative:
dϕ(α) = dα (f (x + αp)) = {df (x) = ⟨∇f (x), dx⟩} = ⟨∇f (x + αp), dα (x + αp)⟩
= ⟨∇f (x + αp), (dα)p⟩ = ⟨∇f (x + αp), p⟩dα.

Here, the last equality follows from the fact that dα — is a scalar. Note that we have introduced dϕ(α) in
canonical form dϕ(α) = ϕ′ (α)dα. Means,

ϕ′ (α) = ⟨∇f (x + αp), p⟩ .

Now find the second derivative:


d2 ϕ(α) = dα (⟨∇f (x + αp), p⟩dα1 ) = ⟨dα ∇f (x + αp), p⟩dα1 = {d∇f (x) = ∇2 f (x)dx}
= ⟨∇2 f (x + αp)dα (x + αp), p⟩dα1 = ⟨∇2 f (x + αp)(dα2 )p, p⟩dα1
= ⟨∇2 f (x + αp)p, p⟩dα1 dα2 .

Thus, from the canonical form d2 ϕ(α) = ϕ”(α)α1 α2 , we get

ϕ′′ (α) = ⟨∇2 f (x + αp)p, p⟩ .

Problem 10. Consider the scalar argument function

ϕ(α) := ∥r(x + αp)∥2 , α ∈ R+ , r(x + αp) ̸= 0,

where x, p ∈ Rn , r : Rn → Rm — differentiable map. Find the derivative of ϕ′ (α) and express it in terms of
the Jacobi matrix Jr (·).

7
Solution. In this problem, as in the previous one, you need to constantly remember that the differentiation
is performed by α, and x is a constant vector.
Find the first derivative:
 
⟨x, dx⟩ ⟨r(x + αp), dα r(x + αp)⟩
dϕ(α) = dα (∥r(x + αp)∥2 ) = d∥x∥2 = = = {dr(x) = Jr (x)dx}
∥x∥2 ∥r(x + αp)∥2
⟨r(x + αp), Jr (x + αp)dα (x + αp)⟩ ⟨r(x + αp), Jr (x + αp)(dα)p⟩
= =
∥r(x + αp)∥2 ∥r(x + αp)∥2
⟨r(x + αp), Jr (x + αp)p⟩
= dα.
∥r(x + αp)∥2

From here
⟨r(x + αp), Jr (x + αp)p⟩
ϕ′ (α) = .
∥r(x + αp)∥2

A Derivatives: theory
A.1 Definition
Let’s start by recalling the concept of a derivative.
For a function of a single variable f : R → R, its derivative at the point x is denoted by f ′ (x) and is
determined from the equality:

f (x + h) = f (x) + f ′ (x)h + o(h) for all sufficiently small h.

In other words, by fixing some point x, we want to approximate the change of the function f (x + h) − f (x)
in the neighborhood of this point using a linear function over h, and f ′ (x)h is the best way to do this.
Let us now consider a more general situation.
Let U and V be finite-dimensional linear spaces with norms. The main examples of such spaces for us
will be numbers: R, vectors: Rn and matrices: Rn×m , as well as their combinations (Cartesian products).
Consider the function f : X → V , where X ⊆ U .

Definition A.1 (Differentiability). Let x ∈ X be the inner point of the set X, and let L : U → V be
the linear operator. We will say that the function f is differentiable at the point x with the derivative
L if the following decomposition holds for all sufficiently small h ∈ U :

f (x + h) = f (x) + L[h] + o(∥h∥). (A.1)

If for any linear operator L : U → V the function f is not differentiable at the point x with the derivative
L, then we will say that f is not differentiable at the point x. If the point x is not an internal point of
the set X, then we leave the notion of differentiability of the function f at the point x undefined.

Remark A.2. The expression o(∥h∥) has the standard value:

def ∥f (x + h) − f (x) − L[h]∥


f (x + h) − f (x) − L[h] = o(∥h∥) ⇐⇒ lim = 0.
h→0 ∥h∥

8
Remark A.3. Since the spaces U and V under consideration are finite-dimensional (and in a finite-
dimensional space all norms are topologically equivalent), it does not matter which specific norms are
used in the definition given above: if a function f is differentiable at x with the derivative L for one
choice of norms, then f will also be differentiable at x with the derivative L for any other choice of
norms.

Proposition A.4. Suppose that the function f is differentiable at x with the derivative L1 and is also
differentiable at x with the derivative L2 . Then L1 = L2 .

Thus, if the function f is differentiable at the point x, then its derivative L is defined in a unique way.
We will denote it with the symbol df (x).

Remark A.5. The object df depends on two parameters: the point x ∈ X, which we aproximarem
function and increment h ∈ U , which is deposited from the fixed point:

df : X × U → V, linear in the second argument — in «h».

Remark A.6. There are different notations for the derivative of the function f at the point x:

Df (x)[h] ≡ df (x)[h] ≡ Df (x)[∆x] ≡ df (x)[∆x].

They all mean the same thing. When working with the definition of the derivative, it is convenient to
explicitly specify the increment (h or ∆x) in square brackets. When calculating derivatives in practice,
using the already known calculated derivatives and conversion properties, the increment in square
brackets is usually not written: df (x) or even just df , when it is clear what it is about.

So, the derivative of the function at the point x — is the linear operator df (x) that best approximates
the increment of the function:
f (x + h) − f (x) ≈ Df (x)[h].
Another well-known and important concept is the derivative of a function in the direction. It turns out
that knowing the derivative of the function f , we can easily calculate its derivative along any direction h.

Proposition A.7. Let f be differentiable at the point x. Choose an arbitrary direction h. Then:

∂f (x) f (x + th) − f (x)


Df (x)[h] = := lim
∂h t→+0 t

That is, to calculate ∂f∂h


(x)
— the derivative of the function f along the direction h, it is enough to apply
df (x)[·] to this direction.
A set of vectors
ei = (0, . . . , 0, 1, 0, . . . , 0) ∈ Rn , i = 1, . . . , n
i
called standard basis in Rn .
If for some i the function has a (two-sided) derivative along the direction ei , then it is called the partial
derivative over thei -th coordinate:
∂f (x) f (x + tei ) − f (x)
:= lim = Df (x)[ei ].
∂xi t→0 t
Note that a function can be undifferentiated even if it has derivatives in all directions.

9
Example A.8. Consider the function f (x) = ∥x∥2 . Find its derivative in the direction of h at the point
x = 0:
∂∥x∥2 ∥0 + th∥2 − ∥0∥2 t∥h∥2
= lim = lim = ∥h∥2 .
∂h x=0 t→+0 t t→+0 t
If the function f (x) = ∥x∥2 are differentiable at zero, according to 2:

∂f (0)
Df (0)[h] = = ∥h∥2 ,
∂h
but the function ∥h∥2 is not linear, which contradicts the fact that the derivative is a linear operator.
So ∥x∥2 is not differentiable at zero, although it has derivatives along all directions.

Function gradient, Jacobi matrix.


• In the case of U = Rn , V = R the linear function Df (x)[h] can always be represented by a scalar
product with some vector:
Df (x)[h] = ⟨ax , h⟩ where ax ∈ Rn — different for each x.

The vector ax is called the gradient of the function f at the point x and is denoted by ∇f (x).
In the standard basis, the gradient of the function is represented as a vector of partial derivatives:
!
∂f ∂f
∇f (x) = (x), . . . , (x) ∈ Rn .
∂x1 ∂xn
Like all the vectors we have, this vector is a column vector.
• if U = Rn×m , V = R linear function df (x)[H] can always be represented using a dot product with
some matrix:
df (x)[H] = ⟨Ax , H⟩, Ax , H ∈ Rn×m .
This matrix is also called a gradient function at the point x: ∇f (x) = Ax and in the standard basis
(from the matrix with all zeros except one unity) is written in a matrix of partial derivatives:
!n,m
∂f
∇f (x) = (x)
∂xij
i=1,j=1

• if U = Rm , V = Rn , the linear operator df (x)[·], fixing bases, can always be represented by the matrix:
Df (x)[h] = Jx h, Jx ∈ Rn×m .

The matrix Jx is called the Jacobi matrix of the function f . In the standard basis, it consists of partial
derivatives: !n,m
∂fi
Jx = (x) .
∂xj
i=1,j=1

Proposition A.9 (Differential calculus). Let U and V — vector spaces, X — a subset of U , x ∈ X —


an inner point of X. The following properties are valid:
(a) (The derivative of a constant) Let f : X → V — be a constant function, i.e. there is v ∈ V that
f (x′ ) = v for all x′ ∈ X. Then f is differentiable at x, and df (x) = 0.
(b) (Derivative of the identity function) Let f : X → V — the identity function, i.e. f (x′ ) = x′

10
for all x′ ∈ X. Then f is differentiable at x, and its derivative — is also an identity function:
Df (x)[h] = h for all h ∈ U .
(c) (Linearity) Let f : X → V and g : X → V — functions. If f and g differentiable at x and
c1 , c2 ∈ R — numbers, the function (f c1 + c2 g) is also differentiable at x, and

d(f c1 + c2 g)(x) = c1 df (x) + c2 dg(x).

(d) (Usually works) Let α : X → R and f : X → V — functions. If α and f differentiable at x, the


function αf is also differentiable at x, and

D(αf )(x)[h] = (Dα(x)[h])f (x) + α(x)(Df (x)[h])

for all h ∈ U .
(e) (Composition Rule) Let Y — be a subset of V , f : X → Y — function. Also let W — vector
space, g : Y → W — function. If f is differentiable at x, and g is differentiable at f (x), then their
composition (g ◦ f ) : X → W (defined as (g ◦ f )(x) = g(f (x))) will also be differentiable at x, and
h i h i
D(g◦f )(x) = Dg(f (x)) df (x) or, in more detail, D(g◦f )(x)[h] = Dg(f (x)) , Df (x)[h] .

(f ) (Private Rule) Let α : X → R and f : X → V be functions. If α and f are differentiable at x,


and if α does not vanish at X, then the function (1/α)f is also differentiable at x, and
 
1 α(x)(Df (x)[h]) − (Dα(x)[h])f (x)
D f (x)[h] = .
α α(x)2

for all h ∈ U .

Proof. The first four properties are proved by definition, and the last is derived from the rules of composition
and composition.
Note that the product rule in the statement A.9 is set only if one of the functions is scalar. This is
understandable, since in vector spaces only multiplication by a scalar is defined, and not by an arbitrary
element of the vector space. However, in some special cases, the product rule remains true even if both
functions are non-scalar. For example, the following statement is true.

Proposition A.10. Let U — be a vector space, X — a subset of U , x ∈ X — an inner point of X. Let


f : X → Rm×n and g : X → Rn×k — matrix-valued functions. Suppose that f and g are differentiable
at x. Then the function f g is also differentiable at the point x, and

D(f g)(x)[h] = (Df (x)[h])g(x) + f (x)(Dg(x)[h]).

for all h ∈ U . (Here, the multiplication operation means matrix multiplication, so the order of the
multipliers is important.)

A.2 Second derivative


Let the function f : X → V be differentiable at each point x ∈ X ⊆ U .
Consider the derivative of the function f with a fixed increment h1 ∈ U as a function of x:

g(x) = Df (x)[h1 ].

11
Definition A.11. If there is a derivative for the function g at some point x, then it is called second
derivative of the function f at point x:

D2 f (x)[h1 , h2 ] := Dg(x)[h2 ].

It can be shown that D2 f (x)[h1 , h2 ] is a bilinear function of h1 and h2 .

By analogy, the third one is defined: D3 f (x)[h1 , h2 , h3 ], fourth and higher order derivatives.
If the derivative df (x) is a continuous function over x, then f — is said to be continuously differentiable.
If the second derivative D2 f (x) is continuous over x, then f — is twice continuously differentiable.
For the functions f : Rn → R, the second derivative, like any bilinear form, can be represented using the
matrix:
D2 f (x)[h1 , h2 ] = ⟨Hx h1 , h2 ⟩, Hx ∈ Rn×n .
The matrix Hx is called the Hessian of the function f at the point x and is usually denoted by ∇2 f (x).
In the standard basis, this matrix consists of the second partial derivatives:
!n,n
2 ∂2f
∇ f (x) = (x)
∂xi ∂xj
i=1,j=1

For a doubly continuously differentiable function, its Hessian is a symmetric matrix:

∇2 f (x) ∈ Sn .

A.3 Taylor formula


For a doubly continuously differentiable function, the Taylor formula holds:
1
f (x + h) = f (x) + Df (x)[h] + D2 f (x)[h, h] + o(∥h∥2 ).
2
For a function f : Rn → R can be written using the gradient and Hessian:
1
f (x + h) = f (x) + ⟨∇f (x), h⟩ + ⟨∇2 f (x)h, h⟩ + o(∥h∥2 ).
2
If the function has continuous derivatives up to and including the order of k, then Taylor’s formula can
be written down to the kth derivative:
1 2 1 1
f (x + h) = f (x) + Df (x)[h] + D f (x)[h, h] + D3 f (x)[h, h, h] + · · · + Dk f (x)[h, . . . , h] + o(∥h∥k ).
2! 3! k!

A.4 Calculating tabular derivatives


Note: Throughout the following, ∥ · ∥ denotes (for short) the Euclidean norm for vectors and the spectral
(operator) norm for matrices.

Example A.12 (Linear function). Let c ∈ Rn , and let f : Rn → R — function f (x) := ⟨c, x⟩. We show
that f is differentiable at an arbitrary point x ∈ Rn and find its derivative df (x) : Rn → R. To do this,
we fix an arbitrary increment of the argument h ∈ Rn and calculate the corresponding increment of the
function:
f (x + h) − f (x) = ⟨c, x + h⟩ − ⟨c, x⟩ = ⟨c, h⟩.
Note that the mapping h 7→ ⟨c, h⟩ is linear. So, for the function f , the decomposition (A.1) with

12
Df (x)[hisvalid] := ⟨c, h⟩. Thus, the function f is differentiable at an arbitrary point x ∈ Rn with the
derivative Df (x)[h] = ⟨c, h⟩.

Example A.13 (Quadratic form). Let A ∈ Rn×n , and let f : Rn → R — function f (x) := ⟨Ax, x⟩. We
fix an arbitrary point x ∈ Rn and an arbitrary increment of the argument h ∈ Rn and calculate the
corresponding increment of the function:

f (x + h) − f (x) = ⟨A(x + h), x + h⟩ − ⟨Ax, x⟩ = ⟨(A + AT )x, h⟩ + ⟨Ah, h⟩.

Note that the mapping h 7→ ⟨(A + AT )x, h⟩ is linear, and ⟨Ah, h⟩ = o(∥h∥) , since the following chain
of inequalities is valid for all h ∈ Rn :

|⟨Ah, h⟩| ≤ ∥h∥∥Ah∥ ≤ ∥A∥∥h∥2 .

Here, the first inequality follows from the Cauchy-Bunyakovsky inequality; the second inequality follows
from the consistency of the matrix and vector norms. Thus, the function f is differentiable at an
arbitrary point x ∈ Rn with the derivative Df (x)[h] = ⟨(A + AT )x, h⟩.

Example A.14 (Inverse matrix). Let S := {X ∈ Rn×n : Det(X) ̸= 0} — the set of all square
nondegenerate matrices of size n. Consider the function f : S → S, which for each matrix X ∈ S
returns its inverse: f (X) := X −1 . We show that f is differentiable at any point X ∈ S. To do
this, fix an arbitrary sufficiently small increment parameter H ∈ Rn×n (satisfying X + H ∈ S and
∥H∥ < 1/∥X −1 ∥) and consider the corresponding increment of the function:

f (X + H) − f (X) = (X + H)−1 − X −1 = (X(In + X −1 H))−1 − X −1 = ((In + X −1 H)−1 − In )X −1 .

Let’s evaluate separately (In +X −1 H)−1 . To do this, we decompose this matrix into a Neumann series:a

X
−1 −1 −1
(In + X H) = In − X H+ (−X −1 H)k .
k=2

Note that the series on the right side of the last equality is absolutely convergent, since ∥X −1 H∥ < 1 is
sufficiently small H. We show that the sum of this series is o(H∥):
∞ ∞ ∞
X X X ∥X −1 ∥2 ∥H∥2
(−X −1 H)k ≤ ∥(−X −1 H)k ∥ ≤ ∥X −1 ∥k ∥H∥k = .
1 − ∥X −1 ∥ · ∥H∥
k=2 k=2 k=2

Here, the first inequality follows from the triangle inequality for the norm; the second inequality follows
from the submultiplicativity of the norm; then the sum of the geometric series is calculated. Thus,

(In + X −1 H)−1 = In − X −1 H + o(∥H∥).

Substituting this expression into the above formula for the increment of the function, we get

f (X + H) − f (X) = −X −1 HX −1 + o(∥H∥).

13
Thus, the function f is differentiable at an arbitrary point X ∈ S with the derivative df (X)[H] =
−X −1 HX −1 .

a We mean the decomposition (In − A)−1 = Ak , valid for any matrix A ∈ Rn×n , such that ∥A∥ < 1. This formula
P
k=0

is a generalization of the well-known formula for the sum of a geometric series: (1 − q)−1 = q k for any |q| < 1.
P
k=0

Remark A.15. The derived formula for the derivative of the function X −1 can be obtained very
simply by using the following trick. Consider the differential of the unit matrix d(In ). On the one hand,
since the matrix is constant, d(In ) = 0. On the other hand, by the product rule, dIn = d XX −1 =


(dX)X −1 + Xd(X −1 ). Equating the expressions, we get d(X −1 ) = −X −1 (dX)X −1 , or, in another
form, d(X −1 )[H] = −X −1 HX −1 . Note, however, that the above argument is not a complete proof of
the identity, since it assumes, but does not prove, the existence of the differential d(X −1 ).

Example A.16 (The determinant of the matrix). Let f : Rn×n → R — the function f (X) := Det(X).
Consider an arbitrary point X ∈ Rn×n and an arbitrary increment parameter H ∈ Rn×n . We will
assume that the matrix X is invertible. Write out the corresponding increment of the function:

f (X+H)−f (X) = Det(X+H)−Det(X) = Det(X(In +X −1 H))−Det(X) = Det(X)(Det(In +X −1 H)−1).

Let’s evaluate separately Det(In + X −1 H). To do this, we use the fact that the determinant of the
matrix is equal to the product of its eigenvalues. Let λ1 (X −1 H), . . . , λn (X −1 H) — the eigenvalues of
the matrix X −1 H (numbered in any order and possibly complex). Note that the eigenvalues of the
matrix In + X −1 H are 1 + λ1 (X −1 H), . . . , 1 + λn (X −1 H). Therefore
 
Yn Xn X
Det(In +X −1 H) = [1+λi (X −1 H)] = 1+ λi (X −1 H)+  λi (X −1 H)λj (X −1 H) + . . .  ,
i=1 i=1 1≤i≤j≤n

where the ellipsis hides the sum of all possible triples λi (X −1 H)λj (X −1 H)λk (X −1 H), all possible fours,
and so on.Note that the expression in parentheses is the value o(∥H∥). This follows from the triangle
inequality and the fact that for an arbitrary matrix A ∈ Rn×n , all its eigenvalues do not exceed modulo
its norm ∥A∥. (Indeed, let λ ∈ C — be the eigenvalue of the matrix A, and let x ∈ Cn \ {0} — the
corresponding eigenvector is: Ax = λx. Then |λ|∥x∥ = ∥Ax∥ ≤ ∥A∥∥x∥.) So
n
X
Det(In − X −1 H) = 1 + λi (X −1 H) + o(∥H∥) = 1 + Tr(X −1 H) + o(∥H∥).
i=1

Substituting the resulting expression in the above formula for the increment of the function, we get

f (X + H) − f (X) = Det(X) Tr(X −1 H) + o(∥H∥).

Thus, for any invertible matrix X ∈ Rn×n , the function f is differentiable at the point X with the
derivative df (x)[H] = Det(X) Tr(X −1 H) = Det(X)⟨X −T , H⟩.

Remark A.17. It can be shown that the function in question f (X) = Det(X) will be differentiable
everywhere on Rn×n , and not just on a subset of invertible matrices. The general formula for the

14
derivative in this case is called Jacobi formula and looks like this: df (X)[H] = Tr(Adj(X)H), where
Adj(X) — is the attached matrix to X. Note that if X — is a non-degenerate matrix, then Adj(X) =
Det(X)X −1 and the Jacobi formula passes into the proven formula df (X)[H] = Det(X) Tr(X −1 H).

15

You might also like