0% found this document useful (0 votes)
49 views3 pages

Mat Deriv

This document discusses properties of matrix calculus that are useful for machine learning algorithms. It defines notation for matrices and vectors. Key properties include: the gradient of a function with respect to a matrix; derivatives of operations involving transposes, traces, and sums of matrices; and an example of deriving the least squares solution for linear regression using these properties.

Uploaded by

samhith23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views3 pages

Mat Deriv

This document discusses properties of matrix calculus that are useful for machine learning algorithms. It defines notation for matrices and vectors. Key properties include: the gradient of a function with respect to a matrix; derivatives of operations involving transposes, traces, and sums of matrices; and an example of deriving the least squares solution for linear regression using these properties.

Uploaded by

samhith23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Some Important Properties for Matrix Calculus

Dawen Liang
Carnegie Mellon University
[email protected]

1 Introduction
Matrix calculation plays an essential role in many machine learning algorithms, among which ma-
trix calculus is the most commonly used tool. In this note, based on the properties from the dif-
ferential calculus, we show that they are all adaptable to the matrix calculus1 . And in the end, an
example on least-square linear regression is presented.

2 Notation
A matrix is represented as a bold upper letter, e.g. X, where Xm,n indicates the numbers of rows
and columns are m and n, respectively. A vector is represented as a bold lower letter, e.g. x, where
it is a n × 1 column vector in this note. An important concept for a n × n matrix An,n is the trace
Tr(A), which is defined as the sum of the diagonal:
n
X
Tr(A) = Aii (1)
i=1

where Aii index the element at the ith row and ith column.

3 Properties
The derivative of a matrix is usually referred as the gradient, denoted as ∇. Consider a function
f : Rm×n → Rp×q , the gradient for f (A) w.r.t. Am,n is:
 ∂f ∂f ∂f

∂A11 ∂A12 · · · ∂A 1n
 ∂f ∂f ∂f 
∂f (A)  ∂A 21 ∂A 22
· · · ∂A 
2n 
∇A f (A) = = . . . .
∂A  .. .. .. .. 


∂f ∂f ∂f
∂Am1 ∂Am2 ··· ∂Amn

This definition is very similar to the differential derivative, thus a few simple properties hold
(the matrix A below is square matrix and has the same dimension with the vectors):

∇x bT Ax = bT A (2)
1
Some of the detailed derivations which are omitted in this note can be found at http://www.cs.berkeley.
edu/˜jduchi/projects/matrix_prop.pdf

1
∇A XAY = YT XT (3)
∇x xT Ax = Ax + AT x (4)
∇AT f (A) = (∇A f (A))T (5)
where superscript T denotes the transpose of a matrix or a vector.
Now let us turn to the properties for the derivative of the trace. First of all, a few useful properties
for trace:
Tr(A) = Tr(AT ) (6)
Tr(ABC) = Tr(BCA) = Tr(CAB) (7)
Tr(A + B) = Tr(A) + Tr(B) (8)
which are all easily derived. Note that the second one be extended to more general case with
arbitrary number of matrices.
Thus, for the derivatives,
∇A Tr(AB) = BT (9)
Proof :
Just extend Tr(AB) according to the trace definition (Eq. 1).

∇A Tr(ABAT C) = CAB + CT ABT (10)

Proof :

∇A Tr(ABAT C)
=∇A Tr((AB) (AT C))
| {z } | {z }
u(A) v(AT )

=∇A:u(A) Tr(u(A)v(AT )) + ∇A:v(AT ) Tr(u(A)v(AT ))


=(v(AT ))T ∇A u(A) + (∇AT :v(AT ) Tr(u(A)v(AT ))T
=CT ABT + ((u(A))T ∇AT v(AT ))T
=CT ABT + (BT AT CT )T
=CAB + CT ABT

Here we make use of the property of the derivative of product: (u(x)v(x))0 = u0 (x)v(x) +
u(x)v 0 (x). The notation ∇A:u(A) means to calculate the derivative w.r.t. A only on u(A). Same ap-
plies to ∇AT :v(AT ) . Here chain rule is used. Note that the conversion from ∇A:v(AT ) to ∇AT :v(AT )
is based on Eq. 5.

4 An Example on Least-square Linear Regression


Now we will derive the solution for least-square linear regression in matrix form, using the proper-
ties shown above. We know that the least-square linear regression has a closed-form solution (often
referred as normal equation).

2
Assume we have N data points {x(i) , y (i) }1:N , and the linear regression function hθ (x) is
parametrized by θ. We can rearrange the data to matrix form:
 (1) T   (1) 
(x ) y
 (x(2) )T   y (2) 
X= y= . 
   
..  .
 .   . 
(x(N ) )T y (N )

Thus the error can be represented as:

hθ (x(1) ) − y(1)
 
 hθ (x(2) ) − y(2)

Xθ − y = 
 
 ..
  .
(N )
hθ (x ) − y (N )

The squared error E(θ), according to the numerical definition:


N
1X
E(θ) = (hθ (x(i) ) − y(i) )2
2
i=1

which is equivalent to the matrix form:


1
E(θ) = (Xθ − y)T (Xθ − y)
2
Take the derivative:
1
∇θ E(θ) = ∇ (Xθ − y)T (Xθ − y)
2| {z }
1×1 matrix, thus Tr(·)=(·)
1
= ∇Tr(θT XT Xθ − yT Xθ − θT XT y + yT y)
2
1
= ∇Tr(θT XT Xθ) − ∇Tr(yT Xθ) − ∇Tr(θT XT y)
2
1
= ∇Tr(θ I θT XT X) − (yT X)T − XT y
2
The first term can be computed using Eq. 10, where A = θ, B = I, and C = XT X (Note that
in this case, C = CT ). Plug back to the derivation:
1
∇θ E(θ) = (XT Xθ + XT Xθ − 2XT y)
2
1
= (2XT Xθ − 2XT y)
2
Set to 0
====⇒ XT Xθ = XT y
θLS = (XT X)−1 XT y

The normal equation is obtained in matrix form.

You might also like