Linear Algebra Tutorial

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Linear algebra for computer vision

Bharath Hariharan
January 15, 2020

1 Vector spaces
Definition 1 A vector space V is a nonempty set of objects v, with two operations defined on them:
multiplication by a scalar c (belonging to a field; here let’s assume c is a real number), denoted as cv, and
addition of two vectors, denoted as u + v that satisfy the following properties:
1. The vector space is closed under both addition and scalar multiplication. That is, if u, v ∈ V and c is
a real number, then cu ∈ V and u + v ∈ V .

2. Vector addition is commutative and associative. That is, u + v = v + u, and u + (v + w) = (u + v) + w


for all u, v ∈ V .
3. There is a zero vector 0 ∈ V s.t u + 0 = u for all u ∈ V .
4. For every vector u ∈ V , there exists −u ∈ V s.t u + (−u) = 0.

5. Scalar multiplication distributes over addition: c(u + v) = cu + cv, and (c + d)u = cu + du.
6. c(du) = (cd)u.
7. 1u = u.

A vector space is best thought of as a generalization of the cartesian plane. Consider the cartesian
plane, which is the set of all points (x, y), where x and y are real numbers. Define addition to be element-
wise addition: (x, y) + (x0 , y 0 ) = (x + x0 , y + y 0 ). Similarly, define scalar multiplication to be element-wise:
c(x, y) = (cx, cy). Define the zero vector to be (0, 0). For u = (x, y), define −u = (−1)u = (−x, −y). Test
each of the properties described above and make sure that they are indeed true.
Points (x, y) in the cartesian plane can be thought of in computer science parlance as numeric arrays of
size 2. We can in fact produce a more general example by considering the set of numeric arrays of size d,
denoted as Rd . Here R denotes the fact that components of each array are real numbers, and d denotes the
number of components in each array. Thus, each element in Rd is represented as [x1 , x2 , . . . , xd ]. Addition
and scalar multiplication are element-wise as above, and the zero vector is the vector of all zeros.
What about the set of two dimensional numeric arrays? A two dimensional array, or a matrix, has rows
and columns. An n × m matrix has n rows and m columns. If we consider the set of all n × m matrices,
then we can denote this set as Rn×m as before. Again, we can define addition and scalar multiplication
element-wise. Convince yourself that this is indeed a vector space. Observe that if we consider gray-scale
images of size n × m it is indeed exactly this vector space.

2 Bases and dimensionality


The point (x, y) on the cartesian plane can be thought of as “x units along the X axis and y units along the
Y axis”. Or, more precisely, (x, y) = x(1, 0) + y(0, 1). In other words, any vector u in the cartesian plane
R2 can be represented as a linear combination of only two vectors, (1, 0) and (0, 1).
Similarly, consider the vector space Rd . Consider the vectors e1 = [1, 0, 0, . . . , 0], e2 = [0, 1, 0, . . . , 0],
e3 = [0, 0, 1, . . . , 0] and so on. Thus ei has a 1 in the i-th position and is 0 everywhere else. Now consider

1
any vector u = [x1 , x2 , . . . , xd ]. Then u = i xi ei . Again, any vector in Rd can be represented as a linear
P
combination of the ei ’s.
What about the vector space of all n × m images, Rn×m ? Recall that every element in this vector space
is an n × m matrix. Consider  the matrix eij , which
 has a 1 in the (i, j)-th position and is zero everywhere
x11 x12 . . . x1m
else. Then any vector u =  ... .. .. ..  in Rn×m can be written as P x e .

. . .  i,j ij ij
xn1 xn2 . . . xnm
Thus it seems that the set of vectors B2 = {(1, 0), (0, 1)} in R2 , the set Bd = {ei ; i = 1, . . . , d} in Rd and
the set Bn×m = {eij ; i = 1, . . . , n; j = 1, . . . , m} in Rn×m are all special in some way. Let us concretize this
further.
We first need two definitions.
Definition 2 Let V be a vector space, and suppose U ⊂ V . Then a vector Pn v ∈ V is said to be in the span
of U if it is a linear combination of the vectors in U , that is, if v = i=1 αi ui for some ui ∈ U and some
scalars αi . The span of U is the set of all such vectors v which can be expressed as linear combinations of
vectors in U .
Thus, in R2 , the span of the set B2 = (0, 1), (1, 0) is all of R2 , since every vector in R2 can be expressed
as a linear combination of vectors in B.

Definition 3 Let V be a vector space. A set of vectorsP U = u1 , . . . , un ⊂ V is linearly dependent if there


exist scalars α1 , . . . , αn , not all of them 0, such that i αi ui = u. If no such αi ’s exist, then the set of
vectors U is linearly independent.

Consider, for example the set (1, 0), (0, 1), (1, −1). Then, because (1, 0) − (0, 1) − (1, −1) = 0, this set is in
fact linearly dependent. An equivalent definition for a linearly independent set is that no vector in the set is
a linear combination of the others. This is because if u1 is a linear combination of u2 , . . . , un , then:
n
X
u1 = αi ui (1)
i=2
Xn
⇒0= αi ui − u1 (2)
i=2
Xn
⇒0= αi ui where α1 = −1 (3)
i=1
(4)

which in turn implies that U = {u1 , u2 , . . . , un } is linearly dependent.


Note that the sets B2 , Bd and Bn×m above are all linearly independent. Let us prove this via contradiction
for Bd . Suppose that Bd is in fact linearly dependent. Then, there exist α1 , . . . , αd , not all 0, s.t
d
X
αi ei = 0 (5)
i=1
⇒ [α1 , α2 , . . . , αd ] = 0 (6)
⇒ αi = 0∀i (7)

which contradicts our assumption.

Definition 4 Let V be a vector space. A set of vectors U ⊂ V is a basis for V if:


1. The span of U is V , that is, every vector in V can be written as a linear combination of vectors from
U , and
2. U is linearly independent.

2
Thus B2 is a basis for R2 , Bd is a basis for Rd and Bn×m is a basis for Rn×m . However, note that a given
vector space can have more than a single basis. For example, B20 = (0, 1), (1, 1) is also a basis for R2 : any
vector (x, y) = x(1, 1) + (y − x)(0, 1), and it can be shown that (1, 1) and (0, 1) are linearly independent.
However, here is a crucial fact: all basis sets for a vector space have the same number of elements

Definition 5 The number of elements in a basis for a vector space is called the dimensionality of the
vector space.

Thus the dimensionality of R2 is 2, the dimensionality of Rd is d and the dimensionality of Rn×m is nm.
In general, the dimensionality of vector spaces can be infinite, but in computer vision we will only encounter
finite-dimensional vector spaces.

2.1 Coordinate representation for vectors


A basis gives us a way to represent vectors in a unified way independent of the vector space. Let V be
a vector space. Fix a basis B = b1 , . . . , bn for V . Then, for any vector x  in V , it can be represented as
α1
 α2 
Pn  
i=1 αi bi for some αi . We represent this vector as a column of numbers:  .. .
 . 
αn
Note that such a representation can be used only when we have chosen a basis. For some vector spaces
such as Rd , if no basis is mentioned, we will use Bd ; this is called the canonical basis.

3 Norms, distances and angles


For many vector spaces, we also want to have a notion of lengths and distances, and similarities between two
vectors. For example, we may want to ask, how far is the point (4, 5) from the point (0, 0)? In the cartesian
plane, there is a natural notion of distance. For example, to get to (4, 5) from (0, 0), we will have to go 4
unit along X and 5 units along Y; then we can use Pythagoras theorem to compute the distance.
The generalization of this computation is a norm. Although there are many different norms for many
d
vector spaces, for our purposes in computer vision, we mainly
p deal with the L2 norm for R , which is denoted
by k · k2 or k · k. This norm is defined as follows: kxk = x21 + x22 + . . . + x2d
The norm of a vector x is its length. The distance between two vectors x and y is the length of x − y.
Often, we will talk of the direction of a vector x, which is just x multiplied by a scalar to make its norm 1:
x
kxk . Such a vector is also called a unit vector.
Another useful quantity is the dot product or inner product between two vectors, denoted as < ·, · >.
Again, there are many possible inner products we can use. However, the most common inner product for
Rd is given by < x, y >= x1 y1 + x2 y2 + . . . + xd yd . Observe that, with this definition of inner product,
< x, x >= kxk2 .
This inner product generalizes the dot product you may have encountered in 3D geometry. It can be
shown that with this definition, < x, y >= kxkkyk cos θ, where θ is the angle between x and y. Thus an
inner product of 0 between two non-zero vectors indicates that they are perpendicular (or orthogonal ) to
each other. This connection with the angle also motivates the use of this inner product as a measure of
similarity between two vectors.
As a matter of notation, note that vectors written as column vectors can also be thought of as d × 1
matrices. As such, the inner product described above is also equivalent to xT y, and is commonly represented
in this way.

4 Linear transformations
Suppose we have two vector spaces U and V . We are interested in functions that map from one to the
other. Perhaps the most important class of these functions in terms of their practical uses as well as ease of
understanding is the class of linear transformations.

3
Definition 6 Consider two vector spaces U and V . A function f : U → V is a linear transformation if
f (αu1 + βu2 ) = αf (u1 ) + βf (u2 ) for all scalars α, β and for all u1 , u2 ∈ U .
Let f : U → V be a linear transformation. Suppose that we have fixed a basis BU = b1 , . . . , bm for U ,
and a basis BV = a1 , . . . , an for V . Consider the vectors f (bj ), j = 1, . . . , m. Since these are vectors in V ,
they can be expressed as a linear combination of vectors in BV . Thus:
n
X
f (bj ) = Mij ai (8)
i=1

for some coefficients Mij .


Now consider Pan arbitrary vector u in U . It should be expressible as a linear combination of the basis
m
vectors, so u = j=1 uj bj . We can now figure out what f (u) should be:

Xm
f (u) = f ( uj bj ) (9)
j=1
m
X
= uj f (bj ) (by linearity of f ) (10)
j=1
Xm X n
= Mij uj ai (11)
j=1 i=1
 
u1
 u2 
Now we can express u as a column vector of coefficients  . . If we express f (u) as a column vector
 
 .. 
um
similarly, we can see that:

f (u) = M u (12)
 
M11 . . . M1m
 .. .. .. .
where M =  . . . 
Mn1 . . . Mnm
Thus every linear transformation can be expressed as a matrix multiplication. The matrix
encodes how each basis vector gets transformed ; the linearity of the transformation means that this infor-
mation is enough to predict how everything else will be transformed. In particular, the j-th column of this
matrix is the transformed j-th basis vector (Equation (8)). You should also be able to prove that every
matrix multiplication is a linear transformation.

4.1 Change of basis


A special case of a linear transformation is when we want to change the basis. In this case, U and V above
are the same vector space, but with different basis. Given a vector u represented in the basis BU , we want to
know its representation in the basis BV . Again, this can be represented using a matrix multiplication M u.
Here, the j-th column of M is the representation of the j-th basis vector in BU , namely bj , now represented
in the basis BV .

4.2 Rank and nullity


Consider a linear transformation from vector space U to V corresponding to the matrix M . As can  be seen

M1j
above, the output of this transformation is a linear combination of the matrix column vectors mj =  ... 
 

Mnj

4
Thus, the output is the span of the matrix columns. Note that this span need not be the whole space
V . It can be a subset of V , but still a vector space (it should be easy to verify that the set of all linear
combinations of a set of vectors is itself a vector space, a subset of the original space; often called a subspace).
The dimensionality of this output subspace is called the rank of the matrix M . If this dimensionality is
equal to the dimensionality of V , the matrix M is considered full rank.
As another useful property, consider the set of vectors u in U s.t M u = 0. Again, it can be shown that
this set is a vector space, and is thus a subspace of U . The dimensionality of this subspace is called the
nullity of the matrix M .
One of the most useful theorems of linear algebra is that rank + nullity = number of columns of
M.

4.3 Properties of matrices


Because matrices represent linear transformations between vector spaces, a lot of linear algebra is devoted
to studying them. For our purposes, there are several properties of matrices that you must know :
1. The matrix I is such that Iii = 1 for all i, and Iij = 0 if j 6= i. This is the identity matrix and
represents an identity transformation: Ix = x.
P
2. Matrices can be multiplied together as follows. If C = AB, then Cij = k Aik Bkj . Note that this
means that A and B can only be multiplied together if the number of columns of A match the number
of rows of B. The product of an n × m matrix and an m × l matrix is n × l.

3. Matrix multiplication is associative (A(BC) = (AB)C) and distributes over addition (A(B + C) =
AB + AC) but it is not commutative in general (i.e., AB need not equal BA)
4. For a matrix A, we can construct another matrix B s.t Bij = Aji . B is called the transpose of A and
is denoted as AT .
5. If A = AT , A is symmetric. If A = −AT , A is skew-symmetric.

6. For a matrix A, if B is such that AB = I, B is a right-inverse of A. If BA = I, B is a left-inverse of


A. If BA = AB = I, B is an inverse of A. Inverses when they exist are unique and are denoted as
A−1 .
7. Square matrices A such that AAT = AT A = I are called orthonormal matrices. Note that for any
orthonormal matrix R, if rj is the j-th column of R, then rTj rj = 1 and rTj ri = 0 for i 6= j. Thus,
columns of an orthonormal matrix have unit norm and are orthogonal to each other.
P
8. The trace of a matrix is the sum of its diagonal elements (i.e., trace(A) = i Aii )
9. Another useful function of a square matrix is the determinant,
 denoted as det(A). The determinant
a b
can be difficult to compute. For a 2 × 2 matrix , the determinant is ad − bc. For a diagonal
c d
matrix it is the product of the eigenvalues. For matrices that are not full rank, the determinant is 0.
The determinant of AB = det(A)det(B).

5 Understanding and decomposing linear transformations


Because matrices and linear transforms are so important, it is worthwhile delving into them a bit more. In
particular, let us focus on linear transformations f : Rn → Rn , which correspond to n × n matrices M .
Consider an input vector x, and the corresponding output vector M x. We want to understand how M x
relates to x, and what that means in terms of properties of M .
Table 1 shows three kinds of 2 × 2 matrices and their effect on a set of 2D points spread evenly on the
unit circle:

5
Scaling Rotation General transformation
1.5
1.00 1.00

0.75 1.0 0.75 1.0

0.50 0.50
0.5 0.5
0.25 0.25

0.00 0.0 0.00 0.0

0.25 0.25
0.5 0.5
0.50 0.50

0.75 1.0 0.75 1.0

1.00 1.00
1.5
1.0 0.5 0.0 0.5 1.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5

     
0.3 0 0.5 0.866 0.38 −0.88
Input points
0 1.4 −0.866 0.5 1.18 −0.63

Table 1: Three different matrices and their action on a set of points.

1. Column 2 shows a scaling (non-isometric), which involves stretching or compressing 


the axes to different

s1 0 . . . 0
 0 s2 . . . 0 
extents. Matrices that perform such operations are diagonal matrices of the form  . .. .
 
.. . .
 .. . . .
0 0 . . . sn
2. Column 3 shows a rotation which, as the name suggests, involves a rotation about the origin. Rotation
matrices R are orthonormal (RT R = I) and in addition have determinant 1. A close cousin are
reflections which also involve orthonormal matrices but with determinant −1.
3. Column 4 shows a general linear transformation.
It can be seen that the transformation in Column 4 involves some scaling along an arbitrary axis as well
as some rotation. It turns out that a version of this holds for any linear transformation. In particular, every
linear transformation can be decomposed into a rotation/reflection, followed by a non-isometric scaling,
followed by another rotation/reflection. This decomposition is called a singular value decomposition:
Definition 7 Every matrix M can be written as M = U ΣV T , where U and V are orthonormal matrices
and Σ is a diagonal matrix. This is known as the singular value decomposition (SVD) of matrix M .
The values in the diagonal of Σ are called the singular values of the matrix M .
Thus M x = U (Σ(V T x)). In other words, applying M amounts to rotating using V T , scaling using Σ and
rotating again using U . For any matrix M , Σ is unique, and U and V are unique upto a sign flip for each
column.
Singular value decomposition is one of the most versatile, commonly used tools of linear algebra and it is
worthwhile remembering that this exists. Apart from being an interpretable decomposition of a matrix, it
offers several interesting properties.
1. The rank of the matrix M is simply the number of non-zero singular values.
2. Given a matrix M = U ΣV T of rank r, if we want the closest matrix M 0 of rank r − k, then one can
simply zero out the k smallest singular values (smallest in absolute value) in Σ to produce Σ0 . M 0 is
then U Σ0 V T .
3. If ui is the i-th column of U ,P
vi is the i-th column of V and σi is the i-th diagonal element of Σ, then
it is easy to show that M = i σi ui viT .

T 0 if i 6= j
4. Consider what happens when M is applied to the vector vj . Since vi vj = , we have:
1 ow
X
M vj = σi ui vi T vj = σj uj (13)
i

6
Consider now square, symmetric matrices, that is M = M T . If M = U ΣV T , then this means that
U ΣV T = V ΣU T , or in other words U = V . In this case the singular value decomposition coincides with
another matrix decomposition, the eigenvalue decomposition:
Definition 8 Every square matrix M can be written as M = U ΛU T , where U is an orthonormal matrix
and Λ is a diagonal matrix. This is known as the eigenvalue decomposition of matrix M . The values in
the diagonal of Λ are called the eigenvalues of the matrix M .
As above, if uj is the j-th column of U (or the j-th eigenvector ) and λj is the j-th eigenvalue, then:

M uj = λj uj (14)

Thus, eigenvectors of M are vectors which when multiplied by M will point in the same direction, but have
their norm scaled by λ.
All square matrices have an eivenvalue decomposition, but the eigenvalues and eigenvectors may be
complex. If the matrix is symmetric, then the eigenvectors and eigenvalues will be real and the eigenvalue
decomposition coincides with the SVD.
An interesting fact is that the eigenvectors of a square d × d matrix always form a basis for Rd . Similarly,
the column vectors of V in an SVD of a d × d matrix form a basis for Rd .

6 Matrices, vectors and optimization


One of the primary use-cases of linear algebra techniques will appear when we try to solve equations or
optimization problems.

6.1 Solving linear equations


Consider, for example, any set of linear equations in 2 variables:

a11 x + a12 y = b1 (15)


a21 x + a22 y = b2 (16)
..
. (17)
an1 x + an2 y = bn (18)

This set of equations


  can be written using matrices and vectors, where we assemble all the unknowns into a
x
single vector x = :
y
 
a11 a12  
 a21 a22  b1
 b2 
..  x =   (19)
 
 ..
 . .  ..
.bn
an1 an2
⇒ Ax = b (20)

In fact, any set of linear equations in d variables can be written as a matrix vector equation Ax = b,
where A and b are a vector of coefficients and x is a vector of the variables. In general, if A is d × d and full
rank, then A−1 exists and the solution to these equations are simply x = A−1 b. However, what if A is not
full rank or different from d × d?

Over-constrained systems What happens if A is n × d, where n > d, and is full rank? In this case,
there are more constraints than there are variables, so there may not in fact be a solution. Instead, we look
for a solution in the least squares sense: we try to optimize:

min kAx − bk2 (21)


x

7
Now, we have:

kAx − bk2 = (Ax − b)T (Ax − b) (22)


= (xT AT − bT )(Ax − b) (23)
T T T T T T
= x A Ax − b Ax − x A b + b b (24)
T T T T
= x A Ax − 2b Ax + b b (25)

We now must minimize this function over x. This can be done by computing the derivative of this objective
w.r.t each component of x and setting it to 0. In vector notation, the vector of derivatives of a function f (x)
with respect to each component of x is called the gradient ∇x f (x):
 ∂f (x) 
1 ∂x
 ∂f (x) 
 ∂x2 
∇x f (x) =  . 
 (26)
 .. 

∂f (x)
∂xd

We will use two identities here that are easy to prove:

∇ x cT x = c (27)
T T
∇x x Qx = (Q + Q )x (28)

This gives us:

∇x (xT AT Ax − 2bT Ax + bT b) = 2AT Ax − 2AT b (29)

Setting this to 0 gives us the normal equations, which are now precisely a set of d equations:

AT Ax = AT b (30)

These can be solved the usual way giving us the least squares solution x = (AT A)−1 AT b.

Under-constrained equations What if A has rank n < d? In this case, there might be multiple possible
solutions, and the system is underconstrained. It is also possible that no solution exists. In particular, if x1
is a solution (i.e., Ax1 = b), and Ax2 = 0, then x1 + x2 is also a solution.
We can get a particular solution as follows. First we do an SVD of A to get:

U ΣV T x = b (31)
T T
⇔ ΣV x = U b (32)
(33)

Next, let y = V T x, so that x = V y. Then,

Σy = U T b (34)

Because Σ is a diagonal matrix, this equation can be solved trivially, if a solution exists (note that since
A is not full rank, some diagonal entries of Σ are 0; the corresponding entries of the RHS must be 0 for a
solution to exist).

6.2 Optimization problems


Another common use-case of linear algebra is for solving optimization problems. Consider the problem:
minx xT Qx, subject to the constraint that kxk = 1. Here Q is symmetric . To solve this, let us express x as

8
P
a linear combination of the eigenvectors of Q: x = i αi vi . Then, we have
X X
xT Qx = ( αi viT )Q( αj vj ) (35)
i j
X X
=( αi viT )( αj Qvj ) (36)
i j
X X
=( αi viT )( αj λj vj ) (37)
i j
X
= αi αj λj viT vj (38)
i,j
X
= αi2 λj (39)
i

Thus, the objective function is a linear combination of the λj with positive weights αj2 . The only way to
minimize this is to put maximum weight αj on the smallest eigenvalue and 0 weight on everything else. The
maximum weight we can put is 1, since kxk = 1. Thus the solution to the minimization is v∗ , the eigenvector
corresponding to the smallest eigenvalue λ∗ .

You might also like