Linear Algebra For Business Analytics
Linear Algebra For Business Analytics
Marcel Scharth
The University of Sydney Business School
This reference material for Business Analytics students covers basic concepts from
linear algebra that are helpful in applications. You can use this guide either to learn
the essentials, or as a reference that you can always go back to as you come across
these concepts. This material draws on Klein (2013) and Boyd and Vandenberghe
(2016), with the latter being particularly well suited as a complete resource for business
analytics applications.
Here, we follow a practical approach and do not cover abstract linear algebra, which
includes concepts that would be part of a traditional linear algebra course such as vector
spaces. While powerful, this level of abstraction is not a requirement for our purposes.
Instead, you will find that investing in the material below considerably simplifies things
from the perspective of understanding practical methods.
Contents
1. Vectors 4
1.6. Norm 8
1.7. Distance 9
2. Matrices 10
2.3. Transpose 12
2.11. Trace† 19
3. Differentiation 19
4. Random vectors 20
References 27
The † symbol indicates subsections that are less important and can be initially skipped.
You can always go back to them when you come across these concepts.
4
1. Vectors
−1 −1
0 0
or
2.5 2.5
−7.2 −7.2
5
1
a=
−2 , a =
1
−3
a = (a1 , · · · , ai , · · · , an )
5
A vector with n entries, each belonging to R (the set of real numbers), is called a
n-vector over R. We denote the set of n-vectors over R as Rn .
a = (a1 , a2 , a3 ) ∈ (R, R, R) ≡ R3
We can also define a vector as a function from a finite set D to R. For example, a
function from D = {0, 1, 2, . . . , d − 1} to R. The vector
is the function
0 7−→ 6
1 7−→ −4
2 7−→ −3.7,
This last definition is useful as it matches how we work with vectors in Python.
For example, if we store the above vector as a Python list called a, the command
a[0] returns 6, the first element of the vector. More formally, we say that the above
definition lends itself to representation in a data structure (a format for organising and
storing data). Python objects such as lists, dictionaries, and NumPy arrays are data
structures that can represent vectors.
Zero vector. A zero vector has all elements equal to zero 0 = (0, 0, ..., 0). 0n indicates
a zero vector with dimension n.
Unit vector. A unit vector has all elements equal zeros, except one element which is
equal to one. ei = [0, ..., 0, 1, 0, ..., 0]T (all zeros except for 1 at i-th position).
Ones vector. A ones vector has elements equal to one, 1 = [1, 1, ..., 1]T . 1n indicates
a ones vector with dimension n. We also use the notation ι for this type of vector
6
Sparsity. A vector is said to be sparse if many of its elements are equal to zero.
Addition. let a and b be two vectors with the same size n. The sum c = a + b is the
vector with elements ci = ai + bi .
Linear combination. Let a and b are n−vectors and β1 and β2 are scalars, the
n−vector
β1 a + β2 b
is called a linear combination of a and b. The scalars β1 and β2 are the coefficients
of the linear combination.
Properties. The following are useful properties of inner products that follow easily
from the definition.
(αa)T b = α(aT b)
aT b = bT a
aT (b + c) = aT b + aT c
.
Examples.
3 1
4 1
n = 4, x = , 1 = ,
2 1
7 1
1 1
⇒ ιT x = (1 × 3 + 1 × 4 + 1 × 3 + 1 × 7) = 4.
4 4
8
f (x) = aT x = a1 x1 + a2 x2 + . . . an xn ,
where x is an n−vector. Here, a is fixed, and the argument x can be any n−vector.
For example, in a linear regression model f (x) is the regression function, x are the
predictor values, and a are the model parameters.
f (x) = aT x + b,
for a scalar b.
1.6. Norm.
This is the distance from the origin to the point a or the length of the vector.
a
The normalized vector ∥a∥2
has unit norm.
9
Example.
Let x be a vector with sample average zero. Then ∥x∥2 is the sum of squares of x and
sx = ∥x∥2 /n is the sample variance.
for p ≥ 1. This is a generalisation of the previous norms. We include this here because
the scikit-learn package in Python sometimes refers to the Minkowski norm by
default, even if p = 2.
1.7. Distance.
The Euclidean distance between two vectors x and y is the norm of the difference
vector x − y:
√ ( n )1/2
∑
dist(x, y) = ∥x − y∥2 = (x − y)T (x − y) = (xi − yi )
2
i=1
10
Two vectors are orthogonal, written a ⊥ b, if and only if their inner product is zero,
aT b = 0.
Example:
1 1 √
a = , b = , ∥a∥ = ∥b∥ = 2, aT b = 0.
1 −1
1 .6
c = , d = ,
−3 .2
√ √
∥c∥ = 10 ≈ 3.16, ∥d∥ = .4 ≈ 0.63, cT d = 0, and aT c = −2.
2. Matrices
4.1 −1 0 1.7
The size (or dimensions) of a matrix are the number of rows and columns. The
matrix above has 3 rows and 4 columns, so the size is 3 × 4 (it reads 3-by-4).
11
We represent an (m × n) matrix as
a11 a12 · · · a1j ··· a1n
a a22 · · · a2j ··· a2n
21
. .. .. .. .. ..
.
. . . . . .
A=
ai1
ai2 · · · aij ··· ain
.. .. .. .. .. ..
. . . . . .
am1 am2 · · · amj · · · amn ,
with A ∈ Rm×n .
The transpose of a column vector is the corresponding row vector and vice-versa.
T
a
1
[ ]
a2
= a a ... a
.. 1 2 m
.
am
12
a
1
[ ]T
a2
a1 a2 . . . am = .
..
am
2.3. Transpose.
a11 a21 · · · ai1 · · · am1
a ai2 · · · am2
12 a22 · · ·
. .. .. .. .. ..
.
. . . . . .
T
A =
a1j a2j · · · aij · · · amj
.. .. .. .. .. ..
. . . . . .
a1n a2n · · · ain · · · amn
Example.
2 3 4
1
2 1 3 −2
2 −1
A= , A = 3 2 −2 4
T
3 −2 5
4 −1 5 1
−2 4 1
A + B = {aij + bij } .
This can only be performed if A and B have the exact same dimensions.
Example.
2 3 1 2 2 + 1 3 + 2 3 5
+ = =
3 −2 3 4 3 + 3 −2 + 4 6 2
Example.
14
1 4 1×2+4×1 6
2
7 −3 × = 7 × 2 − 3 × 1 = 11
1
2 −5 2×2−5×1 −1
Example.
2 3 1 2 2 × 1 + 3 × 3 2 × 2 + 3 × 4 11 16
× = =
3 −2 3 4 3×1−2×3 3×2−2×4 −3 −2
(AB)T = BT AT
(AB)C = A(BC)
A(B + C) = AB + AC
(A + B)C = AC + BC
Example.
1 0 1
2 −1
A=
5 −1 , B =
, ι3 = 1
3 6
3 2 1
2 −1 [ ]
C = AB = 7 −11
, ι T
C = 21 −3 .
12 9
BA is not defined.
ab a1 b 2 . . . a 1 bn
1 1
a2 b 1 a2 b 2 . . . a 2 bn
abT =
.. .. ..
. . .
am b 1 am b 2 . . . a m b n
A symmetric matrix A with the property that xT Ax > 0 for any vector x is said to
be positive definite.
The diagonal elements of a matrix are the elements aij such that i = j (same row and
column index).
An identity matrix of order n is a matrix with all diagonal elements equal to one
(aii = 1 for i = 1, . . . , n), and all non-diagonal elements equal to zero, that is
1 0 ··· 0 0
...
0 1 0 0
.. . . ..
.... .. .. .
In = = diag(1, ..., 1)
...
0 0 1 0
0 0 ··· 0 1
For example,
1 0 0
1 0
I2 = , I3 =
0 1 0 ,
0 1
0 0 1
Properties.
Let A be an m × n matrix.
In2 = In
Im A = A
17
AIn = A
Diagonal matrix. A diagonal matrix is a square matrix with zeros in all the non-
diagonal positions.
d
1
0 ··· 0 0
...
0 d2 0 0
. .. .. .. ..
.. .
D= . . . = diag(d1 , ..., dn )
..
0 0 . dn−1 0
0 0 ··· 0 dn
Ax = b,
where
a a21 . . . a1n x b
11 0 1
a21 a22 . . . a2n x1 b2
A=
.. .. .. ..
, x = .. , b = .. .
. . . . . .
am1 am2 . . . amn xn bm
18
The study of linear systems is a fundamental part of linear algebra, which allows us to
determine whether a system has an unique solution, infinitely many solutions, or no
solutions, and to obtain a solution if one exists.
Example.
AB = In
If that is the case then we call B the inverse of A, and use the notation A−1 .
There are several methods for calculating a matrix inverse, but we will leave the details
in the background. It is often the case in practice that the we do not actually need to
explictly compute the matrix inverse to evaluate expressions in which it appears (for
example in the formula for the OLS).
Properties.
19
(A−1 )−1 = A
(αA)−1 = (1/α)A−1
(AT )−1 = (A−1 )T
(AB)−1 = B −1 A−1
2.11. Trace† .
∑
n
tr(A) = aii .
i=1
Properties.
tr(αA) = α tr(A)
tr(A + B) = tr(A) + tr(B)
tr(AT ) = tr(A)
tr(AB) = tr(BA)
3. Differentiation
∂f (x)
∂x1
∂f (x)
∂x2
▽f (x) =
..
.
∂f (x)
∂xn
There are several convenient rules for differentiating linear algebra operations with
respect to vectors. The two following rules appear in the derivation of the least squares
estimates of a linear regression.
d(x′ a)
=a
dx
d(x′ Ax)
= (A + A′ )x
dx
4. Random vectors
E(X1 )
E(X2 )
E(X) =
..
.
E(Xn )
Let a and b be non-random scalars and Y be another random vector with dimension
n. Then
E(aX + bY ) = aE(X) + bE(Y ),
which follows from the linearity of expectations.
E(aT X) = aT E(X).
E(AX) = AE(X)
Var(aT X) = aT Var(X)a.
Var(AX) = AVar(X)AT .
1. Linearity: if X = x, then
Y = β0 + β1 x1 + . . . + βp xp + ε
Var(ε) = σ 2 In
Let {(yi , xi )}ni=1 be a sample. The ordinary least squares (OLS) method obtains
the coefficient values that minimise the residual sum of squares (RSS):
2
∑
n ∑
p
βb = argmin yi − β0 − βj xij
β i=1 j=1
The partial derivatives of the RSS with respect to the coefficients are
∂RSS(β) ∑n ∑
p
= −2 yi − β0 − βj xij
∂β0 i=1 j=1
24
∂RSS(β) ∑n ∑p
= −2 xi1 yi − β0 − βj xij
∂β1 i=1 j=1
∂RSS(β) ∑
n ∑
p
= −2 xi2 yi − β0 − βj xij
∂β2 i=1 j=1
..
.
∂RSS(β) ∑n ∑p
= −2
xip yi − β0 − βj xij
∂βp i=1 j=1
Note that each partial derivative j contains a sum that is the inner product of the j-th
column of X with the vector of residuals (y − Xβ). We can therefore write the above
equations using the compact notation
where
y
1
y2
y=
..
.
yn
The least squares estimate βb therefore satisfies the system of linear equations:
X T X βb = X T y
βb = (X T X)−1 X T y.
25
2 2
∑
n ∑
p ∑
n ∑
p
RSS(β) = yi − β0 − βj xij yi − β0 − βj xij
i=1 j=1 i=1 j=1
= (y − Xβ) (y − Xβ)
T
= y T y − 2β T X T y + β T X T Xβ
The gradient is
d(RSS(β)) d(y T y) d(2β T X T y) d(β T X T Xβ)
= − + =
dβ dβ dβ dβ
= 0 − 2X T y + 2X T Xβ
Therefore, as above
X T X βb = X T y,
leading to
βb = (X T X)−1 X T y.
βb = (X T X)−1 X T Y
= (X T X)−1 X T (Xβ + ε)
= (X T X)−1 X T Xβ + (X T X)−1 X T ε
= β + (X T X)−1 X T ε
Below, all the results are conditional on the predictor values in the X matrix. We omit
this conditioning from the notation for simplicity.
Expected value.
b = E(β + (X T X)−1 X T ε)
E(β)
[ ]
= β + E (X T X)−1 X T ε
= β + (X T X)−1 X T E(ε)
= β + (X T X)−1 X T 0
=β
Variance.
27
b = Var(β + (X T X)−1 X T ε)
Var(β)
= Var((X T X)−1 X T ε)
= E((X T X)−1 X T εεT X(X T X)−1 )
= (X T X)−1 X T E(εεT )X(X T X)−1
= (X T X)−1 X T (σ 2 I)X(X T X)−1
= σ 2 (X T X)−1 X T X(X T X)−1
= σ 2 (X T X)−1
References
Boyd, S. and L. Vandenberghe (2016). Vectors, matrices, and least squares. Available:
stanford.edu/class/ee103/mma.pdf .
Klein, P. N. (2013). Coding the matrix: Linear algebra through applications to computer
science. Newtonian Press.