0% found this document useful (0 votes)
241 views27 pages

Linear Algebra For Business Analytics

This document provides an overview of key concepts from linear algebra that are useful for business analytics applications. It covers vectors, matrices, systems of linear equations, and linear regression. Vectors are defined as ordered lists of numbers. Key vector concepts discussed include the zero vector, unit vectors, addition, and the inner product. Matrices are defined as arrays of numbers, and matrix operations like transpose, addition, and multiplication are explained. The document also introduces systems of linear equations and linear regression, which are important applications of linear algebra concepts.

Uploaded by

Bom Villatuya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
241 views27 pages

Linear Algebra For Business Analytics

This document provides an overview of key concepts from linear algebra that are useful for business analytics applications. It covers vectors, matrices, systems of linear equations, and linear regression. Vectors are defined as ordered lists of numbers. Key vector concepts discussed include the zero vector, unit vectors, addition, and the inner product. Matrices are defined as arrays of numbers, and matrix operations like transpose, addition, and multiplication are explained. The document also introduces systems of linear equations and linear regression, which are important applications of linear algebra concepts.

Uploaded by

Bom Villatuya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Linear Algebra for Business Analytics

Marcel Scharth
The University of Sydney Business School

This version: 11/8/2017

This reference material for Business Analytics students covers basic concepts from
linear algebra that are helpful in applications. You can use this guide either to learn
the essentials, or as a reference that you can always go back to as you come across
these concepts. This material draws on Klein (2013) and Boyd and Vandenberghe
(2016), with the latter being particularly well suited as a complete resource for business
analytics applications.
Here, we follow a practical approach and do not cover abstract linear algebra, which
includes concepts that would be part of a traditional linear algebra course such as vector
spaces. While powerful, this level of abstraction is not a requirement for our purposes.
Instead, you will find that investing in the material below considerably simplifies things
from the perspective of understanding practical methods.

Contents

1. Vectors 4

1.1. What is a vector? 4

1.2. Special vectors 5

1.3. Vector operations 6

1.4. Inner Product 6

1.5. Linear functions 8


2

1.6. Norm 8

1.7. Distance 9

1.8. Orthogonal vectors 10

2. Matrices 10

2.1. What is a matrix? 10

2.2. Row and column vectors 11

2.3. Transpose 12

2.4. Addition and scalar multiplication 13

2.5. Matrix-vector multiplication 13

2.6. Matrix-matrix multiplication 14

2.7. Square matrices 15

2.8. Identity and diagonal matrices 16

2.9. Systems of Linear Equations 17

2.10. Matrix inverse 18

2.11. Trace† 19

3. Differentiation 19

4. Random vectors 20

5. Application: linear regression and least squares 22

5.1. Multiple Linear Regression (MLR) model 22


3

5.2. Least squares 23

5.3. Sampling properties 25

References 27

The † symbol indicates subsections that are less important and can be initially skipped.
You can always go back to them when you come across these concepts.
4

1. Vectors

1.1. What is a vector?

A vector is an ordered finite list of numbers. We typically write vectors as vertical


arrays, surrounded by square or curved brackets, as in

   
−1 −1
   
 0   0 
   
  or  
 2.5   2.5 
   
−7.2 −7.2

 
5  
  1
a= 
−2 , a =
 
1
−3

We also write vectors as numbers separated by commas.

a = (5, −2, −3), a = (1, 1).

The elements or entries of a vector are the values in the array.

The size (or dimension) of a vector is the number of elements it contains.

A vector of size n is called an n−vector.


 
a1
 
 .. 
 . 
 
a= 
 ai 
 
 .. 
 . 
 
an

a = (a1 , · · · , ai , · · · , an )
5

A vector with n entries, each belonging to R (the set of real numbers), is called a
n-vector over R. We denote the set of n-vectors over R as Rn .

a = (a1 , a2 , a3 ) ∈ (R, R, R) ≡ R3

We can also define a vector as a function from a finite set D to R. For example, a
function from D = {0, 1, 2, . . . , d − 1} to R. The vector

a = (6, −4, −3.7)

is the function

0 7−→ 6
1 7−→ −4
2 7−→ −3.7,

where 7−→ reads “maps to”.

This last definition is useful as it matches how we work with vectors in Python.
For example, if we store the above vector as a Python list called a, the command
a[0] returns 6, the first element of the vector. More formally, we say that the above
definition lends itself to representation in a data structure (a format for organising and
storing data). Python objects such as lists, dictionaries, and NumPy arrays are data
structures that can represent vectors.

1.2. Special vectors.

Zero vector. A zero vector has all elements equal to zero 0 = (0, 0, ..., 0). 0n indicates
a zero vector with dimension n.

Unit vector. A unit vector has all elements equal zeros, except one element which is
equal to one. ei = [0, ..., 0, 1, 0, ..., 0]T (all zeros except for 1 at i-th position).

Ones vector. A ones vector has elements equal to one, 1 = [1, 1, ..., 1]T . 1n indicates
a ones vector with dimension n. We also use the notation ι for this type of vector
6

Sparsity. A vector is said to be sparse if many of its elements are equal to zero.

1.3. Vector operations.

Vector equality. a = b ⇐⇒ ai = bi for all i = 1, 2, ..., n.

Scalar-vector multiplication. Let α denote a scalar. The vector α a is the vector


with elements {α ai }. For example, let a = (5, −2, −3), then

0.5 a = (0.5 × 5, 0.5 × −2, 0.5 × −3) = (2.5, −1, −1.5)

Addition. let a and b be two vectors with the same size n. The sum c = a + b is the
vector with elements ci = ai + bi .

Let a = (5, −2, −3) and b = (−1, 2, 4). Then,

c = a + b = (5, −2, −3) + (−1, 2, 4) = (5 − 1, −2 + 2, −3 + 4) = (4, 0, 1).

Linear combination. Let a and b are n−vectors and β1 and β2 are scalars, the
n−vector
β1 a + β2 b
is called a linear combination of a and b. The scalars β1 and β2 are the coefficients
of the linear combination.

Let a = (5, −2, −3), b = (−1, 2, 4), β1 = 2, and β2 = 3.

2a + 3b = (2 × 5, 2 × −2, 2 × −3) + (3 × −1, 3 × 2, 3 × 4) = (7, 2, 18)

1.4. Inner Product.

We define the dot or inner product of two n-dimensional vectors a and b as



n
T
a b = a1 b 1 + a2 b 2 + . . . + an b n = ai b i
i=1
7

Example: a = (2, −1, 3) and b = (5, −2, −3), then

aT b = 2 × 5 + (−1) × (−2) + 3 × (−3) = 3

Some authors use the notation ⟨a, b⟩ for inner products.

Properties. The following are useful properties of inner products that follow easily
from the definition.

(αa)T b = α(aT b)

aT b = bT a

aT (b + c) = aT b + aT c
.

Examples.

Sum. ιT a = a1 + a2 + . . . + an is the sum the elements of a.

Average. (1/n)(ιT a) is the average of the elements of a.

Sum of squares. aT a = a21 + . . . + a2n is the sum of squares of the elements of a.

   
3 1
   
4 1
   
n = 4, x =   , 1 =   ,
2 1
   
7 1
1 1
⇒ ιT x = (1 × 3 + 1 × 4 + 1 × 3 + 1 × 7) = 4.
4 4
8

1.5. Linear functions.

The notation f : Rn −→ R means that f is a function that maps an n−vector to a


real number. If x is an n−vector, then f (x) (a scalar) is the value of the function at
x. In this setting, we refer to x as the argument of the function.

Let x and y be n−vectors and α and β be scalars. A linear function is a function


that satisfies the property

f (αx + βy) = αf (x) + βf (y)

We can always represent a linear function as an inner product. Let a be an n−vector.


The we can write any linear function using the form

f (x) = aT x = a1 x1 + a2 x2 + . . . an xn ,

where x is an n−vector. Here, a is fixed, and the argument x can be any n−vector.

For example, in a linear regression model f (x) is the regression function, x are the
predictor values, and a are the model parameters.

An affine function is a linear function plus a constant, that is

f (x) = aT x + b,

for a scalar b.

1.6. Norm.

The Euclidean norm or ℓ2 -norm of a vector is


( )1/2
√ ∑
n
∥a∥2 = aT a = a2i .
i=1

This is the distance from the origin to the point a or the length of the vector.

a
The normalized vector ∥a∥2
has unit norm.
9

Example.

Let x be a vector with sample average zero. Then ∥x∥2 is the sum of squares of x and
sx = ∥x∥2 /n is the sample variance.

General definition. A norm ∥ · ∥ is a function that satisfies the following properties:

(1) ∥a∥ ≥ 0 (non-negativity).


(2) ∥a∥ = 0 only if x = 0 (definiteness).
(3) ∥αa∥ = |α| × ∥a∥ (homogeneity).
(4) ∥a + b∥ ≤ ∥a∥ + ∥b∥ (triangle inequality).

The ℓ1 norm of a vector is



n
∥a∥1 = |a1 | + |a2 | + . . . + |an | = |ai |.
i=1

The Chebyshev or ℓ∞ norm is given by

∥a∥∞ = max{|a1 |, |a2 |, . . . , |an |}.

The Minkowski norm of order p is


( n )1/p

∥a∥p = |ai |p ,
i=1

for p ≥ 1. This is a generalisation of the previous norms. We include this here because
the scikit-learn package in Python sometimes refers to the Minkowski norm by
default, even if p = 2.

1.7. Distance.

The Euclidean distance between two vectors x and y is the norm of the difference
vector x − y:
√ ( n )1/2

dist(x, y) = ∥x − y∥2 = (x − y)T (x − y) = (xi − yi )
2

i=1
10

Every norm ∥ · ∥ induces a distance metric ∥x − y∥.

1.8. Orthogonal vectors.

Two vectors are orthogonal, written a ⊥ b, if and only if their inner product is zero,
aT b = 0.

Example:

   
1 1 √
a =   , b =   , ∥a∥ = ∥b∥ = 2, aT b = 0.
1 −1

   
1 .6
c =  , d =  ,
−3 .2

√ √
∥c∥ = 10 ≈ 3.16, ∥d∥ = .4 ≈ 0.63, cT d = 0, and aT c = −2.

2. Matrices

2.1. What is a matrix?

A matrix is a rectangular two-dimensional array of numbers such as


 
0 1 −2.3 0.1
 
A=
1.3 4 −0.1 7 

4.1 −1 0 1.7

The size (or dimensions) of a matrix are the number of rows and columns. The
matrix above has 3 rows and 4 columns, so the size is 3 × 4 (it reads 3-by-4).
11

We represent an (m × n) matrix as
 
a11 a12 · · · a1j ··· a1n
 
a a22 · · · a2j ··· a2n 
 21 
 . .. .. .. .. .. 
 . 
 . . . . . . 
A=



 ai1

ai2 · · · aij ··· ain 

 .. .. .. .. .. .. 
 . . . . . . 
 
am1 am2 · · · amj · · · amn ,
with A ∈ Rm×n .

We also represent a matrix as A = {aij }. In a design matrix in regression analysis, the


index i = 1, 2, ..., m refers to the statistical units, and the index j = 1, 2, ..., n to the
variables or attributes.

2.2. Row and column vectors.

A column vector is a m × 1 matrix. We do not distinguish between vectors and column


vectors.  
a
 1
 
 a2 
a= 
 .. 
 . 
 
am

In the same way, a row vector is a 1 × m matrix.


[ ]
a = a1 a2 . . . a m

The transpose of a column vector is the corresponding row vector and vice-versa.
 T
a
 1
  [ ]
 a2 
  = a a ... a
 ..  1 2 m
 . 
 
am
12

 
a
 1
[ ]T  
 a2 
a1 a2 . . . am =  . 


 .. 
 
am

We can represent a matrix X as a partitioned matrix whose generic block is the 1 × n


row vector xiT = [xi1 , xi2 , ..., xij , ..., xin ], which contains the profile of the i-th row unit,
 
xT
 1
 .. 
 . 
 
 
X= xT 
 i 
 .. 
 . 
 
T
xm
Alternatively, we can partition as

X = [x1 , x2 , ..., xj , ..., xn ],

where xj is the m × 1 column vector referring to the j-th variable or attribute.

2.3. Transpose.

The transpose of an m × n matrix A yields an n × m matrix that interchanges the


rows and columns of A.

 
a11 a21 · · · ai1 · · · am1
 
a ai2 · · · am2 
 12 a22 · · · 
 . .. .. .. .. .. 
 . 
 . . . . . . 
T 
A = 

 a1j a2j · · · aij · · · amj 
 
 .. .. .. .. .. .. 
 . . . . . . 
 
a1n a2n · · · ain · · · amn

The transpose has the property that (AT )T = A


13

Example.  
2 3 4  

 1
 2 1 3 −2
 2 −1  
A= , A = 3 2 −2 4 
T 

 3 −2 5 
 
4 −1 5 1
−2 4 1

2.4. Addition and scalar multiplication.

Scalar Multiplication. αa = {αaij }.

Matrix addition. If A is an m × n matrix and B an m × n matrix, then

A + B = {aij + bij } .

This can only be performed if A and B have the exact same dimensions.

Note that (A + B)T = AT + BT .

Example.

       

2 3  1 2 2 + 1 3 + 2  3 5
+ = =
3 −2 3 4 3 + 3 −2 + 4 6 2

2.5. Matrix-vector multiplication.

The product of an m × n matrix A with an n-vector b is an m-vector c with element


i equal to the inner product of the row i of A with b.

n
ci = aTi b = aij bj ,
j=1

where aTi denotes i-th row of A.

Example.
14

     
1 4   1×2+4×1 6
  2    
7 −3 ×   = 7 × 2 − 3 × 1 =  11 
     
1
2 −5 2×2−5×1 −1

2.6. Matrix-matrix multiplication.

The product of an m × p matrix A with an p × n matrix B is an m × n matrix C with


element ij equal to the inner product of the row i of A with column j of B

p
cij = aTi bj = aik bkj
k=1

where aTi denotes i-th row of A and bj denotes j-th column of B.

The matrix partitions that we use in the multiplication are


 
aT
 1
 .. 
 . 
 
 
A =  aiT  , B = [b1 , ..., bj , ..., bn ]
 
 .. 
 . 
 
T
am

The multiplication (AB) is only defined when the column dimension of A (m × p


matrix) equals the row dimensionB (p × n matrix).

Example.

       

2 3  1 2 2 × 1 + 3 × 3 2 × 2 + 3 × 4  11 16 
× = =
3 −2 3 4 3×1−2×3 3×2−2×4 −3 −2

Properties of matrix multiplication.


15

(AB)T = BT AT
(AB)C = A(BC)
A(B + C) = AB + AC
(A + B)C = AC + BC

Unlike in scalar multiplication, the order of multiplication matters for matrices: in


general, AB̸=BA. Moreover, remember that if m ̸= n, BA is not even defined.

Example.

   
1 0   1
  2 −1  
A= 
5 −1 , B =
  , ι3 = 1
 
3 6
3 2 1

 
2 −1 [ ]
 
C = AB =  7 −11

 , ι T
C = 21 −3 .
12 9

BA is not defined.

Vector outer product. If a is an m-vector and b is an n- vector, the outer product


abT is the m × n matrix

 
ab a1 b 2 . . . a 1 bn
 1 1 

 a2 b 1 a2 b 2 . . . a 2 bn 

abT = 
 .. .. .. 
 . . . 
 
am b 1 am b 2 . . . a m b n

2.7. Square matrices.

A square matrix has the same number or rows and columns m = n.


16

Symmetric matrix. A square matrix A is symmetric if AT = A.

Quadratic form. Let A be an n dimensional square matrix and x an n × 1 vector.


The scalar xT Ax is called a quadratic form.

A symmetric matrix A with the property that xT Ax > 0 for any vector x is said to
be positive definite.

2.8. Identity and diagonal matrices.

The diagonal elements of a matrix are the elements aij such that i = j (same row and
column index).

An identity matrix of order n is a matrix with all diagonal elements equal to one
(aii = 1 for i = 1, . . . , n), and all non-diagonal elements equal to zero, that is
 

1 0 ··· 0 0
 ... 
0 1 0 0
 
.. . . .. 
 .... .. .. .
In =   = diag(1, ..., 1)
 
 ... 
0 0 1 0
 
0 0 ··· 0 1

For example,  
  1 0 0
1 0  
I2 =  , I3 =  
0 1 0 ,
0 1
0 0 1

Properties.

Let A be an m × n matrix.

In2 = In
Im A = A
17

AIn = A

Diagonal matrix. A diagonal matrix is a square matrix with zeros in all the non-
diagonal positions.
 
d
 1
0 ··· 0 0
 ... 
0 d2 0 0
 
. .. .. .. .. 
 .. .
D=  . . .  = diag(d1 , ..., dn )
 
 .. 
0 0 . dn−1 0 
 
0 0 ··· 0 dn

Let D be an n × n diagonal matrix and A an n × p matrix. The operation DA


multiplies each row i of A by the diagonal element di of D.

2.9. Systems of Linear Equations.

Consider a system of m linear equations with n variables.

a11 x1 + a12 x2 + . . . + a1n xn = b1


a21 x1 + a22 x2 + . . . + a2n xn = b2
..
.
am1 x1 + am2 x2 + . . . + amn xn = b2

This system has a compact representation in matrix notation

Ax = b,

where      
a a21 . . . a1n x b
 11   0  1
     
 a21 a22 . . . a2n   x1   b2 
A=
 .. .. .. ..     
 , x =  ..  , b =  ..  .
 . . . .   .   . 
     
am1 am2 . . . amn xn bm
18

The study of linear systems is a fundamental part of linear algebra, which allows us to
determine whether a system has an unique solution, infinitely many solutions, or no
solutions, and to obtain a solution if one exists.

Example.

We can write the system




2x1 + 2x2 + x3 = 9



2x1 − x2 + 2x3 = 6



x − x + 2x
1 2 3 = 5
as
Ax = b,
where    
2 2 1 9
   
A = 2 −1 2 , b = 6
  
.
1 −1 2 5

The unique solution is x = (1, 2, 3).

2.10. Matrix inverse.

An n × n matrix A is invertible is there exists a matrix B such that

AB = In

If that is the case then we call B the inverse of A, and use the notation A−1 .

There are several methods for calculating a matrix inverse, but we will leave the details
in the background. It is often the case in practice that the we do not actually need to
explictly compute the matrix inverse to evaluate expressions in which it appears (for
example in the formula for the OLS).

Properties.
19

(A−1 )−1 = A
(αA)−1 = (1/α)A−1
(AT )−1 = (A−1 )T
(AB)−1 = B −1 A−1

2.11. Trace† .

The trace of a square matrix is the sum of its diagonal elements. If A is n × n,


n
tr(A) = aii .
i=1

Properties.

tr(αA) = α tr(A)
tr(A + B) = tr(A) + tr(B)
tr(AT ) = tr(A)
tr(AB) = tr(BA)

3. Differentiation

Let f : Rn −→ R be a function. The gradient of f is the vector of partial derivatives


of the function with respect to each of its arguments.
20

 
∂f (x)
 ∂x1 
 
 ∂f (x) 
 
 ∂x2 
 
▽f (x) = 
 .. 

 . 
 
 
 ∂f (x) 
∂xn

We also use the notation:


( )
d f (x) ∂f (x) ∂f (x) ∂f (x)
= , ,...,
dx ∂x1 ∂x2 ∂xn

There are several convenient rules for differentiating linear algebra operations with
respect to vectors. The two following rules appear in the derivation of the least squares
estimates of a linear regression.

Let x and a be n−vectors and A a matrix with column dimension n. Then,

d(x′ a)
=a
dx

d(x′ Ax)
= (A + A′ )x
dx

The Matrix Algebra Cookbook found online contains a comprehensive catalog.

4. Random vectors

A random vector or multivariate random variable is a vector with entries that


are scalar-valued random variables.
21

Let X = (X1 X2 . . . Xn ) be a random vector. The mean vector, or expected value


of X is a n-vector over R defined as

 
E(X1 )
 
 
 E(X2 ) 
E(X) = 
 ..  
 . 
 
E(Xn )

Let a and b be non-random scalars and Y be another random vector with dimension
n. Then
E(aX + bY ) = aE(X) + bE(Y ),
which follows from the linearity of expectations.

For a non-random n−vector a,

E(aT X) = aT E(X).

Let A be a non-random matrix with n columns.

E(AX) = AE(X)

We define the variance of the random vector as the square matrix


 
Var(X1 ) Cov(X1 , X2 ) . . . Cov(X1 , Xn )
 

 Cov(X2 , X1 ) Var(X2 ) . . . Cov(X2 , Xn )

Var(X) = 
 .. .. .. .. 

 . . . . 
 
Cov(Xn , X1 ) Cov(Xn , X2 ) ... Var(Xn )

Var(X) is also known as the variance-covariance or covariance matrix of X.

Var(X) = E(XX T ) − E(X)E(X)T


22

Let a be a non-random vector and b a scalar. Then,

Var(a + bX) = b2 Var(X)

For a non-random n−vector a,

Var(aT X) = aT Var(X)a.

For a non-random matrix A with n columns,

Var(AX) = AVar(X)AT .

5. Application: linear regression and least squares

5.1. Multiple Linear Regression (MLR) model.

The classical MLR model is characterised by the following set of assumptions.

1. Linearity: if X = x, then

Y = β0 + β1 x1 + . . . + βp xp + ε

for some population parameters β0 , β1 , . . . , βp and a random error ε.


2. The conditional mean of ε given X is zero, E(ε|X) = 0.
3. Constant error variance: Var(ε|X) = σ 2 .
4. Independence: the observations are independent.
5. The distribution of X1 , . . . , Xp is arbitrary.
6. There is no perfect multicollinearity (no column of X is a linear combination
of other columns).

The model equation for an observed indexed by i is

Yi = β0 + β1 xi1 + β2 xi2 + . . . + βp xip + εi


23

We can therefore write the model of a sample of n observations as

Y1 = β0 + β1 x11 + β2 x12 + . . . + βp x1p + ε1


Y2 = β0 + β1 x21 + β2 x22 + . . . + βp x2p + ε2
..
.
Yn = β0 + β1 xn1 + β2 xn2 + . . . + βp xnp + εn

Compact matrix notation:


Y = Xβ + ε,
where
   
  1 x11 x12 . . . x1p β  
Y    0 ε
 1      1
 ..   1 x21 x22 . . . x2p  β1 
Y =  .  X=   β =   ε =  ... 

.
   .. .. .. .. .   ..   

 . . . . .. 

 . 
 
Yn εn
1 xn1 xn2 . . . xnp βp

Assumptions 2 and 4 imply that

Var(ε) = σ 2 In

5.2. Least squares.

Let {(yi , xi )}ni=1 be a sample. The ordinary least squares (OLS) method obtains
the coefficient values that minimise the residual sum of squares (RSS):
 2

n ∑
p
βb = argmin yi − β0 − βj xij 
β i=1 j=1

The partial derivatives of the RSS with respect to the coefficients are
 
∂RSS(β) ∑n ∑
p
= −2 yi − β0 − βj xij 
∂β0 i=1 j=1
24
 
∂RSS(β) ∑n ∑p
= −2 xi1 yi − β0 − βj xij 
∂β1 i=1 j=1
 
∂RSS(β) ∑
n ∑
p
= −2 xi2 yi − β0 − βj xij 
∂β2 i=1 j=1
..
. 
∂RSS(β) ∑n ∑p
= −2 
xip yi − β0 − βj xij 
∂βp i=1 j=1

Note that each partial derivative j contains a sum that is the inner product of the j-th
column of X with the vector of residuals (y − Xβ). We can therefore write the above
equations using the compact notation

▽RSS(β) = −2X T (Y − Xβ),

where  
y
 1
 
 y2 
y= 
 .. 
 . 
 
yn

We obtain the first order condition by setting the gradient to zero,


d(RSS(β))
= −2X T y + 2X T Xβ = 0.

The least squares estimate βb therefore satisfies the system of linear equations:

X T X βb = X T y

Note that X T X is a (p + 1) × (p + 1) matrix and X T y is a (p + 1)-vector, such that


this expression has the form of Section 2.9.

If (X T X) is invertible, which is the case if Assumption 6 of no perfect multicollinearity


is satisfied, left multiplication with (X T X)−1 gives the unique solution

βb = (X T X)−1 X T y.
25

Solution using vector differentiation rules.

 2  2

n ∑
p ∑
n ∑
p
RSS(β) = yi − β0 − βj xij  yi − β0 − βj xij 
i=1 j=1 i=1 j=1

= (y − Xβ) (y − Xβ)
T

= y T y − 2β T X T y + β T X T Xβ

The gradient is
d(RSS(β)) d(y T y) d(2β T X T y) d(β T X T Xβ)
= − + =
dβ dβ dβ dβ
= 0 − 2X T y + 2X T Xβ

The first order condition is therefore


d(RSS(β))
= −2X T y + 2X T Xβ = 0.

Therefore, as above
X T X βb = X T y,
leading to
βb = (X T X)−1 X T y.

5.3. Sampling properties.

We first obtain the following convenient representation of the estimator.


26

βb = (X T X)−1 X T Y
= (X T X)−1 X T (Xβ + ε)
= (X T X)−1 X T Xβ + (X T X)−1 X T ε
= β + (X T X)−1 X T ε

Below, all the results are conditional on the predictor values in the X matrix. We omit
this conditioning from the notation for simplicity.

Expected value.

b = E(β + (X T X)−1 X T ε)
E(β)
[ ]
= β + E (X T X)−1 X T ε
= β + (X T X)−1 X T E(ε)
= β + (X T X)−1 X T 0

The least squares estimator is unbiased under the model assumptions.

Variance.
27

b = Var(β + (X T X)−1 X T ε)
Var(β)
= Var((X T X)−1 X T ε)
= E((X T X)−1 X T εεT X(X T X)−1 )
= (X T X)−1 X T E(εεT )X(X T X)−1
= (X T X)−1 X T (σ 2 I)X(X T X)−1
= σ 2 (X T X)−1 X T X(X T X)−1
= σ 2 (X T X)−1

References

Boyd, S. and L. Vandenberghe (2016). Vectors, matrices, and least squares. Available:
stanford.edu/class/ee103/mma.pdf .
Klein, P. N. (2013). Coding the matrix: Linear algebra through applications to computer
science. Newtonian Press.

You might also like