0% found this document useful (0 votes)
31 views21 pages

2 Probability and Linear Algebra

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views21 pages

2 Probability and Linear Algebra

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Lecture 2: Probability and linear algebra basics

Statistical Learning (BST 263)

Jeffrey W. Miller

Department of Biostatistics
Harvard T.H. Chan School of Public Health

1 / 21
Outline

Linear algebra basics

Probability basics

Random vectors

2 / 21
Outline

Linear algebra basics

Probability basics

Random vectors

3 / 21
Linear algebra in this course

A little bit of linear algebra is essential for understanding


many machine learning methods.
I E.g., linear regression, logistic regression, LDA, QDA, PCA,
GAMs, kernel ridge, SVMs, K-means.

Linear algebra is not a prerequisite for this course, so I made


the following slides to give you the basic concepts needed.
You will need to study this material carefully if you are not
already familiar with it.

4 / 21
Matrices and transposes
A is an m × n real matrix, written A ∈ Rm×n , if
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A= .
 
. .. 
 . . 
am1 am2 · · · amn

where aij ∈ R. The (i, j)th entry of A is Aij = aij .


The transpose of A ∈ Rm×n is defined as
 
A11 A21 · · · Am1
 A12 A22 · · · Am2 
T n×m
A = . ..  ∈ R .
 
 .. . 
A1n A2n · · · Amn

In other words, (AT )ij = Aji .


Note: x ∈ Rn is considered to be a column vector in Rn×1 .
5 / 21
Sums and products of matrices

The sum of matrices A ∈ Rm×n and B ∈ Rm×n is the matrix


A + B ∈ Rm×n such that

(A + B)ij = Aij + Bij .

The product of matrices A ∈ Rm×n and B ∈ Rn×` is the


matrix AB ∈ Rm×` such that
n
X
(AB)ij = Aik Bkj .
k=1

6 / 21
Basic matrix properties

In the following properties, it is assumed that the matrix


dimensions are compatible. (For example, if we write A + B then
it is assumed that A and B are the same size.)

(AB)C = A(BC)
I Consequently, we can write ABC without specifying the order
in which the multiplications are performed.
A(B + C) = AB + AC
(B + C)A = BA + CA
Except in special circumstances, AB is not equal to BA.
(AB)T = B T AT
(A + B)T = AT + B T

7 / 21
Identity, inverse, and trace
The n × n identity matrix, denoted In×n or I for short, is
 
1 0 ··· 0
0 1 · · · 0
n×n
I = In×n =  . ..  ∈ R .
 
 .. .
0 0 ··· 1
IA = A = AI
If it exists, the inverse of A, denoted A−1 , is a matrix such
that A−1 A = I and AA−1 = I.
If A−1 exists, we say that A is invertible.
(A−1 )T = (AT )−1
(AB)−1 = B −1 A−1
n×n , denoted tr(A), is
The trace of a square Pnmatrix A ∈ R
defined as tr(A) = i=1 Aii .
tr(AB) = tr(BA) if AB is a square matrix.
8 / 21
Symmetric and definite matrices

A is symmetric if A = AT .

A is symmetric positive semi-definite (SPSD) if and only if


A = B T B for some B ∈ Rm×n and some m.

A is symmetric positive definite (SPD) if and only if A is


SPSD and A−1 exists.

There are many equivalent definitions of SPSD and SPD


(which is why I wrote “if and only if”). I believe the
definitions above are the easiest to understand and use.

9 / 21
Outline

Linear algebra basics

Probability basics

Random vectors

10 / 21
Discrete random variables
Informally, a random variable (r.v.) is a quantity that
probabilistically takes any one of a range of values.
Notation: Uppercase for r.v.s, lowercase for values taken.

A random variable X is discrete if it takes values in a


countable set X = {x1 , x2 , . . .}.
Examples: Bernoulli, Binomial, Poisson, Geometric.

The density of a discrete r.v. is the function


p(x) = P(X = x) = probability that X equals x.
I Sometimes, p(x) is called the probability mass function in the
discrete case, but “density” is technically correct also.

Properties (discrete case):


X X
0 ≤ p(x) ≤ 1, p(x) = 1, P(X ∈ A) = p(x).
x∈X x∈A

11 / 21
Continuous random variables
A random variable X ∈ R is continuous
R if there is a function
p(x) ≥ 0 such that P(X ∈ A) = A p(x)dx for all A ⊆ R.
I (We will ignore measure-theoretic technicalities in this course.)
Examples: Normal, Uniform, Beta, Gamma, Exponential.

p(x) is called the density of X.


Careful! p(x) is not the probability that X equals x.
R
Note that R p(x)dx = 1, but p(x) can be > 1.

The same definitions apply to random vectors X ∈ Rn , with


Rn in place of R.

The cumulative distribution function (c.d.f.) of X ∈ R is


Z x
F (x) = P(X ≤ x) = p(x0 )dx0 .
−∞

12 / 21
Joint distributions of multiple random variables/vectors
p(x, y) denotes the joint density of X ∈ X and Y ∈ Y.
I P(X = x, Y = y) = p(x, y) if X and Y are discrete.
R
I P(X ∈ A, Y ∈ B) = A×B
p(x, y)dx dy if X and Y are
continuous.
R
I P(X = x, Y ∈ B) = B
p(x, y)dy if X is discrete and Y is
continuous.

The density of X can be recovered from the joint density by


marginalizing over Y :
P
I p(x) = y∈Y p(x, y) if Y is discrete,
R
I p(x) = Y p(x, y)dy if Y is continuous.

Note: It is common to use “p” to denote all densities and


follow the convention that X is taking the value x, Y is
taking the value y, etc.
13 / 21
Conditional densities and Independence
If p(y) > 0 then the conditional density of X given Y = y is
p(x, y)
p(x|y) = .
p(y)

X and Y are independent if p(x, y) = p(x)p(y) for all x, y.

X1 , . . . , Xn are independent if

p(x1 , . . . , xn ) = p(x1 ) · · · p(xn )

for all x1 , . . . , xn .

X1 , . . . , Xn are conditionally independent given Y if

p(x1 , . . . , xn | y) = p(x1 |y) · · · p(xn |y)

for all x1 , . . . , xn , y.
14 / 21
Expectations (a.k.a. expected values)

Suppose h(x) is a real-valued function of x.

The expectation of h(X), denoted E(h(X)), is


P
I E(h(X)) = R x∈X h(x)p(x) if X is discrete,
I E(h(X)) = X h(x)p(x)dx if X is continuous.

The conditional expectation of h(X) given Y = y is


P
I E(h(X) | Y = y) = R x∈X h(x)p(x|y) if X is discrete,
I E(h(X) | Y = y) = X h(x)p(x|y)dx if X is continuous.

E(h(X)|Y ) is defined as g(Y ) where g(y) = E(h(X)|Y = y).


Law of iterated expectations: E(E(h(X)|Y )) = E(h(X)).

15 / 21
Outline

Linear algebra basics

Probability basics

Random vectors

16 / 21
Random vectors

If Z1 , . . . , Zn ∈ R are random variables, then


 
Z1
 .. 
Z =  .  = (Z1 , . . . , Zn )T
Zn

is a random vector in Rn .

The expectation of a random vector Z ∈ Rn is


 
E(Z1 )
E(Z) =  ...  .
 

E(Zn )

17 / 21
Random vectors
The covariance matrix of a random vector Z ∈ Rn is the
matrix Cov(Z) ∈ Rn×n with (i, j)th entry

Cov(Z)ij = Cov(Zi , Zj )

where
 
Cov(Zi , Zj ) = E (Zi − E(Zi ))(Zj − E(Zj ))
= E(Zi Zj ) − E(Zi )E(Zj ).

Equivalently,
 
Cov(Z) = E (Z − E(Z))(Z − E(Z))T
= E(ZZ T ) − E(Z)E(Z)T .

Recall that Z ∈ Rn is considered to be a column vector in


Rn×1 , so ZZ T is a matrix in Rn×n .
18 / 21
Random vectors

Cov(Z) is always SPSD.

If Z ∈ Rn is a random vector, then

E(AZ + b) = A E(Z) + b

and
Cov(AZ + b) = A Cov(Z)AT
for any fixed (i.e., nonrandom) A ∈ Rm×n and b ∈ Rm .

If Y, Z ∈ Rn are independent random vectors, then


Cov(Y + Z) = Cov(Y ) + Cov(Z).

19 / 21
Multivariate normal distribution
If µ ∈ Rn and C ∈ Rn×n is SPSD, then Z ∼ N (µ, C) denotes
that Z is multivariate normal with E(Z) = µ and
Cov(Z) = C.

Standard multivariate normal: If Z1 , . . . , Zn ∼ N (0, 1)


independently and Z = (Z1 , . . . , Zn )T , then Z ∼ N (0, I).

Affine transformation property : If Z ∼ N (µ, C) then


AZ + b ∼ N (Aµ + b, ACAT ) for any fixed A ∈ Rm×n ,
b ∈ Rm , µ ∈ Rn , and SPSD C ∈ Rn×n .

Any multivariate normal distribution can be obtained via an


affine transformation (AZ + b) of Z ∼ N (0, In×n ) for an
appropriate choice of n, A, and b.

20 / 21
Multivariate normal distribution

Sum property : If Y ∼ N (µ1 , C1 ) and Z ∼ N (µ2 , C2 )


independently, then Y + Z ∼ N (µ1 + µ2 , C1 + C2 ).

Density : If Z = (Z1 , . . . , Zn )T ∼ N (µ, C) and C −1 exists,


then Z has density
1 1 T −1

p(z) = exp − 2 (z − µ) C (z − µ)
(2π)n/2 | det(C)|1/2

for all z ∈ Rn .

21 / 21

You might also like