0% found this document useful (0 votes)
69 views4 pages

Background Material Crib-Sheet: 1 Probability Theory

This document provides a summary of key concepts in probability theory and linear algebra that are useful for background material: 1. Probability theory concepts covered include the probability of events, joint and conditional probabilities, independence, Bayes' rule, expectations, and how these concepts extend to continuous random variables. 2. Linear algebra concepts summarized are scalars, vectors, matrices, transposition, matrix multiplication and properties, inverses, solving linear equations, symmetric and diagonal matrices, traces, determinants, eigenvalues and eigenvectors. 3. The document is intended to provide an overview of foundational concepts in these areas as a reference for further study, with links to additional resources for more detailed explanations and examples.

Uploaded by

nishanthps
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views4 pages

Background Material Crib-Sheet: 1 Probability Theory

This document provides a summary of key concepts in probability theory and linear algebra that are useful for background material: 1. Probability theory concepts covered include the probability of events, joint and conditional probabilities, independence, Bayes' rule, expectations, and how these concepts extend to continuous random variables. 2. Linear algebra concepts summarized are scalars, vectors, matrices, transposition, matrix multiplication and properties, inverses, solving linear equations, symmetric and diagonal matrices, traces, determinants, eigenvalues and eigenvectors. 3. The document is intended to provide an overview of foundational concepts in these areas as a reference for further study, with links to additional resources for more detailed explanations and examples.

Uploaded by

nishanthps
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Background material crib-sheet

Iain Murray <[email protected]>, October 2003

Here are a summary of results with which you should be familiar. If anything here
is unclear you should to do some further reading and exercises.

1 Probability Theory
Chapter 2, sections 2.1–2.3 of David MacKay’s book covers this material:
http://www.inference.phy.cam.ac.uk/mackay/itila/book.html

The probability a discrete variable A takes value a is: 0 ≤ P (A = a) ≤ 1

Probabilities of alternatives add: P (A = a or a0 ) = P (A = a) + P (A = a0 ) Alternatives


X
The probabilities of all outcomes must sum to one: P (A = a) = 1 Normalisation
all possible a

P (A = a, B = b) is the joint probability that both A = a and B = b occur. Joint Probability

Variables can be “summed out” of joint distributions: Marginalisation


X
P (A = a) = P (A = a, B = b)
all possible b

P (A = a|B = b) is the probability A = a occurs given the knowledge B = b. Conditional Probability

P (A = a, B = b) = P (A = a) P (B = b|A = a) = P (B = b) P (A = a|B = b) Product Rule

The following hold, for all a and b, if and only if A and B are independent: Independence

P (A = a|B = b) = P (A = a)
P (B = b|A = a) = P (B = b)
P (A = a, B = b) = P (A = a) P (B = b) .
Otherwise the product rule above must be used.

Bayes rule can be derived from the above: Bayes Rule

P (B = b|A = a, H) P (A = a|H)
P (A = a|B = b, H) = ∝ P (A = a, B = b|H)
P (B = b|H)
Note that here, as with any expression, we are freeP to condition the whole
thing on any set of assumptions, H, we like. Note a P (A = a, B = b|H) =
P (B = b|H) gives the normalising constant of proportionality.

1
All the above theory basically still applies to continuous variables if sums are Continuous variables
converted into integrals1 . The probability that X lies between x and x+dx is
p (x) dx, where p (x) is a probability density function with range [0, ∞].
Z x2 Z ∞ Z ∞
P (x1 < X < x2 ) = p (x) dx , p (x) dx = 1 and p (x) = p (x, y) dy. Continuous versions of
x1 −∞ −∞ some results
The expectation or mean under a probability distribution is: Expectations
X Z ∞
hf (a)i = P (A = a) f (a) or hf (x)i = p (x) f (x)dx
a −∞

2 Linear Algebra
This is designed as a prequel to Sam Roweis’s “matrix identities” sheet:
http://www.cs.toronto.edu/~roweis/notes/matrixid.pdf

Scalars are individual numbers, vectors are columns of numbers, matrices are
rectangular grids of numbers, eg:
   
x1 A11 A12 · · · A1n
 x2   A21 A22 · · · A2n 
x = 3.4, x =  . , A= .
   
.. .. .. 
 ..   .. . . . 
xn Am1 Am2 · · · Amn

In the above example x is 1 × 1, x is n × 1 and A is m × n. Dimensions


>
The transpose operator, ( 0 in Matlab), swaps the rows and columns: Transpose
>
x> = x1 x2 · · · xn , A> ij = Aji
 
x = x,

Quantities whose inner dimensions match may be “multiplied” by summing over Multiplication
this index. The outer dimensions give the dimensions of the answer.
n
X n
X n
X
(AA> )ij = Aik A>

Ax has elements (Ax)i = Aij xj and kj
= Aik Ajk
j=1 k=1 k=1

All the following are allowed (the dimensions of the answer are also shown): Check Dimensions
> > > > >
x x xx Ax AA A A x Ax
1×1 n×n m×1 m×m n×n 1×1 ,
scalar matrix vector matrix matrix scalar
while xx, AA and xA do not make sense for m 6= n 6= 1. Can you see why?

An exception to the above rule is that we may write: xA. Every element of the Multiplication by scalar
matrix A is multiplied by the scalar x.

Simple and valid manipulations: Easily proved results


> > >
(AB)C = A(BC) A(B+C) = AB+AC (A+B) = A +B (AB) = B > A>
>

Note that AB 6= BA in general.


1 Integrals
Pn
are the equivalent of sums for continuous variables. Eg: i=1 f (xi )∆x becomes
Rb
the integral a f (x)dx in the limit ∆x → 0, n → ∞, where ∆x = b−a n
and xi = a + i∆x.
Find an A-level text book with some diagrams if you have not seen this before.

2
2.1 Square Matrices
Now consider the square n × n matrix B.

All off-diagonal elements of diagonal matrices are zero. The “Identity matrix”, Diagonal matrices, the
which leaves vectors and matrices unchanged on multiplication, is diagonal with Identity
each non-zero element equal to one.
Bij = 0 if i 6= j ⇔ “B is diagonal”
Iij = 0 if i 6= j and Iii = 1 ∀i ⇔ “I is the identity matrix”
Ix = x IB = B = BI x> I = x>
Some square matrices have inverses: Inverses
−1 −1
B −1 B = BB −1 = I

B =B,
which have these properties:
> −1
(BC)−1 = C −1 B −1 B −1 = B>
Linear simultaneous equations could be solved (inefficiently) this way: Solving Linear equations

if Bx = y then x = B −1 y
Some other commonly used matrix definitions include:
Bij = Bji ⇔ “B is symmetric” Symmetry
n
X
Trace(B) = Tr(B) = Bii = “sum of diagonal elements” Trace
i=1
Cyclic permutations are allowed inside trace. Trace of a scalar is a scalar: A Trace Trick

Tr(BCD) = Tr(DBC) = Tr(CDB) x> Bx = Tr(x> Bx) = Tr(xx> B)

The determinant2 is written Det(B) or |B|. It is a scalar regardless of n. Determinants

B = 1 .
−1
|BC| = |B||C| , |x| = x , |xB| = xn |B| ,
|B|
It determines if B can be inverted: |B| = 0 ⇒ B −1 undefined. If the vector to
every point of a shape is pre-multiplied by B then the shape’s area or volume
increases by a factor of |B|. It also appears in the normalising constant of
a Gaussian. For a diagonal matrix the volume scaling factor is simply the
product of the diagonal elements. In general the determinant is the product of
the eigenvalues.
Be(i) = λ(i) e(i) ⇔ “λ(i) is an eigenvalue of B with eigenvector e(i) ” Eigenvalues, Eigenvectors
Y X
|B| = eigenvalues Trace(B) = eigenvalues
If B is real and symmetric (eg a covariance matrix) the eigenvectors are orthog-
onal (perpendicular) and so form a basis (can be used as axes).
2 This section is only intended to give you a flavour so you understand other
references and Sam’s crib sheet. More detailed history and overview is here:
http://www.wikipedia.org/wiki/Determinant

3
3 Differentiation
Any good A-level maths text book should cover this material and have plenty of exer-
cises. Undergraduate text books might cover it quickly in less than a chapter.
y(x+∆x)−y(x)
The gradient of a straight line y = mx+c is a constant y 0 = ∆x = m. Gradient

Many functions look like straight lines over a small enough range. The gradient Differentiation
of this line, the derivative, is not constant, but a new function:
dy y(x+∆x) − y(x) which could be d2 y dy 0
y 0 (x) = = lim , y 00 = =
dx ∆x→0 ∆x differentiated again: dx2 dx

The following results are well known (c is a constant): Standard derivatives


f (x) : c cx cxn loge (x) exp(x)
.
f 0 (x) : 0 c cnxn−1 1/x exp(x)

At a maximum or minimum the function is rising on one side and falling on the Optimisation
other. In between the gradient must be zero. Therefore
df (x) df (x) df (x)
maxima and minima satisfy: =0 or =0 ⇔ = 0 ∀i
dx dx dxi
If we can’t solve this we can evolve our variable x, or variables x, on a computer
using gradient information until we find a place where the gradient is zero.

A function may be approximated by a straight line3 about any point a. Approximation


1
f (a + x) ≈ f (a) + xf 0 (a) , eg: log(1 + x) ≈ log(1 + 0) + x =x
1+0

The derivative operator is linear: Linearity


d(f (x) + g(x)) df (x) dg(x) d (x + exp(x))
= + , eg: = 1 + exp(x).
dx dx dx dx

Dealing with products is slightly more involved: Product Rule


d (u(x)v(x)) du dv d (x · exp(x))
=v +u , eg: = exp(x) + x exp(x).
dx dx dx dx

df (u) du df (u)
The “chain rule” = , allows results to be combined. Chain Rule
dx dx du
d exp (ay m ) d (ay m ) d exp (ay m )
For example: = · “with u = ay m ”
dy dy d (ay m )
= amy m−1 · exp (ay m )
If you can’t show the following you could do with some practice: Exercise
   
d 1 a c
exp(az) + e = exp(az) −
dz (b + cz) b + cz (b + cz)2
Note that a, b, c and e are constants, that u1 = u−1 and this is hard if you haven’t
done differentiation (for a long time). Again, get a text book.
3 More accurate approximations can be made. Look up Taylor series.

You might also like