Background Material Crib-Sheet: 1 Probability Theory
Background Material Crib-Sheet: 1 Probability Theory
Here are a summary of results with which you should be familiar. If anything here
is unclear you should to do some further reading and exercises.
1 Probability Theory
Chapter 2, sections 2.1–2.3 of David MacKay’s book covers this material:
http://www.inference.phy.cam.ac.uk/mackay/itila/book.html
The following hold, for all a and b, if and only if A and B are independent: Independence
P (A = a|B = b) = P (A = a)
P (B = b|A = a) = P (B = b)
P (A = a, B = b) = P (A = a) P (B = b) .
Otherwise the product rule above must be used.
P (B = b|A = a, H) P (A = a|H)
P (A = a|B = b, H) = ∝ P (A = a, B = b|H)
P (B = b|H)
Note that here, as with any expression, we are freeP to condition the whole
thing on any set of assumptions, H, we like. Note a P (A = a, B = b|H) =
P (B = b|H) gives the normalising constant of proportionality.
1
All the above theory basically still applies to continuous variables if sums are Continuous variables
converted into integrals1 . The probability that X lies between x and x+dx is
p (x) dx, where p (x) is a probability density function with range [0, ∞].
Z x2 Z ∞ Z ∞
P (x1 < X < x2 ) = p (x) dx , p (x) dx = 1 and p (x) = p (x, y) dy. Continuous versions of
x1 −∞ −∞ some results
The expectation or mean under a probability distribution is: Expectations
X Z ∞
hf (a)i = P (A = a) f (a) or hf (x)i = p (x) f (x)dx
a −∞
2 Linear Algebra
This is designed as a prequel to Sam Roweis’s “matrix identities” sheet:
http://www.cs.toronto.edu/~roweis/notes/matrixid.pdf
Scalars are individual numbers, vectors are columns of numbers, matrices are
rectangular grids of numbers, eg:
x1 A11 A12 · · · A1n
x2 A21 A22 · · · A2n
x = 3.4, x = . , A= .
.. .. ..
.. .. . . .
xn Am1 Am2 · · · Amn
Quantities whose inner dimensions match may be “multiplied” by summing over Multiplication
this index. The outer dimensions give the dimensions of the answer.
n
X n
X n
X
(AA> )ij = Aik A>
Ax has elements (Ax)i = Aij xj and kj
= Aik Ajk
j=1 k=1 k=1
All the following are allowed (the dimensions of the answer are also shown): Check Dimensions
> > > > >
x x xx Ax AA A A x Ax
1×1 n×n m×1 m×m n×n 1×1 ,
scalar matrix vector matrix matrix scalar
while xx, AA and xA do not make sense for m 6= n 6= 1. Can you see why?
An exception to the above rule is that we may write: xA. Every element of the Multiplication by scalar
matrix A is multiplied by the scalar x.
2
2.1 Square Matrices
Now consider the square n × n matrix B.
All off-diagonal elements of diagonal matrices are zero. The “Identity matrix”, Diagonal matrices, the
which leaves vectors and matrices unchanged on multiplication, is diagonal with Identity
each non-zero element equal to one.
Bij = 0 if i 6= j ⇔ “B is diagonal”
Iij = 0 if i 6= j and Iii = 1 ∀i ⇔ “I is the identity matrix”
Ix = x IB = B = BI x> I = x>
Some square matrices have inverses: Inverses
−1 −1
B −1 B = BB −1 = I
B =B,
which have these properties:
> −1
(BC)−1 = C −1 B −1 B −1 = B>
Linear simultaneous equations could be solved (inefficiently) this way: Solving Linear equations
if Bx = y then x = B −1 y
Some other commonly used matrix definitions include:
Bij = Bji ⇔ “B is symmetric” Symmetry
n
X
Trace(B) = Tr(B) = Bii = “sum of diagonal elements” Trace
i=1
Cyclic permutations are allowed inside trace. Trace of a scalar is a scalar: A Trace Trick
B = 1 .
−1
|BC| = |B||C| , |x| = x , |xB| = xn |B| ,
|B|
It determines if B can be inverted: |B| = 0 ⇒ B −1 undefined. If the vector to
every point of a shape is pre-multiplied by B then the shape’s area or volume
increases by a factor of |B|. It also appears in the normalising constant of
a Gaussian. For a diagonal matrix the volume scaling factor is simply the
product of the diagonal elements. In general the determinant is the product of
the eigenvalues.
Be(i) = λ(i) e(i) ⇔ “λ(i) is an eigenvalue of B with eigenvector e(i) ” Eigenvalues, Eigenvectors
Y X
|B| = eigenvalues Trace(B) = eigenvalues
If B is real and symmetric (eg a covariance matrix) the eigenvectors are orthog-
onal (perpendicular) and so form a basis (can be used as axes).
2 This section is only intended to give you a flavour so you understand other
references and Sam’s crib sheet. More detailed history and overview is here:
http://www.wikipedia.org/wiki/Determinant
3
3 Differentiation
Any good A-level maths text book should cover this material and have plenty of exer-
cises. Undergraduate text books might cover it quickly in less than a chapter.
y(x+∆x)−y(x)
The gradient of a straight line y = mx+c is a constant y 0 = ∆x = m. Gradient
Many functions look like straight lines over a small enough range. The gradient Differentiation
of this line, the derivative, is not constant, but a new function:
dy y(x+∆x) − y(x) which could be d2 y dy 0
y 0 (x) = = lim , y 00 = =
dx ∆x→0 ∆x differentiated again: dx2 dx
At a maximum or minimum the function is rising on one side and falling on the Optimisation
other. In between the gradient must be zero. Therefore
df (x) df (x) df (x)
maxima and minima satisfy: =0 or =0 ⇔ = 0 ∀i
dx dx dxi
If we can’t solve this we can evolve our variable x, or variables x, on a computer
using gradient information until we find a place where the gradient is zero.
df (u) du df (u)
The “chain rule” = , allows results to be combined. Chain Rule
dx dx du
d exp (ay m ) d (ay m ) d exp (ay m )
For example: = · “with u = ay m ”
dy dy d (ay m )
= amy m−1 · exp (ay m )
If you can’t show the following you could do with some practice: Exercise
d 1 a c
exp(az) + e = exp(az) −
dz (b + cz) b + cz (b + cz)2
Note that a, b, c and e are constants, that u1 = u−1 and this is hard if you haven’t
done differentiation (for a long time). Again, get a text book.
3 More accurate approximations can be made. Look up Taylor series.