Maths Primer
Maths Primer
Machine Intelligence
Outline
1 Linear Algebra
Transpose, Inverse, Rank and Trace
Determinant
Eigenanalysis
Matrix Gradient
2 Analysis
Metrics
Jacobi and Hessian
Taylor Series
Optimization
3 Probability Theory
Combinatorics
Random Variables and Vectors
Conditional Probabilities and Independence
Expectations and Moments
www.ni.tu-berlin.de Mathematics Primer 3 / 41
Linear Algebra Transpose, Inverse, Rank and Trace
(A B)⊤ = B⊤ A⊤
Linear independence
A set of vectors {a1 , . . . , aN } is linearly independent, if N
P
i=1 αi a i = 0
holds only if all αi = 0. This means none of the vectors can be expressed
as a linear combination of the others.
Rank
The rank rank(A) of a matrix A is the maximum number of linearly
independent rows (or columns).
Trace
PN
The trace of a square matrix A ∈ RN ×N is defined as Tr(A) = i=1 aii .
It holds:
Tr(A B) = Tr(B A)
Determinant
Note:
Determinant of the identiy matrix: det(I) = 1
Determinant of a transposed matrix: det(A) = det(A⊤ )
Determinant of a product of two matrices:
Row i can be any row, the result is always the same. The cofactors Cij
are defined as Cij = (−1)i+j det([A]∅ij ), where [A]∅ij is the submatrix
that remains when the i-th row and j-th column are removed:
··· ∅ ···
A11 A12 A1N
A21 A22 ··· ∅ ··· A2N
.. .. .. . ..
∅ ..
. . . .
[A]∅ij =
∅ ∅ ∅ ∅ ∅ ∅
.. .. .. .. ..
. . . . .
∅
AN 1 AN 2 ··· ∅ ··· AN N
a b
|A| = = ad − bc
c d
a b c
e f d f d e
|A| = d e f = a −b +c
h i g i g h
g h i
where the adjoint adj[A] of A is the matrix whose elements are the
cofactors:
(adj[A])ij = Cji
The determinant of an inverse matrix is given by
1
det(A−1 ) =
det(A)
Eigendecomposition of a Matrix
A x = λx
(A − λI) x = 0
Characteristic Equation:
Polynomial of order N
N (not necessarily distinct) solutions
Number of non-zero Eigenvalues: rank(A)
In general: Eigenvalues are complex
For symmetric matrices (A = A⊤ ): Eigenvalues are real
QN
Determinant: det(A) = i=1 λi
PN
Trace: Tr(A) = i=1 λi
Eigendecomposition of a Matrix in R3
Example
0 −1 1
A = −3 −2 3
−2 −2 3
⇒ λ1 , λ2 Eigenspace: {(x1 , x2 , x3 )| − x1 − x2 + x3 = 0}
⇒ λ3 Eigenspace: {(t, 3t, 2t)|t ∈ R}
www.ni.tu-berlin.de Mathematics Primer 12 / 41
Linear Algebra Eigenanalysis
Eigendecomposition of a Matrix in R3
Example
0 −1 1
A = −3 −2 3
−2 −2 3
Matrix Gradient
The gradient of a function f : RN → R is given by
⊤
∂f ∂f
∇f ≡ ,...,
∂x1 ∂xN
Examples:
linear f : x 7→ a⊤ x ∇f (x) = a
quadratic f : x 7→ x⊤ Ax ∇f (x) = (A⊤ + A)x
Outline
1 Linear Algebra
Transpose, Inverse, Rank and Trace
Determinant
Eigenanalysis
Matrix Gradient
2 Analysis
Metrics
Jacobi and Hessian
Taylor Series
Optimization
3 Probability Theory
Combinatorics
Random Variables and Vectors
Conditional Probabilities and Independence
Expectations and Moments
www.ni.tu-berlin.de Mathematics Primer 15 / 41
Analysis
Infimum, Supremum
Let D be a subset of R. A number K is called supremum (infimum) of
D, if K is the smallest upper bound (largest lower bound) of D:
x ≤ K (x ≥ K), ∀ x ∈ D
Examples:
For the closed interval D = [a, b], a ≤ b : sup D = b, inf D = a.
n
For D = n+1 , n ∈ N : sup D = 1.
Metric Space
Metric
A metric (or distance function) on a set X is a non-negative mapping
d : X × X → R+
(x, y) 7→ d(x, y)
with the following characteristics
1 Positive definiteness: d(x, y) = 0 iff x = y, d(x, y) > 0 otherwise
2 Symmetry: d(x, y) = d(y, x), ∀ x, y ∈ X
3 Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z), ∀ x, y, z ∈ X
Taylor Series
Taylor Series in R
Let f : I → R be an infinitely often differentiable function, and x0 ∈ I.
Then the Taylor series around x0 is defined as
∞
X 1 dn f (x)
f (x) = (x − x0 )n
n=0
n! dxn x0
1
= f (x0 ) + f ′ (x0 ) · (x − x0 ) + f ′′ (x0 ) · (x − xo )2 + . . .
2
Taylor Series in RN
Let f be an infinitely smooth scalar-valued function with domain in RN :
⊤ 1
f (x) = f (x0 ) + ∇f(x (x − x0 ) + (x − x0 )⊤ Hf (x0 ) (x − x0 ) + . . .
0) 2
Local Extrema
Local Extrema
A critical point x0 of f is
a minimum of f , if all Eigenvalues of (Hf )(x0 ) are positive
(the Hessian is positive definite)
a maximum of f , if all Eigenvalues of (Hf )(x0 ) are negative
(the Hessian is negative definite)
no extremum of f , in all other cases (the Hessian is indefinite)
Convexity
Convex Functions
Let U ⊂ RN be open and convex. A function f : U → R is called (strictly)
convex, if for all x1 , x2 ∈ U with x1 6= x2 and all 0 < λ < 1
f λx1 + (1 − λ)x2 (<) ≤ λf (x1 ) + (1 − λ)f (x2 )
Concave Functions
f is called concave, if (−f ) is convex.
hi (w) ≤ 0, ∀i
λi ≥ 0, ∀i
λi · hi (w) = 0, ∀i,
which are known as the Karush-Kuhn-Tucker (KKT) conditions.
www.ni.tu-berlin.de Mathematics Primer 24 / 41
Probability Theory
Outline
1 Linear Algebra
Transpose, Inverse, Rank and Trace
Determinant
Eigenanalysis
Matrix Gradient
2 Analysis
Metrics
Jacobi and Hessian
Taylor Series
Optimization
3 Probability Theory
Combinatorics
Random Variables and Vectors
Conditional Probabilities and Independence
Expectations and Moments
www.ni.tu-berlin.de Mathematics Primer 25 / 41
Probability Theory Combinatorics
Combinatorics
Consider a set consisting of n elements. The power set is the set of all
subsets, its cardinality is 2n .
Permutation: arrangement of n elements in a certain order
# without repetitions: Pn = n!
(k) n!
# with repetitions (k ≤ n repeated elements): Pn = k!
Random Variable
In practice, the cdf is computed from the known pdf using the inverse
relationship Z z
FX (z) = pX (t)dt
−∞
Ω → R N ⊂ RN
w → X(w) ≡ X
at a point z is given by
FX (z) = P (X ≤ z)
0
for (z1 < 1) ∨ (z2 < 1)
for (1 ≤ z1 < 2)
1/4 ∧ (1 ≤ z2 < 2)
FX (z) = 1/2 for (1 ≤ z1 < 2) ∧ (2 ≤ z2 )
3/4 for (2 ≤ z1 ) ∧ (1 ≤ z2 < 2)
1 for (2 ≤ z1 ) ∧ (2 ≤ z2 )
Conditional Probabilities
Conditional Probabilities
Consider two discrete random variables X and Y . The conditional
probability of Y given X:
P (X = x, Y = y)
P (Y = y|X = x) = , P (X = x) 6= 0
P (X = x)
p(x, y)
p(y|x) = almost everywhere in X
p(x)
Independence
Marginals
Bayes’ Theorem
P (X = x|Y = y)P (Y = y)
P (Y = y|X = x) =
P (X = x)
P (X = x|Y = y)P (Y = y)
= P
k P (X = x|Y = yk )P (Y = yk )
Decomposition
Expectations
Moments
Examples:
R
1st order: hXi i = p(xi ) xi dxi ... mean value µi , µ = (µ1 , . . . , µn )
2nd order: hXi Xj i ... correlation between Xi , Xj
3rd order: hXi Xj Xk i ... e.g. skewness
Correlation Matrix
Symmetry: RX = R⊤
X
Positive semidefinite: a⊤ RX a ≥ 0, ∀a
⇒ all eigenvalues real and nonnegative
⇒ all eigenvectors are mutually orthogonal
Covariance Matrix
CX ≡ h(X − µX )(X − µX )⊤ i = hX X⊤ i − µX µ⊤
X
= RX − µX µ⊤
X
RX Y = hX Y⊤ i = hXihY⊤ i = µX µ⊤
Y
,