Maths of Machine Learning
Maths of Machine Learning
Machine Learning
Femi William
Linear algebra is the study of vectors. At the most general level, vectors
are ordered finite lists of numbers.
Machine Learning
Data Structures in ML
Scalar
Vectors
Matrices
Tensor
Machine Learning
Vectors
Vectors
• Vector definition
Computer science: vector is a one-dimensional array of ordered real-valued scalars
Mathematics: vector is a quantity possessing both magnitude and direction, represented
by an arrow indicating the direction, and the length of which is proportional to the
magnitude
• Vectors are written in column form or in row form
Denoted by bold-font lower-case letters
• For a general form vector with elements the vector lies in the -dimensional space
• Vector addition
We add the coordinates, and follow the directions
given by the two vectors that are added
Vectors
• The geometric interpretation of vectors as points in space allow us to consider a
training set of input examples in ML as a collection of points in space
Hence, classification can be viewed as discovering how to separate two clusters of
points belonging to different classes (left picture)
o Rather than distinguishing images containing cars, planes, buildings, for example
Or, it can help to visualize zero-centering and normalization of training data (right
picture)
Machine Learning
Example of LA
Machine Learning
Norm of a Vector
Vectors
(∑ | | )
𝑛 1
𝑝 𝑝
‖𝐱‖𝑝 = 𝑥𝑖
• The general norm of a vector is obtained as: 𝑖=1
On next page we will review the most common norms, obtained for and
√∑
𝑛
𝑥 =√ 𝐱 𝐱
• For we have norm 2 𝑇
Also called Euclidean norm
‖𝐱‖2= 𝑖
It is the most often used norm 𝑖=1
norm is often denoted just as with the subscript 2 omitted
𝑛
• For we have norm ‖𝐱‖1=∑ |𝑥𝑖|
Uses the absolute values of the elements 𝑖 =1
Discriminate between zero and non-zero elements
For one linear equation ax=b where the unknown is x and a and b
are constants,
3 possibilities:
With >1 equation and >1 unknown
Can use solution from the single equation to
solve
For example
In matrix form
A X = B
X =A-1B
X =A-1B
To find A-1
From earlier
(2 -2) – (3 1) = -4 – 3 = -7
So determinant is -7
if B is
So
Vector Projection
Vectors
• Hyperplane is a subspace whose dimension is one less than that of its ambient
space
In a 2D space, a hyperplane is a straight line (i.e., 1D)
In a 3D, a hyperplane is a plane (i.e., 2D)
In a d-dimensional vector space, a hyperplane has dimensions, and divides the space
into two half-spaces
• Hyperplane is a generalization of a concept of plane in high-dimensional space
• In ML, hyperplanes are decision boundaries used for linear classification
Data points falling on either sides of the hyperplane are attributed to different classes
• Scalar multiplication
• Matrix multiplication
Defined only if the number of columns of the left matrix is the same as the number of
rows of the right matrix
Note that
Some properties
• Identity matrix ( In ): has ones on the main diagonal, and zeros elsewhere
In the above, is a minor of the matrix obtained by removing the row and column
associated with the indices i and j
• Trace of a matrix is the sum of all diagonal elements
• The matrix-vector product is a column vector of length m, whose ith element is the
dot product
• For an matrix, the rank of the matrix is the largest number of linearly independent
columns
• The matrix B from the previous example has , since the two columns are linearly
dependent
𝐁= [ 2
4
−1
−2 ]
• The matrix C below has , since it has two linearly independent columns
I.e., , ,
• For a square matrix A with rank , is its inverse matrix if their product is an
identity matrix I
• Pseudo-inverse of a matrix
Also known as Moore-Penrose pseudo-inverse
• For matrices that are not square, the inverse does not exist
Therefore, a pseudo-inverse is used
• If , then the pseudo-inverse is and
• If , then the pseudo-inverse is and
E.g., for a matrix with dimension , a pseudo-inverse can be found of size , so that
Machine Learning
How to calculate Eigenvalues &
eigenvectors
Calculating Eigenvectors &
Eigenvalues
Calculating Eigenvectors &
Eigenvalues
Calculating Eigenvectors &
Eigenvalues
• The following rules are used for computing the derivatives of explicit functions
44
Higher Order Derivatives
Differential Calculus
45
Partial Derivatives
Differential Calculus
46
Gradient
Differential Calculus
• When there is no ambiguity, the notations or are often used for the gradient
instead of
The symbol for the gradient is the Greek letter (pronounced “nabla”), although is
more often it is pronounced “gradient of f with respect to x”
• In ML, the gradient descent algorithm relies on the opposite direction of the
gradient of the loss function with respect to the model parameters for
minimizing the loss function
Adversarial examples can be created by adding perturbation in the direction of the
gradient of the loss with respect to input examples for maximizing the loss function
47
Optimization
Optimization
• Optimization and machine learning have related, but somewhat different goals
Goal in optimization: minimize an objective function
o For a set of training examples, reduce the training error
Goal in ML: find a suitable model, to predict on data examples
o For a set of testing examples, reduce the generalization error
• For a given empirical function g (dashed purple curve), optimization algorithms
attempt to find the point of minimum empirical risk
• The expected function f (blue curve) is
obtained given a limited amount of training
data examples
• ML algorithms attempt to find the point of
minimum expected risk, based on minimizing
the error on a set of testing examples
o Which may be at a different location than the
minimum of the training examples
o And which may not be minimal in a formal sense
49
Stationary Points
Optimization
• A function of a single variable is concave if every line segment joining two points
on its graph does not lie above the graph at any point
• Symmetrically, a function of a single variable is convex if every line segment
joining two points on its graph does not lie below the graph at any point
• In mathematical terms, the function fis a convex function if for all points and for all
53
Minimum vs Maximum
54
Constrained Optimization
Optimization
• The points that satisfy the constraints form the feasible region
• Various optimization algorithms have been developed for handling optimization
problems based on whether the constraints are equalities, inequalities, or a
combination of equalities and inequalities
55
Lagrange Multipliers
Optimization
where
• The variables are called Lagrange multipliers and ensure that the constraints are
properly enforced
They are chosen just large enough to ensure that
• This is a saddle-point optimization problem where one wants to minimize with
respect to and simultaneously maximize with respect to
The saddle point of gives the optimal solution to the original constrained optimization
problem
56
Projections
Optimization
57
Projections
Optimization
• This means that the vector is projected onto the closest vector that belongs to
the set
• For example, in the figure, the blue circle represents
a convex set
The points inside the circle project to itself
o E.g., is the yellow vector, its closest point in the set is i:
the distance between and is
The points outside the circle project to the closest
point inside the circle
o E.g., is the yellow vector, its closest point in the set is
the red vector
o Among all vectors in the set , the red vector has the
smallest distance to , i.e.,
• Intuition:
In a process, several outcomes are possible
When the process is repeated a large number of times, each outcome occurs with a
relative frequency, or probability
If a particular outcome occurs more often, we say it is more probable
• Probability arises in two contexts
In actual repeated experiments
o Example: You record the color of 1,000 cars driving by. 57 of them are green. You estimate the
probability of a car being green as 57/1,000 = 0.057.
In idealized conceptions of a repeated process
o Example: You consider the behavior of an unbiased six-sided die. The expected probability of
rolling a 5 is 1/6 = 0.1667.
o Example: You need a model for how people’s heights are distributed. You choose a normal
distribution to represent the expected relative probabilities.
• Probability distribution that acts on many variables at the same time is known as a
joint probability distribution
• Given any values x and y of two random variables and , what is the probability
that = x and = y simultaneously?
denotes the joint probability
We may also write for brevity
• Bernoulli distribution
Binary random variable with states 𝑝=0.3
The random variable can encodes a coin flip
which comes up 1 with probability p and 0
with probability
Notation:
• Uniform distribution
The probability of each value is
Notation:
Figure:
• Binomial distribution
Performing a sequence of n independent 𝑛=10 , 𝑝=0.2
experiments, each of which has probability p of
succeeding, where
The probability of getting k successes in n trials
is
Notation:
• Poisson distribution
A number of events occurring independently in
a fixed interval of time with a known rate
A discrete random variable with states has
𝜆=5
probability
The rate is the average number of occurrences
of the event
Notation:
• Gaussian distribution
The most well-studied distribution
o Referred to as normal distribution or informally bell-shaped distribution
Defined with the mean and variance
Notation:
For a random variable with n independent measurements, the density is
• Multinoulli distribution
It is an extension of the Bernoulli distribution, from binary class to multi-class
Multinoulli distribution is also called categorical distribution or generalized Bernoulli
distribution
Multinoulli is a discrete probability distribution that describes the possible results of a
random variable that can take on one of k possible categories
o A categorical random variable is a discrete variable with more than two possible outcomes (such
as the roll of a die)
For example, in multi-class classification in machine learning, we have a set of data
examples , and corresponding to the data example is a k-class label representing one-
hot encoding
o One-hot encoding is also called 1-of-k vector, where one element has the value 1 and all other
elements have the value 0
o Let’s denote the probabilities for assigning the class labels to a data example by
o We know that and for the different classes
o The multinoulli probability of the data example is
o Similarly, we can calculate the probability of all data examples as
68
Statistics
statistics
69
Variance
Probability
• Variance gives the measure of how much the values of the function deviate from
the expected value as we sample values of X from
• When the variance is low, the values of cluster near the expected value
• Variance is commonly denoted with
The above equation is similar to a function
We have
This is similar to the formula for calculating the variance of a sample of observations:
• The square root of the variance is the standard deviation
Denoted
• Covariance gives the measure of how much two random variables are linearly
related to each other
• If and
Then, the covariance is:
Compare to covariance of actual samples:
• The covariance measures the tendency for and to deviate from their means in
same (or opposite) directions at same time
𝑋 𝑋
Picture from: Jeff Howbert — Machine Learning Math Essentials 71
Correlation
Probability
• I.e.,
• The diagonal elements of the covariance matrix are the variances of the elements of
the vector
73
References
1. A. Zhang, Z. C. Lipton, M. Li, A. J. Smola, Dive into Deep Learning, https://d2l.ai,
2020.
2. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2017.
3. M. P. Deisenroth, A. A. Faisal, C. S. Ong, Mathematics for Machine Learning,
Cambridge University Press, 2020.
4. Jeff Howbert — Machine Learning Math Essentials presentation
5. Brian Keng – Manifolds: A Gentle Introduction blog
6. Martin J. Osborne – Mathematical Methods for Economic Theory (link)
74
The end.