0% found this document useful (0 votes)
58 views75 pages

Maths of Machine Learning

The document discusses essential linear algebra concepts for machine learning, including vectors, matrices, norms, dot products, and vector projections. It provides notation, examples, and applications of these concepts. It also briefly discusses solving simultaneous equations and hyperplanes.

Uploaded by

Rakesh Garapati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views75 pages

Maths of Machine Learning

The document discusses essential linear algebra concepts for machine learning, including vectors, matrices, norms, dot products, and vector projections. It provides notation, examples, and applications of these concepts. It also briefly discusses solving simultaneous equations and hyperplanes.

Uploaded by

Rakesh Garapati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

CS 7267

Machine Learning

Essential Mathematics for


Machine Learning

Femi William

Computer Science, Kennesaw State University

CS 7267 Kennesaw State University Machine learning


Lecture Outline
• Linear algebra
 Vectors
 Matrices
 Eigen decomposition
• Differential calculus
• Optimization algorithms
• Probability
 Random variables
 Probability distributions
• Statistics

CS 7267 Kennesaw State University Machine learning


2
Notation
• Scalar (integer or real)
• Vector (bold-font, lower case)
• Matrix (bold-font, upper-case)
• Tensor ((bold-font, upper-case)
• Random variable (normal font, upper-case)
• Set membership: is member of set
• Cardinality: number of items in set
• Norm of vector
• or Dot product of vectors and
• Set of real numbers
• Real numbers space of dimension n
• or Function (map): assign a unique value to each input value
• Function (map): map an n-dimensional vector into a scalar

CS 7267 Kennesaw State University Machine learning


3
Notation
• Element-wise product of matrices A and B
• Pseudo-inverse of matrix A
• n-th derivative of function f with respect to x
• Gradient of function f with respect to x
• Hessian matrix of function f
• Random variable has distribution
• Probability of given
• Gaussian distribution with mean and variance
• Expectation of with respect to
• Variance of
• Covariance of and
• Correlation coefficient for and
• Kullback-Leibler divergence for distributions and
• Cross-entropy for distributions and

CS 7267 Kennesaw State University Machine learning


4
What is Linear Algebra?
Linear algebra is the branch of mathematics concerning linear equations

Linear algebra is the study of vectors. At the most general level, vectors
are ordered finite lists of numbers.

• Linear algebra is to machine learning as flour to bakery.

• Every machine learning model is based in linear algebra, as


every cake is based in flour.

Machine Learning
Data Structures in ML
 Scalar
 Vectors
 Matrices
 Tensor

Machine Learning
Vectors
Vectors
• Vector definition
 Computer science: vector is a one-dimensional array of ordered real-valued scalars
 Mathematics: vector is a quantity possessing both magnitude and direction, represented
by an arrow indicating the direction, and the length of which is proportional to the
magnitude
• Vectors are written in column form or in row form
 Denoted by bold-font lower-case letters

• For a general form vector with elements the vector lies in the -dimensional space

CS 7267 Kennesaw State University Machine learning


7
Geometry of Vectors
Vectors
• First interpretation of a vector: point in space
 E.g., in 2D we can visualize the data points with
respect to a coordinate origin

• Second interpretation of a vector: direction in space


 E.g., the vector has a direction of 3 steps to the right
and 2 steps up
 The notation is sometimes used to indicate that the
vectors have a direction
 All vectors in the figure have the same direction

• Vector addition
 We add the coordinates, and follow the directions
given by the two vectors that are added

CS 7267 Kennesaw State University Machine learning


8
Geometry of Vectors

Vectors
• The geometric interpretation of vectors as points in space allow us to consider a
training set of input examples in ML as a collection of points in space
 Hence, classification can be viewed as discovering how to separate two clusters of
points belonging to different classes (left picture)
o Rather than distinguishing images containing cars, planes, buildings, for example
 Or, it can help to visualize zero-centering and normalization of training data (right
picture)

CS 7267 Kennesaw State University Machine learning


9
Dot Product and Angles
Vectors

• Dot product of vectors,


 It is also referred to as inner product, or scalar product of vectors
 The dot product is also often denoted by
• The dot product is a symmetric operation,
• Geometric interpretation of a dot product:
angle between two vectors
 I.e., dot product over the norms of the vectors
is
𝐮∙ 𝐯
𝐮 ∙ 𝐯 =‖𝐮‖‖ 𝐯‖ 𝑐𝑜𝑠cos
( 𝜃) 𝜃 = ‖𝐮‖‖𝐯‖

• If two vectors are orthogonal: , i.e., , then


• Also, in ML the term is sometimes employed as a measure of closeness of two
vectors/data instances, and it is referred to as cosine similarity 

CS 7267 Kennesaw State University Machine learning


10
Applications- Linear Algebra in Action

Machine Learning
Example of LA

Machine Learning
Norm of a Vector
Vectors

• A vector norm is a function that maps a vector to a scalar value


 The norm is a measure of the size of the vector
• The norm should satisfy the following properties:
 Scaling:
 Triangle inequality:
 Must be non-negative:

(∑ | | )
𝑛 1
𝑝 𝑝
‖𝐱‖𝑝 = 𝑥𝑖
• The general norm of a vector is obtained as: 𝑖=1

 On next page we will review the most common norms, obtained for and

CS 7267 Kennesaw State University Machine learning


13
Norm of a Vector
Vectors

√∑
𝑛
𝑥 =√ 𝐱 𝐱
• For we have norm 2 𝑇
 Also called Euclidean norm
‖𝐱‖2= 𝑖
 It is the most often used norm 𝑖=1
 norm is often denoted just as with the subscript 2 omitted
𝑛
• For we have norm ‖𝐱‖1=∑ |𝑥𝑖|
 Uses the absolute values of the elements 𝑖 =1
 Discriminate between zero and non-zero elements

‖𝐱‖∞ =max |𝑥𝑖|


• For we have norm 𝑖

 Known as infinity norm, or max norm


 Outputs the absolute value of the largest element

• norm outputs the number of non-zero elements


 It is not an norm, and it is not really a norm function either (it is incorrectly called a norm)
CS 7267 Kennesaw State University Machine learning
14
Solving simultaneous equations

For one linear equation ax=b where the unknown is x and a and b
are constants,
3 possibilities:
With >1 equation and >1 unknown
 Can use solution from the single equation to
solve
 For example

 In matrix form

A X = B

X =A-1B
 X =A-1B
 To find A-1

 Need to find determinant of matrix A

 From earlier

(2 -2) – (3 1) = -4 – 3 = -7

 So determinant is -7
if B is

So
Vector Projection
Vectors

• Orthogonal projection of a vector onto vector


 The projection can take place in any space of
dimensionality ≥ 2
 The unit vector in the direction of is
o A unit vector has norm equal to 1
 The length of the projection of onto is
 The orthogonal project is the vector

CS 7267 Kennesaw State University Machine learning


19
Hyperplanes
Hyperplanes

• Hyperplane is a subspace whose dimension is one less than that of its ambient
space
 In a 2D space, a hyperplane is a straight line (i.e., 1D)
 In a 3D, a hyperplane is a plane (i.e., 2D)
 In a d-dimensional vector space, a hyperplane has dimensions, and divides the space
into two half-spaces
• Hyperplane is a generalization of a concept of plane in high-dimensional space
• In ML, hyperplanes are decision boundaries used for linear classification
 Data points falling on either sides of the hyperplane are attributed to different classes

CS 7267 Kennesaw State University Machine learning


20
Matrices
Matrices
• Matrix is a rectangular array of real-valued scalars arranged in m horizontal rows
and n vertical columns
 Each element belongs to the ith row and jth column
 The elements are denoted or or or

• For the matrix , the size (dimension) is or


 Matrices are denoted by bold-font upper-case letters

CS 7267 Kennesaw State University Machine learning


21
Matrices
Matrices
• Addition or subtraction

• Scalar multiplication

• Matrix multiplication

 Defined only if the number of columns of the left matrix is the same as the number of
rows of the right matrix
 Note that

CS 7267 Kennesaw State University Machine learning


22
Matrices
Matrices

• Transpose of the matrix: has the rows and columns exchanged

 Some properties

• Square matrix: has the same number of rows and columns

• Identity matrix ( In ): has ones on the main diagonal, and zeros elsewhere

CS 7267 Kennesaw State University Machine learning

 E.g.: identity matrix of size 3×3 : 23


Matrices
Matrices

• Determinant of a matrix, denoted by det(A) or is a real-valued scalar encoding


certain properties of the matrix
 E.g., for a matrix of size 2×2:

 For larger-size matrices the determinant of a matrix id calculated as

 In the above, is a minor of the matrix obtained by removing the row and column
associated with the indices i and j
• Trace of a matrix is the sum of all diagonal elements

• A matrix for which is called a symmetric matrix

CS 7267 Kennesaw State University Machine learning


24
Matrices
Matrices

• Elementwise multiplication of two matrices A and B is called the Hadamard


product or elementwise product
 The math notation is

CS 7267 Kennesaw State University Machine learning


25
Matrix-Vector Products
Matrices
• Consider a matrix and a vector
• The matrix can be written in terms of its row vectors (e.g., is the first row)

• The matrix-vector product is a column vector of length m, whose ith element is the
dot product

• Note the size:


CS 7267 Kennesaw State University Machine learning
26
Matrix-Matrix Products
Matrices

• To multiply two matrices and

• We can consider the matrix-matrix product as dot-products of rows in and columns


in

• Size: Kennesaw State University


CS 7267 Machine learning
27
Linear Dependence
Matrices

• For the following matrix 𝐁=


[ 2
4
−1
−2 ]
• Notice that for the two columns and , we can write
 This means that the two columns are linearly dependent
• The weighted sum is referred to as a linear combination of the vectors and
 In this case, a linear combination of the two vectors exist for which
• A collection of vectors are linearly dependent if there exist coefficients not all
equal to zero, so that

• If there is no linear dependence, the vectors are linearly independent

CS 7267 Kennesaw State University Machine learning


28
Matrix Rank
Matrices

• For an matrix, the rank of the matrix is the largest number of linearly independent
columns
• The matrix B from the previous example has , since the two columns are linearly
dependent

𝐁= [ 2
4
−1
−2 ]
• The matrix C below has , since it has two linearly independent columns
 I.e., , ,

CS 7267 Kennesaw State University Machine learning


29
Inverse of a Matrix
Matrices

• For a square matrix A with rank , is its inverse matrix if their product is an
identity matrix I

• Properties of inverse matrices

• If (i.e., ), then the inverse does not exist


 A matrix that is not invertible is called a singular matrix
• Note that finding an inverse of a large matrix is computationally expensive
 In addition, it can lead to numerical instability
• If the inverse of a matrix is equal to its transpose, the matrix is said to be
orthogonal matrix

CS 7267 Kennesaw State University Machine learning


30
Pseudo-Inverse of a Matrix
Matrices

• Pseudo-inverse of a matrix
 Also known as Moore-Penrose pseudo-inverse
• For matrices that are not square, the inverse does not exist
 Therefore, a pseudo-inverse is used
• If , then the pseudo-inverse is and
• If , then the pseudo-inverse is and

 E.g., for a matrix with dimension , a pseudo-inverse can be found of size , so that

CS 7267 Kennesaw State University Machine learning


31
Tensors
Tensors

• Tensors are n-dimensional arrays of scalars


 Vectors are first-order tensors,
 Matrices are second-order tensors,
 E.g., a fourth-order tensor is
• Tensors are denoted with upper-case letters of a special font face (e.g., )
• RGB images are third-order tensors, i.e., as they are 3-dimensional arrays
 The 3 axes correspond to width, height, and channel
 E.g., 224 × 224 × 3
 The channel axis corresponds to the color channels (red, green, and blue)

CS 7267 Kennesaw State University Machine learning


32
Eigen Decomposition
Eigen Decomposition
• Eigen decomposition is decomposing a matrix into a set of eigenvalues and
eigenvectors
• Eigenvalues of a square matrix are scalars and eigenvectors are non-zero vectors
that satisfy

• Eigenvalues are found by solving the following equation

• If a matrix has n linearly independent eigenvectors with corresponding


eigenvalues , the eigen decomposition of is given by

 Columns of the matrix are the eigenvectors, i.e.,


 is a diagonal matrix of the eigenvalues, i.e.,
• To find the inverse of the matrix A, we can use
 This involves simply finding the inverse of a diagonal matrix

CS 7267 Kennesaw State University Machine learning


33
Eigen Decomposition
Eigen Decomposition

• Geometric interpretation of the eigenvalues and eigenvectors is that they allow to


stretch the space in specific directions
 Left figure: the two eigenvectors and are shown for a matrix, where the two vectors are
unit vectors (i.e., they have a length of 1)
 Right figure: the vectors and are multiplied with the eigenvalues and
o We can see how the space is scaled in the direction of the larger eigenvalue
• E.g., this is used for dimensionality reduction with PCA (principal component
analysis) where the eigenvectors corresponding to the largest eigenvalues are used
for extracting the most important data dimensions

Picture from: Goodfellow (2017) – Deep Learning 34


Eigenvalue Problem
Eigenvalue Problem
• Going back to our example: as A . v =
λ.v
2 3  3  1 2   3
 2 1  x  2    8   4 x  2


A v Av  v
• Therefore, (3,2) is an eigenvector of the square matrix A and 4 is
an
eigenvalue of A.
• The question is:
Given matrix A, how can we calculate the eigenvector and eigenvalues
for A?

Machine Learning
How to calculate Eigenvalues &
eigenvectors
Calculating Eigenvectors &
Eigenvalues
Calculating Eigenvectors &
Eigenvalues
Calculating Eigenvectors &
Eigenvalues

Going through the same procedure for the second eigenvalue:


Calculating Eigenvectors &
Eigenvalues
Properties of Eigenvectors and Eigenvalues
Singular Value Decomposition
Singular Value Decomposition

• Singular value decomposition (SVD) provides another way to factorize a matrix,


into singular vectors and singular values
 SVD is more generally applicable than eigen decomposition
 Every real matrix has an SVD, but the same is not true of the eigen decomposition
o E.g., if a matrix is not square, the eigen decomposition is not defined, and we must use SVD
• SVD of an matrix is given by

 is an matrix, is an matrix, and is an matrix


 The elements along the diagonal of are known as the singular values of A
 The columns of are known as the left-singular vectors
 The columns of are known as the right-singular vectors
• For a non-square matrix , the squares of the singular values are the eigenvalues
of , i.e., for
• Applications of SVD include computing the pseudo-inverse of non-square
matrices, matrix approximation, determining the matrix rank
43
Differential Calculus
Differential Calculus

• The following rules are used for computing the derivatives of explicit functions

44
Higher Order Derivatives
Differential Calculus

• The derivative of the first derivative of a function is the second derivative of

• The second derivative quantifies how the rate of change of is changing


 E.g., in physics, if the function describes the displacement of an object, the first derivative
gives the velocity of the object (i.e., the rate of change of the position)
 The second derivative gives the acceleration of the object (i.e., the rate of change of the
velocity)
• If we apply the differentiation operation any number of times, we obtain the n-th
derivative of

45
Partial Derivatives
Differential Calculus

• So far, we looked at functions of a single variable, where


• Functions that depend on many variables are called multivariate functions
• Let be a multivariate function with n variables
 The input is an n-dimensional vector and the output is a scalar y
 The mapping is
• The partial derivative of y with respect to its ith parameter is

• To calculate ( pronounced “del” or we can just say “partial derivative”), we can


treat as constants and calculate the derivative of y only with respect to
• For notation of partial derivatives, the following are equivalent:

46
Gradient
Differential Calculus

• We can concatenate partial derivatives of a multivariate function with respect to


all its input variables to obtain the gradient vector of the function
• The gradient of the multivariate function with respect to the n-dimensional
input vector , is a vector of n partial derivatives

• When there is no ambiguity, the notations or are often used for the gradient
instead of
 The symbol for the gradient is the Greek letter (pronounced “nabla”), although is
more often it is pronounced “gradient of f with respect to x”
• In ML, the gradient descent algorithm relies on the opposite direction of the
gradient of the loss function with respect to the model parameters for
minimizing the loss function
 Adversarial examples can be created by adding perturbation in the direction of the
gradient of the loss with respect to input examples for maximizing the loss function

47
Optimization
Optimization

• Optimization is concerned with optimizing an objective function — finding the


value of an argument that minimizes of maximizes the function
 Most optimization algorithms are formulated in terms of minimizing a function
 Maximization is accomplished vie minimizing the negative of an objective function (e.g.,
minimize )
 In minimization problems, the objective function is often referred to as a cost function or
loss function or error function
• Optimization is very important for machine learning
 The performance of optimization algorithms affect the model’s training efficiency
• Most optimization problems in machine learning are nonconvex
 Meaning that the loss function is not a convex function
 Nonetheless, the design and analysis of algorithms for solving convex problems has
been very instructive for advancing the field of machine learning

CS 7267 Kennesaw State University Machine learning


48
Optimization
Optimization

• Optimization and machine learning have related, but somewhat different goals
 Goal in optimization: minimize an objective function
o For a set of training examples, reduce the training error
 Goal in ML: find a suitable model, to predict on data examples
o For a set of testing examples, reduce the generalization error
• For a given empirical function g (dashed purple curve), optimization algorithms
attempt to find the point of minimum empirical risk
• The expected function f (blue curve) is
obtained given a limited amount of training
data examples
• ML algorithms attempt to find the point of
minimum expected risk, based on minimizing
the error on a set of testing examples
o Which may be at a different location than the
minimum of the training examples
o And which may not be minimal in a formal sense

49
Stationary Points
Optimization

• Stationary points ( or critical points) of a differentiable function of one variable


are the points where the derivative of the function is zero, i.e.,
• The stationary points can be:
 Minimum, a point where the derivative changes from negative to positive
 Maximum, a point where the derivative changes from positive to negative
 Saddle point, derivative is either positive or negative on both sides of the point
• The minimum and maximum points are collectively known as extremum points
• The nature of stationary points can be
determined based on the second derivative
of at the point
 If , the point is a minimum
 If , the point is a maximum
 If , inconclusive, the point can be a saddle
point, but it may not
• The same concept also applies to gradients
of multivariate functions
50
Local Minima
Optimization

• Among the challenges in optimization of model’s parameters in ML involve local


minima, saddle points, vanishing gradients
• For an objective function , if the value at a point x is the minimum of the objective
function over the entire domain of x, then it is the global minimum
• If the value of at x is smaller than the values of the objective function at any other
points in the vicinity of x, then it is the local minimum
 The objective functions in ML usually
have many local minima
o When the solution of the optimization
algorithm is near the local minimum, the
gradient of the loss function approaches or
becomes zero (vanishing gradients)
o Therefore, the obtained solution in the final
iteration can be a local minimum, rather
than the global minimum

Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html 51


Convex Optimization
Optimization

• A function of a single variable is concave if every line segment joining two points
on its graph does not lie above the graph at any point
• Symmetrically, a function of a single variable is convex if every line segment
joining two points on its graph does not lie below the graph at any point

Picture from: https://mjo.osborne.economics.utoronto.ca/index.php/tutorial/index/1/cv1/t 52


Convex Functions
Optimization

• In mathematical terms, the function fis a convex function if for all points and for all

53
Minimum vs Maximum

54
Constrained Optimization
Optimization

• The optimization problem that involves a set of constraints which need to be


satisfied to optimize the objective function is called constrained optimization
• E.g., for a given objective function and a set of constraint functions

• The points that satisfy the constraints form the feasible region
• Various optimization algorithms have been developed for handling optimization
problems based on whether the constraints are equalities, inequalities, or a
combination of equalities and inequalities

55
Lagrange Multipliers
Optimization

• One approach to solving optimization problems is to substitute the initial problem


with optimizing another related function
• The Lagrange function for optimization of the constrained problem on the previous
page is defined as

where
• The variables are called Lagrange multipliers and ensure that the constraints are
properly enforced
 They are chosen just large enough to ensure that
• This is a saddle-point optimization problem where one wants to minimize with
respect to and simultaneously maximize with respect to
 The saddle point of gives the optimal solution to the original constrained optimization
problem

56
Projections
Optimization

• An alternative strategy for satisfying constraints are projections


• E.g., gradient clipping in NNs can require that the norm of the gradient is bounded
by a constant value c
• Approach:
 At each iteration during training
 If the norm of the gradient , then the update is
 If the norm of the gradient , then the update is
• Note that since is a unit vector (i.e., it has a norm = 1), then the vector has a norm =
• Such clipping is the projection of the gradient g onto the ball of radius c
 Projection on the unit ball is for

57
Projections
Optimization

• More generally, a projection of a vector onto a set is defined as

• This means that the vector is projected onto the closest vector that belongs to
the set
• For example, in the figure, the blue circle represents
a convex set
 The points inside the circle project to itself
o E.g., is the yellow vector, its closest point in the set is i:
the distance between and is
 The points outside the circle project to the closest
point inside the circle
o E.g., is the yellow vector, its closest point in the set is
the red vector
o Among all vectors in the set , the red vector has the
smallest distance to , i.e.,

Picture from: http://d2l.ai/chapter_optimization/convexity.html 58


Probability
Probability

• Intuition:
 In a process, several outcomes are possible
 When the process is repeated a large number of times, each outcome occurs with a
relative frequency, or probability
 If a particular outcome occurs more often, we say it is more probable
• Probability arises in two contexts
 In actual repeated experiments
o Example: You record the color of 1,000 cars driving by. 57 of them are green. You estimate the
probability of a car being green as 57/1,000 = 0.057.
 In idealized conceptions of a repeated process
o Example: You consider the behavior of an unbiased six-sided die. The expected probability of
rolling a 5 is 1/6 = 0.1667.
o Example: You need a model for how people’s heights are distributed. You choose a normal
distribution to represent the expected relative probabilities.

Slide credit: Jeff Howbert — Machine Learning Math Essentials 59


Random variables
Probability

• A random variable is a variable that can take on different values


 Example: = rolling a die
o Possible values of comprise the sample space, or outcome space,
o We denote the event of “seeing a 5” as or
o The probability of the event is or
o Also, can be used to denote the probability that takes the value of 5
• A probability distribution is a description of how likely a random variable is to
take on each of its possible states
 A compact notation is common, where is the probability distribution over the random
variable
o Also, the notation can be used to denote that the random variable has probability distribution
• Random variables can be discrete or continuous
 Discrete random variables have finite number of states: e.g., the sides of a die
 Continuous random variables have infinite number of states: e.g., the height of a person

Slide credit: Jeff Howbert — Machine Learning Math Essentials 60


Discrete Variables
Probability

• A probability distribution over discrete


variables may be described using a
probability mass function (PMF)
 E.g., sum of two dice

• A probability distribution over continuous


variables may be described using a
probability density function (PDF)
 E.g., waiting time between eruptions of Old
Faithful
 A PDF gives the probability of an infinitesimal
region with volume
 To find the probability over an interval [a, b],
we can integrate the PDF as follows:

Picture from: Jeff Howbert — Machine Learning Math Essentials 61


Marginal Probability Distribution
Probability

• Marginal probability distribution is the probability distribution of a single


variable
 It is calculated based on the joint probability distribution
 I.e., using the sum rule:
o For continuous random variables, the summation is replaced with integration,
 This process is called marginalization

Slide credit: Jeff Howbert — Machine Learning Math Essentials 62


Joint Probability Distribution
Probability

• Probability distribution that acts on many variables at the same time is known as a
joint probability distribution
• Given any values x and y of two random variables and , what is the probability
that = x and = y simultaneously?
 denotes the joint probability
 We may also write for brevity

Slide credit: Jeff Howbert — Machine Learning Math Essentials 63


Conditional Probability Distribution
Probability

• Conditional probability distribution is the probability distribution of one variable


provided that another variable has taken a certain value
 Denoted
• Note that:

Slide credit: Jeff Howbert — Machine Learning Math Essentials 64


Probability Distributions
Probability

• Bernoulli distribution
 Binary random variable with states 𝑝=0.3
 The random variable can encodes a coin flip
which comes up 1 with probability p and 0
with probability
 Notation:

• Uniform distribution
 The probability of each value is
 Notation:
 Figure:

Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html 65


Probability Distributions
Probability

• Binomial distribution
 Performing a sequence of n independent 𝑛=10 , 𝑝=0.2
experiments, each of which has probability p of
succeeding, where
 The probability of getting k successes in n trials
is
 Notation:
• Poisson distribution
 A number of events occurring independently in
a fixed interval of time with a known rate
 A discrete random variable with states has
𝜆=5
probability
 The rate is the average number of occurrences
of the event
 Notation:

Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html 66


Probability Distributions
Probability

• Gaussian distribution
 The most well-studied distribution
o Referred to as normal distribution or informally bell-shaped distribution
 Defined with the mean and variance
 Notation:
 For a random variable with n independent measurements, the density is

Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html 67


Probability Distributions
Probability

• Multinoulli distribution
 It is an extension of the Bernoulli distribution, from binary class to multi-class
 Multinoulli distribution is also called categorical distribution or generalized Bernoulli
distribution
 Multinoulli is a discrete probability distribution that describes the possible results of a
random variable that can take on one of k possible categories
o A categorical random variable is a discrete variable with more than two possible outcomes (such
as the roll of a die)
 For example, in multi-class classification in machine learning, we have a set of data
examples , and corresponding to the data example is a k-class label representing one-
hot encoding
o One-hot encoding is also called 1-of-k vector, where one element has the value 1 and all other
elements have the value 0
o Let’s denote the probabilities for assigning the class labels to a data example by
o We know that and for the different classes
o The multinoulli probability of the data example is
o Similarly, we can calculate the probability of all data examples as
68
Statistics
statistics

69
Variance
Probability

• Variance gives the measure of how much the values of the function deviate from
the expected value as we sample values of X from

• When the variance is low, the values of cluster near the expected value
• Variance is commonly denoted with
 The above equation is similar to a function
 We have
 This is similar to the formula for calculating the variance of a sample of observations:
• The square root of the variance is the standard deviation
 Denoted

Slide credit: Jeff Howbert — Machine Learning Math Essentials 70


Covariance
Probability

• Covariance gives the measure of how much two random variables are linearly
related to each other

• If and
 Then, the covariance is:
 Compare to covariance of actual samples:
• The covariance measures the tendency for and to deviate from their means in
same (or opposite) directions at same time

𝑌 No covariance 𝑌 High covariance

𝑋 𝑋
Picture from: Jeff Howbert — Machine Learning Math Essentials 71
Correlation
Probability

• Correlation coefficient is the covariance normalized by the standard deviations of


the two variables

 It is also called Pearson’s correlation coefficient and it is denoted


 The values are in the interval
 It only reflects linear dependence between variables, and it does not measure non-linear
dependencies between the variables

Picture from: Jeff Howbert — Machine Learning Math Essentials 72


Covariance Matrix
Probability

• Covariance matrix of a multivariate random variable with states is an matrix,


such that

• I.e.,

• The diagonal elements of the covariance matrix are the variances of the elements of
the vector

• Also note that the covariance matrix is symmetric, since

73
References
1. A. Zhang, Z. C. Lipton, M. Li, A. J. Smola, Dive into Deep Learning, https://d2l.ai,
2020.
2. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2017.
3. M. P. Deisenroth, A. A. Faisal, C. S. Ong, Mathematics for Machine Learning,
Cambridge University Press, 2020.
4. Jeff Howbert — Machine Learning Math Essentials presentation
5. Brian Keng – Manifolds: A Gentle Introduction blog
6. Martin J. Osborne – Mathematical Methods for Economic Theory (link)

74
The end.

CS 7267 Kennesaw State University Machine learning

You might also like