0% found this document useful (0 votes)

58 views75 pages

Maths of Machine Learning

The document discusses essential linear algebra concepts for machine learning, including vectors, matrices, norms, dot products, and vector projections. It provides notation, examples, and applications of these concepts. It also briefly discusses solving simultaneous equations and hyperplanes.

Uploaded by

Rakesh Garapati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views75 pages

Maths of Machine Learning

Uploaded by

Rakesh Garapati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 75

CS 7267

Machine Learning

Essential Mathematics for

Machine Learning

Femi William

Computer Science, Kennesaw State University

CS 7267 Kennesaw State University Machine learning

Lecture Outline
• Linear algebra
 Vectors
 Matrices
 Eigen decomposition
• Differential calculus
• Optimization algorithms
• Probability
 Random variables
 Probability distributions
• Statistics

CS 7267 Kennesaw State University Machine learning

2
Notation
• Scalar (integer or real)
• Vector (bold-font, lower case)
• Matrix (bold-font, upper-case)
• Tensor ((bold-font, upper-case)
• Random variable (normal font, upper-case)
• Set membership: is member of set
• Cardinality: number of items in set
• Norm of vector
• or Dot product of vectors and
• Set of real numbers
• Real numbers space of dimension n
• or Function (map): assign a unique value to each input value
• Function (map): map an n-dimensional vector into a scalar

CS 7267 Kennesaw State University Machine learning

3
Notation
• Element-wise product of matrices A and B
• Pseudo-inverse of matrix A
• n-th derivative of function f with respect to x
• Gradient of function f with respect to x
• Hessian matrix of function f
• Random variable has distribution
• Probability of given
• Gaussian distribution with mean and variance
• Expectation of with respect to
• Variance of
• Covariance of and
• Correlation coefficient for and
• Kullback-Leibler divergence for distributions and
• Cross-entropy for distributions and

CS 7267 Kennesaw State University Machine learning

4
What is Linear Algebra?
Linear algebra is the branch of mathematics concerning linear equations

Linear algebra is the study of vectors. At the most general level, vectors
are ordered finite lists of numbers.

• Linear algebra is to machine learning as flour to bakery.

• Every machine learning model is based in linear algebra, as

every cake is based in flour.

Machine Learning
Data Structures in ML
 Scalar
 Vectors
 Matrices
 Tensor

Machine Learning
Vectors
Vectors
• Vector definition
 Computer science: vector is a one-dimensional array of ordered real-valued scalars
 Mathematics: vector is a quantity possessing both magnitude and direction, represented
by an arrow indicating the direction, and the length of which is proportional to the
magnitude
• Vectors are written in column form or in row form
 Denoted by bold-font lower-case letters

• For a general form vector with elements the vector lies in the -dimensional space

CS 7267 Kennesaw State University Machine learning

7
Geometry of Vectors
Vectors
• First interpretation of a vector: point in space
 E.g., in 2D we can visualize the data points with
respect to a coordinate origin

• Second interpretation of a vector: direction in space

 E.g., the vector has a direction of 3 steps to the right
and 2 steps up
 The notation is sometimes used to indicate that the
vectors have a direction
 All vectors in the figure have the same direction

• Vector addition
 We add the coordinates, and follow the directions
given by the two vectors that are added

CS 7267 Kennesaw State University Machine learning

8
Geometry of Vectors

Vectors
• The geometric interpretation of vectors as points in space allow us to consider a
training set of input examples in ML as a collection of points in space
 Hence, classification can be viewed as discovering how to separate two clusters of
points belonging to different classes (left picture)
o Rather than distinguishing images containing cars, planes, buildings, for example
 Or, it can help to visualize zero-centering and normalization of training data (right
picture)

CS 7267 Kennesaw State University Machine learning

9
Dot Product and Angles
Vectors

• Dot product of vectors,

 It is also referred to as inner product, or scalar product of vectors
 The dot product is also often denoted by
• The dot product is a symmetric operation,
• Geometric interpretation of a dot product:
angle between two vectors
 I.e., dot product over the norms of the vectors
is
𝐮∙ 𝐯
𝐮 ∙ 𝐯 =‖𝐮‖‖ 𝐯‖ 𝑐𝑜𝑠cos
( 𝜃) 𝜃 = ‖𝐮‖‖𝐯‖

• If two vectors are orthogonal: , i.e., , then

• Also, in ML the term is sometimes employed as a measure of closeness of two
vectors/data instances, and it is referred to as cosine similarity

CS 7267 Kennesaw State University Machine learning

10
Applications- Linear Algebra in Action

Machine Learning
Example of LA

Machine Learning
Norm of a Vector
Vectors

• A vector norm is a function that maps a vector to a scalar value

 The norm is a measure of the size of the vector
• The norm should satisfy the following properties:
 Scaling:
 Triangle inequality:
 Must be non-negative:

(∑ | | )
𝑛 1
𝑝 𝑝
‖𝐱‖𝑝 = 𝑥𝑖
• The general norm of a vector is obtained as: 𝑖=1

 On next page we will review the most common norms, obtained for and

CS 7267 Kennesaw State University Machine learning

13
Norm of a Vector
Vectors

√∑
𝑛
𝑥 =√ 𝐱 𝐱
• For we have norm 2 𝑇
 Also called Euclidean norm
‖𝐱‖2= 𝑖
 It is the most often used norm 𝑖=1
 norm is often denoted just as with the subscript 2 omitted
𝑛
• For we have norm ‖𝐱‖1=∑ |𝑥𝑖|
 Uses the absolute values of the elements 𝑖 =1
 Discriminate between zero and non-zero elements

‖𝐱‖∞ =max |𝑥𝑖|

• For we have norm 𝑖

 Known as infinity norm, or max norm

 Outputs the absolute value of the largest element

• norm outputs the number of non-zero elements

 It is not an norm, and it is not really a norm function either (it is incorrectly called a norm)
CS 7267 Kennesaw State University Machine learning
14
Solving simultaneous equations

For one linear equation ax=b where the unknown is x and a and b
are constants,
3 possibilities:
With >1 equation and >1 unknown
 Can use solution from the single equation to
solve
 For example


 In matrix form

A X = B

X =A-1B
 X =A-1B
 To find A-1

 Need to find determinant of matrix A

 From earlier

(2 -2) – (3 1) = -4 – 3 = -7

 So determinant is -7
if B is

So
Vector Projection
Vectors

• Orthogonal projection of a vector onto vector

 The projection can take place in any space of
dimensionality ≥ 2
 The unit vector in the direction of is
o A unit vector has norm equal to 1
 The length of the projection of onto is
 The orthogonal project is the vector

CS 7267 Kennesaw State University Machine learning

19
Hyperplanes
Hyperplanes

• Hyperplane is a subspace whose dimension is one less than that of its ambient
space
 In a 2D space, a hyperplane is a straight line (i.e., 1D)
 In a 3D, a hyperplane is a plane (i.e., 2D)
 In a d-dimensional vector space, a hyperplane has dimensions, and divides the space
into two half-spaces
• Hyperplane is a generalization of a concept of plane in high-dimensional space
• In ML, hyperplanes are decision boundaries used for linear classification
 Data points falling on either sides of the hyperplane are attributed to different classes

CS 7267 Kennesaw State University Machine learning

20
Matrices
Matrices
• Matrix is a rectangular array of real-valued scalars arranged in m horizontal rows
and n vertical columns
 Each element belongs to the ith row and jth column
 The elements are denoted or or or

• For the matrix , the size (dimension) is or

 Matrices are denoted by bold-font upper-case letters

CS 7267 Kennesaw State University Machine learning

21
Matrices
Matrices
• Addition or subtraction

• Scalar multiplication

• Matrix multiplication

 Defined only if the number of columns of the left matrix is the same as the number of
rows of the right matrix
 Note that

CS 7267 Kennesaw State University Machine learning

22
Matrices
Matrices

• Transpose of the matrix: has the rows and columns exchanged

 Some properties

• Square matrix: has the same number of rows and columns

• Identity matrix ( In ): has ones on the main diagonal, and zeros elsewhere

CS 7267 Kennesaw State University Machine learning

 E.g.: identity matrix of size 3×3 : 23

Matrices
Matrices

• Determinant of a matrix, denoted by det(A) or is a real-valued scalar encoding

certain properties of the matrix
 E.g., for a matrix of size 2×2:

 For larger-size matrices the determinant of a matrix id calculated as

 In the above, is a minor of the matrix obtained by removing the row and column
associated with the indices i and j
• Trace of a matrix is the sum of all diagonal elements

• A matrix for which is called a symmetric matrix

CS 7267 Kennesaw State University Machine learning

24
Matrices
Matrices

• Elementwise multiplication of two matrices A and B is called the Hadamard

product or elementwise product
 The math notation is

CS 7267 Kennesaw State University Machine learning

25
Matrix-Vector Products
Matrices
• Consider a matrix and a vector
• The matrix can be written in terms of its row vectors (e.g., is the first row)

• The matrix-vector product is a column vector of length m, whose ith element is the
dot product

• Note the size:

CS 7267 Kennesaw State University Machine learning
26
Matrix-Matrix Products
Matrices

• To multiply two matrices and

• We can consider the matrix-matrix product as dot-products of rows in and columns

• Size: Kennesaw State University

CS 7267 Machine learning
27
Linear Dependence
Matrices

• For the following matrix 𝐁=

[ 2
4
−1
−2 ]
• Notice that for the two columns and , we can write
 This means that the two columns are linearly dependent
• The weighted sum is referred to as a linear combination of the vectors and
 In this case, a linear combination of the two vectors exist for which
• A collection of vectors are linearly dependent if there exist coefficients not all
equal to zero, so that

• If there is no linear dependence, the vectors are linearly independent

CS 7267 Kennesaw State University Machine learning

28
Matrix Rank
Matrices

• For an matrix, the rank of the matrix is the largest number of linearly independent
columns
• The matrix B from the previous example has , since the two columns are linearly
dependent

𝐁= [ 2
4
−1
−2 ]
• The matrix C below has , since it has two linearly independent columns
 I.e., , ,

CS 7267 Kennesaw State University Machine learning

29
Inverse of a Matrix
Matrices

• For a square matrix A with rank , is its inverse matrix if their product is an
identity matrix I

• Properties of inverse matrices

• If (i.e., ), then the inverse does not exist

 A matrix that is not invertible is called a singular matrix
• Note that finding an inverse of a large matrix is computationally expensive
 In addition, it can lead to numerical instability
• If the inverse of a matrix is equal to its transpose, the matrix is said to be
orthogonal matrix

CS 7267 Kennesaw State University Machine learning

30
Pseudo-Inverse of a Matrix
Matrices

• Pseudo-inverse of a matrix
 Also known as Moore-Penrose pseudo-inverse
• For matrices that are not square, the inverse does not exist
 Therefore, a pseudo-inverse is used
• If , then the pseudo-inverse is and
• If , then the pseudo-inverse is and

 E.g., for a matrix with dimension , a pseudo-inverse can be found of size , so that

CS 7267 Kennesaw State University Machine learning

31
Tensors
Tensors

• Tensors are n-dimensional arrays of scalars

 Vectors are first-order tensors,
 Matrices are second-order tensors,
 E.g., a fourth-order tensor is
• Tensors are denoted with upper-case letters of a special font face (e.g., )
• RGB images are third-order tensors, i.e., as they are 3-dimensional arrays
 The 3 axes correspond to width, height, and channel
 E.g., 224 × 224 × 3
 The channel axis corresponds to the color channels (red, green, and blue)

CS 7267 Kennesaw State University Machine learning

32
Eigen Decomposition
Eigen Decomposition
• Eigen decomposition is decomposing a matrix into a set of eigenvalues and
eigenvectors
• Eigenvalues of a square matrix are scalars and eigenvectors are non-zero vectors
that satisfy

• Eigenvalues are found by solving the following equation

• If a matrix has n linearly independent eigenvectors with corresponding

eigenvalues , the eigen decomposition of is given by

 Columns of the matrix are the eigenvectors, i.e.,

 is a diagonal matrix of the eigenvalues, i.e.,
• To find the inverse of the matrix A, we can use
 This involves simply finding the inverse of a diagonal matrix

CS 7267 Kennesaw State University Machine learning

33
Eigen Decomposition
Eigen Decomposition

• Geometric interpretation of the eigenvalues and eigenvectors is that they allow to

stretch the space in specific directions
 Left figure: the two eigenvectors and are shown for a matrix, where the two vectors are
unit vectors (i.e., they have a length of 1)
 Right figure: the vectors and are multiplied with the eigenvalues and
o We can see how the space is scaled in the direction of the larger eigenvalue
• E.g., this is used for dimensionality reduction with PCA (principal component
analysis) where the eigenvectors corresponding to the largest eigenvalues are used
for extracting the most important data dimensions

Picture from: Goodfellow (2017) – Deep Learning 34

Eigenvalue Problem
Eigenvalue Problem
• Going back to our example: as A . v =
λ.v
2 3  3  1 2   3
 2 1  x  2    8   4 x  2


A v Av  v
• Therefore, (3,2) is an eigenvector of the square matrix A and 4 is
an
eigenvalue of A.
• The question is:
Given matrix A, how can we calculate the eigenvector and eigenvalues
for A?

Machine Learning
How to calculate Eigenvalues &
eigenvectors
Calculating Eigenvectors &
Eigenvalues
Calculating Eigenvectors &
Eigenvalues
Calculating Eigenvectors &
Eigenvalues

Going through the same procedure for the second eigenvalue:

Calculating Eigenvectors &
Eigenvalues
Properties of Eigenvectors and Eigenvalues
Singular Value Decomposition
Singular Value Decomposition

• Singular value decomposition (SVD) provides another way to factorize a matrix,

into singular vectors and singular values
 SVD is more generally applicable than eigen decomposition
 Every real matrix has an SVD, but the same is not true of the eigen decomposition
o E.g., if a matrix is not square, the eigen decomposition is not defined, and we must use SVD
• SVD of an matrix is given by

 is an matrix, is an matrix, and is an matrix

 The elements along the diagonal of are known as the singular values of A
 The columns of are known as the left-singular vectors
 The columns of are known as the right-singular vectors
• For a non-square matrix , the squares of the singular values are the eigenvalues
of , i.e., for
• Applications of SVD include computing the pseudo-inverse of non-square
matrices, matrix approximation, determining the matrix rank
43
Differential Calculus
Differential Calculus

• The following rules are used for computing the derivatives of explicit functions

44
Higher Order Derivatives
Differential Calculus

• The derivative of the first derivative of a function is the second derivative of

• The second derivative quantifies how the rate of change of is changing

 E.g., in physics, if the function describes the displacement of an object, the first derivative
gives the velocity of the object (i.e., the rate of change of the position)
 The second derivative gives the acceleration of the object (i.e., the rate of change of the
velocity)
• If we apply the differentiation operation any number of times, we obtain the n-th
derivative of

45
Partial Derivatives
Differential Calculus

• So far, we looked at functions of a single variable, where

• Functions that depend on many variables are called multivariate functions
• Let be a multivariate function with n variables
 The input is an n-dimensional vector and the output is a scalar y
 The mapping is
• The partial derivative of y with respect to its ith parameter is

• To calculate ( pronounced “del” or we can just say “partial derivative”), we can

treat as constants and calculate the derivative of y only with respect to
• For notation of partial derivatives, the following are equivalent:

46
Gradient
Differential Calculus

• We can concatenate partial derivatives of a multivariate function with respect to

all its input variables to obtain the gradient vector of the function
• The gradient of the multivariate function with respect to the n-dimensional
input vector , is a vector of n partial derivatives

• When there is no ambiguity, the notations or are often used for the gradient
instead of
 The symbol for the gradient is the Greek letter (pronounced “nabla”), although is
more often it is pronounced “gradient of f with respect to x”
• In ML, the gradient descent algorithm relies on the opposite direction of the
gradient of the loss function with respect to the model parameters for
minimizing the loss function
 Adversarial examples can be created by adding perturbation in the direction of the
gradient of the loss with respect to input examples for maximizing the loss function

47
Optimization
Optimization

• Optimization is concerned with optimizing an objective function — finding the

value of an argument that minimizes of maximizes the function
 Most optimization algorithms are formulated in terms of minimizing a function
 Maximization is accomplished vie minimizing the negative of an objective function (e.g.,
minimize )
 In minimization problems, the objective function is often referred to as a cost function or
loss function or error function
• Optimization is very important for machine learning
 The performance of optimization algorithms affect the model’s training efficiency
• Most optimization problems in machine learning are nonconvex
 Meaning that the loss function is not a convex function
 Nonetheless, the design and analysis of algorithms for solving convex problems has
been very instructive for advancing the field of machine learning

CS 7267 Kennesaw State University Machine learning

48
Optimization
Optimization

• Optimization and machine learning have related, but somewhat different goals
 Goal in optimization: minimize an objective function
o For a set of training examples, reduce the training error
 Goal in ML: find a suitable model, to predict on data examples
o For a set of testing examples, reduce the generalization error
• For a given empirical function g (dashed purple curve), optimization algorithms
attempt to find the point of minimum empirical risk
• The expected function f (blue curve) is
obtained given a limited amount of training
data examples
• ML algorithms attempt to find the point of
minimum expected risk, based on minimizing
the error on a set of testing examples
o Which may be at a different location than the
minimum of the training examples
o And which may not be minimal in a formal sense

49
Stationary Points
Optimization

• Stationary points ( or critical points) of a differentiable function of one variable

are the points where the derivative of the function is zero, i.e.,
• The stationary points can be:
 Minimum, a point where the derivative changes from negative to positive
 Maximum, a point where the derivative changes from positive to negative
 Saddle point, derivative is either positive or negative on both sides of the point
• The minimum and maximum points are collectively known as extremum points
• The nature of stationary points can be
determined based on the second derivative
of at the point
 If , the point is a minimum
 If , the point is a maximum
 If , inconclusive, the point can be a saddle
point, but it may not
• The same concept also applies to gradients
of multivariate functions
50
Local Minima
Optimization

• Among the challenges in optimization of model’s parameters in ML involve local

minima, saddle points, vanishing gradients
• For an objective function , if the value at a point x is the minimum of the objective
function over the entire domain of x, then it is the global minimum
• If the value of at x is smaller than the values of the objective function at any other
points in the vicinity of x, then it is the local minimum
 The objective functions in ML usually
have many local minima
o When the solution of the optimization
algorithm is near the local minimum, the
gradient of the loss function approaches or
becomes zero (vanishing gradients)
o Therefore, the obtained solution in the final
iteration can be a local minimum, rather
than the global minimum

Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html 51

Convex Optimization
Optimization

• A function of a single variable is concave if every line segment joining two points
on its graph does not lie above the graph at any point
• Symmetrically, a function of a single variable is convex if every line segment
joining two points on its graph does not lie below the graph at any point

Picture from: https://mjo.osborne.economics.utoronto.ca/index.php/tutorial/index/1/cv1/t 52

Convex Functions
Optimization

• In mathematical terms, the function fis a convex function if for all points and for all

53
Minimum vs Maximum

54
Constrained Optimization
Optimization

• The optimization problem that involves a set of constraints which need to be

satisfied to optimize the objective function is called constrained optimization
• E.g., for a given objective function and a set of constraint functions

• The points that satisfy the constraints form the feasible region
• Various optimization algorithms have been developed for handling optimization
problems based on whether the constraints are equalities, inequalities, or a
combination of equalities and inequalities

55
Lagrange Multipliers
Optimization

• One approach to solving optimization problems is to substitute the initial problem

with optimizing another related function
• The Lagrange function for optimization of the constrained problem on the previous
page is defined as

where
• The variables are called Lagrange multipliers and ensure that the constraints are
properly enforced
 They are chosen just large enough to ensure that
• This is a saddle-point optimization problem where one wants to minimize with
respect to and simultaneously maximize with respect to
 The saddle point of gives the optimal solution to the original constrained optimization
problem

56
Projections
Optimization

• An alternative strategy for satisfying constraints are projections

• E.g., gradient clipping in NNs can require that the norm of the gradient is bounded
by a constant value c
• Approach:
 At each iteration during training
 If the norm of the gradient , then the update is
 If the norm of the gradient , then the update is
• Note that since is a unit vector (i.e., it has a norm = 1), then the vector has a norm =
• Such clipping is the projection of the gradient g onto the ball of radius c
 Projection on the unit ball is for

57
Projections
Optimization

• More generally, a projection of a vector onto a set is defined as

• This means that the vector is projected onto the closest vector that belongs to
the set
• For example, in the figure, the blue circle represents
a convex set
 The points inside the circle project to itself
o E.g., is the yellow vector, its closest point in the set is i:
the distance between and is
 The points outside the circle project to the closest
point inside the circle
o E.g., is the yellow vector, its closest point in the set is
the red vector
o Among all vectors in the set , the red vector has the
smallest distance to , i.e.,

Picture from: http://d2l.ai/chapter_optimization/convexity.html 58

Probability
Probability

• Intuition:
 In a process, several outcomes are possible
 When the process is repeated a large number of times, each outcome occurs with a
relative frequency, or probability
 If a particular outcome occurs more often, we say it is more probable
• Probability arises in two contexts
 In actual repeated experiments
o Example: You record the color of 1,000 cars driving by. 57 of them are green. You estimate the
probability of a car being green as 57/1,000 = 0.057.
 In idealized conceptions of a repeated process
o Example: You consider the behavior of an unbiased six-sided die. The expected probability of
rolling a 5 is 1/6 = 0.1667.
o Example: You need a model for how people’s heights are distributed. You choose a normal
distribution to represent the expected relative probabilities.

Slide credit: Jeff Howbert — Machine Learning Math Essentials 59

Random variables
Probability

• A random variable is a variable that can take on different values

 Example: = rolling a die
o Possible values of comprise the sample space, or outcome space,
o We denote the event of “seeing a 5” as or
o The probability of the event is or
o Also, can be used to denote the probability that takes the value of 5
• A probability distribution is a description of how likely a random variable is to
take on each of its possible states
 A compact notation is common, where is the probability distribution over the random
variable
o Also, the notation can be used to denote that the random variable has probability distribution
• Random variables can be discrete or continuous
 Discrete random variables have finite number of states: e.g., the sides of a die
 Continuous random variables have infinite number of states: e.g., the height of a person

Slide credit: Jeff Howbert — Machine Learning Math Essentials 60

Discrete Variables
Probability

• A probability distribution over discrete

variables may be described using a
probability mass function (PMF)
 E.g., sum of two dice

• A probability distribution over continuous

variables may be described using a
probability density function (PDF)
 E.g., waiting time between eruptions of Old
Faithful
 A PDF gives the probability of an infinitesimal
region with volume
 To find the probability over an interval [a, b],
we can integrate the PDF as follows:

Picture from: Jeff Howbert — Machine Learning Math Essentials 61

Marginal Probability Distribution
Probability

• Marginal probability distribution is the probability distribution of a single

variable
 It is calculated based on the joint probability distribution
 I.e., using the sum rule:
o For continuous random variables, the summation is replaced with integration,
 This process is called marginalization

Slide credit: Jeff Howbert — Machine Learning Math Essentials 62

Joint Probability Distribution
Probability

• Probability distribution that acts on many variables at the same time is known as a
joint probability distribution
• Given any values x and y of two random variables and , what is the probability
that = x and = y simultaneously?
 denotes the joint probability
 We may also write for brevity

Slide credit: Jeff Howbert — Machine Learning Math Essentials 63

Conditional Probability Distribution
Probability

• Conditional probability distribution is the probability distribution of one variable

provided that another variable has taken a certain value
 Denoted
• Note that:

Slide credit: Jeff Howbert — Machine Learning Math Essentials 64

Probability Distributions
Probability

• Bernoulli distribution
 Binary random variable with states 𝑝=0.3
 The random variable can encodes a coin flip
which comes up 1 with probability p and 0
with probability
 Notation:

• Uniform distribution
 The probability of each value is
 Notation:
 Figure:

Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html 65

Probability Distributions
Probability

• Binomial distribution
 Performing a sequence of n independent 𝑛=10 , 𝑝=0.2
experiments, each of which has probability p of
succeeding, where
 The probability of getting k successes in n trials
is
 Notation:
• Poisson distribution
 A number of events occurring independently in
a fixed interval of time with a known rate
 A discrete random variable with states has
𝜆=5
probability
 The rate is the average number of occurrences
of the event
 Notation:

Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html 66

Probability Distributions
Probability

• Gaussian distribution
 The most well-studied distribution
o Referred to as normal distribution or informally bell-shaped distribution
 Defined with the mean and variance
 Notation:
 For a random variable with n independent measurements, the density is

Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html 67

Probability Distributions
Probability

• Multinoulli distribution
 It is an extension of the Bernoulli distribution, from binary class to multi-class
 Multinoulli distribution is also called categorical distribution or generalized Bernoulli
distribution
 Multinoulli is a discrete probability distribution that describes the possible results of a
random variable that can take on one of k possible categories
o A categorical random variable is a discrete variable with more than two possible outcomes (such
as the roll of a die)
 For example, in multi-class classification in machine learning, we have a set of data
examples , and corresponding to the data example is a k-class label representing one-
hot encoding
o One-hot encoding is also called 1-of-k vector, where one element has the value 1 and all other
elements have the value 0
o Let’s denote the probabilities for assigning the class labels to a data example by
o We know that and for the different classes
o The multinoulli probability of the data example is
o Similarly, we can calculate the probability of all data examples as
68
Statistics
statistics

69
Variance
Probability

• Variance gives the measure of how much the values of the function deviate from
the expected value as we sample values of X from

• When the variance is low, the values of cluster near the expected value
• Variance is commonly denoted with
 The above equation is similar to a function
 We have
 This is similar to the formula for calculating the variance of a sample of observations:
• The square root of the variance is the standard deviation
 Denoted

Slide credit: Jeff Howbert — Machine Learning Math Essentials 70

Covariance
Probability

• Covariance gives the measure of how much two random variables are linearly
related to each other

• If and
 Then, the covariance is:
 Compare to covariance of actual samples:
• The covariance measures the tendency for and to deviate from their means in
same (or opposite) directions at same time

𝑌 No covariance 𝑌 High covariance

𝑋 𝑋
Picture from: Jeff Howbert — Machine Learning Math Essentials 71
Correlation
Probability

• Correlation coefficient is the covariance normalized by the standard deviations of

the two variables

 It is also called Pearson’s correlation coefficient and it is denoted

 The values are in the interval
 It only reflects linear dependence between variables, and it does not measure non-linear
dependencies between the variables

Picture from: Jeff Howbert — Machine Learning Math Essentials 72

Covariance Matrix
Probability

• Covariance matrix of a multivariate random variable with states is an matrix,

such that

• I.e.,

• The diagonal elements of the covariance matrix are the variances of the elements of
the vector

• Also note that the covariance matrix is symmetric, since

73
References
1. A. Zhang, Z. C. Lipton, M. Li, A. J. Smola, Dive into Deep Learning, https://d2l.ai,
2020.
2. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2017.
3. M. P. Deisenroth, A. A. Faisal, C. S. Ong, Mathematics for Machine Learning,
Cambridge University Press, 2020.
4. Jeff Howbert — Machine Learning Math Essentials presentation
5. Brian Keng – Manifolds: A Gentle Introduction blog
6. Martin J. Osborne – Mathematical Methods for Economic Theory (link)

74
The end.

CS 7267 Kennesaw State University Machine learning

2nd Exam Question Paper 2
No ratings yet
2nd Exam Question Paper 2
16 pages
Ethnotech - Data Science With Python
No ratings yet
Ethnotech - Data Science With Python
480 pages
Machine Learning Cheat Sheet ??? - ?
No ratings yet
Machine Learning Cheat Sheet ??? - ?
231 pages
Maths Roadmap For Machine Learning
No ratings yet
Maths Roadmap For Machine Learning
16 pages
New Ebook Guide To AI Data Science
No ratings yet
New Ebook Guide To AI Data Science
50 pages
Lectures Machine Learning
No ratings yet
Lectures Machine Learning
205 pages
Introduction To Data Visualization With Python
No ratings yet
Introduction To Data Visualization With Python
47 pages
Math4ml PDF
No ratings yet
Math4ml PDF
21 pages
Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
100% (1)
Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
504 pages
Mathematics For Machine Learning-I
No ratings yet
Mathematics For Machine Learning-I
10 pages
ML - Full Slides Srikanth Allamshatty
No ratings yet
ML - Full Slides Srikanth Allamshatty
369 pages
PThread API Reference
No ratings yet
PThread API Reference
348 pages
Unit 4
No ratings yet
Unit 4
108 pages
Deep Learning
No ratings yet
Deep Learning
127 pages
ML UNIT-IV Notes
100% (1)
ML UNIT-IV Notes
23 pages
Data Science Learning Path For 50 Days
No ratings yet
Data Science Learning Path For 50 Days
15 pages
7 Time Series Datasets For Machine Learning
No ratings yet
7 Time Series Datasets For Machine Learning
8 pages
Machine Learning Systems
No ratings yet
Machine Learning Systems
1,748 pages
1 - Machine Learning (Start)
No ratings yet
1 - Machine Learning (Start)
32 pages
Difference Between Data Science and Machine Learning
No ratings yet
Difference Between Data Science and Machine Learning
5 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
771 A18 Lec4
100% (1)
771 A18 Lec4
128 pages
Early Stopping in Practice
No ratings yet
Early Stopping in Practice
14 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
112 pages
Machine Learning Basic Principles
No ratings yet
Machine Learning Basic Principles
124 pages
ML LAB Mannual-1
No ratings yet
ML LAB Mannual-1
79 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Becominghuman - Ai-Cheat Sheets For AI Neural Networks Machine Learning Deep Learning Amp BignbspData
100% (1)
Becominghuman - Ai-Cheat Sheets For AI Neural Networks Machine Learning Deep Learning Amp BignbspData
24 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
75 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
9 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Regression Project
100% (1)
Regression Project
60 pages
Pthread
No ratings yet
Pthread
4 pages
Building Powerful Image Classification Models Using Very Little Data
No ratings yet
Building Powerful Image Classification Models Using Very Little Data
20 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
02 Amazon Fine Food Reviews Analysis - TSNE - Slides
No ratings yet
02 Amazon Fine Food Reviews Analysis - TSNE - Slides
1 page
Introduction To Machine Learning PART 1
No ratings yet
Introduction To Machine Learning PART 1
6 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Class Xi Python
100% (2)
Class Xi Python
138 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
Python-Linear Regression
No ratings yet
Python-Linear Regression
72 pages
Example of 2D Convolution
No ratings yet
Example of 2D Convolution
5 pages
Statistical Machine Learning
No ratings yet
Statistical Machine Learning
28 pages
Module 2
No ratings yet
Module 2
20 pages
Infosys Pragathi Report
No ratings yet
Infosys Pragathi Report
68 pages
Detailed Curriculum PDF
No ratings yet
Detailed Curriculum PDF
6 pages
Daily Dose of Data Science - Archive
No ratings yet
Daily Dose of Data Science - Archive
580 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
76 - Sample - Chapter Kunci M2K3 No 9
No ratings yet
76 - Sample - Chapter Kunci M2K3 No 9
94 pages
DSML Curriculum Doc - Google Sheets
0% (1)
DSML Curriculum Doc - Google Sheets
12 pages
ML Module Notes
No ratings yet
ML Module Notes
139 pages
ML CHP 123
No ratings yet
ML CHP 123
69 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
From Everand
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
Fouad Sabry
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
From Everand
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
Gourab Mukherjee
No ratings yet
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet