0% found this document useful (0 votes)
25 views18 pages

1 2.-Maths ML

machine learning mathamatics

Uploaded by

aapatil.sknsits
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views18 pages

1 2.-Maths ML

machine learning mathamatics

Uploaded by

aapatil.sknsits
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

8/30/22

Module 1: Module 1: Introduction Machine Learning


• Part 1.1: Introduction to ML
Part 1.4 Mathematics for Machine
• Part 1.2: Well posed learning problem
Learning
• Part 1.3: Types of Machine Learning & Applications

• Part 1.4 : Mathematics for Machine Learning


Dr. Chandra Prakash

[ Slides adapted from D r. A lex Vakanski ]

Mathematical Foundations So far


• Mathematics for Machine Learning • ML Concerned
– Linear algebra – about general-purpose methodologies that can be applied to many dataset
• Vectors – to extract valuable/meaningful pattern, ideally without much domain-specific expertise.
• Matrices • Three core concept of ML :
– Data
• Eigen decomposition
– Model
– Differential calculus
– Learning
– Optimization algorithms
– Probability • Mitchell (1997): A model is said to learn from data if its performance on a given
task improves after the data is taken into account.
• Random variables
– The goal is to find good models that generalize well to yet unseen data, which we may
• Probability distributions care about in the future.
– Information theory
• Learning can be understood as a way to automatically find patterns and structure
in data by optimizing the parameters of the model.

Areas of math essential to machine learning Why worry about the math?
• Machine learning is part of both statistics and computer science • There are lots of easy-to-use machine learning packages out there.
– Probability packages out there.
– Statistical inference – After this course, you will know how to apply several of the most general
– Validation purpose algorithms.
– Estimates of error, confidence intervals
• Linear algebra HOWEVER
– Hugely useful for compact representation of linear transformations on data
transformations on data – Dimensionality reduction techniques
• To get really useful results, you need good mathematical intuitions
• Optimization theory about certain general machine learning principles, as well as the
inner workings of the individual algorithms.

1
8/30/22

Map of Mathematics Foundations and Pillars of Machine Learning

S o u r c e : M a th e m a tic s fo r M a c h in e L e a r n in g B o o k b y A . A ld o F a isa l, C h e n g S o o n O n g , a n d M a r c P e te r D e ise n r o th

source : https://www.youtube.com/watch?v=OmJ-4B-mS-Y

Notation Notation
• 𝑎, 𝑏, 𝑐 Scalar (integer or real) • 𝐀⊙𝐁 Element-wise product of matrices A and B
• 𝐱, 𝐲, 𝐳 Vector (bold-font, lower case) • 𝐀J Pseudo-inverse of matrix A
• 𝐀, 𝐁, 𝐂 Matrix (bold-font, upper-case) K" L
• KM"
n-th derivative of function f with respect to x
• A, B, C Tensor ((bold-font, upper-case)
• 𝛻𝐱𝑓 𝐱 Gradient of function f with respect to x
• 𝑋, 𝑌, 𝑍 Random variable (normal font, upper-case)
• 𝐇L Hessian matrix of function f
• 𝑎∈𝒜 Set membership: 𝑎 is member of set 𝒜 • 𝑋~𝑃 Random variable 𝑋 has distribution 𝑃
• 𝒜 Cardinality: number of items in set 𝒜 • 𝑃 𝑋|𝑌 Probability of 𝑋 given 𝑌
• 𝐯 Norm of vector 𝐯 • 𝒩 𝜇, 𝜎 N Gaussian distribution with mean 𝜇 and variance 𝜎 N
• 𝐮 A 𝐯 or 𝐮, 𝐯 Dot product of vectors 𝐮 and 𝐯 • 𝔼O~Q 𝑓 𝑋 Expectation of 𝑓 𝑋 with respect to 𝑃 𝑋
• ℝ Set of real numbers • Var 𝑓 𝑋 Variance of 𝑓 𝑋
• ℝ! Real numbers space of dimension n • Cov 𝑓 𝑋 , 𝑔 𝑌 Covariance of 𝑓 𝑋 and 𝑔 𝑌
• 𝑦 = 𝑓 𝑥 or 𝑥 ↦ 𝑓 𝑥 Function (map): assign a unique value 𝑓(𝑥) to each input value 𝑥 • corr 𝑋, 𝑌 Correlation coefficient for 𝑋 and 𝑌
• 𝑓: ℝ ! → ℝ Function (map): map an n-dimensional vector into a scalar • 𝐷RS 𝑃||𝑄 Kullback-Leibler divergence for distributions 𝑃 and 𝑄
• 𝐶𝐸 𝑃, 𝑄 Cross-entropy for distributions 𝑃 and 𝑄

Linear Algebra Vectors


• 2a+3b=8 • Vector definition
– Computer science: vector is a one-dimensional array of ordered real-valued scalars
• 10a + 1b =13 – Mathematics: vector is a quantity possessing both magnitude and direction, represented by an arrow
indicating the direction, and the length of which is proportional to the magnitude
• Vectors are written in column form or in row form
– Denoted by bold-font lower-case letters
2 3 𝑎 8 1 𝑥# %
= 7 𝑥$
𝐱= 1 7 0 1
10 1 𝑏 13 𝐱=
0
𝐱=

1 𝑥!
Constant variable output
• For a general form vector with 𝑛 elements, the vector lies in the 𝑛-dimensional space
𝐱 ∈ ℝT
• Type of vector :
– Geometric
– Polynomials

2
8/30/22

Geometry of Vectors Geometry of Vectors


• First interpretation of a vector: point in space • The geometric interpretation of vectors as points in space allow us to consider a
– E.g., in 2D we can visualize the data points with respect to a coordinate origin training set of input examples in ML as a collection of points in space
– Hence, classification can be viewed as discovering how to separate two clusters of points
belonging to different classes (left picture)
• Second interpretation of a vector: direction in space • Rather than distinguishing images containing cars, planes, buildings, for example
– E.g., the vector 𝐯 = 3, 2 % has a direction of 3 steps to the right and 2 steps up
– Or, it can help to visualize zero-centering and normalization of training data (right picture)
– The notation 𝐯 is sometimes used to indicate that the vectors have a direction
– All vectors in the figure have the same direction

• Vector addition
– W e add the coordinates, and follow the directions given by the two vectors that are
added

Vectors Vectors
Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.html#geometry-of-vectors

Dot Product and Angles Norm / magnitude of a Vector


• Dot product of vectors, 𝐮 @ 𝐯 = 𝐮`𝐯 = ∑a 𝑢a @ 𝑣a • A vector norm is a function that maps a vector to a scalar value
– It is also referred to as inner product, or scalar product of vectors
– The dot product 𝐮 A 𝐯 is also often denoted by 𝐮, 𝐯
– The norm is a measure of the size / length of the vector
• The dot product is a symmetric operation, 𝐮 @ 𝐯 = 𝐮`𝐯 = 𝐯`𝐮 = 𝐯@𝐮 • The norm 𝑓 should satisfy the following properties:
– Scaling: 𝑓 𝛼𝐱 = 𝛼 𝑓 𝐱
• Geometric interpretation of a dot product: angle between two vectors
– Triangle inequality: 𝑓 𝐱 + 𝐲 ≤ 𝑓 𝐱 + 𝑓 𝐲
§ I.e., dot product 𝐯 A 𝐰 over the norms of the vectors is cos 𝜃
cos𝜃 =
𝐮A𝐯 – Must be non-negative: 𝑓 𝐱 ≥ 0
𝐮 A 𝐯 = 𝐮 𝐯 𝑐𝑜𝑠 𝜃 𝐮 𝐯
#
! )
• If two vectors are orthogonal: 𝜃 = 90°, i.e., cos(𝜃) = 0, then 𝐮 A 𝐯 = 0
• Also, in ML the term cos𝜃 =
𝐮'𝐯
is sometimes employed as a measure of closeness of two
• The general ℓF norm of a vector 𝐱 is obtained as: 𝐱 ) = d 𝑥*
*+#
)

𝐮 𝐯
vectors/data instances, and it is referred to as cosine similarity

– the most common norms, obtained for 𝑝 = 1, 2, and ∞


Vectors Vectors
Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.html#geometry-of-vectors

Vector Projection Hyperplanes


• Orthogonal projection of a vector 𝐲 onto vector 𝐱 • Hyperplane is a subspace whose dimension is one less than that of its ambient space
– In a 2D space, a hyperplane is a straight line (i.e., 1D)
– The projection can take place in any space of
– In a 3D, a hyperplane is a plane (i.e., 2D)
dimensionality ≥ 2 – In a d-dimensional vector space, a hyperplane has 𝑑 − 1 dimensions, and divides the space into two
𝐱
– The unit vector in the direction of 𝐱 is 𝐱
half-spaces
• Hyperplane is a generalization of a concept of plane in high-dimensional space
• A unit vector has norm equal to 1 • In ML, hyperplanes are decision boundaries used for linear classification
– The length of the projection of 𝐲 onto 𝐱 is 𝐲 2 𝑐𝑜𝑠 𝜃 – Data points falling on either sides of the hyperplane are attributed to different classes

– The orthogonal project is the vector 𝐩𝐫𝐨𝐣𝐱 𝐲


𝐱 A 𝐲 A 𝑐𝑜𝑠 𝜃
𝐩𝐫𝐨𝐣 𝐱 𝐲 =
𝐱

Vectors Hyperplanes
Slide credit: Jeff Howbert — Machine Learning Math Essentials Picture from: https://kgpdag.wordpress.com/2015/08/12/svm-simplified/

3
8/30/22

Hyperplanes Hyperplanes
• For example, for a given data point 𝐰 = 2, 1 we can `,
use dot-product to find the hyperplane for which 𝐰 @ • In a 3D space, if we have a vector 𝐰 = 1, 2, 3 J and try to find all
𝐯=1 points that satisfy 𝐰 / 𝐯 = 1, we can obtain a plane that is orthogonal
– I.e., all vectors with 𝐰 A 𝐯 > 1 can be classified as one class,
and all vectors with 𝐰 A 𝐯 < 1 can be classified as another to the vector 𝐰
class
– The inequalities 𝐰 2 𝐯 > 1 and 𝐰 2 𝐯 < 1 again define the two subspaces that
are created by the plane
• Solving 𝐰 @ 𝐯 = 1, we obtain

– I.e., the solution is the set of points for which 𝐰 A 𝐯 = 1


meaning the points lay on the line that is orthogonal to the
vector 𝐰
• That is the line 2𝑥 + 𝑦 = 1
𝐰
#
– The orthogonal projection of 𝐯 onto 𝐰 is 𝐯 𝑐𝑜𝑠 𝜃 =
-

Hyperplanes Hyperplanes
Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.html#hyperplanes Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.html#hyperplanes

• The same concept applies to high-dimensional spaces as well

Matrices Matrices
• Matrix • Addition or subtraction
( A ± B)i, j = Ai, j ± Bi, j
– Is a tuple of vectors
– is a rectangular array of real-valued scalars arranged in m horizontal rows and n vertical
columns
– Each element 𝑎an belongs to the ith row and jth column • Scalar multiplication ( cA)i, j = c × Ai, j
– The elements are denoted 𝑎an or 𝐀an or 𝐀 an or 𝐀 𝒊, 𝒋

• Matrix multiplication ( AB )i , j = A i ,1B 1, j + A i ,2B 2, j + ! + A i ,n B n, j


– Defined only if the number of columns of the left matrix is the same as the number of rows of the
right matrix
• For the matrix 𝐀 ∈ ℝo×T , the size (dimension) is 𝑚×𝑛 or 𝑚, 𝑛 – Note that 𝐀𝐁 ≠ 𝐁𝐀
– Matrices are denoted by bold-font upper-case letters

Matrices Matrices

Matrices Matrices
• Transpose of the matrix: 𝐀` has the rows and columns exchanged • Determinant of a matrix, denoted by det(A) or 𝐀 , is a real-valued scalar encoding certain
properties of the matrix
– E.g., for a matrix of size 2× 2: æ éa b ù ö
(A )
T
= A j, i
det ç ê ú ÷ = ad - bc
i, j è ëc d û ø
– Some properties
𝐀+𝐁=𝐁+𝐀 𝐀 𝐁 + 𝐂 = 𝐀𝐁 + 𝐀𝐂 – For larger-size matrices the determinant of a matrix id calculated as
𝐀+𝐁 % = 𝐀% + 𝑩% 𝐀 𝐁𝐂 = 𝐀𝐁 𝐂 *0/
det 𝐀 = d 𝑎 */ −1 𝑑𝑒𝑡 𝐀 *,/
𝐀% % =𝐀 𝐀𝐁 % = 𝑩% 𝐀% /

– In the above, 𝐀 *,/ is a minor of the matrix obtained by removing the row and column associated with
• Square matrix: has the same number of rows and columns
the indices i and j
• Identity matrix ( In ): has ones on the main diagonal, and zeros elsewhere • Trace of a matrix is the sum of all diagonal elements

é1 0 0 ù Tr 𝐀 = O 𝑎aa
– E.g.: identity matrix of size 3× 3 : I 3 = êê0 1 0 úú a
êë0 0 1 úû • A matrix for which 𝐀 = 𝐀` is called a symmetric matrix
Matrices Matrices

4
8/30/22

Matrices Matrix-Vector Products


• Elementwise multiplication of two matrices A and B is called the • Consider a matrix 𝐀 ∈ ℝo×T and a vector 𝐱 ∈ ℝT
• The matrix can be written in terms of its row vectors (e.g., 𝐚v` is the first row)
Hadamard product or elementwise product
– The math notation is ⊙

• The matrix-vector product is a column vector of length m, whose ith element is the dot product
𝐚`a 𝐱

Matrices
Note the size: 𝐀 𝑚×𝑛 @ 𝐱 𝑛×1 = 𝐀𝐱 𝑚×1
• Matrices

Matrix-Matrix Products Linear Dependence


• To multiply two matrices 𝐀 ∈ ℝT×wand 𝐁 ∈ ℝw×o • For the following matrix 2 −1
𝐁=
4 −2

• Notice that for the two columns 𝐛v = 2, 4 ` and 𝐛N= −1, −2 `, we can write 𝐛v= −2 @ 𝐛N
– This means that the two columns are linearly dependent
• The weighted sum 𝑎v𝐛v + 𝑎N𝐛N is referred to as a linear combination of the vectors 𝐛v and 𝐛N
– In this case, a linear combination of the two vectors exist for which 𝐛 # +2 A 𝐛 $ = 𝟎

• We can consider the matrix-matrix product as dot-products of rows in 𝐀 and columns in 𝐁 • A collection of vectors 𝐯v, 𝐯N, … , 𝐯w are linearly dependent if there exist coefficients
𝑎v, 𝑎N, … , 𝑎w not all equal to zero, so that

• If there is no linear dependence, the vectors are linearly independent

• Size: 𝐀 𝑛×𝑘 @ 𝐁 𝑘×𝑚 = 𝐂 𝑛×𝑚

Matrices Matrices

Matrix Rank Inverse of a Matrix


• For an 𝑛×𝑚 matrix, the rank of the matrix is the largest number of linearly independent • For a square 𝑛×𝑛 matrix A with rank 𝑛, 𝐀|𝟏 is its inverse matrix if their product is an identity
columns matrix I
• The matrix B from the previous example has 𝑟𝑎𝑛𝑘 𝐁 = 1, since the two columns are linearly
dependent
• Properties of inverse matrices
𝐁=
2 −1
4 −2
(A )
-1 -1
=A

( AB )
-1
= B -1A -1
• The matrix C below has 𝑟𝑎𝑛𝑘 𝐂 = 2, since it has two linearly independent columns • If det 𝐴 = 0 (i.e., rank 𝐴 < 𝑛), then the inverse does not exist
– I.e., 𝐜 2 = −1 A 𝐜 # , 𝐜 - = −1 A 𝐜 3 , 𝐜 $ = 3 A 𝐜 # +3 A 𝐜 3 – A matrix that is not invertible is called a singular matrix
• Note that finding an inverse of a large matrix is computationally expensive
1 3 0 −1 0
−1 0 1 1 −1 – In addition, it can lead to numerical instability
𝐂=
0 −3 1 0 −1 • If the inverse of a matrix is equal to its transpose, the matrix is said to be orthogonal matrix
2 3 −1 −2 1
A -1 = AT
Matrices Matrices

5
8/30/22

Pseudo-Inverse of a Matrix Tensors


• Pseudo-inverse of a matrix • Tensors are n-dimensional arrays of scalars
– Also known as Moore-Penrose pseudo-inverse – Vectors are first-order tensors, 𝐯 ∈ ℝT
• For matrices that are not square, the inverse does not exist – Matrices are second-order tensors, 𝐀 ∈ ℝo×T
– Therefore, a pseudo-inverse is used – E.g., a fourth-order tensor is T ∈ ℝT •×T €×T •×T ‚
hi J • Tensors are denoted with upper-case letters of a special font face
• If 𝑚 > 𝑛, then the pseudo-inverse is 𝐀f = 𝐀g 𝐀 𝐀 and 𝐀f 𝐀 = 𝐈
(e.g., X, Y, Z)
f J g hi f
• If 𝑚 < 𝑛, then the pseudo-inverse is 𝐀 = 𝐀 𝐀𝐀 and 𝐀𝐀 = 𝐈 • RGB images are third-order tensors, i.e., as they are 3-dimensional
– E.g., for a matrix with dimension 𝐗 N×~ , a pseudo-inverse can be found of
arrays
size 𝐗 J~×N , so that 𝐗 N×~ 𝐗 J~×N = 𝐈N×N – The 3 axes correspond to width, height, and channel
– E.g., 224 × 224 × 3
– The channel axis corresponds to the color channels (red, green, and blue)
Matrices Tensors

Manifolds Manifolds
• Earlier we learned that hyperplanes generalize the concept of planes in high- • Manifolds are studied in mathematics under topological spaces
dimensional spaces • An n-dimensional manifold is defined as a topological space with the property that each point
– Similarly, manifolds can be informally imagined as generalization of the concept of surfaces has a neighborhood that is homeomorphic to the Euclidean space of dimension n
in high-dimensional spaces – This means that a manifold locally resembles Euclidean space near each point
– Informally, a Euclidean space is locally smooth, it does not have holes, edges, or other sudden
• To begin with an intuitive explanation, the surface of the Earth is an example of changes, and it does not have intersecting neighborhoods
a two-dimensional manifold embedded in a three-dimensional space – Although the manifolds can have very complex structure on a large scale, resemblance of the
– This is true because the Earth looks locally flat, so on a small scale it is like a 2-D plane Euclidean space on a small scale allows to apply standard math concepts

– However, if we keep walking on the Earth in one direction, we will eventually end up back
where we started • Examples of 2-dimensional manifolds are shown
in the figure
• This means that Earth is not really flat, it only looks locally like a Euclidean plane, but at
§ The surfaces in the figure have been conveniently
large scales it folds up on itself, and has a different global structure than a flat plane
cut up into little rectangles that were glued
together
§ Those small rectangles locally look like flat
Euclidean planes
Manifolds Manifolds
Picture from: http://bjlkeng.github.io/posts/manifolds/

Manifolds Manifolds
• Examples of one-dimensional manifolds • Example:
– Upper figure: a circle is a l-D manifold embedded in 2-D, where – The data points have 3 dimensions (left figure), i.e., the input space of the data is 3-
each arc of the circle locally resembles a line segment dimensional
– Lower figures: other examples of 1-D manifolds
– Note that a number 8 figure is not a manifold because it has an
– The data points lie on a 2-dimensional manifold, shown in the right figure
intersecting point (it is not Euclidean locally) – Most ML algorithms extract lower-dimensional data features that enable to distinguish
• It is hypothesized that in the real-world, high- between various classes of high-dimensional input data
dimensional data (such as images) lie on low- • The low-dimensional representations of the input data are called embeddings
dimensional manifolds embedded in the high-
dimensional space
– E.g., in ML, let’s assume we have a training set of images with
size 224×224×3 pixels
– Learning an arbitrary function in such high-dimensional space
would be intractable
– Despite that, all images of the same class (“cats” for example)
might lie on a low-dimensional manifold
– This allows function learning and image classification
Manifolds Manifolds
Picture from: http://bjlkeng.github.io/posts/manifolds/

6
8/30/22

Eigen Decomposition Eigen Decomposition


• Eigen decomposition is decomposing a matrix into a set of eigenvalues and eigenvectors • Decomposing a matrix into eigenvalues and eigenvectors allows to analyze certain properties of
• Eigenvalues of a square matrix 𝐀 are scalars 𝜆 and eigenvectors are non-zero vectors 𝐯 that the matrix
satisfy – If all eigenvalues are positive, the matrix is positive definite
𝐀𝐯 = 𝜆𝐯 – If all eigenvalues are positive or zero-valued, the matrix is positive semidefinite
– If all eigenvalues are negative or zero-values, the matrix is negative semidefinite
• Eigenvalues are found by solving the following equation
• Positive semidefinite matrices are interesting because they guarantee that ∀𝐱, 𝐱 % 𝐀𝐱 ≥ 0
det 𝐀 − 𝜆𝐈 = 0
• Eigen decomposition can also simplify many linear-algebraic computations
• If a matrix 𝐀 has n linearly independent eigenvectors 𝐯v, … , 𝐯T with corresponding – The determinant of A can be calculated as
eigenvalues 𝜆v, … , 𝜆T , the eigen decomposition of 𝐀 is given by
det 𝐀 = 𝜆 # A 𝜆 $ ⋯ 𝜆 !
𝐀 = 𝐕𝚲𝐕 |v – If any of the eigenvalues are zero, the matrix is singular (it does not have an inverse)
– Columns of the matrix 𝐕 are the eigenvectors, i.e., 𝐕 = 𝐯 # , … , 𝐯 ! • However, not every matrix can be decomposed into eigenvalues and eigenvectors
– 𝚲 is a diagonal matrix of the eigenvalues, i.e., 𝚲 = 𝜆 # , … , 𝜆 ! – Also, in some cases the decomposition may involve complex numbers
• To find the inverse of the matrix A, we can use 𝐀|𝟏 = 𝐕𝚲|𝟏𝐕 |v – Still, every real symmetric matrix is guaranteed to have an eigen decomposition according to 𝐀 =
– This involves simply finding the inverse 𝚲 4𝟏
of a diagonal matrix 𝐕𝚲𝐕 4# , where 𝐕 is an orthogonal matrix

Eigen Decomposition Eigen Decomposition

Eigen Decomposition Singular Value Decomposition


• Geometric interpretation of the eigenvalues and eigenvectors is that they allow to stretch the • Singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and
space in specific directions singular values
– Left figure: the two eigenvectors 𝐯 # and 𝐯 $ are shown for a matrix, where the two vectors are unit – SVD is more generally applicable than eigen decomposition
vectors (i.e., they have a length of 1) – Every real matrix has an SVD, but the same is not true of the eigen decomposition
– Right figure: the vectors 𝐯 # and 𝐯 $ are multiplied with the eigenvalues 𝜆 # and 𝜆 $ • E.g., if a matrix is not square, the eigen decomposition is not defined, and we must use SVD
• SVD of an 𝑚 ×𝑛 matrix 𝐀 is given by
• W e can see how the space is scaled in the direction of the larger eigenvalue 𝜆 #
• E.g., this is used for dimensionality reduction with PCA (principal component analysis) where 𝐀 = 𝐔𝐃𝐕 𝑻
the eigenvectors corresponding to the largest eigenvalues are used for extracting the most – 𝐔 is an 𝑚×𝑚 matrix, 𝐃 is an 𝑚×𝑛 matrix, and 𝐕 is an 𝑛×𝑛 matrix
important data dimensions – The elements along the diagonal of 𝐃 are known as the singular values of A
– The columns of 𝐔 are known as the left-singular vectors
– The columns of 𝐕 are known as the right-singular vectors
• For a non-square matrix 𝐀, the squares of the singular values 𝜎* are the eigenvalues 𝜆 * of 𝐀 𝑻 𝐀, i.e., 𝜎*$ =
𝜆 * for 𝑖 = 1,2, … , 𝑛
• Applications of SVD include computing the pseudo-inverse of non-square matrices, matrix approximation,
determining the matrix rank

Eigen Decomposition Singular Value Decomposition


Picture from: Goodfellow (2017) – Deep Learning

Matrix Norms Differential Calculus


• Frobenius norm – calculates the square-root of the summed • For a function 𝑓: ℝ → ℝ, the derivative of f is defined as
squares of the elements of matrix 𝐗 7 ! 𝑓 𝑥+ℎ −𝑓 𝑥
𝐗 = $
d d 𝑥*/ 𝑓 ” 𝑥 = lim
– This norm is similar to Euclidean norm of a vector 8 •→– ℎ
*+# /+#
• If 𝑓 ” 𝑎 exists, f is said to be differentiable at a
• Spectral norm – is the largest singular value of matrix 𝐗 • If f ‘ 𝑐 is differentiable for ∀𝑐 ∈ 𝑎, 𝑏 , then f is differentiable on this interval
– Denoted 𝐗 𝐗 $ = 𝜎79: 𝐗 – W e can also interpret the derivative 𝑓′(𝑥) as the instantaneous rate of change of 𝑓(𝑥) with respect to x
$

– The singular values of 𝐗 are 𝜎# , 𝜎$ ,… , 𝜎7 – I.e., for a small change in x, what is the rate of change of 𝑓(𝑥)
• Given 𝑦 = 𝑓(𝑥), where x is an independent variable and y is a dependent variable, the following
! 7 expressions are equivalent:
$
• 𝑳N,v norm – is the sum of the Euclidean norms of the 𝐗 $,# = d d 𝑥*/ 𝑑𝑦 𝑑𝑓 𝑑
columns of matrix 𝐗 /+# *+# 𝑓” 𝑥 = 𝑓” = = = 𝑓 𝑥 = 𝐷𝑓 𝑥 = 𝐷M𝑓(𝑥)
𝑑𝑥 𝑑𝑥 𝑑𝑥
K
• Max norm – is the largest element of matrix X
𝐗 ;<= = max 𝑥*/
*,/
• The symbols KM, D, and 𝐷M are differentiation operators that indicate operation of differentiation

Matrix Norms Differential Calculus

7
8/30/22

Differential Calculus Higher Order Derivatives


• The following rules are used for computing the derivatives of • The derivative of the first derivative of a function 𝑓 𝑥 is the second derivative of 𝑓 𝑥
𝑑N𝑓 𝑑 𝑑𝑓
explicit functions =
𝑑𝑥 N 𝑑𝑥 𝑑𝑥
• The second derivative quantifies how the rate of change of 𝑓 𝑥 is changing
– E.g., in physics, if the function describes the displacement of an object, the first derivative gives the
velocity of the object (i.e., the rate of change of the position)
– The second derivative gives the acceleration of the object (i.e., the rate of change of the velocity)
• If we apply the differentiation operation any number of times, we obtain the n-th derivative of
𝑓 𝑥
𝑑 T𝑓 𝑑 T
𝑓T 𝑥 = T= 𝑓 𝑥
𝑑𝑥 𝑑𝑥

Differential Calculus Differential Calculus

Taylor Series Geometric Interpretation


• Taylor series provides a method to approximate any function 𝑓(𝑥) at a point 𝑥– if we have the • To provide a geometric interpretation of the derivatives, let’s consider a first-order Taylor series
first n derivatives 𝑓 𝑥– , 𝑓 v 𝑥– , 𝑓 N 𝑥– , … , 𝑓 T 𝑥– approximation of 𝑓 𝑥 at 𝑥 = 𝑥–
• For instance, for 𝑛 = 2, the second-order approximation of a function 𝑓(𝑥) is 𝑑𝑓
𝑓 𝑥 ≈ 𝑓 𝑥– + € 𝑥 − 𝑥–
1 𝑑N𝑓 𝑑𝑓 𝑑𝑥 M
𝑥 − 𝑥– N + € 𝑥 − 𝑥– + 𝑓 𝑥–
>
𝑓 𝑥 ≈ •
2 𝑑𝑥 N 𝑑𝑥 M • The expression approximates the function 𝑓 𝑥 by a line which passes through the point
M> >
KL KL
• Similarly, the approximation of 𝑓(𝑥) with a Taylor polynomial of n-degree is 𝑥–, 𝑓 𝑥– and has slope KM‚ (i.e., the value of KM at the point 𝑥–)
M>
vK ? L
𝑓(𝑥) ≈ ∑T
a˜– a! KM? • 𝑥 − 𝑥– a
M > • Therefore, the first derivative of a
function is also the slope of the
• For example, the figure shows the first-order, tangent line to the curve of the
second-order, and fifth-order polynomial of function
the exponential function 𝑓(𝑥) = 𝑒 : at the
point 𝑥 @ = 0

Differential Calculus Differential Calculus


Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/single-variable-calculus.html Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/single-variable-calculus.html

Partial Derivatives Gradient


• So far, we looked at functions of a single variable, where 𝑓: ℝ → ℝ • W e can concatenate partial derivatives of a multivariate function with respect to all its input variables to
• Functions that depend on many variables are called multivariate functions obtain the gradient vector of the function
• Let 𝑦 = 𝑓 𝐱 = 𝑓(𝑥v, 𝑥N , … , 𝑥T) be a multivariate function with n variables • The gradient of the multivariate function 𝑓(𝐱) with respect to the n-dimensional input vector 𝐱 =
– The input is an n-dimensional vector 𝐱 = 𝑥#, 𝑥$ , … , 𝑥! % and the output is a scalar y %
𝑥# , 𝑥 $ , … , 𝑥 ! , is a vector of n partial derivatives
– The mapping is 𝑓: ℝ! → ℝ %
• The partial derivative of y with respect to its ith parameter 𝑥a is 𝜕𝑓 𝐱 𝜕𝑓 𝐱 𝜕𝑓 𝐱
𝛻𝐱 𝑓 𝐱 = , ,…,
𝜕𝑦 𝑓(𝑥v, 𝑥N , … , 𝑥a +ℎ, … , 𝑥T ) − 𝑓(𝑥v, 𝑥N , … , 𝑥a, … , 𝑥T ) 𝜕𝑥# 𝜕𝑥 $ 𝜕𝑥 !
= lim
𝜕𝑥a •→– ℎ • W hen there is no ambiguity, the notations 𝛻𝑓 𝐱 or 𝛻𝐱 𝑓 are often used for the gradient instead of 𝛻𝐱 𝑓 𝐱
š› – The symbol for the gradient is the Greek letter 𝛻 (pronounced “nabla”), although 𝛻𝐱𝑓 𝐱 is more often it is pronounced
• To calculate šM (𝜕 pronounced “del” or we can just say “partial derivative”), we can treat “gradient of f with respect to x”
?
𝑥v, 𝑥N , … , 𝑥a|v, 𝑥aœv… , 𝑥T as constants and calculate the derivative of y only with respect to 𝑥a • In ML, the gradient descent algorithm relies on the opposite direction of the gradient of the loss function ℒ
• For notation of partial derivatives, the following are equivalent: with respect to the model parameters 𝜃 𝛻A ℒ for minimizing the loss function
𝜕𝑦 𝜕𝑓 𝜕 – Adversarial examples can be created by adding perturbation in the direction of the gradient of the loss ℒ with respect to
= = 𝑓 𝐱 = 𝑓M? = 𝑓a = 𝐷a𝑓 = 𝐷M? 𝑓
𝜕𝑥a 𝜕𝑥a 𝜕𝑥a input examples 𝑥 𝛻: ℒ for maximizing the loss function

Differential Calculus Differential Calculus

8
8/30/22

Hessian Matrix Jacobian Matrix


• To calculate the second-order partial derivatives of multivariate functions, we need to calculate the • The concept of derivatives can be further generalized to vector-valued functions (or, vector fields)
derivatives for all combination of input variables 𝑓: ℝT → ℝo
• That is, for a function 𝑓(𝐱) with an n-dimensional input vector 𝐱 = 𝑥# , 𝑥 $ , … , 𝑥 ! % , there are 𝑛 $ second
• For an n-dimensional input vector 𝐱 = 𝑥v, 𝑥N , … , 𝑥T ` ∈ ℝT, the vector of functions is given as
partial derivatives for any choice of i and j
𝜕 $𝑓 𝜕 𝜕𝑓 𝐟 𝐱 = 𝑓v 𝐱 , 𝑓N 𝐱 , … , 𝑓o 𝐱 ` ∈ ℝo
=
𝜕𝑥 * 𝜕𝑥/ 𝜕𝑥 * 𝜕𝑥/ • The matrix of first-order partial derivates of the vector-valued function 𝐟 𝐱 is an 𝑚×𝑛 matrix called
• The second partial derivatives are assembled in a matrix called the Hessian a Jacobian
𝜕 $𝑓 𝜕 $𝑓 𝜕𝑓v 𝐱 𝜕𝑓v 𝐱
𝜕𝑥# 𝜕𝑥#

𝜕𝑥# 𝜕𝑥 !

𝜕𝑥v 𝜕𝑥T
𝐇B = ⋮ ⋱ ⋮ 𝐉= ⋮ ⋱ ⋮
𝜕 $𝑓 𝜕 $𝑓 𝜕𝑓o 𝐱 𝜕𝑓o 𝐱
… …
𝜕𝑥 ! 𝜕𝑥# 𝜕𝑥 ! 𝜕𝑥 !
𝜕𝑥v 𝜕𝑥T
• Computing and storing the Hessian matrix for functions with high-dimensional inputs can be
computationally prohibitive – For example, in robotics a robot Jacobian matrix gives the partial derivatives of the translational and
angular velocities of the robot end-effector with respect to the joints (i.e., axes) velocities
– E.g., the loss function for a ResNet50 model with approximately 23 million parameters, has a Hessian of
23 M×23 M = 529 T (trillion) parameters

Differential Calculus Differential Calculus

Integral Calculus Optimization


• For a function 𝑓(𝑥) defined on the domain [𝑎, 𝑏], the definite integral of the • Optimization is concerned with optimizing an objective function — finding the
function is denoted value of an argument that minimizes of maximizes the function
¦ – Most optimization algorithms are formulated in terms of minimizing a function 𝑓(𝑥)
L 𝑓 𝑥 𝑑𝑥 – Maximization is accomplished vie minimizing the negative of an objective function (e.g.,
¥
minimize −𝑓(𝑥))
• Geometric interpretation of the integral is the area between the horizontal axis – In minimization problems, the objective function is often referred to as a cost function or
loss function or error function
and the graph of 𝑓(𝑥) between the points a and b
• Optimization is very important for machine learning
– In this figure, the integral is the sum of blue areas (where 𝑓 𝑥 > 0) minus the pink area
(where 𝑓 𝑥 < 0) – The performance of optimization algorithms affect the model’s training efficiency
• Most optimization problems in machine learning are nonconvex
– Meaning that the loss function is not a convex function
– Nonetheless, the design and analysis of algorithms for solving convex problems has been
very instructive for advancing the field of machine learning
Integral Calculus Optimization
Picture from: https://mjo.osborne.economics.utoronto.ca/index.php/tutorial/index/1/clc/t

Optimization Stationary Points


• Optimization and machine learning have related, but somewhat different goals • Stationary points ( or critical points) of a differentiable function 𝑓(𝑥) of one variable are the
– Goal in optimization: minimize an objective function points where the derivative of the function is zero, i.e., 𝑓′(𝑥) = 0
• For a set of training examples, reduce the training error • The stationary points can be:
– Goal in ML: find a suitable model, to predict on data examples – Minimum, a point where the derivative changes from negative to positive
• For a set of testing examples, reduce the generalization error – Maximum, a point where the derivative changes from positive to negative
• For a given empirical function g (dashed purple curve), optimization algorithms attempt to find – Saddle point, derivative is either positive or negative on both sides of the point
the point of minimum empirical risk • The minimum and maximum points are collectively known as extremum points
• The expected function f (blue curve) is • The nature of stationary points can be
obtained given a limited amount of training determined based on the second derivative
data examples of 𝑓(𝑥) at the point
• ML algorithms attempt to find the point of § If 𝑓 CC 𝑥 > 0, the point is a minimum
minimum expected risk, based on minimizing § If 𝑓 CC 𝑥 < 0, the point is a maximum
the error on a set of testing examples § If 𝑓 CC 𝑥 = 0, inconclusive, the point can be a
o W hich m ay be at a different location than the saddle point, but it may not
m inim um of the training exam ples
o A nd w hich m ay not be m inim al in a form al sense
• The same concept also applies to gradients
of multivariate functions
Optimization Optimization
Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html

9
8/30/22

Local Minima Saddle Points


• Among the challenges in optimization of model’s parameters in ML involve local minima, • The gradient of a function 𝑓(𝑥) at a saddle point is 0, but the point is not a minimum or
saddle points, vanishing gradients maximum point
• For an objective function 𝑓(𝑥), if the value at a point x is the minimum of the objective function – The optimization algorithms may stall at saddle points, without reaching a minima
over the entire domain of x, then it is the global minimum • Note also that the point of a function at which the sign of the curvature changes is called an
• If the value of 𝑓(𝑥) at x is smaller than the values of the objective function at any other points in inflection point
the vicinity of x, then it is the local minimum – An inflection point (𝑓′′ 𝑥 = 0) can also be a saddle point, but it does not have to be
• For the 2D function (right figure), the saddle point is at (0,0)
§ The objective functions in ML usually
have many local minima – The point looks like a saddle, and gives the minimum with respect to x, and the maximum with respect
to y
o W hen the solution of the optim ization
algorithm is near the local m inim um , the saddle point
gradient of the loss function approaches or
becom es zero (vanishing gradients)
o Therefore, the obtained solution in the final
iteration can be a local m inim um , rather
than the global m inim um

Optimization Optimization x
Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html

Convex Optimization Convex Functions


• A function of a single variable is concave if every line segment joining two • In mathematical terms, the function f is a convex function if for all points 𝑥v , 𝑥N
points on its graph does not lie above the graph at any point and for all 𝜆 ∈ [0,1]
• Symmetrically, a function of a single variable is convex if every line segment 𝜆𝑓(𝑥v ) + (1 − 𝜆)𝑓(𝑥N ) ≥ 𝑓 𝜆𝑥v + (1 − 𝜆)𝑥N
joining two points on its graph does not lie below the graph at any point

Optimization Optimization
Picture from: https://mjo.osborne.economics.utoronto.ca/index.php/tutorial/index/1/cv1/t

Convex Functions Convex Functions


• One important property of convex functions is that they do not have local • Another important property of convex functions is stated by the Jensen’s
minima inequality
– Every local minimum of a convex function is a global minimum • Namely, if we let 𝛼1 = 𝜆 and 𝛼2 = 1 − 𝜆, the definition of convex function
– I.e., every point at which the gradient of a convex function = 0 is the global minimum becomes
– The figure below illustrates two convex functions, and one nonconvex function 𝛼1𝑓(𝑥v ) + 𝛼2𝑓(𝑥N ) ≥ 𝑓 𝛼1𝑥v + 𝛼2𝑥N
convex non-convex convex • The Danish mathematician Johan Jensen showed that this can be generalized for
all 𝛼𝑖 that are non-negative real numbers and ∑a 𝛼𝑖 , to the following:
𝛼1𝑓 𝑥v + 𝛼2𝑓 𝑥N + ⋯ + 𝛼𝑛𝑓 𝑥T ≥ 𝑓 𝛼1𝑥v + 𝛼2𝑥N + ⋯ + 𝛼𝑛𝑥T
• This inequality is also identical to
𝔼M [𝑓(𝑥)] ≥ 𝑓 𝔼M [𝑥]
– I.e., the expectation of a convex function is larger than the convex function of an expectation
Optimization Optimization
Picture from: http://d2l.ai/chapter_optimization/convexity.html

10
8/30/22

Convex Sets Derivatives and Convexity


• A set 𝒳 in a vector space is a convex set is for any 𝑎, 𝑏 ∈ 𝒳 the line segment • A twice-differentiable function of a single variable 𝑓: ℝ → ℝ is convex if and
connecting a and b is also in 𝒳 only if its second derivative is non-negative everywhere
KD L
• For all 𝜆 ∈ [0,1], we have – Or, we can write, KMD ≥ 0
𝜆 2 𝑎 + 1 − 𝜆 2 𝑏 ∈ 𝒳 for all 𝑎, 𝑏 ∈ 𝒳 – For example, 𝑓 𝑥 = 𝑥 N is convex, since 𝑓′ 𝑥 = 2𝑥, and 𝑓′′ 𝑥 = 2, meaning that 𝑓 ”” M >
• In the figure, each point represents a 2D vector 0
§ The left set is nonconvex, and the other two sets are convex • A twice-differentiable function of many variables 𝑓: ℝT → ℝ is convex if and
• Properties of convex sets include:
only if its Hessian matrix is positive semi-definite everywhere
§ If 𝒳 and 𝒴 are convex sets, then 𝒳 ∩ 𝒴 is also convex
§ If 𝒳 and 𝒴 are convex sets, then 𝒳 ∪ 𝒴 is not necessarily convex
– Or, we can write, 𝐇L ≽ 0
– This is equivalent to stating that all eigenvalues of the Hessian matrix are non-negative (i.e.,
≥ 0)

Optimization Optimization
Picture from: http://d2l.ai/chapter_optimization/convexity.html

Constrained Optimization Lagrange Multipliers


• The optimization problem that involves a set of constraints which need to be • One approach to solving optimization problems is to substitute the initial problem with
satisfied to optimize the objective function is called constrained optimization optimizing another related function
• The Lagrange function for optimization of the constrained problem on the previous page is
• E.g., for a given objective function 𝑓(𝐱) and a set of constraint functions 𝑐a 𝐱 defined as
minimize 𝑓(𝐱) 𝐿 𝐱, 𝛼 = 𝑓 𝐱 + ∑a 𝛼a 𝑐a 𝐱 where 𝛼a ≥ 0
𝐱
subject to 𝑐a 𝐱 ≤ 0 for all 𝑖 ∈ 1, 2, … , 𝑁 • The variables 𝛼a are called Lagrange multipliers and ensure that the constraints are properly
enforced
• The points that satisfy the constraints form the feasible region – They are chosen just large enough to ensure that 𝑐* 𝐱 ≤ 0 for all 𝑖 ∈ 1, 2, … , 𝑁
• Various optimization algorithms have been developed for handling optimization • This is a saddle-point optimization problem where one wants to minimize 𝐿 𝐱, 𝛼 with respect
problems based on whether the constraints are equalities, inequalities, or a to 𝐱 and simultaneously maximize 𝐿 𝐱, 𝛼 with respect to 𝛼a
– The saddle point of 𝐿 𝐱, 𝛼 gives the optimal solution to the original constrained optimization problem
combination of equalities and inequalities

Optimization Optimization

Projections Projections
• An alternative strategy for satisfying constraints are projections • More generally, a projection of a vector 𝐱 onto a set 𝒳 is defined as
• E.g., gradient clipping in NNs can require that the norm of the gradient is bounded by a constant
value c Proj 𝐱 = argmin 𝐱 − 𝐱′ N
𝒳 𝐱”∈𝒳
• Approach:
– At each iteration during training
• This means that the vector 𝐱 is projected onto the closest vector 𝐱′ that belongs to
G! "#
the set 𝒳
– If the norm of the gradient 𝑔 ≥ c, then the update is 𝑔 !EF ← 𝑐 A
G! "#
• For example, in the figure, the blue circle represents
– If the norm of the gradient 𝑔 < c, then the update is 𝑔 !EF ← 𝑔 HIJ a convex set 𝒳
²KLM ²KLM § The points inside the circle project to itself
• Note that since ²KLM
is a unit vector (i.e., it has a norm = 1), then the vector 𝑐 @ ²KLM
has a o E.g., 𝐱 is the yellow vector, its closest point 𝐱′ in the set 𝒳
is itself: the distance betw een 𝐱 and 𝐱′ is 𝐱 − 𝐱′ = 0
norm = 𝑐 § The points outside the circle project to the closest
$

• Such clipping is the projection of the gradient g onto the ball of radius c point inside the circle
o E.g., 𝐱 is the yellow vector, its closest point 𝐱′ in the set 𝒳
– Projection on the unit ball is for 𝑐 = 1
is the red vector
o A m ong all vectors in the set 𝒳 , the red vector 𝐱′ has the
sm allest distance to 𝐱, i.e., 𝐱 − 𝐱′ $
Optimization Optimization
Picture from: http://d2l.ai/chapter_optimization/convexity.html

11
8/30/22

First-order vs Second-order Optimization Lower Bound and Infimum


• First-order optimization algorithms use the gradient of a function for finding the extrema points • Lower bound of a subset 𝒮 from a partially ordered set 𝒳 is an element 𝑎 of 𝒳, such that 𝑎 ≤
– Methods: gradient descent, proximal algorithms, optimal gradient schemes 𝑠 for all 𝑠 ∈ 𝒮
– The disadvantage is that they can be slow and inefficient – E.g., for the subset 𝒮 = 2, 3, 6, 8 from the natural numbers ℕ, lower bounds are the numbers 2, 1, 0,
• Second-order optimization algorithms use the Hessian matrix of a function for finding the − 3, and all other natural numbers ≤ 2
extrema points • Infimum of a subset 𝒮 from a partially ordered set 𝒳 is the greatest lower bound in 𝒳, denoted
– This is since the Hessian matrix holds the second-order partial derivatives inf·∈𝒮 𝑠
– Methods: Newton’s method, conjugate gradient method, Quasi-Newton method, Gauss-Newton method, – It is the maximal quantity ℎ such that ℎ ≤ 𝑠 for all 𝑠 ∈ 𝒮
BFGS (Broyden-Fletcher-Goldfarb-Shanno) method, Levenberg-Marquardt method, Hessian-free – E.g., the infimum of the set 𝒮 = 2, 3, 6, 8 is ℎ =2, since it is the greatest lower bound
method • Example: consider the subset of positive real numbers (excluding zero) ℝ¹–= 𝑥 ∈ ℝ: 𝑥 ≥ 0
– The second-order derivatives can be thought of as measuring the curvature of the loss function
– The subset ℝ X@ does not have a minimum, because for every small positive number, there is a another
– Recall also that the second-order derivative can be used to determine whether a stationary points is a even smaller positive number
maximum (𝑓 CC 𝑥 < 0), minimum (𝑓 CC 𝑥 > 0)
– On the other hand, all real negative numbers and 0 are lower bounds on the subset ℝ X@
– This information is richer than the information provided by the gradient – 0 is the greatest lower bound of all lower bounds, and therefore, the infimum of ℝ X@ is 0
– Disadvantage: computing the Hessian matrix is computationally expensive, and even prohibitive for
high-dimensional data

Optimization Optimization

Upper Bound and Supremum Lipschitz Function


• Upper bound of a subset 𝒮 from a partially ordered set 𝒳 is an element 𝑏 of 𝒳, such that 𝑏 ≥ • A function 𝑓 𝑥 is a Lipschitz continuous function if a constant 𝜌 > 0 exists, such that for all
𝑠 for all 𝑠 ∈ 𝒮 points 𝑥v, 𝑥N
– E.g., for the subset 𝒮 = 2, 3, 6, 8 from the natural numbers ℕ, upper bounds are the numbers 8, 9, 40, 𝑓 𝑥v − 𝑓 𝑥N ≤ 𝜌 𝑥v − 𝑥N
and all other natural numbers ≥ 8
• Supremum of a subset 𝒮 from a partially ordered set 𝒳 is the least upper bound in 𝒳, denoted • Such function is also called a 𝜌-Lipschitz function
sup·∈𝒮 𝑠 • Intuitively, a Lipschitz function cannot change too fast
– It is the minimal quantity 𝑔 such that g ≥ 𝑠 for all 𝑠 ∈ 𝒮 – I.e., if the points 𝑥# and 𝑥 $ are close (i.e., the distance 𝑥# − 𝑥 $ is small), that means that the 𝑓 𝑥#
– E.g., the supremum of the subset 𝒮 = 2, 3, 6, 8 is 𝑔 = 8, since it is the least upper bound and 𝑓 𝑥 $ are also close (i.e., the distance 𝑓 𝑥# − 𝑓 𝑥 $ is also small)

• Example: for the subset of negative real numbers (excluding zero) ℝº–= 𝑥 ∈ ℝ: 𝑥 ≤ 0 • The smallest real number that bounds the change of 𝑓 𝑥# − 𝑓 𝑥 $ for all points 𝑥# , 𝑥 $ is the
Lipschitz constant 𝜌 of the function 𝑓 𝑥
– All real positive numbers and 0 are upper bounds
– For a 𝜌-Lipschitz function 𝑓 𝑥 , the first derivative 𝑓′ 𝑥 is bounded everywhere by 𝜌
– 0 is the least upper bound, and therefore, the supremum of ℝ Y@
• E.g., the function 𝑓 𝑥 = 𝑙𝑜𝑔 1 + exp(𝑥) is 1-Lipschitz over ℝ
Z=[(:) # 𝟏
– Since 𝑓′ 𝑥 = = = ≤1
#0Z=[(:) Z=[ 4: 0# Z=[ 4: 0#

– I.e., 𝜌 = 1

Optimization Optimization

Lipschitz Continuous Gradient Probability


• A differentiable function 𝑓 𝑥 has a Lipschitz continuous gradient if a constant • Intuition:
𝜌 > 0 exists, such that for all points 𝑥v , 𝑥N – In a process, several outcomes are possible
– When the process is repeated a large number of times, each outcome occurs with a relative
𝛻𝑓 𝑥v − 𝛻𝑓 𝑥N ≤ 𝜌 𝑥v − 𝑥N
frequency, or probability
• For a function 𝑓 𝑥 with a 𝜌-Lipschitz gradient, the second derivative 𝑓′′ 𝑥 is – If a particular outcome occurs more often, we say it is more probable
bounded everywhere by 𝜌 • Probability arises in two contexts
N
• E.g., consider the function 𝑓 𝑥 = 𝑥 – In actual repeated experiments
– 𝑓 𝑥 = 𝑥 N is not a Lipschitz continuous function, since 𝑓 ”(𝑥) = 2𝑥, so when 𝑥 → ∞ then • Example: You record the color of 1,000 cars driving by. 57 of them are green. You
𝑓 ”(𝑥) → ∞, i.e., the derivative is not bounded everywhere estimate the probability of a car being green as 57/1,000 = 0.057.
– Since 𝑓 ””(𝑥) = 2, therefore the gradient 𝑓 ”(𝑥) is 2-Lipschitz everywhere, since the second – In idealized conceptions of a repeated process
derivative is bounded everywhere by 2 • Example: You consider the behavior of an unbiased six-sided die. The expected
probability of rolling a 5 is 1/6 = 0.1667.
• Example: You need a model for how people’s heights are distributed. You choose a
normal distribution to represent the expected relative probabilities.
Optimization Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials

12
8/30/22

Probability Random variables


• Solving machine learning problems requires to deal with uncertain quantities, as well as with • A random variable 𝑋 is a variable that can take on different values
stochastic (non-deterministic) quantities – Example: 𝑋 = rolling a die
– Probability theory provides a mathematical framework for representing and quantifying uncertain • Possible values of 𝑋 comprise the sample space, or outcome space, 𝒮 = 1, 2, 3, 4, 5, 6
quantities • W e denote the event of “seeing a 5” as 𝑋 = 5 or 𝑋 = 5
• There are different sources of uncertainty: • The probability of the event is 𝑃 𝑋 = 5 or 𝑃 𝑋 = 5
– Inherent stochasticity in the system being modeled • Also, 𝑃 5 can be used to denote the probability that 𝑋 takes the value of 5
• For example, most interpretations of quantum mechanics describe the dynamics of subatomic • A probability distribution is a description of how likely a random variable is to take on each of
particles as being probabilistic its possible states
– Incomplete observability
– A compact notation is common, where 𝑃 𝑋 is the probability distribution over the random variable 𝑋
• Even deterministic systems can appear stochastic when we cannot observe all of the variables that
• Also, the notation X~𝑃 𝑋 can be used to denote that the random variable 𝑋 has probability
drive the behavior of the system
distribution 𝑃 𝑋
– Incomplete modeling
• Random variables can be discrete or continuous
• W hen we use a model that must discard some of the information we have observed, the discarded
– Discrete random variables have finite number of states: e.g., the sides of a die
information results in uncertainty in the model’s predictions
– Continuous random variables have infinite number of states: e.g., the height of a person
• E.g., discretization of real-numbered values, dimensionality reduction, etc.

Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials

Axioms of probability Discrete Variables


• The probability of an event 𝒜 in the given sample space 𝒮, denoted as 𝑃 𝒜 , • A probability distribution over discrete variables may
must satisfies the following properties: be described using a probability mass function (PMF)
– E.g., sum of two dice
– Non-negativity
• For any event 𝒜 ∈ 𝒮, 𝑃 𝒜 ≥ 0
• A probability distribution over continuous variables
– All possible outcomes
may be described using a probability density function
• Probability of the entire sample space is 1, 𝑃 𝒮 = 1 (PDF)
– Additivity of disjoint events – E.g., waiting time between eruptions of Old Faithful
• For all events 𝒜v, 𝒜N∈ 𝒮 that are mutually exclusive (𝒜v ∩ 𝒜N = ∅), the probability – A PDF gives the probability of an infinitesimal region
that both events happen is equal to the sum of their individual probabilities, 𝑃(𝒜v ∪ with volume 𝛿𝑋
𝒜N) = 𝑃 𝒜v +𝑃 𝒜N – To find the probability over an interval [a, b], we can
integrate the PDF as follows:
^
• The probability of a random variable 𝑃 𝑋 must obey the axioms of probability 𝑃 𝑋 ∈ 𝑎, 𝑏 = ∫9 𝑃 𝑋 𝑑𝑋

over the possible values in the sample space 𝒮


Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials Picture from: Jeff Howbert — Machine Learning Math Essentials

Multivariate Random Variables Joint Probability Distribution


• We may need to consider several random variables at a time • Probability distribution that acts on many variables at the same time is known as
– If several random processes occur in parallel or in sequence a joint probability distribution
– E.g., to model the relationship between several diseases and symptoms • Given any values x and y of two random variables 𝑋 and 𝑌, what is the
– E.g., to process images with millions of pixels (each pixel is one random variable) probability that 𝑋 = x and 𝑌 = y simultaneously?
• Next, we will study probability distributions defined over multiple random – 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦) denotes the joint probability
variables – We may also write 𝑃(𝑥, 𝑦) for brevity
– These include joint, conditional, and marginal probability distributions
• The individual random variables can also be grouped together into a random
vector, because they represent different properties of an individual statistical
unit
• A multivariate random variable is a vector of multiple random variables 𝐗 =
𝑋v , 𝑋N , … , 𝑋T `
Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials Slide credit: Jeff Howbert — Machine Learning Math Essentials

13
8/30/22

Marginal Probability Distribution Conditional Probability Distribution


• Marginal probability distribution is the probability distribution of a single variable • Conditional probability distribution is the probability distribution of one
– It is calculated based on the joint probability distribution 𝑃 𝑋, 𝑌 variable provided that another variable has taken a certain value
– I.e., using the sum rule: 𝑃 𝑋 = 𝑥 = ∑ _ 𝑃 𝑋 = 𝑥, 𝑌 = 𝑦
– Denoted 𝑃(𝑋 = 𝑥| 𝑌 = 𝑦)
• For continuous random variables, the summation is replaced with integration, 𝑃 𝑋 = 𝑥 =
∫ 𝑃 𝑋 = 𝑥, 𝑌 = 𝑦 𝑑𝑦 Q O˜M, Á˜›
• Note that: 𝑃 𝑋 = 𝑥| 𝑌 = 𝑦 = Q Á˜›
– This process is called marginalization

Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials Slide credit: Jeff Howbert — Machine Learning Math Essentials

Bayes’ Theorem Independence


• Bayes’ theorem – allows to calculate conditional probabilities for one variable when conditional • Two random variables 𝑋 and 𝑌 are independent if the occurrence of 𝑌 does not reveal any
probabilities for another variable are known information about the occurrence of 𝑋
𝑃 𝑌| 𝑋 𝑃 𝑋 – E.g., two successive rolls of a die are independent
𝑃 𝑋| 𝑌 =
𝑃 𝑌 • Therefore, we can write: 𝑃 𝑋| 𝑌 = 𝑃 𝑋
• Also known as Bayes’ rule – The following notation is used: 𝑋 ⊥ 𝑌

• Multiplication rule for the joint distribution is used: 𝑃 𝑋, 𝑌 = 𝑃 𝑌| 𝑋 𝑃 𝑋 – Also note that for independent random variables: 𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑃 𝑌

• By symmetry, we also have: 𝑃 𝑌, 𝑋 = 𝑃 𝑋| 𝑌 𝑃 𝑌 • In all other cases, the random variables are dependent
– E.g., duration of successive eruptions of Old Faithful
• The terms are referred to as: – Getting a king on successive draws form a deck (the drawn card is not replaced)
– 𝑃 𝑋 , the prior probability, the initial degree of belief for 𝑋
• Two random variables 𝑋 and 𝑌 are conditionally independent given another random variable 𝑍
– 𝑃 𝑋| 𝑌 , the posterior probability, the degree of belief after incorporating the knowledge of 𝑌
if and only if 𝑃 𝑋, 𝑌|𝑍 = 𝑃 𝑋|𝑍 𝑃 𝑌|𝑍
– 𝑃 𝑌| 𝑋 , the likelihood of 𝑌 given 𝑋
– This is denoted as 𝑋 ⊥ 𝑌|𝑍
– P(Y), the evidence
𝐥𝐢𝐤 𝐞 𝐥𝐢𝐡 𝐨 𝐨 𝐝 × 𝐩 𝐫 𝐢𝐨 𝐫 𝐩 𝐫 𝐨 𝐛 𝐚 𝐛 𝐢𝐥𝐢𝐭𝐲
– Bayes’ theorem: posterior probability =
𝐞 𝐯 𝐢𝐝 𝐞 𝐧 𝐜 𝐞

Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials

Continuous Multivariate Distributions Expected Value


• Same concepts of joint, marginal, and conditional probabilities apply for • The expected value or expectation of a function 𝑓(𝑋) with respect to a probability distribution 𝑃 𝑋 is the
average (mean) when 𝑋 is drawn from 𝑃 𝑋
continuous random variables • For a discrete random variable X, it is calculated as
• The probability distributions use integration of continuous random variables,
𝔼 `~b 𝑓 𝑋 =d𝑃 𝑋 𝑓 𝑋
instead of summation of discrete random variables `
– Example: a three-component Gaussian mixture probability distribution in two dimensions • For a continuous random variable X, it is calculated as

𝔼 `~b 𝑓 𝑋 = Æ 𝑃 𝑋 𝑓 𝑋 𝑑𝑋

– When the identity of the distribution is clear from the context, we can write 𝔼` 𝑓 𝑋
– If it is clear which random variable is used, we can write just 𝔼 𝑓 𝑋
• Mean is the most common measure of central tendency of a distribution
– For a random variable: 𝑓 𝑋* = 𝑋* ⇒ µ = 𝔼 𝑋* = ∑* 𝑃 𝑋* A 𝑋*
#
– This is similar to the mean of a sample of observations: µ = c ∑* 𝑋*
– Other measures of central tendency: median, mode

Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials Slide credit: Jeff Howbert — Machine Learning Math Essentials

14
8/30/22

Variance Covariance
• Variance gives the measure of how much the values of the function 𝑓 𝑋 deviate from the • Covariance gives the measure of how much two random variables are linearly related to each
expected value as we sample values of X from 𝑃 𝑋 other
Var 𝑓 𝑋 = 𝔼 𝑓 𝑋 − 𝔼 𝑓 𝑋 N Cov 𝑓 𝑋 , 𝑔 𝑌 = 𝔼 𝑓 𝑋 − 𝔼 𝑓 𝑋 𝑔 𝑌 −𝔼 𝑔 𝑌

• When the variance is low, the values of 𝑓 𝑋 cluster near the expected value • If 𝑓 𝑋a = 𝑋a − µO and 𝑔 𝑌a = 𝑌a − µÁ
– Then, the covariance is: Cov 𝑋, 𝑌 = ∑ * 𝑃 𝑋* , 𝑌* A 𝑋* − µ ` A 𝑌* − µ d
• Variance is commonly denoted with 𝜎 N #
– Compare to covariance of actual samples: Cov 𝑋, 𝑌 = ∑ * 𝑌* − µ ` 𝑌* − µ d
– The above equation is similar to a function 𝑓 𝑋* = 𝑋* − µ c4#

– W e have 𝜎 $ = ∑ * 𝑃 𝑋* A 𝑋* − µ $ • The covariance measures the tendency for 𝑋 and 𝑌 to deviate from their means in same (or
– This is similar to the formula for calculating the variance of a sample of observations: 𝜎 $ = opposite) directions at same time
# $
∑ * 𝑋* − µ
c4#
𝑌 N o covariance 𝑌 H igh
• The square root of the variance is the standard deviation
covariance
– Denoted 𝜎 = Var 𝑋

𝑋 𝑋

Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials Picture from: Jeff Howbert — Machine Learning Math Essentials

Correlation Covariance Matrix


• Correlation coefficient is the covariance normalized by the standard deviations of the two • Covariance matrix of a multivariate random variable 𝐗 with states 𝐱 ∈ ℝ𝒏 is an 𝑛×𝑛 matrix,
variables such that
Cov 𝑋, 𝑌 Cov 𝐗 a,n = Cov 𝐱a , 𝐱n
corr 𝑋, 𝑌 =
𝜎O @ 𝜎Á
• I.e.,
– It is also called Pearson’s correlation coefficient and it is denoted 𝜌 𝑋, 𝑌
Cov 𝐱v, 𝐱𝟏 Cov 𝐱v, 𝐱𝟐 ⋯ Cov 𝐱v, 𝐱𝒏
– The values are in the interval −1, 1
– It only reflects linear dependence between variables, and it does not measure non-linear dependencies
Cov 𝐗 = Cov 𝐱N, 𝐱𝟏 ⋱ Cov 𝐱N, 𝐱𝒏
⋮ ⋮
between the variables Cov 𝐱T, 𝐱𝟏 Cov 𝐱T, 𝐱𝟐 ⋯ Cov 𝐱𝒏, 𝐱𝒏
• The diagonal elements of the covariance matrix are the variances of the elements of the vector
Cov 𝐱a , 𝐱a = Var 𝐱a
• Also note that the covariance matrix is symmetric, since Cov 𝐱a , 𝐱n = Cov 𝐱n, 𝐱a

Probability Probability
Picture from: Jeff Howbert — Machine Learning Math Essentials

Probability Distributions Probability Distributions


• Bernoulli distribution • Binomial distribution
– Binary random variable 𝑋 with states 0, 1 – Performing a sequence of n independent experiments,
– The random variable can encodes a coin flip which 𝑝 = 0.3 each of which has probability p of succeeding, where 𝑛 = 10, 𝑝 = 0.2
comes up 1 with probability p and 0 with probability 𝑝 ∈ 0, 1
1−𝑝 – The probability of getting k successes in n trials is
𝑛 g
– Notation: 𝑋 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝑝 𝑃 𝑋=𝑘 = 𝑝 1 − 𝑝 !4g
𝑘
– Notation: 𝑋 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝑛, 𝑝
• Poisson distribution
– A number of events occurring independently in a fixed
• Uniform distribution interval of time with a known rate 𝜆
# – A discrete random variable 𝑋 with states 𝑘 ∈
– The probability of each value 𝑖 ∈ 1,2, … , 𝑛 is 𝑝* = 𝜆=5
! h6 'E 7 8
0, 1, 2, … has probability 𝑃 𝑋 = 𝑘 = `!
– Notation: 𝑋 ∼ 𝑈 𝑛
– The rate 𝜆 is the average number of occurrences of the
– Figure: 𝑛 = 5, 𝑝 = 0.2
event
– Notation: 𝑋 ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛 𝜆

Probability Probability
Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html

15
8/30/22

Probability Distributions Probability Distributions


• Gaussian distribution • Multinoulli distribution
– The most well-studied distribution – It is an extension of the Bernoulli distribution, from binary class to multi-class
• Referred to as normal distribution or informally bell-shaped distribution – Multinoulli distribution is also called categorical distribution or generalized Bernoulli distribution
– Defined with the mean 𝜇 and variance 𝜎 $ – Multinoulli is a discrete probability distribution that describes the possible results of a random variable that can
take on one of k possible categories
– Notation: 𝑋 ∼ 𝒩 𝜇, 𝜎 $ • A categorical random variable is a discrete variable with more than two possible outcomes (such as the roll of
– For a random variable 𝑋 with n independent measurements, the density is a die)
1 :4j 9 – For example, in multi-class classification in machine learning, we have a set of data examples 𝐱#, 𝐱$, … , 𝐱! , and
4
𝑃` 𝑥 = 𝑒 $k 9 corresponding to the data example 𝐱* is a k-class label 𝐲* = 𝑦*#, 𝑦*$, … , 𝑦*g representing one-hot encoding
2𝜋𝜎 $ • One-hot encoding is also called 1-of-k vector, where one element has the value 1 and all other elements have
the value 0
• Let’s denote the probabilities for assigning the class labels to a data example by 𝑝#, 𝑝$, … , 𝑝g
• We know that 0 ≤ 𝑝/ ≤ 1 and ∑ 𝑝/ = 1 for the different classes 𝑗 = 1, 2, … , 𝑘
• The multinoulli probability of the data example 𝐱* is 𝑃 𝐱* = 𝑝# _:; A 𝑝$ _:9 ⋯ 𝑝g _:< = ∏/ 𝑝/ _:=
• Similarly, we can calculate the probability of all data examples as ∏* ∏/ 𝑝/ _:=

Probability Probability
Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html

Information Theory Self-information


• Information theory studies encoding, decoding, transmitting, and manipulating information • The basic intuition behind information theory is that learning that an unlikely event has occurred
– It is a branch of applied mathematics that revolves around quantifying how much information is is more informative than learning that a likely event has occurred
present in different signals – E.g., a message saying “the sun rose this morning” is so uninformative that it is unnecessary to be sent
• As such, information theory provides fundamental language for discussing the information – But, a message saying “there was a solar eclipse this morning” is very informative
processing in computer systems • Based on that intuition, Shannon defined the self-information of an event 𝑋 as
– E.g., machine learning applications use the cross-entropy loss, derived from information theoretic 𝐼 𝑋 = −log 𝑃 𝑋
considerations
• A seminal work in this field is the paper A Mathematical Theory of Communication by Clause – 𝐼 𝑋 is the self-information, and𝑃 𝑋 is the probability of the event 𝑋
E. Shannon, which introduced the concept of information entropy for the first time • The self-information outputs the bits of information received for the event 𝑋
– Information theory was originally invented to study sending messages over a noisy channel, such as – For example, if we want to send the code “0010” over a channel
communication via radio transmission – The event “0010” is a series of codes of length n (in this case, the length is 𝑛 =4)
# #
– Each code is a bit (0 or 1), and occurs with probability of ; for this event 𝑃 = >
$ $
1
𝐼 "0010" = −log 𝑃 "0010" = −log 2 = −log $ 1 + log $ 2 2 = 0 + 4 = 4 bits
2

Information Theory Information Theory

Entropy Entropy
• For a discrete random variable 𝑋 that follows a probability distribution 𝑃 with a probability • Intuitively, we can interpret the self-information (𝐼 𝑋 = −log 𝑃(𝑋) ) as the amount of surprise
mass function 𝑃(𝑋), the expected amount of information through entropy (or Shannon entropy) we have at seeing a particular outcome
is – W e are less surprised when seeing a more frequent event
𝐻 𝑋 = 𝔼O~Q 𝐼 𝑋 = −𝔼O~Q [log 𝑃(𝑋)] • Similarly, we can interpret the entropy (𝐻 𝑋 = 𝔼O~Q 𝐼 𝑋 ) as the average amount of surprise
• Based on the expectation definition 𝔼O~Q 𝑓 𝑋 = ∑O 𝑃 𝑋 𝑓 𝑋 , we can rewrite the entropy as from observing a random variable 𝑋
– Therefore, distributions that are closer to a uniform distribution have high entropy
𝐻 𝑋 = − ∑O 𝑃 𝑋 log 𝑃 𝑋
– Because there is little surprise when we draw samples from a uniform distribution, since all samples
• If 𝑋 is a continuous random variable that follows a probability distribution 𝑃 with a probability have similar values
density function 𝑃(𝑋), the entropy is

𝐻 𝑋 = − ¨ 𝑃 𝑋 log 𝑃 𝑋 𝑑𝑋
O
– For continuous random variables, the entropy is also called differential entropy

Information Theory Information Theory

16
8/30/22

Kullback–Leibler Divergence Kullback–Leibler Divergence


• Kullback-Leibler (KL) divergence (or relative entropy) provides a measure of how different two • KL divergence is non-negative: 𝐷RS 𝑃||𝑄 ≥ 0
probability distribution are • 𝐷RS 𝑃||𝑄 = 0 if and only if 𝑃(𝑋) and 𝑄 𝑋 are the same distribution
• For two probability distributions 𝑃(𝑋) and 𝑄 𝑋 over the same random variable 𝑋, the KL • The most important property of KL divergence is that it is non-symmetric, i.e.,
divergence is 𝐷RS 𝑃||𝑄 ≠ 𝐷RS 𝑄||𝑃
𝑃(𝑋)
𝐷RS 𝑃||𝑄 = 𝔼O~Q log • Because 𝐷RS is non-negative and measures the difference between distributions, it is often
𝑄 𝑋
considered as a “distance metric” between two distributions
• For discrete random variables, this formula is equivalent to
– However, KL divergence is not a true distance metric, because it is not symmetric
Q(O) Þ(O)
𝐷RS 𝑃||𝑄 = ∑O 𝑃 𝑋 log Þ O = − ∑O 𝑃 𝑋 log Q O – The asymmetry means that there are important consequences to the choice of whether to use
𝐷lm 𝑃||𝑄 or 𝐷lm 𝑄 ||𝑃
• When base 2 logarithm is used, 𝐷RS provides the amount of information in bits
• An alternative divergence which is non-negative and symmetric is the Jensen-Shannon
– In machine learning, the natural logarithm is used (with base e): the amount of information is provided in nats
divergence, defined as
• KL divergence can be considered as the amount of information lost when the distribution 𝑄 is 1 1
used to approximate the distribution 𝑃 𝐷áâ 𝑃||𝑄 = 𝐷RS 𝑃||𝑀 + 𝐷RS 𝑄||𝑀
2 2
– E.g., in GANs, 𝑃 is the distribution of true data, 𝑄 is the distribution of synthetic data #
– In the above, M is the average of the two distributions, 𝑀 = 𝑃+𝑄
$
Information Theory Information Theory

Cross-entropy Maximum Likelihood


• Cross-entropy is closely related to the KL divergence, and it is defined as the summation of the entropy • Cross-entropy is closely related to the maximum likelihood estimation
𝐻 𝑃 and KL divergence 𝐷lm 𝑃||𝑄 • In ML, we want to find a model with parameters 𝜃 that maximize the probability that the data is
𝐶𝐸 𝑃, 𝑄 = 𝐻 𝑃 + 𝐷lm 𝑃||𝑄 assigned the correct class, i.e., argmaxì 𝑃 model | data
– F o r th e c la ssific a tio n p ro b le m fro m p re v io u s p a g e , w e w a n t to fin d p a ra m e te rs 𝜃 so th a t fo r th e d a ta e x a m p le s 𝑥 ! , 𝑥 " , … , 𝑥 # th e p ro b a b ility o f o u tp u ttin g c la ss la b e ls 𝑦 ! , 𝑦 " , … , 𝑦# is
• Alternatively, the cross-entropy can be written as m a x im iz e d

𝐶𝐸 𝑃, 𝑄 = −𝔼 `~b [log 𝑄 (𝑋)] • I.e., for some data examples, the predicted class 𝑦é/ will be different than the true class 𝑦/ , but the goal is to
• In machine learning, let’s assume a classification problem based on a set of data examples 𝑥# , 𝑥 $ , … , 𝑥 ! , find 𝜃 that results in an overall maximum probability
• From Bayes’ theorem, argmax 𝑃 model | data is proportional to argmax 𝑃 data | model
that need to be classified into k classes
𝑃 𝑥v, 𝑥N, … , 𝑥T|𝜃 𝑃 𝜃
– For each data example 𝑥* we have a class label 𝑦* 𝑃 𝜃|𝑥v, 𝑥N, … , 𝑥T =
𝑃 𝑥v, 𝑥N, … , 𝑥T
• The true labels 𝐲 follow the true distribution 𝑃
– The goal is to train a classifier (e.g., a NN) parameterized by 𝜃, that outputs a predicted class label 𝑦é* for each data – This is true since 𝑃 𝑥#, 𝑥$, … , 𝑥! does not depend on the parameters 𝜃
example 𝑥* – Also, we can assume that we have no prior assumption on which set of parameters 𝜃 are better than any others
• The predicted labels 𝒚ê follow the estimated distribution 𝑄 • Recall that 𝑃 data|model is the likelihood, therefore, the maximum likelihood estimate of 𝜃 is
– The cross-entropy loss between the true distribution 𝑃 and the estimated distribution 𝑄 is calculated as: 𝐶𝐸 𝐲, 𝐲é = based on solving
− 𝔼`~b log 𝑄 𝑋 = − ∑` 𝑃 𝑋 log 𝑄 𝑋 = − ∑* 𝑦* log 𝑦é*
arg max 𝑃 𝑥v, 𝑥N, … , 𝑥T|𝜃
• The further away the true and estimated distributions are, the greater the cross-entropy loss is ì

Information Theory Information Theory

Maximum Likelihood Challenges of Machine Learning


• For a total number of n observed data examples 𝑥v, 𝑥N, … , 𝑥T , the predicted class labels for the • Well-posed problem vs ill-posed problem
data example 𝑥a is 𝐲¬a – Where specifications are complete and available
– Using the multinoulli distribution, the probability of predicting the true class label 𝐲* = 𝑦*#, 𝑦*$, … , 𝑦*g is – 𝑦 = 𝑥1 ∗ 𝑥2
𝒫 𝑥* |𝜃 = ∏/ 𝑦é*/ _:= , where 𝑗 ∈ 1,2, … , 𝑘
– 𝑦 = 𝑥1 /𝑥2
– E.g., we have a problem with 3 classes [car, house, tree], and an image of a car 𝑥* , the true label 𝐲* = 1,0,0 , and
let’s assume a predicted label 𝐲é* = 0.7, 0.1, 02 , then the probability is 𝒫 𝑥* |𝜃 = ∏/ 𝑦é*/ _:= = 0.7# A 0.1@ A – 𝑦 = 𝑥1 𝑥 2
0.2@ = 0.7 A 1 A 1 = 0.7
• Assuming that the data examples are independent, the likelihood of the data given the model
parameters 𝜃 can be written as 𝒫 𝑥v, 𝑥N, … , 𝑥T|𝜃 = 𝒫 𝑥v|𝜃 ⋯ 𝒫 𝑥T |𝜃 = – Require more examples to solve ill posed problem
∏n 𝑦¬vn ›wx @ ∏n 𝑦¬Nn ›Dx ⋯ ∏n 𝑦¬Tn ›"x = ∏a ∏n 𝑦¬an ›?x
• Log-likelihood is often used because it simplifies numerical calculations, since it transforms a
• Huge data : availability of a quality data is challenge
product with many terms into a summation, e.g., log 𝑎v¦w @ 𝑎N¦D = 𝑏vlog 𝑎v + 𝑏Nlog 𝑎N • High Computation power : GPU, TPU
– log 𝒫 𝑥#, 𝑥$, … , 𝑥! |𝜃 = log ∏* ∏/ 𝑦é*/ _:= = ∑* ∑/ 𝑦*/ log 𝑦é*/ • Complexity of the algorithms
– A negative of the log-likelihood allows us to use minimization approaches, i.e., −log 𝒫 𝑥#, 𝑥$, … , 𝑥! |𝜃 =
− ∑* ∑/ 𝑦*/ log 𝑦é*/ = 𝐶𝐸 𝐲, ê𝐲 • Bias / Variance : Variance is the error of the model.
• Thus, maximizing the likelihood is the same as minimizing the cross-entropy
Information Theory

17
8/30/22

Why worry about the math? References


• These intuitions will allow you to: 1. Dr. Alex Vakanski, CS 404/504, Special Topics: Adversarial Machine
Learning
– Choose the right algorithm(s) for the problem Choose the right algorithm(s)
for the problem 2. A. Zhang, Z. C. Lipton, M. Li, A. J. Smola, Dive into Deep Learning,
– Make good choices on parameter settings, validation strategies https://d2l.ai, 2020.
– Recognize over- or underfitting 3. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2017.
4. M. P. Deisenroth, A. A. Faisal, C. S. Ong, Mathematics for Machine Learning,
– Troubleshoot poor / ambiguous results
Cambridge University Press, 2020.
– Put appropriate bounds of confidence / uncertainty on results
5. Jeff Howbert — Machine Learning Math Essentials presentation
– Do a better job of coding algorithms or incorporating them into more com
plex analysis pipelines 6. Brian Keng – Manifolds: A Gentle Introduction blog
7. Martin J. Osborne – Mathematical Methods for Economic Theory (link)
8. Dr. Alex Vakanski- CS 404/504 Special Topics: Adversarial Machine Learning

Next time
• Module 2: Understanding Data and Feature Engineering in ML
– Part 2.1: Research Steps : Problem Formulation and design cycle
– Part 2.2: Data Collection and Creation
– Part 2.3: Feature Engineering
• Data Preprocessing and Exploratory Data Analysis
• Data Visualization - Dimension Reduction - Feature Extraction : PCA,
SVD - Feature Selection
– Part 2.4: Evaluation Parameters
• Accuracy, Precision, Recall
– Part 2.5: Python for Feature Engineering

18

You might also like