1 2.-Maths ML
1 2.-Maths ML
Areas of math essential to machine learning Why worry about the math?
• Machine learning is part of both statistics and computer science • There are lots of easy-to-use machine learning packages out there.
– Probability packages out there.
– Statistical inference – After this course, you will know how to apply several of the most general
– Validation purpose algorithms.
– Estimates of error, confidence intervals
• Linear algebra HOWEVER
– Hugely useful for compact representation of linear transformations on data
transformations on data – Dimensionality reduction techniques
• To get really useful results, you need good mathematical intuitions
• Optimization theory about certain general machine learning principles, as well as the
inner workings of the individual algorithms.
1
8/30/22
source : https://www.youtube.com/watch?v=OmJ-4B-mS-Y
Notation Notation
• 𝑎, 𝑏, 𝑐 Scalar (integer or real) • 𝐀⊙𝐁 Element-wise product of matrices A and B
• 𝐱, 𝐲, 𝐳 Vector (bold-font, lower case) • 𝐀J Pseudo-inverse of matrix A
• 𝐀, 𝐁, 𝐂 Matrix (bold-font, upper-case) K" L
• KM"
n-th derivative of function f with respect to x
• A, B, C Tensor ((bold-font, upper-case)
• 𝛻𝐱𝑓 𝐱 Gradient of function f with respect to x
• 𝑋, 𝑌, 𝑍 Random variable (normal font, upper-case)
• 𝐇L Hessian matrix of function f
• 𝑎∈𝒜 Set membership: 𝑎 is member of set 𝒜 • 𝑋~𝑃 Random variable 𝑋 has distribution 𝑃
• 𝒜 Cardinality: number of items in set 𝒜 • 𝑃 𝑋|𝑌 Probability of 𝑋 given 𝑌
• 𝐯 Norm of vector 𝐯 • 𝒩 𝜇, 𝜎 N Gaussian distribution with mean 𝜇 and variance 𝜎 N
• 𝐮 A 𝐯 or 𝐮, 𝐯 Dot product of vectors 𝐮 and 𝐯 • 𝔼O~Q 𝑓 𝑋 Expectation of 𝑓 𝑋 with respect to 𝑃 𝑋
• ℝ Set of real numbers • Var 𝑓 𝑋 Variance of 𝑓 𝑋
• ℝ! Real numbers space of dimension n • Cov 𝑓 𝑋 , 𝑔 𝑌 Covariance of 𝑓 𝑋 and 𝑔 𝑌
• 𝑦 = 𝑓 𝑥 or 𝑥 ↦ 𝑓 𝑥 Function (map): assign a unique value 𝑓(𝑥) to each input value 𝑥 • corr 𝑋, 𝑌 Correlation coefficient for 𝑋 and 𝑌
• 𝑓: ℝ ! → ℝ Function (map): map an n-dimensional vector into a scalar • 𝐷RS 𝑃||𝑄 Kullback-Leibler divergence for distributions 𝑃 and 𝑄
• 𝐶𝐸 𝑃, 𝑄 Cross-entropy for distributions 𝑃 and 𝑄
2
8/30/22
• Vector addition
– W e add the coordinates, and follow the directions given by the two vectors that are
added
Vectors Vectors
Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.html#geometry-of-vectors
𝐮 𝐯
vectors/data instances, and it is referred to as cosine similarity
Vectors Hyperplanes
Slide credit: Jeff Howbert — Machine Learning Math Essentials Picture from: https://kgpdag.wordpress.com/2015/08/12/svm-simplified/
3
8/30/22
Hyperplanes Hyperplanes
• For example, for a given data point 𝐰 = 2, 1 we can `,
use dot-product to find the hyperplane for which 𝐰 @ • In a 3D space, if we have a vector 𝐰 = 1, 2, 3 J and try to find all
𝐯=1 points that satisfy 𝐰 / 𝐯 = 1, we can obtain a plane that is orthogonal
– I.e., all vectors with 𝐰 A 𝐯 > 1 can be classified as one class,
and all vectors with 𝐰 A 𝐯 < 1 can be classified as another to the vector 𝐰
class
– The inequalities 𝐰 2 𝐯 > 1 and 𝐰 2 𝐯 < 1 again define the two subspaces that
are created by the plane
• Solving 𝐰 @ 𝐯 = 1, we obtain
Hyperplanes Hyperplanes
Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.html#hyperplanes Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.html#hyperplanes
Matrices Matrices
• Matrix • Addition or subtraction
( A ± B)i, j = Ai, j ± Bi, j
– Is a tuple of vectors
– is a rectangular array of real-valued scalars arranged in m horizontal rows and n vertical
columns
– Each element 𝑎an belongs to the ith row and jth column • Scalar multiplication ( cA)i, j = c × Ai, j
– The elements are denoted 𝑎an or 𝐀an or 𝐀 an or 𝐀 𝒊, 𝒋
Matrices Matrices
Matrices Matrices
• Transpose of the matrix: 𝐀` has the rows and columns exchanged • Determinant of a matrix, denoted by det(A) or 𝐀 , is a real-valued scalar encoding certain
properties of the matrix
– E.g., for a matrix of size 2× 2: æ éa b ù ö
(A )
T
= A j, i
det ç ê ú ÷ = ad - bc
i, j è ëc d û ø
– Some properties
𝐀+𝐁=𝐁+𝐀 𝐀 𝐁 + 𝐂 = 𝐀𝐁 + 𝐀𝐂 – For larger-size matrices the determinant of a matrix id calculated as
𝐀+𝐁 % = 𝐀% + 𝑩% 𝐀 𝐁𝐂 = 𝐀𝐁 𝐂 *0/
det 𝐀 = d 𝑎 */ −1 𝑑𝑒𝑡 𝐀 *,/
𝐀% % =𝐀 𝐀𝐁 % = 𝑩% 𝐀% /
– In the above, 𝐀 *,/ is a minor of the matrix obtained by removing the row and column associated with
• Square matrix: has the same number of rows and columns
the indices i and j
• Identity matrix ( In ): has ones on the main diagonal, and zeros elsewhere • Trace of a matrix is the sum of all diagonal elements
é1 0 0 ù Tr 𝐀 = O 𝑎aa
– E.g.: identity matrix of size 3× 3 : I 3 = êê0 1 0 úú a
êë0 0 1 úû • A matrix for which 𝐀 = 𝐀` is called a symmetric matrix
Matrices Matrices
4
8/30/22
• The matrix-vector product is a column vector of length m, whose ith element is the dot product
𝐚`a 𝐱
Matrices
Note the size: 𝐀 𝑚×𝑛 @ 𝐱 𝑛×1 = 𝐀𝐱 𝑚×1
• Matrices
• Notice that for the two columns 𝐛v = 2, 4 ` and 𝐛N= −1, −2 `, we can write 𝐛v= −2 @ 𝐛N
– This means that the two columns are linearly dependent
• The weighted sum 𝑎v𝐛v + 𝑎N𝐛N is referred to as a linear combination of the vectors 𝐛v and 𝐛N
– In this case, a linear combination of the two vectors exist for which 𝐛 # +2 A 𝐛 $ = 𝟎
• We can consider the matrix-matrix product as dot-products of rows in 𝐀 and columns in 𝐁 • A collection of vectors 𝐯v, 𝐯N, … , 𝐯w are linearly dependent if there exist coefficients
𝑎v, 𝑎N, … , 𝑎w not all equal to zero, so that
Matrices Matrices
( AB )
-1
= B -1A -1
• The matrix C below has 𝑟𝑎𝑛𝑘 𝐂 = 2, since it has two linearly independent columns • If det 𝐴 = 0 (i.e., rank 𝐴 < 𝑛), then the inverse does not exist
– I.e., 𝐜 2 = −1 A 𝐜 # , 𝐜 - = −1 A 𝐜 3 , 𝐜 $ = 3 A 𝐜 # +3 A 𝐜 3 – A matrix that is not invertible is called a singular matrix
• Note that finding an inverse of a large matrix is computationally expensive
1 3 0 −1 0
−1 0 1 1 −1 – In addition, it can lead to numerical instability
𝐂=
0 −3 1 0 −1 • If the inverse of a matrix is equal to its transpose, the matrix is said to be orthogonal matrix
2 3 −1 −2 1
A -1 = AT
Matrices Matrices
5
8/30/22
Manifolds Manifolds
• Earlier we learned that hyperplanes generalize the concept of planes in high- • Manifolds are studied in mathematics under topological spaces
dimensional spaces • An n-dimensional manifold is defined as a topological space with the property that each point
– Similarly, manifolds can be informally imagined as generalization of the concept of surfaces has a neighborhood that is homeomorphic to the Euclidean space of dimension n
in high-dimensional spaces – This means that a manifold locally resembles Euclidean space near each point
– Informally, a Euclidean space is locally smooth, it does not have holes, edges, or other sudden
• To begin with an intuitive explanation, the surface of the Earth is an example of changes, and it does not have intersecting neighborhoods
a two-dimensional manifold embedded in a three-dimensional space – Although the manifolds can have very complex structure on a large scale, resemblance of the
– This is true because the Earth looks locally flat, so on a small scale it is like a 2-D plane Euclidean space on a small scale allows to apply standard math concepts
– However, if we keep walking on the Earth in one direction, we will eventually end up back
where we started • Examples of 2-dimensional manifolds are shown
in the figure
• This means that Earth is not really flat, it only looks locally like a Euclidean plane, but at
§ The surfaces in the figure have been conveniently
large scales it folds up on itself, and has a different global structure than a flat plane
cut up into little rectangles that were glued
together
§ Those small rectangles locally look like flat
Euclidean planes
Manifolds Manifolds
Picture from: http://bjlkeng.github.io/posts/manifolds/
Manifolds Manifolds
• Examples of one-dimensional manifolds • Example:
– Upper figure: a circle is a l-D manifold embedded in 2-D, where – The data points have 3 dimensions (left figure), i.e., the input space of the data is 3-
each arc of the circle locally resembles a line segment dimensional
– Lower figures: other examples of 1-D manifolds
– Note that a number 8 figure is not a manifold because it has an
– The data points lie on a 2-dimensional manifold, shown in the right figure
intersecting point (it is not Euclidean locally) – Most ML algorithms extract lower-dimensional data features that enable to distinguish
• It is hypothesized that in the real-world, high- between various classes of high-dimensional input data
dimensional data (such as images) lie on low- • The low-dimensional representations of the input data are called embeddings
dimensional manifolds embedded in the high-
dimensional space
– E.g., in ML, let’s assume we have a training set of images with
size 224×224×3 pixels
– Learning an arbitrary function in such high-dimensional space
would be intractable
– Despite that, all images of the same class (“cats” for example)
might lie on a low-dimensional manifold
– This allows function learning and image classification
Manifolds Manifolds
Picture from: http://bjlkeng.github.io/posts/manifolds/
6
8/30/22
– The singular values of 𝐗 are 𝜎# , 𝜎$ ,… , 𝜎7 – I.e., for a small change in x, what is the rate of change of 𝑓(𝑥)
• Given 𝑦 = 𝑓(𝑥), where x is an independent variable and y is a dependent variable, the following
! 7 expressions are equivalent:
$
• 𝑳N,v norm – is the sum of the Euclidean norms of the 𝐗 $,# = d d 𝑥*/ 𝑑𝑦 𝑑𝑓 𝑑
columns of matrix 𝐗 /+# *+# 𝑓” 𝑥 = 𝑓” = = = 𝑓 𝑥 = 𝐷𝑓 𝑥 = 𝐷M𝑓(𝑥)
𝑑𝑥 𝑑𝑥 𝑑𝑥
K
• Max norm – is the largest element of matrix X
𝐗 ;<= = max 𝑥*/
*,/
• The symbols KM, D, and 𝐷M are differentiation operators that indicate operation of differentiation
7
8/30/22
8
8/30/22
9
8/30/22
Optimization Optimization x
Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html
Optimization Optimization
Picture from: https://mjo.osborne.economics.utoronto.ca/index.php/tutorial/index/1/cv1/t
10
8/30/22
Optimization Optimization
Picture from: http://d2l.ai/chapter_optimization/convexity.html
Optimization Optimization
Projections Projections
• An alternative strategy for satisfying constraints are projections • More generally, a projection of a vector 𝐱 onto a set 𝒳 is defined as
• E.g., gradient clipping in NNs can require that the norm of the gradient is bounded by a constant
value c Proj 𝐱 = argmin 𝐱 − 𝐱′ N
𝒳 𝐱”∈𝒳
• Approach:
– At each iteration during training
• This means that the vector 𝐱 is projected onto the closest vector 𝐱′ that belongs to
G! "#
the set 𝒳
– If the norm of the gradient 𝑔 ≥ c, then the update is 𝑔 !EF ← 𝑐 A
G! "#
• For example, in the figure, the blue circle represents
– If the norm of the gradient 𝑔 < c, then the update is 𝑔 !EF ← 𝑔 HIJ a convex set 𝒳
²KLM ²KLM § The points inside the circle project to itself
• Note that since ²KLM
is a unit vector (i.e., it has a norm = 1), then the vector 𝑐 @ ²KLM
has a o E.g., 𝐱 is the yellow vector, its closest point 𝐱′ in the set 𝒳
is itself: the distance betw een 𝐱 and 𝐱′ is 𝐱 − 𝐱′ = 0
norm = 𝑐 § The points outside the circle project to the closest
$
• Such clipping is the projection of the gradient g onto the ball of radius c point inside the circle
o E.g., 𝐱 is the yellow vector, its closest point 𝐱′ in the set 𝒳
– Projection on the unit ball is for 𝑐 = 1
is the red vector
o A m ong all vectors in the set 𝒳 , the red vector 𝐱′ has the
sm allest distance to 𝐱, i.e., 𝐱 − 𝐱′ $
Optimization Optimization
Picture from: http://d2l.ai/chapter_optimization/convexity.html
11
8/30/22
Optimization Optimization
• Example: for the subset of negative real numbers (excluding zero) ℝº–= 𝑥 ∈ ℝ: 𝑥 ≤ 0 • The smallest real number that bounds the change of 𝑓 𝑥# − 𝑓 𝑥 $ for all points 𝑥# , 𝑥 $ is the
Lipschitz constant 𝜌 of the function 𝑓 𝑥
– All real positive numbers and 0 are upper bounds
– For a 𝜌-Lipschitz function 𝑓 𝑥 , the first derivative 𝑓′ 𝑥 is bounded everywhere by 𝜌
– 0 is the least upper bound, and therefore, the supremum of ℝ Y@
• E.g., the function 𝑓 𝑥 = 𝑙𝑜𝑔 1 + exp(𝑥) is 1-Lipschitz over ℝ
Z=[(:) # 𝟏
– Since 𝑓′ 𝑥 = = = ≤1
#0Z=[(:) Z=[ 4: 0# Z=[ 4: 0#
– I.e., 𝜌 = 1
Optimization Optimization
12
8/30/22
Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials
13
8/30/22
Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials Slide credit: Jeff Howbert — Machine Learning Math Essentials
• Multiplication rule for the joint distribution is used: 𝑃 𝑋, 𝑌 = 𝑃 𝑌| 𝑋 𝑃 𝑋 – Also note that for independent random variables: 𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑃 𝑌
• By symmetry, we also have: 𝑃 𝑌, 𝑋 = 𝑃 𝑋| 𝑌 𝑃 𝑌 • In all other cases, the random variables are dependent
– E.g., duration of successive eruptions of Old Faithful
• The terms are referred to as: – Getting a king on successive draws form a deck (the drawn card is not replaced)
– 𝑃 𝑋 , the prior probability, the initial degree of belief for 𝑋
• Two random variables 𝑋 and 𝑌 are conditionally independent given another random variable 𝑍
– 𝑃 𝑋| 𝑌 , the posterior probability, the degree of belief after incorporating the knowledge of 𝑌
if and only if 𝑃 𝑋, 𝑌|𝑍 = 𝑃 𝑋|𝑍 𝑃 𝑌|𝑍
– 𝑃 𝑌| 𝑋 , the likelihood of 𝑌 given 𝑋
– This is denoted as 𝑋 ⊥ 𝑌|𝑍
– P(Y), the evidence
𝐥𝐢𝐤 𝐞 𝐥𝐢𝐡 𝐨 𝐨 𝐝 × 𝐩 𝐫 𝐢𝐨 𝐫 𝐩 𝐫 𝐨 𝐛 𝐚 𝐛 𝐢𝐥𝐢𝐭𝐲
– Bayes’ theorem: posterior probability =
𝐞 𝐯 𝐢𝐝 𝐞 𝐧 𝐜 𝐞
Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials
𝔼 `~b 𝑓 𝑋 = Æ 𝑃 𝑋 𝑓 𝑋 𝑑𝑋
– When the identity of the distribution is clear from the context, we can write 𝔼` 𝑓 𝑋
– If it is clear which random variable is used, we can write just 𝔼 𝑓 𝑋
• Mean is the most common measure of central tendency of a distribution
– For a random variable: 𝑓 𝑋* = 𝑋* ⇒ µ = 𝔼 𝑋* = ∑* 𝑃 𝑋* A 𝑋*
#
– This is similar to the mean of a sample of observations: µ = c ∑* 𝑋*
– Other measures of central tendency: median, mode
Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials Slide credit: Jeff Howbert — Machine Learning Math Essentials
14
8/30/22
Variance Covariance
• Variance gives the measure of how much the values of the function 𝑓 𝑋 deviate from the • Covariance gives the measure of how much two random variables are linearly related to each
expected value as we sample values of X from 𝑃 𝑋 other
Var 𝑓 𝑋 = 𝔼 𝑓 𝑋 − 𝔼 𝑓 𝑋 N Cov 𝑓 𝑋 , 𝑔 𝑌 = 𝔼 𝑓 𝑋 − 𝔼 𝑓 𝑋 𝑔 𝑌 −𝔼 𝑔 𝑌
• When the variance is low, the values of 𝑓 𝑋 cluster near the expected value • If 𝑓 𝑋a = 𝑋a − µO and 𝑔 𝑌a = 𝑌a − µÁ
– Then, the covariance is: Cov 𝑋, 𝑌 = ∑ * 𝑃 𝑋* , 𝑌* A 𝑋* − µ ` A 𝑌* − µ d
• Variance is commonly denoted with 𝜎 N #
– Compare to covariance of actual samples: Cov 𝑋, 𝑌 = ∑ * 𝑌* − µ ` 𝑌* − µ d
– The above equation is similar to a function 𝑓 𝑋* = 𝑋* − µ c4#
– W e have 𝜎 $ = ∑ * 𝑃 𝑋* A 𝑋* − µ $ • The covariance measures the tendency for 𝑋 and 𝑌 to deviate from their means in same (or
– This is similar to the formula for calculating the variance of a sample of observations: 𝜎 $ = opposite) directions at same time
# $
∑ * 𝑋* − µ
c4#
𝑌 N o covariance 𝑌 H igh
• The square root of the variance is the standard deviation
covariance
– Denoted 𝜎 = Var 𝑋
𝑋 𝑋
Probability Probability
Slide credit: Jeff Howbert — Machine Learning Math Essentials Picture from: Jeff Howbert — Machine Learning Math Essentials
Probability Probability
Picture from: Jeff Howbert — Machine Learning Math Essentials
Probability Probability
Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html
15
8/30/22
Probability Probability
Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html
Entropy Entropy
• For a discrete random variable 𝑋 that follows a probability distribution 𝑃 with a probability • Intuitively, we can interpret the self-information (𝐼 𝑋 = −log 𝑃(𝑋) ) as the amount of surprise
mass function 𝑃(𝑋), the expected amount of information through entropy (or Shannon entropy) we have at seeing a particular outcome
is – W e are less surprised when seeing a more frequent event
𝐻 𝑋 = 𝔼O~Q 𝐼 𝑋 = −𝔼O~Q [log 𝑃(𝑋)] • Similarly, we can interpret the entropy (𝐻 𝑋 = 𝔼O~Q 𝐼 𝑋 ) as the average amount of surprise
• Based on the expectation definition 𝔼O~Q 𝑓 𝑋 = ∑O 𝑃 𝑋 𝑓 𝑋 , we can rewrite the entropy as from observing a random variable 𝑋
– Therefore, distributions that are closer to a uniform distribution have high entropy
𝐻 𝑋 = − ∑O 𝑃 𝑋 log 𝑃 𝑋
– Because there is little surprise when we draw samples from a uniform distribution, since all samples
• If 𝑋 is a continuous random variable that follows a probability distribution 𝑃 with a probability have similar values
density function 𝑃(𝑋), the entropy is
𝐻 𝑋 = − ¨ 𝑃 𝑋 log 𝑃 𝑋 𝑑𝑋
O
– For continuous random variables, the entropy is also called differential entropy
16
8/30/22
𝐶𝐸 𝑃, 𝑄 = −𝔼 `~b [log 𝑄 (𝑋)] • I.e., for some data examples, the predicted class 𝑦é/ will be different than the true class 𝑦/ , but the goal is to
• In machine learning, let’s assume a classification problem based on a set of data examples 𝑥# , 𝑥 $ , … , 𝑥 ! , find 𝜃 that results in an overall maximum probability
• From Bayes’ theorem, argmax 𝑃 model | data is proportional to argmax 𝑃 data | model
that need to be classified into k classes
𝑃 𝑥v, 𝑥N, … , 𝑥T|𝜃 𝑃 𝜃
– For each data example 𝑥* we have a class label 𝑦* 𝑃 𝜃|𝑥v, 𝑥N, … , 𝑥T =
𝑃 𝑥v, 𝑥N, … , 𝑥T
• The true labels 𝐲 follow the true distribution 𝑃
– The goal is to train a classifier (e.g., a NN) parameterized by 𝜃, that outputs a predicted class label 𝑦é* for each data – This is true since 𝑃 𝑥#, 𝑥$, … , 𝑥! does not depend on the parameters 𝜃
example 𝑥* – Also, we can assume that we have no prior assumption on which set of parameters 𝜃 are better than any others
• The predicted labels 𝒚ê follow the estimated distribution 𝑄 • Recall that 𝑃 data|model is the likelihood, therefore, the maximum likelihood estimate of 𝜃 is
– The cross-entropy loss between the true distribution 𝑃 and the estimated distribution 𝑄 is calculated as: 𝐶𝐸 𝐲, 𝐲é = based on solving
− 𝔼`~b log 𝑄 𝑋 = − ∑` 𝑃 𝑋 log 𝑄 𝑋 = − ∑* 𝑦* log 𝑦é*
arg max 𝑃 𝑥v, 𝑥N, … , 𝑥T|𝜃
• The further away the true and estimated distributions are, the greater the cross-entropy loss is ì
17
8/30/22
Next time
• Module 2: Understanding Data and Feature Engineering in ML
– Part 2.1: Research Steps : Problem Formulation and design cycle
– Part 2.2: Data Collection and Creation
– Part 2.3: Feature Engineering
• Data Preprocessing and Exploratory Data Analysis
• Data Visualization - Dimension Reduction - Feature Extraction : PCA,
SVD - Feature Selection
– Part 2.4: Evaluation Parameters
• Accuracy, Precision, Recall
– Part 2.5: Python for Feature Engineering
18