0% found this document useful (0 votes)
12 views74 pages

Lecture 2 - Applied Mathematics (Updated 22nd Sept 2022)

The lecture covers the importance of mathematics in machine learning, focusing on linear algebra, probability theory, and key concepts such as vectors, matrices, and tensors. It discusses operations like eigen decomposition and singular value decomposition, which are essential for understanding machine learning algorithms. Additionally, it introduces statistical concepts like expected value, variance, and covariance, which are crucial for data analysis and modeling.

Uploaded by

Lam Chun Wing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views74 pages

Lecture 2 - Applied Mathematics (Updated 22nd Sept 2022)

The lecture covers the importance of mathematics in machine learning, focusing on linear algebra, probability theory, and key concepts such as vectors, matrices, and tensors. It discusses operations like eigen decomposition and singular value decomposition, which are essential for understanding machine learning algorithms. Additionally, it introduces statistical concepts like expected value, variance, and covariance, which are crucial for data analysis and modeling.

Uploaded by

Lam Chun Wing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

COMP 3057 – Introduction to

Artificial Intelligence and Machine


Learning
Lecture 2 – Applied Mathematics
Dr. Umair Zuneera
In this Lecture …
 Why learn Mathematics for Machine Learning?
 Concepts in Linear Algebra
 Scalars, Vectors, Matrices, Tensors
 Matrix operations
 Norms

 Eigen Decomposition
 Singular Value Decomposition
 Principal Component Analysis
 Concepts in Probability theory
 Machine Learning terminologies
Introduction
 Machine learning is inherently data driven
 Machine Learning automates the process of Data
Analysis and makes data-informed predictions in
real-time without any human intervention
 Learning can be understood as a learning way to
automatically find patterns and structure in data by
optimizing the parameters of the model.
Introduction

 While not all data is numerical, it is often useful to


consider data in a number format.
 we think of data as vectors
Introduction

 In this lecture, we will cover some basic


mathematics (Linear Algebra, Probability,
Optimization, etc.) as this is the foundation of
learning any machine learning technique
If you want to understand machine learning algorithms
in depth then a good grasp in Linear Algebra will help
you a lot.
What is Linear Algebra?

 We represent numerical data as vectors and


represent a table of such data as a matrix
 The study of vectors and matrices is called linear
algebra
Linear Algebra

 Linear Algebra is a branch of mathematics that


deals with concerning linear equations such as:.

𝒂 ₁ 𝒙 ₁+ 𝒂 ₂ 𝒙 ₂+. . .+ 𝒂𝒏 𝒙 𝒏 =𝒃

 In vector notation, we say aᵗx = b

 We call this as a linear transformation of x.


Linear Algebra
Scalar
 A scalar is a number or 0th order tensor.
 Examples are — temperature, distance, speed, or
mass.
 Here all the quantities have a magnitude but no
“direction”, other than the fact that it may be
positive or negative.
 We have dealt with scalars all our life; everyday
calculation you do is on scalar numbers.
Scalar
 Usually represented as lower-case variable names
written in italics:
 For e.g. be the slope of the line (real – valued
scalar)
 Fore.g. be the number of units (natural
number scalar)
 There are multiple scalar types in python, such
as int, float, complex, bytes, Unicode.
Scalar - Tutorial

a = 2 # Scalar 1
b = 5 # Scalar 2print(a + b) # 7 # Addition
print(a - b) # -3 # Subtraction
print(a * b) # 10 # Multiplication
print(a / b) # 0.4 # Division
Vector
 A vector is a list of numbers or 1st order tensor.
There are two ways in which you can interpret what
this means —
 pointin space where each number will represent the
vector’s component and dimension.
 vector is a magnitude and a direction. In this way, a
vector is an arrow pointing from the origin to the
endpoint given by the list of numbers.
Vector
 An example of a vector is —
 each element of vector is identified as , , and so
on …
 If each element of belongs to , and the vector has
elements, then it is denoted as
Vector Addition and Subtraction
 Vectors can be added and subtracted. We can
think of it as adding two line segments end-to-
end, maintaining distance and direction.
 a = [4, 3], b = [1, 2]
 c=a+b
 c = [4, 3] + [1, 2]
 c = [4+1, 3+2]
 c = [5, 5]
 This is how we make vector additions, and
similarly, you can do vector subtraction as well.
Vector Multiplication
 Dot product (scalar product)
 The dot products generate a scalar value from
the product of two vectors.

 Cross product
 The cross product generates a vector from the
product of two vectors.
Vector Multiplication (Dot Product)

 Dot product (scalar product)


 The dot products generate a scalar value from
the product of two vectors.
Vector Multiplication (Cross Product)

 Cross product
 The cross product generates a vector from the
product of two vectors.
Matrices

 A matrix is also like a vector, which is a collection


of numbers. The difference is that vectors are like
list, whereas a matrix is like a table.
 It is a 2-d array of numbers (each element is
identified by two indices)
 Matrix represented by upper case bold typeface.
Matrices

 An m*n matrix contains m rows and n columns


and contains m*n elements.
 If a real-valued matrix has a height of m and
width of n, then it is represented as
Matrix Transpose

 Matrix transpose is flipped version of the original


matrix
Tensors

 Array with more than two axes


 Variable number of axes
 We denote a tensor named with typeface A
 The elements of A at coordinates are
Multiplying Matrices and Vectors

For matrix A and B to multiply, the number of


columns of A must be equal to the number of rows
of B
If is of shape and is of shape then is of shape

The product is defined by the operation


Multiplying Matrices and Vectors
 Hadamard Product:
 Element wise product and is denoted by

 Dot product:
 between two vectors and of the same
dimensionality is the matrix product
Norms
 Machine learning uses tensors as the basic units
of representation
 Vectors, Matrices, etc.
 Two reasons to use norms:
1. To estimate how “big” a vector/tensor is
(Norms can be thought of as a mapping from a
vector/tensor to a single number -> scalar (+ve
number))
𝑣 =(3 , 4) ||𝑣||=√ 3 + 4 =5
2 2
Norms

 To
estimate “how close” one tensor is to
another
For e.g. how similar two images are to each
other
Norms
 Norm is a generalization of the notion of “length” to vectors,
matrices and tensors
 Mathematically, a norm is any function f that satisfies:

𝑣 =( 𝑣1 , 𝑣 2)
Norms
2. (the triangle inequality)

𝑥 𝑥+ 𝑦
𝑦
3. (Linearity)
(2 𝑣 1 ,2 𝑣 2 )

𝑣 =( 𝑣1 , 𝑣 2)
Some standard Norms

 Euclidean Norm:

 Also called 2 norm or the norm


 Corresponds to our usual notion of distance
 1-norm or norm

Some standard Norms

 P-norm or norm
 for
 -Norm or or norm
Some standard Norms

 Matrices: Frobenius Norm:

 For e.g.
Eigen decomposition

 Sometimes we can understand things better by


breaking them apart.
 Decomposing them into their constituent parts,
allowing us to find the non-obvious and universal
properties.
 Eigen decomposition provides us with a tool to
decompose a matrix by discovering the
eigenvalues and the eigenvectors.
Eigen Decomposition

 For example, a matrix is only singular if any


eigenvalues are zero.
 (Recall: Singular matrix is the matrix with the
determinant = 0)
Eigen decomposition

 Multiplying a matrix by a vector can also be


interpreted as a linear transformation
 In most cases, this transformation will change the
direction of the vector.
 Let’s assume, we have a matrix (A) and a vector (v),
which we can multiply.

 Matrix changed the direction of vector


Eigen decomposition
 when performing eigen decomposition, we are
looking for a vector, whose direction will not be
changed by a matrix-vector multiplication — only
its magnitude will either be scaled up or down.
 Therefore the effect of the matrix on the vector is
the same as the effect of a scalar on the vector.
Eigen decomposition

 This effect can be described, more formally, by


the fundamental eigenvalue equation:

 After rearranging and factoring the vector (v) out,


we get the following equation:
Eigen decomposition

 we first need to discover the scalars (λ), that shift the matrix (A)
just enough, to make sure a matrix-vector multiplication equals
zero
 Geometrically, we find a matrix that squishes space into a lower
dimension with an area or volume of zero. We can achieve this
squishing effect when the matrix determinant equals zero.

 The scalars (λ) we have to discover are called the eigenvalues


 We can envision the eigenvalues as some kind of keys
unlocking the matrix to get access to the eigenvectors.
Eigen Decomposition – Eigen Values

 we need to find the eigenvalues before we can unlock


the eigenvectors.
 An M x M matrix has M eigenvalues
and M eigenvectors — each eigenvalue has a related
eigenvector, which is why they come in pairs.
Eigen Decomposition – Eigen Values

 Imagine, we have a 2 by 2 matrix and we want to compute the


eigenvalues
 By using the equation we derived earlier, we can calculate
the characteristic polynomial and solve for the eigenvalues.

Meaning, the associated eigenvectors


have a magnitude of 3 and 2
respectively.
Eigen Decomposition – Eigen
Vectors
 We can use the eigenvalues, we discovered
earlier, to reveal the eigenvectors.

 Based on the fundamental eigenvalue equation,


we simply pluck in the ith-eigenvalue and retrieve
the ith-eigenvector from the null space of the
matrix.
Eigen Decomposition – Eigen
Vectors
 Let’s continue our example from before and use
the already discovered eigenvalues and , starting
with :
Eigen Vectors

 We inserted our first eigenvalue into the equation


(1), shifted the matrix by that value (2), and
finally solved for the eigenvector (3). The
eigenvector for that unique eigenvalue is defined
by and .
 To reconstruct the original matrix (A), we can use
the following equation:
Singular Value Decomposition (SVD)

 The Singular Value Decomposition (SVD) of a


matrix is a factorization of that matrix into three
matrices.
 It has some interesting algebraic properties and
conveys important geometrical and theoretical
insights about linear transformations.
 The SVD of matrix A is given by the formula :
 ,
 where U is defined to be an m  m matrix, D to
be an mRemark:
 n matrix,
U and V and V to
are both the be an n  n matrix.
orthogonal
matrices, and D is a diagonal matrix.
Singular Value Decomposition (SVD)

𝑻
𝑨=𝑼 𝑫 𝑽
 D: The elements along the diagonal of D are
known as the singular values of the matrix A,
where the nonzero singular values of A are the
square roots of the eigenvalues of ATA (the same
is true for AAT);
 U: The columns of U are known as the left-
singular vectors of A, i.e. eigenvectors of AAT;
 V: The columns of V are known as the right-
singular vectors of A, i.e. eigenvectors of ATA.
Principal Component Analysis
 Principal Component Analysis (PCA) is to reduce the
dimensionality of a data set consisting of many
variables correlated with each other, either heavily or
lightly, while retaining the variation present in the
dataset, up to the maximum extent.
 Principal Component Analysis can supply the user
with a lower-dimensional picture, a projection or
"shadow" of this object when viewed from its most
informative viewpoint.
Principal Component Analysis
Principal Component Analysis

 We need to find a direction (vector ) onto which


to project the data so as to minimize the project
error
Principal Component Analysis

 Reduce from n-dimensions to k-dimensions: Find


vectors onto which to project the data, so as to
minimize the projection error.
Principal Component Analysis
 Target: to reduce n dimensions to k dimensions
 Step 1: Compute co-variance matrix

 Step 2: Compute Eigen Vectors of (You can use


techniques such as Eigen Vector decomposition or
Singular Value Decomposition)

𝑛 ×𝑛
𝑈 ←ℝ

𝑘
Principal Component Analysis

 The basic transformation is from


Principal Component Analysis
(Summary)
 Find
 Then we find
 Reduced coordinates:
Expected Value
 The expectation, or expected value, of some
function f(x) with respect to a probability
distribution P(x) is the average, or mean value,
that f takes on when x drawn from P.
n
E X    X i P X i 
i 1

 X 1 P X 1   X 2 P ( X 2 )  ...  X n P ( X n )
Variance

 For a discrete probability distribution the variance


can be computed by n
σ 2  Variance  [ X i  E ( X )]2 P ( X i )
i 1
Covariance

 Covariance measures the direction of the


relationship between two variables.
 A positive covariance means that both variables
tend to be high or low at the same time. A
negative covariance means that when one
variable is high, the other tends to be low.
 Covariance of x and y:

Covariance matrix of the vector


x:
Multivariate Normal/Gaussian
Distribution
 The multivariate gaussian distribution is mainly
used for anomaly detection i.e. finding if there are
any erroneous data points
 The multivariate gaussian distribution is given by:

 Where,
 is the vector, is the mean, is the covariance
matrix of the vectors
Multivariate Normal/Gaussian
Distribution
Exponential Distributions
 Exponential distributions have a sharp peak at

𝑝 ( 𝑥 ; 𝜆) = { 𝜆 e xp ⁡(− 𝜆 𝑥 )
0
𝑖𝑓 𝑥 ≥ 0 ,
𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒 .
Laplace Distributions
 Laplace distribution:
 The probability distribution that allows us to
place a sharp peak of probability mass at an
arbitrary point m;
.

the random
variable
Location
parameter
scale parameter
What is Machine Learning

“A computer program is said to learn from


experience E with respect to some task T and some
performance measure P, if its performance on T, as
measured by P, improves with experience E.”
(Tom Mitchell, 1997)
Machine Learning – Task (T)

 Classification
 Regression
 Machine Translation
 Anomaly Detection
 Denoising
 Density Estimation (or Probability
Mass Function) Estimation

Machine Learning – Performance
Measure P
 Accuracy
 Error Rate
 We prefer to know how well a machine
learning algorithm performs on data
that it has not seen before.
 We evaluate these performance
measures using a test set of data that is
separate from the data (i.e. training set
of data) used for training the machine
learning system.
Machine Learning – Experience E

 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning
 More Learning, e.g.
 Semi-supervised Learning
 Active Learning
 Transfer Learning
Capacity, Overfitting and
Underfitting
 The ability of a machine to perform well on
previously unobserved inputs, i.e. test data, is
called generalization.
 Generalization refers to your model's ability to
adapt properly to new, previously unseen data,
drawn from the same distribution as the one used
to create the model.
Capacity, Overfitting and
Underfitting
 In this example: the model will attempt to learn
the relationship on the training data and can be
evaluated on the test data.
 In this case, 70% of the data is used for training
and 30% for testing
Capacity, Overfitting and
Underfitting
 Overfitting:
 Overfitting happens when a model learns the detail
and noise in the training data to the extent that it
negatively impacts the performance of the model on
new data
 This means that the noise or random fluctuations in
the training data is picked up and learned as concepts
by the model.
 The problem is that these concepts do not apply to
new data and negatively impact the models ability to
generalize..
Capacity, Overfitting and
Underfitting
 Underfitting
 Underfitting refers to a model that can neither model the
training data nor generalize to new data.
 An underfit machine learning model is not a suitable model
and will be obvious as it will have poor performance on the
training data.
 Underfitting is often not discussed as it is easy to detect
given a good performance metric. The remedy is to move
on and try alternate machine learning algorithms.
Nevertheless, it does provide a good contrast to the
problem of overfitting.
A Good Fit in Machine Learning
 Ideally, you want to select a model at the good
spot between underfitting and overfitting.
 The good spot is the point just before the error on
the test dataset starts to increase where the
model has good skill on both the training dataset
and the unseen test dataset.
What is Optimization?

 Whenever you face a situation in


which you have a lot of options to
select from and each option has a
‘cost’ associated to it and you
want to make an optimal choice
 You wish to pickup the option with
the:
 Minimum cost
 Maximum reward
 You can convert the max problem
to a min problem by ()
What is Optimization?

 In the previous example, we only saw 6 options to


choose from
 What is the options are infinite?
 In real life situation: you want to pick up a value ,
which minimizes the cost function
What is Optimization?

 In 3D …
 As the dimensions increase, it is harder to select
options
What is Optimization
3 Important Variables in
Optimization
1. Decision variable
2. Cost function
3. Constraints:
 Equality constraints:

Example:
 Inequality constraints:

Example:
What is Optimization?
Linear Program
References

 Deep Learning (Ian J. Goodfellow, Yoshua Bengio


and Aaron Courville), Chapter 2, MIT Press, 2016.
 Mike X Cohen, PhD. Linear Algebra: Theory,
Intuition, Code.
 3Blue1Brown-Eigenvectors and eigenvalues
 Prof. Andrew Ng’s Youtube Lectures

You might also like