0% found this document useful (0 votes)
53 views25 pages

Maths For ML Revision

Uploaded by

Debashis Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views25 pages

Maths For ML Revision

Uploaded by

Debashis Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Maths for Machine Learning

Revision notes

1
Table of Contents

Contents Page no.

Equations of a straight line 3

Parallel and perpendicular lines, hyperplane, vector form of a hyperplane 4

Vectors, halfspaces 5

Transpose, dot product, unit vector 6

Distance between two points, norm, angle b/w two vectors, projection, intersection 7

Distance b/w : Hyperplane and origin, point and hyperplane, parallel hyperplanes 8
Circle

Rotating coordinate axes, limit, function 9

Important functions for ML 10

Continuity, tangent and derivative 12

Finding optima 14

Partial derivative 16

Gradient descent 17

Variants of Gradient descent 18

Constrained optimization, Method of Lagrange multipliers 19

Principal component analysis 20

Optimization without Gradient descent 23

Eigenvector and Eigenvalue 24

2
● Equations of a straight line:

1. Slope-intercept form of a straight line is given as:

Where, is the slope of the line and is the y-intercept.


And , where is the angle which the line makes with the positive x-axis.

2. Point-slope form:

where is the slope of the line and are the coordinates of a point on the line.

3. Two-point form:

Where, and are the coordinates of two points on the line.

4. Intercept form:

Where a and b are the intercepts of the line on the x-axis and y-axis respectively.

3
5. General form:

where a, b, and c are real numbers.

● Two lines are called parallel to each other if the values of the slope are equal.
Let’s consider two lines and .

The above two lines are parallel if .

The above two lines are Perpendicular to each other if:

● Hyperplane is a linear surface in n-dimensions.


The general equation of a hyperplane is given as:

Where, are called the weights/coefficients and

are the features.

The equation of a plane in 3-D is given as:

● Vector form of a hyperplane is:

Where, and,

4
● Vectors can be interpreted as coordinates as well as a line segment from the origin to the
coordinate.
For example, the below-given vector can be considered as coordinates of point P(x,y,z) as
well as a line segment from the origin to point P(x,y,z).

● Half Spaces: In geometry, a half-space is either of the two parts into which a plane divides the
three-dimensional Euclidean space.

Example: Let’s assume that a hyperplane is classifying the data points of


two different classes in a space.

Let’s say we got a point x0 in the space.


Now, If:
=> the point is in the +ve halfspace

=> the point is in the -ve halfspace.

5
● The transpose operation changes a column vector into a row vector and vice versa.
For example,

if then,

● Dot product of two vectors and is given as :

Where, and

Also,

Geometrically, it is the product of the magnitudes of the two vectors and the cosine of the
angle between them.

i.e.

where θ is the angle between the two vectors.

If the dot product of two vectors is zero, then the vectors are perpendicular to each other.

● Unit vector is a vector that has a magnitude of 1.


To convert a vector into a unit vector, we divide the vector by its magnitude.

i.e. unit vector =

6
➔ We can multiply any scalar value with the unit vector to get the desired magnitude
(equal to that scalar value) in the same direction.
➔ All vectors with the same unit vector are parallel

● Distance between two points having coordinates and in an x-y plane is


given as:

● Norm or Magnitude of a vector is calculated by taking the square root of dot product with
itself.

i.e.

It represents the length of a vector or distance of coordinate from the origin.

● Angle between two vectors is given as :

● Projection of vector on =

● At the point of intersection of two lines, both lines will have the same coordinates.

Example: Let’s say we have two lines, y = x+2 and y = 2x+1.


We need to find the point of intersection of these two lines.
We assume that the lines intersect at a single point (a,b).

7
Therefore, this point will satisfy both the line’s equations.
i.e. b = a+2 — i)
b = 2a+1 — ii)
Solving above two equations, we get, a = 1 and b = 3.
Therefore, the given two lines intersect at the point (1,3).

● Distance of a Hyperplane from the origin:

Let’s assume a hyperplane .

Its distance from the origin is given as:

● Distance of a point from a hyperplane is given as:

i.e. Just put the point in the hyperplane’s equation and divide by the square root of the
summation of coefficients’ square (or norm of the w vector)

● Distance between two parallel hyperplanes


Given two parallel hyperplanes, and ,
Distance between them is given as:

● The equation of a circle in the x-y plane is given as:

where, are the coordinates of the center of the circle and r is the radius of the circle.

8
Therefore, a circle with the center at origin is given as:

Points inside the circle give negative values when substituted in the circle equation and points
outside the circle give positive values.

● Let's say we have a coordinate system x-y initially and a point P(x0, y0) in it this system.

If the coordinate system is rotated by an angle θ in the anti-clockwise direction, then the
Coordinates of point P with respect to the new coordinate system will be:

x0’ = x 0cosθ + y0sinθ and y0’ = -x 0sinθ + y0cosθ

● A limit is a value toward which an expression converges as one or more variables approach
certain values. It is denoted as:

A left-hand limit means the limit of a function as it approaches from the left-hand side. It is
denoted as:

On the other hand, a right-hand limit means the limit of a function as it approaches from the
right-hand side.

● A Function is a relationship between inputs and outputs where each input is related to exactly
one output.

9
The domain of a function is the set of input values for f, in which the function is real and
defined.
The set of all the outputs of a function is known as the range of the function.

● Some important functions for Machine Learning:

Function Name Domain Range Plot

Hyperbola R R

+
R
Exponent R

Modulus R +
R

10
Exponential + R
R

Sigmoid R (-1, 1)

sine R [-1, 1]

tangent R

11
● f(x) is continuous at a point x = a,

if lim f(x) = lim f(x) = lim f(x)


+ −
𝑥→𝑎 𝑥→𝑎 𝑥→𝑎

Example:

Figure 1 Figure 2

1. Figure 1 is continuous everywhere in its domain.

2. In figure 2, at x=1,

Therefore, the function given in figure 2 is not continuous at x=1.

● Tangent is a straight line that touches a graph only at one point.

Example:

In the above-given graph, line y = 6x - 9 is a tangent line to the curve at the point (3,9).

● The rate of change of a function with respect to a variable is called the derivative of the
function with respect to that variable.

12
● Derivative of a function 𝑓(𝑥) with respect to variable x is denoted as:

Differentiation using the first principles:


The derivative of a function 𝑓(𝑥) at a point x = a is given as:

● Common derivatives:

● Rules of differentiation:

1. Sum\Difference rule:

2. Constant multiple rule:

3. Product rule:

4. Quotient rule:

13
5. Chain rule:

● The derivative of a function gives us the slope of the tangent line to the function at any point
on the graph.

If the derivative/slope of the tangent at a certain point is positive, then the function is
increasing.

If the derivative/slope of the tangent at a certain point is negative, then the function is
decreasing.

If the slope of the tangent is zero, then the function is neither decreasing nor increasing at that
point.

● The Second derivative of a function represents its concavity.


If the second derivative is positive, then the function is concave upwards.
If the second derivative is negative, then the function is concave downwards.

● Steps to find the optima:

1. Given a function f(x), firstly calculate its derivative. i.e. f’(x)


2. Put f’(x) = 0 to obtain the stationary points x = c.
3. calculate f”(x) at each stationary points x = c (i.e f”(c))
4. We get the following situations:
i). If f”(c)>0, then f(x) has a minimum value at x=c.
ii). If f”(c)<0, then f(x) has a maximum value at x=c.
iii). If f”(c)=0, then f(x) may or may not have a maxima or minima at x = c.

14
Example: Let’s find the optima of the function

Step 1: Calculate the first derivative.

Step 2: Put 𝑓 ’(𝑥) = 0 to obtain the stationary points.

Step 3: Calculate

Step 4: Since, f”(1)>0


therefore, there exists minima at x = 1.

● Local minima/maxima can be defined as a point where the function has minimum/maximum
value with respect to its vicinity/surrounding. We have marked it as Lmin and Lmax in the
image above.

Global minima/maxima can be defined as the minimum/maximum value across the whole
domain. It is also called absolute maxima/minima.

● A function f(x) is said to be differentiable if it satisfies the following conditions:


1. f(x) should be smooth in its domain.
2. f(x) is continuous in its domain and
3. f’(x) is continuous.

15
Example: f(x) = |x| is not differentiable at x = 0 as it has a sharp point (not smooth) at this
point.

● We can also have functions that have more than one variable.

Example:

A partial derivative of a function of several variables is its derivative with respect to one of
those variables, with the others held constant.
For example, the partial derivatives of f(x,y) with respect to x and y are given as:

and

● ∇ is called a delta operator. It is a 2D vector that consists of derivatives w.r.t single variables
also called partial derivatives.

Let us assume a function f(w0,w1, w2, …, wn) that has more than one input variable.

Then,

∇𝑓(𝑤0, 𝑤1, 𝑤2, …, 𝑤𝑛) =

16
Optima for 𝑓 can be found by putting ∇𝑓 equal to a Null matrix of the same dimensions as ∇𝑓.

We can have points where ∇f = 0, but those points may not be maxima or minima. Those are
called Saddle points.

● Gradient descent is an iterative algorithm to reach the optima of a function.


Let’s say we have a function z = f(x,y) and are trying to find the minimum of this function.

We will start by initializing x0 and y0 randomly. Then we will keep updating 𝑥 and 𝑦 till the
point where the partial derivatives are very close to 0 or some fixed number of iterations.

Algorithm:

Step 1: Initially, pick x0 and y0 randomly.

Step 2: Compute and at x = x0 and y = y0 respectively.

Step 3: The new values of x0 and y0 which are closer to the optima are given as:

and

Step 4: Repeat the step 3 either till some k(constant) iterations or till a point where
δ𝑓/δ𝑥 ≈ 0 and δ𝑓/δ𝑦 ≈ 0.

Here, η (eta) is the learning rate and decides the step size of our iterations. If we set its value
to very small, then the updates will happen very slowly. If we set it to a large value, it may
overshoot the minima.

Therefore, the update rule of the Gradient descent algorithm is:

17
If we want to maximize some function. Then we can convert the maximum function into a
minimum function by adding a negative sign..
i.e. 𝑚𝑎𝑥 𝑓(𝑥, 𝑦) = 𝑚𝑖𝑛 − 𝑓(𝑥, 𝑦)

● Variants of Gradient descent:


Let’s assume that we are minimizing a loss function ( 𝑓 ) while training a model.
The update using Gradient descent is given by:

We use all the data points for one update, which leads to a high computation time if our
dataset is very large.

So we have some variants of Gradient descent which can help us with the problem.

1. Batch Gradient descent calculates the partial derivative using only a few data points
from our data set randomly, while performing many updates.

i.e.

where, 𝐵 is a random sample of our data points.

We get very high-speed improvement while training our model with almost a similar
accuracy.

2. Stochastic Gradient descent updates the parameters for each training example one
by one.

i.e.

where, 𝑘 is a random number from 1 to 𝑛.


It is comparatively faster than Batch GD but the number of updates needed to reach the
minima is large.

18
● For a constrained optimization problem, we have an objective function that we are trying to
optimize ( say, 𝑚𝑖𝑛 f(x, y) ) and this objective function will be subjected to some
𝑥, 𝑦
constraints.

The constraint may be an equality constraint ( 𝑔(𝑥, 𝑦) = 0 ) or we can also have inequality
constraints like 𝑔(𝑥, 𝑦) < 𝑐.

● The method of Lagrange multipliers is a method of finding the local minima or local maxima
of a function subject to equality or inequality constraints.

* *
We want to solve the problem, 𝑥 , 𝑦 = 𝑚𝑖𝑛𝑥, 𝑦 𝑓(𝑥, 𝑦)

subjected to the constraint 𝑔(𝑥, 𝑦) = 𝑐

In order to solve the above problem, we combine both the constraint and the objective
function. We can write the constraint as 𝑔(𝑥, 𝑦) − 𝑐 and then rewrite our problem as:

* *
𝑥 , 𝑦 = 𝑚𝑖𝑛𝑥, 𝑦 𝑓(𝑥, 𝑦) + λ(𝑔(𝑥, 𝑦) − 𝑐) = 𝐿(𝑥, 𝑦, λ)

Here λ is called a Lagrange multiplier ( λ ≥ 0 ) and the function 𝐿(𝑥, 𝑦, λ) is called the
Lagrangian function.

𝑛
𝑇 2
Example: 𝑚𝑖𝑛𝑥, 𝑦 ∑ - 𝑦𝑖(𝑤 𝑥𝑖 + 𝑤0), subjected to the constraint ||𝑤|| = 1
𝑖=1
2
We can rewrite the constraint as ||𝑤|| - 1 = 0.

Using Langrange multiplier, we can convert it into an unconstrained optimization problem.

𝑛
𝑇 2
i.e. 𝐿 = 𝑚𝑖𝑛𝑥, 𝑦 ∑ - 𝑦𝑖(𝑤 𝑥𝑖 + 𝑤0) + λ(||𝑤|| - 1), λ ≥ 0
𝑖=1

We can solve for the optimal value using the Gradient Descent algorithm.

● Dimensionality reduction techniques help to convert a high dimensional data to fewer


dimensions which can be then visualized using simpler plots or it can be used when we want
to preserve the information of our feature columns.

19
● Principal Component Analysis (PCA)

Principal component analysis, or PCA, is a dimensionality-reduction method that is often used


to reduce the dimensionality of large data sets.

PCA is an act of finding new axis to represent the data so that a few principal components may
contain most of the information.

The main objective here is whenever we are going from a higher dimension(d) to a lower
/
dimension(𝑑 ) we want to preserve those dimensions which have high variance (or high
information.)

For example: Imagine we have two features 𝑓1 and 𝑓2, and the data is spread as shown in
the image below. We want to project the given 2-D data to 1-D.

One thing we can observe here is that both axes have a good amount of variance. It will be
difficult to decide which feature to drop.

Therefore, we rotated the entire coordinate axes. It is shown in green.


/ / /
Let’s call these new dimensions as 𝑓1 and 𝑓2 . Now we can see that variability on 𝑓1 is more
/ /
than 𝑓2 and we can now drop the 𝑓1 axis.

Mathematics of PCA:

Step 1: Featurewise Centering: We compute the mean of each feature and subtract it from
each value of that feature to centre the mean of the data at origin.

20
Step 2: Standardize: It is done in order to keep all the features of our data in same scale.
Best is to use standard scaling because it is less affected by the outliers.

Combining step1 and step 2, we get the value of our feature vector x as:

Step 3: Rotate the coordinate axes:

There can be many angles of rotation possible, we’ll consider a random axis and optimize it to
get the best axis.
Let's consider a unit vector in the direction of the new axis.
/
Our optimization problem is finding unit vectors u such that the variance of all 𝑥𝑖 (features) are
maximum.

21
We can also think of it in terms of the length of projections.

Consider a vector x in the space representing one of the points in our data.

The projection of a vector x on vector u is given as:

The best u will be where the summation of the length of projections of all such points(xi) on
the vector u is maximum.

i.e.

However, the projections may be a negative value.


Therefore, we take the magnitude of the numerator:

i.e.

Since the above objective function is not differentiable, we rewrite it as:

22
Now, since u is a unit vector, we introduce a constraint: ||𝑢|| = 1

Using Lagrange’s multiplier, our loss function becomes:

We can optimize the above Lagrangian loss function using Gradient descent.

● Optimization without Gradient descent:

We know that,

therefore, our new loss function is:

Now, since we can write the constraint as , therefore,

Now we know that If a matrix A is an invertible square matrix, then:

Thus,

Using the identity , ,

We get,

23
Or

Let

Hence,

V is the pairwise covariance matrix of all the features of our feature matrix X.

Taking partial derivatives w.r.t 𝑢 and λ,

and

After putting both the partial derivatives equal to zero, we get,

— (i)

By carefully observing the equation (i), we can conclude that 𝑢 is the eigen vector and λ is the
eigen value of matrix 𝑉.

Eigen vector and Eigen value:

For any matrix A, there exists a vector such that when this vector is multiplied with the
matrix A, we get a new vector in the same direction having different magnitude.

The vector is called the Eigen vector and the length is called as Eigen value.

i.e.

There can be multiple eigen vectors, which are always orthogonal to each other.

24
Conclusion:

To find the best value of vector 𝑢, which is the direction of maximum variance (or maximum
information) and along which we should rotate our existing coordinates, we follow the below
given steps:

Step 1: Find the covariance matrix of the given feature matrix X.

Step 2: Then calculate the eigenvectors and eigenvalues of the covariance matrix.
The eigenvector is the direction of best 𝑢 and eigen value is the importance of that
vector.

The eigenvector associated with the largest eigenvalue indicates the direction in which the
data has the most variance.

Therefore, we can select our principal components in the direction of the eigenvectors
having large eigenvalues and drop the principal components having relatively small
eigenvalues.

25

You might also like