Maths For ML Revision
Maths For ML Revision
Revision notes
1
Table of Contents
Vectors, halfspaces 5
Distance between two points, norm, angle b/w two vectors, projection, intersection 7
Distance b/w : Hyperplane and origin, point and hyperplane, parallel hyperplanes 8
Circle
Finding optima 14
Partial derivative 16
Gradient descent 17
2
● Equations of a straight line:
2. Point-slope form:
where is the slope of the line and are the coordinates of a point on the line.
3. Two-point form:
4. Intercept form:
Where a and b are the intercepts of the line on the x-axis and y-axis respectively.
3
5. General form:
● Two lines are called parallel to each other if the values of the slope are equal.
Let’s consider two lines and .
Where, and,
4
● Vectors can be interpreted as coordinates as well as a line segment from the origin to the
coordinate.
For example, the below-given vector can be considered as coordinates of point P(x,y,z) as
well as a line segment from the origin to point P(x,y,z).
● Half Spaces: In geometry, a half-space is either of the two parts into which a plane divides the
three-dimensional Euclidean space.
5
● The transpose operation changes a column vector into a row vector and vice versa.
For example,
if then,
Where, and
Also,
Geometrically, it is the product of the magnitudes of the two vectors and the cosine of the
angle between them.
i.e.
If the dot product of two vectors is zero, then the vectors are perpendicular to each other.
6
➔ We can multiply any scalar value with the unit vector to get the desired magnitude
(equal to that scalar value) in the same direction.
➔ All vectors with the same unit vector are parallel
● Norm or Magnitude of a vector is calculated by taking the square root of dot product with
itself.
i.e.
● Projection of vector on =
● At the point of intersection of two lines, both lines will have the same coordinates.
7
Therefore, this point will satisfy both the line’s equations.
i.e. b = a+2 — i)
b = 2a+1 — ii)
Solving above two equations, we get, a = 1 and b = 3.
Therefore, the given two lines intersect at the point (1,3).
i.e. Just put the point in the hyperplane’s equation and divide by the square root of the
summation of coefficients’ square (or norm of the w vector)
where, are the coordinates of the center of the circle and r is the radius of the circle.
8
Therefore, a circle with the center at origin is given as:
Points inside the circle give negative values when substituted in the circle equation and points
outside the circle give positive values.
● Let's say we have a coordinate system x-y initially and a point P(x0, y0) in it this system.
If the coordinate system is rotated by an angle θ in the anti-clockwise direction, then the
Coordinates of point P with respect to the new coordinate system will be:
● A limit is a value toward which an expression converges as one or more variables approach
certain values. It is denoted as:
A left-hand limit means the limit of a function as it approaches from the left-hand side. It is
denoted as:
On the other hand, a right-hand limit means the limit of a function as it approaches from the
right-hand side.
● A Function is a relationship between inputs and outputs where each input is related to exactly
one output.
9
The domain of a function is the set of input values for f, in which the function is real and
defined.
The set of all the outputs of a function is known as the range of the function.
Hyperbola R R
+
R
Exponent R
Modulus R +
R
10
Exponential + R
R
Sigmoid R (-1, 1)
sine R [-1, 1]
tangent R
11
● f(x) is continuous at a point x = a,
Example:
Figure 1 Figure 2
2. In figure 2, at x=1,
Example:
In the above-given graph, line y = 6x - 9 is a tangent line to the curve at the point (3,9).
● The rate of change of a function with respect to a variable is called the derivative of the
function with respect to that variable.
12
● Derivative of a function 𝑓(𝑥) with respect to variable x is denoted as:
● Common derivatives:
● Rules of differentiation:
1. Sum\Difference rule:
3. Product rule:
4. Quotient rule:
13
5. Chain rule:
● The derivative of a function gives us the slope of the tangent line to the function at any point
on the graph.
If the derivative/slope of the tangent at a certain point is positive, then the function is
increasing.
If the derivative/slope of the tangent at a certain point is negative, then the function is
decreasing.
If the slope of the tangent is zero, then the function is neither decreasing nor increasing at that
point.
14
Example: Let’s find the optima of the function
Step 3: Calculate
● Local minima/maxima can be defined as a point where the function has minimum/maximum
value with respect to its vicinity/surrounding. We have marked it as Lmin and Lmax in the
image above.
Global minima/maxima can be defined as the minimum/maximum value across the whole
domain. It is also called absolute maxima/minima.
15
Example: f(x) = |x| is not differentiable at x = 0 as it has a sharp point (not smooth) at this
point.
● We can also have functions that have more than one variable.
Example:
A partial derivative of a function of several variables is its derivative with respect to one of
those variables, with the others held constant.
For example, the partial derivatives of f(x,y) with respect to x and y are given as:
and
● ∇ is called a delta operator. It is a 2D vector that consists of derivatives w.r.t single variables
also called partial derivatives.
Let us assume a function f(w0,w1, w2, …, wn) that has more than one input variable.
Then,
16
Optima for 𝑓 can be found by putting ∇𝑓 equal to a Null matrix of the same dimensions as ∇𝑓.
We can have points where ∇f = 0, but those points may not be maxima or minima. Those are
called Saddle points.
We will start by initializing x0 and y0 randomly. Then we will keep updating 𝑥 and 𝑦 till the
point where the partial derivatives are very close to 0 or some fixed number of iterations.
Algorithm:
Step 3: The new values of x0 and y0 which are closer to the optima are given as:
and
Step 4: Repeat the step 3 either till some k(constant) iterations or till a point where
δ𝑓/δ𝑥 ≈ 0 and δ𝑓/δ𝑦 ≈ 0.
Here, η (eta) is the learning rate and decides the step size of our iterations. If we set its value
to very small, then the updates will happen very slowly. If we set it to a large value, it may
overshoot the minima.
17
If we want to maximize some function. Then we can convert the maximum function into a
minimum function by adding a negative sign..
i.e. 𝑚𝑎𝑥 𝑓(𝑥, 𝑦) = 𝑚𝑖𝑛 − 𝑓(𝑥, 𝑦)
We use all the data points for one update, which leads to a high computation time if our
dataset is very large.
So we have some variants of Gradient descent which can help us with the problem.
1. Batch Gradient descent calculates the partial derivative using only a few data points
from our data set randomly, while performing many updates.
i.e.
We get very high-speed improvement while training our model with almost a similar
accuracy.
2. Stochastic Gradient descent updates the parameters for each training example one
by one.
i.e.
18
● For a constrained optimization problem, we have an objective function that we are trying to
optimize ( say, 𝑚𝑖𝑛 f(x, y) ) and this objective function will be subjected to some
𝑥, 𝑦
constraints.
The constraint may be an equality constraint ( 𝑔(𝑥, 𝑦) = 0 ) or we can also have inequality
constraints like 𝑔(𝑥, 𝑦) < 𝑐.
● The method of Lagrange multipliers is a method of finding the local minima or local maxima
of a function subject to equality or inequality constraints.
* *
We want to solve the problem, 𝑥 , 𝑦 = 𝑚𝑖𝑛𝑥, 𝑦 𝑓(𝑥, 𝑦)
In order to solve the above problem, we combine both the constraint and the objective
function. We can write the constraint as 𝑔(𝑥, 𝑦) − 𝑐 and then rewrite our problem as:
* *
𝑥 , 𝑦 = 𝑚𝑖𝑛𝑥, 𝑦 𝑓(𝑥, 𝑦) + λ(𝑔(𝑥, 𝑦) − 𝑐) = 𝐿(𝑥, 𝑦, λ)
Here λ is called a Lagrange multiplier ( λ ≥ 0 ) and the function 𝐿(𝑥, 𝑦, λ) is called the
Lagrangian function.
𝑛
𝑇 2
Example: 𝑚𝑖𝑛𝑥, 𝑦 ∑ - 𝑦𝑖(𝑤 𝑥𝑖 + 𝑤0), subjected to the constraint ||𝑤|| = 1
𝑖=1
2
We can rewrite the constraint as ||𝑤|| - 1 = 0.
𝑛
𝑇 2
i.e. 𝐿 = 𝑚𝑖𝑛𝑥, 𝑦 ∑ - 𝑦𝑖(𝑤 𝑥𝑖 + 𝑤0) + λ(||𝑤|| - 1), λ ≥ 0
𝑖=1
We can solve for the optimal value using the Gradient Descent algorithm.
19
● Principal Component Analysis (PCA)
PCA is an act of finding new axis to represent the data so that a few principal components may
contain most of the information.
The main objective here is whenever we are going from a higher dimension(d) to a lower
/
dimension(𝑑 ) we want to preserve those dimensions which have high variance (or high
information.)
For example: Imagine we have two features 𝑓1 and 𝑓2, and the data is spread as shown in
the image below. We want to project the given 2-D data to 1-D.
One thing we can observe here is that both axes have a good amount of variance. It will be
difficult to decide which feature to drop.
Mathematics of PCA:
Step 1: Featurewise Centering: We compute the mean of each feature and subtract it from
each value of that feature to centre the mean of the data at origin.
20
Step 2: Standardize: It is done in order to keep all the features of our data in same scale.
Best is to use standard scaling because it is less affected by the outliers.
Combining step1 and step 2, we get the value of our feature vector x as:
There can be many angles of rotation possible, we’ll consider a random axis and optimize it to
get the best axis.
Let's consider a unit vector in the direction of the new axis.
/
Our optimization problem is finding unit vectors u such that the variance of all 𝑥𝑖 (features) are
maximum.
21
We can also think of it in terms of the length of projections.
Consider a vector x in the space representing one of the points in our data.
The best u will be where the summation of the length of projections of all such points(xi) on
the vector u is maximum.
i.e.
i.e.
22
Now, since u is a unit vector, we introduce a constraint: ||𝑢|| = 1
We can optimize the above Lagrangian loss function using Gradient descent.
We know that,
Thus,
We get,
23
Or
Let
Hence,
V is the pairwise covariance matrix of all the features of our feature matrix X.
and
— (i)
By carefully observing the equation (i), we can conclude that 𝑢 is the eigen vector and λ is the
eigen value of matrix 𝑉.
For any matrix A, there exists a vector such that when this vector is multiplied with the
matrix A, we get a new vector in the same direction having different magnitude.
The vector is called the Eigen vector and the length is called as Eigen value.
i.e.
There can be multiple eigen vectors, which are always orthogonal to each other.
24
Conclusion:
To find the best value of vector 𝑢, which is the direction of maximum variance (or maximum
information) and along which we should rotate our existing coordinates, we follow the below
given steps:
Step 2: Then calculate the eigenvectors and eigenvalues of the covariance matrix.
The eigenvector is the direction of best 𝑢 and eigen value is the importance of that
vector.
The eigenvector associated with the largest eigenvalue indicates the direction in which the
data has the most variance.
Therefore, we can select our principal components in the direction of the eigenvectors
having large eigenvalues and drop the principal components having relatively small
eigenvalues.
25