Mathematics For ML

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 69

Mathematics for ML

Dr. Deepti
Vector
WHY IS DATA REPRESENTED AS VECTOR IN
MACHINE LEARNING ?

• Vectors are commonly used in machine learning as they lend a convenient way to organize data.
Often one of the very first steps in making a machine learning model is vectorizing the data. They
are also relied upon heavily to make up the basis for some machine learning techniques as well.
• Science and Maths are used to describe the world around us. There are many quantities which
require only 1 measurement to describe them. e.g Length of a string, or area of any shape or
temperature of any surface. Such quantities are called scalars. Any quantity which can be
represented as a number (positive or negative) is called scalar. This value is known as magnitude.
• On the other hand, there are quantities which require at least 2 measurements to describe them.
Along with the magnitude, they have a “direction” associated e.g velocity or force. These
quantities are known as “Vectors”.
• When we say that a person ran for 2 Kms, its a scalar but when we say that a person ran for 2
Kms, North-east from his initial position, its a vector.
SCALAR QUANTITY
• Let's say you are collecting some data about a group of students in a class. You are
measuring the height and weight of each student and the data collected for 5 students is as
follows:

• Each individual measurement here is a scalar quantity. So height or weight viewed stand-
alone are scalars.
• However, when you look at the observation about each student as a whole i.e
height and weight together for every student, you can think of it as a vector.
• Now, there is no real “direction” here in this vector as per the standard
definition. But when we represent quantities in more than one dimension (in
this case, it's 2 — height and weight), there is a sense of orientation with
respect to each other
Representing Multiple Vector
• when we are looking at 5 students (5 vectors), they have a magnitude and an
idea of “direction” with respect to each other.

• Since most useful datasets always have more than 1 attribute, they all can be
represented as “Vectors”. Every observation in a given data set can be thought
of as a vector. All possible observations of a data set constitute a “vector
space”.
• The benefit of representing data as vectors is — we can leverage vector
algebra to find patterns or relationships within our data.
How can vectors help ?
• Looking at the vectors in vector space, you can quickly compare them to
check if there is a relationship.
• For example, you deduce that similar vectors will have smaller angle between
them i.e their orientation will be close to each other. In our sample data, the
students (5.4,54) and (5, 50) are quite similar.
• The angle between the vectors indicates “similarity” between them. The
vectors in the same direction (close to 0 degrees angle) are similar while
vectors in the opposite direction (close to 180 degrees angle) are
dissimilar.
• If the vectors are at 90 degrees to each other (orthogonal), then there is no
relationship between them.
Cosine Similarity
• Cosine Similarity is a metric that gives the cosine of the angle between
vectors. It can be calculated using the “dot product” of 2 vectors.
Mathematically,

• In Data Science, think about the dot product and cosine of the angle between
them. You will be looking at how similar or dissimilar those vectors are.
Text Vectorization
• The process of converting or transforming a data set into a set of Vectors is
called Vectorization. It’s easier to represent data set as vectors where
attributes are already numeric. What about textual data?
• “Word Embedding” is the process of representing words or text as Vectors
• There are a number of techniques of converting/representing text as Vectors.
One of the simplest methods is Count Vectorizer. Below is a snippet of the
first few lines of text from the book “A Tale of Two Cities” by Charles
Dickens:
• It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness
How can the above 4 sentences be converted
into vectors:
• Step 1: Get unique words from the collection of your text. The total text you have is called “corpus”
• The unique words here (ignoring case and punctuation) are:
• ['age',
'best',
'foolishness',
'it',
'of',
'the',
'times',
'was',
'wisdom',
'worst’]
• This is a vocabulary of 10 words from a corpus containing 24 words.
• Step 2: For each sentence, create a list of 10 zeroes
• Step 3: For each sentence, start reading the word one by one. For each word,
count total occurrence in the sentence. Now identify the position of the word
in vocabulary list above and replace the zero with this count at that position.
• For our corpus, the vectors we got are:
• “It was the best of times“= [0 1 0 1 1 1 1 1 0 0]
• “it was the worst of times” = [0 0 0 1 1 1 1 1 0 1]
• “it was the age of wisdom” = [1 0 0 1 1 1 0 1 1 0]
• “it was the age of foolishness” = [1 0 1 1 1 1 0 1 0 0]
• The output:
How Calculus enables Machine Learning and AI

• Calculus is the mathematical study of continuous change.


• Calculus is also used to gain a more precise understanding of the nature
of space, time, and motion.
• Calculus helps in finding out the relationship between two variables
(quantities) by measuring how one variable changes when there is a change in
another variable and how these changes accumulate over time.
What is calculus and why is it needed?

• The understanding or learning is often expressed as a mathematical function


that captures the relationship between the entities or dimensions involved. e.g
let's say you are driving a car. You are measuring the speed of the car as time
is passing by.
• As seen, there is a relationship between speed and time. As time passes by, the
speed of the car increases at a constant rate. Every minute, the speed of the car
increases by 5 times. This relationship can be expressed as follows:
• Speed is a Function of time
• Speed = f(time)
• S = 5* t

• this relationship is actually representing the equation of a line:


• y = mx + b
• where m is the slope and b is the intercept of the line.
• In our example, the equation becomes s = 5t + 0
• Focus on this thing known as “Slope”.
Example
• as time goes by, the speed of the car is changing.
• how much speed has changed for a change in time. e.g between t = 1 and t =2,
the speed has changed from 5 to 10. This can be written as:

• In our case, if we divide the “change in speed” with “change in time”, it gives
us the rate of change.
Slope/ Gradient
• The rate of change is also known as slope or gradient
• Greater the rate of change, greater will be the inclination of the line (hence,
greater the slope).
• Compare the slope of the orange and blue lines in the below graph:
• In our example above, the speed of the car is increasing at a constant rate.
What will happen if this rate of change is not constant? What if the car is
accelerating and the rate of change in speed is different every minute.

• How to measure the rate of change, in this case, because its different for
different time periods? We can not define a single rate of change in speed for
this kind of data.
• You can notice that the rate of change of speed climbs up as time passes. The
mathematical function that captures this relationship is not a straight line but
its a “curved line”.
How to measure the “slope” or “inclination” of a line which is
curved?
• The property of “curviness” means that rate of change is not constant.
• Here comes the beauty of calculus…….
• What you can do is to imagine the curve as a collection of lots of “very small
straight line segments”
Cont…

• We will now be calculating the slope of a “very small” line segment of the
curve. This slope represents a “very small” change in speed with respect to a
“very small” change in time.
• Let's denote “very small” change in speed as “ds”.
• Similarly a “very small” change in time as “dt”
• The slope of this “very small” line segment is
It can be calculated as follows:
• Note that since “dt” is already very small, so the below will become
extremely small and hence can be neglected.
• we can make ds and dt infinitesimally small, so that this “very small” line
segment” becomes a point. The rate of change or slope at this point or instant
of time is actually a tangent to the curve.


• In calculus terminology, the process of finding out the rate of change of a
variable with respect to another is known as “differentiation”. In other
words, if
• y = f(x)
• then the process of finding out dy/dx is known as “differentiation” or
“differentiating”. The ratio dy/dx is called “the differential coefficient of y
with respect to x” or “derivative”. Remember, its nothing but the rate of
change of y with respect to x.
• Keep in mind — “dx” means a very small part of x
• What if ‘y’ is a constant. What will be dy/dx in this case? It will be 0 as the
value of y is not changing. It's a constant.
Application of Differential Calculus in Machine Learning

• The most robust application of Calculus in Machine Learning is the Gradient


Descent algorithm in Linear regression (and Neural Networks).
• Linear regression involves using data to calculate a line that best fits that data,
and then using that line to predict scores on one variable from another.
• Prediction is simply the process of estimating scores of the outcome (or
dependent) variable based on the scores of the predictor (or independent)
variable. To generate the regression line, we look for a line of best fit.
• A line that can explain the relationship between the independent and dependent
variable(s), better is said to be the best-fit line. The difference between the
observed value and the actual value gives the error. The formula to calculate
this error is also called the cost function.
Cont…
• The line of best fit will be expressed mathematically as
• Y = m.x + c
• where m and c are the slope and intercept of the line respectively. These are the
two coefficients that the gradient descent algorithm has to find out.
• The error will depend upon the coefficients of the line. If the value of
coefficients is not optimal, then the error will be more. The cost function or the
amount of error of Linear regression model is dependent upon the value of
coefficient chosen.
• This is where the calculus is used. We can find the rate of change of error with
respect to the different values of coefficients. The value for which the rate of
change is minimum( i.e 0, the bottom of the curve) is the optimal value.
• So differentiation is used to find the minimum point of the curve. This point
gives the optimal value of the coefficient that was being looked for.
Integration
• Integration. It is simply, the reverse of Differentiation.
• In differentiation, we break up things into smaller and smaller parts.
• In integration, we accumulate or add up, all the smaller parts together. The
symbol for integration is:

• What will be the sum of all small parts of x?


Why Differentiation than Integration ?

• There are many things that can not be understood until and unless you break
them up into smaller parts, do some operation for each smaller part and then
accumulate or add up the results.
Example
• Let us take a rectangle:
• We know that area of a rectangle is:
• Length * Width (y *x)
• Can you prove it?
• Think in terms of calculus. Imagine a smaller rectangle by taking a small bit
of width(dx).
• The whole of rectangle can be thought of sum of all smaller rectangles where
width is dx:
Cont…

• So, now we have ‘x’ number of small rectangles, each with width dx and
length as y.
• If you imagine dx to be very very small, the mini rectangle will eventually be
reduced to a line (width close to 0 and length as y). The whole rectangle is
just a collection of ‘lines’ each of length ‘y’. How many lines? — x.
• Total area = Length of Line1 + Length of Line 2 + …….+ Length of Line X

• Area = Y * X
Gradient Descent
Stochastic Gradient Descent

• “Gradient descent is an iterative


algorithm, that starts from a random
point on a function and travels down
its slope in steps until it reaches the
lowest point of that function.”
• The objective of gradient descent
algorithm is to find the value of “x” such
that “y” is minimum.
The steps of the algorithm are
Vector
Matrix
Matrix
Matrix Operations
Matrix Operation
Linear Regression
Eigen Vector
Understanding Eigen Value and Eigen Vector
• Determinant
Calculation of Determinant
Eigen Value
incremented from 1 to
2
Example
DESCRIPTIVE SATISTICS

You might also like