0% found this document useful (0 votes)
40 views24 pages

Tut02 - Calculus Crash Course

The document discusses finding good parameter values for a model that minimize loss. It introduces the concept of calculus and derivatives for finding the local minimum of a loss function. The gradient of a multivariate function provides the rate of change in all directions and can be used with gradient descent to minimize loss by moving in the opposite direction of the gradient at each step.

Uploaded by

Jglewd 2641
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views24 pages

Tut02 - Calculus Crash Course

The document discusses finding good parameter values for a model that minimize loss. It introduces the concept of calculus and derivatives for finding the local minimum of a loss function. The gradient of a multivariate function provides the rate of change in all directions and can be used with gradient descent to minimize loss by moving in the opposite direction of the gradient at each step.

Uploaded by

Jglewd 2641
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Finding good parameters

Is there a simple way to


find parameter values
that give me least loss?
Highest education Gender True
Age
(HS, BT, MT, PhD) (M, F, O) Salary We suspect that the following law holds
Oh, of course! You are
45 0 0 1 0 1 0 0 64475 already familiar with
it. It is called calculus.
22 0 1 0 0 0 1 0 34179
Actually, just a linear model
Clearly, we don’t agree
28 1 0 0 0 1 0 0 34573 on the definition of
the word “simple”
34 0 0 1 0 0 1 0 50882

47 1 0 0 0 0 1 0 38660 Use average loss to find good params


55 0 0 1 0 1 0 0 71487

49 0 0 0 1 0 1 0 79430 is the feature vector for the i-th person in our training set

27 0 1 0 0 0 0 1 34355

25 0 1 0 0 1 0 0 43837
calculus crash course
it is said that both Gottfried Leibniz and Isaac Newton
felt that the other person’s work was a bit … derivative
Derivatives
For this simple case, we can exactly calculate the discrepancy.

i.e., smaller the movement, smaller the discrepancy 


2.5
Case 1: ¿1
2
2.25
1.5
1.25
1 Case 2:
0.5

Case 3:
-2 -1.5 -1 -0.5 0 0.5 1 1.5
2
Derivatives – Behind the Scenes

Can beRearranging
obtained as a corollary
this equation gives of Taylor’s theorem
me
Holds only if is “small” otherwise this may not hold
If has the same sign as then
If has opposite sign as then
The derivative tells us two things
Its sign tells us in which direction will function value increase
Interpret ve as “right” and ve as “left”
E.g., tells me that will increase if I decrease from a bit
Its magnitude gives an idea how much function value will change
Stationary points
These aredoesplaces where
not look flat to the derivative vanishes
Yup! In fact, it looks like the i.e. is 0
function value will increase
me at
These could be a local minimum, both to a
theglobal
left and theminimum
right

a local maximum, global maximum, or a saddle point


Derivative being zero at a point is itsalmost waylikeofa straight line! I bet the slope of
Look! At this small scale, the function looks

telling us that the function looks flat this line is . This is why at small scales, we
around that point have

Tangent
The function really does look flat at
line/plane
. What about some other value of
where ?

Remember kids, such effects are visible


only when we are look at very small
scales. Let me zoom in to show you
Stationary points If and , then derivative moves
from +ve to -ve around this point
Can find out if a stationary i.e., maximum

point is a max. or min. using


2nd derivative If and , then this may be min or max
where we define or saddle – higher derivatives e.g.,
needed
Sign of tells us in which
direction will increase
Its magnitude gives us an idea
how much will change
Also, no general way of If and , then derivative moves
telling if a max/min is local or from -ve to +ve around this point
i.e., minimum
global
Rules of derivatives
In the following we have
Sum Rule:
Scaling Rule: if is a constant
Product Rule:
Quotient Rule:
Chain Rule:
Chain rule commonly used when is a function of and is dependent on
Exercise
 Melbo claims that the following function is continuous and
differentiable at for a secret value of and . Find them!

 Use the following identity and derivative rules to show .


Multivariate Functions

Saddle

Minimum

Maximum
Multivariate derivatives aka Gradients
Aha! So, a gradient is like a bunch
of coordinate-wise derivatives
arranged in the form of a vector!
For a function where the input has coordinates, we simply
repeat the process for each coordinates to define the
gradient as
Consider a function from
Trick: convert the problem into analyzing functions as we know them
at a point is the derivative of w.r.t treating as const
The sign of tells us if will or if we increase slightly (keeping const)
The magnitude of tells us how sharply changes upon changing
at a point is the derivative of w.r.t treating as const
The sign of tells us if will or if we increase slightly (keeping const)
The magnitude of tells us how sharply changes upon changing
The vector is called the gradient of
The gradient vector only tells me how the function

Gradients for value changes if I change one coordinate keeping


all other coordinates fixed. What if I want to
change multiple coordinates at the same time?

𝑓 ( 𝑥 , 𝑦0)

( 𝑥0 , 𝑦 0 )
𝑓 ( 𝑥0 , 𝑦 )

Stationary points are those points where the gradient


vector is all zeros i.e., the function looks flat in all directions.
As in 1D case, maxima, minima, saddle points all stationary.
Steepest ascent

where and , if is “small”


A fancy way of saying
Claim: The direction of the gradient vector offers the biggest
increase in function value out of all directions
Proof: Suppose we are only allowed to take a step of length i.e., .
Recall that where is the angle between the vectors and . To get the
max increase in value, we must increase as much as possible which
happens when i.e., and are in the same direction.
Steepest descent

where Ibe
have a feeling this result will
and , if is “small”
very useful when we wish
Indeed! This simple-looking 2-line result
is the key to powerful ML algorithms such
A fancyto minimize
way of saying
loss functions as gradient descent and backpropagation

Claim: The direction opposite to the gradient vector offers the


biggest decrease in function value out of all directions
Proof: Suppose we are only allowed to take a step of length i.e., .
Recall that where is the angle between the vectors and . To get the
max decrease in value, we must decrease as much as possible
which happens when i.e., and are in opposite directions.
A Toy Example – Function output values
In this discrete toy example, we
3 3 3 3 3 3 3 3 can
3 calculate gradient at a point
6

as
2 2 2 3 4 3 3 2 where
1
5
4

1 1 1 3 3 3 1 1 1
3

1 0 1 1 2 1 1 0 1
2

1 1 1 3 3 3 1 1 1
1

1 2 3 3 4 3 2 2 2
0

03 13 2 3 3 3 4 3 5 36 37 83 3
A Toy Example – Gradients
In this discrete toy example, we
can calculate gradient at a point
6

as
where
5
4

Saddle Minimum

We can visualize these gradients


3

using simple arrows as well


2

Maximum
1
0

0 1 2 3 4 5 6 7 8
A Toy Example – Gradients
In thisa discrete toy example, we
Gradients converge toward
can calculate gradient at a point
maxima from all directions
6

as
where
5

Gradients diverge away


from a minima in all
directions
4

We can visualize these gradients


3

using simple arrows as well


2

At saddle points, different things


happen along different directions
1
0

0 1 2 3 4 5 6 7 8
Rules of gradients
In the following we have and
Sum Rule:
Scaling Rule: if is a constant that does not vary with
Product Rule:
Quotient Rule:
Chain Rule:
Rules of derivatives
In the following we have
Sum Rule:
Scaling Rule: if is a constant that does not vary with
Product Rule:
Quotient Rule:
Chain Rule:
A few useful identities
If is a constant that does not vary with

If is a constant vector that does not vary with


if is
If is a constant square matrix that does not vary with
symmetric

If is a constant vector that does not vary with


Exercise
 Find the gradient of the multivariate function where and
 Hint: try to derive this from first principles (easy if using one of the identities)
 Find the gradient of the affine multivariate function defined as
where and .
 Find the gradient where is a constant vector
 Hint: use the fact that for any vector , we always have and the fac that dot
product distributes over addition i.e.,
 Alternatively, use chain rule with where
Applying Calculus to Find My Salary
Lets simplify
Be careful by hiding the bias by taking
though – the similar
looking expression is equal to If is a vector then is actually a
the squared Euclidean norm and square symmetric matrix
whose -th entry is
Let hence, just a scalar.

since are the “ground-truth” labels and don’t depend on


using the scaling rule and an identity
using an identity and the fact that dot products are symmetric i.e.,
Applying Calculus to Find My Salary
By sum rule

A is differentiable, the minima must be one of the stationary pts

If is invertible, then the solution must be

But what if is not


invertible?

We arrange the feature vectors in a matrix and the


true outputs/ground truth labels in a vector
Then we need to apply other learning
techniques such as regularization
which we will study later.
Summary
Calculus techniques play a major role in designing ML algorithms
For , its derivative tells us how the function output value will vary if
we make a tiny change to its input value
The sign tells if the output will go up or down if we increase the input a little
bit
The magnitude tells us how sensitive is the output to changes in the input
Stationary points (max, min, saddle) are where the derivative is zero
For , its gradient plays the same role
and its -th coordinate tells us how the output value will change if the -th
coordinate of input is changed keeping all other coordinates fixed
The gradient gives us the direction of steepest ascent for the function output
value and the direction opposite to it is the direction of steepest ascent
Stay Amazing!
Hang-out with you in the next one

You might also like