Om Namo Bhagavate Vasudevaya
23ECE216
Machine Learning
Partial Derivatives
Dr. Binoy B Nair
1
Mathematical Preliminaries
Covered in previous semesters
Requirements
for ML
Linear Probability Information Mathematical
Algebra and Statistics Theory Optimization
Matrix Random
Entropy Formulation
Multiplication variables
Matrix Mutual Optimality
Distributions
Inversion Information conditions
Eigen Values Numerical
KL
and Eigen Optimization
Divergence
Vectors methods 2
Identifying Critical Points
• Differentiation allows for the identification of critical points
such as minima, maxima, and inflection points.
• Critical points are defined as points where the derivative
is 0 (or non-existent).
• Considering the function 𝑓(𝑥), the critical point is
represented as the point 𝑥𝑖 where: 𝑓 ′ 𝑥𝑖 = 0
• The maximum is at point 𝑥𝑖 when the 𝑓 ′′ 𝑥𝑖 < 0
• The minimum is at point 𝑥𝑖 when the 𝑓 ′′ 𝑥𝑖 > 0
• The point of inflection is at point 𝑥𝑖 when the 𝑓 ′′ 𝑥𝑖 = 0
3
Stationary points
Figure showing the three types of stationary points
(a) inflection point (b) minimum (c) maximum
4
Relative(local) and Global Optimum
…contd.
A1, A2, A3 = Relative maxima
A2 = Global maximum
B1, B2 = Relative minima
B1 = Global minimum
.
Relative minimum
A2 is also global
f(x) f(x) optimum
.
A1
. . A3
B2
.
B1
.
x x
a b a b
5
5
Example 1
Find the optimum value of the function and also state if the
function attains a maximum or a minimum:𝑓 𝑥 = 𝑥 2 + 3𝑥 −
5
Sol:
For maxima or minima, put 𝑓 ′ 𝑥 = 2𝑥 + 3 = 0 to get x* = -
3/2.
Also 𝑓 ′′ 𝑥 = 2 , which is positive hence the point x* = -3/2
is a point of minima and the function attains a minimum
6
value of -29/4 at this point. 6
Example 2
Find the optimum value of the function 𝑓 𝑥 = 𝑥 − 2 4 and also state if the
function attains a maximum or a minimum
Solution:
f '( x) = 4( x − 2)3 = 0 or x = x* = 2 for maxima or minima.
f ''( x*) = 12( x * −2)2 = 0 at x* = 2
f '''( x*) = 24( x * −2) = 0 at x* = 2
( )
f x * = 24 at x* = 2
Hence fn(x) is positive and n is even hence the point x = x* = 2 is a point of minimum
and the function attains a minimum value of 0 at this point.
7
Example 3
Analyze the function 𝑓(𝑥) = 12𝑥 5 − 45𝑥 4 + 40𝑥 3 + 5 and classify the stationary
points as maxima, minima and points of inflection.
Solution:
f ( x) = 12 x5 − 45x4 + 40 x3 + 5
f '( x) = 60 x 4 − 180 x3 + 120 x 2 = 0
= x 4 − 3x3 + 2 x 2 = 0
or x = 0,1, 2
Consider the point x = x* = 0
at x * = 0
f ( x ) = 240( x ) − 540( x ) + 240 x = 0
'' * * 3 * 2 *
at x * = 0
f ''' ( x* ) = 720( x* )2 −1080x* + 240 = 240
12/11/2024
8
Example 3 …contd.
Since the third derivative is non-zero, x = x* = 0 is neither a point of maximum or
minimum but it is a point of inflection
Consider x = x* = 1
f '' ( x* ) = 240( x* )3 − 540( x* )2 + 240 x* = −60 at x* = 1
Since the second derivative is negative the point x = x* = 1 is a point of local maxima with
a maximum value of f(x) = 12 – 45 + 40 + 5 = 12
Consider x = x* = 2 f '' ( x* ) = 240( x* )3 − 540( x* )2 + 240 x* = 240
at x* = 2
Since the second derivative is positive, the point x = x* = 2 is a point of local minima with
a minimum value of f(x) = -11
12/11/2024
9
Differentiation in Machine Learning
• Machine learning often seeks to determine the minimum
of a cost/loss function.
• As such, iterative techniques pioneered in numerical
computing find widespread usage in various ML
algorithms.
• At its core, most such methods aim at finding critical
points where the derivative is 0.
• Differentiation tools are also used in probabilistic
parameter estimation, most notably in maximum
likelihood estimation, to identify the set of most likely
parameters for a data-generating probability distribution.
10
Partial Derivatives
In general, if f is a function of two variables x and y,
suppose we let only x vary while keeping y fixed, say y = b,
where b is a constant.
Then we are really considering a function of a single
variable x, namely, g(x) = f (x, b). If g has a derivative at a,
then we call it the partial derivative of f with respect to x
at (a, b) and denote it by fx(a, b). Thus
11
Partial Derivatives
By the definition of a derivative, we have
and so Equation 1 becomes
12
Partial Derivatives
Similarly, the partial derivative of f with respect to y at
(a, b), denoted by fy(a, b), is obtained by keeping x fixed
(x = a) and finding the ordinary derivative at b of the
function G(y) = f (a, y):
13
Partial Derivatives
If we now let the point (a, b) vary in Equations 2 and 3,
fx and fy become functions of two variables.
14
Partial Derivatives
There are many alternative notations for partial derivatives.
For instance, instead of fx we can write f1 or D1f (to indicate
differentiation with respect to the first variable) or ∂f / ∂x.
But here ∂f / ∂x can’t be interpreted as a ratio of differentials.
15
Partial Derivatives
To compute partial derivatives, all we have to do is
remember from Equation 1 that the partial derivative with
respect to x is just the ordinary derivative of the function g
of a single variable that we get by keeping y fixed.
Thus we have the following rule.
16
Example 1
If 𝑓 (𝑥, 𝑦) = 𝑥3 + 𝑥2𝑦3 – 2𝑦2, find fx(2, 1) and fy(2, 1).
Solution:
Holding y constant and differentiating with respect to x,
we get
𝜕
𝑓𝑥 𝑥, 𝑦 = 𝑓 𝑥, 𝑦
𝜕𝑥
𝜕
= 𝑥3 + 𝑥2𝑦3 – 2𝑦2
𝜕𝑥
= 3𝑥2 + 2𝑥𝑦3
and so fx(2, 1) = 3 22 + 2 2 13 = 16
17
Example 1
If 𝑓 (𝑥, 𝑦) = 𝑥3 + 𝑥2𝑦3 – 2𝑦2, find fx(2, 1) and fy(2, 1).
Solution:
Holding x constant and differentiating with respect to y,
we get
𝜕
𝑓𝑦 𝑥, 𝑦 = 𝑓 𝑥, 𝑦
𝜕𝑦
𝜕
= 𝑥3 + 𝑥2𝑦3 – 2𝑦2
𝜕𝑦
= 3𝑥2𝑦2 – 4𝑦
𝑓𝑦(2, 1) = 3 22 12 – 4 1 = 8
18
Example 2
If f (x, y) = 4 – x2 – 2y2, find fx(1, 1) and fy(1, 1).
Solution:
We have
fx(x, y) = –2x fy(x, y) = –4y
fx(1, 1) = –2 fy(1, 1) = –4
19
Interpretations of Partial Derivatives
20
Interpretations of Partial Derivatives
Partial derivatives can be interpreted as rates of change.
If z = f (x, y), then ∂z / ∂x represents the rate of change of z
with respect to x when y is fixed.
Similarly, ∂z / ∂y represents the rate of change of z with
respect to y when x is fixed.
21
Functions of More Than Two Variables
22
Functions of More Than Two Variables
Partial derivatives can also be defined for functions of three
or more variables.
For example, if f is a function of three variables x, y, and z,
then its partial derivative with respect to x is defined as
and it is found by regarding y and z as constants and
differentiating f (x, y, z) with respect to x.
23
Functions of More Than Two Variables
If w = f (x, y, z), then f x = ∂w / ∂x can be interpreted as the
rate of change of w with respect to x when y and z are held
fixed.
In general, if u is a function of n variables,
u = f (x1, x2,…, xn), its partial derivative with respect to the
i th variable xi is
and we also write
24
Example 3
Find fx, fy, and fz if f (x, y, z) = exy ln z.
Solution:
Holding y and z constant and differentiating with respect
to x, we have
fx = yexy ln z
Similarly,
fy = xexy ln z and fz =
25
Higher Derivatives
26
Higher Derivatives
If f is a function of two variables, then its partial derivatives
fx and fy are also functions of two variables, so we can
consider their partial derivatives (fx)x, (fx)y, (fy)x, and (fy)y,
which are called the second partial derivatives of f.
If z = f (x, y), we use the following notation:
27
Higher Derivatives
Thus the notation fxy (or ∂2f / ∂y ∂x) means that we first
differentiate with respect to x and then with respect to y,
whereas in computing fyx the order is reversed.
28
Example 4
Find the second partial derivatives of
𝑓 (𝑥, 𝑦) = 𝑥3 + 𝑥2𝑦3 – 2𝑦2
Solution:
In Example 1 we found that
𝑓𝑥 𝑥, 𝑦 = 3𝑥2 + 2𝑥𝑦3
𝑓𝑦(𝑥, 𝑦) = 3𝑥2𝑦2 – 4𝑦
Therefore
𝜕 𝜕
𝑓𝑥𝑥 = 𝑥3 + 𝑥2𝑦3 – 2𝑦2
𝜕𝑥 𝜕𝑥
𝜕
= (3𝑥2 + 2𝑥𝑦3) = 6𝑥 + 2𝑦3
𝜕𝑥 29
Example 4 – Solution cont’d
fxy = (3x2 + 2xy3)
= 6xy2
fyx = (3x2y2 – 4y)
= 6xy2
fyy = (3x2y2 – 4y)
= 6x2y – 4
30
Higher Derivatives
Notice that fxy = fyx in Example 4. This is not just a
coincidence.
It turns out that the mixed partial derivatives fxy and fyx are
equal for most functions that one meets in practice.
The following theorem, which was discovered by the
French mathematician Alexis Clairaut (1713–1765), gives
conditions under which we can assert that fxy = fyx.
31
Higher Derivatives
Partial derivatives of order 3 or higher can also be defined.
For instance,
and using Clairaut’s Theorem it can be shown that
fxyy = fyxy = fyyx if these functions are continuous.
32
Moving to Higher
Dimensions
Multivariate Calculus
33
Multivariate Calculus
We can generalize the concepts from univariate
differentiation to higher dimensions by studying multi-
variate (or multivariable) differentiation.
34
Gradient Vector
We can compute the partial derivative for all dimensions
𝑥𝑖 ∈ 𝑥 and collect all partial derivatives in a gradient vector:
𝜕 𝜕 𝜕
∇𝑓 𝑥1 , … , 𝑥𝑛 = 𝑥1 , … , 𝑥𝑛 𝑥1 , … , 𝑥𝑛 … 𝑥1 , … , 𝑥𝑛
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛
35
Example 5
Consider the function 𝑓(𝑥, 𝑦) = 𝑥 2 𝑦 + 3 ln(𝑥𝑦). Find the
partial derivatives and the gradient.
Solution
The partial derivatives and the gradient are computed as
follows:
𝜕 𝜕 3 3
𝑓 𝑥, 𝑦 = 𝑥 2 𝑦 + 3 ln 𝑥𝑦 = 2𝑥𝑦 + 𝑦 = 2𝑥𝑦 +
𝜕𝑥 𝜕𝑥 𝑥𝑦 𝑥
𝜕 𝜕 3 3
𝑓 𝑥, 𝑦 = 𝑥 2 𝑦 + 3 ln 𝑥𝑦 = 𝑥2 + 𝑥 = 𝑥2 +
𝜕𝑦 𝜕𝑦 𝑥𝑦 𝑦
𝜕 𝜕 𝟑 𝟑
∇𝑓 𝑥, 𝑦 = 𝑓 𝑥, 𝑦 𝑥, 𝑦 = 𝟐𝒙𝒚 + 𝒙𝟐 +
𝜕𝑥 𝜕𝑦 𝒙 𝒚
36
Jacobian Matrix
For a differentiable vector valued function 𝑓: ℝ𝑛 → ℝ𝑚 , the
Jacobian is defined as:
𝜕 𝜕 𝜕
𝑓 𝒙 𝑓 𝒙 ⋯ 𝑓 𝒙
𝜕𝑥1 1 𝜕𝑥2 1 𝜕𝑥𝑛 1
∇𝒇𝑻𝟏 𝜕 𝜕 𝜕
𝐽= ⋮ = 𝜕𝑥1 𝑓2 𝒙 𝑓 𝒙
𝜕𝑥2 2
⋯ 𝑓 𝒙
𝜕𝑥𝑛 2
∇𝒇𝑻𝒎 ⋮ ⋮ ⋱ ⋮
𝜕 𝜕 𝜕
𝑓 𝒙 𝑓 𝒙 ⋯ 𝑓 𝒙
𝜕𝑥1 𝑚 𝜕𝑥2 2 𝜕𝑥𝑚 𝑚
37
Example 6
𝑥 𝑓1 𝑥, 𝑦
Consider the multivariate function 𝒇 𝑦 = =
𝑓2 𝑥, 𝑦
𝑥2𝑦
. Find its Jacobian.
5𝑥 + sin(𝑦)
Solution
𝜕 𝜕
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 𝑥2
𝜕𝑥 1 𝜕𝑦 1 2𝑥𝑦
𝐽= 𝜕 𝜕
=
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 5 cos(𝑦)
𝜕𝑥 2 𝜕𝑦 2
38
Example 7
𝑢 𝑥, 𝑦, 𝑧
Consider the function function 𝒇 𝑥, 𝑦, 𝑧 = 𝑣 𝑥, 𝑦, 𝑧 where
𝑤(𝑥, 𝑦, 𝑧)
Find the Jacobian matrix and evaluate the Jacobian Matrix
at (0,0,0)
39
Example 7 - Solution
𝑢 𝑥, 𝑦, 𝑧
𝒇 𝑥, 𝑦, 𝑧 = 𝑣 𝑥, 𝑦, 𝑧
𝑤(𝑥, 𝑦, 𝑧)
𝜕 𝜕 𝜕
𝑢 𝑥, 𝑦, 𝑧 𝑢 𝑥, 𝑦, 𝑧 𝑢 𝑥, 𝑦, 𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑧
𝜕 𝜕 𝜕
𝐽= 𝑣 𝑥, 𝑦, 𝑧 𝑣 𝑥, 𝑦, 𝑧 𝑣 𝑥, 𝑦, 𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑧
𝜕 𝜕 𝜕
𝑤 𝑥, 𝑦, 𝑧 𝑤 𝑥, 𝑦, 𝑧 𝑤 𝑥, 𝑦, 𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑧
18𝑥𝑦 2 + 𝑧𝑒 𝑥 18𝑥 2 𝑦 𝑒𝑥
= 𝑦 + 2𝑥𝑦 3 𝑥 + 3𝑥 2 𝑦 2 2
− sin 𝑥 𝑠𝑖𝑛 𝑧 𝑒 𝑦 cos 𝑥 sin 𝑧 𝑒 𝑦 cos 𝑥 cos 𝑧 𝑒 𝑦
40
Example 7 - Solution
At (0,0,0), we get:
18 ∗ 0 ∗ 02 + 0 ∗ 𝑒 0 18 ∗ 02 ∗ 0 𝑒0
𝐽(0,0,0) = 0 + 2 ∗ 0 ∗ 03 0 + 3 ∗ 02 ∗ 02 2
− sin 0 𝑠𝑖𝑛 0 𝑒 0 cos 0 sin 0 𝑒 0 cos 0 cos 0 𝑒 0
0 0 1
= 0 0 2
0 0 1
41
Gradient and Jacobian in ML
• Both the gradient and the Jacobian belong to the bread
and butter of both classical and machine learning /
numerical approximation algorithms.
• In particular, they are the core ingredients of an
algorithm called gradient descent which enables us to
iteratively search for a local minimum of a differentiable
function.
• For instance, neural network learning relies on gradient
descent via the backpropagation algorithm.
42
Hessian Matrix
For a twice-differentiable scalar-valued function f : ℝ𝑛 → ℝ
the Hessian matrix H is defined as the matrix containing the
combinations of all second-order derivatives:
𝜕2 𝜕2 𝜕2
2𝑓 𝒙 𝜕𝑥1 𝜕𝑥2
𝑓 𝒙 ⋯
𝜕𝑥1 𝜕𝑥𝑛
𝑓 𝒙
𝜕𝑥1
𝜕2 𝜕2 𝜕2
𝐻 = 𝜕𝑥2 𝜕𝑥1 𝑓 𝒙 𝜕𝑥2 2𝑓 𝒙 ⋯
𝜕𝑥2 𝜕𝑥𝑛
𝑓 𝒙
⋮ ⋮ ⋱ ⋮
𝜕2 𝜕2 𝜕2
𝑓 𝒙 𝑓 𝒙 ⋯ 2𝑓 𝒙
𝜕𝑥𝑛 𝜕𝑥1 𝜕𝑥𝑛 𝜕𝑥2 𝜕𝑥𝑛
Note that 𝜕𝑥𝑖2 = 𝜕𝑥𝑖 𝜕𝑥𝑖 43
Example 8
Consider the function 𝑓(𝑥, 𝑦) = 𝑥 3 − 2𝑥𝑦 − 𝑦 6 . Compute the
Hessian.
Solution:
The Hessian is computed as follows:
𝜕2 𝜕2
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 6𝑥 −2
𝜕𝑥 2 𝜕𝑥𝜕𝑦
𝐻= =
𝜕2 𝜕2 −2 −30𝑦 4
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦
𝜕𝑦𝜕𝑥 𝜕𝑦 2
44
Example 9
Consider the function 𝑓(𝑥, 𝑦) = 𝑒 𝑥 cos(𝑦). Compute the
Hessian.
Solution:
The Hessian is computed as follows:
𝜕2 𝜕2
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 𝑒 𝑥 cos(𝑦) −𝑒 𝑥 𝑠𝑖𝑛(𝑦)
𝜕𝑥 2 𝜕𝑥𝜕𝑦
𝐻= =
𝜕2
𝑓 𝑥, 𝑦
𝜕2
𝑓 𝑥, 𝑦 −𝑒 𝑥 sin(𝑦) −𝑒 𝑥 cos(𝑦)
𝜕𝑦𝜕𝑥 𝜕𝑦 2
45
Hessian in Machine Learning
• The Hessian matrix and its approximations are
frequently used to assess the curvature of loss
landscapes in neural networks.
• Similar to the univariate second-order derivative of a
function, the Hessian enables us to identify saddle points
and directions of higher and lower curvature.
• In particular, the ratio between the largest and the lowest
𝜆
eigenvalue of H, i.e. max , defines the condition
𝜆min
number.
• A large condition number implies slower convergence
while a condition number of 1 enables gradient descent
to quickly converge in all curvature directions.
46
References
Chapter 14- Partial Derivatives, Cengage Learning.
Stephan Rabanser, Tutorial: Multivariate Differentiation
ECE421 – Introduction To Machine Learning (Fall 2022),
University of Toronto & Vector Institute for AI
47