Calculus For Machine Learning
Calculus For Machine Learning
Calculus
for Machine Learning
Understanding the Language of Mathematics
Jason Brownlee
Founder
i
Disclaimer
The information contained within this eBook is strictly for educational purposes. If you wish to
apply ideas contained in this eBook, you are taking full responsibility for your actions.
The author has made every effort to ensure the accuracy of the information within this book was
correct at time of publication. The author does not assume and hereby disclaims any liability to any
party for any loss, damage, or disruption caused by errors or omissions, whether such errors or
omissions result from accident, negligence, or any other cause.
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic or
mechanical, recording or by any information storage and retrieval system, without written
permission from the author.
Credits
Founder: Jason Brownlee
Authors: Stefania Cristina and Mehreen Saeed
Lead Editor: Adrian Tam
Technical reviewers: Andrei Cheremskoy, Darci Heikkinen, and Arun Koshy
Copyright
Calculus for Machine Learning
© 2022 MachineLearningMastery.com. All Rights Reserved.
Edition: v1.00
Contents
Copyright i
Preface x
Introduction xi
I Foundations 1
1 What is Calculus? 2
Calculus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Applications of calculus . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Further reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Rate of Change 7
Rate of change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
The importance of measuring the rate of change. . . . . . . . . . . . . . . . 10
Further reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Why it Works? 12
Calculus in machine learning . . . . . . . . . . . . . . . . . . . . . . . . 12
Why calculus in machine learning works . . . . . . . . . . . . . . . . . . . 15
Further reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Evaluating Limits 38
Rules for limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Limits for polynomials. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Limits for rational functions . . . . . . . . . . . . . . . . . . . . . . . . 40
Case for functions with a discontinuity . . . . . . . . . . . . . . . . . . . . 42
The sandwich theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Evaluating limits with Python . . . . . . . . . . . . . . . . . . . . . . . 44
Further reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7 Function Derivatives 46
What is the derivative of a function . . . . . . . . . . . . . . . . . . . . . 46
Differentiation examples . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Differentiability and continuity . . . . . . . . . . . . . . . . . . . . . . . 51
Further reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8 Continuous Functions 53
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
An informal definition of continuous functions . . . . . . . . . . . . . . . . 54
A formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Connection of continuity with function derivatives . . . . . . . . . . . . . . . 56
Intermediate value theorem . . . . . . . . . . . . . . . . . . . . . . . . 56
Extreme value theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Continuous functions and optimization. . . . . . . . . . . . . . . . . . . . 57
Further reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
13 Applications of Derivatives 89
Applications of derivatives in real-life . . . . . . . . . . . . . . . . . . . . 89
Applications of derivatives in optimization algorithms . . . . . . . . . . . . . 92
Further reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
V Approximation 188
27 Approximation 189
What is approximation? . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Approximation when form of function is known . . . . . . . . . . . . . . . . 190
Approximation when form of function is unknown . . . . . . . . . . . . . . . 190
Further reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
When we try to understand machine learning algorithms, it is quite difficult to avoid calculus.
This book is to help you refresh the calculus you learned or give you a quick start on just enough
calculus to move forward.
When calculus is brought up, the first thing that comes to mind for many is difficult math.
However, evaluating a calculus problem is just following some rules to manipulate it. However,
the most important thing in studying calculus is to remember the physical nature it represents.
At times, calculus can be abstract but is often not hard to visualize.
So why must we learn about calculus in studying machine learning? In many machine
learning algorithms, we have a goal of what the machine should do and we expect it would
behave in a certain way. Calculus is the tool for us to model the algorithm behavior. It allows
us to see how the behavior of the algorithm would change if a parameter is changed. It also
gives us insight on which direction we should fine-tune the algorithm or whether the algorithm
is achieving the best it can do, even if it doesn’t perfectly fit the training data.
As a practitioner, you need to know calculus is a tool for modeling. After reading this
book, you should know why we cannot use accuracy measure as a loss function in training a
neural network, which is because accuracy is not a differentiable function. You will also be
able to explain why a neural network of larger size will be trained disproportionately slower, by
counting the number of differentiations we need to compute in backpropagation. Furthermore,
if you understand calculus, you can convert an idea of the machine learning algorithm into code.
This is a book on the theoretical side of machine learning, but it does not aim to be
comprehensive. The objective of this book is to provide you the background to understand the
API documentation, or other people’s work on machine learning. This book is to provide you
an overview so you can go deeper with more advanced calculus books if you would like to.
The earlier chapters of this book focuses on the foundation. They introduce the notation
and terminologies, as well as the concepts of calculus, but are deliberately kept to the minimum.
The later chapters of this book are to introduce some examples where calculus is applied. The
examples are in Python, so you may try to run on your computer. We will see how we can build
a neural network and support vector machine from scratch in the last few chapters, in which
some calculus evaluation has to be done in order to program it correctly. We hope this will give
you some insight and pave you the way to better understand machine learning literature.
Introduction
What to expect?
This book will teach you the basics concepts of calculus. You will learn not only the univariate
calculus that you will see in elementary textbooks, but also the multivariate one that we will
often encounter in machine learning literature. Our focus is inclined toward the differential
calculus as we will find it more useful in machine learning. After reading and working through
the book, you will know:
⊲ Calculus arose from studying how to add up a lot of infinitesmial amounts
⊲ What is limit, and differentiation is a result of taking limits
⊲ Integration is the reverse of differentiation
⊲ The physical meaning of differentiation is the rate of change, or the slope if put into a
geometric perspective
⊲ The many rules to evaluate differentiation of a function
⊲ What is the differentiation of multivariate functions and vector-valued functions, and
how to evaluate them
⊲ Differentiation is a tool for optimization, and the method of Lagrange multipliers can
let us perform function optimization with constraints
⊲ Differentiation is a tool to find approximation to a function
⊲ How we apply calculus in coding a neural network from scratch
⊲ How we apply calculus to implement a support vector machine
We do not want this book to be a substitute for a formal calculus course. In fact, the
textbooks on calculus should give you more detail, more exercise, and more examples. They
are beneficial if you need a deeper understanding. However, this book can be a complement to
the textbooks to help you make connections to machine learning applications.
⊲ Part II: Limits and Differential Calculus. Define what is a limit of a function and from
there, we see how infinitesimal quantities are measured. Then, we learn how to find the
derivative of a function, or the differentiation. While the differentiation is defined from
the limits, we will discover the rules that allows us to find the differentiation of many
functions easier.
⊲ Part III: Multivariate Calculus. Extended from the simple differential calculus, we will
see the differentiation of more complex functions, namely, those with multiple variables
or those with vectors as their values. This is where we learn the terms that we often
encounter in machine learning literatures, such as partial derivaitves, gradient vectors,
Jacobian, Hessian, and Laplacian.
⊲ Part IV: Mathematical Programming. We will see one important use of calculus,
the optimization. The gradient descent algorithm could be useful for functional
optimization, but there we focus on the case of optimization with constraints. We
will see how we can convert an optimization problem with constraint into one without.
Then in turn, we need to use differentiation to solve it.
⊲ Part V: Approximation. Another use of differentiation is to find an approximate
function. This can be useful if we prefer the function to be written as a polynomial.
In fact, with the technique of Taylor series, we can approximate any function with a
polynomial as long as the function is differentiable.
⊲ Part VI: Calculus in Machine Learning. In the final part of this book, we study the
case of training a neural network and a support vector machine classifier. Both will
need to evaluate some calculus to get it done correctly. The backpropagation in neural
network training involves a Jacobian matrix, while a support vector machine classifier
is indeed a constrainted optimization problem. The chapters in this part will layout
the mathematical derivation and then convert them into Python code.
These are not designed to tell you everything, but to give you understanding of how they
work and how to use them. This is to help you learn and by doing so, you can get the result
the fastest.
Overview
This tutorial is divided into two parts; they are:
⊲ Calculus
⊲ Applications of Calculus
1.1 Calculus
Calculus is a Latin word for stone, or pebble.
The use of this word has seeped into mathematics from the ancient practice of using little
stones to perform calculations, such as addition and multiplication. While the use of this word
has, with time, disappeared from the title of many methods of calculation, one important branch
of mathematics retained it so much that we now refer to it as The Calculus.
“
Calculus, like other forms of mathematics, is much more than a language; it’s also
”
an incredibly powerful system of reasoning.
— Page xii, Infinite Powers, 2020.
Calculus matured from geometry.
1.1 Calculus 3
At the start, geometry was concerned with straight lines, planes and angles, reflecting its
utilitarian origins in the construction of ramps and pyramids, among other uses. Nonetheless,
geometers found themselves tool-less for the study of circles, spheres, cylinders and cones.
The surface areas and volumes of these curved shapes was found to be much more difficult to
analyze than rectilinear shapes made of straight lines and flat planes. Despite its reputation for
being complicated, the method of calculus grew out of a quest for simplicity, by breaking down
complicated problems into simpler parts.
“
Back around 250 BCE in ancient Greece, it was a hot little mathematical startup
”
devoted to the mystery of curves.
— Page 3, Infinite Powers, 2020.
In order to do so, calculus revolved around the controlled use of infinity as the bridge
between the curved and the straight.
“
The Infinity Principle. To shed light on any continuous shape, object, motion,
process, or phenomenon — no matter how wild and complicated it may appear —
reimagine it as an infinite series of simpler parts, analyze those, and then add the
”
results back together to make sense of the original whole.
— Page xvi, Infinite Powers, 2020.
To grasp this concept a little better, imagine yourself traveling on a spaceship to the moon.
As you glance outwards to the moon from earth, its outline looks undoubtedly curved. But
as you approach closer and smaller parts of the outline start filling up the viewing port, the
curvature eases and becomes less defined. Eventually, the amount of curvature becomes so small
that the infinitesimally small parts of the outline appear as a straight line. If we had to slice
the circular shape of the moon along these infinitesimally small parts of its outline, and then
arrange the infinitely small slices into a rectangle, then we would be able to calculate its area:
by multiplying its width to its height.
This is the essence of calculus: the breakthrough that if one looks at a curved shape
through a microscope, the portion of its curvature being zoomed upon will appear straight and
flat. Hence, analyzing a curved shape is, in principle, made possible by putting together its
many straight pieces.
Calculus can, therefore, be considered to comprise of two phases: cutting and rebuilding.
“
In mathematical terms, the cutting process always involves infinitely fine subtraction,
which is used to quantify the differences between the parts. Accordingly, this half
of the subject is called differential calculus. The reassembly process always involves
infinite addition, which integrates the parts back into the original whole. This half
”
of the subject is called integral calculus.
— Page xv, Infinite Powers, 2020.
With this in mind, let us revisit our simple example. Suppose that we have sliced the circular
shape of the moon into smaller pieces, and rearranged the pieces alongside one another.
The shape that we have formed is similar to a rectangle having a width equal to half the
circle circumference, C/2, and a height equal to the circle radius, r.
1.1 Calculus 4
C/2
To flatten out the curvature further, we can slice the circle into thinner pieces.
C/2
The thinner the slices, the more the curvature flattens out until we reach the limit of infinitely
many slices, where the shape is now perfectly rectangular.
C/2
We have cut out the slices from the circular shape, and rearranging them into a rectangle
does not change their area. Hence, calculating the area of the circle is equivalent to calculating
the area of the resulting rectangle: A = rC/2.
Curves are not only a characteristic of geometric shapes, but also appear in nature in the
form of parabolic arcs traced by projectiles, or the elliptical orbits of planets around the sun.
“
And so began the second great obsession: a fascination with the mysteries of motion
”
on Earth and in the solar system.
— Page xix, Infinite Powers, 2020.
1.2 Applications of calculus 5
And with curves and motion, the next natural question concerns their rate of change.
“
With the mysteries of curves and motion now settled, calculus moved on to its third
”
lifelong obsession: the mystery of change.
— Page xxii, Infinite Powers, 2020.
It is through the application of the Infinity Principle that calculus allows us to study motion
and change too, by approximating these into many infinitesimal steps.
It is for this reason that calculus has come to be considered the language of the universe.
“
Without calculus, we wouldn’t have cell phones, computers, or microwave ovens.
We wouldn’t have radio. Or television. Or ultrasound for expectant mothers, or
GPS for lost travelers. We wouldn’t have split the atom, unraveled the human
genome, or put astronauts on the moon. We might not even have the Declaration
”
of Independence.
— Page vii, Infinite Powers, 2020.
More interestingly is the integral role of calculus in machine learning. It underlies important
algorithms, such as gradient descent, which requires the computation of the gradient of a
function and is often essential to train machine learning models. This makes calculus one of the
fundamental mathematical tools in machine learning.
Books
Steven Strogatz. Infinite Powers. How Calculus Reveals the Secrets of the Universe. Mariner
Books, 2020.
https://www.amazon.com/dp/0358299284/
Mark Ryan. Calculus Essentials For Dummies. Wiley, 2019.
https://www.amazon.com/dp/1119591201/
Mark Ryan. Calculus For Dummies. 2nd ed. Wiley, 2016.
https://www.amazon.com/dp/1119293499/
Michael Spivak. The Hitchhiker’s Guide to Calculus. American Mathematical Society, 2019.
https://www.amazon.com/dp/1470449625/
Marc Peter Deisenroth. Mathematics for Machine Learning. Cambridge University Press, 2020.
https://www.amazon.com/dp/110845514X
Articles
Calculus. Wikipedia.
https://en.wikipedia.org/wiki/Calculus
1.4 Summary
In this tutorial, you discovered the origins of calculus and its applications. Specifically, you
learned:
⊲ That calculus is the mathematical study of change that is based on a cutting and
rebuilding strategy.
⊲ That calculus has permitted many discoveries and the creation of many modern-day
devices as we known them, and is also a fundamental mathematical tool in machine
learning.
In the next chapter, we are going to see the algebraic meaning of calculus, namely, the rate of
change.
Rate of Change
2
The measurement of the rate of change is an integral concept in differential calculus, which
concerns the mathematics of change and infinitesimals. It allows us to find the relationship
between two changing variables and how these affect one another. The measurement of the
rate of change is also essential for machine learning, such as in applying gradient descent as the
optimization algorithm to train a neural network model.
In this tutorial, you will discover the rate of change as one of the key concepts in calculus,
and the importance of measuring it. After completing this tutorial, you will know:
⊲ How the rate of change of linear and nonlinear functions is measured.
⊲ Why the measurement of the rate of change is an important concept in different fields.
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ Rate of Change
⊲ The Importance of Measuring the Rate of Change
y
20
y = 2x
15
10
0 x
0 2 4 6 8 10
Figure 2.1: Line plot of a linear function
In this graphical representation of the object’s movement, the rate of change is represented by
the slope of the line, or its gradient. Since the line can be seen to rise 2 units for each single
unit that it runs to the right, then its rate of change, or its slope, is equal to 2.
“
Rates and slopes have a simple connection. The previous rate examples can be
”
graphed on an x-y coordinate system, where each rate appears as a slope.
— Page 38, Calculus Essentials For Dummies, 2019.
Tying everything together, we see that:
δy rise
rate of change = = = slope
δx run
If we had to consider two particular points, P1 = (2, 4) and P2 = (8, 16), on this straight line,
we may confirm the slope to be equal to:
δy y2 − y1 16 − 4
slope = = = =2
δx x2 − x1 8−2
For this particular example, the rate of change, represented by the slope, is positive since the
direction of the line is increasing rightwards. However, the rate of change can also be negative
if the direction of the line decreases, which means that the value of y would be decreasing as
the value of x increases. Furthermore, when the value of y remains constant as x increases, we
would say that we have zero rate of change. If, otherwise, the value of x remains constant as y
increases, we would consider the range of change to be infinite, because the slope of a vertical
line is considered undefined.
So far, we have considered the simplest example of having a straight line, and hence a linear
function, with an unchanging slope. Nonetheless, not all functions are this simple, and if they
were, there would be no need for calculus.
“
Calculus is the mathematics of change, so now is a good time to move on to parabolas,
”
curves with changing slopes.
— Page 39, Calculus Essentials For Dummies, 2019.
2.1 Rate of change 9
y
11
y = 41 x2
10
(6, 9)
9 3
7 1
(5, 6.25)
6
5
(4, 4)
4 2
3
(3, 2.25) 1
2
(2, 1)
1 1
(1, 0.25)
1 x
−2 −1 1 2 3 4 5 6 7 8
−1
Figure 2.2: Line plot of a parabola
Recall that the method of calculus allows us to analyze a curved shape by cutting it into
many infinitesimal straight pieces arranged alongside one another. If we had to consider one of
such pieces at some particular point, P , on the curved shape of the parabola, we see that we
find ourselves calculating again the rate of change as the slope of a straight line. It is important
to keep in mind that the rate of change on a parabola depends on the particular point, P , that
we happened to consider in the first place.
For example, if we had to consider the straight line that passes through point, P = (2, 1),
we find that the rate of change at this point on the parabola is:
δy 1
rate of change = = =1
δx 1
2.2 The importance of measuring the rate of change 10
If we had to consider a different point on the same parabola, at P = (6, 9), we find that the
rate of change at this point is equal to:
δy 3
rate of change = = =3
δx 1
The straight line that touches the curve as some particular point, P , is known as the tangent
line, whereas the process of calculating the rate of change of a function is also known as finding
its derivative.
“
A derivative is simply a measure of how much one thing changes compared to another
”
— and that’s a rate.
— Page 37, Calculus Essentials For Dummies, 2019.
While we have considered a simple parabola for this example, we may similarly use calculus
to analyze more complicated nonlinear functions. The concept of computing the instantaneous
rate of change at different tangential points on the curve remains the same.
We meet one such example when we come to train a neural network using the gradient
descent algorithm. As the optimization algorithm, gradient descent iteratively descends an error
function towards its global minimum, each time updating the neural network weights to model
better the training data. The error function is, typically, nonlinear and can contain many local
minima and saddle points. In order to find its way downhill, the gradient descent algorithm
computes the instantaneous slope at different points on the error function, until it reaches a
point at which the error is lowest and the rate of change is zero.
“ ”
But a rate can be anything per anything.
— Page 38, Calculus Essentials For Dummies, 2019.
Within the context of training a neural network, for instance, we have seen that the error
gradient is computed as the change in error with respect to a specific weight in the neural
network.
There are many different fields in which the measurement of the rate of change is an
important concept too. A few examples are:
⊲ In physics, speed is computed as the change in position per unit time.
⊲ In signal digitization, sampling rate is computed as the number of signal samples per
second.
⊲ In computing, bit rate is the number of bits the computer processes per unit time.
⊲ In finance, exchange rate refers to the value of one currency with respect to another.
“ ”
In either case, every rate is a derivative, and every derivative is a rate.
— Page 38, Calculus Essentials For Dummies, 2019.
2.3 Further reading 11
Books
Mark Ryan. Calculus Essentials For Dummies. Wiley, 2019.
https://www.amazon.com/dp/1119591201/
Steven Strogatz. Infinite Powers. How Calculus Reveals the Secrets of the Universe. Mariner
Books, 2020.
https://www.amazon.com/dp/0358299284/
Mark Ryan. Calculus For Dummies. 2nd ed. Wiley, 2016.
https://www.amazon.com/dp/1119293499/
Michael Spivak. The Hitchhiker’s Guide to Calculus. American Mathematical Society, 2019.
https://www.amazon.com/dp/1470449625/
Marc Peter Deisenroth. Mathematics for Machine Learning. Cambridge University Press, 2020.
https://www.amazon.com/dp/110845514X
2.4 Summary
In this tutorial, you discovered the rate of change as one of the key concepts in calculus, and
the importance of measuring it. Specifically, you learned:
⊲ The measurement of the rate of change is an integral concept in differential calculus
that allows us to find the relationship of one changing variable with respect to another.
⊲ This is an important concept that can be applied to many fields, one of which is machine
learning.
In the next chapter, we will drill deeper on why calculus is helpful in building machine learning
algorithms.
Why it Works?
3
Calculus is one of the core mathematical concepts in machine learning that permits us to
understand the internal workings of different machine learning algorithms. One of the important
applications of calculus in machine learning is the gradient descent algorithm, which, in tandem
with backpropagation, allows us to train a neural network model.
In this tutorial, you will discover the integral role of calculus in machine learning. After
completing this tutorial, you will know:
⊲ Calculus plays an integral role in understanding the internal workings of machine learning
algorithms, such as the gradient descent algorithm for minimizing an error function.
⊲ Calculus provides us with the necessary tools to optimize complex objective functions
as well as functions with multidimensional inputs, which are representative of different
machine learning applications.
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ Calculus in Machine Learning
⊲ Why Calculus in Machine Learning Works
“
A very simple type of function is a linear mapping from a single input to a single
”
output.
— Page 187, Deep Learning, 2019.
Such a linear function can be represented by the equation of a line having a slope, m, and
a y-intercept, c:
y = mx + c
Varying each of parameters, m and c, produces different linear models that define different
input-output mappings.
0.4
0.2
x
0.2 0.4 0.6 0.8 1
Figure 3.1: Line plot of different line models produced by varying the slope and intercept
The process of learning the mapping function, therefore, involves the approximation of
these model parameters, or weights, that result in the minimum error between the predicted
and target outputs. This error is calculated by means of a loss function, cost function, or error
function, as often used interchangeably, and the process of minimizing the loss is referred to as
function optimization.
We can apply differential calculus to the process of function optimization. In order to
understand better how differential calculus can be applied to function optimization, let us
return to our specific example of having a linear mapping function.
Say that we have some dataset of single input features, x, and their corresponding target
outputs, y. In order to measure the error on the dataset, we shall be taking the sum of squared
errors (SSE), computed between the predicted and target outputs, as our loss function.
Carrying out a parameter sweep across different values for the model weights, w0 = m and
w1 = c, generates individual error profiles that are convex in shape (i.e., like letter U, as in
Figure 3.2).
3.1 Calculus in machine learning 14
1 1
0.8 0.8
0.6 0.6
SSE
SSE
0.4 0.4
0.2 0.2
0 0
−0.2 0 0.2 0.4 0.6 −0.2 0 0.2 0.4 0.6
w0 (y-intercept, c) w1 (slope, m)
Figure 3.2: Line plots of error (SSE) profiles generated when sweeping across a range
of values for the slope and intercept
Combining the individual error profiles generates a three-dimensional error surface that is
also convex in shape. This error surface is contained within a weight space, which is defined by
the swept ranges of values for the model weights, w0 and w1 .
SSE
w1
w0
Figure 3.3: Three-dimensional plot of the error (SSE) surface generated when both slope
and intercept are varied
Moving across this weight space is equivalent to moving between different linear models.
Our objective is to identify the model that best fits the data among all possible alternatives.
The best model is characterized by the lowest error on the dataset, which corresponds with the
lowest point on the error surface.
3.2 Why calculus in machine learning works 15
“
A convex or bowl-shaped error surface is incredibly useful for learning a linear
function to model a dataset because it means that the learning process can be
framed as a search for the lowest point on the error surface. The standard algorithm
”
used to find this lowest point is known as gradient descent.
— Page 194, Deep Learning, 2019.
The gradient descent algorithm, as the optimization algorithm, will seek to reach the lowest
point on the error surface by following its gradient downhill. This descent is based upon the
computation of the gradient, or slope, of the error surface.
This is where differential calculus comes into the picture.
“
Calculus, and in particular differentiation, is the field of mathematics that deals
”
with rates of change.
— Page 198, Deep Learning, 2019.
More formally, let us denote the function that we would like to optimize by:
error = f (weights)
By computing the rate of change, or the slope, of the error with respect to the weights, the
gradient descent algorithm can decide on how to change the weights in order to keep reducing
the error.
“
In the context of deep learning, we optimize functions that may have many local
”
minima that are not optimal, and many saddle points surrounded by very flat regions
— Page 84, Deep Learning, 2016.
3.3 Further reading 16
Hence, within the context of deep learning, we often accept a suboptimal solution that may not
necessarily correspond to a global minimum, so long as it corresponds to a very low value of
f (x).
f (x)
local minimum performs
nearly as well as the global
one; acceptable halting point ideal global minimum, but
might not be achievable
If the function we are working with takes multiple inputs, calculus also provides us with
the concept of partial derivatives; or in simpler terms, a method to calculate the rate of change
of y with respect to changes in each one of the inputs, x, while holding the remaining inputs
constant.
“
This is why each of the weights is updated independently in the gradient descent
algorithm: the weight update rule is dependent on the partial derivative of the SSE
for each weight, and because there is a different partial derivative for each weight,
”
there is a separate weight update rule for each weight.
— Page 200, Deep Learning, 2019.
Hence, if we consider again the minimization of an error function, calculating the partial
derivative for the error with respect to each specific weight permits that each weight is updated
independently of the others.
This also means that the gradient descent algorithm may not follow a straight path down
the error surface. Rather, each weight will be updated in proportion to the local gradient of
the error curve. Hence, one weight may be updated by a larger amount than another, as much
as needed for the gradient descent algorithm to reach the function minimum.
Books
John D. Kelleher. Deep Learning. Illustrated edition. The MIT Press Essential Knowledge series.
MIT Press, 2019.
https://www.amazon.com/dp/0262537559/
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
3.4 Summary 17
3.4 Summary
In this tutorial, you discovered the integral role of calculus in machine learning. Specifically,
you learned:
⊲ Calculus plays an integral role in understanding the internal workings of machine learning
algorithms, such as the gradient descent algorithm that minimizes an error function
based on the computation of the rate of change.
⊲ The concept of the rate of change in calculus can also be exploited to minimize more
complex objective functions that are not necessarily convex in shape.
⊲ The calculation of the partial derivative, another important concept in calculus, permits
us to work with functions that take multiple inputs.
In the next chapter, we are going to review what we should know before we move on to learn
about how to manipulate the math.
A Brief Tour of Calculus
Prerequisites
4
We have previously seen that calculus is one of the core mathematical concepts in machine
learning that permits us to understand the internal workings of different machine learning
algorithms.
Calculus, in turn, builds on several fundamental concepts that derive from algebra and
geometry. The importance of having these fundamentals at hand will become even more
important as we work our way through more advanced topics of calculus, such as the evaluation
of limits and the computation of derivatives, to name a few.
In this tutorial, you will discover several prerequisites that will help you work with calculus.
After completing this tutorial, you will know:
⊲ Linear and nonlinear functions are central to calculus and machine learning, and many
calculus problems involve their use.
⊲ Fundamental concepts from algebra and trigonometry provide the foundations for
calculus, and will become especially important as we tackle more advanced calculus
topics.
Let’s get started.
Overview
This tutorial is divided into three parts; they are:
⊲ The Concept of a Function
⊲ Fundamentals of Pre-Algebra and Algebra
⊲ Fundamentals of Trigonometry
“
Examples are all around us: The average daily temperature for your city depends
on, and is a function of, the time of year; the distance an object has fallen is a
function of how much time has elapsed since you dropped it; the area of a circle
is a function of its radius; and the pressure of an enclosed gas is a function of its
”
temperature.
— Page 43, Calculus For Dummies, 2016.
In machine learning, a neural network learns a function by which it can represent the relationship
between features in the input, the independent variable, and the expected output, the dependent
variable. In such a scenario, therefore, the learned function defines a deterministic mapping
between the input values and one or more output values. We can represent this mapping as
follows:
Output(s) = function(Input)
More formally, however, a function is often represented by y = f (x), which translates to y is a
function of x. This notation specifies x as the independent input variable that we already know,
whereas y is the dependent output variable that we wish to find. For example, if we consider
the squaring function, f (x) = x2 , then inputting a value of 3 would produce an output of 9:
y = f (3) = 9
“ ”
By the graph of the function f we mean the collection of all points (x, f (x)).
— Page 13, The Hitchhiker’s Guide to Calculus, 2019.
When graphing a function, the independent input variable is placed on the x-axis, while the
dependent output variable goes on the y-axis. A graph helps to illustrate the relationship between
the independent and dependent variables better: is the graph (and, hence, the relationship)
rising or falling, and by which rate?
A straight line is one of the simplest functions that can be graphed on the coordinate plane.
Take, for example, the graph of the line y = 31 x + 53 :
y
6
4
(14, 3)
(11, 2)
2
(8, 1)
(5, 0)
x
−6 −3 3 6 9 12 15
(−1, −2) (2, −1)
−2
(−4, −3)
−4
−6
This straight line can be described by a linear function, so called because the output changes
proportionally to any change in the input. The linear function that describes this straight line
can be represented in slope-intercept form, where the slope is denoted by m, and the y-intercept
by c:
1 5
f (x) = mx + c = x −
3 3
We had seen how to calculate the slope when we addressing the topic of rate of change in the
last chapter. If we had to consider the special case of setting the slope to zero, the resulting
horizontal line would be described by a f (x) = c = − 35
Within the context of machine learning, the calculation defined by such a linear function
is implemented by every neuron in a neural network. Specifically, each neuron receives a set
of n inputs, xi , from the previous layer of neurons or from the training data, and calculates a
weighted sum of these inputs (where the weight, wi , is more common term for the slope, m, in
machine learning) to produce an output, z:
n
X
z= xi × wi
i=1
The process of training a neural network involves learning the weights that best represent the
patterns in the input dataset, which process is carried out by the gradient descent algorithm.
In addition to the linear function, there exists another family of nonlinear functions.
The simplest of all nonlinear functions can be considered to be the parabola, that may be
described by:
y = f (x) = x2
When graphed, we find that this is an even function, because it is symmetric about the y-axis,
and never falls below the x-axis.
f (x) = x2
20
10
x
−6 −4 −2 2 4 6
Nonetheless, nonlinear functions can take many different shapes. Consider, for instance, the
exponential function of the form f (x) = bx , which grows or decays indefinitely, or monotonically,
depending on the value of x:
4.1 The concept of a function 21
y
5 g(x) = 10x g(x) = 2x
x
−0.5 0.5 1 1.5 2 2.5
Or the logarithmic function of the form f (x) = log2 x, which is similar to the exponential
function but with the x- and y-axes switched:
9 f (x) = 2x
8 y=x
7
6
5
4
3
2 g(x) = log2 x
1
x
−2 −1
−1 1 2 3 4 5 6 7 8 9
−2
Of particular interest for deep learning are the logistic, tanh, and the rectified linear units
(ReLU) nonlinear functions, which serve as activation functions:
4.2 Fundamentals of pre-algebra and algebra 22
1 1
Output = activation(z)
Output = activation(z)
0.5 0.5
1
logistic(z) =
1 + e−z
0 0
ReLU(z) = max(0, z)
−0.5 −0.5
z −z
e −e
tanh(z) =
ez + e−z
−1 −1
−10 −5 0 5 10 −1 −0.5 0 0.5 1
Input = z Input = z
Figure 4.5: Line plots of the logistic, tanh and ReLU functions
The importance of these activation functions lies in the introduction of a nonlinear mapping
into the processing of a neuron. If we had to rely solely on the linear regression performed by each
neuron in calculating a weighted sum of the inputs, then we would be restricted to learning only
a linear mapping from the inputs to the outputs. However, many real-world relationships are
more complex than this, and a linear mapping would not accurately model them. Introducing a
nonlinearity to the output, z, of the neuron, allows the neural network to model such nonlinear
relationships:
Output = activation(z)
“
…a neuron, the fundamental building block of neural networks and deep learning,
is defined by a simple two-step sequence of operations: calculating a weighted sum
”
and then passing the result through an activation function.
— Page 76, Deep Learning, 2019.
Nonlinear functions appear elsewhere in the process of training a neural network too, in
the form of error functions. A nonlinear error function can be generated by calculating the error
between the predicted and the target output values as the weights of the model change. Its
shape can be as simple as a parabola, but most often it is characterized by many local minima
and saddle points. The gradient descent algorithm descends this nonlinear error function by
calculating the slope of the tangent line that touches the curve at some particular instance:
another important concept in calculus that permits us to analyze complex curved functions by
cutting them into many infinitesimal straight pieces arranged alongside one another.
“
Algebra is the language of calculus. You can’t do calculus without knowing algebra
”
any more than you can write Chinese poetry without knowing Chinese.
— Page 29, Calculus For Dummies, 2016.
4.2 Fundamentals of pre-algebra and algebra 23
There are several fundamental concepts of algebra that turn out to be useful for calculus, such
as those concerning fractions, powers, square roots, and logarithms.
Let’s first start by revising the basics for working with fractions.
⊲ Division by Zero: The denominator of a fraction can never be equal to zero. For
example, the result of a fraction such as 5/0 is undefined. The intuition behind this
is that if it is a number, say, x, you will end up making 5 = 0 × x = 0 and hence all
numbers will be equal to zero.
⊲ Reciprocal: The reciprocal of a fraction is its multiplicative inverse. In simpler terms,
to find the reciprocal of a fraction, flip it upside down. Hence, the reciprocal of 3/4,
for instance, becomes 4/3.
⊲ Multiplication of Fractions: Multiplication between fractions is as straightforward as
multiplying across the numerators, and multiplying across the denominators:
a c ac
× =
b d bd
⊲ Division of Fractions: The division of fractions is very similar to multiplication, but with
an additional step; the reciprocal of the second fraction is first found before multiplying.
Hence, considering again two generic fractions:
a c a d ad
÷ = × =
b d b c bc
⊲ Addition of Fractions: An important first step is to find a common denominator between
all fractions to be added. Any common denominator will do, but we usually find the
least common denominator. Finding the least common denominator is, at times, as
simple as multiplying the denominators of all individual fractions:
a c ad + cb
+ =
b d bd
⊲ Subtraction of Fractions: The subtraction of fractions follows a similar procedure as
for the addition of fractions:
a c ad − cb
− =
b d bd
⊲ Canceling in Fractions: Fractions with an unbroken chain of multiplications across the
entire numerator, as well as across the entire denominator, can be simplified by canceling
out any common terms that appear in both the numerator and the denominator:
a3 b 2 a2 b2
=
ac c
The next important prerequisite for calculus revolves around exponents, or powers as they are
also commonly referred to. There are several rules to keep in mind when working with powers
too.
⊲ The Power of Zero: The result of any number (whether rational or irrational, negative
or positive, except for zero itself) raised to the power of zero, is equal to one:
x0 = 1
4.2 Fundamentals of pre-algebra and algebra 24
⊲ Negative Powers: A base number raised to a negative power turns into a fraction, but
does not change sign:
1
x−a = a
x
⊲ Fractional Powers: A base number raised to a fractional power can be converted into
a root problem: √ a √
a
x b = b x = b xa
⊲ Addition of Powers: If two (or more) equivalent base terms are being multiplied to one
another, then their powers may be added:
xa × xb = x(a+b)
⊲ Subtraction of Powers: Similarly, if two equivalent base terms are being divided, then
their power may be subtracted:
xa
= xa−b
xb
⊲ Power of Powers: If a power is also raised to a power, then the two powers may be
multiplied by one another:
(xa )b = xab
⊲ Distribution of Powers: Whether the base numbers are being multiplied or divided,
the power may be distributed to each variable. However, it cannot be distributed if the
base numbers are, otherwise, being added or subtracted:
(xyz)a = xa y a z a
!a
x xa
= a
y y
√
⊲ Common mistake to avoid: a2 + b2 6= a + b
and rules for working with logarithms:
⊲ Identities: logc 1 = 0 and logc c = 1
⊲ Products: logc (ab) = logc a + logc b
4.3 Fundamentals of trigonometry 25
a
⊲ Quotients: logc = logc a − logc b
b
⊲ Exponents: logc ab = b logc a
logc b
⊲ Change of base: loga b =
logc a
⊲ Anti-logarithm: loga ab = b and aloga b = b
Finally, knowing how to solve quadratic equations can also come in handy in calculus. If
the quadratic equation is factorizable, then the easiest method to solve it is to express the sum
of terms in product form. For example, the following quadratic equation can be factored as
follows:
x2 − 9 = (x + 3)(x − 3) = 0
Setting each factor to zero permits us to find a solution to this equation, which in this case is
x = ±3 Alternatively, the following quadratic formula may be used:
√
−b ± b2 − 4ac
x=
2a
If we had to consider the same quadratic equation as above, then we would set the coefficient
values to, a = 1, b = 0, and c = 9, which would again result in x = ±3 as our solution.
O 3 Hypotenuse (H)
sin x = = 5 3
H 5
Opposite (O)
A 4
cos x = =
H 5 x
O 3 4
tan x = =
A 4 Adjacent (A)
The sine, cosine and tangent functions only work with right-angled triangles, and hence
can only be used in the calculation of acute angles that are smaller than 90°. Nonetheless, if
we had to work within the unit circle on the x-y coordinate plane, then we would be able to
apply trigonometry to all angles between 0°and 360°:
4.3 Fundamentals of trigonometry 26
(0, 1)
√ √
3 1 3 1
− 2 , 2 2 , 2
1 radian ≈ 57 ◦
(0, −1)
The unit circle has its center at the origin of the x-y coordinate plane, and a radius of one
unit. Rotations around the unit circle are performed in a counterclockwise manner, starting
from the positive x-axis. The cosine of the rotated angle would then be given by the x-coordinate
of the point that hits the unit circle, whereas the y-coordinate specifies the sine of the rotated
angle. It is also worth noting that the quadrants are symmetrical, and hence a point in one
quadrant has symmetrical counterparts in the other three.
The graphed sine, cosine and tangent functions appear as follows:
y
y y = tan x
y = sin x
1
x 1
−90◦ 90◦ 180◦ 270◦ 360◦
−1
x
−9 ◦
−4 ◦
45 ◦
90 ◦
180 ◦
270 ◦
0
y
y = cos x −1
1
x
−90◦ 90◦ 180◦ 270◦ 360◦
−1 asymptotes
Figure 4.8: Line plots of the sine, cosine and tangent functions
All functions are periodic, with the sine and cosine functions featuring the same shape
albeit being displaced by 90° between one another. The sine and cosine functions may, indeed,
4.4 Further reading 27
be easily sketched from the calculated x- and y-coordinates as one rotates around the unit circle.
The tangent function may also be sketched similarly, since for any angle θ this function may be
defined by:
sin θ y
tan θ = =
cos θ x
The tangent function is undefined at ±90°, since the cosine in the denominator returns
a value of zero at this angle. Hence, we draw vertical asymptotes at these angles, which are
imaginary lines that the curve approaches but never touches.
One final note concerns the inverse of these trigonometric functions. Taking the sine
function as an example, its inverse is denoted by sin−1 . This is not to be mistaken for the
cosecant function, which is rather the reciprocal of sine, and hence not the same as its inverse.
Books
Mark Ryan. Calculus For Dummies. 2nd ed. Wiley, 2016.
https://www.amazon.com/dp/1119293499/
Michael Spivak. The Hitchhiker’s Guide to Calculus. American Mathematical Society, 2019.
https://www.amazon.com/dp/1470449625/
John D. Kelleher. Deep Learning. Illustrated edition. The MIT Press Essential Knowledge series.
MIT Press, 2019.
https://www.amazon.com/dp/0262537559/
4.5 Summary
In this tutorial, you discovered several prerequisites for working with calculus.
Specifically, you learned:
⊲ Linear and nonlinear functions are central to calculus and machine learning, and many
calculus problems involve their use.
⊲ Fundamental concepts from algebra and trigonometry provide the foundations for
calculus, and will become especially important as we tackle more advanced calculus
topics.
Starting from next chapter, we will see how we can manipulate the math in calculus. We will
start with the process on how to evaluate a limit.
II
Limits and Differential Calculus
Limits and Continuity
5
There is no denying that calculus is a difficult subject. However, if you learn the fundamentals,
you will not only be able to grasp the more complex concepts but also find them fascinating.
To understand machine learning algorithms, you need to understand concepts such as gradient
of a function, Hessians of a matrix, and optimization, etc. The concept of limits and continuity
serves as a foundation for all these topics.
In this tutorial, you will discover how to evaluate the limit of a function, and how to
determine if a function is continuous or not. After reading this tutorial, you will be able to:
⊲ Determine if a function f (x) has a limit as x approaches a certain value
⊲ Evaluate the limit of a function f (x) as x approaches a
⊲ Determine if a function is continuous at a point or in an interval
Let’s get started.
Overview
This tutorial is divided into two parts
⊲ Limits
◦ Determine if the limit of a function exists for a certain point
◦ Compute the limit of a function for a certain point
◦ Formal definition of a limit
◦ Examples of limits
◦ Left and right hand limits
⊲ Continuity
◦ Definition of continuity
◦ Determine if a function is continuous at a point or within an interval
◦ Examples of continuous functions
5.1 A simple example 30
f (x) = 1 + x
y
0.004
0.003
x f (x) = 1 + x 0.002
−1.003 −0.003 0.001
−1.002 −0.002 0
−1.001 −0.001 −0.001
−1 0 −0.002
−0.999 0.001 −0.003
−0.998 0.002 −0.004 x
004
003
002
001
−1
999
998
997
996
−0.997 0.003
−1.
−1.
−1.
−1.
−0.
−0.
−0.
−0.
Figure 5.1: Plotting f (x) = 1 + x
1 − x2
g(x) =
1+x
We can simplify the expression for g(x) as:
(1 − x)(1 + x)
g(x) =
1+x
If the denominator is not zero then g(x) can be simplified as:
g(x) = 1 − x, if x 6= −1
However, at x = −1, the denominator is zero and we cannot divide by zero. So it looks like
there is a hole in the function at x = −1. Despite the presence of this hole, g(x) gets closer and
closer to 2 as x gets closer and closer −1, as shown in the figure:
5.1 A simple example 31
y
2.004
1 − x2 2.003
x g(x) = 2.002
1+x
−1.003 2.003 2.001
−1.002 2.002 2
−1.001 2.001 1.999
−1 ? 1.998
−0.999 1.999 1.997
−0.998 1.998 1.996 x
004
003
002
001
−1
999
998
997
996
995
−0.997 1.997
−1.
−1.
−1.
−1.
−0.
−0.
−0.
−0.
−0.
Figure 5.2: Plotting g(x) = 1 − x
This is the basic idea of a limit. If g(x) is defined in an open interval that does not include −1,
and g(x) gets closer and closer to 2, as x approaches −1, we write this as:
lim g(x) = 2
x→−1
In general, for any function f (x), if f (x) gets closer and closer to a value L, as x gets closer
and closer to k, we define the limit of f (x) as x approaches k, as L. This is written as:
lim f (x) = L
x→k
−1.003 −1.002 −1.001 −1 −0.999 −0.998 −0.997 1.997 1.998 1.999 2 2.001 2.002 2.003
−1.003 −1.002 −1.001 −1 −0.999 −0.998 −0.997 1.997 1.998 1.999 2 2.001 2.002 2.003
This gives rise to the notion of one-sided limits. The left hand limit is defined on an interval
to the left of −1, which does not include −1, e.g., (−1.003, −1). As we approach −1 from the
left, g(x) gets closer to 2.
5.1 A simple example 32
Similarly, the right hand limit is defined on an open interval to the right of −1 and does
not include −1, e.g., (−1, 0.997). As we approach −1 from the right, the right hand limit of
g(x) is 2. Both the left and right hand limits are written as follows:
We say that f (x) has a limit L as x approaches k, if both its left and right hand limits are equal.
Therefore, this is another way of testing whether a function has a limit at a specific point, i.e.,
If we want to compute left and right hand limits using a computer, it is not difficult to achieve.
But first, we have to understand that computers cannot evaluate at any number since computers
use floating point arithmetic, namely, not all numbers can be represented. The smallest floating
point number that a computer can use is called the machine epsilon, and it can be found using
a simple trick to trigger a floating point rounding error:
or we can also use a numpy function in Python to get the same result. We can make use of this
concept to approximate the left and right hand limits. In the example above, g(x) = 1 − x, we
can compute them as follows:
import numpy as np
def g(x):
return 1-x
x = -1
epsilon = np.finfo(float).eps
The np.finfo(float).eps is to get the machine epsilon for the float type. Most likely it is
a 64-bit floating point in your computer. We may also use a 32-bit floating point by saying
np.finfo(np.float32).eps instead. The above code will compute the limit as follows:
f (x)
L+ǫ
L
L−ǫ
δ
k
k−
k+
1
2 40
1.75 0.8
1.5 30 (
1
x2 + 3x + 1
x x>0
1.25 0.6 f3 (x) =
otherwise
f3 (x)
0
|x|
1 20
0.75 0.4
0.5 10
0.2
0.25
0 0 0
x x x
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −4 −2 0 2 4 −10 0 10 20 30 40 50
lim f1 (x) = lim |x| = 0 lim f2 (x) = lim (x2 + 3x + 1) = 1 lim f3 (x) = 0
x→0 x→0 x→0 x→0 x→∞
f2 (x) = x2 + 3x + 1
The limit of f2 (x) exists for all values of x, e.g., lim f2 (x) = 1 + 3 + 1 = 5.
x→1
As we get closer and closer to 0 from the left, the function remains a zero. However, as soon as
we reach x = 0, H(x) jumps to 1, and hence H(x) does not have a limit as x approaches zero.
This function has a left hand limit equal to zero and a right hand limit equal to 1.
The left and right hand limits do not agree, as x → 0, hence H(x) does not have a limit
as x approaches 0. Here, we used the equality of left and right hand limits as a test to check if
a function has a limit at a particular point.
h1 (x)
H(x) = 0
0.4
1, otherwise 2
−0.5
0.2 −1 1
0 −1.5
−2 0
x x x
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −10 −7.5 −5 −2.5 0 2.5 5 7.5 10 0 1 2 3 4 5
5.5 Continuity
If you have understood the notion of a limit, then it is easy to understand continuity. A function
f(x) is continuous at a point a, if the following three conditions are met:
1. f (a) should exist
2. f (x) has a limit as x approaches a
3. The limit of f (x) as x → a is equal to f (a)
5.6 Further reading 36
If all of the above hold true, then the function is continuous at the point a. Some examples
follow:
Examples of continuity
The concept of continuity is closely related to limits. If the function is defined at a point, has
no jumps at that point, and has a limit at that point, then it is continuous at that point. The
figure below shows some examples, which are explained below:
f4 (x) = x2
Now we have a function that is continuous for all values of x. We come up with this by knowing
that 1 − x2 = (1 − x)(1 + x) and hence if x 6= −1, g(x) would be same as 1 − x.
f3 (x) is continuous everywhere, except at x = 0 as the value of f3 (x) has a big jump at x = 0.
Hence, there is a discontinuity at x = 0.
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
5.7 Summary
In this tutorial, you discovered calculus concepts on limits and continuity. Specifically, you
learned:
⊲ Whether a function has a limit when approaching a point
⊲ Whether a function is continuous at a point or within an interval
As we understand what a limit of a mathematical function is about, we will see how we can
evaluate it in the next chapter.
Evaluating Limits
6
The concept of the limit of a function dates back to Greek scholars such as Eudoxus and
Archimedes. While they never formally defined limits, many of their calculations were based
upon this concept. Isaac Newton formally defined the notion of a limit and Cauchy refined
this idea. Limits form the basis of calculus, which in turn defines the foundation of many
machine learning algorithms. Hence, it is important to understand how limits of different types
of functions are evaluated.
In this tutorial, you will discover how to evaluate the limits of different types of functions.
After completing this tutorial, you will know:
⊲ The different rules for evaluating limits
⊲ How to evaluate the limit of polynomials and rational functions
⊲ How to evaluate the limit of a function with discontinuities
⊲ The Sandwich Theorem
Let’s get started.
Overview
This tutorial is divided into 3 parts; they are:
⊲ Rules for limits
◦ Examples of evaluating limits using the rules for limits
◦ Limits for polynomials
◦ Limits for rational expressions
⊲ Limits for functions with a discontinuity
⊲ The Sandwich Theorem
6.1 Rules for limits 39
Example 1 Example 2
50
200
40
100
30
0
20
−100
10
−200
x 0
−6 −4 −2 0 2 4 6 x
−6 −4 −2 0 2 4 6
Evaluate lim x3 + 3x + 2
x→2 Evaluate lim x2 − 2x + 1
x→3
We can use lim x = 2
x→2 We can use lim x = 2
x→3
lim x3 + 3x + 2 = lim x3 + lim 3x + 2 (Sum rule)
x→2 x→2 x→2 lim x2 − 2x + 1 = lim x2 − lim 2x + 1 (Difference rule)
x→3 x→3 x→3
= 23 + 3 × 2 + 2 (Power+const multiple)
= 32 − 2 × 3 + 2 (Power+const multiple)
=8+6+2
=9−6+1
= 16
=4
6.2 Limits for polynomials 40
Example 3 Example 4
15 2.5
10
2
5
1.5
0
−5 1
−10
0.5
−15
x 0 x
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
x3 + 3 √
Evaluate lim Evaluate lim x+1
x→1 x2 x→0
Here are a few examples that use the basic rules to evaluate a limit. Note that these rules
apply to functions which are defined at a point as x approaches that point.
= an k n + an−1 k n−1 + · · · + a1 k + a0
= P (k)
Hence, we can evaluate the limit of a polynomial via direct substitution, e.g.
lim x4 + 3x3 + 2 = 14 + 3(1)3 + 2 = 6
x→1
Here, we can apply the quotient rule or easier still, substitute x = 0 to evaluate the limit.
However, this function has no limit when x approaches 1. See the first graph in the figure below.
At x = 2 we are faced with a problem. The denominator is zero, and hence the function is
undefined at x = 2. We can see from the figure that the graph of this function and (x + 2) is
the same, except at the point x = 2, where there is a hole. In this case, we can cancel out the
common factors and still evaluate the limit for (x → 2) as:
x2 − 4 (x − 2)(x + 2)
lim = lim = lim (x + 2) = 4
x→2 x − 2 x→2 x−2 x→2
Following image shows the above two examples as well as a third similar example of g3 (x):
15 8
10 6 2
5 4
g3 (x)
g2 (x)
g1 (x)
0 2 1
−5 0
0
−10 −2
−15 −4
x x x
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
x2 + 1 x2 − 4 x2 − 1
g1 (x) = g2 (x) = g3 (x) =
x−1 x−2 x2 − x
x2 + 1 x2 − 4 (x + 2)(x − 2) x2 − 1 (x − 1)(x + 1)
lim = −1 lim = lim lim 2
= lim
x→0 x − 1 x→2 x − 2 x→2 x−2 x→1 x − x x→1 x(x − 1)
(no limit when x → 1) = lim (x + 2) x+1
x→2 = lim
x→1 x
=4
=2
x2 + x
h(x) = , if x 6= 0
x
h(x) = 0, if x = 0
The function g(x), has a discontinuity at x = 0, as shown in the figure below. When evaluating
lim h(x), we have to see what happens to h(x) when x is close to 0 (and not when x = 0). As
x→0
we approach x = 0 from either side, h(x) approaches 1, and hence lim h(x) = 1.
x→0
The function m(x) shown in the figure below is another interesting case. This function is
also defined for all real numbers but the limit does not exist when x → 0.
8
15
6
10
4
5
2
h(x)
m(x)
0
0
−5
−2
−10
−4
−6 −15
x x
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
( (
x2 +x
x , if x 6= 0 x2 +1
x , if x 6= 0
h(x) = m(x) =
0, if x = 0 0, if x = 0
lim g(x) = L
x→k
f (x)
g(x)
h(x)
x
k
As x → k and
Using this theorem we can evaluate the limits of many complex functions. A well known
example involves the sine function:
1
lim x2 sin
x→0 x
We know that the sin(x) always alternates between −1 and +1. Using this fact, we can solve
this limit as shown below:
expression = sqrt(x+1)
result = limit(expression, x, -1)
print(”Limit of”)
pprint(expression)
print(”at x=-1 is”, result)
Limit of
_______
╲╱ x + 1
at x=-1 is 0
Note that in the code above, the variable x must be defined as a symbol in SymPy. Hence,
we use the one imported from sympy.abc for this purpose. The limit is evaluated algebraically
rather than solely numerical, hence its result should be accurate. We can see that it can evaluate
for the point that the function is undefined, too, such as the case we see in the previous section:
1
lim x2 sin
x→0 x
from sympy import limit, sin, pprint
from sympy.abc import x
Program 6.2: Evaluate the limit from the example of sandwich theorem
Limit of
2 ⎛1⎞
x ⋅sin⎜─⎟
⎝x⎠
at x=0 is 0
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
6.8 Summary
In this tutorial, you discovered how limits for different types of functions can be evaluated.
Specifically, you learned:
⊲ Rules for evaluating limits for different functions.
⊲ Evaluating limits of polynomials and rational functions
⊲ Evaluating limits when discontinuities are present in a function
In the next chapter we will connect the limit to differential calculus.
Function Derivatives
7
The concept of the derivative is the building block of many topics of calculus. It is important
for understanding integrals, gradients, Hessians, and much more.
In this tutorial, you will discover the definition of a derivative, its notation and how you
can compute the derivative based upon this definition. You will also discover why the derivative
of a function is also a function.
After completing this tutorial, you will know:
⊲ The definition of the derivative of a function
⊲ How to compute the derivative of a function based upon the definition
⊲ Why some functions do not have a derivative at a point
Let’s get started.
Overview
This tutorial is divided into three parts; they are:
⊲ The definition and notation used for derivatives of functions
⊲ How to compute the derivative of a function using the definition
⊲ Why some functions do not have a derivative at a point
df f (x + ∆x) − f (x) ∆f
f ′ (x) = = lim = lim
dx ∆x→0 ∆x ∆x→0 ∆x
y y
make this interval
f (x + ∆x) smaller and smaller
so that ∆x → 0
f (x + ∆x)
∆f
f (x) f f (x) f
∆x
x x
x
x
∆x
∆x
x+
x+
Figure 7.1: Derivative of f is the rate of change of f
In the figure, ∆x represents a change in the value of x. We keep making the interval
between x and (x + ∆x) smaller and smaller until it is infinitesimal. Hence, we have the limit
∆ → 0. The numerator f (x + ∆x) − f (x) represents the corresponding change in the value of
the function f over the interval ∆x. This makes the derivative of a function f at a point x, the
rate of change of f at that point.
An important point to note is that ∆x, the change in x can be negative or positive. Hence:
0 < |∆x| < ǫ,
where ǫ is an infinitesimally small value.
Example 1: m(x) = 2x + 5
Let’s start with a simple example of a linear function m(x) = 2x + 5. We can see that m(x)
changes at a constant rate. We can differentiate this function as follows.
m(x) = 2x + 5
m(x + ∆x) = 2(x + ∆x) + 5
m(x + ∆x) − m(x)
m′ (x) = lim
∆x→0 ∆x
2x + 2∆x + 5 − 2x − 5
= lim
∆x→0 ∆x
2∆x
= lim
∆x→0 ∆x
=2
15
m(x) = 2x + 5
10
5 m′ (x) = 2
0
−5
x
−6 −4 −2 0 2 4 6
Figure 7.2: Derivative of m(x) = 2x + 5. The rate of change of a linear function is a
constant
The above figure shows how the function m(x) is changing and it also shows that no matter
which value of x we choose the rate of change of m(x) always remains a 2. We can verify the
above result with SymPy:
expression = 2*x + 5
result = diff(expression, x)
print(”Derivative of”)
pprint(expression)
print(”with respect to x is”)
pprint(result)
Derivative of
2⋅x + 5
with respect to x is
2
Example 2: g(x) = x2
Suppose we have the function g(x) given by: g(x) = x2 . The figure below shows how the
derivative of g(x) with respect to x is calculated. There is also a plot of the function and its
derivative in the figure.
g(x) = x2
g(x + ∆x) = (x + ∆x)2
= x2 + 2x∆x + (∆x)2
g(x + ∆x) − g(x)
g ′ (x) = lim
∆x→0 ∆x
x2 + 2x∆x + (∆x)2 − x2
= lim
∆x→0 ∆x
2x∆x + ∆x2
= lim
∆x→0 ∆x
= lim (2x + ∆x)
∆x→0
= 2x
10 g(x) = x2
Example: The rate of
5 change of g(x) of x = 2
is the value of g ′ (x) at
0 x = 2, i.e., g ′ (2) = 4
−5 g ′ (x) = 2x
x
−6 −4 −2 0 2 4 6
Figure 7.3: Derivative of g(x) = x2 : The rate of change is positive when x > 0, negative
when x < 0 and 0 when x = 0
As g ′ (x) = 2x, hence g ′ (0) = 0, g ′ (1) = 2, g ′ (2) = 4 and g ′ (−1) = −2, g ′ (−2) = −4
From the figure, we can see that the value of g(x) is very large for large negative values of
x. When x < 0, increasing x decreases g(x) and hence g ′ (x) < 0 for x < 0. The graph flattens
7.2 Differentiation examples 50
out for x = 0, where the derivative or rate of change of g(x) becomes zero. When x > 0, g(x)
increases quadratically with the increase in x, and hence, the derivative is also positive. We can
verify the above result with SymPy:
expression = x**2
result = diff(expression, x)
print(”Derivative of”)
pprint(expression)
print(”with respect to x is”)
pprint(result)
Derivative of
2
x
with respect to x is
2⋅x
−1
=
x2
7.3 Differentiability and continuity 51
y
2
1
1 h(x) =
x
h(x) is decreasing
′ −1
h (x) = 2 h′ (x) is negative
0 x
1
h′ (x) = −
1 x2
h(x) =
−1 x
−2 x
−4 −2 0 2 4
Figure 7.4: Derivative of h(x) = 1/x
expression = 1/x
result = diff(expression, x)
print(”Derivative of”)
pprint(expression)
print(”with respect to x is”)
pprint(result)
Derivative of
1
─
x
with respect to x is
-1
───
2
x
( (
2 0, if x < 0
x2 +1
x , if x 6= 0
1−x H(x) = m(x) =
g(x) = 1, otherwise 0, if x = 0
1+x
3
15
1
10
2.5 0.8
5
0.6
H(x)
g(x)
m(x)
2 0
0.4
−5
1.5 0.2
−10
0
1 −15
x x x
−2 −1.5 −1 −0.5 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −6 −4 −2 0 2 4 6
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
7.5 Summary
In this tutorial, you discovered the function derivatives and the fundamentals of function
differentiation. Specifically, you learned:
⊲ The definition and notation of a function derivative
⊲ How to differentiate a function using the definition
⊲ When a function is not differentiable
In the next chapter, we will explore more properties related to the continuity of a function.
Continuous Functions
8
Many areas of calculus require an understanding of continuous functions. The characteristics
of continuous functions, and the study of points of discontinuity are of great interest to the
mathematical community. Because of their important properties, continuous functions have
practical applications in machine learning algorithms and optimization methods.
In this tutorial, you will discover what continuous functions are , their properties, and two
important theorems in the study of optimization algorithms, i.e., intermediate value theorem
and extreme value theorem.
After completing this tutorial, you will know:
⊲ Definition of continuous functions
⊲ Intermediate value theorem
⊲ Extreme value theorem
Let’s get started.
Overview
This tutorial is divided into 2 parts; they are:
⊲ Definition of continuous functions
◦ Informal definition
◦ Formal definition
⊲ Theorems
◦ Intermediate value theorem
◦ Extreme value theorem
8.1 Prerequisites 54
8.1 Prerequisites
This tutorial requires an understanding of the concept of limits. To refresh your memory, you
can take a look at limits and continuity (Chapter 5), where continuous functions are also briefly
defined. In this tutorial we’ll go into more details.
We’ll also make use of intervals. So square brackets mean closed intervals (include the
boundary points) and parenthesis mean open intervals (do not include boundary points), for
example,
⊲ [a, b] means a ≤ x ≤ b
⊲ (a, b) means a < x < b
⊲ [a, b) means a ≤ x < b
From the above, you can note that an interval can be open on one side and closed on the other.
As a last point, we’ll only be discussing real functions defined over real numbers. We won’t
be discussing complex numbers or functions defined on the complex plane.
15
5
10
4
f (x) = 2x + 1
5
3
ceil(x)
−5 2
−10
1
x
−6 −4 −2 0 2 4 6 0
x
Possible to draw without lifting the 0 1 2 3 4 5
The ceil function has a value of 1 on the interval (0, 1], for example, ceil(0.5) = 1,
ceil(0.7) = 1, and so on. As a result, the function is continuous over the domain (0, 1]. If
we adjust the interval to (0, 2], ceil(x) jumps to 2 as soon as x > 1. To plot ceil(x) for the
8.3 A formal definition 55
domain (0, 2], we must now lift our hand and start plotting again at x = 2. As a result, the ceil
function isn’t a continuous function.
If the function is continuous over the entire domain of real numbers, then it is a continuous
function as a whole, otherwise, it is not continuous as whole. For the later type of functions,
we can check over which interval they are continuous.
8.4 Examples
Some examples are listed below and also shown in the figure:
⊲ f (x) = 1/x is not continuous as it is not defined at x = 0. However, the function is
continuous for the domain x > 0.
⊲ All polynomial functions are continuous functions.
⊲ The trigonometric functions sin(x) and cos(x) are continuous and oscillate between the
values −1 and 1.
⊲ The trigonometric function tan(x) is not continuous as it is undefined at x = π/2,
x = −π/2, etc.
√
⊲ x is not continuous as it is not defined for x < 0.
⊲ |x| is continuous everywhere.
8.5 Connection of continuity with function derivatives 56
4
1 sin(x)
1,500 cos(x)
Trigonometric functions
2
0.5
x4 + x3 + 2x + 1
1,000
1/x
0 0
−2 500 −0.5
−1
−4 x 0 x x
−4 −2 0 2 4 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
6 10 6
4 8 4
2 2
6
tan(x)
|x|
0
√
0
4
−2
−2
2
−4
−4
−6 0
x x −6 x
−3 −2 −1 0 1 2 3 −100 −50 0 50 100 −6 −4 −2 0 2 4 6
f (b)
f (b)
K
f
f (a)
f (a)
f
f (xmin )
x x
a c b a xmin xmax b
Intermediate value theorem Extreme value theorem
Figure 8.3: Illustration of intermediate value theorem (left) and extreme value theorem
(right)
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
8.10 Summary
In this tutorial, you discovered the concept of continuous functions. Specifically, you learned:
⊲ What are continuous functions
⊲ The formal and informal definitions of continuous functions
⊲ Points of discontinuity
⊲ Intermediate value theorem
⊲ Extreme value theorem
⊲ Why continuous functions are important
While differentiation is ultimately about evaluating the limits, we will see in next chapter that
by we can find the derivative of a function easier and faster by remembering a few rules.
Derivatives of Powers and
Polynomials
9
One of the most frequently used functions in machine learning and data science algorithms are
polynomials or functions involving powers of x. It is therefore, important to understand how
the derivatives of such functions are calculated.
In this tutorial, you will discover how to compute the derivative of powers of x and
polynomials. After completing this tutorial, you will know:
⊲ General rule for computing the derivative of polynomials
⊲ General rule for finding the derivative of a function that involves any non-zero real
powers of x
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
1. The derivative of a function that involve integer powers of x
2. Differentiation of a function that has any real non-zero power of x
9.1 Derivative of the sum of two functions 60
Here we have a general rule that says that the derivative of the sum of two functions is the
sum of the derivatives of the individual functions.
! ! !
n n n−1
n n n−2 2 n
(a + b) = a + a b+ a b + ... + abn−1 + bn
1 2 n−1
We’ll derive a simple rule for finding the derivative of a function that involves xn , where n is
an integer and n > 0. Let’s go back to the definition of a derivative and apply it to kxn , where
k is a constant.
f (x) = kxn
f (x + h) = k(x + h)n
! ! !
n n−1 n n−2 2 n
= k(xn + x h+ x h + ··· + xhn−1 + hn )
1 2 n−1
Then the derivative is
f (x + h) − f (x)
f ′ (x) = lim
h→0 h
n n n
k(xn + 1
xn−1 h + 2
xn−2 h2 + · · · + n−1
xhn−1 + hn ) − kxn
= lim
h→0 h
! ! ! !
n n−1 n n−2 n
= lim k x + x h + ··· + xhn−2 + hn−1
h→0 1 2 n−1
= knxn−1
9.3 How to differentiate a polynomial? 61
Following are some examples of applying this rule, and we can verify them with SymPy:
⊲ Derivative of x2 is 2x
⊲ Derivative of 3x5 is 15x4
⊲ Derivative of 4x9 is 36x8
Derivative of
2
x
with respect to x is
2⋅x
Derivative of
5
3⋅x
with respect to x is
4
15⋅x
Derivative of
9
4⋅x
with respect to x is
8
36⋅x
P ′ (x) as:
This shows that the derivative of the polynomial of degree n, is in fact a polynomial of degree
(n − 1).
9.4 Examples
Some examples are shown below, where the polynomial function and its derivatives are all
plotted together. The blue curve shows the function itself, while the red curve is the derivative
of that function.
100 20 100
50
50 10
0
0 0 −50
−100
−50 −10
f (x) =2x2 − 3x f (x) = 2x8 x6
+ +x+2 −150 f (x) = 5x3 + 2x + 1
f ′ (x) = 4x − 3 f ′ (x) = 16x7 + 6x5 + 1 f ′ (x) = 15x2 + 2
−100 −20 −200
x x x
−20 −10 0 10 20 −2 −1 0 1 2 −3 −2 −1 0 1 2 3
f (x) = kxa
f ′ (x) = kaxa−1
Here are a few examples, which are plotted along with their derivatives. Again, the blue curve
denotes the function itself, and the red curve denotes the corresponding derivative:
9.5 What about non-integer powers of x? 63
100 20
f (x) = 2x−1 10
f ′ (x) = −2x−2
50 10
5
0
0
0
−10
−50
√
f (x) = 5x0.1 f (x) = x
−20 f ′ (x) = 0.5x−0.9 −5 f ′ (x) = 0.5x−0.5
−100
x x x
−2 −1 0 1 2 0 0.5 1 1.5 2 0 10 20 30 40 50
We can verify these results with SymPy. But to prevent the fraction − 43 from being converted
to floating point numbers so we can match our expressions above, we provided these functions
in string and ask SymPy to parse it:
Derivative of
a
k⋅x
with respect to x is
a - 1
a⋅k⋅x
Derivative of
0.2
x
with respect to x is
-0.8
0.2⋅x
Derivative of
π
x
with respect to x is
-1 + π
π⋅x
Derivative of
9.6 Further reading 64
1
────
3/4
x
with respect to x is
-3
──────
7/4
4⋅x
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
9.7 Summary
In this tutorial, you discovered how to differentiate a polynomial function and functions involving
a sum of non-integer powers of x. Specifically, you learned:
⊲ Derivative of the sum of two functions
⊲ Derivative of a constant multiplied by an integer power of x
⊲ Derivative of a polynomial function
⊲ Derivative of a sum of expressions involving non-integers powers of x
In the next chapter, we will see how we can find derivatives involving not powers of x, but sin(x)
and cos(x).
Derivative of the Sine and Cosine
10
Many machine learning algorithms involve an optimization process for different purposes.
Optimization refers to the problem of minimizing or maximizing an objective function by altering
the value of its inputs.
Optimization algorithms rely on the use of derivatives in order to understand how to alter
(increase or decrease) the input values to the objective function, in order to minimize or maximize
it. It is, therefore, important that the objective function under consideration is differentiable.
The two fundamental trigonometric functions, the sine and cosine, offer a good opportunity
to understand the maneuvers that might be required in finding the derivatives of differentiable
functions. These two functions become especially important if we think of them as the
fundamental building blocks of more complex functions.
In this tutorial, you will discover how to find the derivative of the sine and cosine functions.
After completing this tutorial, you will know:
⊲ How to find the derivative of the sine and cosine functions by applying several rules
from algebra, trigonometry and limits.
⊲ How to find the derivative of the sine and cosine functions in Python.
Let’s get started.
Overview
This tutorial is divided into three parts; they are:
⊲ The Derivative of the Sine Function
⊲ The Derivative of the Cosine Function
⊲ Finding Derivatives in Python
10.1 The derivative of the sine function 66
1 length h
sin h
x
cos h
Figure 10.1: Representing angle, h, on the unit circle
10.1 The derivative of the sine function 67
We will be comparing the area of different sectors and triangles, with sides subtending the
sin h
angle h, in an attempt to infer how behaves as the value of h approaches zero. For this
h
purpose, consider first the area of sector OAB:
x
O C A
Figure 10.2: Finding the area of sector, OAB
The area of a sector can be defined in terms of the circle radius, r, and the length of the
arc AB, h. Since the circle under consideration is the unit circle, then r = 1:
rh h
area of sector OAB = =
2 2
We can compare the area of the sector OAB that we have just found, to the area of the
triangle OAB within the same sector.
x
O C A
Figure 10.3: Finding the area of triangle, OAB
10.1 The derivative of the sine function 68
The area of this triangle is defined in terms of its height, BC = sin h, and the length of its
base, OA = 1:
(BC)(OA) sin h
area of triangle OAB = =
2 2
Since we can clearly see that the area of the triangle, OAB, that we have just considered is
smaller that the area of the sector that it is contained within, then we may say that:
sin h h
<
2 2
sin h
<1
h
sin h
This is the first piece of information that we have obtained regarding the behavior of ,
h
which tells us that its upper limit value will not exceed 1.
Let us now proceed to consider a second triangle, OAB ′ , that is characterized by a larger
area than that of sector, OAB. We can use this triangle to provide us with the second piece of
sin h
information about the behavior of , which is its lower limit value:
h
y
B′
sin h
x
O cos h C A
Figure 10.4: Comparing similar triangles, OAB and OAB ′
Applying the properties of similar triangles to relate OAB ′ to OCB, gives us information
regarding the length, B ′ A, that we need to compute the area of the triangle:
B′A BC sin h
= =
OA OC cos h
Hence, the area of triangle OAB ′ may be computed as:
(B ′ A)(OA) sin h
area of triangle OAB ′ = =
2 2 cos h
10.1 The derivative of the sine function 69
Comparing the area of triangle OAB ′ to that of sector OAB, we can see that the former is now
larger:
h sin h
<
2 2 cos h
sin h
cos h <
h
This is the second piece of information that we needed, which tells us that the lower limit value
sin h
of does not drop below cos h. We also know that as h approaches 0, the value of cos h
h
approaches 1.
Hence, putting the two pieces of information together, we find that as h becomes smaller
sin h
and smaller, the value of itself is squeezed to 1 by its lower and upper limits. This is,
h
indeed, referred to as the squeeze or sandwich theorem.
Let’s now proceed to tackle the second limit. By applying standard algebraic rules:
cos h − 1 cos h − 1 cos h + 1
lim = lim ·
h→0 h h→0 h cos h + 1
We can manipulate the second limit as follows:
cos h − 1 cos2 h − 1
lim = lim
h→0 h h→0 h(cos h + 1)
We can then express this limit in terms of sine, by applying the Pythagorean identity from
trigonometry, sin2 h = 1 − cos2 h:
cos h − 1 − sin2 h
lim = lim
h→0 h h→0 h(cos h + 1)
Followed by the application of another limit law, which states that the limit of a product is
equal to the product of the separate limits:
cos h − 1 sin h − sin h
lim = lim · lim
h→0 h h→0 h h→0 cos h + 1
We have already tackled the first limit of this product, and we have found that this has a value
of 1.
The second limit of this product is characterized by a cos h in the denominator, which
approaches a value of 1 as h becomes smaller. Hence, the denominator of the second limit
approaches a value of 2 as h approaches 0. The sine term in the numerator, on the other hand,
attains a value of 0 as h approaches 0. This drives not only the second limit, but also the entire
product limit to 0:
cos h − 1
lim =0
h→0 h
Putting everything together, we may finally arrive to the following conclusion:
sin h cos h − 1
sin′ (x) = lim cos x + lim sin x
h→0 h h→0 h
sin′ (x) = (1)(cos x) + (0)(sin x)
= cos x
10.2 The derivative of the cosine function 70
We can use the derivative of the sine function in order to compute directly the rate of
change, or slope, of the tangent line at this peak on the graph:
sin′ (π/2) = cos(π/2) = 0
We find that this result corresponds well with the fact that the peak of the sine function is,
indeed, a stationary point with zero rate of change.
A similar exercise can be easily carried out to compute the rate of change of the tangent
line at different angles, for both the sine and cosine functions.
10.3 Finding derivatives in Python 71
We can now proceed to define a variable x in symbolic form, which means that we can
work with x without having to assign it a value.
Next, we can find the derivative of the sine and cosine function with respect to x, using
the difffunction.
We find that the diff function correctly returns cos(x) as the derivative of sine, and -sin(x)
as the derivative of cosine.
The diff function can take multiple derivatives too. For example, we can find the second
derivative for both sine and cosine by passing x twice.
This means that, in finding the second derivative, we are taking the derivative of the
derivative of each function. For example, to find the second derivative of the sine function,
we take the derivative of cos(x), its first derivative. We can find the second derivative for the
cosine function by similarly taking the derivative of -sin(x), its first derivative.
We can, alternatively, pass the number 2 to the diff function to indicate that we are interested
in finding the second derivative.
Tying all of this together, the complete example of finding the derivative of the sine and
cosine functions is listed below.
Books
Michael Spivak. The Hitchhiker’s Guide to Calculus. American Mathematical Society, 2019.
https://www.amazon.com/dp/1470449625/
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
10.5 Summary
In this tutorial, you discovered how to find the derivative of the sine and cosine functions.
Specifically, you learned:
⊲ How to find the derivative of the sine and cosine functions by applying several rules
from algebra, trigonometry and limits.
⊲ How to find the derivative of the sine and cosine functions in Python.
This completes the most common building blocks for finding a derivative of a function. In the
next chapter, we will see how the derivative of a function involving power, multiplication, and
division can be broken down by using these building blocks.
The Power, Product, and
Quotient Rules
11
Optimization, as one of the core processes in many machine learning algorithms, relies on the
use of derivatives in order to decide in which manner to update a model’s parameter values, to
maximize or minimize an objective function.
This tutorial will continue exploring the different techniques by which we can find the
derivatives of functions. In particular, we will be exploring the power, product and quotient
rules, which we can use to arrive to the derivatives of functions faster than if we had to find every
derivative from first principles. Hence, for functions that are especially challenging, keeping
such rules at hand to find their derivatives will become increasingly important.
In this tutorial, you will discover the power, product and quotient rules to find the derivative
of functions.
After completing this tutorial, you will know:
⊲ The power rule to follow when finding the derivative of a variable base, raised to a fixed
power.
⊲ How the product rule allows us to find the derivative of a function that is defined as
the product of another two (or more) functions.
⊲ How the quotient rule allows us to find the derivative of a function that is the ratio of
two differentiable functions.
Let’s get started.
Overview
This tutorial is divided into three parts; they are:
⊲ The Power Rule
⊲ The Product Rule
⊲ The Quotient Rule
11.1 The power rule 75
f (x) = x2
f ′ (x) = 2x
For the purpose of understanding better where this rule comes from, let’s take the longer route
and find the derivative of f (x) by starting from the definition of a derivative:
df (x) f (x + h) − f (x)
f ′ (x) = = lim
dx h→0 h
Here, we substitute for f (x) = x2 and then proceed to simplify the expression:
′ (x + h)2 − x2
f (x) = lim
h→0 h
x2 + 2xh + h2 − x2
= lim
h→0 h
2xh + h2
= lim
h→0 h
= lim (2x + h)
h→0
As h approaches a value of 0, then this limit approaches 2x, which tallies with the result that
we have obtained earlier using the power rule.
If applied to f (x) = x, the power rule give us a value of 1. That is because, when we bring
a value of 1 in front of x, and then subtract the power by 1, what we are left with is a value of
0 in the exponent. Since, x0 = 1, then f ′ (x) = (1)(x0 ) = 1.
“
The best way to understand this derivative is to realize that f (x) = x is a line
that fits the form y = mx + b because f (x) = x is the same as f (x) = 1x + 0 (or
y = 1x + 0). The slope (m) of this line is 1, so the derivative equals 1. Or you can
just memorize that the derivative of x is 1. But if you forget both of these ideas,
”
you can always use the power rule.
— Page 131, Calculus For Dummies, 2016.
The power rule can be applied to any power, be it positive, negative, or a fraction. We can also
apply it to radical functions by first expressing their exponent (or power) as a fraction:
√
f (x) = x = x1/2
1 1
f ′ (x) = x−1/2 = √
2 2 x
These examples can be verified with SymPy:
11.2 The product rule 76
Program 11.1: Verifying the examples of finding derivatives using power rule
Derivative of
2
x
with respect to x is
2⋅x
Derivative of
√x
with respect to x is
1
────
2⋅√x
In order to investigate how to go about finding the derivative of f (x), let’s first start with finding
the derivative of the product of u(x) and v(x) directly:
Now let’s investigate what happens if we, otherwise, had to compute the derivatives of the
functions separately first and then multiply them afterwards:
It is clear that the second result does not tally with the first one, and that is because we have
not applied the product rule.
The product rule tells us that the derivative of the product of two functions can be found
as:
f ′ (x) = u′ (x)v(x) + u(x)v ′ (x)
11.2 The product rule 77
We can arrive at the product rule if we our work our way through by applying the properties
of limits, starting again with the definition of a derivative:
df (x) f (x + h) − f (x)
f ′ (x) = = lim
dx h→0 h
We know that f (x) = u(x)v(x) and, hence, we can substitute for f (x) and f (x + h):
With this new tool in hand, let’s reconsider finding f ′ (x) when u(x) = 2x and v(x) = x3 :
The resulting derivative now correctly matches the derivative of the product, (u(x)v(x))′ , that
we have obtained earlier.
11.3 The quotient rule 78
This was a fairly simple example that we could have computed directly in the first place.
However, we might have more complex problems involving functions that cannot be multiplied
directly, to which we can easily apply the product rule. For example:
f (x) = x2 sin x
f ′ (x) = (x2 )′ (sin x) + (x2 )(sin x)′
= 2x sin x + x2 cos x
This can be verified using SymPy:
u = x**2
v = sin(x)
f = u * v
result = diff(f, x)
print(”Derivative of”)
pprint(f)
print(”with respect to x is”)
pprint(result)
Derivative of
2
x ⋅sin(x)
with respect to x is
2
x ⋅cos(x) + 2⋅x⋅sin(x)
We can even extend the product rule to more than two functions. For example, say f (x) is now
defined as the product of three functions, u(x), v(x) and w(x):
f (x) = u(x)v(x)w(x)
We can apply the product rule as follows:
f ′ (x) = u′ (x)v(x)w(x) + u(x)v ′ (x)w(x) + u(x)v(x)w′ (x)
We can derive the quotient rule from first principles as we have done for the product rule, that
is by starting off with the definition of a derivative and applying the properties of limits. Or we
can take a shortcut and derive the quotient rule using the product rule itself. Let’s take this
route this time around:
u(x)
f (x) = −→ u(x) = f (x)v(x)
v(x)
One final step substitutes for f (x) to arrive to the quotient rule:
u(x) ′
′
u′ (x) − v(x)
v (x) u′ (x)v(x) − u(x)v ′ (x)
f (x) = =
v(x) v 2 (x)
We had seen how to find the derivative of the sine and cosine functions in the previous chapter.
Using the quotient rule, we can now find the derivative of the tangent function too:
sin x
f (x) = tan x =
cos x
Applying the quotient rule and simplifying the resulting expression:
(sin x)′ cos x − sin x(cos x)′
f ′ (x) =
cos2 x
cos x cos x − sin x(− sin x)
=
cos2 x
cos2 x + sin2 x
=
cos2 x
From the Pythagorean identity in trigonometry, we know that cos2 x + sin2 x = 1, hence:
1
f ′ (x) = = sec2 x
cos2 x
Therefore, using the quotient rule, we have easily found that the derivative of tangent is the
squared secant function. This is the same as what we can verify with SymPy:
f = sin(x) / cos(x)
result = diff(f, x)
11.4 Further reading 80
print(”Derivative of”)
pprint(f)
print(”with respect to x is”)
pprint(simplify(result))
Derivative of
sin(x)
──────
cos(x)
with respect to x is
1
───────
2
cos (x)
Books
Mark Ryan. Calculus For Dummies. 2nd ed. Wiley, 2016.
https://www.amazon.com/dp/1119293499/
Articles
Power rule. Wikipedia.
https://en.wikipedia.org/wiki/Power_rule
Product rule. Wikipedia.
https://en.wikipedia.org/wiki/Product_rule
Quotient rule. Wikipedia.
https://en.wikipedia.org/wiki/Quotient_rule
11.5 Summary
In this tutorial, you discovered how to apply the power, product and quotient rules to find the
derivative of functions. Specifically, you learned:
⊲ The power rule for finding the derivative of a variable base, raised to a fixed power.
⊲ How the product rule and quotient allows us to find the derivative of a function that is
defined as the product and ratio of another two (or more) functions, respectively.
In the next chapter, we will see what can be done to find limits of a function that cannot be
evaluated directly.
Indeterminate Forms and
l’Hôpital’s Rule
12
Indeterminate forms are often encountered when evaluating limits of functions, and limits in
turn play an important role in mathematics and calculus. They are essential for learning about
derivatives, gradients, Hessians, and a lot more.
In this tutorial, you will discover how to evaluate the limits of indeterminate forms and
the l’Hôpital’s rule for solving them. After completing this tutorial, you will know:
⊲ How to evaluate the limits of functions having indeterminate types of the form 0/0 and
∞/∞
⊲ l’Hôpital’s rule for evaluating indeterminate types
⊲ How to convert more complex indeterminate types and apply l’Hôpital’s rule to them
Let’s get started.
Overview
This tutorial is divided into 2 parts; they are:
⊲ The indeterminate forms of type 0/0 and ∞/∞
◦ How to apply l’Hôpital’s rule to these types
◦ Solved examples of these two indeterminate types
⊲ More complex indeterminate types
◦ How to convert the more complex indeterminate types to 0/0 and ∞/∞ forms
◦ Solved examples of such types
12.1 Prerequisites
This tutorial requires a basic understanding of the following two topics:
⊲ Limits and Continuity (Chapter 5)
⊲ Evaluating limits (Chapter 6)
12.2 What are indeterminate forms? 82
The above rule can only be applied if the expression in the denominator does not approach zero
as x approaches a. A more complicated situation arises if both the numerator and denominator
both approach zero as x approaches a. This is called an indeterminate form of type 0/0.
Similarly, there are indeterminate forms of the type ∞/∞, given by:
“
If we have an indeterminate type of the form 0/0 or ∞/∞, i.e.,
lim f (x) = 0 and lim g(x) = 0 or lim f (x) = ±∞ and lim g(x) = ±∞
x→a x→a x→a x→a
then
f (x) f ′ (x)
lim = lim ′
x→a g(x) x→a g (x)
sin x
⊲ lim can apply the rule as it’s 0/0 form
x→0 x
ex
⊲ lim cannot apply l’Hôpital’s rule as it’s not ∞/∞ form
x→∞ 1/(x + 1)
ex
⊲ lim can apply l’Hôpital’s rule as it is ∞/∞ form
x→∞ x
Example 1: 0/0
ln(x − 1)
Evaluate lim (See the left graph in the figure)
x→2 x−2
d
ln(x − 1) ln(x − 1)
lim = lim dx
d apply l’Hôpital’s rule to type 0/0
x→2 x−2 x→2
dx
(x − 2)
1/(x − 1)
= lim
x→2 1
=1
Example 2: ∞/∞
ln x
Evaluate x→∞
lim (See the right graph in the figure)
x
d
ln x ln x
lim
x→∞ x
= x→∞
lim dx
d apply l’Hôpital’s rule to type ∞/∞
dx
x
1/x
= lim
x→∞ 1
=0
2.5
0.00015
2
1.5 0.0001
1
0.00005
0.5
0 x x
0 0.5 1 1.5 2 2.5 0 2 · 105 4 · 105 6 · 105 8 · 105 1 · 106
ln(x − 1) ln x
lim =1 lim =0
x→2 x−2 x→∞ x
Figure 12.1: Graphs of Examples 1 and 2
expression = ln(x-1)/(x-2)
result = limit(expression, x, 2)
print(”Limit of”)
pprint(expression)
print(”at x = 2 is”, result)
12.5 More indeterminate forms 84
print()
expression = ln(x)/x
result = limit(expression, x, oo)
print(”Limit of”)
pprint(expression)
print(”at x = infinity is”, result)
Limit of
log(x - 1)
──────────
x - 2
at x = 2 is 1
Limit of
log(x)
──────
x
at x = infinity is 0
∞ f (x)
lim lim f (x) = ∞ and x→a
lim g(x) = ∞ No conversion
∞ x→a g(x) x→a
f (x)
0·∞ lim f (x)g(x) lim f (x) = 0 and lim g(x) = ∞ Convert to lim
x→a x→a x→a x→a g(x)
Take out a common
∞−∞ lim (f (x) − g(x)) lim f (x) = ∞ and x→a
lim g(x) = ∞
x→a x→a factor, rationalize
00 lim f (x)g(x) lim f (x) = 0 and lim g(x) = 0
x→a x→a x→a f (x)g(x) = eg(x) ln f (x)
∞ 0
lim f (x)
x→a
g(x)
lim f (x) = ∞ and x→a
x→a
lim g(x) = 0 or let y = f (x)g(x) then
ln y = g(x) ln f (x)
1∞ lim f (x)g(x)
x→a
lim f (x) = 1 and x→a
x→a
lim g(x) = ∞
12.6 Examples
The following examples show how you can convert one indeterminate form to either 0/0 or ∞/∞
form and apply l’Hôpital’s rule to solve the limit. After the worked out examples you can also
look at the graphs of all the functions whose limits are calculated.
Example 3: 0 · ∞
1
Evaluate x→∞
lim x · sin (See the first graph in Figure 12.2)
x
1 sin(1/x)
lim x · sin = x→∞
lim Convert type 0 · ∞ to 0/0
x→∞ x 1/x
d
sin(1/x)
= lim dx
d Apply l’Hôpital’s rule
x→∞
dx
(1/x)
(−1/x2 ) cos(1/x)
= x→∞
lim
−1/x2
1
= lim cos
x→∞ x
=1
Example 4: ∞ − ∞
1 1
Evaluate lim − (See the second graph in Figure 12.2)
x→0 1 − cos x x
1 x − 1 + cos x
lim = lim Convert type ∞ − ∞ to type 0/0
x→0 1 − cos x x→0 x(1 − cos x)
d
(x − 1 + cos x)
= lim dx
d Apply l’Hôpital’s rule
x→0
dx
(x(1 − cos x))
1 − sin x
= lim
x→0 x sin x + (1 − cos x)
=∞
12.6 Examples 86
ln(1 + x)
lim ey
= x→∞ let y =
x
= elimx→∞ y (*)
ln(1 + x)
lim y = lim Type ∞/∞
x→∞ x→∞ x
d
ln(1 + x)
= x→∞
lim dx
d Apply l’Hôpital’s rule
dx
x
1
1+x
= lim
x→∞ 1
=0
=1
25
1
20 1.004
0.9
15
1.003
0.8
10
0.7 1.002
5
0.6 1.001
0
0.5 1
x −5 x x
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 0 20000 40000 60000 80000 100000
1 1 1 lim (1 + x)1/x = 1
lim x · sin =1 lim − =∞ x→∞
x→∞ x x→0 1 − cos x x
Figure 12.2: Graphs of examples 3, 4, and 5
expression = x * sin(1/x)
result = limit(expression, x, oo)
print(”Limit of”)
pprint(expression)
print(”at x = infinity is”, result)
print()
12.7 Further reading 87
expression = (1+x)**(1/x)
result = limit(expression, x, oo)
print(”Limit of”)
pprint(expression)
print(”at x = infinity is”, result)
Limit of
⎛1⎞
x⋅sin⎜─⎟
⎝x⎠
at x = infinity is 1
Limit of
1 1
────────── - ─
1 - cos(x) x
at x = 0 is oo
Limit of
x _______
╲╱ x + 1
at x = infinity is 1
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
12.8 Summary 88
12.8 Summary
In this tutorial, you discovered the concept of indeterminate forms and how to evaluate them.
Specifically, you learned:
⊲ Indeterminate forms of type 0/0 and ∞/∞
⊲ l’Hôpital rule for evaluating types 0/0 and ∞/∞
⊲ Indeterminate forms of type 0 · ∞, ∞ − ∞, and power forms, and how to evaluate them.
In the next chapter, we are going to see how knowing the derivative of a function can help us.
Applications of Derivatives
13
The derivative defines the rate at which one variable changes with respect to another.
It is an important concept that comes in extremely useful in many applications: in everyday
life, the derivative can tell you at which speed you are driving, or help you predict fluctuations
on the stock market; in machine learning, derivatives are important for function optimization.
This tutorial will explore different applications of derivatives, starting with the more familiar
ones before moving to machine learning. We will be taking a closer look at what the derivatives
tell us about the different functions we are studying. In this tutorial, you will discover different
applications of derivatives. After completing this tutorial, you will know:
⊲ The use of derivatives can be applied to real-life problems that we find around us.
⊲ The use of derivatives is essential in machine learning, for function optimization.
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ Applications of Derivatives in Real-Life
⊲ Applications of Derivatives in Optimization Algorithms
“
Derivatives answer questions like “How fast?” “How steep?” and “How sensitive?”
”
These are all questions about rates of change in one form or another.
— Page 141, Infinite Powers, 2020.
δy
This rate of change is denoted by, , hence defining a change in the dependent variable,
δx
δy, with respect to a change in the independent variable, δx.
13.1 Applications of derivatives in real-life 90
Let’s start off with one of the most familiar applications of derivatives that we can find
around us.
“ ”
Every time you get in your car, you witness differentiation.
— Page 178, Calculus For Dummies, 2016.
When we say that a car is moving at 100 kilometers an hour, we would have just stated
its rate of change. The common term that we often use is speed or velocity, although it would
be best that we first distinguish between the two.
In everyday life, we often use speed and velocity interchangeably if we are describing the
rate of change of a moving object. However, this in not mathematically correct because speed is
always positive, whereas velocity introduces a notion of direction and, hence, can exhibit both
positive and negative values. Hence, in the ensuing explanation, we shall consider velocity as
the more technical concept, defined as:
δy
velocity =
δt
This means that velocity gives the change in the car’s position, δy, within an interval of time,
δt. In other words, velocity is the first derivative of position with respect to time.
The car’s velocity can remain constant, such as if the car keeps on traveling at 100 kilometers
an hour consistently, or it can also change as a function of time. In case of the latter, this means
that the velocity function itself is changing as a function of time, or in simpler terms, the car
can be said to be accelerating. Acceleration is defined as the first derivative of velocity, v, and
the second derivative of position, y, with respect to time:
δv δ2y
acceleration =
= 2
δt δt
We can graph the position, velocity and acceleration curves to visualize them better. Suppose
that the car’s position, as a function of time, is given by
y(t) = t3 − 8t2 + 40t
600
500
position (meters)
400
300
200
100
0
0 2 4 6 8 10
time (seconds)
Figure 13.1: Line plot of the car’s position against time
13.1 Applications of derivatives in real-life 91
The graph indicates that the car’s position changes slowly at the beginning of the journey,
slowing down slightly until around t = 2.7s, at which point its rate of change picks up and
continues increasing until the end of the journey. This is depicted by the graph of the car’s
velocity:
180
160
velocity (meters/second)
140
120
100
80
60
40
20
0
0 2 4 6 8 10
time (seconds)
Figure 13.2: Line plot of the car’s velocity against time
Notice that the car retains a positive velocity throughout the journey, and this is because
it never changes direction. Hence, if we had to imagine ourselves sitting in this moving car, the
speedometer would be showing us the values that we have just plotted on the velocity graph
(since the velocity remains positive throughout, otherwise we would have to find the absolute
value of the velocity to work out the speed). If we had to apply the power rule to y(t) to find
its derivative, then we would find that the velocity is defined by the following function:
v(t) = y ′ (t) = 3t2 − 16t + 40
We can also plot the acceleration graph:
acceleration (meters/second squared)
50
40
30
20
10
0
−10
−20
0 2 4 6 8 10
time (seconds)
Figure 13.3: Line plot of the car’s acceleration against time
13.2 Applications of derivatives in optimization algorithms 92
We find that the graph is now characterized by negative acceleration in the time interval,
t = [0, 2.4] seconds. This is because acceleration is the derivative of velocity, and within this
time interval the car’s velocity is decreasing. If we had to, again, apply the power rule to v(t) to
find its derivative, then we would find that the acceleration is defined by the following function:
a(t) = v ′ (t) = 6t − 16
Putting all functions together, we have the following:
y(t) = t3 − 8t2 + 40t
v(t) = 3t2 − 16t + 40
a(t) = 6t − 16
If we substitute for t = 10 seconds, we can use these three functions to find that by the end
of the journey, the car has traveled 600m, its velocity is 180 m/s, and it is accelerating at 44
m/s2 . We can verify that all of these values tally with the graphs that we have just plotted.
We have framed this particular example within the context of finding a car’s velocity and
acceleration. But there is a plethora of real-life phenomena that change with time (or variables
other than time), which can be studied by applying the concept of derivatives as we have just
done for this particular example. To name a few:
⊲ Growth rate of a population (be it a collection of humans, or a colony of bacteria) over
time, which can be used to predict changes in population size in the near future.
⊲ Changes in temperature as a function of location, which can be used for weather
forecasting.
⊲ Fluctuations of the stock market over time, which can be used to predict future stock
market behavior.
Derivatives also provide salient information in solving optimization problems, as we shall be
seeing next.
f = -x * sin(x)
d1 = diff(f, x)
d2 = diff(f, x, x)
print(”Function”)
pprint(f)
print(”has first derivative”)
pprint(d1)
print(”and second derivative”)
pprint(d2)
Function
-x⋅sin(x)
has first derivative
-x⋅cos(x) - sin(x)
and second derivative
x⋅sin(x) - 2⋅cos(x)
We can plot these three functions for different values of x to visualize them:
10
f (x)
f ′ (x)
5 f ′′ (x)
x
2 4 6 8 10
−5
−10
Figure 13.4: Line plot of f (x), its first derivative f ′ (x), and its second derivative f ′′ (x)
Similar to what we have observed earlier for the car example, the graph of the first derivative
indicates how f (x) is changing and by how much. For example, a positive derivative indicates
that f (x) is an increasing function, whereas a negative derivative tells us that f (x) is now
decreasing. Hence, if in its search for a function minimum, the optimization algorithm performs
small changes to the input based on its learning rate, ǫ:
xnew = x − ǫf ′ (x)
Then the algorithm can reduce f (x) by moving to the opposite direction (by inverting the sign)
of the derivative.
13.3 Further reading 94
“ ”
We can think of the second derivative as measuring curvature.
— Page 86, Deep Learning, 2016.
For example, if the algorithm arrives at a critical point at which the first derivative is zero, it
cannot distinguish between this point being a local maximum, a local minimum, a saddle point or
a flat region based on f ′ (x) alone. However, when the second derivative intervenes, the algorithm
can tell that the critical point in question is a local minimum if the second derivative is greater
than zero. For a local maximum, the second derivative is smaller than zero. Hence, the second
derivative can inform the optimization algorithm on which direction to move. Unfortunately,
this test remains inconclusive for saddle points and flat regions, for which the second derivative
is zero in both cases.
Optimization algorithms based on gradient descent do not make use of second order
derivatives and are, therefore, known as first-order optimization algorithms. Optimization
algorithms, such as Newton’s method, that exploit the use of second derivatives, are otherwise
called second-order optimization algorithms.
Books
Steven Strogatz. Infinite Powers. How Calculus Reveals the Secrets of the Universe. Mariner
Books, 2020.
https://www.amazon.com/dp/0358299284/
Mark Ryan. Calculus For Dummies. 2nd ed. Wiley, 2016.
https://www.amazon.com/dp/1119293499/
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
13.4 Summary
In this tutorial, you discovered different applications of derivatives. Specifically, you learned:
⊲ The use of derivatives can be applied to real-life problems that we find around us.
⊲ The use of derivatives is essential in machine learning, for function optimization.
In the next chapter, we will take a closer look at one application of differentiation, namely, the
slope of a curve.
Slopes and Tangents
14
The slope of a line, and its relationship to the tangent line of a curve is a fundamental concept
in calculus. It is important for a general understanding of function derivatives.
In this tutorial, you will discover what is the slope of a line and what is a tangent to a
curve. After completing this tutorial, you will know:
⊲ The slope of a line
⊲ The average rate of change of f (x) on an interval with respect to x
⊲ The slope of a curve
⊲ The tangent line to a curve at a point
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ The slope of a line and a curve
⊲ The tangent line to a curve
y2
B(x2 , y2 )
y2 − y1
y
y1
A(x1 , y1 )
x2 − x1
x
x1 x2
Figure 14.1: Slope of a line calculated from two points on the line
A straight line can be uniquely defined by two points on the line. The slope of a line is the
same everywhere on the line; hence, any line can also be uniquely defined by the slope and one
point on the line. From the known point we can move to any other point on the line according
to the ratio defined by the slope of the line.
y y
25 25
A(x0 , y0 ) C(x2 , y2 ) A(x0 , y0 )
20 20
B ′ (x0 + h, f (x0 + h))
secant lines
15 15
move B closer to A
10 10 so that h → 0
5 5
B(x1 , y1 ) D(x3 , y3 ) B(x1 , y1 )
0 x 0 x
−4 −2 0 2 4 −4 −2 0 2 4
Average rate of change over an interval Average rate of change at a point
Figure 14.2: Rate of change of a curve over an interval vs. at a point
◦ Slope of f (x) = x2 at x = 1 is 2
⊲ The slope of f (x) = 2x + 1, is a constant value equal to 2. We can see that f (x) defines
a straight line.
⊲ The slope of f (x) = k, (where k is a constant) is zero as the function does not change
anywhere. Hence its average rate of change at any point is zero.
2
(1, 1)
f (x) = 1/x
(−1, −1)
−2
−4 x
−4 −2 0 2 4
1
Figure 14.3: f (x) =
x
import numpy as np
def f(x):
return 1/x
epsilon = np.finfo(np.float32).eps
for x in [1, -1]:
slope = (f(x+epsilon) - f(x))/epsilon
y = f(x)
c = y - slope * x
print(”Slope at x={} is {}”.format(x, slope))
print(”Tangent line is y={:f}x{:+f}”.format(slope,c))
f (x) = x2
Shown below is the curve and the tangent lines at the points x = 2, x = −2, x = 0. At x = 0,
the tangent line is parallel to the x-axis as the slope of f (x) at x = 0 is zero. This is how we
compute the equation of the tangent line at x = 2:
40
30
f (x) = x2
20
10
(0, 0)
0 (−2, 4) (2, 4)
x
−6 −4 −2 0 2 4 6
Figure 14.4: f (x) = x2
14.5 Examples of tangent lines 100
import numpy as np
def f(x):
return x**2
epsilon = np.finfo(np.float32).eps
for x in [2, -2]:
slope = (f(x+epsilon) - f(x))/epsilon
y = f(x)
c = y - slope * x
print(”Slope at x={} is {}”.format(x, slope))
print(”Tangent line is y={:f}x{:+f}”.format(slope,c))
f (x) = x3 + 2x + 1
This function is shown below, along with its tangent lines at x = 0, x = 2 and x = −2. Below
are the steps to derive an equation of the tangent line at x = 0.
14.5 Examples of tangent lines 101
20
(2, 13)
f (x) = x3 + 2x + 1
10
(0, 1)
0
−10
(−2, 11)
−20 x
−3 −2 −1 0 1 2 3
Figure 14.5: f (x) = x3 + 2x + 1
import numpy as np
def f(x):
return x**3 + 2*x + 1
epsilon = np.finfo(np.float32).eps
for x in [2, 0, -2]:
slope = (f(x+epsilon) - f(x))/epsilon
y = f(x)
c = y - slope * x
print(”Slope at x={} is {}”.format(x, slope))
print(”Tangent line is y={:f}x{:+f}”.format(slope,c))
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
14.7 Summary
In this tutorial, you discovered the concept of the slope of a curve at a point and the tangent
line to a curve at a point. Specifically, you learned:
⊲ What is the slope of a line
⊲ What is the average rate of change of a curve over an interval with respect to x
⊲ Slope of a curve at a point
⊲ Tangent to a curve at a point
In the next chapter, we will introduce integral calculus as the reverse of differential calculus.
Differential and Integral Calculus
15
Integral calculus was one of the greatest discoveries of Newton and Leibniz. Their work
independently led to the proof, and recognition of the importance of the fundamental theorem
of calculus, which linked integrals to derivatives. With the discovery of integrals, areas and
volumes could thereafter be studied.
Integral calculus is the second half of the calculus journey that we will be exploring. In
this tutorial, you will discover the relationship between differential and integral calculus. After
completing this tutorial, you will know:
⊲ The concepts of differential and integral calculus are linked together by the fundamental
theorem of calculus.
⊲ By applying the fundamental theorem of calculus, we can compute the integral to find
the area under a curve.
⊲ In machine learning, the application of integral calculus can provide us with a metric
to assess the performance of a classifier.
Let’s get started.
Overview
This tutorial is divided into four parts; they are:
⊲ The link between differential and integral calculus
⊲ The fundamental theorem of calculus
⊲ Integration Example
⊲ Application of Integration in Machine Learning
it to different functions from first principles. We have even understood how to apply rules to
arrive to the derivative faster. But we are only half way through the journey.
“
From A twenty-first-century vantage point, calculus is often seen as the mathematics
of change. It quantifies change using two big concepts: derivatives and integrals.
”
Derivatives model rates of change …Integrals model the accumulation of change …
— Page 141, Infinite Powers, 2020.
Recall having said that calculus comprises two phases: cutting and rebuilding.
The cutting phase breaks down a curved shape into infinitesimally small and straight pieces
that can be studied separately, such as by applying derivatives to model their rate of change,
or slope. This half of the calculus journey is called differential calculus, and we have already
looked into it in some detail.
The rebuilding phase gathers the infinitesimally small and straight pieces, and sums them
back together in an attempt to study the original whole. In this manner, we can determine the
area or volume of regular and irregular shapes after having cut them into infinitely thin slices.
This second half of the calculus journey is what we shall be exploring next. It is called integral
calculus.
The important theorem that links the two concepts together is called the fundamental
theorem of calculus.
600 180
velocity (meters/second)
500 160
position (meters)
140
400 120
300 100
80
200 60
40
100
20
0 0
0 2 4 6 8 10 0 2 4 6 8 10
time (seconds) time (seconds)
Figure 15.1: Line plot of the car’s position and velocity against time
In computing the derivative we had solved the forward problem, where we found the velocity
from the slope of the position graph at any time, t. But what if we would like to solve the
backward problem, where we are given the velocity graph, v(t), and wish to find the distance
traveled? The solution to this problem is to calculate the area under the curve(the shaded
region) up to time, t:
15.2 The fundamental theorem of calculus 105
180
velocity (meters/second)
160
140
120
100
80
60
40
20
0
0 2 4 6 8 10
time (seconds)
Figure 15.2: The shaded region is the area under the curve
We do not have a specific formula to define the area of the shaded region directly. But
we can apply the mathematics of calculus to cut the shaded region under the curve into many
infinitely thin rectangles, for which we have a formula:
180
velocity (meters/second)
160
140
120
100
80
60
40
20
0
}
0 2 ∆t 4 6 8 10
time (seconds)
Figure 15.3: Cutting the shaded region into many rectangles of width, ∆t
If we consider the i-th rectangle, chosen arbitrarily to span the time interval ∆t, we can
define its area as its length times its width:
We can have as many rectangles as necessary in order to span the interval of interest, which
in this case is the shaded region under the curve. For simplicity, let’s denote this closed interval
by [a, b]. Finding the area of this shaded region (and, hence, the distance traveled), then reduces
to finding the sum of the n number of rectangles:
We can express this sum even more compactly by applying the Riemann sum with sigma notation:
n
X
v(ti )∆ti = v(t0 )∆t0 + v(t1 )∆t1 + · · · + v(tn )∆tn
i=1
15.2 The fundamental theorem of calculus 106
If we cut (or divide) the region under the curve by a finite number of rectangles, then we find
that the Riemann sum gives us an approximation of the area, since the rectangles will not fit
the area under the curve exactly. If we had to position the rectangles so that their upper left
or upper right corners touch the curve, the Riemann sum gives us either an underestimate or
an overestimate of the true area, respectively. If the midpoint of each rectangle had to touch
the curve, then the part of the rectangle protruding above the curve roughly compensates for
the gap between the curve and neighboring rectangles:
velocity (meters/second)
160 160
140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
0 2 4 6 8 10 0 2 4 6 8 10
time (seconds) time (seconds)
180 midpoint sum
velocity (meters/second)
160
140
120
100
80
60
40
20
0
0 2 4 6 8 10
time (seconds)
Figure 15.4: Approximating the area under the curve with left sum, right sum, and
midpoint sums
The solution to finding the exact area under the curve, is to reduce the rectangles’ width so
much that they become infinitely thin (recall the infinity principle in calculus). In this manner,
the rectangles would be covering the entire region, and in summing their areas we would be
finding the definite integral.
“
The definite integral (“simple” definition): The exact area under a curve between
t = a and t = b is given by the definite integral, which is defined as the limit of a
”
Riemann sum …
— Page 227, Calculus For Dummies, 2016.
15.2 The fundamental theorem of calculus 107
The definite integral can, then, be defined by the Riemann sum as the number of rectangles, n,
approaches infinity. Let’s also denote the area under the curve by A(t). Then:
Z b n
X
A(t) = v(t)dt = n→∞
lim v(ti )∆ti
a i=1
Note that the notation now changes into the integral symbol, , replacing sigma, . The reason
R P
behind this change is, merely, to indicate that we are summing over a huge number of thinly
sliced rectangles. The expression on the left hand side reads as, the integral of v(t) from a to
b, and the process of finding the integral is called integration.
Here, F (t) is any antiderivative of v(t), and the integral is defined as the subtraction of the
antiderivative evaluated at a and b.
Hence, the second part of the theorem computes the integral by subtracting the area under
the curve between some starting point, C, and the lower limit, a, from the area between the
same starting point, C, and the upper limit, b. This, effectively, calculates the area of interest
between a and b.
Since the constant, C, defines the point on the x-axis at which the sweep starts, the simplest
antiderivative to consider is the one with C = 0. Nonetheless, any antiderivative with any value
of C can be used, which simply sets the starting point to a different position on the x-axis.
The indefinite integral does not define the limits between which the area under the curve is
being calculated. The constant, C, is included to compensate for the lack of information about
the limits, or the starting point of the sweep.
If we do have knowledge of the limits, then we can simply apply the second fundamental
theorem of calculus to compute the definite integral:
Z 3
3x2 dt = 33 − 23 = 19
2
We can simply set C to zero, because it will not change the result in this case.
We can also find the integration using SymPy for the exact solution or apply numerical
integration using NumPy for an approximation:
15.3 Integration example 109
f = 3 * x**2
result = integrate(f, x)
print(”Antiderivative of”)
pprint(f)
print(”is”)
pprint(result)
print()
dx = 0.001
x = np.arange(2, 3, dx)
y = 3 * x**2
result = (y * dx).sum()
print(”Numerically using left sum:”, result)
x = np.arange(2, 3, dx) + dx
y = 3 * x**2
result = (y * dx).sum()
print(”Numerically using right sum:”, result)
x = np.arange(2, 3, dx) + dx/2
y = 3 * x**2
result = (y * dx).sum()
print(”Numerically using midpoint sum:”, result)
Antiderivative of
2
3⋅x
is
3
x
Integration of
2
3⋅x
for x=2 to x=3 is
19
“
But you can use this adding-up-areas-of-rectangles scheme to add up tiny bits of
anything — distance, volume, or energy, for example. In other words, the area under
”
the curve doesn’t have to stand for an actual area.
— Page 214, Calculus For Dummies, 2016.
One of the important steps of successfully applying machine learning techniques includes the
choice of appropriate performance metrics. In deep learning, for instance, it is common practice
to measure precision and recall.
“
Precision is the fraction of detections reported by the model that were correct, while
”
recall is the fraction of true events that were detected.
— Page 423, Deep Learning, 2016.
It is also common practice to, then, plot the precision and recall on a Precision-Recall (PR)
curve, placing the recall on the x-axis and the precision on the y-axis. It would be desirable that
a classifier is characterized by both high recall and high precision, meaning that the classifier
can detect many of the true events correctly. Such a good classification performance would be
characterized by a higher area under the PR curve.
You can probably already tell where this is going. The area under the PR curve can, indeed,
be calculated by applying integral calculus, permitting us to characterize the performance of
the classifier.
Books
Steven Strogatz. Infinite Powers. How Calculus Reveals the Secrets of the Universe. Mariner
Books, 2020.
https://www.amazon.com/dp/0358299284/
Mark Ryan. Calculus For Dummies. 2nd ed. Wiley, 2016.
https://www.amazon.com/dp/1119293499/
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
David Guichard et al. Single and Multivariable Calculus: Early Transcendentals. 2020.
https://www.whitman.edu/mathematics/multivariable/multivariable.pdf
Michael Spivak. The Hitchhiker’s Guide to Calculus. American Mathematical Society, 2019.
https://www.amazon.com/dp/1470449625/
15.6 Summary 111
15.6 Summary
In this tutorial, you discovered the relationship between differential and integral calculus.
Specifically, you learned:
⊲ The concepts of differential and integral calculus are linked together by the fundamental
theorem of calculus.
⊲ By applying the fundamental theorem of calculus, we can compute the integral to find
the area under a curve.
⊲ In machine learning, the application of integral calculus can provide us with a metric
to assess the performance of a classifier.
In the previous chapters, we learned how to do differentiation on a function with single variable.
Starting from the next chapter, we will see how to do the same in a function with multiple
variables.
III
Multivariate Calculus
Introduction to Multivariate
Calculus
16
It is often desirable to study functions that depend on many variables.
Multivariate calculus provides us with the tools to do so by extending the concepts that
we find in calculus, such as the computation of the rate of change, to multiple variables. It
plays an essential role in the process of training a neural network, where the gradient is used
extensively to update the model parameters.
In this tutorial, you will discover a gentle introduction to multivariate calculus. After
completing this tutorial, you will know:
⊲ A multivariate function depends on several input variables to produce an output.
⊲ The gradient of a multivariate function is computed by finding the derivative of the
function in different directions.
⊲ Multivariate calculus is used extensively in neural networks to update the model
parameters.
Let’s get started.
Overview
This tutorial is divided into three parts; they are:
⊲ Revisiting the Concept of a Function
⊲ Derivatives of Multivariate Functions
⊲ Application of Multivariate Calculus in Machine Learning
Such a function that takes a single, independent variable and defines a one-to-one mapping
between the input and output, is called a univariate function.
For example, let’s say that we are attempting to forecast the weather based on the
temperature alone. In this case, the weather is the dependent variable that we are trying
to forecast, which is a function of the temperature as the input variable. Such a problem can,
therefore, be easily framed into a univariate function.
However, let’s say that we now want to base our weather forecast on the humidity level and
the wind speed too, in addition to the temperature. We cannot do so by means of a univariate
function, where the output depends solely on a single input.
Hence, we turn our attention to multivariate functions, so called because these functions
can take several variables as input.
Formally, we can express a multivariate function as a mapping between several real input
variables, n, to a real output:
f : Rn 7→ R
For example, consider the following parabolic surface:
f (x, y) = x2 + y 2
This is a multivariate function that takes two variables, x and y, as input, hence n = 2, to
produce an output. We can visualize it by graphing its values for x and y between −1 and 1.
z 1
1
0
−1 0
−0.5 0
0.5 y
1 −1
x
Figure 16.1: Three-dimensional plot of a parabolic surface
Similarly, we can have multivariate functions that take more variables as input. Visualizing
them, however, may be difficult due to the number of dimensions involved.
We can even generalize the concept of a function further by considering functions that map
multiple inputs, n, to multiple outputs, m:
f : Rn 7→ Rm
“ ”
The generalization of the derivative to functions of several variables is the gradient.
— Page 146, Mathematics for Machine Learning, 2020.
The technique to finding the gradient of a function of several variables involves varying each
one of the variables at a time, while keeping the others constant. In this manner, we would
be taking the partial derivative of our multivariate function with respect to each variable, each
time.
“ ”
The gradient is then the collection of these partial derivatives.
— Page 146, Mathematics for Machine Learning, 2020.
In order to visualize this technique better, let’s start off by considering a simple univariate
quadratic function of the form:
g(x) = x2
0.8
0.6
g(x)
0.4
0.2
0
x
−1 −0.5 0 0.5 1
Figure 16.2: Line plot of a univariate quadratic function
Finding the derivative of this function at some point, x, requires the application of the equation
for g ′ (x) that we have defined earlier. We can, alternatively, take a shortcut by using the power
rule to find that:
g ′ (x) = 2x
Furthermore, if we had to imagine slicing open the parabolic surface considered earlier, with
a plane passing through y = 0, we realize that the resulting cross-section of f (x, y) is the
quadratic curve, g(x) = x2 . Hence, we can calculate the derivative (or the steepness, or slope)
of the parabolic surface in the direction of x, by taking the derivative of f (x, y) but keeping y
constant. We refer to this as the partial derivative of f (x, y) with respect to x, and denote it
16.3 Application of multivariate calculus in machine learning 116
by ∂ to signify that there are more variables in addition to x but these are not being considered
for the time being. Therefore, the partial derivative with respect to x of f (x, y) is:
∂ 2
(x + 2y 2 ) = g ′ (x) = 2x
∂x
We can similarly hold x constant (or, in other words, find the cross-section of the parabolic
surface by slicing it with a plane passing through a constant value of x) to find the partial
derivative of f (x, y) with respect to y, as follows:
∂ 2
(x + 2y 2 ) = 4y
∂y
What we have essentially done is that we have found the univariate derivative of f (x, y) in each
of the x and y directions. Combining the two univariate derivatives as the final step, gives us
the multivariate derivative (or the gradient):
" #
df ∂f (x, y) ∂f (x, y)
= , = [2x, 4y]
d(x, y) ∂x ∂y
The same technique remains valid for functions of higher dimensions.
We can also find the derivatives of multivariate functions in SymPy:
f = x**2 + 2 * y**2
dx = diff(f, x)
dy = diff(f, y)
print(”Derivative of”)
pprint(f)
print(”with respect to x is”)
pprint(dx)
print(”and with respect to y is”)
pprint(dy)
Derivative of
2 2
x + 2⋅y
with respect to x is
2⋅x
and with respect to y is
4⋅y
seek to follow its gradient downhill. If this error function was univariate, and hence a function
of a single independent weight, then optimizing it would simply involve computing its univariate
derivative.
However, a neural network comprises many weights (each attributed to a different neuron)
of which the error is a function. Hence, updating the weight values requires that the gradient
of the error curve is calculated with respect to all of these weights.
This is where the application of multivariate calculus comes into play. The gradient of the
error curve is calculated by finding the partial derivative of the error with respect to each weight;
or in other terms, finding the derivative of the error function by keeping all weights constant
except the one under consideration. This allows each weight to be updated independently of
the others, to reach the goal of finding an optimal set of weights.
Books
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for Machine Learning.
Cambridge, 2020.
https://www.amazon.com/dp/110845514X
David Guichard et al. Single and Multivariable Calculus: Early Transcendentals. 2020.
https://www.whitman.edu/mathematics/multivariable/multivariable.pdf
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
John D. Kelleher. Deep Learning. Illustrated edition. The MIT Press Essential Knowledge series.
MIT Press, 2019.
https://www.amazon.com/dp/0262537559/
16.5 Summary
In this tutorial, you discovered a gentle introduction to multivariate calculus. Specifically, you
learned:
⊲ A multivariate function depends on several input variables to produce an output.
⊲ The gradient of a multivariate function is computed by finding the derivative of the
function in different directions.
⊲ Multivariate calculus is used extensively in neural networks to update the model
parameters.
In the nexdt chapter, we will learn more about vector-valued functions
Vector-Valued Functions
17
Vector-valued functions are often encountered in machine learning, computer graphics and
computer vision algorithms. They are particularly useful for defining the parametric equations
of space curves. It is important to gain a basic understanding of vector-valued functions to
grasp more complex concepts.
In this tutorial, you will discover what vector-valued functions are, how to define them and
some examples. After completing this tutorial, you will know:
⊲ Definition of vector-valued functions
⊲ Derivatives of vector-valued functions
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ Definition and examples of vector-valued functions
⊲ Differentiating vector-valued functions
Given the unit vectors i, j, k parallel to the x, y, z-axis respectively, we can write a three
dimensional vector-valued function as:
A circle
Let’s start with a simple example of a vector function in 2D space:
x(t) = cos(t)
y(t) = sin(t)
The space curve defined by the parametric equations is a circle in 2D space as shown in the
figure. If we vary t from −π to π, we’ll generate all the points that lie on the circle.
A helix
We can extend the r1 (t) function of example 1.1, to easily generate a helix in 3D space. We
just need to add the value along the z axis that changes with t. Hence, we have the following
function:
r2 (t) = cos(t)i + sin(t)j + tk
17.3 Derivatives of vector functions 120
A twisted cubic
We can also define a curve called the twisted cubic with an interesting shape as:
r3 (t) = ti + t2 j + t3 k
y
1
0.5 10
1,000
0
0
0
−0.5 1
100
−10 0 −1,000
−1 −10 50
−0.5 0 −5
−1 0.5 y 0 y
x 1 −1 5
10 0
−1 −0.5 0 0.5 1 x x
A circle
The parametric equation of a circle in 2D is given by:
Its derivative is therefore computed by computing the corresponding derivatives of x(t) and y(t)
as shown below:
x′ (t) = − sin(t)
y ′ (t) = cos(t)
17.5 More complex examples 121
The space curve defined by the parametric equations is a circle in 2D space as shown in
the figure. If we vary t from −π to π, we’ll generate all the points that lie on the circle.
A helix
Similar to the previous example, we can compute the derivative of r2 (t) as:
A twisted cubic
The derivative of r3 (t) is given by:
r3 (t) = ti + t2 j + t3 k
r3′ (t) = i + 2tj + 3t2 k
All the above examples are shown in the figure, where the derivatives are plotted in red.
Note the circle’s derivative also defines a circle in space.
y
1 r2 (t)
r2′ (t) r3 (t)
r3′ (t)
0.5 10
1,000
0
0
0
−0.5 1
100
r1 (t) −10 0
−1 −1,000
−10 50
r1′ (t) −0.5 0 −5
−1 y 0
x 0.5 1 −1 5 y
10 0
−1 −0.5 0 0.5 1 x x
r1′ (t) = − sin(t)i + cos(t)j r2′ (t) = − sin(t)i + cos(t)j + k r3′ (t) = i + 2tj + 3t2 k
Figure 17.2: Parametric functions and their derivatives
The cardioid:
r6 (t) = cos(t)(1 − cos(t))i + sin(t)(1 − cos(t))j
1
1
0
1 0
5
0
0 2 −1
−1 −1
−6 0
−4 −2 y −2
0 0
2 4 2 −2 y x
x 6 −5 x −2 −1.5 −1 −0.5 0 0.5
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
17.8 Summary 123
17.8 Summary
In this tutorial, you discovered what vector functions are and how to differentiate them.
Specifically, you learned:
⊲ Definition of vector functions
⊲ Parametric curves
⊲ Differentiating vector functions
In the next chapter, we will introduce the concept of partial derivatives that is closely related
to what we learned in this chapter.
Partial Derivatives and Gradient
Vectors
18
Partial derivatives and gradient vectors are used very often in machine learning algorithms for
finding the minimum or maximum of a function. Gradient vectors are used in the training of
neural networks, logistic regression, and many other classification and regression problems.
In this tutorial, you will discover partial derivatives and the gradient vector. After
completing this tutorial, you will know:
⊲ Function of several variables
⊲ Level sets, contours and graphs of a function of two variables
⊲ Partial derivatives of a function of several variables
⊲ Gradient vector and its meaning
Let’s get started.
Overview
This tutorial is divided into three parts; they are:
⊲ Function of several variables
⊲ Definition of partial derivatives
⊲ Gradient vector
In the above function x and y are the independent variables. Their sum determines the value
of the function. The domain of this function is the set of all points on the xy Cartesian plane.
The plot of this function would require plotting in the 3D space, with two axes for input points
(x, y) and the third representing the values of f . Here is another example of a function of two
variables. f2 (x, y) = x × x + y × y
To keep things simple, we’ll do examples of functions of two variables. Of course, in machine
learning you’ll encounter functions of hundreds of variables. The concepts related to functions
of two variables can be extended to those cases.
x+y =c
x×x+y×y =1
We can see that any point that lies on a circle of radius 1 with center at (0, 0) satisfies the above
expression. Hence, this level set consists of all points that lie on this circle. Similarly, any level
set of f2 satisfies the following expression (c is any real constant ≥ 0):
x×x+y×y =c
Hence, all level sets of f2 are circles with center at (0, 0), each level set having its own radius.
The graph of the function f (x, y) is the set of all points (x, y, f (x, y)). It is also called a
surface z = f (x, y). The graphs of f1 and f2 are shown in left side of Figure 18.1.
10
5
20
0
0
10 −5
−20 0
−10
−5
0
5 −10
10−10 −10 −5 0 5 10
5
200
0
100
10 −5
0 0
−10
−5
0
5 −10
10−10 −10 −5 0 5 10
∂f1 /∂x represents the rate of change of f1 with respect to x. For any function f (x, y), ∂f /∂x
represents the rate of change of f with respect to variable x.
Similar is the case for ∂f /∂y. It represents the rate of change of f with respect to y. You
can look at the formal definition of partial derivatives in Chapter 16.
When we find the partial derivatives with respect to all independent variables, we end up
with a vector. This vector is called the gradient vector of f denoted by ∇f (x, y). A general
expression for the gradients of f1 and f2 are given by (here i, j are unit vectors parallel to the
18.2 Partial derivatives and gradients 127
coordinate axes):
∂f1 ∂f1
∇f1 (x, y) = i+ j=i+j
∂x ∂y
∂f2 ∂f2
∇f2 (x, y) = i+ j = 2xi + 2yj
∂x ∂y
From the general expression of the gradient, we can evaluate the gradient at different points in
space. In case of f1 the gradient vector is a constant, i.e.,
i+j
No matter where we are in the three dimensional space, the direction and magnitude of the
gradient vector remains unchanged.
For the function f2 , ∇f2 (x, y) changes with values of (x, y). For example, at (1, 1) and
(2, 1) the gradient of f2 is given by the following vectors:
∇f2 (1, 1) = 2i + 2j
∇f2 (2, 1) = 4i + 2j
We can reproduce the same result by first finding the partial derivatives in SymPy and then
evaluating for its numerical value at the point (x, y):
f2 = x**2 + y**2
df2dx = diff(f2, x)
df2dy = diff(f2, y)
print(”Partial derivative of”)
pprint(f2)
print(”with respect to x is”)
pprint(df2dx)
print(”and with respect to y is”)
pprint(df2dy)
print(”gradient at (1,1) is ({},{})”.format(df2dx.subs([(x,1),(y,1)]),
df2dy.subs([(x,1),(y,1)])))
print(”gradient at (2,1) is ({},{})”.format(df2dx.subs([(x,2),(y,1)]),
df2dy.subs([(x,2),(y,1)])))
Partial derivative of
2 2
x + y
with respect to x is
2⋅x
and with respect to y is
2⋅y
gradient at (1,1) is (2,2)
gradient at (2,1) is (4,2)
10
−5
−10
−10 −5 0 5 10
Figure 18.2: The contours and the direction of gradient vectors: Gradient vectors at
various points shown with red arrows, tangent to the contour is in green
We can verify that the gradient vector we found before is indeed in the direction of the maximum
rate of change by exhaustive search. In the following code, we search from 0◦ (i.e., positive
direction along the x-axis) to 360◦ at 5◦ increments. The unit vector at the direction of angle
θ is sin θi + cos θj, so for a small step, the corresponding size along the x- and y-axes can be
found as follows:
import numpy as np
Then, we can find the derivative of f (x, y) on that direction using the first principle:
import numpy as np
x, y = 1, 1
step = 0.001
angles = np.arange(0, 360, 5) # 0 to 360 degrees at 5-degree steps
maxdf, maxangle = -np.inf, 0
for angle in angles:
rad = angle * np.pi / 180 # convert degree to radian
dx, dy = np.sin(rad)*step, np.cos(rad)*step
df = (f(x+dx, y+dy) - f(x,y))/step
if df > maxdf:
maxdf, maxangle = df, angle
print(f”Rate of change at {angle} degrees = {df}”)
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
18.6 Summary
In this tutorial, you discovered what are functions of several variables, partial derivatives and
the gradient vector. Specifically, you learned:
⊲ Function of several variables
◦ Contours of a function of several variables
◦ Level sets of a function of several variables
⊲ Partial derivatives of a function of several variables
⊲ Gradient vector and its meaning
In the next chapter we will learn about the deriviatve of a derivative.
Higher-Order Derivatives
19
Higher-order derivatives can capture information about a function that first-order derivatives
on their own cannot capture.
First-order derivatives can capture important information, such as the rate of change, but
on their own they cannot distinguish between local minima or maxima, where the rate of change
is zero for both. Several optimization algorithms address this limitation by exploiting the use
of higher-order derivatives, such as in Newton’s method where the second-order derivatives are
used to reach the local minimum of an optimization function.
In this tutorial, you will discover how to compute higher-order univariate and multivariate
derivatives.
After completing this tutorial, you will know:
⊲ How to compute the higher-order derivatives of univariate functions.
⊲ How to compute the higher-order derivatives of multivariate functions.
⊲ How the second-order derivatives can be exploited in machine learning by second-order
optimization algorithms.
Let’s get started.
Overview
This tutorial is divided into three parts; they are:
⊲ Higher-Order Derivatives of Univariate Functions
⊲ Higher-Order Derivatives of Multivariate Functions
⊲ Application in Machine Learning
moving object, or it can help an optimization algorithm distinguish between a local maximum
and a local minimum.
Computing higher-order (second, third or higher) derivatives of univariate functions is not
that difficult.
“
The second derivative of a function is just the derivative of its first derivative. The
third derivative is the derivative of the second derivative, the fourth derivative is
”
the derivative of the third, and so on.
— Page 147, Calculus For Dummies, 2016.
Hence, computing higher-order derivatives simply involves differentiating the function
repeatedly. In order to do so, we can simply apply our knowledge of the power rule. Let’s
consider the function, f (x) = x3 + 2x2 − 4x + 1, as an example. Then:
⊲ First derivative: f ′ (x) = 3x2 + 4x − 4
⊲ Second derivative: f ′′ (x) = 6x + 4
⊲ Third derivative: f ′′′ (x) = 6
⊲ Fourth derivative: f (4) (x) = 0
⊲ Fifth derivative: f (5) (x) = 0, etc.
What we have done here is that we have first applied the power rule to f (x) to obtain
its first derivative, f ′ (x), then applied the power rule to the first derivative in order to obtain
the second, and so on. The derivative will, eventually, go to zero as differentiation is applied
repeatedly.
In SymPy, the higher-order derivatives can be found using the same diff() function:
Function
3 2
x + 2⋅x - 4⋅x + 1
First derivative
2
3⋅x + 4⋅x - 4
Second derivative
2⋅(3⋅x + 2)
Third derivative
6
Fourth derivative
0
Fifth derivative
0
The application of the product and quotient rules also remains valid in obtaining higher-order
derivatives, but their computation can become messier and messier as the order increases. The
general Leibniz rule simplifies the task in this aspect, by generalizing the product rule to:
n
! n
X n (n−k) (k) X n!
(f g)(n) = f g = f (n−k) g (k)
k=0 k k=0 k!(n − k)!
n!
Here, the term, , is the binomial coefficient from the binomial theorem, while f (k)
k!(n − k)!
and g (k) denote the k-th derivative of the functions, f and g, respectively.
Therefore, finding the first and second derivatives (and, hence, substituting for n = 1 and
n = 2, respectively), by the general Leibniz rule, gives us:
Notice the familiar first derivative as defined by the product rule. The Leibniz rule can
also be used to find higher-order derivatives of rational functions, since the quotient can be
effectively expressed into a product of the form, f g −1
“
To take a “derivative,” we must take a partial derivative with respect to x or y, and
”
there are four ways to do it: x then x, x then y, y then x, y then y.
— Page 371, Single and Multivariable Calculus, 2020.
19.2 Higher-order derivatives of multivariate functions 134
Let’s consider the multivariate function, f (x, y) = x2 + 3xy + 4y 2 , for which we would
like to find the second partial derivatives. The process starts with finding its first-order partial
derivatives, first:
∂f
= fx = 2x + 3y
∂x
∂f
= fy = 3x + 8y
∂y
The four, second-order partial derivatives are then found by repeating the process of finding
the partial derivatives, of the partial derivatives. The own partial derivatives are the most
straightforward to find, since we simply repeat the partial differentiation process, with respect
to either x or y, a second time:
∂2f ∂
2
= (2x + 3y) = fxx = 2
∂x ∂x
∂2f ∂
= (3x + 8y) = fyy = 8
∂y 2 ∂y
The cross partial derivative of the previously found fx (that is, the partial derivative with
respect to x is found by taking the partial derivative of the result with respect to y, giving us
fxy . Similarly, taking the partial derivative of fy with respect to x, gives us fyx :
∂2f ∂
= (2x + 3y) = fxy = 3
∂x∂y ∂y
∂2f ∂
= (3x + 8y) = fyx = 3
∂y∂x ∂x
It is not by accident that the cross partial derivatives give the same result. This is defined by
Clairaut’s theorem, which states that as long as the cross partial derivatives are continuous,
then they are equal. The above can be verified using SymPy as follows:
pprint(fxx)
print(”f_yy =”)
pprint(fyy)
print(”f_xy =”)
pprint(fxy)
print(”f_yx =”)
pprint(fyx)
Function
2 2
x + 3⋅x⋅y + 4⋅y
f_x =
2⋅x + 3⋅y
f_y =
3⋅x + 8⋅y
f_xx =
2
f_yy =
8
f_xy =
3
f_yx =
3
“
Second-order information, on the other hand, allows us to make a quadratic
approximation of the objective function and approximate the right step size to
”
reach a local minimum …
— Page 87, Algorithms for Optimization, 2019.
In the univariate case, Newton’s method uses a second-order Taylor series expansion to
perform the quadratic approximation around some point on the objective function. The update
rule for Newton’s method, which is obtained by setting the derivative to zero and solving for
the root, involves a division operation by the second derivative. If Newton’s method is extended
to multivariate optimization, the derivative is replaced by the gradient, while the reciprocal of
the second derivative is replaced with the inverse of the Hessian matrix.
19.4 Further reading 136
We shall be covering the Hessian and Taylor Series approximations, which leverage the use
of higher-order derivatives, in separate tutorials.
Books
Mark Ryan. Calculus For Dummies. 2nd ed. Wiley, 2016.
https://www.amazon.com/dp/1119293499/
David Guichard et al. Single and Multivariable Calculus: Early Transcendentals. 2020.
https://www.whitman.edu/mathematics/multivariable/multivariable.pdf
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
19.5 Summary
In this tutorial, you discovered how to compute higher-order univariate and multivariate
derivatives. Specifically, you learned:
⊲ How to compute the higher-order derivatives of univariate functions.
⊲ How to compute the higher-order derivatives of multivariate functions.
⊲ How the second-order derivatives can be exploited in machine learning by second-order
optimization algorithms.
In the next chapter, we will study chain rule. With it, we can find the derivatives of a function
with respect to implicit variables.
The Chain Rule
20
The chain rule allows us to find the derivative of composite functions.
It is computed extensively by the backpropagation algorithm, in order to train feedforward
neural networks. By applying the chain rule in an efficient manner while following a specific
order of operations, the backpropagation algorithm calculates the error gradient of the loss
function with respect to each weight of the network.
In this tutorial, you will discover the chain rule of calculus for univariate and multivariate
functions.
After completing this tutorial, you will know:
⊲ A composite function is the combination of two (or more) functions.
⊲ The chain rule allows us to find the derivative of a composite function.
⊲ The chain rule can be generalized to multivariate functions, and represented by a tree
diagram.
⊲ The chain rule is applied extensively by the backpropagation algorithm in order to
calculate the error gradient of the loss function with respect to each weight.
Let’s get started.
Overview
This tutorial is divided into four parts; they are:
⊲ Composite Functions
⊲ The Chain Rule
⊲ The Generalized Chain Rule
⊲ Application in Machine Learning
20.1 Prerequisites
For this tutorial, we assume that you already know what are:
20.2 Composite functions 138
“ ”
A composite function is the combination of two functions.
— Page 49, Calculus For Dummies, 2016.
Consider two functions of a single independent variable, f (x) = 2x − 1 and g(x) = x3 . Their
composite function can be defined as follows:
h = g(f (x))
In this operation, g is a function of f . This means that g is applied to the result of applying
the function, f , to x, producing h.
Let’s consider a concrete example using the functions specified above to understand this
better. Suppose that f (x) and g(x) are two systems in cascade, receiving an input x = 5:
Since f (x) is the first system in the cascade (because it is the inner function in the composite),
its output is worked out first:
f (5) = (2 × 5) − 1 = 9
This result is then passed on as input to g(x), the second system in the cascade (because it is
the outer function in the composite) to produce the net result of the composite function:
g(9) = 93 = 729
We could have, alternatively, computed the net result at one go, if we had performed the following
computation:
h = g(f (x)) = (2x − 1)3 = 729
The composition of functions can also be considered as a chaining process, to use a more familiar
term, where the output of one function feeds into the next one in the chain.
“ ”
With composite functions, the order matters.
— Page 49, Calculus For Dummies, 2016.
20.3 The chain rule 139
Keep in mind that the composition of functions is a non-commutative process, which means
that swapping the order of f (x) and g(x) in the cascade (or chain) does not produce the same
results. Hence:
g(f (x)) 6= f (g(x))
The composition of functions can also be extended to the multivariate case:
h = g(r, s, t) = g(r(x, y), s(x, y), t(x, y)) = g(f (x, y))
Here, f (x, y) is a vector-valued function of two independent variables (or inputs), x and y. It is
made up of three components (for this particular example) that are r(x, y), s(x, y), t(x, y), and
which are also known as the component functions of f . This means that f (x, y) will map two
inputs to three outputs, and will then feed these three outputs into the consecutive system in
the chain, g(r, s, t), to produce h.
((2x − 1)′ )3 = 2
The derivative of the composite function as defined by the chain rule is, then, the following:
We have, hereby, considered a simple example, but the concept of applying the chain rule
to more complicated functions remains the same.
20.4 The generalized chain rule 140
Recall that we employ the use of partial derivatives when we are finding the gradient of a function
of multiple variables.
We can also visualize the workings of the chain rule by a tree diagram. Suppose that we
have a composite function of two independent variables, x1 and x2 , defined as follows:
Here, u1 and u2 act as the intermediate variables. Its tree diagram would be represented as
follows:
∂u1
∂x1 x1
u1
∂h
∂u1
∂u1 x2
∂x2
h
∂u2
∂x1 x1
∂h
u2
∂u2
∂u2 x2
∂x2
Figure 20.2: Representing the chain rule by a tree diagram
In order to derive the formula for each of the inputs, x1 and x2 , we can start from the left
hand side of the tree diagram, and follow its branches rightwards. In this manner, we find that
we form the following two formulae (the branches being summed up have been color coded for
20.5 The chain rule on univariate functions 141
simplicity):
∂h ∂h ∂u1 ∂h ∂u2
= · + ·
∂x1 ∂u1 ∂x1 ∂u2 ∂x1
∂h ∂h ∂u1 ∂h ∂u2
= · + ·
∂x2 ∂u1 ∂x2 ∂u2 ∂x2
Notice how the chain rule relates the net output, h, to each of the inputs, xi , through the
intermediate variables, uj . This is a concept that the backpropagation algorithm applies
extensively to optimize the weights of a neural network.
Example 1
Let’s raise the bar a little by considering the following composite function:
√
h = g(f (x)) = x2 − 10
intermediate variable, u, and its value will be fed into the input of the outer function.
The first step is to find the derivative of the outer part of the composite function, while
ignoring whatever is inside. For this purpose, we can apply the power rule:
dh 1 1 1 1
= u− 2 = (x2 − 10)− 2
du 2 2
The next step is to find the derivative of the inner part of the composite function, this time
ignoring whatever is outside. We can apply the power rule here too:
du
= 2x
dx
Putting the two parts together and simplifying, we have:
dh dh du 1 1 x
= · = (x2 − 10)− 2 (2x) = √ 2
dx du dx 2 x − 10
We can verify this result with SymPy:
20.5 The chain rule on univariate functions 142
f = x**2 - 10
g = sqrt(f)
result = diff(g, x)
print(”Function”)
pprint(g)
print(”has derivative”)
pprint(result)
Function
_________
╱ 2
╲╱ x - 10
has derivative with respect to x
x
────────────
_________
╱ 2
╲╱ x - 10
Example 2
Let’s repeat the procedure, this time with a different composite function:
h = cos(x3 − 1)
We will again use, u, the output of the inner function, as our intermediate variable. The outer
function in this case is, cos x. Finding its derivative, again ignoring the inside, gives us:
dh
= (cos(x)3 − 1))′ = − sin(x3 − 1)
du
The inner function is, x3 − 1 Hence, its derivative becomes:
du
= (x3 − 1)′ = 3x2
dx
Putting the two parts together, we obtain the derivative of the composite function:
dh dh du
= · = −3x2 sin(x3 − 1)
dx du dx
We can verify this result with SymPy:
u = x**3 - 1
h = cos(u)
result = diff(h, x)
print(”Function”)
pprint(h)
print(”has derivative”)
pprint(result)
Function
⎛ 3 ⎞
cos⎝x - 1⎠
has derivative
2 ⎛ 3 ⎞
-3⋅x ⋅sin⎝x - 1⎠
Example 3
Let’s now raise the bar a little further by considering a more challenging composite function:
√
h = cos(x x2 − 10)
If we observe this closely, we realize that not only do we have nested functions for which we will
need to apply the chain rule multiple times, but we also have a product to which we will need
to apply the product rule.
We find that the outermost function is a cosine. In finding its derivative by the chain rule,
we shall be using the intermediate variable, u:
dh √ √
= (cos(x x2 − 10))′ = − sin(x x2 − 10)
du
√
Inside the cosine, we have the product, x x2 − 10, to which we will be applying the product
rule to find its derivative (notice that we are always moving from the outside to the inside, in
order to discover the operation that needs to be tackled next):
du √ √ √
= (x x2 − 10)′ = x2 − 10 + x( x2 − 10)′
dx
√
One of the components in the resulting term is, ( x2 − 10)′ , to which we shall be applying the
chain rule again. Indeed, we have already done so above, and hence we can simply re-utilise
the result: √ 1
( x2 − 10)′ = x(x2 − 10)− 2
Putting all the parts together, we obtain the derivative of the composite function:
√ √
x2
dh dh du
= · = − sin(x x − 10) · x − 10 + √ 2
2
dx du dx x − 10
20.6 The chain rule on multivariate functions 144
dh √ 2x2 − 10
= − sin(x x − 10) · √ 2
dx x − 10
We can verify this result with SymPy:
u = x * sqrt(x**2 - 10)
h = cos(u)
result = diff(h, x)
print(”Function”)
pprint(h)
print(”has derivative”)
pprint(simplify(result))
√
Program 20.3: Find derivative of h = cos(x x2 − 10)
Function
⎛ _________⎞
⎜ ╱ 2 ⎟
cos⎝x⋅╲╱ x - 10 ⎠
has derivative
⎛ _________⎞
⎛ 2⎞ ⎜ ╱ 2 ⎟
2⋅⎝5 - x ⎠⋅sin⎝x⋅╲╱ x - 10 ⎠
──────────────────────────────
_________
╱ 2
╲╱ x - 10
√
Output 20.3: Derivative of h = cos(x x2 − 10)
s = x*y
t = 2*x - y
h = s**2 + t**3
dhdx = diff(h, x)
dhdy = diff(h, y)
print(”Function”)
pprint(h)
print(”Derivative with respect to x”)
pprint(dhdx)
print(”Derivative with respect to y”)
pprint(dhdy)
Function
2 2 3
x ⋅y + (2⋅x - y)
Derivative with respect to x
2 2
2⋅x⋅y + 6⋅(2⋅x - y)
Derivative with respect to y
2 2
2⋅x ⋅y - 3⋅(2⋅x - y)
Example 5
Let’s repeat this again, this time with a multivariate function of three independent variables,
r, s and t, with each of these variables being dependent on another two independent variables,
x and y:
h = g(r, s, t) = r2 − rs + t3
Where the functions, r = x cos y, s = xey , and t = x + y.
This time round, r, s and t will act as our intermediate variables. The formulae that we
will be working with, defined with respect to each input, are the following:
∂h ∂h ∂r ∂h ∂s ∂h ∂t
= · + · + ·
∂x ∂r ∂x ∂s ∂x ∂t ∂x
∂h ∂h ∂r ∂h ∂s ∂h ∂t
= · + · + ·
∂y ∂r ∂y ∂s ∂y ∂t ∂y
From these formulae, we can see that we will now need to find nine different partial derivatives:
∂h ∂r ∂h
= 2r − s = cos y = −r
∂r ∂x ∂s
∂s ∂h ∂t
= ey = 3t2 =1
∂x ∂t ∂x
∂r ∂s ∂t
= −x sin y = xey =1
∂y ∂y ∂y
∂h ∂h
Again, we proceed to substitute these terms in the formulae for and :
∂x ∂y
∂h ∂h ∂r ∂h ∂s ∂h ∂t
= · + · + · = (2r − s) cos y − rey + 3t2
∂x ∂r ∂x ∂s ∂x ∂t ∂x
∂h ∂h ∂r ∂h ∂s ∂h ∂t
= · + · + · = (2r − s)(−x sin y) − r(xey ) + 3t2
∂y ∂r ∂y ∂s ∂y ∂t ∂y
And subsequently substitute for r, s and t to find the derivatives:
∂h
= (2x cos y − xey ) cos y − (x cos y)ey + 3(x + y)2
∂x
∂h
= (2x cos y − xey )(−x sin y) − (x cos y)(xey ) + 3(x + y)2
∂y
20.7 Application in machine learning 147
Which may be simplified a little further (hint: apply the trigonometric identity 2 sin y cos y =
sin 2y to ∂h/∂y):
∂h
= 2x cos y(cos y − ey ) + 3(x + y)2
∂x
∂h
= −x2 (sin 2y − ey sin y + ey cos y) + 3(x + y)2
∂y
We can verify this result with SymPy:
r = x*cos(y)
s = x*exp(y)
t = x + y
h = r**2 - r*s + t**3
dhdx = diff(h, x)
dhdy = diff(h, y)
print(”Function”)
pprint(h)
print(”Derivative with respect to x”)
pprint(dhdx)
print(”Derivative with respect to y”)
pprint(dhdy)
Function
2 y 2 2 3
- x ⋅e ⋅cos(y) + x ⋅cos (y) + (x + y)
Derivative with respect to x
y 2 2
- 2⋅x⋅e ⋅cos(y) + 2⋅x⋅cos (y) + 3⋅(x + y)
Derivative with respect to y
2 y 2 y 2 2
x ⋅e ⋅sin(y) - x ⋅e ⋅cos(y) - 2⋅x ⋅sin(y)⋅cos(y) + 3⋅(x + y)
No matter how complex the expression is, the procedure to follow remains similar:
“ ”
Your last computation tells you the first thing to do.
— Page 143, Calculus For Dummies, 2016.
Hence, start by tackling the outer function first, then move inwards to the next one. You may
need to apply other rules along the way, as we have seen for Example 3. Do not forget to take
the partial derivatives if you are working with multivariate functions.
outputs on the right hand side). We can apply the chain rule to a neural network through the
use of the backpropagation algorithm, in a very similar manner as to how we have applied it to
the tree diagram above.
“
An area where the chain rule is used to an extreme is deep learning, where the
”
function value y is computed as a many-level function composition.
— Page 159, Mathematics for Machine Learning, 2020.
A neural network can, indeed, be represented by a massive nested composite function. For
example:
y = fK (fK−1 (. . . (f1 (x)) . . .))
Here, x are the inputs to the neural network (for example, the images) whereas y are the outputs
(for example, the class labels). Every function, fi for i = 1, . . . , K, is characterized by its own
weights.
Applying the chain rule to such a composite function allows us to work backwards through
all of the hidden layers making up the neural network, and efficiently calculate the error gradient
of the loss function with respect to each weight, wi , of the network until we arrive to the input.
Books
Mark Ryan. Calculus For Dummies. 2nd ed. Wiley, 2016.
https://www.amazon.com/dp/1119293499/
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for Machine Learning.
Cambridge, 2020.
https://www.amazon.com/dp/110845514X
David Guichard et al. Single and Multivariable Calculus: Early Transcendentals. 2020.
https://www.whitman.edu/mathematics/multivariable/multivariable.pdf
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
20.9 Summary
In this tutorial, you discovered the chain rule of calculus for univariate and multivariate functions.
Specifically, you learned:
⊲ A composite function is the combination of two (or more) functions.
⊲ The chain rule allows us to find the derivative of a composite function.
⊲ The chain rule can be generalized to multivariate functions, and represented by a tree
diagram.
20.9 Summary 149
Overview
This tutorial is divided into three parts; they are:
⊲ Partial Derivatives in Machine Learning
⊲ The Jacobian Matrix
⊲ Other Uses of the Jacobian
outputs, m:
f : Rn 7→ Rm
For example, consider a neural network that classifies grayscale images into several classes. The
function being implemented by such a classifier would map the n pixel values of each single-
channel input image, to m output probabilities of belonging to each of the different classes.
In training a neural network, the backpropagation algorithm is responsible for sharing back
the error calculated at the output layer, among the neurons comprising the different hidden
layers of the neural network, until it reaches the input.
“
The fundamental principle of the backpropagation algorithm in adjusting the weights
in a network is that each weight in a network should be updated in proportion to
”
the sensitivity of the overall error of the network to changes in that weight.
— Page 222, Deep Learning, 2016.
This sensitivity of the overall error of the network to changes in any one particular weight
is measured in terms of the rate of change, which, in turn, is calculated by taking the partial
derivative of the error with respect to the same weight.
For simplicity, suppose that one of the hidden layers of some particular network consists
of just a single neuron, k. We can represent this in terms of a simple computational graph:
wk zk
input k output
Again, for simplicity, let’s suppose that a weight, wk , is applied to an input of this neuron
to produce an output, zk , according to the function that this neuron implements (including the
nonlinearity). Then, the weight of this neuron can be connected to the error at the output of
the network as follows (the following formula is formally known as the chain rule of calculus,
see Chapter 20):
d(error) d(error) dzk
= ·
dwk dzk dwk
dzk
Here, the derivative, , first connects the weight, wk , to the output, zk , while the derivative,
dwk
d(error)
, subsequently connects the output, zk , to the network error.
dzk
It is more often the case that we’d have many connected neurons populating the network,
each attributed a different weight. Since we are more interested in such a scenario, then we can
generalize beyond the scalar case to consider multiple inputs and multiple outputs:
(1) (2) (3)
d(error) d(error) dzk d(error) dzk d(error) dzk
(i)
= (1)
· (i)
+ (2)
· (i)
+ (3)
· (i)
+ ···
dwk dzk dwk dzk dwk dzk dwk
This sum of terms can be represented more compactly as follows:
(j)
d(error) X d(error) dzk
(i)
= (j)
· (i)
dwk j dzk dwk
21.2 The Jacobian matrix 152
The above equation is indeed for all different i. If we list them out, we can make the left side a
vector spanning each i. Similarly, the first fraction after the summation sight on the right can
also be represented as a vector spanning each j. The second fraction after the summation sign,
however, can be represented as a matrix in which each row is for a different i and each column is
for a different j. We can use the vector notation and introduce the del operator, ∇, to represent
the gradient of the error with respect to the weights wk or the outputs zk . Then, if vectors are
presented as columns, we can rewrite the above into the form of matrix multiplication:
⊤
∂zk
∇wk (error) = ∇zk (error)
∂wk
“
The back-propagation algorithm consists of performing such a Jacobian-gradient
”
product for each operation in the graph.
— Page 207, Deep Learning, 2016.
This means that the backpropagation algorithm can relate the sensitivity of the network error
!⊤
∂zk
to changes in the weights, through a multiplication by the Jacobian matrix, .
∂wk
Hence, what does this Jacobian matrix contain?
f : Ru 7→ R
Then, for an input vector, x, of length, u, the Jacobian vector of size, u × 1, can be defined as
follows: " #
df (x) ∂f (x) ∂f (x)
J= = ···
dx ∂x1 ∂xu
Now, consider another function that maps u real inputs, to v real outputs:
f : Ru 7→ Rv
Then, for the same input vector, x, of length, u, the Jacobian is now a v × u matrix, J ∈ Rv×u ,
that is defined as follows:
∂f1 (x) ∂f1 (x)
···
" # ∂x1 ∂xu
J=
df (x)
=
∂f (x) ∂f (x)
= .
.. .
..
···
dx ∂x1 ∂xu
∂fv (x) ∂fv (x)
···
∂x1 ∂xu
Reframing the Jacobian matrix into the machine learning problem considered earlier, while
retaining the same number of u real inputs and v real outputs, we find that this matrix would
21.2 The Jacobian matrix 153
def sigmoid(x):
return 1/(1+exp(-x))
# Vector-valued function
f = Matrix([sigmoid(p*x+q*y), sigmoid(r*x+s*y), sigmoid(t*x+u*y)])
variables = Matrix([x,y])
# Find and print the Jacobian
pprint(f.jacobian(variables))
which is:
pe−(px+qy) qe−(px+qy)
(1 + e−(px+qy) )2 (1 + e−(px+qy) )2
re−(rx+sy) se−(rx+sy)
J=
(1 + e−(rx+sy) )2 (1 + e−(rx+sy) )2
te−(tx+uy) ue−(tx+uy)
(1 + e−(tx+uy) )2 (1 + e−(tx+uy) )2
“
In the single variable case, there’s typically just one reason to want to change the
variable: to make the function “nicer” so that we can find an antiderivative. In the
two variable case, there is a second potential reason: the two-dimensional region
over which we need to integrate is somehow unpleasant, and we want the region in
”
terms of u and v to be nicer—to be a rectangle, for example.
— Page 412, Single and Multivariable Calculus, 2020.
When performing a substitution between two (or possibly more) variables, the process
starts with a definition of the variables between which the substitution is to occur. For example,
x = f (u, v) and y = g(u, v). This is then followed by a conversion of the integral limits depending
on how the functions, f and g, will transform the u-v plane into the x-y plane. Finally, the
absolute value of the Jacobian determinant is computed and included, to act as a scaling factor
between one coordinate space and another.
Books
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
David Guichard et al. Single and Multivariable Calculus: Early Transcendentals. 2020.
https://www.whitman.edu/mathematics/multivariable/multivariable.pdf
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for Machine Learning.
Cambridge, 2020.
https://www.amazon.com/dp/110845514X
21.5 Summary 155
John D. Kelleher. Deep Learning. Illustrated edition. The MIT Press Essential Knowledge series.
MIT Press, 2019.
https://www.amazon.com/dp/0262537559/
Articles
Jacobian matrix and determinant. Wikipedia.
https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant
Integration by substitution. Wikipedia.
https://en.wikipedia.org/wiki/Integration_by_substitution
21.5 Summary
In this tutorial, you discovered a gentle introduction to the Jacobian. Specifically, you learned:
⊲ The Jacobian matrix collects all first-order partial derivatives of a multivariate function
that can be used for backpropagation.
⊲ The Jacobian determinant is useful in changing between variables, where it acts as a
scaling factor between one coordinate space and another.
In the next chapter, we will see another matrix notation in calculus that is very similar to
Jacobian.
Hessian Matrices
22
Hessian matrices belong to a class of mathematical structures that involve second order
derivatives. They are often used in machine learning and data science algorithms for optimizing
a function of interest.
In this tutorial, you will discover Hessian matrices, their corresponding discriminants, and
their significance. All concepts are illustrated via an example.
After completing this tutorial, you will know:
⊲ Hessian matrices
⊲ Discriminants computed via Hessian matrices
⊲ What information is contained in the discriminant
Let’s get started.
Overview
This tutorial is divided into three parts; they are:
⊲ Definition of a function’s Hessian matrix and the corresponding discriminant
⊲ Example of computing the Hessian matrix, and the discriminant
⊲ What the Hessian and discriminant tell us about the function of interest
22.1 Prerequisites
For this tutorial, we assume that you already know:
⊲ Derivative of functions (Chapter 7)
⊲ Function of several variables, partial derivatives and gradient vectors (Chapter 18)
⊲ Higher order derivatives (Chapter 19)
22.2 What is a Hessian matrix? 157
f : Rn 7→ R for f (x, y) :
∂2f ∂2f ∂2f f : R2 7→ R
∂x2 ···
1 ∂x1 ∂x2 ∂x1 ∂xn
∂2f ∂2f
∂2f ∂2f ∂2f ∂x2
··· ∂x∂y = fxx fxy
∂x2 ∂x1 ∂x22 ∂x2 ∂xn Hf (x,y) =
Hf = ∂2f ∂2f fxy fyy
.. .. ..
. . . ∂y 2
∂x∂y
∂2f ∂2f 2
∂ f
···
∂xn ∂x1 ∂xn ∂x2 ∂x2n
Figure 22.1: Hessian a function of n variables (left). Hessian of f (x, y) (right)
We already know from our tutorial on gradient vectors that the gradient is a vector of first
order partial derivatives. The Hessian is similarly, a matrix of second order partial derivatives
formed from all pairs of variables in the domain of f .
fxx fxy 2
det Hf = = fxx fyy − fxy
fxy fyy
The definition of determinant for any square matrix in general can be found in many linear
algebra books.
g(x, y) = x3 + 2y 2 + 3xy 2
22.4 Examples of Hessian matrices and discriminants 158
6x 6y
Dg = = 6x(4 + 6x) − 36y 2 = 36x2 + 24x − 36y 2
6y 4 + 6x
Function
3 2 2
x + 3⋅x⋅y + 2⋅y
Hessian
⎡6⋅x 6⋅y ⎤
⎢ ⎥
⎣6⋅y 6⋅x + 4⎦
Discriminant
2 2
36⋅x + 24⋅x - 36⋅y
Discriminant at (0,0) = 0
Discriminant at (1,0) = 60
Discriminant at (0,1) = -36
Discriminant at (-1,0) = 12
Example: g(x, y)
For the function g(x, y):
1. We cannot draw any conclusions for the point (0, 0)
2. fxx (1, 0) = 6 > 0 and Dg (1, 0) = 60 > 0, hence (1, 0) is a local minimum
3. The point (0, 1) is a saddle point as Dg (0, 1) < 0
4. fxx (−1, 0) = −6 < 0 and Dg (−1, 0) = 12 > 0, hence (−1, 0) is a local maximum
The figure below shows a graph of the function g(x, y) and its corresponding contours.
0 2 −4
0 1
−1
−2−2 −6
−6 −4 −2 0 2 4 6
22.7 Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
⊲ Optimization
⊲ Eigenvalues of the Hessian matrix
⊲ Inverse of Hessian matrix and neural network training
Further reading
This section provides more resources on the topic if you are looking to go deeper.
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
22.8 Summary
In this tutorial, you discovered what are Hessian matrices. Specifically, you learned:
⊲ Hessian matrix
⊲ Discriminant of a function
While we list out the entire matrix in these two chapters, in the next chapter, we will learn
about the Laplacian operator that can make our notation more concise.
The Laplacian
23
The Laplace operator was first applied to the study of celestial mechanics, or the motion of
objects in outer space, by Pierre-Simon de Laplace, and as such has been named after him.
The Laplace operator has since been used to describe many different phenomena, from
electric potentials, to the diffusion equation for heat and fluid flow, and quantum mechanics. It
has also been recasted to the discrete space, where it has been used in applications related to
image processing and spectral clustering.
In this tutorial, you will discover a gentle introduction to the Laplacian.
After completing this tutorial, you will know:
⊲ The definition of the Laplace operator and how it relates to divergence.
⊲ How the Laplace operator relates to the Hessian.
⊲ How the continuous Laplace operator has been recasted to discrete-space, and applied
to image processing and spectral clustering.
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ The Laplacian
◦ The Concept of Divergence
◦ The Continuous Laplacian
⊲ The Discrete Laplacian
23.1 Prerequisites
For this tutorial, we assume that you already know what are:
⊲ The gradient of a function (Chapter 18)
⊲ Higher-order derivatives (Chapter 19)
23.2 The Laplacian 162
“
Roughly speaking, divergence measures the tendency of the fluid to collect or disperse
”
at a point …
— Page 432, Single and Multivariable Calculus, 2020.
y 0
−1
−2
−2 −1 0 1 2
x
Figure 23.1: Part of the vector field of (cos x, sin y)
Using the nabla (or del) operator, ∇, the divergence is denoted by ∇· and produces a scalar
value when applied to a vector field, measuring the quantity of fluid at each point. In Cartesian
coordinates, the divergence of a vector field, F = hf, g, hi, is given by:
D ∂ ∂ ∂ E ∂f ∂g ∂h
∇·F= , , · hf, g, hi = + +
∂x ∂y ∂z ∂x ∂y ∂z
Although the divergence computation involves the application of the divergence operator
(rather than a multiplication operation), the dot in its notation is reminiscent of the dot product,
which involves the multiplication of the components of two equal-length sequences (in this case,
∇ and F) and the summation of the resulting terms.
23.3 The discrete Laplacian 163
Then, the Laplacian (that is, the divergence of the gradient) of f can be defined by the sum of
unmixed second partial derivatives:
∂2f ∂2f
∇ · ∇f = ∇2 f = +
∂x2 ∂y 2
It can, equivalently, be considered as the trace (tr) of the function’s Hessian, H(f ). The trace
defines the sum of the elements on the main diagonal of a square n × n matrix, which in this
case is the Hessian, and also the sum of its eigenvalues. As you may notice from Chapter 22
or in particular Figure 22.1 that the Hessian matrix contains the own (i.e., unmixed) second
partial derivatives on the diagonal:
∇2 f = tr(H(f ))
An important property of the trace of a matrix is its invariance to a change of basis. We have
already defined the Laplacian in Cartesian coordinates. In polar coordinates, we would define
it as follows:
∂2f 1 ∂f 1 ∂2f
∇2 f = 2 + + 2 2
∂r r ∂r r dθ
The invariance of the trace to a change of basis means that the Laplacian can be defined in
different coordinate spaces, but it would give the same value at some point (x, y) in the Cartesian
coordinate space, and at the same point (r, θ) in the polar coordinate space.
Recall that we had also mentioned that the second derivative can provide us with information
regarding the curvature of a function. Hence, intuitively, we can consider the Laplacian to also
provide us with information regarding the local curvature of a function, through this summation
of second derivatives.
The continuous Laplace operator has been used to describe many physical phenomena,
such as electric potentials, and the diffusion equation for heat flow.
In this case, the discrete Laplacian operator (or filter) is constructed by combining two,
one-dimensional second derivative filters, into a single two-dimensional one:
In machine learning, the information provided by the discrete Laplace operator as derived from
a graph can be used for the purpose of data clustering.
Consider a graph G = (V, E), having a finite number of V vertices and E edges (i.e., an
abstract structure of edges connecting vertices). Its Laplacian matrix, L, can be defined in
terms of the degree matrix, D, containing information about the connectivity of each vertex,
and the adjacency matrix, A, which indicates pairs of vertices that are adjacent in the graph:
L=D−A
Spectral clustering can be carried out by applying some standard clustering method (such
as k-means) on the eigenvectors of the Laplacian matrix, hence partitioning the graph nodes
(or the data points) into subsets.
One issue that can arise in doing so relates to a problem of scalability with large datasets,
where the eigen-decomposition (or the extraction of the eigenvectors) of the Laplacian matrix
may be prohibitive. The use of deep learning has been proposed1 to address this problem, where
a deep neural network is trained such that its outputs approximate the eigenvectors of the
graph Laplacian. The neural network, in this case, is trained using a constrained optimization
approach, to enforce the orthogonality of the network outputs.
Books
David Guichard et al. Single and Multivariable Calculus: Early Transcendentals. 2020.
https://www.whitman.edu/mathematics/multivariable/multivariable.pdf
Al Bovik, ed. Handbook of Image and Video Processing. 2nd ed. Academic Press, 2005.
https://www.amazon.com/dp/0121197921
Articles
Laplace operator. Wikipedia.
https://en.wikipedia.org/wiki/Laplace_operator
Divergence. Wikipedia.
https://en.wikipedia.org/wiki/Divergence
Discrete Laplace operator. Wikipedia.
https://en.wikipedia.org/wiki/Discrete_Laplace_operator
Laplacian matrix. Wikipedia.
https://en.wikipedia.org/wiki/Laplacian_matrix
1
https://arxiv.org/pdf/1801.01587.pdf
23.5 Summary 165
Papers
Uri Shaham et al. “SpectralNet: Spectral Clustering Using Deep Neural Networks”. In: Proc.
ICLR. 2018.
https://arxiv.org/pdf/1801.01587.pdf
23.5 Summary
In this tutorial, you discovered a gentle introduction to the Laplacian. Specifically, you learned:
⊲ The definition of the Laplace operator and how it relates to divergence.
⊲ How the Laplace operator relates to the Hessian.
⊲ How the continuous Laplace operator has been recasted to discrete-space, and applied
to image processing and spectral clustering.
This concludes our study in multivariate calculus. The next chapter will start our journey in
exploring one particular use of calculus in optimization.
IV
Mathematical Programming
Introduction to Optimization and
Mathematical Programming
24
Whether it is a supervised learning problem or an unsupervised problem, there will be some
optimization algorithm working in the background. Almost any classification, regression or
clustering problem can be cast as an optimization problem.
In this tutorial, you will discover what is optimization and concepts related to it. After
completing this tutorial, you will know:
⊲ What is Mathematical programming or optimization
⊲ Difference between a maximization and minimization problems
⊲ Difference between local and global optimal solutions
⊲ Difference between constrained and unconstrained optimization
⊲ Difference between linear and nonlinear programming
⊲ Examples of optimization
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ Various introductory topics related to optimization
◦ Constrained vs. unconstrained optimization
◦ Equality vs. inequality constraints
◦ Feasible region
⊲ Examples of optimization in machine learning
y
Global maxima
Local maxima
x
Figure 24.1: Local and global maximum points
Feasible region
All the points in space where the constraints on the problem hold true comprise the feasible
region. An optimization algorithm searches for optimal points in the feasible region. The feasible
region for the two types of constraints is shown in the figure of the next section.
For an unconstrained optimization problem, the entire domain of the function is a feasible
region.
10 10
9x−1=0
8 8
−2
−2
x−
x−
6 6
10=
10=
4
0
2 2
0 0
−2 −2
0
y=
x−
−4 −4
3+
1−
−6
x−
−6
y=
0
−8 −8
−10 −10
−10 −8 −6 −4 −2 0 2 4 6 8 10 −10 −8 −6 −4 −2 0 2 4 6 8 10
f (x) = 2x − 3y + 4 f (x) = x2 + y
100
20
50
0
0
−20
10 10
−10 0 0 5 10
−5 0 0
5 −5
10−10 −10−10
24.7 Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
24.8 Further reading 171
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
24.9 Summary
In this tutorial, you discovered what is mathematical programming or optimization problem.
Specifically, you learned:
⊲ Maximization vs. minimization
⊲ Constrained vs. unconstrained optimization
⊲ Why optimization is important in machine learning
In the next chapter, we will see how an optimization problem can be solved.
The Method of Lagrange
Multipliers
25
The method of Lagrange multipliers is a simple and elegant method of finding the local minima
or local maxima of a function subject to equality or inequality constraints. Lagrange multipliers
are also called undetermined multipliers. In this tutorial we’ll talk about this method when
given equality constraints.
In this tutorial, you will discover the method of Lagrange multipliers and how to find the
local minimum or maximum of a function when equality constraints are present.
After completing this tutorial, you will know:
⊲ How to find points of local maximum or minimum of a function with equality constraints
⊲ Method of Lagrange multipliers with equality constraints
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ Method of Lagrange multipliers with equality constraints
⊲ Two solved examples
25.1 Prerequisites
For this tutorial, we assume that you already know what are:
⊲ Derivative of functions (Chapter 7)
⊲ Function of several variables, partial derivatives and gradient vectors (Chapter 18)
⊲ Introduction to optimization (Chapter 24)
⊲ Gradient descent (Chapter 29)
25.2 The method of Lagrange multipliers with equality constraints 173
minimize f (x)
subject to g1 (x) = 0
g2 (x) = 0
..
.
gn (x) = 0
The method of Lagrange multipliers first constructs a function called the Lagrange function as
given by the following expression.
λ = [λ1 , λ2 , . . . , λn ]⊤
To find the points of local minimum of f (x) subject to the equality constraints, we find the
stationary points of the Lagrange function L(x, λ), i.e., we solve the following equations:
∇×L=0
∂L
=0 for i = 1, · · · , n
∂λi
Hence, we get a total of m + n equations to solve, where
⊲ m = number of variables in domain of f
⊲ n = number of equality constraints.
In short, the points of local minimum would be the solution of the following equations:
∂L
=0 for j = 1, · · · , m
∂xj
gi (x) = 0 for i = 1, · · · , n
minimize f (x) = x2 + y 2
subject to x + 2y − 1 = 0
L(x, y, λ) = x2 + y 2 + λ(x + 2y − 1)
10
x+
5 2y −
200 1=0
1 2
,
5 5
0
100
10 −5
0 0
−10
−5
0
5 −10
10−10 −10 −5 0 5 10
Graph of f (x, y) and the constraint Contour of f (x, y) and the constraint
Figure 25.1: Graph of function (left). Contours, constraint and local minima (right)
This constrained optimization problem can also be solved numerically using SciPy, as follows:
25.3 Solved examples 175
import numpy as np
from scipy.optimize import minimize
def objective(x):
return x[0]**2 + x[1]**2
def constraint(x):
return x[0]+2*x[1]-1
# initial guesses
x0 = np.array([3,3])
# optimize
bounds = ((-10,10), (-10,10))
constraints = [{”type”:”eq”, ”fun”:constraint}]
solution = minimize(objective, x0, method='SLSQP',
bounds=bounds, constraints=constraints)
x = solution.x
# show solution
print('Objective:', objective(x))
print('Solution:', x)
SciPy has several algorithms for constrained optimization. The above uses SLSQP (Sequential
Least-Squares Programming). The objective function can be defined with a single argument
of a vector of arbitrary length. The constraints are similarly defined as a function, and only
those arguments where it produced zero return value will be considered as feasible. The SLSQP
algorithm requires a range for each element of the vector argument to search on, as well as an
initial “guessed” solution to start. The above code will produce the following solution:
Objective: 0.19999999999999998
Solution: [0.19999999 0.4 ]
The numerical solution matched the one found by the method of Lagrange multiplier. However,
not all problems can be solved using the SLSQP algorithm (e.g., when the problem is not in the
form of quadratic programming). The method of Lagrange multiplier, however, can be applied
to a wider range of problems.
minimize g(x, y) = x2 + 4y 2
subject to x+y =0
x2 + y 2 − 1 = 0
25.3 Solved examples 176
The solution of this problem can be found by first constructing the Lagrange function:
L(x, y, λ1 , λ2 ) = x2 + 4y 2 + λ1 (x + y) + λ2 (x2 + y 2 − 1)
We have 4 equations to solve:
∂L
= 2x + λ1 + 2xλ2 = 0
∂x
∂L
= 8y + λ1 + 2yλ2 = 0
∂y
∂L
=x+y =0
∂λ1
∂L
= x2 + y 2 − 1 = 0
∂λ2
Solving the above system of equations gives us two solutions for (x, y), i.e. we get the two points:
1 1
√ , −√
2 2
1 1
−√ ,√
2 2
This problem can also be solved using SciPy, but the algorithm will produce only one of the
solutions:
import numpy as np
from scipy.optimize import minimize
def objective(x):
return x[0]**2 + 4*x[1]**2
def constraint1(x):
return x[0]+x[1]
def constraint2(x):
return x[0]**2 + x[1]**2 - 1
# initial guesses
x0 = np.array([3,3])
# optimize
bounds = ((-10,10), (-10,10))
constraints = [{”type”:”eq”, ”fun”:constraint1}, {”type”:”eq”, ”fun”:constraint2}]
solution = minimize(objective, x0, method='SLSQP',
bounds=bounds, constraints=constraints)
x = solution.x
# show solution
print('Objective:', objective(x))
print('Solution:', x)
Objective: 2.5000000000173994
Solution: [-0.70710678 0.70710678]
The function along with its constraints and local minimum are shown below.
4 x+
y=
0
2
40 x2 + y 2 = 1
− √1 , √1
2 2
0
√1 , − √1
2 2
20
−2
2
0 0
−2
0 −4
2 −2 −4 −2 0 2 4
Graph of f (x, y) and the constraint Contour of f (x, y) and the constraint
Figure 25.2: Graph of function (left). Contours, constraint and local minima (right)
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
25.7 Summary 178
25.7 Summary
In this tutorial, you discovered what is the method of Lagrange multipliers. Specifically, you
learned:
⊲ Lagrange multipliers and the Lagrange function
⊲ How to solve an optimization problem when equality constraints are given
In the next chapter, we will see how the same method can be applied to the case of having
inequalities as constraints.
Lagrange Multipliers with
Inequality Constraints
26
In the previous chapter, we introduced the method of Lagrange multipliers to find local minima
or local maxima of a function with equality constraints. The same method can be applied to
those with inequality constraints as well.
In this tutorial, you will discover the method of Lagrange multipliers applied to find the
local minimum or maximum of a function when inequality constraints are present, optionally
together with equality constraints.
After completing this tutorial, you will know
⊲ How to find points of local maximum or minimum of a function with equality constraints
⊲ Method of Lagrange multipliers with equality constraints
Let’s get started.
Overview
This tutorial is divided into four parts; they are:
⊲ Constrained optimization and Lagrangians
⊲ The complementary slackness condition
⊲ Example 1: Mean-variance portfolio optimization
⊲ Example 2: Water-filling algorithm
26.1 Prerequisites
For this tutorial, we assume that you already have reviewed:
⊲ Derivative of functions (Chapter 7)
⊲ Function of several variables, partial derivatives and gradient vectors (Chapter 18)
⊲ Introduction to optimization (Chapter 24)
⊲ The method of Lagrange multipliers (Chapter 25)
26.2 Constrained optimization and Lagrangians 180
min f (X)
subject to g(X) = 0
h(X) ≥ 0
k(X) ≤ 0
where X is a scalar or vector values. Here, g(X) = 0 is the equality constraint, and h(X) ≥ 0,
k(X) ≤ 0 are inequality constraints. Note that we always use ≥ and ≤ rather than > and < in
optimization problems because the former defined a closed set in mathematics from where we
should look for the value of X. These can be many constraints of each type in an optimization
problem.
The equality constraints are easy to handle but the inequality constraints are not. Therefore,
one way to make it easier to tackle is to convert the inequalities into equalities, by introducing
slack variables:
min f (X)
subject to g(X) = 0
h(X) − s2 = 0
k(X) + t2 = 0
When something is negative, adding a certain positive quantity into it will make it equal to
zero, and vice versa. That quantity is the slack variable; the s2 and t2 above are examples. We
deliberately put s2 and t2 terms there to denote that they must not be negative.
With the slack variables introduced, we can use the Lagrange multipliers approach to solve
it, in which the Lagrange function or Lagrangian is defined as:
It is useful to know that, for the optimal solution X ∗ to the problem, the inequality constraints
are either having the equality holds (which the slack variable is zero), or not. For those inequality
constraints with their equality hold are called the active constraints. Otherwise, the inactive
constraints. In this sense, you can consider that the equality constraints are always active.
From this point onward, the complementary slackness condition have to be considered. We have
two slack variables s and t and the corresponding Lagrange multipliers are θ and φ. We now
have to consider whether a slack variable is zero (which the corresponding inequality constraint
is active) or the Lagrange multiplier is zero (the constraint is inactive). There are four possible
cases:
1. θ = φ = 0 and s2 > 0, t2 > 0
2. θ 6= 0 but φ = 0, and s2 = 0, t2 > 0
3. θ = 0 but φ 6= 0, and s2 > 0, t2 = 0
4. θ 6= 0 and φ 6= 0, and s2 = t2 = 0
For case 1, using ∂L/∂λ = 0, ∂L/∂w1 = 0 and ∂L/∂w2 = 0 we get
w2 = 1 − w1
0.5w1 + 0.3w2 = λ
0.3w1 + 0.2w2 = λ
which we get w1 = −1, w2 = 2, λ = 0.1. But with ∂L/∂θ = 0, we get s2 = −1, which we cannot
find a solution (s2 cannot be negative). Thus this case is infeasible.
For case 2, with ∂L/∂θ = 0 we get w1 = 0. Hence from ∂L/∂λ = 0, we know w2 = 1.
And with ∂L/∂w2 = 0, we found λ = 0.2 and from ∂L/∂w1 we get φ = 0.1. In this case, the
objective function is 0.1
For case 3, with ∂L/∂φ = 0 we get w1 = 1. Hence from ∂L/∂λ = 0, we know w2 = 0. And
with ∂L/∂w2 = 0, we get λ = 0.3 and from ∂L/∂w1 we get θ = 0.2. In this case, the objective
function is 0.25
For case 4, we get w1 = 0 from ∂L/∂θ = 0 but w1 = 1 from ∂L/∂φ = 0. Hence this case
is infeasible.
Comparing the objective function from case 2 and case 3, we see that the value from case
2 is lower. Hence that is taken as our solution to the optimization problem, with the optimal
solution attained at w1 = 0, w2 = 1.
This problem can also be solved by SciPy using the SLSQP method:
import numpy as np
from scipy.optimize import minimize
def objective(x):
return 0.25*x[0]**2 + 0.1*x[1]**2 + 0.3*x[0]*x[1]
def constraint1(x):
# Equality constraint: The result required be zero
return x[0] + x[1] - 1
def constraint2(x):
# Inequality constraint: The result required be non-negative
return x[0]
26.5 Example 2: Water-filling algorithm 183
def constraint3(x):
# Inequality constraint: The result required be non-negative
return 1-x[0]
# initial guesses
x0 = np.array([0, 1])
# optimize
bounds = ((0,1), (0,1))
constraints = [
{”type”:”eq”, ”fun”:constraint1},
{”type”:”ineq”, ”fun”:constraint2},
{”type”:”ineq”, ”fun”:constraint3},
]
solution = minimize(objective, x0, method='SLSQP',
bounds=bounds, constraints=constraints)
x = solution.x
# show solution
print('Objective:', objective(x))
print('Solution:', x)
Objective: 0.1
Solution: [0. 1.]
As an exercise, you can retry the above with σ12 = −0.15. The solution would be 0.0038 attained
when w1 = 135
, with the two inequality constraints inactive.
Assume we are using a battery that can give only 1 watt of power and this power have to
distribute to the k channels (denoted as p1 , · · · , pk ). Each channel may have different attenuation
so at the end, the signal power is discounted by a gain gi for each channel. Then the maximum
total capacity we can achieve by using these k channels is formulated as an optimization problem
26.5 Example 2: Water-filling algorithm 184
k
∗
X gi pi
max f (p1 , · · · , pk ) = log2 1+
i=1 ni
k
subject to
X
pi = 1
i=1
p1 , · · · , p k ≥ 0
For convenience of differentiation, we notice log2 x = log x/ log 2 and log(1 + gi pi /ni ) =
log(ni + gi pi ) − log(ni ), hence the objective function can be replaced with
k
X
f (p1 , · · · , pk ) = log(ni + gi pi )
i=1
in the sense that if f attained its maximum, f ∗ also attained its maximum. Assume we have
k = 3 channels, each has noise level of 1.0, 0.9, 1.0 respectively, and the channel gain is 0.9, 0.8,
0.7, then the optimization problem is
L(p1 , p2 , p3 , λ, θ1 , θ2 , θ3 )
= log(1 + 0.9p1 ) + log(0.9 + 0.8p2 ) + log(1 + 0.7p3 )
− λ(p1 + p2 + p3 − 1)
− θ1 (p1 − s21 ) − θ2 (p2 − s22 ) − θ3 (p3 − s23 )
∂L
= s22 − p2
∂θ2
∂L
= s23 − p3
∂θ3
This problem is an example where SciPy cannot find the optimal solution. The issue lies
on the use of logarithms in the objective function. Hence, it is more accurate if we can solve it
for an exact solution using the method of Lagrange multiplier.
import numpy as np
from scipy.optimize import minimize
def objective(x):
return np.log(1+0.9*x[0]) + np.log(0.9+0.8*x[1]) + np.log(1+0.7*x[2])
# initial guesses
x0 = np.array([0.4, 0.4, 0.4])
# optimize
bounds = ((0,1), (0,1), (0,1))
constraints = [
{”type”:”eq”, ”fun”:constraint1},
{”type”:”ineq”, ”fun”:constraint2},
{”type”:”ineq”, ”fun”:constraint3},
{”type”:”ineq”, ”fun”:constraint4},
]
solution = minimize(objective, x0, method='SLSQP',
bounds=bounds, constraints=constraints)
x = solution.x
# show solution
print('Objective:', objective(x))
print('Solution:', x)
Objective: 0.4252677354043441
Solution: [0. 0. 1.]
inequality constraints as positive. In that case you may see the Lagrangian function written as
but requires θ ≥ 0; φ ≥ 0.
The Lagrangian function is also useful to apply to primal-dual approach for finding the
maximum or minimum. This is particularly helpful if the objectives or constraints are nonlinear,
which the solution may not be easily found.
Some books that covers this topic are:
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge, 2004.
https://amzn.to/34mvCr1
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
26.7 Summary
In this tutorial, you discovered how the method of Lagrange multipliers can be applied to
inequality constraints. Specifically, you learned:
⊲ Lagrange multipliers and the Lagrange function in presence of inequality constraints
⊲ How to use KKT conditions to solve an optimization problem when inequality constraints
are given
In the next chapter, we will see a different application of calculus.
V
Approximation
Approximation
27
When it comes to machine learning tasks such as classification or regression, approximation
techniques play a key role in learning from the data. Many machine learning methods
approximate a function or a mapping between the inputs and outputs via a learning algorithm.
In this tutorial, you will discover what is approximation and its importance in machine
learning and pattern recognition. After completing this tutorial, you will know:
⊲ What is approximation
⊲ Importance of approximation in machine learning
Let’s get started.
Overview
This tutorial is divided into 3 parts; they are:
⊲ What is approximation?
⊲ Approximation when the form of function is not known
⊲ Approximation when the form of function is known
2. The function itself is unknown and hence a model or learning algorithm is used to closely
find a function that can produce outputs close to the unknown function’s outputs.
Approximation in regression
Regression involves the prediction of an output variable when given a set of inputs. In regression,
the function that truly maps the input variables to outputs is not known. It is assumed that
some linear or nonlinear regression model can approximate the mapping of inputs to outputs.
For example, we may have data related to consumed calories per day and the corresponding
blood sugar. To describe the relationship between the calorie input and blood sugar output,
we can assume a straight line relationship/mapping function. The straight line is therefore the
approximation of the mapping of inputs to outputs. A learning method such as the method of
least squares is used to find this line.
1
https://en.wikipedia.org/wiki/Newton%27s_method
27.3 Approximation when form of function is unknown 191
200
150
blood sugar
100
0
0 500 1,000 1,500 2,000 2,500 3,000
caloric intake
Figure 27.1: A straight line approximation to relationship between caloric count and
blood sugar
Approximation in classification
A classic example of models that approximate functions in classification problems is that of
neural networks. It is assumed that the neural network as a whole can approximate a true
function that maps the inputs to the class labels. Gradient descent or some other learning
algorithm is then used to learn that function approximation by adjusting the weights of the
neural network.
inputs output
x1
x2
.. Neural
class label
. network
xn
Figure 27.2: A neural network approximates an underlying function that maps inputs
to outputs
y
250
200
150
100
50
Circular clusters
are assumed
0 x
50 100 150 200 250 300
Books
Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
https://amzn.to/36yvG9w
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
27.5 Summary
In this tutorial, you discovered what is approximation. Specifically, you learned:
⊲ Approximation
⊲ Approximation when the form of a function is known
⊲ Approximation when the form of a function is unknown
In the next chapter, we will see a concrete example, the Taylor series.
Taylor Series
28
Taylor series expansion is an awesome concept, not only the world of mathematics, but also
in optimization theory, function approximation and machine learning. It is widely applied in
numerical computations when estimates of a function’s values at different points are required.
In this tutorial, you will discover Taylor series and how to approximate the values of a
function around different points using its Taylor series expansion. After completing this tutorial,
you will know:
⊲ Taylor series expansion of a function
⊲ How to approximate functions using Taylor series expansion
Let’s get started.
Overview
This tutorial is divided into 3 parts; they are:
⊲ Power series and Taylor series
⊲ Taylor polynomials
⊲ Function approximation using Taylor polynomials
on a given interval, then the Taylor series generated by f (x) at x = a is given by:
∞
X f (n) (a) f ′′ (a) f (k) (a)
(x − a)n = f (a) + f ′ (a)(x − a) + (x − a)2 + · · · + (x − a)k + · · ·
n=0 n! 2! k!
f (k) (a)
ck =
k!
The second line of the above expression gives the value of the k-th coefficient.
If we set a = 0, then we have an expansion called the Maclaurin series expansion of f (x).
Setting a = 1,
∞
X (x − 1)n
n+1
= 1 − (x − 1) + (x − 1)2 + · · · + (−1)k (x − 1)k + · · ·
n=0 x
Setting a = 3,
∞
X (x − 3)n 1 x − 3 (x − 3)2 k (x − 3)
k
= − + + · · · + (−1) + ···
n=0 xn+1 3 32 33 3k+1
At a = 1,
P2 (x) = 1 − (x − 1) − (x − 1)2
At a = 3,
1 x − 3 (x − 3)2
P3 (x) = − −
3 32 33
28.5 Approximation via Taylor polynomials 195
y y
3 1
1/x 1/x
Order 2 approximation Order 2 approximation
2.5
0.8
2
0.6
1.5
0.4
1
x 0.2 x
0.5 1 1.5 1 2 3 4 5
Figure 28.1: The actual function (green) and its approximation (red)
x2 x3
ex = 1 + x + + + ···
2! 3!
The polynomial of order k generated for the function ex around the point x = 0 is given by:
x2 x3 xk
ex ≈ 1 + x + + + ··· +
2! 3! k!
The plots below show polynomials of different orders that estimate the value of ex around x = 0.
We can see that as we move away from zero, we need more terms to approximate ex more
accurately. The green line representing the actual function is hiding behind the blue line of the
approximating polynomial of order 7.
28.7 Taylor series in machine learning 196
y
20
Actual
Order 2
Order 3
15
Order 4
Order 7
10
x
−3 −2 −1 0 1 2 3
Figure 28.2: Polynomials of varying degrees that approximate ex
Books
Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
https://amzn.to/36yvG9w
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
28.9 Summary 197
28.9 Summary
In this tutorial, you discovered what is Taylor series expansion of a function about a point.
Specifically, you learned:
⊲ Power series and Taylor series
⊲ Taylor polynomials
⊲ How to approximate functions around a value using Taylor polynomials
In the next chapter, we will see some examples in machine learning that benefited directly from
calculus.
VI
Calculus in Machine Learning
Gradient Descent Procedure
29
Gradient descent procedure is a method that holds paramount importance in machine learning.
It is often used for minimizing error functions in classification and regression problems. It is
also used in training neural networks, and deep learning architectures.
In this tutorial, you will discover the gradient descent procedure. After completing this
tutorial, you will know:
⊲ Gradient descent method
⊲ Importance of gradient descent in machine learning
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ Gradient descent procedure
⊲ Solved example of gradient descent procedure
The notation
We have the following variables:
⊲ t = Iteration number
⊲ T = Total iterations
29.2 Example of gradient descent 200
f (x, y) = x × x + 2 × y × y
10
200
0
100
−5
10
0 0
−10
−5
0 −10
5
10−10 −10 −5 0 5 10
3. At t = 2,
If you keep running the above iterations, the procedure will eventually end up at the point where
the function is minimum, i.e., (0,0). At iteration t = 1, the algorithm is illustrated in the figure
below:
4
(4, 3)
t
of dien
o n r a
2 cti g
i re tive
D ga
ne
0 Take a step in the
direction opposite
−2 the gradient vector
−4
−6
−6 −4 −2 0 2 4 6
Figure 29.2: Illustration of gradient descent procedure
29.3 How many iterations to run? 202
The initial change at t = 0 is a zero vector. For this problem ∆x[0] = (0, 0).
Books
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https : / / ocw . mit . edu / resources / res - 18 - 001 -
calculus-online-textbook-spring-2005/textbook/).
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
29.8 Summary
In this tutorial, you discovered the algorithm for gradient descent. Specifically, you learned:
⊲ Gradient descent procedure
⊲ How to apply gradient descent procedure to find the minimum of a function
⊲ How to transform a maximization problem into a minimization problem
In the next chapter, we will learn about neural networks, which is the famous use case of gradient
descent.
Calculus in Neural Networks
30
An artificial neural network is a computational model that approximates a mapping between
inputs and outputs. It is inspired by the structure of the human brain, in that it is similarly
composed of a network of interconnected neurons that propagate information upon receiving
sets of stimuli from neighboring neurons. Training a neural network involves a process that
employs the backpropagation and gradient descent algorithms in tandem. As we will be seeing,
both of these algorithms make extensive use of calculus.
In this tutorial, you will discover how aspects of calculus are applied in neural networks.
After completing this tutorial, you will know:
⊲ An artificial neural network is organized into layers of neurons and connections, where
the latter are attributed a weight value each.
⊲ Each neuron implements a nonlinear function that maps a set of inputs to an output
activation.
⊲ In training a neural network, calculus is used extensively by the backpropagation and
gradient descent algorithms.
Let’s get started.
Overview
This tutorial is divided into three parts; they are:
⊲ An Introduction to the Neural Network
⊲ The Mathematics of a Neuron
⊲ Training the Network
“
A neural network is a computational model that is inspired by the structure of the
”
human brain.
— Page 65, Deep Learning, 2019.
The human brain consists of a massive network of interconnected neurons (around one hundred
billion of them), with each comprising a cell body, a set of fibers called dendrites, and an axon:
Dendrites
Another Axon
Cell
neuron
The dendrites act as the input channels to a neuron, whereas the axon acts as the output
channel. Therefore, a neuron would receive input signals through its dendrites, which in turn
would be connected to the (output) axons of other neighboring neurons. In this manner, a
sufficiently strong electrical pulse (also called an action potential) can be transmitted along the
axon of one neuron, to all the other neurons that are connected to it. This permits signals to
be propagated along the structure of the human brain.
“
So, a neuron acts as an all-or-none switch, that takes in a set of inputs and either
”
outputs an action potential or no output.
— Page 66, Deep Learning, 2019.
An artificial neural network is analogous to the structure of the human brain, because (1)
it is similarly composed of a large number of interconnected neurons that, (2) seek to propagate
information across the network by, (3) receiving sets of stimuli from neighboring neurons and
mapping these to outputs, to be fed to the next layer of neurons.
The structure of an artificial neural network is typically organized into layers of neurons
(recall the depiction of a tree diagram). For example, the following diagram illustrates a fully-
connected neural network, where all the neurons in one layer are connected to all the neurons
in the next layer:
input output
layer hidden layer
hidden layer 2
layer 1
Figure 30.2: A fully-connected, feedforward neural network
30.2 The mathematics of a neuron 206
The inputs are presented on the left hand side of the network, and the information propagates
(or flows) rightward towards the outputs at the opposite end. Since the information is, hereby,
propagating in the forward direction through the network, then we would also refer to such a
network as a feedforward neural network.
The layers of neurons in between the input and output layers are called hidden layers,
because they are not directly accessible. Each connection (represented by an arrow in the
diagram) between two neurons is attributed a weight, which acts on the data flowing through
the network, as we will see shortly.
This weighted sum calculation that we have performed so far is a linear operation. If every
neuron had to implement this particular calculation alone, then the neural network would be
restricted to learning only linear input-output mappings.
“
However, many of the relationships in the world that we might want to model are
nonlinear, and if we attempt to model these relationships using a linear model, then
”
the model will be very inaccurate.
— Page 77, Deep Learning, 2019.
Hence, a second operation is performed by each neuron that transforms the weighted sum
by the application of a nonlinear activation function, a(·):
n
X
output = a(z) = a xi × wi + b
i=1
We can represent the operations performed by each neuron even more compactly, if we had to
integrate the bias term into the sum as another weight, w0 (notice that the sum now starts from
0):
n
X
y = a(z) = a xi × wi
i=0
x1 w0
w1
w2 output
x2 Σ a(·)
..
.
xn wn
Therefore, each neuron can be considered to implement a nonlinear function that maps a
set of inputs to an output activation.
w1 w2
x z1 a1 z2 a2 error
z1 = xw1 z2 = xw2 δ2
δ1
The first application of the chain rule connects the overall error of the network to the input,
z2 , of the activation function a2 of the second neuron, and subsequently to the weight, w2 , as
follows:
∂(error) ∂(error) ∂a2 ∂z2 ∂z2
= × × = δ2 ×
∂w2 ∂a2 ∂z2 ∂w2 ∂w2
30.3 Training the network 208
You may notice that the application of the chain rule involves, among other terms, a multiplication
by the partial derivative of the neuron’s activation function with respect to its input, z2 . There
are different activation functions to choose from, such as the sigmoid or the logistic functions. If
we had to take the logistic function as an example, then its partial derivative would be computed
as follows:
∂a2 logistic(z)
= = logistic(z2 ) × (1 − logistic(z2 ))
∂z2 ∂z2
Hence, we can compute δ2 as follows:
Here, t2 is the expected activation, and in finding the difference between t2 and a2 we are,
therefore, computing the error between the activation generated by the network and the expected
ground truth.
Since we are computing the derivative of the activation function, it should, therefore, be
continuous and differentiable over the entire space of real numbers. In the case of deep neural
networks, the error gradient is propagated backwards over a large number of hidden layers.
This can cause the error signal to rapidly diminish to zero, especially if the maximum value of
the derivative function is already small to begin with (for instance, the inverse of the logistic
function has a maximum value of 0.25). This is known as the vanishing gradient problem. The
ReLU function has been so popularly used in deep learning to alleviate this problem, because
its derivative in the positive portion of its domain is equal to 1.
The next weight backwards is deeper into the network and, hence, the application of the
chain rule can similarly be extended to connect the overall error to the weight, w1 , as follows:
Once we have computed the gradient of the network error with respect to each weight, then
the gradient descent algorithm can be applied to update each weight for the next forward
propagation at time, t + 1. For the weight, w1 , the weight update rule using gradient descent
would be specified as follows:
∂z1
w1t+1 = w1t + η × δ1 ×
∂w1
Even though we have hereby considered a simple network, the process that we have gone through
can be extended to evaluate more complex and deeper ones, such convolutional neural networks
(CNNs).
30.4 Further reading 209
Books
John D. Kelleher. Deep Learning. Illustrated edition. The MIT Press Essential Knowledge series.
MIT Press, 2019.
https://www.amazon.com/dp/0262537559/
Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
https://amzn.to/36yvG9w
30.5 Summary
In this tutorial, you discovered how aspects of calculus are applied in neural networks. Specifically,
you learned:
⊲ An artificial neural network is organized into layers of neurons and connections, where
the latter are each attributed a weight value.
⊲ Each neuron implements a nonlinear function that maps a set of inputs to an output
activation.
⊲ In training a neural network, calculus is used extensively by the backpropagation and
gradient descent algorithms.
In the next chapter, we will combine what we learned in these two chapter and implement a
neural network training process from scratch.
Implementing a Neural Network
in Python
31
Differential calculus is an important tool in machine learning algorithms. Neural networks in
particular, the gradient descent algorithm depends on the gradient, which is a quantity computed
by differentiation.
In this tutorial, we will see how the backpropagation technique is used in finding the
gradients in neural networks.
After completing this tutorial, you will know
⊲ What is a total differential and total derivative
⊲ How to compute the total derivatives in neural networks
⊲ How backpropagation helped in computing the total derivatives
Let’s get started
Overview
This tutorial is divided into 5 parts; they are:
⊲ Total differential and total derivatives
⊲ Algebraic representation of a multilayer perceptron model
⊲ Finding the gradient by backpropagation
⊲ Matrix form of gradient equations
⊲ Implementing backpropagation
to u while assuming the other variable v is a constant. Therefore, we use ∂ instead of d as the
symbol for differentiation to signify the difference.
However, what if the u and v in f (u, v) are both function of x? In other words, we can write
u(x) and v(x) and f (u(x), v(x)). So x determines the value of u and v and in turn, determines
df
f (u, v). In this case, it is perfectly fine to ask what is , as f is eventually determined by x.
dx
This is the concept of total derivatives. In fact, for a multivariate function f (t, u, v) =
f (t(x), u(x), v(x)), we always have
df ∂f dt ∂f du ∂f dv
= + +
dx ∂t dx ∂u dx ∂v dx
The above notation is called the total derivative because it is sum of the partial derivatives. In
essence, it is applying chain rule to find the differentiation.
If we take away the dx part in the above equation, what we get is an approximate change
in f with respect to x, i.e.,
∂f ∂f ∂f
df = dt + du + dv
∂t ∂u ∂v
We call this notation the total differential.
2nd output
input hidden layer
layer 1st layer
hidden
layer
Figure 31.1: An example of neural network
31.3 Finding the gradient by backpropagation 212
If we denote the input to the network as xi where i = 1, · · · , n0 and the network’s output
as ŷi where i = 1, · · · , n3 . Then we can write
n0
for i = 1, · · · , n1
X (1) (1)
h1i = f1 ( wij xj + bi )
j=1
n1
X (2) (2)
h2i = f2 ( wij h1j + bi ) i = 1, · · · , n2
j=1
n2
X (3) (3)
ŷi = f3 ( wij h2j + bi ) i = 1, · · · , n3
j=1
Here the activation function at layer i is denoted as fi . The outputs of first hidden layer are
denoted as h1i for the i-th unit. Similarly, the outputs of second hidden layer are denoted as
h2i . The weights and bias of unit i in layer k are denoted as wij and bi respectively.
(k) (k)
In the above, we can see that the output of layer k − 1 will feed into layer k. Therefore,
while ŷi is expressed as a function of h2j , but h2i is also a function of h1j and in turn, a function
of xj .
The above describes the construction of a neural network in terms of algebraic equations.
Training a neural network would need to specify a loss function as well so we can minimize it in
the training loop. Depends on the application, we commonly use cross entropy for categorization
problems or mean squared error for regression problems. For example, with the target variables
as yi , the mean square error loss function is specified as
n3
X
L= (yi − ŷi )2
i=1
(k) (k) ∂L
wij = wij − η (k)
∂wij
(k) (k) ∂L
bi = bi − η (k)
∂bi
ŷi can be written as function of wij or bi eventually. Let’s see one by one how the weights
(k) (k)
We begin with the loss metric. If we consider the loss of a single data point, we have
n3
X
L= (yi − ŷi )2
i=1
∂L
= 2(ŷi − yi ) for i = 1, · · · , n3
∂ ŷi
Here we see that the loss function depends on all outputs ŷi and therefore we can find a partial
∂L
derivative .
∂ ŷi
Now let’s look at the output layer:
n2
for i = 1, · · · , n3
X (3) (3)
ŷi = f3 ( wij h2j + bi )
j=1
∂L ∂L ∂ ŷi
(3)
= (3)
i = 1, · · · , n3 ; j = 1, · · · , n2
∂wij ∂ ŷi ∂wij
n2
∂L ′ X (3) (3)
= f3 ( wij h2j + bi )h2j
∂ ŷi j=1
∂L ∂L ∂ ŷi
(3)
= i = 1, · · · , n3
∂bi ∂ ŷi ∂b(3)
i
n2
∂L ′ X (3) (3)
= f3 ( wij h2j + bi )
∂ ŷi j=1
Because the weight wij at layer 3 applies to input h2j and affects output ŷi only. Hence we can
(3)
∂L ∂L ∂ ŷi
write the derivative (3)
as the product of two derivatives (3)
. Similar case for the bias
∂wij ∂ ŷi ∂wij
∂L
bi as well. In the above, we make use of , which we already derived previously.
(3)
∂ ŷi
But in fact, we can also write the partial derivative of L with respect to output of second
layer h2j . It is not used for the update of weights and bias on layer 3 but we will see its importance
later:
n3
∂L ∂L ∂ ŷi
for j = 1, · · · , n2
X
=
∂h2j i=1 ∂ ŷi ∂h2j
n3 n2
X ∂L ′ X (3) (3) (3)
= f3 ( wij h2j + bi )wij
i=1 ∂ ŷi j=1
This one is the interesting one and different from the previous partial derivatives. Note that
h2j is an output of layer 2. Each and every output in layer 2 will affect the output ŷi in layer 3.
∂L
Therefore, to find we need to add up every output at layer 3. Thus the summation sign
∂h2j
∂L
in the equation above. And we can consider as the total derivative, in which we applied
∂h2j
∂L ∂ ŷi
the chain rule for every output i and then sum them up.
∂ ŷi ∂h2j
31.3 Finding the gradient by backpropagation 214
∂L ∂L ∂h2i
(2)
= (2)
i = 1, · · · , n2 ; j = 1, · · · , n1
∂wij ∂h2i ∂wij
n1
∂L ′ X (2) (2)
= f2 ( wij h1j + bi )h1j
∂h2i j=1
∂L ∂L ∂h2i
(2)
= i = 1, · · · , n2
∂bi ∂h2i ∂b(2)
i
n1
∂L ′ X (2) (2)
= f2 ( wij h1j + bi )
∂h2i j=1
n2
∂L X ∂L ∂h2i
= j = 1, · · · , n1
∂h1j i=1 ∂h2i ∂h1j
n2 n1
X ∂L ′ X (2) (2) (2)
= f2 ( wij h1j + bi )wij
i=1 ∂h2i j=1
∂L
In the equations above, we are reusing that we derived earlier. Again, this derivative
∂h2i
is computed as a sum of several products from the chain rule. Also similar to the previous, we
∂L
derived as well. It is not used to train wij nor bi but will be used for the layer prior. So
(2) (2)
∂h1j
for layer 1, we have
n0
for i = 1, · · · , n1
X (1) (1)
h1i = f1 ( wij xj + bi )
j=1
∂L ∂L ∂h1i
(1)
= (1)
i = 1, · · · , n1 ; j = 1, · · · , n0
∂wij ∂h1i ∂wij
n0
∂L ′ X (1) (1)
= f1 ( wij xj + bi )xj
∂h1i j=1
∂L ∂L ∂h1i
(1)
= i = 1, · · · , n1
∂bi ∂h1i ∂b(1)
i
n0
∂L ′ X (1) (1)
= f1 ( wij xj + bi )
∂h1i j=1
and this completes all the derivatives needed for training of the neural network using gradient
descent algorithm.
31.4 Matrix form of gradient equations 215
Recall how we derived the above: We first start from the loss function L and find the
derivatives one by one in the reverse order of the layers. We write down the derivatives on layer
k and reuse it for the derivatives on layer k − 1. While computing the output ŷi from input
xi starts from layer 0 forward, computing gradients are in the reversed order. Hence the name
“backpropagation”.
where ak is a vector of outputs of layer k, and assume a0 = x is the input vector and a3 = ŷ is
the output vector. Also denote zk = Wk ak−1 + bk for convenience of notation.
∂L ∂L
Under such notation, we can represent as a vector (so as that of zk and bk ) and
∂ak ∂Wk
∂L
as a matrix. And then if is known, we have
∂ak
∂L ∂L
= ⊙ fk′ (zk )
∂zk ∂ak
!⊤
∂L ∂L
= · ak
∂Wk ∂zk
∂L ∂L
=
∂bk ∂zk
!⊤
∂L ∂zk ∂L ∂L
= · = Wk⊤ ·
∂ak−1 ∂ak−1 ∂zk ∂zk
∂zk
where is a Jacobian matrix as both zk and ak−1 are vectors, and this Jacobian matrix
∂ak−1
happens to be Wk .
import numpy as np
We deliberately clip the input of the sigmoid function to between −500 to +500 to avoid
overflow. Otherwise, these functions are trivial. Then for classification, we care about accuracy
but the accuracy function is not differentiable. Therefore, we use the cross entropy function as
loss for training:
Args:
y, yhat (np.array): 1xn matrices which n are the number of data instances
Returns:
average cross entropy value of shape 1x1, averaging over the n instances
”””
return ( -(y.T @ np.log(yhat.clip(epsilon)) +
(1-y.T) @ np.log((1-yhat).clip(epsilon))
) / y.shape[1] )
In the above, we assume the output and the target variables are row matrices in numpy.
Hence we use the dot product operator @ to compute the sum and divide by the number of
elements in the output. Note that this design is to compute the average cross entropy over a
batch of samples.
Then we can implement our multilayer perceptron model. To make it easier to read, we
want to create the model by providing the number of neurons at each layer as well as the
31.5 Implementing backpropagation 217
activation function at the layers. But at the same time, we would also need the differentiation
of the activation functions as well as the differentiation of the loss function for the training. The
loss function itself, however, is not required but useful for us to track the progress. We create
a class to encapsulate the entire model, and define each layer k according to the formula:
ak = fk (zk ) = fk (ak−1 Wk + bk )
class mlp:
'''Multilayer perceptron using numpy
'''
def __init__(self, layersizes, activations, derivatives, lossderiv):
”””remember config, then initialize array to hold NN parameters
without init”””
# hold NN config
self.layersizes = layersizes
self.activations = activations
self.derivatives = derivatives
self.lossderiv = lossderiv
# parameters, each is a 2D numpy array
L = len(self.layersizes)
self.z = [None] * L
self.W = [None] * L
self.b = [None] * L
self.a = [None] * L
self.dz = [None] * L
self.dW = [None] * L
self.db = [None] * L
self.da = [None] * L
The variables in this class z, W, b, and a are for the forward pass and the variables dz, dW,
db, and da are their respective gradients that to be computed in the backpropagation. All these
variables are presented as numpy arrays.
31.5 Implementing backpropagation 218
As we will see later, we are going to test our model using data generated by scikit-learn.
Hence we will see our data in numpy array of shape “(number of samples, number of features)”.
Therefore, each sample is presented as a row on a matrix, and in function forward(), the
weight matrix is right-multiplied to each input a to the layer. While the activation function and
dimension of each layer can be different, the process is the same. Thus we transform the neural
network’s input x to its output by a loop in the forward() function. The network’s output is
simply the output of the last layer.
To train the network, we need to run the backpropagation after each forward pass. The
backpropagation is to compute the gradient of the weight and bias of each layer, starting from
the output layer to the input layer. With the equations we derived above, the backpropagation
function is implemented as:
class mlp:
...
The only difference here is that we compute db not for one training sample, but for the
entire batch. Since the loss function is the cross entropy averaged across the batch, we compute
db also by averaging across the samples.
Up to here, we completed our model. The update() function simply applies the gradients
found by the backpropagation to the parameters W and b using the gradient descent update rule.
To test out our model, we make use of scikit-learn to generate a classification dataset:
and then we build our model: Input is two-dimensional and output is one dimensional (logistic
regression). We make two hidden layers of 4 and 3 neurons respectively:
31.5 Implementing backpropagation 219
h11
h21
x1 h12
h22 ŷ
x2 h13
h23
h14
Figure 31.2: Neural network model for binary classification
# Build a model
model = mlp(layersizes=[2, 4, 3, 1],
activations=[relu, relu, sigmoid],
derivatives=[drelu, drelu, dsigmoid],
lossderiv=d_cross_entropy)
model.initialize()
yhat = model.forward(X)
loss = cross_entropy(y, yhat)
score = accuracy_score(y, (yhat > 0.5))
print(f”Before training - loss value {loss} accuracy {score}”)
Program 31.6: Create a neural network model and run one forward pass
Now we train our network. To make things simple, we perform full-batch gradient descent
with fixed learning rate:
Although not perfect, we see the improvement by training. At least in the example above,
we can see the accuracy was up to more than 80% at iteration 145, but then we saw the model
diverged. That can be improved by reducing the learning rate, which we didn’t implement
above. Nonetheless, this shows how we computed the gradients by backpropagations and chain
rules.
The complete code is as follows:
Args:
y, yhat (np.array): nx1 matrices which n are the number of data instances
Returns:
average cross entropy value of shape 1x1, averaging over the n instances
”””
return ( -(y.T @ np.log(yhat.clip(epsilon)) +
31.5 Implementing backpropagation 221
(1-y.T) @ np.log((1-yhat).clip(epsilon))
) / y.shape[1] )
class mlp:
'''Multilayer perceptron using numpy
'''
def __init__(self, layersizes, activations, derivatives, lossderiv):
”””remember config, then initialize array to hold NN parameters
without init”””
# hold NN config
self.layersizes = tuple(layersizes)
self.activations = tuple(activations)
self.derivatives = tuple(derivatives)
self.lossderiv = lossderiv
assert len(self.layersizes)-1 == len(self.activations), \
”number of layers and the number of activation functions do not match”
assert len(self.activations) == len(self.derivatives), \
”number of activation functions and number of derivatives do not match”
assert all(isinstance(n, int) and n >= 1 for n in layersizes), \
”Only positive integral number of perceptons is allowed in each layer”
# parameters, each is a 2D numpy array
L = len(self.layersizes)
self.z = [None] * L
self.W = [None] * L
self.b = [None] * L
self.a = [None] * L
self.dz = [None] * L
self.dW = [None] * L
self.db = [None] * L
self.da = [None] * L
Args:
x (numpy.ndarray): Input data to feed forward
”””
self.a[0] = x
for l, func in enumerate(self.activations, 1):
31.5 Implementing backpropagation 222
Args:
eta (float): Learning rate
”””
for l in range(1, len(self.W)):
self.W[l] -= eta * self.dW[l]
self.b[l] -= eta * self.db[l]
# Build a model
model = mlp(layersizes=[2, 4, 3, 1],
activations=[relu, relu, sigmoid],
derivatives=[drelu, drelu, dsigmoid],
lossderiv=d_cross_entropy)
model.initialize()
yhat = model.forward(X)
loss = cross_entropy(y, yhat)
31.6 Further readings 223
31.7 Summary
In this tutorial, you learned how differentiation is applied to training a neural network.
Specifically, you learned:
⊲ What is a total differential and how it is expressed as a sum of partial differentials
⊲ How to express a neural network as equations and derive the gradients by differentiation
⊲ How backpropagation helped us to express the gradients of each layer in the neural
network
⊲ How to convert the gradients into code to make a neural network model
In the next chapter, we will see the use of calculus in a different machine learning model.
Training a Support Vector
Machine: The Separable Case
32
This tutorial is designed for anyone looking for a deeper understanding of how Lagrange
multipliers are used in building up the model for support vector machines (SVMs). SVMs
were initially designed to solve binary classification problems and later extended and applied to
regression and unsupervised learning. They have shown their success in solving many complex
machine learning classification problems.
In this tutorial, we’ll look at the simplest SVM that assumes that the positive and negative
examples can be completely separated via a linear hyperplane.
After completing this tutorial, you will know:
⊲ How the hyperplane acts as the decision boundary
⊲ Mathematical constraints on the positive and negative examples
⊲ What is the margin and how to maximize the margin
⊲ Role of Lagrange multipliers in maximizing the margin
⊲ How to determine the separating hyperplane for the separable case
Let’s get started.
Overview
This tutorial is divided into three parts; they are:
⊲ Formulation of the mathematical model of SVM
⊲ Solution of finding the maximum margin hyperplane via the method of Lagrange
multipliers
⊲ Solved example to demonstrate all concepts
x2
Positive example
Negative example
x1
rgin
ma
separating hyperplane
w1 x1 + w2 x2 + w0 = 0
The support vector machine is designed to discriminate data points belonging to two different
classes. One set of points is labeled as +1 also called the positive class. The other set of points
is labeled as −1 also called the negative class. For now, we’ll make a simplifying assumption
that points from both classes can be discriminated via linear hyperplane.
The SVM assumes a linear decision boundary between the two classes and the goal is to
find a hyperplane that gives the maximum separation between the two classes. For this reason,
the alternate term maximum margin classifier is also sometimes used to refer to an SVM. The
perpendicular distance between the closest data point and the decision boundary is referred to
as the margin. As the margin completely separates the positive and negative examples and does
not tolerate any errors, it is also called the hard margin.
The mathematical expression for a hyperplane is given below with wj being the coefficients
and w0 being the arbitrary constant that determines the distance of the hyperplane from the
32.3 The maximum margin hyperplane 226
origin:
w⊤ xi + w0 = 0
For the i-th 2-dimensional point (xi1 , xi2 ) the above expression is reduced to:
w1 xi1 + w2 xi2 + w0 = 0
We can use a neat trick to write a uniform equation for both set of points by using ti ∈ {−1, +1}
to denote the class label of data point xi :
ti (w⊤ xi + w0 ) ≥ +1
Plugging above in the Lagrange function gives us the following optimization problem, also called
the dual:
1 XX X
Ld = − αi αk ti tk (xi )⊤ (xk ) + αi
2 i k i
The nice thing about the above is that we have an expression for w in terms of Lagrange
multipliers. The objective function involves no w term. There is a Lagrange multiplier associated
with each data point. The computation of w0 is also explained later.
αi ≥ 0
ti y(xi ) − 1 ≥ 0
αi (ti y(xi ) − 1) = 0
The KKT conditions dictate that for each data point one of the following is true:
⊲ The Lagrange multiplier is zero, i.e., αi = 0. This point, therefore, plays no role in
classification; or,
⊲ ti y(xi ) = 1 and αi > 0: In this case, the data point has a role in deciding the value of
w. Such a point is called a support vector.
For w0 , we can select any support vector xs and solve
ts y(xs ) = 1
giving us: X
ts ( αi ti x⊤
s xi + w0 ) = 1
i
32.7 A solved example 228
For the above set of points, we can see that (1, 2), (2, 1) and (0, 0) are points closest to the
separating hyperplane and hence, act as support vectors. Points far away from the boundary
(e.g. (−3, 1)) do not play any role in determining the classification of the points.
Books
Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
https://amzn.to/36yvG9w
Joel Hass, Christopher Heil, and Maurice Weir. Thomas’ Calculus. 14th ed. Based on the original
works of George B. Thomas. Pearson, 2017.
https://www.amazon.com/dp/0134438981/
Articles
Christopher J. C. Burges. “A Tutorial on Support Vector Machines for Pattern Recognition”.
Data mining and Knowledge Discovery, 2, 1998, pp. 121–167.
https://www.di.ens.fr/~mallat/papiers/svmtutorial.pdf
32.9 Summary 229
32.9 Summary
In this tutorial, you discovered how to use the method of Lagrange multipliers to solve the problem
of maximizing the margin via a quadratic programming problem with inequality constraints.
Specifically, you learned:
⊲ The mathematical expression for a separating linear hyperplane
⊲ The maximum margin as a solution of a quadratic programming problem with inequality
constraint
⊲ How to find a linear hyperplane between positive and negative examples using the
method of Lagrange multipliers
In the next chapter, we will consider the case that the SVM cannot separate the positive and
negative classes perfectly.
33
Training a Support Vector
Machine: The Non-Separable
Case
This tutorial is an extension of the previous chapter and explains the non-separable case. In
real life problems positive and negative training examples may not be completely separable by
a linear decision boundary. This tutorial explains how a soft margin can be built that tolerates
a certain amount of errors.
In this tutorial, we’ll cover the basics of a linear SVM. We won’t go into details of nonlinear
SVMs derived using the kernel trick. The content is enough to understand the basic mathematical
model behind an SVM classifier.
After completing this tutorial, you will know:
⊲ Concept of a soft margin
⊲ How to maximize the margin while allowing mistakes in classification
⊲ How to formulate the optimization problem and compute the Lagrange dual
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ The solution of the SVM problem for the case where positive and negative examples
are not linearly separable
◦ The separating hyperplane and the corresponding relaxed constraints
◦ The quadratic optimization problem for finding the soft margin
⊲ A worked example
⊲ x+ : Positive example
⊲ x− : Negative example
⊲ i: Subscript used to index the training points. 0 ≤ i < m
⊲ j: Subscript to index a dimension of the data point. 1 ≤ j ≤ n
⊲ t: Label of data points. It is an m-dimensional vector
⊲ ⊤: Transpose operator
⊲ w: Weight vector denoting the coefficients of the hyperplane. It is an n-dimensional
vector
⊲ α: Vector of Lagrange multipliers, an m-dimensional vector
⊲ µ: Vector of Lagrange multipliers, again an m-dimensional vector
⊲ ξ: Error in classification. An m-dimensional vector
w ⊤ x+
i + w0 ≥ 1 − ξi
w ⊤ x−
i + w0 ≤ −1 + ξi
Combining the above two constraints by using the class label ti ∈ {−1, +1} we have the following
constraint for all points:
ti (w⊤ xi + w0 ) ≥ 1 − ξi
The variable ξ allows more flexibility in our model. It has the following interpretations:
⊲ ξi = 0: This means that xi is correctly classified and this data point is on the correct
side of the hyperplane and away from the margin.
⊲ 0 < ξi < 1: When this condition is met, xi lies on the correct side of the hyperplane
but inside the margin.
⊲ ξi > 0: Satisfying this condition implies that xi is misclassified.
Hence, ξ quantifies the errors in the classification of training points. We can define the soft
error as: X
Esoft = ξi
i
33.3 The quadratic programming problem 232
The overall quadratic programming problem is, therefore, given by the following expression:
1 2
X
min
w 2
kwk + C ξi
i
subject to ti (w⊤ xi + w0 ) ≥ +1 − ξi , ∀i
ξi ≥ 0, ∀i
minimized. If C is kept large, then the soft margin i ξi would automatically be small. If
P
C is close to zero, then we are allowing the soft margin to be large making the overall product
small.
In short, a large value of C means we have a high penalty on errors and hence our model
is not allowed to make too many mistakes in classification. A small value of C allows the errors
to grow.
0 = C − αi − µi
Substitute the above in the Lagrange function gives us the following optimization problem, also
called the dual:
1 XX X
Ld = − αi αk ti tk x⊤
i x k + αi
2 i k i
αi ti = 0, and
X
0 ≤ αi ≤ C, ∀i
Similar to the separable case, we have an expression for w in terms of Lagrange multipliers.
The objective function involves no w term. There is a Lagrange multiplier α and µ associated
with each data point.
y(x1 , x2 ) = −1 x2
y(x1 , x2 ) = +1 Positive example
Negative example
ξ=0
0<α<C
ξ > 1 ξ=0
α = C ξ<1 α=0
α=C
x1
n
rgi
ma ξ=0
0<α<C
separating hyperplane
y(x1 , x2 ) = w1 x1 + w2 x2 + w0 = 0
A positive value of y(x) implies x = +1 and a negative value means x = −1. Hence, the
predicted class of a test point is the sign of y(x).
αi ≥ 0
ti y(xi ) − 1 + ξi ≥ 0
αi (ti y(xi ) − 1 + ξi ) = 0
µi ≥ 0
ξi ≥ 0
µi ξi = 0
33.8 A solved example 235
Positive example
x2 Negative example
w = (1, 1)
w0 = − 13 (2, 3)
i data point x label t α α=0
Compute w:
Compute w0 :
From (1, 2) 1 + 2 + w0 = 1
w0 = −2
From (2, 1) 2 + 1 + w0 = 1
w0 = −2
From (−2, −2) −2 − 2 + w0 = −1
w0 = 3
−2 − 2 + 3
Take the average w0 =
3
1
=−
3
Shown above is a solved example for 2D training points to illustrate all the concepts. A few
things to note about this solution are:
⊲ The training data points and their corresponding labels act as input
⊲ The user defined constant C is set to 10
33.9 Further reading 236
⊲ The solution satisfies all the constraints, however, it is not the optimal solution
⊲ We have to make sure that all the α lie between 0 and C
⊲ The sum of alphas of all negative examples should equal the sum of alphas of all positive
examples
⊲ The points (1, 2), (2, 1) and (−2, −2) lie on the soft margin on the correct side of the
hyperplane. Their values have been arbitrarily set to 3, 3 and 6 respectively to balance
the problem and satisfy the constraints.
⊲ The points with α = C = 10 lie either inside the margin or on the wrong side of the
hyperplane
Books
Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
https://amzn.to/36yvG9w
Articles
Christopher J. C. Burges. “A Tutorial on Support Vector Machines for Pattern Recognition”.
Data mining and Knowledge Discovery, 2, 1998, pp. 121–167.
https://www.di.ens.fr/~mallat/papiers/svmtutorial.pdf
33.10 Summary
In this tutorial, you discovered the method of Lagrange multipliers for finding the soft margin
in an SVM classifier.
Specifically, you learned:
⊲ How to formulate the optimization problem for the non-separable case
⊲ How to find the hyperplane and the soft margin using the method of Lagrange multipliers
⊲ How to find the equation of the separating hyperplane for very simple problems
In the next chapter, we will see how we can implement what we learned above in Python.
Implementing a Support Vector
Machine in Python
34
The mathematics that powers a support vector machine (SVM) classifier is beautiful. It is
important to not only learn the basic model of an SVM but also know how you can implement
the entire model from scratch. This is a continuation of our series of tutorials on SVMs. In
the previous two chapters, we discussed the mathematical model behind a linear SVM. In this
tutorial, we’ll show how you can build an SVM linear classifier using the optimization routines
shipped with Python’s SciPy library.
After completing this tutorial, you will know:
⊲ How to use SciPy’s optimization routines
⊲ How to define the objective function
⊲ How to define bounds and linear constraints
⊲ How to implement your own SVM classifier in Python
Let’s get started.
Overview
This tutorial is divided into two parts; they are:
⊲ The optimization problem of an SVM
⊲ Solution of the optimization problem in Python
◦ Define the objective function
◦ Define the bounds and linear constraints
⊲ Solve the problem with different C values
import numpy as np
# For optimization
from scipy.optimize import Bounds, BFGS
from scipy.optimize import LinearConstraint, minimize
# For plotting
import matplotlib.pyplot as plt
import seaborn as sns
# For generating dataset
import sklearn.datasets as dt
We also need the following constant to detect all alphas numerically close to zero, so we
need to define our own threshold for zero.
...
ZERO = 1e-7
Next, let’s define a very simple dataset, the corresponding labels and a simple routine for
plotting this data. Optionally, if a string of alphas is given to the plotting function, then it
will also label all support vectors with their corresponding alpha values. Just to recall support
vectors are those points for which α > 0.
dat = np.array([[0,3], [-1,0], [1,2], [2,1], [3,3], [0,0], [-1,-1], [-3,1], [3,1]])
labels = np.array([1, 1, 1, 1, 1, -1, -1, -1, -1])
plot_x(dat, labels)
plt.show()
# Objective function
def lagrange_dual(alpha, x, t):
result = 0
ind_sv = np.where(alpha > ZERO)[0]
for i in ind_sv:
for k in ind_sv:
result = result + alpha[i]*alpha[k]*t[i]*t[k]*np.dot(x[i, :], x[k, :])
result = 0.5*result - sum(alpha)
return result
The LinearConstraint() class requires all constraints to be written as matrix form, which is:
α0
α1
i
h
0 = t0 t1 . . . tm ..
=0
.
αm
The first matrix is the first parameter in the LinearConstraint() class. The left and right
bounds are the second and third arguments. This produce the LinearConstraint object that
will be used later when we call minimize().
...
linear_constraint = LinearConstraint(labels, [0], [0])
❈
Bounds(array([0., 0., 0., 0., 0., 0., 0., 0., 0.]), array([10, 10, 10, 10, 10, 10, 10, ❈
10, 10]))
bounds=bounds_alpha)
# The optimized value of alpha lies in result.x
alpha = result.x
return alpha
w⊤ x + w0 = 0
For the hyperplane, we need the weight vector w and the constant w0 . The weight vector is
given by: X
w= αi ti xi
i
If there are too many training points, it’s best to use only support vectors with α > 0 to compute
the weight vector.
For w0 , we’ll compute it from each support vector s, for which αs < C, and then take the
average. For a single support vector xs , w0 is given by:
w0 = ts − w⊤ xs
A support vector’s α cannot be numerically exactly equal to C. Hence, we can subtract a small
constant from C to find all support vectors with αs < C. This is done in the get_w0() function:
Let’s write the corresponding function that can take as argument an array of test points along
with w and w0 and classify various points. We have also added a second function for calculating
the misclassification rate:
Putting them together, the following is the complete code to produce a SVM from a given
dataset:
import numpy as np
# For optimization
from scipy.optimize import Bounds, BFGS
from scipy.optimize import LinearConstraint, minimize
# For plotting
34.5 Powering up the SVM 245
ZERO = 1e-7
# Objective function
def lagrange_dual(alpha, x, t):
result = 0
ind_sv = np.where(alpha > ZERO)[0]
for i in ind_sv:
for k in ind_sv:
result = result + alpha[i]*alpha[k]*t[i]*t[k]*np.dot(x[i, :], x[k, :])
result = 0.5*result - sum(alpha)
return result
❈
dat = np.array([[0, 3], [-1, 0], [1, 2], [2, 1], [3,3], [0, 0], [-1, -1], [-3, 1], ❈
[3, 1]])
labels = np.array([1, 1, 1, 1, 1, -1, -1, -1, -1])
plot_x(dat, labels)
plt.show()
display_SVM_result(dat, labels, 100)
plt.show()
fig = plt.figure(figsize=(8,25))
i=0
C_array = [1e-2, 100, 1e5]
for C in C_array:
fig.add_subplot(311+i)
display_SVM_result(dat, labels, C)
i = i + 1
The above is a nice example, which shows that increasing C, decreases the margin. A high
value of C adds a stricter penalty on errors. A smaller value allows a wider margin and more
misclassification errors. Hence, C defines a trade-off between the maximization of margin and
classification errors.
import numpy as np
# For optimization
from scipy.optimize import Bounds, BFGS
from scipy.optimize import LinearConstraint, minimize
# For plotting
import matplotlib.pyplot as plt
import seaborn as sns
# For generating dataset
import sklearn.datasets as dt
ZERO = 1e-7
# Objective function
def lagrange_dual(alpha, x, t):
result = 0
ind_sv = np.where(alpha > ZERO)[0]
for i in ind_sv:
34.7 Consolidated code 249
for k in ind_sv:
result = result + alpha[i]*alpha[k]*t[i]*t[k]*np.dot(x[i, :], x[k, :])
result = 0.5*result - sum(alpha)
return result
fig = plt.figure(figsize=(15,8))
i=0
C_array = [1e-2, 100, 1e5]
for C in C_array:
fig.add_subplot(221+i)
display_SVM_result(dat, labels, C)
i = i + 1
plt.show()
Books
Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
https://amzn.to/36yvG9w
Articles
Christopher J. C. Burges. “A Tutorial on Support Vector Machines for Pattern Recognition”.
Data mining and Knowledge Discovery, 2, 1998, pp. 121–167.
https://www.di.ens.fr/~mallat/papiers/svmtutorial.pdf
APIs
SciPy’s optimization library.
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
scikit-learn’s sample generation library (sklearn.datasets).
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets
NumPy random number generator.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html
34.9 Summary
In this tutorial, you discovered how to implement an SVM classifier from scratch.
Specifically, you learned:
⊲ How to write the objective function and constraints for the SVM optimization problem
⊲ How to write code to determine the hyperplane from Lagrange multipliers
⊲ The effect of C on determining the margin
This marks the end of this book. Hope you now find yourself capable to understand the literature
on machine learning algorithms when it mentions about calculus.
VII
Appendix
Notations in Mathematics
A
If you are not from a suitable background, you may feel the notations in mathematics are
confusing. Believe it or not, mathematics are fond of the rigorous formulation and logic but the
notations used are sometimes ambiguous.
In the following, we list out all the notations you can find in the previous chapters and
explain them. We hope this will make you feel easier to follow.
Delta. The Greek letter δ is usually to mean a change of something else. Therefore we usually
see notations such as δx = xn+1 − xn . Sometimes uppercase delta is used, for example ∆x.
Multiplication. Multiplication so common that we preferred to omit this operator when there
is no confusion. For example, in the equation y = mx + c, we write m and x together to mean
m multiplied by x. Sometimes we would like to make the multiplication explicit, so we may
write m × x or m · x, or even (m)(x).
Vectors. If we want to emphasize that a symbol is a vector, we may write it in bold or with
an arrow, such as x or ~x. But we may just write it as x if there is no confusion or ambiguity.
The vectors in mathematics are analogous to one-dimensional array in programming. Hence we
may sometimes write w = hw0 , w1 , w2 i but we may also use round or square brackets, such as
(w0 , w1 , w2 ) or [w0 , w1 , w2 ]. Under the geometrical context, we may have unit vectors defined
and all other vectors can be written using unit vectors. For example, we may see a vector in
two-dimensional space as xi + yj with i and j are unit vectors along the horizontal and vertical
axes.
Norm of vectors. √ For a vector v = hx, y, zi, quite often we want to know how “long” it is,
which is defined as x2 + y 2 + z 2 . This operation is so common that we have a notation kvk2
to mean taking the square of each element of this vector, then sum it up, and take the square
root of the sum. Hence kvk22 essentially is sum of the square of each element (without taking
square root afterwards). Sometimes we write kvk instead of kvk2 for a cleaner notation. And if
we write kvkk , it means to sum the k-th power of each element and take the k-th root. Hence
kvk1 is just the sum of all elements, and we abuse this notation to make kvk0 to mean how
many elements are there in vector v.
254
Matrices. A matrix is usually represented with a uppercase letter, and sometimes we write it
in bold, such as W. If we write its elements out, we usually use square or round backets, such
as
w00 w01 w02 w03 w00 w01 w02 w03
w10 w11 w12 w13 or w10 w11 w12 w13 .
Multiplication of vectors or matrix. If necessary, we will use a dot to mean the vector
multiplication, such as w · x. Because of the nature of matrix multiplication, we may sometimes
consider vectors as column matrices and write w⊤ x instead. We are careful to write w × x for
vectors as it means a different kind of multiplication (the cross product instead of dot product).
On the contrary, we may see A · B, A × B, or even AB to mean the same multiplication for
matrices. One special kind of multiplication for matrices is called the Hadamard product, which
is to multiply elementwise and denoted as A ⊙ B or A ⊗ B.
Sets. We usually see symbols R to mean all real numbers. Similarly, we may use Z to mean
all integers (positive and negative) and N to mean all natural numbers (positive integers only).
When we want to represent a vector of n real numbers, we use Rn for that. When there is a
set X of multiple elements, we can say x is one of them by x ∈ X. If we mean for any x in the
set, we can write ∀x ∈ X or simply ∀x.
Functions. We usually write functions as f (x) to mean it takes value x, and hence f (g(x)) is a
composite function that g takes the value x and f takes the result of g. But to define a function
accurately, we may write f : Rk 7→ Rn to mean function f takes a vector of k real numbers as
its input and produces a vector of n real numbers.
3
Summation, products, and factorials. We write f (k) to mean f (0) + f (1) + f (2) + f (3)
X
k=0
4
and xj to mean x2 × x3 × x4 . This is the math expression to resemble a loop in programming.
Y
j=2
Factorial
Y is a particular product that is used a lot and hence we use the notation 4! to mean
4
k = 1 k = 1 × 2 × 3 × 4.
Differentiation. For a function g(x), its derivative can be written as any of the following:
dg d
g ′ (x) g(x) ġ
dx dx
dg
Newton used ġ in his work but Leibniz used . For a higher order derivative, we may use
dx
g ′′ (x), g ′′′ (x) for second and third order derivatives, and subsequently, g (n) (x) for n-th order.
255
dn g dn
Leibniz’s notation is more convenient, as we will write n or n g(x) for n-th order. Newton’s
... .... dx dx
notation, however, will use g̈, g , g for second, third, and forth order derivatives respectively.
∂ ∂n
For partial derivatives, we will use g(x, y) and for higher order, we will use n
g(x, y) or
∂x ∂x
∂n
g(x, y). But sometimes we will use a more convenient notation of gx , or gxx , gxy for
∂xm ∂y n−m
second order.
How to Setup a Workstation for
Python
B
It can be difficult to install a Python machine learning environment on some platforms. Python
itself must be installed first and then there are many packages to install, and it can be confusing
for beginners. In this tutorial, you will discover how to setup a Python machine learning
development environment using Anaconda. After completing this tutorial, you will have a
working Python environment to begin learning, practicing, and developing machine learning
software. These instructions are suitable for Windows, Mac OS X, and Linux platforms. I will
demonstrate them on Windows, so you may see some Windows dialogs and file extensions.
B.1 Overview
In this tutorial, we will cover the following steps:
1. Download Anaconda
2. Install Anaconda
3. Start and Update Anaconda
Note: The specific versions may differ as the software and libraries are updated
INFO-CIRCLE frequently.
2. Click “Products” from the menu and click “Individual Edition” to go to the download
page https://www.anaconda.com/products/individual-d/.
This will download the Anaconda Python package to your workstation. It will automatically
give you the installer according to your OS (Windows, Linux, or MacOS). The file is about 480
MB. You should have a file with a name like:
Anaconda3-2021.05-Windows-x86_64.exe
B.3 Install Anaconda 258
Installation is quick and painless. There should be no tricky questions or sticking points.
The installation should take less than 10 minutes and take a bit more than 5 GB of
space on your hard drive.
You can use the Anaconda Navigator and graphical development environments later; for
now, I recommend starting with the Anaconda command line environment called conda1 . Conda
is fast, simple, it’s hard for error messages to hide, and you can quickly confirm your environment
is installed and working correctly.
1. Open a terminal or CMD.exe prompt (command line window).
2. Confirm conda is installed correctly, by typing:
conda -V
conda 4.10.1
python -V
1
https://conda.pydata.org/docs/index.html
B.4 Start and update Anaconda 260
Python 3.8.8
If the commands do not work or have an error, please check the documentation for help
for your platform. See some of the resources in the “Further Reading” section.
4. Confirm your conda environment is up-to-date, type:
You may need to install some packages and confirm the updates.
5. Confirm your SciPy environment.
The script below will print the version number of the key SciPy libraries you require
for machine learning development, specifically: SciPy, NumPy, Matplotlib, Pandas,
Statsmodels, and Scikit-learn. You can type “python” and type the commands in
directly. Alternatively, I recommend opening a text editor and copy-pasting the script
into your editor.
# scipy
import scipy
print('scipy: %s' % scipy.__version__)
# numpy
import numpy
print('numpy: %s' % numpy.__version__)
# matplotlib
import matplotlib
print('matplotlib: %s' % matplotlib.__version__)
# pandas
import pandas
print('pandas: %s' % pandas.__version__)
# statsmodels
B.5 Further reading 261
import statsmodels
print('statsmodels: %s' % statsmodels.__version__)
# scikit-learn
import sklearn
print('sklearn: %s' % sklearn.__version__)
Program B.1: Code to check that key Python libraries are installed
Save the script as a file with the name: versions.py. On the command line, change
your directory to where you saved the script and type:
python versions.py
scipy: 1.7.1
numpy: 1.20.3
matplotlib: 3.4.2
pandas: 1.3.2
statsmodels: 0.12.2
sklearn: 0.24.2
⊲ Anaconda Navigator
https://docs.continuum.io/anaconda/navigator.html
B.6 Summary 262
⊲ Using conda
https://conda.pydata.org/docs/using/
B.6 Summary
Congratulations, you now have a working Python development environment for machine learning.
You can now learn and practice machine learning on your workstation.
How to Solve Calculus Problems
C
Calculus is a topic of mathematics. For problems in calculus, we have various ways to check if
your solution is correct. If you want to use a computer to verify your solution, or use a computer
to solve a calculus problem for you, there are several approaches you can try.
For more complicated expression, you may need to learn to write in the Wolfram Language.
The above example would be written as
D[x^2 Sin[Cos[x]], x]
C.3 SymPy
In Python, we have a CAS library, SymPy, that can do basic evaluations. If you haven’t installed
this library yet, you can do so using pip:
SymPy is a library that allows you to define symbols in Python. Some common functions are
also provided by SymPy to help you specify the problem. For example, the same differentiation
problem as mentioned above can be solved using:
x = Symbol(”x”)
expression = x**2 * sin(cos(x))
print(expression)
print(diff(expression))
x**2*sin(cos(x))
-x**2*sin(x)*cos(cos(x)) + 2*x*sin(cos(x))
In SymPy, we need to define variables as Symbol objects and use them to define an expression.
The syntax in defining an expression is same as Python arithmatics: We need to explicitly use
* for all multiplications and exponents are introduced using **. Once you have the symbols
defined, you can find limits using limit() function, differentiation using the diff() function
and integration using integrate() function. Partial derivatives are also supported, for example,
y = tanh(wx + b)
w, x, b = symbols(”w x b”)
y = tanh(w*x + b)
print(y)
print(diff(y, w))
print(diff(y, b))
tanh(b + w*x)
x*(1 - tanh(b + w*x)**2)
1 - tanh(b + w*x)**2
The symbols used to build an expression are not limited to a single letter. So you can write
w1 w2 w3 = symbols(”w1 w2 w3”) if you prefer. Since single letter symbols are so common, you
can avoid defining them but importing them from sympy.abc instead. Moreover, if you feel this
notation is too clumpsy to read, you can choose to “pretty print” the SymPy expression using
the pprint() function. This is illustrated in the following rewrite:
y = tanh(w*x + b)
pprint(y)
pprint(diff(y, w))
pprint(diff(y, b))
tanh(b + w⋅x)
⎛ 2 ⎞
x⋅⎝1 - tanh (b + w⋅x)⎠
2
1 - tanh (b + w⋅x)
SymPy allows you to do much more than these. For example, solving an equation, simplify an
expression, plotting are also supported. To know more, you should start with its tutorial and
full documentation:
⊲ SymPy tutorial (https://docs.sympy.org/latest/tutorial/index.html)
⊲ SymPy documentation (https://docs.sympy.org/latest/index.html)
How Far You Have Come
You made it. Well done. Take a moment and look back at how far you have come. You now
know:
⊲ What is calculus and how it is introduced as a branch of mathematics
⊲ What other branches of mathematics are related to calculus
⊲ Two main topics of calculus are differentiation and integration, and they are reverse of
each other
⊲ Rate of change of a quantity or the slope in geometry can be represented by differentiation
of a function
⊲ Differentiation is done by taking limits, but there are a number of rules to help us do
that faster
⊲ Calculus is not only for a function with single parameter. We have multivariate calculus
for vector-valued functions or functions with multiple variables
⊲ With calculus, we can solve constrainted optimization problem using the method of
Lagrange multipliers
⊲ With calculus, we can approximate a function using Taylor series expansion
⊲ How gradient descent uses differentiation of a function to determine direction of
optimization
⊲ How backpropagation procedure in neural networks gets its name from using the chain
rule
⊲ How the support vector machine find its solution using the method of Lagrange multiplier
Don’t make light of this. You have come a long way in a short amount of time. You have
developed the important and valuable foundational skills in calculus. You can now confidently:
⊲ Understand the calculus notation in machine learning papers.
⊲ Implement the calculus expressions of machine learning algorithms into code.
⊲ Describe the calculus operations of your machine learning models.
The sky’s the limit.
268
Thank You!
We want to take a moment and sincerely thank you for letting me help you start your calculus
journey. We hope you keep learning and have fun as you continue to master machine learning.