Mathematics of Machine Learning
Mathematics of Machine Learning
Tivadar Danka
I Introduction 3
1 Introduction 7
1.1 What is this book about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 How to read this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
II Linear algebra 13
2 Vectors in theory 15
2.1 Representing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 What is a vector space? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Examples of vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Linear basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Vectors in practice 31
3.1 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 NumPy arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Is NumPy really faster than Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
i
6.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
ii
14.5 Orthogonal projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
14.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
18 Numbers 211
18.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
19 Sequences 217
19.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
19.2 Famous convergent sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
19.3 The big and small O notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
19.4 Real numbers are sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
iii
23.4 Numerical differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
23.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
iv
32 Minima and maxima in multiple dimensions 349
32.1 Local extrema in multiple dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
38 Distributions 399
38.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
38.2 Law of total probability, revisited once more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
38.3 Real-valued distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
38.4 Notable real-valued distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
38.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
39 Densities 417
39.1 Density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
39.2 Classification of real-valued random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
v
41 The Law of Large Numbers 437
41.1 Tossing coins… . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
41.2 …rolling dices… . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
41.3 …and all the rest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
41.4 The strong law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
VI Statistics 447
XI Appendix 457
42 It’s just logic 459
42.1 Mathematical logic 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
42.2 Logical connectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
42.3 The propositional calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
42.4 Variables and predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
42.5 Existential and universal quantification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
42.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
46 Bibliography 487
Bibliography 489
vi
Mathematics of Machine Learning
Hey there!
First of all, thank you! Your support is what makes this book possible! You are an absolute legend.
In the early access program, I’ll release the sections of this book as I write them. During our time together, my goal is
to guide you through the inner workings of machine learning, from high school mathematics to backpropagation. Each
week, a new chapter will be published, and I’ll be there for you to discuss your thoughts about the book.
For this purpose, I have created a Discord server, where I’ll be available for you at all times. You can join here: https:
//discord.gg/JC2RFpzun6
To give you a heads up, I am focusing on the content first, appearances second. So, some figures might be clumsy, and
the editing might not be perfect. Don’t worry, though. These will be fixed before the full release.
I am planning to finish writing the content until the end of 2023. After the content is finalized, I’ll focus on editing and
formatting. This is especially important regarding the pdf version of the book. To avoid spending an excessive
amount of time with LaTeX customization, prettifying the pdf is going to take place at a later time.
• The latest version of the book each week, in an interactive Jupyter Book format + pdf.
• Exclusive access to a new chapter each week as I finish them.
• A personal hotline to me where you can share your feedback with me to build the best learning resource for you.
Writing a book is a long and challenging project. I want to do this the right way, so I decided to dedicate 100% of my
time and energy. However, I can’t do this without your support. I created the Early Access Program for those wishing to
join me in this journey. With you signing up for the Early Access Program, I’ll get
• your financial support so I can work on this project full time,
• and your continual feedback, which is essential for me to write the best book on the subject for you.
CONTENTS 1
Mathematics of Machine Learning
There are two main ways. You can contact me at the Discord server of the early access (join here: https://discord.gg/
JC2RFpzun6), or you can shoot a message on Twitter, you can find me here: https://twitter.com/TivadarDanka.
Acknowledgements
This book is dedicated to my mother, who I lost while making this book. Thanks, Mom! You are inside every line I write.
2 CONTENTS
Part I
Introduction
3
Mathematics of Machine Learning
5
Mathematics of Machine Learning
6
CHAPTER
ONE
INTRODUCTION
Why do we have to learn mathematics? - This is a question I am asked and think about almost every day.
On the surface, advanced mathematics doesn’t impact software engineering and machine learning in a production setting.
You don’t have to calculate gradients, solve linear equations, or find eigenvalues by hand. Basic and advanced algorithms
are abstracted away into libraries and APIs, performing all the hard work for you.
Nowadays, implementing a state-of-the-art deep neural network is almost equivalent to instantiating an object in Tensor-
Flow, loading the pre-trained weights, and letting the data blaze through the model. Just like all technological advances,
this is a double-edged sword. On the one hand, frameworks that accelerate prototyping and development enable machine
learning in practice. Without them, we wouldn’t have seen the explosion in deep learning that we witnessed in the last
decade.
On the other hand, high-level abstractions are barriers between us and the underlying technology. User-level knowledge
is only sufficient when one is treading on familiar paths. (Or until something breaks.)
If you are not convinced, let’s do a thought experiment! Imagine moving to a new country without speaking the language
and knowing the way of life. However, you have a smartphone and a reliable internet connection.
How do you start exploring?
With Google Maps and a credit card, you can do many awesome things there: explore the city, eat in excellent restaurants,
have a good time. You can do the groceries every day without speaking a word: just put the stuff in your basket and swipe
your card at the cashier.
After a few months, you’ll start to pick up some language as well—simple things like saying greetings or introducing
yourself. You are off to a good start!
There are built-in solutions for everyday tasks that just work—food ordering services, public transportation, etc. However,
at some point, they will break down. For instance, you need to call the delivery person who dropped off your package at
the wrong door. This requires communication.
You may also want to do more. Get a job, or perhaps even start your own business. For that, you need to communicate
with others effectively.
Learning the language when you plan to live somewhere for a few months is unnecessary. However, if you want to stay
there for the rest of your life, it is one of the best investments you can make.
Now replace the country with machine learning and the language with mathematics.
Fact is, algorithms are written in the language of mathematics. To work with algorithms on a professional level, you have
to speak it.
7
Mathematics of Machine Learning
There is a similarity between knowing one’s way about a town and mastering a field of knowledge; from any
given point one should be able to reach any other point. One is even better informed if one can immediately
take the most convenient and quickest path from the one point to the other. — George Pólya and Gábor Szegő,
in the introduction of the legendary book Problems and Theorems in Analysis
The above quote is one of my all-time favorites. For me, it implies that knowledge rests on many pillars. Like a chair
has four legs, a well-rounded machine learning engineer also has several skills that enable them to be effective in their
job. Each of us focuses on a balanced constellation of skills, and for many, mathematics is a great addition. You can start
machine learning without advanced mathematics, but at some point in your career, getting familiar with the mathematical
background of machine learning can help you bring your skills to the next level.
In my opinion, there are two paths to mastery in deep learning. One starts from the practical parts, the other starts from
theory. Both are perfectly viable, and eventually, they intertwine. This book is for those who started on the practical,
application-oriented path, like data scientists, machine learning engineers, or even software developers interested in the
topic.
This book is not a 100% pure mathematical treatise. At points, I will make some shortcuts to balance between clarity
and mathematical correctness. My goal is to give you the “Eureka!” moments and help you understand the bigger picture
instead of getting you ready for a PhD in mathematics.
Most machine learning books I have read fall into one of the two categories.
1. Focus on practical applications, unclear or imprecise with mathematical concepts.
2. Focus on theory, involving heavy mathematics with almost no real applications.
I want this book to offer the best of both: a sound introduction of basic and advanced mathematical concepts, keeping
machine learning in sight at all times. My goal is not only to cover the bare fundamentals but to give a breadth of knowl-
edge. In my experience, to master a subject, one needs to go both deep and wide. Covering only the very essentials of
mathematics would be like a tightrope walk. Instead of performing a balancing act every time you encounter a mathe-
matical subject in the future, I want you to gain a stable footing. Such confidence can bring you very far and set you apart
from others.
During our journey, we are going to follow this roadmap. (You might need to zoom in, as the figure is relatively large.)
Before we start, let’s take a brief look into each part.
We are going to begin our journey with linear algebra. In machine learning, data is represented by vectors. Essentially,
training a learning algorithm is finding more descriptive representations of data through a series of transformations.
Linear algebra is the study of vector spaces and their transformations.
Simply speaking, a neural network is just a function mapping the data to a high-level representation. Linear transforma-
tions are the fundamental building blocks of these. Developing a good understanding of them will go a long way, as they
are everywhere in machine learning.
8 Chapter 1. Introduction
Mathematics of Machine Learning
1.1.2 Calculus
While linear algebra shows how to describe predictive models, calculus has the tools to fit them to the data. When you
train a neural network, you are almost certainly using gradient descent, which is rooted in calculus and the study of
differentiation.
Besides differentiation, its “inverse” is also a central part of calculus: integration.
Integrals are used to express essential quantities such as expected value, entropy, mean squared error, and many more.
They provide the foundations for probability and statistics.
When doing machine learning, we deal with functions with millions of variables.
10 Chapter 1. Introduction
Mathematics of Machine Learning
In higher dimensions, things work differently. This is where multivariable calculus comes in, where differentiation and
integration are adapted to these spaces.
How to draw conclusions from experiments and observations? How to describe and discover patterns in them?
These are answered by probability theory and statistics, the logic of scientific thinking.
Linear algebra, calculus, and probability theory form the foundations of mathematics in machine learning. They are just
the starting points. The most exciting stuff comes after we are familiar with them! Advanced statistics, optimization
techniques, backpropagation, the internals of neural networks. In the second part of the book, we will take a detailed look
at all of those.
Mathematics follows a definition-theorem-proof structure that might be difficult to follow at first. If you are unfamiliar
with such a flow, don’t worry. I’ll give a gentle introduction right now.
In essence, mathematics is the study of abstract objects (such as functions) through their fundamental properties. Instead
of empirical observations, mathematics is based on logic, making it universal. A correct mathematical result is set in
stone, remaining valid forever. (Or, until the axioms of logic change.) If we want to use the powerful tool of logic, the
mathematical objects need to be precisely defined. Definitions are presented in boxes like this below.
Given a definition, results are formulated as if A, then B statements, where A is the premise, and B is the conclusion. Such
results are called theorems. For instance, if a function is differentiable, then it is also continuous. If a function is convex,
then it has global minima. If we have a function, then we can approximate it with arbitrary precision using a single-layer
neural network. You get the pattern. Theorems are the core of mathematics.
We must provide a sound logical argument to accept the validity of a proposition, one that deduces the conclusion from
the premise. This is called a proof, responsible for the steep learning curve of mathematics. Contrary to other scientific
disciplines, proofs in mathematics are indisputable statements, set in stone forever. On a practical note, look out for these
boxes.
To enhance the learning experience, I’ll often make good-to-know but not absolutely essential information into remarks.
The most effective way of learning is building things and putting theory into practice. In mathematics, this is the only way
to learn. What this means to you is need to read through the text carefully. Don’t take anything for granted just because
it is written down. Think through every sentence, take apart every argument and calculation. Try to prove theorems by
yourself before reading the proofs.
12 Chapter 1. Introduction
Part II
Linear algebra
13
CHAPTER
TWO
VECTORS IN THEORY
I want to point out that the class of abstract linear spaces is no larger than the class of spaces whose elements
are arrays. So what is gained by abstraction? First of all, the freedom to use a single symbol for an array;
this way we can think of vectors as basic building blocks, unencumbered by components. The abstract view
leads to simple, transparent proofs of results. — Peter D. Lax, in Chapter 1 of his book Linear Algebra and
its Applications
The mathematics of machine learning rest upon three pillars: linear algebra, mathematical analysis, and probability theory.
Linear algebra describes how to represent and manipulate data; mathematical analysis helps us define and fit models; while
probability theory helps interpret them.
These are building on top of each other, and we will start at the beginning: representing and manipulating data.
To guide us throughout this section, we will look at the famous Iris dataset. This contains the measurements from three
species of Iris: the lengths and widths of sepals and petals. Each data point includes these four measurements, and we
also know the corresponding species (Iris setosa, Iris virginica, Iris versicolor).
The dataset can be loaded right away from scikit-learn, so let’s take a look!
data = load_iris()
X, y = data["data"], data["target"]
X[:10]
15
Mathematics of Machine Learning
Before going into the mathematical definitions, let’s establish a common vocabulary first. The measurements themselves
are stored in a tabular format. Rows represent samples, and columns represent measurements. A particular measurement
type is often called feature. As X.shape tells us, the Iris dataset has 150 data points and four features.
X.shape
(150, 4)
For a given sample, the corresponding species is called the label. In our case, this is either Iris setosa, Iris virginica, or
Iris versicolor. Here, the labels are encoded with the numbers 0, 1, and 2.
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In its entirety, the Iris dataset forms a matrix, and the data points form vectors. Simply speaking, matrices are tables,
while vectors are tuples of numbers. (Tuples are are just finite sequence of numbers, like (1.297, −2.35, 32.3, 29.874).)
However, this view doesn’t give us the big picture. Moreover, since we have more than three features, we cannot visualize
the dataset easily. As humans cannot see in more than three dimensions, visualization breaks down.
Besides representing the data points in a compact form, we want to perform operations on them, like addition and scalar
multiplication. Why do we need to add data points together? To give you a simple example, it is often beneficial if the
features are on the same scale. If a given feature is distributed on a smaller scale than the others, it will have less influence
on the predictions.
Think about this: if somebody is whispering to you something from the next room while speakers blast loud music right
next to your ear, you won’t hear anything from what the person is saying to you. Large-scale features are the blasting
music, while the smaller ones are the whisper. You may obtain much more information from the whisper, but you need
to quiet down the music first.
To see this phenomenon in action, let’s take a look at the distribution of the features of our dataset!
You can see in Fig. 2.1 that some are more stretched-out (like the sepal length), while others are narrower (like sepal
width). In practical scenarios, this can hurt the predictive performance of our algorithms.
To solve it, we can remove the mean and the standard deviation of a dataset. If the dataset consists of the vectors
𝑥1 , 𝑥2 , … , 𝑥150 , we can calculate their mean by
1 150
𝜇= ∑ 𝑥 ∈ ℝ4
150 𝑖=1 𝑖
In other words, the mean describes the average of samples, while the standard deviation represents the average distance
from the mean. The larger the standard deviation is, the more spread out the samples are.
With these quantities, the scaled dataset can be described as
𝑥1 − 𝜇 𝑥2 − 𝜇 𝑥 −𝜇
, , … , 150 ,
𝜎 𝜎 𝜎
where both the substraction and the division are taken elementwise.
If you are familiar with Python and NumPy, this is how it is done. (Don’t worry if you are not, everything you need to
know about them will be explained in the next chapter, with example code.)
X_scaled = (X - X.mean(axis=0))/X.std(axis=0)
X_scaled[:10]
If you compare this modified version with the original, you can see that its features are on the same scale. From a (very)
abstract point of view, machine learning is nothing else but a series of learned data transformations, turning raw data into
a form where prediction is simple.
In a mathematical setting, manipulating data and modeling its relations to the labels arise from the concept of vector spaces
and transformations between them. Let’s take the first steps by making the definition of vector spaces precise!
Representing multiple measurements as a tuple (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) is a natural idea that has a ton of merits. The tuple form
suggests that the components belong together, giving a clear and concise way to store information.
However, this comes at a cost: now we have to deal with more complex objects. Despite having to deal with objects like
(𝑥1 , … , 𝑥𝑛 ) instead of numbers, there are similarities. For instance, if 𝑥 = (𝑥1 , … , 𝑥𝑛 ) and 𝑦 = (𝑦1 , … , 𝑦𝑛 ) are two
arbitrary tuples,
• they can be added together by 𝑥 + 𝑦 = (𝑥1 + 𝑦1 , … , 𝑥𝑛 + 𝑦𝑛 ),
• and they can be multiplied with scalars: if 𝑐 ∈ ℝ, then 𝑐𝑥 = (𝑐𝑥1 , … , 𝑐𝑥𝑛 ).
It’s almost like using a number.
These operations have clear geometric interpretations as well. Adding two vectors together is the same as translation,
while multiplying with a scalar is a simple stretching. (Or squeezing, if |𝑐| < 1.)
On the other hand, if we want to follow our geometric intuition (which we definitely do), it is unclear how to define vector
multiplication. The definition
𝑥𝑦 = (𝑥1 𝑦1 , … , 𝑥𝑛 𝑦𝑛 )
might make sense algebraically, but it is unclear what it means in a geometric sense.
When we think about vectors and vector spaces, we are thinking about a mathematical structure that fits our intuitive
views and expectations. So, let’s turn these into the definition!
At first sight, this definition is certainly too complex to comprehend. It seems like just a bunch of sets, operations, and
properties thrown together. However, to help us build a mental model, we can imagine a vector as an arrow, starting from
the null vector. (Recall that the null vector 0 is that special one for which 𝑥 + 0 = 𝑥 holds for all 𝑥. Thus, it can be
considered as an arrow with zero length; the origin.)
To further familiarize ourselves with the concept, let’s see some examples of vector spaces!
Examples are one of the best ways of building insight and understanding of seemingly difficult concepts like vector
spaces. We, humans, usually think in terms of models instead of abstractions. (Yes, this includes pure mathematicians.
Even though they might deny it.)
Example 1. The most ubiquitous instance of the vector space is (ℝ𝑛 , ℝ, +, ⋅), the same one we used to motivate the
definition itself. (ℝ𝑛 refers to the n-fold Cartesian product of the set of real numbers. If you are unfamiliar with this
notion, check the set theory tutorial in the Appendix.)
(ℝ𝑛 , ℝ, +, ⋅) is the canonical model, the one we use to guide us throughout our studies. If 𝑛 = 2, we are simply talking
about the familiar Euclidean plane.
Using ℝ2 or ℝ3 for visualization can help a lot. What works here will usually work in the general case, although sometimes
this can be dangerous. Math relies on both intuition and logic. We develop ideas using our intuition, but we confirm them
with our logic.
Example 2. Not all vector spaces take the form of a collection of finite tuples. An example of this is the space of
polynomials with real coefficients, defined by
𝑛
ℝ[𝑥] = { ∑ 𝑝𝑖 𝑥𝑖 ∶ 𝑝𝑖 ∈ ℝ, 𝑛 = 0, 1, … }.
𝑖=0
(𝑐𝑝)(𝑥) = 𝑐𝑝(𝑥).
With these operations, (ℝ[𝑥], ℝ, +, ⋅) is a vector space. Although most of the time we percieve polynomials as functions,
they can be represented as tuples of coefficients as well:
𝑛
∑ 𝑝𝑖 𝑥𝑖 ⟷ (𝑝0 , … , 𝑝𝑛 ).
𝑖=0
Note that 𝑛, the degree of the polynomial, is unbounded. As a consequence, this vector space has a significantly richer
structure than ℝ𝑛 .
Example 3. The previous example can be further generalized. Let 𝐶([0, 1]) denote the set of all continuous real functions
𝑓 ∶ [0, 1] → ℝ. Then (𝐶(ℝ), ℝ, +, ⋅) is a vector space, where the addition and scalar multiplication are defined just as in
the previous example:
Although our vector spaces contain infinitely many vectors, we can reduce the complexity by finding special subsets that
can express any other vector.
To make this idea precise, let’s consider our recurring example ℝ𝑛 . There, we have a special vector set
𝑒1 = (1, 0, … , 0)
𝑒2 = (0, 1, … , 0)
⋮
𝑒𝑛 = (0, 0, … , 1)
Let’s zoom out from the special case ℝ𝑛 and start talking about general vector spaces. From our motivating example
regarding bases, we have seen that sums of the form
𝑛
∑ 𝑥𝑖 𝑣𝑖 ,
𝑖=1
where the 𝑣𝑖 -s are vectors and the 𝑥𝑖 coefficients are scalars, play a crucial role. These are called linear combinations. A
linear combination is called trivial if all of the coefficients are zero.
Given a set of vectors, the same vector can potentially be expressed as a linear combination in multiple ways. For example,
if 𝑣1 = (1, 0), 𝑣2 = (0, 1), and 𝑣3 = (1, 1), then
(2, 1) = 2𝑣1 + 𝑣2 = 𝑣1 + 𝑣3 .
This suggests that the set 𝑆 = {𝑣1 , 𝑣2 , 𝑣3 } is redundant, as it contains duplicate information. The concept of linear
dependence and independence makes this precise.
for some nonzero 𝑣𝑘 , then by subtracting 𝑣𝑘 , we obtain that the null vector can be obtained as a nontrivial linear combi-
nation
𝑛
0 = ∑ 𝑥 𝑖 𝑣𝑖
𝑖=1
for some scalars 𝑥𝑖 , where 𝑥𝑘 = −1. This is an equivalent definition of linear dependence. With this, we have proved
the following theorem.
Theorem 1.4.1
Let 𝑉 be a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } be a subset of its vectors.
(a) 𝑆 is linearly dependent if and only if the null vector 0 can be obtained as a nontrivial linear combination.
𝑛
(b) 𝑆 is linearly independent if and only if whenever 0 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 , all coefficients 𝑥𝑖 are zero.
Linear combinations provide a way to take a small set of vectors and generate a whole lot of others from them. For a
set of vectors 𝑆, taking all of its possible linear combinations is called spanning, and the generated set is called the span.
Formally, it is defined by
𝑛
span(𝑆) = { ∑ 𝑥𝑖 𝑣𝑖 ∶ 𝑛 ∈ ℕ, 𝑣𝑖 ∈ 𝑆, 𝑥𝑖 is a scalar}.
𝑖=1
Note that the vector set 𝑆 is not necessarily finite. To help illustrate the concept of span, we can visualize the process in
three dimensions. The span of two linearly independent vectors is a plane.
When we are talking about spans of a finite vector set {𝑣1 , … , 𝑣𝑛 }, we denote the span as
span(𝑣1 , … , 𝑣𝑛 ).
Proposition 1.4.1
Let 𝑉 be a vector space and 𝑆, 𝑆1 , 𝑆2 ⊆ 𝑉 be subsets of its vectors.
(a) If 𝑆1 ⊆ 𝑆2 , then span(𝑆1 ) ⊆ span(𝑆2 ).
(b) span(span(𝑆)) = span(𝑆).
Proof. The property (a) follows directly from the definition. To prove (b), we have to show that span(𝑆) ⊆ span(span(𝑆))
and span(span(𝑆)) ⊆ span(𝑆). The former follows from the definition. For the latter, let 𝑥 ∈ span(span(𝑆)). Then
𝑛
𝑥 = ∑ 𝛼𝑖 𝑥𝑖
𝑖=1
𝑚
for some 𝑥𝑖 ∈ span(𝑆). Because of 𝑥𝑖 being in the span of 𝑆, we have 𝑥𝑖 = ∑𝑗=1 𝛽𝑖,𝑗 𝑠𝑗 for some 𝑠𝑗 ∈ 𝑆. Thus,
𝑛
𝑥 = ∑ 𝛼 𝑖 𝑥𝑖
𝑖=1
𝑛 𝑚
= ∑ 𝛼𝑖 ∑ 𝛽𝑖,𝑗 𝑠𝑗
𝑖=1 𝑗=1
𝑚 𝑛
= ∑ ( ∑ 𝛼𝑖 𝛽𝑖,𝑗 )𝑠𝑗 ,
𝑗=1 𝑖=1
Because of span(span(𝑆)) = span(𝑆), if 𝑆 is linearly dependent, we can remove the redundant vectors and still keep the
span the same.
𝑛−1
Think about it: if 𝑆 = {𝑣1 , … , 𝑣𝑛 } and, say, 𝑣𝑛 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 , then 𝑣𝑛 ∈ span(𝑆\{𝑣𝑛 }). So,
Among sets of vectors, those that generate the entire vector space are special. Remember that we started the discussion
about linear combinations to find subsets that can be used to express any vector. After all this setup, we are ready to make
a formal introduction. Any set of vectors 𝑆 that have the property span(𝑆) = 𝑉 is called a generating set for 𝑉 .
𝑆 can be thought of as a “lossless compression” of 𝑉 , as it contains all the information needed to reconstruct any element
in 𝑉 , yet it is smaller than the entire space. Thus, it is natural to aim to reduce the size of the generating set as much as
possible. This leads us to one of the most important concepts in linear algebra: minimal generating sets, or bases, as we
prefer to call them.
With all the intuition we have built so far, let’s jump into the definition right away!
It can be shown that these defining properties mean that every vector 𝑥 can be uniquely written as a linear combination of
𝑆. (This is left as an exercise for the reader)
Let’s see some examples! In ℝ3 , the set {(1, 0, 0), (0, 1, 0), (0, 0, 1)} is a basis, but so is {(1, 1, 1), (1, 1, 0), (0, 1, 1)}.
So, there can be more than one basis for the same vector space.
For ℝ𝑛 , the most commonly used basis is {𝑒1 , … , 𝑒𝑛 }, where 𝑒𝑖 is a vector whose all coordinates are 0, except the 𝑖-th
one, which is 1. This is called the standard basis.
In terms of the “information” contained in a set of vectors, bases hit the sweet spot. Adding any new vector to a basis
set would introduce redundancy; removing any of its elements would cause the set to be incomplete. These notions are
formalized in the two theorems below.
Theorem 1.4.2
Let 𝑉 be a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } a subset of vectors. The following are equivalent.
(a) 𝑆 is a basis.
(b) 𝑆 is linearly independent and for any 𝑥 ∈ 𝑉 \𝑆, the vector set 𝑆 ∪ {𝑥} is linearly dependent. In other words, 𝑆 is a
maximal linearly independent set.
Proof. To show the equivalence of two propositions, we have to prove two things: that (a) implies (b); and that (b) implies
(a). Let’s start with the first one!
𝑛
(a) ⟹ (b) If 𝑆 is a basis, then any 𝑥 ∈ 𝑉 can be written as 𝑥 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 for some 𝑥𝑖 ∈ ℝ. Thus, by definition,
𝐸 ∪ {𝑥} is linearly dependent.
(b) ⟹ (a). Our goal is to show that any 𝑥 can be written as a linear combination of the vectors in 𝑆. By our assumption,
𝑆 ∪ {𝑥} is linearly dependent, so 0 can be written as a nontrivial linear combination:
𝑛
0 = 𝛼𝑥 + ∑ 𝛼𝑖 𝑣𝑖 ,
𝑖=1
where not all coefficients are zero. Because 𝑆 is linearly independent, 𝛼 cannot be zero. (As it would imply the linear
dependence of 𝑆, which would go against our assumptions.) Thus,
𝑛
𝛼𝑖
𝑥 = ∑− 𝑣,
𝑖=1
𝛼 𝑖
Theorem 1.4.3
Let 𝑉 be a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } a basis. Then for any 𝑣 ∈ 𝑆, the span of 𝑆\{𝑣} is a proper subset of 𝑉 .
Proof. We are going to prove by contradiction. Without the loss of generality, we can assume that 𝑣 = 𝑣1 . If
span(𝑆\{𝑣1 }) = 𝑉 , then
𝑛
𝑣1 = ∑ 𝑥𝑖 𝑣𝑖 .
𝑖=2
This means that 𝑆 = {𝑣1 , … , 𝑣𝑛 } is not linearly independent, contradicting our assumptions. □
In other words, the above results mean that a basis is a maximal linearly independent and a minimal generating set at the
same time.
𝑛
Given a basis 𝑆 = {𝑣1 , … , 𝑣𝑛 }, we implictly write the vector 𝑥 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 as 𝑥 = (𝑥1 , … , 𝑥𝑛 ). Since this decompo-
sition is unique, we can do this without issues. The coefficients 𝑥𝑖 are also called coordinates. (Note that the coordinates
strongly depend on the basis. Given two different bases, the coordinates of the same vector can be different.)
As we have seen previously, bases are not unique, as a single vector space can have many different bases. A very natural
question that arises in this context is the following. If 𝑆1 and 𝑆2 are two bases for 𝑉 , then does |𝑆1 | = |𝑆2 | hold? (Where
|𝑆| denotes the cardinality of the set 𝑆, that is, its “size”.)
In other words, can we do better if we select our basis more cleverly? It turns out that we cannot, and the sizes of any two
basis sets is equal. We are not going to prove this, but here is the theorem in its entirety.
Theorem 1.4.4
Let 𝑉 be a vector space and 𝑆1 , 𝑆2 be two of its bases. Then |𝑆1 | = |𝑆2 |.
This gives us a way to define the dimension of a vector space, which is simply the cardinality of its basis. We’ll
denote the dimension of 𝑉 as dim(𝑉 ). For example, ℝ𝑛 is 𝑛-dimensional, as shown by the standard basis
{(1, 0, … , 0), … , (0, 0, … , 1)}.
If you recall the previous theorems, we almost always assumed that a basis is finite. You might ask the question: is this
always true? The answer is no. Examples 2 and 3 show that this is not the case. For instance, the countably infinite set
{1, 𝑥, 𝑥2 , 𝑥3 , … } is a basis for ℝ[𝑥]. So, according to the theorem above, no finite basis can exist there.
This marks an important distinction between vector spaces: those with finite bases are called finite-dimensional. I have
some good news: all finite-dimensional real vector spaces are essentially ℝ𝑛 . (Recall that we call a vector space real if its
scalars are the real numbers.)
To see why, suppose that 𝑉 is an 𝑛-dimensional real vector space with basis {𝑣1 , … , 𝑣𝑛 }, and define the mapping 𝜑 ∶
𝑉 → ℝ𝑛 by
𝑛
𝜑 ∶ ∑ 𝑥𝑖 𝑣𝑖 → (𝑥1 , … , 𝑥𝑛 ).
𝑖=1
𝜑 is invertible and preserves the structure of 𝑉 , that is, the addition and scalar multiplication operations. Indeed, if
𝑢, 𝑣 ∈ 𝑉 and 𝛼, 𝛽 ∈ ℝ, then 𝜑(𝛼𝑢 + 𝛽𝑣) = 𝛼𝜑(𝑥) + 𝛽𝜑(𝑦). Such mappings are called isomorphisms. The word itself
is derived from ancient Greek, with isos meaning same and morphe meaning shape. Even though this sounds abstract, the
existence of an isomorphism between two vector spaces mean that they have the same structure. So, ℝ𝑛 is not just an
example of finite dimensional real vector spaces, it is a universal model of them. (Note that if the scalars are not the real
numbers, the isomorphism to ℝ𝑛 is not true.)
Considering that we’ll almost exclusively deal with finite dimensional real vector spaces, this is good news. Using ℝ𝑛 is
not just a heuristic, it is a good mental model.
If every finite-dimensional real vector space is essentially the same as ℝ𝑛 , what do we gain from abstraction? Sure, we can
just work with ℝ𝑛 without talking about bases, but to develop a deep understanding of the core mathematical concepts in
machine learning, we need the abstraction.
Let’s look ahead briefly and see an example. If you have some experience with neural networks, you know that matrices
play an essential role in defining its layers. Without any context, matrices are just a table of numbers with seemingly
arbitrary rules of computation. Have you ever wondered why matrix multiplication is defined the way it is?
Although we haven’t precisely defined matrices yet, you have probably encountered them previously. For two matrices
Even though we can visualize this to make it easier to understand, the definition seems random.
Why not just take the componentwise product (𝑎𝑖,𝑗 𝑏𝑖,𝑗 )𝑛𝑖,𝑗=1 ? The definition becomes crystal clear once we look at a
matrix as a tool to describe linear transformations between vector spaces, as the elements of the matrix describe the
images of basis vectors. In this context, multiplication of matrices is just the composition of linear transformations.
Instead of just putting out the definition and telling you how to use it, I want you to understand why it is defined that way.
In the next chapters, we are going to learn every nook and cranny of matrix multiplication.
At this point, you might ask the question: for a given vector space, are we guaranteed to find a basis? Without such a
guarantee, the previous setup might be wasted. (As there might not be a basis to work with.)
Fortunately, this is not the case. As the proof is extremely difficult, we will not show this, but this is so important that we
should at least state the theorem. If you are interested in how this can be done, I included a proof sketch. Feel free to
skip this, as it is not going to be essential for our purposes.
Theorem 1.4.5
Every vector space has a basis.
Proof. (Sketch.) The proof of this uses an advanced technique called transfinite induction, which is way beyond our
scope. Instead of being precise, let’s just focus on building intuition about how to construct a basis for any vector space.
For our vector space 𝑉 , we will build a basis one by one. Given any non-null vector 𝑣1 , if span(𝑆1 ) ≠ 𝑉 , the set
𝑆1 = {𝑣1 } is not yet a basis. Thus, we can find a vector 𝑣2 ∈ 𝑉 \span(𝑆1 ) so that 𝑆2 ∶= 𝑆1 ∪ {𝑣2 } is still linearly
independent.
Is 𝑆2 a basis? If not, we can continue the process. In case the process stops in finitely many steps, we are done. However,
this is not guaranteed. Think about ℝ[𝑥], the vector space of polynomials, which is not finite-dimensional, as we have
seen before. This is where we need to employ some set-theoretical heavy machinery. (Which we don’t have.)
If the process doesn’t stop, we need to find a set 𝑆ℵ0 that contains all 𝑆𝑖 as a subset. (Finding this 𝑆ℵ0 set is the tricky
part.) Is 𝑆ℵ0 a basis? If not, we continue the process.
This is difficult to show, but the process eventually stops, and we can’t add any more vectors to our linearly independent
vector set that won’t destroy the independence property. When this happens, we have found a maximal linearly independent
set, that is, a basis. ≈ □
For finite dimensional vector spaces, the above process is easy to describe. In fact, one of the pillars of linear algebra is
the so-called Gram-Schmidt process, used to explicitly construct special bases for vector spaces. As several quintessential
results rely on this, we are going to study it in detail during the next chapters.
2.5 Subspaces
Before we move on, there is one more thing we need to talk about, one that will come in handy when talking about linear
transformations. (But again, linear transformations are at the heart of machine learning. Everything we learn is to get to
know them better.) For a given vector space 𝑉 , we are often interested in one of its subsets that is a vector space in its
entirety. This is described by the concept of subspaces.
By definition, subspaces are vector spaces themselves, so we can define their dimension as well. There are at least two
subspaces of each vector space: itself and {0}. These are called trivial subspaces. Besides those, the span of a set of
vectors is always a subspace. One such example is illustrated in Fig. 2.5.
One of the most important aspects of subspaces is that we can use them to create more subspaces. This notion is made
precise below.
𝑈1 + 𝑈2 = {𝑢1 + 𝑢2 ∶ 𝑢1 ∈ 𝑈1 , 𝑢2 ∈ 𝑈2 }.
You can easily verify that 𝑈1 + 𝑈2 is a subspace indeed, moreover 𝑈1 + 𝑈2 = span(𝑈1 ∪ 𝑈2 ). (See one of the
exercises at the end of the chapter.) Subspaces and their direct sum play an essential role in several topics, such as matrix
decompositions. For example, we’ll see later that many of them are equivalent to decomposing a linear space into a sum
of vector spaces.
The ability to select a basis whose subsets span certain given subspaces often comes in handy. This is formalized by the
next result.
Theorem 1.5.1
Let 𝑉 be a vector space and 𝑈1 , 𝑈2 ⊆ 𝑉 its subspaces such that 𝑈1 + 𝑈2 = 𝑉 . Then for any bases {𝑝1 , … , 𝑝𝑘 } ⊆ 𝑈1
and {𝑞1 , … , 𝑞𝑙 } ⊆ 𝑈2 , their union is a basis in 𝑉 .
Proof. This follows directly from the direct sum’s definition. If 𝑉 = 𝑈1 + 𝑈2 , then any 𝑥 ∈ 𝑉 can be written in the
form 𝑥 = 𝑎 + 𝑏, where 𝑎 ∈ 𝑈1 and 𝑏 ∈ 𝑈2 .
In turn, since 𝑝1 , … , 𝑝𝑘 forms a basis in 𝑈1 and 𝑞1 , … , 𝑞𝑙 forms a basis in 𝑈2 , the vectors 𝑎 and 𝑏 can be written as
𝑘 𝑙
𝑎 = ∑ 𝛼𝑖 𝑝𝑖 , 𝑏 = ∑ 𝛽𝑖 𝑞𝑖 .
𝑖=1 𝑖=1
Thus, any 𝑥 is
𝑘 𝑙
𝑥 = ∑ 𝛼𝑖 𝑝𝑖 + ∑ 𝛽𝑖 𝑞𝑖 ,
𝑖=1 𝑖=1
With vector spaces, we are barely scratching the surface. Bases are essential, but they only provide the skeleton for the
vector spaces encountered in practice. To properly represent and manipulate data, we need to build a geometric structure
around this skeleton. How to measure the “distance” between two measurements? What about their similarity?
Besides all that, there is an even more crucial question: how on earth will we represent vectors inside a computer? In the
next chapter, we will take a look at the data structures of Python, laying the foundation for the data manipulations and
transformations we’ll do later.
2.6 Problems
Problem 1. Not all vector spaces are infinite. There are some that only contain a finite number of vectors, as we shall see
next in this problem. Define the set
ℤ2 ∶= {0, 1},
2.6. Problems 29
Mathematics of Machine Learning
(b) Show that (ℤ𝑛2 , ℤ2 , +, ⋅) is also a vector space, where ℤ𝑛2 is the 𝑛-fold Cartesian product
ℤ𝑛2 = ℤ 2 × ⋯ × ℤ2 ,
⏟⏟⏟⏟⏟
𝑛 times
𝑥 + 𝑦 = (𝑥1 + 𝑦1 , … , 𝑥𝑛 + 𝑦𝑛 ), 𝑥, 𝑦 ∈ ℤ𝑛2 ,
𝑐𝑥 = (𝑐𝑥1 , … , 𝑐𝑥𝑛 ), 𝑐 ∈ ℤ2 .
is a bijective and linear. (A function 𝑓 ∶ 𝑋 → 𝑌 is bijective if every 𝑦 ∈ 𝑌 has exactly one 𝑥 ∈ 𝑋 for which 𝑓(𝑥) = 𝑦.)
In general, a linear and bijective function 𝑓 ∶ 𝑈 → 𝑉 between vector spaces is called an isomorphism. Given the existence
of such function, we call the vector spaces 𝑈 and 𝑉 isomorphic, meaning that they have an identical algebraic structure.
Combining (a) and (b), we obtain that ℝ[𝑋] is isomorphic with its proper subspace 𝑥ℝ[𝑋]. This is quite an interesting
phenomenon: a vector space that is algebraically identical to its proper subspace.
THREE
VECTORS IN PRACTICE
So far, we have mostly talked about the theory of vectors and vector spaces. However, our ultimate goal is to build
computational models for discovering and analyzing patterns in data. To put theory into practice, we will take a look at
how vectors are represented in computations.
In computer science, there is a stark contrast between how we think about mathematical structures and how we represent
them inside a computer. Until this point, our goal was to develop a mathematical framework that enables us to reason
about the structure of data and its transformations effectively. We want a language that is
• expressive,
• easy to speak,
• and as compact as possible.
However, our goals change when we aim to do computations instead of pure logical reasoning. We want implementations
that are
• easy to work with,
• memory-efficient,
• and fast to access, manipulate and transform.
These are often contradicting requirements, and particular situations might prefer one over the other. For instance, if
we have plenty of memory but want to perform lots of computations, we can sacrifice size for speed. Because of all
the potential use-cases, there are multiple formats to represent the same mathematical concepts. These are called data
structures.
Different programming languages implement vectors differently. Because Python is ubiquitous in data science and ma-
chine learning, it’ll be our language of choice. In this chapter, we are going to study all the possible data structures in
Python to see which one is suitable to represent vectors for high performance computations.
3.1 Tuples
In standard Python, two built-in data structures can be used to represent vectors: tuples and lists. Let’s start with tuples!
They can be simply defined by enumerating their elements between two parentheses, separating them with commas.
print(v_tuple)
31
Mathematics of Machine Learning
type(v_tuple)
tuple
A single tuple can hold elements of various types. Even though we’ll exclusively deal with floats in computational linear
algebra, this property is extremely useful for general purpose programming.
We can access the elements of a tuple by indexing. Just like in (almost) all other programming languages, indexing starts
from zero. This is in contrast with mathematics, where we often start indexing from one. (Don’t tell this to anybody else,
but it used to drive me crazy. I am a mathematician first.)
v_tuple[0]
The number of elements in a tuple can be accessed by calling the built-in len function.
len(v_tuple)
v_tuple[1:4]
Slicing works by specifying the first and last elements with an optional step size, using the syntax ob-
ject[first:last:step].
Tuples are rather inflexible, as you cannot change their components. Attempting to do so results in a TypeError,
Python’s standard way of telling you that the object does not support the method you are trying to call (item assignment).
v_tuple[0] = 2
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-5d8791fb815d> in <module>
----> 1 v_tuple[0] = 2
Besides that, extending the tuple with additional elements is also not supported. As we cannot change the state of a tuple
object in any way after it has been instantiated, they are immutable. Depending on the use-case, immutability can be an
advantage and a disadvantage as well. Immutable objects eliminate accidental changes, but each operation requires the
creation of a new object, resulting in a computational overhead. Thus, tuples are not going to be optimal to represent large
amounts of data in complex computations.
This issue is solved by lists. Let’s take a look at them, and the new problems they introduce!
3.2 Lists
Lists are the workhorses of Python. In contrast with tuples, lists are extremely flexible and easy to use, albeit this comes at
the cost of runtime. Similarly to tuples, a list object can be created by enumerating its objects between square brackets,
separated by commas.
type(v_list)
list
Just like tuples, accessing the elements of a list is done by indexing or slicing.
We can do all kinds of operations on a list: overwrite its elements, append items, or even remove others.
v_list
This example illustrates that lists can hold elements of various types as well. Adding and removing elements can be done
with methods like append, push, pop, and remove.
Before trying that, let’s quickly take note of the memory address of our example list, which can be accessed by calling
the id function.
v_list_addr = id(v_list)
v_list_addr
140531937533320
This number simply refers to an address in my computer’s memory, where the v_list object is located. Quite literally,
as this book is compiled on my personal computer.
Now, we are going to perform a few simple operations on our list and show that the memory address doesn’t change.
Thus, no new object is created.
True
3.2. Lists 33
Mathematics of Machine Learning
id(v_list) == v_list_addr # removing elements still doesn't create any new objects
True
Unfortunately, adding lists together achieves a result that is completely different from our expectations.
[1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]
Instead of adding the corresponding elements together, like we want vectors to behave, the lists are concatenated. This
feature is handy when writing general-purpose applications. However, this is not well-suited for scientific computations.
“Scalar multiplication” also has strange results.
3*[1, 2, 3]
[1, 2, 3, 1, 2, 3, 1, 2, 3]
Multiplying a list with an integer repeats the list by the specified number of times. Given the behavior of the + operator
on lists, this seems logical as multiplication with an integer is repeated addition:
𝑎 ⋅ 𝑏 = 𝑏⏟⏟
+⏟⋯⏟
+⏟𝑏 .
𝑎 times
Overall, lists can do much more than we need to represent vectors. Although we potentially want to change elements
of our vectors, we don’t need to add or remove elements from them, and we also don’t need to store objects other than
floats. Can we sacrifice these extra features and obtain an implementation suitable for our purposes yet has a lightning-fast
computational performance? Yes. Enter NumPy arrays.
Even though Python’s built-in data structures are amazing, they are optimized for ease of use, not for scientific computa-
tion. This problem was realized early on the language’s development and was addressed by the NumPy library.
One of the main selling points of Python is how fast and straightforward it is to write code, even for complex tasks. This
comes at the price of speed. However, in machine learning, speed is crucial for us. When training a neural network, a
small set of operations are repeated millions of times. Even a small percentage of improvement in performance can save
hours, days, or even weeks in case of extremely large models.
Regarding speed, the C language is at the other end of the spectrum. C code is hard to write but executes blazing fast
when done correctly. As Python is written in C, a tried and true method for achieving fast performance is to call functions
written in C from Python. In a nutshell, this is what NumPy provides: C arrays and operations, all in Python.
To get a glimpse into the deep underlying issues with Python’s built-in data structures, we should put numbers and arrays
under our magnifying glass. Inside a computer’s memory, objects are represented as fixed-length 0-1 sequences. Each
component is called a bit. Bits are usually grouped into 8, 16, 32, 64, or even 128 sized chunks. Depending on what we
want to represent, identical sequences can mean different things. For instance, the 8-bit sequence 00100110 can represent
the integer 38 or the ASCII character “&”.
By specifying the data type, we can decode binary objects. 32-bit integers are called int32 types, 64-bit floats are
float64, and so on.
Since a single bit contains very little information, memory is addressed by dividing it into 32 or 64 bit sized chunks and
numbering them consecutively. This address is a hexadecimal number, starting from 0. (For simplicity, let’s assume that
the memory is addressed by 64 bits. This is customary in modern computers.)
A natural way to store a sequence of related objects (with matching data type) is to place them next to each other in the
memory. This data structure is called an array.
By storing the memory address of the first object, say 0x23A0, we can instantly retrieve the k-th element by accessing
the memory at 0x23A0 + k.
We call this the static array or often the C array because this is how it is done in the magnificent C language. Although
this implementation of arrays is lightning fast, it is relatively inflexible. First, you can only store objects of a single type.
Second, you have to know the size of your array in advance, as you cannot use memory addresses that overextend the
pre-allocated part. Thus, before you start working with your array, you have to allocate memory for it. (That is, reserve
space so that other programs won’t overwrite it.)
However, in Python, you can store arbitrarily large and different objects in the same list, with the option of removing and
adding elements to it.
l.append(lambda x: x)
[5575186299632655785383929568162090376495105,
'a string',
<function __main__.<lambda>(x)>]
In the example above, l[0] is an integer so large that it doesn’t fit into 128 bits. Also, there are all kinds of objects in
our list, including a function. How is this possible?
Python’s list provides a flexible data structure by
1. overallocating the memory and,
2. keeping memory addresses to the objects’ in the list instead of the objects themselves.
(At least in the most widespread CPython implementation.)
By checking the memory addresses of each object in our list l, we can see that they are all over the memory.
[id(x) for x in l]
Due to the overallocation, deletion or insertion can always be done simply by shifting the remaining elements. Since the
list stores the memory address of its elements, all types of objects can be stored within a single structure.
However, this comes at a cost. Because the objects are not contiguous in memory, we lose locality of reference, meaning
that since we frequently access distant locations of the memory, our reads are much slower. Thus, looping over a Python
list is not efficient, making it unsuitable for scientific computation.
So, NumPy arrays are essentially the good old C arrays in Python, with the user-friendly interface of Python lists. (If you
have ever worked with C, you know how big of a blessing this is.) Let’s see how to work with them!
First, we import the numpy library. (To save on the characters, it is customary to import it as np.)
import numpy as np
The main data structure is the np.ndarray, short for n-dimensional array. We can use the np.array function to
create NumPy arrays from standard Python containers or initialized from scratch. (Yes, I know. This is confusing, but
you’ll get used to it. Just take a mental note that np.ndarray is the class, and np.array is the function you use to
create NumPy arrays from Python objects.)
We can even initialize NumPy arrays using random numbers. Later, when talking about probability theory, we’ll discuss
this functionality in detail, as the library covers a wide range of probability distributions.
np.random.rand(10)
Most importantly, when we have a given array, we can initialize another one with the same dimensions using the np.
zeros_like, np.ones_like, and np.empty_like functions.
np.zeros_like(X)
Just like Python lists, NumPy arrays support item assignments and slicing.
X[0] = 1545.215
X
X[1:4]
However, as expected, you can only store a single data type within each ndarray. When trying to assign a string as the
first element, we get an error message.
X[0] = "str"
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-d996c0d86300> in <module>
----> 1 X[0] = "str"
As you might have guessed, every ndarray has a data type attribute that can be accessed at ndarray.dtype. If a
conversion can be made between the value to be assigned and the data type, it is automatically performed, making the
item assignment successful.
X.dtype
dtype('float64')
val = 23
type(val)
int
X[0] = val
NumPy arrays are iterable, just like other container types in Python.
for x in X:
print(x)
23.0
4.5
-4.1
42.1414
-3.14
2.001
Let’s talk about vectors once more. From now on, we are going to use NumPy ndarray-s to model vectors.
The addition and scalar multiplication operations are supported by default and perform as expected.
np.zeros(shape=3) + 1
Because of the dynamic typing of Python, we can (often) plug in NumPy arrays into functions intended for scalars.
def f(x):
return 3*x**2 - x**4
f(v_1)
array([-208. , 2. , -12.1141])
So far, NumPy arrays satisfy almost everything we require to represent vectors. There is only one box to be ticked:
performance. To investigate this, we measure the execution time with Python’s built-in timeit tool.
In its first argument, timeit takes a function to be executed and timed. Instead of passing a function object, it also accepts
executable statements as a string. Since function calls have a significant computational overhead in Python, we are passing
code rather than a function object in order to be precise with the time measurements.
Below, we compare adding together two NumPy arrays vs. Python lists containing a thousand zeros.
t_add_builtin = timeit(
"[x + y for x, y in zip(v_1, v_2)]",
setup=f"size={size}; v_1 = [0 for _ in range(size)]; v_2 = [0 for _ in␣
↪range(size)]",
number=n_runs
)
t_add_numpy = timeit(
"v_1 + v_2",
setup=f"import numpy as np; size={size}; v_1 = np.zeros(shape=size); v_2 = np.
↪zeros(shape=size)",
number=n_runs
)
If you are already familiar with some deep learning frameworks, you might ask: why are we studying NumPy instead
of them? The answer is simple: because all state-of-the-art libraries are built on its legacy. Modern tensor libraries are
essentially clones of NumPy, with GPU support. Thus, most NumPy knowledge translates directly to TensorFlow and
PyTorch. If you understand how it works on a fundamental level, you’ll have a headstart in more advanced frameworks.
Moreover, our goal is to implement our neural network from scratch by the end of the book. To understand every nook
and cranny, we don’t want to use built-in algorithms like backpropagation. We’ll create our own!
NumPy is designed to be faster than vanilla Python. Is this really the case? Not all the time. If you use it wrong, it might
even hurt performance! To know when it is beneficial to use NumPy, we will look at why exactly it is faster in practice.
To simplify the investigation, our toy problem will be random number generation. Suppose that we need just a single
random number. Should we use NumPy? Let’s test it! We are going to compare it with the built-in random number
generator by running both ten million times, measuring the execution time.
n_runs = 10000000
t_builtin = timeit(random_py, number=n_runs)
t_numpy = timeit(random_np, number=n_runs)
For generating a single random number, NumPy is significantly slower. Why is this the case? What if we need an array
instead of a single number? Will this also be slower?
This time, let’s generate a list/array of a thousand elements.
size = 1000
n_runs = 10000
t_builtin_list = timeit(
"[random_py() for _ in range(size)]",
setup=f"from random import random as random_py; size={size}",
number=n_runs
)
t_numpy_array = timeit(
"random_np(size)",
setup=f"from numpy.random import random as random_np; size={size}",
number=n_runs
)
(Again, I don’t want to wrap the timed expressions in lambdas since function calls have an overhead in Python. I want to
be as precise as possible, so I pass them as strings to the timeit function.)
Things are looking much different now. When generating an array of random numbers, NumPy wins hands down.
There are some curious things about this result as well. First, we generated a single random number 10 000 000 times.
Second, we generated an array of 1000 random numbers 10 000 times. In both cases, we have 10 000 000 random
numbers in the end. Using the built-in method, it took ~2x time when we put them in a list. However, with NumPy, we
see a ~30x speedup compared to itself when working with arrays!
To see what happens behind the scenes, we are going to profile the code using cProfiler. With this, we’ll see exactly how
many times a given function was called and how much time we spent inside them.
(To make profiling work from Jupyter Notebooks, we need to do some Python magic first. Feel free to disregard the
contents of the next cell; this is just to make sure that the output of the profiling is printed inside the notebook.)
Let’s take a look at the built-in function first. In the following function, we create 10 000 000 random numbers, just as
before.
def builtin_random_single(n_runs):
for _ in range(n_runs):
random_py()
From Jupyter Notebooks, where this book is written, cProfiler can be called with the magic command %prun.
n_runs = 10000000
%prun builtin_random_single(n_runs)
There are two important columns here for our purposes. ncalls shows how many times a function was called, while
tottime is the total time spent in a function, excluding time spent in subfunctions.
The built-in function random.random() was called 10 000 000 times as expected, and the total time spent in that
function was 0.407 seconds. (If you are running this notebook locally, this number is going to be different.)
What about the NumPy version? The results are surprising.
def numpy_random_single(n_runs):
for _ in range(n_runs):
random_np()
%prun numpy_random_single(n_runs)
Similarly as before, the numpy.random.random() function was indeed called 10 000 000 times as expected. Yet,
the script spent significantly more time in this function than in the Python built-in random before. Thus, it is more costly
per call.
When we start working with large arrays and lists, things change dramatically. Next, we generate a list/array of 1000
random numbers, while measuring the execution time.
size = 1000
n_runs = 10000
As we see, about 60% of the time was spent on the list comprehensions: 10 000 calls, 0.641s total. (Note that tottime
doesn’t count subfunction calls like calls to random.random() here.)
Now we are ready to see why NumPy is faster when used right.
With each of the 10 000 function calls, we get a numpy.ndarray of 1000 random numbers. The reason why NumPy
is fast when used right is that its arrays are extremely efficient to work with. They are like C arrays instead of Python lists.
As we have seen, there are two significant differences between them.
• Python lists are dynamic, so for instance, you can append and remove elements. NumPy arrays have fixed lengths,
so you cannot add or delete without creating a new one.
• Python lists can hold several data types simultaneously, while a NumPy array can only contain one.
So, NumPy arrays are less flexible but significantly more performant. When this additional flexibility is not needed,
NumPy outperforms Python.
To see precisely at which size does NumPy overtakes Python in random number generation, we can compare the two by
measuring the execution times for several sizes.
runtime_builtin = [
timeit(
"[random_py() for _ in range(size)]",
setup=f"from random import random as random_py; size={size}",
number=100000
)
for size in sizes
]
runtime_numpy = [
timeit(
"random_np(size)",
setup=f"from numpy.random import random as random_np; size={size}",
(continues on next page)
with plt.style.context("seaborn-white"):
plt.figure(figsize=(10, 5))
plt.plot(sizes, runtime_builtin, label="built-in")
plt.plot(sizes, runtime_numpy, label="NumPy")
plt.xlabel("array size")
plt.ylabel("time (seconds)")
plt.title("Runtime of random array generation")
plt.legend()
Around 20, NumPy starts to beat Python in performance. Of course, this number might be different for other operations
like calculating the sine or adding numbers together, but the tendency will be the same. Python will slightly outperform
NumPy for small input sizes, but NumPy wins by a large margin as the size grows.
FOUR
Let’s revisit the Iris dataset introduced in the previous chapter! I want to test your intuition. I plotted the petal widths
against the petal lengths while hiding the class labels in the following figure.
Fig. 4.1: Petal width plotted against petal length in the Iris dataset.
Even without knowing any labels, we can intuitively point out that there are probably at least two classes. Can you
summarize your reasoning in a single sentence?
There are many valid arguments, but the most prevalent one is that the two clusters are far away from each other. As this
example illustrates, the concept of distance plays an essential role in machine learning. In this chapter, we will translate
the notion of distance to the language of mathematics and put it into the context of vector spaces.
47
Mathematics of Machine Learning
Previously, we have seen that vectors are essentially arrows, starting from the null vector. Besides their direction, vectors
also have magnitude. For example, as we have learned in high school mathematics, the magnitude in the Euclidean plane
is defined by
𝑥 = (𝑥1 , 𝑥2 ) ∈ ℝ2 ,
𝑦 = (𝑦1 , 𝑦2 ) ∈ ℝ2 .
The magnitude formula √𝑥21 + 𝑥22 can be generalized to higher dimensions simply by
However, just from this formula, it is not clear why it is defined this way. What does the square root of a sum of squares
have to do with distance and magnitude? Behind the scenes, it is just the Pythagorean theorem.
Recall that the Pythagorean theorem states that in right triangles, the squared length of the hypotenuse equals the sum of
squared lengths of other sides, as illustrated by Fig. 4.3.
To put this into an algebraic form, it states that 𝑎2 + 𝑏2 = 𝑐2 , when 𝑐 is the hypotenuse of the right triangle, and 𝑎 and
𝑏 are its two other sides. If we apply this to a two-dimensional vector 𝑥 = (𝑥1 , 𝑥2 ), we can see that the Pythagorean
theorem gives its magnitude ‖𝑥‖2 = √𝑥21 + 𝑥22 .
This can be generalized to higher dimensions. To see what is happening, we are going to check the three-dimensional
case, as illustrated by Fig. 4.4. Here, we can apply the Pythagorean theorem twice to obtain the magnitude!
For each vector 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 ), we can take a look at the triangle determined by (0, 0, 0), (𝑥1 , 0, 0), and (𝑥1 , 𝑥2 , 0) first.
The length of the hypotenuse can be calculated by √𝑥21 + 𝑥22 . However, the points (0, 0, 0), (𝑥1 , 𝑥2 , 0), and (𝑥1 , 𝑥2 , 𝑥3 )
form a right triangle. Applying the Pythagorean theorem once again we obtain
which is the Euclidean norm. This is exactly what is going on in the general 𝑛-dimensional case.
The notions of magnitude and distance are critical in machine learning, as we can use them to determine the similarity
between data points, measure and control the complexity of neural networks, and many more.
Is the above method the only viable way to measure magnitude and distance? Certainly not. Because Manhattan’s street
layout is essentially a rectangular grid, its residents are famed for measuring distances in blocks. If something is two
blocks to the north and three blocks east, it means that you have to travel two intersections to the north and three to
the east to find it. This gives rise to a mathematically perfectly valid notion of measurement called Manhattan distance,
defined by
When using the Manhattan distance, the shortest path between two points is not unique.
Fig. 4.5: For the Manhattan distance, the shortest path between two points is not unique.
Besides the Euclidean and Manhattan distances, there are several other metrics. Once again, we are going to step away
from the concrete examples to take an abstract viewpoint.
If we talk about measurements and metrics in general, what are the properties that we expect from all of them? What
makes a measurement distance? Essentially, there are three such traits: the distance should
• be nonnegative,
• preserve scaling,
• and the distance straight from point 𝐴 to 𝐵 is always equal or smaller than touching any other point 𝐶.
These are formalized by the notion of norms.
on ℝ𝑛 . The function ‖ ⋅ ‖𝑝 is called the 𝑝-norm. Showing that these are indeed norms is a bit technical. Thus, we won’t
go into the details. (The triangle inequality requires some work, but the other three properties are easy to see.)
We have already seen two special cases: the Euclidean norm (𝑝 = 2), and the Manhattan norm (𝑝 = 1). Both of them
frequently appear in machine learning. For instance, the familiar mean squared error is just the scaled Euclidean distance
between prediction and ground truth:
1 1 𝑛
MSE(𝑦, 𝑦)̂ = ‖𝑦 − 𝑦‖̂ 22 = ∑(𝑦𝑖 − 𝑦𝑖̂ )2
𝑛 𝑛 𝑖=1
As mentioned before, the 2-norm, along with the 1-norm, is commonly used to control the complexity of models dur-
𝑚
ing training. To give a concrete example, suppose that we are fitting a polynomial 𝑓(𝑥) = ∑𝑖=0 𝑞𝑖 𝑥𝑖 to the data
{(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 )}. To obtain a model that generalizes well to new data, we prefer our models to be as simple
as possible. Thus, instead of using the plain mean squared error, we might consider minimizing the loss
where the term ‖𝑞‖𝑝 is responsible for keeping the coeffients of the polynomial 𝑓(𝑥) small, and 𝜆 controlls the strength
of regularization. Usually, 𝑝 is either 1 or 2, but other values from [1, ∞) are also valid.
Example 2. Let’s stay in ℝ𝑛 for a bit more! The so-called ∞-norm is defined by
Showing that ‖ ⋅ ‖∞ is indeed a norm is a simple task and left to the reader for practice. (This is perhaps one of the most
notorious sentences written in mathematical textbooks, but trust me, this is truly easy. Give it a shot! If you don’t see it,
try the special case ℝ2 .)
This is called the ∞-norm, and is strongly related to the 𝑝-norm that we have just seen. In fact, if we let the value 𝑝 grow
infinitely, ‖𝑥‖𝑝 will be very close to ‖𝑥‖∞ , ultimately reaching it in the limit.
|𝑥𝑖 |
Since ‖𝑥‖∞ ≤ 1 by definition,
𝑛 𝑝 1/𝑝
|𝑥𝑖 |
1 ≤ (∑( ) ) ≤ 𝑛1/𝑝 ,
𝑖=1
‖𝑥‖∞
holds. Because lim𝑝→∞ 𝑛1/𝑝 = 1, we can conclude that lim𝑝→∞ ‖𝑥‖𝑝 = ‖𝑥‖∞ . This is the reason why the ∞-norm is
considered a 𝑝-norm with 𝑝 = ∞.
If you are not familiar with taking limits of sequences, don’t worry. We’ll cover everything in detail when studying
single-variable calculus.
Example 3. ∞-norms can be generalized for function spaces. Remember 𝐶([0, 1]), the vector space of functions con-
tinuous on [0, 1]? We introduced this when talking about examples of vector spaces. There, ‖ ⋅ ‖∞ can be defined as
This norm can be defined on other function spaces, like 𝐶(ℝ), the space of continuous real functions. Since the maximum
is not guaranteed to exist (like for the sigmoid function in 𝐶(ℝ)), the maximum is replaced with supremum. Hence, the
∞-norm is often called the supremum norm.
If you imagine the function as a landscape, the supremum norm is the height of the highest peak or the depth of the
deepest trench. (Whichever is larger in absolute value.)
When encountering this norm for the first time, it might seem challenging to understand what this has to do with any
notion of magnitude. However, ‖𝑓 − 𝑔‖∞ is a natural way to measure the distance between two functions 𝑓 and 𝑔, and in
general, magnitude is just the distance from 0.
(subsection:linear-algebra/normed-spaces/unit-spheres)
Fig. 4.7: The distance between two functions, given by the supremum norm.
For each norm, the unit sphere 𝑆 = {𝑥 ∶ ‖𝑥‖ = 1} plays a special role. Not only does every norm uniquely determine its
unit sphere, but the other way around as well: given a sphere, a corresponding norm can be constructed.
To be more precise, if you give me a 𝑆 ⊆ 𝑉 that contains 0, is bounded, strictly convex, and symmetric set, I can
construct you the norm for which this is the unit sphere. We are not going to prove this, but it helps to illustrate that
norms essentially define a “geometry” on vector spaces.
We can visualize this in ℝ2 . (In two dimensions, spheres are called circles, so we’ll refer to them as such.)
Besides measuring the magnitude of vectors, we are also interested in measuring the distance between them. If you are
at the location 𝑥 in some normed space, how far is 𝑦? In normed vector spaces, we can define the distance between any
𝑥 and 𝑦 by
𝑑(𝑥, 𝑦) = ‖𝑥 − 𝑦‖.
This is called the norm-induced metric. Thus, norms measure the distance from the zero vector, and the metric 𝑑 measures
the norm of the difference.
In general, we say that a function 𝑑 ∶ 𝑉 × 𝑉 → [0, ∞) is a metric if the following hold.
Given the properties of norms, we can quickly check that 𝑑(𝑥, 𝑦) = ‖𝑥 − 𝑦‖ is indeed a metric. Due to the linear structure
of vector spaces, the norm-generated metric is invariant to translation. That is, for any 𝑥, 𝑦, 𝑧 ∈ 𝑉 , we have
In other words, it doesn’t matter where you start: the distance only depends on your displacement. This is not true for any
metric. Thus, norm-induced metrics are special. In our studies, we only deal with these special cases. Because of this,
we won’t even talk about metrics, just norms.
4.3 Conclusion
In itself, a vector space is just a skeleton that provides a way to represent data. On top of this, norms define a geometric
structure that reveals properties such as magnitude and distance. Both of these are essential in machine learning. For
instance, some unsupervised learning algorithms separate data points into clusters based on their mutual distances from
each other.
There is yet another way to enhance the geometric structure of vector spaces: inner products, also called dot products.
We are going to put this concept under our magnifying glass in the next section.
4.4 Problems
0 if 𝑥 = 𝑦,
𝑑(𝑥, 𝑦) = {
1 otherwise.
Problem 2. Let 𝑆𝑛 be the set of all ASCII strings of 𝑛 character length and define the Hamming distance ℎ(𝑥, 𝑦) for any
two 𝑥, 𝑦 ∈ 𝑆𝑛 by the number of corresponding positions where 𝑥 and 𝑦 are different.
For instance,
ℎ("001101", "101110") = 2,
ℎ("metal", "petal") = 1.
Show that ℎ satisfies the three defining properties of a metric. (Note that 𝑆𝑛 is not a vector space, so technically, the
Hamming distance is not a metric.)
Problem 3. Let ‖ ⋅ ‖ be a norm on the vector space ℝ𝑛 , and define the mapping 𝑓 ∶ ℝ𝑛 → ℝ𝑛 ,
Show that
‖𝑥‖∗ ∶= ‖𝑓(𝑥)‖
is a norm on ℝ𝑛 .
4.4. Problems 55
Mathematics of Machine Learning
FIVE
In the previous chapter, we imbued our vector spaces with norms, measuring the magnitude of vectors and the distance
between points. In machine learning, these concepts can be used, for instance, to identify clusters in unlabeled datasets.
However, without context, distance is often not enough. Following our geometric intuition, we can aspire to measure the
similarity of data points. This is done by the inner product. (Also known as the dot product.)
You can recall the inner product as a quantity that we used to measure the angle between two vectors in high school
geometry classes. Given two vectors 𝑥 = (𝑥1 , 𝑥2 ), 𝑦 = (𝑦1 , 𝑦2 ) from the plane, we defined their inner product by
⟨𝑥, 𝑦⟩ = 𝑥1 𝑦1 + 𝑥2 𝑦2 ,
holds, where 𝛼 is the angle between 𝑥 and 𝑦. (In fact, there are two such angles, but their cosine is equal.) Thus, the angle
itself can be extracted by
⟨𝑥, 𝑦⟩
𝛼 = arccos .
‖𝑥‖‖𝑦‖
We can use the inner products to determine if two vectors are orthogonal, as this happens if and only if ⟨𝑥, 𝑦⟩ = 0 holds.
During our earlier encounters with mathematics, geometric intuition (such as orthogonality) came first, on which we built
tools such as the inner product. However, if we zoom out and take an abstract viewpoint, things are exactly the opposite.
As we’ll see soon, inner products emerge quite naturally, giving rise to the general concept of orthogonality.
In general, this is the formal definition of an inner product.
57
Mathematics of Machine Learning
As a special case, ⟨0, 0⟩ = 0. Just like we have seen for norms, a bit more is true: if ⟨𝑥, 𝑥⟩ = 0, then 𝑥 = 0. This follows
from positive definiteness and (5.2).
In addition, due to symmetry and the linearity of the first variable, inner products are also linear in the second variable.
Because of this, they are called bilinear.
To familiarize ourselves with the concept, let’s see some examples!
Example 1. As usual, the canonical and most prevalent example of inner product spaces is ℝ𝑛 , where the inner product
⟨⋅, ⋅⟩ is defined by
𝑛
⟨𝑥, 𝑦⟩ = ∑ 𝑥𝑖 𝑦𝑖 , 𝑥 = (𝑥1 , … , 𝑥𝑛 ), 𝑦 = (𝑦1 , … , 𝑦𝑛 ).
𝑖=1
This bilinear function is often called the dot product. Equipped with this, ℝ𝑛 is called the n-dimensional Euclidean space.
This is a central concept in machine learning, as data is most frequently represented in Euclidean spaces. Thus, we are
going to explore the structure of this space in great detail throughout this book.
Example 2. Besides Euclidean spaces, there are other inner product spaces that play a significant role in mathematics and
machine learning. If you are familiar with integration, in certain function spaces the bilinear function
∞
⟨𝑓, 𝑔⟩ = ∫ 𝑓(𝑥)𝑔(𝑥)𝑑𝑥
−∞
defines an inner product space with a very rich and beautiful structure.
The symmetry and linearity of ⟨𝑓, 𝑔⟩ is clear. Only the positive definiteness seems to be an issue. For instance, if 𝑓 is
defined by
1 if 𝑥 = 0,
𝑓(𝑥) = {
0 otherwise,
then 𝑓 ≠ 0, but ⟨𝑓, 𝑓⟩ = 0. This problem can be circumvented by “overloading” the equality operator and letting 𝑓 = 𝑔
∞
if and only if ∫−∞ |𝑓(𝑥) − 𝑔(𝑥)|2 𝑑𝑥 = 0. Even though function spaces such as this play an important role in mathematics
and machine learning, their study falls outside of our scope.
holds.
Proof. At this point, we don’t know much about the inner product except its core defining properties. So, we are going to
use a little trick. For any 𝜆 ∈ ℝ, the positive definiteness implies that ⟨𝑥 + 𝜆𝑦, 𝑥 + 𝜆𝑦⟩ ≥ 0. On the other hand, because
of the bilinearity (that is, linearity in both variables) and symmetry, we have
⟨𝑥 + 𝜆𝑦, 𝑥 + 𝜆𝑦⟩ = ⟨𝑥, 𝑥⟩ + 2𝜆⟨𝑥, 𝑦⟩ + 𝜆2 ⟨𝑦, 𝑦⟩, (5.3)
58 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning
which is a quadratic polynomial in 𝜆. In general, we know that for any quadratic polynomial of the form 𝑎𝑥2 + 𝑏𝑥 + 𝑐,
the roots are given by the formula
√
−𝑏 ± 𝑏2 − 4𝑎𝑐
𝑥1,2 = .
2𝑎
Since ⟨𝑥 + 𝜆𝑦, 𝑥 + 𝜆𝑦⟩ ≥ 0, the polynomial defined by (5.3) must have at most one real root. Thus, the discriminant
𝑏2 − 4𝑎𝑐 is negative or zero. Plugging in the coefficients of (5.3) into the discriminant formula implies
|⟨𝑥, 𝑦⟩|2 − ⟨𝑥, 𝑥⟩⟨𝑦, 𝑦⟩ ≤ 0,
which is what we had to show. □
The Cauchy-Schwarz inequality is probably one of the most useful tools in studying inner product spaces. One application
we are going to see next is to show how inner products define norms.
‖𝑥‖ = √⟨𝑥, 𝑥⟩
is a norm on 𝑉 .
Proof. According to the definition of norms, we have to show that three properties hold: positive definiteness, homo-
geneity, and the triangle inequality. The first two follow easily from the same properties of inner products. The triangle
inequality follows from the Cauchy-Schwarz inequality:
‖𝑥 + 𝑦‖2 = ⟨𝑥 + 𝑦, 𝑥 + 𝑦⟩
= ‖𝑥‖2 + ‖𝑦‖2 + 2⟨𝑥, 𝑦⟩
≤ ‖𝑥‖2 + ‖𝑦‖2 + 2‖𝑥‖‖𝑦‖
2
= (‖𝑥‖ + ‖𝑦‖) ,
from which the triangle inequality follows. □
Thus, inner product spaces are normed spaces as well. They have the algebraic and geometric structure we need to
represent, manipulate, and transform data.
Most importantly, Theorem 4.1.2 can be reversed! That is, given a norm ‖ ⋅ ‖, we can define a matching inner product.
In other words, one can generate an inner product from a norm, not just the other way around.
5.2 Orthogonality
In vector spaces other than ℝ2 , the concept of orthogonality and angles are not clear at all. For instance, in spaces where
vectors are functions, there is no intuitive way to define the angles between two function. However, as the formula (5.1)
suggests in the special case ℝ2 , these can be generalized.
To illustrate how inner products and orthogonality define geometry on vector spaces, let’s see how the classic Pythagorean
theorem looks in this new form. Recall that the “original” version states that in right triangles, 𝑎2 + 𝑏2 = 𝑐2 , where 𝑐 is
the length of the hypotenuse, while 𝑎 and 𝑏 are the lengths of the other two sides.
In inner product spaces, this generalizes to the following way.
Proof. Given the definition of inner products and orthogonality, the proof is trivial. Due to the bilinearity, we have
⟨𝑥 + 𝑦, 𝑥 + 𝑦⟩ = ⟨𝑥, 𝑥 + 𝑦⟩ + ⟨𝑦, 𝑥 + 𝑦⟩
= ⟨𝑥, 𝑥⟩ + 2⟨𝑥, 𝑦⟩ + ⟨𝑦, 𝑦⟩,
Why is this the Pythagorean theorem in another form? Because the norm and the inner product is related by ⟨𝑥, 𝑥⟩ = ‖𝑥‖2 ,
the equation (5.5) is equivalent to
By looking at the general definition, it is hard to get an insight into what an inner product does. However, by using the
concept of orthogonality, we can visualize what does ⟨𝑥, 𝑦⟩ represent for any 𝑥 and 𝑦.
Intuitively, any 𝑥 can be decomposed into the sum of two vectors 𝑥𝑜 + 𝑥𝑝 , where 𝑥𝑜 is orthogonal to 𝑦 and 𝑥𝑝 is parallel
to it.
How can we find 𝑥𝑝 and 𝑥𝑜 ? Since 𝑥𝑝 has the same direction as 𝑦, it can be written in the form 𝑥𝑝 = 𝑐𝑦 for some 𝑐.
Because 𝑥𝑝 and 𝑥𝑜 sum up to 𝑥, we also have 𝑥𝑜 = 𝑥 − 𝑥𝑝 = 𝑥 − 𝑐𝑦.
Since 𝑥𝑜 is orthogonal to 𝑦, the constant 𝑐 can be determined by solving the equation
⟨𝑥 − 𝑐𝑦, 𝑦⟩ = 0.
60 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning
By using the bilinearity of the inner product, we can express 𝑐 from this equation. Thus, we have
⟨𝑥, 𝑦⟩
𝑐= .
⟨𝑦, 𝑦⟩
So,
⟨𝑥, 𝑦⟩
𝑥𝑝 = 𝑦,
⟨𝑦, 𝑦⟩
(5.6)
⟨𝑥, 𝑦⟩
𝑥𝑜 = 𝑥 − 𝑦.
⟨𝑦, 𝑦⟩
We call 𝑥𝑝 the orthogonal projection of 𝑥 onto 𝑦. This is a common transformation, so we are going to introduce the
notation
⟨𝑥, 𝑦⟩
proj𝑦 (𝑥) = 𝑦. (5.7)
⟨𝑦, 𝑦⟩
From this, we can see that the scaling ratio between 𝑦 and proj𝑦 (𝑥) can be described by inner products.
So far, we have seen that we can use inner products to define the orthogonality relation between two vectors. Can we use
it to measure (and in some cases, even define) the angle? The answer is yes! In the following, we are going to see how,
arriving at the formula (5.1) already familiar from basic geometry.
To build our intuition, let’s select two arbitrary 𝑛-dimensional vectors 𝑥, 𝑦 ∈ ℝ𝑛 . The inner product of the sum 𝑥 + 𝑦 can
be calculated using the bilinearity property.
With this, we obtain that
⟨𝑥 + 𝑦, 𝑥 + 𝑦⟩ = ‖𝑥 + 𝑦‖2
(5.8)
= ‖𝑥‖2 + ‖𝑦‖2 + 2⟨𝑥, 𝑦⟩.
On the other hand, considering that 𝑥, 𝑦, and 𝑥+𝑦 form a triangle, we can use the law of cosines to express ⟨𝑥+𝑦, 𝑥+𝑦⟩ =
‖𝑥 + 𝑦‖2 in a different form.
Here, the law of cosines imply
Given our geometric interpretation of inner products as orthogonal projections, let’s focus on the case when both 𝑥 and 𝑦
have unit norms. In this special case, the orthogonal projection equals to
Thus, ⟨𝑥, 𝑦⟩ precisely describes the signed magnitude of the orthogonal projection. (It can be negative when proj𝑦 (𝑥)
and 𝑦 have an opposite direction.)
62 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning
With this in mind, we can see that the inner product equals to the cosine of the angle enclosed by the two vectors. Let’s
draw a picture to illustrate! (Recall that in right triangles, the cosine is the ratio of the length of the adjacent side and the
hypotenuse. In this case, the adjacent side has a length of ⟨𝑥, 𝑦⟩, while the hypotenuse is of unit length.)
In machine learning, this quantity is frequently used to measure the similarity of two vectors.
Because any vector 𝑥 can be scaled to unit norm with the transformation 𝑥 ↦ 𝑥/‖𝑥‖, we define the cosine similarity by
𝑥 𝑦
cos(𝑥, 𝑦) = ⟨ , ⟩. (5.11)
‖𝑥‖ ‖𝑦‖
If 𝑥 and 𝑦 represent the feature vectors of two data samples, cos(𝑥, 𝑦) tells us how much the features move together. Note
that because of the scaling, two samples with a high cosine similarity can be far from each other. So, this reveals nothing
about their relative positions in the feature space.
Through the lenses of similarity, orthogonality means that one vector does not contain “information” about the other. We
will make this notion more precise when learning about correlation, but there are clear implications regarding the structure
of inner product spaces. Recall that with our introduction of basis vectors, our motivation was to find a minimal set of
vectors that can be used to express any other vector. With the introduction of orthogonality, we can go a step further.
Fig. 5.4: The inner product of two unit vectors equals the cosine of their angle.
Orthogonal and orthonormal bases are extremely convenient to use. If a basis is orthogonal, we can easily obtain an
orthonormal basis by simply scaling its vectors to unit norm. Thus, we’ll use orthonormal basis vectors most of the time.
Why do we love orthonormal bases so much? To see this, let {𝑣1 , … , 𝑣𝑛 } be an arbitrary basis and 𝑥 be an arbitrary
𝑛
vector. We know that 𝑥 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 , but how do we find the coefficients 𝑥𝑖 ? There is a general method involving linear
equations that we’ll see later, but if {𝑣𝑖 }𝑛𝑖=1 is orthonormal, the situation is much simpler.
This is made more precise in the following theorem.
Theorem 4.4.1
Let 𝑉 be a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } be an orthonormal basis of 𝑉 . Then, for any 𝑥 ∈ 𝑉 ,
𝑛
𝑥 = ∑⟨𝑥, 𝑣𝑖 ⟩𝑣𝑖 (5.12)
𝑖=1
holds.
𝑛
Proof. Because 𝑣1 , … , 𝑣𝑛 is a basis, 𝑥 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 for some scalars 𝑥𝑖 . However, due to the linearity of the inner
product,
𝑛 𝑛
⟨𝑥, 𝑣𝑗 ⟩ = ⟨ ∑ 𝑥𝑖 𝑣𝑖 , 𝑣𝑗 ⟩ = ∑ 𝑥𝑖 ⟨𝑣𝑖 , 𝑣𝑗 ⟩ = 𝑥𝑗
𝑖=1 𝑖=1
64 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning
Thus, the coefficients can be calculated by taking the inner product. In other words, for orthonormal bases, 𝑥𝑗 depends
only on the 𝑗-th basis vector.
As another consequence of the orthonormality, calculating the norm is also easier, as we can always express it in terms of
the coefficients. To be more precise, we have
𝑛 𝑛
‖𝑥‖2 = ⟨𝑥, 𝑥⟩ = ⟨ ∑ 𝑥𝑖 𝑣𝑖 , ∑ 𝑥𝑗 𝑣𝑗 ⟩
𝑖=1 𝑗=1
𝑛 𝑛
= ∑ ∑ 𝑥𝑖 𝑥𝑗 ⟨𝑣𝑖 , 𝑣𝑗 ⟩ (5.13)
𝑖=1 𝑗=1
𝑛
= ∑ 𝑥2𝑖 .
𝑖=1
This is called Parsival’s identity. So, given 𝑥 in terms of an orthonormal basis, its norm is easy to find. It is not a coincident
that this formula resembles the Euclidean norm so much! (Note that here, ‖ ⋅ ‖ is a general norm.) In fact, the squared
Euclidean norm
𝑛
‖𝑥‖22 = ∑ 𝑥2𝑖 , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛
𝑖=1
Orthogonal bases are awesome and all, but how do we find them?
There is a general method called the Gram-Schmidt orthogonalization process that solves this problem. The algorithm
takes any set of basis vectors {𝑣1 , … , 𝑣𝑛 } and outputs an orthonormal basis {𝑒1 , … , 𝑒𝑛 } such that
span(𝑣1 , … , 𝑣𝑘 ) = span(𝑒1 , … , 𝑒𝑘 ), 𝑘 = 1, … , 𝑛,
that is, the subspaces generated by the first 𝑘 vectors of both sets match.
How to do that? The process is straightforward. Let’s focus on finding an orthogonal system first, which we can normalize
later to achieve orthonormality. We are going to build our set {𝑒1 , … , 𝑒𝑛 } iteratively. It is clear that
𝑒1 ∶= 𝑣1
is a good choice. Now, our goal is to find 𝑒2 such that 𝑒2 ⟂ 𝑒1 and together, they span the same subspace as {𝑣1 , 𝑣2 }.
Remember when we talked about the geometric interpretation of orthogonality? The orthogonal component of 𝑣2 with
respect to 𝑒1 will be a good choice for 𝑒2 . Thus, let
𝑒2 ∶ = 𝑣2 − proj𝑒 (𝑣2 )
1
⟨𝑣 , 𝑒 ⟩
= 𝑣2 − 2 1 𝑒1 .
⟨𝑒1 , 𝑒1 ⟩
From the definition, it is clear that 𝑒2 ⟂ 𝑒1 , and it is also clear that {𝑒1 , 𝑒2 } spans the same subspace as {𝑣1 , 𝑣2 }.
In the next step, we perform the same process. We project 𝑣3 onto the subspace generated by 𝑒1 and 𝑒2 , then define 𝑒3
as the difference of 𝑣3 and the projection . That is,
⟨𝑣3 , 𝑒1 ⟩ ⟨𝑣 , 𝑒 ⟩
𝑒3 ∶ = 𝑣3 − 𝑒1 − 3 2 𝑒2
⟨𝑒1 , 𝑒1 ⟩ ⟨𝑒2 , 𝑒2 ⟩
= 𝑣3 − proj𝑒 ,𝑒 (𝑣3 ).
1 2
With this, we essentially remove the “contributions” of 𝑒1 and 𝑒2 towards 𝑣3 , thus obtaining an 𝑒3 that is orthogonal to
the previous ones.
In general, if we have 𝑒1 , … , 𝑒𝑘 , the vector 𝑒𝑘+1 can be found by
where
𝑘
⟨𝑥, 𝑒𝑖 ⟩
proj𝑒 (𝑥) = ∑ 𝑒 (5.14)
1 ,…,𝑒𝑘
𝑖=1
⟨𝑒𝑖 , 𝑒𝑖 ⟩ 𝑖
is the generalized orthogonal projection operator, projecting a vector to the subspace generated by {𝑒1 , … , 𝑒𝑘 }. To check
that 𝑒𝑘+1 ⟂ 𝑒1 , … , 𝑒𝑘 , we have
𝑘
⟨𝑣𝑘+1 , 𝑒𝑖 ⟩
⟨𝑒𝑘+1 , 𝑒𝑗 ⟩ = ⟨𝑣𝑘+1 − ∑ 𝑒 ,𝑒 ⟩
𝑖=1
⟨𝑒𝑖 , 𝑒𝑖 ⟩ 𝑖 𝑗
𝑘
⟨𝑣𝑘+1 , 𝑒𝑖 ⟩
= ⟨𝑣𝑘+1 , 𝑒𝑗 ⟩ − ∑ ⟨𝑒𝑖 , 𝑒𝑗 ⟩
𝑖=1
⟨𝑒𝑖 , 𝑒𝑖 ⟩
= ⟨𝑣𝑘+1 , 𝑒𝑗 ⟩ − ⟨𝑣𝑘+1 , 𝑒𝑗 ⟩
= 0,
due to the orthogonality of the 𝑒𝑖 -s and the linearity of the inner product. Since {𝑒1 , … , 𝑒𝑘 } spans the same subspace as
{𝑣1 , … , 𝑣𝑘 } and 𝑒𝑘+1 is a linear combination of 𝑣𝑘+1 and 𝑒1 , … , 𝑒𝑘 (where the coefficient of 𝑣𝑘+1 is nonzero),
also follows.
This can be repeated until we run out of vectors and find {𝑒1 , … , 𝑒𝑛 }.
For the sake of further reference, mathematical correctness, and a tiny bit of OCD, let’s summarize the all the above in a
single theorem.
span(𝑒1 , … , 𝑒𝑘 ) = span(𝑣1 , … , 𝑣𝑘 )
As a consequence, we can state that each finite inner product space has an orthonormal basis. We can even construct it
explicitly via the Gram-Schmidt process.
Going one step further, we can view Theorem 4.4.2 and its proof as an algorithm.
66 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning
span(𝑒1 , … , 𝑒𝑘 ) = span(𝑣1 , … , 𝑣𝑘 )
⟨𝑣 , 𝑒 ⟩
= 𝑣 2 − 2 1 𝑒1
⟨𝑒1 , 𝑒1 ⟩
𝑐⟨𝑣1 , 𝑣1 ⟩
= 𝑣2 − 𝑣
⟨𝑣1 , 𝑣1 ⟩ 1
= 𝑣2 − 𝑐𝑣1
= 0.
This is true in the general case as well: when the process encounters an input vector that is linearly dependent from the
previous ones, a zero vector is added to the output.
Earlier, we have seen that given a fixed vector 𝑦 ∈ 𝑉 , we can decompose any 𝑥 ∈ 𝑉 as 𝑥 = 𝑥𝑜 + 𝑥𝑝 , where 𝑥𝑜 is
orthogonal to 𝑦, while 𝑥𝑝 is parallel to it. (We used this to provide a geometric motivation for inner products.)
This is an essential tool, and in this section, we will see that an analogue of this decomposition still holds true when 𝑦 is
replaced with an arbitrary subspace 𝑆 ⊂ 𝑉 . To see this, let’s talk about the orthogonality of subspaces.
For example, the 𝑥-axis and the 𝑦-axis are orthogonal subspaces in ℝ2 . (Just as the 𝑥-𝑦 plane and the 𝑧-axis in ℝ3 .)
Similarly, we can talk about the orthogonality of a vector and a subspace: 𝑥 is orthogonal to the subspace 𝑆, 𝑥 ⟂ 𝑆 in
symbols, if 𝑥 is orthogonal to all vectors of 𝑆.
One of the most straightforward and essential ways to construct orthogonal subspaces is to take the orthogonal complement.
𝑆 ⟂ ∶= {𝑥 ∈ 𝑉 ∶ 𝑥 ⟂ 𝑆} (5.15)
Theorem 4.5.1
Let 𝑉 be an arbitrary inner product space and 𝑆 ⊆ 𝑉 one of its subspaces. 𝑆 ⟂ is orthogonal to 𝑆, and a subspace.
Moreover, 𝑆 ∩ 𝑆 ⟂ = {0}.
Proof. According to the definition of subspaces, we only have to show that 𝑆 ⟂ is closed with respect to addition and
scalar multiplication. As the inner product is bilinear, this is straightforward:
Recall the decomposition of any 𝑥 ∈ 𝑉 into a parallel and an orthogonal component with respect to a fixed vector 𝑦? In
terms of subspaces, we can restate this as
𝑉 = span(𝑦) + span(𝑦)⟂ ,
that is, 𝑉 can be written as the direct sum of the vector space spanned by 𝑦, and its orthogonal complement. This is an
extremely powerful result, as this allows us to decouple 𝑥 from 𝑦. For instance, if we think about vectors as a collection
of features (just like the sepal and petal width and length measurements in our favourite Iris dataset), 𝑦 can represent a
certain trait that we want to exclude from our analysis.
With the notion of orthogonal complements, we can make this mathematically precise. We can also be more general. In
fact, the decomposition
𝑉 = 𝑆 + 𝑆⟂
holds for any subspace 𝑆! We are going to see at least two proofs for this. One right now, another a bit later when talking
about orthogonal projections.
Theorem 4.5.2
Let 𝑉 be an arbitrary finite dimensional inner product space and 𝑆 ⊂ 𝑉 its subspace. Then
𝑉 = 𝑆 + 𝑆⟂
holds.
Proof. Let 𝑒1 , … , 𝑒𝑘 ∈ 𝑆 be an orthonormal basis of 𝑆. This is guaranteed to exist, and we can even construct it from
an arbitrary basis using the Gram-Schmidt process.
Like during its proof, we can define the generalized orthogonal projection (5.14), given by
𝑘
proj𝑒 (𝑥) = ∑⟨𝑥, 𝑒𝑖 ⟩𝑒𝑖 .
1 ,…,𝑒𝑘
𝑖=1
68 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning
Since proj𝑒 ,…,𝑒 (𝑥) is the linear combination of 𝑒𝑖 ∈ 𝑆-s, it belongs to 𝑆. On the other hand, the bilinearity of the inner
1 𝑘
product gives that 𝑥 − proj𝑒 ,…,𝑒 (𝑥) ∈ 𝑆 ⟂ . Indeed, as we have
1 𝑘
𝑘
⟨𝑥 − proj𝑒 (𝑥), 𝑒𝑗 ⟩ = ⟨𝑥 − ∑⟨𝑥, 𝑒𝑖 ⟩𝑒𝑖 , 𝑒𝑗 ⟩
1 ,…,𝑒𝑘
𝑖=1
𝑘
= ⟨𝑥, 𝑒𝑗 ⟩ − ∑⟨𝑥, 𝑒𝑖 ⟩⟨𝑒𝑖 , 𝑒𝑗 ⟩
𝑖=1
= ⟨𝑥, 𝑒𝑗 ⟩ − ⟨𝑥, 𝑒𝑗 ⟩
= 0,
the vector 𝑥 − proj𝑒 ,…,𝑒 (𝑥) is orthogonal to each 𝑒𝑗 . Thus, since 𝑒1 , … , 𝑒𝑘 is an orthonormal basis of 𝑆, it is also
1 𝑘
orthogonal to 𝑆, hence 𝑥 − proj𝑒 ,…,𝑒 (𝑥) ∈ 𝑆 ⟂ .
1 𝑘
The fact that every 𝑥 ∈ 𝑉 can be decomposed as the sum of a vector from 𝑆 and a vector from 𝑆 ⟂ , as given by (5.16),
means that 𝑉 = 𝑆 + 𝑆 ⟂ , which is what we had to prove. □
5.6 Problems
𝑎𝑖,𝑗 ∶= ⟨𝑣𝑖 , 𝑣𝑗 ⟩.
𝑛 𝑛
where 𝑥 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 and 𝑦 = ∑𝑖=1 𝑦𝑖 𝑣𝑖 .
Problem 3. Let 𝑉 be a finite-dimensional real inner product space.
(a) Let 𝑦 ∈ 𝑉 be an arbitrary vector. Show that
𝑓 ∶ 𝑉 → ℝ, 𝑥 ↦ ⟨𝑥, 𝑦⟩
is a linear function. (That is, 𝑓(𝛼𝑢 + 𝛽𝑣) = 𝛼𝑓(𝑥) + 𝛽𝑓(𝑦) holds for all 𝑢, 𝑣 ∈ 𝑉 and 𝛼, 𝛽 ∈ ℝ).
(b) Let 𝑓 ∶ 𝑉 → ℝ an arbitrary linear function. Show that there exists an 𝑦 ∈ 𝑉 such that
5.6. Problems 69
Mathematics of Machine Learning
(Note that (b) is the reverse of (a), and a much more interesting result.)
Problem 4. Let 𝑉 be a real inner product space and let ‖𝑥‖ = √⟨𝑥, 𝑥⟩ be the generated norm. Show that
This is called the parallelogram law, because if we think of 𝑥 and 𝑦 as two the sides determining a parallelogram, (5.17)
relates the length of its sides to the length of its diagonals.
Problem 5. Let 𝑉 be a real inner product space and let 𝑥1 , 𝑥2 ∈ 𝑉 . Show that if
⟨𝑥1 , 𝑦⟩ = ⟨𝑥2 , 𝑦⟩
70 Chapter 5. Inner products, angles, and lots of reasons to care about them
CHAPTER
SIX
Now that we started to understand the geometric structure of vector spaces, it’s time to put the theory into practice once
again. In this chapter, we’ll take a hands-down look at norms, inner products, and NumPy array operations in general.
The last time we translated theory to code, we left off at finding an ideal representation for vectors: NumPy arrays. Let’s
initialize two instances to play around with.
import numpy as np
In linear algebra, and in most of machine learning, almost all operations involve looping through the vector components
one by one. For instance, adding together two vectors can be implemented like this.
for i in range(len(x_plus_y)):
x_plus_y[i] = x[i] + y[i]
return x_plus_y
add(x, y)
Of course, this is far from optimal. (It may not even work if the vectors have different dimensions.)
For example, addition is massively parallelizable, and our implementation does not take advantage of that. With two
threads, we can do two additions simultaneously. So, adding together two-dimensional vectors would require just one
step, as one would compute x[0] + y[0], while the other x[1] + y[1]. Raw Python does not have access to such
high-performance computing tools, but NumPy does, through its functions that are implemented in C. In turn, C uses the
LAPACK (Linear Algebra PACKage) library, which makes calls to BLAS (Basic Linear Algebra Subprograms). BLAS
is optimized at the assembly level.
So, whenever it is possible, we should strive to work with vectors in a NumPythonic way. (Yes, I just made up that term.)
For vector addition, this is simply the + operator, as we have seen earlier.
71
Mathematics of Machine Learning
By the way, you shouldn’t ever compare floats with the == operator, as internal rounding errors can occur due to the float
representation. The example below illustrates this.
False
To compare arrays, NumPy provides the functions np.allclose and np.equal. These compare arrays elementwise,
returning a boolean array. From this, the built-in all function can be used to see if all the elements match.
True
In the following, we’ll briefly review how to work with NumPy arrays in practice.
At this point, there are two operations that we want to do with our vectors: apply a function elementwise, or take the
sum/product of the elements. Since the +, *, and ** operators are implemented for our arrays, certain functions carry
over from scalars, as the example shows below.
def just_a_quadratic_polynomial(x):
return 3*x**2 + 1
However, we can’t just plug in ndarray-s to every function. For instance, let’s take a look at Python’s built-in exp from
its math module.
exp(x)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-a34f98e51bbc> in <module>
1 from math import exp
2
----> 3 exp(x)
for i in range(len(x)):
x_exp[i] = exp(x[i])
return x_exp
(Recall that np.empty_like(x) creates an uninitialized array that matches the dimensions of x.)
naive_exp(x)
A bit less naive implementation would use comprehensions to achieve the same effect.
bit_less_naive_exp(x)
Even though comprehensions are more concise and readable, they still don’t avoid the core issue: for loops in Python.
This problem is solved by NumPy’s famous ufuncs, that is, “functions that operate element by element on whole arrays”.
Since they are implemented in C, they are blazing fast. For instance, the exponential function 𝑓(𝑥) = 𝑒𝑥 is given by
np.exp.
np.exp(x)
all(np.equal(naive_exp(x), np.exp(x)))
True
all(np.equal(bit_less_naive_exp(x), np.exp(x)))
True
Again, there are more advantages of using NumPy functions and operations than simplicity. In machine learning, we care
a lot about speed, and as we are about to see, NumPy delivers once more.
n_runs = 10000
(continues on next page)
t_naive_exp = timeit(
"np.array([exp(x_i) for x_i in x])",
setup=f"import numpy as np; from math import exp; x = np.ones({size})",
number=n_runs
)
t_numpy_exp = timeit(
"np.exp(x)",
setup=f"import numpy as np; from math import exp; x = np.ones({size})",
number=n_runs
)
For further reference, you can find the list of available ufuncs here.
What about operations that aggregate the elements and return a single value? Not surprisingly, these can be found within
NumPy as well. For instance, let’s take a look at the sum. In terms of mathematical formulas, we are looking to implement
the function
𝑛
sum(𝑥) = ∑ 𝑥𝑖 , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛 .
𝑖=1
for x_i in x:
val += x_i
return val
naive_sum(x)
13.799999999999999
sum(x)
13.799999999999999
The story is the same: NumPy can do this better using its own data structures. We can either call the function np.sum,
or use the array method np.ndarray.sum.
np.sum(x)
13.799999999999999
x.sum()
13.799999999999999
Y’know by now that I love timing functions, so let’s compare the performances once more.
t_naive_sum = timeit(
"sum(x)",
setup=f"import numpy as np; x = np.ones({size})",
number=n_runs
)
t_numpy_sum = timeit(
"np.sum(x)",
setup=f"import numpy as np; x = np.ones({size})",
number=n_runs
)
np.prod(x)
-543.996
On quite a few occasions, we need to find the maximum or minimum of an array. We can do this using the np.max and
np.min functions. (Similarly to the others, these are also available as array methods.) The rule of thumb is if you want
to perform any array operation, use NumPy functions.
Now that we have reviewed how to perform operations on our vectors efficiently, it’s time to dive deep into the really
interesting part: norms and distances.
Let’s start with the most important one: the Euclidean norm, also known as the 2-norm, defined by
𝑛 1/2
‖𝑥‖2 = ( ∑ 𝑥2𝑖 ) , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛 .
𝑖=1
Note that our euclidean_norm function is dimension-agnostic, that is, it works for arrays of every dimension.
euclidean_norm(x)
4.036087214122113
euclidean_norm(y)
10.261578825892242
But wait, didn’t I just mention that we should use NumPy functions whenever possible? Norms are important enough to
have their own functions: np.linalg.norm.
np.linalg.norm(x)
4.036087214122113
With a quick inspection, we can check that these match for our vector x.
np.equal(euclidean_norm(x), np.linalg.norm(x))
True
However, the Euclidean norm is just a special case of 𝑝-norms. Recall that for any 𝑝 ∈ [0, ∞), we defined the 𝑝-norm
by the formula
𝑛 1/𝑝
𝑝
‖𝑥‖𝑝 = ( ∑ |𝑥𝑖 | ) , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛 ,
𝑖=1
and
for 𝑝 = ∞. It is a good practice to keep the number of functions in a codebase minimal to reduce maintenance costs.
Can we compact all 𝑝-norms into a single Python function that takes the value of 𝑝 as an argument? Sure. We only have
a small issue: representing ∞. Python and NumPy both provide their own representations, but we will go with NumPy’s
np.inf. Surprisingly, this is a float type.
type(np.inf)
float
Since ∞ can have multiple other representations, such as Python’s built-in math.inf, we can make our function more
robust by using the np.isinf function to check if an object represents ∞ or not.
A quick check shows that p_norm works as intended.
However, once again, NumPy is one step ahead of us. In fact, the familiar np.linalg.norm already does this out of
the box. We can achieve the same with less code by passing the value of 𝑝 as the argument ord, short for order. For
ord = 2, we obtain the good old 2-norm.
Somewhat surprisingly, distances don’t have their own NumPy functions. However, as the most common distance metrics
are generated from norms, we can often write our own. For instance, here is the Euclidean distance.
Besides norms and distances, the third component that defines the geometry of our vector spaces is the inner product.
During our journey, we’ll almost exclusively use the dot product, defined in the vector space ℝ𝑛 by
𝑛
⟨𝑥, 𝑦⟩ = ∑ 𝑥𝑖 𝑦𝑖 , 𝑥, 𝑦 ∈ ℝ𝑛 .
𝑖=1
By now, you can easily smash out a Python function that calculates this. In principle, the one-liner below should work.
dot_product(x, y)
4.5
When the dimension of the vectors doesn’t match, the function throws an exception as we expect.
dot_product(x, y)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-086e67f8bc3e> in <module>
2 y = np.array([1.9, 2.5])
3
----> 4 dot_product(x, y)
<ipython-input-39-bb4e36bf420e> in dot_product(x, y)
1 def dot_product(x: np.ndarray, y: np.ndarray):
----> 2 return np.sum(x*y)
ValueError: operands could not be broadcast together with shapes (4,) (2,)
However, upon further attempts to break the code, a strange thing occurs. Our function dot_product should fail when
called with an 𝑛-dimensional and a one-dimensional vector, and this is not what happens.
dot_product(x, y)
3.0
I always advocate breaking solutions in advance to avoid later surprises, and the above example excellently illustrates the
usefulness of this principle.
Beyond the scenes, NumPy is doing something called broadcasting. When performing an operation on two arrays with
mismatching shapes, it tries to guess the correct sizes and reshape them so the operation can go through. Check out what
takes place when calculating x*y.
x*y
NumPy guessed that we want to multiply all elements of x by the scalar y[0], so it transforms y = np.array([2.
0]) into the four-dimensional vector np.array([2.0, 2.0, 2.0, 2.0]), then calculates the elementwise
product.
Broadcasting is extremely useful because it allows us to write much simpler code by automagically performing transfor-
mations. Still, if you are unaware of how and when broadcasting is done, it can seriously bite you in the back. Just like
in our case, as the inner product of a four-dimensional and a one-dimensional vector is not defined.
To avoid writing excessive checks for edge cases (or missing them altogether), we calculate the inner product in practice
using the np.dot function.
np.dot(x, y)
4.5
When attempting to call np.dot with misaligned arrays, it fails as supposed to. Even in cases when broadcasting bails
out our custom implementation.
np.dot(x, y)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-45-513b3ef24556> in <module>
2 y = np.array([2.0])
3
----> 4 np.dot(x, y)
Now that we have a basic arsenal of array operations and functions, it is time to do something with them!
One of the most fundamental algorithms in linear algebra is the Gram-Schmidt orthogonalization process, used to turn a
set of linearly independent vectors into an orthonormal set.
To be more precise, for our input of a set of linearly independent vectors 𝑣1 , … , 𝑣𝑛 ∈ ℝ𝑛 , the Gram-Schmidt process
finds the output set of vectors 𝑒1 , … , 𝑒𝑛 ∈ ℝ𝑛 such that
• ‖𝑒𝑖 ‖ = 1 and ⟨𝑒𝑖 , 𝑒𝑗 ⟩ = 0 for all 𝑖 ≠ 𝑗 (that is, the vectors are orthonormal).
• and span(𝑒1 , … , 𝑒𝑘 ) = span(𝑣1 , … , 𝑣𝑘 ) for all 𝑘 = 1, … , 𝑛.
If you are having trouble recalling how this is done, feel free to revisit the section where we first described the algorithm.
The learning process is a spiral, where we keep revisiting old concepts from new perspectives. For the Gram-Schmidt
process, this is our second iteration, where we put the mathematical formulation into code.
Since we are talking about a sequence of vectors, we need a suitable data structure for this purpose vectors. There are
several possibilities for this in Python. For now, we are going with the conceptually simplest, albeit computationally rather
suboptimal one: lists. (Later, we’ll revisit this algorithm using multidimensional arrays, enabling us to write super concise
code, but let’s not get ahead of ourselves.)
vectors
The first component of the algorithm is the orthogonal projection operator, defined by
𝑘
⟨𝑥, 𝑒𝑖 ⟩
proj𝑒 (𝑥) = ∑ 𝑒.
1 ,…,𝑒𝑘
𝑖=1
⟨𝑒𝑖 , 𝑒𝑖 ⟩ 𝑖
for e in to:
e_norm_square = np.dot(e, e)
p_x += np.dot(x, e)*e/e_norm_square
return p_x
To check if it works, let’s look at a simple example and visualize the results. If you are reading the Jupyter Notebook
version of the book, feel free to change the inputs and experiment with this. (Don’t worry if you don’t understand the
visualization code, it is not essential for now.)
x = np.array([1.0, 2.0])
e = np.array([2.0, 1.0])
with plt.style.context("seaborn-white"):
plt.figure(figsize=(7, 7))
plt.xlim([-0, 3])
plt.ylim([-0, 3])
plt.arrow(0, 0, x[0], x[1], head_width=0.1, color="r", label="x")
plt.arrow(0, 0, e[0], e[1], head_width=0.1, color="g", label="e")
plt.arrow(x_to_e[0], x_to_e[1], x[0] - x_to_e[0], x[1] - x_to_e[1], linestyle="--
↪")
plt.legend()
True
When writing code for production, a couple of visualizations and ad-hoc checks are not enough. An extensive set of unit
tests is customarily written to ensure that a function works as intended. We are skipping this to keep our discussion on
track, but feel free to add some of your tests. After all, mathematics and programming are not a spectator’s sport.
With the projection(x: np.ndarray, to: List[np.ndarray]) function available to us, we are ready
to knock the Gram-Schmidt algorithm out of the park.
return output
gram_schmidt(test_vectors)
So, we have just created our first algorithm from scratch. This is like the base camp for Mount Everest. We have gone a
long way, but there is much more to go until we create a neural network from scratch. Until then, the journey is packed
with beautiful sections, and this is one of them. Take a while to appreciate this, then move on when you are ready.
which causes numerical issues. If any v is approximately zero, its norm np.linalg.norm(v, ord=2) is going to
be really small, and division with such small numbers can lead to issues.
This issue also affects the projection function. Take a look at the definition below:
for e in to:
e_norm_square = np.dot(e, e)
p_x += np.dot(x, e)*e/e_norm_square
return p_x
If e is (close to) zero, which can happen if the input vectors are linearly dependent, then e_norm_square is small.
In the following chapter, we will be meeting with the single most important objects in machine learning: matrices.
6.5 Problems
1 𝑛
MSE(𝑥, 𝑦) = ∑(𝑥 − 𝑦𝑖 )2 , 𝑥, 𝑦 ∈ ℝ𝑛
𝑛 𝑖=1 𝑖
both with and without using NumPy functions and methods. (The vectors 𝑥 and 𝑦 should be represented by NumPy arrays
in both cases.)
Problem 2. Compare the performances of the built-in maximum function max and NumPy’s np.max using timeit.
timeit, like we did above. Try running a different number of experiments and changing the array sizes to figure out
the breakeven point between the two performances.
Problem 3. Instead of implementing the general 𝑝-norm as we earlier in this chapter, we can change things around to
obtain the version below.
However, this doesn’t work for 𝑝 = ∞. What is the problem with it?
Problem 4. Let 𝑤 ∈ ℝ𝑛 be a vector with nonnegative elements. Use NumPy to implement the weighted 𝑝-norm by
𝑛 1/𝑝
‖𝑥‖𝑤 𝑝
𝑝 = ( ∑ 𝑤𝑖 |𝑥𝑖 | ) , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛 .
𝑖=1
Can you come up with a scenario where this can be useful in machine learning?
Problem 5. Implement the cosine similarity function, defined by the formula
𝑥 𝑦
cos(𝑥, 𝑦) = ⟨ , ⟩, 𝑥, 𝑦 ∈ ℝ𝑛 .
‖𝑥‖ ‖𝑦‖
SEVEN
I am quite sure that you were already familiar with the notion of matrices before reading this book. Matrices are one
of the most important data structures that are able to represent systems of equations, graphs, mappings between vector
spaces, and many more. Matrices are the fundamental building blocks of neural networks.
At first look, we define a matrix as a table of numbers. If the matrix 𝐴 has, for instance, 𝑛 rows and 𝑚 columns of real
numbers, we write
When we don’t want to write out the entire matrix as (7.1), we use the abbreviation 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑚
𝑖,𝑗=1 .
The set of all 𝑛 × 𝑚 real matrices is denoted by ℝ𝑛×𝑚 . We will exclusively talk about real matrices, but when it is not
the case, this notation is modified accordingly. For instance, ℤ𝑛×𝑚 denotes the set of integer matrices.
Matrices can be added and multiplied together, or multiplied by a scalar.
𝑐𝐴 ∶= (𝑐𝑎𝑖,𝑗 )𝑛,𝑚
𝑖,𝑗=1 ∈ ℝ
𝑛×𝑚
.
𝑙
𝐴𝐵 ∶= (∑ 𝑎𝑖,𝑘 𝑏𝑘,𝑗 )𝑛,𝑚
𝑖,𝑗=1 ∈ ℝ
𝑛×𝑚
.
𝑘=1
Scalar multiplication and addition is clear, but matrix multiplication is not the simplest-to-understand operation ever.
Fortunately, visualization can help. In essence, the (𝑖, 𝑗)-th element is the dot product of the 𝑖-th row of 𝐴 and the 𝑗-th
column of 𝐵.
Besides addition and multiplication, there is another operation that is worth mentioning: transposition.
85
Mathematics of Machine Learning
𝐴𝑇 = (𝑎𝑗,𝑖 )𝑛,𝑚
𝑖,𝑗=1 ∈ ℝ
𝑚×𝑛
Transposition simply means “flipping” the matrix, replacing rows with columns. For example,
𝑎 𝑏 𝑎 𝑐
𝐴=[ ], 𝐴𝑇 = [ ].
𝑐 𝑑 𝑏 𝑑
As opposed to addition and multiplication, transposition is a unary operation. (Unary means that it takes one argument.
Binary operations take two arguments, and so on.) Although transposition is easy to understand, there is much more
behind the surface. Later, we will give a geometric interpretation involving inner products, but for now, let’s move on to
study the invertibility of linear transformations.
Matrix multiplication is one of the most frequently used operations in computing. As it can be performed extremely fast,
it is common to even vectorize certain algorithms just to express it in terms of matrix multiplications.
Thus, the more we know about it the better. To get a grip on the operation itself, we can take a look at it from a few
different angles. Let’s start with a special case!
In machine learning, taking the product of a matrix and a column vector is a fundamental building block of certain models.
For instance, this is linear regression in itself, or the famous fully connected layer in neural networks.
To see what happens in this case, let 𝐴 ∈ ℝ𝑛×𝑚 be a matrix. If we treat 𝑥 ∈ ℝ𝑚 as a column vector 𝑥 ∈ ℝ𝑚×1 , then 𝐴𝑥
can be written as
𝑚
𝑎1,1 𝑎1,2 … 𝑎1,𝑚 𝑥1 ∑ 𝑎1,𝑗 𝑥𝑗
⎡𝑎 ⎤ ⎡ ⎤ ⎡ 𝑗=1
𝑚 ⎤
𝑎2,2 … 𝑎2,𝑚 𝑥 ∑𝑗=1 𝑎2,𝑗 𝑥𝑗 ⎥
𝐴𝑥 = ⎢ 2,1 ⎥⎢ 2⎥ = ⎢⎢ ⎥.
⎢ ⋮ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝑚
⎣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑚 .⎦ ⎣𝑥𝑚 ⎦ ⎣∑ 𝑎𝑛,𝑗 𝑥𝑗 ⎦
𝑗=1
Based on this, the matrix 𝐴 describes a function that takes a piece of data 𝑥, then transforms it into the form 𝐴𝑥.
This is the same as taking the linear combination of 𝐴‘s columns, that is,
With a bit more suggestive notation, by denoting the 𝑖-th column as 𝑎𝑖 , we can write
𝑥1 𝑛 𝑎1,𝑖
⎡𝑎
⎢ 1 𝑎2 … 𝑎𝑛 ⎤ ⎡ ⎤
⎥ ⎢ ⋮ ⎥ = ∑ 𝑥𝑖 𝑎𝑖 , 𝑎𝑖 = ⎡
⎢ ⋮ ⎥.
⎤ (7.2)
⎣ ⎦ ⎣𝑥𝑛 ⎦ 𝑖=1 ⎣𝑎𝑛,𝑖 ⎦
If we replace the vector 𝑥 with a matrix 𝐵, the columns in the product matrix 𝐴𝐵 are linear combinations of 𝐴‘s columns,
where the coefficients are determined by 𝐵.
You should really appreciate that certain operations on the data can be written in the form 𝐴𝑥. Elevating this simple
property to a higher level of abstraction, we can say that the data has the same representation as the function. If you are
familiar with programming languages like Lisp, you know how beautiful this is.
There is one more way to think about the matrix product: taking the columnwise inner products. If 𝑎𝑖 = (𝑎𝑖,1 , … , 𝑎𝑖,𝑛 )
denotes the 𝑖-th column of 𝐴, then 𝐴𝑥 can be written as
that is, the transformation 𝑥 ↦ 𝐴𝑥 projects the input 𝑥 to the row vectors of 𝐴, then compacts the results in a vector.
Because of the well-defined matrix operations, we can do algebra on matrices just as with numbers. However, there are
some differences. As manipulating matrix expressions is an essential skill, let’s take a look at its fundamental rules!
𝐴 + (𝐵 + 𝐶) = (𝐴 + 𝐵) + 𝐶
𝐴(𝐵𝐶) = (𝐴𝐵)𝐶
𝐴(𝐵 + 𝐶) = 𝐴𝐵 + 𝐴𝐶
(𝐴 + 𝐵)𝐶 = 𝐴𝐶 + 𝐵𝐶
As the proof is extremely technical and boring, we are going to skip it. However, there are a few things to note. Most
importantly, matrix multiplication is not commutative; that is, 𝐴𝐵 does not always equal to 𝐵𝐴. (It might not even be
defined.) For instance, consider
1 1 1 0
𝐴=[ ], 𝐵=[ ].
1 1 0 2
1 2 1 1
𝐴𝐵 = [ ], 𝐵𝐴 = [ ],
1 2 2 2
(𝐴 + 𝐵)𝑇 = 𝐴𝑇 + 𝐵𝑇
holds.
(b) Let 𝐴 ∈ ℝ𝑛×𝑙 , 𝐵 ∈ ℝ𝑙×𝑚 be arbitrary matrices. Then
(𝐴𝐵)𝑇 = 𝐵𝑇 𝐴𝑇
holds.
To do computational work with matrices inside a computer, we are looking for a data structure that represents a matrix A
and supports
• accessing elements by A[i, j],
• assigning elements by A[i, j] = value,
• addition and multiplication with the + and * operators,
and works lightning fast. These requirements only specify the interface of our matrix data structure, not the concrete
implementation. An obvious choice would be a list of lists, but as discussed in our section about representing vectors in
computations, this is highly suboptimal. Can we leverage the C array structure to store a matrix?
Yes, and this is precisely what NumPy does, providing a fast and convenient representation for matrices in the form of
multidimensional arrays. Before learning to use NumPy’s machinery for our purposes, let’s look a bit deeper into the
heart of the issue.
At first glance, there seems to be a problem: a computer’s memory is one-dimensional, thus addressed (indexed) by a
single key, not two as we want. Thus, we can’t just shove a matrix into the memory. The solution is to flatten the matrix
and place each consecutive row next to each other, like Fig. 7.2 illustrates in the 3 × 3 case.
By storing the rows of any 𝑛 × 𝑚 matrix in a contiguous array, we get all the benefits of the array data structure at the
low cost of a simple index transformation defined by
(𝑖, 𝑗) ↦ 𝑖𝑚 + 𝑗.
To demonstrate what’s happening, let’s conjure up a prototypical Matrix class in Python that uses a single list to store
all the values, yet supports accessing elements by row and column indices. For the sake of illustration, let’s imagine that
a Python list is actually a static array. (At least until this presentation is over.) This is for educational purposes only, as at
the moment, we only care about understanding the flattening process, not performance.
Take a moment to review the code below. I’ll explain everything line by line. (If you are not familiar with classes in
Python, I encourage you to also check the introductory OOP section in the appendix.)
class Matrix:
def __init__(self, shape: Tuple[int, int]):
if len(shape) != 2:
raise ValueError("The shape of a Matrix object must be a two-dimensional␣
↪tuple.")
def __repr__(self):
array_form = [[self[i, j] for j in range(self.shape[1])] for i in range(self.
↪shape[0])]
The Matrix object is initialized with the __init__ method. This is called when an object is created, like we are about
to do now.
M = Matrix(shape=(3, 4))
Upon initialization, we supply the dimensions of the matrix in the form of a two-dimensional tuple, passed for the
shape argument. In our concrete example, M is a 3 × 4 matrix, represented by an array of length 12. For simplicity, our
simple Matrix is filled up with zeros by default.
Overall, the __init__ method does three things:
• checking the validity of shape,
• storing shape in an attribute,
• and initializing a list of size shape[0]*shape[1], serving as our data storage.
The second method, suggestively named _linear_idx, is responsible for translating between the row-column indices
of the matrix and the linear index for our internal one-dimensional representation. (In Python, it is customary to prefix
methods with an underscore if they are not intended to called externally. Many other languages, such as Java, support
hidden methods. Python is not one of them, so we have to make do with such polite suggestions instead of strictly enforced
rules.)
We can implement item retrieval via indexing by providing the __getitem__ method, expecting a two-dimensional
integer tuple as the key. For any key = (i, j), the method
• calculates the linear index using our _linear_idx method,
• then retrieves the element located at the given linear index from the list.
Item assignment happens similarly, as given by the __setitem__ magic method. Let’s try these out to see if they work.
M[1, 2] = 3.14
M[1, 2]
3.14
By providing a __repr__ method, we specify how a Matrix object is represented as a string. So, we can print it out
to the standard output in a pretty form.
Pretty awesome. Now that we understand some of the internals, it is time to see how we can achieve much more with
NumPy.
As foreshadowed earlier, NumPy provides an excellent out-of-the-box representation for matrices in the form of multi-
dimensional arrays. (These are often called tensors, but I’ll just stick to the naming array.)
I have some fantastic news: these are the same np.ndarray objects we have been using! We can create one by simply
providing a list of lists during initialization.
import numpy as np
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Everything works the same as we have seen so far. Operations are performed elementwise, and you can plug them into
functions like np.exp.
A + B
array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])
A*B
np.exp(A)
Since we are working with multidimensional arrays, the transposition operator can be defined. Here, this is conveniently
implemented as the np.transpose function, but can also be accessed at the np.ndarray.T attribute.
np.transpose(A)
array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])
A.T
array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])
As expected, we can get and set elements with the indexing operator []. The indexing starts from zero. (Don’t even get
me started.)
A[1, 2] # 1st row, 2nd column (if we index rows and columns from zero)
Entire rows and columns can be accessed using slicing. Instead of putting out the exact definitions, I’ll just leave a few
examples and let you figure it out with your internal pattern matching engine. (That is, your intelligence.)
array([ 2, 6, 10])
array([4, 5, 6, 7])
array([4, 5, 6, 7])
When used as an iterable, a two-dimensional array yields its rows at every step.
for row in A:
print(row)
[0 1 2 3]
[4 5 6 7]
[ 8 9 10 11]
Initializing arrays can be done with the familiar np.zeros, np.ones, and other functions.
np.zeros(shape=(4, 5))
As you have guessed, that shape argument specifies the dimensions of the array. We are going to explore this property
next.
Let’s initialize an example multidimensional array with three rows and four columns.
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
The shape of an array, stored inside the attribute np.ndarray.shape, is a tuple object describing its dimensions.
In our example, since we have a 3 × 4 matrix, the shape equals (3, 4).
A.shape
(3, 4)
This innocently looking attribute determines what kind of operations you can perform with your arrays. Let me tell you,
as a machine learning engineer, shape mismatches will be the bane of your existence. You want to calculate the product
of two matrices A and B? The second dimension of A must match the first dimension of B. Pointwise products? Matching
or broadcastable shapes are required. Understanding shapes are vital.
However, we have just learned that multidimensional arrays are linear arrays in disguise. Because of this, we can reshape
an array by slicing the linear view differently. For example, A can be reshaped into arrays with shapes (12, 1), (6,
2), (4, 3), (3, 4), (2, 6), and (1, 12).
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])
The np.ndarray.reshape method returns a newly constructed array object but doesn’t change A. In other words,
reshaping is not destructive in NumPy.
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Reshaping is hard to wrap your head around for the first time. To help you visualize the process, Fig. 7.3 shows precisely
what happens in our case.
If you are unaware of the exact dimension along a specific axis, you can get away by inputting -1 there during the
reshaping. Since the product of dimensions is constant, NumPy is smart enough to figure out the missing one for you.
This trick will get you out of trouble all the time, so it is worth taking note.
A.reshape(-1, 2)
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])
We won’t go into the details now, but as you probably guessed, multidimensional arrays can have more than two di-
mensions. The range of permitted shapes for the operations will be even more complicated then. So, building a solid
understanding now will provide a massive headstart in the future.
Without a doubt, one of the most important operations regarding matrices is multiplication. Computing determinants and
eigenvalues? Matrix multiplication. Passing data through a fully connected layer? Matrix multiplication. Convolution?
Matrix multiplication. We will see how these seemingly different things can be traced back to matrix multiplication, but
first, let’s discuss the operation itself from a computational perspective.
First, recap the mathematical definition. If 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑚
𝑖,𝑗=1 ∈ ℝ
𝑛×𝑚
and 𝐵 = (𝑏𝑖,𝑗 )𝑚,𝑙
𝑖,𝑗=1 ∈ ℝ
𝑚×𝑙
are two arbitrary
matrices, then their product is defined by the formula,
𝑚 𝑛,𝑙
𝐴𝐵 = ( ∑ 𝑎𝑖,𝑘 𝑏𝑘,𝑗 ) ∈ ℝ𝑛×𝑙 ,
𝑘=1 𝑖,𝑗=1
which comes from the composition of the linear transformations determined by 𝐴 and 𝐵. Notice that the element in the
𝑖-th row and 𝑗-th column of 𝐴𝐵 is the dot product of 𝐴‘s 𝑖-th row and 𝐵‘s 𝑗-th column.
We can put this into code using the tools we have learned so far.
return AB
Let’s test our function with an example that is easy to verify by hand.
A = np.ones(shape=(4, 6))
B = np.ones(shape=(6, 3))
matrix_multiplication(A, B)
np.matmul(A, B)
This yields the same result as our custom function. We can test it out by generating a bunch of random matrices and
checking if the results match.
for _ in range(100):
n, m, l = np.random.randint(1, 100), np.random.randint(1, 100), np.random.
↪randint(1, 100)
A = np.random.rand(n, m)
B = np.random.rand(m, l)
According to this small test, our matrix_multiplication function yields the same result as NumPy’s built-in one.
We are happy, but don’t forget: always use your chosen framework’s implementations in practice, may it be NumPy,
TensorFlow, or PyTorch.
Since writing numpy.matmul is cumbersome when lots of multiplications are present, NumPy offers a way to abbreviate
using the @ operator.
A = np.ones(shape=(4, 6))
B = np.ones(shape=(6, 3))
True
Besides composing linear transformations, matrix multiplication also describes the image of vectors under them. Recall
that if a transformation is given by the matrix 𝐴 ∈ ℝ𝑛×𝑚 and the input is given by 𝑥 ∈ ℝ𝑚 , then by treating 𝑥 as a
column vector 𝑥 ∈ ℝ𝑚×1 , the image of 𝑥 under 𝐴 can be calculated by
𝑚
𝑎1,1 𝑎1,2 … 𝑎1,𝑚 𝑥1 ∑ 𝑎1,𝑗 𝑥𝑗
⎡𝑎 ⎤ ⎡ ⎤ ⎡ 𝑗=1
𝑚 ⎤
𝑎2,2 … 𝑎2,𝑚 𝑥 ∑𝑗=1 𝑎2,𝑗 𝑥𝑗 ⎥
𝐴𝑥 = ⎢ 2,1 ⎥⎢ 2⎥ = ⎢⎢ ⎥.
⎢ ⋮ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝑚
⎣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑚 .⎦ ⎣𝑥𝑚 ⎦ ⎣∑ 𝑎𝑛,𝑗 𝑥𝑗 ⎦
𝑗=1
Mathematically speaking, looking at 𝑥 as a column vector is perfectly natural. Think of it as extending ℝ𝑚 with a dummy
dimension, thus obtaining ℝ𝑚×1 . This form also comes naturally by considering that the columns of a matrix are images
of the basis vectors by their very definition.
In practice, things are not as simple as they look. Implicitly, we have made a choice here: to represent datasets as a
horizontal stack of column vectors. To elaborate further, let’s consider two data points with four features and a matrix
that maps these into a three-dimensional feature space. That is, let 𝑥1 , 𝑥2 ∈ ℝ4 and let 𝐴 ∈ ℝ3×4 .
(I specifically selected the numbers so that calculations are easily verifiable by hand.) To be sure, we double-check the
shapes.
A.shape
(3, 4)
x1.shape
(4,)
np.matmul(A, x1)
array([ 0, 8, 16])
The result is correct. However, when we have a bunch of input data points, we prefer to calculate the images using a single
operation. This way, we can take advantage of vectorized code, locality of reference, and all the juicy computational magic
we have seen so far.
We can achieve this by horizontally stacking the column vectors, each one representing a data point. Mathematically
speaking, we want to perform the calculation
2 −1
0 1 2 3 ⎡ 0 1
⎡4 0 1⎤ ⎡
⎢ 5 6 7⎤ ⎢ ⎥
⎥ ⎢0 0 ⎥ ⎢ 8 1⎥
= ⎤
⎣8 9 10 11⎦ ⎣16 1⎦
⎣0 0 ⎦
in code. Upon looking up the NumPy documentation, we quickly find that the np.hstack function might be the tool
for the job, at least according to its official documentation. Yay!
Not yay. What happened? np.hstack treats one-dimensional arrays differently, and even though the math works out
perfectly by creatively abusing the notation, we don’t get away that easily in the trenches of real-life computations. Thus,
we have to reshape our inputs manually. Meet the true skill gap between junior and senior machine learning engineers:
correctly shaping multidimensional arrays.
data
array([[ 2, -1],
[ 0, 1],
[ 0, 0],
[ 0, 0]])
np.matmul(A, data)
array([[ 0, 1],
[ 8, 1],
[16, 1]])
We made an extremely impactful choice in the previous section: representing individual data points as column vectors.
I am writing this with bold letters to emphasize its importance.
Why? Because we could have gone the other way and treated samples as row vectors. With our current choice, we ended
up with a multidimensional array of shape
as opposed to
The former is called batch-last, while the latter is called batch-first format. Popular frameworks like TensorFlow and
PyTorch use batch-first, but we are going with batch-last. The reasons go back to the very definition of matrices, where
columns are the images of basis vectors under the given linear transformation. This way, we can write multiplication from
left to right, like 𝐴𝑥 and 𝐴𝐵.
Should we define matrices as rows of basis vector images, everything turns upside down. This way, if 𝑓 and 𝑔 are linear
transformations with “matrices” 𝐴 and 𝐵, the “matrix” of the composed transformation 𝑓 ∘ 𝑔 would be 𝐵𝐴. This makes
the math complicated and ugly.
On the other hand, batch-first makes the data easier to store and read. Think about a situation when you have thousands
of data points in a single CSV file. Due to how input-output is implemented, files are read line-by-line, so it is natural
and convenient to have a single line correspond to a single sample.
No good choices here; there are sacrifices either way. Since the math works out much easier for batch-last, we will use
that format. However, in practice, you’ll find that batch-first is more common. With this textbook, I don’t intend to give
you just a manual. My goal is to help you understand the internals of machine learning. If I succeed, you’ll be able to
apply your knowledge to translate between batch-first and batch-last seamlessly.
7.7 Problems
6 −2
−1 2
𝐴=[ ], 𝐵=⎡ ⎤
⎢ 2 −6⎥ .
1 5
⎣−3 2 ⎦
(b)
1 2 3 7 8
𝐴=[ ], 𝐵=[ ].
4 5 6 9 10
Problem 2. The famous Fibonacci numbers are defined by the recursive sequence
𝐹0 = 0,
𝐹1 = 1,
𝐹𝑛 = 𝐹𝑛−1 + 𝐹𝑛−2 .
(a) Write a recursive function that computes the 𝑛-th Fibonacci number. (Expect it to be really slow.)
(b) Show that
𝑛
1 1 𝐹 𝐹𝑛
[ ] = [ 𝑛+1 ],
1 0 𝐹𝑛 𝐹𝑛−1
and use this identity to write a non-recursive function that computes the 𝑛-th Fibonacci number.
Use Python’s builtin timeit function to measure the execution of both functions. Which one is faster?
Problem 3. Let 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑚 𝑛,𝑚
𝑖,𝑗=1 , 𝐵 = (𝑏𝑖,𝑗 )𝑖,𝑗=1 . be two 𝑛 × 𝑚 matrices. Their Hadamard product is defined by
Implement a function that takes two identically shaped NumPy arrays, then performs the Hadamard product on them.
(There are two ways to do this: with for loops and with NumPy operations. It is instructive to implement both.)
Problem 4. Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix. Functions of the form
𝐵(𝑥, 𝑦) = 𝑥𝑇 𝐴𝑦, 𝑥, 𝑦 ∈ ℝ𝑛
are called bilinear forms. Implement a function that takes two vectors and a matrix (all represented by NumPy arrays),
then calculates the corresponding bilinear form.
7.7. Problems 99
Mathematics of Machine Learning
EIGHT
LINEAR TRANSFORMATIONS
“Why do my eyes hurt?”“You’ve never used them before.” - Morpheus to Neo, when waking up from the Matrix
for the first time
In most linear algebra courses, the curriculum is all about matrices. In machine learning, we work with them all the time.
Here is the thing: matrices don’t tell the whole story. It is hard to understand the patterns by looking only at matrices.
For instance, why is matrix multiplication defined in such a complex way as it is? Why are relations like 𝐵 = 𝑇 −1 𝐴𝑇
important? Why are some matrices invertible and some are not?
To really understand what is going on, we have to look at what gives rise to matrices: linear transformations. Like for
Neo, this might hurt a bit, but it will greatly reward us later down the line. Let’s get to it!
With the introduction of inner products, orthogonality, and orthogonal/orthonormal bases, we know everything about
the structure of our feature spaces. However, in machine learning, our interest mainly lies in transforming the data. To
illustrate this, we should take another look at the Iris dataset. Each sample is represented by a four-dimensional vector
in its raw form, belonging to one of the three classes. Since human perception is limited to three dimensions, we can’t
visualize these directly. However, we can map each feature against every other.
To simplify the data and gain predictive insight, we can train a simple neural network and check out how it maps the
dataset into a new feature space. Since the Iris set contains three class labels, the result is three-dimensional. Similarly to
the raw data, we are going to visualize this by plotting the features pairwise.
Even though we only passed the data through a function 𝑓 ∶ ℝ4 → ℝ3 without adding new information, the transformed
dataset looks much more descriptive than the original one.
From this viewpoint, a neural network is just a function composed of smaller parts (known as layers), transforming
the data to a new feature space in every step. One of the key components of models in machine learning are linear
transformations. You probably encountered them as functions of the form 𝑓(𝑥) = 𝐴𝑥, but this is only one way to look
at them. This section will start from a geometric viewpoint, then move towards the algebraic representation that you are
probably already familiar with. To understand how neural networks can learn powerful high-level representations of the
data, looking at the geometry of transforms is essential.
Let’s not hesitate a moment further, and jump into the definition right away!
101
Mathematics of Machine Learning
Fig. 8.1: The Iris dataset, visualized by plotting every feature against every other feature. Colors are according to class
labels, while the diagonals represent the density estimation of each feature.
This is why linear algebra is called linear algebra. In essence, a linear transformation is a mapping between two vector
spaces that preserves the algebraic structure: addition and scalar multiplication. (Functions between vector spaces are
often called transformations, so we will use this terminology.)
Remark 7.1.1
Linearity is essentially comprising two properties in one: 𝑓(𝑥 + 𝑦) = 𝑓(𝑥) + 𝑓(𝑦) and 𝑓(𝑎𝑥) = 𝑎𝑓(𝑥) for all vectors
𝑥, 𝑦 and all scalars 𝑎. From these two, (8.1) follows by
Two properties immediately jump out from the definition. First, since
𝑓(0) = 0 holds for every linear transformation. In addition, the composition of linear transformations is still linear, as
To show that rotations are indeed linear, I am pull the definition out from the hat: the rotation of a planar vector 𝑥 =
(𝑥1 , 𝑥2 ) with the angle 𝛼 is described by
from which (8.1) is easily confirmed. I know that this looks like magic, but trust me, the rotation formula will be explained
in detail. You can sweat it out with some basic trigonometry, or wait until we do this later with matrices.
In general, linear transformations have a strong connection with the geometry of the space. Later we are going to study the
linear transformations of ℝ2 in detail, with an emphasis on geometric ones such as this. (Note that rotations are slightly
more complicated in higher dimensions, as they will require an axis to rotate around.)
Example 3. In any vector space 𝑉 and a nonzero vector 𝑣 ∈ 𝑉 , the translation defined by 𝑓(𝑥) = 𝑥 + 𝑣 is not linear, as
𝑓(0) = 𝑣 ≠ 0.
We’ll see more examples later in the section. For now, let’s move to some general properties of linear transformations.
is always a subspace of 𝑉 . This is easy to check: if 𝑣1 , 𝑣2 ∈ im𝑓, then there are 𝑢1 , 𝑢2 ∈ 𝑈 such that 𝑓(𝑢1 ) = 𝑣1 and
𝑓(𝑢2 ) = 𝑣2 . Thus,
To add one more level of abstraction, we will see that the set of all linear transformations is a vector space.
Theorem 7.1.1
Let 𝑈 and 𝑉 be two vector spaces over the same field 𝐹 . Then the set of all linear transformations
is also a vector space over 𝐹 , with the usual definitions for function addition and scalar multiplication.
The proof of this is just a boring checklist, going through the items of the definition of vector spaces. (I recommend you
to walk through it at least once to solidify your understanding of vector spaces, but there is really nothing special there.)
The definition of linear transformations, as we saw it, might seem a bit abstract. However, there is a simple and expressive
way to characterize them.
To see this, let 𝑓 ∶ 𝑈 → 𝑉 be a linear transformation between two vector spaces 𝑈 and 𝑉 . Suppose that {𝑢1 , … , 𝑢𝑚 }
𝑚
is a basis in 𝑈 , while {𝑣1 , … , 𝑣𝑛 } is a basis in 𝑉 . Since every 𝑥 ∈ 𝑈 can be written in the form 𝑥 = ∑𝑖=1 𝑥𝑖 𝑢𝑖 , the
linearity of 𝑓 implies
𝑚 𝑚
𝑓( ∑ 𝑥𝑗 𝑢𝑗 ) = ∑ 𝑥𝑗 𝑓(𝑢𝑗 ), (8.3)
𝑗=1 𝑗=1
meaning that 𝑓(𝑥) is a linear combination of 𝑓(𝑢1 ), … , 𝑓(𝑢𝑚 ). In other words, every linear transformation is completely
determined by the images of basis vectors. To expand this idea, suppose that for every 𝑢𝑗 , we have
𝑛
𝑓(𝑢𝑗 ) = ∑ 𝑎𝑖,𝑗 𝑣𝑖
𝑖=1
meaning that linear transformations are represented by matrices. This connection is heavily utilized throughout machine
learning.
𝑚
Expanding (8.3) further, for every 𝑥 = ∑𝑗=1 𝑥𝑗 𝑢𝑗 we have
𝑚
𝑓(𝑥) = ∑ 𝑥𝑗 𝑓(𝑢𝑗 )
𝑗=1
𝑚 𝑛
= ∑ 𝑥𝑗 ∑ 𝑎𝑖,𝑗 𝑣𝑖
𝑗=1 𝑖=1
𝑛 𝑚
= ∑ ( ∑ 𝑎𝑖,𝑗 𝑥𝑗 )𝑣𝑖 .
𝑖=1 𝑗=1
1 0 … 0
⎡0 1 … 0⎤
𝐼=⎢ ⎥. (8.4)
⎢⋮ ⋮ ⋱ ⋮⎥
⎣0 0 … 1⎦
To summarize, for a matrix 𝐴, a linear transformation can be given by 𝑥 ↦ 𝐴𝑥. In fact, the mapping
𝑓 ↦ 𝐴𝑓,𝑃
defines a one-to-one correspondence between the space of linear transformations 𝐿(𝑈 , 𝑉 ) defined by (8.2) and the set of
𝑛 × 𝑚 matrices, where 𝑛 and 𝑚 are the corresponding dimensions.
Functions can be added and composed. Because of the connection between linear transformations and matrices, matrix
operations are inherited from the corresponding function operations.
With this principle in mind, we defined matrix addition so that the the matrix of the sum of two linear transformations is
the sum of the corresponding matrices. Mathematically speaking, if 𝑓, 𝑔 ∶ 𝑈 → 𝑉 are two linear transformations with
𝑛
(𝑓 + 𝑔)(𝑢𝑗 ) = 𝑓(𝑢𝑗 ) + 𝑔(𝑢𝑗 ) = ∑(𝑎𝑖,𝑗 + 𝑏𝑖,𝑗 )𝑣𝑖 .
𝑖=1
Thus, the corresponding matrices can be added together elementwise: 𝐴 + 𝐵 = (𝑎𝑖𝑗 + 𝑏𝑖,𝑗 )𝑛,𝑚
𝑖,𝑗=1 .
Multiplication between matrices is defined by the composition of the corresponding transformations. To see how, we
study a special case first. (In general, it is a good idea to look at special cases first, as they often reduce the complexity and
allows you to see patterns without information overload.) So, let 𝑓, 𝑔 ∶ 𝑈 → 𝑈 be two linear transformations, mapping
𝑈 onto itself. To determine the elements of the matrix corresponding to 𝑓 ∘ 𝑔, we have to express 𝑓(𝑔(𝑢𝑗 )) in terms of
all the basis vectors 𝑢1 , … , 𝑢𝑛 . For this, we have
𝑛
(𝑓𝑔)(𝑢𝑗 ) = 𝑓(𝑔(𝑢𝑗 )) = 𝑓( ∑ 𝑏𝑘,𝑗 𝑢𝑘 )
𝑘=1
𝑛
= ∑ 𝑏𝑘,𝑗 𝑓(𝑢𝑘 )
𝑘=1
𝑛 𝑛
= ∑ 𝑏𝑘,𝑗 ∑ 𝑎𝑖,𝑘 𝑢𝑖
𝑘=1 𝑖=1
𝑛 𝑛
= ∑ ( ∑ 𝑎𝑖,𝑘 𝑏𝑘,𝑗 )𝑢𝑖 .
𝑖=1 𝑘=1
𝑛
By considering how we defined a transformation’s matrix, the scalar ∑𝑘=1 𝑎𝑖,𝑘 𝑏𝑘,𝑗 is the element in the 𝑖-th row and 𝑗-th
𝑛 𝑛
column of 𝑓 ∘ 𝑔‘s matrix. Thus, matrix multiplication can be defined by 𝐴𝐵 = ( ∑𝑘=1 𝑎𝑖,𝑘 𝑏𝑘,𝑗 )𝑖,𝑗=1 .
In the general case, we can only define the product of matrices if the corresponding linear transformations can be com-
posed. That is, if 𝑓 ∶ 𝑈 → 𝑉 , then 𝑔 must start from 𝑉 . Translating this into the language of the matrices, the number
of columns in 𝐴 must match the number of rows in 𝐵. So, for any 𝐴 ∈ ℝ𝑛×𝑚 and 𝐵 ∈ ℝ𝑚×𝑙 , their product is defined by
𝑚 𝑛,𝑙
𝐴𝐵 = ( ∑ 𝑎𝑖,𝑘 𝑏𝑘,𝑗 ) ∈ ℝ𝑛×𝑙 .
𝑘=1 𝑖,𝑗=1
Regarding linear transformations, the question of invertibility is extremely important. For example, have you encountered
a system of equations like this?
2𝑥1 + 𝑥2 = 5
𝑥1 − 3𝑥2 = −8
If we define
2 1 5 𝑥1
𝐴=[ ], 𝑏=[ ], 𝑥=[ ],
1 −3 −8 𝑥2
the above system can be written in the form 𝐴𝑥 = 𝑏. These are called linear equations.
How would you write the solution of such an equation? If there were be a matrix 𝐴−1 such that 𝐴−1 𝐴 is the identity
matrix 𝐼 (defined by (8.4)), then multiplying the equation 𝐴𝑥 = 𝑏 from the left by 𝐴−1 would yield the solution in the
form 𝑥 = 𝐴−1 𝑏.
The matrix 𝐴−1 is called the inverse matrix. It might not always exist, but when it does, it is extremely important for
several reasons. We’ll talk about linear equations later, but first, let’s study the fundamentals of invertibility! Here is the
general definition.
𝑓 −1 (𝑓(𝑢)) = 𝑢,
𝑓(𝑓 −1 (𝑣)) = 𝑣
Not all linear transformations are invertible. For instance, if 𝑓 maps all vectors to the zero vector, you cannot define an
inverse.
There are certain conditions that guarantee the existence of the inverse. One of the most important ones connect the
concept of basis with invertibility.
The following proof is straightforward, but can be a bit overwhelming. Feel free to skip this at the first reading, you can
always revisit it later.
Proof. As usual, the proof of the if and only if type theorems consist of two parts, as these statements involve two
implications.
(a) First, we prove that 𝑓 is invertible, then 𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 ) is a basis. That is, we need to show that 𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 )
is linearly indepenendent and every 𝑣 ∈ 𝑉 can be written as their linear combination.
Since 𝑓 is invertible, 𝑓(0) = 0, moreover there are no nonzero vectors 𝑢 ∈ 𝑈 such that 𝑓(𝑢) = 0. In other words, 0
cannot be written as the nontrivial linear combination of 𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 ), from which Theorem 1.4.1 implies the linear
independence.
On the other hand, since 𝑓 is surjective, every 𝑣 ∈ 𝑉 can be obtained as 𝑣 = 𝑓(𝑢) for some 𝑢 ∈ 𝑈 . As 𝑢1 , … , 𝑢𝑛 is a
𝑛
basis, 𝑢 = ∑𝑖=1 𝛼𝑖 𝑢𝑖 . Thus,
𝑣 = 𝑓(𝑢)
𝑛
= 𝑓( ∑ 𝛼𝑖 𝑢𝑖 )
𝑖=1
𝑛
= ∑ 𝛼𝑖 𝑓(𝑢𝑖 ),
𝑖=1
which shows the surjectivity. Regarding the injectivity, if 𝑣 = 𝑓(𝑥) = 𝑓(𝑦) for some 𝑥, 𝑦 ∈ 𝑈 , then, since both 𝑥 and 𝑦
can be written as a linear combination of the 𝑢𝑖 basis vectors, we would have
𝑛 𝑛
𝑣 = 𝑓(𝑥) = 𝑓( ∑ 𝑥𝑖 𝑢𝑖 ) = ∑ 𝑥𝑖 𝑓(𝑢𝑖 )
𝑖=1 𝑖=1
and
𝑛 𝑛
𝑣 = 𝑓(𝑦) = 𝑓( ∑ 𝑦𝑖 𝑢𝑖 ) = ∑ 𝑦𝑖 𝑓(𝑢𝑖 ).
𝑖=1 𝑖=1
𝑛
Thus, 0 = ∑𝑖=1 (𝑥𝑖 − 𝑦𝑖 )𝑢𝑖 , and since 𝑢1 , … , 𝑢𝑛 is a basis in U, 𝑥𝑖 = 𝑦𝑖 must hold. Hence 𝑓 is injective. □
A consequence of this theorem is that a linear transformation 𝑓 ∶ 𝑈 → 𝑉 is not invertible if the dimensions of 𝑈 and 𝑉
are different. We can look at invertibility from the aspect of matrices as well. For any 𝐴 ∈ 𝐹 𝑛×𝑛 , if the corresponding
linear transformation is invertible, there exists a matrix 𝐴−1 ∈ 𝔽𝑛×𝑛 such that 𝐴−1 𝐴 = 𝐴𝐴−1 = 𝐼. Not surprisingly,
we call 𝐴−1 the inverse of 𝐴. If a matrix is not square, it is not invertible in the classical sense.
Regarding the invertibility of a linear transformation, two special sets play an essential role: the kernel and the image.
Let’s see them!
im𝑓 ∶= {𝑓(𝑢) ∶ 𝑢 ∈ 𝑈 }
and
Often, we write im𝐴 and ker 𝐴 for some matrix 𝐴, referring to the linear transformation defined by 𝑥 ↦ 𝐴𝑥. Due to the
linearity of 𝑓, it is easy to see that im𝑓 is a subspace of 𝑉 and ker 𝑓 is a subspace of 𝑈 . As mentioned, they are closely
connected with invertibility, as we shall see next.
Proof. (a) If 𝑓 is injective, there can only be one vector in 𝑈 that is mapped to 0. Since 𝑓(0) = 0 for any linear
transformation, ker 𝑓 = {0}.
On the other hand, if there are two different vectors 𝑥, 𝑦 ∈ 𝑈 such that 𝑓(𝑥) = 𝑓(𝑦), then 𝑓(𝑥 − 𝑦) = 𝑓(𝑥) − 𝑓(𝑦) = 0,
so 𝑥 − 𝑦 ∈ ker 𝑓. Thus, ker 𝑓 = {0} implies 𝑥 = 𝑦, which gives the injectivity.
Because matrices define linear transformations, it makes sense to talk about the inverse of a matrix.
Algebraically speaking, the inverse of an 𝐴 ∈ ℝ𝑛×𝑛 is the matrix 𝐴−1 ∈ ℝ𝑛×𝑛 such that 𝐴−1 𝐴 = 𝐴𝐴−1 = 𝐼 holds.
The connection between linear transforms and matrices imply that 𝐴−1 is the matrix of 𝑓 −1 , so no surprise here.
Don’t worry if this section about invertibility feels a bit too much of algebra. Later, when talking about the determinant
of a transformation, we are going to study invertibility from a geometric perspective. In terms of matrices, later we are
going to see a general method to calculate the inverse matrix.
Previously in this section, we have seen that any linear transformation can be described with the images of the basis vectors.
This gave us the matrix representation that we use all the time. However, this very much depends on the choice of basis.
Different bases yield different matrices for the same transformation.
For instance, let’s take a look at 𝑓 ∶ ℝ2 → ℝ2 that maps 𝑒1 = (1, 0) to the vector (2, 1) and 𝑒2 = (0, 1) to (1, 2). Its
matrix in the standard orthonormal basis 𝐸 = {𝑒1 , 𝑒2 } is given by
2 1
𝐴𝑓,𝐸 = [ ]. (8.5)
1 2
2 1 1 3 2 1 −1 −1
[ ][ ] = [ ], [ ][ ] = [ ].
1 2 1 3 1 2 1 1
In other words, 𝑓(𝑝1 ) = 3𝑝1 + 0𝑝2 and 𝑓(𝑝2 ) = 0𝑝1 + 𝑝2 . This is visualized by Fig. 8.6.
This means that if 𝑃 = {𝑝1 , 𝑝2 } is our basis (thus, if writing (𝑎, 𝑏) means 𝑎𝑝1 + 𝑏𝑝2 ), the matrix of 𝑓 becomes
3 0
𝐴𝑓,𝑃 = [ ].
0 1
In this form, 𝐴𝑓,𝑃 is a diagonal matrix. (That is, its elements below and above the diagonal are zero.) As you can
see, having the right basis can significantly simplify the linear transformation. For instance, in 𝑛 dimensions, applying a
transformation in diagonal form requires only 𝑛 operations, as
𝑑1 0 … 0 𝑥1 𝑑1 𝑥1
⎡0 𝑑2 … 0 ⎤ ⎡ 𝑥2 ⎤ ⎡ 𝑑2 𝑥2 ⎤
⎢ ⎥⎢ ⎥ = ⎢ ⎥
⎢⋮ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣0 0 … 𝑑𝑛 ⎦ ⎣𝑥𝑛 ⎦ ⎣𝑑𝑛 𝑥𝑛 ⎦
We have just seen that the matrix of a linear transformation depends on our choice of basis. However, there is a special
relation between matrices of the same transformation. We’ll explore this next. Let 𝑓 ∶ 𝑈 → 𝑈 be a linear transformation,
and let 𝑃 = {𝑝1 , … , 𝑝𝑛 }, 𝑄 = {𝑞1 , … , 𝑞𝑛 } be two bases. As before, 𝐴𝑓,𝑆 denotes the matrix of 𝑓 in some basis 𝑆.
Suppose that we know 𝐴𝑓,𝑃 , but we have our vectors represented in terms of the other basis 𝑄. How do we calculate the
images our vectors under the linear transformation? A natural idea is to first transform our vector representations from 𝑄
to 𝑃 , apply 𝐴𝑓,𝑃 , then transform the representations back. In the following, we are going to make this precise.
Let 𝑡 ∶ 𝑈 → 𝑈 be a transformation defined by 𝑝𝑖 ↦ 𝑞𝑖 for all 𝑖 ∈ {1, … , 𝑛}. Since 𝑃 and 𝑄 are bases (so the sets are
linearly independent), 𝑡 is invertible. Suppose that the matrix 𝐴𝑓,𝑄 = (𝑎𝑄 𝑛
𝑖,𝑗 )𝑖,𝑗=1 is known to us, that is,
𝑛
𝑓(𝑞𝑗 ) = 𝐴𝑓,𝑄 𝑞𝑗 = ∑ 𝑎𝑄
𝑖,𝑗 𝑞𝑖
𝑖=1
In other words, the matrix of the composed transformation 𝑡−1 𝑓𝑡 in the basis 𝑃 is the same as the matrix of 𝑓 in 𝑄. In
terms of formulas,
where 𝑇 denotes the matrix of 𝑡 in 𝑃 . (For notational simplicity, we omit the subscript. Most often, we don’t care what
base it is in.)
We’ll call 𝑇 the change of basis matrix. These types of relations are prevalent in linear algebra, so we’ll take the time to
introduce a definition formally.
𝐵 = 𝑇 −1 𝐴𝑇
In these terms, (8.6) says that the matrices of a given linear transformation are all similar to each other. This holds true
the other way around: if matrices are similar to each other, then they are coming from the same linear transformation.
With this under our belt, we can finish up with the example (8.5). In this case, 𝑇 and 𝑇 −1 can be written as
1 −1 1/2 1/2
𝑇 =[ ], 𝑇 −1 = [ ].
1 1 −1/2 1/2
(Later, we’ll see a general method to compute the inverse of any matrix, but for now, you can verify this by hand.) Thus,
1/2 1/2 2 1 1 −1 3 0
[ ][ ][ ]=[ ]. (8.7)
−1/2 1/2 1 2 1 1 0 1
Fig. 8.7 shows how (8.7) looks like in geometric terms. From this example, we can see that a properly selected similarity
transformation can diagonalize certain matrices. Is this a coincidence? Spoiler alert: no. In a later chapter, we will see
exactly when and how this can be done.
We have just seen that a linear transformation can be described by the image of a basis set. From a geometric viewpoint,
they are functions mapping parallelepipeds to parallelepipeds.
Because of the linearity, you can imagine this as distorting the grid determined by the bases.
In two dimensions, we have seen a few examples of geometric maps such as scaling and rotation as linear transformations.
Now we can put them into matrix form. There are five of them in particular that we will study: stretching, shearing,
rotation, reflection, and projection.
These simple transformations are not only essential to build intuition, but they are also frequently applied in computer
vision. Flipping, rotating, and stretching are essential parts of image augmentation pipelines, greatly enhancing the per-
formance of models.
Fig. 8.8: How linear transforms distort the grid determined by the basis vectors.
8.6.1 Stretching
The simplest one is a generalization of scaling. We have seen a variant of this in Example 1 above. In matrix form, this
is given by
𝑐1 0
𝐴=[ ], 𝑐1 , 𝑐2 ∈ ℝ.
0 𝑐2
Linear transformations such as this can be visualized by plotting the image of the unit square determined by the standard
basis 𝑒1 = (1, 0), 𝑒2 = (0, 1).
8.6.2 Rotation
cos 𝛼 − sin 𝛼
𝑅𝛼 = [ ].
sin 𝛼 cos 𝛼
To see why, recall that each column of the transformation’s matrix describes the image of the basis vectors. The rotation
of (1, 0) is given by (cos 𝛼, sin 𝛼), while the rotation of (0, 1) is (cos(𝛼 + 𝜋/2), sin(𝛼 + 𝜋/2)). This is illustrated by
Fig. 8.10.
Like above, we can visualize the image of the unit square to gain a geometric insight into what is happening.
8.6.3 Shearing
Another essential geometric transform is shearing, which is frequently applied in physics. A shearing force is a pair of
forces with opposite directions, acting on the same body.
Its matrix is given by
1 𝑎
𝑆=[ ].
0 1
8.6.4 Reflection
Until this point, all the transformations we have seen in the Euclidean plane had preserved the “orientation” of the space.
However, this is not always the case. The transformation given by the matrices
−1 0 1 0
𝑅1 = [ ], 𝑅2 = [ ]
0 1 0 −1
0 −1 1 0 0 1
𝑅= [ ] [ ]=[ ]
1
⏟ ⏟ 0 0 −1 1 0
rotation with 𝜋/2 =𝑅2
maps 𝑒1 to 𝑒2 and 𝑒2 to 𝑒1 .
These types of transformations play an essential role in understanding determinants, as we will soon see in the next chapter.
In general, reflections can be easily defined in higher dimensional spaces. For instance,
1 0 0
𝑅=⎡
⎢0 1 0⎤⎥
⎣0 0 −1⎦
is a reflection in ℝ3 that flips 𝑒3 to the opposite direction. It is just like looking in the mirror: it turns left to right and
right to left.
Reflections can flip orientations multiple times. The transformation given by
1 0 0
𝑅=⎡
⎢0 −1 0⎤⎥
⎣0 0 −1⎦
flips 𝑒2 and 𝑒3 , changing the orientation twice. Later, we’ll see that the “number of changes in orientation” of a given
transformation is one of its essential descriptors.
One of the most important transformations (not only in two dimensions) is the orthogonal projection. We have seen this
already when talking about inner products and their geometric representation. By taking a closer look, it turns out that they
are linear transformations.
Recall from (5.6) that the orthogonal projection of 𝑥 to some 𝑦 can be written as
⟨𝑥, 𝑦⟩
proj𝑦 (𝑥) = 𝑦. (8.8)
⟨𝑦, 𝑦⟩
The bilinearity of ⟨⋅, ⋅⟩ immediately implies that proj𝑦 (𝑥) is also linear. With a bit of algebra, we can rewrite this in terms
of matrices. We have
⟨𝑥, 𝑦⟩
proj𝑦 (𝑥) = 𝑦 (8.9)
⟨𝑦, 𝑦⟩
𝑥 𝑦 + 𝑥 𝑦 𝑦1
= 1 1 2 2 2 (8.10)
[ ]
‖𝑦‖ 𝑦2
1 2
𝑦 𝑦1 𝑦2 𝑥1
= [ 1 ] [(8.11)
],
‖𝑦‖ 2 𝑦 𝑦
1 2 𝑦22 𝑥2
thus,
1 𝑦2 𝑦1 𝑦2
proj𝑦 = [ 1 ].
‖𝑦‖ 𝑦1 𝑦2
2 𝑦22
Notice that proj𝑦 (𝑒2 ) = 𝑦𝑦2 proj𝑦 (𝑒1 ), so the images of the standard basis vectors are not linearly independent. As a
1
consequence, the image of the plane under proj𝑦 is span(𝑦), which is a one-dimensional subspace. From this example,
we can see that the image of a vector space under a linear transformation is not necessarily of the same dimension as the
starting space.
With these examples and knowledge under our belt, we have a basic understanding of linear transformations, the most
basic building blocks of neural networks. In the next chapter, we will study how linear transformations affect the geometric
structure of the vector space.
8.7 Problems
Problem 1. Show that if 𝐴 ∈ ℝ𝑛×𝑛 is an invertible matrix, then (𝐴−1 )𝑇 = (𝐴𝑇 )−1 .
Problem 2. Let 𝑅𝛼 be the two-dimensional rotation matrix defined by
cos 𝛼 − sin 𝛼
𝑅𝛼 = [ ].
sin 𝛼 cos 𝛼
𝑑1 0 … 0
𝐷=⎡
⎢0 𝑑2 … 0⎤ ⎥,
⎣0 0 … 𝑑𝑛 ⎦
where all of its elements are zero outside the diagonal. Show that
and
𝑑1 𝑎1,1 𝑑1 𝑎1,2 … 𝑑1 𝑎1,𝑛
⎡𝑑 𝑎 𝑑2 𝑎2,2 … 𝑑2 𝑎2,𝑛 ⎤
𝐴𝐷 = ⎢ 2 2,1 ⎥.
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣𝑑𝑛 𝑎𝑛,1 𝑑𝑛 𝑎𝑛,2 … 𝑑𝑛 𝑎𝑛,𝑛 ⎦
‖𝑥‖∗ ∶= ‖𝐴𝑥‖
is a norm on ℝ𝑛 .
Problem 5. Let 𝑈 be a normed space and 𝑓 ∶ 𝑈 → 𝑈 be a linear transformation.
If
‖𝑥‖∗ ∶= ‖𝑓(𝑥)‖
⟨𝑥, 𝑦⟩ = 𝑥𝑇 𝐴𝑦, 𝑥, 𝑦 ∈ ℝ𝑛 .
Problem 7. Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix. 𝐴 is called positive definite if 𝑥𝑇 𝐴𝑥 > 0 for every nonzero 𝑥 ∈ ℝ𝑛 .
Show that 𝐴 is positive definite if and only if
⟨𝑥, 𝑦⟩ ∶= 𝑥𝑇 𝐴𝑦
is an inner product.
Problem 8. Let 𝐴 ∈ ℝ𝑛×𝑚 be a matrix, and denote its columns by 𝑎1 , … , 𝑎𝑛 ∈ ℝ𝑛 .
(a) Show that for all 𝑥 ∈ ℝ𝑚 , we have 𝐴𝑥 ∈ span(𝑎1 , … , 𝑎𝑛 ).
(b) Let 𝐵 ∈ ℝ𝑚×𝑘 , and denote the columns of 𝐴𝐵 by 𝑣1 , … , 𝑣𝑘 ∈ ℝ𝑛 . Show that
𝑣1 , … , 𝑣𝑘 ∈ span(𝑎1 , … , 𝑎𝑛 ).
⟨𝐴𝑥, 𝑦⟩ = ⟨𝑥, 𝐴𝑇 𝑦⟩
NINE
In the previous sections, we have seen that linear transformations can be thought as distorting the grid determined by the
basis vectors.
Following our geometric intuition, we suspect that measuring how much a transformation distorts volume and distance
can provide some valuable insight. As we will see in this chapter, this is exactly the case. Transformations that preserve
distance or norm are special, giving rise to methods such as Principal Component Analysis.
Let’s go back to the Euclidean plane one more time. Consider any linear transformation 𝐴, mapping the unit square to a
parallelogram.
The area of this parallelogram describes how 𝐴 scales the unit square. Let’s call it 𝜆 for now; that is,
area(𝐴(𝐶)) = 𝜆 ⋅ area(𝐶),
123
Mathematics of Machine Learning
where 𝐶 = [0, 1] × [0, 1] is the unit square, and 𝐴(𝐶) is its image
Due to linearity, 𝜆 also matches the scaling ratio between the area of any rectangle (with parallel sides to the coordinate
axes) and its image under 𝐴. As Fig. 9.2 shows, we can approximate any planar object as the union of rectangles.
If all rectangles are scaled by 𝜆, then unions of rectangles also scale by that factor. Thus, it follows that 𝜆 is also the
scaling ratio between any planar object 𝐸 and its image 𝐴(𝐸) = {𝐴𝑥 ∶ 𝑥 ∈ 𝐸}.
This quantity 𝜆 reveals a lot about the transformation itself, but there is a question remaining: how can we calculate it?
𝑥1 𝑦1
𝐴=[ ],
𝑥2 𝑦2
thus its columns 𝑥 = (𝑥1 , 𝑥2 ) and 𝑦 = (𝑦1 , 𝑦2 ) describe the two sides of the parallelogram. This is the image of the unit
square.
Our area scaling factor 𝜆 equals the area of this parallelogram, so our goal is to calculate this.
The area of any parallelogram can be calculated by multiplying the length of the base (‖𝑥‖ in this case) with the height
ℎ. (You can easily see this by cutting off a triangle at the right side of the parallelogram and putting it to the left side,
rearranging it as a rectangle.) ℎ is unknown, but with basic trigonometry, we can see that ℎ = sin 𝛼‖𝑦‖, where 𝛼 is the
angle between 𝑥 and 𝑦.
Thus,
This is almost the dot product of 𝑥 and 𝑦. (Recall that the dot product can be written as ⟨𝑥, 𝑦⟩ = ‖𝑥‖‖𝑦‖ cos 𝛼.) However,
the sin 𝛼 part is not a match.
Fortunately, there is a clever trick we can use to turn this into a dot product! Since sin 𝛼 = cos (𝛼 − 𝜋2 ), we have
𝜋
area = cos (𝛼 − )‖𝑥‖‖𝑦‖.
2
The issue is, the angle between 𝑥 and 𝑦 is not 𝛼 − 𝜋2 . However, we can solve this easily by applying a rotation. Applying
the transformation
0 1
𝑅−𝜋/2 = [ ],
−1 0
The quantity ⟨𝑥, 𝑦rot ⟩ can be calculated using only the elements of the matrix 𝐴:
⟨𝑥, 𝑦rot ⟩ = 𝑥1 𝑦2 − 𝑥2 𝑦1 .
Notice that ⟨𝑥, 𝑦rot ⟩ can be negative! This happens when the angle between 𝑦 = 𝐴𝑒2 and 𝑥 = 𝐴𝑒1 , measured from a
counter-clockwise direction, is larger than 𝜋, as this implies cos (𝛼 − 𝜋2 ) < 0.
Hence, the quantity ⟨𝑥, 𝑦rot ⟩ is called the signed area of the parallelogram.
In two dimensions, we call this the determinant of the linear transformation. That is, for any given linear transforma-
tion/matrix 𝐴 ∈ ℝ2×2 , its determinant is defined by
𝑎 𝑏
det 𝐴 = 𝑎𝑑 − 𝑐𝑏, 𝐴=[ ]. (9.1)
𝑐 𝑑
The determinant is often written as |𝐴|, but we’ll avoid this notation. We’ll deal with determinants for any matrix 𝐴 ∈
ℝ𝑛×𝑛 , but let’s stay with the 2 × 2 case just a bit to build intuition.
The determinant also reveals the orientation of the vectors: positive determinant means positive orientation, negative de-
terminant means negative orientation. (As mentioned earlier, positive orientation means that the angle measured between
𝑥 and 𝑦 in a counter-clockwise direction is between 0 and 𝜋.) This is demonstrated in Fig. 9.4 below.
Overall,
holds. Even though we have only shown this in two dimensions, this holds in general. (Although we don’t know how to
define the determinant there yet.)
So, if 𝑒1 and 𝑒2 is a basis on the plane, equations (9.1) and (9.2) tell us that the determinant in two dimensions equals to
Based on the example of the Euclidean plane, we have built enough geometric intuition on understanding how linear trans-
formations distort volume and change the orientation of the space. These are described by the concept of determinants,
which we have defined in the special case (9.1). We are going to move on to study the concept in its full generality.
To introduce the formal definition of the determinant, we will take a route that is different from the usual. Most commonly,
the determinant of a linear transformation 𝐴 is defined straight away with a complicated formula, then all of its geometric
properties are shown.
Instead of this, we will deduce the determinant formula by generalizing the geometric notion we have learned in the
previous section. Here, we are roughly going to follow the outline of [[Lax07]].
We set the foundations by introducing some key notations. Let 𝐴 = (𝑎𝑖,𝑗 )𝑛𝑖,𝑗=1 ∈ ℝ𝑛×𝑛 be a matrix with columns
𝑎1 , … , 𝑎𝑛 . When we introduced the notion of matrices as linear transformations, we have seen that the 𝑖-th column is the
image of the 𝑖-th basis vector. For simplicity, let’s assume that 𝑒1 , 𝑒2 , … , 𝑒𝑛 is the standard orthonormal basis, that is, 𝑒𝑖
is the vector whose 𝑖-th coordinate is 1 and the rest is 0. Thus, 𝐴𝑒𝑖 = 𝑎𝑖 .
During our explorations in the Euclidean plane, we have seen that the determinant is the orientation of the images of basis
vectors, times the area of the parallelogram defined by them. Following this logic, we could define the determinant for
𝑛 × 𝑛 matrices by
Two questions surface immediately. First, how do we define the orientation of multiple vectors in the 𝑛-dimensional
space? Second, how can we even calculate the area?
Instead of finding the answers for these questions, we are going to put a twist into the story: first, we’ll find a convenient
formula for determinants, then use it to define orientation.
To make the relation between the determinant and the columns of the matrix 𝑎𝑖 = 𝐴𝑒𝑖 more explicit, we’ll write
det 𝐴 = det(𝑎1 , … , 𝑎𝑛 ).
Thinking about determinants this way, det is just a function of multiple variables:
det ∶ ℝ 𝑛 × ⋯ × ℝ𝑛 → ℝ.
⏟⏟⏟⏟⏟
𝑛 times
holds. We are not going to prove this, but as the determinant represents the signed volume, you can convince yourself by
checking out Fig. 9.5.
A consequence of linearity is that we can express the determinant as a linear combination of determinants for the standard
𝑛
basis vectors 𝑒1 , … , 𝑒𝑛 . For instance, consider the following. Since 𝐴𝑒1 = 𝑎1 = ∑𝑖=1 𝑎𝑖,1 𝑒𝑖 , we have
𝑛
det(𝑎1 , 𝑎2 , … , 𝑎𝑛 ) = ∑ 𝑎𝑖,1 det(𝑒𝑖 , 𝑎2 , … , 𝑎𝑛 ).
𝑖=1
𝑛
Going one step further and using that 𝑎2 = ∑𝑗=1 𝑎𝑗,2 𝑒𝑗 , we start noticing a pattern. With the linearity, we have
𝑛 𝑛
det(𝑎1 , 𝑎2 , … , 𝑎𝑛 ) = ∑ ∑ 𝑎𝑖,1 𝑎𝑗,2 det(𝑒𝑖 , 𝑒𝑗 , 𝑎3 , … , 𝑎𝑛 ).
𝑖=1 𝑗=1
We can see that the row indices in the coefficients 𝑎𝑖1 𝑎𝑗2 match the indices of 𝑒𝑘 -s in det(𝑒𝑖 , 𝑒𝑗 , 𝑎3 , … , 𝑎𝑛 ). In the general
case, this pattern can be formalized in terms of permutations. Expanding the determinant of 𝐴, we have
𝑛
det(𝑎1 , … , 𝑎𝑛 ) = ∑ [ ∏ 𝑎𝜎(𝑖),𝑖 ] det(𝑒𝜎(1) , … , 𝑒𝜎(𝑛) ).
𝜎∈𝑆𝑛 𝑖=1
𝑛
This formula is not the easiest one to understand. You can think about each term ∏𝑖=1 𝑎𝜎(𝑖),𝑖 as placing 𝑛 chess rooks
on a 𝑛 × 𝑛 board such that none of them can capture each other.
𝑛
The formula ∑𝜎∈𝑆 [ ∏𝑖=1 𝑎𝜎(𝑖),𝑖 ] det(𝑒𝜎(1) , … , 𝑒𝜎(𝑛) ) combines all the possible ways we can do this.
𝑛
When the det notation is not convenient, we denote determinants by putting the elements of the matrix inside a big absolute
value sign:
𝑎 𝑎1,2 … 𝑎1,𝑛
∣ 1,1 ∣
𝑎 𝑎2,2 … 𝑎2,𝑛 ∣
det 𝐴 = ∣ 2,1 .
∣ ⋮ ⋮ ⋱ ⋮ ∣
∣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑛 ∣
When I was a young math student, the determinant formula (9.5) was presented as-is in my first linear algebra class.
Without explaining the connection to volume and orientation, it took me years to properly understand it. I still think that
the determinant is one of the most complex concepts in linear algebra, especially when presented without a geometric
motivation to the definition.
Now that you have a basic understanding of the determinant, you might ask: how can we calculate it in practice? Summing
over the set of all permutations and calculating their sign is not an easy operation from a computational perspective.
Good news: there is a recursive formula for the determinant. Bad news: for an 𝑛 × 𝑛 matrix, it involves 𝑛 pieces of
(𝑛 − 1) × (𝑛 − 1) matrices. Still, it is a big step from the permutation formula. Let’s see it!
where 𝐴𝑖,𝑗 is the (𝑛 − 1) × (𝑛 − 1) matrix obtained from 𝐴 by removing its 𝑖-th row and 𝑗-th column.
Instead of a proof, we are going to provide an example to demonstrate the formula. For 3 × 3 matrices, this is how it
looks:
𝑎 𝑏 𝑐
𝑒 𝑓 𝑑 𝑓 𝑑 𝑒
∣𝑑 𝑒 𝑓∣ = 𝑎 ∣ ∣−𝑏∣ ∣+𝑐∣ ∣.
ℎ 𝑖 𝑔 𝑖 𝑔 ℎ
𝑔 ℎ 𝑖
When working with determinants, we prefer to create basic building blocks and rules for combining them. (Like we have
seen this pattern so many times, even when deducing the formula (9.5).) These rules are manifested by the fundamental
properties of determinants, which we will discuss now. Usually, the proofs are some heavy computations based on the
formulas (9.5) and (9.6), so I am going to be a bit unorthodox here. Instead of providing fully fleshed-out proofs, I’ll give
intuitive explanations. After all, we want to build algorithms using mathematics, not building mathematics.
The first property is concerned with the relation of composition and the determinant.
Theorem 8.4.1
Let 𝐴, 𝐵 ∈ ℝ𝑛×𝑛 be two matrices. Then
The explanation for this is quite simple. If we think about the matrices 𝐴, 𝐵 ∈ ℝ𝑛×𝑛 as linear transformations, we have
just seen that det(𝐴) and det(𝐵) determine how they scale the unit cube.
Since the composition of these linear transformations is the matrix product 𝐴𝐵, the linear transformation 𝐴𝐵 scales the
unit cube to a parallelepiped with signed volume det(𝐴) det(𝐵). (Because applying 𝐴𝐵 is the same as applying 𝐵 first,
then applying 𝐴 on the result.)
Thus, by our understanding of the determinant, as the scaling factor of 𝐴𝐵 is also det(𝐴𝐵), (9.7) holds.
We can do the actual proof of this, for example, by induction based on the recursive formula (9.6), leading to a long and
involved calculation.
An immediate corollary of the product rule is a special relation between the determinants of a matrix and its inverse.
Theorem 8.4.2
Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary invertible matrix. Then det 𝐴−1 = (det 𝐴)−1 .
Because of this, we can also conclude that the determinant is preserved by the similarity relation.
Theorem 8.4.3
Let 𝐴, 𝐵 ∈ ℝ𝑛×𝑛 be two similar matrices with 𝐵 = 𝑇 −1 𝐴𝑇 for some 𝑇 ∈ ℝ𝑛×𝑛 . Then det 𝐴 = det 𝐵.
Another important consequence is that the determinant is independent of the basis the matrix is in. If 𝐴 ∶ 𝑈 → 𝑈 is a
linear transformation and 𝑃 = {𝑝1 , … , 𝑝𝑛 }, 𝑅 = {𝑟1 , … , 𝑟𝑛 } are two bases of 𝑈 , then we know that the matrices of the
transformation are related by
𝐴𝑃 = 𝑇 −1 𝐴𝑅 𝑇 ,
where 𝐴𝑆 is the matrix of the transformation 𝐴 in a basis 𝑆 and 𝑇 ∈ ℝ𝑛×𝑛 is the change of basis matrix. Using the
previous theorem, this implies that det 𝐴𝑃 = det 𝐴𝑅 . Thus, the determinant is properly defined for linear transformations,
not just matrices!
There is an essential duality relation regarding determinants: you can swap the rows and columns of a matrix, keeping all
determinant-related identities true.
Theorem 8.4.4
Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary matrix. Then det 𝐴 = det 𝐴𝑇 .
Proof. Suppose that 𝐴 = (𝑎𝑖,𝑗 )𝑛𝑖,𝑗=1 . Let’s denote the elements of its transpose by 𝑎𝑡𝑖,𝑗 = 𝑎𝑗,𝑖 . According to (9.5), we
have
𝑛
det 𝐴𝑇 = ∑ sign(𝜎) ∏ 𝑎𝑡𝜎(𝑖),𝑖
𝜎∈𝑆𝑛 𝑖=1
𝑛
= ∑ sign(𝜎) ∏ 𝑎𝑖,𝜎(𝑖) .
𝜎∈𝑆𝑛 𝑖=1
𝑛
Now comes the trick. Since the product ∏𝑖=1 𝑎𝑖,𝜎(𝑖) iterates through all 𝑖-s, and the order of the terms doesn’t matter, we
might as well order the terms as 𝑖 = 𝜎−1 (1), … , 𝜎−1 (𝑛). Since sign(𝜎−1 ) = sign(𝜎), by continuing the above calculation,
we have
𝑛 𝑛
∑ sign(𝜎) ∏ 𝑎𝑖,𝜎(𝑖) = ∑ sign(𝜎−1 ) ∏ 𝑎𝜎−1 (𝑗),𝑗 .
𝜎∈𝑆𝑛 𝑖=1 𝜎∈𝑆𝑛 𝑗=1
Because every permutation is invertible and 𝜎 ↦ 𝜎−1 is a bijection, summing over 𝜎 ∈ 𝑆𝑛 is the same as summing over
𝜎−1 ∈ 𝑆𝑛 . Combining all of the above, we obtain that
𝑛
det 𝐴 = ∑ sign(𝜎−1 ) ∏ 𝑎𝜎−1 (𝑗),𝑗
𝜎∈𝑆𝑛 𝑗=1
𝑛
= ∑ sign(𝜎) ∏ 𝑎𝜎(𝑗),𝑗
𝜎∈𝑆𝑛 𝑗=1
𝑇
= det 𝐴 ,
Theorem 8.4.5
Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary matrix and let 𝐴𝑖,𝑗 denote the matrix which can be obtained by swapping the 𝑖-th and 𝑗-th
column of 𝐴. Then
or in other words, swapping any two columns of 𝐴 will change the sign of the determinant. Similarly, swapping two rows
also changes the sign of the determinant.
Proof. This follows from a clever application of (9.7), noticing that 𝐴𝑖,𝑗 = 𝐴𝐼 𝑖,𝑗 , where 𝐼 𝑖,𝑗 is obtained from the
identity matrix by swapping its 𝑖-th and 𝑗-th column. det 𝐼 𝑖,𝑗 is a determinant of the form det(𝑒𝜎(1) , … , 𝑒𝜎(𝑛) ), where 𝜎
is a permutation simply swapping 𝑖 and 𝑗. (That is, 𝜎 is a transposition.) Thus,
Theorem 8.4.6
Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix that has two identical rows or columns. Then det 𝐴 = 0.
Proof. Suppose that the 𝑖-th and the 𝑗-th columns are matching. Since the two columns equal, det 𝐴𝑖,𝑗 = det 𝐴. However,
applying the previous theorem, we obtain det 𝐴𝑖,𝑗 = − det 𝐴 This can only be true if det 𝐴 = 0. Again, transposing the
matrix gives the statement for rows. □
As yet another consequence, we obtain an essential connection between linearly dependent vector systems and determi-
nants.
Theorem 8.4.7
Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix. Then its columns are linearly dependent if and only if det 𝐴 = 0. Similarly, the rows of 𝐴
are linearly dependent if and only if det 𝐴 = 0.
Proof. (i) First, we are going to show that linearly dependent columns (or rows) imply det 𝐴 = 0. As usual, let’s denote
𝑛
the columns of 𝐴 as 𝑎1 , … , 𝑎𝑛 and for the sake of simplicity, assume that 𝑎1 = ∑𝑖=2 𝛼𝑖 𝑎𝑖 . Since the determinant is a
linear function of the columns, we have
𝑛
det(𝑎1 , 𝑎2 , … , 𝑎𝑛 ) = ∑ 𝛼𝑖 det(𝑎𝑖 , 𝑎2 , … , 𝑎𝑛 ).
𝑖=2
Because of the previous theorem, all terms det(𝑎𝑖 , 𝑎2 , … , 𝑎𝑛 ) are zero, implying det 𝐴 = 0, which is what we had to
show. If the rows are linearly dependent, we apply the above to obtain that det 𝐴 = det 𝐴𝑇 = 0.
(ii) Now, let’s show that det 𝐴 = 0 means linearly dependent columns. Instead of the exact proof, which is rather involved,
we should have an intuitive explanation instead.
Recall that the determinant is orientation times volume of the parallelepiped given by the columns. Since the orientation
is ±1, det 𝐴 implies that the volume of the parallelepiped is 0. This can only happen if the 𝑛 columns lie in an 𝑛 − 1-
dimensional subspace, meaning that they are linearly dependent. □
Corollary 8.4.1
Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix with a constant zero column (or row). Then det 𝐴 = 0.
As the determinant is the signed volume of the basis vectors’ image, it can be zero in certain cases. These transformations
are rather special. When can it happen? Let’s go back to the Euclidean plane to build some intuition.
There, we have
𝑥1 𝑦1
∣ ∣ = 𝑥1 𝑦2 − 𝑥2 𝑦1 = 0,
𝑥2 𝑦2
or in other words, 𝑥𝑦1 = 𝑥𝑦2 . There is one more interpretation of this: the vector (𝑦1 , 𝑦2 ) is a scalar multiple of (𝑥1 , 𝑥2 ).
1 2
Thinking in terms of linear transformations, this means that the images of 𝑒1 and 𝑒2 lie on a subspace of ℝ2 . As we shall
see next, this is closely connected with the invertibility of the transformation.
Proof. When we introduced the concept of invertibility, we have seen that 𝐴 is invertible if and only if its columns
𝑎1 , … , 𝑎𝑛 form a basis. Thus, they are linearly independent. Since linear independence of columns is equivalent to a
nonzero determinant, the result follows. □
9.6 Problems
𝑎 … 𝑐𝑎1,𝑖 … 𝑎1,𝑛
∣ 1,1 ∣
∣ 𝑎2,1 … 𝑐𝑎2,𝑖 … 𝑎2,𝑛 ∣
= 𝑐 det 𝐴
∣ ⋮ ⋱ ⋮ ⋱ ⋮ ∣
∣𝑎𝑛,1 … 𝑐𝑎𝑛,𝑖 … 𝑎𝑛,𝑛 ∣
𝑎 𝑎1,2 … 𝑎1,𝑛
∣ 1,1 ∣
⋮ ⋮ ⋱ ⋮
∣ ∣
∣𝑐𝑎𝑖,1 𝑐𝑎𝑖,2 … 𝑐𝑎𝑖,𝑛 ∣ = 𝑐 det 𝐴
∣ ⋮ ⋮ ⋱ ⋮ ∣
∣ 𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑛 ∣
det(𝑐𝐴) = 𝑐𝑛 det 𝐴.
Problem 3. Let 𝐴 ∈ ℝ𝑛×𝑛 be an upper triangular matrix. (That is, all elements below the diagonal are zero.) Show that
𝑛
det 𝐴 = ∏ 𝑎𝑖,𝑖 .
𝑖=1
Show that the same holds for lower triangular matrices. (That is, matrices where elements above the diagonal are zero.)
Problem 4. Let 𝑀 ∈ ℝ𝑛×𝑚 be a matrix with the block structure
𝐴 𝐵
𝑀 =[ ],
0 𝐶
TEN
LINEAR EQUATIONS
In practice, we can translate several problems to linear equations. For example, a cash dispenser has $900 in $20 and
$50 bills. We know that there are twice as many $20 bills than $50. The question is, how many pieces of each bill the
machine has?
If we denote the number of $20 bills by 𝑥1 and the number of $50 bills by 𝑥2 , we obtain the equations
𝑥1 − 2𝑥2 = 0
20𝑥1 + 50𝑥2 = 900.
For two variables like we have now, these are easily solvable by expressing one in terms of other. Here, the first equation
would imply 𝑥1 = 2𝑥2 . Plugging it back to the second equation, we obtain 90𝑥2 = 900, which gives 𝑥2 = 10. Arriving
full circle, we can substitute this into 𝑥1 = 2𝑥2 , yielding the solutions
𝑥1 = 20
𝑥2 = 10.
However, for thousands of variables like in real applications, we need a bit more craft. This is where linear algebra comes
in. By introducing the matrix and vectors
1 −2 𝑥1 0
𝐴=[ ], 𝑥=[ ], 𝑏=[ ],
20 50 𝑥2 900
the equation can be written in the form 𝐴𝑥 = 𝑏. That is, in terms of linear transformations, we can reformulate the
question: which vector 𝑥 is mapped to 𝑏 by the transformation 𝐴? This question is central in linear algebra. We are going
to dedicate this section to solving these.
135
Mathematics of Machine Learning
Earlier, we have applied these repeatedly to eliminate variables progressively in our simple example. We can easily do
the same for 𝑛 variables! First, let’s see what we are talking about!
A system of linear equations is often written in the short form 𝐴𝑥 = 𝑏, where 𝐴 is called its coefficient matrix. If the
vector 𝑥 satisfies 𝐴𝑥 = 𝑏, it is called a *solution.
Speaking of the solutions, are there even any, and if so, how can we find them?
𝑎𝑘1
If 𝑎11 is nonzero, we can multiply the first equation of (10.1) with 𝑎11 and subtract it from the 𝑘-th equation.
This way, 𝑥1 will be eliminated from all but the first row, obtaining
We can repeat the above process and use the second equation to get rid of the 𝑥2 variable in the third equation, and
so forth. This can be done 𝑛 − 1 times in total, ultimately leading to an equation system 𝐴(𝑛−1) 𝑥 = 𝑏(𝑛−1) where all
coefficients below the diagonal of 𝐴(𝑛−1) is zero:
Notice that the 𝑘-th elimination step only affects the coefficients from the 𝑘 + 1-th row. Now we can work backwards:
(𝑛−1) (𝑛−1)
the last equation 𝑎𝑛𝑛 𝑥𝑛 = 𝑏𝑛 can be used to find 𝑥𝑛 . This can be substituted to the 𝑛 − 1-th equation, yielding
𝑥𝑛−1 . Continuing like this, we can eventually find all 𝑥1 , … , 𝑥𝑛 , obtaining a solution for our linear system.
This process is called Gaussian elimination, and it’s kind of a big deal. It is not only useful for solving linear equations, it
can be used to calculate determinants, factor matrices into the product of simpler ones, and many more. We’ll talk about
all of these in detail, but let’s focus on equations for a little more.
Unfortunately, not all linear equations can be solved. For instance, consider the system
𝑥1 + 𝑥 2 = 1
2𝑥1 + 2𝑥2 = −1.
To build a deeper understanding of Gaussian elimination, let’s consider the simple equation system
𝑥1 + 0𝑥2 − 3𝑥3 = 6
2𝑥1 + 1𝑥2 + 5𝑥3 = 2
−2𝑥1 − 3𝑥2 + 8𝑥3 = 2.
To keep track of our progress (and, since we are lazy, to avoid writing too much), we record the intermediate results as
1 0 −3 6
2 1 5 2
−2 −3 8 2
with the coefficient matrix 𝐴 on the left side and 𝑏 on the other. To get a good grip on the method, I encourage you to
follow along and do the calculations for yourself by hand.
After eliminating the first variable from the second and third equations, we have
1 0 −3 6
0 1 11 −10 ,
0 −3 2 14
1 0 −3 6
0 1 11 −10 .
0 0 35 −16
If you followed the description of Gaussian elimination carefully, you might have noticed that the process can break down.
We might accidentally divide with zero during any elimination step!
For instance, after the first step given by equation (10.2), the new coefficients are of the form
𝑎𝑖1
(𝑎𝑖𝑗 − 𝑎1𝑗 ),
𝑎11
(𝑘−1) (𝑘−1)
which is invalid if 𝑎11 = 0. In general, the 𝑘-th step involves division with 𝑎𝑘𝑘 . Since 𝑎𝑘𝑘 is defined recursively,
describing it in terms of 𝐴 is not straightforward. For this, we introduce the concept of principal minors, the upper left
subdeterminants of a matrix.
𝑎11 𝑎12
𝐴1 = [𝑎11 ] , 𝐴2 = [ ],
𝑎21 𝑎22
𝑀𝑘 ∶= det 𝐴𝑘 .
The first and last principal minors are special, as 𝑀1 = 𝑎11 and 𝑀𝑛 = det 𝐴. With principal minors, we can describe
when Gaussian elimination is possible. In fact, it turns out that
(1) 𝑀2 (𝑛−1) 𝑀𝑛
𝑎11 = 𝑀1 , 𝑎22 = , … , 𝑎𝑛𝑛 =
𝑀1 𝑀𝑛−1
(𝑘−1) 𝑀𝑘
and in general, 𝑎𝑘𝑘 = 𝑀𝑘−1 .
Theorem 9.1.1
Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary square matrix, and let 𝑀𝑘 be its 𝑘-th principal minor. If 𝑀𝑘 ≠ 0 for all 𝑘 = 1, 2, … , 𝑛−1,
then Gaussian elimination can be successfully performed.
(𝑘−1)
As the proof is a bit involved, we are not going to do it here. (The difficult step is to show 𝑎𝑘𝑘 = 𝑀𝑘 /𝑀𝑘−1 , the rest
follows immediately.) Point is, if none of the principal minors are zero, the algorithm finishes.
We can simplify this requirement a bit, and describe the Gaussian elimination in terms of the determinant, not the principal
minors.
Theorem 9.1.2
Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary square matrix. If det 𝐴 ≠ 0 holds, then all principal minors are nonzero as well.
As a consequence, if the determinant is nonzero, the Gaussian elimination is successful. A simple and nice requirement.
To get a grip on how fast the Gaussian elimination algorithm executes, let’s do a little complexity analysis. As described
by (10.2), the first elimination step involves an addition and a multiplication for each element, except for those in the first
row. That is 2𝑛(𝑛 − 1) operations in total.
The next step is essentially the first step, done on the (𝑛 − 1) × (𝑛 − 1) matrix obtained from 𝐴(1) by removing its first
row and column. This time, we have 2(𝑛 − 1)(𝑛 − 2) operations.
Following this train of thought, we quickly get that the total number of operations are
𝑛
∑ 2(𝑛 − 𝑖 + 1)(𝑛 − 𝑖),
𝑖=1
which doesn’t look that friendly. Since we are looking for the order of complexity instead of an exact number, we can be
generous and suppose that at each elimination step, we are performing 𝑂(𝑛2 ) operations. So, we have a time complexity
of
𝑛
∑ 𝑂(𝑛2 ) = 𝑂(𝑛3 ),
𝑖=1
meaning that we need around 𝑐𝑛3 operations for Gaussian elimination, where 𝑐 is an arbitrary positive constant. This
might seem a lot, but in the beautiful domain of algorithms, this is good. 𝑂(𝑛3 ) is polynomial time, and we can be
much-much worse.
𝐴𝑥 = 𝑏, 𝐴 ∈ ℝ𝑛×𝑛 , 𝑥, 𝑏 ∈ ℝ𝑛 ,
Gaussian elimination can be successfully performed if the principal minors 𝑀1 , … , 𝑀𝑛−1 are nonzero. Notice one caveat
about the result: 𝑀𝑛 = det 𝐴 can be zero as well. Turns out, this is quite an important detail.
If you have closely followed the discussion leading up to this point, you can see that we missed a crucial point: are there
any solutions at all for a given linear equation?
1. there are no solutions,
2. there is exactly one solution,
3. and there are multiple solutions.
All of these are relevant to us from a certain perspective, but let’s start with the most straightforward one: when do we
have exactly one solution? The answer is simple: when 𝐴 is invertible, the solution can be explicitly written as 𝑥 = 𝐴−1 𝑏.
Speaking in terms of linear transformations, we can find a unique vector 𝑥 that is mapped to 𝑏. We summarize this idea
in the following theorem.
Theorem 9.2.1
Let 𝐴 ∈ ℝ𝑛×𝑛 be an invertible matrix. Then for any 𝑏 ∈ ℝ𝑛 , the equation 𝐴𝑥 = 𝑏 has a unique solution that can be
written as 𝑥 = 𝐴−1 𝑏.
If 𝐴 is invertible, then det 𝐴 is nonzero. Thus, using what we have learned previously, Gaussian elimination can be
performed, yielding the unique solution. Nice and simple.
If 𝐴 is not invertible, the two remaining possibilities are in play: no vector is mapped to 𝑏, which means there are no
solutions, or multiple vectors mapped to 𝑏, giving numerous solutions.
Do you remember how we used the kernel of a linear transformation to describe its invertibility? It turns out that ker 𝐴
can also be used to find all solutions for a linear system.
Theorem 9.2.2
Let 𝐴 ∈ ℝ𝑛×𝑛 an arbitrary matrix and let 𝑥0 ∈ ℝ𝑛 be a solution of the linear equation 𝐴𝑥 = 𝑏, where 𝑏 ∈ ℝ𝑛 . Then the
set of all solutions can be written as
Proof. We have to show two things: (a) if 𝑥 ∈ 𝑥0 + ker 𝐴, then 𝑥 is a solution; and (b) if 𝑥 is a solution, then
𝑥 ∈ 𝑥0 + ker 𝐴.
(a) Suppose that 𝑥 ∈ 𝑥0 + ker 𝐴, that is, 𝑥 = 𝑥0 + 𝑦 for some 𝑦 ∈ ker 𝐴. Then
𝐴𝑥 = 𝐴(𝑥0 + 𝑦) = 𝐴𝑥
⏟0 + 𝐴𝑦
⏟ = 𝑏,
=𝑏 =0
𝐴(𝑥 − 𝑥0 ) = 𝐴𝑥 − 𝐴𝑥0 = 𝑏 − 𝑏 = 0.
Thus, (a) and (b) imply that 𝑥0 + ker 𝐴 is the set of all solutions. □
In theory, this theorem provides an excellent way to find all solutions for linear equations, generalizing far beyond finite-
dimensional vector spaces. (Note that the proof goes through verbatim for all vector spaces and linear transformations.)
For instance, this exact result is used to describe all solutions of an inhomogeneous linear differential equation.
So far, we have seen that the invertibility of a matrix 𝐴 ∈ ℝ𝑛×𝑛 is key to solving linear equations. However, we haven’t
found a way to compute the inverse of a matrix yet.
Let’s recap what the inverse is in terms of linear transformations. If the columns of 𝐴 are denoted by the vectors
𝑎1 , … , 𝑎𝑛 ∈ ℝ𝑛 , then 𝐴 is the linear transformation that maps the standard basis vectors to the these vectors:
𝐴 ∶ 𝑒𝑖 ↦ 𝑎 𝑖 , 𝑖 = 1, … , 𝑛.
𝐴−1 ∶ 𝑎𝑖 ↦ 𝑒𝑖 , 𝑖 = 1, … , 𝑛
I know, this seems paradoxical: to find the solution of 𝐴𝑥 = 𝑏, we need the inverse 𝐴−1 . To find the inverse, we need
to solve 𝑛 equations. The answer is Gaussian elimination, which gives us an exact computational method to obtain 𝐴−1 .
In the next chapter, we are going to put this into practice and write our matrix-inverting algorithm from scratch. Pretty
awesome.
ELEVEN
THE LU DECOMPOSITION
In the previous chapter, I promised that you’d never have to solve a linear equation by hand. As it turns out, this task is
perfectly suitable for computers. In this chapter, we will dive deep into the art of solving linear equations, developing the
tools from scratch.
We start by describing the process of Gaussian elimination in terms of matrices. Why would we even do that? Because
matrix multiplication can be performed extremely fast in modern computers. Expressing any algorithm in terms of
matrices is a sure way
At the start, our linear equation 𝐴𝑥 = 𝑏 is given by the coefficient matrix
𝐴(𝑛−1) is upper diagonal, that is, all elements below its diagonal are zero.
Gaussian elimination performs this task one step at a time, focusing on consecutive columns. After the first elimination
step, this is turned into the equation (10.3), described by the coefficient matrix
Can we obtain 𝐴(1) from 𝐴 via multiplication with some matrix, that is, can we find a 𝐺1 ∈ ℝ𝑛×𝑛 such that 𝐴(1) = 𝐺1 𝐴
holds?
Yes. By defining 𝐺1 as
1 0 0 … 0
⎡ − 𝑎21 1 0 … 0⎤
⎢ 𝑎𝑎11 ⎥
𝐺1 = ⎢ − 𝑎31 0 1 … 0⎥ , (11.1)
⎢ 11 ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
𝑎𝑛1
⎣− 𝑎11 0 0 … 1⎦
143
Mathematics of Machine Learning
we can see that 𝐴(1) = 𝐺1 𝐴 is the same as performing the first step of Gaussian elimination. 𝐺1 is lower diagonal; that
is, all elements above its diagonal are zero. In fact, except for the first column, all elements below the diagonal are zero
as well. (Note that 𝐺1 depends on 𝐴.)
By analogously defining
1 0 0 … 0
⎡0 1 0 … 0⎤
⎢ 𝑎32
(1) ⎥
⎢ 0⎥ ,
𝐺2 = ⎢0 − (1)
𝑎22
1 …
⎥ (11.2)
⎢⋮ ⋮ ⋱ ⋮ ⎥
⎢ 𝑎𝑛2
(1) ⎥
⎣0 − (1)
𝑎22
0 … 1⎦
we obtain 𝐴(2) = 𝐺2 𝐴(1) = 𝐺2 𝐺1 𝐴, a matrix that is upper diagonal in the first two column. (That is, all elements are
zero below the diagonal, but only in the first two columns.)
We can continue this process until obtaining the upper triangular matrix
The algorithm is starting to shape up nicely. The 𝐺𝑖 matrices are invertible, with inverses
1 0 0 … 0
1 0 0 … 0 ⎡0
⎡ 𝑎21 1 0 … 0⎤
1 0 … 0⎤ ⎢ ⎥
⎢ 𝑎𝑎11 ⎥ ⎢
(1)
𝑎32
0⎥ , … ,
𝐺−1
1 = ⎢ 𝑎31 0 1 … 0⎥ , 𝐺−1
2 = ⎢0 (1)
𝑎22
1 …
⎥
⎢ 11 ⎥ ⎢⋮ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥ ⋮ ⋱ ⋮
𝑎𝑛1 ⎢ (1) ⎥
⎣ 𝑎11 0 0 … 1⎦ 𝑎𝑛2
⎣0 (1)
𝑎22
0 … 1⎦
and so on. Thus, by multiplying with their inverses one by one, we can express 𝐴 as
𝐴 = 𝐺−1 −1
1 … 𝐺𝑛−1 𝐴
(𝑛−1)
.
1 0 0 … 0
⎡ 𝑎21 1 0 … 0⎤
⎢ 𝑎11 (1)
⎥
⎢ 𝑎31 𝑎32
1 … 0⎥ ,
𝐿 = ⎢ 𝑎11 (1)
𝑎22 ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢𝑎 (1)
𝑎𝑛2
(2)
𝑎𝑛3
⎥
𝑛1
… 1
⎣ 𝑎11 (1)
𝑎22
(2)
𝑎33 ⎦
which is lower diagonal. By defining the upper diagonal matrix 𝑈 ∶= 𝐴(𝑛−1) , we obtain the famous LU decomposition,
factoring 𝐴 into a lower and an upper diagonal matrix:
𝐴 = 𝐿𝑈 .
Notice that with this algorithm, we perform two tasks for the price of one:
• factorizing 𝐴 into the product of an upper diagonal and a lower diagonal matrix,
• and performing Gaussian elimination.
From a computational standpoint, the LU decomposition is an extremely important tool. Good news: it is relatively easy
to and fast compute. Since it is just a refashioned Gaussian elimination, its complexity is 𝑂(𝑛3 ), just as we saw this
earlier. Bad news: it is not always available. Since it is tied to Gaussian elimination, we can characterize its existence
in similar terms. Recall that for the Gaussian elimination to successfully finish, the principal minors are required to be
nonzero. This is directly transferred to the LU decomposition.
𝐴 = 𝐿𝑈 , 𝐿, 𝑈 ∈ ℝ𝑛×𝑛 ,
where 𝐿 is a lower diagonal, and 𝑈 is an upper diagonal matrix. Moreover, the elements along the diagonal of 𝐿 are equal
to 1.
The gist is the same: everything is fine if we avoid division with zero during the algorithm.
After all the preparations, we are ready to put things into practice!
To summarize the LU decomposition algorithm as described above, we essentially repeat two steps:
1. calculate the elimination matrices of the input,
2. and multiply the input with the elimination matrices, feeding the output back into the first step.
The plan is clear: first, we write a function that computes the elimination matrices and their inverses; then, we iteratively
perform the elimination steps using matrix multiplication.
import numpy as np
def compute_elimination_matrices(
A: np.ndarray,
step: int,
):
"""
Computes the step-th elimination matrix and its inverse.
return L, U
A = 10*np.random.rand(4, 4) - 5
L, U = LU(A)
print(f"Lower:\n{L}\n\nUpper:\n{U}")
Lower:
[[ 1. 0. 0. 0. ]
[ 1.66305674 1. 0. 0. ]
[ 2.24433177 0.93030038 1. 0. ]
[-2.87123912 -2.08743287 -1.66626783 1. ]]
Upper:
[[ 1.7020027 4.93736796 1.92012691 4.20560581]
[ 0. -8.09470627 -7.71885461 -6.91590661]
[ 0. 0. 4.19060713 0.58861142]
[ 0. 0. 0. 1.15142459]]
np.allclose(np.matmul(L, U), A)
True
Overall, the LU decomposition is a highly versatile tool, used as a stepping stone in the implementations of essential
algorithms. One of them is computing the inverse matrix, as we shall see next.
So far, we have talked a lot about the inverse matrix. We explored the question of invertibility from several angles, in
terms of
• the kernel and the image,
• the determinant,
• and the solvability of linear equations.
However, we haven’t yet talked about actually computing the inverse. With the LU decomposition, we obtain a tool that
can be used for this purpose. How? By plugging in a lower triangular matrix into the Gaussian elimination process, we
get its inverse as a side effect. So, we
1. calculate the LU decomposition 𝐴 = 𝐿𝑈 ,
2. invert the lower triangular matrices 𝐿 and 𝑈 𝑇 ,
3. use the identity (𝑈 −1 )𝑇 = (𝑈 𝑇 )−1 to get 𝑈 −1 ,
4. multiply 𝐿−1 and 𝑈 −1 to finally obtain 𝐴−1 = 𝑈 −1 𝐿−1 .
That’s a plan! Let’s start with inverting lower triangular matrices.
Let 𝐿 ∈ ℝ𝑛×𝑛 be an arbitrary lower triangular matrix. Following the same process that led to (11.3), we obtain
𝐷 = 𝐺𝑛−1 … 𝐺1 𝐿,
We can implement this very similarly to the LU decomposition; we can even reuse our com-
pute_elimination_matrices function.
return np.matmul(D_inv, G)
With this done, we are ready to invert any matrix. (That is actually invertible.)
We are almost at the finish line. Every component is ready, the only thing left to do is to put them together. We can do
this within a few lines of code.
A = np.random.rand(3, 3)
A_inv = invert(A)
print(f"A:\n{A}\n\nA⁻¹:\n{A_inv}\n\nAA⁻¹:\n{np.matmul(A, A_inv)}")
A:
[[0.50099494 0.38379137 0.90459645]
[0.90899798 0.68654876 0.17644865]
[0.02572362 0.47884098 0.89581146]]
A⁻¹:
[[ 1.59422999 0.26850537 -1.66275189]
[-2.43329707 1.27870628 2.20529206]
[ 1.2548991 -0.69122123 -0.01474888]]
AA⁻¹:
[[ 1.00000000e+00 4.71048498e-15 8.53893767e-18]
[-6.79235268e-15 1.00000000e+00 1.92026231e-16]
[ 3.12838119e-14 -1.43648206e-14 1.00000000e+00]]
To test the correctness of our invert function, we quickly check the results on a few randomly generated matrices.
for _ in range(1000):
n = np.random.randint(1, 10)
A = np.random.rand(n, n)
A_inv = invert(A)
if not np.allclose(np.matmul(A, A_inv), np.eye(n), atol=1e-5):
print("Test failed.")
Of course, our implementation far from optimal. When working with NumPy arrays, we can turn to the built-in functions.
In NumPy, this is np.linalg.inv.
A = np.random.rand(3, 3)
A_inv = np.linalg.inv(A)
A:
[[0.89317567 0.90111789 0.82417549]
[0.04126326 0.12167803 0.65452453]
[0.52059519 0.49859182 0.34184381]]
NumPy's A⁻¹:
[[ -59.83217022 21.6188274 102.86029659]
[ 68.63452126 -25.99985543 -115.69420373]
[ -8.98735362 4.99835821 15.02326035]]
AA⁻¹:
[[ 1.00000000e+00 -4.52177990e-15 -1.03125295e-14]
[ 1.46455735e-16 1.00000000e+00 -5.46033371e-16]
[-2.30941798e-15 2.59457335e-16 1.00000000e+00]]
n_runs = 100
size = 100
A = np.random.rand(size, size)
A ~200x improvement. Nice! Why is NumPy that much faster? There are two main reasons. First, it directly calls the
SGETRI function from LAPACK, which is extremely fast. Second, according to its documentation, SGETRI uses a
faster algorithm:
This method inverts U and then computes inv(A) by solving the system
inv(A)*L = inv(U) for inv(A).
So, NumPy calls the LAPACK function, which uses LU factorization in turn. (I am not particularly adept in digging
through Fortran code that is older than I am, so let me know if I am wrong here. Nevertheless, the fact that state-of-the-
art frameworks still make calls to this ancient library is a testament to its power. Never underestimate old technologies
like LAPACK and Fortran.)
11.3 Problems
Problem 1. Show that the product of upper triangular matrices is upper triangular. Similarly, the product of lower
triangular matrices is lower triangular. (We have used these facts extensively in this section but didn’t give a proof. So,
this is an excellent time to convince yourself about this if you haven’t already.)
Problem 2. Write a function that, given an invertible square matrix 𝐴 ∈ ℝ𝑛×𝑛 and a vector 𝑏 ∈ ℝ𝑛 , finds the solution of
the linear equation 𝐴𝑥 = 𝑏. (This can be done with a one-liner if you use one of the tools we have built here.)
TWELVE
DETERMINANTS IN PRACTICE
In the theory and practice of mathematics, the development of concepts usually has a simple flow. Definitions first arise
from vague geometric or algebraic intuitions, eventually crystallized in mathematical formalism.
However, mathematical definitions often disregard practicalities. Often for a very good reason, mind you! Keeping
practical considerations out of sight gives us the power to reason about structure effectively. This is the strength of
abstraction. Eventually, if meaningful applications are found, the development flows toward computational questions,
putting speed and efficiency onto the horizon.
An epitome of this is neural networks themselves. From theoretical constructs to state-of-the-art algorithms that run on
your smartphone, machine learning research followed this same arc.
This is also what we experience in this book on a microscopic level. Among many other examples, think about deter-
minants. We introduced the determinant as the orientation of column vectors and the parallelepiped volume defined by
them. Still, we haven’t really worked on computing them in practice. Sure, we gave a formula or two, but it is hard to
decide which one is the most convoluted. All of them are.
On the other hand, the mathematical study of determinants yielded a ton of useful results: invertibility of linear transfor-
mations, characterization of Gaussian elimination, and many more. (And even more to come.)
In this chapter, we are ready to pay off our debts and develop tools to actually compute determinants. As before, we will
take a straightforward approach and use one of the previously derived determinant formulas. Spoiler alert: this is far from
optimal, so we’ll find a way to compute the determinant with high speed.
Let’s recall what we know about determinants so far. Given a matrix 𝐴 ∈ ℝ𝑛×𝑛 , its determinant det 𝐴 quantifies the
volume distortion of the linear transformation 𝑥 → 𝐴𝑥. That is, if 𝑒1 , … , 𝑒𝑛 is the standard orthonormal basis, then
informally speaking,
We have derived two formulas to compute this quantity. Initially, we described the determinant in terms of summing
over all permutations:
This is difficult even to understand, let alone programmatically compute. So, a recursive formula is derived, which we can
also use. It states that
𝑛
det 𝐴 = ∑(−1)𝑗+1 𝑎1,𝑗 det 𝐴1,𝑗 ,
𝑗=1
151
Mathematics of Machine Learning
where 𝐴𝑖,𝑗 is the matrix obtained by deleting the 𝑖-th row and 𝑗-th column of 𝐴. Which one would you rather use? Take
a few minutes to figure out your reasoning.
Unfortunately, there are no right choices here. With the permutation formula, one has to find a way to generate all
permutations first, then calculate their signs. Moreover, there are 𝑛! unique permutations in 𝑆𝑛 , so this sum has a lot of
terms. Using this formula seems extremely difficult, so we are going with the recursive version. Recursion has its issues
(as we are about to see very soon), but it is easy to handle from a coding standpoint. Let’s get to work!
under our magnifying glass. If 𝐴 is an 𝑛 × 𝑛 matrix, then 𝐴1,𝑗 (obtained from 𝐴 by deleting its first row and 𝑗-th column)
is of size (𝑛 − 1) × (𝑛 − 1). This is a recursive step. For each 𝑛 × 𝑛 determinant, we have to calculate 𝑛 pieces of
(𝑛 − 1) × (𝑛 − 1) determinants, and so on.
By the end, we have a lot of 1 × 1 determinants, which are trivial to calculate. So, we have a boundary condition, and
with that, we are ready to put these together inside a function.
import numpy as np
if n == 1:
return A[0, 0]
else:
return sum([(-1)**i*A[0, i]*det(np.delete(A[1:], i, axis=1)) for i in␣
↪range(n)])
Let’s test the det function out on a small example. For 2 × 2 matrices, we can easily calculate the determinants using
the rule
𝑎 𝑏
det [ ] = 𝑎𝑑 − 𝑏𝑐.
𝑐 𝑑
det(A) # should be -2
-2
It seems to work. So far, so good. What is the issue? Recursion. Let’s calculate the determinant of a small 10 × 10
matrix, measuring the time it takes.
A = np.random.rand(10, 10)
t_det = timeit(lambda: det(A), number=1)
Thirty-one long and unbearable seconds. For such a simple task, this feels like an eternity.
For 𝑛 × 𝑛 inputs, we call the det function recursively 𝑛 times, on (𝑛 − 1) × (𝑛 − 1) inputs. That is, if 𝑎𝑛 denotes the
time complexity of our algorithm for an 𝑛 × 𝑛 matrix, then, due to the recursive step, we have
𝑎𝑛 = 𝑛𝑎𝑛−1 ,
which explodes really fast. In fact, 𝑎𝑛 = 𝑂(𝑛!), which is the dreaded factorial complexity. Unlike some other recursive
algorithms, caching doesn’t help either. There are two reasons for this: sub-matrices rarely match, and numpy.ndarray
objects are mutable thus not hashable.
In practice, 𝑛 can be in the millions, so this formula is utterly useless. What can we do? Simple: LU decomposition.
Besides the two formulas, we saw lots of useful properties of matrices and determinants. Can we apply what we have
learned so far to simplify the problem?
Let’s consider the LU decomposition. According to this, if det 𝐴 ≠ 0, then 𝐴 = 𝐿𝑈 , where 𝐿 is lower triangular and 𝑈
is upper triangular. Since the determinant behaves nicely with respect to matrix multiplication, see (9.7), we have
Seemingly, we made our situation worse: instead of one determinant, we have to deal with two. However, 𝐿 and 𝑈 are
rather special, as they are triangular. It turns out that computing a triangular matrix’s determinant is extremely easy. We
just have to multiply the elements in the diagonal together!
Proof. Suppose that 𝐴 is lower triangular. (That is, all elements above its diagonal are zero.) According to the recursive
formula for det 𝐴, we have
𝑛
det 𝐴 = ∑(−1)𝑗+1 𝑎1,𝑗 det 𝐴1,𝑗 .
𝑗=1
𝐴1,1 = (𝑎𝑖𝑗 )𝑛𝑖,𝑗=2 is also lower triangular. By iterating the previous step, we obtain
holds as well. □
Back to our original problem. Since the diagonal 𝐿 is constant 1, as guaranteed by the LU decomposition, we have
𝑛
det 𝐴 = det 𝑈 = ∏ 𝑢𝑖𝑖 .
𝑖=1
So, the algorithm to compute the determinant is quite simple: get the LU decomposition, then calculate the product of
𝑈 ‘s diagonal. Let’s put this into practice!
import nbimporter
from scripts.LU import LU
A = np.random.rand(1000, 1000)
The time it takes to calculate the determinant of a 1000 x 1000 matrix: 41.
↪27983342299922
Forty-one seconds, but for a 1000 × 1000 matrix this time. This can be even faster if we use a better implementation of
the LU decomposition algorithm. (For instance, scipy.linalg.lu, which relies on our old friend LAPACK.)
I get emotional just by looking at this result. See how far we can go with a bit of linear algebra? This is why understanding
the fundamentals such as Gaussian elimination is essential. Machine learning and deep learning are still very new fields,
and even though an insane amount of research power is being put into it, moments like these happen all the time. Simple
ideas often give birth to new paradigms.
12.3 Problems
Before we wrap this chapter up, let’s go back to the very beginning. Even though we have lots of reasons against using
the determinant formula, we have one for it: it is a good exercise, and implementing it will deepen your understanding.
So, in this section, you are going to build
det 𝐴 = ∑ sign(𝜎)𝑎𝜎(1)1 … 𝑎𝜎(𝑛)𝑛 ,
𝜎∈𝑆𝑛
[2, 0, 1]
where | ⋅ | denotes the number of elements in the set. Essentially, inversion describes the number of times a permutation
reverses the order of a pair of numbers.
Turns out, the sign of 𝜎 can be written as
sign(𝜎) = (−1)inversion(𝜎) .
Implement a function that first calculates the inversion number, then the sign of an arbitrary permutation. (Permutations
are represented like in the previous problem.)
Problem 3. Put the solutions for Problem 1. and Problem 2. together and calculate the determinant of a matrix using the
permutation formula. What do you think the time complexity of this algorithm is?
12.4 Solutions
Problem 1.
new_permutations = []
for p in prev_permutations:
for i in range(len(p)+1):
p_new = deepcopy(p)
p_new.insert(i, n)
new_permutations.append(p_new)
return new_permutations
Problem 2.
return inversions
Problem 3.
THIRTEEN
So far, we have seen three sides of linear transformations: functions, matrices, and transforms that distort the grid of the
underlying vector space. In the Euclidean plane, we saw some examples that shed some light into the geometric nature of
them.
Following this line of thought, let’s consider the linear transformation given by the matrix
2 1
𝐴=[ ]. (13.1)
1 2
Since the columns of 𝐴 are the images of the standard basis vectors 𝑒1 = (1, 0) and 𝑒2 = (0, 1), we can visualize the
effect of 𝐴. (Check here if you don’t recall this fact.)
Fig. 13.1: Images of the standard basis vectors under the linear transformation given by 𝐴.
This seems to shear, stretch, and rotate the entire grid. However, there are special directions along which 𝐴 is simply
a stretching. For instance, consider the vector 𝑢1 = (1, 1). By a simple calculation, you can verify that 𝐴𝑢1 = 3𝑢1 .
Because of the linearity, this means that if a vector 𝑥 is in span(𝑢1 ), its image under 𝐴 is 3𝑥.
Another one is 𝑢2 = (−1, 1), where we have 𝐴𝑢2 = 𝑢2 . Thus, any 𝑥 ∈ span(𝑢2 ) is left in place.
157
Mathematics of Machine Learning
Fig. 13.2: Images of 𝑢1 = (1, 1) and 𝑢2 = (−1, 1) under the linear transformation given by 𝐴.
3 0
𝐴𝑢1 ,𝑢2 = [ ],
0 1
that is, 𝐴𝑢1 ,𝑢2 is diagonal. We love diagonal matrices in practice because multiplication with a diagonal matrix is much
faster, as it requires 𝑂(𝑛) operations, opposed to 𝑂(𝑛2 ).
Is this a general phenomena? Are these even useful? The answer is yes to both questions. What we have just seen is
formalized by the concept of eigenvalues and eigenvectors. The terminology originates from the german word “eigen”
meaning “own”, resulting in one of the ugliest naming conventions in mathematics.
Although we have formally defined eigenvalues and eigenvectors for linear transformations, we often talk about them
in context of matrices. (Because, as we have seen, matrices and linear transformations are the same.) Let’s start by
translating the definition into the language of matrices.
If 𝐴 ∈ ℝ𝑛×𝑛 is a matrix, Definition 12.1 translates to the following: the scalar 𝜆 and the vector 𝑥 ∈ ℝ\{0} is an
eigenvalue-eigenvector pair of the matrix if
𝐴𝑥 = 𝜆𝑥 (13.2)
holds. This can be simplified: as the linear transformation 𝑥 ↦ 𝜆𝑥 corresponds to the matrix 𝜆𝐼, (13.2) is equivalent to
(𝐴 − 𝜆𝐼)𝑥 = 0 (13.3)
If you recall how matrices arise from linear transformations, you might ask the question: won’t the eigenvalues depend on
the choice of the matrix?
The following theorem states that this is not the case: the eigenvalues for a linear transformation and its matrices are the
same.
𝐵𝑥′ = 𝜆𝑥′
(𝐴 − 𝜆𝐼)𝑥 = (𝐴 − 𝜆𝑇 𝑇 −1 )𝑥
= 𝑇 (𝑇 −1 𝐴𝑇 − 𝜆𝐼)𝑇 −1 𝑥
= 0.
Since 𝑇 is invertible, 𝑇 [(𝑇 −1 𝐴𝑇 − 𝜆𝐼)𝑇 −1 𝑥] = 0 can only happen if (𝑇 −1 𝐴𝑇 − 𝜆𝐼)𝑇 −1 𝑥 = 0. (Recall the relation
of the kernel and invertibility.) This looks almost like (13.3), just a bit more complicated. Let me use some suggestive
parentheses to highlight the similarities:
[𝑇 −1 𝐴𝑇 − 𝜆𝐼][𝑇 −1 𝑥] = 0.
Note that
So, with the selection 𝑥′ = 𝑇 −1 𝑥, we have
𝑇 −1 𝐴𝑇 𝑥′ = 𝜆𝑥′ ,
In other words, the eigenvalues of similar matrices are the same. Consequently, we can talk about the eigenvalues of
matrices, not just linear transformations. The above theorem implies that the eigenvalues of a transformation and its
corresponding matrix are the same. Moreover, the eigenvalues of the matrix don’t depend on the choice of basis.
To be more precise, suppose that 𝐴 ∶ 𝑈 → 𝑈 is a linear transformation and 𝑃 , 𝑄 are bases of 𝑈 . The matrix of 𝐴 in
some basis 𝑆 is denoted by 𝐴𝑄 . We know that there is a transformation matrix 𝑇 ∈ ℝ𝑛×𝑛 such that
𝐴𝑄 = 𝑇 −1 𝐴𝑃 𝑇 .
Even though the definition of eigenvalues-eigenvectors is easy to understand given the geometric interpretation we just
saw, it does not give us any tools to find them in practice. Using them to get simpler representations of matrices is one
thing, but we are stuck at square one without a method to find them.
First, let’s focus on the eigenvalues. Suppose that for some 𝜆, there is a nonzero vector 𝑥 such that 𝐴𝑥 = 𝜆𝑥. The
transformation defined by 𝑥 → 𝜆𝑥 is a linear one, and its matrix is diagonal:
𝜆 0 … 0 𝑥1
⎡0 𝜆 … 0⎤ ⎡𝑥 ⎤
𝜆𝑥 = ⎢ ⎥ ⎢ 2⎥ .
⎢⋮ ⋮ ⋱ ⋮⎥⎢ ⋮ ⎥
⎣ 0 0 … 𝜆⎦ ⎣𝑥𝑛 ⎦
Because linear transformations can be added and subtracted, the defining equation 𝐴𝑥 = 𝜆𝑥 is equivalent to
(𝐴 − 𝜆𝐼)𝑥 = 0,
where 𝐼 denotes the identity transformation, as defined by (8.4). In another words, the transformation 𝐴 − 𝜆𝐼 maps a
nonzero vector to 0, meaning that it is not invertible, as Theorem 7.4.2 implies. We can characterize this with determinants:
we need to find all 𝜆-s such that
det(𝐴 − 𝜆𝐼) = 0.
Theorem 12.2.1
Let 𝐴 ∶ 𝑈 → 𝑈 be an arbitrary linear transformation. Then 𝜆 is its eigenvalue if and only if det(𝐴 − 𝜆𝐼) = 0.
Although we are one step closer, finding eigenvalues based on this still seems complicated. In the following, we are going
to see what det(𝐴 − 𝜆𝐼) really is and how we can find the solutions of det(𝐴 − 𝜆𝐼) = 0 in practice.
Before going into the generalities, let’s revisit the example (13.1). There, we have
2−𝜆 1
det(𝐴 − 𝜆𝐼) = ∣ ∣
1 2−𝜆
= (2 − 𝜆)2 − 1
= 𝜆2 − 4𝜆 + 3.
𝜆2 − 4𝜆 + 3 = 0,
which we can do easily. Recall that the solutions of any quadratic equation 𝑎𝑥2 + 𝑏𝑥 + 𝑐 = 0 are
√
−𝑏 ± 𝑏2 − 4𝑎𝑐
𝑥1,2 = .
2𝑎
Applying this, we have 𝜆1 = 3 and 𝜆2 = 1 as solutions. There are no other ones, so 1 and 3 are the only two eigenvalues
for 𝐴.
Let’s see what happens in the general case!
As the example above suggests, if the underlying vector space 𝑈 is 𝑛-dimensional, that is, 𝐴 is an 𝑛×𝑛 matrix, det(𝐴−𝜆𝐼)
is an 𝑛-th degree polynomial in 𝜆.
To see this, let’s write det(𝐴 − 𝜆𝐼) explicitly in terms of matrices. With this in mind, we have
𝑎 −𝜆 𝑎12 … 𝑎1𝑛
∣ 11 ∣
𝑎 𝑎22 − 𝜆 … 𝑎2𝑛 ∣
det(𝐴 − 𝜆𝐼) = ∣ 21 .
∣ ⋮ ⋮ ⋱ ⋮ ∣
∣ 𝑎𝑛1 𝑎𝑛2 … 𝑎𝑛𝑛 − 𝜆∣
If you consider the formula to calculate the determinant given by (9.5), you can see that every term is a polynomial.
Depending on how many fixed points 𝜎 has (that is, points where 𝜎(𝑖) = 𝑖), the degree of this polynomial varies between
0 and 𝑛.
(Alternatively, you can see that det(𝐴 − 𝜆𝐼) is a polynomial of degree 𝑛 by using the recursive formula (9.6) and applying
induction.)
The roots of the characteristic polynomial are the eigenvalues. If 𝑈 is an 𝑛-dimensional complex vector space (that is,
the set of scalars is ℂ), the fundamental theorem of algebra guarantees that det(𝐴 − 𝜆𝐼) = 0 has exactly
𝑛 roots.
As a consequence, every matrix 𝐴 ∈ ℂ𝑛×𝑛 has at least one eigenvalue. Note that roots can have higher algebraic
multiplicity. For instance, the characteristic polynomial for the matrix
1 0 0
𝐵=⎡
⎢0 1 0⎤⎥
⎣0 0 2⎦
is (1 − 𝜆)2 (2 − 𝜆). So, its roots are 1 (with algebraic multiplicity 2) and 2.
If we restrict ourselves to real matrices and real vector spaces, the existence of eigenvalues and eigenvectors are not
guaranteed. For instance, consider
1 2
𝐶=[ ].
1 1
Its characteristic polynomial is −𝜆2 − 1, which doesn’t have any real roots, only complex ones: 𝜆1 = 𝑖 and 𝜆2 = −𝑖.
Mathematically speaking, if we want to stay within the confines of real vector spaces, 𝐶 has no eigenvalues. However, we
are here to do machine learning, not algebra. Thus, we are going to be a bit imprecise and treat real matrices as complex
ones. We don’t often need complex numbers to describe mathematical models of a dataset, but they frequently appear
during the analysis of matrices.
When an eigenvalue 𝜆 is identified, we can set out to find the corresponding eigenvectors; that is, vectors 𝑥 where (𝐴 −
𝜆𝐼)𝑥 = 0. In more precise terms, we are looking for \ker (A - \lambda I) $.
As we have mentioned before, the kernel of any linear transformation is a subspace. As it might be more than one
dimensional, identifying it often involves an implicit description like 𝑥1 + 𝑥2 = 0.
Let’s check what happens with our recurring example
2 1
𝐴=[ ].
1 2
Previously, we have seen that 𝜆1 = 3 and 𝜆2 = 1 are the eigenvalues. To identify the corresponding eigenvectors for,
say, 𝜆1 , we have to find all solutions for the linear equation (𝐴 − 𝜆1 𝐼)𝑥 = 0. Expanding this, we have
−𝑥1 + 𝑥2 = 0
𝑥1 − 𝑥2 = 0.
Definition 12.3.1
Let 𝑓 ∶ 𝑉 → 𝑉 be an arbitrary linear transformation, and 𝜆 its eigenvalue. The subspace of eigenvectors defined by
𝑈𝜆 = {𝑥 ∶ 𝐴𝑥 = 𝜆𝑥}
Eigenspaces play an important role in understanding the structure of linear transformations. First, we can note that a
linear transformation keeps its eigenspaces invariant. (That is, if 𝑥 is in the 𝑈𝜆 eigenspace, then 𝑓(𝑥) ∈ 𝑈𝜆 as well.) This
property makes it possible for us to restrict linear transformations to their eigenspaces.
To illustrate the concept of eigenspaces, let’s revisit the already familiar matrix
2 1
𝐴=[ ]
1 2
one more time. Its eigenvalues are 𝜆1 = 3 and 𝜆2 = 1, and by solving the equation (𝐴 − 𝜆1 𝐼)𝑥 = 0, the eigenspace of
𝜆1 is
𝑈𝜆1 = {𝑥 ∈ ℝ2 ∶ 𝑥1 = 𝑥2 }.
Similarly, you can check that 𝑈𝜆2 = {𝑥 ∈ ℝ2 ∶ 𝑥1 = −𝑥2 }. (If you go back to Fig. 13.2, you can visualize 𝑈𝜆1 and
𝑈𝜆2 .)
Eigenspaces are not necessarily one dimensional. For instance, consider one of the the previous examples
1 0 0
𝐵=⎡
⎢0 1 0⎤⎥,
⎣0 0 2⎦
with two eigenvalues 𝜆1 = 1 and 𝜆2 = 2. Substituting 𝜆1 back into the equation and solving for (𝐵 − 𝐼)𝑥 = 0, we
obtain that
𝑈𝜆1 = {𝑥 ∈ ℝ3 ∶ 𝑥3 = 0},
Theorem 12.3.1
Let 𝑓 ∶ 𝑉 → 𝑉 be a linear transformation, let 𝐴 ∈ ℝ𝑛×𝑛 be its matrix in some basis, and let 𝑈𝜆1 , … , 𝑈𝜆𝑘 be the
eigenspaces of 𝑓. The following are equivalent.
(a) There is a matrix 𝑇 ∈ ℝ𝑛×𝑛 such that
Λ = 𝑇 −1 𝐴𝑇 ,
𝑉 = 𝑈𝜆1 + ⋯ + 𝑈𝜆𝑘 .
Proof. (a) ⟹ (b). If 𝐴 is the matrix of 𝑓 in some basis, then a a similarity transformation is equivalent to a change in
basis.
That is, the new matrix Λ = 𝑇 −1 𝐴𝑇 is the matrix of 𝑓 in a different basis, say 𝑢1 , … , 𝑢𝑛 .
If Λ is diagonal, it can be written in the form
𝜆1 0 … 0
⎡0 𝜆2 … 0⎤
Λ=⎢ ⎥.
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣0 0 … 𝜆𝑛 ⎦
(Note that the 𝜆𝑖 -s are not mutually different.) Thus, Λ𝑢𝑖 = 𝜆𝑖 𝑢𝑖 , meaning that 𝑢1 , … , 𝑢𝑛 is a basis from the eigenvectors
of 𝑓.
(b) ⟹ (a). If 𝑢1 , … , 𝑢𝑛 is a basis from the eigenvectors of 𝑓, then its matrix Λ in that basis is diagonal. Thus, 𝐴 is
similar to Λ, which is what we had to show.
(b) ⟹ (c). By definition, the direct sum of the eigenspaces contains all linear combinations of the form
𝑛
𝑥 = ∑ 𝑥𝑖 𝑢𝑖 .
𝑖=1
Even though this theorem does not give us any useful recipes on how to diagonalize a matrix, it provides us with an
extremely valuable insight: diagonalization is equivalent to finding an eigenvector basis. This is not always possible, but
when it is, we are cooking with gas.
In the next chapter, we will take a deep dive into this topic, providing multiple ways to simplify matrices. If our journey
in linear algebra is akin to a mountain climb, we will reach the peak soon.
13.4 Problems
Problem 1. Let 𝐴 ∈ ℝ𝑛×𝑛 be an upper or lower trangular matrix. Show that the eigenvalues of 𝐴 are its diagonal
elements.
FOURTEEN
So far, we have aspired to develop a geometric view of linear algebra. Vectors are mathematical objects defined by their
direction and magnitude. In the spaces of vectors, the concept of distance and orthogonality gives rise to a geometric
structure.
Linear transformations, the building blocks of machine learning, are just mappings that distort this structure: rotating,
stretching, skewing the geometry. However, there are types of transformations that preserve some of the structure. In
practice, these provide valuable insight, and additionally, they are much easier to work with. In this section, we will take
a look at the most important ones, those that we’ll encounter in machine learning.
In machine learning, the most important stage is the Euclidean space ℝ𝑛 . This is where data is represented and manipu-
lated. There, the entire geometric structure is defined by the inner product
𝑛
⟨𝑥, 𝑦⟩ = ∑ 𝑥𝑖 𝑦𝑖 ,
𝑖=1
giving rise to the notion of magnitude, direction (in the form of angles), and orthogonality. Because of this, transforma-
tions that can be related to the inner product are special. For instance, if ⟨𝑓(𝑥), 𝑓(𝑥)⟩ = ⟨𝑥, 𝑥⟩ holds for all 𝑥 ∈ ℝ𝑛 and
the linear transformation 𝑓 ∶ ℝ𝑛 → ℝ𝑛 , we know that 𝑓 leaves the norm invariant. That is, distance in the original and
the transformed feature space have the same meaning.
First, we will establish a general relation between images of vectors under a transform and their inner product. This is
going to be the foundation for our discussions in this chapter.
165
Mathematics of Machine Learning
Proof. Suppose that 𝐴 ∈ ℝ𝑛×𝑛 is the matrix of 𝑓 in the standard orthonormal basis. For any 𝑥 = (𝑥1 , … , 𝑥𝑛 ) and
𝑦 = (𝑦1 , … , 𝑦𝑛 ), the inner product is defined by
𝑛
⟨𝑥, 𝑦⟩ = ∑ 𝑥𝑖 𝑦𝑖 ,
𝑖=1
Using this form, we can express ⟨𝐴𝑥, 𝑦⟩ in terms of 𝑎𝑖𝑗 -s, 𝑥𝑖 -s, and 𝑦𝑖 -s. For this, we have
𝑛 𝑛
⟨𝐴𝑥, 𝑦⟩ = ∑ ( ∑ 𝑎𝑖𝑗 𝑥𝑗 )𝑦𝑖
𝑖=1 𝑗=1
𝑛 𝑛
= ∑ ( ∑ 𝑎𝑖𝑗 𝑦𝑖 ) 𝑥𝑗
𝑗=1 ⏟⏟⏟⏟⏟
𝑖=1
𝑗-th component of 𝐴𝑇 𝑦
= ⟨𝑥, 𝐴𝑇 𝑦⟩.
This shows that the transformation given by 𝑓 ∗ ∶ 𝑥 ↦ 𝐴𝑇 𝑥 satisfies (14.1) and (14.2), which is what we had to show. □
Why is the quantity ⟨𝐴𝑥, 𝑦⟩ that important for us? Because inner products define the geometric structure of a vector
space. Recall the equation (5.12), allowing us to fully describe any vector using only the inner products with respect to
an orthonormal basis. In addition, ⟨𝑥, 𝑥⟩ = ‖𝑥‖2 defines the notion of distance and magnitude. Because of this, (14.1)
and (14.2) will be quite useful for us.
As we are about to see, transformations that preserve the inner product are rather special, and these relations provide us
a way to characterize them both algebraically and geometrically.
As a consequence, an orthogonal 𝑓 preserves the norm: ‖𝑓(𝑥)‖2 = ⟨𝑓(𝑥), 𝑓(𝑥)⟩ = ⟨𝑥, 𝑥⟩ = ‖𝑥‖2 . Because the angle
enclosed by two vectors is defined by their inner product, see (5.11), the property ⟨𝑓(𝑥), 𝑓(𝑦)⟩ = ⟨𝑥, 𝑦⟩ means that an
orthogonal transform also preserves angles.
We can translate the definition to the language of matrices as well. In practice, we are always going to work with matrices,
so this characterization is essential.
holds for all 𝑥. By letting 𝑥 = (𝐴𝑇 𝐴 − 𝐼)𝑦, the positive definiteness of the inner product implies that (𝐴𝑇 𝐴 − 𝐼)𝑦 = 0
for all 𝑦. Thus, 𝐴𝑇 𝐴 = 𝐼, which means that 𝐴𝑇 is the inverse of 𝐴.
(b) If 𝐴𝑇 = 𝐴−1 , we have
⟨𝐴𝑥, 𝐴𝑦⟩ = ⟨𝑥, 𝐴𝑇 𝐴𝑦⟩
= ⟨𝑥, 𝐴−1 𝐴𝑦⟩
= ⟨𝑥, 𝑦⟩,
showing that 𝑓 is orthogonal. □
The fact that 𝐴𝑇 = 𝐴−1 has a profound implication regarding the columns of 𝐴. If you think back to the definition of
matrix multiplication, the element in the 𝑖-th row and 𝑗-th column of 𝐴𝐵 is the inner product of 𝐴‘s 𝑖-th row and 𝐵‘s 𝑗-th
column.
To be more precise, if the 𝑖-th column is denoted by 𝑎𝑖 = (𝑎1,𝑖 , 𝑎2,𝑖 , … , 𝑎𝑛,𝑖 ), then we have
𝑛
𝐴𝑇 𝐴 = (⟨𝑎𝑖 , 𝑎𝑗 ⟩) = 𝐼,
𝑖,𝑗=1
that is,
1 if 𝑖 = 𝑗,
⟨𝑎𝑖 , 𝑎𝑗 ⟩ = {
0 otherwise.
In other words, the columns of 𝐴 form an orthonormal system. This fact should not come as a surprise since orthogonal
transformations preserve magnitude and orthogonality, and the columns of 𝐴 are the images of the standard orthonormal
basis 𝑒1 , … , 𝑒𝑛 .
In machine learning, performing an orthogonal transformation on our features is equivalent to looking at them from
another perspective, without distortion. You might know it already, but this is what Principal Component Analysis is
doing.
Besides orthogonal transformations, there is another important class: transformations whose adjoints are themselves. Bear
with me a bit, and we’ll see an example soon.
As always, we are going to translate this to the language of the matrices. If 𝐴 is the matrix of 𝑓 in the standard orthonormal
basis, we know that 𝐴𝑇 is the matrix of the adjoint. For self-adjoint transformations, it implies that 𝐴𝑇 = 𝐴. Matrices
such as these are called symmetric, and they have a lot of pleasant properties.
For us, the most important one is that symmetric matrices can be diagonalized! (That is, transformed to a diagonal matrix
with a similarity transform.) The following theorem makes this precise.
𝐴 = 𝑈 Λ𝑈 𝑇 (14.4)
holds.
Note that the eigenvalues 𝜆1 ≥ ⋯ ≥ 𝜆𝑛 are not necessarily distinct from each other.
Proof. (Sketch) Since the proof is pretty involved, we are better off getting to know the main ideas behind it, without all
the mathematical details.
The main steps are the following.
1. If the matrix 𝐴 is symmetric, all of its eigenvalues are real.
2. Using this, it can be shown that an orthonormal basis can be formed from the eigenvectors of 𝐴.
3. Writing the matrix of the transformation 𝑥 → 𝐴𝑥 in this orthonormal basis yields a diagonal matrix. Hence, a
change of basis yields (14.4).
Showing that the eigenvalues are real requires some complex numbers magic (which is beyond our scope). The tough part
is the second step. Once that has been done, moving to the third one is straightforward, as we have seen when talking
about eigenspaces and their bases. □
We still don’t have a hands-on way to diagonalize matrices, but this theorem gets us one step closer: at least we know it
is possible for symmetric matrices. This is an important stepping stone, as we’ll be able to reduce the general case to the
symmetric one.
The requirement for a matrix to be symmetric seems like a very special one. However, in practice, we can symmetrize
matrices in several different ways. For any matrix 𝐴 ∈ ℝ𝑛×𝑚 , the products 𝐴𝐴𝑇 and 𝐴𝑇 𝐴 will be symmetric. For
𝑇
square matrices, the average 𝐴+𝐴
2 also works. So, symmetric matrices are more common than you think.
The orthogonal matrix 𝑈 and the corresponding orthonormal basis {𝑢1 , … , 𝑢𝑛 } that diagonalizes a symmetric matrix 𝐴
has a special property that is going to be very important for us later, when we discuss the Principal Component Analysis
of data samples.
Theorem 13.3.2
Let 𝐴 ∈ ℝ𝑛×𝑛 be a real symmetric matrix and let 𝜆1 ≥ ⋯ ≥ 𝜆𝑛 be its real eigenvalues in decreasing order. Moreover,
let 𝑈 ∈ ℝ𝑛×𝑛 be the ortogonal matrix that diagonalizes 𝐴, with the corresponding orthonormal basis {𝑢1 , … , 𝑢𝑛 }.
Then
arg max 𝑥𝑇 𝐴𝑥 = 𝑢1 ,
‖𝑥‖=1
and
max 𝑥𝑇 𝐴𝑥 = 𝜆1 .
‖𝑥‖=1
Proof. Since {𝑢1 , … , 𝑢𝑛 } is an orthonormal basis, any 𝑥 can be expressed as a linear combination of them:
𝑛
𝑥 = ∑ 𝑥𝑖 𝑢𝑖 , 𝑥𝑖 ∈ ℝ.
𝑖=1
1 if 𝑖 = 𝑗,
𝑢𝑇𝑗 𝑢𝑖 = {
0 otherwise.
In other words, 𝑢𝑇𝑗 𝑢𝑖 vanishes when 𝑖 ≠ 𝑗. Continuing the above calculation with this observation,
𝑛 𝑛
𝑥𝑇 𝐴𝑥 = ∑ 𝑥𝑖 𝑥𝑗 𝜆𝑖 𝑢𝑇𝑗 𝑢𝑖 = ∑ 𝑥2𝑖 𝜆𝑖 .
𝑖,𝑗=1 𝑖=1
𝑛 𝑛
When ‖𝑥‖2 = ∑𝑖=1 𝑥2𝑖 = 1, the sum ∑𝑖=1 𝑥2𝑖 𝜆𝑖 is a weighted average of the eigenvalues 𝜆𝑖 . So,
𝑛 𝑛
∑ 𝑥2𝑖 𝜆𝑖 ≤ ∑ 𝑥2𝑖 max 𝜆𝑘 = max 𝜆𝑘 = 𝜆1 ,
𝑘=1,…,𝑛 𝑘=1,…,𝑛
𝑖=1 𝑖=1
from which 𝑥𝑇 𝐴𝑥 ≤ 𝜆1 follows. (Recall that we can assume without loss in generality that the eigenvalues are decreasing.)
On the other hand, by plugging in 𝑥 = 𝑢1 , we can see that 𝑢𝑇1 𝐴𝑢1 = 𝜆1 , so the maximum is indeed attained. From these
two, the theorem follows. □
Remark 13.3.1
In other words, Theorem 13.3.2 gives that the function 𝑥 ↦ 𝑥𝑇 𝐴𝑥 assumes its maximum value at 𝑢1 , and that maximum
value is 𝑢𝑇1 𝐴𝑢1 = 𝜆1 . The quantity 𝑥𝑇 𝐴𝑥 seems quite mysterious as well, so let’s clarify this a bit. If we think in terms
of features, the vectors 𝑢1 , … , 𝑢𝑛 can be thought of as mixtures of the “old” features 𝑒1 , … , 𝑒𝑛 . When we have actual
observations (that is, data), we can use the above process to diagonalize the covariance matrix. So, if 𝐴 denotes this
covariance matrix, 𝑢𝑇1 𝐴𝑢1 is the variance of the new feature 𝑢1 .
Thus, this theorem says that 𝑢1 is the unique feature that maximizes the variance. So, among all the possible choices for
new features, 𝑢1 conveys the most information about the data.
At this point, we don’t have all the tools to see, but in connection to the Principal Component Analysis, this says that the
first principal vector is the one that maximizes variance.
Theorem 13.3.3
Let 𝐴 ∈ ℝ𝑛×𝑛 be a real symmetric matrix, let 𝜆1 ≥ ⋯ ≥ 𝜆𝑛 be its real eigenvalues in decreasing order. Moreover, let
𝑈 ∈ ℝ𝑛×𝑛 be the ortogonal matrix that diagonalizes 𝐴, with the corresponding orthonormal basis {𝑢1 , … , 𝑢𝑛 }.
Then for all 𝑘 = 1, … , 𝑛, we have
arg max 𝑥𝑇 𝐴𝑥 = 𝑢𝑘 ,
‖𝑥‖ = 1
𝑥 ⟂ {𝑢1 , … , 𝑢𝑘−1 }
and
max 𝑥𝑇 𝐴𝑥 = 𝜆𝑘 .
‖𝑥‖ = 1
𝑥 ⟂ {𝑢1 , … , 𝑢𝑘−1 }
Proof. The proof is almost identical to the previous one. Since 𝑥 is required to be orthogonal to 𝑢1 , … , 𝑢𝑘−1 , it can be
expressed as
𝑛
𝑥 = ∑ 𝑥𝑖 𝑢𝑖 .
𝑖=𝑘
On the other hand, similarly as before, 𝑢𝑇𝑘 𝐴𝑢𝑘 = 𝜆𝑘 , so the theorem follows. □
So, we can diagonalize any real symmetric matrix with an orthogonal transformation. That’s great, but what if our matrix
is not symmetric? After all, this is a rather special case.
How can we do the same for a general matrix? We’ll use a very strong tool, straight from the mathematician’s toolkit:
wishful thinking. We pretend to have the solution, then reverse engineer it. To be specific, let 𝐴 ∈ ℝ𝑛×𝑚 be any real
matrix. (It might not be square.) Since 𝐴 is not symmetric, we have to relax our wishes for factoring it into the form
𝑈 Λ𝑈 𝑇 . The most straightforward way is to assume that the orthogonal matrices to the left and to the right of Λ are not
each other’s transposes.
Thus, we are looking for orthogonal matrices 𝑈 ∈ ℝ𝑛×𝑛 and 𝑉 ∈ ℝ𝑚×𝑚 such that
𝐴 = 𝑈 Λ𝑉 𝑇
Here comes the reverse-engineering part. First, as we have discussed earlier, 𝐴𝐴𝑇 and 𝐴𝑇 𝐴 are symmetric matrices.
Second, we can simplify them by using the orthogonality of 𝑈 and 𝑉 , obtaining
𝐴𝐴𝑇 = (𝑈 Λ𝑉 𝑇 )(𝑉 Λ𝑈 𝑇 )
= 𝑈 Λ2 𝑈 𝑇 .
Similarly, we have 𝐴𝑇 𝐴 = 𝑉 Λ2 𝑉 𝑇 . Good news: we can actually find 𝑈 and 𝑉 by applying the spectral decomposition
theorem to 𝐴𝐴𝑇 and 𝐴𝑇 𝐴 respectively. Thus, the factorization 𝐴 = 𝑈 Λ𝑉 𝑇 is valid! This form is called the singular
value decomposition (SVD), one of the pinnacle achievements of linear algebra.
Of course, we are not done yet, we only know where to look. Let’s make this mathematically precise!
Proof. Since 𝐴𝑇 𝐴 ∈ ℝ𝑚×𝑚 is a real symmetric matrix, we can apply the spectral decomposition theorem to obtain a
diagonal Λ ∈ ℝ𝑚×𝑚 and orthogonal 𝑉 ∈ ℝ𝑚×𝑚 such that
𝐴𝑇 𝐴 = 𝑉 Λ𝑉 𝑇
for some orthogonal matrix 𝑈 . If this indeed holds, we can select 𝑈 ∶= 𝐴𝑉 Λ−1 , obtaining the singular value decompo-
sition.
Stating that 𝐴𝑉 = 𝑈 Λ for an orthogonal 𝑈 and diagonal Λ is equivalent to saying that the columns of 𝐴𝑉 are orthogonal.
(Since multiplying 𝑈 with a diagonal matrix from the right is the same as scaling the columns of 𝑈 .) In turn, this is
equivalent to showing that (𝐴𝑉 )𝑇 (𝐴𝑉 ) is diagonal. Thus, we have
(𝐴𝑉 )𝑇 (𝐴𝑉 ) = 𝑉 𝑇 𝐴
⏟ 𝑇𝐴 𝑉
=𝑉 Λ𝑉 𝑇
𝑇 𝑇
= 𝑉 (𝑉 Λ𝑉 )𝑉
= (𝑉 𝑇 𝑉 )Λ(𝑉 𝑇 𝑉 )
= Λ,
which is diagonal. Now, everything is ready to reap the rewards of our work. By selecting 𝑈 ∶= 𝐴𝑉 Λ−1 , we have
𝑈 𝑇 𝐴𝑉 = Λ, which is diagonal. Thus, since 𝑈 and 𝑉 are orthogonal, we finally have
𝐴 = 𝑈 Λ𝑉 𝑇 ,
Let’s take a moment to appreciate the power of the singular value decomposition. The columns of 𝑈 and 𝑉 are orthogonal
matrices, which are rather special transformations. As they leave the inner products and the norm invariant, the structure
of the underlying vector spaces are preserved. The diagonal Λ is also special, as it is just a stretching in the direction of
the bases. It is very surprising that any linear transformation is the composition of these three special ones.
Besides mapping out the fine structure of linear transformations, SVD offers a lot more. For instance, it generalizes the
notion of eigenvectors, a concept that was defined only for square matrices. With this, we have
𝐴𝑉 = 𝑈 Λ,
which we can take a look column-wise. Here, Λ is diagonal, but its number of elements depend on the smaller one of 𝑛
or 𝑚.
So, if 𝑢𝑖 is the 𝑖-th column of 𝑈 , 𝑣𝑖 is the 𝑖-th column of 𝑉 , the identity 𝐴𝑉 = 𝑈 Λ is translated to
This resembles closely to the definition of eigenvalue-eigenvector pairs, except that instead of one vector, we have two.
The 𝑢𝑖 and 𝑣𝑖 are the so-called left and right singular vectors, while the scalars 𝜆𝑖 are called singular values.
Linear transformations are essentially manipulations of data, revealing other (hopefully more useful) representations.
Intuitively, we think about them as one-to-one mappings, faithfully preserving all the “information” from the input.
This is often not the case, to such an extent that sometimes a lossy compression of the data is highly beneficial. To give
you a concrete example, consider a dataset with a million features, out of which only a couple hundred are useful. What
we can do is to identify the important features and throw away the rest, obtaining a representation that is more compact,
thus easier to work with.
This notion is formalized by the concept of orthogonal projections. We have already met them upon our first encounter
with the inner products (see (5.7)). Projections also played a fundamental role in the Gram-Schmidt process, used to or-
thogonalize an arbitrary basis. Because we are already somewhat familiar with orthogonal projections, a formal definition
is in due.
Let’s revisit the examples we have seen so far to get a grip on the definition!
Example 1. The simplest one is the orthogonal projection to a single vector. That is, if 𝑢 ∈ ℝ𝑛 is an arbitrary vector, the
transformation
⟨𝑥, 𝑢⟩
proj𝑢 (𝑥) = 𝑢
⟨𝑢, 𝑢⟩
is the orthogonal projection to (the subspace spanned by) 𝑢. (We have talked about this when discussing the geometric in-
terpretation of inner products, where this definition was deduced from a geometric intuition.) Applying this transformation
repeatedly, we get
⟨𝑥,𝑢⟩
⟨ ⟨𝑢,𝑢⟩ 𝑢, 𝑢⟩
proj𝑢 (proj𝑢 (𝑥)) = 𝑢
⟨𝑢, 𝑢⟩
⟨𝑥,𝑢⟩
⟨𝑢,𝑢⟩ ⟨𝑢, 𝑢⟩
= 𝑢
⟨𝑢, 𝑢⟩
⟨𝑥, 𝑢⟩
= 𝑢
⟨𝑢, 𝑢⟩
= proj𝑢 (𝑥).
Thus, faithfully to its name, proj𝑢 is indeed a projection. To see that it is indeed orthogonal, let’s examine its kernel and
image! Since the value of proj𝑢 (𝑥) is a scalar multiple of 𝑢, its image is
im(proj𝑢 ) = span(𝑢).
⟨𝑥,𝑢⟩
Its kernel, the set of vectors mapped to zero by proj𝑢 , is also easy to find, as ⟨𝑢,𝑢⟩ 𝑢 = 0 can only happen if ⟨𝑥, 𝑢⟩ = 0,
that is, if 𝑥 ⟂ 𝑢. In other words,
ker(proj𝑢 ) = span(𝑢)⟂ ,
where span(𝑢)⟂ denotes the orthogonal complement of span(𝑢). This means that proj𝑢 is indeed an orthogonal projection.
We can also describe proj𝑢 (𝑥) in terms of matrices. By writing out proj𝑢 (𝑥) component-wise, we have
⟨𝑥, 𝑢⟩𝑢1
⟨𝑥, 𝑢⟩ 1 ⎡ ⟨𝑥, 𝑢⟩𝑢2 ⎤
proj𝑢 (𝑥) = 𝑢= ⎢ ⎥,
⟨𝑢, 𝑢⟩ ‖𝑢‖2 ⎢ ⋮ ⎥
⎣⟨𝑥, 𝑢⟩𝑢𝑛 ⎦
where 𝑢 = (𝑢1 , … , 𝑢𝑛 ). This looks like some kind of matrix multiplication! As we have seen earlier, multiplying a
matrix and a vector can be described in terms of rowwise dot products. (See (7.3).)
So, according to this interpretation of matrix multiplication, we have
⟨𝑥, 𝑢⟩𝑢1
1 ⎡ ⟨𝑥, 𝑢⟩𝑢2 ⎤ 𝑢𝑢𝑇
proj𝑢 (𝑥) = ⎢ ⎥= 𝑥. (14.5)
‖𝑢‖2 ⎢ ⋮ ⎥ ‖𝑢‖2
⎣⟨𝑥, 𝑢⟩𝑢𝑛 ⎦
Note that the scaling with ‖𝑢‖2 can be incorporated into the “matrix” product by writing
𝑢𝑢𝑇 𝑢 𝑢𝑇
= ⋅ ,
‖𝑢‖2 ‖𝑢‖ ‖𝑢‖
The matrix 𝑢 ∈ ℝ𝑛×𝑛 , obtained from the product of the vector 𝑢 ∈ ℝ𝑛(×1) and its transpose 𝑢𝑇 ∈ ℝ1×𝑛 , is a rather
special one. They are called rank-1 projection matrices, and they frequently appear in mathematics.
Example 2. As we have seen when introducing the Gram-Schmidt orthogonalization process, the previous example can
be generalized by projecting to multiple vectors.
If 𝑢1 , … , 𝑢𝑘 ∈ ℝ𝑛 is a set of linearly independent and pairwise orthogonal vectors, then the linear transformation
𝑘
⟨𝑥, 𝑢𝑖 ⟩
proj𝑢 (𝑥) = ∑ 𝑢
1 ,…,𝑢𝑘
𝑖=1
⟨𝑢𝑖 , 𝑢𝑖 ⟩ 𝑖
is an orthogonal projection onto the subspace span(𝑢1 , … , 𝑢𝑘 ). This is easy to see, and I recommend the reader to do this
as an exercise. (This can be found in the problems section as well.)
From (14.5), we can determine the matrix form of proj𝑢 as well:
1 ,…,𝑢𝑘
𝑘
𝑢𝑖 𝑢𝑇𝑖
proj𝑢 (𝑥) = ( ∑ ) 𝑥.
1 ,…,𝑢𝑘 ‖𝑢𝑖 ‖2
⏟⏟⏟⏟⏟
𝑖=1
∈ℝ𝑛×𝑛
This is good to know, as projection matrices are often needed in the implementation of certain algorithms.
Now that we have seen a few examples, it is time to discuss orthogonal projections in more general terms. There are lots
of reasons why these special transformations are useful, and we’ll explore them in this section. First, let’s start with the
most important one: an orthogonal projection also enables to decompose vectors in terms of a given subspace plus an
orthogonal vector.
Theorem 13.5.1
Let 𝑉 be an inner product space and 𝑃 ∶ 𝑉 → 𝑉 be a projection. Then 𝑉 = ker 𝑃 + im𝑃 , that is, every vector 𝑥 ∈ 𝑉
can be written as
𝑥 = (𝑥 − 𝑃 𝑥) + 𝑃 𝑥.
𝑃 (𝑥 − 𝑃 𝑥) = 𝑃 𝑥 − 𝑃 (𝑃 𝑥)
= 𝑃𝑥 − 𝑃𝑥
= 0,
that is, 𝑥 − 𝑃 𝑥 ∈ ker 𝑃 . By definition, 𝑃 𝑥 ∈ im𝑃 , so 𝑉 = ker 𝑉 + im𝑉 , which proves our main proposition.
If 𝑃 is an orthogonal projection, then again by definition, 𝑥im ⟂ 𝑥ker , which is what we had to show. □
In addition, orthogonal projections are self-adjoint. This might not sound like a big deal, but self-adjointness leads to
several very pleasant properties.
Theorem 13.5.2
Let 𝑉 be an inner product space and 𝑃 ∶ 𝑉 → 𝑉 be an orthogonal projection. Then 𝑃 is self-adjoint.
⟨𝑃 𝑥, 𝑦⟩ = ⟨𝑥, 𝑃 𝑦⟩
holds for any 𝑥, 𝑦 ∈ 𝑉 . In the previous result, we have seen that 𝑥 and 𝑦 can be written as
and
Since 𝑃 2 = 𝑃 , we have
⟨𝑃 𝑥, 𝑦⟩ = ⟨𝑃 𝑥ker 𝑃 + 𝑃 𝑥im𝑃 , 𝑦ker + 𝑦im𝑃 ⟩
= ⟨𝑥im𝑃 , 𝑦ker 𝑃 + 𝑦im𝑃 ⟩
= ⟨𝑥 im𝑃 , 𝑦ker 𝑃 ⟩ +⟨𝑥im𝑃 , 𝑦im𝑃 ⟩
⏟⏟⏟⏟⏟
=0
= ⟨𝑥im𝑃 , 𝑦im𝑃 ⟩.
Similarly, it can be shown that ⟨𝑥, 𝑃 𝑦⟩ = ⟨𝑥im𝑃 , 𝑦im𝑃 ⟩. These two identities imply ⟨𝑃 𝑥, 𝑦⟩ = ⟨𝑥, 𝑃 𝑦⟩, which is what
we had to show. □
One straightforward consequence of self-adjointness is that the kernel of orthogonal projections is the orthogonal com-
plement of its image.
Theorem 13.5.3
Let 𝑉 be an inner product space and 𝑃 ∶ 𝑉 → 𝑉 be an orthogonal projection. Then
ker 𝑃 = (im𝑃 )⟂ .
Proof. To prove the equality of these two sets, we need to show that (a) ker 𝑃 ⊆ (im𝑃 )⟂ , and (b) (im𝑃 )⟂ ⊆ ker 𝑃 .
(a) Let 𝑥 ∈ ker 𝑃 , that is, suppose that 𝑃 𝑥 = 0. We need to show that for any 𝑦 ∈ im𝑃 , we have ⟨𝑥, 𝑦⟩ = 0. For this,
let 𝑦0 ∈ 𝑉 such that 𝑃 𝑦0 = 𝑦. (This is guaranteed to exist, since we took 𝑦 from the image of 𝑃 .) Then
⟨𝑥, 𝑦⟩ = ⟨𝑥, 𝑃 𝑦0 ⟩
= ⟨𝑃 𝑥, 𝑦0 ⟩
= ⟨0, 𝑦0 ⟩
= 0,
where we used that 𝑃 is self-adjoint. Thus, 𝑥 ∈ (im𝑃 )⟂ also holds, implying ker 𝑃 ⊆ (im𝑃 )⟂ .
(b) Now let 𝑥 ∈ (im𝑃 )⟂ . Then for any 𝑦 ∈ 𝑉 , we have ⟨𝑥, 𝑃 𝑦⟩ = 0. However,
⟨𝑃 𝑥, 𝑦⟩ = ⟨𝑥, 𝑃 𝑦⟩ = 0.
Specially, with the choice 𝑦 = 𝑃 𝑦, we have ⟨𝑃 𝑥, 𝑃 𝑥⟩ = 0. Due to the positive definiteness of the inner product, this
implies that 𝑃 𝑥 = 0, that is, 𝑥 ∈ ker 𝑃 . □
Summing up all the above, if 𝑃 is an orthogonal projection of the inner product space 𝑉 , then
𝑉 = im𝑃 + (im𝑃 )⟂ .
Do you recall that when we first encountered the concept of orthogonal complements, how we proved that 𝑉 = 𝑆 + 𝑆 ⟂
for any finite dimensional inner product space 𝑉 and its subspace 𝑆? With the use of a special orthogonal projection. We
are getting close to see the general pattern here.
Because of the kernel of an orthogonal projection 𝑃 is an orthogonal complement of the image, the transformation 𝐼 − 𝑃
is an orthogonal projection as well, with the roles of image and kernel reversed.
Theorem 13.5.4
Let 𝑉 be an inner product space and 𝑃 ∶ 𝑉 → 𝑉 be an orthogonal projection. Then 𝐼 − 𝑃 is an orthogonal projection
as well, and
The proof is so simple that this is left as an exercise for the reader.
One more thing to mention. If the image spaces of two orthogonal projections match, then the projections themselves
are equal. This is a very strong uniqueness property, as if you think about it, this is not true for other classes of linear
transformations.
Proof. Because of ker 𝑃 = (im𝑃 )⟂ , the equality of the image spaces also imply that ker 𝑃 = ker 𝑄.
Since 𝑉 = ker 𝑃 + im𝑃 , every 𝑥 ∈ 𝑉 can be decomposed as
This decomposition and the equality of kernel and image spaces give that
With an identical argument, we have 𝑄𝑥 = 𝑥im𝑃 , thus 𝑃 𝑥 = 𝑄𝑥 on all vectors 𝑥 ∈ 𝑉 . This proves 𝑃 = 𝑄. □
In other words, given a subspace, there can be only one orthogonal projection to it. But is there any at all? Yes, and in the
next section, we will see that it can be desrcribed in geometric terms.
Orthogonal projections have an extremely pleasant and mathematically useful property. In some sense, if 𝑃 ∶ 𝑉 → 𝑉 is
an orthogonal projection, 𝑃 𝑥 provides the optimal approximation of 𝑥 among all vectors in im𝑃 . To make this precise,
we can state the following.
𝑃 ∶ 𝑥 → arg min ‖𝑥 − 𝑦‖
𝑦∈𝑆
is an orthogonal projection to 𝑆.
In other words, since orthogonal projections to a given subspace are unique (as implied by Theorem 13.5.5), 𝑃 𝑥 is the
closest vector to 𝑥 in the subspace 𝑆. Thus, we can denote this as 𝑃𝑆 , emphasizing the uniqueness.
Besides having an explicit way to describe orthogonal projections, there is one extra benefit. Recall that previously, we
have shown that
𝑉 = im𝑃 + (im𝑃 )⟂
holds. Since for any subspace 𝑆 an orthogonal projection 𝑃𝑆 exists whose image set is 𝑆, it also follows that 𝑉 = 𝑆 + 𝑆 ⟂ .
Although we have seen this earlier when talking about orthogonal complements, it is interesting to see a proof that doesn’t
require the construction of an orthonormal basis in 𝑆.
Interestingly, this is a point where mathematical analysis and linear algebra intersects. We don’t have the tools for it
yet, but using the concept of convergence, the above theorems can be generalized to infinite dimensional spaces. Infinite
dimensional spaces are not particularly relevant for machine learning in practice, yet they provide a beautiful mathematical
framework for the study of functions. Who knows, one day these advanced tools will provide a significant breakthrough
in machine learning.
14.6 Problems
Problem 1. Let 𝑢1 , … , 𝑢𝑘 ∈ ℝ𝑛 be a set of linearly independent and pairwise orthogonal vectors. Show that the linear
transformation
𝑘
⟨𝑥, 𝑢𝑖 ⟩
proj𝑢 (𝑥) = ∑ 𝑢
1 ,…,𝑢𝑘
𝑖=1
⟨𝑢𝑖 , 𝑢𝑖 ⟩ 𝑖
is an orthogonal projection.
Problem 2. Let 𝑢1 , … , 𝑢𝑘 ∈ ℝ𝑛 be a set of linearly independent vectors, and define the linear transformation
𝑘
⟨𝑥, 𝑢𝑖 ⟩
fakeproj𝑢 (𝑥) = ∑ 𝑢.
1 ,…,𝑢𝑘
𝑖=1
⟨𝑢𝑖 , 𝑢𝑖 ⟩ 𝑖
Is this a projection? (Hint: study the special case 𝑘 = 2 and ℝ3 . You can visualize this if needed.)
Problem 3. Let 𝑉 be an inner product space and 𝑃 ∶ 𝑉 → 𝑉 be an orthogonal projection. Show that 𝐼 − 𝑃 is an
orthogonal projection as well, and
holds.
FIFTEEN
COMPUTING EIGENVALUES
In the last chapter, we have reached the singular value decomposition, one of the pinnacle results of linear algebra. We
laid all of our theoretical groundwork to get us to this point.
However, one thing is missing: computing the SVD in practice. Without this, we can’t reap all the rewards this powerful
tool offers. In this chapter, we’ll develop two methods for this purpose. One offers a deep insight into the behavior of
eigenvectors, but it doesn’t work in practice. The other offers excellent performance, but it is hard to understand what is
happening behind the formulas. Let’s start with the first one, illuminating how the eigenvectors determine the effects of
a linear transformation!
If you recall, we discovered the SVD by tracing the problem back to the spectral decomposition of symmetric matrices. In
turn, we can obtain the spectral decomposition by finding an orthonormal basis from the eigenvectors of our matrix. The
plan is the following: first, we define a procedure that finds an orthonormal set of eigenvectors for symmetric matrices.
Then, use this to compute the SVD for arbitrary matrices.
A naive way would be find the eigenvalues by solving the polynomial equation det(𝐴 − 𝜆𝐼) = 0 for 𝜆, then compute the
corresponding eigenvectors by solving the linear equations (𝐴 − 𝜆𝐼)𝑥 = 0.
However, there are problems with this approach. For an 𝑛 × 𝑛 matrix, the characteristic polynomial 𝑝(𝜆) = det(𝐴 − 𝜆𝐼)
is a polynomial of degree 𝑛. Even if we could effectively evaluate det(𝐴 − 𝜆𝐼) for any lambda, there are serious issues.
Unfortunately, unlike for the quadratic equation 𝑎𝑥2 + 𝑏𝑥 + 𝑐 = 0, there are no formulas for finding the solutions when
𝑛 > 4. (It is not that mathematicians were just not clever enough to find them. No such formula exists.)
How can we find an alternative approach? Once again, we use the wishful thinking approach that worked so well before.
Let’s pretend that we know the eigenvalues, play around with them, and see if this gives us some useful insight.
For the sake of simplicity, assume that 𝐴 is a small symmetric 2×2 matrix, say with eigenvalues 𝜆1 = 4 and 𝜆2 = 2. Since
𝐴 is symmetric, we can even find a set of corresponding eigenvectors 𝑢1 , 𝑢2 such that 𝑢1 and 𝑢2 form an orthonormal basis.
(That is, both have an unit norm and they are orthogonal to each other.) This is guaranteed by the spectral decomposition
theorem.
Thus, any 𝑥 ∈ ℝ2 can be written as 𝑥 = 𝑥1 𝑢1 + 𝑥2 𝑢2 for some nonzero scalars 𝑥1 , 𝑥2 . What happens if we apply the
transformation 𝐴 to our vector 𝑥? Because 𝑢𝑖 are eigenvectors, we have
𝐴𝑥 = 𝐴(𝑥1 𝑢1 + 𝑥2 𝑢2 )
= 𝑥1 𝐴𝑢1 + 𝑥2 𝐴𝑢2
= 𝑥1 𝜆1 𝑢1 + 𝑥2 𝜆2 𝑢2
= 4𝑥1 𝑢1 + 2𝑥2 𝑢2 .
179
Mathematics of Machine Learning
Equation (15.1) is great news for us! All we have to do is repeatedly apply the transformation 𝐴 to identify the eigenvector
for the dominant eigenvalue 𝜆1 . There is one small caveat though: we have to know the value of 𝜆1 . We’ll deal with this
later, but first, let’s record this milestone in the form of a theorem.
Theorem 14.1.1 (Finding the eigenvector for the dominant eigenvalue with power iteration. )
Let 𝐴 ∈ ℝ𝑛×𝑛 be a real symmetric matrix. Suppose that
(a) the eigenvalues of 𝐴 are 𝜆1 > ⋯ > 𝜆𝑛 (that is, 𝜆1 is the dominant eigenvalue),
(b) and the corresponding eigenvectors 𝑢1 , … , 𝑢𝑛 form an orthonormal basis.
𝑛
Let 𝑥 ∈ ℝ𝑛 be a vector such that when written as the linear combination 𝑥 = ∑𝑖=1 𝑥𝑖 𝑢𝑖 , the coefficient 𝑥1 ∈ ℝ is
nonzero. Then
𝐴𝑘 𝑥
lim = 𝑥1 𝑢1 . (15.2)
𝑘→∞ 𝜆𝑘
1
Before we jump into the proof, some explanations are in order. Recall that if 𝐴 is symmetric, the spectral decomposition
theorem guarantees that it can be diagonalized with a similarity transformation. In its proof (sketch), we mentioned that
a symmetric matrix has
• real eigenvalues,
• and an orthonormal basis from its eigenvectors.
Thus, the assumptions (a) and (b) are guaranteed, except for one caveat: the eigenvalues are not necessarily distinct.
However, this rarely causes problems in practice. There are multiple reasons for this, but most importantly, matrices with
repeated eigenvalues are so rare that they form a zero-probability set. Thus, stumbling upon one is highly unlikely.
Thus,
𝑛 𝑘
𝐴𝑘 𝑥 𝜆
𝑘
= 𝑥 𝑢
1 1 + ∑ 𝑥𝑖 ( 𝑖 ) 𝑢𝑖 .
𝜆1 𝑖=2
𝜆1
Since 𝜆1 is the dominant eigenvalue, |𝜆𝑖 /𝜆1 | < 1 for 𝑖 = 2, … , 𝑛, so (𝜆𝑖 /𝜆1 )𝑘 → 0 as 𝑘 → ∞. Hence,
𝐴𝑘 𝑥
lim = 𝑥1 𝑢1 .
𝑘→∞ 𝜆𝑘
1
Now, let’s fix the small issue that requires us to know 𝜆1 . Since 𝜆1 is the largest eigenvalue, the previous theorem shows
that 𝐴𝑘 𝑥 equals to 𝑥1 𝜆𝑘1 𝑢1 plus some term that is much smaller, at least compared to this dominant term. We can extract
this quantity by taking the supremum norm ‖𝐴𝑘 𝑥‖∞ . (Recall that for any 𝑦 = (𝑦1 , … , 𝑦𝑘 ), the supremum norm is defined
by ‖𝑦‖∞ = max{|𝑦1 |, … , |𝑦𝑛 |}. Keep in mind that the 𝑦𝑖 -s are the coefficients of 𝑦 in the original basis of our vector
space, which is not necessarily our eigenvector basis 𝑢1 , … , 𝑢𝑛 .)
By factoring out |𝜆1 |𝑘 from 𝐴𝑘 𝑥, we have
𝑛 𝑘
𝜆𝑖
‖𝐴𝑘 𝑥‖∞ = |𝜆1 |𝑘 ∥𝑥1 𝑢1 + ∑ 𝑥𝑖 ( ) 𝑢𝑖 ∥ .
𝑖=2
𝜆1 ∞
𝑘
𝑛
Intuitively speaking, the remainder term ∑𝑖=2 𝑥𝑖 ( 𝜆𝜆𝑖 ) 𝑢𝑖 is small, thus we can approximate the norm as
1
In other words, instead of scaling with 𝜆𝑘1 , we can scale with ‖𝐴𝑘 𝑥‖∞ .
So, we are ready to describe our general eigenvector-finding procedure fully. First, we initialize a vector 𝑥0 randomly,
then we define the recursive sequence
𝐴𝑥𝑘−1
𝑥𝑘 = , 𝑘 = 1, 2, …
‖𝐴𝑥𝑘−1 ‖∞
Using the linearity of 𝐴, we can see that, in fact,
𝐴𝑘 𝑥 0
𝑥𝑘 = ,
‖𝐴𝑘 𝑥0 ‖∞
but scaling has an additional side benefit, as we don’t have to use large numbers at any computational step. With this,
(15.2) implies that
𝐴𝑘 𝑥 0
lim 𝑥𝑘 = = 𝑢1 .
𝑘→∞ ‖𝐴𝑘 𝑥0 ‖∞
That is, we can extract the eigenvector for the dominant eigenvalue without actually knowing the eigenvalue itself.
15.1. Power iteration for calculating the eigenvectors of real symmetric matrices 181
Mathematics of Machine Learning
Let’s put the power iteration method into practice! The input of our power_iteration function is a square matrix
A, and we expect the output to be an eigenvector corresponding to the dominant eigenvalue.
Since this is an iterative process, we should define a condition that defines when the process should terminate. If the
consecutive members of the sequence {𝑥𝑘 }∞ 𝑘=1 are sufficiently close together, we arrived at a solution. That is, if, say
we can stop and return the current value. However, this might never happen. For those cases, we define a cutoff point,
say 𝑘 = 100000, when we terminate the computation even if there is no convergence.
To give us a bit more control, we can also manually define the initialization vector x_init.
import numpy as np
def power_iteration(
A: np.ndarray,
n_max_steps: int = 100000,
convergence_threshold: float = 1e-10,
x_init: np.ndarray = None,
normalize: bool = False
):
n, m = A.shape
return x
To test the method, we should use an input for which the correct output is easy to calculate by hand. Our usual recurring
example
2 1
𝐴=[ ].
1 2
should be perfect, as we already know a lot about it. Previously, we have seen that its eigenvalues are 𝜆1 = 3 and 𝜆2 = 1,
with corresponding eigenvectors 𝑢1 = (1, 1) and 𝑢2 = (−1, 1).
Let’s see if our function correctly recovers (a scalar multiple of) 𝑢1 = (1, 1)!
u_1
array([[-0.70710678],
[-0.70710678]])
Success! To recover the eigenvalue, we can simply apply the linear transformation and compute the proportions.
A @ u_1 / u_1
array([[3.],
[3.]])
Can we modify the power iteration algorithm to recover the other eigenvalues as well? In theory, yes. In practice, no. Let
me elaborate!
To get a grip on how to generalize the idea, let’s take another look at the equation (15.3), saying that
𝑛
𝐴𝑘 𝑥 = ∑ 𝑥𝑖 𝜆𝑘𝑖 𝑢𝑖 .
𝑖=1
𝐴𝑘 𝑥
One of the conditions for 𝜆𝑘
to converge was that 𝑥 should have a nonzero component of the eigenvector 𝑢1 , that is,
1
𝑥1 ≠ 0.
What if 𝑥1 = 0? In that case, we have
𝐴𝑘 𝑥 = 𝑥2 𝜆𝑘2 𝑢2 + ⋯ + 𝑥𝑛 𝜆𝑘2 𝑢𝑛 ,
implying that
𝐴𝑘 𝑥
lim = 𝑥2 𝑢2 .
𝑘→∞ 𝜆𝑘
2
15.1. Power iteration for calculating the eigenvectors of real symmetric matrices 183
Mathematics of Machine Learning
holds.
The proof goes just like what we have seen a few times already. The question is, how can we eliminate the 𝑢1 , … , 𝑢𝑙−1
components from any vector? The answer is simple: orthogonal projections.
For the sake of simplicity, let’s take a look at extracting the second dominant eigenvector with power iteration. Recall
that the transformation
𝑛
describes the orthogonal projection of any 𝑥 to 𝑢1 . In concrete terms, if 𝑥 = ∑𝑖=1 𝑥𝑖 𝑢𝑖 , then
𝑛
proj𝑢 (𝑥) = proj𝑢 ( ∑ 𝑥𝑖 𝑢𝑖 )
1 1
𝑖=1
𝑛
= ∑ 𝑥𝑖 proj𝑢 (𝑢𝑖 )
1
𝑖=1
= 𝑥1 𝑢1 .
This is the exact opposite of what we are looking for! However, at this point, we can see that 𝐼 − proj𝑢 is going to be
1
suitable for our purposes. This is still an orthogonal projection. Moreover, we have
𝑛 𝑛
(𝐼 − proj𝑢 )( ∑ 𝑥𝑖 𝑢𝑖 ) = ∑ 𝑥𝑖 𝑢𝑖 ,
1
𝑖=1 𝑖=2
that is, 𝐼 −proj𝑢 eliminates the 𝑢1 component of 𝑥. Thus, if we initialize the power iteration with 𝑥∗ = (𝐼 −proj𝑢 )(𝑥),
1 1
𝐴𝑘 𝑥∗
the sequence ‖𝐴𝑘 𝑥∗ ‖∞ will converge to 𝑢2 , the second dominant eigenvector.
How to compute (𝐼 − proj𝑢 )(𝑥) in practice? Recall that in the standard orthonormal basis, the matrix of proj𝑢 can
1 1
be written as 𝑢1 𝑢𝑇1 . (Keep in mind that the 𝑢𝑖 vectors form an orthonormal basis, so ‖𝑢1 ‖ = 1.) Thus, the matrix of
𝐼 − proj𝑢 is 𝐼 − 𝑢1 𝑢𝑇1 , which we can easily compute.
1
which we use as the initial vector of the second round of power iteration, yielding the second dominant eigenvector 𝑢2 .
3. Project 𝑥(2) to the orthogonal complement of the subspace spanned by 𝑢1 and 𝑢2 , thus obtaining
𝑥(3) = (𝐼 − proj𝑢 )(𝑥(2) )
2
= (𝐼 − proj𝑢 )(𝑥(1) ),
1 ,𝑢2
which we use as the initial vector of the third round of power iteration, yielding the third dominant eigenvector 𝑢3 .
4. Project 𝑥(3) to…
You get the pattern. To implement this in practice, we add the find_eigenvectors function.
for _ in range(n):
ev = power_iteration(A, x_init=x_init)
proj = get_orthogonal_complement_projection(ev)
x_init = proj @ x_init
eigenvectors.append(ev)
return eigenvectors
find_eigenvectors(A, x_init)
[array([[0.65505892],
[0.65505892]]),
array([[ 0.12565508],
[-0.12565508]])]
The result is as we expected. (Don’t be surprised that the eigenvectors are not normalized, as we haven’t explicitly done
so in the find_eigenvectors function.)
We are ready to actually diagonalize symmetric matrices. Recall that the diagonalizing orthogonal matrix 𝑈 can be
obtained by vertically stacking the eigenvectors one by one.
return U, U @ A @ U.T
15.1. Power iteration for calculating the eigenvectors of real symmetric matrices 185
Mathematics of Machine Learning
diagonalize_symmetric_matrix(A, x_init)
Awesome!
What is the problem then? What you haven’t seen is that I had to run find_eigenvectors 16 times to finally find
an initial vector that yields the expected result. This is because power iteration is numerically extremely unstable. Notice
that we get a completely different result by perturbing our initial vector by 0.0001.
[array([[0.65511639],
[0.65511639]]),
array([[0.28417351],
[0.28417351]])]
where 𝜆𝑖 and 𝑢𝑖 are eigenvalue-eigenvector pairs of the symmetric matrix 𝐴, reflecting how eigenvectors and eigenvalues
determine the behavior of the transformation.
If the power iteration is not usable in practice, how can we compute the eigenvalues? We will see this in the next section.
The algorithm used in practice to compute the eigenvalues is the so-called QR algorithm, proposed independently by
John G. R. Francis and the soviet mathematician Vera Kublanovskaya. This is where all of the lessons we have learned in
linear algebra converge. Describing the QR algorithm is very simple, as it is the iteration of a matrix decomposition and
a multiplication step.
However, understanding why it works is a different question. Behind the scenes, the QR algorithm combines many tools
we have learned earlier. To start, let’s revisit the good old Gram-Schmidt orthogonalization process.
If you recall, we have encountered the Gram-Schmidt orthogonalization process when introducing the concept of orthog-
onal bases.
In essence, this algorithm takes an arbitrary basis 𝑣1 , … , 𝑣𝑛 and turns it into an orthonormal one 𝑒1 , … , 𝑒𝑛 such that
𝑒1 , … , 𝑒𝑘 spans the same subspace as 𝑣1 , … , 𝑣𝑘 . Since we last met this, we have gained a lot of perspective about linear
algebra, so we are ready to see the bigger picture.
𝑒1 = 𝑣 1 ,
𝑒𝑘 = 𝑣𝑘 − proj𝑒 (𝑣𝑘 ),
1 ,…,𝑒𝑘−1
𝑒1 = 𝑣 1
⟨𝑒1 , 𝑣2 ⟩
𝑒2 = 𝑣 2 − 𝑒
⟨𝑒1 , 𝑒1 ⟩ 1
⋮
⟨𝑒1 , 𝑣𝑛 ⟩ ⟨𝑒𝑛−1 , 𝑣𝑛 ⟩
𝑒𝑛 = 𝑣 𝑛 − 𝑒 −⋯− 𝑒 .
⟨𝑒1 , 𝑒1 ⟩ 1 ⟨𝑒𝑛−1 , 𝑒𝑛−1 ⟩ 𝑛−1
A pattern is starting to emerge. By arranging the 𝑒1 , … , 𝑒𝑛 terms into one side, we obtain
𝑣1 = 𝑒 1
⟨𝑒1 , 𝑣2 ⟩
𝑣2 = 𝑒 + 𝑒2
⟨𝑒1 , 𝑒1 ⟩ 1
⋮
⟨𝑒1 , 𝑣𝑛 ⟩ ⟨𝑒𝑛−1 , 𝑣𝑛 ⟩
𝑣𝑛 = 𝑒1 + ⋯ + 𝑒 + 𝑒𝑛 .
⟨𝑒1 , 𝑒1 ⟩ ⟨𝑒𝑛−1 , 𝑒𝑛−1 ⟩ 𝑛−1
This is starting to resemble some kind of matrix multiplication! Recall that matrix multiplication can be viewed as taking
the linear combination of columns. (Check (7.2) if you are uncertain about this.)
By horizontally concatenating the column vectors 𝑣𝑘 to form the matrix 𝐴 and similarly defining the vector 𝑄 from the
𝑒𝑘 -s, we obtain that
𝐴 = 𝑄∗ 𝑅 ∗
for some upper triangular 𝑅, defined by the coefficients of 𝑒𝑘 in 𝑣𝑘 according to the Gram-Schmidt orthogonalization.
To be more precise, define
𝐴=⎡
⎢𝑣1 … 𝑣𝑛 ⎤
⎥, 𝑄∗ = ⎡
⎢𝑒1 … 𝑒𝑛 ⎤
⎥,
⎣ ⎦ ⎣ ⎦
and
⟨𝑒1 ,𝑣2 ⟩ ⟨𝑒1 ,𝑣𝑛 ⟩
1 ⟨𝑒1 ,𝑒1 ⟩ … ⟨𝑒1 ,𝑒1 ⟩
⎡ ⟨𝑒2 ,𝑣𝑛 ⟩ ⎤
𝑅∗ = ⎢0 1 … ⟨𝑒2 ,𝑒2 ⟩ ⎥ .
⎢ ⎥
⎢⋮ ⋮ ⋱ ⋮ ⎥
⎣0 0 ⋮ 1 ⎦
The result 𝐴 = 𝑄∗ 𝑅∗ is almost what we call the QR factorization. The columns of 𝑄∗ are orthogonal (but not or-
thonormal), while 𝑅∗ is upper triangular. We can easily orthonormalize 𝑄∗ by factoring out the norms columnwise, thus
obtaining
⟨𝑒1 ,𝑣2 ⟩ ⟨𝑒1 ,𝑣𝑛 ⟩
‖𝑒1 ‖ √⟨𝑒1 ,𝑒1 ⟩
… √⟨𝑒1 ,𝑒1 ⟩ ⎤
⎡ ⟨𝑒2 ,𝑣𝑛 ⟩
𝑄=⎡ 𝑅=⎢ 0 ‖𝑒2 ‖ … ⎥
𝑒1 𝑒𝑛 ⎤
⎢ ‖𝑒1 ‖ … ‖𝑒𝑛 ‖ ⎥ , ⎢ √⟨𝑒2 ,𝑒2 ⟩ ⎥ .
⎣ ⎦ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 0 0 ⋮ ‖𝑒𝑛 ‖ ⎦
It is easy to see that 𝐴 = 𝑄𝑅 still holds. This result is called the QR decomposition, and we have just proved the following
theorem.
𝐴 = 𝑄𝑅
holds.
As we are about to see, the QR decomposition is an extremely useful and versatile tool. (Like all other matrix decompo-
sitions are.) Before we move forward to discuss how it can be used to compute the eigenvalues in practice, let’s put what
we have seen so far into code!
The QR decomposition algorithm is essentially Gram-Schmidt orthogonalization, where we explicitly memorize some
coefficients and form a matrix from them. (Recall our earlier implementation if you feel overwhelmed.)
for e in to:
coeff = projection_coeff(x, e)
coeffs.append(coeff)
p_x += coeff*e
if return_coeffs:
return p_x, coeffs
else:
return p_x
Now we can put these together to obtain the QR factorization of an arbitrary square matrix. (Surprisingly, this works for
non-square matrices as well, but we won’t be concerned with this.)
Q_columns.append(A_columns[0])
R_columns.append([1] + (m-1)*[0])
for i, a in enumerate(A_columns[1:]):
p, coeffs = projection(a, Q_columns, return_coeffs=True)
next_q = a - p
next_r = coeffs + [1] + max(0, m - i - 2)*[0]
Q_columns.append(next_q)
R_columns.append(next_r)
A = np.random.rand(3, 3)
Q, R = QR(A)
There are three things to check: (a) that 𝐴 = 𝑄𝑅, (b) 𝑄 is an orthogonal matrix, (c) and that 𝑅 is upper triangular.
np.allclose(A, Q @ R)
True
np.allclose(Q.T @ Q, np.eye(3))
True
np.allclose(R, np.triu(R))
True
Success! There is only one more question left. How does this help us in calculating the eigenvalues? Let’s see that now.
Surprisingly, we can discover the eigenvalues of a matrix 𝐴 by a simple iterative process. First, we find the QR decom-
position
𝐴 = 𝑄 1 𝑅1 ,
𝐴1 = 𝑅1 𝑄1 ,
that is, we simply reverse the order of 𝑄 and 𝑅. Then, we start it all over and find the QR decomposition of 𝐴1 , and so
on, defining the sequence
𝐴𝑘−1 = 𝑄𝑘 𝑅𝑘 (QR decomposition)
(15.4)
𝐴𝑘 = 𝑅𝑘 𝑄𝑘 (definition).
In the long run, the diagonal elements of 𝐴𝑘 will get closer and closer to the eigenvalues of 𝐴. This is called the QR
algorithm, which is so simple that I didn’t believe it when I first saw it.
With all of our tools, we can implement the QR algorithm in a few lines.
array([[3.00000000e+00, 2.39107046e-16],
[0.00000000e+00, 1.00000000e+00]])
We are almost at the state-of-the-art. Unfortunately, the vanilla QR algorithm has some issues, as it can fail to converge.
A simple example is given by the matrix
0 1
𝐴=[ ].
1 0
array([[0., 1.],
[1., 0.]])
where 𝛼𝑘 is some scalar. There are multiple approaches to defining the shifts themselves (Rayleigh quotient shift, Wilkin-
son shift, etc.), but the details lie much deeper than our study.
Functions
191
CHAPTER
SIXTEEN
FUNCTIONS IN THEORY
Mathematicians are like Frenchmen: whatever you say to them they translate into their own language and
forthwith it is something entirely different. — Johann Wolfgang von Goethe
Everyone has an intuitive understanding of what functions are. At one point or another, all of us have encountered
this concept. For most of us, a function is a curve drawn with a continuous line onto a representation of the Cartesian
coordinate system.
However, in mathematics, intuitions can often lead us to false conclusions. Often, there is a difference between what
something is and how do you think about it, what your mental models are. To give an example from a real-life scenario in
machine learning, consider the following piece of code.
import numpy as np
193
Mathematics of Machine Learning
Returns:
loss: numpy.float.
Cross entropy loss of the predictions.
"""
exp_x = np.exp(X)
probs = exp_x / np.sum(exp_x, axis=1, keepdims=True)
log_probs = - np.log([probs[i, y[i]] for i in range(len(probs))])
loss = np.mean(log_probs)
return loss
Suppose that you wrote this function, and it is in your codebase somewhere. Depending on our needs, we might think of
it as cross-entropy loss, but in reality, this is a 579 character long string in the Python language, eventually processed by
an interpreter. However, when working with it, we often use a mental model that compacts this information into easily
usable chunks. Like the three words cross-entropy loss. When we reason about high-level processes like training a neural
network, abstractions such as this allow us to move further and step bigger.
But sometimes, things don’t go our way. When this function throws an error and crashes the computations, cross-entropy
loss will not cut it. Then, it is time to unravel the definition and put everything under a magnifying glass. What could have
hindered your thinking before is now essential.
These principles are also true for theory, not just for practice. Mathematics is a balancing act between logical precision
and a clear understanding, two often contradicting objectives.
Let’s go back to our starting point: functions in a mathematical sense. One possible mental model, as mentioned, is a
curve drawn with a continuous line. It allows us to reason about functions visually and intuitively answer some questions.
However, this particular mental model can go very wrong. To give an example, is this a function below?
Even though this curve is drawn with a continuous line, this is not a function. To avoid confusion later, we have to build
the foundations of our discussion if we were to talk about mathematical objects. In this chapter, our goal is to establish a
basic dictionary to properly understand the objects we are working with in machine learning.
Let’s dive straight into the deep water and see the exact mathematical definition of functions! (Don’t worry if you don’t
understand it for the first read. I’ll explain everything in detail. This is the usual experience when encountering a definition
for the first time.)
𝑓 ∶ 𝑋 → 𝑌,
which is short for 𝑓 is a function from 𝑋 to 𝑌 . Note that 𝑋 and 𝑌 can be any set. In the examples we encounter, these
are usually the set of real numbers or vectors, but there is no such restriction.
To visualize the definition, we can draw two sets and arrows pointing from elements of 𝑋 to elements of 𝑌 . Each element
(𝑥, 𝑦) ∈ 𝑓 represents an arrow, pointing from 𝑥 to 𝑦.
The only criteria is that there can be at most one arrow starting from any 𝑥 ∈ 𝑋. This is why Fig. 16.2 is not a function.
Defining a function as a subset is mathematically precise but very low level. To be more useful, we can introduce an
abstraction by defining functions with formulas, such as
𝑓 ∶ ℝ → ℝ, 𝑥 ↦ 𝑥2 ,
or simply 𝑓(𝑥) = 𝑥2 in short. This is how most of us think about functions when working with them.
Now that we are familiar with the definition, we should get to know some of the most basic structural properties of
functions.
We saw that, in essence, functions are arrows between sets. At this point, we don’t know anything useful about them.
When is a function invertible? How to find their minima and maxima? Why should we even care? Probably you have a
bunch of questions here. Slowly but surely, we will cover all of these.
The first steps in our journey are concerned with the sets from which arrows start and point. There are two important sets
in a function’s life: its domain and image.
and
In other words, the domain is the subset of 𝑋 where arrows start; the image is the subset of 𝑌 where arrows point.
Why is this important? For one, these are directly related to the invertibility of a function. If you consider the “points and
arrows” mental representation, inverting a function is as simple as flipping the direction of the arrows. When can we do
it? In some cases, doing this might not even result in a function, as in the figure below.
To put the study of functions on top of solid theoretical foundations, we introduce the concept of injective, surjective and
bijective functions.
Fig. 16.5: This function is not invertible. Reversing the arrows doesn’t give a well-defined function.
In terms of arrows, injectivity means that every element of the image has at most one arrow pointing to it, while surjectivity
is that every element indeed has at least one arrow. When both are satisfied, we have a bijective function, one that can be
inverted properly. When the inverse 𝑓 −1 exists, it is unique. Both 𝑓 −1 ∘ 𝑓 and 𝑓 ∘ 𝑓 −1 equal to the identity function in
their respective domains.
𝑓 ∶ ℝ → ℝ, 𝑓(𝑥) = 𝑥2
is not injective nor surjective. (Ponder on this a bit if you don’t understand it right away. It helps if you draw a figure.)
On the contrary,
𝑔 ∶ ℝ → ℝ, 𝑔(𝑥) = 𝑥3
Functions, just like numbers, have operations defined on them. Two numbers can be multiplied and added together, but
can you the same with functions? Without any difficulty, they can be added together and multiplied with a scalar as
(𝑓 + 𝑔)(𝑥) ∶= 𝑓(𝑥) + 𝑔(𝑥),
(𝑐𝑓)(𝑥) ∶= 𝑐𝑓(𝑥),
where 𝑐 is some scalar.
Another essential operation is composition. Let’s consider the famous logistic regression for a minute! The estimator itself
is defined by
where
1
𝜎(𝑥) =
1 + 𝑒−𝑥
is the sigmoid function. The estimator 𝑓(𝑥) is the composition of two functions: 𝑙(𝑥) = 𝑎𝑥 + 𝑏 and the sigmoid function,
so
𝑓(𝑥) = 𝜎(𝑙(𝑥)).
This is how we can illustrate function composition with points and arrows.
To give one more example, a neural network with several hidden layers is just the composition of a bunch of functions.
The output of each layer is fed into the next one, which is exactly how composition is defined.
In general, if 𝑓 ∶ 𝐵 → 𝐶 and 𝑔 ∶ 𝐴 → 𝐵 are two functions, then their composition is formally defined by
𝑓 ∘ 𝑔 ∶ 𝐴 → 𝐶, 𝑥 ↦ 𝑓(𝑔(𝑥)).
𝑓 ∶ 𝑋 → ℝ, 𝑔 ∶ 𝑋 → ℝ,
Believe it or not, this is yet another form of function composition. Why? Define the function add by
add ∶ 𝑋 × 𝑋 → ℝ, add(𝑥1 , 𝑥2 ) = 𝑥1 + 𝑥2 .
Composition is an extremely powerful tool. In fact, so powerful that given a small set of cleverly defined building blocks,
“almost every function” can be obtained as the composition of these blocks. (I put “almost every function” in quotes
because if we want to say mathematically precise here, long detours are needed. To keep ourselves focused, let’s allow
ourselves to be a little hand-wavy here.)
So far, we have seen that functions are defined as arrows drawn between elements of two sets. This, although being
mathematically rigorous, does not give us useful mental models to reason about them. As you’ll surely see by the end of
our journey, in mathematics, the key is often to find the right way to look at things. Regarding functions, one of the most
common and useful mental models is their graph.
If 𝑓 ∶ ℝ → ℝ is a function mapping a real number to a real number, we can visualize it using its graph, defined by
This set of points can be drawn in the two-dimensional plane. For instance, in the case of the famous rectified linear unit
(ReLU)
0 if 𝑥 < 0
ReLU(𝑥) = {
𝑥 if 0 ≤ 𝑥,
Fig. 16.9: Functions as a transformation of the space. Here, the vectors of the space are rotated around the origin.
Fig. 16.10: Image operations as transformations, as done by the Albumentations library. Source of the image: Albu-
mentations: Fast and Flexible Image Augmentations by Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khvedchenya,
Alex Parinov, Mikhail Druzhinin and Alexandr A. Kalinin.
16.5 Problems
SEVENTEEN
FUNCTIONS IN PRACTICE
In our study of functions, we started from arrows between sets and ended up with mental models such as formulas and
graphs. For pure mathematical purposes, these models are perfectly enough to conduct thorough investigations. However,
once we leave the realm of theory and start putting things into practice, we must think about how functions are represented
in programming languages.
In Python, functions are defined using a straightforward syntax. For instance, this is how the square(𝑥) = 𝑥2 function
can be implemented.
def square(x):
return x**2
type(square)
function
square(12)
144
Python is well-known for its simplicity, and functions are no exception. However, this doesn’t mean that they are limited
in features, quite the contrary: you can achieve a lot with the clever use of functions.
There are three operations that we want to do on functions: composition, addition, multiplication. The easiest way is
to call the functions themselves and fall back to the operations defined for the number types. To see an example, let’s
implement the cube(𝑥) = 𝑥3 function and add/multiply/compose it with square.
def cube(x):
return x**3
x = 2
205
Mathematics of Machine Learning
12
square(x)*cube(x) # multiplication
32
square(cube(x)) # composition
64
However, there is a major problem. If you take another look at the function operations, you can notice that they take
functions and return functions. For instance, the composition is defined by
compose ∶ 𝑓, 𝑔 ↦ (𝑥
⏟⏟↦ 𝑓(𝑔(𝑥)))
⏟⏟ ⏟⏟⏟ ,
the composed function
with a function as a result. We did no such thing by simply passing the return value to the outer function. There is no
function object to represent the composition.
In Python, functions are first-class objects, meaning that we can pass them to other functions and return them from
functions. (This is an absolutely fantastic feature, but if this is the first time you encounter this, it might take some time
to get used to.) Thus, we can implement the compose function above by using the first-class function feature.
return composition
square_cube_composition(2)
64
Addition and multiplication can be done just like this. (They are even assigned as an exercise problem.)
The standard way of function definitions is not a good fit for an application that is essential for us: parametrized functions.
Think about the case of linear functions of the form 𝑎𝑥 + 𝑏, where 𝑎 and 𝑏 are parameters. On the first try, we can do
something like this.
Passing the parameters as arguments seems to work, but there are serious underlying issues. For instance, functions can
have a lot of parameters. Even if we compact parameters into multidimensional arrays, we might need to deal with dozens
of such arrays. Passing them around manually is error-prone, and we usually have to work with multiple functions. For
example, neural networks are composed of several layers. Each layer is a parameterized function, and their composition
yields a predictive model.
We can solve this issue by applying the classical object-oriented principle of encapsulation, implementing functions as
callable objects. In Python, we can do this by implementing the magic __call__ method for the class.
class Linear:
def __init__(self, a, b):
self.a = a
self.b = b
3.2
This way, we can store, access, and modify the parameters using attributes.
f.a, f.b
(2, -1)
Since there can be a lot of parameters, we should implement a method that collects them together in a dictionary.
class Linear:
def __init__(self, a, b):
self.a = a
self.b = b
def parameters(self):
return {"a": self.a, "b": self.b}
f = Linear(2, -1)
f.parameters()
Interactivity is one of the most useful features of Python. In practice, we frequently find ourselves working in the REPL,
inspecting objects and calling functions by hand. We often add a concise string representation for our classes for these
situations.
By default, printing a Linear instance results in a cryptic message.
<__main__.Linear at 0x7fae3c29b0f0>
This is not very useful. Besides the class name and its location in the memory, we haven’t received any information.
We can change this by implementing the __repr__ method responsible for returning the string representation for our
object.
class Linear:
def __init__(self, a, b):
self.a = a
self.b = b
def __repr__(self):
return f"Linear(a={self.a}, b={self.b})"
def parameters(self):
return {"a": self.a, "b": self.b}
f = Linear(2, -1)
Linear(a=2, b=-1)
This looks much better! Adding a pretty string representation seems like a small thing, but this can go a long way when
doing machine learning engineering in the trenches.
The Linear class that we have just seen is only the tip of the iceberg. There are hundreds of function families that are
used in machine learning. We’ll implement many of them eventually, and to keep the interfaces consistent, we are going
to add a base class from which all others will be inherited.
class Function:
def __init__(self):
pass
def parameters(self):
return dict()
With this, we can implement functions and function families in the following way.
import numpy as np
sigmoid = Sigmoid()
sigmoid(2)
0.8807970779778823
Even though we haven’t implemented the parameters method for the Sigmoid class, it is inherited from the base
class.
sigmoid.parameters()
{}
For now, let’s keep the base class as simple as possible. During the course of this book, we’ll progressively enhance the
Function base class to cover all the methods a neural network and its layers need. (For instance, gradients.)
(subsection:functions/functions-in-practice/composition)
Recall how we did function composition when working with plain Python functions? Syntactically, that can work with our
Function class as well, although there is a huge issue: the return value is not a Function type.
composed(2)
0.7615941559557646
isinstance(composed, Function)
False
composed.parameters()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-28-02b9088725fc> in <module>
----> 1 composed.parameters()
To fix the issue, we implement function composition as a child of the Function base class. Recall that composition is
a function, taking two functions as input and returning one:
compose ∶ 𝑓, 𝑔 ↦ (𝑥
⏟⏟↦ 𝑓(𝑔(𝑥)))
⏟⏟ ⏟⏟⏟ .
the composed function
class Composition(Function):
def __init__(self, *functions):
self.functions = functions
for f in self.functions:
x = f(x)
return x
composed(2)
0.9525741268224334
composed.parameters()
{}
isinstance(composed, Function)
True
17.3 Problems
Problem 1. Implement the add function that takes the functions 𝑓 and 𝑔, and returns their sum 𝑓 + 𝑔. (You can do this
following the example of composition.)
17.4 Solutions
Problem 1.
return sum
EIGHTEEN
NUMBERS
It’s like asking why is Ludwig van Beethoven’s Ninth Symphony beautiful. If you don’t see why, someone can’t
tell you. I know numbers are beautiful. If they aren’t beautiful, nothing is. — Paul Erdős
When I was about to take my first mathematical analysis course at the university, coming straight from high school, I
wondered why we would spend several lectures on real numbers. At the time, I was confident in my knowledge and
thought that I knew what numbers were. This was my first painful encounter with the Dunning–Kruger effect. Suffice to
say, after a few classes, I was left confused about numbers, taking a while to understand them finally.
If you look at numbers with a magnifying glass, they become extremely complex. In this chapter, we are going to see why
and how to make sense of them. To look ahead and keep machine learning in our sight, consider that gradient descent
(you know, the optimization algorithm that is used everywhere) is not be possible for functions that are not differentiable.
In turn, a function 𝑓 is differentiable at 𝑥 if the limit
𝑓(𝑥) − 𝑓(𝑦)
lim
𝑦→𝑥 𝑥−𝑦
exists. To understand limits, we must understand real numbers first.
Another good reason to dig deep into the patterns and structures of numbers: they are beautiful. (Like said above by Paul
Erdős, one of the greatest mathematicians ever.) There is a particular joy of understanding seemingly familiar things on
a deep level. Even though you might not use this knowledge every day, it teaches you perspective about the objects you
encounter during your work.
So, let’s get started!
18.1 Numbers
There are five famous classes of numbers that one has to know in order to become adept in mathematics:
• natural numbers, denoted by ℕ,
• integers, denoted by ℤ,
• rational numbers, denoted by ℚ,
• real numbers, denoted by ℝ,
• and finally complex numbers, denoted by ℂ.
These classes are increasing in order, that is,
ℕ ⊆ ℤ ⊆ ℚ ⊆ ℝ ⊆ ℂ.
In this section, we are going to be concerned with the first four. (Complex numbers will get own chapter.)
211
Mathematics of Machine Learning
ℕ ∶= {1, 2, 3, … }.
Sometimes zero is included; sometimes it is not. Believe it or not, after a few thousand years, mathematicians still cannot
decide whether or not 0 is a natural number. This problem might sound comical, but trust me, I have seen senior professors
almost go into a fistfight upon debating this issue. For some people, this is almost a religious question.
I don’t particularly care, and neither should you. I propose to use the more common and practical definition, which is the
one without zero. When we really need to talk about the natural numbers AND zero, I will use the notation
ℕ0 = {0, 1, 2, … }.
The cardinality of the set of natural numbers is countably infinite. In fact, countability is defined as |ℕ|.
To be able to express negative and zero quantities, we extend natural numbers to obtain the set of integers, defined by
ℤ = {… , −2, −1, 0, 1, 2, … }.
Relatively straightforward so far. Integers are also countable: one can enumerate all of its elements by
One significant advantage of integers over natural numbers is that they contain the additive inverse for each element. That
is, in plain English, if 𝑛 ∈ ℤ, then so does −𝑛 ∈ ℤ. This makes it possible to define all kinds of algebraic structures over
the integers, giving us mathematical tools to reason about phenomena modeled by them.
Note that if 𝑛, 𝑚 ∈ ℤ, then 𝑛 + 𝑚 ∈ ℤ. In mathematical terminology, we say that ℤ is closed to addition.
To summarize, ℤ is
• closed to addition,
• and every element has an additive inverse.
These two properties will guide us on how to go from natural numbers to real numbers: each extension is constructed so
that these two properties hold, but for different and different operations.
So, we obtained ℤ from ℕ by extending it with zero and the additive inverses for each element. What about the multi-
plicative inverses? This idea leads us to the concept of rational numbers, numbers that can be written as a ratio of two
integers. It is defined by
𝑝
ℚ={ ∶ 𝑝, 𝑞 ∈ ℤ, 𝑞 ≠ 0},
𝑞
is both closed to multiplication and every element (except zero) has a multiplicative inverse. This is not just a “l’art pour
l’art” mathematical construction. Rational numbers model quantities that we encounter in real life. 0.798 kilometers, 3.4
kilograms of grain, etc.
It might be surprising, but ℚ is also countable. One easy way to prove is to notice that it can be obtained as the countable
union of countable sets:
𝑝
ℚ = ∪𝑝∈ℤ { ∶ 𝑞 ∈ ℤ\{0}}.
𝑞
Fig. 18.1: Enumeration of rational numbers. This shows that ℚ is indeed countable.
Since the union of countable sets is countable, ℚ is countable as well. Another (and perhaps more visual) way to see this
is to simply enumerate them in a sequence. Something like this:
1
Rational numbers can be written in decimal form, like 2 = 0.5 for example. In general, the following is true.
Theorem
Any rational number 𝑥 can be represented as a:
(a) finite decimal
𝑥 = 𝑥0 .𝑥1 … 𝑥𝑛 , 𝑥𝑖 ∈ {0, 1, 2, … , 9}
where the decimals between the two dots repeat infinitely. (This can be just a single digit as well.)
Note that the decimal representation is not unique: for example, 1.0 and 0.9̇ is equal.
The above theorem fully characterizes rational numbers. But what about the numbers with an infinite decimal form that
is not repeating? Like the famous mathematical constant 𝜋 describing the half circumference of the unit circle, that is
𝜋 = 3.14159265358979323846264338327950288419716939937510...,
with no repeating patterns. These are called irrational numbers, and together with rationals, they make up the real numbers.
The simplest way to imagine real numbers is a line, where each point represents a number.
If we temporarily let a little bit of mathematical correctness slide, we can say that
Real numbers are also the first in our journey that is not countable, and we will prove this! Its proof is so beautiful that it
belongs to The Book, a collection of the most elegant and beautiful mathematical proofs.
Theorem
ℝ is not countable.
Proof. To show that ℝ is not countable, we take an indirect approach: we suppose that it is countable and demonstrate
that this leads to a contradiction. This method is called an indirect proof, a top-tier tool in a mathematician’s toolkit.
Since [0, 1) ⊆ ℝ, it is enough to show that [0, 1) is not countable. If it is countable, we can enumerate it:
[0, 1) = {𝑎1 , 𝑎2 , … }.
5, if 𝑎𝑛𝑛 ≠ 5,
𝑎𝑛𝑛
̂ ∶= {
1, if 𝑎𝑛𝑛 = 5.
𝑎̂ ∶= 0.𝑎11
̂ 𝑎22
̂ 𝑎33
̂ …
be found in the sequence {𝑎1 , 𝑎2 , … }? No, because the i-th decimal of 𝑎𝑖 and 𝑎̂ must be different for all 𝑖 ∈ ℕ! We
have constructed 𝑎̂ by changing the i-th decimal of 𝑎𝑖 .
To summarize, our assumption that [0, 1) can be enumerated leads us to a contradiction because we have found an element
that cannot possibly be in our enumeration. So, [0, 1) is not countable, hence ℝ is not countable as well. This is what we
needed to show! □
The method of proof that you have seen above is called Cantor’s diagonal argument. This is a beautiful and powerful
idea, and although we won’t encounter it anymore, it is the key to proving several difficult theorems. (Like Gödel’s famous
incompleteness theorems that threw a huge monkey wrench into the machinery of mathematics at the beginning of the
20th century.)
Notice that the way we introduced real numbers broke the pattern we have observed before. Integers were constructed
by extending the natural numbers with additive inverses and closing them to addition. Rationals were obtained the same
way, except doing it for multiplication. As we shall see later, real numbers follow a similar process: we obtain it from
rationals by closing them to limits.
NINETEEN
SEQUENCES
Sequences lie at the very heart of mathematics. Sequences and their limits describe long-term behavior, like the (occa-
sional) convergence of gradient descent to a local optimum. By definition, a sequence is an enumeration of mathematical
objects.
The elements of a sequence can be any mathematical object, like sets, functions, or Hilbert spaces. (Whatever those might
be.) For us, sequences are composed of numbers. We formally denote them as
{𝑎𝑛 }∞
𝑛=1 , 𝑎𝑛 ∈ ℝ.
For simplicity, the subscripts and the superscripts are often omitted, so don’t panic if you see {𝑎𝑛 }, as it is just an
abbreviation. (Or 𝑎𝑛 . Mathematicians love abbreviations.) If all elements of the sequence belongs to a set 𝐴, we often
write {𝑎𝑛 } ⊆ 𝐴.
Sequences can be bidirectional as well, those are denoted as {𝑎𝑛 }∞ 𝑛=−∞ . We don’t need them for now, but they will
frequently appear when talking about probability distributions later.
19.1 Convergence
One of the most important aspects of sequences is their asymptotic behavior, or in other words, what they do in the long
term. A particular property we often look for is convergence. In plain English, the sequence {𝑎𝑛 } converges to 𝑎 if no
matter how small of an interval (𝑎 − 𝜀, 𝑎 + 𝜀) we define (where 𝜀 can be really small), eventually all of the elements of
{𝑎𝑛 } fall into it.
The following is the mathematically precise definition of convergence.
|𝑎𝑛 − 𝑎| < 𝜀
holds for all indices 𝑛 > 𝑛0 . 𝑎 is said to be the limit of {𝑎𝑛 } and we write
lim 𝑎𝑛 = 𝑎
𝑛→∞
or
𝑎𝑛 → 𝑎 (𝑛 → ∞).
217
Mathematics of Machine Learning
Note that the cutoff index 𝑛0 depends on 𝜀. We could write 𝑛0 (𝜀) to emphasize this dependency, but we will rarely do
so. To avoid referencing and naming the cutoff index 𝑛0 all the time, we often simply say that a given property “holds for
all 𝑛 large enough”. (Did I mention that mathematicians love abbreviations?)
In plain English, the definition means that no matter how small of an interval you enclose 𝑎 in, all members of the sequence
will eventually fall into it.
Although mathematically extremely precise and correct, this definition doesn’t give us a lot of tools to show if a sequence
is convergent or not. First, we have to conjure up the limit 𝑎 and then construct the cutoff indexes. For example, consider
1
𝑎𝑛 ∶= .
𝑛
To make our job easier, we can plot this to visualize the situation.
Here, we can explicitly construct the cutoff index 𝑛0 for every 𝜀. Since we want to have
1
< 𝜀,
𝑛
we can reorganize the inequality to obtain
1
< 𝑛.
𝜀
So,
1
𝑛0 ∶= ⌊ ⌋ + 1
𝜀
will do the job.
We had it easy in this example, but this is pretty much as far as we can go with the definition. For example, how do you
show the convergence of
−1
1 1 1
𝑎𝑛 ∶= ( + +⋯+ )
𝑛 𝑛+1 2𝑛
with the definition only? You don’t. There are more advanced tools for this, as we shall see. (By the way, lim𝑛→∞ 𝑎𝑛 =
1
ln 2 . We will show this later when talking about integrals.) For sequences that are defined recursively and there is no
analytic formula available, like
⃗ ∞
{𝐿(𝑤⃗ 𝑛 , 𝑥,⃗ 𝑦)} 𝑛=1 ,
where 𝐿 is the loss function for a neural network with weights 𝑤⃗ 𝑛 and training data (𝑥,⃗ 𝑦),
⃗ we have even more complica-
tions. There is no need to worry about them yet, so let’s go one step at a time.
In essence, the study of convergence for a particular sequence comes down to breaking it into simpler and simpler parts
until the limit is known.
1. Is this a “famous” sequence where the limit is known? If yes, we are done. If not, go to the next step.
2. Can you decompose it into simpler parts? If yes, is the convergence known for them? If the convergence is
unknown, can you simplify it further?
We can do this because convergence has some particularly nice properties, as summarized in the theorem below.
lim (𝑎𝑛 + 𝑏𝑛 ) = 𝑎 + 𝑏,
𝑛→∞
(b)
(c)
lim 𝑎𝑛 𝑏𝑛 = 𝑎𝑏,
𝑛→∞
The properties (a) and (b) together are called linearity of convergence. If you recall the definition of linear transformations
As we shall see later, continuity of functions also provides a great tool to study convergence properties of a sequence. In
fact, continuity is nothing else than the interchangeability of limits and functions:
One essential property of convergent sequences is that under certain circumstances, they preserve inequalities. This will
be true to function limits as well, so it will be important for us later as we’ll see.
|𝑎−𝛼|
Proof. We are going to do this indirectly. If lim𝑛→∞ 𝑎𝑛 < 𝛼, then by the definition of convergence, |𝑎𝑛 − 𝑎| < 2
for all large 𝑛. This means that those 𝑎𝑛 -s are actually below 𝛼, contradicting our assumptions. □
This proof is straightforward to understand if you draw a figure and visualize what happens, so I encourage you to do it.
The identical result is true if we replace ≥ with ≤ in the above, and the proof goes through word by word.
Note that if 𝑎𝑛 > 𝛼 for all 𝑛, lim𝑛→∞ 𝑎𝑛 > 𝑎 is not guaranteed! The best example to show this is 𝑎𝑛 ∶= 1/𝑛, which
converges to 0, although all of its terms are positive.
As a corollary, we obtain a tool that will be very useful for showing the convergence of particular sequences.
In other words, squeezing {𝑏𝑛 } between two convergent sequences that have the same limit implies convergence of 𝑏𝑛 to
the joint limit.
Because convergence behaves nicely with respect to certain operations, we study sequences by decomposing them into
building blocks. Let’s see the most important ones that will be useful for us later!
Example 1. For any 𝑥 ≥ 0,
⎧0 if 0 ≤ 𝑥 < 1,
{
𝑛
lim 𝑥 = 1 if 𝑥 = 1, (19.1)
𝑛→∞ ⎨
{∞ if 𝑥 > 1.
⎩
If you think about it for a minute, this is easy to see. The 𝑥 = 0 and 𝑥 = 1 cases are trivial. Regarding the others,
because taking the logarithm turns exponentiation into multiplication, we have log 𝑥𝑛 = 𝑛 log 𝑥. So,
−∞ if 0 < 𝑥 < 1,
lim 𝑛 log 𝑥 = {
𝑛→∞ ∞ if 𝑥 > 1.
1 if 𝑥 > 0,
lim 𝑥1/𝑛 = { (19.2)
𝑛→∞ 0 if 𝑥 = 0.
Similarly to the previous example, this can be shown with the use of logarithms.
Convergence is everywhere. We just met this concept for the first time, so we don’t see its importance just yet. However,
it is central to mathematics and machine learning.
Just to look ahead and give a few examples, differentiation is defined by a limit:
𝑓(𝑥) − 𝑓(𝑦)
𝑓 ′ (𝑥) ∶= lim .
𝑦→𝑥 𝑥−𝑦
Regarding derivatives, integrals (the “inverse” of differentiation) are limits of convergent sequences. For instance,
1 𝑛
𝑘2
∫ 𝑥2 𝑑𝑥 = lim ∑ .
0
𝑛→∞
𝑘=1
𝑛3
Because integrals are limits, so does every quantity calculated with integration, such as expected values, like for the
standard normal distribution,
∞
1 2
𝔼[𝒩(0, 1)] = ∫ 𝑥 √ 𝑒−𝑥 /2 𝑑𝑥.
−∞ 2𝜋
Convergence is also central to probability and statistics. There are two famous theorems: the Law of Large Numbers,
stating that
1 𝑛
∑ 𝑋𝑘 = 𝜇
lim
𝑛→∞ 𝑛
𝑘=1
in distribution, for independent and identically distributed random variables 𝑋1 , 𝑋2 , … with finite expected value 𝔼[𝑋𝑖 ] =
𝜇 and variance var(𝑋𝑖 ) = 𝜎2 . They are both very important in machine learning and neural networks, for instance, the
Law of Large Numbers is one of the fundamental ideas behind stochastic gradient descent.
Even the gradient descent optimization process is a recursively defined sequence of model weights, converging towards
an optimum where the model best fits the data.
We will talk about all of these in detail. So even if you don’t understand these right now, don’t worry. It’ll be clear soon.
Before finishing up with sequences, we shall discuss what happens when a sequence is not convergent.
We have talked about how convergent sequences are everywhere, and they are at the core of mathematics and machine
learning. However, not all sequences are convergent.
Think about the following example:
𝑎𝑛 ∶= sin(𝑛).
The sequence {𝑎𝑛 } is said to be ∞-divergent if for every arbitrarily large number 𝑥, there is a cutoff index 𝑛0 such that
𝑎𝑛 > 𝑥
0 ≤ 𝑎𝑛 ≤ 𝑐𝑛
19.2.3 Subsequences
Sometimes, when working with sequences, we don’t need the entire thing, just its subsequence. We will not do anything
special with them just yet, but here is the formal definition.
19.2.4 Series
There are a special class of sequences we need to mention: series. That is, sequences of the form
∞
𝑆𝑛 = ∑ 𝑎𝑘 , 𝑎𝑘 ∈ ℝ.
𝑘=1
This is what we mean when we write infinite sums, as they are defined by
∞ 𝑛
∑ 𝑎𝑘 ∶= lim ∑ 𝑎𝑘 .
𝑛→∞
𝑘=1 𝑘=1
Although we won’t go into details, the literature on series is huge. It is not an overstatement to say that almost the entire
development of mathematical analysis in the 19th and 20th century was motivated by expressing functions in series form.
If you have some experience with computer science, you are probably familiar with the big O/small O notation. There, it
is used to express the runtime of algorithms, but it is not limited to that. In general, it is used to compare the long-term
behavior of sequences. Let’s start with the definitions first, and then I’ll explain the intuition and some use-cases.
In plain English, “𝑏𝑛 is big O 𝑎𝑛 ” means that 𝑏𝑛 grows roughly at the same rate as 𝑎𝑛 , while “𝑏𝑛 is small o 𝑎𝑛 ” says that
𝑏𝑛 is an order of magnitude smaller than 𝑎𝑛 .
So, when we say that the runtime of an algorithm is 𝑂(𝑛) steps where 𝑛 is the input size, we mean that the algorithm
will finish in 𝐶𝑛 steps. Often, we don’t care about the constant multiplier since it doesn’t mean an order of magnitude
difference in the long run.
Now that we have familiarized ourselves with the concept of convergent sequences, we shall take another look at rational
and real numbers. When extending the classes of numbers going from ℕ to ℝ, we pick an operation, close the set with
respect to it, and add inverse elements to that operation.
Extending ℕ with additive inverses −𝑛 for all 𝑛 ∈ ℕ yields ℤ. Extending ℤ with multiplicative inverses 1/𝑛 for all 𝑛 and
closing it for multiplication yields ℚ. The pattern is seemingly different in the case of ℝ, but this is not the case. After
understanding what convergence is, we have the tools to see why.
Consider the following sequence:
𝑛
1
𝑎𝑛 ∶= (1 + ) , 𝑛 = 1, 2, … .
𝑛
Since rational numbers are closed to addition and multiplication, we see that 𝑎𝑛 is rational. However,
𝑛
1
lim (1 + ) = 𝑒,
𝑛→∞ 𝑛
TWENTY
(Almost) everything in machine learning is described with real numbers. Features, losses, parameters, probabilities.
Every model is a mapping between ℝ𝑛 and ℝ𝑚 . Because our tooling is built on top of this, it is essential to understand
how real numbers are structured. In mathematical terms, this is called topology.
According to the Cambridge English Dictionary, the word “topology” means
the way the parts of something are organized or connected.
From a mathematical perspective, topology studies the local properties of structures and spaces. In machine learning,
we are often interested in global properties like minima and maxima but only have local tools to search for them. One
example is the derivative of functions. Derivatives describe the slope of the tangent plane, and as Fig. 20.1 illustrates, this
doesn’t change if the function is modified away from the point where the derivative is taken.
In mathematics, local properties are handled in terms of sequences and neighborhoods. We have learned about sequences
in the last chapter, and now we tackle the subject of neighborhoods.
We are going to focus on three fundamental aspects:
• open and closed sets,
• behavior of sequences within sets,
• and their smallest and largest elements, upper and lower bounds.
Our main goal with mathematical analysis is to understand gradient descent, a fundamental tool for training models. To
do that, we need to understand limits. For that, sequences and real numbers, leading deep into the rabbit hole where we
are now.
Think of it as learning the Python language versus learning TensorFlow or PyTorch. Since we want to do machine learning,
we ultimately want to learn a high-level framework. However, if we lack the understanding of the basic keywords in Python
like import or def, we are not ready to learn and productively use advanced tools. Sequences, open and closed sets,
limits, and others are the fundamental building blocks of mathematical analysis, the language of optimization.
Let’s start our discussion with open and closed sets! (In this chapter, when we refer to something as a subset or set, it is
implicitly assumed to be within ℝ.)
225
Mathematics of Machine Learning
Before we start analyzing the properties of open and closed sets, here are some key examples for building up useful mental
models.
Example 1. Intervals of the form (𝑎, 𝑏) = {𝑥 ∈ ℝ ∶ 𝑎 < 𝑥 < 𝑏} are open. This can be easily seen by picking any
𝑥 ∈ (𝑎, 𝑏) and letting 𝜀 = min{|𝑥 − 𝑎|/2, |𝑥 − 𝑏|/2}. Essentially, we take the distance from the closest endpoint and cut
that in half. Any point that is closer to 𝑥 than half-distance of the closest endpoint will be also in (𝑎, 𝑏).
Example 2. Intervals of the form [𝑎, 𝑏] = {𝑥 ∈ ℝ ∶ 𝑎 ≤ 𝑥 ≤ 𝑏} are closed. Indeed, its complement is ℝ\[𝑎, 𝑏] =
(−∞, 𝑎) ∪ (𝑏, ∞). Using the reasoning above, it is easy to see that (−∞, 𝑎) ∪ (𝑏, ∞) is open.
Example 3. Intervals of the form (𝑎, 𝑏] = {𝑥 ∈ ℝ ∶ 𝑎 < 𝑥 ≤ 𝑏} are neither open, nor closed. To see that it is not open,
observe that no interval containing 𝑏 is fully within (𝑎, 𝑏], since 𝑏 is an endpoint. For similar reasons, its complement
ℝ\(𝑎, 𝑏] = (−∞, 𝑎] ∪ (𝑏, ∞) is not open.
An important takeaway from the last example is that if a set is not closed, it doesn’t mean that it is open and vice versa.
We can rephrase the definition of openness by introducing the concept of neighborhoods. The neighborhoods of a given
point 𝑥 are the open intervals (𝑎, 𝑏) containing 𝑥. With this terminology, any set 𝐴 is open if, for any 𝑥 ∈ 𝐴, there exists
a neighborhood of 𝑥 that is fully contained within 𝐴. From this aspect, openness means that there is still “room to move”
from any point.
The most fundamental property of open and closed sets is their behavior under union and intersection.
Theorem 19.1.1
Let {𝐴𝛾 }𝛾∈Γ be an arbitrary collection of sets.
(a) If each 𝐴𝛾 is open, then ∪𝛾∈Γ 𝐴𝛾 is also open.
(b) If each 𝐴𝛾 is closed, then ∩𝛾∈Γ 𝐴𝛾 is also closed.
Proof. (a) Suppose that 𝐴𝛾 , 𝛾 ∈ Γ are open sets and let 𝑥 ∈ ∪𝛾∈Γ 𝐴𝛾 . Because 𝑥 is in the union, there is some 𝛾0 ∈ Γ
such that 𝑥 ∈ 𝐴𝛾0 . Because 𝐴𝛾0 is open, there is a small neighborhood (𝑎, 𝑏) of 𝑥 such that (𝑎, 𝑏) ⊆ 𝐴𝛾0 . Because of
this, (𝑎, 𝑏) ⊆ ∪𝛾∈Γ 𝐴𝛾 , which is what we had to show.
(b) Now let 𝐴𝛾 , 𝛾 ∈ Γ be closed sets. In this case, De Morgan’s laws imply that ℝ\(∩𝛾∈Γ 𝐴𝛾 ) = ∪𝛾∈Γ (ℝ\𝐴𝛾 ). Since
each 𝐴𝛾 is closed, ℝ\𝐴𝛾 is open. As we have previously seen, the union of open sets is open. □
Closedness and openness of a set influence its behavior regarding set sequences. The first fundamental result regarding
this is Cantor’s axiom.
This seemingly simple proposition is a profound property of real numbers, one that ultimately follows from their mathe-
matical construction. Cantor’s axiom is not true, for instance, if we talk about subsets of ℚ instead of ℝ. Think about a
sequence of rational numbers 𝑎𝑛 → 𝜋 that approximates 𝜋 from below, and another sequence 𝑏𝑛 → 𝜋 that approximates
pi from above, that is,
𝑎𝑛 < 𝜋 < 𝑏𝑛 , 𝑎𝑛 , 𝑏𝑛 ∈ ℚ.
The intersection of the intervals [𝑎𝑛 , 𝑏𝑛 ] only contains 𝜋, which is not rational. Thus, in the space of rational numbers,
∩∞
𝑛=1 [𝑎𝑛 , 𝑏𝑛 ] = ∅, therefore Cantor’s axiom doesn’t hold there.
There is an old proverb about losing the war because of a nail in a horseshoe. It goes something like this:
For want of a nail the shoe was lost.For want of a shoe the horse was lost.For want of a horse the rider was
lost.For want of a rider the message was lost.For want of a message the battle was lost.For want of a battle
the kingdom was lost.And all for the want of a horseshoe nail.
Think about Cantor’s axiom as the nail in the horseshoe. Without it, we can’t talk about taking limits of sequences.
Without limits, there are no gradients. Without gradients, there is no gradient descent, and consequently, we can’t fit
machine learning models.
Originally, we have defined open sets in terms of small open intervals like (𝑥 − 𝜀, 𝑥 + 𝜀). We called a set open if you
could squeeze in such a small interval for each of its points. By taking a step of abstraction, we can rephrase the definition
in terms of norms.
From this viewpoint, an interval (𝑥 − 𝜀, 𝑥 + 𝜀) is the same as a one-dimensional open ball. Given a normed space 𝑉 with
the norm ‖ ⋅ ‖, the ball of radius 𝑟 > 0 centered at 𝑥 is defined by
Equivalently, a ball of radius 𝑟 is the set of points with distance less than 𝑟 from the center point. In the Euclidean spaces
ℝ𝑛 , with the norm ‖𝑥‖ = √𝑥21 + ⋯ + 𝑥2𝑛 , this matches our intuitive understanding. This is illustrated in Fig. 20.2.
However, in one dimension, the Euclidean norm simplifies to ‖𝑥‖ = |𝑥|. Thus, we have
𝐵(𝑥, 𝑟) = {𝑦 ∈ ℝ ∶ |𝑥 − 𝑦| < 𝑟}
= (𝑥 − 𝑟, 𝑥 + 𝑟).
We don’t often think about the interval (𝑥 − 𝜀, 𝑥 + 𝜀) as the one-dimensional ball 𝐵(𝑥, 𝜀). However, making this
connection will make it easy to later extend the topology of ℝ to ℝ𝑛 , which is where we want to work eventually.
With norms and balls, we can rephrase the definition of open sets in the following way.
Closed sets can be characterized in terms of their sequences. The following theorem shows an equivalent definition of
closed sets, giving us a helpful way of thinking about them.
Proof. To prove that the two statements are equivalent, we have to show two things: that (a) implies (b) and that (b)
implies (a). Don’t worry if this proof seems too complicated when you read it the first time. If you don’t understand it
right away, I suggest thinking about 𝐴 as a closed interval and drawing a figure. You can also skip it since I will refer back
to this fact every time we need it later.
First, let’s see that (a) implies (b). Thus, suppose that 𝐴 is closed and {𝑎𝑛 }∞
𝑛=1 ⊆ 𝐴 is a convergent sequence, 𝑎 ∶=
lim𝑛→∞ 𝑎𝑛 . We have to show that 𝑎 ∈ 𝐴, and we are going to do this by contradiction. The plan is the following: assume
that 𝑎 ∉ 𝐴 and deduce that {𝑎𝑛 } must eventually leave 𝐴.
Indeed, suppose that 𝑎 ∈ ℝ\𝐴. Because of 𝐴 is closed, ℝ\𝐴 is open, so there is a small neighborhood (𝑎 − 𝜀, 𝑎 + 𝜀) ⊆
ℝ\𝐴. In plain English, this means that we can separate 𝑎 from 𝐴. This contradicts the fact that {𝑎𝑛 } ⊆ 𝐴 and 𝑎𝑛 → 𝑎,
because according to the definition of convergence, eventually all members of the sequence have to fall into (𝑎 − 𝜀, 𝑎 + 𝜀).
This is a contradiction, so 𝑎 ∈ 𝐴.
Second, we will show that (b) implies (a), that is, if the limit of every convergent sequence of 𝐴 is also in 𝐴, then the
set is closed. Our goal is to show that ℝ\𝐴 is open. More precisely, if 𝑥 ∈ ℝ\𝐴, we want to find a small neighborhood
(𝑥 − 𝜀, 𝑥 + 𝜀) that is disjoint from 𝐴. Again, we can show this via contradiction.
Suppose that no matter how small 𝜀 > 0 is, we can find an 𝑎 ∈ 𝐴 ∩ (𝑥 − 𝜀, 𝑥 + 𝜀). Thus, we can define a sequence
{𝑎𝑛 }∞𝑛=1 such that 𝑎𝑛 ∈ 𝐴 ∩ (𝑥 − 1/𝑛, 𝑥 + 1/𝑛). Due to the construction, lim𝑛→∞ 𝑎𝑛 = 𝑎, and as 𝐴 is closed to taking
limits according to the premise (b), this would imply that 𝑎 ∈ 𝐴, which is a contradiction. This is what we had to show.
□
This result also explains the origins of the terminology closed. A closed set is such because it is closed to limits.
From a (very) high-level view, machine learning can be described as an optimization problem. For inputs 𝑥 and predictions
𝑦, we are looking at a parametrized family of functions 𝑓(𝑥, 𝑤), where our parameters are condensed in the variable 𝑤.
Given a set of samples and observations, our goal is to find the minimum of the set
and the parameter configuration 𝑤 where the optimum is attained. To make sure that our foundations are not missing this
building block, we are going to take some time to study this.
Being bounded means that we can include the set in a large interval [𝑚, 𝑀 ]. For optimization, there are a few essential
quantities that are related to bounds: minimal and maximal elements, smallest upper bounds, and largest lower bounds.
Let’s start with formalizing the concept of the smallest and largest element within a set.
We won’t go into great details here, but the infimum and supremum always exist. However, it is essential to note that there
is a sequence {𝑎𝑛 }∞
𝑛=1 ⊆ 𝐴 such that 𝑎𝑛 → inf 𝐴. (This is true for the supremum as well.)
With the concept of infimum and supremum, we can formalize the optimization problem for machine learning described
by (20.1) as
where this number represents the smallest possible value of the loss function and 𝑤 is the parameter of our model.
However, there is a significant issue with the 𝑤 ∈ ℝ𝑛 part. First, our parameter space is high dimensional. In practice, 𝑛
can be in the millions. Besides that, we are looking at an unbounded parameter space, where such an optimum might not
even exist. Finally, is there even a parameter 𝑤 where the infimum is attained? After all, this is what we are primarily
interested in.
We can restrict the parameter space to a closed and bounded set to fix these issues. These sets are so prevalent that they
have their own name: compact sets.
We love compact sets. Even though their definition seems straightforward, these two properties have profound conse-
quences regarding optimization. At this point, we are not ready to talk about this in detail, but we can find minima or
maxima in practice because continuous functions behave nicely on compact sets.
There is a key result about compact sets that will constantly resurface during our studies of functions: the Bolzano-
Weierstrass theorem.
Because 𝐴 is compact, there exists an interval 𝐼1 ∶= [𝑚, 𝑀 ] that contains 𝐴 in its entirety. By cutting this interval in
half, we obtain [𝑚, (𝑚 + 𝑀 )/2] and [(𝑚 + 𝑀 )/2, 𝑀 ]. At least one of these will contain infinitely many points from
{𝑎𝑛 }, let that be 𝐼2 . Repeating this process will yield a sequence of closed intervals 𝐼1 ⊇ 𝐼2 ⊇ 𝐼3 …. The length of 𝐼𝑘
is (𝑀 − 𝑚)/2𝑘 , so eventually these will get really small.
Due to the construction of these intervals, we can also define a subsequence {𝑎𝑛𝑘 }∞
𝑘=1 by selecting 𝑎𝑛𝑘 such that 𝑎𝑛𝑘 ∈ 𝐼𝑘 .
The technique we used here is called lion catching. How does a mathematician catch a lion in the desert? By cutting it in
half. The lion will be located at one half or the other. This section can be cut in half repeatedly until the area becomes
small. Thus, the lion will be trapped there eventually.
20.4 Problems
Problem 1. Let 𝐴 ⊆ ℝ be an arbitrary set. Show that there exists a sequence {𝑎𝑛 }∞ 𝑛=1 ⊆ 𝐴 such that lim𝑛→∞ 𝑎𝑛 →
sup 𝐴. (An identical statement is true for inf 𝐴 as well that can be shown in the same way.)
TWENTYONE
If I ask you to conjure up a random function from your mind, I am almost sure that you will show one that is both
continuous and differentiable. (Unless you have a weird taste as most mathematicians do.)
However, the vast majority of functions are neither. In terms of cardinality, if you count all real functions 𝑓 ∶ ℝ ↦ ℝ,
it turns out that there are 2𝑐 of them in total, but the subset of continuous ones have cardinality 𝑐. It is hard to imagine
such quantities: 𝑐 and 2𝑐 are both infinite, but, well, 2𝑐 is more infinite. Yeah, I know. Set theory is weird.
Overall, as we shall see, continuity and differentiability allow us to do meaningful work with functions. For instance,
the usual gradient descent-based optimization for neural networks doesn’t work if the loss function and the layers are not
differentiable. That alone would throw a huge monkey wrench into the cogs of machine learning since this is used all the
time in the deep learning part of the field.
This chapter explores how these concepts work together and ultimately enable us to train neural networks.
Recall that in the section about sequences, we defined limits of convergent sequences. Intuitively, limits capture the notion
that eventually, all elements get as close to the limit as we wish. This concept can be extended to functions as well.
lim 𝑓(𝑥) = 𝑎
𝑥→𝑥0
lim 𝑓(𝑥𝑛 ) = 𝑎
𝑛→∞
holds.
233
Mathematics of Machine Learning
1 if 𝑥 = 0,
𝑓(𝑥) = {
0 otherwise.
1 if 𝑥 ∈ ℚ,
𝐷(𝑥) = { (21.1)
0 otherwise.
This is the (in)famous Dirichlet function, which is hard to imagine and impossible to plot: its value is 1 at rationals and
0 at irrationals. Not surprisingly, lim𝑥→𝑥0 𝐷(𝑥) does not exist for all 𝑥0 , because rational and irrational numbers are
“dense”: every number 𝑥0 can be obtained as a limit of rationals and as a limit of irrationals.
Since limits of functions are defined as the common limit of sequences, many of its properties are inherited from se-
quences. How sequences behave under operations determines how function limits behave.
(b)
lim 𝑐𝑓(𝑥) = 𝑐 lim 𝑓(𝑥) for all 𝑐 ∈ ℝ,
𝑥→𝑥0 𝑥→𝑥0
(c)
lim 𝑓(𝑥)𝑔(𝑥) = lim 𝑓(𝑥) lim 𝑔(𝑥),
𝑥→𝑥0 𝑥→𝑥0 𝑥→𝑥0
(d) if 𝑓(𝑥) ≠ 0 in some small interval (𝑥0 − 𝜀, 𝑥0 + 𝜀) around 𝑥0 and lim𝑥→𝑥0 𝑓(𝑥) ≠ 0, then
1 1
lim = .
𝑥→𝑥0 𝑓(𝑥) lim𝑥→𝑥0 𝑓(𝑥)
Similarly as we have seen for convergent sequences, (a) and (b) above is referred to as the linearity of limits.
For technical reasons, we often want to be a bit more strict regarding the sequences 𝑥𝑛 when defining limits. There are
two particular cases: when we restrict the sequences to be strictly smaller or larger than the target. This is formalized by
the definition of left and right limits.
lim 𝑓(𝑥𝑛 ) = 𝑎
𝑛→∞
holds. This is called the right limit of 𝑓 at 𝑥0 . Similarly, the left limit lim𝑥→𝑥0 − 𝑓(𝑥) can be defined by restricting the
sequences {𝑥𝑛 } to be 𝑥𝑛 < 𝑥0 for all 𝑛.
Remember how the big and small O notation expressed asymptotic properties of sequences? We’ll have a similar tool for
functions as well.
If you have a sharp eye (and some experience in mathematics), you might have already posed the question: won’t showing
convergence of {𝑓(𝑥𝑛 )} for all sequences 𝑥𝑛 → 𝑥0 be difficult?
Indeed, it is often not the most convenient way to think about function limits. Another equivalent definition expresses
limits in terms of smaller and smaller neighborhoods around the point in question.
|𝑓(𝑥) − 𝑎| < 𝜀
holds.
Proof. (a) ⟹ (b). We are going to do this indirectly, so we assume that (a) holds and (b) is not true. The negation
of (b) states that there is a 𝜀 > 0 such that for every 𝛿 > 0, there is an 𝑥 ∈ (𝑥0 − 𝛿, 𝑥0 ) ∪ (𝑥0 , 𝑥0 + 𝛿) such that
|𝑓(𝑥) − 𝑎| > 𝜀. (If you don’t see why this is the negation, check out the introductory section about logic.)
Now we define a sequence that will contradict (a). If we select 𝛿 = 1/𝑛, we can let 𝑥𝑛 be the one in (𝑥0 − 𝛿, 𝑥0 ) ∪
(𝑥0 , 𝑥0 +𝛿) such that |𝑓(𝑥𝑛 )−𝑎| > 𝜀, as guaranteed by our assumption that (b) is false. Due to its construction, {𝑥𝑛 }∞
𝑛=1
does not converge to 𝑎. This contradicts (a), which completes our indirect proof.
(b) ⟹ (a). Let {𝑥𝑛 }∞ 𝑛=1 be an arbitrary sequence that converges to 𝑥0 . If 𝑛 is large enough (that is, larger than some
cutoff index 𝑁 ), 𝑥𝑛 will fall into (𝑥0 − 𝛿, 𝑥0 + 𝛿). Since (b) says that |𝑓(𝑥𝑛 ) − 𝑎| < 𝜀 here for all such 𝑛, we have
lim𝑛→∞ 𝑓(𝑥𝑛 ) = 𝑎 by the definition of convergence. □
In plain English, this theorem says that 𝑓(𝑥) gets arbitrarily close to lim𝑥→𝑥0 𝑓(𝑥) if 𝑥 is close enough to 𝑥0 . Definitions
similar to (b) are called epsilon-delta definitions.
There is yet another equivalent definition which, although might seem trivial, is a useful mental model when thinking
about limits.
Theorem 20.1.3
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function and 𝑥0 ∈ ℝ. Then the following are equivalent.
(a) lim𝑥→𝑥0 𝑓(𝑥) = 𝑎.
(b) There exists a function error(𝑥) such that lim𝑥→𝑥0 error(𝑥) = 0 and
𝑓(𝑥) = 𝑎 + error(𝑥)
holds.
Proof. (a) ⟹ (b). Due to how limits behave with respect to operations, it is easy to see that
error(𝑥) ∶= 𝑓(𝑥) − 𝑎
Often, we don’t need to know the exact limits of a function; it is enough to know that the limit is above or below a specific
bound.
To give a specific example, we will look slightly ahead and talk about differentiation. I’ll explain everything in the next
section in detail, but the derivative of a function 𝑓 at the point 𝑥0 is defined as the limit
𝑓(𝑥) − 𝑓(𝑥0 )
lim .
𝑥→𝑥0 𝑥 − 𝑥0
If the function is increasing, we have
𝑓(𝑥) − 𝑓(𝑥0 )
≥ 0,
𝑥 − 𝑥0
which, given the things we are about to see, implies that the derivative is positive.
Theorem 20.1.4
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function. If 𝑓(𝑥) ≥ 𝛼 for all 𝑥 ∈ (𝑎 − 𝛿, 𝑎) ∪ (𝑎, 𝑎 + 𝛿) and some 𝛼 ∈ ℝ lower bound,
then lim𝑥→𝑎 𝑓(𝑥) ≥ 𝛼 if the limit exists.
Proof. Due to the definition of function limits, this is the immediate consequence of the transfer principle for convergent
sequences. □
There are a few limit relations that come up all the time in calculations. These are the building blocks for calculating more
complicated limits, as they can often be reduced to a form for which the limit is known.
We won’t include the proofs here, as they are not that useful for our purposes. (Which is understanding how machine
learning algorithms work.)
Theorem 20.1.5
(a)
sin 𝑥
lim = 1. (21.2)
𝑥→0 𝑥
(b)
(c)
21.2 Continuity
With the extension of limits from sequences to functions, we saw that if the limit exists, it is not necessarily equal to the
function’s value at the given point. However, when it does, the function is much easier to handle. This is called continuity.
holds.
In other words, continuity means that if 𝑥 is close to 𝑦, then 𝑓(𝑥) will also be close to 𝑓(𝑦). This is how most of our
mental models work. This is also what we want from many machine learning models. For example, if 𝑓 is a model that
takes images and decides if they feature a cat or not, we would expect that after changing a few pixels on 𝑥, the prediction
would stay the same. (However, this is definitely not the case in general, which is exploited by certain adversarial attacks.)
We can rephrase the above definition a bit by unpacking the limits. If you think it through a bit, it is easy to see that
continuity of 𝑓 at 𝑎 is equivalent to
for all convergent sequences 𝑎𝑛 → 𝑎. We are going to use this very frequently.
As usual, we’ll see some examples first. We’ll revisit the ones we saw when discussing limits.
Example 1. Let’s revisit
1 if 𝑥 = 0,
𝑓(𝑥) = {
0 otherwise.
While 𝑓(𝑥) is not continuous at 0 since lim𝑥→0 𝑓(𝑥) = 0 ≠ 𝑓(0) as we have seen before, 𝑓(𝑥) is continuous everywhere
else. (Since it is constant 0.)
Note that even though the function is not continuous at 0, the limit does exist!
Example 2. What about the Dirichlet function 𝐷(𝑥)? (See (21.1) for the definition.) Since the limits doesn’t even exist,
this is a nowhere continuous function.
Example 3. Define
𝑥 if 𝑥 ∈ ℚ,
𝑓(𝑥) = {
−𝑥 otherwise.
Surprisingly, 𝑓(𝑥) is continuous at 0, but nowhere else. As you can see, (almost) nothing is off the table with continuity.
Functions, in general, can be wild objects, and without certain regularity conditions, optimizing them is extremely hard.
In essence, this chapter aims to understand when and how we can optimize functions that we used to do when training a
machine learning model.
One final example!
Example 4. We call a function an elementary function, if it can be obtained by taking a finite sum, product, and combi-
nation of
• constant functions,
• power functions 𝑥, 𝑥2 , 𝑥3 , …,
• n-th root functions 𝑥1/2 , 𝑥1/3 , 𝑥1/4 , …,
• exponential functions 𝑎𝑥 ,
• logarithms log𝑎 𝑥,
• trigonometric and inverse trigonometric functions sin 𝑥, cos 𝑥, arcsin 𝑥, arccos 𝑥,
−1 −1
• hyperbolic and inverse hyperbolic functions sinh 𝑥, cosh 𝑥, sinh 𝑥, cosh 𝑥.
For instance,
1 − 3𝑥 + 5𝑥4
𝑓(𝑥) = sin(𝑥2 + 𝑒𝑥 ) − √
2𝑥 − 𝑥
is an elementary function. Elementary functions are continuous wherever they are defined. This is going to be extremely
useful for us since showing the continuity of a complicated function like 𝑓(𝑥) is hard with the definition alone. This way,
if it is elementary, we know it is continuous. This will also be true for multivariate functions. (Like a neural network.)
A typical pattern in mathematics, as you have seen when discussing the properties of convergent sequences, is to study
certain properties on basic building blocks first, then show how it behaves with respect to operations.
As the previous example regarding the continuity of elementary functions illustrates, we are going to follow a similar
pattern here.
Theorem 20.3.1
Let 𝑓 and 𝑔 be two functions.
(a) If 𝑓 and 𝑔 are continuous at 𝑎, then 𝑓 + 𝑔 and 𝑓𝑔 are also continuous at 𝑎.
(b) If 𝑓 and 𝑔 are continuous at 𝑎 and 𝑔(𝑎) ≠ 𝑎, then 𝑓/𝑔 is also continuous at 𝑎.
(c) If 𝑔 is continuous at 𝑎 and 𝑓 is continuous at 𝑔(𝑎), then 𝑓 ∘ 𝑔 is also continuous at 𝑎.
Proof. (a) and (b) follows directly from the properties of limits.
To see (c), we simply let {𝑎𝑛 }∞
𝑛=1 be a sequence that converges to 𝑎. Then, using that 𝑓 is continuous at 𝑔(𝑎) and 𝑔 is
continuous at 𝑎, we have
So far, we have only defined continuity at a single point. In general, a function 𝑓 ∶ ℝ → ℝ is continuous on the set 𝐴
simply if it is continuous at its every point.
We have arrived at the point that partly explains why we love continuous functions and compact sets. The reason is simple:
functions that are continuous on compact sets are bounded there and attain their optima.
Theorem 20.4.1
Let 𝑓 be continuous on a compact set 𝐾. There exists 𝛼, 𝛽 ∈ 𝐾 such that 𝑓(𝛼) ≤ 𝑓(𝑥) ≤ 𝑓(𝛽) holds for all 𝑥 ∈ 𝐾.
which is what we had to show. An identical argument shows the existence of a 𝛽 ∈ 𝐾 such that 𝑓(𝑥) ≤ 𝑓(𝛽) for all
𝑥 ∈ 𝐾. □
1
This statement is not true for sets that are not closed and bounded. For example, 𝑓(𝑥) = 𝑥 is continuous on (0, 1], but
has no upper bound.
We conclude our study of continuity with this theorem.
Now that we are familiar with function limits and continuous functions, we are ready to tackle the first directly relevant
subject for machine learning: differentiation. We should take a look at how to analyze functions and what makes a function
“behave nicely”.
If you think through what machine learning is really about, you’ll find that it is quite straightforward from a bird’s view.
In essence, all we want to do is
1. design parametrized functions to explain the relations between data and observations,
2. find the parameters that give the best fit to our data.
To find models that are expressive enough yet it is easy to work with them, we need to restrict ourselves to functions that
satisfy certain properties. The two most important of those are continuity and differentiability. Now that we have seen
what continuity is, we can move on to study differentiable functions.
In the following few chapters, we will exclusively deal with univariate real functions. This is just to introduce concepts
without adding many layers of complexity at once. In later chapters, we are going to turn towards multivariate functions
slowly, and by the time we get to machine learning, we will master their use.
TWENTYTWO
DIFFERENTIATION IN THEORY
I turn with terror and horror from this lamentable scourge of continuous functions with no derivatives. —
Charles Hermite
In the history of science, a few milestones are as significant as inventing the wheel. Even among these, differentiation is
a highlight. With its invention, Newton essentially created mechanics as we know it. Differentiation enables space travel,
function optimization, or even epidemiological models.
Instead of jumping straight into the mathematical definition, let’s start our discussion with a straightforward example: a
point-like object moving along a straight line. Its movement is fully described by the time-distance plot, which shows its
distance from the starting point at a given time.
Our goal is to calculate the object’s speed at a given time. In high school, we learned that
distance
average speed = .
time
To put this into a quantitative form, if 𝑓(𝑥) denotes the time-distance function, and 𝑡1 < 𝑡2 are two arbitrary points in
time then,
𝑓(𝑡2 ) − 𝑓(𝑡1 )
average speed between 𝑡1 and 𝑡2 = .
𝑡2 − 𝑡 1
𝑓(𝑡2 )−𝑓(𝑡1 )
Expressions like 𝑡2 −𝑡1 are called differential quotients. Note that if the object moves backwards, the average speed
is negative.
The average speed has a simple geometric interpretation. If you replace the object’s motion with a constant velocity
motion moving at the average speed, you’ll end up at exactly the same place. In graphical terms, this is equivalent of
connecting (𝑡1 , 𝑓(𝑡1 )) and (𝑡2 , 𝑓(𝑡2 )) with a single line. The average speed is just the slope of this line.
Given this, we can calculate the exact speed at a single time point 𝑡, which we’ll denote with 𝑣(𝑡). The idea is simple: the
average speed in the small time-interval between 𝑡 and 𝑡 + Δ𝑡 should get closer and closer to 𝑣(𝑡) if Δ𝑡 is small enough.
(Δ𝑡 can be negative as well.)
So,
𝑓(𝑡 + Δ𝑡) − 𝑓(𝑡)
𝑣(𝑡) = lim ,
∆𝑡→0 Δ𝑡
if the above limit exists.
Following our geometric intuition, 𝑣(𝑡) is simply the slope of the tangent line of 𝑓 at 𝑡. Keeping this in mind, we are
ready to introduce the formal definition.
243
Mathematics of Machine Learning
𝑑𝑓 𝑓(𝑥0 ) − 𝑓(𝑥)
(𝑥 ) = lim
𝑑𝑥 0 𝑥→𝑥0 𝑥0 − 𝑥
In other words, if 𝑓 describes a time-distance function of a moving object, then the derivative is simply its speed.
Don’t let the change in notation from 𝑡 and 𝑡 + Δ𝑡 to 𝑥0 and 𝑥 confuse you, this means exactly the same as before. Similar
to continuity, differentiability is a local property. However, we’ll be more interested in functions that are differentiable
(almost) everywhere. In those cases, the derivative is a function, often denoted with 𝑓 ′ (𝑥).
Sometimes it is confusing that 𝑥 can denote the variable of 𝑓 and the exact point where the derivative is taken. Here is a
quick glossary of terms to clarify the difference between derivative and derivative function.
𝑑𝑓
• 𝑑𝑥 (𝑥0 ): derivative of 𝑓 with respect to the variable 𝑥 at the point 𝑥0 . This is a scalar, also denoted with 𝑓 ′ (𝑥0 ).
𝑑𝑓
• 𝑑𝑥 : derivative function of 𝑓 with respect to the variable 𝑥. This is a function, also denoted with 𝑓 ′ .
Let’s see some examples!
Example 1. 𝑓(𝑥) = 𝑥. For any 𝑥, we have
Thus, 𝑓(𝑥) = 𝑥 is differentiable everywhere and its derivative is the constant function 𝑓 ′ (𝑥) = 1.
Example 2. 𝑓(𝑥) = 𝑥2 . Here, we have
𝑓(𝑥) − 𝑓(𝑦) 𝑥2 − 𝑦2
lim = lim
𝑦→𝑥 𝑥−𝑦 𝑦→𝑥 𝑥 − 𝑦
(𝑥 − 𝑦)(𝑥 + 𝑦)
= lim
𝑦→𝑥 𝑥−𝑦
= lim 𝑥 + 𝑦
𝑦→𝑥
= 2𝑥.
So, 𝑓(𝑥) = 𝑥2 is differentiable everywhere and 𝑓 ′ (𝑥) = 2𝑥. Later, when talking about elementary functions, we’ll see
the general case 𝑓(𝑥) = 𝑥𝑘 .
Example 3. 𝑓(𝑥) = |𝑥| at 𝑥 = 0. For this, we have
|𝑦| 1 if 𝑦 > 0,
={
𝑦 −1 if 𝑦 < 0,
this limit does not exist. This is our first example of a non-differentiable function. However, |𝑥| is differentiable everywhere
else.
It is worth drawing a picture here to enhance our understanding of differentiability. Recall that the value of the derivative
at a given point equals the slope of the tangent line to the function’s graph. Since |𝑥| has a sharp corner at 0, the tangent
line is not well-defined.
Differentiability means no sharp corners in the graph, so differentiable functions are often called smooth. This is one
reason we prefer differentiable functions: the rate of change is tractable.
Next, we’ll see an equivalent definition of differentiability, involving local approximation with a linear function. From this
perspective, differentiability means manageable behavior: no wrinkles, corners, sharp changes in value.
To really understand derivatives and differentiation, we are going to take a look at it from another point of view: local
linear approximations.
Approximation is a very natural idea in mathematics. Have you ever thought about what happens when you punch
sin(2.18) into a calculator? We cannot express the function sin with finitely many additions and multiplications, so
we have to approximate it. In practice, functions of the form 𝑝(𝑥) = 𝑝0 + 𝑝1 𝑥 + … 𝑝𝑛 𝑥𝑛 , called polynomials, can be
evaluated easily. They are just a finite combination of additions and multiplications.
Can we just replace functions with polynomials to make computations easier? (Even at the cost of perfect precision.)
It turns out that we can. We will not go into details here, but every continuous function can be approximated by a
polynomial with arbitrary precision on a compact set. Actually,
𝑁
𝑥2𝑛+1
sin 𝑥 ≈ ∑(−1)𝑛 ,
𝑛=0
(2𝑛 + 1)!
In essence, differentiation is just a local approximation with a linear function. The following theorem makes this clear.
Recall that the small O notation means that the function is an order of magnitude smaller around 𝑥0 than the function
|𝑥 − 𝑥0 |.
If exists, the 𝛼 in the above theorem is the derivative 𝑓 ′ (𝑥0 ). In other words, 𝑓(𝑥) is can be locally written as
Proof. To show the equivalence of two statements, we have to prove that differentiation implies the desired property and
vice versa. Although this might seem complicated, it is straightforward and entirely depends on how functions can be
written as their limit plus an error term.
(a) ⟹ (b). The existence of the limit
𝑓(𝑥) − 𝑓(𝑥0 )
lim = 𝑓 ′ (𝑥0 )
𝑥→𝑥0 𝑥 − 𝑥0
implies that we can write the slope of the approximating tangent in the form
𝑓(𝑥) − 𝑓(𝑥0 )
= 𝑓 ′ (𝑥0 ) + error(𝑥),
𝑥 − 𝑥0
where lim𝑥→𝑥0 error(𝑥) = 0. (Recall that one equivalent form of limits states exactly this.)
With some simple algebra, we obtain
Since the error term tends to zero as 𝑥 goes to 𝑥0 , error(𝑥)(𝑥 − 𝑥0 ) = 𝑜(|𝑥 − 𝑥0 |), which is what we wanted to show.
(b) ⟹ (a). Now, repeat what we did in the previous part, just in reverse order. We can rewrite
in the form
𝑓(𝑥) − 𝑓(𝑥0 )
= 𝛼 + 𝑜(1),
𝑥 − 𝑥0
𝑓(𝑥) − 𝑓(𝑥0 )
lim = 𝛼.
𝑥→𝑥0 𝑥 − 𝑥0
One huge advantage of this form is that it will be easily generalized to multivariate functions. Even though we are far
from it, we can get a glimpse. Multivariate functions map vectors to scalars, so the ratio
𝑓(𝑥) − 𝑓(𝑥0 )
, 𝑥, 𝑥0 ∈ ℝ𝑛
𝑥 − 𝑥0
is not even defined. (Since we can’t divide with a vector.) However, the expression
𝑓(𝑥) = 𝑓(𝑥0 ) + ∇𝑓(𝑥0 )𝑇 (𝑥 − 𝑥0 ) + 𝑜(|𝑥 − 𝑥0 |)
makes perfect sense, since ∇𝑓(𝑥0 )𝑇 (𝑥−𝑥0 ) is a scalar. Here, ∇𝑓(𝑥0 ) denotes the gradient of 𝑓, that is, the multivariable
version of derivatives. ∇𝑓(𝑥0 ) is an n-dimensional vector. Don’t worry if you are not familiar with this notation, we’ll
cover everything in due time. The take-home message is that this alternative definition will be more convenient for us in
the future.
As the following theorem states, differentiation is a more strict condition than continuity.
Note that the previous theorem is not true the other way around: a function can be continuous, but not differentiable. (As
the example 𝑓(𝑥) = |𝑥| at 𝑥 = 0 shows.)
This can be taken to the extremes: there are functions that are everywhere continuous but nowhere differentiable. One of
the first examples was provided by Weierstrass (from the Bolzano-Weierstrass theorem). The function itself is defined by
the infinite sum
∞
𝑊 (𝑥) = ∑ 𝑎𝑛 cos(𝑏𝑛 𝜋𝑥),
𝑛=0
One last thing to do before we move on is to talk about higher-order derivatives. Because derivatives are functions, it is a
completely natural idea to calculate the derivative of derivatives. As we will see when studying the basics of optimization
in the next chapter, the second derivatives contain quite a lot of essential information regarding minima and maxima.
The 𝑛-th derivative of 𝑓 is denoted with 𝑓 (𝑛) , where 𝑓 (0) = 𝑓. There are a few rules regarding them that are worth
keeping in mind. Although, we have to note that a derivative function is not always differentiable, as the example
0 if 𝑥 < 0,
𝑓(𝑥) = {
𝑥2 otherwise.
Theorem 21.4.1
Let 𝑓 ∶ ℝ → ℝ and 𝑔 ∶ ℝ → ℝ be two arbitrary functions.
(a) (𝑓 + 𝑔)(𝑛) = 𝑓 (𝑛) + 𝑔(𝑛)
𝑛
(b) (𝑓𝑔)(𝑛) = ∑𝑘=0 (𝑛𝑘)𝑓 (𝑛−𝑘) 𝑔(𝑘)
Now, we assume that it is true for 𝑛 and deduce the 𝑛 + 1 case. For this, we have
𝑛
′ 𝑛 ′
(𝑓𝑔)(𝑛+1) = ((𝑓𝑔)(𝑛) ) = ∑ ( )(𝑓 (𝑛−𝑘) 𝑔(𝑘) )
𝑘=0
𝑘
𝑛
𝑛
= ∑ ( )[𝑓 (𝑛−𝑘+1) 𝑔(𝑘) + 𝑓 (𝑛−𝑘) 𝑔(𝑘+1) ]
𝑘=0
𝑘
𝑛 𝑛
𝑛 𝑛
= ∑ ( )𝑓 (𝑛−𝑘+1) 𝑔(𝑘) + ∑ ( )𝑓 (𝑛−𝑘) 𝑔(𝑘+1)
𝑘=0
𝑘 𝑘=0
𝑘
𝑛 𝑛−1
𝑛 𝑛 𝑛 𝑛
= ( )𝑓 (𝑛+1) 𝑔 + [ ∑ ( )𝑓 (𝑛+1−𝑘) 𝑔(𝑘) ] + [ ∑ ( )𝑓 (𝑛−𝑘) 𝑔(𝑘+1) ] + ( )𝑓𝑔(𝑛+1) .
0 𝑘=1
𝑘 𝑘=0
𝑘 𝑛
TWENTYTHREE
DIFFERENTIATION IN PRACTICE
During our first encounter with differentiation, we seen that computing derivatives by the definition
𝑓(𝑥0 ) − 𝑓(𝑥)
𝑓 ′ (𝑥0 ) = lim
𝑥→𝑥0 𝑥0 − 𝑥
can be really hard in practice if we encounter convoluted functions such as 𝑓(𝑥) = cos(𝑥) sin(𝑒𝑥 ). Similar to convergent
sequences and limits, using the definition of differentiation won’t get us far—the complexity piles on fast. So we have to
find ways to decompose the complexity into its fundamental building blocks.
First, we’ll look at the simplest of operations: scalar multiplication, addition, multiplication, and division.
𝑓(𝑥)𝑔(𝑥) − 𝑓(𝑥0 )𝑔(𝑥0 ) 𝑓(𝑥)𝑔(𝑥) − 𝑓(𝑥0 )𝑔(𝑥) + 𝑓(𝑥0 )𝑔(𝑥) − 𝑓(𝑥0 )𝑔(𝑥0 )
lim = lim
𝑥→𝑥0 𝑥 − 𝑥0 𝑥→𝑥 0 𝑥 − 𝑥0
𝑓(𝑥)𝑔(𝑥) − 𝑓(𝑥0 )𝑔(𝑥) 𝑓(𝑥0 )𝑔(𝑥) − 𝑓(𝑥0 )𝑔(𝑥0 )
= lim + lim
𝑥→𝑥0 𝑥 − 𝑥0 𝑥→𝑥0 𝑥 − 𝑥0
𝑓(𝑥) − 𝑓(𝑥0 ) 𝑔(𝑥) − 𝑔(𝑥0 )
= lim [ 𝑔(𝑥)] + 𝑓(𝑥0 ) lim
𝑥→𝑥0 𝑥 − 𝑥0 𝑥→𝑥0 𝑥 − 𝑥0
= 𝑓 ′ (𝑥0 )𝑔′ (𝑥0 ) + 𝑓(𝑥0 )𝑔′ (𝑥0 ),
253
Mathematics of Machine Learning
For (d), we are going to start with the special case of (1/𝑔)′ . We have
1 1
𝑔(𝑥) − 𝑔(𝑥0 ) 1 𝑔(𝑥0 ) − 𝑔(𝑥)
lim = lim
𝑥→𝑥0 𝑥 − 𝑥0 𝑥→𝑥0𝑔(𝑥)𝑔(𝑥0 ) 𝑥 − 𝑥0
′
𝑔 (𝑥0 )
=− ,
𝑔(𝑥0 )2
from which the general case follows by applying (c) to 𝑓 and 1/𝑔. □
There is one operation which we haven’t covered in the previous theorem: function composition. In the study of neural
networks, composition plays an essential role. Each layer can be thought of as a function, which are composed together
to form the entire network.
holds.
Proof. First, we rewrite the differential quotient into the following form:
Because 𝑔 is differentiable at 𝑥0 , it is also continuous there, so lim𝑥→𝑥0 𝑔(𝑥) = 𝑔(𝑥0 ). So, the first term can be rewritten
as
𝑓(𝑔(𝑥)) − 𝑓(𝑔(𝑥0 )) 𝑓(𝑦) − 𝑓(𝑔(𝑥0 ))
lim = lim = 𝑓 ′ (𝑔(𝑥0 )).
𝑥→𝑥0 𝑔(𝑥) − 𝑔(𝑥0 ) 𝑦→𝑔(𝑥0 ) 𝑦 − 𝑔(𝑥0 )
𝑓(𝑔(𝑥)) − 𝑓(𝑔(𝑥0 ))
lim = 𝑓 ′ (𝑔(𝑥0 ))𝑔′ (𝑥0 ),
𝑥→𝑥0 𝑥 − 𝑥0
As neural networks are just a huge composed functions, their derivative is calculated with the repeated application of the
chain rule. (Although the derivatives of its layers are vectors and matrices since they are multivariable functions.)
Following the already familiar pattern, now we calculate the derivatives for the most important class: the elementary
functions. There are a few ones that we will encounter all the time, like in the mean squared error, cross-entropy, Kullback-
Leibler divergence, etc.
You don’t necessarily have to know how to prove these. I’ll include the proof of (a), but feel free to skip it, especially if
this is your first encounter with calculus. What you have to remember, though, are the derivatives themselves. (However,
I’ll refer back to this part when necessary.)
Proof. (a) It is easy to see that for 𝑛 = 0, the derivative (𝑥0 )′ = 0. The case 𝑛 = 1 is also simple: calculating the
differential quotient shows that (𝑥)′ = 1. For the case 𝑛 ≥ 2, we are going to employ a small trick. Writing out the
differential quotient for 𝑓(𝑥) = 𝑥𝑛 , we obtain
𝑥𝑛 − 𝑥𝑛0
,
𝑥 − 𝑥0
which we want to simplify. If you don’t have a lot of experience in math, it might seem like magic, but 𝑥𝑛 − 𝑦𝑛 can be
written as
𝑥𝑛 − 𝑦𝑛 = (𝑥 − 𝑦)(𝑥𝑛−1 + 𝑥𝑛−2 𝑦 + ⋯ + 𝑥𝑦𝑛−2 + 𝑦𝑛−1 )
𝑛−1
= (𝑥 − 𝑦) ∑ 𝑥𝑛−1−𝑘 𝑦𝑘 .
𝑘=0
The case 𝑛 < 0 follows from 𝑥−𝑛 = 1/𝑥𝑛 using the rules of differentiation. □
With these rules under our belt, we can calculate the derivatives for some of the most famous activation functions.
The most classical one, the sigmoid function is defined by
1
𝜎(𝑥) = .
1 + 𝑒−𝑥
Since it is an elementary function, it is differentiable everywhere. To calculate its derivative, we can use the quotient rule:
′
′ 1 𝑒−𝑥
𝜎 (𝑥) = ( ) =
1 + 𝑒−𝑥 (1 + 𝑒−𝑥 )2
−𝑥 (23.1)
1 𝑒
=
1 + 𝑒−𝑥 1 + 𝑒−𝑥
= 𝜎(𝑥)(1 − 𝜎(𝑥)).
Another popular activation function is the ReLU, defined by
𝑥 if 𝑥 > 0,
ReLU(𝑥) = {
0 otherwise.
Let’s plot its graph first!
def relu(x):
if x > 0:
return x
else:
return 0
import numpy as np
import matplotlib.pyplot as plt
xs = np.linspace(-5, 5, 1000)
with plt.style.context("seaborn-white"):
plt.figure(figsize=(15, 5))
plt.title("Graph of the ReLU function")
plt.plot(xs, [relu(x) for x in xs], label="ReLU")
plt.legend()
1 if 𝑥 > 0,
ReLU′ (𝑥) = {
0 if 𝑥 < 0.
Even though ReLU is not differentiable at 0, this is not a problem in practice. When performing backpropagation, it is
extremely unlikely that ReLU′ (𝑥) will receive 0 as its input. Even if this is the case, the derivative can be artificially
extended to zero by defining it as 0.
Now that we have several tools under our belt to calculate derivatives, it’s time to think about implementations. Since we
have our own Function base class, a natural idea is to implement the derivative as a method. This is a simple solution
that is in line with object-oriented principles as well, so we should go for it!
class Function:
def __init__(self):
pass
def parameters(self):
return dict()
To see a concrete example, let’s revisit the Sigmoid function, whose derivative is given by (23.1):
class Sigmoid(Function):
def __call__(self, x):
return 1/(1 + np.exp(-x))
A simple implementation, yet colorful functionality. Now that we have the Sigmoid and its derivative in place, let’s plot
them together!
sigmoid = Sigmoid()
with plt.style.context("seaborn-white"):
plt.figure(figsize=(15, 5))
plt.title("Sigmoid and its derivative")
plt.plot(xs, [sigmoid(x) for x in xs], label="Sigmoid")
plt.plot(xs, [sigmoid.prime(x) for x in xs], label="Sigmoid prime")
plt.legend()
At this point, I probably emphasized the importance of function compositions and the chain rule dozens of times. We
finally reached a point when we are ready to implement a simple neural network and compute its derivative! (Of course,
our methods will be far more refined in the end, but still, this is a milestone.)
How to calculate the derivative for a composition of 𝑛 functions? To see the pattern, let’s start map out the first few cases.
For 𝑛 = 2, we have the good old chain rule
′
(𝑓2 (𝑓1 (𝑥))) = 𝑓2′ (𝑓1 (𝑥)) ⋅ 𝑓1′ (𝑥).
For 𝑛 = 3, we have
′
(𝑓3 (𝑓2 (𝑓1 (𝑥)))) = 𝑓3′ (𝑓2 (𝑓1 (𝑥))) ⋅ 𝑓2′ (𝑓1 (𝑥)) ⋅ 𝑓1′ (𝑥).
Among the multitude of parentheses, we can notice a pattern. First, we should calculate the value of the composed
function 𝑓3 ∘ 𝑓2 ∘ 𝑓1 at 𝑥 while storing the intermediate results, then pass these to the appropriate derivatives and take the
product of the result.
class Composition(Function):
def __init__(self, *functions):
self.functions = functions
for f in self.functions:
x = f(x)
return x
for f in self.functions:
x = f(x)
forward_pass.append(x)
return derivative
To see if our implementation works, we should test it on a simple test case, say for
𝑓1 (𝑥) = 2𝑥,
𝑓2 (𝑥) = 3𝑥,
𝑓3 (𝑥) = 4𝑥.
The derivative of the composition (𝑓3 ∘ 𝑓2 ∘ 𝑓1 )(𝑥) = 24𝑥 should be constant 24.
class Linear(Function):
def __init__(self, a, b):
self.a = a
self.b = b
def parameters(self):
return {"a": self.a, "b": self.b}
with plt.style.context("seaborn-white"):
plt.figure(figsize=(15, 5))
plt.title("The derivative of f(x) = 24x")
plt.plot(xs, ys, label="f prime")
plt.legend()
Success! Even though we only deal with single-variable functions for now, our Composition is going to be the skeleton
for neural networks.
So far, we have seen that in the cases when at least some formula is available for the function in question, we can apply
the rules of differentiation to obtain the derivative.
However, in practice, this is often not the case. For instance, think about the case when the function represents a recorded
audio signal.
If we can’t compute the derivative exactly, a natural idea is to approximate it, that is, provide an estimate that is sufficiently
close to the real value.
For the sake of example, suppose that we don’t know the exact formula of our function to be differentiated, which is
secretly the good old sine function.
import numpy as np
def f(x):
return np.sin(x)
𝑓(𝑥0 ) − 𝑓(𝑥)
𝑓 ′ (𝑥0 ) = lim .
𝑥→𝑥0 𝑥0 − 𝑥
Since we can’t take limits inside a computer (as computers can’t deal with infinity), the second best thing to do is to
approximate this by
𝑓(𝑥 + ℎ) − 𝑓(𝑥)
Δℎ 𝑓(𝑥) = , ℎ ∈ (0, ∞),
ℎ
where ℎ is an arbitrarily small but fixed quantity. Δℎ 𝑓(𝑥) is called the forward difference quotient. In theory, Δℎ 𝑓(𝑥) ≈
𝑓 ′ (𝑥) holds when ℎ is sufficiently small. Let’s see how they perform!
def f_prime(x):
return np.cos(x)
with plt.style.context("seaborn-white"):
plt.figure(figsize=(15, 5))
plt.title("Approximating the derivative with finite differences")
for h in hs:
ys = [delta(f, h, x) for x in xs]
plt.plot(xs, ys, label=f"h = {h}")
plt.plot(xs, f_prime_ys, label="the true derivative")
plt.legend()
Although the Δℎ 𝑓(𝑥) functions seem to be close 𝑓 ′ (𝑥) well when ℎ is small, there are a plethora of potential issues.
For one, Δℎ 𝑓(𝑥) = 𝑓(𝑥+ℎ)−𝑓(𝑥)
ℎ only approximates the derivative from the right of 𝑥, as ℎ > 0. To solve this, one might
use the backward difference quotient
𝑓(𝑥) − 𝑓(𝑥 − ℎ)
∇ℎ 𝑓(𝑥) = ,
ℎ
but that seem to have the same problems. The crux of the issue is that if 𝑓 is differentiable at some 𝑥0 , then
𝑓(𝑥 + ℎ) − 𝑓(𝑥 − ℎ)
𝛿ℎ 𝑓(𝑥) = , ℎ ∈ (0, ∞),
2ℎ
∆ℎ 𝑓(𝑥)+∇ℎ 𝑓(𝑥)
which is the average of forward and backward differences: 𝛿ℎ 𝑓(𝑥) = 2 . These three approximators are
called finite differences.
Even though symmetric differences are provably better, the approximation errors can be significantly amplified on the
long run.
All things considered, we are not going to use finite differences for machine learning in practice. However, as we’ll see,
the gradient descent method is simply a forward difference approximation of a special differential equation.
23.5 Problems
𝑒𝑥 − 𝑒−𝑥
tanh(𝑥) = .
𝑒𝑥 + 𝑒−𝑥
TWENTYFOUR
If someone gave you a function 𝑓 ∶ ℝ → ℝ defined by some tractable formula, how would you find its minima and
maxima? Take a moment and conjure up some ideas before moving on.
The first idea that comes to mind for most people is to evaluate the function for all possible values and simply find the
optimum. This method immediately breaks down due to multiple reasons. We can only perform finite evaluations, so this
would be impossible. Even if we try to define a discrete search grid cleverly and evaluate only there, this method takes an
unreasonable amount of time.
Another idea is to use some kind of inequality to provide an ad hoc upper or lower bound, then see if this bound can be
attained. However, this is nearly impossible for complex functions, like losses for neural networks.
However, derivatives provide an extremely useful way to optimize functions. Throughout the following sections, we will
study the relationship between derivatives and optimal points, and algorithms on how to find them.
Intuitively, the notion of minima and maxima is simple. Take a look at the example below.
Peaks of hills are the maxima, and bottoms of valleys are the minima. Minima and maxima are collectively called extremal
or optimal points. As our example demonstrates, we have to distinguish between local and global optima. The graph has
two valleys, and although both have a bottom, one of them is lower than the other.
We can graphically mark the local and global optima in our example based on this.
The really interesting part is finding these, as we’ll see next.
Let’s consider again our simple example above to demonstrate how derivatives are connected to local minima and maxima.
If we use a bit of geometric intuition, we can observe that the tangents are horizontal at the peaks of hills and the bottoms
of valleys.
In terms of derivatives, since they describe the slope of the tangent, it means that the derivative should be 0 there.
If we think about the function as the description of a motion along the real line, derivatives say that the motion stops there
and changes direction. It slows down first, stops, then immediately starts in the opposite direction. For instance, in the
local maxima case, the function increases up until that point, where it starts decreasing.
Again, we can describe this monotonicity behavior in terms of derivatives. Notice that when the function increases, the
derivative is positive (the object in motion has a positive speed). On the other hand, decreasing parts have a negative
derivative.
We can go ahead and put these intuitions into a mathematical form. First, we’ll start with the definitions of monotonicity
and their relation to the derivative. Then, we’ll connect all the dots and see how this comes together for characterizing the
optima.
263
Mathematics of Machine Learning
≥ 𝑓(𝑥), if 𝑥 ∈ (𝑎 − 𝛿, 𝑎),
𝑓(𝑎) {
≤ 𝑓(𝑥), if 𝑥 ∈ (𝑎, 𝑎 + 𝛿),
The locally decreasing and strictly locally decreasing properties are defined similarly, with the inequalities reversed.
For differentiable functions, the behavior of the derivative describes their local behavior in terms of monotonicity.
Theorem 23.1.1
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function that is differentiable at some 𝑎 ∈ ℝ.
(a) If 𝑓 ′ (𝑎) ≥ 0, then 𝑓 is locally increasing at 𝑎.
(b) If 𝑓 ′ (𝑎) > 0, then 𝑓 is strictly locally increasing at 𝑎.
(c) If 𝑓 ′ (𝑎) ≤ 0, then 𝑓 is locally decreasing at 𝑎.
(d) If 𝑓 ′ (𝑎) < 0, then 𝑓 is strictly locally decreasing at 𝑎.
Proof. We will only show (a), since the rest of the proofs go the same way. Due to how limits are defined,
𝑓(𝑥) − 𝑓(𝑎)
lim = 𝑓 ′ (𝑎) ≥ 0
𝑥→𝑎 𝑥−𝑎
means that once 𝑥 gets close enough to 𝑎, that is, 𝑥 is from a small neighborhood (𝑎 − 𝛿, 𝑎 + 𝛿),
𝑓(𝑥) − 𝑓(𝑎)
≥ 0, 𝑥 ∈ (𝑎 − 𝛿, 𝑎 + 𝛿)
𝑥−𝑎
holds. If 𝑥 > 𝑎, then because of the differential quotient is nonnegative, 𝑓(𝑥) ≥ 𝑓(𝑎) must hold. Similarly, for 𝑥 < 𝑎,
the nonnegativity of the differential quotient implies that 𝑓(𝑥) ≤ 𝑓(𝑎).
The proof for (b), (c), and (d) is almost identical, with the obvious changes in the inequalities. □
The propositions related to not strict monotonicity are true the other way around as well
Theorem 23.1.2
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function that is differentiable at some 𝑎 ∈ ℝ.
(a) If 𝑓 is locally increasing at 𝑎, then 𝑓 ′ (𝑎) ≥ 0.
(b) If 𝑓 is locally decreasing at 𝑎, then 𝑓 ′ (𝑎) ≤ 0.
Proof. Similarly as before, we will only show the proof of (a), since (b) can be done in the same way. If 𝑓 is locally
increasing at 𝑎, then the differential quotient is positive:
𝑓(𝑥) − 𝑓(𝑎)
≥ 0.
𝑥−𝑎
Using the transfer principle of limits, we obtain
𝑓(𝑥) − 𝑓(𝑎)
𝑓 ′ (𝑎) = lim ≥ 0,
𝑥→𝑎 𝑥−𝑎
which is what we had to prove. □
As we have seen in the introduction, the tangent at the extremal points is horizontal. Now it is time to put this introduction
into a mathematically correct form.
Extremal points have their global versions as well. The sad truth is, even though we always want global optimums, we
only have the tools to find local ones.
Note that a global optimum is also a local optimum, but not the other way around.
Theorem 23.2.1
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function that is differentiable at some 𝑎 ∈ ℝ. If 𝑓 has a local minima or maxima at 𝑎, then
𝑓 ′ (𝑎) = 0.
Proof. According to our previous theorem, if 𝑓 ′ (𝑎) ≠ 0, then it is either strictly increasing or decreasing locally. Since
this contradicts our assumption that 𝑎 is a local optimum, the theorem is proven. □
(In case you are interested, this was the principle of contraposition in action. From the negation of the conclusion, we
have shown the negation of the premises.)
It is very important to emphasize that the theorem is not true the other way around. For instance, the function 𝑓(𝑥) = 𝑥3
is strictly increasing everywhere, yet 𝑓 ′ (0) = 0.
Fig. 24.6: Graph of 𝑓(𝑥) = 𝑥3 as a counterexample to show that 𝑓 ′ (0) = 0 doesn’t imply local optimum.
In general, we call this behavior inflection. So, 𝑓(𝑥) = 𝑥3 is said to have an inflection point at 0. Inflection means a change
in behavior, which reflects the switch in its derivative from decreasing to increasing in this case. (The multidimensional
analogue of inflection is called a “saddle”, as we shall see later.)
So, we are not at our end goal yet, as the other half of the promised characterization is missing. The derivative is zero at
the local extremal points, but can we come up with a criterion that implies the existence of minima or maxima?
With the utilization of second derivatives, this is possible.
Let’s take a second look at our example, considering the local behavior of 𝑓 ′ this time, not just its sign. In the figure
below, the derivative is plotted along with our function.
The pattern seems simple: an increasing derivative implies a local minimum, a decreasing one means a local maximum.
This aligns with our intuition about derivative as speed: local maximum means that the object is going in a positive
direction, then stops and starts reversing.
Proof. Once again, we will only prove (a), since the proof of (b) is almost identical.
First, as we have seen when discussing the relation between derivatives and monotonicity, 𝑓 ′′ (𝑎) > 0 implies that 𝑓 ′ is
strictly locally increasing at 𝑎. Since 𝑓 ′ (𝑎) = 0, this means that
≤0 if 𝑥 ∈ (𝑎 − 𝛿, 𝑎]
𝑓 ′ (𝑥) {
≥0 if 𝑥 ∈ [𝑎, 𝑎 + 𝛿)
for some 𝛿 > 0. Because of the previously referenced theorem, 𝑓 is locally decreasing in (𝑎 − 𝛿, 𝑎] and locally increasing
in [𝑎, 𝑎 + 𝛿). This can only happen if 𝑎 is a local minimum. □
In some cases, we can extract a lot of information about the derivatives without explicitly calculating them. These results
are extremely useful in cases where we don’t have an explicit formula for the function or the formula might be too huge.
(Like in the case of neural networks.) In the following, we’ll get to know the Lagrange’s mean value theorems, connecting
the function’s behavior at the endpoints and inside an interval.
First, we start with a special case that states that the function attains the same value at the end of some interval [𝑎, 𝑏], then
its derivative is zero somewhere inside the interval.
Proof. If you are a visual person, take a look at the figure below. This is what we need to show.
To be mathematically precise, there are two cases. First, if 𝑓 is constant on [𝑎, 𝑏], then its derivative is zero on the entire
interval.
If 𝑓 is not constant, then it attains some value 𝑐 inside (𝑎, 𝑏) that is not equal to 𝑓(𝑎) = 𝑓(𝑏). For simplicity, suppose
that 𝑐 > 𝑓(𝑎). (The argument that follows goes through in the 𝑐 < 𝑓(𝑎) case with some obvious changes.) Since 𝑓 is
continuous, it attains its maximum there at a point 𝜉 ∈ [𝑎, 𝑏]. According to what we have just seen regarding the relation
of local maxima and the derivative, 𝑓 ′ (𝜉) = 0, which is what we had to show. □
Rolle’s theorem is an important stepping stone towards Lagrange’s mean value theorem, which we will show in the fol-
lowing.
𝑓(𝑏) − 𝑓(𝑎)
𝑓 ′ (𝜉) =
𝑏−𝑎
holds.
Proof. Again, let’s start with a visualization to get a grip on the theorem.
𝑓(𝑏)−𝑓(𝑎)
Recall that 𝑏−𝑎 is the slope of the line going through (𝑎, 𝑓(𝑎)) and (𝑏, 𝑓(𝑏)). This line is described by the function
𝑓(𝑏) − 𝑓(𝑎)
(𝑥 − 𝑎) + 𝑓(𝑎).
𝑏−𝑎
Using this, we introduce the function
𝑓(𝑏) − 𝑓(𝑎)
𝑔(𝑥) ∶= 𝑓(𝑥) − ( (𝑥 − 𝑎) + 𝑓(𝑎)).
𝑏−𝑎
We can apply Rolle’s theorem to 𝑔(𝑥), since 𝑔(𝑎) = 𝑔(𝑏) = 0. Thus, for some 𝜉 ∈ (𝑎, 𝑏), we have
𝑓(𝑏) − 𝑓(𝑎)
𝑔( 𝜉) = 0 = 𝑓 ′ (𝜉) − ,
𝑏−𝑎
𝑓(𝑏)−𝑓(𝑎)
implying 𝑓 ′ (𝜉) = 𝑏−𝑎 , which is what we had to show. □
Why are mean value theorems so important? In mathematics, they serve as a cornerstone in several results. To give you
one example, think about integration. (Perhaps you are familiar with this concept already. Don’t worry if not, we are
going to study it in detail later.) Integration is essentially the inverse of differentiation: if 𝐹 ′ (𝑥) = 𝑓(𝑥), then
𝑏
∫ 𝑓(𝑥)𝑑𝑥 = 𝐹 (𝑏) − 𝐹 (𝑎),
𝑎
24.5 Problems
TWENTYFIVE
When we encountered the concept of the derivative for the first time, we saw several of its faces. The derivative can be
thought of as
1. speed (if the function describes a time-distance graph of a moving object),
2. the slope of the tangent line of a function,
3. and the best linear approximator at a given point.
To understand how gradient descent works, we’ll see yet another interpretation: derivatives as vectors. For any differen-
tiable function 𝑓(𝑥), the derivative 𝑓 ′ (𝑥) can be thought of as a one-dimensional vector. If 𝑓 ′ (𝑥) is positive, it points
to the right. If it is negative, it points to the left. We can visualize this by drawing a horizontal vector to every point of
𝑓(𝑥)-s graph, where the length represents |𝑓 ′ (𝑥)| and the direction represents the sign.
Do you recall how monotonicity is characterized by the sign of the derivative? Negative derivative means decreasing,
positive means increasing. In other words, this implies that the derivative, as a vector, points towards the direction of the
increase.
Imagine yourself as a hiker on the x-y plane, where y signifies the height. How would you climb a mountain ahead of
you? By taking a step towards the direction of increase, that is, following the derivative. If you are not there yet, you can
still take another (perhaps smaller) step in the right direction, over and over again until you arrive. If you are right at the
top, the derivative is zero, so you won’t move anywhere.
This process is illustrated by Fig. 25.2.
What you have seen here is the gradient ascent in action. Now that we understand the main idea, we are ready to tackle
the mathematical details.
277
Mathematics of Machine Learning
Based on our intuition, the process is quite simple. First, we conjure up an arbitrary starting point 𝑥0 , then define the
sequence
where ℎ ∈ (0, ∞) is a parameter of our gradient descent algorithm, called the learning rate. In English, the formula
𝑥𝑛 + ℎ𝑓 ′ (𝑥𝑛 ) describes taking a small step from 𝑥𝑛 towards the direction of the increase, with step size ℎ𝑓 ′ (𝑥𝑛 ).
If things go our way, the sequence 𝑥𝑛 converges to a local maximum of 𝑓. However, things do not always go our way.
We’ll discuss these when talking about the issues of gradient descent.
But what about finding minima? In machine learning, we are trying to minimize loss functions. There is a simple trick:
′
the minima of 𝑓(𝑥) is the maxima of −𝑓(𝑥). So, since ( − 𝑓) = −𝑓 ′ , the definition of the approximating sequence 𝑥𝑛
changes to
𝑥𝑛+1 ∶= 𝑥𝑛 − ℎ𝑓 ′ (𝑥𝑛 ).
At this point, we have all the knowledge to implement the gradient descent algorithm. As usual, I encourage you to try
implementing your version before looking at mine. Coding is one of the most effective ways to learn.
import numpy as np
import nbimporter
from tools.function import Function
def gradient_descent(
f: Function,
x_init: float, # the initial guess
learning_rate: float = 0.1, # the learning rate
n_iter: int = 1000, # number of steps
return_all: bool = False # if true, returns all intermediate values
):
xs = [x_init] # we store the intermediate results for visualization
for n in range(n_iter):
x = xs[-1]
grad = f.prime(x)
x_next = x - learning_rate*grad
xs.append(x_next)
if return_all:
return xs
else:
return x
Let’s test the gradient descent out on a simple example, say 𝑓(𝑥) = 𝑥2 ! If all goes according to plan, the algorithm should
find the minimum 𝑥 = 0 in no time.
class Square(Function):
def __call__(self, x):
return x**2
f = Square()
gradient_descent(f, x_init=5.0)
7.688949513507002e-97
The result is as expected: our gradient_descent function successfully finds the minimum.
To visualize what happens, we can plot the process in its entirety.
(The plot_gradient_descent is a helper function to reduce boilerplate code in the book. Don’t worry about the
details. It just plots the result of the gradient descent.)
Even though the idea behind the gradient descent is sound, there are several issues. During our journey in machine
learning, we’ll see most of them fixed by variants of the algorithm, but it is worth looking at the potential problems of the
base version at this point.
First, the base gradient descent can get infinitely stuck at a local minima. To illustrate this, let’s take a look at the
𝑓(𝑥) = cos(𝑥) + 𝑥2 function, whose global minima is at 𝑥 = 0.
class CosPlusSquare(Function):
def __call__(self, x):
return np.sin(x) + 0.5*x
(continues on next page)
f = CosPlusSquare()
Note that if the initial point 𝑥0 is selected poorly, the algorithm is much less effective. This sensitivity to the initial
conditions is another weakness. It might not seem that much of an issue in a simple one-variable case that we have just
seen. However, this is a significant headache in the million-dimensional parameter spaces that we encounter when training
neural networks. Several methods can help to alleviate the issue, and we are going to see them when talking about weight
initialization for neural networks.
The starting point is not the only parameter of the algorithm; it depends on the learning rate ℎ as well. There are several
potential mistakes here: a too large learning rate results in the algorithm bouncing all around the space, never finding an
optimum. On the other hand, a too small one results in an extremely slow convergence.
In the case of 𝑓(𝑥) = 𝑥2 , starting the gradient descent from 𝑥0 = 1.0 with a learning rate of ℎ = 1.05, the algorithm
diverges, with 𝑥𝑛 oscillating at a larger and larger amplitude.
f = Square()
Can you come up with some solution ideas to these problems? No need to work anything out, just take a few minutes to
brainstorm and make a mental note about what comes to mind. In the later chapters, we’ll see several proposed solutions
for all of these problems, but putting some time into this is a very useful exercise.
We are almost at the end of our journey of introductory calculus. During the lectures so far, we mostly spent our time
with getting to know differentiation and its importance in optimization. However, there is a counterpart of differentiation
that will be essential for understanding probability and statistics: integration. In a sense, integration is the inverse of
differentiation, and later will be used to express quantities like expected values and loss functions. Let’s take a look at
them!
TWENTYSIX
INTEGRATION IN THEORY
When we first encountered the concept of the derivative, we introduced it through an example from physics. As Newton
created it, the derivative describes the speed of a moving object as calculated from its time-distance graph. In other words,
the speed can be derived from the time-distance information.
Can the distance be reconstructed given the speed? In a sense, this is the inverse of differentiation.
Questions such as these are hard to answer if we only look at the most general case, so let’s consider a special one. Suppose
that our object is moving with a constant speed 𝑣(𝑡) = 𝑣0 𝑚 𝑠 , for a duration of 𝑇 seconds. With some elementary logic,
we can conclude that the total distance traveled is 𝑣0 𝑇 meters.
When taking a look at the time-speed plot, we can immediately see that the distance is the area under the time-speed
function graph 𝑣(𝑡) = 𝑣0 . The graph of 𝑣(𝑡) describes a rectangle with width 𝑣0 and length 𝑇 , hence its area is indeed
𝑣0 𝑇 .
Does the area under 𝑣(𝑡) equal the distance traveled in the general case? For instance, what happens when the time-speed
plot looks something like this?
The speed is not constant here. In this case, we can do a simple trick: partition the time interval [0, 𝑇 ] into smaller ones
and approximate the object’s motion as a constant-speed motion on each of these intervals.
If the time intervals [𝑡𝑖 , 𝑡𝑖+1 ] are sufficiently granular, the distance travelled will roughly match a constant velocity motion
with the average speed at [𝑡𝑖 , 𝑡𝑖+1 ]. That is, if we introduce the notation
we should have
𝑛
∑ 𝑣𝑖 (𝑡𝑖 − 𝑡𝑖−1 ) ≈ total distance traveled during [0, 𝑇 ].
𝑖=1
Let’s think about this whole process as approximating the function 𝑣(𝑡) with a stepwise constant function 𝑣approx (𝑡). From
this angle, we have
𝑛
∑ 𝑣𝑖 (𝑡𝑖 − 𝑡𝑖−1 ) = area under 𝑣approx (𝑡),
𝑖=1
but we also
(Very) loosely speaking, if the granularity of the time intervals [𝑡𝑖 , 𝑡𝑖+1 ] gets infinitesimally small, the approximations
turn into equality. Thus,
total distance traveled during [0, 𝑇 ] = area under 𝑣approx (𝑡) in [0, 𝑇 ].
There are two key points that we need to remember: if 𝑠(𝑡) is the distance traveled and 𝑣(𝑡) is the speed, then
287
Mathematics of Machine Learning
289
Mathematics of Machine Learning
converge if the partition of [0, 𝑇 ] gets more granular? Does the limit depend on the partitions? Can we even define the
area under the “graph” for all functions? Like the Dirichlet function, defined by
1 if 𝑥 is rational
𝐷(𝑥) = { (26.1)
0 otherwise,
𝑛
How do we calculate limits of ∑𝑖=1 𝑣𝑖 (𝑡𝑖 − 𝑡𝑖−1 ) in practice? In addition, what does all of this have to do with machine
learning?
Fasten your seatbelts! Here comes the rigorous study of integration, clearing up all of these questions above.
Let’s build a solid theoretical foundation for the above intuitive explanation! Let 𝑓 ∶ [𝑎, 𝑏] → ℝ be an arbitrary bounded
function, and our goal is to calculate the signed area under the graph. (Note that the signed area is negative if the graph
goes below the 𝑥 axis. In the time-speed graph example above, this is equivalent to moving backward, thus decreasing
the distance traveled from the starting point.)
Let 𝑎 = 𝑥0 < 𝑥1 < ⋯ < 𝑥𝑛 = 𝑏 an arbitrary partition of the interval [𝑎, 𝑏]. For notational convenience, we’ll denote
this partition as 𝑋 = {𝑥0 , … , 𝑥𝑛 } as well. The granularity (or mesh) of 𝑋 is defined by
which is the length of the biggest gap in 𝑋. Note that the partition is not necessarily uniform, so |𝑥𝑖 − 𝑥𝑖−1 | is not
constant.
We are going to use an argument similar to the squeeze principle to make the approximation idea rigorous. (You know,
the one where we replaced the speed of a moving object with a piecewise constant one.) Instead of using the averages of
𝑓(𝑥) on each interval [𝑥𝑖−1 , 𝑥𝑖 ], we are going to provide an upper and lower estimation by using
𝑚𝑖 ∶= inf 𝑓(𝑥)
𝑥∈[𝑥𝑖−1 ,𝑥𝑖 ]
and
𝑀𝑖 ∶= sup 𝑓(𝑥).
𝑥∈[𝑥𝑖−1 ,𝑥𝑖 ]
Mathematically speaking, the infimum and the supremum are much easier to work with than the average. Now we can
approximate 𝑓(𝑥) with a piecewise constant function from both above and below. This is visualized by Fig. 26.4.
Our plan is to squeeze the area between the lower and upper sums
𝑛
𝐿[𝑓, 𝑋] ∶= ∑ 𝑚𝑖 (𝑥𝑖 − 𝑥𝑖−1 ) (26.2)
𝑖=1
Fig. 26.4: Estimating the area under the curve of 𝑓 using the partition 𝑋
and
𝑛
𝑈 [𝑓, 𝑋] ∶= ∑ 𝑀𝑖 (𝑥𝑖 − 𝑥𝑖−1 ), (26.3)
𝑖=1
then study if these two match. (As usual, the dependence on 𝑓 and 𝑋 will be omitted if it is clear from the context.)
It is clear from the construction that
We need to introduce some basic facts about refining partitions to construct mathematically correct arguments regarding
the convergence of the approximating sums 𝐿[𝑓, 𝑋] and 𝑈 [𝑓, 𝑋].
Proposition 25.1.1
Let 𝑓 ∶ [𝑎, 𝑏] → ℝ be a bounded function and 𝑋 and 𝑌 be two partitions of [𝑎, 𝑏]. Suppose that 𝑌 is a refinement of 𝑋.
Then
and
Proof. We are going to show 𝐿[𝑓, 𝑋] ≤ 𝐿[𝑓, 𝑌 ], as (26.5) follows from a similar argument. Suppose that 𝑥𝑖−1 ≤ 𝑦𝑗 ≤
⋯ ≤ 𝑦𝑙 ≤ 𝑥𝑖 . Mathematically speaking, we have
𝑙
Since 𝑥𝑖 − 𝑥𝑖−1 = ∑𝑘=𝑗+1 𝑦𝑘 − 𝑦𝑘−1 , the above implies that
𝑙
inf 𝑓(𝑥)(𝑥𝑖 − 𝑥𝑖−1 ) = ∑ inf 𝑓(𝑥)(𝑦𝑘 − 𝑦𝑘−1 )
𝑥∈[𝑥𝑖−1 ,𝑥𝑖 ] 𝑥∈[𝑥𝑖−1 ,𝑥𝑖 ]
𝑘=𝑗+1
(26.6)
𝑙
≤ ∑ inf 𝑓(𝑥)(𝑦𝑘 − 𝑦𝑘−1 ).
𝑥∈[𝑦𝑘−1 ,𝑦𝑘 ]
𝑘=𝑗+1
Don’t worry if these mathematical formalisms make this hard to follow. Just take a look at Fig. 26.6 below, which
summarizes all that we have done so far.
Since 𝐿[𝑓, 𝑋] and 𝐿[𝑥, 𝑌 ] are composed from parts like in (26.6), summing over 𝑖 in the above immediately yields
𝐿[𝑓, 𝑋] ≤ 𝐿[𝑥, 𝑌 ]. □
We are almost there. There is one thing left for us to show: that for any two partitions, the lower sum is always smaller
than the upper sum. Hence, the squeeze principle could be applied to show that the lower and upper sums converge to the
same limit in some instances.
For, we need a simple but fundamental fact about partitions.
Proposition 25.1.2
Let 𝑋 and 𝑌 be two partitions of [𝑎, 𝑏]. Then there is a partition 𝑍 that is a refinement for both 𝑋 and 𝑌 .
The above 𝑍 is called a mutual refinement of 𝑋 and 𝑌 . We can show a fundamental relation between the upper and lower
sums with this idea.
Proposition 25.1.3
Let 𝑓 ∶ [𝑎, 𝑏] → ℝ be a bounded real function and let 𝑋 and 𝑌 be two partitions of the interval [𝑎, 𝑏]. Then
𝐿[𝑓, 𝑋] ≤ 𝑈 [𝑓, 𝑌 ]
holds.
Proof. Let 𝑍 be a mutual refinement of 𝑋 and 𝑌 , as guaranteed by the previous result. Then, (26.4) and (26.5) implies
that
𝐿[𝑓, 𝑋] ≤ 𝐿[𝑓, 𝑍] ≤ 𝑈 [𝑓, 𝑍] ≤ 𝑈 [𝑓, 𝑌 ],
which is what we wanted to show. □
This value is called the Riemann integral (or just the integral) of 𝑓 over [𝑎, 𝑏], denoted by
𝑏
∫ 𝑓(𝑥)𝑑𝑥.
𝑎
𝑏
The function 𝑓 in ∫𝑎 𝑓(𝑥)𝑑𝑥 is called the integrand. How do we calculate the integral itself? The hard way is to define a
sequence of partitions 𝑋𝑛 and show that
lim 𝐿[𝑓, 𝑋𝑛 ] = lim 𝑈 [𝑓, 𝑋𝑛 ],
𝑛→∞ 𝑛→∞
𝑏
so this number is necessarily ∫𝑎 𝑓(𝑥)𝑑𝑥. We’ll see the easy way soon, but let’s see an example demonstrating this process.
1
Let’s calculate ∫0 𝑥2 𝑑𝑥! The simplest is to use the uniform partition 𝑋𝑛 = {𝑖/𝑛}𝑛𝑖=0 , obtaining
𝑛 2
𝑖−1 1
𝐿[𝑥2 , 𝑋𝑛 ] = ∑ ( )
𝑖=1
𝑛 𝑛
1 𝑛
= ∑(𝑖 − 1)2 .
𝑛3 𝑖=1
𝑛 𝑛(𝑛+1)(2𝑛+1)
Since ∑𝑘=1 𝑘2 = 6 (as it can be shown by induction), it is easy to see that
1
lim 𝐿[𝑥2 , 𝑋𝑛 ] = .
𝑛→∞ 3
1 1
With a similar argument, you can check that lim𝑛→∞ 𝑈 [𝑥2 , 𝑋𝑛 ] = 3 as well, thus, ∫0 𝑥2 𝑑𝑥 exists and
1
1
∫ 𝑥2 𝑑𝑥 = .
0 3
Although this method works for simple cases such as 𝑓(𝑥) = 𝑥2 , it breaks down for more complex functions, as calculating
limits of upper and lower sums can be difficult. In addition, selecting the right partition is also a challenge. For instance,
𝜋
can you calculate ∫0 sin(𝑥)𝑑𝑥 by the definition?
Because we are lazy (just like any good mathematician), we want to find a general method to calculate integrals. We’ll
see this in the next section.
Lower and upper sums are needed to make the notion of an integral mathematically precise. Combined with the squeeze
principle, they are used to provide a definition.
However, other tools become available once we know that a function is integrable. Such as the general approximating
sum, as we are about to see next.
Theorem 25.3.1
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function, and let 𝑋𝑛 = {𝑥0,𝑛 , … , 𝑥𝑛,𝑛 } be a sequence of partitions on [𝑎, 𝑏] such that
|𝑋𝑛 | → 0. Then 𝑓 is integrable if and only if the limit
𝑛
lim ∑ 𝑓(𝜉𝑖 )(𝑥𝑖,𝑛 − 𝑥𝑖−1,𝑛 )
𝑛→∞
𝑖=1
We will not prove the above theorem, as the proof is technical and doesn’t provide any valuable insight. However, the
point is clear: local infima and suprema in lower and upper sums can be replaced with any local value.
For simplicity, we’ll denote this sum by
𝑛
𝑆[𝑓, 𝑋, 𝜉𝑋 ] = ∑ 𝑓(𝜉𝑖 )(𝑥𝑖 − 𝑥𝑖−1 ) (26.7)
𝑖=1
Now that we understand the mathematical definition of the integral, it is time to find some tools that enable its use in
practice. The most important result is the Newton-Leibniz formula, named after Isaac Newton and Gottfried Wilhelm
Leibniz, the inventors of calculus. (Fun fact: these men discovered calculus independently and were mortal enemies
throughout their lives.)
Theorem 25.4.1 (The fundamental theorem of calculus, a.k.a. the Newton-Leibniz formula.)
Let 𝑓 ∶ ℝ → ℝ a function that is integrable on some [𝑎, 𝑏] and suppose that there is an 𝐹 ∶ ℝ → ℝ such that 𝐹 ′ (𝑥) = 𝑓(𝑥).
Then
𝑏
∫ 𝑓(𝑥)𝑑𝑥 = 𝐹 (𝑏) − 𝐹 (𝑎) (26.8)
𝑎
holds.
𝑥
In other words, by defining 𝑥 ↦ 𝐹 (𝑎) + ∫𝑎 𝑓(𝑥)𝑑𝑥, we can effectively reconstruct a function from its derivative.
Proof. Let 𝑎 = 𝑥0 < 𝑥1 < ⋯ < 𝑥𝑛 = 𝑏 be an arbitrary partition of [𝑎, 𝑏] According to Lagrange's mean value theorem,
there is a 𝜉𝑖 ∈ (𝑥𝑖−1 , 𝑥𝑖 ) for all 𝑖 = 1, … , 𝑛 such that
Thus, we can sum these numbers up, eliminating all but the first and last elements:
𝑛 𝑛
∑ 𝑓(𝜉𝑖 )(𝑥𝑖 − 𝑥𝑖−1 ) = ∑ 𝐹 (𝑥𝑖 ) − 𝐹 (𝑥𝑖−1 )
𝑖=1 𝑖=1
= 𝐹 (𝑏) − 𝐹 (𝑎).
On the other hand, due to the properties of lower and upper sums, we have
𝑛
𝐿[𝑓, 𝑋] ≤ ∑ 𝑓(𝜉𝑖 )(𝑥𝑖 − 𝑥𝑖−1 ) ≤ 𝑈 [𝑓, 𝑋]
𝑖=1
Note that integration is insensitive towards changing the values of 𝑓(𝑥) at countably many points. To be more precise,
suppose that 𝑓 ∶ ℝ → ℝ is a function that is integrable on [−1, 1]. Let’s change its value at a single point and define
𝑓(0) + 1 if 𝑥 = 0
𝑓 ∗ (𝑥) = {
𝑓(𝑥) otherwise.
and
holds. We can select the partition such that 𝑥𝑘 − 𝑥𝑘−1 < 𝜀 = min{𝑚, 𝑀 } for some arbitrary 𝜀 > 0, thus, ∣𝐿[𝑓, 𝑋] −
𝐿[𝑓 ∗ , 𝑋]∣ and ∣𝑈 [𝑓, 𝑋] − 𝑈 [𝑓 ∗ , 𝑋]∣ can be made as small as needed. This implies that
𝑏 𝑏
∫ 𝑓(𝑥)𝑑𝑥 = ∫ 𝑓 ∗ (𝑥)𝑑𝑥.
𝑎 𝑎
Hence, saying that integration is the inverse of differentiation is mathematically a bit imprecise. Given a differentiable
𝑥
function 𝐹 (𝑥), its derivative is unique, but there are infinitely many functions whose integral 𝐹 (𝑎) + ∫𝑎 𝑔(𝑦)𝑑𝑦 recon-
structs 𝐹 .
After all this theory, you might ask: what does integration have to do with machine learning? Without being mathemati-
cally rigorous, here is a (very) brief overview of what’s to come.
First, you can think about integration as a continuous generalization of the arithmetic mean. As you can see, for equidistant
partitions, an approximating sum
1 𝑛
𝑆[𝑓, 𝑋, 𝜉] = ∑ 𝑓(𝜉𝑖 )
𝑛 𝑖=1
is exactly the average of 𝑓(𝜉1 ), … , 𝑓(𝜉𝑛 ). In machine learning, averages are frequently used to express quantities. Think
about it: overall loss functions are often averages of certain individual losses. On a fine enough scale, averages become
integrals.
Along with linear algebra and calculus, the central pillar of machine learning is probability theory and statistics, which gives
us a way to model the world based on our observations. Probability and statistics are the logic of science and decision-
making. There, integration is used to express probabilities, expected value, information, and many more. Without a
rigorous theory of integration, we cannot build probabilistic models beyond a certain point.
TWENTYSEVEN
INTEGRATION IN PRACTICE
Even though we understand what an integral is, we are far from computing them in practice. As opposed to differentiation,
analytically evaluating an integral can be really difficult and often downright impossible. The formula (26.8) suggests that
the key is to find the function whose derivative is the integrand, called the antiderivative or primitive function. This is
harder than you think. Nevertheless, there are several tools for this, and we are going to devote this section to study the
most important ones.
The key is often finding the antiderivative, so we introduce the notation
𝐹 = ∫ 𝑓𝑑𝑥,
for the functions where 𝐹 ′ = 𝑓. Note that since (𝐹 + some constant)′ = 𝐹 ′ , the antiderivative ∫ 𝑓𝑑𝑥 is not uniquely
determined. However, this is not an issue for us, as the Newton-Leibniz formula states that
𝑏
∫ 𝑓(𝑥)𝑑𝑥 = 𝐹 (𝑏) − 𝐹 (𝑎).
𝑎
As we have seen this several times (for instance when discussing the rules of differentiation), the relations of an operation
with addition, multiplication, and possibly others are extremely useful for gaining insight and develop practical tools.
This is the same for integration as well. Similarly as before, the linearity of the integral is our main tool to evaluate them.
Proof. (a) If 𝑓 and 𝑔 is integrable, then for any 𝜀 > 0, there are partitions 𝑋𝑓 , 𝑋𝑔 such that
𝑏 𝑏
∫ 𝑓(𝑥)𝑑𝑥 − 𝜀 ≤ 𝐿[𝑓, 𝑋𝑓 ] ≤ 𝑈 [𝑓, 𝑋𝑓 ] ≤ ∫ 𝑓(𝑥)𝑑𝑥 + 𝜀
𝑎 𝑎
299
Mathematics of Machine Learning
and
𝑏 𝑏
∫ 𝑔(𝑥)𝑑𝑥 − 𝜀 ≤ 𝐿[𝑔, 𝑋𝑔 ] ≤ 𝑈 [𝑔, 𝑋𝑔 ] ≤ ∫ 𝑔(𝑥)𝑑𝑥 + 𝜀,
𝑎 𝑎
where the lower and upper sums are defined by (26.2) and (26.3). So, for the mutual refinement 𝑋 = 𝑋𝑓 ∪ 𝑋𝑔 , we have
𝑏 𝑏
∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑔(𝑥)𝑑𝑥 − 2𝜀 ≤ 𝐿[𝑓, 𝑋] + 𝐿[𝑔, 𝑋]
𝑎 𝑎
≤ 𝑆[𝑓, 𝑋, 𝜉𝑋 ] + 𝑈 [𝑔, 𝑋, 𝜉𝑋 ]
≤ 𝑈 [𝑓, 𝑋𝑓 ] + 𝑈 [𝑔, 𝑋𝑔 ]
𝑏 𝑏
≤ ∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑔(𝑥)𝑑𝑥 + 2𝜀,
𝑎 𝑎
where 𝑆 is defined by (26.7). From this definition, it can also be seen that
Thus,
𝑏 𝑏
∣ ∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑔(𝑥)𝑑𝑥 − 𝑆[𝑓 + 𝑔, 𝑋, 𝜉𝑋 ]∣ ≤ 𝜀,
𝑎 𝑎
implying that
𝑏 𝑏
lim 𝑆[𝑓 + 𝑔, 𝑋, 𝜉𝑋 ] = ∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑔(𝑥)𝑑𝑥
|𝑋|→0 𝑎 𝑎
𝑏
= ∫ 𝑓(𝑥) + 𝑔(𝑥)𝑑𝑥.
𝑎
Our Theorem regarding the approximating sum 𝑆 implies that 𝑓 + 𝑔 is integrable on [𝑎, 𝑏] and
𝑏 𝑏 𝑏
∫ 𝑓(𝑥) + 𝑔(𝑥) = ∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑔(𝑥)𝑑𝑥.
𝑎 𝑎 𝑎
As we have learned when studying the rules of differentiation, for an arbitrary 𝑓 and 𝑔, we have
(𝑓𝑔)′ = 𝑓 ′ 𝑔 + 𝑓𝑔′ .
𝑓𝑔 = ∫ 𝑓 ′ 𝑔 + 𝑓𝑔′ 𝑑𝑥
holds. Rearranging the equation a bit, we obtain the formula of partial integration:
∫ 𝑓 ′ 𝑔 = 𝑓𝑔 − ∫ 𝑓𝑔′ . (27.1)
holds.
How is this useful for us? Consider a situation where finding the antiderivative of 𝑓 and the derivative of 𝑔 is easy, but
the antiderivative of the product 𝑓𝑔 is hard. For example, can you quickly calculate
∫ 𝑥 log 𝑥𝑑𝑥?
Applying (27.1) with the roles 𝑓 ′ (𝑥) = 𝑥 and 𝑔(𝑥) = log 𝑥 immediately yields
1 2 1
∫ 𝑥 log 𝑥 = 𝑥 log 𝑥 − ∫ 𝑥𝑑𝑥
2 2
1 2 1 2
= 𝑥 log 𝑥 − 𝑥 .
2 4
As the partial integration formula is the “opposite” of the differentiation rule for products, there is an analogue for the
chain formula as well. Recall that for two differentiable functions, we had
holds.
This is called integration by substitution. To give you an example of its use, consider
∫ 𝑥 sin 𝑥2 𝑑𝑥.
Partial integration and substitution are our main weapons when calculating integrals on paper. Most of the integrals one
might encounter can be solved with the creative (and possible iterated) application of these two rules. The recipe is simple:
find the antiderivative, then use the Newton-Leibniz formula to compute the value of the integral.
However, there is a serious issue: antiderivatives can be extremely hard to find, maybe even impossible. This makes
integrals difficult to compute symbolically. For instance, consider
2
∫ 𝑒−𝑥 𝑑𝑥,
2 2
where the function 𝑒−𝑥 describes the well-known Gaussian bell curve. As surprising as it is, ∫ 𝑒−𝑥 𝑑𝑥 cannot be de-
scribed with a closed formula! (That is, one that uses a finite number of operations and only elementary functions.) It’s
not that mathematicians were not clever enough to discover that, this is proven to be impossible.
Thus, computing integrals is much simpler to do numerically. This is in stark contrast with differentiation, which is easy
to do symbolically, but hard numerically.
Instead of using symbolic computation to get the exact value of an integral, we will resolve to approximation once again.
Previously, Theorem 25.3.1 showed us that an integral is the limit of the Riemann-sums:
𝑏 𝑛
∫ 𝑓(𝑥)𝑑𝑥 = lim ∑ 𝑓(𝜉𝑖 )(𝑥𝑖,𝑛 − 𝑥𝑖−1,𝑛 ), (27.2)
𝑛→∞
𝑎 𝑖=1
where 𝑋𝑛 = {𝑥0,𝑛 , … , 𝑥𝑛,𝑛 } is a partition of [𝑎, 𝑏] and 𝜉𝑖 ∈ [𝑥𝑖−1,𝑛 , 𝑥𝑖,𝑛 ] are arbitrary intermediate values.
𝑛 𝑏
In other words, if 𝑛 is large enough, the sum ∑𝑖=1 𝑓(𝜉𝑖 )(𝑥𝑖,𝑛 − 𝑥𝑖−1,𝑛 ) is close to ∫𝑎 𝑓(𝑥)𝑑𝑥. There are two crucial
issues: first, how to select the partition and the intermediate values; second, how fast is the convergence?
If we want to make (27.2) useful, we have to devise a concrete method that prescribes the 𝑥𝑖 -s, 𝜉𝑖 -s, and tells us how
large of an 𝑛 should we select. This is an extremely rich subject that has been the focus of studies ever since the since the
introduction of integration. So, there is a lot to talk about here. To keep things simple, let’s just focus on the essentials.
The most straightforward method is to select an uniform partition, then approximate the area under the function curve
with a sequence of trapezoids. That is, let 𝑋 = {𝑎, 𝑎 + 𝑏−𝑎 𝑏−𝑎
𝑛 , 𝑎 + 2 𝑛 , … , 𝑏}, and
This is called the trapezoidal rule. It might seem complicated, but (27.3) is just a weighted sum of the 𝑓(𝑥𝑖 ) values.
Its rate of convergence is quadratic, as stated by the following theorem.
Fig. 27.1: Approximating the area under a function with successive trapezoids.
1 𝑛 𝑓(𝑥𝑖−1 ) + 𝑓(𝑥𝑖 )
𝐼𝑛 ∶= ∑
𝑛 𝑖=1 2
There are other methods, for instance, Simpson’s rule approximates the function with a piecewise quadratic one. (Instead
of a piecewise linear one, like we did for the trapezoidal rule.) Since the approximation is more accurate, the convergence
is also faster: Simpson’s rule converges at a 𝑂(𝑛−4 ) rate. Without going into details, it is given by
⌊𝑛/2⌋
𝑏−𝑎
𝑆𝑛 = ∑ 𝑓(𝑥2𝑖−2, ) + 4𝑓(𝑥2𝑖−𝑖 ) + 𝑓(𝑥2𝑖 ), (27.4)
3𝑛 𝑖=1
𝑏
with ∣ ∫𝑎 𝑓(𝑥)𝑑𝑥 − 𝑆𝑛 ∣ = 𝑂(𝑛−4 ), where 𝑥𝑖 is again the equidistant partition 𝑥𝑖 = 𝑎 + 𝑖 𝑏−𝑎
𝑛 .
The formula (27.4) can be difficult to unpack, but the essence remains the same: we compute the function’s values at
given points, then take their weighted sum.
To show you how straightforward the trapezoidal rule is, let’s implement in practice! To keep it simple, we are imple-
menting this as a function that takes another function as its input.
return I_n
This can be made even simpler with NumPy, but I’ll leave this to you as an exercise. Let’s test it on an example instead!
With the use of the Newton-Leibniz formula, you can verify that
1
1
∫ 𝑥2 𝑑𝑥 = .
0 3
(We even computed this with our bare hands, using lower and upper sums.) After plugging in the function lambda x:
x**2 into trapezoidal_rule, we can see that this method is indeed correct.
with plt.style.context("seaborn-white"):
plt.figure(figsize=(7, 7))
ns = range(1, 25, 1)
Is = [trapezoidal_rule(lambda x: x**2, 0, 1, n) for n in ns]
plt.scatter(ns, Is)
27.5 Conclusion
Congratulations! With all these knowledge about integration under your belt, you finished with the technically most
challenging subject so far.
However, we are just getting started. In machine learning, things are happening in spaces with millions of dimensions.
So, we need to generalize all the tools we have developed so far. Fortunately, a solid knowledge of single-variable calculus
is an excellent guide for multivariable functions as well. Some concepts work similarly, some have to be re-thought.
Have some rest, maybe briefly review what we have done so far, then go dive deep into the next chapter: multivariable
calculus.
TWENTYEIGHT
Young man, in mathematics you don’t understand things. You just get used to them. — John von Neumann
In the practice of machine learning, we use gradient descent so much that we get used to it. We hardly ever question why
it works.
What’s usually told is the mountain-climbing analogue: to find the peak (or the bottom) of a bumpy terrain, one has to
look at the direction of the steepest ascent (or descent), and take a step in that direction. This direction is desribed by
the gradient, and the iterative process of finding local extrema by following the gradient is called gradient ascent/descent.
(Ascent for finding peaks, descent for finding valleys.)
However, this is not a mathematically precise explanation. There are several questions left unanswered, and based on our
mountain-climbing intuition, it’s not even clear if the algorithm works.
Without a precise understanding of gradient descent, we are practically flying blind. In this chapter, our goal is to look
behind gradient descent and reveal the magic behind it.
Understanding the “whys” of the gradient descent starts with one of the most beautiful areas of mathematics: differential
equations.
Equations play an essential role in mathematics. This is common wisdom, but there is a deep truth behind it. Quite often,
equations arise from modeling systems such as interactions in a biochemical network, economic processes, and thousands
more. For instance, modelling the metabolic processes in organisms yields linear equations of the form
𝐴𝑥 = 𝑏, 𝐴 ∈ ℝ𝑛×𝑛 , 𝑥, 𝑏 ∈ ℝ𝑛
where the vectors 𝑥 and 𝑏 represent the concentration of molecules (where 𝑥 is the unknown), and the matrix 𝐴 represents
the interactions between them. Linear equations are easy to solve, and we understand quite a lot about them.
However, the equations we have seen so far are unfit to model dynamical systems, as they lack a time component. To
describe, for example, the trajectory of a space station orbiting around Earth, we have to describe our models in terms of
functions and their derivatives.
For instance, the trajectory of a swinging pendulum can be described by the equation
𝑔
𝑥′′ (𝑡) + sin 𝑥(𝑡) = 0, (28.1)
𝐿
where
• 𝑥(𝑡) describes the angle of the pendulum from the vertical,
307
Mathematics of Machine Learning
• 𝐿 is the length of the (massless) rod that our object of mass 𝑚 hangs on,
• and 𝑔 is the gravitational acceleration constant ≈ 9.7𝑚/𝑠2 .
According to the original interpretation of differentiation, if 𝑥(𝑡) describes the movement of the pendulum at time 𝑡, then
𝑥′ (𝑡) and 𝑥′′ (𝑡) describe the velocity and the acceleration of it, where the differentiation is taken with respect to the time
𝑡.
(In fact, the differential equation (28.1) is a direct consequence of Newton’s second law of motion.)
Equations involving functions and their derivatives, such as (28.1), are called ordinary differential equations, or ODEs
in short. Without any overexaggeration, their study has been the main motivating force of mathematics since the 17th
century. Trust me when I say this, differential equations are one of the most beautiful objects in mathematics. As we are
about to see, the gradient descent algorithm is, in fact, an approximate solution of differential equations.
The first part of this chapter will serve as a quickstart to differential equations. I am mostly going to follow the fantastic
Nonlinear Dynamics and Chaos book by Steven Strogatz [[Str00]]. If you ever feel the desire to dig deep into dynamical
systems, I wholeheartedly recommend this book to you. (This is one of my favorite math books ever, it reads like a novel.
The quality and clarity of its exposition serves as a continuous inspiration for my writing.)
Let’s dive straight into the deep waters and start with an example to get a grip on differential equations. Quite possibly,
the simplest example is the equation
𝑥′ (𝑡) = 𝑥(𝑡),
where the differentiation is taken with respect to the time variable 𝑡. If, for example, 𝑥(𝑡) is the size of a bacterial colony,
the equation 𝑥′ (𝑡) = 𝑥(𝑡) describes its population dynamics if the growth is unlimited. Think about 𝑥′ (𝑡) as the rate
at which the population grows: if there are no limitations in space and nutrients, every bacterial cell can freely replicate
whenever possible. Thus, since every cell can freely divide, the speed of growth matches the colony’s size.
In plain English, the solutions of the equation 𝑥′ (𝑡) = 𝑥(𝑡) are functions whose derivatives are themselves. After a bit of
thinking, we can come up with a family of solutions: 𝑥(𝑡) = 𝑐𝑒𝑡 , where 𝑐 ∈ ℝ is an arbitrary constant. (Recall that 𝑒𝑡 is
an elementary function, and we have seen that its derivative is itself.)
If you are a visual person, some of the solutions are plotted on Fig. 28.2.
There are two key takeaways here: differential equations describe dynamical processes that change in time, and they can
have multiple solutions. Each solution is determined by two factors: the equation itself 𝑥′ (𝑡) = 𝑥(𝑡), and an initial
condition 𝑥(0) = 𝑥∗ . If we specify 𝑥(0) = 𝑥∗ , then the value of 𝑐 is given by
𝑥(0) = 𝑐𝑒0 = 𝑐 = 𝑥∗ .
Thus, ODEs have a bundle of solutions, each one determined by the initial condition. So, it’s time to discuss differential
equations in more general terms!
When it is clear, the dependence on 𝑡 is often omitted, so we only write 𝑥′ = 𝑓(𝑥). (Some resources denote the time
derivative by 𝑥,̇ a notation that can be originated from Newton. We will not use this, though it is good to know.)
The term “first-order homogeneous ordinary differential equation” doesn’t exactly roll off the tongue, and it is overloaded
with heavy terminology. So, let’s unpack what is going on here.
The differential equation part is clear: it is a functional equation that involves derivatives. Since the time 𝑡 is the only
variable, the differential equation is ordinary. (As opposed to differential equations involving multivariable functions
and partial derivatives, but more on those later.) As only the first derivative is present, the equation becomes first-order.
Second-order would involve second derivatives, and so on. Finally, since the right-hand side 𝑓(𝑥) doesn’t explicitly depend
on the time variable 𝑡, the equation is homogeneous in time. Homogeneity means that the rules governing our dynamical
system don’t change over time.
Don’t let the 𝑓(𝑥(𝑡)) part scare you! For instance, in our example 𝑥′ (𝑡) = 𝑥(𝑡), the role of 𝑓 is cast to the identity
function 𝑓(𝑥) = 𝑥. In general, 𝑓(𝑥) establishes a relation between the quantity 𝑥(𝑡) (which can be position, density, etc)
and its derivative, that is, its rate of change.
As we have seen, we think in terms of differential equations and initial conditions that pinpoint solutions among a bundle
of functions. Let’s put this into a proper mathematical definition!
𝑥′ = 𝑓(𝑥)
{
𝑥(𝑡0 ) = 𝑥0
is called an initial value problem. If a function 𝑥(𝑡) satisfies both conditions, it is said to be a solution to the initial value
problem.
Most often, we select 𝑡0 to be 0. After all, we have the freedom to select the origin of the time as we want.
Unfortunately, things are not as simple as they seem. In general, differential equations and initial value problems are tough
to solve. Except for a few simple ones, we cannot find exact solutions. (And when I say we, I include every person on
the planet.) In these cases, there are two things that we can do: either we construct approximate solutions via numeric
methods or turn to qualitative methods that study the behavior of the solutions without actually finding them.
We’ll talk about both, but let’s turn to the qualitative methods first. As we’ll see, looking from a geometric perspective
gives us a deep insight into how differential equations work.
When finding analytic solutions is not feasible, we look for a qualitative understanding of the solutions, focusing on the
local and long-term behavior instead of formulas.
Imagine that given a differential equation
𝑥′ (𝑡) = 𝑓(𝑥(𝑡)),
you are interested in a particular solution that assumes the value 𝑥∗ at time 𝑡0 . For instance, you could be studying the
dynamics of a bacterial colony and want to provide a predictive model to fit your latest measurement 𝑥(𝑡0 ) = 𝑥∗ . In the
short term, where will your solutions go?
We can immediately notice that if 𝑥(𝑡0 ) = 𝑥∗ and 𝑓(𝑥∗ ) = 0, then the constant function 𝑥(𝑡) = 𝑥∗ is a solution! These
are called equilibrium solutions, and they are extremely important. So, let’s make a formal definition!
be a first order homogeneous ODE, and let 𝑥∗ ∈ ℝ be an arbitrary point. If 𝑓(𝑥∗ ) = 0, then 𝑥∗ is called an equilibrium
point of the equation 𝑥′ = 𝑓(𝑥).
For equilibrium points, the constant function 𝑥(𝑡) = 𝑥∗ is a solution of (28.3). This is called an equilibrium solution.
Think about our recurring example, the simplest ODE 𝑥′ (𝑡) = 𝑥(𝑡). As mentioned, we can interpret this equation as a
model of unrestricted population growth under ideal conditions. In that case, 𝑓(𝑥) = 𝑥, and this is zero only for 𝑥 = 0.
Therefore, the constant 𝑥(𝑡) = 0 function is a solution. This makes perfect sense: if a population has zero individuals, no
change is going to happen in its size. In other words, the system is in equilibrium.
Like a pendulum that stopped moving and reached its resting point at the bottom. However, pendulums have two equilibria:
one at the top and one at the bottom. (Let’s suppose that the mass is held by a massless rod. Otherwise, it would collapse)
At the bottom, you can push the hanging mass all you want, it’ll return to rest. However, at the top, any small push would
disrupt the equilibrium state, to which it would never return.
To shed light on this phenomenon, let’s look at another example: the famous logistic equation
From a population dynamics perspective, if our favorite equation 𝑥′ (𝑡) = 𝑥(𝑡) describes the unrestricted growth of a
bacterial colony, the logistic equation models the population growth under a resource constraint. If we assume that 1
is the total capacity of our population, the growth becomes more difficult as the size approaches this limit. Thus, the
population’s rate of change 𝑥′ (𝑡) can be modelled as 𝑥(𝑡)(1 − 𝑥(𝑡)), where the term 1 − 𝑥(𝑡) slows down the process as
the colony nears the sustain capacity.
We can write the logistic equation in the general form (28.2) by casting the role 𝑓(𝑥) = 𝑥(1 − 𝑥). Do you recall our
theorem about the relation of derivatives and monotonicity? Translated to the differential equation 𝑥′ = 𝑓(𝑥), this reveals
the flow of our solutions! To be specific,
⎧1 if 𝑥′ (0) > 0,
{
lim 𝑥(𝑡) = 0
⎨ if 𝑥′ (0) = 0, (28.5)
𝑡→∞
{−∞ if 𝑥′ (0) < 0.
⎩
With a little bit of calculation (whose details are not essential for us), we can obtain that we can write the solutions as
1
𝑥(𝑡) = ,
1 + 𝑐𝑒−𝑡
where 𝑐 ∈ ℝ is an arbitrary constant. For 𝑐 = 1, this is the famous Sigmoid function. You can check by hand that these
are indeed solutions. We can even plot them, as shown in Fig. 28.4 below.
As we can see in Fig. 28.4, the monotonicity of the solutions are as we predicted in (28.5).
We can characterize the equilibria based on the long-term behavior of nearby solutions. (In the case of our logistic
equation, the equilibria are 0 and 1.) This can be connected to the local behavior of 𝑓: if it decreases around the
equilibrium 𝑥∗ , it attracts the nearby solutions. On the other hand, if 𝑓 increases around 𝑥∗ , then the nearby solutions are
repelled.
This gives rise to the concept of stable and unstable equilibria.
Fig. 28.3: The flow of solutions for 𝑥′ = 𝑥(1 − 𝑥), visualized on the phase portrait. (The arrows represent the direction
where the solutions for given initial values are headed.)
𝑥∗ is called a stable equilibrium if there is a neighborhood (𝑥∗ −𝜀, 𝑥∗ +𝜀) around 𝑥∗ such that for all 𝑥0 ∈ (𝑥∗ −𝜀, 𝑥∗ +𝜀),
the solution of the initial value problem
𝑥′ = 𝑓(𝑥)
{
𝑥(0) = 𝑥0
In the case of the logistic ODE 𝑥′ = 𝑥(1 − 𝑥), 𝑥∗ = 1 is a stable and 𝑥∗ = 0 is an unstable equilibrium. This makes
sense given its population dynamics interpretation: the equilibrium 𝑥∗ = 1 means that the population is at maximum
capacity. If the size is slightly above or below the capacity 1, some specimens die due to starvation, or the colony reaches
its constraints. On the other hand, no matter how small the population is, it won’t ever go extinct in this ideal model.
Recall how the derivatives characterize the monotonicity of differentiable functions? With this, we have a simple tool that
can help us decide whether a given equilibrium is stable or not.
Theorem 27.1.1
Let 𝑥′ = 𝑓(𝑥) be a first-order homogeneous ordinary differential equation, and suppose that 𝑓 is differentiable. Moreover,
let 𝑥∗ be a equilibrium point of the equation.
If 𝑓 ′ (𝑥∗ ) < 0, then 𝑥∗ is a stable equilibrium.
The concept of stable equilibrium is fundamental, even in the most general cases. At this point, it’s time to take a few
steps backward and remind ourselves why we are here: to understand gradient descent. If stable equilibria remind you of
a local minimum which a gradient descent process converges towards, it is not an accident. We are ready to see what’s
behind the scenes.
Now, let’s talk about maximizing a function 𝐹 ∶ ℝ → ℝ. Suppose that 𝐹 is twice differentiable, and we denote its
derivative by 𝐹 ′ = 𝑓. Luckily, the local maxima of 𝐹 can be found with the help of its second derivative by looking for
𝑥∗ where 𝑓(𝑥∗ ) = 0 and 𝑓 ′ (𝑥∗ ) < 0.
Does this look familiar? If 𝑓(𝑥∗ ) = 0 indeed holds, then 𝑥(𝑡) = 𝑥∗ is an equilibrium solution; and since 𝑓 ′ (𝑥∗ ) < 0, it
attracts the nearby solutions as well. This means that if 𝑥0 is drawn from the basin of attraction and 𝑥(𝑡) is the solution
of the initial value problem
𝑥′ = 𝑓(𝑥)
{ (28.6)
𝑥(0) = 𝑥0 ,
then lim𝑡→∞ 𝑥(𝑡) = 𝑥∗ . In other words, the solution converges towards 𝑥∗ , a local maxima of 𝐹 ! This is gradient ascent
in a continuous version.
We are happy, but there is an issue. We’ve talked about how hard solving differential equations are. For a general 𝐹 , we
have no prospects to actually find the solutions. Fortunately, we can approximate them.
When studying differentiation in practice, we have seen that derivatives can be approximated numerically by the forward
difference
𝑥(𝑡 + ℎ) − 𝑥(𝑡)
𝑥′ (𝑡) ≈ .
ℎ
If 𝑥(𝑡) is indeed the solution for the initial value problem (28.6), we are in luck! Using forward differences, we can take
a small step from 0 and approximate 𝑥(ℎ) by substituting the forward difference into the differential equation. To be
precise, we have
𝑥(ℎ) − 𝑥(0)
≈ 𝑓(𝑥(0)),
ℎ
from which
thus by defining 𝑥2 ∶= 𝑥1 + ℎ𝑓(𝑥1 ), we have 𝑥2 ≈ 𝑥(2ℎ). Notice that in 𝑥2 , two kinds of approximation errors are
accumulated: first the forward difference, then the approximation error of the previous step.
This motivates us to define the recursive sequence
𝑥0 ∶= 𝑥(0),
(28.7)
𝑥𝑛+1 ∶= 𝑥𝑛 + ℎ𝑓(𝑥𝑛 ),
which approximates 𝑥(𝑛ℎ) with 𝑥𝑛 , as this is implied by the very definition. This recursive sequence is the gradient
ascent itself, and the small step ℎ is the learning rate! Check (25.1) if you don’t believe me. (28.7) is called the Euler
method.
Without going into the details, if ℎ is small enough and 𝑓 “behaves properly”, the Euler method will converge to the
equilibrium solution 𝑥∗ . (Whatever proper behavior might mean.)
We only have one more step: to turn everything into gradient descent instead of ascent. This is extremely simple, as
gradient descent is just applying gradient ascent to −𝑓. Think about it: minimizing a function 𝑓 is the same as maximizing
its negative −𝑓. And with that, we are done! The famous gradient descent is a consequence of dynamical systems
converging towards their stable equilibria, and this is beautiful.
To see the gradient ascent (that is, the Euler method) in action, we should go back to our good old example: the logistic
equation (28.4). So, suppose that we want to find the local maxima of the function
1 2 1 3
𝐹 (𝑥) = 𝑥 − 𝑥 ,
2 3
plotted in Fig. 28.5.
First, we can use what we learned and find the maxima using the derivative 𝑓(𝑥) = 𝐹 ′ (𝑥) = 𝑥(1 − 𝑥), concluding that
there is a local maximum at 𝑥∗ = 1. (Don’t just take my word, check out Theorem 23.3.1 and work it out!)
Since 𝑓(𝑥∗ ) = 𝐹 ′ (𝑥∗ ) = 0 and 𝑓 ′ (𝑥∗ ) < 0, the point 𝑥∗ is a stable equilibrium of the logistic equation
𝑥′ = 𝑥(1 − 𝑥).
Thus, if the initial value 𝑥(0) = 𝑥0 is sufficiently close to 𝑥∗ = 1, the solution 𝑥(𝑡) of the initial value problem
𝑥′ = 𝑥(1 − 𝑥),
{
𝑥(0) = 𝑥0 ,
then lim𝑡→∞ 𝑥(𝑡) = 𝑥∗ . (In fact, we can select any initial value 𝑥0 from the infinite interval (0, ∞), and the convergence
will hold.) Upon discretization via the Euler method, we obtain the recursive sequence
𝑥0 = 𝑥(0),
𝑥𝑛+1 = 𝑥𝑛 + ℎ𝑥𝑛 (1 − 𝑥𝑛 ).
Fig. 28.6: Solving 𝑥′ = 𝑥(1 − 𝑥) via the Euler-method. (For visualization purposes, the initial value was set at 𝑡0 = −5.)
To sum up what we’ve seen so far, our entire goal was to understand the very principles of gradient descent, the most
important optimization algorithm in machine learning. Its main principle is straightforward: to find a local minimum of
a function, first find the direction of decrease, then take a small step towards there. This seemingly naive algorithm has a
foundation that lies deep within differential equations. Turns out that if we look at our functions as rules determining a
dynamical system, local extrema correspond to equilibrium states. These dynamical systems are described by differential
equations, and the local maxima are equilibrium states that solutions towards them. From this viewpoint, the gradient
descent algorithm is nothing else than a numerical solution to this equation.
What we’ve seen so far only covers the single-variable case, and as I have probably told this many times, machine learning
is done in millions of dimensions. Still, the intuition we built up will be our guide in the study of multivariable functions
and high-dimensional spaces. There, the principles are the same, but the objects of study are much more complex. The
main challenge in multivariable calculus is to manage the complexity, and this is where our good friends, vectors and
matrices will do much of the heavy lifting.
Multivariable calculus is where linear algebra and the study of functions come together, providing the skeleton for building
and training neural networks. Let’s jump into it!
Multivariable functions
319
CHAPTER
TWENTYNINE
MULTIVARIABLE FUNCTIONS
How different is multivariable calculus from its single-variable counterpart? When I was a student, I had a professor who
used to say something like, “multivariable and single-variable functions behave the same, you just have to write more”.
Well, this couldn’t be further from the truth. Just think about what we are doing in machine learning: training models
with gradient descent; that is, finding a configuration of parameters that minimize a parametric function. In one variable
(which is not a realistic assumption), we can do this with the derivative. How can we extend the derivative to multiple
dimensions?
The inputs of multivariable functions are vectors. Thus, given a function 𝑓 ∶ ℝ𝑛 → ℝ, we can’t just define
𝑑𝑓 𝑓(x0 ) − 𝑓(x)
(x ) = lim , x0 , x ∈ ℝ𝑛
𝑑x 0 x→x0 x0 − x
to the analogue of Definition 21.1.1. Why? Because the division with the vector x0 − x is not defined.
As we’ll see, differentiation in multiple dimensions is much more complicated. Think about it: in one dimension, there
are only two directions, left and right. This is not true even for two dimensions, with an infinite number of directions at
each point.
So, what are multivariable functions anyway?
We introduced functions as general mappings between two sets. However, we’ve only discussed functions that map real
numbers to real numbers. Simple scalar-scalar functions are great for conveying ideas, but the world around us is much
more complex than what we could describe with them. At the other end of the spectrum, set-set functions are way too
general to be useful.
In practice, three categories are special enough to be analyzed mathematically but general enough to describe the patterns
in science and engineering: those that
1. map scalars to vectors, that is, 𝑓 ∶ ℝ → ℝ𝑛 (curves),
2. map vectors to scalars, that is, 𝑓 ∶ ℝ𝑛 → ℝ (scalar fields),
3. and those that map vectors to vectors, that is, 𝑓 ∶ ℝ𝑛 → ℝ𝑚 (vector fields).
The scalar-vector variants are called curves, the vector-scalar ones are surfaces, and the vector-vector functions are what
we call vector fields. This nomenclature looks a bit abstract, so let’s see some examples.
Scalar-vector functions, or curves in their more user-friendly name, are the mathematical representations of movement.
A space station orbiting around Earth describes a curve. So does the trajectory of a stock in the market.
To give you a concrete example, the scalar-vector function
cos(𝑡)
𝑓(𝑡) = [ ]
sin(𝑡)
321
Mathematics of Machine Learning
cos(𝑡)
𝑔(𝑡) = ⎡ ⎤
⎢ sin(𝑡) ⎥
⎣ 𝑡 ⎦
represents a motion that spirals upward, as illustrated by Fig. 29.2. These curves are called open.
Because of their inherent ability to describe trajectories, scalar-vector functions are essential in mathematics and science.
Are you familiar with Newton’s second law of motion, stating that force equals mass times acceleration? This is described
by the equation, 𝐹 = 𝑚𝑎, which is an instance of an ordinary differential equation. All of its solutions are curves.
On the surface, scalar-vector functions have little to do with machine learning, but that’s not the case. Even though we
won’t deal with them extensively, they have a serious presence behind the scenes. For instance, gradient descent is a
discretized curve.
Vector-scalar functions will be our focus for the next few chapters. When I write “multivariable function”, I’ll most often
refer to a vector-scalar function.
Think about a map of a mountain landscape. This maps the height - a scalar - to each coordinate, thereby defining the
surface. This is just a function 𝑓 ∶ ℝ2 → ℝ in mathematical terms. Thinking about scalar fields as surfaces is useful for
building geometric intuition, giving us a way to visualize them.
Let’s clear up the notation first. If 𝑓 ∶ ℝ𝑛 → ℝ is a function of 𝑛 variables, we might write 𝑓(x) for an x ∈ ℝ𝑛 or
𝑓(𝑥1 , … , 𝑥𝑛 ) for 𝑥𝑖 ∈ ℝ if we want to emphasize the dependence on its variables. A function of 𝑛 variables is the same
as a function of a single vector variable. I know this seems confusing, but trust me, you’ll get used to it in no time.
To give a concrete example for a vector-scalar function, let’s consider pressure. Pressure is the ratio of the magnitude of
the force and the area of the surface of contact:
𝐹
𝑝= .
𝐴
must match for all possible choices for 𝑥𝑛 and 𝑦𝑛 . This is not the case. Consider 𝑥𝑛 = 𝛼2 /𝑛 and 𝑦𝑛 = 𝛼/𝑛 for any 𝛼
real number. With this choice, we have
𝑥𝑛 𝛼2 /𝑛
lim = = 𝛼.
𝑛→∞ 𝑦𝑛 𝛼/𝑛
Thus, the above limit is not defined. All we did here is approach zero along slightly different trajectories, yet the result is
a total mess. In one variable, we have to flex our intellectual muscles to produce such examples; in multiple variables, a
simple 𝑥/𝑦 will do the trick.
Vector-vector functions are called vector fields. For example, consider our solar system, modeled by ℝ3 . Each point
is affected by a gravitational force, which is a vector. Thus, the gravitational pull can be described by a 𝑓 ∶ ℝ3 → ℝ3
function, hence the name vector field.
Although they are often hidden in the background, vector fields play an essential role in machine learning. Remember
when we discussed why does gradient descent work? (At least in one variable.) All the differential equations we have
encountered there are equivalent to vector fields.
Why? Consider the differential equation 𝑥′ = 𝑓(𝑥). If 𝑥(𝑡) describes the trajectory of a moving object, then its derivative
𝑥′ (𝑡) is its speed. Thus, we can interpret the equation 𝑥′ (𝑡) = 𝑓(𝑥(𝑡)) as prescribing the speed of our object at every
position. It’s not that spectacular when our object is moving in one dimension (like we assumed in the previous chapter),
but if the trajectory 𝑥 ∶ ℝ → ℝ2 describes a motion on the plane, the function 𝑓 ∶ ℝ2 → ℝ2 can be visualized neatly.
For example, consider the population dynamics of a simple predator-prey system. Predators feed on the prey, thus, their
numbers can grow in the abundance of food. In turn, over-consumption decreases the prey population, causing a famine
among the predators and decreasing their numbers. This leads to a growth in the prey, and the cycle starts over again.
If 𝑥1 (𝑡) and 𝑥2 (𝑡) are the size of the prey and predator populations, respectively, then their dynamics are described by
the famous Lotka-Volterra equations:
𝑥′1 = 𝑥1 − 𝑥1 𝑥2
𝑥′2 = 𝑥1 𝑥2 − 𝑥2 .
If we represent the trajectory as the scalar-vector function
𝑥1 (𝑡)
x ∶ ℝ → ℝ2 , x(𝑡) = [ ],
𝑥2 (𝑡)
then the derivative
𝑥′1 (𝑡)
x′ (𝑡) = [ ]
𝑥′2 (𝑡)
is given by the vector-vector function
𝑥1 − 𝑥1 𝑥2
𝑓 ∶ ℝ2 → ℝ 2 , 𝑓(𝑥1 , 𝑥2 ) = [ ].
𝑥1 𝑥2 − 𝑥2
𝑓 can be visualized by drawing a vector onto each point of the plane, as illustrated by Fig. 29.4.
Vector fields have serious applications in machine learning. As we shall see soon, the multivariable derivative (called
gradient) defines a vector field. Moreover, as indicated by the single-variable case, the gradient descent algorithm will be
the discretized trajectory determined by the vector field of the gradient.
One of the most important functions in mathematics is the linear function. In one variable, it takes the form 𝑙(𝑥) = 𝑎𝑥+𝑏,
where 𝑎 and 𝑏 are arbitrary real numbers.
We’ve seen linear functions several times already. For instance, Theorem 21.2.1 gives that differentiation is equivalent to
finding the best linear approximation.
Linear functions, that is, functions of the form
𝑛
𝑓(𝑥1 , … , 𝑥𝑛 ) = 𝑏 + ∑ 𝑎𝑖 𝑥𝑖 , 𝑏, 𝑎𝑖 ∈ ℝ
𝑖=1
⟨m, x − v0 ⟩ = 0 (29.1)
This is a linear function of the single variable 𝑥1 in its full glory. The coefficient − 𝑚 1
𝑚2 describes the slope, while 𝑚2 ⟨m, v0 ⟩
1
In other words, linear functions are equivalent to vector equations of the form (29.1), at least in one variable.
What happens if we apply the same argument in higher dimensional spaces? In ℝ𝑛+1 , the normal vector equation
defines a hyperplane, that is, an 𝑚-dimensional plane. (One dimension less than the embedding plane, which is ℝ𝑛+1 in
our case.) Unraveling (29.2), we obtain
𝑛
1 𝑚𝑖
𝑥𝑛+1 = ⟨m, v0 ⟩ − ∑ 𝑥.
𝑚𝑛+1 𝑖=1
𝑚𝑛+1 𝑖
originates from the normal vector equation of the 𝑛-dimensional plane, embedded in the 𝑛 + 1-dimensional space. This
can also be written in the vectorized form
𝑓(x) = 𝑏 + ⟨a, x⟩
(29.3)
= a𝑇 x + 𝑏, a, x ∈ ℝ𝑛 , 𝑏 ∈ ℝ,
which is how we’ll mostly use it in the future. (Note that when looking at the matrix representation of a vector u ∈ ℝ𝑛 ,
we always use the column form ℝ𝑛×1 . Moreover, a is not the normal vector of the plane.)
Before we move on to study the inner workings of multivariable calculus, I want to emphasize how seriously multiple
dimensions complicate things in machine learning.
First, let’s talk about optimization. If all else fails, optimizing a single-variable function 𝑓 ∶ [𝑎, 𝑏] → ℝ can be as simple
as partitioning [𝑎, 𝑏] into a grid of 𝑛 points, evaluate the function at each point, then find the minima/maxima.
We cannot* do this *in higher dimensions. To see why, consider ResNet18, the famous convolutional network archi-
tecture. It has precisely 11689512 parameters. Thus, training is equivalent to optimizing a function of a whopping
11689512-variable function. If we were to construct a grid with just two points along every dimension, we would have
211689512 points to evaluate the function at.
For comparison, the number of atoms in our observable universe is around 1082 . A number that is dwarfed by the size of
our grid. Thus, grid search is currently impossible on such an enormous grid. We are forced to devise clever algorithms
that can tackle the size and complexity of large dimensional spaces.
In high dimensions, a strange thing starts to happen with balls. Recall that by definition, the n-dimensional ball of radius
𝑟 around the point x0 ∈ ℝ𝑛 is defined by
and we denote its volume by 𝑉𝑛 (𝑟). (The volume depends only on the radius and the dimension, not the center.)
Heuristically, this means that if you randomly select a point from the unit ball, its distance from the center will be close
to 1 in high dimensions.
In other words, distance doesn’t behave as you would intuitively expect. Another way of looking at the issue would be to
study the effects of taking one step in each possible direction, starting from the origin and arriving at the point
1 = (1, 1, … , 1) ∈ ℝ𝑛 ,
which goes to infinity as the number of dimensions grows. That is, the diagonal of the unit cube is really big.
These two phenomena can cause significant headaches in practice. More parameters result in more expressive models but
also make training much more difficult. This is called the curse of dimensionality.
THIRTY
Now that we understand why multivariate functions and high-dimensional spaces are more complex than the single-variable
case we studied earlier, it’s time to see how to do things in the general case.
To recap quickly, our goal in machine learning is to optimize functions with millions of variables. For instance, think
about a neural network 𝑁 (x, w), where x ∈ ℝ𝑛 is the input data and the vector w ∈ ℝ𝑚 compresses all of the weight
parameters. In the case of, say, the binary cross-entropy loss, we have the loss function
𝑑
𝐿(w) = − ∑ 𝑁 (x𝑖 , w) log 𝑦𝑖 ,
𝑘=1
where x𝑖 is the 𝑖-th data point with ground truth 𝑦𝑖 . (I told you that we have to write much more in multivariable calculus.)
Training the neural network is the same as finding a global minimum of 𝐿(w), if it exists.
We have already seen how we can do optimization in a single variable:
• figure out the direction of increase by calculating the derivative,
• take a small step,
• then iterate.
For this to work in multiple variables, we need to generalize the concept of the derivative.
We quickly discovered the issue: since division with a vector is not defined, the difference quotient
𝑓(x) − 𝑓(y)
x−y
Let’s take a look at multivariable functions more closely! For the sake of simplicity, let 𝑓 ∶ ℝ2 → ℝ be our function of
two variables. To emphasize the dependence on the individual variables, we often write
𝑓(𝑥1 , 𝑥2 ), 𝑥1 , 𝑥2 ∈ ℝ.
We can quickly notice that by fixing one of the variables, we obtain the two single-variable functions! That is, if 𝑥1 , 𝑥2 ∈
ℝ2 is fixed, then we have
𝑥 ↦ 𝑓(𝑥, 𝑥2 ),
𝑥 ↦ 𝑓(𝑥1 , 𝑥),
329
Mathematics of Machine Learning
where 𝑥 ∈ ℝ is a scalar. Think about this as slicing the function graph with a plane parallel to the 𝑥 − 𝑧 or the 𝑦 − 𝑧, as
illustrated by Fig. 30.1. The part cut out by the plane is a single-variable function.
We can define the derivative of these functions by the limit of difference quotients. These are called the partial derivatives:
𝜕𝑓 𝑓(𝑥, 𝑥2 ) − 𝑓(𝑥1 , 𝑥2 )
(𝑥 , 𝑥 ) = lim ,
𝜕𝑥1 1 2 𝑥→𝑥1 𝑥 − 𝑥1
𝜕𝑓 𝑓(𝑥1 , 𝑥) − 𝑓(𝑥1 , 𝑥2 )
(𝑥 , 𝑥 ) = lim .
𝜕𝑥2 1 2 𝑥→𝑥2 𝑥 − 𝑥2
𝜕𝑓 𝜕𝑓
(Keep in mind that 𝑥1 signifies the variable in 𝜕𝑥 , but an actual scalar value in the argument of 𝜕𝑥1 (𝑥1 , 𝑥2 ). This can
1
be quite confusing, but you’ll soon learn to make sense of it.)
The definition is similar for general multivariable functions; we just have to write much more. There, the partial derivative
of 𝑓 ∶ ℝ𝑛 → ℝ at the point x = (𝑥1 , … , 𝑥𝑛 ) with respect to the 𝑖-th variable is defined by
𝑖-th variable
𝜕𝑓 𝑓(𝑥1 , … , 𝑥
⏞ , … , 𝑥𝑛 ) − 𝑓(𝑥1 , … , 𝑥𝑖 , … , 𝑥𝑛 ) (30.1)
(𝑥 , … , 𝑥𝑛 ) = lim .
𝜕𝑥𝑖 1 𝑥→𝑥𝑖 𝑥 − 𝑥𝑖
One of the biggest challenges in multivariable calculus is to manage the ever-growing notational complexity. Just take a
look at the difference quotient above:
𝑓(𝑥1 , … , 𝑥, … , 𝑥𝑛 ) − 𝑓(𝑥1 , … , 𝑥𝑖 , … , 𝑥𝑛 )
.
𝑥 − 𝑥𝑖
This is not the prettiest to look at, and this kind of notational complexity can pile up fast. Fortunately, we can use linear
algebra to the rescue! Not only can we compact the variables into the vector x = (𝑥1 , … , 𝑥𝑛 ), we can use the standard
basis
e𝑖 = (0, … , 0, 1⏟ , 0, … , 0)
𝑖-th component
If the above limit exists, we say that 𝑓 is partially differentiable with respect to the 𝑖-th variable 𝑥𝑖 .
𝜕
The partial derivative is again a vector-scalar function. Because of this, it is often written as 𝜕𝑥𝑖 𝑓, reflecting on the fact
𝜕
that the symbol " 𝜕𝑥 "can be thought of as a function that maps functions to functions. I know, this is a bit abstract, but
𝑖
you’ll get used to it quickly.
As usual, there are several alternative notations for the partial derivatives. Among others, the symbols
• 𝑓𝑥𝑖 (x),
• 𝐷𝑖 𝑓(x),
• 𝜕𝑖 𝑓(x)
𝜕𝑓
denote the 𝑖-th partial derivative of 𝑓 at x. For simplicity, we’ll use the old school 𝜕𝑥𝑖 (x).
30.1.1 Examples
It’s best to start with a few examples to illustrate the concept of partial derivatives.
Example 1. Let
To calculate, say, 𝜕𝑓/𝜕𝑥1 , we fix the second variable and treat 𝑥2 as a constant. Formally, we obtain the single-variable
function
𝑓 1 (𝑥) ∶= 𝑥2 + 𝑥22 , 𝑥2 ∈ ℝ,
𝜕𝑓 𝑑𝑓
(𝑥 , 𝑥 ) = (𝑥 ) = 2𝑥1 .
𝜕𝑥1 1 2 𝑑𝑥 1
𝜕𝑓
(𝑥 , 𝑥 ) = 2𝑥2 .
𝜕𝑥2 1 2
Once you are comfortable with the mental gymnastics of fixing variables, you’ll be able to perform partial differentiation
without writing out all the intermediate steps.
𝑓(𝑥1 , 𝑥2 ) = sin(𝑥21 + 𝑥2 ).
By fixing 𝑥2 , we obtain a composite function. Thus the chain rule is used to calculate the first partial derivative:
𝜕𝑓
(𝑥 , 𝑥 ) = 2𝑥1 cos(𝑥21 + 𝑥2 ).
𝜕𝑥1 1 2
Similarly, we obtain that
𝜕𝑓
(𝑥 , 𝑥 ) = cos(𝑥21 + 𝑥2 ).
𝜕𝑥2 1 2
(I highly advise you to carry out the above calculations step by step as an exercise, even if you understand all the inter-
mediate steps.)
Example 3. Finally, let’s see a function that is partially differentiable in one variable but not in the other. Define the
function
−1 if 𝑥2 < 0,
𝑓(𝑥1 , 𝑥2 ) = {
1 else.
If a function is partially differentiable in every variable, we can compact the derivatives together in a single vector to form
the gradient.
A few remarks are in order. First, the symbol ∇ is called nabla, a symbol that was conceived to denote gradients.
Second, the gradient can be thought of as a vector-vector function. To see that, consider the already familiar function
𝑓(𝑥1 , 𝑥2 ) = 𝑥21 + 𝑥22 . The gradient of 𝑓 is
2𝑥1
∇𝑓(𝑥1 , 𝑥2 ) = [ ],
2𝑥2
or
∇𝑓(x) = 2x
Fig. 30.2: The vector field given by the gradient of 𝑥21 + 𝑥22 .
in vectorized form. We can visualize this by drawing the vector ∇𝑓(𝑥1 , 𝑥2 ) at each point (𝑥1 , 𝑥2 ) ∈ ℝ2 .
Thus, you can think about ∇𝑓 as a vector-vector function ∇𝑓 ∶ ℝ𝑛 → ℝ𝑛 . The gradient at a given point x is obtained by
evaluating this function, yielding (∇𝑓)(x).
For clarity, the parentheses are omitted, arriving at the all familiar notation ∇𝑓(x).
The partial derivatives of a vector-scalar function 𝑓 ∶ ℝ𝑛 → ℝ are vector-scalar functions themselves. Thus, we can
perform partial differentiation one more time!
If they exist, the second order partial derivatives are defined by
𝜕2𝑓 𝜕 𝜕𝑓
(a) ∶= ( (a)). (30.2)
𝜕𝑥𝑖 𝜕𝑥𝑗 𝜕𝑥𝑖 𝜕𝑥𝑗
𝜕2 𝑓
(When the second partial differentiation takes place with respect to the same variable, (30.2) is abbreviated by 𝜕𝑥2𝑖
(a).)
The definition begs the question: is the order of differentiation interchangeable? That is, does
𝜕2𝑓 𝜕2𝑓
(a) = (a)
𝜕𝑥𝑖 𝜕𝑥𝑗 𝜕𝑥𝑗 𝜕𝑥𝑖
hold? The answer is quite surprising: the order is interchangeable under some mild assumptions, but not in the general
case. There is a famous theorem about it which we won’t prove, but it’s essential to know.
Theorem 29.1.1
Let 𝑓 ∶ ℝ𝑛 → ℝ be an arbitrary vector-scalar function and let a ∈ ℝ𝑛 . If there is a small ball 𝐵(𝜀, a) ⊆ ℝ𝑛 centered at
a such that 𝑓 has continuous second-order partial derivatives at all points of 𝐵(𝜀, a), then
𝜕2𝑓 𝜕2𝑓
(a) = (a)
𝜕𝑥𝑖 𝜕𝑥𝑗 𝜕𝑥𝑗 𝜕𝑥𝑖
Partial derivatives seem to generalize the notion of differentiability for multivariable functions. However, something is
missing. Let’s revisit the single-variable case for a moment.
Recall that according to Theorem 21.2.1, the differentiability of a single-variable function 𝑓 ∶ ℝ → ℝ at a given point 𝑎
is equivalent to a local approximation of 𝑓 by the linear function
If 𝑥 is close to 𝑎, 𝑙(𝑥) is also close to 𝑓(𝑥). Moreover, this is the best linear approximation we can do around 𝑎. In a
single variable, this is equivalent to differentiation.
𝑓(x)−𝑓(y)
This gives us an idea: even though difference quotients like x−y does not exist in multiple variables, the best local
approximation with a multivariable linear function does!
Thus, the notion of total differentiability is born.
holds for all a ∈ 𝐵(𝜀, a), where 𝜀 > 0 and 𝐵(𝜀, a) is defined by
(In other words, 𝐵(𝜀, a) is a ball of radius 𝜀 > 0 around a.) When exists, the vector 𝐷𝑓 (a) is called the total derivative
of 𝑓 at a.
Recall that when it is not stated explicitly, we prefer to work with column vectors, because we want to write our linear
transformations in the form 𝐴x, where 𝐴 ∈ ℝ𝑚×𝑛 and x ∈ 𝑛 × 1. Thus, the “dimensionology” of the formula
𝑓(x)
⏟ = 𝑓(a)
⏟ +𝐷⏟𝑓 (a) (x
⏟ − a) +𝑜(‖x − a‖) ∈ ℝ1×1
∈ℝ1×1 ∈ℝ1×1 ∈ℝ1×𝑛 ∈ℝ𝑛×1
holds for all a in some 𝐵(𝜀, a). (That is, 𝐷𝑓 (a) = ∇𝑓(a)𝑇 .)
In other words, the equation (30.4) gives that the coefficients of the best linear approximation are equal to the partial
derivatives.
Proof. Because 𝑓 is totally differentiable at a, the definition gives that 𝑓 can be written in the form
where 𝐷𝑓 (a) = (𝑑1 , … , 𝑑𝑛 ) is the vector that describes the coefficients of the linear part.
Our goal is to show that
𝑓(a + ℎe𝑖 ) − 𝑓(a)
lim = 𝑑𝑖 ,
ℎ→0 ℎ
where e𝑖 is the unit (column) vector whose 𝑖-th component is 1, while the others are 0.
Let’s do a quick calculation! Based on what we know, we have
𝑓(a + ℎe𝑖 ) − 𝑓(a) 𝐷𝑓 (a)ℎe𝑖 + 𝑜(‖ℎe𝑖 ‖)
=
ℎ ℎ
= 𝐷𝑓 (a)e𝑖 + 𝑜(1)
= 𝑑𝑖 + 𝑜(1),
𝑓(a+ℎe𝑖 )−𝑓(a)
thus confirming that limℎ→0 ℎ = 𝑑𝑖 , which is what we had to show. □
What’s all the hassle with total differentiation, then? Theorem 29.2.1 tells us that total differentiability is a stronger
condition than partial differentiability.
Surprisingly, the other direction is not true: the existence of partial derivatives does not imply total differentiability, as
the example
1 if 𝑥 = 0 or 𝑦 = 0,
𝑓(𝑥, 𝑦) = {
0 otherwise
illustrates. This function has all its partial derivatives at 0, yet the total derivative does not exist. (You can convince
yourself by either drawing a figure, or noting that the function 1 − d𝑇 x can never be 𝑜(‖x‖), regardless of the choice of
d.)
So far, we have talked about two kinds of derivatives: partial derivatives that describe the rate of change along a fixed
axis, and total derivatives that give the best linear approximation of the function at a given point.
Partial derivatives are only concerned with a few particular directions. However, this is not the end of the story in multiple
variables. With the standard orthonormal basis vectors e𝑖 , the partial derivatives are defined by
As we have seen earlier, these describe the rate of change along the dimensions. However, the standard orthonormal
vectors are just a few special directions.
What about an arbitrary direction v? Can we define the derivative along these? Sure! There is nothing stopping us to
replace e𝑖 with v in (30.5). Thus, the directional derivatives are born.
Good news: the directional derivatives can be described in terms of the gradient.
Theorem 29.3.1
Let 𝑓 ∶ ℝ𝑛 → ℝ be a function of 𝑛 variables. If 𝑓 is totally differentiable at a, then its directional derivatives exist in all
directions, and
𝜕𝑓
(a) = ∇𝑓(a)𝑇 v.
𝜕v
around a. Thus,
𝑓(a + ℎv) − 𝑓(a) ℎ∇𝑓(a)𝑇 v + 𝑜(ℎ)
=
ℎ ℎ
= ∇𝑓(a)𝑇 v + 𝑜(1),
giving that
𝜕𝑓 𝑓(a + ℎv) − 𝑓(a)
(a) = lim
𝜕v ℎ→0 ℎ
𝑇
= lim ∇𝑓(a) v + 𝑜(1)
ℎ→0
= ∇𝑓(a)𝑇 v,
as we needed to show. □
In one variable, we have learned that if the derivative of 𝑓 is positive at some 𝑎, then 𝑓 increases around 𝑎. (If the
derivative is negative, 𝑓 decreases.) If we think about the derivative 𝑓 ′ (𝑎) as a one-dimensional vector, the above result
says that the derivative points towards the direction of increase.
Is this true in higher dimensions? Yes, and this is what makes gradient descent work.
I know, (30.6) is pretty overloaded, so let’s unpack it. First, let’s start with the mysterious arg max. For a given function
𝑓,
denotes the values that maximizes 𝑓 on the set 𝑆. As the maximum may not be unique, arg max can yield a set. (The
definition of arg min is the same, but with minimum instead of maximum.)
Thus, in English, (30.6) states that the unit direction that maximizes the directional derivative at a is the normalized
gradient. Now we are ready to see the proof!
Proof. Do you remember the Cauchy-Schwarz inequality? It was a long time ago, so let’s recall it! In the vector space
ℝ𝑛 , the Cauchy-Schwarz inequality tells us that for any x, y ∈ ℝ𝑛 ,
x𝑇 y ≤ ‖x‖‖y‖.
𝜕𝑓
(a) = ∇𝑓(a)𝑇 v.
𝜕v
Combined with the Cauchy-Schwarz inequality, we get that
𝜕𝑓
(a) = ∇𝑓(a)𝑇 v
𝜕v
≤ ‖∇𝑓(a)‖‖v‖.
𝜕𝑓
(a) ≤ ‖∇𝑓(a)‖ (30.7)
𝜕v
follows. Thus, the directional derivatives must be less or equal than the gradient’s norm. (At least, along a direction vector
with unit length.)
However, by letting v0 = ∇𝑓(a)/‖∇𝑓(a)‖, we obtain that
𝜕𝑓
(a) = ∇𝑓(a)𝑇 v0
𝜕v0
∇𝑓(a)𝑇 ∇𝑓(a)
=
‖∇𝑓(a)‖
‖∇𝑓(a)‖2
=
‖∇𝑓(a)‖
= ‖∇𝑓(a)‖.
∇𝑓(a) ∇𝑓(a)
Thus, with the choice v0 = ‖∇𝑓(a)‖ , equality can be attained in (30.7). This means that ‖∇𝑓(a)‖ maximizes the directional
derivative at a, which is what we had to prove. □
With that, we have the basics of differentiation in multiple variables under our belt. To sum up, we have learned that the
difference quotient definition of the derivative does not generalize directly for multiple variables, but we can fix all but
one variables to make the difference quotient work, thus obtaining partial derivatives.
On the other hand, the linear approximation definition works in multiple dimensions, but instead of
It’s been a long time since we’ve put theory into code. So, let’s take a look at multivariable functions!
Last time, we built a Function base class with two main methods: one for computing the derivative (Function.
prime), one for getting the dictionary of parameters (Function.parameters).
This won’t be much of a surprise: the multivariate function base class is not much different. For clarity, we’ll appropriately
rename the prime method to grad.
class MultivariableFunction:
def __init__(self):
pass
def grad(self):
pass
def parameters(self):
return dict()
Let’s see a few examples right away. The simplest one is the squared Euclidean norm 𝑓(x) = ‖x‖2 , a closed relative to
the mean squared error function. Its gradient is given by
∇𝑓(x) = 2x,
thus everything is ready to implement it. As we’ve used NumPy arrays to represent vectors, we’ll use them as the input
as well.
import numpy as np
class SquaredNorm(MultivariableFunction):
def __call__(self, x: np.array):
return np.mean(x**2)
Note that SquaredNorm is different from 𝑓(x) = ‖x‖2 in a mathematical sense, as it accepts any NumPy array, not
just an 𝑛-dimensional vector. This is not a problem now, but will be one later, so keep that in mind.
Another can be given by the parametric linear function
𝑔(𝑥, 𝑦) = 𝑎𝑥 + 𝑏𝑦,
class Linear(MultivariableFunction):
def __init__(self, a: float, b: float):
self.a = a
self.b = b
def parameters():
return {"a": self.a, "b": self.b}
Note that as we are working with column vectors, the input x is an array of shape (2, 1).
To check if our implementation works correctly, we can quickly test it out on a simple example.
g = Linear(a=1, b=-1)
Perhaps we might have overlooked this question until now, but trust me, specifying the input and output shapes is of
crucial importance. When doing mathematics, we can be flexible in our notation and treat any vector 𝑥 ∈ ℝ𝑛 as a column
or row vector, but this is painfully not the case in practice.
Correctly keeping track of array shapes is of utmost importance, and can save you hundreds of hours. No joke.
For now, that’s basically all to our MultivariableFunction class. Later, when implementing our neural networks
from scratch, we’ll add other methods for utility. However, regarding “mathematical functionality”, we are almost done.
THIRTYONE
In a single variable, defining higher-order derivatives is simple. We simply have to keep repeating differentiation:
′
𝑓 ′′ (𝑥) = (𝑓 ′ (𝑥)) ,
′
𝑓 ′′′ (𝑥) = (𝑓 ′′ (𝑥)) ,
and so on. However, this is not that straightforward with multivariable functions. So far, we have only talked about
gradients, the generalization of the derivative for vector-scalar functions.
As ∇𝑓(a) is a column vector, the gradient is a vector-vector function ∇ ∶ ℝ𝑛 → ℝ𝑛 . We only know how to compute the
derivative of vector-scalar functions. It’s time to change that!
Curves, often describing the solutions of dynamical systems, are one of the most important objects in mathematics. We
don’t use them explicitly in machine learning, but they are underneath algorithms such as gradient descent. (Where we
traverse a discretized curve leading to a local minimum.)
Formally, a curve - that is, a scalar-vector function - is given by a function
𝛾1 (𝑡)
⎡ 𝛾 (𝑡) ⎤
𝛾 ∶ ℝ → ℝ𝑛 , 𝛾(𝑡) = ⎢ 2 ⎥ ∈ ℝ𝑛(×1) ,
⎢ ⋮ ⎥
⎣𝛾𝑛 (𝑡)⎦
where the 𝛾𝑖 ∶ ℝ → ℝ functions are good old single-variable scalar-scalar functions. As the independent variable often
represents time, it is customary to denote it with 𝑡.
We can differentiate 𝛾 componentwise:
𝛾1′ (𝑡)
⎡ 𝛾 ′ (𝑡) ⎤
𝛾 ′ (𝑡) ∶= ⎢ 2 ⎥ ∈ ℝ𝑛(×1) .
⎢ ⋮ ⎥
⎣𝛾𝑛′ (𝑡)⎦
If we indeed imagine 𝛾(𝑡) as a trajectory in space, 𝛾 ′ (𝑡) is the tangent vector to 𝛾 at 𝑡. Since the differentiation is
componentwise, Theorem 21.2.1 implies that if 𝛾 is differentiable at some 𝑎,
there. The equation (31.1) is a true vectorized formula: some components are vectors, and some are scalars. Yet, this is
simple and makes perfect sense to us. Hiding the complexities of vectors and matrices is the true power of linear algebra.
341
Mathematics of Machine Learning
It is easy to see that for any two 𝛾, 𝜂 ∶ ℝ → ℝ𝑛 , differentiation is additive, as (𝛾 + 𝜂)′ = 𝛾 ′ + 𝜂′ . What happens when
we compose a scalar-vector function with a vector-scalar one?
This situation is commonplace in machine learning. If, say, 𝐿 ∶ ℝ𝑛 → ℝ describes the loss function and 𝛾 ∶ ℝ → ℝ𝑛
is our trajectory in the parameter space ℝ𝑛 , the composite function 𝑓(𝛾(𝑡)) describes the model loss at time 𝑡. Thus, to
compute (𝑓 ∘ 𝛾)′ , we have to generalize the chain rule.
Theorem 30.1.1 (The chain rule for composing scalar-vector and vector-scalar functions.)
Let 𝛾 ∶ ℝ → ℝ𝑛 and 𝑓 ∶ ℝ𝑛 → ℝ be arbitrary functions. If 𝛾 is differentiable at some 𝑎 ∈ ℝ and 𝑓 is differentiable at
𝛾(𝑎), then 𝑓 ∘ 𝛾 ∶ ℝ → ℝ is also differentiable at 𝑎 and
there.
Thus,
𝑓(𝛾(𝑡)) − 𝑓(𝛾(𝑎))
(𝑓 ∘ 𝛾)′ (𝑎) = lim
𝑡→𝑎 𝑡−𝑎
= ∇𝑓(𝛾(𝑎))𝑇 𝛾 ′ (𝑎),
which is what we had to prove. □
Now, our task is to extend the derivative for vector-vector functions. Let f ∶ ℝ𝑛 → ℝ𝑚 be an arbitrary vector-vector
function. By writing out the output of f explicitly, we can decompose it into multiple components:
𝑓1 (x)
f(x) = ⎡
⎢ ⋮ ⎥∈ℝ
⎤ 𝑚(×1)
⎣𝑓𝑚 (x)⎦
where 𝑓𝑖 ∶ ℝ𝑛 → ℝ are vector-scalar functions.
The natural idea is to compute the partial derivatives for 𝑓𝑖 , compacting them into a matrix. And so we shall!
I have good news: the best local linear approximation of f around a is given by
if the best local linear approximation exists. Thus, the Jacobian is a proper generalization of the gradient.
We can use the Jacobian to generalize the notion of second-order derivatives for vector-scalar functions: by computing
the Jacobian of the gradient, we obtain a special matrix, the analogue of the second derivative.
In other words,
holds by definition. Moreover, if 𝑓 behaves nicely (for instance, all second-order partial derivatives exist and are contin-
uous), Theorem 29.1.1 implies that the Hessian is symmetric; that is, 𝐻𝑓 (a) = 𝐻𝑓 (a)𝑇 .
One last generalization, I promise. Recall that the existence of the gradient (that is, partial differentiability) doesn’t imply
total differentiability for vector-scalar functions, as the example
1 if 𝑥 = 0 or 𝑦 = 0,
𝑓(𝑥, 𝑦) = {
0 otherwise
shows at zero.
This is true for vector-vector functions as well, as the Jacobian is the generalization of the gradient, not the total derivative.
It is best to rip the band-aid off quickly and define the total derivative for vector-vector functions. The definition will be
a bit abstract, but trust me, the investment will pay off when talking about the chain rule. (Which is the foundation of
backpropagation, the algorithm that makes gradient descent computationally feasible.)
Let f ∶ ℝ𝑛 → ℝ𝑚 be an arbitrary vector-vector function. We say that 𝑓 is totally differentiable (or sometimes just
differentiable in short) at a ∈ ℝ𝑛 if there exists a matrix 𝐷f (a) ∈ ℝ𝑚×𝑛 such that
holds for all a ∈ 𝐵(𝜀, a), where 𝜀 > 0 and 𝐵(𝜀, a) is defined by
(In other words, 𝐵(𝜀, a) is a ball of radius 𝜀 > 0 around a.) When exists, the matrix 𝐷f (a) is called the total derivative
of 𝑓 at a.
Notice that Definition 30.3.1 is almost verbatim to Definition 29.2.1, except that the “derivative” is a matrix this time.
You are probably not surprised to hear that its relation with the Jacobian is the same as the gradient and the total derivative
in the vector-scalar case.
𝐷f (a) = 𝐽f (a)𝑇 .
The proof is almost identical to the one of Theorem 29.2.1, with more complex notations. I strongly recommend you to
work it out line by line, as this kind of mental gymnastics helps significantly to get used to matrices in practice.
Componentwise, the total derivative can be written as
𝜕𝑓1 𝜕𝑓1 𝜕𝑓1
𝜕𝑥1 (a) 𝜕𝑥2 (a) … 𝜕𝑥𝑛 (a)
⎡ 𝜕𝑓2 𝜕𝑓2 𝜕𝑓2 ⎤
𝐷f (a) = ⎢ 𝜕𝑥1 (a) 𝜕𝑥2 (a) … 𝜕𝑥𝑛 (a) ⎥ 𝑚×𝑛
⎢ ⋮ ⋮ ⋱ ⋮ ⎥∈ℝ .
⎢ ⎥
𝜕𝑓𝑚 𝜕𝑓𝑚 𝜕𝑓𝑚
⎣ 𝜕𝑥1 (a) 𝜕𝑥 0
2
(a) … 𝜕𝑥𝑛 (a)⎦
and
∇𝑓1 (a)𝑇
⎡ ∇𝑓 (a)𝑇 ⎤
𝐷f (a) = ⎢ 2 ⎥.
⎢ ⋮ ⎥
𝑇
∇𝑓
⎣ 𝑚 (a) ⎦
We have generalized the notion of derivatives as far as possible for us. Now it’s time to study their relations with the
two essential function operations: addition and composition. (As there is no vector multiplication in higher dimensional
spaces, the product and ratio of f, g ∶ ℝ𝑛 → ℝ𝑚 is undefined.)
Let’s start with the simpler one: addition.
there.
which implies
Linearity is always nice, but what we need is the ultimate generalization of the chain rule. We previously saw the special
case of composing a scalar-vector and a vector-vector function (see Theorem 30.1.1), but we need to go one step further.
The multivariable chain rule is extremely important in machine learning. A neural network is a composite function, each
layer forming a component. During gradient descent, we use the chain rule to calculate its derivative.
holds.
To our advantage, the derivative of a composed function (31.3) is given by the product of two matrices. Since matrix
multiplication can be done lightning fast, this is good news.
We will see two proofs for Theorem 30.4.2. One is done with a faster-than-light engine, while the other shows much more
by reducing the general case to Theorem 30.1.1. Both provide a ton of insight. Let’s start with the heavy machinery.
(f ∘ g)1 (x)
(f ∘ g)(x) = ⎡ ⎤ 𝑙
⎢ (f ∘ g)2 (x) ⎥ ∈ ℝ , x ∈ ℝ𝑛 .
⎣⋮ (f ∘ g)𝑙 (x)⎦
𝜕(f ∘ g)𝑖
(𝐷f∘g (a)) = (a).
𝑖,𝑗 𝜕𝑥𝑗
and the vector-scalar function 𝑓𝑖 ∶ ℝ𝑚 → ℝ. Thus, the chain rule for the composition of scalar-vector and vector-scalar
functions (given by Theorem 30.1.1) can be applied:
𝜕(f ∘ g)𝑖 𝜕
(a) = ∇𝑓𝑖 (g(a))𝑇 g(a),
𝜕𝑥𝑗 𝜕𝑥𝑗
𝜕
where 𝜕𝑥𝑗 g(a) is the componentwise derivative
𝜕𝑔1 (a)
⎡ 𝜕𝑔𝜕𝑥(a)
𝑗 ⎤
𝜕 ⎢ 𝜕𝑥 2
⎥
g(a) = ⎢ 𝑗
⎥.
𝜕𝑥𝑗 ⎢ ⋮ ⎥
𝜕𝑔𝑚 (a)
⎣ 𝜕𝑥𝑗 ⎦
With the concept of total derivatives for vector-vector functions and the general chain rule under our belt, we are ready
to actually do things with multivariable functions. Thus, our next stop lays the foundations of optimization.
THIRTYTWO
In a single variable, we have successfully used the derivatives to find local optima of differentiable functions.
Recall that if 𝑓 ∶ ℝ → ℝ is differentiable everywhere, then Theorem 23.3.1 gives that
(a) 𝑓 ′ (𝑎) = 0 and 𝑓 ′′ (𝑎) > 0 implies a local minimum,(b) and 𝑓 ′ (𝑎) = 0 and 𝑓 ′′ (𝑎) < 0 implies a local maximum.
(A simple 𝑓 ′ (𝑎) = 0 is not enough to conclude the local extremum, as the example 𝑓(𝑥) = 𝑥3 shows at 0).
Can we do something similar in multiple variables? Right from the start, there seem to be an issue: the derivative is not
a scalar (thus, we can’t equate it to zero).
This is easy to solve: the analogue of the condition 𝑓 ′ (𝑎) = 0 is ∇𝑓(a) = (0, 0, … , 0) for multivariate functions. For
simplicity, the zero vector (0, 0, … , 0) will also be denoted by 0. Don’t worry, this won’t be confusing; it’s all clear from
the context. Introducing a new notation for the zero vector would just add more complexity.
We can even visualize this. In a single variable, we have already seen this: as Fig. 32.1 illustrates, 𝑓 ′ (𝑎) = 0 implies that
the tangent line is horizontal.
349
Mathematics of Machine Learning
In multiple variables, the situation is similar: ∇𝑓(a) = 0 implies that the best local linear approximation (30.3) is constant;
that is, the tangent plane is horizontal. (As visualized by Fig. 32.2.)
So, what does ∇𝑓(a) = 0 imply? Similarly to the single-variable case, we have three options:
1. local minima,
2. local maxima,
3. neither.
The functions
𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 ,
𝑔(𝑥, 𝑦) = −(𝑥2 + 𝑦2 ),
ℎ(𝑥, 𝑦) = 𝑥2 − 𝑦2
at (0, 0) provide an example for all three, as Fig. 32.3, Fig. 32.4, and Fig. 32.5 show. (Keep in mind that a local extremum
might be global.)
To put things into order, let’s start formulating definitions and theorems.
holds.
For the sake of precision, let’s define local extrema in multiple dimensions.
As the example of 𝑥2 − 𝑦2 shows, a critical point is not necessarily a local extremum, but a local extremum is always a
critical point. The next result, which is the analogue of Theorem 23.2.1, makes this mathematically precise.
Theorem 31.1.1
Let 𝑓 ∶ ℝ𝑛 → ℝ be an arbitrary vector-scalar function, and suppose that 𝑓 is partially differentiable with respect to all
variables at some a ∈ ℝ.
If 𝑓 has a local extremum at a, then ∇𝑓(a) = 0.
Proof. This is a direct consequence of Theorem 23.2.1, as if a = (𝑎1 , … , 𝑎𝑛 ) is a local extremum of the vector-scalar
function 𝑓, then it is a local extremum of the single-variable functions 𝑓(a + ℎe𝑖 ), where e𝑖 is the vector whose 𝑖-th
component is 1, while the others are zero.
According to the very definition of the partial derivative given by Definition 29.1.1,
𝑑 𝜕𝑓
𝑓(a + ℎe𝑖 ) = (a).
𝑑ℎ 𝜕𝑥𝑖
𝜕𝑓
(a) = 0
𝜕𝑥𝑖
So, how can we find the local extrema with the derivative? As we have already suggested, studying the second derivative
will help us pinpoint the extrema among critical points. Unfortunately, things are much more complicated in 𝑛 variables,
so let’s focus on the two-variable case first.
We will not prove this, but some remarks are in order. First, as the determinant of the Hessian can be zero, Theorem
31.1.3 does not cover all possible cases.
It’s probably best to see a few examples, so let’s revisit the previously seen functions
𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 ,
𝑔(𝑥, 𝑦) = −(𝑥2 + 𝑦2 ),
ℎ(𝑥, 𝑦) = 𝑥2 − 𝑦2 .
All three have a critical point at 0, so the Hessians can provide a clearer picture. The Hessians are given by the matrices
2 0 −2 0 2 0
𝐻𝑓 (𝑥, 𝑦) = [ ], 𝐻𝑔 (𝑥, 𝑦) = [ ], 𝐻ℎ (𝑥, 𝑦) = [ ].
0 2 0 −2 0 −2
𝜕2 𝑓
For functions of two variables, Theorem 31.1.3 says that it is enough to study det 𝐻𝑓 (a) and 𝜕𝑦2 (a).
2
In the case of 𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 , we have 𝐻𝑓 (0, 0) = 4 and 𝜕𝜕𝑦𝑓2 (0, 0) = 2, giving that 0 is a local minimum of
𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 . Similarly, we can conclude that 0 is a local maximum of 𝑔(𝑥, 𝑦) = −(𝑥2 + 𝑦2 ). (Which shouldn’t
surprise you, as 𝑔 = −𝑓.)
Finally, for ℎ(𝑥, 𝑦) = 𝑥2 − 𝑦2 , the second derivative test confirms that 0 is indeed a saddle point.
So, what’s up with the general case? Unfortunately, just studying the determinant of the Hessian matrix is not enough.
We need to bring in the heavy hitters: eigenvalues. Here is the second derivative test in its full glory.
That’s right: if any of the eigenvalues are zero, then the test is inconclusive. You might recall from linear algebra that
in practice, computing the eigenvalues is not as fast as computing the second-order derivatives, but there are plenty of
numerical methods. (Like the QR-algorithm.)
To sum it up, the method of optimizing (differentiable) multivariable functions is a simple two-step process:
1. find the critical points by solving the equation ∇𝑓(x) = 0,
2. then use the second derivative test to determine which critical points are extrema.
Do we use this method in practice to optimize functions? No. Why? Most importantly because computing the eigenvalues
of the Hessian for a vector-scalar function of millions of variables is extremely hard. Why is the second derivative test so
important? Because understanding the behavior of functions around their extremal points is essential to truly understand
gradient descent. Believe it or not, this is the key behind the theoretical guarantees for gradient descent.
Speaking of gradient descent, now is the time to dig deep into the algorithm that powers neural networks.
THIRTYTHREE
Gradient descent is one of the most important algorithms in machine learning. We have talked about this a lot, although
up until this point, we have only seen it for single-variable functions. (Which is, let’s admit it, not the most practical
use-case.)
However, now we have all the tools we need to talk about gradient descent in its general form. Let’s get to it!
Suppose that we have a differentiable vector-scalar function 𝑓 ∶ ℝ𝑛 → ℝ that we want to maximize. This can describe the
loss function of a neural network, the return on investment of an investing strategy, or any other quantity.
Calculating the gradient and finding the critical points is often not an option, as solving the equation ∇𝑓(x) = 0 can be
computationally unfeasible. Thus, we resort to an iterative solution.
The algorithm is the same as for single-variable functions:
1. we start from a random point,
2. calculate the gradient,
3. take a step towards its direction.
This is called gradient ascent. We can formalize it in the following way.
x𝑛+1 ∶= x𝑛 + ℎ∇𝑓(x𝑛 ).
If we want to minimize 𝑓, we might as well maximize −𝑓. The only effect of this is a sign change for the gradient. In this
form, the algorithm is called gradient descent, and this is what’s widely used for training neural networks.
x𝑛+1 ∶= x𝑛 − ℎ∇𝑓(x𝑛 ).
355
Mathematics of Machine Learning
import numpy as np
import nbimporter
from tools.function import MultivariableFunction
def gradient_descent(
f: MultivariableFunction,
x_init: np.array, # the initial guess
learning_rate: float = 0.1, # the learning rate
n_iter: int = 1000, # number of steps
):
x = x_init
for n in range(n_iter):
grad = f.grad(x)
x = x - learning_rate*grad
return x
Notice that it is almost identical to the single variable version. To see if it works correctly, let’s test it out on the squared
Euclidean norm function! (The one that we implemented a few chapters earlier.)
squared_norm = SquaredNorm()
local_minimum = gradient_descent(
f=squared_norm,
x_init=np.array([10.0, -15.0])
)
local_minimum
There is nothing special to it, really. The issues with multivariable gradient descent are the same as what we discussed at
the single-variable version: it can get stuck in local minima, it is sensitive to our choice of learning rate, and the gradient
can be computationally hard to calculate in high dimensions.
For machine learning, we’ll have to solve all of these issues.
Probability theory
357
CHAPTER
THIRTYFOUR
WHAT IS PROBABILITY?
When going about our lives, we almost always think in binary terms. A statement is either true or false. An outcome has
either occurred or not.
In practice though, we rarely have the comfort of certainty. We have to operate with incomplete information. When a
scientist observes the outcome of an experiment, can she verify her hypothesis with 100% certainty? No. Because she
does not have complete control over all the variables of the experiment (like the weather or the alignment of stars), the
observed effect might be unintentional. Each result will either strengthen or weaken our belief in the hypothesis, but none
will provide ultimate proof.
In machine learning, our job is not to simply provide a prediction about some class label, but to formulate a mathematical
model that summarizes our knowledge about the data in a way that conveys information about the degree of our certainty
in the prediction as well.
So, fitting a parametric function 𝑓 ∶ ℝ𝑛 → ℝ𝑚 to model the relation between the data and the variable to be predicted is
not enough. We will need an entirely new vocabulary to formulate such models. We need to think in terms of probabilities.
First, let’s talk about how we think. On the most basic level, our knowledge about the world is stored in propositions. In
a mathematical sense, a proposition is a declaration that is either true or false. (In binary terms, true is denoted by 1 and
false is denoted by 0.)
“The sky is blue.“
“There are infinitely many prime numbers.“
“1 + 1 = 3.“
“I got the flu.“
Propositions are often abbreviated as variables such as 𝐴 = "it's raining outside".
Determining the truth value of a given proposition using evidence and reasoning is called inference. To be able to formulate
valid arguments and understand how inference works, we’ll take a quick visit in the world of mathematical logic.
359
Mathematics of Machine Learning
So, we have propositions like 𝐴 = "it's raining outside", or 𝐵 = "the sidewalk is wet". We need more expression power:
propositions are building blocks, and we want to combine them, yielding more complex propositions.
We can formulate complex propositions from smaller building blocks with logical connectives. Consider the proposition
“if it is raining outside, then the sidewalk is wet”. This is the combination of 𝐴 and 𝐵, strung together by the implication
connective.
There are four essential connectives:• NOT (¬), also known as negation,• AND (∧), also known as conjunction,• OR (∨),
also known as disjunction,• THEN (→), also known as implication.
Connectives are defined by the truth values of the resulting propositions. For instance, if 𝐴 is true, then ¬𝐴 is false; if 𝐴
is false, then ¬𝐴 is true. Denoting true by 1 and false by 0, we can describe connectives with truth tables. Here is the one
for negation.
𝐴 ¬𝐴
0 1
1 0
AND (∧) and OR (∨) connect two propositions. 𝐴 ∧ 𝐵 is true if both 𝐴 and 𝐵 are true, while 𝐴 ∨ 𝐵 is true if either one
is.
𝐴 𝐵 𝐴∧𝐵 𝐴∨𝐵
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 1
The implication connective THEN (→) formalizes the deduction of a conclusion 𝐵 from a premise 𝐴. By definition,
𝐴 → 𝐵 is true if 𝐵 is true or both 𝐴 and 𝐵 are false. An example: if “it’s raining outside”, THEN “the sidewalk is wet”.
𝐴 𝐵 𝐴→𝐵
0 0 1
0 1 1
1 0 0
1 1 1
Note that 𝐴 → 𝐵 does not imply 𝐵 → 𝐴. This common logical fallacy is called affirming the consequent, and we’ve all
fell victim to it at some point in our lives. To see a concrete example: if “it’s raining outside”, then “the sidewalk is wet”,
but not the other way around. The sidewalk can be wet for other reasons, like someone spilling a barrel of water.
Connectives correspond to set operations. Why? Let’s take a look at the formal definition of set operations.
Definition 33.1.1 (The (reasonably) formal definition of set operations and relations.)
Let 𝐴 and 𝐵 be two sets.
(a) The union of 𝐴 and 𝐵 is defined by
𝐴 ∪ 𝐵 ∶= {𝑥 ∶ (𝑥 ∈ 𝐴) ∨ (𝑥 ∈ 𝐵)},
𝐴 ∩ 𝐵 ∶= {𝑥 ∶ (𝑥 ∈ 𝐴) ∧ (𝑥 ∈ 𝐵)},
(𝑥 ∈ 𝐴) → (𝑥 ∈ 𝐵)
that is, Ω\𝐴 contains all elements that are in Ω, but not in 𝐴.
If you carefully read through the definitions, you can see how connectives and set operations relate. ∧ is intersection, ∨ is
union, ¬ is the complement, and → is the subset relation. This is illustrated by Fig. 35.1. (I’ve slightly abused the notation
here, as statements like 𝐴 ∧ 𝐵 ⟺ 𝐴 ∩ 𝐵 is mathematically incorrect. 𝐴 and 𝐵 cannot be a proposition and a set at
the same time, and thus equivalence is not precise. )
Why is this important? Because probability operates on sets, and sets play the role of propositions. We’ll see this later,
but first, let’s dive deep into how mathematical logic formalizes scientific thinking.
Let’s refine the inference process of mathematical logic. A proposition is either true or false, fair and square. How can
we determine that in practice? Say, how do we find the truth value of the proposition “there are infinitely many prime
numbers“?
Using evidence and deduction. Like Sherlock Holmes solving a crime by connecting facts, we rely on knowledge of the
form “if 𝐴, then 𝐵“. Our knowledge about the world is stored in true implications. For example,
• “If it is raining, then the sidewalk is wet.“
Classical logic has a fatal flaw: it is unable to deal with uncertainty. Think about the simple proposition “it is raining
outside”. If we are unable to actually observe the weather but have some indirect evidence (like the fact that the sidewalk
is wet, or the sky is cloudy, or it’s autumn out there), “it is raining outside” is probable, but not certain.
We need to tool to measure the truth value on a 0 − 1 scale. This is where probabilities come in.
In a mathematical sense, probability is a function that assigns a numerical value between zero and one to various sets that
represent events. (You can think about events as propositions.) Events are subsets of the event space, often denoted with
the capital Greek letter omega (Ω). This is illustrated in Fig. 34.2.
This sounds quite abstract, so let’s see a simple example: rolling a fair six-sided dice. We can encode all possi-
ble outcomes with the event space Ω = {1, 2, 3, 4, 5, 6}. Events such as 𝐴 = "the outcome is even" or 𝐵 =
"the outcome is larger than 3" are represented by the sets
𝐴 = {2, 4, 6},
𝐵 = {4, 5, 6}.
𝑃 ({1}) = ⋯ = 𝑃 ({6}) = 6.
There are two properties that make such a function 𝑃 a proper measure of probability:
1. the probability of the event space is one,
2. and the probability of the union of disjoint events is the sum of probabilities.
THIRTYFIVE
In the previous chapter, we have talked about probability as the extension of mathematical logic. Just like formal logic,
probability has its axioms, which we need to understand to work with probability models. In this chapter, we are going
to seek the answer to a fundamental question: what is the mathematical model of probability and how to work with it?
Probabilities are defined in context of experiments and outcomes. To talk about probabilities, we need to define what do
we assign probabilities to. Formally speaking, we denote the probability of the event 𝐴 by 𝑃 (𝐴). First, we’ll talk about
what events are.
Let’s revisit the six-sided example from the previous chapter. There are six different mutually exclusive outcomes, and
together they form the event space, denoted by Ω:
Ω ∶= {1, 2, 3, 4, 5, 6}.
In general, the event space is the collection of all mutually exclusive outcomes. It can be any set.
Returning to our dice-rolling example, what kind of events can we assign probabilities to? Obviously, the individual
outcomes come to mind. However, we can think of events like “the result is an odd number”, “the result is 2 or 6”, or
“the result is not 1”. Following this logic, our expectations are that for any two events 𝐴 and 𝐵,
• 𝐴 or 𝐵,
• 𝐴 and 𝐵,
• and not 𝐴
are events as well. These can be translated to the language of set theory, and are formalized by the notion of event algebras.
365
Mathematics of Machine Learning
Since events are modelled by sets, logical concepts like and, or, and not can be translated to set operations. That is,
• the joint occurrence of events 𝐴 and 𝐵 is equivalent to 𝐴 ∩ 𝐵,
• 𝐴 or 𝐵 is equivalent to 𝐴 ∪ 𝐵,
• and not 𝐴 is equivalent to Ω\𝐴.
In the literature, event algebras are frequently referred to as 𝜎-algebras. We’ll use the former terminology, but keep this
in mind for your later studies.
An immediate consequence of the definition is that for any events 𝐴1 , 𝐴2 , ⋯ ∈ Σ, their intersection ∩∞
𝑛=1 𝐴𝑛 is also a
member of Σ. Indeed, as De Morgan’s laws suggest,
Ω\( ∩∞ ∞
𝑛=1 𝐴𝑛 ) = ∪𝑛=1 (Ω\𝐴𝑛 ).
= Ω\ ∪∞
𝑛=1 ( Ω\𝐴
⏟ 𝑛 ).
∈Σ
Ω = {1, 2, 3, 4, 5, 6}, Σ = 2Ω .
Even though this is one of the simplest examples, it will serve as a prototype and a building block for constructing more
complicated event spaces.
Example 2. Tossing a coin 𝑛 times. A single toss has two possible outcomes: heads or tails. For simplicity, we are going
to encode heads with 0 and tails with 1. Since we are tossing the coin 𝑛 times, the result of an experiment will be an
𝑛-long sequence of ones and zeros. Like this: (0, 1, 1, 1, … , 0, 1). Thus, the complete event space is Ω = {0, 1}𝑛 .
(We are not talking about probabilities just yet, but feel free to spend some time figuring out how to assign them to these
events. Don’t worry if this is not clear; we will go through it in detail.)
Just like in the previous example, the event algebra 2Ω is a good choice. This covers all events that we need, for instance,
“the number of tails is 𝑘“.
In practice, event algebras are rarely given explicitly. Sure, for simple cases such as the above, it is possible.
What about cases where the event spaces are not countable? For instance, suppose that we are picking a random number
between 0 and 1. Then, Ω = [0, 1], but selecting Σ = 2[0,1] is extremely problematic. Recall that we want to assign a
probability to every event in Σ. The power set 2[0,1] is so large that very strange things can occur. In certain scenarios,
we can cut up sets into a finite number of pieces and reassemble two identical copies of the set from its pieces. (If you are
interested in more, check out the Banach-Tarski paradox.)
To avoid weird things like the above mentioned, we need another way to describe event algebras.
Let’s start with a simple yet fundamental property of event algebras that we’ll soon use to give a friendly description of
event algebras.
Proof. As we saw in the definition of event algebras, there are three properties we need to verify to show that Σ1 ∩ Σ2 is
an event algebra. This is very simple to check, so I suggest taking a shot by yourself first before reading my explanation.
(a) As both Σ1 and Σ2 are event algebras, Ω ∈ Σ1 and Ω ∈ Σ2 both hold. Thus, by definition of the intersection,
Ω ∈ Σ 1 ∩ Σ2 .
(b) Let 𝐴 ∈ Σ1 ∩ Σ2 . As both of them are event algebras, Ω\𝐴 ∈ Σ1 and Ω\𝐴 ∈ Σ2 . Thus, Ω\𝐴 is an element of the
intersection as well.
(c) Let 𝐴1 , 𝐴2 , ⋯ ∈ Σ1 ∩ Σ2 be arbitrary events. We can use the exact same argument as before: as both Σ1 and Σ2 are
event algebras, ∪∞ 𝑛=1 𝐴𝑛 ∈ Σ1 and ∪𝑛=1 𝐴𝑛 ∈ Σ2 . So, the union is also a member of the intersection. □
∞
With all that, we are ready to describe event algebras with a generating set.
(By smallest, we mean that if Σ is an event algebra containing 𝑆, then 𝜎(𝑆) ⊆ Σ.)
Proof. Our previous result shows that the intersection of event algebras is also an event algebra. So, let’s take all event
algebras that contain 𝑆 and take their intersection. Formally, we define
Right away, we can use this to precisely construct the event algebra for an extremely common task: picking a number
between 0 and 1.
Example 3. Selecting a random number between 0 and 1. It is clear that the event space is Ω = [0, 1]. What about
the events? In this situation, we want to ask questions like the probability of a random number 𝑋 falling between some
𝑎, 𝑏 ∈ [0, 1]. That is, events like (𝑎, 𝑏), (𝑎, 𝑏], [𝑎, 𝑏), [𝑎, 𝑏]. (Whether or not we want strict inequality regarding 𝑎 and 𝑏.)
So, a proper event algebra can be given by the algebra generated by events of the form (𝑎, 𝑏]. That is,
This Σ has a rich structure. For instance, it contains simple events like {𝑥}, where 𝑥 ∈ [0, 1], but more complex ones
like “𝑋 is a rational number” or “𝑋 is an irrational number”. Give yourself a few minutes to see why this is true. Don’t
worry if you don’t see the solution, we’ll work this out in the problems section. (If you think this through, you’ll also see
why we chose intervals of the form (𝑎, 𝑏] instead of others like (𝑎, 𝑏) or [𝑎, 𝑏].)
Now that we understand what events and event algebras are, we can take our first detailed look at probability. In the next
section, we will introduce its precise mathematical definition.
From all the examples we have seen so far, it is clear that most commonly, we define probability spaces on ℕ or on ℝ.
When Ω ⊆ ℕ, the choice of event algebra is clear, as Σ = 2Ω will always work.
However, as suggested in Example 3 above, selecting Σ = 2Ω when Ω ⊆ ℝ can lead to some weird stuff. Because we are
interested in the probability of events like [𝑎, 𝑏], our standard choice is going to be the generated event algebra
called the Borel-algebra, named after the famous French mathematician Émile Borel. Due to its construction, ℬ contains
all events that are important to us, such as intervals and unions of intervals. Elements of ℬ are called Borel-sets.
Because event algebras are closed to unions, you can see that all types of intervals can be found in ℬ(ℝ). This is summa-
rized by the following theorem.
Theorem 34.1.3
For all 𝑎, 𝑏 ∈ ℝ, the sets [𝑎, 𝑏], (𝑎, 𝑏], [𝑎, 𝑏), (−∞, 𝑎], (−∞, 𝑎), (𝑎, ∞), [𝑎, ∞) are elements of ℬ(ℝ).
As an exercise, try to come up with the proof by yourself. One trick to get the ideas flowing is to start drawing some
figures. If you can visualize what happens, you’ll discover a proof quickly.
Proof. In general, for a given set 𝑆, we can show that it belongs to ℬ(ℝ) by writing it as the union/intersection/difference
of known Borel sets. First, we have
(𝑎, ∞) = ∪∞
𝑛=1 (𝑎, 𝑛),
so (𝑎, ∞) ∈ ℬ(ℝ). With a similar argument, we see that (−∞, 𝑎) ∈ ℬ(ℝ). Next,
so (−∞, 𝑎], [𝑎, ∞) ∈ ℬ(ℝ) for all 𝑎. From these, the sets [𝑎, 𝑏], (𝑎, 𝑏], [𝑎, 𝑏) can be produced by intersections. □
Let’s recap what we have learned so far! In the language of mathematics, experiments with intrinsic uncertainty are
described with outcomes, event spaces, and events. The collection of all possible mutually exclusive outcomes of an
experiment is the event space Ω. Its certain subsets are the events, to which we want to assign probabilities. The events
form the so-called event algebra Σ. We denote the probability of an event 𝐴 with 𝑃 (𝐴)
Intuitively speaking, we have three reasonable expectations about probability.
1. 𝑃 (Ω) = 1, that is, the probability that at least one outcome occurs is 1. In other words, our event space is a
complete description of the experiment.
2. 𝑃 (∅) = 0, that is, the probability that none of the outcomes occur is 0. Again, this means that our event space is
complete.
3. The probability that either of two events occurs for two mutually exclusive events is the sum of the individual
probabilities.
Definition 34.2.1
Let Ω be an event space and Σ be an event algebra over Ω. We say that the function 𝑃 ∶ Σ → [0, 1] is a probability
measure on Σ if the following properties hold.
(a) 𝑃 (Ω) = 1.
∞
(b) If 𝐴1 , 𝐴2 , … are mutually disjoint events (that is, 𝐴𝑖 ∩ 𝐴𝑗 = ∅ for all 𝑖 ≠ 𝑗), then 𝑃 (∪∞
𝑛=1 𝐴𝑛 ) = ∑𝑛=1 𝑃 (𝐴𝑛 ).
This property is called the 𝜎-additivity of probability measures.
Along with the probability measure 𝑃 , the structure (Ω, Σ, 𝑃 ) is said to form a probability space.
As usual, let’s see some concrete examples first! We are going to continue with the ones we worked out when discussing
event algebras.
Example 1, continued. Rolling a six-sided dice. Recall that the event space and algebra were defined by
Ω = {1, 2, 3, 4, 5, 6}, Σ = 2Ω .
If we don’t have any extra knowledge about our dice, it is reasonable to assume that each outcome is equally probable.
That is, since there are six possible outcomes, we have
1
𝑃 ({1}) = ⋯ = 𝑃 ({6}) = .
6
Notice that in this case, knowing the probabilities for the individual outcomes is enough to determine the probability of
any event. This is due to the (𝜎-)additivity of the probability. For instance, the event “the outcome of the dice roll is an
odd number” is described by
3
𝑃 ({1, 3, 5}) = 𝑃 ({1}) + 𝑃 ({3}) + 𝑃 ({5}) = .
6
In English, the probability of any event can be written down with the following formula:
favorable outcomes
𝑃 (event) = .
all possible outcomes
You might remember this from your elementary and high school studies (depending on the curriculum in your country).
This is a useful formula, but there is a caveat: it only works if we assume that each outcome has equal probability.
In the case when our dice is not uniformly weighted, the occurrences of individual outcomes are not equal. (Just think of
a lead dice, where one side is significantly heavier than the others.) For now, we are not going to be concerned with this
case. Later, this generalization will be discussed in detail.
Example 2, continued. Tossing a coin 𝑛 times. Here, our event space and algebra was Ω = {0, 1}𝑛 and Σ = 2Ω . For
simplicity, let’s assume that 𝑛 = 5.
What is the probability of a particular result, say HHTTT? Going step by step, the probability that the first toss will be
heads is 1/2. That is,
1
𝑃 (first toss is heads) =
2
Since the first toss is independent of the second,
1
𝑃 (second toss is heads) =
2
as well. To combine this and calculate the probability that the first two tosses are both heads, we can think like the
following. Among the outcomes where the first toss is heads, exactly half of them will have the second toss heads as well.
So, we are looking for the half of the half. That is,
𝑃 (first two tosses are heads) = 𝑃 (first toss is heads)𝑃 (second toss is heads)
1
= .
4
Going further with the same logic, we obtain that
1 1
𝑃 (HHTTT) = = .
25 32
If we look a bit deeper, we can notice that this follows the previously seen “favorable/all” formula. Indeed, as we can see
with a bit of combinatorics, there are 25 total possibilities, all of them having equal probability.
Considering this, what is the probability that out of our five tosses, exactly two of them are heads? In the language of
sets, we can encode each five-toss experiment as a subset of {1, 2, 3, 4, 5}, the elements signifying the toss that resulted
in heads. (So, for example, {1, 4, 5} would encode the outcome HTTHH.) With this, the experiments when there are
two heads are exactly the two-element subsets of {1, 2, 3, 4, 5}.
From our combinatorics studies, we know that the number of subsets with given elements is
𝑛
( ),
𝑘
where 𝑛 is the size of our set and 𝑘 is the desired size of the subsets. So, in total, there are (52) number of occurrences
with exactly two heads. Thus, following the “favorable/all” formula, we have
5 1 10
𝑃 (two heads out of five tosses) = ( ) = .
2 32 32
One more example, and we are ready to move forward.
Example 3. Selecting a random number between 0 and 1. Here, our event space was Ω = [0, 1], and our event algebra
was the generated algebra
Σ = 𝜎({(𝑎, 𝑏] ∶ 0 ≤ 𝑎 < 𝑏 ≤ 1}).
Without any further information, it is reasonable to assume that every number can be selected with an equal probability.
What does this even mean for an infinite event space like Ω = [0, 1]? We can’t divide 1 into infinitely many equal parts.
So, instead of thinking about individual outcomes, we should start thinking about events. Let’s denote our randomly
selected number with 𝑋. If all number is “equally likely”, what is 𝑃 (𝑋 ∈ (0, 1/2])? Intuitively, given our equally likely
hypothesis, this probability should be proportional to the size of [0, 1/2]. Thus,
𝑃 (𝑋 ∈ 𝐼) = |𝐼|,
where 𝐼 is some interval and |𝐼| is its length. For instance,
𝑃 (𝑎 < 𝑋 < 𝑏) = 𝑃 (𝑎 ≤ 𝑋 < 𝑏)
= 𝑃 (𝑎 < 𝑋 ≤ 𝑏)
= 𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏)
= 𝑏 − 𝑎.
By giving the probabilities on the generating set of the event algebra, the probabilities for all other events can be deduced.
For instance,
𝑃 (𝑋 = 𝑥) = 𝑃 (0 ≤ 𝑋 ≤ 𝑥) − 𝑃 (0 ≤ 𝑋 < 𝑥)
=𝑥−𝑥
= 0.
Thus, the probability of picking a given number is surprisingly zero. There is an important lesson here: events with zero
probability can happen. This sounds counterintuitive at first, but based on the above example, you can see that it is true.
Now that we are familiar with the mathematical model of probability, we can start working with them. Manipulating
expressions of probabilities gives us the ability to deal with more and more complex scenarios.
If you recall, probability measures had three simple defining properties:
(a) 𝑃 (Ω) = 1,
(b) 𝑃 (∅) = 0, and
∞
(c) 𝑃 (∪∞
𝑛=1 𝐴𝑘 ) = ∑𝑛=1 𝑃 (𝐴𝑘 ), if the events 𝐴𝑘 are mutually disjoint.
From these properties, many others can be deduced. For simplicity, here is a theorem summarizing the most important
ones.
Theorem 34.2.1
Let (Ω, Σ, 𝑃 ) be a probability space and let 𝐴, 𝐵 ∈ Σ be two arbitrary events.
(a) 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∪ 𝐵).
(b) 𝑃 (𝐴) = 𝑃 (𝐴 ∩ 𝐵) + 𝑃 (𝐴\𝐵). Specifically, 𝑃 (Ω\𝐴) + 𝑃 (𝐴) = 1.
(c) if 𝐴 ⊆ 𝐵, then 𝑃 (𝐴) ≤ 𝑃 (𝐵).
The proof of this is so simple that it is left to the reader as a simple exercise. All of these follow from the additivity of
probability measures with respect to disjoint events. (If you don’t see the solution, try drawing Venn diagrams.)
Another fundamental tool is the law of total probability, which is used all the time when dealing with more complex
events.
∞
𝑃 (𝐴) = ∑ 𝑃 (𝐴 ∩ 𝐴𝑛 ) (35.2)
𝑛=1
holds. We call mutually disjoint events whose union is the entire event space partitions.
Proof. This simply follows from the 𝜎-additivity of probability measures. Feel free to give the proof a shot by yourself
to test your understanding.
If you can’t see this, no worries. Here is a brief explanation. Since 𝐴1 , 𝐴2 , … are mutually disjoint, 𝐴 ∩ 𝐴1 , 𝐴 ∩ 𝐴2 , …
are mutually disjoint as well. Moreover, since ∪∞ 𝑛=1 𝐴𝑛 = Ω, we also have
∞
∪𝑛=1 𝐴𝑛 ∩ 𝐴 = ( ∪ ∞
𝑛=1 𝐴𝑛 ) ∩ 𝐴
=Ω∩𝐴
= 𝐴.
Let’s see an example right away! Suppose that we toss two dice. What is the probability that the sum of the results is 7?
First, we should properly describe the probability space. For notational simplicity, let’s denote the result of the throws
with 𝑋 and 𝑌 . What we are looking for is 𝑃 (𝑋 + 𝑌 = 7). Modeling the toss with two dice is the simplest if we impose
order among the dice: we designate the first and the second dice. With this in mind, the event space Ω is described by
the Cartesian product
Ω = {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6}
= {(𝑖, 𝑗) ∶ 𝑖, 𝑗 ∈ {1, 2, 3, 4, 5, 6}},
and the outcomes are tuples of the form (𝑖, 𝑗). (That is, the tuple (𝑖, 𝑗) encodes the elementary event {𝑋 = 𝑖, 𝑌 = 𝑗}.)
Since the tosses are independent of each other,
11 1
𝑃 (𝑋 = 𝑖, 𝑌 = 𝑗) = = .
66 36
(When it is clear, we omit the brackets of the event {𝑋 = 𝑖, 𝑌 = 𝑗}.)
Since the first throw falls between 1 and 6, we can partition the event space by forming
𝐴𝑛 ∶= {𝑋 = 𝑛}, 𝑛 = 1, … , 6.
However, if we know that 𝑋 + 𝑌 = 7 and 𝑋 = 𝑛, we know that 𝑌 = 7 − 𝑛 must hold as well. So, continuing the
calculation above,
6
𝑃 (𝑋 + 𝑌 = 7) = ∑ 𝑃 ({𝑋 + 𝑌 = 7} and {𝑋 = 𝑛})
𝑛=1
6
= ∑ 𝑃 (𝑋 = 𝑛, 𝑌 = 7 − 𝑛)
𝑛=1
6
1
=∑
𝑛=1
36
1
= .
6
So, the law of total probability helps us deal with complex events by decomposing them into simpler ones. We have seen
this pattern dozens of times now, and once again, it proves to be essential.
As yet another consequence of 𝜎-additivity, we can calculate the probability of an increasing sequence of events by taking
the limit.
𝑃 (∪∞
𝑛=1 𝐴𝑛 ) = lim 𝑃 (𝐴𝑛 ) (35.3)
𝑛→∞
Proof. Since the events are increasing, that is, 𝐴𝑛−1 ⊆ 𝐴𝑛 , we can write 𝐴𝑛 as
∪∞ ∞
𝑛=1 𝐴𝑛 = ∪𝑛=1 𝐴𝑛 \𝐴𝑛−1 , 𝐴0 ∶= ∅,
which gives
∞
𝑃 (∪∞
𝑛=1 𝐴𝑛 ) = ∑ 𝑃 (𝐴𝑛 \𝐴𝑛−1 )
𝑛=1
𝑛
= lim ∑ 𝑃 (𝐴𝑘 \𝐴𝑘−1 )
𝑛→∞
𝑘=1
𝑛
= lim ∑ 𝑃 (𝐴𝑘 ) − 𝑃 (𝐴𝑘−1 )
𝑛→∞
𝑘=1
= lim 𝑃 (𝐴𝑛 ),
𝑛→∞
We can state an analogue of the above theorem for a decreasing sequence of events.
𝑃 (∩∞
𝑛=1 𝐴𝑛 ) = lim 𝑃 (𝐴𝑛 ) (35.4)
𝑛→∞
Now that we have a mathematical definition of a probabilistic model, it is time to take a step toward space where machine
learning is done: ℝ𝑛 .
In machine learning, every data point is an elementary outcome, located somewhere in the Euclidean space ℝ𝑛 . Because
of this, we are interested in modeling experiments there.
How can we define a probability space there? Similarly as we did with on real line, we describe a convenient event algebra
by generating. There, we can use the higher dimensional counterpart of the intervals (𝑎, 𝑏): 𝑛-dimensional spheres. For
this, we define the set
where the norm ‖ ⋅ ‖ denotes the usual Euclidean norm. (The 𝐵 denotes the word “ball”. In mathematics, 𝑛-dimensional
spheres are often called balls.) Similarly to the real line, the Borel-algebra is defined by
As we saw on real line, the structure of ℬ(ℝ𝑛 ) is richer than what the definition suggests at first glance. Here, the analogue
of interval is a rectangle, defined by
Similarly, we can define [𝑎, 𝑏], (𝑎, 𝑏], [𝑎, 𝑏), and others.
Theorem 34.3.1
For any 𝑎, 𝑏 ∈ ℝ𝑛 , the sets [𝑎, 𝑏], [𝑎, 𝑏), (𝑎, 𝑏], (𝑎, ∞), [𝑎, ∞), (−∞, 𝑎), (−∞, 𝑏] are elements of ℬ(ℝ𝑛 ).
Proof. The proof goes along the same line as the counterpart for ℬ(ℝ). As such, it is left as an exercise to the reader.
As a hint, first we can show that (𝑎, 𝑏) can be written as a countable union of balls. We can also show that this holds true
for sets like
ℝ × ⋯ × (−∞,
⏟ 𝑎𝑖 ) × ⋯ × ℝ
𝑖-th component
As an example, let’s throw a few darts at a rectangular wall. Suppose that we are terrible darts players and hitting any
point on the wall is equally likely.
We can model this event space with Ω = [0, 1] × [0, 1] ⊆ ℝ2 , representing our wall. What are the possible events?
For instance, there is a circular darts board hanging on the wall, and we want to find the probability of hitting it. In this
scenario, we can restrict the Borel sets defined by (35.5) to
Now that the event space and algebra is clear, we need to think about assigning probabilities. Our assumption is that
hitting any point is equally likely. So, by generalizing the allfavorable outcomes
possible outcomes formula we have seen in the discrete case, we
define the probability measure by
volume(𝐴)
𝑃 (𝐴) = .
volume(Ω)
Now that we know how to work with probabilities, it is time to study how can we assign probabilities to real-life events.
First, we are going to take a look at the frequentist interpretation, explaining probabilities with relative frequencies. (If
you are one of those people who are religious about this question, calm down. We’ll discuss the Bayesian interpretation
in detail, but it is not time yet.)
Let’s go back to the beginning and consider the coin-tossing experiment, the most basic example possible. If I toss a fair
coin 1000 times, how many of them will be heads? Most people immediately answer 500, but this is not correct. There is
no right answer, any number of heads between 0 and 1000 can happen. Of course, most probably it will be around 500,
but with a very small probability, there can be zero heads as well.
In general, the probability of an event describes its relative frequence among infinitely many attempts. That is,
number of occurences
𝑃 (event) ≈ .
number of attempts
When the number of attempts goes towards infinity, the relative frequency of occurrences converges to the true underlying
probability. In other words, if 𝑋𝑖 quantitatively describes our 𝑖-th attempt by
then
𝑋1 + ⋯ + 𝑋𝑛
𝑃 (event) = lim .
𝑛→∞ 𝑛
We can illustrate this by doing a quick simulation using the coin-tossing example. Don’t worry if you don’t understand
the code, we’ll talk about it in detail in the next chapters.
import numpy as np
from scipy.stats import randint
(continues on next page)
n_tosses = 1000
# coin tosses: 0 for tails and 1 for heads
coin_tosses = [randint.rvs(low=0, high=2) for _ in range(n_tosses)]
averages = [np.mean(coin_tosses[:k+1]) for k in range(n_tosses)]
with plt.style.context("seaborn-white"):
plt.figure(figsize=(16, 8))
plt.title("Relative frequency of the coin tosses")
plt.xlabel("Relative frequency")
plt.ylabel("Number of tosses")
The relative frequency quite nicely stabilizes around 1/2, which is the true probability of our fair coin landing on its heads.
Is this an accident? No.
We will make all of this mathematically precise when talking about the law of large numbers. For now, you can think
about estimating probabilities this way. In the next chapter, we will introduce the Bayesian viewpoint, a probabilistic
framework for updating our models given new observations.
35.5 Problems
Problem 1. Let’s roll two six-sided dice! Describe the event space, event algebra, and the corresponding probabilities for
this experiment.
Problem 2. Let Ω = [0, 1], and the corresponding event algebra be the generated algebra
Show that the event algebras generated by these sets are the same, that is,
THIRTYSIX
CONDITIONAL PROBABILITY
In the previous chapter, we learned the foundations of probability. Now we can speak in terms of outcomes, events, and
chances. However, in real-life applications, these basic tools are not enough to build useful predictive models.
To illustrate this, let’s build a probabilistic spam filter! For every email we receive, we want to estimate the probability
𝑃 (email is spam). The closer this is to 1, the more likely that we are looking at a spam email.
Based on our inbox, we might calculate the relative frequency of spam emails and obtain that
This looks much more useful for our spam filtering efforts. By checking for the presence of the phrase “act now”, we can
confidently classify an email as spam.
Of course, there is much more to spam filtering, but this simple example demonstrates the importance of probabilities
conditional on other events. To put this into mathematical form, we introduce the following definition.
𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐵 ∣ 𝐴) ∶= .
𝑃 (𝐴)
You can think about 𝑃 (𝐵|𝐴) as restricting the event space to 𝐴, as illustrated by Fig. 36.1.
When there are more conditions, say 𝐴1 and 𝐴2 , the definition takes the form
𝑃 (𝐵 ∩ 𝐴1 ∩ 𝐴2 )
𝑃 (𝐵 ∣ 𝐴1 , 𝐴2 ) = ,
𝑃 (𝐴1 ∩ 𝐴2 )
and so on.
379
Mathematics of Machine Learning
To bring this concept closer, let’s revisit the simple dice-rolling experiment. Suppose that your friend rolls a six-sided
dice and tells you that the outcome is an odd number. Given this information, what is the probability that the result is 3?
For simplicity, let’s denote the outcome of the roll with 𝑋. Mathematically speaking, this can be calculated by
36.1 Independence
The idea behind conditional probability is that observing certain events changes the probability of others. Is this always
the case though?
In probabilistic modeling, recognizing when observing an event doesn’t influence another is equally important. This
motivates the concept of independence.
𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐴)𝑃 (𝐵)
holds.
Equivalently, this can be formulated in terms of conditional probabilities. By the definition, if 𝐴 and 𝐵 are independent,
we have
𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐵 ∣ 𝐴) =
𝑃 (𝐴)
𝑃 (𝐴)𝑃 (𝐵)
=
𝑃 (𝐴)
= 𝑃 (𝐵).
To see an example, let’s go back to coin tossing, and suppose that we toss a coin two times. Let the result of the first and
second toss be denoted by 𝑋1 and 𝑋2 respectively. What is the probability that both of these tosses are heads? As we
saw this when discussing the probability space given by this experiment, we can see that
1
𝑃 (𝑋1 = heads and 𝑋2 = heads) = 𝑃 (𝑋1 = heads)𝑃 (𝑋2 = heads) = .
4
That is, the two events are independent of each other.
Regarding probability, there are many common misconceptions. One is about the interpretation of independence. Sup-
pose that I toss a fair coin ten times, all of them resulting in heads. What is the probability that my next toss will be
heads?
Most would immediately conclude that this must be very small since having eleven heads in a row is highly unlikely.
However, once we have the ten results available, we no longer talk about the probability of eleven coin tosses, just the last
one! Since the coin tosses are independent of each other, the chance of heads for the eleventh toss (given the results of
the previous ten) is still 50%. This phenomenon is called the gambler’s fallacy, and I am pretty sure that at some point in
your life, you fell victim to it. (I sure did.)
In practical scenarios, working with conditional probabilities might be easier. (For instance, sometimes we can estimate
them directly, while the standard probabilities are difficult to gauge.)
Because of this, we need tools to work with them.
Remember the law of total probability? We can use conditional probabilities to put it into a slightly different form.
∞
𝑃 (𝐴) = ∑ 𝑃 (𝐴|𝐴𝑘 )𝑃 (𝐴𝑘 ). (36.1)
𝑘=1
Proof. The proof is the trivial application of the law of total probability and the definition of conditional probabilities: as
𝑃 (𝐴 ∩ 𝐴𝑘 ) = 𝑃 (𝐴|𝐴𝑘 )𝑃 (𝐴𝑘 ),
∞
𝑃 (𝐴) = ∑ 𝑃 (𝐴 ∩ 𝐴𝑘 )
𝑘=1
∞
= ∑ 𝑃 (𝐴|𝐴𝑘 )𝑃 (𝐴𝑘 )
𝑘=1
Why is this useful for us? Let’s demonstrate this with an example. Suppose that we have three urns containing red and
blue colored balls. The first one contains 4 blue, the second one contains 2 red and 2 blue, while the last one contains 1
red and 3 blue balls.
We randomly pick one. However, picking the first one is twice as likely as picking the other two. (That is, we pick the first
urn 50% of the time, while the second and the third 25%-25% of the time.) From that urn, we also randomly pick a ball.
What is the probability that we select a red ball? Without using the law of total probability, this is difficult to compute.
Let’s denote the color of the selected ball by 𝑋 and suppose that the event 𝐴𝑛 describes picking the 𝑛-th urn. Then, we
have
3 3
𝑃 (𝑋 = red) = ∑ 𝑃 ({𝑋 = red} ∩ 𝐴𝑘 ) = ∑ 𝑃 (𝑋 = red ∣ 𝐴𝑘 )𝑃 (𝐴𝑘 ).
𝑘=1 𝑘=1
Without using conditional probabilities, calculating 𝑃 ({𝑋 = red} ∩ 𝐴𝑘 ) is difficult. (Since we are not picking each urn
with equal probability.) However, we can simply calculate the conditionals by counting the number of red balls in each
urn. That is, we have
𝑃 (𝑋 = red ∣ 𝐴1 ) = 0
2
𝑃 (𝑋 = red ∣ 𝐴2 ) =
4
1
𝑃 (𝑋 = red ∣ 𝐴3 ) = .
4
Since 𝑃 (𝐴1 ) = 1/2, 𝑃 (𝐴2 ) = 1/4, and 𝑃 (𝐴3 ) = 1/4, the probability we are looking for is
3
𝑃 (𝑋 = red) = ∑ 𝑃 (𝑋 = red ∣ 𝐴𝑘 )𝑃 (𝐴𝑘 )
𝑘=1
1 21 11
=0 + +
2 44 44
3
= .
16
Note that because the urns are not selected with equal probability,
number of red balls
𝑃 (𝑋 = red) ≠ ,
number of balls
as one would naively guess. This is because the urns are not selected with uniform probability, as frequently happens in
statistics.
Another useful property of the conditional probability that, due to their definition, we can use them to express the joint
probability of events:
𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐵 ∣ 𝐴)𝑃 (𝐴).
Even though this sounds trivial, there are cases when we can estimate/compute the conditional probability, but not the
joint probability. In fact, this simple identity can be generalized for an arbitrary number of conditions. This is called the
chain rule. (Despite its name, it has nothing to do with the chain rule for differentiation.)
holds.
In essence, machine learning is about turning observations into predictive models. Probability theory gives us a language
to express our models. For instance, going back to our spam filter example from the beginning of the chapter, we can
notice that 5% of our emails are spam. However, this is not enough information to filter out spam emails. Upon inspection,
we have observed that 95% of mails that contain the phrase “act now” are spam. (But only 1% of the all mails contain
“act now”.)
In the language of conditional probabilities, we have concluded that
𝑃 (spam ∣ contains "act now") = 0.95.
With this, we can start looking for emails containing the phrase “act now” and discard them with 95% confidence. Is this
spam filter effective? Not really, since there can be other frequent keywords in spam mails that we don’t check. How can
we check this?
For one, we can take a look at the conditional probability 𝑃 (contains "act now" ∣ spam), describing the frequency of the
“act now” keyword among all the spam emails. A low frequency means that we are missing out on other keywords that
we can use for filtering.
Generally speaking, we often want to compute/estimate the quantity 𝑃 (𝐴|𝐵), but our observations only allow us to infer
𝑃 (𝐵|𝐴). So, we need a way to reverse the condition and the event. With a bit of algebra, we can do this easily.
To see how it works in action, let’s put it to the test in our spam filtering example. Given the information we know, we
have
𝑃 (spam ∣ contains "act now") = 0.95,
𝑃 (contains "act now") = 0.01,
𝑃 (spam) = 0.05.
So, according to the Bayes formula,
𝑃 (spam ∣ contains "act now")𝑃 (contains "act now")
𝑃 (contains "act now" ∣ spam) =
𝑃 (spam)
0.95 × 0.01
=
0.05
19
= .
100
Thus, by filtering only for the phrase “act now”, we are missing a lot of spam.
We can take the Bayes formula one step further by combining it with the law of total probability. (See the equation (36.1).)
𝑃 (𝐴 ∣ 𝐵)𝑃 (𝐵)
𝑃 (𝐵 ∣ 𝐴) = ∞
∑𝑘=1 𝑃 (𝐴 ∣ 𝐴𝑘 )𝑃 (𝐴𝑘 )
holds.
Proof. The proof immediately follows from the Bayes formula (36.3) and the law of total probability (36.1). □
Historically, probability was introduced as the relative frequency of observed events. However, the invention of conditional
probabilities and the Bayes formula enabled another interpretation that slowly became prevalent in statistics and machine
learning.
In pure English, the Bayes formula can be thought of as updating our probabilistic models using new observations. Suppose
that we are interested in the event 𝐵. Without observing anything, we can formulate a probabilistic model by assigning
a probability to 𝐵, that is, estimating 𝑃 (𝐵). This is what we call the prior. However, observing another event 𝐴 might
change our probabilistic model.
Thus, we would like to estimate the posterior probability 𝑃 (𝐵 ∣ 𝐴). We can’t do this directly, but thanks to our prior
model, we can tell 𝑃 (𝐴 ∣ 𝐵). The quantity 𝑃 (𝐴 ∣ 𝐵) is called the likelihood. Combining these with the Bayes formula,
we can see that the posterior is proportional to the likelihood and the prior.
Fig. 36.3: The Bayes formula, as the product of the likelihood and prior.
Let’s see a concrete example that will make the idea clear. Suppose that we are creating a diagnostic test for an exotic
disease. How likely is the disease present in a random person?
Without knowing any specifics about the situation, we can only use statistics to formulate the probability model. Let’s say
that only 2% of the population is affected. So, our probabilistic model is
However, once someone produces a positive test, things change. The goal is to estimate the posterior probability
𝑃 (infected|positive test), a more accurate model.
Since no medical test is perfect, false positives and false negatives can happen. From the manufacturer, we know that it
gives true positives 99% of the time, but the chance for a false positive is 5%. In probabilistic terms, we have
So, the chance of being infected upon producing a positive test is surprisingly 29%. (Given these specific true and false
positive rates.)
These probabilistic thinking principles are also valid for machine learning. If we abstract away the process of learning
from data, we are essentially 1) making observations, 2) updating our models given the new observations, and 3) and start
the process over again. The Bayes theorem gives a concrete tool for the job.
As we have seen before, probability theory is the extension of mathematical logic. So far, we have discussed how logical
connectives correspond to set operations, and how probability generalizes the truth value by adding the component of
uncertainty. What about the probabilistic inference process? Can we generalize classical inference and use probabilistic
reasoning to construct arguments? Yes.
To illustrate, let’s start with a story. It’s 6:00 AM. The alarm clock is blasting, but you are having a hard time getting out
of bed. You don’t feel well. Your muscles are weak, and your head is exploding. After a brief struggle, you manage to
call a doctor and list all the symptoms. Your sore throat makes speaking painful.
“It’s probably just the flu“, she says.
Interactions like this are everyday occurrences. Yet, we hardly think about the reasoning process behind them. After all,
you could have been hungover. Similarly, if the police find a murder weapon at your house, they’ll suspect that you are
the killer. The two are related, but not the same. For instance, the murder weapon could have been planted.
The bulk of humanity’s knowledge is obtained in this manner: we collect evidence, then explain it with various hypotheses.
How do we infer the underlying cause from observing the effect? Most importantly, how can we avoid fooling ourselves
into false conclusions?
Let’s focus on “muscle fatigue, headache, sore throat → flu“. This is certainly not true in an absolute sense, as these
symptoms resemble how you would feel after shouting and drinking excessively during a metal concert. Which is far from
the flu. Yet, a positive diagnosis of flu is plausible. Given the evidence at hand, our belief is increased in the hypothesis.
Unfortunately, classical logic cannot deal with plausible. Only with absolute. Probability theory solves this problem by
measuring plausibility on a 0 − 1 scale, instead of being stuck at the extremes. Zero is impossible. One is certain. All the
values in between represent degrees of uncertainty.
Essentially, 𝑃 (𝐵 ∣ 𝐴) = 1 means that 𝐴 → 𝐵 is true, while 𝑃 (𝐵 ∣ 𝐴) = 0 means that it is not. We can take this analogy
further: a small 𝑃 (𝐵 ∣ 𝐴) means that 𝐴 → 𝐵 is likely false, and a large 𝑃 (𝐵 ∣ 𝐴) means that it is likely to be true. This
is illustrated by Fig. 36.5.
Thus, the “probabilistic modus ponens” goes like this:
1. 𝑃 (𝐵 ∣ 𝐴) ≈ 1.
2. 𝐴.
3. Therefore, 𝐵 is probable.
This is quite a relief, as now we have a solid theoretical justification for most of our decisions. Thus, the diagnostic process
that kicked up our investigation makes a lot more sense now:
1. 𝑃 (flu ∣ headache, muscle fatigue, sore throat) ≈ 1.
2. “Headache and muscle fatigue”.
3. Therefore, “flu” is probable.
However, one burning question remains. How do we know that 𝑃 (flu ∣ headache, muscle fatigue, sore throat) ≈ 1 holds?
Let’s focus on the probabilistic version of “headache, sore throat, muscle fatigue → flu“. We know that this is not certain,
only plausible. Yet, the reverse implication “flu \to headache, sore throat, muscle fatigue“ is almost certain.
When naively arguing that the evidence implies the hypothesis, we have the opposite in mind. Instead of applying the
modus ponens, we use the faulty argument
1. 𝐴 → 𝐵.
2. 𝐵.
3. Therefore 𝐴.
We have talked about this before. This logical fallacy is called affirming the consequent, and it’s completely wrong from
a purely logical standpoint. However, the Bayes theorem provides its probabilistic version. The proposition 𝐴 → 𝐵
translates to 𝑃 (𝐵 ∣ 𝐴) = 1, which implies that when 𝐵 is observed, 𝐴 is more likely. Why? Because then, we have
𝑃 (𝐵 ∣ 𝐴)𝑃 (𝐴)
𝑃 (𝐴 ∣ 𝐵) =
𝑃 (𝐵)
𝑃 (𝐴)
=
𝑃 (𝐵)
> 𝑃 (𝐵).
This is good news, as reversing the implication is not totally wrong. Instead, we have the “probabilistic affirming the
consequent”:
1. 𝐴 → 𝐵.
2. 𝐵.
3. Therefore, 𝐴 is more probable.
With this, the probabilistic reasoning process makes perfect sense. To recall, the issue with arguments like “if you have
muscle fatigue, sore throat, and a headache, then you have the flu“ is that the symptoms can be caused by other conditions;
and in rare cases, the flu does not carry all of these these symptoms.
Yet, this kind of thinking can surprisingly effective in real-life decision-making. Probability and conditional probability
extends our reasoning toolkit with inductive methods in three steps:
1. generalized the binary 0 − 1 truth values to allow the representation of uncertainty,
2. defined the analogue of “if 𝐴, then 𝐵“ -type implications using conditional probability,
3. and came up with a method to infer the cause from observing the effect.
These three ideas are seriously powerful, and their inception has enabled science to perform unbelievable feats.
(If you are interested in learning more about the relation of probability theory and logic, I recommend you the great book
[[JBP03]].)
Before we finish with conditional probability, we’ll touch on an important problem. Regarding probability, we often have
seemingly contradictory phenomena, going against our intuitive expectations. These are called paradoxes. To master
probabilistic thinking, we need to resolve them and eliminate common fallacies from our thinking processes. So far, we
have already seen the gambler’s fallacy when talking about the concept of independence. Now, we’ll discuss the famous
Monty Hall paradox.
In the ’60s, there was a TV show in the United States, called Let’s Make a Deal. As a contestant, you faced three closed
doors, one having a car behind it (that you could take home), while the rest were empty. You had the opportunity to open
one.
Fig. 36.6: Three closed doors, one of which contains a reward behind it.
Suppose that after selecting door No. 1, Monty Hall, the show host, opens the third door, showing that it was not the
winning one. Now, you have the opportunity to change your mind and open door No. 2 instead of the first one. Do you
take it?
At first glance, your chances are 50%-50%, so you might not be better off with switching. However, this is not true!
To set things straight, let’s do a careful probabilistic analysis! Let 𝐴𝑖 denote the event that the price is behind the 𝑖-th
door, while 𝐵𝑖 is the event of Monty opening the 𝑖-th door. Before Monty opens the third one, our model is
1
𝑃 (𝐴1 ) = 𝑃 (𝐴2 ) = 𝑃 (𝐴3 ) = ,
3
Fig. 36.7: Monty opened the third door for you. Do you switch?
and we want to calculate 𝑃 (𝐴1 ∣ 𝐵3 ) and 𝑃 (𝐴2 ∣ 𝐵3 ). By thinking from the perspective of the show host, which door
would you open? If you know that the price is behind the 1st door, you open the 2nd and 3rd one with equal probability.
However, if the price is actually behind the 2nd door (and the contestant selected the 1st one), you always open the 3rd
one. That is,
1
𝑃 (𝐵3 ∣ 𝐴1 ) = 𝑃 (𝐵2 ∣ 𝐴1 ) = ,
2
𝑃 (𝐵3 ∣ 𝐴2 ) = 1.
Thus, by applying the Bayes formula, we have
𝑃 (𝐵3 ∣ 𝐴1 )𝑃 (𝐴1 )
𝑃 (𝐴1 ∣ 𝐵3 ) =
𝑃 (𝐵3 )
1/6
= ,
𝑃 (𝐵3 )
and
𝑃 (𝐵3 ∣ 𝐴2 )𝑃 (𝐴2 )
𝑃 (𝐴2 ∣ 𝐵3 ) =
𝑃 (𝐵3 )
1/3
= .
𝑃 (𝐵3 )
In conclusion, 𝑃 (𝐴2 ∣ 𝐵3 ) is twice as large as 𝑃 (𝐴1 ∣ 𝐵3 ), from which we deduce
1 2
𝑃 (𝐴1 ∣ 𝐵3 ) = , 𝑃 (𝐴2 ∣ 𝐵3 ) = .
3 3
So, you should always switch doors. Surprising, isn’t it? Here, the paradox is that contrary to what we might expect,
changing our minds is the better option. With clear probabilistic thinking, we can easily resolve this.
THIRTYSEVEN
RANDOM VARIABLES
Having a probability space to model our experiments and observations is fine and all, but in almost all of the cases, we are
interested in a quantitative measure of the outcome. To give you an example, let’s consider an already familiar situation:
tossing coins. Suppose that we are tossing a fair coin 𝑛 times, but we are only interested in the number of heads. How do
we model the probability space this time?
By taking things one step at a time, first we construct an event space by enumerating all possible outcomes in a single set,
just like we already did:
Ω = {0, 1}𝑛 , Σ = 2Ω .
Since the coin is fair, each outcome 𝜔 has the probability 𝑃 (𝜔) = 21𝑛 . This probability space (Ω, Σ, 𝑃 ) is nice and simple
so far. Using the additivity of probability measures, we can calculate the probability of eny event. That is, for any 𝐴 ∈ Σ,
we have
|𝐴|
𝑃 (𝐴) = ,
|Ω|
where | ⋅ | denotes the number of elements in a given set.
However, as mentioned, we are only interested in the number of heads. Should we just incorporate this information
somewhere in the probability space? Sure, we could do that, but that would couple the elementary outcomes (that is, a
series of heads or tails) with the measurements. This can significantly complicate our model.
Instead of overloading this probability space to directly deal with the desired measurements, we can do something much
simpler: introduce a function 𝑋 ∶ Ω → ℕ, mapping outcomes to measurements.
These functions are called random variables, and they are at the very center of probability theory and statistics. By
collecting data, we are observing random variables, and by fitting predictive models, we approximate them using the
observations. Now that we understand why we need them, we are going to make this notion mathematically precise.
In their ultimate form, random variables are special mappings between probability spaces and event spaces. By taking one
step at a time, we’ll deal with so-called discrete random variables (such as the above example) first, real random variables
second, and the general case last.
Following our motivating example describing the number of heads in 𝑛 coin tosses, we can create a formal definition.
𝑆𝑘 = {𝜔 ∈ Ω ∶ 𝑋(𝜔) = 𝑥𝑘 }
391
Mathematics of Machine Learning
You might ask, why are we requiring the sets {𝜔 ∈ Ω ∶ 𝑋(𝜔) = 𝑥𝑘 } to be events? It seems like just another technical
condition, but this plays an essential role. Ultimately, we are defining random variables because we are want to measure
the probabities of our observations. This condition assures that we can do this.
To simplify our notations, we write
𝑃 (𝑋 = 𝑥𝑘 ) ∶= 𝑃 ({𝜔 ∈ Ω ∶ 𝑋(𝜔) = 𝑥𝑘 })
𝑋 = number of heads.
this is not needed. Often such a thing is not even possible. Regarding our random variables, we are not interested in
knowing the entire mapping, but rather questions such as the probability of 𝑘 heads among 𝑛 tosses.
If we record the “timestamps” where the outcome is heads, we can encode each 𝜔 as a subset of {1, 2, … , 𝑛}. For instance,
if the 1st, 3rd, and 37th tosses are heads and the rest are tails, this is {1, 3, 37}. To calculate the probability of 𝑘 heads,
we need to count the number of 𝑘-sized subsets for a set of 𝑛 elements. This is given by the binomial coefficient (𝑛𝑘). So,
𝑛 1
𝑃 (𝑋 = 𝑘) = ( ) 𝑛 .
𝑘 2
What if our measurements are not discrete? For instance, suppose that we have a class of students in front of us. We are
interested in the distribution of their body height. So, we pick one student by random and measure their height with our
shiny new tool, capable to measure the height with perfect precision.
In this case, discrete random variables are not enough, but we can define something similar.
are events for all 𝑎, 𝑏 ∈ ℝ. (That is, 𝑋 −1 ((𝑎, 𝑏)) ∈ Σ for all 𝑎, 𝑏 ∈ ℝ.)
Let’s unwrap this definition. First of all, 𝑋 is a mapping from the event space Ω to the set of real numbers ℝ.
Similarly to the discrete case, we are interested in the probabilities of events like 𝑋 −1 ((𝑎, 𝑏)). Again, for simplicity, we
write
You can imagine 𝑋 −1 ((𝑎, 𝑏) as the subset of Ω that are mapped to (𝑎, 𝑏). (In general, sets of the form 𝑋 −1 (𝐴) are called
inverse images.)
Fig. 37.1: A real-valued random variable is a mapping from the event space to the set of real numbers.
Let’s see an example right away. Suppose that we are throwing darts at a circular board on the wall. (For simplicity,
assume that we are so good that we always hit the board.) As we have seen when discussing event algebras in higher
dimensions, we can model this by selecting
Ω = 𝐵(0, 1)
= {𝑥 ∈ ℝ2 ∶ ‖𝑥‖ < 1}
and
Σ = ℬ(𝐵(0, 1))
= 𝜎({𝐴 ∩ 𝐵(0, 1) ∶ 𝐴 ∈ ℬ(ℝ𝑛 )),
while
area(𝐴) area(𝐴)
𝑃 (𝐴) = = .
area(Ω) 𝜋
Since dart boards are subdivided by concentric circles, scoring is determined by the distance from the center. So, we
might as well define our random variable by
⎧0 if 𝑟 ≤ 0,
{
𝑃 (𝑋 < 𝑟) = ⎨𝑟2 if 0 < 𝑟 < 1,
{1 otherwise.
⎩
What if we have more than one measurement? For instance, in the case of the famous Iris dataset (one that we have seen
a few times so far), we have four measurements. Sure, we can just define four random variables, but then we cannot take
advantage of all the heavy machinery we built so far: linear algebra and multivariate calculus.
For this, we will take a look at random variables in the general case.
𝑋 −1 (𝐸) ∶= {𝜔 ∈ Ω1 ∶ 𝑋(𝜔) ∈ 𝐸}
In mathematical literature, random variables are usually denoted with either capital latin letters such as 𝑋, 𝑌 , or Greek
letters. (Mostly starting from 𝜉.)
Random variables essentially push probability measures forward from abstract probability spaces to more tractable ones.
On the event space (Ω2 , Σ2 ), we can define a probability measure 𝑃2 by
𝑃2 (𝐸) ∶= 𝑃1 (𝑋 −1 (𝐸)), 𝐸 ∈ Σ2 ,
making it possible to transform one probability space to another, while keeping the underlying probabilistic model intact.
This general case covers all the mathematical objects we are interested in for machine learning. Staying with the Iris
dataset, the random variable
describes the generating distribution for the dataset, while for classification tasks, we are interested in approximating the
random variable
𝑌 ∶ set of iris flowers → {setosa, versicolor, virginica},
iris flower ↦ class label.
Now we will take a deeper look into why random variables are defined this way. This will be a bit technical, so feel free
to skip it. It won’t adversely affect your ability to work with random variables.
So, random variables are functions, mapping the probability space onto a measurement space. The only question is, why
are the sets 𝑋 −1 (𝐸) so special? Let’s revisit one of our motivating examples: picking a random student and measuring
their height. We are interested in questions such as the probability of a student having a body height between 155 cm and
185 cm. (If you prefer using the imperial metric system, then 155 cm is roughly 5.09 feet and 185 cm is around 6.07
feet.) Translating this to formulas, we are interested in
(In the above formula, I wrote the same thing using two different notations.)
So, how is 𝑋 −1 ([155, 185]) an event? To find this out, let’s look at inverse images in general.
We like inverse images of sets because they behave nicely under set operations. This is formalized by the following
theorem.
Theorem
Let 𝑓 ∶ 𝐸 → 𝐻 be a function between the two sets 𝐸 and 𝐻. For any 𝐴1 , 𝐴2 , ⋯ ⊆ 𝐻, the following hold.
(a) 𝑓 −1 (∪∞ ∞
𝑛=1 𝐴𝑛 ) = ∪𝑛=1 𝑓
−1
(𝐴𝑛 ),
(b) 𝑓 −1 (𝐴1 \𝐴2 ) = 𝑓 −1 (𝐴1 )\𝑓 −1 (𝐴2 ),
(c) 𝑓 −1 (∩∞ ∞
𝑛=1 𝐴𝑛 ) = ∩𝑛=1 𝑓
−1
(𝐴𝑛 ).
Proof. (a) We can easily see this by simply writing out the definitions. That is, we have
𝑓 −1 (∪∞ ∞
𝑛=1 𝐴𝑛 ) = {𝑥 ∈ 𝐸 ∶ 𝑓(𝑥) ∈ ∪𝑛=1 𝐴𝑛 }
= ∪∞
𝑛=1 {𝑥 ∈ 𝐸 ∶ 𝑓(𝑥) ∈ 𝐴𝑛 }
= ∪∞
𝑛=1 𝑓
−1
(𝐴𝑛 ),
which is what we had to show. (If you are not comfortable with working with sets, feel free to review the chapter on
introductory set theory.)
(b) This can be done in the same manner as (a).
(c) The De Morgan laws imply that
𝐻\( ∪∞ ∞
𝑛=1 𝐴𝑛 ) = ∩𝑛=1 (𝐻\𝐴𝑛 )
Why is this important? Recall that the Borel sets, our standard event algebra on real numbers, is defined by
These contain all events that we are interested in regarding the measurements. Combined with our previous result, we
can reveal what is not in plain sight about random variables.
Theorem
Let (Ω, Σ, 𝑃 ) be a probability space and 𝑋 ∶ Ω → ℝ be a random variable, and let 𝐴 ∈ ℬ, where ℬ is defined by (37.1).
Then 𝑋 −1 (𝐴) ∈ Σ.
That is, we can measure the probability of 𝑋 −1 (𝐴) for any Borel set 𝐴. Without this, our random variables would not be
that useful. To make our notations more intuitive, we write
𝑃 (𝑋 ∈ 𝐴) ∶= 𝑃 (𝑋 −1 (𝐴)).
In plain English, 𝑃 (𝑋 ∈ 𝐴) is the probability of our measurement 𝑋 falling into the set 𝐴.
Now that we understand what all of this means, let’s see the simple proof!
Proof. This is a simple consequence of the fact that ℬ is the event algebra generated by sets of the form (−∞, 𝑥], and
the inverse images behave nicely under set operations (as the previous result suggests). □
When building probabilistic models of the external world, the assumption of independence significantly simplifies the
subsequent mathematical analysis. Recall that on a probability space (Ω, Σ, 𝑃 ) the events 𝐴, 𝐵 ∈ Σ are independent if
𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐴)𝑃 (𝐵),
or equivalently,
𝑃 (𝐴|𝐵) = 𝑃 (𝐴).
In plain English, observing one event doesn’t change our probabilistic belief about the other. Since a random variable 𝑋
is described by events of the form 𝑋 −1 (𝐸), we can generalize the notion of independence to random variables.
Let 𝑋, 𝑌 ∶ Ω1 → Ω2 be two random variables between the probability space (Ω1 , Σ1 , 𝑃 ) and event algebra (Ω2 , Σ2 ).
We say that 𝑋 and 𝑌 are independent if for every 𝐴, 𝐵 ∈ Σ2 ,
𝑃 (𝑋 ∈ 𝐴, 𝑌 ∈ 𝐵) = 𝑃 (𝑋 ∈ 𝐴)𝑃 (𝑌 ∈ 𝐵)
holds.
Again, think about two coin tosses. 𝑋1 describes the first coin toss, 𝑋2 describes the other. Since the tosses are inde-
pendent, no observation of the first one reveals any extra information about the second one. This is formalized by the
definition above.
On the other hand, to show two dependent random variables, consider the following. We’ll roll a six-sided dice, and denote
the result by 𝑋. After that, we roll with 𝑋 pieces of six-sided dices, and denote the sum total of their values by 𝑌 .
𝑋 and 𝑌 are dependent on each other. For instance, consider that 𝑃 (𝑋 = 1, 𝑌 ≥ 7) = 0, but neither 𝑃 (𝑋 = 1) and
𝑃 (𝑌 ≥ 7) are zero.
37.5 Problems
Problem 1. Let 𝑋 and 𝑌 be two independent random variables, and let 𝑎, 𝑏 ∈ ℝ be two arbitrary constants. Show that
𝑋 − 𝑎 and 𝑌 − 𝑏 are also independent from each other.
THIRTYEIGHT
DISTRIBUTIONS
Let’s recap what we have learned so far. In probability theory, our goal is to
1. model real-life scenarios affected by uncertainty,
2. and to analyze them using mathematical tools such as calculus.
For the latter purpose, probability spaces are not easy to work with. A probability measure is a function defined on an
event algebra, so we can’t really use calculus there.
Random variables bring us one step closer to the solution, but they can be also difficult to work with. Even though a real
random variable 𝑋 ∶ Ω → ℝ maps an abstract probablity space to the set of real numbers, there are some complications.
Ω can be anything, and if you recall, we might not even have a tractable formula for 𝑋.
For example, if 𝑋 denotes the lifetime of a lightbulb, we don’t have a formula. So again, we can’t use calculus. However,
there is a way to represent the information contained by a random variable in a sequence, a vector-scalar function, or a
scalar-scalar function.
Enter probability distributions and density functions.
Consider a simple experiment, like tossing a fair coin 𝑛 times and counting the number of heads, denoting it with 𝑋. As
we have seen before, 𝑋 is a discrete random variable with
(𝑛𝑘) 21𝑛 if 𝑘 = 0, 1, … , 𝑛,
𝑃 (𝑋 = 𝑘) = {
0 otherwise.
where we used the (𝜎-)additivity of probability. The sequence {𝑃 (𝑋 = 𝑘)}𝑛𝑘=0 is all the information we need.
As a consequence, instead of working with 𝑋 ∶ Ω → ℕ, we can forget about it and use only {𝑃 (𝑋 = 𝑘)}𝑛𝑘=0 . Why is
this good for us?
Because sequences are awesome. As opposed to the mysterious random variables, we have a lot of tools to work with
them. Most importantly, we can represent them in a programming language as an array of numbers. We can’t do such a
thing with pure random variables.
399
Mathematics of Machine Learning
𝑝𝑋 (𝑥𝑘 ) = 𝑃 (𝑋 = 𝑘𝑘 )
is called the probability mass function (or PMF in short) of the discrete random variable 𝑋.
In general, a sequence of real numbers defines a discrete distribution if its elements are nonnegative and it sums up to one.
Remark 37.1.1
Note that if the random variable assumes only finitely many variables (such as in our coin tossing example before), only
finitely many values are nonzero in the distribution.
As recently hinted, every discrete random variable 𝑋 defines the distribution {𝑃 (𝑋 = 𝑥𝑘 )}∞𝑘=1 , where {𝑥1 , 𝑥2 , … } are
the possible values 𝑋 can take. This is true the other way around: given a discrete distribution p = {𝑝𝑘 }∞𝑘=1 , we can
construct a random variable 𝑋 whose PMF is p.
Thus, the probability mass function of 𝑋 is also referred to as its distribution. I know, it is a bit confusing, as the word
“distribution” is quite overloaded in math. You’ll get used to it.
These discrete probability distributions are well-suited for performing quantitative analysis, as opposed to the base form
of random variables. As an additional benefit, think about how distributions generalize random variables. No matter if
we talk about coin tosses or medical tests, the rate of success is given by the above discrete probability distribution.
Before moving on to discussing the basic properties of discrete distributions, let’s see some examples!
Let’s start the long line of examples with the most basic probability distribution possible: the Bernoulli distribution,
describing a simple coin-tossing experiment. We are tossing a coin having probability 𝑝 of coming up heads and probability
1 − 𝑝 of coming up tails. The experiment is encoded in the random variable 𝑋 that takes the value 1 if the toss results in
heads, 0 otherwise.
Thus,
⎧1 − 𝑝 if 𝑘 = 0,
{
𝑃 (𝑋 = 𝑘) = ⎨𝑝 if 𝑘 = 1,
{0 otherwise.
⎩
When a random variable 𝑋 is distributed according to this, we write
𝑋 ∼ Bernoulli(𝑝),
We can generate random values using the rvs method of the bernoulli object. (Just like for any other distribution
from scipy.)
[0, 1, 1, 0, 0, 0, 0, 0, 1, 0]
In scipy, the probability mass function is implemented in the pmf method. We can even visualize the distribution using
Matplotlib. (Don’t worry if you don’t understand the code below. As we will routinely do things like this, I’ll introduce
you to the necessary libraries when the time comes.)
with plt.style.context("seaborn-white"):
fig, axs = plt.subplots(1, len(params), figsize=(4*len(params), 4), sharey=True)
fig.suptitle("The Bernoulli distribution")
for ax, p in zip(axs, params):
x = range(2)
y = [bernoulli.pmf(k=k, p=p) for k in x]
ax.bar(x, y)
ax.set_title(f"p = {p}")
ax.set_ylabel("P(X = k)")
ax.set_xlabel("k")
If you are interested in the details, feel free to check out the SciPy documentation for further methods!
Let’s take our previous coin-tossing example one step further. Suppose that we toss the same coin 𝑛 times, and 𝑋 denotes
the number of heads out of 𝑛 tosses. What is the probability of getting exactly 𝑘 heads?
Say, 𝑛 = 5 and 𝑘 = 3. For example, the configuration 11010 (where 0 denotes tails and 1 denotes heads) has the
probability 𝑝3 (1 − 𝑝)2 , as there are three heads and two tails from five independent tosses.
How many such configurations are available? Selecting the position of the three heads is the same as selecting a three-
element subset out of a set of five elements. Thus, there are (𝑛𝑘) possibilities.
Combining this, we have
(𝑛𝑘)𝑝𝑘 (1 − 𝑝)𝑛−𝑘 if 𝑘 = 0, 1, … , 𝑛,
𝑃 (𝑋 = 𝑘) = {
0 otherwise.
This is called the binomial distribution, one of the most frequently encountered ones in probability and statistics. In
notation, we write
𝑋 ∼ Binomial(𝑛, 𝑝),
where the 𝑛 ∈ ℕ and 𝑝 ∈ [0, 1] are its two parameters. Let’s visualize the distribution!
with plt.style.context("seaborn-white"):
fig, axs = plt.subplots(1, len(params), figsize=(4*len(params), 4), sharey=True)
fig.suptitle("The binomial distribution")
for ax, (n, p) in zip(axs, params):
x = range(n+1)
y = [binom.pmf(n=n, p=p, k=k) for k in x]
ax.bar(x, y)
ax.set_title(f"n = {n}, p = {p}")
ax.set_ylabel("P(X = k)")
ax.set_xlabel("k")
A bit more coin tossing. We toss the same coin until a heads turn up. Let 𝑋 denote the number of tosses needed. With
some elementary probabilistic thinking, we can deduce that
(1 − 𝑝)𝑘−1 𝑝 if 𝑘 = 1, 2, … , 𝑛,
𝑃 (𝑋 = 𝑘) = {
0 otherwise.
(Since if heads turn up first for the 𝑘-th toss, we tossed 𝑘 − 1 tails previously.) This is called the geometric distribution
and is denoted as
𝑋 ∼ Geo(𝑝),
with 𝑝 ∈ [0, 1] being the only parameter. Similarly, we can plot the histograms to visualize the distribution family.
with plt.style.context("seaborn-white"):
fig, axs = plt.subplots(1, len(params), figsize=(5*len(params), 5), sharey=True)
fig.suptitle("The geometric distribution")
for ax, p in zip(axs, params):
x = range(1, 20)
y = [geom.pmf(p=p, k=k) for k in x]
ax.bar(x, y)
ax.set_title(f"p = {p}")
ax.set_ylabel("P(X = k)")
ax.set_xlabel("k")
Note that none of the probabilities 𝑃 (𝑋 = 𝑘) are zero, but as 𝑘 grows, they become extremely small. (The closer 𝑝 is to
1, the faster the decay.)
∞
It might not be immediately obvious that ∑𝑘=1 (1 − 𝑝)𝑘−1 𝑝 = 1. To do that, we’ll apply a magic trick. (You know. As
the famous Arthur C. Clarke quote goes, “Any sufficiently advanced mathematics is indistinguishable from magic.” Or
technology. It’s the same.)
In fact, for an arbitrary 𝑥 ∈ (−1, 1), the astounding identity
∞
1
∑ 𝑥𝑘 = (38.1)
𝑘=0
1−𝑥
Using the geometric series is one of the most common tricks up a mathematician’s sleeve. We’ll use this, for instance,
when talking about expected values for certain distributions.
Let’s discard the coin and roll a six-sided dice. We’ve seen this before: the probability of each outcome is the same, that
is,
1
𝑃 (𝑋 = 1) = 𝑃 (𝑋 = 2) = ⋯ = 𝑃 (𝑋 = 6) = ,
6
where 𝑋 denotes the outcome of the roll. This is a special instance of the uniform distribution.
In general, let 𝐴 = {𝑎1 , 𝑎2 , … , 𝑎𝑛 } be a finite set. The discrete random variable 𝑋 ∶ Ω → 𝐴 is uniformly distributed on
𝐴, that is,
𝑋 ∼ Uniform(𝐴),
if
1
𝑃 (𝑋 = 𝑎1 ) = 𝑃 (𝑋 = 𝑎2 ) = ⋯ = 𝑃 (𝑋 = 𝑎𝑛 ) = .
𝑛
Note that 𝐴 must be a finite set: no discrete uniform distribution exists on infinite sets.
Here is the probability mass function for rolling a six-sided dice. Not the most exciting one, I know.
with plt.style.context("seaborn-white"):
fig = plt.figure(figsize=(16, 8))
plt.title("The uniform distribution")
x = range(-1, 9)
y = [randint.pmf(k=k, low=1, high=7) for k in x]
plt.bar(x, y)
plt.ylim(0, 1)
plt.ylabel("P(X = k)")
plt.xlabel("k")
We’ve left the simplest one for the last: the single-point distribution. Let 𝑎 ∈ ℝ be an arbitrary real number. We say that
the random variable 𝑋 is distributed according to 𝛿(𝑎) if
1 if 𝑥 = 𝑎,
𝑃 (𝑋 = 𝑥) = {
0 otherwise.
That is, 𝑋 assumes 𝑎 with probability 1. Their corresponding cumulative distribution function is
1 if 𝑥 ≥ 𝑎,
𝐹𝑋 (𝑥) = {
0 otherwise,
With the help of discrete random variables, we can dress up the law of total probability in new clothes.
Proof. For any discrete random variable 𝑋 ∶ Ω → {𝑥1 , 𝑥2 , … }, the events {𝑋 = 𝑥𝑘 } partition the event space: they
are mutually disjoint, and their union gives Ω. Thus, the law of total probability can be applied, obtaining
∞
𝑃 (𝐴) = ∑ 𝑃 (𝐴, 𝑋 = 𝑥𝑘 )
𝑘=−∞
∞
= ∑ 𝑃 (𝐴|𝑋 = 𝑥𝑘 )𝑃 (𝑋 = 𝑥𝑘 ),
𝑘=−∞
In other words, we can study events in the context of discrete random variables. This is extremely useful in practice.
(Soon, we’ll see that it’s not only for the discrete case.)
Let’s put (38.2) to work right away.
Since discrete probability distributions are represented by sequences, we can use a wide array of tools from mathematical
analysis to work with them. (This was the whole reason behind switching random variables to distributions.) As a
consequence, we can easily describe more complex random variables by constructing them from simpler ones.
For instance, consider rolling two dice, where we are interested in the distribution of the sum. So, we can write this as
the sum of random variables 𝑋1 and 𝑋2 , denoting the outcome of the first and second toss respectively. We know that
1
if 𝑘 = 1, 2, … , 6,
𝑃 (𝑋𝑖 = 𝑘) = { 6
0 otherwise
for 𝑖 = 1, 2. Using (38.2) and the fact that the two outcomes are independent, we have
6
𝑃 (𝑋1 + 𝑋2 = 𝑘) = ∑ 𝑃 (𝑋1 + 𝑋𝑘 = 𝑘|𝑋2 = 𝑙)𝑃 (𝑋2 = 𝑙)
𝑙=1
6
= ∑ 𝑃 (𝑋1 = 𝑘 − 𝑙)𝑃 (𝑋2 = 𝑙)
𝑙=1
If this looks familiar, it is not an accident. What you see here is the famous convolution operation in action.
Let 𝑎 = {𝑎𝑘 }∞ ∞
𝑘=−∞ and 𝑏 = {𝑏𝑘 }𝑘=−∞ be two arbitrary sequences. Their convolution is defined by
∞ ∞
𝑎 ∗ 𝑏 ∶= { ∑ 𝑎𝑘−𝑙 𝑏𝑙 } .
𝑙=−∞ 𝑘=−∞
∞
That is, the 𝑘-th element of the sequence 𝑎 ∗ 𝑏 is defined by the sum ∑𝑙=−∞ 𝑎𝑘−𝑙 𝑏𝑙 . This might be hard to imagine, but
thinking about the probabilistic interpretation makes the definition clear. The random variable 𝑋1 + 𝑋2 can assume the
value 𝑘 if 𝑋1 = 𝑘 − 𝑙 and 𝑋2 = 𝑙, for all possible 𝑙 ∈ ℤ.
∞
This trick is often extremely useful, as when 𝑎𝑘 and 𝑏𝑘 is explicitly given, sometimes ∑𝑙=−∞ 𝑎𝑙 𝑏𝑘−𝑙 is simpler to calculate
∞
than ∑𝑙=−∞ 𝑎𝑘−𝑙 𝑏𝑙 , and vice versa.
Convolution is supported by NumPy, so with its help, we can visualize the distribution of our 𝑋1 + 𝑋2 .
import numpy as np
with plt.style.context("seaborn-white"):
plt.figure(figsize=(16, 8))
plt.bar(range(0, len(sum_dist)), sum_dist)
plt.title("Distribution of X₁ + X₂")
plt.ylabel("P(X + Y = k)")
plt.xlabel("k")
Let’s talk about the general case. The pattern is clear, so we can formulate a theorem.
that is,
𝑝𝑋+𝑌 = 𝑝𝑋 ∗ 𝑝𝑌 .
Proof. The proof is a straightforward application of the law of total probability (38.2):
∞
𝑃 (𝑋 + 𝑌 = 𝑘) = ∑ 𝑃 (𝑋 + 𝑌 = 𝑘|𝑌 = 𝑙)𝑃 (𝑌 = 𝑙)
𝑙=−∞
∞
= ∑ 𝑃 (𝑋 = 𝑘 − 𝑙)𝑃 (𝑌 = 𝑙)
𝑙=−∞
= (𝑝𝑋 ∗ 𝑝𝑌 )(𝑘),
Another example of random variable sums is the binomial distribution itself. Instead of thinking about the number of
successes of an experiment out of 𝑛 independent tries, we can model the core experiment as a Bernoulli distribution. That
is, if 𝑋𝑖 is a Bernoulli(𝑝) distributed random variable describing the success of the 𝑖-th attempt, we have
𝑃 (𝑋1 + ⋯ + 𝑋𝑛 = 𝑘) = ∑ 𝑃 (𝑋1 = 𝑖1 , … , 𝑋𝑛 = 𝑖𝑛 )
𝑖1 +…𝑖𝑛 =𝑘
= ∑ 𝑃 (𝑋1 = 𝑖1 ) … 𝑃 (𝑋𝑛 = 𝑖𝑛 )
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
𝑖1 +…𝑖𝑛 =𝑘 𝑋1 ,…,𝑋𝑛 are independent
= ∑ 𝑝 (1 − 𝑝)𝑛−𝑘
𝑘
𝑖1 +…𝑖𝑛 =𝑘
𝑛
= ( )𝑝𝑘 (1 − 𝑝)𝑛−𝑘 ,
𝑘
where the sum ∑𝑖 traverses all tuples (𝑖1 , … , 𝑖𝑛 ) ∈ {0, 1}𝑛 for which 𝑖1 + ⋯ + 𝑖𝑛 = 𝑘. (As there are (𝑛𝑘) of
1 +…𝑖𝑛 =𝑘
such tuples, we have ∑𝑖 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 = (𝑛𝑘)𝑝𝑘 (1 − 𝑝)𝑛−𝑘 in the last step.)
1 +…𝑖𝑛 =𝑘
So far, we have talked about discrete random variables; that is, random variables with countably many values. However,
not all experiments/observations/measurements are like this. For instance, the height of a person is a random variable
that can assume a continuum of values.
To give a tractable example, let’s pick a number 𝑋 from [0, 1], with each one having an “equal chance”. In this context,
equal chance means that
𝑃 (𝑎 < 𝑋 ≤ 𝑏) = |𝑏 − 𝑎|.
Can we describe 𝑋 with a single real function? Like in the discrete case, we can try
𝐹 (𝑥) = 𝑃 (𝑋 = 𝑥),
but this wouldn’t work. Why? Because for each 𝑥 ∈ 𝑋, we have 𝑃 (𝑋 = 𝑥) = 0. That is, picking a particular number 𝑥
has zero probability. Instead, we can try 𝐹𝑋 (𝑥) = 𝑃 (𝑋 ≤ 𝑥), which is
⎧0 if 𝑥 ≤ 0,
{
𝐹𝑋 (𝑥) = ⎨𝑥 if 0 < 𝑥 ≤ 1,
{1 otherwise.
⎩
We can plot this for visualization.
with plt.style.context('seaborn-white'):
plt.figure(figsize=(16, 8))
plt.plot(X, y)
In the following section, we will properly define and study this object in detail for all real-valued random variables.
What we have seen in our motivating example is an instance of a cumulative distribution functionor CDF in short. Let’s
jump into the formal definition right away.
𝐹𝑋 (𝑥) ∶= 𝑃 (𝑋 ≤ 𝑥) (38.3)
Again, let’s unpack this. Recall that in the definition of real-valued random variables, we have used the inverse images
𝑋 −1 ((𝑎, 𝑏))?
Something similar is going on here. 𝑃 (𝑋 ≤ 𝑥) is the abbreviation for 𝑃 (𝑋 −1 ((−∞, 𝑥])), which we are too lazy to
write. Similarly to 𝑋 −1 ((𝑎, 𝑏))), you can visualize 𝑋 −1 ((−∞, 𝑥])) by pulling the interval (−∞, 𝑥] back to Ω using the
mapping 𝑋.
Sets of the form 𝑋 −1 ((−∞, 𝑥]) are called the level sets of 𝑋.
According to the Oxford English Dictionary, the word cumulative means “increasing or increased in quantity, degree,
or force by successive additions”. For discrete random variables, using 𝑃 (𝑋 = 𝑘) was enough, but since real random
variables are more nuanced, we have to use the cumulative probabilities 𝑃 (𝑋 ≤ 𝑥) to meaningfully describe them.
Why do we like to work with distribution functions? Because they condense all the relevant information about a random
variable in a real function. For instance, we can express probabilities like
To give an example, let’s revisit the introduction, where we were selecting a random number between zero and one. There,
Cumulative distribution functions have three properties that characterize them: they are always nondecreasing, right-
continuous (whatever that might be), and their limits are 0 and 1 towards −∞ and ∞ respectively. You might have
guessed some of this from the definition, but here is the formal theorem that summarizes this.
holds.
Proof. The proofs are relatively straightforward. (a) follows from the fact that if 𝑥 < 𝑦, then we have
In other words, the event 𝑋 ≤ 𝑥 is a subset of 𝑋 ≤ 𝑦. Thus, due to the monotonicity of probability measures, we have
𝑃 (𝑋 ≤ 𝑥) ≤ 𝑃 (𝑋 ≤ 𝑦).
(b) Here, we need to show that lim𝑥→𝑥0 + 𝑃 (𝑋 ≤ 𝑥) = 𝑃 (𝑋 ≤ 𝑥0 ). For this, note that for any 𝑥𝑛 → 𝑥0 with 𝑥𝑛 > 𝑥0 ,
the event sequence {𝜔 ∈ Ω ∶ 𝑋(𝜔) ≤ 𝑥𝑛 } is decreasing, and
∩∞ −1 −1
𝑛=1 𝑋 ((−∞, 𝑥𝑛 ]) = 𝑋 ((−∞, 𝑥]).
Because of the upper continuity of probability measures (see (35.4)), the right continuity of 𝐹𝑋 follows.
(c) Again, this follows from the fact that
∩∞ −1
𝑛=1 𝑋 ((−∞, 𝑛]) = ∅
and
∪∞ −1
𝑛=1 𝑋 ((−∞, 𝑛]) = Ω.
Since 𝑃 (∅) = 0 and 𝑃 (Ω), the statement follows from the upper and lower continuity of probability measures. (See
(35.3) and (35.4).) □
that is, 𝑋 < 𝑥 instead of 𝑋 ≤ 𝑥. This doesn’t change the big picture, but some details are slightly different. For instance,
this change makes 𝐹𝑋 left-continuous instead of right-continuous. These minute details matter if you dig really deep, but
in machine learning, we’ll be fine without thinking too much about them.
Theorem 37.3.1 is true the other way around: if you give me a nondecreasing right-continuous function 𝐹 (𝑥) with
lim𝑥→−∞ 𝐹 (𝑥) = 0 and lim𝑥→∞ 𝐹 (𝑥) = 1, I can construct a random variable such that its distribution function matches
𝐹 (𝑥).
The discrete and real-valued case is not entirely disjoint: in fact, discrete random variables have cumulative distribution
functions as well. (But not the other way around, as real-valued random variables cannot be described with sequences.)
Say, if 𝑋 is a discrete random variable taking the values 𝑥1 , 𝑥2 , …, then its CDF is
𝐹𝑋 (𝑥) = ∑ 𝑃 (𝑋 = 𝑥𝑖 ),
𝑥𝑖 ≤𝑥
which is a piecewise continuous function. In the case of binomial distributions, here is what it looks like.
The strength or probability lies in its ability to translate real-world phenomena into coin tosses, dice rolls, dart throws,
lightbulb lifespans, and many more. This is possible because of distributions. Distributions are the ribbons stringing
together a vast bundle of random variables.
Let’s meet some of the most important ones!
We have already seen a special case of the uniform distribution: selecting a random number from the interval [0, 1], such
that all outcomes are “equally likely”. The general uniform distribution captures the same concept, except on an arbitrary
interval [𝑎, 𝑏]. That is, the random variable 𝑋 is uniformly distributed on the interval [𝑎, 𝑏], or 𝑋 ∼ Uniform(𝑎, 𝑏) in
symbols, if
1
𝑃 (𝛼 < 𝑋 ≤ 𝛽) = ∣[𝑎, 𝑏] ∩ (𝛼, 𝛽]∣
𝑏−𝑎
for all 𝛼 < 𝛽, where ∣[𝑐, 𝑑]∣ denotes the length of the interval [𝑐, 𝑑],
In other words, the probability of our random number falling into a given interval is proportional to the interval’s length.
This is how the condition “equally likely” makes sense: as there are uncountably many possible outcomes, the probability
of each individual outcome is zero, but equally long intervals have an equal chance.
In line with the definition, the distribution function of 𝑋 is
⎧0 if 𝑥 ≤ 𝑎,
{
𝐹𝑋 (𝑥) = ⎨ 𝑥−𝑎
𝑏−𝑎 if 𝑎 < 𝑥 ≤ 𝑏,
{1 otherwise.
⎩
Let’s turn our attention toward a different problem: lightbulbs. According to some mysterious (and probably totally
inaccurate) lore, lightbulbs possess the so-called memoryless property. That is, its expected lifespan is the same at any
point in its life.
To put this into a mathematical form, let 𝑋 be a random variable denoting the lifespan of a given lightbulb. The mem-
oryless property states that if the lightbulb has already lasted 𝑠 seconds, then the probability of lasting another 𝑡 is the
same as in the very first moment of its life. That is,
If we think about the probabilities as a function 𝑓(𝑡) = 𝑃 (𝑋 > 𝑡), (38.5) can be viewed as a functional equation. And
a famous one for that. Without going into the painful details the only continuous solution is the exponential function
𝑓(𝑡) = 𝑒𝑎𝑡 , where 𝑎 ∈ ℝ is a parameter.
As we are talking about the lifespan of a lightbulb here, the probability of it lasting forever is zero. That is,
lim 𝑃 (𝑋 > 𝑡) = 0
𝑡→∞
holds. Thus, as
⎧0 if 𝑎 < 0,
{
lim 𝑒𝑎𝑡 =
𝑡→∞ ⎨1 if 𝑎 = 0,
{∞ if 𝑎 > 0,
⎩
only the negative parameters are valid in our case. This characterizes the exponential distribution. In general, 𝑋 ∼ exp(𝜆)
for a 𝜆 > 0 if
0 if 𝑥 < 0,
𝐹𝑋 (𝑥) = {
1 − 𝑒−𝜆𝑥 if 0 ≤ 𝑥.
with plt.style.context('seaborn-white'):
plt.figure(figsize=(16, 8))
plt.legend()
The exponential distribution is extremely useful and frequently encountered in real-life applications. For instance, it
models the requests incoming to a server, customers standing in a queue, buses arriving at a bus stop, and many more.
We’ll talk more about special distributions in later chapters, and we’ll add quite a few others as well.
38.5 Conclusion
Distributions are the lifeblood of probability theory, and distributions can be represented with cumulative distribution
functions.
However, CDFs have a significant drawback: it’s hard to express the probability of more complex events with them.
Later, we’ll see several concrete examples of where CDFs fail. Without going into details, one example points towards
multidimensional distributions. (I hope that their existence and importance are not surprising you.) There, the distribution
functions can be used to express the probability of rectangle-shaped events, but not, say, spheres.
To be a tiny bit more precise, if 𝑋, 𝑌 ∼ Uniform(0, 1), then the probability
𝑃 (𝑋 2 + 𝑌 2 < 1)
cannot be directly expressed in terms of the two-dimensional CDF 𝐹𝑋,𝑌 (𝑥, 𝑦). (Whatever that may be.) Fortunately,
this is not our only tool.
Enter probability density functions.
THIRTYNINE
DENSITIES
Distribution functions are not our only tool to describe real-valued random variables. If you have studied probability
theory from a book/lecture/course written by a non-mathematician, you have probably seen a function like
1 𝑥2
𝑝(𝑥) = √ 𝑒− 2
2𝜋
referred to as “probability” at some point. Let me tell you, this is definitely not a probability. I have seen this mistake so
much that I decided to write short Twitter threads properly explaining probabilistic concepts, from which this book was
grown out of. So, I take this issue to heart.
Here is the problem with cumulative distribution functions: they represent global information about local objects. Let’s
unpack this idea. If 𝑋 is a real-valued random variable, the CDF
𝐹𝑋 (𝑥) = 𝑃 (𝑋 ≤ 𝑥)
describes the probability of 𝑋 being smaller than a given 𝑥. But what if we are interested in what happens around 𝑥?
Say, in the case of the uniform distribution (38.4), we have
𝑃 (𝑋 = 𝑥) = lim 𝑃 (𝑥 − 𝜀 < 𝑋 ≤ 𝑥)
𝜀→0
= lim (𝐹𝑋 (𝑥) − 𝐹𝑋 (𝑥 − 𝜀))
𝜀→0
= lim 𝜀
𝜀→0
= 0.
holds. Does this look familiar to you? Increments of 𝐹𝑋 on the right, probabilities on the left. Where have we seen
increments before?
417
Mathematics of Machine Learning
′
In the fundamental theorem of calculus, that’s where. That is, if 𝐹𝑋 is differentiable and its derivative is 𝐹𝑋 (𝑥) = 𝑓𝑋 (𝑥),
then
𝑏
∫ 𝑓𝑋 (𝑥)𝑑𝑥 = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎). (39.1)
𝑎
The function 𝑓𝑋 (𝑥) seems to be what we are looking for: it represents the local behavior of 𝑋 around 𝑥. But instead of
describing the probability, it describes its rate of change. This is called a probability density function.
By turning this argument around, we can define density functions using (39.1). Here is the mathematically precise version.
Again, (39.2) is the Newton-Leibniz formula (26.8) in disguise. The following theorem makes this connection precise.
Proof. This is just a simple application of the fundamental theorem of calculus. If the derivative indeed exists, then
𝑏
𝑑
∫ 𝐹 (𝑥)𝑑𝑥 = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎),
𝑎 𝑑𝑥 𝑋
𝑑
which means that 𝑓𝑋 (𝑥) = 𝑑𝑥 𝐹𝑋 (𝑥) is indeed a density function. □
∗ 𝑓𝑋 (𝑥) if 𝑥 ≠ 0,
𝑓𝑋 (𝑥) = {
𝑓𝑋 (0) + 1 if 𝑥 = 0.
∗ ∗
You can check it by hand that 𝑓𝑋 is still a density for 𝑋, yet 𝑓𝑋 ≠ 𝑓𝑋 .
One more thing before we move on. Recall that discrete random variables are characterized by probability mass functions:
these two objects are two sides of the same coin.
The probability mass function is analogous to the density function, yet we don’t have terminology for random variables
with the latter. We’ll fix this now.
Discrete and continuous random variables are the backbones of probability theory: the most interesting random variables
are falling into these classes. (Later in the chapter, we’ll see that there are more types, but these two are the most
important.)
Now we are ready to get our hands dirty and see some density functions in practice.
After all this introduction, let’s see a few concrete examples. So far, we have seen two real-valued non-discrete distribu-
tions: the uniform and the exponential.
Example 1. Let’s start with 𝑋 ∼ Uniform(0, 1). Can we apply Theorem 38.1.1 directly? Not without a little snag. Or
two, to be more precise.
Why? Because the distribution function
⎧0 if 𝑥 ≤ 0,
{
𝐹𝑋 (𝑥) =
⎨𝑥 if 0 < 𝑥 ≤ 1,
{1 if 1 < 𝑥
⎩
is not differentiable at 𝑥 = 0 and 𝑥 = 1. However, it is differentiable everywhere else, and its derivative
⎧0 if 𝑥 < 0,
′
{
𝐹𝑋 (𝑥) = ⎨1 if 0 < 𝑥 < 1,
{0 if 1 < 𝑥
⎩
is indeed a density function. (You can check this by hand.) This density is patched together from the derivative of 𝐹𝑋 (𝑥)
on the intervals (−∞, 0), (0, 1), and (1, ∞).
Example 2. In the case of the exponentially distributed random variable 𝑌 ∼ exp(𝜆), the function
0 if 𝑥 < 0,
𝑓𝑌 (𝑥) = {
𝜆𝑒−𝜆𝑥 if 0 ≤ 𝑥
is a proper density function, which we obtained by differentiating 𝐹𝑌 (𝑥) whenever possible. Again, the density 𝑓𝑋 (𝑥) is
patched together from the derivatives on the intervals (−∞, 0) and (0, ∞).
Example 3. Now, I am going to turn everything upside down. Let 𝑍 ∼ Bernoulli(1/2), which is a discrete random
variable with probability mass function
1
𝑝𝑍 (0) = 𝑝𝑍 (1) = ,
2
Remark 38.1.2 (The non-existence of density despite the lack of jump discontinuities.)
Unfortunately, the reverse direction of “jump discontinuity in the CDF ⟹ no PDF exists” is not true, I repeat, not true.
We can find random variables whose cumulative distribution functions are continuous, but their density does not exist.
One famous example is the Cantor function, also known as the Devil’s staircase. (Only follow this link if you are brave
enough, or well-trained in real analysis. Which is the same.)
So far, we have been focusing on two special kinds of real-valued random variables: discrete random variables and
continuous ones.
We’ve seen all kinds of objects describing them. Every real-valued random variable has a cumulative distribution function,
but while discrete ones are characterized by probability mass functions, the continuous ones are by density functions.
Are these two all that’s out there?
No. There are mixed cases. For instance, consider the following example. We are selecting a random number from [0, 1],
but we add a little twist to the picking process. First, we toss a fair coin, and if it comes up heads, we pick 0. Otherwise,
we pick uniformly between zero and one.
To describe this weird process, let’s introduce two random variables: let 𝑋 be the final outcome, and 𝑌 be the outcome
of the coin toss. Then, using the conditional version of the law of total probability (see Theorem 35.2.1), we have
𝑃 (𝑋 ≤ 𝑥) = 𝑃 (𝑋 ≤ 𝑥|𝑌 = heads)𝑃 (𝑌 = heads)
+ 𝑃 (𝑋 ≤ 𝑥|𝑌 = tails)𝑃 (𝑌 = tails).
As
0 if 𝑥 < 0,
𝑃 (𝑋 ≤ 𝑥|𝑌 = heads) = {
1 if 0 ≤ 𝑥,
⎧0 if 𝑥 < 0,
{
𝐹𝑋 (𝑥) = ⎨ 𝑥+1
2 if 0 ≤ 𝑥 < 1,
{1 if 1 ≤ 𝑥.
⎩
Ultimately, 𝐹𝑋 is the convex combination of two cumulative distribution functions. (A convex combination is a linear
combination where the coefficients are positive and their sum is 1.)
Thus, the random variable 𝑋 is not discrete, nor continuous. So, what is it?
It’s time to put order to chaos! In this section, we are going to provide a complete classification for our real-valued random
variables. This is a beautiful, albeit advanced topic, so feel free to skip it on a first read.
Let’s start at a seemingly distant topic: subsets of ℝ that are so small that they practically vanish.
Since ℝ is a one-dimensional object, we are usually talking about length here, but let’s forget that terminology and talk
about measure instead. We’ll denote the measure of a set 𝐴 ⊆ ℝ by 𝜆(𝐴), whatever that might be.
We are not going too deep into the details and will keep on using the notion of measure intuitively. For instance, the
measure of an interval [𝑎, 𝑏] is 𝜆([𝑎, 𝑏]) = 𝑏 − 𝑎.
Our measure 𝜆 has some fundamental properties, for instance,
∞
(a) 𝜆(∅) = 0,(b) 𝜆(𝐴) ≤ 𝜆(𝐵) if 𝐴 ⊆ 𝐵,(c) and 𝜆(∪𝑘=1 𝐴𝑘 ) = ∑𝑘=1 𝜆(𝐴𝑘 ) if 𝐴𝑖 ∩ 𝐴𝑗 = ∅.
This almost behaves like a probability measure, with one glaring exception: 𝜆(ℝ) = ∞. This is not an accident.
What is the measure of a finite set {𝑎1 , … , 𝑎𝑛 }? Intuitively, it is zero, and from this example, we’ll conjure up the concept
of sets of zero measure.
Proof. As 𝐴 ⊆ 𝐸, 𝜆(𝐴) ≤ 𝜆(𝐸) < 𝜀. This means that 𝜆(𝐴) is smaller than any positive real number, thus it must be
zero. □
𝜀 𝜀
𝐸 = ∪𝑛𝑘=1 (𝑎𝑘 − ,𝑎 + )
2𝑛 𝑘 2𝑛
𝜀 𝜀
𝐸 = ∪∞
𝑘=1 (𝑎𝑘 − ,𝑎 + )
2𝑘+1 𝑘 2𝑘+1
work perfectly, as
∞
𝜀
𝜆(𝐸) ≤ ∑ = 𝜀.
𝑘=1
2𝑘+1
For instance, as the set of integers and rational numbers are both countable, 𝜆(ℤ) = 𝜆(ℚ) = 0.
Overall, sets of zero measure are true to their name: they are small. (They are not necessarily countable though.) Why
are these important? We’ll see this in the next section.
∗ 𝑓𝑋 (𝑥) if 𝑥 ∉ ℚ,
𝑓𝑋 (𝑥) = {
0 if 𝑥 ∈ ℚ
is still a density function for 𝑋. Unfortunately, we don’t have the tools to show this, as it would require moving beyond
the good old Riemann integral, which is way beyond our scope.
The main difference between a discrete and continuous random variable is the set where they live. Fundamentally, they
are both real-valued random variables, but the range of a discrete variable is a set of measure zero.
Let’s introduce the concept of singular random variables to make this notion precise.
𝜆(𝑋(Ω)) = 0
holds.
All discrete random variables are singular, but not the other way around. For instance, Cantor function is a good example.
Why are singular random variables so special? Because every distribution can be written as the sum of a singular and a
continuous one! Here is the famous Lebesgue decomposition theorem.
𝐹𝑋 = 𝛼𝐹𝑋𝑠 + 𝛽𝐹𝑋𝑐 ,
where 𝛼 + 𝛽 = 1, and 𝐹𝑋 , 𝐹𝑋𝑠 , 𝐹𝑋𝑐 are the corresponding cumulative distribution functions.
We are not going to prove this here, but the gist is this: there are singular random variables, continuous ones, and their
sum.
FORTY
Let’s play a simple game. I toss a coin, and if it comes up heads, you win $1. If it is tails, you lose $2.
Up until now, we were dealing with questions like the probability of winning. Say, if 𝑋 describes your winnings per
round, we have
1
𝑃 (heads) = 𝑃 (tails) = .
2
Despite the equal chances, should you play this game? Let’s find out.
After 𝑛 rounds, your earnings can be calculated by the number of heads times $1 minus the number of tails times $2. If
we divide total earnings by 𝑛, we obtain your average winnings per round. That is,
total winnings
your average winnings =
𝑛
1 ⋅ #heads − 2 ⋅ #tails
=
𝑛
#heads #tails
=1⋅ −2⋅ ,
𝑛 𝑛
where #heads and #tails denote the number of heads and tails respectively.
Recall the frequentist interpretation of probability? According to our intuition, we should have
#heads 1
lim = 𝑃 (heads) = ,
𝑛→∞ 𝑛 2
#tails 1
lim = 𝑃 (tails) = .
𝑛→∞ 𝑛 2
This means that if you play long enough, your average winnings per round is
your average winnings = 1 ⋅ 𝑃 (heads) − 2 ⋅ 𝑃 (tails)
1
=− .
2
So, as you are losing half a dollar per round on average, you definitely shouldn’t play this game.
Let’s formalize this argument with a random variable. Say, if 𝑋 describes your winnings per round, we have
1
𝑃 (𝑋 = 1) = 𝑃 (𝑋 = −2) = ,
2
so the average winnings can be written as
425
Mathematics of Machine Learning
With a bit of a pattern matching, we find that for a general discrete random variable 𝑋, the formula looks like
In English, the expected value describes the average value of a random variable in the long run. The expected value is
also called the mean and is often denoted by 𝜇.
It’s time for examples.
Example 1. Expected value of the Bernoulli distribution. Let 𝑋 ∼ Bernoulli(𝑝). Its expected value is quite simple to
compute, as
𝔼[𝑋] = 0 ⋅ 𝑃 (𝑋 = 0) + 1 ⋅ 𝑃 (𝑋 = 1) = 𝑝.
We’ve seen this before: the introductory example with the simple game is the transformed Bernoulli distribution 3 ⋅
Bernoulli(1/2) − 2.
Example 2. Expected value of the binomial distribution. Let 𝑋 ∼ Binomial(𝑛, 𝑝). Then
𝑛
𝔼[𝑋] = ∑ 𝑘𝑃 (𝑋 = 𝑘)
𝑘=0
𝑛
𝑛
= ∑ 𝑘( )𝑝𝑘 (1 − 𝑝)𝑛−𝑘
𝑘=0
𝑘
𝑛
𝑛!
= ∑𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 .
𝑘=0
𝑘!(𝑛 − 𝑘)!
𝑛!
The plan is the following: absorb that 𝑘 with the fraction 𝑘!(𝑛−𝑘)! , and adjust the sum such that its terms form the
probability mass function for Binomial(𝑛 − 1, 𝑝). As 𝑛 − 𝑘 = (𝑛 − 1) − (𝑘 − 1), we have
𝑛
𝑛!
𝔼[𝑋] = ∑ 𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘
𝑘=0
𝑘!(𝑛 − 𝑘)!
𝑛
(𝑛 − 1)!
= 𝑛𝑝 ∑ 𝑝𝑘−1 (1 − 𝑝)(𝑛−1)−(𝑘−1)
𝑘=1
(𝑘 − 1)!((𝑛 − 1) − (𝑘 − 1))!
𝑛−1
(𝑛 − 1)!
= 𝑛𝑝 ∑ 𝑝𝑘 (1 − 𝑝)(𝑛−1−𝑘)
𝑘=0
𝑘!(𝑛 − 1 − 𝑘)!
𝑛−1
= 𝑛𝑝 ∑ 𝑃 (Binomial(𝑛 − 1, 𝑝) = 𝑘)
𝑘=0
= 𝑛𝑝.
This computation might not look like the simplest, but once you get familiar with the trick, that’ll be like second nature
for you.
Example 3. Expected value of the geometric distribution. Let 𝑋 ∼ Geo(𝑝). We need to calculate
∞
𝔼[𝑋] = ∑ 𝑘(1 − 𝑝)𝑘−1 𝑝.
𝑘=1
Do you remember the geometric series (38.1)? This is almost it, except for the 𝑘 term, which throws a monkey wrench
into our gears. To fix that, we’ll use another magic trick. Recall that
∞
1
= ∑ 𝑥𝑘 .
1 − 𝑥 𝑘=0
𝑑 1 𝑑 ∞ 𝑘
= ∑𝑥
𝑑𝑥 1 − 𝑥 𝑑𝑥 𝑘=0
∞
𝑑 𝑘
=∑ 𝑥
𝑘=0
𝑑𝑥
∞
= ∑ 𝑘𝑥𝑘−1 ,
𝑘=1
where we used the linearity of the derivative and the pleasant analytic properties of the geometric series. Mathematicians
would scream upon the sight of switching the derivative and the sum, but don’t worry, everything here is correct as is.
(It’s just that mathematicians are really afraid of interchanging limits. For a good reason, if I may say so.)
On the other hand,
𝑑 1 1
= ,
𝑑𝑥 1 − 𝑥 (1 − 𝑥)2
thus
∞
1
∑ 𝑘𝑥𝑘−1 = .
𝑘=1
(1 − 𝑥)2
𝔼[𝑐] = 𝑐𝑃 (𝑐 = 𝑐) = 𝑐.
I know, this example looks silly, but it is quite useful, as we shall see this soon. (And many times later.)
One more example, before we move on. I was a mediocre no-limit Texas hold’em player a while ago, and the first time I
heard about the expected value was years before I studied probability theory.
According to the rules of Texas hold-em, each player holds two cards on their own, while five more shared cards are dealt.
The shared cards are available for everyone, and the player with the strongest hand wins.
Fig. 40.1 shows how the table looks before the last card (the river) is revealed.
There is money in the pot to be won, but to see the river, you have to call the opponent’s bet.
The question is, should you? Expected value to the rescue.
Let’s build a probabilistic model. We would win the pot with certain river cards but lose with all the others. If 𝑋 represents
our winnings, then
#winning cards
𝑃 (𝑋 = pot) = ,
#remaining cards
#losing cards
𝑃 (𝑋 = −bet) = .
#remaining cards
Thus, the expected value is
When is the expected value positive? With some algebra, we obtain that 𝔼[𝑋] > 0 if and only if
which is called positive pot odds. If this is satisfied, making the bet is the right call. You might lose a hand with positive
pot odds, but in the long term, your winnings will be positive.
Of course, pot odds are extremely hard to determine in practice. For instance, you don’t know what others hold, and
counting the cards that would win the pot for you is not possible unless you have a good read on the opponents. Poker is
much more than just math. Good players choose their bet specifically to throw off their opponents’ pot odds.
So far, we have only defined the expected value for discrete random variables. As 𝔼[𝑋] describes the average value of 𝑋
in the long run, it should exist for continuous random variables as well.
The interpretation of the expected value was simple: outcome times probability, summed over all potential values.
However, there is a snag with continuous random variables: we don’t have such a mass distribution, as the probabilities
of individual outcomes are zero: 𝑃 (𝑋 = 𝑥) = 0. Moreover, we can’t sum uncountably many values.
What can we do?
Wishful thinking. This is one of the most powerful techniques in mathematics, and I am not joking.
Here’s the plan. We’ll pretend that the expected value of a continuous random variable is well-defined, and let our
imagination run free. Say goodbye to mathematical precision, and allow our intuition to unfold.
Instead of the probability of a given outcome, we can talk about 𝑋 landing in a small interval. First, we divide up the set
of real numbers into really small parts. To be more precise, let 𝑥0 < 𝑥1 < ⋯ < 𝑥𝑛 be a granular partition of the real
line. If the partition is refined enough, we should have
𝑛
𝔼[𝑋] ≈ ∑ 𝑥𝑘 𝑃 (𝑥𝑘−1 < 𝑋 ≤ 𝑋𝑘 ). (40.1)
𝑘=1
These increments remind us of the difference quotients. We don’t quite have these inside the sum, but by a “fancy
multiplication with one”, we can achieve this:
𝑛 𝑛
𝐹𝑋 (𝑥𝑘 ) − 𝐹𝑋 (𝑥𝑘−1 )
∑ 𝑥𝑘 (𝐹𝑋 (𝑥𝑘 ) − 𝐹𝑋 (𝑥𝑘−1 )) = ∑ 𝑥𝑘 (𝑥𝑘 − 𝑥𝑘−1 ) .
𝑘=1 𝑘=1
𝑥𝑘 − 𝑥𝑘−1
If the 𝑥𝑖 -s are close to each other (and we can select them to be arbitrarily close), the difference quotients are close to the
derivative of 𝐹𝑋 , which is the density function 𝑓𝑋 . Thus,
𝑛 𝑛
𝐹𝑋 (𝑥𝑘 ) − 𝐹𝑋 (𝑥𝑘−1 ) ′
∑ 𝑥𝑘 (𝑥𝑘 − 𝑥𝑘−1 ) ≈ ∑ 𝑥𝑘 (𝑥𝑘 − 𝑥𝑘−1 )𝑓𝑋 (𝑥𝑘 ).
𝑘=1
𝑥𝑘 − 𝑥𝑘−1 𝑘=1
Although we were not exactly precise in our argument, all of the above can be made mathematically correct. (But we are
not going to do it here, as it is not relevant to us.) Thus, we finally obtain the formula of the expected value for continuous
random variables.
We can do this via partial integration: by letting 𝑓(𝑥) = 𝑥 and 𝑔′ (𝑥) = 𝑒−𝜆𝑥 , we have
∞
𝔼[𝑋] = ∫ 𝑥𝜆𝑒−𝜆𝑥 𝑑𝑥
0
∞
𝑥=∞
= [ − 𝑥𝑒−𝜆𝑥 ]𝑥=0 + ∫ 𝑒−𝜆𝑥 𝑑𝑥
⏟⏟⏟⏟⏟⏟⏟ 0
=0
𝑥=∞
1 −𝜆𝑥
=[− 𝑒 ]
𝜆 𝑥=0
1
= .
𝜆
As usual, the expected value has several useful properties. Most importantly, the expected value is linear with respect to
the random variable.
holds.
We are not going to prove this theorem here, but know that linearity is an essential tool. Do you recall the game that we
used to introduce the expected value for discrete random variables? I toss a coin, and if it comes up heads, you win $1.
Tails, you lose $2. If you think about it for a minute, this is the
𝑋 = 3 ⋅ Bernoulli(1/2) − 2
Remark 39.3.1
Notice that Theorem 39.3.1 did not say that 𝑋 and 𝑌 have to be both discrete or both continuous. Even though we have
only defined the expected value in such cases, there is a general definition that works for all random variables.
The snag is, it requires a familiarity with measure theory, falling way outside of our scope. Suffice to say, the theorem
works as is.
If the expected value of a sum is the sum of the expected values, does the same apply to the product? Not in general, but
fortunately, this works for independent random variables.
𝔼[𝑋𝑌 ] = 𝔼[𝑋]𝔼[𝑌 ]
holds.
This property is extremely useful, as we’ll see in the next section, where we’ll talk about the variance and the covariance.
One more property that’ll help us to calculate the expected value of functions of the random variable, such as 𝑋 2 or sin 𝑋.
Thus, calculating 𝔼[𝑋 2 ] for a continuous random variable can be done by simply taking
∞
𝔼[𝑋 2 ] = ∫ 𝑥2 𝑓𝑋 (𝑥)𝑑𝑥,
−∞
40.4 Variance
Plainly speaking, the expected value measures the average value of the random variable. However, even though both
Uniform(−1, 1) and Uniform(−100, 100) have zero expected value, the latter is much more spread out than the former.
Thus, 𝔼[𝑋] is not a good descriptor of the random variable 𝑋.
To add one more layer, we measure the average deviation from the expected value. These are done via the variance and
the standard deviation.
Std[𝑋] = √Var[𝑋].
Take note that in the literature, the expected value is often denoted by 𝜇, while the standard deviation is by 𝜎. Together,
they form two of the most important descriptors of a random variable.
Fig. 40.2 shows a visual interpretation of the mean and standard deviation in the case of a normal distribution. The mean
shows the average value, while the standard deviation can be interpreted as the average deviation from the mean. (We’ll
talk about the normal distribution in great detail later, so don’t worry if it is not yet familiar to you.)
The usual method of calculating variance is not taking the expected value of (𝑋 − 𝜇)2 , but taking the expected value of
𝑋 2 and subtracting 𝜇2 from it. This is shown by the following proposition.
Proposition 39.4.1
Let (Ω, Σ, 𝑃 ) be a probability space, and let 𝑋 ∶ Ω → ℝ be a random variable.
Then
Fig. 40.2: Mean (𝜇) and standard deviation (𝜎) of the standard normal distribution.
Proof. Let 𝜇 = 𝔼[𝑋]. Because of the linearity of the expected value, we have
Is the variance linear as well? No, but there are some important identities regarding scalar multiplication and addition.
Var[𝑎𝑋] = 𝑎2 Var[𝑋].
Now, as 𝑋 and 𝑌 are independent, 𝔼[𝑋𝑌 ] = 𝔼[𝑋]𝔼[𝑌 ]. Thus, due to the linearity of the expected value,
Expected value and variance measure a random variable in isolation. However, in real problems, we need to discover
relations between separate measurements. Say, 𝑋 describes the price of a given real estate, while 𝑌 measures its size.
These are certainly related, but one does not determine the other. For instance, the location might be a differentiator
between the prices.
The simplest statistical way of measuring similarity is the covariance and correlation.
Similarly to variance, the definition of covariance can be simplified to provide an easier way of calculating its exact value.
Proposition 39.5.1
Let (Ω, Σ, 𝑃 ) be a probability space, let 𝑋, 𝑌 ∶ Ω → ℝ be two random variables, , and let 𝜇𝑋 = 𝔼[𝑋], 𝜇𝑌 = 𝔼[𝑌 ] be
their expected values.
Then
Cov[𝑋, 𝑌 ] = 𝔼[𝑋𝑌 ] − 𝜇𝑋 𝜇𝑌 .
One of the most important properties of covariance and correlation is that they are zero for independent random variables.
Theorem 39.5.1
Let (Ω, Σ, 𝑃 ) be a probability space, and let 𝑋, 𝑌 ∶ Ω → ℝ be two independent random variables.
Then Cov[𝑋, 𝑌 ] = 0. (And consequently, Corr[𝑋, 𝑌 ] = 0 as well.)
The proof follows straight from the definition and Theorem 39.3.2, so this is left as an exercise for you.
Take note, as this is extra important: independence implies zero covariance, but zero covariance does not imply indepen-
dence. Here is an example.
Let 𝑋 be a discrete random variable with the probability mass function
1
𝑃 (𝑋 = −1) = 𝑃 (𝑋 = 0) = 𝑃 (𝑋 = 1) = ,
3
and let 𝑌 = 𝑋 2 . The expected value of 𝑋 is
𝔼[𝑋𝑌 ] = 𝔼[𝑋 3 ] = 0.
Thus,
Cov[𝑋, 𝑌 ] = 𝔼[𝑋𝑌 ] − 𝔼[𝑋]𝔼[𝑌 ]
= 𝔼[𝑋 3 ] − 𝔼[𝑋]𝔼[𝑋 2 ]
2
=0−0⋅
3
= 0.
However, 𝑋 and 𝑌 are not independent, as 𝑌 = 𝑋 2 is a function of 𝑋.
(I shamelessly stole this example from a brilliant Stack Overflow thread, which you should read for more on this question.)
40.6 Problems
FORTYONE
We’ll continue our journey with a quite remarkable and famous result: the law of large numbers. You have probably
already heard several faulty arguments invoking the law of large numbers. For instance, gamblers are often convinced
that their bad luck will end soon because of the said law. This is one of the most frequently misused mathematical terms,
and we are here to clean that up.
We’ll do this in two passes. First, we are going to see an intuitive interpretation, then add the technical but important
mathematical details. I’ll try to be gentle.
First, let’s toss some coins again. If we toss coins repeatedly, what is the relative frequency of heads on the long run?
We should have a pretty good guess already: the average number of heads should converge to 𝑃 (heads) = 𝑝 as well.
Why? Because we have seen this when studying the frequentist interpretation of probability.
Our simulation showed that the relative frequency of heads does indeed converge to the true probability. This time, we’ll
carry the simulation a bit further.
First, to formulate the problem, let’s introduce the independent random variables 𝑋1 , 𝑋2 , … that are distributed along
Bernoulli(𝑝), where 𝑋𝑖 = 0 if the toss results in tails, while 𝑋𝑖 = 1 if it is heads. We are interested in the long-term
behavior of
𝑋 + ⋯ + 𝑋𝑛
𝑋̄ 𝑛 = 1 .
𝑛
𝑋̄ 𝑛 is called the sample average. We have already seen that the sample average gets closer and closer to 𝑝 as 𝑛 grows.
Let’s see the simulation one more time, before we go any further. (The parameter 𝑝 is selected to be 1/2 for the sake of
the example.)
import numpy as np
from scipy.stats import bernoulli
n_tosses = 1000
idx = range(n_tosses)
437
Mathematics of Machine Learning
with plt.style.context("seaborn-white"):
plt.figure(figsize=(16, 8))
plt.title("Relative frequency of the coin tosses")
plt.xlabel("Relative frequency")
plt.ylabel("Number of tosses")
Nothing new so far. However, if you have a sharp eye, you might ask the question: is this just an accident? After all, we
are studying the average
𝑋 + ⋯ + 𝑋𝑛
𝑋̄ 𝑛 = 1 ,
𝑛
which is (almost) a binomially distributed random variable! To be more precise, if 𝑋𝑖 ∼ Bernoulli(𝑝), then
1
𝑋̄ 𝑛 ∼ Binomial(𝑛, 𝑝).
𝑛
(We have seen this earlier when discussing the sums of discrete random variables.)
At this point, it is far from guaranteed that this distribution will be concentrated around a single value. So, let’s do some
more simulations. This time, we’ll toss a coin a thousand times thousand times to see the distribution of the averages.
Quite meta, I know.
with plt.style.context("seaborn-white"):
fig, axs = plt.subplots(1, 3, figsize=(12, 4), sharey=True)
fig.suptitle("The distribution of sample averages")
for ax, i in zip(axs, [5, 100, 999]):
x = [k/i for k in range(i+1)]
y = more_coin_toss_averages[:, i]
ax.hist(y, bins=x)
ax.set_title(f"n = {i}")
In other words, the probability of 𝑋𝑛 falling far from 𝑝 becomes smaller and smaller. For any small 𝜀, we can formulate
the probability of “𝑋̄ 𝑛 falls farther from 𝑝 than 𝜀” as 𝑃 (|𝑋̄ 𝑛 − 𝑝| > 𝜀).
Thus, mathematically speaking, our guess is that the limit
holds.
Again, is this just an accident, and were we just lucky to study an experiment where this is true? Would the same work
for random variables other than Bernoulli ones? What will the sample averages converge to? (If they converge at all.)
We’ll find out.
Let’s play dice. To keep things simple, we are interested in the average value of a roll on the long run. To build a proper
probabilistic model, let’s introduce random variables!
A single roll is uniformly distributed on {1, 2, … , 6}, and each roll is independent from the others. So, let 𝑋1 , 𝑋2 , … be
independent random variables, each distributed according to Uniform({1, 2, … , 6}).
How does the sample average 𝑋̄ 𝑛 behave? Simulation time. We’ll randomly generate 1000 rolls, then explore how 𝑋̄ 𝑛
behaves.
n_rolls = 1000
(continues on next page)
with plt.style.context("seaborn-white"):
plt.figure(figsize=(16, 8))
plt.title("Sample averages of rolling a six-sided dice")
plt.xlim(-10, n_rolls+10)
plt.ylim(0, 6)
The first thing to note is that these are suspiciously close to 3.5. This is not a probability, but the expected value:
For Bernoulli(𝑝) distributed random variables, the expected value coincides with the probability 𝑝. However, this time,
𝑋̄ 𝑛 does not have a nice and explicit distribution like in the case of coin tosses, where the sample averages were binomially
distributed. So, let’s roll some more dices to estimate how 𝑋̄ 𝑛 is distributed.
with plt.style.context("seaborn-white"):
fig, axs = plt.subplots(1, 3, figsize=(12, 4), sharey=True)
fig.suptitle("The distribution of sample averages")
for ax, i in zip(axs, [5, 100, 999]):
x = [6*k/i for k in range(i+1)]
y = more_dice_roll_averages[:, i]
ax.hist(y, bins=x)
ax.set_title(f"n = {i}")
It seems like that once more, the distribution of 𝑋̄ 𝑛 is concentrated around 𝔼[𝑋1 ]. Our intuition tells us that this is not
an accident; that this phenomenon is true for a wide range of random variables.
Let me spoil the surprise: this is indeed the case, and we’ll see this now.
This time, let 𝑋1 , 𝑋2 , … be a sequence of independent and identically distributed random variables. Not coin tosses, not
dice rolls, but any distribution. We saw that the sample average 𝑋̄ 𝑛 seem to converge to the joint expected value of the
𝑋𝑖 -s:
Note the quotation marks: 𝑋̄ 𝑛 is not a number, but a random variable. Thus, we can’t (yet) speak about their convergence.
In mathematically precise terms, what we saw previously is that for large enough 𝑛-s, the sample average 𝑋̄ 𝑛 is highly
unlikely to fall far from the joint expected value 𝜇 = 𝔼[𝑋1 ]; that is,
lim 𝑃 (|𝑋̄ 𝑛 − 𝜇| > 𝜀) = 0 (41.1)
𝑛→∞
which does not look friendly at all. (The symbol ⌊𝑥⌋ denotes the largest integer that is smaller than 𝑥.)
Thus, our plan is the following.
1. Find a way to estimate 𝑃 (|𝑋̄ 𝑛 − 𝜇| > 𝜀) in a way that is independent from the distribution of the 𝑋𝑖 -s.
2. Use the upper estimate to show lim𝑛→∞ 𝑃 (|𝑋̄ 𝑛 − 𝜇| > 𝜀) = 0.
Let’s go.
First, the upper estimates. There are two general inequalities that’ll help us to deal with 𝑃 (|𝑋̄ 𝑛 − 𝜇| ≥ 𝜀)
𝔼[𝑋]
𝑃 (𝑋 ≥ 𝑡) ≤
𝑡
holds for any 𝑡 ∈ (0, ∞).
Proof. We have to separate the discrete and the continuous cases. The proofs are almost identical, so I’ll only do the
discrete case here, while the continuous is left for you as an exercise to test your understanding.
So, let 𝑋 ∶ Ω → {𝑥1 , 𝑥2 , … } be a discrete random variable (where 𝑥𝑘 ≥ 0 for all 𝑘), and 𝑡 ∈ (0, ∞) be an arbitrary
positive real number. Then
∞
𝔼[𝑋] = ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 )
𝑘=1
= ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 ) + ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 ),
𝑘∶𝑥𝑘 <𝑡 𝑘∶𝑥𝑘 ≥𝑡
where the sum ∑𝑘∶𝑥 only accounts for 𝑘-s with 𝑥𝑘 < 𝑡, and similarly, ∑𝑘∶𝑥 only accounts for 𝑘-s with 𝑥𝑘 ≥ 𝑡.
𝑘 <𝑡 𝑘 ≥𝑡
As the 𝑥𝑘 -s are nonnegative by assumption, we can estimate 𝔼[𝑋] from below by omitting one of them. Thus,
𝔼[𝑋] = ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 ) + ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 )
𝑘∶𝑥𝑘 <𝑡 𝑘∶𝑥𝑘 ≥𝑡
≥ ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 )
𝑘∶𝑥𝑘 ≥𝑡
≥ 𝑡 ∑ 𝑃 (𝑋 = 𝑥𝑘 )
𝑘∶𝑥𝑘 ≥𝑡
= 𝑡𝑃 (𝑋 ≥ 𝑡),
𝔼[𝑋]
𝑃 (𝑋 ≥ 𝑡) ≤
𝑡
follows. □
The law of large numbers is only one step away from Markov’s inequality. This last step is so useful that it deserves to be
its own theorem. Meet the famous inequality of Chebyshev.
Let (Ω, Σ, 𝑃 ) be a probability space and let 𝑋 → ℝ be a nonnegative random variable with finite variance 𝜎2 = Var[𝑋] <
∞ and expected value 𝔼[𝑋] = 𝜇.
Then
𝜎2
𝑃 (|𝑋 − 𝜇| ≥ 𝑡) ≤
𝑡2
holds for all 𝑡 ∈ (0, ∞).
𝔼[|𝑋 − 𝜇|2 ]
𝑃 (|𝑋 − 𝜇| ≥ 𝑡) ≤
𝑡2
𝜎2
= 2
𝑡
which is what we had to show. □
And with that, we are ready to precisely formulate and prove the law of large numbers.
After all this setup, the (weak) law of large numbers is just a small step away. Here it is in its full glory.
𝑋 + ⋯ + 𝑋𝑛
𝑋̄ 𝑛 = 1
𝑛
be their sample average. Then
lim 𝑃 (|𝑋̄ 𝑛 − 𝜇| ≥ 𝜀) = 0
𝑛→∞
𝑋 + ⋯ + 𝑋𝑛
Var[𝑋̄ 𝑛 ] = Var[ 1 ]
𝑛
1
= 2 Var[𝑋1 + ⋯ + 𝑋𝑛 ]
𝑛
1
= 2 (Var[𝑋1 ] + ⋯ + Var[𝑋𝑛 ])
𝑛
𝑛𝜎2
= 2
𝑛
𝜎2
= .
𝑛
Now, by using Chebyshev's inequality, we obtain
Var[𝑋̄ 𝑛 ]
𝑃 (|𝑋𝑛 − 𝜇| ≥ 𝜀) ≤
𝜀2
2
𝜎
= 2.
𝑛𝜀
Thus,
0 ≤ lim 𝑃 (|𝑋̄ 𝑛 − 𝜇| ≥ 𝜀)
𝑛→∞
𝜎2
≤ lim = 0,
𝑛→∞ 𝑛𝜀2
hence
lim 𝑃 (|𝑋̄ 𝑛 − 𝜇| ≥ 𝜀) = 0,
𝑛→∞
Theorem 40.3.3 is not all that can be said about the sample averages. There is a stronger result, showing that the sample
averages are in fact converge to the mean with probability 1.
Why is Theorem 40.3.3 called the “weak” law? Think about the statement
for a moment. For a given 𝜔 ∈ Ω, this doesn’t tell us anything about the convergence of a concrete sample average
𝑋 (𝜔) + ⋯ + 𝑋𝑛 (𝜔)
𝑋̄ 𝑛 (𝜔) = 1 ,
𝑛
it just tells us that in a probabilistic sense, 𝑋̄ 𝑛 is concentrated around the joint expected value 𝜇. In a sense, (41.2) is a
weaker version of
𝑃 ( lim 𝑋̄ 𝑛 = 𝜇) = 1,
𝑛→∞
hence the terminology weak law of large numbers. Do we have a stronger result than Theorem 40.3.3? Yes, we do.
Let 𝑋1 , 𝑋2 , … be a sequence of independent and identically distributed random variables with finite expected value
𝜇 = 𝔼[𝑋1 ] and variance 𝜎2 = Var[𝑋1 ], and let
𝑋 + ⋯ + 𝑋𝑛
𝑋̄ 𝑛 = 1
𝑛
be their sample average. Then
𝑃 ( lim 𝑋̄ 𝑛 = 𝜇) = 1.
𝑛→∞
We are not going to prove this, just know that the sample average will converge to the mean with probability one.
lim 𝑃 (|𝑋𝑛 − 𝑋| ≥ 𝜀) = 0
𝑛→∞
𝑃
for all 𝜀 > 0. Convergence in probability is denoted by 𝑋𝑛 −
→ 𝑋.
(b) 𝑋𝑛 converges almost surely towards 𝑋 if
𝑃 ( lim 𝑋𝑛 = 𝑋) = 1
𝑛→∞
a. s.
holds. Almost sure convergence is denoted by 𝑋𝑛 −−→ 𝑋.
Thus, the weak and strong laws of large numbers state that in certain cases, the sample averages converge to the expected
value both in probability and almost surely.
Statistics
447
Part VII
449
Part VIII
Neural networks
451
Part IX
Advanced optimization
453
Part X
Convolutional networks
455
Part XI
Appendix
457
CHAPTER
FORTYTWO
The rules of logic are to mathematics what those of structure are to architecture. — Bertrand Russell
“Mathematics is a language”, one of my professors used to say all the time. “Learning mathematics starts with building up
a basic vocabulary.”
What he forgot to add is that mathematics is the language of thinking. I often get the question: do you need to know
mathematics to be a software engineer/data scientist/random technical professional? My answer is simple. If you regularly
have to solve problems in your profession, then mathematics is extremely beneficial for you. You don’t have to think
effectively, but you are better off.
The learning curve of mathematics is steep. You have experienced it yourself, and the difficulty may have deterred you
from reaching a familiarity with its fundamentals. I have good news for you: if we treat learning mathematics as learning
a foreign language, we can start by building up a basic vocabulary instead of diving into poems and novels. Like as my
professor suggested.
Logic and clear thinking lie at the very foundations of mathematics. But what are those? How would you explain what
“logic” is?
Our thinking processes are formalized by the field of mathematical logic. In logic, we work with propositions; that is,
statements that are either true or false. “It is raining outside.” “The sidewalk is wet.” These are both valid propositions.
To be able to reason about propositions effectively, we often denote them with roman capital letters, such as
𝐴 = it is raining outside,
𝐵 = the sidewalk is wet.
Each proposition has a corresponding truth value, which is either true or false. These are often abbreviated as 1 and 0.
Although this seems like no big deal, finding the truth value can be extremely hard. Think about the proposition
This is the famous P = NP conjecture, one of the longest-standing unsolved problems in mathematics. The statement is
easy to understand, but solving the problem (that is, finding the truth value of the corresponding proposition) has eluded
even the smartest minds.
In essence, the entire body of our scientific knowledge lies in propositions whose truth values we have identified. So, how
do we do that in practice?
459
Mathematics of Machine Learning
In themselves, propositions are not enough to provide an effective framework for reasoning. Mathematics (and the entire
science) is the collection of complex propositions formulated from smaller building blocks with logical connectives. Each
connective takes one or more propositions and transforms their truth value.
“If it is raining outside, then the sidewalk is wet.” This is the combination of two propositions, strung together by the
implication connective. There are four essential connectives: negation, disjunction, conjunction, and implication. We will
take a close look at each one.
Negation flips the truth value of a proposition to its opposite. It is denoted by the mathematical symbol ¬: if 𝐴 is a
proposition, then ¬𝐴 is its negation. In general, connectives are defined by truth tables that enumerate all possible truth
values of the resulting expression, given its inputs. In writing, this looks complicated, so here is the truth table of ¬ to
illustrate the concept.
𝐴 ¬𝐴
0 1
1 0
When expressing propositions in a natural language, negation translates to the word “not”. For instance, the negation of
the proposition “the screen is black” is “the screen is not black”. (Not “the screen is white”.)
Logical conjunction is equivalent to grammatical conjunction “and”, denoted by the symbol ∧. The proposition 𝐴 ∧ 𝐵 is
true if and only if both 𝐴 and 𝐵 are true. For example, when we say that “the table is set and the food is ready”, we mean
to convey that both conjuncts are true. Here is the truth table:
𝐴 𝐵 𝐴∧𝐵
0 0 0
0 1 0
1 0 0
1 1 1
Disjunction is known as “or” in the English language and is denoted by the symbol ∨. The proposition 𝐴 ∨ 𝐵 is true
whenever either one is:
𝐴 𝐵 𝐴∨𝐵
0 0 0
0 1 1
1 0 1
1 1 1
Disjunction is inclusive, unlike the exclusive or we frequently use in our natural language. When you say “I am traveling
by train or car”, both cannot be true. The disjunction connective is not exclusive.
Finally, the implication connective (→) formalizes the deduction of a conclusion 𝐵 from a premise 𝐴: “if 𝐴, then 𝐵.”
The implication is true when the conclusion is true, or both the premise and the conclusion are false.
𝐴 𝐵 𝐴→𝐵
0 0 1
0 1 1
1 0 0
1 1 1
One example would be the famous quote from Descartes: “I think, therefore I am.” Translated to the language of formal
logic, this is simply
Sentences of the form “if 𝐴, then 𝐵“ are called conditionals. It’s not all just philosophy. Science is the collection of
propositions like “if 𝑋 is a closed system, then the entropy of 𝑋 cannot decrease”. (As the 2nd law of thermodynamics
states.)
The entire body of scientific knowledge is made of 𝐴 → 𝐵 propositions, and scientific research is equivalent to pursuing
the truth value of implications. When solving problems in practice, we rely on theorems (that is, implications) that turn
our premises into conclusions.
If you got the feeling that the connectives are akin to arithmetic operations, you are correct. Connectives yield propositions.
Thus connectives can again be applied, resulting in complex expressions like ¬(𝐴∨𝐵)∧𝐶. Constructing such expressions
and deductive arguments is called the propositional calculus.
Just like arithmetic operations, expressions made up of propositions and connectives also have identities. Think about the
famous algebraic identity
(𝑎 + 𝑏)(𝑎 − 𝑏) = 𝑎2 − 𝑏2 ,
that is one of the most frequently used symbolic expressions. Such an identity means we can write one thing in another
form. In mathematical logic, we call these logical equivalences.
𝑃 ≡ 𝑄.
To show you an example, let’s look at our first theorem, one that establishes logical equivalences for the conjunction
connective.
Proof. Showing these properties is done by drawing up their truth tables. We will do this for (a), while the rest is left for
you as an exercise. (I highly suggest you to do this, as performing a task by yourself is an excellent learning opportunity.)
𝐴 𝐵 𝐶 𝐴∧𝐵 𝐵∧𝐶 (𝐴 ∧ 𝐵) ∧ 𝐶 𝐴 ∧ (𝐵 ∧ 𝐶)
0 0 0 0 0 0 0
0 0 1 0 0 0 0
0 1 0 0 0 0 0
0 1 1 0 1 0 0
1 0 0 0 0 0 0
1 0 1 0 0 0 0
1 1 0 1 0 0 0
1 1 1 1 1 1 1
provides a proof. □
A few remarks are in order. First, we should read the truth table from left to right columns. Strictly speaking, we can
omit the columns for 𝐴 ∧ 𝐵 and 𝐵 ∧ 𝐶. However, including them saves the mental gymnastics.
Second, because of the associativity, we can freely write 𝐴 ∧ 𝐵 ∧ 𝐶, as the order of operations is irrelevant.
Finally, note that our first theorem is a premise and a conclusion, connected by the implication connective. If we denote
them by
𝑃 = 𝐴, 𝐵, 𝐶 are propositions,
𝑄 = (𝐴 ∧ 𝐵) ∧ 𝐶 ≡ 𝐴 ∧ (𝐵 ∧ 𝐶),
then the first part of our theorem is just the proposition 𝑃 → 𝑄, one that we have proven to be true via laying out the
truth table. This shows the immense power of the propositional calculus we are building here.
Theorem 41.3.1 has an analogue for disjunction. This is stated below for the sake of completeness, but the proof is left to
you as an exercise.
Just as arithmetic operations, connectives have order of precedence as well: ¬, ∧, ∨, →. This means that, for instance,
((¬𝐴) ∧ 𝐵) ∨ 𝐶 can be written as ¬𝐴 ∧ (𝐵 ∨ 𝐶).
In our calculus of propositions, one of the most important rules is De Morgan’s laws, describing how conjunction and
disjunction behave with respect to negation.
Proof. As usual, we can prove De Morgan’s laws by laying out the two truth tables
𝐴 𝐵 ¬𝐴
¬𝐵 𝐴∧𝐵 ¬(𝐴 ∧ 𝐵)
¬𝐴 ∨ ¬𝐵
0 0 1
1 0 1
1
0 1 1
0 0 1
1
1 0 0
1 0 1
1
1 1 0
0 1 0
0
and
𝐴 𝐵 ¬𝐴
¬𝐵 𝐴∨𝐵 ¬(𝐴 ∨ 𝐵)
¬𝐴 ∧ ¬𝐵
0 0 1
1 0 1
1
0 1 1
0 1 0
0
1 0 0
1 1 0
0
1 1 0
0 1 0
0
The propositional calculus we have established so far is the mathematical formalization of thinking. One thing is missing,
though: deduction, or as Wikipedia puts it, “the mental process of drawing inferences in which the truth of their premises
ensures the truth of their conclusion”. This is given via the famous rule of modus ponens.
𝐴 𝐵 𝐴→𝐵
0 0 1
0 1 1
1 0 0
1 1 1
By looking at its rows, we can see that when 𝐴 is true and the implication 𝐴 → 𝐵 is true, 𝐵 is true as well, as the principle
of modus ponens indicates. □
As modus ponens sounds extremely abstract, here is a concrete example. From common sense, we know that the impli-
cation “if it’s raining, then the sidewalk is wet” is true. If we observe from a roof window that it’s indeed raining, we can
confidently conclude that the sidewalk is wet, even without looking at it.
In symbolic notation, we can write
𝐴 → 𝐵, 𝐴 ⊢ 𝐵,
where the turnstile symbol ⊢ essentially reads as “proves”. Thus, the modus ponens says that 𝐴 → 𝐵 and 𝐴 prove 𝐵.
Modus ponens is how we use our theorems. It is always in the background.
Remark
This is a great opportunity to point out one of the most frequent logical fallacies: reversing the implication. When debating
about a given topic, participants often resort to the faulty argument
𝐴 → 𝐵, 𝐵 ⊢ 𝐴.
Of course, this is not true. For instance, consider our favorite example:
Clearly, 𝐴 → 𝐵 holds, but 𝐵 → 𝐴 does not. There are other reasons for a wet sidewalk. For instance, someone
accidentally spilled a barrel of water on it.
So, mathematics is about propositions, implications, and their truth values. We have seen that we can formulate proposi-
tions and reason about pretty complicated expressions using our propositional calculus. However, the language we have
built up so far is not suitable for propositions with variables.
For instance, think about the sentence
Because the truth value depends on 𝑥, this is not a well-formed proposition. Sentences with variables are called predicates,
and we denote them by emphasizing the dependence on their variables; for instance
𝑃 (𝑥) ∶ 𝑥 ≥ 0,
or
Each predicate has a domain from which its variables can be taken. You can think about a predicate 𝑃 (𝑥) as a function
that maps its domain to the set {0, 1}, representing its truth value. (Although, strictly speaking, we don’t have functions
available as tools when defining the very foundation of our formal language. However, we are not philosophers or set
theorists, so we don’t have to be concerned about such details.)
Predicates define truth sets, that is, subsets of the domain where the predicate is true. Formally, they are denoted by
{𝑥 ∈ 𝐷 ∶ 𝑃 (𝑥)}, (42.1)
all the time. These are called comprehensions, and they are inspired by the so-called set-builder notation given by (42.1).
Predicates are a big step towards properly formalizing mathematical thinking, but we are not quite there yet. To give you
an example from machine learning, let’s talk about finding the minima of loss functions. (That is, training a model.)
A point 𝑥 is said to be the global minimum of a function 𝑓(𝑥) if for all other 𝑦 in its domain 𝐷, 𝑓(𝑥) ≤ 𝑓(𝑦) holds. For
instance, the point 𝑥 = 0 is a minima of the function 𝑓(𝑥) = 𝑥2 .
How would you express this in our formal language? For one, we could say that
where we fix 𝑓(𝑥) = 𝑥2 and 𝑥 = 0. There are two parts of this sentence: for all 𝑦 ∈ 𝐷, and 𝑓(𝑥) ≤ 𝑓(𝑦) is true. The
second one is a predicate:
where 𝑦 ∈ ℝ.
The second part seems new, as we have never seen the words "for all" in our formal language before. They express a kind
of quantification about when the predicate 𝑃 (𝑦) is true.
In mathematical logic, there are two quantifiers we need to be happy: the universal quantifier "for all" denoted by the
symbol ∀, and the existential quantifier "there exists" denoted by ∃.
For example, consider the sentence “all of my friends are mathematicians”. By defining the set 𝐹 to be set of my friends
and the predicate on this domain as
𝑀 (𝑥) ∶ 𝑥 is a mathematician,
∀𝑥 ∈ 𝐹 , 𝑀 (𝑥).
Remember that the domain of the predicate 𝑀 (𝑥) is 𝐹 . We could omit that, but it’s much more user-friendly this way.
Similarly, “I have at least one friend who is a mathematician” translates to
∃𝑥 ∈ 𝐹 , 𝑀 (𝑥).
When there is a more complex proposition behind the quantifier, we mark its scope with parentheses:
Note that as (∀𝑥 ∈ 𝐹 , 𝑀 (𝑥)) and (∃𝑥 ∈ 𝐹 , 𝑀 (𝑥)) have a single truth value, they propositions, not predicates! Thus,
quantifiers turn predicates into propositions. Just like any other propositions, logical connectives can be applied to them.
Among all the operations, negation is the most interesting here. To see why, let’s consider the previous example: “all of
my friends are mathematicians”. At first, you might say that its negation is “none of my friends are mathematicians”, but
that is not correct. Think about it: I can have mathematician friends, as long as not all of them are mathematicians. Thus,
42.6 Problems
𝐴 𝐵 𝐴⊕𝐵
0 0 0
0 1 1
1 0 1
1 1 0
Show that
(a) 𝐴 ⊕ 𝐵 ≡ (¬𝐴 ∧ 𝐵) ∨ (𝐴 ∧ ¬𝐵),
(b) and 𝐴 ⊕ 𝐵 ≡ (¬𝐴 ∨ ¬𝐵) ∧ (𝐴 ∨ 𝐵)
holds.
FORTYTHREE
We’ve come a long way from the start: we studied propositions, logical connectives, predicates, quantifiers, and all the
formal logic. This was to be able to talk about mathematics. However, ultimately, we want to do mathematics.
As the only exact science, mathematics is built on top of definitions, theorems, and proofs. We precisely define objects,
formulate conjectures about them, then prove those with logically correct arguments. You can think of mathematics as a
colossal building made of propositions, implications, and modus ponens. If one theorem fails, all others that build upon
it fail too.
In other fields of science, the modus operandi is to hypothesize, experiment, and validate. However, experiments are not
𝑛
enough in mathematics. For instance, think about the famous Fermat numbers, that is, numbers of the form 𝐹𝑛 ∶= 22 +1.
Fermat conjectured them to be all prime numbers, as 𝐹0 , 𝐹1 , 𝐹2 , 𝐹3 , and 𝐹4 are all primes.
Five affirmative “experiments” might have been enough to accept the hypothesis as true in certain fields of science. Not in
mathematics. Around the 1700s, Euler showed that 𝐹5 = 4294967297 is not a prime, as 4294967297 = 641×6700417.
(Imagine calculating that in the 18th century, long before the age of computing.)
So far, we’ve seen some definitions, theorems, and even proofs when talking about mathematical logic. It’s time to put
them under the magnification glass and see what they are!
Ambiguity is the drawback of natural languages. How would you define, say, the concept of “hot”? Upon several attempts,
you would soon discover that no two people have the same definition.
In mathematics, there is no room for ambiguity. Every object and every property must be precisely defined. It’s best to
look at a good example instead of philosophizing about it.
For example, 2 ∣ 10 and 5 ∣ 10, but 7 ∤ 10. (Crossed symbols mean the negation of the said property.)
In terms of our formal language, the definition of “𝑎 is a divisor of 𝑏” can be written as
𝑎 ∣ 𝑏 ∶ ∃𝑘 ∈ ℤ, 𝑏 = 𝑘𝑎. (43.1)
Don’t let the 𝑎 ∣ 𝑏 notation deceive you; this is a predicate in disguise. We could have denoted 𝑎 ∣ 𝑏 by
divisor(𝑎, 𝑏) ∶ ∃𝑘 ∈ ℤ, 𝑏 = 𝑘𝑎.
467
Mathematics of Machine Learning
Although every mathematical definition can be formalized, we’ll prefer our natural language because it is much easier to
understand. (At least for humans. Not so much for computers.)
Like building blocks, definitions build on top of each other.
(If you have a sharp eye for details, you noticed that even Definition 42.1.1 is built upon other concepts such as numbers,
multiplication, and equality. We haven’t defined them precisely, just assumed they are there. Since our goal is not to
re-build mathematics from scratch, we’ll let this one slide.)
Again, it’s best to see an example here. Let’s see what even and odd numbers are!
One more time, with our formal language. For an integer 𝑛 ∈ ℤ, the predicates
even(𝑛) ∶ 2 ∣ 𝑛
and
odd(𝑛) ∶ 2 ∤ 𝑛
In other words, primes have no integer divisors other than themselves. The first few primes are 2, 3, 5, 7, 11, 13, 17, and
many more. Non-prime integers are called composite numbers.
The definition of primality can be written as
This might look complicated, but we can decompose it into parts, as shown by Fig. 43.1.
Primes play an essential role in our everyday lives! For instance, many mainstream cryptographic methods use large
primes to cipher and decipher messages. Without them, you wouldn’t be able to initiate financial transactions securely.
Their usefulness is guaranteed by their various properties, established in the form of theorems. We’ll see a few of them
soon enough, but first, let’s talk about what theorems really are.
Fig. 43.1: Definition of primality in first-order language, decomposed into its parts.
So, a definition is essentially a predicate whose truth set consists of our objects of interest. The whole point of mathematics
is to find true propositions involving those objects, most often in the form 𝐴 → 𝐵. Consider the following theorem that
is a cornerstone of optimization for machine learning.
Don’t worry if you are unfamiliar with the concepts of convexity and local minimum; it’s beside the point. The gist is that
Theorem 42.2.1 can be written as
∀𝑓 ∈ 𝐹 , (𝐶(𝑓) → 𝑀 (𝑓)),
where 𝐹 denotes the set of all functions ℝ𝑛 → ℝ, and the predicates 𝐶(𝑓) and 𝑀 (𝑓) are defined by
𝐶(𝑓) ∶ 𝑓 is convex,
𝑀 (𝑓) ∶ ∃𝑥∗ , 𝑥∗ is a global minimum of 𝑓.
Notice the structure of the theorem: “Let 𝑥 ∈ 𝐴. If 𝐵(𝑥), then 𝐶(𝑥).” With the first sentence, we are setting the domains
of the predicates 𝐴(𝑥) and 𝐵(𝑥), and putting an universal quantifier in front of the conditional “if 𝐵(𝑥), then 𝐶(𝑥).”
Now that we understand what theorems are, it’s time to look at proofs. We have just seen that theorems are true propo-
sitions. Proofs are deductions that establish the truth of a proposition. Let’s see an example instead of talking like a
philosopher!
The proof of Theorem 42.2.1 is not within our reach yet, so let’s look at something much simpler: the sum of even
numbers.
Proof. Since 𝑛 is even, 2 ∣ 𝑛. According to Definition 42.1.1, this means that there exists an integer 𝑘 ∈ ℤ such that
𝑛 = 2𝑘.
Similarly, as 𝑚 is also even, there exists an integer 𝑙 ∈ ℤ such that 𝑚 = 2𝑙.
Summing up the two, we obtain that
𝑛 + 𝑚 = 2𝑘 + 2𝑙
= 2(𝑘 + 𝑙),
giving that 𝑛 + 𝑚 is indeed even. □
(The square symbol □ is just there to mark the end of the proof. It reads as “quod erat demonstrandum”, meaning “what
was to be shown”.)
If you read the above proof carefully, you might notice that it is a chain of implications and modus ponens. These two
form the backbone of our deductive skills. What is proven is set in stone.
Understanding what proofs are is one of the biggest skill gaps in mathematics. Don’t worry if you don’t get it immediately;
this is a deep concept. You’ll get used to proofs eventually.
43.4 Equivalences
The building blocks of mathematics are propositions of the form 𝐴 → 𝐵; at least, this is what I emphasized throughout
this chapter.
I was not precise. The proposition 𝐴 → 𝐵 translates to “if 𝐴, then 𝐵“, but sometimes, we know much more. Quite
frequently, 𝐴 and 𝐵 have the same truth values. In natural language, we express this by saying “𝐴 if and only if 𝐵“.
(Although this is much rarer than the simple conditional.)
In logic, we express this relation with the biconditional connective ↔, defined by
𝐴 ↔ 𝐵 ≡ (𝐴 → 𝐵) ∧ (𝐵 → 𝐴).
Theorems of the “if and only if” type are called equivalences, and they play an essential role in mathematics. When
proving an equivalence, we must show both 𝐴 → 𝐵 and 𝐵 → 𝐴.
To see an example, let’s go back to elementary geometry. As you have probably learned in high school, we can describe
geometric objects on the plane with vectors that are represented by a tuple of two real numbers.
This way, geometric properties can be translated into analytic ones, and we can often prove hard theorems by a simple
calculation.
For instance, let’s talk about orthogonality, one of the most important concepts in mathematics. This is how orthogonality
is defined for two planar vectors.
For the sake of simplicity, we always assume that the enclosed angle is between 0 and 𝜋. (An angle of 𝜋 radians is 180
degrees, but we’ll always use radians.)
However, measuring the angle enclosed by two arbitrary vectors is not as easy as it sounds. We need a tractable formula,
and this is where the dot product comes in.
𝑎 ⋅ 𝑏 ∶= |𝑎||𝑏| cos 𝛼,
where 𝛼 is the angle enclosed by the two vectors, and | ⋅ | denotes the magnitude of a vector.
Dot products give an equivalent definition of orthogonality in the form of an “if and only if” theorem.
Theorem 42.4.1
Let 𝑎 = (𝑎1 , 𝑎2 ) and 𝑏 = (𝑏1 , 𝑏2 ) be two nonzero vectors on the plane. Then 𝑎 and 𝑏 orthogonal if and only if 𝑎 ⋅ 𝑏 = 0.
𝑎 ⋅ 𝑏 = |𝑎||𝑏| cos 𝛼 = 0
can only hold if cos 𝛼 = 0. In turn, this means that 𝛼 = 𝜋/2; that is, 𝑎 ⟂ 𝑏. (Recall that we assumed the enclosed angle
𝛼 to be between 0 and 𝜋.) □
There is no way around it: proving theorems is hard. Some took the smartest of minds decades, and some conjectures
remain unresolved after a century. (That is, they are not proven nor disproven.)
A few basic yet powerful tools can get one push through the difficulties. In the following, we’ll look at the three most
important ones: proof by induction, proof by contradiction, and the principle of contraposition.
How do you climb a set of stairs? Simple. You climb the first step, then climb the next one and so on.
You might be surprised, but this is something we frequently use in mathematics all the time. Let’s illuminate this by an
example.
𝑛(𝑛 + 1)
1+2+⋯+𝑛= (43.2)
2
holds.
Proof. For 𝑛 = 1, the case is clear: the left-hand side of (43.2) evaluates to 1, while the right-hand side is
1(1 + 1)
= 1.
2
Thus, our proposition holds for 𝑛 = 1.
Here comes the magic, that is, the induction step. Let’s assume that (43.2) holds for a given 𝑛; that is, we have
𝑛(𝑛 + 1)
1+2+⋯+𝑛= .
2
This is what’s called the induction hypothesis. Using this assumption, we are going to prove that (43.2) holds for 𝑛 + 1 as
well. In other words, our goal is to show that
(𝑛 + 1)(𝑛 + 2)
1 + 2 + ⋯ + 𝑛 + (𝑛 + 1) = .
2
Due to our induction hypothesis, we have
1 + 2 + ⋯ + 𝑛 + (𝑛 + 1) = [1 + 2 + ⋯ + 𝑛] + (𝑛 + 1)
𝑛(𝑛 + 1)
= + (𝑛 + 1).
2
Continuing the calculation, we obtain
To sum up what happened, let’s denote the equation (43.2) by the predicate
𝑛(𝑛 + 1)
𝑆(𝑛) ∶ 1 + 2 + ⋯ + 𝑛 = .
2
Proof by induction consists of two main steps. First, we establish that 𝑆(1) is true. Then, we show that for arbitrary 𝑛,
the implication 𝑆(𝑛) → 𝑆(𝑛 + 1) holds. Starting from the induction step, this implies that 𝑆(𝑛) is indeed true for all 𝑛:
the chain of implications
𝑆(1) → 𝑆(2),
𝑆(2) → 𝑆(3),
𝑆(3) → 𝑆(4),
⋮
combined with 𝑆(1) and the almighty modus ponens yields the truth of 𝑆(𝑛).
We took the first step 𝑆(1), then proved that we can take the next step from anywhere.
Induction is not simple to grasp, so here is another example. (It is slightly more complex than the previous one.) Follow
through with the proof and see if you can identify the marks of induction.
For simplicity, we’ll only prove the existence of the prime factorization, not the unicity.
𝑛 + 1 = 𝑎𝑏
for some 𝑎, 𝑏 ∈ ℤ. Since 𝑎, 𝑏 ≤ 𝑛, we can apply the induction hypothesis! Spelling it out, it means that we can write
them as 𝛼 𝛼
𝑎 = 𝑝1 1 … 𝑝𝑙 𝑙 ,
𝛽 𝛽
𝑏 = 𝑞1 1 … 𝑞𝑚𝑚 ,
where the 𝑝𝑖 , 𝑞𝑖 are the primes and the 𝛼𝑖 , 𝛽𝑖 are the exponents. Thus,
𝑛 + 1 = 𝑎𝑏
𝛼 𝛼 𝛽 𝛽
= 𝑝1 1 … 𝑝𝑙 𝑙 𝑞1 1 … 𝑞𝑚𝑚 ,
Induction is like a power tool in mathematics. It is extremely powerful, and when it is applicable, it’ll almost always do
the job.
Sometimes, it is easier to prove theorems by assuming that their conclusion is false, then deduce a contradiction.
Again, it’s best to see a quick example. Let’s revisit our good old friends, the prime numbers.
Theorem 42.5.3
There are infinitely many prime numbers.
𝑝𝑖 ∤ 𝑝1 𝑝2 … 𝑝𝑛 + 1.
This holds indeed, as by definition, 𝑝1 𝑝2 … 𝑝𝑛 + 1 = 𝑝𝑖 𝑘 + 1, where 𝑘 is simply the product of the prime numbers other
than 𝑝𝑖 .
Since no 𝑝𝑖 is a divisor of 𝑝1 𝑝2 … 𝑝𝑛 + 1, it must be a prime. We have found a new prime that is not on our list! This
means that our assumption (that there are finitely many prime numbers) has led to a contradiction.
Thus, there must be infinitely many prime numbers. □
If you have a sharp eye, you probably noticed that the above example is not of the form 𝐴 → 𝐵; it’s just a simple
proposition:
In these cases, showing that ¬𝐴 is false yields the desired conclusion. However, this technique works for 𝐴 → 𝐵 -style
propositions as well.
43.5.3 Contraposition
The final technique we will study is contraposition, a clever method that puts a twist into the classic 𝐴 → 𝐵-style thinking.
We should get to know the implication connective a bit better to see what it is. As it turns out, 𝐴 → 𝐵 can be written in
terms of negation and disjunction.
Theorem 42.5.4
Theorem 42.5.5
Let 𝑛 ∈ ℤ be an integer. If 2 ∤ 𝑛, then 4 ∤ 𝑛.
Proof. We should prove this via contraposition. Thus, assume that 4 ∣ 𝑛. This means that
𝑛 = 4𝑘
for some integer 𝑘 ∈ ℤ. However, this implies that
𝑛 = 2(2𝑘),
which shows that 2 ∣ 𝑛. Due to the principle of contraposition, (4 ∣ 𝑛) → (2 ∣ 𝑛) is logically equivalent to (2 ∤ 𝑛) →
(4 ∤ 𝑛), which is what we had to prove. □
Contraposition is not only useful in mathematics, it is a valuable thinking tool in general. Let’s consider our recurring
proposition: “if it is raining outside, then the sidewalk is wet”. We know this to be true, but this also means that “if the
sidewalk is not wet, then it is not raining”. (Because otherwise, the sidewalk would be wet.)
You perform these types of arguments every day without even noticing it. Now you have a name for them and can start
to apply this pattern consciously.
FORTYFOUR
In other words, general set theory is pretty trivial stuff really, but, if you want to be a mathematician, you need
some and here it is; read it, absorb it, and forget it. — Paul R. Halmos
Although Paul Halmos said the above a long time ago, it has remained quite accurate. Except for one part: set theory is
not only necessary for mathematicians, but for computer scientists, data scientists, and software engineers as well.
You might have heard about or studied set theory before. It is hard to see why it is so essential for machine learning, but
trust me, set theory is the very foundation of mathematics. Deep down, everything is a set or a function between sets. (As
we will see later, even functions are defined as sets.)
Think about the relation of set theory and machine learning like grammar and poetry. To write beautiful poetry, one
needs to be familiar with the rules of the language. For example, data points are represented as vectors in vector spaces,
often constructed as the Cartesian product of sets. (Don’t worry if you are not familiar with Cartesian products, we’ll get
there soon.) Or, to really understand probability theory, you need to be familiar with event spaces, which are systems of
sets closed under certain operations.
So, what are sets anyway?
On the surface level, a set is just a collection of things. We define sets by enumerating their elements like
Two sets are equal if they have the same elements. Given any element, we can always tell if it is a member of a given set
or not. When every element of 𝐴 is also an element of 𝐵, we say that 𝐴 is a subset of 𝐵, or in notation,
𝐴 ⊆ 𝐵.
If we have a set, we can define subsets by specifying a property that all of its elements satisfy, for example
(The % denotes the modulo operator.) This latter method is called the set-builder notation, and if you are familiar with
the Python programming language, you can see this inspired list comprehensions. There, one would write something like
this.
print(even_numbers)
477
Mathematics of Machine Learning
{0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,
↪ 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82,␣
2𝐴 ∶= {𝐵 ∶ 𝐵 ⊆ 𝐴}. (44.1)
Describing more complex sets with only these two methods (listing its members or using the set-builder notation) will be
extremely difficult. To make the job easier, we define operations on sets.
The most basic operations are the union, intersection, and difference. You are probably familiar with these, as they are
encountered frequently as early as high school. Even you feel familiar with them, check out the formal definition next.
We can easily visualize these with Venn diagrams, as you can see below.
We can express set operations in plain English as well. For example, 𝐴 ∪ 𝐵 means that “𝐴 or 𝐵“. Similarly, 𝐴 ∩ 𝐵 means
“𝐴 and 𝐵“, while 𝐴\𝐵 is “𝐴 but not 𝐵“. When talking about probabilities, these will be useful for translating events to
the language of set theory.
These set operations also have a lot of pleasant properties. For example, they behave nicely with respect to parentheses.
Theorem 43.2.1
Let 𝐴, 𝐵, and 𝐶 be three sets. The union operation is
(a) associative, that is, 𝐴 ∪ (𝐵 ∪ 𝐶) = (𝐴 ∪ 𝐵) ∪ 𝐶,
(b) commutative, that is, 𝐴 ∪ 𝐵 = 𝐵 ∪ 𝐴.
Moreover, the intersection operation is also associative and commutative.
Finally,
(c) the union is distributive with respect to the intersection, that is, 𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶),
(d) and the intersection is distributive with respect to the union, that is, 𝐴 ∩ (𝐵 ∪ 𝐶) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶).
Union and intersection can be defined for an arbitrary number of operands. That is, if 𝐴1 , 𝐴2 , … , 𝐴𝑛 are sets,
𝐴1 ∪ ⋯ ∪ 𝐴𝑛 ∶= (𝐴1 ∪ ⋯ ∪ 𝐴𝑛−1 ) ∪ 𝐴𝑛 ,
and similar for the intersection. Note that this is a recursive definition! Because of associativity, the order of parentheses
doesn’t matter.
The associativity and commutativity might seem too abstract and trivial at the same time. However, this is not the case
for all operations, so it is worth emphasizing to get used to the concepts. If you are curious, noncommutative operations
are right under our noses. A simple example is string concatenation.
a = "string"
b = "concatenation"
a + b == b + a
False
One of the fundamental rules describes how set difference, union, and intersection behave together regarding set opera-
tions. These are called De Morgan’s laws.
Proof. For simplicity, we are going to prove this using Venn diagrams. Although drawing a picture is not a “proper”
mathematical proof, this is not a problem. We are here to understand things, not to get hung up on philosophy.
Here is the illustration.
Based on this, you can easily see both (a) and (b). □
Note that De Morgan’s laws can be generalized to cover any number of sets. So, for any Γ index set,
𝐴\(∩𝛾∈Γ 𝐵𝛾 ) = ∪𝛾∈Γ (𝐴\𝐵𝛾 ),
𝐴\(∪𝛾∈Γ 𝐵𝛾 ) = ∩𝛾∈Γ (𝐴\𝐵𝛾 )
One of the most fundamental ways to construct new sets is the Cartesian product.
The elements of the product are called tuples. Note that this operation is not associative nor commutative! To see this,
consider that for example,
{1} × {2} ≠ {2} × {1}
and
({1} × {2}) × {3} ≠ {1} × ({2} × {3}).
The Cartesian product for an arbitrary number of sets is defined with a recursive definition, just like we did with the union
and intersection. So, if 𝐴1 , 𝐴2 , … , 𝐴𝑛 are sets, then
𝐴1 × ⋯ × 𝐴𝑛 ∶= (𝐴1 × ⋯ × 𝐴𝑛−1 ) × 𝐴𝑛 .
Here, the elements are tuples of tuples of tuples of…, but to avoid writing an excessive number of parentheses, we can
abbreviate it as (𝑎1 , … , 𝑎𝑛 ). When the operands are the same, we usually write 𝐴𝑛 instead of 𝐴 × ⋯ × 𝐴.
One of the most common examples is the Cartesian plane, which you probably have seen before.
To give you a machine learning-related example, let’s take a look at how we are usually given the data! Let’s focus on the
famous Iris dataset, a subset of ℝ4 . Here, the axes represent sepal length, sepal width, petal length, and petal width.
As the example demonstrates, Cartesian products are useful because they combine related information into a single math-
ematical structure. This is a recurring pattern in mathematics: building complex things from simpler building blocks and
Fig. 44.4: The sepal width, plotted against the sepal length in the Iris dataset. Source: scikit-learn documentation.
abstracting away the details by turning the result into yet another building block. (As one would do for creating complex
software as well.)
Let’s return to a remark I made earlier: naively defining sets as collections of things is not going to cut it. In the following,
we are going to see why. Prepare for some mind-twisting mathematics.
As we have seen, sets can be made of sets. For instance, {ℕ, ℤ, ℝ} is a collection of the most commonly used number
sets. We might as well define the set of all sets, which we’ll denote with Ω.
With that, we can use the set-builder notation to describe the following collection of sets:
𝑆 ∶= {𝐴 ∈ Ω ∶ 𝐴 ∉ 𝐴}.
In plain English, 𝑆 is a collection of sets that are not elements of themselves. Although this is weird, it looks valid. We
used the property “𝐴 ∉ 𝐴” to filter the set of all sets. What is the problem?
For one, we can’t decide if 𝑆 is an element of 𝑆 or not. If 𝑆 ∈ 𝑆, then by the defining property, 𝑆 ∉ 𝑆. On the other
hand, if 𝑆 ∉ 𝑆, then by the definition, 𝑆 ∈ 𝑆. This is definitely very weird.
We can diagnose the issue by decomposing the set-builder notation. In general terms, it can be written as
𝑥 ∈ 𝐴 ∶ 𝑇 (𝑥),
where 𝐴 is some set and 𝑇 (𝑥) is a property, that is, a true or false statement about 𝑥. In the definition {𝐴 ∈ Ω ∶ 𝐴 ∉ 𝐴},
our abstract property is defined by
true if 𝐴 ∉ 𝐴,
𝑇 (𝐴) = {
false otherwise.
This is perfectly valid, so the problem must be in the other part: the set Ω. It turns out, the set of all sets is not a set. So,
defining sets as a collection of things is not enough. Since sets are at the very foundations of mathematics, this discovery
threw a giant monkey wrench into the machine around the late 19th - early 20th century, and it took lots of years and
brilliant minds to fix it.
Fortunately, as machine learning practitioners, we don’t have to care about such low-level details as the axioms of set
theory. For us, it is enough to know that a solid foundation exists somewhere. (Hopefully.)
FORTYFIVE
This section is just a draft of a future Python quickstart for those who are new to the language. For now, I’ll just link a
few tutorials that are relevant for understanding the course material.
45.1 Variables
45.3.1 Tuples
45.3.2 Lists
45.3.3 Dictionaries
45.3.4 Comprehensions
45.4 Functions
45.5 Decorators
485
Mathematics of Machine Learning
FORTYSIX
BIBLIOGRAPHY
487
Mathematics of Machine Learning
[JBP03] E.T. Jaynes, G.L. Bretthorst, and Cambridge University Press. Probability Theory: The Logic of Science. Cam-
bridge University Press, 2003.
[Lax07] Peter D. Lax. Linear Algebra and Its Applications. Wiley-Interscience, second edition, 2007.
[Str00] Steven H. Strogatz. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry and Engi-
neering. Westview Press, 2000.
489