Math For Data Science
Math For Data Science
vii
viii
The ideas presented in the text are made concrete by interpreting them
in Python code. The standard Python data science packages are used, and
a Python index lists the functions used in the text. Because Python is used
to highlight concepts, the supporting code snippets are often written from
scratch, even when they don’t need to be.
Omar Hijab
Spring 2025
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Averages and Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2 Linear Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.2 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.3 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4 Span and Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.5 Zero Variance Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.6 Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.7 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.8 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.9 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.1 Single-Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.3 Multi-Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
4.4 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
ix
x Contents
5 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.2 Binomial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
5.5 Chi-squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
5.6 Multinomial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
6 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
6.2 Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
6.3 T -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
6.4 Chi-Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
A.1 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . 467
A.2 The Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
A.3 The Exponential Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
A.4 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
A.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
A.6 Asymptotics and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
A.7 Existence of Minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
A.8 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
List of Figures
xi
xii List of Figures
3.25 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . . . 195
3.26 The full MNIST dataset (2d projection). . . . . . . . . . . . . . . . . . . . 196
3.27 The Iris dataset (2d projection). . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
1.1 Introduction
1
2 CHAPTER 1. DATASETS
iris = datasets.load_iris()
iris["feature_names"]
This returns
['sepal length','sepal width','petal length','petal width'].
To return the data and the classes, the code is
dataset = iris["data"]
labels = iris["target"]
dataset, labels
This subsection is included just to give a flavor. All unfamiliar words are
explained in detail in Chapter 2. If preferred, just skip to the next subsection.
Suppose we have a dataset of N points
x1 , x2 , . . . , xN
If this is your first exposure to data science, there will be a learning curve,
because here there are three kinds of thinking: Data science (datasets, PCA,
descent, networks), math (linear algebra, probability, statistics, calculus), and
Python (numpy, pandas, scipy, sympy, matplotlib). It may help to read the
4 CHAPTER 1. DATASETS
code examples , and the important math principles first, then dive
into details as needed.
To illustrate and make concrete concepts as they are introduced, we use
Python code throughout. We run Python code in a jupyter notebook.
jupyter is an IDE, an integrated development environment. jupyter
supports many languages, including Python, Sage, Julia, and R. A useful
jupyter feature is the ability to measure the amount of execution time of a
jupyter cell by including at the start of the cell
%%time
OS uses. This is achieved by carrying out the above steps within a venv, a
virtual environment. Then several venvs may be set up side-by-side, and, at
any time, any venv may be deleted without impacting any others, or the OS.
Exercises
def uniq(a):
return [x for i, x in enumerate(a) if x not in a[:i] ]
The MNIST2 dataset consists of 60,000 training images. Since this dataset is
for demonstration purposes, these images are coarse.
Each image consists of 28 × 28 = 784 pixels, and each pixel shading is a
byte, an integer between 0 and 255 inclusive. Therefore each image is a point
x in Rd = R784 . Attached to each image is its label, a digit 0, 1, . . . , 9.
We assume the dataset has been downloaded to your laptop as a CSV file
mnist.csv. Then each row in the file consists of the pixels for a single image.
Since the image’s label is also included in the row, each row consists of 785
integers. There are many sources and formats online for this dataset.
The code
mnist = read_csv("mnist.csv").to_numpy()
mnist.shape,dataset.shape,labels.shape
returns
Fig. 1.4 Original and projections: n = 784, 600, 350, 150, 50, 10, 1.
2 The National Institute of Standards and Technology (NIST) is a physical sciences labo-
ratory and non-regulatory agency of the United States Department of Commerce.
1.2. THE MNIST DATASET 7
Here is an exercise. The top left image in Figure 1.4 is given by a 784-
dimensional point which is imported as an array pixels.
pixels = dataset[1].reshape((28,28))
grid()
scatter(2,3,s = 50)
show()
2. Do for loops over i and j in range(28) and use scatter to plot points
at location (i,j) with size given by pixels[i,j], then show.
pixels = dataset[1]
grid()
for i in range(28):
for j in range(28):
scatter(i,j, s = pixels[i,j])
show()
imshow(pixels, cmap="gray_r")
np.float64(5.843333333333335)
5.843333333333335
set_printoptions(legacy="1.25")
We end the section by discussing the Python import command. The last
code snippet can be rewritten
plt.imshow(pixels, cmap="gray_r")
or as
imshow(pixels, cmap="gray_r")
In the third version, only the command imshow is imported. Which import
style is used depends on the situation.
In this text, we usually use the first style, as it is visually lightest. To help
with online searches, in the Python index, Python commands are listed under
their full package path.
Exercises
Exercise 1.2.1 Run the code in this section on your laptop (all code is run
within jupyter).
Exercise 1.2.2 The first image in the MNIST dataset is an image of the
digit 5. What is the 43,120th image?
Exercise 1.2.3 Figure 1.6 is not oriented the same way as the top-left image
in Figure 1.4. Modify the code returning Figure 1.6 to match the top-left
image in Figure 1.4.
L = [x_1,x_2,...,x_N].
The total population is the population or the sample space. For example, the
sample space consists of all real numbers and we take N = 5 samples from
this population
Or, the sample space consists of all integers and we take N = 5 samples from
this population
Or, the sample space consists of all rational numbers and we take N = 5
samples from this population
Or, the sample space consists of all Python strings and we take N = 5 samples
from this population
L_4 = ['a2e?','#%T','7y5,','kkk>><</','[[)*+']
Or, the sample space consists of all HTML colors and we take N = 5 samples
from this population
def hexcolor():
chars = '0123456789abcdef'
return "#" + ''.join([choice(chars) for _ in range(6)])
v = x − µ = (c − a, d − b).
Then µ is the tail of v, and x is the head of v. For example, the vector joining
µ = (1, 2) to x = (3, 4) is v = (2, 2).
Given a point x, we would like to associate to it a vector v in a uniform
manner. However, this cannot be done without a second point, a reference
point. Given a dataset of points x1 , x2 , . . . , xN , the most convenient choice
for the reference point is the mean µ of the dataset. This results in a dataset
of vectors v1 , v2 , . . . , vN , where vk = xk − µ, k = 1, 2, . . . , N .
v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,
x5 x2
v5 v2
µ v4 v1
0
x4 x1
v3
x3
Let us go back to vector spaces. When we work with vector spaces, numbers
are referred to as scalars, because 2v, 3v, −v, . . . are scaled versions of v.
When we multiply a vector v by a scalar r to get the scaled vector rv, we call
this vector scaling. This is to distinguish this multiplication from the inner
and outer products we see below.
For example, the samples in the list L1 form a vector space, the set of all
real numbers R. Even though one can add integers, the set Z of all integers
does not form a vector space because multiplying an integer by 1/2 does
not result in an integer. The set Q of all rational numbers (fractions) is a
vector space, so L3 is a sampling from a vector space. The set of strings is
not a vector space because even though one can add strings, addition is not
commutative:
returns False.
the average is
1.23 + 4.29 − 3.3 + 555
µ= = 139.305.
4
In Python, averages are computed using numpy.mean. For a scalar dataset,
the code
dataset = array([1.23,4.29,-3.3,555])
mu = mean(dataset)
mu
the average is
dataset = array([[1,3,-2,0],[2,4,11,66]])
Here the x-components of the four points are the first row, and the y-
components are the second row. With this, the code
mu = mean(dataset, axis=1)
mu
mean(dataset, axis=0)
N = 20
def row(N): return array([random() for _ in range(N) ])
# 2xN array
dataset = array([ row(N), row(N) ])
mu = mean(dataset,axis=1)
grid()
scatter(*mu)
16 CHAPTER 1. DATASETS
scatter(*dataset)
show()
H, H, T, T, T, H, T, . . .
If we add the vectorized samples f (x) using vector addition in the plane
(§1.4), the first component of the mean (1.3.2) is an average of ones and
zeroes, with ones matching heads, resulting in the proportion p̂ of heads.
Similarly, the second component is the proportion of tails. Hence (1.3.2) is
the pair (p̂, 1 − p̂), where p̂ is the proportion of heads in N tosses.
1.3. AVERAGES AND VECTOR SPACES 17
More generally, if the label of a sample falls into d categories, we may let
f (x) be a vector with d components consisting of zeros and ones, according
to the category of the sample. This is one-hot encoding (see §2.4 and §7.6).
For example, suppose we take a sampling of size N from the Iris dataset,
and we look at the classes of the resulting samples. Since there are three
classes, in this case, we can define f (x) to equal
according to which class x belongs to. Then the mean (1.3.2) is a triple
p̂ = (p̂1 , p̂2 , p̂3 ) of proportions of each class in the sampling. Of course, p̂1 +
p̂2 + p̂3 = 1, so p̂ is a probability vector (§5.6).
f
sample space
vector space
When there are only two possibilities, two classes, it’s simpler to encode
the classes as follows,
(
1, if x is heads,
f (x) =
0, if x is tails.
Even when the samples are already scalars or vectors, we may still want
to vectorize them. For example, suppose x1 , x2 , . . . , xN are the prices of a
sample of printers from across the country. Then the average price (1.3.1) is
well-defined. Nevertheless, we may set
(
1, if x is greater than $100,
f (x) =
0, if x is ≤ $100.
18 CHAPTER 1. DATASETS
Then the mean (1.3.2) is the sample proportion p̂ of printers that cost more
then $100.
In §6.4, we use vectorization to derive the chi-squared tests.
Exercises
Exercise 1.3.2 What is the average petal length in the Iris dataset?
Exercise 1.3.3 What is the average shading of the pixels in the first image
in the MNIST dataset?
x = arange(0,1,.2)
plot(x,f(x))
scatter(x,f(x))
We start with the geometry of vectors in two dimensions. This is the cartesian
plane R2 , also called 2-dimensional real space. The plane R2 is a vector space,
in the sense described in the previous section.
(0, 2) (3, 2)
v
(0, 1)
v1
v2
0 0
Addition of vectors
v1 + v2 = (x1 + x2 , y1 + y2 ). (1.4.1)
Because points and vectors are interchangeable, the same formula is used
for addition P + P ′ of points P and P ′ .
This addition is the same as combining their shadows as in Figure 1.14.
In Python, lists and tuples do not add this way. Lists and tuples have to first
be converted into numpy arrays.
v1 = (1,2)
v2 = (3,4)
v1 + v2 == (1+3,2+4) # returns False
v1 = [1,2]
v2 = [3,4]
v1 + v2 == [1+3,2+4] # returns False
20 CHAPTER 1. DATASETS
v1 = array([1,2])
v2 = array([3,4])
v1 + v2 == array([1+3,2+4]) # returns True
Scaling of vectors
v = array([1,2])
3*v == array([3,6]) # returns True
tv
0 tv
Given a vector v, the scalings tv of v form a line passing through the origin
0 (Figure 1.17). This line is the span of v (more on this in §2.4). Scalings tv
of v are also called multiples of v.
If t and s are real numbers, it is easy to check
Thus scaling v by s, and then scaling the result by t, has the same effect as
scaling v by ts, in a single step. Because points and vectors are interchange-
able, the same formula tP is used for scaling points P by t.
We set −v = (−1)v, and define subtraction of vectors by
v1 − v2 = v1 + (−v2 ).
v1 = array([1,2])
v2 = array([3,4])
v1 - v2 == array([1-3,2-4]) # returns True
Subtraction of vectors
v1 − v2 = (x1 − x2 , y1 − y2 ) (1.4.2)
22 CHAPTER 1. DATASETS
Distance Formula
v = array([1,2])
norm(v) == sqrt(5)# returns True
(x, y)
r y
θ
0 x
The unit circle consists of the vectors which are distance 1 from the origin
0. When v is on the unit circle, the magnitude of v is 1, and we say v is a
1.4. TWO DIMENSIONS 23
unit vector. In this case, the line formed by the scalings of v intersects the
unit circle at ±v (Figure 1.17).
When v is a unit vector, r = 1, and (Figure 1.16),
−v
The unit circle intersects the horizontal axis at (1, 0), and (−1, 0), and
intersects the vertical axis at (0, 1), and (0, −1). These four points are equally
spaced on the unit circle (Figure 1.17).
By the distance formula, a vector v = (x, y) is a unit vector when
x2 + y 2 = 1.
More generally, any circle with center Q = (a, b) and radius r consists of
points (x, y) satisfying
(x − a)2 + (y − b)2 = r2 .
Let R be a point on the unit circle, and let t > 0. From this, we see the scaled
point tR is on the circle with center (0, 0) and radius t. Moreover, it follows
a point P is on the circle of center Q and radius r iff P = Q + rR for some
R on the unit circle.
Given this, it is easy to check
1 1 1
v = |v| = r = 1,
r r r
24 CHAPTER 1. DATASETS
Now we discuss the dot product in two dimensions. We have two vectors
v1 and v2 in the plane R2 , with v1 = (x1 , y1 ) and v2 = (x2 , y2 ). The dot
product of v1 and v2 is given algebraically as
v1 · v2 = x1 x2 + y1 y2 ,
or geometrically as
v1 · v2 = |v1 | |v2 | cos θ,
where θ is the angle between v1 and v2 . To show that these are the same,
below we derive the
v2 − v1
v2
v1
v1 = array([1,2])
v2 = array([3,4])
dot(v1,v2) == 1*3 + 2*4 # returns True
As a consequence of the dot product identity, we have code for the angle
between two vectors (there is also a built-in numpy.angle).
def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)
Cauchy-Schwarz Inequality
To derive the dot product identity, we first derive Pythagoras’ theorem for
general triangles (Figure 1.19)
f
e
a
b
d
a2 = d2 + f 2 and c2 = e2 + f 2 .
Also b = e + d, so e = b − d, so
1.4. TWO DIMENSIONS 27
e2 = (b − d)2 = b2 − 2bd + d2 .
c2 = e2 + f 2 = (b − d)2 + f 2
= f 2 + d2 + b2 − 2db
= a2 + b2 − 2ab cos θ,
so we get (1.4.8).
Next, connect Figures 1.18 and 1.19 by noting a = |v2 | and b = |v1 | and
c = |v2 − v1 |. By (1.4.6),
thus
c2 = a2 + b2 − 2(x1 x2 + y1 y2 ). (1.4.9)
Comparing the terms in (1.4.8) and (1.4.9), we arrive at (1.4.5). This com-
pletes the proof of the dot product identity (1.4.5).
P + P⊥
P⊥
v⊥ P⊥
v
0 P
−v ⊥ c
b
−P ⊥ O a
v · v ⊥ = (x, y) · (−y, x) = 0.
From Figure 1.21, we see points P and P ′ on the unit circle satisfy P ·P ′ = 0
iff P ′ = ±P ⊥ .
ax + by = 0, cx + dy = 0. (1.4.10)
Homogeneous System
ax + by = e, cx + dy = f, (1.4.13)
(x, y) = (e/a, 0), (x, y) = (0, e/b), (x, y) = (f /c, 0), (x, y) = (0, f /d)
de − bf af − ce
x= , y= . (1.4.15)
ad − bc ad − bc
Putting all this together, we conclude
Inhomogeneous System
In §2.9, we will understand the three cases in terms of the rank of A equal
to 2, 1, or 0.
In this case, we call u and v the rows of A. On the other hand, A may be
written as
ac
A= = uv , u = (a, b), v = (c, d).
bd
In this case, we call u and v the columns of A. Many texts then write u and
v as
a c
u= , v= . (1.4.16)
b d
We do not do this when u and v are on their own, because then they are just
vectors. We only do this when u and v being multiplied with matrices or are
rows or columns of matrices.
In fact, when we do write (1.4.16), we are thinking of u and v as 2 × 1
matrices, not as vectors. This shows there are at least three ways to think
about a matrix: as rows, or as columns, or as a single block.
The simplest operations on matrices are addition and scaling. Addition is
as follows,
′ ′
a + a′ b + b′
ab ′ a b ′
A= , A = ′ ′ =⇒ A+A = ,
cd c d c + c′ d + d′
u · u′ u · v ′
AA′ = .
u′ · v u′ · v ′
cos θ′ − sin θ′
cos θ − sin θ
U (θ)U (θ′ ) =
sin θ cos θ sin θ′ cos θ′
cos(θ + θ′ ) − sin(θ + θ′ )
= = U (θ + θ′ ).
sin(θ + θ′ ) cos(θ + θ′ )
AA−1 = I = A−1 A.
(AB)−1 = B −1 A−1 .
(AB)t = B t At .
Ax = b
is
x = A−1 b,
since
Ax = AA−1 b = Ib = b.
With this, we can rewrite (1.4.13) as
ab x e
= .
cd y f
Orthogonal Matrices
Here we wrote u ⊗ v as a single block, and also in terms of rows and columns.
If we do this the other way, we get
ca cb
v⊗u= ,
da db
so
(u ⊗ v)t = v ⊗ u.
When u = v, u ⊗ v = v ⊗ v is a symmetric matrix.
Here is code for tensor.
34 CHAPTER 1. DATASETS
There is no need to use this, since the numpy built-in outer does the same
job,
A = outer(u,v)
det(u ⊗ v) = 0.
This is true no matter what the vectors u and v are. Check this yourself.
By definition of u ⊗ v,
so
v · Qv = (x, y) · (ax + by, bx + cy) = ax2 + 2bxy + cy 2 .
This is the quadratic form associated to the matrix Q.
1.4. TWO DIMENSIONS 35
Quadratic Form
If
ab
Q= and v = (x, y),
bc
then
v · Qv = ax2 + 2bxy + cy 2 .
Q=I =⇒ v · Qv = x2 + y 2 .
When Q is diagonal,
a0
Q= =⇒ v · Qv = ax2 + cy 2 .
0c
If Q = u ⊗ u, then
Exercises
ax + by = c, −bx + ay = d.
Exercise 1.4.2 Let u = (1, a), v = (b, 2), and w = (3, 4). Solve
u + 2v + 3w = 0
for a and b.
Exercise 1.4.3 Let u = (1, 2), v = (3, 4), and w = (5, 6). Find a and b such
that
au + bv = w.
36 CHAPTER 1. DATASETS
⊥
Exercise 1.4.4 Let P be a nonzero point in the plane. What is P ⊥ ?
8 −8 3 −2
Exercise 1.4.5 Let A = and B = . Compute AB and
−7 −3 2 −2
BA.
9 2
Exercise 1.4.6 Let A = . Find a nonzero 2×2 matrix B satisfying
−36 −8
AB = 0.
eq:tensorident
Exercise 1.4.8 If u = (a, b) and v = (c, d) and A = u ⊗ v, use (1.4.19) to
compute A2 .
Exercise 1.4.11 With Q and V are 2×2 matrices, Q invertible, and t scalar,
show
det(Q + tV )
= 1 + t · trace(Q−1 V ) + t2 det(Q−1 V ).
det(Q)
9 2
Exercise 1.4.12 What is the trace of A = ?
−36 −8
u ∧ v = u ⊗ v − v ⊗ u.
Exercise 1.4.15 Calculate the areas of the triangles and the squares in Fig-
ure 1.21. From that, deduce Pythagoras’s theorem c2 = a2 + b2 .
Above |x| stands for the length of the vector x, or the distance of the point
x to the origin. When d = 2 and we are in two dimensions, this was defined
in §1.4. For general d, this is defined in §2.1. In this section we continue to
focus on two dimensions d = 2.
The mean or sample mean is
N
1 X x1 + x2 + · · · + xN
µ= xk = . (1.5.1)
N N
k=1
Point of Best-fit
The mean is the point of best-fit: The mean minimizes the mean-
square distance to the dataset (Figure 1.22).
Fig. 1.22 MSD for the mean (green) versus MSD for a random point (red).
Using (1.4.6),
38 CHAPTER 1. DATASETS
so we have
M SD(x) = M SD(µ) + |x − µ|2 ,
which is clearly ≥ M SD(µ), deriving the above result.
Here is the code for Figure 1.22.
N, d = 20, 2
# d x N array
dataset = array([ [random() for _ in range(N)] for _ in range(d) ])
mu = mean(dataset,axis=1)
p = array([random(),random()])
for v in dataset.T:
plot([mu[0],v[0]],[mu[1],v[1]],c='green')
plot([p[0],v[0]],[p[1],v[1]],c='red')
scatter(*mu)
scatter(*dataset)
grid()
show()
N
1 X
q= (xk − µ)2 . (1.5.2)
N
k=1
√
The square root of the variance is the standard deviation σ = q.
If a scalar dataset has mean zero and variance one, it is standard. Every
dataset x1 , x2 , . . . , xN may be standardized by first centering the dataset
v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,
v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ.
Then the variance is the matrix (see §1.4 for tensor product)
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN
Q= . (1.5.3)
N
Since v ⊗ v is a symmetric matrix, the variance of a dataset is a symmetric
matrix. Below we see the variance is also nonnegative, in the sense v · Qv ≥ 0
for all vectors v. Later we see how to standardize vector datasets.
When i ̸= j, the entries Q = (qij ) of the variance matrix are called covari-
ances: qij is the covariance between the i-th feature and the j-th feature.
For example, suppose N = 5 and
x1 = (1, 2), x2 = (3, 4), x3 = (5, 6), x4 = (7, 8), x5 = (9, 10). (1.5.4)
Since
40 CHAPTER 1. DATASETS
16 16
(±4, ±4) ⊗ (±4, ±4) = ,
16 16
44
(±2, ±2) ⊗ (±2, ±2) = ,
44
00
(0, 0) ⊗ (0, 0) = ,
00
Notice
Q = 8(1, 1) ⊗ (1, 1),
which, as we see below (§2.5), reflects the fact that the points of this dataset
lies on a line. Here the line is y = x + 1. Here is code from scratch for the
variance (matrix) of a dataset.
def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])
N, d = 20, 2
# N x d array
dataset = array([ [random(),random()] for _ in range(N) ])
mu = mean(dataset,axis=0)
# center dataset
vectors = dataset - mu
N, d = 20, 2
# d x N array
dataset = array([ [random() for _ in range(N)] for _ in range(d) ])
Q = cov(dataset,bias=True)
Q
This returns the same result as the previous code for Q. Notice here there is
no need to compute the mean, this is taken care of automatically. The option
bias=True indicates division by N , returning the biased variance. To return
the unbiased variance and divide by N −1, change the option to bias=False,
or remove it, since bias=False is the default.
From (1.4.18), if Q is the variance matrix (1.5.3),
N
1 X
trace(Q) = |xk − µ|2 . (1.5.5)
N
k=1
# dataset is d x N array
Q = cov(dataset,bias=True)
Q.trace()
P b = (b · u)u.
42 CHAPTER 1. DATASETS
Pb
u
These vectors are all multiples of u, as they should be. The projected dataset
is two-dimensional.
Alternately, discarding u and retaining the scalar coefficients, we have the
one-dimensional dataset
v1 · u, v2 · u, . . . , vN · u.
# dataset is d x N array
Q = cov(dataset,bias=True)
This shows that the dataset lies on the line passing through m and perpen-
dicular to (1, −1).
(v − µ) · Q(v − µ) = k
(v − µ) · Q−1 (v − µ) = k
Fig. 1.24 Unit variance ellipses (blue) and unit inverse variance ellipses (red) with µ = 0.
If we write v = (x, y) for a vector in the plane, the variance ellipse equation
centered at µ = 0 is
v · Qv = ax2 + 2bxy + cy 2 = k.
def ellipse(Q,mu,padding=.5,levels=[1],render="var"):
grid()
scatter(*mu,c="red",s=5)
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d,e = mu
delta = .01
x = arange(d-padding,d+padding,delta)
y = arange(e-padding,e+padding,delta)
x, y = meshgrid(x, y)
if render == "var" or render == "both":
1.5. MEAN AND VARIANCE 45
# matrix_text(Q,mu,padding,'blue')
eq = a*(x-d)**2 + 2*b*(x-d)*(y-e) + c*(y-e)**2
contour(x,y,eq,levels=levels,colors="blue",linewidths=.5)
if render == "inv" or render == "both":
draw_major_minor_axes(Q,mu)
Q = inv(Q)
# matrix_text(Q,mu,padding,'red')
A, B, C = Q[0,0],Q[0,1],Q[1,1]
eq = A*(x-d)**2 + 2*B*(x-d)*(y-e) + C*(y-e)**2
contour(x,y,eq,levels=levels,colors="red",linewidths=.5)
With this code, ellipse(Q,mu) returns the unit variance ellipse in the unit
square centered at µ. The codes for the functions draw_major_minor_axes
and matrix_text are below.
Depending on whether render is var, inv, or both, the code renders the
variance ellipse (blue), the inverse variance ellipse (red), or both. The code
renders several ellipses, one for each level in the list levels. The default is
levels = [1], so the unit ellipse is returned. Also padding can be adjusted
to enlarge the plot.
The code for Figure 1.24 is
mu = array([0,0])
Q = array([[9,0],[0,4]])
ellipse(Q,mu,padding=4,render="both")
show()
Q = array([[9,2],[2,4]])
ellipse(Q,mu,padding=4,render="both")
show()
To use TEX to display the matrices in Figure 1.24, insert the function
rcParams['text.usetex'] = True
rcParams['text.latex.preamble'] = r'\usepackage{amsmath}'
def matrix_text(Q,mu,padding,color):
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d,e = mu
valign = e + 3*padding/4
if color == 'blue': halign = d - padding/2; tex = "$Q="
else: halign = d; tex = "$Q^{-1}="
# r"..." means raw string
tex += r"\begin{pmatrix}" + str(round(a,2)) + "&" + str(round(b,2))
tex += r"\\" + str(round(b,2)) + "&" + str(round(c,2))
tex += r"\end{pmatrix}$"
return text(halign,valign,tex,fontsize=15,color=color)
46 CHAPTER 1. DATASETS
Fig. 1.25 Variance ellipses (blue) and inverse variance ellipses (red) for a dataset.
N = 50
# N x d array
dataset = array([ [random(),random()] for _ in range(N) ])
Q = cov(dataset.T,bias=True)
mu = mean(dataset,axis=0)
scatter(*dataset.T,s=5)
ellipse(Q,mu,render="var",padding=.5,levels=[.005,.01,.02])
show()
scatter(*dataset.T,s=5)
ellipse(Q,mu,render="inv",padding=.5,levels=[.5,1,2])
show()
x1 , x2 , . . . , xN , and y1 , y2 , . . . , yN .
Suppose the mean of this dataset is µ = (µx , µy ). Then, by the formula for
tensor product, the variance matrix is
ab
Q= ,
bc
where
N N N
1 X 1 X 1 X
a= (xk − µx )2 , b= (xk − µx )(yk − µy ), c= (yk − µy )2 .
N N N
k=1 k=1 k=1
From this, we see a is the variance of the x-features, and c is the variance
of y-features. We also see b is a measure of the correlation between the x and
y features.
Standardizing the dataset means to center the dataset and to place the x
and y features on the same scale. For example, the x-features may be close
to their mean µx , resulting in a small x variance a, while the y-features may
be spread far from their mean µy , resulting in a large y variance c.
When this happens, the different scales of x’s and y’s distorts the relation
between them, and b may not accurately reflect the correlation. To correct
for this, we center and re-scale
x1 − µx ′ x2 − µx xN − µx
x1 , x2 , . . . xN → x′1 = √ , x2 = √ , . . . , x′N = √ ,
a a a
and
y1 − µy ′ y2 − µy yN − µy
y1 , y2 , . . . yN → y1′ = √ , y2 = √ ′
, . . . , yN = √ .
c c c
′
This results in a new dataset v1 = (x′1 , y1′ ), v2 = (x′2 , y2′ ), . . . , vN = (x′N , yN )
that is centered,
v1 + v2 + · · · + vN
= 0,
N
with each feature standardized to have unit variance,
N N
1 X ′2 1 X ′2
xk = 1, yk = 1.
N N
k=1 k=1
where
N
1 X ′ ′ b
ρ= xk yk = √
N ac
k=1
Fig. 1.26 Unit variance ellipse and unit inverse variance ellipse with standard Q.
When ρ = ±1, the dataset samples are perfectly correlated and lie on
a line passing through the mean. When ρ = 1, the line has slope 1, and
when ρ = −1, the line has slope −1. When ρ = 0, the dataset samples are
completely uncorrelated and are considered two independent one-dimensional
datasets.
In numpy, the correlation matrix Q′ is returned by
1.5. MEAN AND VARIANCE 49
# dataset is d x N array
corrcoef(dataset)
u · Qu = max v · Qv.
|v|=1
Since the sine function varies between +1 and −1, we conclude the pro-
jected variance varies between
1 − ρ ≤ v · Qv ≤ 1 + ρ,
and
π 1 1
θ= , v+ = √ ,√ =⇒ v+ · Qv+ = 1 + ρ,
4 2 2
3π −1 1
θ= , v− = √ , √ =⇒ v− · Qv− = 1 − ρ.
4 2 2
Thus the best-aligned vector v+ is at 45◦ , and the worst-aligned vector is at
135◦ (Figure 1.26).
Actually, the above is correct only if ρ > 0. When ρ < 0, it’s the other
way. The correct answer is
1 − |ρ| ≤ v · Qv ≤ 1 + |ρ|,
Fig. 1.27 Positively and negatively correlated datasets (unit inverse ellipses).
Here are two randomly generated datasets. The dataset on the left in
Figure 1.27 is positively correlated. Its mean and variance are
0.08016526 0.01359483
(0.53626891, 0.54147513) .
0.01359483 0.08589097
The dataset on the right in Figure 1.27 is negatively correlated. Its the
mean and variance are
0.08684941 −0.00972569
(0.46979642, 0.48347168) .
−0.00972569 0.09409118
λ− ≤ v · Qv ≤ λ+ , |v| = 1.
If the inverse variance ellipse is not a circle, then Q is not a multiple of the
identity, and either v+ or w+ is nonzero. If v+ ̸= 0, v+ is the best-aligned
vector. If v+ = 0, w+ is the best-aligned vector.
If the inverse variance ellipse is not a circle, then Q is not a multiple of the
identity, and either v− or w− is nonzero. If v− ̸= 0, v− is the worst-aligned
vector. If v− = 0, w− is the worst-aligned vector.
If Q is a multiple of the identity, then any vector is best-aligned and worst-
aligned. All this follows from solutions of homogeneous 2×2 systems (1.4.10).
The general d × d case is in §3.2. For the 2 × 2 case discussed here, see the
exercises at the end of §3.2.
The code for rendering the major and minor axes of the inverse variance
ellipse uses (1.5.6) and (1.5.7),
def draw_major_minor_axes(Q,mu):
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d, e = mu
label = { 1:"major", -1:"minor" }
for pm in [1,-1]:
lamda = (a+c)/2 + pm * sqrt(b**2 + (a-c)**2/4)
sigma = sqrt(lamda)
lenv = sqrt(b**2 +(a-lamda)**2)
lenw = sqrt(b**2 +(c-lamda)**2)
if lenv: deltaX, deltaY = b/lenv, (a-lamda)/lenv
elif lenw: deltaX, deltaY = (lamda-c)/lenw, b/lenw
elif pm == 1: deltaX, deltaY = 1, 0
else: deltaX, deltaY = 0, 1
axesX = [d+sigma*deltaX,d-sigma*deltaX]
axesY = [e-sigma*deltaY,e+sigma*deltaY]
plot(axesX,axesY,linewidth=.5,label=label[pm])
legend()
Exercises
d = 10
# 100 x 2 array
dataset = array([ array([i+j,j]) for i in range(d) for j in range(d)
,→ ])
Compute the mean and variance, and plot the dataset and the mean.
Exercise 1.5.2 Let the dataset be the petal lengths against the petal widths
in the Iris dataset. Compute the mean and variance, and plot the dataset and
the mean.
Exercise 1.5.3 Project the dataset in Exercise 1.5.1 onto the line through
the vector (1, 2). What is the projected dataset? What is the reduced dataset?
Exercise 1.5.4 Project the dataset in Exercise 1.5.2 onto the line through
the vector (1, 2). What is the projected dataset? What is the reduced dataset?
Exercise 1.5.5 Plot the variance ellipse and inverse variance ellipses of the
dataset in Exercise 1.5.1.
Exercise 1.5.6 Plot the variance ellipse and inverse variance ellipses of the
dataset in Exercise 1.5.2.
Exercise 1.5.7 Plot the dataset in Exercise 1.5.1 together with its mean
and the line through the vector of best fit.
Exercise 1.5.8 Plot the dataset in Exercise 1.5.2 together with its mean
and the line through the vector of best fit.
Exercise 1.5.9 Standardize the dataset in Exercise 1.5.1. Plot the stan-
dardized dataset. What is the correlation matrix?
Exercise 1.5.10 Standardize the dataset in Exercise 1.5.2. Plot the stan-
dardized dataset. What is the correlation matrix?
ab
Exercise 1.5.11 Let Q = . Show Q is nonnegative when a ≥ |b|.
ba
(Compute v · Qv with v = (cos θ, sin θ) as in the text.)
1.6. HIGH DIMENSIONS 53
Although not used in later material, this section is here to boost intuition
about high dimensions. Draw four disks inside a square, and a fifth disk in
the center.
In Figure 1.29, the edge-length of the square is 4, and the radius of each
blue disk is 1. Draw the diagonal of the square. Then the diagonal passes
through two blue disks. √
Since the length of the diagonal of the square is 4 2, and the diameters
of the two blue disks
√ add up 4, the portions of the diagonal outside the blue
disks add up to 4 2 − 4. Hence the radius of the red disk is
1 √ √
(4 2 − 4) = 2 − 1.
4
In three dimensions, draw eight balls inside a cube, as in Figure 1.30, and
one ball in the center. Since the edge-length of the cube is 4, the radius
√ of
each blue ball is 1. Since the length of the diagonal of the cube is 4 3, the
radius of the red ball is
1 √ √
(4 3 − 4) = 3 − 1.
4
Now we repeat in d dimensions. Here the edge-length of the cube remains
4, the radius of each blue ball remains 1,√and there are 2d blue balls. Since
the length of the diagonal of the cube is√4 d, the same calculation results in
the radius of the red ball equal to r = d − 1.
54 CHAPTER 1. DATASETS
# initialize figure
ax = axes()
# red disk
circle = Circle((2, 2), radius=sqrt(2)-1, color='red')
ax.add_patch(circle)
ax.set_axis_off()
ax.axis('equal')
show()
%matplotlib ipympl
from matplotlib.pyplot import *
from numpy import *
from itertools import product
# initialize figure
ax = axes(projection="3d")
# render ball
def ball(a,b,c,r,color):
return ax.plot_surface(a + r*x,b + r*y, c + r*z,color=color)
# red ball
ball(2,2,2,sqrt(3)-1,"red")
# cube grid
cube = ones((4,4,4),dtype=bool)
ax.voxels(cube, edgecolors='black',lw=.5,alpha=0)
ax.set_aspect("equal")
ax.set_axis_off()
show()
Ĝ = {(tx, 1 − t) : 0 ≤ t ≤ 1, x in G}.
1 t=1
td+1
Z
Vol(G)
Vol(Ĝ) = td Vol(G) dt = Vol(G) = .
0 d+1 t=0 d+1
Thus
Vol(G)
Vol(Ĝ) = .
d+1
Exercises
√
Exercise 1.6.1 Why is the diagonal length of the square 4 2?
√
Exercise 1.6.2 Why is the diagonal length of the cube 4 3?
Exercise 1.6.3 Why does dividing by 4 yield the red disk radius and the red
ball radius?
Exercise 1.6.4 Suspend the unit circle G : x2 +y 2 = 1 from its center. What
is the suspension Ĝ? Conclude area(unit disk) = length(unit circle)/2.
v = (t1 , t2 , . . . , td ).
The scalars are the components or the features of v. If there are d features,
we say the dimension of v is d. We call v a d-dimensional vector.
A point x is also a list of scalars, x = (t1 , t2 , . . . , td ). The relation between
points x and vectors v is discussed in §1.3. The set of all d-dimensional vectors
or points is d-dimensional space Rd .
In Python, we use numpy or sympy for vectors and matrices. In Python,
if L is a list, then numpy.array(L) or sympy.Matrix(L) return a vector or
matrix.
59
60 CHAPTER 2. LINEAR GEOMETRY
v = array([1,2,3])
v.shape
v = Matrix([1,2,3])
v.shape
The first v.shape returns (3,), and the second v.shape returns (3,1). In
either case, v is a 3-dimensional vector.
Vectors are added and scaled component by component: With
we have
together are the standard basis. Similarly, in Rd , we have the standard basis
e1 , e2 , . . . , ed .
# numpy vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
A = column_stack([u,v,w])
A.shape
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
A.shape
B = array([u,v,w])
The transpose interchanges rows and columns: the rows of At are the columns
of A. In both numpy or sympy, the transpose of A is A.T.
A vector v may be written as a 1 × N matrix
v = t1 t2 . . . tN .
td
vN
# 5x3 matrix
A = Matrix.hstack(u,v,w)
# column vector
b = Matrix([1,1,1,1,1])
# 5x4 matrix
M = Matrix.hstack(A,b)
2.1. VECTORS AND MATRICES 63
In general, for any sympy matrix A, column vectors can be hstacked and
row vectors can be vstacked. For any matrix A, the code
returns True. Note we use the unpacking operator * to unpack the list, before
applying hstack.
In numpy, there is column_stack and row_stack, so the code
both return True. Here col refers to rows of At , hence refers to the columns
of A.
The number of rows is len(A), and the number of columns is len(A.T).
To access row i, use A[i]. To access column j, access row j of the transpose,
A.T[j]. To access the j-th entry in row i, use A[i,j].
In sympy, the number of rows in a matrix A is A.rows, and the number of
columns is A.cols, so
A.shape == (A.rows,A.cols)
A = zeros(2,3)
B = ones(2,2)
C = Matrix([[1,2],[3,4]])
D = B + C
E = 5 * C
F = eye(4)
A, B, C, D, E, F
returns
10 0 0
000 11 12 23 5 10 01 0 0
, , , , , .
000 11 34 45 15 20 0 0 1 0
00 0 1
A = diag(1,2,3,4)
B = diag(-1, ones(2, 2), Matrix([5, 7, 5]))
A, B
returns
−1 000
1 0 0 0 0 1 1 0
0 2 0 0
, 0 1 1 0
.
0 0 3 0
0 0 0 5
0 0 0 4 0 0 0 7
0 005
It is straightforward to convert back and forth between numpy and sympy.
In the code
A = diag(1,2,3,4)
B = array(A)
C = Matrix(B)
Exercises
Exercise 2.1.1 A vector is one-hot encoded if all features are zero, except for
one feature which is one. For example, in R3 there are three one-hot encoded
vectors
(1, 0, 0), (0, 1, 0), (0, 0, 1).
A matrix is a permutation matrix if it is square and all rows and all columns
are one-hot encoded. How many 3 × 3 permutation matrices are there? What
about d × d?
2.2 Products
u · v = s1 t1 + s2 t2 + · · · + sd td . (2.2.1)
As in §1.4, we always have rows on the left, and columns on the right.
In Python,
u = array([1,2,3])
v = array([4, 5, 6])
u = Matrix([1,2,3])
v = Matrix([4, 5, 6])
sqrt(dot(v,v))
sqrt(v.T * v)
As in §1.4,
Dot Product
In two dimensions, this was equation (1.4.5) in §1.4. Since any two vectors
lie in a two-dimensional plane, this remains true in any dimension. More
precisely, (2.2.2) is taken as the definition of cos θ.
Based on this, we can compute the angle θ,
u·v u·v
cos θ = p =p .
|u| |v| (u · u)(v · v)
def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)
Cauchy-Schwarz Inequality
The dot product of two vectors is absolutely less or equal to the prod-
uct of their lengths,
|a + b| = (a + b) · v ≤ |a| + |b|.
Let A and B be two matrices. If the row dimension of A equals the column
dimension of B, the matrix-matrix product AB is defined. When this condition
holds, the entries in the matrix AB are the dot products of the rows of A with
the columns of B. In Python,
the code
A,B,dot(A,B)
A,B,A*B
returns
70 80 90
AB = .
158 184 210
Let A and B be matrices, and suppose the row dimension of A and the
column dimension of B both equal d. Then the matrix-matrix product AB
is defined. If A = (aij ) and B = (bij ), then we may we may write AB in
summation notation as
X d
(AB)ij = aik bkj . (2.2.5)
k=1
d
X d X
X d
trace(AB) = (AB)ii = aik bkj .
i=1 i=1 k=1
dot(A,B).T == dot(B.T,A.T)
In terms of row vectors and column vectors, this is automatic. For example,
In Python,
dot(dot(A,u),v) == dot(u,dot(A.T,v))
dot(dot(A.T,u),v) == dot(u,dot(A,v))
As a consequence,1
(u ⊗ v)ij = ui vj .
Then the identities (1.4.19) and (1.4.20) hold in general. Using the tensor
product, we have
Tensor Identity
AAt = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN . (2.2.10)
To derive this, let Q and Q′ be the symmetric matrices on the left and
right sides of (2.2.10). By Exercise 2.2.7, to establish (2.2.10), it is enough
to show x · Qx = x · Q′ x for every vector x. By (2.2.8),
At x = (v1 · x, v2 · x, . . . , vN · x).
Since |At x|2 is the sum of the squares of its components, this establishes
x · Qx = x · Q′ x, hence the result.
valid for any matrix A and vectors u, v with compatible shapes. The deriva-
tion of this identity is a simple calculation with components that we skip.
and
2
∥A∥ = trace(At A) = trace(AAt ). (2.2.14)
By replacing A by At , the same results hold for rows.
Q = dot(vectors,vectors.T)/N
Q = cov(dataset,bias=True)
After downloading the Iris dataset as in §2.1, the mean, variance, and total
variance are
0.68 −0.04 1.27 0.51
−0.04 0.19 −0.32 −0.12
1.27 −0.32 3.09 1.29 , 4.54.
µ = (5.84, 3.05, 3.76, 1.2), Q =
x1 · ej , x2 · ej , . . . , xN · ej ,
consisting of the j-th feature of the samples. If qjj is the variance of this
scalar dataset, then q11 , q22 , . . . , qdd are the diagonal entries of the variance
matrix.
To standardize the dataset, we center it, and rescale the features to have
variance one, as follows. Let µ = (µ1 , µ2 , . . . , µd ) be the dataset mean. For
each sample point x = (t1 , t2 , . . . , td ), the standardized vector is
t1 − µ1 t2 − µ2 td − µd
v= √ , √ ,..., √ .
q11 q22 qdd
′ qij
qij =√ , i, j = 1, 2, . . . , d.
qii qjj
In Python,
N, d = 10, 2
# Nxd array
dataset = array([ [random() for _ in range(d)] for _ in range(N) ])
# standardize dataset
standardized = StandardScaler().fit_transform(dataset)
Qcorr = corrcoef(dataset.T)
Qcov = cov(standardized.T,bias=True)
allclose(Qcov,Qcorr)
returns True.
Exercises
v = (1, 2, 3, . . . , n).
√
Let |v| = v · v be the length
√ of v. Then, for example, when n = 1, |v| = 1
and, when n = 2, |v| = 5. There is one other n for which |v| is a whole
number. Use Python to find it.
vd
AB = u1 ⊗ v1 + u2 ⊗ v2 + · · · + ud ⊗ vd .
Let A be any matrix and b a vector. The goal is to solve the linear system
Ax = b. (2.3.1)
In this section, we use the inverse A−1 and the pseudo-inverse A+ to solve
(2.3.1).
Of course, the system (2.3.1) doesn’t even make sense unless
In what follows, we assume this equality is true and dimensions are appro-
priately compatible.
Even then, it’s very easy to construct matrices A and vectors b for which
the linear system (2.3.1) has no solutions at all! For example, take A the zero
matrix and b any non-zero vector. Because of this, we must take some care
when solving (2.3.1).
AB = I = BA. (2.3.2)
we have
(AB)−1 = B −1 A−1 .
Ax = b =⇒ x = A−1 b. (2.3.3)
Ax = A(A−1 b) = (AA−1 )b = Ib = b.
# solving Ax=b
x = A.inv() * b
# solving Ax=b
x = dot(inv(A) , b)
x + = A+ b =⇒ Ax+ = b.
How do we use the above result? Given A and b, using Python, we compute
x = A+ b. Then we check, by multiplying in Python, equality of Ax and b.
The rest of the section consists of examples of solving linear systems. The
reader is encouraged to work out the examples below in Python. However,
because some linear systems have more than one solution, and the implemen-
tations of Python on your laptop and on my laptop may differ, our solutions
may differ.
It can be shown that if the entries of A are integers, then the entries of A+
are fractions. This fact is reflected in sympy, but not in numpy, as the default
in numpy is to work with floats.
Let
# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
# arrange as columns
A = column_stack([u,v,w])
pinv(A)
returns
−37 −20 −3 14 31
1
A+ = −10 −5 0 5 10 .
150
17 10 3 −4 −11
Alternatively, in sympy,
82 CHAPTER 2. LINEAR GEOMETRY
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
A.pinv()
For
b3 = (−9, −3, 3, 9, 10),
we have
1
x + = A+ b3 = (82, 25, −32).
15
However, for this x+ , we have
We solve
Bx = u, Bx = v, Bx = w
by constructing the candidates
B + u, B + v, B + w,
Let
1 2 3 4 5
C = At = 6 7 8 9 10
11 12 13 14 15
and let f = (0, −5, −10). By Exercise 2.6.8, C + = (A+ )t , so
−37 −10 17
−20 −5 10
+ + t 1 −3 0
C = (A ) = 3
150
14 5 −4
31 10 −11
and
1
x+ = C + f =(32, 35, 38, 41, 44).
50
Once we confirm equality of Cx+ and f , which is the case, we obtain a
solution x+ of Cx = f .
We solve
Dx = a, Dx = b, Dx = c, , Dx = d, Dx = e,
D+ a, D+ b, D+ c, D+ d, D+ e,
x+ = (1, 0), x+ = (2, 1), x+ = (3, 2), x+ = (4, 3), x+ = (5, 4).
Exercises
Exercise 2.3.2 With R(d) as in Exercise 2.2.9, find the formula for the
inverse and pseudo-inverse of R(d), whichever exists. Here d = 1, 2, 3, . . . .
t 1 v1 + t 2 v2 + · · · + t d vd . (2.4.1)
and let A be the matrix with columns u, v, w, as in (2.3.4). Let x be the vector
(r, s, t) = (1, 2, 3). Then an explicit calculation shows (do this calculation!)
the matrix-vector product Ax equals ru + sv + tw,
Ax = ru + sv + tw.
The code
returns
x = (t1 , t2 , . . . , td ).
Then
Ax = t1 v1 + t2 v2 + · · · + td vd , (2.4.2)
In other words,
86 CHAPTER 2. LINEAR GEOMETRY
t1 v1 + t2 v2 + · · · + td vd
of the vectors. For example, span(b) of a single vector b is the line through
b, and span(u, v, w) is the set of all linear combinations ru + sv + tw.
Span Definition I
S = span(v1 , v2 , . . . , vd ).
Span Definition II
span(v1 , v2 , . . . , vd ) = span(w1 , w2 , . . . , wN ).
Thus there are many choices of spanning vectors for a given span.
For example, let u, v, w be the columns of A in (2.3.4). Let ⊂ mean “is
contained in”. Then
since adding a third vector can only increase the linear combination possibil-
ities. On the other hand, since w = 2v − u, we also have
2.4. SPAN AND LINEAR INDEPENDENCE 87
It follows that
span(u, v, w) = span(u, v).
Let A be a matrix. The column space of A is the span of its columns. For
A as in (2.3.4), the column space of A is span(u, v, w). The code
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
returns a minimal list of vectors spanning the column space of A. The column
rank of A is the length of the list, i.e. the number of vectors returned.
For example, for A as in (2.3.4), this code returns the list
1 6
2 7
3 , 8 .
[u, v] =
4 9
5 10
Ax = t1 v1 + t2 v2 + · · · + td vd .
By (2.4.3),
88 CHAPTER 2. LINEAR GEOMETRY
The column space of a matrix A consists of all vectors of the form Ax.
A vector b is in the column space of A when Ax = b has a solution.
# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
A = column_stack([u,v,w])
orth(A)
For example, let b3 = (−9, −3, 3, 9, 10) and let Ā = (A, b3 ). Using Python,
check the column rank of Ā is 3. Since the column rank of A is 2, we conclude
b3 is not in the column space of A, so b3 is not a linear combination of u, v,
w.
When (2.4.6) holds, b is a linear combination of the columns of A. However,
(2.4.6) does not tell us which linear combination. According to (2.4.3), finding
the specific linear combination is equivalent to solving Ax = b.
then
(r, s, t) = re1 + se2 + te3 .
This shows the vectors e1 , e2 , e3 span R3 , or
R3 = span(e1 , e2 , e3 ).
e1 = (1, 0, 0, . . . , 0, 0)
e2 = (0, 1, 0, . . . , 0, 0)
e3 = (0, 0, 1, . . . , 0, 0) (2.4.7)
... = ...
ed = (0, 0, 0, . . . , 0, 1)
Then e1 , e2 , . . . , ed span Rd , so
Rd is a span.
span(a, b, c, d, e) = span(a, f ).
For any matrix, the row rank equals the column rank.
Because of this, we refer to this common number as the rank of the matrix.
92 CHAPTER 2. LINEAR GEOMETRY
t1 v1 + t2 v2 + · · · + td vd = 0.
ru + sv + tw = 1u − 2v + 1w = 0 (2.4.8)
u = −(s/r)v − (t/r)w.
v = −(r/s)u − (t/s)w.
If t ̸= 0, then
w = −(r/t)u − (s/t)v.
Hence linear dependence of u, v, w means one of the three vectors is a multiple
of the other two vectors.
In general, a vanishing non-trivial linear combination of v1 , v2 , . . . , vd , or
linear dependence of v1 , v2 , . . . , vd , is the same as saying one of the vectors
is a linear combination of the remaining vectors.
In terms of matrices,
A.nullspace()
This says the null space of A consists of all multiples of (1, −2, 1). Since the
code
[r,s,t] = A.nullspace()[0]
null_space(A)
A Versus At A
Let A be any matrix. The null space of A equals the null space of
At A.
|Ax|2 = Ax · Ax = x · At Ax = 0,
t1 v1 + t2 v2 + · · · + td vd = 0.
Take the dot product of both sides with v1 . Since the dot products of any
two vectors is zero, and each vector has length one, we obtain
t1 = t1 v1 · v1 = t1 v1 · v1 + t2 v2 · v1 + · · · + td vd · v1 = 0.
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
C = A.T
C.nullspace()
u⊥ = {v : u · v = 0} . (2.4.9)
Ax = (v1 · x, v2 · x, . . . , vN · x).
v1 · x = 0, v2 · x = 0, . . . , vN · x = 0,
Every vector in the row space is orthogonal to every vector in the null
space,
Actually, the above paragraph only established the first identity. For the
second identity, we need to use (2.7.9), as follows
⊥
rowspace = rowspace⊥ = nullspace⊥ .
2.4. SPAN AND LINEAR INDEPENDENCE 97
Since the row space is the orthogonal complement of the null space, and
the null space of A equals the null space of At A, we conclude
A Versus At A
Let A be any matrix. Then the row space of A equals the row space
of At A.
Now replace A by At in this last result. Since the row space of At equals
the column space of A, and AAt is symmetric, we also have
A Versus AAt
Let A be any matrix. Then the column space of A equals the column
space of AAt .
A(x1 − x2 ) = b − b = 0, (2.4.11)
Let A be any matrix. The null space of A and the row space of A are
in the source space of A, and the column space of A is in the target
space of A.
This shows the null space of an invertible matrix is zero, hence the nullity is
zero.
Since the row space is the orthogonal complement of the null space, we
conclude the row space is all of Rd .
In §2.9, we see that the column rank and the row rank are equal. From
this, we see also the column space is all of Rd . In summary,
Let A be a d×d invertible matrix. Then the null space is zero, and the
row space and column space are both Rd . In particular, the nullity is
0, and the row rank rank and column rank are both d.
2.4. SPAN AND LINEAR INDEPENDENCE 99
Exercises
Exercise 2.4.1 For what condition on a, b, c do the vectors (1, a), (2, b),
(3, c) lie on a line?
Compute Cx in two ways, first by row times column, then as a linear combi-
nation of the columns of C.
Exercise 2.4.3 Check that the array in Figure 2.1 matches with b1 , b2 as
explained in the text, and the vectors b1 and b2 are orthogonal.
Exercise 2.4.4 [32] Let a = (1, 1, 0, 0), b = (0, 0, 1, 1), c = (1, 0, 1, 0), d =
(0, 1, 0, 1). Check whether or not a, b, c, d are linearly independent by solving
ra + sb + tc + ud = 0. Is ra + sb + tc + ud = (0, 0, 0, 1) solvable? Do a, b, c, d
span R4 ?
Exercise 2.4.5 Let A = (u, v, w) be as in (2.3.4) and let b = (16, 17, 18, 19, 20).
Is b in the column space of A? If yes, solve b = ru + sv + tw.
Exercise 2.4.8 [32] Let A be a 64 × 17 matrix with rank 11. How many
linearly independent vectors x solve Ax = 0? How many linearly independent
vectors x solve At x = 0?
What are A(5, 3) and A(3, 5)? What are the source and target spaces for
A(N, d)?
100 CHAPTER 2. LINEAR GEOMETRY
Exercise 2.4.10 Calculate the column rank of the matrix A(N, d) for all
N ≥ 2 and all d ≥ 2. (Column rank is the length of the list columnspace
returns.)
Exercise 2.4.11 What is the nullity of the matrix A(N, d) for all N ≥ 2 and
all d ≥ 2?
Exercise 2.4.12 Show directly from the definition the vectors
1 a a2
V = 1 b b2 .
1 c c2
v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,
v1 · u, v2 · u, . . . , vN · u.
v · Qv = 0.
v · x + b = 0.
dimensional, always one less than the ambient dimension. When b = 0, the
hyperplane orthogonal to v equals v ⊥ (2.4.9).
The hyperplane passes through a point µ if
v · µ + b = 0.
v · (x − µ) = 0.
v · (x − µ) = 0.
a(x − x0 ) + b(y − y0 ) = 0, or ax + by = c,
u · Qu = 0 and v · Qv = 0.
From this we see the dataset corresponding to this Q lies in two planes: the
plane orthogonal to u, and the plane orthogonal v. But the intersection of
two planes is a line, so this dataset lies in a line, which means it is one-
dimensional.
Which line does this dataset lie in? Well, the line has to pass through the
mean, and is orthogonal to u and v. If we find a vector b satisfying b · u = 0
and b · v = 0, then the line will pass through the mean and will be parallel to
b. But we know how to find such a vector. Let A be the matrix with rows u, v.
Then b in the nullspace of A fullfills the requirements. We obtain b = (1, 1, 1).
Let Q be a variance matrix. Then the null space of Q equals the zero
variance directions of Q.
To see this, we use the quadratic equation from high school. If Q is sym-
metric, then u · Qv = v · Qu. For t scalar and u, v vectors, since Q ≥ 0, the
function
(v + tu) · Q(v + tu)
is nonnegative for all t scalar. Expanding this function into powers of t, we
see
t2 u · Qu + 2tu · Qv + v · Qv = at2 + 2bt + c
is nonnegative for all t scalar. Thus the parabola at2 + 2bt + c intersects the
horizontal axis in at most one root. This implies the discriminant b2 − ac is
not positive, b2 − ac ≤ 0, which yields
Now we can derive the result. If v is a zero variance direction, then v ·Qv =
0. By (2.5.4), this implies u · Qv = 0 for all u, so Qv = 0, so v is in the null
space of Q. This derivation is valid for any nonnegative matrix Q, not just
variance matrices. Later (§3.2) we will see every nonnegative matrix is the
variance matrix of a dataset.
Based on the above result, here is code that returns zero variance direc-
tions.
N, d = 20, 2
# dxN array
dataset = array([ [random() for _ in range(N)] for _ in range(d) ])
def zero_variance(dataset):
Q = cov(dataset)
return null_space(Q)
zero_variance(dataset)
(1, 2, 3, 4, 5), (6, 7, 8, 9, 10), (11, 12, 13, 14, 15), (16, 17, 18, 19, 20).
Thus this dataset is orthogonal to three directions, hence lies in the intersec-
tion of three hyperplanes. Each hyperplane is one condition, so each hyper-
plane cuts the dimension down by one, so the dimension of this dataset is
5 − 3 = 2. Dimension of a dataset is discussed further in §2.9.
2.6 Pseudo-Inverse
What is the pseudo-inverse? In §2.3, we used both the inverse and the pseudo-
inverse to solve Ax = b, but we didn’t explain the framework behind them.
It turns out the framework is best understood geometrically.
Think of b and Ax as points, and measure the distance between them, and
think of x and the origin 0 as points, and measure the distance between them
(Figure 2.2).
x b
A
−
−−−−−−−
→
0 Ax
Even though the point x+ may not solve Ax = b, this procedure results
in a uniquely determined x+ : While there may be several points x∗ , there is
only one x+ . Figure 2.3 summarizes the situation for a 2 × 2 matrix A with
rank(A) = 1.
rowspace
column space
x∗
x∗ Ax
x+ A Ax∗
−
−−−−−−−
→
Ax
Ax
x x
v nullspace
x v
0 b
Fig. 2.3 The points x, Ax, the points x∗ , Ax∗ , and the point x+ .
The results in this section are as follows. Let A be any matrix. There is a
unique matrix A+ — the pseudo-inverse of A — with the following properties.
• the linear system Ax = b is solvable, when b = AA+ b.
• x+ = A+ b is a solution of
1. the linear system Ax = b, if Ax = b is solvable.
2. the regression equation At Ax = At b, always.
• In either case,
1. there is exactly one solution x∗ with minimum norm.
2. Among all solutions, x+ has minimum norm.
3. Every other solution is x∗ = x+ + v for some v in the null space of A.
At Ax = At b. (2.6.2)
Zero Residual
x + 6y + 11z = −9
2x + 7y + 12z = −3
3x + 8y + 13z = 3 (2.6.3)
4x + 9y + 14z = 9
5x + 10y + 15z = 10
Let b be any vector, not necessarily in the column space of A. To see how
close we can get to solving (2.3.1), we minimize the residual (2.6.1). We say
x∗ is a residual minimizer if
Regression Equation
To see this, let v be any vector, and t a scalar. Insert x = x∗ + tv into the
residual and expand in powers of t to obtain
108 CHAPTER 2. LINEAR GEOMETRY
At (Ax∗ − b) · v = (Ax∗ − b) · Av = 0.
At (Ax∗ − b) = 0,
Multiple Solutions
Since we know from above there is a residual minimizer in the row space
of A, we always have a minimum norm residual minimizer.
Let v be in the null space of A, and write
x∗ · v ≥ 0.
Since both ±v are in the null space of A, this implies ±x∗ · v ≥ 0, hence
x∗ · v = 0. Since the row space is the orthogonal complement of the null
space, the result follows.
Uniqueness
If x+ +
1 and x2 are minimum norm residual minimizers, then v = x1 − x2
+ +
+ +
is both in the row space and in the null space of A, so x1 − x2 = 0. Hence
x+ +
1 = x2 .
Putting the above all together, each vector b leads to a unique x+ . Defining
+
A by setting
x+ = A+ b,
we obtain A+ , the pseudo-inverse of A.
Notice if A is, for example, 5 × 4, then Ax = b implies x is a 4-vector and
b is a 5-vector. Then from x = A+ b, it follows A+ is 4 × 5. Thus the shape of
A+ equals the shape of At .
110 CHAPTER 2. LINEAR GEOMETRY
We know any two solutions of the linear system (2.3.1) differ by a vector in
the null space of A (2.4.11), and any two solutions of the regression equation
(2.6.2) differ by a vector in the null space of A (above).
If x is a solution of (2.3.1), then, by multiplying by At , x is a solution of
the regression equation (2.6.2). Since x+ = A+ b is a solution of the regression
equation, x+ = x + v for some v in the null space of A, so
Ax+ = A(x + v) = Ax + Av = b + 0 = b.
This shows x+ is a solution of the linear system. Since all other solutions
differ by a vector v in the null space of A, this establishes the result.
Now we can state when Ax = b is solvable,
Solvability of Ax = b
Properties of Pseudo-Inverse
A. AA+ A = A
B. A+ AA+ = A+
(2.6.8)
C. AA+ is symmetric
D. A+ A is symmetric
u = A+ Au + v. (2.6.9)
112 CHAPTER 2. LINEAR GEOMETRY
Au = AA+ Au.
A+ w = A+ AA+ w + v
for some v in the null space of A. But both A+ w and A+ AA+ w are in the
row space of A, hence so is v. Since v is in both the null space and the row
space, v is orthogonal to itself, so v = 0. This implies A+ AA+ w = A+ w.
Since w was any vector, we obtain B.
Since A+ b solves the regression equation, At AA+ b = At b for any vector b.
Hence At AA+ = At . With P = AA+ ,
(x − A+ Ax) · A+ Ay = 0.
x · P y = P x · P y = x · P tP y
Also we have
Exercises
Exercise 2.6.2 Let A(N, d) be as in Exercise 2.4.9, and let A = A(6, 4).
Let b = (1, 1, 1, 1, 1, 1). Write out Ax = b as a linear system. How many
equations, how many unknowns?
Exercise 2.6.4 Continuing with the same A and b, write out the correspond-
ing regression equation. How many equations, how many unknowns?
(At )+ = (A+ )t .
QQ+ = Q+ Q.
Exercise 2.6.11 Let A be any matrix. Then the null space of A equals the
null space of A+ A. Use (2.6.8).
Exercise 2.6.12 Let A be any matrix. Then the row space of A equals the
row space of A+ A.
Exercise 2.6.13 Let A be any matrix. Then the column space of A equals
the column space of AA+ .
114 CHAPTER 2. LINEAR GEOMETRY
2.7 Projections
Let u be a unit vector, and let b be any vector. Let span(u) be the line
through u (Figure 2.4). The projection of b onto span(u) is the vector v in
span(u) that is closest to b.
It turns out this closest vector v equals P b for some matrix P , the projec-
tion matrix. Since span(u) is a line, the projected vector P b is a multiple tu
of u.
From Figure 2.4, b − P b is orthogonal to u, so
0 = (b − P b) · u = b · u − P b · u = b · u − t u · u = b · u − t.
b − Pb
b
P b = tu
u
(b − P b) · u = 0 and (b − P b) · v = 0.
r = b · u, s = b · v.
b
b − Pb
u
Pb
Characterization of Projections
To prove this, suppose P is the projection matrix onto some span S. For
any v, by 1., P v is in S. By 2., P (P v) = P v. Hence P 2 = P . Also, for any u
and v, P v is in S, and u − P u is orthogonal to S. Hence
(u − P u) · P v = 0
which implies
u · P v = (P u) · (P v).
Switching u and v,
v · P u = (P v) · (P u),
Hence
u · (P v) = (P u) · v,
t
which implies P = P .
For the other direction, suppose P is a projection matrix, and let S be the
column space of P . Then a vector x is in S iff x is of the form x = P v. This
establishes 1. above. Since
P x = P (P v) + P 2 v = P v = x,
Let A be any matrix. Then the projection matrix onto the column
space of A is
P = AA+ . (2.7.2)
P b = t1 v1 + t2 v2 + · · · + td vd .
def project(A,b):
Aplus = pinv(A)
x = dot(Aplus,b) # reduced
return dot(A,x) # projected
Let A be a matrix and b a vector, and project onto the column space
of A. Then the projected vector is P b = AA+ b and the reduced vector
is x = A+ b.
For A as in (2.3.4) and b = (−9, −3, 3, 9, 10) the reduced vector onto the
column space of A is
118 CHAPTER 2. LINEAR GEOMETRY
1
x = A+ b = (82, 25, −32),
15
and the projected vector onto the column space of A is
P = A+ A. (2.7.3)
def project_to_ortho(U,b):
x = dot(U.T,b) # reduced
return dot(U,x) # projected
dataset vk in Rd , k = 1, 2, . . . , N
reduced U tv k in Rn , k = 1, 2, . . . , N
projected U U tv k in Rd , k = 1, 2, . . . , N
# projection of dataset
# onto column space of A
If S is a span in Rd , then
Rd = S ⊕ S ⊥ . (2.7.5)
v = P v + (v − P v),
An important example of (2.7.5) is the relation between the row space and
the null space of a matrix. In §2.4, we saw that, for any matrix A, the row
space and the null space are orthogonal complements.
Taking S = nullspace in (2.7.5), we have the important
If A is an N × d matrix,
and the null space and row space are orthogonal to each other.
2.7. PROJECTIONS 121
From this,
P = I − A+ A. (2.7.7)
For any matrix, the row rank plus the nullity equals the dimension of
the source space. If the matrix is N × d, r is the rank, and n is the
nullity, then
r + n = d.
But this was already done in §2.3, since P b = AA+ b = Ax+ where x+ = A+ b
is a residual minimizer.
Exercises
Exercise 2.7.4 Let P be the projection matrix onto the column space of a
matrix A. Use Exercise 2.7.3 to show trace(P ) equals the rank of A.
Exercise 2.7.6 Let A be the dataset matrix of the centered MNIST dataset,
so the shape of A is 60000 × 784. Using Exercise 2.7.4, show the rank of A
is 712.
Exercise 2.7.9 Let S be a span, and let P be the projection matrix onto S.
Use P to show ⊥
S ⊥ = S. (2.7.9)
(S ⊂ (S ⊥ )⊥ is easy. For S ⊃ (S ⊥ )⊥ , show |v − P v|2 = 0 when v in (S ⊥ )⊥ .)
Exercise 2.7.10 Let S be a span and suppose P and Q are both projection
matrices onto S. Show
(P − Q)2 = 0.
Conclude P = Q. Use Exercise 2.2.4.
2.8 Basis
To clarify this definition, suppose someone asks “Who is the shortest per-
son in the room?” There may be several shortest people in the room, but, no
matter how many shortest people there are, there is only one shortest height.
In other words, a span may have several bases, but a span’s dimension is
uniquely determined.
When a basis v1 , v2 , . . . , vN consists of orthogonal vectors, we say v1 , v2 ,
. . . , vN is an orthogonal basis. When v1 , v2 , . . . , vN are also unit vectors, we
say v1 , v2 , . . . , vN is an orthonormal basis.
spanning
orthogonal orthonormal
vectors basis
basis basis
linearly
orthogonal orthonormal
independent
Span of N Vectors
The dimension of Rd is d.
mu = mean(dataset,axis=0)
vectors = dataset - mu
matrix_rank(vectors)
In particular, since 712 < 784, approximately 10% of pixels are never
touched by any image. For example, a likely pixel to remain untouched is
at the top left corner (0, 0). For this dataset, there are 784 − 712 = 72 zero
variance directions.
We pose the following question: What is the least n for which the first n
images are linearly dependent? Since the dimension of the feature space is
784, we must have n ≤ 784. To answer the question, we compute the rank
of the first n vectors for n = 1, 2, 3, . . . , and continue until we have linear
dependence of v1 , v2 , . . . , vn .
If we load MNIST as dataset, as in §1.2, and run the code below, we
obtain n = 560 (Figure 2.8). matrix_rank is discussed in §2.9.
def find_first_defect(dataset):
d = len(dataset[0])
previous = 0
for n in range(len(dataset)):
r = matrix_rank(dataset[:n+1,:])
print((r,n+1),end=",")
if r == previous: break
if r == d: break
126 CHAPTER 2. LINEAR GEOMETRY
previous = r
This we call the dimension staircase. For example, Figure 2.9 is the di-
mension staircase for
2.8. BASIS 127
v1 = (1, 0, 0), v2 = (0, 1, 0), v3 = (1, 1, 0), v4 = (3, 4, 0), v5 = (0, 0, 1).
Ideally the code should be run in sympy using exact arithmetic. However,
this takes too long, so we use numpy.linalg.matrix_rank. Because datasets
consist of floats in numpy, the matrix_rank and dimensions are approximate
not exact. For more on this, see approximate rank in §3.2.
def dimension_staircase(dataset):
N = len(dataset)
rmax = matrix_rank(dataset)
dimensions = [ ]
for n in range(N):
r = matrix_rank(dataset[:n+1,:])
dimensions.append(r)
128 CHAPTER 2. LINEAR GEOMETRY
if r == rmax: break
title("number of vectors = " + str(n+1) + ", rank = " + str(rmax))
stairs(dimensions, range(1,n+3),linewidth=2,color='red')
grid()
show()
span(v1 , v2 , . . . , vN ) = span(v2 , v3 , . . . , vN ).
span(v1 , v2 , . . . , vN ) = span(b1 , b2 , . . . , bd ),
v1 is a linear combination of b1 , b2 , . . . , bd ,
v1 = t1 b1 + t2 b2 + · · · + td bd .
span(v1 , v2 , . . . , vN ) = span(v1 , b2 , b3 , . . . , bd ).
v2 = s1 v1 + t2 b2 + t3 b3 + · · · + td bd .
2.9. RANK 129
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , b3 , b4 , . . . , bd ).
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , v3 , b4 , b5 , . . . , bd ).
span(v1 , v2 , . . . , vN ) = · · · = span(v1 , v2 , . . . , vd ).
2.9 Rank
R3 R5
x b
A
At b
Ax
At
source space target space
By (2.4.2), the column space is in the target space, and the row space is
in the source space. Thus we always have
For A as in (2.3.4), the column rank is 2, the row rank is 2, and the nullity
is 1. Thus the column space is a 2-d plane in R5 , the row space is a 2-d plane
in R3 , and the null space is a 1-d line in R3 .
Rank Theorem
Let A be any matrix. Then
A.rank()
matrix_rank(A)
returns the rank of a matrix. The main result implies rank(A) = rank(At ),
so
For any N × d matrix, the rank is never greater than min(N, d).
C = CI = CAB = IB = B,
so B = C is the inverse of A.
The first two assertions are in §2.2. For the last assertion, assume U is a
square matrix. From §2.4, orthonormality of the rows implies linear indepen-
dence of the rows, so U is full-rank. If U also is a square matrix, then U is
invertible. Multiply by U −1 ,
U −1 = U −1 I = U −1 U U t = U t .
U U t = I = U tU (2.9.2)
is an orthogonal matrix.
Equivalently, we can say
Orthogonal Matrix
A matrix U is orthogonal iff its rows are an orthonormal basis iff its
columns are an orthonormal basis.
Since
U u · U v = u · U t U v = u · v,
U preserves dot products. Since lengths are dot products, U also preserves
lengths. Since angles are computed from dot products, U also preserves an-
gles. Summarizing,
2.9. RANK 133
As a consequence,
I = u1 ⊗ u1 + u2 ⊗ u2 + · · · + ud ⊗ ud . (2.9.3)
and
|u|2 = (u · u1 )2 + (u · u2 )2 + · · · + (u · ud )2 . (2.9.5)
Full-Rank Dataset
A dataset x1 , x2 , . . . , xN is full-rank iff x1 , x2 , . . . , xN spans the
feature space.
To derive the rank theorem, first we recall (2.7.6). Assume A has N rows
and d columns. By (2.7.6), every vector x in the source space Rd can be
written as a sum x = u + v with u in the null space, and v in the row space.
In other words, each vector x may be written as a sum x = u + v with Au = 0
and v in the row space.
From this, we have
Ax = A(u + v) = Au + Av = Av.
This shows the column space consists of vectors of the form Av with v in the
row space.
Let v1 , v2 , . . . , vr be a basis for the row space. From the previous para-
graph, it follows Av1 , Av2 , . . . , Avr spans the column space of A. We claim
Av1 , Av2 , . . . , Avr are linearly independent. To check this, we write
If v is the vector t1 v1 +t2 v2 +· · ·+tr vr , this shows v is in the null space. But v
is a linear combination of basis vectors of the row space, so v is also in the row
space. Since the row space is the orthogonal complement of the null space, we
must have v orthogonal to itself. Thus v = 0, or t1 v1 + t2 v2 + · · · + tr vr = 0.
But v1 , v2 , . . . , vr is a basis. By linear independence of v1 , v2 , . . . , vr , we
conclude t1 = 0, . . . , tr = 0. This establishes the claim, hence Av1 , Av2 , . . . ,
Avr is a basis for the column space. This shows r is the dimension of the
column space, which is by definition the column rank. Since by construction,
r is also the row rank, this establishes the rank theorem.
Exercises
A = u1 ⊗ v1 + u2 ⊗ v2 + · · · + ur ⊗ vr
1 a a2
V = 1 b b 2 .
1 c c2
137
138 CHAPTER 3. PRINCIPAL COMPONENTS
How does this compare with the distance between Av1 and Av2 , or |Av1 −
Av2 |?
If we let
v1 − v2
u= ,
|v1 − v2 |
then u is a unit vector, |u| = 1, and by linearity
|Av1 − Av2 |
|Au| = .
|v1 − v2 |
Here the maximum and minimum are taken over all unit vectors u.
Then σ1 is the distance of the furthest image from the origin, and σ2 is
the distance of the nearest image to the origin. It turns out σ1 and σ2 are
the top and bottom singular values of A.
To keep things simple, assume both the source space and the target space
are R2 ; then A is 2 × 2.
The unit circle (in red in Figure 3.1) is the set of vectors u satisfying
|u| = 1. The image of the unit circle (also in red in Figure 3.1) is the set of
vectors of the form
{Au : |u| = 1}.
The annulus is the set (the region between the dashed circles in Figure 3.1)
of vectors b satisfying
{b : σ2 < |b| < σ1 }.
It turns out the image is an ellipse, and this ellipse lies in the annulus.
Thus the numbers σ1 and σ2 constrain how far the image of the unit circle
is from the origin, and how near the image is to the origin.
To relate σ1 and σ2 to what we’ve seen before, let Q = At A. Then,
Now let Q = AAt , and let b be in the image. Then b = Au for some unit
vector u, and
This shows the image of the unit circle is the inverse variance ellipse (§1.5)
corresponding to the variance Q, with major axis length 2σ1 and minor axis
length 2σ2 .
These reflect vectors across the horizontal axis, and across the vertical axis.
140 CHAPTER 3. PRINCIPAL COMPONENTS
The SVD decomposition (§3.4) states that every matrix A can be written
as a product
ab
A= = U SV.
cd
Here S is a diagonal matrix as above, and U , V are orthogonal and rotation
matrices as above.
In more detail, apart from a possible reflection, there are scalings σ1 and
σ2 and angles α and β, so that A transforms vectors by first rotating by α,
then scaling by (σ1 , σ2 ), then by rotating by β (Figure 3.2).
V S U
In §1.5 and §2.5, we saw every variance matrix is nonnegative. In this section,
we see that every nonnegative matrix Q is the variance matrix of a specific
dataset. This dataset is called the principal components of Q.
Let A be a matrix. An eigenvector for A is a nonzero vector v such that
Av is aligned with v. This means
Av = λv (3.2.1)
singular:
σ, u, v
row column
any
rank rank
matrix square
eigen:
invertible symmetric
λ, v
non-
variance negative λ≥0
λ ̸= 0 positive λ>0
A = array([[2,1],[1,2]])
lamda, U = eig(A)
lamda
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda
v · Qv = v · λv = λv · v = λ.
µu · v = u · (µv) = u · Qv = v · Qu = v · (λu) = λu · v.
This implies
(µ − λ)u · v = 0.
If λ ̸= µ, we must have u · v = 0. We conclude:
λ1 ≥ λ2 ≥ · · · ≥ λd .
and scalars λ, µ satisfying Qu = λu, and Qv = µv. These are the eigenvalues
and eigenvectors. Define three matrices
λ0 u
U = (u, v), E= , V = .
0µ v
We conclude QU = U E. Multiplying by V ,
Diagonalization (EVD)
Q = U EV (3.2.3)
and V = U.T.
146 CHAPTER 3. PRINCIPAL COMPONENTS
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
v = U[:,0]
allclose(dot(Q,v), lamda[0]*v)
returns True.
The conclusion is: With the correct choice of orthonormal basis, the matrix
Q becomes a diagonal matrix E.
The orthonormal basis eigenvectors v1 , v2 , . . . , vd are the principal compo-
nents of the matrix Q. The eigenvalues and eigenvectors of Q, taken together,
are the eigendata of Q.
To obtain the diagonal matrix E,
E = diag(lamda)
init_printing()
# eigenvalues
Q.eigenvals()
3.2. EIGENVALUE DECOMPOSITION 147
# eigenvectors
Q.eigenvects()
U, E = Q.diagonalize()
This returns the diagonal E with the eigenvalues in increasing order. The
command init_printing pretty-prints the output.
If A is the matrix (2.3.4), and Q = At A is as in the regression equation
(2.6.4), then the eigenvalues are
√ √
λ1 = 620 + 10 3769, λ2 = 620 − 10 3769, λ3 = 0, (3.2.4)
The third row is a multiple of (1, −2, 1), which, as we know, is a basis for the
nullspace of A (§2.4).
rank(Q) = rank(E) = r.
For example, in (3.2.4), there are two positive eigenvalues, and the rank
of Q, which equals the rank of A, is two.
# dataset is Nxd
N, d = dataset.shape
Q = dot(dataset.T,dataset)/N
lamda = eigh(Q)[0]
approx_rank = d - approx_nullity
approx_rank, approx_nullity
This code returns 712 for the MNIST dataset, agreeing with the code in
§2.8.
Q = Matrix([[2,1],[1,2]])
U, E = Q.diagonalize()
display(U,E)
returns
1 1 10
U= , E= .
−1 1 03
Also,
Q = Matrix([[a,b ],[b,c]])
U, E = Q.diagonalize()
display(Q,U,E)
returns √ √
ab 1 a−c− D a−c+ D
Q= , U=
bc 2b 2b 2b
and
3.2. EIGENVALUE DECOMPOSITION 149
√
1 a+c− D 0 √
E= , D = (a − c)2 + 4b2 .
2 0 a+c+ D
0 0 . . . 0 1/λd
Q = U EV =⇒ Q+ = U E + V. (3.2.6)
Qx = b
has a solution x for every vector b iff all eigenvalues are nonzero, in
which case
1 1 1
x= (b · v1 )v1 + (b · v2 )v2 + · · · + (b · vd )vd . (3.2.7)
λ1 λ2 λd
150 CHAPTER 3. PRINCIPAL COMPONENTS
trace(Q) = λ1 + λ2 + · · · + λd . (3.2.8)
Q2 is symmetric with eigenvalues λ21 , λ22 , . . . , λ2d . Applying the last result to
Q2 , we have
√
√ λ1 v 1
λ2 v2
√
√ − λ2 v2
− λ1 v1
Q = λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd . (3.2.9)
where the maximum is over all unit vectors v. We say a unit vector b is best-fit
for Q or best-aligned with Q if the maximum is achieved at v = b: λ1 = b · Qb.
When Q is a variance matrix, this means the unit vector b is chosen so that
the variance b · Qb of the dataset projected onto b is maximized.
152 CHAPTER 3. PRINCIPAL COMPONENTS
A Calculation
Suppose λ, a, b, c, d are real numbers and let
λ + at + bt2
f (t) = .
1 + ct + dt2
If f (t) is maximized at t = 0, then a = λc.
λ1 ≥ v · Qv = v · (λv) = λv · v = λ.
λ1 = v1 · Qv1 ≥ v · Qv (3.2.11)
for all unit vectors v. Let u be any vector. Then for any real t,
3.2. EIGENVALUE DECOMPOSITION 153
v1 + tu
v=
|v1 + tu|
u · Qv1 = λ1 u · v1
u · (Qv1 − λ1 v1 ) = 0
Just as the maximum variance (3.2.10) is the top eigenvalue λ1 , the mini-
mum variance
λd = min v · Qv, (3.2.12)
|v|=1
v · Qv over all unit v in T , i.e. over all unit v orthogonal to v1 . This leads to
another eigenvalue λ2 with corresponding eigenvector v2 orthogonal to v1 .
Since λ1 is the maximum of v · Qv over all vectors in Rd , and λ2 is the
maximum of v · Qv over the restricted space T of vectors orthogonal to v1 ,
we must have λ1 ≥ λ2 .
Having found the top two eigenvalues λ1 ≥ λ2 and their orthonormal
eigenvectors v1 , v2 , we let S = span(v1 , v2 ) and T = S ⊥ be the orthogonal
complement of S. Then dim(T ) = d − 2, and we can repeat the process to
obtain λ3 and v3 in T . Continuing in this manner, we obtain eigenvalues
λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λd .
v1 , v2 , v3 , . . . , vd .
T = S⊥
v1
v3
v2
Sλ = {v : Qv = λv}
the eigenspace corresponding to λ. For example, suppose the top three eigen-
values are equal: λ1 = λ2 = λ3 , with b1 , b2 , b3 the corresponding eigenvectors.
Calling this common value λ, the eigenspace is Sλ = span(b1 , b2 , b3 ). Since
b1 , b2 , b3 are orthonormal, dim(Vλ ) = 3. In Python, the eigenspaces Vλ are
obtained by the matrix U above: The columns of U are an orthonormal basis
for the entire space, so selecting the columns corresponding to a specific λ
yields an orthonormal basis for Sλ .
Let (E,U) be the list of eigenvalues and matrix U whose columns are
the eigenvectors. Then the eigenvectors are the rows of U t . Here is code for
selecting just the eigenvectors corresponding to eigenvalue s.
lamda, U = eigh(Q)
V = U.T
V[isclose(lamda,s)]
156 CHAPTER 3. PRINCIPAL COMPONENTS
The function isclose(a,b) returns True when a and b are numerically close.
Using this boolean, we extract only those rows of V whose corresponding
eigenvalue is close to s.
The subspace Sλ is defined for any λ. However, dim(Sλ ) = 0 unless λ is
an eigenvalue, in which case dim(Sλ ) = m, where m is the multiplicity of λ.
The proof of the eigenvalue decomposition provides a systematic procedure
for finding eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd . Now we show there are no other
eigenvalues.
All this can be readily computed in Python. For the Iris dataset, we have
the variance matrix in (2.2.16). The eigenvalues are
4.54 = trace(Q) = λ1 + λ2 + λ3 + λ4 .
For the Iris dataset, the top eigenvalue is λ1 = 4.2, it has multiplicity 1, and
its corresponding list of eigenvectors contains only one eigenvector,
The top two eigenvalues account for 97.8% of the total variance.
The third eigenvalue is λ3 = 0.08 with eigenvector
The top three eigenvalues account for 99.5% of the total variance.
The fourth eigenvalue is λ4 = 0.02 with eigenvector
The top four eigenvalues account for 100% of the total variance. Here each
eigenvalue has multiplicity 1, since there are four distinct eigenvalues.
def row(i,d):
v = [0]*d
v[i] = 2
if i > 0: v[i-1] = -1
if i < d-1: v[i+1] = -1
if i == 0: v[d-1] += -1
if i == d-1: v[0] += -1
158 CHAPTER 3. PRINCIPAL COMPONENTS
return v
# using sympy
from sympy import Matrix
# using numpy
from numpy import *
m1 m2
x1 x2
To explain where these matrices come from, look at the mass-spring sys-
tems in Figures 3.6 and 3.7. Here we have springs attached to masses and
walls on either side. At rest, the springs are the same length. When per-
turbed, some springs are compressed and some stretched. In Figure 3.6, let
x1 and x2 denote the displacement of each mass from its rest position.
When extended by x, each spring fights back by exerting a force kx pro-
portional to the displacement x. Here k is the spring constant. For example,
look at the mass m1 . The spring to its left is extended by x1 , so exerts a force
of −kx1 . Here the minus indicates pulling to the left. On the other hand, the
spring to its right is extended by x2 − x1 , so it exerts a force +k(x2 − x1 ).
Here the plus indicates pulling to the right. Adding the forces from either
side, the total force on m1 is −k(2x1 − x2 ). For m2 , the spring to its left
exerts a force −k(x2 − x1 ), and the spring to its right exerts a force −kx2 ,
so the total force on m2 is −k(2x2 − x1 ). We obtain the force vector
3.2. EIGENVALUE DECOMPOSITION 159
2x1 − x2 2 −1 x1
−k = −k .
−x1 + 2x2 −1 2 x2
However, as you can see, the matrix here is not exactly Q(2).
m1 m2 m3 m4 m5
x1 x2 x3 x4 x5
But, again, the matrix here is not Q(5). Notice, if we place one mass and two
springs in Figure 3.6, we obtain the 1 × 1 matrix 2.
To obtain Q(2) and Q(5), we place the springs along a circle, as in Figures
3.8 and 3.9. Now we have as many springs as masses. Repeating the same
logic, this time we obtain Q(2) and Q(5). Notice if we place one mass and
one spring in Figure 3.8, d = 1, we obtain the 1 × 1 matrix Q(1) = 0: There
is no force if we move a single mass around the circle, because the spring is
not being stretched.
Thus the matrices Q(d) arise from mass-spring systems arranged on a
circle. From Newton’s law (force equals mass p times acceleration), one shows
the frequencies of the vibrating springs equal λk/m, where k is the spring
constant, m is the mass of each of the masses, and λ is an eigenvalue of Q(d).
This is the physical meaning of the eigenvalues of Q(d).
160 CHAPTER 3. PRINCIPAL COMPONENTS
m1 m2 m2
m1
m1 m1
m2
m2
m5 m5
m3 m4
m4 m3
p(t) = 2 − t − td−1 ,
and let
1
ω
ω2
v1 = .
ω3
..
.
d−1
ω
Then Qv1 is
1
2 − ω − ω d−1
ω
−1 + 2ω − ω 2
ω2
−ω + 2ω 2 − ω 3
Qv1 = = p(ω) = p(ω)v1 .
ω3
..
. ..
d−2 d−1
.
−ω + 2ω −1
ω d−1
Then
v0 = 1 = (1, 1, . . . , 1),
and, by the same calculation, we have
By (A.4.9),
Eigenvalues of Q(d)
2πk
λk = p(ω k ) = 2 − 2 cos , (3.2.15)
d
Q(2) = (4, 0)
Q(3) = (3, 3, 0)
Q(4) = (4, 2, 2, 0)
√ √ √ √ !
5 5 5 5 5 5 5 5
Q(5) = + , + , − , − ,0
2 2 2 2 2 2 2 2
Q(6) = (4, 3, 3, 1, 1, 0)
√ √ √ √
Q(8) = (4, 2 + 2, 2 + 2, 2, 2, 2 − 2, 2 − 2, 0)
√ √ √ √
5 5 5 5 3 5 3 5
Q(10) = 4, + , + , + , + ,
2 2 2 2 2 2 2 2
√ √ √ √ !
5 5 5 5 3 5 3 5
− , − , − , − ,0
2 2 2 2 2 2 2 2
√ √ √ √
Q(12) = 4, 2 + 3, 2 + 3, 3, 3, 2, 2, 1, 1, 2 − 3, 2 − 3, 0 .
The matrices Q(d) are circulant matrices. Each row in Q(d) is obtained
from the row above it in Q(d) by shifting the entries to the right. The trick of
using the roots of unity to compute the eigenvalues and eigenvectors works
for any circulant matrix.
3.2. EIGENVALUE DECOMPOSITION 163
Our last topic is the distribution of the eigenvalues for large d. How are
the eigenvalues scattered? Figure 3.10 plots the eigenvalues for Q(50) using
the code below.
d = 50
E = eigh(Q(d))[0]
stairs(E,range(d+1),label="numpy")
k = arange(d)
lamda = 2 - 2*cos(2*pi*k/d)
sorted = sort(lamda)
scatter(k,lamda,s=5,label="unordered")
scatter(k,sorted,c="red",s=5,label="increasing order")
grid()
legend()
show()
Figure 3.10 shows the eigenvalues tend to cluster near the top λ1 ≈ 4 and
the bottom λd = 0, they are sparser near the middle. Using the double-angle
formula,
πk
λk = 4 sin2 , k = 0, 1, 2, . . . , d − 1.
d
Solving for k/d in terms of λ, and multiplying by two to account for the
double multiplicity, we obtain the proportion of eigenvalues below threshold
λ,
164 CHAPTER 3. PRINCIPAL COMPONENTS
1√
#{k : λk ≤ λ} 2
≈ arcsin λ , 0 ≤ λ ≤ 4. (3.2.16)
d π 2
Here ≈ means asymptotic equality, see §A.6.
Equivalently, the derivative (4.1.23) of the arcsine law (3.2.16) exhibits the
eigenvalue clustering near the ends (Figure 3.11).
lamda = arange(0.1,3.9,.01)
density = 1/(pi*sqrt(lamda*(4-lamda)))
plot(lamda,density)
# r"..." means raw string
tex = r"$\displaystyle\frac1{\pi\sqrt{\lambda(4-\lambda)}}$"
text(.5,.45,tex,usetex=True,fontsize="x-large")
grid()
show()
Exercises
λ2 − λ trace(A) + det(A) = 0.
λ± = a ± ib
Exercise 3.2.11 With R(d) as in Exercise 2.2.9, find the eigenvalues and
eigenvectors of R(d).
d 4 · trace(Q(d)+ )
4 4+1
16 (4+1)(16+1)
256 (4+1)(16+1)(256+1)
3.3 Graphs
−3 7.4
2 0
Let wij be the weight on the edge (i, j) in a weighed directed graph. The
weight matrix of a weighed directed graph is the matrix W = (wij ).
If the graph is unweighed, then we set A = (aij ), where
(
1, if i and j adjacent,
aij = .
0, if not.
In this case, A consists of ones and zeros, and is called the adjacency matrix.
If the graph is undirected, then the adjacency matrix is symmetric,
aij = aji .
Sometimes graphs may have multiple edges between nodes, or loops, which
are edges starting and ending at the same node. A graph is simple if it has
no loops and no multiple edges. In this section, we deal only with simple
undirected unweighed graphs.
To summarize, a simple undirected graph G = (V, E) is a collection V
of nodes, and a collection of edges E, each edge corresponding to a pair of
nodes.
The number of nodes is the order n of the graph, and the number of edges
is the size m of the graph. In a (simple undirected) graph of order n, the
number of pairs of nodes is n-choose-2, so the number of edges satisfies
n 1
0≤m≤ = n(n − 1).
2 2
How many graphs of order n are there? Since graphs are built out of
edges, the answer depends on how many subsets of edges you can grab from
a maximum of n(n − 1)/2 edges. The number of subsets of a set with m
elements is 2m , so the number Gn of graphs with n nodes is
n
Gn = 2( 2 ) = 2n(n−1)/2 .
When m = 0, there are no edges, and we say the graph is empty. When
m = n(n − 1)/2, there are the maximum number of edges, and we say the
graph is complete. The complete graph with n nodes is written Kn (Figure
3.16).
Fig. 3.16 The complete graph K6 , the cycle graph C6 , and the wheel graph W6 .
3.3. GRAPHS 169
The cycle graph Cn with n nodes is as in Figure 3.16. The graph Cn has n
edges. The wheel graph is the cycle graph with one vertex added at the center
and connected to the spokes. The cycle graph C3 is a triangle.
d1 ≥ d2 ≥ d3 ≥ · · · ≥ dn
(d1 , d2 , d3 , . . . , dn )
Handshaking Lemma
n
X
d1 + d2 + · · · + dn = dk = 2m.
k=1
In any graph, there are at least two nodes with the same degree.
To see this, we consider two cases. First case, assume there are no isolated
nodes. Then the degree sequence is
n − 1 ≥ d1 ≥ d2 ≥ · · · ≥ dn ≥ 1.
n − 2 ≥ d1 ≥ d2 ≥ . . . dn−1 ≥ 1.
A graph is regular if all the node degrees are equal. If the node degrees are
all equal to k, we say the graph is k-regular. From the handshaking lemma,
for a k-regular graph, we have kn = 2m, so
1
m= kn.
2
For example, because 2m is even, there are no 3-regular graphs with 11 nodes.
Both Kn and Cn are regular, with Kn being (n − 1)-regular, and Cn being
2-regular.
A walk on a graph is a sequence of nodes v1 , v2 , v3 , . . . where each
consecutive pair vi , vi+1 of nodes are adjacent. For example, if v1 , v2 , v3 ,
v4 , v5 , v6 are the nodes (in any order) of the complete graph K6 , then
v1 → v2 → v3 → v4 → v2 is a walk. A path is a walk with no backtracking:
A path visits each node at most once.
3.3. GRAPHS 171
Two nodes a and b are connected if there is a walk starting at a and ending
at b. If a and b are connected, then there is a path starting at a and ending
at b, since we can cut out the cycles of the walk. A graph is connected if every
two nodes are connected. A graph is disconnected if it is not connected. For
example, Figure 3.16 may be viewed as two connected graphs K6 and C6 , or
a single disconnected graph K6 ∪ C6 .
A closed walk is a walk that ends where it starts. A cycle is a closed path.
If a graph has no cycles, it is a forest. A connected forest is a tree. In a tree,
any two nodes are connected by exactly one path.
A=1⊗1−I
Notice there are ones on the sub-diagonal, and ones on the super-diagonal,
and ones in the upper-right and lower-left corners.
172 CHAPTER 3. PRINCIPAL COMPONENTS
Euler’s Theorem
A connected graph has an Eulerian cycle if and only if every vertex
has an even degree.
Since the degree sequence of the graph in Figure 3.19 is (4, 4, 4, 4, 4, 2), the
graph is Eulerian.
For any adjacency matrix A, the sum of each row is equal to the degree of
the node corresponding to that row. This is the same as saying
d1
d2
A1 = . . . .
dn
A1 = k1,
v1 = 1 ≥ |vj |, j = 2, 3, . . . , n.
Since the sum a11 + a12 + · · · + a1n equals the degree d1 of node 1, this implies
Top Eigenvalue
A1 = (1 · 1)1 − 1 = n1 − 1 = (n − 1)1,
Ā = A(Ḡ) = 1 ⊗ 1 − I − A(G).
Now aik akj is either 0 or 1, and equals 1 exactly if there is a 2-step path from
i to j. Hence
3.3. GRAPHS 175
Notice a 2-step walk between i and j is the same as a 2-step path between i
and j.
When i = j, (A2 )ii is the number of 2-step paths connecting i and i, which
means number of edges. Since this counts edges twice, we have
1
trace(A2 ) = m = number of edges.
2
Similarly, (A3 )ij is the number of 3-step walks connecting i and j. Since
a 3-step walk from i to i is the same as a triangle, (A3 )ii is the number
of triangles in the graph passing through i. Since the trace is the sum of
the diagonal elements, trace(A3 ) counts the number of triangles. But this
overcounts by a factor of 3! = 6, since three labels may be rearranged in six
ways. Hence
1
trace(A3 ) = number of triangles.
6
Loops, Edges, Triangles
This is correct because for a complete graph, n(n − 1)/2 is the number of
edges.
Continuing,
Connected Graph
1000 0100
In general, the permutation matrix P has Pij = 1 if i → j, and Pij = 0
if not. If P is any permutation matrix, then Pik Pjk equals 1 if both i → k
and j → k. In other words, Pik Pjk = 1 if i = j and i → k, and Pik Pjk = 0
otherwise. Since i → k for exactly one k,
3.3. GRAPHS 177
n n
(
t
X
t
X 1, i = j,
(P P )ij = Pik Pkj = Pik Pjk =
k=1 k=1
0, i ̸= j.
Hence P is orthogonal,
P P t = I, P −1 = P t .
Using permutation matrices, we can say two graphs are isomorphic if their
adjacency matrices A, A′ satisfy
A′ = P AP −1 = P AP t
A graph is bipartite if the nodes can be divided into two groups, with
adjacency only between nodes across groups. If we call the two groups even
and odd, then odd nodes are never adjacent to odd nodes, and even nodes
are never adjacent to even nodes.
The complete bipartite graph is the bipartite graph with maximum num-
ber of edges: Every odd node is adjacent to every even node. The complete
bipartite graph with n odd nodes with m even nodes is written Kn,m . Then
the order of Kn,m is n + m.
Let a = (1, 1, . . . , 1, 0, 0, . . . , 0) be the vector with n ones and m zeros, and
let b = 1 − a. Then b has n zeros and m ones, and the adjacency matrix of
Kn,m is
A = A(Kn,m ) = a ⊗ b + b ⊗ a.
For example, the adjacency matrix of K5,3 is
178 CHAPTER 3. PRINCIPAL COMPONENTS
1 0 0 1 0 0 0 1 1 1 1 1
1 0 0 1 0 0 0 1 1 1 1 1
1 0 0 1 0 0 0 1 1 1 1 1
0 1 1 0 1 1 1 0 0 0 0 0
⊗ + ⊗ = .
0 1 1 0 1 1 1 0 0 0 0 0
0 1 1 0 1 1 1 0 0 0 0 0
0 1 1 0 1 1 1 0 0 0 0 0
0 1 1 0 1 1 1 0 0 0 0 0
Recall we have
(a ⊗ b)v = (b · v)a.
From this, we see the column space of A = a⊗b+b⊗a is span(a, b). Thus the
rank of A is 2, and the nullspace of A consists of the orthogonal complement
span(a, b)⊥ of span(a, b). Using this, we compute the eigenvalues of A.
Aa = nb, Ab = ma.
Hence
λv = Av = A(ra + sb) = rnb + sma.
Applying A again,
For
√ example, √
for the graph in Figure 3.21, the nonzero eigenvalues are λ =
± 3 × 5 = ± 15.
L = B t B.
Both the laplacian matrix and the adjacency matrix are n × n. What is the
connection between them?
Laplacian
Note the Laplacian does not depend on how the edges were directed, it
only depends on A.
For example, for the cycle graph C6 , the degree matrix is 2I, and the
laplacian is the matrix we saw in §3.2,
2 −1 0 0 0 −1
−1 2 −1 0 0 0
0 −1 2 −1 0 0
L = Q(6) = .
0 0 −1 2 −1 0
0 0 0 −1 2 −1
−1 0 0 0 −1 2
180 CHAPTER 3. PRINCIPAL COMPONENTS
A = 2I − Q(n),
2 cos(2πk/n), k = 0, 1, 2, . . . , n − 1,
Exercises
Exercise 3.3.1 [27] Consider the graph in Figure 3.22 below. What is the
order of the graph? What is the degree of vertex H? What is the degree of
vertex D? How many components does the graph have?
Exercise 3.3.2 [27] Which of the following degree sequences are possible for
a simple graph?
1. (4,4,4,3,3,3,3,2)
2. (5,3,2,2,2,1)
3. (8,7,6,5,5,3,1,1)
4. (6,6,6,5,4,4,4,3,3,3)
Exercise 3.3.3 [27] Construct a tree with five vertices such that the degree of
one vertex is 3. How many such (non-isomorphic) graphs can you construct?
Exercise 3.3.6 [27] Find the degree sequences of the cycle graph C6 , the
complete graph K8 , the complete bipartite graph K3,7 , and the wheel graph
W4 .
When this happens, v is a right singular vector and u is a left singular vector
associated to σ.
When (3.4.1) holds, so does
182 CHAPTER 3. PRINCIPAL COMPONENTS
The singular values of A and the singular values of At are the same.
Then Av = λv implies λ = 1 and v = (1, 0). Thus A has only one eigenvalue
equal to 1, and only one eigenvector. Set
11
Q = At A = .
12
0 = det(Q − λI) = λ2 − 3λ + 1.
Qv = At Av = At (σu) = σ 2 v. (3.4.2)
Thus v1 , u1 are right and left singular vectors corresponding to the singular
value σ1 of A. Similarly, if we set u2 = Av2 /σ2 , then v2 , u2 are right and left
singular vectors corresponding to the singular value σ2 of A.
We show v1 , v2 are orthonormal, and u1 , u2 are orthonormal. We already
know v1 , v2 are orthonormal, because they are orthonormal eigenvectors of
the symmetric matrix Q. Also
0 = λ1 v1 ·v2 = Qv1 ·v2 = (At Av1 )·v2 = (Av1 )·(Av2 ) = σ1 u1 ·σ2 u2 = σ1 σ2 u1 ·u2 .
A Versus Q = At A
Since the rank equals the dimension of the row space, the first part follows
from §2.4. If Av = σu and At u = σv, then
Qv = At Av = At (σu) = σAt u = σ 2 v,
so λ = σ 2 is an eigenvalue of Q.
Conversely,
√ If Qv = λv, then λ ≥ 0, so there are two cases. If λ > 0, set
σ = λ and u = Av/σ. Then
184 CHAPTER 3. PRINCIPAL COMPONENTS
Let A be any matrix, and let r be the rank of A. Then there are
r positive singular values σk , an orthonormal basis uk of the target
space, and an orthonormal basis vk of the source space, such that
Avk = σk uk , At uk = σk vk , k ≤ r, (3.4.3)
and
Avk = 0, At uk = 0 for k > r. (3.4.4)
Taken together, (3.4.3) and (3.4.4) say the number of positive singular
values is exactly r. Assume A is N × d, and let p = min(N, d) be the lesser
of N and d.
Since (3.4.4) holds as long as there are vectors uk and vk , there are p − r
zero singular values. Hence there are p = min(N, d) singular values altogether.
The proof of the result is very simple once we remember the rank of Q
equals the number of positive eigenvalues of Q. By the eigenvalue decom-
position, there is an orthonormal basis vk of the source space and positive
√ that Qvk = λk vk , k ≤ r, and Qvk = 0, k > r.
eigenvalues λk such
Setting σk = λk and uk = Avk /σk , k ≤ r, as in our first example, we
have (3.4.3), and, again as in our first example, uk , k ≤ r, are orthonormal.
By construction, vk , k > r, is an orthonormal basis for the null space of
A, and uk , k ≤ r, is an orthonormal basis for the column space of A.
Choose uk , k > r, any orthonormal basis for the nullspace of At . Since
the column space of A is the row space of At , the column space of A is the
orthogonal complement of the nullspace of At (2.7.6). Hence uk , k ≤ r, and
uk , k > r, are orthogonal. From this, uk , k ≤ r, together with uk , k > r,
form an orthonormal basis for the target space.
3.4. SINGULAR VALUE DECOMPOSITION 185
For our second example, let a and b be nonzero vectors, possibly of different
sizes, and let A be the matrix
A = a ⊗ b, At = b ⊗ a.
Thus there is only one positive singular value of A, equal to |a| |b|. All other
singular values are zero. This is not surprising since the rank of A is one.
Now think of the vector b as a single-row matrix B. Then, in a similar
manner, one sees the only positive singular value of B is σ = |b|.
Our third example is
0000
1 0 0 0
A= 0 1 0 0 .
(3.4.5)
0010
Then
0 1 0 0 1 0 0 0
0 0 1 0 0 1 0 0
At = Q = At A =
,
0 0 0 1 0 0 1 0
0 0 0 0 0 0 0 0
Since Q is diagonal symmetric, its rank is 3 and its eigenvalues are λ1 = 1,
λ2 = 1, λ3 = 1, λ4 = 0, and its eigenvectors are
1 0 0 0
0 1 0 0
0 , v2 = 0 , v3 1 , v4 = 0 .
v1 =
0 0 0 1
Diagonalization (SVD)
A = U SV. (3.4.6)
0 0 0 σ4 0 0
U, sigma, V = svd(A)
# sigma is a vector
print(U.shape,S.shape,V.shape)
print(U,S,V)
Given the relation between the singular values of A and the eigenvalues of
Q = At A, we also can conclude
For example, if dataset is the Iris dataset (ignoring the labels), the code
188 CHAPTER 3. PRINCIPAL COMPONENTS
# center dataset
m = mean(dataset,axis=0)
A = dataset - m
# rows of V are right
# singular vectors of A
V = svd(A)[2]
# columns of U are
# eigenvectors of Q
U = eigh(Q)[1]
# compare columns of U
# and rows of V
U, V
returns
0.36 −0.66 −0.58 0.32 0.36 −0.08 0.86 0.36
−0.08 −0.73 0.6 −0.32
, V = −0.66 −0.73 0.18 0.07
U =
0.86 0.18 0.07 −0.48 0.58 −0.6 −0.07 −0.55
0.36 0.07 0.55 0.75 0.32 −0.32 −0.48 0.75
This shows the columns of U are identical to the rows of V , except for the
third column of U , which is the negative of the third row of V .
Exercises
Exercise 3.4.1 Let b be a vector and let B be the matrix with the single
row b. Show σ = |b| is the only positive singular value.
Exercise 3.4.4 Let A be the 5 × 3 matrix (2.3.4). Use numpy and sympy to
compute the singular value decomposition of A (see Exercise 3.2.10).
190 CHAPTER 3. PRINCIPAL COMPONENTS
Qvk = λk vk , k = 1, 2, . . . , d.
λ1 ≥ λ2 ≥ · · · ≥ λd ,
in PCA one takes the most significant components, those components who
eigenvalues are near the top eigenvalue. For example, one can take the top
two eigenvalues λ1 ≥ λ2 and their eigenvectors v1 , v2 , and project the dataset
onto span(v1 , v2 ). The projected dataset can then be visualized as points in
the plane. Similarly, one can take the top three eigenvalues λ1 ≥ λ2 ≥ λ3
and their eigenvectors v1 , v2 , v3 and project the dataset onto span(v1 , v2 , v3 ).
This can then be visualized as points in three dimensions.
Recall the MNIST dataset consists of N = 60000 points in d = 784 di-
mensions. After we download the dataset,
mnist = read_csv("mnist.csv").to_numpy()
dataset = mnist[:,1:]
labels = mnist[:,0]
This results in Figures 3.23 and 3.24. Here we sort the array eig in
decreasing order, then we cumsum the array to obtain the cumulative sums.
Because the rank of the MNIST dataset is 712 (§2.9), the bottom 72 =
784 − 712 eigenvalues are exactly zero. A full listing shows that many more
eigenvalues are near zero, and the second column in Figure 3.23 shows the
top ten eigenvalues alone sum to almost 50% of the total variance.
Q = cov(dataset.T)
totvar = Q.trace()
192 CHAPTER 3. PRINCIPAL COMPONENTS
# cumulative sums
sums = cumsum(percent)
data = array([percent,sums])
print(data.T[:20].round(decimals=3))
d = len(lamda)
from matplotlib.pyplot import stairs
grid()
stairs(percent,range(d+1))
show()
def pca(dataset,n):
Q = cov(dataset.T)
# columns of U are
# eigenvectors of Q
lamda, U = eigh(Q)
# decreasing eigenvalue sort
order = lamda.argsort()[::-1]
# sorted top n columns of U
# are cols of Uproj
# U is dxd Uproj is dxn
Uproj = U[:,order[:n]]
P = dot(Uproj,Uproj.T)
return P
In the code, lamda is sorted in decreasing order, and the sorting order is
saved as order. To obtain the top n eigenvectors of U , we sort the first n
columns U[:,order[:n]] in the same order, resulting in the d × n matrix
t
Uproj . The code then returns the projection matrix P = Uproj Uproj (2.7.4).
Instead of working with the variance Q, as discussed at the start of the
section, we can work directly with the dataset, using SVD, to obtain the
eigenvectors.
def pca_with_svd(dataset,n):
# center dataset
mu = mean(dataset,axis=0)
vectors = dataset - mu
# rows of V are
# right singular vectors
194 CHAPTER 3. PRINCIPAL COMPONENTS
V = svd(vectors)[2]
# no need to sort, already decreasing order
Uproj = V[:n].T # top n rows as columns
P = dot(Uproj,Uproj.T)
return P
Let v = dataset[1] be the second image in the MNIST dataset, and let
Q be the variance of the dataset. Then the code below returns the image
compressed down to n = 784, 600, 350, 150, 50, 10, 1 dimensions, returning
Figure 3.25.
def display_image(v,row,col,i):
A = reshape(v,(28,28))
fig.add_subplot(row, col,i)
xticks([])
yticks([])
imshow(A,cmap="gray_r")
fig = figure(figsize=(10,5))
row, col = 2, 4
If you run out of memory trying this code, cut down the dataset from
60,000 points to 10,000 points or fewer. The code works with pca or with
pca_with_svd.
N = len(dataset)
n = 10
3.5. PRINCIPAL COMPONENT ANALYSIS 195
engine = PCA(n_components = n)
reduced = engine.fit_transform(dataset)
reduced.shape
and returns (N, n) = (60000, 10). The following code computes the projected
dataset
projected = engine.inverse_transform(reduced)
projected.shape
Fig. 3.25 Original and projections: n = 784, 600, 350, 150, 50, 10, 1.
fig = figure(figsize=(10,5))
row, col = 2, 4
196 CHAPTER 3. PRINCIPAL COMPONENTS
Now we project all vectors of the MNIST dataset onto two and three
dimensions, those corresponding to the top two or three eigenvalues. To start,
we compute reduced as above with n = 3, the top three components.
grid()
legend(loc='upper right')
show()
grid()
legend(loc='upper right')
show()
%matplotlib ipympl
from matplotlib.pyplot import *
ax = axes(projection="3d")
ax.set_aspect("equal")
ax.set_axis_off()
legend(loc='upper right')
show()
The three dimensional plot of the complete MNIST dataset is Figure 1.5
in §1.2. The command %matplotlib ipympl allows the figure to rotated and
scaled.
such that
def nearest_index(x,means):
i = 0
for j,m in enumerate(means):
n = means[i]
if norm(x - m) < norm(x - n): i = j
return i
def assign_clusters(dataset,means):
clusters = [ [ ] for m in means ]
for x in dataset:
i = nearest_index(x,means)
clusters[i].append(x)
return [ c for c in clusters if len(c) > 0 ]
def update_means(clusters):
return [ mean(c,axis=0) for c in clusters ]
d = 2
k,N = 7,100
def random_vector(d):
return array([ random() for _ in range(d) ])
close_enough = False
This code returns the size the clusters after each iteration. Here is code
that plots a cluster.
def plot_cluster(mean,cluster,color,marker):
for v in cluster:
scatter(v[0],v[1], s=50, c=color, marker=marker)
scatter(mean[0], mean[1], s=100, c=color, marker='*')
d = 2
k,N = 7,100
def random_vector(d):
return array([ random() for _ in range(d) ])
close_enough = False
figure(figsize=(4,4))
grid()
3.6. CLUSTER ANALYSIS 201
The material in this chapter lays the groundwork for Chapter 7. It assumes
the reader has some prior exposure, and the first section quickly reviews
basic material essential for our purposes. The overarching role of convexity
is emphasized repeatedly, both in the single-variable and multi-variable case.
The chain rule is treated extensively, in both interpretations, combinato-
rial (back-propagation) and geometric (time-derivatives). Both are crucial for
neural network training in Chapter 7.
Because it is used infrequently in the text, integration is treated separately
in an appendix (§A.5).
Even though parts of §4.5 are heavy-going, the material is necessary for
Chapter 7. Nevertheless, for a first pass, the reader should feel free to skim
this material and come back to it after the need is made clear.
Definition of Derivative
The derivative of f (x) at the point a is the slope of the line tangent
to the graph of f (x) at a.
203
204 CHAPTER 4. CALCULUS
Since a constant function f (x) = c is a line with slope zero, the derivative
of a constant is zero. Since f (x) = mx+b is a line with slope m, its derivative
is m.
Since the tangent line at a passes through the point (a, f (a)), and its slope
is f ′ (a), the equation of the tangent line at a is
y = f (x)
x
a
Using these properties, we determine the formula for f ′ (a). Suppose the
derivative is bounded between two extremes m and L at every point x in an
interval [a, b], say
m ≤ f ′ (x) ≤ L, a ≤ x ≤ b.
Then by A, the derivative of h(x) = f (x)−mx at x equals h′ (x) = f ′ (x)−m.
By assumption, h′ (x) ≥ 0 on [a, b], so, by B, h(b) ≥ h(a). Since h(a) =
f (a) − ma and h(b) = f (b) − mb, this leads to
f (b) − f (a)
≥ m.
b−a
Repeating this same argument with f (x) − Lx, and using C, leads to
f (b) − f (a)
≤ L.
b−a
4.1. SINGLE-VARIABLE CALCULUS 205
We have shown
f (b) − f (a)
m≤ ≤ L. (4.1.1)
b−a
Derivative Formula
f (x) − f (a)
f ′ (a) = lim . (4.1.3)
x→a x−a
dy dy du
= · .
dx du dx
To visualize the chain rule, suppose
u = f (x) = sin x,
y = g(u) = u2 .
x u y
f g
√
Suppose x = π/4. Then u = sin(π/4) = 1/ 2, and y = u2 = 1/2. Since
dy 2 du 1
= 2u = √ , = cos x = √ ,
du 2 dx 2
by the chain rule,
dy dy du 2 1
= · = √ · √ = 1.
dx du dx 2 2
Since the chain rule is important for machine learning, it is discussed in detail
in §4.4.
By the product rule,
Using the chain rule, the power rule can be √derived for any rational number n,
2
positive or negative. For example,
√ since ( x) = x, we can write x = f (g(x))
with f (x) = x2 and g(x) = x. By the chain rule,
√ √
1 = (x)′ = f ′ (g(x))g ′ (x) = 2g(x)g ′ (x) = 2 x( x)′ .
√
Solving for ( x)′ yields
√ 1
( x)′ = √ ,
2 x
4.1. SINGLE-VARIABLE CALCULUS 207
x, a = symbols('x, a')
f = x**a
returns
axa axa
, , axa−1 , axa−1 .
x x
The power rule can be combined with the chain rule. For example, if
un+1
u = 1 − p + cp, f (p) = un , g(u) = ,
(c − 1)(n + 1)
then
(1 − p + cp)n+1
F (p) = ,
(c − 1)(n + 1)
and
F ′ (p) = g ′ (u)u′ = un ,
hence
(1 − p + cp)n+1
F (p) = =⇒ F ′ (p) = f (p). (4.1.5)
(c − 1)(n + 1)
For example,
n!
(xn )′′ = (nxn−1 )′ = n(n − 1)xn−2 = xn−2 = P (n, 2)xn−2
(n − 2)!
More generally, the k-th derivative f (k) (x) is the derivatives taken k times,
so
(k) n!
(xn ) = n(n − 1)(n − 2) . . . (n − k + 1)xn−k = xn−k = P (n, k)xn−k .
(n − k)!
When k = 0, f (0) (x) = f (x), and, when k = 1, f (1) (x) = f ′ (x). The code
x, n = symbols('x, n')
diff(x**n,x,3)
def sym_legendre(n):
# symbolic variable
x = symbols('x')
# symbolic function
p = (x**2 - 1)**n
nfact = factorial(n,exact=True)
# symbolic nth derivative
return p.diff(x,n)/(nfact * 2**n)
For example,
4.1. SINGLE-VARIABLE CALCULUS 209
def num_legendre(n):
x = symbols('x')
f = sym_legendre(n)
return lambdify(x,f, 'numpy')
We use the above to derive the Taylor series. Suppose f (x) is given by a
finite or infinite sum
f (x) = c0 + c1 x + c2 x2 + c3 x3 + . . . (4.1.6)
Then f (0) = c0 . Taking derivatives, by the sum, product, and power rules,
Inserting x = 0, we obtain f ′ (0) = c1 , f ′′ (0) = 2c2 , f ′′′ (0) = 3 · 2c3 , f (4) (0) =
4 · 3 · 2c4 . This can be encapsulated by f (n) (0) = n!cn , n = 0, 1, 2, 3, 4, . . . ,
which is best written
f (n) (0)
= cn , n ≥ 0.
n!
Going back to (4.1.6), we derived
210 CHAPTER 4. CALCULUS
Taylor Series
y = log x ⇐⇒ x = ey .
log(ey ) = y, elog x = x.
4.1. SINGLE-VARIABLE CALCULUS 211
From here, we see the logarithm is defined only for x > 0 and is strictly
increasing (Figure 4.3).
Since e0 = 1,
log 1 = 0.
Since e∞ = ∞ (Figure A.3),
log ∞ = ∞.
log 0 = −∞.
We also see log x is negative when 0 < x < 1, and positive when x > 1.
ab = eb log a .
Then, by definition,
log(ab ) = b log a,
and c c
ab = eb log a = ebc log a = abc .
212 CHAPTER 4. CALCULUS
x = ey =⇒ 1 = x′ = (ey )′ = ey y ′ = xy ′ ,
so
1
y = log x =⇒ y′ = .
x
Derivative of the Logarithm
1
y = log x =⇒ y′ = . (4.1.9)
x
Since the derivative of log(1 + x) is 1/(1 + x), the chain rule implies
dn (n − 1)!
log(1 + x) = (−1)n−1 , n ≥ 1.
dxn (1 + x)n
x2 x3 x4
log(1 + x) = x − + − + .... (4.1.10)
2 3 4
0
x
For the parabola in Figure 4.4, y = x2 so, by the power rule, y ′ = 2x.
Since y ′ > 0 when x > 0 and y ′ < 0 when x < 0, this agrees with the
4.1. SINGLE-VARIABLE CALCULUS 213
√
(c = 1/ 3)
−1 −c c 1
x
0
max f (x).
x∗ ,a,b
In other words, to find the maximum of f (x), find the critical points x∗ ,
plug them and the endpoints a, b into f (x), and select whichever yields the
maximum value.
For example, since (x2 )′′ = 2 > 0 and (ex )′′ = ex > 0, x2 and e√ x
are
strictly convex everywhere, and x4 − 2x2 is strictly convex for |x| > 1/ 3.
Convexity of ex was also derived in (A.3.14). Since
(ex )(n) = ex , n ≥ 0,
f (x) − f (a)
f ′ (a) ≤ ≤ f ′ (x), a ≤ x ≤ b.
x−a
Since the tangent line at a is y = f ′ (a)(x − a) + f (a), rearranging this last
inequality, we obtain
216 CHAPTER 4. CALCULUS
For example, the function in Figure 4.6 is convex near x = a, and the
graph lies above its tangent line at a.
L
pL (x) = f (a) + f ′ (a)(x − a) + (x − a)2 . (4.1.13)
2
Then p′′L (x) = L. Moreover the graph of pL (x) is tangent to the graph of f (x)
at x = a, in the sense f (a) = pL (a) and f ′ (a) = p′L (a). Because of this, we
call pL (x) the upper tangent parabola.
When y is convex, we saw above the graph of y lies above its tangent line.
When m ≤ y ′′ ≤ L, we can specify the size of the difference between the
graph and the tangent line. In fact, the graph is constrained to lie above or
below the lower or upper tangent parabolas.
If m ≤ f ′′ (x) ≤ L on [a, b], the graph lies between the lower and upper
tangent parabolas pm (x) and pL (x),
m L
(x − a)2 ≤ f (x) − f (a) − f ′ (a)(x − a) ≤ (x − a)2 . (4.1.14)
2 2
a ≤ x ≤ b.
so g(x) is convex, so g(x) lies above its tangent line at x = a. Since g(a) = 0
and g ′ (a) = 0, the tangent line is 0, and we conclude g(x) ≥ 0, which is the
4.1. SINGLE-VARIABLE CALCULUS 217
x
a
Fig. 4.6 Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0.
f ′ (b) − f ′ (a)
t= =⇒ L ≥ t ≥ m,
b−a
which implies
t2 − (m + L)t + mL = (t − m)(t − L) ≤ 0.
This yields
For gradient descent, we need the relation between a convex function and
its dual. If f (x) is convex, its convex dual is
Below we see g(p) is also convex. This may not always exist, but we will work
with cases where no problems arise.
To evaluate g(p), following (4.1.11), we compute the maximizer x∗ by
setting the derivative of (px − f (x)) equal to zero and solving for x.
Let a > 0. The simplest example is f (x) = ax2 /2. In this case, the maxi-
mum of px − f (x) occurs where (px − f (x))′ = 0, which leads to
′
1
0= px − ax2 = p − ax,
2
Going back to (4.1.16), for each p, the point x where px − f (x) equals the
maximum g(p) — the maximizer — depends on p. If we denote the maximizer
by x = x(p), then
g(p) = px(p) − f (x(p)).
Since the maximum occurs when the derivative is zero, we have
Hence
g(p) = px − f (x) ⇐⇒ p = f ′ (x).
Also, by the chain rule, differentiating with respect to p,
Thus f ′ (x) is the inverse function of g ′ (p). Since g(p) = px − f (x) is the same
as f (x) = px − g(p), we have
If g(p) is the convex dual of a convex f (x), then f (x) is the convex
dual of g(p).
4.1. SINGLE-VARIABLE CALCULUS 219
f ′ (g ′ (p)) = p.
f ′′ (g ′ (p))g ′′ (p) = 1.
We derived
Let f (x) be a strictly convex function, and let g(p) be the convex dual
of f (x). Then g(p) is strictly convex and
1
g ′′ (p) = , (4.1.18)
f ′′ (x)
n
This makes sense because the binomial coefficient k is defined for any
real number n (A.2.12), (A.2.13).
In summation notation,
∞
X
n n n−k k
(a + x) = a x . (4.1.19)
k
k=0
The only difference between the binomial theorem and (4.1.19) is the upper
limit of the summation, which is set to infinity. When n is a whole number,
by (A.2.10), we have
n
= 0, for k > n,
k
220 CHAPTER 4. CALCULUS
so
f (k) (0)
n(n − 1)(n − 2) . . . (n − k + 1) n−k n n−k
= a = a .
k! k! k
Writing out the Taylor series,
∞ ∞
X f (k) (0) X n n−k k
(a + x)n = = a x ,
k! k
k=0 k=0
a, b = 0, 3*pi
theta = arange(a,b,.01)
ax = axes()
ax.grid(True)
ax.axhline(0, color='black', lw=1)
plot(theta,sin(theta))
show()
It is often convenient to set the horizontal axis tick marks at the multiples
of π/2. For this, we use
def label(k):
if k == 0: return '$0$'
elif k == 1: return r'$\pi/2$'
222 CHAPTER 4. CALCULUS
def set_pi_ticks(a,b):
base = pi/2
m = floor(b/base)
n = ceil(a/base)
k = arange(n,m+1,dtype=int)
# multiples of base
return xticks(k*base, map(label,k) )
We review the derivative of sine and cosine. This is needed for the arcsine
law (3.2.16). Recall the angle θ in radians is the length of the subtended
arc (in red) in Figure 4.9. Following the figure, with P = (x, y), we have
x = cos θ, y = sin θ.
The key idea here is Archimedes’ axiom [12], which states:
Suppose two convex curves share common initial and terminal points. If one is inside
the other, then the inside curve is the shorter.
P 1−x
Q
1 y
θ
O x I
By the figure, there are three convex curves joining P and I: The line
segment P I, the red arc, and the polygonal curve P QI. By Archimedes’
axiom, the length of P I is less than the length of the red arc, which in turn
is less than the length of P QI. Since the length of P I is greater than y, this
implies
4.1. SINGLE-VARIABLE CALCULUS 223
y < θ < 1 − x + y,
or
sin θ < θ < 1 − cos θ + sin θ.
Dividing by θ (here we assume 0 < θ < π/2),
1 − cos θ sin θ
1− < < 1. (4.1.21)
θ θ
We use this to show (the definition of limit is in §A.6)
sin θ
lim = 1. (4.1.22)
θ→0 θ
Since sin θ is odd, it is enough to verify (4.1.22) for θ > 0.
To this end, since sin2 θ = 1 − cos2 θ, from (4.1.21),
which implies
1 − cos θ
lim = 0.
θ→0 θ
Taking the limit θ → 0 in (4.1.21), we obtain (4.1.22) for θ > 0.
From (A.4.6),
sin(θ + t) = sin θ cos t + cos θ sin t,
so
sin(θ + t) − sin θ cos t − 1 sin t
lim = lim sin θ · + cos θ · = cos θ.
t→0 t t→0 t t
Thus the derivative of sine is cosine,
Similarly,
(cos θ)′ = − sin θ.
Using the chain rule, we compute the derivative of the inverse arcsin x of
sin θ. Since
θ = arcsin x ⇐⇒ x = sin θ,
we have p
1 = x′ = (sin θ)′ = θ′ · cos θ = θ′ · 1 − x2 ,
or
1
(arcsin x)′ = θ′ = √ .
1 − x2
224 CHAPTER 4. CALCULUS
We
√ use this to compute the derivative of the arcsine law (3.2.16). With
x = λ/2, by the chain rule,
′
1√
2 2 1
arcsin λ = √ · x′
π 2 π 1 − x2
(4.1.23)
2 1 1 1
= p · √ = p .
π 1 − λ/4 4 λ π λ(4 − λ)
This shows the derivative of the arcsine law is the density in Figure 3.11.
Exercises
Exercise 4.1.2 With exp x = ex , what are the first derivatives of exp(exp x)
and exp(exp(exp x))?
1 2
Exercise 4.1.3 With a > 0, let f (x) = 2 ax − ex . Where is f (x) convex,
and where is it concave?
Exercise 4.1.5 For fixed α > 0 and β > 0, find the maximizer p̂ of
pα (1 − p)β−α , 0 ≤ p ≤ 1.
Exercise 4.1.6 Compute the maximum and minimum of the second deriva-
tive of cos θ over the interval [a, b] = [−π/4, π/4]. Use that to compute the
upper and lower tangent parabolas at θ = 0. Plots these parabolas against
cos θ. Repeat everything with [a, b] = [−π/2, π/2].
Exercise 4.1.7 Suppose f (x) ≥ 0 and f ′′ (x) ≤ 1/2 for all x. Show
p
|f ′ (a)| ≤ f (a).
Exercise 4.1.12 If the convex dual of f (x) is g(p), and t is a constant, what
is the convex dual of f (x) + t?
Exercise 4.1.13 If the convex dual of f (x) is g(p), and t is a constant, what
is the convex dual of f (x + t)?
Exercise 4.1.14 If the convex dual of f (x) is g(p), and t ̸= 0 is a constant,
what is the convex dual of f (tx)?
Exercise 4.1.15 If the convex dual of f (x) is g(p), and t ̸= 0 is a constant,
what is the convex dual of tf (x)?
Exercise 4.1.16 If a > 0 and
1 2
f (x) = ax + bx + c,
2
what is the convex dual?
Exercise 4.1.17 Show f (x) convex implies ef (x) convex.
This is also called absolute entropy to contrast with relative entropy which
we see below.
To graph H(p), we compute its first and second derivatives. Here the
independent variable is p. By the product rule,
′ ′ 1−p
H (p) = (−p log p − (1 − p) log(1 − p)) = − log p + log(1 − p) = log .
p
Thus H ′ (p) = 0 when p = 1/2, H ′ (p) > 0 on p < 1/2, and H ′ (p) < 0 on
p > 1/2. Since this implies H(p) is increasing on p < 1/2, and decreasing on
p > 1/2, p = 1/2 is a global maximizer of the graph.
As p increases, 1−p decreases, so (1−p)/p decreases. Since log is increasing,
as p increases, H ′ (p) decreases. Thus H(p) is concave.
226 CHAPTER 4. CALCULUS
Taking the second derivative, by the chain rule and the quotient rule,
′
1−p 1
H ′′ (p) = log =− ,
p p(1 − p)
A crucial aspect of Figure 4.10 is its limiting values at the edges p = 0 and
p = 1,
H(0) = lim H(p) and H(1) = lim H(p).
p→0 p→1
To explain the meaning of the entropy function H(p), suppose a coin has
heads-bias or heads-probability p. If p is near 1, then we have confidence the
outcome of tossing the coin is heads, and, if p is near 0, we have confidence the
outcome of tossing the coin is tails. If p = 1/2, then we have least information.
Thus we can view the entropy as measuring a lack of information.
To formalize this, we define the information or absolute information
Then we have
−e−x
p′ = − = σ(x)(1 − σ(x)) = p(1 − p). (4.2.4)
(1 + e−x )2
The logistic function, also called the expit function and the sigmoid function,
is studied further in §5.2, where it used in coin-tossing and Bayes theorem.
The inverse of the logistic function is the logit function. The logit function
is found by solving p = σ(x) for x, obtaining
−1 p
x = σ (p) = log . (4.2.5)
1−p
The logit function is also called the log-odds function. Its derivative is
′
′ 1−p p 1−p 1 1
x = · = · 2
= .
p 1−p p (1 − p) p(1 − p)
228 CHAPTER 4. CALCULUS
Let
Z(x) = log (1 + ex ) . (4.2.6)
Then Z ′ (x) = σ(x) and Z ′′ (x) = σ ′ (x) = σ(1 − σ) > 0. This shows Z(x) is
strictly convex. We call Z(x) the cumulant-generating function, to be consis-
tent with random variable terminology (§5.3).
We compute the convex dual (§4.1) of Z(x). By (4.1.11), the maximum
max(px − Z(x))
x
0 ≤ p ≤ 1, and 0 ≤ q ≤ 1.
Then
230 CHAPTER 4. CALCULUS
I(q, q) = 0,
which agrees with our design goal of I(p, q) measuring the divergence between
the information in p and the information in q. Because I(p, q) is not symmetric
in p, q, we think of q as a base or reference probability, against which we
compare p.
Equivalently, instead of measuring relative information, we can measure
the relative entropy,
H(p, q) = −I(p, q).
Since − log(x) is strictly convex,
q 1−q q 1−q
I(p, q) = −p log − (1 − p) log > − log p · + (1 − p) ·
p 1−p p 1−p
= − log 1 = 0.
d2 1
I(p, q) = I ′′ (p) = ,
dp2 p(1 − p)
d2 p 1−p
2
I(p, q) = 2 + ,
dq q (1 − q)2
For more on this terminology confusion, see the remarks at the end of §5.6.
The code is as follows.
%matplotlib ipympl
from numpy import *
from matplotlib.pyplot import *
from scipy.stats import entropy
ax = axes(projection='3d')
ax.set_axis_off()
p = arange(0,1,.01)
q = arange(0,1,.01)
p,q = meshgrid(p,q)
# surface
ax.plot_surface(p,q,I(p,q), cmap='cool')
# square
ax.plot([0,1,1,0,0],[0,0,1,1,0],linewidth=.5,c="k")
show()
Exercises
Exercise 4.2.3 Let 0 < q < 1 be a constant. What is the convex dual of
Exercise 4.2.5 The relative information I(p, q) has minimum zero when p =
q. Use the lower tangent parabola (4.1.12) of I(x, q) at q and Exercise 4.2.2
to show
I(p, q) ≥ 2(p − q)2 .
For q = 0.7, plot both I(p, q) and 2(p − q)2 as functions of 0 < p < 1.
Let
f (x) = f (x1 , x2 , . . . , xd )
be a scalar function of a point x = (x1 , x2 , . . . , xd ) in Rd , and suppose v is
a unit vector in Rd . Then, along the line x(t) = x + tv, g(t) = f (x + tv)
is a function of the single variable t. Hence its derivative g ′ (0) at t = 0 is
well-defined. Since g ′ (0) depends on the point x and on the direction v, this
rate of change is the directional derivative of f (x) at x in the direction v.
More explicitly, the directional derivative of f (x) at x in the direction v is
d
Dv f (x) = f (x + tv). (4.3.1)
dt t=0
∂f d
(x) = f (x + tek ).
∂xk ds t=0
The partial derivative in the k-th direction is just the one-dimensional deriva-
tive considering xk as the independent variable, with all other xj ’s constants.
Below we exhibit the multi-variable chain rule in two ways. The first in-
terpretation is geometric, and involves motion in time and directional deriva-
tives. This interpretation is relevant to gradient descent, §7.3.
The second interpretation is combinatorial, and involves repeated compo-
sitions of functions. This interpretation is relevant to computing gradients in
networks, specifically backpropagation §4.4, §7.2.
These two interpretations work together when training neural networks,
§7.4.
For the first interpretation of the chain rule, suppose the components x1 ,
x2 , . . . , xd are functions of a single variable t (usually time), so we have
The Rd -valued function x(t) = (x1 (t), x2 (t), . . . , xd (t)) represents a curve
or path in Rd , and the vector
d
f (x + tv) = ∇f (x) · v. (4.3.3)
dt t=0
d
f (W + sV ) = trace(V t G). for all V. (4.3.4)
ds s=0
r
x
x s u y
g + k
x t
dy dy du
= = −0.90 ∗ 1 = −0.90,
dr du dr
and similarly,
dy dy
= = −0.90.
ds dt
By the chain rule,
dy dy dr dy ds dy dt
= · + · + · .
dx dr dx ds dx dt dx
By (4.2.4), s′ = s(1 − s) = 0.22, so
236 CHAPTER 4. CALCULUS
dr ds dt
= cos x = 0.71, = s(1 − s) = 0.22, = 2x = 1.57.
dx dx dx
We obtain
dy
= −0.90 ∗ 0.71 − 0.90 ∗ 0.22 − 0.90 ∗ 1.57 = −2.25.
dx
The chain rule is discussed in further detail in §4.4.
∇f (x∗ ) = 0.
g(t) = f (x + tv)
1
= (x + tv) · Q(x + tv) − b · (x + tv)
2
1 1 (4.3.7)
= x · Qx − b · x + tv · (Qx − b) + t2 v · Qv
2 2
1 2
= f (x) + tv · (Qx − b) + t v · Qv.
2
From this follows
1
g ′ (t) = v · (Qx − b) + tv · Qv, g ′′ (t) = v · Qv.
2
This shows
Quadratic Convexity
∇f (x) = Qx − b. (4.3.8)
Moreover f (x) is convex everywhere when Q is a variance matrix.
By (2.2.2),
Dv f (x) = ∇f (x) · v = |∇f (x)| |v| cos θ,
where θ is the angle between the vector v and the gradient vector ∇f (x).
Since −1 ≤ cos θ ≤ 1, we conclude
To see this, pick any point a. Then, by properness, the sublevel set f (x) ≤
f (a) is bounded. By continuity of f (x), there is a minimizer x∗ (see §A.7).
Since for all x outside this sublevel set, we have f (x) > f (a), x∗ is a global
minimizer.
To see this, suppose f (x) is not proper. In this case, by (4.3.9), there
would be a level c and a sequence x1 , x2 , . . . in the row space of A satisfying
|xn | → ∞ and f (xn ) ≤ c for n ≥ 1.
Let x′n = xn /|xn |. Then x′n are unit vectors in the row space of A, hence
xn is a bounded sequence. From §A.7, this implies x′n subconverges to some
′
Properness of Residual
is proper on Rd .
As a consequence,
Let a be any point, and v any direction, and let g(t) = f (a + tv). Then
g ′ (0) = ∇f (a) · v.
Exercises
Exercise 4.3.1 Let I(p, q) be the relative information (4.2.9), and let Ipp ,
Ipq , Iqp , Iqq be the second partial derivatives. If Q is the second derivative
matrix
Ipp Ipq
Q= ,
Iqp Iqq
show
(p − q)2
det(Q) = .
p(1 − p)q 2 (1 − q)2
Exercise 4.3.2 Let I(p, q) be the relative information (4.2.9). With x =
(p, q) and v = (ap(1 − p), bq(1 − q)), show
d2
I(x + tv) = p(1 − p)(a − b)2 + b2 (p − q)2 .
dt2 t=0
Conclude that I(p, q) is a convex function of (p, q). Where is it not strictly
convex?
Exercise 4.3.3 Let J(x) = J(x1 , x2 , . . . , xd ) equal
1 1 1 1
J(x) = (x1 − x2 )2 + (x2 − x3 )2 + · · · + (xd−1 − xd )2 + (xd − x1 )2 .
2 2 2 2
Compute Q = D2 J.
Exercise 4.3.4 Let f (Q) = log det(Q) be the log of the determinant of a
positive 2 × 2 matrix Q (Exercise 3.2.6), and let V be a symmetric 2 × 2
matrix. Using Exercise 1.4.11, compute the second derivative of f (Q + tV )
at t = 0 as in (4.3.6). Using Exercise 1.4.10, conclude log det(Q) is a concave
function of Q.
242 CHAPTER 4. CALCULUS
dy dy dr
r = f (x), y = g(r) =⇒ = · .
dx dr dx
In this section, we work out the implications of the chain rule on repeated
compositions of functions.
Suppose
r = f (x) = sin x,
1
s = g(r) = ,
1 + e−r
y = h(s) = s2 .
x r s y
f g h
The chain in Figure 4.15 has five nodes and four edges. There is one input
node (no incoming edge from another node) and one output node (no
outgoing edge to another node). The outgoing signals at the first four nodes
are x, r, s, y. The incoming signals at the last four nodes are x, r, s, y.
Start with x = π/4. Evaluating the functions in order,
Notice these values are evaluated in the forward direction: x then r then s
then y. This is forward propagation.
4.4. BACK PROPAGATION 243
From this,
dy dy ds
= · = 1.340 ∗ g ′ (r) = 1.340 ∗ 0.221 = 0.296.
dr ds dr
Repeating one more time,
dy dy dr
= · = 0.296 ∗ cos x = 0.296 ∗ 0.707 = 0.209.
dx dr dx
Thus the derivatives are
dy dy dy
= 0.209, = 0.296, = 1.340.
dx dr ds
Notice the derivatives are evaluated in the backward direction: First dy/dy =
1, then dy/ds, then dy/dr, then dy/dx. This is back propagation.
r = x2 ,
s = r 2 = x4 ,
y = s2 = x8 .
This is the same function h(x) = x2 composed with itself three times. With
x = 5, we have
func_chain = [f,g,h]
der_chain = [df,dg,dh]
Then we evaluate the output vector x = (x, r, s, y), leading to the first
version of forward propagation,
def forward_prop(x_in,func_chain):
x = [x_in]
while func_chain:
f = func_chain.pop(0) # first func
x_out = f(x_in)
x.append(x_out) # insert at end
x_in = x_out
return x
# dy/dy = 1
delta_out = 1
def backward_prop(delta_out,x,der_chain):
delta = [delta_out]
while der_chain:
# discard last output
x.pop(-1)
df = der_chain.pop(-1) # last der
der = df(x[-1])
# chain rule -- multiply by previous der
der = der * delta[0]
delta.insert(0,der) # insert at start
return delta
delta = backward_prop(delta_out,x,der_chain)
d = 3
func_chain, der_chain = [h]*d, [dh]*d
x_in, delta_out = 5, 1
x = forward_prop(x_in,func_chain)
delta = backward_prop(delta_out,x,der_chain)
Now we work with the network in Figure 4.16, using the multi-variable
chain rule (§4.3). The functions are
a = f (x, y) = x + y,
b = g(y, z) = max(y, z),
J = h(a, b) = ab.
J = (x + y) max(y, z),
Here there are three input nodes, labeled 0, 1, 2, three hidden nodes, 3,
4, 5, and an output node, 6. Starting with inputs (x, y, z) = (1, 2, 0), and
plugging in, we obtain the outgoing signals at the first six nodes
246 CHAPTER 4. CALCULUS
x
+
y a
J
∗
y
b
z
max
y<z
max(y, z) = z
∂g/∂y = 0, ∂g/∂z = 1
y=z
y>z
max(y, z) = y
∂g/∂y = 1, ∂g/∂z = 0
∂J ∂J
= b = 2, = a = 3.
∂a ∂b
Then
∂a ∂a
= 1, = 1.
∂x ∂y
Let (
1, y > z,
1(y > z) =
0, y < z.
By Figure 4.17, since y = 2 and z = 0,
∂b ∂b
= 1(y > z) = 1, = 1(z > y) = 0.
∂y ∂z
By the chain rule,
∂J ∂J ∂a
= = 2 ∗ 1 = 2,
∂x ∂a ∂x
∂J ∂J ∂a ∂J ∂b
= + = 2 ∗ 1 + 3 ∗ 1 = 5,
∂y ∂a ∂y ∂b ∂y
∂J ∂J ∂b
= = 3 ∗ 0 = 0.
∂z ∂b ∂z
1
+
2∗1=2
2 3
2 2
6
2+3 ∗
1
2 2
3 3
0
max
0
Hence we have
∂J ∂J ∂J ∂J ∂J ∂J
, , , , , = (2, 5, 0, 2, 3, 1).
∂x ∂y ∂z ∂a ∂b ∂J
The outputs (blue) and the derivatives (red) are displayed in Figure 4.18.
Summarizing, by the chain rule,
248 CHAPTER 4. CALCULUS
d = 7
w = [ [None]*d for _ in range(d) ]
More generally, in a weighed directed graph (§3.3), the weights wij are nu-
meric scalars or None.
Once we have the outgoing vector x, for each node j, let
x−
j = (w0j x0 , w1j x1 , w2j x2 , . . . , wd−1,j xd−1 ). (4.4.1)
This is the incoming signal list at node j. Here we adopt the convention that
None times anything is None, and any resulting None entry in the list is to be
discarded.
An activation function at node j is a function fj of the incoming signal
list x−
j . Then the outgoing signal at node j is
xj = fj (x−
j ). (4.4.2)
x− = (x− − − −
0 , x1 , x2 , . . . , xd−1 ).
x− = (x− − − − − − −
0 , x1 , x2 , x3 , x4 , x5 , x6 ),
where
x0minus = [None,None,None,None,None,None,None]
x1minus = [None,None,None,None,None,None,None]
x2minus = [None,None,None,None,None,None,None]
x3minus = [x,y,None,None,None,None,None]
x4minus = [None,y,z,None,None,None,None]
x5minus = [None,None,None,a,b,None,None]
x6minus = [None,None,None,None,None,None,J]
x0minus = [ ]
x1minus = [ ]
x2minus = [ ]
x3minus = [x,y]
x4minus = [y,z]
x5minus = [a,b]
x6minus = [J]
activate = [None]*d
def incoming(x,w,j):
return [ w[i][j] * outgoing(x,w,i) for i in range(d) if w[i][j] !=
,→ None ]
def outgoing(x,w,j):
if x[j] != None: return x[j]
else:
if activate[j] != None: return activate[j](*incoming(x,w,j))
else: return None
5 −2 7
5 −2 7
f g h
Let xin be the outgoing vector over the input nodes. If there are m input
nodes, and d nodes in total, then the length of xin is m, and the length of x
is d. In the example above, xin = (x, y, z).
We assume the list of nodes is ordered so that the initial portion of the list
of nodes is the list of input nodes.
m = len(x_in)
x[:m] = x_in
def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x
For this code to work, we assume there are no cycles in the graph: All back-
ward paths end at input nodes, and all forward paths end at output nodes.
The output function J is a function of all node outputs. For Figure 4.16,
this means J is a function of x, y, z, a, b.
Therefore, at each node i, we have the derivatives
252 CHAPTER 4. CALCULUS
∂J
δi = (x), i = 0, 1, 2, . . . , d − 1.
∂xi
Then δ = (δ0 , δ1 , δ2 , . . . , δd−1 ) is the gradient vector. We first compute the
derivatives of J with respect to the output nodes xout , and we assume these
derivatives are assembled into a vector δout .
In Figure 4.16, there is one output node J, and
∂J
δJ = = 1.
∂J
Hence δout = (1).
We assume the list of nodes is ordered so that the terminal portion of the
list of nodes is the list of output nodes.
For each i, j, let
∂fj
gij = .
∂xi
Then we have a d × d gradient matrix g = (gij ). When (i, j) is not an edge,
gij = 0.
These are the local derivatives, not the derivatives obtained by the chain
rule. For example, even though we saw above ∂J/∂y = 1, here the local
derivative is zero, since J does not depend directly on y.
For the example above, with (x1 , x2 , x3 , x4 , x5 , x6 ) = (x, y, z, a, b, J),
∂J X ∂J ∂xj X ∂J ∂fj
= · = · · wij ,
∂xi i→j
∂xj ∂xi i→j
∂xj ∂xi
so X
δi = δj · gij · wij .
i→j
The code is
4.4. BACK PROPAGATION 253
def derivative(x,m,delta,g,i):
if delta[i] != None: return delta[i]
elif i >= d-m: return 1
else:
return sum([ derivative(x,m,delta,g,j) *
,→ g[i][j](*incoming(x,w,j)) * w[i][j] for j in range(d) if
,→ g[i][j] != None ])
def backward_prop(x,m,g):
d = len(g)
delta = [None]*d
for i in range(d): delta[i] = derivative(x,m,delta,g,i)
return delta[:-m]
m = 1
delta = backward_prop(x,m,g)
In §7.2, we derive the third version of propagation, this for neural networks.
Exercises
Exercise 4.4.1 For the network in Figure 4.15, use the second version of
the propagation code and x = π/4 to compute the output vector and the
gradient vector
dy dy dy dy
x = (x, r, s, y), δ= , , , .
dx dr ds dy
Exercise 4.4.2 In Figure 4.20 below, the activation function at each neuron
is the sum of the squares of the incoming signals to that neuron. Starting with
x = 1, compute (x, a, b, c, d, p, q, J), and the corresponding derivatives of J.
Do this by hand and by coding. You should get
d c
x
p x
J x
b
x
q x
x
a a
Exercise 4.4.3 Compute the outgoing vector x and gradient vector δ for the
network in Figure 4.21. The outgoing signal at each neuron is the sum of the
squares of the incoming signals at that neuron. Here the input node signal is
the variable t, so both x and δ will have powers of t in them.
x21
f (x) = f (x1 , x2 ) = max(|x1 |, |x2 |), f (x) = f (x1 , x2 ) = + x22
4
are scalar functions of points in R2 . A level set of f (x) is the set
E: f (x) = 1.
4.5. CONVEX FUNCTIONS 255
This is the level set corresponding to level 1. One can have level sets corre-
sponding to any level c, f (x) = c. In two dimensions, level sets are also called
contour lines.
For example, the variance ellipse x · Qx = 1 is a level set. The perimeters
(not their interiors) of the square and ellipse in Figure 4.22 are level sets
x21 x2
max(|x1 |, |x2 |) = 1, + 2 = 1.
16 4
x∗
x∗ x∗
E: f (x) ≤ 1.
This is the sublevel set corresponding to level 1. One can have sublevel sets
corresponding to any level c, f (x) ≤ c. Sublevel sets were used in the def-
inition of proper functions (4.3.9). For example, in Figure 4.22, the (blue)
interior of the square, together with the square itself, is a sublevel set. Sim-
ilarly, the interior of the ellipse, together with the ellipse itself, is a sublevel
set. The interiors of the ellipsoids, together with the ellipsoids themselves, in
Figure 4.31 are sublevel sets. We always consider the level set to be part of
the sublevel set.
The level set f (x) = 1 is the boundary of the sublevel set f (x) ≤ 1. Thus
the square and the ellipse in Figure 4.22 are boundaries of their respective
sublevel sets, and the unit variance ellipsoid x · Qx = 1 is the boundary of
the sublevel set x · Qx ≤ 1.
(Figure 4.24).
x1 x1
v
tv (1 − t)x0 + tx1
x0 x0
This says the line segment joining any two points (x0 , f (x0 )) and (x1 , f (x1 ))
on the graph of f (x) lies above the graph of f (x). For example, in two di-
1 We only consider convex functions that are continuous.
4.5. CONVEX FUNCTIONS 257
mensions, the function f (x) = f (x1 , x2 ) = x21 + x22 /4 is convex because its
graph is the paraboloid in Figure 4.25.
If the inequality is strict for 0 < t < 1, then f (x) is strictly convex,
f ((1 − t)x0 + tx1 ) < (1 − t)f (x0 ) + tf (x1 ), for 0 < t < 1.
t1 x1 + t2 x2 + · · · + tN xN
t1 + t2 + · · · + tN = 1.
Fig. 4.25 Convex: The line segment lies above the graph.
Quadratic is Convex
This was derived in the previous section, but here we present a more
geometric proof.
To derive this result, let x0 and x1 be any points, and let v = x1 − x0 .
Then x0 + tv = (1 − t)x0 + tx1 and x1 = x0 + v. Let g0 = Qx0 − b. By (4.3.7),
1 1
f (x0 + tv) = f (x0 ) + tv · (Qx0 − b) + t2 v · Qv = f (x0 ) + tv · g0 + + t2 v · Qv.
2 2
(4.5.3)
Inserting t = 1 in (4.5.3), we have f (x1 ) = f (x0 ) + v · g0 + v · Qv/2. Since
t2 ≤ t for 0 ≤ t ≤ 1 and v · Qv ≥ 0, by (4.5.3),
Here are some basic properties and definitions of sets that will be used
in this section and in the exercises. Let a be a point in Rd and let r be a
positive scalar. A closed ball of radius r and center a is the set of points x
satisfying |x − a|2 ≤ r2 . An open ball of radius r and center a is the set of
points x satisfying |x − a|2 < r2 .
Let E be any set in Rd . The complement of E is the set E c of points that
are not in E. If E and F are sets, the intersection E ∩ F is the set of points
that lie in both sets.
A point a is in the interior of E if there is a ball B centered at a contained
in E; this is usually written B ⊂ E. Here the ball may be either open or
closed, the interior is the same.
4.5. CONVEX FUNCTIONS 259
x3
x4
x2
x6 x7
x5
x1
x = t 1 x1 + t 2 x2 + · · · + t N xN
rng = default_rng()
hull = ConvexHull(points)
facet = hull.simplices[0]
plot(points[facet, 0], points[facet, 1], 'r--')
grid()
show()
Let E be any convex set, and let x0 be any point. We search for a point
x∗ in E that is nearest to x0 . Since |x − x0 | is the distance between x and
x0 , the nearest point x∗ satisfies
|x∗ − x0 |2 = min |x − x0 |2 .
x in E
Here we minimize the distance squared, since the distance minimizer is the
same as the distance-squared minimizer.
If x0 is in E, then clearly x∗ = x0 is the unique distance minimizer. In x0 is
not in E, the results in §A.7 guarantee the existence of a distance minimizer
x∗ . This means there is at least one point in E that is nearest to x0 .
Let x0 be any point not in E. We show there is exactly one point x∗ in
E nearest to x0 . Let δ be the minimum distance-squared between E and x0 .
To this end, suppose x′ is another point in E at the same distance from x0
as x∗ . Then
|x∗ − x0 |2 = δ = |x′ − x0 |2 .
If xa = (x∗ + x′ )/2 is the average of x∗ and x′ , since E is convex, xa is in E,
hence |xa − x0 |2 ≥ δ. By expanding the squares, check that
Since xa is in E, the left side is no less than 4δ +|x∗ −x′ |2 . On the other hand,
the right side equals 4δ. This implies |x∗ − x′ |2 = 0, or x∗ = x′ , completing
the proof.
Given any point x0 and any convex set E, there is a unique point x∗
in E nearest to x0 .
x0
x∗
If f (x) is a function, its graph is the set of points (x, y) in Rd+1 satisfying
y = f (x), and its epigraph is the set of points (x, y) satisfying y ≥ f (x).
If f (x) is defined on Rd , its sublevel sets are in Rd , and its epigraph is in
Rd+1 . Then f (x) is a convex function exactly when its epigraph is a convex
set (Figure 4.25). From convex functions, there are other ways to get convex
sets:
E: f (x) ≤ 1
is a convex set.
H: n · (x − x0 ) = 0. (4.5.4)
H: m · x + b = 0, (4.5.5)
with a nonzero vector m and scalar b. In this section, we use (4.5.4); in §7.6,
we use (4.5.5).
4.5. CONVEX FUNCTIONS 263
n
n
x0 x0
n · (x − x0 ) < 0 n · (x − x0 ) = 0 n · (x − x0 ) > 0.
The vector n is the normal vector to the hyperplane. Note replacing n by any
nonzero multiple of n leaves the hyperplane unchanged.
Separating Hyperplane I
x0
n
x′
x0 x∗
x x∗ + tv
Expanding, we have
0 ≤ 2(x∗ − x0 ) · v + t|v|2 , 0 ≤ t ≤ 1.
Since this is true for small positive t, sending t → 0 results in v ·(x∗ −x0 ) ≥ 0.
Since n = x0 − x∗ , v = x − x∗ , we obtain
x in E =⇒ (x − x0 ) · n ≤ 0. (4.5.7)
y ≥ 0, if p = 1,
for every sample x. (4.5.8)
y ≤ 0, if p = 0,
m · xk + b = 0, k = 1, 2, . . . , N. (4.5.9)
Separating Hyperplane II
To derive this result, from Exercise 4.5.9 both K0 and K1 have interiors.
Suppose there is a separating hyperplane m · x + b = 0. If x0 is any point
in the interior K0 ∩ K1 , then we have m · x0 + b ≤ 0 and m · x0 + b ≥ 0,
so m · x0 + b = 0. This shows the separating hyperplane passes through x0 .
Since K0 lies on one side of the hyperplane, x0 cannot be in the interior of
K0 . Similarly for K1 . Hence x0 cannot be in the interior of K0 ∩ K1 . This
implies K0 ∩ K1 has no interior.
Conversely, for the reverse direction, suppose K0 ∩ K1 has no interior.
There are two cases, whether K0 ∩ K1 is empty or not. If K0 ∩ K1 is empty,
then the minimum of |x1 − x0 |2 over all x1 in K1 and x0 in K0 is positive. If
we let
|x∗1 − x∗0 |2 = min |x1 − x0 |2 , (4.5.10)
x0 in K0
x1 in K1
H0 H1 H
K0 K1 tK0 tK1
x∗0 x∗1
In the first case, since K0 and K1 don’t intersect, x∗1 is not in K0 , and x∗0
is not in K1 . Let m = x∗1 − x∗0 . Since x∗0 is the point in K0 closest to K1 ,
by separating hyperplane I, the hyperplane H0 : m · (x − x∗0 ) = 0 separates
K0 from x∗1 . Similarly, since x∗1 is the point in K1 closest to K0 , the hyper-
plane H1 : m · (x − x∗1 ) = 0 separates K1 from x∗0 . Thus (Figure 4.32) both
hyperplanes separate K0 from K1 .
In the second case, when K0 and K1 intersect, then the minimum in
(4.5.10) is zero, hence x∗0 = x∗1 = x∗ . Let 0 < t < 1, and let tK0 be K0
4.5. CONVEX FUNCTIONS 267
scaled towards its mean. Similarly, let tK1 be K1 scaled towards its mean.
By Exercise 4.5.10, both tK0 and tK1 lie in the interiors of K0 and K1
respectively, so tK0 and tK1 do not intersect. By applying the first case to
tK0 and tK1 , and choosing t close to 1, t → 1, we obtain a hyperplane H
separating K0 and K1 . We skip the details.
In Figure 4.22, at the corner of the square, there are multiple supporting
hyperplanes. However, at every other point a on the boundary of the square,
there is a unique (up to scalar multiple) supporting hyperplane. For the ellipse
or ellipsoid, at every point of the boundary, there is a unique supporting
hyperplane.
Now we derive the analogous concepts for convex functions.
We say a function f (x) is convex if g(t) = f (a + tv) is convex for every
point a and direction v. This is our third definition of convex; they are all
equivalent. This way a convex function of a vector variable is reduced to a
convex function of a scalar variable.
We say a function f (x) is strictly convex if g(t) = f (a + tv) is strictly
convex for every point a and direction v. This is the same as saying the
inequality (4.5.1) is strict for 0 < t < 1.
Let f (x) be a function and let a be a point at which there is a gradient
∇f (a). The tangent hyperplane for f (x) at a is
∇f (x∗ ) = 0. (4.5.13)
Let
∂2f
, 1 ≤ i, j ≤ d,
∂xi ∂xj
be the second partial derivatives. Then the second derivative of f (x) is the
symmetric matrix
∂2f ∂2f
∂x ∂x ∂x ∂x . . .
1 1 1 2
∂2f ∂2f
. . .
D2 f (x) = ∂x2 ∂x1 ∂x2 ∂x2
... ... . . .
∂2f ∂2f
...
∂xd ∂x1 ∂xd ∂x2
Replacing x by x + tv in (4.3.3), we have
d
f (x + tv) = ∇f (x + tv) · v.
dt
Differentiating and using the chain rule again,
4.5. CONVEX FUNCTIONS 269
d2
f (x + tv) = v · Qv. (4.5.14)
dt2 t=0
This implies
d2
f (x + tv) = 0 only when v = 0. (4.5.15)
dt2 t=0
If m ≤ D2 f (x) ≤ L, then
m L
|x − a|2 ≤ f (x) − f (a) − ∇f (a) · (x − a) ≤ |x − a|2 . (4.5.16)
2 2
m L
|x − x∗ |2 ≤ f (x) − f (x∗ ) ≤ |x − x∗ |2 . (4.5.17)
2 2
270 CHAPTER 4. CALCULUS
Here the maximum is over all vectors x, and p = (p1 , p2 , . . . , pd ), the dual
variable, also has d features. We will work in situations where a maximizer
exists in (4.5.18).
Let Q > 0 be a positive matrix. The simplest example is
1 1
f (x) = x · Qx =⇒ g(p) = p · Q−1 p.
2 2
This is established by the identity
1 1 1
(p − Qx) · Q−1 (p − Qx) = p · Q−1 p − p · x + x · Qx. (4.5.19)
2 2 2
To see this, since the left side of (4.5.19) is greater or equal to zero, we have
1 1
p · Q−1 p − p · x + x · Qx ≥ 0.
2 2
Since (4.5.19) equals zero iff p = Qx, we are led to (4.5.18).
Moreover, switching p · Q−1 p with x · Qx, we also have
Thus the convex dual of the convex dual of f (x) is f (x). In §5.6, we compute
the convex dual of the cumulant-generating function.
If x is a maximizer in (4.5.18), then the derivative is zero,
0 = ∇x (p · x − f (x)) =⇒ p = ∇x f (x).
p = ∇x f (x) ⇐⇒ x = ∇p g(p).
4.5. CONVEX FUNCTIONS 271
This yields
Using this, and writing out (4.5.16) for g(p) instead of f (x) (we skip the
details) yields
mL 1
(p − q) · (x − a) ≥ |x − a|2 + |p − q|2 . (4.5.22)
m+L m+L
272 CHAPTER 4. CALCULUS
This is derived by using (4.5.21), the details are in [3]. This result is used
in gradient descent.
For the exercises below, we use the properties of sets defined earlier in this
section: interior and boundary.
Exercises
Exercise 4.5.1 If a two-class dataset does not lie in a hyperplane and is sepa-
rable, then the means of the two classes are distinct. (Argue by contradiction:
Assume the means are equal, and look at levels of samples.)
Exercise 4.5.2 Let e0 = 0 and let e1 , e2 , . . . , ed be the one-hot encoded
basis in Rd . The d-simplex Σd is the convex hull of e0 , e1 , e2 , . . . , ed . Draw
pictures of Σ1 , Σ2 , and Σ3 . Show Σd is the suspension (§1.6) of Σd−1 from
ed . Conclude
1
Vol(Σd ) = , d = 0, 1, 2, 3, . . .
d!
(Since Σ0 is one point, we start with Vol(Σ0 ) = 1.)
Exercise 4.5.3 Let x1 , x2 , . . . , xd be positive scalars. Use convexity of exp
to show
d
1X
xi ≥ (x1 x2 . . . xd )1/d .
d i=1
5.1 Probability
273
274 CHAPTER 5. PROBABILITY
Then the event A1 consists of the outcomes (1, 6), (2, 5), (3, 4), (4, 3), (5, 2),
(6, 1). Here #(S) = 36 and #(A1 ) = 6.
Let A2 be the event of obtaining three heads when tossing a coin seven
times. Here #(S) = 27 = 128 and #(A2 ) = 35, which is the number of ways
you can choose three things out of seven things (§A.1):
7 7·6·5
#(A2 ) = 7-choose-3 = = = 35.
3 1·2·3
A probability on S satisfies
1. 0 ≤ P rob(A) ≤ 1 for every event A in S,
2. P rob(S) = 1,
3. (Additivity) If A, B, . . . are exclusive events in S,
Suppose the number of outcomes #(S) is finite. Then the simplest prob-
ability is the discrete uniform distribution, assigning to each event A the
proportion of outcomes in A,
#(A)
P rob(A) = . (5.1.2)
#(S)
When this is so and #(S) = N , each outcome has probability 1/N , and we
say the outcomes are equally likely.
Here are examples of discrete uniform distributions.
1. A coin is fair if, after one toss, the two outcomes are equally likely. Then
P rob(heads) = P rob(tails) = 1/2.
2. A 6-sided die is fair if, after one roll, the outcomes are equally likely. Let
A be the event that the outcome is less than 3. Since the outcome is then
1 or 2, P rob(A) = 2/6 = 1/3.
3. With A1 as above, assuming the dice are fair, leads to P rob(A1 ) = 6/36 =
1/6.
4. With A2 as above, assuming the coin is fair, leads to P rob(A2 ) = 35/128.
5. With A3 as above, assuming the dice are fair, leads to P rob(A3 ) = 11/36.
Here are some consequences of the probability axioms. Since A and Ac are
exclusive and exhaustive, the first consequence is
276 CHAPTER 5. PROBABILITY
Complementary Probabilities
Monotonicity of Probabilities
Additivity of Probabilities
Sub-Additivity of Probabilities
Since a single number a is a sub-interval [a, a] with zero length, the event A
of sampling X exactly equal to 0.5 is a null event. Since A is possible, A is
not impossible. Moreover Ac is a sure event, but Ac is not certain.
0 a µ b 1
Let A∞,k be the event of infinite tuple outcomes with exactly k heads.
Then each outcome x in A∞,k is in An,k for some n. In fact, an outcome
x in A∞,k is necessarily in An,k for all sufficiently large n. This means the
following: if x is in A∞,k , then for some N ≥ 1, x is in An,k for every n greater
or equal than N . Therefore, for each outcome x in A∞,k , there is some N ,
depending on x, with x in
\
An,k = AN,k ∩ AN +1,k ∩ AN +2,k ∩ . . . .
n≥N
T
This shows the event A∞,k is part of the union of the events n≥N An,k over
N = 1, 2,T. . . .
Since n≥N An,k is part of An,k for every n ≥ N , by monotonicity (5.1.3),
−n n
\
P rob An,k ≤ 2
, for every n ≥ N.
k
n≥N
P rob(A and B)
P rob(A | B) = . (5.1.7)
P rob(B)
When A and B are independent, the conditional probability equals the uncon-
ditional probability.
Are A1 and A3 above independent? Since P rob(A1 ) = 6/36 = 1/6 and
and
1
P rob(B = 0 and G = 1) = P rob(G = 1 | 1 child) P rob(1 child) = 0.20 = 0.1,
2
and
3
P rob(B = 1 and G = 2) = P rob(G = 2 | 3 children) P rob(3 children) = 0.30 = .1125.
8
280 CHAPTER 5. PROBABILITY
p = .5
n = 10
N = 20
v = binomial(n,p,N)
print(v)
returns
[9 6 7 4 4 4 3 3 7 5 6 4 6 9 4 5 4 7 6 7]
p = .5
for n in [5,50,500]: print(binomial(n,p,1))
This returns the count of heads after 5 tosses, 50 tosses, and 500 tosses,
3, 28, 266
The proportions are the count divided by the total number of tosses in the
experiment. For the above three experiments, the proportions after 5 tosses,
50 tosses, and 500 tosses, are
Fig. 5.3 100,000 sessions, with 5, 15, 50, and 500 tosses per session.
Now we repeat each experiment 100,000 times and we plot the results in
a histogram.
N = 100000
p = .5
for n in [5,50,500]:
data = binomial(n,p,N)
282 CHAPTER 5. PROBABILITY
hist(data,bins=n,edgecolor ='Black')
grid()
show()
The takeaway from these graphs are the two fundamental results of prob-
ability:
For large sample size, the shape of the graph of the proportions or
counts is approximately normal. The normal distribution is studied in
§5.4. Another way of saying this is: For large sample size, the shape
of the sample mean histogram is approximately normal.
The law of large numbers is qualitative and the central limit theorem is
quantitative. While the law of large numbers says one thing is close to another,
it does not say how close. The central limit theorem provides a numerical
measure of closeness, using the normal distribution.
One may think that the LLN and the CLT above depend on some aspect
of the binomial distribution. After all, the binomial is a specific formula and
something about this formula may lead to the LLN and the CLT. To show
that this is not at all the case, to show that the LLN and the CLT are
universal, we bring in the petal lengths of the Iris dataset. This time the
experiment is not something we invent, it is a result of something arising in
nature, Iris petal lengths.
We begin by loading the Iris dataset,
5.1. PROBABILITY 283
iris = datasets.load_iris()
dataset = iris["data"]
iris["feature_names"]
This code shows the petal lengths are the third feature in the dataset, and
we compute the mean of the petal lengths using
petal_lengths = dataset[:,2]
mean(petal_lengths)
This returns the petal length population mean µ = 3.758. If we plot the
petal lengths in a histogram with 50 bins using the code
grid()
hist(petal_lengths,bins=50)
show()
Now we sample the Iris dataset randomly. More generally, we take a ran-
dom batch of samples of size n and take the mean of the samples in the batch.
For example, the following code grabs a batch of n = 5 petals lengths X1 ,
284 CHAPTER 5. PROBABILITY
# n = batch_size
def random_batch_mean(n):
rng.shuffle(petal_lengths)
return mean(petal_lengths[:n])
random_batch_mean(5)
This code shuffles the dataset, then selects the first n petal lengths, then
returns their mean.
To sample a single petal length randomly 100,000 times, we run the code
N = 100000
n = 1
5.1. PROBABILITY 285
Since we are sampling single petal lengths, here we take n = 1. This code
returns the histogram in Figure 5.5.
In Figure 5.4, the bin heights add up to 150. In Figure 5.5, the bin heights
add up to 100,000. Moreover, while the shapes of the histograms are almost
identical, a careful examination shows the histograms are not identical. Nev-
ertheless, there is no essential difference between the two figures.
Fig. 5.6 Iris petal lengths batch means sampled 100,000 times, batch sizes 3, 5, 20.
Now repeat the same experiment, but with batches of various sizes, and
plot the resulting histograms. If we do this with batches of size n = 3, n = 5,
n = 20 using
figure(figsize=(8,4))
# three subplots
rows, cols = 1, 3
N = 100000
show()
Exercises
def sums(dataset,k):
if k == 1: return dataset
else:
s = sums(dataset,k-1)
return array([ a+b for a in dataset for b in s ])
for k in range(5):
5.1. PROBABILITY 287
s = sums(dataset,k)
grid()
hist(s,bins=50,edgecolor="k")
show()
for k = 1, 2, 3, 4, . . . . What does this code do? What does it return? What
pattern do you see? What if dataset were changed? What if the samples in
the dataset were vectors?
Exercise 5.1.7 Let A and B be any events, not necessarily exclusive. Show
Exercise 5.1.8 Let A and B be any events, not necessarily exclusive. Extend
(5.1.1) to show
Exercise 5.1.9 [30] There is a 60% chance an event A will occur. If A does
not occur, there is a 10% chance B occurs. What is the chance A or B occurs?
(Start with two events, then go from two to three events.) With a = P rob(Ac ),
b = P rob(B c ), c = P rob(C c ), this exercise is the same as Exercise A.3.4.
Exercise 5.1.11 Toss a coin infinitely many times, and let A1 be the out-
comes x = (x1 , x2 , . . . ) where the limit of the sample means
x1 + x2 + · · · + xn
lim
n→∞ n
equals 1. Show the event A1 is not certain nor impossible. Here each xk is
1 or 0. More generally, let t = a/b be any fraction, and let At be the event
of outcomes where the limit of the sample means equals t. Show At is not
certain nor impossible.
288 CHAPTER 5. PROBABILITY
Suppose a coin is tossed repeatedly, landing heads or tails each time. After
tossing the coin 100 times, we obtain 53 heads. What can we say about this
coin? Can we claim the coin is fair? Can we claim the probability of obtaining
heads is .53?
Whatever claims we make about the coin, they should be reliable, in that
they should more or less hold up to repeated verification.
To obtain reliable claims, we therefore repeat the above experiment 20
times, obtaining for example the following count of heads
[57, 49, 55, 44, 55, 50, 49, 50, 53, 49, 53, 50, 51, 53, 53, 54, 48, 51, 50, 53].
On the other hand, suppose someone else repeats the same experiment 20
times with a different coin, and obtains
[69, 70, 79, 74, 63, 70, 68, 71, 71, 73, 65, 63, 68, 71, 71, 64, 73, 70, 78, 67].
In this case, one suspects the two coins are statistically distinct, and have
different probabilities of obtaining heads.
In this section, we study how the probabilities of coin-tossing behave, with
the goal of answering the question: Is a given coin fair?
p + q = 1.
the particular coin being tossed. When p = 1/2, P rob(H) = P rob(T ), and
we say the coin is fair.
If we toss the coin twice, we obtain one of four possibilities, HH, HT ,
T H, or T T . If we make the natural assumption that the coin has no memory,
that the result of the first toss has no bearing on the result of the second
toss, then the probabilities are
p2 + pq + qp + q 2 = (p + q)2 = 12 = 1.
We use (5.1.7) to compute the probability that we obtain heads on the sec-
ond toss given that we obtain tails on the first toss. Introduce the convenient
notation (
1, if the n-th toss is heads,
Xn =
0, if the n-th toss is tails.
Then Xn is a random variable (§5.3) and represents a numerical reward
function of the outcome (heads or tails) at the n-th toss.
With this notation, (5.2.1) may be rewritten
P rob(X1 = 1 and X2 = 1) = p2 ,
P rob(X1 = 1 and X2 = 0) = pq,
P rob(X1 = 0 and X2 = 1) = qp,
P rob(X1 = 0 and X2 = 0) = q 2 .
P rob(X1 = 0 and X2 = 1) qp
P rob(X2 = 1 | X1 = 0) = = = p = P rob(X2 = 1),
P rob(X1 = 0) q
so
P rob(X2 = 1 | X1 = 0) = P rob(X2 = 1).
Thus X1 = 0 has no effect on the probability that X2 = 1, and similarly for
the other possibilities. This is often referred to as the independence of the
coin tosses. We conclude
290 CHAPTER 5. PROBABILITY
P rob(Xn = 1) = p, P rob(Xn = 0) = q = 1 − p, n ≥ 1.
as it should be.
Assume we know p = P rob(Xn = 1). Since the number of ways of choosing
k heads from n tosses is the binomial coefficient nk (see §A.2), and the
probabilities of distinct tosses multiply, the probability of k heads in n tosses
is as follows.
n k
P rob(Sn = k) = p (1 − p)n−k . (5.2.3)
k
n, p, N = 5, .5, 10
k,n,p = 5, 10, .5
B = binom(n,p)
# probability of k heads
B.pmf(k)
k,n,p = 5, 10, .5
allclose(pmf1,pmf2)
returns True.
Be careful to distinguish between
numpy.random.binomial and scipy.stats.binom.
The former returns samples from a binomial distribution, while the latter
returns a binomial random variable. Samples are just numbers; random vari-
ables have cdf’s, pmf’s or pdf’s, etc.
Toss a coin n times, and let #n (p) be the number of outcomes where
the heads-proportion is p. Then
1 This result exhibits the entropy as the log of the number of combinations, or configura-
tions, or possibilities, which is the original definition of the physicist Boltzmann (1875).
5.2. BINOMIAL PROBABILITY 293
In more detail, using Stirling’s approximation (A.1.6), one can derive the
asymptotic equality
1 1
#n (p) ≈ √ ·p · enH(p) , for n large. (5.2.4)
2πn p(1 − p)
Figure 5.7 is returned by the code below, which compares both sides of
the asymptotic equality (5.2.4) for n = 10 and 0 ≤ p ≤ 1.
n = 10
p = arange(0,1,.01)
def approx(n,p):
return exp(n*H(p))/sqrt(2*n*pi*p*(1-p))
grid()
plot(p, comb(n,n*p), label="binomial coefficient")
plot(p, approx(n,p), label="entropy approximation")
title("number of tosses " + "$n=" + str(n) +"$", usetex=True)
legend()
show()
294 CHAPTER 5. PROBABILITY
Assume a coin’s bias is q. Toss the coin n times, and let Pn (p, q) be
the probability of obtaining tosses where the heads-proportion is p.
Then
In more detail, using Stirling’s approximation (A.1.6), one can derive the
asymptotic equality
1 1
Pn (p, q) ≈ √ ·p · enH(p,q) , for n large. (5.2.6)
2πn p(1 − p)
If we set
(1 − p + cp)n+1
f (p) = (1 − p + cp)n , F (p) = ,
(c − 1)(n + 1)
n
1 cn+1 − 1 X k 1
· = c · .
n+1 c−1 n+1
k=0
Notice the difference: In (5.2.3), we know the coin’s bias p, and obtain the
binomial distribution, while in (5.2.10), since we don’t know p, and there are
n + 1 possibilities 0 ≤ k ≤ n, we obtain the uniform distribution 1/(n + 1).
We now turn things around: Suppose we toss the coin n times, and obtain
k heads. How can we use this data to estimate the coin’s bias p?
To this end, we introduce the fundamental
Bayes Theorem I
P rob(B | A) · P rob(A)
P rob(A | B) = . (5.2.11)
P rob(B)
P rob(A and B)
P rob(A | B) =
P rob(B)
P rob(A and B) P rob(A)
= ·
P rob(A) P rob(B)
P rob(A)
= P rob(B | A) · .
P rob(B)
P rob(p)
P rob(p | Sn = k) = P rob(Sn = k | p) · . (5.2.12)
P rob(Sn = k)
Summarizing,
n = 10
k = 7
grid()
p = arange(0,1,.01)
plot(p,f(p),color="blue",linewidth=.5)
show()
Because Bayes Theorem is so useful, here are two alternate forms. Suppose
A1 , A2 , . . . , Ad are several exclusive and exhaustive events, so
Then by the law of total probability (5.1.10) and the first version (5.2.11),
we have the second version
Bayes Theorem II
P rob(B | Ai ) P rob(Ai )
. (5.2.14)
P rob(B | A1 ) P rob(A1 ) + P rob(B | A2 ) P rob(A2 ) + . . .
P rob(B | A) P rob(A)
.
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )
As an example, suppose 20% of the population are smokers, and the preva-
lence of lung cancer among smokers is 90%. Suppose also 80% of non-smokers
are cancer-free. Then what is the probability that someone who has cancer
is actually a smoker?
To use the second version, set A = smoker and B = cancer. This means
A is the event that a randomly sampled person is a smoker, and B is the
event that a randomly sampled person has cancer. Then
and
P rob(B | A) P rob(A)
P rob(A | B) =
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )
.9 × .2
= = .52941.
.9 × .2 + .2 × .8
Thus the probability that a person with lung cancer is indeed a smoker is
53%.
5.2. BINOMIAL PROBABILITY 299
To describe the third version of Bayes theorem, bring in the logistic func-
tion. Let
1
p = σ(y) = . (5.2.15)
1 + e−y
This is the logistic function or sigmoid function. The logistic function takes
as inputs real numbers y, and returns as outputs probabilities p (Figure 5.9),
and is plotted in Figure 5.10.
p = expit(y)
w0 1 −m2H + m2T mH + mT
x∗ = − =− = ,
w 2 mH − mT 2
which is the midpoint of the line segment joining mH and mT .
mH cut-off mT
More generally, if the points x are in Rd , then the same question may be
asked, using the normal distribution with variance I in Rd (§5.5). In this
case, w is a nonzero vector, and w0 is still a scalar,
1 1
w = mH − mT , w0 = − |mH |2 + |mT |2 .
2 2
Then the cut-off or decision boundary between the two groups is the hyper-
plane
w · x + w0 = 0,
which is the hyperplane halfway between mH and mT , and orthogonal to the
vector joining mH and mT . Written this way, the probability
mT
cut-off
mH
Exercises
Exercise 5.2.2 A coin with bias p is tossed. What is the probability of ob-
taining 5 heads in 8 tosses?
Exercise 5.2.3 A coin with bias p is tossed 8 times and 5 heads are obtained.
What is the most likely value for p?
Exercise 5.2.4 A coin with unknown bias p is tossed 8 times and 5 heads
are obtained. Assuming a uniform prior for p, what is the probability that
p lies between 0.5 and 0.7? Use scipy.integrate.quad (§A.5) to integrate
(5.2.13) over 0.5 ≤ p ≤ 0.7.)
Exercise 5.2.5 A fair coin is tossed n times. Sometimes you get more heads
than tails, sometimes the reverse. If you’re really lucky, the number of heads
may equal exactly the number of tails. What is the least n for which the
probability of this happening is less than 10%?
Exercise 5.2.6 A fair coin is tossed n times. Sometimes you get more heads
than tails, sometimes the reverse. If you’re really lucky, the number of heads
may equal exactly the number of tails. What is the least n for which the
probability of this happening is less than 10%?
In §1.3, this was called vectorization. In this section, random variables are
scalar-valued. In §5.5 and §6.4, they are vector-valued.
X x
for this quantity, then we are asking to compute the probability of the event
that X lies in the interval [a, b]. If we don’t know anything about X, then
we can’t figure out the probability, and there is nothing we can say. Knowing
something about X means knowing the distribution of X: Where X is more
likely to be and where X is less likely to be. Any quantity X where proba-
304 CHAPTER 5. PROBABILITY
Then E(X) is the mean of the random variable X associated to the dataset.
Similarly,
N
1 X 2
E(X 2 ) = xk
N
k=1
P (X = a) = p, P (X = b) = q, P (X = c) = r.
E(X) = ap + bq + cr.
V ar(X) = E(X 2 ) − µ2 .
The Bernoulli random variable is the outcome of a single toss of a coin with
bias p, and the binomial random variable is the outcome of n tosses of a coin
with bias p.
As we see below (5.3.23), the mean of a binomial random variable is np. If
we let the number of tosses grow without bound, n → ∞, while keeping the
mean fixed at λ = np, we obtain the Poisson random variable.
In the text and the exercises, we consider several continuous random vari-
ables,
unif orm, exponential, logistic, arcsine,
and
normal, chi-squared, student.
E(X) = x1 p1 + x2 p2 + . . . . (5.3.3)
5.3. RANDOM VARIABLES 307
E(1) = p1 + p2 + · · · = 1.
If
rjk = P rob(X = xj and Y = yk ), j, k = 1, 2, . . . ,
then, by additivity of probabilities (5.1.1),
pj = P rob(X = xj )
= P rob(X = xj and Y = y1 ) + P rob(X = xj and Y = y2 ) + . . .
X
= rj1 + rj2 + · · · = rjk .
k
Similarly, X
qk = r1k + r2k + · · · = rjk .
j
We conclude
E(X + Y ) = E(X) + E(Y ).
Since we already know E(aX) = aE(X) (5.3.5), this derives linearity.
The variance measures the spread of X about its mean. Since the mean of
aX is aµ, the variance of aX is the mean of (aX − aµ)2 = a2 (X − µ)2 . Thus
V ar(aX) = a2 V ar(X).
However, the variance of a sum X + Y is not simply the sum of the variances
of X and Y : This only happens if X and Y are independent, see (5.3.21).
Using (5.3.2), we can view a dataset as the samples of a random variable
X. In this case, the mean and variance of X are the same as the mean and
variance of the dataset, as defined by (1.5.1) and (1.5.2).
When X is a constant, then X = µ, so V ar(X) = 0. Conversely, if
V ar(X) = 0, then by definition
5.3. RANDOM VARIABLES 309
This displays the variance in terms of the first moment E(X) and the second
moment E(X 2 ). Equivalently,
P rob(X = 1) = p, P rob(X = 0) = 1 − p.
1−p
p
0 1
E X 2 = 12 · P rob(X = 1) + 02 · P rob(X = 0) = p.
From this,
1
p
1−p
0 1
M ′ (t) = E XetX .
When t = 0,
M ′ (0) = E(X) = µ.
Similarly, since the derivative of log x is 1/x, for the cumulant-generating
function,
M ′ (0)
Z ′ (0) = = E(X) = µ.
M (0)
The second derivative of M (t) is
M ′′ (t) = E X 2 etX ,
Definition of Uncorrelated
Random variables X and Y are uncorrelated if
We investigate when X and Y are uncorrelated. Here a > 0, b > 0, and c > 0.
First, because the total probability equals 1,
a + 2b + c = 1. (5.3.15)
Also we have
and
5.3. RANDOM VARIABLES 313
E(X) = a − c, E(Y ) = a + b.
Now X and Y are uncorrelated if
a,b,c = symbols('a,b,c')
eq1 = a + 2*b + c - 1
eq2 = a - b - (a-c)*(a+b)
solutions = solve([eq1,eq2],a,b)
print(solutions)
Definition of Independence
for all positive powers n and m. When X and Y are discrete, this is
equivalent to the events X = x and Y = y being independent, for
every value x of X and every value y of Y .
a − b = (a − b)(a + b).
314 CHAPTER 5. PROBABILITY
Expanding the exponentials into their series, and using (5.3.18), one can show
1 e7t − et
MX (t) = .
6 et − 1
By Exercise 5.3.1 again,
1 e13t − et
MX+Y (t) = ,
12 et − 1
It follows, by (5.3.20),
1 e13t − et 1 e7t − et
= · MY (t).
12 et − 1 6 et − 1
Factoring
we obtain
1 6t
MY (t) = (e + 1).
2
This says
1 1
P rob(Y = 0) = , P rob(Y = 6) = ,
2 2
and all other probabilities are zero.
S = X1 + X2 + · · · + Xn .
Then
The next simplest discrete random variable is the binomial random vari-
able,
S = X1 + X2 + · · · + Xn
obtained from n independent Bernoulli random variables.
Then S has values 0, 1, 2, . . . , n, and the probability mass function
n k
p(k) = p (1 − p)n−k , if k = 0, 1, 2, . . . , n. (5.3.22)
k
Since the cdf F (x) is the sum of the pmf p(k) for k ≤ x, the code
n, p = 8, .5
B = binom(n,p)
returns
0 0.003906250000000007 0.00390625
1 0.031249999999999983 0.03515625
2 0.10937500000000004 0.14453125
3 0.21874999999999992 0.36328125
4 0.27343749999999994 0.63671875
5 0.2187499999999999 0.85546875
6 0.10937500000000004 0.96484375
5.3. RANDOM VARIABLES 317
7 0.031249999999999983 0.99609375
8 0.00390625 1.0
Since
p(1 − p)
E(p̂n ) = p, V ar(p̂n ) = . (5.3.24)
n
By the binomial theorem, the moment-generating function is
n
tS
X
tk n k n
p (1 − p)n−k = pet + 1 − p .
E e = e
k
k=0
so the total probability is one. The Python code for a Poisson random variable
is
318 CHAPTER 5. PROBABILITY
lamda = 1
P = poisson(lamda)
Here the integration is over the entire range of the random variable: If X
takes values in the interval [a, b], the integral is from a to b. For a normal
random variable, the range is (−∞, ∞). For a chi-squared random variable,
the range is (0, ∞). Below, when we do not specify the limits of integration,
the integral is taken over the whole range of X.
More generally, let f (x) be a function. The mean of f (X) or expectation
of f (X) is Z
E(f (X)) = f (x)p(x) dx. (5.3.28)
This only holds when the integral is over the complete range of X. When this
is not so,
Z b
P rob(a < X < b) = p(x) dx
a
is the green area in Figure 5.16. Thus
0 a b 0 a b
Since
1 2
F (x) =
x =⇒ F ′ (x) = x,
2
by the fundamental theorem of calculus (A.5.2),
Z 1
1
E(X) = x dx = F (1) − F (0) = .
0 2
In particular, if [a, b] = [−1, 1], then the mean is zero, the variance is 1/3,
and
1 1
Z
E(f (X)) = f (x) dx.
2 −1
When X is discrete, X
F (x) = pk .
xk ≤x
When X is continuous, Z x
F (x) = p(z) dz.
−∞
Then each green area in Figure 5.16 is the difference between two areas,
F (b) − F (a).
discrete continuous
density pmf pdf
distribution cdf cdf
sum cdf(x) = sum([ pmf(k)for k in range(x+1)]) cdf(x) = integrate(pdf,x)
difference pmf(k) = cdf(k)-cdf(k-1) pdf(x) = derivative(cdf,x)
Table 5.18 summarizes the situation. For the distribution on the left in
Figure 5.16, the cumulative distribution function is in Figure 5.17.
Let X and Y be independent uniform random variables on [0, 1], and let
Z = max(X, Y ). We compute the pdf p(x), the cdf F (x), and the mean of
Z. By definition of max(X, Y ),
P rob(X ≤ x) P rob(Y ≤ x) = x2 .
5.3. RANDOM VARIABLES 323
Hence
0,
if x < 0,
F (x) = P rob(max(X, Y ) ≤ x) = x2 , if 0 ≤ x ≤ 1,
1, if x > 1.
From this,
0,
if x < 0,
p(x) = F ′ (x) = 2x, if 0 ≤ x ≤ 1,
0, if x > 1.
E(X n ) = E(Y n ), n ≥ 1.
for every interval [a, b], and equivalent to having the same moment-
generating functions,
MX (t) = MY (t)
for every t.
X1 , X2 , . . . , Xn x1 , x2 , . . . , xn
Then
1 1
E(X̄n ) = (E(X1 ) + E(X2 ) + · · · + E(Xn )) = · nµ = µ.
n n
We conclude the mean of the sample mean equals the population mean.
Now let σ 2 be the common variance of X1 , X2 , . . . , Xn . By (5.3.21), the
variance of Sn is nσ 2 , hence the variance of X̄n is σ 2 /n. Summarizing,
σ2
E(X̄n ) = µ, V ar(X̄n ) = , (5.3.34)
n
and
√
X̄n − µ
n (5.3.35)
σ
is standard.
is standard.
Exercises
1 etb − eta
MX (t) = · t .
b−a e −1
Exercise 5.3.2 Let A and B be events and let X and Y be the Bernoulli
random variables corresponding to A and B (5.3.10). Show that A and B are
independent (5.1.9) if and only if X and Y are independent (5.3.18).
Exercise 5.3.3 [30] Let X be a binomial random variable with mean 7 and
variance 3.5. What are P rob(X = 4) and P rob(X > 14)?
Exercise 5.3.4 The proportion of adults who own a cell phone in a certain
Canadian city is believed to be 90%. Thirty adults are selected at random
from the city. Let X be the number of people in the sample who own a cell
phone. What is the distribution of the random variable X?
Exercise 5.3.5 If two random samples of sizes n1 and n2 are selected inde-
pendently from two populations with means µ1 and µ2 , show the mean of the
sample mean difference X̄1 − X̄2 equals µ1 − µ2 . If σ1 and σ2 are standard
deviations of the two populations, then the standard deviation of X̄1 − X̄2
equals s
σ12 σ2
+ 2.
n1 n2
Exercise 5.3.6 Check (5.3.30) and (5.3.31).
Exercise 5.3.7 [30] You arrive at the bus stop at 10:00am, knowing the bus
will arrive at some time uniformly distributed during the next 30 minutes.
What is the probability you have to wait longer than 10 minutes? Given that
the bus hasn’t arrived by 10:15am, what is the probability that you’ll have
to wait at least an additional 10 minutes?
Exercise 5.3.8 If X and Y satisfy (5.3.14), show X and 2Y −1 are identically
distributed for any a, b, c.
Exercise 5.3.9 Let B and G be the number of boys and the number of girls
in a randomly selected family with probabilities as in Table 5.2. Are B and
G independent? Are they identically distributed?
Exercise 5.3.10 If X and Y satisfy (5.3.14), use Python to verify (5.3.17)
and (5.3.19).
Exercise 5.3.11 If X and Y satisfy (5.3.14), compute V ar(X) and V ar(Y )
in terms of a, b, c. What condition on a, b, c maximizes V ar(X)? What
condition on a, b, c maximizes V ar(Y )?
Exercise 5.3.12 Let X be Poisson with parameter λ. Show the cumulant-
generating function is
Z(t) = λ(et − 1).
(Use the exponential series (A.3.12).)
5.3. RANDOM VARIABLES 327
Exercise 5.3.13 Let X be Poisson with parameter λ. Show both E(X) and
V ar(X) equal λ (Use (5.3.12).)
Exercise 5.3.14 Let X and Y be independent Poisson with parameter λ
and µ respectively. Show X + Y is Poisson with parameter λ + µ.
Exercise 5.3.15 If X1 , X2 , . . . , Xn are i.i.d. Poisson with parameter λ, show
S = X1 + X2 + · · · + Xn
nn+1
E (relu(S − n)) = e−n · .
n!
(Use Exercise A.1.2.)
Exercise 5.3.18 Suppose X is a logistic random variable (5.3.33). Show the
probability density function of X is σ(x)(1 − σ(x)).
Exercise 5.3.19 Suppose X is a logistic random variable (5.3.33). Show the
mean of X is zero.
Exercise 5.3.20 Suppose X is a logistic random variable (5.3.33). Use
(A.3.16) with a = −e−x to show the variance of X is
∞
(−1)n−1
X 1 1 1
4 = 4 1 − + − + . . . .
n=1
n2 4 9 16
Exercise 5.3.25 For k and n fixed, compute the mean of the conditional
probability of a coin’s bias p given k heads in n tosses. The answer is not
k/n. (Use (5.2.13) with n, k replaced by n + 1, k + 1.)
0 a b
grid()
z = arange(mu-3*sdev,mu+3*sdev,.01)
p = Z.pdf(z)
plot(z,p)
show()
√
The curious constant 2π in (5.4.1) is inserted to make the total area
under the graph equal to one. That this is so arises from the
√ fact that 2π is
the circumference of the unit circle. Using Python, we see 2π is the correct
constant, since the code
allclose(I, sqrt(2*pi))
returns True.
330 CHAPTER 5. PROBABILITY
The mean of Z is Z
E(Z) = zp(z) dz.
with the integral computed using the fundamental theorem of calculus (A.5.2)
or Python.
E(Z) = 0, V ar(Z) = 1
From this, the odd moments of Z are zero, and the even moments are
(2n)!
E(Z 2n ) = , n = 0, 1, 2, . . .
2n n!
By separating the even and the odd factors, this simplifies to
For example,
X1 + X2 + · · · + Xn
X̄n =
n
be the sample mean. Then the event of outcomes where
lim X̄n = µ
n→∞
In other words, the LLN says the outcomes where the limiting sample
mean is not equal to µ form a null event. The event specified in the LLN is
sure, but not certain (see §5.1 and Exercise 5.1.11 for the distinction).
The LLN is qualitative: There is no measure of closeness in the LLN state-
ment. On the other hand, the CLT is more quantitative. The CLT says for
large sample size, the sample mean is approximately normal with mean µ
and variance σ 2 /n. More exactly,
Let
√
X̄n − µ
Z̄n = n
σ
be the standardized sample mean, and let Z be a standard normal
random variable. Then
lim P rob a < Z̄n < b = P rob(a < Z < b)
n→∞
for every t.
Toss a coin n times, assume the coin’s bias is p, and let Sn be the number
of heads. Then, by (5.3.23), Sn is binomial with mean µ = np and standard
5.4. NORMAL DISTRIBUTION 333
p
deviation σ = np(1 − p). By the CLT, Sn is approximately normal with the
same mean and variance, so the cumulative distribution function of Sn ap-
proximately equals the cumulative distribution function of a normal random
variable with the same mean and variance.
The code
n, p = 100, pi/4
mu = n*p
sigma = sqrt(n*p*(1-p))
B = binom(n,p)
Z = norm(mu,sigma)
grid()
legend()
show()
Fig. 5.21 The binomial cdf and its CLT normal approximation.
a scalar dataset, and assume the dataset is standardized. Then its mean and
variance are zero and one,
N N
X 1 X 2
xk = 0, xk = 1.
N
k=1 k=1
If the samples of the dataset are equally likely, then sampling the dataset
results in a random variable X, with expectations given by (5.3.2). It follows
that X is standard, and the moment-generating function of X is
N
1 X txk
E(etX ) = e .
N
k=1
Since the mean and variance of X are zero and 1, taking expectations of both
sides,
√ t2
E etX/ n = 1 + + ....
2n
From this, n
t2
Mn (t) = 1 + + ... .
2n
By the compound-interest formula (A.3.8) (the missing terms . . . don’t affect
the result)
2
lim Mn (t) = et /2 ,
n→∞
we expect the chance that Z < 0 should equal 1/2. In other words, because
of the symmetry of the curve, we expect to be 50% confident that Z < 0, or
0 is at the 50-th percentile level. So
p
p
z z
When
P rob(Z < z) = p,
we say z is the z-score z corresponding to the p-value p. Equivalently, we say
our confidence that Z < z is p, or the percentile of z equals 100p. In Python,
the relation between z and p (Figure 5.22) is specified by
p = Z.cdf(z)
z = Z.ppf(p)
ppf is the percentile point function, and cdf is the cumulative distribution
function.
In Figure 5.23, the red areas are the lower tail p-value P rob(Z < z), the
two-tail p-value P rob(|Z| > z), and the upper tail p-value P rob(Z > z).
To go backward, suppose we are given P rob(|Z| < z) = p and we want
to compute the cutoff z. Then P rob(|Z| > z) = 1 − p, so P rob(Z > z) =
(1 − p)/2. This implies
336 CHAPTER 5. PROBABILITY
and
P rob(|Z| < z) = P rob(−z < Z < z) = P rob(Z < z) − P rob(Z < −z),
and
P rob(Z > z) = 1 − P rob(Z < z).
In Python,
# p = P(|Z| < z)
z = Z.ppf((1+p)/2)
p = Z.cdf(z) - Z.cdf(-z)
−z 0 −z 0 z
0 z
Now let’s zoom in closer to the graph and mark off z-scores 1, 2, 3 on the
horizontal axis to obtain specific colored areas as in Figure 5.25. These areas
are governed by the 68-95-99 rule (Table 5.24). Our confidence that |Z| < 1
equals the blue area 0.685, our confidence that |Z| < 2 equals the sum of the
5.4. NORMAL DISTRIBUTION 337
blue plus green areas 0.955, and our confidence that |Z| < 3 equals the sum
of the blue plus green plus red areas 0.997. This is summarized in Table 5.24.
The possibility |Z| > 1 is called a 1-sigma event, |Z| > 2 a 2-sigma event,
and so on. So a 2-sigma event is 95.5% unlikely, or 4.5% likely. An event is
considered statistically significant if it’s a 2-sigma event or more. In other
words, something is significant if it’s unlikely. A six-sigma event |Z| > 6 is
two in a billion. You want a plane crash to be six-sigma.
−3 −2 −1 0 1 2 3
Fig. 5.25 68%, 95%, 99% confidence cutoffs for standard normal.
These terms are defined for two-tail p-values. The same terms may be used
for upper-tail or lower tail p-values.
Figure 5.25 is not to scale, because a 1-sigma event should be where the
curve inflects from convex to concave (in the figure this happens closer to
2.7). Moreover, according to Table 5.24, the left-over white area should be
.03% (3 parts in 10,000), which is not what the figure suggests.
In general, the normal distribution is not centered at the origin, but else-
where. We say X is normal with mean µ and standard deviation σ if
X −µ
Z=
σ
is distributed according to a standard normal. We write N (µ, σ) for the nor-
mal with mean µ and standard deviation σ. As its name suggests, it is easily
checked that such a random variable X has mean µ and standard deviation
σ. For the normal distribution with mean µ and standard deviation σ, the
cutoffs are as in Figure 5.27. In Python, norm(mu,sigma) returns the normal
with mean m and standard deviation s.
Here is a sample computation. Let X be a normal random variable with
mean µ and standard deviation σ, and suppose P rob(X < 7) = .15, and
P rob(X < 19) = .9. Given this data, we find µ and σ as follows.
With Z as above, we have
P rob(Z < (7 − µ)/σ) = .15, and P rob(Z < (19 − µ)/σ) = .9.
µ − 3σ µ−σ µ µ+σ µ + 3σ
a = Z.ppf(.15)
b = Z.ppf(.9)
√ X̄ − µ
Z= n· ,
σ
340 CHAPTER 5. PROBABILITY
Here are two examples. In the first example, suppose student grades are
normally distributed with mean µ = 80 and variance σ 2 = 16. This says the
average of all grades is 80, and the standard deviation is σ = 4. If a grade is
g, the standardized grade is
g−µ g − 80
z= = .
σ 4
A student is picked and their grade was g = 84. Is this significant? Is it highly
significant? In effect, we are asking, how unlikely is it to obtain such a grade?
Remember,
significant = unlikely
Since the standard deviation is 4, the student’s z-score is
g − 80 84 − 80
z= = = 1.
4 4
What’s the upper-tail p-value corresponding to this z? It’s
1
P rob(Z > z) = P rob(Z > 1) = P rob(|Z| > 1) = .16,
2
or 16%. Since the upper-tail p-value is more than 5%, this student’s grade is
not significant.
For the second example, suppose a sample of n = 9 students are selected
and their sample average grade is ḡ = 84. Is this significant? Is it highly
significant? This time we take
√ ḡ − 80 84 − 80
z= n· =3 = 3.
4 4
What’s the upper-tail p-value corresponding to this z? It’s
or .13%. Since the upper-tail p-value is less than 1%, yes, this sample average
grade is both significant and highly significant.
The same grade, g = 84, is not significant for a single student, but is
significant for nine students. This is a reflection of the law of large numbers,
which says the sample mean approaches the population mean as the sample
size grows.
5.4. NORMAL DISTRIBUTION 341
Suppose student grades are normally distributed with mean 80 and vari-
ance 16. How many students should be sampled so that the chance that at
least one student’s grade lies below 70 is at least 50%?
To solve this, if p is the chance that a single student has a grade below 70,
then 1 − p is the chance that the student has a grade above 70. If n is the
sample size, (1 − p)n is the chance that all sample students have grades above
70. Thus the requested chance is 1 − (1 − p)n . The following code shows the
answer is n = 112.
z = 70
mean, sdev = 80, 4
p = Z(mean,sdev).cdf(z)
for n in range(2,200):
q = 1 - (1-p)**n
print(n, q)
Here is the code for computing tail probabilities for the sample mean X̄
drawn from a normally distributed population with mean µ and standard
deviation σ. When n = 1, this applies to a single normal random variable.
342 CHAPTER 5. PROBABILITY
########################
# P-values
########################
def pvalue(mean,sdev,n,xbar,type):
Xbar = Z(mean,sdev/sqrt(n))
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 *(1 - Xbar.cdf(abs(xbar)))
else:
print("What's the tail type (lower-tail, upper-tail,
,→ two-tail)?")
return
print("sample size: ",n)
print("mean,sdev,xbar: ",mean,sdev,xbar)
print("mean,sdev,n,xbar: ",mean,sdev,n,xbar)
print("p-value: ",p)
z = sqrt(n) * (xbar - mean) / sdev
print("z-score: ",z)
type = "upper-tail"
mean = 80
sdev = 4
n = 1
xbar = 90
pvalue(mean,sdev,n,xbar,type)
Exercises
Exercise 5.4.1 Let X be a normal random variable and suppose P rob(X <
1) = 0.3, and P rob(X < 2) = 0.4 What are the mean and variance of X?
Exercise 5.4.2 [27] Consider a normal distribution curve where the middle
90% of the area under the curve lies above the interval (4, 18). Use this
information to find the mean and the standard deviation of the distribution.
Exercise 5.4.3 Let Z be a normal random variable with mean 30.4 and
standard deviation of 0.7. What is P rob(29 < Z < 31.1)?
Exercise 5.4.4 [27] Consider a normal distribution where the 70th percentile
is at 11 and the 25th percentile is at 2. Find the mean and the standard
deviation of the distribution.
5.4. NORMAL DISTRIBUTION 343
We count the proportion of printers in the sample having speeds greater than
18 by setting
Y1 + Y2 + · · · + Yn
p̂ = .
n
Compute E(p̂) and V ar(p̂). Use the CLT to compute the probability that
more than 50.9% of the printers have speeds greater than 18.
Exercise 5.4.9 [27] The level of nitrogen oxides in the exhaust of a particular
car model varies with mean 0.9 grams per mile and standard deviation 0.19
grams per mile . What sample size is needed so that the standard deviation
of the sampling distribution is 0.01 grams per mile?
Exercise 5.4.10 [27] The scores of students had a normal distribution with
mean µ = 559.7 and standard deviation σ = 28.2. What is the probability
that a single randomly chosen student scores 565 or higher? Now suppose
n = 30 students are sampled, assume i.i.d. What are the mean and standard
deviation of the sample mean score? What z-score corresponds to the mean
score of 565? What is the probability that the mean score is 565 or higher?
Exercise 5.4.11 Complete the square in the moment-generating function of
the standard normal pdf and use (5.4.3) to derive (5.4.4).
Exercise 5.4.12 Let Z be a standard normal random variable, and let
relu(x) be as in Exercise 5.3.17. Show
1
E(relu(Z)) = √ .
2π
344 CHAPTER 5. PROBABILITY
E(relu(Z̄n )) → E(relu(Z)), n → ∞.
Fig. 5.28 (X, Y ) inside the square and inside the disk.
5.5. CHI-SQUARED DISTRIBUTION 345
P rob(X 2 + Y 2 ≤ 1)?
Since
∞
1 X un
√ = E euU = E(U n ),
1 − 2u n=0
n!
But this equals the right side of (5.4.5). Thus the left sides of (5.4.5) and
(5.5.1) are equal. This shows
346 CHAPTER 5. PROBABILITY
Going back to the question posed at the beginning of the section, we have
X and Y independent standard normal and we want
P rob(X 2 + Y 2 ≤ 1).
d = 2
u = 1
U(d).cdf(u)
returns 0.39.
Figure 5.29 is returned by the code
u = arange(0,15,.01)
for d in range(1,7):
2 Geometrically, the p-value P rob(U > 1) is the probability that a normally distributed
point in d-dimensional space is outside the unit sphere.
5.5. CHI-SQUARED DISTRIBUTION 347
p = U(d).pdf(u)
plot(u,p,label="d: " + str(d))
ylim(ymin=0,ymax=.6)
grid()
legend()
show()
and
d
X d
X
V ar(U ) = V ar(Zk2 ) = 2 = 2d.
k=1 k=1
We conclude
348 CHAPTER 5. PROBABILITY
Because
1 1 1
′ /2 = ,
(1 − 2t) d/2 (1 − 2t) d (1 − 2t)(d+d′ )/2
we obtain
X = (X1 , X2 , . . . , Xn )
in Rn .
Random vectors have means, variances, moment-generating functions,
and cumulant-generating functions, just like scalar-valued random variables.
Moreover we can have simple random samples of random vectors X1 , X2 ,
. . . , Xn .
If X is a random vector in Rd , its mean is the vector
Q = E((X − µ) ⊗ (X − µ)).
By (1.4.20),
hence
w · Qw = E ((X − µ) · w)2 .
(5.5.3)
Thus the variance of a random vector is a nonnegative matrix
A random vector is standard if µ = 0 and Q = I. If X is standard, then
In §2.2, we defined the mean and variance of a dataset (2.2.15). Then the
mean and variance there is the same as the mean and variance defined here,
that of a random variable.
To see this, we must build a random variable X corresponding to a dataset
x1 , x2 , . . . , xN . But this was done in (5.3.2). The moral is: every dataset may
be interpreted as a random variable.
350 CHAPTER 5. PROBABILITY
Uncorrelated Chi-squared
|Z|2 (5.5.6)
Using this, we can plot the probability density function of a normal random
vector in R2 ,
352 CHAPTER 5. PROBABILITY
%matplotlib ipympl
from numpy import *
from matplotlib.pyplot import *
from scipy.stats import multivariate_normal as Z
# standard normal
mu = array([0,0])
Q = array([[1,0],[0,1]])
x = arange(-3,3,.01)
y = arange(-3,3,.01)
xy = cartesian_product(x,y)
# last axis of xy is fed into pdf
z = Z(mu,Q).pdf(xy)
ax = axes(projection='3d')
ax.set_axis_off()
x,y = meshgrid(x,y)
ax.plot_surface(x,y,z, cmap='cool')
show()
Then
t
MX,Y (w) = E ew·(X,Y ) = ew·Qw/2 = MX (u) MY (v) e(u·Bv+v·B u)/2 .
From this, X and Y are independent when B = 0. Thus, for normal random
vectors, independence and uncorrelatedness are the same.
Correlated Chi-squared
E = U t QU, Q+ = U E + U t ,
and
λ1 0 0 ... 0 1/λ1 0 0 . . . 0
0 λ2 0 ... 0 0 1/λ2 0 . . . 0
E+ =
. . .
E= ... ... ... . . .
, ... ... ... ... . . .
.
0 ... 0 λr 0 0 . . . 0 1/λr 0
0 0 0 0 0 0 0 0 0 0
so X · µ = 0.
By Exercise 2.6.7, Q+ = Q. Since X · µ = 0,
X · Q+ X = X · QX = X · (X − (X · µ)µ) = |X|2 .
We conclude
Singular Chi-squared
We use the above to derive the distribution of the sample variance. Let
X1 , X2 , . . . , Xn be a random sample, and let X̄ be the sample mean,
X1 + X2 + · · · + Xn
X̄ = .
n
Let S 2 be the sample variance,
1 = (1, 1, . . . , 1)
√
be in Rn , and let µ = 1/ n. Then µ is a unit vector and
n
1 X √
Z ·µ= √ Zk = n Z̄.
n
k=1
√
Since Z1 , Z2 , . . . , Zn are i.i.d standard, Z · µ = nZ̄ is standard.
Now let U = I − µ ⊗ µ and
Then the mean of X is zero. Since Z has variance I, by Exercises 2.2.2 and
5.5.5,
V ar(X) = U t IU = U 2 = U = I − µ ⊗ µ.
By singular chi-squared above,
(n − 1)S 2 = |X|2
Exercises
Exercise 5.5.3 Continuing the previous problem with n = 20, use the CLT
to estimate the probability that fewer than 50% of the points lie in the unit
disk. Is this a 1-sigma event, a 2-sigma event, or a 3-sigma event?
Exercise 5.5.4 Let X be a random vector with mean zero and variance Q
Show v is a zero variance direction (§2.5) for Q iff X · v = 0.
Exercise 5.5.5 Let µ and Q be the mean and variance of a random d-vector
X, and let A be any N × d matrix. Then AX is a random vector with mean
Aµ and variance AQAt .
Y12 Y2 Y2
+ 2 + ··· + r
λ1 λ2 λr
is chi-squared with degree r.
Exercise 5.5.7 If X is a random vector with mean zero and variance Q, then
(Insert w = u + v in (5.5.3).)
Exercise 5.5.8 Assume the classes of the Iris dataset are normally dis-
tributed with their means and variances (Exercise 2.2.8), and assume the
classes are equally likely. Using Bayes theorem (5.2.14), write a Python
function that returns the probabilities (p1 , p2 , p3 ) that a given iris x =
(t1 , t2 , t3 , t4 ) lies in each of the three classes. Feed your function the 150
samples of the Iris dataset. How many samples are correctly classified?
p1 + p2 + · · · + pd = 1.
This is called one-hot encoding since all slots in Y are zero except for one
“hot” slot.
For example, suppose X has three values 1, 2, 3, say X is the class of a
random sample from the Iris dataset. Then Y is R3 -valued, and we have
(1, 0, 0), if X = 1,
Y = (0, 1, 0), if X = 2,
(0, 0, 1), if X = 3.
More generally, let X have d values. Then with one-hot encoding, the
moment-generating function is
In particular, for a fair dice with d sides, the values are equally likely, so
the one-hot encoded cumulant-generating function is
Because
(y1 , y2 , . . . , yd ) σ (p1 , p2 , . . . , pd )
ey1 1
q1 = = = σ(y1 − y2 ),
ey1
+ ey2 1 + e−(y1 −y2 )
ey2 1
q2 = y1 = = σ(y2 − y1 ).
e + ey2 1 + e−(y2 −y1 )
5.6. MULTINOMIAL PROBABILITY 359
Because of this, the softmax function is the multinomial analog of the logistic
function, and we use the same symbol σ to denote both functions.
y = array([y1,y2,y3])
q = softmax(y)
or
σ(y) = σ(y + a1).
We say a vector y is centered if y is orthogonal to 1,
y · 1 = y1 + y2 + · · · + yd = 0.
This establishes
Define
log p = (log p1 , log p2 , . . . , log pd ).
Then the inverse of p = σ(z) is
y = Z1 + log p. (5.6.5)
The function
d
X
I(p) = p · log p = pk log pk (5.6.6)
k=1
This implies
d
X d
X
p·y = p k yk = pk log(eyk )
k=1 k=1
d
! d
!
X X
yk yk +log pk
≤ log pk e = log e = Z(y + log p).
k=1 k=1
For all y,
Z(y) = max (p · y − I(p)) .
p
Since
2 1 1 1
D I(p) = diag , ,..., ,
p1 p2 pd
we see I(p) is strictly convex, and H(p) is strictly concave.
In Python, the entropy is
p = array([p1,p2,p3])
entropy(p)
Roll a d-faced dice n times, and let #n (p) be the number of outcomes
where the face-proportions are p = (p1 , p2 , . . . , pd ). Then
Now (
∂2Z ∂σj σj − σj σk , if j = k,
= =
∂yj ∂yk ∂yk −σj σk , if j ̸= k.
Hence we have
yj ≤ c, j = 1, 2, . . . , d.
which implies
5.6. MULTINOMIAL PROBABILITY 363
d
X
|y|2 = yk2 ≤ d(d − 1)2 c2 .
k=1
√
Setting C = d(d − 1)c, we conclude
Let
log q = (log q1 , log q2 , . . . , log qd ).
Then
d
X
p · log q = pk log qk ,
k=1
and
I(p, q) = I(p) − p · log q. (5.6.13)
Similarly, the relative entropy is
p = array([p1,p2,p3])
q = array([q1,q2,q3])
entropy(p,q)
364 CHAPTER 5. PROBABILITY
returns the relative information, not the relative entropy. Always check your
Python code’s conventions and assumptions. See below for more on this ter-
minology confusion.
Assume a d-faced dice’s bias is q. Roll the dice n times, and let Pn (p, q)
be the probability of obtaining outcomes where the proportion of faces
is p. Then
= max
′
(p · (y ′ − log q) − Z(y ′ ))
y
= I(p) − p · log q
= I(p, q).
This identity is the direct analog of (4.5.19). The identity (4.5.19) is used
in linear regression. Similarly, (5.6.15) is used in logistic regression.
The cross-information is
d
X
Icross (p, q) = − pk log qk ,
k=1
Since I(p, σ(y)) and Icross (p, σ(y)) differ by the constant I(p), we also have
This is easily checked using the definitions of I(p, q) and σ(y, q).
H = −I Information Entropy
Absolute I(p) H(p)
Cross Icross (p, q) Hcross (p, q)
Relative I(p, q) H(p, q)
Curvature Convex Concave
Error I(p, q) with q = σ(z)
Table 5.33 The third row is the sum of the first and second rows, and the H column is
the negative of the I column.
How does one keep things straight? By remembering that it’s convex func-
tions that we like to minimize, not concave functions. In more vivid terms,
would you rather ski down a convex slope, or a concave slope?
In machine learning, loss functions are built to be minimized, and infor-
mation, in any form, is convex, while entropy, in any form, is concave. Table
5.33 summarizes the situation.
Exercises
6.1 Estimation
369
370 CHAPTER 6. STATISTICS
do not
reject H
p>α
hypothesis
sample p-value
H
p<α
reject H
d = 784
for _ in range(20):
u = randn(d)
v = randn(d)
print(angle(u,v))
6.1. ESTIMATION 371
86.27806537791886
87.91436653824776
93.00098725550777
92.73766421951748
90.005139015804
87.99643434444482
89.77813370637857
96.09801014394806
90.07032573539982
89.37679070400239
91.3405728939376
86.49851399221568
87.12755619082597
88.87980905998855
89.80377324818076
91.3006921339982
91.43977096117017
88.52516224405458
86.89606919838387
90.49100744167357
d = 784
for _ in range(20):
u = binomial(n,.5,d)
v = binomial(n,.5,d)
print(angle(u,v))
59.43464627897324
59.14345748418916
60.31453922165891
60.38024365702492
59.24709660805488
59.27165957992343
61.21424657806321
60.55756381536082
61.59468919876665
61.33296028237481
60.03925473033243
60.25732069941224
61.77018692842784
60.672901794058326
59.628519516164666
59.41272458020638
58.43172340007064
59.863796136907744
59.45156367988921
59.95835532791699
The difference between the two scenarios is the distribution. In the first
scenario, we have randn(d): the components are distributed according to
a standard normal. In the second scenario, we have binomial(1,.5,d) or
binomial(3,.5,d): the components are distributed according to one or three
fair coin tosses. To see how the distribution affects things, we bring in the
law of large numbers, which is discussed in §5.3.
Let X1 , X2 , . . . , Xd be a simple random sample from some population,
and let µ be the population mean. Recall this means X1 , X2 , . . . , Xd are
i.i.d. random variables, with µ = E(X). The sample mean is
X1 + X2 + · · · + Xd
X̄ = .
d
For large sample size d, the sample mean X̄ approximately equals the
population mean µ, X̄ ≈ µ.
We use the law of large numbers to explain the closeness of the vector
angles to specific values.
Assume u = (x1 , x2 , . . . , xd ), and v = (y1 , y2 , . . . , yd ) where all components
are selected independently of each other, and each is selected according to
the same distribution.
6.1. ESTIMATION 373
X1 Y1 + X2 Y2 + · · · + Xd Yd
≈ E(X1 Y1 ),
d
so
U · V = X1 Y1 + X2 Y2 + · · · + Xd Yd ≈ d E(X1 Y1 ).
Similarly, U · U ≈ d E(X12 ) and V · V ≈ d E(Y12 ). Hence (check that the d’s
cancel)
U ·V E(X1 Y1 )
cos(U, V ) = p ≈p .
(U · U )(V · V ) E(X12 )E(Y12 )
Since X1 and Y1 are independent with mean µ and variance σ 2 ,
U ·V µ2
cos(θ) = p ≈ .
(U · U )(V · V ) µ2 + σ 2
µ2 p2
= = p.
µ2 + σ 2 p2 + p(1 − p)
µ2
cos(θ) is approximately .
µ2 + σ 2
1 ≈ means the ratio of the two sides approaches 1 for large n, see §A.6.
374 CHAPTER 6. STATISTICS
6.2 Z-test
p = .7
n = 25
N = 1000
v = binomial(n,p,N)/n
hist(v,edgecolor ='Black')
show()
A confidence level of zero indicates that we have no faith at all that se-
lecting another sample will give similar results, while a confidence level of 1
indicates that we have no doubt at all that selecting another sample will give
similar results.
When we say p is within X̄ ± ϵ, or
|p − X̄| < ϵ,
(L, U ) = (X̄ − ϵ, X̄ + ϵ)
is a confidence interval.
With the above setup, we have the population proportion p, and the four
sample characteristics
• sample size n
• sample proportion X̄,
• margin of error ϵ,
• confidence level α.
Suppose we do not know p, but we know n and X̄. We say the margin of
error is ϵ, at confidence level α, if
√ X̄ − p
Z= np
p(1 − p)
L, U = X̄ − ϵ, X̄ + ϵ.
P rob(|Z| > z ∗ ) = α.
√
Let σ/ n be the standard error. By the central limit theorem,
6.2. Z-TEST 377
!
|X̄ − p| z∗
α ≈ P rob p >√ .
p(1 − p) n
|X̄ − p| z∗
p =√ (6.2.1)
p(1 − p) n
##########################
# Confidence Interval - Z
##########################
def confidence_interval(xbar,sdev,n,alpha,type):
Xbar = Z(xbar,sdev/sqrt(n))
if type == "two-tail":
U = Xbar.ppf(1-alpha/2)
L = Xbar.ppf(alpha/2)
elif type == "upper-tail":
U = Xbar.ppf(1-alpha)
L = xbar
elif type == "lower-tail":
L = Xbar.ppf(alpha)
U = xbar
else: print("what's the test type?"); return
return L, U
alpha = .02
sdev = 228
n = 35
xbar = 95
L, U = confidence_interval(xbar,sdev,n,alpha,type)
Now we can answer the questions posed at the start of the section. Here
are the answers.
1. When n = 20, α = .95, and X̄ = .7, we have [L, U ] = [.5, .9], so ϵ = .2.
2. When X̄ = .7, α = .95, and ϵ = .15, we run confidence_interval for
15 ≤ n ≤ 40, and select the least n for which ϵ < .15. We obtain n = 36.
3. When X̄ = .7, α = .99, and ϵ = .15, we run confidence_interval for
1 ≤ n ≤ 100, and select the least n for which ϵ < .15. We obtain n = 62.
4. When X̄ = .7, n = 20, and ϵ = .1, we have
√
∗ ϵ n
z = = .976.
σ
6.2. Z-TEST 379
• Ha : µ ̸= 0.
Here the significance level is α = .02 and µ0 = 0. To decide whether to
reject H0 or not, compute the standardized test statistic
√ x̄ − µ0
z= n· = 2.465.
σ
Since z is a sample from an approximately normal distribution Z, the p-value
Hypothesis Testing
µ < µ0 , µ > µ0 , µ ̸= µ0 .
In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with
√ X̄, which is normally distributed with mean µ0
and standard deviation σ/ n.
###################
# Hypothesis Z-test
###################
xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
sdev = 2
alpha = .01
There are two types of possible errors we can make. a Type I error is when
H0 is true, but we reject it, and a Type 2 error is when H0 is not true but
we fail to reject it.
H0 is true H0 is false
do not reject H0 1−α Type II error: β
reject H0 Type I error: α Power: 1 − β
z∗σ z∗σ
µ0 − √ < x̄ < µ0 + √ .
n n
This calculation was for a two-tail test. When the test is upper-tail or
lower-tail, a similar calculation leads to the code
############################
# Type1 and Type2 errors - Z
############################
def type2_error(type,mu0,mu1,sdev,n,alpha):
print("significance,mu0,mu1, sdev, n: ", alpha,mu0,mu1,sdev,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
zstar = Z.ppf(alpha)
type2 = 1 - Z.cdf(delta + zstar)
elif type == "upper-tail":
zstar = Z.ppf(1-alpha)
type2 = Z.cdf(delta + zstar)
elif type == "two-tail":
zstar = Z.ppf(1 - alpha/2)
type2 = Z.cdf(delta + zstar) - Z.cdf(delta - zstar)
else: print("what's the test type?"); return
print("test type: ",type)
print("zstar: ", zstar)
print("delta: ", delta)
print("prob of type2 error: ", type2)
print("power: ", 1 - type2)
mu0 = 120
mu1 = 122
sdev = 2
n = 10
alpha = .01
type = "upper-tail"
type2_error(type,mu0,mu1,sdev,n,alpha)
A type II error is when we do not reject the null hypothesis and yet it’s
false. The power of a test is the probability of rejecting the null hypothesis
when it’s false (Figure 6.3). If the probability of a type II error is β, then the
power is 1 − β.
Going back to the driving speed example, what is the chance that someone
driving at µ1 = 122 is not caught? This is a type II error; using the above
code, the probability is
6.3 T -test
Here C is a constant to make the total area under the graph equal to one
(Figure 6.4).
The distribution of a Student random variable T is the Student distribu-
tion with degree d, also called the t-distribution with degree d. The Student
distribution has pdf (6.3.1), and the probability that T lies in a small interval
[a, b] is
2 This terminology is due to the statistician R. A. Fisher.
6.3. T -TEST 385
with the integral computed via the fundamental theorem of calculus (A.5.2)
or Python.
The Student pdf (6.3.1) approaches the standard normal pdf (5.4.1) as
d → ∞ (Exercise 6.3.1).
386 CHAPTER 6. STATISTICS
for d in [3,4,7]:
t = arange(-3,3,.01)
plot(t,T(d).pdf(t),label="d = "+str(d))
plot(t,Z.pdf(t),"--",label=r"d = $\infty$")
grid()
legend()
show()
√ X̄ − µ √ X̄ − µ
n· = n· v .
S u n
1 X
(Xk − X̄)2
u
t
n−1
k=1
Xk = µ + σZk ,
n
2 1 X
2
S =σ (Zk − Z̄)2 .
n−1
k=1
√ X̄ − µ √ Z̄ √ Z̄
n· = n· v = n· p .
S u n U/(n − 1)
1 X
(Zk − Z̄)2
u
t
n−1
k=1
##########################
# Confidence Interval - T
##########################
def confidence_interval(xbar,s,n,alpha,type):
d = n-1
if type == "two-tail":
3 Geometrically, the p-value P rob(T > 1) is the probability that a normally distributed
point in (d + 1)-dimensional spacetime is inside the light cone.
388 CHAPTER 6. STATISTICS
tstar = T(d).ppf(1-alpha/2)
L = xbar - tstar * s / sqrt(n)
U = xbar + tstar * s / sqrt(n)
elif type == "upper-tail":
tstar = T(d).ppf(1-alpha)
L = xbar
U = xbar + tstar* s / sqrt(n)
elif type == "lower-tail":
tstar = T(d).ppf(alpha)
L = xbar + tstar* s / sqrt(n)
U = xbar
else: print("what's the test type?"); return
print("type: ",type)
return L, U
n = 10
xbar = 120
s = 2
alpha = .01
type = "upper-tail"
print("significance, s, n, xbar: ", alpha,s,n,xbar)
L,U = confidence_interval(xbar,s,n,alpha,type)
print("lower, upper: ", L,U)
Going back to the driving speed example from §6.2, instead of assuming
the population standard deviation is σ = 2, we compute the sample standard
deviation and find it’s S = 2. Recomputing with T (9), instead of Z, we
see (L, U ) = (120, 121.78), so the cutoff now is µ∗ = 121.78, as opposed to
µ∗ = 121.47 there.
###################
# Hypothesis T-test
###################
xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
s = 2
alpha = .01
ttest(mu0, s, n, xbar,type)
########################
# Type1 and Type2 errors
########################
def type2_error(type,mu0,mu1,n,alpha):
d = n-1
print("significance,mu0,mu1,n: ", alpha,mu0,mu1,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
tstar = T(d).ppf(alpha)
type2 = 1 - T(d).cdf(delta + tstar)
390 CHAPTER 6. STATISTICS
type2_error(type,mu0,mu1,n,alpha)
Exercises
Exercise 6.3.1 Use the compound-interest formula (A.3.8) to show the Stu-
dent pdf (6.3.1) equals the standard normal pdf (5.4.1) when d = ∞. Since
the formula for the constant C is not given, ignore C in your calculation.
i = 1, 2, . . . , d.
√ p̂ − p
Z= n· p (6.4.1)
p(1 − p)
is approximately standard normal for large enough sample size, and con-
sequently U = Z 2 is approximately chi-squared with degree one. The chi-
squared test generalizes this from d = 2 categories to d > 2 categories.
Given a category i, let #i denote the number of times Xk = i, 1 ≤ k ≤ n,
in a sample of size n. Then #i is the count that Xk = i, and p̂i = #i /n is the
observed frequency, in a sample of size n. Let pi be the expected frequency,
pi = P rob(Xk = i).
Goodness-Of-Fit Test
def goodness_of_fit(observed,expected):
# assume len(observed) == len(expected)
d = len(observed)
392 CHAPTER 6. STATISTICS
Suppose a dice is rolled n = 120 times, and the observed counts are
Notice
O1 + O2 + O3 + O4 + O5 + O6 = 120.
If the dice is fair, the expected counts are
u = 12.7.
The dice is fair if u is not large and the dice is unfair if u is large. At
significance level α, the large/not-large cutoff u∗ is
d = 6
ustar = U(d-1).ppf(1-alpha)
Since this returns u∗ = 11.07 and u > u∗ , we can conclude the dice is not
fair.
√1
if X = i,
Vi = pi (6.4.3)
0 if X ̸= i.
6.4. CHI-SQUARED TESTS 393
we conclude
E(V ) = µ, E(V ⊗ V ) = I.
From this,
E(V ) = µ, V ar(V ) = I − µ ⊗ µ. (6.4.4)
Now define
Vk = vectp (Xk ) , k = 1, 2, . . . , n.
Since X1 , X2 , . . . , Xn are i.i.d, V1 , V2 , . . . , Vn are i.i.d. By (5.5.5), we conclude
the random vector !
n
√ 1X
Z= n Vk − µ
n
k=1
obtaining (6.4.2).
394 CHAPTER 6. STATISTICS
rij = pi qj ,
or r = p ⊗ q.
For example, suppose 300 people are polled and the results are collected
in a contingency table (Figure 6.5).
Is a person’s gender correlated with their party affiliation, or are the two
variables independent? To answer this, let p̂ and q̂ be the observed frequencies
#{k : Xk = i} #{k : Yk = j}
p̂i = , q̂j = ,
n n
and let r̂ be the joint observed frequencies
#{k : Xk = i and Yk = j}
r̂ij = .
n
Then r̂ is also a d × N matrix.
When the effects are independent, r = p ⊗ q, so, by the law of large
numbers, we should have
r̂ ≈ p̂ ⊗ q̂
for large sample size. The chi-squared independence test quantifies the dif-
ference of the two matrices r and r̂.
6.4. CHI-SQUARED TESTS 395
d,N
X (observed)2
= −n + n .
i,j=1
expected
The code
def chi2_independence(table):
n = sum(table) # total sample size
d = len(table)
N = len(table.T)
rowsum = array([ sum(table[i,:]) for i in range(d) ])
colsum = array([ sum(table[:,j]) for j in range(e) ])
expected = outer(rowsum,colsum) # tensor product
u = -n + n*sum([[ table[i,j]**2/expected[i,j] for j in range(N) ]
,→ for i in range(d) ])
deg = (d-1)*(N-1)
pvalue = 1 - U(deg).cdf(u)
return pvalue
table = array([[68,56,32],[52,72,20]])
396 CHAPTER 6. STATISTICS
chi2_independence(table)
returns a p-value of 0.0401, so, at the 5% significance level, the effects are
not independent.
equals (6.4.6).
Let u1 , u2 , . . . , ud and v1 , v2 , . . . , vN be orthonormal bases for Rd and
N
R respectively. By (2.9.8),
d,N
X
2
∥Z∥ = trace(Z t Z) = (ui · Zvj )2 . (6.4.8)
i,j=1
2
We will show ∥Z∥ is asymptotically chi-squared of degree (d − 1)(N − 1).
To achieve this, we show Z is asymptotically normal.
Let X and Y be discrete random variables with probability vectors
p = (p1 , p2 , . . . , pd ) and q = (q1 , q2 , . . . , qN ), and assume X and Y are in-
dependent.
Let
√ √ √ √ √ √
µ = ( p1 , p2 , . . . , pd ) , ν = ( q1 , q2 , . . . , qN ) .
and
W ≈0
over k = 1, 2, . . . , n, we see
√ √
√ p̂i qj
CLT r̂ij q̂j pi √
Zij = n √ √ − √ − √ + pi qj . (6.4.10)
pi qj pi qj
q̂j − qj
√ p ≈ 0.
p̂i q̂j
Z ≈ Z CLT .
We conclude
• the mean and variance of Z are asymptotically the same as those of M ,
• u · Zν ≈ 0, µ · Zv ≈ 0 for any u and v, and,
• Z ≈ normal.
In particular, since u·Zv and u′ ·Zv ′ are asymptotically uncorrelated when
u ⊥ u′ and v ⊥ v ′ , and Z is asymptotically normal, we conclude u · Zv and
u′ · Zv ′ are asymptotically independent when u ⊥ u′ and v ⊥ v ′ .
Now choose the orthonormal bases with u1 and v1 equal to µ and ν re-
spectively. Then
ui · Zvj , i = 1, 2, 3, . . . , d, j = 1, 2, 3, . . . , N
are independent normal random variables with mean zero, asymptotically for
large n, and variances according to the listing
4 The theoretical basis for this intuitively obvious result is Slutsky’s theorem [7].
6.4. CHI-SQUARED TESTS 399
Exercises
Exercise 6.4.4 Verify the goodness-of-fit test statistic (6.4.2) is the square
of (6.4.1) when d = 2.
Exercise 6.4.5 [30] Among 100 vacuum tubes tested, 41 had lifetimes of less
than 30 hours, 31 had lifetimes between 30 and 60 hours, 13 had lifetimes
between 60 and 90 hours, and 15 had lifetimes of greater than 90 hours.
Are these data consistent with the hypothesis that a vacuum tube’s lifetime
is exponentially distributed (Exercise 5.3.23) with a mean of 50 hours? At
what significance? Here p = (p1 , p2 , p3 , p4 ).
Exercise 6.4.7 [30] In a famous article (S. Russell, “A red sky at night. . . ”
Metropolitan Magazine London, 61, p. 15, 1926) the following dataset of
400 CHAPTER 6. STATISTICS
frequencies of sunset colors and whether each was followed by rain was pre-
sented. Test the hypothesis that whether it rains tomorrow is independent of
the color of today’s sunset.
Exercise 6.4.8 [30] A sample of 300 cars having mobile phones and one of
400 cars without phones were tracked for 1 year. The following table gives the
number of these cars involved in accidents over that year. Use the above to
test the hypothesis that having a mobile phone in your car and being involved
in an accident are independent. Use the 5 percent level of significance.
Accident No Accident
Mobile phone 22 278
No phone 26 374
7.1 Overview
401
402 CHAPTER 7. MACHINE LEARNING
Sometimes J(W ) is normalized by dividing by N , but this does not change the
results. With the dataset given, the mean error is a function of the weights.
A weight matrix W ∗ is optimal if it is a minimizer of the mean error,
In §4.4, we saw two versions of forward and back propagation. In this section
we see a third version. We begin by reviewing the definition of graph and
network as given in §3.3 and §4.4.
A graph consists of nodes and edges. Nodes are also called vertices, and an
edge is an ordered pair (i, j) of nodes. Because the ordered pair (i, j) is not
the same as the ordered pair (j, i), our graphs are directed.
The edge (i, j) is incoming at node j and outgoing at node i. If a node j
has no outgoing edges, then j is an output node. If a node i has no incoming
edges, then i is an input node. If a node is neither an input nor an output, it
is a hidden node.
We assume our graphs have no cycles: every forward path terminates at an
output node in a finite number of steps, and every backward path terminates
at an input node in a finite number of steps.
A graph is weighed if a scalar weight wij is attached to each edge (i, j). If
(i, j) is not an edge, we set wij = 0.
If a network has d nodes, the nodes are labeled 0, 1, 2, . . . , d − 1, and the
edges are completely specified by the d × d weight matrix W = (wij ).
A node with an attached activation function (4.4.2) is a neuron. A net-
work is a directed weighed graph where some nodes are neurons. In the next
paragraph, we define a special kind of network, a neural network.
7.2. NEURAL NETWORKS 403
j = 0, 1, . . . , d − 1.
Because wij = 0 if (i, j) is not an edge, the nonzero entries in the incoming
list at node j correspond to the edges incoming at node j.
A neural network is a network where every activation function is restricted
to be a function of the sum of the entries of the incoming list.
For example, all the networks in this section are neural networks, but the
network in Figure 4.16 is not a neural network.
Let X
x−j = wij xi (7.2.1)
i→j
be the sum of the incoming list at node j. Then, in a neural network, the
outgoing signal at node j is
X
xj = fj (x−
j ) = fj
wij xi . (7.2.2)
i→j
x = (x0 , x1 , . . . , xd−1 ),
x− = (x− − −
0 , x1 , . . . , xd−1 ).
In a network, in §4.4, x− −
j was a list or vector; in a neural network, xj is a
scalar.
If node j is an input node, then x−j = None. If node j is an output node,
then xj = None.
Let W be the weight matrix. If the network has d nodes, the activation
vector is
f = (f0 , f1 , . . . , fd−1 ).
Then a neural network may be written in vector-matrix form
x = f (W t x).
However, this representation is more useful when the network has structure,
for example in a dense shallow layer (7.2.12).
404 CHAPTER 7. MACHINE LEARNING
Neural Network
Every neural network is a combination of perceptrons.
x1
w1
w2 y
x2 f
w3
x3
y = f (w0 + w1 x1 + w2 x2 + · · · + wd xd ) = f (w · x + w0 ).
The role of the bias is to shift the threshold in the activation function.
If x1 , x2 , . . . , xN is a dataset, then (x1 , 1), (x2 , 1), . . . , (xN , 1) is the aug-
mented dataset. If the original dataset is in Rd , then the augmented dataset
is in Rd+1 . In this regard, Exercise 7.2.1 is relevant.
By passing to the augmented dataset, a neural network with bias and d
input features can be thought of as a neural network without bias and d + 1
input features.
In §5.2, Bayes theorem is used to express a conditional probability in terms
of a perceptron,
P rob(H | x) = σ(w · x + w0 ).
This is a basic example of how a perceptron computes probabilities.
7.2. NEURAL NETWORKS 405
w0
x1
w1
y
f
w2
x2 w3
x3
Perceptrons gained wide exposure after Minsky and Papert’s famous 1969
book [22], from which Figure 7.3 is taken.
∂y ∂y −
= f ′ (y − ) · = f ′ (y − ) · xi .
∂wi ∂wi
The derivative of the output with respect to the incoming signal is the down-
stream derivative δ, so we obtain the formula
∂y
= δ · xi .
∂wi
If there is a bias w0 , the corresponding input is x0 = 1, and the formula is
still valid. We generalize this formula to any neural network, by explaining
each of the various terms, and leading to (7.2.11).
ez − e−z
tanh(z) =
ez + e−z
# activation functions
w02 w24
f2 f4
w03
w25
w34
w12
w13 w35
f3 f5
x− = (None, None, x− − − − − −
2 , x3 , x4 , x5 , x6 , x7 ),
x = (x0 , x1 , x2 , x3 , x4 , x5 , None, None).
Note x4 = x− −
6 and x5 = x7 . Figures 7.5 and 7.6 show the incoming and
outgoing signals.
x0 x2 x4
f2 f4
x0
x2
x3
x1
x1 x3 x5
f3 f5
x−
2 x−
4 x−
6
f2 f4
x−
3
x−
5
x−
4
x−
2
x−
3 x−
5 x−
7
f3 f5
The nodes may be labeled in any order. We identify input nodes and
output nodes from the weight matrix using the code
def is_output(i):
for j in range(d):
if w[i][j] != None: return False
return True
def is_input(i):
for j in range(d):
if w[j][i] != None: return False
return True
d = 8
0.1 −0.3 1
f2 f4
−2.0
−0.3
.22
0.1
−2.0 .22 1
f3 f5
activate = [None]*d
activate[2] = relu
activate[3] = id
activate[4] = sigmoid
activate[5] = tanh
Now we modify the forward propagation code in §4.4 to work for neural
networks. The key diagram is Figure 7.9.
Assume the activation function at node j is activate[j]. By (7.2.1) and
(7.2.2), the code is
def incoming(x,w,j):
return sum([ outgoing(x,w,i)*w[i][j] for i in range(d) if w[i][j]
,→ != None ])
def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](incoming(x,w,j))
7.2. NEURAL NETWORKS 411
xi xj
fi fj
wij
def forward_prop(x,w):
d = len(x)
for j in range(d):
if not is_output(j):
x[j] = outgoing(x,w,j)
for j in range(d):
if not is_input(j):
xminus[j] = sum([ x[i] * w[i][j] for i in range(d) if
,→ w[i][j] != None ])
return xminus, x
x = [None]*d
x[0] = 1.5
x[1] = 2.5
x, xminus = forward_prop(x,w)
print(xminus)
print(x)
Let y be the target output signals, defined only at the output nodes
y = (y0 , y1 , y2 , y3 , y4 , y5 , y6 , y7 )
y = (None, None, None, None, None, None, 0.427, −0.288),
412 CHAPTER 7. MACHINE LEARNING
and let J(x− , y) be a function of x− and y, measuring the error between the
target outputs y and the actual outputs. For Figure 7.4, we define the mean
square error function or mean square loss
1 − 1
J(x− , y) = (x − y6 )2 + (x− − y7 ) 2 , (7.2.6)
2 6 2 7
The code for J is
1 1
J= (x4 − y6 )2 + (x5 − y7 )2 ,
2 2
x5 = f5 (x−
5 ) = f5 (w25 x2 + w35 x3 ),
x4 = f4 (x−
4 ) = f4 (w24 x2 + w34 x3 ),
x3 = f3 (x−
3 ) = f3 (w03 x0 + w13 x1 ),
x2 = f2 (x−
2 ) = f2 (w02 x0 + w12 x1 ).
Therefore J is a function of the weights w25 , w35 , w24 , w34 , w03 , w13 , w02 ,
w12 . For gradient descent, we will need the derivative of J with respect to
these weights, as we did above the perceptron. For example, w25 appears
above only in x−5 , so, by the chain rule,
∂J ∂J ∂x−
5
= · .
∂w25 ∂x−
5 ∂w25
Since x−
5 = w25 x2 + w35 x3 ,
∂x−
5
= x2 .
∂w25
If we denote
∂J
δ5 = ,
∂x−
5
we obtain
∂J
= x2 · δ 5 .
∂w25
The goal is to code this formula in general, see (7.2.11), to be used in gradient
descent.
7.2. NEURAL NETWORKS 413
These are the downstream derivative, local derivative, and upstream derivative
at node j. (The terminology reflects the fact that derivatives are computed
backward.)
fi′
∂J ∂J
∂x−
i ∂xi
fi
From (7.2.2),
∂xj
= fj′ (x−
j ). (7.2.8)
∂x−
j
By the chain rule and (7.2.8), the key relation between these derivatives is
∂J ∂J
− = · fi′ (x−
i ), (7.2.9)
∂xi ∂xi
or
downstream = upstream × local.
def local(x,w,i):
return der_dict[activate[i]](incoming(x,w,i))
δ2 δ4 δ6
f2 f4
δ3
δ5
δ4
δ2
δ3 δ5 δ7
f3 f5
Let
∂J
δi = , i = 0, 1, . . . , d − 1.
∂x−
i
If i is an input node, δi is None. Then we have the downstream gradient
vector δ = (δ0 , δ1 , . . . , δd−1 ). Strictly speaking, we should write δi− for the
downstream derivatives. However, in §7.4, we don’t need upstream deriva-
tives. Because of this, we will write δi .
∂J
= (x−
6 − y6 ) = −0.294.
∂x−
6
Similarly,
∂J
δ7 = = (x−
7 − y7 ) = −0.666.
∂x−
7
The code for this is
def delta_J(xminus,y):
delta = [None]*d
for i in range(d):
if is_output(i): delta[i] = xminus[i] - y[i]
return delta
7.2. NEURAL NETWORKS 415
X ∂J ∂xj ∂xi −
∂J
− = · ·
∂xi i→j
∂x−
j ∂xi ∂x− i
X ∂J
· fi′ (x−
= − · wij i ).
i→j
∂xj
The code is
def downstream(x,delta,w,i):
if delta[i] != None: return delta[i]
else:
upstream = sum([ downstream(x,delta,w,j) * w[i][j] for j in
,→ range(d) if w[i][j] != None ])
return upstream * local(x,w,i)
def backward_prop(x,delta,w):
d = len(x)
for i in range(d):
if not is_input(i): delta[i] = downstream(x,delta,w,i)
return delta
delta = delta_J(xminus,y)
delta = backward_prop(x,delta, w)
print(delta)
returns
∂x−
j
= xi ,
∂wij
We have shown
∂J
= xi · δ j . (7.2.11)
∂wij
x0
z0
f0
x1
z1
f1
x2
z2
f2
x3
Our convention is wij denotes the weight on the edge from node i to node
j. With this convention, the formulas (7.2.1), (7.2.2) reduce to the matrix
multiplication formulas
z − = W t x, z = f (W t x). (7.2.12)
Exercises
Exercise 7.2.2 Verify the propagation computations in this web page. This
neural network has four input nodes (two of which are biases), four neurons,
and two output nodes. The two output nodes are to the right of nodes o1, o2,
and not shown, you have to include them. Here W is 10 × 10. Their neth1 and
outh1 are the incoming and outgoing signals at node h1, and their δo1 is the
downstream derivative at node o1. Don’t update the weights, just compute
x and δ.
Exercise 7.2.3 Verify the propagation computations in this web page. This
neural network has five input nodes (three of which are biases), three neurons,
and one output node. The output node is to the right of node 5, and not
shown, you have to include it. Here W is 9 × 9. Their I3 is x− 3 , and their O3
is x3 . Their Errj is δj . At nodes 3, 4, 5, the activation function fj is σ, so
fj′ = σ(1 − σ). Don’t update the weights, just compute x and δ.
7.3. GRADIENT DESCENT 419
This goal is so general, that anything concrete one insight one provides to-
wards this goal is widely useful in many settings. The setting we have in mind
is f = J, where J is the mean error from §7.1.
Usually f (w) is a measure of cost or lack of compatibility. Because of this,
f (w) is called the loss function or cost function.
A neural network is a black box with inputs x and outputs y, depending on
unknown weights w. To train the network is to select weights w in response
to training data (x, y). The optimal weights w∗ are selected as minimizers
of a loss function f (w) measuring the error between predicted outputs and
actual outputs, corresponding to given training inputs.
From §4.3, if the loss function f (w) is continuous and proper, there is
a global minimizer w∗ . If f (w) is in addition strictly convex, w∗ is unique
(§4.5). When this happens, if the gradient of the loss function is g = ∇f (w),
then w∗ is the unique point satisfying g ∗ = ∇f (w∗ ) = 0.
g(b) − g(a)
≈ g ′ (a).
b−a
Inserting a = w and b = w+ ,
Solving for w+ ,
g(w)
w+ ≈ w − .
g ′ (w)
Since the global minimizer w∗ satisfies f ′ (w∗ ) = 0, we insert g(w) = f ′ (w)
in the above approximation,
420 CHAPTER 7. MACHINE LEARNING
f ′ (w)
w+ ≈ w − .
f ′′ (w)
f ′ (wn )
wn+1 = wn − , n = 1, 2, . . .
f ′′ (wn )
def newton(loss,grad,curv,w,num_iter):
g = grad(w)
c = curv(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= g/c
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
c = curv(w)
if allclose(g,0): break
return trajectory
u0 = -2.72204813
w0 = 2.45269774
num_iter = 20
trajectory = newton(loss,grad,curv,w0,num_iter)
def plot_descent(a,b,loss,curv,delta,trajectory):
w = arange(a,b,delta)
plot(w,loss(w),color='red',linewidth=1)
plot(w,curv(w),"--",color='blue',linewidth=1)
plot(*trajectory,color='green',linewidth=1)
scatter(*trajectory,s=10)
title("num_iter= " + str(len(trajectory.T)))
grid()
show()
ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)
f (w) = f (w1 , w2 , . . . ).
In other words,
7.3. GRADIENT DESCENT 423
In practice, the learning rate is selected by trial and error. Which learning
rate does the theory recommend?
Given an initial point w0 , the sublevel set at w0 (see §4.5) consists of all
points w where f (w) ≤ f (w0 ). Only the part of the sublevel set that is
connected to w0 counts.
In Figure 7.15, the sublevel set at w0 is the interval [u0 , w0 ]. The sublevel
set at w1 is the interval [b, w1 ]. Notice we do not include any points to the
left of b in the sublevel set at w1 , because points to the left of b are separated
from w1 by the gap at the point b.
Suppose the second derivative D2 f (w) is never greater than a constant L
on the sublevel set. This means
a b c w1
u0 w0
Fig. 7.15 Double well cost function and sublevel sets at w0 and at w1 .
424 CHAPTER 7. MACHINE LEARNING
To see this, fix w and let S be the sublevel set {w′ : f (w′ ) ≤ f (w)}. Since
the gradient pushes f down, for t > 0 small, w+ stays in S. Insert x = w+
and a = w into the right half of (4.5.16) and simplify. This leads to
t2 L
f (w+ ) ≤ f (w) − t|∇f (w)|2 + |∇f (w)|2 .
2
Since tL ≤ 1 when 0 ≤ t ≤ 1/L,we have t2 L ≤ t. This derives (7.3.3).
The curvature of the loss function and the learning rate are inversely pro-
portional. Where the curvature of the graph of f (w) is large, the learning
rate 1/L is small, and gradient descent proceeds in small time steps.
For example, let f (w) = w4 − 6w2 + 2w (Figures 7.14, 7.15, 7.16). Then
Thus the inflection points (where f ′′ (w) = 0) are ±1 and, in Figure 7.15, the
critical points are a, b, c.
Let u0 and w0 be the points satisfying f (w) = 5 as in Figure 7.16.
Then u0 = −2.72204813 and w0 = 2.45269774, so f ′′ (u0 ) = 76.914552 and
f ′′ (w0 ) = 60.188. Thus we may choose L = 76.914552. With this L, the
short-step gradient descent starting at w0 is guaranteed to converge to one
of the three critical points. In fact, the sequence converges to the right-most
critical point c (Figure 7.16).
This exposes a flaw in basic gradient descent. Gradient descent may con-
verge to a local minimizer, and miss the global minimizer. In §7.9, modified
gradient descent will address some of these shortcomings.
426 CHAPTER 7. MACHINE LEARNING
def gd(loss,grad,w,learning_rate,num_iter):
g = grad(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= learning_rate * g
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
if allclose(g,0): break
return trajectory
u0 = -2.72204813
w0 = 2.45269774
L = 76.914552
learning_rate = 1/L
num_iter = 100
trajectory = gd(loss,grad,w0,learning_rate,num_iter)
ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)
xin → xout .
Here the network inputs xin are the outgoing signals at the input nodes, and
the network outputs xout are the incoming signals at the output nodes.
Given inputs xin and target outputs y, we seek to modify the weight matrix
W so that the input-output map is
xin → y.
The weights are modified using gradient descent. If J measures the error
between the network outputs xout and the targets y, the weight matrix W is
updated to a new matrix W + using (7.3.1),
W + = W − t∇W J.
∂J/∂wij = xi δj ,
or
∇W J = x ⊗ δ. (7.4.1)
For the network in Figure 7.4, the weight gradients are as in Figure 7.17.
x0 δ 2 x2 δ 4
f2 f4
x0 δ 3
x2 δ 5
x3 δ 5
x1 δ 2
x1 δ 3 x3 δ 5
f3 f5
The code is
def update_weights(x,delta,w,learning_rate):
d = len(x)
for i in range(d):
for j in range(d):
if w[i][j]:
w[i][j] = w[i][j] - learning_rate*x[i]*delta[j]
428 CHAPTER 7. MACHINE LEARNING
The triple
def train_nn(x,y,w,learning_rate,n_iter):
trajectory = []
# local copy
wlocal = [ row[:] for row in w ]
for _ in range(n_iter):
xminus, x = forward_prop(x,wlocal)
cost = J(xminus,y)
if isclose(0,cost): break
trajectory.append(cost)
delta = delta_J(xminus,y)
delta = backward_prop(x,delta,wlocal)
wlocal = update_weights(x,delta,wlocal, learning_rate)
return x, xminus, wlocal, trajectory
Here n_iter is the maximum number of iterations allowed, and the iterations
stop if the cost J is close to zero.
Let W be the weight matrix (7.2.4). Then
x = [None]*d
x[0] = 1.5
x[1] = 2.5
y = [None]*d
y[6] = 0.427
y[7] = -0.288
lr = .045
n_iter = 1000
x, xminus, w, trajectory = train_nn(x,y,w,lr,n_iter)
len(trajectory)
x−
6 = 0.42688039547094403, x−
7 = −0.28800519549406556,
grid()
legend()
show()
Fig. 7.18 Cost trajectory and number of iterations as learning rate varies.
J(W ∗ ) ≤ J(W ),
x1
+ y
z1
x2
z2 J
+ (−)2
z3
x3
z = W tx
+
J = |z − y|2 /2
x4
For linear regression without bias, the loss function is (7.5.1) with
432 CHAPTER 7. MACHINE LEARNING
1
J(x, y, W ) = |y − z|2 , z = W t x. (7.5.3)
2
Then (7.5.1) is the mean square error or mean square loss, and the problem
of minimizing (7.5.1) is linear regression (Figure 7.19).
= trace V t (x ⊗ (z + sv − y)) .
d2
J(x, y, W + sV ) = |v|2 = |V t x|2 . (7.5.6)
ds2 s=0
d2
J(W + sV ) = 0.
ds2 s=0
V t xk = 0, k = 1, 2, . . . , N. (7.5.7)
Recall the feature space is the vector space of all inputs x, and (§2.9) a
dataset is full-rank if the span of the dataset is the entire feature space. When
this happens, (7.5.7) implies V = 0. By (4.5.15), J(W ) is strictly convex.
To check properness of J(W ), by definition (4.3.9), we show there is a
bound C with
7.5. LINEAR REGRESSION 433
√
J(W ) ≤ c =⇒ ∥W ∥ ≤ C d. (7.5.8)
Here ∥W ∥ is the norm of the matrix W (2.2.13). The exact formula for the
bound C, which is not important for our purposes, depends on the level c
and the dataset.
If J(W ) ≤ c, by (7.5.1), (7.5.3), and the triangle inequality,
√
|W t xk | ≤ 2c + |yk |, k = 1, 2, . . . , N.
|W t x| ≤ C(x). (7.5.9)
For linear regression with bias, the loss function is (7.5.2) with
1
J(x, y, W, b) = |y − z|2 , z = W t x + b. (7.5.10)
2
Here W is the weight matrix and b is a bias vector.
If we augment the dataset x1 , x2 , . . . , xN to (x1 , 1), (x2 , 1), . . . , (xN , 1),
then this corresponds to the augmented weight matrix
W
.
bt
Applying the last result to the augmented dataset and appealing to Exer-
cise 7.2.1, we obtain
434 CHAPTER 7. MACHINE LEARNING
These are simple, clear geometric criteria for convergence of gradient de-
scent to the global minimum of J, valid for linear regression with or without
bias inputs.
Exercises
max p = max(p1 , p2 , . . . , pd ).
Then the i-th class may be defined as the samples with targets satisfying
pi = max p. Alternatively, the i-th class may be defined as the samples with
targets satisfying pi > 0.
7.6. LOGISTIC REGRESSION 435
When classes are assigned to targets, they need not be disjoint. Because
of this, they are called soft classes. Summarizing, a soft-class dataset is a
dataset x1 , x2 , . . . , xN with targets p1 , p2 , . . . , pN consisting of probability
vectors.
We start with logistic regression without bias inputs. For logistic regres-
sion, the loss function is
N
X
J(W ) = J(xk , pk , W ), (7.6.1)
k=1
Here I(p, q) is the relative information, and q = σ(y) is the softmax function,
squashing the network’s output y = W t x into the probability q. When q =
σ(y), I(p, q) measures the information error between the desired target p and
the computed target q.
When p is one-hot encoded, by (5.6.16),
Because of this, in the literature, in the one-hot encoded case, (7.6.1) is called
the cross-entropy loss.
x1
y1
+ p
q1
x2
y2 q2 J
+ σ I
x3 q3
y3 y = W tx
+ q = σ(y)
J = I(p, q)
x4
J(W ) is logistic loss or logistic error, and the problem of minimizing (7.6.1)
is logistic regression (Figure 7.20).
Since we will be considering both strict and one-hot encoded probabilities,
we work with I(p, q) rather than Icross (p, q). Table 5.33 is a useful summary
of the various information and entropy concepts.
W 1 = 0, (7.6.2)
or
d
X
wij = 0, i = 1, 2, . . . , d.
j=1
and, by (5.6.10),
d d
J(x, y, W + sV ) = I(p, σ(y + sv))
ds s=0 ds s=0
= v · (q − p) = (V t x) · (q − p)
= trace (V t x) ⊗ (q − p)
= trace V t (x ⊗ (q − p)) .
As before, this result is a special case of (7.4.1). Since q and p are probability
vectors, p · 1 = 1 = q · 1, hence the gradient G is centered.
Recall (§5.6) we have strict convexity of Z(y) along centered vectors y,
those vectors satisfying y · 1 = 0. Since y = W t x, y · 1 = x · W 1. Hence, to
force y · 1 = 0, it is natural to assume W is centered.
If we initiate gradient descent with a centered weight matrix W , since the
gradient G is also centered, all successive weight matrices will be centered.
7.6. LOGISTIC REGRESSION 437
Pd
To see this, given a vector v and probability vector q, set v̄ = j=1 vj qj .
Then 2
Xd Xd Xd
vj2 qj − vj qj = (vj − v̄)2 qj .
j=1 j=1 j=1
vanishes, then, since the summands are nonnegative, (7.6.6) vanishes, for
every sample x = xk , p = pk , hence
V t xk = 0, k = 1, 2, . . . , N.
438 CHAPTER 7. MACHINE LEARNING
The convex hull is discussed in §4.5, see Figures 4.26 and 4.27. If Ki were
just the samples x whose corresponding targets p satisfy pi > 0 (with no
convex hull), then the intersection Ki ∩ Kj may be empty.
For example, if p were one-hot encoded, then x belongs to at most one Ki .
Thus taking the convex hull in the definition of Ki is crucial. This is clearly
seen in Figure 7.32: The samples never intersect, but the convex hulls may
do so.
To establish properness of J(W ), by definition (4.3.9), we show
for some C. The exact formula for the bound C, which is not important for
our purposes, depends on the level c and the dataset.
Suppose J(W ) ≤ c, with W 1 = 0 and let q = σ(y). Then I(p, q) =
J(x, p, W ) ≤ c for every sample x and corresponding target p.
Let x be a sample, let y = W t x, and suppose the corresponding target p
satisfies pi ≥ ϵ, for some class i, and some ϵ > 0. If j ̸= i, then
d
X
ϵ(yj − yi ) ≤ ϵ(Z(y) − yi ) ≤ pi (Z(y) − yi ) ≤ pk (Z(y) − yk ) = Z(y) − p · y.
k=1
By (5.6.15),
Z(y) − p · y = I(p, σ(y)) − I(p) ≤ c + log d.
Combining the last two inequalities,
ϵ(yj − yi ) ≤ c + log d.
Let x be any vector in feature space, and let y = W t x. Since span(Ki ∩Kj )
is full-rank, x is a linear combination of vectors in Ki ∩ Kj , for every i and j.
This implies, by (7.6.8), there is a bound C(x), depending on x but not on
W , such that
X
d|yi | = |(d − 1)yi + yi | = (yi − yj ) ≤ (d − 1)C(x).
j̸=i
|wji | = |ej · W ei | ≤ C, i, j = 1, 2, . . . , d.
By (2.2.13), X
2
∥W ∥ = |wij |2 ≤ d2 C 2 .
i,j
with
J(x, p, W, b) = I(p, q), q = σ(y), y = W t x + b.
Here W is the weight matrix and b is the bias vector. In keeping with our
prior convention, we call the weight (W, b) centered if W is centered and b is
centered. Then y is centered.
If the columns of W are (w1 , w2 , . . . , wd ), and b = (b1 , b2 , . . . , bd ), then
y = W t x + b is equivalent to levels corresponding to d hyperplanes (§4.5)
y1 = w1 · x + b1 ,
y2 = w2 · x + b2 ,
(7.6.11)
... = ...
yd = wd · x + bd .
yi ≥ 0, for x in class i,
for every i = 1, 2, . . . , d and every j ̸= i.
yi ≤ 0, for x in class j,
(7.6.12)
yi ≥ 0, for x in class i,
for some i = 1, 2, . . . , d and some j ̸= i.
yi ≤ 0, for x in class j,
(7.6.13)
As special cases, there are corresponding results for strict targets and one-
hot encoded targets.
To begin the proof, suppose (W, b) satisfies (7.6.12). Then (Exercise 7.6.4)
yi ≥ 0, for x in Ki ,
for every i = 1, 2, . . . , d,
yj ≤ 0, for x in Ki and every j ̸= i,
(7.6.14)
From this, one obtains I(p, σ(y)) ≤ log d for every sample x and q = σ(y)
(Exercise 7.6.5). Since this implies J(W, b) ≤ N log d, the loss function is not
proper, hence not trainable.
7.6. LOGISTIC REGRESSION 443
rϵ|wi − wj | ≤ c + log d.
Let
yi = wi · x∗ij + bi , yj = wj · x∗ij + bj .
Since x∗ij is in Ki ∩ Kj , by (7.6.8),
Hence
Since W is centered,
444 CHAPTER 7. MACHINE LEARNING
X X
dwi = (d − 1)wi + wi = (d − 1)wi − wj = (wi − wj ).
j̸=i j̸=i
Hence
1X
|wi | + |bi | ≤ |wi − wj | + |bi − bj |.
d
j̸=i
A very special case is a two-class dataset. In this case, the result is com-
pelling:
We end the section by comparing the three regressions: linear, strict logis-
tic, and one-hot encoded logistic.
In classification problems, it is one-hot encoded logistic regression that is
relevant. Because of this, in the literature, logistic regression often defaults
to the one-hot encoded case.
In linear regression, not only do J(W ) and J(W, b) have minima, but so
does J(z, y). Properness ultimately depends on properness of a quadratic |z|2 .
In strict logistic regression, by (7.6.3), the critical point equation
∇y J(y, p) = 0
can always be solved, so there is at least one minimum for each J(y, p). Here
properness ultimately depends on properness of Z(y).
7.7. REGRESSION EXAMPLES 445
Exercises
N
X
J(w, w0 ) = (yk − w · xk − w0 )2
k=1
0 = w0 + w · x = w0 + w1 x1 + w2 x2 + · · · + wd xd .
Linear Regression
We work out the regression equation in the plane, when both features x
and y are scalar. In this case, w = (m, b) and
x1 1 y1
x2 1 y2
X= , Y =
.
. . . . . . . . .
xN 1 yN
Then (x̄, ȳ) is the mean of the dataset. Also, let x and y denote the vectors
(x1 , x2 , . . . , xN ) and (y1 , y1 , . . . , yN ), and let, as in §1.5,
N
1 X 1
cov(x, y) = (xk − x̄)(yk − ȳ) = x · y − x̄ȳ.
N N
k=1
(x · x)m + x̄b = x · y,
mx̄ + b = ȳ.
The second equation says the regression line passes through the mean (x̄, ȳ).
Multiplying the second equation by x̄ and subtracting the result from the
first equation cancels the b and leads to
This derives
The regression line in two dimensions passes through the mean (x̄, ȳ)
and has slope
cov(x, y)
m= .
cov(x, x)
df - read_csv("longley.csv")
X = df["Population"].to_numpy()
Y = df["Employed"].to_numpy()
X = X - mean(X)
Y = Y - mean(Y)
varx = sum(X**2)/len(X)
vary = sum(Y**2)/len(Y)
X = X/sqrt(varx)
Y = Y/sqrt(vary)
After this, we compute the optimal weight w∗ and construct the polyno-
mial. The regression equation is solved using the pseudo-inverse (§2.3).
figure(figsize=(12,12))
# six subplots
rows, cols = 3,2
# x interval
x = arange(xmin,xmax,.01)
for i in range(6):
d = 3 + 2*i # degree = d-1
subplot(rows, cols,i+1)
plot(X,Y,"o",markersize=2)
plot([0],[0],marker="o",color="red",markersize=4)
plot(x,poly(x,d),color="blue",linewidth=.5)
xlabel("degree = %s" % str(d-1))
grid()
show()
Running this code with degree 1 returns Figure 7.22. Taking too high a
power can lead to overfitting, for example for degree 12.
7.7. REGRESSION EXAMPLES 451
x p x p x p x p x p
0.5 0 .75 0 1.0 0 1.25 0 1.5 0
1.75 0 1.75 1 2.0 0 2.25 1 2.5 0
2.75 1 3.0 0 3.25 1 3.5 0 4.0 1
4.25 1 4.5 1 4.75 1 5.0 1 5.5 1
More generally, we may only know the amount of study time x, and the
probability p that the student passed, where now 0 ≤ p ≤ 1.
For example, the data may be as in Figure 7.24, where pk equals 1 or 0
according to whether they passed or not.
As stated, the samples of this dataset are scalars, and the dataset is one-
dimensional (Figure 7.25).
Plotting the dataset on the (x, p) plane, the goal is to fit a curve
p = σ(m∗ x + b∗ ) (7.7.4)
as in Figure 7.26.
(0, 1)
x
(0, 0)
Since this is logistic regression with bias, we can apply the two-class result
from the previous section: The dataset is one-dimensional, so a hyperplane is
just a point, a threshold. Neither class lies in a hyperplane, and the dataset is
not separable (Figure 7.25). Hence logistic regression with bias is trainable,
and gradient descent is guaranteed to converge to an optimal weight (m∗ , b∗ ).
Here is the descent code.
X = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5, 2.75,
,→ 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
P = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
def gradient(m,b):
return sum([ (expit(m*x+b) - p) * array([x,1]) for x,p in zip(X,P)
,→ ],axis=0)
# gradient descent
w = array([0,0]) # starting m,b
g = gradient(*w)
t = .01 # learning rate
m∗ = 1.49991537, b∗ = −4.06373862.
Even though we are done, we take the long way and apply logistic regres-
sion without bias by incorporating the bias, to better understand how things
work.
To this end, we incorporate the bias and write the augmented dataset
resulting in Figure 7.27. Since these vectors are not parallel, the dataset is
full-rank in R2 , hence J(m, b) is strictly convex. In Figure 7.27, the shaded
7.7. REGRESSION EXAMPLES 453
x0
(0, 1)
x
(0, 0)
Let σ(z) be the sigmoid function (5.2.15). Then, as in the previous section,
the goal is to minimize the loss function
N
X
J(m, b) = I(pk , qk ), qk = σ(mxk + b), (7.7.5)
k=1
Once we have the minimizer (m∗ , b∗ ), we have the best-fit curve (7.7.4).
If the targets p are one-hot encoded, the dataset is as follows.
x p x p x p x p x p
0.5 (1,0) .75 (1,0) 1.0 (1,0) 1.25 (1,0) 1.5 (1,0)
1.75 (1,0) 1.75 (0,1) 2.0 (1,0) 2.25 (0,1) 2.5 (1,0)
2.75 (0,1) 3.0 (1,0) 3.25 (0,1) 3.5 (1,0) 4.0 (0,1)
4.25 (0,1) 4.5 (0,1) 4.75 (0,1) 5.0 (0,1) 5.5 (0,1)
b y
1 +
q
−b
J
σ I
m
−y 1−q
x +
−m
1
b
y q J
+ σ I
m
x
Figure 7.26 is a plot of x against p. However, the dataset, with the bias
input included, has two inputs x, 1 and one output p, and should be plotted
in three dimensions (x, 1, p). Then (Figure 7.31) samples lie on the line (x, 1)
in the horizontal plane, and p is on the vertical axis.
0.5
0
0
2
1
4
0.5
0
The horizontal plane in Figure 7.31, which is the plane in Figure 7.27, is
feature space. The convex hulls K0 and K1 are in feature space, so the convex
hull K0 of the samples corresponding to p = 0 is the line segment joining
(.5, 1, 0) and (3.5, 1, 0), and the convex hull K1 of the samples corresponding
to p = 1 is the line segment joining (1.75, 1, 0) and (5.5, 1, 0). In Figure 7.31,
K0 is the line segment joining the green points, and K1 is the projection onto
feature space of the line segment joining the red points. Since K0 ∩ K1 is the
line segment joining (1.75, 1, 0) and (3.5, 1, 0), the span of K0 ∩ K1 is all of
feature space. By the results of the previous section, J(w) is proper.
The Iris dataset consists of 150 samples divided into three groups. leading
to three convex hulls K0 , K1 , K2 in R4 . If the dataset is projected onto the
top two principal components, then the projections of these three hulls do
not pair-intersect (Figure 7.32). It follows we have no guarantee the logistic
loss is proper.
On the other hand, the MNIST dataset consists of 60,000 samples divided
into ten groups. If the MNIST dataset is projected onto the top two principal
components, the projections of the ten convex hulls K0 , K1 , . . . , K9 onto R2 ,
do intersect (Figure 7.33).
This does not guarantee that the ten convex hulls K0 , K1 , . . . , K9 in R784
intersect, but at least this is so for the 2d projection of the MNIST dataset.
Therefore the logistic loss of the 2d projection of the MNIST dataset is proper.
456 CHAPTER 7. MACHINE LEARNING
In this section, we work with loss functions that are smooth and strictly
convex. While this is not always the case, this assumption is a base case
against which we can test different optimization or training models.
By smooth and strictly convex, we mean there are positive constants m
and L satisfying
Recall this means the eigenvalues of the symmetric matrix D2 f (w) are be-
tween L and m. In this situation, the condition number1 r = m/L is between
zero and one: 0 < r ≤ 1.
In the previous section, we saw that basic gradient descent converged to
a critical point. If f (x) is strictly convex, there is exactly one critical point,
the global minimum. From this we have
m L
|w − w∗ |2 ≤ f (w) − f (w∗ ) ≤ |w − w∗ |2 . (7.8.3)
2 2
How far we are from our goal w∗ can be measured by the error E(w) =
|w − w∗ |2 . Another measure of error is E(w) = f (w) − f (w∗ ). The goal is to
drive the error between w and w∗ to zero.
When f (w) is smooth and strictly convex in the sense of (7.8.1), the es-
timate (7.8.3) shows these two error measures are equivalent. We use both
measures below.
Gradient Descent I
Let r = m/L and set E(w) = f (w)−f (w∗ ). Then the descent sequence
w0 , w1 , w2 , . . . given by (7.3.1) with learning rate
1
t=
L
converges to w∗ at the rate
458 CHAPTER 7. MACHINE LEARNING
n
E(wn ) ≤ (1 − r) E(w0 ), n = 1, 2, . . . . (7.8.5)
mL 1
g · (w − w∗ ) ≥ |w − w∗ |2 + |g|2 .
m+L m+L
Using this and (7.3.1) and t = 2/(m + L),
This implies
Gradient Descent II
GD-II improves GD-I in two ways: Since m < L, the learning rate is larger,
2 1
> ,
m+L L
and the convergence rate is smaller,
2
1−r
< (1 − r),
1+r
7.9. ACCELERATED GRADIENT DESCENT 459
Let g be the gradient of the loss function at a point w. Then the line
passing through w in the direction of g is w − tg. When the loss function is
quadratic (4.3.8), f (w − tg) is a quadratic function of the scalar variable t.
In this case, the minimizer t along the line w − tg is explicitly computable as
g·g
t= .
g · Qg
This leads to gradient descent with varying time steps t0 , t1 , t2 , . . . . As a
consequence, one can show the error is lowered as follows,
+ 1 g
E(w ) = 1 − E(w), u= .
(u · Qu)(u · Q−1 u) |g|
w◦ = w + s(w − w− ). (7.9.1)
Here s is the decay rate. The momentum term reflects the direction induced by
the previous step. Because this mimics the behavior of a ball rolling downhill,
gradient descent with momentum is also called heavy ball descent.
460 CHAPTER 7. MACHINE LEARNING
Here we have two hyperparameters, the learning rate and the decay rate.
wn = w∗ + ρn v, Qv = λv. (7.9.5)
Inserting this into (7.9.3) and using Qw∗ = b leads to the quadratic equation
ρ2 = (1 − tλ + s)ρ − s.
(L − λ)(λ − m)
4s − (1 − λt + s)2 ≥ (1 − s)2 . (7.9.8)
mL
When (7.9.6) holds, the roots are conjugate complex numbers ρ, ρ̄, where
p
(1 − λt + s) + i −(1 − λt + s)2 + 4s
ρ = x + iy = . (7.9.9)
2
It follows the absolute value of ρ equals
p √
|ρ| = x2 + y 2 = s.
√
To obtain the fastest convergence, we choose s and t to minimize |ρ| = s,
while still satisfying (7.9.7). This forces (7.9.7) to be an equality,
√ √
(1 − s)2 (1 + s)2
=t= .
m L
These are two equations in two unknowns s, t. Solving, we obtain
√
√ 1− r 1 4
s= √ , t= · √ .
1+ r L (1 + r)2
Let w̃n = wn −w∗ . Since Qwn −b = Qw̃n , (7.9.3) is a 2-step linear recursion
in the variables w̃n . Therefore the general solution depends on two constants
A, B.
Let λ1 , λ2 , . . . , λd be the eigenvalues of Q and let v1 , v2 , . . . , vd be the
corresponding orthonormal basis of eigenvectors.
Since (7.9.3) is a 2-step vector linear recursion, A and B are vectors, and
the general solution depends on 2d constants Ak , Bk , k = 1, 2, . . . , d.
If ρk , k = 1, 2, . . . , d, are the corresponding roots (7.9.9), then (7.9.5) is
a solution of (7.9.3) for each of 2d roots ρ = ρk , ρ = ρ̄k , k = 1, 2, . . . , d.
Therefore the linear combination
d
X
wn = w∗ + (Ak ρnk + Bk ρ̄nk ) vk , n = 0, 1, 2, . . . (7.9.10)
k=1
Ak + Bk = (w0 − w∗ ) · vk ,
Ak ρk + Bk ρ̄k = (w1 − w∗ ) · vk = (1 − tλk )(w0 − w∗ ) · vk ,
Let
(L − m)(L − m)
C = max . (7.9.11)
λ (L − λ)(λ − m)
Using (7.9.8), one verifies the estimate
Suppose the loss function f (w) is quadratic (7.8.2), let r = m/L, and
set E(w) = |w − w∗ |2 . Let C be given by (7.9.11). Then the descent
sequence w0 , w1 , w2 , . . . given by (7.9.2) with learning rate and decay
rate √ 2
1 4 1− r
t= · √ , s= √ ,
L (1 + r)2 1+ r
converges to w∗ at the rate
√ 2n
1− r
E(wn ) ≤ 4C √ E(w0 ), n = 1, 2, . . . (7.9.12)
1+ r
f (w) where heavy ball descent does not converge to w∗ . Nevertheless, this
method is widely used.
w◦ = w + s(w − w− ),
(7.9.13)
w+ = w◦ − t∇f (w◦ ).
we will show
V (w+ ) ≤ ρV (w). (7.9.15)
In fact, we see below (7.9.22), (7.9.23) that V is reduced by an additional
quantity proportional to the momentum term.
The choice t = 1/L is a natural choice from basic gradient descent (7.3.3).
The derivation of (7.9.15) below forces the choices for s and ρ.
Given a point w, while w+ is well-defined by (7.9.13), it is not clear what
w− means. There are two ways to insert meaning here. Either evaluate V (w)
along a sequence w0 , w1 , w2 , . . . and set, as before, wn− = wn−1 , or work
464 CHAPTER 7. MACHINE LEARNING
L m
V (w0 ) = f (w0 ) + |w0 − ρw0 |2 = f (w0 ) + |w0 |2 ≤ 2f (w0 ).
2 2
Moreover f (w) ≤ V (w). Iterating (7.9.15), we obtain
This derives
Let r = m/L and set E(w) = f (w) − f (w∗ ). Then the sequence w0 ,
w1 , w2 , . . . given by (7.9.13) with learning rate and decay rate
√
1 1− r
t= , s= √
L 1+ r
While the convergence rate for accelerated descent is slightly worse than
heavy ball descent, the value of accelerated descent is its validity for all convex
functions satisfying (7.8.1), and the fact, also due to Nesterov [23], that this
convergence rate is best-possible among all such functions.
Now we derive (7.9.15). Assume (w+ )− = w and w∗ = 0, f (w∗ ) = 0. We
know w◦ = (1 + s)w − sw− and w+ = w◦ − tg ◦ , where g ◦ = ∇f (w◦ ).
By the basic descent step (7.3.1) with w◦ replacing w, (7.3.3) implies
t
f (w+ ) ≤ f (w◦ ) − |g ◦ |2 . (7.9.17)
2
Here we used t = 1/L.
By (4.5.16) with x = w and a = w◦ ,
m
f (w◦ ) ≤ f (w) − g ◦ · (w − w◦ ) − |w − w◦ |2 . (7.9.18)
2
By (4.5.16) with x = w∗ = 0 and a = w◦ ,
m ◦2
f (w◦ ) ≤ g ◦ · w◦ − |w | . (7.9.19)
2
7.9. ACCELERATED GRADIENT DESCENT 465
Multiply (7.9.18) by ρ and (7.9.19) by 1 − ρ and add, then insert the sum
into (7.9.17). After some simplification, this yields
r t
f (w+ ) ≤ ρf (w) + g ◦ · (w◦ − ρw) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 − |g ◦ |2 .
2t 2
(7.9.20)
Since
(w◦ − ρw) − tg ◦ = w+ − ρw,
we have
1 + 1 t
|w − ρw|2 = |w◦ − ρw|2 − g ◦ · (w◦ − ρw) + |g ◦ |2 .
2t 2t 2
Adding this to (7.9.20) leads to
r 1
V (w+ ) ≤ ρf (w) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 + |w◦ − ρw|2 . (7.9.21)
2t 2t
Let
R(a, b) = r ρs2 |b|2 + (1 − ρ)|a + sb|2 − |(1 − ρ)a + sb|2 + ρ|(1 − ρ)a + ρb|2 .
which is positive.
Chapter A
Appendices
Some of the material here is first seen in high school. Because repeating the
exposure leads to a deeper understanding, we review it in a manner useful to
us here.
We start with basic counting, and show how the factorial function leads
directly to the exponential. Given its convexity and its importance for entropy
(§5.2), the exponential is treated carefully (§A.3).
The other use of counting is in graph theory (§3.3), which lays the ground-
work for neural networks (§7.2).
Suppose we have three balls in a bag, colored red, green, and blue. Suppose
they are pulled out of the bag and arranged in a line. We then obtain six
possibilities, listed in Figure A.1.
Why are there six possibilities? Because they are three ways of choosing
the first ball, then two ways of choosing the second ball, then one way of
choosing the third ball, so the total number of ways is
6 = 3 × 2 × 1.
n! = n × (n − 1) × (n − 2) × · · · × 2 × 1.
467
468 CHAPTER A. APPENDICES
Notice also
(n + 1)! = (n + 1) × n × (n − 1) × · · · × 2 × 1 = (n + 1) × n!,
Permutations of n Objects
We also have
1! = 1, 0! = 1.
It’s clear that 1! = 1. It’s less clear that 0! = 1, but it’s reasonable if you
think about it: The number of ways of selecting from zero balls results in
only one possibility — no balls. The code for n! is
factorial(n,exact=True)
More generally, we can consider the selection of k balls from a bag contain-
ing n distinct balls. There are two varieties of selections that can be made:
Ordered selections and unordered selections. An ordered selection is a permu-
tation. In particular, when k = n, an ordered selection of n objects from n
objects is n, which is the number of ways of permuting n objects.
A.1. PERMUTATIONS AND COMBINATIONS 469
def perm_tuples(a,b,k):
if k==1: return [ (i,) for i in range(a,b+1) ]
else:
list1 = [ (i,*p) for i in range(a,b) for p in
,→ perm_tuples(i+1,b,k-1) ]
list2 = [ (*p,i) for i in range(a,b) for p in
,→ perm_tuples(i+1,b,k-1) ]
return list1 + list2
perm_tuples(1,5,2)
[(1, 2),(1, 3),(1, 4),(1, 5),(2, 3),(2, 4),(2, 5),(3, 4),(3, 5),(4,
,→ 5),(2, 1),(3, 1),(4, 1),(5, 1),(3, 2),(4, 2),(5, 2),(4, 3),(5,
,→ 3),(5, 4)]
n, k = 5, 2
perm(n, k)
perm(n,k,exact=True) == len(perm_tuples(1,n,k))
Notice P (x, k) is defined for any real number x by the same formula,
def comb_tuples(a,b,k):
if k==1: return [ (i,) for i in range(a,b+1) ]
else: return [ (i, *p) for i in range(a,b) for p in
,→ comb_tuples(i+1,b,k-1) ]
comb_tuples(1,5,2)
[(1, 2),(1, 3),(1, 4),(1, 5),(2, 3),(2, 4),(2, 5),(3, 4),(3, 5),(4,
,→ 5)]
n, k = 5, 2
comb(n, k)
comb(n,k,exact=True) == len(comb_tuples(1,n,k))
P (n, k) n!
C(n, k) = = .
k! (n − k)!k!
1, 2, 3, . . . , n − 1, n,
n! < nn .
However, because half of the factors are less then n/2, we expect an approx-
imation smaller than nn , maybe something like (n/2)n or (n/3)n .
To be systematic about it, assume
n n
n! is approximately equal to e for n large, (A.1.1)
e
472 CHAPTER A. APPENDICES
for some constant e. We seek the best constant e that fits here. In this ap-
proximation, we multiply by e so that (A.1.1) is an equality when n = 1.
Using the binomial theorem, in §A.3 we show
n n n n
3 ≤ n! ≤ 2 , n ≥ 1. (A.1.2)
3 2
Based on this, a constant e satisfying (A.1.1) must lie between 2 and 3,
2 ≤ e ≤ 3.
To figure out the best constant e to pick, we see how much both sides
of (A.1.1) increase when we replace n by n + 1. Write (A.1.1) with n + 1
replacing n, obtaining
n+1
n+1
(n + 1)! is approximately equal to e for n large.
e
(A.1.3)
Dividing the left sides of (A.1.1), (A.1.3) yields
(n + 1)!
= (n + 1).
n!
Dividing the right sides yields
n
e((n + 1)/e)n+1
1 1
= (n + 1) · · 1 + . (A.1.4)
e(n/e)n e n
Exercises
(First break the sum into two sums, then write out the first few terms of each
sum separately, and notice all terms but one cancel.)
Similarly,
Thus
(a + x)2 = a2 + 2ax + x2
(a + x)3 = a3 + 3a2 x + 3ax2 + x3
(A.2.4)
(a + x)4 = a4 + 4a3 x + 6a2 x2 + 4ax3 + x4
(a + x)5 = ⋆a5 + ⋆a4 x + ⋆a3 x2 + ⋆a2 x3 + ⋆ax4 + ⋆x5 .
and
3 3 3 3
= 1, = 3, = 3, =1
0 1 2 3
and
4 4 4 4 4
= 1, = 4, = 6, = 4, =1
0 1 2 3 4
and
A.2. THE BINOMIAL THEOREM 475
5 5 5 5 5 5
= ⋆, = ⋆, = ⋆, = ⋆, = ⋆, = ⋆.
0 1 2 3 4 5
is the coefficient of an−k xk when you multiply out (a + x)n . This is the bino-
mial coefficient. Here n is the degree of the binomial, and k, which specifies
the term in the resulting sum, varies from 0 to n (not 1 to n).
It is important to remember that, in this notation, the binomial (a + x)2
expands into the sum of three terms a2 , 2ax, x2 . These are term 0, term 1,
and term 2. Alternatively, one says these are the zeroth term, the first term,
and the second term. Thus the second term in the expansion of the binomial
(a+x)4 is 6a2 x2 , and the binomial coefficient 42 = 6. In general, the binomial
(a + x)n of degree n expands into a sum of n + 1 terms.
Since the binomial coefficient nk is the coefficient of an−k xk when you
multiply out (a + x)n , we have the binomial theorem.
Binomial Theorem
For example, the term 42 a2 x2 corresponds to choosing two a’s, and two x’s,
n = 0: 1
n = 1: 1 1
n = 2: 1 2 1
n = 3: 1 3 3 1
n = 4: 1 4 6 4 1
n = 5: 1 5 10 10 5 1
n = 6: ⋆ 6 15 20 15 6 ⋆
n = 7: 1 ⋆ 21 35 35 21 ⋆ 1
n = 8: 1 8 ⋆ 56 70 56 ⋆ 8 1
n = 9: 1 9 36 ⋆ 126 126 ⋆ 36 9 1
N = 10
Comb = zeros((N,N),dtype=int)
Comb[0,0] = 1
for n in range(1,N):
Comb[n,0] = Comb[n,n] = 1
for k in range(1,n): Comb[n,k] = Comb[n-1,k] + Comb[n-1,k-1]
Comb
In Pascal’s triangle, the very top row has one number in it: This is the
zeroth row corresponding to n = 0 and the binomial expansion of (a+x)0 = 1.
The first row corresponds to n = 1; it contains the numbers (1, 1), which
correspond to the binomial expansion of (a + x)1 = 1a + 1x. We say the
zeroth entry (k = 0) in the first row (n = 1) is 1 and the first entry (k = 1)
in the first row is 1. Similarly, the zeroth entry (k = 0) in the second row
(n = 2) is 1, and the second entry (k = 2) in the second row (n = 2) is 1.
The second entry (k = 2) in the fourth row (n = 4) is 6. For every row, the
A.2. THE BINOMIAL THEOREM 477
entries are counted starting from k = 0, and end with k = n, so there are
n + 1 entries in row n. With this understood, the k-th entry in the n-th row
is the binomial coefficient n-choose-k. So 10-choose-2 is
10
= 45.
2
We can learn a lot about the binomial coefficients from this triangle. First,
we have 1’s all along the left edge. Next, we have 1’s all along the right edge.
Similarly, one step in from the left or right edge, we have the row number.
Thus we have
n n n n
=1= , =n= , n ≥ 1.
0 n 1 n−1
Note also Pascal’s triangle has a left-to-right symmetry: If you read off
the coefficients in a particular row, you can’t tell if you’re reading them from
left to right, or from right to left. It’s the same either way: The fifth row is
(1, 5, 10, 10, 5, 1). In terms of our notation, this is written
n n
= , 0 ≤ k ≤ n;
k n−k
4 4 4 3 4 2 2 4 3 4 4
a + a x+ a x + ax + x
0 1 2 3 4
3 4 3 3 3 2 2 3
= a + a x+ a x + ax3
0 1 2 3
3 3 3 2 2 3 3 3 4
+ a x+ a x + ax + x .
0 1 2 3
We conclude the sum of the binomial coefficients along the n-th row of Pas-
cal’s triangle is 2n (remember n starts from 0).
Now insert x = 1 and a = −1. You get
n n n n n
0= − + − ··· ± ± .
0 1 2 n−1 n
Hence: the alternating1 sum of the binomial coefficients along the n-th row of
Pascal’s triangle is zero.
We now show
Binomial Coefficient
Let
n · (n − 1) · · · · · (n − k + 1) n!
C(n, k) = = .
1 · 2 · ··· · k k!(n − k)!
Then
n
= C(n, k), 0 ≤ k ≤ n. (A.2.10)
k
n! n!
C(n, k) + C(n, k − 1) = +
k!(n − k)! (k − 1)!(n − k + 1)!
n! 1 1
= +
(k − 1)!(n − k)! k n − k + 1
n!(n + 1)
=
(k − 1)!(n − k)!k(n − k + 1)
(n + 1)!
= = C(n + 1, k).
k!(n + 1 − k)!
The formula (A.2.10) is easy to remember: There are k terms in the numerator
as well as the denominator, the factors in the denominator increase starting
from 1, and the factors in the numerator decrease starting from n.
In Python, the code
comb(n,k)
comb(n,k,exact=True)
The binomial coefficient nk makes sense even for fractional n. This can
Rewriting this by pulling out the first two terms k = 0 and k = 1 leads to
n n
1 X 1 1 2 k−1
1+ =1+1+ 1− 1− ... 1 − . (A.3.1)
n k! n n n
k=2
From (A.3.1), we can tell a lot. First, since all terms are positive, we see
n
1
1+ ≥ 2, n ≥ 1.
n
as follows.
A geometric sum is a sum of the form
n−1
X
2 n−1
sn = 1 + a + a + · · · + a = ak .
k=0
asn = a + a2 + a3 + · · · + an−1 + an = sn + an − 1,
yielding
(a − 1)sn = an − 1.
When a ̸= 1, we may divide by a − 1, obtaining
n−1
X an − 1
sn = ak = 1 + a + a2 + · · · + an−1 = . (A.3.4)
a−1
k=0
By (A.3.3), we arrive at
n
1
2≤ 1+ ≤ 3, n ≥ 1. (A.3.5)
n
Since a bounded increasing sequence has a limit (§A.7), this establishes the
following strengthening of (A.1.5).
Euler’s Constant
The limit n
1
e = lim 1+ (A.3.6)
n→∞ n
exists and satisfies 2 ≤ e ≤ 3.
482 CHAPTER A. APPENDICES
are in §A.6, see Exercises A.6.10 and A.6.11. Nevertheless, the intuition is
clear: (A.3.6) is saying there is a specific positive number e with
n
1
1+ ≈e
n
for n large.
Since we’ve shown bn increases faster than an , and cn increases faster than
bn , we have derived (A.1.2).
without bound in (A.3.1). Using 1/∞ = 0, since the k-th term approaches
1/k!, and since the number of terms increases with n, we obtain the second
formula
∞ X ∞
X 1 1 2 k−1 1
e=1+1+ 1− 1− ... 1 − = .
k! ∞ ∞ ∞ k!
k=2 k=0
To summarize,
Euler’s Constant
Euler’s constant satisfies
∞
X 1 1 1 1 1 1
e= =1+1+ + + + + + ... (A.3.7)
k! 2 6 24 120 720
k=0
Depositing one dollar in a bank offering 100% interest returns two dollars
after one year. Depositing one dollar in a bank offering the same annual
interest compounded at mid-year returns
2
1
1+ = 2.25
2
Exponential Function
(1 − x) = 1 − x
(1 − x)2 = 1 − 2x + x2 ≥ 1 − 2x
(1 − x)3 = (1 − x)(1 − x)2 ≥ (1 − x)(1 − 2x) = 1 − 3x + 2x2 ≥ 1 − 3x
(1 − x)4 = (1 − x)(1 − x)3 ≥ (1 − x)(1 − 3x) = 1 − 4x + 3x3 ≥ 1 − 4x
... ...
grid()
plot(x,exp(x))
show()
X xk n
x n 1 2 k−1
1+ =1+x+ 1− 1− ... 1 − . (A.3.11)
n k! n n n
k=2
Exponential Series
The exponential function is always positive and satisfies, for every real
number x,
∞
X xk x2 x3 x4 x5 x6
exp x = = 1+x+ + + + + + . . . (A.3.12)
k! 2 6 24 120 720
k=0
Law of Exponents
(a0 + a1 + a2 + a3 + . . . )(b0 + b1 + b2 + b3 + . . . )
Thus
∞ ∞ ∞
! ! n
!
X X X X
ak bm = ak bn−k .
k=0 m=0 n=0 k=0
Now insert
xk y n−k
ak = , bn−k = .
k! (n − k)!
Then the n-th term in the resulting sum equals, by the binomial theorem,
n n n
X X xk y n−k 1 X n k n−k 1
ak bn−k = = x y = (x + y)n .
k! (n − k)! n! k n!
k=0 k=0 k=0
Thus
A.3. THE EXPONENTIAL FUNCTION 487
∞ ∞ ∞
! !
X xk X ym X (x + y)n
exp x · exp y = = = exp(x + y).
k! m=0
m! n=0
n!
k=0
Exponential Notation
Exercises
Exercise A.3.1 Assume a bank gives 50% annual interest on deposits. After
one year, what does $1 become? Do this when the money is compounded once,
twice, and at every instant during the year.
Exercise A.3.2 Assume a bank gives -50% annual interest on deposits. After
one year, what does $1 become? Do this when the money is compounded once,
twice, and at every instant during the year.
valid for a, b, c positive. This remains valid for any number of factors.
A.4. COMPLEX NUMBERS 489
Exercise A.3.5 Use the previous exercise, (A.3.1), (A.3.3), and the identity
k(k − 1)
1 + 2 + 3 + · · · + (k − 2) + (k − 1) =
2
to derive the error estimate
n n
X 1 1 3
0≤ − 1+ ≤ , n ≥ 2.
k! n 2n
k=0
In §1.4, we study points in two dimensions, and we saw how points can be
added and subtracted. In §2.1, we study points in any number of dimensions,
and there we also add and subtract points.
In two dimensions, each point has a shadow (Figure 1.13). By stacking
shadows, points in the plane can be multiplied and divided (Figure A.5). In
this sense, points in the plane behave like numbers, because they follow the
usual rules of arithmetic.
This ability of points in the plane to follow the usual rules of arithmetic
is unique to two dimensions (considering one dimension as part of two di-
mensions), and not present in any other dimension. When thought of in this
manner, points in the plane are called complex numbers, and the plane is the
complex plane.
Here is how one does this without any angle measurement: Mark Q = x′ P
at distance x′ along the vector OP joining O and P , and draw the circle
with radius y ′ and center Q. Then this circle intersects the unit circle at two
points, both called P ′′ .
P
P′
1
1
O O
P ′′
Q Q
P ′′
O O
P ′′ = P P ′ = (xx′ − yy ′ , x′ y + xy ′ ),
(A.4.1)
P ′′ = P/P ′ = (xx′ + yy ′ , x′ y − xy ′ ).
so (A.4.1) is equivalent to
P ′′ = x′ P ± y ′ P ⊥ . (A.4.2)
A.4. COMPLEX NUMBERS 491
P P ′ = (xx′ − yy ′ , x′ y + xy ′ ), (A.4.3)
Because of this, we can write z = x instead of z = (x, 0), this only for points
in the plane, and we call the horizontal axis the real axis.
Similarly, let i = (0, 1). Then the point i is on the vertical axis, and, using
(A.4.1), one can check
Thus the vertical axis consists of all points of the form ix. These are called
imaginary numbers, and the vertical axis is the imaginary axis.
Using i, any point P = (x, y) may be written
P = x + iy,
since
x + iy = (x, 0) + (y, 0)(0, 1) = (x, 0) + (0, y) = (x, y).
This leads to Figure A.6. In this way, real numbers x are considered complex
numbers with zero imaginary part, x = x + 0i.
2i 3 + 2i
−1 0 1 2 3
Square Root of −1
and
z x + iy (xx′ + yy ′ ) + i(x′ y − xy ′ )
= ′ = .
z ′ x + iy ′
x′ 2 + y ′ 2
A.4. COMPLEX NUMBERS 493
In particular, one can always “move” the i from the denominator to the
numerator by the formula
1 1 x − iy z̄
= = 2 = 2.
z x + iy x + y2 |z|
From this and (A.4.1), using (x, y) = (cos θ, sin θ), (x′ , y ′ ) = (cos θ′ , sin θ′ ),
we have the addition formulas
sin(θ + θ′ ) = sin θ cos θ′ + cos θ sin θ′ ,
(A.4.6)
cos(θ + θ′ ) = cos θ cos θ′ − sin θ sin θ′ .
We will need the roots of unity in §3.2. This generalizes square roots, cube
roots, etc.
A complex number ω is a root of unity if ω d = 1 for some power d. If d is
the power, we say ω is a d-th root of unity.
For example, the square roots of unity are ±1, since (±1)2 = 1. Here we
have
1 = cos 0 + i sin 0, −1 = cos π + i sin π.
The fourth roots of unity are ±1 and ±i, since (±1)4 = 1 and (±i)4 = 1.
Here we have
1 = cos 0 + i sin 0,
i = cos(π/2) + i sin(π/2),
−1 = cos π + i sin π,
−i = cos(3π/2) + i sin(3π/2).
If ω d = 1, then
d k
ωk = ωd = 1k = 1.
A.4. COMPLEX NUMBERS 495
1, ω, ω 2 , . . . , ω d−1
13 = 1, ω 3 = 1, (ω 2 )3 = 1.
ω
ω
ω 1 1 ω2 1
ω2
ω3
ω2 = 1 ω3 = 1 ω4 = 1
Summarizing,
Roots of Unity
ω = cos(2π/d) + i sin(2π/d),
1, ω, ω 2 , . . . , ω d−1 .
ω k = cos(2πk/d) + i sin(2πk/d), k = 0, 1, 2, . . . , d − 1.
ω ω ω4 ω3
ω2 ω5
ω2
ω2 ω6
ω
ω7
1 ω3 1 1
ω8
ω 14
ω3 ω9
ω 13
ω4 ω4 ω5 ω 10 12
ω 11 ω
ω5 = 1 ω6 = 1 ω 15 = 1
d
Y
p(z) = (z − ak ) = (z − a1 )(z − a2 ) . . . (z − ad ). (A.4.10)
k=1
z = symbols('z')
d = 5
solve(z**d - 1)
roots([a,b,c])
Since the cube roots of unity are the roots of p(z) = z 3 − 1, the code
roots([1,0,0,-1])
Exercises
Exercise A.4.1 Let P = (1, 2) and Q = (3, 4) and R = (5, 6). Calculate P Q,
P/Q, P R, P/R, QR, Q/R.
z = symbols('z')
roots = solve(z**d - 1)
A.5 Integration
f (x)
0 a x x + dx b
To derive this, let A(x) denote the area under the graph between the y-
axis and the vertical line at x. Then A(x) is the sum of the gray area and
the red area, A(a) is the gray area, and A(b) is the sum of four areas: gray,
red, green, and blue. It follows the integral (A.5.1) equals A(b) − A(a).
Since A(x + dx) is the sum of three areas, gray, red, green, it follows
A(x + dx) − A(x) is the green area. But the green area is approximately a
500 CHAPTER A. APPENDICES
rectangle of width dx and height f (x). Hence the green area is approximately
f (x) × dx, or
A(x + dx) − A(x) ≈ f (x) dx.
As a consequence of this analysis,
When d = 2, a = −1, b = 1, this is 2/3, which is the area under the parabola
in Figure A.10.
When a = 0, b = 1, Z 1
1
td dt = . (A.5.3)
0 d + 1
When F (x) can’t be found, we can’t use the FTC. Instead we use Python
to evaluate the integral (A.5.1) as follows.
d = 2
a,b = -1, 1
This not only returns the computed integral I but also an estimate of the
error between the computed integral and the theoretical value,
(0.6666666666666666, 7.401486830834376e-15).
quad refers to quadrature, which is another term for integration.
Another example is the area under one hump of the sine curve in Figure
A.11, Z π
sin x dx = − cos π − (− cos 0) = −(−1) + 1 = 2.
0
Here f (x) = sin x, F (x) = − cos x, F ′ (x) = f (x). The Python code quad
returns (2.0, 2.220446049250313e-14).
It is important to realize the integral (A.5.1) is the signed area under the
graph: Portions of areas that are below the x-axis are counted negatively. For
example,
Z 2π
sin x dx = − cos(2π) − (− cos 0) = −1 + 1 = 0.
0
Explicitly,
Z 2π Z π Z 2π
sin x dx = sin x dx + sin x dx = 2 − 2 = 0,
0 0 π
so the areas under the first two humps in Figure A.11 cancel.
def plot_and_integrate(f,a,b,pi_ticks=False):
# initialize figure
ax = axes()
A.5. INTEGRATION 503
ax.grid(True)
# draw x-axis and y-axis
ax.axhline(0, color='black', lw=1)
ax.axvline(0, color='black', lw=1)
# set x-axis ticks as multiples of pi/2
if pi_ticks: set_pi_ticks(a,b)
x = linspace(a,b,100)
plot(x,f(x))
positive = f(x)>=0
negative = f(x)<0
ax.fill_between(x,f(x), 0, color='green', where=positive, alpha=.5)
ax.fill_between(x,f(x), 0, color='red', where=negative, alpha=.5)
I = quad(f,a,b,limit=1000)[0]
title("integral equals " + str(I),fontsize = 10)
show()
plot_and_integrate(f,a,b,pi_ticks=True)
Above, the Python function set_pi_ticks(a,b) sets the x-axis tick mark
labels at the multiples of π/2 The code for set_pi_ticks is in §4.1.
The exercises are meant to be done using the code in this section. For the
infinite limits below, use numpy.inf.
504 CHAPTER A. APPENDICES
Exercises
Exercise A.5.1 Plot and integrate f (x) = x2 + A sin(5x) over the interval
[−10, 10], for amplitudes A = 0, 1, 2, 4, 15. Note the integral doesn’t depend
on A. Why?
Exercise A.5.2 Plot and integrate (Figure A.12)
Z 3π
sin x
dx.
0 x
Exercise A.5.3 Plot and integrate f (x) = exp(−x) over [a, b] with a = 0,
b = 1, 10, 100, 1000, 10000.
√
Exercise A.5.4 Plot and integrate f (x) = 1 − x2 over [−1, 1].
√
Exercise A.5.5 Plot and integrate f (x) = 1/ 1 − x2 over [−1, 1].
Exercise A.5.6 Plot and integrate f (x) = (− log x)n over [0, 1] for n =
2, 3, 4. What is the answer for general n?
Exercise A.5.7 With k = 7, n = 10, plot and integrate using Python
Z 1
xk (1 − x)n−k dx.
0
2 ∞ sin x
Z
dx.
π 0 x
Exercise A.5.10 Use numpy.inf to plot the normal pdf and compute its
integral Z ∞
1 2
√ e−x /2 dx.
2π −∞
Exercise A.5.11 Let σ(x) = 1/(1+e−x ). Plot and integrate f (x) = σ(x)(1−
σ(x)) over [−10, 10]. What is the answer for (−∞, ∞)?
Exercise A.5.12 Let Pn (x) be the Legendre polynomial of degree n (§4.1).
Use num_legendre (§4.1) to compute the integral
Z 1
Pn (x)2 dx
−1
for n = 1, 2, 3, 4. What is the integral for general n? Hint – take the reciprocal
of the answer.
A.6. ASYMPTOTICS AND CONVERGENCE 505
Asymptotic Vanishing
an ≈ 0 =⇒ can ≈ 0.
506 CHAPTER A. APPENDICES
Convergence of Reciprocals
If an ≈ 1, then 1/an ≈ 1.
Asymptotic Equality
an an
a n ≈ bn ⇐⇒ ≈1 ⇐⇒ − 1 ≈ 0.
bn bn
This is exactly what is meant in (A.1.6). While both sides in (A.1.6) in-
crease without bound, their ratio is close to one, for large n.
In general, an ≈ bn is not the same as an − bn ≈ 0: ratios and differences
behave differently. For example, based on (A.1.6), the following code
def factorial(n):
if n == 1: return 1
else: return n * factorial(n-1)
a = factorial(100)
b = stirling(100)
a/b, a-b
returns
(1.000833677872004, 7.773919124995513 × 10154 ).
The first entry is close to one, but the second entry is far from zero.
If, however, bn ≈ b for some nonzero constant b, then (Exercise A.6.7)
ratios and differences are the same,
an ≈ bn ⇐⇒ an − bn ≈ 0. (A.6.2)
a = lim an . (A.6.3)
n→∞
As we saw above, limits and asymptotic equality are the same, as long as the
limit is not zero. When a is the limit of an , we also say an converges to a, or
an approaches a and we write an → a.
With this notation, asymptotic vanishing is an → 0, asymptotically one is
an → 1, and asymptotic equality is an /bn → 1.
Limits can be taken for sequences of points in Rd as well. Let an be a
sequence of points in Rd . We say an converges to a if an · v converges to a · v
for every vector v. Here we also write an → a and we write (A.6.3).
Exercises
an + bn → a + b, an bn → ab.
Several times in the text, we deal with minimizing functions, most notably for
the pseudo-inverse of a matrix (§2.3), for proper continuous functions (§4.3),
and for gradient descent (§7.3).
Previously, the technical foundations underlying the existence of minimiz-
ers were ignored. In this section, we review the foundational material sup-
porting the existence of minimizers.
For example, since y = ex is an increasing function, the minimum
min ex = min{ex | 0 ≤ x ≤ 1}
0≤x≤1
Completeness Property
lim xn .
n→∞
Here it is important that the indices n1 < n2 < n3 < . . . be strictly increas-
ing.
If a sequence x1 , x2 , . . . has a subsequence x′1 , x′2 , . . . converging to x∗ ,
then we say the sequence x1 , x2 , . . . subconverges to x∗ . For example, the
sequence 1, −1, 1, −1, 1, −1, . . . subconverges to 1 and also subconverges
to −1, as can be seen by considering the odd-indexed terms and the even-
indexed terms separately.
I0 ⊃ I1 ⊃ I2 ⊃ . . . ,
A.7. EXISTENCE OF MINIMIZERS 511
x∗ = lim x∗n
n→∞
As we saw above, a minimizer may or may not exist, and, when the minimizer
does exist, there may be several minimizers.
A function y = f (x) is continuous if f (xn ) approaches f (x∗ ) whenever xn
approaches x∗ ,
xn → x∗ =⇒ f (xn ) → f (x∗ ),
Existence of Minimizers
f (x1 ) + m1
c=
2
be the midpoint between m1 and f (x1 ).
There are two possibilities. Either c is a lower bound or not. In the first
case, define m2 = c and x2 = x1 . In the second case, there is a point x2 in
S satisfying f (x2 ) < c, and we define m2 = m1 . As a consequence, in either
case, we have f (x2 ) ≥ m2 , m1 ≤ m2 , and
1
f (x2 ) − m2 ≤ (f (x1 ) − m1 ).
2
Let
f (x2 ) + m2
c=
2
be the midpoint between m2 and f (x2 ).
There are two possibilities. Either c is a lower bound or not. In the first
case, define m3 = c and x3 = x2 . In the second case, there is a point x3 in
S satisfying f (x3 ) < c, and we define m3 = m2 . As a consequence, in either
case, we have f (x3 ) ≥ m3 , m2 ≤ m3 , and
1
f (x3 ) − m3 ≤ (f (x1 ) − m1 ).
22
Continuing in this manner, we have a sequence x1 , x2 , . . . in S, and an
increasing sequence m1 ≤ m2 ≤ . . . of lower bounds, with
2
f (xn ) − mn ≤ (f (x1 ) − m1 ).
2n
Since S is bounded, xn subconverges to some x∗ . Since f (x) is continuous,
f (xn ) subconverges to f (x∗ ). Since f (xn ) ≈ mn and mn is a lower bound for
all n, f (x∗ ) is a lower bound, hence x∗ is a minimizer.
A.8. SQL 513
A.8 SQL
select from
limit
select distinct
where/where not <column>
where <column> = <data> and/or <column> = <data>
order by <column1>,<column2>
514 CHAPTER A. APPENDICES
This is an unordered listing of key-value pairs. Here the keys are the strings
dish, price, and quantity. Keys need not be strings; they may be integers or
any unmutable Python objects. Since a Python list is mutable, a key cannot
be a list. Values may be any Python objects, so a value may be a list. In
a dict, values are accessed through their keys. For example, item1["dish"]
returns 'Hummus'.
A list-of-dicts is simply a Python list whose elements are Python dicts, for
example,
len(L), L[0]["dish"]
returns
(2,'Hummus')
returns True.
A list-of-dicts L can be converted into a string using the json module, as
follows:
s = dumps(L)
Now print L and print s. Even though L and s “look” the same, L is a list,
and s is a string. To emphasize this point, note
• len(L) == 2 and len(s) == 99,
• L[0:2] == L and s[0:2] == '[{'
• L[8] returns an error and s[8] == ':'
To convert back the other way, use
L1 = loads(s)
Then L == L1 returns True. Strings having this form are called JSON strings,
and are easy to store in a database as VARCHARs (see Figure A.16).
The basic object in the Python package pandas is the dataframe (Figures
A.13, A.14, A.16, A.17). pandas can convert a dataframe df to many, many
other formats
df = DataFrame(L)
df
L1 = df.to_dict('records')
L == L1
returns True. Here the option 'records' returns a list-of-dicts; other options
returns a dict-of-dicts or other combinations.
To convert a CSV file into a dataframe, use the code
menu_df = read_csv("menu.csv")
menu_df
df.to_csv("menu1.csv")
df.to_csv("menu2.csv",index=False)
To connect using sqlalchemy, we first collect the connection data into one
URI string,
protocol = "mysql+pymysql://"
credentials = "username:password"
server = "@servername"
port = ":3306"
uri = protocol + credentials + server + port
This string contains your database username, your database password, the
database server name, the server port, and the protocol. If the database is
”rawa”, the URI is
database = "/rawa"
uri = protocol + credentials + server + port + database
engine = sqlalchemy.create_engine(uri)
df.to_sql('Menu',engine,if_exists='replace')
The if_exists = 'replace' option replaces the table Menu if it existed prior
to this command. Other options are if_exists='fail' and if_exists='append'.
The default is if_exists='fail', so
df.to_sql('Menu',engine)
One benefit of this syntax is the automatic closure of the connection upon
completion. This completes the discussion of how to convert between dataframes
and SQL tables, and completes the discussion of conversions between any of
the objects in (A.8.2).
As an example how all this goes together, here is a task:
Given two CSV files menu.csv and orders.csv downloaded from a restaurant website
(Figure A.15), create three SQL tables Menu, OrdersIn, OrdersOut.
/* Menu */
dish varchar
price integer
/* ordersin */
orderid integer
created datetime
customerid integer
items json
/* ordersout */
orderid integer
subtotal integer
tip integer
tax integer
total integer
520 CHAPTER A. APPENDICES
To achieve this task, we download the CSV files menu.csv and orders.csv,
then we carry out these steps. (price and tip in menu.csv and orders.csv
are in cents so they are INTs.)
1. Read the CSV files into dataframes menu_df and orders_df.
2. Convert the dataframes into list-of-dicts menu and orders.
3. Create a list-of-dicts OrdersIn with keys orderId, created, customerId
whose values are obtained from list-of-dicts orders.
4. Create a list-of-dicts OrdersOut with keys orderId, tip whose values are
obtained from list-of-dicts orders (tips are in cents so they are INTs).
5. Add a key items to OrdersIn whose values are JSON strings specifying
the items ordered in orders, using the prices in menu (these are in cents so
they are INTs). The JSON string is of a list-of-dicts in the form discussed
above L = [item1, item2] (see row 0 in Figure A.16).
Do this by looping over each order in the list-of-dicts orders, then loop-
ing over each item in the list-of-dicts menu, and extracting the quantity
ordered of the item item in the order order.
6. Add a key subtotal to OrdersOut whose values (in cents) are computed
from the above values.
Add a key tax to OrdersOut whose values (in cents) are computed using
the Connecticut tax rate 7.35%. Tax is applied to the sum of subtotal
and tip.
Add a key total to OrdersOut whose values (in cents) are computed
from the above values (subtotal, tax, tip).
7. Convert the list-of-dicts OrdersIn, OrdersOut to dataframes OrdersIn_df,
OrdersOut_df.
A.8. SQL 521
# step 1
from pandas import *
protocol = "https://"
server = "omar-hijab.org"
path = "/teaching/csv_files/restaurant/"
url = protocol + server + path
# step 2
menu = menu_df.to_dict('records')
522 CHAPTER A. APPENDICES
orders = orders_df.to_dict('records')
# step 3
OrdersIn = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["created"] = r["created"]
d["customerId"] = r["customerId"]
OrdersIn.append(d)
# step 4
OrdersOut = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["tip"] = r["tip"]
OrdersOut.append(d)
# step 5
from json import *
# steps 6
for i,r in enumerate(OrdersOut):
items = loads(OrdersIn[i]["items"])
subtotal = sum([ item["price"]*item["quantity"] for item in items
,→ ])
r["subtotal"] = subtotal
tip = OrdersOut[i]["tip"]
tax = int(.0735*(tip + subtotal))
total = subtotal + tip + tax
r["tax"] = tax
r["total"] = total
# step 7
ordersin_df = DataFrame(OrdersIn)
ordersout_df = DataFrame(OrdersOut)
# step 8
A.8. SQL 523
engine = create_engine(uri)
dtype2 = {
"orderId":sqlalchemy.Integer,
"created":sqlalchemy.String(30),
"customerId":sqlalchemy.Integer,
"items":sqlalchemy.String(1000)
}
dtype3 = {
"orderId":sqlalchemy.Integer,
"tip":sqlalchemy.Integer,
"subtotal":sqlalchemy.Integer,
"tax":sqlalchemy.Integer,
"total":sqlalchemy.Integer
}
In this section, all work was done in Python on a laptop, no SQL was used on
the database, other than creating a table or downloading a table. Generally,
this is an effective workflow:
• Use SQL to do big manipulations on the database (joining and filtering).
• Use Python to do detailed computations on your laptop (analysis).
Now we consider the following simple problem. The total number of orders
in 3970. What is the total number of plates? To answer this, we loop through
524 CHAPTER A. APPENDICES
all the orders, summing the number of plates in each order. The answer is
14,949 plates.
protocol = "mysql+pymysql://"
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
uri = protocol + credentials + server + port + database
engine = sqlalchemy.create_engine(uri)
connection = engine.connect()
num = 0
print(num)
def num_plates(item):
dishes = loads(item)
return sum( [ dish["quantity"] for dish in dishes ])
Then we use map to apply to this function to every element in the series
df["items"], resulting in another series. Then we sum the resulting series.
num = df["items"].map(num_plates).sum()
print(num)
Since the total number of plates is 14,949, and the total number of orders
is 4970, the average number of plates per order is 3.76.
REFERENCES 525
References
[19] J. W. Longley. “An Appraisal of Least Squares Programs for the Elec-
tronic Computer from the Point of View of the User”. In: Journal of
the American Statistical Association 62.319 (1967), pp. 819–841.
[20] D. G. Luenberger and Y. Ye. Linear and Nonlinear Programming.
Springer, 2008.
[21] A. A. Faisal M. P. Deisenroth and C. S. Ong. Mathematics for Machine
Learning. Cambridge University Press, 2020.
[22] M. Minsky and S. Papert. Perceptrons, An Introduction to Computa-
tional Geometry. MIT Press, 1988.
[23] Y. Nesterov. Lectures on Convex Optimization. Springer, 2018.
[24] K. Pearson. “On the criterion that a given system of deviations from
the probable in the case of a correlated system of variables is such that
it can be reasonably supposed to have arisen from random sampling”.
In: Philosophical Magazine Series 5 50:302 (1900), pp. 157–175.
[25] R. Penrose. “A generalized inverse for matrices”. In: Proceedings of the
Cambridge Philosophical Society 51 (1955), pp. 406–413.
[26] B. T. Polyak. “Some methods of speeding up the convergence of itera-
tion methods”. In: USSR Computational Mathematics and Mathemat-
ical Physics 4(5) (1964), pp. 1–17.
[27] The WeBWorK Project. url: https://openwebwork.org/.
[28] S. Raschka. PCA in three simple steps. 2015. url: https://sebastia
nraschka.com/Articles/2015_pca_in_3_steps.html.
[29] H. Robbins and S. Monro. “A Stochastic Approximation Method”. In:
The Annals of Mathematical Statistics 22.3 (1951), pp. 400–407.
[30] S. M. Ross. Probability and Statistics for Engineers and Scientists, Sixth
Edition. Academic Press, 2021.
[31] M. J. Schervish. Theory of Statistics. Springer, 1995.
[32] G. Strang. Linear Algebra and its Applications. Brooks/Cole, 1988.
[33] Stanford University. CS224N: Natural Language Processing with Deep
Learning. url: https://web.stanford.edu/class/cs224n.
[34] I. Waldspurger. Gradient Descent With Momentum. 2022. url: https
://www.ceremade.dauphine.fr/~waldspurger/tds/22_23_s1/adva
nced_gradient_descent.pdf.
[35] Wikipedia. Logistic Regression. url: https://en.wikipedia.org/wi
ki/Logistic_regression.
[36] Wikipedia. Seven Bridges of Königsberg. url: https://en.wikipedi
a.org/wiki/Seven_Bridges_of_Konigsberg.
[37] S. J. Wright and B. Recht. Optimization for Data Analysis. Cambridge
University Press, 2022.
Python Index
*, 9, 16 def.local, 413
def.matrix_text, 45
all, 199 def.nearest_index, 199
append, 199 def.newton, 420
def.num_legendre, 209
def.angle, 25, 68 def.num_plates, 524
def.assign_clusters, 199 def.outgoing, 250, 410
def.backward_prop, 244, 253, def.pca, 193
415 def.pca_with_svd, 193
def.ball, 55 def.perm_tuples, 469
def.cartesian_product, 351 def.plot_and_integrate, 502
def.chi2_independence, 395 def.plot_cluster, 200
def.comb_tuples, 470 def.plot_descent, 421
def.confidence_interval, 377, def.poly, 450
387 def.project, 117
def.derivative, 252 def.project_to_ortho, 118
def.dimension_staircase, 127 def.pvalue, 341
def.downstream, 415 def.random_batch_mean, 284
def.draw_major_minor_axes, 51 def.random_vector, 199
def.ellipse, 44 def.set_pi_ticks, 221
def.find_first_defect, 125 def.sym_legendre, 208
def.forward_prop, 244, 251, 411 def.tensor, 33
def.gd, 426 def.train_nn, 428
def.goodness_of_fit, 391 def.ttest, 388
def.H, 293 def.type2_error, 383, 389
def.hexcolor, 11 def.uniq, 5
def.incoming, 250, 410 def.update_means, 199
def.is_input, 409 def.update_weights, 427
def.is_output, 409 def.zero_variance, 104
def.J, 412 def.ztest, 381
527
528 PYTHON INDEX
531
532 INDEX