Statistics
Statistics
Omar Hijab*
Copyright ©2022 — 2024 Omar Hijab. All Rights Reserved.
Preface
Detailed proofs and detailed code snippets are included throughout the
text for the same reason: There is value in understanding how things work,
and real understanding can only be achieved by going all the way.
Because SQL is usually part of a data scientist’s toolkit, an introduction
to using SQL from within Python, is included in an appendix.
Throughout, we use iff to mean if and only if. To help navigate the text,
in each section, we use the ship’s wheel to indicate a break, a new idea,
or a change in direction.
Sections and figures are numbered sequentially within each chapter, and
equations are numbered sequentially within each section, so §3.3 is the third
iii
iv
section in the third chapter, Figure 7.11 is the eleventh figure in the seventh
chapter, and (3.2.1) is the first equation in the second section of the third
chapter.
⋆ under construction ⋆,
Preface iii
1 Datasets 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Averages and Vector Spaces . . . . . . . . . . . . . . . . . . . 9
1.4 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Mean and Covariance . . . . . . . . . . . . . . . . . . . . . . . 43
1.7 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 59
2 Linear Geometry 65
2.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . 65
2.2 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.3 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.4 Span and Linear Independence . . . . . . . . . . . . . . . . . . 90
2.5 Zero Variance Directions . . . . . . . . . . . . . . . . . . . . . 104
2.6 Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.7 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
2.8 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.9 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
v
vi CONTENTS
4 Counting 193
4.1 Permutations and Combinations . . . . . . . . . . . . . . . . . 193
4.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
4.3 Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . 212
4.4 Exponential Function . . . . . . . . . . . . . . . . . . . . . . . 219
5 Probability 229
5.1 Binomial Probability . . . . . . . . . . . . . . . . . . . . . . . 229
5.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 246
5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 263
5.5 Chi-squared Distribution . . . . . . . . . . . . . . . . . . . . . 273
6 Statistics 285
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
6.2 Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
6.3 T -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
6.4 Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
6.5 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
6.6 Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . 318
6.7 Chi-Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . 319
7 Calculus 325
7.1 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
7.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . 343
7.3 Multi-variable Calculus . . . . . . . . . . . . . . . . . . . . . . 351
7.4 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 357
7.5 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . 369
7.6 Multinomial Probability . . . . . . . . . . . . . . . . . . . . . 387
A Appendices 465
A.1 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
A.2 Minimizing Sequences . . . . . . . . . . . . . . . . . . . . . . 479
A.3 Keras Training . . . . . . . . . . . . . . . . . . . . . . . . . . 486
References 487
Python 489
Index 493
viii CONTENTS
List of Figures
ix
x LIST OF FIGURES
Datasets
1.1 Introduction
Geometrically, a dataset is a sample of N points x1 , x2 , . . . , xN in d-
dimensional space Rd . Algebraically, a dataset is an N × d matrix.
Practically speaking, as we shall see, the following are all representations
of datasets
The Iris dataset contains 150 examples of four features of Iris flowers,
and there are three classes of Irises, Setosa, Versicolor, and Virginica, with
50 samples from each class.
1
2 CHAPTER 1. DATASETS
The four features are sepal length and width, and petal length and width
(Figure 1.1). For each example, the class is the label corresponding to that
example, so the Iris dataset is labelled.
iris = datasets.load_iris()
dataset = iris["data"]
labels = iris["target"]
dataset, labels
This subsection is included just to give a flavor. All unfamiliar words are
explained in detail in Chapter 2. If preferred, just skip to the next subsection.
Suppose we have a dataset of N points
x1 , x2 , . . . , xN
If this is your first exposure to data science, there will be a learning curve,
because here there are three kinds of thinking: Data science (Datasets, PCA,
descent, networks), math (linear algebra, probability, statistics, calculus),
and Python (numpy, pandas, scipy, sympy, matplotlib). It may help to
read the code examples , and the important math principles first,
then dive into details as needed.
To illustrate and make concrete concepts as they are introduced, we use
Python code throughout. We run Python code in a Jupyter notebook.
Jupyter is an IDE, an integrated development environment. Jupyter sup-
ports many frameworks, including Python, Sage, Julia, and R. A useful
Jupyter feature is the ability to measure the amount of execution time of
a code cell by including at the start of the cell
%%time
1
The National Institute of Standards and Technology (NIST) is a physical sciences
laboratory and non-regulatory agency of the United States Department of Commerce.
1.2. THE MNIST DATASET 5
dataset.shape, labels.shape
returns
(This code requires keras, tensorflow and related modules if not already
installed.)
Since this dataset is for demonstration purposes, these images are coarse.
Since each image consists of 784 pixels, and each pixel shading is a number,
each image is a point x in Rd = R784 .
6 CHAPTER 1. DATASETS
Figure 1.4: Original and projections: n = 784, 600, 350, 150, 50, 10, 1.
For the second image in Figure 1.2, reducing dimension from d = 784 to
1.2. THE MNIST DATASET 7
n equal 600, 350, 150, 50, 10, and 1, we have the images in Figure 1.4.
Compressing each image to a point in n = 3 dimensions and plotting all
N = 60000 points yields Figure 1.5. All this is discussed in §3.4.
The top left image in Figure 1.4 is given by a 784-dimensional point which
is imported as an array pixels of shape (28,28).
pixels = dataset[1]
grid()
scatter(2,3)
show()
3. Do for loops over i and j in range(28) and use scatter to plot points
at location (i,j) with size given by pixels[i,j], then show.
Here is one possible code, returning Figure 1.6.
pixels = dataset[1]
grid()
for i in range(28):
for j in range(28): scatter(i,j, s = pixels[i,j])
8 CHAPTER 1. DATASETS
show()
imshow(pixels, cmap="gray_r")
We end the section by discussing the Python import command. The last
code snippet can be rewritten
1.3. AVERAGES AND VECTOR SPACES 9
plt.imshow(pixels, cmap="gray_r")
or as
imshow(pixels, cmap="gray_r")
L = [x_1,x_2,...,x_N].
The total population is the population or the sample space. For example, the
sample space consists of all real numbers and we take N = 5 samples from
this population
Or, the sample space consists of all integers and we take N = 5 samples from
this population
Or, the sample space consists of all rational numbers and we take N = 5
samples from this population
Or, the sample space consists of all Python strings and we take N = 5
samples from this population
L_4 = ['a2e?','#%T','7y5,','kkk>><</','[[)*+']
Or, the sample space consists of all HTML colors and we take N = 5 samples
from this population
def hexcolor():
return "#" + ''.join([choice('0123456789abcdef') for _ in
,→ range(6)])
8. 1v = v and 0v = 0
9. r(sv) = (rs)v.
12 CHAPTER 1. DATASETS
A vector is an arrow joining two points (Figure 1.8). Given two points
m = (a, b) and x = (c, d), the vector joining them is
v = x − m = (c − a, d − b).
Then m is the tail of v, and x is the head of v. For example, the vector
joining m = (1, 2) to x = (3, 4) is v = (2, 2).
Given a point x, we would like to associate to it a vector v in a uniform
manner. However, this cannot be done without a second point, a reference
point. Given a dataset of points x1 , x2 , . . . , xN , the most convenient choice
for the reference point is the mean m of the dataset. This results in a dataset
of vectors v1 , v2 , . . . , vN , where vk = xk − m, k = 1, 2, . . . , N .
The dataset v1 , v2 , . . . , vN is centered, its mean is zero,
v1 + v2 + · · · + vN
= 0.
N
x5 x2
v5 v2
m v4 v1
0
x4 x1 v3
x3
v1 = x1 − m, v2 = x2 − m, . . . , vN = xN − m,
returns False.
Usually, we can’t take sample means from a population, we instead take
the sample mean of a statistic associated to the population. A statistic is
14 CHAPTER 1. DATASETS
V
f
Sample Space
dataset = array([1.23,4.29,-3.3,555])
mean(dataset)
1.3. AVERAGES AND VECTOR SPACES 15
mean(dataset, axis=0)
mean(dataset, axis=1)
N = 20
dataset = array([ [random(), random()] for _ in range(N) ])
mean = mean(dataset,axis=0)
grid()
X = dataset[:,0]
Y = dataset[:,1]
scatter(X,Y)
scatter(*mean)
show()
In this code, scatter expects two positional arguments, the x and the y
components of a point, or two lists of x and y components separately. The
unpacking operator * unpacks mean from one pair into its separate x and
y components *mean. Also, for scatter, dataset is separated into its two
columns.
1.4. TWO DIMENSIONS 17
(0, 2) (3, 2)
v
(0, 1)
(0, −2)
v1
v2
0 0
In the cartesian plane, each vector v has a shadow. This is the triangle
constructed by dropping the perpendicular from the tip of v to the x-axis, as
in Figure 1.13. This cannot be done unless one first draws a horizontal line
(the x-axis), then a vertical line (the y-axis). In this manner, each vector v
18 CHAPTER 1. DATASETS
has cartesian coordinates v = (x, y). In Figure 1.12, the coordinates of v are
(3, 2). In particular, the vector 0 = (0, 0), the zero vector, corresponds to the
origin.
In the cartesian plane, vectors v1 = (x1 , y1 ) and v2 = (x2 , y2 ) are added
by adding their coordinates,
Addition of vectors
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then
v1 + v2 = (x1 + x2 , y1 + y2 ). (1.4.1)
Because points and vectors are interchangeable, the same formula is used
for addition P + P ′ of points P and P ′ .
v1 = (1,2)
v2 = (3,4)
v1 + v2 == (1+3,2+4) # returns False
v1 = [1,2]
v2 = [3,4]
v1 + v2 == [1+3,2+4] # returns False
1.4. TWO DIMENSIONS 19
v1 = array([1,2])
v2 = array([3,4])
v1 + v2 == array([1+3,2+4]) # returns True
v = array([1,2])
3*v == array([3,6]) # returns True
Scaling of vectors
Thus multiplying v by s, and then multiplying the result by t, has the same
effect as multiplying v by ts, in a single step. Because points and vectors are
interchangeable, the same formula is used for scaling tP points P by t.
20 CHAPTER 1. DATASETS
tv
v
0 tv
v1 − v2 = v1 + (−v2 ).
This gives
Subtraction of vectors
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then
v1 − v2 = (x1 − x2 , y1 − y2 ) (1.4.2)
v1 = array([1,2])
v2 = array([3,4])
v1 - v2 == array([1-3,2-4]) # returns True
Distance Formula
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then the distance between v1 and v2
is p
|v1 − v2 | = (x1 − x2 )2 + (y1 − y2 )2 .
1.4. TWO DIMENSIONS 21
(x, y)
r y
θ
0 x
x = r cos θ, y = r sin θ.
In Python,
v = array([1,2])
norm(v) == sqrt(5)# returns True
The unit circle consists of the vectors which are distance 1 from the origin
0. When v is on the unit circle, the magnitude of v is 1, and we say v is a
unit vector. In this case, the line formed by the scalings of v intersects the
unit circle at ±v (Figure 1.17).
When v is a unit vector, r = 1, and (Figure 1.16),
The unit circle intersects the horizontal axis at the vectors (1, 0), and
(−1, 0), and intersects the vertical axis at the vectors (0, 1), and (0, −1).
These four vectors are equally spaced on the unit circle (Figure 1.17).
−v
I
0
v
x2 + y 2 = 1.
More generally, any circle with center (a, b) and radius r consists of vectors
v = (x, y) satisfying
(x − a)2 + (y − b)2 = r2 .
Let R be a point on the unit circle, and let t > 0. From this, we see the
scaled point tR is on the circle with center (0, 0) and radius t. Moreover, if
Q is any point, Q + tR is on the circle with center Q and radius r.
Given this, it is easy to check
1 1 1
v = |v| = r = 1,
r r r
1.4. TWO DIMENSIONS 23
v2 − v1
v2
v1
Now we discuss the dot product in two dimensions. We have two vectors
v and v ′ in the plane R2 , with v1 = (x1 , y1 ) and v2 = (x2 , y2 ). The dot
product of v1 and v2 is given algebraically as
v1 · v2 = x1 x2 + y1 y2 ,
or geometrically as
v1 · v2 = |v1 | |v2 | cos θ,
where θ is the angle between v1 and v2 . To show that these are the same,
below we derive the
v1 = array([1,2])
v2 = array([3,4])
24 CHAPTER 1. DATASETS
As a consequence of the dot product identity, we have code for the angle
between two vectors,
def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)
Cauchy-Schwarz Inequality
If u and v are any two vectors, then
a2 = d2 + f 2 and c2 = e 2 + f 2 .
Also b = e + d, so
b2 = (e + d)2 = e2 + 2ed + d2 .
1.4. TWO DIMENSIONS 25
c2 = e2 + f 2 = (b − d)2 + f 2
= f 2 + d2 + b2 − 2db
= a2 + b2 − 2ab cos θ,
so we get (1.4.6).
f
e
a
d b
Next, connect Figures 1.18 and 1.19 by noting a = |v2 | and b = |v1 | and
c = |v2 − v1 |.
Now go back to deriving (1.4.4). By vector addition, we have
v2 − v1 = (x2 − x1 , y2 − y1 ),
26 CHAPTER 1. DATASETS
v · v ⊥ = (x, y) · (−y, x) = 0.
P⊥
P
v⊥ v
0
−v ⊥
−P ⊥
From Figure 1.21, we see points P and P ′ on the unit circle satisfy P ·P ′ =
0 iff P ′ = ±P ⊥ .
1.4. TWO DIMENSIONS 27
ax + by = 0, cx + dy = 0. (1.4.9)
In (1.4.9), multiply the first equation by d and the second by b and sub-
tract, obtaining
In (1.4.9), multiply the first equation by c and the second by a and subtract,
obtaining
(bc − ad)y = c(ax + by) − a(cx + dy) = 0.
From here, we see there are two cases: det(A) = 0 and det(A) ̸= 0. When
det(A) ̸= 0, the only solution of (1.4.9) is (x, y) = (0, 0). When det(A) = 0,
(x, y) = (−b, a) is a solution of both equations in (1.4.9). We have shown
Homogeneous System
ax + by = e, cx + dy = f, (1.4.11)
28 CHAPTER 1. DATASETS
Inhomogeneous System
In this case, we call u and v the columns of A. This shows there are at least
three ways to think about a matrix: as rows, or as columns, or as a single
block.
The simplest operations on matrices are addition and scalar multiplica-
tion. Addition is as follows,
′ ′
a + a′ b + b ′
a b ′ a b ′
A= , A = ′ ′ =⇒ A+A = ,
c d c d c + c′ d + d ′
u · u′ u · v ′
′
AA = .
u′ · v u′ · v ′
cos θ′ − sin θ′
′ cos θ − sin θ
U (θ)U (θ ) =
sin θ cos θ sin θ′ cos θ′
cos(θ + θ′ ) − sin(θ + θ′ )
= = U (θ + θ′ ).
sin(θ + θ′ ) cos(θ + θ′ )
Orthogonal Matrices
def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])
det(u ⊗ v) = 0.
This is true no matter what the vectors u and v are. Check this yourself.
Notice by definition of u ⊗ v,
so
v · Qv = (x, y) · (ax + by, bx + cy) = ax2 + 2bxy + cy 2 .
This is the quadratic form associated to the matrix Q.
34 CHAPTER 1. DATASETS
Quadratic Form
If
a b
Q= and v = (x, y),
b c
then
v · Qv = ax2 + 2bxy + cy 2 .
Q=I =⇒ v · Qv = x2 + y 2 .
When Q is diagonal,
a 0
Q= =⇒ v · Qv = ax2 + cy 2 .
0 c
An important case is when Q = u ⊗ u. In this case, by (1.4.14),
P
P′
1
1
O O
P ′′
Q Q
P ′′
O O
This ability of points in the plane to follow the usual rules of arithmetic is
unique to one and two dimensions, and not present in any other dimension.
When thought of in this manner, points in the plane are called complex
numbers, and the plane is the complex plane.
P ′′ = P P ′ = (xx′ − yy ′ , x′ y + xy ′ ),
(1.5.1)
P ′′ = P/P ′ = (xx′ + yy ′ , x′ y − xy ′ ).
so (1.5.1) is equivalent to
P ′′ = x′ P ± y ′ P ⊥ . (1.5.2)
P P ′ = (xx′ − yy ′ , x′ y + xy ′ ), (1.5.3)
1.5. COMPLEX NUMBERS 37
1 1
P/P ′ = ′
2 P P̄ = ′ 2 (xx′ + yy ′ , x′ y − xy ′ ). (1.5.4)
r ′ x + y′2
Because of this, we can write z = x instead of z = (x, 0), this only for points
in the plane, and we call the horizontal axis the real axis.
Similarly, let i = (0, 1). Then the point i is on the vertical axis, and,
using (1.5.1), one can check
Thus the vertical axis consists of all points of the form ix. These are called
imaginary numbers, and the vertical axis is the imaginary axis.
Using i, any point P = (x, y) may be written
P = x + iy,
since x + iy = (x, 0) + (y, 0)(0, 1) = (x, 0) + (0, y) = (x, y). This leads to
Figure 1.23. In this way, real numbers x are considered complex numbers
with zero imaginary part, x = x + 0i.
38 CHAPTER 1. DATASETS
2i 3 + 2i
−1 0 1 2 3
Square Root of −1
and
z x + iy (xx′ + yy ′ ) + i(x′ y − xy ′ )
= = .
z′ x′ + iy ′ x′ 2 + y ′ 2
In particular, one can always “move” the i from the denominator by the
formula
1 1 x − iy z̄
= = 2 2
= 2.
z x + iy x +y |z|
Here x2 + y 2 = r2 = |z|2 is the absolute value squared of z, and z̄ is the
conjugate of z.
If (r, θ) and (r′ , θ′ ) are the polar coordinates of complex numbers P and
P ′ , and (r′′ , θ′′ ) are the polar coordinates of the product P ′′ = P P ′ ,
then
r′′ = rr′ and θ′′ = θ + θ′ .
From this and (1.5.1), using (x, y) = (cos θ, sin θ), (x′ , y ′ ) = (cos θ′ , sin θ′ ),
we have the addition formulas
This formula is valid as long as x ̸= −r, and can checked directly by checking
Q2 = P .
When P is on the unit circle, r = 1, so the formula reduces to
√
1+x y
P =± √ ,√ .
2 + 2x 2 + 2x
We will need the roots of unity in §3.2. This generalizes square roots,
cube roots, etc.
A point ω is a root of unity if ω d = 1 for some power d. If d is the power,
we say ω is a d-th root of unity.
For example, the square roots of unity are ±1, since (±1)2 = 1. Here we
have
1 = cos 0 + i sin 0, −1 = cos π + i sin π.
The fourth roots of unity are ±1, ±i, since (±1)4 = 1, (±i)4 = 1. Here
we have
1 = cos 0 + i sin 0,
i = cos(π/2) + i sin(π/2),
−1 = cos π + i sin π,
−i = cos(3π/2) + i sin(3π/2).
ω
ω
ω 1 1 ω2 1
ω2
ω3
ω2 = 1 ω3 = 1 ω4 = 1
If ω d = 1, then
d k
ωk = ωd = 1k = 1.
1, ω, ω 2 , . . . , ω d−1
13 = 1, ω 3 = 1, (ω 2 )3 = 1.
√
s√
1 5 5 5
ω=− + +i + = cos(2π/5) + i sin(2π/5).
4 4 8 8
42 CHAPTER 1. DATASETS
ω ω ω4 ω3
ω2 ω5
ω2
ω2 ω6
ω
ω7
1 ω3 1 1
ω8
ω 14
ω3 ω9
ω 13
ω4 ω4 ω5 ω 10 11 ω 12
ω
ω5 = 1 ω6 = 1 ω 15 = 1
Summarizing,
Roots of Unity
If
ω = cos(2π/d) + i sin(2π/d),
the d-th roots of unity are
1, ω, ω 2 , . . . , ω d−1 .
ω k = cos(2πk/d) + i sin(2πk/d), k = 0, 1, 2, . . . , d − 1.
x = symbols('x')
d = 5
1.6. MEAN AND COVARIANCE 43
import numpy as np
np.roots([a,b,c])
Since the cube roots of unity are the roots of the polynomial p(x) = x3 − 1,
the code
import numpy as np
np.roots([1,0,0,-1])
Above |x| stands for the length of the vector x, or the distance of the
point x to the origin. When d = 2 and we are in two dimensions, this was
defined in §1.4. For general d, this is defined in §2.1. In this section we
continue to focus on two dimensions d = 2.
44 CHAPTER 1. DATASETS
N
1 X x1 + x2 + · · · + xN
m= xk = .
N k=1 N
Point of Best-fit
The mean is the point of best-fit: The mean minimizes the mean-square
distance to the dataset (Figure 1.26).
Figure 1.26: MSD for the mean (green) versus MSD for a random point (red).
Using (1.4.8),
|a + b|2 = |a|2 + 2a · b + |b|2
N
2 X
M SD(x) = M SD(m) + (xk − m) · (m − x) + |m − x|2 .
N k=1
1.6. MEAN AND COVARIANCE 45
so we have
N = 20
dataset = array([ [random(),random()] for _ in range(N) ])
m = mean(dataset,axis=0)
p = array([random(),random()])
grid()
X = dataset[:,0]
Y = dataset[:,1]
scatter(X,Y)
for v in dataset:
plot([m[0],v[0]],[m[1],v[1]],c='green')
plot([p[0],v[0]],[p[1],v[1]],c='red')
show()
def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])
N = 20
dataset = array([ [random(),random()] for _ in range(N) ])
m = mean(dataset,axis=0)
# center dataset
vectors = dataset - m
Since
16 16
(±4, ±4) ⊗ (±4, ±4) = ,
16 16
4 4
(±2, ±2) ⊗ (±2, ±2) = ,
4 4
0 0
(0, 0) ⊗ (0, 0) = ,
0 0
Notice
Q = 8(1, 1) ⊗ (1, 1),
which, as we see below (§2.5), reflects the fact that the points of this dataset
lies on a line. Here the line is y = x + 1.
The covariance matrix as written in (1.6.1) is the biased covariance matrix.
If the denominator is instead N − 1, the matrix is the unbiased covariance
matrix.
For datasets with large N , it doesn’t matter, since N and N − 1 are
almost equal. For simplicity, here we divide by N , and we only consider the
biased covariance matrix.
In practice, datasets are standardized before computing their covariance.
The covariance of standardized datasets — the correlation matrix — is the
same whether one starts with bias or not (§2.2).
In numpy, the Python covariance constructor is
N = 20
dataset = array([ [random(),random()] for _ in range(N) ])
Q = cov(dataset,bias=True,rowvar=False)
This returns the same result as the previous code for Q. Notice here there
is no need to compute the mean, this is taken care of automatically. The
48 CHAPTER 1. DATASETS
Q = cov(dataset.T,bias=True)
We call (1.6.3) the total variance of the dataset. Thus the total variance
equals MSD(m).
In Python, the total variance is
Q = cov(dataset.T,bias=True)
Q.trace()
proj v = (v · u)u.
u
1.6. MEAN AND COVARIANCE 49
proj u v
u
Because the reduced dataset and projected dataset are essentially the
same, we also refer to q as the variance of the projected dataset. Thus we
conclude (see §1.4 for v · Qv)
This shows that the dataset lies on the line passing through m and perpen-
dicular to (1, −1).
u · Qu = ax2 + 2bxy + cy 2 = 1.
The covariance ellipse and inverse covariance ellipses described above are
centered at the origin (0, 0). When a dataset has mean m and covariance Q,
the ellipses are drawn centered at m, as in Figures 1.30, 1.31, and 1.32.
Here is the code for Figure 1.28. The ellipses drawn here are centered at
the origin.
L, delta = 4, .1
x = arange(-L,L,delta)
y = arange(-L,L,delta)
X,Y = meshgrid(x, y)
a, b, c = 9, 0, 4
det = a*c - b**2
A, B, C = c/det, -b/det, a/det
def ellipse(a,b,c,levels,color):
contour(X,Y,a*X**2 + 2*b*X*Y + c*Y**2,levels,colors=color)
grid()
ellipse(a,b,c,[1],'blue')
ellipse(A,B,C,[1],'red')
show()
x1 , x2 , . . . , xN , and y1 , y2 , . . . , yN .
Suppose the mean of this dataset is m = (mx , my ). Then, by the formula for
tensor product, the covariance matrix is
a b
Q= ,
b c
where
N N N
1 X 1 X 1 X
a= (xk −mx )2 , b= (xk −mx )(yk −my ), c= (yk −my )2 .
N k=1 N k=1 N k=1
From this, we see a is the variance of the x-features, and c is the variance
of y-features. We also see b is a measure of the correlation between the x
and y features.
Standardizing the dataset means to center the dataset and to place the x
and y features on the same scale. For example, the x-features may be close
1.6. MEAN AND COVARIANCE 53
and
y1 − my ′ y2 − my y N − my
y1 , y2 , . . . yN → y1′ = √ , y2 = √ ′
, . . . , yN = √ .
c c c
′
This results in a new dataset v1 = (x′1 , y1′ ), v2 = (x′2 , y2′ ), . . . , vN = (x′N , yN )
that is centered,
v1 + v2 + · · · + vN
= 0,
N
with each feature standardized to have unit variance,
N N
1 X ′2 1 X ′2
x = 1, y = 1.
N k=1 k N k=1 k
For example,
9 2 b 1 ′ 1 1/3
Q= =⇒ ρ= √ = =⇒ Q = .
2 4 ac 3 1/3 1
corrcoef(dataset.T)
Here again, we input the transpose of the dataset if our default is vectors
as rows. Notice the 1/N cancels in the definition of ρ. Because of this,
corrcoef is the same whether we deal with biased or unbiased covariance
matrices.
u · Qu = max v · Qv.
|v|=1
Since the sine function varies between +1 and −1, we conclude the projected
variance varies between
1 − ρ ≤ v · Qv ≤ 1 + ρ,
and
π 1 1
θ= , v+ = √ , √ =⇒ v+ · Qv+ = 1 + ρ,
4 2 2
3π −1 1
θ= , v− = √ , √ =⇒ v− · Qv− = 1 − ρ.
4 2 2
Thus the best-aligned vector v+ is at 45◦ , and the worst-aligned vector is at
135◦ (Figure 1.29)
Actually, the above is correct only if ρ > 0. When ρ < 0, it’s the other
way. The correct answer is
1 − |ρ| ≤ v · Qv ≤ 1 + |ρ|,
Here are two randomly generated datasets. For the dataset in Figure
1.30, the mean and covariance are
0.09652275 0.00939796
(0.46563359, 0.59153958) .
0.00939796 0.0674424
1.6. MEAN AND COVARIANCE 57
For the dataset in Figure 1.31, the mean and covariance are
0.08266583 −0.00976249
(0.48785572, 0.51945499) .
−0.00976249 0.08298294
Here is code for Figures 1.30, 1.31, and 1.32. The code incorporates the
formulas for λ± and v± .
N = 50
X = array([ random() for _ in range(N) ])
Y = array([ random() for _ in range(N) ])
scatter(X,Y,s=2)
1.7. HIGH DIMENSIONS 59
m = mean([X,Y],axis=1)
Q = cov(X,Y,bias=True)
a, b, c = Q[0,0], Q[0,1], Q[1,1]
delta = .01
x = arange(0,1,delta)
y = arange(0,1,delta)
X,Y = meshgrid(x, y)
def ellipse(a,b,c,d,e,levels,color):
det = a*c - b**2
A, B, C = c/det, -b/det, a/det
# inverse covariance ellipse centered at (d,e)
Z = A*(X-d)**2 + 2*B*(X-d)*(Y-e) + C*(Y-e)**2
contour(X,Y,Z,levels,colors=color)
for pm in [+1,-1]:
lamda = (a+c)/2 + pm * sqrt(b**2 + (a-c)**2/4)
sigma = sqrt(lamda)
len = sqrt(b**2 +(a-lamda)**2)
axesX = [d+sigma*b/len,d-sigma*b/len]
axesY = [e-sigma*(a-lamda)/len,e+sigma*(a-lamda)/len]
plot(axesX,axesY,linewidth=.5)
grid()
levels = [.5,1,1.5,2]
ellipse(a,b,c,*m,levels,'red')
show()
1 √ √
(4 2 − 4) = 2 − 1.
4
60 CHAPTER 1. DATASETS
plot([-2,2],[-2,2],color='black')
axes.add_patch(square)
1.7. HIGH DIMENSIONS 61
axes.add_patch(circle1)
axes.add_patch(circle2)
axes.add_patch(circle3)
axes.add_patch(circle4)
axes.add_patch(circle)
Since the edge-length of the cube is 4, the √ radius of each blue ball is 1.
Since the length of the diagonal of the cube is 4 3, the radius of the red ball
is
1 √ √
(4 3 − 4) = 3 − 1.
4
Notice there are 8 blue balls.
In two dimensions, when a region is scaled by a factor t, its area increases
by the factor t2 . In three dimensions, when a region is scaled by a factor t,
its volume increases by the factor t3 . We conclude: In d dimensions, when
a region is scaled by a factor t, its (d-dimensional) volume increases by the
factor td . This is called the scaling principle.
In d dimensions, the edge-length of the cube remains 4, the radius of
each blue ball remains 1,√and there are 2d blue balls. Since the length of the
diagonal of the cube is√4 d, the same calculation results in the radius of the
red ball equal to r = d − 1.
By the scaling principle, the volume of the red ball equals rd times the
volume of the blue ball. We conclude the following:
√
• Since r = d−1 = 1 exactly when d = 4, we have: In four dimensions,
the red ball and the blue balls are the same size.
• Since there are 2d blue balls, the ratio of the volume of the red ball
over the total volume of all the blue balls is rd /2d .
√
• Since rd = 2d exactly when r = 2, and since r = d − 1 = 2 exactly
when d = 9, we have: In nine dimensions, the volume of the red ball
equals the sum total of the volumes of all blue balls.
√
• Since r = d − 1 > 2 exactly when d > 9, we have: In ten or more
dimensions, the red ball sticks out of the cube.
√
• Since the length of the semi-diagonal
√ is 2 d, for any dimension d, the
radius of the red ball r = d − 1 is less than half the length of the
semi-diagonal. As the dimension grows without bound, the proportion
of the diagonal covered by the red ball converges to 1/2.
The code for Figure 1.35 is as follows. For 3d plotting, the module mayavi
is better than matplotlib.
1.7. HIGH DIMENSIONS 63
pm1 = [-1,1]
for center in product(pm1,pm1,pm1):
# blue balls: color (0,0,1)
ball(*center,1,(0,0,1))
# black wire cube: color (0,0,0)
outline(color=(0,0,0))
Linear Geometry
v = array([1,2,3])
v.shape
65
66 CHAPTER 2. LINEAR GEOMETRY
v = Matrix([1,2,3])
v.shape
The first v.shape returns (3,), and the second v.shape returns (3,1). In
either case, v is a 3-dimensional vector.
Vectors are added component by component: With
we have
together are the standard basis. Similarly, in Rd , we have the standard basis
e1 , e2 , . . . , ed .
A = array([[1,6,11],[2,7,12],[3,8,13],[4,9,14],[5,10,15]])
A.shape
A = Matrix([[1,6,11],[2,7,12],[3,8,13],[4,9,14],[5,10,15]])
A.shape
Note the transpose operation interchanges rows and columns: the rows of At
are the columns of A. In both numpy or sympy, the transpose of A is A.T.
A d-dimensional vector v may be written as a 1 × d matrix
v = t1 t2 . . . td .
tN
• A, B: any matrix
• Q: symmetric matrix
• P : projections
vN
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
# 5x3 matrix
2.1. VECTORS AND MATRICES 69
A = Matrix.hstack(u,v,w)
# column vector
b = Matrix([1,1,1,1,1])
# 5x4 matrix
M = Matrix.hstack(A,b)
In general, for any sympy matrix A, column vectors can be hstacked and
row vectors can be vstacked. For any matrix A, the code
returns True. Note we use the unpacking operator * to unpack the list, before
applying hstack.
In numpy, there is hstack and vstack, but we prefer column_stack and
row_stack, so the code
both return True. Here col refers to rows of At , hence refers to the columns
of A.
70 CHAPTER 2. LINEAR GEOMETRY
A.shape == (A.rows,A.cols)
A = zeros(2,3)
B = ones(2,2)
C = Matrix([[1,2],[3,4]])
D = B + C
E = 5 * C
F = eye(4)
A, B, C, D, E, F
returns
1 0 0 0
0 0 0 1 1 1 2 2 3 5 10 0 1 0 0
, , , , , .
0 0 0 1 1 3 4 4 5 15 20 0 0 1 0
0 0 0 1
72 CHAPTER 2. LINEAR GEOMETRY
A = diag(1,2,3,4)
B = diag(-1, ones(2, 2), Matrix([5, 7, 5]))
A, B
returns
−1 0 0 0
1 0 0 0 0 1 1 0
0 2 0 0
0 1 1 0
0 , .
0 3 0 0 0 0 5
0 0 0 4 0 0 0 7
0 0 0 5
It is straightforward to convert back and forth between numpy and sympy.
In the code
A = diag(1,2,3,4)
B = array(A)
C = Matrix(B)
For the Iris dataset, the mean (§1.3) is given by the following code.
iris = datasets.load_iris()
2.2. PRODUCTS 73
dataset = iris["data"]
m = mean(dataset,axis=0)
vectors = dataset - m
2.2 Products
Let t be a scalar, u, v, w be vectors, and let A, B be matrices. We already
know how to compute tu, tv, and tA, tB. In this section, we compute the dot
product u · v, the matrix-vector product Av, and the matrix-matrix product
AB.
These products are not defined unless the dimensions “match”. In numpy,
these products are written dot; in sympy, these products are written *.
In §1.4, we defined the dot product in two dimensions. We now generalize
to any dimension d. Suppose u, v are vectors in Rd . Then their dot product
u · v is the scalar obtained by multiplying corresponding features and then
summing the products. This only works if the dimensions of u and v agree.
In other words, if u = (s1 , s2 , . . . , sd ) and v = (t1 , t2 , . . . , td ), then
u · v = s1 t1 + s2 t2 + · · · + sd td . (2.2.1)
As in §1.4, we always have rows on the left, and columns on the right.
In Python,
u = array([1,2,3])
74 CHAPTER 2. LINEAR GEOMETRY
v = array([4, 5, 6])
u = Matrix([1,2,3])
v = Matrix([4, 5, 6])
sqrt(dot(v,v))
sqrt(v.T * v)
As in §1.4,
Dot Product
The dot product u · v (2.2.1) satisfies
In two dimensions, this was equation (1.4.4) in §1.4. Since any two vectors
lie in a two-dimensional plane, this remains true in any dimension.
Based on this, we can compute the angle θ,
u·v u·v
cos θ = p =p .
|u| |v| (u · u)(v · v)
Here is code for the angle θ,
def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)
Cauchy-Schwarz Inequality
The dot product of two vectors is absolutely less or equal to the product
of their lengths,
are orthogonal. With this understood, the zero vector is orthogonal to every
vector. The converse is true as well: If u·v = 0 for every v, then in particular,
u · u = 0, which implies u = 0.
Vectors v1 , . . . , vN are said to be orthonormal if they are both unit vectors
and orthogonal. Orthogonal nonzero vectors can be made orthonormal by
dividing each vector by its length.
|a + b| = (a + b) · v ≤ |a| + |b|.
A,B,dot(A,B)
A,B,A*B
returns
70 80 90
AB = .
158 184 210
78 CHAPTER 2. LINEAR GEOMETRY
(Av)t = v t At .
dot(A,B).T == dot(B.T,A.T)
In terms of row vectors and column vectors, this is automatic. For exam-
ple,
(Au) · v = (Au)t v = (ut At )v = ut (At v) = u · (At v).
In Python,
dot(dot(A,u),v) == dot(u,dot(A.T,v))
dot(dot(A.T,u),v) == dot(u,dot(A,v))
As a consequence,1
Then the identities (1.4.14) and (1.4.15) hold in general. Using the tensor
product, we have
1
Iff is short for if and only if.
80 CHAPTER 2. LINEAR GEOMETRY
Tensor Identity
Let A be a matrix with rows v1 , v2 , . . . , vN . Then
At A = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN . (2.2.7)
Multiplying (2.2.7) by xt on the left and x on the right, and using (1.4.15),
we see (2.2.7) is equivalent to
By matrix-vector multiplication,
Ax = (v1 · x, v2 · x, . . . , vN · x).
Since |Ax|2 is the sum of the squares of its components, this derives (2.2.8).
and
∥A∥2 = trace(At A). (2.2.12)
By replacing A by At , the same results hold for columns.
Q = dot(vectors.T,vectors)/N
Q = cov(dataset,rowvar=False)
or
Q = cov(dataset.T)
After downloading the Iris dataset as in §2.1, the mean, covariance, and
total variance are
0.68 −0.04 1.27 0.51
−0.04 0.19 −0.32 −0.12
m = (5.84, 3.05, 3.76, 1.2), Q =
1.27 −0.32 3.09
, 4.54.
1.29
0.51 −0.12 1.29 0.58
(2.2.14)
In Python,
# standardize dataset
vectors = StandardScaler().fit_transform(dataset)
Qcorr = corrcoef(dataset.T)
Qcov = cov(vectors.T,bias=True)
allclose(Qcov,Qcorr)
returns True.
Ax = b. (2.3.1)
In this section, we use the inverse A−1 and the pseudo-inverse A+ to solve
(2.3.1).
However, it’s very easy to construct matrices A and vectors b for which
the linear system (2.3.1) has no solutions at all! For example, take A the
84 CHAPTER 2. LINEAR GEOMETRY
zero matrix and b any non-zero vector. Because of this, we must be careful
when solving (2.3.1).
Ax = b =⇒ x = A−1 b. (2.3.3)
# solving Ax=b
x = A.inv() * b
# solving Ax=b
x = dot(inv(A) , b)
x+ = A+ b =⇒ Ax+ = b.
• no solutions, or
• A is invertible,
• A = 0 and b = 0.
The pseudo-inverse provides a single systematic procedure for deciding
among these three possibilities. The pseudo-inverse is available in numpy and
sympy as pinv. In this section, we focus on using Python to solve Ax = b,
postponing concepts to §2.6.
How do we use the above result? Given A and b, using Python, we
compute x = A+ b. Then we check, by multiplying in Python, equality of Ax
and b.
The rest of the section consists of examples of solving linear systems. The
reader is encouraged to work out the examples below in Python. However,
because some linear systems have more than one solution, and the imple-
mentation of Python on your laptop may be different than on my laptop, our
solutions may differ.
It can be shown that if the entries of A are integers, then the entries of
A+ are fractions. This fact is reflected in sympy, but not in numpy, as the
default in numpy is to work with floats.
Let
# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
# arrange as columns
A = column_stack([u,v,w])
pinv(A)
returns
−37 −20 −3 14 31
1
A+ = −10 −5 0 5 10 .
150
17 10 3 −4 −11
Alternatively, in sympy,
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
A.pinv()
For
b3 = (−9, −3, 3, 9, 10),
we have
1
x+ = A + b 3 = (82, 25, −32).
15
However, for this x+ , we have
We solve
Bx = u, Bx = v, Bx = w
by constructing the candidates
B + u, B + v, B + w,
Let
1 2 3 4 5
C = At = 6 7 8 9 10
11 12 13 14 15
and let f = (0, −5, −10). Then
−37 −10 17
−20 −5 10
+ t + + t 1 −3
C = (A ) = (A ) = 0 3
150
14 5 −4
31 10 −11
and
1
x+ = C + f =
(32, 35, 38, 41, 44).
50
Once we confirm equality of Cx+ and f , which is the case, we obtain a
solution x+ of Cx = f .
returns
x = (t1 , t2 , . . . , td ).
Then
Ax = t1 v1 + t2 v2 + · · · + td vd , (2.4.2)
In other words,
t1 v1 + t2 v2 + · · · + td vd
of the vectors. For example, span(b) of a single vector b is the line through
b, and span(u, v, w) is the set of all linear combinations ru + sv + tw.
Span Definition I
The span of v1 , v2 , . . . , vd is the set S of all linear combinations of v1 ,
v2 , . . . , vd , and we write
S = span(v1 , v2 , . . . , vd ).
Span Definition II
Let A be the matrix with columns v1 , v2 , v3 , . . . , vd . Then
span(v1 , v2 , . . . , vd ) is the set S of all vectors of the form Ax.
span(v1 , v2 , . . . , vd ) = span(w1 , w2 , . . . , wN ).
Thus there are many choices of spanning vectors for a given span.
For example, let u, v, w be the columns of A in (2.3.4). Let ⊂ mean “is
contained in”. Then
since adding a third vector can only increase the linear combination possi-
bilities. On the other hand, since w = 2v − u, we also have
It follows that
span(u, v, w) = span(u, v).
Let A be a matrix. The column space of A is the span of its columns. For
A as in (2.3.4), the column space of A is span(u, v, w). The code
# column vectors
2.4. SPAN AND LINEAR INDEPENDENCE 93
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
returns a minimal set of vectors spanning the column space of A. The column
rank of A is the number of vectors returned.
For example, for A as in (2.3.4), this code returns
Ax = t1 v1 + t2 v2 + · · · + td vd .
By (2.4.3),
orth(A)
This code returns two orthonormal vectors b1 /|b1 | and b2 /|b2 |, where
For example, let b3 = (−9, −3, 3, 9, 10) and let Ā = (A, b3 ). Using Python,
check the column rank of Ā is 3. Since the column rank of A is 2, we conclude
b3 is not in the column space of A: b3 is not a linear combination of u, v, w.
When (2.4.6) holds, b is a linear combination of the columns of A. How-
ever, (2.4.6) does not tell us which linear combination. According to (2.4.3),
finding the linear combination is equivalent to solving Ax = b.
d-dimensional Space
Rd is a span.
span(a, b, c, d, e) = span(a, f ).
t1 v1 + t2 v2 + · · · + td vd = 0.
ru + sv + tw = 1u − 2v + 1w = 0 (2.4.8)
u = −(s/r)v − (t/r)w.
v = −(r/s)u − (t/s)w.
If t ̸= 0, then
w = −(r/t)u − (s/t)v.
98 CHAPTER 2. LINEAR GEOMETRY
A.nullspace()
This says the null space of A consists of all multiples of (1, −2, 1). Since the
code
2.4. SPAN AND LINEAR INDEPENDENCE 99
[r,s,t] = A.nullspace()[0]
null_space(A)
A Versus At A
Let A be any matrix. The null space of A equals the null space of At A.
|Ax|2 = Ax · Ax = x · At Ax = 0,
t1 v1 + t2 v2 + · · · + td vd = 0.
Take the dot product of both sides with v1 . Since the dot products of any
two vectors is zero, and each vector has length one, we obtain
t1 = t1 v1 · v1 = t1 v1 · v1 + t2 v2 · v1 + · · · + td vd · v1 = 0.
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
C = row_stack([u,v,w])
null_space(B)
Ax = (v1 · x, v2 · x, . . . , vN · x).
v1 · x = 0, v2 · x = 0, . . . , vN · x = 0,
Since the row space is the orthogonal complement of the null space, and
the null space of A equals the null space of At A, we conclude
A Versus At A
Let A be any matrix. Then the row space of A equals the row space
of At A.
Now replace A by At in this last result. Since the row space of At equals
the column space of A, and AAt is symmetric, we also have
2.4. SPAN AND LINEAR INDEPENDENCE 103
A Versus AAt
Let A be any matrix. Then the column space of A equals the column
space of AAt .
• If x1 and x2 are in the null space, and r1 and r2 are scalars, then so is
r1 x1 + r2 x2 , because
x 1 + x2 + · · · + xN
m= .
N
Center the dataset (see §1.3)
v1 = x1 − m, v2 = x2 − m, . . . , vN = xN − m,
v1 · b, v2 · b, . . . , vN · b.
b · Qb = 0.
b · (x − m) = 0.
b · (x − m) = 0.
a(x − x0 ) + b(y − y0 ) = 0, or ax + by = c,
point in the plane, then (x, y, z) − (x0 , y0 , z0 ) is orthogonal to (a, b, c), so the
equation of the plane is
(a, b, c) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or ax + by + cz = d,
where d = ax0 + by0 + cz0 .
Suppose we have a dataset in R3 with mean m = (3, 2, 1), and covariance
1 1 1
Q = 1 1 1 . (2.5.2)
1 1 1
Let b = (2, −1, −1). Then Qb = 0, so b · Qb = 0. We conclude the dataset
lies in the plane
(2, −1, −1) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or 2x − y − z = 3.
In this case, the dataset is two-dimensional, as it lies in a plane.
If a dataset has covariance the 3 × 3 identity matrix I, then b · Ib is never
zero unless b = 0. Such a dataset is three-dimensional, it does not lie in a
plane.
Sometimes there may be several zero variance directions. For example,
for the covariance (2.5.2) and u = (2, −1, −1), v = (0, 1, −1), we have both
u · Qu = 0 and v · Qv = 0.
From this we see the dataset corresponding to this Q lies in two planes: The
plane orthogonal to u, and the plane orthogonal v. But the intersection of
two planes is a line, so this dataset lies in a line, which means it is one-
dimensional.
Which line does this dataset lie in? Well, the line has to pass through the
mean, and is orthogonal to u and v. If we find a vector b satisfying b · u = 0
and b · v = 0, then the line will pass through m and will be parallel to b. But
we know how to find such a vector. Let A be the matrix with rows u, v. Then
b in the nullspace of A fullfills the requirements. We obtain b = (1, 1, 1).
Based on the above result, here is code that returns zero variance direc-
tions.
def zero_variance(dataset):
Q = cov(dataset.T)
return null_space(Q)
(1, 2, 3, 4, 5), (6, 7, 8, 9, 10), (11, 12, 13, 14, 15), (16, 17, 18, 19, 20).
2.6 Pseudo-Inverse
What exactly is the pseudo-inverse? It turns out the answer is best under-
stood geometrically.
Think of b and Ax as points, and measure the distance between them,
and think of x and the origin 0 as points, and measure the distance between
them (Figure 2.1).
x b
A
−−−−−−→
0 Ax
source space target space
Even though the point x+ may not solve Ax = b, this procedure (Figure
2.2) results in a uniquely determined x+ : While there may be several points
x∗ , there is only one x+ .
∗ column space
x∗ x Ax
x+ A Ax∗
−−−−−−→ AxAx
x
x
x nullspace
0 b
Figure 2.2: The points x, Ax, the points x∗ , Ax∗ , and the point x+ .
The results in this section are as follows. Let A be any matrix. There
is a unique matrix A+ — the pseudo-inverse of A — with the following
properties.
• x+ = A+ b is a solution of
• In either case,
At Ax = At b. (2.6.2)
Zero Residual
x is a solution of (2.3.1) iff the residual is zero.
x + 6y + 11z = −9
2x + 7y + 12z = −3
3x + 8y + 13z =3 (2.6.3)
4x + 9y + 14z =9
5x + 10y + 15z = 10
Let b be any vector, not necessarily in the column space of A. To see how
close we can get to solving (2.3.1), we minimize the residual (2.6.1). We say
x∗ is a residual minimizer if
Regression Equation
x∗ is a residual minimizer iff x∗ solves the regression equation.
At (Ax∗ − b) · v = (Ax∗ − b) · Av = 0.
At (Ax∗ − b) = 0,
Multiple Solutions
Any two residual minimizers differ by a vector in the nullspace of A.
Since we know from above there is a residual minimizer in the row space
of A, we always have a minimum norm residual minimizer.
Let v be in the null space of A, and write
x∗ · v ≥ 0.
114 CHAPTER 2. LINEAR GEOMETRY
Since both ±v are in the null space of A, this implies ±x∗ · v ≥ 0, hence
x∗ · v = 0. Since the row space is the orthogonal complement of the null
space, the result follows.
Uniqueness
If x+ +
1 and x2 are minimum norm residual minimizers, then v = x1 − x2
+ +
We know any two solutions of the linear system (2.3.1) differ by a vector in
the null space of A (2.4.10), and any two solutions of the regression equation
(2.6.2) differ by a vector in the null space of A (above).
If x is a solution of (2.3.1), then, by multiplying by At , x is a solution of
the regression equation (2.6.2). Since x+ = A+ b is a solution of the regression
equation, x+ = x + v for some v in the null space of A, so
Ax+ = A(x + v) = Ax + Av = b + 0 = b.
This shows x+ is a solution of the linear system. Since all other solutions
differ by a vector v in the null space of A, this establishes the result.
Now we can state when Ax = b is solvable,
Solvability of Ax = b
Properties of Pseudo-Inverse
A. AA+ A = A
B. A+ AA+ = A+
(2.6.8)
C. AA+ is symmetric
D. A+ A is symmetric
u = A+ Au + v. (2.6.9)
Au = AA+ Au.
A+ w = A+ AA+ w + v
2.6. PSEUDO-INVERSE 117
for some v in the null space of A. But both A+ w and A+ AA+ w are in the
row space of A, hence so is v. Since v is in both the null space and the row
space, v is orthogonal to itself, so v = 0. This implies A+ AA+ w = A+ w.
Since w was any vector, we obtain B.
Since A+ b solves the regression equation, At AA+ b = At b for any vector
b. Hence At AA+ = At . Let P = AA+ . Now
(x − A+ Ax) · A+ Ay = 0.
x · P y = P x · P y = x · P tP y
Also we have
2.7 Projections
In this section, we study projection matrices P , and we show
b − Pb
b
P b = tu
u
Let u be a unit vector, and let b be any vector. Let span(u) be the line
through u (Figure 2.3). The projection of b onto span(u) is the vector v in
span(u) that is closest to b.
It turns out this closest vector v equals P b for some matrix P , the pro-
jection matrix. Since span(u) is a line, the projected vector P b is a multiple
tu of u.
From Figure 2.3, b − P b is orthogonal to u, so
0 = (b − P b) · u = b · u − P b · u = b · u − t u · u = b · u − t.
is already on the line. If U is the matrix with the single column u, we obtain
P = U U t.
To summarize, the projected vector is the vector (b · u)u, and the reduced
vector is the scalar b · u. If U is the matrix with the single column u, then
the reduced vector is U t b and the projected vector is U U t b.
b
b − Pb
u
Pb
(b − P b) · u = 0 and (b − P b) · v = 0.
r = b · u, s = b · v.
2. P b = b if b is in S,
P = AA+ . (2.7.2)
establishing 3.
2.7. PROJECTIONS 121
def project(A,b):
Aplus = pinv(A)
x = dot(Aplus,b) # reduced
return dot(A,x) # projected
For A as in (2.3.4) and b = (−9, −3, 3, 9, 10) the reduced vector onto the
column space of A is
1
x = A+ b =
(82, 25, −32),
15
and the projected vector onto the column space of A is
P b = Ax = AA+ b = (−8, −3, 2, 7, 12).
The projection matrix onto the column space of A is
6 4 2 0 −2
4 3 2 1 0
+ 1
2 2 2 2 2 .
P = AA =
10
0 1 2 3 4
−2 0 2 4 6
122 CHAPTER 2. LINEAR GEOMETRY
P = A+ A. (2.7.3)
def project_to_ortho(U,b):
x = dot(U.T,b) # reduced
return dot(U,x) # projected
2.7. PROJECTIONS 123
dataset vk in Rd , k = 1, 2, . . . , N
reduced U t vk in Rn , k = 1, 2, . . . , N
projected U U t vk in Rd , k = 1, 2, . . . , N
If S is a span in Rd , then
Rd = S ⊕ S ⊥ . (2.7.5)
v = P v + (v − P v),
124 CHAPTER 2. LINEAR GEOMETRY
and the null space and row space are orthogonal to each other.
P = I − A+ A. (2.7.7)
But this was already done in §2.3, since P b = AA+ b = Ax+ where x+ = A+ b
is a residual minimizer.
2.8. BASIS 125
2.8 Basis
spanning
orthogonal orthonormal
vectors basis
basis basis
linearly
orthogonal orthonormal
independent
Span of N Vectors
e1 = (1, 0, . . . , 0),
e2 = (0, 1, 0, . . . , 0),
... = ...
ed = (0, 0, . . . , 0, 1),
The dimension of Rd is d.
matrix_rank(vectors)
In particular, since 712 < 784, approximately 10% of pixels are never
touched by any image. For example, a likely pixel to remain untouched is
at the top left corner (0, 0). For this dataset, there are 72 = 784 − 712 zero
variance directions.
We pose the following question: What is the least n for which the first n
images are linearly dependent? Since the dimension of the feature space is
784, we must have n ≤ 784. To answer the question, we compute the rank
of the first n vectors for n = 1, 2, 3, . . . , and continue until we have linear
dependence of v1 , v2 , . . . , vn .
If we save the MNIST dataset as a centered array vectors, as in §2.1,
and run the code below, we obtain n = 560 (Figure 2.7). matrix_rank is
discussed in §2.9.
def find_first_defect(vectors):
d = len(vectors[0])
previous = 0
for n in range(len(vectors)):
r = matrix_rank(vectors[:n+1,:])
2.8. BASIS 129
print((r,n+1),end=",")
if r == previous: break
if r == d: break
previous = r
This we call the dimension staircase. For example, Figure 2.8 is the
dimension staircase for
v1 = (1, 0, 0), v2 = (0, 1, 0), v3 = (1, 1, 0), v4 = (3, 4, 0), v5 = (0, 0, 1).
In Figure 2.8, we call the points (3, 2) and (4, 2) defects.
In the code, the staircase is drawn by stairs(X,Y), where the horizontal
points X and the vertical values Y satisfy len(X) == len(Y)+1. In Figure 2.8,
X = [1,2,3,4,5,6], and Y = [1,2,2,2,3].
With the MNIST dataset loaded as vectors, here is code returning Figure
2.9. This code is not efficient, but it works. It takes 57041 vectors in the
dataset to fill up 712 dimensions.
def dimension_staircase(vectors):
d = vectors[0].size
N = len(vectors)
rmax = matrix_rank(vectors)
2.8. BASIS 131
dimensions = [ ]
basis = [ ]
for n in range(1,N):
r = matrix_rank(vectors[:n,:])
print((r,n),end=",")
dimensions.append(r)
if r == rmax: break
stairs(dimensions, range(n+1))
span(v1 , v2 , . . . , vN ) = span(v2 , v3 , . . . , vN ).
span(v1 , v2 , . . . , vN ) = span(b1 , b2 , . . . , bd ),
v1 is a linear combination of b1 , b2 , . . . , bd ,
v1 = t1 b1 + t2 b2 + · · · + td bd .
This shows
span(v1 , v2 , . . . , vN ) = span(v1 , b2 , b3 , . . . , bd ).
Repeating the same logic, v2 is a linear combination of v1 , b2 , b3 , . . . , bd ,
v2 = s1 v1 + t2 b2 + t3 b3 + · · · + td bd .
1
b2 = (v2 − s1 v1 − t3 b3 − · · · − td bd ).
t2
This shows
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , b3 , b4 , . . . , bd ).
1
b3 = (v3 − s1 v1 − s2 v2 − t4 b4 − · · · − td bd ).
t3
This shows
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , v3 , b4 , b5 , . . . , bd ).
span(v1 , v2 , . . . , vN ) = · · · = span(v1 , v2 , . . . , vd ).
2.9 Rank
If A is an N ×d matrix, then (Figure 2.10) x 7→ Ax is a linear transformation
that sends a vector x in Rd (the source space) to the vector Ax in RN (the
target space). The transpose At goes in the reverse direction: The linear
transformation b 7→ At b sends a vector b in RN (the target space) to the
vector At b in Rd (the source space).
It follows that for an N × d matrix, the dimension of the source space is
d, and the dimension of the target space is N ,
dim(source space) = d, dim(target space) = N.
R3 R5
x b
A
At b
Ax
At
source space target space
By (2.4.2), the column space is in the target space, and the row space is
in the source space. Thus we always have
0 ≤ row rank ≤ d and 0 ≤ column rank ≤ N.
For A as in (2.3.4), the column rank is 2, the row rank is 2, and the nullity
is 1. Thus the column space is a 2-d plane in R5 , the row space is a 2-d plane
in R3 , and the null space is a 1-d line in R3 .
Rank Theorem
Let A be any matrix. Then
A.rank()
matrix_rank(A)
returns the rank of a matrix. The main result implies rank(A) = rank(At ),
so
For any N × d matrix, the rank is never greater than min(N, d).
C = CI = CAB = IB = B,
so B = C is the inverse of A.
The first two assertions are in §2.2. For the last assertion, assume U
is a square matrix. From §2.4, orthonormality of the rows implies linear
136 CHAPTER 2. LINEAR GEOMETRY
Orthogonal Matrix
A matrix U is orthogonal iff its rows are an orthonormal basis iff its
columns are an orthonormal basis.
Since
U u · U v = u · U t U v = u · v,
U preserves dot products. Since lengths are dot products, U also preserves
lengths. Since angles are computed from dot products, U also preserves
angles. Summarizing,
As a consequence,
I = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vd ⊗ vd .
and
|v|2 = |v · v1 |2 + |v · v2 |2 + · · · + |v · vd |2 . (2.9.4)
To derive the main result, first we recall (2.7.6). From the definition of
dimension, we can rewrite (2.7.6) as
138 CHAPTER 2. LINEAR GEOMETRY
Ax = A(u + v) = Au + Av = Av.
This shows the column space consists of vectors of the form Av with v in the
row space.
Let v1 , v2 , . . . , vr be a basis for the row space. From the previous para-
graph, it follows Av1 , Av2 , . . . , Avr spans the column space of A. We claim
Av1 , Av2 , . . . , Avr are linearly independent. To check this, we write
If v is the vector t1 v1 +t2 v2 +· · ·+tr vr , this shows v is in the null space. But v
is a linear combination of basis vectors of the row space, so v is also in the row
space. Since the row space is the orthogonal complement of the null space,
we must have v orthogonal to itself. Thus v = 0, or t1 v1 +t2 v2 +· · ·+tr vr = 0.
But v1 , v2 , . . . , vr is a basis. By linear independence of v1 , v2 , . . . , vr , we
conclude t1 = 0, . . . , tr = 0. This establishes the claim, hence Av1 , Av2 , . . . ,
Avr is a basis for the column space. This shows r is the dimension of the
column space, which is by definition the column rank. Since by construction,
r is also the row rank, this establishes the rank theorem.
Chapter 3
Principal Components
139
140 CHAPTER 3. PRINCIPAL COMPONENTS
To keep things simple, assume both the source space and the target space
are R2 ; then A is 2 × 2.
The unit circle (in red in Figure 3.1) is the set of vectors u satisfying
|u| = 1. The image of the unit circle (also in red in Figure 3.1) is the set of
vectors of the form
{Au : |u| = 1}.
The annulus is the set (the region between the dashed circles in Figure 3.1)
of vectors b satisfying
{b : σ2 < |b| < σ1 }.
It turns out the image is an ellipse, and this ellipse lies in the annulus.
Thus the numbers σ1 and σ2 constrain how far the image of the unit circle
is from the origin, and how near the image is to the origin.
This shows the image of the unit circle is the inverse covariance ellipse (§1.6)
corresponding to the covariance Q, with major axis length 2σ1 and minor
axis length 2σ2 .
These reflect vectors across the horizontal axis, and across the vertical axis.
Recall an orthogonal matrix is a matrix U satisfying U t U = I = U U t
(2.9.2). Every orthogonal matrix U is a rotation V or a rotation times a
reflection V R.
The SVD decomposition (§3.3) states that every matrix A can be written
as a product
a b
A= = U SV.
c d
Here S is a diagonal matrix as above, and U , V are orthogonal and rotation
matrices as above.
In more detail, apart from a possible reflection, there are scalings σ1 and
σ2 and angles α and β, so that A transforms vectors by first rotating by α,
then scaling by (σ1 , σ2 ), then by rotating by β (Figure 3.2).
V S U
Av = λv (3.2.1)
3.2. EIGENVALUE DECOMPOSITION 143
singular:
σ, u, v
row column
any
rank rank
matrix square
eigen:
invertible symmetric
λ, v
non-
covariance negative λ≥0
λ ̸= 0 positive λ>0
A = array([[2,1],[1,2]])
lamda, U = eig(A)
lamda
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda
(A − λI)v = Av − λv = 0.
v · Qv = v · λv = λv · v = λ.
µu · v = u · (µv) = u · Qv = v · Qu = v · (λu) = λu · v.
This implies
(µ − λ)u · v = 0.
If λ ̸= µ, we must have u · v = 0. We conclude:
QU = U E. (3.2.3)
allclose(dot(Q,v), lamda*v)
returns True.
λ1 ≥ λ2 ≥ · · · ≥ λd .
Diagonalization (EVD)
In other words, with the correct choice of orthonormal basis, the matrix
Q becomes a diagonal matrix E.
The orthonormal basis eigenvectors v1 , v2 , . . . , vd are the principal compo-
nents of the matrix Q. The eigenvalues and eigenvectors of Q, taken together,
are the eigendata of Q. The code
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda, U
3.2. EIGENVALUE DECOMPOSITION 149
returns the eigenvectors [1, 3] and the matrix U = [u, v] with columns
√ √ √ √
u = (1/ 2, −1/ 2), v = (1/ 2, 1/ 2).
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
V = U.T
E = diag(lamda)
allclose(Q,dot(U,dot(E,V))
returns True.
init_printing()
# eigenvalues
Q.eigenvals()
# eigenvectors
Q.eigenvects()
U, E = Q.diagonalize()
rank(Q) = rank(E) = r.
Using sympy,
Q = Matrix([[2,1],[1,2]])
U, E = Q.diagonalize()
display(U,E)
returns
1 1 1 0
U= , E= .
−1 1 0 3
Also,
Q = Matrix([[a,b ],[b,c]])
U, E = Q.diagonalize()
display(Q,U,E)
returns
√ √
a b 1 a−c− D a−c+ D
Q= , U=
b c 2b 2b 2b
3.2. EIGENVALUE DECOMPOSITION 151
and
√
1 a+c− D 0 √
E= , D = (a − c)2 + 4b2 .
2 0 a+c+ D
display is used to pretty-print the output.
Pseudo-Inverse (EVD)
Qx = b
152 CHAPTER 3. PRINCIPAL COMPONENTS
has a solution x for every vector b iff all eigenvalues are nonzero, in
which case
1 1 1
x= (b · v1 )v1 + (b · v2 )v2 + · · · + (b · vd )vd . (3.2.5)
λ1 λ2 λd
trace(Q) = λ1 + λ2 + · · · + λd . (3.2.6)
Q2 is symmetric with eigenvalues λ21 , λ22 , . . . , λ2d . Applying the last result to
Q2 , we have
√ √
λ2 v2 λ1 v1
√ √
− λ1 v1 − λ2 v2
Q = λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd . (3.2.7)
v = (v · v1 ) v1 + (v · v2 ) v2 + · · · + (v · vd ) vd .
covariance matrix
2
(b1 ⊗ b1 + b2 ⊗ b2 + · · · + bd ⊗ bd )
2d
equals Q/d.
where the maximum is over all unit vectors v. We say a unit vector b is best-fit
for Q or best-aligned with Q if the maximum is achieved at v = b: λ1 = b · Qb.
When Q is a covariance matrix, this means the unit vector b is chosen so that
the variance b · Qb of the dataset projected onto b is maximized.
An eigenvalue λ1 of Q is the top eigenvalue if λ1 ≥ λ for any other
eigenvalue. An eigenvalue λ1 of Q is the bottom eigenvalue if λ1 ≤ λ for any
other eigenvalue.
A Calculation
Suppose λ, a, b, c, d are real numbers and suppose we know
λ + at + bt2
≤ λ, for all t real.
1 + ct + dt2
Then a = λc.
λ1 ≥ v · Qv = v · (λv) = λv · v = λ.
λ1 = v1 · Qv1 ≥ v · Qv (3.2.9)
for all unit vectors v. Let u be any vector. Then for any real t,
v1 + tu
v=
|v1 + tu|
u · Qv1 = λ1 u · v1
156 CHAPTER 3. PRINCIPAL COMPONENTS
u · (Qv1 − λ1 v1 ) = 0
Just as the maximum variance (3.2.8) is the top eigenvalue λ1 , the mini-
mum variance
λd = min v · Qv, (3.2.10)
|v|=1
λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λd .
3.2. EIGENVALUE DECOMPOSITION 157
v1 , v2 , v3 , . . . , vd .
T = S⊥
S
v1
v3
v2
Sλ = {v : Qv = λv}
the eigenspace corresponding to λ. For example, suppose the top three eigen-
values are equal: λ1 = λ2 = λ3 , with b1 , b2 , b3 the corresponding eigenvectors.
Calling this common value λ, the eigenspace is Sλ = span(b1 , b2 , b3 ). Since
b1 , b2 , b3 are orthonormal, dim(Vλ ) = 3. In Python, the eigenspaces Vλ are
obtained by the matrix U above: The columns of U are an orthonormal basis
for the entire space, so selecting the columns corresponding to a specific λ
yields an orthonormal basis for Sλ .
Let (evs,U) be the list of eigenvalues and matrix U whose columns are
the eigenvectors. Then the eigenvectors are the rows of U t . Here is code for
selecting just the eigenvectors corresponding to eigenvalue s.
lamda, U = eigh(Q)
V = U.T
V[isclose(lamda,s)]
The function isclose(a,b) returns True when a and b are numerically close.
Using this boolean, we extract only those rows of V whose corresponding
eigenvalue is close to s.
3.2. EIGENVALUE DECOMPOSITION 159
All this can be readily computed in Python. For the Iris dataset, we have
the covariance matrix in (2.2.14). The eigenvalues are
4.54 = trace(Q) = λ1 + λ2 + λ3 + λ4 .
For the Iris dataset, the top eigenvalue is λ1 = 4.2, it has multiplicity 1, and
its corresponding list of eigenvectors contains only one eigenvector,
def row(i,d):
v = [0]*d
v[i] = 2
if i > 0: v[i-1] = -1
if i < d-1: v[i+1] = -1
if i == 0: v[d-1] += -1
if i == d-1: v[0] += -1
return v
# using sympy
from sympy import Matrix
# using numpy
from numpy import *
m1 m2
x1 x2
To explain where these matrices come from, look at the mass-spring sys-
tems in Figures 3.6 and 3.7. Here we have springs attached to masses and
162 CHAPTER 3. PRINCIPAL COMPONENTS
walls on either side. At rest, the springs are the same length. When per-
turbed, some springs are compressed and some stretched. In Figure 3.6, let
x1 and x2 denote the displacement of each mass from its rest position.
When extended by x, each spring fights back by exerting a force kx
proportional to the displacement x. For example, look at the mass m1 . The
spring to its left is extended by x1 , so exerts a force of −kx1 . Here the minus
indicates pulling to the left. On the other hand, the spring to its right is
extended by x2 − x1 , so it exerts a force +k(x2 − x1 ). Here the plus indicates
pulling to the right. Adding the forces from either side, the total force on
m1 is −k(2x1 − x2 ). For m2 , the spring to its left exerts a force −k(x2 − x1 ),
and the spring to its right exerts a force −kx2 , so the total force on m2 is
−k(2x2 − 2x1 ). We obtain the force vector
2x1 − x2 2 −1 x1
−k = −k .
−x1 + 2x2 −1 2 x2
However, as you can see, the matrix here is not exactly Q(2).
m1 m2 m3 m4 m5
x1 x2 x3 x4 x5
vector
2x1 − x2 2 −1 0 0 0 x1
−x1 + 2x2 − x3 −1 2 −1 0 0 x2
−k −x2 + 2x3 − x4 = −k 0 −1
2 −1 0 x3 .
−x3 + 2x4 − x5 0 0 −1 2 −1 x4
−x4 + 2x5 0 0 0 −1 2 x5
But, again, the matrix here is not Q(5). Notice, if we place one mass and
two springs in Figure 3.6, we obtain the 1 × 1 matrix 2.
To obtain Q(2) and Q(5), we place the springs along a circle, as in Figures
3.8 and 3.9. Now we have as many springs as masses. Repeating the same
logic, this time we obtain Q(2) and Q(5). Notice if we place one mass and
one spring in Figure 3.8, d = 1, we obtain the 1 × 1 matrix Q(1) = 0: There
is no force if we move a single mass around the circle, because the spring is
not being stretched.
m1 m2 m2
m1
m1 m1
m2
m2
m5 m5
m3 m4
m4 m3
p(t) = 2 − t − td−1 ,
and let
1
ω
ω2
v1 = .
ω3
..
.
ω d−1
Then Qv1 is
1
2 − ω − ω d−1
−1 + 2ω − ω 2
ω
−ω + 2ω 2 − ω 3
ω2
Qv1 = = p(ω) = p(ω)v1 .
.. ω3
. ..
d−2 d−1
.
−ω + 2ω −1 d−1
ω
vk = 1, ω k , ω 2k , ω 3k , . . . , ω (d−1)k .
By (1.5.7),
Eigenvalues of Q(d)
Q(2) = (4, 0)
Q(3) = (3, 3, 0)
Q(4) = (4, 2, 2, 0)
√ √ √ √ !
5 5 5 5 5 5 5 5
Q(5) = + , + , − , − ,0
2 2 2 2 2 2 2 2
Q(6) = (4, 3, 3, 1, 1, 0)
√ √ √ √
Q(8) = (4, 2 + 2, 2 + 2, 2, 2, 2 − 2, 2 − 2, 0)
√ √ √ √
5 5 5 5 3 5 3 5
Q(10) = 4, + , + , + , + ,
2 2 2 2 2 2 2 2
√ √ √ √ !
5 5 5 5 3 5 3 5
− , − , − , − ,0
2 2 2 2 2 2 2 2
√ √ √ √
Q(12) = 4, 2 + 3, 2 + 3, 3, 3, 2, 2, 1, 1, 2 − 3, 2 − 3, 0 .
3.2. EIGENVALUE DECOMPOSITION 167
The matrices Q(d) are circulant matrices. Each row in Q(d) is obtained
from the row above by shifting the entries to the right. The trick of using
the roots of unity to compute the eigenvalues and eigenvectors works for any
circulant matrix.
Our last topic is the distribution of the eigenvalues for large d. How are
the eigenvalues scattered? Figure 3.10 plots the eigenvalues for Q(50) using
the code below.
d = 50
lamda = eigh(Q(d))[0]
stairs(lamda,range(d+1),label="numpy")
k = arange(d)
lamda = 2 - 2*cos(2*pi*k/d)
sorted = sort(lamda)
scatter(k,lamda,s=5,label="unordered")
scatter(k,sorted,c="red",s=5,label="increasing order")
legend()
show()
Figure 3.10 shows the eigenvalues tend to cluster near the top λ1 ∼ 4 and
the bottom λd = 0 for d large. Using the double-angle formula,
2 πk
λk = 4 sin , k = 0, 1, 2, . . . , d − 1.
d
Solving for k/d in terms of λ, and multiplying by two to account for the dou-
ble multiplicity, we obtain2 the proportion of eigenvalues below threshhold
λ,
1√
#{k : λk ≤ λ} 2
≈ arcsin λ , 0 ≤ λ ≤ 4. (3.2.13)
d π 2
2
This is an approximate equality: The ratio of the two sides approaches 1 as d → ∞.
168 CHAPTER 3. PRINCIPAL COMPONENTS
Equivalently, the derivative of the arcsine law (3.2.13) exhibits (see (7.1.9))
the eigenvalue clustering near the ends (Figure 3.11).
lamda = arange(0.1,3.9,.01)
density = 1/(pi*sqrt(lamda*(4-lamda)))
plot(lamda,density)
# r"..." means raw string
f = r"$\displaystyle\frac1{\pi\sqrt{\lambda(4-\lambda)}}$"
text(.5,.45,f,usetex=True,fontsize="x-large")
show()
When this happens, v is a right singular vector and u is a left singular vector
associated to σ.
Some books allow singular values to be zero. Here we insist that sin-
gular values be positive. Contrast singular values with eigenvalues: While
eigenvalues may be negative or zero, for us singular values are positive.
The definition immediately implies
Then Av = λv implies λ = 1 and v = (1, 0). Thus A has only one eigenvalue
equal to 1. Set
t 1 1
Q=AA= .
1 2
Since Q is symmetric, Q has two eigenvalues λ1 , λ2 and corresponding eigen-
vectors v1 , v2 . Moreover, as we saw in an earlier section, v1 , v2 may be chosen
orthonormal.
The eigenvalues of Q are given by
0 = det(Q − λI) = λ2 − 3λ + 1.
Qv = At Av = At (σu) = σ 2 v. (3.3.2)
Thus v1 , u1 are right and left singular vectors corresponding to the singular
value σ1 of A. Similarly, if we set u2 = Av2 /σ2 , then v2 , u2 are right and left
singular vectors corresponding to the singular value σ2 of A.
We show v1 , v2 are orthonormal, and u1 , u2 are orthonormal. We already
know v1 , v2 are orthonormal, because we chose them that way, as eigenvectors
of the symmetric matrix Q. Also
0 = λ1 v1 ·v2 = Qv1 ·v2 = (At Av1 )·v2 = (Av1 )·(Av2 ) = σ1 u1 ·σ2 u2 = σ1 σ2 u1 ·u2 .
A Versus Q
Let A be any matrix. Then
Since the rank equals the dimension of the row space, the first part follows
from §2.4.
If Av = σu and At u = σv, then
Qv = At Av = At (σu) = σAt u = σ 2 v,
so v is an eigenvector of At A corresponding
√ to λ = σ 2 > 0. Conversely, If
Qv = λv and λ > 0, then set σ = λ and u = Av/σ. Then
Avk = σk uk , At uk = σk vk , k = 1, 2, . . . , r, (3.3.3)
and
Avk = 0, At uk = 0 for k > r.
The proof is very simple once we remember the rank of Q equals the
number of positive eigenvalues of Q. By the eigenvalue decomposition, there
is an orthonormal basis of the source space v1 , v2 , . . . and λ1 ≥ λ2 ≥ · · · ≥
λr > 0 such that √ Qvk = λk vk , k = 1, . . . , r, and Qvk = 0, k > r.
Setting σk = λk and uk = Avk /σk , k = 1, . . . , r, as in our first exam-
ple, we have (3.3.3), and, again as in our first example, u1 , u2 , . . . , ur are
orthonormal.
Assume A is N × d. Then the source space is Rd , and the target space
is RN . By construction, vr+1 , vr+2 , . . . , vd is an orthonormal basis for the
null space of A. Set u1 = Av1 /σ1 , u2 = Av2 /σ2 , . . . , ur = Avr /sr . Since
Avr+1 = 0, . . . , Avd = 0, u1 , u2 , . . . , ur is an orthonormal basis for the
column space of A.
Since the column space of A is the row space of At , the column space
of A is the orthogonal complement of the nullspace of At (2.7.6). Choose
ur+1 , ur+2 , . . . , uN any orthonormal basis for the nullspace of At . Then
{u1 , u2 , . . . , ur } and {ur+1 , ur+2 , . . . , uN } are orthogonal. From this, u1 , u2 ,
. . . , uN is an orthonormal basis for the target.
3.3. SINGULAR VALUE DECOMPOSITION 173
For our second example, let a and b be nonzero vectors, possibly of dif-
ferent sizes, and let A be the matrix
A = a ⊗ b, At = b ⊗ a.
Then
Av = (v · b)a = σu and At u = (u · a)b = σv.
Since the range of A equals span(a), the rank of A equals one.
Since σ > 0, v is a multiple of b and u is a multiple of a. If we write
v = tb and u = sa and plug in, we get
Thus there is only one singular value of A, equal to |a| |b|. This is not
surprising since the rank of A is one.
In a similar manner, one sees the only singular value of the 1 × n matrix
A = a equals σ = |a|.
Our third example is
0 0 0 0
1 0 0 0
A= 0 1 0 0 .
(3.3.4)
0 0 1 0
Then
0 1 0 0 1 0 0 0
0 0 1 0 0 1 0 0
At =
0
, Q = At A =
0 0 1 0 0 1 0
0 0 0 0 0 0 0 0
Since Q is diagonal symmetric, its rank is 3 and its eigenvalues are λ1 = 1,
λ2 = 1, λ3 = 1, λ4 = 0, and its eigenvectors are
1 0 0 0
0 1 0 0
v1 =
0 , v2 = 0 , v3 1 , v4 = 0 .
0 0 0 1
0 0 0 0 0 0
Here we have (N, d) = (6, 4), r = 3. In either case, S has the same shape
N × d as A.
Let U be the matrix with columns u1 , u2 , . . . , uN , and let V be the matrix
with rows v1 , v2 , . . . , vd . Then V t has columns v1 , v2 , . . . , vd .
Then U and V are orthogonal N × N and d × d matrices. By (3.3.1),
AV t = U S.
A = U SV.
Summarizing,
3.3. SINGULAR VALUE DECOMPOSITION 175
Diagonalization (SVD)
U, sigma, V = svd(A)
# sigma is a vector
print(U.shape,S.shape,V.shape)
print(U,S,V)
Given the relation between the singular values of A and the eigenvalues
of Q = At A, we also can conclude
# center dataset
m = mean(dataset,axis=0)
A = dataset - m
# rows of V are right
# singular vectors of A
V = svd(A)[2]
# columns of U are
# eigenvectors of Q
U = eigh(Q)[1]
# compare columns of U
# and rows of V
3.3. SINGULAR VALUE DECOMPOSITION 177
U, V
returns
0.36 −0.66 −0.58 0.32 0.36 −0.08 0.86 0.36
−0.08 −0.73 0.6 −0.32
, V = −0.66 −0.73 0.18 0.07
U =
0.86 0.18 0.07 −0.48 0.58 −0.6 −0.07 −0.55
0.36 0.07 0.55 0.75 0.32 −0.32 −0.48 0.75
This shows the columns of U are identical to the rows of V , except for the
third column of U , which is the negative of the third row of V .
Qvk = λk vk , k = 1, . . . , d.
Thus the principal components of a dataset are the right singular vectors
of the centered dataset matrix. This shows there are two approaches to the
principal components of a dataset: Either through EVD and eigenvectors
of the covariance matrix, or through SVD and right singular vectors of the
centered dataset matrix. We shall do both.
λ1 ≥ λ2 ≥ · · · ≥ λd ,
in PCA one takes the most significant components, those components who
eigenvalues are near the top eigenvalue. For example, one can take the top
two eigenvalues λ1 ≥ λ2 and their eigenvectors v1 , v2 , and project the dataset
onto the plane span(v1 , v2 ). The projected dataset can then be visualized
as points in the plane. Similarly, one can take the top three eigenvalues
λ1 ≥ λ2 ≥ λ3 and their eigenvectors v1 , v2 , v3 and project the dataset onto
the space span(v1 , v2 , v3 ). This can then be visualized as points in three
dimensions.
dataset = train_X.reshape((60000,784))
labels = train_y
Q = cov(dataset.T)
totvar = Q.trace()
# cumulative sums
sums = cumsum(percent)
data = array([percent,sums])
print(data.T[:20].round(decimals=3))
d = len(lamda)
3.4. PRINCIPAL COMPONENT ANALYSIS 181
stairs(percent,range(d+1))
The left column in Figure 3.12 lists the top twenty eigenvalues as a per-
centage of their sum. For example, the top eigenvalue λ1 is around 10% of
the total variance. The right column lists the cumulative sums of the eigen-
values, so the third entry in the right column is the sum of the top three
eigenvalues, λ1 + λ2 + λ3 = 22.97%.
This results in Figures 3.12 and 3.13. Here we sort the array eig in
decreasing order, then we cumsum the array to obtain the cumulative sums.
Because the rank of the MNIST dataset is 712 (§2.9), the bottom 72 =
784 − 712 eigenvalues are exactly zero. A full listing shows that many more
eigenvalues are near zero, and the second column in Figure 3.12 shows the
top ten eigenvalues alone sum to almost 50% of the total variance.
def pca(dataset,n):
Q = cov(dataset.T)
# columns of V are
# eigenvectors of Q
lamda, U = eigh(Q)
# decreasing eigenvalue sort
order = lamda.argsort()[::-1]
# sorted top n columns of U
# are cols of U
V = U[:,order[:n]]
P = dot(V,V.T)
return P
In the code, lamda is sorted in decreasing order, and the sorting order is
saved as order. To obtain the top n eigenvectors, we sort the first n columns
U[:,order[:n]] in the same order, resulting in the d × n matrix V . The
code then returns the projection matrix P = V V t (2.7.4).
Instead of working with the covariance Q, as discussed at the start of
the section, we can work directly with the dataset, using svd, to obtain the
eigenvectors.
# of dataset
def pca_with_svd(dataset,n):
# center dataset
m = mean(dataset,axis=0)
vectors = dataset - m
# rows of V are
# right singular vectors
V = svd(vectors)[2]
# no need to sort, already decreasing order
U = V[:n].T # top n rows as columns
P = dot(U,U.T)
return P
Let v = dataset[1] be the second image in the MNIST dataset, and let
Q be the covariance of the dataset. Then the code below returns the image
compressed down to n = 784, 600, 350, 150, 50, 10, 1 dimensions, returning
Figure 1.4.
figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4
If you run out of memory trying this code, cut down the dataset from
60,000 points to 10,000 points or fewer. The code works with pca or with
184 CHAPTER 3. PRINCIPAL COMPONENTS
pca_with_svd.
N = len(dataset)
n = 10
engine = PCA(n_components = n)
reduced = engine.fit_transform(dataset)
reduced.shape
and returns (N, n) = (60000, 10). The following code computes the projected
dataset
projected = engine.inverse_transform(reduced)
projected.shape
figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4
display_image(v,rows,cols,1)
Figure 3.14: Original and projections: n = 784, 600, 350, 150, 50, 10, 1.
Now we project all vectors of the MNIST dataset onto two and three
dimensions, those corresponding to the top two or three eigenvalues. To
start, we compute reduced as above with n = 3, the top three components.
In the two-dimensional plotting code below, reduced is an array of shape
(60000,3), but we use only the top two components 0 and 1. When the
rows are plotted as a scatterplot, we obtain Figure 3.15. Note the rows are
plotted grouped by color, to match the legend, and each plot point’s color is
determined by the value of its label.
186 CHAPTER 3. PRINCIPAL COMPONENTS
grid()
legend(loc='upper right')
show()
grid()
legend(loc='upper right')
show()
%matplotlib notebook
from matplotlib.pyplot import *
from mpl_toolkits import mplot3d
P = axes(projection='3d')
P.set_axis_off()
legend(loc='upper right')
show()
The three dimensional plot of the complete MNIST dataset is Figure 1.5
in §1.2. The command %matplotlib notebook allows the figure to rotated
and scaled.
such that
3.5. CLUSTER ANALYSIS 189
def nearest_index(x,means):
i = 0
for j,m in enumerate(means):
n = means[i]
if norm(x - m) < norm(x - n): i = j
return i
def assign_clusters(dataset,means):
clusters = [ [ ] for m in means ]
for x in dataset:
i = nearest_index(x,means)
clusters[i].append(x)
return [ c for c in clusters if len(c) > 0 ]
def update_means(clusters):
return [ mean(c,axis=0) for c in clusters ]
d = 2
k,N = 7,100
190 CHAPTER 3. PRINCIPAL COMPONENTS
def random_vector(d):
return array([ random() for _ in range(d) ])
close_enough = False
This code returns the size the clusters after each iteration. Here is code
that plots a cluster.
def plot_cluster(mean,cluster,color,marker):
for v in cluster:
scatter(v[0],v[1], s=50, c=color, marker=marker)
scatter(mean[0], mean[1], s=100, c=color, marker='*')
d = 2
k,N = 7,100
def random_vector(d):
return array([ random() for _ in range(d) ])
close_enough = False
figure(figsize=(4,4))
grid()
for v in dataset: scatter(v[0],v[1],s=20,c='black')
show()
Counting
Some of the material in this chapter is first seen in high school. Because
repeating the exposure leads to a deeper understanding, we review it in a
manner useful to the later chapters.
Why are there six possibilities? Because they are three ways of choosing
193
194 CHAPTER 4. COUNTING
the first ball, then two ways of choosing the second ball, then one way of
choosing the third ball, so the total number of ways is
6 = 3 × 2 × 1.
n! = n × (n − 1) × (n − 2) × · · · × 2 × 1.
Notice also
(n + 1)! = (n + 1) × n × (n − 1) × · · · × 2 × 1 = (n + 1) × n!,
Permutations of n Objects
The number of ways of selecting n objects from a collection of n distinct
objects is n!.
We also have
1! = 1, 0! = 1.
It’s clear that 1! = 1. It’s less clear that 0! = 1, but it’s reasonable if you
think about it: The number of ways of selecting from zero balls results in
only one possibility — no balls.
More generally, we can consider the selection of k balls from a bag con-
taining n distinct balls. There are two varieties of selections that can be
made: Ordered selections and unordered selections. An ordered selection is
a permutation, and an unordered selection is a combination. In particular,
when k = n, n! is the number of ways of permuting n objects.
4.1. PERMUTATIONS AND COMBINATIONS 195
Notice P (x, k) is defined for any real number x by the same formula,
P (n, k) n!
C(n, k) = = .
k! (n − k)!k!
For example,
5×4
P (5, 2) = 5 × 4 = 20, C(5, 2) = = 10,
2×1
so we have twenty ordered pairs
(1, 2), (1, 3), (1, 4), (1, 5), (2, 1), (2, 3), (2, 4), (2, 5), (3, 1), (3, 2),
(3, 4), (3, 5), (4, 1), (4, 2), (4, 3), (4, 5), (5, 1), (5, 2), (5, 3), (5, 4)
{1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}, {3, 4}, {3, 5}, {4, 5}.
1, 2, 3, . . . , n − 1, n,
n! < nn .
4.1. PERMUTATIONS AND COMBINATIONS 197
However, because half of the factors are less then n/2, we expect an approx-
imation smaller than nn , maybe something like (n/2)n or (n/3)n .
To be systematic about it, assume an approximation of the form1
n n
n! ∼ e , for n large, (4.1.1)
e
for some constant e. We seek the best constant e that fits here. In this
approximation, we multiply by e so that (4.1.1) is an equality when n = 1.
Using the binomial theorem, in §4.4 we show
n n n n
3 ≤ n! ≤ 2 , n ≥ 1. (4.1.2)
3 2
Based on this, a constant e satisfying (4.1.1) must lie between 2 and 3,
2 ≤ e ≤ 3.
To figure out the best constant e to pick, we see how much both sides
of (4.1.1) increase when we replace n by n + 1. Write (4.1.1) with n + 1
replacing n, obtaining
n+1
n+1
(n + 1)! ∼ e . (4.1.3)
e
Dividing the left sides of (4.1.1), (4.1.3) yields
(n + 1)!
= (n + 1).
n!
Dividing the right sides yields
n
e((n + 1)/e)n+1
1 1
= (n + 1) · · 1 + . (4.1.4)
e(n/e)n e n
To make these quotients match as closely as possible, we should choose
n
1
e∼ 1+ . (4.1.5)
n
Choosing n = 1, 2, 3, . . . , 100, . . . results in
4.2 Graphs
A graph consists of nodes and edges. For example, the graphs in Figure 4.2
each have four nodes and three edges. The left graph is directed, in that a
direction is specified for each edge. The graph on the right is undirected, no
direction is specified.
−3 7.4
2 0
Let wij be the weight on the edge (i, j) in a weighed directed graph. The
weight matrix of a weighed directed graph is the matrix W = (wij ).
If the graph is unweighed, then we set A = (aij ), where
(
1, if i and j adjacent,
aij = .
0, if not.
In this case, A consists of ones and zeros, and is called the adjacency matrix.
If the graph is also undirected, then the adjacency matrix is symmetric,
aij = aji .
When m = 0, there are no edges, and we say the graph is empty. When
m = n(n − 1)/2, there are the maximum number of edges, and we say the
graph is complete. The complete graph with n nodes is written Kn (Figure
4.5).
The cycle graph Cn with n nodes is as in Figure 4.5. The graph Cn has
n edges. The cycle graph C3 is a triangle.
d1 ≥ d2 ≥ d3 ≥ · · · ≥ dn
(d1 , d2 , d3 , . . . , dn )
Handshaking Lemma
If the order is n, the size is m, and the degrees are d1 , d2 , . . . , dn , then
n
X
d1 + d2 + · · · + dn = dk = 2m.
k=1
To see this, we consider two cases. First case, assume there are no isolated
nodes. Then the degree sequence is
n − 1 ≥ d1 ≥ d2 ≥ · · · ≥ dn ≥ 1.
n − 2 ≥ d1 ≥ d2 ≥ . . . dn−1 ≥ 1.
A graph is regular if all the node degrees are equal. If the node degrees are
all equal to k, we say the graph is k-regular. From the handshaking lemma,
for a k-regular graph, we have kn = 2m, so
1
m = kn.
2
For example, because 2m is even, there are no 3-regular graphs with 11 nodes.
Both Kn and Cn are regular, with Kn being (n − 1)-regular, and Cn being
2-regular.
A walk on a graph is a sequence of nodes v1 , v2 , v3 , . . . where each
consecutive pair vi , vi+1 of nodes are adjacent. For example, if v1 , v2 , v3 ,
v4 , v5 , v6 are the nodes (in any order) of the complete graph K6 , then v1 →
v2 → v3 → v4 → v2 is a walk. A path is a walk with no backtracking: A
path visits each node at most once. A closed walk is a walk that ends where
it starts. A cycle is a closed walk with no backtracking.
Two nodes a and b are connected if there is a walk starting at a and ending
at b. If a and b are connected, then there is a path starting at a and ending
at b, since we can cut out the cycles of the walk. A graph is connected if every
two nodes are connected. A graph is disconnected if it is not connected. For
4.2. GRAPHS 203
For example, the empty graph has adjacency matrix given by the zero matrix.
Since our graphs are undirected, the adjacency matrix is symmetric.
Let 1 be the vector 1 = (1, 1, 1, . . . , 1). The adjacency matrix of the
complete graph Kn is the n×n matrix A with all ones except on the diagonal.
If I is the n × n identity matrix, then this adjacency matrix is
A=1⊗1−I
Notice there are ones on the sub-diagonal, and ones on the super-diagonal,
and ones in the upper-right and lower-left corners.
204 CHAPTER 4. COUNTING
For any adjacency matrix A, the sum of each row is equal to the degree
of the node corresponding to that row. This is the same as saying
d1
d2
A1 = . . . .
dn
A1 = k1,
v1 = 1 ≥ |vj |, j = 2, 3, . . . , n.
Since the sum a11 + a12 + · · · + a1n equals the degree d1 of node 1, this implies
Top Eigenvalue
For a k-regular graph, k is the top eigenvalue of the adjacency matrix
A.
A1 = (1 · 1)1 − 1 = n1 − 1 = (n − 1)1,
2 cos(2πk/n), k = 0, 1, 2, . . . , n − 1,
Ā = A(Ḡ) = 1 ⊗ 1 − I − A(G).
Now aik akj is either 0 or 1, and equals 1 exactly if there is a 2-step path from
i to j. Hence
Notice a 2-step walk between i and j is the same as a 2-step path between i
and j.
When i = j, (A2 )ii is the number of 2-step paths connecting i and i,
which means number of edges. Since this counts edges twice, we have
1
trace(A2 ) = m = number of edges.
2
Similarly, (A3 )ij is the number of 3-step walks connecting i and j. Since
a 3-step walk from i to i is the same as a triangle, (A3 )ii is the number
of triangles in the graph passing through i. Since the trace is the sum of
the diagonal elements, trace(A3 ) counts the number of triangles. But this
overcounts by a factor of 3! = 6, since three labels may be rearranged in six
ways. Hence
1
trace(A3 ) = number of triangles.
6
4.2. GRAPHS 207
Connected Graph
Let A be the adjacency matrix. Then the graph is connected if for
every i ̸= j, there is a k with (Ak )ij > 0.
1 0 0 0 0 1 0 0
Hence P is orthogonal,
P P t = I, P −1 = P t .
4.2. GRAPHS 209
Using permutation matrices, we can say two graphs are isomorphic if their
adjacency matrices A, A′ satisfy
A′ = P AP −1 = P AP t
A graph is bipartite if the nodes can be divided into two groups, with
adjacency only between nodes across groups. If we call the two groups even
and odd, then odd nodes are never adjacent to odd nodes, and even nodes
are never adjacent to even nodes.
The complete bipartite graph is the bipartite graph with maximum num-
ber of edges: Every odd node is adjacent to every even node. The complete
bipartite graph with n odd nodes with m even nodes is written Knm . Then
the order of Kmn is n + m.
Let a = (1, 1, . . . , 1, 0, 0, . . . , 0) be the vector with n ones and m zeros,
and let b = 1 − a. Then b has n zeros and m ones, and the adjacency matrix
of Knm is
A = A(Knm ) = a ⊗ b + b ⊗ a.
210 CHAPTER 4. COUNTING
Recall we have
(a ⊗ b)v = (b · v)a.
From this, we see the column space of A = a⊗b+b⊗a is span(a, b). Thus the
rank of A is 2, and the nullspace of A consists of the orthogonal complement
span(a, b)⊥ of span(a, b). Using this, we compute the eigenvalues of A.
Since the nullspace is span(a, b)⊥ , any vector orthogonal to a and to b
is an eigenvector for λ = 0. Hence the eigenvalue λ = 0 has multiplicity
n + m − 2. Since trace(A) = 0, the sum of the eigenvalues is zero, and the
remaining two eigenvalues are ±λ ̸= 0.
Let v be an eigenvector for λ ̸= 0. Then v is orthogonal to the nullspace
of A, so v must be a linear combination of a and b, v = ra+sb. Since a·b = 0,
Aa = nb, Ab = ma.
Hence
λv = Av = A(ra + sb) = rnb + sma.
4.2. GRAPHS 211
Applying A again,
For
√ example, √
for the graph in Figure 4.8, the nonzero eigenvalues are λ =
± 3 × 5 = ± 15.
L = B t B.
Both the laplacian matrix and the adjacency matrix are n × n. What is the
connection between them?
Laplacian
The laplacian satisfies
L = D − A,
where D = diag(d1 , d2 , . . . , dn ) is the diagonal degree matrix.
212 CHAPTER 4. COUNTING
For example, for the cycle graph C6 , the degree matrix is 2I, and the
laplacian is the matrix we saw in §3.2,
2 −1 0 0 0 −1
−1 2 −1 0 0 0
0 −1 2 −1 0 0
L = Q(6) = .
0
0 −1 2 −1 0
0 0 0 −1 2 −1
−1 0 0 0 −1 2
Similarly,
Thus
(a + x)2 = a2 + 2ax + x2
(a + x)3 = a3 + 3a2 x + 3ax2 + x3
(4.3.4)
(a + x)4 = a4 + 4a3 x + 6a2 x2 + 4ax3 + x4
(a + x)5 = ⋆a5 + ⋆a4 x + ⋆a3 x2 + ⋆a2 x3 + ⋆ax4 + ⋆x5 .
and
3 3 3 3
= 1, = 3, = 3, =1
0 1 2 3
and
4 4 4 4 4
= 1, = 4, = 6, = 4, =1
0 1 2 3 4
214 CHAPTER 4. COUNTING
and
5 5 5 5 5 5
= ⋆, = ⋆, = ⋆, = ⋆, = ⋆, = ⋆.
0 1 2 3 4 5
With this notation, the number
n
(4.3.5)
k
is the coefficient of an−k xk when you multiply out (a + x)n . This is the
binomial coefficient. Here n is the degree of the binomial, and k, which
specifies the term in the resulting sum, varies from 0 to n (not 1 to n).
It is important to remember that, in this notation, the binomial (a + x)2
expands into the sum of three terms a2 , 2ax, x2 . These are term 0, term
1, and term 2. Alternatively, one says these are the zeroth term, the first
term, and the second term. Thus the second term in theexpansion of the
binomial (a + x)4 is 6a2 x2 , and the binomial coefficient 42 = 6. In general,
the binomial (a + x)n of degree n expands into a sum of n + 1 terms.
Since the binomial coefficient nk is the coefficient of an−k xk when you
Binomial Theorem
The binomial (a + x)n equals
n n n n−1 n n−2 2 n n−1 n n
a + a x+ a x + ··· + ax + x .
0 1 2 n−1 n
(4.3.6)
For example, the term 42 a2 x2 corresponds to choosing two a’s, and two x’s,
In Pascal’s triangle, the very top row has one number in it: This is the
zeroth row corresponding to n = 0 and the binomial expansion of (a+x)0 = 1.
The first row corresponds to n = 1; it contains the numbers (1, 1), which
correspond to the binomial expansion of (a + x)1 = 1a + 1x. We say the
zeroth entry (k = 0) in the first row (n = 1) is 1 and the first entry (k = 1)
in the first row is 1. Similarly, the zeroth entry (k = 0) in the second row
(n = 2) is 1, and the second entry (k = 2) in the second row (n = 2) is 1.
The second entry (k = 2) in the fourth row (n = 4) is 6. For every row, the
entries are counted starting from k = 0, and end with k = n, so there are
n + 1 entries in row n. With this understood, the k-th entry in the n-th row
is the binomial coefficient n-choose-k. So 10-choose-2 is
10
= 45.
2
216 CHAPTER 4. COUNTING
We can learn a lot about the binomial coefficients from this triangle.
First, we have 1’s all along the left edge. Next, we have 1’s all along the
right edge. Similarly, one step in from the left or right edge, we have the row
number. Thus we have
n n n n
=1= , =n= , n ≥ 1.
0 n 1 n−1
Note also Pascal’s triangle has a left-to-right symmetry: If you read off
the coefficients in a particular row, you can’t tell if you’re reading them from
left to right, or from right to left. It’s the same either way: The fifth row is
(1, 5, 10, 10, 5, 1). In terms of our notation, this is written
n n
= , 0 ≤ k ≤ n;
k n−k
Let’s work this out when n = 3. Then the left side is (a + x)4 . From (4.3.4),
we get
4 4 4 3 4 2 2 4 3 4 4
a + a x+ ax + ax + x
0 1 2 3 4
3 3 3 2 3 2 3 3
= (a + x) a + a x+ ax + x
0 1 2 3
3 4 3 3 3 2 2 3
= a + a x+ ax + ax3
0 1 2 3
3 3 3 2 2 3 3 3 4
+ a x+ ax + ax + x
0 1 2 3
3 4 3 3 3 3 3
= a + + a x+ + a2 x 2
0 1 0 2 1
3 3 3 3 4
+ + ax + x.
3 2 3
4.3. BINOMIAL THEOREM 217
This allows us to build Pascal’s triangle (Figure 4.9), where, apart from
the ones on either end, each term (“the child”) in a given row is the sum of
the two terms (“the parents”) located directly above in the previous row.
We conclude the sum of the binomial coefficients along the n-th row of Pas-
cal’s triangle is 2n (remember n starts from 0).
Now insert x = 1 and a = −1. You get
n n n n n
0= − + − ··· ± ± .
0 1 2 n−1 n
Hence: the alternating2 sum of the binomial coefficients along the n-th row
of Pascal’s triangle is zero.
We now show
2
Alternating means the plus-minus pattern + − + − + − . . . .
218 CHAPTER 4. COUNTING
Binomial Coefficient
The binomial coefficient nk equals C(n, k),
n n · (n − 1) · · · · · (n − k + 1) n!
= = , 1 ≤ k ≤ n.
k 1 · 2 · ··· · k k!(n − k)!
(4.3.10)
n! n!
C(n, k) + C(n, k − 1) = +
k!(n − k)! (k − 1)!(n − k + 1)!
n! 1 1
= +
(k − 1)!(n − k)! k n − k + 1
n!(n + 1)
=
(k − 1)!(n − k)!k(n − k + 1)
(n + 1)!
= = C(n + 1, k).
k!(n + 1 − k)!
The formula (4.3.10) is easy to remember: There are k terms in the numerator
as well as the denominator, the factors in the denominator increase starting
from 1, and the factors in the numerator decrease starting from n.
In Python, the code
4.4. EXPONENTIAL FUNCTION 219
comb(n,k)
comb(n,k,exact=True)
The binomial coefficient nk makes sense even for fractional n. This can
Rewriting this by pulling out the first two terms k = 0 and k = 1 leads to
n n
1 X 1 1 2 k−1
1+ =1+1+ 1− 1− ... 1 − . (4.4.1)
n k=2
k! n n n
220 CHAPTER 4. COUNTING
From (4.4.1), we can tell a lot. First, since all terms are positive, we see
n
1
1+ ≥ 2, n ≥ 1.
n
By (4.4.3), we arrive at
n
1
2≤ 1+ ≤ 3, n ≥ 1. (4.4.4)
n
Summarizing, we established the following strengthening of (4.1.5).
Euler’s Constant
The limit n
1
e = lim 1+ (4.4.5)
n→∞ n
exists and satisfies 2 ≤ e ≤ 3.
Since we’ve shown bn increases faster than an , and cn increases faster than
bn , we have derived (4.1.2).
To summarize,
Euler’s Constant
Euler’s constant satisfies
∞
X 1 1 1 1 1 1
e= =1+1+ + + + + + ...
k=0
k! 2 6 24 120 720
Depositing one dollar in a bank offering 100% interest returns two dollars
after one year. Depositing one dollar in a bank offering the same annual
interest compounded at mid-year returns
2
1
1+ = 2.25
2
Depositing one dollar in a bank offering the same annual interest com-
pounded at n intermediate time points returns (1 + 1/n)n dollars after one
year.
Passing to the limit, depositing one dollar in a bank and continuously
compounding at an annual interest rate of 100% returns e dollars after one
year. Because of this, (4.4.5) is often called the compound-interest formula.
Exponential Function
For any real number x, the limit
x n
exp x = lim 1+ (4.4.6)
n→∞ n
exists. In particular, exp 0 = 1 and exp 1 = e.
preceding one,
(1 − x) = 1−x
(1 − x)2 = 1 − 2x + x2 ≥ 1 − 2x
(1 − x)3 = (1 − x)(1 − x)2 ≥ (1 − x)(1 − 2x) = 1 − 3x + 2x2 ≥ 1 − 3x
(1 − x)4 = (1 − x)(1 − x)3 ≥ (1 − x)(1 − 3x) = 1 − 4x + 3x3 ≥ 1 − 4x
... ...
This shows the limit exp x in (4.4.6) is well-defined when x < 0, and
1
exp(−x) = , for all x.
exp x
4.4. EXPONENTIAL FUNCTION 225
Exponential Series
The exponential function is always positive and satisfies, for every real
number x,
∞
X xk x2 x3 x 4 x5 x6
exp x = =1+x+ + + + + + . . . (4.4.10)
k=0
k! 2 6 24 120 720
Law of Exponents
For real numbers x and y,
(a0 + a1 + a2 + a3 + . . . )(b0 + b1 + b2 + b3 + . . . )
Thus
∞
! ∞
! ∞ n
!
X X X X
ak bm = ak bn−k .
k=0 m=0 n=0 k=0
Now insert
xk y n−k
ak = , bn−k = .
k! (n − k)!
Then the n-th term in the resulting sum equals, by the binomial theorem,
n n n
X X xk y n−k 1 X n k n−k 1
ak bn−k = = x y = (x + y)n .
k=0 k=0
k! (n − k)! n! k=0 k n!
Thus
∞
! ∞
! ∞
X xk X ym X (x + y)n
exp x · exp y = = = exp(x + y).
k=0
k! m=0
m! n=0
n!
Exponential Notation
For any real number x,
ex = exp x.
Probability
[57, 49, 55, 44, 55, 50, 49, 50, 53, 49, 53, 50, 51, 53, 53, 54, 48, 51, 50, 53].
On the other hand, suppose someone else repeats the same experiment 20
times with a different coin, and obtains
[69, 70, 79, 74, 63, 70, 68, 71, 71, 73, 65, 63, 68, 71, 71, 64, 73, 70, 78, 67].
In this case, one suspects the two coins are statistically distinct, and have
different probabilities of obtaining heads.
In this section, we study how the probabilities of coin-tossing behave,
with the goal of answering the question: Is a given coin fair?
229
230 CHAPTER 5. PROBABILITY
P rob(X1 = 0 and X2 = 1) qp
P rob(X2 = 1 | X1 = 0) = = = p = P rob(X2 = 1),
P rob(X1 = 0) q
so
P rob(X2 = 1 | X1 = 0) = P rob(X2 = 1).
Thus X1 = 0 has no effect on the probability that X2 = 1, and similarly for
the other possibilities. This is often referred to as the independence of the
coin tosses. We conclude
Independent Coin-Tossing
P rob(Xn = 1) = p, P rob(Xn = 0) = q = 1 − p, n ≥ 1.
P (X = a) = p, P (X = b) = q, P (X = c) = r.
232 CHAPTER 5. PROBABILITY
E(X) = ap + bq + cr.
For example,
E(Xn ) = 1 · p + 0 · (1 − p) = p,
Let
Sn = X1 + X2 + · · · + Xn .
5.1. BINOMIAL PROBABILITY 233
Since Xk = 1 when the k-th toss is heads, and Xk = 0 when the k-th toss is
tails, Sn is the number of heads in n tosses.
The mean of Sn is
which is the same as saying P rob(r < p < r +dr) = dr. By (5.1.6), we obtain
Z 1
n k
P rob(Sn = k) = r (1 − r)n−k dr.
0 k
We now turn things around: Suppose we toss the coin n times, and obtain
k heads. How can we use this data to estimate the coin’s probability of heads
p?
To this end, we introduce the fundamental
Bayes Theorem
P rob(B | A) · P rob(A)
P rob(A | B) = . (5.1.8)
P rob(B)
236 CHAPTER 5. PROBABILITY
P rob(A and B)
P rob(A | B) =
P rob(B)
P rob(A and B) P rob(A)
= ·
P rob(A) P rob(B)
P rob(A)
= P rob(B | A) · .
P rob(B)
P rob(p = r)
P rob(p = r | Sn = k) = P rob(Sn = k | p = r) · . (5.1.9)
P rob(Sn = k)
Notice because of the extra factor (n + 1), this is not equal to (5.1.6).
In (5.1.6), p is fixed, and k is the variable. In (5.1.10), k is fixed, and r is
the variable. This a posteriori distribution for (n, k) = (10, 7) is plotted in
Figure 5.1. Notice this distribution is concentrated about k/n = 7/10 = .7.
5.1. BINOMIAL PROBABILITY 237
grid()
X = arange(0,1,.01)
plot(X,f(X),color="blue",linewidth=.5)
show()
Because Bayes Theorem is so useful, here are two alternate forms. First,
since
P rob(B | A) P rob(A)
P rob(A | B) = . (5.1.11)
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )
Let
1
p = σ(z) = . (5.1.12)
1 + e−z
This is the logistic function or sigmoid function (Figure 5.2). The logistic
function takes as inputs real numbers y, and returns as outputs probabilities
p (Figure 5.3). Think of the input z as an activation energy, and the output
p as the probability of activation. In Python, σ is the expit function.
p = expit(z)
5.1. BINOMIAL PROBABILITY 239
If the result is tails, select a point x at random with normal probability with
mean mT , or
2
P rob(x | T ) ∼ e−|x−mT | /2 .
This says the the groups are centered around the points mH and mT respec-
tively.
Given a point x, what is the probability x is in the heads group? In other
words, what is
P rob(H | x)?
This question is begging for Bayes theorem.
Let
1 1
w = mH − mT , w0 = − |mH |2 + |mT |2 .
2 2
Since P rob(H) = P rob(T ), here we have P rob(A) = P rob(Ac ). Inserting the
probabilities and simplifying leads to
P rob(x | H) P rob(H)
log = w · x + w0 . (5.1.14)
P rob(x | T ) P rob(T )
By (5.1.13), this leads to
P rob(H | x) = σ(w · x + w0 ).
240 CHAPTER 5. PROBABILITY
5.2 Probability
A probability is often described as
the extent to which an event is likely to occur, measured by the
ratio of the favorable outcomes to the whole number of outcomes
possible.
We explain what this means by describing the basic terminology:
• An experiment is a procedure that yields an outcome, out of a set of
possible outcomes. For example, tossing a coin is an experiment that
yields one of two outcomes, heads or tails, which we also write as 1 or
0. Rolling a six-sided die yields outcomes 1, 2, 3, 4, 5, 6. Rolling two
six-sided dice yields 36 outcomes (1, 1), (1, 2),. . . . Flipping a coin three
times yields 23 = 8 outcomes
or
000, 001, 010, 011, 100, 101, 110, 111.
(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1),
#(E) = 35, which is the number of ways you can choose three things
out of seven things:
7 7·6·5
#(E) = 7-choose-3 = = = 35.
3 1·2·3
1. 0 ≤ P rob(s) ≤ 1,
2. The sum of the probabilities of all outcomes equals one.
• Outcomes are equally likely when they have the same probability. When
this is so, we must have
#(E)
P rob(E) = .
#(S)
For example,
1. A coin is fair if the outcomes are equally likely. For one toss of a
fair coin, P rob(heads) = 1/2.
2. More generally, tossing a coin results in outcomes
P rob(head) = p, P rob(tail) = 1 − p,
p = .5
n = 10
N = 20
v = binomial(n,p,N)
print(v)
returns
[9 6 7 4 4 4 3 3 7 5 6 4 6 9 4 5 4 7 6 7]
p = .5
for n in [5,50,500]: print(binomial(n,p,1))
This returns the count of heads after 5 tosses, 50 tosses, and 500 tosses,
5.2. PROBABILITY 243
Figure 5.4: 100,000 sessions, with 5, 15, 50, and 500 tosses per session.
3, 28, 266
The proportions are the count divided by the total number of tosses in
the experiment. For the above three experiments, the proportions after 5
tosses, 50 tosses, and 500 tosses, are
Now we repeat each experiment 100,000 times and we plot the results in
a histogram.
N = 100000
p = .5
for n in [5,50,500]:
data = binomial(n,p,N)
hist(data,bins=n,edgecolor ='Black')
grid()
244 CHAPTER 5. PROBABILITY
show()
The takeaway from these graphs are the two fundamental results of prob-
ability:
2. Central Limit Theorem. For large sample size, the shape of the
graph of the proportions or counts is approximately normal. The nor-
mal distribution is studied in §5.4. Another way of saying this is: For
large sample size, the shape of the sample mean histogram is approxi-
mately normal.
The law of large numbers is qualitative and the central limit theorem
is quantitative. While the law of large numbers says one thing is close to
another, it does not say how close. The central limit theorem provides a
numerical measure of closeness, using the normal distribution.
Roll two six-sided dice. Let A be the event that at least one dice is an
even number, and let B be the event that the sum is 6. Then
A = {(2, ∗), (4, ∗), (6, ∗), (∗, 2), (∗, 4), (∗, 6)} .
B = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} .
The intersection of A and B is the event of outcomes in both events:
X x
for the chance that X lies in the interval [a, b], we are asking for P rob(a <
X < b). If we don’t know anything about X, then we can’t figure out the
probability, and there is nothing we can say. Knowing something about X
means knowing the distribution of X: Where X is more likely to be and
where X is less likely to be. In effect, a random variable is a quantity X
whose probabilities P rob(a < X < b) can be computed.
For example, take the Iris dataset and let X be the petal length of an iris
(Figure 5.6) selected at random. Here the number of samples is N = 150.
df = read_csv("iris.csv")
petal_length = df["Petal_length"].to_numpy()
N
1 X
m = E(X) = xk .
N k=1
N
2 1 X 2
E(X ) = x .
N k=1 k
In general, given any function f (x), we have the mean of f (x1 ), f (x2 ), . . . ,
f (xN ),
N
1 X
E(f (X)) = f (xk ). (5.3.1)
N k=1
If we let
(
1, 1 < x < 3,
f (x) =
0, otherwise,
N
1 X #{samples satisfying 1 < xk < 3}
E(f (X)) = f (xk ) = .
N k=1 N
But this is the probability that a randomly selected iris has petal length X
between 1 and 3,
P rob(1 < X < 3) = E(f (X)),
To see how the iris petal lengths are distributed, we plot a histogram,
grid()
hist(petal_length,bins=20)
show()
rng = default_rng()
# n = batch_size
250 CHAPTER 5. PROBABILITY
def random_batch_mean(n):
rng.shuffle(petal_length)
return mean(petal_length[:n])
random_batch_mean(5)
The five petal lengths are selected by first shuffling the petal lengths,
then selecting the first five petal_length[:5]. Now repeat this computation
100,000 times, for batch sizes 1, 5, 15, 50. The resulting histograms are in
Figure 5.8. Notice in the first subplot, the batch is size n = 1, so we recover
the base histogram Figure 5.7. Figure 5.8 is of course another illustration of
the central limit theorem.
N = 100000
5.3. RANDOM VARIABLES 251
for n in [1,5,15,50]:
Xbar = [ random_batch_mean(n) for _ in range(N)]
hist(Xbar,bins=50)
grid()
show()
P rob(X = 1) = p, P rob(X = 0) = 1 − p,
P rob(X = x) = px (1 − p)1−x , x = 0, 1.
p
1−p
0 1
5.10, when the probability P rob(a < X < b) is given by the green area in
Figure 5.10. Thus
0 a b 0 a b
Then the green areas in Figure 5.10 is the difference between two areas, hence
equal
cdfX (b) − cdfX (a).
For the bernoulli distribution in Figure 5.9, the cdf is in Figure 5.12.
Because the bernoulli random variable takes on only the values x = 0, 1,
these are the values where the cdf P rob(X ≤ x) jumps.
1
1−p
0 1
This is the population mean. It does not depend on a sampling of the popu-
lation.
For example, suppose the population consists of 100 balls, of which 30
are red, 20 are green, and 50 are blue. The cost of each ball is
$1, red,
X(ball) = $2, green,
$3, blue.
Then
#(red) 30
pred = P rob(red) = = = .3,
#(balls) 100
#(green) 20
pgreen = P rob(green) = = = .2,
#(balls) 100
#(blue) 50
pblue = P rob(blue) = = = .5.
#(balls) 100
Then the average cost of a ball equals
E(X) = pred · 1 + pgreen · 2 + pblue · 3
30 · 1 + 20 · 2 + 50 · 3 x1 + x2 + · · · + x100
= = .
100 100
254 CHAPTER 5. PROBABILITY
The variance is
V ar(X) = E (X − µ)2 = p1 (x1 − µ)2 + p2 (x2 − µ)2 + p3 (x3 − µ)2 + . . .
N
X
= pk (xk − µ)2 .
k=1
µ = E(X) = x1 p1 + x2 p2 = 1 · p + 0 · (1 − p) = p,
V ar(X) = σ 2 = E (X − µ)2
We conclude
E(X 2 ) = µ2 + σ 2 = (E(X))2 + V ar(X). (5.3.2)
Let X have mean µ and variance σ 2 , and write
X −µ
Z= .
σ
Then
1 E(X) − µ µ−µ
E(Z) = E(X − µ) = = = 0,
σ σ σ
and
1 σ2
E(Z 2 ) = E((X − µ) 2
) = = 1.
σ2 σ2
We conclude Z has mean zero and variance one.
A random variable is standard if its mean is zero and its variance is one.
The variable Z is the standardization of X. For example, the standardization
of the bernoulli random variable is
X −p
p .
p(1 − p)
p(1 − p)
0 1
t2 t3
et = 1 + t + + + ...
2! 3!
where t is any real number. The number e, Euler’s constant (§4.4), is ap-
proximately 2.7, as can be seen from
1 1 1 1
e = e1 = 1 + 1 + + + ··· = 1 + 1 + + + ...
2! 3! 2 6
Since X has real values, so does tX, so etX is also a random variable.
The moment generating function is the mean of etX ,
t2 t3
M (t) = MX (t) = E etX = 1 + tE(X) + E(X 2 ) + E(X 3 ) + . . .
2! 3!
For example, for the smartphone random variable X = 0, 1 with P rob(X =
1) = p, X 2 = X, X 3 = X, . . . , so
t2 t3 t2 t3
M (t) = 1 + tE(X) + E(X 2 ) + E(X 3 ) + · · · = 1 + tp + p + p + . . .
2! 3! 2! 3!
which equals
M (t) = (1 − p) + pet .
In §5.2, we discussed independence of events. Now we do the same for
random variables. Let X and Y be random variables. We say X and Y are
uncorrelated if the expectations multiply,
a,b,c = symbols('a,b,c')
eq1 = a + 2*b + c - 1
eq2 = a - b - (a-c)*(a+b)
solutions = solve([eq1,eq2],a,b)
print(solutions)
Since
12
1 X tk 1 e13t − et
MX+Y (t) = e = ,
12 k=1 12 et − 1
we obtain
1 e13t − et 1 e7t − et
= MY (t) · .
12 et − 1 6 et − 1
Factoring
we obtain
1
MY (t) = (e6t + 1).
2
This says
1 1
P rob(Y = 0) = , P rob(Y = 6) = ,
2 2
and all other probabilities are zero.
260 CHAPTER 5. PROBABILITY
E(X n ) = E(Y n ), n ≥ 1.
X1 , X2 , . . . , Xn x1 , x2 , . . . , xn
mean is
n
X1 + X 2 + · · · + Xn 1X
X̄ = = Xk .
n n k=1
Then
1 1 1
E(X̄) = E(X1 +X2 +· · ·+Xn ) = (E(X1 )+E(X2 )+· · ·+E(Xn )) = ·nµ = µ.
n n n
We conclude the mean of the sample mean equals the population mean.
Now let σ 2 be the common variance of X1 , X2 , . . . , Xn . Since σ 2 =
E(X 2 ) − E(X)2 , we have
E(Xk2 ) = µ2 + σ 2 .
When i ̸= j, by independence,
1 X
= E(Xi Xj )
n2 i,j
!
1 X X
= 2 E(Xi Xj ) + E(Xk2 )
n i̸=j k
1 2 2 2 1
= µ2 + σ 2 .
= 2
n(n − 1)µ + n(µ + σ )
n n
σ2
E(X̄) = µ and V ar(X̄) = .
n
Sn = X1 + X2 + · · · + Xn .
grid()
z = arange(mu-3*sdev,mu+3*sdev,.01)
0 a b
Then
X −µ
X ∼ N (µ, σ) ⇐⇒ Z= ∼ N (0, 1).
σ
A normal distribution is a standard normal distribution when µ = 0 and
σ = 1.
Sn = X1 + X2 + · · · + Xn
we expect the chance that Z < 0 should equal 1/2. In other words, because
of the symmetry of the curve, we expect to be 50% confident that Z < 0, or
0 is at the 50-th percentile level. So
p
p
z z
When
P rob(Z < z) = p,
we say z is the z-score z corresponding to the p-value p. Equivalently, we say
our confidence that Z < z is p, or the percentile of z equals 100p. In Python,
the relation between z and p (Figure 5.16) is specified by
p = Z.cdf(z)
z = Z.ppf(p)
ppf is the percentile point function, and cdf is the cumulative distribution
function.
In Figure 5.17, the red areas are the lower tail p-value P rob(Z < z), the
two-tail p-value P rob(|Z| > z), and the upper tail p-value P rob(Z > z).
−z 0 −z 0 z
0 z
and
P rob(|Z| < z) = P rob(−z < Z < z) = P rob(Z < z) − P rob(Z < z),
and
P rob(Z > z) = 1 − P rob(Z < z).
To go backward, suppose we are given P rob(|Z| < z) = p and we want
to compute the cutoff z. Then P rob(|Z| > z) = 1 − p, so P rob(Z > z) =
(1 − p)/2. This implies
In Python,
# p = P(|Z| < z)
z = Z.ppf((1+p)/2)
Now let’s zoom in closer to the graph and mark off 1, 2, 3 on the hor-
izontal axis to obtain specific colored areas as in Figure 5.18. These areas
are governed by the 68-95-99 rule (Table 5.19). Our confidence that |Z| < 1
equals the blue area 0.685, our confidence that |Z| < 2 equals the sum of
the blue plus green areas 0.955, and our confidence that |Z| < 3 equals the
sum of the blue plus green plus red areas 0.997. This is summarized in Table
5.19.
The possibility |Z| > 1 is called a 1-sigma event, |Z| > 2 a 2-sigma event,
and so on. So a 2-sigma event is 95.5% unlikely, or 4.5% likely. An event
is considered statistically significant if it’s a 2-sigma event or more. In other
words, something is significant if it’s unlikely. A six-sigma event |Z| > 6 is
2 in a billion. You want a plane crash to be six-sigma.
268 CHAPTER 5. PROBABILITY
−3 −2 −1 0 1 2 3
Figure 5.18: 68%, 95%, 99% confidence cutoffs for standard normal.
Figure 5.18 is not to scale, because a 1-sigma event should be where the
curve inflects from convex to concave (in the figure this happens closer to
2.7). Moreover, according to Table 5.19, the left-over white area should be
.03% (3 parts in 10,000), which is not what the figure suggests.
µ − 3σ µ−σ µ µ+σ µ + 3σ
a = Z.ppf(.15)
b = Z.ppf(.9)
Here are three examples. In the first example, suppose student grades are
normally distributed with mean 80 and variance 16. This says the average
of all grades is 80, and the SD is 4. If a grade is g, the standardized grade is
g − 80
z= .
4
5.4. NORMAL DISTRIBUTION 271
rng = default_rng()
x = 70
mean, sdev = 80, 4
p = Z(mean,sdev).cdf(x)
for n in range(2,200):
q = 1 - (1-p)**n
print(n, q)
Here is the code for computing tail probabilities for the sample mean X̄
drawn from a normally distributed population with mean µ and standard
deviation σ. When n = 1, this applies to a single normal random variable.
########################
# P-values
5.5. CHI-SQUARED DISTRIBUTION 273
########################
def pvalue(mean,sdev,n,xbar,type):
Xbar = Z(mean,sdev/sqrt(n))
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 *(1 - Xbar.cdf(abs(xbar)))
else:
print("What's the tail type?")
return
print("type: ",type)
print("mean,sdev,n,xbar: ",mean,sdev,n,xbar)
print("p-value: ",p)
z = sqrt(n) * (xbar - mean) / sdev
print("z-score: ",z)
type = "upper-tail"
mean = 80
sdev = 4
n = 1
xbar = 90
pvalue(mean,sdev,n,xbar,type)
1
MU (u) = E(euU ) = √ .
1 − 2u
Since ∞
1 uU
X un
√ =E e = E(U n ),
1 − 2u n=0
n!
comparing coefficients of un /n! shows
n n −1/2
E(U ) = (−2) n! , n = 0, 1, 2, . . . (5.5.1)
n
1
MU (t) = E(etU ) = .
(1 − 2t)d/2
Going back to the question posed at the beginning of the section, we have
X and Y independent standard normal and we want
d = 2
u = 1
U(d).cdf(u)
returns 0.39.
3
Geometrically, P rob(U < 1) is the probability that a normally distributed point is
inside the unit sphere in d-dimensional space.
5.5. CHI-SQUARED DISTRIBUTION 277
d
X
E(U 2 ) = E(Zk2 Zℓ2 )
k,ℓ=1
X d
X
= E(Zk2 )E(Zℓ2 ) + E(Zk4 )
k̸=ℓ k=1
2
= d(d − 1) · 1 + d · 3 = d + 2d.
Because
1 1 1
′ /2 = ,
d/2
(1 − 2t) (1 − 2t)d (1 − 2t)(d+d′ )/2
we obtain
X = (X1 , X2 , . . . , Xn )
in Rn .
If X is a random vector in Rd , its mean is the vector
Qij = E(Xi Xj ), 1 ≤ i, j ≤ d.
is never negative.
A random vector X is normal with mean µ and variance Q if for every
vector w, w · X is normal with mean w · µ and variance w · Qw.
Then µ is the mean of X, and Q is the variance X. The random vector
X is standard normal if µ = 0 and Q = I.
From §5.3, we see
Z = (Z1 , Z2 , . . . , Zd )
From this, X and Y are independent when B = 0. Thus, for normal random
vectors, independence and uncorrelatedness are the same.
and
Q+ = U S + U t .
If we set Y = U t X = (Y1 , Y2 , . . . , Yd ), then
so X · v = 0.
It is easy to check Q3 = Q and Q2 is symmetric, so (§2.3) Q+ = Q. Since
X · v = 0,
X · Q+ X = X · QX = X · (X − (v · X)v) = |X|2 .
We conclude
282 CHAPTER 5. PROBABILITY
Singular Chi-squared
We use the above to derive the distribution of the sample variance. Let
X1 , X2 , . . . , Xn be a random sample, and let X̄ be the sample mean,
X1 + X2 + · · · + X n
X̄ = .
n
Let S 2 be the sample variance,
(X1 − X̄)2 + (X2 − X̄)2 + · · · + (Xn − X̄)2
S2 = . (5.5.4)
n−1
Since (n − 1)S 2 is a sum-of-squares similar to (5.5.2), we expect (n − 1)S 2
to be chi-squared. In fact this is so, but the degree is n − 1, not n. We will
show
Now let
X = Z − (Z · v)v = (Z1 − Z̄, Z2 − Z̄, . . . .Zn − Z̄).
5.5. CHI-SQUARED DISTRIBUTION 283
E(X ⊗ X) = I − v ⊗ v.
Hence
(n − 1)S 2 = |X|2
is chi-squared with degree n − 1.
Now X and Z · v are uncorrelated, since
Statistics
6.1 Estimation
In statistics, like any science, we start with a guess or an assumption or
hypothesis, then we take a measurement, then we accept or modify our
guess/assumption based on the result of the measurement. This is common
sense, and applies to everything in life, not just statistics.
For example, suppose you see a sign on campus saying
There is a lecture in room B120.
How can you tell if this is true/correct or not? One approach is to go to
room B120 and look. Either there is a lecture or there isn’t. Problem solved.
But then someone might object, saying, wait, what if there is a lecture
in room B120 tomorrow? To address this, you go every day to room B120
and check, for 100 days. You find out that in 85 of the 100 days, there is a
lecture, and in 15 days, there is none. Based on this, you can say you are
85% confident there is a lecture there. Of course, you can never be sure, it
depends on which day you checked, you can only provide a confidence level.
Nevertheless, this kind of thinking allows us to quantify the probability that
our hypothesis is correct.
In general, the measurement is significant if it is unlikely. When we obtain
a significant measurement, then we are likely to reject our guess/assumption.
So
significance = 1 − confidence.
In practice, our guess/assumption allows us to calculate a p-value, which is
the probability that the measurement is not consistent with our assumption.
285
286 CHAPTER 6. STATISTICS
do not
reject H
p>α
hypothesis
sample p-value
H
p<α
reject H
Here is a geometric example. The null hypothesis and the alternate hy-
pothesis are
In §2.2, there is code (2.2) returning the angle angle(u,v) between two
vectors. To test this hypothesis, we run the code
6.1. ESTIMATION 287
N = 784
for _ in range(20):
u = randn(N)
v = randn(N)
print(angle(u,v))
86.27806537791886
87.91436653824776
93.00098725550777
92.73766421951748
90.005139015804
87.99643434444482
89.77813370637857
96.09801014394806
90.07032573539982
89.37679070400239
91.3405728939376
86.49851399221568
87.12755619082597
88.87980905998855
89.80377324818076
91.3006921339982
91.43977096117017
88.52516224405458
86.89606919838387
90.49100744167357
N = 784
for _ in range(20):
u = binomial(n,.5,N)
v = binomial(n,.5,N)
print(angle(u,v))
59.43464627897324
59.14345748418916
60.31453922165891
60.38024365702492
59.24709660805488
59.27165957992343
61.21424657806321
60.55756381536082
61.59468919876665
61.33296028237481
60.03925473033243
60.25732069941224
61.77018692842784
60.672901794058326
59.628519516164666
59.41272458020638
58.43172340007064
59.863796136907744
59.45156367988921
59.95835532791699
Here we see strong evidence that H0 is false, as the angles are now close to
60◦ .
6.1. ESTIMATION 289
The difference between the two scenarios is the distribution. In the first
scenario, we have randn(n): the components are distributed according to a
standard normal. In the second scenario, we have binomial(1,.5,N): the
components are distributed according to a fair coin toss. To see how the
distribution affects things, we bring in the law of large numbers, which is
discussed in §5.3.
Let X1 , X2 , . . . , Xn be a simple random sample from some population,
and let µ be the population mean. Recall this means X1 , X2 , . . . , Xn are
i.i.d. random variables, with µ = E(X). The sample mean is
X1 + X2 + · · · + X n
X̄ = .
n
Then we have the
We use the law of large numbers to explain the closeness of the vector
angles to specific values.
Assume u = (x1 , x2 , . . . , xn ), and v = (y1 , y2 , . . . , yn ) where all compo-
nents are selected independently of each other, and each is selected according
to the same distribution.
Let U = (X1 , X2 , . . . , Xn ), V = (Y1 , Y2 , . . . , Yn ), be the corresponding
random variables. Then X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are independent
and identically distributed (i.i.d.), with population mean E(X) = E(Y ).
From this, X1 Y1 , X2 Y2 , . . . , Xn Yn are i.i.d. random variables with popu-
lation mean E(XY ). By the law of large numbers,1
X1 Y1 + X2 Y2 + · · · + Xn Yn
≈ E(XY ),
n
so
U · V = X1 Y1 + X2 Y2 + · · · + Xn Yn ≈ n E(XY ).
1
≈ means the ratio of the two sides approaches 1 as n grows without bound.
290 CHAPTER 6. STATISTICS
U ·V µ2
cos(θ) = p ≈ .
(U · U )(V · V ) µ2 + σ 2
µ2 p2
= = p.
µ2 + σ 2 p2 + p(1 − p)
µ2
cos(θ) is approximately .
µ2 + σ 2
6.2 Z-test
Suppose we want to estimate the proportion of American college students
who have a smart phone. Instead of asking every student, we take a sample
and make an estimate based on the sample.
6.2. Z-TEST 291
p = .7
n = 25
N = 1000
v = binomial(n,p,N)/n
hist(v,edgecolor ='Black')
show()
292 CHAPTER 6. STATISTICS
|p − X̄| < ϵ,
(L, U ) = (X̄ − ϵ, X̄ + ϵ)
is a confidence interval.
With the above setup, we have the population proportion p, and the four
sample characteristics
• sample size n
• margin of error ϵ,
• confidence level α.
Suppose we do not know p, but we know n and X̄. We say the margin of
error is ϵ, at confidence level α, if
L, U = X̄ − ϵ, X̄ + ϵ.
P rob(|Z| > z ∗ ) = α.
√
Let σ/ n be the standard error. By the central limit theorem,
!
|X̄ − p| z∗
α ≈ P rob p >√ .
p(1 − p) n
##########################
# Confidence Interval - Z
##########################
def confidence_interval(xbar,sdev,n,alpha,type):
Xbar = Z(xbar,sdev/sqrt(n))
if type == "two-tail":
U = Xbar.ppf(1-alpha/2)
L = Xbar.ppf(alpha/2)
elif type == "upper-tail":
U = Xbar.ppf(1-alpha)
L = xbar
296 CHAPTER 6. STATISTICS
alpha = .02
sdev = 228
n = 35
xbar = 95
L, U = confidence_interval(xbar,sdev,n,alpha,type)
Now we can answer the questions posed at the start of the section. Here
are the answers.
1. When n = 20, α = .95, and X̄ = .7, we have [L, U ] = [.5, .9], so ϵ = .2.
• H0 : µ = µ0
• Ha : µ ̸= µ0 or µ < µ0 or µ > µ0 .
• Ha : µ ̸= 0.
Here the significance level is α = .02 and µ0 = 0. To decide whether to
reject H0 or not, compute the standardized test statistic
√ x̄ − µ0
z= n· = 2.465.
σ
Since z is a sample from an approximately normal distribution Z, the p-value
Hypothesis Testing
There are three types of alternative hypotheses Ha :
µ < µ0 , µ > µ0 , µ ̸= µ0 .
In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with
√ X̄, which is normally distributed with mean µ0
and standard deviation σ/ n.
###################
# Hypothesis Z-test
###################
print("pvalue: ",p)
if p < alpha: print("reject H0")
else: print("do not reject H0")
xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
sdev = 2
alpha = .01
• Ha : µ > µ0
If a driver’s measured average speed is X̄ = 122, the above code rejects H0 .
This is consistent with the confidence interval cutoff we found above.
There are two types of possible errors we can make. a Type I error is
when H0 is true, but we reject it, and a Type 2 error is when H0 is not true
but we fail to reject it.
H0 is true H0 is false
do not reject H0 1−α Type II error: β
reject H0 Type I error: α Power: 1 − β
This calculation was for a two-tail test. When the test is upper-tail or
lower-tail, a similar calculation leads to the code
############################
# Type1 and Type2 errors - Z
############################
def type2_error(type,mu0,mu1,sdev,n,alpha):
print("significance,mu0,mu1, sdev, n: ",
,→ alpha,mu0,mu1,sdev,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
zstar = Z.ppf(alpha)
type2 = 1 - Z.cdf(delta + zstar)
elif type == "upper-tail":
zstar = Z.ppf(1-alpha)
type2 = Z.cdf(delta + zstar)
elif type == "two-tail":
zstar = Z.ppf(1 - alpha/2)
type2 = Z.cdf(delta + zstar) - Z.cdf(delta - zstar)
else: print("what's the test type?"); return
print("test type: ",type)
print("zstar: ", zstar)
print("delta: ", delta)
print("prob of type2 error: ", type2)
print("power: ", 1 - type2)
mu0 = 120
mu1 = 122
sdev = 2
n = 10
alpha = .01
type = "upper-tail"
type2_error(type,mu0,mu1,sdev,n,alpha)
A type II error is when we do not reject the null hypothesis and yet it’s
false. The power of a test is the probability of rejecting the null hypothesis
when it’s false (Figure 6.3). If the probability of a type II error is β, then
the power is 1 − β.
Going back to the driving speed example, what is the chance that someone
driving at µ1 = 122 is not caught? This is a type II error; using the above
6.3. T -TEST 303
6.3 T -test
Let X1 , X2 , . . . , Xn be a simple random sample from a population. We
repeat the previous section when we know neither the population mean µ,
nor the population variance σ 2 . We only know the sample mean
X1 + X2 + · · · + Xn
X̄ =
n
and the sample variance
n
2 1 X
S = (Xk − X̄)2 .
n − 1 k=1
Here N is a constant to make the total area under the graph equal to one
(Figure 6.4). In other words, (6.3.1) is the pdf of the t-distribution.
When the interval [a, b] is not small, the correct formula is obtained by
integration, which means dividing [a, b] into many small intervals and sum-
ming. We will not use this density formula directly.
for d in [3,4,7]:
t = arange(-3,3,.01)
plot(t,T(d).pdf(t),label="d = "+str(d))
plot(t,Z.pdf(t),"--",label=r"d = $\infty$")
grid()
legend()
show()
Xk = µ + σZk ,
306 CHAPTER 6. STATISTICS
√ X̄ − µ √ Z̄ √ Z̄
n· = n· v = n· p .
S u n U/(n − 1)
1 X
(Zk − Z̄)2
u
t
n − 1 k=1
Using the last result with d = n − 1, we arrive at the main result in this
section.
2
Geometrically, P rob(T > 1) is the probability that a normally distributed point is
inside the light cone in (d + 1)-dimensional spacetime.
6.3. T -TEST 307
##########################
# Confidence Interval - T
##########################
def confidence_interval(xbar,s,n,alpha,type):
d = n-1
if type == "two-tail":
tstar = T(d).ppf(1-alpha/2)
L = xbar - tstar * s / sqrt(n)
U = xbar + tstar * s / sqrt(n)
elif type == "upper-tail":
tstar = T(d).ppf(1-alpha)
L = xbar
U = xbar + tstar* s / sqrt(n)
elif type == "lower-tail":
tstar = T(d).ppf(alpha)
L = xbar + tstar* s / sqrt(n)
U = xbar
else: print("what's the test type?"); return
print("type: ",type)
return L, U
n = 10
xbar = 120
s = 2
alpha = .01
type = "upper-tail"
print("significance, s, n, xbar: ", alpha,s,n,xbar)
L,U = confidence_interval(xbar,s,n,alpha,type)
print("lower, upper: ", L,U)
308 CHAPTER 6. STATISTICS
Going back to the driving speed example from §6.2, instead of assuming
the population standard deviation is σ = 2, we compute the sample standard
deviation and find it’s S = 2. Recomputing with T (9), instead of Z, we see
(L, U ) = (120, 121.78), so the cutoff now is µ∗ = 121.78, as opposed to
µ∗ = 121.47 there.
• H0 : µ = µ0
• Ha : µ ̸= µ0 .
###################
# Hypothesis T-test
###################
xbar = 122
6.3. T -TEST 309
n = 10
type = "upper-tail"
mu0 = 120
s = 2
alpha = .01
ttest(mu0, s, n, xbar,type)
• H0 : µ = µ0
• Ha : µ > µ0
########################
# Type1 and Type2 errors
########################
def type2_error(type,mu0,mu1,n,alpha):
d = n-1
print("significance,mu0,mu1,n: ", alpha,mu0,mu1,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
tstar = T(d).ppf(alpha)
type2 = 1 - T(d).cdf(delta + tstar)
elif type == "upper-tail":
310 CHAPTER 6. STATISTICS
tstar = T(d).ppf(1-alpha)
type2 = T(d).cdf(delta + tstar)
elif type == "two-tail":
tstar = T(d).ppf(1 - alpha/2)
type2 = T(d).cdf(delta + tstar) - T(d).cdf(delta -
,→ tstar)
else: print("what's the test type?"); return
type2_error(type,mu0,mu1,n,alpha)
Similarly,
2σY4
V ar(SY2 ) = .
m−1
Before, with a single mean, we used the result that
Z
T =p
U/n
2. Z is N (0, 1)
We apply this same result this time, but we proceed more carefully. To
begin, X̄ and Ȳ are normal with means µX and µY and variances σ 2 /n and
σ 2 /m respectively. Hence
(X̄ − Ȳ ) − (µX − µY )
r ∼ N (0, 1).
σ2 σ2
+
n m
Next,
2
SX SY2
(n − 1) and (m − 1)
σ2 σ2
are chi-squared of degrees n − 1 and m − 1 respectively, so their sum
2
SX SY2
(n − 1) + (m − 1)
σ2 σ2
312 CHAPTER 6. STATISTICS
##################################
# Confidence Interval - Two means
##################################
import numpy as np
from scipy.stats import t
T = t
def confidence_interval(xbar,ybar,varx,vary,nx,ny,alpha):
tstar = T.ppf(1-alpha/2, nx+ny-2)
varp = (nx-1)*varx+(ny-1)*vary
n = nx+ny-2
varp = varp/n
s_p = np.sqrt(varp)
h = 1/nx + 1/ny
L = xbar - ybar - tstar * s_p * np.sqrt(h)
U = xbar - ybar + tstar * s_p * np.sqrt(h)
6.4. TWO MEANS 313
return L, U
2
Now we turn to the question of what to do when the variances σX and
2
σY are not equal. In this case, by independence, the population variance of
X̄ − Ȳ is the sum of the population variances of X̄ and Ȳ , which is
2
σX σY2
σB2 = + . (6.4.1)
n m
Hence
(X̄ − Ȳ ) − (µX − µY )
r ∼ N (0, 1).
2
σX σY2
+
n m
We want to replace the population variance (6.4.1) by the sample variance
2
SX S2
SB2 = + Y.
n m
Because SB2 is not a straight sum, but is a more complicated linear combina-
tion of variances, SB2 is not chi-squared.
Welch’s approximation is to assume it is chi-squared with degree r, and
to figure out the best r for this. More exactly, we seek the best choice of r
so that
rSB2 rSB2
= 2
σB2 σX σ2
+ Y
n m
is close to chi-squared with degree r. By construction, we multiplied SB2 by
r/σB2 so that its mean equals r,
2
rSB r
E 2
= 2 E(SB2 ) = r.
σB σB
Since the variance of a chi-squared with degree r is 2r, we compute the
variance and set it equal to 2r,
2
rSB r2
2r = V ar = V ar(SB2 ). (6.4.2)
σB2 (σB2 )2
By independence,
2 2
2 SX SY 1 2 1
V ar(SB ) = V ar + V ar = 2 V ar(SX ) + 2 V ar(SY2 ).
n m n m
314 CHAPTER 6. STATISTICS
2 2
But (n − 1)SX /σX and (m − 1)SY2 /σY2 are chi-squared, so
4
σX σY4
V ar(SB2 ) = 2 + 2 . (6.4.3)
n2 (n − 1) m2 (m − 1)
Combining (6.4.2) and (6.4.3), we arrive at Welch’s approximation for the
degrees of freedom,
2 2
σX σY2
+
σB4 n m
r= 4 4 = 4 .
σX σY σX σY4
+ +
n2 (n − 1) m2 (m − 1) n2 (n − 1) m2 (m − 1)
In practice, this expression for r is never an integer, so one rounds it to the
2
closest integer, and the population variances σX and σY2 are replaced by the
2 2
sample variances SX and SY .
We summarize the results.
Welch’s T-statistic
If we have independent simple random samples, then the statistic
X̄ − Ȳ − (µX − µY )
T = r
2
SX S2
+ Y
n m
is approximately distributed according to a T -distribution with degrees
of freedom 2 2
SX SY2
+
n m
r= 4 .
SX SY4
+
n2 (n − 1) m2 (m − 1)
6.5 Variances
Let X1 , X2 , . . . , Xn be a normally distributed simple random sample with
mean 0 and variance 1.
Then we know
U = X12 + X22 + · · · + Xn2
6.5. VARIANCES 315
P rob(U ≤ χ2α,n ) = α.
(n − 1)S 2
∼ χ2n−1 .
σ2
Let
a = χ2α/2,n−1 , b = χ21−α/2,n−1 . (6.5.1)
By definition of the score χ2α,n , we have
(n − 1)S 2
P rob a ≤ ≤ b = 1 − α.
σ2
(n − 1)S 2 (n − 1)S 2
2
P rob ≤ σ ≤ = 1 − α.
b2 a2
We conclude
Confidence Interval
A (1 − α)100% confidence interval for the population variance σ 2 is
(n − 1)S 2 (n − 1)S 2
2
≤σ ≤
b2 a2
where a and b are the χ2n−1 scores at significance 1 − α/2 and α/2.
316 CHAPTER 6. STATISTICS
##############################
# Confidence Interval - Chi2
##############################
def confidence_interval(s2,n,alpha):
a = chi2.ppf(alpha/2,n-1)
b = chi2.ppf(1-alpha/2,n-1)
L = (n-1)*s2/b
U = (n-1)*s2/a
return L, U
L, U = 1.99, 14.0
.
For hypothesis testing, given hypotheses
• H0 : σ = σ0
• Ha : σ ̸= σ0
and one compares the p-value of the standardized test statistic to the required
significance score, whether two-tail, upper-tail, or lower-tail.
6.5. VARIANCES 317
Now we consider two populations with two variances. For this, we intro-
duce the F -distribution. If U1 , U2 are independent chi-squared distributions
with degrees n1 and n2 , then
U1 /n1
F =
U2 /n2
alpha = .05
a = f.ppf(alpha/2,dfn,dfd)
b = f.ppf(1-alpha/2,dfn,dfd)
Then
2
σY2
SX
P rob aα < 2 2 < bα = 1 − α,
SY σ X
which may be rewritten
2 2 2
1 SX σX 1 SX
P rob < 2 < = 1 − α.
bα SY2 σY aα SY2
1 SX 1 SX
L= √ , U=√ .
bα S Y aα SY
L = 0.31389215230779993, U = 1.6621265193149342
Xk = 0, 1, 2, . . . , d − 1.
√ X̄ − p
Z= n· p (6.7.1)
p(1 − p)
is approximately standard normal for large enough sample size, and conse-
quently U = Z 2 is approximately chi-squared with degree one. Pearson’s test
generalizes this from d = 2 categories to d > 2 categories.
Given a category j, let #j denote the number of times Xk = j, 1 ≤ k ≤ n.
Then #j is the count that Xk = j, and p̂j = #j /n is the observed frequency,
in n samples. Let pj be the expected frequency,
√ √
#j
n(p̂j − pj ) = n − pj , 0 ≤ j < d,
n
are approximately normal for large n. Based on this, Pearson [22] showed
Goodness-Of-Fit Test
Let p̂ = (p̂1 , p̂2 , . . . , p̂d ) be the observed frequencies and p =
320 CHAPTER 6. STATISTICS
def goodness_of_fit(observed,expected):
# assume len(observed) == len(expected)
d = len(observed)
n = sum(observed)
u = sum([ (observed[i] - expected[i])**2/expected[i] for i
,→ in range(d) ])
deg = d-1
pvalue = 1 - U(deg).cdf(u)
return pvalue
Suppose a dice is rolled n = 120 times, and the observed counts are
Notice
O1 + O2 + O3 + O4 + O5 + O6 = 120.
6.7. CHI-SQUARED TESTS 321
d = 6
ustar = U(d-1).ppf(1-alpha)
Since this returns u∗ = 11.07 and u > u∗ , we can conclude the dice is not
fair.
We now derive the goodness-of-fit test. For each category 0 ≤ j < d, let
√1
j
if Xk = j,
X̃k = pj
0 if Xk ̸= j.
√
Then E(X̃nj ) = pj , and
(
1 if i = j,
E(X̃ki X̃kj ) =
0 if i ̸= j.
If
√ √ √
µ = ( p1 , p2 , . . . , pd ) and X̃k = (X̃k1 , X̃k2 , . . . , X̃kd ),
then
E(X̃k ) = µ, E(X̃k ⊗ X̃k ) = I.
322 CHAPTER 6. STATISTICS
From this,
V ar(X̃k ) = E(X̃k ⊗ X̃k ) − E(X̃k ) ⊗ E(X̃k ) = I − µ ⊗ µ.
From (5.3.8), we conclude the random vector
n
!
√ 1X
Z= n X̃k − µ
n k=1
has mean zero and variance I − µ ⊗ µ. By the central limit theorem, Z is
approximately normal for large n.
Since
√ √ √
|µ|2 = ( p0 )2 + ( p1 )2 + · · · + ( pd−1 )2 = p0 + p1 + · · · + pd−1 = 1,
µ is a unit vector. By the singular chi-squared result in §5.5, |Z|2 is approx-
imately chi-squared with degree d − 1. Using
√
p̂j √
Zj = n √ − pj ,
pj
we write |Z|2 out,
d d 2 d
2
X X p̂j √ X (p̂j − pj )2
|Z| = Zj2 =n √ − pj = n ,
j=1 j=1
pj j=1
pj
obtaining (6.7.2).
Is a person’s gender correlated with their party affiliation, or are the two
variables independent? To answer this, we use the
#{k : Xk = i, Yk = j}
r̂ij = , i = 1, 2, . . . , d, j = 1, 2, . . . , e.
n
If X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are independent, then, for large
n, the statistic
d,e
X (r̂ij − p̂i q̂j )2
n (6.7.3)
i,j=1
p̂ i q̂ j
d,e
X (observed)2
= −n + n .
i,j=1
expected
The code
def chi2_independence(table):
observed = table
n = sum(observed)
d = len(observed)
e = len(observed.T)
324 CHAPTER 6. STATISTICS
table = array([[68,56,32],[52,72,20]])
chi2_independence(table)
returns a p-value of 0.0401, so, at the 5% significance level, the effects are
not independent.
Calculus
7.1 Calculus
In this section, we focus on single-variable calculus, and in §7.3, we review
multi-variable calculus. Recall the slope of a line y = mx + b equals m.
Let y = f (x) be a function as in Figure 7.1, and let a be a fixed point. The
derivative of f (x) at the point a is the slope of the line tangent to the graph
of f (x) at a. Then the derivative at a point a is a number f ′ (a) possibly
depending on a.
y = f (x)
x
a
Since the tangent line at a passes through the point (a, f (a)), and its
325
326 CHAPTER 7. CALCULUS
Using these properties, we determine the formula for f ′ (a). Suppose the
derivative is bounded between two extremes m and L at every point x in an
interval [a, b], say
m ≤ f ′ (x) ≤ L, a ≤ x ≤ b.
Then by A, the derivative of h(x) = f (x)−mx at x equals h′ (x) = f ′ (x)−m.
By assumption, h′ (x) ≥ 0 on [a, b], so, by B, h(b) ≥ h(a). Since h(a) =
f (a) − ma and h(b) = f (b) − mb, this leads to
f (b) − f (a)
≥ m.
b−a
Repeating this same argument with f (x) − Lx, and using C, leads to
f (b) − f (a)
≤ L.
b−a
We have shown
f (b) − f (a)
m≤ ≤ L. (7.1.1)
b−a
Derivative Definition
f (x) − f (a)
f ′ (a) = lim . (7.1.2)
x→a x−a
f g
x u y
Using the chain rule, the power rule can be √derived for any rational number n,
2
positive or negative. For example,
√ since ( x) = x, we can write x = f (g(x))
2
with f (x) = x and g(x) = x. By the chain rule,
√ √
1 = (x)′ = f ′ (g(x))g ′ (x) = 2g(x)g ′ (x) = 2 x( x)′ .
√
Solving for ( x)′ yields
√ 1
( x)′ = √ ,
2 x
which is (7.1.3) with n = 1/2. In this generality, the variable x is restricted
to positive values only.
For example,
n!
(xn )′′ = (nxn−1 )′ = n(n − 1)xn−2 = xn−2 = P (n, 2)xn−2
(n − 2)!
7.1. CALCULUS 329
We use the above to derive the Taylor series. Suppose f (x) is given by a
finite or infinite sum
f (x) = c0 + c1 x + c2 x2 + c3 x3 + . . . (7.1.4)
Then f (0) = c0 . Taking derivatives, by the sum, product, and power rules,
f ′ (x) = c1 + 2c2 x + 3c3 x2 + 4c4 x3 + . . .
f ′′ (x) = 2c2 + 3 · 2c3 x + 4 · 3c4 x2 + . . .
(7.1.5)
f ′′′ (x) = 3 · 2c3 + 4 · 3 · 2c4 x + . . .
f (4) (x) = 4 · 3 · 2c4 + . . .
Inserting x = 0, we obtain f ′ (0) = c1 , f ′′ (0) = 2c2 , f ′′′ (0) = 3 · 2c3 , f (4) (0) =
4 · 3 · 2c4 . This can be encapsulated by f (n) (0) = n!cn , n = 0, 1, 2, 3, 4, . . . ,
which is best written
f (n) (0)
= cn , n ≥ 0.
n!
Going back to (7.1.4), we derived
More generally, let a be a fixed point. Then any function f (x) can be
expanded in powers (x − a)n , and we have
330 CHAPTER 7. CALCULUS
We review the derivative of sine and cosine. Recall the angle θ in radians
is the length of the subtended arc (in red) in Figure 7.3. Following the figure,
with P = (x, y), we have x = cos θ, y = sin θ. By the figure, the arclength θ
is greater than the diagonal, which in turn is greater than y. Moreover θ is
less than 1 − x + y, so
y < θ < 1 − x + y.
P 1−x
θ
0 x 1
which implies
1 − cos θ sin θ
1− < < 1. (7.1.8)
θ θ
7.1. CALCULUS 331
From (1.5.5),
sin(θ + ϕ) = sin θ cos ϕ + cos θ sin ϕ,
so
sin(θ + ϕ) − sin θ cos ϕ − 1 sin ϕ
lim = lim sin θ · + cos θ · = cos θ.
ϕ→0 ϕ ϕ→0 ϕ ϕ
Thus the derivative of sine is cosine,
(sin θ)′ = cos θ.
Similarly,
(cos θ)′ = − sin θ.
Using the chain rule, we compute the derivative of the inverse arcsin x of
sin θ. Since
θ = arcsin x ⇐⇒ x = sin θ,
we have √
1 = x′ = (sin θ)′ = θ′ · cos θ = θ′ · 1 − x2 ,
or
1
(arcsin x)′ = θ′ = √ .
1 − x2
We
√ use this to compute the derivative of the arcsine law (3.2.13). With
x = λ/2, by the chain rule,
′
1√
2 2 1
arcsin λ = √ · x′
π 2 π 1 − x2
(7.1.9)
2 1 1 1
= p · √ = p .
π 1 − λ/4 4 λ π λ(4 − λ)
332 CHAPTER 7. CALCULUS
This shows the derivative of the arcsine law is the density in Figure 3.11.
For the parabola in Figure 7.4, y = x2 so, by the power rule, y ′ = 2x.
Since y ′ > 0 when x > 0 and y ′ < 0 when x < 0, this agrees with the
increase/decrease of the graph. In particular, the minimum of the parabola
occurs when y ′ = 0.
0
x
√
(c = 1/ 3)
−1 −c c 1
x
0
(ex )(n) = ex , n ≥ 0,
writing the Taylor series centered at zero for the exponential function yields
the exponential series (4.4.10).
7.1. CALCULUS 335
For example, the function in Figure 7.6 is convex near x = a, and the
graph lies above its tangent line at a.
so g(x) is convex, so g(x) lies above its tangent line at x = a. Since g(a) = 0
and g ′ (a) = 0, the tangent line is 0, and we conclude g(x) ≥ 0, which is the
left half of (7.1.10). Similarly, if f ′′ (x) ≤ L, then pL (x) − f (x) is convex,
leading to the right half of (7.1.10).
x
a
Figure 7.6: Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0.
f ′ (b) − f ′ (a)
t= =⇒ L ≥ t ≥ m,
b−a
which implies
t2 − (m + L)t + mL = (t − m)(t − L) ≤ 0, a ≤ t ≤ b.
This yields
7.1. CALCULUS 337
y = log x ⇐⇒ x = ey .
log(ey ) = y, elog x = x.
338 CHAPTER 7. CALCULUS
From here, we see the logarithm is defined only for x > 0 and is strictly
increasing (Figure 7.7).
Since e0 = 1,
log 1 = 0.
Since e∞ = ∞ (Figure 4.10),
log ∞ = ∞.
log 0 = −∞.
We also see log x is negative when 0 < x < 1, and positive when x > 1.
Moreover, by the law of exponents,
ab = eb log a .
7.1. CALCULUS 339
Then, by definition,
log(ab ) = b log a,
and c c
ab = eb log a = ebc log a = abc .
x = ey =⇒ 1 = x′ = (ey )′ = ey y ′ = xy ′ ,
so
1
y = log x =⇒ y′ = .
x
Derivative of the Logarithm
1
y = log x =⇒ y′ = . (7.1.12)
x
For gradient descent, we need the relation between a convex function and
its dual. If f (x) is convex, its convex dual is
Below we see g(p) is also convex. This may not always exist, but we will
work with cases where no problems arise.
Let q > 0. The simplest example is
q 1 2
f (x) = x2 =⇒ g(p) = p.
2 2q
For each p, the point x where px − f (x) equals the maximum g(p) — the
maximizer — depends on p. If we denote the maximizer by x = x(p), then
Hence
g(p) = px − f (x) ⇐⇒ p = f ′ (x).
Also, by the chain rule, differentiating with respect to p,
Thus f ′ (x) is the inverse function of g ′ (p). Since g(p) = px − f (x) is the
same as f (x) = px − g(p), we have
f ′ (g ′ (p)) = p.
f ′′ (g ′ (p))g ′′ (p) = 1.
We derived
Notice the derivatives of σ and its inverse σ −1 are reciprocals. This result
holds in general, and is called the inverse function theorem.
The partition function is
Then Z ′ (z) = σ(z) and Z ′′ (z) = σ ′ (z) = σ(1 − σ) > 0. This shows Z(z) is
strictly convex.
The maximum
max(pz − Z(z))
z
which simplifies to I(p). Thus the convex dual of the partition function is the
information. The information is studied further in §7.2, and the multinomial
extension is in §7.6.
n
This makes sense because the binomial coefficient k
is defined for any
real number n (4.3.12), (4.3.13).
In summation notation,
∞
n
X n n−k k
(a + x) = a x . (7.1.21)
k=0
k
The only difference between (4.3.7) and (7.1.21) is the upper limit of the
summation, which is set to infinity. When n is a whole number, by (4.3.10),
we have
n
= 0, for k > n,
k
so (7.1.21) is a sum of n + 1 terms, and equals (4.3.7) exactly. When n is not
a whole number, the sum (7.1.21) is an infinite sum.
Actually, in §5.5, we will need the special case a = 1, which we write in
slightly different notation,
∞
p
X p n
(1 + x) = x . (7.1.22)
n=0
n
f (x) = (a + x)n .
7.2. ENTROPY AND INFORMATION 343
so
f (k) (0)
n(n − 1)(n − 2) . . . (n − k + 1) n−k n n−k
= a = a .
k! k! k
Writing out the Taylor series,
∞ ∞
n
X f (k) (0) X n
(a + x) = = an−k xk ,
k=0
k! k=0
k
or
(1 − t) log a + t log b ≤ log ((1 − t)a + tb) .
Since the inequality sign is reversed, this shows
for 0 ≤ t ≤ 1.
Since x > 0, y ′′ < 0, which shows log x is in fact strictly concave everywhere
it is defined.
Since log x is strictly concave,
1
log = − log x
x
is strictly convex.
Thus H ′ (p) = 0 when p = 1/2, H ′ (p) > 0 on p < 1/2, and H ′ (p) < 0 on
p > 1/2. Since this implies H(p) is increasing on p < 1/2, and decreasing on
p > 1/2, p = 1/2 is a global maximum of the graph.
Notice as p increases, 1 − p decreases, so (1 − p)/p decreases. Since log is
increasing, as p increases, H ′ (p) decreases. Thus H(p) is concave.
Taking the second derivative, by the chain rule and the quotient rule,
′
′′ 1−p 1
H (p) = log =− ,
p p(1 − p)
A crucial aspect of Figure 7.8 is its limiting values at the edges p = 0 and
p = 1,
H(0) = lim H(p) and H(1) = lim H(p).
p→0 p→1
= − lim 2p log(2p)
p→0
Then I ′ (p) is the inverse of the derivative σ(x) (7.1.16) of the dual Z(x)
(7.1.19) of I(p), as it should be (7.1.14).
Toss a coin n times, and let #n (p) be the number of outcomes where
the proportion of heads is p. Then we have the approximation
In more detail, using (4.1.6), one can derive the asymptotic equality
1 1
#n (p) ≈ √ ·p · enH(p) , for n large. (7.2.5)
2πn p(1 − p)
Figure 7.9 is returned by the code below, which compares both sides of
the asymptotic equality (7.2.5) for n = 10.
n = 10
def H(p): return - p*log(p) - (1-p)*log(1-p)
p = arange(.01,.99,.01)
grid()
plot(p, comb(n, n*p), label="binomial coefficient")
plot(p, exp(n*H(p))/sqrt(2*n*pi*p*(1-p)), label="entropy
,→ approximation")
title("number of tosses " + "$n=" + str(n) +"$", usetex=True)
legend()
show()
Then
I(q, q) = 0,
which agrees with our understanding that I(p, q) measures the difference in
information between p and q. Because I(p, q) is not symmetric in p, q, we
think of q as a base or reference probability, against which we compare p.
Equivalently, instead of measuring relative information, we can measure
the relative entropy,
H(p, q) = −I(p, q).
7.2. ENTROPY AND INFORMATION 349
Since
I(p, q) = −H(p) − p log(q) − (1 − p) log(1 − q)
and H(0) = 0 = H(1), I(p, q) is well-defined for p = 0, and p = 1,
d2 1
2
I(p, q) = −H ′′ (p) = ,
dp p(1 − p)
d2 p 1−p
2
I(p, q) = 2 + ,
dq q (1 − q)2
In more detail, using (4.1.6), one can derive the asymptotic equality
1 1
Pn (p, q) ≈ √ ·p · enH(p,q) , for n large. (7.2.7)
2πn p(1 − p)
The law of large numbers (§6.1) states that the proportion of heads equals
approximately q for large n. Therefore, when p ̸= q, we expect the probability
that the proportion of heads equal p should become successively smaller as
n get larger, and in fact vanish when n = ∞. Since H(p, q) < 0 when p ̸= q,
(7.2.7) implies this is so. Thus (7.2.7) may be viewed as a quantitative
strengthening of the law of large numbers, in the setting of coin-tossing.
The partial derivative in the k-th direction is just the one-dimensional deriva-
tive considering xk as the independent variable, with all other xj ’s constants.
Below we exhibit the multi-variable chain rule in two ways. The first in-
terpretation is geometric, and involves motion in time and directional deriva-
tives. This interpretation is relevant to gradient descent, §8.3.
The second interpretation is combinatorial, and involves repeated com-
positions of functions. This interpretation is relevant to computing gradients
in networks, specifically backpropagation §7.4, §8.2.
These two interpretations work together when training neural networks,
§8.4.
For the first interpretation of the chain rule, suppose the components x1 ,
x2 , . . . , xd are functions of a single variable t (usually time), so we have
x1 = x1 (t), x2 = x2 (t), ..., xd = xd (t).
Inserting these into f (x1 , x2 , . . . , xd ), we obtain a function
f (t) = f (x1 (t), x2 (t), . . . , xd (t))
of a single variable t. Then we have
The Rd -valued function x(t) = (x1 (t), x2 (t), . . . , xd (t)) represents a curve
or path in Rd , and the vector
d
f (x + tv) = ∇f (x) · v. (7.3.3)
dt t=0
d
f (W + sV ) = trace(V t G). for all V. (7.3.4)
ds s=0
g s u y
x + k
and similarly,
dy dy
= = −0.90.
ds dt
By the chain rule,
dy dy dr dy ds dy dt
= · + · + · .
dx dr dx ds dx dt dx
By (7.1.17), s′ = s(1 − s) = 0.22, so
dr ds dt
= cos x = 0.71, = s(1 − s) = 0.22, = 2x = 1.57.
dx dx dx
We obtain
dy
= −0.90 ∗ 0.71 − 0.90 ∗ 0.22 − 0.90 ∗ 1.57 = −2.25.
dx
The chain rule is discussed in further detail in §7.4.
∇f (x∗ ) = 0.
Quadratic Convexity
Let Q be a symmetric matrix and b a vector. The quadratic function
1
f (x) = x · Qx − b · x
2
has gradient
∇f (x) = Qx − b. (7.3.7)
Moreover f (x) is convex everywhere exactly when Q is a covariance
matrix, Q ≥ 0.
By (2.2.2),
Dv f (x) = ∇f (x) · v = |∇f (x)| |v| cos θ,
where θ is the angle between the vector v and the gradient vector ∇f (x).
Since −1 ≤ cos θ ≤ 1, we conclude
The derivatives are taken with respect to the outputs at each node of the
graph. In §8.2, we consider a third case, and compute outputs and derivatives
on a neural network.
To compute node outputs, we do forward propagation. To compute
derivatives, we do back propagation. Corresponding to the three cases, we
will code three versions of forward and back propagation. In all cases, back
propagation depends on the chain rule.
The chain rule (§7.1) states
dy dy dr
r = f (x), y = g(r) =⇒ = · .
dx dr dx
In this section, we work out the implications of the chain rule on repeated
compositions of functions.
Suppose
r = f (x) = sin x,
1
s = g(r) = ,
1 + e−r
y = h(s) = s2 .
r s y
x f g h
The chain in Figure 7.12 has four nodes and four edges. The outputs at
the nodes are x, r, s, y. Start with output x = π/4. Evaluating the functions
in order,
Notice these values are evaluated in the forward direction: x then r then s
then y. This is forward propagation.
Now we evaluate the derivatives of the output y with respect to x, r, s,
dy dy dy
, , .
dx dr ds
7.4. BACK PROPAGATION 359
func_chain = [f,g,h]
der_chain = [df,dg,dh]
Then we evaluate the output vector x = (x, r, s, y), leading to the first
version of forward propagation,
def forward_prop(x_in,func_chain):
x = [x_in]
while func_chain:
f = func_chain.pop(0) # first func
x_out = f(x_in)
x.append(x_out) # insert at end
x_in = x_out
return x
# dy/dy = 1
delta_out = 1
def backward_prop(delta_out,x,der_chain):
delta = [delta_out]
while der_chain:
# discard last output
x.pop(-1)
df = der_chain.pop(-1) # last der
der = df(x[-1])
# chain rule -- multiply by previous der
der = der * delta[0]
delta.insert(0,der) # insert at start
return delta
delta = backward_prop(delta_out,x,der_chain)
d = 3
func_chain, der_chain = [h]*d, [dh]*d
x_in, delta_out = 5, 1
x = forward_prop(x_in,func_chain)
delta = backward_prop(delta_out,x,der_chain)
362 CHAPTER 7. CALCULUS
Now we work with the network in Figure 7.13, using the multi-variable
chain rule (§7.3). The functions are
a = f (x, y) = x + y,
b = g(y, z) = max(y, z),
J = h(a, b) = ab.
The composite function is
J = (x + y) max(y, z),
x +
a
y J
∗
b
z max
Here there are three input nodes x, y, z, and three hidden nodes +, max,
∗. Starting with inputs (x, y, z) = (1, 2, 0), and plugging in, we obtain node
outputs
(x, y, z, a, b, J) = (1, 2, 0, 3, 2, 6)
(Figure 7.15). This is forward propagation.
Then
∂a ∂a
= 1, = 1.
∂x ∂y
z
y<z
max(y, z) = z
∂g/∂y = 0, ∂g/∂z = 1
y=z
y>z
max(y, z) = y
∂g/∂y = 1, ∂g/∂z = 0
Let (
1, y > z,
1(y > z) =
0, y < z.
By Figure 7.14, since y = 2 and z = 0,
∂b ∂b
= 1(y > z) = 1, = 1(z > y) = 0.
∂y ∂z
By the chain rule,
∂J ∂J ∂a
= = 2 ∗ 1 = 2,
∂x ∂a ∂x
∂J ∂J ∂a ∂J ∂b
= + = 2 ∗ 1 + 3 ∗ 1 = 5,
∂y ∂a ∂y ∂b ∂y
∂J ∂J ∂b
= = 3 ∗ 0 = 0.
∂z ∂b ∂z
Hence we have
∂J ∂J ∂J ∂J ∂J ∂J
, , , , , = (2, 5, 0, 2, 3, 1).
∂x ∂y ∂z ∂a ∂b ∂J
364 CHAPTER 7. CALCULUS
The outputs (blue) and the derivatives (red) are displayed in Figure 7.15.
1
x +
2∗1=2
2 3
2 2
y 6
∗
1
2 2
3 3
0
z max
0
d = 6
w = [ [None]*d for _ in range(d) ]
w[0][3] = w[1][3] = w[1][4] = w[2][4] = w[3][5] = w[4][5] = 1
More generally, in a weighed directed graph, the weights wij are numeric
scalars. In this case, for each node j, let
x−
j = (w1j x1 , w2j x2 , . . . , wdj xd ). (7.4.1)
Then x−j is the list of node signals, each weighed accordingly. If (i, j) is
not an edge, then wij = 0, so xi does not appear in x− j : In other words, xj
−
x5 = f5 (x−
5 ) = f5 (w15 x1 , w75 x7 , w25 x2 ).
and
f4 (x, y) = x + y, f5 (y, z) = max(y, z), J(a, b) = ab.
Note there is nothing incoming at the input nodes, so there is no point
defining f1 , f2 , f3 .
366 CHAPTER 7. CALCULUS
activate = [None]*d
def incoming(x,w,j):
return [ outgoing(x,w,i) * w[i][j] if w[i][j] else 0 for i
,→ in range(d) ]
def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](*incoming(x,w,j))
Let xin be the outgoing vector over the input nodes. If there are m input
nodes, and d nodes in total, then the length of xin is m, and the length of x
is d. In the example above, xin = (x, y, z).
We assume the nodes are ordered so that the initial portion of x equals
xin ,
7.4. BACK PROPAGATION 367
m = len(x_in)
x[:m] = x_in
def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x
For this code to work, we assume there are no cycles in the graph: All
backward paths end at inputs.
Let xout be the output nodes. For Figure 7.13, this means xout = (J).
Then by forward propagation, J is also a function of all node outputs. For
Figure 7.13, this means J is a function of x, y, z, a, b.
Therefore, at each node i, we have the derivatives
∂J
δi = (xi ), i = 1, 2, . . . , d.
∂xi
Then δ = (δ1 , δ2 , . . . , δd ) is the gradient vector. We first compute the deriva-
tives of J with respect to the output nodes xout , and we assume these deriva-
tives are assembled into a vector δout .
In Figure 7.13, there is one output node J, and
∂J
δJ = = 1.
∂J
Hence δout = (1).
We assume the nodes are ordered so that the terminal portion of x equals
xout and the terminal portion of δ equals δout ,
368 CHAPTER 7. CALCULUS
d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out
∂J X ∂J ∂xj X ∂J ∂fj
= · = · · wij ,
∂xi i→j
∂xj ∂xi i→j
∂xj ∂xi
so X
δi = δj · gij · wij .
i→j
The code is
7.5. CONVEX FUNCTIONS 369
def derivative(x,delta,g,i):
if delta[i] != None: return delta[i]
else:
return sum([ derivative(x,delta,g,j) *
,→ g[i][j](*incoming(x,g,j)) * w[i][j] if g[i][j] != None
,→ else 0 for j in range(d) ])
def backward_prop(x,delta_out,g):
d = len(g)
delta = [None]*d
m = len(delta_out)
delta[d-m:] = delta_out
for i in range(d-m): delta[i] = derivative(x,delta,g,i)
return delta
x0
x0 x0
x1
(1 − t)x0 + tx1
x0
Here we write the sublevel set of level 1. One can have sublevel sets corre-
sponding to any level c, f (x) ≤ c. For example, in Figure 7.16, the (blue)
interior of the square, together with the square itself, is a sublevel set. Sim-
ilarly, the interior of the ellipse, together with the ellipse itself, is a sublevel
set. The interiors of the ellipsoids, together with the ellipsoids themselves,
in Figure 7.22 are sublevel sets. Note we always consider the level set to be
part of the sublevel set.
The level set f (x) = 1 is the boundary of the sublevel set f (x) ≤ 1. Thus
the square and the ellipse in Figure 7.16 are boundaries of their respective
sublevel sets, and the covariance ellipsoid x · Qx = 1 is the boundary of the
sublevel set x · Qx ≤ 1.
A scalar function f (x) is convex if1 for any two points x0 and x1 in Rd ,
This says the line segment joining any two points (x0 , f (x0 )) and (x1 , f (x1 ))
on the graph of f (x) lies above the graph of f (x). For example, in two
dimensions, the function f (x) = f (x1 , x2 ) = x21 + x22 /4 is convex because its
graph is the paraboloid in Figure 7.19.
More generally, given points x1 , x2 , . . . , xN , a linear combination
t1 x1 + t2 x2 + · · · + tN xN
1
We only consider convex functions that are continuous.
372 CHAPTER 7. CALCULUS
Figure 7.19: Convex: The line segment lies above the graph.
Quadratic is Convex
If Q is a nonnegative matrix and b is a vector, then
1
f (x) = x · Qx − b · x
2
7.5. CONVEX FUNCTIONS 373
This was derived in the previous section, but here we present a more
geometric proof.
To derive this result, let x0 and x1 be any points, and let v = x1 − x0 .
Then x0 + tv = (1 − t)x0 + tx1 and x1 = x0 + v. Let g0 = Qx0 − b. By (7.3.6),
1 1
f (x0 + tv) = f (x0 ) + tv · (Qx0 − b) + t2 v · Qv = f (x0 ) + tv · g0 + + t2 v · Qv.
2 2
(7.5.3)
Inserting t = 1 in (7.5.3), we have f (x1 ) = f (x0 ) + v · g0 + v · Qv/2. Since
t2 ≤ t for 0 ≤ t ≤ 1 and v · Qv ≥ 0, by (7.5.3),
f ((1 − t)x0 + tx1 ) = f (x0 + tv)
1
≤ f (x0 ) + tv · g0 + tv · Qv
2
1
= (1 − t)f (x0 ) + tf (x0 ) + tv · g0 + tv · Qv
2
= (1 − t)f (x0 ) + tf (x1 ).
When Q is is invertible, then v · Qv > 0, and we have strict convexity.
x3
x4
x2
x6 x7
x5
x1
rng = default_rng()
hull = ConvexHull(points)
facet = hull.simplices[0]
plot(points[facet, 0], points[facet, 1], 'r--')
grid()
show()
If f (x) is a function, its graph is the set of points (x, y) in Rd+1 satisfying
y = f (x), and its epigraph is the set of points (x, y) satisfying y ≥ f (x).
If f (x) is defined on Rd , its sublevel sets are in Rd , and its epigraph is in
Rd+1 . Then f (x) is a convex function exactly when its epigraph is a convex
set (Figure 7.19). From convex functions, there are other ways to get convex
sets:
E: f (x) ≤ 1
is a convex set.
H: n · (x − x0 ) = 0. (7.5.4)
n · (x − x0 ) < 0 n · (x − x0 ) = 0 n · (x − x0 ) > 0.
7.5. CONVEX FUNCTIONS 377
n n
x0 x0
Separating Hyperplane
Let E be a convex set and let x∗ be a point not in E. Then there is a
hyperplane separating x∗ and E: For some x0 in E and nonzero n,
x∗
n
x′
x0 x0
x x0 + tv
Expanding, we have
0 ≤ 2(x0 − x∗ ) · v + t|v|2 , 0 ≤ t ≤ 1.
Since this is true for small positive t, sending t → 0, results in v·(x0 −x∗ ) ≥ 0.
Setting n = x∗ − x0 , we obtain
x in E =⇒ (x − x0 ) · n ≤ 0. (7.5.6)
where the minimum is taken over all vectors x. A minimizer is the location of
the bottom of the graph of the function. For example, the parabola (Figure
7.4) and the relative information (Figure 7.10) both have global minimizers.
We say a function f (x) is strictly convex if g(t) = f (a + tv) is strictly
convex for every point a and direction v. This is the same as saying the
inequality (7.5.1) is strict for 0 < t < 1.
We say a function f (x) is proper if the sublevel set f (x) ≤ c is bounded
for every level c. Before we state this precisely, we contrast a level versus a
bound.
Let f (x) be a function. A level is a scalar c determining a sublevel set
f (x) ≤ c. A bound is a scalar C determining a bounded set |x|2 ≤ C.
380 CHAPTER 7. CALCULUS
To see this, suppose f (x) is not proper. In this case, by (7.5.9), there
would be a level c and a sequence x1 , x2 , . . . in the row space of A satisfying
|xn | → ∞ and f (xn ) ≤ c for n ≥ 1.
Let x′n = xn /|xn |. Then x′n are unit vectors in the row space of A, hence
xn is a bounded sequence. From §A.2, this implies x′n subconverges to some
′
1 1 1 √
|Ax′n | = |Axn | ≤ (|Axn − b| + |b|) ≤ ( c + |b|).
|xn | |xn | |xn |
Properness of Residual
When the N × d matrix A has rank d,
is proper on Rd .
To see this, pick any point a. Then, by properness, the sublevel set S
given by f (x) ≤ f (a) is bounded. By continuity of f (x), there is a minimizer
x∗ (see §A.2). Since for all x outside the sublevel set, we have f (x) > f (a),
x∗ is a global minimizer.
382 CHAPTER 7. CALCULUS
1
f (x2 ) < (f (x∗ ) + f (x1 )) = f (x∗ ),
2
contradicting the fact that x∗ is a global minimizer. Thus there cannot be
another global minimizer.
As a consequence,
Let a be any point, and v any direction, and let g(t) = f (a + tv). Then
g ′ (0) = ∇f (a) · v.
∂ 2f
, 1 ≤ i, j ≤ d,
∂xi ∂xj
d
f (x + tv) = ∇f (x + tv) · v.
dt
d2
f (x + tv) = v · Qv. (7.5.15)
dt2 t=0
This implies
m L
|x − a|2 ≤ f (x) − f (a) − ∇f (a) · (x − a) ≤ |x − a|2 . (7.5.16)
2 2
m L
|x − x∗ |2 ≤ f (x) − f (x∗ ) ≤ |x − x∗ |2 . (7.5.17)
2 2
Here the maximum is over all vectors x, and p = (p1 , p2 , . . . , pd ), the dual
variable, also has d features. We will work in situations where a maximizer
exists in (7.5.18).
Let Q > 0 be a positive matrix. The simplest example is
1 1
f (x) = x · Qx =⇒ g(p) = p · Q−1 p.
2 2
This is established by the identity
1 1 1
(p − Qx) · Q−1 (p − Qx) = p · Q−1 p − p · x + x · Qx. (7.5.19)
2 2 2
To see this, since the left side of (7.5.19) is greater or equal to zero, we have
1 1
p · Q−1 p − p · x + x · Qx ≥ 0.
2 2
Since (7.5.19) equals zero iff p = Qx, we are led to (7.5.18).
Moreover, switching p · Q−1 p with x · Qx, we also have
Thus the convex dual of the convex dual of f (x) is f (x). In §7.6, we compute
the convex dual of the partition function.
386 CHAPTER 7. CALCULUS
0 = ∇x (p · x − f (x)) =⇒ p = ∇x f (x).
p = ∇x f (x) ⇐⇒ x = ∇p g(p).
This yields
Using this, and writing out (7.5.16) for g(p) instead of f (x) (we skip the
details) yields
7.6. MULTINOMIAL PROBABILITY 387
mL 1
(p − q) · (x − a) ≥ |x − a|2 + |p − q|2 . (7.5.22)
m+L m+L
This is derived by using (7.5.21), the details are in [3]. This result is used
in gradient descent.
Then
d
X
p·1= pk = 1.
k=1
Because
ez1 1
q1 = = = σ(z1 − z2 ),
z
e 1 +e 2z 1 + e 1 −z2 )
−(z
ez2 1
q 2 = z1 = = σ(z2 − z1 ).
e + ez2 1 + e−(z2 −z1 )
Because of this, the softmax function is the multinomial analog of the logistic
function, and we use the same symbol σ to denote both functions.
z = array([z1,z2,z3])
q = softmax(z)
7.6. MULTINOMIAL PROBABILITY 389
or
σ(z) = σ(z + a1).
To guarantee uniqueness of a global minimum of Z, we have to restrict
attention to the subspace of vectors z = (z1 , z2 , . . . , zd ) orthogonal to 1, the
vectors satisfying
z1 + z2 + · · · + zd = 0.
Now suppose z is orthogonal to 1. Since the exponential function is
convex, !
d d
eZ 1 X zk 1X
= e ≥ exp zk = e0 = 1.
d d k=1 d k=1
This establishes
z = Z1 + log p. (7.6.4)
390 CHAPTER 7. CALCULUS
The function
d
X
I(p) = p · log p = pk log pk (7.6.5)
k=1
This implies
d
X d
X
p·z = pk zk = pk log(ezk )
k=1 k=1
d
! d
!
X X
≤ log pk ezk = log ezk +log pk = Z(z + log p).
k=1 k=1
For all z,
Z(z) = max (p · z − I(p)) .
p
Since
2 1 1 1
D I(p) = diag , ,..., ,
p1 p2 pd
we see I(p) is strictly convex, and H(p) is strictly concave.
In Python, the entropy is
p = array([p_1,p_2,p_3])
entropy(p)
Now (
∂ 2Z ∂σj σj − σj σk , if j = k,
= =
∂zj ∂zk ∂zk −σj σk , if j ̸= k.
Hence we have
d
X d
X
v · Qv = qk vk2 2
− (v · q) = qk (vk − v̄)2 ,
k=1 k=1
392 CHAPTER 7. CALCULUS
zj ≤ c, j = 1, 2, . . . , d.
−zj ≤ (d − 1)c, j = 1, 2, . . . , d.
which implies
d
X
2
|z| = zk2 ≤ d(d − 1)2 c2 .
k=1
√
Setting C = d(d − 1)c, we conclude
Then
d
X
p · log q = pk log qk ,
k=1
and
I(p, q) = I(p) − p · log q. (7.6.12)
p = array([p1,p2,p3])
q = array([q1,q2,q3])
entropy(p,q)
returns the relative information, not the relative entropy. See below for more
on this terminology confusion.
d
!
X
Z(z, q) = log ezk qk ,
k=1
The cross-information is
d
X
Icross (p, q) = − pk log qk ,
k=1
Since I(p, σ(z)) and Icross (p, σ(z)) differ by the constant I(p), we also have
Here is the multinomial analog of (7.2.6). Suppose a dice has d faces, and
suppose the probability of rolling the k-th face in a single roll is qk . Then
q = (q1 , q2 , . . . , qd ) is a probability vector. Let p = (p1 , p2 , . . . , pd ) be another
probability vector. Roll the dice n times, and let Pn (p, q) be the probability
that the proportion of times the k-th face is rolled equals pk , k = 1, 2, . . . , d.
Then we have the approximation
How does one keep things straight? By remembering that it’s convex
functions that we like to minimize, not concave functions. In machine learn-
ing, loss functions are built to be minimized, and information, in any form,
is convex, while entropy, in any form, is concave. Table 7.25 summarizes the
situation.
H = −I Information Entropy
Absolute I(p) H(p)
Cross Icross (p, q) Hcross (p, q)
Relative I(p, q) H(p, q)
Curvature Convex Concave
Error I(p, q) with q = σ(z)
Table 7.25: The third row is the sum of the first and second rows, and the
H column is the negative of the I column.
398 CHAPTER 7. CALCULUS
Chapter 8
Machine Learning
8.1 Overview
This first section is an overview of the chapter. Here is a summary of the
structure of neural networks.
• In a directed graph, there are input nodes, output nodes, and hidden
nodes.
• A network is a weighed directed graph (§4.2) where the nodes are neu-
rons (§7.4).
399
400 CHAPTER 8. MACHINE LEARNING
Because wij = 0 if (i, j) is not an edge, the nonzero entries in the incoming
list at node j correspond to the edges incoming at node j.
A neural network is a network where every activation function is restricted
to be a function of the sum of the entries of the incoming list.
For example, all the networks in this section are neural networks, but the
network in Figure 7.13 is not a neural network.
Let X
x−j = wij xi (8.2.1)
i→j
402 CHAPTER 8. MACHINE LEARNING
be the sum of the incoming list at node j. Then, in a neural network, the
outgoing signal at node j is
!
X
xj = fj (x−
j ) = fj wij xi . (8.2.2)
i→j
x = (x1 , x2 , . . . , xd ),
In a network, in §7.4, x− −
j was a list or vector; in a neural network, xj is a
scalar.
Let W be the weight matrix. If the network has d nodes, the activation
vector is
f = (f1 , f2 , . . . , fd ).
Then a neural network may be written in vector-matrix form
x = f (W t x).
However, this representation is more useful when the network has structure,
for example in a dense shallow layer (8.2.12).
y = f (w1 x1 + w2 x2 + · · · + wd xd ) = f (w · x)
Neural Network
Every neural network is a combination of perceptrons.
8.2. NEURAL NETWORKS 403
x1
w1
w2
x2 f y
w3
x3
y = f (w1 x1 + w2 x2 + · · · + wd xd + w0 ) = f (w · x + w0 ).
P rob(H | x) = σ(w · x + w0 ).
Perceptrons gained wide exposure after Minsky and Papert’s famous 1969
book [18], from which Figure 8.2 is taken.
404 CHAPTER 8. MACHINE LEARNING
# activation functions
w13 w35
x1 f3 f5 x5
w14 w36
w23 w45
w24 w46
x2 f4 f6 x6
Let xin and xout be the outgoing vectors corresponding to the input and
output nodes. Then the network in Figure 8.3 has outgoing vectors
Here are the incoming and outgoing signals at each of the four neurons f3 ,
f4 , f5 , f6 .
8.2. NEURAL NETWORKS 407
xi xj
fi fj
wij
def incoming(x,w,j):
return sum([ outgoing(x,w,i)*w[i][j] if w[i][j] != None
,→ else 0 for i in range(d) ])
def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](incoming(x,w,j))
We assume the nodes are ordered so that the initial portion of x equals
xin ,
m = len(x_in)
x[:m] = x_in
408 CHAPTER 8. MACHINE LEARNING
def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x
activate = [None]*d
activate[2] = relu
activate[3] = id
activate[4] = sigmoid
activate[5] = tanh
x_in = [1.5,2.5]
x = forward_prop(x_in,w)
Let
y1 = 0.427, y2 = −0.288, y = (y1 , y2 )
be targets, and let J(xout , y) be a function of the outputs xout of the output
nodes, and the targets y. For example, for Figure 8.3, xout = (x5 , x6 ) and we
may take J to be the mean square error function or mean square loss
1 1
J(xout , y) = (x5 − y1 )2 + (x6 − y2 )2 , (8.2.6)
2 2
The code for this J is
def J(x_out,y):
m = len(y)
return sum([ (x_out[i] - y[i])**2/2 for i in range(m) ])
y0 = [0.132,-0.954]
y = [0.427, -0.288]
J(x_out,y0), J(x_out,y)
∂J ∂J
, fj′ (x−
j ), . (8.2.7)
∂x−
j ∂xj
These are the downstream derivative, local derivative, and upstream derivative
at node j. (The terminology reflects the fact that derivatives are computed
backward.)
∂J fi′ ∂J
∂x−
i ∂xi
fi
From (8.2.2),
∂xj
= fj′ (x−
j ). (8.2.8)
∂x−
j
By the chain rule and (8.2.8), the key relation between these derivatives
is
∂J ∂J
− = · f ′ (x− ), (8.2.9)
∂xi ∂xi i i
or
downstream = upstream × local.
8.2. NEURAL NETWORKS 411
def local(x,w,i):
return der_dict[activate[i]](incoming(x,w,i))
Let
∂J
δi = , i = 1, 2, . . . , d.
∂x−
i
d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out
∂J ∂J
δ5 = , δ6 = , δout = (δ5 , δ6 )
∂x−
5 ∂x−
6
∂J
= (x5 − y1 ) = −0.294.
∂x5
412 CHAPTER 8. MACHINE LEARNING
σ ′ (x− − −
5 ) = σ(x5 )(1 − σ(x5 )) = x5 (1 − x5 ) = 0.114.
Similarly,
δ6 = −0.059.
We conclude
δout = (−0.0337, −0.059).
The code for this is
def delta_out(x_out,y,w):
d =len(w)
m = len(y)
return [ (x_out[i] - y[i]) * local(x,w,d-m+i) for i in
,→ range(m) ]
delta_out(x_out,y_star,w)
∂J X ∂J ∂x− j ∂xi
− = − · · −
∂xi i→j
∂xj ∂xi ∂xi
!
X ∂J
= − · wij · fi′ (x−
i ).
i→j
∂x j
8.2. NEURAL NETWORKS 413
The code is
def downstream(x,delta,w,i):
if delta[i] != None: return delta[i]
else:
upstream = sum([ downstream(x,delta,w,j) * w[i][j] if
,→ w[i][j] != None else 0 for j in range(d) ])
return upstream * local(x,w,i)
def backward_prop(x,y,w):
d = len(w)
delta = [None]*d
m = len(y)
x_out = x[d-m:]
delta[d-m:] = delta_out(x_out,y_star,w)
for i in range(d-m): delta[i] = downstream(x,delta,w,i)
return delta
delta = backward_prop(x,y,w)
returns
∂x−
j
= xi ,
∂wij
see also Table 8.4. From this,
−
∂J ∂J ∂xj
= · = δj · x i .
∂wij ∂x−
j ∂wij
We have shown
∂J
= xi · δj . (8.2.11)
∂wij
x1
z1
x2
z2
x3
z3
x4
Our convention is to let wij denote the weight on the edge (i, j). With this
convention, the formulas (8.2.1), (8.2.2) reduce to the matrix multiplication
formulas
z − = W t x, z = f (W t x). (8.2.12)
Thus a dense shallow network can be thought of as a vector-valued percep-
tron. This allows for vectorized forward and back propagation.
This goal is so general, that anything concrete one insight one provides to-
wards this goal is widely useful in many settings. The setting we have in
mind is f = J, where J is the mean error from §8.1.
Usually f (w) is a measure of cost or lack of compatibility. Because of
this, f (w) is called the loss function or cost function.
A neural network is a black box with inputs x and outputs y, depending on
unknown weights w. To train the network is to select weights w in response
to training data (x, y). The optimal weights w∗ are selected as minimizers
of a loss function f (w) measuring the error between predicted outputs and
actual outputs, corresponding to given training inputs.
From §7.5, if the loss function f (w) is continuous and proper, there is
a global minimizer w∗ . If f (w) is in addition strictly convex, w∗ is unique.
Moreover, if the gradient of the loss function is g = ∇f (w), then w∗ is a
critical point, g ∗ = ∇f (w∗ ) = 0.
Inserting a = w and b = w+ ,
Solving for w+ ,
g(w)
w+ ≈ w − .
g ′ (w)
Since the global minimizer w∗ satisfies f ′ (w∗ ) = 0, we insert g(w) = f ′ (w)
in the above approximation,
f ′ (w)
w+ ≈ w − .
f ′′ (w)
f ′ (wn )
wn+1 = wn − , n = 1, 2, . . .
f ′′ (wn )
def newton(loss,grad,curv,w,num_iter):
g = grad(w)
c = curv(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= g/c
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
c = curv(w)
if allclose(g,0): break
return trajectory
u0 = -2.72204813
w0 = 2.45269774
num_iter = 20
trajectory = newton(loss,grad,curv,w0,num_iter)
def plot_descent(a,b,loss,curv,delta,trajectory):
w = arange(a,b,delta)
plot(w,loss(w),color='red',linewidth=1)
plot(w,curv(w),"--",color='blue',linewidth=1)
plot(*trajectory,color='green',linewidth=1)
scatter(*trajectory,s=10)
8.3. GRADIENT DESCENT 419
ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)
f (w) = f (w1 , w2 , . . . ).
∂f
w1+ = w1 − t ,
∂w1
∂f
w2+ = w2 − t ,
∂w2
... = ....
In other words,
8.3. GRADIENT DESCENT 421
In practice, the learning rate is selected by trial and error. Which learning
rate does the theory recommend?
Given an initial point w0 , the sublevel set at w0 (see §7.5) consists of all
points w where f (w) ≤ f (w0 ). Only the part of the sublevel set that is
connected to w0 counts.
u0 a b c w1 w
0
Figure 8.10: Double well cost function and sublevel sets at w0 and at w1 .
In Figure 8.10, the sublevel set at w0 is the interval [u0 , w0 ]. The sublevel
set at w1 is the interval [b, w1 ]. Notice we do not include any points to the
left of b in the sublevel set at w1 , because points to the left of b are separated
from w1 by the gap at the point b.
Suppose the second derivative D2 f (w) is never greater than a constant L
on the sublevel set. This means
To see this, fix w and let S be the sublevel set {w′ : f (w′ ) ≤ f (w)}. Since
the gradient pushes f down, for t > 0 small, w+ stays in S. Insert x = w+
and a = w into the right half of (7.5.16) and simplify. This leads to
t2 L
f (w+ ) ≤ f (w) − t|∇f (w)|2 + |∇f (w)|2 .
2
Since tL ≤ 1 when 0 ≤ t ≤ 1/L,we have t2 L ≤ t. This derives (8.3.3).
The curvature of the loss function and the learning rate are inversely
proportional. Where the curvature of the graph of f (w) is large, the learning
rate 1/L is small, and gradient descent proceeds in small time steps.
1
f (wn+1 ) ≤ f (wn ) − |∇f (wn )|2 .
2L
Since f (wn ) and f (wn+1 ) both converge to f (w∗ ), and ∇f (wn ) converges to
∇f (w∗ ), we conclude
1
f (w∗ ) ≤ f (w∗ ) − |∇f (w∗ )|2 .
2L
For example, let f (w) = w4 − 6w2 + 2w (Figures 8.9, 8.10, 8.11). Then
Thus the inflection points (where f ′′ (w) = 0) are ±1 and, in Figure 8.10, the
critical points are a, b, c.
Let u0 and w0 be the points satisfying f (w) = 5 as in Figure 8.11.
Then u0 = −2.72204813 and w0 = 2.45269774, so f ′′ (u0 ) = 76.914552 and
f ′′ (w0 ) = 60.188. Thus we may choose L = 76.914552. With this L, the
short-step gradient descent starting at w0 is guaranteed to converge to one
of the three critical points. In fact, the sequence converges to the right-most
critical point c (Figure 8.10).
This exposes a flaw in basic gradient descent. Gradient descent may con-
verge to a local minimizer, and miss the global minimizer. In §8.8, modified
gradient descent will address some of these shortcomings.
424 CHAPTER 8. MACHINE LEARNING
def gd(loss,grad,w,learning_rate,num_iter):
g = grad(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= learning_rate * g
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
if allclose(g,0): break
return trajectory
u0 = -2.72204813
w0 = 2.45269774
L = 76.914552
learning_rate = 1/L
8.4. NETWORK TRAINING 425
num_iter = 100
trajectory = gd(loss,grad,w0,learning_rate,num_iter)
ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)
xin → xout .
Given inputs xin and target outputs y, we seek to modify the weight matrix
W so that the input-output map is
xin → y.
This is training.
Let (§8.2)
−
x− = (x− −
1 , x2 , . . . , xd ), x = (x1 , x2 , . . . , xd )
δ = (δ1 , δ2 , . . . , δd )
Let wij be the weight along an edge (i, j), let xi be the outgoing signal
from the i-th node, and let δj be the downstream derivative of the
output J with respect to the j-th node. Then the derivative ∂J/∂wij
equals xi δj . In this partial sense,
∇W J = x ⊗ δ. (8.4.2)
def update_weights(x,delta,w,learning_rate):
d = len(w)
for i in range(d):
for j in range(d):
if w[i][j]:
w[i][j] = w[i][j] - learning_rate*x[i]*delta[j]
def train_nn(x_in,y,w0,learning_rate,n_iter):
trajectory = []
cost = 1
# build a local copy
w = [ row[:] for row in w0 ]
d = len(w0)
for _ in range(n_iter):
x = forward_prop(x_in,w)
delta = backward_prop(x,y,w)
update_weights(x,delta,w,learning_rate)
m = len(y)
x_out = x[d-m:]
8.4. NETWORK TRAINING 427
cost = J(x_out,y)
trajectory.append(cost)
if allclose(0,cost): break
return w, trajectory
Here n_iter is the maximum number of iterations allowed, and the iterations
stop if the cost J is close to zero.
The cost or error function J enters the code only through the function
delta_out, which is part of the function backward_prop.
Let W0 be the weight matrix (8.2.4). Then
x_in = [1.5,2.5]
learning_rate = .01
y0 = 0.4265356063
y1 = -0.2876478137
y = [y0,y1]
n_iter = 10000
w, trajectory = train_nn(x_in,y,w0,learning_rate,n_iter)
returns the cost trajectory, which can be plotted using the code
for lr in [.01,.02,.03,.035]:
w, trajectory = train_nn(x_in,y,w0,lr,n_iter)
n = len(trajectory)
label = str(n) + ", " + str(lr)
plot(range(n),trajectory,label=label)
grid()
legend()
show()
Figure 8.12: Cost trajectory and number of iterations as learning rate varies.
N
X
J(W ) = J(xk , yk , W ). (8.5.1)
k=1
8.5. SHALLOW LEARNING 429
• logistic regression.
With any loss function J, the goal is to minimize J. With this in mind,
from §7.5, we recall
J(W ∗ ) ≤ J(W ),
1
J(x, y, W ) = |W t x − y|2 . (8.5.2)
2
Then (8.5.1) is the mean square error or mean square loss, and the problem
of minimizing (8.5.1) is linear regression (Figure 8.13).
We use (7.3.4) to compute the gradient of J(x, y, W ). Let V be a weight
matrix, and let v = V t x, z = W t x. Then (W + sV )t x = z + sv, and the
directional derivative is
d d 1
J(x, y, W + sV ) = |z + sv − y|2
ds s=0 ds s=0 2
= v · (z − y) = (V t x) · (z − y)
= trace (V t x) ⊗ (z − y) = trace V t (x ⊗ (z − y)) .
By (7.3.4), this implies the weight gradient for mean square loss is
x1
+ y
z1
x2
z2 J
+ (−)2
z3
x3
z = W tx
+
J = |z − y|2 /2
x4
Since J(W ) is the sum of J(x, y, W ) over all samples, J(W ) is convex.
To check strict convexity of J(W ), suppose
d2
J(W + sV ) = 0.
ds2 s=0
V t xk = 0, k = 1, 2, . . . , N. (8.5.5)
Recall the feature space is the vector space of all inputs x, and (§2.9)
a dataset is full-rank if the span of the dataset is the entire feature space.
When this happens, (8.5.5) implies V = 0, hence J(W ) is strictly convex.
To check properness of J(W ), by definition (7.5.9), we have to show
J(W ) ≤ c =⇒ ∥W ∥2 ≤ C. (8.5.6)
Here ∥W ∥ is the norm of the matrix W (2.2.11). The exact formula for the
bound C, which is not important for our purposes, depends on the level c
and the dataset.
If J(W ) ≤ c, by (8.5.1), (8.5.2), and the triangle inequality,
√
|W t xk | ≤ 2c + |yk |, k = 1, 2, . . . , N.
432 CHAPTER 8. MACHINE LEARNING
|W t x| ≤ C1 . (8.5.7)
Here I(p, q) is the relative information, measuring the error between the
desired target p and the computed target q, and q = σ(z) is the softmax
function, squashing the network’s output z = W t x into the probability q.
When p is one-hot encoded, by (7.6.15),
Because of this, in the literature, in the one-hot encoded case, (8.5.1) is called
the cross-entropy loss.
In either case, strict or one-hot encoded, J(W ) is logistic loss or logistic
error, and the problem of minimizing (8.5.1) is logistic regression (Figure
8.14).
Since we will be considering both strict and one-hot encoded probabilities,
we work with I(p, q) rather than Icross (p, q). Table 7.25 is a useful summary
of the various information and entropy concepts.
x1
z1
+ p
q1
x2
z2 q2 J
+ σ I
x3 q3
z3 z = W tx
+ q = σ(z)
J = I(p, q)
x4
= trace V t (x ⊗ (q − p)) .
d d
!2 d
X X X
vj2 qj − vj qj = (vj − v̄)2 qj .
j=1 j=1 j=1
(W + sV )t x = z + sv.
vanishes, then, since the summands are nonnegative, (8.5.12) vanishes, for
every sample x = xk , p = pk , hence
V t xk = 0, k = 1, 2, . . . , N.
The convex hull is discussed in §7.5, see Figures 7.20 and 7.21. If Ki were
just the samples x whose corresponding targets p satisfy pi > 0 (with no
convex hull), then the intersection Ki ∩ Kj may be empty.
For example, if p were one-hot encoded, then x belongs to at most one
Ki . Thus taking the convex hull in the definition of Ki is crucial. This is
clearly seen in Figure 8.26: The samples never intersect, but the convex hulls
may do so.
To establish properness of J(W ), by definition (7.5.9), we have to show
The exact formula for the bound C, which is not important for our purposes,
depends on the level c and the dataset.
Suppose J(W ) ≤ c, with W 1 = 0. Then I(p, σ(W t x)) = J(x, p, W ) ≤ c
for every sample x and corresponding target p.
Let x be a sample, let z = W t x, and suppose the corresponding target p
satisfies pi ≥ ϵ, for some class i, and some ϵ > 0. If j ̸= i, then
d
X
ϵ(zj − zi ) ≤ pi (Z(z) − zi ) ≤ pk (Z(z) − zk ) = Z(z) − p · z.
k=1
8.5. SHALLOW LEARNING 437
Here we used zj < Z(z), and Z(z) − zk > 0 for all k. By (7.6.14),
ϵ(zj − zi ) ≤ c + log d.
Let x be any vector in feature space, and let z = W t x. Since span(Ki ∩Kj )
is full-rank for every i and j, x is a linear combination of vectors in Ki ∩ Kj .
This implies, by (8.5.14), there is a bound C2 , depending on x but not on
W , such that
|zi − zj | ≤ C2 , for every i and j, (8.5.15)
P
Since z · 1 = 0, zi = − j̸=i zj . Summing (8.5.15) over j ̸= i,
X
d|zi | = |(d − 1)zi + zi | = (zi − zj ) ≤ (d − 1)C2 .
j̸=i
∇z J(z, p) = 0
can always be solved, so there is at least one minimum for each J(z, p). Here
properness ultimately depends on properness of the partition function Z(z).
In one-hot encoded regression, J(z, p) = I(p, σ(z)) and ∇z J(z, p) = 0
can never be solved, because q = σ(z) is always strict and p is one-hot
encoded, see (8.5.10). Nevertheless, trainability of J(W ) is achievable if
there is sufficient overlap between the sample categories.
In linear regression, the minimizer is expressible in terms of the regression
equation, and thus can be solved in principle using the pseudo-inverse. In
practice, when the dimensions are high, gradient descent may be the only
option for linear regression.
In logistic regression, the minimizer cannot be found in closed form, so
we have no choice but to apply gradient descent, even for low dimensions.
440 CHAPTER 8. MACHINE LEARNING
y = w · x = w 1 x1 + w 2 x2 + · · · + w d xd .
Linear Regression
We work out the regression equation in the plane, when both features x
and y are scalar. In this case, w = (m, b) and
x1 1 y1
x2 1 y2
X= . . . . . . ,
Y =
. . . .
xN 1 yN
Then (x̄, ȳ) is the mean of the dataset. Also, let x and y denote the vectors
(x1 , x2 , . . . , xN ) and (y1 , y1 , . . . , yN ), and let, as in §1.6,
N
1 X 1
cov(x, y) = (xk − x̄)(yk − ȳ) = x · y − x̄ȳ.
N k=1 N
(x · x)m + x̄b = x · y,
mx̄ + b = ȳ.
The second equation says the regression line passes through the mean (x̄, ȳ).
Multiplying the second equation by x̄ and subtracting the result from the
first equation cancels the b and leads to
This derives
8.6. REGRESSION EXAMPLES 443
The regression line in two dimensions passes through the mean (x̄, ȳ)
and has slope
cov(x, y)
m= .
cov(x, x)
df - read_csv("longley.csv")
X = df["Population"].to_numpy()
Y = df["Employed"].to_numpy()
X = X - mean(X)
Y = Y - mean(Y)
varx = sum(X**2)/len(X)
vary = sum(Y**2)/len(Y)
X = X/sqrt(varx)
Y = Y/sqrt(vary)
After this, we compute the optimal weight w∗ and construct the polyno-
mial. The regression equation is solved using the pseudo-inverse (§2.3).
444 CHAPTER 8. MACHINE LEARNING
wstar = dot(Aplus,b)
return sum([ x**i*wstar[i] for i in range(d) ],axis=0)
figure(figsize=(12,12))
# six subplots
rows, cols = 3,2
# x interval
x = arange(xmin,xmax,.01)
for i in range(6):
d = 3 + 2*i # degree = d-1
subplot(rows, cols,i+1)
plot(X,Y,"o",markersize=2)
plot([0],[0],marker="o",color="red",markersize=4)
plot(x,poly(x,d),color="blue",linewidth=.5)
xlabel("degree = %s" % str(d-1))
grid()
show()
Running this code with degree 1 returns Figure 8.15. Taking too high a
power can lead to overfitting, for example for degree 12.
x p x p x p x p x p
0.5 0 .75 0 1.0 0 1.25 0 1.5 0
1.75 0 1.75 1 2.0 0 2.25 1 2.5 0
2.75 1 3.0 0 3.25 1 3.5 0 4.0 1
4.25 1 4.5 1 4.75 1 5.0 1 5.5 1
More generally, we may only know the amount of study time x, and the
probability p that the student passed, where now 0 ≤ p ≤ 1.
For example, the data may be as in Figure 8.18, where pk equals 1 or 0
according to whether they passed or not.
As stated, the samples of this dataset are scalars, and the dataset is
one-dimensional (Figure 8.19).
Plotting the dataset on the (x, p) plane, the goal is to fit a curve as in
Figure 8.20.
(0, 1)
x
(0, 0)
To apply the results from the previous section, we incorporate the bias
8.6. REGRESSION EXAMPLES 447
(0, 1)
x
(0, 0)
Let σ(z) be the sigmoid function (5.1.12). Then, as in the previous sec-
tion, the goal is to minimize the loss function
N
X
J(m, b) = I(pk , qk ), qk = σ(mxk + b), (8.6.4)
k=1
x p x p x p x p x p
0.5 (1,0) .75 (1,0) 1.0 (1,0) 1.25 (1,0) 1.5 (1,0)
1.75 (1,0) 1.75 (0,1) 2.0 (1,0) 2.25 (0,1) 2.5 (1,0)
2.75 (0,1) 3.0 (1,0) 3.25 (0,1) 3.5 (1,0) 4.0 (0,1)
4.25 (0,1) 4.5 (0,1) 4.75 (0,1) 5.0 (0,1) 5.5 (0,1)
p
b z
1 + q
−b J
σ I
m
−z
x + 1−q
−m
Since here d = 2, the networks in Figures 8.23 and 8.24 are equivalent.
In Figure 8.23, σ is the softmax function, I is given by (7.6.5), and p, q are
probability vectors. In Figure 8.24, σ is the sigmoid function, I is given by
(7.2.3), and p, q are probability scalars.
1 b
z q J
+ σ I
m
x
Figure 8.20 is a plot of x against p. However, the dataset, with the bias
input included, has two inputs x, 1 and one output p, and should be plotted
in three dimensions (x, 1, p). Then (Figure 8.25) samples lie on the line (x, 1)
in the horizontal plane, and p is on the vertical axis.
8.6. REGRESSION EXAMPLES 449
0.5
0
0
2
1
4
0.5
0
The horizontal plane in Figure 8.25, which is the plane in Figure 8.21, is
feature space. The convex hulls K0 and K1 are in feature space, so the convex
hull K0 of the samples corresponding to p = 0 is the line segment joining
(.5, 1, 0) and (3.5, 1, 0), and the convex hull K1 of the samples corresponding
to p = 1 is the line segment joining (1.75, 1, 0) and (5.5, 1, 0). In Figure 8.25,
K0 is the line segment joining the blue points, and K1 is the projection onto
feature space of the line segment joining the red points. Since K0 ∩ K1 is the
line segment joining (1.75, 1, 0) and (3.5, 1, 0), the span of K0 ∩ K1 is all of
feature space. By the results of the previous section, J(w) is proper.
Here is the descent code.
X = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5,
,→ 2.75, 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
P = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
def gradient(m,b):
return sum([ (expit(m*x+b) - p) * array([x,1]) for x,p in
,→ zip(X,P) ],axis=0)
# gradient descent
w = array([0,0]) # starting m,b
450 CHAPTER 8. MACHINE LEARNING
g = gradient(*w)
t = .01 # learning rate
m∗ = 1.49991537, b∗ = −4.06373862.
The Iris dataset consists of 150 samples divided into three groups. leading
to three convex hulls K0 , K1 , K2 in R4 . If the dataset is projected onto the
top two principal components, then the projections of these three hulls do
not pair-intersect (Figure 8.26). It follows we have no guarantee the logistic
loss is proper.
On the other hand, the MNIST dataset consists of 60,000 samples divided
into ten groups. If the MNIST dataset is projected onto the top two principal
components, the projections of the ten convex hulls K0 , K1 , . . . , K9 onto R2 ,
do intersect (Figure 8.27).
This does not guarantee that the ten convex hulls K0 , K1 , . . . , K9 in R784
intersect, but at least this is so for the 2d projection of the MNIST dataset.
Therefore the logistic loss of the 2d projection of the MNIST dataset is
proper.
z = w · x + w0
452 CHAPTER 8. MACHINE LEARNING
with
z ≤ 0, for x in A,
z ≥ 0, for x in B.
In the case of two classes, the results in §7.5 and §8.5 lead to the following
result [12].
• The two classes are not linearly separable: When the means of
the classes are distinct, the log loss J(w, w0 ) is trainable (strictly
convex and proper), and there is a unique minimizer (w∗ , w0∗ )
with w∗ ̸= 0, hence an optimal single-layer perceptron
q = σ(w∗ · x + w0∗ ).
1
In the literature, the condition number is often defined as L/m.
8.7. STRICT CONVEXITY 453
Gradient Descent I
Let r = m/L and set E(w) = f (w)−f (w∗ ). Then the descent sequence
w0 , w1 , w2 , . . . given by (8.3.1) with learning rate
1
t=
L
converges to w∗ at the rate
mL 1
g · (w − w∗ ) ≥ |w − w∗ |2 + |g|2 .
m+L m+L
Using this and (8.3.1) and t = 2/(m + L),
This implies
8.7. STRICT CONVEXITY 455
Gradient Descent II
Let r = m/L and set E(w) = |w − w∗ |2 . Then the descent sequence
w0 , w1 , w2 , . . . given by (8.3.1) with learning rate
2
t=
m+L
converges to w∗ at the rate
2n
1−r
E(wn ) ≤ E(w0 ), n = 1, 2, . . . . (8.7.6)
1+r
Let g be the gradient of the loss function at a point w. Then the line
passing through w in the direction of g is w − tg. When the loss function is
quadratic (7.3.7), f (w − tg) is a quadratic function of the scalar variable t.
In this case, the minimizer t along the line w − tg is explicitly computable as
g·g
t= .
g · Qg
This leads to gradient descent with varying time steps t0 , t1 , t2 , . . . . As a
consequence, one can show the error is lowered as follows,
+ 1 g
E(w ) = 1 − −1
E(w), u= .
(u · Qu)(u · Q u) |g|
456 CHAPTER 8. MACHINE LEARNING
w◦ = w + s(w − w− ). (8.8.1)
Here s is the decay rate. The momentum term reflects the direction induced
by the previous step. Because this mimics the behavior of a ball rolling
downhill, gradient descent with momentum is also called heavy ball descent.
Then the descent sequence w0 , w1 , w2 , . . . is generated by
Here we have two hyperparameters, the learning rate and the decay rate.
wn = w∗ + ρn v, Qv = λv. (8.8.5)
Inserting this into (8.8.3) and using Qw∗ = b leads to the quadratic equation
ρ2 = (1 − tλ + s)ρ − s.
Ak + Bk = (w0 − w∗ ) · vk ,
Ak ρk + Bk ρ̄k = (w1 − w∗ ) · vk = (1 − tλk )(w0 − w∗ ) · vk ,
Let
(L − m)(L − m)
C = max . (8.8.11)
λ (L − λ)(λ − m)
8.8. ACCELERATED GRADIENT DESCENT 459
d
X
∗ 2
|wn − w | = |(wn − w∗ ) · vk |2
k=1
d
X
= |Ak ρnk + Bk ρ̄nk |2
k=1
Xd
≤ (|Ak | + |Bk |)2 |ρk |2n
k=1
d
X
≤ 4Cs n
|(w0 − w∗ ) · vk |2
k=1
= 4Cs |w0 − w∗ |2 .
n
Suppose the loss function f (w) is quadratic (8.7.2), let r = m/L, and
set E(w) = |w − w∗ |2 . Let C be given by (8.8.11). Then the descent
sequence w0 , w1 , w2 , . . . given by (8.8.2) with learning rate and decay
rate √ 2
1 4 1− r
t= · √ , s= √ ,
L (1 + r)2 1+ r
converges to w∗ at the rate
√ 2n
1− r
E(wn ) ≤ 4C √ E(w0 ), n = 1, 2, . . . (8.8.12)
1+ r
guaranteed for f (w) quadratic (8.7.2). In fact, there are examples of non-
quadratic f (w) where heavy ball descent does not converge to w∗ . Neverthe-
less, this method is widely used.
w◦ = w + s(w − w− ),
(8.8.13)
w+ = w◦ − t∇f (w◦ ).
L
V (w) = f (w) + |w − ρw− |2 , (8.8.14)
2
with a suitable choice of ρ, does the job. With the choices
√
1 1− r √
t= , s= √ , ρ = 1 − r,
L 1+ r
we will show
V (w+ ) ≤ ρV (w). (8.8.15)
8.8. ACCELERATED GRADIENT DESCENT 461
L m
V (w0 ) = f (w0 ) + |w0 − ρw0 |2 = f (w0 ) + |w0 |2 ≤ 2f (w0 ).
2 2
Moreover f (w) ≤ V (w). Iterating (8.8.15), we obtain
This derives
While the convergence rate for accelerated descent is slightly worse than
heavy ball descent, the value of accelerated descent is its validity for all
convex functions satisfying (8.7.1), and the fact, also due to Nesterov [19],
that this convergence rate is best-possible among all such functions.
Now we derive (8.8.15). Assume (w+ )− = w and w∗ = 0, f (w∗ ) = 0. We
know w◦ = (1 + s)w − sw− and w+ = w◦ − tg ◦ , where g ◦ = ∇f (w◦ ).
462 CHAPTER 8. MACHINE LEARNING
R(a, b) = r ρs2 |b|2 + (1 − ρ)|a + sb|2 − |(1 − ρ)a + sb|2 + ρ|(1 − ρ)a + ρb|2 .
which is positive.
Appendices
A.1 SQL
Recall matrices (§2.1), datasets, CSV files, spreadsheets, arrays, dataframes
are basically the same objects.
Databases are collections of tables, where a table is another object similar
to the above. Hence
465
466 CHAPTER A. APPENDICES
select from
limit
select distinct
where/where not <column>
where <column> = <data> and/or <column> = <data>
order by <column1>,<column2>
insert into table (<column1>,<column2>,...) \
values (<data1>, <data2>, ...)
is null
update <table> set <column> = <data> where ...
like <regex> (%, _, [abc], [a-f], [!abc])
delete from <table> where ...
select min(<column>) from <table> (also max, count, avg)
where <column> in/not in (<data array>)
between/not between <data1> and <data2>
as
join (left, right, inner, full)
create database <database>
drop database <database>
create table <table>
truncate <table>
alter table <table> add <column> <datatype>
alter table <table> drop column <column>
insert into <table> select
A.1. SQL 467
This is an unordered listing of key-value pairs. Here the keys are the strings
dish, price, and quantity. Keys need not be strings; they may be integers
or any unmutable Python objects. Since a Python list is mutable, a key
cannot be a list. Values may be any Python objects, so a value may be a
list. In a dict, values are accessed through their keys. For example, item1[
,→ "dish"] returns 'Hummus'.
A list-of-dicts is simply a Python list whose elements are Python dicts,
for example,
len(L), L[0]["dish"]
returns
(2,'Hummus')
returns True.
468 CHAPTER A. APPENDICES
s = dumps(L)
Now print L and print s. Even though L and s “look” the same, L is a
list, and s is a string. To emphasize this point, note
L1 = loads(s)
Then L == L1 returns True. Strings having this form are called JSON
strings, and are easy to store in a database as VARCHARs (see Figure A.4).
The basic object in the Python module pandas is the dataframe (Figures
A.1, A.2, A.4, A.5). The pandas module can convert a dataframe df to
many, many other formats
df = DataFrame(L)
df
L1 = df.to_dict('records')
L == L1
returns True. Here the option 'records' returns a list-of-dicts; other options
returns a dict-of-dicts or other combinations.
To convert a CSV file into a dataframe, use the code
menu_df = read_csv("menu.csv")
menu_df
To go the other way, to convert the dataframe df to the CSV file menu1
,→ .csv, use the code
df.to_csv("menu1.csv")
df.to_csv("menu2.csv",index=False)
protocol = "mysql+pymysql://"
credentials = "username:password"
server = "@servername"
port = ":3306"
uri = protocol + credentials + server + port
This string contains your database username, your database password, the
database server name, the server port, and the protocol. If the database is
”\rawa”, the URI is
A.1. SQL 471
database = "/rawa"
uri = protocol + credentials + server + port + database
engine = sqlalchemy.create_engine(uri)
df.to_sql('Menu',engine,if_exists='replace')
df.to_sql('Menu',engine)
One benefit of this syntax is the automatic closure of the connection upon
completion. This completes the discussion of how to convert between dataframes
and SQL tables, and completes the discussion of conversions between any of
the objects in (A.1.2).
/* Menu */
dish varchar
price integer
/* ordersin */
orderid integer
created datetime
customerid integer
items json
/* ordersout */
orderid integer
subtotal integer
tip integer
tax integer
total integer
To achieve this task, we download the CSV files menu.csv and orders
,→ .csv, then we carry out these steps. (price and tip in menu.csv and
474 CHAPTER A. APPENDICES
5. Add a key items to OrdersIn whose values are JSON strings specifying
the items ordered in orders, using the prices in menu (these are in cents
so they are INTs). The JSON string is of a list-of-dicts in the form
discussed above L = [item1, item2] (see row 0 in Figure A.4).
Do this by looping over each order in the list-of-dicts orders, then
looping over each item in the list-of-dicts menu, and extracting the
quantity ordered of the item item in the order order.
6. Add a key subtotal to OrdersOut whose values (in cents) are com-
puted from the above values.
Add a key tax to OrdersOut whose values (in cents) are computed
using the Connecticut tax rate 7.35%. Tax is applied to the sum of
subtotal and tip.
Add a key total to OrdersOut whose values (in cents) are computed
from the above values (subtotal, tax, tip).
# step 1
from pandas import *
protocol = "https://"
server = "math.temple.edu"
path = "/~hijab/teaching/csv_files/restaurant/"
url = protocol + server + path
# step 2
menu = menu_df.to_dict('records')
orders = orders_df.to_dict('records')
# step 3
OrdersIn = h
476 CHAPTER A. APPENDICES
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["created"] = r["created"]
d["customerId"] = r["customerId"]
OrdersIn.append(d)
# step 4
OrdersOut = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["tip"] = r["tip"]
OrdersOut.append(d)
# step 5
from json import *
# steps 6
for i,r in enumerate(OrdersOut):
items = loads(OrdersIn[i]["items"])
subtotal = sum([ item["price"]*item["quantity"] for item
,→ in items ])
r["subtotal"] = subtotal
tip = OrdersOut[i]["tip"]
tax = int(.0735*(tip + subtotal))
A.1. SQL 477
# step 7
ordersin_df = DataFrame(OrdersIn)
ordersout_df = DataFrame(OrdersOut)
# step 8
from sqlalchemy import create_engine, text
engine = create_engine(uri)
dtype1 = { "dish":sqlalchemy.String(60),
,→ "price":sqlalchemy.Integer }
dtype2 = {
"orderId":sqlalchemy.Integer,
"created":sqlalchemy.String(30),
"customerId":sqlalchemy.Integer,
"items":sqlalchemy.String(1000)
}
dtype3 = {
"orderId":sqlalchemy.Integer,
"tip":sqlalchemy.Integer,
"subtotal":sqlalchemy.Integer,
"tax":sqlalchemy.Integer,
"total":sqlalchemy.Integer
}
478 CHAPTER A. APPENDICES
protocol = "mysql+pymysql://"
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
uri = protocol + credentials + server + port + database
engine = sqlalchemy.create_engine(uri)
connection = engine.connect()
A.2. MINIMIZING SEQUENCES 479
num = 0
print(num)
def num_plates(item):
dishes = loads(item)
return sum( [ dish["quantity"] for dish in dishes ])
Then we use map to apply to this function to every element in the series
df["items"], resulting in another series. Then we sum the resulting series.
num = df["items"].map(num_plates).sum()
print(num)
Since the total number of plates is 14,949, and the total number of orders
is 4970, the average number of plates per order is 3.76.
mizers were ignored. In this section, which may safely be skipped, we review
the foundational material supporting the existence of minimizers.
The first issue that must be clarified is the difference between the min-
imum and the infimum. In a given situation, it is possible that there is
no minimum. By contrast, in any reasonable situation, there is always an
infimum.
For example, since y = ex is an increasing function, the minimum
min ex = min{ex | 0 ≤ x ≤ 1}
0≤x≤1
In this situation, the minimizer does not exist, but, since the values of 1/x are
arbitrarily close to 0, we say the infimum is 0. Since there is no minimizer,
there is no minimum value. Also, even though 0 is the infimum, we do not
say ∞ is the “infimizer”, since ∞ is not an actual number.
S is infinite, a minimum need not exist. When the minimum exists, we write
m = min S.
If S is bounded below, then S has many lower bounds. The greatest
among these lower bounds is the infimum of S. A foundational axiom for
real numbers is that the infimum always exists. When m is the infimum of
S, we write m = inf S.
Existence of Infima
Any collection S of real numbers that is bounded below has an infimum:
There is a lower bound m for S that is greater than any other lower
bound b for S.
For example, for S = [0, 1], inf S = 0 and min S = 0, and, for S = (0, 1),
inf S = 0, but min S does not exist. For both these sets S, it is clear that 0
is the infimum. The power of the axiom comes from its validity for any set
S of scalars that is bounded below, no matter how complicated.
By definition, the infimum of S is the lower bound for S that is greater
than any other lower bound for S. From this, if min S exists, then inf S =
min S.
e1 ≥ e2 ≥ · · · ≥ 0.
inf en = 0.
n≥1
Error Sequence
An error sequence e1 ≥ e2 ≥ · · · ≥ 0 converges to zero iff for any ϵ > 0,
there is an N > 0 with
0 ≤ en < ϵ, n ≥ N.
or we write xn → x∗ .
Note this definition of convergence is consistent with the previous defini-
tion, since an error sequence e1 , e2 , . . . converges to zero (in the first sense)
iff
lim en = 0
n→∞
Here it is important that the indices n1 < n2 < n3 < . . . be strictly increas-
ing.
If a sequence x1 , x2 , . . . has a subsequence x′1 , x′2 , . . . converging to x∗ ,
then we say the sequence x1 , x2 , . . . subconverges to x∗ . For example, the
sequence 1, −1, 1, −1, 1, −1, . . . subconverges to 1 and also subconverges
to −1, as can be seen by considering the odd-indexed terms and the even-
indexed terms separately.
Note a subsequence of an error sequence converging to zero is also an
error sequence converging to zero. As a consequence, if a sequence converges
to x∗ , then every subsequence of the sequence converges to x∗ . From this
A.2. MINIMIZING SEQUENCES 483
it follows that the sequence 1, −1, 1, −1, 1, −1, . . . does not converge to
anything: it bounces back and forth between ±1.
I0 ⊃ I! ⊃ I2 ⊃ . . . ,
x∗ = inf x∗n
n≥1
must exist.
A minimizer is a vector x∗ satisfying f (x∗ ) = m. As we saw above, a
minimizer may or may not exist, and, when the minimizer does exist, there
may be several minimizers.
A minimizing sequence for f (x) over S is a sequence x1 , x2 , . . . of vectors
in S such that the corresponding values f (x1 ), f (x2 ), . . . are decreasing and
converge to m = inf S f (x) as n → ∞. In other words, x1 , x2 , . . . is a
minimizing sequence for f (x) over S if
and
inf f (x) = inf f (xn ).
S n≥1
or
0 < f (x1 ) − m < (f (x0 ) − m)/2.
Existence of Minimizers
If f (x) is continuous on Rd and S is a bounded set in Rd , then there
is a minimizer x∗ ,
f (x∗ ) = inf f (x). (A.2.2)
x in S
[1] Joshua Akey, Genome 560: Introduction to Statistical Genomics, 2008. https://ww
w.gs.washington.edu/academics/courses/akey/56008/lecture/lecture1.pdf.
[2] Christopher M. Bishop, Pattern Recognition and Machine Learning, Information Sci-
ence and Statistics, Springer, 2006.
[3] Sébastien Bubeck, Convex Optimization: Algorithms and Complexity, Foundations
and Trends in Machine Learning, vol. 8, Now Publishers, 2015.
[4] Harald Cramér, Mathematical Methods of Statistics, Princeton University Press, 1946.
[5] A. Aldo Faisal Marc Peter Deisenroth and Cheng Soon Ong, Mathematics for Machine
Learning, Cambridge University Press, 2020.
[6] J. L. Doob, Probability and Statistics, Transactions of the American Mathematical
Society 36 (1934), 759-775.
[7] R. A. Fisher, The conditions under which χ2 measures the discrepancy between ob-
servation and hypothesis, Journal of the Royal Statistical Society 87 (1924), 442-450.
[8] Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press,
2016.
[9] Google, Machine Learning. https://developers.google.com/machine-learning.
[10] Robert M. Gray, Toeplitz and Circulant Matrices: A Review, Foundations and Trends
in Communications and Information Theory 2 (2006), no. 3, 155-239.
[11] T. L. Heath, The Works of Archimedes, Cambridge University, 1897.
[12] Omar Hijab, A Note on Binary Classifiers, Preprint (2024).
[13] Nikolai Janakiev, Classifying the Iris Data Set with Keras, 2018. https://janakiev
.com/blog/keras-iris.
[14] Lily Jiang, A Visual Explanation of Gradient Descent Methods, 2020. https://towa
rdsdatascience.com/a-visual-explanation-of-gradient-descent-methods-m
omentum-adagrad-rmsprop-adam-f898b102325c.
[15] J. W. Longley, An Appraisal of Least Squares Programs for the Electronic Computer
from the Point of View of the User, Journal of the American Statistical Association
62.319 (1967), 819-841.
487
488 BIBLIOGRAPHY
[16] David G. Luenberger and Yinyu Ye, Linear and Nonlinear Programming, Springer,
2008.
[17] Ioannis Mitliagkas, Theoretical principles for deep learning, lecture notes, 2019. http
s://mitliagkas.github.io/ift6085-dl-theory-class-2019/.
[18] Marvin Minsky and Seymour Papert, Perceptrons, An Introduction to Computational
Geometry, MIT Press, 1988.
[19] Yurii Nesterov, Lectures on Convex Optimization, Springer, 2018.
[20] Roger Penrose, A generalized inverse for matrices, Proceedings of the Cambridge
Philosophical Society 51 (1955), 406-413.
[21] Boris Teodorovich Polyak, Some methods of speeding up the convergence of iteration
methods, USSR Computational Mathematics and Mathematical Physics 4(5) (1964),
1-17.
[22] Karl Pearson, On the criterion that a given system of deviations from the probable in
the case of a correlated system of variables is such that it can be reasonably supposed
to have arisen from random sampling, Philosophical Magazine Series 5 50:302 (1900),
157-175.
[23] Sebastian Raschka, PCA in three simple steps, 2015. https://sebastianraschka.c
om/Articles/2015_pca_in_3_steps.html.
[24] Herbert Robbins and Sutton Monro, A Stochastic Approximation Method, The Annals
of Mathematical Statistics 22 (1951), no. 3, 400 – 407.
[25] Sheldon M. Ross, Probability and Statistics for Engineers and Scientists, Sixth Edi-
tion, Academic Press, 2021.
[26] Mark J. Schervish, Theory of Statistics, Springer, 1995.
[27] Stanford University, CS224N: Natural Language Processing with Deep Learning. ht
tps://web.stanford.edu/class/cs224n.
[28] Irène Waldspurger, Gradient Descent With Momentum, 2022. https://www.cerema
de.dauphine.fr/~waldspurger/tds/22_23_s1/advanced_gradient_descent.pd
f.
[29] Wikipedia, Logistic Regression, 2015. https://en.wikipedia.org/wiki/Logistic
_regression.
[30] Stephen J. Wright and Benjamin Recht, Optimization for Data Analysis, Cambridge
University, 2022.
Python
*, 9, 16 def.newton, 417
def.num_plates, 479
all, 189 def.outgoing, 366, 407
append, 189 def.pca, 182
def.pca_with_svd, 182
def.angle, 24, 75 def.plot_cluster, 190
def.assign_clusters, 189 def.plot_descent, 418
def.backward_prop, 361, 369, def.poly, 444
413 def.project, 121
def.ball, 62 def.project_to_ortho, 122
def.chi2_independence, 323 def.pvalue, 272
def.confidence_interval, 295, def.random_batch_mean, 249
307, 312, 315 def.random_vector, 189
def.delta_out, 412 def.tensor, 32
def.derivative, 368 def.train_nn, 426
def.dimension_staircase, 130 def.ttest, 308
def.downstream, 413 def.type2_error, 301, 309
def.ellipse, 51, 58 def.update_means, 189
def.find_first_defect, 128 def.update_weights, 426
def.forward_prop, 360, 367, 408 def.zero_variance, 108
def.gd, 424 def.ztest, 299
def.goodness_of_fit, 320 diag, 175
def.H, 347 dict, 467
def.hexcolor, 10 display, 150
def.incoming, 366, 407
def.J, 409 enumerate, 183
def.local, 410
def.nearest_index, 189 floor, 167
489
490 PYTHON
import, 8 numpy.array, 8, 65
itertools.product, 62 numpy.column_stack, 86, 100
numpy.corrcoef, 54
join, 10 numpy.cov, 47
json.dumps, 468 numpy.cumsum, 180
json.loads, 468 numpy.degrees, 24
numpy.dot, 73
keras
numpy.exp, 347
datasets
numpy.isclose, 158
mnist.load_data, 4
numpy.linalg.eig, 144
lambda, 365 numpy.linalg.eigh, 144, 180
list, 7 numpy.linalg.inv, 84
numpy.linalg.matrix_rank, 128
matplotlib.pyplot.axes, 187 numpy.linalg.norm, 21, 189
matplotlib.pyplot.contour, 58 numpy.linalg.pinv, 121
matplotlib.pyplot.figure, 183 numpy.linalg.svd, 175
matplotlib.pyplot.grid, 7 numpy.linspace, 62
matplotlib.pyplot.hist, 243 numpy.log, 347
matplotlib.pyplot.imshow, 8, 9 numpy.mean, 14, 15
matplotlib.pyplot.legend, 167 numpy.meshgrid, 62
matplotlib.pyplot.meshgrid, numpy.outer, 323
58 numpy.pi, 347
matplotlib.pyplot.plot, 45 numpy.random.binomial, 242
matplotlib.pyplot.scatter, 7 numpy.random.default_rng, 249
matplotlib.pyplot.show, 7 numpy.random.default_rng.
matplotlib.pyplot.stairs, 130 ,→ shuffle, 249
matplotlib.pyplot.subplot, numpy.random.normal, 271
183 numpy.random.randn, 286
matplotlib.pyplot.text, 168 numpy.random.random, 45
matplotlib.pyplot.title, 347 numpy.reshape, 179
matplotlib.pyplot.xlabel, 445 numpy.roots, 43
numpy.row_stack, 69
numpy.allclose, 83 numpy.shape, 65
numpy.amax, 445 numpy.sqrt, 24
numpy.amin, 445
numpy.arange, 58, 167 pandas.DataFrame, 468
numpy.arccos, 24, 75 pandas.DataFrame.drop, 72
numpy.argsort, 182 pandas.DataFrame.to_csv, 470
PYTHON 491
493
494 INDEX