100% found this document useful (1 vote)

71 views

Math For Data Science

Uploaded by

campusanofelipe137

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

71 views

Math For Data Science

Uploaded by

campusanofelipe137

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 554

Omar Hijab*

Math for Data Science

Copyright ©2022 — 2024 Omar Hijab. All Rights Reserved.
Compiled 2024-11-13 14:28:19-05:00
Code boxes: 292
Exercises: 314
Figures: 176
Math boxes: 275
Pages: 554
Tables: 17
Preface

This text is a presentation of the mathematics underlying Data Science. The

text assumes the math background typical of an Electrical Engineering un-
dergraduate. In particular, Chapter 4, Calculus, assumes the reader has prior
calculus exposure.
By contrast, because we outsource computations to Python, and focus
on conceptual understanding, Chapter 2, Linear Geometry, is developed in
depth.
Depending on the emphasis and supplementary material, the text is ap-
propriate for a course in the following programs
• Applied Mathematics,
• Business Analytics,
• Computer Science,
• Data Science,
• Engineering.
The level and pace of the text varies from gentle, at the start, to advanced,
at the end. Depending on the depth of coverage, the text is appropriate for
a one-semester course, or a two-semester course.
Chapters 1-3, together with some of the appendices, form the basis for
a leisurely one-semester course, and Chapters 4-7 form the basis for an ad-
vanced one-semester course.
The chapter ordering is chosen to allow for the two semesters being taught
simultaneously. The text was written while being repeatedly taught as a two-
semester course with the two semesters taught simultaneously.
The culmination of the text is Chapter 7, Machine Learning. Much of
the mathematics developed in prior chapters is used here. While only an
introduction to the subject, the material in Chapter 7 is carefully and logically
built up from first principles.
As a consequence, the presentation and some results are new: The proofs
of heavy ball convergence and Nesterov convergence for accelerated gradient
descent are simplifications of the proofs in [37], and the connection between

vii
viii

separability and trainability in a multi-class dataset (§7.6), while fundamen-

tal, is apparently not present in the literature.

Important principles or results are displayed in these boxes.

The ideas presented in the text are made concrete by interpreting them
in Python code. The standard Python data science packages are used, and
a Python index lists the functions used in the text. Because Python is used
to highlight concepts, the supporting code snippets are often written from
scratch, even when they don’t need to be.

Python code is displayed in these boxes.

Because SQL is usually part of a data scientist’s toolkit, an introduction

to using SQL, from within Python, is included in an appendix. Also, in case
the instructor wishes to de-emphasize it, integration is presented separately
in an appendix. Other appendices cover combinations and permutations, the
binomial theorem, the exponential function, complex numbers, asymptotics,
and minimizers, to be used according to the instructor’s emphasis and pref-
erences.
The bibiliography at the end is a listing of the references accessed while
writing the text. Throughout, we use iff to mean if and only if, and we use ≈
for asymptotic equality (§A.6). Apart from a few exceptions, all figures in the
text were created by the author using Python or tikZ. The text is typeset
using TEX. The boxes above are created using tcolorbox.
To help navigate the text, in each section, to indicate a break, a new idea,
or a change in direction, we use a ship’s wheel .
Sections and figures are numbered sequentially within each chapter, and
equations and exercises are numbered sequentially within each section, so
§3.4 is the fourth section in the third chapter, Figure 4.14 is the fourteenth
figure in the fourth chapter, (3.2.1) is the first equation in the second section
of the third chapter, and Exercise 1.2.3 is the third exercise in the second
section of the first chapter. Also, [1] cites the first entry in the references.
I would like to express my sincere gratitude for the support of E. L. Grin-
berg, my friend and colleague, throughout this project. This endeavor began
at Rawa, and we are truly thankful for the generous hospitality they provided
during that time.

Omar Hijab
Spring 2025
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Averages and Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2 Linear Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.2 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.3 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4 Span and Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.5 Zero Variance Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.6 Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.7 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.8 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.9 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

3 Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

3.1 Geometry of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.2 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.4 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
3.5 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
3.6 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

4 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.1 Single-Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.3 Multi-Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
4.4 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

ix
x Contents

4.5 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

5 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.2 Binomial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
5.5 Chi-squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
5.6 Multinomial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

6 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
6.2 Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
6.3 T -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
6.4 Chi-Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390

7 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
7.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
7.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
7.4 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
7.5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
7.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
7.7 Regression Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
7.8 Strict Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
7.9 Accelerated Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
A.1 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . 467
A.2 The Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
A.3 The Exponential Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
A.4 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
A.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
A.6 Asymptotics and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
A.7 Existence of Minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
A.8 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525

Python Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
List of Figures

1.1 Iris dataset [28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Images in the MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 A portion of the MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . . . 6
1.5 The MNIST dataset (3d projection). . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 A crude copy of the image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 HTML colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 The vector v joining the points m and x. . . . . . . . . . . . . . . . . . . . 12
1.9 Datasets of points versus datasets of vectors. . . . . . . . . . . . . . . . . 13
1.10 A dataset with its mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.11 Vectorization of samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.12 A vector v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.13 Vectors v1 and v2 and their shadows in the plane. . . . . . . . . . . . 19
1.14 Adding v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.15 Scaling with t = 2 and t = −2/3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.16 The polar representation of v = (x, y). . . . . . . . . . . . . . . . . . . . . . 22
1.17 v and its antipode −v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.18 Two vectors v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.19 Pythagoras for general triangles. . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.20 Proof of Pythagoras for general triangles. . . . . . . . . . . . . . . . . . . . 26
1.21 P and P ⊥ and v and v ⊥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.22 MSD for the mean (green) versus MSD for a random point
(red). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.23 Projecting a vector b onto the line through u. . . . . . . . . . . . . . . . 42
1.24 Unit variance ellipses (blue) and unit inverse variance ellipses
(red) with µ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.25 Variance ellipses (blue) and inverse variance ellipses (red) for
a dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.26 Unit variance ellipse and unit inverse variance ellipse with
standard Q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xi
xii List of Figures

1.27 Positively and negatively correlated datasets (unit inverse

ellipses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.28 Ellipsoid and axes in 3d. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.29 Disks inside the square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.30 Balls inside the cube. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.31 Suspensions of interval [a, b] and disk D. . . . . . . . . . . . . . . . . . . . 57

2.1 Numpy column space array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

2.2 The points 0, x, Ax, and b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.3 The points x, Ax, the points x∗ , Ax∗ , and the point x+ . . . . . . 106
2.4 Projecting onto a line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.5 Projecting onto a plane, P b = ru + sv. . . . . . . . . . . . . . . . . . . . . . 115
2.6 Dataset, reduced dataset, and projected dataset, n < d. . . . . . . 119
2.7 Relations between vector classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.8 First defect for MNIST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
2.9 The dimension staircase with defects. . . . . . . . . . . . . . . . . . . . . . . 126
2.10 The dimension staircase for the MNIST dataset. . . . . . . . . . . . . . 127
2.11 A 5 × 3 matrix A is a linear transformation from R3 to R5 . . . 130

3.1 Image of unit circle with σ1 = 1.5 and σ2 = .75. . . . . . . . . . . . . . 139

3.2 SVD decomposition A = U SV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.3 Relations between matrix classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.4 Inverse variance ellipse and centered dataset. . . . . . . . . . . . . . . . . 150
3.5 S = span(v1 ) and T = S ⊥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.6 Three springs at rest and perturbed. . . . . . . . . . . . . . . . . . . . . . . . 158
3.7 Six springs at rest and perturbed. . . . . . . . . . . . . . . . . . . . . . . . . . 159
3.8 Two springs along a circle leading to Q(2). . . . . . . . . . . . . . . . . . 160
3.9 Five springs along a circle leading to Q(5). . . . . . . . . . . . . . . . . . 160
3.10 Plot of eigenvalues of Q(50). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.11 Density of eigenvalues of Q(d) for d large. . . . . . . . . . . . . . . . . . . 164
3.12 Trace of pseudo-inverse (§2.3) of Q(d). . . . . . . . . . . . . . . . . . . . . . 166
3.13 Directed and undirected graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.14 A weighed directed graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
3.15 A double edge and a loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
3.16 The complete graph K6 , the cycle graph C6 , and the wheel
graph W6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
3.17 The triangle K3 = C3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
3.18 An eighteenth-century map of Königsberg showing the seven
bridges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
3.19 An Eulerian graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
3.20 Non-isomorphic graphs with degree sequence (3, 2, 2, 1, 1, 1). . . 177
3.21 Complete bipartite graph K5,3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.22 A graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.23 MNIST eigenvalues as a percentage of the total variance. . . . . . 191
3.24 MNIST eigenvalue percentage plot. . . . . . . . . . . . . . . . . . . . . . . . . 192
List of Figures xiii

3.25 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . . . 195
3.26 The full MNIST dataset (2d projection). . . . . . . . . . . . . . . . . . . . 196
3.27 The Iris dataset (2d projection). . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

4.1 f ′ (a) is the slope of the tangent line at a. . . . . . . . . . . . . . . . . . . . 204

4.2 Composition of two functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4.3 The logarithm function log x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.4 Increasing or decreasing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
4.5 Increasing or decreasing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.6 Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0. . . . . 217
4.7 The sine function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
4.8 The sine function with π/2 tick marks. . . . . . . . . . . . . . . . . . . . . . 221
4.9 Angle θ in the plane, P = (x, y). . . . . . . . . . . . . . . . . . . . . . . . . . . 222
4.10 The absolute entropy function H(p). . . . . . . . . . . . . . . . . . . . . . . . 226
4.11 The absolute information I(p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.12 The relative information I(p, q) with q = .7. . . . . . . . . . . . . . . . . 229
4.13 Surface plot of I(p, q) over the square 0 ≤ p ≤ 1, 0 ≤ q ≤ 1. . . . 231
4.14 Composition of multiple functions. . . . . . . . . . . . . . . . . . . . . . . . . . 235
4.15 Composition of three functions in a chain. . . . . . . . . . . . . . . . . . . 242
4.16 A network composition [33]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
4.17 The function g = max(y, z). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
4.18 Forward and backward propagation [33]. . . . . . . . . . . . . . . . . . . . 247
4.19 Graph, directed graph, weighed directed graph, network. . . . . . 251
4.20 A network with outgoing signals. . . . . . . . . . . . . . . . . . . . . . . . . . . 254
4.21 Another network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
4.22 Level sets and sublevel sets in two dimensions. . . . . . . . . . . . . . . 255
4.23 Contour lines in two dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.24 Line segment [x0 , x1 ]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4.25 Convex: The line segment lies above the graph. . . . . . . . . . . . . . 257
4.26 Convex hull of x1 , x2 , x3 , x4 , x5 , x6 , x7 . . . . . . . . . . . . . . . . . . . . 259
4.27 A convex hull with one facet highlighted. . . . . . . . . . . . . . . . . . . . 260
4.28 A convex set has a unique nearest point to any x0 . . . . . . . . . . . 261
4.29 Hyperplanes in two and three dimensions. . . . . . . . . . . . . . . . . . . 263
4.30 Separating hyperplane I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
4.31 Ellipsoids in three dimensions with supporting hyperplanes. . . . 264
4.32 Separating hyperplane II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

5.1 Uniform probability density function. . . . . . . . . . . . . . . . . . . . . . . 277

5.2 Joint distribution of boys and girls [30]. . . . . . . . . . . . . . . . . . . . . 280
5.3 100,000 sessions, with 5, 15, 50, and 500 tosses per session. . . . 281
5.4 The histogram of Iris petal lengths. . . . . . . . . . . . . . . . . . . . . . . . . 283
5.5 Iris petal lengths sampled 100,000 times. . . . . . . . . . . . . . . . . . . . 284
5.6 Iris petal lengths batch means sampled 100,000 times, batch
sizes 3, 5, 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
5.7 Asymptotics of binomial coefficients. . . . . . . . . . . . . . . . . . . . . . . . 293
xiv List of Figures

5.8 The posterior density of p given 7 heads in 10 tosses. . . . . . . . . 297

5.9 The logistic function takes real numbers to probabilities. . . . . . 299
5.10 The logistic function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
5.11 Decision boundary (1d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
5.12 Decision boundary (3d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
5.13 When we sample X, we get x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
5.14 Probability mass function p(x) of a Bernoulli random variable. 309
5.15 Cumulative distribution function F (x) of a Bernoulli random
variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
5.16 Confidence that X lies in interval [a, b]. . . . . . . . . . . . . . . . . . . . . 319
5.17 Continuous cumulative distribution function. . . . . . . . . . . . . . . . . 321
5.18 Densities versus distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
5.19 When we sample X1 , X2 , . . . , Xn , we get x1 , x2 , . . . , xn . . . . . 324
5.20 The pdf of the standard normal distribution. . . . . . . . . . . . . . . . 329
5.21 The binomial cdf and its CLT normal approximation. . . . . . . . 333
5.22 z = Z.ppf(p) and p = Z.cdf(z). . . . . . . . . . . . . . . . . . . . . . . . . . 335
5.23 Confidence (green) or significance (red) (lower-tail, two-tail,
upper-tail). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
5.24 Cutoffs, confidence levels, p-values. . . . . . . . . . . . . . . . . . . . . . . . . 337
5.25 68%, 95%, 99% confidence cutoffs for standard normal. . . . . . . . 337
5.26 p-values at 5% and at 1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
5.27 68%, 95%, 99% cutoffs for non-standard normal. . . . . . . . . . . . . 339
5.28 (X, Y ) inside the square and inside the disk. . . . . . . . . . . . . . . . . 344
5.29 Chi-squared distribution with different degrees. . . . . . . . . . . . . . 347
5.30 With degree d ≥ 2, the chi-squared density peaks at d − 2. . . . . 348
5.31 Normal probability density on R2 . . . . . . . . . . . . . . . . . . . . . . . . . . 351
5.32 The softmax function takes vectors to probability vectors. . . . 358
5.33 The third row is the sum of the first and second rows, and
the H column is the negative of the I column. . . . . . . . . . . . . . . 367

6.1 Statistics flowchart: p-value p and significance α. . . . . . . . . . . . . 370

6.2 Histogram of sampling n = 25 students, repeated N = 1000
times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
6.3 The error matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
6.4 Student distribution, against normal (dashed). . . . . . . . . . . . . . . 385
6.5 2 × 3 = d × N contingency table [30]. . . . . . . . . . . . . . . . . . . . . . . 394
6.6 Earthquake counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
6.7 Sunset and rain counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
6.8 Phone and accident counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400

7.1 A perceptron with activation function f without bias. . . . . . . . . 404

7.2 A perceptron with activation function f with bias. . . . . . . . . . . . 405
7.3 Perceptrons in parallel (R in the figure is the retina) [22]. . . . . 405
7.4 Network of neurons with weights. . . . . . . . . . . . . . . . . . . . . . . . . . . 407
7.5 Network of neurons with outgoing signals. . . . . . . . . . . . . . . . . . . 408
List of Figures xv

7.6 Network of neurons with incoming signals. . . . . . . . . . . . . . . . . . . 408

7.7 Incoming and Outgoing signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
7.8 Network of neurons with specific weights. . . . . . . . . . . . . . . . . . . . 410
7.9 Forward and back propagation between two neurons. . . . . . . . . . 411
7.10 Downstream, local, and upstream derivatives at node i. . . . . . . 413
7.11 Network of neurons with downstream derivatives. . . . . . . . . . . . . 414
7.12 A shallow dense layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
7.13 Layered neural network [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
7.14 Double well newton descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
7.15 Double well cost function and sublevel sets at w0 and at w1 . . . 423
7.16 Double well gradient descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
7.17 Network of neurons with weight gradients. . . . . . . . . . . . . . . . . . . 427
7.18 Cost trajectory and number of iterations as learning rate varies.429
7.19 Linear regression neural network with no bias inputs. . . . . . . . . 431
7.20 Logistic regression neural network without bias inputs. . . . . . . . 435
7.21 Longley Economic Data [19]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
7.22 Population versus employed: linear regression. . . . . . . . . . . . . . . . 446
7.23 Polynomial regression: Degrees 2, 4, 6, 8, 10, 12. . . . . . . . . . . . . . 449
7.24 Hours studied and outcomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
7.25 Exam dataset: x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
7.26 Exam dataset: (x, p) [35]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
7.27 Exam dataset: (x, x0 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
7.28 Hours studied and one-hot encoded outcomes. . . . . . . . . . . . . . . . 453
7.29 Neural network for student exam outcomes. . . . . . . . . . . . . . . . . . 453
7.30 Equivalent neural network for student exam outcomes. . . . . . . . 454
7.31 Exam dataset: (x, x0 , p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
7.32 Convex hulls of Iris classes in R2 . . . . . . . . . . . . . . . . . . . . . . . . . . 455
7.33 Convex hulls of MNIST classes in R2 . . . . . . . . . . . . . . . . . . . . . . . 456

A.1 6 = 3! permutations of 3 balls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468

A.2 Pascal’s triangle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
A.3 The exponential function exp x. . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
A.4 Convexity of the exponential function. . . . . . . . . . . . . . . . . . . . . . 488
A.5 Multiplying and dividing points on the unit circle. . . . . . . . . . . . 490
A.6 Complex numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
A.7 The second, third, and fourth roots of unity . . . . . . . . . . . . . . . . 495
A.8 The fifth, sixth, and fifteenth roots of unity . . . . . . . . . . . . . . . . . 496
A.9 Areas under the graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
A.10 Area under the parabola. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
A.11 The graph and area under sin x. . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
A.12 Integral of sin x/x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
A.13 Dataframe from list-of-dicts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
A.14 Menu dataframe and SQL table. . . . . . . . . . . . . . . . . . . . . . . . . . . 517
A.15 Rawa restaurant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
A.16 OrdersIn dataframe and SQL table. . . . . . . . . . . . . . . . . . . . . . . . . 520
xvi List of Figures

A.17 OrdersOut dataframe and SQL table. . . . . . . . . . . . . . . . . . . . . . . 521

Chapter 1
Datasets

In this chapter we explore examples of datasets and some simple Python

code. We also review the geometry of vectors in the plane and properties of
2 × 2 matrices, introduce the mean and variance of a dataset, then present a
first taste of what higher dimensions might look like.

1.1 Introduction

Geometrically, a dataset is a sample of N points x1 , x2 , . . . , xN in d-

dimensional space Rd . When manipulating datasets as vectors, they are usu-
ally arranged into d × N arrays. When displaying datasets, as in spreadsheets
or SQL tables, they are usually arranged into N × d arrays.
Practically speaking, as we shall see, the following are all representations
of datasets
matrix = CSV file = spreadsheet = SQL table = array = dataframe
Each point x = (t1 , t2 , . . . , td ) in the dataset is a sample or an example,
and the components t1 , t2 , . . . , td of a sample point x are its features or
attributes. As such, d-dimensional space Rd is feature space.
Often one of the features is separated out as the label or target. In this
case, the dataset is a labeled dataset.

The Iris dataset contains N = 150 examples of d = 4 features of Iris

flowers, and there are three classes of Irises, Setosa, Versicolor, and Virginica,
with 50 samples from each class. For each example, the class is the label
corresponding to that example, so the Iris dataset is labeled.
The four features are sepal length and width, and petal length and width.
In Figure 1.1, the dataset is displayed as an N × d array.

1
2 CHAPTER 1. DATASETS

Fig. 1.1 Iris dataset [28].

The Iris dataset is downloaded using the code

from sklearn import datasets

iris = datasets.load_iris()
iris["feature_names"]

This returns
['sepal length','sepal width','petal length','petal width'].
To return the data and the classes, the code is

dataset = iris["data"]
labels = iris["target"]

dataset, labels

The above code returns dataset as an N × d array. To return a d × N

array, take the transpose dataset = iris["data"].T.

The MNIST dataset consists of images of hand-written digits (Figure 1.2).

There are 10 classes of images, corresponding to each digit 0, 1, . . . , 9. We
1.1. INTRODUCTION 3

seek to compress the images while preserving as much as possible of the

images’ characteristics.
Each image is a grayscale 28x28 pixel image. Since 282 = 784, each image
is a point in d = 784 dimensions. Here there are N = 60000 samples and
d = 784 features.

Fig. 1.2 Images in the MNIST dataset.

This subsection is included just to give a flavor. All unfamiliar words are
explained in detail in Chapter 2. If preferred, just skip to the next subsection.
Suppose we have a dataset of N points

x1 , x2 , . . . , xN

in d-dimensional feature space. We seek to find a lower-dimensional feature

space U ⊂ Rd so that the projections of these points onto U retain as much
information as possible about the data.
In other words, we are looking for an n-dimensional subspace U for some
n < d. Among all n-dimensional subspaces, which one should we pick? The
answer is to select U among all n-dimensional subspaces to maximize vari-
ability in the data.
Another issue is the choice of n, which is an integer satisfying 0 ≤ n ≤ d.
On the one hand, we want n to be as small as possible, to maximize data
compression. On the other hand, we want n to be big enough to capture most
of the features of the data. At one extreme, if we pick n = d, then we have
no compression and complete information. At the other extreme, if we pick
n = 0, then we have full compression and no information.
Projecting the data from Rd to a lower-dimensional space U is dimensional
reduction. The best alignment, the best-fit, or the best choice of U is principal
component analysis. These issues will be taken up in §3.5.

If this is your first exposure to data science, there will be a learning curve,
because here there are three kinds of thinking: Data science (datasets, PCA,
descent, networks), math (linear algebra, probability, statistics, calculus), and
Python (numpy, pandas, scipy, sympy, matplotlib). It may help to read the
4 CHAPTER 1. DATASETS

code examples , and the important math principles first, then dive
into details as needed.
To illustrate and make concrete concepts as they are introduced, we use
Python code throughout. We run Python code in a jupyter notebook.
jupyter is an IDE, an integrated development environment. jupyter
supports many languages, including Python, Sage, Julia, and R. A useful
jupyter feature is the ability to measure the amount of execution time of a
jupyter cell by including at the start of the cell

%%time

It’s simplest to first install Python, then jupyter. To minimize overhead,

it’s best to install Python using a package built for your laptop’s OS, and
avoid extra packages or frameworks. If Python is installed from
https://www.python.org/downloads/,
then the Python package installer pip is also installed.
From within a shell,1 check the latest version of pip is installed using the
command
pip --version,
The versions of Python and pip used in this edition of the text are 3.12.*
and 24.*. The first step is to ensure updated versions of Python and pip are
installed on your laptop.
After this, from within a shell, use pip to install your first package:
pip install jupyter
After installing jupyter, all other packages are installed from within
jupyter. For this text, from within a jupyter cell, we ran

pip install numpy

pip install sympy
pip install scipy
pip install scikit-learn
pip install pandas
pip install matplotlib
pip install ipympl
pip install sqlalchemy
pip install pymysql

After installing these packages, restart jupyter to activate the packages.

The above is a complete listing of the packages used in this text.
Because one often has to repeatedly install different versions of Python,
it’s best to isolate your installations from whatever Python your laptop’s
1 Powershell in Windows or Terminal in macOS.
1.2. THE MNIST DATASET 5

OS uses. This is achieved by carrying out the above steps within a venv, a
virtual environment. Then several venvs may be set up side-by-side, and, at
any time, any venv may be deleted without impacting any others, or the OS.

Exercises

Exercise 1.1.1 What is dataset.shape and labels.shape?

Exercise 1.1.2 What does sum(dataset[0]) return and why?

Exercise 1.1.3 What does sum(dataset) return and why?

Exercise 1.1.4 Let a be a list. What does list(enumerate(a)) return?

What does the code below return?

def uniq(a):
return [x for i, x in enumerate(a) if x not in a[:i] ]

1.2 The MNIST Dataset

Fig. 1.3 A portion of the MNIST dataset.

6 CHAPTER 1. DATASETS

The MNIST2 dataset consists of 60,000 training images. Since this dataset is
for demonstration purposes, these images are coarse.
Each image consists of 28 × 28 = 784 pixels, and each pixel shading is a
byte, an integer between 0 and 255 inclusive. Therefore each image is a point
x in Rd = R784 . Attached to each image is its label, a digit 0, 1, . . . , 9.
We assume the dataset has been downloaded to your laptop as a CSV file
mnist.csv. Then each row in the file consists of the pixels for a single image.
Since the image’s label is also included in the row, each row consists of 785
integers. There are many sources and formats online for this dataset.
The code

from pandas import *

from numpy import *

mnist = read_csv("mnist.csv").to_numpy()

# separate rows into data and labels

# first column is the labels
labels = mnist[:,0]
# all other columns are the pixels
dataset = mnist[:,1:]

mnist.shape,dataset.shape,labels.shape

returns

(60000, 785), (60000, 784), (60000,)

Here the dataset is arranged into an N × d array.

Fig. 1.4 Original and projections: n = 784, 600, 350, 150, 50, 10, 1.

2 The National Institute of Standards and Technology (NIST) is a physical sciences labo-
ratory and non-regulatory agency of the United States Department of Commerce.
1.2. THE MNIST DATASET 7

To compress the image means to reduce the number of dimensions in the

point x while keeping maximum information. We can think of a single image
as a dataset itself, and compress the image, or we can design a compression
algorithm based on a collection of images. It is then reasonable to expect that
the procedure applies well to any image that is similar to the images in the
collection.
For the second image in Figure 1.2, reducing dimension from d = 784 to
n equal 600, 350, 150, 50, 10, and 1, we have the images in Figure 1.4.
Compressing each image to a point in n = 3 dimensions and plotting all
N = 60000 points yields Figure 1.5. All this is discussed in §3.5.

Fig. 1.5 The MNIST dataset (3d projection).

Here is an exercise. The top left image in Figure 1.4 is given by a 784-
dimensional point which is imported as an array pixels.

pixels = dataset[1].reshape((28,28))

Then pixels is an array of shape (28,28).

1. In Jupyter, return a two-dimensional plot of the point (2, 3) at size 50

using the code
8 CHAPTER 1. DATASETS

from matplotlib.pyplot import *

grid()
scatter(2,3,s = 50)
show()

2. Do for loops over i and j in range(28) and use scatter to plot points
at location (i,j) with size given by pixels[i,j], then show.

Fig. 1.6 A crude copy of the image.

Here is one possible code, returning Figure 1.6.

from matplotlib.pyplot import *

from numpy import *

pixels = dataset[1]

grid()
for i in range(28):
for j in range(28):
scatter(i,j, s = pixels[i,j])

show()

The top left image in Figure 1.4 is returned by the code

1.2. THE MNIST DATASET 9

from matplotlib.pyplot import *

imshow(pixels, cmap="gray_r")

In recent versions of numpy, floats are displayed as follows

np.float64(5.843333333333335)

To display floats without their type, as follows,

5.843333333333335

insert this code

from numpy import *

set_printoptions(legacy="1.25")

at the top of your jupyter notebook or in your jupyter configuration file

We end the section by discussing the Python import command. The last
code snippet can be rewritten

import matplotlib.pyplot as plt

plt.imshow(pixels, cmap="gray_r")

or as

from matplotlib.pyplot import imshow

imshow(pixels, cmap="gray_r")

So we have three versions of this code snippet.

In the second version, it is explicit that imshow is imported from the mod-
ule pyplot of the package matplotlib. Moreoever, the module matplotlib.pyplot
is referenced by a short nickname plt.
In the first version import from *, many commands, maybe not all, are
imported from the module matplotlib.pyplot.
10 CHAPTER 1. DATASETS

In the third version, only the command imshow is imported. Which import
style is used depends on the situation.
In this text, we usually use the first style, as it is visually lightest. To help
with online searches, in the Python index, Python commands are listed under
their full package path.

Exercises

Exercise 1.2.1 Run the code in this section on your laptop (all code is run
within jupyter).
Exercise 1.2.2 The first image in the MNIST dataset is an image of the
digit 5. What is the 43,120th image?
Exercise 1.2.3 Figure 1.6 is not oriented the same way as the top-left image
in Figure 1.4. Modify the code returning Figure 1.6 to match the top-left
image in Figure 1.4.

1.3 Averages and Vector Spaces

Suppose we have a population of things (people, tables, numbers, vectors,

images, etc.) and we have a sample of size N from this population:

L = [x_1,x_2,...,x_N].

The total population is the population or the sample space. For example, the
sample space consists of all real numbers and we take N = 5 samples from
this population

L_1 = [3.95, 3.20, 3.10, 5.55, 6.93].

Or, the sample space consists of all integers and we take N = 5 samples from
this population

L_2 = [35, -32, -8, 45, -8].

Or, the sample space consists of all rational numbers and we take N = 5
samples from this population

L_3 = [13/31, 8/9, 7/8, 41/22, 32/27].

1.3. AVERAGES AND VECTOR SPACES 11

Or, the sample space consists of all Python strings and we take N = 5 samples
from this population

L_4 = ['a2e?','#%T','7y5,','kkk>><</','[[)*+']

Or, the sample space consists of all HTML colors and we take N = 5 samples
from this population

Fig. 1.7 HTML colors.

Here’s the code generating the colors

# HTML color codes are #rrggbb (6 hexes)

from matplotlib.pyplot import *
from random import choice

def hexcolor():
chars = '0123456789abcdef'
return "#" + ''.join([choice(chars) for _ in range(6)])

for i in range(5): scatter(i,0, c=hexcolor())

show()

Let L be a list as above. The goal is to compute the sample average or

mean of the list, which is
x1 + x2 + · · · + xN
µ= . (1.3.1)
N
In the first example, for real numbers, the average is
3.95 + 3.20 + 3.10 + 5.55 + 6.93
= 4.546.
5
In the second case, for integers, the average is 32/5. In the third case, the
average is 385373/73656. In the fourth case, while we can add strings, we
can’t divide them by 5, so the average is undefined. Similarly for colors: the
average is undefined.
This leads to an important definition. A sample space or population V is
called a vector space if, roughly speaking, one can compute means or averages
12 CHAPTER 1. DATASETS

in V . In this case, we call the members of the population “vectors”, even

though the members may be anything, as long as they satisfy the basic rules
of a vector space.
In a vector space V , the rules are:
1. vectors v, w can be added, yielding sum v + w,
2. vector addition is commutative, v + w = w + v,
3. vector addition is associative, u + (v + w) = (u + v) + w,
4. there is a zero vector 0,
5. vectors v have negatives −v, v + (−v) = 0,
6. vectors v can be scaled to rv by real numbers r,
7. double-scaling is the same as multiplying scalars, r(sv) = (rs)v,
8. scaling is distributive over addition both ways,

(r + s)v = rv + sv, r(u + v) = ru + rv,

9. scaling v by 1 and 0 yields 1 and 0, 1v = v and 0v = 0.

Let x1 , x2 , . . . , xN be a dataset. Is the dataset a collection of points, or

is the dataset a collection of vectors? In other words, what geometric picture
of datasets should we have in our heads? Here’s how it works.
A vector is an arrow joining two points (Figure 1.8). Given two points
µ = (a, b) and x = (c, d), the vector joining them is

v = x − µ = (c − a, d − b).

Then µ is the tail of v, and x is the head of v. For example, the vector joining
µ = (1, 2) to x = (3, 4) is v = (2, 2).
Given a point x, we would like to associate to it a vector v in a uniform
manner. However, this cannot be done without a second point, a reference
point. Given a dataset of points x1 , x2 , . . . , xN , the most convenient choice
for the reference point is the mean µ of the dataset. This results in a dataset
of vectors v1 , v2 , . . . , vN , where vk = xk − µ, k = 1, 2, . . . , N .

Fig. 1.8 The vector v joining the points m and x.

1.3. AVERAGES AND VECTOR SPACES 13

The dataset v1 , v2 , . . . , vN is centered, its mean is zero,

v1 + v2 + · · · + vN
= 0.
N
So datasets can be points x1 , x2 , . . . , xN with mean µ, or vectors v1 , v2 , . . . ,
vN with mean zero (Figure 1.9).

Centered Versus Non-Centered

If x1 , x2 , . . . , xN is a dataset of points with mean µ and

v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,

then v1 , v2 , . . . , vN is a centered dataset of vectors.

x5 x2
v5 v2

µ v4 v1
0
x4 x1
v3

Fig. 1.9 Datasets of points versus datasets of vectors.

Let us go back to vector spaces. When we work with vector spaces, numbers
are referred to as scalars, because 2v, 3v, −v, . . . are scaled versions of v.
When we multiply a vector v by a scalar r to get the scaled vector rv, we call
this vector scaling. This is to distinguish this multiplication from the inner
and outer products we see below.
For example, the samples in the list L1 form a vector space, the set of all
real numbers R. Even though one can add integers, the set Z of all integers
does not form a vector space because multiplying an integer by 1/2 does
not result in an integer. The set Q of all rational numbers (fractions) is a
vector space, so L3 is a sampling from a vector space. The set of strings is
not a vector space because even though one can add strings, addition is not
commutative:

'alpha' + 'romeo' == 'romeo' + 'alpha'

14 CHAPTER 1. DATASETS

returns False.

For the scalar dataset

x1 = 1.23, x2 = 4.29, x3 = −3.3, x4 = 555,

the average is
1.23 + 4.29 − 3.3 + 555
µ= = 139.305.
4
In Python, averages are computed using numpy.mean. For a scalar dataset,
the code

from numpy import *

dataset = array([1.23,4.29,-3.3,555])
mu = mean(dataset)
mu

returns the average.

For the two-dimensional dataset

x1 = (1, 2), x2 = (3, 4), x3 = (−2, 11), x4 = (0, 66),

the average is

(1, 2) + (3, 4) + (−2, 11) + (0, 66)

µ= = (0.5, 20.75).
4
Note the x-components are summed, and the y-components are summed,
leading to a two-dimensional mean. (This is vector addition, taken up in
§1.4.)
In Python, a dataset of four points in R2 is assembled as 2 × 4 array

from numpy import *

dataset = array([[1,3,-2,0],[2,4,11,66]])

Here the x-components of the four points are the first row, and the y-
components are the second row. With this, the code

mu = mean(dataset, axis=1)
mu

returns the mean (0.5, 20.75).

1.3. AVERAGES AND VECTOR SPACES 15

To explain what axis=1 does, we use matrix terminology. After arranging

dataset into an array of two rows and four columns, to compute the mean,
we sum over the column index.
This means summing the entries of the first row, then summing the entries
of the second row, resulting in a mean with two components.
In Python, the default is to consider the row index i as index zero, and to
consider the column index j as index one.
Summing over index=0 is equivalent to thinking of the dataset as two
points in R4 , so

mean(dataset, axis=0)

returns (1.5, 3.5, 4.5, 33).

Fig. 1.10 A dataset with its mean.

Here is a more involved example of a dataset of random points and their

mean:

from numpy import *

from numpy.random import random
from matplotlib.pyplot import scatter, grid, show

N = 20
def row(N): return array([random() for _ in range(N) ])

# 2xN array
dataset = array([ row(N), row(N) ])
mu = mean(dataset,axis=1)

grid()
scatter(*mu)
16 CHAPTER 1. DATASETS

scatter(*dataset)
show()

This returns Figure 1.10.

In this code, scatter expects two positional arguments, the x and the y
components, arranged as two scalars (for a single point), or two arrays of x
and y components separately (for several points). The unpacking operator *
unpacks mu from one pair into its separate x and y components *mu. So mu
is one Python object, and *mu is two Python objects. Similarly, plot(x,y)
expects the x and y components as two separate arrays.

Sometimes, a population is not a vector space, so we can’t take sample

means from it. Instead, we take the sample mean of a scalar or vector com-
puted from the samples. This computed quantity is a statistic associated to
the population.
A statistic is an assignment of a scalar or vector f (x) to each sample x
from the population, and the sample mean is then

f (x1 ) + f (x2 ) + · · · + f (xN )

. (1.3.2)
N
Since scalars and vectors do form vector spaces, this mean is well-defined. For
example, a population of cats is not a vector space (they can’t be added),
but their heights is a vector space (heights can be added). This process is
vectorization of the samples.
Vectorization is frequently used to count proportions: Samples are drawn
from finitely many categories, and we wish to count the proportion of samples
belonging to a particular category.
If we toss a coin N times, we obtain a list of heads and tails,

H, H, T, T, T, H, T, . . .

To count the proportion of heads, we define

(
(1, 0), if x is heads,
f (x) =
(0, 1), if x is tails.

If we add the vectorized samples f (x) using vector addition in the plane
(§1.4), the first component of the mean (1.3.2) is an average of ones and
zeroes, with ones matching heads, resulting in the proportion p̂ of heads.
Similarly, the second component is the proportion of tails. Hence (1.3.2) is
the pair (p̂, 1 − p̂), where p̂ is the proportion of heads in N tosses.
1.3. AVERAGES AND VECTOR SPACES 17

More generally, if the label of a sample falls into d categories, we may let
f (x) be a vector with d components consisting of zeros and ones, according
to the category of the sample. This is one-hot encoding (see §2.4 and §7.6).
For example, suppose we take a sampling of size N from the Iris dataset,
and we look at the classes of the resulting samples. Since there are three
classes, in this case, we can define f (x) to equal

(1, 0, 0), (0, 1, 0), (0, 0, 1),

according to which class x belongs to. Then the mean (1.3.2) is a triple
p̂ = (p̂1 , p̂2 , p̂3 ) of proportions of each class in the sampling. Of course, p̂1 +
p̂2 + p̂3 = 1, so p̂ is a probability vector (§5.6).

f
sample space

vector space

Fig. 1.11 Vectorization of samples.

When there are only two possibilities, two classes, it’s simpler to encode
the classes as follows,
(
1, if x is heads,
f (x) =
0, if x is tails.

Then the mean (1.3.2) is the proportion p̂ of heads.

Even when the samples are already scalars or vectors, we may still want
to vectorize them. For example, suppose x1 , x2 , . . . , xN are the prices of a
sample of printers from across the country. Then the average price (1.3.1) is
well-defined. Nevertheless, we may set
(
1, if x is greater than $100,
f (x) =
0, if x is ≤ $100.
18 CHAPTER 1. DATASETS

Then the mean (1.3.2) is the sample proportion p̂ of printers that cost more
then $100.
In §6.4, we use vectorization to derive the chi-squared tests.

Exercises

Exercise 1.3.1 For the dataset = array([[1,3,-2,0],[2,4,11,66]]), the

commands mean(dataset,axis=1) and mean(dataset,axis=0) return means
in R2 and in R4 . What does mean(dataset) return and why?

Exercise 1.3.2 What is the average petal length in the Iris dataset?

Exercise 1.3.3 What is the average shading of the pixels in the first image
in the MNIST dataset?

Exercise 1.3.4 What’s the difference between plot and scatter in

from numpy import *

from matplotlib.pyplot import scatter, plot

def f(x): return x**2

x = arange(0,1,.2)

plot(x,f(x))
scatter(x,f(x))

1.4 Two Dimensions

We start with the geometry of vectors in two dimensions. This is the cartesian
plane R2 , also called 2-dimensional real space. The plane R2 is a vector space,
in the sense described in the previous section.

(0, 2) (3, 2)

v
(0, 1)

Fig. 1.12 A vector v.

1.4. TWO DIMENSIONS 19

In the cartesian plane, a vector is an arrow v joining the origin to a point

(Figure 1.12). In this way, points and vectors are almost interchangeable, as a
point x in Rd corresponds to the vector v starting at the origin 0 and ending
at x.
In the cartesian plane, each vector v has a shadow. This is the triangle
constructed by dropping the perpendicular from the tip of v to the x-axis, as
in Figure 1.13.
This cannot be done unless one first draws a horizontal line (the x-axis),
then a vertical line (the y-axis). In this manner, each vector v has cartesian
coordinates v = (x, y). In Figure 1.12, the coordinates of v are (3, 2). In
particular, the vector 0 = (0, 0), the zero vector, corresponds to the origin.

0 0

Fig. 1.13 Vectors v1 and v2 and their shadows in the plane.

In the cartesian plane, vectors v1 = (x1 , y1 ) and v2 = (x2 , y2 ) are added

by adding their coordinates,

Addition of vectors

If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then

v1 + v2 = (x1 + x2 , y1 + y2 ). (1.4.1)

Because points and vectors are interchangeable, the same formula is used
for addition P + P ′ of points P and P ′ .
This addition is the same as combining their shadows as in Figure 1.14.
In Python, lists and tuples do not add this way. Lists and tuples have to first
be converted into numpy arrays.

v1 = (1,2)
v2 = (3,4)
v1 + v2 == (1+3,2+4) # returns False

v1 = [1,2]
v2 = [3,4]
v1 + v2 == [1+3,2+4] # returns False
20 CHAPTER 1. DATASETS

from numpy import *

v1 = array([1,2])
v2 = array([3,4])
v1 + v2 == array([1+3,2+4]) # returns True

For example, v1 = (−3, 1) and v2 = (2, −2) returns

v1 + v2 = (−3, 1) + (2, −2) = (−3 + 2, 1 − 2) = (−1, −1).

Fig. 1.14 Adding v1 and v2

In Python, == is exact equality of values. When the entries are integers,

this is not a problem. However, when the entries a and b are floats, a == b
may return False even though the two floats agree to within the underlying
precision of the Python code.
To remedy this, it’s best to use isclose(a,b) or allclose(a,b). This
returns True when a and b agree to within the underlying precision. For
isclose, a and b are floats, for allcose, a and b are arrays of floats. In this
chapter, we ignore this point, but we are more careful starting in Chapter 2.

Scaling of vectors

If v = (x, y), then

tv = (tx, ty).

A vector v = (x, y) in the plane may be scaled by scaling the shadow as in

Figure 1.15. This is vector scaling by t. Note when t is negative, the shadow
is also flipped. Because of this frequent use, numbers t are also called scalars.
In Python, we write
1.4. TWO DIMENSIONS 21

from numpy import *

v = array([1,2])
3*v == array([3,6]) # returns True

0 tv

Fig. 1.15 Scaling with t = 2 and t = −2/3

Given a vector v, the scalings tv of v form a line passing through the origin
0 (Figure 1.17). This line is the span of v (more on this in §2.4). Scalings tv
of v are also called multiples of v.
If t and s are real numbers, it is easy to check

t(v1 + v2 ) = tv1 + tv2 and t(sv) = (ts)v.

Thus scaling v by s, and then scaling the result by t, has the same effect as
scaling v by ts, in a single step. Because points and vectors are interchange-
able, the same formula tP is used for scaling points P by t.
We set −v = (−1)v, and define subtraction of vectors by

v1 − v2 = v1 + (−v2 ).

from numpy import *

v1 = array([1,2])
v2 = array([3,4])
v1 - v2 == array([1-3,2-4]) # returns True

Subtraction of vectors

If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then

v1 − v2 = (x1 − x2 , y1 − y2 ) (1.4.2)
22 CHAPTER 1. DATASETS

Distance Formula

If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then the distance between v1 and

v2 is p
|v1 − v2 | = (x1 − x2 )2 + (y1 − y2 )2 .

The distance of v = (x, y) to the origin 0 = (0, 0) is its magnitude or norm

or length p
r = |v| = |v − 0| = x2 + y 2 .
In Python,

from numpy import *

from scipy.linalg import norm

v = array([1,2])
norm(v) == sqrt(5)# returns True

(x, y)

r y

θ
0 x

Fig. 1.16 The polar representation of v = (x, y).

In terms of r and θ (Figure 1.16), the polar representation of (x, y) is

x = r cos θ, y = r sin θ. (1.4.3)

The unit circle consists of the vectors which are distance 1 from the origin
0. When v is on the unit circle, the magnitude of v is 1, and we say v is a
1.4. TWO DIMENSIONS 23

unit vector. In this case, the line formed by the scalings of v intersects the
unit circle at ±v (Figure 1.17).
When v is a unit vector, r = 1, and (Figure 1.16),

v = (x, y) = (cos θ, sin θ). (1.4.4)

−v

Fig. 1.17 v and its antipode −v.

The unit circle intersects the horizontal axis at (1, 0), and (−1, 0), and
intersects the vertical axis at (0, 1), and (0, −1). These four points are equally
spaced on the unit circle (Figure 1.17).
By the distance formula, a vector v = (x, y) is a unit vector when

x2 + y 2 = 1.

More generally, any circle with center Q = (a, b) and radius r consists of
points (x, y) satisfying

(x − a)2 + (y − b)2 = r2 .

Let R be a point on the unit circle, and let t > 0. From this, we see the scaled
point tR is on the circle with center (0, 0) and radius t. Moreover, it follows
a point P is on the circle of center Q and radius r iff P = Q + rR for some
R on the unit circle.
Given this, it is easy to check

|tv| = |t| |v|

for any real number t and vector v.

From this, if a vector v is unit and r > 0, then rv has magnitude r. If v is
any vector not equal to the zero vector, then r = |v| is positive, and

1 1 1
v = |v| = r = 1,
r r r
24 CHAPTER 1. DATASETS

so v/r is a unit vector.

Now we discuss the dot product in two dimensions. We have two vectors
v1 and v2 in the plane R2 , with v1 = (x1 , y1 ) and v2 = (x2 , y2 ). The dot
product of v1 and v2 is given algebraically as

v1 · v2 = x1 x2 + y1 y2 ,

or geometrically as
v1 · v2 = |v1 | |v2 | cos θ,
where θ is the angle between v1 and v2 . To show that these are the same,
below we derive the

Dot Product Identity

x1 x2 + y1 y2 = v1 · v2 = |v1 | |v2 | cos θ. (1.4.5)

v2 − v1

Fig. 1.18 Two vectors v1 and v2 .

From the algebraic definition of dot product, we have v ·v = x2 +y 2 = |v|2 .

If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then v1 + v2 = (x1 + x2 , y1 + y2 ). From this
we have

|v1 + v2 |2 = (v1 + v2 ) · (v1 + v2 ) = (x1 + x2 )2 + (y1 + y2 )2 .

Expanding the squares, we obtain

1.4. TWO DIMENSIONS 25

|v1 + v2 |2 = |v1 |2 + 2v1 · v2 + |v2 |2 . (1.4.6)

In Python, the dot product is given by numpy.dot,

from numpy import *

v1 = array([1,2])
v2 = array([3,4])
dot(v1,v2) == 1*3 + 2*4 # returns True

As a consequence of the dot product identity, we have code for the angle
between two vectors (there is also a built-in numpy.angle).

from numpy import *

def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)

Recall that −1 ≤ cos θ ≤ 1. Using the dot product identity (1.4.5), we

obtain the important

Cauchy-Schwarz Inequality

If u and v are any two vectors, then

−|u| |v| ≤ u · v ≤ |u| |v|. (1.4.7)

Using the geometric definition of the dot product,

u·v
cos θ = .
|u| |v|

Vectors u and v are orthogonal or perpendicular if the angle between them is

a right angle (90 degrees or π/2 radians). From this formula, we see vectors
are orthogonal when their dot product is zero.
When u and v are orthogonal and also unit vectors, we say u and v are
orthonormal. Python code for the angle is as above.
26 CHAPTER 1. DATASETS

To derive the dot product identity, we first derive Pythagoras’ theorem for
general triangles (Figure 1.19)

c2 = a2 + b2 − 2ab cos θ. (1.4.8)

Fig. 1.19 Pythagoras for general triangles.

To derive (1.4.8), we drop a perpendicular to the base b, obtaining two

right triangles, as in Figure 1.20.

f
e
a

b
d

Fig. 1.20 Proof of Pythagoras for general triangles.

By Pythagoras applied to each triangle,

a2 = d2 + f 2 and c2 = e2 + f 2 .

Also b = e + d, so e = b − d, so
1.4. TWO DIMENSIONS 27

e2 = (b − d)2 = b2 − 2bd + d2 .

By the definition of cos θ, d = a cos θ. Putting this all together,

c2 = e2 + f 2 = (b − d)2 + f 2
= f 2 + d2 + b2 − 2db
= a2 + b2 − 2ab cos θ,

so we get (1.4.8).
Next, connect Figures 1.18 and 1.19 by noting a = |v2 | and b = |v1 | and
c = |v2 − v1 |. By (1.4.6),

c2 = |v2 − v1 |2 = |v2 |2 − 2v1 · v2 + |v1 |2 = a2 + b2 − 2(x1 x2 + y1 y2 ),

thus
c2 = a2 + b2 − 2(x1 x2 + y1 y2 ). (1.4.9)
Comparing the terms in (1.4.8) and (1.4.9), we arrive at (1.4.5). This com-
pletes the proof of the dot product identity (1.4.5).

P + P⊥

P⊥

v⊥ P⊥
v
0 P
−v ⊥ c
b

−P ⊥ O a

Fig. 1.21 P and P ⊥ and v and v ⊥ .

If P = (x, y), let P ⊥ = (−y, x), and let v = OP and v ⊥ = OP ′ be the

vectors emanating from the origin, and ending at P and P ⊥ . Then

v · v ⊥ = (x, y) · (−y, x) = 0.

This shows v and v ⊥ are perpendicular (Figure 1.21).

28 CHAPTER 1. DATASETS

From Figure 1.21, we see points P and P ′ on the unit circle satisfy P ·P ′ = 0
iff P ′ = ±P ⊥ .

We now solve two linear equations in two unknowns x, y. We start with

the homogeneous case

ax + by = 0, cx + dy = 0. (1.4.10)

Let A be the 2 × 2 matrix

ab
A= (1.4.11)
cd
Assume (a, b) ̸= (0, 0). Then it is easy to exhibit a nonzero solution of
the first equation in (1.4.10): choose (x, y) = (−b, a). If we want this to be a
solution of the second equation as well, we must have cx + dy = ad − bc = 0.
On the other hand, if (c, d) ̸= (0, 0), (x, y) = (−d, c) is a nonzero solution
of the first equation in (1.4.10). If we want this to be a solution of the second
equation as well, we must have ax + by = bc − ad = 0.
Based on this, we make the following definition. The determinant of A is

ab
det(A) = det = ad − bc. (1.4.12)
cd

Above we found solutions of (1.4.10) when det(A) = 0. Now we show when

det(A) ̸= 0, the only solution is (x, y) = (0, 0).
Multiply the first equation in (1.4.10) by d and the second by b and sub-
tract, obtaining

(ad − bc)x = d(ax + by) − b(cx + dy) = 0.

Since ad − bc ̸= 0, this leads to x = 0. Similarly, in (1.4.10), multiply the first

equation by c and the second by a and subtract, obtaining

(bc − ad)y = c(ax + by) − a(cx + dy) = 0.

Since ad − bc ̸= 0, this leads to y = 0.

Summarizing, we conclude

Homogeneous System

Let A be the matrix (1.4.11). When det(A) ̸= 0, the only solution of

(1.4.10) is (x, y) = (0, 0). When A ̸= 0 and det(A) = 0, every solution
of (1.4.10) is a scalar multiple of (x, y) = (−b, a), or of (x, y) = (−d, c),
depending on whether (a, b) ̸= (0, 0) or (c, d) ̸= (0, 0).
1.4. TWO DIMENSIONS 29

This covers the homogeneous case. For the inhomogeneous case

ax + by = e, cx + dy = f, (1.4.13)

there are three mutually exclusive possibilities: A = 0, A ̸= 0 and det(A) = 0,

and det(A) ̸= 0.
• When A = 0, there is nothing to say. The system (1.4.13) has a solution
only if (e, f ) = (0, 0), in which case, any (x, y) is a solution.
• When A ̸= 0 and det(A) = 0, multiplying and subtracting as above, we
obtain
(ad − bc)x = d(ax + by) − b(cx + dy) = de − bf,
(1.4.14)
(ad − bc)y = a(cx + dy) − c(ax + by) = af − ce.

This implies ce = af and de = bf . Conversely, when ce = af and de = bf ,

(x, y) = (e/a, 0), (x, y) = (0, e/b), (x, y) = (f /c, 0), (x, y) = (0, f /d)

are solutions, when a ̸= 0, b ̸= 0, c ̸= 0, or d ̸= 0 respectively.

• When det(A) ̸= 0, dividing (1.4.14) by ad − bc leads to

de − bf af − ce
x= , y= . (1.4.15)
ad − bc ad − bc
Putting all this together, we conclude

Inhomogeneous System

Let A be the matrix (1.4.11). When det(A) ̸= 0, (1.4.13) has the

unique solution (1.4.15). When A ̸= 0 and det(A) = 0, (1.4.13) has a
solution iff ce = af and de = bf . In this case, there are four possible
solutions, listed above, depending on which of a, b, c, d is nonzero. All
other solutions differ from these solutions by a solution of (1.4.10).

In §2.9, we will understand the three cases in terms of the rank of A equal
to 2, 1, or 0.

We now go over the basic properties of 2 × 2 matrices. These we use in the

next section. The matrix (1.4.11) can be written in terms of the two vectors
u = (a, b) and v = (c, d), as follows

ab u
A= = , u = (a, b), v = (c, d).
cd v
30 CHAPTER 1. DATASETS

In this case, we call u and v the rows of A. On the other hand, A may be
written as

ac
A= = uv , u = (a, b), v = (c, d).
bd

In this case, we call u and v the columns of A. Many texts then write u and
v as
a c
u= , v= . (1.4.16)
b d
We do not do this when u and v are on their own, because then they are just
vectors. We only do this when u and v being multiplied with matrices or are
rows or columns of matrices.
In fact, when we do write (1.4.16), we are thinking of u and v as 2 × 1
matrices, not as vectors. This shows there are at least three ways to think
about a matrix: as rows, or as columns, or as a single block.
The simplest operations on matrices are addition and scaling. Addition is
as follows,
′ ′
a + a′ b + b′

ab ′ a b ′
A= , A = ′ ′ =⇒ A+A = ,
cd c d c + c′ d + d′

and scaling is as follows,

ta tb
tA = .
tc td
The transpose At of the matrix A is

ab t ac
A= =⇒ A = .
cd bd

Then the rows of At are the columns of A.

If Q satisfies Qt = Q, then Q is a symmetric matrix. In this case, the rows
of Q equal the columns of Q, and Q looks like

ab
Q= .
bc

A general matrix A consists of four numbers a, b, c, d, and a symmetric

matrix Q consists of three numbers a, b, c.
Let w = (x, y) be a vector. We now explain how to multiply the matrix
A by the vector w. The result is then another vector Aw. This is called
matrix-vector multiplication.
u
To do this, we write A as rows A = , then use the dot product,
v

Aw = (u · w, v · w) = (ax + by, cx + dy).

1.4. TWO DIMENSIONS 31

Notice Aw is a vector. When multiplying this way, one often writes

ab x ax + by
Aw = = ,
cd y cx + dy

and we call w and Aw column vectors.

This terminology is introduced to keep things consistent: It’s always row-
times-column with row on the left and column on the right. Nevertheless, a
vector, a row vector, and a column vector are all the same thing, just a vector.
Just like we can multiply matrices and vectors, we can also multiply two
matrices A and A′ and obtain a product AA′ . This is matrix-matrix multi-
u
plication. Following the row-column rule above, we write A = as rows
v
and A′ = (u′ , v ′ ) as columns to obtain

u · u′ u · v ′

AA′ = .
u′ · v u′ · v ′

If we do this the other way, we obtain

′
u · u u′ · v

A′ A = ,
u · v′ u · v
so
AA′ ̸= A′ A,
is generally true.

A rotation in the plane is the matrix

cos θ − sin θ
U = U (θ) = .
sin θ cos θ

Here θ is the angle of rotation. By the trigonometric addition formulas

(A.4.6),

cos θ′ − sin θ′

cos θ − sin θ
U (θ)U (θ′ ) =
sin θ cos θ sin θ′ cos θ′
cos(θ + θ′ ) − sin(θ + θ′ )

= = U (θ + θ′ ).
sin(θ + θ′ ) cos(θ + θ′ )

This says rotating by θ′ followed by rotating by θ is the same as rotating by

θ + θ′ .
32 CHAPTER 1. DATASETS

There is a special matrix I, the identity matrix,

10
I= .
01

The matrix I satisfies

AI = IA = A
for any matrix A.
Also, for each matrix A with det(A) ̸= 0, the matrix

1 d −b 1 d −b
A−1 = =
det(A) −c a ad − bc −c a

is the inverse of A. The inverse matrix satisfies

AA−1 = I = A−1 A.

The inverse reverses the order of the product,

(AB)−1 = B −1 A−1 .

The transpose also reverses the order of a product,

(AB)t = B t At .

Using the matrix inverse property, the solution of

Ax = b

is
x = A−1 b,
since
Ax = AA−1 b = Ib = b.
With this, we can rewrite (1.4.13) as

ab x e
= .
cd y f

Then the solution (1.4.15) can be rewritten

−1
x ab e 1 d −b e 1 de − bf
= = = .
y cd f ad − bc −c a f ad − bc af − ce
1.4. TWO DIMENSIONS 33

We study inverse matrices in depth in §2.3.

Let A = (u, be a 2 × 2 matrix with columns u, v. Then u, v are the

v)
u
rows of At = . Since matrix multiplication is row × column,
v

u u·u u·v
At A =

uv = .
v v·u v·v

Now suppose At A = I. Then u · u = 1 = v · v and u · v = 0, so u and v are

orthogonal unit vectors. Such vectors are called orthonormal. We have shown

Orthogonal Matrices

Let A be a matrix. Then At A = I iff the columns of A are orthonor-

mal, and AAt = I iff the rows of A are orthonormal.

The second statement follows by applying the first to At instead of A. A

matrix U is orthogonal if
U tU = I = U U t.
Thus a matrix is orthogonal iff its rows are orthonormal, and its columns are
orthonormal.

Now we introduce the tensor product. If u = (a, b) and v = (c, d) are

vectors, their tensor product is the matrix

ac ad av
u⊗v = = cu du = .
bc bd bv

Here we wrote u ⊗ v as a single block, and also in terms of rows and columns.
If we do this the other way, we get

ca cb
v⊗u= ,
da db
so
(u ⊗ v)t = v ⊗ u.
When u = v, u ⊗ v = v ⊗ v is a symmetric matrix.
Here is code for tensor.
34 CHAPTER 1. DATASETS

from numpy import *

def tensor(u,v): return array([ [ a*b for b in v] for a in u ])

There is no need to use this, since the numpy built-in outer does the same
job,

from numpy import *

A = outer(u,v)

The trace of a matrix A is the sum of the diagonal entries,

ab
A= =⇒ trace(A) = a + d. (1.4.17)
cd

The determinant of u ⊗ v is zero,

det(u ⊗ v) = 0.

This is true no matter what the vectors u and v are. Check this yourself.
By definition of u ⊗ v,

trace(u ⊗ v) = u · v, and trace(v ⊗ v) = |v|2 . (1.4.18)

The basic property of tensor product is

Tensor Product Identities

If A is a matrix and B = u ⊗ v, then

Bw = (u ⊗ v)w = (v · w)u, AB = A(u ⊗ v) = (Au) ⊗ v. (1.4.19)

These can be checked by writing out both sides in detail.

Now let
ab
Q=
bc
be a symmetric matrix and let v = (x, y). Then

Qv = (ax + by, bx + cy),

so
v · Qv = (x, y) · (ax + by, bx + cy) = ax2 + 2bxy + cy 2 .
This is the quadratic form associated to the matrix Q.
1.4. TWO DIMENSIONS 35

Quadratic Form

If
ab
Q= and v = (x, y),
bc
then
v · Qv = ax2 + 2bxy + cy 2 .

When Q is the identity

10
Q=I= ,
01

then the quadratic function is x2 + y 2 :

Q=I =⇒ v · Qv = x2 + y 2 .

When Q is diagonal,

a0
Q= =⇒ v · Qv = ax2 + cy 2 .
0c

An important case is when Q = u ⊗ u. In this case, by (1.4.19),

Quadratic Forms of Tensors

If Q = u ⊗ u, then

v · Qv = v · (u ⊗ u)v = (u · v)2 . (1.4.20)

Exercises

Exercise 1.4.1 Solve the linear system

ax + by = c, −bx + ay = d.

Exercise 1.4.2 Let u = (1, a), v = (b, 2), and w = (3, 4). Solve

u + 2v + 3w = 0

for a and b.

Exercise 1.4.3 Let u = (1, 2), v = (3, 4), and w = (5, 6). Find a and b such
that
au + bv = w.
36 CHAPTER 1. DATASETS
⊥
Exercise 1.4.4 Let P be a nonzero point in the plane. What is P ⊥ ?

8 −8 3 −2
Exercise 1.4.5 Let A = and B = . Compute AB and
−7 −3 2 −2
BA.

9 2
Exercise 1.4.6 Let A = . Find a nonzero 2×2 matrix B satisfying
−36 −8
AB = 0.

Exercise 1.4.7 Solve for X

−7 4 −9 5
− 4X = .
4 −3 6 −9

eq:tensorident
Exercise 1.4.8 If u = (a, b) and v = (c, d) and A = u ⊗ v, use (1.4.19) to
compute A2 .

Exercise 1.4.9 Find a nonzero 2 × 2 matrix A satisfying A2 = 0.

Exercise 1.4.10 If V is a symmetric 2 × 2 matrix, show

(trace(V ))2 = 2 det(V ) + trace(V 2 ).

Exercise 1.4.11 With Q and V are 2×2 matrices, Q invertible, and t scalar,
show
det(Q + tV )
= 1 + t · trace(Q−1 V ) + t2 det(Q−1 V ).
det(Q)

9 2
Exercise 1.4.12 What is the trace of A = ?
−36 −8

Exercise 1.4.13 The wedge product of vectors u and v is the matrix

u ∧ v = u ⊗ v − v ⊗ u.

If u = (a, b) and v = (c, d), what is u ∧ v?

Exercise 1.4.14 Let u be a unit vector, and let A = u ⊗ u. Compute A100 .

Exercise 1.4.15 Calculate the areas of the triangles and the squares in Fig-
ure 1.21. From that, deduce Pythagoras’s theorem c2 = a2 + b2 .

Exercise 1.4.16 Let u and v be unit vectors, and let A = u ⊗ v. If A2 = 0,

what is the angle between u and v?

0 1
Exercise 1.4.17 Let W = . What is W 211 ?
−1 0
1.5. MEAN AND VARIANCE 37

1.5 Mean and Variance

Let x1 , x2 , . . . , xN be a dataset in Rd , and let x be any point in Rd . The

mean-square distance of x to D is
N
1 X
M SD(x) = |xk − x|2 .
N
k=1

Above |x| stands for the length of the vector x, or the distance of the point
x to the origin. When d = 2 and we are in two dimensions, this was defined
in §1.4. For general d, this is defined in §2.1. In this section we continue to
focus on two dimensions d = 2.
The mean or sample mean is
N
1 X x1 + x2 + · · · + xN
µ= xk = . (1.5.1)
N N
k=1

The mean µ is a point in feature space. The first result is

Point of Best-fit
The mean is the point of best-fit: The mean minimizes the mean-
square distance to the dataset (Figure 1.22).

Fig. 1.22 MSD for the mean (green) versus MSD for a random point (red).

Using (1.4.6),
38 CHAPTER 1. DATASETS

|a + b|2 = |a|2 + 2a · b + |b|2

for vectors a and b, it is easy to derive the above result. Insert a = xk − µ
and b = µ − x to get
N
2 X
M SD(x) = M SD(µ) + (xk − µ) · (µ − x) + |µ − x|2 .
N
k=1

Now the middle term vanishes

N N
! !
2 X 2 X
(xk − µ) · (µ − x) = xk − Nµ · (µ − x)
N N
k=1 k=1
= 2(µ − µ) · (µ − x) = 0,

so we have
M SD(x) = M SD(µ) + |x − µ|2 ,
which is clearly ≥ M SD(µ), deriving the above result.
Here is the code for Figure 1.22.

from matplotlib.pyplot import *

from numpy import *
from numpy.random import random

N, d = 20, 2
# d x N array
dataset = array([ [random() for _ in range(N)] for _ in range(d) ])

mu = mean(dataset,axis=1)
p = array([random(),random()])

for v in dataset.T:
plot([mu[0],v[0]],[mu[1],v[1]],c='green')
plot([p[0],v[0]],[p[1],v[1]],c='red')

scatter(*mu)
scatter(*dataset)

grid()
show()

The variance of a dataset is defined in any dimension d. When d = 1, the

dataset consists of scalars x1 , x2 , . . . , xN , and the mean µ is a scalar. In this
case, the variance q is also a scalar,
1.5. MEAN AND VARIANCE 39

N
1 X
q= (xk − µ)2 . (1.5.2)
N
k=1
√
The square root of the variance is the standard deviation σ = q.
If a scalar dataset has mean zero and variance one, it is standard. Every
dataset x1 , x2 , . . . , xN may be standardized by first centering the dataset

v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,

then dividing the dataset by its standard deviation,

x1 − µ ′ x2 − µ xN − µ
x′1 = , x2 = , . . . , x′N = .
σ σ σ
The resulting dataset x′1 , x′2 , . . . , x′N is then standard.

In general, a dataset consists of points x1 , x2 , . . . , xN in some feature

space Rd . If the dataset has mean µ, we can center the dataset,

v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ.

Then the variance is the matrix (see §1.4 for tensor product)
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN
Q= . (1.5.3)
N
Since v ⊗ v is a symmetric matrix, the variance of a dataset is a symmetric
matrix. Below we see the variance is also nonnegative, in the sense v · Qv ≥ 0
for all vectors v. Later we see how to standardize vector datasets.
When i ̸= j, the entries Q = (qij ) of the variance matrix are called covari-
ances: qij is the covariance between the i-th feature and the j-th feature.
For example, suppose N = 5 and

x1 = (1, 2), x2 = (3, 4), x3 = (5, 6), x4 = (7, 8), x5 = (9, 10). (1.5.4)

Then µ = (5, 6) and

v1 = x1 − m = (−4, −4), v2 = x2 − m = (−2, −2), v3 = x3 − m = (0, 0),

v4 = x4 − m = (2, 2), v5 = x5 − m = (4, 4).

Since
40 CHAPTER 1. DATASETS

16 16
(±4, ±4) ⊗ (±4, ±4) = ,
16 16

44
(±2, ±2) ⊗ (±2, ±2) = ,
44

00
(0, 0) ⊗ (0, 0) = ,
00

summing and dividing by N leads to the variance

88
Q= .
88

Notice
Q = 8(1, 1) ⊗ (1, 1),
which, as we see below (§2.5), reflects the fact that the points of this dataset
lies on a line. Here the line is y = x + 1. Here is code from scratch for the
variance (matrix) of a dataset.

from numpy import *

from numpy.random import random

def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])

N, d = 20, 2
# N x d array
dataset = array([ [random(),random()] for _ in range(N) ])
mu = mean(dataset,axis=0)

# center dataset
vectors = dataset - mu

Q = mean([ tensor(v,v) for v in vectors ],axis=0)

The variance matrix as written in (1.5.3) is the biased variance matrix. If

instead the denominator is N − 1, the matrix is the unbiased variance matrix.
For datasets with large N , it doesn’t matter, since N and N −1 are almost
equal. For simplicity, here we divide by N , and we only consider the biased
variance matrix.
In practice, datasets are standardized before computing their variance. The
variance of standard datasets — the correlation matrix — is the same whether
one starts with bias or not (§2.2).
In numpy, the Python variance constructor is
1.5. MEAN AND VARIANCE 41

from numpy import *

from numpy.random import random

N, d = 20, 2
# d x N array
dataset = array([ [random() for _ in range(N)] for _ in range(d) ])

Q = cov(dataset,bias=True)
Q

This returns the same result as the previous code for Q. Notice here there is
no need to compute the mean, this is taken care of automatically. The option
bias=True indicates division by N , returning the biased variance. To return
the unbiased variance and divide by N −1, change the option to bias=False,
or remove it, since bias=False is the default.
From (1.4.18), if Q is the variance matrix (1.5.3),
N
1 X
trace(Q) = |xk − µ|2 . (1.5.5)
N
k=1

We call trace(Q) the total variance or explained variance of the dataset.

Thus the total variance equals M SD(µ).
In Python,

from numpy import *

# dataset is d x N array

Q = cov(dataset,bias=True)
Q.trace()

We now project a 2d dataset onto a line. Let u be a unit vector (a vector

of length one, |u| = 1), and let v1 , v2 , . . . , vN be a 2d dataset, assumed
for simplicity to be centered. We wish to project this dataset onto the line
through u. This will result in a 1d dataset.
According to Figure 1.23, when a vector b is projected onto the line through
u, the length of the projected vector P b equals |b| cos θ, where θ is the angle
between the vectors b and u. Since |u| = 1, this length equals the dot product
b · u. Hence the projected vector is

P b = (b · u)u.
42 CHAPTER 1. DATASETS

Pb
u

Fig. 1.23 Projecting a vector b onto the line through u.

Applying this logic to each vector v1 , v2 , . . . , vN , we conclude: the projected

dataset onto the line through u is the two-dimensional dataset

(v1 · u)u, (v2 · u)u, . . . , (vN · u)u.

These vectors are all multiples of u, as they should be. The projected dataset
is two-dimensional.
Alternately, discarding u and retaining the scalar coefficients, we have the
one-dimensional dataset

v1 · u, v2 · u, . . . , vN · u.

This is the reduced dataset. The reduced dataset is one-dimensional.

Since the vector u is fixed, the reduced dataset and the projected dataset
contain the same information. The formulas for the projected and reduced
datasets are correct only when u is a unit vector. If u is not unit, remember
to replace u by u/|u|.
The reduced dataset is centered, since

v1 · u + v2 · u + · · · + v N · u v1 + v2 + · · · + vN
= · u = 0 · u = 0,
N N

and the mean of the projected dataset is also 0.

The variance of the reduced dataset
N
1 X
q= (vk · u)2
N
k=1

is a scalar that is positive or zero. According to (1.4.20), this equals

N
1 X
q= u · (vk ⊗ vk )u = u · Qu.
N
k=1

Thus we conclude (see §1.4 for v · Qv)

1.5. MEAN AND VARIANCE 43

Variance of Reduced Dataset

Let Q be the variance matrix of a dataset and let u be a unit vector.
Then the variance of the dataset reduced onto the line through u
equals u · Qu.

A matrix Q is positive if it is symmetric and u · Qu > 0 for any nonzero

vector u. A matrix Q is nonnegative if it is symmetric and u · Qu ≥ 0 for any
vector u.
A matrix Q is a variance matrix if Q is the variance matrix of some dataset.
As a consequence of the above, since q = u · Qu is the variance of the reduced
dataset, we have

Variance Matrix is Nonnegative

Every variance matrix is nonnegative.

Later (§3.2), we see every nonnegative matrix is a variance matrix. Here

is code for computing the variance of the projected dataset.

from numpy import *

# dataset is d x N array

Q = cov(dataset,bias=True)

# project along unit vector u

q = dot(u,dot(Q,u))

Going back to the dataset (1.5.4), xk − m, k = 1, 2, 3, 4, 5, are all multiples

of (1, 1). If we select u = (1, −1), then (xk − m) · u = 0, so the variance Q
satisfies u · Qu = 0. This can also be seen by

Qu = 8((1, 1) ⊗ (1, 1))u = 8(1, 1) · u (1, 1) = 0.

This shows that the dataset lies on the line passing through m and perpen-
dicular to (1, −1).

We describe the variance ellipses associated to a given dataset. Let Q be

a variance matrix and µ a point in R2 . The contour of all points v satisfying

(v − µ) · Q(v − µ) = k

is the variance ellipse corresponding to level k. When k = 1, the ellipse is the

unit variance ellipse.
44 CHAPTER 1. DATASETS

The contour of all points v satisfying

(v − µ) · Q−1 (v − µ) = k

is the inverse variance ellipse corresponding to level k. When k = 1, the

ellipse is the unit inverse variance ellipse.

Fig. 1.24 Unit variance ellipses (blue) and unit inverse variance ellipses (red) with µ = 0.

In two dimensions, a variance matrix has the form

ab
Q= .
bc

If we write v = (x, y) for a vector in the plane, the variance ellipse equation
centered at µ = 0 is

v · Qv = ax2 + 2bxy + cy 2 = k.

The inverse variance ellipse centered at µ = 0 is given by the same equation

with Q replaced by Q−1 . The code for rendering ellipses is

from matplotlib.pyplot import *

from numpy import *
from scipy.linalg import inv

def ellipse(Q,mu,padding=.5,levels=[1],render="var"):
grid()
scatter(*mu,c="red",s=5)
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d,e = mu
delta = .01
x = arange(d-padding,d+padding,delta)
y = arange(e-padding,e+padding,delta)
x, y = meshgrid(x, y)
if render == "var" or render == "both":
1.5. MEAN AND VARIANCE 45

# matrix_text(Q,mu,padding,'blue')
eq = a*(x-d)**2 + 2*b*(x-d)*(y-e) + c*(y-e)**2
contour(x,y,eq,levels=levels,colors="blue",linewidths=.5)
if render == "inv" or render == "both":
draw_major_minor_axes(Q,mu)
Q = inv(Q)
# matrix_text(Q,mu,padding,'red')
A, B, C = Q[0,0],Q[0,1],Q[1,1]
eq = A*(x-d)**2 + 2*B*(x-d)*(y-e) + C*(y-e)**2
contour(x,y,eq,levels=levels,colors="red",linewidths=.5)

With this code, ellipse(Q,mu) returns the unit variance ellipse in the unit
square centered at µ. The codes for the functions draw_major_minor_axes
and matrix_text are below.
Depending on whether render is var, inv, or both, the code renders the
variance ellipse (blue), the inverse variance ellipse (red), or both. The code
renders several ellipses, one for each level in the list levels. The default is
levels = [1], so the unit ellipse is returned. Also padding can be adjusted
to enlarge the plot.
The code for Figure 1.24 is

mu = array([0,0])

Q = array([[9,0],[0,4]])
ellipse(Q,mu,padding=4,render="both")
show()

Q = array([[9,2],[2,4]])
ellipse(Q,mu,padding=4,render="both")
show()

To use TEX to display the matrices in Figure 1.24, insert the function

rcParams['text.usetex'] = True
rcParams['text.latex.preamble'] = r'\usepackage{amsmath}'

def matrix_text(Q,mu,padding,color):
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d,e = mu
valign = e + 3*padding/4
if color == 'blue': halign = d - padding/2; tex = "$Q="
else: halign = d; tex = "$Q^{-1}="
# r"..." means raw string
tex += r"\begin{pmatrix}" + str(round(a,2)) + "&" + str(round(b,2))
tex += r"\\" + str(round(b,2)) + "&" + str(round(c,2))
tex += r"\end{pmatrix}$"
return text(halign,valign,tex,fontsize=15,color=color)
46 CHAPTER 1. DATASETS

A minimal TEX installation is included in matplotlib.pyplot. To dis-

play matrices, the code needs to access your laptop’s TEX installation. The
rcParams lines enable this access. If TEX is installed on your laptop, uncom-
ment matrix_text in ellipse.

Fig. 1.25 Variance ellipses (blue) and inverse variance ellipses (red) for a dataset.

Figure 1.25 shows variance ellipses with levels = [.005,.01,.02], and

inverse variance ellipses with levels = [.5,1,2], corresponding to a ran-
dom dataset. The code for this is

from numpy.random import random

N = 50
# N x d array
dataset = array([ [random(),random()] for _ in range(N) ])
Q = cov(dataset.T,bias=True)
mu = mean(dataset,axis=0)

scatter(*dataset.T,s=5)
ellipse(Q,mu,render="var",padding=.5,levels=[.005,.01,.02])
show()

scatter(*dataset.T,s=5)
ellipse(Q,mu,render="inv",padding=.5,levels=[.5,1,2])
show()

When Q is diagonal, the lengths

√ of the
√ major and minor axes of the unit
inverse variance ellipse equal 2 a and 2 c, and√the lengths√of the major and
minor axes of the unit variance ellipse equal 2/ a and 2/ c.

We describe how to standardize datasets in R2 . For datasets in Rd , this

is described in §2.2.
1.5. MEAN AND VARIANCE 47

Remember, a dataset is a sequence of N points in a d-dimensional feature

space. Restricting to the case d = 2, a dataset is a sequence of x-coordinates
and y-coordinates

x1 , x2 , . . . , xN , and y1 , y2 , . . . , yN .

Suppose the mean of this dataset is µ = (µx , µy ). Then, by the formula for
tensor product, the variance matrix is

ab
Q= ,
bc

where
N N N
1 X 1 X 1 X
a= (xk − µx )2 , b= (xk − µx )(yk − µy ), c= (yk − µy )2 .
N N N
k=1 k=1 k=1

From this, we see a is the variance of the x-features, and c is the variance
of y-features. We also see b is a measure of the correlation between the x and
y features.
Standardizing the dataset means to center the dataset and to place the x
and y features on the same scale. For example, the x-features may be close
to their mean µx , resulting in a small x variance a, while the y-features may
be spread far from their mean µy , resulting in a large y variance c.
When this happens, the different scales of x’s and y’s distorts the relation
between them, and b may not accurately reflect the correlation. To correct
for this, we center and re-scale
x1 − µx ′ x2 − µx xN − µx
x1 , x2 , . . . xN → x′1 = √ , x2 = √ , . . . , x′N = √ ,
a a a

and
y1 − µy ′ y2 − µy yN − µy
y1 , y2 , . . . yN → y1′ = √ , y2 = √ ′
, . . . , yN = √ .
c c c
′
This results in a new dataset v1 = (x′1 , y1′ ), v2 = (x′2 , y2′ ), . . . , vN = (x′N , yN )
that is centered,
v1 + v2 + · · · + vN
= 0,
N
with each feature standardized to have unit variance,
N N
1 X ′2 1 X ′2
xk = 1, yk = 1.
N N
k=1 k=1

The resulting dataset is standard.

Because of this, the variance matrix of the standardized dataset is
48 CHAPTER 1. DATASETS

1ρ
Q′ = ,
ρ1

where
N
1 X ′ ′ b
ρ= xk yk = √
N ac
k=1

is the correlation coefficient of the dataset. The matrix Q′ is the correlation

matrix, or the standardized variance matrix.
For example,

92 b 1 1 1/3
Q= =⇒ ρ= √ = =⇒ Q′ = .
24 ac 3 1/3 1

The correlation coefficient ρ (“rho”) is always between −1 and 1 (this

follows from the Cauchy-Schwarz inequality (1.4.7).
When ρ > 0, we say the x and y features are positively correlated. More
loosely, we say the dataset is positively correlated (this only for 2d). When
ρ < 0, we say the x and y features are negatively correlated. More loosely, we
say the dataset is negatively correlated (this only for 2d).

Fig. 1.26 Unit variance ellipse and unit inverse variance ellipse with standard Q.

When ρ = ±1, the dataset samples are perfectly correlated and lie on
a line passing through the mean. When ρ = 1, the line has slope 1, and
when ρ = −1, the line has slope −1. When ρ = 0, the dataset samples are
completely uncorrelated and are considered two independent one-dimensional
datasets.
In numpy, the correlation matrix Q′ is returned by
1.5. MEAN AND VARIANCE 49

from numpy import *

# dataset is d x N array

corrcoef(dataset)

We say a unit vector u is best aligned or best-fit with the dataset if u

maximizes the variance v · Qv over all unit vectors v,

u · Qu = max v · Qv.
|v|=1

We calculate the best-aligned unit vector. When a dataset is standard, the

variance of the dataset projected onto a vector v = (x, y) equals

v · Qv = ax2 + 2bxy + cy 2 = x2 + 2ρxy + y 2 .

Since v = (x, y) is a unit vector, we have x2 + y 2 = 1, so we can write

(x, y) = (cos θ, sin θ). Using the double-angle formula, we obtain

v · Qv = x2 + 2ρxy + y 2 = 1 + 2ρ sin θ cos θ = 1 + ρ sin(2θ).

Since the sine function varies between +1 and −1, we conclude the pro-
jected variance varies between

1 − ρ ≤ v · Qv ≤ 1 + ρ,

and
π 1 1
θ= , v+ = √ ,√ =⇒ v+ · Qv+ = 1 + ρ,
4 2 2

3π −1 1
θ= , v− = √ , √ =⇒ v− · Qv− = 1 − ρ.
4 2 2
Thus the best-aligned vector v+ is at 45◦ , and the worst-aligned vector is at
135◦ (Figure 1.26).
Actually, the above is correct only if ρ > 0. When ρ < 0, it’s the other
way. The correct answer is

1 − |ρ| ≤ v · Qv ≤ 1 + |ρ|,

and v± must be switched when ρ < 0. We study best-aligned vectors in Rd

in §3.2.
50 CHAPTER 1. DATASETS

Fig. 1.27 Positively and negatively correlated datasets (unit inverse ellipses).

Here are two randomly generated datasets. The dataset on the left in
Figure 1.27 is positively correlated. Its mean and variance are

0.08016526 0.01359483
(0.53626891, 0.54147513) .
0.01359483 0.08589097

The dataset on the right in Figure 1.27 is negatively correlated. Its the
mean and variance are

0.08684941 −0.00972569
(0.46979642, 0.48347168) .
−0.00972569 0.09409118

In general, for non-standard datasets, the projected variance v · Qv varies

between two extremes λ± ,

λ− ≤ v · Qv ≤ λ+ , |v| = 1.

where λ± are given by

s 2
a+c a−c
λ± = ± + b2 . (1.5.6)
2 2

When the dataset is standard, as we saw above, λ± = 1 ± |ρ|.

−1
p ellipse v · Q v = 1 has length
pThe major axis of the inverse variance
2 λ+ , and the minor axis has length 2 λ− . These are the principal axes of
the dataset.
When the dataset is not standard, let

v± = (−b, a − λ± ), and w± = (λ± − c, b), (1.5.7)

1.5. MEAN AND VARIANCE 51

If the inverse variance ellipse is not a circle, then Q is not a multiple of the
identity, and either v+ or w+ is nonzero. If v+ ̸= 0, v+ is the best-aligned
vector. If v+ = 0, w+ is the best-aligned vector.
If the inverse variance ellipse is not a circle, then Q is not a multiple of the
identity, and either v− or w− is nonzero. If v− ̸= 0, v− is the worst-aligned
vector. If v− = 0, w− is the worst-aligned vector.
If Q is a multiple of the identity, then any vector is best-aligned and worst-
aligned. All this follows from solutions of homogeneous 2×2 systems (1.4.10).
The general d × d case is in §3.2. For the 2 × 2 case discussed here, see the
exercises at the end of §3.2.
The code for rendering the major and minor axes of the inverse variance
ellipse uses (1.5.6) and (1.5.7),

def draw_major_minor_axes(Q,mu):
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d, e = mu
label = { 1:"major", -1:"minor" }
for pm in [1,-1]:
lamda = (a+c)/2 + pm * sqrt(b**2 + (a-c)**2/4)
sigma = sqrt(lamda)
lenv = sqrt(b**2 +(a-lamda)**2)
lenw = sqrt(b**2 +(c-lamda)**2)
if lenv: deltaX, deltaY = b/lenv, (a-lamda)/lenv
elif lenw: deltaX, deltaY = (lamda-c)/lenw, b/lenw
elif pm == 1: deltaX, deltaY = 1, 0
else: deltaX, deltaY = 0, 1
axesX = [d+sigma*deltaX,d-sigma*deltaX]
axesY = [e-sigma*deltaY,e+sigma*deltaY]
plot(axesX,axesY,linewidth=.5,label=label[pm])
legend()

Fig. 1.28 Ellipsoid and axes in 3d.

52 CHAPTER 1. DATASETS

In three dimensions, when d = 3, the ellipses are replaced by ellipsoids

(Figure 1.28).

Exercises

Exercise 1.5.1 The dataset is

from numpy import *

d = 10
# 100 x 2 array
dataset = array([ array([i+j,j]) for i in range(d) for j in range(d)
,→ ])

Compute the mean and variance, and plot the dataset and the mean.

Exercise 1.5.2 Let the dataset be the petal lengths against the petal widths
in the Iris dataset. Compute the mean and variance, and plot the dataset and
the mean.

Exercise 1.5.3 Project the dataset in Exercise 1.5.1 onto the line through
the vector (1, 2). What is the projected dataset? What is the reduced dataset?

Exercise 1.5.4 Project the dataset in Exercise 1.5.2 onto the line through
the vector (1, 2). What is the projected dataset? What is the reduced dataset?

Exercise 1.5.5 Plot the variance ellipse and inverse variance ellipses of the
dataset in Exercise 1.5.1.

Exercise 1.5.6 Plot the variance ellipse and inverse variance ellipses of the
dataset in Exercise 1.5.2.

Exercise 1.5.7 Plot the dataset in Exercise 1.5.1 together with its mean
and the line through the vector of best fit.

Exercise 1.5.8 Plot the dataset in Exercise 1.5.2 together with its mean
and the line through the vector of best fit.

Exercise 1.5.9 Standardize the dataset in Exercise 1.5.1. Plot the stan-
dardized dataset. What is the correlation matrix?

Exercise 1.5.10 Standardize the dataset in Exercise 1.5.2. Plot the stan-
dardized dataset. What is the correlation matrix?

ab
Exercise 1.5.11 Let Q = . Show Q is nonnegative when a ≥ |b|.
ba
(Compute v · Qv with v = (cos θ, sin θ) as in the text.)
1.6. HIGH DIMENSIONS 53

1.6 High Dimensions

Although not used in later material, this section is here to boost intuition
about high dimensions. Draw four disks inside a square, and a fifth disk in
the center.

Fig. 1.29 Disks inside the square.

In Figure 1.29, the edge-length of the square is 4, and the radius of each
blue disk is 1. Draw the diagonal of the square. Then the diagonal passes
through two blue disks. √
Since the length of the diagonal of the square is 4 2, and the diameters
of the two blue disks
√ add up 4, the portions of the diagonal outside the blue
disks add up to 4 2 − 4. Hence the radius of the red disk is
1 √ √
(4 2 − 4) = 2 − 1.
4
In three dimensions, draw eight balls inside a cube, as in Figure 1.30, and
one ball in the center. Since the edge-length of the cube is 4, the radius
√ of
each blue ball is 1. Since the length of the diagonal of the cube is 4 3, the
radius of the red ball is
1 √ √
(4 3 − 4) = 3 − 1.
4
Now we repeat in d dimensions. Here the edge-length of the cube remains
4, the radius of each blue ball remains 1,√and there are 2d blue balls. Since
the length of the diagonal of the cube is√4 d, the same calculation results in
the radius of the red ball equal to r = d − 1.
54 CHAPTER 1. DATASETS

Fig. 1.30 Balls inside the cube.

In two dimensions, when a region is scaled by a factor t, its area increases

by the factor t2 . In three dimensions, when a region is scaled by a factor t,
its volume increases by the factor t3 . The general result is

Scaling Principle: Dependence on Dimension

In d dimensions, when a region is scaled by a factor t, its volume scales

by the factor td .
√
The radius of the red ball is r = d − 1. By the scaling principle, in d
dimensions, the volume of the red ball equals rd times the volume of the blue
ball. We conclude the following:
√
• Since r = d − 1 = 1 exactly when d = 4, we have: In four dimensions,
the red ball and the blue balls are the same size.
• Since there are 2d blue balls, the ratio of the volume of the red ball over
the total volume of all the blue balls is rd /2d . √
• Since rd = 2d exactly when r = 2, and since r = d − 1 = 2 exactly when
d = 9, we have: In nine dimensions, the volume of the red ball equals the
sum total of the volumes of all blue balls.
1.6. HIGH DIMENSIONS 55
√
• Since r = d − 1 > 2 exactly when d > 9, we have: In ten or more
dimensions, the red ball sticks out of the cube.
√
• Since the length of the semi-diagonal
√ is 2 d, for any dimension d, the
radius of the red ball r = d − 1 is less than half the length of the semi-
diagonal. As the dimension grows without bound, the proportion of the
diagonal covered by the red ball converges to 1/2.

The following code returns Figure 1.29.

from matplotlib.pyplot import *

from matplotlib.patches import Circle, Rectangle
from numpy import *
from itertools import product

# initialize figure
ax = axes()

square = Rectangle((0,0), 4, 4,color='lightblue')

ax.add_patch(square)

xcent = ycent = [1,3]

# blue disks
for center in product(xcent,ycent):
circle = Circle(center, radius=1, color='blue')
ax.add_patch(circle)

# red disk
circle = Circle((2, 2), radius=sqrt(2)-1, color='red')
ax.add_patch(circle)

ax.set_axis_off()
ax.axis('equal')
show()

The code for Figure 1.30 is as follows.

%matplotlib ipympl
from matplotlib.pyplot import *
from numpy import *
from itertools import product

# build sphere mesh

N = 40
theta = linspace(0,2*pi,N)
phi = linspace(0,pi,N)
theta,phi = meshgrid(theta,phi)
56 CHAPTER 1. DATASETS

# spherical coordinates theta, phi

x = cos(theta)*sin(phi)
y = sin(theta)*sin(phi)
z = cos(phi)

# initialize figure
ax = axes(projection="3d")

# render ball
def ball(a,b,c,r,color):
return ax.plot_surface(a + r*x,b + r*y, c + r*z,color=color)

xcent = ycent = zcent = [1,3]

# blue balls
for center in product(xcent,ycent,zcent): ball(*center,1,"blue")

# red ball
ball(2,2,2,sqrt(3)-1,"red")

# cube grid
cube = ones((4,4,4),dtype=bool)
ax.voxels(cube, edgecolors='black',lw=.5,alpha=0)

ax.set_aspect("equal")
ax.set_axis_off()
show()

If theta and phi have shapes (m,) and (n,) then

theta,phi = meshgrid(theta,phi)
returns arrays theta and phi having shapes (m,n), with
theta[i,j] = theta[i], phi[i,j] = phi[j].
Here this is used to build a 2d mesh of 3d points
(x[i,j],y[i,j],z[i,j])
lying on a sphere. The cube grid is rendered using a voxel grid. Voxels are
the 3d counterparts of 2d pixels.
In jupyter, a magic command starts with a %. A magic command is sent to
jupyter, not to Python. The magic command %matplotlib ipympl allows
for rotating the figure.

Another phenomenon that happens in high dimensions, discussed in §6.1,

is that the angle between two randomly chosen vectors in a high-dimensional
space is not arbitrary, it is pre-determined. This is a consequence of the law
of large numbers.
1.6. HIGH DIMENSIONS 57

Scaling and dimensionality work together in suspensions. (Figure 1.31).

Let [a, b] be an interval and let V be a point not in the interval. To suspend
the interval from V , draw line segments between V and all points in the
interval. You end up with a triangle with vertex V . Therefore the suspension
of an interval is a triangle. Here the dimension of the interval is one, and the
dimension of the triangle is two.
Let D be a disk and let V be a point not in the disk. To suspend the disk
from V , draw line segments between V and all points in the disk. You end
up with a cone with vertex V . Therefore the suspension of a disk is a cone.
Here the dimension of the disk is two, and the dimension of the cone is three.
In general, the suspension Ĝ of G is obtaining by drawing line segments
from a point V not in G to every point x in G,

Ĝ = {(tx, 1 − t) : 0 ≤ t ≤ 1, x in G}.

When t = 1, the suspension’s base is the original region G, and when

t = 0, we have the vertex at the top. For each 0 < t < 1, the cross-section at
level t of the suspension is tG, which is G scaled by the factor t.

Fig. 1.31 Suspensions of interval [a, b] and disk D.

We assume the point V is in a dimension orthogonal to the dimensions of

G. Then the dimension of Ĝ is one more than the dimension of G: If G is d-
dimensional, then Ĝ is (d + 1)-dimensional, and the cross-sections at distinct
levels do not intersect.
The volume of Ĝ is obtained by integrating over cross-sections,
Z 1
Vol(Ĝ) = Vol(tG) dt.
0

By the scaling principle and (A.5.3),

58 CHAPTER 1. DATASETS

1 t=1
td+1
Z
Vol(G)
Vol(Ĝ) = td Vol(G) dt = Vol(G) = .
0 d+1 t=0 d+1

Thus
Vol(G)
Vol(Ĝ) = .
d+1

Exercises
√
Exercise 1.6.1 Why is the diagonal length of the square 4 2?
√
Exercise 1.6.2 Why is the diagonal length of the cube 4 3?

Exercise 1.6.3 Why does dividing by 4 yield the red disk radius and the red
ball radius?

Exercise 1.6.4 Suspend the unit circle G : x2 +y 2 = 1 from its center. What
is the suspension Ĝ? Conclude area(unit disk) = length(unit circle)/2.

Exercise 1.6.5 Suspend the unit sphere G : x2 + y 2 + z 2 = 1 from its center.

What is the suspension Ĝ? Conclude volume(unit ball) = area(unit sphere)/3.
Chapter 2
Linear Geometry

In §1.4, we reviewed the geometry of vectors in the plane. Now we study

linear geometry in any dimension d.
This chapter is a systematic thorough treatment of vectors and matrices,
but covering only the parts relevant to data science. The material in this
chapter is usually referred to as Linear Algebra. We prefer the term Lin-
ear Geometry, to emphasize that the material is, like much of data science,
geometric.
Even though parts of this chapter are heavy-going, all included material
is necessary for later chapters. In particular, the derivations of chi-squared
distribution (§5.5) and the chi-squared tests (§6.4) are clarified by the appro-
priate use of vectors and matrices.

2.1 Vectors and Matrices

A vector is a list of scalars

v = (t1 , t2 , . . . , td ).

The scalars are the components or the features of v. If there are d features,
we say the dimension of v is d. We call v a d-dimensional vector.
A point x is also a list of scalars, x = (t1 , t2 , . . . , td ). The relation between
points x and vectors v is discussed in §1.3. The set of all d-dimensional vectors
or points is d-dimensional space Rd .
In Python, we use numpy or sympy for vectors and matrices. In Python,
if L is a list, then numpy.array(L) or sympy.Matrix(L) return a vector or
matrix.

59
60 CHAPTER 2. LINEAR GEOMETRY

from numpy import *

v = array([1,2,3])
v.shape

from sympy import *

v = Matrix([1,2,3])
v.shape

The first v.shape returns (3,), and the second v.shape returns (3,1). In
either case, v is a 3-dimensional vector.
Vectors are added and scaled component by component: With

v = (t1 , t2 , . . . ) and v = (t′1 , t′2 , . . . ),

we have

v + v ′ = (t1 + t′1 , t2 + t′2 , . . . ), and sv = (st1 , st2 , . . . ).

Addition v + v ′ only works when v and v ′ have the same shape.

The zero vector is the vector 0 = (0, 0, 0, . . . ). The zero vector is the only
vector satisfying 0 + v = v = v + 0 for every vector v. Even though the zero
scalar and the zero vector are distinct objects, we use 0 to denote both. A
vector v is nonzero if v is not the zero vector.
In R4 , the vectors

e1 = (1, 0, 0, 0), e2 = (0, 1, 0, 0), e3 = (0, 0, 1, 0), e4 = (0, 0, 0, 1)

together are the standard basis. Similarly, in Rd , we have the standard basis
e1 , e2 , . . . , ed .

A matrix is a listing arranged in a rectangle of rows and columns. Specifi-

cally, an d × N matrix A has d rows and N columns,
 
a11 a12 . . . a1N
a21 a22 . . . a2N 
A= . . . . . . . . .  .


ad1 ad2 . . . adN

In Python, if L is a list of lists, then both array(L) and Matrix(L) return

a matrix. The code
2.1. VECTORS AND MATRICES 61

from numpy import *

# numpy vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])

A = column_stack([u,v,w])
A.shape

from sympy import *

# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

A = Matrix.hstack(u,v,w)
A.shape

returns (5,3), so A is a 5 × 3 matrix,

 
1 6 11
2 7 12
 
A= 3 8 13 .

4 9 14
5 10 15

The transpose of a matrix A is the matrix B = At resulting from turning

A on its side, so  
1 2 3 4 5
B = At =  6 7 8 9 10 .
11 12 13 14 15
The default for numpy is to arrange vectors as rows, so the code is shorter.

B = array([u,v,w])

The transpose interchanges rows and columns: the rows of At are the columns
of A. In both numpy or sympy, the transpose of A is A.T.
A vector v may be written as a 1 × N matrix

v = t1 t2 . . . tN .

In this case, we call v a row vector.

A vector v may be written as a d × 1 matrix
62 CHAPTER 2. LINEAR GEOMETRY
 
t1
 t2 
. . . .
v= 

In this case, we call v a column vector.

We will be considering matrices with different properties, and we use the

following notation
• A, B: any matrix
• U , V : orthonormal rows or orthonormal columns
• Q: symmetric matrix
• P : projection or permutation matrix

Vectors v1 , v2 , . . . , vN with the same dimension may be horizontally

stacked as columns of a d × N matrix,

A = v1 v2 . . . v N .

Similarly, vectors v1 , v2 , . . . , vN with the same dimension may be vertically

stacked as rows of an N × d matrix,
 
v1
 v2 
A= . . . .


By default, sympy creates column vectors. Because of this, it is easiest to

build matrices as columns,

from sympy import *

# 5x3 matrix
A = Matrix.hstack(u,v,w)

# column vector
b = Matrix([1,1,1,1,1])

# 5x4 matrix
M = Matrix.hstack(A,b)
2.1. VECTORS AND MATRICES 63

In general, for any sympy matrix A, column vectors can be hstacked and
row vectors can be vstacked. For any matrix A, the code

from sympy import *

A == Matrix.hstack(*[A.col(j) for j in range(A.cols)])

returns True. Note we use the unpacking operator * to unpack the list, before
applying hstack.
In numpy, there is column_stack and row_stack, so the code

from numpy import *

A == row_stack([ row for row in A ])

A == column_stack([ col for col in A.T ])

returns True. In numpy, the input is a list, there is no unpacking.

In numpy, a matrix A is a list of rows, so

A == array([ row for row in A ])

A.T == array([ col for col in A.T ])

both return True. Here col refers to rows of At , hence refers to the columns
of A.
The number of rows is len(A), and the number of columns is len(A.T).
To access row i, use A[i]. To access column j, access row j of the transpose,
A.T[j]. To access the j-th entry in row i, use A[i,j].
In sympy, the number of rows in a matrix A is A.rows, and the number of
columns is A.cols, so

A.shape == (A.rows,A.cols)

returns True. To access row i, use A.row(i). Similarly, to access column j,

use A.col(j). So,

A == Matrix([ A.row(i) for i in range(A.rows) ])

A.T == Matrix([ A.col(j) for j in range(A.cols) ])

both return True.

A matrix is square if the number of rows equals the number of columns,
N = d. A matrix is diagonal if it looks like one of these
64 CHAPTER 2. LINEAR GEOMETRY
   
a000   a00
 0 b 0 0 a000 0 b 0
 0 b 0 0 ,
 0 0 c 0 , or or 0 0 c  ,
   
00c0
000d 000

where some of the numbers on the diagonal a, b, c, d may be zero.

A dataset is a collection of points x1 , x2 , . . . , xN in Rd . After centering

the mean to the origin (§1.3), a dataset is a collection of vectors v1 , v2 , . . . ,
vN . Usually the vectors are presented as the columns of a d × N matrix A.
Corresponding to this, datasets are often provided as a CSV file.
The matrix A is the dataset matrix. In excel, this is called a spreadsheet. In
SQL, this is called a table. In numpy, it’s an array. In pandas, it’s a dataframe.
So, effectively,
matrix = dataset = CSV file = spreadsheet = table = array = dataframe

Matrices are added and scaled as follows. With

 ′
a′12 . . . a′1N
  
a11 a12 . . . a1N a11
′
a21 a22 . . . a2N 
 and A′ = a21
 a′22 . . . a′2N 
A= . . . . . . . . .  . . .
,
... ... 
ad1 ad2 . . . adN a′d1 a′d2 . . . a′dN

we have matrix addition

a11 + a′11 a12 + a′12 . . . a1n + a′1N
 
a21 + a′21 a22 + a′22 . . . a2n + a′ 
A + A′ = 
 ...
2N 
... ... 
ad1 + a′d1 ad2 + a′d2 . . . adN + a′dN

and matrix scaling  

ta11 ta12 . . . ta1N
ta21 ta22 . . . ta2N 
tA = 
 .
... ... ... 
tad1 tad2 . . . tadN
Matrices may be added only if they have the same shape.
In Python, matrix scaling and matrix addition are a*A and A + B. The
code
2.1. VECTORS AND MATRICES 65

from sympy import *

A = zeros(2,3)
B = ones(2,2)
C = Matrix([[1,2],[3,4]])
D = B + C
E = 5 * C
F = eye(4)
A, B, C, D, E, F

returns
 
10 0 0
000 11 12 23 5 10  01 0 0
, , , , , .
000 11 34 45 15 20 0 0 1 0
00 0 1

Diagonal matrices are constructed using diag. The code

from sympy import *

A = diag(1,2,3,4)
B = diag(-1, ones(2, 2), Matrix([5, 7, 5]))
A, B

returns  
  −1 000
1 0 0 0 0 1 1 0

0 2 0 0
, 0 1 1 0

 .
0 0 3 0 
0 0 0 5

0 0 0 4 0 0 0 7
0 005
It is straightforward to convert back and forth between numpy and sympy.
In the code

from sympy import *

A = diag(1,2,3,4)

from numpy import *

B = array(A)

C = Matrix(B)

A and C are sympy.Matrix, and B is numpy.array. numpy is for numerical

computations, and sympy is for algebraic/symbolic computations.
66 CHAPTER 2. LINEAR GEOMETRY

Exercises

Exercise 2.1.1 A vector is one-hot encoded if all features are zero, except for
one feature which is one. For example, in R3 there are three one-hot encoded
vectors
(1, 0, 0), (0, 1, 0), (0, 0, 1).
A matrix is a permutation matrix if it is square and all rows and all columns
are one-hot encoded. How many 3 × 3 permutation matrices are there? What
about d × d?

2.2 Products

Let t be a scalar, u, v, w be vectors, and let A, B be matrices. We already

know how to compute tu, tv, and tA, tB. In this section, we compute the dot
product u · v, the matrix-vector product Av, and the matrix-matrix product
AB.
These products are not defined unless the dimensions “match”. In numpy,
these products are written dot; in sympy, these products are written *.
In §1.4, we defined the dot product in two dimensions. We now generalize
to any dimension d. Suppose u, v are vectors in Rd . Then their dot product
u · v is the scalar obtained by multiplying corresponding features and then
summing the products. This only works if the dimensions of u and v agree.
In other words, if u = (s1 , s2 , . . . , sd ) and v = (t1 , t2 , . . . , td ), then

u · v = s1 t1 + s2 t2 + · · · + sd td . (2.2.1)

It’s best to think of this as “row-times-column” multiplication,

 
t1
u · v = s1 s2 s3 t2  = s1 t1 + s2 t2 + s3 t3 .
t3

As in §1.4, we always have rows on the left, and columns on the right.
In Python,

from numpy import *

u = array([1,2,3])
v = array([4, 5, 6])

dot(u,v) == 14 + 25 + 3*6

from sympy import *

2.2. PRODUCTS 67

u = Matrix([1,2,3])
v = Matrix([4, 5, 6])

u.T * v == 14 + 25 + 3*6

both return True.

For clarity, sometimes we write (u.T)*v; the parentheses don’t change
anything. Note in sympy, we take the transpose when multiplying, since vec-
tors are by default column vectors, and it’s always row × column.

As in two dimensions, the length or norm or magnitude of a vector v =

(t1 , t2 , . . . , td ) is the square root of the dot product v · v,
√ q
|v| = v · v = t21 + t22 + · · · + t2d .

In Python, the length of a vector v is

from numpy import *

sqrt(dot(v,v))

from sympy import *

sqrt(v.T * v)

In numpy, this returns a scalar; in sympy, a 1 × 1 matrix.

A vector is a unit vector if its length equals 1. When |v| = 0, all the features
of v equal zero. It follows the zero vector is the only vector with zero length.
All other vectors have positive length.
Let v be any nonzero vector. By dividing v by its length |v|, we obtain a
unit vector u = v/|v|.

As in §1.4,

Dot Product

The dot product u · v (2.2.1) satisfies

u · v = |u| |v| cos θ, (2.2.2)

where θ is the angle between u and v.

68 CHAPTER 2. LINEAR GEOMETRY

In two dimensions, this was equation (1.4.5) in §1.4. Since any two vectors
lie in a two-dimensional plane, this remains true in any dimension. More
precisely, (2.2.2) is taken as the definition of cos θ.
Based on this, we can compute the angle θ,
u·v u·v
cos θ = p =p .
|u| |v| (u · u)(v · v)

Here is code for the angle θ (there is also a built-in numpy.angle).

from numpy import *

def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)

Since | cos θ| ≤ 1, we have the

Cauchy-Schwarz Inequality

The dot product of two vectors is absolutely less or equal to the prod-
uct of their lengths,

|u · v| ≤ |u| |v| or |u · v|2 ≤ (u · u)(v · v). (2.2.3)

Vectors u and v are said to be perpendicular or orthogonal if u · v = 0.

In this case we often write u ⊥ v. A collection of vectors is orthogonal if
any pair of vectors in the collection are orthogonal. With this understood,
the zero vector is orthogonal to every vector. The converse is true as well: If
u · v = 0 for every v, then in particular, u · u = 0, which implies u = 0.
Vectors v1 , . . . , vN are said to be orthonormal if they are both unit vectors
and orthogonal. Orthogonal nonzero vectors can be made orthonormal by
dividing each vector by its length.

An important application of the Cauchy-Schwarz inequality is the triangle

inequality
|a + b| ≤ |a| + |b|. (2.2.4)
To see this, let v be any unit vector. Then

(a + b) · v = a · v + b · v ≤ |a||v| + |b||v| = |a| + |b|.

From this, selecting v = (a + b)/|a + b|,

2.2. PRODUCTS 69

|a + b| = (a + b) · v ≤ |a| + |b|.

Suppose v is a vector and A is a matrix. If the rows of A have the same

dimension as that of v, we can take the dot product of each row of A with v,
obtaining the matrix-vector product Av: Av is the vector whose features are
the dot products of the rows of A with v.
In other words,

dot(A,v) == array([ dot(row,v) for row in A ])

Av == Matrix([ A.row(i) v for i in range(A.rows) ])

both return True.

If u and v are vectors, we can think of u as a row vector, or a matrix
consisting of a single row. With this interpretation, the matrix-vector product
uv equals the dot product u · v.
If u and v are vectors, we can think of u as a column vector, or a matrix
consisting of a single column. With this interpretation, ut is a single row, and
the matrix-vector product ut v equals the dot product u · v.

Let A and B be two matrices. If the row dimension of A equals the column
dimension of B, the matrix-matrix product AB is defined. When this condition
holds, the entries in the matrix AB are the dot products of the rows of A with
the columns of B. In Python,

from numpy import *

C = array([ [ dot(row,col) for col in B.T ] for row in A ])

dot(A,B) == C

from sympy import *

C = Matrix([[ A.row(i)*B.col(j) for j in range(B.cols)] for i in

,→ range(A.rows) ])
A*B == C

both return True, and, with

 
1 2 3
1234 4 5 6
A= ,B =  ,
5678 7 8 9
10 11 12
70 CHAPTER 2. LINEAR GEOMETRY

the code

A,B,dot(A,B)

A,B,A*B

returns
70 80 90
AB = .
158 184 210

Let A and B be matrices, and suppose the row dimension of A and the
column dimension of B both equal d. Then the matrix-matrix product AB
is defined. If A = (aij ) and B = (bij ), then we may we may write AB in
summation notation as
X d
(AB)ij = aik bkj . (2.2.5)
k=1

The trace of a square matrix

 
ab c
A = b d e 
cef

is the sum of its diagonal elements,

 
ab c
trace(A) = trace  b d e  = a + d + f.
cef

In general, the trace of a d × d matrix is

d
X
trace(A) = aii . (2.2.6)
i=1

Even though in general AB ̸= BA, it is always true that

trace(AB) = trace(BA), (2.2.7)

This can be verified by switching the i and the k in the sums

2.2. PRODUCTS 71

d
X d X
X d
trace(AB) = (AB)ii = aik bkj .
i=1 i=1 k=1

A matrix Q is symmetric if Q = Qt . For any matrix A, Q = AAt and

Q = At A are symmetric.
A symmetric matrix Q satisfying v · Qv ≥ 0 for every vector v is non-
negative. When Q is nonnegative, we write Q ≥ 0. A symmetric matrix Q
satisfying v · Qv > 0 for every nonzero vector v is positive. When Q is pos-
itive, we write Q > 0. Since any vector may be rescaled into a unit vector,
the vectors v in these definitions may be assumed to be unit vectors.
For any d × N matrix A, At A is a symmetric N × N matrix, and AAt is
a symmetric d × d matrix.
As we saw in §1.5, the variance matrix of a dataset is nonnegative. In fact,
when a dataset in Rd fills up all d dimensions, the variance matrix is positive
(see §2.5).

Let A and B be matrices. Since transpose interchanges rows and columns,

we always have
(AB)t = B t At .
As a special case, if we think of v as a column vector, i.e. as a matrix with a
single column, then the matrix-vector product Av is the same as the matrix-
matrix product Av, so
(Av)t = v t At .
Here we are thinking of v as a matrix with one column, and v t as a matrix
with one row.
In Python,

dot(A,B).T == dot(B.T,A.T)

(A * B).T == B.T * A.T

both return True.

We also have

Dot Product Transpose Identity

For any vectors u, v, and matrices A, we have

72 CHAPTER 2. LINEAR GEOMETRY

(Au) · v = u · (At v) and (At u) · v = u · Av, (2.2.8)

whenever the shapes of u, v, A match.

In terms of row vectors and column vectors, this is automatic. For example,

(Au) · v = (Au)t v = (ut At )v = ut (At v) = u · (At v).

In Python,

dot(dot(A,u),v) == dot(u,dot(A.T,v))
dot(dot(A.T,u),v) == dot(u,dot(A,v))

(Au).T v == u.T * (A.T*v)

(A.T*u).T * v == u.T * (A*v)

all return True.

Let A be a matrix. We compute useful expressions for AAt and At A.

Assume the columns of A are v1 , v2 , . . . , vN , so A is d × N . Since the
transpose interchanges rows and columns, v1 , v2 , . . . , vN are the rows of At .
Since matrix-matrix multiplication is row × column, we have
 
v1 · v1 v1 · v2 . . . v 1 · vN
 v2 · v1 v2 · v2 . . . v 2 · vN 
At A = 
 ...
. (2.2.9)
... ... ... 
vN · v1 vN · v2 . . . v N · v N

As a consequence,1

Orthonormal Rows and Columns

Let U be a matrix.
• U has orthonormal columns iff U t U = I.
• U has orthonormal rows iff U U t = I.

The second statement follows from the first by substituting U t for U .

To compute AAt , we bring in the tensor product. If u and v are vectors,

the tensor product u ⊗ v is the matrix
1 Iff is short for if and only if.
2.2. PRODUCTS 73

(u ⊗ v)ij = ui vj .

If u is d-dimensional and v is N -dimensional, then u ⊗ v is a d × N matrix.

If we think of u and v as 1 × d and 1 × N matrices, this is the matrix-matrix
product ut v.
For example, if u = (a, b, c), v = (A, B), then
   
a aA aB
u ⊗ v =  b  A B =  bA bB  .
c cA cB

Then the identities (1.4.19) and (1.4.20) hold in general. Using the tensor
product, we have

Tensor Identity

Let A be a matrix with columns v1 , v2 , . . . , vN . Then

AAt = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN . (2.2.10)

To derive this, let Q and Q′ be the symmetric matrices on the left and
right sides of (2.2.10). By Exercise 2.2.7, to establish (2.2.10), it is enough
to show x · Qx = x · Q′ x for every vector x. By (2.2.8),

x · Qx = x · AAt x = (At x) · (At x) = |At x|2

On the other hand, multiplying the right side of (2.2.10) by x, we obtain

Q′ x = (v1 ⊗ v1 )x + (v2 ⊗ v2 )x + · · · + (vN ⊗ vN )x.

By (1.4.19), this implies

Q′ x = (v1 · x)v1 + (v2 · x)v2 + · · · + (vN · x)vN .

Taking the dot product of both sides with x,

x · Q′ x = (v1 · x)2 + (v2 · x)2 + · · · + (vN · x)2 . (2.2.11)

But by matrix-vector multiplication,

At x = (v1 · x, v2 · x, . . . , vN · x).

Since |At x|2 is the sum of the squares of its components, this establishes
x · Qx = x · Q′ x, hence the result.

Trace and tensor product combine in the identity

74 CHAPTER 2. LINEAR GEOMETRY

u · Av = trace((u ⊗ v)t A), (2.2.12)

valid for any matrix A and vectors u, v with compatible shapes. The deriva-
tion of this identity is a simple calculation with components that we skip.

If A = (aij ) is any matrix, then the norm squared of A is

2
X
∥A∥ = a2ij .
i,j

This equals trace(At A) which equals trace(AAt ). By taking the trace in

(2.2.10),

Norm Squared of Matrix

Let A be a matrix with columns v1 , v2 , . . . , vN . Then

2
∥A∥ = |v1 |2 + |v2 |2 + · · · + |vN |2 , (2.2.13)

and
2
∥A∥ = trace(At A) = trace(AAt ). (2.2.14)
By replacing A by At , the same results hold for rows.

If x1 , x2 , . . . , xN is a dataset of points, and v1 , v2 , . . . , vN is the cor-

responding centered dataset, then the variance matrix Q is the average of
tensor products (§1.5),
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN
Q= .
N
Let A be the matrix with columns v1 , v2 , . . . , vN . By (2.2.10), the last
equation is the same as
1
Q = AAt . (2.2.15)
N
Let dataset be the Iris dataset, as a d × N array. If vectors is the cor-
responding centered dataset, as in §2.1, code for the variance is

from numpy import *

# vectors is dxN array

2.2. PRODUCTS 75

Q = dot(vectors,vectors.T)/N

Of course, it is simpler to avoid centering and just do directly

Q = cov(dataset,bias=True)

After downloading the Iris dataset as in §2.1, the mean, variance, and total
variance are
 
0.68 −0.04 1.27 0.51
−0.04 0.19 −0.32 −0.12
 1.27 −0.32 3.09 1.29  , 4.54.
µ = (5.84, 3.05, 3.76, 1.2), Q =  

0.51 −0.12 1.29 0.58

(2.2.16)

In §1.5, we discussed standardizing datasets in R2 . This can be done in

general.
Let x1 , x2 , . . . , xN be a dataset in Rd . Each sample point x has d features
(t1 , t2 , . . . , td ). We compute the variance of each feature separately.
Let e1 , e2 , . . . , ed be the standard basis in Rd , and, for each j = 1, 2 . . . , d,
project the dataset onto ej , obtaining the scalar dataset

x1 · ej , x2 · ej , . . . , xN · ej ,

consisting of the j-th feature of the samples. If qjj is the variance of this
scalar dataset, then q11 , q22 , . . . , qdd are the diagonal entries of the variance
matrix.
To standardize the dataset, we center it, and rescale the features to have
variance one, as follows. Let µ = (µ1 , µ2 , . . . , µd ) be the dataset mean. For
each sample point x = (t1 , t2 , . . . , td ), the standardized vector is

t1 − µ1 t2 − µ2 td − µd
v= √ , √ ,..., √ .
q11 q22 qdd

Then the dataset v1 , v2 , . . . , vN is standard.

If Q = (qij ) is the variance matrix, then the correlation matrix is the d × d
matrix Q′ = (qij′
) with entries

′ qij
qij =√ , i, j = 1, 2, . . . , d.
qii qjj

Then a straightforward calculation shows

76 CHAPTER 2. LINEAR GEOMETRY

Standardized Variance Equals Correlation

The variance matrix of the standardized dataset equals the correlation

matrix of the original dataset.

In Python,

from numpy import *

from sklearn.preprocessing import StandardScaler

N, d = 10, 2
# Nxd array
dataset = array([ [random() for _ in range(d)] for _ in range(N) ])

# standardize dataset
standardized = StandardScaler().fit_transform(dataset)

Qcorr = corrcoef(dataset.T)
Qcov = cov(standardized.T,bias=True)

allclose(Qcov,Qcorr)

returns True.

Exercises

Exercise 2.2.1 For n = 1, 2, 3, . . . , let v be the vector

v = (1, 2, 3, . . . , n).
√
Let |v| = v · v be the length
√ of v. Then, for example, when n = 1, |v| = 1
and, when n = 2, |v| = 5. There is one other n for which |v| is a whole
number. Use Python to find it.

Exercise 2.2.2 If µ is a unit vector and Q = I − µ ⊗ µ, then Q2 = Q.

Exercise 2.2.3 Give an example of a 3 × 3 matrix A satisfying A2 = 0 but

A ̸= 0.

Exercise 2.2.4 If Q2 = 0 and Qt = Q, then Q = 0.

Exercise 2.2.5 Matrices A and B commute if AB = BA. For what condition

on a and b do these matrices commute?
   
100 100
A = a 1 0 , B = 0 1 0 .
001 0b1
2.2. PRODUCTS 77

Exercise 2.2.6 Verify (2.2.12).

Exercise 2.2.7 Let Q and Q′ be symmetric d×d matrices. Show that Q = Q′
iff
x · Qx = x · Q′ x, for all x.
(Replace x by u + v and expand, then insert u and v standard basis vectors.)
Exercise 2.2.8 Compute the means and variances µ1 , µ2 , µ3 and Q1 , Q2 ,
Q3 of the classes of the Iris dataset.
Exercise 2.2.9 With

from sympy import *

def row(i,d): return [ (-1)**(i+j) for j in range(d) ]

def R(d): return Matrix([ row(i,d) for i in range(d) ])

print R(d) for d = 1, 2, 3, . . . .

Exercise 2.2.10 With R(d) as in Exercise 2.2.9,

R(d)3 = c(d) × R(d)

for some scalar c(d). Use Python to find c(d). Here d = 1, 2, 3, . . . .

Exercise 2.2.11 Suppose A and B are matrices with rows and columns
 
u1
 u2 
A= 
 and B = (v1 , v2 , . . . , vd ),
. . .
uN

all with the same dimension. Show that

 
u1 · v1 u1 · v2 . . . u 1 · vd
 u2 · v1 u2 · v2 . . . u 2 · vd 
AB =  ...
.
... ... ... 
uN · v1 uN · v2 . . . u N · vd

This generalizes (2.2.9).

Exercise 2.2.12 Suppose A and B are matrices with columns and rows
 
v1
 v2 
A = (u1 , u2 , . . . , ud ) and B=
. . . .


Use (2.2.5) to show

78 CHAPTER 2. LINEAR GEOMETRY

AB = u1 ⊗ v1 + u2 ⊗ v2 + · · · + ud ⊗ vd .

This generalizes (2.2.10).

Exercise 2.2.13 Let P and Q be d×d permutation matrices (Exercise 2.1.1).
Show that P Q is a permutation matrix.
Exercise 2.2.14 Let P be a 3×3 permutation matrix (Exercise 2.1.1). Show
that P 6 = I. Check this for every 3 × 3 permutation matrix. What about
d × d?
Exercise 2.2.15 The matrix exponential of a a square matrix A is
1 2 1
eA = I + A + A + A3 + . . .
2! 3!
Let  
011
N = 0 0 1 .
000
Compute the matrix exponential of N and I + N (for N the series terminates
after finitely many steps).

2.3 Matrix Inverse

Let A be any matrix and b a vector. The goal is to solve the linear system

Ax = b. (2.3.1)

In this section, we use the inverse A−1 and the pseudo-inverse A+ to solve
(2.3.1).
Of course, the system (2.3.1) doesn’t even make sense unless

A.shape == b.shape, x.shape

In what follows, we assume this equality is true and dimensions are appro-
priately compatible.
Even then, it’s very easy to construct matrices A and vectors b for which
the linear system (2.3.1) has no solutions at all! For example, take A the zero
matrix and b any non-zero vector. Because of this, we must take some care
when solving (2.3.1).

Given a square matrix A, the inverse matrix is the matrix B satisfying

2.3. MATRIX INVERSE 79

AB = I = BA. (2.3.2)

Here I is the identity matrix. Since I is a square matrix, A must also be a

square matrix.
Only square matrices may have inverses. Moreover, not every square ma-
trix has an inverse. For example, the zero matrix does not have an inverse.
When A has an inverse, we say A is invertible.
If a matrix is d × d, then the inverse is also d × d. We write B = A−1 for
the inverse matrix of A. For example, it is easy to check

ab 1 d −b
A= =⇒ A−1 = .
cd ad − bc −c a

Since we can’t divide by zero, a 2 × 2 matrix is invertible only if ad − bc ̸= 0.

Since

(AB)(B −1 A−1 ) = A(BB −1 )A−1 = AIA−1 = AA−1 = I,

we have
(AB)−1 = B −1 A−1 .

When A is invertible, the inverse A−1 provides a conceptual framework for

solving the linear system Ax = b. Of course, a framework is not the same as
a computational procedure. Many issues arise in the numerical construction
of the inverse. These we sweep under the rug and ignore by accessing the
inverse code inv in numpy and sympy.

Solution of Ax = b when A invertible

If A is invertible, then

Ax = b =⇒ x = A−1 b. (2.3.3)

This is easy to check, since

Ax = A(A−1 b) = (AA−1 )b = Ib = b.

from sympy import *

# solving Ax=b
x = A.inv() * b

from numpy import *

80 CHAPTER 2. LINEAR GEOMETRY

from scipy.linalg import inv

# solving Ax=b
x = dot(inv(A) , b)

In general, a matrix A is not invertible, and Ax = b is solved using the

pseudo-inverse x = A+ b. The definition and framework of the pseudo-inverse
is in §2.6. The upshot is: every (square or non-square) matrix A has a pseudo-
inverse A+ . Here is the general result.

Solution of Ax = b for General A

If Ax = b is solvable, then

x + = A+ b =⇒ Ax+ = b.

If Ax = b is not solvable, then x+ minimizes the residual |Ax − b|2 .

This says if Ax = b has some solution, then x+ = A+ b is also a solution.

On the other hand, Ax = b may have no solution, in which case the error
|Ax − b|2 is minimized. From this point of view, it’s best to think of x+ as a
candidate for a solution. It’s a solution only after confirming equality of Ax+
and b. All this is worked out in §2.6.
To put this in context, there are three possibilities for a linear system
(2.3.1). A linear system Ax = b can have
• no solutions, or
• exactly one solution, or
• infinitely many solutions.
As examples of these three possibilities, we have
• A = 0 and b ̸= 0,
• A is invertible,
• A = 0 and b = 0.
The pseudo-inverse provides a conceptual framework for deciding among
these three possibilities. Of course, a framework is not the same as a com-
putational procedure. Many issues arise in the numerical construction of the
pseudo-inverse. These we sweep under the rug and ignore by accessing the
pseudo-inverse code pinv in numpy and sympy.
In this section, we focus on using Python to solve Ax = b, and in §2.6, we
explore the pseudo-inverse framework.
2.3. MATRIX INVERSE 81

How do we use the above result? Given A and b, using Python, we compute
x = A+ b. Then we check, by multiplying in Python, equality of Ax and b.
The rest of the section consists of examples of solving linear systems. The
reader is encouraged to work out the examples below in Python. However,
because some linear systems have more than one solution, and the implemen-
tations of Python on your laptop and on my laptop may differ, our solutions
may differ.
It can be shown that if the entries of A are integers, then the entries of A+
are fractions. This fact is reflected in sympy, but not in numpy, as the default
in numpy is to work with floats.

Let

u = (1, 2, 3, 4, 5), v = (6, 7, 8, 9, 10), w = (11, 12, 13, 14, 15),

and let A be the matrix with columns u, v, w, and rows a, b, c, d, e,

   
1 6 11 a
2 7 12  b 
   
A= uvw = 3 8 13 =  c  .
   (2.3.4)
4 9 14 d
5 10 15 e

from numpy import *

# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])

# arrange as columns
A = column_stack([u,v,w])

For this A, the code

from scipy.linalg import pinv

pinv(A)

returns  
−37 −20 −3 14 31
1 
A+ = −10 −5 0 5 10  .
150
17 10 3 −4 −11
Alternatively, in sympy,
82 CHAPTER 2. LINEAR GEOMETRY

from sympy import *

# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

A = Matrix.hstack(u,v,w)

A.pinv()

returns the same result.

Let A be as in (2.3.4) and let

b1 = (8, 9, 10, 11, 12), b2 = (11, 6, 1, −4, −9).

We solve Ax = b1 and Ax = b2 by computing the candidates

1
x + = A+ b1 = (2, 5, 8),
15
and
1
x+ = A+ b2 = (−173, −50, 73).
30
Then we check that the candidates are actually solutions, which they are, by
comparing Ax+ and b1 , in the first case, and Ax+ and b2 , in the second case.

For
b3 = (−9, −3, 3, 9, 10),
we have
1
x + = A+ b3 = (82, 25, −32).
15
However, for this x+ , we have

Ax+ = (−8, −3, 2, 7, 12),

which is not equal to b3 . From this, not only do we conclude x+ is not a

solution of Ax = b3 , but also, by the general result above, the system Ax = b3
is not solvable at all.
2.3. MATRIX INVERSE 83

Let B be the matrix with columns b1 and b2 ,

 
8 11
9 6 
 
10 1  .
B = (b1 , b2 ) =  
11 −4
12 −9

We solve
Bx = u, Bx = v, Bx = w
by constructing the candidates

B + u, B + v, B + w,

obtaining the solutions

1 1 1
x+ = (16, −7), x+ = (41, −2), x+ = (66, 3).
51 51 51

Let  
1 2 3 4 5
C = At =  6 7 8 9 10
11 12 13 14 15
and let f = (0, −5, −10). By Exercise 2.6.8, C + = (A+ )t , so
 
−37 −10 17
−20 −5 10 
+ + t 1  −3 0

C = (A ) = 3 
150  
14 5 −4 
31 10 −11

and
1
x+ = C + f =(32, 35, 38, 41, 44).
50
Once we confirm equality of Cx+ and f , which is the case, we obtain a
solution x+ of Cx = f .

Let D be the matrix with columns a and f ,

 
1 0
D = (a, f ) =  6 −5  ,
11 −10
84 CHAPTER 2. LINEAR GEOMETRY

where a, b, c, d, e are the rows of A, or, equivalently, the columns of C. Then

+ 1 25 10 −5
D = .
30 28 10 −8

We solve

Dx = a, Dx = b, Dx = c, , Dx = d, Dx = e,

by constructing the candidates

D+ a, D+ b, D+ c, D+ d, D+ e,

obtaining the solutions

x+ = (1, 0), x+ = (2, 1), x+ = (3, 2), x+ = (4, 3), x+ = (5, 4).

Exercises

Exercise 2.3.1 Verify the computations in this section using Python.

Exercise 2.3.2 With R(d) as in Exercise 2.2.9, find the formula for the
inverse and pseudo-inverse of R(d), whichever exists. Here d = 1, 2, 3, . . . .

Exercise 2.3.3 The sum matrix and difference matrix are

   
11111 1 −1 0 0 0
0 1 1 1 1 0 1 −1 0 0 
   
S = 0 0 1 1 1 , D= 0 0 1 −1 0  .
  
0 0 0 1 1 0 0 0 1 −1
00001 0 0 0 0 1

Compute SD and DS. What do you conclude?

Exercise 2.3.4 Let D = D(d) be the d × d difference matrix as in Exer-

cise 2.3.3. Compute DDt and Dt D, and SS t and S t S.

Exercise 2.3.5 Let u and v be vectors in Rd and let A = I + u ⊗ v. Show

that
u⊗v
A−1 = I − .
1+u·v
Exercise 2.3.6 Let P be a d × d permutation matrix (Exercise 2.1.1). Show
that P t is the inverse of P .
2.4. SPAN AND LINEAR INDEPENDENCE 85

2.4 Span and Linear Independence

Let u, v, w be three vectors. Then

1
3u − v + 9w, 5u + 0v − w, 0u + 0v + 0w
6
are linear combinations of u, v, w.
In general, a linear combination of vectors v1 , v2 , . . . , vd is

t 1 v1 + t 2 v2 + · · · + t d vd . (2.4.1)

Here the coefficients t1 , t2 , . . . , td are scalars. In short, a linear combination

is a sum of scaled vectors.
In terms of matrices, let

u = (1, 2, 3, 4, 5), v = (6, 7, 8, 9, 10), w = (11, 12, 13, 14, 15),

and let A be the matrix with columns u, v, w, as in (2.3.4). Let x be the vector
(r, s, t) = (1, 2, 3). Then an explicit calculation shows (do this calculation!)
the matrix-vector product Ax equals ru + sv + tw,

Ax = ru + sv + tw.

The code

dot(A,x) == ru + sv + t*w

returns

array([ True, True, True, True, True])

To repeat, the linear combination ru + sv + tw is the same as the matrix-

vector product Ax. This is a general fact on which everything depends:

Column Linear Combination Equals Matrix-Vector Product

Let A be a matrix with columns v1 , v2 , . . . , vd , and let

x = (t1 , t2 , . . . , td ).

Then
Ax = t1 v1 + t2 v2 + · · · + td vd , (2.4.2)
In other words,
86 CHAPTER 2. LINEAR GEOMETRY

Ax = b is the same as b = t1 v1 + t2 v2 + · · · + td vd . (2.4.3)

The span of vectors v1 , v2 , . . . , vd consists of all linear combinations

t1 v1 + t2 v2 + · · · + td vd

of the vectors. For example, span(b) of a single vector b is the line through
b, and span(u, v, w) is the set of all linear combinations ru + sv + tw.

Span Definition I

The span of v1 , v2 , . . . , vd is the set S of all linear combinations of

v1 , v2 , . . . , vd , and we write

S = span(v1 , v2 , . . . , vd ).

When we don’t want to specify the vectors v1 , v2 , v3 , . . . , vd , we simply

say S is a span.
From (2.4.2), we have

Span Definition II

Let A be the matrix with columns v1 , v2 , v3 , . . . , vd . Then

span(v1 , v2 , . . . , vd ) is the set S of all vectors of the form Ax.

If each vector vk is a linear combination of vectors w1 , w2 , . . . , wN , then

every vector v in span(v1 , v2 , . . . , vd ) is a linear combination of w1 , w2 , . . . ,
wN , so span(v1 , v2 , . . . , vd ) is contained in span(w1 , w2 , . . . , wN ).
If also each vector wk is a linear combination of vectors v1 , v2 , . . . , vd ,
then every vector w in span(w1 , w2 , . . . , wN ) is a linear combination of v1 ,
v2 , . . . , vd , so span(w1 , w2 , . . . , wN ) is contained in span(v1 , v2 , . . . , vd ).
When both conditions hold, it follows

span(v1 , v2 , . . . , vd ) = span(w1 , w2 , . . . , wN ).

Thus there are many choices of spanning vectors for a given span.
For example, let u, v, w be the columns of A in (2.3.4). Let ⊂ mean “is
contained in”. Then

span(u, v) ⊂ span(u, v, w),

since adding a third vector can only increase the linear combination possibil-
ities. On the other hand, since w = 2v − u, we also have
2.4. SPAN AND LINEAR INDEPENDENCE 87

span(u, v, w) ⊂ span(u, v).

It follows that
span(u, v, w) = span(u, v).

Let A be a matrix. The column space of A is the span of its columns. For
A as in (2.3.4), the column space of A is span(u, v, w). The code

from sympy import *

# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

A = Matrix.hstack(u,v,w)

# returns minimal spanning set for column space of A

A.columnspace()

returns a minimal list of vectors spanning the column space of A. The column
rank of A is the length of the list, i.e. the number of vectors returned.
For example, for A as in (2.3.4), this code returns the list
   
1 6
2  7 
   
3 ,  8  .
[u, v] =     
4  9 
5 10

Why is this? Because w = 2v − u, so

span(u, v, w) = span(u, v).

We conclude the column rank of A equals 2.

If the columns of A are v1 , v2 , . . . , vd , and x = (t1 , t2 , . . . , td ) is a vector,

then by definition of matrix-vector multiplication,

Ax = t1 v1 + t2 v2 + · · · + td vd .

By (2.4.3),
88 CHAPTER 2. LINEAR GEOMETRY

Column Space and Ax = b

The column space of a matrix A consists of all vectors of the form Ax.
A vector b is in the column space of A when Ax = b has a solution.

The corresponding code in numpy is

from numpy import *

from scipy.linalg import orth

# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])

A = column_stack([u,v,w])

# returns minimal orthonormal spanning set

# for column space of A

orth(A)

This code returns the array in Figure 2.1.

Fig. 2.1 Numpy column space array.

To explain this, let

b1 = (8, 9, 10, 11, 12), b2 = (11, 6, 1, −4, −9).

√ √
Then b1 · b2 = 0, |b1 | = 510, |b2 | = 255, and the columns of the array
in Figure 2.1 are the two orthonormal vectors −b1 /|b1 | and b2 /|b2 |. (Why
−b1 /|b1 | instead of b1 /|b1 |? Because numpy has to make an arbitrary choice
among the unit vectors ±b1 /|b1 |.)
We conclude the column space of A can be described in at least three ways,

span(b1 , b2 ) = span(u, v, w) = span(u, v).

2.4. SPAN AND LINEAR INDEPENDENCE 89

Explicitly, b1 and b2 are linear combinations of u, v, w,

15b1 = 2u + 5v + 8w, 30b2 = −173u − 50v + 73w, (2.4.4)

and u, v, w are linear combinations of b1 and b2 ,

51u = 16b1 − 7b2 , 51v = 41b1 − 2b2 , w = 2v − u. (2.4.5)

By (2.4.3), to derive (2.4.4), we solve Ax = b1 and Ax = b2 for x. But this

was done in §2.3.
Similarly, let B be the matrix with columns b1 and b2 , and solve Bx = u,
Bx = v, Bx = w, obtaining (2.4.5). This was also done in §2.3.
As a general rule, sympy.columnspace returns lists of spanning vectors,
and scipy.linalg.orth returns arrays of orthonormal spanning vectors.

Let A be a matrix, and let b be a vector. How can we tell if b is in the

column space of A? Given the above tools, here is an easy way to tell.
Write the augmented matrix Ā = (A, b); Ā obtained by adding b as an
extra column next to the columns of A. If A is d × N , then Ā is d × (N + 1).
Given A and Ā = (A, b), compute their column ranks. Let v1 , v2 , . . . , vN
be the columns of A. If these ranks are equal, then

span(v1 , v2 , . . . , vN ) = span(v1 , v2 , . . . , vN , b),

so b is a linear combination of the columns, or b is in the column space of A.

Column Space of Augmented Matrix

Let Ā be the matrix A augmented by a vector b. Then Ax = b is

solvable iff b is in the column space of A iff

column rank(A) = column rank(Ā). (2.4.6)

For example, let b3 = (−9, −3, 3, 9, 10) and let Ā = (A, b3 ). Using Python,
check the column rank of Ā is 3. Since the column rank of A is 2, we conclude
b3 is not in the column space of A, so b3 is not a linear combination of u, v,
w.
When (2.4.6) holds, b is a linear combination of the columns of A. However,
(2.4.6) does not tell us which linear combination. According to (2.4.3), finding
the specific linear combination is equivalent to solving Ax = b.

R3 consists of all vectors (r, s, t) in three dimensions. If

90 CHAPTER 2. LINEAR GEOMETRY

e1 = (1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1),

then
(r, s, t) = re1 + se2 + te3 .
This shows the vectors e1 , e2 , e3 span R3 , or

R3 = span(e1 , e2 , e3 ).

As a consequence, R3 is a span. Similarly, in dimension d, we can write

e1 = (1, 0, 0, . . . , 0, 0)
e2 = (0, 1, 0, . . . , 0, 0)
e3 = (0, 0, 1, . . . , 0, 0) (2.4.7)
... = ...
ed = (0, 0, 0, . . . , 0, 1)

Then e1 , e2 , . . . , ed span Rd , so

Standard Basis Spans

Rd is a span.

Following machine-learning terminology, a vector v = (v1 , v2 , . . . , vd ) is

one-hot encoded at slot j if all components of v are zero except the j-th
component. For example, when d = 3, the vectors

(a, 0, 0), (0, a, 0), (0, 0, a)

are one-hot encoded.

Sometimes one-hot encoded also means the nonzero slot must be a one.
With this interpretation, when d = 3, the only one-hot encoded vectors

(1, 0, 0), (0, 1, 0), (0, 0, 1).

We use both interpretations.

The vectors e1 , e2 , . . . , ed are one-hot encoded. These vectors are the
standard basis for Rd , or the one-hot encoded basis for Rd .

The row space of a matrix is the span of its rows.

from sympy import *

2.4. SPAN AND LINEAR INDEPENDENCE 91

# returns minimal spanning set for row space of A

A.rowspace()

The row rank of a matrix is the number of vectors returned by rowspace().

This is the minimal number of vectors spanning the row space of A.
For example, call the rows of A in (2.3.4) a, b, c, d, e. Let

f = (0, −5, −10).

Then sympy.rowspace returns the vectors a and f , so

span(a, b, c, d, e) = span(a, f ).

Explicitly, the linear combination

50f = 32a + 35b + 38c + 41d + 44e

is derived using C = At and solving Cx = f . The linear combinations

a = a + 0f, b = 2a − 5f, c = 3a − 10f, d = 4a − 15f, e = 5a − 20f

are derived using D = (a, f ) and solving Dx = a, Dx = b, Dx = c, Dx = d,

Dx = e. Again, these linear systems were solved in §2.3.
Since the transpose interchanges rows and columns, the row space of A
equals the column space of At . Using this, we compute the row space in numpy
by

from numpy import *

from scipy.linalg import orth

# returns minimal spanning set for row space of A

orth(A.T)

Numpy returns orthonormal vectors.

Clearly, when Q is symmetric, the row space of Q equals the column space
of Q.
It turns out the column rank equals the row rank, for any matrix. Even
though we won’t establish this till (2.9.1), we state this result here, because
it helps ground the concepts.

Column Rank Equals Row Rank

For any matrix, the row rank equals the column rank.

Because of this, we refer to this common number as the rank of the matrix.
92 CHAPTER 2. LINEAR GEOMETRY

A linear combination t1 v1 + t2 v2 + · · · + td vd is trivial if all the coefficients

are zero, t1 = t2 = · · · = td = 0. Otherwise it is non-trivial, if at least one
coefficient is not zero. A linear combination t1 v1 + t2 v2 + · · · + td vd vanishes
if it equals the zero vector,

t1 v1 + t2 v2 + · · · + td vd = 0.

For example, with u, v, w as above, we have w = 2v − u, so

ru + sv + tw = 1u − 2v + 1w = 0 (2.4.8)

is a vanishing non-trivial linear combination of u, v, w.

We say v1 , v2 , . . . , vd are linearly dependent if there is a vanishing non-
trivial linear combination of v1 , v2 , . . . , vd . Otherwise, if there is no non-trivial
vanishing linear combination, we say v1 , v2 , . . . , vd are linearly independent.
For example, u, v, w above are linearly dependent.
Suppose u, v, w are any three vectors, and suppose u, v, w are linearly
dependent. Then we have ru + sv + tw = 0 for some scalars r, s, t, where at
least one is not zero. If r ̸= 0, then we may solve for u, obtaining

u = −(s/r)v − (t/r)w.

If s ̸= 0, then we may solve for v, obtaining

v = −(r/s)u − (t/s)w.

If t ̸= 0, then
w = −(r/t)u − (s/t)v.
Hence linear dependence of u, v, w means one of the three vectors is a multiple
of the other two vectors.
In general, a vanishing non-trivial linear combination of v1 , v2 , . . . , vd , or
linear dependence of v1 , v2 , . . . , vd , is the same as saying one of the vectors
is a linear combination of the remaining vectors.
In terms of matrices,

Homogeneous Linear Systems

Let A be the matrix with columns v1 , v2 , . . . , vd . Then

• v1 , v2 , . . . , vd are linearly dependent when Ax = 0 has a nonzero
solution x, and
• v1 , v2 , . . . , vd are linearly independent when Ax = 0 has only the
zero solution x = 0.
2.4. SPAN AND LINEAR INDEPENDENCE 93

The set of vectors x satisfying Ax = 0, or the set of solutions x of Ax = 0,

is the null space of the matrix A.
With this terminology, v1 , v2 , . . . , vd are linearly dependent when there is
a nonzero null space for the matrix A.
For example, with A as in (2.3.4), the sympy code

from sympy import *

A.nullspace()

returns a list with a single vector,

   
r 1
s = −2 .
t 1

This says the null space of A consists of all multiples of (1, −2, 1). Since the
code

[r,s,t] = A.nullspace()[0]

ru + sv + t*w

returns the column vector  

0
0
 
0 ,
 
0
0
we have Ax = 0, in agreement with (2.4.8).

The corresponding numpy code is

from scipy.linalg import null_space

null_space(A)

This code returns the unit vector

 
1
−1  
√ −2 ,
6 1
94 CHAPTER 2. LINEAR GEOMETRY

which is a multiple of (1, −2, 1). scipy.linalg.null_space always returns

orthonormal vectors.

Here is a simple result that is used frequently.

A Versus At A
Let A be any matrix. The null space of A equals the null space of
At A.

If x is in the null space of A, then Ax = 0. Multiplying by At leads to

A Ax = 0, so x is in the null space of At A.
t

Conversely, if x is in the null space of At A, then At Ax = 0. By the dot-

product-transpose identity (2.2.8),

|Ax|2 = Ax · Ax = x · At Ax = 0,

so Ax = 0, which means x is in the null space of A.

An important example of linearly independent vectors are orthonormal

vectors.

Orthonormal Implies Linearly Independent

If v1 , v2 , . . . , vd are orthonormal, they are linearly independent.

To see this, suppose we have a vanishing linear combination

t1 v1 + t2 v2 + · · · + td vd = 0.

Take the dot product of both sides with v1 . Since the dot products of any
two vectors is zero, and each vector has length one, we obtain

t1 = t1 v1 · v1 = t1 v1 · v1 + t2 v2 · v1 + · · · + td vd · v1 = 0.

Similarly, all other coefficients tk are zero. This shows v1 , v2 , . . . , vd are

linearly independent.

In general, nullspace() returns a minimal set of vectors spanning the

null space of A. The nullity of A is the number of vectors returned by the
method nullspace().
2.4. SPAN AND LINEAR INDEPENDENCE 95

For example, to compute the nullspace of the matrix

 
1 2 3 4 5
C = At =  6 7 8 9 10 ,
11 12 13 14 15

we solve Cx = 0. Since the code

from sympy import *

u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

A = Matrix.hstack(u,v,w)
C = A.T

C.nullspace()

returns the list of three vectors

     
1 2 3
−2 −3 −4
     
 1  ,  0  ,  0  ,
     
 0   1   0 
0 0 1

here we can make three conclusions: (1) the nullspace of C is spanned by

three vectors, (2) this is the least number of vectors that spans the nullspace
of C, and (3) the nullity of C is 3.

Let u be a nonzero vector, and let

u⊥ = {v : u · v = 0} . (2.4.9)

Then u⊥ (pronounced “u-perp”), the orthogonal complement of u, is a span

and consists of all vectors orthogonal to u.
When u is in R2 , u⊥ is a line, and we previously defined u⊥ in §1.4 to be
a vector spanning that line. Specifically, when u = (x, y), in §1.4 we defined
u⊥ = (−y, x).
More generally, suppose S is any collection of vectors, not necessarily a
span, and let
S ⊥ = {v : u · v = 0 for all u in S} .
Then S ⊥ (pronounced “S-perp”), the orthogonal complement of S, is a span
(even if S isn’t) and consists of all vectors orthogonal to all vectors in S.
96 CHAPTER 2. LINEAR GEOMETRY

Suppose S consists of five vectors a, b, c, d, e. How do we compute S ⊥ ?

The answer is by using nullspace: Let A be the matrix with rows a, b, c, d,
e. By matrix-vector multiplication,
   
a a·x
b b · x
   
0 = Ax =  c  x = c · x .
   
d d · x
e e·x

This shows x is orthogonal to a, b, c, d, e exactly when x is in the null space

of A. Thus S ⊥ equals the null space of A.
In general, if S = span(v1 , v2 , . . . , vN ), let A be the matrix with rows v1 ,
v2 , . . . , vN . Then S ⊥ equals the null space of A.

An important example of orthogonality is the relation between row space

and the null space. Suppose A has rows v1 , v2 , . . . , vN , and x is a vector, all
of the same dimension. Then, by definition, the matrix-vector product is

Ax = (v1 · x, v2 · x, . . . , vN · x).

If x is in the null space, Ax = 0, then

v1 · x = 0, v2 · x = 0, . . . , vN · x = 0,

so x is orthogonal to the rows of A. Conversely, if x is orthogonal to the rows

of A, then Ax = 0.
This shows the null space of A and the row space of A are orthogonal
complements. Summarizing, we write

Row Space and Null Space are Orthogonal

Every vector in the row space is orthogonal to every vector in the null
space,

rowspace⊥ = nullspace and nullspace⊥ = rowspace. (2.4.10)

Actually, the above paragraph only established the first identity. For the
second identity, we need to use (2.7.9), as follows
⊥
rowspace = rowspace⊥ = nullspace⊥ .
2.4. SPAN AND LINEAR INDEPENDENCE 97

Since the row space is the orthogonal complement of the null space, and
the null space of A equals the null space of At A, we conclude

A Versus At A
Let A be any matrix. Then the row space of A equals the row space
of At A.

Now replace A by At in this last result. Since the row space of At equals
the column space of A, and AAt is symmetric, we also have

A Versus AAt
Let A be any matrix. Then the column space of A equals the column
space of AAt .

Let A be a matrix and b a vector. So far we’ve met four spaces,

• the null space: all x’s satisfying Ax = 0,
• the row space: the span of the rows of A,
• the column space: the span of the columns of A,
• the solution space: the solutions x of Ax = b.
A set S of vectors is a subspace if x1 + x2 is in S whenever x1 and x2 are in
S, and tx is in S whenever x is in S. When this happens, we say S is closed
under addition and scalar multiplication: A subspace is a set of vectors closed
under addition and scalar multiplication.
Since a linear combination of linear combinations is a linear combination,
every span is a subspace. In particular, Rd is a subspace.
It’s important to realize the first three are subspaces, but the fourth is
not.
• If x1 and x2 are in the null space, and r1 and r2 are scalars, then so is
r1 x1 + r2 x2 , because

A(r1 x1 + r2 x2 ) = r1 Ax1 + r2 Ax2 = r1 0 + r2 0 = 0.

This shows the null space is a subspace. In particular, S ⊥ is a subspace

for any S.
• The row space is a span, so is a subspace.
• The column space is a span, so is a subspace.
• The solution space S of Ax = b is not a subspace, nor a span: If x is in
S, then Ax = b, so A(5x) = 5Ax = 5b, so 5x is not in S.
98 CHAPTER 2. LINEAR GEOMETRY

If x1 and x2 are solutions of Ax = b, then A(x1 + x2 ) = 2b, so the solution

space is not a subspace. However

A(x1 − x2 ) = b − b = 0, (2.4.11)

so the difference x1 − x2 of any two solutions x1 and x2 is in the null space

of A, which is a span.

Let A be an N × d matrix. Then matrix multiplication by A transforms a

vector x to the vector b = Ax. Since A is N × d, x is in Rd , and Ax is in RN .
From this point of view, the source space of A is Rd , and the target space of
A is RN .

Locations of Column, Row, and Null Spaces

Let A be any matrix. The null space of A and the row space of A are
in the source space of A, and the column space of A is in the target
space of A.

Let A be a d × d invertible matrix. Then the source space is Rd and the

target space is Rd . If Ax = 0, then

x = (A−1 A)x = A−1 (Ax) = A−1 0 = 0.

This shows the null space of an invertible matrix is zero, hence the nullity is
zero.
Since the row space is the orthogonal complement of the null space, we
conclude the row space is all of Rd .
In §2.9, we see that the column rank and the row rank are equal. From
this, we see also the column space is all of Rd . In summary,

Null Space of Invertible Matrix

Let A be a d×d invertible matrix. Then the null space is zero, and the
row space and column space are both Rd . In particular, the nullity is
0, and the row rank rank and column rank are both d.
2.4. SPAN AND LINEAR INDEPENDENCE 99

Exercises

Exercise 2.4.1 For what condition on a, b, c do the vectors (1, a), (2, b),
(3, c) lie on a line?

Exercise 2.4.2 Let

 
  16
1 2 3 4 5 17
 
C =  6 7 8 9 10 , 18 .
x= 
11 12 13 14 15 19
20

Compute Cx in two ways, first by row times column, then as a linear combi-
nation of the columns of C.

Exercise 2.4.3 Check that the array in Figure 2.1 matches with b1 , b2 as
explained in the text, and the vectors b1 and b2 are orthogonal.

Exercise 2.4.4 [32] Let a = (1, 1, 0, 0), b = (0, 0, 1, 1), c = (1, 0, 1, 0), d =
(0, 1, 0, 1). Check whether or not a, b, c, d are linearly independent by solving
ra + sb + tc + ud = 0. Is ra + sb + tc + ud = (0, 0, 0, 1) solvable? Do a, b, c, d
span R4 ?

Exercise 2.4.5 Let A = (u, v, w) be as in (2.3.4) and let b = (16, 17, 18, 19, 20).
Is b in the column space of A? If yes, solve b = ru + sv + tw.

Exercise 2.4.6 Let A = (u, v, w) be as in (2.3.4) and let Q = At A. What

are the source and target spaces for A and Q? Calculate column spaces, row
spaces, and null spaces of A and Q. How are they related?

Exercise 2.4.7 Let A = (u, v, w) be as in (2.3.4) and let Q = AAt . What

are the source and target spaces for A and Q? Calculate column spaces, row
spaces, and null spaces of A and Q. How are they related?

Exercise 2.4.8 [32] Let A be a 64 × 17 matrix with rank 11. How many
linearly independent vectors x solve Ax = 0? How many linearly independent
vectors x solve At x = 0?

Exercise 2.4.9 Let A(N, d) be the matrix returned by the code

from sympy import *

def col(N,j): return Matrix([ 1+i+j*N for i in range(N) ])

def A(N,d): return Matrix.hstack(*[ col(N,j) for j in range(d) ])

What are A(5, 3) and A(3, 5)? What are the source and target spaces for
A(N, d)?
100 CHAPTER 2. LINEAR GEOMETRY

Exercise 2.4.10 Calculate the column rank of the matrix A(N, d) for all
N ≥ 2 and all d ≥ 2. (Column rank is the length of the list columnspace
returns.)
Exercise 2.4.11 What is the nullity of the matrix A(N, d) for all N ≥ 2 and
all d ≥ 2?
Exercise 2.4.12 Show directly from the definition the vectors

u = (2, ∗, ∗, ∗, ∗, ∗), v = (0, 7, ∗, ∗, ∗, ∗), w = (0, 0, 0, 1, ∗, ∗), x = (0, 0, 0, 0, 0, 3)

are linearly independent.

Exercise 2.4.13 Let a, b, c, d be the rows of the matrix
 
2 1 0 1 3 7
0 7 7 2 0 5
E= 0 0

0 1 3 1
0 0 0 0 0 3

Show directly from the definition a, b, c, d are linearly independent. A

matrix with this staircase pattern is in echelon form.
Exercise 2.4.14 Let E be the matrix in Exercise 2.4.4. Solve Ex = 0 to
obtain the nullspace of E and the nullity of E.
Exercise 2.4.15 [27] Let x, y, z be three nonzero vectors, and w = 2y −
2x + z. If z = x − y, find r and s with w = rx + sy. Which of the following
must be true?
1. span(x, y, z) = span(w, y, z),
2. span(w, z) = span(y, z),
3. span(x, z) = span(x, z, w),
4. span(x, z) = span(w, x),
5. span(w, x, y) = span(w, x, z).
Exercise 2.4.16 [27] Let a be a linear combination of x, y, z. Select the best
statement.
1. span(u, v, w) is contained or equal to span(u, v, w, a),
2. span(u, v, w) is equal to span(u, v, w, a),
3. There is no obvious relationship between span(u, v, w) and span(u, v, w, a),
4. span(u, v, w) is not equal to span(u, v, w, a).
Exercise 2.4.17 Let a, b, c be three distinct (non-equal) numbers, and let

1 a a2
 

V = 1 b b2  .
1 c c2

Let x = (r, s, t) and p(z) = r + sz + tz 2 . Show V x = (p(a), p(b), p(c)). Using

Exercise A.4.9, conclude the column rank of V is 3.
2.5. ZERO VARIANCE DIRECTIONS 101

2.5 Zero Variance Directions

Let x1 , x2 , . . . , xN be a dataset in Rd . Then x1 , x2 , . . . , xN are N points in

Rd , and each x has d features, x = (t1 , t2 , . . . , td ). From §1.5, the mean is
x1 + x2 + · · · + xN
µ= .
N
Center the dataset (see §1.3)

v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,

and let A be the dataset matrix with rows v1 , v2 , . . . , vN . By (2.2.10), the

variance is
v 1 ⊗ v 1 + v 2 ⊗ v 2 + · · · + vN ⊗ vN 1
Q= = At A. (2.5.1)
N N
If u is a unit vector, the projection of the centered dataset onto the line
through u results in the reduced dataset

v1 · u, v2 · u, . . . , vN · u.

This reduced dataset is centered, and, by (2.2.10), its variance is

(v1 · u)2 + (v2 · u)2 + · · · + (vN · u)2 1

q= = v t At Au = u · Qu. (2.5.2)
N N
We obtain this result, which was first stated in §1.5.

Variance of Reduced Dataset

Let Q be the variance matrix of a dataset and let u be a unit vector.
Then the variance of the reduced dataset onto the line through the
vector u equals the quadratic function u · Qu.

A vector v is a zero variance direction if the reduced variance is zero,

v · Qv = 0.

We investigate zero variance directions, but first we need a definition.

Let b be a scalar and v a nonzero vector in Rd . A hyperplane orthogonal
to v is the set of points x satisfying the equation

v · x + b = 0.

In R3 , a hyperplane is a plane, in R2 , a hyperplane is a line, and in R, a

hyperplane is a point, a threshhold. In general, in Rd , a hyperplane is (d−1)-
102 CHAPTER 2. LINEAR GEOMETRY

dimensional, always one less than the ambient dimension. When b = 0, the
hyperplane orthogonal to v equals v ⊥ (2.4.9).
The hyperplane passes through a point µ if

v · µ + b = 0.

By subtracting the last two equations, the equation of a hyperplane orthog-

onal to v and passing through µ may be written

v · (x − µ) = 0.

Zero Variance Directions

Let µ and Q be the mean and variance of a dataset in Rd . Then

v · Qv = 0 is the same as saying every point in the dataset lies in the
hyperplane passing through µ and orthogonal to v,

v · (x − µ) = 0.

This is easy to see. Let the dataset be x1 , x2 , . . . , xN , and center it to v1 ,

v2 , . . . , vN . If v · Qv = 0, then, by (2.5.2), vk · v = 0 for k = 1, 2, . . . , N . This
shows v · (xk − µ) = 0, k = 1, 2, . . . , N , which means the points x1 , x2 , . . . ,
xN lie on the hyperplane v · (x − µ) = 0. Here are some examples.
In two dimensions R2 , a line is determined by a point on the line and a
vector orthogonal to the line. If v = (a, b) is the vector orthogonal to the line
and (x0 , y0 ), (x, y) are points on the line, then (x, y) − (x0 , y0 ) is orthogonal
to v, or
(a, b) · ((x, y) − (x0 , y0 )) = 0.
Writing this out, the equation of the line is

a(x − x0 ) + b(y − y0 ) = 0, or ax + by = c,

where c = ax0 + by0 .

If the mean and variance of a dataset are µ = (2, 3) and

1 −1
Q= ,
−1 1

and v = (1, 1), then Qv = 0, so v · Qv = 0. Since the line x + y = 5 passes

through the mean, the dataset lies on this line. We conclude this dataset is
one-dimensional.
If
30
Q= ,
01
and v = (x, y), then
v · Qv = 3x2 + y 2 ,
2.5. ZERO VARIANCE DIRECTIONS 103

so v · Qv is never zero unless v = 0. In this case, we conclude the dataset is

two-dimensional, because it does not lie on a line.
In three dimensions R3 , a plane is determined by a point (x0 , y0 , z0 ) and
a vector v = (a, b, c). The point is in the plane, and the vector is orthogonal
to the plane. If (x, y, z) is any point in the plane, then (x, y, z) − (x0 , y0 , z0 )
is orthogonal to v, so the equation of the plane is

(a, b, c) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or ax + by + cz = d,

where d = ax0 + by0 + cz0 .

Suppose we have a dataset in R3 with mean µ = (3, 2, 1), and variance
 
111
Q = 1 1 1 . (2.5.3)
111

Let v = (2, −1, −1). Then Qv = 0, so v · Qv = 0. We conclude the dataset

lies in the plane

(2, −1, −1) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or 2x − y − z = 3.

In this case, the dataset is two-dimensional, as it lies in a plane.

If a dataset has variance the 3 × 3 identity matrix I, then v · Iv is never
zero unless v = 0. Such a dataset is three-dimensional, it does not lie in a
plane.
Sometimes there may be several zero variance directions. For example, for
the variance (2.5.3) and u = (2, −1, −1), v = (0, 1, −1), we have both

u · Qu = 0 and v · Qv = 0.

From this we see the dataset corresponding to this Q lies in two planes: the
plane orthogonal to u, and the plane orthogonal v. But the intersection of
two planes is a line, so this dataset lies in a line, which means it is one-
dimensional.
Which line does this dataset lie in? Well, the line has to pass through the
mean, and is orthogonal to u and v. If we find a vector b satisfying b · u = 0
and b · v = 0, then the line will pass through the mean and will be parallel to
b. But we know how to find such a vector. Let A be the matrix with rows u, v.
Then b in the nullspace of A fullfills the requirements. We obtain b = (1, 1, 1).

Let v1 , v2 , . . . , vN be a centered dataset of vectors in Rd , and let Q be

the variance matrix of the dataset. If v is in the nullspace of Q, then Qv = 0,
so v · Qv = 0. This shows every vector in the nullspace is a zero variance
direction. What is less clear is that this works in the other direction.
104 CHAPTER 2. LINEAR GEOMETRY

Zero Variance Directions and Nullspace I

Let Q be a variance matrix. Then the null space of Q equals the zero
variance directions of Q.

To see this, we use the quadratic equation from high school. If Q is sym-
metric, then u · Qv = v · Qu. For t scalar and u, v vectors, since Q ≥ 0, the
function
(v + tu) · Q(v + tu)
is nonnegative for all t scalar. Expanding this function into powers of t, we
see
t2 u · Qu + 2tu · Qv + v · Qv = at2 + 2bt + c
is nonnegative for all t scalar. Thus the parabola at2 + 2bt + c intersects the
horizontal axis in at most one root. This implies the discriminant b2 − ac is
not positive, b2 − ac ≤ 0, which yields

(u · Qv)2 ≤ (u · Qu) (v · Qv). (2.5.4)

Now we can derive the result. If v is a zero variance direction, then v ·Qv =
0. By (2.5.4), this implies u · Qv = 0 for all u, so Qv = 0, so v is in the null
space of Q. This derivation is valid for any nonnegative matrix Q, not just
variance matrices. Later (§3.2) we will see every nonnegative matrix is the
variance matrix of a dataset.

Based on the above result, here is code that returns zero variance direc-
tions.

from numpy import *

from scipy.linalg import null_space
from numpy.random import random

N, d = 20, 2
# dxN array
dataset = array([ [random() for _ in range(N)] for _ in range(d) ])

def zero_variance(dataset):
Q = cov(dataset)
return null_space(Q)

zero_variance(dataset)

Let A be an N ×d dataset matrix, and let Q be the variance of the dataset.

By (2.2.15), Q = At A/N if the dataset is centered. Then the null space of Q
equals the null space of At A, which equals the null space of A. We conclude
2.6. PSEUDO-INVERSE 105

Zero Variance Directions and Nullspace II

Let Q be a variance matrix of a centered dataset A. Then the null

space of A equals the zero variance directions of Q.

Suppose the dataset is

(1, 2, 3, 4, 5), (6, 7, 8, 9, 10), (11, 12, 13, 14, 15), (16, 17, 18, 19, 20).

This is four vectors in R5 . Since it is only four vectors, it is at most a four-

dimensional dataset. The code zero_variance returns three vectors

(1, −2, 1, 0, 0), (2, −3, 0, 1, 0), (3, −4, 0, 0, 1).

Thus this dataset is orthogonal to three directions, hence lies in the intersec-
tion of three hyperplanes. Each hyperplane is one condition, so each hyper-
plane cuts the dimension down by one, so the dimension of this dataset is
5 − 3 = 2. Dimension of a dataset is discussed further in §2.9.

2.6 Pseudo-Inverse

What is the pseudo-inverse? In §2.3, we used both the inverse and the pseudo-
inverse to solve Ax = b, but we didn’t explain the framework behind them.
It turns out the framework is best understood geometrically.
Think of b and Ax as points, and measure the distance between them, and
think of x and the origin 0 as points, and measure the distance between them
(Figure 2.2).

x b
A
−
−−−−−−−
→

0 Ax

source space target space

Fig. 2.2 The points 0, x, Ax, and b.

If Ax = b is solvable, then, among all solutions x∗ , select the solution x+

closest to 0.
More generally, if Ax = b is not solvable, select the points x∗ so that Ax∗
is closest to b, then, among all such x∗ , select the point x+ closest to the
origin (this is “closest twice”).
106 CHAPTER 2. LINEAR GEOMETRY

Even though the point x+ may not solve Ax = b, this procedure results
in a uniquely determined x+ : While there may be several points x∗ , there is
only one x+ . Figure 2.3 summarizes the situation for a 2 × 2 matrix A with
rank(A) = 1.

rowspace
column space
x∗
x∗ Ax
x+ A Ax∗
−
−−−−−−−
→
Ax
Ax

x x
v nullspace
x v
0 b

Fig. 2.3 The points x, Ax, the points x∗ , Ax∗ , and the point x+ .

The results in this section are as follows. Let A be any matrix. There is a
unique matrix A+ — the pseudo-inverse of A — with the following properties.
• the linear system Ax = b is solvable, when b = AA+ b.
• x+ = A+ b is a solution of
1. the linear system Ax = b, if Ax = b is solvable.
2. the regression equation At Ax = At b, always.
• In either case,
1. there is exactly one solution x∗ with minimum norm.
2. Among all solutions, x+ has minimum norm.
3. Every other solution is x∗ = x+ + v for some v in the null space of A.

Key concepts in this section are the residual

|Ax − b|2 (2.6.1)

and the regression equation

At Ax = At b. (2.6.2)

The following is clear.

2.6. PSEUDO-INVERSE 107

Zero Residual

x is a solution of (2.3.1) iff the residual is zero.

For A as in (2.3.4) and b = (−9, −3, 3, 9, 10), the linear system Ax = b is

x + 6y + 11z = −9
2x + 7y + 12z = −3
3x + 8y + 13z = 3 (2.6.3)
4x + 9y + 14z = 9
5x + 10y + 15z = 10

and the regression equation At Ax = At b is

11x + 26y + 41z = 16

13x + 33y + 53z = 13 (2.6.4)
41x + 106y + 171z = 36.

Let b be any vector, not necessarily in the column space of A. To see how
close we can get to solving (2.3.1), we minimize the residual (2.6.1). We say
x∗ is a residual minimizer if

|Ax∗ − b|2 = min |Ax − b|2 . (2.6.5)

A residual minimizer always exists.

Existence of Residual Minimizer

There is a residual minimizer x∗ in the row space of A.

The derivation of this technical result is in §4.3, see (4.3.12), (4.3.13).

Regression Equation

x∗ is a residual minimizer iff x∗ solves the regression equation.

To see this, let v be any vector, and t a scalar. Insert x = x∗ + tv into the
residual and expand in powers of t to obtain
108 CHAPTER 2. LINEAR GEOMETRY

|Ax − b|2 = |Ax∗ − b|2 + 2t(Ax∗ − b) · Av + t2 |Av|2 = f (t).

If x∗ is a residual minimizer, then f (t) is minimized when t = 0. But a

parabola
f (t) = a + 2bt + ct2
is minimized at t = 0 only when b = 0. Thus the linear coefficient b vanishes,
(Ax∗ − b) · Av = 0. This implies

At (Ax∗ − b) · v = (Ax∗ − b) · Av = 0.

Since v is any vector, this implies

At (Ax∗ − b) = 0,

which is the regression equation. Conversely, if the regression equation holds,

then the linear coefficient in the parabola f (t) vanishes, so t = 0 is a mini-
mum, establishing that x∗ is a residual minimizer.

If x1 and x2 are solutions of the regression equation, then

At A(x1 − x2 ) = At Ax1 − At Ax2 = At b − At b = 0,

so x1 − x2 is in the null space of At A. But from §2.4, the nullspace of At A

equals the nullspace of A. We conclude x1 − x2 is in the null space of A. This
establishes

Multiple Solutions

Any two residual minimizers differ by a vector in the nullspace of A.

We say x+ is a minimum norm residual minimizer if x+ is a residual

minimizer and
|x+ |2 ≤ |x∗ |2
for any residual minimizer x∗ .
Since any two residual minimizers differ by a vector in the null space of A,
x+ is a minimum norm residual minimizer if x+ is a residual minimizer and

|x+ |2 ≤ |x+ + v|2

for any v in the null space of A.

2.6. PSEUDO-INVERSE 109

Minimum Norm Residual Minimizer

Let x∗ be a residual minimizer. Then x∗ is a minimum norm residual
minimizer iff x∗ is in the row space of A.

Since we know from above there is a residual minimizer in the row space
of A, we always have a minimum norm residual minimizer.
Let v be in the null space of A, and write

|x∗ + v|2 = |x∗ |2 + 2x∗ · v + |v|2 .

This shows x∗ is a minimum norm solution of the regression equation iff

2x∗ · v + |v|2 ≥ 0. (2.6.6)

If x∗ is in the row space of A, then x∗ · v = 0, so (2.6.6) is valid.

Conversely, if (2.6.6) is valid for every v in the null space of A, replacing
v by tv yields
2tx∗ · v + t2 |v|2 ≥ 0.
Dividing by t and inserting t = 0 yields

x∗ · v ≥ 0.

Since both ±v are in the null space of A, this implies ±x∗ · v ≥ 0, hence
x∗ · v = 0. Since the row space is the orthogonal complement of the null
space, the result follows.

Now we use this to show

Uniqueness

There is exactly one minimum norm residual minimizer x+ .

If x+ +
1 and x2 are minimum norm residual minimizers, then v = x1 − x2
+ +
+ +
is both in the row space and in the null space of A, so x1 − x2 = 0. Hence
x+ +
1 = x2 .
Putting the above all together, each vector b leads to a unique x+ . Defining
+
A by setting
x+ = A+ b,
we obtain A+ , the pseudo-inverse of A.
Notice if A is, for example, 5 × 4, then Ax = b implies x is a 4-vector and
b is a 5-vector. Then from x = A+ b, it follows A+ is 4 × 5. Thus the shape of
A+ equals the shape of At .
110 CHAPTER 2. LINEAR GEOMETRY

Summarizing what we have so far,

Regression Equation is Always Solvable

The regression equation (2.6.2) is always solvable. The solution of

minimum norm is x+ = A+ b. Any other solution differs by a vector
in the null space of A.

For A as in (2.3.4) and b = (−9, −3, 3, 9, 10),

 
82
1
x + = A+ b =  25 
15
−32

is the minimum norm solution of the regression equation (2.6.4).

Returning to the linear system (2.3.1), we show

Linear System Versus Regression Equation

If the linear system is solvable, then every solution of the regression

equation is a solution of the linear system, and vice-versa.

We know any two solutions of the linear system (2.3.1) differ by a vector in
the null space of A (2.4.11), and any two solutions of the regression equation
(2.6.2) differ by a vector in the null space of A (above).
If x is a solution of (2.3.1), then, by multiplying by At , x is a solution of
the regression equation (2.6.2). Since x+ = A+ b is a solution of the regression
equation, x+ = x + v for some v in the null space of A, so

Ax+ = A(x + v) = Ax + Av = b + 0 = b.

This shows x+ is a solution of the linear system. Since all other solutions
differ by a vector v in the null space of A, this establishes the result.
Now we can state when Ax = b is solvable,

Solvability of Ax = b

The linear system Ax = b is solvable iff b = AA+ b. When this happens,

x+ = A+ b is the solution of minimum norm.

If (2.3.1) is solvable, then from above, x+ is a solution, so

AA+ b = A(A+ b) = Ax+ = b.

2.6. PSEUDO-INVERSE 111

Conversely, if AA+ b = b, then clearly x+ = A+ b is a solution of (2.3.1).

When (2.3.1) is solvable, (2.3.1) and (2.6.2) have the same solutions, so
x+ is the minimum norm solution of (2.3.1).
For example, let b = (−9, −3, 3, 9, 10), and let A be as in (2.3.4). Since
 
−8
−3
AA+ b = 
 
2 (2.6.7)

7
12

is not equal to b, the linear system (2.6.3) is not solvable.

Suppose A is invertible. Then (2.3.1) has only the solution x = A−1 b, so

−1
A b is the minimum norm residual minimizer. We conclude

Inverse Equals Pseudo-Inverse

If A is invertible, then A+ = A−1 .

The key properties [25] of A+ are

Properties of Pseudo-Inverse

The pseudo-inverse of A is the unique matrix A+ satisfying

A. AA+ A = A
B. A+ AA+ = A+
(2.6.8)
C. AA+ is symmetric
D. A+ A is symmetric

The verification of these properties is very enlightening, so we do it care-

fully. Let u be a vector and set b = Au. Then the residual

|Ax − b|2 = |Ax − Au|2

is minimized at x = u. Since A+ b = A+ Au is the minimum norm residual

minimizer, u and A+ Au differ by a vector v in the null space of A,

u = A+ Au + v. (2.6.9)
112 CHAPTER 2. LINEAR GEOMETRY

Since Av = 0, multiplying by A leads to

Au = AA+ Au.

Since u was any vector, this yields A.

Now let w be a vector and set u = A+ w. Inserting into (2.6.9) yields

A+ w = A+ AA+ w + v

for some v in the null space of A. But both A+ w and A+ AA+ w are in the
row space of A, hence so is v. Since v is in both the null space and the row
space, v is orthogonal to itself, so v = 0. This implies A+ AA+ w = A+ w.
Since w was any vector, we obtain B.
Since A+ b solves the regression equation, At AA+ b = At b for any vector b.
Hence At AA+ = At . With P = AA+ ,

P t P = (AA+ )t (AA+ ) = (A+ )t At AA+ = (A+ )t At = P t .

Since the left side is symmetric, so is P t . Hence P is symmetric, obtaining

C.
For any vector x,

A(x − A+ Ax) = Ax − AA+ Ax = 0,

so x − A+ Ax is in the null space of A. For any y, A+ Ay is in the row space

of A. Since the row space and the null space are orthogonal,

(x − A+ Ax) · A+ Ay = 0.

Let P = A+ A. This implies

x · P y = P x · P y = x · P tP y

Since this is true for any vectors x and y, P = P t P . This shows P = A+ A is

symmetric, obtaining D.
Having arrived at A, B, C, D, the reasoning is reversible: It can be shown
any matrix A+ satisfying A, B, C, D must equal the pseudo-inverse.

Also we have

Pseudo-Inverse and Transpose

If U has orthonormal columns or orthonormal rows, then U + = U t .

2.6. PSEUDO-INVERSE 113

From (2.2.9), such a matrix U satisfies U U t = I or U t U = I. In either

case, A, B, C, D are immediate consequences.

Exercises

Exercise 2.6.1 Let A be the 1 × 3 matrix (1, 2, 3). What is A+ ?

Exercise 2.6.2 Let A(N, d) be as in Exercise 2.4.9, and let A = A(6, 4).
Let b = (1, 1, 1, 1, 1, 1). Write out Ax = b as a linear system. How many
equations, how many unknowns?

Exercise 2.6.3 With A and b as in Exercise 2.6.2, is Ax = b solvable? If so,

provide a solution.

Exercise 2.6.4 Continuing with the same A and b, write out the correspond-
ing regression equation. How many equations, how many unknowns?

Exercise 2.6.5 With A and b as in Exercise 2.6.2, is the regression equation

solvable? If so, provide a solution.

Exercise 2.6.6 With A and b as in Exercise 2.6.2, what is the minimum

norm residual minimizer x+ ?

Exercise 2.6.7 Let µ be a unit vector, and let Q = I − µ ⊗ µ. Use (2.6.8)

and Exercise 2.2.2 to show Q+ = Q.

Exercise 2.6.8 Use (2.6.8) to show the transpose of the pseudo-inverse is

the pseudo-inverse of the transpose,

(At )+ = (A+ )t .

Exercise 2.6.9 Let Q be symmetric, Qt = Q. Show Q+ is symmetric.

Exercise 2.6.10 Let Q be symmetric, Qt = Q. Show Q and Q+ commute,

QQ+ = Q+ Q.

Exercise 2.6.11 Let A be any matrix. Then the null space of A equals the
null space of A+ A. Use (2.6.8).

Exercise 2.6.12 Let A be any matrix. Then the row space of A equals the
row space of A+ A.

Exercise 2.6.13 Let A be any matrix. Then the column space of A equals
the column space of AA+ .
114 CHAPTER 2. LINEAR GEOMETRY

2.7 Projections

In this section, we study projection matrices P , and we show

• P = AA+ is the projection matrix onto the column space of A,
• P = A+ A is the projection matrix onto the row space of A,
• P = I − A+ A is the projection matrix onto the null space of A,

Let u be a unit vector, and let b be any vector. Let span(u) be the line
through u (Figure 2.4). The projection of b onto span(u) is the vector v in
span(u) that is closest to b.
It turns out this closest vector v equals P b for some matrix P , the projec-
tion matrix. Since span(u) is a line, the projected vector P b is a multiple tu
of u.
From Figure 2.4, b − P b is orthogonal to u, so

0 = (b − P b) · u = b · u − P b · u = b · u − t u · u = b · u − t.

Solving for t, this implies t = b · u. Thus

P b = (b · u)u = (u ⊗ u)b. (2.7.1)

Notice P b = b when b is already on the line through u. In other words,

the projection of a vector onto a line equals the vector itself when the vector
is already on the line. If U is the matrix with the single column u, we obtain
P = U U t.
To summarize, the projected vector is the vector (b · u)u, and the reduced
vector is the scalar b · u. If U is the matrix with the single column u, then the
reduced vector is U t b and the projected vector is U U t b.

b − Pb
b

P b = tu
u

Fig. 2.4 Projecting onto a line.

2.7. PROJECTIONS 115

Now we project onto a plane. Let u, v be an orthonormal pair of vectors,

so u · v = 0, u · u = 1 = v · v. We project a vector b onto span(u, v). As before,
there is a matrix P , the projection matrix, such that the projection of b onto
the plane equals P b. Then b − P b is orthogonal to the plane (Figure 2.5),
which means b − P b satisfies

(b − P b) · u = 0 and (b − P b) · v = 0.

Since P b lies in the plane, P b = ru + sv is a linear combination of u and v.

Inserting P b = ru + sv, we obtain

r = b · u, s = b · v.

If U is the matrix with columns u, v, by (2.2.10), this yields,

P b = (b · u)u + (b · v)v = (u ⊗ u + v ⊗ v)b = U U t b,

As before, here also the projection matrix is P = U U t .

Notice P b = b when b is already in the plane. In other words, the projection
of a vector onto a plane equals the vector itself when the vector is already in
the plane.
To summarize, here the projected vector is the vector U U t b = (b · u)u +
(b · v)v, and the reduced vector is the vector U t b = (b · u, b · v). The projected
vector has the same dimension as the original vector, and the reduced vector
has only two components.

b
b − Pb

u
Pb

Fig. 2.5 Projecting onto a plane, P b = ru + sv.

116 CHAPTER 2. LINEAR GEOMETRY

We define projection matrices in general. Let S be a span. A matrix P is

the projection matrix onto S if
1. P v is in S for any vector v,
2. P v = v if v is in S,
3. v − P v is orthogonal to S for any vector v.
We say the projection matrix onto S because there is only one such matrix
corresponding to a given S, see Exercise 2.7.10.
Here is a characterization without mentioning S. A matrix P is a projection
matrix if
1. P 2 = P ,
2. P t = P .
What is the relation between these two versions? We show they are the same.

Characterization of Projections

If P is the projection matrix onto a span S, then P is a projection

matrix. Conversely, if P is a projection matrix, then P is the projection
matrix onto the column space S of P .

To prove this, suppose P is the projection matrix onto some span S. For
any v, by 1., P v is in S. By 2., P (P v) = P v. Hence P 2 = P . Also, for any u
and v, P v is in S, and u − P u is orthogonal to S. Hence

(u − P u) · P v = 0

which implies
u · P v = (P u) · (P v).
Switching u and v,
v · P u = (P v) · (P u),
Hence
u · (P v) = (P u) · v,
t
which implies P = P .
For the other direction, suppose P is a projection matrix, and let S be the
column space of P . Then a vector x is in S iff x is of the form x = P v. This
establishes 1. above. Since

P x = P (P v) + P 2 v = P v = x,

this establishes 2. above. Similarly, P t = P implies 3. above.

2.7. PROJECTIONS 117

Projection Onto Column Space

Let A be any matrix. Then the projection matrix onto the column
space of A is
P = AA+ . (2.7.2)

To see this, let P = AA+ . By (2.6.8),

P 2 = AA+ AA+ = (AA+ A)A+ = AA+ = P,

and P is symmetric. Hence P is a projection matrix. By the previous result,

P is the projection matrix onto the column space of P = AA+ . But by
Exercise 2.6.13, the column spaces of A and of P agree. Thus P is the
projection matrix onto the column space of A.

Now let x = A+ b. Then Ax = AA+ b = P b is the projection of b onto

the column space of A. If the columns of A are v1 , v2 , . . . , vd , and x =
(t1 , t2 , . . . , td ), then by matrix-vector multiplication,

P b = t1 v1 + t2 v2 + · · · + td vd .

Since the reduced vector x consists of the coefficients when writing P b as a

linear combination of the columns of A, this shows A+ b is the reduced vector.

from numpy import *

from scipy.linalg import pinv

# projection of column vector b

# onto column space of A

# assume len(b) == len(A.T)

def project(A,b):
Aplus = pinv(A)
x = dot(Aplus,b) # reduced
return dot(A,x) # projected

Projected and Reduced Vectors

Let A be a matrix and b a vector, and project onto the column space
of A. Then the projected vector is P b = AA+ b and the reduced vector
is x = A+ b.

For A as in (2.3.4) and b = (−9, −3, 3, 9, 10) the reduced vector onto the
column space of A is
118 CHAPTER 2. LINEAR GEOMETRY

1
x = A+ b = (82, 25, −32),
15
and the projected vector onto the column space of A is

P b = Ax = AA+ b = (−8, −3, 2, 7, 12).

The projection matrix onto the column space of A is

 
6 4 2 0 −2
4 321 0
1 
P = AA+ =

 2 2 2 2 2 .
10 
 
0 123 4
−2 0 2 4 6

In the same way, one can show

Projection Onto Row Space

The projection matrix onto the row space of A is

P = A+ A. (2.7.3)

For A as in (2.3.4), the projection matrix onto the row space is

 
5 2 −1
1
P = A+ A =  2 2 2 
6
−1 2 5

When the columns of a matrix U are orthonormal, in the previous section

we saw U + = U t , so we have

Projection onto Orthonormal Vectors

If the columns of U are orthonormal, the projection matrix onto the

column space of U is
P = UUt (2.7.4)

Here the projected vector is U U t b, and the reduced vector is U t b. The

code here is
2.7. PROJECTIONS 119

from numpy import *

# projection of column vector b

# onto column space of U
# with orthonormal columns

# assume len(b) == len(U)

def project_to_ortho(U,b):
x = dot(U.T,b) # reduced
return dot(U,x) # projected

Let v1 , v2 , . . . , vN be a dataset in Rd , and let U be a d × n matrix with

orthonormal columns. Then the projection matrix onto the column space of
U is P = U U t , and P is the projection onto an orthonormal span.
In this case, the dataset U t v1 , U t v2 , . . . , U t vN is the reduced dataset, and
U U t v1 , U U t v2 , . . . , U U t vN is the projected dataset.
The projected dataset is in Rd , and the reduced dataset is in Rn . Table
2.6 summarizes the relationships.

dataset vk in Rd , k = 1, 2, . . . , N
reduced U tv k in Rn , k = 1, 2, . . . , N
projected U U tv k in Rd , k = 1, 2, . . . , N

Table 2.6 Dataset, reduced dataset, and projected dataset, n < d.

from numpy import *

from scipy.linalg import pinv

# projection of dataset
# onto column space of A

# Aplus = A.T # orthonormal columns

Aplus = pinv(A) # any matrix

reduced = array([ dot(Aplus,v) for v in dataset ])

projected = array([ dot(A,x) for x in reduced ])

Let S and T be spans. Let S + T consist of all sums of vectors u + v with

u in S and v in T . Then a moment’s thought shows S + T is itself a span.
120 CHAPTER 2. LINEAR GEOMETRY

When the intersection of S and T is the zero vector, we write S ⊕ T , and we

say S ⊕ T is the direct sum of S and T .
Let S be a span and let S ⊥ consist of all vectors orthogonal to S. We call
⊥
S the orthogonal complement. This is pronounced “S-perp”. If v is in both
S and in S ⊥ , then v is orthogonal to itself, hence v = 0. From this, we see
S + S ⊥ is a direct sum S ⊕ S ⊥ .

Direct Sum and Orthogonal Complement I

If S is a span in Rd , then

Rd = S ⊕ S ⊥ . (2.7.5)

This is an immediate consequence of what we already know. Let P be the

projection matrix onto S. Since any vector v in Rd may be written

v = P v + (v − P v),

we see any vector is a sum of a vector in S and a vector in S ⊥ .

Let S be the span of a dataset x1 , x2 , . . . , xN . If S does not equal Rd ,
then there is a nonzero vector in S ⊥ . This shows

Direct Sum and Orthogonal Complement II

If a dataset spans the feature space and v is orthogonal to the dataset,

then v = 0. If v is not zero and is orthogonal to the dataset, then the
dataset does not span the feature space.

Another way of saying the same thing: A vector v is orthogonal to the

whole space iff v is zero.

An important example of (2.7.5) is the relation between the row space and
the null space of a matrix. In §2.4, we saw that, for any matrix A, the row
space and the null space are orthogonal complements.
Taking S = nullspace in (2.7.5), we have the important

Null space plus Row Space Equals Source Space

If A is an N × d matrix,

nullspace ⊕ rowspace = Rd , (2.7.6)

and the null space and row space are orthogonal to each other.
2.7. PROJECTIONS 121

From this,

Projection Onto Null Space

The projection matrix onto the null space of A is

P = I − A+ A. (2.7.7)

For A as in (2.3.4), the projection matrix onto the null space is

 
1 −2 1
1
P = I − A+ A = −2 4 −2
6
1 −2 1

The result (2.7.6) can be written as

Row Rank plus Nullity equals Source Space Dimension

For any matrix, the row rank plus the nullity equals the dimension of
the source space. If the matrix is N × d, r is the rank, and n is the
nullity, then
r + n = d.

Let S be the column space of a matrix A, and let P be the projection

matrix onto S. We end the section by establishing the claim made at the
start of the section, that P b is the point in S that is closest to b.
Since every point in S is of the form Ax, we need to check

|P b − b|2 = min |Ax − b|2 .

But this was already done in §2.3, since P b = AA+ b = Ax+ where x+ = A+ b
is a residual minimizer.

Projection is the Nearest Point in the Span

Let P b = AA+ b be the projection of b onto the column space of A,

and let x+ = A+ b be the reduced vector. Then

|Ax+ − b|2 = min |Ax − b|2 . (2.7.8)

x
122 CHAPTER 2. LINEAR GEOMETRY

Exercises

Exercise 2.7.1 Let A be a 7 × 12 matrix. What is the greatest the rank of

A can be? What is the least the rank of A can be? What if A is 12 × 7?

Exercise 2.7.2 Let A be a 7 × 12 matrix. What is the greatest the nullity

of A can be? What is the least the nullity of A can be? What if A is 12 × 7?

Exercise 2.7.3 Let A be a matrix and let u1 , u2 , . . . , ur be an orthonormal

basis for the column space of A. Show that the projection onto the column
space of A is
P = u1 ⊗ u1 + u2 ⊗ u2 + · · · + ur ⊗ ur .

Exercise 2.7.4 Let P be the projection matrix onto the column space of a
matrix A. Use Exercise 2.7.3 to show trace(P ) equals the rank of A.

Exercise 2.7.5 Let A be a 10 × 7 matrix and let Q = At A. Then Q is 7 × 7.

If the row rank of A is 5, what is the row rank of Q?

Exercise 2.7.6 Let A be the dataset matrix of the centered MNIST dataset,
so the shape of A is 60000 × 784. Using Exercise 2.7.4, show the rank of A
is 712.

Exercise 2.7.7 If µ is a unit vector, then P = I − µ ⊗ µ is a projection.

Exercise 2.7.8 If µ and ν are orthogonal unit vectors, then P = I − µ ⊗ µ −

ν ⊗ ν is a projection.

Exercise 2.7.9 Let S be a span, and let P be the projection matrix onto S.
Use P to show ⊥
S ⊥ = S. (2.7.9)
(S ⊂ (S ⊥ )⊥ is easy. For S ⊃ (S ⊥ )⊥ , show |v − P v|2 = 0 when v in (S ⊥ )⊥ .)

Exercise 2.7.10 Let S be a span and suppose P and Q are both projection
matrices onto S. Show
(P − Q)2 = 0.
Conclude P = Q. Use Exercise 2.2.4.

2.8 Basis

Let S be the span of vectors v1 , v2 , . . . , vN . Then there are many other

choices of spanning vectors for S. For example, v1 + v2 , v2 , v3 , . . . , vN also
spans S.
If S cannot be spanned by fewer than N vectors, then we say v1 , v2 , . . . ,
vN . is a basis for S, and we call N is the dimension of S.
2.8. BASIS 123

In other words, when N is the smallest number of spanning vectors, we

say N is the dimension dim S of S, and v1 , v2 , . . . , vN is a minimal spanning
set for S. This definition is important enough to repeat,

Basis and Dimension Definition

A basis for a span S is a minimal spanning set of vectors. The dimen-
sion of S is the number of vectors in any basis for S.

To clarify this definition, suppose someone asks “Who is the shortest per-
son in the room?” There may be several shortest people in the room, but, no
matter how many shortest people there are, there is only one shortest height.
In other words, a span may have several bases, but a span’s dimension is
uniquely determined.
When a basis v1 , v2 , . . . , vN consists of orthogonal vectors, we say v1 , v2 ,
. . . , vN is an orthogonal basis. When v1 , v2 , . . . , vN are also unit vectors, we
say v1 , v2 , . . . , vN is an orthonormal basis.

spanning

orthogonal orthonormal
vectors basis
basis basis

linearly
orthogonal orthonormal
independent

Fig. 2.7 Relations between vector classes.

Here are two immediate consequences of this terminology.

Span of N Vectors

If S = span(v1 , v2 , . . . , vN ), then dim S ≤ N .

124 CHAPTER 2. LINEAR GEOMETRY

Larger Span has Larger Dimension

If a span S1 is contained in a span S2 , then dim S1 ≤ dim S2 .

With this terminology,

• rowspace() returns a basis of the row space,
• columnspace() returns a basis of the column space,
• nullspace() returns a basis for the null space,
• row rank equals the dimension of the row space,
• column rank equals the dimension of the column space,
• nullity equals the dimension of the null space.

Let S be the span of vectors v1 , v2 , . . . , vN . How can we check if these

vectors constitute a basis for S? The answer is the main result of the section.

Spanning Plus Linearly Independent Equals Basis

Let S be the span of vectors v1 , v2 , . . . , vN . Then the vectors are a

basis for S iff they are linearly independent.

Remember, to check for linear independence of given vectors, assemble the

vectors as columns of a matrix A, and check whether A.nullspace() equals
zero. If that is the case, the vectors are a basis for their span. If not, the
vectors are not a basis for their span. The proof of the main result is at the
end of the section.

Here is an example. The columns of the 3 × 3 identity matrix I are e1 =

(1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1). Since the nullspace of I is zero, e1 , e2 , e3
are linearly independent. Hence the standard basis e1 , e2 , e3 is indeed a basis
for R3 , i.e. a minimal spanning set of vectors for R3 . From this, we conclude
dim R3 = 3.
The statement dim R3 = 3 may at first seem trivial or obvious. But, if
we flesh this out following our terminology above, the statement is saying
that any minimal spanning set of vectors in R3 must have exactly 3 vectors.
Stated in this manner, the statement has content.
Since we can do the same calculation with the standard basis
e1 = (1, 0, . . . , 0),
e2 = (0, 1, 0, . . . , 0),
... = ...
ed = (0, 0, . . . , 0, 1),
2.8. BASIS 125

in Rd , we conclude e1 , e2 , . . . , ed are linearly independent, so

Dimension of Euclidean Space

The dimension of Rd is d.

The MNIST dataset consists of vectors v1 , v2 , . . . , vN in Rd , where N =

60000 and d = 784. For the MNIST dataset, the dimension is 712, as returned
by the code

from numpy.linalg import matrix_rank

# dataset is Nxd array

mu = mean(dataset,axis=0)
vectors = dataset - mu

matrix_rank(vectors)

In particular, since 712 < 784, approximately 10% of pixels are never
touched by any image. For example, a likely pixel to remain untouched is
at the top left corner (0, 0). For this dataset, there are 784 − 712 = 72 zero
variance directions.
We pose the following question: What is the least n for which the first n
images are linearly dependent? Since the dimension of the feature space is
784, we must have n ≤ 784. To answer the question, we compute the rank
of the first n vectors for n = 1, 2, 3, . . . , and continue until we have linear
dependence of v1 , v2 , . . . , vn .
If we load MNIST as dataset, as in §1.2, and run the code below, we
obtain n = 560 (Figure 2.8). matrix_rank is discussed in §2.9.

from numpy import *

from numpy.linalg import matrix_rank

# dataset is Nxd array

def find_first_defect(dataset):
d = len(dataset[0])
previous = 0
for n in range(len(dataset)):
r = matrix_rank(dataset[:n+1,:])
print((r,n+1),end=",")
if r == previous: break
if r == d: break
126 CHAPTER 2. LINEAR GEOMETRY

previous = r

Fig. 2.8 First defect for MNIST.

Fig. 2.9 The dimension staircase with defects.

Let v1 , v2 , . . . , vN be a dataset. We want to compute the dimensions of

the first n vectors, n = 1, 2, 3, . . . ,

d1 = dim(v1 ), d2 = dim(v1 , v2 ), d3 = dim(v1 , v2 , v3 ), and so on

This we call the dimension staircase. For example, Figure 2.9 is the di-
mension staircase for
2.8. BASIS 127

v1 = (1, 0, 0), v2 = (0, 1, 0), v3 = (1, 1, 0), v4 = (3, 4, 0), v5 = (0, 0, 1).

In Figure 2.9, we call the points (3, 2) and (4, 2) defects.

In the code, the staircase is drawn by stairs(X,Y), where the horizontal
points X and the vertical values Y satisfy len(X) == len(Y)+1. In Figure 2.9,
X = [1,2,3,4,5,6], and Y = [1,2,2,2,3].
With the MNIST dataset loaded as vectors, here is code returning Fig-
ures 2.9 and 2.10. This code is not efficient, but it works.

Fig. 2.10 The dimension staircase for the MNIST dataset.

Ideally the code should be run in sympy using exact arithmetic. However,
this takes too long, so we use numpy.linalg.matrix_rank. Because datasets
consist of floats in numpy, the matrix_rank and dimensions are approximate
not exact. For more on this, see approximate rank in §3.2.

from numpy import *

from matplotlib.pyplot import *
from numpy.linalg import matrix_rank

# dataset is Nxd array

def dimension_staircase(dataset):
N = len(dataset)
rmax = matrix_rank(dataset)
dimensions = [ ]
for n in range(N):
r = matrix_rank(dataset[:n+1,:])
dimensions.append(r)
128 CHAPTER 2. LINEAR GEOMETRY

if r == rmax: break
title("number of vectors = " + str(n+1) + ", rank = " + str(rmax))
stairs(dimensions, range(1,n+3),linewidth=2,color='red')
grid()
show()

Proof of main result. Here we derive: Let S be the span of v1 , v2 , . . . ,

vN . Then v1 , v2 , . . . , vN is a basis for S iff v1 , v2 , . . . , vN are linearly
independent.
Suppose v1 , v2 , . . . , vN are not linearly independent. Then v1 , v2 , . . . , vN
are linearly dependent, which means one of the vectors, say v1 , is a linear
combination of the other vectors v2 , v3 , . . . , vN . Then any linear combination
of v1 , v2 , . . . , vN is necessarily a linear combination of v2 , v3 , . . . , vN , thus

span(v1 , v2 , . . . , vN ) = span(v2 , v3 , . . . , vN ).

This shows v1 , v2 , . . . , vN is not a minimal spanning set, and completes the

derivation in one direction.
In the other direction, suppose v1 , v2 , . . . , vN are linearly independent,
and suppose b1 , b2 , . . . , bd is a minimal spanning set. Since b1 , b2 , . . . , bd is
minimal, we must have d ≤ N . Once we establish d = N , it follows v1 , v2 ,
. . . , vN is minimal, and the proof will be complete.
Since by assumption,

span(v1 , v2 , . . . , vN ) = span(b1 , b2 , . . . , bd ),

v1 is a linear combination of b1 , b2 , . . . , bd ,

v1 = t1 b1 + t2 b2 + · · · + td bd .

Since v1 ̸= 0, at least one of the t coefficients is not zero. By rearranging the

vectors, assume t1 ̸= 0. Then we can solve for b1 ,
1
b1 = (v1 − t2 b2 − t3 b3 − · · · − td bd ).
t1
This shows

span(v1 , v2 , . . . , vN ) = span(v1 , b2 , b3 , . . . , bd ).

Repeating the same logic, v2 is a linear combination of v1 , b2 , b3 , . . . , bd ,

v2 = s1 v1 + t2 b2 + t3 b3 + · · · + td bd .
2.9. RANK 129

If all the coefficients of b2 , b3 , . . . , bd are zero, then v2 is a multiple of v1 ,

contradicting linear independence of v1 , v2 , . . . , vN . Thus at least one of the
t coefficients is not zero. By rearranging the vectors, assume t2 ̸= 0. Then we
can solve for b2 , obtaining
1
b2 = (v2 − s1 v1 − t3 b3 − · · · − td bd ).
t2
This shows

span(v1 , v2 , . . . , vN ) = span(v1 , v2 , b3 , b4 , . . . , bd ).

Repeating the same logic, v3 is a linear combination of v1 , v2 , b3 , b3 , . . . ,

bd ,
v3 = s1 v1 + s2 v2 + t3 b3 + t4 b4 + · · · + td bd .
If all the coefficients of b3 , b4 , . . . , bd are zero, then v3 is a linear combination
of v1 , v2 , contradicting linear independence of v1 , v2 , . . . , vN . Thus at least
one of the t coefficients is not zero. By rearranging the vectors, assume t3 ̸= 0.
Then we can solve for b3 , obtaining
1
b3 = (v3 − s1 v1 − s2 v2 − t4 b4 − · · · − td bd ).
t3
This shows

span(v1 , v2 , . . . , vN ) = span(v1 , v2 , v3 , b4 , b5 , . . . , bd ).

Continuing in this manner, we eventually arrive at

span(v1 , v2 , . . . , vN ) = · · · = span(v1 , v2 , . . . , vd ).

This shows vN is a linear combination of v1 , v2 , . . . , vd . This shows N =

d, because N > d contradicts linear independence. Since d is the minimal
spanning number, this shows v1 , v2 , . . . , vN is a minimal spanning set for S.

2.9 Rank

If A is an N × d matrix, then (Figure 2.11) x 7→ Ax is a linear transformation

that sends a vector x in Rd (the source space) to the vector Ax in RN (the
target space). The transpose At goes in the reverse direction: The linear
transformation b 7→ At b sends a vector b in RN (the target space) to the
vector At b in Rd (the source space).
130 CHAPTER 2. LINEAR GEOMETRY

R3 R5

x b
A

At b
Ax
At
source space target space

Fig. 2.11 A 5 × 3 matrix A is a linear transformation from R3 to R5 .

It follows that for an N × d matrix, the dimension of the source space is

d, and the dimension of the target space is N ,

dim(source space) = d, dim(target space) = N.

from sympy import *

d = A.cols # source space dimension

N = A.rows # target space dimension

By (2.4.2), the column space is in the target space, and the row space is
in the source space. Thus we always have

0 ≤ row rank ≤ d and 0 ≤ column rank ≤ N.

For A as in (2.3.4), the column rank is 2, the row rank is 2, and the nullity
is 1. Thus the column space is a 2-d plane in R5 , the row space is a 2-d plane
in R3 , and the null space is a 1-d line in R3 .

The main result in this section is

Rank Theorem
Let A be any matrix. Then

row rank(A) = column rank(A). (2.9.1)

This is established at the end of the section.

Because the row rank and the column rank are equal, below we just say
rank of a matrix, and we write rank(A). In Python,
2.9. RANK 131

from sympy import *

A.rank()

from numpy.linalg import matrix_rank

matrix_rank(A)

returns the rank of a matrix. The main result implies rank(A) = rank(At ),
so

Upper bound for Rank

For any N × d matrix, the rank is never greater than min(N, d).

An N ×d matrix A is full-rank if its rank is the highest it can be, rank(A) =

min(N, d). Here are some consequences of the main result.
• When N ≥ d, full-rank is the same as rank(A) = d, which is the same as
saying the columns are linearly independent and the rows span Rd .
• When N ≤ d, full-rank is the same as rank(A) = N , which is the same
as saying the rows are linearly independent and the columns span RN .
• When N = d, full-rank is the same as saying the rows are a basis of Rd ,
and the columns are a basis of RN .

When A is a square matrix, we can say more:

Full Rank Square Equals Invertible

Let A be a square matrix. Then A is full-rank iff A is invertible.

Suppose A is d×d. If A is invertible and B is its inverse, then AB = I. Since

ABx = A(Bx) = Ay with y = Bx, the column space of AB is contained in
the column space of A. Since the column space of AB = I is Rd , we conclude
the column space of A is Rd , thus rank(A) = d.
Conversely, suppose A is full-rank. This means the columns of A span Rd .
By (2.4.3), this implies
Ax = b
is solvable for any b. Let e1 , e2 , . . . , ed be the standard basis. If we set
successively b = e1 , b = e2 , . . . , b = ed , we then get solutions x1 , x2 , . . . , xd .
If B is the matrix with columns x1 , x2 , . . . , xd , then

AB = A(x1 , x2 , . . . , xd ) = (Ax1 , Ax2 , . . . , Axd ) = (e1 , e2 , . . . , ed ) = I.

132 CHAPTER 2. LINEAR GEOMETRY

Thus we found a matrix B satisfying AB = I.

Repeating the same argument with rows instead of columns, we find a
matrix C satisfying CA = I. Then

C = CI = CAB = IB = B,

so B = C is the inverse of A.

Orthonormal Rows and Columns

Let U be a matrix.
• U has orthonormal rows iff U U t = I.
• U has orthonormal columns iff U t U = I.
If U is square and either holds, then they both hold.

The first two assertions are in §2.2. For the last assertion, assume U is a
square matrix. From §2.4, orthonormality of the rows implies linear indepen-
dence of the rows, so U is full-rank. If U also is a square matrix, then U is
invertible. Multiply by U −1 ,

U −1 = U −1 I = U −1 U U t = U t .

Since we have U −1 U = I, we also have U t U = I.

A square matrix U satisfying

U U t = I = U tU (2.9.2)

is an orthogonal matrix.
Equivalently, we can say

Orthogonal Matrix

A matrix U is orthogonal iff its rows are an orthonormal basis iff its
columns are an orthonormal basis.

Since
U u · U v = u · U t U v = u · v,
U preserves dot products. Since lengths are dot products, U also preserves
lengths. Since angles are computed from dot products, U also preserves an-
gles. Summarizing,
2.9. RANK 133

Angles, Lengths, and Dot Products

Orthogonal Matrices Preserve Angles, Lengths, and Dot Products:

As a consequence,

Orthogonal Matrix sends ON Vectors to ON Vectors

Let U be an orthogonal matrix. If v1 , v2 , . . . , vd are orthonormal,

then U v1 , U v2 , . . . , U vd are orthonormal.

In two dimensions, d = 2, an orthogonal matrix must have two orthonor-

mal columns, so must be of the form

cos θ − sin θ cos θ sin θ
U= or U= .
sin θ cos θ sin θ − cos θ

In the first case, U is a rotation, while in the second, U is a rotation followed

by a reflection.

If u1 , u2 , . . . , ud is an orthonormal basis of Rd , and U has columns u1 ,

u2 , . . . , ud , then U is square and U U t = I = U t U . By (2.2.10), we have

I = u1 ⊗ u1 + u2 ⊗ u2 + · · · + ud ⊗ ud . (2.9.3)

Multiplying both sides by u, by (1.4.19), we obtain

Orthonormal Basis Expansion

If u1 , u2 , . . . , ud is an orthonormal basis, and u is any vector, then

u = (u · u1 )u1 + (u · u2 )u2 + · · · + (u · ud )ud (2.9.4)

and
|u|2 = (u · u1 )2 + (u · u2 )2 + · · · + (u · ud )2 . (2.9.5)

Let x1 , x2 , . . . , xN be a dataset, and let A be the dataset matrix with rows

x1 , x2 , . . . , xN . The dataset is full-rank if A is full-rank. Since A is full-rank
iff its rows span (we assume N >> d, which means there are more samples
than features) we have
134 CHAPTER 2. LINEAR GEOMETRY

Full-Rank Dataset
A dataset x1 , x2 , . . . , xN is full-rank iff x1 , x2 , . . . , xN spans the
feature space.

The dimension or rank of the dataset is the rank of its N × d dataset

matrix A. Hence the dimension of the dataset equals the rank of At A. Since
scaling a matrix has no effect on the rank, we conclude the dimension or rank
of a dataset equals the rank of its variance Q = At A/N , (see (2.5.1)).

To derive the rank theorem, first we recall (2.7.6). Assume A has N rows
and d columns. By (2.7.6), every vector x in the source space Rd can be
written as a sum x = u + v with u in the null space, and v in the row space.
In other words, each vector x may be written as a sum x = u + v with Au = 0
and v in the row space.
From this, we have

Ax = A(u + v) = Au + Av = Av.

This shows the column space consists of vectors of the form Av with v in the
row space.
Let v1 , v2 , . . . , vr be a basis for the row space. From the previous para-
graph, it follows Av1 , Av2 , . . . , Avr spans the column space of A. We claim
Av1 , Av2 , . . . , Avr are linearly independent. To check this, we write

0 = t1 Av1 + t2 Av2 + · · · + tr Avr = A(t1 v1 + t2 v2 + · · · + tr vr ).

If v is the vector t1 v1 +t2 v2 +· · ·+tr vr , this shows v is in the null space. But v
is a linear combination of basis vectors of the row space, so v is also in the row
space. Since the row space is the orthogonal complement of the null space, we
must have v orthogonal to itself. Thus v = 0, or t1 v1 + t2 v2 + · · · + tr vr = 0.
But v1 , v2 , . . . , vr is a basis. By linear independence of v1 , v2 , . . . , vr , we
conclude t1 = 0, . . . , tr = 0. This establishes the claim, hence Av1 , Av2 , . . . ,
Avr is a basis for the column space. This shows r is the dimension of the
column space, which is by definition the column rank. Since by construction,
r is also the row rank, this establishes the rank theorem.

Exercises

Exercise 2.9.1 Let u and v be nonzero vectors. Then the rank of A = u ⊗ v

is one.
2.9. RANK 135

Exercise 2.9.2 Let µ be a unit vector in Rd . Then the rank of I − µ ⊗ µ is

d − 1.

Exercise 2.9.3 Use (2.9.4) to derive (2.9.5).

Exercise 2.9.4 Let v1 , v2 , . . . , vN be an orthonormal basis in RN , and let

Q be an N × N matrix. Use (2.9.3) and Exercise 2.2.6 to show

trace(Q) = v1 · Qv1 + v2 · Qv2 + · · · + vN · QvN . (2.9.6)

Exercise 2.9.5 Let v1 , v2 , . . . , vN be an orthonormal basis in RN , and let

A be an d × N matrix. Use Exercise 2.9.4 and Q = At A and (2.2.14) to show
2
∥A∥ = |Av1 |2 + |Av2 |2 + · · · + |AvN |2 . (2.9.7)

Exercise 2.9.6 Let v1 , v2 , . . . , vN be an orthonormal basis in RN , let u1 ,

u2 , . . . , ud be an orthonormal basis in Rd , and let A be an d × N matrix.
Use Exercise 2.9.5 and (2.9.5) to show
d X
X N
2
∥A∥ = (ui · Avj )2 . (2.9.8)
i=1 j=1

Exercise 2.9.7 Let u1 , u2 , . . . , ur be linearly independent, and v1 , v2 , . . . ,

vr be linearly independent. Then the rank of

A = u1 ⊗ v1 + u2 ⊗ v2 + · · · + ur ⊗ vr

is r. (One way to do this is by writing out Ax = 0.)

Exercise 2.9.8 Let a, b, c be three distinct (non-equal) numbers, and let

1 a a2
 

V = 1 b b 2  .
1 c c2

Show V is invertible. (Exercise 2.4.17.)

Chapter 3
Principal Components

In this chapter, we look at the two fundamental methods of breaking or

decomposing a matrix into elementary components, the eigenvalue decompo-
sition and the singular value decomposition, then we apply this to principal
component analysis.
Principal component analysis rests an important phenomenon, that the
eigenvalues of a large matrix cluster near the top and bottom: For a wide
class of d × d variance matrices Q, when d is large, the eigenvalues of Q
cluster near the top eigenvalue, or near the bottom eigenvalue.
Because the bottom eigenvalue is usually zero, the eigenvalues near the
bottom don’t add up to anything substantial. On the other hand, because
of this clustering, the eigenvalues of Q near the top provide the largest con-
tribution to the explained variance trace(Q). We illustrate this for a specific
class of matrices arising from mass-spring systems (§3.2).
We begin by looking at the geometry of a matrix as a linear transformation.

3.1 Geometry of Matrices

Matrix multiplication by an N × d matrix A sends a point x in the source

space Rd to a point b = Ax in the target space RN (Figure 2.11).
Equivalently, since points in Rd are essentially the same as vectors in Rd
(see §1.3), an N × d matrix A sends a vector v in Rd to a vector Av in RN .
Looked at this way, a matrix A induces a linear transformation: Matrix
multiplication by A satisfies

A(v1 + v2 ) = Av1 + Av2 , A(tv) = tAv.

One way to understand what the transformation does is to see how it

distorts distances between vectors. If v1 and v2 are in Rd , then the distance
between them is d = |v1 − v2 | (recall |v| denotes the euclidean length of v).

137
138 CHAPTER 3. PRINCIPAL COMPONENTS

|Av1 − Av2 |
|Au| = .
|v1 − v2 |

This ratio is a scaling factor of the linear transformation. Of course this

scaling factor depends on the given vectors v1 , v2 .
From this, to understand the scaling distortions, it is enough to understand
what multiplication by A does to unit vectors u.
The first step in understanding this is to compute

σ1 = max |Au| and σ2 = min |Au|.

Here the maximum and minimum are taken over all unit vectors u.
Then σ1 is the distance of the furthest image from the origin, and σ2 is
the distance of the nearest image to the origin. It turns out σ1 and σ2 are
the top and bottom singular values of A.

To keep things simple, assume both the source space and the target space
are R2 ; then A is 2 × 2.
The unit circle (in red in Figure 3.1) is the set of vectors u satisfying
|u| = 1. The image of the unit circle (also in red in Figure 3.1) is the set of
vectors of the form
{Au : |u| = 1}.
The annulus is the set (the region between the dashed circles in Figure 3.1)
of vectors b satisfying
{b : σ2 < |b| < σ1 }.
It turns out the image is an ellipse, and this ellipse lies in the annulus.
Thus the numbers σ1 and σ2 constrain how far the image of the unit circle
is from the origin, and how near the image is to the origin.
To relate σ1 and σ2 to what we’ve seen before, let Q = At A. Then,

σ12 = max |Au|2 = max(Au) · (Au) = max u · At Au = max u · Qu.

Thus σ12 is the maximum projected variance corresponding to the variance

Q. Similarly, σ22 is the minimum projected variance corresponding to the
variance Q.
3.1. GEOMETRY OF MATRICES 139

Now let Q = AAt , and let b be in the image. Then b = Au for some unit
vector u, and

b · Q−1 b = (Au) · Q−1 Au = u · At (AAt )−1 Au = u · Iu = |u|2 = 1.

This shows the image of the unit circle is the inverse variance ellipse (§1.5)
corresponding to the variance Q, with major axis length 2σ1 and minor axis
length 2σ2 .

Fig. 3.1 Image of unit circle with σ1 = 1.5 and σ2 = .75.

Let us look at some special cases.

The first example is
cos θ − sin θ
V = . (3.1.1)
sin θ cos θ
If e1 = (1, 0), e2 = (0, 1) is the standard basis in R2 . then the columns of V
are
V e1 = (cos θ, sin θ), and V e2 = (− sin θ, cos θ).
Since V t V = I, the columns of V are orthonormal. Thus V transforms the
orthonormal basis e1 , e2 into the orthonormal basis V e1 , V e2 (see §2.9). By
(1.4.4), V is a rotation by the angle θ.
The second example is
σ1 0
S= .
0 σ2
Then S scales the horizontal direction by the factor σ1 , and S scales the
vertical direction by σ2 .
The third example are the reflections

−1 0 1 0
R= , R= .
0 1 0 −1

These reflect vectors across the horizontal axis, and across the vertical axis.
140 CHAPTER 3. PRINCIPAL COMPONENTS

Recall an orthogonal matrix is a matrix U satisfying U t U = I = U U t

(2.9.2). Every orthogonal matrix U is a rotation V or a rotation times a
reflection V R.

The SVD decomposition (§3.4) states that every matrix A can be written
as a product
ab
A= = U SV.
cd
Here S is a diagonal matrix as above, and U , V are orthogonal and rotation
matrices as above.
In more detail, apart from a possible reflection, there are scalings σ1 and
σ2 and angles α and β, so that A transforms vectors by first rotating by α,
then scaling by (σ1 , σ2 ), then by rotating by β (Figure 3.2).

V S U

Fig. 3.2 SVD decomposition A = U SV .

In other words, each 2 × 2 matrix A, consisting of four numbers a, b, c,

d, may be described by four other numbers. These other numbers present a
much clearer picture of the geometry of A: two angles α, β, and two scalings
σ1 , σ2 .
Everything in this section generalizes to any N × d matrix, as we see in
the coming sections.

3.2 Eigenvalue Decomposition

In §1.5 and §2.5, we saw every variance matrix is nonnegative. In this section,
we see that every nonnegative matrix Q is the variance matrix of a specific
dataset. This dataset is called the principal components of Q.
Let A be a matrix. An eigenvector for A is a nonzero vector v such that
Av is aligned with v. This means

Av = λv (3.2.1)

for some scalar λ, the corresponding eigenvalue.

3.2. EIGENVALUE DECOMPOSITION 141

Because the solution v = 0 of (3.2.1) is not useful, we insist eigenvectors be

nonzero. If v is an eigenvector, then the dimension of v equals the dimension
of Av, which can only happen when A is a square matrix.

singular:
σ, u, v

row column
any
rank rank

matrix square

eigen:
invertible symmetric
λ, v

non-
variance negative λ≥0

λ ̸= 0 positive λ>0

Fig. 3.3 Relations between matrix classes.

142 CHAPTER 3. PRINCIPAL COMPONENTS

If v is an eigenvector corresponding to eigenvalue λ, then any scalar mul-

tiple u = tv is also an eigenvector corresponding to eigenvalue λ, since

Av = λv =⇒ Au = A(tv) = t(Av) = t(λv) = λ(tv) = λu.

Because of this, we usually take eigenvectors to be unit vectors, by normal-

izing them.
Even then, this does not determine v uniquely, since both ±v are unit
eigenvectors. This ± ambiguity is real, because different software packages
make different sign choices. Because of this, when plotting or computing with
datasets, units assumptions must be checked carefully.
Let
21
Q=
12
Then Q has eigenvalues 3 and 1, with corresponding eigenvectors (1, 1) and
(1, −1).√These√are not unit
√ vectors,
√ but the corresponding unit eigenvectors
are (1/ 2, 1/ 2) and (1/ 2, −1/ 2).
The code

from numpy import *

from scipy.linalg import eig

A = array([[2,1],[1,2]])
lamda, U = eig(A)
lamda

returns the eigenvalues [3,1] as an array, and returns the eigenvectors v1 ,

v2 of Q, as the columns of the matrix U . The matrix U is discussed further
below.
The method eig(A) works on any square matrix A, but may return com-
plex eigenvalues. When eig(A) returns real eigenvalues, they are not neces-
sarily ordered in any predetermined fashion.
If the matrix Q is known to be symmetric, then the eigenvalues are guar-
anteed real. In this case, eigh(Q) returns the eigenvalues in increasing order.
If eigh is used on a non-symmetric matrix, it will return erroneous data.

from numpy import *

from scipy.linalg import eigh

Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda

returns the array [1,3].

3.2. EIGENVALUE DECOMPOSITION 143

Let A be a square d × d matrix. The ideal situation is when there is a

basis v1 , v2 , . . . , vd in Rd of eigenvectors of A. However, this is not always
the case. For example, if
11
A= (3.2.2)
01
and Av = λv, then v = (x, y) satisfies x + y = λx, y = λy. This system has
only the nonzero solution (x, y) = (1, 0) (or its multiples) and λ = 1. Thus
A has only one eigenvector e1 = (1, 0), and the corresponding eigenvalue is
λ = 1.
Let A be any square matrix.

Eigenvalues of A Versus Eigenvalues of A Transpose

The eigenvalues of A and the eigenvalues of At are the same.

This result is a consequence of the rank theorem in §2.9. To see why,

suppose λ is an eigenvalue of A with corresponding eigenvector v. Then Av =
λv, which implies
(A − λI)v = Av − λv = 0.
As a consequence, if we let B = A − λI, then v is an eigenvector of A
corresponding to λ iff1 v is in the nullspace of B. It follows λ is an eigenvalue
for A iff B has a nonzero nullspace. Now B t = At − λI. If we show B t has a
nonzero null space, by the same logic, we will conclude λ is an eigenvalue of
At . Now B has a nonzero null space iff B is not full-rank. Since B is square,
by the rank theorem, this happens iff B t is not full-rank, which happens iff
B t has a nonzero null space. Thus λ is an eigenvalue of A iff λ is an eigenvalue
of At .

Let v be a unit vector. From §2.5, when Q is the variance matrix of a

dataset, v · Qv is the variance of the dataset projected onto the line through
v. When v is an eigenvector, Qv = λv, the variance equals

v · Qv = v · λv = λv · v = λ.

More generally, this holds for any symmetric matrix Q. We conclude

Projected Variance along Eigenvector Direction

If v is a unit eigenvector of a symmetric matrix Q, then v·Qv equals the

corresponding eigenvalue. In particular, the eigenvalues of a variance
matrix are nonnegative.

1 Iff is short for if and only if.

144 CHAPTER 3. PRINCIPAL COMPONENTS

In general, when Q is symmetric but not a variance matrix, some eigen-

values of Q may be negative.

Suppose λ and µ are eigenvalues of a symmetric matrix Q with correspond-

ing eigenvectors u, v. Since Q is symmetric, u · Qv = v · Qu. Using Qu = λu,
Qv = µv, we compute u · Qv in two ways:

µu · v = u · (µv) = u · Qv = v · Qu = v · (λu) = λu · v.

This implies
(µ − λ)u · v = 0.
If λ ̸= µ, we must have u · v = 0. We conclude:

Distinct Eigenvalues Have Orthogonal Eigenvectors

For a symmetric matrix Q, eigenvectors corresponding to distinct

eigenvalues are orthogonal.

More generally, one can show (Exercise 3.2.15)

Distinct Eigenvalues Have Linearly Independent Eigenvec-

tors

Let A be any matrix and suppose λ1 , λ2 , . . . , λd are distinct (non-

equal) eigenvalues of A Then the corresponding eigenvectors are lin-
early independent.

The main result in this section is

Eigenvalue Decomposition (EVD)

Let Q be a symmetric d × d matrix. There is an orthonormal basis v1 ,

v2 , . . . , vd in Rd of eigenvectors of Q, with corresponding eigenvalues

λ1 ≥ λ2 ≥ · · · ≥ λd .

To see the implications of this main result, for simplicity, assume Q is a

2 × 2 symmetric matrix. By EVD, there is an orthonormal basis u, v in R2
3.2. EIGENVALUE DECOMPOSITION 145

and scalars λ, µ satisfying Qu = λu, and Qv = µv. These are the eigenvalues
and eigenvectors. Define three matrices

λ0 u
U = (u, v), E= , V = .
0µ v

Then the columns of U are u, v, the rows of V are u, v, and V = U t .

By matrix-vector multiplication,

QU = (Qu, Qv) = (λu, µv), U E = (λu, µv).

We conclude QU = U E. Multiplying by V ,

Q = QI = Q(U V ) = (QU )V = U EV.

This result remains valid in general. To explain this, let λ1 ≥ λ2 ≥ · · · ≥ λd

be the eigenvalues of a symmetric d × d matrix Q, and let
 
λ1 0 0 . . . 0
 0 λ2 0 . . . 0 
 
E= . . . . . . . . . . . . . . . .

 0 0 . . . λd−1 0 
0 0 . . . 0 λd

Then we have the following.

Diagonalization (EVD)

If v1 , v2 , . . . , vd is an orthonormal basis of eigenvectors of Q, with

corresponding eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd , let V be the orthogonal
matrix with rows v1 , v2 , . . . , vd , and let U = V t . If E is the diagonal
matrix consisting of the eigenvalues, then

Q = U EV (3.2.3)

When this happens, we say Q is diagonalizable. Thus A in (3.2.2) is not

diagonalizable. In Python, E and U are computed by

from numpy import *

from scipy.linalg import eigh

# Q is any symmetric matrix

lamda, U = eigh(Q)

and V = U.T.
146 CHAPTER 3. PRINCIPAL COMPONENTS

Be careful, though. numpy returns lamda as a 1d array of eigenvalues, not

a diagonal matrix. Nevertheless, these are arranged in increasing order, and
the columns of U are the eigenvectors.
To verify this, we verify that the first row v = U.T[0] of V is an eigen-
vector corresponding to the first eigenvalue lamda[0],

from numpy import *

from scipy.linalg import eigh

# lambda is a keyword in Python

# so we use lamda instead

Q = array([[2,1],[1,2]])

lamda, U = eigh(Q)

v = U[:,0]

allclose(dot(Q,v), lamda[0]*v)

returns True.
The conclusion is: With the correct choice of orthonormal basis, the matrix
Q becomes a diagonal matrix E.
The orthonormal basis eigenvectors v1 , v2 , . . . , vd are the principal compo-
nents of the matrix Q. The eigenvalues and eigenvectors of Q, taken together,
are the eigendata of Q.
To obtain the diagonal matrix E,

E = diag(lamda)

Since lambda is a keyword in Python, we deliberately misspell it and write

lamda in the code. When used as a symbol and pretty-printed, sympy knows
to display lamda as λ.

In sympy, the corresponding commands are

from sympy import *

from sympy import init_printing

init_printing()

# eigenvalues
Q.eigenvals()
3.2. EIGENVALUE DECOMPOSITION 147

# eigenvectors
Q.eigenvects()

U, E = Q.diagonalize()

This returns the diagonal E with the eigenvalues in increasing order. The
command init_printing pretty-prints the output.
If A is the matrix (2.3.4), and Q = At A is as in the regression equation
(2.6.4), then the eigenvalues are
√ √
λ1 = 620 + 10 3769, λ2 = 620 − 10 3769, λ3 = 0, (3.2.4)

and the eigenvectors are the rows of

   √ √ 
v1 −100 + 2√3769 −3 + √3769 94
1
V = v2  = −100 − 2 3769 −3 − 3769 94 (3.2.5)
94
v3 94 −188 94

The third row is a multiple of (1, −2, 1), which, as we know, is a basis for the
nullspace of A (§2.4).

Let λ1 , λ2 , . . . , λr be the nonzero eigenvalues of Q. Then the diagonal

matrix E has r nonzero entries on the diagonal, so rank(E) = r. Since U
and V = U t are invertible, rank(E) = rank(U EV ). Since Q = U EV ,

rank(Q) = rank(E) = r.

Rank Equals Number of Nonzero Eigenvalues

The rank of a diagonal matrix equals the number of nonzero en-

tries. The rank of a square symmetric matrix Q equals the number of
nonzero eigenvalues of Q.

For example, in (3.2.4), there are two positive eigenvalues, and the rank
of Q, which equals the rank of A, is two.

Because real-life datasets are composed of floats, a more useful measure of

the rank or dimension of a dataset matrix is the approximate dimension. The
approximate dimension or approximate rank of A is the number of eigenvalues
of the variance Q = At A/N (see (2.5.1)) that are not almost zero, measured
by numpy.
148 CHAPTER 3. PRINCIPAL COMPONENTS

from numpy import *

from scipy.linalg import eigh

# dataset is Nxd
N, d = dataset.shape
Q = dot(dataset.T,dataset)/N

lamda = eigh(Q)[0]

for i,eig in enumerate(lamda):

if not allclose(eig,0):
approx_nullity = i
break

approx_rank = d - approx_nullity

approx_rank, approx_nullity

This code returns 712 for the MNIST dataset, agreeing with the code in
§2.8.

Let’s go back to diagonalization. Using sympy,

from sympy import *

Q = Matrix([[2,1],[1,2]])
U, E = Q.diagonalize()
display(U,E)

returns
1 1 10
U= , E= .
−1 1 03
Also,

from sympy import *

a,b,c = symbols("a b c")

Q = Matrix([[a,b ],[b,c]])
U, E = Q.diagonalize()
display(Q,U,E)

returns √ √
ab 1 a−c− D a−c+ D
Q= , U=
bc 2b 2b 2b
and
3.2. EIGENVALUE DECOMPOSITION 149
√
1 a+c− D 0 √
E= , D = (a − c)2 + 4b2 .
2 0 a+c+ D

(display is used to pretty-print the output.)

When all the eigenvalues are nonzero, we can write

 
1/λ1 0 0 . . . 0
 0 1/λ2 0 . . . 0 
E −1 =  ... ... ... ... ... .


0 0 . . . 0 1/λd

Then a straightforward calculation using (3.2.3) shows

Nonzero Eigenvalues Equals Invertible

Let Q = U EV be the EVD of a symmetric matrix Q. Then Q is

invertible iff all its eigenvalues are nonzero. When this happens, we
have
Q−1 = U E −1 V

More generally, using (2.6.8), one can check

Pseudo-Inverse and EVD

If λ1 ≥ λ2 ≥ · · · ≥ λr are the nonzero eigenvalues of Q, then 1/λ1 ≤

1/λ2 ≤ · · · ≤ 1/λr are the nonzero eigenvalues of Q+ . Moreover, if U
is an orthogonal matrix, and V = U t , then

Q = U EV =⇒ Q+ = U E + V. (3.2.6)

Similarly, eigendata may be used to solve linear systems.

Nonzero Eigenvalues Equals Solvable

Let v1 , v2 , . . . , vd be the orthonormal basis of eigenvectors of Q cor-

responding to eigenvalues λ1 , λ2 , . . . , λd . Then the linear system

Qx = b

has a solution x for every vector b iff all eigenvalues are nonzero, in
which case
1 1 1
x= (b · v1 )v1 + (b · v2 )v2 + · · · + (b · vd )vd . (3.2.7)
λ1 λ2 λd
150 CHAPTER 3. PRINCIPAL COMPONENTS

The proof is straightforward using (2.9.4): multiply x by Q to verify. An-

other consequence of the eigenvalue decomposition is

Trace is the Sum of Eigenvalues

Let Q be a symmetric matrix with eigenvalues λ1 , λ2 , . . . , λd . Then

trace(Q) = λ1 + λ2 + · · · + λd . (3.2.8)

To derive this, use (3.2.3): Since U is orthogonal, V U = U t U = I. By

(2.2.7), trace(AB) = trace(BA), so

trace(Q) = trace(QU V ) = trace(V QU ) = trace(V U EV U ) = trace(E).

Since E = diag(λ1 , λ2 , . . . , λd ), trace(E) = λ1 + λ2 + · · · + λd , and the result

follows.
Let Q be symmetric with eigenvalues λ1 , λ2 , . . . , λd . Since

Qv = λv =⇒ Q2 v = QQv = Q(λv) = λQv = λ2 v,

Q2 is symmetric with eigenvalues λ21 , λ22 , . . . , λ2d . Applying the last result to
Q2 , we have

trace(Q2 ) = trace(QQt ) = trace(Q2 ) = λ21 + λ22 + · · · + λ2d .

It turns out every nonnegative matrix Q is the variance of a simple dataset

(Figure 3.4).

√
√ λ1 v 1
λ2 v2

√
√ − λ2 v2
− λ1 v1

Fig. 3.4 Inverse variance ellipse and centered dataset.

3.2. EIGENVALUE DECOMPOSITION 151

Sum of Tensor Products

Let Q be a symmetric d × d matrix with eigenvalues λ1 , λ2 , . . . , λd
and orthonormal eigenvectors v1 , v2 , . . . , vd . Then

Q = λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd . (3.2.9)

In particular, when Q is nonnegative, the dataset consisting of the 2d

points p p p
± dλ1 v1 , ± dλ2 v2 , . . . , ± dλd vd
is centered and has variance Q.

The vectors in this dataset are the principal components of Q.

Since v1 , v2 , . . . , vd is an orthonormal basis, by (2.9.4), every vector v can
be written
v = (v · v1 ) v1 + (v · v2 ) v2 + · · · + (v · vd ) vd .
Multiply by Q. Since Qvk = λk vk ,

Qv = (v · v1 ) Qv1 + (v · v2 ) Qv2 + · · · + (v · vd ) Qvd

= λ1 (v · v1 ) v1 + λ2 (v · v2 ) v2 + · · · + λd (v · vd ) vd
= (λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd ) v
√
This proves the first part. For the second part, let bk = λk vk . Then the
mean of the 2d vectors ±b1 , ±b2 , . . . , ±bd is clearly zero, and by (3.2.9), the
variance matrix
2
(b1 ⊗ b1 + b2 ⊗ b2 + · · · + bd ⊗ bd )
2d
equals Q/d.

Now we approach the eigenvalues of Q from a different angle. In §2.5, we

studied zero variance directions. Since the eigenvalues of a variance matrix
are nonnegative, for a variance matrix, they may also be called minimum
variance directions. Now we study maximum variance directions.
Let
λ1 = max v · Qv,
|v|=1

where the maximum is over all unit vectors v. We say a unit vector b is best-fit
for Q or best-aligned with Q if the maximum is achieved at v = b: λ1 = b · Qb.
When Q is a variance matrix, this means the unit vector b is chosen so that
the variance b · Qb of the dataset projected onto b is maximized.
152 CHAPTER 3. PRINCIPAL COMPONENTS

An eigenvalue λ1 of Q is the top eigenvalue if λ1 ≥ λ for any other eigen-

value. An eigenvalue λ1 of Q is the bottom eigenvalue if λ1 ≤ λ for any other
eigenvalue. We establish the following results.

Maximum Projected Variance is an Eigenvalue

Let Q be a symmetric matrix. Then

λ1 = max v · Qv, (3.2.10)

|v|=1

is the top eigenvalue of Q.

Best-aligned vector is an eigenvector

Let Q be a symmetric matrix. Then a best-aligned vector b is an

eigenvector of Q corresponding to the top eigenvalue λ1 .

To prove these results, we begin with a simple calculation, whose derivation

we skip. This result may be derived directly using algebra, or by using calculus
and setting f ′ (0) to equal zero.

A Calculation
Suppose λ, a, b, c, d are real numbers and let

λ + at + bt2
f (t) = .
1 + ct + dt2
If f (t) is maximized at t = 0, then a = λc.

Let λ be any eigenvalue of Q, with eigenvector v: Qv = λv. Dividing v by

its length, we may assume |v| = 1. Then

λ1 ≥ v · Qv = v · (λv) = λv · v = λ.

This shows λ1 ≥ λ for any eigenvalue λ.

Now we show λ1 itself is an eigenvalue. Let v1 be a unit vector maximizing
v · Qv, so v1 is best-fit for Q. Then

λ1 = v1 · Qv1 ≥ v · Qv (3.2.11)

for all unit vectors v. Let u be any vector. Then for any real t,
3.2. EIGENVALUE DECOMPOSITION 153

v1 + tu
v=
|v1 + tu|

is a unit vector. Insert this v into (3.2.11) to obtain

(v1 + tu) · Q(v1 + tu)

λ1 ≥ .
|v1 + tu|2

Since Q is symmetric, u · Qv1 = v1 · Qu. Expanding with |v1 |2 = 1, we obtain

λ1 + 2tu · Qv1 + t2 u · Qu λ1 + at + bt2

λ1 ≥ = .
1 + 2tu · v1 + t2 |u|2 1 + ct + dt2

Applying the calculation with λ = λ1 , a = 2u · Qv1 , b = u · Qu, c = 2u · v1 ,

and d = |u|2 , we conclude

u · Qv1 = λ1 u · v1

for all vectors u. But this implies

u · (Qv1 − λ1 v1 ) = 0

for all u. Thus Qv1 − λ1 v1 is orthogonal to all vectors, hence orthogonal to

itself. Since this can only happen if Qv1 − λ1 v1 = 0, we conclude Qv1 = λ1 v1 .
Hence λ1 is itself an eigenvalue. This completes the proof of the two results.

Just as the maximum variance (3.2.10) is the top eigenvalue λ1 , the mini-
mum variance
λd = min v · Qv, (3.2.12)
|v|=1

is the bottom eigenvalue, and the corresponding eigenvector vd is the worst-

aligned vector.
By the eigenvalue decomposition, the eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd of
a symmetric matrix Q may be arranged in decreasing order, and may be
positive, zero, or negative scalars. When Q is a variance, the eigenvalues are
nonnegative, and the bottom eigenvalue is at least zero. When the bottom
eigenvalue is zero, the corresponding eigenvectors are zero variance directions.

Now we can complete the proof the eigenvalue decomposition. Having

found the top eigenvalue λ1 with its corresponding unit eigenvector v1 , we
let S = span(v1 ) and T = S ⊥ be the orthogonal complement of v1 (Figure
3.5). Then dim(T ) = d − 1, and we can repeat the process and maximize
154 CHAPTER 3. PRINCIPAL COMPONENTS

v · Qv over all unit v in T , i.e. over all unit v orthogonal to v1 . This leads to
another eigenvalue λ2 with corresponding eigenvector v2 orthogonal to v1 .
Since λ1 is the maximum of v · Qv over all vectors in Rd , and λ2 is the
maximum of v · Qv over the restricted space T of vectors orthogonal to v1 ,
we must have λ1 ≥ λ2 .
Having found the top two eigenvalues λ1 ≥ λ2 and their orthonormal
eigenvectors v1 , v2 , we let S = span(v1 , v2 ) and T = S ⊥ be the orthogonal
complement of S. Then dim(T ) = d − 2, and we can repeat the process to
obtain λ3 and v3 in T . Continuing in this manner, we obtain eigenvalues

λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λd .

with corresponding orthonormal eigenvectors

v1 , v2 , v3 , . . . , vd .

This proves the eigenvalue decomposition.

T = S⊥

Fig. 3.5 S = span(v1 ) and T = S ⊥ .

Let Q be a positive variance matrix and let b · Q−1 b = 1 be the inverse

variance ellipsoid. If v is a unit eigenvector
√ corresponding
√ to an eigenvalue λ,
then λ ≥ 0, and the vector b = λv has length λ. Moreover b satisfies
√ √
b · Q−1 b = ( λv) · Q−1 ( λv) = λv · Q−1 v = λv · (λ−1 v) = v · v = 1.
3.2. EIGENVALUE DECOMPOSITION 155
√
Hence the line segment joining √ the vectors ± λv is an axis of the inverse
variance ellipsoid, with length 2 λ (Figure 3.4).
When λ = λ1 is the top eigenvalue, the axis is the principal axis of the
inverse variance ellipsoid. When λ = λ2 is the next highest eigenvalue, the
axis is orthogonal to the principal axis, and is the second principal axis.
Continuing in this manner, we obtain all the principal axes of the inverse
variance ellipsoid.

Principal Axes of Inverse Variance Ellipsoid

Let v be a unit eigenvector of a variance

√ matrix
√ Q with eigenvalue λ.
Then the line segment joining − λv and +√ λv is a principal axis of
the inverse variance ellipsoid, with length 2 λ.

Together with Figure 1.25, this result provides a geometric interpretation

of eigenvalues: They control the variances of a dataset’s points, in the prin-
cipal directions.
Sometimes, several eigenvalues are equal, leading to several eigenvectors,
say m of them, corresponding to a given eigenvalue λ. In this case, we say
the eigenvalue λ has multiplicity m, and we call the span

Sλ = {v : Qv = λv}

the eigenspace corresponding to λ. For example, suppose the top three eigen-
values are equal: λ1 = λ2 = λ3 , with b1 , b2 , b3 the corresponding eigenvectors.
Calling this common value λ, the eigenspace is Sλ = span(b1 , b2 , b3 ). Since
b1 , b2 , b3 are orthonormal, dim(Vλ ) = 3. In Python, the eigenspaces Vλ are
obtained by the matrix U above: The columns of U are an orthonormal basis
for the entire space, so selecting the columns corresponding to a specific λ
yields an orthonormal basis for Sλ .

Let (E,U) be the list of eigenvalues and matrix U whose columns are
the eigenvectors. Then the eigenvectors are the rows of U t . Here is code for
selecting just the eigenvectors corresponding to eigenvalue s.

from numpy import *

from scipy.linalg import eigh

lamda, U = eigh(Q)
V = U.T
V[isclose(lamda,s)]
156 CHAPTER 3. PRINCIPAL COMPONENTS

The function isclose(a,b) returns True when a and b are numerically close.
Using this boolean, we extract only those rows of V whose corresponding
eigenvalue is close to s.
The subspace Sλ is defined for any λ. However, dim(Sλ ) = 0 unless λ is
an eigenvalue, in which case dim(Sλ ) = m, where m is the multiplicity of λ.
The proof of the eigenvalue decomposition provides a systematic procedure
for finding eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd . Now we show there are no other
eigenvalues.

The Eigenvalue Decomposition is Complete

If λ is an eigenvalue for Q, Qv = λv, then λ equals one of the eigen-

values in the eigenvalue decomposition of Q.

To see this, suppose Qv = λv with λ ̸= λj for j = 1, . . . , d. Since λ ̸= λj

for j = 1, . . . , d, the vector v must be orthogonal to every vj , j = 1, . . . , d.
Since span(v1 , . . . , vd ) = Rd , it follows v is orthogonal to every vector, hence
v is orthogonal to itself, hence v = 0. We conclude λ cannot be an eigenvalue.

All this can be readily computed in Python. For the Iris dataset, we have
the variance matrix in (2.2.16). The eigenvalues are

4.2 > 0.24 > 0.08 > 0.02,

and the orthonormal eigenvectors are the columns of the matrix

 
0.36 −0.66 −0.58 0.32
−0.08 −0.73 0.6 −0.32
U =  0.86 0.18 0.07 −0.48


0.36 0.07 0.55 0.75

Since the eigenvalues are distinct, the multiplicity of each eigenvalue is 1.

From (2.2.16), the total variance of the Iris dataset is

4.54 = trace(Q) = λ1 + λ2 + λ3 + λ4 .

For the Iris dataset, the top eigenvalue is λ1 = 4.2, it has multiplicity 1, and
its corresponding list of eigenvectors contains only one eigenvector,

v1 = (0.36, −0.08, 0.86, 0.36).

The top eigenvalue accounts for 92.5% of the total variance.

The second eigenvalue is λ2 = 0.24 with eigenvector

v2 = (−0.66, −0.73, 0.18, 0.07).

3.2. EIGENVALUE DECOMPOSITION 157

The top two eigenvalues account for 97.8% of the total variance.
The third eigenvalue is λ3 = 0.08 with eigenvector

v3 = (−0.58, 0.60, 0.07, 0.55).

The top three eigenvalues account for 99.5% of the total variance.
The fourth eigenvalue is λ4 = 0.02 with eigenvector

v4 = (0.32, −0.32, −0.48, 0.75).

The top four eigenvalues account for 100% of the total variance. Here each
eigenvalue has multiplicity 1, since there are four distinct eigenvalues.

An important class of symmetric matrices are of the form

 
  2 −1 0 −1
2 −1 −1 
2 −2  −1 2 −1 0 
−1 2 −1   
−2 2  0 −1 2 −1 
−1 −1 2
−1 0 −1 2
 
 2 −1 0 0 0 −1

2 −1 0 0 −1 
 −1 2 −1 0 0   −1 2 −1 0 0 0 

   0 −1 2 −1 0 0 
 0 −1 2 −1 0   
 0 0 −1 2 −1   0 0 −1 2 −1 0 
  
 0 0 0 −1 2 −1 
−1 0 0 −1 2
−1 0 0 0 −1 2
 
2 −1 0 0 0 0 −1
 −1 2 −1 0 0 0 0 
 
 0 −1 2 −1 0 0 0 
 
 0 0 −1 2 −1 0 0  .
 
 0 0 0 −1 2 −1 0 
 
 0 0 0 0 −1 2 −1 
−1 0 0 0 0 −1 2
We denote these matrices Q(2), Q(3), Q(4), Q(5), Q(6), Q(7). The following
code generates these symmetric d × d matrices Q(d),

def row(i,d):
v = [0]*d
v[i] = 2
if i > 0: v[i-1] = -1
if i < d-1: v[i+1] = -1
if i == 0: v[d-1] += -1
if i == d-1: v[0] += -1
158 CHAPTER 3. PRINCIPAL COMPONENTS

return v

# using sympy
from sympy import Matrix

def Q(d): return Matrix([ row(i,d) for i in range(d) ])

# using numpy
from numpy import *

def Q(d): return array([ row(i,d) for i in range(d) ])

The eigenvalues of these symmetric matrices follow interesting patterns

that are best explored using Python.
Below we will see, the eigenvalues of Q(d) are between 4 and 0, and each
eigenvalue other than 4 and 0 has multiplicity 2.

m1 m2

x1 x2

Fig. 3.6 Three springs at rest and perturbed.

To explain where these matrices come from, look at the mass-spring sys-
tems in Figures 3.6 and 3.7. Here we have springs attached to masses and
walls on either side. At rest, the springs are the same length. When per-
turbed, some springs are compressed and some stretched. In Figure 3.6, let
x1 and x2 denote the displacement of each mass from its rest position.
When extended by x, each spring fights back by exerting a force kx pro-
portional to the displacement x. Here k is the spring constant. For example,
look at the mass m1 . The spring to its left is extended by x1 , so exerts a force
of −kx1 . Here the minus indicates pulling to the left. On the other hand, the
spring to its right is extended by x2 − x1 , so it exerts a force +k(x2 − x1 ).
Here the plus indicates pulling to the right. Adding the forces from either
side, the total force on m1 is −k(2x1 − x2 ). For m2 , the spring to its left
exerts a force −k(x2 − x1 ), and the spring to its right exerts a force −kx2 ,
so the total force on m2 is −k(2x2 − x1 ). We obtain the force vector
3.2. EIGENVALUE DECOMPOSITION 159

2x1 − x2 2 −1 x1
−k = −k .
−x1 + 2x2 −1 2 x2

However, as you can see, the matrix here is not exactly Q(2).

m1 m2 m3 m4 m5

x1 x2 x3 x4 x5

Fig. 3.7 Six springs at rest and perturbed.

For five masses, let x1 , x2 , x3 , x4 , x5 denote the displacement of each mass

from its rest position. In Figure 3.7, x1 , x2 , x5 are positive, and x3 , x4 are
negative.
As before, the total force on m1 is −k(2x1 − x2 ), and the total force on m5
is −k(2x5 − x4 ). For m2 , the spring to its left exerts a force −k(x2 − x1 ), and
the spring to its right exerts a force +k(x3 − x2 ). Hence, the total force on
m2 is −k(−x1 + 2x2 − x3 ). Similarly for m3 , m4 . We obtain the force vector
    
2x1 − x2 2 −1 0 0 0 x1
−x1 + 2x2 − x3   −1 2 −1 0 0  x2 
    
−k −x2 + 2x3 − x4  = −k  0 −1 2 −1 0  x3  .
   
−x3 + 2x4 − x5   0 0 −1 2 −1  x4 
−x4 + 2x5 0 0 0 −1 2 x5

But, again, the matrix here is not Q(5). Notice, if we place one mass and two
springs in Figure 3.6, we obtain the 1 × 1 matrix 2.

To obtain Q(2) and Q(5), we place the springs along a circle, as in Figures
3.8 and 3.9. Now we have as many springs as masses. Repeating the same
logic, this time we obtain Q(2) and Q(5). Notice if we place one mass and
one spring in Figure 3.8, d = 1, we obtain the 1 × 1 matrix Q(1) = 0: There
is no force if we move a single mass around the circle, because the spring is
not being stretched.
Thus the matrices Q(d) arise from mass-spring systems arranged on a
circle. From Newton’s law (force equals mass p times acceleration), one shows
the frequencies of the vibrating springs equal λk/m, where k is the spring
constant, m is the mass of each of the masses, and λ is an eigenvalue of Q(d).
This is the physical meaning of the eigenvalues of Q(d).
160 CHAPTER 3. PRINCIPAL COMPONENTS

m1 m2 m2

Fig. 3.8 Two springs along a circle leading to Q(2).

Let v have features (x1 , x2 , . . . , xd ), and let Q = Q(d). By elementary

algebra, check that

v · Qv = (x1 − x2 )2 + (x2 − x3 )2 + · · · + (xd−1 − xd )2 + (xd − x1 )2 . (3.2.13)

As a consequence of (3.2.13), show also the following.

• For any vector v, 0 ≤ v · Qv ≤ 4|v|2 . Conclude every eigenvalue λ satisfies
0 ≤ λ ≤ 4.
• λ = 0 is an eigenvalue, with multiplicity 1.
• When d is even, λ = 4 is an eigenvalue with multiplicity 1.
• When d is odd, λ = 4 is not an eigenvalue.

m1 m1

m2
m2

m5 m5

m3 m4

m4 m3

Fig. 3.9 Five springs along a circle leading to Q(5).

3.2. EIGENVALUE DECOMPOSITION 161

To compute the eigenvalues, we use complex numbers, specifically the d-th

root of unity ω (§A.4). Let

p(t) = 2 − t − td−1 ,

and let  
1
 ω

ω2
 
 
v1 =  .
ω3
 
 
 ..
  .
d−1
ω
Then Qv1 is
 
1
2 − ω − ω d−1
 
ω
−1 + 2ω − ω 2
 
ω2
   
−ω + 2ω 2 − ω 3
   
Qv1 =   = p(ω)   = p(ω)v1 .
ω3
   
 ..   
 .    ..
d−2 d−1
  .
−ω + 2ω −1
ω d−1

Thus v1 is an eigenvector corresponding to eigenvalue p(ω).

For each k = 0, 1, 2, . . . , d − 1, define

vk = 1, ω k , ω 2k , ω 3k , . . . , ω (d−1)k . (3.2.14)

Then
v0 = 1 = (1, 1, . . . , 1),
and, by the same calculation, we have

Qvk = p(ω k )vk , k = 0, 1, 2, . . . , d − 1.

By (A.4.9),

p(ω k ) = 2 − ω k − ω (d−1)k = 2 − ω k − ω −k = 2 − 2 cos(2πk/d).

Eigenvalues of Q(d)

The (unsorted) eigenvalues of Q(d) are

162 CHAPTER 3. PRINCIPAL COMPONENTS

2πk
λk = p(ω k ) = 2 − 2 cos , (3.2.15)
d

with corresponding eigenvectors vk given by (3.2.14), k = 0, 1, 2, . . . ,

d − 1.

Corresponding to each eigenvalue λk , there is the complex eigenvector vk .

Separating vk into its real and imaginary parts yields two real eigenvectors

2πk 4πk 6πk 2(d − 1)πk
ℜ(vk ) = 1, cos , cos , cos , . . . , cos ,
d d d d

2πk 4πk 6πk 2(d − 1)πk
ℑ(vk ) = 0, sin , sin , sin , . . . , sin .
d d d d

When k = 0 or when k = d/2, d even, we have ℑ(vk ) = 0. This explains the

double multiplicity in Figure 3.10, except when k = 0 or k = d/2, d even.

Applying this formula, we obtain eigenvalues

Q(2) = (4, 0)
Q(3) = (3, 3, 0)
Q(4) = (4, 2, 2, 0)
√ √ √ √ !
5 5 5 5 5 5 5 5
Q(5) = + , + , − , − ,0
2 2 2 2 2 2 2 2
Q(6) = (4, 3, 3, 1, 1, 0)
√ √ √ √
Q(8) = (4, 2 + 2, 2 + 2, 2, 2, 2 − 2, 2 − 2, 0)
√ √ √ √
5 5 5 5 3 5 3 5
Q(10) = 4, + , + , + , + ,
2 2 2 2 2 2 2 2
√ √ √ √ !
5 5 5 5 3 5 3 5
− , − , − , − ,0
2 2 2 2 2 2 2 2
√ √ √ √
Q(12) = 4, 2 + 3, 2 + 3, 3, 3, 2, 2, 1, 1, 2 − 3, 2 − 3, 0 .

The matrices Q(d) are circulant matrices. Each row in Q(d) is obtained
from the row above it in Q(d) by shifting the entries to the right. The trick of
using the roots of unity to compute the eigenvalues and eigenvectors works
for any circulant matrix.
3.2. EIGENVALUE DECOMPOSITION 163

Fig. 3.10 Plot of eigenvalues of Q(50).

Our last topic is the distribution of the eigenvalues for large d. How are
the eigenvalues scattered? Figure 3.10 plots the eigenvalues for Q(50) using
the code below.

from scipy.linalg import eigh

from matplotlib.pyplot import stairs,show,scatter,legend

d = 50
E = eigh(Q(d))[0]
stairs(E,range(d+1),label="numpy")

k = arange(d)
lamda = 2 - 2*cos(2*pi*k/d)
sorted = sort(lamda)

scatter(k,lamda,s=5,label="unordered")
scatter(k,sorted,c="red",s=5,label="increasing order")

grid()
legend()
show()

Figure 3.10 shows the eigenvalues tend to cluster near the top λ1 ≈ 4 and
the bottom λd = 0, they are sparser near the middle. Using the double-angle
formula,
πk
λk = 4 sin2 , k = 0, 1, 2, . . . , d − 1.
d
Solving for k/d in terms of λ, and multiplying by two to account for the
double multiplicity, we obtain the proportion of eigenvalues below threshold
λ,
164 CHAPTER 3. PRINCIPAL COMPONENTS

1√

#{k : λk ≤ λ} 2
≈ arcsin λ , 0 ≤ λ ≤ 4. (3.2.16)
d π 2
Here ≈ means asymptotic equality, see §A.6.
Equivalently, the derivative (4.1.23) of the arcsine law (3.2.16) exhibits the
eigenvalue clustering near the ends (Figure 3.11).

from numpy import *

from matplotlib.pyplot import *

lamda = arange(0.1,3.9,.01)
density = 1/(pi*sqrt(lamda*(4-lamda)))
plot(lamda,density)
# r"..." means raw string
tex = r"$\displaystyle\frac1{\pi\sqrt{\lambda(4-\lambda)}}$"
text(.5,.45,tex,usetex=True,fontsize="x-large")

grid()
show()

The matrices Q(d) are prototypes of matrices that are fundamental in

many areas of physics and engineering, including time series analysis and
information theory, see [10]. This clustering of eigenvalues near the top and
bottom is valid for a wide class of matrices, not just Q(d), as the matrix size
d grows without bound, d → ∞.

Fig. 3.11 Density of eigenvalues of Q(d) for d large.

3.2. EIGENVALUE DECOMPOSITION 165

Exercises

Exercise 3.2.1 Let A be a 2 × 2 matrix. Show λ is an eigenvalue of A when

det(A − λI) = 0. (See homogeneous systems in §1.4.)

Exercise 3.2.2 Let A be a 2 × 2 matrix. Show λ is an eigenvalue of A when

λ2 − λ trace(A) + det(A) = 0.

Exercise 3.2.3 Let Q be a 2 × 2 symmetric matrix. Show the eigenvalues

λ± of Q are given by (1.5.6).

1 −1
Exercise 3.2.4 Let A = . Find the eigenvalues and eigenvectors of
2 4
A. Verify that the trace and determinant of A are the sum and product of
the eigenvalues.

a b
Exercise 3.2.5 Let A = . Show the eigenvalues are
−b a

λ± = a ± ib

(these are complex numbers §A.4).

Exercise 3.2.6 Let Q be a 2 × 2 symmetric matrix. Show det(Q) and

trace(Q) are the product and sum of the eigenvalues. Conclude Q is a positive
variance matrix when det(Q) > 0 and trace(Q) > 0.

Exercise 3.2.7 The symmetric 2 × 2 matrix Q has eigenvalues 1 and 2 and

corresponding eigenvectors u = (3, 4) and v = u⊥ (u⊥ is defined in §1.4).
What is Q?

Exercise 3.2.8 [32] Let  

1 1 1 1
1 1 1 1
A=
1
.
1 1 1
1 1 1 1
Find all four eigenvalues and eigenvectors of A by solving Av = λv.

Exercise 3.2.9 [32] Let  

1 0 1 0
0 1 0 1
A=
1
.
0 1 0
0 1 0 1
Find all four eigenvalues and eigenvectors of A by solving Av = λv.

Exercise 3.2.10 Verify (3.2.4) and (3.2.5) using sympy.

166 CHAPTER 3. PRINCIPAL COMPONENTS

Exercise 3.2.11 With R(d) as in Exercise 2.2.9, find the eigenvalues and
eigenvectors of R(d).

Exercise 3.2.12 Use Python to verify the entries in Table 3.12.

d 4 · trace(Q(d)+ )
4 4+1
16 (4+1)(16+1)
256 (4+1)(16+1)(256+1)

Table 3.12 Trace of pseudo-inverse (§2.3) of Q(d).

Exercise 3.2.13 Verify (3.2.13). Conclude Q(d) is nonnegative, hence a vari-

ance matrix.

Exercise 3.2.14 Let P be a projection matrix (§2.7). Show the eigenvalues

of P are 0 and 1. Which vectors are eigenvectors for 1, and which for 0?

Exercise 3.2.15 Let a, b, c be three distinct eigenvalues of a matrix A,

with corresponding eigenvectors u, v, w. Show that u, v, w are linearly in-
dependent. Use Exercise A.4.8: there is a quadratic p(t) = αt2 + βt + γ
satisfying p(a) = 0, p(b) = 0, p(c) = 1. Multiply ru + sv + tw = 0 by
p(A) = αA2 + βA + γI to conclude t = 0, and repeat for r and s.

3.3 Graphs

Graph theory is a kind of linear geometry, and depends on the material

already covered. As such, the study of graphs is an application of the material
in the previous sections. Since graph theory is the start of neural networks,
we study it here.
A graph consists of nodes and edges. The nodes are also called vertices.
For example, the graphs in Figure 3.13 each have four nodes and three edges.
The left graph is directed, in that a direction is specified for each edge. The
graph on the right is undirected, no direction is specified.

Fig. 3.13 Directed and undirected graphs.

In a directed graph, if there is an edge pointing from node i to node j, we

say (i, j) is an edge. For undirected graphs, we say i and j are adjacent.
3.3. GRAPHS 167

An edge (i, j) is weighed if a scalar wij is attached to it. If every edge in a

graph is weighed, then the graph is a weighed graph. Any two nodes may be
considered adjacent by assigning the weight zero to the edge between them.
In §4.4, back propagation on weighed directed graphs is used to calculate
derivatives.

−3 7.4

2 0

Fig. 3.14 A weighed directed graph.

Let wij be the weight on the edge (i, j) in a weighed directed graph. The
weight matrix of a weighed directed graph is the matrix W = (wij ).
If the graph is unweighed, then we set A = (aij ), where
(
1, if i and j adjacent,
aij = .
0, if not.

In this case, A consists of ones and zeros, and is called the adjacency matrix.
If the graph is undirected, then the adjacency matrix is symmetric,

aij = aji .

Fig. 3.15 A double edge and a loop.

168 CHAPTER 3. PRINCIPAL COMPONENTS

Sometimes graphs may have multiple edges between nodes, or loops, which
are edges starting and ending at the same node. A graph is simple if it has
no loops and no multiple edges. In this section, we deal only with simple
undirected unweighed graphs.
To summarize, a simple undirected graph G = (V, E) is a collection V
of nodes, and a collection of edges E, each edge corresponding to a pair of
nodes.
The number of nodes is the order n of the graph, and the number of edges
is the size m of the graph. In a (simple undirected) graph of order n, the
number of pairs of nodes is n-choose-2, so the number of edges satisfies

n 1
0≤m≤ = n(n − 1).
2 2

How many graphs of order n are there? Since graphs are built out of
edges, the answer depends on how many subsets of edges you can grab from
a maximum of n(n − 1)/2 edges. The number of subsets of a set with m
elements is 2m , so the number Gn of graphs with n nodes is
n
Gn = 2( 2 ) = 2n(n−1)/2 .

For example, the number of graphs with n = 5 is 25(5−1)/2 = 210 = 1, 024,

and the number of graphs with n = 10 is

n = 10 =⇒ Gn = 245 = 35, 184, 372, 088, 832.

When m = 0, there are no edges, and we say the graph is empty. When
m = n(n − 1)/2, there are the maximum number of edges, and we say the
graph is complete. The complete graph with n nodes is written Kn (Figure
3.16).

Fig. 3.16 The complete graph K6 , the cycle graph C6 , and the wheel graph W6 .
3.3. GRAPHS 169

The cycle graph Cn with n nodes is as in Figure 3.16. The graph Cn has n
edges. The wheel graph is the cycle graph with one vertex added at the center
and connected to the spokes. The cycle graph C3 is a triangle.

A graph G′ is a subgraph of a graph G if every node of G′ is a node of G,

and every edge of G′ is an edge of G. For example, a triangle in G is a graph
triangle that is a subgraph of G. Below we see the graph K6 in Figure 3.16
contains twenty triangles.

Fig. 3.17 The triangle K3 = C3 .

Let v be a node in a (simple, undirected) graph G. The degree of v is the

number dv of edges containing v. If the nodes are labeled 1, 2, . . . , n, with
the degrees in decreasing order, then

d1 ≥ d2 ≥ d3 ≥ · · · ≥ dn

is the degree sequence of the graph. We write

(d1 , d2 , d3 , . . . , dn )

for the degree sequence.

If the graph is directed, the in-degree is the number of incoming edges, and
the out-degree is the number of outgoing edges. If a node has no incoming
edges, it is an input node. If a node has no outgoing edges, it is an output
node. Unless specified explicitly, all graphs in this section are undirected and
unweighed.
If we add the degrees over all nodes, we obtain the number of edges counted
twice, because each edge contains two nodes. Thus we have

Handshaking Lemma

If the order is n, the size is m, and the degrees are d1 , d2 , . . . , dn ,

then
170 CHAPTER 3. PRINCIPAL COMPONENTS

n
X
d1 + d2 + · · · + dn = dk = 2m.
k=1

A node is isolated if its degree is zero. A node is dominating if it has the

highest degree. Notice the highest degree is ≤ n − 1, because there are no
loops. We show

Nodes with Equal Degree

In any graph, there are at least two nodes with the same degree.

To see this, we consider two cases. First case, assume there are no isolated
nodes. Then the degree sequence is

n − 1 ≥ d1 ≥ d2 ≥ · · · ≥ dn ≥ 1.

So we have n integers spread between 1 and n − 1. This can’t happen unless

at least two of these integers are equal. This completes the first case. In the
second case, we have at least one isolated node, so dn = 0. If dn−1 = 0
also, then we have found two nodes with the same degree. If not, then the
maximum degree is n − 2 (because node n is isolated), and

n − 2 ≥ d1 ≥ d2 ≥ . . . dn−1 ≥ 1.

So now we have n − 1 integers spread between 1 and n − 2. This can’t happen

unless at least two of these integers are equal. This completes the second
case.

A graph is regular if all the node degrees are equal. If the node degrees are
all equal to k, we say the graph is k-regular. From the handshaking lemma,
for a k-regular graph, we have kn = 2m, so
1
m= kn.
2
For example, because 2m is even, there are no 3-regular graphs with 11 nodes.
Both Kn and Cn are regular, with Kn being (n − 1)-regular, and Cn being
2-regular.
A walk on a graph is a sequence of nodes v1 , v2 , v3 , . . . where each
consecutive pair vi , vi+1 of nodes are adjacent. For example, if v1 , v2 , v3 ,
v4 , v5 , v6 are the nodes (in any order) of the complete graph K6 , then
v1 → v2 → v3 → v4 → v2 is a walk. A path is a walk with no backtracking:
A path visits each node at most once.
3.3. GRAPHS 171

Two nodes a and b are connected if there is a walk starting at a and ending
at b. If a and b are connected, then there is a path starting at a and ending
at b, since we can cut out the cycles of the walk. A graph is connected if every
two nodes are connected. A graph is disconnected if it is not connected. For
example, Figure 3.16 may be viewed as two connected graphs K6 and C6 , or
a single disconnected graph K6 ∪ C6 .
A closed walk is a walk that ends where it starts. A cycle is a closed path.
If a graph has no cycles, it is a forest. A connected forest is a tree. In a tree,
any two nodes are connected by exactly one path.

The adjacency matrix of a graph is given by

(
1, if i and j are adjacent,
aij =
0, if not.

When a graph is undirected, its adjacency matrix is symmetric.

Let 1 be the vector 1 = (1, 1, 1, . . . , 1). The adjacency matrix of the
complete graph Kn is the n×n matrix A with all ones except on the diagonal.
If I is the n × n identity matrix, then this adjacency matrix is

A=1⊗1−I

For example, for the triangle K3 ,

     
1 100 011
A = 1 1 1 ⊗ 1 − 0 1 0 = 1 0 1 .
1 001 110

If we label the nodes of the cycle graph Cn consecutively, then node i is

shares an edge with i − 1 and i + 1, except when i = 1 and i = n. Node 1
shares an edge with 2 and n, and node n shares an edge with n − 1 and 1.
So for C6 the adjacency matrix is
 
010001
1 0 1 0 0 0
 
0 1 0 1 0 0
A= 0 0 1 0 1 0 .

 
0 0 0 1 0 1
100010

Notice there are ones on the sub-diagonal, and ones on the super-diagonal,
and ones in the upper-right and lower-left corners.
172 CHAPTER 3. PRINCIPAL COMPONENTS

The Seven Bridges of Königsberg was an important technology prob-

lem in the eighteenth century. The following description and map are from
Wikipedia [36].

The city of Königsberg in Prussia (now

Kaliningrad, Russia) was set on both sides
of the Pregel River, and included two large
islands — Kneiphof and Lomse — which
were connected to each other, and to the
two mainland portions of the city, by seven
bridges.
The problem was to devise a path
through the city that would cross each of
those bridges once and only once. The prob-
Fig. 3.18 An eighteenth-century lem’s negative resolution by Leonhard Eu-
map of Königsberg showing the ler2 [pronounced “oy-ler”], in 1736, laid the
seven bridges.
foundations of graph theory.

Based on this, we say a path is Eulerian [pronounced “oy-ler-yan”] if it

visits every edge in the graph exactly once. An Eulerian cycle is an Eulerian
path that is a cycle. Here is Euler’s result.

Euler’s Theorem
A connected graph has an Eulerian cycle if and only if every vertex
has an even degree.

If a graph has an Eulerian cycle, it is an Eulerian graph. Thus the Eulerian

graphs are those where all nodes have even degree.

Fig. 3.19 An Eulerian graph.

2 This is the same Euler who is in §A.3.

3.3. GRAPHS 173

Since the degree sequence of the graph in Figure 3.19 is (4, 4, 4, 4, 4, 2), the
graph is Eulerian.

For any adjacency matrix A, the sum of each row is equal to the degree of
the node corresponding to that row. This is the same as saying
 
d1
 d2 
A1 = . . . .


In particular, for a k-regular graph, we have

A1 = k1,

so for a k-regular graph, k is an eigenvalue of A.

What is the connection between degrees and eigenvalues in general? To
explain this, let λ be an eigenvalue of A with eigenvector v = (v1 , v2 , . . . , vn ),
so Av = λv. Since a multiple tv of v is also an eigenvector, we may assume
the biggest component of v equals 1. Suppose the nodes are labeled so that
v = (1, v2 , v3 , . . . , vn ), with

v1 = 1 ≥ |vj |, j = 2, 3, . . . , n.

Taking the first component of Av = λv, we have

(Av)1 = a11 v1 + a12 v2 + a13 v3 + · · · + a1n vn .

Since the sum a11 + a12 + · · · + a1n equals the degree d1 of node 1, this implies

d1 = a11 +a12 +· · ·+a1n ≥ a11 v1 +a12 v2 +a13 v3 +· · ·+a1n vn = (Av)1 = λv1 = λ.

Since d1 is one of the degrees, d1 is no greater than the maximum degree.

This explains

Maximum Degree of Graph

If λ is any eigenvalue of the adjacency matrix A, then λ is less or equal

to the maximum degree.

In particular, for a k-regular graph, the maximum degree equals k, and we

already saw k is an eigenvalue, so
174 CHAPTER 3. PRINCIPAL COMPONENTS

Top Eigenvalue

For a k-regular graph, k is the top eigenvalue of the adjacency matrix

Let A = 1 ⊗ 1 − I be the adjacency matrix of complete graph Kn . Then

for any vector v orthogonal to 1,

Av = (1 ⊗ 1 − I)v = (1 · v)1 − v = 0 − v = −v,

so λ = −1 is an eigenvalue with multiplicity n − 1. Since

A1 = (1 · 1)1 − 1 = n1 − 1 = (n − 1)1,

n − 1is an eigenvalue. Hence the eigenvalues of A are n − 1 with multiplicity

1 and −1 with multiplicity n − 1.

The complement of graph G is the graph Ḡ obtained by switching 1’s and

0’s, so the adjacency matrix Ā of Ḡ is

Ā = A(Ḡ) = 1 ⊗ 1 − I − A(G).

Let G be a k-regular graph, and suppose k = λ1 ≥ λ2 ≥ · · · ≥ λn are the

eigenvalues of A = A(G). Since A is symmetric, we have an orthogonal basis
of eigenvectors v1 , v2 , . . . , vn , with v1 = 1. Then Ḡ is an (n − 1 − k)-regular
graph, so the top eigenvalue of Ā = A(Ḡ) is n−1−k, with eigenvector v1 = 1.
If vk is any eigenvector of A other than 1, then vk is orthogonal to 1, hence

Āv = (1 ⊗ 1 − I − A)vk = −v − λk vk = (−1 − λk )vk .

Hence the eigenvalues of Ā are n − 1 − k and −1 − λk , k = 2, . . . , n, with

the same eigenbasis.

Now we look at powers of the adjacency matrix A. By definition of matrix-

matrix multiplication,
n
X
(A2 )ij = i-th row × j-th column = aik akj .
k=1

Now aik akj is either 0 or 1, and equals 1 exactly if there is a 2-step path from
i to j. Hence
3.3. GRAPHS 175

(A2 )ij = number of 2-step walks connecting i and j.

Notice a 2-step walk between i and j is the same as a 2-step path between i
and j.
When i = j, (A2 )ii is the number of 2-step paths connecting i and i, which
means number of edges. Since this counts edges twice, we have
1
trace(A2 ) = m = number of edges.
2
Similarly, (A3 )ij is the number of 3-step walks connecting i and j. Since
a 3-step walk from i to i is the same as a triangle, (A3 )ii is the number
of triangles in the graph passing through i. Since the trace is the sum of
the diagonal elements, trace(A3 ) counts the number of triangles. But this
overcounts by a factor of 3! = 6, since three labels may be rearranged in six
ways. Hence
1
trace(A3 ) = number of triangles.
6
Loops, Edges, Triangles

Let A be the adjacency matrix. Then

• trace(A) = number of loops = 0,
• trace(A2 ) = 2 × number of edges,
• trace(A3 ) = 6 × number of triangles.

Let us compute these for the complete graph Kn . Since

(u ⊗ v)2 = (u ⊗ v)(u ⊗ v) = (u · v)(u ⊗ v),

and 1 · 1 = n, we have (1 ⊗ 1)2 = n1 ⊗ 1. So

A2 = (1 ⊗ 1 − I)2 = (1 ⊗ 1)2 − 21 ⊗ 1 + I = (n − 2)1 ⊗ 1 + I.

Since trace(u ⊗ v) = u · v, we have trace(1 ⊗ 1) = n. Hence

trace(A2 ) = trace((n − 2)1 ⊗ 1 + I) = n(n − 2) + n = n(n − 1).

This is correct because for a complete graph, n(n − 1)/2 is the number of
edges.
Continuing,

A3 = A2 A = ((n − 2)1 ⊗ 1 + I)(1 ⊗ 1 − I)

= n(n − 2)1 ⊗ 1 − (n − 2)1 ⊗ 1 + 1 ⊗ 1 − I
= (n2 − 3n + 3)1 ⊗ 1 − I.

From this, we get

176 CHAPTER 3. PRINCIPAL COMPONENTS

trace(A3 ) = n(n2 − 3n + 3) − n = n(n2 − 3n + 2) = n(n − 1)(n − 2).

This is correct because for a complete graph, we have a triangle whenever

we have a triple of nodes, and there are n-choose-3 triples, which equals
n(n − 1)(n − 2)/6.
Remember, a graph is connected if there is a walk connecting any two
nodes. Since there is a 4-step walk between i and j exactly when there are r,
s, and t satisfying
air ars ast atj = 1,
we see there is a 4-step walk connecting i and j if (A4 )ij > 0. Hence

Connected Graph

Let A be the adjacency matrix. Then the graph is connected if for

every i ̸= j, there is a k with (Ak )ij > 0.

If a graph is not connected, it may be decomposed as a disjoint union of

two or more connected subgraphs. These subgraps are the components of the
graph.

Two graphs are isomorphic if a re-labeling of the nodes in one makes it

identical to the other. To explain this, we need permutations.
A permutation on n letters is a re-arrangement of 1, 2, 3,. . . , n. Here are
two permutations of (1, 2, 3, 4),

1234 1234
, .
4321 4312

There are n! permutations of (1, 2, . . . , n). If a permutation sends i to j, we

write i → j. Since a permutation is just a re-labeling, if i → k and j → k,
then we must have i = j.
Each permutation leads to a permutation matrix. A permutation matrix
is a matrix of zeros and ones, with only one 1 in any column or row (see also
Exercise 2.1.1). For example, the above permutations correspond to the 4×4
matrices    
0001 0001
0 0 1 0 0 0 1 0
P =0 1 0 0
 P = 1 0 0 0 .


1000 0100
In general, the permutation matrix P has Pij = 1 if i → j, and Pij = 0
if not. If P is any permutation matrix, then Pik Pjk equals 1 if both i → k
and j → k. In other words, Pik Pjk = 1 if i = j and i → k, and Pik Pjk = 0
otherwise. Since i → k for exactly one k,
3.3. GRAPHS 177

n n
(
t
X
t
X 1, i = j,
(P P )ij = Pik Pkj = Pik Pjk =
k=1 k=1
0, i ̸= j.

Hence P is orthogonal,

P P t = I, P −1 = P t .

Using permutation matrices, we can say two graphs are isomorphic if their
adjacency matrices A, A′ satisfy

A′ = P AP −1 = P AP t

for some permutation matrix P .

If two graphs are isomorphic, then it is easy to check their degree sequences
are equal. However, the converse is not true. Figure 3.20 displays two non-
isomorphic graphs with degree sequences (3, 2, 2, 1, 1, 1). These graphs are
non-isomorphic because in one graph, there are two degree-one nodes adjacent
to a degree-three node, while in the other graph, there is only one degree-one
node adjacent to a degree-three node.

Fig. 3.20 Non-isomorphic graphs with degree sequence (3, 2, 2, 1, 1, 1).

A graph is bipartite if the nodes can be divided into two groups, with
adjacency only between nodes across groups. If we call the two groups even
and odd, then odd nodes are never adjacent to odd nodes, and even nodes
are never adjacent to even nodes.
The complete bipartite graph is the bipartite graph with maximum num-
ber of edges: Every odd node is adjacent to every even node. The complete
bipartite graph with n odd nodes with m even nodes is written Kn,m . Then
the order of Kn,m is n + m.
Let a = (1, 1, . . . , 1, 0, 0, . . . , 0) be the vector with n ones and m zeros, and
let b = 1 − a. Then b has n zeros and m ones, and the adjacency matrix of
Kn,m is
A = A(Kn,m ) = a ⊗ b + b ⊗ a.
For example, the adjacency matrix of K5,3 is
178 CHAPTER 3. PRINCIPAL COMPONENTS
         
1 0 0 1 0 0 0 1 1 1 1 1
1 0 0 1 0 0 0 1 1 1 1 1
         
1 0 0 1 0 0 0 1 1 1 1 1
         
0 1 1 0 1 1 1 0 0 0 0 0
 ⊗ + ⊗ = .
0 1 1 0 1 1 1 0 0 0 0 0
         
0 1 1 0 1 1 1 0 0 0 0 0
         
0 1 1 0 1 1 1 0 0 0 0 0
0 1 1 0 1 1 1 0 0 0 0 0

Recall we have
(a ⊗ b)v = (b · v)a.
From this, we see the column space of A = a⊗b+b⊗a is span(a, b). Thus the
rank of A is 2, and the nullspace of A consists of the orthogonal complement
span(a, b)⊥ of span(a, b). Using this, we compute the eigenvalues of A.

Fig. 3.21 Complete bipartite graph K5,3 .

Since the nullspace is span(a, b)⊥ , any vector orthogonal to a and to b is an

eigenvector for λ = 0. Hence the eigenvalue λ = 0 has multiplicity n + m − 2.
Since trace(A) = 0, the sum of the eigenvalues is zero, and the remaining two
eigenvalues are ±λ ̸= 0.
Let v be an eigenvector for λ ̸= 0. Because eigenvectors corresponding
to distinct eigenvalues of a symmetric matrix are orthogonal (see §3.2), v is
orthogonal to the nullspace of A, so v must be a linear combination of a and
b, v = ra + sb. Since a · b = 0,

Aa = nb, Ab = ma.

Hence
λv = Av = A(ra + sb) = rnb + sma.
Applying A again,

λ2 v = A2 v = A(rnb + sma) = rnma + smnb = nm(ra + sb) = nmv.

3.3. GRAPHS 179
√
Hence λ = nm. We conclude the eigenvalues of Kn,m are
√ √
nm, 0, 0, . . . , 0, − nm, (with 0 repeated n + m − 2 times).

For
√ example, √
for the graph in Figure 3.21, the nonzero eigenvalues are λ =
± 3 × 5 = ± 15.

Let G be a graph with n nodes and m edges. The incidence matrix of G

is a matrix whose rows are indexed by the edges, and whose columns are
indexed by the nodes. Therefore, the incidence matrix has shape m × n.
By placing arrows along the edges, we can make G into a directed graph.
In a directed graph, each edge has a tail node and a head node. Then the
incidence matrix is given by

1,
 if node j is the head of edge i,
Bij = −1, if node j is the tail of edge i,

0, if node j is not on edge i.


The laplacian of a graph G is the symmetric n × n matrix

L = B t B.

Both the laplacian matrix and the adjacency matrix are n × n. What is the
connection between them?

Laplacian

The laplacian satisfies

L = D − A,
where D = diag(d1 , d2 , . . . , dn ) is the diagonal degree matrix.

Note the Laplacian does not depend on how the edges were directed, it
only depends on A.
For example, for the cycle graph C6 , the degree matrix is 2I, and the
laplacian is the matrix we saw in §3.2,
 
2 −1 0 0 0 −1
−1 2 −1 0 0 0 
 
 0 −1 2 −1 0 0 
L = Q(6) =  .
 0 0 −1 2 −1 0 

 0 0 0 −1 2 −1
−1 0 0 0 −1 2
180 CHAPTER 3. PRINCIPAL COMPONENTS

In a k-regular graph, the Laplacian is L = kI − A, so the eigenvalues of A

and L are related by λ → k − λ.

Let A be the adjacency matrix of the cycle graph Cn . Since Cn is 2-regular,

the top eigenvalue of A is 2. Since

A = 2I − Q(n),

From this, by (3.2.15), the eigenvalues of A are

2 cos(2πk/n), k = 0, 1, 2, . . . , n − 1,

and the eigenvectors of A are the eigenvectors of Q(n).

Exercises

Exercise 3.3.1 [27] Consider the graph in Figure 3.22 below. What is the
order of the graph? What is the degree of vertex H? What is the degree of
vertex D? How many components does the graph have?

Exercise 3.3.2 [27] Which of the following degree sequences are possible for
a simple graph?
1. (4,4,4,3,3,3,3,2)
2. (5,3,2,2,2,1)
3. (8,7,6,5,5,3,1,1)
4. (6,6,6,5,4,4,4,3,3,3)

Fig. 3.22 A graph.

3.4. SINGULAR VALUE DECOMPOSITION 181

Exercise 3.3.3 [27] Construct a tree with five vertices such that the degree of
one vertex is 3. How many such (non-isomorphic) graphs can you construct?

Exercise 3.3.4 [27] Consider the cities F , G, H, I, J, K. The costs of the

possible roads between cities are c(F, G) = 8, c(G, H) = 5, c(F, H) = 9,
c(F, K) = 10, c(I, K) = 6, c(I, J) = 7. What is the minimum cost to build a
road system that connects all the cities?

Exercise 3.3.5 [27] A graph has vertices M , N , O, P , Q, R, S, and edges

M N , M O, M P , M R, M S, N O, OR, P R, P S, QR. What is the degree of
S? What is the degree of M ? How many components does the graph have?

Exercise 3.3.6 [27] Find the degree sequences of the cycle graph C6 , the
complete graph K8 , the complete bipartite graph K3,7 , and the wheel graph
W4 .

Exercise 3.3.7 [27] A directed graph has vertices v0 , v1 , v2 , v3 , v4 , v5 , v6

with adjacency matrix  
0110100
 0 0 0 1 0 0 0
 
 0 1 0 1 0 0 1
 
 1 0 0 0 0 1 1 .
 
 0 1 1 0 0 0 0
 
 1 0 1 0 0 0 1
1100100
What is the order of the graph? What is the number of edges? What are the
in-degree and out-degree of v3 ? of v5 ?

Exercise 3.3.8 Which of Cn , Kn , Wn is Eulerian? For which n?

Exercise 3.3.9 Find an Eulerian cycle in the graph in Figure 3.19.

3.4 Singular Value Decomposition

In this section, we discuss the singular value decomposition (U, S, V ) of a

matrix A.
Let A be a matrix. We say a real number σ is a singular value of A if there
are nonzero vectors v and u satisfying

Av = σu and At u = σv. (3.4.1)

When this happens, v is a right singular vector and u is a left singular vector
associated to σ.
When (3.4.1) holds, so does
182 CHAPTER 3. PRINCIPAL COMPONENTS

Av = (−σ)(−u), At (−u) = (−σ)v.

Because of this, to eliminate ambiguity, it is standard to assume σ ≥ 0;

henceforth we shall insist singular values are positive or zero.
Contrast singular values with eigenvalues: While eigenvalues may be pos-
itive, negative, or zero, singular values are positive or zero, never negative.
The definition immediately implies

Singular Values of A Versus A Transpose

The singular values of A and the singular values of At are the same.

Contrast this with the analogous result for eigenvalues in §3.2.

We work out our first example. Let

11
A= .
01

Then Av = λv implies λ = 1 and v = (1, 0). Thus A has only one eigenvalue
equal to 1, and only one eigenvector. Set

11
Q = At A = .
12

Since Q is symmetric, Q has two eigenvalues λ1 , λ2 and corresponding eigen-

vectors v1 , v2 . Moreover, as we saw in §3.2, v1 , v2 may be chosen orthonormal.
The eigenvalues of Q are given by

0 = det(Q − λI) = λ2 − 3λ + 1.

By the quadratic formula,

√ √
3 5 3 5
λ1 = + = 2.62, λ2 = − = 0.38.
2 2 2 2
Now we turn to singular values. If v and u and σ satisfy (3.4.1), then

Qv = At Av = At (σu) = σ 2 v. (3.4.2)

Hence σ 2 = λ, and we obtain

s √ s √
3 5 3 5
σ1 = + = 1.62, σ2 = − = 0.62.
2 2 2 2

To make (3.4.1) work, we set u1 = Av1 /σ1 . Then Av1 = σ1 u1 , and

At u1 = At Av1 /σ1 = Qv1 /σ1 = λ1 v1 /σ1 = σ1 v1 .

3.4. SINGULAR VALUE DECOMPOSITION 183

Thus v1 , u1 are right and left singular vectors corresponding to the singular
value σ1 of A. Similarly, if we set u2 = Av2 /σ2 , then v2 , u2 are right and left
singular vectors corresponding to the singular value σ2 of A.
We show v1 , v2 are orthonormal, and u1 , u2 are orthonormal. We already
know v1 , v2 are orthonormal, because they are orthonormal eigenvectors of
the symmetric matrix Q. Also

0 = λ1 v1 ·v2 = Qv1 ·v2 = (At Av1 )·v2 = (Av1 )·(Av2 ) = σ1 u1 ·σ2 u2 = σ1 σ2 u1 ·u2 .

Since σ1 ̸= 0, σ2 ̸= 0, it follows u1 , u2 are orthogonal. Also

λ1 = λ1 v1 · v1 = Qv1 · v1 = (At Av1 ) · v1 = (Av1 ) · (Av1 ) = σ12 u1 · u1 .

Since λ1 = σ12 , u1 · u1 = 1. Similarly, u2 · u2 = 1. This shows u1 , u2 are

orthonormal, and completes the first example.

Let A be an N × d matrix, and suppose σ1 ≥ σ2 ≥ · · · ≥ σr are positive

singular values with corresponding left singular vectors u1 , u2 , . . . , ur and
right singular vectors v1 , v2 , . . . , vr . Then, since uk = Avk /σk , the vectors
u1 , u2 , . . . , ur are in the column space of A.
If u1 , u2 , . . . , ur are linearly independent, it follows r is no larger than
rank(A), hence r is no larger than min(N, d). We seek the largest value of r.
Below we show the largest r is min(N, d), the lesser of d and N .

The close connection between singular values σ of A and eigenvalues λ of

Q = At A carries over in the general case.

A Versus Q = At A

Let A be any matrix. Then

• the rank of A equals the rank of Q,
• σ is a singular value of A iff λ = σ 2 is an eigenvalue of Q.

Since the rank equals the dimension of the row space, the first part follows
from §2.4. If Av = σu and At u = σv, then

Qv = At Av = At (σu) = σAt u = σ 2 v,

so λ = σ 2 is an eigenvalue of Q.
Conversely,
√ If Qv = λv, then λ ≥ 0, so there are two cases. If λ > 0, set
σ = λ and u = Av/σ. Then
184 CHAPTER 3. PRINCIPAL COMPONENTS

Av = σu, At u = At Av/σ = Qv/σ = λv/σ = σv

This shows σ is a singular value of A with singular vectors u and v.

If λ = 0, then we take σ = 0, and the correct interpretation of the second
part is the null space of Q equals the null space of A, which we already know.
From §3.2, the number of positive eigenvalues (possibly repeated) of Q
equals the rank of Q. By the above, we conclude the rank of A equals the
number of positive singular values (possibly repeated) of A.

The above results may be phrased as

Singular Value Decomposition (SVD)

Let A be any matrix, and let r be the rank of A. Then there are
r positive singular values σk , an orthonormal basis uk of the target
space, and an orthonormal basis vk of the source space, such that

Avk = σk uk , At uk = σk vk , k ≤ r, (3.4.3)

and
Avk = 0, At uk = 0 for k > r. (3.4.4)

Taken together, (3.4.3) and (3.4.4) say the number of positive singular
values is exactly r. Assume A is N × d, and let p = min(N, d) be the lesser
of N and d.
Since (3.4.4) holds as long as there are vectors uk and vk , there are p − r
zero singular values. Hence there are p = min(N, d) singular values altogether.
The proof of the result is very simple once we remember the rank of Q
equals the number of positive eigenvalues of Q. By the eigenvalue decom-
position, there is an orthonormal basis vk of the source space and positive
√ that Qvk = λk vk , k ≤ r, and Qvk = 0, k > r.
eigenvalues λk such
Setting σk = λk and uk = Avk /σk , k ≤ r, as in our first example, we
have (3.4.3), and, again as in our first example, uk , k ≤ r, are orthonormal.
By construction, vk , k > r, is an orthonormal basis for the null space of
A, and uk , k ≤ r, is an orthonormal basis for the column space of A.
Choose uk , k > r, any orthonormal basis for the nullspace of At . Since
the column space of A is the row space of At , the column space of A is the
orthogonal complement of the nullspace of At (2.7.6). Hence uk , k ≤ r, and
uk , k > r, are orthogonal. From this, uk , k ≤ r, together with uk , k > r,
form an orthonormal basis for the target space.
3.4. SINGULAR VALUE DECOMPOSITION 185

For our second example, let a and b be nonzero vectors, possibly of different
sizes, and let A be the matrix

A = a ⊗ b, At = b ⊗ a.

Let v and u be right and left singular vectors corresponding to a positive

singular value σ. Then, by (1.4.19),

Av = (v · b)a = σu and At u = (u · a)b = σv.

Since σ > 0, it follows v is a multiple of b and u is a multiple of a. If we

write v = tb and u = sa and plug in, we get

v = |a| b, u = |b| a, σ = |a| |b|.

Thus there is only one positive singular value of A, equal to |a| |b|. All other
singular values are zero. This is not surprising since the rank of A is one.
Now think of the vector b as a single-row matrix B. Then, in a similar
manner, one sees the only positive singular value of B is σ = |b|.
Our third example is  
0000
1 0 0 0
A= 0 1 0 0 .
 (3.4.5)
0010
Then    
0 1 0 0 1 0 0 0
0 0 1 0 0 1 0 0
At =  Q = At A = 
 
, 
0 0 0 1 0 0 1 0
0 0 0 0 0 0 0 0
Since Q is diagonal symmetric, its rank is 3 and its eigenvalues are λ1 = 1,
λ2 = 1, λ3 = 1, λ4 = 0, and its eigenvectors are
       
1 0 0 0
0 1 0 0
0 , v2 = 0 , v3 1 , v4 = 0 .
v1 =        

0 0 0 1

Clearly v1 , v2 , v3 , v4 are orthonormal. By (3.4.2), σ1 = 1, σ2 = 1, σ3 = 1,

σ4 = 0.
Since we must have Av = σu, we can check that

u1 = Av1 = v2 , u2 = Av2 = v3 , u3 = Av3 = v4 , u4 = v1

satisfies (3.4.1). This completes our third example.

186 CHAPTER 3. PRINCIPAL COMPONENTS

Let A be N × d, let U be the matrix with columns u1 , u2 , . . . , uN , and let

V be the matrix with rows v1 , v2 , . . . , vd . Then V t has columns v1 , v2 , . . . ,
vd . Because the u’s and v’s are orthonormal, U and V are orthogonal N × N
and d × d matrices.
To see the implications of (3.4.3), for simplicity, assume A is 2 × 2. Then
we have two singular values σ1 , σ2 , two right singular vectors v1 , v2 , and two
left singular vectors
u1 ,u2 .
σ1 0
If we let S = , then matrix-vector multiplication shows
0 σ2

AV t = (Av1 , Av2 ) = (σ1 u1 , σ2 u2 ), U S = (σ1 u1 , σ2 u2 ).

Hence AV t = U S. Right-multiplying by V and using V t V = I implies the

following result, which remains valid in general.

Diagonalization (SVD)

Let A be any matrix, with singular values σ1 ≥ σ2 ≥ · · · ≥ 0. Let

v1 , v2 , . . . be an orthonormal basis of right singular vectors of A in
the source space, and let u1 , u2 , . . . be an orthonormal basis of left
singular vectors of A in the target space. Let U be the orthogonal
matrix with columns u1 , u2 , . . . , and let V be the orthogonal matrix
with rows v1 , v2 , . . . . Let S be the diagonal matrix with the same
shape as A and consisting of the singular values. Then

A = U SV. (3.4.6)

In more detail, suppose A is 4 × 6. Then we have an orthonormal basis v1 ,

v2 , v3 , v4 , v5 , v6 of R6 , and an orthonormal basis u1 , u2 , u3 , u4 satisfying
(3.4.3) with r = 4. If we set
 
σ1 0 0 0 0 0
 0 σ2 0 0 0 0
S=  0 0 σ 3 0 0 0 ,


0 0 0 σ4 0 0

then U is 4 × 4 and V is 6 × 6, and we can verify directly that A = U SV .

If A is 6 × 4, and  
σ1 0 0 0
 0 σ2 0 0 
 
 0 0 σ3 0 
S=  ,
 0 0 0 σ4 

0 0 0 0
0 0 0 0
then U is 6 × 6 and V is 4 × 4, and we can verify directly that A = U SV . In
either case, S has the same shape as A.
3.4. SINGULAR VALUE DECOMPOSITION 187

When A = Q is a variance matrix, Q ≥ 0, then the eigenvalues are non-

negative, and, from (3.2.3), we have U EU t = Q. If we choose V = U t , we see
EVD is a special case of SVD.
In general, however, if Q has negative eigenvalues, V is not equal to U t ;
instead V is obtained from U t by multiplying some of the columns of U by
minus signs.

In numpy, svd returns the orthogonal matrices U and V and a 1d array

sigma of singular values. The singular values are arranged in decreasing order.
To recover the diagonal matrix S, we use diag.

from numpy import *

from scipy.linalg import svd

U, sigma, V = svd(A)
# sigma is a vector

# build diag matrix S

p = min(A.shape)
S = zeros(A.shape)
S[:p,:p] = diag(sigma)

print(U.shape,S.shape,V.shape)
print(U,S,V)

allclose(A, dot(U, dot(S, V)))

This code returns True.

Given the relation between the singular values of A and the eigenvalues of
Q = At A, we also can conclude

Right Singular Vectors Are the Same as Eigenvectors

Let A be any matrix and let Q = At A.

v is an eigenvector of Q ⇐⇒ v is a right singular vector of A.

(3.4.7)

For example, if dataset is the Iris dataset (ignoring the labels), the code
188 CHAPTER 3. PRINCIPAL COMPONENTS

from numpy import *

from scipy.linalg import svd,eigh

# center dataset
m = mean(dataset,axis=0)
A = dataset - m
# rows of V are right
# singular vectors of A
V = svd(A)[2]

# any of these will work

# because the eigenvectors are the same
Q = dot(A.T,A)
Q = cov(dataset.T,bias=False)
Q = cov(dataset.T,bias=True)

# columns of U are
# eigenvectors of Q
U = eigh(Q)[1]

# compare columns of U
# and rows of V

U, V

returns
   
0.36 −0.66 −0.58 0.32 0.36 −0.08 0.86 0.36
−0.08 −0.73 0.6 −0.32
 , V = −0.66 −0.73 0.18 0.07 
 
U =
 0.86 0.18 0.07 −0.48  0.58 −0.6 −0.07 −0.55
0.36 0.07 0.55 0.75 0.32 −0.32 −0.48 0.75

This shows the columns of U are identical to the rows of V , except for the
third column of U , which is the negative of the third row of V .

Now we turn to the pseudo-inverse.

To get the Pseudo-Inverse, Invert the Positive Singular Val-

ues

The pseudo-inverse A+ is obtained by replacing positive singular val-

ues of A by their reciprocals, and taking the transpose.

More explicitly, we can write

3.4. SINGULAR VALUE DECOMPOSITION 189

Inverse Singular Values, and Flipped Singular Vectors

Let A have rank r, and let σk , vk , uk be the singular data as above.

Then
1 1
A+ uk = vk , (A+ )t vk = uk , k = 1, 2, . . . , r,
σk σk
and
A+ uk = 0, (A+ )t vk = 0 for k > r.

We illustrate these results in the case of a diagonal matrix

     
a0000 a 0 0 0 0 0 0
S=
 0 b 0 0 0  0 b 0 0 0   Q
0 0
 0 0 c 0 0 =  0 0 c 0 0  = 
    
0 0
00000 0 0 0 0 0 0 0 0 0 0
Since S is 4 × 5 and SS + S = S, S + must be 5 × 4. Writing S + as blocks
and applying the four properties of the pseudo-inverse S + , leads to
   
0 1/a 0 0 0
 −1 0   0 1/b 0 0
Q
+
   
S =  0 =  0 0 1/c 0 .
 
 0 0 0 0   0 0 0 0
0 0 0 0 0 0 0 0

Exercises

Exercise 3.4.1 Let b be a vector and let B be the matrix with the single
row b. Show σ = |b| is the only positive singular value.

Exercise 3.4.2 Let Q be a symmetric matrix. When are the eigenvalues of

Q equal to the singular values of Q? When are they not?

Exercise 3.4.3 In sympy, there is Q.eigenvects(), which returns the eigen-

data of a symmetric matrix Q. In numpy, svd(A) returns the singular data
of a matrix A. Write a sympy function svd(A) that replicates this. (First
assume all singular values of A are positive, then adjust the code for when A
has zero singular values. Your final product should work for the zero matrix
A = 0.)

Exercise 3.4.4 Let A be the 5 × 3 matrix (2.3.4). Use numpy and sympy to
compute the singular value decomposition of A (see Exercise 3.2.10).
190 CHAPTER 3. PRINCIPAL COMPONENTS

3.5 Principal Component Analysis

Let Q be the variance matrix of a dataset in Rd . Then Q is a d × d symmetric

matrix, and the EVD guarantees an orthonormal basis v1 , v2 , . . . , vd in Rd
consisting of eigenvectors of Q,

Qvk = λk vk , k = 1, 2, . . . , d.

These eigenvectors are the principal components of the dataset. Principal

Component Analysis (PCA) consists of projecting the dataset onto lower-
dimensional spans of some of the eigenvectors.
Let Q be a symmetric matrix with eigenvalue λ and corresponding eigen-
vector v, Qv = λv. If t is a scalar, then the matrix tQ has eigenvalue tλ and
corresponding eigenvector v, since

(tQ)v = tQv = tλv = (tλ)v.

Hence multiplying Q by a scalar does not change the eigenvectors.

Let A be the dataset matrix of a given dataset with N samples, and d
features. If the samples are the rows of A, then A is N × d. If we assume the
dataset is centered, then, by (2.2.15), the variance is Q = At A/N . From the
previous paragraph, the eigenvectors of the variance Q equal the eigenvectors
of At A. From (3.4.7), these are the same as the right singular vectors of A.
Thus the principal components of a dataset are the right singular vectors
of the centered dataset matrix. This shows there are two approaches to the
principal components of a dataset: Either through EVD and eigenvectors
of the variance matrix, or through SVD and right singular vectors of the
centered dataset matrix. We shall do both.
Assuming the eigenvalues are ordered top to bottom,

λ1 ≥ λ2 ≥ · · · ≥ λd ,

in PCA one takes the most significant components, those components who
eigenvalues are near the top eigenvalue. For example, one can take the top
two eigenvalues λ1 ≥ λ2 and their eigenvectors v1 , v2 , and project the dataset
onto span(v1 , v2 ). The projected dataset can then be visualized as points in
the plane. Similarly, one can take the top three eigenvalues λ1 ≥ λ2 ≥ λ3
and their eigenvectors v1 , v2 , v3 and project the dataset onto span(v1 , v2 , v3 ).
This can then be visualized as points in three dimensions.
Recall the MNIST dataset consists of N = 60000 points in d = 784 di-
mensions. After we download the dataset,

from pandas import *

from numpy import *
3.5. PRINCIPAL COMPONENT ANALYSIS 191

mnist = read_csv("mnist.csv").to_numpy()

dataset = mnist[:,1:]
labels = mnist[:,0]

we compute Q, the total variance, and the eigenvalues, as percentages of the

total variance. We also name the targets as labels for later use.
The left column in Figure 3.23 lists the top twenty eigenvalues as a per-
centage of their sum. For example, the top eigenvalue λ1 is around 10% of the
total variance. The right column lists the cumulative sums of the eigenvalues,
so the third entry in the right column is the sum of the top three eigenvalues,
λ1 + λ2 + λ3 = 22.97%.

Fig. 3.23 MNIST eigenvalues as a percentage of the total variance.

This results in Figures 3.23 and 3.24. Here we sort the array eig in
decreasing order, then we cumsum the array to obtain the cumulative sums.
Because the rank of the MNIST dataset is 712 (§2.9), the bottom 72 =
784 − 712 eigenvalues are exactly zero. A full listing shows that many more
eigenvalues are near zero, and the second column in Figure 3.23 shows the
top ten eigenvalues alone sum to almost 50% of the total variance.

Q = cov(dataset.T)
totvar = Q.trace()
192 CHAPTER 3. PRINCIPAL COMPONENTS

from scipy.linalg import eigh

# use eigh for symmetric matrices

lamda, U = eigh(Q)

# sort in ascending order then reverse

sorted = sort(lamda)[::-1]
percent = sorted*100/totvar

# cumulative sums
sums = cumsum(percent)

data = array([percent,sums])
print(data.T[:20].round(decimals=3))

d = len(lamda)
from matplotlib.pyplot import stairs

grid()
stairs(percent,range(d+1))
show()

Fig. 3.24 MNIST eigenvalue percentage plot.

A MNIST image is a point in R784 . Now we turn to projecting the image

from 784 dimensions down to n dimensions, where n is 784, 600, 350, 150,
50, 10, 1. Let Q be any d × d variance matrix, and let v be in Rd . Let v1 ,
3.5. PRINCIPAL COMPONENT ANALYSIS 193

v2 , . . . , vd be the orthonormal basis of eigenvectors corresponding to the

eigenvalues of Q, arranged in decreasing order, and let E be the diagonal
matrix of eigenvalues. By EVD diagonalization (§3.2), if v1 , v2 , . . . , vd are
the columns of U and the rows of V , then Q = U EV .
Here is code that returns the projection matrix P (§2.7) onto the span of
the eigenvectors v1 , v2 , . . . , vn corresponding to the top n eigenvalues of Q.

from numpy import *

from scipy.linalg import eigh

# projection matrix onto top n

# eigenvectors of variance
# of dataset

def pca(dataset,n):
Q = cov(dataset.T)
# columns of U are
# eigenvectors of Q
lamda, U = eigh(Q)
# decreasing eigenvalue sort
order = lamda.argsort()[::-1]
# sorted top n columns of U
# are cols of Uproj
# U is dxd Uproj is dxn
Uproj = U[:,order[:n]]
P = dot(Uproj,Uproj.T)
return P

In the code, lamda is sorted in decreasing order, and the sorting order is
saved as order. To obtain the top n eigenvectors of U , we sort the first n
columns U[:,order[:n]] in the same order, resulting in the d × n matrix
t
Uproj . The code then returns the projection matrix P = Uproj Uproj (2.7.4).
Instead of working with the variance Q, as discussed at the start of the
section, we can work directly with the dataset, using SVD, to obtain the
eigenvectors.

from numpy import *

from scipy.linalg import svd

# projection matrix onto top n

# eigenvectors of variance
# of dataset

def pca_with_svd(dataset,n):
# center dataset
mu = mean(dataset,axis=0)
vectors = dataset - mu
# rows of V are
# right singular vectors
194 CHAPTER 3. PRINCIPAL COMPONENTS

V = svd(vectors)[2]
# no need to sort, already decreasing order
Uproj = V[:n].T # top n rows as columns
P = dot(Uproj,Uproj.T)
return P

Let v = dataset[1] be the second image in the MNIST dataset, and let
Q be the variance of the dataset. Then the code below returns the image
compressed down to n = 784, 600, 350, 150, 50, 10, 1 dimensions, returning
Figure 3.25.

def display_image(v,row,col,i):
A = reshape(v,(28,28))
fig.add_subplot(row, col,i)
xticks([])
yticks([])
imshow(A,cmap="gray_r")

from matplotlib.pyplot import *

fig = figure(figsize=(10,5))
row, col = 2, 4

v = dataset[1] # second image

display_image(v,row,col,1)

for i,n in enumerate([784,600,350,150,50,10,1],start=2):

# either will work
P = pca(dataset,n)
#P = pca_with_svd(dataset[:100],n)
projv = dot(P,v)
display_image(projv,row,col,i)

If you run out of memory trying this code, cut down the dataset from
60,000 points to 10,000 points or fewer. The code works with pca or with
pca_with_svd.

We show how to project a vector v in the dataset using sklearn. The

following code sets up the PCA engine using sklearn.

from sklearn.decomposition import PCA

N = len(dataset)
n = 10
3.5. PRINCIPAL COMPONENT ANALYSIS 195

engine = PCA(n_components = n)

The following code computes the reduced dataset (§2.7)

reduced = engine.fit_transform(dataset)
reduced.shape

and returns (N, n) = (60000, 10). The following code computes the projected
dataset

projected = engine.inverse_transform(reduced)
projected.shape

and returns (N, d) = (60000, 784).

Let Uproj be the d × n matrix with columns the top n eigenvectors. Then
t
the projection matrix onto the column space of Uproj is P = Uproj Uproj . In
t
the above code, reduced equals Uproj v for each image v, and projected is
t
Uproj Uproj v for each image v.

Fig. 3.25 Original and projections: n = 784, 600, 350, 150, 50, 10, 1.

Then the code

from matplotlib.pyplot import *

fig = figure(figsize=(10,5))
row, col = 2, 4
196 CHAPTER 3. PRINCIPAL COMPONENTS

v = dataset[1] # second image

display_image(v,row,col,1)

for i,n in enumerate([784,600,350,150,50,10,1],start=2):

engine = PCA(n_components = n)
reduced = engine.fit_transform(dataset)
projected = engine.inverse_transform(reduced)
projv = projected[1] # second image
display_image(projv,row,col,i)

returns Figure 3.25.

Now we project all vectors of the MNIST dataset onto two and three
dimensions, those corresponding to the top two or three eigenvalues. To start,
we compute reduced as above with n = 3, the top three components.

Fig. 3.26 The full MNIST dataset (2d projection).

In the two-dimensional plotting code below, reduced is an array of shape

(60000,3), but we use only the top two components 0 and 1. When the
rows are plotted as a scatterplot, we obtain Figure 3.26. Note the rows are
plotted grouped by color, to match the legend, and each plot point’s color is
determined by the value of its label.

from matplotlib.pyplot import *

from scipy.spatial import ConvexHull
3.5. PRINCIPAL COMPONENT ANALYSIS 197

Colors = ('blue', 'red', 'green', 'orange',

,→ 'gray','cyan','turquoise', 'black', 'orchid', 'brown')

for i,color in enumerate(Colors):

points = reduced[labels==i,:]
scatter(points[:,0], points[:,1],label=i, edgecolor='black')
#hull = ConvexHull(points)
#for simplex in hull.simplices:
#plot(points[simplex, 0], points[simplex, 1], '-',c=color)

grid()
legend(loc='upper right')
show()

Code for displaying the convex hulls is included.

Fig. 3.27 The Iris dataset (2d projection).

Code for the 2d plot (Figure 3.27) of the Iris dataset is

from matplotlib.pyplot import *

Colors = ['blue', 'red', 'green']

Classes = ["Iris-setosa", "Iris-virginica", "Iris-versicolor"]

for a,b in zip(Classes,Colors):

198 CHAPTER 3. PRINCIPAL COMPONENTS

scatter(reduced[labels==a,0], reduced[labels==a,1], label=a, c=b,

,→ edgecolor='black')

grid()
legend(loc='upper right')
show()

Now we turn to three dimensional plotting. Here is the code

%matplotlib ipympl
from matplotlib.pyplot import *

ax = axes(projection="3d")

Colors = ('blue', 'red', 'green', 'orange', 'gray','cyan' ,

,→ 'turquoise', 'black', 'orchid', 'brown')

for i,color in enumerate(Colors):

ax.scatter(reduced[labels==i,0], reduced[labels==i,1],
,→ reduced[labels==i,2], label=i, c=color, edgecolor='black')

ax.set_aspect("equal")
ax.set_axis_off()

legend(loc='upper right')
show()

The three dimensional plot of the complete MNIST dataset is Figure 1.5
in §1.2. The command %matplotlib ipympl allows the figure to rotated and
scaled.

3.6 Cluster Analysis

Cluster analysis seeks to partition a dataset into groups or clusters based on

selected criteria, such as proximity in distance.
Let x1 , x2 , . . . , xN be a dataset in Rd . The simplest algorithm is k-means
clustering. The algorithm is iterative: We start with k means m1 , m2 , . . . , mk ,
not necessarily part of the dataset, and we divide the dataset into k clusters,
where the i-th cluster consists of the points x in the dataset for which mean
mi is nearest to x.
The algorithm is in two parts, the assignment step and the update step.
Initially the means m1 , m2 , . . . , mk are chosen at random, or by an edu-
3.6. CLUSTER ANALYSIS 199

cated guess, then clusters C1 , C2 , . . . , Ck are assigned, then each mean is

recomputed as the mean of each cluster.
The sklearn package contains clustering routines, but here we write the
code from scratch to illustrate the ideas. Here is an animated gif illustrating
the convergence of the algorithm.
Assume the means are given as a list of length k,

means = [ means[0], means[1], ... ]

and each cluster is a list of points (so clusters is a list of lists)

clusters = [ clusters[0], clusters[1], ... ]

such that

N == sum([ len(cluster) for cluster in clusters] )

Given a point x, we first select the mean closest to x:

from numpy import *

from scipy.linalg import norm

def nearest_index(x,means):
i = 0
for j,m in enumerate(means):
n = means[i]
if norm(x - m) < norm(x - n): i = j
return i

Starting with empty clusters (k is the number of clusters), we iterate the

assign/update steps until the means no longer change. If any clusters remain
empty, we discard them. Here is the assignment step.

def assign_clusters(dataset,means):
clusters = [ [ ] for m in means ]
for x in dataset:
i = nearest_index(x,means)
clusters[i].append(x)
return [ c for c in clusters if len(c) > 0 ]

Here is the update step.

def update_means(clusters):
return [ mean(c,axis=0) for c in clusters ]

Here is the iteration.

200 CHAPTER 3. PRINCIPAL COMPONENTS

from numpy.random import random

d = 2
k,N = 7,100

def random_vector(d):
return array([ random() for _ in range(d) ])

dataset = [ random_vector(d) for _ in range(N) ]

means = [ random_vector(d) for _ in range(k) ]

close_enough = False

while not close_enough:

clusters = assign_clusters(dataset,means)
print([len(c) for c in clusters])
newmeans = update_means(clusters)
# only check closeness if number of means unchanged
if len(newmeans) == len(means):
close_enough = all([ allclose(m,n) for m,n in
,→ zip(means,newmeans) ])
means = newmeans

This code returns the size the clusters after each iteration. Here is code
that plots a cluster.

def plot_cluster(mean,cluster,color,marker):
for v in cluster:
scatter(v[0],v[1], s=50, c=color, marker=marker)
scatter(mean[0], mean[1], s=100, c=color, marker='*')

Here is code for the entire iteration. hexcolor is in §1.3.

from matplotlib.pyplot import *

d = 2
k,N = 7,100

def random_vector(d):
return array([ random() for _ in range(d) ])

dataset = [ random_vector(d) for _ in range(N) ]

means = [ random_vector(d) for _ in range(k) ]
colors = [ hexcolor() for _ in range(k) ]

close_enough = False

figure(figsize=(4,4))
grid()
3.6. CLUSTER ANALYSIS 201

for v in dataset: scatter(v[0],v[1],s=20,c='black')

show()

while not close_enough:

clusters = assign_clusters(dataset,means)
newmeans = update_means(clusters)
# only check closeness if number of means unchanged
if len(newmeans) == len(means):
close_enough = all([ allclose(m,n) for m,n in
,→ zip(means,newmeans) ])
figure(figsize=(4,4))
grid()
for i,c in enumerate(clusters):
plot_cluster(newmeans[i], c, colors[i],
'$' + str(i) + '$')
show()
means = newmeans
Chapter 4
Calculus

The material in this chapter lays the groundwork for Chapter 7. It assumes
the reader has some prior exposure, and the first section quickly reviews
basic material essential for our purposes. The overarching role of convexity
is emphasized repeatedly, both in the single-variable and multi-variable case.
The chain rule is treated extensively, in both interpretations, combinato-
rial (back-propagation) and geometric (time-derivatives). Both are crucial for
neural network training in Chapter 7.
Because it is used infrequently in the text, integration is treated separately
in an appendix (§A.5).
Even though parts of §4.5 are heavy-going, the material is necessary for
Chapter 7. Nevertheless, for a first pass, the reader should feel free to skim
this material and come back to it after the need is made clear.

4.1 Single-Variable Calculus

In this section, we focus on single-variable calculus, and in §4.3, we review

multi-variable calculus. Recall the slope of a line y = mx + b equals m.
Let y = f (x) be a function as in Figure 4.1, and let a be a fixed point. The
derivative of f (x) at the point a is the slope of the line tangent to the graph
of f (x) at a. Then the derivative at a point a is a number f ′ (a) possibly
depending on a.

Definition of Derivative

The derivative of f (x) at the point a is the slope of the line tangent
to the graph of f (x) at a.

203
204 CHAPTER 4. CALCULUS

Since a constant function f (x) = c is a line with slope zero, the derivative
of a constant is zero. Since f (x) = mx+b is a line with slope m, its derivative
is m.
Since the tangent line at a passes through the point (a, f (a)), and its slope
is f ′ (a), the equation of the tangent line at a is

y = f (a) + f ′ (a)(x − a).

Based on the definition, natural properties of the derivative are

A. The derivative of f (x) + g(x) is f ′ (x) + g ′ (x), and (−f (x))′ is −f ′ (x).
B. If f ′ (x) ≥ 0 on an interval [a, b], then f (b) ≥ f (a).
C. If f ′ (x) ≤ 0 on an interval [a, b], then f (b) ≤ f (a).

y = f (x)

x
a

Fig. 4.1 f ′ (a) is the slope of the tangent line at a.

Using these properties, we determine the formula for f ′ (a). Suppose the
derivative is bounded between two extremes m and L at every point x in an
interval [a, b], say
m ≤ f ′ (x) ≤ L, a ≤ x ≤ b.
Then by A, the derivative of h(x) = f (x)−mx at x equals h′ (x) = f ′ (x)−m.
By assumption, h′ (x) ≥ 0 on [a, b], so, by B, h(b) ≥ h(a). Since h(a) =
f (a) − ma and h(b) = f (b) − mb, this leads to

f (b) − f (a)
≥ m.
b−a
Repeating this same argument with f (x) − Lx, and using C, leads to

f (b) − f (a)
≤ L.
b−a
4.1. SINGLE-VARIABLE CALCULUS 205

We have shown

First Derivative Bounds

If m ≤ f ′ (x) ≤ L for a ≤ x ≤ b, then

f (b) − f (a)
m≤ ≤ L. (4.1.1)
b−a

As an immediate consequence, when m = L = 0, applying this to any

subinterval [a′ , b′ ] in [a, b],

Zero Derivative Implies Constant

For any f (x),

f ′ (x) = 0 =⇒ f (x) is constant. (4.1.2)

When b is close to a, we expect both extremes m and L to be close to

f ′ (a). From (4.1.1), we arrive at the formula for the derivative,

Derivative Formula

f (x) − f (a)
f ′ (a) = lim . (4.1.3)
x→a x−a

From (4.1.3), the derivative of a line f (x) = mx + b equals f ′ (a) = m,

agreeing with what we already know. We usually deal with limits as in (4.1.3)
in an intuitive manner. When needed, please refer to §A.6 for basic properties
of limits.
Below we also write
dy
y ′ = f ′ (x) =
dx
or
dy
f ′ (a) =
dx x=a
When the particular point a is understood from the context, we write y ′ .

From (4.1.3), the basic properties of the derivative are

• Sum rule. h = f + g implies h′ = f ′ + g ′ ,
• Product rule. h = f g implies h′ = f ′ g + f g ′ ,
• Quotient rule. h = f /g implies h′ = (f ′ g − f g ′ )/g 2 .
206 CHAPTER 4. CALCULUS

• Chain rule. u = f (x) and y = g(u) implies

dy dy du
= · .
dx du dx
To visualize the chain rule, suppose

u = f (x) = sin x,
y = g(u) = u2 .

These are two functions f , g in composition, as in Figure 4.2.

x u y
f g

Fig. 4.2 Composition of two functions.

√
Suppose x = π/4. Then u = sin(π/4) = 1/ 2, and y = u2 = 1/2. Since

dy 2 du 1
= 2u = √ , = cos x = √ ,
du 2 dx 2
by the chain rule,
dy dy du 2 1
= · = √ · √ = 1.
dx du dx 2 2
Since the chain rule is important for machine learning, it is discussed in detail
in §4.4.
By the product rule,

(x2 )′ = x′ x + xx′ = 1x + x1 = 2x.

Similarly one obtains the power rule

(xn )′ = nxn−1 . (4.1.4)

Using the chain rule, the power rule can be √derived for any rational number n,
2
positive or negative. For example,
√ since ( x) = x, we can write x = f (g(x))
with f (x) = x2 and g(x) = x. By the chain rule,
√ √
1 = (x)′ = f ′ (g(x))g ′ (x) = 2g(x)g ′ (x) = 2 x( x)′ .
√
Solving for ( x)′ yields
√ 1
( x)′ = √ ,
2 x
4.1. SINGLE-VARIABLE CALCULUS 207

which is (4.1.4) with n = 1/2. In this generality, the variable x is restricted

to positive values only.
For example, the code

from sympy import *

x, a = symbols('x, a')
f = x**a

f.diff(x), diff(f,x), f.diff(x).simplify(), simplify(diff(f,x))

returns
axa axa
, , axa−1 , axa−1 .
x x

The power rule can be combined with the chain rule. For example, if

un+1
u = 1 − p + cp, f (p) = un , g(u) = ,
(c − 1)(n + 1)

then
(1 − p + cp)n+1
F (p) = ,
(c − 1)(n + 1)
and
F ′ (p) = g ′ (u)u′ = un ,
hence
(1 − p + cp)n+1
F (p) = =⇒ F ′ (p) = f (p). (4.1.5)
(c − 1)(n + 1)

The second derivative f ′′ (x) of f (x) is the derivative of the derivative,

′
f ′′ (x) = (f ′ (x)) .

For example,
n!
(xn )′′ = (nxn−1 )′ = n(n − 1)xn−2 = xn−2 = P (n, 2)xn−2
(n − 2)!

(for n! and P (n, k) see §A.1).

208 CHAPTER 4. CALCULUS

More generally, the k-th derivative f (k) (x) is the derivatives taken k times,
so
(k) n!
(xn ) = n(n − 1)(n − 2) . . . (n − k + 1)xn−k = xn−k = P (n, k)xn−k .
(n − k)!

When k = 0, f (0) (x) = f (x), and, when k = 1, f (1) (x) = f ′ (x). The code

from sympy import *

init_printing()

x, n = symbols('x, n')

diff(x**n,x,3)

returns the third derivative n(n − 1)(n − 2)xn−3 .

Here is an example using derivatives from sympy. Given a power n, let

pn (x) = (x2 − 1)n . Then pn (x) is a polynomial of degree 2n.
The Legendre polynomial Pn (x) is the n-th derivative of pn (x) divided by
n!2n . Then Pn (x) is a polynomial of degree n.
For example, when n = 1, p1 (x) = x2 − 1, so
1 1
P1 (x) = (x2 − 1)′ = 2x = x.
1!21 2
When n = 2,
1 ′′ 1
P2 (x) =(x2 − 1)2 = (3x2 − 1).
2!22 2
The Python code for Pn (x) uses symbolic functions and symbolic deriva-
tives.

from sympy import diff, symbols

from scipy.special import factorial

def sym_legendre(n):
# symbolic variable
x = symbols('x')
# symbolic function
p = (x**2 - 1)**n
nfact = factorial(n,exact=True)
# symbolic nth derivative
return p.diff(x,n)/(nfact * 2**n)

For example,
4.1. SINGLE-VARIABLE CALCULUS 209

from sympy import init_printing, simplify

init_printing()

[ simplify(sym_legendre(n)) for n in range(6) ]

returns the first six Legendre Polynomials, starting from n = 0:

" #
3x2 1 x 5x2 − 3 35x4 15x2 3 x 63x4 − 70x2 + 15
1, x, − , , − + ,
2 2 2 8 4 8 8

To compute values such as P4 (5), we have modify sym_legendre to a

numpy function as follows,

from sympy import lambdify

def num_legendre(n):
x = symbols('x')
f = sym_legendre(n)
return lambdify(x,f, 'numpy')

The function num_legendre(n) can be evaluated, plotted, integrated, etc.

We use the above to derive the Taylor series. Suppose f (x) is given by a
finite or infinite sum

f (x) = c0 + c1 x + c2 x2 + c3 x3 + . . . (4.1.6)

Then f (0) = c0 . Taking derivatives, by the sum, product, and power rules,

f ′ (x) = c1 + 2c2 x + 3c3 x2 + 4c4 x3 + . . .

f ′′ (x) = 2c2 + 3 · 2c3 x + 4 · 3c4 x2 + . . .
(4.1.7)
f ′′′ (x) = 3 · 2c3 + 4 · 3 · 2c4 x + . . .
f (4) (x) = 4 · 3 · 2c4 + . . .

Inserting x = 0, we obtain f ′ (0) = c1 , f ′′ (0) = 2c2 , f ′′′ (0) = 3 · 2c3 , f (4) (0) =
4 · 3 · 2c4 . This can be encapsulated by f (n) (0) = n!cn , n = 0, 1, 2, 3, 4, . . . ,
which is best written
f (n) (0)
= cn , n ≥ 0.
n!
Going back to (4.1.6), we derived
210 CHAPTER 4. CALCULUS

Taylor Series

For almost every function f (x),

∞
X f (n) (0) n
f (x) = x
n=0
n!
x2 x3 x4
= f (0) + f ′ (0)x + f ′′ (0) + f ′′′ (0) + f (4) (0) + . . .
2 6 24
(4.1.8)

We now compute the derivatives of the exponential function (§A.3). By

the compound-interest formula (A.3.8),
x n
ex = lim 1 + .
n→∞ n
By the power rule and chain rule,
x n ′ x n−1 1 x n−1
1+ =n 1+ · = 1+ .
n n n n
From this follows
′
x n−1 x n 1
(ex ) = lim 1+ = lim 1 + · = ex · 1 = ex .
n→∞ n n→∞ n 1 + x/n

Since the first derivative is ex , so is the second derivative. This derives

Derivative of the Exponential Function

The exponential function satisfies

(ex )′ = ex , (ex )′′ = ex .

The logarithm function is the inverse of the exponential function,

y = log x ⇐⇒ x = ey .

This is the same as saying

log(ey ) = y, elog x = x.
4.1. SINGLE-VARIABLE CALCULUS 211

From here, we see the logarithm is defined only for x > 0 and is strictly
increasing (Figure 4.3).
Since e0 = 1,
log 1 = 0.
Since e∞ = ∞ (Figure A.3),

log ∞ = ∞.

Since e−∞ = 1/e∞ = 1/∞ = 0,

log 0 = −∞.

We also see log x is negative when 0 < x < 1, and positive when x > 1.

Fig. 4.3 The logarithm function log x.

Moreover, by the law of exponents,

log(ab) = log a + log b.

For a > 0 and b real, define

ab = eb log a .

Then, by definition,
log(ab ) = b log a,
and c c
ab = eb log a = ebc log a = abc .
212 CHAPTER 4. CALCULUS

By definition of the logarithm, y = log x is shorthand for x = ey . Use the

chain rule to find y ′ :

x = ey =⇒ 1 = x′ = (ey )′ = ey y ′ = xy ′ ,

so
1
y = log x =⇒ y′ = .
x
Derivative of the Logarithm

1
y = log x =⇒ y′ = . (4.1.9)
x

Since the derivative of log(1 + x) is 1/(1 + x), the chain rule implies

dn (n − 1)!
log(1 + x) = (−1)n−1 , n ≥ 1.
dxn (1 + x)n

From this, the Taylor series of log(1 + x) is

x2 x3 x4
log(1 + x) = x − + − + .... (4.1.10)
2 3 4

0
x

Fig. 4.4 Increasing or decreasing?

For the parabola in Figure 4.4, y = x2 so, by the power rule, y ′ = 2x.
Since y ′ > 0 when x > 0 and y ′ < 0 when x < 0, this agrees with the
4.1. SINGLE-VARIABLE CALCULUS 213

increase/decrease of the graph. In particular, the minimum of the parabola

occurs when y ′ = 0.
For the curve y = x4 − 2x2 in Figure 4.5,

y ′ = 4x3 − 4x = 4x(x2 − 1) = 4x(x − 1)(x + 1),

so y ′ is a product of the three factors 4x, x − 1, x + 1. Since the zeros of these

factors are 0, 1, and −1, and y ′ > 0 when all factors are positive, or two of
them are negative, this agrees with the increase/decrease in the figure.
Here y ′ = 0 occurs at the two minima x = ±1 and at the local maximum
0. Notice 0 is not a global maximum as there is no highest value for y.

√
(c = 1/ 3)

−1 −c c 1
x
0

Fig. 4.5 Increasing or decreasing?

Let y = f (x) be a function. A critical point is a point x∗ where the deriva-

tive equals zero, f ′ (x∗ ) = 0. Above we saw local or global maximizers or
minimizers are critical points. In general, however, this need not be so. A
critical point may be neither. For example, for f (x) = x3 , x∗ = 0 is a critical
point, but is neither a maximizer nor a minimizer. Here, for y = x3 , x∗ = 0
is a saddle point. Nevertheless, we can say

Searching for Maximizers

Let y = f (x) be defined on an interval [a, b]. If x∗ is a maximizer in

the interior (a, b), then x∗ is a critical point. Thus the maximum

max f (x) (4.1.11)

a≤x≤b
214 CHAPTER 4. CALCULUS

equals the maximum over critical points and endpoints,

max f (x).
x∗ ,a,b

The same result holds for minimizers.

In other words, to find the maximum of f (x), find the critical points x∗ ,
plug them and the endpoints a, b into f (x), and select whichever yields the
maximum value.

Now we look at the increase/decrease in y ′ , rather than in y. Applying the

above logic to y ′ instead to y, we see y ′ is increasing when y ′′ ≥ 0, and y ′ is
decreasing when y ′′ ≤ 0. In the first case, we say f (x) is convex, while in the
second case, we say f (x) is concave. Clearly a function y = f (x) is concave
if −f (x) is convex.

If we look at Figure 4.4, the slope at x equals y ′ = 2x. Thus as x increases,

′
y increases. Even though the parabola height y decreases when x < 0 and
increases when x > 0, its slope y ′ is always increasing: When x < 0, as x
increases, y ′ = 2x is less and less negative, while, when x > 0, as x increases,
y ′ is more and more positive.
Since y ′ increases when its derivative is positive, the parabola’s behavior
is encapsulated in
y ′′ = (y ′ )′ = (2x)′ = 2 > 0.
In general,

Second Derivative Test for Convexity

y = f (x) is convex iff y ′′ ≥ 0, and concave if y ′′ ≤ 0.

A point where y ′′ = 0 is an inflection point. For example, the parabola in

Figure 4.4 is convex everywhere. Analytically, for the parabola, y ′′ = 2 > 0
everywhere,
For the graph in Figure 4.5 it is clear the graph is convex away from 0,
and concave near 0. Analytically,

y ′′ = (x4 − 2x2 )′′ = (4x3 − 4x)′ = 12x2 − 4 = 4(3x2 − 1),

√
so the inflection
√ points are x = ±1/ 3. Hence the √ graph is convex
√ when
|x| > 1/ 3, and the graph is concave when |x| < 1/ 3. Since 1/ 3 < 1, the
graph is convex near x = ±1.
4.1. SINGLE-VARIABLE CALCULUS 215

A function f (x) is strictly convex if y ′′ > 0. Geometrically, f (x) is strictly

convex if each chord joining any two points on the graph lies strictly above
the graph. Similarly, one defines strictly concave to mean y ′′ < 0.

Second Derivative Test for Strict Convexity

Suppose y = f (x) has a second derivative y ′′ . Then y is strictly convex

if y ′′ > 0, and strictly concave if y ′′ < 0.

For example, since (x2 )′′ = 2 > 0 and (ex )′′ = ex > 0, x2 and e√ x
are
strictly convex everywhere, and x4 − 2x2 is strictly convex for |x| > 1/ 3.
Convexity of ex was also derived in (A.3.14). Since

(ex )(n) = ex , n ≥ 0,

writing the Taylor series of ex yields the exponential series (A.3.12).

Let y = − log x.By the power rule,

′
′′ ′′ 1 ′ 1
y = (− log x) = − = − x−1 = 2 .
x x

Since y ′′ > 0, which shows − log x is strictly convex. This shows

Concavity of the Logarithm Function

The logarithm function log x is strictly concave on x > 0.

Suppose y = f (x) is convex, so y ′ is increasing. Then a ≤ t ≤ x ≤ b implies

f ′ (a) ≤ f ′ (t) ≤ f ′ (x) ≤ f ′ (b). Taking m = f ′ (a) and L = f ′ (x) in (4.1.1),

f (x) − f (a)
f ′ (a) ≤ ≤ f ′ (x), a ≤ x ≤ b.
x−a
Since the tangent line at a is y = f ′ (a)(x − a) + f (a), rearranging this last
inequality, we obtain
216 CHAPTER 4. CALCULUS

Convex Function Graph Lies Above the Tangent Line

If f (x) is convex on [a.b], then

f (x) ≥ f (a) + f ′ (a)(x − a), a ≤ x ≤ b.

For example, the function in Figure 4.6 is convex near x = a, and the
graph lies above its tangent line at a.

Let pm (x) be the parabola

m
pm (x) = f (a) + f ′ (a)(x − a) + (x − a)2 . (4.1.12)
2
Then p′′m (x) = m. Moreover the graph of pm (x) is tangent to the graph of
f (x) at x = a, in the sense f (a) = pm (a) and f ′ (a) = p′m (a). Because of this,
we call pm (x) the lower tangent parabola.
Similarly, let pL (x) be the parabola

L
pL (x) = f (a) + f ′ (a)(x − a) + (x − a)2 . (4.1.13)
2
Then p′′L (x) = L. Moreover the graph of pL (x) is tangent to the graph of f (x)
at x = a, in the sense f (a) = pL (a) and f ′ (a) = p′L (a). Because of this, we
call pL (x) the upper tangent parabola.
When y is convex, we saw above the graph of y lies above its tangent line.
When m ≤ y ′′ ≤ L, we can specify the size of the difference between the
graph and the tangent line. In fact, the graph is constrained to lie above or
below the lower or upper tangent parabolas.

Second Derivative Bounds

If m ≤ f ′′ (x) ≤ L on [a, b], the graph lies between the lower and upper
tangent parabolas pm (x) and pL (x),

m L
(x − a)2 ≤ f (x) − f (a) − f ′ (a)(x − a) ≤ (x − a)2 . (4.1.14)
2 2
a ≤ x ≤ b.

To see this, suppose f ′′ (x) ≥ m. then g(x) = f (x) − pm (x) satisfies

g ′′ (x) = f ′′ (x) − p′′m (x) = f ′′ (x) − m ≥ 0,

so g(x) is convex, so g(x) lies above its tangent line at x = a. Since g(a) = 0
and g ′ (a) = 0, the tangent line is 0, and we conclude g(x) ≥ 0, which is the
4.1. SINGLE-VARIABLE CALCULUS 217

left half of (4.1.14). Similarly, if f ′′ (x) ≤ L, then pL (x) − f (x) is convex,

leading to the right half of (4.1.14).

x
a

Fig. 4.6 Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0.

Now suppose f (x) is strongly convex in the sense L ≥ f ′′ (x) ≥ m on an

interval [a, b], for some positive constants m and L. By (4.1.1),

f ′ (b) − f ′ (a)
t= =⇒ L ≥ t ≥ m,
b−a
which implies

t2 − (m + L)t + mL = (t − m)(t − L) ≤ 0.

This yields

Coercivity for Strongly Convex Functions

If m ≤ f ′′ (x) ≤ L for a ≤ x ≤ b, then

2
f ′ (b) − f ′ (a) mL 1 f ′ (b) − f ′ (a)
≥ + . (4.1.15)
b−a m+L m+L b−a

This is used in gradient descent §7.3.

218 CHAPTER 4. CALCULUS

For gradient descent, we need the relation between a convex function and
its dual. If f (x) is convex, its convex dual is

g(p) = max(px − f (x)). (4.1.16)

Below we see g(p) is also convex. This may not always exist, but we will work
with cases where no problems arise.
To evaluate g(p), following (4.1.11), we compute the maximizer x∗ by
setting the derivative of (px − f (x)) equal to zero and solving for x.
Let a > 0. The simplest example is f (x) = ax2 /2. In this case, the maxi-
mum of px − f (x) occurs where (px − f (x))′ = 0, which leads to
′
1
0= px − ax2 = p − ax,
2

or x∗ = p/a. Plugging this maximizer x∗ back into (4.1.16) yields g(p) =

p2 /2a. In the exercises and in §4.2, we see other examples.

Going back to (4.1.16), for each p, the point x where px − f (x) equals the
maximum g(p) — the maximizer — depends on p. If we denote the maximizer
by x = x(p), then
g(p) = px(p) − f (x(p)).
Since the maximum occurs when the derivative is zero, we have

0 = (px − f (x))′ = p − f ′ (x) ⇐⇒ x = x(p).

Hence
g(p) = px − f (x) ⇐⇒ p = f ′ (x).
Also, by the chain rule, differentiating with respect to p,

g ′ (p) = (px − f (x))′ = x + px′ − f ′ (x)x′ = x.

From this, we conclude

p = f ′ (x) ⇐⇒ x = g ′ (p). (4.1.17)

Thus f ′ (x) is the inverse function of g ′ (p). Since g(p) = px − f (x) is the same
as f (x) = px − g(p), we have

Dual of the Dual

If g(p) is the convex dual of a convex f (x), then f (x) is the convex
dual of g(p).
4.1. SINGLE-VARIABLE CALCULUS 219

Since f ′ (x) is the inverse function of g ′ (p), we have

f ′ (g ′ (p)) = p.

Differentiating with respect to p again yields

f ′′ (g ′ (p))g ′′ (p) = 1.

We derived

Second Derivatives of Dual Functions

Let f (x) be a strictly convex function, and let g(p) be the convex dual
of f (x). Then g(p) is strictly convex and
1
g ′′ (p) = , (4.1.18)
f ′′ (x)

where x = g ′ (p), p = f ′ (x).

Since f ′′ (x) > 0, also g ′′ (p) > 0, so g(p) is strictly convex.

For the chi-squared distribution (§5.5), we used Newton’s generalization

of the binomial theorem (A.2.7) to general exponents. This we now derive.

Newton’s Binomial Theorem

Let n be any real number. For a > 0 and −a < x < a,

n n n−1 n n−2 2 n n−3 3
(a + x) = a + na x+ a x + a x + ....
2 3

n

This makes sense because the binomial coefficient k is defined for any
real number n (A.2.12), (A.2.13).
In summation notation,
∞
X
n n n−k k
(a + x) = a x . (4.1.19)
k
k=0

The only difference between the binomial theorem and (4.1.19) is the upper
limit of the summation, which is set to infinity. When n is a whole number,
by (A.2.10), we have
n
= 0, for k > n,
k
220 CHAPTER 4. CALCULUS

so (4.1.19) is a sum of n + 1 terms, yielding the binomial theorem exactly.

When n is not a whole number, the sum (4.1.19) is an infinite sum.
Actually, in §5.5, we need the special case a = 1, which we write in slightly
different notation,
∞
p
X p n
(1 + x) = x . (4.1.20)
n=0
n

Newton’s binomial theorem (4.1.19) is a special case of the Taylor series

(4.1.8). To see this, set
f (x) = (a + x)n .
Then, by the power rule,

f (k) (x) = n(n − 1)(n − 2) . . . (n − k + 1)(a + x)n−k ,

so
f (k) (0)

n(n − 1)(n − 2) . . . (n − k + 1) n−k n n−k
= a = a .
k! k! k
Writing out the Taylor series,
∞ ∞
X f (k) (0) X n n−k k
(a + x)n = = a x ,
k! k
k=0 k=0

which is Newton’s binomial theorem.

Fig. 4.7 The sine function.

4.1. SINGLE-VARIABLE CALCULUS 221

The trigonometric functions sine and cosine were defined in (1.4.3). To

plot them, use

from matplotlib.pyplot import *

from numpy import *

a, b = 0, 3*pi
theta = arange(a,b,.01)

ax = axes()
ax.grid(True)
ax.axhline(0, color='black', lw=1)

plot(theta,sin(theta))
show()

This returns Figure 4.7.

Fig. 4.8 The sine function with π/2 tick marks.

It is often convenient to set the horizontal axis tick marks at the multiples
of π/2. For this, we use

from numpy import *

from matplotlib.pyplot import *

# r'...' is a raw string. In a raw string

# the backslash \ is not an escape character

def label(k):
if k == 0: return '$0$'
elif k == 1: return r'$\pi/2$'
222 CHAPTER 4. CALCULUS

elif k == -1: return r'$-\pi/2$'

elif k == 2: return r'$\pi$'
elif k == -2: return r'$-\pi$'
elif k%2 == 0: return '$' + str(k//2) + r'\pi$'
else: return '$' + str(k) + r'\pi/2$'

def set_pi_ticks(a,b):
base = pi/2
m = floor(b/base)
n = ceil(a/base)
k = arange(n,m+1,dtype=int)
# multiples of base
return xticks(k*base, map(label,k) )

Then inserting set_pi_ticks(a,b) in the plot code returns Figure 4.8.

We review the derivative of sine and cosine. This is needed for the arcsine
law (3.2.16). Recall the angle θ in radians is the length of the subtended
arc (in red) in Figure 4.9. Following the figure, with P = (x, y), we have
x = cos θ, y = sin θ.
The key idea here is Archimedes’ axiom [12], which states:
Suppose two convex curves share common initial and terminal points. If one is inside
the other, then the inside curve is the shorter.

P 1−x
Q

1 y

θ
O x I

Fig. 4.9 Angle θ in the plane, P = (x, y).

By the figure, there are three convex curves joining P and I: The line
segment P I, the red arc, and the polygonal curve P QI. By Archimedes’
axiom, the length of P I is less than the length of the red arc, which in turn
is less than the length of P QI. Since the length of P I is greater than y, this
implies
4.1. SINGLE-VARIABLE CALCULUS 223

y < θ < 1 − x + y,
or
sin θ < θ < 1 − cos θ + sin θ.
Dividing by θ (here we assume 0 < θ < π/2),

1 − cos θ sin θ
1− < < 1. (4.1.21)
θ θ
We use this to show (the definition of limit is in §A.6)

sin θ
lim = 1. (4.1.22)
θ→0 θ
Since sin θ is odd, it is enough to verify (4.1.22) for θ > 0.
To this end, since sin2 θ = 1 − cos2 θ, from (4.1.21),

1 − cos θ 1 − cos2 θ sin θ sin θ

0≤ = = · ≤ sin θ ≤ θ,
θ θ(1 + cos θ) θ 1 + cos θ

which implies
1 − cos θ
lim = 0.
θ→0 θ
Taking the limit θ → 0 in (4.1.21), we obtain (4.1.22) for θ > 0.
From (A.4.6),
sin(θ + t) = sin θ cos t + cos θ sin t,
so
sin(θ + t) − sin θ cos t − 1 sin t
lim = lim sin θ · + cos θ · = cos θ.
t→0 t t→0 t t
Thus the derivative of sine is cosine,

(sin θ)′ = cos θ.

Similarly,
(cos θ)′ = − sin θ.
Using the chain rule, we compute the derivative of the inverse arcsin x of
sin θ. Since
θ = arcsin x ⇐⇒ x = sin θ,
we have p
1 = x′ = (sin θ)′ = θ′ · cos θ = θ′ · 1 − x2 ,
or
1
(arcsin x)′ = θ′ = √ .
1 − x2
224 CHAPTER 4. CALCULUS

We
√ use this to compute the derivative of the arcsine law (3.2.16). With
x = λ/2, by the chain rule,
′
1√

2 2 1
arcsin λ = √ · x′
π 2 π 1 − x2
(4.1.23)
2 1 1 1
= p · √ = p .
π 1 − λ/4 4 λ π λ(4 − λ)

This shows the derivative of the arcsine law is the density in Figure 3.11.

Exercises

Exercise 4.1.1 What is the y-intercept of the line tangent to f (x) = x2 at

the point (1, 1)?

Exercise 4.1.2 With exp x = ex , what are the first derivatives of exp(exp x)
and exp(exp(exp x))?
1 2
Exercise 4.1.3 With a > 0, let f (x) = 2 ax − ex . Where is f (x) convex,
and where is it concave?

Exercise 4.1.4 With Pn (x) the Legendre polynomial, use num_legendre to

find the general formula for Pn (0), Pn (1), Pn (−1), for n = 1, 2, 3, . . . .

Exercise 4.1.5 For fixed α > 0 and β > 0, find the maximizer p̂ of

pα (1 − p)β−α , 0 ≤ p ≤ 1.

Exercise 4.1.6 Compute the maximum and minimum of the second deriva-
tive of cos θ over the interval [a, b] = [−π/4, π/4]. Use that to compute the
upper and lower tangent parabolas at θ = 0. Plots these parabolas against
cos θ. Repeat everything with [a, b] = [−π/2, π/2].

Exercise 4.1.7 Suppose f (x) ≥ 0 and f ′′ (x) ≤ 1/2 for all x. Show
p
|f ′ (a)| ≤ f (a).

(Write (4.1.14) with x + a replacing x and recall the discriminant b2 − 4ac of

a nonnegative parabola is ≤ 0.)

Exercise 4.1.8 Use the Taylor series for log(1 + x) to show

1 1 1
log 2 = 1 − + − + ....
2 3 4
Exercise 4.1.9 Compute the Taylor series for sin θ and cos θ.
4.2. ENTROPY AND INFORMATION 225

0 1
Exercise 4.1.10 Let W = . Compute eW (Exercise 2.2.15).
−1 0
Exercise 4.1.11 Using Newton’s binomial theorem, show
1 1·3 2 1·3·5 3 1·3·5·7 4
√ =1+u+ u + u + u + ...
1 − 2u 2! 3! 4!

Exercise 4.1.12 If the convex dual of f (x) is g(p), and t is a constant, what
is the convex dual of f (x) + t?
Exercise 4.1.13 If the convex dual of f (x) is g(p), and t is a constant, what
is the convex dual of f (x + t)?
Exercise 4.1.14 If the convex dual of f (x) is g(p), and t ̸= 0 is a constant,
what is the convex dual of f (tx)?
Exercise 4.1.15 If the convex dual of f (x) is g(p), and t ̸= 0 is a constant,
what is the convex dual of tf (x)?
Exercise 4.1.16 If a > 0 and
1 2
f (x) = ax + bx + c,
2
what is the convex dual?
Exercise 4.1.17 Show f (x) convex implies ef (x) convex.

4.2 Entropy and Information

Let p be a probability, i.e. a number between 0 and 1. The entropy of p is

H(p) = −p log p − (1 − p) log(1 − p), 0 ≤ p ≤ 1. (4.2.1)

This is also called absolute entropy to contrast with relative entropy which
we see below.
To graph H(p), we compute its first and second derivatives. Here the
independent variable is p. By the product rule,

′ ′ 1−p
H (p) = (−p log p − (1 − p) log(1 − p)) = − log p + log(1 − p) = log .
p

Thus H ′ (p) = 0 when p = 1/2, H ′ (p) > 0 on p < 1/2, and H ′ (p) < 0 on
p > 1/2. Since this implies H(p) is increasing on p < 1/2, and decreasing on
p > 1/2, p = 1/2 is a global maximizer of the graph.
As p increases, 1−p decreases, so (1−p)/p decreases. Since log is increasing,
as p increases, H ′ (p) decreases. Thus H(p) is concave.
226 CHAPTER 4. CALCULUS

Taking the second derivative, by the chain rule and the quotient rule,
′
1−p 1
H ′′ (p) = log =− ,
p p(1 − p)

which is negative, leading to the strict concavity of H(p).

Fig. 4.10 The absolute entropy function H(p).

A crucial aspect of Figure 4.10 is its limiting values at the edges p = 0 and
p = 1,
H(0) = lim H(p) and H(1) = lim H(p).
p→0 p→1

Inserting p = 0 into p log p yields 0 × (−∞), so it is not at all clear what

H(0) should be. On the other hand, Figure 4.10 suggests H(0) = 0.
For the first limit, since H(p) is increasing near p = 0, it is clear there
is a definite value H(0). The entropy is the sum of two terms, −p log p, and
−(1 − p) log(1 − p). When p → 0, the second term approaches − log 1 = 0, so
H(0) is the limit of the first term,

H(0) = − lim p log p.

p→0

When p → 0, also 2p → 0. Replacing p by 2p,

H(0) = − lim p log p = − lim 2p log(2p)

p→0 p→0

= lim −2p log 2 + 2H(0) = 2H(0).

p→0

Thus H(0) = 0. Since H(p) is symmetric, H(1 − p) = H(p), we also have

H(1) = 0.
4.2. ENTROPY AND INFORMATION 227

To explain the meaning of the entropy function H(p), suppose a coin has
heads-bias or heads-probability p. If p is near 1, then we have confidence the
outcome of tossing the coin is heads, and, if p is near 0, we have confidence the
outcome of tossing the coin is tails. If p = 1/2, then we have least information.
Thus we can view the entropy as measuring a lack of information.
To formalize this, we define the information or absolute information

I(p) = p log p + (1 − p) log(1 − p), 0 ≤ p ≤ 1. (4.2.2)

Then we have

Entropy and Information

Entropy equals negative information.

The clearest explanation of H(p) is in terms of coin-tossing, where it is

shown H(p) is the log of the number of outcomes with heads-proportion p.
This is explained in §5.2.

The logistic function is

ex 1
p = σ(x) = = , −∞ < x < ∞. (4.2.3)
1+e x 1 + e−x
By the quotient and chain rules, its derivative is

−e−x
p′ = − = σ(x)(1 − σ(x)) = p(1 − p). (4.2.4)
(1 + e−x )2

The logistic function, also called the expit function and the sigmoid function,
is studied further in §5.2, where it used in coin-tossing and Bayes theorem.
The inverse of the logistic function is the logit function. The logit function
is found by solving p = σ(x) for x, obtaining

−1 p
x = σ (p) = log . (4.2.5)
1−p

The logit function is also called the log-odds function. Its derivative is
′
′ 1−p p 1−p 1 1
x = · = · 2
= .
p 1−p p (1 − p) p(1 − p)
228 CHAPTER 4. CALCULUS

Notice the derivative p′ of σ and the derivative x′ of its inverse σ −1 are

reciprocals. This result, the inverse function theorem, holds in general.

Let
Z(x) = log (1 + ex ) . (4.2.6)
Then Z ′ (x) = σ(x) and Z ′′ (x) = σ ′ (x) = σ(1 − σ) > 0. This shows Z(x) is
strictly convex. We call Z(x) the cumulant-generating function, to be consis-
tent with random variable terminology (§5.3).
We compute the convex dual (§4.1) of Z(x). By (4.1.11), the maximum

max(px − Z(x))
x

is attained when (px − Z(x))′ = 0, which happens when p = Z ′ (x) = σ(x).

Therefore the maximizer is the log-odds function x = σ −1 (p). Inserting this
into px − Z(x), we obtain

p p
max(px − Z(x)) = p log − Z log , (4.2.7)
x 1−p 1−p

which simplifies to I(p) (4.2.2).

Dual of Cumulant-Generating Function is Information

The convex dual of the cumulant-generating function is the informa-

tion.

Fig. 4.11 The absolute information I(p).

4.2. ENTROPY AND INFORMATION 229

The derivative of I(p) is

′ p
I (p) = log . (4.2.8)
1−p

Then I ′ (p) is the inverse of Z ′ (x) = σ(x), as it should be (4.1.17).

From (4.2.8),
1
I ′′ (p) = .
p(1 − p)
The multinomial extension of I(p) is in §5.6.

Let p and q be two probabilities,

0 ≤ p ≤ 1, and 0 ≤ q ≤ 1.

When do we consider p and q close to each other? If p and q were just

numbers, p and q are considered close if the distance |p − q| is small or the
distance squared |p − q|2 is small. But here p and q are probabilities, so it
makes sense to consider them close if their information content is close.

Fig. 4.12 The relative information I(p, q) with q = .7.

To this end, we define the relative information I(p, q) by

p 1−p
I(p, q) = p log + (1 − p) log . (4.2.9)
q 1−q

Then
230 CHAPTER 4. CALCULUS

I(q, q) = 0,
which agrees with our design goal of I(p, q) measuring the divergence between
the information in p and the information in q. Because I(p, q) is not symmetric
in p, q, we think of q as a base or reference probability, against which we
compare p.
Equivalently, instead of measuring relative information, we can measure
the relative entropy,
H(p, q) = −I(p, q).
Since − log(x) is strictly convex,

q 1−q q 1−q
I(p, q) = −p log − (1 − p) log > − log p · + (1 − p) ·
p 1−p p 1−p
= − log 1 = 0.

This shows I(p, q) is positive and H(p, q) is negative, when p ̸= q.

Since
I(p, q) = I(p) − p log(q) − (1 − p) log(1 − q),
the second derivatives of I(p) and I(p, q) agree, and I(0) = 0 = I(1), I(p, q)
is well-defined for p = 0, and p = 1,

I(1, q) = − log q, I(0, q) = − log(1 − q).

Taking derivatives (with independent variable p),

d2 1
I(p, q) = I ′′ (p) = ,
dp2 p(1 − p)

hence I is strictly convex in p. Thus q is a global minimizer of the graph of

I(p, q) (Figure 4.12). Also

d2 p 1−p
2
I(p, q) = 2 + ,
dq q (1 − q)2

so I(p, q) is strictly convex in q as well. In Exercise 4.3.2, it is shown I(p, q)

is convex in all directions in the (p, q)-plane (Figure 4.13).
The clearest explanation of H(p, q) is in terms of coin-tossing, where it is
shown H(p, q) is the log of the probability of a coin with heads-bias q having
outcomes with heads-proportion p. This also is explained in §5.2.

Figure 4.13 is the surface plot of I(p, q) as a function of two variables

(p, q). This clearly exhibits the trough p = q where I(p, q) = 0, and the edges
q = 0, 1 where I(p, q) = ∞. In scipy, I(p, q) is incorrectly called entropy.
4.2. ENTROPY AND INFORMATION 231

For more on this terminology confusion, see the remarks at the end of §5.6.
The code is as follows.

%matplotlib ipympl
from numpy import *
from matplotlib.pyplot import *
from scipy.stats import entropy

def I(p,q): return entropy([p,1-p],[q,1-q])

ax = axes(projection='3d')
ax.set_axis_off()

p = arange(0,1,.01)
q = arange(0,1,.01)
p,q = meshgrid(p,q)

# surface
ax.plot_surface(p,q,I(p,q), cmap='cool')

# square
ax.plot([0,1,1,0,0],[0,0,1,1,0],linewidth=.5,c="k")

show()

Fig. 4.13 Surface plot of I(p, q) over the square 0 ≤ p ≤ 1, 0 ≤ q ≤ 1.

232 CHAPTER 4. CALCULUS

Exercises

Exercise 4.2.1 Check (4.2.7) simplifies to the information (4.2.2).

Exercise 4.2.2 Compute

min I ′′ (p).
0≤p≤1

Exercise 4.2.3 Let 0 < q < 1 be a constant. What is the convex dual of

Z(x, q) = log (qex + 1 − q)?

p
Exercise 4.2.4 Use Python to plot the entropy H(p) and p(1 − p). Use
scipy.optimize.newton to find where they are equal.

Exercise 4.2.5 The relative information I(p, q) has minimum zero when p =
q. Use the lower tangent parabola (4.1.12) of I(x, q) at q and Exercise 4.2.2
to show
I(p, q) ≥ 2(p − q)2 .
For q = 0.7, plot both I(p, q) and 2(p − q)2 as functions of 0 < p < 1.

4.3 Multi-Variable Calculus

Let
f (x) = f (x1 , x2 , . . . , xd )
be a scalar function of a point x = (x1 , x2 , . . . , xd ) in Rd , and suppose v is
a unit vector in Rd . Then, along the line x(t) = x + tv, g(t) = f (x + tv)
is a function of the single variable t. Hence its derivative g ′ (0) at t = 0 is
well-defined. Since g ′ (0) depends on the point x and on the direction v, this
rate of change is the directional derivative of f (x) at x in the direction v.
More explicitly, the directional derivative of f (x) at x in the direction v is

d
Dv f (x) = f (x + tv). (4.3.1)
dt t=0

In multiple dimensions, there are many directions v emanating from a

point x, we may ask: How does the direction v affect the rate of change of
temperature f ? More specifically, in which direction v does the temperature
f increase? In which direction v does the temperature decrease? In which
direction does the temperature have the greatest increase? In which direction
does the temperature have the greatest decrease? In one dimension, there are
only two directions, so the directional derivative is either f ′ (x) or −f ′ (x).
4.3. MULTI-VARIABLE CALCULUS 233

When we select specific directions, the directional derivatives have specific

names. Let e1 , e2 , . . . , ed be the standard basis in Rd . The partial derivative
in the k-th direction, k = 1, . . . , d, is

∂f d
(x) = f (x + tek ).
∂xk ds t=0

The partial derivative in the k-th direction is just the one-dimensional deriva-
tive considering xk as the independent variable, with all other xj ’s constants.

Below we exhibit the multi-variable chain rule in two ways. The first in-
terpretation is geometric, and involves motion in time and directional deriva-
tives. This interpretation is relevant to gradient descent, §7.3.
The second interpretation is combinatorial, and involves repeated compo-
sitions of functions. This interpretation is relevant to computing gradients in
networks, specifically backpropagation §4.4, §7.2.
These two interpretations work together when training neural networks,
§7.4.

For the first interpretation of the chain rule, suppose the components x1 ,
x2 , . . . , xd are functions of a single variable t (usually time), so we have

x1 = x1 (t), x2 = x2 (t), ..., xd = xd (t).

Inserting these into f (x1 , x2 , . . . , xd ), we obtain a function

f (t) = f (x1 (t), x2 (t), . . . , xd (t))

of a single variable t. Then we have

Multi-Variable Chain Rule

With f (t) = f (x1 (t), x2 (t), . . . , xd (t)),

df ∂f dx1 ∂f dx2 ∂f dxd

= · + · + ··· + · .
dt ∂x1 dt ∂x2 dt ∂xd dt

The gradient of f (x) is the vector

∂f ∂f ∂f
∇f = , ,..., . (4.3.2)
∂x1 ∂x2 ∂xd
234 CHAPTER 4. CALCULUS

The Rd -valued function x(t) = (x1 (t), x2 (t), . . . , xd (t)) represents a curve
or path in Rd , and the vector

x′ (t) = (x′1 (t), x′2 (t), . . . , x′d (t))

represents its velocity at time t.

With this notation, the chain rule may be written
df
= ∇f (x(t)) · x′ (t).
dt
Let v = (v1 , v2 , . . . , vd ). The simplest application of the multi-variable
chain rule is to select x(t) = x + tv. Then the chain rule becomes

Directional Derivative Formula

The directional derivative of f (x) in the direction v is the dot product

of the gradient ∇f (x) and v,

d
f (x + tv) = ∇f (x) · v. (4.3.3)
dt t=0

In §7.7, we will need to compute the gradient of a function f (W ) of ma-

trices W . Towards this, recall the collection of matrices with a fixed shape
may be added and scaled. It follows if W and V are matrices with the same
shape, then W + sV also has the same shape, for any scalar s.
If G and V are two matrices with the same shape, we think of trace(V t G)
as a dot product between G and V . This is consistent with the definition of
norm squared (2.2.14). By analogy with (4.3.3), we say

Directional Derivative Matrix Formula

A matrix G is the gradient of f (W ) at W if

d
f (W + sV ) = trace(V t G). for all V. (4.3.4)
ds s=0

Then the gradient G has the same shape as W .

We use this result in Chapter 7.

Here is an example of the second interpretation of the chain rule. Suppose

4.3. MULTI-VARIABLE CALCULUS 235
1
r = f (x) = sin x, s = g(x) = ,
1 + e−x
t = h(x) = x2 , u = r + s + t, y = k(u) = cos u.

These are multiple functions in composition, as in Figure 4.14.

The input variable is x and the output variable is y. The intermediate
variables are r, s, t, u. Suppose x = π/4. Then

x, r, s, t, u, y = 0.79, 0.71, 0.69, 0.62, 2.01, −0.43.

r
x

x s u y
g + k

x t

Fig. 4.14 Composition of multiple functions.

To compute derivatives, start with

dy
= k ′ (u) = − sin u = −0.90.
du
Next, to compute dy/dr, the chain rule says

dy dy du
= = −0.90 ∗ 1 = −0.90,
dr du dr
and similarly,
dy dy
= = −0.90.
ds dt
By the chain rule,
dy dy dr dy ds dy dt
= · + · + · .
dx dr dx ds dx dt dx
By (4.2.4), s′ = s(1 − s) = 0.22, so
236 CHAPTER 4. CALCULUS

dr ds dt
= cos x = 0.71, = s(1 − s) = 0.22, = 2x = 1.57.
dx dx dx
We obtain
dy
= −0.90 ∗ 0.71 − 0.90 ∗ 0.22 − 0.90 ∗ 1.57 = −2.25.
dx
The chain rule is discussed in further detail in §4.4.

Let y = f (x) be a function. A critical point is a point x∗ satisfying

∇f (x∗ ) = 0.

Let x∗ be a local or global minimizer of y = f (x). Then for any vector v

and scalar t near zero, f (x∗ ) ≤ f (x∗ + tv). Hence

d f (x∗ + tv) − f (x∗ )

∇f (x) · v = f (x∗ + tv) = lim ≥ 0.
dt t=0
t→0 t

This is so for any direction v. Replacing v by −v, we conclude ∇f (x∗ ) · v = 0.

Since v is any direction, ∇f (x∗ ) = 0, x∗ is a critical point. Thus a minimizer
is a critical point. Similarly, a maximizer is a critical point.
As in the single-variable case, a critical point may be neither a minimizer
nor a maximizer, for example x∗ = (0, 0) and y = x21 − x22 . Such a point is a
saddle point.
If x∗ is a critical point and D2 f (x∗ ) > 0, then x∗ is a local or global
minimizer. This is the same as saying all eigenvalues of the symmetric matrix
D2 f (x∗ ) are positive. When D2 f (x∗ ) < 0, x∗ is a local or global maximum.
If D2 f (x∗ ) has both positive and negative eigenvalues, x∗ is a saddle point.

Let Q be a d × d symmetric matrix, let b be a vector, and let

d d
1 1 X X
f (x) = x · Qx − b · x = qij xi xj − bj x j . (4.3.5)
2 2 i,j=1 j=1

When Q is a variance matrix and b = 0, f (x) is the projected variance onto

direction x.
In this case,
d d
∂f 1X 1X
= qij xj + qji xj − bi = (Qx − b)i .
∂xi 2 j=1 2 j=1
4.3. MULTI-VARIABLE CALCULUS 237

Here we used Q = Qt . Thus ∇f (x) = Qx − b, and

Dv f (x) = v · (Qx − b).

Let x be a point and v a direction, both in Rd . Then x + tv is the equation

of the line passing through x and parallel to v. A multi-variable function f (x)
is convex if its restriction to any line x + tv is convex. Explicitly,

Definition of Multi-Variable Convex Function

Let f (x) be a function of x in Rd . Then f (x) is convex for every point x

fixed and direction v fixed, the single-variable function g(t) = f (x+tv)
is a convex function of t.

This means f (x), x = (x1 , x2 , . . . , xd ), is convex not just in each of the

directions e1 = (1, 0, . . . , 0), e2 = (0, 1, 0, . . . , 0), . . . , ed = (0, 0, . . . , 1), but is
also convex in any direction v. In terms of second derivatives, f (x) is convex
if
d2
f (x + tv) (4.3.6)
dt2 t=0
is nonnegative for every point x and every direction v. For this, see also
(4.5.14). As usual, f (x) is concave if −f (x) is convex.
For example, when f (x) is given by (4.3.5),

g(t) = f (x + tv)
1
= (x + tv) · Q(x + tv) − b · (x + tv)
2
1 1 (4.3.7)
= x · Qx − b · x + tv · (Qx − b) + t2 v · Qv
2 2
1 2
= f (x) + tv · (Qx − b) + t v · Qv.
2
From this follows
1
g ′ (t) = v · (Qx − b) + tv · Qv, g ′′ (t) = v · Qv.
2
This shows

Quadratic Convexity

Let Q be a symmetric matrix and b a vector. The quadratic function

1
f (x) = x · Qx − b · x
2
has gradient
238 CHAPTER 4. CALCULUS

∇f (x) = Qx − b. (4.3.8)
Moreover f (x) is convex everywhere when Q is a variance matrix.

By (2.2.2),
Dv f (x) = ∇f (x) · v = |∇f (x)| |v| cos θ,
where θ is the angle between the vector v and the gradient vector ∇f (x).
Since −1 ≤ cos θ ≤ 1, we conclude

Gradient is Direction of Greatest Increase

Let v be a unit vector and let x0 be a point in Rd . As the direction v

varies, the directional derivative varies between the two extremes

−|∇f (x0 )| ≤ Dv f (x0 ) ≤ |∇f (x0 )|.

The directional derivative achieves its greatest value when v points in

the direction of ∇f (x0 ), and achieves its least value in the opposite
direction, when v points in the direction of −∇f (x0 ).

We now address the existence of a global minimizer of a function f (x). A

minimizer for f (x) is a point x∗ satisfying

f (x∗ ) = min f (x),

where the minimum is taken over all points x in Rd . A minimizer is the

location of the bottom of the graph of the function.
For example, the parabola (Figure 4.4) and the relative information (Fig-
ure 4.12) both have minimizers. For the parabola, the minimizer is against all
x in R. For the relative information, the minimizer is against all p in the unit
interval, or all (p, q) in the unit square, depending on whether we consider I
as a function of one variable p, or of two variables (p, q).
A function f (x) is proper if the sublevel set f (x) ≤ c is bounded for every
level c. Before we state this precisely, we explain the difference between a
level and a bound.
Let f (x) be a function. A level is a scalar c determining a sublevel set
f (x) ≤ c. A bound is a scalar C determining a bounded set |x| ≤ C.
We say f (x) is proper if for every level c, there is a bound C so that
4.3. MULTI-VARIABLE CALCULUS 239

f (x) ≤ c =⇒ |x| ≤ C. (4.3.9)

In other words, f (x) is proper if every sublevel set is bounded.

This is same as saying f (x) rises to +∞ as |x| → ∞. The exact formula
for the bound C, which depends on the level c and the function f (x), is not
important for our purposes. What matters is the existence of some bound C
for each level c.
More vividly, suppose x is scalar, and think of the graph of y = f (x) as the
cross-section of a river. Then f (x) is proper if the river never floods its banks,
no matter how much it rains. So y = sin x is not proper, but y = x2 + sin x
is proper.
What does it mean for f (x) to not be proper? Unpacking the definition,
f (x) is not proper if there is some level c with no corresponding bound C.
This means there is some level c and a sequence x1 , x2 , . . . with f (xn ) ≤ c
and |xn | → ∞.
For example, the functions in Figure 4.4 are proper and strictly convex,
while the function in Figure 4.5 is proper but neither convex nor strictly
convex.
Intuitively, if f (x) goes up to +∞ when x is far away, then its graph must
have a minimizer at some point x∗ . Continuous functions are defined in §A.7.

Existence of Global Minimizer

Suppose f (x) is a continuous proper function. Then f (x) has a mini-

mizer x∗ ,
f (x∗ ) ≤ f (x). (4.3.10)

To see this, pick any point a. Then, by properness, the sublevel set f (x) ≤
f (a) is bounded. By continuity of f (x), there is a minimizer x∗ (see §A.7).
Since for all x outside this sublevel set, we have f (x) > f (a), x∗ is a global
minimizer.

Using properness, we can establish the existence of residual minimizers, as

promised in §2.6.

Properness of Residual on Row Space

Let A be a matrix, and b a vector with dimensions so that the residual

f (x) = |Ax − b|2

is defined. Then f (x) is proper on the row space of A.

240 CHAPTER 4. CALCULUS

To see this, suppose f (x) is not proper. In this case, by (4.3.9), there
would be a level c and a sequence x1 , x2 , . . . in the row space of A satisfying
|xn | → ∞ and f (xn ) ≤ c for n ≥ 1.
Let x′n = xn /|xn |. Then x′n are unit vectors in the row space of A, hence
xn is a bounded sequence. From §A.7, this implies x′n subconverges to some
′

x∗ , necessarily a unit vector in the row space of A.

By the triangle inequality (2.2.4),
1 1 1 √
|Ax′n | = |Axn | ≤ (|Axn − b| + |b|) ≤ ( c + |b|).
|xn | |xn | |xn |

Moreover Ax′n subconverges to Ax∗ . Since |xn | → ∞, taking the limit n → ∞,

1 √
|Ax∗ | = lim |Ax′n | ≤ ( c + |b|) = 0.
n→∞ ∞
∗
Thus x is both in the row space of A and in the null space of A. Since the
row space and the null space are orthogonal, this implies x∗ = 0. But we can’t
have 1 = |x∗ | = |0| = 0. This contradiction shows there is no such sequence
xn , and we conclude f (x) is proper.
When the row space is the source space,

Properness of Residual

When the N × d matrix A has rank d,

f (x) = |Ax − b|2 (4.3.11)

is proper on Rd .

As a consequence,

Existence of Residual Minimizer

Let A be a matrix and b a vector so that the residual

|Ax − b|2 (4.3.12)

is well-defined. Then there is a residual minimizer x∗ in the row space

of A,
|Ax∗ − b|2 ≤ |Ax − b|2 (4.3.13)
for all x.

A minimizer x∗ is located by the first derivative test.

4.3. MULTI-VARIABLE CALCULUS 241

First Derivative Test for Global Minimizer

Assume f (x) has a gradient ∇f (x) at every point. Then a minimizer

x∗ satisfies
∇f (x∗ ) = 0. (4.3.14)

Let a be any point, and v any direction, and let g(t) = f (a + tv). Then

g ′ (0) = ∇f (a) · v.

If a is a minimizer, then t = 0 is a minimum of g(t), so g ′ (0) = 0. Since v

is any direction, this shows ∇f (a) = 0. This establishes the first derivative
test.

Exercises

Exercise 4.3.1 Let I(p, q) be the relative information (4.2.9), and let Ipp ,
Ipq , Iqp , Iqq be the second partial derivatives. If Q is the second derivative
matrix
Ipp Ipq
Q= ,
Iqp Iqq
show
(p − q)2
det(Q) = .
p(1 − p)q 2 (1 − q)2
Exercise 4.3.2 Let I(p, q) be the relative information (4.2.9). With x =
(p, q) and v = (ap(1 − p), bq(1 − q)), show

d2
I(x + tv) = p(1 − p)(a − b)2 + b2 (p − q)2 .
dt2 t=0

Conclude that I(p, q) is a convex function of (p, q). Where is it not strictly
convex?
Exercise 4.3.3 Let J(x) = J(x1 , x2 , . . . , xd ) equal
1 1 1 1
J(x) = (x1 − x2 )2 + (x2 − x3 )2 + · · · + (xd−1 − xd )2 + (xd − x1 )2 .
2 2 2 2
Compute Q = D2 J.
Exercise 4.3.4 Let f (Q) = log det(Q) be the log of the determinant of a
positive 2 × 2 matrix Q (Exercise 3.2.6), and let V be a symmetric 2 × 2
matrix. Using Exercise 1.4.11, compute the second derivative of f (Q + tV )
at t = 0 as in (4.3.6). Using Exercise 1.4.10, conclude log det(Q) is a concave
function of Q.
242 CHAPTER 4. CALCULUS

4.4 Back Propagation

In this section, we compute outputs and derivatives on a graph. We already

did this for the graph (4.14), but now we systematize things so that we can
write code.
We consider two cases, when the graph is a chain, or the graph is a network
of neurons. The derivatives are taken with respect to the outputs at each
node of the graph. In §7.2, we consider a third case, and compute outputs
and derivatives on a neural network.
To compute node outputs, we do forward propagation. To compute deriva-
tives, we do back propagation. Corresponding to the three cases, we will code
three versions of forward and back propagation. In all cases, back propagation
depends on the chain rule.
The chain rule (§4.1) states

dy dy dr
r = f (x), y = g(r) =⇒ = · .
dx dr dx
In this section, we work out the implications of the chain rule on repeated
compositions of functions.
Suppose

r = f (x) = sin x,
1
s = g(r) = ,
1 + e−r
y = h(s) = s2 .

These are three functions f , g, h composed in a chain (Figure 4.15).

x r s y
f g h

Fig. 4.15 Composition of three functions in a chain.

The chain in Figure 4.15 has five nodes and four edges. There is one input
node (no incoming edge from another node) and one output node (no
outgoing edge to another node). The outgoing signals at the first four nodes
are x, r, s, y. The incoming signals at the last four nodes are x, r, s, y.
Start with x = π/4. Evaluating the functions in order,

x = 0.785, r = 0.707, s = 0.670, y = 0.448.

Notice these values are evaluated in the forward direction: x then r then s
then y. This is forward propagation.
4.4. BACK PROPAGATION 243

Now we evaluate the derivatives of the output y with respect to x, r, s,

dy dy dy
, , .
dx dr ds
With the above values for x, r, s, we have
dy
= 2s = 2 ∗ 0.670 = 1.340.
ds
Since g is the logistic function, by (4.2.4),

g ′ (r) = g(r)(1 − g(r)) = s(1 − s) = 0.670 ∗ (1 − 0.670) = 0.221.

From this,
dy dy ds
= · = 1.340 ∗ g ′ (r) = 1.340 ∗ 0.221 = 0.296.
dr ds dr
Repeating one more time,
dy dy dr
= · = 0.296 ∗ cos x = 0.296 ∗ 0.707 = 0.209.
dx dr dx
Thus the derivatives are
dy dy dy
= 0.209, = 0.296, = 1.340.
dx dr ds
Notice the derivatives are evaluated in the backward direction: First dy/dy =
1, then dy/ds, then dy/dr, then dy/dx. This is back propagation.

Here is another example. Let

r = x2 ,
s = r 2 = x4 ,
y = s2 = x8 .

This is the same function h(x) = x2 composed with itself three times. With
x = 5, we have

x = 5, r = 25, s = 625, y = 390625.

Applying the chain rule as above, check that

dy dy dy
= 625000, = 62500, = 1250.
dx dr ds
244 CHAPTER 4. CALCULUS

To evaluate x, r, s, y in Figure 4.15, first we built the list of functions and

the list of derivatives

from numpy import *

def f(x): return sin(x)

def g(r): return 1/(1+ exp(-r))
def h(s): return s**2
# this for next example
def k(t): return cos(t)

func_chain = [f,g,h]

def df(x): return cos(x)

def dg(r): return g(r)*(1-g(r))
def dh(s): return 2*s
# this for next example
def dk(t): return -sin(t)

der_chain = [df,dg,dh]

Then we evaluate the output vector x = (x, r, s, y), leading to the first
version of forward propagation,

# first version: chains

def forward_prop(x_in,func_chain):
x = [x_in]
while func_chain:
f = func_chain.pop(0) # first func
x_out = f(x_in)
x.append(x_out) # insert at end
x_in = x_out
return x

from numpy import *

x_in = pi/4
x = forward_prop(x_in,func_chain)

Now we evaluate the gradient vector δ = (dy/dx, dy/dr, dy/ds, dy/dy).

Since dy/dy = 1, we set

# dy/dy = 1
delta_out = 1

The code for the first version of back propagation is

4.4. BACK PROPAGATION 245

# first version: chains

def backward_prop(delta_out,x,der_chain):
delta = [delta_out]
while der_chain:
# discard last output
x.pop(-1)
df = der_chain.pop(-1) # last der
der = df(x[-1])
# chain rule -- multiply by previous der
der = der * delta[0]
delta.insert(0,der) # insert at start
return delta

delta = backward_prop(delta_out,x,der_chain)

Note forward propagation must be run prior to back propagation.

To apply this code to the second example, use

d = 3
func_chain, der_chain = [h]*d, [dh]*d
x_in, delta_out = 5, 1

x = forward_prop(x_in,func_chain)
delta = backward_prop(delta_out,x,der_chain)

Now we work with the network in Figure 4.16, using the multi-variable
chain rule (§4.3). The functions are

a = f (x, y) = x + y,
b = g(y, z) = max(y, z),
J = h(a, b) = ab.

The composite function is

J = (x + y) max(y, z),

Here there are three input nodes, labeled 0, 1, 2, three hidden nodes, 3,
4, 5, and an output node, 6. Starting with inputs (x, y, z) = (1, 2, 0), and
plugging in, we obtain the outgoing signals at the first six nodes
246 CHAPTER 4. CALCULUS

(x0 , x1 , x2 , x3 , x4 , x5 ) = (x, y, z, a, b, J) = (1, 2, 0, 3, 2, 6)

(Figure 4.18). This is forward propagation.

x
+

y a

J
∗

y
b
z
max

Fig. 4.16 A network composition [33].

y<z
max(y, z) = z
∂g/∂y = 0, ∂g/∂z = 1
y=z

y>z
max(y, z) = y
∂g/∂y = 1, ∂g/∂z = 0

Fig. 4.17 The function g = max(y, z).

Now we compute the derivatives

∂J ∂J ∂J ∂J ∂J ∂J
, , , , , .
∂x ∂y ∂z ∂a ∂b ∂J

This we do in reverse order. It’s clear ∂J/∂J = 1. We compute

4.4. BACK PROPAGATION 247

∂J ∂J
= b = 2, = a = 3.
∂a ∂b
Then
∂a ∂a
= 1, = 1.
∂x ∂y
Let (
1, y > z,
1(y > z) =
0, y < z.
By Figure 4.17, since y = 2 and z = 0,
∂b ∂b
= 1(y > z) = 1, = 1(z > y) = 0.
∂y ∂z
By the chain rule,
∂J ∂J ∂a
= = 2 ∗ 1 = 2,
∂x ∂a ∂x
∂J ∂J ∂a ∂J ∂b
= + = 2 ∗ 1 + 3 ∗ 1 = 5,
∂y ∂a ∂y ∂b ∂y
∂J ∂J ∂b
= = 3 ∗ 0 = 0.
∂z ∂b ∂z

1
+
2∗1=2

2 3
2 2
6
2+3 ∗
1
2 2
3 3
0
max
0

Fig. 4.18 Forward and backward propagation [33].

Hence we have

∂J ∂J ∂J ∂J ∂J ∂J
, , , , , = (2, 5, 0, 2, 3, 1).
∂x ∂y ∂z ∂a ∂b ∂J

The outputs (blue) and the derivatives (red) are displayed in Figure 4.18.
Summarizing, by the chain rule,
248 CHAPTER 4. CALCULUS

• derivatives are computed backward,

• derivatives along successive edges are multiplied,
• derivatives along several outgoing edges are added.

To label these derivatives systematically, look at a specific node, say in

+
Figure 4.16 look at the node . This node has outgoing signal a and incom-
ing signals x and y. Corresponding to these, we have the upstream derivative
∂J/∂a and downstream derivatives ∂J/∂x and ∂J/∂y. But there is a problem
here, since ∂J/∂y can be considered a downstream derivative on two separate
edges. Because of this, and since there is only one outgoing signal, we label
the derivative at this node to be the upstream derivative,
∂J ∂J
δ3 = = .
∂x3 ∂a
With this notation, we have

∂J ∂J ∂J ∂J ∂J ∂J ∂J ∂J ∂J ∂J ∂J ∂J
, , , , , = , , , , , .
∂x0 ∂x1 ∂x2 ∂x3 ∂x4 ∂x5 ∂x ∂y ∂z ∂a ∂b ∂J

To do this in general, recall a directed graph (§3.3) as in Figure 4.16 has

an adjacency matrix W = (wij ) with wij equal to one or zero depending on
whether (i, j) is an edge or not.
Suppose a directed graph has d nodes, labeled 0, 1, 2, . . . , d − 1, and, for
each node i, let xi be the outgoing signal. Then x = (x0 , x1 , x2 , . . . , xd−1 ) is
the outgoing vector. In the case of Figure 4.16, d = 7 and

x = (x0 , x1 , x2 , x3 , x4 , x5 , x6 ) = (x, y, z, a, b, J, None).

With this order, the adjacency matrix is

 
0001 0 0 0
0 0 0 1 1 0 0
 
0 0 0 0 1 0 0
 
W =0 0 0 0 0 1 0.
0 0 0 0 0 1 0
 
0 0 0 0 0 0 1
0000 0 0 0

This we code as a list of lists,

4.4. BACK PROPAGATION 249

d = 7
w = [ [None]*d for _ in range(d) ]

w[0][3] = w[1][3] = w[1][4] = w[2][4] = 1

w[3][5] = w[4][5] = w[5][6] = 1

More generally, in a weighed directed graph (§3.3), the weights wij are nu-
meric scalars or None.
Once we have the outgoing vector x, for each node j, let

x−
j = (w0j x0 , w1j x1 , w2j x2 , . . . , wd−1,j xd−1 ). (4.4.1)

This is the incoming signal list at node j. Here we adopt the convention that
None times anything is None, and any resulting None entry in the list is to be
discarded.
An activation function at node j is a function fj of the incoming signal
list x−
j . Then the outgoing signal at node j is

xj = fj (x−
j ). (4.4.2)

By the chain rule,


∂xj  ∂fj · w , if (i, j) is an edge,
ij
= ∂xi (4.4.3)
∂xi None, if (i, j) is not an edge.

The incoming vector is

x− = (x− − − −
0 , x1 , x2 , . . . , xd−1 ).

Then x− is a list of lists. In the case of Figure 4.16,

x− = (x− − − − − − −
0 , x1 , x2 , x3 , x4 , x5 , x6 ),

where

x0minus = [None,None,None,None,None,None,None]
x1minus = [None,None,None,None,None,None,None]
x2minus = [None,None,None,None,None,None,None]
x3minus = [x,y,None,None,None,None,None]
x4minus = [None,y,z,None,None,None,None]
x5minus = [None,None,None,a,b,None,None]
x6minus = [None,None,None,None,None,None,J]

This is before discarding. After discarding,

250 CHAPTER 4. CALCULUS

x0minus = [ ]
x1minus = [ ]
x2minus = [ ]
x3minus = [x,y]
x4minus = [y,z]
x5minus = [a,b]
x6minus = [J]

The activation functions are

activate = [None]*d

activate[3] = lambda x,y: x+y

activate[4] = lambda y,z: max(y,z)
activate[5] = lambda a,b: a*b

To compute the outgoing signal xj at node j, we collect the incoming

signals x−
j following (4.4.1)

def incoming(x,w,j):
return [ w[i][j] * outgoing(x,w,i) for i in range(d) if w[i][j] !=
,→ None ]

then plug them into the activation function.

To compute the incoming signal at node j, we plug the incoming list into
the activation function,

def outgoing(x,w,j):
if x[j] != None: return x[j]
else:
if activate[j] != None: return activate[j](*incoming(x,w,j))
else: return None

Here * is the unpacking operator.

Summarizing, at each node j, we have the outgoing signal xj , and a list
x−
j of incoming signals.

Now we can define what is meant by a network. A node with an attached

activation function is a neuron. A network is a directed weighed graph where
some nodes are neurons. The code in this section works for any network
without cycles. In §7.2, we specialize to neural networks. Neural networks are
networks with a restricted class of activation functions.
4.4. BACK PROPAGATION 251

5 −2 7

5 −2 7
f g h

Fig. 4.19 Graph, directed graph, weighed directed graph, network.

Let xin be the outgoing vector over the input nodes. If there are m input
nodes, and d nodes in total, then the length of xin is m, and the length of x
is d. In the example above, xin = (x, y, z).
We assume the list of nodes is ordered so that the initial portion of the list
of nodes is the list of input nodes.

m = len(x_in)
x[:m] = x_in

Here is the second version of forward propagation.

# second version: networks

def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x

For this code to work, we assume there are no cycles in the graph: All back-
ward paths end at input nodes, and all forward paths end at output nodes.

The output function J is a function of all node outputs. For Figure 4.16,
this means J is a function of x, y, z, a, b.
Therefore, at each node i, we have the derivatives
252 CHAPTER 4. CALCULUS

∂J
δi = (x), i = 0, 1, 2, . . . , d − 1.
∂xi
Then δ = (δ0 , δ1 , δ2 , . . . , δd−1 ) is the gradient vector. We first compute the
derivatives of J with respect to the output nodes xout , and we assume these
derivatives are assembled into a vector δout .
In Figure 4.16, there is one output node J, and
∂J
δJ = = 1.
∂J
Hence δout = (1).
We assume the list of nodes is ordered so that the terminal portion of the
list of nodes is the list of output nodes.
For each i, j, let
∂fj
gij = .
∂xi
Then we have a d × d gradient matrix g = (gij ). When (i, j) is not an edge,
gij = 0.
These are the local derivatives, not the derivatives obtained by the chain
rule. For example, even though we saw above ∂J/∂y = 1, here the local
derivative is zero, since J does not depend directly on y.
For the example above, with (x1 , x2 , x3 , x4 , x5 , x6 ) = (x, y, z, a, b, J),

g = [ [None]*d for _ in range(d) ]

g[0][3] = lambda x,y: 1

g[1][3] = lambda x,y: 1
g[1][4] = lambda y,z: 1 if y >= z else 0
g[2][4] = lambda y,z: 1 if z > y else 0
g[3][5] = lambda a,b: b
g[4][5] = lambda a,b: a
g[5][6] = lambda J: 1

By the chain rule and (4.4.3),

∂J X ∂J ∂xj X ∂J ∂fj
= · = · · wij ,
∂xi i→j
∂xj ∂xi i→j
∂xj ∂xi

so X
δi = δj · gij · wij .
i→j

The code is
4.4. BACK PROPAGATION 253

# m is number of output nodes

def derivative(x,m,delta,g,i):
if delta[i] != None: return delta[i]
elif i >= d-m: return 1
else:
return sum([ derivative(x,m,delta,g,j) *
,→ g[i][j](*incoming(x,w,j)) * w[i][j] for j in range(d) if
,→ g[i][j] != None ])

This leads to our second version of back propagation,

# second version: networks

# m is number of output nodes

def backward_prop(x,m,g):
d = len(g)
delta = [None]*d
for i in range(d): delta[i] = derivative(x,m,delta,g,i)
return delta[:-m]

m = 1
delta = backward_prop(x,m,g)

In §7.2, we derive the third version of propagation, this for neural networks.

Exercises

Exercise 4.4.1 For the network in Figure 4.15, use the second version of
the propagation code and x = π/4 to compute the output vector and the
gradient vector

dy dy dy dy
x = (x, r, s, y), δ= , , , .
dx dr ds dy

Exercise 4.4.2 In Figure 4.20 below, the activation function at each neuron
is the sum of the squares of the incoming signals to that neuron. Starting with
x = 1, compute (x, a, b, c, d, p, q, J), and the corresponding derivatives of J.
Do this by hand and by coding. You should get

x = (1, 1, 2, 5, 26, 667, 2, 458333),

δ = (18449628, 5632648, 2816320, 704080, 70408, 1354, 4, 1).
254 CHAPTER 4. CALCULUS

d c

x
p x
J x
b
x
q x
x

a a

Fig. 4.20 A network with outgoing signals.

Exercise 4.4.3 Compute the outgoing vector x and gradient vector δ for the
network in Figure 4.21. The outgoing signal at each neuron is the sum of the
squares of the incoming signals at that neuron. Here the input node signal is
the variable t, so both x and δ will have powers of t in them.

Fig. 4.21 Another network.

4.5 Convex Functions

Let f (x) be a scalar function of points x = (x1 , . . . , xd ) in Rd . For example,

in two dimensions,

x21
f (x) = f (x1 , x2 ) = max(|x1 |, |x2 |), f (x) = f (x1 , x2 ) = + x22
4
are scalar functions of points in R2 . A level set of f (x) is the set

E: f (x) = 1.
4.5. CONVEX FUNCTIONS 255

This is the level set corresponding to level 1. One can have level sets corre-
sponding to any level c, f (x) = c. In two dimensions, level sets are also called
contour lines.
For example, the variance ellipse x · Qx = 1 is a level set. The perimeters
(not their interiors) of the square and ellipse in Figure 4.22 are level sets

x21 x2
max(|x1 |, |x2 |) = 1, + 2 = 1.
16 4

x∗

x∗ x∗

Fig. 4.22 Level sets and sublevel sets in two dimensions.

The contour lines of

x21 x2
f (x) = f (x1 , x2 ) = + 2
16 4
are in Figure 4.23.

Fig. 4.23 Contour lines in two dimensions.

256 CHAPTER 4. CALCULUS

A sublevel set of f (x) is the set

E: f (x) ≤ 1.

This is the sublevel set corresponding to level 1. One can have sublevel sets
corresponding to any level c, f (x) ≤ c. Sublevel sets were used in the def-
inition of proper functions (4.3.9). For example, in Figure 4.22, the (blue)
interior of the square, together with the square itself, is a sublevel set. Sim-
ilarly, the interior of the ellipse, together with the ellipse itself, is a sublevel
set. The interiors of the ellipsoids, together with the ellipsoids themselves, in
Figure 4.31 are sublevel sets. We always consider the level set to be part of
the sublevel set.
The level set f (x) = 1 is the boundary of the sublevel set f (x) ≤ 1. Thus
the square and the ellipse in Figure 4.22 are boundaries of their respective
sublevel sets, and the unit variance ellipsoid x · Qx = 1 is the boundary of
the sublevel set x · Qx ≤ 1.

Given points x0 and x1 in Rd , let v = x1 − x0 be the vector joining them.

Then
(1 − t)x0 + tx1 = x0 + tv, 0 ≤ t ≤ 1.
The line segment [x0 , x1 ] joining x0 , x1 consists of linear combinations

x = (1 − t)x0 + tx1 = x0 + tv, 0≤t≤1

(Figure 4.24).

x1 x1
v
tv (1 − t)x0 + tx1
x0 x0

Fig. 4.24 Line segment [x0 , x1 ].

Although convex functions were defined in §4.3, we repeat the definition,

while emphasizing different points. A function f (x) is convex if1 for any two
points x0 and x1 in Rd ,

f ((1 − t)x0 + tx1 ) ≤ (1 − t)f (x0 ) + tf (x1 ), for 0 ≤ t ≤ 1. (4.5.1)

This says the line segment joining any two points (x0 , f (x0 )) and (x1 , f (x1 ))
on the graph of f (x) lies above the graph of f (x). For example, in two di-
1 We only consider convex functions that are continuous.
4.5. CONVEX FUNCTIONS 257

mensions, the function f (x) = f (x1 , x2 ) = x21 + x22 /4 is convex because its
graph is the paraboloid in Figure 4.25.
If the inequality is strict for 0 < t < 1, then f (x) is strictly convex,

f ((1 − t)x0 + tx1 ) < (1 − t)f (x0 ) + tf (x1 ), for 0 < t < 1.

More generally, given points x1 , x2 , . . . , xN , a linear combination

t1 x1 + t2 x2 + · · · + tN xN

is a convex combination if t1 , t2 , . . . , tN are nonnegative, and

t1 + t2 + · · · + tN = 1.

For example, if 0 ≤ t ≤ 1, (1 − t)x0 + tx1 is a convex combination of x0 and

x1 . Then a convex function also satisfies

f (t1 x1 + · · · + tN xN ) ≤ t1 f (x1 ) + · · · + tN f (xN ), (4.5.2)

for any convex combination.

Fig. 4.25 Convex: The line segment lies above the graph.

Recall (§2.2) a nonnegative matrix is a symmetric matrix Q satisfying

x · Qx ≥ 0 for all x, and every such matrix is the variance matrix of some
258 CHAPTER 4. CALCULUS

dataset. This is equivalent to the nonnegativity of the eigenvalues of Q. When

the eigenvalues of Q are positive, Q is invertible.

Quadratic is Convex

If Q is a nonnegative matrix and b is a vector, then

1
f (x) = x · Qx − b · x
2
is a convex function. When Q is invertible, f (x) is strictly convex.

This was derived in the previous section, but here we present a more
geometric proof.
To derive this result, let x0 and x1 be any points, and let v = x1 − x0 .
Then x0 + tv = (1 − t)x0 + tx1 and x1 = x0 + v. Let g0 = Qx0 − b. By (4.3.7),
1 1
f (x0 + tv) = f (x0 ) + tv · (Qx0 − b) + t2 v · Qv = f (x0 ) + tv · g0 + + t2 v · Qv.
2 2
(4.5.3)
Inserting t = 1 in (4.5.3), we have f (x1 ) = f (x0 ) + v · g0 + v · Qv/2. Since
t2 ≤ t for 0 ≤ t ≤ 1 and v · Qv ≥ 0, by (4.5.3),

f ((1 − t)x0 + tx1 ) = f (x0 + tv)

1
≤ f (x0 ) + tv · g0 + tv · Qv
2
1
= (1 − t)f (x0 ) + tf (x0 ) + tv · g0 + tv · Qv
2
= (1 − t)f (x0 ) + tf (x1 ).

When Q is is invertible, then v · Qv > 0, and we have strict convexity.

Here are some basic properties and definitions of sets that will be used
in this section and in the exercises. Let a be a point in Rd and let r be a
positive scalar. A closed ball of radius r and center a is the set of points x
satisfying |x − a|2 ≤ r2 . An open ball of radius r and center a is the set of
points x satisfying |x − a|2 < r2 .
Let E be any set in Rd . The complement of E is the set E c of points that
are not in E. If E and F are sets, the intersection E ∩ F is the set of points
that lie in both sets.
A point a is in the interior of E if there is a ball B centered at a contained
in E; this is usually written B ⊂ E. Here the ball may be either open or
closed, the interior is the same.
4.5. CONVEX FUNCTIONS 259

A point a is in the boundary of E if every ball centered at a contains points

of E and points of E c . From the definitions, it is clear that there are no points
that lie in both the interior of E and the boundary of E.
Let E be a set. If E equals its interior, then E is an open set. If E contains
its boundary, then E is a closed set . When a set is closed, we have

set = interior + boundary.

Most sets are neither open nor closed.

A convex set is a closed subset E in Rd that contains the line segment

joining any two points in it: If x0 and x1 are in E, then the line segment
[x0 , x1 ] is in E. To be consistent with sublevel sets, we only consider convex
sets that contain their boundaries. In other words, we only consider convex
sets that are closed.

x2
x6 x7

Fig. 4.26 Convex hull of x1 , x2 , x3 , x4 , x5 , x6 , x7 .

More generally, given points x1 , x2 , . . . , xN in E, the convex combination

x = t 1 x1 + t 2 x2 + · · · + t N xN

is also in E. The set of all convex combinations of x1 , x2 , . . . , xN is the

convex hull of x1 , x2 , . . . , xN (Figure 4.26).
The convex hull of a dataset is a (closed) convex set. Conversely, if E is
convex and contains a dataset x1 , x2 , . . . , xN , then E contains the convex
hull of the dataset.
The interiors of the square and the ellipse in Figure 4.22, together with
their boundaries, are convex sets. The interior of the ellipsoid in Figure 4.31,
together with the ellipsoid, is a convex set.
The following code generates convex hulls,
260 CHAPTER 4. CALCULUS

from scipy.spatial import ConvexHull

from numpy import *
from numpy.random import *

rng = default_rng()

# 30 random points in 2-D

points = rng.random((30, 2))

hull = ConvexHull(points)

and this plots the facets of the convex hull

from matplotlib.pyplot import *

plot(points[:,0], points[:,1], 'o')

for facet in hull.simplices:
plot(points[facet,0], points[facet,1], 'k-')

facet = hull.simplices[0]
plot(points[facet, 0], points[facet, 1], 'r--')

grid()
show()

resulting in Figure 4.27.

Fig. 4.27 A convex hull with one facet highlighted.

4.5. CONVEX FUNCTIONS 261

Let E be any convex set, and let x0 be any point. We search for a point
x∗ in E that is nearest to x0 . Since |x − x0 | is the distance between x and
x0 , the nearest point x∗ satisfies

|x∗ − x0 |2 = min |x − x0 |2 .
x in E

Here we minimize the distance squared, since the distance minimizer is the
same as the distance-squared minimizer.
If x0 is in E, then clearly x∗ = x0 is the unique distance minimizer. In x0 is
not in E, the results in §A.7 guarantee the existence of a distance minimizer
x∗ . This means there is at least one point in E that is nearest to x0 .
Let x0 be any point not in E. We show there is exactly one point x∗ in
E nearest to x0 . Let δ be the minimum distance-squared between E and x0 .
To this end, suppose x′ is another point in E at the same distance from x0
as x∗ . Then
|x∗ − x0 |2 = δ = |x′ − x0 |2 .
If xa = (x∗ + x′ )/2 is the average of x∗ and x′ , since E is convex, xa is in E,
hence |xa − x0 |2 ≥ δ. By expanding the squares, check that

4|xa − x0 |2 + |x∗ − x′ |2 = 2|x∗ − x0 |2 + 2|x′ − x0 |2 .

Since xa is in E, the left side is no less than 4δ +|x∗ −x′ |2 . On the other hand,
the right side equals 4δ. This implies |x∗ − x′ |2 = 0, or x∗ = x′ , completing
the proof.

Unique Nearest Point in Convex Set

Given any point x0 and any convex set E, there is a unique point x∗
in E nearest to x0 .

x∗

Fig. 4.28 A convex set has a unique nearest point to any x0 .

262 CHAPTER 4. CALCULUS

If f (x) is a function, its graph is the set of points (x, y) in Rd+1 satisfying
y = f (x), and its epigraph is the set of points (x, y) satisfying y ≥ f (x).
If f (x) is defined on Rd , its sublevel sets are in Rd , and its epigraph is in
Rd+1 . Then f (x) is a convex function exactly when its epigraph is a convex
set (Figure 4.25). From convex functions, there are other ways to get convex
sets:

Sublevel of Convex is Convex

If f (x) is a convex function, then the sublevel set

E: f (x) ≤ 1

is a convex set.

This is an immediate consequence of the definition: f (x0 ) ≤ 1 and f (x1 ) ≤

1 implies

f ((1 − t)x0 + tx1 ) ≤ (1 − t)f (x0 ) + tf (x1 ) ≤ (1 − t) + t = 1.

From these results, we have

Ellipsoids are Boundaries of Convex Sets

If Q is a variance matrix, then x · Qx ≤ 1 is a convex set.

Let n be a nonzero vector in Rd . In two dimensions, the vectors orthogonal

to n form a line (Figure 4.29). In three dimensions, the vectors orthogonal
to n form a plane (Figure 4.29). In d dimensions, these vectors form the
orthogonal complement n⊥ (2.7.5), which is a (d − 1)-dimensional subspace.
This subspace is a hyperplane passing through the origin.
Given a point x0 and a nonzero vector n, the hyperplane passing through
x0 orthogonal to n consists of all x satisfying

H: n · (x − x0 ) = 0. (4.5.4)

The hyperplane equation may be written

H: m · x + b = 0, (4.5.5)

with a nonzero vector m and scalar b. In this section, we use (4.5.4); in §7.6,
we use (4.5.5).
4.5. CONVEX FUNCTIONS 263

n
n
x0 x0

Fig. 4.29 Hyperplanes in two and three dimensions.

A hyperplane separates the whole space Rd into two half-spaces,

n · (x − x0 ) < 0 n · (x − x0 ) = 0 n · (x − x0 ) > 0.

The vector n is the normal vector to the hyperplane. Note replacing n by any
nonzero multiple of n leaves the hyperplane unchanged.

Separating Hyperplane I

Let E be a convex set, let x0 be a point not in E, and let x∗ be the

point in E nearest to x0 . If n = x0 − x∗ , then the hyperplane passing
through x∗ and orthogonal to n separates x0 and E: n · (x0 − x∗ ) > 0
and
n · (x − x∗ ) ≤ 0, for every x in E. (4.5.6)

n
x′
x0 x∗
x x∗ + tv

Fig. 4.30 Separating hyperplane I.

A diagram of the proof is Figure 4.30. If x is in E, then by convexity, the

line segment [x∗ , x] is in E, hence x∗ + tv, v = x − x∗ , is in E for 0 ≤ t ≤ 1.
Since x∗ is the point of E nearest to x0 ,
264 CHAPTER 4. CALCULUS

|x∗ − x0 |2 ≤ |x∗ + tv − x0 |2 for 0 ≤ t ≤ 1.

Expanding, we have

|x∗ − x0 |2 ≤ |x∗ − x0 |2 + 2t(x∗ − x0 ) · v + t2 |v|2 , 0 ≤ t ≤ 1.

Canceling |x∗ − x0 |2 then canceling t, we obtain

0 ≤ 2(x∗ − x0 ) · v + t|v|2 , 0 ≤ t ≤ 1.

Since this is true for small positive t, sending t → 0 results in v ·(x∗ −x0 ) ≥ 0.
Since n = x0 − x∗ , v = x − x∗ , we obtain

n · (x − x∗ ) ≤ 0 and n · (x0 − x∗ ) > 0.

Fig. 4.31 Ellipsoids in three dimensions with supporting hyperplanes.

Now suppose x0 is a point in the boundary of a convex set E. Since x0 is

in E, we cannot find a separating hyperplane for x∗ = x0 . In this case, the
best we can hope for is a hyperplane passing through x0 , with E to one side
of the hyperplane:
4.5. CONVEX FUNCTIONS 265

x in E =⇒ (x − x0 ) · n ≤ 0. (4.5.7)

Such a hyperplane is a supporting hyperplane for E at x0 . Figures 4.22 and

4.31 display examples of supporting hyperplanes. Here is the basic result
relating convex sets and supporting hyperplanes.

Supporting Hyperplane for Convex Set

Let E be a convex set and let x0 be a point on the boundary of E.

Then there is a supporting hyperplane at x0 .

If x0 is in the boundary of E, there are points x′ not in E approximating

x0 (Figure 4.30). Applying the separating hyperplane theorem to x′ , with x∗
the point in E nearest to x′ , and taking the limit x′ → x0 , results in x∗ → x0 ,
leading to a supporting hyperplane at x0 . We skip the details here.
Supporting hyperplanes characterize convex sets in the following sense: If
through every point x0 in the boundary of E, there is a supporting hyper-
plane, then E is convex.

Recall a bit is either zero or one. A dataset x1 , x2 , . . . , xN is a two-class

dataset if there are bits p1 , p2 , . . . , pN corresponding to each sample. Then
the two classes correspond to p = 1 and p = 0 respectively.
Let m · x + b = 0 be a hyperplane. The level of a sample x relative to the
hyperplane is y = m · x + b. Theb x is in the hyperplane iff its level is zero.
A hyperplane is separating if

y ≥ 0, if p = 1,
for every sample x. (4.5.8)
y ≤ 0, if p = 0,

When there is a separating hyperplane, we say the dataset is separable. Be-

cause samples are separated by level, positive, negative, or zero, a separating
hyperplane is a decision boundary.
The dataset x1 , x2 , . . . , xN lies in the hyperplane m · x + b = 0 if

m · xk + b = 0, k = 1, 2, . . . , N. (4.5.9)

When a two-class dataset lies in a hyperplane, the hyperplane is separating,

so the question of separability is only interesting when the dataset does not
lie in a hyperplane.
If a two-class dataset does not lie in a hyperplane and is separable, then
the means of the two classes are distinct (Exercise 4.5.1).
266 CHAPTER 4. CALCULUS

Separating Hyperplane II

Let x1 , x2 , . . . , xN be a two-class dataset and assume neither class

lies in a hyperplane. Let K0 and K1 be the convex hulls of the two
classes. Then the dataset is separable iff the intersection K0 ∩ K1 has
no interior.

To derive this result, from Exercise 4.5.9 both K0 and K1 have interiors.
Suppose there is a separating hyperplane m · x + b = 0. If x0 is any point
in the interior K0 ∩ K1 , then we have m · x0 + b ≤ 0 and m · x0 + b ≥ 0,
so m · x0 + b = 0. This shows the separating hyperplane passes through x0 .
Since K0 lies on one side of the hyperplane, x0 cannot be in the interior of
K0 . Similarly for K1 . Hence x0 cannot be in the interior of K0 ∩ K1 . This
implies K0 ∩ K1 has no interior.
Conversely, for the reverse direction, suppose K0 ∩ K1 has no interior.
There are two cases, whether K0 ∩ K1 is empty or not. If K0 ∩ K1 is empty,
then the minimum of |x1 − x0 |2 over all x1 in K1 and x0 in K0 is positive. If
we let
|x∗1 − x∗0 |2 = min |x1 − x0 |2 , (4.5.10)
x0 in K0
x1 in K1

then x∗1 is on the boundary of K1 , x∗0 is on the boundary of K0 , and the

points x∗0 , x∗1 are distinct.

H0 H1 H

K0 K1 tK0 tK1
x∗0 x∗1

Fig. 4.32 Separating hyperplane II.

In the first case, since K0 and K1 don’t intersect, x∗1 is not in K0 , and x∗0
is not in K1 . Let m = x∗1 − x∗0 . Since x∗0 is the point in K0 closest to K1 ,
by separating hyperplane I, the hyperplane H0 : m · (x − x∗0 ) = 0 separates
K0 from x∗1 . Similarly, since x∗1 is the point in K1 closest to K0 , the hyper-
plane H1 : m · (x − x∗1 ) = 0 separates K1 from x∗0 . Thus (Figure 4.32) both
hyperplanes separate K0 from K1 .
In the second case, when K0 and K1 intersect, then the minimum in
(4.5.10) is zero, hence x∗0 = x∗1 = x∗ . Let 0 < t < 1, and let tK0 be K0
4.5. CONVEX FUNCTIONS 267

scaled towards its mean. Similarly, let tK1 be K1 scaled towards its mean.
By Exercise 4.5.10, both tK0 and tK1 lie in the interiors of K0 and K1
respectively, so tK0 and tK1 do not intersect. By applying the first case to
tK0 and tK1 , and choosing t close to 1, t → 1, we obtain a hyperplane H
separating K0 and K1 . We skip the details.

In Figure 4.22, at the corner of the square, there are multiple supporting
hyperplanes. However, at every other point a on the boundary of the square,
there is a unique (up to scalar multiple) supporting hyperplane. For the ellipse
or ellipsoid, at every point of the boundary, there is a unique supporting
hyperplane.
Now we derive the analogous concepts for convex functions.
We say a function f (x) is convex if g(t) = f (a + tv) is convex for every
point a and direction v. This is our third definition of convex; they are all
equivalent. This way a convex function of a vector variable is reduced to a
convex function of a scalar variable.
We say a function f (x) is strictly convex if g(t) = f (a + tv) is strictly
convex for every point a and direction v. This is the same as saying the
inequality (4.5.1) is strict for 0 < t < 1.
Let f (x) be a function and let a be a point at which there is a gradient
∇f (a). The tangent hyperplane for f (x) at a is

y = f (a) + ∇f (a) · (x − a). (4.5.11)

Convex Function Graph Lies Above the Tangent Hyperplane

If f (x) is convex and has a gradient ∇f (a), then

f (x) ≥ f (a) + ∇f (a) · (x − a). (4.5.12)

This vector result is obtained by applying the corresponding scalar result

in §4.1 to the function f (a + tv), where v = x − a. As in the scalar case, there
is a similar vector result (4.5.16) with tangent paraboloids.

In §4.3, we showed how properness leads to minimizers of a function f (x).

Now we show the minimizer is unique when f (x) is also strictly convex.

Existence and Uniqueness of Global Minimizer

Suppose f (x) is a continuous strictly convex proper function. Then

f (x) has a unique minimizer x∗ .
268 CHAPTER 4. CALCULUS

Let x1 be another global minimizer. Then f (x1 ) = f (x∗ ). Let x2 = (x∗ +

x1 )/2 be their midpoint. By strict convexity,
1
f (x2 ) < (f (x∗ ) + f (x1 )) = f (x∗ ),
2
contradicting the fact that x∗ is a global minimizer. Thus there cannot be
another global minimizer.
This implies

First Derivative Test for Global Minimizer

Let f (x) be a strictly convex proper function having a gradient ∇f (x)

at every point. Then the minimizer x∗ is the unique point satisfying

∇f (x∗ ) = 0. (4.5.13)

From §4.3, we know x∗ satisfies ∇f (x∗ ) = 0. If there were another point b

satisfying ∇f (b) = 0, let v = b − x∗ , and g(t) = f (x∗ + tv). Then b = x∗ + v,
g(0) = f (x∗ ), g(1) = f (b), and g(t) is strictly convex in t, and also g ′ (1) =
∇f (b) · v = 0. By convexity, g ′ (t) is increasing in t. If g ′ (0) = 0 and g ′ (1) = 0,
then g ′ (t) = 0 for 0 < t < 1. This implies g(t) is a linear on 0 < t < 1,
contradicting strict convexity. This establishes the first derivative test.

Let
∂2f
, 1 ≤ i, j ≤ d,
∂xi ∂xj
be the second partial derivatives. Then the second derivative of f (x) is the
symmetric matrix
 
∂2f ∂2f
 ∂x ∂x ∂x ∂x . . .
 1 1 1 2 
 ∂2f ∂2f
 

 . . .
D2 f (x) =  ∂x2 ∂x1 ∂x2 ∂x2
 

 
 ... ... . . .
 
 
 ∂2f ∂2f 
...
∂xd ∂x1 ∂xd ∂x2
Replacing x by x + tv in (4.3.3), we have

d
f (x + tv) = ∇f (x + tv) · v.
dt
Differentiating and using the chain rule again,
4.5. CONVEX FUNCTIONS 269

Second Directional Derivative and Convexity

The second derivative Q = D2 f (x) satisfies

d2
f (x + tv) = v · Qv. (4.5.14)
dt2 t=0

Then f (x) is convex if the second directional derivative is nonnegative

for all x and v.

This implies

Second Directional Derivative and Strict Convexity

f (x) is strictly convex if f (x) is convex and

d2
f (x + tv) = 0 only when v = 0. (4.5.15)
dt2 t=0

An important example of a strictly convex proper function is f (x) = x ·

Qx/2 − b · x when Q > 0 (§4.3). Also (§7.5, §7.6) loss functions in linear
regression and logistic regression are strictly convex and proper under the
right assumptions.

Recall m ≤ Q ≤ L means the eigenvalues of the symmetric matrix Q are

between L and m. The following is the multi-variable version of (4.1.14). The
proof is the same as in the scalar case.

Second Derivative Bounds

If m ≤ D2 f (x) ≤ L, then

m L
|x − a|2 ≤ f (x) − f (a) − ∇f (a) · (x − a) ≤ |x − a|2 . (4.5.16)
2 2

If we choose a = x∗ , where x∗ is the global minimizer, then by (4.5.13),

we see the graph of f (x) lies between two quadratics globally.

Upper and Lower Paraboloids

If m ≤ D2 f (x) ≤ L and x∗ is the global minimizer, then

m L
|x − x∗ |2 ≤ f (x) − f (x∗ ) ≤ |x − x∗ |2 . (4.5.17)
2 2
270 CHAPTER 4. CALCULUS

We describe the convex dual in the multi-variable setting (the single-

variable case was done in (4.1.16)). If f (x) is a scalar convex function of
x, and x = (x1 , x2 , . . . , xd ) has d features, the convex dual is

g(p) = max (p · x − f (x)) . (4.5.18)

Here the maximum is over all vectors x, and p = (p1 , p2 , . . . , pd ), the dual
variable, also has d features. We will work in situations where a maximizer
exists in (4.5.18).
Let Q > 0 be a positive matrix. The simplest example is
1 1
f (x) = x · Qx =⇒ g(p) = p · Q−1 p.
2 2
This is established by the identity
1 1 1
(p − Qx) · Q−1 (p − Qx) = p · Q−1 p − p · x + x · Qx. (4.5.19)
2 2 2
To see this, since the left side of (4.5.19) is greater or equal to zero, we have
1 1
p · Q−1 p − p · x + x · Qx ≥ 0.
2 2
Since (4.5.19) equals zero iff p = Qx, we are led to (4.5.18).
Moreover, switching p · Q−1 p with x · Qx, we also have

f (x) = max (p · x − g(p)) . (4.5.20)

Thus the convex dual of the convex dual of f (x) is f (x). In §5.6, we compute
the convex dual of the cumulant-generating function.
If x is a maximizer in (4.5.18), then the derivative is zero,

0 = ∇x (p · x − f (x)) =⇒ p = ∇x f (x).

Here ∇ is with respect to x. The maximizer x = x(p) depends on p, so by

the chain rule
∇p g(p) = ∇(p · x(p) − f (x(p)))
= x + p∇x(p) − ∇f (x)∇x(p) = x + (p − ∇f (x))∇x(p) = x.

Here ∇ is with respect to p and, since x = x(p) is vector-valued, ∇x(p) is a

d × d matrix. We conclude

p = ∇x f (x) ⇐⇒ x = ∇p g(p).
4.5. CONVEX FUNCTIONS 271

Thus the vector-valued function ∇f (x) is the inverse of the vector-valued

function ∇g(p), or
∇g(∇f (x)) = x.
Differentiating, we obtain

D2 g(∇f (x))D2 f (x) = I.

This yields

Second Derivatives of Dual Functions

Let f (x) be a strictly convex function with second derivative D2 f (x),

and let g(p) be its convex dual. Then

D2 g(p) = (D2 f (x))−1 , p = ∇f (x).

Moreover, if m ≤ D2 f (x) ≤ L, then

1 1
≤ D2 g(p) ≤ .
L m

Using this, and writing out (4.5.16) for g(p) instead of f (x) (we skip the
details) yields

Dual Second Derivative Bounds

Let p = ∇f (x) and q = ∇f (a). If f (x) is convex and m ≤ D2 f (x) ≤ L,

then
1 1
|p − q|2 ≥ f (x) − f (a) − q · (x − a) ≥ |p − q|2 . (4.5.21)
2m 2L

This is used in gradient descent.

Now let f (x) be strongly convex in the sense m ≤ D2 f (x) ≤ L. Then we

have the vector version of (4.1.15).

Coercivity of the Gradient

Let p = ∇f (x) and q = ∇f (a). If m ≤ D2 f (x) ≤ L, then

mL 1
(p − q) · (x − a) ≥ |x − a|2 + |p − q|2 . (4.5.22)
m+L m+L
272 CHAPTER 4. CALCULUS

This is derived by using (4.5.21), the details are in [3]. This result is used
in gradient descent.
For the exercises below, we use the properties of sets defined earlier in this
section: interior and boundary.

Exercises

Exercise 4.5.1 If a two-class dataset does not lie in a hyperplane and is sepa-
rable, then the means of the two classes are distinct. (Argue by contradiction:
Assume the means are equal, and look at levels of samples.)
Exercise 4.5.2 Let e0 = 0 and let e1 , e2 , . . . , ed be the one-hot encoded
basis in Rd . The d-simplex Σd is the convex hull of e0 , e1 , e2 , . . . , ed . Draw
pictures of Σ1 , Σ2 , and Σ3 . Show Σd is the suspension (§1.6) of Σd−1 from
ed . Conclude
1
Vol(Σd ) = , d = 0, 1, 2, 3, . . .
d!
(Since Σ0 is one point, we start with Vol(Σ0 ) = 1.)
Exercise 4.5.3 Let x1 , x2 , . . . , xd be positive scalars. Use convexity of exp
to show
d
1X
xi ≥ (x1 x2 . . . xd )1/d .
d i=1

(Substitute xi = exp(ti ).)

Exercise 4.5.4 Let a be a point in Rd and r a positive scalar. Then the
open ball {x : |x − a| < r} is an open set.
Exercise 4.5.5 A hyperplane in Rd is a closed set.
Exercise 4.5.6 Let B be a ball in Rd (either open or closed). Then the span
of B is Rd .
Exercise 4.5.7 A hyperplane in Rd has no interior.
Exercise 4.5.8 Let K be the convex hull of a dataset, and suppose the
dataset does not lie in a hyperplane. Then the mean of the dataset does
not lie in any supporting hyperplane of K.
Exercise 4.5.9 Let K be the convex hull of a dataset. Then the dataset does
not lie in a hyperplane iff K has interior. (Show the mean of the dataset is
in the interior of K: Argue by contradiction - assume the mean is on the
boundary of K.)
Exercise 4.5.10 Let K be a convex set, let x0 lie on the boundary of K, and
let µ be in the interior of K. Then, apart from x0 , the line segment joining
µ and x0 lies in the interior of K.
Chapter 5
Probability

Many concepts of probability are already present in a coin-tossing context.

Because of this, we first start with a section on basic notions of probability,
then we dive into coin-tossing, and discuss binomial probability. Here we show
how, even in this simplest setting, entropy is an inescapable feature, a basic
measure of randomness.
We also show how Bayes theorem allows us to flip things and gain inference.
For this, we need the fundamental theorem of calculus §A.5.
After this, random variables and the normal and chi-squared distributions
are covered. The presentation is layered so that a reader with only minimal
prior exposure will come away with an appreciation of probabilistic reasoning.

5.1 Probability

Let us start with an experiment. An experiment is a procedure that yields an

outcome, one of a set of possible outcomes. Each time we run the experiment,
the result, possibly an aggregate of several interrelated samples, is considered
a single outcome.
Tossing a coin once yields one of two outcomes, heads or tails, H or T ,
which we also write as 1 or 0. Rolling a six-sided die yields outcomes 1, 2, 3,
4, 5, 6. Rolling two six-sided dice yields 36 outcomes (1, 1), (1, 2),. . . . Tossing
a coin three times yields outcomes T T T , T T H, T HT , T HH, HT T , HT H,
HHT , HHH, which we also write as 111, 110, 101, 100, 011, 010, 000.
The outcome space of an experiment is the set S of all possible outcomes. If
#(S) is the number of outcomes, then for the four experiments above, #(S)
equals 2, 6, 36, and 8. The outcome space S is also called the sample space,
and the population.
An event is a specific subset A of S. For example, when rolling two dice,
let A1 can be the subset of outcomes where the sum of the dice equals 7.

273
274 CHAPTER 5. PROBABILITY

Then the event A1 consists of the outcomes (1, 6), (2, 5), (3, 4), (4, 3), (5, 2),
(6, 1). Here #(S) = 36 and #(A1 ) = 6.
Let A2 be the event of obtaining three heads when tossing a coin seven
times. Here #(S) = 27 = 128 and #(A2 ) = 35, which is the number of ways
you can choose three things out of seven things (§A.1):

7 7·6·5
#(A2 ) = 7-choose-3 = = = 35.
3 1·2·3

Suppose A is an event in an experiment, and suppose the outcome of the

experiment is in A. Then we say A has occured. For example, if rolling two
dice results in (2, 5), the event A1 has occured. If tossing a coin seven times
results in four heads and three tails, the event A2 has not occured.
If an event always occurs, then it is a certain event. For example, when
tossing a coin seven times, obtaining fewer than ten heads is a certain event.
Every outcome is in a certain event. There is only one certain event, the whole
outcome space. Even so, we often say “a certain event” because certainty can
be presented in many ways.
If an event never occurs, it is an impossible event. For example, when
tossing a coin seven times, obtaining eight heads is an impossible event.
There are no outcomes in an impossible event. There is only one impossible
event, the event with no outcomes. Even so, we often say “an impossible
event” because impossibility can be presented in many ways.

The intersection of events A and B is the event (A and B) = A ∩ B of

outcomes common to both events. If A3 is the event of obtaining at least one
5 when rolling two dice, then the outcomes in (A1 and A3 ) are (2, 5), (5, 2).
Here #(A3 ) = 11 and #(A1 and A3 ) = 2.
Events A and B are exclusive if there is no outcome common to both
events. Two events are exclusive if their intersection is impossible.
The union of events A and B is the event (A or B) = A ∪ B of outcomes
that are in A or in B. If A3 is the event of obtaining at least one 5 when
rolling two dice, then check that #(A1 or A3 ) = 15.
Events A and B are exhaustive if every outcome is either in A or in B.
Two events are exhaustive if their union is certain.
If A is an event, the complementary event is the event Ac of outcomes not
in A. Thus A occurs exactly when Ac does not occur: the events A and Ac
are exclusive and exhaustive.
The difference of A minus B are the outcomes in A but not in B: The
outcomes of A1 − A3 are (1, 6), (3, 4), (4, 3), (6, 1). Similarly, the outcomes of
A3 − A1 are (5, 1), (5, 3), (5, 4), (5, 5), (5, 6), (1, 5), (3, 5), (4, 5), (5, 5), (6, 5).
An event A is part of event B if every outcome in A is also in B. If A is
part of B, then A ∩ B is A, A ∪ B is B, and A − B is impossible.
5.1. PROBABILITY 275

Moreover, for any A, B, A − B is part of A and B − A is part of B, A ∩ B

is part of both, and all are part of A ∪ B.
These concepts carry over to several events. Events A, B, C are exclusive
if A ∩ B, A ∩ C, and B ∩ C are impossible. Events A, B, C are exhaustive if
A ∪ B ∪ C is certain.

A probability on an outcome space S is an assignment of a number P rob(E)

to every event E in S, satisfying three axioms.

Axioms for Probability

A probability on S satisfies
1. 0 ≤ P rob(A) ≤ 1 for every event A in S,
2. P rob(S) = 1,
3. (Additivity) If A, B, . . . are exclusive events in S,

P rob(A ∪ B ∪ . . . ) = P rob(A) + P rob(B) + . . . . (5.1.1)

Suppose the number of outcomes #(S) is finite. Then the simplest prob-
ability is the discrete uniform distribution, assigning to each event A the
proportion of outcomes in A,

#(A)
P rob(A) = . (5.1.2)
#(S)

When this is so and #(S) = N , each outcome has probability 1/N , and we
say the outcomes are equally likely.
Here are examples of discrete uniform distributions.
1. A coin is fair if, after one toss, the two outcomes are equally likely. Then
P rob(heads) = P rob(tails) = 1/2.
2. A 6-sided die is fair if, after one roll, the outcomes are equally likely. Let
A be the event that the outcome is less than 3. Since the outcome is then
1 or 2, P rob(A) = 2/6 = 1/3.
3. With A1 as above, assuming the dice are fair, leads to P rob(A1 ) = 6/36 =
1/6.
4. With A2 as above, assuming the coin is fair, leads to P rob(A2 ) = 35/128.
5. With A3 as above, assuming the dice are fair, leads to P rob(A3 ) = 11/36.

Here are some consequences of the probability axioms. Since A and Ac are
exclusive and exhaustive, the first consequence is
276 CHAPTER 5. PROBABILITY

Complementary Probabilities

For any event A,

P rob(Ac ) = 1 − P rob(A).

An event A is sure if P rob(A) = 1. An event A is null if P rob(A) = 0.

Then A is sure iff Ac is null. A certain event is sure, and an impossible event
is null. However, as we see below, the converse isn’t always true.
The second consequence of additivity is

Monotonicity of Probabilities

Let A be an event, and let B be part of A. Then

P rob(B) ≤ P rob(A). (5.1.3)

This follows from Exercise 5.1.7. A useful variation of additivity (5.1.1) is

Additivity of Probabilities

If A1 , A2 , . . . are exclusive and exhaustive events, and B is any event,

then

P rob(B) = P rob(B and A1 ) + P rob(B and A2 ) + . . . . (5.1.4)

In particular, if A and B are any events,

P rob(B) = P rob(B and A) + P rob(B and Ac ). (5.1.5)

Also, when the events are not exclusive, we have

Sub-Additivity of Probabilities

If the event A is part of the union of events A1 , A2 , . . . , then

P rob(A) ≤ P rob(A1 ) + P rob(A2 ) + . . . . (5.1.6)

This follows from additivity (Exercise 5.1.10).

Suppose we sample numbers X at random from the interval [0, 1] in a uni-

form manner. This means the outcome space is S = [0, 1], and the probability
of sampling from a sub-interval [a, b] equals its length,

P rob(a < X < b) = b − a, 0 ≤ a < b ≤ 1.

5.1. PROBABILITY 277

Since a single number a is a sub-interval [a, a] with zero length, the event A
of sampling X exactly equal to 0.5 is a null event. Since A is possible, A is
not impossible. Moreover Ac is a sure event, but Ac is not certain.

0 a µ b 1

Fig. 5.1 Uniform probability density function.

Here is a more sophisticated example of a null event that is possible.

Because this example relates to the LLN (see below and Exercise 5.1.11),
we go over it carefully.
Toss a fair coin n times. Then the outcome space Sn consists of all n-tuples
x = (t1 , t2 , . . . , tn ) of 0’s and 1’s. Here the number of outcomes is 2n , and the
probability of each outcome is 2−n .
Now toss a coin infinitely many times. Then the outcome space S∞ consists
of all infinite tuples x = (t1 , t2 , . . . ) of 0’s and 1’s. Here the number of samples
in each outcome is infinite, and it turns out the probability of each outcome
is zero.
Why is the probability of each outcome x in S∞ zero? To understand why,
we first have to discuss how to measure probabilities of events in S∞ .
If an event A in S∞ depends only on the first n samples, this is easy. We
consider A as part of Sn , and P rob(A) is taken from the discrete uniform
distribution (5.1.2) on Sn . For example, if A consists of all outcomes in S∞
with 3 heads in the first 7 samples, then P rob(A) = 35/128. Events that
depends only on finitely many samples are called finite-sample events.
Of course, the interesting events in S∞ are those that are not finite-sample
events, see the LLN below. Even so, it turns out there is one and only one way
to specify probabilities P rob on all events in S∞ , finite-sample or otherwise,
once P rob is specified on finite-sample events as above.
Now let x = (t1 , t2 , . . . ) be a specific outcome, and let An be the event
of all outcomes in S∞ whose first n samples are exactly (t1 , t2 , . . . , tn ). Since
An depends only on the first n samples, P rob(An ) = 2−n . Since x is in An
for every n ≥ 1, by monotonicity (5.1.3), P rob(x) ≤ 2−n for every n ≥ 1. In
this inequality, let n increase without bound. Since the right side approaches
zero, we obtain P rob(x) = 0. This shows each outcome has probability zero.
Now, for k < n, let An,k be the event of outcomes in S∞ with k heads in
the first n samples. Since there are nk ways of obtaining k heads (§A.1),
278 CHAPTER 5. PROBABILITY

n
P rob (An,k ) = 2−n .
k

Let A∞,k be the event of infinite tuple outcomes with exactly k heads.
Then each outcome x in A∞,k is in An,k for some n. In fact, an outcome
x in A∞,k is necessarily in An,k for all sufficiently large n. This means the
following: if x is in A∞,k , then for some N ≥ 1, x is in An,k for every n greater
or equal than N . Therefore, for each outcome x in A∞,k , there is some N ,
depending on x, with x in
\
An,k = AN,k ∩ AN +1,k ∩ AN +2,k ∩ . . . .
n≥N
T
This shows the event A∞,k is part of the union of the events n≥N An,k over
N = 1, 2,T. . . .
Since n≥N An,k is part of An,k for every n ≥ N , by monotonicity (5.1.3),
 

−n n
\
P rob  An,k ≤ 2
 , for every n ≥ N.
k
n≥N

With k fixed, let n increase without bound

T in this inequality. Since the right
side approaches zero (Exercise A.6.1),T n≥N An,k is a null event.
Since A∞,k is part of the union of n≥N An,k over N ≥ 1, by subadditivity
(5.1.6), A∞,k is also a null event. However, since there definitely are outcomes
in S∞ with k heads, A∞,k is not impossible.
Finally, let A∞ be the event of all outcomes with a finite, but unspecified,
number of heads. Then A∞ equals

A∞,0 ∪ A∞,1 ∪ A∞,2 ∪ . . . .

By subadditivity again, A∞ is a null event.

The moral here is if you try to specify a pattern of infinitely many tosses
of a coin, you end up with a null event. This is why coin-tossing is called
random.

The conditional probability of A given B is

P rob(A and B)
P rob(A | B) = . (5.1.7)
P rob(B)

The definition (5.1.7) is equivalent to the chain rule

P rob(A and B) = P rob(A | B) P rob(B). (5.1.8)

5.1. PROBABILITY 279

Events A and B are independent if

P rob(A and B) = P rob(A) × P rob(B). (5.1.9)

When A and B are independent,

P rob(A and B) P rob(A) × P rob(B)

P rob(A | B) = = = P rob(A).
P rob(B) P rob(B)

When A and B are independent, the conditional probability equals the uncon-
ditional probability.
Are A1 and A3 above independent? Since P rob(A1 ) = 6/36 = 1/6 and

P rob(A1 and A3 ) 2/36 2

P rob(A1 | A3 ) = = = ,
P rob(A3 ) 11/36 11

A1 and A3 are not independent.

Additivity of probabilities and the chain rule can be combined into

Law of Total Probability

If A1 , A2 , . . . are exclusive and exhaustive events, and B is any event,

then

P rob(B) = P rob(B | A1 ) P rob(A1 ) + P rob(B | A2 ) P rob(A2 ) + . . . .

(5.1.10)
In particular, if A and B are any events,

P rob(B) = P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac ). (5.1.11)

Suppose in a certain community 15% of families have no children, 20%

have one child, 35% have two children, and 30% have three children. Suppose
also each child is equally likely to be a boy or a girl. Let B and G be the
number of boys and girls in a randomly selected family. Then

P rob(B = 0 and G = 0) = P rob(no children) = 0.15,

and
1
P rob(B = 0 and G = 1) = P rob(G = 1 | 1 child) P rob(1 child) = 0.20 = 0.1,
2
and
3
P rob(B = 1 and G = 2) = P rob(G = 2 | 3 children) P rob(3 children) = 0.30 = .1125.
8
280 CHAPTER 5. PROBABILITY

Continuing in this manner, the complete table is

P rob((B, G) = (i, j)) G = 0 G = 1 G = 2 G = 3

B=0 0.15 0.10 .0875 .0375
B=1 0.10 .175 .1125 0
B=2 .0875 .1125 0 0
B=3 .0375 0 0 0

Table 5.2 Joint distribution of boys and girls [30].

Now suppose we conduct an experiment by tossing a coin (always assumed

fair unless otherwise mentioned) 10 times. Because the coin is fair, we expect
to obtain heads around 5 times. Will we obtain heads exactly 5 times? Let’s
run the experiment with Python. In fact, we will run the experiment 20 times.
If we count the number of heads after each run of the experiment, we obtain
a digit between 0 and 10 inclusive.
To simulate this, we use binomial(n,p,N). When N = 1, this returns the
number of heads obtained after a single experiment, consisting of tossing a
coin n times, where the probability of obtaining heads in each toss is p.
More generally, binomial(n,p,N) runs this experiment N times, returning
a vector v with N components. For example, the code

from numpy.random import *

p = .5
n = 10
N = 20

v = binomial(n,p,N)
print(v)

returns

[9 6 7 4 4 4 3 3 7 5 6 4 6 9 4 5 4 7 6 7]

The outcome space S corresponding to (p, n, N ) consists of all vectors v =

(v1 , v2 , . . . , vN ) with N components, with each component equal to to 0, 1,
. . . , n. So here #(S) = (n + 1)N .
Now we conduct three experiments: tossing a coin 5 times, then 50 times,
then 500 times. The code
5.1. PROBABILITY 281

p = .5
for n in [5,50,500]: print(binomial(n,p,1))

This returns the count of heads after 5 tosses, 50 tosses, and 500 tosses,

3, 28, 266

The proportions are the count divided by the total number of tosses in the
experiment. For the above three experiments, the proportions after 5 tosses,
50 tosses, and 500 tosses, are

3/5=.600, 28/50=.560, 266/500=.532

Fig. 5.3 100,000 sessions, with 5, 15, 50, and 500 tosses per session.

Now we repeat each experiment 100,000 times and we plot the results in
a histogram.

from matplotlib.pyplot import *

from numpy.random import *

N = 100000
p = .5

for n in [5,50,500]:
data = binomial(n,p,N)
282 CHAPTER 5. PROBABILITY

hist(data,bins=n,edgecolor ='Black')
grid()
show()

This results in Figure 5.3.

The takeaway from these graphs are the two fundamental results of prob-
ability:

Law of Large Numbers (LLN)

The proportion in a repeated experiment is the sample proportion.

The sample proportion tends to be near the underlying probability
p. The underlying probability is the population proportion. The larger
the sample size in the experiment, the closer the proportion is to p.
Another way of saying this is: For large sample size, the sample mean
is approximately equal to the population mean.

Central Limit Theorem (CLT)

For large sample size, the shape of the graph of the proportions or
counts is approximately normal. The normal distribution is studied in
§5.4. Another way of saying this is: For large sample size, the shape
of the sample mean histogram is approximately normal.

The law of large numbers is qualitative and the central limit theorem is
quantitative. While the law of large numbers says one thing is close to another,
it does not say how close. The central limit theorem provides a numerical
measure of closeness, using the normal distribution.

One may think that the LLN and the CLT above depend on some aspect
of the binomial distribution. After all, the binomial is a specific formula and
something about this formula may lead to the LLN and the CLT. To show
that this is not at all the case, to show that the LLN and the CLT are
universal, we bring in the petal lengths of the Iris dataset. This time the
experiment is not something we invent, it is a result of something arising in
nature, Iris petal lengths.
We begin by loading the Iris dataset,
5.1. PROBABILITY 283

from sklearn import datasets

iris = datasets.load_iris()

dataset = iris["data"]
iris["feature_names"]

This code shows the petal lengths are the third feature in the dataset, and
we compute the mean of the petal lengths using

petal_lengths = dataset[:,2]
mean(petal_lengths)

This returns the petal length population mean µ = 3.758. If we plot the
petal lengths in a histogram with 50 bins using the code

from matplotlib.pyplot import *

grid()
hist(petal_lengths,bins=50)
show()

we obtain Figure 5.4.

Fig. 5.4 The histogram of Iris petal lengths.

Now we sample the Iris dataset randomly. More generally, we take a ran-
dom batch of samples of size n and take the mean of the samples in the batch.
For example, the following code grabs a batch of n = 5 petals lengths X1 ,
284 CHAPTER 5. PROBABILITY

X2 , X3 , X4 , X5 at random and takes their mean,

X1 + X2 + X3 + X4 + X5
.
5
The code is

from numpy import *

from numpy.random import *

# rng = random number generator

rng = default_rng()

# n = batch_size

def random_batch_mean(n):
rng.shuffle(petal_lengths)
return mean(petal_lengths[:n])

random_batch_mean(5)

This code shuffles the dataset, then selects the first n petal lengths, then
returns their mean.

Fig. 5.5 Iris petal lengths sampled 100,000 times.

To sample a single petal length randomly 100,000 times, we run the code

N = 100000
n = 1
5.1. PROBABILITY 285

Xbar = [ random_batch_mean(n) for _ in range(N)]

hist(Xbar,bins=50)
grid()
show()

Since we are sampling single petal lengths, here we take n = 1. This code
returns the histogram in Figure 5.5.
In Figure 5.4, the bin heights add up to 150. In Figure 5.5, the bin heights
add up to 100,000. Moreover, while the shapes of the histograms are almost
identical, a careful examination shows the histograms are not identical. Nev-
ertheless, there is no essential difference between the two figures.

Fig. 5.6 Iris petal lengths batch means sampled 100,000 times, batch sizes 3, 5, 20.

Now repeat the same experiment, but with batches of various sizes, and
plot the resulting histograms. If we do this with batches of size n = 3, n = 5,
n = 20 using

from matplotlib.pyplot import *

figure(figsize=(8,4))
# three subplots
rows, cols = 1, 3

N = 100000

for i,n in enumerate([3,5,20],start=1):

Xbar = [ random_batch_mean(n) for _ in range(N)]
subplot(rows,cols,i)
grid()
hist(Xbar,bins=50)
286 CHAPTER 5. PROBABILITY

show()

we obtain Figure 5.6.

This shows the CLT is universal, since here it arises from sampling the
petal lengths of Irises, whose dataset has the histogram in Figure 5.4. Of
course, we also have the LLN, which says the peak of each of the bell-shaped
curves is near µ = 3.758.

Exercises

Exercise 5.1.1 [30] A communications channel transmits bits 0 and 1. Be-

cause of noise, the probability of transmitting a bit incorrectly is 0.2. To
reduce error probabilities, each bit is repeated five times: 1 is sent as 11111
and 0 is sent as 00000. If the recipient uses majority decoding, what is the
probability of mis-reading a message consisting of one bit? Majority decoding
means five consecutive bits will be read as 1 if at least three of the bits are
1, and similarly for 0.
Exercise 5.1.2 Check the values in Table 5.2.
Exercise 5.1.3 A fair coin is tossed infinitely many times. Show that the
event of outcomes with both infinitely many heads and infinitely many tails
is sure.
Exercise 5.1.4 [30] Approximately 80,000 marriages took place in New York
last year. Assuming any day is equally likely, what is the probability that for
at least one of these couples, both partners were born on January 1? Both
partners celebrate their birthdays on the same day of the year?
Exercise 5.1.5 This problem has nothing to do with calculus or probabil-
ity or data science, and just uses addition of numbers, so can be presented
to grade school students. Let dataset be any five numbers, for example
[-11.2,sqrt(2),1.4,11, 23.4], and run the code

from matplotlib.pyplot import *

from numpy import *

def sums(dataset,k):
if k == 1: return dataset
else:
s = sums(dataset,k-1)
return array([ a+b for a in dataset for b in s ])

for k in range(5):
5.1. PROBABILITY 287

s = sums(dataset,k)
grid()
hist(s,bins=50,edgecolor="k")
show()

for k = 1, 2, 3, 4, . . . . What does this code do? What does it return? What
pattern do you see? What if dataset were changed? What if the samples in
the dataset were vectors?

Exercise 5.1.6 [30] At least one-half of an airplane’s engines are required

to function in order for it to operate. If each engine functions independently
with probability p, for what value of 0 < p < 1 is a 4-engine plane as likely
to operate as a 2-engine plane? (Write the binomial probability as a function
of p and use numpy.roots.)

Exercise 5.1.7 Let A and B be any events, not necessarily exclusive. Show

P rob(A or B) = P rob(B) + P rob(A − B).

Exercise 5.1.8 Let A and B be any events, not necessarily exclusive. Extend
(5.1.1) to show

P rob(A or B) = P rob(A) + P rob(B) − P rob(A and B). (5.1.12)

(Break A ∪ B into three exclusive events A − B, A ∩ B, and B − A.)

Exercise 5.1.9 [30] There is a 60% chance an event A will occur. If A does
not occur, there is a 10% chance B occurs. What is the chance A or B occurs?

Exercise 5.1.10 Let A, B, C be any events, not necessarily exclusive. Use

Exercise 5.1.8 to show

P rob(A or B or C) ≤ P rob(A) + P rob(B) + P rob(C).

(Start with two events, then go from two to three events.) With a = P rob(Ac ),
b = P rob(B c ), c = P rob(C c ), this exercise is the same as Exercise A.3.4.

Exercise 5.1.11 Toss a coin infinitely many times, and let A1 be the out-
comes x = (x1 , x2 , . . . ) where the limit of the sample means
x1 + x2 + · · · + xn
lim
n→∞ n
equals 1. Show the event A1 is not certain nor impossible. Here each xk is
1 or 0. More generally, let t = a/b be any fraction, and let At be the event
of outcomes where the limit of the sample means equals t. Show At is not
certain nor impossible.
288 CHAPTER 5. PROBABILITY

Exercise 5.1.12 If A is any event in S∞ , let Ā be the event A with heads

and tails interchanged. This means xk is replaced by 1 − xk in the definition
of A. Then P rob(A) = P rob(Ā). We call A symmetric if Ā = A. Show for A
symmetric,
1
P rob(A and x1 = 1) = P rob(A) = P rob(A and x1 = 0).
2
Conclude a symmetric event A is independent of the event x1 = 1. (Use
additivity.)
Exercise 5.1.13 Let At be the event in Exercise 5.1.11. Show At = A1−t .
Conclude A0.5 is symmetric.

5.2 Binomial Probability

Suppose a coin is tossed repeatedly, landing heads or tails each time. After
tossing the coin 100 times, we obtain 53 heads. What can we say about this
coin? Can we claim the coin is fair? Can we claim the probability of obtaining
heads is .53?
Whatever claims we make about the coin, they should be reliable, in that
they should more or less hold up to repeated verification.
To obtain reliable claims, we therefore repeat the above experiment 20
times, obtaining for example the following count of heads

[57, 49, 55, 44, 55, 50, 49, 50, 53, 49, 53, 50, 51, 53, 53, 54, 48, 51, 50, 53].

On the other hand, suppose someone else repeats the same experiment 20
times with a different coin, and obtains

[69, 70, 79, 74, 63, 70, 68, 71, 71, 73, 65, 63, 68, 71, 71, 64, 73, 70, 78, 67].

In this case, one suspects the two coins are statistically distinct, and have
different probabilities of obtaining heads.
In this section, we study how the probabilities of coin-tossing behave, with
the goal of answering the question: Is a given coin fair?

Assume we are tossing a coin. If we let p = P rob(H) and q = P rob(T ) be

the probabilities of obtaining heads and tails after a single toss, then

p + q = 1.

The proportion p is the coin’s bias towards heads. In particular, we see q =

1 − p, and the bias p may be any real number between 0 and 1, depending on
5.2. BINOMIAL PROBABILITY 289

the particular coin being tossed. When p = 1/2, P rob(H) = P rob(T ), and
we say the coin is fair.
If we toss the coin twice, we obtain one of four possibilities, HH, HT ,
T H, or T T . If we make the natural assumption that the coin has no memory,
that the result of the first toss has no bearing on the result of the second
toss, then the probabilities are

P rob(HH) = p2 , P rob(HT ) = pq, P rob(T H) = qp, P rob(T T ) = q 2 . (5.2.1)

These are valid probabilities since their sum equals 1,

p2 + pq + qp + q 2 = (p + q)2 = 12 = 1.

We use (5.1.7) to compute the probability that we obtain heads on the sec-
ond toss given that we obtain tails on the first toss. Introduce the convenient
notation (
1, if the n-th toss is heads,
Xn =
0, if the n-th toss is tails.
Then Xn is a random variable (§5.3) and represents a numerical reward
function of the outcome (heads or tails) at the n-th toss.
With this notation, (5.2.1) may be rewritten

P rob(X1 = 1 and X2 = 1) = p2 ,
P rob(X1 = 1 and X2 = 0) = pq,
P rob(X1 = 0 and X2 = 1) = qp,
P rob(X1 = 0 and X2 = 0) = q 2 .

In particular, by (5.1.5), this implies (remember q = 1 − p)

P rob(X1 = 1) = P rob(X1 = 1 and X2 = 0) + P rob(X1 = 1 and X2 = 1)

= pq + p2 = p(p + q) = p.

Similarly, P rob(X2 = 1) = p. Computing,

P rob(X1 = 0 and X2 = 1) qp
P rob(X2 = 1 | X1 = 0) = = = p = P rob(X2 = 1),
P rob(X1 = 0) q
so
P rob(X2 = 1 | X1 = 0) = P rob(X2 = 1).
Thus X1 = 0 has no effect on the probability that X2 = 1, and similarly for
the other possibilities. This is often referred to as the independence of the
coin tosses. We conclude
290 CHAPTER 5. PROBABILITY

Multiplication of Probabilities: Independent Coin-Tossing

With the conditional probability definition (5.1.7), a coin has no mem-

ory between successive tosses iff the probabilities at distinct tosses
multiply,

P rob(X1 = a1 , X2 = a2 , . . . ) = P rob(X1 = a1 ) P rob(X2 = a2 ) . . .

(5.2.2)
Here a1 , a2 , . . . are 0 or 1.

Since we are tossing the same coin repeatedly, we can set

P rob(Xn = 1) = p, P rob(Xn = 0) = q = 1 − p, n ≥ 1.

Thus all probabilities in (5.2.2) are determined by the parameter p, which

may be any number between 0 and 1.

It is natural to ask for the probability of obtaining k heads in n tosses,

P rob(Sn = k). Here k varies between 0 and n, corresponding to all tails or
all heads respectively.
There are n + 1 possibilities Sn = 0, Sn = 1, Sn = 2, . . . , Sn = n for the
number of heads in n tosses. If we have no data to think otherwise, then all
possibilities are equally likely, so one expects
1
P rob(Sn = k) = , 0 ≤ k ≤ n.
n+1
Notice the total probability is 1,
n n
X X 1
P rob(Sn = k) = = 1,
n+1
k=0 k=0

as it should be.
Assume we know p = P rob(Xn = 1). Since the number of ways of choosing
k heads from n tosses is the binomial coefficient nk (see §A.2), and the
probabilities of distinct tosses multiply, the probability of k heads in n tosses
is as follows.

Coin-Tossing With Known Bias

If a coin has bias p, the probability of obtaining k heads in n tosses

is the binomial distribution
5.2. BINOMIAL PROBABILITY 291

n k
P rob(Sn = k) = p (1 − p)n−k . (5.2.3)
k

Why is this? Because the probabilities multiply, so the probability of a

specific pattern of k heads in n tosses is pk(1−p)n−k . By (5.1.4), probabilities
of exclusive events add, and there are nk exclusive events here, because nk
is the number of ways of choosing k heads from n tosses.
By the binomial theorem,
n n
X X n
P rob(Sn = k) = pk (1 − p)n−k = (p + 1 − p)n = 1,
k
k=0 k=0

again as it should be.

The binomial distribution with n = 1 corresponds to a single coin toss,
and is called the Bernoulli distribution. The corresponding random variable
X,
P rob(X = 1) = p, P rob(X = 0) = 1 − p,
is a Bernoulli random variable.
The code for counting heads from n tosses repeated N times is

from numpy.random import binomial

n, p, N = 5, .5, 10

# counting heads from n tosses sampled N times

binomial(n,p,N)

This returns array([2, 2, 2, 0, 4, 3, 4, 2, 4, 2]).

The code for the probability of k heads in n tosses of a coin with bias p is

from scipy.stats import binom

k,n,p = 5, 10, .5

B = binom(n,p)

# probability of k heads
B.pmf(k)

This returns 0.24609375000000003.

More generally,
292 CHAPTER 5. PROBABILITY

from scipy.stats import binom

from scipy.special import comb

# code to verify binomial pmf

def f(n,k,p): return binom(n,p).pmf(k)

def g(n,k,p): return comb(n,k,exact=True) * p**k * (1-p)**(n-k)

k,n,p = 5, 10, .5

pmf1 = array([ f(n,k,p) for k in range(n+1) ])

pmf2 = array([ g(n,k,p) for k in range(n+1) ])

allclose(pmf1,pmf2)

returns True.
Be careful to distinguish between
numpy.random.binomial and scipy.stats.binom.
The former returns samples from a binomial distribution, while the latter
returns a binomial random variable. Samples are just numbers; random vari-
ables have cdf’s, pmf’s or pdf’s, etc.

We explain the connection

between entropy (§4.2) and coin-tossing. Recall
the binomial coefficient nk is the number of ways of selecting k objects from
n objects (A.2.10).
Toss the coin n times, and let #n = #n (p) be the number of outcomes
where the proportion k/n of heads is p. Then the number of heads is k = np,
so,
n
#n (p) = .
np
When p is an irrational, np is replaced by the floor ⌊np⌋, but we ignore
this point. Using (A.1.1), a straightforward calculation yields the following
result.1

Entropy and Coin-Tossing

Toss a coin n times, and let #n (p) be the number of outcomes where
the heads-proportion is p. Then

#n (p) is approximately equal to enH(p) for n large.

1 This result exhibits the entropy as the log of the number of combinations, or configura-
tions, or possibilities, which is the original definition of the physicist Boltzmann (1875).
5.2. BINOMIAL PROBABILITY 293

In more detail, using Stirling’s approximation (A.1.6), one can derive the
asymptotic equality
1 1
#n (p) ≈ √ ·p · enH(p) , for n large. (5.2.4)
2πn p(1 − p)

Asymptotic equality means the ratio of the two sides approaches 1 as n → ∞

(see §A.6).

Fig. 5.7 Asymptotics of binomial coefficients.

Figure 5.7 is returned by the code below, which compares both sides of
the asymptotic equality (5.2.4) for n = 10 and 0 ≤ p ≤ 1.

from numpy import *

from scipy.special import comb
from scipy.stats import entropy as H
from matplotlib.pyplot import *

n = 10
p = arange(0,1,.01)

def approx(n,p):
return exp(n*H(p))/sqrt(2*n*pi*p*(1-p))

grid()
plot(p, comb(n,n*p), label="binomial coefficient")
plot(p, approx(n,p), label="entropy approximation")
title("number of tosses " + "$n=" + str(n) +"$", usetex=True)
legend()
show()
294 CHAPTER 5. PROBABILITY

Assume the probability of heads in a single toss of a coin is q. We call q

the coin’s bias. Then we expect the long-term proportion of heads in n tosses
to equal roughly q. Now let p be another probability, 0 ≤ p ≤ 1.
Toss a coin n times, and let Pn (p, q) be the probability of obtaining out-
comes with heads-proportion p, given that the coin’s bias is q.
If p = q, one’s first guess is Pn (p, p) ≈ 1 for n large. However, this is
not correct, because Pn (p, p) is specifying a specific proportion p, predicting
specific behavior from the coin tosses. Because this is too specific, it turns
out Pn (p, p) ≈ 0, see Exercise 5.2.8.
On the other hand, if p ̸= q, we definitely expect the proportion of heads
to not equal p. In other words, we expect Pn (p, q) to be small for large n. In
fact, when p ̸= q, it turns out Pn (p, q) → 0 exponentially, as n → ∞.
We derive a formula for the speed of this decay. With k = np in the
binomial distribution (5.2.3),

n np
Pn (p, q) = q (1 − q)n−np .
np

Let H(p, q) be the relative entropy (§4.2). Using (A.1.1), a straightforward

calculation results in

Relative Entropy and Coin-Tossing

Assume a coin’s bias is q. Toss the coin n times, and let Pn (p, q) be
the probability of obtaining tosses where the heads-proportion is p.
Then

Pn (p, q) is approximately equal to enH(p,q) for n large.

(5.2.5)

In more detail, using Stirling’s approximation (A.1.6), one can derive the
asymptotic equality
1 1
Pn (p, q) ≈ √ ·p · enH(p,q) , for n large. (5.2.6)
2πn p(1 − p)

The law of large numbers (§5.1)) states that the heads-proportion in n

tosses equals approximately q for large n. Therefore, when p ̸= q, we ex-
pect the probabilities that the heads-proportions equal p become successively
smaller as n get larger, and in fact vanish when n = ∞. Since H(p, q) < 0
when p ̸= q, (5.2.6) implies this is so. Thus (5.2.6) may be viewed as a quanti-
tative strengthening of the law of large numbers, in the setting of coin-tossing.
5.2. BINOMIAL PROBABILITY 295

Now we assume the coin parameter p is unknown, and we interpret (5.2.3)

as the conditional probability that Sn = k given knowledge of p, which we
rewrite as

n k
P rob(Sn = k | p) = p (1 − p)n−k , 0 ≤ k ≤ n. (5.2.7)
k

By additivity of probabilities, P rob(Sn = k) is the sum of the probabilities

P rob(Sn = k and p) over 0 ≤ p ≤ 1.
By the conditional probability chain rule (5.1.8),

P rob(Sn = k and p) = P rob(Sn = k | p) P rob(p).

Thus P rob(Sn = k) is the sum of P rob(Sn = k | p) P rob(p) over 0 ≤ p ≤ 1.

Since p varies continuously over 0 ≤ p ≤ 1, the sum is replaced by the integral,
and Z 1
P rob(Sn = k) = P rob(Sn = k | p) P rob(p) dp.
0
Integrals are reviewed in §A.5.
Since we don’t know anything about p, it’s simplest to assume a uniform
prior probability P rob(p) = 1. Based on this, we obtain
Z 1
n k
P rob(Sn = k) = p (1 − p)n−k dp. (5.2.8)
0 k

Usually, this integral is evaluated using integration by parts. However, it

is easier to evaluate this for all 0 ≤ k ≤ n at once, by writing
n
X
I(c) = ck P rob(Sn = k).
k=0

Using (5.2.8) and the binomial theorem (A.2.7), I(c) equals

Z 1 X n
! Z 1
n k k n−k
c p (1 − p) dp = (1 − p + cp)n dp.
0 k 0
k=0

If we set
(1 − p + cp)n+1
f (p) = (1 − p + cp)n , F (p) = ,
(c − 1)(n + 1)

then F ′ (p) = f (p) (see (4.1.5)). By the fundamental theorem of calculus

(A.5.2),
1 cn+1 − 1
I(c) = F (1) − F (0) = · . (5.2.9)
n+1 c−1
But by (A.3.4) with n replaced by n + 1,
296 CHAPTER 5. PROBABILITY

n
1 cn+1 − 1 X k 1
· = c · .
n+1 c−1 n+1
k=0

Matching coefficients of powers of c here and in I(c), we conclude

Coin-Tossing With Unknown Bias

If a coin has unknown bias p, distributed uniformly on 0 ≤ p ≤ 1,

then the probability of obtaining k heads in n tosses is
1
P rob(Sn = k) = , k = 0, 1, 2, . . . , n. (5.2.10)
n+1

Notice the difference: In (5.2.3), we know the coin’s bias p, and obtain the
binomial distribution, while in (5.2.10), since we don’t know p, and there are
n + 1 possibilities 0 ≤ k ≤ n, we obtain the uniform distribution 1/(n + 1).

We now turn things around: Suppose we toss the coin n times, and obtain
k heads. How can we use this data to estimate the coin’s bias p?
To this end, we introduce the fundamental

Bayes Theorem I

P rob(B | A) · P rob(A)
P rob(A | B) = . (5.2.11)
P rob(B)

The proof of Bayes Theorem is straightforward:

P rob(A and B)
P rob(A | B) =
P rob(B)
P rob(A and B) P rob(A)
= ·
P rob(A) P rob(B)
P rob(A)
= P rob(B | A) · .
P rob(B)

The depth of the result lies in its widespread usefulness.

We now write Bayes Theorem to compute

P rob(p)
P rob(p | Sn = k) = P rob(Sn = k | p) · . (5.2.12)
P rob(Sn = k)

But P rob(Sn = k | p) is as in (5.2.7), P rob(Sn = k) is as in (5.2.10).

Since p is uniformly distributed, P rob(p) = 1. Inserting these quantities into
(5.2.12) leads to
5.2. BINOMIAL PROBABILITY 297

n (n + 1)!
(n + 1) · · pk (1 − p)n−k = · pk (1 − p)n−k . (5.2.13)
k k!(n − k)!

Summarizing,

Posterior Probability Given k Heads in n Tosses

Assume the unknown bias p of a coin is uniformly distributed on

0 ≤ p ≤ 1. Then the conditional probability P rob(p | Sn = k) that
the bias is p given k heads in n tosses equals (5.2.13).

In (5.2.7), p is fixed, and k is the variable. In (5.2.13), k is fixed, and

p is the variable. Nevertheless, we call (5.2.13) the binomial density. This
posterior density for (n, k) = (10, 7) is plotted in Figure 5.8, and it peaks at
k/n = 7/10 (Exercise 4.1.5). The code generating Figure 5.8 is

from matplotlib.pyplot import *

from numpy import arange
from scipy.stats import binom

n = 10
k = 7

def f(p): return (n+1) * binom(n,p).pmf(k)

grid()
p = arange(0,1,.01)
plot(p,f(p),color="blue",linewidth=.5)
show()

Fig. 5.8 The posterior density of p given 7 heads in 10 tosses.

298 CHAPTER 5. PROBABILITY

Because Bayes Theorem is so useful, here are two alternate forms. Suppose
A1 , A2 , . . . , Ad are several exclusive and exhaustive events, so

P rob(A1 ) + P rob(A2 ) + · · · + P rob(Ad ) = 1.

Then by the law of total probability (5.1.10) and the first version (5.2.11),
we have the second version

Bayes Theorem II

If A1 , A2 , . . . are several exclusive and exhaustive events, then for

i = 1, 2, . . . , P rob(Ai | B) equals

P rob(B | Ai ) P rob(Ai )
. (5.2.14)
P rob(B | A1 ) P rob(A1 ) + P rob(B | A2 ) P rob(A2 ) + . . .

In particular, P rob(A | B) equals

P rob(B | A) P rob(A)
.
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )

As an example, suppose 20% of the population are smokers, and the preva-
lence of lung cancer among smokers is 90%. Suppose also 80% of non-smokers
are cancer-free. Then what is the probability that someone who has cancer
is actually a smoker?
To use the second version, set A = smoker and B = cancer. This means
A is the event that a randomly sampled person is a smoker, and B is the
event that a randomly sampled person has cancer. Then

P rob(A) = .2, P rob(B | A) = .9, P rob(B c | Ac ) = .8.

From this, we have

P rob(B | Ac ) = 1 − P rob(B c | Ac ) = 1 − .8 = .2,

and
P rob(B | A) P rob(A)
P rob(A | B) =
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )
.9 × .2
= = .52941.
.9 × .2 + .2 × .8
Thus the probability that a person with lung cancer is indeed a smoker is
53%.
5.2. BINOMIAL PROBABILITY 299

−∞ < y < ∞ σ 0<p<1

Fig. 5.9 The logistic function takes real numbers to probabilities.

To describe the third version of Bayes theorem, bring in the logistic func-
tion. Let
1
p = σ(y) = . (5.2.15)
1 + e−y
This is the logistic function or sigmoid function. The logistic function takes
as inputs real numbers y, and returns as outputs probabilities p (Figure 5.9),
and is plotted in Figure 5.10.

Fig. 5.10 The logistic function.

We think of the input y as an activation energy, the output p as the

probability of activation, and y = 0 as the activation threshold.
In Python, σ is the expit function.
300 CHAPTER 5. PROBABILITY

from scipy.special import expit

p = expit(y)

The multinomial or vector-valued version of σ(y) is the softmax function

§5.6.
Dividing the numerator and denominator of (5.2.14) by the numerator, we
obtain Bayes Theorem in terms of log-probability,

Bayes Theorem III

P rob(B | A) P rob(A)
P rob(A | B) = σ log . (5.2.16)
P rob(B | Ac ) P rob(Ac )

When there are several mutually exclusive events A1 , A2 , . . . , Ad , the

same result holds with σ the softmax function (§5.6).

Here is an application of the third version. Suppose we have two groups

of scalars, selected as follows. A fair coin is tossed. Depending on the result,
select a scalar x at random with normal probability (§5.4) with probability
density
1 2
P rob(x | H) = √ · e−(x−mH ) /2 ,
2π
(5.2.17)
1 2
P rob(x | T ) = √ · e−(x−mT ) /2 .
2π
This says the the two groups of scalars are centered around the means mH
and mT respectively, according to whether the coin toss results in heads or
tails.
Given a scalar x, what is the probability x is in the heads group? In other
words, what is
P rob(H | x)?
This question is begging for Bayes theorem.
Assume the two groups are distinct, by assuming mH ̸= mT , and let
1 1
w = mH − mT , w0 = − m2H + m2T .
2 2
Then w ̸= 0. Since P rob(H) = P rob(T ), here we have P rob(A) = P rob(Ac ).
Inserting the formulas for P rob(x | H) and P rob(x | T ) leads to the log-
probability
5.2. BINOMIAL PROBABILITY 301

P rob(x | H) P rob(H)
log = wx + w0 . (5.2.18)
P rob(x | T ) P rob(T )
By (5.2.14),
P rob(H | x) = σ(wx + w0 ).
This shows the group membership of x is determined by the activation thresh-
old wx + w0 = 0, or by the cut-off x∗ = −w0 /w. Simplifying, the cut-off is

w0 1 −m2H + m2T mH + mT
x∗ = − =− = ,
w 2 mH − mT 2
which is the midpoint of the line segment joining mH and mT .

mH cut-off mT

Fig. 5.11 Decision boundary (1d).

More generally, if the points x are in Rd , then the same question may be
asked, using the normal distribution with variance I in Rd (§5.5). In this
case, w is a nonzero vector, and w0 is still a scalar,
1 1
w = mH − mT , w0 = − |mH |2 + |mT |2 .
2 2
Then the cut-off or decision boundary between the two groups is the hyper-
plane
w · x + w0 = 0,
which is the hyperplane halfway between mH and mT , and orthogonal to the
vector joining mH and mT . Written this way, the probability

P rob(H | x) = σ(w · x + w0 ) (5.2.19)

is a single-layer perceptron (§7.2). We study hyperplanes in §4.5.

cut-off
mH

Fig. 5.12 Decision boundary (3d).

302 CHAPTER 5. PROBABILITY

Exercises

Exercise 5.2.1 A fair coin is tossed. What is the probability of obtaining 5

heads in 8 tosses?

Exercise 5.2.2 A coin with bias p is tossed. What is the probability of ob-
taining 5 heads in 8 tosses?

Exercise 5.2.3 A coin with bias p is tossed 8 times and 5 heads are obtained.
What is the most likely value for p?

Exercise 5.2.4 A coin with unknown bias p is tossed 8 times and 5 heads
are obtained. Assuming a uniform prior for p, what is the probability that
p lies between 0.5 and 0.7? Use scipy.integrate.quad (§A.5) to integrate
(5.2.13) over 0.5 ≤ p ≤ 0.7.)

Exercise 5.2.5 A fair coin is tossed n times. Sometimes you get more heads
than tails, sometimes the reverse. If you’re really lucky, the number of heads
may equal exactly the number of tails. What is the least n for which the
probability of this happening is less than 10%?

Exercise 5.2.6 A fair coin is tossed n times. Sometimes you get more heads
than tails, sometimes the reverse. If you’re really lucky, the number of heads
may equal exactly the number of tails. What is the least n for which the
probability of this happening is less than 10%?

Exercise 5.2.7 A coin is tossed. Depending on the result, select a scalar x

at random with normal probability densities as in (5.2.17). If the coin bias is
p, compute the decision boundary.

Exercise 5.2.8 If a fair coin is tossed 2n times, show

√ the probability of
obtaining n heads and n tails is approximately 1/ πn for n large. (Use
(5.2.6).)

Exercise 5.2.9 Show

Z 1
(α − 1)!(β − 1)!
pα−1 (1 − p)β−1 dp = .
0 (α + β − 1)!

(The integral of (5.2.13) over 0 < p < 1 is 1.)

Exercise 5.2.10 The posterior is an updating of the prior of p: Based on our

observing k heads in n tosses, our estimate of p goes from a uniform prior
to a binomial posterior (5.2.13) peaking at p̂ = k/n. Now assume a binomial
prior for p as in (5.2.13), with (k, n) replaced by (α, β). This prior peaks
at p̂ = α/β (Exercise 4.1.5). Given k heads in n tosses, show the posterior
P rob(p | Sn = k) is binomial peaking at p̂ = (k + α)/(n + β).
5.3. RANDOM VARIABLES 303

5.3 Random Variables

Suppose a real number x is selected at random. Even if we don’t know any-

thing about x, we know x is a number, so our confidence that −∞ < x < ∞
equals 100%, the chance that x satisfies −∞ < x < ∞ equals 1, and the
probability that x satisfies −∞ < x < ∞ equals 1.
When we say x is “selected at random”, we think of a machine X that is
the source of the numbers x (Figure 5.13). Such a source of numbers is called
a random number, short for random number generator, just like a source for
apples should be called a random apple, short for random apple generator.
It is standard to call such a source X a random variable.

Definition of Random Variable

A random variable X is a function of outcomes: Each outcome results
in a sample x of X.

In §1.3, this was called vectorization. In this section, random variables are
scalar-valued. In §5.5 and §6.4, they are vector-valued.

X x

Fig. 5.13 When we sample X, we get x.

Let X and Y be random variables. Since X is a function of outcomes, the

outcomes where X = 5 is an event A (§5.1). For simplicity, we write X = 5
for this event, rather than introduce the superfluous symbol A. Similarly,
(X = 5 and Y = 7) is the event consisting of outcomes where X = 5 and
Y = 5. With this understood, P rob(X = 5) and P rob(X = 5 and Y = 7) are
well-defined probabilities of events.
Let X be a random variable and let x be a sample of X. What is the
chance, what is our confidence, what is the probability, of selecting x from
an interval [a, b]? If we write

P rob(a < X < b)

for this quantity, then we are asking to compute the probability of the event
that X lies in the interval [a, b]. If we don’t know anything about X, then
we can’t figure out the probability, and there is nothing we can say. Knowing
something about X means knowing the distribution of X: Where X is more
likely to be and where X is less likely to be. Any quantity X where proba-
304 CHAPTER 5. PROBABILITY

bilities of events P rob(a < X < b) can be computed is a random variable.

For example, suppose we want to estimate the proportion of American

college students who have a smart phone. Instead of asking every student,
we take a sample and make an estimate based on the sample.
Let p be the actual proportion of students that in fact have a smartphone.
If there are N students in total, and m of them have a smartphone, then
p = m/N . For each student, let
(
1, if the student has a smartphone,
X=
0, if not.

Then X is a random variable: X is a machine that returns 0 or 1 depending

on the chosen student.
A random variable taking on only two values is a Bernoulli random vari-
able. Since X takes on the two values 0 and 1, X is a Bernoulli random
variable.
Throughout we adopt the convention that random variables are written in
uppercase, X, while the numbers resulting when sampled are written lower-
case, x. In other words, when we sample X, we obtain x.
We will meet many different random variables X, Y , Z, . . . . The letter Z is
reserved for a standard random variable, one having mean zero and variance
one. Samples from Z are written as z.

Every dataset x1 , x2 , . . . , xN may be viewed as the samples of a random

variable X, as follows.
Define
N
1 X
E(X) = xk . (5.3.1)
N
k=1

Then E(X) is the mean of the random variable X associated to the dataset.
Similarly,
N
1 X 2
E(X 2 ) = xk
N
k=1

is the second moment of the random variable X associated to the dataset.

More generally, given any function f (x), we have the mean of f (x1 ), f (x2 ),
. . . , f (xN ),
N
1 X
E(f (X)) = f (xk ). (5.3.2)
N
k=1

Given any interval (a, b), we may set

5.3. RANDOM VARIABLES 305
(
1, a < x < b,
f (x) =
0, otherwise,

Then f (xk ) is only counted when a < xk < b, so

N
1 X #{xk : a < xk < b}
E(f (X)) = f (xk ) = = P rob(a < X < b)
N N
k=1

is the probability that a randomly selected sample lies in (a, b).

This shows probabilities are special cases of means. Since we can compute
means by (5.3.2), we can compute probabilities for X. This is what is meant
by “selecting a random sample from the dataset”.

Suppose X is a random variable taking on three values a, b, c with prob-

abilities p, q, r,

P (X = a) = p, P (X = b) = q, P (X = c) = r.

Then the mean or average or expected value of X is

E(X) = ap + bq + cr.

Since p + q + r = 1, the expected value of X lies between the greatest of a,

b, c, and the least,

min(a, b, c) ≤ E(X) ≤ max(a, b, c).

Let µ = E(X) be the mean. The variance of X is a measure of how far X

deviates from its mean,

V ar(X) = E((X − µ)2 ).

For the random variable X above,

V ar(X) = (a − µ)2 · p + (b − µ)2 · q + (c − µ)2 · r.

By expanding the squares, one has the identity

V ar(X) = E(X 2 ) − µ2 .

This is valid for any random variable X.

306 CHAPTER 5. PROBABILITY

Random variables are either discrete or continuous. Even though the

derivations and identities below are carried out in the context of discrete
random variables, these results remain valid in the context of continuous
random variables.
In the text and exercises, we consider three discrete random variables,

Bernoulli(p) → binomial(n, p) → P oisson(λ).

The Bernoulli random variable is the outcome of a single toss of a coin with
bias p, and the binomial random variable is the outcome of n tosses of a coin
with bias p.
As we see below (5.3.23), the mean of a binomial random variable is np. If
we let the number of tosses grow without bound, n → ∞, while keeping the
mean fixed at λ = np, we obtain the Poisson random variable.
In the text and the exercises, we consider several continuous random vari-
ables,
unif orm, exponential, logistic, arcsine,
and
normal, chi-squared, student.

A random variable X is discrete if X takes on discrete values x1 , x2 , . . . ,

with probabilities p1 , p2 , . . . . Here the values may be scalars or vectors, and
there may be finitely many or infinitely many values. If all the values are
equal to the same scalar µ, then we say X is a constant.
For a discrete random variable, the probability mass function (pmf in
Python) is
p(x) = P rob(X = x),
and the cumulative distribution function (cdf in Python) is

F (x) = P rob(X ≤ x).

Then pk = p(xk ). By additivity of probabilities (5.1.1), F (x) is the sum of

p(xk ) over all xk ≤ x.

Definition of Expectation: Discrete Case

Let X take on values x1 , x2 , . . . , with probabilities p1 , p2 , . . . . The

expectation of X is

E(X) = x1 p1 + x2 p2 + . . . . (5.3.3)
5.3. RANDOM VARIABLES 307

E(X) is also called the mean or average or first moment of X, and is

usually denoted µ.

When there are N values, and we take p1 = p2 = · · · = 1/N , we say the

values are equally likely or X is uniform. In this case, the mean reduces to
(1.3.1).
More generally, let f (x) be a function. The mean or expectation of f (X)
is
E(f (X)) = f (x1 )p1 + f (x2 )p2 + . . . (5.3.4)
Since the total probability is one, when f (x) = 1,

E(1) = p1 + p2 + · · · = 1.

If a is a constant then the values of aX are ax1 , ax2 , . . . , with probabilities

p1 , p2 ,. . . , so

E(aX) = ax1 p1 + ax2 p2 + · · · = a(x1 p1 + x2 p2 + . . . ) = aE(X). (5.3.5)

When f (x) = x2 , the mean of f (X) is the second moment

E(X 2 ) = x21 p1 + x22 p2 + . . .

When f (x) = etx , the mean of f (X) is the moment-generating function

M (t) = E etX = etx1 p1 + etx2 p2 + . . .

The log of the moment-generating function is the cumulant-generating

function
Z(t) = log M (t) = log E etX .

A basic property of expectation is the

Linearity of the Expectation

For any random variables X and Y and constants a and b,

E(aX + bY ) = aE(X) + bE(Y ). (5.3.6)

Linearity is used routinely whenever we compute expectations, and is de-

ceptively simple to state. Because the derivation of linearity uses additivity
of probabilities (5.1.1), it is instructive to go over this carefully.
Let X have values x1 , x2 , . . . , and probabilities p1 , p2 , . . . , and let Y have
values y1 , y2 , . . . , and probabilities q1 , q2 , . . . .
308 CHAPTER 5. PROBABILITY

If
rjk = P rob(X = xj and Y = yk ), j, k = 1, 2, . . . ,
then, by additivity of probabilities (5.1.1),

pj = P rob(X = xj )
= P rob(X = xj and Y = y1 ) + P rob(X = xj and Y = y2 ) + . . .
X
= rj1 + rj2 + · · · = rjk .
k

Similarly, X
qk = r1k + r2k + · · · = rjk .
j

Since the values of X + Y are xj + yk , with probabilities rjk , j, k = 1, 2, . . . ,

by definition of expectation,
XX
E(X + Y ) = (xj + yk )rjk .
j k

But this double sum may be written in two parts as

XX XX X X
xj rjk + yk rjk = xj pj + yk qk = E(X) + E(Y ).
j k j k j k

We conclude
E(X + Y ) = E(X) + E(Y ).
Since we already know E(aX) = aE(X) (5.3.5), this derives linearity.

Let µ be the mean of a random variable X. The variance of X is

V ar(X) = E((X − µ)2 ). (5.3.7)

The variance measures the spread of X about its mean. Since the mean of
aX is aµ, the variance of aX is the mean of (aX − aµ)2 = a2 (X − µ)2 . Thus

V ar(aX) = a2 V ar(X).

However, the variance of a sum X + Y is not simply the sum of the variances
of X and Y : This only happens if X and Y are independent, see (5.3.21).
Using (5.3.2), we can view a dataset as the samples of a random variable
X. In this case, the mean and variance of X are the same as the mean and
variance of the dataset, as defined by (1.5.1) and (1.5.2).
When X is a constant, then X = µ, so V ar(X) = 0. Conversely, if
V ar(X) = 0, then by definition
5.3. RANDOM VARIABLES 309

0 = (x1 − µ)2 p1 + (x2 − µ)2 p2 + . . . ,

so all values are equal to µ, hence X is a constant.

The square root of the variance is the standard deviation. If we write
V ar(X) = σ 2 , then the standard deviation is σ.
Expanding the square in (5.3.7),

V ar(X) = E(X 2 ) − 2µE(X) + µ2 .

Since µ = E(X), we obtain the alternate formula for the variance

V ar(X) = E(X 2 ) − E(X)2 . (5.3.8)

This displays the variance in terms of the first moment E(X) and the second
moment E(X 2 ). Equivalently,

E(X 2 ) = µ2 + σ 2 = (E(X))2 + V ar(X). (5.3.9)

The simplest discrete random variable is the Bernoulli random variable X

resulting from a coin toss, with X = 1 corresponding to heads, and X = 0
corresponding to tails,

P rob(X = 1) = p, P rob(X = 0) = 1 − p.

We say X is Bernoulli with bias p.

The probability mass function is
(
1 − p, if k = 0,
p(k) =
p, if k = 1.

This is presented graphically in Figure 5.14.

1−p
p

0 1

Fig. 5.14 Probability mass function p(x) of a Bernoulli random variable.

The mean of the Bernoulli random variable is

E(X) = 1 · P rob(X = 1) + 0 · P rob(X = 0) = 1 · p + 0 · (1 − p) = p.

310 CHAPTER 5. PROBABILITY

The second moment is

E X 2 = 12 · P rob(X = 1) + 02 · P rob(X = 0) = p.

From this,

V ar(X) = E(X 2 ) − E(X)2 = p − p2 = p(1 − p).

When p = 0 or p = 1, the variance is zero, there is no randomness. When

p = 1/2, the randomness is maximized and the maximum variance equals
1/4.

Mean and Variance of Bernoulli

If X is Bernoulli with bias p, then

E(X) = p, var(X) = p(1 − p).

The moment-generating function is

M (t) = E etX = et1 p + et0 (1 − p) = pet + (1 − p).

The cumulant-generating function is

Z(t) = log M (t) = log(pet + 1 − p).

1
p

1−p

0 1

Fig. 5.15 Cumulative distribution function F (x) of a Bernoulli random variable.

The cumulative distribution function F (x) is in Figure 5.15. Because the

Bernoulli random variable takes on only the values x = 0, 1, these are the
values where F (x) jumps.

More generally, let A be any event, and define

(
1, if the outcome is in A,
B= (5.3.10)
0, if the outcome is in Ac .
5.3. RANDOM VARIABLES 311

Then B has values 1 and 0 with probabilities

p = P rob(B = 1) = P rob(A), 1 − p = P rob(B = 0) = P rob(Ac ),

hence B is Bernoulli with bias p. We say B is the Bernoulli random variable

corresponding to event A.
By definition of B,

E(B) = p, V ar(B) = p(1 − p).

The relation between A and B is discussed further in Exercise 5.3.2.

Bernoulli random variables are used to count sample proportions. Let X
be a random variable, and fix a threshold a for X. Let X1 , X2 , . . . , Xn be
a repeated sampling of X, and let B1 , B2 , . . . , Bn be the Bernoulli random
variables corresponding to the events X1 > a, X2 > a, . . . , Xn > a. Then
B1 + B2 + · · · + Bn
p̂ = (5.3.11)
n
is the proportion of samples greater than threshold a. This is a special case
of vectorization (§1.3).

Let X be any random variable. Since the total probability is one,

M (0) = E(e0X ) = E(1) = 1.

The derivative of the moment-generating function is

M ′ (t) = E XetX .

When t = 0,
M ′ (0) = E(X) = µ.
Similarly, since the derivative of log x is 1/x, for the cumulant-generating
function,
M ′ (0)
Z ′ (0) = = E(X) = µ.
M (0)
The second derivative of M (t) is

M ′′ (t) = E X 2 etX ,

so M ′′ (0) is the second moment E(X 2 ).

By the quotient rule, the second derivative of Z(t) is
′
M ′ (t) M ′′ (t)M (t) − M ′ (t)2

′′
Z (t) = = .
M (t) M (t)2
312 CHAPTER 5. PROBABILITY

Inserting t = 0, and recalling (5.3.8), we have

Cumulant-Generating Function and Variance

Let Z(t) be the cumulant-generating function of a random variable

X. Then

Z ′ (0) = E(X) and Z ′′ (0) = V ar(X). (5.3.12)

In §5.1, we discussed independence of events. Now we do the same for

random variables.

Definition of Uncorrelated
Random variables X and Y are uncorrelated if

E(XY ) = E(X) E(Y ). (5.3.13)

Otherwise, we say X and Y are correlated.

By (5.3.8), a random variable X is always correlated to itself, unless it is

a constant.

Suppose X and Y take on the values X = ±1 and Y = 0, 1 with the

probabilities

(1, 1)


with probability a,
(1, 0) with probability b,
(X, Y ) = (5.3.14)
(−1, 1)
 with probability b,

(−1, 0) with probability c.


We investigate when X and Y are uncorrelated. Here a > 0, b > 0, and c > 0.
First, because the total probability equals 1,

a + 2b + c = 1. (5.3.15)

Also we have

P rob(X = 1) = a+b = P rob(Y = 1), P rob(X = −1) = b+c = P rob(Y = 0),

and
5.3. RANDOM VARIABLES 313

E(X) = a − c, E(Y ) = a + b.
Now X and Y are uncorrelated if

a − b = E(XY ) = E(X)E(Y ) = (a − c)(a + b). (5.3.16)

Solving (5.3.15), (5.3.16) using Python,

from sympy import *

a,b,c = symbols('a,b,c')
eq1 = a + 2*b + c - 1
eq2 = a - b - (a-c)*(a+b)
solutions = solve([eq1,eq2],a,b)
print(solutions)

we see X and Y are uncorrelated if

√ √
b = c − c, a = c − 2 c + 1. (5.3.17)

For example, X and Y are uncorrelated when c = 1/4, which leads to a =

b = 1/4. Also, X and Y are uncorrelated if c = .01, which leads to a = .81
and b = .09.
Let X and Y be random variables. We say X and Y are independent if all
powers of X are uncorrelated with all powers of Y .

Definition of Independence

Random variables X and Y are independent if

E(X n Y m ) = E(X n ) E(Y m ) (5.3.18)

for all positive powers n and m. When X and Y are discrete, this is
equivalent to the events X = x and Y = y being independent, for
every value x of X and every value y of Y .

Clearly, if X and Y are independent, then, by taking n = 1 and m = 1, X

and Y are uncorrelated.
Suppose X and Y satisfy (5.3.14) and (5.3.17). Since X = ±1, X n = 1 for
n even and X n = X for X odd. Since Y = 0, 1, Y n = Y for all n. This is
enough to show that, in this case, X and Y uncorrelated is equivalent to X
and Y independent. However, this is certainly not true in general.
Here is an example of uncorrelated random variables that are not inde-
pendent. Let X, Y be as above and set U = XY . We check when U and
Y are uncorrelated versus when they are independent. As before, check that
E(U Y ) = E(U )E(Y ) is equivalent to

a − b = (a − b)(a + b).
314 CHAPTER 5. PROBABILITY

This happens in one of two cases. Either a − b ̸= 0, or a − b = 0. If a − b ̸= 0,

then canceling a − b leads to a + b = 1. By (5.3.15), this leads to b + c = 0,
which can’t happen, since both b and c are positive or zero. Hence we must
have the other case, a − b = 0. By (5.3.15), this leads to
1 c 1 c
a= − , b= − . (5.3.19)
3 3 3 3
Thus U and Y are uncorrelated when (5.3.19) holds, for any choice of c.
However, since X 2 = 1 and Y 2 = Y , U 2 = Y , so U 2 and Y are always
correlated, unless Y is constant. Hence U and Y are never independent, unless
Y is constant. Note Y is a constant when a = 1 or c = 1.

Let X and Y be random variables. The joint moment-generating function

of the pair (X, Y ) is
MX,Y (s, t) = E esX+tY .

Expanding the exponentials into their series, and using (5.3.18), one can show

Independence and Moment-Generating Functions

Let X and Y be random variables. Then X and Y are independent if

their moment-generating functions multiply,

MX,Y (s, t) = MX (s) MY (t).

As a special case, choosing s = t, we see

Independent Sums and Moment-Generating Functions

Let X and Y be independent random variables. Then the moment-

generating function of X + Y is

MX+Y (t) = MX (t)MY (t). (5.3.20)

As an illustration, consider an ordinary dice with X = 1, X = 2, . . . ,

X = 6 equally probable. Then P rob(X = k) = 1/6, k = 1, 2, . . . , 6. Now
suppose we have a random variable Y with values Y = 0, Y = 1, . . . ,Y = 6,
and assume X and Y are independent.
If we are told the sum X + Y is uniform over 1 ≤ X + Y ≤ 12, how should
we choose the probabilities for Y = 0, Y = 1, . . . ,Y = 6?
5.3. RANDOM VARIABLES 315

To answer this, we use (5.3.20). By Exercise 5.3.1,

1 e7t − et
MX (t) = .
6 et − 1
By Exercise 5.3.1 again,

1 e13t − et
MX+Y (t) = ,
12 et − 1
It follows, by (5.3.20),

1 e13t − et 1 e7t − et
= · MY (t).
12 et − 1 6 et − 1
Factoring

e13t − et = et (e6t − 1)(e6t + 1), e7t − et = et (e6t − 1),

we obtain
1 6t
MY (t) = (e + 1).
2
This says
1 1
P rob(Y = 0) = , P rob(Y = 6) = ,
2 2
and all other probabilities are zero.

Taking the log in (5.3.20), independence is related to cumulant-generating

functions as follows.

Independent Sums and Cumulant-Generating Functions

Let X and Y be independent random variables. Then the cumulant-

generating function of X + Y is

ZX+Y (t) = ZX (t) + ZY (t).

Taking the second derivative, plugging in t = 0, and using (5.3.12), we

obtain
V ar(X + Y ) = V ar(X) + V ar(Y ).
This holds when X and Y are independent. In general, the result is
316 CHAPTER 5. PROBABILITY

Independent Sums and Variances

Let X1 , X2 , . . . , Xn be independent random variables, and let

S = X1 + X2 + · · · + Xn .

Then

V ar(S) = V ar(X1 ) + V ar(X2 ) + · · · + V ar(Xn ). (5.3.21)

While this result depends strongly on independence, the corresponding

result for means

E(S) = E(X1 ) + E(X2 ) + · · · + E(Xn )

is valid for any sum, by linearity of the expectation (5.3.6).

The next simplest discrete random variable is the binomial random vari-
able,
S = X1 + X2 + · · · + Xn
obtained from n independent Bernoulli random variables.
Then S has values 0, 1, 2, . . . , n, and the probability mass function

n k
p(k) = p (1 − p)n−k , if k = 0, 1, 2, . . . , n. (5.3.22)
k

Since the cdf F (x) is the sum of the pmf p(k) for k ≤ x, the code

from scipy.stats import binom

n, p = 8, .5
B = binom(n,p)

for k in range(n+1): print(k, B.pmf(k), B.cdf(k))

returns

0 0.003906250000000007 0.00390625
1 0.031249999999999983 0.03515625
2 0.10937500000000004 0.14453125
3 0.21874999999999992 0.36328125
4 0.27343749999999994 0.63671875
5 0.2187499999999999 0.85546875
6 0.10937500000000004 0.96484375
5.3. RANDOM VARIABLES 317

7 0.031249999999999983 0.99609375
8 0.00390625 1.0

Since

E(S) = E(X1 ) + E(X2 ) + · · · + E(Xn ) = p + p + · · · + p = np,

the mean of S is np.

Since X1 , X2 . . . , Xn are independent, by (5.3.21), V ar(S) = np(1 − p).
Summarizing,

Mean and Variance of Binomial

If S is binomial with bias p and tosses n, then

E(S) = np, V ar(S) = np(1 − p). (5.3.23)

If p̂n is the proportion of heads, then p̂n = S/n, so

p(1 − p)
E(p̂n ) = p, V ar(p̂n ) = . (5.3.24)
n
By the binomial theorem, the moment-generating function is
n
tS
X
tk n k n
p (1 − p)n−k = pet + 1 − p .

E e = e
k
k=0

Then the cumulant-generating function is

Z(t) = n log pet + 1 − p .

A random variable X is Poisson with parameter λ if X is discrete and

takes on the nonnegative integer values k = 0, 1, 2, . . . with probability mass
function
λk
P rob(X = k) = e−λ · , k = 0, 1, 2, . . . . (5.3.25)
k!
Here λ > 0. From the exponential series (A.3.12),
∞ ∞
X X λk
P rob(X = k) = e−λ = 1,
k!
k=0 k=0

so the total probability is one. The Python code for a Poisson random variable
is
318 CHAPTER 5. PROBABILITY

from scipy.stats import poisson

lamda = 1
P = poisson(lamda)

for k in range(10): print(k, P.pmf(k), P.cdf(k))

Mean and Variance of Poisson

If X is Poisson with parameter λ, then

E(X) = λ, V ar(X) = λ. (5.3.26)

This is derived in Exercise 5.3.13. The Poisson random variable with

parameter λ is the limit of the binomial random variable as n → ∞ while
keeping the mean λ = np fixed (Exercise 5.3.16).

A continuous random variable X takes on continuous values x with prob-

ability density function p(x) (pdf in Python). Here means are computed by
integrals using the fundamental theorem of calculus (A.5.2).

Definition of Expectation: Continuous Case

Let X have probability density function p(x). The expectation of X is

Z
E(X) = x p(x) dx. (5.3.27)

E(X) is also called the mean or average or first moment of X, and is

usually denoted µ.

Here the integration is over the entire range of the random variable: If X
takes values in the interval [a, b], the integral is from a to b. For a normal
random variable, the range is (−∞, ∞). For a chi-squared random variable,
the range is (0, ∞). Below, when we do not specify the limits of integration,
the integral is taken over the whole range of X.
More generally, let f (x) be a function. The mean of f (X) or expectation
of f (X) is Z
E(f (X)) = f (x)p(x) dx. (5.3.28)

Since the total probability is one,

Z
E(1) = p(x) dx = 1.
5.3. RANDOM VARIABLES 319

This only holds when the integral is over the complete range of X. When this
is not so,
Z b
P rob(a < X < b) = p(x) dx
a
is the green area in Figure 5.16. Thus

chance = confidence = probability = area.

0 a b 0 a b

Fig. 5.16 Confidence that X lies in interval [a, b].

When f (x) = x2 , the mean of f (X) is the second moment

Z
E(X ) = x2 p(x) dx.
2

When f (x) = etx , the mean of f (X) is the moment-generating function

Z
M (t) = E etX = etx p(x) dx.

As before, the log of the moment-generating function is the cumulant-

generating function

Z(t) = log M (t) = log E etX .

The simplest continuous distribution is the uniform distribution. A random

variable X is distributed uniformly over the interval [0, 1] if (Figure 5.1)
Z b
P rob(a < X < b) = b − a = 1 dx, 0 ≤ a < b ≤ 1.
a

Here the probability density function is

320 CHAPTER 5. PROBABILITY
(
1, a < x < b,
p(x) =
0, otherwise,

for any interval (a, b) inside (0, 1).

The mean of a uniform random variable is
Z 1
E(X) = x dx.
0

Since
1 2
F (x) =
x =⇒ F ′ (x) = x,
2
by the fundamental theorem of calculus (A.5.2),
Z 1
1
E(X) = x dx = F (1) − F (0) = .
0 2

Since F (x) = x3 /3 implies F ′ (x) = x2 , the second moment is

Z 1
1
E(X 2 ) = x2 dx = F (1) − F (0) = .
0 3

Hence the variance is

1 1 1
V ar(X) = E(X 2 ) − E(X)2 = − = .
3 4 12
The moment-generating function is
1
et − 1
Z
M (t) = E(etX ) = etx dx = .
0 t

The cumulative distribution function is


0, if x < 0,

F (x) = x, if 0 ≤ x ≤ 1, (5.3.29)

1, if x > 1.


More generally, fix an interval [a, b]. A random variable X is uniform on

[a, b] if the probability density function of X is

 1 , a < x < b,
p(x) = b − a .
0, otherwise,
5.3. RANDOM VARIABLES 321

For such an X, the mean is

b
b2 − a2
Z
1 1 1
µ = E(X) = x dx = · = (a + b), (5.3.30)
b−a a b−a 2 2

and the variance is

Z b
1 1
V ar(X) = (x − µ)2 dx = (b − a)2 . (5.3.31)
b−a a 12

In particular, if [a, b] = [−1, 1], then the mean is zero, the variance is 1/3,
and
1 1
Z
E(f (X)) = f (x) dx.
2 −1

We summarize the differences between discrete and continuous random

variables. In both cases, the cumulative distribution function is

F (x) = P rob(X ≤ x).

When X is discrete, X
F (x) = pk .
xk ≤x

When X is continuous, Z x
F (x) = p(z) dz.
−∞

Then each green area in Figure 5.16 is the difference between two areas,

F (b) − F (a).

Fig. 5.17 Continuous cumulative distribution function.

When X is discrete, the probability mass function is

322 CHAPTER 5. PROBABILITY

p(x) = P rob(X = x).

When X is continuous, the probability density function p(x) satisfies

Z b
P rob(a < X < b) = p(x) dx.
a

discrete continuous
density pmf pdf
distribution cdf cdf
sum cdf(x) = sum([ pmf(k)for k in range(x+1)]) cdf(x) = integrate(pdf,x)
difference pmf(k) = cdf(k)-cdf(k-1) pdf(x) = derivative(cdf,x)

Table 5.18 Densities versus distributions.

For a continuous random variable the probability density function is the

derivative of the cumulative distribution function,

p(x) = F ′ (x). (5.3.32)

Table 5.18 summarizes the situation. For the distribution on the left in
Figure 5.16, the cumulative distribution function is in Figure 5.17.

A logistic random variable is a random variable X with cumulative distri-

bution function σ(x) (5.2.15). For a logistic random variable, the probability
density function is

p(x) = σ ′ (x) = σ(x)(1 − σ(x)), (5.3.33)

the mean is zero, and the variance is π 2 /3 (see the exercises).

Let X and Y be independent uniform random variables on [0, 1], and let
Z = max(X, Y ). We compute the pdf p(x), the cdf F (x), and the mean of
Z. By definition of max(X, Y ),

F (x) = P rob(Z ≤ x) = P rob(max(X, Y ) ≤ x)) = P rob(X ≤ x and Y ≤ x).

By independence, for 0 ≤ x ≤ 1, this equals

P rob(X ≤ x) P rob(Y ≤ x) = x2 .
5.3. RANDOM VARIABLES 323

Hence 
0,
 if x < 0,
F (x) = P rob(max(X, Y ) ≤ x) = x2 , if 0 ≤ x ≤ 1,

1, if x > 1.


From this, 
0,
 if x < 0,
p(x) = F ′ (x) = 2x, if 0 ≤ x ≤ 1,

0, if x > 1.


From this, by the FTC (§A.5),

Z Z 1 x=1
2 3 2
E(max(X, Y )) = xp(x) dx = x(2x) dx = x = .
0 3 x=0 3

Let X have mean µ and variance σ 2 , and write

X −µ
Z= .
σ
Then
1 E(X) − µ µ−µ
E(Z) = E(X − µ) = = = 0,
σ σ σ
and
1 σ2
E(Z 2 ) = E((X − µ)2
) = = 1.
σ2 σ2
We conclude Z has mean zero and variance one.
A random variable is standard if its mean is zero and its variance is one.
The variable Z is the standardization of X. For example, the standardization
of a Bernoulli random variable is
X −p
Z=p ,
p(1 − p)

and the standardization of a uniform random variable on [0, 1] is

√
Z = 12(X − 1/2).
324 CHAPTER 5. PROBABILITY

Definition of Identically Distributed

Random variables X and Y are identically distributed if

E(X n ) = E(Y n ), n ≥ 1.

This is equivalent to X and Y having equal probabilities,

P rob(a < X < b) = P rob(a < Y < b),

for every interval [a, b], and equivalent to having the same moment-
generating functions,
MX (t) = MY (t)
for every t.

For example, if X and Y satisfy (5.3.14), then X and 2Y − 1 are identi-

cally distributed. However, X and 2Y − 1 are independent iff X and Y are
independent, which, as we saw above, happens only when (5.3.17) holds.
On the other hand, Let X be any random variable, and let Y = X. Then
X and Y are identically distributed, but are certainly correlated. So identical
distributions does not imply independence, nor vice-versa.
Let X be a random variable. A simple random sample of size n is a sequence
of random variables X1 , X2 , . . . , Xn that are independent and identically
distributed. We also say the sequence X1 , X2 , . . . , Xn is an i.i.d. sequence
(independent identically distributed).
For example, going back to the smartphone example, suppose we select n
students at random, where we are allowed to select the same student twice.
We obtain numbers x1 , x2 , . . . , xn . So the result of a single selection experi-
ment is a sequence of numbers x1 , x2 , . . . , xn . To make statistical statements
about the results, we repeat this experiment many times, and we obtain a
sequence of numbers x1 , x2 , . . . , xn each time.
This process can be thought of n machines producing x1 , x2 , . . . , xn each
time, or n random variables X1 , X2 , . . . , Xn (Figure 5.19). By making each
of the n selections independently, we end up with an i.i.d. sequence, or a
simple random sample.

X1 , X2 , . . . , Xn x1 , x2 , . . . , xn

Fig. 5.19 When we sample X1 , X2 , . . . , Xn , we get x1 , x2 , . . . , xn .

5.3. RANDOM VARIABLES 325

Let X1 , X2 , . . . , Xn be independent and identically distributed, and let µ

be their common mean E(X). The sample mean is
n
Sn X1 + X2 + · · · + Xn 1X
X̄n = = = Xk .
n n n
k=1

Then
1 1
E(X̄n ) = (E(X1 ) + E(X2 ) + · · · + E(Xn )) = · nµ = µ.
n n
We conclude the mean of the sample mean equals the population mean.
Now let σ 2 be the common variance of X1 , X2 , . . . , Xn . By (5.3.21), the
variance of Sn is nσ 2 , hence the variance of X̄n is σ 2 /n. Summarizing,

Mean and Variance of Sample Mean

If X1 , X2 , . . . , Xn are independent and identically distributed, each

with mean µ and variance σ 2 , then

σ2
E(X̄n ) = µ, V ar(X̄n ) = , (5.3.34)
n
and
√

X̄n − µ
n (5.3.35)
σ
is standard.

For example, when X1 , X2 , . . . , Xn are independent and identically dis-

tributed according to a random variable X, the proportion p̂ of samples
(5.3.11) greater than a threshold a has mean p = P rob(X > a), and variance
p(1 − p)/n. It follows that
√ p̂ − p
Z= n· p
p(1 − p)

is standard.

Exercises

Exercise 5.3.1 Let a and b be integers and let X have values a, a + 1, a + 2,

. . . , b − 1. Assume the values are equally likely. Use (A.3.4) to show
326 CHAPTER 5. PROBABILITY

1 etb − eta
MX (t) = · t .
b−a e −1
Exercise 5.3.2 Let A and B be events and let X and Y be the Bernoulli
random variables corresponding to A and B (5.3.10). Show that A and B are
independent (5.1.9) if and only if X and Y are independent (5.3.18).
Exercise 5.3.3 [30] Let X be a binomial random variable with mean 7 and
variance 3.5. What are P rob(X = 4) and P rob(X > 14)?
Exercise 5.3.4 The proportion of adults who own a cell phone in a certain
Canadian city is believed to be 90%. Thirty adults are selected at random
from the city. Let X be the number of people in the sample who own a cell
phone. What is the distribution of the random variable X?
Exercise 5.3.5 If two random samples of sizes n1 and n2 are selected inde-
pendently from two populations with means µ1 and µ2 , show the mean of the
sample mean difference X̄1 − X̄2 equals µ1 − µ2 . If σ1 and σ2 are standard
deviations of the two populations, then the standard deviation of X̄1 − X̄2
equals s
σ12 σ2
+ 2.
n1 n2
Exercise 5.3.6 Check (5.3.30) and (5.3.31).
Exercise 5.3.7 [30] You arrive at the bus stop at 10:00am, knowing the bus
will arrive at some time uniformly distributed during the next 30 minutes.
What is the probability you have to wait longer than 10 minutes? Given that
the bus hasn’t arrived by 10:15am, what is the probability that you’ll have
to wait at least an additional 10 minutes?
Exercise 5.3.8 If X and Y satisfy (5.3.14), show X and 2Y −1 are identically
distributed for any a, b, c.
Exercise 5.3.9 Let B and G be the number of boys and the number of girls
in a randomly selected family with probabilities as in Table 5.2. Are B and
G independent? Are they identically distributed?
Exercise 5.3.10 If X and Y satisfy (5.3.14), use Python to verify (5.3.17)
and (5.3.19).
Exercise 5.3.11 If X and Y satisfy (5.3.14), compute V ar(X) and V ar(Y )
in terms of a, b, c. What condition on a, b, c maximizes V ar(X)? What
condition on a, b, c maximizes V ar(Y )?
Exercise 5.3.12 Let X be Poisson with parameter λ. Show the cumulant-
generating function is
Z(t) = λ(et − 1).
(Use the exponential series (A.3.12).)
5.3. RANDOM VARIABLES 327

Exercise 5.3.13 Let X be Poisson with parameter λ. Show both E(X) and
V ar(X) equal λ (Use (5.3.12).)
Exercise 5.3.14 Let X and Y be independent Poisson with parameter λ
and µ respectively. Show X + Y is Poisson with parameter λ + µ.
Exercise 5.3.15 If X1 , X2 , . . . , Xn are i.i.d. Poisson with parameter λ, show

S = X1 + X2 + · · · + Xn

is Poisson with parameter nλ.

Exercise 5.3.16 With p = λ/n, use the compound-interest formula (A.3.8)
to show the binomial pmf (5.3.22) converges to the Poisson pmf (5.3.25) as
n → ∞.
Exercise 5.3.17 The relu(x) function is a common activation function in
neural networks (§7.2),
(
x if x ≥ 0,
relu(x) =
0 if x < 0.

If S is Poisson with parameter n, then

nn+1
E (relu(S − n)) = e−n · .
n!
(Use Exercise A.1.2.)
Exercise 5.3.18 Suppose X is a logistic random variable (5.3.33). Show the
probability density function of X is σ(x)(1 − σ(x)).
Exercise 5.3.19 Suppose X is a logistic random variable (5.3.33). Show the
mean of X is zero.
Exercise 5.3.20 Suppose X is a logistic random variable (5.3.33). Use
(A.3.16) with a = −e−x to show the variance of X is
∞
(−1)n−1

X 1 1 1
4 = 4 1 − + − + . . . .
n=1
n2 4 9 16

(This requires knowledge of integration substitution.) Using other tools, it

can be shown separately this sum equals π 2 /3 [14].
Exercise 5.3.21 Let X1 , X2 , . . . , Xn be i.i.d. each uniformly distributed on
[0, 1]. Let
Xmax = max(X1 , X2 , . . . , Xn ).
Compute F (x) = P rob(Xmax ≤ x). From that, compute the pdf p(x) of
Xmax , then the mean E(Xmax ). (To evaluate the integral in E(Xmax ), use
the FTC.)
328 CHAPTER 5. PROBABILITY

Exercise 5.3.22 Let X1 , X2 , . . . , Xn be i.i.d. each uniformly distributed on

[0, 1]. Let
Xmin = min(X1 , X2 , . . . , Xn ).
Compute 1 − F (x) = P rob(Xmin > x). From that, compute the pdf p(x)
of Xmin , then the mean E(Xmin ). (To evaluate the integral in E(Xmin ), use
(5.2.13) with k = 1.)

Exercise 5.3.23 A random variable X is exponential with parameter a > 0

if P rob(X > x) = e−x/a . Then P rob(X ≤ 0) = 0, so the values of X are
positive. Show that the mean and standard deviation of X are both a.

Exercise 5.3.24 A random variable is arcsine if its pdf is given by Fig-

ure 3.11. Compute the mean√and variance of an arcsine random variable.
(Substitute x = (2/π) arcsin( λ/2) in the integrals and use the double-angle
formula.)

Exercise 5.3.25 For k and n fixed, compute the mean of the conditional
probability of a coin’s bias p given k heads in n tosses. The answer is not
k/n. (Use (5.2.13) with n, k replaced by n + 1, k + 1.)

5.4 Normal Distribution

A random variable Z has a standard normal distribution or Z distribution or

gaussian distribution if its probability density function is given by the famous
formula
1 2
p(z) = √ · e−z /2 . (5.4.1)
2π
This means the normal distribution is continuous and the probability that
Z lies in a small interval [a, b] is

P rob(a < Z < b)

≈ p(z), a < z < b,
b−a
When the interval [a, b] is not small, this is not correct. The exact formula
for P rob(a < Z < b) is the area under the graph (Figure 5.20). This is
obtained by integration (§A.5),
Z b
P rob(a < Z < b) = p(x) dx. (5.4.2)
a

Under this interpretation, this probability corresponds to the area under

the graph (Figure 5.20) between the vertical lines at a and at b, and the total
area under the graph corresponds to a = −∞ and b = ∞.
5.4. NORMAL DISTRIBUTION 329

0 a b

Fig. 5.20 The pdf of the standard normal distribution.

The normal probability density function is plotted by

from scipy.stats import norm as Z

from numpy import *
from matplotlib.pyplot import *

# Z defaults to standard normal

# for non-standard, use Z(mu,sdev)

grid()

z = arange(mu-3*sdev,mu+3*sdev,.01)
p = Z.pdf(z)
plot(z,p)

show()

√
The curious constant 2π in (5.4.1) is inserted to make the total area
under the graph equal to one. That this is so arises from the
√ fact that 2π is
the circumference of the unit circle. Using Python, we see 2π is the correct
constant, since the code

from numpy import *

from scipy.integrate import quad

def p(z): return exp(-z**2/2)

a,b = -inf, inf

I = quad(p,a,b)[0] # integral from a to b

allclose(I, sqrt(2*pi))

returns True.
330 CHAPTER 5. PROBABILITY

The mean of Z is Z
E(Z) = zp(z) dz.

More generally, means of f (Z) are

Z
E(f (Z)) = f (z)p(z) dz,

with the integral computed using the fundamental theorem of calculus (A.5.2)
or Python.

Let p(z) be the probability density function of Z. If we shift the graph of

p(z) by horizontally by t, we obtain p(z − t). Since shifting a graph doesn’t
change the total area under the graph,
Z
p(z − t) dz = 1. (5.4.3)

By definition, the moment-generating function of Z is

Z
M (t) = E etZ = etz p(z) dz.

Using (5.4.3), one can show (Exercise 5.4.11)

2
M (t) = et /2
= exp(t2 /2). (5.4.4)

From this, the cumulant-generating function is t2 /2. Using (5.3.12), it follows

Z is indeed a standard random variable,

E(Z) = 0, V ar(Z) = 1

Expand both sides of the definition of MZ (t) in exponential series. This

results in
t2 t3 t4
1 + tE(Z) + E(Z 2 ) + E(Z 3 ) + E(Z 4 ) + . . .
2! 3! 4!
2 3
t2 1 t2 1 t2
=1+ + + + ....
2 2! 2 3! 2
5.4. NORMAL DISTRIBUTION 331

From this, the odd moments of Z are zero, and the even moments are

(2n)!
E(Z 2n ) = , n = 0, 1, 2, . . .
2n n!
By separating the even and the odd factors, this simplifies to

(1 · 3 · 5 · · · · · (2n − 1))(2 · 4 · · · · · 2n)

E(Z 2n ) =
2n n!
(1 · 3 · 5 · · · · · (2n − 1))2n n! (5.4.5)
=
2n n!
= 1 · 3 · 5 · · · · · (2n − 1), n ≥ 1.

For example,

E(Z) = 0, E(Z 2 ) = 1, E(Z 3 ) = 0, E(Z 4 ) = 3, E(Z 5 ) = 0, E(Z 6 ) = 15.

More generally, we say X has a normal distribution with parameters µ and

σ 2 , if its moment-generating function is

MX (t) = E etX = exp(µt + σ 2 t2 /2).

(5.4.6)

Then its cumulant-generating function is

1
ZX (t) = µt + σ 2 t2 ,
2
hence its mean and variance are
′ ′′
ZX (0) = µ, ZX (0) = σ 2 .

From this, if X is normal with parameters µ and σ 2 , then its standardization

Z = (X − µ)/σ is standard normal.

We restate the two fundamental results of probability in the language of

this section and in terms of limits. We usually deal with limits in an intuitive
manner. For additional information on limits, see §A.6.
Given a sample from a random variable X, the population mean is µ =
E(X), and the population variance is σ 2 = V ar(X). The LLN says for large
sample size, the sample mean X̄ approximately equals µ.

Law of Large Numbers (LLN)

Let X1 , X2 , . . . , Xn be independent identically distributed random

variables, each with mean µ and variance σ 2 , and let
332 CHAPTER 5. PROBABILITY

X1 + X2 + · · · + Xn
X̄n =
n
be the sample mean. Then the event of outcomes where

lim X̄n = µ
n→∞

is sure (sure events are defined in §5.1).

In other words, the LLN says the outcomes where the limiting sample
mean is not equal to µ form a null event. The event specified in the LLN is
sure, but not certain (see §5.1 and Exercise 5.1.11 for the distinction).
The LLN is qualitative: There is no measure of closeness in the LLN state-
ment. On the other hand, the CLT is more quantitative. The CLT says for
large sample size, the sample mean is approximately normal with mean µ
and variance σ 2 /n. More exactly,

Central Limit Theorem (CLT)

Let
√

X̄n − µ
Z̄n = n
σ
be the standardized sample mean, and let Z be a standard normal
random variable. Then

lim P rob a < Z̄n < b = P rob(a < Z < b)
n→∞

for every interval [a, b].

An equivalent form of the CLT is

lim E f Z̄n = E(f (Z)) (5.4.7)
n→∞

for every function f (x).

Let Mn (t) be the moment-generating function of Z̄n . Another equivalent
form of the CLT is convergence of the moment-generating functions,
2
lim Mn (t) = et /2
, (5.4.8)
n→∞

for every t.

Toss a coin n times, assume the coin’s bias is p, and let Sn be the number
of heads. Then, by (5.3.23), Sn is binomial with mean µ = np and standard
5.4. NORMAL DISTRIBUTION 333
p
deviation σ = np(1 − p). By the CLT, Sn is approximately normal with the
same mean and variance, so the cumulative distribution function of Sn ap-
proximately equals the cumulative distribution function of a normal random
variable with the same mean and variance.
The code

from numpy import *

from scipy.stats import binom, norm
from matplotlib.pyplot import *

n, p = 100, pi/4
mu = n*p
sigma = sqrt(n*p*(1-p))

B = binom(n,p)
Z = norm(mu,sigma)

x = arange(mu - 2sigma, mu + 2sigma, .01)

plot(x, Z.cdf(x), label="normal approx")
plot(x, B.cdf(x), label="binomial")

grid()
legend()
show()

returns Figure 5.21.

Fig. 5.21 The binomial cdf and its CLT normal approximation.

Using the compound-interest formula (A.3.8), it is simple to derive the

CLT. We derive the third version (5.4.8) of the CLT. Let x1 , x2 , . . . , xN be
334 CHAPTER 5. PROBABILITY

a scalar dataset, and assume the dataset is standardized. Then its mean and
variance are zero and one,
N N
X 1 X 2
xk = 0, xk = 1.
N
k=1 k=1

If the samples of the dataset are equally likely, then sampling the dataset
results in a random variable X, with expectations given by (5.3.2). It follows
that X is standard, and the moment-generating function of X is
N
1 X txk
E(etX ) = e .
N
k=1

If X1 , X2 , . . . , Xn are obtained by repeated sampling of the dataset, then

they are i.i.d. following X.
If X̄n is the sample mean
X1 + X2 + · · · + Xn
X̄n = ,
n
√
then, by (5.3.35), Z̄n = nX̄n is standard.
By independence, the moment-generating function Mn (t) of Z̄n is the
product
√ √ √ n
Mn (t) = E et nX̄n = E et(X1 +X2 +···+Xn )/ n = E etX/ n .

By the exponential series,

√ t t2 2
etX/ n
=1+ √ X + X + ...
n 2n

Since the mean and variance of X are zero and 1, taking expectations of both
sides,
√ t2
E etX/ n = 1 + + ....
2n
From this, n
t2

Mn (t) = 1 + + ... .
2n
By the compound-interest formula (A.3.8) (the missing terms . . . don’t affect
the result)
2
lim Mn (t) = et /2 ,
n→∞

which is the moment-generating function of the standard normal distribution.

Even though we couched this derivation in terms of a standardized dataset,
it is valid in general. This completes the derivation of the CLT.
5.4. NORMAL DISTRIBUTION 335

The standard normal distribution is symmetric about zero, and has a

specific width. Because of the symmetry, a random number Z following this
distribution is equally likely to satisfy Z < 0 and Z > 0, so P rob(Z < 0) =
P rob(Z > 0). Since the total area equals 1,

P rob(Z < 0) + P rob(Z > 0) = 1,

we expect the chance that Z < 0 should equal 1/2. In other words, because
of the symmetry of the curve, we expect to be 50% confident that Z < 0, or
0 is at the 50-th percentile level. So

chance = confidence = percentile = area

To summarize, we expect P rob(Z < 0) = 1/2.

p
p

z z

Fig. 5.22 z = Z.ppf(p) and p = Z.cdf(z).

When
P rob(Z < z) = p,
we say z is the z-score z corresponding to the p-value p. Equivalently, we say
our confidence that Z < z is p, or the percentile of z equals 100p. In Python,
the relation between z and p (Figure 5.22) is specified by

from scipy.stats import norm as Z

p = Z.cdf(z)
z = Z.ppf(p)

ppf is the percentile point function, and cdf is the cumulative distribution
function.
In Figure 5.23, the red areas are the lower tail p-value P rob(Z < z), the
two-tail p-value P rob(|Z| > z), and the upper tail p-value P rob(Z > z).
To go backward, suppose we are given P rob(|Z| < z) = p and we want
to compute the cutoff z. Then P rob(|Z| > z) = 1 − p, so P rob(Z > z) =
(1 − p)/2. This implies
336 CHAPTER 5. PROBABILITY

P rob(Z < z) = 1 − (1 − p)/2 = (1 + p)/2.

By symmetry of the graph, upper-tail and two-tail p-values can be com-

puted from lower tail p-values.

P rob(a < Z < b) = P rob(Z < b) − P rob(Z < a),

and

P rob(|Z| < z) = P rob(−z < Z < z) = P rob(Z < z) − P rob(Z < −z),

and
P rob(Z > z) = 1 − P rob(Z < z).
In Python,

from scipy.stats import norm as Z

# p = P(|Z| < z)

z = Z.ppf((1+p)/2)
p = Z.cdf(z) - Z.cdf(-z)

−z 0 −z 0 z

0 z

Fig. 5.23 Confidence (green) or significance (red) (lower-tail, two-tail, upper-tail).

Now let’s zoom in closer to the graph and mark off z-scores 1, 2, 3 on the
horizontal axis to obtain specific colored areas as in Figure 5.25. These areas
are governed by the 68-95-99 rule (Table 5.24). Our confidence that |Z| < 1
equals the blue area 0.685, our confidence that |Z| < 2 equals the sum of the
5.4. NORMAL DISTRIBUTION 337

blue plus green areas 0.955, and our confidence that |Z| < 3 equals the sum
of the blue plus green plus red areas 0.997. This is summarized in Table 5.24.

cutoff abs confidence two-tail p-value

z 1−p p
1 .685 .315
2 .955 .045
3 .997 .003

Table 5.24 Cutoffs, confidence levels, p-values.

The possibility |Z| > 1 is called a 1-sigma event, |Z| > 2 a 2-sigma event,
and so on. So a 2-sigma event is 95.5% unlikely, or 4.5% likely. An event is
considered statistically significant if it’s a 2-sigma event or more. In other
words, something is significant if it’s unlikely. A six-sigma event |Z| > 6 is
two in a billion. You want a plane crash to be six-sigma.

−3 −2 −1 0 1 2 3

Fig. 5.25 68%, 95%, 99% confidence cutoffs for standard normal.

These terms are defined for two-tail p-values. The same terms may be used
for upper-tail or lower tail p-values.
Figure 5.25 is not to scale, because a 1-sigma event should be where the
curve inflects from convex to concave (in the figure this happens closer to
2.7). Moreover, according to Table 5.24, the left-over white area should be
.03% (3 parts in 10,000), which is not what the figure suggests.

An event is statistically significant if its p-value is 5% or less (Table 5.26).

For example, Z > z is statistically significant if P rob(Z > z) is .05 or
less, which means z is greater than 1.64, Z < z is statistically significant
if P rob(Z < z) is .05 or less, which means z is less than −1.64, and |Z| > z
is statistically significant if P rob(|Z| > z) is .05 or less, which means |z| is
greater than 1.96.
338 CHAPTER 5. PROBABILITY

event type p-value z-score

Z > z upper tail .05 1.64
Z < z lower tail .05 -1.64
|Z| > z two-tail .05 1.96
Z > z upper tail .01 2.33
Z < z lower tail .01 -2.33
|Z| > z two-tail .01 2.56

Table 5.26 p-values at 5% and at 1%.

An event is highly significant if its p-value is 1% or less (Table 5.26). For

example, Z > z is highly significant if P rob(Z > z) is .01 or less, which
means z is greater than 2.33, Z < z is highly significant if P rob(Z < z) is .01
or less, which means z is less than −2.33, and |Z| > z is highly significant if
P rob(|Z| > z) is .01 or less, which means |z| is greater than 2.56.

Significant and Highly Significant

An event A is significant if P rob(A) ≤ 0.05. An event A is highly

significant if P rob(A) ≤ 0.01.

In general, the normal distribution is not centered at the origin, but else-
where. We say X is normal with mean µ and standard deviation σ if
X −µ
Z=
σ
is distributed according to a standard normal. We write N (µ, σ) for the nor-
mal with mean µ and standard deviation σ. As its name suggests, it is easily
checked that such a random variable X has mean µ and standard deviation
σ. For the normal distribution with mean µ and standard deviation σ, the
cutoffs are as in Figure 5.27. In Python, norm(mu,sigma) returns the normal
with mean m and standard deviation s.
Here is a sample computation. Let X be a normal random variable with
mean µ and standard deviation σ, and suppose P rob(X < 7) = .15, and
P rob(X < 19) = .9. Given this data, we find µ and σ as follows.
With Z as above, we have

P rob(Z < (7 − µ)/σ) = .15, and P rob(Z < (19 − µ)/σ) = .9.

Also, since Z is standard, we compute

5.4. NORMAL DISTRIBUTION 339

µ − 3σ µ−σ µ µ+σ µ + 3σ

Fig. 5.27 68%, 95%, 99% cutoffs for non-standard normal.

a = Z.ppf(.15)
b = Z.ppf(.9)

By definition of ppf (see above), we then have

7−µ 19 − µ
a= , b= .
σ σ
These are two equations in two unknowns. Multiplying both equations by σ
then subtracting, we obtain µ and σ,
19 − 7
σ= , µ = 7 − aσ.
b−a

Let X̄ be the sample mean

X1 + X2 + · · · + Xn
X̄ = ,
n
drawn from a normally distributed population with mean√µ and standard
deviation σ. By (5.3.34), the standard deviation of X̄ is σ/ n.

Standard Deviation of Sample Mean is Standard Error

The standard deviation of the sample mean is called the standard

error.
√ If the samples have standard deviation σ, the standard error is
σ/ n.

To compute probabilities for X̄ when X has mean µ and standard deviation

σ, standardize X̄ by writing

√ X̄ − µ
Z= n· ,
σ
340 CHAPTER 5. PROBABILITY

then compute standard normal probabilities.

Here are two examples. In the first example, suppose student grades are
normally distributed with mean µ = 80 and variance σ 2 = 16. This says the
average of all grades is 80, and the standard deviation is σ = 4. If a grade is
g, the standardized grade is
g−µ g − 80
z= = .
σ 4
A student is picked and their grade was g = 84. Is this significant? Is it highly
significant? In effect, we are asking, how unlikely is it to obtain such a grade?
Remember,
significant = unlikely
Since the standard deviation is 4, the student’s z-score is
g − 80 84 − 80
z= = = 1.
4 4
What’s the upper-tail p-value corresponding to this z? It’s
1
P rob(Z > z) = P rob(Z > 1) = P rob(|Z| > 1) = .16,
2
or 16%. Since the upper-tail p-value is more than 5%, this student’s grade is
not significant.
For the second example, suppose a sample of n = 9 students are selected
and their sample average grade is ḡ = 84. Is this significant? Is it highly
significant? This time we take
√ ḡ − 80 84 − 80
z= n· =3 = 3.
4 4
What’s the upper-tail p-value corresponding to this z? It’s

P rob(Z > z) = P rob(Z > 2.5) = 0.0013,

or .13%. Since the upper-tail p-value is less than 1%, yes, this sample average
grade is both significant and highly significant.
The same grade, g = 84, is not significant for a single student, but is
significant for nine students. This is a reflection of the law of large numbers,
which says the sample mean approaches the population mean as the sample
size grows.
5.4. NORMAL DISTRIBUTION 341

To extract samples from a normal distribution, use numpy.random.normal.

For example.

from numpy.random import normal

mean, sdev, n = 80, 4, 20

normal(mean,sdev,n)

returns 20 normally distributed numbers, with specified mean and standard

deviation.
Be careful to distinguish between
numpy.random.normal and scipy.stats.norm.
The former returns samples from a normal distribution, while the latter re-
turns a normal random variable. Samples are just numbers; random variables
have cdf’s, pmf’s or pdf’s, etc.

Suppose student grades are normally distributed with mean 80 and vari-
ance 16. How many students should be sampled so that the chance that at
least one student’s grade lies below 70 is at least 50%?
To solve this, if p is the chance that a single student has a grade below 70,
then 1 − p is the chance that the student has a grade above 70. If n is the
sample size, (1 − p)n is the chance that all sample students have grades above
70. Thus the requested chance is 1 − (1 − p)n . The following code shows the
answer is n = 112.

from scipy.stats import norm as Z

z = 70
mean, sdev = 80, 4
p = Z(mean,sdev).cdf(z)

for n in range(2,200):
q = 1 - (1-p)**n
print(n, q)

Here is the code for computing tail probabilities for the sample mean X̄
drawn from a normally distributed population with mean µ and standard
deviation σ. When n = 1, this applies to a single normal random variable.
342 CHAPTER 5. PROBABILITY

########################
# P-values
########################

from numpy import *

from scipy.stats import norm as Z

def pvalue(mean,sdev,n,xbar,type):
Xbar = Z(mean,sdev/sqrt(n))
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 *(1 - Xbar.cdf(abs(xbar)))
else:
print("What's the tail type (lower-tail, upper-tail,
,→ two-tail)?")
return
print("sample size: ",n)
print("mean,sdev,xbar: ",mean,sdev,xbar)
print("mean,sdev,n,xbar: ",mean,sdev,n,xbar)
print("p-value: ",p)
z = sqrt(n) * (xbar - mean) / sdev
print("z-score: ",z)

type = "upper-tail"
mean = 80
sdev = 4
n = 1
xbar = 90

pvalue(mean,sdev,n,xbar,type)

Exercises

Exercise 5.4.1 Let X be a normal random variable and suppose P rob(X <
1) = 0.3, and P rob(X < 2) = 0.4 What are the mean and variance of X?

Exercise 5.4.2 [27] Consider a normal distribution curve where the middle
90% of the area under the curve lies above the interval (4, 18). Use this
information to find the mean and the standard deviation of the distribution.

Exercise 5.4.3 Let Z be a normal random variable with mean 30.4 and
standard deviation of 0.7. What is P rob(29 < Z < 31.1)?

Exercise 5.4.4 [27] Consider a normal distribution where the 70th percentile
is at 11 and the 25th percentile is at 2. Find the mean and the standard
deviation of the distribution.
5.4. NORMAL DISTRIBUTION 343

Exercise 5.4.5 [27] Let X1 , X2 , . . . , Xn be an i.i.d. sample each with mean

300 and standard deviation of 21. What is the mean and standard deviation
of the sample mean X̄?
Exercise 5.4.6 Suppose the scores of students are normally distributed with
a mean of 80 and a standard deviation of 4. A sample of size n is selected,
and the sample mean is 84. What is the least n for which this is significant?
What is the least n for which this is highly significant?
Exercise 5.4.7 [27] A manufacturer says their laser printers’ printing speeds
are normally distributed with mean 17.63 ppm and standard deviation 4.75
ppm. An i.i.d. sample of n = 11 printers is selected, with speeds X1 , X2 , . . . ,
Xn . What is the probability the sample mean speed X̄ is greater than 18.53
ppm?
Exercise 5.4.8 [27] Continuing Exercise 5.4.7, let Yk be the Bernoulli ran-
dom variable corresponding to the event Xk > 18 (5.3.10),
(
1, if Xk > 18,
Yk = .
0, otherwise.

We count the proportion of printers in the sample having speeds greater than
18 by setting
Y1 + Y2 + · · · + Yn
p̂ = .
n
Compute E(p̂) and V ar(p̂). Use the CLT to compute the probability that
more than 50.9% of the printers have speeds greater than 18.
Exercise 5.4.9 [27] The level of nitrogen oxides in the exhaust of a particular
car model varies with mean 0.9 grams per mile and standard deviation 0.19
grams per mile . What sample size is needed so that the standard deviation
of the sampling distribution is 0.01 grams per mile?
Exercise 5.4.10 [27] The scores of students had a normal distribution with
mean µ = 559.7 and standard deviation σ = 28.2. What is the probability
that a single randomly chosen student scores 565 or higher? Now suppose
n = 30 students are sampled, assume i.i.d. What are the mean and standard
deviation of the sample mean score? What z-score corresponds to the mean
score of 565? What is the probability that the mean score is 565 or higher?
Exercise 5.4.11 Complete the square in the moment-generating function of
the standard normal pdf and use (5.4.3) to derive (5.4.4).
Exercise 5.4.12 Let Z be a standard normal random variable, and let
relu(x) be as in Exercise 5.3.17. Show
1
E(relu(Z)) = √ .
2π
344 CHAPTER 5. PROBABILITY

(Use the fundamental theorem of calculus (A.5.2).)

Exercise 5.4.13 [6] Let X1 , X2 , . . . , Xn be i.i.d. Poisson random variables

(5.3.25) with parameter 1, let S = X1 + X2 + · · · + Xn , and let X̄n = S/n be
the sample mean. Then the mean of X̄n is 1, and the variance of X̄n is 1/n,
so
√ S−n
Z̄n = n(X̄n − 1) = √
n
is standard (5.3.35). By the CLT, with Z standard normal,

E(relu(Z̄n )) → E(relu(Z)), n → ∞.

Use this to derive Stirling’s approximation (A.1.6). (Exercises 5.3.17 and

5.4.12.)

5.5 Chi-squared Distribution

Let X and Y be independent standard normal random variables. Then (X, Y )

is a random point in the plane. What is the probability that the point (X, Y )
lies inside a square (Figure 5.28)?
Specifically, assume the square is |X| ≤ 1 and |Y | ≤ 1. Since X and Y
independent, the probability (X, Y ) lies in the square is

P rob(|X| ≤ 1 and |Y | ≤ 1) = P rob(|X| ≤ 1) P rob(|Y | ≤ 1)

= P rob(|X| ≤ 1)2 = .6852 = .469.

Fig. 5.28 (X, Y ) inside the square and inside the disk.
5.5. CHI-SQUARED DISTRIBUTION 345

What is the probability (X, Y ) lies inside the unit disk,

P rob(X 2 + Y 2 ≤ 1)?

Here the answer is not as straightforward, and leads us to introduce the

chi-squared distribution.

A random variable U has a chi-squared distribution with degree 1 if

1
MU (u) = E(euU ) = √ .
1 − 2u

To compute the moments of U , we use the binomial theorem (4.1.20)

∞
p
X p n p 2 p 3
(1 + x) = x = 1 + px + x + x + ...
n=0
n 2 3

to write out MU (u). Taking p = −1/2 and x = −2u,

∞
1 X −1/2
√ = (1 − 2u)−1/2 = (−2u)n .
1 − 2u n=0
n

Since
∞
1 X un
√ = E euU = E(U n ),
1 − 2u n=0
n!

comparing coefficients of un /n! shows

n n −1/2
E(U ) = (−2) n! , n = 0, 1, 2, . . . (5.5.1)
n

Using the definition

p p · (p − 1) · · · · · (p − n + 1)
= ,
n n!
p

n makes sense for fractional p (see (A.2.12)). With this, we have

(−1/2) · (−1/2 − 1) · · · · · (−1/2 − n + 1)

E(U n ) = (−2)n n!
n!
= 1 · 3 · 5 · 7 · · · · · (2n − 1).

But this equals the right side of (5.4.5). Thus the left sides of (5.4.5) and
(5.5.1) are equal. This shows
346 CHAPTER 5. PROBABILITY

Chi-squared is the Square of Normal

If Z is standard normal, then U = Z 2 is chi-squared with degree 1,

and E(U ) = 1, V ar(U ) = 2.

More generally, we say U is chi-squared with degree d if

U = U1 + U2 + · · · + Ud = Z12 + Z22 + · · · + Zd2 , (5.5.2)

with independent standard normal Z1 , Z2 , . . . , Zd .

By independence, the moment-generating functions multiply (§5.3), so the
moment-generating function for chi-squared with degree d is
1
MU (t) = E(etU ) = .
(1 − 2t)d/2

Going back to the question posed at the beginning of the section, we have
X and Y independent standard normal and we want

P rob(X 2 + Y 2 ≤ 1).

If we set U = X 2 + Y 2 , we want2 P rob(U ≤ 1). Since U is chi-squared with

degree d = 2, we use chi2.cdf(u,d). Then the code

from scipy.stats import chi2 as U

d = 2
u = 1

U(d).cdf(u)

returns 0.39.
Figure 5.29 is returned by the code

from scipy.stats import chi2 as U

from matplotlib.pyplot import *
from numpy import *

u = arange(0,15,.01)

for d in range(1,7):

2 Geometrically, the p-value P rob(U > 1) is the probability that a normally distributed
point in d-dimensional space is outside the unit sphere.
5.5. CHI-SQUARED DISTRIBUTION 347

p = U(d).pdf(u)
plot(u,p,label="d: " + str(d))

ylim(ymin=0,ymax=.6)
grid()
legend()
show()

Fig. 5.29 Chi-squared distribution with different degrees.

Let us compute the mean and variance of a chi-squared U with degree d.

When d = 1, we already know E(U ) = 1 and V ar(U ) = 2. In general, by
(5.5.2) and (5.3.21),
d
X d
X
E(U ) = E(Zk2 ) = 1 = d,
k=1 k=1

and
d
X d
X
V ar(U ) = V ar(Zk2 ) = 2 = 2d.
k=1 k=1

We conclude
348 CHAPTER 5. PROBABILITY

Mean and Variance of Chi-squared

If U is chi-squared with degree d, the mean and variance of U are

E(U ) = d, and V ar(U ) = 2d.

The peak (maximum likelihood point) in the chi-squared density of degree

d is not at the mean d. Using polar coordinates, one can show the peak is at
d − 2 (Figure 5.30).

Fig. 5.30 With degree d ≥ 2, the chi-squared density peaks at d − 2.

Because
1 1 1
′ /2 = ,
(1 − 2t) d/2 (1 − 2t) d (1 − 2t)(d+d′ )/2
we obtain

Independent Chi-squared Variables

If U and U ′ are independent chi-squared with degrees d and d′ , then

U + U ′ is chi-squared with degree d + d′ .
5.5. CHI-SQUARED DISTRIBUTION 349

To compute distributions for sample variances (below) and chi-squared

tests (§6.4), we need to derive chi-squared for correlated normal samples.
This is best approached using vector-valued random variables.
A vector-valued random variable is a vector X = (X1 , X2 , . . . , Xd ) in Rd
whose components are random variables. A vector-valued random variable is
also called a random vector. For example, a simple random sample X1 , X2 ,
. . . , Xn may be collected into a single random vector

X = (X1 , X2 , . . . , Xn )

in Rn .
Random vectors have means, variances, moment-generating functions,
and cumulant-generating functions, just like scalar-valued random variables.
Moreover we can have simple random samples of random vectors X1 , X2 ,
. . . , Xn .
If X is a random vector in Rd , its mean is the vector

µ = E(X) = (E(X1 ), E(X2 ), . . . , E(Xd )) = (µ1 , µ2 , . . . , µd ).

The variance of X is the d × d matrix Q whose (i, j)-th entry is

Qij = E((Xi − µi )(Xj − µj )), 1 ≤ i, j ≤ d.

In the notation of §2.2,

Q = E((X − µ) ⊗ (X − µ)).

By (1.4.20),

w · ((X − µ) ⊗ (X − µ))w = ((X − µ) · w)2 ,

hence
w · Qw = E ((X − µ) · w)2 .

(5.5.3)
Thus the variance of a random vector is a nonnegative matrix
A random vector is standard if µ = 0 and Q = I. If X is standard, then

E(X · w) = 0, V ar(X · w) = |w|2 . (5.5.4)

In §2.2, we defined the mean and variance of a dataset (2.2.15). Then the
mean and variance there is the same as the mean and variance defined here,
that of a random variable.
To see this, we must build a random variable X corresponding to a dataset
x1 , x2 , . . . , xN . But this was done in (5.3.2). The moral is: every dataset may
be interpreted as a random variable.
350 CHAPTER 5. PROBABILITY

In §5.3, we considered i.i.d. sequences of scalar random variables. We can

also do the same with random vectors. If X1 , X2 , . . . , Xn is an i.i.d. se-
quence of random vectors, each with mean µ and variance Q, then the same
calculation as in §5.3 shows
n
!
√ 1X
n Xk − µ (5.5.5)
n
k=1

has mean zero and variance Q.

A random vector X is normal with mean µ and variance Q if for every

vector w, the scalar random variable X · w is normal with mean µ · w and
variance w · Qw. When µ = 0 and Q = I, then X is standard normal.
From §5.3, we see

Standard Normal Random Vectors

Z1 , Z2 , . . . , Zd is a simple random sample of standard normal random
variables iff
Z = (Z1 , Z2 , . . . , Zd )
is a standard normal random vector in Rd .

The central limit theorem remains valid for random vectors: If X1 , X2 ,

. . . , Xn is an i.i.d. sequence of random vectors with mean µ and variance
Q, then (5.5.5) is approximately normal, with mean zero and variance Q, for
large n.
From (5.5.2),

Uncorrelated Chi-squared

If Z = is a standard normal random vector in Rd , then

|Z|2 (5.5.6)

is chi-squared with degree d.

If X is a normal random vector with mean zero and variance Q, then, by

definition, X · w is normal with mean zero and V ar(X · w) = w · Qw. Using
(5.4.6) with t = 1, µ = 0, and σ 2 = w · Qw, the moment-generating function
of the random vector X is
5.5. CHI-SQUARED DISTRIBUTION 351

MX (w) = E ew·X = ew·Qw/2 .

(5.5.7)

In Python, the probability density function of a normal random variable

with mean µ and variance Q is

from numpy import *

from scipy.stats import multivariate_normal as Z

# mu is mean vector array

# Q is variance matrix array

# here x.shape == mu.shape

Z.pdf(x, mean=mu, cov=Q)

If x and y are arrays, then cartesian_product(x,y) is defined by

from numpy import *

def cartesian_product(x,y): return dstack(meshgrid(x,y))

If x and y have shapes (m,) and (n,) then xy = cartesian_product(x,y)

has shape (m,n,2), with xy[i,j,:] = array([x[i],y[j]]).

Fig. 5.31 Normal probability density on R2 .

Using this, we can plot the probability density function of a normal random
vector in R2 ,
352 CHAPTER 5. PROBABILITY

%matplotlib ipympl
from numpy import *
from matplotlib.pyplot import *
from scipy.stats import multivariate_normal as Z

# standard normal
mu = array([0,0])
Q = array([[1,0],[0,1]])

x = arange(-3,3,.01)
y = arange(-3,3,.01)

xy = cartesian_product(x,y)
# last axis of xy is fed into pdf
z = Z(mu,Q).pdf(xy)

ax = axes(projection='3d')
ax.set_axis_off()
x,y = meshgrid(x,y)
ax.plot_surface(x,y,z, cmap='cool')
show()

resulting in Figure 5.31.

In §5.3 we studied correlation and independence. We saw how indepen-

dence implies uncorrelatedness, but not conversely. Now we show that, for
normal random vectors, they are in fact the same.

Independence and Correlation

If (X, Y ) is a normal random vector, then X and Y are uncorrelated

iff X and Y are independent.

Saying (X, Y ) is normal is more than just saying X is normal and Y is

normal, This is joint normality of X and Y . By subtracting their means, we
may assume the means of X and Y are zero.
To derive the result, we write down

E(X ⊗ X) = A, E(X ⊗ Y ) = B, E(Y ⊗ Y ) = C.

Then the variance of (X, Y ) is

E(X ⊗ X) E(X ⊗ Y ) A B
Q= =
E(Y ⊗ X) E(Y ⊗ Y ) Bt C

. From this, we see X and Y are uncorrelated when B = 0.

With w = (u, v), we write
5.5. CHI-SQUARED DISTRIBUTION 353

u A B u
w · Qw = · = u · Au + u · Bv + v · B t u + v · Cv.
v Bt C v

Then
t
MX,Y (w) = E ew·(X,Y ) = ew·Qw/2 = MX (u) MY (v) e(u·Bv+v·B u)/2 .

From this, X and Y are independent when B = 0. Thus, for normal random
vectors, independence and uncorrelatedness are the same.

If Z is a standard normal random vector in Rd , then (5.5.6) we saw |Z|2 is

chi-squared with degree d. Now we generalize this result to correlated normal
random vectors.

Correlated Chi-squared

Let X be a normal random vector with mean zero and variance Q.

Let r be the rank of Q, and let Q+ be the pseudo-inverse (§2.3) of Q.
Then
X · Q+ X (5.5.8)
is chi-squared with degree r.

To derive this, we use the eigenvalue decomposition (3.2.6) of Q: There is

a square diagonal matrix E and a matrix U satisfying

E = U t QU, Q+ = U E + U t ,

and
   
λ1 0 0 ... 0 1/λ1 0 0 . . . 0
0 λ2 0 ... 0  0 1/λ2 0 . . . 0
E+ = 
   
. . .
E= ... ... ... . . .
,  ... ... ... ... . . .
.
0 ... 0 λr 0  0 . . . 0 1/λr 0
0 0 0 0 0 0 0 0 0 0

Here r, the number of nonzero eigenvalues of Q, is the rank of Q.

Then, with Y = U t X,
r
X Y2 i
X · Q+ X = X · (U E + U t )X = (U t X) · E + (U t X) = Y · E + Y = .
i=1
λi

Since Y has variance U t QU (Exercise 5.5.5), and U t QU = E, X · Q+ X is

chi-squared with degree r (Exercise 5.5.6).
354 CHAPTER 5. PROBABILITY

Let µ be a unit vector in Rd , and let Q = I − µ ⊗ µ. Then Q has rank d − 1

(Exercise 2.9.2). Suppose X is a normal random vector with mean zero and
variance Q. Then

E((X · µ)2 ) = µ · Qµ = µ · (I − µ ⊗ µ)µ = µ − (µ · µ)µ = 0,

so X · µ = 0.
By Exercise 2.6.7, Q+ = Q. Since X · µ = 0,

X · Q+ X = X · QX = X · (X − (X · µ)µ) = |X|2 .

We conclude

Singular Chi-squared

Let µ be a unit vector, and let X be a normal random vector with

mean zero and variance I −µ⊗µ. Then |X|2 is chi-squared with degree
d − 1.

We use the above to derive the distribution of the sample variance. Let
X1 , X2 , . . . , Xn be a random sample, and let X̄ be the sample mean,
X1 + X2 + · · · + Xn
X̄ = .
n
Let S 2 be the sample variance,

(X1 − X̄)2 + (X2 − X̄)2 + · · · + (Xn − X̄)2

S2 = . (5.5.9)
n−1
Since (n − 1)S 2 is a sum-of-squares similar to (5.5.2), we expect (n − 1)S 2
to be chi-squared. In fact this is so, but the degree is n − 1, not n. We will
show

Independence of Sample Mean and Sample Variance

Let Z1 , Z2 , . . . , Zn be independent standard normal random vari-

ables, let Z̄ be the sample mean, and let S 2 be the sample variance.
Then (n − 1)S 2 is chi-squared with degree n − 1, and Z̄ and S 2 are
independent.
5.5. CHI-SQUARED DISTRIBUTION 355

To see this, we work with the random vector Z = (Z1 , Z2 , . . . , Zn ) with

mean zero and variance I. Let u and v be vectors in Rn , let

1 = (1, 1, . . . , 1)
√
be in Rn , and let µ = 1/ n. Then µ is a unit vector and
n
1 X √
Z ·µ= √ Zk = n Z̄.
n
k=1
√
Since Z1 , Z2 , . . . , Zn are i.i.d standard, Z · µ = nZ̄ is standard.
Now let U = I − µ ⊗ µ and

X = U Z = Z − (Z · µ)µ = (Z1 − Z̄, Z2 − Z̄, . . . .Zn − Z̄).

Then the mean of X is zero. Since Z has variance I, by Exercises 2.2.2 and
5.5.5,
V ar(X) = U t IU = U 2 = U = I − µ ⊗ µ.
By singular chi-squared above,

(n − 1)S 2 = |X|2

is chi-squared with degree n − 1. Since Z · µ is standard,

E(X(Z · µ)) = E(Z(Z · µ)) − E((Z · µ)2 )µ = µ − µ = 0,

so X and Z · µ are uncorrelated. Since X and Z√· µ are normal, X and Z · µ

are independent. Since (n − 1)S 2 = |X|2 and nZ̄ = Z · µ, S 2 and Z̄ are
independent.

Exercises

Exercise 5.5.1 Let X and Y be independent uniform random variables with

values in the interval [−1, 1]. Then (X, Y ) is a point in the square {|x| ≤
1, |y| ≤ 1}. Let (
1 if X 2 + Y 2 ≤ 1,
B=
0 otherwise
be the Bernoulli variable corresponding to (X, Y ) being in the unit disk. Then
(5.3.10) the mean p = E(B) equals P rob X 2 + Y 2 ≤ 1 . Show that p = π/4.
(This uses integration with polar coordinates rdrdθ replacing dxdy.)

Exercise 5.5.2 Let X1 , X2 , . . . , Xn , and Y1 , Y2 , . . . , Yn be independent i.i.d.

samples of uniform random variables with values in the interval [−1, 1]. Then
356 CHAPTER 5. PROBABILITY

(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) are points in the square {|x| ≤ 1, |y| ≤ 1}

(Figure 5.28). Let p̂n be the proportion of points (5.3.11) lying in the unit
disk {x2 + y 2 ≤ 1}. Use the LLN to estimate p̂n for large n.

Exercise 5.5.3 Continuing the previous problem with n = 20, use the CLT
to estimate the probability that fewer than 50% of the points lie in the unit
disk. Is this a 1-sigma event, a 2-sigma event, or a 3-sigma event?

Exercise 5.5.4 Let X be a random vector with mean zero and variance Q
Show v is a zero variance direction (§2.5) for Q iff X · v = 0.

Exercise 5.5.5 Let µ and Q be the mean and variance of a random d-vector
X, and let A be any N × d matrix. Then AX is a random vector with mean
Aµ and variance AQAt .

Exercise 5.5.6 Let Y1 , Y2 , . . . , Yr be independent normal random variables

with mean zero and variances λ1 , λ2 , . . . , λr . Then

Y12 Y2 Y2
+ 2 + ··· + r
λ1 λ2 λr
is chi-squared with degree r.

Exercise 5.5.7 If X is a random vector with mean zero and variance Q, then

E((X · u)(X · v)) = u · Qv.

(Insert w = u + v in (5.5.3).)

Exercise 5.5.8 Assume the classes of the Iris dataset are normally dis-
tributed with their means and variances (Exercise 2.2.8), and assume the
classes are equally likely. Using Bayes theorem (5.2.14), write a Python
function that returns the probabilities (p1 , p2 , p3 ) that a given iris x =
(t1 , t2 , t3 , t4 ) lies in each of the three classes. Feed your function the 150
samples of the Iris dataset. How many samples are correctly classified?

5.6 Multinomial Probability

Let X be a discrete random variable, with values i = 1, 2, . . . , d, and proba-

bilities p = (p1 , p2 , . . . , pd ). Then each pi ≥ 0 and

p1 + p2 + · · · + pd = 1.

Such a vector p is a probability vector..

Since the values of X are 1, 2, . . . , d, the moment-generating function of
X is
5.6. MULTINOMIAL PROBABILITY 357

M (t) = E(etX ) = et p1 + e2t p2 + · · · + edt pd .

A vector v = (v1 , v2 , . . . , vd ) is one-hot encoded at slot j if all components
of v are zero except the j-th component. For example, when d = 3, the vectors

(a, 0, 0), (0, a, 0), (0, 0, a)

are one-hot encoded.

A useful alternative to M (t) above is to use one-hot encoding and to define
a vector-valued random variable Y = (Y1 , Y2 , . . . , Yd ) by
(
1, if X = i,
Yi = i = 1, 2, . . . , d.
0, otherwise

This is called one-hot encoding since all slots in Y are zero except for one
“hot” slot.
For example, suppose X has three values 1, 2, 3, say X is the class of a
random sample from the Iris dataset. Then Y is R3 -valued, and we have

(1, 0, 0), if X = 1,

Y = (0, 1, 0), if X = 2,

(0, 0, 1), if X = 3.


With this understood, set t = (t1 , t2 , t3 ). Then the moment-generating func-

tion of Y is
M (t) = E et·Y = et1 p1 + et2 p2 + et3 p3 .

More generally, let X have d values. Then with one-hot encoding, the
moment-generating function is

M (t) = et1 p1 + et2 p2 + · · · + etd pd ,

and the cumulant-generating function is

Z(t) = log et1 p1 + et2 p2 + · · · + etd pd .

In particular, for a fair dice with d sides, the values are equally likely, so
the one-hot encoded cumulant-generating function is

Z(t) = log et1 + et2 + · · · + etd − log d.

(5.6.1)

In this section, we define

Z(y) = log (ey1 + ey2 + · · · + eyd ) , (5.6.2)

358 CHAPTER 5. PROBABILITY

so we ignore the constant log d in (5.6.1). Then Z is a function of d variables

y = (y1 , y2 , . . . , yd ). If we insert y = 0, we obtain Z(0) = log d.
Let
1 = (1, 1, . . . , 1).
Then
d
X
p·1= pk = 1.
k=1

Because

Z(y + a1) = Z(y1 + a, y2 + a, . . . , yd + a) = Z(y) + a,

Z is not bounded below and does not have a minimum.

The softmax function is the vector-valued function q = σ(y) with compo-

nents
eyk eyk
qk = σk (y) = = Z(y) , k = 1, 2, . . . , d.
ey1 + ey2+ ··· + ey d e
Thus
q = σ(y) = e−Z(y) (ey1 , ey2 , . . . , eyd ) .

(y1 , y2 , . . . , yd ) σ (p1 , p2 , . . . , pd )

Fig. 5.32 The softmax function takes vectors to probability vectors.

By the chain rule, the gradient of the cumulant-generating function is the

softmax function,
∇Z(y) = σ(y). (5.6.3)
When d = 2, the vector softmax function reduces to the scalar logistic
function (5.2.15), since

ey1 1
q1 = = = σ(y1 − y2 ),
ey1
+ ey2 1 + e−(y1 −y2 )
ey2 1
q2 = y1 = = σ(y2 − y1 ).
e + ey2 1 + e−(y2 −y1 )
5.6. MULTINOMIAL PROBABILITY 359

Because of this, the softmax function is the multinomial analog of the logistic
function, and we use the same symbol σ to denote both functions.

from scipy.special import softmax

y = array([y1,y2,y3])
q = softmax(y)

In §4.5, we studied convex functions and the existence and uniqueness of

the global minimum. As we saw above, Z does not have a global minimum
over unrestricted z.
Since σ(y) = ∇Z(y), a critical point y ∗ of Z must satisfy σ(y ∗ ) = 0. For
Z, a critical point cannot be unique, because

σ(y1 , y2 , . . . , yd ) = σ(y1 + a, y2 + a, . . . , yd + a),

or
σ(y) = σ(y + a1).
We say a vector y is centered if y is orthogonal to 1,

y · 1 = y1 + y2 + · · · + yd = 0.

To guarantee uniqueness of a global minimum of Z, we have to restrict at-

tention to centered vectors y.
Suppose y is centered. Since the exponential function is convex,
d d
!
eZ 1 X yk 1X
= e ≥ exp yk = e0 = 1.
d d d
k=1 k=1

This establishes

Restricted Global Minimum of the Cumulant-generating

Function

If y is centered, then Z(y) ≥ Z(0) = log d.

The inverse of the softmax function is obtained by solving p = σ(y) for y,

obtaining
yk = Z + log pk , k = 1, 2, . . . , d. (5.6.4)
360 CHAPTER 5. PROBABILITY

Define
log p = (log p1 , log p2 , . . . , log pd ).
Then the inverse of p = σ(z) is

y = Z1 + log p. (5.6.5)

The function
d
X
I(p) = p · log p = pk log pk (5.6.6)
k=1

is the absolute information. Since 0 ≤ p ≤ 1, log p ≤ 0, hence I(p) ≤ 0.

Since log is concave,
d d
!
X X
yk yk
pk log(e ) ≤ log pk e .
k=1 k=1

This implies
d
X d
X
p·y = p k yk = pk log(eyk )
k=1 k=1
d
! d
!
X X
yk yk +log pk
≤ log pk e = log e = Z(y + log p).
k=1 k=1

Replacing y by y − log p, this establishes

I(p) ≥ p · y − Z(y). (5.6.7)

By (5.6.5), (5.6.7) is an equality when p = σ(y). We conclude

Information and Cumulant-Generating Function are Convex

Duals
For all p,
I(p) = max (p · y − Z(y)) .
y

For all y,
Z(y) = max (p · y − I(p)) .
p

The second equality follows by switching Z and I in (5.6.7), and repeating

the same logic used to derive the first equality.
5.6. MULTINOMIAL PROBABILITY 361

Inserting y = 0 in (5.6.7), we have

Absolute Information is Bounded

For all p = (p1 , p2 , . . . , pd ),

0 ≥ I(p) ≥ − log(d). (5.6.8)

The absolute entropy, the analog of (4.2.1), is then

d
X
H(p) = −I(p) = − pk log(pk ). (5.6.9)
k=1

Since
2 1 1 1
D I(p) = diag , ,..., ,
p1 p2 pd
we see I(p) is strictly convex, and H(p) is strictly concave.
In Python, the entropy is

from scipy.stats import entropy

p = array([p1,p2,p3])
entropy(p)

Here is the multinomial analog of the relation between entropy and

coin-tossing (5.2.4). Suppose a dice has d faces, and suppose the proba-
bility of rolling the i-th face in a single roll is pi , i = 1, 2, . . . , d. Then
p = (p1 , p2 , . . . , pd ) is a probability vector. We call p the dice’s bias.

Entropy and Dice-Rolling

Roll a d-faced dice n times, and let #n (p) be the number of outcomes
where the face-proportions are p = (p1 , p2 , . . . , pd ). Then

#n (p) is approximately equal to enH(p) for n large.

362 CHAPTER 5. PROBABILITY

In more detail, using Stirling’s approximation (A.1.6), here is the asymp-

totic equality,
1 1
#n (p) ≈ ·√ · enH(p) , for n large.
(2πn)(d−1)/2 p1 p2 . . . pd

Asymptotic equality means the ratio of the two sides approaches 1 as n → ∞

(A.6).

Now (
∂2Z ∂σj σj − σj σk , if j = k,
= =
∂yj ∂yk ∂yk −σj σk , if j ̸= k.
Hence we have

D2 Z(y) = ∇σ(y) = diag(q) − q ⊗ q, q = σ(z). (5.6.10)

qk vk . Since Q = D2 Z(z) satisfies

P
Let v̄ = v · q =
d
X d
X
v · Qv = qk vk2 − (v · q)2 = qk (vk − v̄)2 ,
k=1 k=1

which is nonnegative, Q is a variance matrix, and Z is convex.

In fact Z is strictly convex along centered directions v, the directions
satisfying v · 1 = 0. If v · Qv = 0, then, since qk > 0 for all k, v = v̄1. By
Exercise 5.6.1, if v is centered, this forces v = 0. This shows Z is strictly
convex along centered directions.
Moreover, Z is proper (4.3.9) on centered vectors. To see this, suppose
y · 1 = 0 and Z(y) ≤ c. Since yj ≤ Z(y), this implies

yj ≤ c, j = 1, 2, . . . , d.

Given 1 ≤ j ≤ d,Padd the inequalities yk ≤ c over all indices k ̸= j. Since

y · 1 = 0, −yj = k̸=j yk . Hence
X
−yj = yk ≤ (d − 1)c, j = 1, 2, . . . , d.
k̸=j

Combining the last two inequalities,

|yj | = max(yj , −yj ) ≤ (d − 1)c, j = 1, 2, . . . , d,

which implies
5.6. MULTINOMIAL PROBABILITY 363

d
X
|y|2 = yk2 ≤ d(d − 1)2 c2 .
k=1
√
Setting C = d(d − 1)c, we conclude

Z(y) ≤ c and y·1=0 =⇒ |y| ≤ C. (5.6.11)

By (4.3.9), we have shown

The Cumulant-generating Function is Proper and Strictly

Convex

On centered vectors, Z(y) is proper and strictly convex.

Let p = (p1 , p2 , . . . , pd ) and q = (q1 , q2 , . . . , qd ) be probability vectors. The

relative information is
d
X
I(p, q) = pk log(pk /qk ). (5.6.12)
k=1

Let
log q = (log q1 , log q2 , . . . , log qd ).
Then
d
X
p · log q = pk log qk ,
k=1

and
I(p, q) = I(p) − p · log q. (5.6.13)
Similarly, the relative entropy is

H(p, q) = −I(p, q). (5.6.14)

In Python, as of this writing, the code

from scipy.stats import entropy

p = array([p1,p2,p3])
q = array([q1,q2,q3])
entropy(p,q)
364 CHAPTER 5. PROBABILITY

returns the relative information, not the relative entropy. Always check your
Python code’s conventions and assumptions. See below for more on this ter-
minology confusion.

Here is the multinomial analog of the relation between relative entropy

and coin-tossing (5.2.5). Suppose a dice has d faces, and suppose the prob-
ability of rolling the i-th face in a single roll is qi , i = 1, 2, . . . , d. Then
q = (q1 , q2 , . . . , qd ) is a probability vector, and we expect the long-term pro-
portion of faces in n rolls to equal roughly q.
Let p = (p1 , p2 , . . . , pd ) be another probability vector. Roll a d-faced dice
n times, and let Pn (p, q) be the probability that the face-proportions are
p = (p1 , p2 , . . . , pd ), given that the dice’s bias is q.
If p = q, one’s first guess is Pn (p, p) ≈ 1 for n large. However, this is
not correct, because Pn (p, p) is specifying a specific proportion p, predicting
specific behavior from the coin tosses. Because this is too specific, it turns
out Pn (p, p) ≈ 0, see Exercise 5.2.8.
On the other hand, if p ̸= q, we definitely expect the proportion of faces
to not equal p. In other words, we expect Pn (p, q) to be small for large n. In
fact, when p ̸= q, it turns out Pn (p, q) → 0 exponentially, as n → ∞. Using
(A.1.1), a straightforward calculation results in

Relative Entropy and Dice-Rolling

Assume a d-faced dice’s bias is q. Roll the dice n times, and let Pn (p, q)
be the probability of obtaining outcomes where the proportion of faces
is p. Then

Pn (p, q) is approximately equal to enH(p,q) for n large.

More exactly, using Stirling’s approximation (A.1.6), here is the asymp-

totic equality,
1 1
Pn (p, q) ≈ ·√ · enH(p,q) , for n large.
(2πn)(d−1)/2 p1 p2 . . . pd

The relative cumulant-generating function is

d
!
X
Z(y, q) = log eyk qk ,
k=1
5.6. MULTINOMIAL PROBABILITY 365

As we saw above, this is the one-hot encoded cumulant-generating function

of a d-sided dice with side-probabilities q = (q1 , q2 , . . . , qd ).
If we insert qk = exp(log(qk )) in the definition of Z(y, q), one obtains

Z(y, q) = Z(y + log q).

From this, using the change of variable y ′ = y + log q,

max (p · y − Z(y, q)) = max (p · y − Z(y + log q))

y y

= max
′
(p · (y ′ − log q) − Z(y ′ ))
y

= max (p · y − Z(y)) − p · log q

= I(p) − p · log q
= I(p, q).

As before, this shows

Relative Information and Relative Cumulant-generating

Function are Convex Duals
For all p and q,

I(p, q) = max (p · y − Z(y, q)) .

For all y and q,

Z(y, q) = max (p · y − I(p, q)) .

In logistic regression (§7.6), the output is y, the computed target is q = σ(y),

the desired target is p, and the information error function is I(p, σ(y)). To
compute the information error, by (5.6.5),

q = σ(y) =⇒ log q = y − Z(y)1.

By (5.6.13), this yields

Information Error Identity

For all p and all y, if q = σ(y), then

I(p, q) = I(p) − p · y + Z(y). (5.6.15)

366 CHAPTER 5. PROBABILITY

This identity is the direct analog of (4.5.19). The identity (4.5.19) is used
in linear regression. Similarly, (5.6.15) is used in logistic regression.

Let max y = maxj yj . Then, by definition of Z(y),

yj ≤ Z(y) ≤ max y + log d, j = 1, 2, . . . , d.

The cross-information is
d
X
Icross (p, q) = − pk log qk ,
k=1

and the cross-entropy is

d
X
Hcross (p, q) = −Icross (p, q) = pk log qk .
k=1

In the literature, the terminology is backward: the cross-information is usually

erroneously called “cross-entropy,” see the discussion at the end of the section.
Cross-information and relative information are related by

I(p, q) = I(p) + Icross (p, q).

A probability vector p = (p1 , p2 , . . . , pd ) is one-hot encoded at slot j if

pj = 1. When p is one-hot encoded at slot j, then pk = 0 for k ̸= j.
When p is one-hot encoded, then I(p) = 0, so

I(p, q) = Icross (p, q), (5.6.16)

and, from (5.6.15),

Icross (p, σ(y)) = −p · y + Z(y).

From (5.6.3) and (5.6.15),

∇y I(p, σ(y)) = q − p, q = σ(y). (5.6.17)

Since I(p, σ(y)) and Icross (p, σ(y)) differ by the constant I(p), we also have

∇y Icross (p, σ(y)) = q − p, q = σ(y),

5.6. MULTINOMIAL PROBABILITY 367

so it doesn’t matter whether I(p, q) or Icross (p, q) is used in gradient descent

(§7.3). Nevertheless, we stick with I(p, q), because I(p, q) arises naturally as
the convex dual of Z(y, q).

Let q = (q1 , q2 , . . . , qd ) be a probability vector. The relative softmax func-

tion is
σ(y, q) = e−Z(y,q) (ey1 q1 , ey2 q2 , . . . , eyd qd ) .
Then the relative version of (5.6.15) is

I(p, σ(y, q)) = I(p, q) − p · y + Z(y, q).

This is easily checked using the definitions of I(p, q) and σ(y, q).

H = −I Information Entropy
Absolute I(p) H(p)
Cross Icross (p, q) Hcross (p, q)
Relative I(p, q) H(p, q)
Curvature Convex Concave
Error I(p, q) with q = σ(z)

Table 5.33 The third row is the sum of the first and second rows, and the H column is
the negative of the I column.

In the literature, in the industry, in Wikipedia, and in Python, the termi-

nology3 is confused: The relative information I(p, q) is almost always called
“relative entropy.”
Since the entropy H is the negative of the information I, this is looking at
things upside-down. In other settings, I(p, q) is called the “Kullback–Leibler
divergence,” which is not exactly intuitive terminology.
Also, in machine learning, Icross (p, q) is called the “cross-entropy,” not
cross-information, continuing the confusion.
Rubbing salt into the wound, in Python, entropy(p) is H(p), which is
correct, but entropy(p,q) is I(p, q), which is incorrect, or at the very least,
inconsistent, even within Python.
3 The quantities used here are identical to those in the literature, it’s only the naming that
is confused.
368 CHAPTER 5. PROBABILITY

How does one keep things straight? By remembering that it’s convex func-
tions that we like to minimize, not concave functions. In more vivid terms,
would you rather ski down a convex slope, or a concave slope?
In machine learning, loss functions are built to be minimized, and infor-
mation, in any form, is convex, while entropy, in any form, is concave. Table
5.33 summarizes the situation.

Exercises

Exercise 5.6.1 Let v be a centered vector, and suppose v is a multiple of 1.

Show v = 0.

Exercise 5.6.2 Let p be a probability vector, and v be a vector. Then p + tv

is a probability vector for all scalar t iff v is centered.

Exercise 5.6.3 Let p = (p1 , p2 , . . . , pd ) be a probability vector, and let a =

(a1 , a2 , . . . , ad ) be a vector satisfying

ea1 p1 + ea2 p2 + · · · + ead pd = 1.

Show a · p ≤ 0. (Use convexity of ex .)

Exercise 5.6.4 Continuing Exercise 5.6.3, assume furthermore pi > 0, i =

1, 2, . . . , d. Then a · p = 0 implies a = 0.
Chapter 6
Statistics

6.1 Estimation

In statistics, like any science, we start with a guess or an assumption or hy-

pothesis, then we take a measurement, then we accept or modify our guess/as-
sumption based on the result of the measurement. This is common sense, and
applies to everything in life, not just statistics.
For example, suppose you see a sign on campus saying
There is a lecture in room B120.
How can you tell if this is true/correct or not? One approach is to go to room
B120 and look. Either there is a lecture or there isn’t. Problem solved.
But then someone might object, saying, wait, what if there is a lecture
in room B120 tomorrow? To address this, you go every day to room B120
and check, for 100 days. You find out that in 85 of the 100 days, there is a
lecture, and in 15 days, there is none. Based on this, you can say you are
85% confident there is a lecture there. Of course, you can never be sure, it
depends on which day you checked, you can only provide a confidence level.
Nevertheless, this kind of thinking allows us to quantify the probability that
our hypothesis is correct.
In general, the measurement is significant if it is unlikely. When we obtain
a significant measurement, then we are likely to reject our guess/assumption.
So
significance = 1 − confidence.
In practice, our guess/assumption allows us to calculate a p-value, which is
the probability that the measurement is not consistent with our assumption.
In the above scenario, the p-value is .15, determined by repeatedly sampling
the room.
This is what statistics is about, summarized in Figure 6.1. The details may
be more or less complicated depending on the problem situation or setup, but
this is the central idea.

369
370 CHAPTER 6. STATISTICS

do not
reject H

p>α

hypothesis
sample p-value
H

p<α

reject H

Fig. 6.1 Statistics flowchart: p-value p and significance α.

Here is a geometric example. Grab two vectors at random in three di-

mensions and measure the angle between them. Is there any pattern to the
answer? Doing so twenty times, we see the answer is no, the resulting angle
can be any angle. Now grab two vectors at random from 784 dimensions.
Then, as we shall see, there is a pattern.
The null hypothesis and the alternate hypothesis are
• H0 : The angle between two randomly selected vectors in 784 dimensions
is approximately 90◦
• Ha : The angle between two randomly selected vectors in 784 dimensions
is approximately 60◦ .
In §2.2, there is code (2.2) returning the angle angle(u,v) between two
vectors. To test these hypotheses, we run the code

from numpy import *

from numpy.random import randn

# randn(d) is standard normal sample of size d

d = 784

for _ in range(20):
u = randn(d)
v = randn(d)
print(angle(u,v))
6.1. ESTIMATION 371

to randomly select u, v twenty times. Here randn(n) returns a vector in Rd

whose components are selected independently and randomly according to a
standard normal distribution.
This code returns (since the selection is random, your numbers will differ)

86.27806537791886
87.91436653824776
93.00098725550777
92.73766421951748
90.005139015804
87.99643434444482
89.77813370637857
96.09801014394806
90.07032573539982
89.37679070400239
91.3405728939376
86.49851399221568
87.12755619082597
88.87980905998855
89.80377324818076
91.3006921339982
91.43977096117017
88.52516224405458
86.89606919838387
90.49100744167357

and we see strong evidence supporting H0 .

On the other hand, run the code

from numpy import *

from numpy.random import binomial

d = 784

n = 1 # one coin toss

#n = 3 # three coin tosses

# binomial(n,p,d) is n coin-tosses with bias p and sample of size d

for _ in range(20):
u = binomial(n,.5,d)
v = binomial(n,.5,d)
print(angle(u,v))

to randomly select u, v twenty times. Here binomial(n,.5,d) returns a

vector in Rd whose components are selected independently and randomly
according to the number of heads in n tosses of a fair coin. This code returns
372 CHAPTER 6. STATISTICS

59.43464627897324
59.14345748418916
60.31453922165891
60.38024365702492
59.24709660805488
59.27165957992343
61.21424657806321
60.55756381536082
61.59468919876665
61.33296028237481
60.03925473033243
60.25732069941224
61.77018692842784
60.672901794058326
59.628519516164666
59.41272458020638
58.43172340007064
59.863796136907744
59.45156367988921
59.95835532791699

and we see strong evidence supporting H1 .

The difference between the two scenarios is the distribution. In the first
scenario, we have randn(d): the components are distributed according to
a standard normal. In the second scenario, we have binomial(1,.5,d) or
binomial(3,.5,d): the components are distributed according to one or three
fair coin tosses. To see how the distribution affects things, we bring in the
law of large numbers, which is discussed in §5.3.
Let X1 , X2 , . . . , Xd be a simple random sample from some population,
and let µ be the population mean. Recall this means X1 , X2 , . . . , Xd are
i.i.d. random variables, with µ = E(X). The sample mean is

X1 + X2 + · · · + Xd
X̄ = .
d

Law of Large Numbers

For large sample size d, the sample mean X̄ approximately equals the
population mean µ, X̄ ≈ µ.

We use the law of large numbers to explain the closeness of the vector
angles to specific values.
Assume u = (x1 , x2 , . . . , xd ), and v = (y1 , y2 , . . . , yd ) where all components
are selected independently of each other, and each is selected according to
the same distribution.
6.1. ESTIMATION 373

Let U = (X1 , X2 , . . . , Xd ), V = (Y1 , Y2 , . . . , Yd ), be the corresponding

random variables. Then X1 , X2 , . . . , Xd and Y1 , Y2 , . . . , Yd are independent
and identically distributed (i.i.d.), with population mean E(X1 ) = E(Y1 ).
From this, X1 Y1 , X2 Y2 , . . . , Xd Yd are i.i.d. random variables with popu-
lation mean E(X1 Y1 ). By the law of large numbers,1

X1 Y1 + X2 Y2 + · · · + Xd Yd
≈ E(X1 Y1 ),
d
so
U · V = X1 Y1 + X2 Y2 + · · · + Xd Yd ≈ d E(X1 Y1 ).
Similarly, U · U ≈ d E(X12 ) and V · V ≈ d E(Y12 ). Hence (check that the d’s
cancel)
U ·V E(X1 Y1 )
cos(U, V ) = p ≈p .
(U · U )(V · V ) E(X12 )E(Y12 )
Since X1 and Y1 are independent with mean µ and variance σ 2 ,

E(X1 Y1 ) = E(X1 )E(Y1 ) = µ2 , E(X12 ) = µ2 + σ 2 , E(Y12 ) = µ2 + σ 2 .

If θ is the angle between U and V , we conclude

U ·V µ2
cos(θ) = p ≈ .
(U · U )(V · V ) µ2 + σ 2

When the distribution is standard normal, µ = 0, so the angle is approxi-

mately 90◦ . When the distribution is Bernoulli with parameter p,

µ2 p2
= = p.
µ2 + σ 2 p2 + p(1 − p)

For p = .5, this results in an angle of 60◦ .

The general result is

Random Vectors in High Dimensions

Let U and V be two vectors selected randomly. Assume the compo-

nents of U and V are independent and identically distributed with
mean µ and variance σ 2 . Let θ be the angle between them. When the
vector dimension is high,

µ2
cos(θ) is approximately .
µ2 + σ 2

1 ≈ means the ratio of the two sides approaches 1 for large n, see §A.6.
374 CHAPTER 6. STATISTICS

6.2 Z-test

Suppose we want to estimate the proportion of American college students

who have a smart phone. Instead of asking every student, we take a sample
and make an estimate based on the sample.
The population proportion p is the actual proportion of students that in
fact have a smart phone. Then 0 < p < 1. Pick a student, and let
(
1, if the student has a smartphone,
X=
0, if not.

Then X is a Bernoulli random variable with mean p.

For example, suppose the population proportion of students that have a
smartphone is p = .7, and we sample n = 25 students, obtaining a sample
proportion X̄. If we repeat the sampling N = 1000 times, we will obtain 1000
values for X̄. Figure 6.2 displays the resulting histogram of X̄ values. Here
is the code

from numpy import *

from matplotlib.pyplot import *
from numpy.random import binomial

p = .7
n = 25
N = 1000
v = binomial(n,p,N)/n

hist(v,edgecolor ='Black')
show()

Fig. 6.2 Histogram of sampling n = 25 students, repeated N = 1000 times.

6.2. Z-TEST 375

Let X1 , X2 , . . . , Xn be a simple random sample of size n. This means

n students were selected randomly and independently and whether or not
they had smartphones was recorded in the variables X1 , X2 , . . . , Xn . Each
of these variables equals one or zero with probability p or 1 − p.
The sample mean (§5.3) is
n
X1 + X2 + · · · + Xn 1X
X̄ = = Xk .
n n
k=1

Because each Xk is 0 or 1, this is the sample proportion of the students in

the sample that have smartphones. Like p, X̄ is also between zero and one.
Because the samples vary, it is impossible to make absolute statements
about the population. Instead, as we see below, the best we can do is make
statements that come with a confidence level. Confidence levels are expressed
as percentages, such as a 95% confidence level, or as a proportion, such as a
.95 confidence level.
Often levels are expressed as significance levels. The significance level is
the corresponding tail probability, so

significance level = 1 − confidence level.

A confidence level of zero indicates that we have no faith at all that se-
lecting another sample will give similar results, while a confidence level of 1
indicates that we have no doubt at all that selecting another sample will give
similar results.
When we say p is within X̄ ± ϵ, or

|p − X̄| < ϵ,

we call ϵ the margin of error.. The interval

(L, U ) = (X̄ − ϵ, X̄ + ϵ)

is a confidence interval.
With the above setup, we have the population proportion p, and the four
sample characteristics
• sample size n
• sample proportion X̄,
• margin of error ϵ,
• confidence level α.
Suppose we do not know p, but we know n and X̄. We say the margin of
error is ϵ, at confidence level α, if

P rob(|p − X̄| < ϵ) = α.

Here are some natural questions:

376 CHAPTER 6. STATISTICS

1. Given a sample of size n = 20 and sample proportion X̄ = .7, what can

we say about the margin of error ϵ with confidence α = .95?
2. Given a sample proportion X̄ = .7, what sample size n should we take
to obtain a margin of error ϵ = .15 with confidence α = .95?
3. Given a sample proportion X̄ = .7, what sample size n should we take
to obtain a margin of error ϵ = .15 with confidence α = .99?
4. Given a sample of size n = 20 and sample proportion X̄ = .7, with what
confidence level α is the margin of error ϵ = .1?
The answers are at the end of the section.

Suppose each Xk in the sample X1 , X2 , . . . , Xn has mean µ and standard

deviation√σ. From §5.3, we know the mean and standard deviation of X̄ are
µ and σ/ n. In particular, when X1 , X2 , . . . , Xn is a Bernoulli sample, the
mean and variance of the sample proportion X̄ are p and p(1 − p)/n.
Therefore, the mean and variance of the standardized random variable

√ X̄ − p
Z= np
p(1 − p)

are zero and one.

Returning to our smartphone question, how close is the sample mean X̄
to the population mean E(X) = p? Remember, both X̄ and p are between 0
and 1. More specifically, given a margin of error ϵ, we want to compute the
confidence level
P rob X̄ − p < ϵ .
This corresponds to the confidence interval

L, U = X̄ − ϵ, X̄ + ϵ.

The key result is the central limit theorem (§5.3): Z is approximately

normal. How large should the sample size n be in order to apply the central
limit theorem? When we have success-failure condition

np ≥ 10, n(1 − p) ≥ 10.

For example, p = .7 and n = 50 satisfies the success-failure condition.

Let α be the two-tail significance level, say α = .05. Assuming Z is exactly
normal, let z ∗ be the z-score corresponding to significance α,

P rob(|Z| > z ∗ ) = α.
√
Let σ/ n be the standard error. By the central limit theorem,
6.2. Z-TEST 377
!
|X̄ − p| z∗
α ≈ P rob p >√ .
p(1 − p) n

To compute the confidence interval (L, U ), we solve

|X̄ − p| z∗
p =√ (6.2.1)
p(1 − p) n

for p. But (6.2.1) may be rewritten as a quadratic equation in p, leading to

the approximate solution
z∗
q
L, U = X̄ ± ϵ = X̄ ± √ · X̄(1 − X̄).
n

From here we obtain the margin of error

z∗
q
ϵ = √ · X̄(1 − X̄).
n

More generally, let z ∗ be the z-score corresponding to significance level α,

zstar = Z.ppf(alpha) # lower-tail, zstar < 0

zstar = Z.ppf(1-alpha) # upper-tail, zstar > 0
zstar = Z.ppf(1-alpha/2) # two-tail, zstar > 0

Given a population with known standard deviation σ, sample size n, and

sample mean X̄, the margin of error is
σ
ϵ = z∗ · √ ,
n

and the intervals


(X̄ − ϵ, X̄),
 lower-tail,
(L, U ) = (X̄, X̄ + ϵ), upper-tail,

(X̄ − ϵ, X̄ + ϵ), two-tail,


are the confidence intervals at significance level α. When not specified, a

confidence interval is usually taken to be two-tail.
In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with the X̄. When σ is not known, we have to replace
the normal distribution by the t distribution (§6.3).
378 CHAPTER 6. STATISTICS

##########################
# Confidence Interval - Z
##########################

from numpy import *

from scipy.stats import norm as Z

# significance level alpha

def confidence_interval(xbar,sdev,n,alpha,type):
Xbar = Z(xbar,sdev/sqrt(n))
if type == "two-tail":
U = Xbar.ppf(1-alpha/2)
L = Xbar.ppf(alpha/2)
elif type == "upper-tail":
U = Xbar.ppf(1-alpha)
L = xbar
elif type == "lower-tail":
L = Xbar.ppf(alpha)
U = xbar
else: print("what's the test type?"); return
return L, U

# when X is not Bernoulli 0,1,

# Z-test assumes sdev is known!!!
# when X is Bernoulli, sdev = sqrt(xbar*(1-xbar))

alpha = .02
sdev = 228
n = 35
xbar = 95

L, U = confidence_interval(xbar,sdev,n,alpha,type)

print("type: ", type)

print("significance, sdev, n, xbar: ", alpha,sdev,n,xbar)
print("lower, upper: ",L, U)

Now we can answer the questions posed at the start of the section. Here
are the answers.
1. When n = 20, α = .95, and X̄ = .7, we have [L, U ] = [.5, .9], so ϵ = .2.
2. When X̄ = .7, α = .95, and ϵ = .15, we run confidence_interval for
15 ≤ n ≤ 40, and select the least n for which ϵ < .15. We obtain n = 36.
3. When X̄ = .7, α = .99, and ϵ = .15, we run confidence_interval for
1 ≤ n ≤ 100, and select the least n for which ϵ < .15. We obtain n = 62.
4. When X̄ = .7, n = 20, and ϵ = .1, we have
√
∗ ϵ n
z = = .976.
σ
6.2. Z-TEST 379

Since P rob(Z > z ∗ ) = .165, the confidence level is 1 − 2 ∗ .165 = .68 or

68%.

The speed limit on a highway is µ0 = 120. Ten automatic speed cam-

eras are installed along a stretch of the highway to measure passing vehicles
speeds. Because the cameras aren’t perfect, the average speed X̄ measured
by the cameras may not equal a vehicle’s true speed µ. As a consequence,
some drivers who were driving at the speed limit may be fined. These drivers
are false positives.
Suppose the distribution of a vehicle’s measured speed is normal with
standard deviation 2. What measured speed cutoff µ∗ should the authorities
use to keep false positives below 1%? Here we are asked for the upper-tail
confidence interval (L, U ) = (µ0 , µ∗ ) at significance level .01. A driver will be
fined if their average measured speed X̄ is higher than µ∗ .
Using the above code, the cutoff µ∗ equals 121.47.

One use of confidence intervals is hypothesis testing. Here we have two

hypotheses, a null hypothesis and an alternate hypothesis. In the above set-
ting where we are estimating a population parameter µ, the null hypothesis
is that µ equals a certain value µ0 , and the alternate hypothesis is that µ
is not equal to µ0 . Hypothesis testing is of three types, depending on the
alternate hypothesis: µ ̸= µ0 , µ > µ) , µ < µ0 . These are two-tail, lower-tail,
and upper-tail hypotheses.
• H0 : µ = µ0
• Ha : µ ̸= µ0 or µ < µ0 or µ > µ0 .
For example, going back to our smartphone p setup, if we sample n = 20
students, obtaining a mean x̄ = .7, then σ = x̄(1 − x̄) = .46, and the two-
tail 5% confidence interval is then [.5, .9]. If µ0 lies outside the confidence
interval, we reject H0 and accept Ha , at the 5% level. Otherwise, if µ0 lies
within the interval, we do not reject H0 .
Suppose 35 people are randomly selected and the accuracy of their wrist-
watches is checked, with positive errors representing watches that are ahead
of the correct time and negative errors representing watches that are behind
the correct time. The sample has a mean of 95 seconds and a population
standard deviation of 228 seconds. At the 2% significance, can we claim the
population mean is µ0 = 0?
Here
• H0 : µ = 0
380 CHAPTER 6. STATISTICS

• Ha : µ ̸= 0.
Here the significance level is α = .02 and µ0 = 0. To decide whether to
reject H0 or not, compute the standardized test statistic
√ x̄ − µ0
z= n· = 2.465.
σ
Since z is a sample from an approximately normal distribution Z, the p-value

p = P rob(|Z| > z) = .0137.

On the other hand, the z-score corresponding to the requested significance

level is z ∗ = 2.326, since

P rob(|Z| > 2.326) = .02.

Since p is less than α, or equivalently, since |z| > z ∗ , we reject H0 . In other

words, when the p-value is smaller than the significance level, it is more
significant, and we reject H0 .
Equivalently, the 98% confidence interval is

(x̄ − ϵ, x̄ + ϵ) = (5.3, 184.6) .

Since µ0 = 0 is outside this interval, we reject H0 .

Hypothesis Testing

There are three types of alternative hypotheses Ha :

µ < µ0 , µ > µ0 , µ ̸= µ0 .

These are lower-tail, upper-tail, and two-tail tests. In every case, we

have a sample of size n, a statistic x̄, a standard deviation σ, a stan-
dardized statistic
√ x̄ − µ0
z = n· ,
σ
a significance level α, the p-value

p = P rob(Z < z), p = P rob(Z > z), p = P rob(|Z| > z),

and the critical cutoff z ∗ ,

P rob(Z < z ∗ ) = α, P rob(Z > z ∗ ) = α, P rob(|Z| > z ∗ ) = α.

6.2. Z-TEST 381

Then we reject H0 whenever z is more significant than z ∗ , which is

the same as saying whenever the p-value p is less than the significance
level α.

In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with
√ X̄, which is normally distributed with mean µ0
and standard deviation σ/ n.

###################
# Hypothesis Z-test
###################

from numpy import *

from scipy.stats import norm as Z

# significance level alpha

def ztest(mu0, sdev, n, xbar,type):

Xbar = Z(mu0,sdev/sqrt(n))
print("mu0, sdev, n, xbar: ", mu0,sdev,n,xbar)
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 * (1 - Xbar.cdf(abs(xbar)))
print("type: ",type)
print("pvalue: ",p)
if p < alpha: print("reject H0")
else: print("do not reject H0")

xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
sdev = 2
alpha = .01

ztest(mu0, sdev, n, xbar,type)

Going back to the driving speed example, the hypothesis test is

• H0 : µ = µ0
• Ha : µ > µ0
If a driver’s measured average speed is X̄ = 122, the above code rejects H0 .
This is consistent with the confidence interval cutoff we found above.
382 CHAPTER 6. STATISTICS

There are two types of possible errors we can make. a Type I error is when
H0 is true, but we reject it, and a Type 2 error is when H0 is not true but
we fail to reject it.

H0 is true H0 is false
do not reject H0 1−α Type II error: β
reject H0 Type I error: α Power: 1 − β

Table 6.3 The error matrix.

We reject H0 when the p-value of Z is less than the significance level α,

which happens when z < z ∗ or z > z ∗ or |z| > z ∗ . In all cases, the chance of
this happening is by definition α. In other words,

P rob(Type I error) = P rob(p-value < α | H0 ) = α.

Thus the probability of a type I error is the significance level α.

We make a Type II error when we do not reject H0 , but H0 is false. To
compute the probability of a Type II error, suppose the true value of µ is µ1 .
∗
√ H0 if |z| < |z |, which is when µ0 lies in the confidence
Then we do not reject
∗
interval x̄ ± z σ/ n, or when x̄ lies in the interval

z∗σ z∗σ
µ0 − √ < x̄ < µ0 + √ .
n n

But when µ = µ1 , X̄ is N (µ1 , σ), so the probability of this event can be

computed.
Standardize X̄ by subtracting µ1 and dividing by the standard error. Then
we have a Type II error when
√ (µ0 − µ1 ) √ (µ0 − µ1 )
n − z∗ < z < n + z∗.
σ σ
If we set δ to equal the standardized difference in the means,
√ (µ0 − µ1 )
δ= n ,
σ
then we have a Type II error when

δ − z∗ < Z < δ + z∗,

or when |Z − δ| < z ∗ . Hence

P rob(Type II error) = P rob (|Z − δ| < z ∗ ) .

6.2. Z-TEST 383

This calculation was for a two-tail test. When the test is upper-tail or
lower-tail, a similar calculation leads to the code

############################
# Type1 and Type2 errors - Z
############################

from numpy import *

from scipy.stats import norm as Z

def type2_error(type,mu0,mu1,sdev,n,alpha):
print("significance,mu0,mu1, sdev, n: ", alpha,mu0,mu1,sdev,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
zstar = Z.ppf(alpha)
type2 = 1 - Z.cdf(delta + zstar)
elif type == "upper-tail":
zstar = Z.ppf(1-alpha)
type2 = Z.cdf(delta + zstar)
elif type == "two-tail":
zstar = Z.ppf(1 - alpha/2)
type2 = Z.cdf(delta + zstar) - Z.cdf(delta - zstar)
else: print("what's the test type?"); return
print("test type: ",type)
print("zstar: ", zstar)
print("delta: ", delta)
print("prob of type2 error: ", type2)
print("power: ", 1 - type2)

mu0 = 120
mu1 = 122
sdev = 2
n = 10
alpha = .01
type = "upper-tail"

type2_error(type,mu0,mu1,sdev,n,alpha)

A type II error is when we do not reject the null hypothesis and yet it’s
false. The power of a test is the probability of rejecting the null hypothesis
when it’s false (Figure 6.3). If the probability of a type II error is β, then the
power is 1 − β.
Going back to the driving speed example, what is the chance that someone
driving at µ1 = 122 is not caught? This is a type II error; using the above
code, the probability is

β = P rob(X̄ = 120 | µ = 122) = 20%.

384 CHAPTER 6. STATISTICS

Therefore this test has power 80% to detect such a driver.

6.3 T -test

Let X1 , X2 , . . . , Xn be a simple random sample from a population. We repeat

the previous section when we know neither the population mean µ, nor the
population variance σ 2 . We only know the sample mean
X1 + X2 + · · · + Xn
X̄ =
n
and the sample variance
n
1 X
S2 = (Xk − X̄)2 .
n−1
k=1

For example, assume X1 , X2 , . . . , Xn are Bernoulli random variables with

values 0, 1. Then as we’ve seen before,
n
X
(n − 1)S 2 = (Xk − X̄)2 = nX̄(1 − X̄).
k=1

From §5.5, when the population is standard normal,

• (n − 1)S 2 is chi-squared of degree n − 1, and
• X̄ and S 2 are independent.

A Student2 random variable is a continuous random variable T with prob-

ability density function
−(d+1)/2
t2

p(t) = C · 1 + , a < t < b. (6.3.1)
d

Here C is a constant to make the total area under the graph equal to one
(Figure 6.4).
The distribution of a Student random variable T is the Student distribu-
tion with degree d, also called the t-distribution with degree d. The Student
distribution has pdf (6.3.1), and the probability that T lies in a small interval
[a, b] is
2 This terminology is due to the statistician R. A. Fisher.
6.3. T -TEST 385

P rob(a < T < b)

≈ p(t), a < t < b,
b−a
When the interval [a, b] is not small, this is not correct. The exact formula
for P rob(a < T < b) is the area under the graph (Figure 6.4). This is obtained
by integration,
Z b
P rob(a < T < b) = p(t) dt. (6.3.2)
a
Under this interpretation, this probability corresponds to the area under the
graph between the vertical lines at a and at b, and the total area under the
graph corresponds to a = −∞ and b = ∞.
More generally, means of f (T ) are computed by integration,
Z ∞
E(f (T )) = f (t)p(t) dt,
−∞

with the integral computed via the fundamental theorem of calculus (A.5.2)
or Python.

Fig. 6.4 Student distribution, against normal (dashed).

The Student pdf (6.3.1) approaches the standard normal pdf (5.4.1) as
d → ∞ (Exercise 6.3.1).
386 CHAPTER 6. STATISTICS

from numpy import *

from scipy.stats import t as T, norm as Z
from matplotlib.pyplot import *

for d in [3,4,7]:
t = arange(-3,3,.01)
plot(t,T(d).pdf(t),label="d = "+str(d))

plot(t,Z.pdf(t),"--",label=r"d = $\infty$")
grid()
legend()
show()

The main result of this section, derived using calculus, is

Relation Between Z, U , and T

Suppose Z and U are independent, where Z is standard normal, and

U is chi-squared with degree d. Then
Z
T =p
U/d

is Student with degree d.

In the previous section, we normalized a sample

√ mean by subtracting the
mean µ and dividing by the standard error σ/ n. Since now we don’t know
σ, it is reasonable to divide by the sample standard error, obtaining

√ X̄ − µ √ X̄ − µ
n· = n· v .
S u n
1 X
(Xk − X̄)2
u
t
n−1
k=1

If we standardize each variable by

Xk = µ + σZk ,

then we can verify

Z1 + Z2 + · · · + Zn
X̄ = µ + σ Z̄, Z̄ = ,
n
and
6.3. T -TEST 387

n
2 1 X
2
S =σ (Zk − Z̄)2 .
n−1
k=1

From this, we have

√ X̄ − µ √ Z̄ √ Z̄
n· = n· v = n· p .
S u n U/(n − 1)
1 X
(Zk − Z̄)2
u
t
n−1
k=1

Using the main result with d = n − 1, we arrive at

Samples and Student Distributions

Let X1 , X2 , . . . , Xn be independent normal random variables with

mean µ. Let X̄ be the sample mean, let S 2 be the sample variance,
and let
√ X̄ − µ
T = n· .
S
Then T is a Student random variable with degree (n − 1).

The takeaway here is we do not need to know the population standard

deviations σ of X1 , X2 , . . . , Xn to compute T .

The t-score t∗ corresponding3 to significance α is

tstar = T(d).ppf(alpha) # lower-tail, tstar < 0

tstar = T(d).ppf(1-alpha) # upper-tail, tstar > 0
tstar = T(d).ppf(1-alpha/2) # two-tail, tstar > 0

Here d is the degree of T . Then we have

##########################
# Confidence Interval - T
##########################

from numpy import *

from scipy.stats import t as T

def confidence_interval(xbar,s,n,alpha,type):
d = n-1
if type == "two-tail":

3 Geometrically, the p-value P rob(T > 1) is the probability that a normally distributed
point in (d + 1)-dimensional spacetime is inside the light cone.
388 CHAPTER 6. STATISTICS

tstar = T(d).ppf(1-alpha/2)
L = xbar - tstar * s / sqrt(n)
U = xbar + tstar * s / sqrt(n)
elif type == "upper-tail":
tstar = T(d).ppf(1-alpha)
L = xbar
U = xbar + tstar* s / sqrt(n)
elif type == "lower-tail":
tstar = T(d).ppf(alpha)
L = xbar + tstar* s / sqrt(n)
U = xbar
else: print("what's the test type?"); return
print("type: ",type)
return L, U

n = 10
xbar = 120
s = 2
alpha = .01
type = "upper-tail"
print("significance, s, n, xbar: ", alpha,s,n,xbar)

L,U = confidence_interval(xbar,s,n,alpha,type)
print("lower, upper: ", L,U)

Going back to the driving speed example from §6.2, instead of assuming
the population standard deviation is σ = 2, we compute the sample standard
deviation and find it’s S = 2. Recomputing with T (9), instead of Z, we
see (L, U ) = (120, 121.78), so the cutoff now is µ∗ = 121.78, as opposed to
µ∗ = 121.47 there.

We turn now to hypothesis testing. As before, we have two hypotheses, a

null hypothesis and an alternate hypothesis. In the above setting where we
are estimating a population parameter, the null hypothesis is that µ equals a
certain value µ0 , and the alternate hypothesis is that µ is not equal to µ0 .
• H0 : µ = µ0
• Ha : µ ̸= µ0 .
Here is the code:

###################
# Hypothesis T-test
###################

from numpy import *

from scipy.stats import t as T
6.3. T -TEST 389

def ttest(mu0, s, n, xbar,type):

d = n-1
print("mu0, s, n, xbar: ", mu0,s,n,xbar)
t = sqrt(n) * (xbar - mu0) / s
print("t: ",t)
if type == "lower-tail": p = T(d).cdf(t)
elif type == "upper-tail": p = 1 - T(d).cdf(t)
elif type == "two-tail": p = 2 * (1 - T(d).cdf(abs(t)))
print("pvalue: ",p)
if p < alpha: print("reject H0")
else: print("do not reject H0")

xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
s = 2
alpha = .01

ttest(mu0, s, n, xbar,type)

Going back to the driving speed example, the hypothesis test is

• H0 : µ = µ0
• Ha : µ > µ0
If a driver’s measured average speed is X̄ = 122, the above code rejects
H0 . This is consistent with the confidence interval cutoff we found above.
However, the p-value obtained here is greater than the corresponding p-value
in §6.2.

For Type I and Type II errors, the code is

########################
# Type1 and Type2 errors
########################

from numpy import *

from scipy.stats import t as T

def type2_error(type,mu0,mu1,n,alpha):
d = n-1
print("significance,mu0,mu1,n: ", alpha,mu0,mu1,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
tstar = T(d).ppf(alpha)
type2 = 1 - T(d).cdf(delta + tstar)
390 CHAPTER 6. STATISTICS

elif type == "upper-tail":

tstar = T(d).ppf(1-alpha)
type2 = T(d).cdf(delta + tstar)
elif type == "two-tail":
tstar = T(d).ppf(1 - alpha/2)
type2 = T(d).cdf(delta + tstar) - T(d).cdf(delta - tstar)
else: print("what's the test type?"); return

print("test type: ",type)

print("tstar: ", tstar)
print("delta: ", delta)

print("prob of type2 error: ", type2)

print("power: ", 1 - type2)

type2_error(type,mu0,mu1,n,alpha)

Going back to the driving speed example, if a driver’s measured average

speed is X̄ = 122, what is the chance they will not be fined? From the code,
the probability of this Type II error is 37%, and the power to detect such a
driver is 63%.

Exercises

Exercise 6.3.1 Use the compound-interest formula (A.3.8) to show the Stu-
dent pdf (6.3.1) equals the standard normal pdf (5.4.1) when d = ∞. Since
the formula for the constant C is not given, ignore C in your calculation.

6.4 Chi-Squared Tests

Let X1 , X2 , . . . , Xn be i.i.d. random variables, where each Xk is categorical.

This means each Xk is a discrete random variable (§5.3), taking values in one
of d categories. For simplicity, assume the categories are

i = 1, 2, . . . , d.

When d = 2, this reduces to the Bernoulli case.

When d = 2 and Xk = 0, 1, the sample mean X̄ is a proportion p̂, the
population
p mean is p = P rob(Xk = 1), the population
p standard deviation
p is
p(1 − p), and the sample standard deviation is X̄(1 − X̄) = p̂(1 − p̂).
By the central limit theorem, the test statistic
6.4. CHI-SQUARED TESTS 391

√ p̂ − p
Z= n· p (6.4.1)
p(1 − p)

is approximately standard normal for large enough sample size, and con-
sequently U = Z 2 is approximately chi-squared with degree one. The chi-
squared test generalizes this from d = 2 categories to d > 2 categories.
Given a category i, let #i denote the number of times Xk = i, 1 ≤ k ≤ n,
in a sample of size n. Then #i is the count that Xk = i, and p̂i = #i /n is the
observed frequency, in a sample of size n. Let pi be the expected frequency,

pi = P rob(Xk = i).

Then p = (p1 , p2 , . . . , pd ) is the probability vector associated to X. Since Xk

are identically distributed, this does not depend on k.
By the central limit theorem,
√ √

#i
n(p̂i − pi ) = n − pi ,
n

are approximately normal for large n. Based on this, we have the

Goodness-Of-Fit Test

Let p̂ = (p̂1 , p̂2 , . . . , p̂d ) be the observed frequencies corresponding to

samples X1 , X2 , . . . , Xn , and let p = (p1 , p2 , . . . , pd ) be the expected
frequencies. Then, for large sample size n, the statistic
d
X (p̂i − pi )2
n (6.4.2)
i=1
pi

is approximately chi-squared with degree d − 1.

By clearing denominators, (6.4.2) may be rewritten in terms of counts as

follows,
d d
X (#i − npi )2 X (observedi − expectedi )2
= .
i=1
npi i=1
expectedi

When d = 2, this statistic reduces to Z 2 , where Z is given by (6.4.1). Here

is the code.

from numpy import *

from scipy.stats import chi2 as U

def goodness_of_fit(observed,expected):
# assume len(observed) == len(expected)
d = len(observed)
392 CHAPTER 6. STATISTICS

u = sum([ (observed[i] - expected[i])**2/expected[i] for i in

,→ range(d) ])
pvalue = 1 - U(d-1).cdf(u)
return pvalue

Suppose a dice is rolled n = 120 times, and the observed counts are

O1 = 17, O2 = 12, O3 = 14, O4 = 20, O5 = 29, O6 = 28.

Notice
O1 + O2 + O3 + O4 + O5 + O6 = 120.
If the dice is fair, the expected counts are

E1 = 20, E2 = 20, E3 = 20, E4 = 20, E5 = 20, E6 = 20.

Based on the observed counts, at 5% significance, what can we conclude about

the dice?
Here there are d = 6 categories, α = .05, and the statistic (6.4.2) equals

u = 12.7.

The dice is fair if u is not large and the dice is unfair if u is large. At
significance level α, the large/not-large cutoff u∗ is

from scipy.stats import chi2 as U

d = 6
ustar = U(d-1).ppf(1-alpha)

Since this returns u∗ = 11.07 and u > u∗ , we can conclude the dice is not
fair.

To derive the goodness-of-fit test, let X be a discrete random variable,

taking values in 1, 2, . . . , d, with distribution p = (p1 , p2 , . . . , pd ). We vec-
torize (§1.3) X by defining the one-hot encoded (§2.4) vector-valued random
variable V = (V1 , V2 , . . . , Vd ) as follows,

 √1

if X = i,
Vi = pi (6.4.3)
0 if X ̸= i.

6.4. CHI-SQUARED TESTS 393

For future reference, we denote V = vectp (X).

Then
1 pi √
E(Vi ) = √ P rob(X = i) = √ = pi ,
pi pi
and (
1 if i = j,
E(Vi Vj ) =
0 ̸ j,
if i =
for i, j = 1, 2, . . . , d. If
√ √ √
µ = ( p1 , p2 , . . . , pd ) ,

we conclude
E(V ) = µ, E(V ⊗ V ) = I.
From this,
E(V ) = µ, V ar(V ) = I − µ ⊗ µ. (6.4.4)
Now define
Vk = vectp (Xk ) , k = 1, 2, . . . , n.
Since X1 , X2 , . . . , Xn are i.i.d, V1 , V2 , . . . , Vn are i.i.d. By (5.5.5), we conclude
the random vector !
n
√ 1X
Z= n Vk − µ
n
k=1

has mean zero and variance I − µ ⊗ µ.

Since V1 , V2 , . . . , Vn are i.i.d, by the central limit theorem, we also conclude
Z is approximately normal for large n.
Since
√ √ √
|µ|2 = ( p1 )2 + ( p2 )2 + · · · + ( pd )2 = p1 + p2 + · · · + pd = 1,

µ is a unit vector. By the singular chi-squared result in §5.5, |Z|2 is approxi-

mately chi-squared with degree d − 1. Since
√

p̂i √
Zi = n √ − p i , (6.4.5)
pi

we write |Z|2 out,

d d 2 d
X X p̂i √ X (p̂i − pi )2
|Z|2 = Zi2 = n √ − p i = n ,
i=1 i=1
pi i=1
pi

obtaining (6.4.2).
394 CHAPTER 6. STATISTICS

Suppose X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are samples measuring two

possibly related effects. Suppose the X variables take on d categories, i =
1, 2, . . . , d, and the Y variables take on N categories, j = 1, 2, . . . , N . Let

pi = P rob(Xk = i), qj = P rob(Yk = j),

and set p = (p1 , p2 , . . . , pd ), q = (q1 , q2 , . . . , qN ). The goal is test whether the

two effects are independent or not.
Let
rij = P rob(Xk = i and Yk = j).
Then r is a d × N matrix. The effects are independent when

rij = pi qj ,

or r = p ⊗ q.
For example, suppose 300 people are polled and the results are collected
in a contingency table (Figure 6.5).

Democrat Republican Independent Total

Women 68 56 32 156
Men 52 72 20 144
Total 120 128 52 300

Table 6.5 2 × 3 = d × N contingency table [30].

Is a person’s gender correlated with their party affiliation, or are the two
variables independent? To answer this, let p̂ and q̂ be the observed frequencies

#{k : Xk = i} #{k : Yk = j}
p̂i = , q̂j = ,
n n
and let r̂ be the joint observed frequencies

#{k : Xk = i and Yk = j}
r̂ij = .
n
Then r̂ is also a d × N matrix.
When the effects are independent, r = p ⊗ q, so, by the law of large
numbers, we should have
r̂ ≈ p̂ ⊗ q̂
for large sample size. The chi-squared independence test quantifies the dif-
ference of the two matrices r and r̂.
6.4. CHI-SQUARED TESTS 395

Chi-squared Independence Test

If X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are independent, then, for large

sample size n, the statistic
d,N 2
X (r̂ij − p̂i q̂j )
n (6.4.6)
i,j=1
p̂i q̂j

is approximately chi-squared with degree (d − 1)(N − 1).

Only sample data is used to compute the statistic (6.4.6), knowledge of p

and q is not needed. Conversely, the test says nothing about p and q, and
only queries independence.
By clearing denominators, (6.4.6) may be rewritten in terms of counts as
follows,
d,N 2 d,N 2
X n#XY X Y
ij − #i #j
X #XY
ij
= −n + n
i,j=1
n#X Y
i #j i,j=1
#X
i #j
Y

d,N
X (observed)2
= −n + n .
i,j=1
expected

The code

from numpy import *

from scipy.stats import chi2 as U

# table is dxN numpy array

def chi2_independence(table):
n = sum(table) # total sample size
d = len(table)
N = len(table.T)
rowsum = array([ sum(table[i,:]) for i in range(d) ])
colsum = array([ sum(table[:,j]) for j in range(e) ])
expected = outer(rowsum,colsum) # tensor product
u = -n + n*sum([[ table[i,j]**2/expected[i,j] for j in range(N) ]
,→ for i in range(d) ])
deg = (d-1)*(N-1)
pvalue = 1 - U(deg).cdf(u)
return pvalue

table = array([[68,56,32],[52,72,20]])
396 CHAPTER 6. STATISTICS

chi2_independence(table)

returns a p-value of 0.0401, so, at the 5% significance level, the effects are
not independent.

The derivation of the independence test is similar to the goodness-of-fit

test. There are two differences. First, because there are two indices Xk = i,
Yk = j, we work with matrices, not vectors. Second, we appeal to the law of
large numbers to replace pi by p̂i and qj by q̂j for large n.
Let Z be the d × N matrix
!
√ r̂ij − p̂i q̂j
Zij = n √ p . (6.4.7)
p̂i q̂j

Then (see (2.2.13))

d,N
X
2 t 2
∥Z∥ = trace(Z Z) = Zij
i,j=1

equals (6.4.6).
Let u1 , u2 , . . . , ud and v1 , v2 , . . . , vN be orthonormal bases for Rd and
N
R respectively. By (2.9.8),
d,N
X
2
∥Z∥ = trace(Z t Z) = (ui · Zvj )2 . (6.4.8)
i,j=1

2
We will show ∥Z∥ is asymptotically chi-squared of degree (d − 1)(N − 1).
To achieve this, we show Z is asymptotically normal.
Let X and Y be discrete random variables with probability vectors
p = (p1 , p2 , . . . , pd ) and q = (q1 , q2 , . . . , qN ), and assume X and Y are in-
dependent.
Let
√ √ √ √ √ √
µ = ( p1 , p2 , . . . , pd ) , ν = ( q1 , q2 , . . . , qN ) .

Then µ and ν are unit vectors. Following (6.4.3), define

M = (vectp (X) − µ) ⊗ (vectq (Y ) − ν). (6.4.9)

Then M is a d × N matrix-valued random variable, and

u · M v = (vectp (X) · u − µ · u)(vectq (Y ) · v − ν · v).

6.4. CHI-SQUARED TESTS 397

If u and v are unit vectors in Rd and RN respectively, by (6.4.4),

E(vectp (X) · u) = µ · u, V ar(vectp (X) · u) = 1 − (µ · u)2 ,

and

E(vectq (Y ) · v) = ν · v, V ar(vectq (Y ) · v) = 1 − (ν · v)2 .

By independence of X and Y , the mean of u · M v is zero, and

V ar(u · M v) = 1 − (µ · u)2 1 − (ν · v)2 .

In particular, when the unit vectors u or v are orthogonal to µ and ν respec-

tively, u · M v is a standard random variable, i.e. has mean zero and variance
one. This also shows u · M ν = 0, µ · M v = 0 for any u and v.
More generally (Exercise 6.4.3) u · M v and u′ · M v ′ are uncorrelated when
u ⊥ u′ and v ⊥ v ′ .
Our goal is to show for large n, Z has the same mean and variance as that
of M . If we also show u · Zv and u′ · Zv ′ are independent for large n when
u ⊥ u′ and v ⊥ v ′ , then (6.4.8) leads to the result. Now to the details.
Let W = Wn be a random variable that depends on n. We write

W ≈0

if all probabilities of W converge to zero for n large. In this case, we say W

is asymptotically zero (see §A.6 for more information).
Let W ′ be another random variable depending on n. We write W ≈ W ′ ,
and we say W is asymptotically equal to W ′ , if all probabilities of W and W ′
agree asymptotically for n large. In particular, when W ≈ W ′ ,

E(W ) ≈ E(W ′ ), V ar(W ) ≈ V ar(W ′ ).

If W ≈ W ′ and W ′ is a normal random variable not depending on n, we

write W ≈ normal, and we say W is asymptotically normal. If W ≈ W ′ and
W ′ ≈ normal, then W ≈ normal.
Let Mk correspond to Xk and Yk , k = 1, 2, . . . , n, and let
n
! n
CLT
√ 1X 1 X
Z = n Mk − E(Mk ) = √ Mk .
n n
k=1 k=1

Then, by independence, and the central limit theorem,

• the mean and variance of Z CLT are the same as those of M ,
• u · Z CLT ν = 0, µ · Z CLT v = 0 for any u and v, and,
• Z CLT ≈ normal.
Although Z and Z CLT are not equal, we will show Z ≈ Z CLT . To this
end, multiplying out the expression (6.4.9) for each M = Mk , and summing
398 CHAPTER 6. STATISTICS

over k = 1, 2, . . . , n, we see
√ √
√ p̂i qj

CLT r̂ij q̂j pi √
Zij = n √ √ − √ − √ + pi qj . (6.4.10)
pi qj pi qj

By the law of large numbers, p̂i ≈ pi and q̂j ≈ qj so

q̂j − qj
√ p ≈ 0.
p̂i q̂j

As we saw before (6.4.5), by the central limit theorem,

√
n (p̂i − pi ) ≈ normal.

Hence4 the product

√ (pi − p̂i )(q̂j − qj )
n· √ p ≈ 0. (6.4.11)
p̂i q̂j
Similarly, p̂i ≈ pi and q̂j ≈ qj , so
√ √ !
CLT
pi qj
Zij √ p − 1 ≈ 0. (6.4.12)
p̂i q̂j

Adding (6.4.10), (6.4.11), and (6.4.12), we obtain (6.4.7), hence

Z ≈ Z CLT .

We conclude
• the mean and variance of Z are asymptotically the same as those of M ,
• u · Zν ≈ 0, µ · Zv ≈ 0 for any u and v, and,
• Z ≈ normal.
In particular, since u·Zv and u′ ·Zv ′ are asymptotically uncorrelated when
u ⊥ u′ and v ⊥ v ′ , and Z is asymptotically normal, we conclude u · Zv and
u′ · Zv ′ are asymptotically independent when u ⊥ u′ and v ⊥ v ′ .
Now choose the orthonormal bases with u1 and v1 equal to µ and ν re-
spectively. Then

ui · Zvj , i = 1, 2, 3, . . . , d, j = 1, 2, 3, . . . , N

are independent normal random variables with mean zero, asymptotically for
large n, and variances according to the listing

4 The theoretical basis for this intuitively obvious result is Slutsky’s theorem [7].
6.4. CHI-SQUARED TESTS 399

µ · Zν µ · Zv2 . . . . . . µ · ZvN −1 µ · ZvN 0 0 ... ... 0 0

u2 · Zν u2 · Zv2 . . . . . . u2 · ZvN −1 u2 · ZvN 0 1 ... ... 1 1
... ... ... ... ... ... ≈ ... ... ... ... ... ....
ud−1 · Zν ud−1 · Zv2 . . . . . . ud−1 · ZvN −1 ud−1 · ZvN 0 1 ... ... 1 1
ud · Zν ud · Zv2 . . . . . . ud · ZvN −1 ud · ZvN 0 1 ... ... 1 1
2
From this, only (d − 1)(N − 1) terms are nonzero in (6.4.8), hence ∥Z∥ is
chi-squared with degree (d − 1)(N − 1), completing the proof.

Exercises

Exercise 6.4.1 Let V be the vectorization (6.4.3) of the discrete random

variable X, and let µ be the mean of V . Then V · µ = 1.

Exercise 6.4.2 Verify (6.4.10).

Exercise 6.4.3 Let M be as in (6.4.9). Then u · M v and u′ · M v ′ are uncor-

related when u ⊥ u′ and v ⊥ v ′ .

Exercise 6.4.4 Verify the goodness-of-fit test statistic (6.4.2) is the square
of (6.4.1) when d = 2.

Exercise 6.4.5 [30] Among 100 vacuum tubes tested, 41 had lifetimes of less
than 30 hours, 31 had lifetimes between 30 and 60 hours, 13 had lifetimes
between 60 and 90 hours, and 15 had lifetimes of greater than 90 hours.
Are these data consistent with the hypothesis that a vacuum tube’s lifetime
is exponentially distributed (Exercise 5.3.23) with a mean of 50 hours? At
what significance? Here p = (p1 , p2 , p3 , p4 ).

Exercise 6.4.6 [30] A study was instigated to see if southern California

earthquakes of at least moderate size are more likely to occur on certain
days of the week than on others. The catalogs yielded the data in Figure 6.6.
Test, at the 5 percent level, the hypothesis that an earthquake is equally
likely to occur on any of the 7 days of the week.

Day Sun Mon Tues Wed Thurs Fri Sat Total

Number of Earthquakes 156 144 170 158 172 148 152 1100

Table 6.6 Earthquake counts.

Exercise 6.4.7 [30] In a famous article (S. Russell, “A red sky at night. . . ”
Metropolitan Magazine London, 61, p. 15, 1926) the following dataset of
400 CHAPTER 6. STATISTICS

frequencies of sunset colors and whether each was followed by rain was pre-
sented. Test the hypothesis that whether it rains tomorrow is independent of
the color of today’s sunset.

Sky Color Number of Observations Number Followed by Rain

Red 61 26
Mainly red 194 52
Yellow 159 81
Mainly yellow 188 86
Red and yellow 194 52
Gray 302 167

Table 6.7 Sunset and rain counts.

Exercise 6.4.8 [30] A sample of 300 cars having mobile phones and one of
400 cars without phones were tracked for 1 year. The following table gives the
number of these cars involved in accidents over that year. Use the above to
test the hypothesis that having a mobile phone in your car and being involved
in an accident are independent. Use the 5 percent level of significance.

Accident No Accident
Mobile phone 22 278
No phone 26 374

Table 6.8 Phone and accident counts.

Chapter 7
Machine Learning

7.1 Overview

This first section is an overview of the chapter. Here is a summary of the

structure of neural networks.
• A graph consists of nodes and edges (§3.3).
• If each edge has a direction, the graph is directed.
• If each edge has a weight, the graph is weighed.
• In a directed graph, there are input nodes, output nodes, and hidden
nodes.
• A node with an activation function is a neuron (§4.4).
• Each neuron has incoming signals and an outgoing signal.
• The outgoing signal is the activation function applied to the incoming
signals.
• A network is a weighed directed graph (§3.3) where the nodes are neurons
(§4.4).
• A neural network is a network where each activation function is a function
of the sum of the incoming signals (§7.2).
The goal is to train a neural network. To train a neural network means to
find weights W so that the input-output behavior of the network is as close
as possible to a given dataset of sample pairs (xk , yk ), k = 1, 2. . . . , N . Here
is a summary of how neural networks are trained (§7.4).
1. Start with a sample pair (xk , yk ) and a weight matrix W .
2. Using xk as incoming signals at the input nodes, compute the network’s
outgoing signals at all nodes (forward propagation).
3. Compute the error J = J(xk , yk , W ) between the outgoing signals at the
output nodes and yk .
4. Compute the derivatives δout of J at the output nodes.
5. Compute the derivatives δ of J at all nodes (back propagation).
6. Then the weight gradient is given by ∇W J = x ⊗ δ.

401
402 CHAPTER 7. MACHINE LEARNING

7. Update W using gradient descent (§7.3), W + = W − t∇W J (§7.4).

8. Repeat steps 1-7 over all sample pairs (xk , yk ), k = 1, 2, . . . , N (§7.4).
9. Repeat step 8 until convergence.

Steps 1-7 is an iteration, and step 8 is an epoch. An iteration uses a single

sample (more generally a batch of samples), and an epoch uses the entire
dataset. The mean error function over the dataset is
N
X
J(W ) = J(xk , yk , W ).
k=1

Sometimes J(W ) is normalized by dividing by N , but this does not change the
results. With the dataset given, the mean error is a function of the weights.
A weight matrix W ∗ is optimal if it is a minimizer of the mean error,

J(W ∗ ) ≤ J(W ), for all W.

Convergence means W is close to W ∗ . Now we turn to the details.

7.2 Neural Networks

In §4.4, we saw two versions of forward and back propagation. In this section
we see a third version. We begin by reviewing the definition of graph and
network as given in §3.3 and §4.4.
A graph consists of nodes and edges. Nodes are also called vertices, and an
edge is an ordered pair (i, j) of nodes. Because the ordered pair (i, j) is not
the same as the ordered pair (j, i), our graphs are directed.
The edge (i, j) is incoming at node j and outgoing at node i. If a node j
has no outgoing edges, then j is an output node. If a node i has no incoming
edges, then i is an input node. If a node is neither an input nor an output, it
is a hidden node.
We assume our graphs have no cycles: every forward path terminates at an
output node in a finite number of steps, and every backward path terminates
at an input node in a finite number of steps.
A graph is weighed if a scalar weight wij is attached to each edge (i, j). If
(i, j) is not an edge, we set wij = 0.
If a network has d nodes, the nodes are labeled 0, 1, 2, . . . , d − 1, and the
edges are completely specified by the d × d weight matrix W = (wij ).
A node with an attached activation function (4.4.2) is a neuron. A net-
work is a directed weighed graph where some nodes are neurons. In the next
paragraph, we define a special kind of network, a neural network.
7.2. NEURAL NETWORKS 403

In a network, in §4.4, the activation function fj at node j was allowed to

be any function of the incoming list (4.4.1) at node j

(w0,j x0 , w1,j x1 , . . . , wd−1,j xd−1 ),

j = 0, 1, . . . , d − 1.
Because wij = 0 if (i, j) is not an edge, the nonzero entries in the incoming
list at node j correspond to the edges incoming at node j.
A neural network is a network where every activation function is restricted
to be a function of the sum of the entries of the incoming list.
For example, all the networks in this section are neural networks, but the
network in Figure 4.16 is not a neural network.
Let X
x−j = wij xi (7.2.1)
i→j

be the sum of the incoming list at node j. Then, in a neural network, the
outgoing signal at node j is
 
X
xj = fj (x−
j ) = fj
 wij xi  . (7.2.2)
i→j

If the network has d nodes, the outgoing vector is

x = (x0 , x1 , . . . , xd−1 ),

and the incoming vector is

x− = (x− − −
0 , x1 , . . . , xd−1 ).

In a network, in §4.4, x− −
j was a list or vector; in a neural network, xj is a
scalar.
If node j is an input node, then x−j = None. If node j is an output node,
then xj = None.
Let W be the weight matrix. If the network has d nodes, the activation
vector is
f = (f0 , f1 , . . . , fd−1 ).
Then a neural network may be written in vector-matrix form

x = f (W t x).

However, this representation is more useful when the network has structure,
for example in a dense shallow layer (7.2.12).
404 CHAPTER 7. MACHINE LEARNING

A perceptron is a network of the form

y = f (w0 x0 + w1 x1 + · · · + wd−1 xd−1 ) = f (w · x)

(Figure 7.1). This is the simplest neural network.

Thus a perceptron is a linear function followed by an activation function.
By our definition of neural network,

Neural Network
Every neural network is a combination of perceptrons.

w2 y
x2 f

Fig. 7.1 A perceptron with activation function f without bias.

When one of the inputs x0 is fixed to equal 1, x0 = 1, the corresponding

weight w0 is called a bias, and the perceptron is

y = f (w0 + w1 x1 + w2 x2 + · · · + wd xd ) = f (w · x + w0 ).

The role of the bias is to shift the threshold in the activation function.
If x1 , x2 , . . . , xN is a dataset, then (x1 , 1), (x2 , 1), . . . , (xN , 1) is the aug-
mented dataset. If the original dataset is in Rd , then the augmented dataset
is in Rd+1 . In this regard, Exercise 7.2.1 is relevant.
By passing to the augmented dataset, a neural network with bias and d
input features can be thought of as a neural network without bias and d + 1
input features.
In §5.2, Bayes theorem is used to express a conditional probability in terms
of a perceptron,
P rob(H | x) = σ(w · x + w0 ).
This is a basic example of how a perceptron computes probabilities.
7.2. NEURAL NETWORKS 405

w0
x1
w1
y
f
w2

x2 w3

Fig. 7.2 A perceptron with activation function f with bias.

Perceptrons gained wide exposure after Minsky and Papert’s famous 1969
book [22], from which Figure 7.3 is taken.

Fig. 7.3 Perceptrons in parallel (R in the figure is the retina) [22].

A perceptron consists of a single neuron. The incoming signal at the neuron

is y − = w · x, and the outgoing signal at the neuron is the output y = f (y − ).
How does the output y of a perceptron depend on the weights? Differen-
tiating y = f (w · x), we have
406 CHAPTER 7. MACHINE LEARNING

∂y ∂y −
= f ′ (y − ) · = f ′ (y − ) · xi .
∂wi ∂wi
The derivative of the output with respect to the incoming signal is the down-
stream derivative δ, so we obtain the formula
∂y
= δ · xi .
∂wi
If there is a bias w0 , the corresponding input is x0 = 1, and the formula is
still valid. We generalize this formula to any neural network, by explaining
each of the various terms, and leading to (7.2.11).

Here is a listing of common activation functions.

• The identity function,
id(z) = z
′
and its derivative id = 1.
• The binary output, (
1 if z > 0,
bin(z) =
0 if z < 0,

and its derivative bin′ = 0, z ̸= 0, and bin′ (0) undefined.

• The logistic or sigmoid function (Figure 5.10)
1
σ(z) =
1 + e−z
and its derivative σ ′ = σ(1 − σ).
• The hyperbolic tangent function

ez − e−z
tanh(z) =
ez + e−z

and its derivative tanh′ = 1 − tanh2 .

• The rectified linear unit relu,
(
z if z ≥ 0,
relu(z) =
0 if z < 0,

and its derivative relu′ = bin.

Here is the code
7.2. NEURAL NETWORKS 407

# activation functions

def relu(z): return 0 if z < 0 else z

def bin(z): return 0 if z < 0 else 1
def sigmoid(z): return 1/(1+exp(-z))
def id(z): return z
# tanh already part of numpy
def one(z): return 1
def zero(z): return 0

# derivative of relu is bin

# derivative of bin is zero
# derivative of s=sigmoid is s*(1-s)
# derivative of id is one
# derivative of tanh is 1-tanh**2

def D_relu(z): return bin(z)

def D_bin(z): return 0
def D_sigmoid(z): return sigmoid(z)*(1-sigmoid(z))
def D_id(z): return 1
def D_relu(z): return bin(z)
def D_tanh(z): return 1 - tanh(z)**2

der_dict = { relu:D_relu, id:D_id, bin:D_bin, sigmoid:D_sigmoid,

,→ tanh: D_tanh}

w02 w24
f2 f4

w03
w25

w34
w12
w13 w35
f3 f5

Fig. 7.4 Network of neurons with weights.

The neural network in Figure 7.4 has nodes 0, 1, 2, 3, 4, 5, 6, 7 and

activation functions f2 , f3 , f4 , f5 . Here 0 and 1 are input nodes, 2, 3, 4, 5 are
neurons, and 6, 7 are output nodes. By default, weights that are 1 are not
shown.
This network has incoming and outgoing signals
408 CHAPTER 7. MACHINE LEARNING

x− = (None, None, x− − − − − −
2 , x3 , x4 , x5 , x6 , x7 ),
x = (x0 , x1 , x2 , x3 , x4 , x5 , None, None).

Note x4 = x− −
6 and x5 = x7 . Figures 7.5 and 7.6 show the incoming and
outgoing signals.

x0 x2 x4
f2 f4

x0
x2

x3
x1
x1 x3 x5
f3 f5

Fig. 7.5 Network of neurons with outgoing signals.

x−
2 x−
4 x−
6
f2 f4

x−
3
x−
5

x−
4
x−
2
x−
3 x−
5 x−
7
f3 f5

Fig. 7.6 Network of neurons with incoming signals.

The weight matrix is

 
0 0 w02 w03 0 0 00
0 0 w12 w13 0 0 0 0
 
0 0 0 0 w24 w25 0 0
 
0 0 0 0 w24 w35 0 0
W =
0 0 0
 (7.2.3)
 0 0 0 1 0

0 0 0 0 0 0 0 1
 
0 0 0 0 0 0 0 0
00 0 0 0 0 00
Here are the incoming and outgoing signals at each of the four neurons f2 ,
f3 , f4 , f5 .
7.2. NEURAL NETWORKS 409

Neuron Incoming Outgoing

f2 x−
2 = w02 x0 + w12 x1 x2 = f2 (w02 x0 + w12 x1 )
f3 x−
3 = w03 x0 + w13 x1 x3 = f3 (w03 x0 + w13 x1 )
f4 x−
4 = w24 x2 + w34 x3 x4 = f4 (w24 x2 + w34 x3 )
f5 x−
5 = w25 x2 + w35 x3 x5 = f5 (w25 x2 + w35 x3 )

Table 7.7 Incoming and Outgoing signals.

The nodes may be labeled in any order. We identify input nodes and
output nodes from the weight matrix using the code

def is_output(i):
for j in range(d):
if w[i][j] != None: return False
return True

def is_input(i):
for j in range(d):
if w[j][i] != None: return False
return True

Below we run code for the specific weight matrix

 
0 0 0.1 −2. 0 0 0 0
0 0 0.1 −2. 0 0 0 0
 
0 0 0 0 −.3 −.3 0 0
 
0 0 0 0 .22 .22 0 0
W =   (7.2.4)
0 0 0 0 0 0 1 0 

0 0 0 0 0 0 0 1 
 
0 0 0 0 0 0 0 0 
00 0 0 0 0 00
The code for this W is

d = 8

w = [ [None]*d for _ in range(d) ]

# index starts from 0

410 CHAPTER 7. MACHINE LEARNING

w[0][2] = w[1][2] = 0.1

w[0][3] = w[1][3] = -2.0
w[2][4] = w[2][5] = -0.3
w[3][4] = w[3][5] = 0.22
w[4][6] = w[5][7] = 1

0.1 −0.3 1
f2 f4

−2.0
−0.3

.22
0.1
−2.0 .22 1
f3 f5

Fig. 7.8 Network of neurons with specific weights.

The activation functions are

activate = [None]*d

activate[2] = relu
activate[3] = id
activate[4] = sigmoid
activate[5] = tanh

Now we modify the forward propagation code in §4.4 to work for neural
networks. The key diagram is Figure 7.9.
Assume the activation function at node j is activate[j]. By (7.2.1) and
(7.2.2), the code is

def incoming(x,w,j):
return sum([ outgoing(x,w,i)*w[i][j] for i in range(d) if w[i][j]
,→ != None ])

def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](incoming(x,w,j))
7.2. NEURAL NETWORKS 411

xi xj
fi fj
wij

Fig. 7.9 Forward and back propagation between two neurons.

Here is the third version of forward propagation.

# third version: neural networks

# returns incoming and outgoing signals

def forward_prop(x,w):
d = len(x)
for j in range(d):
if not is_output(j):
x[j] = outgoing(x,w,j)
for j in range(d):
if not is_input(j):
xminus[j] = sum([ x[i] * w[i][j] for i in range(d) if
,→ w[i][j] != None ])
return xminus, x

Then the code

x = [None]*d
x[0] = 1.5
x[1] = 2.5

x, xminus = forward_prop(x,w)

print(xminus)
print(x)

returns the incoming and outgoing signals

x− = (None, None, 0.4, −8.0, −1.88, −1.88, 0.132, −0.954),

(7.2.5)
x = (1.5, 2.5, 0.4, −8.0, 0.132, −0.954, None, None).

Let y be the target output signals, defined only at the output nodes

y = (y0 , y1 , y2 , y3 , y4 , y5 , y6 , y7 )
y = (None, None, None, None, None, None, 0.427, −0.288),
412 CHAPTER 7. MACHINE LEARNING

and let J(x− , y) be a function of x− and y, measuring the error between the
target outputs y and the actual outputs. For Figure 7.4, we define the mean
square error function or mean square loss
1 − 1
J(x− , y) = (x − y6 )2 + (x− − y7 ) 2 , (7.2.6)
2 6 2 7
The code for J is

def J(xminus,y): return sum([ (xminus[i] - y[i])**2/2 for i in

,→ range(d) if is_output(i) ])

Then J(x− , y) = 0.266.

For Figure 7.4, since x− −

6 = x4 and x7 = x5 , we have

1 1
J= (x4 − y6 )2 + (x5 − y7 )2 ,
2 2
x5 = f5 (x−
5 ) = f5 (w25 x2 + w35 x3 ),
x4 = f4 (x−
4 ) = f4 (w24 x2 + w34 x3 ),
x3 = f3 (x−
3 ) = f3 (w03 x0 + w13 x1 ),
x2 = f2 (x−
2 ) = f2 (w02 x0 + w12 x1 ).

Therefore J is a function of the weights w25 , w35 , w24 , w34 , w03 , w13 , w02 ,
w12 . For gradient descent, we will need the derivative of J with respect to
these weights, as we did above the perceptron. For example, w25 appears
above only in x−5 , so, by the chain rule,

∂J ∂J ∂x−
5
= · .
∂w25 ∂x−
5 ∂w25

Since x−
5 = w25 x2 + w35 x3 ,
∂x−
5
= x2 .
∂w25
If we denote
∂J
δ5 = ,
∂x−
5
we obtain
∂J
= x2 · δ 5 .
∂w25
The goal is to code this formula in general, see (7.2.11), to be used in gradient
descent.
7.2. NEURAL NETWORKS 413

We already know how to compute the outgoing signals x using forward

propagation. To code the the derivatives δ, we use use back prapagation. We
turn now to the details.

Since J is a function of all nodes, at each node j, we have the derivatives

∂J ∂J
, fj′ (x−
j ), . (7.2.7)
∂x−
j ∂xj

These are the downstream derivative, local derivative, and upstream derivative
at node j. (The terminology reflects the fact that derivatives are computed
backward.)

fi′
∂J ∂J
∂x−
i ∂xi
fi

Fig. 7.10 Downstream, local, and upstream derivatives at node i.

From (7.2.2),
∂xj
= fj′ (x−
j ). (7.2.8)
∂x−
j

By the chain rule and (7.2.8), the key relation between these derivatives is

∂J ∂J
− = · fi′ (x−
i ), (7.2.9)
∂xi ∂xi

or
downstream = upstream × local.

def local(x,w,i):
return der_dict[activate[i]](incoming(x,w,i))

For x, x− , and W as above, the local derivatives are

(None, None, 1, 1, 0.115, 0.089, None, None).

414 CHAPTER 7. MACHINE LEARNING

δ2 δ4 δ6
f2 f4

δ3
δ5

δ4
δ2
δ3 δ5 δ7
f3 f5

Fig. 7.11 Network of neurons with downstream derivatives.

Let
∂J
δi = , i = 0, 1, . . . , d − 1.
∂x−
i
If i is an input node, δi is None. Then we have the downstream gradient
vector δ = (δ0 , δ1 , . . . , δd−1 ). Strictly speaking, we should write δi− for the
downstream derivatives. However, in §7.4, we don’t need upstream deriva-
tives. Because of this, we will write δi .

Once we have the incoming vector x− and outgoing vector x, we can

differentiate J and compute the downstream derivatives with respect to each
output node. For example, in Figure 7.4, there are two output nodes 6, 7,
and we compute
∂J ∂J
δ6 = −, δ7 =
∂x6 ∂x−
7

as follows. Using (7.2.5) and (7.2.6),

∂J
= (x−
6 − y6 ) = −0.294.
∂x−
6

Similarly,
∂J
δ7 = = (x−
7 − y7 ) = −0.666.
∂x−
7
The code for this is

def delta_J(xminus,y):
delta = [None]*d
for i in range(d):
if is_output(i): delta[i] = xminus[i] - y[i]
return delta
7.2. NEURAL NETWORKS 415

We compute δ recursively via back propagation as in §4.4. From Figure

7.9 and (7.2.1) and (7.2.8),

X ∂J ∂xj ∂xi −
∂J
− = · ·
∂xi i→j
∂x−
j ∂xi ∂x− i
 
X ∂J
 · fi′ (x−
= − · wij i ).
i→j
∂xj

This yields the downstream derivative at node i,

 
X
δi =  δj · wij  · fi′ (x−
i ). (7.2.10)
i→j

The code is

def downstream(x,delta,w,i):
if delta[i] != None: return delta[i]
else:
upstream = sum([ downstream(x,delta,w,j) * w[i][j] for j in
,→ range(d) if w[i][j] != None ])
return upstream * local(x,w,i)

Using this, we have the third version of back propagation,

# third version: neural networks

def backward_prop(x,delta,w):
d = len(x)
for i in range(d):
if not is_input(i): delta[i] = downstream(x,delta,w,i)
return delta

With W , x, and targets y as above, the code

delta = delta_J(xminus,y)
delta = backward_prop(x,delta, w)
print(delta)

returns

δ = (None, None, 0.0279, −0.020, −0.033, −0.059, −0.294, −0.666).

416 CHAPTER 7. MACHINE LEARNING

Note x must be computed prior to δ: first forward then backward propa-

gation.

Above we computed the upstream, downstream, and local derivatives of

J at a given node (7.2.7). Since the incoming signals x−
j depend also on the
weights wij , J also depends on wij . By (7.2.1),

∂x−
j
= xi ,
∂wij

Weight Gradient of Output

If (i, j) is an edge, then

∂J
= xi · δ j . (7.2.11)
∂wij

This result is key for neural network training (§7.4).

Perceptrons can be assembled in parallel (Figure 7.3). If a network has

only one layer of neurons, the network is shallow (Figure 7.12).
A shallow network is dense if all input nodes point to all neurons. A shallow
network can always be assumed dense by inserting zero weights at missing
edges.
Neural networks can also be assembled in series, with each component
a layer (Figure 7.13). Usually each layer is a dense shallow network. For
example, Figure 7.4 consists of two dense shallow networks in layers. We say
a network is deep if there are two or more neuron layers.
The weight matrix W (7.2.3) is 8 × 8, while the weight matrices W1 , W2
of each of the two dense shallow network layers in Figure 7.4 are 2 × 2.
In a single shallow layer with n input nodes and m neurons (Figure 7.12),
let x and z be the layer’s incoming and outgoing signal vector. Then x and
z are n and m dimensional respectively, and W is m × n. Here the weight
7.2. NEURAL NETWORKS 417

matrix consists only of weights of edges incoming to neurons, not to output

nodes.

z0
f0

z1
f1

z2
f2

Fig. 7.12 A shallow dense layer.

If we have the same activation function f at every neuron, then we may

apply it componentwise,

f (z − ) = f (z0− , z1− , . . . , zm−1

−
) = (f (z0− ), f (z1− ), . . . , f (zm−1
−
)).

Our convention is wij denotes the weight on the edge from node i to node
j. With this convention, the formulas (7.2.1), (7.2.2) reduce to the matrix
multiplication formulas

z − = W t x, z = f (W t x). (7.2.12)

Thus a dense shallow network can be thought of as a vector-valued percep-

tron. This allows for parallelized forward and back propagation.
418 CHAPTER 7. MACHINE LEARNING

Fig. 7.13 Layered neural network [9].

Exercises

Exercise 7.2.1 Show that a dataset x1 , x2 , . . . , xN lies in a hyperplane

(4.5.9) in Rd iff the augmented dataset (x1 , 1), (x2 , 1), . . . , (xN , 1) does not
span Rd+1 (see §2.7).

Exercise 7.2.2 Verify the propagation computations in this web page. This
neural network has four input nodes (two of which are biases), four neurons,
and two output nodes. The two output nodes are to the right of nodes o1, o2,
and not shown, you have to include them. Here W is 10 × 10. Their neth1 and
outh1 are the incoming and outgoing signals at node h1, and their δo1 is the
downstream derivative at node o1. Don’t update the weights, just compute
x and δ.

Exercise 7.2.3 Verify the propagation computations in this web page. This
neural network has five input nodes (three of which are biases), three neurons,
and one output node. The output node is to the right of node 5, and not
shown, you have to include it. Here W is 9 × 9. Their I3 is x− 3 , and their O3
is x3 . Their Errj is δj . At nodes 3, 4, 5, the activation function fj is σ, so
fj′ = σ(1 − σ). Don’t update the weights, just compute x and δ.
7.3. GRADIENT DESCENT 419

7.3 Gradient Descent

Let f (w) be a scalar function of a vector w = (w1 , w2 , . . . , wd ) in Rd . A basic

problem is to minimize f (w), that is, to find or compute a minimizer w∗ ,

f (w) ≥ f (w∗ ), for every w.

This goal is so general, that anything concrete one insight one provides to-
wards this goal is widely useful in many settings. The setting we have in mind
is f = J, where J is the mean error from §7.1.
Usually f (w) is a measure of cost or lack of compatibility. Because of this,
f (w) is called the loss function or cost function.
A neural network is a black box with inputs x and outputs y, depending on
unknown weights w. To train the network is to select weights w in response
to training data (x, y). The optimal weights w∗ are selected as minimizers
of a loss function f (w) measuring the error between predicted outputs and
actual outputs, corresponding to given training inputs.
From §4.3, if the loss function f (w) is continuous and proper, there is
a global minimizer w∗ . If f (w) is in addition strictly convex, w∗ is unique
(§4.5). When this happens, if the gradient of the loss function is g = ∇f (w),
then w∗ is the unique point satisfying g ∗ = ∇f (w∗ ) = 0.

Let g(w) be any function of a scalar variable w. From the definition of

derivative (4.1.3), if b is close to a, we have the approximation

g(b) − g(a)
≈ g ′ (a).
b−a
Inserting a = w and b = w+ ,

g(w+ ) ≈ g(w) + g ′ (w)(w+ − w).

Assume w∗ is a root of g(w) = 0, so g(w∗ ) = 0. If w+ is close to w∗ , then

g(w+ ) is close to zero, so

0 ≈ g(w) + g ′ (w)(w+ − w).

Solving for w+ ,
g(w)
w+ ≈ w − .
g ′ (w)
Since the global minimizer w∗ satisfies f ′ (w∗ ) = 0, we insert g(w) = f ′ (w)
in the above approximation,
420 CHAPTER 7. MACHINE LEARNING

f ′ (w)
w+ ≈ w − .
f ′′ (w)

This leads to Newton’s method of computing approximations w0 , w1 , w2 , . . .

of w∗ using the recursion

f ′ (wn )
wn+1 = wn − , n = 1, 2, . . .
f ′′ (wn )

Because calculating f ′′ (w) is computationally expensive, first-order de-

scent methods replace the second derivative terms f ′′ (wn ) by constants,
known as learning rates.
In the multi-variable case, Newton’s method becomes

wn+1 = wn − D2 f (wn )−1 ∇f (wn ), n = 1, 2, . . . ,

and the second-derivative term is even more expensive to compute.

These first-order methods, collectively known as gradient descent, are the
subject of this chapter. In presenting §7.3 and §7.9, we follow [3], [23], [34],
[37].

Here is code for Newton’s method.

from numpy import *

def newton(loss,grad,curv,w,num_iter):
g = grad(w)
c = curv(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= g/c
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
c = curv(w)
if allclose(g,0): break
return trajectory

When applied to the function

f (w) = w4 − 6w2 + 2w,

the code returns trajectory

def loss(w): return w**4 - 6*w**2 + 2*w # f(w)

def grad(w): return 4*w**3 - 12*w + 2 # f'(w)
def curv(w): return 12*w**2 - 12 # f''(w)
7.3. GRADIENT DESCENT 421

u0 = -2.72204813
w0 = 2.45269774
num_iter = 20
trajectory = newton(loss,grad,curv,w0,num_iter)

which can be plotted

from matplotlib.pyplot import *

def plot_descent(a,b,loss,curv,delta,trajectory):
w = arange(a,b,delta)
plot(w,loss(w),color='red',linewidth=1)
plot(w,curv(w),"--",color='blue',linewidth=1)
plot(*trajectory,color='green',linewidth=1)
scatter(*trajectory,s=10)
title("num_iter= " + str(len(trajectory.T)))
grid()
show()

with the code

ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)

returning Figure 7.14.

Fig. 7.14 Double well newton descent.

422 CHAPTER 7. MACHINE LEARNING

A descent sequence is a sequence w0 , w1 , w2 , . . . where the loss function

decreases
f (w0 ) ≥ f (w1 ) ≥ f (w2 ) ≥ . . . .
In a descent sequence, the point after the current point w = wn is the succes-
sive point w+ = wn+1 , and the point before the current point is the previous
point w− = wn−1 . Then (w− )+ = w = (w+ )− .
Recall (§4.3) the gradient ∇f (w) at a given point w is the direction of
greatest increase of the function, starting from w. Because of this, it is natural
to construct a descent sequence by moving, at any given w, in the direction
−∇f (w) opposite to the gradient.
A gradient descent is a descent sequence w0 , w1 , w2 , . . . where each suc-
cessive point w+ is obtained from the previous point w by moving in the
direction opposite to the gradient g = ∇f (w) at w,

Basic Gradient Descent Step

w+ = w − t∇f (w). (7.3.1)

The step-size t, which determines how far to go in the direction opposite

to g, is the learning rate.

Let us unpack (7.3.1), so we understand how it applies to weights in net-

works (§4.4). In a neural network, weights w1 , w2 , . . . are attached to edges,
and the final outputs are combined into a loss function. As a result, the loss
function is a function of the weights,

f (w) = f (w1 , w2 , . . . ).

In (7.3.1), w = (w1 , w2 , . . . ) is the weight vector, consisting all of weights

combined into a single vector. By the gradient formula (4.3.2), (7.3.1) is
equivalent to
∂f
w1+ = w1 − t ,
∂w1
∂f
w2+ = w2 − t ,
∂w2
... = ....

In other words,
7.3. GRADIENT DESCENT 423

Each Weight is Computed Separately

To update a weight in a specific edge using gradient descent, one needs

only the derivative of the loss function relative to that specific weight.

Of course, the derivative relative to a specific weight may depend on other

derivatives and other weights, when one applies backpropagation (§4.4). This
principle also holds for modified gradient descent (§7.9).

In practice, the learning rate is selected by trial and error. Which learning
rate does the theory recommend?
Given an initial point w0 , the sublevel set at w0 (see §4.5) consists of all
points w where f (w) ≤ f (w0 ). Only the part of the sublevel set that is
connected to w0 counts.
In Figure 7.15, the sublevel set at w0 is the interval [u0 , w0 ]. The sublevel
set at w1 is the interval [b, w1 ]. Notice we do not include any points to the
left of b in the sublevel set at w1 , because points to the left of b are separated
from w1 by the gap at the point b.
Suppose the second derivative D2 f (w) is never greater than a constant L
on the sublevel set. This means

D2 f (w) ≤ L, on f (w) ≤ f (w0 ), (7.3.2)

in the sense the eigenvalues of D2 f (w) are never greater than L.

Because the second derivative is the derivative of the first derivative,
D2 f (w) measures how fast the gradient ∇f (w) changes from point to point.
From this point of view, D2 f (w) is a measure of the curvature of the function
f (w), and (7.3.2) says the rate of change of the gradient is never greater than
L.

a b c w1
u0 w0

Fig. 7.15 Double well cost function and sublevel sets at w0 and at w1 .
424 CHAPTER 7. MACHINE LEARNING

Given such a bound L on the curvature, If the learning rate t is no larger

than 1/L, we say we are doing short step gradient descent. Then we have

Short Step Gradient Descent

Let L be as above and w+ as in (7.3.1). If t ≤ 1/L, then

t
f (w+ ) ≤ f (w) − |∇f (w)|2 . (7.3.3)
2

To see this, fix w and let S be the sublevel set {w′ : f (w′ ) ≤ f (w)}. Since
the gradient pushes f down, for t > 0 small, w+ stays in S. Insert x = w+
and a = w into the right half of (4.5.16) and simplify. This leads to

t2 L
f (w+ ) ≤ f (w) − t|∇f (w)|2 + |∇f (w)|2 .
2
Since tL ≤ 1 when 0 ≤ t ≤ 1/L,we have t2 L ≤ t. This derives (7.3.3).
The curvature of the loss function and the learning rate are inversely pro-
portional. Where the curvature of the graph of f (w) is large, the learning
rate 1/L is small, and gradient descent proceeds in small time steps.

When the sublevel set is bounded, there is a bound L satisfying (7.3.2).

From §4.3, the sublevel set is bounded when f (w) is proper: Large |w| implies
high cost f (w). The graphs in Figures 4.4, 4.5, 7.15, are proper.
In practice, when the loss function is not proper, it is modified by an extra
term that forces properness. This is called regularization. If the extra term
is proportional to |w|2 , it is ridge regularization, and if the extra term is
proportional to |w|, it is LASSO regularization.

Now let w0 w1 , w2 . . . be a short-step gradient descent sequence, t ≤ 1/L.

By (7.3.3), wn remains in the sublevel set f (w) ≤ f (w0 ). If this sublevel set is
bounded, wn subconverges to a limit w∗ (Appendix A.7). Inserting w = wn ,
w+ = wn+1 in (7.3.3),
1
f (wn+1 ) ≤ f (wn ) − |∇f (wn )|2 .
2L
Since f (wn ) and f (wn+1 ) both converge to f (w∗ ), and ∇f (wn ) converges to
∇f (w∗ ), we conclude
1
f (w∗ ) ≤ f (w∗ ) − |∇f (w∗ )|2 .
2L
7.3. GRADIENT DESCENT 425

Since this implies ∇f (w∗ ) = 0, we have derived the following.

Gradient Descent Converges to a Critical Point

Fix any initial weight w0 and let L be as above. If the short-step

gradient descent sequence starting from w0 converges to some point
w∗ , then w∗ is a critical point of the loss function. In particular, if
the loss function is proper, the short-step gradient descent sequence
starting from w0 converges to a critical point of the loss function.

Fig. 7.16 Double well gradient descent.

For example, let f (w) = w4 − 6w2 + 2w (Figures 7.14, 7.15, 7.16). Then

f ′ (w) = 4w3 − 12w + 2, f ′′ (w) = 12w2 − 12.

Thus the inflection points (where f ′′ (w) = 0) are ±1 and, in Figure 7.15, the
critical points are a, b, c.
Let u0 and w0 be the points satisfying f (w) = 5 as in Figure 7.16.
Then u0 = −2.72204813 and w0 = 2.45269774, so f ′′ (u0 ) = 76.914552 and
f ′′ (w0 ) = 60.188. Thus we may choose L = 76.914552. With this L, the
short-step gradient descent starting at w0 is guaranteed to converge to one
of the three critical points. In fact, the sequence converges to the right-most
critical point c (Figure 7.16).
This exposes a flaw in basic gradient descent. Gradient descent may con-
verge to a local minimizer, and miss the global minimizer. In §7.9, modified
gradient descent will address some of these shortcomings.
426 CHAPTER 7. MACHINE LEARNING

The code for gradient descent is

from numpy import *

from matplotlib.pyplot import *

def gd(loss,grad,w,learning_rate,num_iter):
g = grad(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= learning_rate * g
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
if allclose(g,0): break
return trajectory

When applied to the double well function f (w),

u0 = -2.72204813
w0 = 2.45269774
L = 76.914552
learning_rate = 1/L
num_iter = 100
trajectory = gd(loss,grad,w0,learning_rate,num_iter)

ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)

the code returns Figure 7.16.

7.4 Network Training

A neural network with weight matrix W defines an input-output map

xin → xout .

Here the network inputs xin are the outgoing signals at the input nodes, and
the network outputs xout are the incoming signals at the output nodes.
Given inputs xin and target outputs y, we seek to modify the weight matrix
W so that the input-output map is

xin → y.

This is network training.

7.4. NETWORK TRAINING 427

The weights are modified using gradient descent. If J measures the error
between the network outputs xout and the targets y, the weight matrix W is
updated to a new matrix W + using (7.3.1),

Weight Gradient Descent

The weight matrix is updated using

W + = W − t∇W J.

Here t is the learning rate.

How do we compute ∇W J? From (7.2.11),

The Weight Gradient is a Tensor Product

The weight derivative is

∂J/∂wij = xi δj ,

or
∇W J = x ⊗ δ. (7.4.1)

For the network in Figure 7.4, the weight gradients are as in Figure 7.17.

x0 δ 2 x2 δ 4
f2 f4

x0 δ 3
x2 δ 5

x3 δ 5
x1 δ 2
x1 δ 3 x3 δ 5
f3 f5

Fig. 7.17 Network of neurons with weight gradients.

The code is

def update_weights(x,delta,w,learning_rate):
d = len(x)
for i in range(d):
for j in range(d):
if w[i][j]:
w[i][j] = w[i][j] - learning_rate*x[i]*delta[j]
428 CHAPTER 7. MACHINE LEARNING

The triple

forward propagation → backward propagation → update weights

is an iteration. Starting with a given W0 , we repeat this iteration until we

converge to the target outputs y. Here is the code.

def train_nn(x,y,w,learning_rate,n_iter):
trajectory = []
# local copy
wlocal = [ row[:] for row in w ]
for _ in range(n_iter):
xminus, x = forward_prop(x,wlocal)
cost = J(xminus,y)
if isclose(0,cost): break
trajectory.append(cost)
delta = delta_J(xminus,y)
delta = backward_prop(x,delta,wlocal)
wlocal = update_weights(x,delta,wlocal, learning_rate)
return x, xminus, wlocal, trajectory

Here n_iter is the maximum number of iterations allowed, and the iterations
stop if the cost J is close to zero.
Let W be the weight matrix (7.2.4). Then

x = [None]*d
x[0] = 1.5
x[1] = 2.5

y = [None]*d
y[6] = 0.427
y[7] = -0.288

lr = .045
n_iter = 1000
x, xminus, w, trajectory = train_nn(x,y,w,lr,n_iter)
len(trajectory)

stops after len(trajectory)= 45 iterations. The outputs are

x−
6 = 0.42688039547094403, x−
7 = −0.28800519549406556,

in close agreement with the targets y6 , y7 .

The cost trajectory can be plotted using the code

from matplotlib.pyplot import *

for lr in [.025,.035, .045,.047]:

7.4. NETWORK TRAINING 429

x, xminus, w1, trajectory = train_nn(x,y,w,lr,n_iter)

n = len(trajectory)
label = str(n) + ", " + str(lr)
plot(range(n),trajectory,label=label)

grid()
legend()
show()

resulting in Figure 7.18.

Fig. 7.18 Cost trajectory and number of iterations as learning rate varies.

The convergence here is surprisingly easy to attain. However, the conver-

gence here is a mirage. It is a reflection of overfitting, in the sense that we
trained the weights to obtain the input-output map corresponding to a single
sample: There is no reason the trained weights reproduce the input-output
map for other samples.
Only after we train the weights repeatedly against all samples in a training
dataset, can we hope to achieve training with some predictive power.

Stochastic Gradient Descent (to be added)

430 CHAPTER 7. MACHINE LEARNING

7.5 Linear Regression

Let x1 , x2 , . . . , xN be a dataset, with corresponding labels or targets y1 , y2 ,

. . . , yN . As in §7.1, the loss function is
N
X
J(W ) = J(xk , yk , W ). (7.5.1)
k=1

In this section, we focus on a single-layer perceptron (Figure 7.19),

J(x, y, W ) = J(z, y), z = W t x.

Here x is the input, W is the weight matrix, z is the network computed

output, and y is the desired output or target.
The loss function (7.5.1) has no bias inputs. When there are bias inputs
b, the loss function is
N
X
J(W, b) = J(xk , yk , W, b), (7.5.2)
k=1

and we focus on a single-layer perceptron

J(x, y, W, b) = J(z, y), z = W t x + b.

A basic attribute of a neural network is its trainability. Can a given network

be trained to achieve desired input-output behavior? As stated, this question
is imprecise and not clearly defined. In fact, for deep networks, it is not at
all clear how to turn this vague idea into an actionable definition.
In the case of a single-layer perceptron, the situation is straightforward
enough to be able to both make the question precise, and to provide action-
able criteria that guarantee trainability. This we do in the two cases
• linear regression, and
• logistic regression.
With any loss function J, the goal is to minimize J. With this in mind,
from §4.5, we recall

Ideal Loss Function

If a loss function J(W ) is strictly convex and proper, then J has a

unique optimal weight W ∗ ,
7.5. LINEAR REGRESSION 431

J(W ∗ ) ≤ J(W ),

characterized as the unique weight W ∗ satisfying ∇W J(W ∗ ) = 0.

Often, in machine learning, J is neither convex nor proper. Nevertheless,

this result is an important benchmark to start with. Lack of properness is
often addressed by regularization, which is the modification of J by a proper
forcing term. Lack of convexity is addressed by using some type of accelerated
gradient descent.
It is natural to say a loss function (7.5.1) is trainable if it is proper (§4.3),
because this guarantees the existence of optimal weights. In the case of a
single-layer perceptron, strict convexity is easy to pin down, leading to the
uniqueness of optimal weights.
Because of this, for a single-layer perceptron, we say the regression is
trainable if the loss function (7.5.1) (without bias) or the loss function (7.5.2)
(with bias) is proper and strictly convex. In this and the next section, we
determine conditions on the dataset that guarantee trainability in the above
two cases. We do this when there are no bias inputs, and when there are bias
inputs, so there are four cases in all.

+ y

z1
x2

z2 J
+ (−)2

z3
x3

z = W tx
+
J = |z − y|2 /2

Fig. 7.19 Linear regression neural network with no bias inputs.

For linear regression without bias, the loss function is (7.5.1) with
432 CHAPTER 7. MACHINE LEARNING

1
J(x, y, W ) = |y − z|2 , z = W t x. (7.5.3)
2
Then (7.5.1) is the mean square error or mean square loss, and the problem
of minimizing (7.5.1) is linear regression (Figure 7.19).

We use the identities (1.4.18) and (1.4.19) to compute the gradient of

J(x, y, W ). As mentioned in §2.2, these remain valid for any matrices and
vectors with matching shapes.
Let V be a weight matrix, and let v = V t x, z = W t x. Then (W + sV )t x =
z + sv, and the directional derivative is
d d 1
J(x, y, W + sV ) = |z + sv − y|2
ds ds 2
= v · (z + sv − y) = (V t x) · (z + sv − y) (7.5.4)
= trace (V t x) ⊗ (z + sv − y)

= trace V t (x ⊗ (z + sv − y)) .

By (4.3.4), inserting s = 0, (7.5.4) implies the weight gradient for mean

square loss is

G = ∇W J(x, y, W ) = x ⊗ (z − y), z = W t x. (7.5.5)

Note this result is a special case of (7.4.1).

Differentiating (7.5.4) with respect to s and inserting s = 0,

d2
J(x, y, W + sV ) = |v|2 = |V t x|2 . (7.5.6)
ds2 s=0

Since this is nonnegative, by (4.5.14), J(x, y, W ) is a convex function of W .

Since J(W ) is the sum of J(x, y, W ) over all samples, J(W ) is convex. To
check strict convexity of J(W ), suppose

d2
J(W + sV ) = 0.
ds2 s=0

Then (7.5.6) vanishes for all samples x = xk , y = yk , which implies

V t xk = 0, k = 1, 2, . . . , N. (7.5.7)

Recall the feature space is the vector space of all inputs x, and (§2.9) a
dataset is full-rank if the span of the dataset is the entire feature space. When
this happens, (7.5.7) implies V = 0. By (4.5.15), J(W ) is strictly convex.
To check properness of J(W ), by definition (4.3.9), we show there is a
bound C with
7.5. LINEAR REGRESSION 433
√
J(W ) ≤ c =⇒ ∥W ∥ ≤ C d. (7.5.8)
Here ∥W ∥ is the norm of the matrix W (2.2.13). The exact formula for the
bound C, which is not important for our purposes, depends on the level c
and the dataset.
If J(W ) ≤ c, by (7.5.1), (7.5.3), and the triangle inequality,
√
|W t xk | ≤ 2c + |yk |, k = 1, 2, . . . , N.

If x is in the span of the dataset, then x is a linear combination of samples

xk . Hence there is a bound C(x), depending on x but not on W , such that

|W t x| ≤ C(x). (7.5.9)

Let e1 , e2 , . . . be the standard basis in feature space, and assume the

dataset is full-rank. Let C be the largest of C(e1 ), C(e2 ), . . . . Then e1 , e2 ,
. . . are in the span of the dataset. By (2.2.13) and (7.5.9), inserting x = ei ,
2
X
∥W ∥ = |W t ej |2 ≤ dC 2 .
j

Since this establishes (7.5.8), we have shown

Trainability: Linear Regression Without Bias

Suppose the dataset x1 , x2 , . . . , xN is full-rank. Then linear regression

without bias is trainable on weights W .

This is a simple, clear geometric criterion for convergence of gradient de-

scent to the global minimum of J, valid for linear regression with no bias
inputs.

For linear regression with bias, the loss function is (7.5.2) with
1
J(x, y, W, b) = |y − z|2 , z = W t x + b. (7.5.10)
2
Here W is the weight matrix and b is a bias vector.
If we augment the dataset x1 , x2 , . . . , xN to (x1 , 1), (x2 , 1), . . . , (xN , 1),
then this corresponds to the augmented weight matrix

W
.
bt

Applying the last result to the augmented dataset and appealing to Exer-
cise 7.2.1, we obtain
434 CHAPTER 7. MACHINE LEARNING

Trainability: Linear Regression With Bias

Suppose the dataset x1 , x2 , . . . , xN does not lie in a hyperplane. Then

linear regression with bias is trainable on weights (W, b).

These are simple, clear geometric criteria for convergence of gradient de-
scent to the global minimum of J, valid for linear regression with or without
bias inputs.

Exercises

7.6 Logistic Regression

A dataset is a two-class dataset if it is composed of two disjoint classes. More

generally, a dataset is a multi-class dataset if it is composed of d ≥ 2 disjoint
classes. Separability of two-class datasets was first discussed in §4.5.
Recall (§5.3) a vector p = (p1 , p2 , . . . , pd ) is a probability vector if each
component p1 , p2 , . . . , pd is nonnegative (positive or zero), and the compo-
nents sum to one, p1 + p2 + · · · + pd = 1.
A probability vector p is strict if the components are all positive (none are
zero). A probability vector p is one-hot encoded if one of the components is
one. When this is the i-th component, we say p is one-hot encoded at slot i.
When this happens, all other components are zero.
Let x1 , x2 , . . . , xN be a dataset with corresponding labels or targets p1 , p2 ,
. . . , pN . In logistic regression, we assume the targets are probability vectors.
Such a dataset is a soft-class dataset.
In a multi-class dataset, we can assign targets to classes as follows. If a
sample xk lies in class i, the target pk assigned to xk is the probability vector
that is one-hot encoded (§2.4) at slot i.
For example, if there are three classes, as in the Iris dataset, the probability
vector pk is one of

(1, 0, 0), (0, 1, 0), (0, 0, 1),

according to the class of the sample xk .

On the other hand, in a soft-class dataset, we can assign classes to targets:
Given a probability vector p = (p1 , p2 , . . . , pd ), let

max p = max(p1 , p2 , . . . , pd ).

Then the i-th class may be defined as the samples with targets satisfying
pi = max p. Alternatively, the i-th class may be defined as the samples with
targets satisfying pi > 0.
7.6. LOGISTIC REGRESSION 435

When classes are assigned to targets, they need not be disjoint. Because
of this, they are called soft classes. Summarizing, a soft-class dataset is a
dataset x1 , x2 , . . . , xN with targets p1 , p2 , . . . , pN consisting of probability
vectors.

We start with logistic regression without bias inputs. For logistic regres-
sion, the loss function is
N
X
J(W ) = J(xk , pk , W ), (7.6.1)
k=1

with (see §5.6)

J(x, p, W ) = I(p, q), q = σ(y), y = W t x.

Here I(p, q) is the relative information, and q = σ(y) is the softmax function,
squashing the network’s output y = W t x into the probability q. When q =
σ(y), I(p, q) measures the information error between the desired target p and
the computed target q.
When p is one-hot encoded, by (5.6.16),

J(x, p, W ) = Icross (p, σ(W t x)).

Because of this, in the literature, in the one-hot encoded case, (7.6.1) is called
the cross-entropy loss.

y1
+ p
q1
x2

y2 q2 J
+ σ I

x3 q3

y3 y = W tx
+ q = σ(y)
J = I(p, q)

Fig. 7.20 Logistic regression neural network without bias inputs.

436 CHAPTER 7. MACHINE LEARNING

J(W ) is logistic loss or logistic error, and the problem of minimizing (7.6.1)
is logistic regression (Figure 7.20).
Since we will be considering both strict and one-hot encoded probabilities,
we work with I(p, q) rather than Icross (p, q). Table 5.33 is a useful summary
of the various information and entropy concepts.

In §5.6, we defined 1 = (1, 1, . . . , 1), and a vector v was centered if v ·1 = 0.

Here we define a matrix W as centered if

W 1 = 0, (7.6.2)

or
d
X
wij = 0, i = 1, 2, . . . , d.
j=1

With this understood, if W is centered, and y = W t x, then y is centered.

We compute the gradient ∇W J(x, p, W ). By (5.6.3) and (5.6.15),

∇y I(p, σ(y)) = ∇y Z(y) − p = q − p, q = σ(y), (7.6.3)

and, by (5.6.10),

Dy2 I(p, σ(y)) = D2 Z(y) = diag(q) − q ⊗ q, q = σ(y). (7.6.4)

Let V be a centered weight matrix, and let v = V t x, y = W t x. Then

(W + sV )t x = y + sv, and, by (7.6.3), the directional derivative is

d d
J(x, y, W + sV ) = I(p, σ(y + sv))
ds s=0 ds s=0
= v · (q − p) = (V t x) · (q − p)
= trace (V t x) ⊗ (q − p)

= trace V t (x ⊗ (q − p)) .

By (4.3.4), this shows the gradient for log loss is

G = ∇W J(x, p, W ) = x ⊗ (q − p), q = σ(W t x). (7.6.5)

As before, this result is a special case of (7.4.1). Since q and p are probability
vectors, p · 1 = 1 = q · 1, hence the gradient G is centered.
Recall (§5.6) we have strict convexity of Z(y) along centered vectors y,
those vectors satisfying y · 1 = 0. Since y = W t x, y · 1 = x · W 1. Hence, to
force y · 1 = 0, it is natural to assume W is centered.
If we initiate gradient descent with a centered weight matrix W , since the
gradient G is also centered, all successive weight matrices will be centered.
7.6. LOGISTIC REGRESSION 437

Turning to convexity, we establish

Strict Convexity: Logistic Regression Without Bias

Suppose the dataset x1 , x2 , . . . , xN is full-rank. Then the logistic loss

J(W ) without bias is strictly convex on centered weights W .

Pd
To see this, given a vector v and probability vector q, set v̄ = j=1 vj qj .
Then  2
Xd Xd Xd
vj2 qj −  vj qj  = (vj − v̄)2 qj .
j=1 j=1 j=1

If either side is zero, and q is strict, then v = v̄1, so v is a multiple of 1.

From this identity, and by (4.5.14) and (7.6.4), the second derivative of
I(p, σ(y)) in the direction of a vector v is
d
d2 X
I(p, σ(y + sv)) = (vj − v̄)2 qj , q = σ(y).
ds2 s=0 j=1

Let V be a centered weight matrix and let v = V t x. Then v·1 = x·V 1 = 0,

so v is centered, and
(W + sV )t x = y + sv.
If y = W t x, it follows the second derivative of J(x, p, W ) in the direction of
V is
d
d2 X
J(x, p, W + sV ) = (vj − v̄)2 qj , v = V t x. (7.6.6)
ds2 t=0 j=1

This shows the second derivative of J(x, p, W ) is nonnegative, establishing

the convexity of J(x, p, W ). Since J(W ) is the sum of J(x, p, W ) over all
samples, we conclude J(W ) is convex.
Moreover, if (7.6.6) vanishes, then, by the previous paragraph, since q =
σ(y) is strict, v is a multiple of 1. Since v is centered, it follows v = 0
(Exercise 5.6.1). Since v = V t x, the vanishing of (7.6.6) implies V t x = 0.
If
N
d2 X d2
2
J(W + sV ) = J(xk , pk , W + sV )
ds s=0 ds2 s=0
k=1

vanishes, then, since the summands are nonnegative, (7.6.6) vanishes, for
every sample x = xk , p = pk , hence

V t xk = 0, k = 1, 2, . . . , N.
438 CHAPTER 7. MACHINE LEARNING

When the dataset is full-rank, this implies V = 0. This establishes strict

convexity of J(W ) on centered weights.

Now we turn to properness of J(W ).

Properness: Logistic Regression Without Bias

Let x1 , x2 , . . . , xN be a dataset with corresponding targets p1 , p2 , . . . ,

pN . For each class i, let Ki be the convex hull of the samples x whose
corresponding targets p = (p1 , p2 , . . . , pd ) satisfy pi > 0. If the span of
the intersection Ki ∩ Kj is full-rank for every class i and class j, then
the logistic loss J(W ) without bias is proper on centered weights W .

The convex hull is discussed in §4.5, see Figures 4.26 and 4.27. If Ki were
just the samples x whose corresponding targets p satisfy pi > 0 (with no
convex hull), then the intersection Ki ∩ Kj may be empty.
For example, if p were one-hot encoded, then x belongs to at most one Ki .
Thus taking the convex hull in the definition of Ki is crucial. This is clearly
seen in Figure 7.32: The samples never intersect, but the convex hulls may
do so.
To establish properness of J(W ), by definition (4.3.9), we show

W1 = 0 and J(W ) ≤ c =⇒ ∥W ∥ ≤ dC (7.6.7)

for some C. The exact formula for the bound C, which is not important for
our purposes, depends on the level c and the dataset.
Suppose J(W ) ≤ c, with W 1 = 0 and let q = σ(y). Then I(p, q) =
J(x, p, W ) ≤ c for every sample x and corresponding target p.
Let x be a sample, let y = W t x, and suppose the corresponding target p
satisfies pi ≥ ϵ, for some class i, and some ϵ > 0. If j ̸= i, then
d
X
ϵ(yj − yi ) ≤ ϵ(Z(y) − yi ) ≤ pi (Z(y) − yi ) ≤ pk (Z(y) − yk ) = Z(y) − p · y.
k=1

By (5.6.15),
Z(y) − p · y = I(p, σ(y)) − I(p) ≤ c + log d.
Combining the last two inequalities,

ϵ(yj − yi ) ≤ c + log d.

By definition of Ki , pi > 0 for all targets p corresponding to samples x

in Ki . Therefore there is a positive ϵi such that pi ≥ ϵi for all targets p
corresponding to samples x in Ki . Let ϵ be the least of ϵ1 , ϵ2 , . . . , ϵd . Then
7.6. LOGISTIC REGRESSION 439

ϵ(yj − yi ) ≤ c + log d, j ̸= i, for samples x in Ki .

By taking convex combinations of samples x in Ki , the last inequality

remains valid for all x in Ki , so

ϵ(yj − yi ) ≤ c + log d, j ̸= i, for all x in Ki .

Repeating the same argument for x in Kj ,

ϵ(yi − yj ) ≤ c + log d, j ̸= i, for all x in Kj .

Combining the last two inequalities,

ϵ|yi − yj | ≤ c + log d, j ̸= i, for all x in Ki ∩ Kj . (7.6.8)

Let x be any vector in feature space, and let y = W t x. Since span(Ki ∩Kj )
is full-rank, x is a linear combination of vectors in Ki ∩ Kj , for every i and j.
This implies, by (7.6.8), there is a bound C(x), depending on x but not on
W , such that

|yi − yj | ≤ C(x), for every vector x and i and j. (7.6.9)

P
Since y · 1 = 0, yi = − j̸=i yj . Summing (7.6.9) over j ̸= i,

X
d|yi | = |(d − 1)yi + yi | = (yi − yj ) ≤ (d − 1)C(x).
j̸=i

Let e1 , e2 , . . . be the standard basis in feature space, and let C be the

largest of C(e1 ), C(e2 ), . . . . Since y = W t x, yi = (W t x) · ei = x · (W ei ).
Inserting x = ej ,

|wji | = |ej · W ei | ≤ C, i, j = 1, 2, . . . , d.

By (2.2.13), X
2
∥W ∥ = |wij |2 ≤ d2 C 2 .
i,j

Thus dC is a bound, depending only on level c and the dataset, satisfying

(7.6.7).

If the span of Ki ∩ Kj is full-rank, then the span of the dataset itself is

full-rank. Putting the last two results together, we conclude
440 CHAPTER 7. MACHINE LEARNING

Trainability: Logistic Regression Without Bias

Let x1 , x2 , . . . , xN be a dataset with corresponding targets p1 , p2 , . . . ,

pN . For each class i, let Ki be the convex hull of the samples x whose
corresponding targets p = (p1 , p2 , . . . , pd ) satisfy pi > 0. If the span of
the intersection Ki ∩ Kj is full-rank for every class i and class j, then
logistic regression without bias is trainable on centered weights W .

By the definition of Ki here, the union of Ki over classes i = 1, 2, . . . , d

contains the whole dataset. This is not necessarily the case in the results
below.
As a special case, let K be the samples whose corresponding targets are
strict. Then K ⊂ Ki for all classes i. If the span of K is full-rank, then
span(Ki ∩ Kj ) is full-rank. This derives the first consequence,

Trainability: Strict Logistic Regression Without Bias

Let x1 , x2 , . . . , xN be a dataset, with corresponding targets p1 , p2 ,

. . . , pN . Let K be the convex hull of the samples whose corresponding
targets are strict. If the span of K is full-rank, then logistic regression
without bias is trainable on centered weights W .

If a target p is one-hot encoded at slot i, then pi = 1 > 0. This derives the

second consequence,

Trainability: One-hot Encoded Logistic Regression Without

Bias
Let x1 , x2 , . . . , xN be a dataset with corresponding targets p1 , p2 ,
. . . , pN . For each class i, let Ki be the convex hull of the samples
whose corresponding targets are one-hot encoded at slot i. If the span
of the intersection Ki ∩ Kj is full-rank for every i and j, then logistic
regression without bias is trainable on centered weights W .

In this case, each sample x belongs in at most one Ki , so taking convex

hulls is crucial, see the examples in the next section. Here not all samples
need be one-hot encoded: The requirement is that there is sufficient overlap
between the targets that are one-hot encoded.

For logistic regression with bias, the loss function is

N
X
J(W, b) = J(xk , pk , W, b), (7.6.10)
k=1
7.6. LOGISTIC REGRESSION 441

with
J(x, p, W, b) = I(p, q), q = σ(y), y = W t x + b.
Here W is the weight matrix and b is the bias vector. In keeping with our
prior convention, we call the weight (W, b) centered if W is centered and b is
centered. Then y is centered.
If the columns of W are (w1 , w2 , . . . , wd ), and b = (b1 , b2 , . . . , bd ), then
y = W t x + b is equivalent to levels corresponding to d hyperplanes (§4.5)

y1 = w1 · x + b1 ,
y2 = w2 · x + b2 ,
(7.6.11)
... = ...
yd = wd · x + bd .

The scalars y1 , y2 , . . . , yd are the outputs corresponding to the sample x and

weight (W, b).
Let x1 , x2 , . . . , xN be a dataset, and suppose (W, b) is a weight with
vanishing outputs yk = 0, k = 1, 2, . . . , N . If W ̸= 0, then at least one of the
columns wj is nonzero, hence the dataset lies in the hyperplane wj ·x+bj = 0.
On the other hand, if the dataset lies in a hyperplane, then there is a weight
(W, b) with W ̸= 0 such that the outputs vanish (Exercise 7.6.1). Because of
this, we call a weight (W, b) satisfying W ̸= 0 a hyperplane weight.
Let x1 , x2 , . . . , xN be a soft-class dataset with associated target probability
vectors p1 , p2 , . . . , pN . Suppose there are d possibly overlapping classes, and
suppose for each sample x in class i, the corresponding target p satisfies
pi > 0. This assumption covers the two cases, strict and one-hot encoded,
discussed above.
We leave strict convexity of the loss function to Exercise 7.6.6, and we
focus on properness of the loss function.
In §4.5, we defined separating hyperplanes and separable two-class datasets.
There are at least two generalizations of separability to soft-class datasets.
They are strong separability (“all-against-all”), and weak separability (“some-
against-some”). Let y1 , y2 , . . . , yd be the outputs (7.6.11).
A dataset is strongly separable if there is a hyperplane separating class i
from the rest of the dataset, for every i = 1, 2, . . . , d. By Exercise 7.6.3, this
is the same as saying there is a weight (W, b) such that

yi ≥ 0, for x in class i,
for every i = 1, 2, . . . , d and every j ̸= i.
yi ≤ 0, for x in class j,
(7.6.12)

Here again the hyperplanes are decision boundaries.

On the other hand, a dataset is weakly separable if there is a hyperplane
separating some class i and some class j ̸= i. By Exercise 7.6.2, this is the
same as saying there is a weight (W, b) such that
442 CHAPTER 7. MACHINE LEARNING

yi ≥ 0, for x in class i,
for some i = 1, 2, . . . , d and some j ̸= i.
yi ≤ 0, for x in class j,
(7.6.13)

Clearly strong separability implies weak separability. In a two-class dataset,

strong separability equals weak separability and both equal separability as
defined in (4.5.8).
If a dataset lies in a hyperplane (4.5.9), the dataset is separable, in both
strong and weak senses. Thus the question of separability is only interesting
when the dataset does not lie in a hyperplane.

Recall (§4.5) a set K has interior if there is a ball B in K. For each

i = 1, 2, . . . , d, let Ki be the convex hull of the samples in class i. Then Ki
has interior iff class i does not lie in a hyperplane (Exercise 4.5.9).
By hyperplane separation II (§4.5), we have

Weak Separability and Interiors

Assume none of the classes lie in a hyperplane. Then the soft-class

dataset is weakly separable iff Ki ∩ Kj has no interior for some i and
some j ̸= i. Equivalently, the soft-class dataset is not weakly separable
iff Ki ∩ Kj has interior for every i and every j ̸= i.

We use this to derive the main result

Trainability: Logistic Regression With Bias

If the soft-class dataset is strongly separable, logistic regression with

bias is not trainable. If none of the classes lie in a hyperplane and the
soft-class dataset is not weakly separable, logistic regression with bias
is trainable on centered weights (W, b).

As special cases, there are corresponding results for strict targets and one-
hot encoded targets.
To begin the proof, suppose (W, b) satisfies (7.6.12). Then (Exercise 7.6.4)

yi ≥ 0, for x in Ki ,
for every i = 1, 2, . . . , d,
yj ≤ 0, for x in Ki and every j ̸= i,
(7.6.14)

From this, one obtains I(p, σ(y)) ≤ log d for every sample x and q = σ(y)
(Exercise 7.6.5). Since this implies J(W, b) ≤ N log d, the loss function is not
proper, hence not trainable.
7.6. LOGISTIC REGRESSION 443

By Exercise 7.6.6, for trainability, it is enough to check properness. To

establish properness of the loss function, suppose none of the classes lie in
a hyperplane and the dataset is not weakly separable. Then Ki ∩ Kj has
interior for all i and all j ̸= i. Let x∗ij be the centers of balls in Ki ∩ Kj for
each i ̸= j. By making the balls small enough, we may assume the radii of
the balls equal the same r > 0.
Let ϵi > 0 be the minimum of pi over all probability vectors p correspond-
ing to samples x in class i. Let ϵ be the least of ϵ1 , ϵ2 , . . . , ϵd . Then ϵ is
positive.
Suppose J(W, b) ≤ c for some level c, with W = (w1 , w2 , . . . , wd ),
b = (b1 , b2 , . . . , bd ) centered. We establish properness of the loss function
by showing
 
c + log d  1 X ∗ 
|wi | + |bi | ≤ 1+r+ |xij | , i = 1, 2, . . . , d.
rϵ d−1
j̸=i
(7.6.15)
The exact form of the right side of (7.6.15) doesn’t matter. What matters is
the right side is a constant depending only on the dataset, the targets, the
number of categories d, and the level c.
If J(W, b) ≤ c, then I(p, q) ≤ c for each sample x. As before, this leads to
(7.6.8).
Let v be a unit vector, and let

x± = x∗ij ± rv, yi± = wi · x± + bi , yj± = wj · x± + bj .

Since x± are in Ki ∩ Kj , by (7.6.8),

2rϵ|(wi − wj ) · v| = ϵ|(yi+ − yj+ ) − (yi− − yj− )| ≤ 2(c + log d).

Optimizing over all v, or choosing v = (wi − wj )/|wi − wj |, we obtain

rϵ|wi − wj | ≤ c + log d.

Let
yi = wi · x∗ij + bi , yj = wj · x∗ij + bj .
Since x∗ij is in Ki ∩ Kj , by (7.6.8),

rϵ|bi − bj | ≤ rϵ|yi − yj | + rϵ|(wi − wj ) · x∗ij |

≤ r(c + log d) + rϵ|wi − wj | |x∗ij | ≤ (c + log d) r + |x∗ij | .

Hence

rϵ|wi − wj | + rϵ|bi − bj | ≤ (c + log d) · (1 + r + |x∗ij |). (7.6.16)

Since W is centered,
444 CHAPTER 7. MACHINE LEARNING
X X
dwi = (d − 1)wi + wi = (d − 1)wi − wj = (wi − wj ).
j̸=i j̸=i

Similarly, since b is centered,

X
dbi = (bi − bj ).
j̸=i

Hence
1X
|wi | + |bi | ≤ |wi − wj | + |bi − bj |.
d
j̸=i

Combining this with (7.6.16) results in (7.6.15), and establishes properness

of the loss function. This completes the proof of the main result.

A very special case is a two-class dataset. In this case, the result is com-
pelling:

Trainability: Two-Class Logistic Regression With Bias

Assume neither class lies in a hyperplane. Then logistic regression

with bias is trainable iff the two-class dataset is not separable.

To highlight this result, a two-class dataset is either separable or it is not.

If it is separable, then a support vector machine [16] computes an optimal
decision boundary. If it is not separable, then (assuming neither class lies
in a hyperplane) logistic regression with bias computes an optimal decision
boundary. Such a decision boundary is an LR hyperplane.

We end the section by comparing the three regressions: linear, strict logis-
tic, and one-hot encoded logistic.
In classification problems, it is one-hot encoded logistic regression that is
relevant. Because of this, in the literature, logistic regression often defaults
to the one-hot encoded case.
In linear regression, not only do J(W ) and J(W, b) have minima, but so
does J(z, y). Properness ultimately depends on properness of a quadratic |z|2 .
In strict logistic regression, by (7.6.3), the critical point equation

∇y J(y, p) = 0

can always be solved, so there is at least one minimum for each J(y, p). Here
properness ultimately depends on properness of Z(y).
7.7. REGRESSION EXAMPLES 445

In one-hot encoded regression, J(y, p) = I(p, σ(y)) and ∇y J(y, p) = 0 can

never be solved, because q = σ(y) is always strict and p is one-hot encoded,
see (7.6.5). Nevertheless, trainability of J(W ) and J(W, b) is achievable if
there is sufficient overlap between the sample categories.
In linear regression, the minimizer is expressible in terms of the regression
equation, and thus can be solved in principle using the pseudo-inverse. In
practice, when the dimensions are high, gradient descent may be the only
option for linear regression.
In logistic regression, the minimizer cannot be found in closed form, so we
have no choice but to apply gradient descent, even for low dimensions.

Exercises

Exercise 7.6.1 Show a dataset x1 , x2 , . . . , xN lies in a hyperplane iff there

is a weight (W, b) with W ̸= 0 such that the outputs y1 , y2 , . . . , yN are all
zero.
Exercise 7.6.2 Show a dataset x1 , x2 , . . . , xN is weakly separable iff (7.6.13)
holds.
Exercise 7.6.3 Show a dataset x1 , x2 , . . . , xN is strongly separable iff
(7.6.12) holds.
Exercise 7.6.4 Show a dataset x1 , x2 , . . . , xN is strongly separable iff
(7.6.14) holds.
Exercise 7.6.5 Let (W, b) be strongly separating, and let y = W t x+b. Using
(5.6.15) and (7.6.14), show I(p, σ(y)) ≤ log d for every sample x and q = σ(y).
Exercise 7.6.6 Let J(W, b) be the logistic loss function with bias inputs.
Then J(W, b) is convex. If the dataset does not lie in a hyperplane, then
J(W, b) is strictly convex.
Exercise 7.6.7 Suppose the multi-class dataset does not lie in a hyperplane.
Then the means of the classes agree iff there is an optimal weight (W, b) with
W = 0. (Do two-class first.)

7.7 Regression Examples

Let (xk , yk ), k = 1, 2, . . . , N , be a dataset in the plane. The simplest regres-

sion problem is to determine the line y = mx + b minimizing the residual
N
X
J(m, b) = (yk − mxk − b)2 . (7.7.1)
k=1
446 CHAPTER 7. MACHINE LEARNING

This line is the regression line.

For the dataset in Figure 7.21, the regression line is in Figure 7.22.

GNP.deflator GNP Unemployed Armed Forces Population Year Employed

83 234.289 235.6 159 107.608 1947 60.323
88.5 259.426 232.5 145.6 108.632 1948 61.122
88.2 258.054 368.2 161.6 109.773 1949 60.171
89.5 284.599 335.1 165 110.929 1950 61.187
96.2 328.975 209.9 309.9 112.075 1951 63.221
98.1 346.999 193.2 359.4 113.27 1952 63.639
99 365.385 187 354.7 115.094 1953 64.989
100 363.112 357.8 335 116.219 1954 63.761
101.2 397.469 290.4 304.8 117.388 1955 66.019
104.6 419.18 282.2 285.7 118.734 1956 67.857
108.4 442.769 293.6 279.8 120.445 1957 68.169
110.8 444.546 468.1 263.7 121.95 1958 66.513
112.6 482.704 381.3 255.2 123.366 1959 68.655
114.2 502.601 393.1 251.4 125.368 1960 69.564
115.7 518.173 480.6 257.2 127.852 1961 69.331
116.9 554.894 400.7 282.7 130.081 1962 70.551

Table 7.21 Longley Economic Data [19].

Fig. 7.22 Population versus employed: linear regression.

More generally, given a dataset x1 , x2 , . . . , xN in Rd , and scalar targets

y1 , y2 , . . . , yN , we want to minimize
7.7. REGRESSION EXAMPLES 447

N
X
J(w, w0 ) = (yk − w · xk − w0 )2
k=1

over all weight vectors w in Rd and scalars w0 .

Here we are fitting a regression hyperplane

0 = w0 + w · x = w0 + w1 x1 + w2 x2 + · · · + wd xd .

This corresponds to (7.5.10), where W is the d × 1 matrix W = w, and b is

the scalar w0 .
For example, Figure 7.21 is a dataset and Figure 7.22 is a plot of population
versus employed, with the mean and the regression line shown.

Let X be the N × d matrix with rows x1 , x2 , . . . , xN , and let Y be the

vector (y1 , y2 , . . . , yN ). Then we can rewrite the residual as

J(w) = |Xw − Y |2 . (7.7.2)

From §2.3, any weight w∗ minimizing (7.7.2) is a solution the regression

equation
X t Xw∗ = X t Y. (7.7.3)
Since the pseudo-inverse provides a solution of the regression equation, we
have

Linear Regression

The weight w∗ = X + Y minimizes the residual (7.7.2) and solves the

regression equation (7.7.3).

We work out the regression equation in the plane, when both features x
and y are scalar. In this case, w = (m, b) and
   
x1 1 y1
 x2 1   y2 
X=   , Y = 
 .
. . . . . . . . .
xN 1 yN

In the scalar case, the regression equation (7.7.3) is 2 × 2. To simplify the

computation of X t X, let
N N
1 X 1 X
x̄ = xk , ȳ = yk .
N N
k=1 k=1
448 CHAPTER 7. MACHINE LEARNING

Then (x̄, ȳ) is the mean of the dataset. Also, let x and y denote the vectors
(x1 , x2 , . . . , xN ) and (y1 , y1 , . . . , yN ), and let, as in §1.5,
N
1 X 1
cov(x, y) = (xk − x̄)(yk − ȳ) = x · y − x̄ȳ.
N N
k=1

Then cov(x, y) is the covariance between x and y,

t x · x x̄ t x·y
X X=N , X Y =N .
x̄ 1 ȳ

With w = (m, b), the regression equation reduces to

(x · x)m + x̄b = x · y,
mx̄ + b = ȳ.

The second equation says the regression line passes through the mean (x̄, ȳ).
Multiplying the second equation by x̄ and subtracting the result from the
first equation cancels the b and leads to

cov(x, x)m = (x · x − x̄2 )m = (x · y − x̄ȳ) = cov(x, y).

This derives

Linear Regression in the Plane

The regression line in two dimensions passes through the mean (x̄, ȳ)
and has slope
cov(x, y)
m= .
cov(x, x)

Now we use linear regression to do polynomial regression. Starting with

the dataset (xk , yk ) in R2 (Figure 7.22), we can expand or “lift” the dataset
from R2 to R6 by working with the vectors (1, xk , x2k , x3k , x4k , yk ) instead of
(xk , yk ).
Assuming the data is given by Figure 7.21, we build the code for Figures
7.22 and 7.23. We begin by assuming the data is given as arrays,

from numpy import *

from pandas import read_csv
7.7. REGRESSION EXAMPLES 449

df - read_csv("longley.csv")

X = df["Population"].to_numpy()
Y = df["Employed"].to_numpy()

Fig. 7.23 Polynomial regression: Degrees 2, 4, 6, 8, 10, 12.

Then we standardize the data

450 CHAPTER 7. MACHINE LEARNING

X = X - mean(X)
Y = Y - mean(Y)

varx = sum(X**2)/len(X)
vary = sum(Y**2)/len(Y)

X = X/sqrt(varx)
Y = Y/sqrt(vary)

After this, we compute the optimal weight w∗ and construct the polyno-
mial. The regression equation is solved using the pseudo-inverse (§2.3).

from scipy.linalg import pinv

# polynomial function - degree d-1

def poly(x,d):
A = column_stack([ X**i for i in range(d) ]) # Nxd
Aplus = pinv(A)
b = Y # Nx1
wstar = dot(Aplus,b)
return sum([ x**i*wstar[i] for i in range(d) ],axis=0)

Then we plot the data and the polynomial in six subplots.

from matplotlib.pyplot import *

xmin,ymin = amin(X), amin(Y)

xmax, ymax = amax(X), amax(Y)

figure(figsize=(12,12))
# six subplots
rows, cols = 3,2

# x interval
x = arange(xmin,xmax,.01)

for i in range(6):
d = 3 + 2*i # degree = d-1
subplot(rows, cols,i+1)
plot(X,Y,"o",markersize=2)
plot([0],[0],marker="o",color="red",markersize=4)
plot(x,poly(x,d),color="blue",linewidth=.5)
xlabel("degree = %s" % str(d-1))
grid()

show()

Running this code with degree 1 returns Figure 7.22. Taking too high a
power can lead to overfitting, for example for degree 12.
7.7. REGRESSION EXAMPLES 451

Here is an example of a simple logistic regression problem. A group of

students takes an exam. For each student, we know the amount of time x
they studied, and the outcome p, whether or not they passed the exam.

x p x p x p x p x p
0.5 0 .75 0 1.0 0 1.25 0 1.5 0
1.75 0 1.75 1 2.0 0 2.25 1 2.5 0
2.75 1 3.0 0 3.25 1 3.5 0 4.0 1
4.25 1 4.5 1 4.75 1 5.0 1 5.5 1

Table 7.24 Hours studied and outcomes.

More generally, we may only know the amount of study time x, and the
probability p that the student passed, where now 0 ≤ p ≤ 1.
For example, the data may be as in Figure 7.24, where pk equals 1 or 0
according to whether they passed or not.
As stated, the samples of this dataset are scalars, and the dataset is one-
dimensional (Figure 7.25).

Fig. 7.25 Exam dataset: x.

Plotting the dataset on the (x, p) plane, the goal is to fit a curve

p = σ(m∗ x + b∗ ) (7.7.4)

as in Figure 7.26.

(0, 1)

x
(0, 0)

Fig. 7.26 Exam dataset: (x, p) [35].

452 CHAPTER 7. MACHINE LEARNING

Since this is logistic regression with bias, we can apply the two-class result
from the previous section: The dataset is one-dimensional, so a hyperplane is
just a point, a threshold. Neither class lies in a hyperplane, and the dataset is
not separable (Figure 7.25). Hence logistic regression with bias is trainable,
and gradient descent is guaranteed to converge to an optimal weight (m∗ , b∗ ).
Here is the descent code.

from numpy import *

from scipy.special import expit

X = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5, 2.75,
,→ 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
P = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

def gradient(m,b):
return sum([ (expit(m*x+b) - p) * array([x,1]) for x,p in zip(X,P)
,→ ],axis=0)

# gradient descent
w = array([0,0]) # starting m,b
g = gradient(*w)
t = .01 # learning rate

while not allclose(g,0):

wplus = w - t * g
if allclose(w,wplus): break
else: w = wplus
g = gradient(*w)

print("descent result: ",w)

print("gradient: ",gradient(*w))

This code returns

m∗ = 1.49991537, b∗ = −4.06373862.

These values are used to graph the sigmoid in Figure 7.26.

Even though we are done, we take the long way and apply logistic regres-
sion without bias by incorporating the bias, to better understand how things
work.
To this end, we incorporate the bias and write the augmented dataset

(x1 , 1), (x2 , 1), . . . , (xN , 1), N = 20,

resulting in Figure 7.27. Since these vectors are not parallel, the dataset is
full-rank in R2 , hence J(m, b) is strictly convex. In Figure 7.27, the shaded
7.7. REGRESSION EXAMPLES 453

area is bounded by the vectors corresponding to the overlap between passing

and failing students’ hours.

(0, 1)

x
(0, 0)

Fig. 7.27 Exam dataset: (x, x0 ).

Let σ(z) be the sigmoid function (5.2.15). Then, as in the previous section,
the goal is to minimize the loss function
N
X
J(m, b) = I(pk , qk ), qk = σ(mxk + b), (7.7.5)
k=1

Once we have the minimizer (m∗ , b∗ ), we have the best-fit curve (7.7.4).
If the targets p are one-hot encoded, the dataset is as follows.

x p x p x p x p x p
0.5 (1,0) .75 (1,0) 1.0 (1,0) 1.25 (1,0) 1.5 (1,0)
1.75 (1,0) 1.75 (0,1) 2.0 (1,0) 2.25 (0,1) 2.5 (1,0)
2.75 (0,1) 3.0 (1,0) 3.25 (0,1) 3.5 (1,0) 4.0 (0,1)
4.25 (0,1) 4.5 (0,1) 4.75 (0,1) 5.0 (0,1) 5.5 (0,1)

Table 7.28 Hours studied and one-hot encoded outcomes.

b y
1 +
q
−b
J
σ I
m
−y 1−q
x +
−m

Fig. 7.29 Neural network for student exam outcomes.

454 CHAPTER 7. MACHINE LEARNING

Each sample (x, 1) in the dataset is in R2 , and each target is one-hot

encoded as (p, 1 − p). Since the weight matrix must satisfy (7.6.2) W 1 = 0.
we have
b −b
W = .
m −m
Since z = W t x, the outputs must satisfy z1 = z and z2 = −z. This leads to
a neural network with two inputs and two outputs (Figure 7.29).
Since here d = 2, the networks in Figures 7.29 and 7.30 are equivalent.
In Figure 7.29, σ is the softmax function, I is given by (5.6.6), and p, q are
probability vectors. In Figure 7.30, σ is the sigmoid function, I is given by
(4.2.2), and p, q are probability scalars.

1
b
y q J
+ σ I

m
x

Fig. 7.30 Equivalent neural network for student exam outcomes.

Figure 7.26 is a plot of x against p. However, the dataset, with the bias
input included, has two inputs x, 1 and one output p, and should be plotted
in three dimensions (x, 1, p). Then (Figure 7.31) samples lie on the line (x, 1)
in the horizontal plane, and p is on the vertical axis.

0.5

0
0

2
1
4
0.5
0

Fig. 7.31 Exam dataset: (x, x0 , p).

7.7. REGRESSION EXAMPLES 455

The horizontal plane in Figure 7.31, which is the plane in Figure 7.27, is
feature space. The convex hulls K0 and K1 are in feature space, so the convex
hull K0 of the samples corresponding to p = 0 is the line segment joining
(.5, 1, 0) and (3.5, 1, 0), and the convex hull K1 of the samples corresponding
to p = 1 is the line segment joining (1.75, 1, 0) and (5.5, 1, 0). In Figure 7.31,
K0 is the line segment joining the green points, and K1 is the projection onto
feature space of the line segment joining the red points. Since K0 ∩ K1 is the
line segment joining (1.75, 1, 0) and (3.5, 1, 0), the span of K0 ∩ K1 is all of
feature space. By the results of the previous section, J(w) is proper.

The Iris dataset consists of 150 samples divided into three groups. leading
to three convex hulls K0 , K1 , K2 in R4 . If the dataset is projected onto the
top two principal components, then the projections of these three hulls do
not pair-intersect (Figure 7.32). It follows we have no guarantee the logistic
loss is proper.

Fig. 7.32 Convex hulls of Iris classes in R2 .

On the other hand, the MNIST dataset consists of 60,000 samples divided
into ten groups. If the MNIST dataset is projected onto the top two principal
components, the projections of the ten convex hulls K0 , K1 , . . . , K9 onto R2 ,
do intersect (Figure 7.33).
This does not guarantee that the ten convex hulls K0 , K1 , . . . , K9 in R784
intersect, but at least this is so for the 2d projection of the MNIST dataset.
Therefore the logistic loss of the 2d projection of the MNIST dataset is proper.
456 CHAPTER 7. MACHINE LEARNING

Fig. 7.33 Convex hulls of MNIST classes in R2 .

7.8 Strict Convexity

In this section, we work with loss functions that are smooth and strictly
convex. While this is not always the case, this assumption is a base case
against which we can test different optimization or training models.
By smooth and strictly convex, we mean there are positive constants m
and L satisfying

m ≤ D2 f (w) ≤ L, for every w. (7.8.1)

Recall this means the eigenvalues of the symmetric matrix D2 f (w) are be-
tween L and m. In this situation, the condition number1 r = m/L is between
zero and one: 0 < r ≤ 1.
In the previous section, we saw that basic gradient descent converged to
a critical point. If f (x) is strictly convex, there is exactly one critical point,
the global minimum. From this we have

Gradient Descent on a Strictly Convex Function

If the short-step gradient descent sequence starting from w0 converges

to w∗ , then w∗ is the global minimum.

The simplest example of a convex loss function is the quadratic case

1
f (w) = w · Qw − b · w, (7.8.2)
2
1 In the literature, the condition number is often defined as L/m.
7.8. STRICT CONVEXITY 457

where Q is a variance matrix. Then D2 f (w) = Q. If the eigenvalues of Q

are between positive constants m and L, then f (w) is smooth and strictly
convex.
By (4.3.8), the gradient for this example is g = Qw−b. Hence the minimizer
is the unique solution w∗ = Q−1 b of the linear system Qw = b. Thus gradient
descent is a natural tool for solving linear systems and computing inverses,
at least for variance matrices Q.
By (4.5.17), f (w) lies between two quadratics,

m L
|w − w∗ |2 ≤ f (w) − f (w∗ ) ≤ |w − w∗ |2 . (7.8.3)
2 2
How far we are from our goal w∗ can be measured by the error E(w) =
|w − w∗ |2 . Another measure of error is E(w) = f (w) − f (w∗ ). The goal is to
drive the error between w and w∗ to zero.
When f (w) is smooth and strictly convex in the sense of (7.8.1), the es-
timate (7.8.3) shows these two error measures are equivalent. We use both
measures below.

Let t = 1/L. Inserting x = w and a = w∗ in the left half of (4.5.21) and

using ∇f (w∗ ) = 0 implies
1
f (w) ≤ f (w∗ ) + |∇f (w)|2 .
2m
Let E(w) = f (w) − f (w∗ ). Combining this inequality with (7.3.3), and re-
calling r = m/L = mt, we arrive at

E(w+ ) ≤ (1 − r)E(w). (7.8.4)

Iterating this implies

E(w2 ) ≤ (1 − r)E(w1 ) ≤ (1 − r)(1 − r)E(w0 ) = (1 − r)2 E(w0 ).

In general, this leads to

Gradient Descent I

Let r = m/L and set E(w) = f (w)−f (w∗ ). Then the descent sequence
w0 , w1 , w2 , . . . given by (7.3.1) with learning rate
1
t=
L
converges to w∗ at the rate
458 CHAPTER 7. MACHINE LEARNING

n
E(wn ) ≤ (1 − r) E(w0 ), n = 1, 2, . . . . (7.8.5)

This is the basic gradient descent result GD-I.

Using coercivity of the gradient (4.5.22), we can obtain an improved result

GD-II.
Let E(w) = |w − w∗ |2 and set the learning rate at t = 2/(m + L). Inserting
x = w and a = w∗ in (4.5.22) and using ∇f (w∗ ) = 0 implies

mL 1
g · (w − w∗ ) ≥ |w − w∗ |2 + |g|2 .
m+L m+L
Using this and (7.3.1) and t = 2/(m + L),

E(w+ ) = E(w) − 2tg · (w − w∗ ) + t2 |g|2

mL 2 2t
≤ 1 − 2t E(w) + t − |g|2
m+L m+L
2
L−m
= E(w).
L+m

This implies

Gradient Descent II

Let r = m/L and set E(w) = |w − w∗ |2 . Then the descent sequence

w0 , w1 , w2 , . . . given by (7.3.1) with learning rate
2
t=
m+L
converges to w∗ at the rate
2n
1−r
E(wn ) ≤ E(w0 ), n = 1, 2, . . . . (7.8.6)
1+r

GD-II improves GD-I in two ways: Since m < L, the learning rate is larger,
2 1
> ,
m+L L
and the convergence rate is smaller,
2
1−r
< (1 − r),
1+r
7.9. ACCELERATED GRADIENT DESCENT 459

implying faster convergence.

For example, if L = 6 and m = 2, then r = 1/3, the learning rates are 1/6
versus 1/4, and the convergence rates are 2/3 versus 1/4. Even though GD-II
improves GD-I, the improvement is not substantial. In the next section, we
use momentum to derive better convergence rates.

Let g be the gradient of the loss function at a point w. Then the line
passing through w in the direction of g is w − tg. When the loss function is
quadratic (4.3.8), f (w − tg) is a quadratic function of the scalar variable t.
In this case, the minimizer t along the line w − tg is explicitly computable as
g·g
t= .
g · Qg
This leads to gradient descent with varying time steps t0 , t1 , t2 , . . . . As a
consequence, one can show the error is lowered as follows,

+ 1 g
E(w ) = 1 − E(w), u= .
(u · Qu)(u · Q−1 u) |g|

Using a well-known inequality, Kantorovich’s inequality [20], one can show

that here the convergence rate is also (7.8.6). Thus, after all this work, there
is no advantage here, it simpler to stick with GD-II!
Nevertheless, the idea here, the line-search for a minimizer, is a sound
one, and is useful in some situations.

7.9 Accelerated Gradient Descent

In this section, we modify the gradient descent method by adding a term

incorporating previous gradients, leading to gradient descent with momentum.
After this, we consider other variations, leading to the most frequently used
descent methods.
Recall in a descent sequence, the current point is w, the next point is w+ ,
and the previous point is w− .
In gradient descent with momentum, we add a momentum term to the
current point w, obtaining the lookahead point

w◦ = w + s(w − w− ). (7.9.1)

Here s is the decay rate. The momentum term reflects the direction induced by
the previous step. Because this mimics the behavior of a ball rolling downhill,
gradient descent with momentum is also called heavy ball descent.
460 CHAPTER 7. MACHINE LEARNING

Then the descent sequence w0 , w1 , w2 , . . . is generated by

Momentum Gradient Descent Step

w+ = w − t∇f (w) + s(w − w− ). (7.9.2)

Here we have two hyperparameters, the learning rate and the decay rate.

We study convergence for the simplest case of a quadratic (7.8.2). In this

case, ∇f (w) = Qw − b, and the sequence satisfies the recursion

wn+1 = wn − t(Qwn − b) + s(wn − wn−1 ), n = 0, 1, 2, . . . . (7.9.3)

To initialize the recursion, we set w−1 = w0− = w0 . This implies w1 =

w0 − t(Qw0 − b).
We measure the convergence using the error E(w) = |w − w∗ |2 , and we
assume m < Q < L strictly, in the sense every eigenvalue λ of Q satisfies

m < λ < L. (7.9.4)

As before, we set r = m/L.

Let v be an eigenvector of Q with eigenvalue λ. To solve (7.9.3), we assume
a solution of the form

wn = w∗ + ρn v, Qv = λv. (7.9.5)

Inserting this into (7.9.3) and using Qw∗ = b leads to the quadratic equation

ρ2 = (1 − tλ + s)ρ − s.

By the quadratic formula,

p
(1 − λt + s) ± (1 − λt + s)2 − 4s
ρ = ρ± = .
2
Assume the discriminant (1 − λt + s)2 − 4s is negative. This happens exactly
when √ √
(1 − s)2 (1 + s)2
<t< . (7.9.6)
λ λ
If we assume √ √
(1 − s)2 (1 + s)2
≤t≤ , (7.9.7)
m L
then (7.9.6) holds for every eigenvalue λ of Q.
Multiplying (7.9.7) by λ and factoring the discriminant as a difference of
two squares leads to
7.9. ACCELERATED GRADIENT DESCENT 461

(L − λ)(λ − m)
4s − (1 − λt + s)2 ≥ (1 − s)2 . (7.9.8)
mL
When (7.9.6) holds, the roots are conjugate complex numbers ρ, ρ̄, where
p
(1 − λt + s) + i −(1 − λt + s)2 + 4s
ρ = x + iy = . (7.9.9)
2
It follows the absolute value of ρ equals
p √
|ρ| = x2 + y 2 = s.
√
To obtain the fastest convergence, we choose s and t to minimize |ρ| = s,
while still satisfying (7.9.7). This forces (7.9.7) to be an equality,
√ √
(1 − s)2 (1 + s)2
=t= .
m L
These are two equations in two unknowns s, t. Solving, we obtain
√
√ 1− r 1 4
s= √ , t= · √ .
1+ r L (1 + r)2

Let w̃n = wn −w∗ . Since Qwn −b = Qw̃n , (7.9.3) is a 2-step linear recursion
in the variables w̃n . Therefore the general solution depends on two constants
A, B.
Let λ1 , λ2 , . . . , λd be the eigenvalues of Q and let v1 , v2 , . . . , vd be the
corresponding orthonormal basis of eigenvectors.
Since (7.9.3) is a 2-step vector linear recursion, A and B are vectors, and
the general solution depends on 2d constants Ak , Bk , k = 1, 2, . . . , d.
If ρk , k = 1, 2, . . . , d, are the corresponding roots (7.9.9), then (7.9.5) is
a solution of (7.9.3) for each of 2d roots ρ = ρk , ρ = ρ̄k , k = 1, 2, . . . , d.
Therefore the linear combination
d
X
wn = w∗ + (Ak ρnk + Bk ρ̄nk ) vk , n = 0, 1, 2, . . . (7.9.10)
k=1

is the general solution of (7.9.3). Inserting n = 0 and n = 1 into (7.9.10), then

taking the dot product of the result with vk , we obtain two linear equations
for two unknowns Ak , Bk ,

Ak + Bk = (w0 − w∗ ) · vk ,
Ak ρk + Bk ρ̄k = (w1 − w∗ ) · vk = (1 − tλk )(w0 − w∗ ) · vk ,

for each k = 1, 2, . . . , d. Solving for Ak , Bk yields

462 CHAPTER 7. MACHINE LEARNING

1 − tλk − ρ̄k
Ak = (w0 − w∗ ) · vk , Bk = Āk .
ρk − ρ̄k

Let
(L − m)(L − m)
C = max . (7.9.11)
λ (L − λ)(λ − m)
Using (7.9.8), one verifies the estimate

|Ak |2 = |Bk |2 ≤ C |(w0 − w∗ ) · vk |2 .

Now use (2.9.5) twice, first with v = wn − w∗ , then with v = w0 − w∗ . By

(7.9.10) and the triangle inequality,
d
X
∗ 2
|wn − w | = |(wn − w∗ ) · vk |2
k=1
d
X
= |Ak ρnk + Bk ρ̄nk |2
k=1
d
X
≤ (|Ak | + |Bk |)2 |ρk |2n
k=1
d
X
≤ 4Csn |(w0 − w∗ ) · vk |2
k=1
= 4Csn |w0 − w∗ |2 .

This derives the following result.

Momentum Gradient Descent - Heavy Ball

Suppose the loss function f (w) is quadratic (7.8.2), let r = m/L, and
set E(w) = |w − w∗ |2 . Let C be given by (7.9.11). Then the descent
sequence w0 , w1 , w2 , . . . given by (7.9.2) with learning rate and decay
rate √ 2
1 4 1− r
t= · √ , s= √ ,
L (1 + r)2 1+ r
converges to w∗ at the rate
√ 2n
1− r
E(wn ) ≤ 4C √ E(w0 ), n = 1, 2, . . . (7.9.12)
1+ r

This heavy ball

√ descent, due to Polyak [26], is an improvement over GD-II
(7.8.6), because r is substantially larger than r when r is small. The down-
side of this momentum method is that the convergence (7.9.12) is only guar-
anteed for f (w) quadratic (7.8.2). In fact, there are examples of non-quadratic
7.9. ACCELERATED GRADIENT DESCENT 463

f (w) where heavy ball descent does not converge to w∗ . Nevertheless, this
method is widely used.

The momentum method can be modified by evaluating the gradient at the

lookahead point w◦ (7.9.1),

Momentum Descent Step With Lookahead Gradient

w◦ = w + s(w − w− ),
(7.9.13)
w+ = w◦ − t∇f (w◦ ).

This leads to accelerated gradient descent, or momentum descent with

lookahead gradient. This result, due to Nesterov [23], is valid for any con-
vex function satisfying (7.8.1), not just quadratics.
The iteration (7.9.13) is in two steps, a momentum step followed by a basic
gradient descent step. The momentum step takes us from the current point
w to the lookahead point w◦ , and the gradient descent step takes us from w◦
to the successive point w+ .
Starting from w0 , and setting w−1 = w0 , here it turns out the loss se-
quence f (w0 ), f (w1 ), f (w2 ), . . . is not always decreasing. Because of this, we
seek another function V (w) where the corresponding sequence V (w0 ), V (w1 ),
V (w2 ), . . . is decreasing.
To explain this, it’s best to assume w∗ = 0 and f (w∗ ) = 0. This can always
be arranged by translating the coordinate system. Then it turns out
L
V (w) = f (w) + |w − ρw− |2 , (7.9.14)
2
with a suitable choice of ρ, does the job. With the choices
√
1 1− r √
t= , s= √ , ρ = 1 − r,
L 1+ r

we will show
V (w+ ) ≤ ρV (w). (7.9.15)
In fact, we see below (7.9.22), (7.9.23) that V is reduced by an additional
quantity proportional to the momentum term.
The choice t = 1/L is a natural choice from basic gradient descent (7.3.3).
The derivation of (7.9.15) below forces the choices for s and ρ.
Given a point w, while w+ is well-defined by (7.9.13), it is not clear what
w− means. There are two ways to insert meaning here. Either evaluate V (w)
along a sequence w0 , w1 , w2 , . . . and set, as before, wn− = wn−1 , or work
464 CHAPTER 7. MACHINE LEARNING

with the function W (w) = V (w+ ) instead of V (w). If we assume (w+ )− = w,

then W (w) is well-defined. With this understood, we nevertheless stick with
V (w) as in (7.9.14) to simplify the calculations.
We first show how (7.9.15) implies the result. Using (w0 )− = w0 and
(7.8.3),

L m
V (w0 ) = f (w0 ) + |w0 − ρw0 |2 = f (w0 ) + |w0 |2 ≤ 2f (w0 ).
2 2
Moreover f (w) ≤ V (w). Iterating (7.9.15), we obtain

f (wn ) ≤ V (wn ) ≤ ρn V (w0 ) ≤ 2ρn f (w0 ).

This derives

Momentum Descent - Lookahead Gradient

Let r = m/L and set E(w) = f (w) − f (w∗ ). Then the sequence w0 ,
w1 , w2 , . . . given by (7.9.13) with learning rate and decay rate
√
1 1− r
t= , s= √
L 1+ r

converges to w∗ at the rate

√ n
E(wn ) ≤ 2 1 − r E(w0 ), n = 1, 2, . . . . (7.9.16)

While the convergence rate for accelerated descent is slightly worse than
heavy ball descent, the value of accelerated descent is its validity for all convex
functions satisfying (7.8.1), and the fact, also due to Nesterov [23], that this
convergence rate is best-possible among all such functions.
Now we derive (7.9.15). Assume (w+ )− = w and w∗ = 0, f (w∗ ) = 0. We
know w◦ = (1 + s)w − sw− and w+ = w◦ − tg ◦ , where g ◦ = ∇f (w◦ ).
By the basic descent step (7.3.1) with w◦ replacing w, (7.3.3) implies
t
f (w+ ) ≤ f (w◦ ) − |g ◦ |2 . (7.9.17)
2
Here we used t = 1/L.
By (4.5.16) with x = w and a = w◦ ,
m
f (w◦ ) ≤ f (w) − g ◦ · (w − w◦ ) − |w − w◦ |2 . (7.9.18)
2
By (4.5.16) with x = w∗ = 0 and a = w◦ ,
m ◦2
f (w◦ ) ≤ g ◦ · w◦ − |w | . (7.9.19)
2
7.9. ACCELERATED GRADIENT DESCENT 465

Multiply (7.9.18) by ρ and (7.9.19) by 1 − ρ and add, then insert the sum
into (7.9.17). After some simplification, this yields
r t
f (w+ ) ≤ ρf (w) + g ◦ · (w◦ − ρw) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 − |g ◦ |2 .
2t 2
(7.9.20)
Since
(w◦ − ρw) − tg ◦ = w+ − ρw,
we have
1 + 1 t
|w − ρw|2 = |w◦ − ρw|2 − g ◦ · (w◦ − ρw) + |g ◦ |2 .
2t 2t 2
Adding this to (7.9.20) leads to
r 1
V (w+ ) ≤ ρf (w) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 + |w◦ − ρw|2 . (7.9.21)

2t 2t
Let

R(a, b) = r ρs2 |b|2 + (1 − ρ)|a + sb|2 − |(1 − ρ)a + sb|2 + ρ|(1 − ρ)a + ρb|2 .

Solving for f (w) in (7.9.14) and inserting into (7.9.21) leads to

1
V (w+ ) ≤ ρV (w) − R(w, w − w− ). (7.9.22)
2t
If we can choose s and ρ so that R(a, b) is a positive scalar multiple of |b|2 ,
then, by (7.9.22), (7.9.15) follows, completing the proof.
Based on this, we choose s, ρ to make R(a, b) independent of a, which is
equivalent to ∇a R = 0. But

2 2

∇a R = 2(1 − ρ) r − (1 − ρ) a + ρ − s(1 − r) b ,

so ∇a R = 0 is two equations in two unknowns s, ρ. This leads to the choices

for s and ρ made above. Once these choices are made, s(1 − r) = ρ2 and
ρ > s. From this,

R(a, b) = R(0, b) = (rs2 − s2 + ρ3 )|b|2 = ρ2 (ρ − s)|b|2 , (7.9.23)

which is positive.
Chapter A
Appendices

Some of the material here is first seen in high school. Because repeating the
exposure leads to a deeper understanding, we review it in a manner useful to
us here.
We start with basic counting, and show how the factorial function leads
directly to the exponential. Given its convexity and its importance for entropy
(§5.2), the exponential is treated carefully (§A.3).
The other use of counting is in graph theory (§3.3), which lays the ground-
work for neural networks (§7.2).

A.1 Permutations and Combinations

Suppose we have three balls in a bag, colored red, green, and blue. Suppose
they are pulled out of the bag and arranged in a line. We then obtain six
possibilities, listed in Figure A.1.
Why are there six possibilities? Because they are three ways of choosing
the first ball, then two ways of choosing the second ball, then one way of
choosing the third ball, so the total number of ways is

6 = 3 × 2 × 1.

In particular, we see that the number of ways multiply, 6 = 3 × 2 × 1.

Similarly, there are 5 × 4 × 3 × 2 × 1 = 120 ways of selecting five distinct
balls. Since this pattern appears frequently, it has a name.
If n is a positive integer, then n-factorial is

n! = n × (n − 1) × (n − 2) × · · · × 2 × 1.

The factorial function grows large rapidly, for example,

10! = 10 × 9 × 8 × 7 × 6 × 5 × 4 × 3 × 2 × 1 = 3, 628, 800.

467
468 CHAPTER A. APPENDICES

Notice also

(n + 1)! = (n + 1) × n × (n − 1) × · · · × 2 × 1 = (n + 1) × n!,

and (n + 2)! = (n + 2) × (n + 1)!, and so on.

Fig. A.1 6 = 3! permutations of 3 balls.

Permutations of n Objects

The number of ways of selecting n objects from a collection of n

distinct objects is n!.

We also have
1! = 1, 0! = 1.
It’s clear that 1! = 1. It’s less clear that 0! = 1, but it’s reasonable if you
think about it: The number of ways of selecting from zero balls results in
only one possibility — no balls. The code for n! is

from scipy.special import factorial

factorial(n,exact=True)

More generally, we can consider the selection of k balls from a bag contain-
ing n distinct balls. There are two varieties of selections that can be made:
Ordered selections and unordered selections. An ordered selection is a permu-
tation. In particular, when k = n, an ordered selection of n objects from n
objects is n, which is the number of ways of permuting n objects.
A.1. PERMUTATIONS AND COMBINATIONS 469

The function perm_tuples(a,b,k) returns all permutations of k integers

between the integers a and b inclusive. Thus perm_tuples(a,b,2) returns all
ordered pairs of integers between a and b inclusive, and perm_tuples(a,b,3)
returns all ordered triples of integers between a and b inclusive, The code

def perm_tuples(a,b,k):
if k==1: return [ (i,) for i in range(a,b+1) ]
else:
list1 = [ (i,*p) for i in range(a,b) for p in
,→ perm_tuples(i+1,b,k-1) ]
list2 = [ (*p,i) for i in range(a,b) for p in
,→ perm_tuples(i+1,b,k-1) ]
return list1 + list2

perm_tuples(1,5,2)

returns the list

[(1, 2),(1, 3),(1, 4),(1, 5),(2, 3),(2, 4),(2, 5),(3, 4),(3, 5),(4,
,→ 5),(2, 1),(3, 1),(4, 1),(5, 1),(3, 2),(4, 2),(5, 2),(4, 3),(5,
,→ 3),(5, 4)]

The number of permutations of k objects from n objects is written as P (n, k).

In Python, P (n, k) is

from scipy.special import perm

n, k = 5, 2

perm(n, k)

It follows the code

perm(n,k,exact=True) == len(perm_tuples(1,n,k))

returns True. For example, perm(5,2) equals 20.

For ordered selections, there are n choices for the first ball, n − 1 choices
for the second ball, and so on, until we have n − k + 1 choices for the k-th
ball. Thus
P (n, k) = n × (n − 1) × · · · × (n − k + 1).
For example, there are 5 × 4 = 20 ordered selections of two balls from five
distinct balls. Because ordering is taken into account, selecting ball #2 then
ball #3 is considered distinct from selecting ball #3 then ball #2.
470 CHAPTER A. APPENDICES

Permutation of k Objects from n Objects

The number of permutations of k objects from n objects is

n!
P (n, k) = n(n − 1)(n − 2) . . . (n − k + 1) = .
(n − k)!

The last formula follows by canceling,

n! n(n − 1) . . . (n − k + 1)(n − k)!

= = n(n − 1) . . . (n − k + 1).
(n − k)! (n − k)!

Notice P (x, k) is defined for any real number x by the same formula,

P (x, k) = x(x − 1)(x − 2) . . . (x − k + 1).

An unordered selection is a combination. When a selection of k objects

is made, and the k objects are permuted, we obtain the same unordered
selection, but a different ordered selection. Since the number of permutations
of k objects is k!, the number of permutations of k objects from n objects is
k! times the number of combinations of k objects from n objects.
The function comb_tuples(a,b,k) returns all combinations of k inte-
gers between the integers a and b inclusive. Thus comb_tuples(a,b,2)
returns all unordered pairs of integers between a and b inclusive, and
comb_tuples(a,b,3) returns all unordered triples of integers between a and
b inclusive. The code

def comb_tuples(a,b,k):
if k==1: return [ (i,) for i in range(a,b+1) ]
else: return [ (i, *p) for i in range(a,b) for p in
,→ comb_tuples(i+1,b,k-1) ]

comb_tuples(1,5,2)

returns the list

[(1, 2),(1, 3),(1, 4),(1, 5),(2, 3),(2, 4),(2, 5),(3, 4),(3, 5),(4,
,→ 5)]

The number of combinations of k objects from n objects is written as C(n, k).

In Python, C(n, k) is
A.1. PERMUTATIONS AND COMBINATIONS 471

from scipy.special import comb

n, k = 5, 2

comb(n, k)

It follows the code

comb(n,k,exact=True) == len(comb_tuples(1,n,k))

returns True. For example, comb(5,2) equals 10.

The number C(n, k) is also called n-choose-k. Because it appears in the
binomial theorem, C(n, k) is also called the binomial coefficient (§A.2).

Combinations of k Objects from n Objects

The number of combinations of k objects from n objects is

P (n, k) n!
C(n, k) = = .
k! (n − k)!k!

Since P (x, k) is defined for any real number x, so is C(n, k):

P (x, k) x(x − 1)(x − 2) . . . (x − k + 1)

C(x, k) = = .
k! 1 · 2 · 3 · ··· · k

An important question is the rate of growth of the factorial function n!.

Attempting to answer this question leads to the exponential (§A.3) and to
the entropy (§4.2). Here is how this happens.
Since n! is a product of the n factors

1, 2, 3, . . . , n − 1, n,

each no larger than n, it is clear that

n! < nn .

However, because half of the factors are less then n/2, we expect an approx-
imation smaller than nn , maybe something like (n/2)n or (n/3)n .
To be systematic about it, assume
n n
n! is approximately equal to e for n large, (A.1.1)
e
472 CHAPTER A. APPENDICES

for some constant e. We seek the best constant e that fits here. In this ap-
proximation, we multiply by e so that (A.1.1) is an equality when n = 1.
Using the binomial theorem, in §A.3 we show
n n n n
3 ≤ n! ≤ 2 , n ≥ 1. (A.1.2)
3 2
Based on this, a constant e satisfying (A.1.1) must lie between 2 and 3,

2 ≤ e ≤ 3.

To figure out the best constant e to pick, we see how much both sides
of (A.1.1) increase when we replace n by n + 1. Write (A.1.1) with n + 1
replacing n, obtaining
n+1
n+1
(n + 1)! is approximately equal to e for n large.
e
(A.1.3)
Dividing the left sides of (A.1.1), (A.1.3) yields

(n + 1)!
= (n + 1).
n!
Dividing the right sides yields
n
e((n + 1)/e)n+1

1 1
= (n + 1) · · 1 + . (A.1.4)
e(n/e)n e n

To make these quotients match as closely as possible, we should choose

n
1
e≈ 1+ , for n large. (A.1.5)
n

Choosing n = 1, 2, 3, . . . , 100, . . . results in

e ≈ 2, 2.25, 2.37, . . . , 2.705, . . . .

As n → ∞, we obtain Euler’s constant e (§A.3).

Equation (A.1.1) can be improved to Stirling’s approximation
√ n n
n! ≈ 2πn , for n large. (A.1.6)
e
This is an asymptotic equality. This means the ratio of the two sides ap-
proaches one for large n (see §A.6). Stirling’s approximation is a consequence
of the central limit theorem (Exercise 5.4.13).
Stirling’s approximation is highly accurate for n large. In fact, as soon
as n = 1, Stirling’s approximation is 90% accurate, and, as soon as n = 9,
Stirling’s approximation is 99% accurate.
A.2. THE BINOMIAL THEOREM 473

Exercises

Exercise A.1.1 The n-th Hermite number is

(2n)!
Hn = , n = 0, 1, 2, 3, . . .
2n n!
Use scipy.special.factorial to find the least n for which Hn is greater
than a billion.

Exercise A.1.2 (Summation notation exercise) Let n = 1, 2, . . . . Show

∞
X nk nn+1
(k − n) · = . (A.1.7)
k! n!
k=n

(First break the sum into two sums, then write out the first few terms of each
sum separately, and notice all terms but one cancel.)

A.2 The Binomial Theorem

Let x and a be two variables. A binomial is an expression of the form

(a + x)2 , (a + x)3 , (a + x)4 , ...

The degree of each of these binomials is 2, 3, and 4.

When binomials are expanded by multiplying out, one obtains a sum of
terms. The binomial theorem specifies the exact pattern or form of the re-
sulting sum.
Recall that

(a + b)(c + d) = a(c + d) + b(c + d) = ac + ad + bc + bd.

Similarly,

(a + b)(c + d + e) = a(c + d + e) + b(c + d + e) = ac + ad + ae + bc + bd + be.

Using this algebra, we can expand each binomial.

Expanding (a + x)2 yields

(a + x)2 = (a + x)(a + x) = a2 + xa + ax + x2 = a2 + 2ax + x2 . (A.2.1)

Similarly, for (a + x)3 , we have

474 CHAPTER A. APPENDICES

(a + x)3 = (a + x)(a + x)2 = (a + x)(a2 + 2ax + x2 )

= a3 + 2a2 x + ax2 + xa2 + 2xax + x3 (A.2.2)
3 2 2 3
= a + 3a x + 3ax + x .

For (a + x)4 , we have

(a + x)4 = (a + x)(a + x)3 = (a + x)(a3 + 3a2 x + 3ax2 + x3 )

= a4 + 3a3 x + 3a2 x2 + ax3 + a3 x + 3a2 x2 + 3ax3 + x4 (A.2.3)
4 3 2 2 3 4
= a + 4a x + 6a x + 4ax + x .

Thus
(a + x)2 = a2 + 2ax + x2
(a + x)3 = a3 + 3a2 x + 3ax2 + x3
(A.2.4)
(a + x)4 = a4 + 4a3 x + 6a2 x2 + 4ax3 + x4
(a + x)5 = ⋆a5 + ⋆a4 x + ⋆a3 x2 + ⋆a2 x3 + ⋆ax4 + ⋆x5 .

Here ⋆ means we haven’t found the coefficient yet.

There is a pattern in (A.2.4). In the first line, the powers of a are in

decreasing order, 2, 1, 0, while the powers of x are in increasing order, 0, 1,
2. In the second line, the powers of a decrease from 3 to 0, while the powers
of x increase from 0 to 3. In the third line, the powers of a decrease from 4
to 0, while the powers of x increase from 0 to 4.
This pattern of powers is simple and clear. Now we want to find the pattern
for the coefficients in front of each term. In (A.2.4), these coefficients are
(1, 2, 1), (1, 3, 3, 1), (1, 4, 6, 4, 1), and (⋆, ⋆, ⋆, ⋆, ⋆, ⋆). These coefficients are the
binomial coefficients.
Before we determine the pattern, we introduce a useful notation for these
coefficients by writing

2 2 2
= 1, = 2, =1
0 1 2

and
3 3 3 3
= 1, = 3, = 3, =1
0 1 2 3
and

4 4 4 4 4
= 1, = 4, = 6, = 4, =1
0 1 2 3 4

and
A.2. THE BINOMIAL THEOREM 475

5 5 5 5 5 5
= ⋆, = ⋆, = ⋆, = ⋆, = ⋆, = ⋆.
0 1 2 3 4 5

With this notation, the number

n
(A.2.5)
k

is the coefficient of an−k xk when you multiply out (a + x)n . This is the bino-
mial coefficient. Here n is the degree of the binomial, and k, which specifies
the term in the resulting sum, varies from 0 to n (not 1 to n).
It is important to remember that, in this notation, the binomial (a + x)2
expands into the sum of three terms a2 , 2ax, x2 . These are term 0, term 1,
and term 2. Alternatively, one says these are the zeroth term, the first term,
and the second term. Thus the second term in the expansion of the binomial
(a+x)4 is 6a2 x2 , and the binomial coefficient 42 = 6. In general, the binomial
(a + x)n of degree n expands into a sum of n + 1 terms.
Since the binomial coefficient nk is the coefficient of an−k xk when you
multiply out (a + x)n , we have the binomial theorem.

Binomial Theorem

The binomial (a + x)n equals

n n n n−1 n n−2 2 n n n
a + a x+ a x +···+ axn−1 + x .
0 1 2 n−1 n
(A.2.6)

Using summation notation, the binomial theorem states

n
n
X n n−k k
(a + x) = a x . (A.2.7)
k
k=0

The binomial coefficient nk is called “n-choose-k”, because it is the coef-

ficient of the term corresponding to choosing k x’s when multiplying the n

factors in the product

(a + x)n = (a + x)(a + x)(a + x) . . . (a + x).

For example, the term 42 a2 x2 corresponds to choosing two a’s, and two x’s,

when multiplying the four factors in the product

(a + x)4 = (a + x)(a + x)(a + x)(a + x).

476 CHAPTER A. APPENDICES

n = 0: 1

n = 1: 1 1

n = 2: 1 2 1

n = 3: 1 3 3 1

n = 4: 1 4 6 4 1

n = 5: 1 5 10 10 5 1

n = 6: ⋆ 6 15 20 15 6 ⋆

n = 7: 1 ⋆ 21 35 35 21 ⋆ 1

n = 8: 1 8 ⋆ 56 70 56 ⋆ 8 1

n = 9: 1 9 36 ⋆ 126 126 ⋆ 36 9 1

n = 10: 1 10 45 120 ⋆ 252 ⋆ 120 45 10 1

Table A.2 Pascal’s triangle.

The binomial coefficients may be arranged in a triangle, Pascal’s triangle

(Figure A.2). Can you figure out the numbers ⋆ in this triangle before peeking
ahead? Based on (A.2.11), here is code generating Pascal’s triangle.

from numpy import *

N = 10
Comb = zeros((N,N),dtype=int)
Comb[0,0] = 1

for n in range(1,N):
Comb[n,0] = Comb[n,n] = 1
for k in range(1,n): Comb[n,k] = Comb[n-1,k] + Comb[n-1,k-1]

Comb

In Pascal’s triangle, the very top row has one number in it: This is the
zeroth row corresponding to n = 0 and the binomial expansion of (a+x)0 = 1.
The first row corresponds to n = 1; it contains the numbers (1, 1), which
correspond to the binomial expansion of (a + x)1 = 1a + 1x. We say the
zeroth entry (k = 0) in the first row (n = 1) is 1 and the first entry (k = 1)
in the first row is 1. Similarly, the zeroth entry (k = 0) in the second row
(n = 2) is 1, and the second entry (k = 2) in the second row (n = 2) is 1.
The second entry (k = 2) in the fourth row (n = 4) is 6. For every row, the
A.2. THE BINOMIAL THEOREM 477

entries are counted starting from k = 0, and end with k = n, so there are
n + 1 entries in row n. With this understood, the k-th entry in the n-th row
is the binomial coefficient n-choose-k. So 10-choose-2 is

10
= 45.
2

We can learn a lot about the binomial coefficients from this triangle. First,
we have 1’s all along the left edge. Next, we have 1’s all along the right edge.
Similarly, one step in from the left or right edge, we have the row number.
Thus we have

n n n n
=1= , =n= , n ≥ 1.
0 n 1 n−1

Note also Pascal’s triangle has a left-to-right symmetry: If you read off
the coefficients in a particular row, you can’t tell if you’re reading them from
left to right, or from right to left. It’s the same either way: The fifth row is
(1, 5, 10, 10, 5, 1). In terms of our notation, this is written

n n
= , 0 ≤ k ≤ n;
k n−k

the binomial coefficients remain unchanged when k is replaced by n − k.

The key step in finding a formula for n-choose-k is to notice

(a + x)n+1 = (a + x)(a + x)n .

Let’s multiply this out when n = 3. From (A.2.4), we get

4 4 4 3 4 2 2 4 3 4 4
a + a x+ a x + ax + x
0 1 2 3 4

3 4 3 3 3 2 2 3
= a + a x+ a x + ax3
0 1 2 3

3 3 3 2 2 3 3 3 4
+ a x+ a x + ax + x .
0 1 2 3

Combining terms, this equals

3 4 3 3 3 3 3 2 2 3 3 3 3 4
a + + a x+ + a x + + ax + x .
0 1 0 2 1 3 2 3

Equating corresponding coefficients of x, we get,

478 CHAPTER A. APPENDICES

4 3 3 4 3 3 4 3 3
= + , = + , = + .
1 1 0 2 2 1 3 3 2

In general, a similar calculation establishes

n+1 n n
= + , 1 ≤ k ≤ n. (A.2.8)
k k k−1
This allows us to build Pascal’s triangle (Figure A.2), where, apart from
the ones on either end, each term (“the child”) in a given row is the sum of
the two terms (“the parents”) located directly above in the previous row.

Insert x = 1 and a = 1 in the binomial theorem to get

n n n n n n
2 = + + + ··· + + . (A.2.9)
0 1 2 n−1 n

We conclude the sum of the binomial coefficients along the n-th row of Pas-
cal’s triangle is 2n (remember n starts from 0).
Now insert x = 1 and a = −1. You get

n n n n n
0= − + − ··· ± ± .
0 1 2 n−1 n

Hence: the alternating1 sum of the binomial coefficients along the n-th row of
Pascal’s triangle is zero.

We now show

Binomial Coefficient
Let
n · (n − 1) · · · · · (n − k + 1) n!
C(n, k) = = .
1 · 2 · ··· · k k!(n − k)!
Then
n
= C(n, k), 0 ≤ k ≤ n. (A.2.10)
k

To establish (A.2.10), because

0
C(0, 0) = 1 = ,
0
1 Alternating means the plus-minus pattern + − + − + − . . . .
A.2. THE BINOMIAL THEOREM 479

it is enough to show C(n, k) also satisfies (A.2.8),

C(n + 1, k) = C(n, k) + C(n, k − 1), 1 ≤ k ≤ n. (A.2.11)

To establish (A.2.11), we simplify

n! n!
C(n, k) + C(n, k − 1) = +
k!(n − k)! (k − 1)!(n − k + 1)!

n! 1 1
= +
(k − 1)!(n − k)! k n − k + 1
n!(n + 1)
=
(k − 1)!(n − k)!k(n − k + 1)
(n + 1)!
= = C(n + 1, k).
k!(n + 1 − k)!

This establishes (A.2.11), and, consequently, (A.2.10).

For example,

7 7·6·5 7 10 10 · 9 10
= = 35 = and = = 45 = .
3 1·2·3 4 2 1·2 8

The formula (A.2.10) is easy to remember: There are k terms in the numerator
as well as the denominator, the factors in the denominator increase starting
from 1, and the factors in the numerator decrease starting from n.
In Python, the code

from scipy.special import comb

comb(n,k)
comb(n,k,exact=True)

returns the binomial coefficient.

The binomial coefficient nk makes sense even for fractional n. This can

be seen from (A.2.10). For example, for n = 1/2 and k = 3,

1 1 1
−1 −2
1/2 2 2 2 (1/2)(−1/2)(−3/2) 3
= = = . (A.2.12)
3 1·2·3 6 48

This works also for n negative,

480 CHAPTER A. APPENDICES

1 1 1
− − −1 − −2
−1/2 2 2 2 (−1/2)(−3/2)(−5/2) 15
= = = .
3 1·2·3 6 48
√ (A.2.13)
In fact, in (A.2.10), n may be any real number, for example n = 2.

A.3 The Exponential Function

In this section, our first goal is to derive (A.1.2), as promised in §A.1.

To begin, use the binomial theorem (A.2.7) with a = 1 and x = 1/n,
obtaining
n Xn k Xn
1 n n−k 1 1 n(n − 1)(n − 2) . . . (n − k + 1)
1+ = 1 = .
n k n k! n · n · n · ··· · n
k=0 k=0

Rewriting this by pulling out the first two terms k = 0 and k = 1 leads to
n n
1 X 1 1 2 k−1
1+ =1+1+ 1− 1− ... 1 − . (A.3.1)
n k! n n n
k=2

From (A.3.1), we can tell a lot. First, since all terms are positive, we see
n
1
1+ ≥ 2, n ≥ 1.
n

Second, each factor in (A.3.1) is of the form

j
1− , 1 ≤ j ≤ k − 1. (A.3.2)
n

Since n is in the denominator, each such factor increases with n. Moreover,

as n increases, the number of terms in (A.3.1) increases, hence so does the
sum. We conclude
n
1
1+ increases as n increases.
n

Third, when k ≥ 2, we know

k! = k(k − 1)(k − 2) . . . 3 · 2 ≥ 2k−1 .

Since each factor in (A.3.2) is no greater than 1, by (A.3.1),

A.3. THE EXPONENTIAL FUNCTION 481
n n n
1 X 1 X 1
1+ ≤1+1+ ≤2+ . (A.3.3)
n k! 2k−1
k=2 k=2

But we can show

n
X 1 1 1 1 1
= + + + · · · + n−1 ≤ 1,
2k−1 2 4 8 2
k=2

as follows.
A geometric sum is a sum of the form
n−1
X
2 n−1
sn = 1 + a + a + · · · + a = ak .
k=0

Multiplying sn by a results in almost the same sum,

asn = a + a2 + a3 + · · · + an−1 + an = sn + an − 1,

yielding
(a − 1)sn = an − 1.
When a ̸= 1, we may divide by a − 1, obtaining
n−1
X an − 1
sn = ak = 1 + a + a2 + · · · + an−1 = . (A.3.4)
a−1
k=0

Inserting a = 1/2, we conclude

n n−1
X 1 X 1
= sn − 1 = 2 1 − 2−n − 1 ≤ 1,

= n ≥ 2.
2k−1 2k
k=2 k=1

By (A.3.3), we arrive at
n
1
2≤ 1+ ≤ 3, n ≥ 1. (A.3.5)
n

Since a bounded increasing sequence has a limit (§A.7), this establishes the
following strengthening of (A.1.5).

Euler’s Constant
The limit n
1
e = lim 1+ (A.3.6)
n→∞ n
exists and satisfies 2 ≤ e ≤ 3.
482 CHAPTER A. APPENDICES

Standard properties of limits, such as

lim (an + bn ) = lim an + lim bn , lim an bn = lim an lim bn ,

n→∞ n→∞ n→∞ n→∞ n→∞ n→∞

are in §A.6, see Exercises A.6.10 and A.6.11. Nevertheless, the intuition is
clear: (A.3.6) is saying there is a specific positive number e with
n
1
1+ ≈e
n

for n large.

We use (A.3.5) to establish (A.1.2). Write (A.1.2) as an ≤ bn ≤ cn . When

n = 1,
a1 = b1 = c1 .
Moreover, as n increases, an , bn , cn all increase. Therefore, to establish
(A.1.2), it is enough to show bn increases faster than an , and cn increases
faster than bn , both as n increases.
To measure how an , bn , cn increase with n, divide the (n + 1)-st term by
the n-th term: It is enough to show
an+1 bn+1 cn+1
≤ ≤ .
an bn cn
But we already know
bn+1
= n + 1,
bn
and, from (A.3.5),
n
3((n + 1)/3)n+1

an+1 1 1 bn+1
= = (n + 1) · · 1 + ≤n+1= ,
an 3(n/3)n 3 n bn

and, from (A.3.5) again,

n
2((n + 1)/2)n+1

bn+1 1 1 cn+1
= n + 1 ≤ (n + 1) · · 1 + = = .
bn 2 n 2(n/2)n cn

Since we’ve shown bn increases faster than an , and cn increases faster than
bn , we have derived (A.1.2).

By definition, Euler’s constant e satisfies (A.3.6). To obtain a second for-

mula for e, insert n = ∞ in (A.3.1), which means let n grow to infinity
A.3. THE EXPONENTIAL FUNCTION 483

without bound in (A.3.1). Using 1/∞ = 0, since the k-th term approaches
1/k!, and since the number of terms increases with n, we obtain the second
formula
∞ X ∞
X 1 1 2 k−1 1
e=1+1+ 1− 1− ... 1 − = .
k! ∞ ∞ ∞ k!
k=2 k=0

To summarize,

Euler’s Constant
Euler’s constant satisfies
∞
X 1 1 1 1 1 1
e= =1+1+ + + + + + ... (A.3.7)
k! 2 6 24 120 720
k=0

Depositing one dollar in a bank offering 100% interest returns two dollars
after one year. Depositing one dollar in a bank offering the same annual
interest compounded at mid-year returns
2
1
1+ = 2.25
2

dollars after one year.

Depositing one dollar in a bank offering the same annual interest com-
pounded at n intermediate time points returns (1 + 1/n)n dollars after one
year.
Passing to the limit, depositing one dollar in a bank and continuously
compounding at an annual interest rate of 100% returns e dollars after one
year. Because of this, (A.3.6) is often called the compound-interest formula.

Now we derive the result of continuously compounding at any specified

annual interest rate x. Note here x is a proportion, not a percent. An interest
rate of 30% corresponds to x = .3 in the exponential function.

Exponential Function

For any real number x, the limit

x n
exp x = lim 1+ (A.3.8)
n→∞ n
484 CHAPTER A. APPENDICES

exists. In particular, exp 0 = 1 and exp 1 = e.

Fig. A.3 The exponential function exp x.

Note, in the compound-interest interpretation, when x > 0, the bank is

giving you interest, while, if x < 0, the bank is taking away interest, leading
to a continual loss.
To derive this, assume first x > 0 is a positive real number. Then, exactly
as before, using the binomial theorem,
x n
1+ , n ≥ 1,
n
is increasing with n, so the limit in (A.3.8) is well-defined.
To establish the existence of the limit when x < 0, we first show

(1 − x)n ≥ 1 − nx, x > 0, n ≥ 1. (A.3.9)

This follows inductively: Each of the following inequalities is implied by the

preceding one,

(1 − x) = 1 − x
(1 − x)2 = 1 − 2x + x2 ≥ 1 − 2x
(1 − x)3 = (1 − x)(1 − x)2 ≥ (1 − x)(1 − 2x) = 1 − 3x + 2x2 ≥ 1 − 3x
(1 − x)4 = (1 − x)(1 − x)3 ≥ (1 − x)(1 − 3x) = 1 − 4x + 3x3 ≥ 1 − 4x
... ...

This establishes (A.3.9) for all n ≥ 1.

A.3. THE EXPONENTIAL FUNCTION 485

Now let x be any real number. Replacing x by x2 /n2 in (A.3.9), we obtain

n
x2 x2

1≥ 1− 2 ≥1− .
n n

As n → ∞, both sides of this last equation approach 1, so

n
x2

lim 1 − 2 = 1. (A.3.10)
n→∞ n

Now let n grow without bound in

n
x2

x n x n
1+ 1− = 1− 2 .
n n n

Since the limit exp x is well-defined when x > 0, by (A.3.10), we obtain

x n
exp x · lim 1 − = 1, x > 0.
n→∞ n
This shows the limit exp x in (A.3.8) is well-defined when x < 0, and
1
exp(−x) = , for all x.
exp x
The code

from numpy import *

from matplotlib.pyplot import *

grid()
plot(x,exp(x))
show()

returns Figure A.3.

Repeating the logic yielding (A.3.1), we have

X xk n
x n 1 2 k−1
1+ =1+x+ 1− 1− ... 1 − . (A.3.11)
n k! n n n
k=2

Letting n → ∞ in (A.3.11) as before, results in the following.

486 CHAPTER A. APPENDICES

Exponential Series

The exponential function is always positive and satisfies, for every real
number x,
∞
X xk x2 x3 x4 x5 x6
exp x = = 1+x+ + + + + + . . . (A.3.12)
k! 2 6 24 120 720
k=0

The graph of exp x is in Figure A.3.

We use the binomial theorem one more time to show

Law of Exponents

For real numbers x and y,

exp(x + y) = exp x · exp y.

To see this, multiply out the sums

(a0 + a1 + a2 + a3 + . . . )(b0 + b1 + b2 + b3 + . . . )

in a “symmetric” manner, obtaining

a0 b0 + (a0 b1 + a1 b0 ) + (a0 b2 + a1 b1 + a2 b0 ) + (a0 b3 + a1 b2 + a2 b1 + a3 b0 ) + . . .

Using summation notation, the n-th term in this last sum is

n
X
ak bn−k = a0 bn + a1 bn−1 + · · · + an−1 b1 + an b0 .
k=0

Thus
∞ ∞ ∞
! ! n
!
X X X X
ak bm = ak bn−k .
k=0 m=0 n=0 k=0

Now insert
xk y n−k
ak = , bn−k = .
k! (n − k)!
Then the n-th term in the resulting sum equals, by the binomial theorem,
n n n
X X xk y n−k 1 X n k n−k 1
ak bn−k = = x y = (x + y)n .
k! (n − k)! n! k n!
k=0 k=0 k=0

Thus
A.3. THE EXPONENTIAL FUNCTION 487

∞ ∞ ∞
! !
X xk X ym X (x + y)n
exp x · exp y = = = exp(x + y).
k! m=0
m! n=0
n!
k=0

This derives the law of exponents.

Repeating the law of exponents n times implies

exp(nx) = exp(x + x + · · · + x) = exp x · exp x · · · · · exp x = (exp x)n .

√
If we write n x = x1/n , replacing x by x/n yields

exp(x/n) = (exp x)1/n .

Combining the last two equations yields

1/m
exp(nx/m) = ((exp x)n ) = (exp x)n/m .

Inserting x = 1 in this last equation, it follows, for any rational number

x = n/m,
exp x = exp(1 · x) = (exp 1)x = ex .
Because of this, as a matter of convenience, we write the exponential function
either way, exp x or ex , even when x is not rational.

Exponential Notation

For any real number x,

ex = exp x.

Suppose 0 < r < 1. Then r2 < r, r3 < r, and so on. Replacing x by rx in

the exponential series (A.3.12),
1 2 2 1
erx = 1 + rx + r x + r 3 x3 + . . .
2! 3!
1 1 (A.3.13)
< 1 + rx + rx2 + rx3 + . . .
2! 3!
= 1 − r + rex .

From this we can show

Convexity of the Exponential Function

For 0 < r < 1,

exp((1 − r)x + ry) < (1 − r) exp x + r exp y. (A.3.14)

488 CHAPTER A. APPENDICES

To derive (A.3.14), replace x by y − x in (A.3.13), obtaining

er(y−x) < 1 − r + rey−x .

Now multiply both sides by ex , obtaining (A.3.14).

Graphically, the convexity of the exponential functions is the fact that the
line segment joining two points on the graph lies above the graph (Figure
A.4).

Fig. A.4 Convexity of the exponential function.

Convexity is discussed further in §4.5.

Exercises

Exercise A.3.1 Assume a bank gives 50% annual interest on deposits. After
one year, what does $1 become? Do this when the money is compounded once,
twice, and at every instant during the year.

Exercise A.3.2 Assume a bank gives -50% annual interest on deposits. After
one year, what does $1 become? Do this when the money is compounded once,
twice, and at every instant during the year.

Exercise A.3.3 Which of the following is correct? For n large,

n+1 n
n + 1 ≈ n, en+1 ≈ en , ee ≈ ee .

≈ is asymptotic equality (see §A.6).

Exercise A.3.4 Extend (A.3.9) by showing

(1 − a)(1 − b)(1 − c) ≥ 1 − (a + b + c),

valid for a, b, c positive. This remains valid for any number of factors.
A.4. COMPLEX NUMBERS 489

Exercise A.3.5 Use the previous exercise, (A.3.1), (A.3.3), and the identity

k(k − 1)
1 + 2 + 3 + · · · + (k − 2) + (k − 1) =
2
to derive the error estimate
n n
X 1 1 3
0≤ − 1+ ≤ , n ≥ 2.
k! n 2n
k=0

This is a complete derivation of (A.3.7).

Exercise A.3.6 Use (A.3.4) to derive the geometric series

∞
1 X
= an = 1 + a + a2 + a3 + . . . , −1 < a < 1. (A.3.15)
1 − a n=0

Exercise A.3.7 Take the derivative of (A.3.15) to obtain

∞
1 X
= nan−1 = 1 + 2a + 3a2 + 4a3 + . . . , −1 < a < 1. (A.3.16)
(1 − a)2 n=1

A.4 Complex Numbers

In §1.4, we study points in two dimensions, and we saw how points can be
added and subtracted. In §2.1, we study points in any number of dimensions,
and there we also add and subtract points.
In two dimensions, each point has a shadow (Figure 1.13). By stacking
shadows, points in the plane can be multiplied and divided (Figure A.5). In
this sense, points in the plane behave like numbers, because they follow the
usual rules of arithmetic.
This ability of points in the plane to follow the usual rules of arithmetic
is unique to two dimensions (considering one dimension as part of two di-
mensions), and not present in any other dimension. When thought of in this
manner, points in the plane are called complex numbers, and the plane is the
complex plane.

To define multiplication of points, let P = (x, y) and P ′ = (x′ , y ′ ) be

points on the unit circle. Stack the shadow of P ′ on top of the shadow of P ,
as in Figure A.5. Because angle stacking is at the basis of angle measurement
[11], we we must do this without knowledge of angle measure.
490 CHAPTER A. APPENDICES

Here is how one does this without any angle measurement: Mark Q = x′ P
at distance x′ along the vector OP joining O and P , and draw the circle
with radius y ′ and center Q. Then this circle intersects the unit circle at two
points, both called P ′′ .

P
P′
1
1

O O
P ′′

Q Q

P ′′

O O

Fig. A.5 Multiplying and dividing points on the unit circle.

We think of the first point P ′′ as the result of multiplying P and P ′ , and

we write P ′′ = P P ′ , and we think of the second point P ′′ as the result of
dividing P by P ′ , and we write P ′′ = P/P ′ . Then we have

Multiplication and Division of Points

For P = (x, y) and P ′ = (x′ , y ′ ) on the unit circle, when x′ y ′ ̸= 0,

P ′′ = P P ′ = (xx′ − yy ′ , x′ y + xy ′ ),
(A.4.1)
P ′′ = P/P ′ = (xx′ + yy ′ , x′ y − xy ′ ).

To derive (A.4.1), let P ⊥ = (−y, x) (“P -perp”). Then

x′ P + y ′ P ⊥ = (x′ x, x′ y) + (−y ′ y, y ′ x) = (xx′ − yy ′ , x′ y + xy ′ ),

x′ P − y ′ P ⊥ = (x′ x, x′ y) − (−y ′ y, y ′ x) = (xx′ + yy ′ , x′ y − xy ′ ),

so (A.4.1) is equivalent to

P ′′ = x′ P ± y ′ P ⊥ . (A.4.2)
A.4. COMPLEX NUMBERS 491

To establish (A.4.2), since P ′′ is on the circle of center Q and radius y ′ ,

we may write P ′′ = Q + y ′ R, for some point R on the unit circle (see §1.4).
Interpreting points as vectors, and using (1.4.6), P ′′ = x′ P + y ′ R is on the
unit circle iff
1 = |x′ P + y ′ R|2 = |x′ P |2 + 2x′ P · y ′ R + |y ′ R|2
2 2
= x′ |P |2 + 2x′ y ′ P · R + y ′ |R|2
2 2
= x′ + y ′ + 2x′ y ′ P · R
= 1 + 2x′ y ′ P · R.

But this happens iff P · R = 0, which happens iff R = ±P ⊥ (Figure 1.21).

This establishes (A.4.2).

More generally, if r = |P | and r′ = |P ′ |, let R be any point satisfying

|R| = r. Then
P ′′ = Q + y ′ R = x′ P + y ′ R
satisfies |P ′′ | = rr′ exactly when R = ±P ⊥ , leading to the two points in
(A.4.1).
Let P̄ be the conjugate (x, −y) of P = (x, y). The first P ′′ is the product

P P ′ = (xx′ − yy ′ , x′ y + xy ′ ), (A.4.3)

but the second P ′′ is not division, it is the hermitian product P P̄ ′ of P and

P̄ ′ .
The correct formula for division is given by
1 1
P/P ′ = ′
2 P P̄ = ′ 2 (xx′ + yy ′ , x′ y − xy ′ ). (A.4.4)
r ′ x + y′ 2

When r′ = 1, (A.4.4) reduces to the formula in (A.4.1).

With this understood, it is easily checked that division undoes multiplica-
tion,
(P/P ′ )P ′ = P.
In fact, one can check that multiplication and division as defined by (A.4.3)
and (A.4.4) follow the usual rules of arithmetic.

It is natural to identify points on the horizontal axis with real numbers,

because, using (A.4.1), z = (x, 0) and z ′ = (x′ , 0) implies

z + z ′ = (x, 0) + (x′ , 0) = (x + x′ , 0), zz ′ = (xx′ − 00, x0 + x′ 0) = (xx′ , 0).

492 CHAPTER A. APPENDICES

Because of this, we can write z = x instead of z = (x, 0), this only for points
in the plane, and we call the horizontal axis the real axis.
Similarly, let i = (0, 1). Then the point i is on the vertical axis, and, using
(A.4.1), one can check

ix = (0, 1)(x, 0) = (−0, x) = x⊥ .

Thus the vertical axis consists of all points of the form ix. These are called
imaginary numbers, and the vertical axis is the imaginary axis.
Using i, any point P = (x, y) may be written

P = x + iy,

since
x + iy = (x, 0) + (y, 0)(0, 1) = (x, 0) + (0, y) = (x, y).
This leads to Figure A.6. In this way, real numbers x are considered complex
numbers with zero imaginary part, x = x + 0i.

2i 3 + 2i

−1 0 1 2 3

Fig. A.6 Complex numbers

Since by (A.4.1), i2 = (0, 1)2 = (−1, 0) = −1, we have

Square Root of −1

The complex number i satisfies i2 = −1.

When thinking of points in the plane as complex numbers, it is traditional

to denote them by z instead of P . By (A.4.1), we have

z = x + iy, z ′ = x′ + iy ′ =⇒ zz ′ = (xx′ − yy ′ ) + i(x′ y + xy ′ ),

and
z x + iy (xx′ + yy ′ ) + i(x′ y − xy ′ )
= ′ = .
z ′ x + iy ′
x′ 2 + y ′ 2
A.4. COMPLEX NUMBERS 493

In particular, one can always “move” the i from the denominator to the
numerator by the formula
1 1 x − iy z̄
= = 2 = 2.
z x + iy x + y2 |z|

Here x2 + y 2 = r2 = |z|2 is the absolute value squared of z, and z̄ is the

conjugate of z.

Let r, r′ , r′′ and θ, θ′ , θ′′ be the polar coordinates (Figure 1.16) of z, z ′ ,

z = zz ′ . Then Figure A.5 says θ′′ = θ+θ′ . Using angle stacking together with
′′

his bisection method, Archimedes [12] defined angle measure θ numerically

and derived θ′′ = θ + θ′ .
By elementary algebra,
2 2
(x2 + y 2 )(x′ + y ′ ) = (xx′ − yy ′ )2 + (x′ y + xy ′ )2 . (A.4.5)
2 2
Since this says r2 r′ = r′′ , we conclude

Polar Coordinates of Complex Numbers

If (r, θ) and (r′ , θ′ ) are the polar coordinates of complex numbers z

and z ′ , and (r′′ , θ′′ ) are the polar coordinates of the product z ′′ = zz ′ ,
then
r′′ = rr′ and θ′′ = θ + θ′ .

From this and (A.4.1), using (x, y) = (cos θ, sin θ), (x′ , y ′ ) = (cos θ′ , sin θ′ ),
we have the addition formulas
sin(θ + θ′ ) = sin θ cos θ′ + cos θ sin θ′ ,
(A.4.6)
cos(θ + θ′ ) = cos θ cos θ′ − sin θ sin θ′ .

For example, if ω = cos θ + i sin θ, then the polar coordinates of ω are

r = 1 and θ. It follows the polar coordinates of ω 2 are r = 1 and 2θ, so
ω 2 = cos(2θ) + i sin(2θ).
By the same logic, for any power k, the polar coordinates of ω k are r = 1
and kθ, so ω k = cos(kθ) + i sin(kθ).
When P = (x, y) is thought of as a complex number zp= x + iy, r is called
the absolute value, and w write r = |z|. Then r = |z| = x2 + y 2 and

z = x + iy = r cos θ + ir sin θ = r(cos θ + i sin θ).

494 CHAPTER A. APPENDICES

We can reverse the logic in the previous paragraph to compute square

roots. We define the square root of a complex number √z to be a complex
number w satisfying w2 = z. In this case, we write w √
= z. If w is a square
root, so is −w, so there √
are two square roots ±w = ± z.
The formula for w = z is
√ r+x yi
z = x + yi =⇒ z=√ +√ . (A.4.7)
2r + 2x 2r + 2x
p
Here r = x2 + y 2 and this formula is valid as long as z is not a negative
√ zero.√When z is a negative number or zero, z = −x with x ≥ 0,
number or
we have z = i x. We conclude every complex number has square roots.
When z is on the unit circle, r = 1, so the formula reduces to
√ 1+x yi
z=√ +√ .
2 + 2x 2 + 2x

We will need the roots of unity in §3.2. This generalizes square roots, cube
roots, etc.
A complex number ω is a root of unity if ω d = 1 for some power d. If d is
the power, we say ω is a d-th root of unity.
For example, the square roots of unity are ±1, since (±1)2 = 1. Here we
have
1 = cos 0 + i sin 0, −1 = cos π + i sin π.
The fourth roots of unity are ±1 and ±i, since (±1)4 = 1 and (±i)4 = 1.
Here we have
1 = cos 0 + i sin 0,
i = cos(π/2) + i sin(π/2),
−1 = cos π + i sin π,
−i = cos(3π/2) + i sin(3π/2).

In general, the roots of unity are denoted by powers of ω, so the square

roots of unity are 1 and ω = −1, and the fourth roots of unity are 1, ω = i,
ω 2 = −1, ω 3 = −i.
Let ω = cos θ + i sin θ. Since 1 = cos(2π) + i sin(2π) and ω k = cos(kθ) +
i sin(kθ), a d-th root of unity ω satisfies

ω = cos(2π/d) + i sin(2π/d). (A.4.8)

If ω d = 1, then
d k
ωk = ωd = 1k = 1.
A.4. COMPLEX NUMBERS 495

With ω given by (A.4.8), this implies

1, ω, ω 2 , . . . , ω d−1

are the d-th roots of unity.

If we set √
1 3
ω =− +i = cos(2π/3) + i sin(2π/3),
2 2
then a calculation shows
√
2 1 3
1, ω, ω =− −i
2 2
are the cube roots of unity,

13 = 1, ω 3 = 1, (ω 2 )3 = 1.

ω
ω

ω 1 1 ω2 1

ω2
ω3

ω2 = 1 ω3 = 1 ω4 = 1

Fig. A.7 The second, third, and fourth roots of unity

Similarly, the fifth roots of unity are 1, ω, ω 2 , ω 3 , ω 4 , where

√ s√
1 5 5 5
ω=− + +i + = cos(2π/5) + i sin(2π/5).
4 4 8 8

Summarizing,

Roots of Unity

Let d ≥ 1 and let

ω = cos(2π/d) + i sin(2π/d),

Then the d-th roots of unity are

496 CHAPTER A. APPENDICES

1, ω, ω 2 , . . . , ω d−1 .

The roots satisfy

ω k = cos(2πk/d) + i sin(2πk/d), k = 0, 1, 2, . . . , d − 1.

Here even though ω depends on the degree d, in the notation, we do not

indicate the dependence of ω on d.
Since ω d = 1, one has, from Figures A.7 and A.8,

ω k + ω −k = ω k + ω d−k = 2 cos(2πk/d), k = 0, 1, 2, . . . , d − 1. (A.4.9)

This we need in §3.2.

ω ω ω4 ω3
ω2 ω5
ω2
ω2 ω6
ω
ω7
1 ω3 1 1
ω8
ω 14
ω3 ω9
ω 13
ω4 ω4 ω5 ω 10 12
ω 11 ω

ω5 = 1 ω6 = 1 ω 15 = 1

Fig. A.8 The fifth, sixth, and fifteenth roots of unity

A polynomial is an expression of the form

p(z) = z d + c1 z d−1 + c2 z d−2 + · · · + cd .

For example, p(z) = z 3 − 5z + 2 or p(z) = z 2 − 2z + 2. Here z is the variable,

and the constants c1 , c2 , . . . , cd are the coefficients.
A root of a polynomial p(z) is a complex number a satisfying p(a) = 0.
For example, the roots of z 2 − 2z + 2 are 1 ± i, and the roots of z 5 − 1 are
the fifth roots of unity 1, ω, ω 2 , ω 3 , ω 4 . In general, the roots of z d − 1 are
the d-th roots of unity 1, ω, ω 2 , . . . , ω d−1 .
The fundamental theorem of algebra states that every polynomial has as
many roots as its degree: If the degree of p(z) is d, there are d (not necessarily
distinct) roots a1 , a2 , . . . , ad of p(z), and p(z) may be factored into a product
A.4. COMPLEX NUMBERS 497

d
Y
p(z) = (z − ak ) = (z − a1 )(z − a2 ) . . . (z − ad ). (A.4.10)
k=1

Conversely, given a1 , a2 , . . . , ad , (A.4.10) display a polynomial with these

as roots.
In particular, when p(z) = z d − 1, we have
d−1
zd − 1 Y
= (z − ω k ). (A.4.11)
z−1
k=1

Here is sympy code for the roots of unity.

from sympy import solve, symbols, init_printing

init_printing()

z = symbols('z')

d = 5
solve(z**d - 1)

In numpy, the roots of p(z) = az 2 + bz + c are returned by

from numpy import roots

roots([a,b,c])

Since the cube roots of unity are the roots of p(z) = z 3 − 1, the code

from numpy import roots

roots([1,0,0,-1])

returns the cube roots

array([-0.5+0.8660254j, -0.5-0.8660254j, 1. +0.j ])

498 CHAPTER A. APPENDICES

Exercises

Exercise A.4.1 Let P = (1, 2) and Q = (3, 4) and R = (5, 6). Calculate P Q,
P/Q, P R, P/R, QR, Q/R.

Exercise A.4.2 Let a = 1 + 2i and b = 3 + 4i and c = 5 + 6i. Calculate ab,

a/b, ac, a/c, bc, b/c.

Exercise A.4.3 We say z ′ is the reciprocal of z if zz ′ = 1. Show the reciprocal

of z = x + yi is
x − yi
z′ = 2 .
x + y2
√ √
Exercise A.4.4 Show z given by (A.4.7) satisfies ( z)2 = z.

Exercise A.4.5 Check (A.4.5) is correct.

Exercise A.4.6 Let a, b, c be complex numbers, with a ̸= 0. Show the roots

of p(z) = az 2 + bz + c are given by the Babylonian quadratic formula
√
−b ± b2 − 4ac
z= .
2a
Exercise A.4.7 Let 1, ω, . . . , ω d−1 be the d-th roots of unity. Using the
code below, compute the product

(1 − ω)(1 − ω 2 )(1 − ω 3 ) . . . (1 − ω d−1 ).

What is the answer? Try different degrees d.

from sympy import prod, solve, symbols, simplify

z = symbols('z')
roots = solve(z**d - 1)

prod([ 1-a if a != 1 else 1 for a in roots ]).simplify()

The answer can be derived algebraically by using (A.4.11).

Exercise A.4.8 Given three distinct (non-equal) numbers a, b, c, there is a

quadratic p(z) = r+sz+tz 2 satisfying p(a) = 0, p(b) = 0, and p(c) = 1. (With
d = 2 and roots a and b, divide (A.4.10) by a constant to make p(c) = 1.)

Exercise A.4.9 Given three distinct (non-equal) numbers a, b, c, and three

numbers α, β, γ, there is a quadratic p(z) = r + sz + tz 2 satisfying p(a) = α,
p(b) = β, and p(c) = γ. Repeat the previous exercise three times, once for a,
b, c, then take a linear combination of the results.
A.5. INTEGRATION 499

A.5 Integration

This section is a review of integration, using the fundamental theorem of

calculus and Python.
Let y = f (x) be a function, and let its graph be as in Figure A.9. The
integral
Z b
I= f (x) dx (A.5.1)
a
is the area under the graph between the vertical lines at a and b.
To repeat, the integral is a number, the area of a specific region under the
graph y = f (x). In Figure A.9, the integral (A.5.1) is the sum of three areas:
red, green, blue.

f (x)

0 a x x + dx b

Fig. A.9 Areas under the graph.

We use Figure A.9 to derive the

Fundamental Theorem of Calculus (FTC)

If F ′ (x) = f (x), then

Z b
f (x) dx = F (b) − F (a). (A.5.2)
a

To derive this, let A(x) denote the area under the graph between the y-
axis and the vertical line at x. Then A(x) is the sum of the gray area and
the red area, A(a) is the gray area, and A(b) is the sum of four areas: gray,
red, green, and blue. It follows the integral (A.5.1) equals A(b) − A(a).
Since A(x + dx) is the sum of three areas, gray, red, green, it follows
A(x + dx) − A(x) is the green area. But the green area is approximately a
500 CHAPTER A. APPENDICES

rectangle of width dx and height f (x). Hence the green area is approximately
f (x) × dx, or
A(x + dx) − A(x) ≈ f (x) dx.
As a consequence of this analysis,

A(x + dx) − A(x)

≈ f (x).
dx
The smaller dx is, the closer the green area is to a rectangle. Taking the limit
dx → 0, the green rectangle becomes infinitely thin, and we obtain

A(x + dx) − A(x)

A′ (x) = lim = f (x).
dx→0 dx
Now let F (x) be any function satisfying F ′ (x) = f (x). Then A(x) and
F (x) have the same derivative, so A(x)−F (x) has derivative zero. By (4.1.2),
A(x) − F (x) is a constant C, or A(x) = F (x) + C. This implies
Z b
f (x) dx = A(b) − A(a) = (F (b) + C) − (F (a) + C) = F (b) − F (a).
a

This completes the derivation of the fundamental theorem of calculus.

Often one writes F (x)|ba for F (b) − F (a). The FTC then reads
Z b x=b
f (x) dx = F (x) .
a x=a

When F ′ (x) = f (x), F (x) is called an anti-derivative or indefinite integral

of f (x). This should not be confused with the integral (A.5.1), which is a
number, an area.
Since the total area between a and R b may be sliced into many thin green
rectangles, interpreting the symbol as “sum” explains the notation (A.5.1).

Important consequences of the FTC are integral additivity,

Z b Z b Z b
(f (x) + g(x)) dx = f (x) dx + g(x) dx,
a a a

and integral scaling,

Z b Z b
cf (x) dx = c f (x) dx.
a a
A.5. INTEGRATION 501

For example, if f (x) = xd , then, by (4.1.4), F (x) = xd+1 /(d + 1) satisfies

′
F (x) = f (x), so, by the FTC,
b
bd+1 ad+1
Z
xn dx = F (b) − F (a) = − .
a d+1 d+1

When d = 2, a = −1, b = 1, this is 2/3, which is the area under the parabola
in Figure A.10.
When a = 0, b = 1, Z 1
1
td dt = . (A.5.3)
0 d + 1
When F (x) can’t be found, we can’t use the FTC. Instead we use Python
to evaluate the integral (A.5.1) as follows.

from scipy.integrate import quad

d = 2

def f(x): return x**d

a,b = -1, 1

# integral of f(x) over the interval [a,b]

I = quad(f,a,b)

This not only returns the computed integral I but also an estimate of the
error between the computed integral and the theoretical value,
(0.6666666666666666, 7.401486830834376e-15).
quad refers to quadrature, which is another term for integration.

Fig. A.10 Area under the parabola.

502 CHAPTER A. APPENDICES

Another example is the area under one hump of the sine curve in Figure
A.11, Z π
sin x dx = − cos π − (− cos 0) = −(−1) + 1 = 2.
0

Here f (x) = sin x, F (x) = − cos x, F ′ (x) = f (x). The Python code quad
returns (2.0, 2.220446049250313e-14).
It is important to realize the integral (A.5.1) is the signed area under the
graph: Portions of areas that are below the x-axis are counted negatively. For
example,
Z 2π
sin x dx = − cos(2π) − (− cos 0) = −1 + 1 = 0.
0

Explicitly,
Z 2π Z π Z 2π
sin x dx = sin x dx + sin x dx = 2 − 2 = 0,
0 0 π

so the areas under the first two humps in Figure A.11 cancel.

Fig. A.11 The graph and area under sin x.

Here is code for Figures A.10, A.11, A.12.

from numpy import *

from matplotlib.pyplot import *
from scipy.integrate import quad

def plot_and_integrate(f,a,b,pi_ticks=False):
# initialize figure
ax = axes()
A.5. INTEGRATION 503

ax.grid(True)
# draw x-axis and y-axis
ax.axhline(0, color='black', lw=1)
ax.axvline(0, color='black', lw=1)
# set x-axis ticks as multiples of pi/2
if pi_ticks: set_pi_ticks(a,b)
x = linspace(a,b,100)
plot(x,f(x))
positive = f(x)>=0
negative = f(x)<0
ax.fill_between(x,f(x), 0, color='green', where=positive, alpha=.5)
ax.fill_between(x,f(x), 0, color='red', where=negative, alpha=.5)
I = quad(f,a,b,limit=1000)[0]
title("integral equals " + str(I),fontsize = 10)
show()

def f(x): return sin(x)/x

a, b = 0, 3*pi

plot_and_integrate(f,a,b,pi_ticks=True)

Above, the Python function set_pi_ticks(a,b) sets the x-axis tick mark
labels at the multiples of π/2 The code for set_pi_ticks is in §4.1.

Fig. A.12 Integral of sin x/x.

The exercises are meant to be done using the code in this section. For the
infinite limits below, use numpy.inf.
504 CHAPTER A. APPENDICES

Exercises

Exercise A.5.1 Plot and integrate f (x) = x2 + A sin(5x) over the interval
[−10, 10], for amplitudes A = 0, 1, 2, 4, 15. Note the integral doesn’t depend
on A. Why?
Exercise A.5.2 Plot and integrate (Figure A.12)
Z 3π
sin x
dx.
0 x

Exercise A.5.3 Plot and integrate f (x) = exp(−x) over [a, b] with a = 0,
b = 1, 10, 100, 1000, 10000.
√
Exercise A.5.4 Plot and integrate f (x) = 1 − x2 over [−1, 1].
√
Exercise A.5.5 Plot and integrate f (x) = 1/ 1 − x2 over [−1, 1].
Exercise A.5.6 Plot and integrate f (x) = (− log x)n over [0, 1] for n =
2, 3, 4. What is the answer for general n?
Exercise A.5.7 With k = 7, n = 10, plot and integrate using Python
Z 1
xk (1 − x)n−k dx.
0

From (5.2.10), what is the exact integral?

Exercise A.5.8 Plot and integrate f (x) = sin(nx)/x over [0, π] with n =
1, 2, 3, 4, . . . . What’s the limit of the integral as n → ∞?
Exercise A.5.9 Use numpy.inf to compute

2 ∞ sin x
Z
dx.
π 0 x

Exercise A.5.10 Use numpy.inf to plot the normal pdf and compute its
integral Z ∞
1 2
√ e−x /2 dx.
2π −∞
Exercise A.5.11 Let σ(x) = 1/(1+e−x ). Plot and integrate f (x) = σ(x)(1−
σ(x)) over [−10, 10]. What is the answer for (−∞, ∞)?
Exercise A.5.12 Let Pn (x) be the Legendre polynomial of degree n (§4.1).
Use num_legendre (§4.1) to compute the integral
Z 1
Pn (x)2 dx
−1

for n = 1, 2, 3, 4. What is the integral for general n? Hint – take the reciprocal
of the answer.
A.6. ASYMPTOTICS AND CONVERGENCE 505

A.6 Asymptotics and Convergence

Let a1 , a2 , . . . be a sequence of scalars. What does it mean to say an is

asymptotically zero? To make this notion precise, we introduce some termi-
nology.
We say an is bounded if all terms lie in some bounded interval. If b > 0,
we say an is bounded by b if |an | ≤ b. For example an = sin(n) is bounded by
b = 1. The constant b is a bound.
We say an is eventually bounded by a positive constant b, if, after ignoring
finitely many terms, the remaining terms are bounded by b. For example,
an = 1/n is eventually bounded by b = .01, since, after ignoring the first
ninety-nine terms, the sequence is bounded by b. The sequence b1 = 1, b2 = 1,
b3 = 1, . . . , is eventually bounded by 1, but not eventually bounded by 0.5.
Please note: If an is bounded by 5, then an is eventually bounded by 5.
On the other hand, if an is eventually bounded by 5, then an is bounded, but
not necessarily by 5, since “eventually” means we are ignoring finitely many
terms.
Typically we use the greek letter epsilon ϵ to denote small positive num-
bers.

Asymptotic Vanishing

If for any positive constant ϵ, no matter how small, an is eventually

bounded by ϵ, we say an is asymptotically zero or asymptotically van-
ishing, and we write an ≈ 0.

For example, bn = 1, 1, 1, . . . is not asymptotically zero, since bn is not

eventually bounded by ϵ = 0.5.
On the other hand, an = 1/n is asymptotically zero: To bound the se-
quence by ϵ = 1/10, we ignore the first nine terms. To bound the sequence by
ϵ = 1/100, we ignore the first ninety-nine terms. To bound the sequence by
ϵ = 1/10000, we ignore the first 9999 terms. Notice the smaller the desired
bound, the more terms need to be ignored.
Immediate consequences of the asymptotic vanishing definition are the
following properties (we skip the proofs).

1. |an | ≤ en and en ≈ 0 imply an ≈ 0.

2. an ≈ 0 and bn ≈ 0 implies an + bn ≈ 0,
3. an ≈ 0 and bn eventually bounded implies an bn ≈ 0.
These are intuitively clear. As a special case, for any constant c,

an ≈ 0 =⇒ can ≈ 0.
506 CHAPTER A. APPENDICES

A sequence an is asymptotically positive if, apart from finitely many terms,

an is positive. More generally, an is asymptotically nonzero if, apart from
finitely many terms, an is not zero.
We say an is asymptotically one, if the difference an − 1 is asymptotically
zero. In this case, we write
an ≈ 1.
As a consequence of the above properties, we show

Convergence of Reciprocals

If an ≈ 1, then 1/an ≈ 1.

To derive this, assume an ≈ 1. Then an − 1 ≈ 0, so an − 1 is eventu-

ally bounded by any positive constant. In particular, an − 1 is eventually
bounded by ϵ = 1/2, which means an is eventually in the interval (1/2, 3/2),
so eventually an ≥ 1/2, or 1/an is eventually bounded. By property 3,
1 1
−1= (1 − an ) ≈ 0,
an an
yielding the result.
The exercises exhibit many other consequences of the above properties.
The point of the exercises is that they depend only on these properties, or
their consequences; do not use any other properties you may have learned
elsewhere.
Let b1 , b2 , . . . be another sequence. We say an is asymptotically equal to
bn , and we write
an ≈ bn , (A.6.1)
if bn is asymptotically nonzero, and the ratio an /bn is asymptotically one, or
an /bn ≈ 1. As part of this, we are assuming bn is asymptotically nonzero, to
ensure we aren’t dividing by zero.
From the definition, it is not clear that an ≈ bn is equivalent to bn ≈ an .
Nevertheless, this is correct (Exercise A.6.2). Summarizing,

Asymptotic Equality
an an
a n ≈ bn ⇐⇒ ≈1 ⇐⇒ − 1 ≈ 0.
bn bn

For example, let an = n, bn = n + 1, cn = n2 . When n = 1000,

an
an = 1000, bn = 1001, = .9990000,
bn
so here we do have an ≈ bn for n large. On the hand, here cn is a million,
and an is a thousand, so we don’t have an ≈ cn .
A.6. ASYMPTOTICS AND CONVERGENCE 507

This is exactly what is meant in (A.1.6). While both sides in (A.1.6) in-
crease without bound, their ratio is close to one, for large n.
In general, an ≈ bn is not the same as an − bn ≈ 0: ratios and differences
behave differently. For example, based on (A.1.6), the following code

from numpy import *

def factorial(n):
if n == 1: return 1
else: return n * factorial(n-1)

def stirling(n): return sqrt(2pin) * (n/e)**n

a = factorial(100)
b = stirling(100)

a/b, a-b

returns
(1.000833677872004, 7.773919124995513 × 10154 ).
The first entry is close to one, but the second entry is far from zero.
If, however, bn ≈ b for some nonzero constant b, then (Exercise A.6.7)
ratios and differences are the same,

an ≈ bn ⇐⇒ an − bn ≈ 0. (A.6.2)

In particular, for a ̸= 0, an ≈ a is the same as an − a ≈ 0.

When we have an − a ≈ 0, we say a is the limit of an , and we write

a = lim an . (A.6.3)
n→∞

As we saw above, limits and asymptotic equality are the same, as long as the
limit is not zero. When a is the limit of an , we also say an converges to a, or
an approaches a and we write an → a.
With this notation, asymptotic vanishing is an → 0, asymptotically one is
an → 1, and asymptotic equality is an /bn → 1.
Limits can be taken for sequences of points in Rd as well. Let an be a
sequence of points in Rd . We say an converges to a if an · v converges to a · v
for every vector v. Here we also write an → a and we write (A.6.3).

In Chapter 6, ≈ is used for random variables. We say random variables

Xn are asymptotically equal to random variables Yn , and we write Xn ≈ Yn ,
508 CHAPTER A. APPENDICES

if their corresponding probabilities are asymptotically equal,

P rob(a < Xn < b) ≈ P rob(a < Yn < b),

for any interval (a, b).

In particular, when Y does not depend on n, the asymptotic equality
Xn ≈ Y is short for

P rob(a < Xn < b) ≈ P rob(a < Y < b). (A.6.4)

When Y is normal or standard normal or chi-squared, we also write Xn ≈

normal or Xn ≈ N (0, 1) or Xn ≈ χ2d .
Since probabilities are positive, here both interpretations in (A.6.2) hold,
hence we also have

P rob(a < Xn < b) → P rob(a < Y < b).

Also, Xn ≈ Y in the sense of (A.6.4) is equivalent to approximations of

the means
E(f (Xn )) → E(f (Y )),
and equivalent to approximations of the moment-generating functions

MXn (t) → MY (t).

Exercises

Exercise A.6.1 Let k be fixed. Show

n
2−n ≈ 0, n → ∞.
k
n

(By (A.2.9), if k < n, 2n is larger than k+1 .)

Exercise A.6.2 If an ≈ bn , then bn ≈ an .

Exercise A.6.3 If an ≈ 1 and bn ≈ 1, then an bn ≈ 1.

Exercise A.6.4 If an ≈ bn and bn ≈ cn , then an ≈ cn .

Exercise A.6.5 If an ≈ a′n and bn ≈ b′n , then an bn ≈ a′n b′n .

Exercise A.6.6 Let a ̸= 0. If an ≈ a, then an − a ≈ 0, and conversely.

Exercise A.6.7 If bn ≈ b and b ̸= 0, then (A.6.2) holds.

Exercise A.6.8 If an − bn ≈ 0 and bn → b, then an → b.

A.7. EXISTENCE OF MINIMIZERS 509

Exercise A.6.9 Let µ be a constant and let X̄1 , X̄2 , . . . be a sequence of

random variables. For example, in the LLN, X̄n is the sample mean. Show
X̄n ≈ µ is equivalent to

P rob(a < X̄n < b) ≈ 0,

for any interval (a, b) not containing µ.

Exercise A.6.10 If an → a and bn → b, then

an + bn → a + b, an bn → ab.

Exercise A.6.11 If an ≤ bn ≤ cn and an → L and cn → L, then bn → L

(squeezing lemma).

A.7 Existence of Minimizers

Several times in the text, we deal with minimizing functions, most notably for
the pseudo-inverse of a matrix (§2.3), for proper continuous functions (§4.3),
and for gradient descent (§7.3).
Previously, the technical foundations underlying the existence of minimiz-
ers were ignored. In this section, we review the foundational material sup-
porting the existence of minimizers.
For example, since y = ex is an increasing function, the minimum

min ex = min{ex | 0 ≤ x ≤ 1}
0≤x≤1

is y ∗ = e0 = 1, and the minimizer, the location at which the minimum occurs,

is x∗ = 0. Here we have one minimizer.
For the function y = x4 −2x2 in Figure 4.5, the minimum over −2 ≤ x ≤ 2
is y ∗ = −1, which occurs at the minimizers x∗ = ±1. Here we have two
minimizers.
On the other hand, there is no minimizer for y = ex on the entire real line
−∞ < x < ∞, because as x approaches −∞, ex approaches zero, but never
reaches it. Our goal in this section is to establish conditions which guarantee
the existence of minimizers.

A sequence xn is increasing if x1 ≤ x2 ≤ x3 ≤ . . . . A sequence xn is

decreasing if x1 ≥ x2 ≥ x3 ≥ . . . . If xn is increasing, then −xn is decreasing,
and vice-versa.
In §A.6, we had bounded sequences and limits. A foundational axiom for
real numbers, the completeness property, is the following.
510 CHAPTER A. APPENDICES

Completeness Property

Let xn be a bounded increasing sequence. Then xn has a limit

lim xn .
n→∞

By multiplying a sequence by a minus, we also see every bounded decreas-

ing sequence has a limit. In general, a bounded sequence need not converge.
However below we see it subconverges.

Let x1 , x2 , . . . be a sequence. A subsequence is a selection of terms

xn1 , xn2 , xn3 , . . . , n1 < n2 < n3 < . . . .

Here it is important that the indices n1 < n2 < n3 < . . . be strictly increas-
ing.
If a sequence x1 , x2 , . . . has a subsequence x′1 , x′2 , . . . converging to x∗ ,
then we say the sequence x1 , x2 , . . . subconverges to x∗ . For example, the
sequence 1, −1, 1, −1, 1, −1, . . . subconverges to 1 and also subconverges
to −1, as can be seen by considering the odd-indexed terms and the even-
indexed terms separately.

Bounded Sequences Must Subconverge

Let x1 , x2 , . . . be a bounded sequence of vectors. Then there is a

subsequence x′1 , x′2 , . . . converging to some x∗ .

To see this, assume first x1 , x2 , . . . are scalars, and let x1 , x2 , . . . be a

bounded sequence of numbers, say a ≤ xn ≤ b for n ≥ 1. Bisect the interval
I0 = [a, b] into two equal subintervals. Then at least one of the subintervals,
call it I1 , has infinitely many terms of the sequence. Select x′1 in I1 and let
x∗1 be the left endpoint of I1 .
Now bisect I1 into two equal subintervals. Then at least one of the subin-
tervals, call it I2 , has infinitely many terms of the sequence. Select x′2 in I2
and let x∗2 be the left endpoint of I2 . Continuing in this manner, we obtain a
subsubsequence x′1 , x′2 , . . . with x′n in In , and a sequence x∗1 , x∗2 , . . . .
Since the intervals are nested

I0 ⊃ I1 ⊃ I2 ⊃ . . . ,
A.7. EXISTENCE OF MINIMIZERS 511

the sequence x∗1 , x∗2 , . . . is increasing. By the completeness property,

x∗ = lim x∗n
n→∞

exists. By definition of limit, this says en = x∗n − x∗ ≈ 0.

Since the length of In equals (b − a)/2n , and 2−n → 0,

0 ≤ x′n − x∗n ≤ (b − a)2−n ,

hence x′n − x∗n ≈ 0. By Exercise A.6.8, we conclude x′n → x∗ .

Now let x1 , x2 , . . . be a sequence of vectors in Rd , and let v be a vector;
then x1 · v, x2 · v, . . . are scalars, so, from the previous paragraph, there is a
subsequence x′n · v (depending on v) converging to some x∗v .
Let e1 , e2 , . . . , ed be the standard basis in Rd . By choosing v = e1 , there
is a subsequence x′1 , x′2 , . . . such that the first features of x′n converge. By
choosing v = e2 , and focusing on the subsequence x′1 , x′2 , . . . , there is a
sub-subsequence x′′1 , x′′2 , . . . such that the first and second features of x′′n
converge. Continuing in this manner, we obtain a subsequence x∗1 , x∗2 , . . .
such that the k-th feature of the subsequence converges to the k-th feature
of a single x∗ , for every 1 ≤ k ≤ d. From this, it follows that x∗n converges to
x∗ .

Let S be a set of vectors and let y = f (x) be a scalar-valued function

bounded below on S, f (x) ≥ b for some number b, for all x in S. Then b is a
lower bound for f (x) over S.
A minimizer is a vector x∗ satisfying

f (x∗ ) ≤ f (x), for every x in S.

As we saw above, a minimizer may or may not exist, and, when the minimizer
does exist, there may be several minimizers.
A function y = f (x) is continuous if f (xn ) approaches f (x∗ ) whenever xn
approaches x∗ ,

xn → x∗ =⇒ f (xn ) → f (x∗ ),

for every x∗ and every xn → x∗ . Here an → a means an − a ≈ 0, see §A.6.

Now we can establish

Existence of Minimizers

If f (x) is continuous on Rd and S is a bounded set in Rd , then there

is a minimizer x∗ ,
512 CHAPTER A. APPENDICES

f (x∗ ) = min f (x). (A.7.1)

x in S

In general, the minimizer x∗ may lie outside the set S. To guarantee x∗

belongs to S, typically one assumes an additional requirement, the closedness
of S. In our applications of this result, we seek a minimizer somewhere in Rd ,
so this point is of no concern.
To establish the result, let m1 be a lower bound for f (x) over S, and let
x1 be any point in S. Then f (x1 ) ≥ m1 . Let

f (x1 ) + m1
c=
2
be the midpoint between m1 and f (x1 ).
There are two possibilities. Either c is a lower bound or not. In the first
case, define m2 = c and x2 = x1 . In the second case, there is a point x2 in
S satisfying f (x2 ) < c, and we define m2 = m1 . As a consequence, in either
case, we have f (x2 ) ≥ m2 , m1 ≤ m2 , and
1
f (x2 ) − m2 ≤ (f (x1 ) − m1 ).
2
Let
f (x2 ) + m2
c=
2
be the midpoint between m2 and f (x2 ).
There are two possibilities. Either c is a lower bound or not. In the first
case, define m3 = c and x3 = x2 . In the second case, there is a point x3 in
S satisfying f (x3 ) < c, and we define m3 = m2 . As a consequence, in either
case, we have f (x3 ) ≥ m3 , m2 ≤ m3 , and
1
f (x3 ) − m3 ≤ (f (x1 ) − m1 ).
22
Continuing in this manner, we have a sequence x1 , x2 , . . . in S, and an
increasing sequence m1 ≤ m2 ≤ . . . of lower bounds, with
2
f (xn ) − mn ≤ (f (x1 ) − m1 ).
2n
Since S is bounded, xn subconverges to some x∗ . Since f (x) is continuous,
f (xn ) subconverges to f (x∗ ). Since f (xn ) ≈ mn and mn is a lower bound for
all n, f (x∗ ) is a lower bound, hence x∗ is a minimizer.
A.8. SQL 513

A.8 SQL

Recall matrices (§2.1), datasets, CSV files, spreadsheets, arrays, dataframes

are basically the same objects.
Databases are collections of tables, where a table is another object similar
to the above. Hence

matrix = dataset = CSV f ile = spreadsheet = table = array = dataf rame

(A.8.1)
One difference is that each entry in a table may be a string, or code, or an
image, not just a number. Nevertheless, every table has rows and columns;
rows are usually called records, and columns are columns.
A database is a collection of several tables that may or may not be linked
by columns with common data. Software that serves databases is a database
server. Often the computer running this software is also called a database
server, or a server for short. Databases created by a database server (software)
are stored as files on the database server.
There are many varieties of database server software. We use MariaDB, a
widely-used open-source database server. By using open-source software, one
is assured to be using the “purest” form of the software, in the sense that
proprietary extensions are avoided, and the software is compatible with the
widest range of commercial variations.
Because database tables can contain millions of records, it is best to ac-
cess a database server programmatically, using an application programming
interface, rather than a graphical user interface. The basic API for inter-
acting with database servers is SQL (structured query language). SQL is a
programming language for creating and modifying databases.
Any application on your laptop that is used to access a database is called an
SQL client. The database server being accessed may be local, running on the
same computer you are logged into, or remote, running on another computer
on the internet. In our examples, the code assumes a local or remote database
server is being accessed.
Because SQL commands are case-insensitive, by default we write them
in lowercase. Depending on the SQL client, commands may terminate with
semicolons or not. As mentioned above, data may be numbers or strings.
The basic SQL commands are

select from
limit
select distinct
where/where not <column>
where <column> = <data> and/or <column> = <data>
order by <column1>,<column2>
514 CHAPTER A. APPENDICES

insert into table (<column1>,<column2>,...) \

values (<data1>, <data2>, ...)
is null
update <table> set <column> = <data> where ...
like <regex> (%, _, [abc], [a-f], [!abc])
delete from <table> where ...
select min(<column>) from <table> (also max, count, avg)
where <column> in/not in (<data array>)
between/not between <data1> and <data2>
as
join (left, right, inner, full)
create database <database>
drop database <database>
create table <table>
truncate <table>
alter table <table> add <column> <datatype>
alter table <table> drop column <column>
insert into <table> select

All the objects in (A.8.1) are also equivalent to a Python list-of-dicts. In

this section we explain how to convert between the objects

list-of-dicts ⇐⇒ JSON string ⇐⇒ dataframe ⇐⇒ CSV file ⇐⇒ SQL table

(A.8.2)
For all conversions, we use pandas. We begin describing a Python list-of-dicts,
because this does not require any additional Python packages.
A Python dictionary or dict is a Python object of the form (prices are in
cents)

item1 = {"dish": "Hummus", "price": 800, "quantity": 5}

This is an unordered listing of key-value pairs. Here the keys are the strings
dish, price, and quantity. Keys need not be strings; they may be integers or
any unmutable Python objects. Since a Python list is mutable, a key cannot
be a list. Values may be any Python objects, so a value may be a list. In
a dict, values are accessed through their keys. For example, item1["dish"]
returns 'Hummus'.
A list-of-dicts is simply a Python list whose elements are Python dicts, for
example,

item2 = {"dish": "Avocado", "price": 900, "quantity": 2}

L = [item1,item2]
A.8. SQL 515

Here L is a list and

len(L), L[0]["dish"]

returns

(2,'Hummus')

In other words, L is a list-of-dicts,

L == [{"dish": "Hummus", "price": 800, "quantity": 5}, {"dish": ...

,→ }]

returns True.
A list-of-dicts L can be converted into a string using the json module, as
follows:

frpm json import *

s = dumps(L)

Now print L and print s. Even though L and s “look” the same, L is a list,
and s is a string. To emphasize this point, note
• len(L) == 2 and len(s) == 99,
• L[0:2] == L and s[0:2] == '[{'
• L[8] returns an error and s[8] == ':'
To convert back the other way, use

from json import *

L1 = loads(s)

Then L == L1 returns True. Strings having this form are called JSON strings,
and are easy to store in a database as VARCHARs (see Figure A.16).
The basic object in the Python package pandas is the dataframe (Figures
A.13, A.14, A.16, A.17). pandas can convert a dataframe df to many, many
other formats

df.to_dict(), df.to_csv(), df.to_excel(), df.to_sql(), df.to_json(),

,→ ...

To convert a list-of-dicts to a dataframe is easy. The code

516 CHAPTER A. APPENDICES

from pandas import *

df = DataFrame(L)
df

returns the dataframe in Figure A.13 (prices are in cents).

Fig. A.13 Dataframe from list-of-dicts.

To go the other way is equally easy. The code

L1 = df.to_dict('records')
L == L1

returns True. Here the option 'records' returns a list-of-dicts; other options
returns a dict-of-dicts or other combinations.
To convert a CSV file into a dataframe, use the code

menu_df = read_csv("menu.csv")
menu_df

This returns Figure A.14 (prices are in cents).

To go the other way, to convert the dataframe df to the CSV file
menu1.csv, use the code

df.to_csv("menu1.csv")
df.to_csv("menu2.csv",index=False)

The option index=False suppresses the index column, so menu2.csv has

two columns, while menu1.csv has three columns. Also useful is the method
.to_excel, which returns an excel file.
Now we explain how to convert between a dataframe and an SQL table.
What we have seen so far uses only pandas. To convert to SQL, we need two
more packages, sqlalchemy and pymysql.
The package sqlalchemy allows us to connect to a database server from
within Python, and the package pymysql is the code necessary to complete
the connection to the MariaDB database server. For example, if we are con-
necting to an Oracle database server, we would use the package cx-Oracle
instead of pymysql.
A.8. SQL 517

Fig. A.14 Menu dataframe and SQL table.

In Python, the standard package installation method is to use pip. To

install sqlalchemy and pymysql, type within jupyter:

pip install sqlalchemy

pip install pymysql

To connect using sqlalchemy, we first collect the connection data into one
URI string,

protocol = "mysql+pymysql://"
credentials = "username:password"
server = "@servername"
port = ":3306"
uri = protocol + credentials + server + port

This string contains your database username, your database password, the
database server name, the server port, and the protocol. If the database is
”rawa”, the URI is

database = "/rawa"
uri = protocol + credentials + server + port + database

Using this uri, the connection is made as follows

518 CHAPTER A. APPENDICES

from sqlalchemy import create_engine

engine = sqlalchemy.create_engine(uri)

(In sqlalchemy, a connection is called an “engine”.) After this, to store the

dataframe df into a table Menu, use the code

df.to_sql('Menu',engine,if_exists='replace')

The if_exists = 'replace' option replaces the table Menu if it existed prior
to this command. Other options are if_exists='fail' and if_exists='append'.
The default is if_exists='fail', so

df.to_sql('Menu',engine)

returns an error if Menu exists.

Fig. A.15 Rawa restaurant.

To read a table into a dataframe, use for example the code

from sqlalchemy import text

query1 = text("select * from rawa.OrdersIn")

A.8. SQL 519

query2 = text("select * from rawa.OrdersIn where items

,→ like '%Hummus%';")
connection = engine.connect()
df1 = read_sql(query1, connection)
df2 = read_sql(query2, connection)

Better Python coding technique is to place read_sql and to_sql in a

with block, as follows

with engine.connect() as connection:

df = pd.read_sql(query, connection)
df.to_sql('Menu',engine)

One benefit of this syntax is the automatic closure of the connection upon
completion. This completes the discussion of how to convert between dataframes
and SQL tables, and completes the discussion of conversions between any of
the objects in (A.8.2).
As an example how all this goes together, here is a task:
Given two CSV files menu.csv and orders.csv downloaded from a restaurant website
(Figure A.15), create three SQL tables Menu, OrdersIn, OrdersOut.

The two CSV files are (click)

orders.csv and menu.csv.
The three SQL table columns are as follows (price, tip, tax, subtotal, total
are in cents)

/* Menu */
dish varchar
price integer

/* ordersin */
orderid integer
created datetime
customerid integer
items json

/* ordersout */
orderid integer
subtotal integer
tip integer
tax integer
total integer
520 CHAPTER A. APPENDICES

Fig. A.16 OrdersIn dataframe and SQL table.

To achieve this task, we download the CSV files menu.csv and orders.csv,
then we carry out these steps. (price and tip in menu.csv and orders.csv
are in cents so they are INTs.)
1. Read the CSV files into dataframes menu_df and orders_df.
2. Convert the dataframes into list-of-dicts menu and orders.
3. Create a list-of-dicts OrdersIn with keys orderId, created, customerId
whose values are obtained from list-of-dicts orders.
4. Create a list-of-dicts OrdersOut with keys orderId, tip whose values are
obtained from list-of-dicts orders (tips are in cents so they are INTs).
5. Add a key items to OrdersIn whose values are JSON strings specifying
the items ordered in orders, using the prices in menu (these are in cents so
they are INTs). The JSON string is of a list-of-dicts in the form discussed
above L = [item1, item2] (see row 0 in Figure A.16).
Do this by looping over each order in the list-of-dicts orders, then loop-
ing over each item in the list-of-dicts menu, and extracting the quantity
ordered of the item item in the order order.
6. Add a key subtotal to OrdersOut whose values (in cents) are computed
from the above values.
Add a key tax to OrdersOut whose values (in cents) are computed using
the Connecticut tax rate 7.35%. Tax is applied to the sum of subtotal
and tip.
Add a key total to OrdersOut whose values (in cents) are computed
from the above values (subtotal, tax, tip).
7. Convert the list-of-dicts OrdersIn, OrdersOut to dataframes OrdersIn_df,
OrdersOut_df.
A.8. SQL 521

8. Upload menu_df, OrdersIn_df, OrdersOut_df to tables Menu, OrdersIn,

OrdersOut.
The resulting dataframes ordersin_df and ordersout_df, and SQL ta-
bles OrdersIn and OrdersOut, are in Figures A.16 and A.17.

Fig. A.17 OrdersOut dataframe and SQL table.

Complete Code for the Task

# step 1
from pandas import *

protocol = "https://"
server = "omar-hijab.org"
path = "/teaching/csv_files/restaurant/"
url = protocol + server + path

menu_df = read_csv(url + "menu.csv")

orders_df = read_csv(url + "orders.csv")

# step 2
menu = menu_df.to_dict('records')
522 CHAPTER A. APPENDICES

orders = orders_df.to_dict('records')

# step 3
OrdersIn = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["created"] = r["created"]
d["customerId"] = r["customerId"]
OrdersIn.append(d)

# step 4
OrdersOut = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["tip"] = r["tip"]
OrdersOut.append(d)

# step 5
from json import *

for i,r in enumerate(OrdersIn):

itemsOrdered = h
for item in menu:
dish = item["dish"]
price = item["price"]
if dish in orders[i]:
quantity = orders[i][dish
if quantity > 0:
d = {"dish": dish, "price": price, "quantity":
,→ quantity}
itemsOrdered.append(d)
r["items"] = dumps(itemsOrdered)

# steps 6
for i,r in enumerate(OrdersOut):
items = loads(OrdersIn[i]["items"])
subtotal = sum([ item["price"]*item["quantity"] for item in items
,→ ])
r["subtotal"] = subtotal
tip = OrdersOut[i]["tip"]
tax = int(.0735*(tip + subtotal))
total = subtotal + tip + tax
r["tax"] = tax
r["total"] = total

# step 7
ordersin_df = DataFrame(OrdersIn)
ordersout_df = DataFrame(OrdersOut)

# step 8
A.8. SQL 523

from sqlalchemy import create_engine, text

# connect to the database

protocol = "mysql+pymysql://"
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
uri = protocol + credentials + server + port + database

engine = create_engine(uri)

dtype1 = { "dish":sqlalchemy.String(60), "price":sqlalchemy.Integer }

dtype2 = {
"orderId":sqlalchemy.Integer,
"created":sqlalchemy.String(30),
"customerId":sqlalchemy.Integer,
"items":sqlalchemy.String(1000)
}

dtype3 = {
"orderId":sqlalchemy.Integer,
"tip":sqlalchemy.Integer,
"subtotal":sqlalchemy.Integer,
"tax":sqlalchemy.Integer,
"total":sqlalchemy.Integer
}

with engine.connect() as connection:

menu_df.to_sql('Menu', engine,
if_exists = 'replace', index = False, dtype = dtype1)
ordersin_df.to_sql("OrdersIn", engine,
index = False, if_exists = 'replace', dtype = dtype2)
ordersout_df.to_sql("OrdersOut", engine,
index = False, if_exists = 'replace', dtype = dtype3)

Moral of this section

In this section, all work was done in Python on a laptop, no SQL was used on
the database, other than creating a table or downloading a table. Generally,
this is an effective workflow:
• Use SQL to do big manipulations on the database (joining and filtering).
• Use Python to do detailed computations on your laptop (analysis).
Now we consider the following simple problem. The total number of orders
in 3970. What is the total number of plates? To answer this, we loop through
524 CHAPTER A. APPENDICES

all the orders, summing the number of plates in each order. The answer is
14,949 plates.

from json import *

from pandas import *
from sqlalchemy import create_engine, text

protocol = "mysql+pymysql://"
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
uri = protocol + credentials + server + port + database

engine = sqlalchemy.create_engine(uri)

connection = engine.connect()

query = text("select * from OrdersIn")

df = read_sql(query, connection)

num = 0

for item in df["items"]:

plates = loads(item)
num += sum( [ plate["quantity"] for plate in plates ])

print(num)

A more streamlined approach is to use map. First we define a function

whose input is a JSON string in the format of df["items"], and whose
output is the number of plates.

from json import *

def num_plates(item):
dishes = loads(item)
return sum( [ dish["quantity"] for dish in dishes ])

Then we use map to apply to this function to every element in the series
df["items"], resulting in another series. Then we sum the resulting series.

num = df["items"].map(num_plates).sum()
print(num)

Since the total number of plates is 14,949, and the total number of orders
is 4970, the average number of plates per order is 3.76.
REFERENCES 525

References

[1] J. Akey. Genome 560: Introduction to Statistical Genomics. 2008. url:

https://www.gs.washington.edu/academics/courses/akey/56008
/lecture/lecture1.pdf.
[2] C. M. Bishop. Pattern Recognition and Machine Learning. Information
Science and Statistics. Springer, 2006.
[3] S. Bubeck. Convex Optimization: Algorithms and Complexity. Vol. 8.
Foundations and Trends in Machine Learning. Now Publishers, 2015.
[4] H. Cramér. Mathematical Methods of Statistics. Princeton University
Press, 1946.
[5] J. L. Doob. “Probability and Statistics”. In: Transactions of the Amer-
ican Mathematical Society 36 (1934), pp. 759–775.
[6] Math Stack Exchange. url: https://math.stackexchange.com/que
stions/4195547/derivation-of-stirling-approximation-from-c
lt.
[7] T. S. Ferguson. A Course in Large Sample Theory. Springer, 1996.
[8] R. A. Fisher. “The conditions under which χ2 measures the discrep-
ancy between observation and hypothesis”. In: Journal of the Royal
Statistical Society 87 (1924), pp. 442–450.
[9] Google. Machine Learning. url: https://developers.google.com/m
achine-learning.
[10] R. M. Gray. “Toeplitz and Circulant Matrices: A Review”. In: Foun-
dations and Trends in Communications and Information Theory 2.3
(2006), pp. 155–239. issn: 1567-2190. url: http://dx.doi.org/10.1
561/0100000006.
[11] E. L. Grinberg and O. Hijab. “The fundamental theorem of trigonom-
etry”. Preprint.
[12] T. L. Heath. The Works of Archimedes. Cambridge University Press,
1897.
[13] O. Hijab. “Binary Classifiers and Logistic Regression”. Preprint.
[14] O. Hijab. Introduction to Calculus and Classical Analysis, Fourth Edi-
tion. Springer, 2016.
[15] Y. Bengio I. Goodfellow and A. Courville. Deep Learning. MIT Press,
2016. url: http://www.deeplearningbook.org.
[16] I. Steinwart and A. Christmann. Support Vector Machines. Springer,
2008.
[17] N. Janakiev. Classifying the Iris Data Set with Keras. 2018. url: htt
ps://janakiev.com/blog/keras-iris.
[18] L. Jiang. A Visual Explanation of Gradient Descent Methods. 2020.
url: https://towardsdatascience.com/a-visual-explanation-o
f-gradient-descent-methods-momentum-adagrad-rmsprop-adam-
f898b102325c.
526 REFERENCES

[19] J. W. Longley. “An Appraisal of Least Squares Programs for the Elec-
tronic Computer from the Point of View of the User”. In: Journal of
the American Statistical Association 62.319 (1967), pp. 819–841.
[20] D. G. Luenberger and Y. Ye. Linear and Nonlinear Programming.
Springer, 2008.
[21] A. A. Faisal M. P. Deisenroth and C. S. Ong. Mathematics for Machine
Learning. Cambridge University Press, 2020.
[22] M. Minsky and S. Papert. Perceptrons, An Introduction to Computa-
tional Geometry. MIT Press, 1988.
[23] Y. Nesterov. Lectures on Convex Optimization. Springer, 2018.
[24] K. Pearson. “On the criterion that a given system of deviations from
the probable in the case of a correlated system of variables is such that
it can be reasonably supposed to have arisen from random sampling”.
In: Philosophical Magazine Series 5 50:302 (1900), pp. 157–175.
[25] R. Penrose. “A generalized inverse for matrices”. In: Proceedings of the
Cambridge Philosophical Society 51 (1955), pp. 406–413.
[26] B. T. Polyak. “Some methods of speeding up the convergence of itera-
tion methods”. In: USSR Computational Mathematics and Mathemat-
ical Physics 4(5) (1964), pp. 1–17.
[27] The WeBWorK Project. url: https://openwebwork.org/.
[28] S. Raschka. PCA in three simple steps. 2015. url: https://sebastia
nraschka.com/Articles/2015_pca_in_3_steps.html.
[29] H. Robbins and S. Monro. “A Stochastic Approximation Method”. In:
The Annals of Mathematical Statistics 22.3 (1951), pp. 400–407.
[30] S. M. Ross. Probability and Statistics for Engineers and Scientists, Sixth
Edition. Academic Press, 2021.
[31] M. J. Schervish. Theory of Statistics. Springer, 1995.
[32] G. Strang. Linear Algebra and its Applications. Brooks/Cole, 1988.
[33] Stanford University. CS224N: Natural Language Processing with Deep
Learning. url: https://web.stanford.edu/class/cs224n.
[34] I. Waldspurger. Gradient Descent With Momentum. 2022. url: https
://www.ceremade.dauphine.fr/~waldspurger/tds/22_23_s1/adva
nced_gradient_descent.pdf.
[35] Wikipedia. Logistic Regression. url: https://en.wikipedia.org/wi
ki/Logistic_regression.
[36] Wikipedia. Seven Bridges of Königsberg. url: https://en.wikipedi
a.org/wiki/Seven_Bridges_of_Konigsberg.
[37] S. J. Wright and B. Recht. Optimization for Data Analysis. Cambridge
University Press, 2022.
Python Index

*, 9, 16 def.local, 413
def.matrix_text, 45
all, 199 def.nearest_index, 199
append, 199 def.newton, 420
def.num_legendre, 209
def.angle, 25, 68 def.num_plates, 524
def.assign_clusters, 199 def.outgoing, 250, 410
def.backward_prop, 244, 253, def.pca, 193
415 def.pca_with_svd, 193
def.ball, 55 def.perm_tuples, 469
def.cartesian_product, 351 def.plot_and_integrate, 502
def.chi2_independence, 395 def.plot_cluster, 200
def.comb_tuples, 470 def.plot_descent, 421
def.confidence_interval, 377, def.poly, 450
387 def.project, 117
def.derivative, 252 def.project_to_ortho, 118
def.dimension_staircase, 127 def.pvalue, 341
def.downstream, 415 def.random_batch_mean, 284
def.draw_major_minor_axes, 51 def.random_vector, 199
def.ellipse, 44 def.set_pi_ticks, 221
def.find_first_defect, 125 def.sym_legendre, 208
def.forward_prop, 244, 251, 411 def.tensor, 33
def.gd, 426 def.train_nn, 428
def.goodness_of_fit, 391 def.ttest, 388
def.H, 293 def.type2_error, 383, 389
def.hexcolor, 11 def.uniq, 5
def.incoming, 250, 410 def.update_means, 199
def.is_input, 409 def.update_weights, 427
def.is_output, 409 def.zero_variance, 104
def.J, 412 def.ztest, 381

527
528 PYTHON INDEX

diag, 187 numpy.allclose, 20, 76

dict, 514 numpy.amax, 450
display, 148 numpy.amin, 450
display_image, 194 numpy.arange, 18, 44
numpy.arccos, 25, 68
enumerate, 194 numpy.argsort, 193
numpy.array, 8, 59
import, 9 numpy.ceil, 221
itertools.product, 55 numpy.column_stack, 81
numpy.copy, 428
join, 11
numpy.corrcoef, 48
json.dumps, 515
numpy.cov, 40
json.loads, 515
numpy.cumsum, 191
lambda, 250 numpy.degrees, 25
lamda, 146 numpy.dot, 66
list, 7 numpy.dstack, 351
numpy.exp, 293
map, 221 numpy.floor, 221
matplotlib.patches numpy.inf, 504
Circle, 55 numpy.isclose, 20, 155
Rectangle, 55 numpy.linalg.matrix_rank, 125
matplotlib.pyplot.axes, 55 numpy.linspace, 55
add_patch, 55 numpy.log, 293
axis, 55 numpy.mean, 14
set_axis_off, 55 numpy.meshgrid, 55, 351
matplotlib.pyplot.contour, 44 numpy.outer, 34, 395
matplotlib.pyplot.figure, 194 numpy.pi, 293
matplotlib.pyplot.grid, 7 numpy.random.binomial, 280,
matplotlib.pyplot.hist, 281 291
matplotlib.pyplot.imshow, 8, 9 numpy.random.default_rng, 284
matplotlib.pyplot.legend, 44 numpy.random.default_rng.
matplotlib.pyplot.meshgrid, ,→ shuffle ,
44 284
matplotlib.pyplot.plot, 18, 38 numpy.random.normal, 341
matplotlib.pyplot.scatter, 7, numpy.random.randn, 370
18 numpy.random.random, 38
matplotlib.pyplot.show, 7 numpy.reshape, 190
matplotlib.pyplot.stairs, 127 numpy.roots, 497
matplotlib.pyplot.subplot, numpy.row_stack, 63
194 numpy.set_printoptions, 9
matplotlib.pyplot.text, 45 numpy.shape, 59
matplotlib.pyplot.title, 293 numpy.sqrt, 25
matplotlib.pyplot.xlabel, 450
matplotlib.pyplot.xticks, 221 pandas.DataFrame, 515
matplotlib.pyplot.ylim, 346 pandas.DataFrame.to_csv, 516
PYTHON INDEX 529

pandas.DataFrame.to_dict, 516 .PCA, 194

pandas.DataFrame.to_sql, 518 sklearn.preprocessing
pandas.read_csv, 448, 516 .StandardScaler, 76
pandas.read_sql, 518 sort, 191
sqlalchemy.create_engine, 517
random.choice, 11 sqlalchemy.text, 517
random.random, 15 sympy.*, 66
sympy.diag, 65
scipy.integrate.quad, 501 sympy.diagonalize, 146
scipy.linalg.eig, 142 sympy.diff, 208
scipy.linalg.eigh, 142, 191 sympy.eigenvects, 146
scipy.linalg.inv, 79 sympy.init_printing, 146
scipy.linalg.norm, 22, 199 sympy.lambdify, 209
scipy.linalg.null_space, 93 sympy.Matrix, 59
scipy.linalg.orth, 88 sympy.Matrix.col, 63
scipy.linalg.pinv, 81, 117 sympy.Matrix.cols, 63
scipy.linalg.svd, 187 sympy.Matrix.columnspace, 87
scipy.optimize.newton, 232 sympy.Matrix.eye, 64
scipy.spatial.ConvexHull, 259 sympy.Matrix.hstack, 62, 81, 95
simplices, 260 sympy.Matrix.inv, 79
scipy.special.comb, 470 sympy.Matrix.nullspace, 93
scipy.special.expit, 299 sympy.Matrix.ones, 64
scipy.special.factorial, 468 sympy.Matrix.rank, 130
scipy.special.perm, 469 sympy.Matrix.row, 63
scipy.special.softmax, 359 sympy.Matrix.rows, 63
scipy.stats.binom, 291 sympy.Matrix.rowspace, 90
scipy.stats.chi2, 346 sympy.Matrix.zeros, 64
scipy.stats.entropy, 231, 293 sympy.prod, 498
scipy.stats. sympy.shape, 59
,→ multivariate_normal , sympy.simplify, 208
351 sympy.solve, 313, 497
scipy.stats.norm, 329 sympy.symbols, 208
scipy.stats.poisson, 317
scipy.stats.t, 385, 387 tuple, 19
sklearn.datasets.load_iris, 2
sklearn.decomposition zip, 197
530 PYTHON INDEX
Index

≈, 505 Newton’s, 219

1, 161, 171, 355, 358, 436 bound, 238

angle, 68, 133 cartesian plane, 18

Archimedes Cauchy-Schwarz inequality, 25, 68
angle measure, 493 central limit theorem, 282, 332
axiom, 222 and Stirling’s approximation,
arcsine law, 224 472
asymptotically chi-squared, 345
equal, 397, 506 correlated, 350, 353
nonzero, 506 circle, 23
normal, 397 unit, 22
one, 506 coin-tossing, 288
positive, 506 bias, 294
zero, 397, 505 entropy, 292
average, 11 relative, 294
column space, 87
basis, 122 columns, 62
of eigenvectors, 144 orthonormal, 72
of singular vectors, 184 combination, 470
one-hot encoded, 90 convex, 257
orthonormal, 123, 133, 144 linear, 85
standard, 60, 90 complex
Bayes theorem, 296, 298, 300 conjugate, 491
perceptron, 301 division, 490, 491
binomial, 473 hermitian product, 491
coefficient, 471, 474, 475 multiplication, 490, 491
density, 297 numbers, 489
theorem, 473, 475 plane, 489
chi-squared, 225 polar representation, 493

531
532 INDEX

roots of unity, 494 target, 1

concave function, 214, 237 two-class, 265, 434
condition number, 456 variance, 39
confidence, 337 vectors or points, 13
interval, 376 decision boundary, 265, 301, 441
level, 375 degree
contingency table, 394 binomial, 473
converges, 507 chi-squared, 345
convex graph node, 169
combination, 257 sequence, 169
dual, 218, 270, 360, 365 derivative, 203
function, 214, 256, 267, 269 directional, 232
hull, 259, 440 downstream, 248, 413
set, 259 formula, 205
strictly, 215, 257 local, 413
convex function, 237 logarithm, 212
correlation maximizers, 213
coefficient, 48 partial, 233
matrix, 48, 75 second, 207
CSV file, 64 convexity, 214
cumulant-generating function, upstream, 248, 413
228, 307, 357 descent
and variance, 312 gradient, 420, 457, 458
heavy ball, 462
dataset, 1 sequence, 422
attributes, 1 with lookahead gradient, 464
augmented, 404 with momentum, 462
centered, 13 diagonalizable, 145
dimension, 134 diagonalization
example, 1 eigen, 145
features, 1 singular, 186
full-rank, 133 dice-rolling
Iris, 1 bias, 361
label, 1 entropy, 361
mean, 37 relative, 364
MNIST, 6 dimension, 122
multi-class, 434 staircase, 126
projected, 42, 119, 195 direct sum, 120
reduced, 42, 101, 119, 195 distance formula, 22
sample, 1 distribution
separable, 265 arcsine, 328
strongly, 441 Bernoulli, 291
weakly, 441 binomial, 290
soft-class, 434 chi-squared, 345, 346
standard, 39, 47, 75 exponential, 328
INDEX 533

logistic, 322 significant, 338

normal, 328, 331 sure, 276
Poisson, 317 symmetric, 288
Student, 384 union, 274
T -, 384 experiment, 273
uniform, 319 exponential
Z-, 328, 331 function, 483
dot product, 24, 66 matrix, 78
series, 486
eigenspace, 155
eigenvalue, 140 factorial, 467
bottom, 153 full-rank
clustering, 163 dataset, 133
decomposition, 144 matrix, 131
minimum variance, 153 function
projected variance, 152 beta, 302
top, 152 concave, 214, 237
transpose, 143 convex, 214, 237, 256, 267, 269
eigenvectors, 140 cumulant-generating, 228, 307,
best-aligned vector, 152 357
is right singular vector, 187 independence, 315
linearly independent, 144 relative, 364
orthogonal, 144 cumulative distribution, 306,
entropy, 225, 361 321
absolute, 225, 361 error
cross-, 366 information, 365
relative, 230, 363 mean, 402
epigraph, 262 mean square, 412
epoch, 402 level, 238
error logistic, 227, 299
logistic, 436 logit, 227
mean square, 432 loss, 419, 430
Euler, 172, 483 moment-generating
Euler’s constant, 481 chi-squared, 346
events, 273 independence, 314
certain, 274 normal, 331
complementary, 274 standard normal, 330
difference, 274 probability density, 318, 328
exclusive, 274 probability mass, 306
exhaustive, 274 proper, 238
highly significant, 338 and trainability, 433
impossible, 274 relu, 327, 406
independent, 279 central limit theorem, 344
intersection, 274 Stirling’s approximation, 344
null, 276 sigmoid, 227, 299
534 INDEX

softmax, 358 simple, 168

relative, 367 size, 168
strictly convex, 215, 257, 267, sub-, 169
269 tree, 171
fundamental theorem undirected, 166
of algebra, 496 vertex, 166
of calculus, 499 walk, 170
eulerian, 172
geometric weighed, 167
series, 489 weight matrix, 402
sum, 481 wheel, 169
gradient, 233
weight, 416, 427 hyperplane, 101, 262
graph, 166 LR, 444
bipartite, 177 separating, 263, 265, 266
complement, 174 suporting, 265
complete, 168 tangent, 267
component, 176 hypothesis
connected, 171 alternate, 379
cycle, 169, 171 null, 379
directed, 166 testing, 379
edge, 166, 402
iff, 72, 143
incoming, 402
incoming edge, 402
outgoing, 402
information, 227, 360
eulerian, 172
absolute, 227, 360
forest, 171
cross-, 366
isomorphism, 176
relative, 229, 363
laplacian, 179
integral, 499
node
additivity, 500
input, 169
scaling, 500
output, 169
inverse, 78
nodes, 166, 402
pseudo-, 80, 106, 109
adjacent, 166 Iris dataset, 1
connected, 171 iteration, 402
degree, 169
dominating, 170 Jupyter, 4
hidden, 402
in-degree, 169 law of large numbers, 282, 294,
input, 402 331, 372
isolated, 170 Legendre polynomial, 208
out-degree, 169 level, 265
output, 402 limit, 507
order, 168 line-search, 459
path, 170 linear
regular, 170 combination, 85
INDEX 535

dependence, 92 weight, 167, 402

independence, 92 maximizer, 218
system, 79, 149 mean, 11, 37, 306, 318, 349
homogeneous, 28, 92 sample, 325
inhomogeneous, 29 minimizer, 509, 511
transformation, 129, 137 existence, 239
log-odds, 227 global, 238
logistic function, 227, 299 properness, 239
logit function, 227 residual, 240
loss, 419, 430 uniqueness, 267
cross-entropy, 435
logistic, 436 network, 250, 402
mean, 402 deep, 416
mean square, 412, 432 iteration, 428
neural, 403
machine learning, 401 layered, 416
margin of error, 375 training, 426
mass-spring system, 158 neuron, 250, 402
matrix, 60 perceptron, 404
addition, 64 shallow, 416
adjacency, 167 dense, 416
augmented, 89 trainability, 431
centered, 436 Newton’s method, 420
norm, 22, 74
circulant, 162
null space, 93
eigenvalues, 162
columns, 30 1, 161, 171, 355, 358, 436
dataset, 64 one-hot encoding, 90, 357, 366,
diagonal, 63 434
identity, 79 orthogonal, 68
incidence, 179 complement, 95, 120
inverse, 32, 78 orthonormal, 68
nonnegative, 43, 71 outcome, 273
orthogonal, 132 outgoing edge, 402
permutation, 66
positive, 43, 71 parabola
projection, 114, 116 lower tangent, 216
rank, 130 upper tangent, 216
approximate, 147 Pascal’s triangle, 476
rows, 30 perceptron, 301, 404
scaling, 64 Bayes theorem, 404
square, 63 parallel, 416
symmetric, 30, 71 permutation, 468
trace, 34, 70 perp, 27, 95
transpose, 30, 61 point, 59
variance, 39, 43 critical, 213, 236, 425
536 INDEX

inflection, 214, 425 quadratic form, 34

saddle, 213, 236
point of best-fit, 37 random variables, 303, 304
population, 10 arcsine, 328
power of a test, 383 Bernoulli, 291, 304
principal axes, 50 binomial, 316
principal components, 146, 151, chi-squared, 345
190 continuous, 306
probability correlation, 312
additivity, 276 discrete, 306
binomial, 288 expectation, 306, 318
chain rule, 278 exponential, 328
coin-tossing, 289 gaussian, 328
conditional, 278 identically distributed, 324
monotonicity, 276 independence, 313
multiplication of, 290 logistic, 322
one-hot encoded, 434 normal, 328
strict, 434 Poisson, 317
sub-additivity, 276 standard, 323
Student, 384
product
vector-valued, 349
dot, 24, 66, 133
rank, 130
matrix-matrix, 69
and eigenvalues, 147
matrix-vector, 69
and singular values, 184
tensor, 33, 72
approximate, 147
projection, 114
column, 87
matrix, 116
full-, 131
onto column space, 117 nonzero eigenvalues, 147
onto null space, 121 row, 91
onto row space, 118 regression
propagation linear, 432, 446, 447
back, 243, 244 convexity, 432
chain, 244 neural network, 432
network, 253 properness, 432
neural network, 415 trainability, 433, 434
forward, 242, 244 with bias, 434
chain, 244 without bias, 433
network, 251 logistic, 436
neural network, 411 convexity, 437
proper function, 238 neural network, 436
minimizer, 239 one-hot encoded, 440
strict convexity, 267 properness, 438
pseudo-inverse, 109 strict, 440
Pythagoras theorem, 26 trainability, 440
Python, 4 regularization, 424
INDEX 537

relu function, 327, 406 vectors, 181

central limit theorem, 344 left, 181
Stirling’s approximation, 344 right, 181
residual, 106 versus eigenvectors, 187
vanishing, 107 singular values
residual minimizer, 107 transpose, 182
and properness, 239 slope, 203
minimum norm, 108 softmax function, 358
pseudo-inverse, 109 space
regression equation, 107 column, 87
row space, 90 eigen-, 155
rows, 62 feature, 1, 59, 90
orthonormal, 72 null, 93
outcome, 273
scalars, 13, 20 row, 90
scaling sample, 10, 273
factor, 138 source, 129
integral, 500 sub-, 97
matrix, 64 target, 129
principle, 54 vector, 12
vector, 20, 60 span, 86
sequence, 505 spherical coordinates, 55
convergent, 507 standard
sub-, 510 deviation, 309
subconvergent, 510 error, 339
series statistic, 16
alternating, 478 Stirling’s approximation, 472
exponential, 486 central limit theorem, 344
Taylor, 210 relu function, 344
set sum
ball, 258 direct, 120
boundary, 256, 259 geometric, 481
closed, 259 of spans, 119
complement, 258 of vectors, 60
convex, 259 suspensions, 57
interior, 258 system
level, 254 linear
open, 259 homogeneous, 28
sublevel, 238, 256 inhomogeneous, 29
sigmoid function, 227, 299
singular tangent
value, 181 line, 204
decomposition, 184, 186 test
of pseudo-inverse, 188 chi-squared, 391, 395
versus eigenvalue, 183 goodness of fit, 391
538 INDEX

independence, 395 gradient, 233

T , 388 downstream, 414
Z, 381 incoming, 403
trainability, 431 length, 22, 67
and properness, 433 magnitude, 67
linear regression, 433, 434 norm, 22, 67
logistic regression, 440 one-hot encoded, 90, 357, 366,
one-hot encoded, 440 434
strict, 440 orthogonal, 25, 68
transpose, 61 orthonormal, 25, 68, 94
triangle inequality, 68 outgoing, 248, 403
perp, 95
unit circle, 22 perpendicular, 25
variance, 38, 39, 74, 101 polar, 22
and correlation, 76 probability, 356, 391
biased, 40 strict, 434
ellipse, 43 projected, 114, 115, 117, 118
explained, 41 random, 349, 370
inverse ellipse, 44 standard, 349
inverse ellipsoid, 155 reduced, 114, 115, 117, 118
matrix, 39 scaling, 20, 60
projected, 43, 101, 138, 143 shadow, 19
reduced, 43, 101 span, 86
sample, 354 standardized, 75
total, 41 subtraction, 21
unbiased, 40 unit, 23, 67
zero direction, 101 zero, 19, 60
vector, 12, 19, 59 vectorization, 16, 392
addition, 19, 60
best aligned, 49 weight, 402
bias, 433 gradient, 416, 427
cartesian, 19 hyperplane, 441
centered, 359 matrix
dimension, 59 centered, 436
dot product, 24 weight matrix, 167
INDEX 539
Omar Hijab obtained his doctorate from the University of
California at Berkeley, and is faculty at Temple University in
Philadelphia, Pennsylvania. Currently he is affiliated with the
University of New Haven in West Haven, Connecticut.

Infinity: An Essay in Metaphysics - Jose Benardete
67% (3)
Infinity: An Essay in Metaphysics - Jose Benardete
150 pages
(Ebook) Machine Learning Algorithms in Depth (MEAP V01) by Vadim Smolyakov ISBN 9781633439214, 1633439216 download pdf
100% (5)
(Ebook) Machine Learning Algorithms in Depth (MEAP V01) by Vadim Smolyakov ISBN 9781633439214, 1633439216 download pdf
81 pages
Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
100% (4)
Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
55 pages
PDF Time Series A First Course With Bootstrap Starter 1st Edition Tucker S. Mcelroy Download
67% (3)
PDF Time Series A First Course With Bootstrap Starter 1st Edition Tucker S. Mcelroy Download
84 pages
2nd Exam Question Paper 2
No ratings yet
2nd Exam Question Paper 2
16 pages
DevOps For Data Science (Alex K Gold) (Z-Library)
No ratings yet
DevOps For Data Science (Alex K Gold) (Z-Library)
274 pages
ECE2191 Lecture Notes
No ratings yet
ECE2191 Lecture Notes
106 pages
Pub Elementary Geometry From An Advanced Standpoint 3r
100% (6)
Pub Elementary Geometry From An Advanced Standpoint 3r
514 pages
Math For Data Science
No ratings yet
Math For Data Science
538 pages
Linear_Algebra_LectureNote
No ratings yet
Linear_Algebra_LectureNote
288 pages
Mathematics of Machine Learning (1)
No ratings yet
Mathematics of Machine Learning (1)
577 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Applied Statistics with Python
100% (1)
Applied Statistics with Python
320 pages
Ensemble Machine Learning With Python: 7-Day Mini-Course Jason Brownlee - The full ebook version is ready for instant download
100% (1)
Ensemble Machine Learning With Python: 7-Day Mini-Course Jason Brownlee - The full ebook version is ready for instant download
46 pages
Lectures Machine Learning
No ratings yet
Lectures Machine Learning
205 pages
Online Machine Learning Algorithms For Currency Exchange Prediction
No ratings yet
Online Machine Learning Algorithms For Currency Exchange Prediction
84 pages
Introduction To Probability For Ds
No ratings yet
Introduction To Probability For Ds
180 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
332 pages
Neural Networks and Statistical Models
No ratings yet
Neural Networks and Statistical Models
13 pages
What Is Convolutional Neural Network
No ratings yet
What Is Convolutional Neural Network
16 pages
Clean Code An Agile Guide To Software Craft Kameron H instant download
100% (1)
Clean Code An Agile Guide To Software Craft Kameron H instant download
84 pages
Probabilistic ML Crash Course - Leblanc, Mason
No ratings yet
Probabilistic ML Crash Course - Leblanc, Mason
95 pages
Machine Learning in Translation Corpora Processing
No ratings yet
Machine Learning in Translation Corpora Processing
281 pages
Deep Learning Fundamentals Materials
100% (1)
Deep Learning Fundamentals Materials
216 pages
Data Science For Complex Systems 9781108844796 9781108953597 - Compress
No ratings yet
Data Science For Complex Systems 9781108844796 9781108953597 - Compress
304 pages
UE20CS302 Unit4 Slides
No ratings yet
UE20CS302 Unit4 Slides
312 pages
Learning Path Machine Learning
No ratings yet
Learning Path Machine Learning
7 pages
Introduction To Data Visualization With Python
No ratings yet
Introduction To Data Visualization With Python
47 pages
Machine Learning
50% (2)
Machine Learning
430 pages
R Visualizations: Derive Meaning from Data 1st Edition David Gerbing - The latest ebook edition with all chapters is now available
100% (3)
R Visualizations: Derive Meaning from Data 1st Edition David Gerbing - The latest ebook edition with all chapters is now available
65 pages
textbook ML_removed_removed
No ratings yet
textbook ML_removed_removed
44 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
Machine Learning: Andrew NG's Course From Coursera: Presentation
100% (1)
Machine Learning: Andrew NG's Course From Coursera: Presentation
4 pages
Data Science Analytics For Ordinary People PDF
No ratings yet
Data Science Analytics For Ordinary People PDF
199 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Linear Algebra For Machine Learning
No ratings yet
Linear Algebra For Machine Learning
115 pages
7 Time Series Datasets For Machine Learning
No ratings yet
7 Time Series Datasets For Machine Learning
8 pages
Minor_in_AI_Vizuara_Engineering_Curriculum_COEP (1)
No ratings yet
Minor_in_AI_Vizuara_Engineering_Curriculum_COEP (1)
9 pages
Supercharge Your Data Science Career
100% (1)
Supercharge Your Data Science Career
20 pages
Pattern Recognition and Machine Learning Errata and Additional Comments
0% (1)
Pattern Recognition and Machine Learning Errata and Additional Comments
7 pages
Infosys Pragathi Report
No ratings yet
Infosys Pragathi Report
68 pages
Umberto Michelucci - Fundamental Mathematical Concepts for Machine Learning in Science-Springer (2024)
No ratings yet
Umberto Michelucci - Fundamental Mathematical Concepts for Machine Learning in Science-Springer (2024)
259 pages
Python-Linear Regression
No ratings yet
Python-Linear Regression
72 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
MLP Mid Sem Merge (Raja)
No ratings yet
MLP Mid Sem Merge (Raja)
351 pages
Statistics For Data Science And Analytics Peter C Bruce download
No ratings yet
Statistics For Data Science And Analytics Peter C Bruce download
80 pages
Building A Career in Data Science - The Overview
No ratings yet
Building A Career in Data Science - The Overview
2 pages
IIITB+ED+ML+AI
No ratings yet
IIITB+ED+ML+AI
24 pages
[FREE PDF SAMPLE] Transformers in Action MEAP V06 Nicole Koenigstein ebook full chapters
100% (2)
[FREE PDF SAMPLE] Transformers in Action MEAP V06 Nicole Koenigstein ebook full chapters
58 pages
Data Analytics and Machine Learning (Pushpa Singh Asha Rani Mishra Payal Garg) (
100% (1)
Data Analytics and Machine Learning (Pushpa Singh Asha Rani Mishra Payal Garg) (
357 pages
Data Science
No ratings yet
Data Science
74 pages
David Forsyth - Probability and Statistics For Computer Science (2018, Springer)
No ratings yet
David Forsyth - Probability and Statistics For Computer Science (2018, Springer)
368 pages
Python For Machine Learning Sample
100% (1)
Python For Machine Learning Sample
58 pages
Advances in Intelligent Information and Database Systems
No ratings yet
Advances in Intelligent Information and Database Systems
371 pages
Edureka Python Ebook
No ratings yet
Edureka Python Ebook
21 pages
Business Analytics Data Science For Business Problems (Walter R. Paczkowski)
No ratings yet
Business Analytics Data Science For Business Problems (Walter R. Paczkowski)
416 pages
Mathematics For Data Science - Towards Data Science
100% (1)
Mathematics For Data Science - Towards Data Science
5 pages
Dive Into DeepLearning
No ratings yet
Dive Into DeepLearning
1,151 pages
100 Days of Data Engineering - Make A Copy and Use As You Need
No ratings yet
100 Days of Data Engineering - Make A Copy and Use As You Need
7 pages
The Gainz Manual
No ratings yet
The Gainz Manual
28 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
95 pages
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
From Everand
Introduction to Python Programming: Learn Coding with Hands-On Projects for Beginners
Kiet Huynh
No ratings yet
Elective Math 9 Modulette Q1W4
No ratings yet
Elective Math 9 Modulette Q1W4
11 pages
B.tech Syllabus 2023 Onwards
No ratings yet
B.tech Syllabus 2023 Onwards
71 pages
Slide1 Merged
No ratings yet
Slide1 Merged
287 pages
DMM360 Exercises PDF
100% (1)
DMM360 Exercises PDF
134 pages
Lecture Planner _ Physics By Saleem Ahmad Sir __ Yakeen NEET 2.0 2026
No ratings yet
Lecture Planner _ Physics By Saleem Ahmad Sir __ Yakeen NEET 2.0 2026
3 pages
Mathematics in The Modern World Week 1
No ratings yet
Mathematics in The Modern World Week 1
35 pages
AS Physics Topic Wise Questions: Forces
No ratings yet
AS Physics Topic Wise Questions: Forces
17 pages
Chapter 3 and 4
No ratings yet
Chapter 3 and 4
44 pages
System of Particles and Rotational Motion
100% (1)
System of Particles and Rotational Motion
36 pages
COT-math-q2-week-7-2024-2025-Copy
No ratings yet
COT-math-q2-week-7-2024-2025-Copy
11 pages
Cs 2255 Database Management Systems
No ratings yet
Cs 2255 Database Management Systems
48 pages
Silva 2014
No ratings yet
Silva 2014
9 pages
Proteus VSM SDK
No ratings yet
Proteus VSM SDK
17 pages
MPS 06
No ratings yet
MPS 06
41 pages
Geotech Chapter 10 Stress Distribution - Question
25% (4)
Geotech Chapter 10 Stress Distribution - Question
2 pages
Mathematics 2004 Marking Guide
No ratings yet
Mathematics 2004 Marking Guide
15 pages
Topology Based Data Analysis Identifies A Subgroup of Breast Cancer With A Unique Mutational Profile and Excellent Survival
No ratings yet
Topology Based Data Analysis Identifies A Subgroup of Breast Cancer With A Unique Mutational Profile and Excellent Survival
6 pages
Rational Numbers
No ratings yet
Rational Numbers
54 pages
Week 1 Analogy
No ratings yet
Week 1 Analogy
27 pages
Aircraft Maintenance
No ratings yet
Aircraft Maintenance
15 pages
Journal of Molecular Structure THEOCHEM
No ratings yet
Journal of Molecular Structure THEOCHEM
4 pages
Partial Derivative of Composite Function
No ratings yet
Partial Derivative of Composite Function
5 pages
Magnetic Coupling Calculations Using Partial Inductance Theory
No ratings yet
Magnetic Coupling Calculations Using Partial Inductance Theory
4 pages
Going Concern Audit Opinion and Corporate Governance in Manufacturing Company Listed BEI
No ratings yet
Going Concern Audit Opinion and Corporate Governance in Manufacturing Company Listed BEI
10 pages
Data Abstraction and Basic Data Structures: Improving Efficiency by Building Better Object IN
No ratings yet
Data Abstraction and Basic Data Structures: Improving Efficiency by Building Better Object IN
12 pages
Enhanced Strain Methods For Elasticity Problems
No ratings yet
Enhanced Strain Methods For Elasticity Problems
16 pages
Provided by Dse - Life: 5. Definite Integration
No ratings yet
Provided by Dse - Life: 5. Definite Integration
23 pages
04 Percntiles Deciles and Quartiles
No ratings yet
04 Percntiles Deciles and Quartiles
8 pages