100% found this document useful (2 votes)

40 views515 pages

Statistics

Uploaded by

adithyasaireddytamatam9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

40 views515 pages

Statistics

Uploaded by

adithyasaireddytamatam9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 515

Math for Data Science

Omar Hijab*
Copyright ©2022 — 2024 Omar Hijab. All Rights Reserved.
Preface

This text is under construction and is continuously updated. Upon comple-

tion of the text, the preface will be enlarged and exercises will be added.
This text is a presentation of the mathematics underlying Data Science.
At first, the text assumes minimal math background, and basic math is
reviewed. After this, deeper results are presented.

Important principles or results are displayed in these boxes.

The culmination of the text is Chapter 8, where neural networks and

machine learning are introduced. Much of the mathematics developed in
prior chapters is used here.
The ideas presented are made concrete by interpreting them in Python
code. The standard Python data science libraries are used, and a Python
index lists the Python functions used in the text.

Python code is displayed in these boxes.

Detailed proofs and detailed code snippets are included throughout the
text for the same reason: There is value in understanding how things work,
and real understanding can only be achieved by going all the way.
Because SQL is usually part of a data scientist’s toolkit, an introduction
to using SQL from within Python, is included in an appendix.
Throughout, we use iff to mean if and only if. To help navigate the text,
in each section, we use the ship’s wheel to indicate a break, a new idea,
or a change in direction.
Sections and figures are numbered sequentially within each chapter, and
equations are numbered sequentially within each section, so §3.3 is the third

iii
iv

section in the third chapter, Figure 7.11 is the eleventh figure in the seventh
chapter, and (3.2.1) is the first equation in the second section of the third
chapter.

If a section contains the alert

⋆ under construction ⋆,

then it is incomplete. If a section does not contain this alert,

then it is complete except for minor edits. Out of 55 sections,
fewer than 5 are incomplete.
Contents

Preface iii

List of Figures xiii

1 Datasets 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Averages and Vector Spaces . . . . . . . . . . . . . . . . . . . 9
1.4 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Mean and Covariance . . . . . . . . . . . . . . . . . . . . . . . 43
1.7 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 59

2 Linear Geometry 65
2.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . 65
2.2 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.3 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.4 Span and Linear Independence . . . . . . . . . . . . . . . . . . 90
2.5 Zero Variance Directions . . . . . . . . . . . . . . . . . . . . . 104
2.6 Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.7 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
2.8 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.9 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

3 Principal Components 139

3.1 Geometry of Matrices . . . . . . . . . . . . . . . . . . . . . . . 139
3.2 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . 142
3.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . 169

v
vi CONTENTS

3.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . 178

3.5 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 188

4 Counting 193
4.1 Permutations and Combinations . . . . . . . . . . . . . . . . . 193
4.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
4.3 Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . 212
4.4 Exponential Function . . . . . . . . . . . . . . . . . . . . . . . 219

5 Probability 229
5.1 Binomial Probability . . . . . . . . . . . . . . . . . . . . . . . 229
5.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 246
5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 263
5.5 Chi-squared Distribution . . . . . . . . . . . . . . . . . . . . . 273

6 Statistics 285
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
6.2 Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
6.3 T -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
6.4 Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
6.5 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
6.6 Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . 318
6.7 Chi-Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . 319

7 Calculus 325
7.1 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
7.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . 343
7.3 Multi-variable Calculus . . . . . . . . . . . . . . . . . . . . . . 351
7.4 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 357
7.5 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . 369
7.6 Multinomial Probability . . . . . . . . . . . . . . . . . . . . . 387

8 Machine Learning 399

8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
8.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 401
8.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 416
8.4 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . 425
CONTENTS vii

8.5 Shallow Learning . . . . . . . . . . . . . . . . . . . . . . . . . 428

8.6 Regression Examples . . . . . . . . . . . . . . . . . . . . . . . 440
8.7 Strict Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 452
8.8 Accelerated Gradient Descent . . . . . . . . . . . . . . . . . . 456
8.9 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 463

A Appendices 465
A.1 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
A.2 Minimizing Sequences . . . . . . . . . . . . . . . . . . . . . . 479
A.3 Keras Training . . . . . . . . . . . . . . . . . . . . . . . . . . 486

References 487

Python 489

Index 493
viii CONTENTS
List of Figures

1.1 Iris dataset [23]. . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Images in the MNIST dataset. . . . . . . . . . . . . . . . . . . 3
1.3 A portion of the MNIST dataset. . . . . . . . . . . . . . . . . 5
1.4 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . 6
1.5 The MNIST dataset (3d projection). . . . . . . . . . . . . . . 6
1.6 A crude copy of the image. . . . . . . . . . . . . . . . . . . . . 8
1.7 HTML colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 The vector v joining the points m and x. . . . . . . . . . . . . 12
1.9 Datasets of points versus datasets of vectors. . . . . . . . . . . 13
1.10 A statistic f valued in a vector space V . . . . . . . . . . . . . 14
1.11 A dataset with its mean. . . . . . . . . . . . . . . . . . . . . . 16
1.12 A vector v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.13 Vectors v1 and v2 and their shadows in the plane. . . . . . . . 17
1.14 Adding v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.15 Scaling with t = 2 and t = −2/3 . . . . . . . . . . . . . . . . . 20
1.16 The polar representation of v = (x, y). . . . . . . . . . . . . . 21
1.17 v and its antipode −v . . . . . . . . . . . . . . . . . . . . . . 22
1.18 Two vectors v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . 23
1.19 Pythagoras for general triangles. . . . . . . . . . . . . . . . . . 25
1.20 Proof of Pythagoras for general triangles. . . . . . . . . . . . . 25
1.21 v and v ⊥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.22 Multiplying and dividing points on the unit circle. . . . . . . . 35
1.23 Complex numbers . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.24 The second, third, and fourth roots of unity . . . . . . . . . . 41
1.25 The fifth, sixth, and fifteenth roots of unity . . . . . . . . . . 42
1.26 MSD for the mean (green) versus MSD for a random point (red). 44
1.27 Projecting a vector v onto the line through u. . . . . . . . . . 49
1.28 Covariance ellipses and inverse covariance ellipses. . . . . . . . 52

ix
x LIST OF FIGURES

1.29 Covariance ellipse and incovariance ellipse. . . . . . . . . . . . 55

1.30 A positively correlated dataset ρ > 0. . . . . . . . . . . . . . . 56
1.31 A negatively correlated dataset ρ < 0. . . . . . . . . . . . . . . 56
1.32 Level contours of v · Q−1 v. . . . . . . . . . . . . . . . . . . . . 57
1.33 Ellipsoid and axes in 3d. . . . . . . . . . . . . . . . . . . . . . 58
1.34 Disks inside the square . . . . . . . . . . . . . . . . . . . . . . 60
1.35 Balls inside the cube . . . . . . . . . . . . . . . . . . . . . . . 61

2.1 The points 0, x, Ax, and b. . . . . . . . . . . . . . . . . . . . . 109

2.2 The points x, Ax, the points x∗ , Ax∗ , and the point x+ . . . . . 110
2.3 Projecting onto a line. . . . . . . . . . . . . . . . . . . . . . . 118
2.4 Projecting onto a plane, P b = ru + sv. . . . . . . . . . . . . . 119
2.5 Dataset, reduced dataset, and projected dataset, n < d. . . . . 123
2.6 Relations between vector classes. . . . . . . . . . . . . . . . . 126
2.7 First defect for MNIST. . . . . . . . . . . . . . . . . . . . . . 129
2.8 The dimension staircase with defects. . . . . . . . . . . . . . . 129
2.9 The dimension staircase for the MNIST dataset. . . . . . . . . 130
2.10 A 5 × 3 matrix A is a linear transformation from R3 to R5 . . . 133

3.1 Image of unit circle with σ1 = 1.5 and σ2 = .75. . . . . . . . . 140

3.2 SVD decomposition A = U SV . . . . . . . . . . . . . . . . . . 142
3.3 Relations between matrix classes. . . . . . . . . . . . . . . . . 143
3.4 Inverse covariance ellipse and centered dataset. . . . . . . . . . 153
3.5 S = span(v1 ) and T = S ⊥ . . . . . . . . . . . . . . . . . . . . . 157
3.6 Three springs at rest and perturbed. . . . . . . . . . . . . . . 161
3.7 Six springs at rest and perturbed. . . . . . . . . . . . . . . . . 162
3.8 Two springs along a circle leading to Q(2). . . . . . . . . . . . 163
3.9 Five springs along a circle leading to Q(5). . . . . . . . . . . . 164
3.10 Plot of eigenvalues of Q(50). . . . . . . . . . . . . . . . . . . . 168
3.11 Density of eigenvalues of Q(d) for d large. . . . . . . . . . . . 168
3.12 MNIST eigenvalues as a percentage of the total variance. . . . 180
3.13 MNIST eigenvalue percentage plot. . . . . . . . . . . . . . . . 181
3.14 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . 185
3.15 The full MNIST dataset (2d projection). . . . . . . . . . . . . 186
3.16 The Iris dataset (2d projection). . . . . . . . . . . . . . . . . . 187

4.1 6 = 3! permutations of 3 balls. . . . . . . . . . . . . . . . . . . 193

4.2 Directed and undirected graphs. . . . . . . . . . . . . . . . . . 198
LIST OF FIGURES xi

4.3 A weighed directed graph. . . . . . . . . . . . . . . . . . . . . 198

4.4 A double edge and a loop. . . . . . . . . . . . . . . . . . . . . 199
4.5 The complete graph K6 and the cycle graph C6 . . . . . . . . . 200
4.6 The triangle K3 = C3 . . . . . . . . . . . . . . . . . . . . . . . 201
4.7 Non-isomorphic graphs with degree sequence (3, 2, 2, 1, 1, 1). . 209
4.8 Complete bipartite graph K53 . . . . . . . . . . . . . . . . . . . 210
4.9 Pascal’s triangle. . . . . . . . . . . . . . . . . . . . . . . . . . 215
4.10 The exponential function exp x. . . . . . . . . . . . . . . . . . 225
4.11 Convexity of the exponential function. . . . . . . . . . . . . . 228

5.1 The distribution of p given 7 heads in 10 tosses. . . . . . . . . 237

5.2 The logistic function. . . . . . . . . . . . . . . . . . . . . . . . 238
5.3 The logistic function takes real numbers to probabilities. . . . 239
5.4 100,000 sessions, with 5, 15, 50, and 500 tosses per session. . . 243
5.5 When we sample X, we get x. . . . . . . . . . . . . . . . . . 246
5.6 N = 150 petal lengths and their mean. . . . . . . . . . . . . . 247
5.7 Histogram of N = 150 petal lengths. . . . . . . . . . . . . . . 249
5.8 Means of 100,000 batches, of size n = 1, 5, 15, 50. . . . . . . . . 250
5.9 Distribution of a bernoulli random variable. . . . . . . . . . . 251
5.10 Confidence that X lies in interval [a, b]. . . . . . . . . . . . . . 252
5.11 Cumulative distribution functions. . . . . . . . . . . . . . . . . 252
5.12 Cdf of a bernoulli distribution. . . . . . . . . . . . . . . . . . . 253
5.13 Binary variance. . . . . . . . . . . . . . . . . . . . . . . . . . . 255
5.14 When we sample X1 , X2 , . . . , Xn , we get x1 , x2 , . . . , xn . . . . 260
5.15 The standard normal distribution. . . . . . . . . . . . . . . . . 264
5.16 z = norm.ppf(p) and p = norm.cdf(z). . . . . . . . . . . . 266
5.17 Confidence (green) or significance (red) (lower-tail, two-tail,
upper-tail). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
5.18 68%, 95%, 99% confidence cutoffs for standard normal. . . . . 268
5.19 Cutoffs, confidence levels, p-values. . . . . . . . . . . . . . . . 268
5.21 68%, 95%, 99% cutoffs for non-standard normal. . . . . . . . . 269
5.20 p-values at 5% and at 1%. . . . . . . . . . . . . . . . . . . . . 269
5.22 (X, Y ) in the square and in the circle. . . . . . . . . . . . . . . 274
5.23 Chi-squared distribution with different degrees. . . . . . . . . 275
5.24 With degree d ≥ 2, the chi-squared distribution peaks at d − 2. 277

6.1 Statistics flowchart: p-value p and significance α. . . . . . . . 286

xii LIST OF FIGURES

6.2 Histogram of sampling n = 25 students, repeated N = 1000

times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
6.3 The error matrix. . . . . . . . . . . . . . . . . . . . . . . . . . 300
6.4 t-distribution, against normal (dashed). . . . . . . . . . . . . . 304
6.5 Fisher F -distribution. . . . . . . . . . . . . . . . . . . . . . . . 317
6.6 Contingency table [25]. . . . . . . . . . . . . . . . . . . . . . . 322

7.1 f ′ (a) is the slope of the tangent line at a. . . . . . . . . . . . . 325

7.2 Composition of two functions. . . . . . . . . . . . . . . . . . . 327
7.3 Angle θ in the plane, P = (x, y). . . . . . . . . . . . . . . . . . 330
7.4 Increasing or decreasing? . . . . . . . . . . . . . . . . . . . . . 332
7.5 Increasing or decreasing? . . . . . . . . . . . . . . . . . . . . . 333
7.6 Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0. . . 336
7.7 The logarithm function log x. . . . . . . . . . . . . . . . . . . 338
7.8 The absolute entropy function H(p). . . . . . . . . . . . . . . 345
7.9 Asymptotics of binomial coefficients. . . . . . . . . . . . . . . 347
7.10 The relative information I(p, q) with q = .7. . . . . . . . . . . 349
7.11 Composition of multiple functions. . . . . . . . . . . . . . . . 354
7.12 Composition of three functions in a chain. . . . . . . . . . . . 358
7.13 A network composition [27]. . . . . . . . . . . . . . . . . . . . 362
7.14 The function g = max(y, z). . . . . . . . . . . . . . . . . . . . 363
7.15 Forward and backward propagation [27]. . . . . . . . . . . . . 364
7.16 Level sets and sublevel sets in two dimensions. . . . . . . . . . 370
7.17 Contour lines in two dimensions. . . . . . . . . . . . . . . . . . 370
7.18 Line segment [x0 , x1 ]. . . . . . . . . . . . . . . . . . . . . . . . 371
7.19 Convex: The line segment lies above the graph. . . . . . . . . 372
7.20 Convex hull of x1 , x2 , x3 , x4 , x5 , x6 , x7 . . . . . . . . . . . . . . 374
7.21 A convex hull with one facet highlighted. . . . . . . . . . . . . 375
7.22 Convex set in three dimensions with supporting hyperplane. . 376
7.23 Hyperplanes in two and three dimensions. . . . . . . . . . . . 377
7.24 Separating hyperplane theorem. . . . . . . . . . . . . . . . . . 377
7.25 The third row is the sum of the first and second rows, and the
H column is the negative of the I column. . . . . . . . . . . . 397

8.1 A perceptron with activation function f. . . . . . . . . . . . . 403

8.2 Perceptrons in parallel [18]. . . . . . . . . . . . . . . . . . . . 404
8.3 Network of neurons. . . . . . . . . . . . . . . . . . . . . . . . . 406
8.4 Incoming and Outgoing signals. . . . . . . . . . . . . . . . . . 407
LIST OF FIGURES xiii

8.5 Forward and back propagation between two neurons. . . . . . 407

8.6 Downstream, local, and upstream derivatives at node i. . . . . 410
8.7 A shallow dense layer. . . . . . . . . . . . . . . . . . . . . . . 415
8.8 Layered neural network [9]. . . . . . . . . . . . . . . . . . . . . 415
8.9 Double well newton descent. . . . . . . . . . . . . . . . . . . . 419
8.10 Double well cost function and sublevel sets at w0 and at w1 . . 421
8.11 Double well gradient descent. . . . . . . . . . . . . . . . . . . 424
8.12 Cost trajectory and number of iterations as learning rate varies.428
8.13 Linear regression neural network. . . . . . . . . . . . . . . . . 431
8.14 Logistic regression neural network. . . . . . . . . . . . . . . . 433
8.15 Population versus employed: Linear Regression. . . . . . . . . 440
8.16 Longley Economic Data [15]. . . . . . . . . . . . . . . . . . . . 441
8.17 Polynomial regression: Degrees 2, 4, 6, 8, 10, 12. . . . . . . . . 444
8.18 Hours studied and outcomes. . . . . . . . . . . . . . . . . . . . 446
8.19 Exam dataset: x. . . . . . . . . . . . . . . . . . . . . . . . . . 446
8.20 Exam dataset: (x, p) [29]. . . . . . . . . . . . . . . . . . . . . 446
8.21 Exam dataset: (x, x0 ). . . . . . . . . . . . . . . . . . . . . . . 447
8.22 Hours studied and one-hot encoded outcomes. . . . . . . . . . 447
8.23 Neural network for student exam outcomes. . . . . . . . . . . 448
8.24 Equivalent neural network for student exam outcomes. . . . . 448
8.25 Exam dataset: (x, x0 , p). . . . . . . . . . . . . . . . . . . . . . 449
8.26 Convex hulls of Iris classes in R2 . . . . . . . . . . . . . . . . . 450
8.27 Convex hulls of MNIST classes in R2 . . . . . . . . . . . . . . . 451

A.1 Dataframe from list-of-dicts. . . . . . . . . . . . . . . . . . . . 469

A.2 Menu dataframe and SQL table. . . . . . . . . . . . . . . . . . 469
A.3 Rawa restaurant. . . . . . . . . . . . . . . . . . . . . . . . . . 472
A.4 OrdersIn dataframe and SQL table. . . . . . . . . . . . . . . . 473
A.5 OrdersOut dataframe and SQL table. . . . . . . . . . . . . . . 475
xiv LIST OF FIGURES
Chapter 1

Datasets

In this chapter we explore examples of datasets and some simple Python

code. We also review the geometry of vectors in the plane and properties of
2 × 2 matrices, introduce the mean and covariance of a dataset, then present
a first taste of what higher dimensions might look like.

1.1 Introduction
Geometrically, a dataset is a sample of N points x1 , x2 , . . . , xN in d-
dimensional space Rd . Algebraically, a dataset is an N × d matrix.
Practically speaking, as we shall see, the following are all representations
of datasets

matrix = CSV file = spreadsheet = SQL table = array = dataframe

Each point x = (t1 , t2 , . . . , td ) in the dataset is a sample or an example,

and the components t1 , t2 , . . . , td of a sample point x are its features or
attributes. As such, d-dimensional space Rd is feature space.
Sometimes one of the features is separated out as the label. In this case,
the dataset is a labelled dataset.

The Iris dataset contains 150 examples of four features of Iris flowers,
and there are three classes of Irises, Setosa, Versicolor, and Virginica, with
50 samples from each class.

1
2 CHAPTER 1. DATASETS

The four features are sepal length and width, and petal length and width
(Figure 1.1). For each example, the class is the label corresponding to that
example, so the Iris dataset is labelled.

Figure 1.1: Iris dataset [23].

The Iris dataset is downloaded using the code

from sklearn import datasets

iris = datasets.load_iris()
dataset = iris["data"]
labels = iris["target"]

dataset, labels

If sklearn is not installed, you’ll need to first run

pip install sklearn

1.1. INTRODUCTION 3

Figure 1.2: Images in the MNIST dataset.

The MNIST dataset consists of images of hand-written digits (Figure 1.2).

There are 10 classes of images, corresponding to each digit 0, 1, . . . , 9. We
seek to compress the images while preserving as much as possible of the
images’ characteristics.
Each image is a grayscale 28x28 pixel image. Since 282 = 784, each image
is a point in d = 784 dimensions. Here there are N = 60000 samples and
d = 784 features.

This subsection is included just to give a flavor. All unfamiliar words are
explained in detail in Chapter 2. If preferred, just skip to the next subsection.
Suppose we have a dataset of N points

x1 , x2 , . . . , xN

in d-dimensional feature space. We seek to find a lower-dimensional feature

space U ⊂ Rd so that the projections of these points onto U retain as much
information as possible about the data.
In other words, we are looking for an n-dimensional subspace U for some
n < d. Among all n-dimensional subspaces, which one should we pick?
The answer is to select U among all n-dimensional subspaces to maximize
variability in the data.
Another issue is the choice of n, which is an integer satisfying 0 ≤ n ≤ d.
On the one hand, we want n to be as small as possible, to maximize data
compression. On the other hand, we want n to be big enough to capture
most of the features of the data. At one extreme, if we pick n = d, then we
have no compression and complete information. At the other extreme, if we
pick n = 0, then we have full compression and no information.
4 CHAPTER 1. DATASETS

Projecting the data from Rd to a lower-dimensional space U is dimen-

sional reduction. The best alignment, the best-fit, or the best choice of U is
principal component analysis. These issues will be taken up in §3.4.

If this is your first exposure to data science, there will be a learning curve,
because here there are three kinds of thinking: Data science (Datasets, PCA,
descent, networks), math (linear algebra, probability, statistics, calculus),
and Python (numpy, pandas, scipy, sympy, matplotlib). It may help to
read the code examples , and the important math principles first,
then dive into details as needed.
To illustrate and make concrete concepts as they are introduced, we use
Python code throughout. We run Python code in a Jupyter notebook.
Jupyter is an IDE, an integrated development environment. Jupyter sup-
ports many frameworks, including Python, Sage, Julia, and R. A useful
Jupyter feature is the ability to measure the amount of execution time of
a code cell by including at the start of the cell

%%time

The installation procedure is to first install Python, then install Jupyter

using Python pip. Additional frameworks (R, . . . ) are then installed sep-
arately. After this, to make each additional installed framework available
from within Jupyter, run jupyter kernelspec. Detailed steps depend on the
reader’s laptop setup.

1.2 The MNIST Dataset

The MNIST1 dataset consists of 70,000 images, split into 60,000 training
images and 10,000 testing images. The following code

from keras.datasets import mnist

1
The National Institute of Standards and Technology (NIST) is a physical sciences
laboratory and non-regulatory agency of the United States Department of Commerce.
1.2. THE MNIST DATASET 5

train, test = mnist.load_data()

dataset, labels = train

dataset.shape, labels.shape

returns

((60000, 28, 28), (60000,))

(This code requires keras, tensorflow and related modules if not already
installed.)

Figure 1.3: A portion of the MNIST dataset.

Since this dataset is for demonstration purposes, these images are coarse.
Since each image consists of 784 pixels, and each pixel shading is a number,
each image is a point x in Rd = R784 .
6 CHAPTER 1. DATASETS

Figure 1.4: Original and projections: n = 784, 600, 350, 150, 50, 10, 1.

To compress the image means to reduce the number of dimensions in the

point x while keeping maximum information. We can think of a single image
as a dataset itself, and compress the image, or we can design a compression
algorithm based on a collection of images. It is then reasonable to expect
that the procedure applies well to any image that is similar to the images in
the collection.

Figure 1.5: The MNIST dataset (3d projection).

For the second image in Figure 1.2, reducing dimension from d = 784 to
1.2. THE MNIST DATASET 7

n equal 600, 350, 150, 50, 10, and 1, we have the images in Figure 1.4.
Compressing each image to a point in n = 3 dimensions and plotting all
N = 60000 points yields Figure 1.5. All this is discussed in §3.4.

The top left image in Figure 1.4 is given by a 784-dimensional point which
is imported as an array pixels of shape (28,28).

pixels = dataset[1]

Live exercise in class

1. Take out your laptops and open Jupyter.
2. In Jupyter, return a two-dimensional plot of the point (2, 3) using the
code

from matplotlib.pyplot import *

grid()
scatter(2,3)
show()

3. Do for loops over i and j in range(28) and use scatter to plot points
at location (i,j) with size given by pixels[i,j], then show.
Here is one possible code, returning Figure 1.6.

from matplotlib.pyplot import *

from numpy import *

pixels = dataset[1]

grid()
for i in range(28):
for j in range(28): scatter(i,j, s = pixels[i,j])
8 CHAPTER 1. DATASETS

show()

Figure 1.6: A crude copy of the image.

The top left image in Figure 1.4 is returned by the code

from matplotlib.pyplot import *

imshow(pixels, cmap="gray_r")

We end the section by discussing the Python import command. The last
code snippet can be rewritten
1.3. AVERAGES AND VECTOR SPACES 9

import matplotlib.pyplot as plt

plt.imshow(pixels, cmap="gray_r")

or as

from matplotlib.pyplot import imshow

imshow(pixels, cmap="gray_r")

So we have three versions of this code snippet.

In the second version, it is explicit that imshow is imported from the
submodule pyplot of the module matplotlib. Moreoever, the submodule
matplotlib.pyplot is referenced by a short nickname plt.
In the first version import from *, many commands, maybe not all, are
imported from the submodule matplotlib.pyplot.
In the third version, only the command imshow is imported. Which im-
port style is used depends on the situation.
In this text, we usually use the first style, as it is visually lightest. To
help with online searches, in the Python index, Python commands are listed
under their full module path.

1.3 Averages and Vector Spaces

Suppose we have a population of things (people, tables, numbers, vectors,
images, etc.) and we have a sample of size N from this population:

L = [x_1,x_2,...,x_N].

The total population is the population or the sample space. For example, the
sample space consists of all real numbers and we take N = 5 samples from
this population

L_1 = [3.95, 3.20, 3.10, 5.55, 6.93].

10 CHAPTER 1. DATASETS

Or, the sample space consists of all integers and we take N = 5 samples from
this population

L_2 = [35, -32, -8, 45, -8].

Or, the sample space consists of all rational numbers and we take N = 5
samples from this population

L_3 = [13/31, 8/9, 7/8, 41/22, 32/27].

Or, the sample space consists of all Python strings and we take N = 5
samples from this population

L_4 = ['a2e?','#%T','7y5,','kkk>><</','[[)*+']

Or, the sample space consists of all HTML colors and we take N = 5 samples
from this population

Figure 1.7: HTML colors.

Here’s the code generating the colors

# HTML color codes are #rrggbb (6 hexes)

from matplotlib.pyplot import *
from random import choice

def hexcolor():
return "#" + ''.join([choice('0123456789abcdef') for _ in
,→ range(6)])

for i in range(5): scatter(i,0, c=hexcolor())

show()
1.3. AVERAGES AND VECTOR SPACES 11

Let L be a list as above. The goal is to compute the sample average or

mean of the list, which is
x 1 + x2 + · · · + xN
mean = average = .
N
In the first example, for real numbers, the average is
3.95 + 3.20 + 3.10 + 5.55 + 6.93
= 4.546.
5
In the second case, for integers, the average is 32/5. In the third case, the
average is 385373/73656. In the fourth case, while we can add strings, we
can’t divide them by 5, so the average is undefined. Similarly for colors: the
average is undefined.
This leads to an important definition. A sample space or population V is
called a vector space if, roughly speaking, one can compute means or averages
in V . In this case, we call the members of the population “vectors”, even
though the members may be anything, as long as they satisfy the basic rules
of a vector space.
In a vector space V , the rules are:
1. vectors can be added (and the sum v + w is back in V )

2. vector addition is commutative v + w = w + v

3. vector addition is associative u + (v + w) = (u + v) + w

4. there is a zero vector 0

5. vectors v have negatives −v

6. vectors can be multiplied by real numbers (and the product rv is back

in V )

7. multiplication is distributive over addition (r + s)v = rv + sv and

r(u + v) = ru + rv

8. 1v = v and 0v = 0

9. r(sv) = (rs)v.
12 CHAPTER 1. DATASETS

Let x1 , x2 , . . . , xN be a dataset. Is the dataset a collection of points, or

is the dataset a collection of vectors? In other words, what geometric picture
of datasets should we have in our heads? Here’s how it works.

Figure 1.8: The vector v joining the points m and x.

A vector is an arrow joining two points (Figure 1.8). Given two points
m = (a, b) and x = (c, d), the vector joining them is

v = x − m = (c − a, d − b).

Then m is the tail of v, and x is the head of v. For example, the vector
joining m = (1, 2) to x = (3, 4) is v = (2, 2).
Given a point x, we would like to associate to it a vector v in a uniform
manner. However, this cannot be done without a second point, a reference
point. Given a dataset of points x1 , x2 , . . . , xN , the most convenient choice
for the reference point is the mean m of the dataset. This results in a dataset
of vectors v1 , v2 , . . . , vN , where vk = xk − m, k = 1, 2, . . . , N .
The dataset v1 , v2 , . . . , vN is centered, its mean is zero,

v1 + v2 + · · · + vN
= 0.
N

So datasets can be points x1 , x2 , . . . , xN with mean m, or vectors v1 , v2 , . . . ,

vN with mean zero (Figure 1.9). This distinction makes a difference when
measuring the dimension of a dataset (§2.8).
1.3. AVERAGES AND VECTOR SPACES 13

x5 x2
v5 v2
m v4 v1
0
x4 x1 v3
x3

Figure 1.9: Datasets of points versus datasets of vectors.

Centered Versus Non-Centered

If x1 , x2 , . . . , xN is a dataset of points with mean m and

v1 = x1 − m, v2 = x2 − m, . . . , vN = xN − m,

then v1 , v2 , . . . , vN is a centered dataset of vectors.

Let us go back to vector spaces. When we work with vector spaces,

numbers are referred to as scalars, because 2v, 3v, −v, . . . are scaled versions
of v. When we multiply a vector v by a scalar r to get the scaled vector rv,
we call it scalar multiplication. This is to distinguish this multiplication from
the inner and outer products we see below.
For example, the samples in the list L1 form a vector space, the set of all
real numbers R. Even though one can add integers, the set Z of all integers
does not form a vector space because multiplying an integer by 1/2 does
not result in an integer. The set Q of all rational numbers (fractions) is a
vector space, so L3 is a sampling from a vector space. The set of strings is
not a vector space because even though one can add strings, addition is not
commutative:

'alpha' + 'romeo' == 'romeo' + 'alpha'

returns False.
Usually, we can’t take sample means from a population, we instead take
the sample mean of a statistic associated to the population. A statistic is
14 CHAPTER 1. DATASETS

V
f
Sample Space

Figure 1.10: A statistic f valued in a vector space V .

an assignment of a number f (item) to each item in the population. For

example, the human population on Earth is not a vector space (they can’t
be added), but their heights is a vector space (heights can be added). For the
list L4 , a statistic might be the length of the string. For the HTML colors, a
statistic is the HTML code of the color.
In general, a statistic need not be a number. A statistic can be anything
that “behaves like a number”. For example, we shall see below that f (item)
can be a vector or a matrix. More generally, a statistic’s values may be
anything that lives in a vector space V , which we defined above.
For example, for the scalar dataset
x1 = 1.23, x2 = 4.29, x3 = −3.3, x4 = 555,
the average is
1.23 + 4.29 − 3.3 + 555
m= = 139.305.
4
In Python, averages are computed using numpy.mean. For a scalar dataset,
the code

from numpy import *

dataset = array([1.23,4.29,-3.3,555])
mean(dataset)
1.3. AVERAGES AND VECTOR SPACES 15

returns the average.

For the two-dimensional dataset
x1 = (1, 2), x2 = (3, 4), x3 = (−2, 11), x4 = (0, 66),
the average is
(1, 2) + (3, 4) + (−2, 11) + (0, 66)
m= = (0.5, 20.75).
4
Note the x-components are summed, and the y-components are summed,
leading to a two-dimensional mean. (This is vector addition, taken up in
§1.4.)
In Python, a dataset in R2 may be assembled as 4 × 2 array

from numpy import *

dataset = array([[1,2], [3,4], [-2,11], [0,66]])

Then the code

mean(dataset, axis=0)

returns the mean (0.5, 20.75).

To explain what axis=0 does, we use matrix terminology. After arranging
dataset into an array of four rows and two columns, to compute the mean,
we sum over the row index.
This means summing the entries of the first column, then summing the
entries of the second column, resulting in a mean with two components.
In Python, the default is to consider the row index i as index zero, and
to consider the column index j as index one.
Summing over index=1 is equivalent to thinking of the dataset as two
points in R4 , so

mean(dataset, axis=1)

returns (1.5, 3.5, 4.5, 33).

Here is a more involved example of a dataset of random points and their
mean:
16 CHAPTER 1. DATASETS

from numpy import *

from numpy.random import random
from matplotlib.pyplot import scatter, grid, show

N = 20
dataset = array([ [random(), random()] for _ in range(N) ])
mean = mean(dataset,axis=0)

grid()
X = dataset[:,0]
Y = dataset[:,1]
scatter(X,Y)
scatter(*mean)
show()

This returns Figure 1.11.

Figure 1.11: A dataset with its mean.

In this code, scatter expects two positional arguments, the x and the y
components of a point, or two lists of x and y components separately. The
unpacking operator * unpacks mean from one pair into its separate x and
y components *mean. Also, for scatter, dataset is separated into its two
columns.
1.4. TWO DIMENSIONS 17

1.4 Two Dimensions

We start with the geometry of vectors in two dimensions. This is the cartesian
plane R2 , also called 2-dimensional real space. The plane R2 is a vector space,
in the sense described in the previous section.

(0, 2) (3, 2)
v
(0, 1)

(0, −2)

Figure 1.12: A vector v.

In the cartesian plane, a vector is an arrow v joining the origin to a point

(Figure 1.12). In this way, points and vectors are almost interchangeable,
as a point x in Rd corresponds to the vector v starting at the origin 0 and
ending at x.

0 0

Figure 1.13: Vectors v1 and v2 and their shadows in the plane.

In the cartesian plane, each vector v has a shadow. This is the triangle
constructed by dropping the perpendicular from the tip of v to the x-axis, as
in Figure 1.13. This cannot be done unless one first draws a horizontal line
(the x-axis), then a vertical line (the y-axis). In this manner, each vector v
18 CHAPTER 1. DATASETS

has cartesian coordinates v = (x, y). In Figure 1.12, the coordinates of v are
(3, 2). In particular, the vector 0 = (0, 0), the zero vector, corresponds to the
origin.
In the cartesian plane, vectors v1 = (x1 , y1 ) and v2 = (x2 , y2 ) are added
by adding their coordinates,

Addition of vectors
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then

v1 + v2 = (x1 + x2 , y1 + y2 ). (1.4.1)

Because points and vectors are interchangeable, the same formula is used
for addition P + P ′ of points P and P ′ .

Figure 1.14: Adding v1 and v2

This is the same as combining their shadows as in Figure 1.14. In Python,

lists and tuples do not add this way. Lists and tuples have to first be con-
verted into numpy arrays.

v1 = (1,2)
v2 = (3,4)
v1 + v2 == (1+3,2+4) # returns False

v1 = [1,2]
v2 = [3,4]
v1 + v2 == [1+3,2+4] # returns False
1.4. TWO DIMENSIONS 19

from numpy import *

v1 = array([1,2])
v2 = array([3,4])
v1 + v2 == array([1+3,2+4]) # returns True

For example, v1 = (−3, 1) and v2 = (2, −2) returns

v1 + v2 = (−3, 1) + (2, −2) = (−3 + 2, 1 − 2) = (−1, −1).

A vector v = (x, y) in the plane may be scaled by scaling the shadow

as in Figure 1.15. This is vector scaling by t. Note when t is negative, the
shadow is also flipped. In Python, we write

from numpy import *

v = array([1,2])
3*v == array([3,6]) # returns True

Given a vector v, the scalings tv of v form a line passing through the

origin 0 (Figure 1.17). This line is the span of v (see §2.4). Scalings tv of v
are also called multiples of v.

Scaling of vectors

If v = (x, y), then

tv = (tx, ty).

If t and s are real numbers, it is easy to check

t(v1 + v2 ) = tv1 + tv2 and t(sv) = (ts)v.

Thus multiplying v by s, and then multiplying the result by t, has the same
effect as multiplying v by ts, in a single step. Because points and vectors are
interchangeable, the same formula is used for scaling tP points P by t.
20 CHAPTER 1. DATASETS

tv
v

0 tv

Figure 1.15: Scaling with t = 2 and t = −2/3

We set −v = (−1)v, and define subtraction of vectors by

v1 − v2 = v1 + (−v2 ).

This gives

Subtraction of vectors
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then

v1 − v2 = (x1 − x2 , y1 − y2 ) (1.4.2)

from numpy import *

v1 = array([1,2])
v2 = array([3,4])
v1 - v2 == array([1-3,2-4]) # returns True

Distance Formula
If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then the distance between v1 and v2
is p
|v1 − v2 | = (x1 − x2 )2 + (y1 − y2 )2 .
1.4. TWO DIMENSIONS 21

The distance of v = (x, y) to the origin 0 = (0, 0) is its magnitude or

norm or length p
r = |v| = |v − 0| = x2 + y 2 .

(x, y)

r y

θ
0 x

Figure 1.16: The polar representation of v = (x, y).

In terms of r and θ (Figure 1.16), the polar representation of (x, y) is

x = r cos θ, y = r sin θ.

In Python,

from numpy import *

from numpy.linalg import norm

v = array([1,2])
norm(v) == sqrt(5)# returns True

The unit circle consists of the vectors which are distance 1 from the origin
0. When v is on the unit circle, the magnitude of v is 1, and we say v is a
unit vector. In this case, the line formed by the scalings of v intersects the
unit circle at ±v (Figure 1.17).
When v is a unit vector, r = 1, and (Figure 1.16),

v = (x, y) = (cos θ, sin θ). (1.4.3)

22 CHAPTER 1. DATASETS

The unit circle intersects the horizontal axis at the vectors (1, 0), and
(−1, 0), and intersects the vertical axis at the vectors (0, 1), and (0, −1).
These four vectors are equally spaced on the unit circle (Figure 1.17).

−v
I
0
v

Figure 1.17: v and its antipode −v

By the distance formula, a vector v = (x, y) is a unit vector when

x2 + y 2 = 1.

More generally, any circle with center (a, b) and radius r consists of vectors
v = (x, y) satisfying
(x − a)2 + (y − b)2 = r2 .
Let R be a point on the unit circle, and let t > 0. From this, we see the
scaled point tR is on the circle with center (0, 0) and radius t. Moreover, if
Q is any point, Q + tR is on the circle with center Q and radius r.
Given this, it is easy to check

|tv| = |t| |v|

for any real number t and vector v.

From this, if a vector v is unit and r > 0, then rv has magnitude r. If v
is any vector not equal to the zero vector, then r = |v| is positive, and

1 1 1
v = |v| = r = 1,
r r r
1.4. TWO DIMENSIONS 23

so v/r is a unit vector.

v2 − v1

Figure 1.18: Two vectors v1 and v2 .

Now we discuss the dot product in two dimensions. We have two vectors
v and v ′ in the plane R2 , with v1 = (x1 , y1 ) and v2 = (x2 , y2 ). The dot
product of v1 and v2 is given algebraically as
v1 · v2 = x1 x2 + y1 y2 ,
or geometrically as
v1 · v2 = |v1 | |v2 | cos θ,
where θ is the angle between v1 and v2 . To show that these are the same,
below we derive the

Dot Product Identity

x1 x2 + y1 y2 = v1 · v2 = |v1 | |v2 | cos θ. (1.4.4)

In Python, the dot product is given by numpy.dot,

from numpy import *

v1 = array([1,2])
v2 = array([3,4])
24 CHAPTER 1. DATASETS

dot(v1,v2) == 13 + 24 # returns True

As a consequence of the dot product identity, we have code for the angle
between two vectors,

from numpy import *

def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)

Recall that −1 ≤ cos θ ≤ 1. Using the dot product identity (1.4.4), we

obtain the important

Cauchy-Schwarz Inequality
If u and v are any two vectors, then

−|u| |v| ≤ u · v ≤ |u| |v|. (1.4.5)

To derive the dot product identity, we first derive Pythagoras’ theorem

for general triangles
c2 = a2 + b2 − 2ab cos θ. (1.4.6)

To derive (1.4.6), we drop a perpendicular to the base b, obtaining two

right triangles (Figure 1.20). By Pythagoras applied to each triangle,

a2 = d2 + f 2 and c2 = e 2 + f 2 .

Also b = e + d, so
b2 = (e + d)2 = e2 + 2ed + d2 .
1.4. TWO DIMENSIONS 25

By the definition of cos θ, d = a cos θ. Putting this all together,

c2 = e2 + f 2 = (b − d)2 + f 2
= f 2 + d2 + b2 − 2db
= a2 + b2 − 2ab cos θ,

so we get (1.4.6).

Figure 1.19: Pythagoras for general triangles.

f
e
a

d b

Figure 1.20: Proof of Pythagoras for general triangles.

Next, connect Figures 1.18 and 1.19 by noting a = |v2 | and b = |v1 | and
c = |v2 − v1 |.
Now go back to deriving (1.4.4). By vector addition, we have

v2 − v1 = (x2 − x1 , y2 − y1 ),
26 CHAPTER 1. DATASETS

and b2 = |v1 |2 = x21 + y12 , a2 = |v2 |2 = x2 2 + y2 2 . By the binomial theorem,

c2 = |v2 − v1 |2 = |(x2 − x1 , y2 − y1 )|2 = (x2 − x1 )2 + (y2 − y1 )2
= x21 + y12 − 2(x1 x2 + y1 y2 ) + x2 2 + y2 2 = a2 + b2 − 2(x1 x2 + y1 y2 ),
thus
c2 = a2 + b2 − 2(x1 x2 + y1 y2 ). (1.4.7)
Comparing the terms in (1.4.6) and (1.4.7), we arrive at (1.4.4).

Let u and v be vectors. A basic property of dot product is

|u + v|2 = |u|2 + 2u · v + |v|2 . (1.4.8)

This is easily derived from the definition of u · v.

If P = (x, y), let P ⊥ = (−y, x), and let v = OP and v ⊥ = OP ′ be the
vectors emanating from the origin, and ending at P and P ⊥ . Then

v · v ⊥ = (x, y) · (−y, x) = 0.

This shows v and v ⊥ are perpendicular (Figure 1.21).

P⊥
P
v⊥ v
0
−v ⊥

−P ⊥

Figure 1.21: v and v ⊥

From Figure 1.21, we see points P and P ′ on the unit circle satisfy P ·P ′ =
0 iff P ′ = ±P ⊥ .
1.4. TWO DIMENSIONS 27

We now solve two linear equations in two unknowns x, y. We start with

the homogeneous case

ax + by = 0, cx + dy = 0. (1.4.9)

Let A be the 2 × 2 matrix

a b
A= (1.4.10)
c d
It is easy to exhibit a solution of the first equation in (1.4.9): choose
(x, y) = (−b, a). If we want this to be a solution of the second equation as
well, we must have cx + dy = ad − bc = 0. Based on this, we make the
following definition. The determinant of A is

a b
det(A) = det = ad − bc.
c d

In (1.4.9), multiply the first equation by d and the second by b and sub-
tract, obtaining

(ad − bc)x = d(ax + by) − b(cx + dy) = 0.

In (1.4.9), multiply the first equation by c and the second by a and subtract,
obtaining
(bc − ad)y = c(ax + by) − a(cx + dy) = 0.
From here, we see there are two cases: det(A) = 0 and det(A) ̸= 0. When
det(A) ̸= 0, the only solution of (1.4.9) is (x, y) = (0, 0). When det(A) = 0,
(x, y) = (−b, a) is a solution of both equations in (1.4.9). We have shown

Homogeneous System

When det(A) = 0, the homogeneous system (1.4.9) has a nonzero so-

lution, and all solutions are scalar multiples of (x, y) = (−b, a). When
det(A) ̸= 0, the only solution is (x, y) = (0, 0).

This covers the homogeneous case. For the inhomogeneous case

ax + by = e, cx + dy = f, (1.4.11)
28 CHAPTER 1. DATASETS

multiplying and subtracting as above, we obtain

(ad − bc)x = d(ax + by) − b(cx + dy) = de − bf,
(bc − d)y = c(ax + by) − a(cx + dy) = ce − af.
Dividing by det(A), we obtain

Inhomogeneous System

When det(A) ̸= 0, the inhomogeneous system (1.4.11) has the unique

solution
de − bf af − ce
x= , y= . (1.4.12)
ad − bc ad − bc
When det(A) = 0, (1.4.11) has a solution iff ce = af and de = bf .
When a2 + b2 ̸= 0, a solution is
ae be
x= , y= .
a2 + b2 a2 + b 2
When c2 + d2 ̸= 0, a solution is
cf df
x= , y= .
c2 + d2 c2 + d2
Any other solution differs from these solutions by a scalar multiple of
the homogeneous solution (x, y) = (−b, a).

We now go over the basic properties of 2 × 2 matrices. This we use in the

next section. A 2 × 2 matrix A is a block of four numbers as in (1.4.10).
The matrix (1.4.10) can be written in terms of the two vectors u = (a, b)
and v = (c, d), as follows

a b u
A= = , u = (a, b), v = (c, d).
c d v
In this case, we call u and v the rows of A. On the other hand, A may be
written as

a b
A= = u v , u = (a, c), v = (b, d).
c d
1.4. TWO DIMENSIONS 29

In this case, we call u and v the columns of A. This shows there are at least
three ways to think about a matrix: as rows, or as columns, or as a single
block.
The simplest operations on matrices are addition and scalar multiplica-
tion. Addition is as follows,
′ ′
a + a′ b + b ′

a b ′ a b ′
A= , A = ′ ′ =⇒ A+A = ,
c d c d c + c′ d + d ′

and scalar multiplication is as follows,

ta tb
tA = .
tc td

The transpose At of the matrix A is

a b a c
t
A= =⇒ A = .
c d b d

Then the rows of At are the columns of A.

Let w = (x, y) be a vector. We now explain how to multiply the matrix
A by the vector w. The result is then another vector Aw. This is called right
multiplication.
u
To do this, we write A as rows A = , then use the dot product,
v

Aw = (u · w, v · w) = (ax + by, cx + dy).

Notice Aw is a vector. When multiplying this way, one often writes

a b x ax + by
Aw = = ,
c d y cx + dy

and we call w and Aw column vectors.

This terminology is introduced to keep things consistent: It’s always row-
times-column with row on the left and column on the right. Nevertheless, a
vector, a row vector, and a column vector are all the same thing, just a
vector.
Just like we can multiply matrices and vectors, we can also multiply two
matrices A and A′ and obtain a product AA′ . Following the row-column rule
30 CHAPTER 1. DATASETS

u
above, we write A = as rows and A′ = (u′ , v ′ ) as columns to obtain
v

u · u′ u · v ′

′
AA = .
u′ · v u′ · v ′

If we do this the other way, we obtain

′
u · u u′ · v

′
AA= ,
u · v′ u · v
so
AA′ ̸= A′ A.

A rotation in the plane is the matrix

cos θ − sin θ
U = U (θ) = .
sin θ cos θ

Here θ is the angle of rotation. By the trigonometric addition formulas

(1.5.5),

cos θ′ − sin θ′

′ cos θ − sin θ
U (θ)U (θ ) =
sin θ cos θ sin θ′ cos θ′
cos(θ + θ′ ) − sin(θ + θ′ )

= = U (θ + θ′ ).
sin(θ + θ′ ) cos(θ + θ′ )

This says rotating by θ′ followed by rotating by θ is the same as rotating by

θ + θ′ .

There is a special matrix I, the identity matrix,

1 0
I= .
0 1

The matrix I satisfies

AI = IA = A
1.4. TWO DIMENSIONS 31

for any matrix A.

Also, for each matrix A with det(A) ̸= 0, the matrix
−b
 
d
1 d −b 1 d −b  − bc ad − bc 
A−1 = = =  ad−c a 
det(A) −c a ad − bc −c a
ad − bc ad − bc
is the inverse of A. The inverse matrix satisfies
AA−1 = A−1 A = I.
The inverse reverses the order of the product,
(AB)−1 = B −1 A−1 .
The transpose also reverses the order of a product,
(AB)t = B t At .

Using matrix-vector multiplication, we can rewrite (1.4.11) as

Av = w,
where
a b x e
A= , v= , w= .
c d y f
Then the solution (1.4.12) can be rewritten
v = A−1 w,
where A−1 is the inverse matrix. We study inverse matrices in depth in §2.3.
The matrix (1.4.10) is symmetric if b = c. A symmetric matrix looks like

a b
Q= .
b c
A general matrix A consists of four numbers a, b, c, d, and a symmetric
matrix Q consists of three numbers a, b, c. A matrix Q is symmetric when
Qt = Q.
32 CHAPTER 1. DATASETS

Let A = (u, be a 2 × 2 matrix with columns u, v. Then u, v are the

v)
u
rows of At = . Since matrix multiplication is row × column,
v

t u u·u u·v
AA= u v = .
v v·u v·v
Now suppose At A = I. Then u · u = 1 = v · v and u · v = 0, so u and v
are orthogonal unit vectors. Such vectors are called orthonormal. We have
shown

Orthogonal Matrices

Let A be a matrix. Then At A = I iff the columns of A are orthonormal,

and AAt = I iff the rows of A are orthonormal.

The second statement follows by applying the first to At instead of A. A

matrix U is orthogonal if
U tU = I = U U t.
Thus a matrix is orthogonal iff its rows are orthonormal, and its columns are
orthonormal.

Now we introduce the tensor product. If u = (a, b) and v = (c, d) are

vectors, their tensor product is the matrix

ac ad av
u⊗v = = cu du = .
bc bd bv
Here we wrote u ⊗ v as a single block, and also in terms of rows and columns.
If we do this the other way, we get

ca cb
v⊗u= ,
da db
so
(u ⊗ v)t = v ⊗ u.
When u = v, u ⊗ v = v ⊗ v is a symmetric matrix.
Here is code for tensor.
1.4. TWO DIMENSIONS 33

from numpy import *

def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])

The trace of a matrix A is the sum of the diagonal entries,

a b
A= =⇒ trace(A) = a + d.
c d

The determinant of u ⊗ v is zero,

det(u ⊗ v) = 0.

This is true no matter what the vectors u and v are. Check this yourself.
Notice by definition of u ⊗ v,

trace(u ⊗ v) = u · v, and trace(v ⊗ v) = |v|2 . (1.4.13)

The basic property of tensor product is

Tensor Product Identity

If A = u ⊗ v, then

Aw = (u ⊗ v)w = (v · w)u. (1.4.14)

This can be checked by writing out both sides in detail.

Now let
a b
Q=
b c
be a symmetric matrix and let v = (x, y). Then

Qv = (ax + by, bx + cy),

so
v · Qv = (x, y) · (ax + by, bx + cy) = ax2 + 2bxy + cy 2 .
This is the quadratic form associated to the matrix Q.
34 CHAPTER 1. DATASETS

Quadratic Form
If
a b
Q= and v = (x, y),
b c
then
v · Qv = ax2 + 2bxy + cy 2 .

When Q is the identity

1 0
Q=I= ,
0 1

then the quadratic function is x2 + y 2 :

Q=I =⇒ v · Qv = x2 + y 2 .

When Q is diagonal,

a 0
Q= =⇒ v · Qv = ax2 + cy 2 .
0 c
An important case is when Q = u ⊗ u. In this case, by (1.4.14),

Quadratic Forms of Tensors

If Q = u ⊗ u, then

v · Qv = v · (u ⊗ u)v = (u · v)2 . (1.4.15)

1.5 Complex Numbers

This section is a brief review of complex numbers, for use in later sections.
In §1.4, we studied points in two dimensions, and we saw how points can be
added and subtracted. In §2.1, we study points in any number of dimensions,
and there we also add and subtract points.
In two dimensions, each point has a shadow (Figure 1.13). By stacking
shadows, points in the plane can be multiplied and divided (Figure 1.22). In
this sense, points in the plane behave like numbers, because they follow the
usual rules of arithmetic.
1.5. COMPLEX NUMBERS 35

P
P′
1
1

O O
P ′′

Q Q

P ′′
O O

Figure 1.22: Multiplying and dividing points on the unit circle.

This ability of points in the plane to follow the usual rules of arithmetic is
unique to one and two dimensions, and not present in any other dimension.
When thought of in this manner, points in the plane are called complex
numbers, and the plane is the complex plane.

To define multiplication of points, let P = (x, y) and P ′ = (x′ , y ′ ) be

points on the unit circle. Stack the shadow of P ′ on top of the shadow of P ,
as in Figure 1.22. Here is how one does this without any angle measurement:
Mark Q = x′ P at distance x′ along the vector OP joining O and P , and
draw the circle with radius y ′ and center Q. Then this circle intersects the
unit circle at two points, both called P ′′ .
We think of the first point P ′′ as the result of multiplying P and P ′ , and
we write P ′′ = P P ′ , and we think of the second point P ′′ as the result of
dividing P by P ′ , and we write P ′′ = P/P ′ . Then we have
36 CHAPTER 1. DATASETS

Multiplication and Division of Points

For P = (x, y) and P ′ = (x′ , y ′ ) on the unit circle, when x′ y ′ ̸= 0,

P ′′ = P P ′ = (xx′ − yy ′ , x′ y + xy ′ ),
(1.5.1)
P ′′ = P/P ′ = (xx′ + yy ′ , x′ y − xy ′ ).

To derive (1.5.1), let P ⊥ = (−y, x) (“P -perp”). Then

x′ P + y ′ P ⊥ = (x′ x, x′ y) + (−y ′ y, y ′ x) = (xx′ − yy ′ , x′ y + xy ′ ),

x′ P − y ′ P ⊥ = (x′ x, x′ y) − (−y ′ y, y ′ x) = (xx′ + yy ′ , x′ y − xy ′ ),

so (1.5.1) is equivalent to

P ′′ = x′ P ± y ′ P ⊥ . (1.5.2)

To establish (1.5.2), since P ′′ is on the circle of center Q and radius y ′ ,

we may write P ′′ = Q + y ′ R, for some point R on the unit circle (see §1.4).
Interpreting points as vectors, and using (1.4.8), P ′′ = x′ P + y ′ R is on
the unit circle iff
1 = |x′ P + y ′ R|2 = |x′ P |2 + 2x′ P · y ′ R + |y ′ R|2
2 2
= x′ |P |2 + 2x′ y ′ P · R + y ′ |R|2
2 2
= x′ + y ′ + 2x′ y ′ P · R
= 1 + 2x′ y ′ P · R.

But this happens iff P · R = 0, which happens iff R = ±P ⊥ (Figure 1.21).

This establishes (1.5.2).

More generally, if r = |P | and r′ = |P ′ |, let R be any point satisfying

|R| = r. Then
P ′′ = Q + y ′ R = x′ P + y ′ R
satisfies |P ′′ | = rr′ exactly when R = ±P ⊥ , leading to the two points in
(1.5.1).
Let P̄ be the conjugate (x, −y) of P = (x, y). The first P ′′ is the product

P P ′ = (xx′ − yy ′ , x′ y + xy ′ ), (1.5.3)
1.5. COMPLEX NUMBERS 37

but the second P ′′ is not division, it is the hermitian product P P̄ ′ of P and

P̄ ′ .
The correct formula for division is given by

1 1
P/P ′ = ′
2 P P̄ = ′ 2 (xx′ + yy ′ , x′ y − xy ′ ). (1.5.4)
r ′ x + y′2

When r′ = 1, (1.5.4) reduces to the formula in (1.5.1).

With this understood, it is easily checked that division undoes multipli-
cation,
(P/P ′ )P ′ = P.
In fact, one can check that multiplication and division as defined by (1.5.3)
and (1.5.4) follow the usual rules of arithmetic.

It is natural to identify points on the horizontal axis with real numbers,

because, using (1.5.1), z = (x, 0) and z ′ = (x′ , 0) implies

z + z ′ = (x, 0) + (x′ , 0) = (x + x′ , 0), zz ′ = (xx′ − 00, x0 + x′ 0) = (xx′ , 0).

Because of this, we can write z = x instead of z = (x, 0), this only for points
in the plane, and we call the horizontal axis the real axis.
Similarly, let i = (0, 1). Then the point i is on the vertical axis, and,
using (1.5.1), one can check

ix = (0, 1)(x, 0) = (−0, x) = x⊥ .

Thus the vertical axis consists of all points of the form ix. These are called
imaginary numbers, and the vertical axis is the imaginary axis.
Using i, any point P = (x, y) may be written

P = x + iy,

since x + iy = (x, 0) + (y, 0)(0, 1) = (x, 0) + (0, y) = (x, y). This leads to
Figure 1.23. In this way, real numbers x are considered complex numbers
with zero imaginary part, x = x + 0i.
38 CHAPTER 1. DATASETS

2i 3 + 2i

−1 0 1 2 3

Figure 1.23: Complex numbers

Since by (1.5.1), i2 = (0, 1)2 = (−1, 0) = −1, we have

Square Root of −1

The complex number i satisfies i2 = −1.

When thinking of points in the plane as complex numbers, it is traditional

to denote them by z instead of P . By (1.5.1), we have

z = x + iy, z ′ = x′ + iy ′ =⇒ zz ′ = (xx′ − yy ′ ) + i(x′ y + xy ′ ),

and
z x + iy (xx′ + yy ′ ) + i(x′ y − xy ′ )
= = .
z′ x′ + iy ′ x′ 2 + y ′ 2
In particular, one can always “move” the i from the denominator by the
formula
1 1 x − iy z̄
= = 2 2
= 2.
z x + iy x +y |z|
Here x2 + y 2 = r2 = |z|2 is the absolute value squared of z, and z̄ is the
conjugate of z.

Let r, r′ , r′′ and θ, θ′ , θ′′ be the polar coordinates of P , P ′ , P ′′ = P P ′ (see

Figure 1.16). Then Figure 1.22 suggests θ′′ = θ + θ′ . Using multiplication
1.5. COMPLEX NUMBERS 39

of points (1.5.1) together with his bisection method, Archimedes[11] defined

angle measure θ numerically and derived θ′′ = θ + θ′ .
By elementary algebra,
2 2
(x2 + y 2 )(x′ + y ′ ) = (xx′ − yy ′ )2 + (x′ y + xy ′ )2 .

Since this implies r′′ 2 = r2 r′ 2 , we conclude

Polar Coordinates of Complex Numbers

If (r, θ) and (r′ , θ′ ) are the polar coordinates of complex numbers P and
P ′ , and (r′′ , θ′′ ) are the polar coordinates of the product P ′′ = P P ′ ,
then
r′′ = rr′ and θ′′ = θ + θ′ .

From this and (1.5.1), using (x, y) = (cos θ, sin θ), (x′ , y ′ ) = (cos θ′ , sin θ′ ),
we have the addition formulas

sin(θ + θ′ ) = sin θ cos θ′ + cos θ sin θ′ ,

(1.5.5)
cos(θ + θ′ ) = cos θ cos θ′ − sin θ sin θ′ .

For example, if ω = cos θ + i sin θ, then the polar coordinates of ω are

r = 1 and θ. It follows the polar coordinates of ω 2 are r = 1 and 2θ, so
ω 2 = cos(2θ) + i sin(2θ).
By the same logic, for any power k, the polar coordinates of ω k are r = 1
and kθ, so ω k = cos(kθ) + i sin(kθ).
When P = (x, y) = x + iy is thought of as a complexpnumber, r is called
the absolute value of P , and written r = |P |. Then r = x2 + y 2 and

P = x + iy = r cos θ + ir sin θ = r(cos θ + i sin θ).

for any complex number P .

We can reverse the logic in the previous paragraph to compute square

roots. Let P be any point in the plane. We define the square√ root of P to
be a point Q satisfying Q2 = P . In this case, we write Q = √ P . If Q is a
square root, so is −Q, so there are two square roots ±Q = ± P .
40 CHAPTER 1. DATASETS
√
The formula for Q = P is
√

r+x y
P = (x, y) =⇒ P =± √ ,√ .
2r + 2x 2r + 2x

This formula is valid as long as x ̸= −r, and can checked directly by checking
Q2 = P .
When P is on the unit circle, r = 1, so the formula reduces to
√

1+x y
P =± √ ,√ .
2 + 2x 2 + 2x

We will need the roots of unity in §3.2. This generalizes square roots,
cube roots, etc.
A point ω is a root of unity if ω d = 1 for some power d. If d is the power,
we say ω is a d-th root of unity.
For example, the square roots of unity are ±1, since (±1)2 = 1. Here we
have
1 = cos 0 + i sin 0, −1 = cos π + i sin π.
The fourth roots of unity are ±1, ±i, since (±1)4 = 1, (±i)4 = 1. Here
we have

1 = cos 0 + i sin 0,
i = cos(π/2) + i sin(π/2),
−1 = cos π + i sin π,
−i = cos(3π/2) + i sin(3π/2).

In general, the roots of unity are denoted by powers of ω, so the square

roots of unity are 1 and ω = −1, and the fourth roots of unity are 1, ω = i,
ω 2 = −1, ω 3 = −i.
Let ω = cos θ + i sin θ. Since 1 = cos(2π) + i sin(2π) and ω k = cos(kθ) +
i sin(kθ), a d-th root of unity ω satisfies

ω = cos(2π/d) + i sin(2π/d). (1.5.6)

1.5. COMPLEX NUMBERS 41

ω
ω

ω 1 1 ω2 1

ω2
ω3
ω2 = 1 ω3 = 1 ω4 = 1

Figure 1.24: The second, third, and fourth roots of unity

If ω d = 1, then
d k
ωk = ωd = 1k = 1.

With ω given by (1.5.6), this implies

1, ω, ω 2 , . . . , ω d−1

are the d-th roots of unity.

If we set √
1 3
ω =− +i = cos(2π/3) + i sin(2π/3),
2 2
then a calculation shows
√
1 3
1, ω, ω2 = − − i
2 2

are the cube roots of unity,

13 = 1, ω 3 = 1, (ω 2 )3 = 1.

Similarly, the fifth roots of unity are 1, ω, ω 2 , ω 3 , ω 4 , where

√
s√
1 5 5 5
ω=− + +i + = cos(2π/5) + i sin(2π/5).
4 4 8 8
42 CHAPTER 1. DATASETS

ω ω ω4 ω3
ω2 ω5
ω2
ω2 ω6
ω
ω7
1 ω3 1 1
ω8
ω 14
ω3 ω9
ω 13
ω4 ω4 ω5 ω 10 11 ω 12
ω

ω5 = 1 ω6 = 1 ω 15 = 1

Figure 1.25: The fifth, sixth, and fifteenth roots of unity

Summarizing,

Roots of Unity
If
ω = cos(2π/d) + i sin(2π/d),
the d-th roots of unity are

1, ω, ω 2 , . . . , ω d−1 .

The roots satisfy

ω k = cos(2πk/d) + i sin(2πk/d), k = 0, 1, 2, . . . , d − 1.

Since ω d = 1, one has, from Figures 1.24 and 1.25,

ω k + ω −k = ω k + ω d−k = 2 cos(2πk/d), k = 0, 1, 2, . . . , d − 1. (1.5.7)
Here is sympy code for the roots of unity. We use display instead of
print to pretty-print the output.

from sympy import RootOf, symbols, init_printing

init_printing()

x = symbols('x')
d = 5
1.6. MEAN AND COVARIANCE 43

for k in range(d): display(RootOf(x**d - 1,k))

The fundamental theorem of algebra states that a polynomial p(x) of de-

gree d has d roots: There are d complex numbers x0 , x1 , . . . , xd−1 (not
necessarily distinct) satisfying p(x) = 0. In numpy, the roots of the polyno-
mial p(x) = ax2 + bx + c are returned by

import numpy as np

np.roots([a,b,c])

Since the cube roots of unity are the roots of the polynomial p(x) = x3 − 1,
the code

import numpy as np

np.roots([1,0,0,-1])

returns the cube roots

array([-0.5+0.8660254j, -0.5-0.8660254j, 1. +0.j ])

1.6 Mean and Covariance

Let x1 , x2 , . . . , xN be a dataset in Rd , and let x be any point in Rd . The
mean-square distance of x to D is
N
1 X
M SD(x) = |xk − x|2 .
N k=1

Above |x| stands for the length of the vector x, or the distance of the
point x to the origin. When d = 2 and we are in two dimensions, this was
defined in §1.4. For general d, this is defined in §2.1. In this section we
continue to focus on two dimensions d = 2.
44 CHAPTER 1. DATASETS

The mean or sample mean is

N
1 X x1 + x2 + · · · + xN
m= xk = .
N k=1 N

The mean m is a point in feature space. The first result is

Point of Best-fit
The mean is the point of best-fit: The mean minimizes the mean-square
distance to the dataset (Figure 1.26).

Figure 1.26: MSD for the mean (green) versus MSD for a random point (red).

Using (1.4.8),
|a + b|2 = |a|2 + 2a · b + |b|2

for vectors a and b, it is easy to derive the above result. Insert a = xk − m

and b = m − x to get

N
2 X
M SD(x) = M SD(m) + (xk − m) · (m − x) + |m − x|2 .
N k=1
1.6. MEAN AND COVARIANCE 45

Now the middle term vanishes

N N
! !
2 X 2 X
(xk − m) · (m − x) = xk − Nm · (m − x)
N k=1 N k=1
= 2(m − m) · (m − x) = 0,

so we have

M SD(x) = M SD(m) + |m − x|2 ≥ M SD(m),

deriving the above result.

Here is the code for Figure 1.26.

from matplotlib.pyplot import *

from numpy import *
from numpy.random import random

N = 20
dataset = array([ [random(),random()] for _ in range(N) ])

m = mean(dataset,axis=0)
p = array([random(),random()])

grid()
X = dataset[:,0]
Y = dataset[:,1]
scatter(X,Y)

for v in dataset:
plot([m[0],v[0]],[m[1],v[1]],c='green')
plot([p[0],v[0]],[p[1],v[1]],c='red')
show()

The covariance of a dataset is defined in any dimension d. When d = 1,

the dataset consists of scalars x1 , x2 , . . . , xN , and the mean m is a scalar.
46 CHAPTER 1. DATASETS

In this case, the covariance q is also a scalar,

N
1 X
q= (xk − m)2 .
N k=1
In the scalar case, the covariance is called the variance of the scalar dataset.
In general, the covariance is a symmetric d × d matrix Q. When d = 1, a
1 × 1 matrix is a scalar, Q = (q), as above.
If the dataset x1 , x2 , . . . , xN has mean m, we can center the dataset,
v1 = x1 − m, v2 = x2 − m, . . . , vN = xN − m.
Then the covariance matrix is (see §1.4 for tensor product)
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN
Q= . (1.6.1)
N
The covariance is a symmetric d × d matrix. Here is code from scratch for
the covariance matrix of a dataset.

from numpy import *

from numpy.random import random

def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])

N = 20
dataset = array([ [random(),random()] for _ in range(N) ])
m = mean(dataset,axis=0)

# center dataset
vectors = dataset - m

Q = mean([ tensor(v,v) for v in vectors ],axis=0)

For example, suppose N = 5 and

x1 = (1, 2), x2 = (3, 4), x3 = (5, 6), x4 = (7, 8), x5 = (9, 10). (1.6.2)
Then m = (5, 6) and
v1 = x1 − m = (−4, −4), v2 = x2 − m = (−2, −2), v3 = x3 − m = (0, 0),
v4 = x4 − m = (2, 2), v5 = x5 − m = (4, 4).
1.6. MEAN AND COVARIANCE 47

Since

16 16
(±4, ±4) ⊗ (±4, ±4) = ,
16 16

4 4
(±2, ±2) ⊗ (±2, ±2) = ,
4 4

0 0
(0, 0) ⊗ (0, 0) = ,
0 0

summing and dividing by N leads to the covariance

8 8
Q= .
8 8

Notice
Q = 8(1, 1) ⊗ (1, 1),
which, as we see below (§2.5), reflects the fact that the points of this dataset
lies on a line. Here the line is y = x + 1.
The covariance matrix as written in (1.6.1) is the biased covariance matrix.
If the denominator is instead N − 1, the matrix is the unbiased covariance
matrix.
For datasets with large N , it doesn’t matter, since N and N − 1 are
almost equal. For simplicity, here we divide by N , and we only consider the
biased covariance matrix.
In practice, datasets are standardized before computing their covariance.
The covariance of standardized datasets — the correlation matrix — is the
same whether one starts with bias or not (§2.2).
In numpy, the Python covariance constructor is

from numpy import *

from numpy.random import random

N = 20
dataset = array([ [random(),random()] for _ in range(N) ])
Q = cov(dataset,bias=True,rowvar=False)

This returns the same result as the previous code for Q. Notice here there
is no need to compute the mean, this is taken care of automatically. The
48 CHAPTER 1. DATASETS

option bias=True indicates division by N , returning the biased covariance.

To return the unbiased covariance and divide by N − 1, change the option
to bias=False, or remove it, since bias=False is the default.
The option rowvar=False indicates the vectors of the dataset are entered
as rows; the dataset is then an N × 2 matrix. If the transpose dataset.T is
entered, the X samples and Y samples of the dataset are entered separately,
and we insert rowvar=True, or remove it, since rowvar=True is the default.

Q = cov(dataset.T,bias=True)

to return the same result.

From (1.4.13), if Q is the covariance matrix (1.6.1),
N
1 X
trace(Q) = |xk − m|2 . (1.6.3)
N k=1

We call (1.6.3) the total variance of the dataset. Thus the total variance
equals MSD(m).
In Python, the total variance is

from numpy import *

Q = cov(dataset.T,bias=True)
Q.trace()

We now project a 2d dataset onto a line. Let u be a unit vector (a vector

of length one, |u| = 1), and let v1 , v2 , . . . , vN be a 2d dataset, assumed
for simplicity to be centered. We wish to project this dataset onto the line
through u. This will result in a 1d dataset.
According to Figure 1.27, when a vector v is projected onto the line
through u, the length of the projected vector proj u v equals |v| cos θ, where
θ is the angle between the vectors v and u. Since |u| = 1, this length equals
the dot product v · u. Hence the projected vector is

proj v = (v · u)u.
u
1.6. MEAN AND COVARIANCE 49

Applying this logic to each vector v1 , v2 , . . . , vN , we conclude: the pro-

jected dataset onto the line through u is the dataset
(v1 · u)u, (v2 · u)u, . . . , (vN · u)u.
These vectors are all multiples of u, as they should be. The projected dataset
is in R2 .
Alternately, discarding u and retaining the scalar coefficients, we have
the one-dimensional dataset
v1 · u, v2 · u, . . . , vN · u.
This is the reduced dataset. The reduced dataset consists of scalars.
Since the vector u is fixed, the reduced dataset and the projected dataset
contain the same information.
v

proj u v
u

Figure 1.27: Projecting a vector v onto the line through u.

The mean of the reduced dataset is 0, since

v1 · u + v2 · u + · · · + vN · u v1 + v2 + · · · + vN
= · u = 0 · u = 0,
N N
and the mean of the projected dataset is also 0.
Since the reduced dataset is one-dimensional, its variance is
N
1 X
q= (vk · u)2 .
N k=1

But, according to (1.4.15), this equals

N
1 X
q= u · (vk ⊗ vk )u = u · Qu.
N k=1
50 CHAPTER 1. DATASETS

Because the reduced dataset and projected dataset are essentially the
same, we also refer to q as the variance of the projected dataset. Thus we
conclude (see §1.4 for v · Qv)

Variance of Projected Dataset

Let Q be the covariance matrix of a dataset. Then the variance of
the projected dataset onto the line through the vector u equals the
quadratic function u · Qu.

As a consequence, for any covariance Q, we see u · Qu ≥ 0 for any vector

u, as this is the variance of the projected dataset. Projections are studied
further in §2.7.
Going back to the dataset (1.6.2), xk −m, k = 1, 2, 3, 4, 5, are all multiples
of (1, 1). If we select u = (1, −1), then (xk − m) · u = 0, so the covariance Q
satisfies u · Qu = 0. This can also be seen by

Qu = 8((1, 1) ⊗ (1, 1))u = 8(1, 1) · u (1, 1) = 0.

This shows that the dataset lies on the line passing through m and perpen-
dicular to (1, −1).

Now we describe the covariance ellipses associated to a given dataset

(Figure 1.28).
The contour of all points x satisfying x·Qx = 1 is the covariance ellipsoid.
In two dimensions d = 2, this is the covariance ellipse. The contour of all
points x satisfying x · Q−1 x = 1 is the inverse covariance ellipsoid. In two
dimensions d = 2, this is the inverse covariance ellipse
In two dimensions d = 2, a covariance matrix has the form

a b
Q= .
b c

If we write u = (x, y) for a vector in the plane, the covariance ellipse is

u · Qu = ax2 + 2bxy + cy 2 = 1.

Figure1.28 displays examples of covariance ellipses (blue) and inverse co-

variance ellipses (red). When Q is diagonal, the lengths of the major and
1.6. MEAN AND COVARIANCE 51
√ √
minor axes of the inverse covariance ellipse equal 2 a and 2 c, and the
√
lengths√of the major and minor axes of the covariance ellipse equal 2/ a
and 2/ c.

The covariance ellipse and inverse covariance ellipses described above are
centered at the origin (0, 0). When a dataset has mean m and covariance Q,
the ellipses are drawn centered at m, as in Figures 1.30, 1.31, and 1.32.

In particular, when a = c and b = 0, then Q = aI is a multiple√ of the

identity, the inverse covariance ellipse is √the circle of radius a, and the
covariance ellipse is the circle of radius 1/ a.

Here is the code for Figure 1.28. The ellipses drawn here are centered at
the origin.

from matplotlib.pyplot import *

from numpy import *

L, delta = 4, .1
x = arange(-L,L,delta)
y = arange(-L,L,delta)
X,Y = meshgrid(x, y)

a, b, c = 9, 0, 4
det = a*c - b**2
A, B, C = c/det, -b/det, a/det

def ellipse(a,b,c,levels,color):
contour(X,Y,a*X**2 + 2*b*X*Y + c*Y**2,levels,colors=color)

grid()
ellipse(a,b,c,[1],'blue')
ellipse(A,B,C,[1],'red')
show()

This completes the discussion of covariance ellipses.

52 CHAPTER 1. DATASETS

Figure 1.28: Covariance ellipses and inverse covariance ellipses.

Now we describe how to standardize datasets in R2 . For datasets in Rd ,

this is described in §2.2.
Remember, a dataset is a sequence of N points in a d-dimensional feature
space. Restricting to the case d = 2, a dataset is a sequence of x-coordinates
and y-coordinates

x1 , x2 , . . . , xN , and y1 , y2 , . . . , yN .

Suppose the mean of this dataset is m = (mx , my ). Then, by the formula for
tensor product, the covariance matrix is

a b
Q= ,
b c

where
N N N
1 X 1 X 1 X
a= (xk −mx )2 , b= (xk −mx )(yk −my ), c= (yk −my )2 .
N k=1 N k=1 N k=1

From this, we see a is the variance of the x-features, and c is the variance
of y-features. We also see b is a measure of the correlation between the x
and y features.
Standardizing the dataset means to center the dataset and to place the x
and y features on the same scale. For example, the x-features may be close
1.6. MEAN AND COVARIANCE 53

to their mean mx , resulting in a small x variance a, while the y-features may

be spread far from their mean my , resulting in a large y variance c.
When this happens, the different scales of x’s and y’s distorts the relation
between them, and b may not accurately reflect the correlation. To correct
for this, we center and re-scale
x1 − mx ′ x2 − m x xN − mx
x1 , x2 , . . . xN → x′1 = √ , x2 = √ , . . . , x′N = √ ,
a a a

and
y1 − my ′ y2 − my y N − my
y1 , y2 , . . . yN → y1′ = √ , y2 = √ ′
, . . . , yN = √ .
c c c
′
This results in a new dataset v1 = (x′1 , y1′ ), v2 = (x′2 , y2′ ), . . . , vN = (x′N , yN )
that is centered,
v1 + v2 + · · · + vN
= 0,
N
with each feature standardized to have unit variance,
N N
1 X ′2 1 X ′2
x = 1, y = 1.
N k=1 k N k=1 k

This is the standardized dataset.

Because of this, the covariance matrix of the standardized dataset has
the form
′ 1 ρ
Q = ,
ρ 1
where
N
X
N
(xk − mx )(yk − my )
1 X ′ ′ b k=1
ρ= xk y k = √ = v
N k=1 ac u N
u X
! N
X
!
t (xk − mx )2 (yk − my )2
k=1 k=1

is the Pearson correlation coefficient of the dataset. The matrix Q′ is the

correlation matrix, or the standardized covariance matrix.
54 CHAPTER 1. DATASETS

For example,

9 2 b 1 ′ 1 1/3
Q= =⇒ ρ= √ = =⇒ Q = .
2 4 ac 3 1/3 1

The correlation coefficient ρ (“rho”) is always between −1 and 1 (this

follows from the Cauchy-Schwarz inequality (1.4.5).
When ρ = ±1, the dataset samples are perfectly correlated and lie on
a line passing through the mean. When ρ = 1, the line has slope 1, and
when ρ = −1, the line has slope −1. When ρ = 0, the dataset samples are
completely uncorrelated and are considered two independent one-dimensional
datasets.
In Python numpy, the correlation matrix Q′ is returned by

from numpy import *

corrcoef(dataset.T)

Here again, we input the transpose of the dataset if our default is vectors
as rows. Notice the 1/N cancels in the definition of ρ. Because of this,
corrcoef is the same whether we deal with biased or unbiased covariance
matrices.

We say a unit vector u is best aligned or best-fit with the dataset if u

maximizes the variance v · Qv over all unit vectors v,

u · Qu = max v · Qv.
|v|=1

We calculate the best-aligned unit vector. When a dataset is standard-

ized, the variance of the dataset projected onto a vector v = (x, y) equals

v · Qv = ax2 + 2bxy + cy 2 = x2 + 2ρxy + y 2 .

Since v = (x, y) is a unit vector, we have x2 + y 2 = 1, so we can write

(x, y) = (cos θ, sin θ). Using the double-angle formula, we obtain

v · Qv = x2 + 2ρxy + y 2 = 1 + 2ρ sin θ cos θ = 1 + ρ sin(2θ).

1.6. MEAN AND COVARIANCE 55

Since the sine function varies between +1 and −1, we conclude the projected
variance varies between

1 − ρ ≤ v · Qv ≤ 1 + ρ,

and
π 1 1
θ= , v+ = √ , √ =⇒ v+ · Qv+ = 1 + ρ,
4 2 2

3π −1 1
θ= , v− = √ , √ =⇒ v− · Qv− = 1 − ρ.
4 2 2
Thus the best-aligned vector v+ is at 45◦ , and the worst-aligned vector is at
135◦ (Figure 1.29)
Actually, the above is correct only if ρ > 0. When ρ < 0, it’s the other
way. The correct answer is

1 − |ρ| ≤ v · Qv ≤ 1 + |ρ|,

and v± must be switched when ρ < 0. We study best-aligned vectors in Rd

in §3.2.

Figure 1.29: Covariance ellipse and incovariance ellipse.

56 CHAPTER 1. DATASETS

Figure 1.30: A positively correlated dataset ρ > 0.

Figure 1.31: A negatively correlated dataset ρ < 0.

Here are two randomly generated datasets. For the dataset in Figure
1.30, the mean and covariance are

0.09652275 0.00939796
(0.46563359, 0.59153958) .
0.00939796 0.0674424
1.6. MEAN AND COVARIANCE 57

For the dataset in Figure 1.31, the mean and covariance are

0.08266583 −0.00976249
(0.48785572, 0.51945499) .
−0.00976249 0.08298294

Figure 1.32: Level contours of v · Q−1 v.

In general, for non-standardized datasets, the projected variance v · Qv

varies between two extremes λ± ,
λ− ≤ v · Qv ≤ λ+ , |v| = 1.
where λ± are given by
s 2
a+c a−c
λ± = ± + b2 . (1.6.4)
2 2
When the datset is standardized, as we saw above, λ± = 1 ± |ρ|.
pThe major axis of the inverse covariance
p ellipse v · Q−1 v = 1 has length
2 λ+ , and the minor axis has length 2 λ− . These are the principal axes of
the dataset.
58 CHAPTER 1. DATASETS

When the dataset is not standardized, the best-aligned and worst-aligned

vectors are
v+ = (−b, a − λ+ ), v− = (−b, a − λ− ). (1.6.5)
All this will discussed in detail in §3.2. In Figure 1.32, the level contours
1 3
v · Q−1 v = k, k = , 1, , 2,
2 2
are drawn.
In three dimensions, when d = 3, the ellipses are replaced by ellipsoids
(Figure 1.33). In 3d, the inverse covariance ellipsoid and principal axes are
displayed in Figure 1.33.

Figure 1.33: Ellipsoid and axes in 3d.

Here is code for Figures 1.30, 1.31, and 1.32. The code incorporates the
formulas for λ± and v± .

from matplotlib.pyplot import *

from numpy import *
from numpy.random import *

N = 50
X = array([ random() for _ in range(N) ])
Y = array([ random() for _ in range(N) ])
scatter(X,Y,s=2)
1.7. HIGH DIMENSIONS 59

m = mean([X,Y],axis=1)
Q = cov(X,Y,bias=True)
a, b, c = Q[0,0], Q[0,1], Q[1,1]

delta = .01
x = arange(0,1,delta)
y = arange(0,1,delta)
X,Y = meshgrid(x, y)

def ellipse(a,b,c,d,e,levels,color):
det = a*c - b**2
A, B, C = c/det, -b/det, a/det
# inverse covariance ellipse centered at (d,e)
Z = A*(X-d)**2 + 2*B*(X-d)*(Y-e) + C*(Y-e)**2
contour(X,Y,Z,levels,colors=color)
for pm in [+1,-1]:
lamda = (a+c)/2 + pm * sqrt(b**2 + (a-c)**2/4)
sigma = sqrt(lamda)
len = sqrt(b**2 +(a-lamda)**2)
axesX = [d+sigma*b/len,d-sigma*b/len]
axesY = [e-sigma*(a-lamda)/len,e+sigma*(a-lamda)/len]
plot(axesX,axesY,linewidth=.5)

grid()
levels = [.5,1,1.5,2]
ellipse(a,b,c,*m,levels,'red')
show()

1.7 High Dimensions

Although not directly used in later material, this section is here to boost
intuition about high dimensions.
Draw four disks inside a square, as in Figure 1.34. In Figure 1.34, the
edge-length of the square is 4, and the radius
√ of each blue disk is 1. Since
the length of the diagonal of the square is 4 2, the radius of the red disk is

1 √ √
(4 2 − 4) = 2 − 1.
4
60 CHAPTER 1. DATASETS

Notice there are 4 blue disks.

Figure 1.34: Disks inside the square

The following code returns Figure 1.34.

from matplotlib.pyplot import *

from matplotlib import patches
from numpy import *

fig, axes = subplots()

square = Rectangle((-2,-2), 4, 4,color='lightblue')

circle1 = Circle((1, 1), radius=1, color='blue')
circle2 = Circle((-1, 1), radius=1, color='blue')
circle3 = Circle((1, -1), radius=1, color='blue')
circle4 = Circle((-1, -1), radius=1, color='blue')
circle = patches.Circle((0, 0), radius=sqrt(2)-1,
,→ color='red')

plot([-2,2],[-2,2],color='black')
axes.add_patch(square)
1.7. HIGH DIMENSIONS 61

axes.add_patch(circle1)
axes.add_patch(circle2)
axes.add_patch(circle3)
axes.add_patch(circle4)
axes.add_patch(circle)

for pos in ['right', 'top', 'bottom', 'left']:

gca().spines[pos].set_visible(False)
axis('equal')
show()

Figure 1.35: Balls inside the cube

Now we repeat this in three dimensions, obtaining Figure 1.35. Draw

eight balls inside a cube, as in Figure 1.35.
62 CHAPTER 1. DATASETS

Since the edge-length of the cube is 4, the √ radius of each blue ball is 1.
Since the length of the diagonal of the cube is 4 3, the radius of the red ball
is
1 √ √
(4 3 − 4) = 3 − 1.
4
Notice there are 8 blue balls.
In two dimensions, when a region is scaled by a factor t, its area increases
by the factor t2 . In three dimensions, when a region is scaled by a factor t,
its volume increases by the factor t3 . We conclude: In d dimensions, when
a region is scaled by a factor t, its (d-dimensional) volume increases by the
factor td . This is called the scaling principle.
In d dimensions, the edge-length of the cube remains 4, the radius of
each blue ball remains 1,√and there are 2d blue balls. Since the length of the
diagonal of the cube is√4 d, the same calculation results in the radius of the
red ball equal to r = d − 1.
By the scaling principle, the volume of the red ball equals rd times the
volume of the blue ball. We conclude the following:
√
• Since r = d−1 = 1 exactly when d = 4, we have: In four dimensions,
the red ball and the blue balls are the same size.

• Since there are 2d blue balls, the ratio of the volume of the red ball
over the total volume of all the blue balls is rd /2d .
√
• Since rd = 2d exactly when r = 2, and since r = d − 1 = 2 exactly
when d = 9, we have: In nine dimensions, the volume of the red ball
equals the sum total of the volumes of all blue balls.
√
• Since r = d − 1 > 2 exactly when d > 9, we have: In ten or more
dimensions, the red ball sticks out of the cube.
√
• Since the length of the semi-diagonal
√ is 2 d, for any dimension d, the
radius of the red ball r = d − 1 is less than half the length of the
semi-diagonal. As the dimension grows without bound, the proportion
of the diagonal covered by the red ball converges to 1/2.

The code for Figure 1.35 is as follows. For 3d plotting, the module mayavi
is better than matplotlib.
1.7. HIGH DIMENSIONS 63

from mayavi.mlab import *

from numpy import *
from itertools import product

# run mayavi viewer inside notebook

init_notebook()

# clear any previously created

# mayavi scenes
clf()

# build sphere mesh

N = 40
theta = linspace(0,2*pi,N)
phi = linspace(0,pi,N)
theta,phi = meshgrid(theta,phi)
# spherical coordinates theta, phi
x = cos(theta)*sin(phi)
y = sin(theta)*sin(phi)
z = cos(phi)
# render ball
# here color is rgb triple of floats
def ball(a,b,c,r,color):
return mesh(a + r*x,b + r*y, c + r*z,color=color)

pm1 = [-1,1]
for center in product(pm1,pm1,pm1):
# blue balls: color (0,0,1)
ball(*center,1,(0,0,1))
# black wire cube: color (0,0,0)
outline(color=(0,0,0))

# red ball: color (1,0,0)

ball(0,0,0,sqrt(3)-1,(1,0,0))
64 CHAPTER 1. DATASETS
Chapter 2

Linear Geometry

In §1.4, we reviewed the geometry of vectors in the plane. Now we study

linear geometry in any dimension d.
The material in this chapter is usually referred to as Linear Algebra. We
prefer the term Linear Geometry, to emphasize that the material is, like
much of data science, geometric.

2.1 Vectors and Matrices

A vector is a list of scalars
v = (t1 , t2 , . . . , td ).
The scalars are the components or the features of v. If there are d features,
we say the dimension of v is d. We call v a d-dimensional vector.
A point x is also a list of scalars, x = (t1 , t2 , . . . , td ). The relation between
points x and vectors v is discussed in §1.3. The set of all d-dimensional vectors
or points is d-dimensional space Rd .
In Python, we use numpy or sympy for vectors and matrices. In Python,
if L is a list, then numpy.array(L) or sympy.Matrix(L) return a vector or
matrix.

from numpy import *

v = array([1,2,3])
v.shape

65
66 CHAPTER 2. LINEAR GEOMETRY

from sympy import *

v = Matrix([1,2,3])
v.shape

The first v.shape returns (3,), and the second v.shape returns (3,1). In
either case, v is a 3-dimensional vector.
Vectors are added component by component: With

v = (t1 , t2 , . . . ) and v = (t′1 , t′2 , . . . ),

we have

v + v ′ = (t1 + t′1 , t2 + t′2 , . . . ), and sv = (st1 , st2 , . . . ).

Addition v + v ′ only works when v and v ′ have the same shape.

The zero vector is the vector 0 = (0, 0, 0, . . . ). The zero vector is the only
vector satisfying 0 + v = v = v + 0 for every vector v. Even though the zero
scalar and the zero vector are distinct objects, we use 0 to denote both. A
vector v is nonzero if v is not the zero vector.
In R4 , the vectors

e1 = (1, 0, 0, 0), e2 = (0, 1, 0, 0), e3 = (0, 0, 1, 0), e4 = (0, 0, 0, 1)

together are the standard basis. Similarly, in Rd , we have the standard basis
e1 , e2 , . . . , ed .

A matrix is a listing arranged in a rectangle of rows and columns. Specif-

ically, an N × d matrix A has N rows and d columns,
 
a11 a12 . . . a1d
 a21 a22 . . . a2d 
A= ... ...
.
... 
aN 1 aN 2 . . . aN d
In Python, if L is a list of lists, then both array(L) and Matrix(L) return
a matrix. The code
2.1. VECTORS AND MATRICES 67

from numpy import *

A = array([[1,6,11],[2,7,12],[3,8,13],[4,9,14],[5,10,15]])
A.shape

from sympy import *

A = Matrix([[1,6,11],[2,7,12],[3,8,13],[4,9,14],[5,10,15]])
A.shape

returns (5,3), so A is a 5 × 3 matrix,

 
1 6 11
2 7 12
 
A= 3 8 13.
4 9 14
5 10 15

The transpose of a matrix A is the matrix B = At resulting from turning

A on its side, so
 
1 2 3 4 5
B = At =  6 7 8 9 10 .
11 12 13 14 15

Note the transpose operation interchanges rows and columns: the rows of At
are the columns of A. In both numpy or sympy, the transpose of A is A.T.
A d-dimensional vector v may be written as a 1 × d matrix

v = t1 t2 . . . td .

In this case, we call v a row vector.

An N -dimensional vector v may be written as a N × 1 matrix
 
t1
 t2 
v= . . .  .


In this case, we call v a column vector.

68 CHAPTER 2. LINEAR GEOMETRY

We will be considering matrices with different properties, and we use the

following notation

• A, B: any matrix

• U , V : orthonormal rows or orthonormal columns

• Q: symmetric matrix

• P : projections

Vectors v1 , v2 , . . . , vd with the same dimension may be horizontally

stacked as columns of a matrix,

A = v1 v2 . . . vd .

Similarly, vectors v1 , v2 , . . . , vN with the same dimension may be vertically

stacked as rows of a matrix,
 
v1
 v2 
A= . . .  .


By default, sympy creates column vectors. Because of this, it is easiest to

build matrices as columns,

from sympy import *

# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

# 5x3 matrix
2.1. VECTORS AND MATRICES 69

A = Matrix.hstack(u,v,w)

# column vector
b = Matrix([1,1,1,1,1])

# 5x4 matrix
M = Matrix.hstack(A,b)

In general, for any sympy matrix A, column vectors can be hstacked and
row vectors can be vstacked. For any matrix A, the code

from sympy import *

A == Matrix.hstack(*[A.col(j) for j in range(A.cols)])

returns True. Note we use the unpacking operator * to unpack the list, before
applying hstack.
In numpy, there is hstack and vstack, but we prefer column_stack and
row_stack, so the code

from numpy import *

A == row_stack([ row for row in A ])

A == column_stack([ col for col in A.T ])

returns True. In numpy, the input is a list, there is no unpacking.

In numpy, a matrix A is a list of rows, so

A == array([ row for row in A ])

A.T == array([ col for col in A.T ])

both return True. Here col refers to rows of At , hence refers to the columns
of A.
70 CHAPTER 2. LINEAR GEOMETRY

The number of rows is len(A), and the number of columns is len(A.T).

To access row i, use A[i]. To access column j, access row j of the transpose,
A.T[j]. To access the j-th entry in row i, use A[i,j].
In sympy, the number of rows in a matrix A is A.rows, and the number
of columns is A.cols, so

A.shape == (A.rows,A.cols)

returns True. To access row i, use A.row(i). Similarly, to access column j,

use A.col(j). So,

A == Matrix([ A.row(i) for i in range(A.rows) ])

A.T == Matrix([ A.col(j) for j in range(A.cols) ])

both return True.

A matrix is square if the number of rows equals the number of columns,
N = d. A matrix is diagonal if it looks like one of these
   
a 0 0 0   a 0 0
0 b 0 0 a 0 0 0
 , or 0 b 0 0 , or 0 b 0 ,
 

0 0 c 0 0 0 c 
0 0 c 0
0 0 0 d 0 0 0

where some of the numbers on the diagonal a, b, c, d may be zero.

A dataset is a collection of points x1 , x2 , . . . , xN in Rd . After centering

the mean to the origin (§1.3), the dataset becomes a collection of vectors v1 ,
v2 , . . . , vN . Usually the vectors are presented as the rows of an N × d matrix
A. Corresponding to this, datasets are often provided as a CSV file.
The matrix A is the dataset matrix. In excel, this is called a spreadsheet.
In SQL, this is called a table. In numpy, it’s an array. In pandas, it’s a
dataframe. So, effectively,

matrix = dataset = CSV file = spreadsheet = table = array = dataframe

2.1. VECTORS AND MATRICES 71

Matrices consisting of numbers are added and multiplied by scalars as

follows. With
a′11 a′12 . . . a′1d
   
a11 a12 . . . a1d
 a21 a22 . . . a2d   a′21 a′22 . . . a′2d 
A= ... ...
 and A′ =  ,
...  ... ... ... 
aN 1 aN 2 . . . aN d a′N 1 a′N 2 . . . a′N d
we have
a11 + a′11 a12 + a′12 . . . a1n + a′1d
 
 a21 + a′21 a22 + a′22 . . . a2n + a′2d 
A + A′ =  
 ... ... ... 
′ ′ ′
aN 1 + aN 1 aN 2 + aN 2 . . . aN d + aN d
and  
ta11 ta12 . . . ta1d
 ta21 ta22 . . . ta2d 
tA = 
 ...
.
... ... 
taN 1 taN 2 . . . taN d
A + A′ is the result of matrix addition, and tA is the result of matrix scaling.
Matrices may be added only if they have the same shape.
In Python, matrix scaling and matrix addition are a*A and A + B. The
code

from sympy import *

A = zeros(2,3)
B = ones(2,2)
C = Matrix([[1,2],[3,4]])
D = B + C
E = 5 * C
F = eye(4)
A, B, C, D, E, F

returns
 
1 0 0 0
0 0 0 1 1 1 2 2 3 5 10  0 1 0 0
, , , , , .
0 0 0 1 1 3 4 4 5 15 20  0 0 1 0
0 0 0 1
72 CHAPTER 2. LINEAR GEOMETRY

Diagonal matrices are constructed using diag. The code

from sympy import *

A = diag(1,2,3,4)
B = diag(-1, ones(2, 2), Matrix([5, 7, 5]))
A, B

returns  
  −1 0 0 0
1 0 0 0  0 1 1 0

0 2 0 0 
 0 1 1 0

0 , .
0 3 0 0 0 0 5

0 0 0 4 0 0 0 7
0 0 0 5
It is straightforward to convert back and forth between numpy and sympy.
In the code

from sympy import *

A = diag(1,2,3,4)

from numpy import *

B = array(A)

C = Matrix(B)

A and C are sympy.Matrix, and B is numpy.array. numpy is for numerical

computations, and sympy is for algebraic/symbolic computations.

For the Iris dataset, the mean (§1.3) is given by the following code.

from sklearn import datasets

iris = datasets.load_iris()
2.2. PRODUCTS 73

dataset = iris["data"]

To center dataset, we compute the mean and subtract it,

m = mean(dataset,axis=0)
vectors = dataset - m

The mean is m = (5.84, 3.05, 3.76, 1.2).

2.2 Products
Let t be a scalar, u, v, w be vectors, and let A, B be matrices. We already
know how to compute tu, tv, and tA, tB. In this section, we compute the dot
product u · v, the matrix-vector product Av, and the matrix-matrix product
AB.
These products are not defined unless the dimensions “match”. In numpy,
these products are written dot; in sympy, these products are written *.
In §1.4, we defined the dot product in two dimensions. We now generalize
to any dimension d. Suppose u, v are vectors in Rd . Then their dot product
u · v is the scalar obtained by multiplying corresponding features and then
summing the products. This only works if the dimensions of u and v agree.
In other words, if u = (s1 , s2 , . . . , sd ) and v = (t1 , t2 , . . . , td ), then

u · v = s1 t1 + s2 t2 + · · · + sd td . (2.2.1)

It’s best to think of this as “row-times-column” multiplication,

 
t1
u · v = s1 s2 s3 t2  = s1 t1 + s2 t2 + s3 t3 .
t3

As in §1.4, we always have rows on the left, and columns on the right.
In Python,

from numpy import *

u = array([1,2,3])
74 CHAPTER 2. LINEAR GEOMETRY

v = array([4, 5, 6])

dot(u,v) == 14 + 25 + 3*6

from sympy import *

u = Matrix([1,2,3])
v = Matrix([4, 5, 6])

u.T * v == 14 + 25 + 3*6

both return True.

For clarity, sometimes we write (u.T)*v; the parentheses don’t change
anything. Note in sympy, we take the transpose when multiplying, since
vectors are by default column vectors, and it’s always row × column.

As in two dimensions, the length or norm or magnitude of a vector v is

the square root of the dot product v · v,
√
|v| = v · v.

In Python, the length of a vector v is

from numpy import *

sqrt(dot(v,v))

from sympy import *

sqrt(v.T * v)

Notice numpy returns a scalar, while sympy returns a 1 × 1 matrix.

A vector is a unit vector if its length equals 1. When |v| = 0, all the
features of v equal zero. It follows the zero vector is the only vector with
zero length. All other vectors have positive length.
Let v be any nonzero vector. By dividing v by its length |v|, we obtain a
unit vector u = v/|v|.
2.2. PRODUCTS 75

As in §1.4,

Dot Product
The dot product u · v (2.2.1) satisfies

u · v = |u| |v| cos θ, (2.2.2)

where θ is the angle between u and v.

In two dimensions, this was equation (1.4.4) in §1.4. Since any two vectors
lie in a two-dimensional plane, this remains true in any dimension.
Based on this, we can compute the angle θ,
u·v u·v
cos θ = p =p .
|u| |v| (u · u)(v · v)
Here is code for the angle θ,

from numpy import *

def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)

Since | cos θ| ≤ 1, we have the

Cauchy-Schwarz Inequality
The dot product of two vectors is absolutely less or equal to the product
of their lengths,

|u · v| ≤ |u| |v| or |u · v|2 ≤ (u · u)(v · v). (2.2.3)

Vectors u and v are said to be perpendicular or orthogonal if u · v = 0.

A collection of vectors is orthogonal if any pair of vectors in the collection
76 CHAPTER 2. LINEAR GEOMETRY

are orthogonal. With this understood, the zero vector is orthogonal to every
vector. The converse is true as well: If u·v = 0 for every v, then in particular,
u · u = 0, which implies u = 0.
Vectors v1 , . . . , vN are said to be orthonormal if they are both unit vectors
and orthogonal. Orthogonal nonzero vectors can be made orthonormal by
dividing each vector by its length.

An important application of the Cauchy-Schwarz inequality is the triangle

inequality
|a + b| ≤ |a| + |b|. (2.2.4)
To see this, let v be any unit vector. Then

(a + b) · v = a · v + b · v ≤ |a||v| + |b||v| = |a| + |b|.

From this, selecting v = (a + b)/|a + b|,

|a + b| = (a + b) · v ≤ |a| + |b|.

Suppose v is a vector and A is a matrix. If the rows of A have the same

dimension as that of v, we can take the dot product of each row of A with v,
obtaining the matrix-vector product Av: Av is the vector whose features are
the dot products of the rows of A with v.
In other words,

dot(A,v) == array([ dot(row,v) for row in A ])

Av == Matrix([ A.row(i) v for i in range(A.rows) ])

both return True.

If u and v are vectors, we can think of u as a row vector, or a matrix con-
sisting of a single row. With this interpretation, the matrix-vector product
uv equals the dot product u · v.
2.2. PRODUCTS 77

If u and v are vectors, we can think of u as a column vector, or a matrix

consisting of a single column. With this interpretation, ut is a single row,
and the matrix-vector product ut v equals the dot product u · v.

Let A and B be two matrices. If the row dimension of A equals the

column dimension of B, the matrix-matrix product AB is defined. When this
condition holds, the entries in the matrix AB are the dot products of the rows
of A with the columns of B. In Python,

from numpy import *

C = array([ [ dot(row,col) for col in B.T ] for row in A ])

dot(A,B) == C

from sympy import *

C = Matrix([[ A.row(i)*B.col(j) for j in range(B.cols)] for i

,→ in range(A.rows) ])
A*B == C

both return True, and, with

 
1 2 3
1 2 3 4 4 5 6
A= ,B = 
 7 8 9 ,

5 6 7 8
10 11 12
the code

A,B,dot(A,B)

A,B,A*B

returns
70 80 90
AB = .
158 184 210
78 CHAPTER 2. LINEAR GEOMETRY

Let A and B be matrices. Since transpose interchanges rows and columns,

we always have
(AB)t = B t At .
As a special case, if we think of v as a column vector, i.e. as a matrix
with a single column, then the matrix-vector product Av is the same as the
matrix-matrix product Av, so

(Av)t = v t At .

Here we are thinking of v as a matrix with one column, and v t as a matrix

with one row.
In Python,

dot(A,B).T == dot(B.T,A.T)

(A * B).T == B.T * A.T

both return True.

We also have

Dot Product Transpose Identity

For any vectors u, v, and matrices A, we have

(Au) · v = u · (At v) and (At u) · v = u · Av, (2.2.5)

whenever the shapes of u, v, A match.

In terms of row vectors and column vectors, this is automatic. For exam-
ple,
(Au) · v = (Au)t v = (ut At )v = ut (At v) = u · (At v).
In Python,

dot(dot(A,u),v) == dot(u,dot(A.T,v))
dot(dot(A.T,u),v) == dot(u,dot(A,v))

(Au).T v == u.T * (A.T*v)

(A.T*u).T * v == u.T * (A*v)
2.2. PRODUCTS 79

all return True.

Let A be a matrix. We compute useful expressions for AAt and At A.

Assume the rows of A are v1 , v2 , . . . , vN . Since matrix-matrix multipli-
cation is row × column, we have
 
v1 · v1 v1 · v2 . . . v1 · vN
 v2 · v1 v2 · v2 . . . v2 · vN 
AAt = 
 ...
. (2.2.6)
... ... ... 
vN · v1 vN · v2 . . . vN · vN

As a consequence,1

Orthonormal Rows and Columns

Let U be a matrix.

• U has orthonormal rows iff U U t = I.

• U has orthonormal columns iff U t U = I.

The second statement follows from the first by substituting U t for U .

To compute At A, we bring in the tensor product. If u and v are vectors,

the tensor product u ⊗ v is the matrix-matrix product ut v, with u and v row
vectors. If u is N -dimensional and v is d-dimensional, then u ⊗ v is an N × d
matrix.
For example, if u = (a, b, c), v = (A, B), then
   
a aA aB
u ⊗ v =  b  A B =  bA bB  .
c cA cB

Then the identities (1.4.14) and (1.4.15) hold in general. Using the tensor
product, we have

1
Iff is short for if and only if.
80 CHAPTER 2. LINEAR GEOMETRY

Tensor Identity
Let A be a matrix with rows v1 , v2 , . . . , vN . Then

At A = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN . (2.2.7)

Multiplying (2.2.7) by xt on the left and x on the right, and using (1.4.15),
we see (2.2.7) is equivalent to

|Ax|2 = xt At Ax = (v1 · x)2 + (v2 · x)2 + · · · + (vN · x)2 . (2.2.8)

By matrix-vector multiplication,

Ax = (v1 · x, v2 · x, . . . , vN · x).

Since |Ax|2 is the sum of the squares of its components, this derives (2.2.8).

A matrix Q is symmetric if Q = Qt . For any matrix A, Q = AAt and

Q = At A are symmetric.
A symmetric matrix Q satisfying v · Qv ≥ 0 for every vector v is nonneg-
ative. A symmetric matrix Q satisfying v · Qv > 0 for every nonzero vector
v is positive.
The most important example of a nonnegative matrix is the covariance
matrix (§1.6) of a dataset. When a dataset in Rd fills up all d dimensions,
the covariance matrix is positive (see §2.5).
The trace of a square matrix
 
a b c
A = b d e 
c e f
is the sum of its diagonal elements,
 
a b c
trace(A) = trace  b d e  = a + d + f.
c e f
Even though in general AB ̸= BA, it is always true that

trace(AB) = trace(BA), (2.2.9)

2.2. PRODUCTS 81

Trace and tensor product combine in the identity

u · Qv = trace(Q(v ⊗ u)). (2.2.10)
The derivations of these identities are simple calculations that we skip.

If A = (aij ) is any matrix, then the norm squared of A is

X
∥A∥2 = a2ij .
i,j

By taking the trace in (2.2.7),

Norm Squared of Matrix

Let A be a matrix with rows v1 , v2 , . . . , vN . Then

∥A∥2 = |v1 |2 + |v2 |2 + · · · + |vN |2 , (2.2.11)

and
∥A∥2 = trace(At A). (2.2.12)
By replacing A by At , the same results hold for columns.

If x1 , x2 , . . . , xN is a dataset of points in Rd with mean m, and v1 , v2 ,

. . . , vN is the corresponding centered dataset, then the covariance matrix Q
is the average of tensor products (§1.6),
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN
Q= .
N
Let A be the matrix with rows v1 , v2 , . . . , vN . By (2.2.7), the last equation
is the same as
1
Q = At A. (2.2.13)
N
If we save the Iris dataset as a centered array vectors, as in §2.1, code
from scratch for the covariance is
82 CHAPTER 2. LINEAR GEOMETRY

from numpy import *

Q = dot(vectors.T,vectors)/N

Of course, it is simpler to avoid centering and just do directly

Q = cov(dataset,rowvar=False)

Q = cov(dataset.T)

After downloading the Iris dataset as in §2.1, the mean, covariance, and
total variance are
 
0.68 −0.04 1.27 0.51
−0.04 0.19 −0.32 −0.12
m = (5.84, 3.05, 3.76, 1.2), Q = 
 1.27 −0.32 3.09
 , 4.54.
1.29 
0.51 −0.12 1.29 0.58
(2.2.14)

In §1.6, we discussed standardizing datasets in R2 . This can be done in

general.
Let x1 , x2 , . . . , xN be a dataset in Rd . Each sample point x has d features
(t1 , t2 , . . . , td ). We compute the variance of each feature separately.
Let e1 , e2 , . . . , ed be the standard basis in Rd , and, for each j = 1, 2 . . . , d,
project the dataset onto ej , obtaining the scalar dataset x1 · ej , x2 · ej , . . . ,
xN · ej , consisting of the j-th feature of the samples. If qjj is the variance
of this scalar dataset, then q11 , q22 , . . . , qdd are the diagonal entries of the
covariance matrix.
To standardize the dataset, we center it, and rescale the features to have
variance one, as follows. Let m = (m1 , m2 , . . . , md ) be the dataset mean. For
each sample point x = (t1 , t2 , . . . , td ), the standardized vector is

t1 − m1 t2 − m2 td − md
v= √ , √ ,..., √ .
q11 q22 qdd
2.3. MATRIX INVERSE 83

Then the standardized dataset is v1 , v2 , . . . , vN .

If Q = (qij ) is the covariance matrix, then the correlation matrix is the
d × d matrix Q′ = (qij′ ) with entries
qij
qij′ = √ , i, j = 1, 2, . . . , d.
qii qjj

Then a straightforward calculation shows

Standardized Covariance Equals Correlation

The covariance matrix of the standardized dataset equals the correla-
tion matrix of the original dataset.

In Python,

from numpy import *

from sklearn.preprocessing import StandardScaler

# standardize dataset
vectors = StandardScaler().fit_transform(dataset)

Qcorr = corrcoef(dataset.T)
Qcov = cov(vectors.T,bias=True)

allclose(Qcov,Qcorr)

returns True.

2.3 Matrix Inverse

Let A be any matrix and b a vector. The goal is to solve the linear system

Ax = b. (2.3.1)

In this section, we use the inverse A−1 and the pseudo-inverse A+ to solve
(2.3.1).
However, it’s very easy to construct matrices A and vectors b for which
the linear system (2.3.1) has no solutions at all! For example, take A the
84 CHAPTER 2. LINEAR GEOMETRY

zero matrix and b any non-zero vector. Because of this, we must be careful
when solving (2.3.1).

Given a square matrix A, the inverse matrix is the matrix B satisfying

AB = I = BA. (2.3.2)
Here I is the identity matrix. Since I is a square matrix, A must also be a
square matrix.
Only square matrices may have inverses. Moreover, not every square
matrix has an inverse. For example, the zero matrix does not have an inverse.
When A has an inverse, we say A is invertible.
If a matrix is d × d, then the inverse is also d × d. We write B = A−1 for
the inverse matrix of A. For example, it is easy to check

a b −1 1 d −b
A= =⇒ A = .
c d ad − bc −c a
Since we can’t divide by zero, a 2 × 2 matrix is invertible only if ad − bc ̸= 0.
Since
(AB)(B −1 A−1 ) = A(BB −1 )A−1 = AIA−1 = AA−1 = I,
we have
(AB)−1 = B −1 A−1 .

When A has an inverse A−1 , we can solve the linear system Ax = b.

Solution of Ax = b when A invertible

If A is invertible, then

Ax = b =⇒ x = A−1 b. (2.3.3)

This is easy to check, since

Ax = A(A−1 b) = (AA−1 )b = Ib = b.
2.3. MATRIX INVERSE 85

from sympy import *

# solving Ax=b
x = A.inv() * b

from numpy import *

from numpy.linalg import inv

# solving Ax=b
x = dot(inv(A) , b)

In general, a matrix A is not invertible, and Ax = b is solved using

the pseudo-inverse x = A+ b. The definition and framework of the pseudo-
inverse is in §2.6. The upshot is: every (square or non-square) matrix A has
a pseudo-inverse A+ . Here is the general result.

Solution of Ax = b for General A

If Ax = b is solvable, then

x+ = A+ b =⇒ Ax+ = b.

If Ax = b is not solvable, then x+ minimizes the residual |Ax − b|2 .

This says if Ax = b has some solution, then x+ = A+ b is also a solution.

On the other hand, Ax = b may have no solution, in which case the error
|Ax − b|2 is minimized. From this point of view, it’s best to think of x+ as
a candidate for a solution. It’s a solution only after confirming equality of
Ax+ and b. All this is worked out in §2.6.
To put this in context, there are three possibilities for a linear system
(2.3.1). A linear system Ax = b can have

• no solutions, or

• exactly one solution, or

86 CHAPTER 2. LINEAR GEOMETRY

• infinitely many solutions.

As examples of these three possibilities, we have
• A = 0 and b ̸= 0,

• A is invertible,

• A = 0 and b = 0.
The pseudo-inverse provides a single systematic procedure for deciding
among these three possibilities. The pseudo-inverse is available in numpy and
sympy as pinv. In this section, we focus on using Python to solve Ax = b,
postponing concepts to §2.6.
How do we use the above result? Given A and b, using Python, we
compute x = A+ b. Then we check, by multiplying in Python, equality of Ax
and b.
The rest of the section consists of examples of solving linear systems. The
reader is encouraged to work out the examples below in Python. However,
because some linear systems have more than one solution, and the imple-
mentation of Python on your laptop may be different than on my laptop, our
solutions may differ.
It can be shown that if the entries of A are integers, then the entries of
A+ are fractions. This fact is reflected in sympy, but not in numpy, as the
default in numpy is to work with floats.

Let

u = (1, 2, 3, 4, 5), v = (6, 7, 8, 9, 10), w = (11, 12, 13, 14, 15),

and let A be the matrix with columns u, v, w, and rows a, b, c, d, e,

   
1 6 11 a
2 7 12 b
   
A= u v w = 3 8 13 = c .
  (2.3.4)
4 9 14   d 
5 10 15 e
2.3. MATRIX INVERSE 87

from numpy import *

# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])

# arrange as columns
A = column_stack([u,v,w])

For this A, the code

from scipy.linalg import pinv

pinv(A)

returns
 
−37 −20 −3 14 31
1 
A+ = −10 −5 0 5 10  .
150
17 10 3 −4 −11

Alternatively, in sympy,

from sympy import *

# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

A = Matrix.hstack(u,v,w)

A.pinv()

returns the same result.

88 CHAPTER 2. LINEAR GEOMETRY

Let A be as in (2.3.4) and let

b1 = (8, 9, 10, 11, 12), b2 = (11, 6, 1, −4, −9).

We solve Ax = b1 and Ax = b2 by computing the candidates

1
x + = A + b1 = (2, 5, 8),
15
and
1
x+ = A+ b2 = (−173, −50, 73).
30
Then we check that the candidates are actually solutions, which they are, by
comparing Ax+ and b1 , in the first case, and Ax+ and b2 , in the second case.

For
b3 = (−9, −3, 3, 9, 10),
we have
1
x+ = A + b 3 = (82, 25, −32).
15
However, for this x+ , we have

Ax+ = (−8, −3, 2, 7, 12),

which is not equal to b3 . From this, not only do we conclude x+ is not

a solution of Ax = b3 , but also, by the general result above, the system
Ax = b3 is not solvable at all.

Let B be the matrix with columns b1 and b2 ,

 
8 11
9 6 
 
10 1  .
B = (b1 , b2 ) =  
11 −4
12 −9
2.3. MATRIX INVERSE 89

We solve
Bx = u, Bx = v, Bx = w
by constructing the candidates

B + u, B + v, B + w,

obtaining the solutions

1 1 1
x+ = (16, −7), x+ = (41, −2), x+ = (66, 3).
51 51 51

Let  
1 2 3 4 5
C = At =  6 7 8 9 10
11 12 13 14 15
and let f = (0, −5, −10). Then

−37 −10 17
−20 −5 10 
+ t + + t 1  −3

C = (A ) = (A ) = 0 3 
150 
 
14 5 −4 
31 10 −11

and
1
x+ = C + f =
(32, 35, 38, 41, 44).
50
Once we confirm equality of Cx+ and f , which is the case, we obtain a
solution x+ of Cx = f .

Let D be the matrix with columns a and f ,

 
1 0
D = (a, f ) =  6 −5  ,
11 −10
90 CHAPTER 2. LINEAR GEOMETRY

and let a, b, c, d, e be the rows of A, or, equivalently, the columns of C.

Then
+ 1 25 10 −5
D = .
30 28 10 −8
We solve
Dx = a, Dx = b, Dx = c, , Dx = d, Dx = e,
by constructing the candidates
D+ a, D+ b, D+ c, D+ d, D+ e,
obtaining the solutions
x+ = (1, 0), x+ = (2, 1), x+ = (3, 2), x+ = (4, 3), x+ = (5, 4).

2.4 Span and Linear Independence

Let u, v, w be three vectors. Then
1
3u − v + 9w, 5u + 0v − w, 0u + 0v + 0w
6
are linear combinations of u, v, w.
In general, a linear combination of vectors v1 , v2 , . . . , vd is
t1 v1 + t2 v2 + · · · + td vd . (2.4.1)
Here the coefficients t1 , t2 , . . . , td are scalars.
In terms of matrices, let
u = (1, 2, 3, 4, 5), v = (6, 7, 8, 9, 10), w = (11, 12, 13, 14, 15),
and let A be the matrix with columns u, v, w, as in (2.3.4). Let x be
the vector (r, s, t) = (1, 2, 3). Then an explicit calculation shows (do this
calculation!) the matrix-vector product Ax equals ru + sv + tw,
Ax = ru + sv + tw.
The code
2.4. SPAN AND LINEAR INDEPENDENCE 91

dot(A,x) == ru + sv + t*w

returns

array([ True, True, True, True, True])

To repeat, the linear combination ru + sv + tw is the same as the matrix-

vector product Ax. This is a general fact on which everything depends:

Column Linear Combination Same as Matrix-Vector Product

Let A be a matrix with columns v1 , v2 , . . . , vd , and let

x = (t1 , t2 , . . . , td ).

Then
Ax = t1 v1 + t2 v2 + · · · + td vd , (2.4.2)
In other words,

Ax = b is the same as b = t1 v1 + t2 v2 + · · · + td vd . (2.4.3)

The span of vectors v1 , v2 , . . . , vd consists of all linear combinations

t1 v1 + t2 v2 + · · · + td vd

of the vectors. For example, span(b) of a single vector b is the line through
b, and span(u, v, w) is the set of all linear combinations ru + sv + tw.

Span Definition I
The span of v1 , v2 , . . . , vd is the set S of all linear combinations of v1 ,
v2 , . . . , vd , and we write

S = span(v1 , v2 , . . . , vd ).

When we don’t want to specify the vectors v1 , v2 , v3 , . . . , vd , we simply

say S is a span.
92 CHAPTER 2. LINEAR GEOMETRY

From (2.4.2), we have

Span Definition II
Let A be the matrix with columns v1 , v2 , v3 , . . . , vd . Then
span(v1 , v2 , . . . , vd ) is the set S of all vectors of the form Ax.

If each vector vk is a linear combination of vectors w1 , w2 , . . . , wN , then

every vector v in span(v1 , v2 , . . . , vd ) is a linear combination of w1 , w2 , . . . ,
wN , so span(v1 , v2 , . . . , vd ) is contained in span(w1 , w2 , . . . , wN ).
If also each vector wk is a linear combination of vectors v1 , v2 , . . . , vd ,
then every vector w in span(w1 , w2 , . . . , wN ) is a linear combination of v1 , v2 ,
. . . , vd , so span(w1 , w2 , . . . , wN ) is contained in span(v1 , v2 , . . . , vd ).
When both conditions hold, it follows

span(v1 , v2 , . . . , vd ) = span(w1 , w2 , . . . , wN ).

Thus there are many choices of spanning vectors for a given span.
For example, let u, v, w be the columns of A in (2.3.4). Let ⊂ mean “is
contained in”. Then

span(u, v) ⊂ span(u, v, w),

since adding a third vector can only increase the linear combination possi-
bilities. On the other hand, since w = 2v − u, we also have

span(u, v, w) ⊂ span(u, v).

It follows that
span(u, v, w) = span(u, v).

Let A be a matrix. The column space of A is the span of its columns. For
A as in (2.3.4), the column space of A is span(u, v, w). The code

from sympy import *

# column vectors
2.4. SPAN AND LINEAR INDEPENDENCE 93

u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

A = Matrix.hstack(u,v,w)

# returns minimal spanning set for column space of A

A.columnspace()

returns a minimal set of vectors spanning the column space of A. The column
rank of A is the number of vectors returned.
For example, for A as in (2.3.4), this code returns

u = (1, 2, 3, 4, 5), v = (6, 7, 8, 9, 10).

Why is this? Because w = 2v − u, so

span(u, v, w) = span(u, v).

We conclude the column rank of A equals 2.

If the columns of A are v1 , v2 , . . . , vd , and x = (t1 , t2 , . . . , td ) is a vector,

then by definition of matrix-vector multiplication,

Ax = t1 v1 + t2 v2 + · · · + td vd .

By (2.4.3),

Column Space and Ax = b

The column space of a matrix A consists of all vectors of the form Ax.
A vector b is in the column space of A when Ax = b has a solution.

The corresponding code in numpy is

94 CHAPTER 2. LINEAR GEOMETRY

from numpy import *

from scipy.linalg import orth

# returns minimal orthonormal spanning set

# for column space of A

orth(A)

This code returns two orthonormal vectors b1 /|b1 | and b2 /|b2 |, where

b1 = (8, 9, 10, 11, 12), b2 = (11, 6, 1, −4, −9),

√ √
and |b1 | = 510, |b2 | = 255.
We conclude the column space of A can be described in at least three
ways,
span(b1 , b2 ) = span(u, v, w) = span(u, v).
Explicitly, b1 and b2 are linear combinations of u, v, w,

15b1 = 2u + 5v + 8w, 30b2 = −173u − 50v + 73w, (2.4.4)

and u, v, w are linear combinations of b1 and b2 ,

51u = 16b1 − 7b2 , 51v = 41b1 − 2b2 , w = 2v − u. (2.4.5)

By (2.4.3), to derive (2.4.4), we solve Ax = b1 and Ax = b2 for x. But this

was done in §2.3.
Similarly, let B be the matrix with columns b1 and b2 , and solve Bx = u,
Bx = v, Bx = w, obtaining (2.4.5). This was also done in §2.3.
As a general rule, sympy.columnspace returns vectors in close to original
form, and scipy.linalg.orth orthonormalizes the spanning vectors.

Let A be a matrix, and let b be a vector. Assume the columns of A and

b all lie in Rd . How can we tell if b is in the column space of A? Given the
above tools, here is an easy way to tell.
Write the augmented matrix Ā = (A, b); Ā obtained by adding b as an
extra column next to the columns of A. If A is d × N , then Ā is d × (N + 1).
2.4. SPAN AND LINEAR INDEPENDENCE 95

Given A and Ā = (A, b), compute their column ranks. Let v1 , v2 , . . . , vN

be the columns of A. If these ranks are equal, then
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , . . . , vN , b),
so b is a linear combination of the columns, or b is in the column space of A.

Column Space of Augmented Matrix

Let Ā be the matrix A augmented by a vector b. Then b is in the

column space of A iff

column rank(A) = column rank(Ā). (2.4.6)

For example, let b3 = (−9, −3, 3, 9, 10) and let Ā = (A, b3 ). Using Python,
check the column rank of Ā is 3. Since the column rank of A is 2, we conclude
b3 is not in the column space of A: b3 is not a linear combination of u, v, w.
When (2.4.6) holds, b is a linear combination of the columns of A. How-
ever, (2.4.6) does not tell us which linear combination. According to (2.4.3),
finding the linear combination is equivalent to solving Ax = b.

R3 consists of all vectors (r, s, t) in three dimensions. If

e1 = (1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1),
then
(r, s, t) = re1 + se2 + te3 .
This shows the vectors e1 , e2 , e3 span R3 , or
R3 = span(e1 , e2 , e3 ).
As a consequence, R3 is a span. Similarly, in dimension d, we can write
e1 = (1, 0, 0, . . . , 0, 0)
e2 = (0, 1, 0, . . . , 0, 0)
e3 = (0, 0, 1, . . . , 0, 0) (2.4.7)
... = ...
ed = (0, 0, 0, . . . , 0, 1)
Then e1 , e2 , . . . , ed span Rd , so
96 CHAPTER 2. LINEAR GEOMETRY

d-dimensional Space

Rd is a span.

The set e1 , e2 , . . . , ed is the standard basis for Rd .

The row space of a matrix is the span of its rows.

from sympy import *

# returns minimal spanning set for row space of A

A.rowspace()

The row rank of a matrix is the number of vectors returned by rowspace().

This is the minimal number of vectors spanning the row space of A.
For example, call the rows of A in (2.3.4) a, b, c, d, e. Let

f = (0, −5, −10).

Then rowspace returns the vectors a and f , so

span(a, b, c, d, e) = span(a, f ).

Explicitly, the linear combination

50f = 32a + 35b + 38c + 41d + 44e

is derived using C = At and solving Cx = f . The linear combinations

a = a + 0f, b = 2a − 5f, c = 3a − 10f, d = 4a − 15f, e = 5a − 20f

are derived using D = (a, f ) and solving Dx = a, Dx = b, Dx = c, Dx = d,

Dx = e. Again, these linear systems were solved in §2.3.
Since the transpose interchanges rows and columns, the row space of A
equals the column space of At . Using this, we compute the row space in
numpy by
2.4. SPAN AND LINEAR INDEPENDENCE 97

from numpy import *

from scipy.linalg import orth

# returns minimal spanning set for row space of A

orth(A.T)

Numpy returns orthonormal vectors.

When Q is symmetric, the row space of Q equals the column space of Q.

A linear combination t1 v1 + t2 v2 + · · · + td vd is trivial if all the coefficients

are zero, t1 = t2 = · · · = td = 0. Otherwise it is non-trivial, if at least one
coefficient is not zero. A linear combination t1 v1 + t2 v2 + · · · + td vd vanishes
if it equals the zero vector,

t1 v1 + t2 v2 + · · · + td vd = 0.

For example, with u, v, w as above, we have w = 2v − u, so

ru + sv + tw = 1u − 2v + 1w = 0 (2.4.8)

is a vanishing non-trivial linear combination of u, v, w.

We say v1 , v2 , . . . , vd are linearly dependent if there is a non-trivial vanish-
ing linear combination of v1 , v2 , . . . , vd . Otherwise, if there is no non-trivial
vanishing linear combination, we say v1 , v2 , . . . , vd are linearly independent.
For example, u, v, w above are linearly dependent.
Suppose u, v, w are any three vectors, and suppose u, v, w are linearly
dependent. Then we have ru + sv + tw = 0 for some scalars r, s, t, where at
least one is not zero. If r ̸= 0, then we may solve for u, obtaining

u = −(s/r)v − (t/r)w.

If s ̸= 0, then we may solve for v, obtaining

v = −(r/s)u − (t/s)w.

If t ̸= 0, then
w = −(r/t)u − (s/t)v.
98 CHAPTER 2. LINEAR GEOMETRY

Hence linear dependence of u, v, w means one of the three vectors is a multiple

of the other two vectors.
In general, linear dependence of v1 , v2 , . . . , vd is the same as saying at
least one of the vectors is a linear combination of the remaining vectors.
In terms of matrices,

Homogeneous Linear Systems

Let A be the matrix with columns v1 , v2 , . . . , vd . Then

• v1 , v2 , . . . , vd are linearly dependent when Ax = 0 has a nonzero

solution x, and

• v1 , v2 , . . . , vd are linearly independent when Ax = 0 has only the

zero solution x = 0.

The set of vectors x satisfying Ax = 0, or the set of solutions x of Ax = 0,

is the null space of the matrix A.
With this terminology, v1 , v2 , . . . , vd are linearly dependent when there
is a nonzero null space for the matrix A.
For example, with A as in (2.3.4), the sympy code

from sympy import *

A.nullspace()

returns a list with a single vector,

   
r 1
[x] =  s   =  −2 .
t 1

This says the null space of A consists of all multiples of (1, −2, 1). Since the
code
2.4. SPAN AND LINEAR INDEPENDENCE 99

[r,s,t] = A.nullspace()[0]

ru + sv + t*w

returns the column vector  

0
0
 
0 ,
 
0
0
we have Ax = 0, in agreement with (2.4.8).

The corresponding numpy code is

from scipy.linalg import null_space

null_space(A)

This code returns the unit vector

 
1
−1
√ −2 ,
6 1

which is a multiple of (1, −2, 1). scipy.linalg.null_space always returns

orthonormal vectors.

Here is a simple result that is used frequently.

A Versus At A
Let A be any matrix. The null space of A equals the null space of At A.

If x is in the null space of A, then Ax = 0. Multiplying by At leads to

At Ax = 0, so x is in the null space of At A.
100 CHAPTER 2. LINEAR GEOMETRY

Conversely, if x is in the null space of At A, then At Ax = 0. By the

dot-product-transpose indentity, (2.2.5),

|Ax|2 = Ax · Ax = x · At Ax = 0,

so Ax = 0, which means x is in the null space of A.

An important example of linearly independent vectors are orthonormal

vectors.

Orthonormal Implies Linearly Independent

If v1 , v2 , . . . , vd are orthonormal, they are linearly independent.

To see this, suppose we have a vanishing linear combination

t1 v1 + t2 v2 + · · · + td vd = 0.

Take the dot product of both sides with v1 . Since the dot products of any
two vectors is zero, and each vector has length one, we obtain

t1 = t1 v1 · v1 = t1 v1 · v1 + t2 v2 · v1 + · · · + td vd · v1 = 0.

Similarly, all other coefficients tk are zero. This shows v1 , v2 , . . . , vd are

linearly independent.

In general, nullspace() returns a minimal set of vectors spanning the

null space of A. The nullity of A is the number of vectors returned by the
method nullspace().
For example, to compute the nullspace of the matrix
 
1 2 3 4 5
C = At =  6 7 8 9 10 ,
11 12 13 14 15

we solve Cx = 0. Since the code

2.4. SPAN AND LINEAR INDEPENDENCE 101

from numpy import *

from scipy.linalg import null_space

u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])

C = row_stack([u,v,w])
null_space(B)

returns the list of three vectors

     
1 2 3
−2 −3 −4
     
 1  ,  0  ,  0  ,
[x1 , x2 , x3 ] =      
 0   1   0 
0 0 1

here we can make three conclusions: (1) the nullspace of C is spanned by

three vectors, (2) this is the least number of vectors that spans the nullspace
of C, and (3) the nullity of C is 3.

Let S and T be spans. We say S and T are orthogonal complements if

every vector in S is orthogonal to every vector in T . In symbols, we write
S = T ⊥ and T = S ⊥ (pronounced “T -perp” and “S-perp”).
Suppose S is the span of vectors a, b, c. How do we compute S ⊥ ? The
answer is by using nullspace: Let A be the matrix with rows a, b, c. By
matrix-vector multiplication,
   
a a·x
0 = Ax =  b  x =  b · x  .
c c·x

This shows x is orthogonal to a, b, c exactly when x is in the null space of

A. Thus S ⊥ equals the null space of A.
In general, if S = span(v1 , v2 , . . . , vN ), let A be the matrix with rows v1 ,
v2 , . . . , vN . Then S ⊥ equals the null space of A.
102 CHAPTER 2. LINEAR GEOMETRY

An important example of orthogonality is the row space and the null

space. Suppose A has rows v1 , v2 , . . . , vN , and x is a vector, all of the same
dimension. Then, by definition, the matrix-vector product is

Ax = (v1 · x, v2 · x, . . . , vN · x).

If x is in the null space, Ax = 0, then

v1 · x = 0, v2 · x = 0, . . . , vN · x = 0,

so x is orthogonal to the rows of A. Conversely, if x is orthogonal to the rows

of A, then Ax = 0.
This shows the null space of A and the row space of A are orthogonal
complements. Summarizing, we write

Row Space and Null Space are Orthogonal

Every vector in the row space is orthogonal to every vector in the null
space,

(nullspace)⊥ = rowspace, (rowspace)⊥ = nullspace. (2.4.9)

Since the row space is the orthogonal complement of the null space, and
the null space of A equals the null space of At A, we conclude

A Versus At A
Let A be any matrix. Then the row space of A equals the row space
of At A.

Now replace A by At in this last result. Since the row space of At equals
the column space of A, and AAt is symmetric, we also have
2.4. SPAN AND LINEAR INDEPENDENCE 103

A Versus AAt
Let A be any matrix. Then the column space of A equals the column
space of AAt .

Let A be a matrix and b a vector. So far we’ve met four spaces,

• the null space: all x’s satisfying Ax = 0,

• the row space: the span of the rows of A,

• the column space: the span of the columns of A,

• the solution space: the solutions x of Ax = b.

A set S of vectors is a subspace if x1 + x2 is in S whenever x1 and x2

are in S, and tx is in S whenever x is in S. When this happens, we say
S is closed under addition and scalar multiplication: A subspace is a set of
vectors closed under addition and scalar multiplication.
Since a linear combination of linear combinations is a linear combination,
every span is a subspace. In particular, Rd is a subspace.
It’s important to realize the first three are subspaces, but the fourth is
not.

• If x1 and x2 are in the null space, and r1 and r2 are scalars, then so is
r1 x1 + r2 x2 , because

A(r1 x1 + r2 x2 ) = r1 Ax1 + r2 Ax2 = r1 0 + r2 0 = 0.

This shows the null space is a subspace.

• The row space is a span, so is a subspace.

• The column space is a span, so is a subspace.

• The solution space S of Ax = b is not a subspace, nor a span: If x is

in S, then Ax = b, so A(5x) = 5Ax = 5b, so 5x is not in S.
104 CHAPTER 2. LINEAR GEOMETRY

If x1 and x2 are solutions of Ax = b, then A(x1 + x2 ) = 2b, so the solution

space is not a subspace. However
A(x1 − x2 ) = b − b = 0, (2.4.10)
so the difference x1 − x2 of any two solutions x1 and x2 is in the null space
of A, which is a span.

Let A be an N × d matrix. Then matrix multiplication by A transforms

a vector x to the vector b = Ax. From this point of view, the set of vectors x
is the source space Rd , and the set of vectors b = Ax is the target space RN .
The null space and the row space are in the source space, and the column
space is in the target space.

Let A be a d × d invertible matrix. Then the source space is Rd and the

target space is Rd . If Ax = 0, then
x = (A−1 A)x = A−1 (Ax) = A−1 0 = 0.
This shows the null space of an invertible matrix is zero, hence the nullity is
zero.
Since the row space is the orthogonal complement of the null space, we
conclude the row space is all of Rd .
In §2.9, we see that the column rank and the row rank are equal. From
this, we see also the column space is all of Rd . In summary,

Null Space of Invertible Matrix

Let A be a d × d invertible matrix. Then the null space is zero, and the
row space and column space are both Rd . In particular, the nullity is
0, and the row rank rank and column rank are both d.

2.5 Zero Variance Directions

Let x1 , x2 , . . . , xN be a dataset in Rd . Then x1 , x2 , . . . , xN are N points in
Rd , and each x has d features, x = (t1 , t2 , . . . , td ). From §1.6, the mean is
2.5. ZERO VARIANCE DIRECTIONS 105

x 1 + x2 + · · · + xN
m= .
N
Center the dataset (see §1.3)

v1 = x1 − m, v2 = x2 − m, . . . , vN = xN − m,

and let A be the matrix with rows v1 , v2 , . . . , vN . By (2.2.7), the covariance

is
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN 1
Q= = At A.
N N
If b is a vector, the projection of the centered dataset onto the line through
b results in the reduced dataset

v1 · b, v2 · b, . . . , vN · b.

The mean of this projected dataset is zero, and its variance is

(v1 · b)2 + (v2 · b)2 + · · · + (vN · b)2 1
= bt At Ab = b · Qb. (2.5.1)
N N
We obtain this result, which was first stated in §1.6.

Variance of Projected Dataset

Let Q be the covariance matrix of a dataset. Then the variance of
the projected dataset onto the line through the vector b equals the
quadratic function b · Qb.

A vector b is a zero variance direction if the projected variance is zero,

b · Qb = 0.

We investigate zero variance directions, but first we need a definition.

Let m be a point in Rd and b a vector in Rd . The hyperplane passing
through m and orthogonal to b is the set of points x satisfying the equation

b · (x − m) = 0.

In R3 , a hyperplane is a plane, and in R2 , a hyperplane is a line. In general,

in Rd , a hyperplane is (d − 1)-dimensional, always one less than the ambient
dimension.
106 CHAPTER 2. LINEAR GEOMETRY

Zero Variance Directions

Let m and Q be the mean and covariance of a dataset in Rd . Then
b · Qb = 0 is the same as saying every point in the dataset lies in the
hyperplane passing through m and orthogonal to b,

b · (x − m) = 0.

This is easy to see. Let the dataset be x1 , x2 , . . . , xN , and center it to

v1 , v2 , . . . , vN . If b · Qb = 0, then, by (2.5.1), vk · b = 0 for k = 1, 2, . . . , N .
This shows b · (xk − m) = 0, k = 1, 2, . . . , N , which means the points x1 , x2 ,
. . . , xN lie on the hyperplane b · (x − m) = 0. Here are some examples.
In two dimensions R2 , a line is determined by a point on the line and a
vector orthogonal to the line. If (a, b) is the vector orthogonal to the line
and (x0 , y0 ), (x, y) are points on the line, then (x, y) − (x0 , y0 ) is orthogonal
to (a, b), or
(a, b) · ((x, y) − (x0 , y0 )) = 0.
Writing this out, the equation of the line is

a(x − x0 ) + b(y − y0 ) = 0, or ax + by = c,

where c = ax0 + by0 .

If the mean and covariance of a dataset are m = (2, 3) and

1 −1
Q= ,
−1 1
and b = (1, 1), then Qb = 0, so b · Qb = 0. Since the line passes through the
mean, the dataset lies on the line x + y = 5. We conclude this dataset is
one-dimensional.
If
3 0
Q= ,
0 1
and b = (x, y), then
b · Qb = 3x2 + y 2 ,
so b · Qb is never zero unless b = 0. In this case, we conclude the dataset is
two-dimensional, because it does not lie on a line.
In three dimensions R3 , a plane is determined by a point (x0 , y0 , z0 ) in
the plane, and a vector (a, b, c) orthogonal to the plane. If (x, y, z) is any
2.5. ZERO VARIANCE DIRECTIONS 107

point in the plane, then (x, y, z) − (x0 , y0 , z0 ) is orthogonal to (a, b, c), so the
equation of the plane is
(a, b, c) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or ax + by + cz = d,
where d = ax0 + by0 + cz0 .
Suppose we have a dataset in R3 with mean m = (3, 2, 1), and covariance
 
1 1 1
Q = 1 1 1 . (2.5.2)
1 1 1
Let b = (2, −1, −1). Then Qb = 0, so b · Qb = 0. We conclude the dataset
lies in the plane
(2, −1, −1) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or 2x − y − z = 3.
In this case, the dataset is two-dimensional, as it lies in a plane.
If a dataset has covariance the 3 × 3 identity matrix I, then b · Ib is never
zero unless b = 0. Such a dataset is three-dimensional, it does not lie in a
plane.
Sometimes there may be several zero variance directions. For example,
for the covariance (2.5.2) and u = (2, −1, −1), v = (0, 1, −1), we have both
u · Qu = 0 and v · Qv = 0.
From this we see the dataset corresponding to this Q lies in two planes: The
plane orthogonal to u, and the plane orthogonal v. But the intersection of
two planes is a line, so this dataset lies in a line, which means it is one-
dimensional.
Which line does this dataset lie in? Well, the line has to pass through the
mean, and is orthogonal to u and v. If we find a vector b satisfying b · u = 0
and b · v = 0, then the line will pass through m and will be parallel to b. But
we know how to find such a vector. Let A be the matrix with rows u, v. Then
b in the nullspace of A fullfills the requirements. We obtain b = (1, 1, 1).

Let v1 , v2 , . . . , vN be a centered dataset of vectors in Rd , and let Q be the

covariance matrix of the dataset. Then v · Qv is the variance of the projected
dataset onto v. Being a variance, we know v · Qv ≥ 0. A zero variance
direction is a vector v satisfying v · Qv = 0. We show the zero variance
directions are the same as the nullspace of Q,
108 CHAPTER 2. LINEAR GEOMETRY

Zero Variance Directions and Nullspace I

Let Q be a covariance matrix. Then the null space of Q equals the
zero variance directions of Q.

To see this, we use the quadratic equation from high school. If Q is

symmetric, then u · Qv = v · Qu. For t scalar and u, v vectors, it follows the
quadratic function
(v + tu) · Q(v + tu) = t2 u · Qu + 2tu · Qv + v · Qv = at2 + 2bt + c
is nonnegative for all t scalar. Thus the parabola at2 + 2bt + c intersects the
horizontal axis in at most one root. This implies the discriminant b2 − ac is
not positive, b2 − ac ≤ 0, which yields
(u · Qv)2 ≤ (u · Qu) (v · Qv).
Inserting u = Qv, we have
|Qv|4 = (Qv · Qv)2 ≤ (Qv · QQv) (v · Qv). (2.5.3)
Now we can derive the result. If v is in the null space of Q, then Qv = 0.
Taking the dot product with v, we get v ·Qv = v ·0 = 0, so v is a zero variance
direction. Conversely, if v is a zero variance direction, then v · Qv = 0. By
(2.5.3), this implies Qv = 0, so v is in the null space of Q.

Based on the above result, here is code that returns zero variance direc-
tions.

from numpy import *

def zero_variance(dataset):
Q = cov(dataset.T)
return null_space(Q)

Let A be an N × d dataset matrix, and let Q be the covariance of the

dataset. By (2.2.13), Q = At A/N if the dataset is centered. Then the null
space of Q equals the null space of At A, which equals the null space of A.
We conclude
2.6. PSEUDO-INVERSE 109

Zero Variance Directions and Nullspace II

Let Q be a covariance matrix of a centered dataset A. Then the null
space of A equals the zero variance directions of Q.

Suppose the dataset is

(1, 2, 3, 4, 5), (6, 7, 8, 9, 10), (11, 12, 13, 14, 15), (16, 17, 18, 19, 20).

This is four vectors in R5 . Since it is only four vectors, it is at most a

four-dimensional dataset. The code returns three vectors

(1, −2, 1, 0, 0), (2, −3, 0, 1, 0), (3, −4, 0, 0, 1).

Thus this dataset is orthogonal to three directions. Each hyperplane is one

condition, so each cuts the dimension down by one, so the dimension of this
dataset is 5 − 3 = 2. Dimension of a dataset is discussed further in §2.9.

2.6 Pseudo-Inverse
What exactly is the pseudo-inverse? It turns out the answer is best under-
stood geometrically.
Think of b and Ax as points, and measure the distance between them,
and think of x and the origin 0 as points, and measure the distance between
them (Figure 2.1).

x b
A
−−−−−−→
0 Ax
source space target space

Figure 2.1: The points 0, x, Ax, and b.

If Ax = b is solvable, then, among all solutions x∗ , select the solution x+

closest to 0.
More generally, if Ax = b is not solvable, select the points x∗ so that Ax∗
is closest to b, then, among all such x∗ , select the point x+ closest to the
origin.
110 CHAPTER 2. LINEAR GEOMETRY

Even though the point x+ may not solve Ax = b, this procedure (Figure
2.2) results in a uniquely determined x+ : While there may be several points
x∗ , there is only one x+ .

∗ column space
x∗ x Ax
x+ A Ax∗
−−−−−−→ AxAx
x
x
x nullspace
0 b

Figure 2.2: The points x, Ax, the points x∗ , Ax∗ , and the point x+ .

The results in this section are as follows. Let A be any matrix. There
is a unique matrix A+ — the pseudo-inverse of A — with the following
properties.

• the linear system Ax = b is solvable, when b = AA+ b.

• x+ = A+ b is a solution of

1. the linear system Ax = b, if Ax = b is solvable.

2. the regression equation At Ax = At b, always.

• In either case,

1. there is exactly one solution with minimum norm.

2. Among all solutions, x+ has minimum norm.
3. Every other solution is x = x+ + v for v in the null space of A.

Key concepts in this section are the residual

|Ax − b|2 (2.6.1)

2.6. PSEUDO-INVERSE 111

and the regression equation

At Ax = At b. (2.6.2)

The following is clear.

Zero Residual
x is a solution of (2.3.1) iff the residual is zero.

For A as in (2.3.4) and b = (−9, −3, 3, 9, 10), the linear system Ax = b is

x + 6y + 11z = −9
2x + 7y + 12z = −3
3x + 8y + 13z =3 (2.6.3)
4x + 9y + 14z =9
5x + 10y + 15z = 10

and the regression equation At Ax = At b is

11x + 26y + 41z = 16

13x + 33y + 53z = 13 (2.6.4)
41x + 106y + 171z = 36.

Let b be any vector, not necessarily in the column space of A. To see how
close we can get to solving (2.3.1), we minimize the residual (2.6.1). We say
x∗ is a residual minimizer if

|Ax∗ − b|2 = min |Ax − b|2 . (2.6.5)

A residual minimizer always exists.

112 CHAPTER 2. LINEAR GEOMETRY

Existence of Residual Minimizer

There is a residual minimizer x∗ in the row space of A.

The derivation of this technical result is in §7.5, see (7.5.12), (7.5.13).

Regression Equation
x∗ is a residual minimizer iff x∗ solves the regression equation.

To see this, let v be any vector, and t a scalar. Insert x = x∗ + tv into

the residual and expand in powers of t to obtain

|Ax − b|2 = |Ax∗ − b|2 + 2t(Ax∗ − b) · Av + t2 |Av|2 = f (t).

If x∗ is a residual minimizer, then f (t) is minimized when t = 0. But a

parabola
f (t) = a + 2bt + ct2
is minimized at t = 0 only when b = 0. Thus the linear coefficient vanishes,
(Ax∗ − b) · Av = 0. This implies

At (Ax∗ − b) · v = (Ax∗ − b) · Av = 0.

Since v is any vector, this implies

At (Ax∗ − b) = 0,

which is the regression equation. Conversely, if the regression equation holds,

then the linear coefficient in the parabola f (t) vanishes, so t = 0 is a mini-
mum, establishing that x∗ is a residual minimizer.

If x1 and x2 are solutions of the regression equation, then

At A(x1 − x2 ) = At Ax1 − At Ax2 = At b − At b = 0,

so x1 − x2 is in the null space of At A. But from §2.4, the nullspace of At A

equals the nullspace of A. We conclude x1 − x2 is in the null space of A. This
establishes
2.6. PSEUDO-INVERSE 113

Multiple Solutions
Any two residual minimizers differ by a vector in the nullspace of A.

We say x+ is a minimum norm residual minimizer if x+ is a residual

minimizer and
|x+ |2 ≤ |x∗ |2
for any residual minimizer x∗ .
Since any two residual minimizers differ by a vector in the null space of
A, x+ is a minimum norm residual minimizer if x+ is a residual minimizer
and
|x+ |2 ≤ |x+ + v|2
for any v in the null space of A.

Minimum Norm Residual Minimizer

Let x∗ be a residual minimizer. Then x∗ is a minimum norm residual
minimizer iff x∗ is in the row space of A.

Since we know from above there is a residual minimizer in the row space
of A, we always have a minimum norm residual minimizer.
Let v be in the null space of A, and write

|x∗ + v|2 = |x∗ |2 + 2x∗ · v + |v|2 .

This shows x∗ is a minimum norm solution of the regression equation iff

2x∗ · v + |v|2 ≥ 0. (2.6.6)

If x∗ is in the row space of A, then x∗ · v = 0, so (2.6.6) is valid.

Conversely, if (2.6.6) is valid for every v in the null space of A, replacing
v by tv yields
2tx∗ · v + t2 |v|2 ≥ 0.
Dividing by t and inserting t = 0 yields

x∗ · v ≥ 0.
114 CHAPTER 2. LINEAR GEOMETRY

Since both ±v are in the null space of A, this implies ±x∗ · v ≥ 0, hence
x∗ · v = 0. Since the row space is the orthogonal complement of the null
space, the result follows.

Now we use this to show

Uniqueness

There is exactly one minimum norm residual minimizer x+ .

If x+ +
1 and x2 are minimum norm residual minimizers, then v = x1 − x2
+ +

is both in the row space and in the null space of A, so x+ +

1 − x2 = 0. Hence
+ +
x1 = x2 .
Putting the above all together, each vector b leads to a unique x+ . Defin-
ing A+ by setting
x+ = A+ b,
we obtain A+ , the pseudo-inverse of A.
Notice if A is, for example, 5 × 4, then Ax = b implies x is a 4-vector and
b is a 5-vector. Then from x = A+ b, it follows A+ is 4 × 5. Thus the shape
of A+ equals the shape of At .
Summarizing what we have so far,

Regression Equation is Always Solvable

The regression equation (2.6.2) is always solvable. The solution of

minimum norm is x+ = A+ b. Any other solution differs by a vector in
the null space of A.

For A as in (2.3.4) and b = (−9, −3, 3, 9, 10),

 
82
1 
x+ = A+ b = 25 
15
−32

is the minimum norm solution of the regression equation (2.6.4).

2.6. PSEUDO-INVERSE 115

Returning to the linear system (2.3.1), we show

Linear System Versus Regression Equation

If the linear system is solvable, then every solution of the regression
equation is a solution of the linear system, and vice-versa.

We know any two solutions of the linear system (2.3.1) differ by a vector in
the null space of A (2.4.10), and any two solutions of the regression equation
(2.6.2) differ by a vector in the null space of A (above).
If x is a solution of (2.3.1), then, by multiplying by At , x is a solution of
the regression equation (2.6.2). Since x+ = A+ b is a solution of the regression
equation, x+ = x + v for some v in the null space of A, so

Ax+ = A(x + v) = Ax + Av = b + 0 = b.

This shows x+ is a solution of the linear system. Since all other solutions
differ by a vector v in the null space of A, this establishes the result.
Now we can state when Ax = b is solvable,

Solvability of Ax = b

The linear system Ax = b is solvable iff b = AA+ b. When this happens,

x+ = A+ b is the solution of minimum norm.

If (2.3.1) is solvable, then from above, x+ is a solution, so

AA+ b = A(A+ b) = Ax+ = b.

Conversely, if AA+ b = b, then clearly x+ = A+ b is a solution of (2.3.1).

When (2.3.1) is solvable, (2.3.1) and (2.6.2) have the same solutions, so
+
x is the minimum norm solution of (2.3.1).
For example, let b = (−9, −3, 3, 9, 10), and let A be as in (2.3.4). Since
 
−8
−3
 
AA+ b =  2
 (2.6.7)
7
12

is not equal to b, the linear system (2.6.3) is not solvable.

116 CHAPTER 2. LINEAR GEOMETRY

Suppose A is invertible. Then (2.3.1) has only the solution x = A−1 b, so

A−1 b is the minimum norm residual minimizer. We conclude

Inverse Equals Pseudo-Inverse

If A is invertible, then A+ = A−1 .

The key properties [20] of A+ are

Properties of Pseudo-Inverse

A. AA+ A = A
B. A+ AA+ = A+
(2.6.8)
C. AA+ is symmetric
D. A+ A is symmetric

The verification of these properties is very enlightening, so we do it care-

fully. Let u be a vector and set b = Au. Then the residual

|Ax − b|2 = |Ax − Au|2

is minimized at x = u. Since A+ b = A+ Au is the minimum norm residual

minimizer, u and A+ Au differ by a vector v in the null space of A,

u = A+ Au + v. (2.6.9)

Since Av = 0, multiplying by A leads to

Au = AA+ Au.

Since u was any vector, this yields A.

Now let w be a vector and set u = A+ w. Inserting into (2.6.9) yields

A+ w = A+ AA+ w + v
2.6. PSEUDO-INVERSE 117

for some v in the null space of A. But both A+ w and A+ AA+ w are in the
row space of A, hence so is v. Since v is in both the null space and the row
space, v is orthogonal to itself, so v = 0. This implies A+ AA+ w = A+ w.
Since w was any vector, we obtain B.
Since A+ b solves the regression equation, At AA+ b = At b for any vector
b. Hence At AA+ = At . Let P = AA+ . Now

P t P = (AA+ )t (AA+ ) = (A+ )t At AA+ = (A+ )t At = P t .

Since the left side is symmetric, so is P t . Hence P is symmetric, obtaining

C.
For any vector x,

A(x − A+ Ax) = Ax − AA+ Ax = 0,

so x − A+ Ax is in the null space of A. For any y, A+ Ay is in the row space

of A. Since the row space and the null space are orthogonal,

(x − A+ Ax) · A+ Ay = 0.

Let P = A+ A. This implies

x · P y = P x · P y = x · P tP y

Since this is true for any vectors x and y, P = P t P . This shows P = A+ A

is symmetric, obtaining D.
Having arrived at A, B, C, D, the reasoning is reversible: It can be shown
any matrix A+ satisfying A, B, C, D must equal the pseudo-inverse.

Also we have

Pseudo-Inverse and Transpose

If U has orthonormal columns or orthonormal rows, then U + = U t .

From (2.2.6), such a matrix U satisfies U U t = I or U t U = I. In either

case, A, B, C, D are immediate consequences.
118 CHAPTER 2. LINEAR GEOMETRY

2.7 Projections
In this section, we study projection matrices P , and we show

• P = AA+ is the projection matrix onto the column space of A,

• P = A+ A is the projection matrix onto the row space of A,

• P = I − A+ A is the projection matrix onto the null space of A,

b − Pb
b

P b = tu
u

Figure 2.3: Projecting onto a line.

Let u be a unit vector, and let b be any vector. Let span(u) be the line
through u (Figure 2.3). The projection of b onto span(u) is the vector v in
span(u) that is closest to b.
It turns out this closest vector v equals P b for some matrix P , the pro-
jection matrix. Since span(u) is a line, the projected vector P b is a multiple
tu of u.
From Figure 2.3, b − P b is orthogonal to u, so

0 = (b − P b) · u = b · u − P b · u = b · u − t u · u = b · u − t.

Solving for t, this implies t = b · u. Thus

P b = (b · u)u = (u ⊗ u)b. (2.7.1)

Notice P b = b when b is already on the line through u. In other words,

the projection of a vector onto a line equals the vector itself when the vector
2.7. PROJECTIONS 119

is already on the line. If U is the matrix with the single column u, we obtain
P = U U t.
To summarize, the projected vector is the vector (b · u)u, and the reduced
vector is the scalar b · u. If U is the matrix with the single column u, then
the reduced vector is U t b and the projected vector is U U t b.

b
b − Pb

u
Pb

Figure 2.4: Projecting onto a plane, P b = ru + sv.

Now we project onto a plane. Let u, v be an orthonormal pair of vectors,

so u · v = 0, u · u = 1 = v · v. We project a vector b onto span(u, v). As
before, there is a matrix P , the projection matrix, such that the projection of
b onto the plane equals P b. Then b − P b is orthogonal to the plane (Figure
2.4), which means b − P b satisfies

(b − P b) · u = 0 and (b − P b) · v = 0.

Since P b lies in the plane, P b = ru + sv is a linear combination of u and v.

Inserting P b = ru + sv, we obtain

r = b · u, s = b · v.

If U is the matrix with columns u, v, by (2.2.7), this yields,

P b = (b · u)u + (b · v)v = (u ⊗ u + v ⊗ v)b = U U t b,

120 CHAPTER 2. LINEAR GEOMETRY

As before, here also the projection matrix is P = U U t .

Notice P b = b when b is already in the plane. In other words, the projec-
tion of a vector onto a plane equals the vector itself when the vector is already
in the plane.
To summarize, here the projected vector is the vector U U t b = (b · u)u +
(b · v)v, and the reduced vector is the vector U t b = (b · u, b · v). The projected
vector has the same dimension as the original vector, and the reduced vector
has only two components.

We define projection matrices in general. Let S be a span. A matrix P

is a projection matrix onto S if

1. P b is in S for any vector b,

2. P b = b if b is in S,

3. b − P b is orthogonal to S for any vector b.

Let A be any matrix and let S be the column space of A. We show

Projection Onto a Column Space

The projection matrix onto the column space of A is

P = AA+ . (2.7.2)

By definition, the column space S of A consists of vectors of the form Ax.

If b is any vector, then

P b = AA+ b = Ax, with x = A+ b,

so P b is in S. This establishes 1. If b is in S, then b = Ax, so P b = AA+ b =

AA+ Ax = Ax = b, establishing 2. For 3., let x+ = A+ b. Then x+ is a
solution of the regression equation, so

(b − P b) · Av = At (b − AA+ b) · v = (At b − At Ax+ ) · v = 0,

establishing 3.
2.7. PROJECTIONS 121

Now let x = A+ b. Then Ax = AA+ b = P b is the projection of v onto the

column space of A. If the columns are v1 , v2 , . . . , vd , and x = (t1 , t2 , . . . , td ),
then by matrix-vector multiplication,
P v = t1 v1 + t2 v2 + · · · + td vd .
So the reduced vector x consists of the coefficients when writing P v as a
linear combination of the columns.

from numpy import *

from numpy.linalg import pinv

# projection of column vector b

# onto column space of A

def project(A,b):
Aplus = pinv(A)
x = dot(Aplus,b) # reduced
return dot(A,x) # projected

Projected and Reduced Vectors

Let A be a matrix and v a vector. Then the projected vector is P v =
AA+ v and the reduced vector is x = A+ v.

For A as in (2.3.4) and b = (−9, −3, 3, 9, 10) the reduced vector onto the
column space of A is
1
x = A+ b =
(82, 25, −32),
15
and the projected vector onto the column space of A is
P b = Ax = AA+ b = (−8, −3, 2, 7, 12).
The projection matrix onto the column space of A is
 
6 4 2 0 −2
4 3 2 1 0
+ 1 
 2 2 2 2 2 .

P = AA =
10 
 
0 1 2 3 4
−2 0 2 4 6
122 CHAPTER 2. LINEAR GEOMETRY

In the same way, one can show

Projection Onto a Row Space

The projection matrix onto the row space of A is

P = A+ A. (2.7.3)

For A as in (2.3.4), the projection matrix onto the row space is

 
5 2 −1
1
P = A+ A =  2 2 2 
6
−1 2 5

When the columns of a matrix U are orthonormal, in the previous section

we saw U + = U t , so we have

Projection onto Orthonormal Vectors

If the columns of U are orthonormal, the projection matrix onto the
column space of U is
P = UUt (2.7.4)

Here the projected vector is U U t b, and the reduced vector is U t b. The

code here is

from numpy import *

# projection of column vector b

# onto column space of U
# with orthonormal columns

def project_to_ortho(U,b):
x = dot(U.T,b) # reduced
return dot(U,x) # projected
2.7. PROJECTIONS 123

Let v1 , v2 , . . . , vN be a dataset in Rd , and let U be a d × n matrix with

orthonormal columns. Then the projection matrix onto the column space of
U is P = U U t , and P is the projection onto an orthonormal span.
In this case, the dataset U t v1 , U t v2 , . . . , U t vN is the reduced dataset, and
U U t v1 , U U t v2 , . . . , U U t vN is the projected dataset.
The projected dataset is in Rd , and the reduced dataset is in Rn . Table
2.5 summarizes the relationships.

dataset vk in Rd , k = 1, 2, . . . , N
reduced U t vk in Rn , k = 1, 2, . . . , N
projected U U t vk in Rd , k = 1, 2, . . . , N

Table 2.5: Dataset, reduced dataset, and projected dataset, n < d.

Let S and T be spans. Let S + T consist of all sums of vectors u + v with

u in S and v in T . Then a moment’s thought shows S + T is itself a span.
When the intersection of S and T is the zero vector, we write S ⊕ T , and we
say S ⊕ T is the direct sum of S and T .
Let S be a span and let S ⊥ consist of all vectors orthogonal to S. We
call S ⊥ the orthogonal complement. This is pronounced “S-perp”. If v is in
both S and in S ⊥ , then v is orthogonal to itself, hence v = 0. From this, we
see S + S ⊥ is a direct sum S ⊕ S ⊥ .

Direct Sum and Orthogonal Complement

If S is a span in Rd , then

Rd = S ⊕ S ⊥ . (2.7.5)

This is an immediate consequence of what we already know. Let P be

the projection matrix onto S. Since any vector v in Rd may be written

v = P v + (v − P v),
124 CHAPTER 2. LINEAR GEOMETRY

we see any vector is a sum of a vector in S and a vector in S ⊥ .

An important example of (2.7.5) is the relation between the row space

and the null space of a matrix. In §2.4, we saw that, for any matrix A, the
row space and the null space are orthogonal complements.
Taking S = nullspace in (2.7.5), we have the important

Null space plus Row Space Equals Source Space

If A is an N × d matrix,

nullspace ⊕ rowspace = Rd , (2.7.6)

and the null space and row space are orthogonal to each other.

From this, the projection onto the null space of A is

P = I − A+ A. (2.7.7)

For A as in (2.3.4), the projection matrix onto the null space is

 
1 −2 1
+ 1
P =I −A A= −2 4 −2
6
1 −2 1

Let S be the column space of a matrix A, and let P be the projection

matrix onto S. We end the section by establishing the claim made at the
start of the section, that P b is the point in S that is closest to b.
Since every point in S is of the form Ax, we need to check

|P b − b|2 = min |Ax − b|2 .

But this was already done in §2.3, since P b = AA+ b = Ax+ where x+ = A+ b
is a residual minimizer.
2.8. BASIS 125

Projection is the Nearest Point in the Span

Let P b = AA+ b be the projection of b onto the column space of A, and

let x+ = A+ b be the reduced vector. Then

|Ax+ − b|2 = min |Ax − b|2 . (2.7.8)

2.8 Basis

Let S be the span of vectors v1 , v2 , . . . , vN . Then there are many other

choices of spanning vectors for S. For example, v1 + v2 , v2 , v3 , . . . , vN also
spans S.
If S cannot be spanned by fewer than N vectors, then we say v1 , v2 , . . . ,
vN . is a basis for S, and we call N is the dimension of S.
In other words, when N is the smallest number of spanning vectors, we
say N is the dimension dim S of S, and v1 , v2 , . . . , vN is a minimal spanning
set for S. This definition is important enough to repeat,

Basis and Dimension Definition

A basis for a span S is a minimal spanning set of vectors. The dimen-
sion of S is the number of vectors in any basis for S.

To clarify this definition, suppose someone asks “Who is the shortest

person in the room?” There may be several shortest people in the room,
but, no matter how many shortest people there are, there is only one shortest
height. In other words, a span may have several bases, but a span’s dimension
is uniquely determined.
When a basis v1 , v2 , . . . , vN consists of orthogonal vectors, we say v1 , v2 ,
. . . , vN is an orthogonal basis. When v1 , v2 , . . . , vN are also unit vectors, we
say v1 , v2 , . . . , vN is an orthonormal basis.
126 CHAPTER 2. LINEAR GEOMETRY

spanning

orthogonal orthonormal
vectors basis
basis basis

linearly
orthogonal orthonormal
independent

Figure 2.6: Relations between vector classes.

Here are two immediate consequences of this terminology.

Span of N Vectors

If S = span(v1 , v2 , . . . , vN ), then dim S ≤ N .

Larger Span has Larger Dimension

If a span S1 is contained in a span S2 , then dim S1 ≤ dim S2 .

With this terminology,

• rowspace() returns a basis of the row space,

• columnspace() returns a basis of the column space,

• nullspace() returns a basis for the null space,

• row rank equals the dimension of the row space,

• column rank equals the dimension of the column space,

2.8. BASIS 127

• nullity equals the dimension of the null space.

Let S be the span of vectors v1 , v2 , . . . , vN . How can we check if these

vectors constitute a basis for S? The answer is the main result of the section.

Spanning Plus Linearly Independent Equals Basis

Let S be the span of vectors v1 , v2 , . . . , vN . Then the vectors are a
basis for S if and only if they are linearly independent.

Remember, to check for linear independence of given vectors, assemble

the vectors as columns of a matrix A, and check whether A.nullspace()
equals zero. If that is the case, the vectors are a basis for their span. If not,
the vectors are not a basis for their span. The proof of the main result is at
the end of the section.

Here is an example. Let e1 = (1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1). We

have just seen that e1 , e2 , e3 are linearly independent. Hence e1 , e2 , e3 is a
basis for R3 , which means e1 , e2 , e3 is a minimal spanning set of vectors for
R3 . From this, we conclude dim R3 = 3.
The statement dim R3 = 3 may at first seem trivial or obvious. But, if
we flesh this out following our terminology above, the statement is saying
that any minimal spanning set of vectors in R3 must have exactly 3 vectors.
Stated in this manner, the statement has content.
Since we can do the same calculation with the standard basis

e1 = (1, 0, . . . , 0),
e2 = (0, 1, 0, . . . , 0),
... = ...
ed = (0, 0, . . . , 0, 1),

in Rd , we conclude e1 , e2 , . . . , ed are linearly independent, so

128 CHAPTER 2. LINEAR GEOMETRY

Dimension of Euclidean Space

The dimension of Rd is d.

The MNIST dataset consists of vectors v1 , v2 , . . . , vN in Rd , where N =

60000 and d = 784. For the MNIST dataset, the dimension is 712, as returned
by the code

from numpy.linalg import matrix_rank

matrix_rank(vectors)

In particular, since 712 < 784, approximately 10% of pixels are never
touched by any image. For example, a likely pixel to remain untouched is
at the top left corner (0, 0). For this dataset, there are 72 = 784 − 712 zero
variance directions.
We pose the following question: What is the least n for which the first n
images are linearly dependent? Since the dimension of the feature space is
784, we must have n ≤ 784. To answer the question, we compute the rank
of the first n vectors for n = 1, 2, 3, . . . , and continue until we have linear
dependence of v1 , v2 , . . . , vn .
If we save the MNIST dataset as a centered array vectors, as in §2.1,
and run the code below, we obtain n = 560 (Figure 2.7). matrix_rank is
discussed in §2.9.

from numpy import *

from numpy.linalg import matrix_rank

# vectors as Nxd array

def find_first_defect(vectors):
d = len(vectors[0])
previous = 0
for n in range(len(vectors)):
r = matrix_rank(vectors[:n+1,:])
2.8. BASIS 129

print((r,n+1),end=",")
if r == previous: break
if r == d: break
previous = r

Figure 2.7: First defect for MNIST.

Let v1 , v2 , . . . , vN be a dataset. We want to compute the dimensions of

the first k vectors,
d1 = dim(v1 ), d2 = dim(v1 , v2 ), d3 = dim(v1 , v2 , v3 ), and so on

Figure 2.8: The dimension staircase with defects.

130 CHAPTER 2. LINEAR GEOMETRY

This we call the dimension staircase. For example, Figure 2.8 is the
dimension staircase for
v1 = (1, 0, 0), v2 = (0, 1, 0), v3 = (1, 1, 0), v4 = (3, 4, 0), v5 = (0, 0, 1).
In Figure 2.8, we call the points (3, 2) and (4, 2) defects.
In the code, the staircase is drawn by stairs(X,Y), where the horizontal
points X and the vertical values Y satisfy len(X) == len(Y)+1. In Figure 2.8,
X = [1,2,3,4,5,6], and Y = [1,2,2,2,3].

Figure 2.9: The dimension staircase for the MNIST dataset.

With the MNIST dataset loaded as vectors, here is code returning Figure
2.9. This code is not efficient, but it works. It takes 57041 vectors in the
dataset to fill up 712 dimensions.

from matplotlib.pyplot import *

from numpy.linalg import matrix_rank

# vectors as Nxd array

def dimension_staircase(vectors):
d = vectors[0].size
N = len(vectors)
rmax = matrix_rank(vectors)
2.8. BASIS 131

dimensions = [ ]
basis = [ ]
for n in range(1,N):
r = matrix_rank(vectors[:n,:])
print((r,n),end=",")
dimensions.append(r)
if r == rmax: break
stairs(dimensions, range(n+1))

Proof of main result. Here we derive: Let S be the span of v1 , v2 , . . . ,

vN . Then v1 , v2 , . . . , vN is a basis for S if and only if v1 , v2 , . . . , vN are
linearly independent.
Suppose v1 , v2 , . . . , vN are not linearly independent. Then v1 , v2 , . . . ,
vN are linearly dependent, which means one of the vectors, say v1 , is a linear
combination of the other vectors v2 , v3 , . . . , vN . Then any linear combination
of v1 , v2 , . . . , vN is necessarily a linear combination of v2 , v3 , . . . , vN , thus

span(v1 , v2 , . . . , vN ) = span(v2 , v3 , . . . , vN ).

This shows v1 , v2 , . . . , vN is not a minimal spanning set, and completes the

derivation in one direction.
In the other direction, suppose v1 , v2 , . . . , vN are linearly independent,
and suppose b1 , b2 , . . . , bd is a minimal spanning set. Since b1 , b2 , . . . , bd is
minimal, we must have d ≤ N . Once we establish d = N , it follows v1 , v2 ,
. . . , vN is minimal, and the proof will be complete.
Since by assumption,

span(v1 , v2 , . . . , vN ) = span(b1 , b2 , . . . , bd ),

v1 is a linear combination of b1 , b2 , . . . , bd ,

v1 = t1 b1 + t2 b2 + · · · + td bd .

Since v1 ̸= 0, at least one of the coefficients, say t1 , is not zero, so we can

solve
1
b1 = (v1 − t2 b2 − t3 b3 − · · · − td bd ).
t1
132 CHAPTER 2. LINEAR GEOMETRY

This shows
span(v1 , v2 , . . . , vN ) = span(v1 , b2 , b3 , . . . , bd ).
Repeating the same logic, v2 is a linear combination of v1 , b2 , b3 , . . . , bd ,

v2 = s1 v1 + t2 b2 + t3 b3 + · · · + td bd .

If all the coefficients of b2 , b3 , . . . , bd are zero, then v2 is a multiple of v1 ,

contradicting linear independence of v1 , v2 , . . . , vN . Thus there is at least
one coefficient, say t2 , which is not zero. Solving for b2 , we obtain

1
b2 = (v2 − s1 v1 − t3 b3 − · · · − td bd ).
t2

This shows

span(v1 , v2 , . . . , vN ) = span(v1 , v2 , b3 , b4 , . . . , bd ).

Repeating the same logic, v3 is a linear combination of v1 , v2 , b3 , b3 , . . . ,

bd ,
v3 = s1 v1 + s2 v2 + t3 b3 + t4 b4 + · · · + td bd .
If all the coefficients of b3 , b4 , . . . , bd are zero, then v3 is a linear combination
of v1 , v2 , contradicting linear independence of v1 , v2 , . . . , vN . Thus there is
at least one coefficient, say t3 , which is not zero. Solving for b3 , we obtain

1
b3 = (v3 − s1 v1 − s2 v2 − t4 b4 − · · · − td bd ).
t3

This shows

span(v1 , v2 , . . . , vN ) = span(v1 , v2 , v3 , b4 , b5 , . . . , bd ).

Continuing in this manner, we eventually arrive at

span(v1 , v2 , . . . , vN ) = · · · = span(v1 , v2 , . . . , vd ).

This shows vN is a linear combination of v1 , v2 , . . . , vd . This shows N = d,

because N > d contradicts linear independence. Since d is the minimal
spanning number, this shows v1 , v2 , . . . , vN is a minimal spanning set for S.
2.9. RANK 133

2.9 Rank
If A is an N ×d matrix, then (Figure 2.10) x 7→ Ax is a linear transformation
that sends a vector x in Rd (the source space) to the vector Ax in RN (the
target space). The transpose At goes in the reverse direction: The linear
transformation b 7→ At b sends a vector b in RN (the target space) to the
vector At b in Rd (the source space).
It follows that for an N × d matrix, the dimension of the source space is
d, and the dimension of the target space is N ,
dim(source space) = d, dim(target space) = N.

from sympy import *

d = A.cols # source space dimension

N = A.rows # target space dimension

R3 R5
x b
A
At b
Ax
At
source space target space

Figure 2.10: A 5 × 3 matrix A is a linear transformation from R3 to R5 .

By (2.4.2), the column space is in the target space, and the row space is
in the source space. Thus we always have
0 ≤ row rank ≤ d and 0 ≤ column rank ≤ N.
For A as in (2.3.4), the column rank is 2, the row rank is 2, and the nullity
is 1. Thus the column space is a 2-d plane in R5 , the row space is a 2-d plane
in R3 , and the null space is a 1-d line in R3 .

The main result in this section is

134 CHAPTER 2. LINEAR GEOMETRY

Rank Theorem
Let A be any matrix. Then

row rank(A) = column rank(A). (2.9.1)

This is established at the end of the section.

Because the row rank and the column rank are equal, below we just say
rank of a matrix, and we write rank(A). In Python,

from sympy import *

A.rank()

from numpy.linalg import matrix_rank

matrix_rank(A)

returns the rank of a matrix. The main result implies rank(A) = rank(At ),
so

Upper bound for Rank

For any N × d matrix, the rank is never greater than min(N, d).

An N × d matrix A is full-rank if its rank is the highest it can be,

rank(A) = min(N, d). Here are some consequences of the main result.
• When N ≥ d, full-rank is the same as rank(A) = d, which is the same
as saying the columns are linearly independent and the rows span Rd .
• When N ≤ d, full-rank is the same as rank(A) = N , which is the same
as saying the rows are linearly independent and the columns span RN .
• When N = d, full-rank is the same as saying the rows are a basis of
Rd , and the columns are a basis of RN .
When A is a square matrix, we can say more:
2.9. RANK 135

Full Rank Square Equals Invertible

Let A be a square matrix. Then A is full-rank iff A is invertible.

Suppose A is d × d. If A is invertible and B is its inverse, then AB = I.

Since ABx = A(Bx) = Ay with y = Bx, the column space of AB is contained
in the column space of A. Since the column space of AB = I is Rd , we
conclude the column space of A is Rd , thus rank(A) = d.
Conversely, suppose A is full-rank. This means the columns of A span
d
R . By (2.4.3), this implies
Ax = b
is solvable for any b. Let e1 , e2 , . . . , ed be the standard basis. If we set
successively b = e1 , b = e2 , . . . , b = ed , we then get solutions x1 , x2 , . . . , xd .
If B is the matrix with columns x1 , x2 , . . . , xd , then

AB = A[x1 , x2 , . . . , xd ] = [Ax1 , Ax2 , . . . , Axd ] = [e1 , e2 , . . . , ed ] = I.

Thus we found a matrix B satisfying AB = I.

Repeating the same argument with rows instead of columns, we find a
matrix C satisfying CA = I. Then

C = CI = CAB = IB = B,

so B = C is the inverse of A.

Orthonormal Rows and Columns

Let U be a matrix.

• U has orthonormal rows iff U U t = I.

• U has orthonormal columns iff U t U = I.

If U is square and either holds, then they both hold.

The first two assertions are in §2.2. For the last assertion, assume U
is a square matrix. From §2.4, orthonormality of the rows implies linear
136 CHAPTER 2. LINEAR GEOMETRY

independence of the rows, so U is full-rank. If U also is a square matrix,

then U is invertible. Multiply by U −1 ,
U −1 = U −1 I = U −1 U U t = U t .
Since we have U −1 U = I, we also have U t U = I.

A square matrix U satisfying

U U t = I = U tU (2.9.2)
is an orthogonal matrix.
Equivalently, we can say

Orthogonal Matrix
A matrix U is orthogonal iff its rows are an orthonormal basis iff its
columns are an orthonormal basis.

Since
U u · U v = u · U t U v = u · v,
U preserves dot products. Since lengths are dot products, U also preserves
lengths. Since angles are computed from dot products, U also preserves
angles. Summarizing,

Angles, Lengths, and Dot Products

Orthogonal Matrices Preserve Angles, Lengths, and Dot Products:

As a consequence,

Orthogonal Matrix sends ON Vectors to ON Vectors

Let U be an orthogonal matrix. If v1 , v2 , . . . , vd are orthonormal, then
U v1 , U v2 , . . . , U vd are orthonormal.

In two dimensions, d = 2, an orthogonal matrix must have two orthonor-

mal columns, so must be of the form

cos θ − sin θ cos θ sin θ
U= or U= .
sin θ cos θ sin θ − cos θ
2.9. RANK 137

In the first case, U is a rotation, while in the second, U is a rotation followed

by a reflection.

If v1 , v2 , . . . , vd is an orthonormal basis of Rd , and U has columns v1 , v2 ,

. . . , vd , then U is square and U U t = I = U t U . By (2.2.7), we have

I = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vd ⊗ vd .

Multiplying both sides by v, by (1.4.14), we obtain

Orthonormal Basis Expansion

If v1 , v2 , . . . , vd is an orthonormal basis, and v is any vector, then

v = (v · v1 )v1 + (v · v2 )v2 + · · · + (v · vd )vd (2.9.3)

and
|v|2 = |v · v1 |2 + |v · v2 |2 + · · · + |v · vd |2 . (2.9.4)

Let x1 , x2 , . . . , xN be a dataset, and let A be the dataset matrix with

rows x1 , x2 , . . . , xN . The dataset is full-rank if A is full-rank. This is the
same as saying the span of x1 , x2 , . . . , xN is the whole feature space.
The dimension of the dataset is the rank of A. Hence the dimension of the
dataset equals the rank of At A. When the dataset is centered, the covariance
is the matrix Q = At A/N . Since scaling a matrix has no effect on the rank,
we conclude the dimension of a dataset equals the rank of its covariance.

To derive the main result, first we recall (2.7.6). From the definition of
dimension, we can rewrite (2.7.6) as
138 CHAPTER 2. LINEAR GEOMETRY

Row Rank plus Nullity equals Source Space Dimension

For any matrix, the row rank plus the nullity equals the dimension of
the source space. If the matrix is N × d, r is the rank, and n is the
nullity, then
r + n = d.

Assume A has N rows and d columns. By (2.7.6), every vector x in the

source space Rd can be written as a sum x = u + v with u in the null space,
and v in the row space. In other words, each vector x may be written as a
sum x = u + v with Au = 0 and v in the row space.
From this, we have

Ax = A(u + v) = Au + Av = Av.

This shows the column space consists of vectors of the form Av with v in the
row space.
Let v1 , v2 , . . . , vr be a basis for the row space. From the previous para-
graph, it follows Av1 , Av2 , . . . , Avr spans the column space of A. We claim
Av1 , Av2 , . . . , Avr are linearly independent. To check this, we write

0 = t1 Av1 + t2 Av2 + · · · + tr Avr = A(t1 v1 + t2 v2 + · · · + tr vr ).

If v is the vector t1 v1 +t2 v2 +· · ·+tr vr , this shows v is in the null space. But v
is a linear combination of basis vectors of the row space, so v is also in the row
space. Since the row space is the orthogonal complement of the null space,
we must have v orthogonal to itself. Thus v = 0, or t1 v1 +t2 v2 +· · ·+tr vr = 0.
But v1 , v2 , . . . , vr is a basis. By linear independence of v1 , v2 , . . . , vr , we
conclude t1 = 0, . . . , tr = 0. This establishes the claim, hence Av1 , Av2 , . . . ,
Avr is a basis for the column space. This shows r is the dimension of the
column space, which is by definition the column rank. Since by construction,
r is also the row rank, this establishes the rank theorem.
Chapter 3

Principal Components

In this chapter, we look at the two fundamental methods of breaking or

decomposing a matrix into elementary components, the eigenvalue decompo-
sition and the singular value decomposition, then we apply this to principal
component analysis.
We begin by looking at the geometry of a matrix as a linear transforma-
tion.

3.1 Geometry of Matrices

Matrix multiplication by an N × d matrix A sends a point x in the source
space Rd to a point b = Ax in the target space RN (Figure 2.10).
Equivalently, since points in Rd are essentially the same as vectors in Rd
(see §1.3), an N × d matrix A sends a vector v in Rd to a vector Av in RN .
Looked at this way, a matrix A induces a linear transformation: Matrix
multiplication by A satisfies
A(v1 + v2 ) = Av1 + Av2 , A(tv) = tAv.
One way to understand what the transformation does is to see how it
distorts distances between vectors. If v1 and v2 are in Rd , then the distance
between them is d = |v1 − v2 | (recall |v| denotes the euclidean length of
v). How does this compare with the distance between Av1 and Av2 , or
|Av1 − Av2 |?
If we let
v1 − v2
u= ,
|v1 − v2 |

139
140 CHAPTER 3. PRINCIPAL COMPONENTS

then u is a unit vector, |u| = 1, and by linearity

|Av1 − Av2 |
|Au| = .
|v1 − v2 |
This ratio is a scaling factor of the linear transformation. Of course this
scaling factor depends on the given vectors v1 , v2 .
From this, to understand the scaling distortions, it is enough to under-
stand what multiplication by A does to unit vectors u.
The first step in understanding this is to compute
σ1 = max |Au| and σ2 = min |Au|.
Here the maximum and minimum are taken over all unit vectors u.
Then σ1 is the distance of the furthest image from the origin, and σ2 is
the distance of the nearest image to the origin. It turns out σ1 and σ2 are
the top and bottom singular values of A.

To keep things simple, assume both the source space and the target space
are R2 ; then A is 2 × 2.
The unit circle (in red in Figure 3.1) is the set of vectors u satisfying
|u| = 1. The image of the unit circle (also in red in Figure 3.1) is the set of
vectors of the form
{Au : |u| = 1}.
The annulus is the set (the region between the dashed circles in Figure 3.1)
of vectors b satisfying
{b : σ2 < |b| < σ1 }.
It turns out the image is an ellipse, and this ellipse lies in the annulus.
Thus the numbers σ1 and σ2 constrain how far the image of the unit circle
is from the origin, and how near the image is to the origin.

Figure 3.1: Image of unit circle with σ1 = 1.5 and σ2 = .75.

3.1. GEOMETRY OF MATRICES 141

To relate σ1 and σ2 to what we’ve seen before, let Q = At A. Then,

σ12 = max |Au|2 = max(Au) · (Au) = max u · At Au = max u · Qu.

Thus σ12 is the maximum projected variance corresponding to the covari-

ance Q. Similarly, σ22 is the minimum projected variance corresponding to
the covariance Q.
Now let Q = AAt , and let b be in the image. Then b = Au for some unit
vector u, and

b · Q−1 b = (Au) · Q−1 Au = u · At (AAt )−1 Au = u · Iu = |u|2 = 1.

This shows the image of the unit circle is the inverse covariance ellipse (§1.6)
corresponding to the covariance Q, with major axis length 2σ1 and minor
axis length 2σ2 .

Let us look at some special cases.

The first example is

cos θ − sin θ
V = . (3.1.1)
sin θ cos θ

If e1 = (1, 0), e2 = (0, 1) is the standard basis in R2 . then the columns of V

are
V e1 = (cos θ, sin θ), and V e2 = (− sin θ, cos θ).
Since V t V = I, the columns of V are orthonormal. Thus V transforms the
orthonormal basis e1 , e2 into the orthonormal basis V e1 , V e2 (see §2.9). By
(1.4.3), V is a rotation by the angle θ.
The second example is

σ1 0
S= .
0 σ2
Then S scales the horizontal direction by the factor σ1 , and S scales the
vertical direction by σ2 .
The third example are the reflections

−1 0 1 0
R= , R= .
0 1 0 −1
142 CHAPTER 3. PRINCIPAL COMPONENTS

These reflect vectors across the horizontal axis, and across the vertical axis.
Recall an orthogonal matrix is a matrix U satisfying U t U = I = U U t
(2.9.2). Every orthogonal matrix U is a rotation V or a rotation times a
reflection V R.

The SVD decomposition (§3.3) states that every matrix A can be written
as a product
a b
A= = U SV.
c d
Here S is a diagonal matrix as above, and U , V are orthogonal and rotation
matrices as above.
In more detail, apart from a possible reflection, there are scalings σ1 and
σ2 and angles α and β, so that A transforms vectors by first rotating by α,
then scaling by (σ1 , σ2 ), then by rotating by β (Figure 3.2).

V S U

Figure 3.2: SVD decomposition A = U SV .

In other words, each 2 × 2 matrix A, consisting of four numbers a, b, c,

d, may be described by four other numbers. These other numbers present a
much clearer picture of the geometry of A: two angles α, β, and two scalings
σ1 , σ2 .
Everything in this section generalizes to any N × d matrix, as we see in
the coming sections.

3.2 Eigenvalue Decomposition

Let A be a matrix. An eigenvector for A is a nonzero vector v such that Av
is aligned with v. This means

Av = λv (3.2.1)
3.2. EIGENVALUE DECOMPOSITION 143

for some scalar λ, the corresponding eigenvalue.

singular:
σ, u, v

row column
any
rank rank

matrix square

eigen:
invertible symmetric
λ, v

non-
covariance negative λ≥0

λ ̸= 0 positive λ>0

Figure 3.3: Relations between matrix classes.

144 CHAPTER 3. PRINCIPAL COMPONENTS

Because the solution v = 0 of (3.2.1) is not useful, we insist eigenvec-

tors be nonzero. If v is an eigenvector, then the dimension of v equals the
dimension of Av, which can only happen when A is a square matrix.
If v is an eigenvector corresponding to eigenvalue λ, then any scalar mul-
tiple u = tv is also an eigenvector corresponding to eigenvalue λ, since
Av = λv =⇒ Au = A(tv) = t(Av) = t(λv) = λ(tv) = λu.
Because of this, we usually take eigenvectors to be unit vectors, by normal-
izing them. Even then, this does not determine v uniquely, since both ±v
are unit eigenvectors.
Let
2 1
Q=
1 2
Then Q has eigenvalues 3 and 1, with corresponding eigenvectors (1, 1) and
(1, −1).√These√are not unit
√ vectors,
√ but the corresponding unit eigenvectors
are (1/ 2, 1/ 2) and (1/ 2, −1/ 2).
The code

from numpy import *

from numpy.linalg import eig

# lambda is a keyword in Python

# so we use lamda instead

A = array([[2,1],[1,2]])
lamda, U = eig(A)
lamda

returns the eigenvalues [3,1] as an array, and returns the eigenvectors v1 ,

v2 of Q, as the columns of the matrix U . The matrix U is discussed further
below.
The method eig(A) works on any square matrix A, but may return com-
plex eigenvalues. When eig(A) returns real eigenvalues, they are not neces-
sarily ordered in any predetermined fashion.
If the matrix Q is known to be symmetric, then the eigenvalues are guar-
anteed real. In this case, the method eigh(Q) returns these eigenvalues in
increasing order. If eigh is used on a non-symmetric matrix, it will return
erroneous data.
3.2. EIGENVALUE DECOMPOSITION 145

from numpy import *

from numpy.linalg import eigh

Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda

returns the array [1,3].

Let A be a square d × d matrix. The ideal situation is when there is a

basis v1 , v2 , . . . , vd in Rd of eigenvectors of A. However, this is not always
the case. For example, if
1 1
A= (3.2.2)
0 1
and Av = λv, then v = (x, y) satisfies x + y = λx, y = λy. This system has
only the nonzero solution (x, y) = (1, 0) (or its multiples) and λ = 1. Thus
A has only one eigenvector e1 = (1, 0), and the corresponding eigenvalue is
λ = 1.
Let A be any square matrix.

Eigenvalues of A Versus Eigenvalues of At

The eigenvalues of A and the eigenvalues of At are the same.

This result is a consequence of the rank theorem in §2.9. To see why,

suppose λ is an eigenvalue of A with corresponding eigenvector v. Then
Av = λv, which implies

(A − λI)v = Av − λv = 0.

As a consequence, if we let B = A − λI, then λ is an eigenvalue of A iff1

B has a nonzero null space. If we show B t = At − λI has a nonzero null
space, by the same logic, we will conclude λ is an eigenvalue of At . Now
B has a nonzero null space iff B is not full-rank. Since B is square, by the
rank theorem, this happens iff B t is not full-rank, which happens iff B t has
1
Iff is short for if and only if.
146 CHAPTER 3. PRINCIPAL COMPONENTS

a nonzero null space. Thus λ is an eigenvalue of A iff λ is an eigenvalue of

At .

Let v be a unit vector. From §2.5, when Q is the covariance matrix of a

dataset, v · Qv is the variance of the dataset projected onto the line through
v. When v is an eigenvector, Qv = λv, the variance equals

v · Qv = v · λv = λv · v = λ.

More generally, this holds for any symmetric matrix Q. We conclude

Projected Variance along Eigenvector Direction

If v is a unit eigenvector of a symmetric matrix Q, then v·Qv equals the
corresponding eigenvalue. In particular, the eigenvalues of a covariance
matrix are nonnegative.

In general, when Q is symmetric but not a covariance matrix, some eigen-

values of Q may be negative.

Suppose λ and µ are eigenvalues of a symmetric matrix Q with corre-

sponding eigenvectors u, v. Since Q is symmetric, u · Qv = v · Qu. Using
Qu = λu, Qv = µv, we compute u · Qv in two ways:

µu · v = u · (µv) = u · Qv = v · Qu = v · (λu) = λu · v.

This implies
(µ − λ)u · v = 0.
If λ ̸= µ, we must have u · v = 0. We conclude:

Distinct Eigenvalues Have Orthogonal Eigenvectors

For a symmetric matrix Q, eigenvectors corresponding to distinct
eigenvalues are orthogonal.
3.2. EIGENVALUE DECOMPOSITION 147

Suppose there is a basis v1 , v2 , . . . , vd of eigenvectors of Q, with corre-

sponding eigenvalues λ1 , λ2 , . . . , λd . Let E be the diagonal matrix with λ1 ,
λ2 , . . . , λd on the diagonal,
 
λ1 0 0 ... 0
 0 λ2 0 ... 0
 
E = . . . . . . . . . . . . . . .

.
 0 0 . . . λd−1 0 
0 0 ... 0 λd

Let U be the matrix with columns v1 , v2 , . . . , vd . By matrix multiplication

and Qvj = λj vj , j = 1, 2, . . . , d, we obtain

QU = U E. (3.2.3)

When this happens, we say Q is diagonalizable. Thus A in (3.2.2) is not

diagonalizable. On the other hand, we will show every symmetric matrix
Q is diagonalizable. In fact, Q symmetric leads to an orthonormal basis of
eigenvectors.
In Python, given Q, we compute the third eigenvector v and third eigen-
value λ, and verify Qv = λv. The code

from numpy import *

from numpy.linalg import eigh

# Q is any symmetric matrix

lamda, U = eigh(Q)
lamda = lamda[2]
v = U[:,2]

allclose(dot(Q,v), lamda*v)

returns True.

The main result in this section is

148 CHAPTER 3. PRINCIPAL COMPONENTS

Eigenvalue Decomposition (EVD)

Let Q be a symmetric d × d matrix. There is an orthonormal basis v1 ,

v2 , . . . , vd in Rd of eigenvectors of Q, with corresponding eigenvalues

λ1 ≥ λ2 ≥ · · · ≥ λd .

Here are some consequences of the eigenvalue decomposition.

If V is the matrix with rows v1 , v2 , . . . , vd then U = V t is the matrix with
columns v1 , v2 , . . . , vd . Since v1 , v2 , . . . , vd are orthonormal, U is orthogonal
(see (2.9.2)), so U t U = I = U U t . By (3.2.3), QU = U E. Multiplying on the
right by V = U t ,
Q = QU V = U EV.
Thus the eigenvalue decomposition states

Diagonalization (EVD)

There is an orthogonal matrix V and a diagonal matrix E such that

with U = V t , we have
Q = U EV. (3.2.4)
When this happens, the rows of V are the eigenvectors of Q, and the
diagonal entries of E are the eigenvalues of Q.

In other words, with the correct choice of orthonormal basis, the matrix
Q becomes a diagonal matrix E.
The orthonormal basis eigenvectors v1 , v2 , . . . , vd are the principal compo-
nents of the matrix Q. The eigenvalues and eigenvectors of Q, taken together,
are the eigendata of Q. The code

from numpy import *

from numpy.linalg import eigh

Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda, U
3.2. EIGENVALUE DECOMPOSITION 149

returns the eigenvectors [1, 3] and the matrix U = [u, v] with columns
√ √ √ √
u = (1/ 2, −1/ 2), v = (1/ 2, 1/ 2).

These columns are the orthonormal eigenvectors Qv = 3v, Qu = 1u. By

(3.2.3), QU = U E, where E is the diagonal matrix with the eigenvalues on
the diagonal,

from numpy import *

from numpy.linalg import eigh

Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
V = U.T
E = diag(lamda)

allclose(Q,dot(U,dot(E,V))

returns True.

In sympy, the corresponding commands are

from sympy import *

from sympy import init_printing

init_printing()

# eigenvalues
Q.eigenvals()

# eigenvectors
Q.eigenvects()

U, E = Q.diagonalize()

The command init_printing pretty-prints the output.

150 CHAPTER 3. PRINCIPAL COMPONENTS

Let λ1 , λ2 , . . . , λr be the nonzero eigenvalues of Q. Then the diagonal

matrix E has r nonzero entries on the diagonal, so rank(E) = r. Since U
and V = U t are invertible, rank(E) = rank(U EV ). Since Q = U EV ,

rank(Q) = rank(E) = r.

Rank Equals Number of Nonzero Eigenvalues

The rank of a diagonal matrix equals the number of nonzero entries.
The rank of a square symmetric matrix Q equals the number of nonzero
eigenvalues of Q.

Using sympy,

from sympy import *

Q = Matrix([[2,1],[1,2]])
U, E = Q.diagonalize()
display(U,E)

returns
1 1 1 0
U= , E= .
−1 1 0 3
Also,

from sympy import *

a,b,c = symbols("a b c")

Q = Matrix([[a,b ],[b,c]])
U, E = Q.diagonalize()
display(Q,U,E)

returns
√ √
a b 1 a−c− D a−c+ D
Q= , U=
b c 2b 2b 2b
3.2. EIGENVALUE DECOMPOSITION 151

and
√
1 a+c− D 0 √
E= , D = (a − c)2 + 4b2 .
2 0 a+c+ D
display is used to pretty-print the output.

When all the eigenvalues are nonzero, we can write

 
1/λ1 0 0 ... 0
 0 1/λ2 0 . . . 0 
E −1 = 
 ...
.
... ... ... ... 
0 0 . . . 0 1/λd

Then a straightforward calculation using (3.2.4) shows

Nonzero Eigenvalues Equals Invertible

Let Q = U EV be the EVD of a symmetric matrix Q. Then Q is
invertible iff all its eigenvalues are nonzero. When this happens, we
have
Q−1 = U E −1 V

More generally, using (2.6.8), one can check

Pseudo-Inverse (EVD)

If λ1 ≥ λ2 ≥ · · · ≥ λr are the nonzero eigenvalues of Q, then 1/λ1 ≤

1/λ2 ≤ · · · ≤ 1/λr are the nonzero eigenvalues of Q+ . Moreover, if
Q = U EV as in (3.2.4), then Q+ = U E + V .

Similarly, eigendata may be used to solve linear systems.

Nonzero Eigenvalues Equals Solvable

Let v1 , v2 , . . . , vd be the orthonormal basis of eigenvectors of Q corre-
sponding to eigenvalues λ1 , λ2 , . . . , λd . Then the linear system

Qx = b
152 CHAPTER 3. PRINCIPAL COMPONENTS

has a solution x for every vector b iff all eigenvalues are nonzero, in
which case
1 1 1
x= (b · v1 )v1 + (b · v2 )v2 + · · · + (b · vd )vd . (3.2.5)
λ1 λ2 λd

The proof is straightforward using (2.9.3), multiply by Q to verify,

1 1
Qx = Q (b · v1 )v1 + (b · v2 )v2 + . . .
λ1 λ2
1 1
= (b · v1 )Qv1 + (b · v2 )Qv2 + . . .
λ1 λ2
1 1
= (b · v1 )v1 + (b · v2 )v2 + . . .
λ1 λ2
= (b · v1 )v1 + (b · v2 )v2 + · · · = b.

Another consequence of the eigenvalue decomposition is

Trace is the Sum of Eigenvalues

Let Q be a symmetric matrix with eigenvalues λ1 , λ2 , . . . , λd . Then

trace(Q) = λ1 + λ2 + · · · + λd . (3.2.6)

To derive this, use (3.2.3): Since U is orthogonal, U V = U U t = I. By

(2.2.9), trace(AB) = trace(BA), so

trace(Q) = trace(QU V ) = trace(V QU ) = trace(V U EV U ) = trace(E).

Since E = diag(λ1 , λ2 , . . . , λd ), trace(E) = λ1 + λ2 + · · · + λd , and the result

follows.
Let Q be symmetric with eigenvalues λ1 , λ2 , . . . , λd . Since

Qv = λv =⇒ Q2 v = QQv = Q(λv) = λQv = λ2 v,

Q2 is symmetric with eigenvalues λ21 , λ22 , . . . , λ2d . Applying the last result to
Q2 , we have

trace(Q2 ) = trace(QQt ) = trace(Q2 ) = λ21 + λ22 + · · · + λ2d .

3.2. EIGENVALUE DECOMPOSITION 153

√ √
λ2 v2 λ1 v1

√ √
− λ1 v1 − λ2 v2

Figure 3.4: Inverse covariance ellipse and centered dataset.

It turns out every symmetric nonnegative matrix is the covariance of a

simple dataset (Figure 3.4).

Sum of Tensor Products

Let Q be a symmetric d × d matrix with eigenvalues λ1 , λ2 , . . . , λd and
orthonormal eigenvectors v1 , v2 , . . . , vd . Then

Q = λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd . (3.2.7)

In particular, when Q is nonnegative, the dataset consisting of the 2d

points p p p
± λ1 v1 , ± λ2 v2 , . . . , ± λd vd
is centered and has covariance Q/d

Since v1 , v2 , . . . , vd is an orthonormal basis, by (2.9.3), every vector v

can be written

v = (v · v1 ) v1 + (v · v2 ) v2 + · · · + (v · vd ) vd .

Multiply by Q. Since Qvk = λk vk ,

Qv = (v · v1 ) Qv1 + (v · v2 ) Qv2 + · · · + (v · vd ) Qvd
= λ1 (v · v1 ) v1 + λ2 (v · v2 ) v2 + · · · + λd (v · vd ) vd
= (λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd ) v
√
This proves the first part. For the second part, let bk = λk vk . Then the
mean of the 2d vectors ±b1 , ±b2 , . . . , ±bd is clearly zero, and by (3.2.7), the
154 CHAPTER 3. PRINCIPAL COMPONENTS

covariance matrix
2
(b1 ⊗ b1 + b2 ⊗ b2 + · · · + bd ⊗ bd )
2d
equals Q/d.

Now we approach the eigenvalues of Q from a different angle. In §2.5, we

studied zero variance directions. Since the eigenvalues of a covariance matrix
are nonnegative, for a covariance matrix, they may also be called minimum
variance directions. Now we study maximum variance directions.
Let
λ1 = max v · Qv,
|v|=1

where the maximum is over all unit vectors v. We say a unit vector b is best-fit
for Q or best-aligned with Q if the maximum is achieved at v = b: λ1 = b · Qb.
When Q is a covariance matrix, this means the unit vector b is chosen so that
the variance b · Qb of the dataset projected onto b is maximized.
An eigenvalue λ1 of Q is the top eigenvalue if λ1 ≥ λ for any other
eigenvalue. An eigenvalue λ1 of Q is the bottom eigenvalue if λ1 ≤ λ for any
other eigenvalue.

Maximum Projected Variance is an Eigenvalue

Let Q be a symmetric matrix. Then

λ1 = max v · Qv, (3.2.8)

|v|=1

is the top eigenvalue of Q.

Best-aligned vector is an eigenvector

Let Q be a symmetric matrix. Then a best-aligned vector b is an
eigenvector of Q corresponding to the top eigenvalue λ1 .
3.2. EIGENVALUE DECOMPOSITION 155

To prove these results, we begin with a simple calculation, whose deriva-

tion we skip.

A Calculation
Suppose λ, a, b, c, d are real numbers and suppose we know

λ + at + bt2
≤ λ, for all t real.
1 + ct + dt2
Then a = λc.

Let λ be any eigenvalue of Q, with eigenvector v: Qv = λv. Dividing v

by its length, we may assume |v| = 1. Then

λ1 ≥ v · Qv = v · (λv) = λv · v = λ.

This shows λ1 ≥ λ for any eigenvalue λ.

Now we show λ1 itself is an eigenvalue. Let v1 be a unit vector maximizing
v · Qv, so v1 is best-fit for Q. Then

λ1 = v1 · Qv1 ≥ v · Qv (3.2.9)

for all unit vectors v. Let u be any vector. Then for any real t,
v1 + tu
v=
|v1 + tu|

is a unit vector. Insert this v into (3.2.9) to obtain

(v1 + tu) · Q(v1 + tu)

λ1 ≥ .
|v1 + tu|2

Since Q is symmetric, u · Qv1 = v1 · Qu. Expanding with |v1 |2 = 1, we obtain

λ1 + 2tu · Qv1 + t2 u · Qu λ1 + at + bt2

λ1 ≥ = .
1 + 2tu · v1 + t2 |u|2 1 + ct + dt2

Applying the calculation with λ = λ1 , a = 2u · Qv1 , b = u · Qu, c = 2u · v1 ,

and d = |u|2 , we conclude

u · Qv1 = λ1 u · v1
156 CHAPTER 3. PRINCIPAL COMPONENTS

for all vectors u. But this implies

u · (Qv1 − λ1 v1 ) = 0

for all u. Thus Qv1 − λ1 v1 is orthogonal to all vectors, hence orthogonal to

itself. Since this can only happen if Qv1 − λ1 v1 = 0, we conclude Qv1 = λ1 v1 .
Hence λ1 is itself an eigenvalue. This completes the proof of the two results.

Just as the maximum variance (3.2.8) is the top eigenvalue λ1 , the mini-
mum variance
λd = min v · Qv, (3.2.10)
|v|=1

is the bottom eigenvalue, and the corresponding eigenvector vd is the worst-

aligned vector.
By the eigenvalue decomposition, the eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd
of a symmetric matrix Q may be arranged in decreasing order, and may be
positive, zero, or negative scalars. When Q is a covariance, the eigenvalues
are nonnegative, and the bottom eigenvalue is at least zero. When the bottom
eigenvalue is zero, the corresponding eigenvectors are zero variance directions.

Now we can complete the proof the eigenvalue decomposition. Having

found the top eigenvalue λ1 with its corresponding unit eigenvector v1 , we
let S = span(v1 ) and T = S ⊥ be the orthogonal complement of v1 (Figure
3.5). Then dim(T ) = d − 1, and we can repeat the process and maximize
v · Qv over all unit v in T , i.e. over all unit v orthogonal to v1 . This leads to
another eigenvalue λ2 with corresponding eigenvector v2 orthogonal to v1 .
Since λ1 is the maximum of v · Qv over all vectors in Rd , and λ2 is the
maximum of v · Qv over the restricted space T of vectors orthogonal to v1 ,
we must have λ1 ≥ λ2 .
Having found the top two eigenvalues λ1 ≥ λ2 and their orthonormal
eigenvectors v1 , v2 , we let S = span(v1 , v2 ) and T = S ⊥ be the orthogonal
complement of S. Then dim(T ) = d − 2, and we can repeat the process to
obtain λ3 and v3 in T . Continuing in this manner, we obtain eigenvalues

λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λd .
3.2. EIGENVALUE DECOMPOSITION 157

with corresponding orthonormal eigenvectors

v1 , v2 , v3 , . . . , vd .

This proves the eigenvalue decomposition.

T = S⊥
S

Figure 3.5: S = span(v1 ) and T = S ⊥ .

Let Q be a positive covariance matrix and let b · Q−1 b = 1 be the inverse

covariance ellipsoid. If v is a unit eigenvector
√ corresponding
√ to an eigenvalue
λ, then λ ≥ 0, and the vector b = λv has length λ. Moreover b satisfies
√ √
b · Q−1 b = ( λv) · Q−1 ( λv) = λv · Q−1 v = λv · (λ−1 v) = v · v = 1.
√
Hence the line segment joining the √ vectors ± λv is an axis of the inverse
covariance ellipsoid, with length 2 λ (Figure 3.4).
When λ = λ1 is the top eigenvalue, the axis is the principal axis of the
inverse covariance ellipsoid. When λ = λ2 is the next highest eigenvalue,
the axis is orthogonal to the principal axis, and is the second principal axis.
Continuing in this manner, we obtain all the principal axes of the inverse
covariance ellipsoid.
158 CHAPTER 3. PRINCIPAL COMPONENTS

Principal Axes of Inverse Covariance Ellipsoid

Let v be a unit eigenvector of a covariance
√ matrix
√ Q with eigenvalue
λ. Then the line segment joining − λv and + √λv is a principal axis
of the inverse covariance ellipsoid, with length 2 λ.

Together with Figure 1.32, this result provides a geometric interpreta-

tion of eigenvalues: They control the variances of a dataset’s points, in the
principal directions.
Sometimes, several eigenvalues are equal, leading to several eigenvectors,
say m of them, corresponding to a given eigenvalue λ. In this case, we say
the eigenvalue λ has multiplicity m, and we call the span

Sλ = {v : Qv = λv}

the eigenspace corresponding to λ. For example, suppose the top three eigen-
values are equal: λ1 = λ2 = λ3 , with b1 , b2 , b3 the corresponding eigenvectors.
Calling this common value λ, the eigenspace is Sλ = span(b1 , b2 , b3 ). Since
b1 , b2 , b3 are orthonormal, dim(Vλ ) = 3. In Python, the eigenspaces Vλ are
obtained by the matrix U above: The columns of U are an orthonormal basis
for the entire space, so selecting the columns corresponding to a specific λ
yields an orthonormal basis for Sλ .

Let (evs,U) be the list of eigenvalues and matrix U whose columns are
the eigenvectors. Then the eigenvectors are the rows of U t . Here is code for
selecting just the eigenvectors corresponding to eigenvalue s.

from numpy import *

from numpy.linalg import eigh

lamda, U = eigh(Q)
V = U.T
V[isclose(lamda,s)]

The function isclose(a,b) returns True when a and b are numerically close.
Using this boolean, we extract only those rows of V whose corresponding
eigenvalue is close to s.
3.2. EIGENVALUE DECOMPOSITION 159

The subspace Sλ is defined for any λ. However, dim(Sλ ) = 0 unless λ is

an eigenvalue, in which case dim(Sλ ) = m, where m is the multiplicity of λ.
The proof of the eigenvalue decomposition is a systematic procedure for
finding eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd . Now we show there are no other
eigenvalues.

The Eigenvalue Decomposition is Complete

If λ is an eigenvalue for Q, Qv = λv, then λ equals one of the eigen-
values in the eigenvalue decomposition of Q.

To see this, suppose Qv = λv with λ ̸= λj for j = 1, . . . , d. Since λ ̸= λj

for j = 1, . . . , d, the vector v must be orthogonal to every vj , j = 1, . . . , d.
Since span(v1 , . . . , vd ) = Rd , it follows v is orthogonal to every vector, hence
v is orthogonal to itself, hence v = 0. We conclude λ cannot be an eigenvalue.

All this can be readily computed in Python. For the Iris dataset, we have
the covariance matrix in (2.2.14). The eigenvalues are

4.2 > 0.24 > 0.08 > 0.02,

and the orthonormal eigenvectors are the columns of the matrix

 
0.36 −0.66 −0.58 0.32
−0.08 −0.73 0.6 −0.32
U = 
 0.86 0.18 0.07 −0.48
0.36 0.07 0.55 0.75

Since the eigenvalues are distinct, the multiplicity of each eigenvalue is 1.

From (2.2.14), the total variance of the Iris dataset is

4.54 = trace(Q) = λ1 + λ2 + λ3 + λ4 .

For the Iris dataset, the top eigenvalue is λ1 = 4.2, it has multiplicity 1, and
its corresponding list of eigenvectors contains only one eigenvector,

v1 = (0.36, −0.08, 0.86, 0.36).

The top eigenvalue accounts for 92.5% of the total variance.

160 CHAPTER 3. PRINCIPAL COMPONENTS

The second eigenvalue is λ2 = 0.24 with eigenvector

v2 = (−0.66, −0.73, 0.18, 0.07).
The top two eigenvalues account for 97.8% of the total variance.
The third eigenvalue is λ3 = 0.08 with eigenvector
v3 = (−0.58, 0.60, 0.07, 0.55).
The top three eigenvalues account for 99.5% of the total variance.
The fourth eigenvalue is λ4 = 0.02 with eigenvector
v4 = (0.32, −0.32, −0.48, 0.75).
The top four eigenvalues account for 100% of the total variance. Here each
eigenvalue has multiplicity 1, since there are four distinct eigenvalues.

An important class of symmetric matrices are of the form

 
  2 −1 0 −1
2 −1 −1 
2 −2  −1 2 −1 0 
−1 2 −1   
−2 2  0 −1 2 −1 
−1 −1 2
−1 0 −1 2
 
  2 −1 0 0 0 −1
2 −1 0 0 −1 
 −1 2 −1 0 0 0 
 −1 2 −1 0 0  
   0 −1 2 −1 0 0 
 0 −1 2 −1 0   
  0 0 −1 2 −1 0 
 0 0 −1 2 −1   
 0 0 0 −1 2 −1 
−1 0 0 −1 2
−1 0 0 0 −1 2
 
2 −1 0 0 0 0 −1
 −1 2 −1 0 0 0 0 
 
 0 −1 2 −1 0 0 0 
 
 0
 0 −1 2 −1 0 0 .

 0
 0 0 −1 2 −1 0 

 0 0 0 0 −1 2 −1 
−1 0 0 0 0 −1 2
We denote these matrices Q(2), Q(3), Q(4), Q(5), Q(6), Q(7). The following
code generates these symmetric d × d matrices Q(d),
3.2. EIGENVALUE DECOMPOSITION 161

def row(i,d):
v = [0]*d
v[i] = 2
if i > 0: v[i-1] = -1
if i < d-1: v[i+1] = -1
if i == 0: v[d-1] += -1
if i == d-1: v[0] += -1
return v

# using sympy
from sympy import Matrix

def Q(d): return Matrix([ row(i,d) for i in range(d) ])

# using numpy
from numpy import *

def Q(d): return array([ row(i,d) for i in range(d) ])

The eigenvalues of these symmetric matrices follow interesting patterns

that are best explored using Python.
Below we will see, the eigenvalues of Q(d) are between 4 and 0, and each
eigenvalue other than 4 and 0 has multiplicity 2.

m1 m2

x1 x2

Figure 3.6: Three springs at rest and perturbed.

To explain where these matrices come from, look at the mass-spring sys-
tems in Figures 3.6 and 3.7. Here we have springs attached to masses and
162 CHAPTER 3. PRINCIPAL COMPONENTS

walls on either side. At rest, the springs are the same length. When per-
turbed, some springs are compressed and some stretched. In Figure 3.6, let
x1 and x2 denote the displacement of each mass from its rest position.
When extended by x, each spring fights back by exerting a force kx
proportional to the displacement x. For example, look at the mass m1 . The
spring to its left is extended by x1 , so exerts a force of −kx1 . Here the minus
indicates pulling to the left. On the other hand, the spring to its right is
extended by x2 − x1 , so it exerts a force +k(x2 − x1 ). Here the plus indicates
pulling to the right. Adding the forces from either side, the total force on
m1 is −k(2x1 − x2 ). For m2 , the spring to its left exerts a force −k(x2 − x1 ),
and the spring to its right exerts a force −kx2 , so the total force on m2 is
−k(2x2 − 2x1 ). We obtain the force vector

2x1 − x2 2 −1 x1
−k = −k .
−x1 + 2x2 −1 2 x2

However, as you can see, the matrix here is not exactly Q(2).

m1 m2 m3 m4 m5

x1 x2 x3 x4 x5

Figure 3.7: Six springs at rest and perturbed.

For five masses, let x1 , x2 , x3 , x4 , x5 denote the displacement of each

mass from its rest position. In Figure 3.7, x1 , x2 , x5 are positive, and x3 , x4
are negative.
As before, the total force on m1 is −k(2x1 − x2 ), and the total force on
m5 is −k(2x5 − x4 ). For m2 , the spring to its left exerts a force −k(x2 − x1 ),
and the spring to its right exerts a force +k(x3 − x2 ). Hence, the total force
on m2 is −k(−x1 + 2x2 − x3 ). Similarly for m3 , m4 . We obtain the force
3.2. EIGENVALUE DECOMPOSITION 163

vector

    
2x1 − x2 2 −1 0 0 0 x1
−x1 + 2x2 − x3   −1 2 −1 0 0   x2 
    
−k −x2 + 2x3 − x4  = −k  0 −1
   2 −1 0   x3  .
 
−x3 + 2x4 − x5   0 0 −1 2 −1   x4 
−x4 + 2x5 0 0 0 −1 2 x5

But, again, the matrix here is not Q(5). Notice, if we place one mass and
two springs in Figure 3.6, we obtain the 1 × 1 matrix 2.

To obtain Q(2) and Q(5), we place the springs along a circle, as in Figures
3.8 and 3.9. Now we have as many springs as masses. Repeating the same
logic, this time we obtain Q(2) and Q(5). Notice if we place one mass and
one spring in Figure 3.8, d = 1, we obtain the 1 × 1 matrix Q(1) = 0: There
is no force if we move a single mass around the circle, because the spring is
not being stretched.

m1 m2 m2

Figure 3.8: Two springs along a circle leading to Q(2).

164 CHAPTER 3. PRINCIPAL COMPONENTS

m1 m1

m2
m2

m5 m5

m3 m4

m4 m3

Figure 3.9: Five springs along a circle leading to Q(5).

Thus the matrices Q(d) arise from mass-spring systems arranged on a

circle. From Newton’s law (force equals mass times
p acceleration), one shows
the frequencies of the vibrating springs equal λk/m, where k is above, m
is the mass of each of the masses, and λ is an eigenvalue of Q(d). This is the
physical meaning of the eigenvalues of Q(d).

Let v have features (x1 , x2 , . . . , xd ), and let Q = Q(d). By elementary

algebra, check that

v · Qv = (x1 − x2 )2 + (x2 − x3 )2 + · · · + (xd−1 − xd )2 + (xd − x1 )2 . (3.2.11)

As a consequence of (3.2.11), show also the following.

• For any vector v, 0 ≤ v · Qv ≤ 4|v|2 . Conclude every eigenvalue λ

satisfies 0 ≤ λ ≤ 4.

• λ = 0 is an eigenvalue, with multiplicity 1.

• When d is even, λ = 4 is an eigenvalue with multiplicity 1.

• When d is odd, λ = 4 is not an eigenvalue.

3.2. EIGENVALUE DECOMPOSITION 165

To compute the eigenvalues, we use complex numbers, specifically the

d-th root of unity ω (§1.5). Let

p(t) = 2 − t − td−1 ,

and let  
1

 ω 

 ω2 
v1 =  .
 
 ω3 
 .. 
 . 
ω d−1
Then Qv1 is
 
  1
2 − ω − ω d−1
 −1 + 2ω − ω 2 



 ω

 −ω + 2ω 2 − ω 3 
   ω2
Qv1 =   = p(ω)   = p(ω)v1 .
 
 ..    ω3
 .    ..
d−2 d−1
  .
−ω + 2ω −1 d−1
ω

Thus v1 is an eigenvector corresponding to eigenvalue p(ω).

For each k = 0, 1, 2, . . . , d − 1, define

vk = 1, ω k , ω 2k , ω 3k , . . . , ω (d−1)k .

By the same calculation, we have

Qvk = p(ω k )vk , k = 0, 1, 2, . . . , d − 1.

By (1.5.7),

p(ω k ) = 2 − ω k − ω (d−1)k = 2 − ω k − ω −k = 2 − 2 cos(2πk/d).

166 CHAPTER 3. PRINCIPAL COMPONENTS

Eigenvalues of Q(d)

The (unsorted) eigenvalues of Q(d) are

k 2πk
λk = p(ω ) = 2 − 2 cos , k = 0, 1, 2, . . . , d − 1. (3.2.12)
d

Corresponding to each eigenvalue λk , there is the complex eigenvector vk .

Separating vk into its real and imaginary parts yields two real eigenvectors

2πk 4πk 6πk 2(d − 1)πk
ℜ(vk ) = 1, cos , cos , cos , . . . , cos ,
d d d d

2πk 4πk 6πk 2(d − 1)πk
ℑ(vk ) = 0, sin , sin , sin , . . . , sin .
d d d d

When k = 0 or when k = d/2, d even, we have ℑ(vk ) = 0. This explains the

double multiplicity in Figure 3.10, except when k = 0 or k = d/2, d even.

Applying this formula, we obtain eigenvalues

Q(2) = (4, 0)
Q(3) = (3, 3, 0)
Q(4) = (4, 2, 2, 0)
√ √ √ √ !
5 5 5 5 5 5 5 5
Q(5) = + , + , − , − ,0
2 2 2 2 2 2 2 2
Q(6) = (4, 3, 3, 1, 1, 0)
√ √ √ √
Q(8) = (4, 2 + 2, 2 + 2, 2, 2, 2 − 2, 2 − 2, 0)
√ √ √ √
5 5 5 5 3 5 3 5
Q(10) = 4, + , + , + , + ,
2 2 2 2 2 2 2 2
√ √ √ √ !
5 5 5 5 3 5 3 5
− , − , − , − ,0
2 2 2 2 2 2 2 2
√ √ √ √
Q(12) = 4, 2 + 3, 2 + 3, 3, 3, 2, 2, 1, 1, 2 − 3, 2 − 3, 0 .
3.2. EIGENVALUE DECOMPOSITION 167

The matrices Q(d) are circulant matrices. Each row in Q(d) is obtained
from the row above by shifting the entries to the right. The trick of using
the roots of unity to compute the eigenvalues and eigenvectors works for any
circulant matrix.

Our last topic is the distribution of the eigenvalues for large d. How are
the eigenvalues scattered? Figure 3.10 plots the eigenvalues for Q(50) using
the code below.

from numpy.linalg import eigh

from matplotlib.pyplot import stairs,show,scatter,legend

d = 50
lamda = eigh(Q(d))[0]
stairs(lamda,range(d+1),label="numpy")

k = arange(d)
lamda = 2 - 2*cos(2*pi*k/d)
sorted = sort(lamda)

scatter(k,lamda,s=5,label="unordered")
scatter(k,sorted,c="red",s=5,label="increasing order")

legend()
show()

Figure 3.10 shows the eigenvalues tend to cluster near the top λ1 ∼ 4 and
the bottom λd = 0 for d large. Using the double-angle formula,

2 πk
λk = 4 sin , k = 0, 1, 2, . . . , d − 1.
d
Solving for k/d in terms of λ, and multiplying by two to account for the dou-
ble multiplicity, we obtain2 the proportion of eigenvalues below threshhold
λ,
1√

#{k : λk ≤ λ} 2
≈ arcsin λ , 0 ≤ λ ≤ 4. (3.2.13)
d π 2
2
This is an approximate equality: The ratio of the two sides approaches 1 as d → ∞.
168 CHAPTER 3. PRINCIPAL COMPONENTS

Figure 3.10: Plot of eigenvalues of Q(50).

Equivalently, the derivative of the arcsine law (3.2.13) exhibits (see (7.1.9))
the eigenvalue clustering near the ends (Figure 3.11).

Figure 3.11: Density of eigenvalues of Q(d) for d large.

from numpy import *

from matplotlib.pyplot import *
3.3. SINGULAR VALUE DECOMPOSITION 169

lamda = arange(0.1,3.9,.01)
density = 1/(pi*sqrt(lamda*(4-lamda)))
plot(lamda,density)
# r"..." means raw string
f = r"$\displaystyle\frac1{\pi\sqrt{\lambda(4-\lambda)}}$"
text(.5,.45,f,usetex=True,fontsize="x-large")
show()

The matrices Q(d) are prototypes of matrices that are fundamental in

many areas of physics and engineering, including time series analysis and
information theory, see [10]. This clustering of eigenvalues near the top and
bottom is valid for a wide class of matrices, not just Q(d), as the matrix size
d grows without bound, d → ∞.

3.3 Singular Value Decomposition

In this section, we discuss the singular value decomposition (U, S, V ) of a
matrix A.
Let A be a matrix. We say a positive number σ > 0 is a singular value of
A if there are nonzero vectors v and u satisfying

Av = σu and At u = σv. (3.3.1)

When this happens, v is a right singular vector and u is a left singular vector
associated to σ.
Some books allow singular values to be zero. Here we insist that sin-
gular values be positive. Contrast singular values with eigenvalues: While
eigenvalues may be negative or zero, for us singular values are positive.
The definition immediately implies

Singular Values of A Versus At

The singular values of A are the same as the singular values of At .

We work out our first example. Let

1 1
A= .
0 1
170 CHAPTER 3. PRINCIPAL COMPONENTS

Then Av = λv implies λ = 1 and v = (1, 0). Thus A has only one eigenvalue
equal to 1. Set
t 1 1
Q=AA= .
1 2
Since Q is symmetric, Q has two eigenvalues λ1 , λ2 and corresponding eigen-
vectors v1 , v2 . Moreover, as we saw in an earlier section, v1 , v2 may be chosen
orthonormal.
The eigenvalues of Q are given by

0 = det(Q − λI) = λ2 − 3λ + 1.

By the quadratic formula,

√ √
3 5 3 5
λ1 = + = 2.62, λ2 = − = .38.
2 2 2 2
Now we turn to singular values. If v and u and σ > 0 satisfy (3.3.1), then

Qv = At Av = At (σu) = σ 2 v. (3.3.2)

Hence σ 2 = λ, and we obtain

s √ s √
3 5 3 5
σ1 = + = 1.62, σ2 = − = 0.62.
2 2 2 2
To make (3.3.1) work, we set u1 = Av1 /σ1 . Then Av1 = σ1 u1 , and

At u1 = At Av1 /σ1 = Qv1 /σ1 = λ1 v1 /σ1 = σ1 v1 .

Thus v1 , u1 are right and left singular vectors corresponding to the singular
value σ1 of A. Similarly, if we set u2 = Av2 /σ2 , then v2 , u2 are right and left
singular vectors corresponding to the singular value σ2 of A.
We show v1 , v2 are orthonormal, and u1 , u2 are orthonormal. We already
know v1 , v2 are orthonormal, because we chose them that way, as eigenvectors
of the symmetric matrix Q. Also

0 = λ1 v1 ·v2 = Qv1 ·v2 = (At Av1 )·v2 = (Av1 )·(Av2 ) = σ1 u1 ·σ2 u2 = σ1 σ2 u1 ·u2 .

Since σ1 ̸= 0, σ2 ̸= 0, it follows u1 , u2 are orthogonal. Also

λ1 = λ1 v1 · v1 = Qv1 · v1 = (At Av1 ) · v1 = (Av1 ) · (Av1 ) = σ12 u1 · u1 .

3.3. SINGULAR VALUE DECOMPOSITION 171

Since λ1 = σ12 , u1 · u1 = 1. Similarly, u2 · u2 = 1. This shows u1 , u2 are

orthonormal, and completes the first example.

Let A be an N × d matrix, and suppose σ1 ≥ σ2 ≥ · · · ≥ σr > 0 are

singular values with corresponding left singular vectors u1 , u2 , . . . , ur and
right singular vectors v1 , v2 , . . . , vr . Then, since uk = Avk /σk , the vectors
u1 , u2 , . . . , ur are in the column space of A. If u1 , u2 , . . . , ur are linearly
independent, it follows r is no larger than rank(A), hence r is no larger than
min(N, d).
We seek the largest value of r. Many books include zero as a possible
singular value, allowing the largest r to equal min(N, d). Here we insist
singular values are positive. Because of this, we will see the largest r is
rank(A).

The close connection between singular values σ of A and positive eigen-

values λ of Q = At A carries over in the general case.

A Versus Q
Let A be any matrix. Then

• the rank of A equals the rank of Q,

• σ is a singular value of A iff λ = σ 2 is a positive eigenvalue of Q.

Since the rank equals the dimension of the row space, the first part follows
from §2.4.
If Av = σu and At u = σv, then

Qv = At Av = At (σu) = σAt u = σ 2 v,

so v is an eigenvector of At A corresponding
√ to λ = σ 2 > 0. Conversely, If
Qv = λv and λ > 0, then set σ = λ and u = Av/σ. Then

Av = σu, At u = At Av/σ = Qv/σ = λv/σ = σv.

172 CHAPTER 3. PRINCIPAL COMPONENTS

From §3.2, the number of positive eigenvalues (possibly repeated) of Q

equals the rank of Q. By the above, we conclude the rank of A equals the
number of singular values (possibly repeated) of A.

Singular Value Decomposition (SVD)

Let A be any matrix and let r be the rank of A. Then there is

an orthonormal basis u1 , u2 , . . . , uN of the target space and an or-
thonormal basis v1 , v2 , . . . , vd of the source space and positive scalars
σ1 ≥ σ2 ≥ · · · ≥ σr > 0, such that

Avk = σk uk , At uk = σk vk , k = 1, 2, . . . , r, (3.3.3)

and
Avk = 0, At uk = 0 for k > r.

The proof is very simple once we remember the rank of Q equals the
number of positive eigenvalues of Q. By the eigenvalue decomposition, there
is an orthonormal basis of the source space v1 , v2 , . . . and λ1 ≥ λ2 ≥ · · · ≥
λr > 0 such that √ Qvk = λk vk , k = 1, . . . , r, and Qvk = 0, k > r.
Setting σk = λk and uk = Avk /σk , k = 1, . . . , r, as in our first exam-
ple, we have (3.3.3), and, again as in our first example, u1 , u2 , . . . , ur are
orthonormal.
Assume A is N × d. Then the source space is Rd , and the target space
is RN . By construction, vr+1 , vr+2 , . . . , vd is an orthonormal basis for the
null space of A. Set u1 = Av1 /σ1 , u2 = Av2 /σ2 , . . . , ur = Avr /sr . Since
Avr+1 = 0, . . . , Avd = 0, u1 , u2 , . . . , ur is an orthonormal basis for the
column space of A.
Since the column space of A is the row space of At , the column space
of A is the orthogonal complement of the nullspace of At (2.7.6). Choose
ur+1 , ur+2 , . . . , uN any orthonormal basis for the nullspace of At . Then
{u1 , u2 , . . . , ur } and {ur+1 , ur+2 , . . . , uN } are orthogonal. From this, u1 , u2 ,
. . . , uN is an orthonormal basis for the target.
3.3. SINGULAR VALUE DECOMPOSITION 173

For our second example, let a and b be nonzero vectors, possibly of dif-
ferent sizes, and let A be the matrix

A = a ⊗ b, At = b ⊗ a.

Then
Av = (v · b)a = σu and At u = (u · a)b = σv.
Since the range of A equals span(a), the rank of A equals one.
Since σ > 0, v is a multiple of b and u is a multiple of a. If we write
v = tb and u = sa and plug in, we get

v = |a| b, u = |b| a, σ = |a| |b|.

Thus there is only one singular value of A, equal to |a| |b|. This is not
surprising since the rank of A is one.
In a similar manner, one sees the only singular value of the 1 × n matrix
A = a equals σ = |a|.
Our third example is
 
0 0 0 0
1 0 0 0
A= 0 1 0 0 .
 (3.3.4)
0 0 1 0

Then    
0 1 0 0 1 0 0 0
0 0 1 0  0 1 0 0
At = 
0
, Q = At A =  
0 0 1 0 0 1 0
0 0 0 0 0 0 0 0
Since Q is diagonal symmetric, its rank is 3 and its eigenvalues are λ1 = 1,
λ2 = 1, λ3 = 1, λ4 = 0, and its eigenvectors are
       
1 0 0 0
0 1 0 0
v1 = 
0 , v2 = 0 , v3 1 , v4 = 0 .
      

0 0 0 1

Clearly v1 , v2 , v3 , v4 are orthonormal. By (3.3.2), σ1 = 1, σ2 = 1, σ3 = 1.

174 CHAPTER 3. PRINCIPAL COMPONENTS

Since we must have Av = σu, we can check that

u1 = Av1 = v2 , u2 = Av2 = v3 , u3 = Av3 = v4 , u4 = v1

satisfies (3.3.1). This completes our third example.

Let’s look at the SVD decomposition in more detail. Suppose A is N × d

and σ1 ≥ σ2 ≥ · · · ≥ σr > 0 are the singular values of A.
If N ≤ d, then r ≤ N , we set
 
σ1 0 0 0 0 0
 0 σ2 0 0 0 0 
S=  0 0 σ3 0 0 0 .


0 0 0 0 0 0

Here we have (N, d) = (4, 6), r = 3.

If N ≥ d, then r ≤ d, we set
 
σ1 0 0 0
0 σ2 0 0
 
0 0 σ3 0
S= 0
.
 0 0 0

0 0 0 0
0 0 0 0

Here we have (N, d) = (6, 4), r = 3. In either case, S has the same shape
N × d as A.
Let U be the matrix with columns u1 , u2 , . . . , uN , and let V be the matrix
with rows v1 , v2 , . . . , vd . Then V t has columns v1 , v2 , . . . , vd .
Then U and V are orthogonal N × N and d × d matrices. By (3.3.1),

AV t = U S.

Right-multiplying by V and using V t V = I implies

A = U SV.

Summarizing,
3.3. SINGULAR VALUE DECOMPOSITION 175

Diagonalization (SVD)

If A is any matrix, there is a diagonal matrix S with nonnegative

diagonal entries, with the same shape as A, and orthogonal matrices
U and V , satisfying
A = U SV.
The rows of V are an orthonormal basis of right singular vectors, and
the columns of U are an orthonormal basis of left singular vectors.

From this, if A is N × d, then U is N × N , S is N × d, and V is d × d.

When A = Q is a covariance matrix, Q ≥ 0, then the eigenvalues are
nonnegative, and, from (3.2.4), we have U EU t = Q. If we choose V = U t ,
we see EVD is a special case of SVD, with the singular values corresponding
to the positive eigenvalues.
In general, however, if Q has negative eigenvalues, V is not equal to U t ;
instead V is obtained from U t by re-sorting the rows of U .

In Python, svd returns the orthogonal matrices U and V and a vector

sigma of singular values. The singular values are arranged in decreasing
order. To recover the diagonal matrix S, we use diag.

from numpy import *

from numpy.linalg import svd

U, sigma, V = svd(A)
# sigma is a vector

# build diag matrix S

p = min(A.shape)
S = zeros(A.shape)
S[:p,:p] = diag(sigma)

print(U.shape,S.shape,V.shape)
print(U,S,V)

allclose(A, dot(U, dot(S, V)))

176 CHAPTER 3. PRINCIPAL COMPONENTS

This code returns True.

Given the relation between the singular values of A and the eigenvalues
of Q = At A, we also can conclude

Right Singular Vectors Are the Same as Eigenvectors

Let A be any matrix and let Q = At A.

v is an eigenvector of Q ⇐⇒ v is a right singular vector of A.

(3.3.5)

(This statement ignores zero eigenvalues.)

For example, if dataset is the Iris dataset (ignoring the classes), the code

from numpy import *

from numpy.linalg import svd,eigh

# center dataset
m = mean(dataset,axis=0)
A = dataset - m
# rows of V are right
# singular vectors of A
V = svd(A)[2]

# any of these will work

Q = dot(A.T,A)
Q = cov(dataset.T,bias=False)
Q = cov(dataset.T,bias=True)

# columns of U are
# eigenvectors of Q
U = eigh(Q)[1]

# compare columns of U
# and rows of V
3.3. SINGULAR VALUE DECOMPOSITION 177

U, V

returns
   
0.36 −0.66 −0.58 0.32 0.36 −0.08 0.86 0.36
−0.08 −0.73 0.6 −0.32
 , V = −0.66 −0.73 0.18 0.07 

U = 
 0.86 0.18 0.07 −0.48  0.58 −0.6 −0.07 −0.55
0.36 0.07 0.55 0.75 0.32 −0.32 −0.48 0.75

This shows the columns of U are identical to the rows of V , except for the
third column of U , which is the negative of the third row of V .

Now we turn to the pseudo-inverse.

To get the Pseudo-Inverse, Invert the Singular Values

The pseudo-inverse A+ is obtained by replacing singular values of A

by their reciprocals, and taking the transpose.

More explicitly, we can write

Inverse Singular Values, and Flipped Singular Vectors

Let A have rank r, and let σk , vk , uk be the singular data as above.
Then
1 1
A + uk = vk , (A+ )t vk = uk , k = 1, 2, . . . , r,
σk σk
and
A+ uk = 0, (A+ )t vk = 0 for k > r.

We illustrate these results in the case of a diagonal matrix

   
a 0 0 0 0 0 0
0 b 0 0 0  Q 0 0
S=  0 0 c 0 0 = 
  .
0 0
0 0 0 0 0 0 0 0 0 0
178 CHAPTER 3. PRINCIPAL COMPONENTS

Since S is 4 × 5 and SS + S = S, S + must be 5 × 4. Writing

   
∗ ∗ ∗ ∗
  A B
∗ ∗ ∗ ∗  

+
∗ ∗ ∗ ∗ 
  
S = ∗ ∗ ∗ = ,
∗ 
C D

∗ ∗ ∗ ∗
and applying the four properties of the pseudo-inverse S + , leads to
 
1/a 0 0
−1
A=Q =  0 1/b 0 , B = C = D = 0.
0 0 1/c

From SVD, one then deduces the above results.

3.4 Principal Component Analysis

Let Q be the covariance matrix of a dataset in Rd . Then Q is a d × d sym-
metric matrix, and the eigenvalue decomposition guarantees an orthonormal
basis v1 , v2 , . . . , vd in Rd consisting of eigenvectors of Q,

Qvk = λk vk , k = 1, . . . , d.

These eigenvectors are the principal components of the dataset. Principal

Component Analysis (PCA) consists of projecting the dataset onto lower-
dimensional subspaces spanned by some of the eigenvectors.
Let Q be a symmetric matrix with eigenvalue λ and corresponding eigen-
vector v, Qv = λv. If t is a scalar, then the matrix tQ has eigenvalue tλ and
corresponding eigenvector v, since

(tQ)v = tQv = tλv = (tλ)v.

Hence multiplying Q by a scalar does not change the eigenvectors.

Let A be the dataset matrix of a given dataset with N samples, and d
features. If the samples are the rows of A, then A is N × d. If we assume
the dataset is centered, then, by (2.2.13), the covariance is Q = At A/N .
From the previous paragraph, the eigenvectors of the covariance Q equal the
eigenvectors of At A. From (3.3.5), these are the same as the right singular
vectors of A.
3.4. PRINCIPAL COMPONENT ANALYSIS 179

Thus the principal components of a dataset are the right singular vectors
of the centered dataset matrix. This shows there are two approaches to the
principal components of a dataset: Either through EVD and eigenvectors
of the covariance matrix, or through SVD and right singular vectors of the
centered dataset matrix. We shall do both.

Assuming the eigenvalues are ordered top to bottom,

λ1 ≥ λ2 ≥ · · · ≥ λd ,

in PCA one takes the most significant components, those components who
eigenvalues are near the top eigenvalue. For example, one can take the top
two eigenvalues λ1 ≥ λ2 and their eigenvectors v1 , v2 , and project the dataset
onto the plane span(v1 , v2 ). The projected dataset can then be visualized
as points in the plane. Similarly, one can take the top three eigenvalues
λ1 ≥ λ2 ≥ λ3 and their eigenvectors v1 , v2 , v3 and project the dataset onto
the space span(v1 , v2 , v3 ). This can then be visualized as points in three
dimensions.

Recall the MNIST dataset consists of N = 60000 points in d = 784

dimensions. After we download the dataset,

from numpy import *

from keras.datasets import mnist

(train_X, train_y), (test_X, test_y) = mnist.load_data()

dataset = train_X.reshape((60000,784))
labels = train_y

we compute Q, the total variance, and the eigenvalues, as percentages of the

total variance. We also name the targets as labels for later use.
180 CHAPTER 3. PRINCIPAL COMPONENTS

Figure 3.12: MNIST eigenvalues as a percentage of the total variance.

Q = cov(dataset.T)
totvar = Q.trace()

from numpy.linalg import eigh

# use eigh for symmetric matrices

lamda, U = eigh(Q)

# sort in ascending order then reverse

sorted = sort(lamda)[::-1]
percent = sorted*100/totvar

# cumulative sums
sums = cumsum(percent)

data = array([percent,sums])
print(data.T[:20].round(decimals=3))

d = len(lamda)
3.4. PRINCIPAL COMPONENT ANALYSIS 181

from matplotlib.pyplot import stairs

stairs(percent,range(d+1))

Figure 3.13: MNIST eigenvalue percentage plot.

The left column in Figure 3.12 lists the top twenty eigenvalues as a per-
centage of their sum. For example, the top eigenvalue λ1 is around 10% of
the total variance. The right column lists the cumulative sums of the eigen-
values, so the third entry in the right column is the sum of the top three
eigenvalues, λ1 + λ2 + λ3 = 22.97%.
This results in Figures 3.12 and 3.13. Here we sort the array eig in
decreasing order, then we cumsum the array to obtain the cumulative sums.
Because the rank of the MNIST dataset is 712 (§2.9), the bottom 72 =
784 − 712 eigenvalues are exactly zero. A full listing shows that many more
eigenvalues are near zero, and the second column in Figure 3.12 shows the
top ten eigenvalues alone sum to almost 50% of the total variance.

A MNIST image is a point in R784 . Now we turn to projecting the image

from 784 dimensions down to n dimensions, where n is 784, 600, 350, 150,
50, 10, 1. Let Q be any d × d covariance matrix, and let v be in Rd . Let
182 CHAPTER 3. PRINCIPAL COMPONENTS

v1 , v2 , . . . , vd be the orthonormal basis of eigenvectors corresponding to the

eigenvalues of Q, arranged in decreasing order. Here is code that returns the
projection matrix P (§2.7) onto the span of the eigenvectors v1 , v2 , . . . , vn
corresponding to the top n eigenvalues of Q.

from numpy import *

from numpy.linalg import eigh

# projection matrix onto top n

# eigenvectors of covariance
# of dataset

def pca(dataset,n):
Q = cov(dataset.T)
# columns of V are
# eigenvectors of Q
lamda, U = eigh(Q)
# decreasing eigenvalue sort
order = lamda.argsort()[::-1]
# sorted top n columns of U
# are cols of U
V = U[:,order[:n]]
P = dot(V,V.T)
return P

In the code, lamda is sorted in decreasing order, and the sorting order is
saved as order. To obtain the top n eigenvectors, we sort the first n columns
U[:,order[:n]] in the same order, resulting in the d × n matrix V . The
code then returns the projection matrix P = V V t (2.7.4).
Instead of working with the covariance Q, as discussed at the start of
the section, we can work directly with the dataset, using svd, to obtain the
eigenvectors.

from numpy import *

from numpy.linalg import svd

# projection matrix onto top n

# eigenvectors of covariance
3.4. PRINCIPAL COMPONENT ANALYSIS 183

# of dataset

def pca_with_svd(dataset,n):
# center dataset
m = mean(dataset,axis=0)
vectors = dataset - m
# rows of V are
# right singular vectors
V = svd(vectors)[2]
# no need to sort, already decreasing order
U = V[:n].T # top n rows as columns
P = dot(U,U.T)
return P

Let v = dataset[1] be the second image in the MNIST dataset, and let
Q be the covariance of the dataset. Then the code below returns the image
compressed down to n = 784, 600, 350, 150, 50, 10, 1 dimensions, returning
Figure 1.4.

from matplotlib.pyplot import *

figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4

v = dataset[1] # second image

display_image(v,row,col,1)

for i,n in enumerate([784,600,350,150,50,10,1],start=2):

# either will work
P = pca_with_svd(dataset,n)
P = pca(dataset,n)
projv = dot(P,v)
A = reshape(projv,(28,28))
subplot(rows, cols,i)
imshow(A,cmap="gray_r")

If you run out of memory trying this code, cut down the dataset from
60,000 points to 10,000 points or fewer. The code works with pca or with
184 CHAPTER 3. PRINCIPAL COMPONENTS

pca_with_svd.

We now show how to project a vector v in the dataset using sklearn.

The following code sets up the PCA engine using sklearn.

from sklearn.decomposition import PCA

N = len(dataset)
n = 10
engine = PCA(n_components = n)

The following code computes the reduced dataset (§2.7)

reduced = engine.fit_transform(dataset)
reduced.shape

and returns (N, n) = (60000, 10). The following code computes the projected
dataset

projected = engine.inverse_transform(reduced)
projected.shape

and returns (N, d) = (60000, 784).

Let U be the d × n matrix with columns the top n eigenvectors. Then the
projection matrix onto the column space of U (project_to_ortho in §2.7)
is P = U U t . In the above code, reduced equals U t v for each image v, and
projected is U U t v for each image v.
Then the code

from matplotlib.pyplot import *

figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4

v = dataset[1] # second image

3.4. PRINCIPAL COMPONENT ANALYSIS 185

display_image(v,rows,cols,1)

for i,n in enumerate([784,600,350,150,50,10,1],start=2):

engine = PCA(n_components = n)
reduced = engine.fit_transform(dataset)
projected = engine.inverse_transform(reduced)
projv = projected[1] # second image
A = reshape(projv,(28,28))
subplot(rows, cols,i)
imshow(A,cmap="gray_r")

returns Figure 3.14.

Figure 3.14: Original and projections: n = 784, 600, 350, 150, 50, 10, 1.

Now we project all vectors of the MNIST dataset onto two and three
dimensions, those corresponding to the top two or three eigenvalues. To
start, we compute reduced as above with n = 3, the top three components.
In the two-dimensional plotting code below, reduced is an array of shape
(60000,3), but we use only the top two components 0 and 1. When the
rows are plotted as a scatterplot, we obtain Figure 3.15. Note the rows are
plotted grouped by color, to match the legend, and each plot point’s color is
determined by the value of its label.
186 CHAPTER 3. PRINCIPAL COMPONENTS

Figure 3.15: The full MNIST dataset (2d projection).

from matplotlib.pyplot import *

Colors = ('blue', 'red', 'green', 'orange', 'gray','cyan',

,→ 'turquoise', 'black', 'orchid', 'brown')
for i,color in enumerate(Colors):
scatter(reduced[labels==i,0], reduced[labels==i,1],
,→ label=i, c=color, edgecolor='black')

grid()
legend(loc='upper right')
show()

Code for the 2d plot (Figure 3.16) of the Iris dataset is

from matplotlib.pyplot import *

Colors = ['blue', 'red', 'green']

Classes = ["Iris-setosa", "Iris-virginica", "Iris-versicolor"]

for a,b in zip(Classes,Colors):

scatter(reduced[labels==a,0], reduced[labels==a,1],
,→ label=a, c=b, edgecolor='black')
3.4. PRINCIPAL COMPONENT ANALYSIS 187

grid()
legend(loc='upper right')
show()

Figure 3.16: The Iris dataset (2d projection).

Now we turn to three dimensional plotting. Here is the code

%matplotlib notebook
from matplotlib.pyplot import *
from mpl_toolkits import mplot3d

P = axes(projection='3d')
P.set_axis_off()

Colors = ('blue', 'red', 'green', 'orange', 'gray','cyan' ,

,→ 'turquoise', 'black', 'orchid', 'brown')

for i,color in enumerate(Colors):

P.scatter(reduced[labels==i,0], reduced[labels==i,1],
188 CHAPTER 3. PRINCIPAL COMPONENTS

,→ reduced[labels==i,2], label=i, c=color,

,→ edgecolor='black')

legend(loc='upper right')
show()

The three dimensional plot of the complete MNIST dataset is Figure 1.5
in §1.2. The command %matplotlib notebook allows the figure to rotated
and scaled.

3.5 Cluster Analysis

⋆ under construction ⋆
Cluster analysis seeks to partition a dataset into groups or clusters based
on selected criteria, such as proximity in distance.
Let x1 , x2 , . . . , xN be a dataset in Rd . The simplest algorithm is k-means
clustering. The algorithm is iterative: We start with k means m1 , m2 , . . . ,
mk , not necessarily part of the dataset, and we divide the dataset into k
clusters, where the i-th cluster consists of the points x in the dataset for
which mean mi is nearest to x.
The algorithm is in two parts, the assignment step and the update step.
Initially the means m1 , m2 , . . . , mk are chosen at random, or by an edu-
cated guess, then clusters C1 , C2 , . . . , Ck are assigned, then each mean is
recomputed as the mean of each cluster.
The sklearn package contains clustering routines, but here we write the
code from scratch to illustrate the ideas. Here is an animated gif illustrating
the convergence of the algorithm.
Assume the means are given as a list of length k,

means = [ means[0], means[1], ... ]

and each cluster is a list of points (so clusters is a list of lists)

clusters = [ clusters[0], clusters[1], ... ]

such that
3.5. CLUSTER ANALYSIS 189

N == sum([ len(cluster) for cluster in clusters] )

Given a point x, we first select the mean closest to x:

from numpy import *

from numpy.linalg import norm

def nearest_index(x,means):
i = 0
for j,m in enumerate(means):
n = means[i]
if norm(x - m) < norm(x - n): i = j
return i

Starting with empty clusters (k is the number of clusters), we iterate the

assign/update steps until the means no longer change. If any clusters remain
empty, we discard them. Here is the assignment step.

def assign_clusters(dataset,means):
clusters = [ [ ] for m in means ]
for x in dataset:
i = nearest_index(x,means)
clusters[i].append(x)
return [ c for c in clusters if len(c) > 0 ]

Here is the update step.

def update_means(clusters):
return [ mean(c,axis=0) for c in clusters ]

Here is the iteration.

from numpy.random import random

d = 2
k,N = 7,100
190 CHAPTER 3. PRINCIPAL COMPONENTS

def random_vector(d):
return array([ random() for _ in range(d) ])

dataset = [ random_vector(d) for _ in range(N) ]

means = [ random_vector(d) for _ in range(k) ]

close_enough = False

while not close_enough:

clusters = assign_clusters(dataset,means)
print([len(c) for c in clusters])
newmeans = update_means(clusters)
# only check closeness if number of means unchanged
if len(newmeans) == len(means):
close_enough = all([ allclose(m,n) for m,n in
,→ zip(means,newmeans) ])
means = newmeans

This code returns the size the clusters after each iteration. Here is code
that plots a cluster.

def plot_cluster(mean,cluster,color,marker):
for v in cluster:
scatter(v[0],v[1], s=50, c=color, marker=marker)
scatter(mean[0], mean[1], s=100, c=color, marker='*')

Here is code for the entire iteration. hexcolor is in §1.3.

from matplotlib.pyplot import *

d = 2
k,N = 7,100

def random_vector(d):
return array([ random() for _ in range(d) ])

dataset = [ random_vector(d) for _ in range(N) ]

means = [ random_vector(d) for _ in range(k) ]
3.5. CLUSTER ANALYSIS 191

colors = [ hexcolor() for _ in range(k) ]

close_enough = False

figure(figsize=(4,4))
grid()
for v in dataset: scatter(v[0],v[1],s=20,c='black')
show()

while not close_enough:

clusters = assign_clusters(dataset,means)
newmeans = update_means(clusters)
# only check closeness if number of means unchanged
if len(newmeans) == len(means):
close_enough = all([ allclose(m,n) for m,n in
,→ zip(means,newmeans) ])
figure(figsize=(4,4))
grid()
for i,c in enumerate(clusters):
plot_cluster(newmeans[i], c, colors[i],
'$' + str(i) + '$')
show()
means = newmeans
192 CHAPTER 3. PRINCIPAL COMPONENTS
Chapter 4

Counting

Some of the material in this chapter is first seen in high school. Because
repeating the exposure leads to a deeper understanding, we review it in a
manner useful to the later chapters.

4.1 Permutations and Combinations

Suppose we have three balls in a bag, colored red, green, and blue. Suppose
they are pulled out of the bag and arranged in a line. We then obtain six
possibilities, listed in Figure 4.1.

Figure 4.1: 6 = 3! permutations of 3 balls.

Why are there six possibilities? Because they are three ways of choosing

193
194 CHAPTER 4. COUNTING

the first ball, then two ways of choosing the second ball, then one way of
choosing the third ball, so the total number of ways is

6 = 3 × 2 × 1.

In particular, we see that the number of ways multiply, 6 = 3 × 2 × 1.

Similarly, there are 5 × 4 × 3 × 2 × 1 = 120 ways of selecting five distinct
balls. Since this pattern appears frequently, it has a name.
If n is a positive integer, then n-factorial is

n! = n × (n − 1) × (n − 2) × · · · × 2 × 1.

The factorial function grows large rapidly, for example,

10! = 10 × 9 × 8 × 7 × 6 × 5 × 4 × 3 × 2 × 1 = 3, 628, 800.

Notice also

(n + 1)! = (n + 1) × n × (n − 1) × · · · × 2 × 1 = (n + 1) × n!,

and (n + 2)! = (n + 2) × (n + 1)!, and so on.

Permutations of n Objects
The number of ways of selecting n objects from a collection of n distinct
objects is n!.

We also have
1! = 1, 0! = 1.
It’s clear that 1! = 1. It’s less clear that 0! = 1, but it’s reasonable if you
think about it: The number of ways of selecting from zero balls results in
only one possibility — no balls.

More generally, we can consider the selection of k balls from a bag con-
taining n distinct balls. There are two varieties of selections that can be
made: Ordered selections and unordered selections. An ordered selection is
a permutation, and an unordered selection is a combination. In particular,
when k = n, n! is the number of ways of permuting n objects.
4.1. PERMUTATIONS AND COMBINATIONS 195

The number of permutations of k objects from n objects is written as

P (n, k), and the number of combinations of k objects from n objects is writ-
ten as C(n, k).
For ordered selections, there are n choices for the first ball, n − 1 choices
for the second ball, and so on, until we have n − k + 1 choices for the k-th
ball. Thus
P (n, k) = n × (n − 1) × · · · × (n − k + 1).

For example, there are 5 × 4 = 20 ordered selections of two balls from

five distinct balls. Because ordering is taken into account, selecting ball #2
then ball #3 is considered distinct from selecting ball #3 then ball #2.

Permutation of k Objects from n Objects

The number of permutations of k objects from n objects is
n!
P (n, k) = n(n − 1)(n − 2) . . . (n − k + 1) = .
(n − k)!

The last formula follows by canceling,

n! n(n − 1) . . . (n − k + 1)(n − k)!

= = n(n − 1) . . . (n − k + 1).
(n − k)! (n − k)!

Notice P (x, k) is defined for any real number x by the same formula,

P (x, k) = x(x − 1)(x − 2) . . . (x − k + 1).

When a selection of k objects is made, and the k objects are permuted,

that is considered the same unordered selection, but a different ordered selec-
tion. Since the number of permutations of k objects is k!, the number P (n, k)
of ordered selections is k! times the number C(n, k) of unordered selections.
This leads to
196 CHAPTER 4. COUNTING

Combinations of k Objects from n Objects

The number of combinations of k objects from n objects is

P (n, k) n!
C(n, k) = = .
k! (n − k)!k!

For example,
5×4
P (5, 2) = 5 × 4 = 20, C(5, 2) = = 10,
2×1
so we have twenty ordered pairs

(1, 2), (1, 3), (1, 4), (1, 5), (2, 1), (2, 3), (2, 4), (2, 5), (3, 1), (3, 2),
(3, 4), (3, 5), (4, 1), (4, 2), (4, 3), (4, 5), (5, 1), (5, 2), (5, 3), (5, 4)

and ten unordered pairs

{1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}, {3, 4}, {3, 5}, {4, 5}.

The number C(n, k) is also called n-choose-k. Because it appears in the

binomial theorem, C(n, k) is also called the binomial coefficient (§4.3).
Since P (x, k) is defined for any real number x, so is C(n, k):

P (x, k) x(x − 1)(x − 2) . . . (x − k + 1)

C(x, k) = = .
k! 1 · 2 · 3 · ··· · k

An important question is the rate of growth of the factorial function n!.

Attempting to answer this question leads to the exponential (§4.4) and to
the entropy (§7.2). Here is how this happens.
Since n! is a product of the n factors

1, 2, 3, . . . , n − 1, n,

each no larger than n, it is clear that

n! < nn .
4.1. PERMUTATIONS AND COMBINATIONS 197

However, because half of the factors are less then n/2, we expect an approx-
imation smaller than nn , maybe something like (n/2)n or (n/3)n .
To be systematic about it, assume an approximation of the form1
n n
n! ∼ e , for n large, (4.1.1)
e
for some constant e. We seek the best constant e that fits here. In this
approximation, we multiply by e so that (4.1.1) is an equality when n = 1.
Using the binomial theorem, in §4.4 we show
n n n n
3 ≤ n! ≤ 2 , n ≥ 1. (4.1.2)
3 2
Based on this, a constant e satisfying (4.1.1) must lie between 2 and 3,

2 ≤ e ≤ 3.

To figure out the best constant e to pick, we see how much both sides
of (4.1.1) increase when we replace n by n + 1. Write (4.1.1) with n + 1
replacing n, obtaining
n+1
n+1
(n + 1)! ∼ e . (4.1.3)
e
Dividing the left sides of (4.1.1), (4.1.3) yields
(n + 1)!
= (n + 1).
n!
Dividing the right sides yields
n
e((n + 1)/e)n+1

1 1
= (n + 1) · · 1 + . (4.1.4)
e(n/e)n e n
To make these quotients match as closely as possible, we should choose
n
1
e∼ 1+ . (4.1.5)
n
Choosing n = 1, 2, 3, . . . , 100, . . . results in

e ∼ 2, e ∼ 2.25, e ∼ 2.37, . . . , e ∼ 2.705, . . . .

1
∼ means approximately equal.
198 CHAPTER 4. COUNTING

As n → ∞, we obtain Euler’s constant e (§4.4).

Equation (4.1.1) can be improved to Stirling’s approximation
√ n n
n! ≈ 2πn , for n large. (4.1.6)
e
This is an asymptotic equality: the ratio of the two sides approaches 1 as n
grows to infinity.

4.2 Graphs
A graph consists of nodes and edges. For example, the graphs in Figure 4.2
each have four nodes and three edges. The left graph is directed, in that a
direction is specified for each edge. The graph on the right is undirected, no
direction is specified.

Figure 4.2: Directed and undirected graphs.

In a directed graph, if there is an edge pointing from node i to node j,

we say (i, j) is an edge. For undirected graphs, we say i and j are adjacent.
An edge (i, j) is weighed if a scalar wij is attached to it. If every edge in a
graph is weighed, then the graph is a weighed graph. Any two nodes may be
considered adjacent by assigning the weight zero to the edge between them.

−3 7.4

2 0

Figure 4.3: A weighed directed graph.

In §7.4, back propagation on weighed directed graphs is used to calculate

derivatives.
4.2. GRAPHS 199

Let wij be the weight on the edge (i, j) in a weighed directed graph. The
weight matrix of a weighed directed graph is the matrix W = (wij ).
If the graph is unweighed, then we set A = (aij ), where
(
1, if i and j adjacent,
aij = .
0, if not.

In this case, A consists of ones and zeros, and is called the adjacency matrix.
If the graph is also undirected, then the adjacency matrix is symmetric,

aij = aji .

Figure 4.4: A double edge and a loop.

Sometimes graphs may have multiple edges between nodes, or loops,

which are edges starting and ending at the same node. A graph is sim-
ple if it has no loops and no multiple edges. In this section, we deal only
with simple undirected unweighed graphs.
To summarize, a simple undirected graph G = (V, E) is a collection V
of nodes, and a collection of edges E, each edge corresponding to a pair of
nodes.
The number of nodes is the order n of the graph, and the number of edges
is the size m of the graph. In a (simple undirected) graph of order n, the
number of pairs of nodes is n-choose-2, so the number of edges satisfies

n 1
0≤m≤ = n(n − 1).
2 2
How many graphs of order n are there? Since graphs are built out of
edges, the answer depends on how many subsets of edges you can grab from
200 CHAPTER 4. COUNTING

a maximum of n(n − 1)/2 edges. The number of subsets of a set with m

elements is 2m , so the number Gn of graphs with n nodes is
n
Gn = 2( 2 ) = 2n(n−1)/2 .

For example, the number of graphs with n = 5 is 25(5−1)/2 = 210 = 1, 024,

and the number of graphs with n = 10 is

n = 10 =⇒ Gn = 245 = 35, 184, 372, 088, 832.

Figure 4.5: The complete graph K6 and the cycle graph C6 .

When m = 0, there are no edges, and we say the graph is empty. When
m = n(n − 1)/2, there are the maximum number of edges, and we say the
graph is complete. The complete graph with n nodes is written Kn (Figure
4.5).
The cycle graph Cn with n nodes is as in Figure 4.5. The graph Cn has
n edges. The cycle graph C3 is a triangle.

A graph G′ is a subgraph of a graph G if every node of G′ is a node of G,

and every edge of G′ is an edge of G. For example, a triangle in G is a graph
triangle that is a subgraph of G. Below we see the graph K6 in Figure 4.5
contains twenty triangles.
4.2. GRAPHS 201

Figure 4.6: The triangle K3 = C3 .

Let v be a node in a (simple, undirected) graph G. The degree of v is the

number dv of edges containing v. If the nodes are labelled 1, 2, . . . , n, with
the degrees in decreasing order, then

d1 ≥ d2 ≥ d3 ≥ · · · ≥ dn

is the degree sequence of the graph. We write

(d1 , d2 , d3 , . . . , dn )

for the degree sequence.

If we add the degrees over all nodes, we obtain the number of edges
counted twice, because each edge contains two nodes. Thus we have

Handshaking Lemma
If the order is n, the size is m, and the degrees are d1 , d2 , . . . , dn , then
n
X
d1 + d2 + · · · + dn = dk = 2m.
k=1

A node is isolated if its degree is zero. A node is dominating if it has the

highest degree. Notice the highest degree is ≤ n − 1, because there are no
loops. We show

Nodes with Equal Degree

In any graph, there are at least two nodes with the same degree.
202 CHAPTER 4. COUNTING

To see this, we consider two cases. First case, assume there are no isolated
nodes. Then the degree sequence is

n − 1 ≥ d1 ≥ d2 ≥ · · · ≥ dn ≥ 1.

So we have n integers spread between 1 and n − 1. This can’t happen unless

at least two of these integers are equal. This completes the first case. In
the second case, we have at least one isolated node, so dn = 0. If dn−1 = 0
also, then we have found two nodes with the same degree. If not, then the
maximum degree is n − 2 (because node n is isolated), and

n − 2 ≥ d1 ≥ d2 ≥ . . . dn−1 ≥ 1.

So now we have n − 1 integers spread between 1 and n − 2. This can’t

happen unless at least two of these integers are equal. This completes the
second case.

A graph is regular if all the node degrees are equal. If the node degrees are
all equal to k, we say the graph is k-regular. From the handshaking lemma,
for a k-regular graph, we have kn = 2m, so
1
m = kn.
2
For example, because 2m is even, there are no 3-regular graphs with 11 nodes.
Both Kn and Cn are regular, with Kn being (n − 1)-regular, and Cn being
2-regular.
A walk on a graph is a sequence of nodes v1 , v2 , v3 , . . . where each
consecutive pair vi , vi+1 of nodes are adjacent. For example, if v1 , v2 , v3 ,
v4 , v5 , v6 are the nodes (in any order) of the complete graph K6 , then v1 →
v2 → v3 → v4 → v2 is a walk. A path is a walk with no backtracking: A
path visits each node at most once. A closed walk is a walk that ends where
it starts. A cycle is a closed walk with no backtracking.
Two nodes a and b are connected if there is a walk starting at a and ending
at b. If a and b are connected, then there is a path starting at a and ending
at b, since we can cut out the cycles of the walk. A graph is connected if every
two nodes are connected. A graph is disconnected if it is not connected. For
4.2. GRAPHS 203

example, Figure 4.5 may be viewed as two connected graphs K6 and C6 , or

a single disconnected graph K6 ∪ C6 .

Consider a graph with order n. The adjacency matrix is the n × n matrix

A = (aij ) given by
(
1, if i and j are adjacent,
aij =
0, if not.

For example, the empty graph has adjacency matrix given by the zero matrix.
Since our graphs are undirected, the adjacency matrix is symmetric.
Let 1 be the vector 1 = (1, 1, 1, . . . , 1). The adjacency matrix of the
complete graph Kn is the n×n matrix A with all ones except on the diagonal.
If I is the n × n identity matrix, then this adjacency matrix is

A=1⊗1−I

For example, for the triangle K3 ,

     
1 1 0 0 0 1 1
A = 1 1 1 ⊗ 1 − 0 1 0 = 1 0 1 .
1 0 0 1 1 1 0

If we label the nodes of the cycle graph Cn consecutively, then node i is

shares an edge with i − 1 and i + 1, except when i = 1 and i = n. Node 1
shares an edge with 2 and n, and node n shares an edge with n − 1 and 1.
So for C6 the adjacency matrix is
 
0 1 0 0 0 1
1 0 1 0 0 0
 
0 1 0 1 0 0
A=  .
 0 0 1 0 1 0 

0 0 0 1 0 1
1 0 0 0 1 0

Notice there are ones on the sub-diagonal, and ones on the super-diagonal,
and ones in the upper-right and lower-left corners.
204 CHAPTER 4. COUNTING

For any adjacency matrix A, the sum of each row is equal to the degree
of the node corresponding to that row. This is the same as saying
 
d1
 d2 
A1 = . . .  .


In particular, for a k-regular graph, we have

A1 = k1,

so for a k-regular graph, k is an eigenvalue of A.

What is the connection between degrees and eigenvalues in general? To
explain this, let λ be an eigenvalue of A with eigenvector v = (v1 , v2 , . . . , vn ),
so Av = λv. Since a multiple tv of v is also an eigenvector, we may assume
the biggest component of v equals 1. Suppose the nodes are labelled so that
v = (1, v2 , v3 , . . . , vn ), with

v1 = 1 ≥ |vj |, j = 2, 3, . . . , n.

Taking the first component of Av = λv, we have

(Av)1 = a11 v1 + a12 v2 + a13 v3 + · · · + a1n vn .

Since the sum a11 + a12 + · · · + a1n equals the degree d1 of node 1, this implies

d1 = a11 +a12 +· · ·+a1n ≥ a11 v1 +a12 v2 +a13 v3 +· · ·+a1n vn = (Av)1 = λv1 = λ.

Since d1 is one of the degrees, d1 is no greater than the maximum degree.

This explains

Maximum Degree of Graph

If λ is any eigenvalue of the adjacency matrix A, then λ is less or equal
to the maximum degree.

In particular, for a k-regular graph, the maximum degree equals k, and

we already saw k is an eigenvalue, so
4.2. GRAPHS 205

Top Eigenvalue
For a k-regular graph, k is the top eigenvalue of the adjacency matrix
A.

Let A = 1 ⊗ 1 − I be the adjacency matrix of complete graph Kn . Then

for any vector v orthogonal to 1,

Av = (1 ⊗ 1 − I)v = (1 · v)1 − v = 0 − v = −v,

so λ = −1 is an eigenvalue with multiplicity n − 1. Since

A1 = (1 · 1)1 − 1 = n1 − 1 = (n − 1)1,

n − 1is an eigenvalue. Hence the eigenvalues of A are n − 1 with multiplicity

1 and −1 with multiplicity n − 1.

Let A be the adjacency matrix of the cycle graph Cn . Since Cn is 2-

regular, the top eigenvalue of A is 2. Since A is a circulant matrix, the
method used to find the eigenvalues of Q(d) in §3.2 works here. However, it
is immediate that
A = 2I − Q(n).
From this and by (3.2.12), the eigenvalues of A are

2 cos(2πk/n), k = 0, 1, 2, . . . , n − 1,

and the eigenvectors of A are the eigenvectors of Q(n).

The complement of graph G is the graph Ḡ obtained by switching 1’s and

0’s, so the adjacency matrix Ā of Ḡ is

Ā = A(Ḡ) = 1 ⊗ 1 − I − A(G).

Let G be a k-regular graph, and suppose k = λ1 ≥ λ2 ≥ · · · ≥ λn are the

eigenvalues of A = A(G). Since A is symmetric, we have an orthogonal basis
of eigenvectors v1 , v2 , . . . , vn , with v1 = 1. Then Ḡ is an (n − 1 − k)-regular
206 CHAPTER 4. COUNTING

graph, so the top eigenvalue of Ā = A(Ḡ) is n − 1 − k, with eigenvector

v1 = 1. If vk is any eigenvector of A other than 1, then vk is orthogonal to
1, hence

Āv = (1 ⊗ 1 − I − A)vk = −v − λk vk = (−1 − λk )vk .

Hence the eigenvalues of Ā are n − 1 − k and −1 − λk , k = 2, . . . , n, with

the same eigenbasis.

Now we look at powers of the adjacency matrix A. By definition of matrix

multiplication,
n
X
2
(A )ij = i-th row × j-th column = aik akj .
k=1

Now aik akj is either 0 or 1, and equals 1 exactly if there is a 2-step path from
i to j. Hence

(A2 )ij = number of 2-step walks connecting i and j.

Notice a 2-step walk between i and j is the same as a 2-step path between i
and j.
When i = j, (A2 )ii is the number of 2-step paths connecting i and i,
which means number of edges. Since this counts edges twice, we have

1
trace(A2 ) = m = number of edges.
2

Similarly, (A3 )ij is the number of 3-step walks connecting i and j. Since
a 3-step walk from i to i is the same as a triangle, (A3 )ii is the number
of triangles in the graph passing through i. Since the trace is the sum of
the diagonal elements, trace(A3 ) counts the number of triangles. But this
overcounts by a factor of 3! = 6, since three labels may be rearranged in six
ways. Hence
1
trace(A3 ) = number of triangles.
6
4.2. GRAPHS 207

Loops, Edges, Triangles

Let A be the adjacency matrix. Then

• trace(A) = number of loops = 0,

• trace(A2 ) = 2 × number of edges,

• trace(A3 ) = 6 × number of triangles.

Let us compute these for the complete graph Kn . Since

(u ⊗ v)2 = (u ⊗ v)(u ⊗ v) = (u · v)(u ⊗ v),
and 1 · 1 = n, we have (1 ⊗ 1)2 = n1 ⊗ 1. So
A2 = (1 ⊗ 1 − I)2 = (1 ⊗ 1)2 − 21 ⊗ 1 + I = (n − 2)1 ⊗ 1 + I.
Since trace(u ⊗ v) = u · v, we have trace(1 ⊗ 1) = n. Hence
trace(A2 ) = trace((n − 2)1 ⊗ 1 + I) = n(n − 2) + n = n(n − 1).
This is correct because for a complete graph, n(n − 1)/2 is the number of
edges.
Continuing,
A3 = A2 A = ((n − 2)1 ⊗ 1 + I)(1 ⊗ 1 − I)
= n(n − 2)1 ⊗ 1 − (n − 2)1 ⊗ 1 + 1 ⊗ 1 − I
= (n2 − 3n + 3)1 ⊗ 1 − I.
From this, we get
trace(A3 ) = n(n2 − 3n + 3) − n = n(n2 − 3n + 2) = n(n − 1)(n − 2).
This is correct because for a complete graph, we have a triangle whenever
we have a triple of nodes, and there are n-choose-3 triples, which equals
n(n − 1)(n − 2)/6.
Remember, a graph is connected if there is a walk connecting any two
nodes. Since there is a 4-step walk between i and j exactly when there are
r, s, and t satisfying
air ars ast atj = 1,
we see there is a 4-step walk connecting i and j if (A4 )ij > 0. Hence
208 CHAPTER 4. COUNTING

Connected Graph
Let A be the adjacency matrix. Then the graph is connected if for
every i ̸= j, there is a k with (Ak )ij > 0.

Two graphs are isomorphic if a re-labelling of the nodes in one makes it

identical to the other. To explain this, we need permutations.
A permutation on n letters is a re-arrangement of 1, 2, 3,. . . , n. Here
are two permutations of (1, 2, 3, 4),

1 2 3 4 1 2 3 4
, .
4 3 2 1 4 3 1 2

There are n! permutations of (1, 2, . . . , n). If a permutation sends i to j, we

write i → j. Since a permutation is just a re-labelling, if i → k and j → k,
then we must have i = j.
Each permutation leads to a permutation matrix. A permutation matrix
is a matrix of zeros and ones, with only one 1 in any column or row. For
example, the above permutations correspond to the 4 × 4 matrices
   
0 0 0 1 0 0 0 1
0 0 1 0 0 0 1 0
P = 0 1 0 0
 P = 1 0 0 0 .


1 0 0 0 0 1 0 0

In general, the permutation matrix P has Pij = 1 if i → j, and Pij = 0

if not. If P is any permutation matrix, then Pik Pjk equals 1 if both i → k
and j → k. In other words, Pik Pjk = 1 if i = j and i → k, and Pik Pjk = 0
otherwise. Since i → k for exactly one k,
n n
(
X X 1, i = j,
(P P t )ij = t
Pik Pkj = Pik Pjk =
k=1 k=1
0, i ̸= j.

Hence P is orthogonal,

P P t = I, P −1 = P t .
4.2. GRAPHS 209

Figure 4.7: Non-isomorphic graphs with degree sequence (3, 2, 2, 1, 1, 1).

Using permutation matrices, we can say two graphs are isomorphic if their
adjacency matrices A, A′ satisfy

A′ = P AP −1 = P AP t

for some permutation matrix P .

If two graphs are isomorphic, then it is easy to check their degree se-
quences are equal. However, the converse is not true. Figure 4.7 displays two
non-isomorphic graphs with degree sequences (3, 2, 2, 1, 1, 1). These graphs
are non-isomorphic because in one graph, there are two degree-one nodes
adjacent to a degree-three node, while in the other graph, there is only one
degree-one node adjacent to a degree-three node.

A graph is bipartite if the nodes can be divided into two groups, with
adjacency only between nodes across groups. If we call the two groups even
and odd, then odd nodes are never adjacent to odd nodes, and even nodes
are never adjacent to even nodes.
The complete bipartite graph is the bipartite graph with maximum num-
ber of edges: Every odd node is adjacent to every even node. The complete
bipartite graph with n odd nodes with m even nodes is written Knm . Then
the order of Kmn is n + m.
Let a = (1, 1, . . . , 1, 0, 0, . . . , 0) be the vector with n ones and m zeros,
and let b = 1 − a. Then b has n zeros and m ones, and the adjacency matrix
of Knm is

A = A(Knm ) = a ⊗ b + b ⊗ a.
210 CHAPTER 4. COUNTING

Figure 4.8: Complete bipartite graph K53 .

For example, the adjacency matrix of K53 is A = A(Knm ) which equals

         
1 0 0 1 0 0 0 1 1 1 1 1
1 0 0 1 0 0 0 1 1 1 1 1
         
1 0 0 1 0 0 0 1 1 1 1 1
         
0 1 1 0 1 1 1 0 0 0 0 0
 ⊗ + ⊗ = .
0 1 1 0 1 1 1 0 0 0 0 0
         
0 1 1 0 1 1 1 0 0 0 0 0
         
0 1 1 0 1 1 1 0 0 0 0 0
0 1 1 0 1 1 1 0 0 0 0 0

Recall we have
(a ⊗ b)v = (b · v)a.
From this, we see the column space of A = a⊗b+b⊗a is span(a, b). Thus the
rank of A is 2, and the nullspace of A consists of the orthogonal complement
span(a, b)⊥ of span(a, b). Using this, we compute the eigenvalues of A.
Since the nullspace is span(a, b)⊥ , any vector orthogonal to a and to b
is an eigenvector for λ = 0. Hence the eigenvalue λ = 0 has multiplicity
n + m − 2. Since trace(A) = 0, the sum of the eigenvalues is zero, and the
remaining two eigenvalues are ±λ ̸= 0.
Let v be an eigenvector for λ ̸= 0. Then v is orthogonal to the nullspace
of A, so v must be a linear combination of a and b, v = ra+sb. Since a·b = 0,

Aa = nb, Ab = ma.

Hence
λv = Av = A(ra + sb) = rnb + sma.
4.2. GRAPHS 211

Applying A again,

λ2 v = A2 v = A(rnb + sma) = rnma + smnb = nm(ra + sb) = nmv.

√
Hence λ = nm. We conclude the eigenvalues of Knm are
√ √
nm, 0, 0, . . . , 0, − nm, (with 0 repeated n + m − 2 times).

For
√ example, √
for the graph in Figure 4.8, the nonzero eigenvalues are λ =
± 3 × 5 = ± 15.

Let G be a graph with n nodes and m edges. The incidence matrix of

G is a matrix whose rows are indexed by the edges, and whose columns are
indexed by the nodes. Therefore, the incidence matrix has shape m × n.
By placing arrows along the edges, we can make G into a directed graph.
In a directed graph, each edge has a tail node and a head node. Then the
incidence matrix is given by

1,
 if node j is the head of edge i,
Bij = −1, if node j is the tail of edge i,

0, if node j is not on edge i.


The laplacian of a graph G is the symmetric n × n matrix

L = B t B.

Both the laplacian matrix and the adjacency matrix are n × n. What is the
connection between them?

Laplacian
The laplacian satisfies
L = D − A,
where D = diag(d1 , d2 , . . . , dn ) is the diagonal degree matrix.
212 CHAPTER 4. COUNTING

For example, for the cycle graph C6 , the degree matrix is 2I, and the
laplacian is the matrix we saw in §3.2,
 
2 −1 0 0 0 −1
−1 2 −1 0 0 0
 
 0 −1 2 −1 0 0 
L = Q(6) =  .
0
 0 −1 2 −1 0 

0 0 0 −1 2 −1
−1 0 0 0 −1 2

4.3 Binomial Theorem

Let x and a be two variables. A binomial is an expression of the form

(a + x)2 , (a + x)3 , (a + x)4 , ...

The degree of each of these binomials is 2, 3, and 4.

When binomials are expanded by multiplying out, one obtains a sum
of terms. The binomial theorem specifies the exact pattern or form of the
resulting sum.
Recall that

(a + b)(c + d) = a(c + d) + b(c + d) = ac + ad + bc + bd.

Similarly,

(a + b)(c + d + e) = a(c + d + e) + b(c + d + e) = ac + ad + ae + bc + bd + be.

Using this algebra, we can expand each binomial.

Expanding (a + x)2 yields

(a + x)2 = (a + x)(a + x) = a2 + xa + ax + x2 = a2 + 2ax + x2 . (4.3.1)

Similarly, for (a + x)3 , we have

(a + x)3 = (a + x)(a + x)2 = (a + x)(a2 + 2ax + x2 )

= a3 + 2a2 x + ax2 + xa2 + 2xax + x3 (4.3.2)
= a3 + 3a2 x + 3ax2 + x3 .
4.3. BINOMIAL THEOREM 213

For (a + x)4 , we have

(a + x)4 = (a + x)(a + x)3 = (a + x)(a3 + 3a2 x + 3ax2 + x3 )

= a4 + 3a3 x + 3a2 x2 + ax3 + a3 x + 3a2 x2 + 3ax3 + x4 (4.3.3)
= a4 + 4a3 x + 6a2 x2 + 4ax3 + x4 .

Thus
(a + x)2 = a2 + 2ax + x2
(a + x)3 = a3 + 3a2 x + 3ax2 + x3
(4.3.4)
(a + x)4 = a4 + 4a3 x + 6a2 x2 + 4ax3 + x4
(a + x)5 = ⋆a5 + ⋆a4 x + ⋆a3 x2 + ⋆a2 x3 + ⋆ax4 + ⋆x5 .

Here ⋆ means we haven’t found the coefficient yet.

There is a pattern in (4.3.4). In the first line, the powers of a are in

decreasing order, 2, 1, 0, while the powers of x are in increasing order, 0, 1,
2. In the second line, the powers of a decrease from 3 to 0, while the powers
of x increase from 0 to 3. In the third line, the powers of a decrease from 4
to 0, while the powers of x increase from 0 to 4.
This pattern of powers is simple and clear. Now we want to find the
pattern for the coefficients in front of each term. In (4.3.4), these coefficients
are (1, 2, 1), (1, 3, 3, 1), (1, 4, 6, 4, 1), and (⋆, ⋆, ⋆, ⋆, ⋆, ⋆). These coefficients
are the binomial coefficients.
Before we determine the pattern, we introduce a useful notation for these
coefficients by writing

2 2 2
= 1, = 2, =1
0 1 2

and
3 3 3 3
= 1, = 3, = 3, =1
0 1 2 3
and

4 4 4 4 4
= 1, = 4, = 6, = 4, =1
0 1 2 3 4
214 CHAPTER 4. COUNTING

and

5 5 5 5 5 5
= ⋆, = ⋆, = ⋆, = ⋆, = ⋆, = ⋆.
0 1 2 3 4 5
With this notation, the number

n
(4.3.5)
k
is the coefficient of an−k xk when you multiply out (a + x)n . This is the
binomial coefficient. Here n is the degree of the binomial, and k, which
specifies the term in the resulting sum, varies from 0 to n (not 1 to n).
It is important to remember that, in this notation, the binomial (a + x)2
expands into the sum of three terms a2 , 2ax, x2 . These are term 0, term
1, and term 2. Alternatively, one says these are the zeroth term, the first
term, and the second term. Thus the second term in theexpansion of the
binomial (a + x)4 is 6a2 x2 , and the binomial coefficient 42 = 6. In general,
the binomial (a + x)n of degree n expands into a sum of n + 1 terms.
Since the binomial coefficient nk is the coefficient of an−k xk when you

multiply out (a + x)n , we have the binomial theorem.

Binomial Theorem
The binomial (a + x)n equals

n n n n−1 n n−2 2 n n−1 n n
a + a x+ a x + ··· + ax + x .
0 1 2 n−1 n
(4.3.6)

Using summation notation, the binomial theorem states

n
n
X n n−k k
(a + x) = a x . (4.3.7)
k=0
k

The binomial coefficient nk is called “n-choose-k”, because it is the co-

efficient of the term corresponding to choosing k x’s when multiplying the n

factors in the product
(a + x)n = (a + x)(a + x)(a + x) . . . (a + x).
4.3. BINOMIAL THEOREM 215

For example, the term 42 a2 x2 corresponds to choosing two a’s, and two x’s,

when multiplying the four factors in the product

(a + x)4 = (a + x)(a + x)(a + x)(a + x).
The binomial coefficients may be arranged in a triangle, Pascal’s triangle
(Figure 4.9). Can you figure out the numbers ⋆ in this triangle before peeking
ahead?
n = 0: 1
n = 1: 1 1
n = 2: 1 2 1
n = 3: 1 3 3 1
n = 4: 1 4 6 4 1
n = 5: 1 5 10 10 5 1
n = 6: ⋆ 6 15 20 15 6 ⋆
n = 7: 1 ⋆ 21 35 35 21 ⋆ 1
n = 8: 1 8 ⋆ 56 70 56 ⋆ 8 1
n = 9: 1 9 36 ⋆ 126 126 ⋆ 36 9 1
n = 10: 1 10 45 120 ⋆ 252 ⋆ 120 45 10 1

Figure 4.9: Pascal’s triangle.

In Pascal’s triangle, the very top row has one number in it: This is the
zeroth row corresponding to n = 0 and the binomial expansion of (a+x)0 = 1.
The first row corresponds to n = 1; it contains the numbers (1, 1), which
correspond to the binomial expansion of (a + x)1 = 1a + 1x. We say the
zeroth entry (k = 0) in the first row (n = 1) is 1 and the first entry (k = 1)
in the first row is 1. Similarly, the zeroth entry (k = 0) in the second row
(n = 2) is 1, and the second entry (k = 2) in the second row (n = 2) is 1.
The second entry (k = 2) in the fourth row (n = 4) is 6. For every row, the
entries are counted starting from k = 0, and end with k = n, so there are
n + 1 entries in row n. With this understood, the k-th entry in the n-th row
is the binomial coefficient n-choose-k. So 10-choose-2 is

10
= 45.
2
216 CHAPTER 4. COUNTING

We can learn a lot about the binomial coefficients from this triangle.
First, we have 1’s all along the left edge. Next, we have 1’s all along the
right edge. Similarly, one step in from the left or right edge, we have the row
number. Thus we have

n n n n
=1= , =n= , n ≥ 1.
0 n 1 n−1

Note also Pascal’s triangle has a left-to-right symmetry: If you read off
the coefficients in a particular row, you can’t tell if you’re reading them from
left to right, or from right to left. It’s the same either way: The fifth row is
(1, 5, 10, 10, 5, 1). In terms of our notation, this is written

n n
= , 0 ≤ k ≤ n;
k n−k

the binomial coefficients remain unchanged when k is replaced by n − k.

The key step in finding a formula for n-choose-k is to notice

(a + x)n+1 = (a + x)(a + x)n .

Let’s work this out when n = 3. Then the left side is (a + x)4 . From (4.3.4),
we get

4 4 4 3 4 2 2 4 3 4 4
a + a x+ ax + ax + x
0 1 2 3 4

3 3 3 2 3 2 3 3
= (a + x) a + a x+ ax + x
0 1 2 3

3 4 3 3 3 2 2 3
= a + a x+ ax + ax3
0 1 2 3

3 3 3 2 2 3 3 3 4
+ a x+ ax + ax + x
0 1 2 3

3 4 3 3 3 3 3
= a + + a x+ + a2 x 2
0 1 0 2 1

3 3 3 3 4
+ + ax + x.
3 2 3
4.3. BINOMIAL THEOREM 217

Equating corresponding coefficients of x, we get,

4 3 3 4 3 3 4 3 3
= + , = + , = + .
1 1 0 2 2 1 3 3 2

In general, a similar calculation establishes

n+1 n n
= + , 1 ≤ k ≤ n. (4.3.8)
k k k−1

This allows us to build Pascal’s triangle (Figure 4.9), where, apart from
the ones on either end, each term (“the child”) in a given row is the sum of
the two terms (“the parents”) located directly above in the previous row.

Insert x = 1 and a = 1 in the binomial theorem to get

n n n n n n
2 = + + + ··· + + . (4.3.9)
0 1 2 n−1 n

We conclude the sum of the binomial coefficients along the n-th row of Pas-
cal’s triangle is 2n (remember n starts from 0).
Now insert x = 1 and a = −1. You get

n n n n n
0= − + − ··· ± ± .
0 1 2 n−1 n

Hence: the alternating2 sum of the binomial coefficients along the n-th row
of Pascal’s triangle is zero.

We now show

2
Alternating means the plus-minus pattern + − + − + − . . . .
218 CHAPTER 4. COUNTING

Binomial Coefficient
The binomial coefficient nk equals C(n, k),

n n · (n − 1) · · · · · (n − k + 1) n!
= = , 1 ≤ k ≤ n.
k 1 · 2 · ··· · k k!(n − k)!
(4.3.10)

To establish (4.3.10), because

0
C(0, 0) = 1 = ,
0

it is enough to show C(n, k) also satisfies (4.3.8),

C(n + 1, k) = C(n, k) + C(n, k − 1), 1 ≤ k ≤ n. (4.3.11)

To establish (4.3.11), we simplify

n! n!
C(n, k) + C(n, k − 1) = +
k!(n − k)! (k − 1)!(n − k + 1)!

n! 1 1
= +
(k − 1)!(n − k)! k n − k + 1
n!(n + 1)
=
(k − 1)!(n − k)!k(n − k + 1)
(n + 1)!
= = C(n + 1, k).
k!(n + 1 − k)!

This establishes (4.3.11), and, consequently, (4.3.10).

For example,

7 7·6·5 7 10 10 · 9 10
= = 35 = and = = 45 = .
3 1·2·3 4 2 1·2 8

The formula (4.3.10) is easy to remember: There are k terms in the numerator
as well as the denominator, the factors in the denominator increase starting
from 1, and the factors in the numerator decrease starting from n.
In Python, the code
4.4. EXPONENTIAL FUNCTION 219

from scipy.special import comb

comb(n,k)
comb(n,k,exact=True)

returns the binomial coefficient.

The binomial coefficient nk makes sense even for fractional n. This can

be seen from (4.3.10). For example, for n = 1/2 and k = 3,

1 1 1
−1 −2
1/2 2 2 2 (1/2)(−1/2)(−3/2) 3
= = = . (4.3.12)
3 1·2·3 6 48

This works also for n negative,

1 1 1
− − −1 − −2
−1/2 2 2 2 (−1/2)(−3/2)(−5/2) 15
= = = .
3 1·2·3 6 48
√ (4.3.13)
In fact, in (4.3.10), n may be any real number, for example n = 2.

4.4 Exponential Function

In this section, our first goal is to derive (4.1.2), as promised in §4.1.
To begin, use the binomial theorem (4.3.7) with a = 1 and x = 1/n,
obtaining
n n k X n
1 X n n−k 1 1 n(n − 1)(n − 2) . . . (n − k + 1)
1+ = 1 = .
n k=0
k n k=0
k! n · n · n · ··· · n

Rewriting this by pulling out the first two terms k = 0 and k = 1 leads to
n n
1 X 1 1 2 k−1
1+ =1+1+ 1− 1− ... 1 − . (4.4.1)
n k=2
k! n n n
220 CHAPTER 4. COUNTING

From (4.4.1), we can tell a lot. First, since all terms are positive, we see
n
1
1+ ≥ 2, n ≥ 1.
n

Second, each factor in (4.4.1) is of the form

j
1− , 1 ≤ j ≤ k − 1. (4.4.2)
n

Since n is in the denominator, each such factor increases with n. Moreover,

as n increases, the number of terms in (4.4.1) increases, hence so does the
sum. We conclude
n
1
1+ increases as n increases.
n

Therefore, as n increases without bound, there is a definite limit

n
1
e = lim 1 + .
n→∞ n

From above, we have e ≥ 2.

Third, when k ≥ 2, we know

k! = k(k − 1)(k − 2) . . . 3 · 2 ≥ 2k−1 .

Since each factor in (4.4.2) is no greater than 1, by (4.4.1),

n n n
1 X 1 X 1
1+ ≤1+1+ ≤2+ k−1
. (4.4.3)
n k=2
k! k=2
2

But the sum n

X 1 1 1 1 1
sn = k−1
= + + + · · · + n−1
k=2
2 2 4 8 2
may be easily estimated as follows.
Multiplying sn by 2 doubles each term, and results in almost the same
sum, so
1 1 1 1
2sn = 1 + + + · · · + n−2 = 1 + sn − n−1 .
2 4 2 2
4.4. EXPONENTIAL FUNCTION 221

Solving for sn , we conclude

n
X 1 1
= s n = 1 − ≤ 1, n ≥ 2.
k=2
2k−1 2n−1

By (4.4.3), we arrive at
n
1
2≤ 1+ ≤ 3, n ≥ 1. (4.4.4)
n
Summarizing, we established the following strengthening of (4.1.5).

Euler’s Constant
The limit n
1
e = lim 1+ (4.4.5)
n→∞ n
exists and satisfies 2 ≤ e ≤ 3.

We use (4.4.4) to establish (4.1.2). Write (4.1.2) as an ≤ bn ≤ cn . When

n = 1,
a1 = b 1 = c 1 .
Moreover, as n increases, an , bn , cn all increase. Therefore, to establish
(4.1.2), it is enough to show bn increases faster than an , and cn increases
faster than bn , both as n increases.
To measure how an , bn , cn increase with n, divide the (n + 1)-st term by
the n-th term: It is enough to show
an+1 bn+1 cn+1
≤ ≤ .
an bn cn
But we already know
bn+1
= n + 1,
bn
and, from (4.4.4),
n
3((n + 1)/3)n+1

an+1 1 1 bn+1
= n
= (n + 1) · · 1 + ≤n+1= ,
an 3(n/3) 3 n bn
222 CHAPTER 4. COUNTING

and, from (4.4.4) again,

n
2((n + 1)/2)n+1

bn+1 1 1 cn+1
= n + 1 ≤ (n + 1) · · 1 + = n
= .
bn 2 n 2(n/2) cn

Since we’ve shown bn increases faster than an , and cn increases faster than
bn , we have derived (4.1.2).

By definition, Euler’s constant e satisfies (4.4.5). To obtain a second

formula for e, insert n = ∞ in (4.4.1), which means let n grow to infinity
without bound in (4.4.1). Using 1/∞ = 0, since the k-th term approaches
1/k!, and since the number of terms increases with n, we obtain the second
formula
∞ X ∞
X 1 1 2 k−1 1
e=1+1+ 1− 1− ... 1 − = .
k=2
k! ∞ ∞ ∞ k=0
k!

To summarize,

Euler’s Constant
Euler’s constant satisfies
∞
X 1 1 1 1 1 1
e= =1+1+ + + + + + ...
k=0
k! 2 6 24 120 720

Depositing one dollar in a bank offering 100% interest returns two dollars
after one year. Depositing one dollar in a bank offering the same annual
interest compounded at mid-year returns
2
1
1+ = 2.25
2

dollars after one year.

4.4. EXPONENTIAL FUNCTION 223

Depositing one dollar in a bank offering the same annual interest com-
pounded at n intermediate time points returns (1 + 1/n)n dollars after one
year.
Passing to the limit, depositing one dollar in a bank and continuously
compounding at an annual interest rate of 100% returns e dollars after one
year. Because of this, (4.4.5) is often called the compound-interest formula.

Now we derive the result of continuously compounding at any specified

annual interest rate x. Note here x is a proportion, not a percent. An interest
rate of 30% corresponds to x = .3 in the exponential function.

Exponential Function
For any real number x, the limit
x n
exp x = lim 1+ (4.4.6)
n→∞ n
exists. In particular, exp 0 = 1 and exp 1 = e.

Note, in the compound-interest interpretation, when x > 0, the bank is

giving you interest, while, if x < 0, the bank is taking away interest, leading
to a continual loss.
To derive this, assume first x > 0 is a positive real number. Then, exactly
as before, using the binomial theorem,
x n
1+ , n ≥ 1,
n

is increasing with n, so the limit in (4.4.6) is well-defined.

To establish the existence of the limit when x < 0, we first show

(1 − x)n ≥ 1 − nx, 0 < x < 1, n ≥ 1. (4.4.7)

This follows inductively: Each of the following inequalities is implied by the

224 CHAPTER 4. COUNTING

preceding one,

(1 − x) = 1−x
(1 − x)2 = 1 − 2x + x2 ≥ 1 − 2x
(1 − x)3 = (1 − x)(1 − x)2 ≥ (1 − x)(1 − 2x) = 1 − 3x + 2x2 ≥ 1 − 3x
(1 − x)4 = (1 − x)(1 − x)3 ≥ (1 − x)(1 − 3x) = 1 − 4x + 3x3 ≥ 1 − 4x
... ...

This establishes (4.4.7) for all n ≥ 1.

Now let x be any real number. Then, for n large enough, x2 /n2 lies
between 0 and 1. Replacing x by x2 /n2 in (4.4.7), we obtain
n
x2 x2

1≥ 1− 2 ≥1− .
n n

As n → ∞, both sides of this last equation approach 1, so

n
x2

lim 1 − 2 = 1. (4.4.8)
n→∞ n

Now let n grow without bound in

n
x2

x n x n
1+ 1− = 1− 2 .
n n n

Since the limit exp x is well-defined when x > 0, by (4.4.8), we obtain

x n
exp x · lim 1− = 1, x > 0.
n→∞ n

This shows the limit exp x in (4.4.6) is well-defined when x < 0, and

1
exp(−x) = , for all x.
exp x
4.4. EXPONENTIAL FUNCTION 225

Figure 4.10: The exponential function exp x.

Repeating the logic yielding (4.4.1), we have

n
xk

x n X 1 2 k−1
1+ =1+x+ 1− 1− ... 1 − . (4.4.9)
n k=2
k! n n n

Letting n → ∞ in (4.4.9) as before, results in the following.

Exponential Series
The exponential function is always positive and satisfies, for every real
number x,
∞
X xk x2 x3 x 4 x5 x6
exp x = =1+x+ + + + + + . . . (4.4.10)
k=0
k! 2 6 24 120 720

The graph of exp x is in Figure 4.10.

We use the binomial theorem one more time to show

226 CHAPTER 4. COUNTING

Law of Exponents
For real numbers x and y,

exp(x + y) = exp x · exp y.

To see this, multiply out the sums

(a0 + a1 + a2 + a3 + . . . )(b0 + b1 + b2 + b3 + . . . )

in a “symmetric” manner, obtaining

a0 b0 + (a0 b1 + a1 b0 ) + (a0 b2 + a1 b1 + a2 b0 ) + (a0 b3 + a1 b2 + a2 b1 + a3 b0 ) + . . .

Using summation notation, the n-th term in this last sum is

n
X
ak bn−k = a0 bn + a1 bn−1 + · · · + an−1 b1 + an b0 .
k=0

Thus
∞
! ∞
! ∞ n
!
X X X X
ak bm = ak bn−k .
k=0 m=0 n=0 k=0

Now insert
xk y n−k
ak = , bn−k = .
k! (n − k)!
Then the n-th term in the resulting sum equals, by the binomial theorem,
n n n
X X xk y n−k 1 X n k n−k 1
ak bn−k = = x y = (x + y)n .
k=0 k=0
k! (n − k)! n! k=0 k n!

Thus
∞
! ∞
! ∞
X xk X ym X (x + y)n
exp x · exp y = = = exp(x + y).
k=0
k! m=0
m! n=0
n!

This derives the law of exponents.

Repeating the law of exponents n times implies

exp(nx) = exp(x + x + · · · + x) = exp x · exp x · · · · · exp x = (exp x)n .

4.4. EXPONENTIAL FUNCTION 227
√
If we write n
x = x1/n , replacing x by x/n yields
exp(x/n) = (exp x)1/n .
Combining the last two equations yields
exp(nx/m) = ((exp x)n )1/m = (exp x)n/m .
Inserting x = 1 in this last equation, it follows, for any rational number
x = n/m,
exp x = exp(1 · x) = (exp 1)x = ex .
Because of this, as a matter of convenience, we write the exponential function
either way, exp x or ex , even when x is not rational.

Exponential Notation
For any real number x,
ex = exp x.

Suppose 0 < r < 1. Then r2 < r, r3 < r, and so on. Replacing x by rx

in the exponential series (4.4.10),
1 2 2 1 3 3
erx = 1 + rx + r x + r x + ...
2! 3!
1 2 1 3 (4.4.11)
< 1 + rx + rx + rx + . . .
2! 3!
= 1 − r + rex .
From this we can show

Convexity of the Exponential Function

For 0 < r < 1,

exp((1 − r)x + ry) < (1 − r) exp x + r exp y. (4.4.12)

To derive (4.4.12), replace x by y − x in (4.4.11), obtaining

er(y−x) < 1 − r + rey−x .
228 CHAPTER 4. COUNTING

Now multiply both sides by ex , obtaining (4.4.12).

Graphically, the convexity of the exponential functions is the fact that
the line segment joining two points on the graph lies above the graph (Figure
4.11).

Figure 4.11: Convexity of the exponential function.

Convexity is discussed further in §7.5.

Chapter 5

Probability

5.1 Binomial Probability

Suppose a coin is tossed repeatedly, landing heads or tails each time. After
tossing the coin 100 times, we obtain 53 heads. What can we say about
this coin? Can we claim the coin is fair? Can we claim the probability of
obtaining heads is .53?
Whatever claims we make about the coin, they should be reliable, in that
they should more or less hold up to repeated verification.
To obtain reliable claims, we therefore repeat the above experiment 20
times, obtaining for example the following count of heads

[57, 49, 55, 44, 55, 50, 49, 50, 53, 49, 53, 50, 51, 53, 53, 54, 48, 51, 50, 53].

On the other hand, suppose someone else repeats the same experiment 20
times with a different coin, and obtains

[69, 70, 79, 74, 63, 70, 68, 71, 71, 73, 65, 63, 68, 71, 71, 64, 73, 70, 78, 67].

In this case, one suspects the two coins are statistically distinct, and have
different probabilities of obtaining heads.
In this section, we study how the probabilities of coin-tossing behave,
with the goal of answering the question: Is a given coin fair?

229
230 CHAPTER 5. PROBABILITY

Assume we are tossing a coin. If we let p = P rob(H) and q = P rob(T )

be the probabilities of obtaining heads and tails after a single toss, then
p + q = 1.
In particular, we see q = 1 − p, and p may be any real number between 0
and 1, depending on the particular coin being tossed.
If we toss the coin twice, we obtain one of four possibilities, HH, HT ,
T H, or T T . If we make the natural assumption that the coin has no memory,
that the result of the first toss has no bearing on the result of the second
toss, then the probabilities are
P rob(HH) = p2 , P rob(HT ) = pq, P rob(T H) = qp, P rob(T T ) = q 2 . (5.1.1)
These are valid probabilities since their sum equals 1,
p2 + pq + qp + q 2 = (p + q)2 = 12 = 1.
To see why these are the correct probabilities, we use the conditional
probability definition,
P rob(A and B)
P rob(A | B) = . (5.1.2)
P rob(B)
We use this formula to compute the probability that we obtain heads on
the second toss given that we obtain tails on the first toss. The conditional
probability definition (5.1.2) is equivalent to the chain rule
P rob(A and B) = P rob(A | B) P rob(B).
To compute this, we introduce the convenient notation
(
1, if the n-th toss is heads,
Xn =
0, if the n-th toss is tails.
Then Xn is a random variable (§5.3) and represents a numerical reward
function of the outcome (heads or tails) at the n-th toss.
With this notation, (5.1.1) may be rewritten
P rob(X1 =1 and X2 = 1) = p2 ,
P rob(X1 =1 and X2 = 0) = pq,
P rob(X1 =0 and X2 = 1) = qp,
P rob(X1 =0 and X2 = 0) = q 2 .
5.1. BINOMIAL PROBABILITY 231

In particular, this implies

P rob(X1 = 1) = P rob(X1 = 1 and X2 = 0) + P rob(X1 = 1 and X2 = 1)

= pq + p2 = P rob(p + q) = p.

Similarly, P rob(X2 = 1) = p. Computing,

P rob(X1 = 0 and X2 = 1) qp
P rob(X2 = 1 | X1 = 0) = = = p = P rob(X2 = 1),
P rob(X1 = 0) q
so
P rob(X2 = 1 | X1 = 0) = P rob(X2 = 1).
Thus X1 = 0 has no effect on the probability that X2 = 1, and similarly for
the other possibilities. This is often referred to as the independence of the
coin tosses. We conclude

Independent Coin-Tossing

With the conditional probability definition (5.1.2), a coin has no mem-

ory between successive tosses if and only if the probabilities at distinct
tosses multiply,

P rob(X1 = a1 , X2 = a2 , . . . ) = P rob(X1 = a1 ) P rob(X2 = a2 ) . . .

(5.1.3)
Here a1 , a2 , . . . are 0 or 1.

Since we are tossing the same coin, we can set

P rob(Xn = 1) = p, P rob(Xn = 0) = q = 1 − p, n ≥ 1.

Thus all probabilities in (5.1.3) are determined by the parameter p, which

may be any number between 0 and 1.

Suppose X is a random variable taking on three values a, b, c with prob-

abilities p, q, r,

P (X = a) = p, P (X = b) = q, P (X = c) = r.
232 CHAPTER 5. PROBABILITY

Then the mean or average or expected value of X is

E(X) = ap + bq + cr.

Since p + q + r = 1, the expected value of X lies between the greatest of a,

b, c, and the least,

min(a, b, c) ≤ E(X) ≤ max(a, b, c).

The variance of X is a measure of how far X deviates from its mean,

V ar(X) = E((X − m)2 ), m = E(X).

For example,

V ar(X) = (a − m)2 · p + (b − m)2 · q + (c − m)2 · r.

By expanding the squares, one has the identity

V ar(X) = E(X 2 ) − m2 , m = E(X).

A random variable Z is standard if its mean is zero and its variance is

one. If X is any random variable with mean m and variance σ 2 , the random
variable
X −m
Z=
σ
is standard.
For X = Xn , the mean is

E(Xn ) = 1 · p + 0 · (1 − p) = p,

and the variance is

V ar(Xn ) = E(Xn2 ) − m2 = 12 · p + 02 · (1 − p) − p2 = p − p2 = p(1 − p).

Let
Sn = X1 + X2 + · · · + Xn .
5.1. BINOMIAL PROBABILITY 233

Since Xk = 1 when the k-th toss is heads, and Xk = 0 when the k-th toss is
tails, Sn is the number of heads in n tosses.
The mean of Sn is

E(Sn ) = E(X1 ) + E(X2 ) + · · · + E(Xn ) = p + p + · · · + p = np.

The second moment of Sn , or the mean of Sn2 , is

 !2 
X n Xn X
2
E(Sn ) = E  Xk  = E(Xn2 ) + E(Xk Xj ).
k=1 k=1 k̸=j

By independence and Xk2 = Xk ,

n
X X
E(Sn2 ) = E(Xn )+ E(Xk )E(Xj ) = np+n(n − 1)·p2 = np(1−p)+n2 p2 .
k=1 k̸=j

Hence the variance of Sn is

V ar(Sn ) = E(Sn2 ) − (np)2 = np(1 − p) + n2 p2 − n2 p2 = np(1 − p).

It is natural to ask for the probability of obtaining k heads in n tosses,

P rob(Sn = k). Here k varies between 0 and n, corresponding to all tails or
all heads respectively.
There are n + 1 possibilities Sn = 0, Sn = 1, Sn = 2, . . . , Sn = n for the
number of heads in n tosses. If we have no idea what the parameter p is,
then all possibilities are equally likely, so one expects
1
P rob(Sn = k) = , 0 ≤ k ≤ n. (5.1.4)
n+1
Notice n n
X X 1
P rob(Sn = k) = = 1,
k=0 k=0
n+1
as it should be.
Now suppose we are given p, so we know p = P rob(X n = 1). Since the
number of ways of choosing k heads from n tosses is nk , and the probabilities

of distinct tosses multiply, the probability of k heads in n tosses is as follows.

234 CHAPTER 5. PROBABILITY

Binomial Distribution With Parameters n, p

If a coin has heads-probability p, the probability of obtaining k heads
in n tosses is
n k
P rob(Sn = k) = p (1 − p)n−k . (5.1.5)
k
Moreover the mean and variance of the binomial distribution is np and
np(1 − p).

By the binomial theorem,

n n
X X n
P rob(Sn = k) = pk (1 − p)n−k = (p + 1 − p)n = 1,
k=0 k=0
k
again as it should be.
The binomial distribution with n = 1 corresponds to a single coin toss,
and is called the bernoulli distribution. The corresponding random variable
X,
P rob(X = 1) = p, P rob(X = 0) = 1 − p,
is a bernoulli or bernoulli random variable. The values of a bernoulli random
variable need not be 0, 1, they may be any two values a and b.

Now we assume the coin parameter p is unknown, and we interpret (5.1.5)

as the conditional probability that Sn = k given knowledge of p, which we
rewrite as

n k
P rob(Sn = k | p = r) = r (1 − r)n−k , 0 ≤ k ≤ n. (5.1.6)
k
Now P rob(Sn = k) is the sum of the probabilities P rob(Sn = k and p = r)
over 0 ≤ r ≤ 1. By the definition of conditional probability (5.1.2),
P rob(Sn = k and p = r) = P rob(Sn = k | p = r) P rob(p = r).
Thus P rob(Sn = k) is the sum of P rob(Sn = k | p = r) P rob(p = r) over
0 ≤ r ≤ 1. Since p varies continuously over 0 ≤ r ≤ 1, the sum is replaced
by the integral, so
Z 1
P rob(Sn = k) = P rob(Sn = k | p = r) P rob(r < p < r + dr).
0
5.1. BINOMIAL PROBABILITY 235

Since we don’t know anything about p, it’s simplest to assume a uniform

a priori probability

P rob(a < p < b) = b − a, 0 ≤ a < b ≤ 1,

which is the same as saying P rob(r < p < r +dr) = dr. By (5.1.6), we obtain
Z 1
n k
P rob(Sn = k) = r (1 − r)n−k dr.
0 k

But by integration by parts,

1
k!(n − k)!
Z
rk (1 − r)n−k dr = .
0 (n + 1)!

From this, we conclude

1
n k!(n − k)!
Z
n k n−k 1
P rob(Sn = k) = r (1−r) dr = = , (5.1.7)
0 k k (n + 1)! n+1

agreeing with with our intuitive result earlier.

Notice the difference: In (5.1.5), we know the coin’s heads probability p,
and obtain the binomial distribution, while in (5.1.7), since we don’t know p,
and there are n+1 possibilities 0 ≤ k ≤ n, we obtain the uniform distribution
1/(n + 1).

We now turn things around: Suppose we toss the coin n times, and obtain
k heads. How can we use this data to estimate the coin’s probability of heads
p?
To this end, we introduce the fundamental

Bayes Theorem

P rob(B | A) · P rob(A)
P rob(A | B) = . (5.1.8)
P rob(B)
236 CHAPTER 5. PROBABILITY

The proof of Bayes Theorem is straightforward:

P rob(A and B)
P rob(A | B) =
P rob(B)
P rob(A and B) P rob(A)
= ·
P rob(A) P rob(B)
P rob(A)
= P rob(B | A) · .
P rob(B)

The depth of the result lies in its widespread usefulness.

We now write Bayes Theorem to compute

P rob(p = r)
P rob(p = r | Sn = k) = P rob(Sn = k | p = r) · . (5.1.9)
P rob(Sn = k)

But P rob(Sn = k | p = r) is as in (5.1.6), P rob(Sn = k) is as in (5.1.7),

and P rob(p = r) = 1. Inserting these quantities into (5.1.9) leads to

A Posteriori probability Given k heads in n tosses

Assume the unknown heads probability p of a coin is uniformly dis-
tributed on 0 ≤ r ≤ 1. Then the probability that p = r given k heads
in n tosses equals

n k
P rob(p = r | Sn = k) = (n + 1) · r (1 − r)n−k . (5.1.10)
k

Notice because of the extra factor (n + 1), this is not equal to (5.1.6).
In (5.1.6), p is fixed, and k is the variable. In (5.1.10), k is fixed, and r is
the variable. This a posteriori distribution for (n, k) = (10, 7) is plotted in
Figure 5.1. Notice this distribution is concentrated about k/n = 7/10 = .7.
5.1. BINOMIAL PROBABILITY 237

Figure 5.1: The distribution of p given 7 heads in 10 tosses.

The code generating this figure is

from matplotlib.pyplot import *

from numpy import arange

def f(x): return 1320 * x^7*(1-x)^3

grid()
X = arange(0,1,.01)
plot(X,f(X),color="blue",linewidth=.5)
show()

Because Bayes Theorem is so useful, here are two alternate forms. First,
since

P rob(B) = P rob(B and A) + P rob(B and Ac )

= P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac ),
238 CHAPTER 5. PROBABILITY

Bayes rule also states

P rob(B | A) P rob(A)
P rob(A | B) = . (5.1.11)
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )

Figure 5.2: The logistic function.

Let
1
p = σ(z) = . (5.1.12)
1 + e−z
This is the logistic function or sigmoid function (Figure 5.2). The logistic
function takes as inputs real numbers y, and returns as outputs probabilities
p (Figure 5.3). Think of the input z as an activation energy, and the output
p as the probability of activation. In Python, σ is the expit function.

from scipy.special import expit

p = expit(z)
5.1. BINOMIAL PROBABILITY 239

−∞ < z < ∞ σ 0<p<1

Figure 5.3: The logistic function takes real numbers to probabilities.

Dividing the numerator and denominator of (5.1.11) by the numerator,

we also obtain in terms of log-probabilities,

P rob(B | A) P rob(A)
P rob(A | B) = σ log . (5.1.13)
P rob(B | Ac ) P rob(Ac )
For example, suppose we have two groups of points in Rd , selected as
follows. A fair coin is tossed. If the result is heads, select a point x in Rd at
random with normal probability (§5.4) with mean mH , or
2 /2
P rob(x | H) ∼ e−|x−mH | .

If the result is tails, select a point x at random with normal probability with
mean mT , or
2
P rob(x | T ) ∼ e−|x−mT | /2 .
This says the the groups are centered around the points mH and mT respec-
tively.
Given a point x, what is the probability x is in the heads group? In other
words, what is
P rob(H | x)?
This question is begging for Bayes theorem.
Let
1 1
w = mH − mT , w0 = − |mH |2 + |mT |2 .
2 2
Since P rob(H) = P rob(T ), here we have P rob(A) = P rob(Ac ). Inserting the
probabilities and simplifying leads to

P rob(x | H) P rob(H)
log = w · x + w0 . (5.1.14)
P rob(x | T ) P rob(T )
By (5.1.13), this leads to

P rob(H | x) = σ(w · x + w0 ).
240 CHAPTER 5. PROBABILITY

Thus the hyperplane

z = w · x + w0
is the cut-off between the two groups. Written this way, the probability is a
single-layer perceptron (§8.2).

5.2 Probability
A probability is often described as
the extent to which an event is likely to occur, measured by the
ratio of the favorable outcomes to the whole number of outcomes
possible.
We explain what this means by describing the basic terminology:
• An experiment is a procedure that yields an outcome, out of a set of
possible outcomes. For example, tossing a coin is an experiment that
yields one of two outcomes, heads or tails, which we also write as 1 or
0. Rolling a six-sided die yields outcomes 1, 2, 3, 4, 5, 6. Rolling two
six-sided dice yields 36 outcomes (1, 1), (1, 2),. . . . Flipping a coin three
times yields 23 = 8 outcomes

T T T, T T H, T HT, T HH, HT T, HT H, HHT, HHH,

or
000, 001, 010, 011, 100, 101, 110, 111.

• The sample space is the set S of all possible outcomes. If #(S) is

the number of outcomes in S, then for the four experiments above, we
have #(S) equals 2, 6, 36, and 8. The sample space S is also called the
population.

• An event is a specific subset E of S. For example, when rolling two

dice, E can be the outcomes where the sum of the dice equals 7. In
this case, the outcomes in E are

(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1),

so here #(S) = 36 and #(E) = 6. Another example is obtaining three

heads when tossing a coin seven times. Here #(S) = 27 = 128 and
5.2. PROBABILITY 241

#(E) = 35, which is the number of ways you can choose three things
out of seven things:

7 7·6·5
#(E) = 7-choose-3 = = = 35.
3 1·2·3

• The probability of an outcome s is a number P rob(s) with the properties

1. 0 ≤ P rob(s) ≤ 1,
2. The sum of the probabilities of all outcomes equals one.

• The probability P rob(E) of an event E is the sum of the probabilities

of the outcomes in E.

• Outcomes are equally likely when they have the same probability. When
this is so, we must have

#(E)
P rob(E) = .
#(S)

For example,

1. A coin is fair if the outcomes are equally likely. For one toss of a
fair coin, P rob(heads) = 1/2.
2. More generally, tossing a coin results in outcomes

P rob(head) = p, P rob(tail) = 1 − p,

with 0 < p < 1.

3. A die is fair if the outcomes are equally likely. Roll a fair die
and let E be the event that the outcome is less than 3. Then
P rob(E) = 2/6 = 1/3.
4. The probability of obtaining a sum of 7 when rolling two fair dice
is P rob(E) = 6/36 = 1/6.
5. The probability of obtaining three heads when tossing a fair coin
seven times is P rob(E) = 35/128.
6. The probability of selecting an iris with petal length between 1
and 3 from the Iris dataset.
242 CHAPTER 5. PROBABILITY

Now suppose we conduct an experiment by tossing a coin (always assumed

fair unless otherwise mentioned) 10 times. Because the coin is fair, we expect
to obtain heads around 5 times. Will we obtain heads exactly 5 times? Let’s
run the experiment with Python. In fact, we will run the experiment 20
times. If we count the number of heads after each run of the experiment, we
obtain a digit between 0 and 10 inclusive.
To simulate this, we use binomial(n,p,N). When N = 1, this returns
the number of heads obtained after a single experiment, consisting of tossing
a coin n times, where the probability of obtaining heads in each toss is p.
More generally, binomial(n,p,N) runs this experiment N times, return-
ing a vector v with N components. For example, the code

from numpy.random import *

p = .5
n = 10
N = 20

v = binomial(n,p,N)
print(v)

returns

[9 6 7 4 4 4 3 3 7 5 6 4 6 9 4 5 4 7 6 7]

The sample space S corresponding to (p, n, N ) consists of all vectors v =

(v1 , v2 , . . . , vN ) with N components, with each component equal to to 0, 1,
. . . , n. So here #(S) = (n + 1)N .
Now we conduct three experiments: tossing a coin 5 times, then 50 times,
then 500 times. The code

p = .5
for n in [5,50,500]: print(binomial(n,p,1))

This returns the count of heads after 5 tosses, 50 tosses, and 500 tosses,
5.2. PROBABILITY 243

Figure 5.4: 100,000 sessions, with 5, 15, 50, and 500 tosses per session.

3, 28, 266

The proportions are the count divided by the total number of tosses in
the experiment. For the above three experiments, the proportions after 5
tosses, 50 tosses, and 500 tosses, are

3/5=.600, 28/50=.560, 266/500=.532

Now we repeat each experiment 100,000 times and we plot the results in
a histogram.

from matplotlib.pyplot import *

from numpy.random import *

N = 100000
p = .5

for n in [5,50,500]:
data = binomial(n,p,N)
hist(data,bins=n,edgecolor ='Black')
grid()
244 CHAPTER 5. PROBABILITY

show()

This results in Figure 5.4.

The takeaway from these graphs are the two fundamental results of prob-
ability:

1. Law of Large Numbers. The proportion in a repeated experiment

is the sample proportion. The sample proportion tends to be near the
underlying probability p. The underlying probability is the population
proportion. The larger the sample size in the experiment, the closer
the proportion is to p. Another way of saying this is: For large sample
size, the sample mean is approximately equal to the population mean.

2. Central Limit Theorem. For large sample size, the shape of the
graph of the proportions or counts is approximately normal. The nor-
mal distribution is studied in §5.4. Another way of saying this is: For
large sample size, the shape of the sample mean histogram is approxi-
mately normal.

The law of large numbers is qualitative and the central limit theorem
is quantitative. While the law of large numbers says one thing is close to
another, it does not say how close. The central limit theorem provides a
numerical measure of closeness, using the normal distribution.

Roll two six-sided dice. Let A be the event that at least one dice is an
even number, and let B be the event that the sum is 6. Then

A = {(2, ∗), (4, ∗), (6, ∗), (∗, 2), (∗, 4), (∗, 6)} .

B = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} .
The intersection of A and B is the event of outcomes in both events:

A and B = {(2, 4), (4, 2)} .

5.2. PROBABILITY 245

The union of A and B is the event of outcomes in either event:

A or B = {(2, ∗), (4, ∗), (6, ∗), (∗, 2), (∗, 4), (∗, 6), (1, 5), (3, 3), (5, 1)} .
The complement of A is the events of outcomes not in A. So
not A = {(1, 1), (1, 3), (1, 5), (3, 1), (3, 3), (3, 5), (5, 1), (5, 3), (5, 5)} .
Since #(not A) = 9, and #(S) = 36,
#(A) = #(S) − #(not A) = 36 − 9 = 27.
Clearly #(B) = 5.
The difference of A minus B is the event of outcomes in A but not in B:
A − B = A and not B
= {(2, ∗ except 4), (4, ∗ except 2), (6, ∗), (∗ except 4, 2), (∗ except 2, 4), (∗, 6)} .
Similarly,
B − A = {(1, 5), (3, 3), (5, 1)} .
Then A − B is part of A and B − A is part of B, A ∩ B is part of both, and
all are part of A ∪ B.
Hence
27 3 5
P rob(A) = = , P rob(B) = .
36 4 36
Events A and B are independent if
P rob(A and B) = P rob(A) × P rob(B).
The conditional probability of A given B is
P rob(A and B)
P rob(A | B) = .
P rob(B)
When A and B are independent,
P rob(A and B) P rob(A) × P rob(B)
P rob(A | B) = = = P rob(A),
P rob(B) P rob(B)
so the conditional probability equals the unconditional probability.
Are A and B above independent? Since
P rob(A ∩ B) 2/36 2
P rob(A | B) = = =
P rob(B) 5/36 5
which is not equal to P rob(A), they are not independent.
246 CHAPTER 5. PROBABILITY

5.3 Random Variables

Suppose a real number x is selected at random. Even if we don’t know
anything about x, we know x is a number, so our confidence that −∞ <
x < ∞ equals 100%, the chance that x satisfies −∞ < x < ∞ equals 1, the
probability that x satisfies −∞ < x < ∞ equals 1.
When we say x is “selected at random”, we think of a machine X that
is the source of the numbers x (Figure 5.5). Such a source of numbers is
best called a random number. Unfortunately,1 the standard 100-year-old
terminology for such an X is random variable, and this is what we’ll use.

X x

Figure 5.5: When we sample X, we get x.

For example, suppose we want to estimate the proportion of American

college students who have a smart phone. Instead of asking every student,
we take a sample and make an estimate based on the sample.
Let p be the actual proportion of students that in fact have a smartphone.
If there are N students in total, and m of them have a smartphone, then
p = m/N . For each student, let
(
1, if the student has a smartphone,
X=
0, if not.
Then X is a random variable: X is a machine that returns 0 or 1 depending
on the chosen student.
A random variable taking on only two values is a bernoulli random vari-
able. Since X takes on the two values 0 and 1, X is a bernoulli random
variable.
Throughout we adopt the convention that random variables are written
uppercase X, while the numbers they produce when sampled are written
lowercase x. In other words, when we sample X, we get x.
1
The standard terminology is inaccurate, because the variability of the samples x is
implied by the term random; the term variable is superfluous and may suggest something
like double-randomness (whatever that is), which is not the case.
5.3. RANDOM VARIABLES 247

We will have occasion to meet many different random variables X, Y ,

Z, . . . . The letter Z is reserved for a standard random variable, one having
mean zero and variance one. Samples from Z are written as z.

What is the chance, what is our confidence, what is the probability, of

selecting x from an interval [a, b]? If we write

P rob(a < X < b)

for the chance that X lies in the interval [a, b], we are asking for P rob(a <
X < b). If we don’t know anything about X, then we can’t figure out the
probability, and there is nothing we can say. Knowing something about X
means knowing the distribution of X: Where X is more likely to be and
where X is less likely to be. In effect, a random variable is a quantity X
whose probabilities P rob(a < X < b) can be computed.
For example, take the Iris dataset and let X be the petal length of an iris
(Figure 5.6) selected at random. Here the number of samples is N = 150.

from pandas import *

df = read_csv("iris.csv")
petal_length = df["Petal_length"].to_numpy()

Figure 5.6: N = 150 petal lengths and their mean.

248 CHAPTER 5. PROBABILITY

The mean is the usual formula

N
1 X
m = E(X) = xk .
N k=1

Similarly, the second moment is

N
2 1 X 2
E(X ) = x .
N k=1 k

In general, given any function f (x), we have the mean of f (x1 ), f (x2 ), . . . ,
f (xN ),
N
1 X
E(f (X)) = f (xk ). (5.3.1)
N k=1

If we let
(
1, 1 < x < 3,
f (x) =
0, otherwise,

then f (xk ) only counts when 1 < xk < 3, so

N
1 X #{samples satisfying 1 < xk < 3}
E(f (X)) = f (xk ) = .
N k=1 N

But this is the probability that a randomly selected iris has petal length X
between 1 and 3,
P rob(1 < X < 3) = E(f (X)),

when f (x) is as above.

This shows probabilities are special cases of means. Since we can compute
means by (5.3.1), we can compute probabilities for X. By definition, this X
is a random variable.
5.3. RANDOM VARIABLES 249

Figure 5.7: Histogram of N = 150 petal lengths.

To see how the iris petal lengths are distributed, we plot a histogram,

from matplotlib.pyplot import *

grid()
hist(petal_length,bins=20)
show()

This results in Figure 5.7.

More generally, we take a random batch of samples of size n and take the
mean of the samples in the batch. For example, the following code grabs a
batch of n = 5 petals lengths x1 , x2 , x3 , x4 , x5 at random and takes their
mean,
x1 + x 2 + x3 + x4 + x5
.
5

from numpy.random import *

rng = default_rng()

# n = batch_size
250 CHAPTER 5. PROBABILITY

def random_batch_mean(n):
rng.shuffle(petal_length)
return mean(petal_length[:n])

random_batch_mean(5)

Figure 5.8: Means of 100,000 batches, of size n = 1, 5, 15, 50.

The five petal lengths are selected by first shuffling the petal lengths,
then selecting the first five petal_length[:5]. Now repeat this computation
100,000 times, for batch sizes 1, 5, 15, 50. The resulting histograms are in
Figure 5.8. Notice in the first subplot, the batch is size n = 1, so we recover
the base histogram Figure 5.7. Figure 5.8 is of course another illustration of
the central limit theorem.

N = 100000
5.3. RANDOM VARIABLES 251

for n in [1,5,15,50]:
Xbar = [ random_batch_mean(n) for _ in range(N)]
hist(Xbar,bins=50)
grid()
show()

The simplest random variable is the bernoulli random variable X re-

sulting from a coin toss, with X = 1 corresponding to heads, and X = 0
corresponding to tails. In this case,

P rob(X = 1) = p, P rob(X = 0) = 1 − p,

which can be written

P rob(X = x) = px (1 − p)1−x , x = 0, 1.

This distribution is presented graphically in Figure 5.9.

p
1−p

0 1

Figure 5.9: Distribution of a bernoulli random variable.

For example, if we let X be the number of heads obtained in n tosses of

a coin with heads probability p, and sample X 100,000 times, the resulting
frequency histogram is in 5.4. Knowing something about X means knowing
where X is more likely to be and where X is less likely to be. But this is
exactly the information provided by the histogram. Therefore the histogram
is the approximate distribution of X. There is one caveat: Because we prefer
our probabilities to sum to 1, with 1 corresponding to certainty, this last
conclusion is correct only after rescaling the histogram to have total area 1.
Because of the central limit theorem, the normal distribution given on
the left in Figure 5.10 plays a central role. This distribution is studied in
§5.4. We say X follows a specific distribution, such as the curves in Figure
252 CHAPTER 5. PROBABILITY

5.10, when the probability P rob(a < X < b) is given by the green area in
Figure 5.10. Thus

chance = confidence = probability = area.

0 a b 0 a b

Figure 5.10: Confidence that X lies in interval [a, b].

Given any distribution as in Figure 5.10, the cumulative distribution func-

tion at a point x is the area under the graph to the left of x,

cdfX (x) = P rob(X ≤ x).

Then the green areas in Figure 5.10 is the difference between two areas, hence
equal
cdfX (b) − cdfX (a).

Figure 5.11: Cumulative distribution functions.

For the distributions in Figure 5.10, the cumulative distribution functions

are in Figure 5.11.
5.3. RANDOM VARIABLES 253

For the bernoulli distribution in Figure 5.9, the cdf is in Figure 5.12.
Because the bernoulli random variable takes on only the values x = 0, 1,
these are the values where the cdf P rob(X ≤ x) jumps.

1
1−p

0 1

Figure 5.12: Cdf of a bernoulli distribution.

Let X be a random variable taking on values x1 , x2 , x3 , . . . , with proba-

bilities p1 , p2 , p3 ,. . . . then the mean is
N
X
µ = E(X) = p1 x1 + p2 x2 + p3 x3 + · · · = p k xk .
k=1

This is the population mean. It does not depend on a sampling of the popu-
lation.
For example, suppose the population consists of 100 balls, of which 30
are red, 20 are green, and 50 are blue. The cost of each ball is

$1, red,

X(ball) = $2, green,

$3, blue.


Then
#(red) 30
pred = P rob(red) = = = .3,
#(balls) 100
#(green) 20
pgreen = P rob(green) = = = .2,
#(balls) 100
#(blue) 50
pblue = P rob(blue) = = = .5.
#(balls) 100
Then the average cost of a ball equals
E(X) = pred · 1 + pgreen · 2 + pblue · 3
30 · 1 + 20 · 2 + 50 · 3 x1 + x2 + · · · + x100
= = .
100 100
254 CHAPTER 5. PROBABILITY

The variance is
V ar(X) = E (X − µ)2 = p1 (x1 − µ)2 + p2 (x2 − µ)2 + p3 (x3 − µ)2 + . . .

N
X
= pk (xk − µ)2 .
k=1

This is the variance or population variance. A direct consequence of the

formula is V ar(X) = 0 exactly when X is a non-random constant. The
square root of the variance is the standard deviation. We also write V ar(X) =
σ 2 , so σ is the standard deviation.
Going back to the smartphone random variable X, because p is the pro-
portion of students having smartphones, P rob(X = 1) = p. This implies
P rob(X = 0) = 1 − p. Since X takes on two values x1 = 1 and x2 = 0 with
probabilities p1 = p and p2 = 1 − p, the mean is

µ = E(X) = x1 p1 + x2 p2 = 1 · p + 0 · (1 − p) = p,

and the variance is

σ 2 = (x1 − µ)2 p1 + (x2 − µ)2 p2 = (1 − p)2 · p + (0 − p)2 · (1 − p) = p(1 − p).

p
Thus the standard deviation is p(1 − p).
More generally, if a random variable X takes on two values a and b with
P rob(X = a) = p, then

E(X) = ap + b(1 − p).

In particular, if Y = ±1 with probability 1/2, then E(Y ) = 0.

If P rob(X = a) = 1, then we say X is the constant a. In this case, the
mean is
E(X) = a · P rob(X = 1) = a,
and the variance is

E((X − a)2 ) = (a − a)2 P rob(X = 1) = 0.

Conversely, a random variable with variance zero is a constant.

Let X and Y be random variables and let a, b be constants. A basic
property of the mean is linearity

E(aX + bY ) = aE(X) + bE(Y ).

5.3. RANDOM VARIABLES 255

Using this, we have

V ar(X) = σ 2 = E (X − µ)2

= E(X 2 − 2µX + µ2 ) = E(X 2 ) − 2µE(X) + µ2 = E(X 2 ) − E(X)2 .

We conclude
E(X 2 ) = µ2 + σ 2 = (E(X))2 + V ar(X). (5.3.2)
Let X have mean µ and variance σ 2 , and write
X −µ
Z= .
σ
Then
1 E(X) − µ µ−µ
E(Z) = E(X − µ) = = = 0,
σ σ σ
and
1 σ2
E(Z 2 ) = E((X − µ) 2
) = = 1.
σ2 σ2
We conclude Z has mean zero and variance one.
A random variable is standard if its mean is zero and its variance is one.
The variable Z is the standardization of X. For example, the standardization
of the bernoulli random variable is
X −p
p .
p(1 − p)

Since p(1 − p) is unchanged when p is replaced by 1 − p, the graph of the

bernoulli variance is a symmetric hump vanishing at p = 0 and p = 1 (Figure
5.13). Therefore the bernoulli variance is maximized at p = 1/2, where it
equals p(1 − p) = 1/4.

p(1 − p)

0 1

Figure 5.13: Binary variance.

256 CHAPTER 5. PROBABILITY

If X is a random variable, so are X, X 2 , X 3 , . . . . These are powers of

X. The moments of X are the power means

E(X), E(X 2 ), E(X 3 ), . . .

The moments of a random variable may be combined into a sum. To explain

this, we bring in the exponential series

t2 t3
et = 1 + t + + + ...
2! 3!
where t is any real number. The number e, Euler’s constant (§4.4), is ap-
proximately 2.7, as can be seen from
1 1 1 1
e = e1 = 1 + 1 + + + ··· = 1 + 1 + + + ...
2! 3! 2 6
Since X has real values, so does tX, so etX is also a random variable.
The moment generating function is the mean of etX ,

t2 t3
M (t) = MX (t) = E etX = 1 + tE(X) + E(X 2 ) + E(X 3 ) + . . .

2! 3!
For example, for the smartphone random variable X = 0, 1 with P rob(X =
1) = p, X 2 = X, X 3 = X, . . . , so

t2 t3 t2 t3
M (t) = 1 + tE(X) + E(X 2 ) + E(X 3 ) + · · · = 1 + tp + p + p + . . .
2! 3! 2! 3!
which equals
M (t) = (1 − p) + pet .
In §5.2, we discussed independence of events. Now we do the same for
random variables. Let X and Y be random variables. We say X and Y are
uncorrelated if the expectations multiply,

E(XY ) = E(X) E(Y ).

Otherwise, we say X and Y are correlated.

By (5.3.2), a random variable X is always correlated to itself, unless it is
a constant.
5.3. RANDOM VARIABLES 257

Suppose X and Y take on the values X = ±1 and Y = 0, 1 with the

probabilities


 (1, 1) with probability a,

(1, 0) with probability b,
(X, Y ) = (5.3.3)


 (−1, 1) with probability b,

(−1, 0) with probability c.
We investigate when X and Y are uncorrelated. Here a > 0, b > 0, and
c > 0.
First, because the total probability equals 1,
a + 2b + c = 1. (5.3.4)
Also we have
P rob(X = 1) = a+b = P rob(Y = 1), P rob(X = −1) = b+c = P rob(Y = 0),
and
E(X) = a − c, E(Y ) = a + b.
Now X and Y are uncorrelated if
a − b = E(XY ) = E(X)E(Y ) = (a − c)(a + b). (5.3.5)
Solving (5.3.4), (5.3.5) using Python,

from sympy import *

a,b,c = symbols('a,b,c')
eq1 = a + 2*b + c - 1
eq2 = a - b - (a-c)*(a+b)
solutions = solve([eq1,eq2],a,b)
print(solutions)

we see X and Y are uncorrelated if

√ √
b = c − c, a = c − 2 c + 1. (5.3.6)
For example, X and Y are uncorrelated when c = 1/4, which leads to a =
b = 1/4. Also, X and Y are uncorrelated if c = .01, which leads to a = .81
and b = .09.
258 CHAPTER 5. PROBABILITY

Let X and Y be random variables. We say X and Y are independent if

all powers of X are uncorrelated with all powers of Y ,

E(X n Y m ) = E(X n ) E(Y m ). (5.3.7)

Clearly, if X and Y are independent, then, by taking n = 1 and m = 1,

X and Y are uncorrelated.
Suppose X and Y satisfy (5.3.3) and (5.3.6). Since X = ±1, X n = 1 for
n even and X n = X for X odd. Since Y = 0, 1, Y n = Y for all n. This is
enough to show that, in this case, X and Y uncorrelated is equivalent to X
and Y independent. However, this is certainly not true in general.
Here is an example of uncorrelated random variables that are dependent.
Let X, Y be as above with a = b = c = 1/4. Then, as we have seen, X and
Y are uncorrelated. Let U = XY . Then
1
E(U ) = E(XY ) = E(X)E(Y ) = 0 · = 0.
2
Since U Y = XY Y = XY = U , E(U Y ) = E(U ) = 0. Hence Y and U are
uncorrelated. But, since U 2 = Y , Y and U 2 are correlated, so Y and U are
not independent.

Let X and Y be random variables. Expanding the exponentials into their

series, and using (5.3.7), one can show

Independence and Moment Generating Functions

Let X and Y be random variables. Then X and Y are independent if
their moment generating functions multiply,

MX,Y (a, b) ≡ E eaX+bY = E eaX E ebY = MX (a) MY (b).

The expectation on the left is the joint moment generating function

MX,Y (a, b) of the pair (X, Y ). One may have joint moment generating func-
tions for triples (X, Y, Z), by writing

MX,Y,Z (a, b, c) = E eaX+bY +cZ .

5.3. RANDOM VARIABLES 259

Then we say X, Y , Z are independent if

MX,Y,Z (a, b, c) = MX (a) MY (b) MZ (c).

Of course, this generalizes to any tuple of random variables.

As an illustration, consider an ordinary dice with X = 1, X = 2, . . . , X =
6 equally probable. Then P rob(X = k) = 1/6, k = 1, 2, . . . , 6. Evaluating
the geometric sum,
6
tX
X 1 1 e7t − et
e6k =

MX (t) = E e = .
k=1
6 6 et − 1

Now suppose we have a non-standard dice with unknown probabilities for

Y = 0, Y = 1, . . . ,Y = 6. if we are told the probabilities of the sum X + Y
is uniform over 1 ≤ X + Y ≤ 12, how should we choose the probabilities for
Y?
To answer this, use

MX+Y (t) = E et(X+Y ) = E etX E etY .

Since
12
1 X tk 1 e13t − et
MX+Y (t) = e = ,
12 k=1 12 et − 1
we obtain
1 e13t − et 1 e7t − et
= MY (t) · .
12 et − 1 6 et − 1
Factoring

e13t − et = et (e6t − 1)(e6t + 1), e7t − et = et (e6t − 1),

we obtain
1
MY (t) = (e6t + 1).
2
This says
1 1
P rob(Y = 0) = , P rob(Y = 6) = ,
2 2
and all other probabilities are zero.
260 CHAPTER 5. PROBABILITY

Let X and Y be random variables. We say X and Y are identically

distributed if their moments are equal,

E(X n ) = E(Y n ), n ≥ 1.

This is equivalent to X and Y having equal probabilities,

P rob(a < X < b) = P rob(a < Y < b).

For example, if X and Y satisfy (5.3.3), then X and 2Y −1 are indentically

distributed. However, X and 2Y − 1 are independent if and only if X and Y
are independent, which, as we saw above, happens only when (5.3.6) holds.
On the other hand, Let X be any random variable, and let Y = X. Then
X and Y are identically distributed, but are certainly correlated. So identical
distributions does not imply independence, nor vice-versa.
Let X be a random variable. A simple random sample of size n is a
sequence of random variables X1 , X2 , . . . , Xn that are independent and
identically distributed. We also say the sequence X1 , X2 , . . . , Xn is an i.i.d.
sequence (independent identically distributed).
For example, going back to the smartphone example, suppose we select n
students at random, where we are allowed to select the same student twice.
We obtain numbers x1 , x2 , . . . , xn . So the result of a single selection experi-
ment is a sequence of numbers x1 , x2 , . . . , xn . To make statistical statements
about the results, we repeat this experiment many times, and we obtain a
sequence of numbers x1 , x2 , . . . , xn each time.
This process can be thought of n machines producing x1 , x2 , . . . , xn each
time, or n random variables X1 , X2 , . . . , Xn (Figure 5.14). By making each
of the n selections independently, we end up with an i.i.d. sequence, or a
simple random sample.

X1 , X2 , . . . , Xn x1 , x2 , . . . , xn

Figure 5.14: When we sample X1 , X2 , . . . , Xn , we get x1 , x2 , . . . , xn .

Let X1 , X2 , . . . , Xn be a simple random sample. Then X1 , X2 , . . . , Xn

are identically distributed. Let µ be their common mean E(X). The sample
5.3. RANDOM VARIABLES 261

mean is
n
X1 + X 2 + · · · + Xn 1X
X̄ = = Xk .
n n k=1

Then
1 1 1
E(X̄) = E(X1 +X2 +· · ·+Xn ) = (E(X1 )+E(X2 )+· · ·+E(Xn )) = ·nµ = µ.
n n n
We conclude the mean of the sample mean equals the population mean.
Now let σ 2 be the common variance of X1 , X2 , . . . , Xn . Since σ 2 =
E(X 2 ) − E(X)2 , we have

E(Xk2 ) = µ2 + σ 2 .

When i ̸= j, by independence,

E(Xi Xj ) = E(Xi )E(Xj ) = µ · µ = µ2 .

Putting this all together,

 !2 
n
1  X
E(X̄ 2 ) = E Xk 
n2 k=1

1 X
= E(Xi Xj )
n2 i,j
!
1 X X
= 2 E(Xi Xj ) + E(Xk2 )
n i̸=j k
1 2 2 2 1
= µ2 + σ 2 .

= 2
n(n − 1)µ + n(µ + σ )
n n

Since the variance of X̄ equals

E(X̄ 2 ) − E(X̄)2 = E(X̄ 2 ) − µ2 ,

we conclude the variance of X̄ equals σ 2 /n. The standard deviation of X̄ is

the standard error.
262 CHAPTER 5. PROBABILITY

Independence and Variances

If X1 , X2 , . . . , Xn is a simple random sample (i.i.d. sequence) with

mean µ and variance σ 2 , then the mean and variance of X̄ are

σ2
E(X̄) = µ and V ar(X̄) = .
n

In particular, when X1 , X2 , . . . , Xn is a bernoulli simple random sample,

the mean and variance of X̄ are p and p(1 − p)/n.
More generally, by a similar calculation, we have

Independence and Variances

Let X1 , X2 , . . . , Xn be a sequence of independent random variables
with means µ1 , µ2 , . . . , µn , and variances σ12 , σ22 , . . . , σn2 , and let

Sn = X1 + X2 + · · · + Xn .

Then the mean and variance of Sn are

µ1 + µ2 + · · · + µn and σ12 + σ22 + · · · + σn2 . (5.3.8)

Now we restate the two fundamental results of probability in the lan-

guage of this section. Let X1 , X2 , . . . , be independent identically distributed
random variables, each with mean µ and variance σ 2 .
Let
X1 + X2 + · · · + Xn
X̄ = , n ≥ 1,
n
and let Z be a normal random variable with mean µ and variance σ 2 .

1. Law of Large Numbers. For every2 sample x1 , x2 , . . . ,

x1 + x2 + · · · + xn
lim = µ.
n→∞ n
2
This holds for almost every sample, in the sense that the exceptions form a negligible
set of samples.
5.4. NORMAL DISTRIBUTION 263

2. Central Limit Theorem. For every a < b,

lim P rob a < X̄ < b = P rob(a < Z < b).
n→∞

5.4 Normal Distribution

A random variable Z has a standard normal distribution or Z distribution if
its moment generating functionfunction!moment generating!normal satisfies
2
MZ (t) = E etZ = et /2 = exp(t2 /2).

In this case, we write Z ∼ N (0, 1).

This is equivalent to specifying the probability density that Z lies in a
small interval [a, b] containing z. This is specified by the famous formula

P rob(a < Z < b) 1 −z2 /2

= ·e , a < z < b. (5.4.1)
b−a N
Here N is a constant to make the total area under the graph equal to one
(Figure 5.15). In other words, (5.4.1) is the pdf of the normal distribution.
When the interval [a, b] is not small, the correct formula is obtained by
integration, which means dividing [a, b] into many small intervals and sum-
ming. We will not use this density formula directly.
Using Python, the normal probability density is plotted by

from scipy.stats import norm as Z

from numpy import *
from matplotlib.pyplot import *

mu, sdev = 0,1

grid()
z = arange(mu-3*sdev,mu+3*sdev,.01)

# if mu, sdev not specified, then Z is standard

plot(z,Z(mu,sdev).pdf(z))
show()

Here pdf stands for probability density function.

264 CHAPTER 5. PROBABILITY

0 a b

Figure 5.15: The standard normal distribution.

Expand both sides of the definition of MZ (t) in exponential series. This

results in
t2 t3 t4
1 + tE(Z) + E(Z 2 ) + E(Z 3 ) + E(Z 4 ) + . . .
2! 3! 4!
2
2 3
t 1 t2 1 t2
=1+ + + + ....
2 2! 2 3! 2
From this, the odd moments of Z are zero, and the even moments are
(2n)!
E(Z 2n ) = , n = 0, 1, 2, . . .
2n n!
By separating the even and the odd factors, this simplifies to
(1 · 3 · 5 · · · · · (2n − 1))(2 · 4 · · · · · 2n)
E(Z 2n ) =
2n n!
(1 · 3 · 5 · · · · · (2n − 1))2n n! (5.4.2)
=
2n n!
= 1 · 3 · 5 · · · · · (2n − 1), n ≥ 1.
For example,

E(Z) = 0, E(Z 2 ) = 1, E(Z 3 ) = 0, E(Z 4 ) = 3, E(Z 5 ) = 0, E(Z 6 ) = 15.

More generally, we say X has a normal distribution with mean µ and

variance σ 2 , and we write X ∼ N (µ, σ), if

MX (t) = E etX = exp(µt + σ 2 t2 /2).

(5.4.3)
5.4. NORMAL DISTRIBUTION 265

Then
X −µ
X ∼ N (µ, σ) ⇐⇒ Z= ∼ N (0, 1).
σ
A normal distribution is a standard normal distribution when µ = 0 and
σ = 1.

As mentioned in §5.3, the important result is

Central Limit Theorem

Let X1 , X2 , . . . , Xn be independent identically distributed random
variables. Then, for large n,

Sn = X1 + X2 + · · · + Xn

and X̄ = Sn /n have approximately normal distributions. More pre-

cisely, if µ and σ are the mean and standard deviation of each X, and
(a, b) is an interval,

σ σ
lim P rob µ + a · √ < X̄ < µ + b · √ = P rob(a < Z < b).
n→∞ n n

The standard normal distribution is symmetric about zero, and has a

specific width. Because of the symmetry, a random number Z following this
distribution is equally likely to satisfy Z < 0 and Z > 0, so P rob(Z < 0) =
P rob(Z > 0). Since the total area equals 1,

P rob(Z < 0) + P rob(Z > 0) = 1,

we expect the chance that Z < 0 should equal 1/2. In other words, because
of the symmetry of the curve, we expect to be 50% confident that Z < 0, or
0 is at the 50-th percentile level. So

chance = confidence = percentile = area

To summarize, we expect P rob(Z < 0) = 1/2.

266 CHAPTER 5. PROBABILITY

p
p

z z

Figure 5.16: z = norm.ppf(p) and p = norm.cdf(z).

When
P rob(Z < z) = p,
we say z is the z-score z corresponding to the p-value p. Equivalently, we say
our confidence that Z < z is p, or the percentile of z equals 100p. In Python,
the relation between z and p (Figure 5.16) is specified by

from scipy.stats import norm as Z

p = Z.cdf(z)
z = Z.ppf(p)

ppf is the percentile point function, and cdf is the cumulative distribution
function.
In Figure 5.17, the red areas are the lower tail p-value P rob(Z < z), the
two-tail p-value P rob(|Z| > z), and the upper tail p-value P rob(Z > z).

−z 0 −z 0 z

0 z

Figure 5.17: Confidence (green) or significance (red) (lower-tail, two-tail,

upper-tail).
5.4. NORMAL DISTRIBUTION 267

By symmetry of the graph, upper-tail and two-tail p-values can be com-

puted from lower tail p-values.

P rob(a < Z < b) = P rob(Z < b) − P rob(Z < a),

and

P rob(|Z| < z) = P rob(−z < Z < z) = P rob(Z < z) − P rob(Z < z),

and
P rob(Z > z) = 1 − P rob(Z < z).
To go backward, suppose we are given P rob(|Z| < z) = p and we want
to compute the cutoff z. Then P rob(|Z| > z) = 1 − p, so P rob(Z > z) =
(1 − p)/2. This implies

P rob(Z < z) = 1 − (1 − p)/2 = (1 + p)/2.

In Python,

from scipy.stats import norm as Z

# p = P(|Z| < z)

z = Z.ppf((1+p)/2)

Now let’s zoom in closer to the graph and mark off 1, 2, 3 on the hor-
izontal axis to obtain specific colored areas as in Figure 5.18. These areas
are governed by the 68-95-99 rule (Table 5.19). Our confidence that |Z| < 1
equals the blue area 0.685, our confidence that |Z| < 2 equals the sum of
the blue plus green areas 0.955, and our confidence that |Z| < 3 equals the
sum of the blue plus green plus red areas 0.997. This is summarized in Table
5.19.
The possibility |Z| > 1 is called a 1-sigma event, |Z| > 2 a 2-sigma event,
and so on. So a 2-sigma event is 95.5% unlikely, or 4.5% likely. An event
is considered statistically significant if it’s a 2-sigma event or more. In other
words, something is significant if it’s unlikely. A six-sigma event |Z| > 6 is
2 in a billion. You want a plane crash to be six-sigma.
268 CHAPTER 5. PROBABILITY

−3 −2 −1 0 1 2 3

Figure 5.18: 68%, 95%, 99% confidence cutoffs for standard normal.

Figure 5.18 is not to scale, because a 1-sigma event should be where the
curve inflects from convex to concave (in the figure this happens closer to
2.7). Moreover, according to Table 5.19, the left-over white area should be
.03% (3 parts in 10,000), which is not what the figure suggests.

cutoff abs confidence two-tail p-value

z 1−p p
1 .685 .315
2 .955 .045
3 .997 .003

Table 5.19: Cutoffs, confidence levels, p-values.

An event is statistically significant if its p-value is 5% or less (Table 5.20).

For example, Z > z is statistically significant if P rob(Z > z) is .05 or
less, which means z is greater than 1.64, Z < z is statistically significant if
P rob(Z < z) is .05 or less, which means z is less than −1.64, and |Z| > z
is statistically significant if P rob(|Z| > z) is .05 or less, which means |z| is
greater than 1.96.
An event is highly significant if its p-value is 1% or less (Table 5.20). For
example, Z > z is highly significant if P rob(Z > z) is .01 or less, which
means z is greater than 2.33, Z < z is highly significant if P rob(Z < z) is .01
or less, which means z is less than −2.33, and |Z| > z is highly significant if
P rob(|Z| > z) is .01 or less, which means |z| is greater than 2.56.
5.4. NORMAL DISTRIBUTION 269

µ − 3σ µ−σ µ µ+σ µ + 3σ

Figure 5.21: 68%, 95%, 99% cutoffs for non-standard normal.

event type p-value z-score

Z>z upper tail .05 1.64
Z<z lower tail .05 -1.64
|Z| > z two-tail .05 1.96
Z>z upper tail .01 2.33
Z<z lower tail .01 -2.33
|Z| > z two-tail .01 2.56

Table 5.20: p-values at 5% and at 1%.

In general, the normal distribution is not centered at the origin, but

elsewhere. We say X is normal with mean µ and standard deviation (SD) σ
if
X −µ
Z=
σ
is distributed according to a standard normal. We write N (µ, σ) for the
normal with mean µ and SD σ. As its name suggests, it is easily checked
that such a random variable X has mean µ and SD σ. For the normal
distribution with mean µ and standard deviation σ, the cutoffs are as in
Figure 5.21. In Python, norm(m,s) returns the normal with mean m and
standard devistion s.
Here is a sample computation. Let X be a normal random variable with
mean µ and standard deviation σ, and suppose P rob(X < 7) = .15, and
P rob(X < 19) = .9. Given this data, we find µ and σ as follows.
With Z as above, we have
P rob(Z < (7 − µ)/σ) = .15, and P rob(Z < (19 − µ)/σ) = .9.
270 CHAPTER 5. PROBABILITY

Also, since Z is standard, we compute

a = Z.ppf(.15)
b = Z.ppf(.9)

By definition of ppf (see above), we then have

7−µ 19 − µ
a= , b= .
σ σ
These are two equations in two unknowns. Multiplying both equations by σ
then subtracting, we obtain µ and σ,
19 − 7
σ= , µ = 7 − aσ.
b−a

Let X̄ be the sample mean

X1 + X2 + · · · + X n
X̄ = ,
n
drawn from a normally distributed population with mean µ and standard
deviation σ.√ As we saw in §5.3, the standard deviation of X̄, the standard
error, is σ/ n.
To compute probabilities for X̄ when X has mean µ and standard devi-
ation σ, standardize X̄ by writing
√ X̄ − µ
Z= n· ,
σ
then compute standard normal probabilities.

Here are three examples. In the first example, suppose student grades are
normally distributed with mean 80 and variance 16. This says the average
of all grades is 80, and the SD is 4. If a grade is g, the standardized grade is
g − 80
z= .
4
5.4. NORMAL DISTRIBUTION 271

A student is picked and their grade was g = 90. Is this significant? Is it

highly significant? In effect, we are asking, how unlikely is it to obtain such
a grade? Remember,
significant = unlikely
Since the standard deviation is 4, the student’s z-score is
g − 80 90 − 80
z= = = 2.5.
4 4
What’s the upper-tail p-value corresponding to this z? It’s
P rob(Z > z) = P rob(Z > 2.5) = .0062,
or .62%. Since the upper-tail p-value is less than 1%, yes, this student’s grade
is both significant and highly significant.
For the second example, suppose a sample of n = 9 students are selected
and their sample average grade is ḡ = 84. Is this significant? Is it highly
significant? This time we take
√ ḡ − 80 84 − 80
z = n· =3 = 3.
4 4
What’s the upper-tail p-value corresponding to this z? It’s
P rob(Z > z) = P rob(Z > 2.5) = 0.0013,
or .13%. Since the upper-tail p-value is less than 1%, yes, this sample average
grade is both significant and highly significant.
For the third example, suppose a single selected student has the same
grade as the sample average, g = 84. Is this significant? Here
g − 80 84 − 80
z= = = 1.
4 4
Since the upper-tail significance corresponding to z = 1
1 1 1
P rob(Z > 1) = P rob(|Z| > 1) = (1 − P rob(|Z| < 1)) = (1 − .68) = .16,
2 2 2
or 16%, this grade is not significant.

The constructor numpy.random.normal returns samples of normally dis-

tributed real numbers. For example.
272 CHAPTER 5. PROBABILITY

from numpy.random import default_rng

from numpy import sqrt

rng = default_rng()

mean, sdev, n = 80, 4, 20

rng.normal(mean,sdev,n)

returns 20 normally distributed numbers, with specified mean and variance.

Suppose student grades are normally distributed with mean 80 and vari-
ance 16. How many students should be sampled so that the chance that at
least one student’s grade lies below 70 is at least 50%?
To solve this, if p is the chance that a single student has a grade below
70, then 1 − p is the chance that the student has a grade above 70. If n is
the sample size, (1 − p)n is the chance that all sample students have grades
above 70. Thus the requested chance is 1 − (1 − p)n . The following code
shows the answer is n = 112.

from scipy.stats import norm as Z

x = 70
mean, sdev = 80, 4
p = Z(mean,sdev).cdf(x)

for n in range(2,200):
q = 1 - (1-p)**n
print(n, q)

Here is the code for computing tail probabilities for the sample mean X̄
drawn from a normally distributed population with mean µ and standard
deviation σ. When n = 1, this applies to a single normal random variable.

########################
# P-values
5.5. CHI-SQUARED DISTRIBUTION 273

########################

from numpy import *

from scipy.stats import norm as Z

def pvalue(mean,sdev,n,xbar,type):
Xbar = Z(mean,sdev/sqrt(n))
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 *(1 - Xbar.cdf(abs(xbar)))
else:
print("What's the tail type?")
return
print("type: ",type)
print("mean,sdev,n,xbar: ",mean,sdev,n,xbar)
print("p-value: ",p)
z = sqrt(n) * (xbar - mean) / sdev
print("z-score: ",z)

type = "upper-tail"
mean = 80
sdev = 4
n = 1
xbar = 90

pvalue(mean,sdev,n,xbar,type)

5.5 Chi-squared Distribution

Let X and Y be independent standard normal random variables. Then
(X, Y ) is a random point in the plane. What is the probability that the
point (X, Y ) lies inside a square (Figure 5.22)? Specifically, assume the
square is |X| ≤ 1 and |Y | ≤ 1. Since X and Y independent, the probability
(X, Y ) lies in the square is

P rob(|X| < 1 and |Y | < 1) = P rob(|X| < 1) P rob(|Y | < 1)

= P rob(|X| < 1)2 = .6852 = .469.
274 CHAPTER 5. PROBABILITY

What is the probability (X, Y ) lies inside the unit circle,

P rob(X 2 + Y 2 < 1)?

Here the answer is not as straightforward, and leads us to introduce the

chi-squared distribution.

Figure 5.22: (X, Y ) in the square and in the circle.

A random variable U has a chi-squared distribution with degree 1 if

1
MU (u) = E(euU ) = √ .
1 − 2u

To compute the moments of U , we use the binomial theorem (7.1.22)

∞
p
X p n p 2 p 3
(1 + x) = x = 1 + px + x + x + ...
n=0
n 2 3

to write out MU (u). Taking p = −1/2 and x = −2u,

∞
1 −1/2
X −1/2
√ = (1 − 2u) = (−2u)n .
1 − 2u n=0
n
5.5. CHI-SQUARED DISTRIBUTION 275

Since ∞
1 uU
X un
√ =E e = E(U n ),
1 − 2u n=0
n!
comparing coefficients of un /n! shows

n n −1/2
E(U ) = (−2) n! , n = 0, 1, 2, . . . (5.5.1)
n

Figure 5.23: Chi-squared distribution with different degrees.

Using the definition

p p · (p − 1) · · · · · (p − n + 1)
= ,
n n!
p

n
makes sense for fractional p (see (4.3.12)). With this, we have
(−1/2) · (−1/2 − 1) · · · · · (−1/2 − n + 1)
E(U n ) = (−2)n n!
n!
= 1 · 3 · 5 · 7 · · · · · (2n − 1).
But this equals the right side of (5.4.2). Thus the left sides of (5.4.2) and
(5.5.1) are equal. This shows
276 CHAPTER 5. PROBABILITY

Chi-squared is the Square of Normal

If Z is standard normal, then U = Z 2 is chi-squared with degree 1.

More generally, we say U is chi-squared with degree d if

U = U1 + U2 + · · · + Ud = Z12 + Z22 + · · · + Zd2 , (5.5.2)

with independent standard normal Z1 , Z2 , . . . , Zd .

By independence, the moment generating functions multiply (§5.3), so
the moment generating function for chi-squared with degree d is

1
MU (t) = E(etU ) = .
(1 − 2t)d/2

Going back to the question posed at the beginning of the section, we have
X and Y independent standard normal and we want

P rob(X 2 + Y 2 < 1).

If we set U = X 2 + Y 2 , we want3 P rob(U < 1). Since U is chi-squared with

degree d = 2, we use chi2.cdf(u,d). Then the code

from scipy.stats import chi2 as U

d = 2
u = 1

U(d).cdf(u)

returns 0.39.

3
Geometrically, P rob(U < 1) is the probability that a normally distributed point is
inside the unit sphere in d-dimensional space.
5.5. CHI-SQUARED DISTRIBUTION 277

Let us compute the mean and variance of a chi-squared U with degree d.

By (5.5.2),
Xd X d
2
E(U ) = E(Zk ) = 1 = d.
k=1 k=1

By (5.4.2) and independence,

d
X
E(U 2 ) = E(Zk2 Zℓ2 )
k,ℓ=1

X d
X
= E(Zk2 )E(Zℓ2 ) + E(Zk4 )
k̸=ℓ k=1
2
= d(d − 1) · 1 + d · 3 = d + 2d.

Since the variance equals E(U 2 ) − E(U )2 , we conclude

Mean and Variance

The mean and variance of a chi-squared with degree d are

E(U ) = d, and V ar(U ) = 2d.

Figure 5.24: With degree d ≥ 2, the chi-squared distribution peaks at d − 2.

278 CHAPTER 5. PROBABILITY

Note that the peak (maximum likelihood point) in a chi-squared distri-

bution with degree d is not at the mean d. It is at d − 2 (Figure 5.24).

Because
1 1 1
′ /2 = ,
d/2
(1 − 2t) (1 − 2t)d (1 − 2t)(d+d′ )/2
we obtain

Independence and Chi-squared

If U and U ′ are independent chi-squared with degrees d and d′ , then
U + U ′ is chi-squared with degree d + d′ .

To compute distributions for sample variances (below) and Pearson’s Test

(§6.7), we need to derive chi-squared for correlated normal samples. This is
best approached using random vectors.
A random vector is a vector X = (X1 , X2 , . . . , Xd ) in Rd whose compo-
nents are random variables. For example, a simple random sample X1 , X2 ,
. . . , Xn may be collected into the random vector

X = (X1 , X2 , . . . , Xn )

in Rn .
If X is a random vector in Rd , its mean is the vector

µ = E(X) = (E(X1 ), E(X2 ), . . . , E(Xd )) = (µ1 , µ2 , . . . , µd ).

Assume the mean of X equals zero. The variance of X is the d × d matrix

Q whose (i, j)-th entry is

Qij = E(Xi Xj ), 1 ≤ i, j ≤ d.

In the notation of §2.2,

Q = E(X ⊗ X).
5.5. CHI-SQUARED DISTRIBUTION 279

Clearly, Q is a nonnegative matrix, since

v · Qv = v · E(X ⊗ X)v = E (v · (X ⊗ X)v) = E (v · X)2

is never negative.
A random vector X is normal with mean µ and variance Q if for every
vector w, w · X is normal with mean w · µ and variance w · Qw.
Then µ is the mean of X, and Q is the variance X. The random vector
X is standard normal if µ = 0 and Q = I.
From §5.3, we see

Normal Random Vectors

Z1 , Z2 , . . . , Zd is a simple random sample of standard normal random
variables if and only if

Z = (Z1 , Z2 , . . . , Zd )

is a standard normal random vector in Rd .

If X is a normal random vector with mean zero and variance Q, then, by

definition, w · X is normal with mean 0 and variance w · Qw. Using (5.4.3)
with t = 1, the moment generating function of the random vector X is
MX (w) = E ew·X = ew·Qw/2 .

(5.5.3)
In §5.3 we studied correlation and independence. We saw how indepen-
dence implies uncorrelatedness, but not conversely. Now we show that, for
normal random vectors, they are in fact the same.

Independence and Correlation

If (X, Y ) is a normal random vector, then X and Y are uncorrelated

iff X and Y are independent.

To see this, we write down

E(X ⊗ X) = A, E(X ⊗ Y ) = B, E(Y ⊗ Y ) = C.
Then the variance of (X, Y ) is

E(X ⊗ X) E(X ⊗ Y ) A B
Q= =
E(Y ⊗ X) E(Y ⊗ Y ) Bt C
280 CHAPTER 5. PROBABILITY

. From this, we see X and Y are uncorrelated when B = 0.

With w = (u, v), we write

u A B u
w · Qw = · = u · Au + u · Bv + v · B t u + v · Cv.
v Bt C v
Then
t
MX,Y (w) = E ew·(X,Y ) = ew·Qw/2 = MX (u) MY (v) e(u·Bv+v·B u)/2 .

From this, X and Y are independent when B = 0. Thus, for normal random
vectors, independence and uncorrelatedness are the same.

If Z is a standard normal random vector in Rd , then above we saw |Z|2 is

chi-squared with degree d. Now we generalize this result to correlated normal
random vectors.
Recall the pseudo-inverse Q+ of the matrix Q (§2.3).

Correlated Normal Random Vector

Let X be a normal random vector with mean zero and variance Q.
Then
U = X · Q+ X
is chi-squared of degree r, where r is the rank of the matrix Q.

To derive this, we use the eigenvalue decomposition (§3.2) of Q,

Q = U SU t .
Here S is a square diagonal matrix and U is an orthogonal matrix, U t U = I.
The diagonal entries of S are the eigenvalues of Q.
Since the rank of Q is r, and this equals the rank of S, only r of these
eigenvalues are positive, λ1 ≥ λ2 ≥ · · · ≥ λr > 0, and λr+1 = · · · = λn = 0.
Moreover,  
1/λ1 0 0 ... 0
 0 1/λ2 0 ... 0
 
 ... . . . . . . . . . . . .
 
S+ =   0 . . . 0 1/λ r 0 ,

 0 0 0 0 0
 
 ... . . . . . . . . . . . .
0 0 0 0 0
5.5. CHI-SQUARED DISTRIBUTION 281

and
Q+ = U S + U t .
If we set Y = U t X = (Y1 , Y2 , . . . , Yd ), then

E(Y ⊗ Y ) = E((U t X) ⊗ (U t X)) = U t E(X ⊗ X)U = U t QU = S.

Writing this out, the random variables Y1 , Y2 , . . . , Yd satisfy

(
λi , i=j
E(Yi Yj ) =
0, i ̸= j.

Since λi = 0 for i > r, we see Yi = 0 for i > r. If we set

1
Z i = √ Yi , i = 1, 2, . . . , r,
λi
then Z is a standard normal r-vector. By (5.5.2),

|Z|2 = Z12 + Z22 + · · · + Zr2

is chi-squared with degree r. But

r r
2
X X Y2 k
|Z| = Zk2 = = Y ·S + Y = (U t X)·S + (U t X) = X·(U S + U t )X = X·Q+ X.
k=1 k=1
λk

This completes the proof of the theorem.

Let v be a unit vector, and let Q = I − v ⊗ v. Suppose X is a normal

random vector with mean zero and variance Q. Then

E((X · v)2 ) = v · Qv = v · (I − v ⊗ v)v = v − (v · v)v = 0,

so X · v = 0.
It is easy to check Q3 = Q and Q2 is symmetric, so (§2.3) Q+ = Q. Since
X · v = 0,

X · Q+ X = X · QX = X · (X − (v · X)v) = |X|2 .

We conclude
282 CHAPTER 5. PROBABILITY

Singular Chi-squared

Let v be a unit vector, and let X be a normal random vector in Rd

with mean zero and variance I − v ⊗ v. Then |X|2 is chi-squared with
degree d − 1.

We use the above to derive the distribution of the sample variance. Let
X1 , X2 , . . . , Xn be a random sample, and let X̄ be the sample mean,
X1 + X2 + · · · + X n
X̄ = .
n
Let S 2 be the sample variance,
(X1 − X̄)2 + (X2 − X̄)2 + · · · + (Xn − X̄)2
S2 = . (5.5.4)
n−1
Since (n − 1)S 2 is a sum-of-squares similar to (5.5.2), we expect (n − 1)S 2
to be chi-squared. In fact this is so, but the degree is n − 1, not n. We will
show

Independence of Sample Mean and Sample Variance

Let Z = (Z1 , Z2 , . . . , Zn ) be independent standard normal random

variables, let Z̄ be the sample mean, and let S 2 be the sample variance.
Then (n − 1)S 2 is chi-squared with degree n − 1, and Z̄ and S 2 are
independent.

To see this, let

1 1 1
v= √ , √ ,..., √ .
n n n
Then v is a unit vector, and
n
1 X √
Z ·v = √ Zk = n Z̄, (Z · v)v = (Z̄, . . . , Z̄).
n k=1

Now let
X = Z − (Z · v)v = (Z1 − Z̄, Z2 − Z̄, . . . .Zn − Z̄).
5.5. CHI-SQUARED DISTRIBUTION 283

Then E(X) = 0 and

E((Z · v)2 ) = E(v · (Z ⊗ Z)v) = v · Iv = 1.

From this, it follows that

E(X ⊗ X) = I − v ⊗ v.

Hence
(n − 1)S 2 = |X|2
is chi-squared with degree n − 1.
Now X and Z · v are uncorrelated, since

E(X(Z · v)) = E(Z(Z · v) − E((Z · v)2 )v = v − v = 0.

Since√they are normal, X and Z · v are independent. Since (n − 1)S 2 = |X|2 ,

and nZ̄ = Z · v, S 2 and Z̄ are independent.
284 CHAPTER 5. PROBABILITY
Chapter 6

Statistics

6.1 Estimation
In statistics, like any science, we start with a guess or an assumption or
hypothesis, then we take a measurement, then we accept or modify our
guess/assumption based on the result of the measurement. This is common
sense, and applies to everything in life, not just statistics.
For example, suppose you see a sign on campus saying
There is a lecture in room B120.
How can you tell if this is true/correct or not? One approach is to go to
room B120 and look. Either there is a lecture or there isn’t. Problem solved.
But then someone might object, saying, wait, what if there is a lecture
in room B120 tomorrow? To address this, you go every day to room B120
and check, for 100 days. You find out that in 85 of the 100 days, there is a
lecture, and in 15 days, there is none. Based on this, you can say you are
85% confident there is a lecture there. Of course, you can never be sure, it
depends on which day you checked, you can only provide a confidence level.
Nevertheless, this kind of thinking allows us to quantify the probability that
our hypothesis is correct.
In general, the measurement is significant if it is unlikely. When we obtain
a significant measurement, then we are likely to reject our guess/assumption.
So
significance = 1 − confidence.
In practice, our guess/assumption allows us to calculate a p-value, which is
the probability that the measurement is not consistent with our assumption.

285
286 CHAPTER 6. STATISTICS

In the above scenario, the p-value is .15, determined by repeatedly sampling

the room.
This is what statistics is about, summarized in Figure 6.1. The details
may be more or less complicated depending on the problem situation or
setup, but this is the central idea.

do not
reject H

p>α
hypothesis
sample p-value
H

p<α

reject H

Figure 6.1: Statistics flowchart: p-value p and significance α.

Here is a geometric example. The null hypothesis and the alternate hy-
pothesis are

• H0 : The angle between two randomly selected vectors in 784 dimen-

sions is approximately 90◦

• Ha : The angle between two randomly selected vectors in 784 dimen-

sions is approximately 60◦ .

In §2.2, there is code (2.2) returning the angle angle(u,v) between two
vectors. To test this hypothesis, we run the code
6.1. ESTIMATION 287

from numpy import *

from numpy.random import randn

# randn is standard normal

N = 784

for _ in range(20):
u = randn(N)
v = randn(N)
print(angle(u,v))

to randomly select u, v twenty times. This code returns

86.27806537791886
87.91436653824776
93.00098725550777
92.73766421951748
90.005139015804
87.99643434444482
89.77813370637857
96.09801014394806
90.07032573539982
89.37679070400239
91.3405728939376
86.49851399221568
87.12755619082597
88.87980905998855
89.80377324818076
91.3006921339982
91.43977096117017
88.52516224405458
86.89606919838387
90.49100744167357

Here we see strong evidence supporting H0 . On the other hand, if we run

the code
288 CHAPTER 6. STATISTICS

from numpy import *

from numpy.random import binomial

N = 784

n = 1 # one coin toss

#n = 3 # three coin tosses

for _ in range(20):
u = binomial(n,.5,N)
v = binomial(n,.5,N)
print(angle(u,v))

to randomly select u, v twenty times, we obtain

59.43464627897324
59.14345748418916
60.31453922165891
60.38024365702492
59.24709660805488
59.27165957992343
61.21424657806321
60.55756381536082
61.59468919876665
61.33296028237481
60.03925473033243
60.25732069941224
61.77018692842784
60.672901794058326
59.628519516164666
59.41272458020638
58.43172340007064
59.863796136907744
59.45156367988921
59.95835532791699

Here we see strong evidence that H0 is false, as the angles are now close to
60◦ .
6.1. ESTIMATION 289

The difference between the two scenarios is the distribution. In the first
scenario, we have randn(n): the components are distributed according to a
standard normal. In the second scenario, we have binomial(1,.5,N): the
components are distributed according to a fair coin toss. To see how the
distribution affects things, we bring in the law of large numbers, which is
discussed in §5.3.
Let X1 , X2 , . . . , Xn be a simple random sample from some population,
and let µ be the population mean. Recall this means X1 , X2 , . . . , Xn are
i.i.d. random variables, with µ = E(X). The sample mean is
X1 + X2 + · · · + X n
X̄ = .
n
Then we have the

law of large numbers

For almost every simple random sample, for large sample size, the
sample mean X̄ approximately equals the population mean µ. More
precisely, as the sample size n approaches infinity, X̄ approaches µ.

We use the law of large numbers to explain the closeness of the vector
angles to specific values.
Assume u = (x1 , x2 , . . . , xn ), and v = (y1 , y2 , . . . , yn ) where all compo-
nents are selected independently of each other, and each is selected according
to the same distribution.
Let U = (X1 , X2 , . . . , Xn ), V = (Y1 , Y2 , . . . , Yn ), be the corresponding
random variables. Then X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are independent
and identically distributed (i.i.d.), with population mean E(X) = E(Y ).
From this, X1 Y1 , X2 Y2 , . . . , Xn Yn are i.i.d. random variables with popu-
lation mean E(XY ). By the law of large numbers,1
X1 Y1 + X2 Y2 + · · · + Xn Yn
≈ E(XY ),
n
so
U · V = X1 Y1 + X2 Y2 + · · · + Xn Yn ≈ n E(XY ).
1
≈ means the ratio of the two sides approaches 1 as n grows without bound.
290 CHAPTER 6. STATISTICS

Similarly, U · U ≈ n E(X 2 ) and V · V ≈ n E(Y 2 ). Hence (check that the n’s

cancel)
U ·V E(XY )
cos(U, V ) = p ≈p .
(U · U )(V · V ) E(X 2 )E(Y 2 )
Since X and Y are independent with mean µ and variance σ 2 ,

E(XY ) = E(X)E(Y ) = µ2 , E(X 2 ) = µ2 + σ 2 , E(Y 2 ) = µ2 + σ 2 .

If θ is the angle between U and V , we conclude

U ·V µ2
cos(θ) = p ≈ .
(U · U )(V · V ) µ2 + σ 2

When the distribution is N (0, 1), µ = 0, so the angle is approximately

90◦ . When the distribution is bernoulli with parameter p,

µ2 p2
= = p.
µ2 + σ 2 p2 + p(1 − p)

For p = .5, this results in an angle of 60◦ .

The general result is

Random Vectors in High Dimensions

Let U and V be two vectors selected randomly. Assume the compo-
nents of U and V are independent and identically distributed with
mean µ and variance σ 2 . Let θ be the angle between them. When the
vector dimension is high,

µ2
cos(θ) is approximately .
µ2 + σ 2

6.2 Z-test
Suppose we want to estimate the proportion of American college students
who have a smart phone. Instead of asking every student, we take a sample
and make an estimate based on the sample.
6.2. Z-TEST 291

Figure 6.2: Histogram of sampling n = 25 students, repeated N = 1000

times.

The population proportion p is the actual proportion of students that in

fact have a smart phone. Then 0 < p < 1. Pick a student, and let
(
1, if the student has a smartphone,
X=
0, if not.

Then X is a bernoulli random variable with mean p.

For example, suppose the population proportion of students that have a
smartphone is p = .7, and we sample n = 25 students, obtaining a sample
proportion X̄. If we repeat the sampling N = 1000 times, we will obtain
1000 values for X̄. Figure 6.2 displays the resulting histogram of X̄ values.
Here is the code

from numpy import *

from matplotlib.pyplot import *
from numpy.random import binomial

p = .7
n = 25
N = 1000
v = binomial(n,p,N)/n

hist(v,edgecolor ='Black')
show()
292 CHAPTER 6. STATISTICS

Let X1 , X2 , . . . , Xn be a simple random sample of size n. This means

n students were selected randomly and independently and whether or not
they had smartphones was recorded in the variables X1 , X2 , . . . , Xn . Each
of these variables equals one or zero with probability p or 1 − p.
The sample mean (§5.3) is
n
X1 + X 2 + · · · + X n 1X
X̄ = = Xk .
n n k=1

Because each Xk is 0 or 1, this is the sample proportion of the students in

the sample that have smartphones. Like p, X̄ is also between zero and one.
Because the samples vary, it is impossible to make absolute statements
about the population. Instead, as we see below, the best we can do is make
statements that come with a confidence level. Confidence levels are expressed
as percentages, such as a 95% confidence level, or as a proportion, such as a
.95 confidence level.
Often levels are expressed as significance levels. The significance level is
the corresponding tail probability, so

significance level = 1 − confidence level.

A confidence level of zero indicates that we have no faith at all that

selecting another sample will give similar results, while a confidence level of
1 indicates that we have no doubt at all that selecting another sample will
give similar results.
When we say p is within X̄ ± ϵ, or

|p − X̄| < ϵ,

we call ϵ the margin of error.. The interval

(L, U ) = (X̄ − ϵ, X̄ + ϵ)

is a confidence interval.
With the above setup, we have the population proportion p, and the four
sample characteristics

• sample size n

• sample proportion X̄,

6.2. Z-TEST 293

• margin of error ϵ,
• confidence level α.
Suppose we do not know p, but we know n and X̄. We say the margin of
error is ϵ, at confidence level α, if

P rob(|p − X̄| < ϵ) = α.

Here are some natural questions:

1. Given a sample of size n = 20 and sample proportion X̄ = .7, what

can we say about the margin of error ϵ with confidence α = .95?
2. Given a sample proportion X̄ = .7, what sample size n should we take
to obtain a margin of error ϵ = .15 with confidence α = .95?
3. Given a sample proportion X̄ = .7, what sample size n should we take
to obtain a margin of error ϵ = .15 with confidence α = .99?
4. Given a sample of size n = 20 and sample proportion X̄ = .7, with
what confidence level α is the margin of error ϵ = .1?

The answers are at the end of the section.

Suppose each Xk in the sample X1 , X2 , . . . , Xn has mean µ and standard

deviation√σ. From §5.3, we know the mean and standard deviation of X̄ are
µ and σ/ n. In particular, when X1 , X2 , . . . , Xn is a bernoulli sample, the
mean and variance of the sample proportion X̄ are p and p(1 − p)/n.
Therefore, the mean and variance of the standardized random variable
√ X̄ − p
Z= np
p(1 − p)
are zero and one.
Returning to our smartphone question, how close is the sample mean X̄
to the population mean E(X) = p? Remember, both X̄ and p are between 0
and 1. More specifically, given a margin of error ϵ, we want to compute the
confidence level
P rob X̄ − p < ϵ .
294 CHAPTER 6. STATISTICS

This corresponds to the confidence interval

L, U = X̄ − ϵ, X̄ + ϵ.

The key result is the central limit theorem (§5.3): Z is approximately

normal. How large should the sample size n be in order to apply the central
limit theorem? When we have success-failure condition

np ≥ 10, n(1 − p) ≥ 10.

For example, p = .7 and n = 50 satisfies the success-failure condition.

Let α be the two-tail significance level, say α = .05. Assuming Z is
exactly normal, let z ∗ be the z-score corresponding to significance α,

P rob(|Z| > z ∗ ) = α.
√
Let σ/ n be the standard error. By the central limit theorem,
!
|X̄ − p| z∗
α ≈ P rob p >√ .
p(1 − p) n

To compute the confidence interval (L, U ), we solve

|X̄ − p| z∗
p =√ (6.2.1)
p(1 − p) n

for p. But (6.2.1) may be rewritten as a quadratic equation in p, leading to

the approximate solution
z∗
q
L, U = X̄ ± ϵ = X̄ ± √ · X̄(1 − X̄).
n
From here we obtain the margin of error
z∗
q
ϵ = √ · X̄(1 − X̄).
n

More generally, let z ∗ be the z-score corresponding to significance level

α, so
6.2. Z-TEST 295

zstar = Z.ppf(alpha) # lower-tail, zstar < 0

zstar = Z.ppf(1-alpha) # upper-tail, zstar > 0
zstar = Z.ppf(1-alpha/2) # two-tail, zstar > 0

Given a population with known standard deviation σ, sample size n, and

sample mean X̄, the margin of error is
σ
ϵ = z∗ · √ ,
n
and the intervals

(X̄ − ϵ, X̄),
 lower-tail,
(L, U ) = (X̄, X̄ + ϵ), upper-tail,

(X̄ − ϵ, X̄ + ϵ), two-tail,


are the confidence intervals at significance level α. When not specified, a

confidence interval is usually taken to be two-tail.
In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with the X̄. When σ is not known, we have to replace
the normal distribution by the t distribution (§6.3).

##########################
# Confidence Interval - Z
##########################

from numpy import *

from scipy.stats import norm as Z

# significance level alpha

def confidence_interval(xbar,sdev,n,alpha,type):
Xbar = Z(xbar,sdev/sqrt(n))
if type == "two-tail":
U = Xbar.ppf(1-alpha/2)
L = Xbar.ppf(alpha/2)
elif type == "upper-tail":
U = Xbar.ppf(1-alpha)
L = xbar
296 CHAPTER 6. STATISTICS

elif type == "lower-tail":

L = Xbar.ppf(alpha)
U = xbar
else: print("what's the test type?"); return
return L, U

# when X is not bernoulli 0,1,

# Z-test assumes sdev is known!!!
# when X is bernoulli, sdev = sqrt(xbar*(1-xbar))

alpha = .02
sdev = 228
n = 35
xbar = 95

L, U = confidence_interval(xbar,sdev,n,alpha,type)

print("type: ", type)

print("significance, sdev, n, xbar: ", alpha,sdev,n,xbar)
print("lower, upper: ",L, U)

Now we can answer the questions posed at the start of the section. Here
are the answers.
1. When n = 20, α = .95, and X̄ = .7, we have [L, U ] = [.5, .9], so ϵ = .2.

2. When X̄ = .7, α = .95, and ϵ = .15, we run confidence_interval

for 15 ≤ n ≤ 40, and select the least n for which ϵ < .15. We obtain
n = 36.

3. When X̄ = .7, α = .99, and ϵ = .15, we run confidence_interval

for 1 ≤ n ≤ 100, and select the least n for which ϵ < .15. We obtain
n = 62.

4. When X̄ = .7, n = 20, and ϵ = .1, we have

√
∗ ϵ n
z = = .976.
σ
Since P rob(Z > z ∗ ) = .165, the confidence level is 1 − 2 ∗ .165 = .68 or
68%.
6.2. Z-TEST 297

The speed limit on a highway is µ0 = 120. Ten automatic speed cam-

eras are installed along a stretch of the highway to measure passing vehicles
speeds. Because the cameras aren’t perfect, the average speed X̄ measured
by the cameras may not equal a vehicle’s true speed µ. As a consequence,
some drivers who were driving at the speed limit may be fined. These drivers
are false positives.
Suppose the distribution of a vehicle’s measured speed is normal with
standard deviation 2. What measured speed cutoff µ∗ should the authorities
use to keep false positives below 1%? Here we are asked for the upper-tail
confidence interval (L, U ) = (µ0 , µ∗ ) at significance level .01. A driver will
be fined if their average measured speed X̄ is higher than µ∗ .
Using the above code, the cutoff µ∗ equals 121.47.

One use of confidence intervals is hypothesis testing. Here we have two

hypotheses, a null hypothesis and an alternate hypothesis. In the above
setting where we are estimating a population parameter µ, the null hypothesis
is that µ equals a certain value µ0 , and the alternate hypothesis is that µ is not
equal to µ0 . Hypothesis testing is of three types, depending on the alternate
hypothesis: µ ̸= µ0 , µ > µ) , µ < µ0 . These are two-tail, lower-tail, and
upper-tail hypotheses.

• H0 : µ = µ0

• Ha : µ ̸= µ0 or µ < µ0 or µ > µ0 .

For example, going back to our smartphone p setup, if we sample n = 20

students, obtaining a mean x̄ = .7, then σ = x̄(1 − x̄) = .46, and the two-
tail 5% confidence interval is then [.5, .9]. If µ0 lies outside the confidence
interval, we reject H0 and accept Ha , at the 5% level. Otherwise, if µ0 lies
within the interval, we do not reject H0 .
Suppose 35 people are randomly selected and the accuracy of their wrist-
watches is checked, with positive errors representing watches that are ahead
of the correct time and negative errors representing watches that are behind
the correct time. The sample has a mean of 95 seconds and a population
298 CHAPTER 6. STATISTICS

standard deviation of 228 seconds. At the 2% significance, can we claim the

population mean is µ0 = 0?
Here
• H0 : µ = 0

• Ha : µ ̸= 0.
Here the significance level is α = .02 and µ0 = 0. To decide whether to
reject H0 or not, compute the standardized test statistic
√ x̄ − µ0
z= n· = 2.465.
σ
Since z is a sample from an approximately normal distribution Z, the p-value

p = P rob(|Z| > z) = .0137.

On the other hand, the z-score corresponding to the requested significance

level is z ∗ = 2.326, since

P rob(|Z| > 2.326) = .02.

Since p is less than α, or equivalently, since |z| > z ∗ , we reject H0 . In other

words, when the p-value is smaller than the significance level, it is more
significant, and we reject H0 .
Equivalently, the 98% confidence interval is

(x̄ − ϵ, x̄ + ϵ) = (5.3, 184.6) .

Since µ0 = 0 is outside this interval, we reject H0 .

Hypothesis Testing
There are three types of alternative hypotheses Ha :

µ < µ0 , µ > µ0 , µ ̸= µ0 .

These are lower-tail, upper-tail, and two-tail tests. In every case, we

6.2. Z-TEST 299

have a sample of size n, a statistic x̄, a standard deviation σ, a stan-

dardized statistic
√ x̄ − µ0
z = n· ,
σ
a significance level α, the p-value

p = P rob(Z < z), p = P rob(Z > z), p = P rob(|Z| > z),

and the critical cutoff z ∗ ,

P rob(Z < z ∗ ) = α, P rob(Z > z ∗ ) = α, P rob(|Z| > z ∗ ) = α.

Then we reject H0 whenever z is more significant than z ∗ , which is

the same as saying whenever the p-value p is less than the significance
level α.

In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with
√ X̄, which is normally distributed with mean µ0
and standard deviation σ/ n.

###################
# Hypothesis Z-test
###################

from numpy import *

from scipy.stats import norm as Z

# significance level alpha

def ztest(mu0, sdev, n, xbar,type):

Xbar = Z(mu0,sdev/sqrt(n))
print("mu0, sdev, n, xbar: ", mu0,sdev,n,xbar)
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 * (1 - Xbar.cdf(abs(xbar)))
print("type: ",type)
300 CHAPTER 6. STATISTICS

print("pvalue: ",p)
if p < alpha: print("reject H0")
else: print("do not reject H0")

xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
sdev = 2
alpha = .01

ztest(mu0, sdev, n, xbar,type)

Going back to the driving speed example, the hypothesis test is

• H0 : µ = µ0

• Ha : µ > µ0
If a driver’s measured average speed is X̄ = 122, the above code rejects H0 .
This is consistent with the confidence interval cutoff we found above.

There are two types of possible errors we can make. a Type I error is
when H0 is true, but we reject it, and a Type 2 error is when H0 is not true
but we fail to reject it.

H0 is true H0 is false
do not reject H0 1−α Type II error: β
reject H0 Type I error: α Power: 1 − β

Table 6.3: The error matrix.

We reject H0 when the p-value of Z is less than the significance level α,

which happens when z < z ∗ or z > z ∗ or |z| > z ∗ . In all cases, the chance of
this happening is by definition α. In other words,

P rob(Type I error) = P rob(p-value < α | H0 ) = α.

6.2. Z-TEST 301

Thus the probability of a type I error is the significance level α.

We make a Type II error when we do not reject H0 , but H0 is false. To
compute the probability of a Type II error, suppose the true value of µ is µ1 .
Then we do not reject
√ H0 if |z| < |z ∗ |, which is when µ0 lies in the confidence
interval x̄ ± z ∗ σ/ n, or when x̄ lies in the interval
z∗σ z∗σ
µ0 − √ < x̄ < µ0 + √ .
n n

But when µ = µ1 , X̄ is N (µ1 , σ), so the probability of this event can be

computed.
Standardize X̄ by subtracting µ1 and dividing by the standard error.
Then we have a Type II error when
√ (µ0 − µ1 ) √ (µ0 − µ1 )
n − z∗ < z < n + z∗.
σ σ
If we set δ to equal the standardized difference in the means,
√ (µ0 − µ1 )
δ= n ,
σ
then we have a Type II error when

δ − z∗ < Z < δ + z∗,

or when |Z − δ| < z ∗ . Hence

P rob(Type II error) = P rob (|Z − δ| < z ∗ ) .

This calculation was for a two-tail test. When the test is upper-tail or
lower-tail, a similar calculation leads to the code

############################
# Type1 and Type2 errors - Z
############################

from numpy import *

from scipy.stats import norm as Z
302 CHAPTER 6. STATISTICS

def type2_error(type,mu0,mu1,sdev,n,alpha):
print("significance,mu0,mu1, sdev, n: ",
,→ alpha,mu0,mu1,sdev,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
zstar = Z.ppf(alpha)
type2 = 1 - Z.cdf(delta + zstar)
elif type == "upper-tail":
zstar = Z.ppf(1-alpha)
type2 = Z.cdf(delta + zstar)
elif type == "two-tail":
zstar = Z.ppf(1 - alpha/2)
type2 = Z.cdf(delta + zstar) - Z.cdf(delta - zstar)
else: print("what's the test type?"); return
print("test type: ",type)
print("zstar: ", zstar)
print("delta: ", delta)
print("prob of type2 error: ", type2)
print("power: ", 1 - type2)

mu0 = 120
mu1 = 122
sdev = 2
n = 10
alpha = .01
type = "upper-tail"

type2_error(type,mu0,mu1,sdev,n,alpha)

A type II error is when we do not reject the null hypothesis and yet it’s
false. The power of a test is the probability of rejecting the null hypothesis
when it’s false (Figure 6.3). If the probability of a type II error is β, then
the power is 1 − β.
Going back to the driving speed example, what is the chance that someone
driving at µ1 = 122 is not caught? This is a type II error; using the above
6.3. T -TEST 303

code, the probability is

β = P rob(X̄ = 120 | µ = 122) = 20%.

Therefore this test has power 80% to detect such a driver.

6.3 T -test
Let X1 , X2 , . . . , Xn be a simple random sample from a population. We
repeat the previous section when we know neither the population mean µ,
nor the population variance σ 2 . We only know the sample mean
X1 + X2 + · · · + Xn
X̄ =
n
and the sample variance
n
2 1 X
S = (Xk − X̄)2 .
n − 1 k=1

For example, assume X1 , X2 , . . . , Xn are bernoulli 0,1 random variables.

Then as we’ve seen before,
n
X
2
(n − 1)S = (Xk − X̄)2 = nX̄(1 − X̄).
k=1

When the sample Z1 , Z2 , . . . , Zn is normal, we have

Z1 + Z2 + · · · + Zn
Z̄ =
n
and
n
21 X
S = (Zk − Z̄)2 .
n − 1 k=1
In this case, from §5.5,

• (n − 1)S 2 is chi-squared of degree n − 1, and

• X̄ and S 2 are independent.

304 CHAPTER 6. STATISTICS

A random variable T has a t-distribution with degree d if the probability

that T lies in a small interval [a, b] containing t is
−(d+1)/2
t2

P rob(a < T < b) 1
= · 1+ , a < t < b. (6.3.1)
b−a N d

Here N is a constant to make the total area under the graph equal to one
(Figure 6.4). In other words, (6.3.1) is the pdf of the t-distribution.
When the interval [a, b] is not small, the correct formula is obtained by
integration, which means dividing [a, b] into many small intervals and sum-
ming. We will not use this density formula directly.

Figure 6.4: t-distribution, against normal (dashed).

By the compound-interest formula for the exponential (4.4.6), the t-

distribution (6.3.1) approaches the standard normal distribution (5.4.1) as
d → ∞.
6.3. T -TEST 305

from numpy import *

from scipy.stats import t as T, norm as Z
from matplotlib.pyplot import *

for d in [3,4,7]:
t = arange(-3,3,.01)
plot(t,T(d).pdf(t),label="d = "+str(d))

plot(t,Z.pdf(t),"--",label=r"d = $\infty$")
grid()
legend()
show()

Using calculus, one can derive

Relation Between Z, U , and T

Suppose Z and U are independent, where Z is standard normal, and
U is chi-squared with d degrees of freedom. Then
Z
T =p
U/d

is a t-distribution with degree d.

In the previous section, we normalized a sample

√ mean by subtracting the
mean µ and dividing by the standard error σ/ n. Since now we don’t know
σ, it is reasonable to divide by the sample standard error, obtaining
√ X̄ − µ √ X̄ − µ
n· = n· v .
S u n
1 X
(Xk − X̄)2
u
t
n−1 k=1

If we standardize each variable by

Xk = µ + σZk ,
306 CHAPTER 6. STATISTICS

then we can verify

Z1 + Z2 + · · · + Zn
X̄ = µ + σ Z̄, Z̄ = ,
n
and n
2 1 X
2
S =σ (Zk − Z̄)2 .
n − 1 k=1
From this, we have

√ X̄ − µ √ Z̄ √ Z̄
n· = n· v = n· p .
S u n U/(n − 1)
1 X
(Zk − Z̄)2
u
t
n − 1 k=1

Using the last result with d = n − 1, we arrive at the main result in this
section.

Samples and T Distributions

Let X1 , X2 , . . . , Xn be independent normal random variables with
mean µ. Let X̄ be the sample mean, let S 2 be the sample variance,
and let
√ X̄ − µ
T = n· .
S
Then T is distributed according to a t-distribution with degree (n − 1).

The takeaway here is we do not need to know the standard deviations σ

of X1 , X2 , . . . , Xn to compute T .

The t-score t∗ corresponding2 to significance α is

tstar = T(d).ppf(alpha) # lower-tail, tstar < 0

tstar = T(d).ppf(1-alpha) # upper-tail, tstar > 0

2
Geometrically, P rob(T > 1) is the probability that a normally distributed point is
inside the light cone in (d + 1)-dimensional spacetime.
6.3. T -TEST 307

tstar = T(d).ppf(1-alpha/2) # two-tail, tstar > 0

Here d is the degree of T . Then we have

##########################
# Confidence Interval - T
##########################

from numpy import *

from scipy.stats import t as T

def confidence_interval(xbar,s,n,alpha,type):
d = n-1
if type == "two-tail":
tstar = T(d).ppf(1-alpha/2)
L = xbar - tstar * s / sqrt(n)
U = xbar + tstar * s / sqrt(n)
elif type == "upper-tail":
tstar = T(d).ppf(1-alpha)
L = xbar
U = xbar + tstar* s / sqrt(n)
elif type == "lower-tail":
tstar = T(d).ppf(alpha)
L = xbar + tstar* s / sqrt(n)
U = xbar
else: print("what's the test type?"); return
print("type: ",type)
return L, U

n = 10
xbar = 120
s = 2
alpha = .01
type = "upper-tail"
print("significance, s, n, xbar: ", alpha,s,n,xbar)

L,U = confidence_interval(xbar,s,n,alpha,type)
print("lower, upper: ", L,U)
308 CHAPTER 6. STATISTICS

Going back to the driving speed example from §6.2, instead of assuming
the population standard deviation is σ = 2, we compute the sample standard
deviation and find it’s S = 2. Recomputing with T (9), instead of Z, we see
(L, U ) = (120, 121.78), so the cutoff now is µ∗ = 121.78, as opposed to
µ∗ = 121.47 there.

We turn now to hypothesis testing. As before, we have two hypotheses, a

null hypothesis and an alternate hypothesis. In the above setting where we
are estimating a population parameter, the null hypothesis is that µ equals
a certain value µ0 , and the alternate hypothesis is that µ is not equal to µ0 .

• H0 : µ = µ0

• Ha : µ ̸= µ0 .

Here is the code:

###################
# Hypothesis T-test
###################

from numpy import *

from scipy.stats import t as T

def ttest(mu0, s, n, xbar,type):

d = n-1
print("mu0, s, n, xbar: ", mu0,s,n,xbar)
t = sqrt(n) * (xbar - mu0) / s
print("t: ",t)
if type == "lower-tail": p = T(d).cdf(t)
elif type == "upper-tail": p = 1 - T(d).cdf(t)
elif type == "two-tail": p = 2 * (1 - T(d).cdf(abs(t)))
print("pvalue: ",p)
if p < alpha: print("reject H0")
else: print("do not reject H0")

xbar = 122
6.3. T -TEST 309

n = 10
type = "upper-tail"
mu0 = 120
s = 2
alpha = .01

ttest(mu0, s, n, xbar,type)

Going back to the driving speed example, the hypothesis test is

• H0 : µ = µ0

• Ha : µ > µ0

If a driver’s measured average speed is X̄ = 122, the above code rejects

H0 . This is consistent with the confidence interval cutoff we found above.
However, the p-value obtained here is greater than the corresponding p-value
in §6.2.

For Type I and Type II errors, the code is

########################
# Type1 and Type2 errors
########################

from numpy import *

from scipy.stats import t as T

def type2_error(type,mu0,mu1,n,alpha):
d = n-1
print("significance,mu0,mu1,n: ", alpha,mu0,mu1,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
tstar = T(d).ppf(alpha)
type2 = 1 - T(d).cdf(delta + tstar)
elif type == "upper-tail":
310 CHAPTER 6. STATISTICS

tstar = T(d).ppf(1-alpha)
type2 = T(d).cdf(delta + tstar)
elif type == "two-tail":
tstar = T(d).ppf(1 - alpha/2)
type2 = T(d).cdf(delta + tstar) - T(d).cdf(delta -
,→ tstar)
else: print("what's the test type?"); return

print("test type: ",type)

print("tstar: ", tstar)
print("delta: ", delta)

print("prob of type2 error: ", type2)

print("power: ", 1 - type2)

type2_error(type,mu0,mu1,n,alpha)

Going back to the driving speed example, if a driver’s measured average

speed is X̄ = 122, what is the chance they will not be fined? From the code,
the probability of this Type II error is 37%, and the power to detect such a
driver is 63%.

6.4 Two Means

Let X1 , X2 , . . . , Xn be a simple random sample from a population. Let Y1 ,
Y2 , . . . , Ym be another simple random sample, and assume the two samples
are independent. Assume also that each Xk is N (µX , σ), and each Yk is
N (µY , σ). The goal is to estimate µX − µY .
As before, let
X1 + X2 + · · · + Xn X1 + Y2 + · · · + Ym
X̄ = , Ȳ = ,
n m
and let
n m
2 1 X 1 X
SX = (Xk − X̄)2 , SY2 = (Yk − Ȳ )2 .
n − 1 k=1 m − 1 k=1
2
Then SX and SY2 are unbiased estimators of σX
2
and σY2 , which means
2 2
E(SX ) = σX , E(SY2 ) = σY2 .
6.4. TWO MEANS 311

From before, we know

2
(n − 1)SX
2
σX
is chi-squared with degree n−1. Since the variance of chi-squared with degree
2 2
r equals 2r, the variance of (n − 1)SX /σX equals 2(n − 1). Thus
4 2 4 4

2 σX (n − 1)SX σX 2σX
V ar(SX ) = V ar 2
= · 2(n − 1) = .
(n − 1)2 σX (n − 1)2 n−1

Similarly,
2σY4
V ar(SY2 ) = .
m−1
Before, with a single mean, we used the result that
Z
T =p
U/n

is a t-distribution with n degrees of freedom when

1. Z and U are independent

2. Z is N (0, 1)

3. U is chi-squared with n degrees of freedom.

We apply this same result this time, but we proceed more carefully. To
begin, X̄ and Ȳ are normal with means µX and µY and variances σ 2 /n and
σ 2 /m respectively. Hence

(X̄ − Ȳ ) − (µX − µY )
r ∼ N (0, 1).
σ2 σ2
+
n m
Next,
2
SX SY2
(n − 1) and (m − 1)
σ2 σ2
are chi-squared of degrees n − 1 and m − 1 respectively, so their sum
2
SX SY2
(n − 1) + (m − 1)
σ2 σ2
312 CHAPTER 6. STATISTICS

is chi-squared with degree (n − 1) + (m − 1) = n + m − 2. If we let

2
(n − 1)SX + (m − 1)SY2
Sp2 =
n+m−2
be the pooled sample variance, then the above sum is (n + m − 2)Sp2 /σ 2 . We
conclude (n + m − 2)Sp2 /σ 2 is chi-squared with degree n + m − 2.
By our main result above (the σ’s cancel),
X̄ − Ȳ − (µX − µY )
T = r
1 1
Sp +
n m
is distributed according to a t-distribution with degree n + m − 2.
Based on this, let t∗α be the critical t-score of degree n + m − 2 at signifi-
cance α. Then a confidence interval for µX − µY at significance α is
r
∗ 1 1
X̄ − Ȳ ± Sp · tα · + .
n m
Here is code for computing confidence intervals for two means.

##################################
# Confidence Interval - Two means
##################################

import numpy as np
from scipy.stats import t

T = t

def confidence_interval(xbar,ybar,varx,vary,nx,ny,alpha):
tstar = T.ppf(1-alpha/2, nx+ny-2)
varp = (nx-1)*varx+(ny-1)*vary
n = nx+ny-2
varp = varp/n
s_p = np.sqrt(varp)
h = 1/nx + 1/ny
L = xbar - ybar - tstar * s_p * np.sqrt(h)
U = xbar - ybar + tstar * s_p * np.sqrt(h)
6.4. TWO MEANS 313

return L, U

2
Now we turn to the question of what to do when the variances σX and
2
σY are not equal. In this case, by independence, the population variance of
X̄ − Ȳ is the sum of the population variances of X̄ and Ȳ , which is
2
σX σY2
σB2 = + . (6.4.1)
n m
Hence
(X̄ − Ȳ ) − (µX − µY )
r ∼ N (0, 1).
2
σX σY2
+
n m
We want to replace the population variance (6.4.1) by the sample variance
2
SX S2
SB2 = + Y.
n m
Because SB2 is not a straight sum, but is a more complicated linear combina-
tion of variances, SB2 is not chi-squared.
Welch’s approximation is to assume it is chi-squared with degree r, and
to figure out the best r for this. More exactly, we seek the best choice of r
so that
rSB2 rSB2
= 2
σB2 σX σ2
+ Y
n m
is close to chi-squared with degree r. By construction, we multiplied SB2 by
r/σB2 so that its mean equals r,
2
rSB r
E 2
= 2 E(SB2 ) = r.
σB σB
Since the variance of a chi-squared with degree r is 2r, we compute the
variance and set it equal to 2r,
2
rSB r2
2r = V ar = V ar(SB2 ). (6.4.2)
σB2 (σB2 )2
By independence,
2 2
2 SX SY 1 2 1
V ar(SB ) = V ar + V ar = 2 V ar(SX ) + 2 V ar(SY2 ).
n m n m
314 CHAPTER 6. STATISTICS

2 2
But (n − 1)SX /σX and (m − 1)SY2 /σY2 are chi-squared, so
4
σX σY4
V ar(SB2 ) = 2 + 2 . (6.4.3)
n2 (n − 1) m2 (m − 1)
Combining (6.4.2) and (6.4.3), we arrive at Welch’s approximation for the
degrees of freedom,
2 2
σX σY2
+
σB4 n m
r= 4 4 = 4 .
σX σY σX σY4
+ +
n2 (n − 1) m2 (m − 1) n2 (n − 1) m2 (m − 1)
In practice, this expression for r is never an integer, so one rounds it to the
2
closest integer, and the population variances σX and σY2 are replaced by the
2 2
sample variances SX and SY .
We summarize the results.

Welch’s T-statistic
If we have independent simple random samples, then the statistic

X̄ − Ȳ − (µX − µY )
T = r
2
SX S2
+ Y
n m
is approximately distributed according to a T -distribution with degrees
of freedom 2 2
SX SY2
+
n m
r= 4 .
SX SY4
+
n2 (n − 1) m2 (m − 1)

6.5 Variances
Let X1 , X2 , . . . , Xn be a normally distributed simple random sample with
mean 0 and variance 1.
Then we know
U = X12 + X22 + · · · + Xn2
6.5. VARIANCES 315

is chi-squared with n degrees of freedom.

Throughout we let χ2α,n be the score corresponding to significance 1 − α,

P rob(U ≤ χ2α,n ) = α.

More generally, let X1 , X2 , . . . , Xn be a normally distributed simple

random sample with mean µ and variance σ 2 , and let
n
1 X
S2 = (Xk − X̄)2
n − 1 k=1

be the sample variance. Then

(n − 1)S 2
∼ χ2n−1 .
σ2
Let
a = χ2α/2,n−1 , b = χ21−α/2,n−1 . (6.5.1)
By definition of the score χ2α,n , we have

(n − 1)S 2

P rob a ≤ ≤ b = 1 − α.
σ2

From this, we get

(n − 1)S 2 (n − 1)S 2

2
P rob ≤ σ ≤ = 1 − α.
b2 a2

We conclude

Confidence Interval
A (1 − α)100% confidence interval for the population variance σ 2 is

(n − 1)S 2 (n − 1)S 2

2
≤σ ≤
b2 a2

where a and b are the χ2n−1 scores at significance 1 − α/2 and α/2.
316 CHAPTER 6. STATISTICS

##############################
# Confidence Interval - Chi2
##############################

from scipy.stats import chi2

def confidence_interval(s2,n,alpha):
a = chi2.ppf(alpha/2,n-1)
b = chi2.ppf(1-alpha/2,n-1)
L = (n-1)*s2/b
U = (n-1)*s2/a
return L, U

Here is an example: A large candy manufacturer produces, packages and

sells packs of candy targeted to weigh 52 grams. A quality control manager
working for the company was concerned that the variation in the actual
weights of the targeted 52-gram packs was larger than acceptable. That
is, he was concerned that some packs weighed significantly less than 52-
grams and some weighed significantly more than 52 grams. In an attempt
to estimate σ 2 , he took a random sample of n = 10 packs off of the factory
line. The random sample yielded a sample variance of 4.2 grams. Use the
random sample to derive a 95% confidence interval for σ 2 .
Here S 2 = 4.2, n = 10, and α = .05, resulting in

L, U = 1.99, 14.0

.
For hypothesis testing, given hypotheses

• H0 : σ = σ0

• Ha : σ ̸= σ0

the standardized test statistic is

(n − 1)S 2
.
σ02

and one compares the p-value of the standardized test statistic to the required
significance score, whether two-tail, upper-tail, or lower-tail.
6.5. VARIANCES 317

Figure 6.5: Fisher F -distribution.

Now we consider two populations with two variances. For this, we intro-
duce the F -distribution. If U1 , U2 are independent chi-squared distributions
with degrees n1 and n2 , then

U1 /n1
F =
U2 /n2

is distributed according to an F -distribution with degrees (df n, df d) = (n1 , n2 ).

The F -distribution for a range of degrees is shown in Figure 6.5. Here df n
and df d stand for degrees of freedom for the numerator and denominator.
Let X1 , X2 , . . . , Xn be a simple random sample from a normally dis-
2
tributed population with mean µX and variance σX . Similarly, let Y1 , Y2 ,
. . . , Ym be a simple random sample from a normally distributed population
with mean µY and variance σY2 .
Then
2
(n − 1)SX (m − 1)SY2
2
, and
σX σY2
are independent chi-squared with degrees n and m respectively.
Hence
2
SX σY2
·
SY2 σX2
318 CHAPTER 6. STATISTICS

is F -distributed with degrees (n, m).

For example, suppose we sample two populations independently. Suppose
the first sample size is 10, the second sample size is 5, and the first sample
2
variance is σX = 1.5, the second sample variance is σY2 = 2.3. What is a 95%
confidence interval for σX /σY ?
Let aα and bα be the critical f -scores corresponding to α = .05, with
degrees (df n, df d) = (10, 5),

from scipy.stats import f

alpha = .05
a = f.ppf(alpha/2,dfn,dfd)
b = f.ppf(1-alpha/2,dfn,dfd)

Then
2
σY2

SX
P rob aα < 2 2 < bα = 1 − α,
SY σ X
which may be rewritten
2 2 2

1 SX σX 1 SX
P rob < 2 < = 1 − α.
bα SY2 σY aα SY2

Hence a 100%(1 − α) confidence interval for σX /σY is

1 SX 1 SX
L= √ , U=√ .
bα S Y aα SY

Plugging in, we obtain,

L = 0.31389215230779993, U = 1.6621265193149342

for the 95% confidence interval.

6.6 Maximum Likelihood Estimates

⋆ under construction ⋆
6.7. CHI-SQUARED TESTS 319

6.7 Chi-Squared Tests

Let X1 , X2 , . . . , Xn be i.i.d. random variables, where each Xk is categorical.
This means each Xk is a discrete random variable taking values in one of d
categories. For simplicity, assume the categories are

Xk = 0, 1, 2, . . . , d − 1.

When d = 2, this reduces to the bernoulli case Xk = 0, 1.

When d = 2 and Xk = 0, 1, the sample mean X̄ is a proportion, the
population
p mean is p = P rob(Xk = 1), the populationp standard deviation
is p(1 − p), and the sample standard deviation is X̄(1 − X̄). By the
central limit theorem, the test statistic

√ X̄ − p
Z= n· p (6.7.1)
p(1 − p)

is approximately standard normal for large enough sample size, and conse-
quently U = Z 2 is approximately chi-squared with degree one. Pearson’s test
generalizes this from d = 2 categories to d > 2 categories.
Given a category j, let #j denote the number of times Xk = j, 1 ≤ k ≤ n.
Then #j is the count that Xk = j, and p̂j = #j /n is the observed frequency,
in n samples. Let pj be the expected frequency,

pj = P rob(Xk = j), 0≤j<d

Since Xk are identically distributed, this does not depend on k.

By the central limit theorem,

√ √

#j
n(p̂j − pj ) = n − pj , 0 ≤ j < d,
n

are approximately normal for large n. Based on this, Pearson [22] showed

Goodness-Of-Fit Test
Let p̂ = (p̂1 , p̂2 , . . . , p̂d ) be the observed frequencies and p =
320 CHAPTER 6. STATISTICS

(p1 , p2 , . . . , pd ) the expected frequencies. Then, for large n, the statistic

d−1
X (p̂j − pj )2
n (6.7.2)
j=0
pj

is approximately chi-squared with degree d − 1.

By clearing denominators, (6.7.2) may be rewritten in terms of counts as

follows,
d−1 d−1
X (#j − npj )2 X (observed − expected)2
= .
j=0
npj j=0
expected

When d = 2, this statistic reduces to Z 2 , where Z is given by (6.7.1).

Here is the code.

from numpy import *

from scipy.stats import chi2 as U

def goodness_of_fit(observed,expected):
# assume len(observed) == len(expected)
d = len(observed)
n = sum(observed)
u = sum([ (observed[i] - expected[i])**2/expected[i] for i
,→ in range(d) ])
deg = d-1
pvalue = 1 - U(deg).cdf(u)
return pvalue

Suppose a dice is rolled n = 120 times, and the observed counts are

O1 = 17, O2 = 12, O3 = 14, O4 = 20, O5 = 29, O6 = 28.

Notice
O1 + O2 + O3 + O4 + O5 + O6 = 120.
6.7. CHI-SQUARED TESTS 321

If the dice is fair, the expected counts are

E1 = 20, E2 = 20, E3 = 20, E4 = 20, E5 = 20, E6 = 20.

Based on the observed counts, at 5% significance, what can we conclude

about the dice?
Here there are d = 6 categories, and α = .05. The Pearson statistic
(6.7.2) equals
u = 12.7
The dice is fair if u is not large and the dice is unfair if u is large. At
significance level α, the large/not-large cutoff u∗ is

from scipy.stats import chi2 as U

d = 6
ustar = U(d-1).ppf(1-alpha)

Since this returns u∗ = 11.07 and u > u∗ , we can conclude the dice is not
fair.

We now derive the goodness-of-fit test. For each category 0 ≤ j < d, let

 √1

j
if Xk = j,
X̃k = pj
0 if Xk ̸= j.


√
Then E(X̃nj ) = pj , and
(
1 if i = j,
E(X̃ki X̃kj ) =
0 if i ̸= j.

If
√ √ √
µ = ( p1 , p2 , . . . , pd ) and X̃k = (X̃k1 , X̃k2 , . . . , X̃kd ),
then
E(X̃k ) = µ, E(X̃k ⊗ X̃k ) = I.
322 CHAPTER 6. STATISTICS

From this,
V ar(X̃k ) = E(X̃k ⊗ X̃k ) − E(X̃k ) ⊗ E(X̃k ) = I − µ ⊗ µ.
From (5.3.8), we conclude the random vector
n
!
√ 1X
Z= n X̃k − µ
n k=1
has mean zero and variance I − µ ⊗ µ. By the central limit theorem, Z is
approximately normal for large n.
Since
√ √ √
|µ|2 = ( p0 )2 + ( p1 )2 + · · · + ( pd−1 )2 = p0 + p1 + · · · + pd−1 = 1,
µ is a unit vector. By the singular chi-squared result in §5.5, |Z|2 is approx-
imately chi-squared with degree d − 1. Using
√

p̂j √
Zj = n √ − pj ,
pj
we write |Z|2 out,
d d 2 d
2
X X p̂j √ X (p̂j − pj )2
|Z| = Zj2 =n √ − pj = n ,
j=1 j=1
pj j=1
pj

obtaining (6.7.2).

Suppose X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are samples measuring

two possibly related effects. Suppose the X variables take on d categories,
X = 1, 2, . . . , d, and the Y variables take on e categories, Y = 1, 2, . . . , e.
The goal is test whether the two effects are independent or not.
For example, suppose 300 people are polled and the results are collected
in a contingency table (Figure 6.6).

Democrat Republican Independent Total

Women 68 56 32 156
Men 52 72 20 144
Total 120 128 52 300

Table 6.6: Contingency table [25].

6.7. CHI-SQUARED TESTS 323

Is a person’s gender correlated with their party affiliation, or are the two
variables independent? To answer this, we use the

Chi-squared Independence Test

Let p̂ = (p̂1 , p̂2 , . . . , p̂d ) be the observed frequencies corresponding to

X1 , X2 , . . . , Xn , and let q̂ = (q̂1 , q̂2 , . . . , q̂e ) be the observed frequen-
cies corresponding to Y1 , Y2 , . . . , Yn . Let r̂ij be the joint observed
frequencies

#{k : Xk = i, Yk = j}
r̂ij = , i = 1, 2, . . . , d, j = 1, 2, . . . , e.
n
If X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are independent, then, for large
n, the statistic
d,e
X (r̂ij − p̂i q̂j )2
n (6.7.3)
i,j=1
p̂ i q̂ j

is approximately chi-squared with degree (d − 1)(e − 1).

By clearing denominators, (6.7.3) may be rewritten in terms of counts as

follows,
X Y 2
d,e d,e 2
X n#XY
ij − #i #j
X #XY
ij
X Y
= −n + n
i,j=1
n#i #j i,j=1
#X
i #j
Y

d,e
X (observed)2
= −n + n .
i,j=1
expected

The code

def chi2_independence(table):
observed = table
n = sum(observed)
d = len(observed)
e = len(observed.T)
324 CHAPTER 6. STATISTICS

rowsum = array([ sum(observed[i,:]) for i in range(d) ])

colsum = array([ sum(observed[:,j]) for j in range(e) ])
expected = outer(rowsum,colsum)
u = -n + n*sum([[ observed[i,j]**2/expected[i,j] for j in
,→ range(e) ] for i in range(d) ])
deg = (d-1)*(e-1)
pvalue = 1 - U(deg).cdf(u)
return pvalue

table = array([[68,56,32],[52,72,20]])
chi2_independence(table)

returns a p-value of 0.0401, so, at the 5% significance level, the effects are
not independent.

The independence test is Fisher’s modification [7] of goodness-of-fit, and

the derivation depends on maximum likelihood estimates.
Chapter 7

Calculus

7.1 Calculus
In this section, we focus on single-variable calculus, and in §7.3, we review
multi-variable calculus. Recall the slope of a line y = mx + b equals m.
Let y = f (x) be a function as in Figure 7.1, and let a be a fixed point. The
derivative of f (x) at the point a is the slope of the line tangent to the graph
of f (x) at a. Then the derivative at a point a is a number f ′ (a) possibly
depending on a.

y = f (x)

x
a

Figure 7.1: f ′ (a) is the slope of the tangent line at a.

Since the tangent line at a passes through the point (a, f (a)), and its

325
326 CHAPTER 7. CALCULUS

slope is f ′ (a), the equation of the tangent line at a is

y = f (a) + f ′ (a)(x − a).

Based on the definition, natural properties of the derivative are

A. The derivative at x of f (x) − mx is f ′ (x) − m.

B. If f ′ (x) ≥ 0 on an interval [a, b], then f (b) ≥ f (a).

C. If f ′ (x) ≤ 0 on an interval [a, b], then f (b) ≤ f (a).

Using these properties, we determine the formula for f ′ (a). Suppose the
derivative is bounded between two extremes m and L at every point x in an
interval [a, b], say
m ≤ f ′ (x) ≤ L, a ≤ x ≤ b.
Then by A, the derivative of h(x) = f (x)−mx at x equals h′ (x) = f ′ (x)−m.
By assumption, h′ (x) ≥ 0 on [a, b], so, by B, h(b) ≥ h(a). Since h(a) =
f (a) − ma and h(b) = f (b) − mb, this leads to

f (b) − f (a)
≥ m.
b−a

Repeating this same argument with f (x) − Lx, and using C, leads to

f (b) − f (a)
≤ L.
b−a
We have shown

First Derivative Bounds

If m ≤ f ′ (x) ≤ L for a ≤ x ≤ b, then

f (b) − f (a)
m≤ ≤ L. (7.1.1)
b−a

When b is close to a, we expect both extremes m and L to be close to

′
f (a). From (7.1.1), we arrive at the formula for the derivative,
7.1. CALCULUS 327

Derivative Definition
f (x) − f (a)
f ′ (a) = lim . (7.1.2)
x→a x−a

From (7.1.2), the derivative of a line f (x) = mx + b equals f ′ (a) = m: If

the graph of a function is a line, then the tangent line to the graph is that
line.
Below we also write
dy
y ′ = f ′ (x) =
dx
or
dy
f ′ (a) =
dx x=a
When the particular point a is understood from the context, we write y ′ .

From (7.1.2), the basic properties of the derivative are

• Sum rule. h = f + g implies h′ = f ′ + g ′ ,
• Product rule. h = f g implies h′ = f ′ g + f g ′ ,
• Quotient rule. h = f /g implies h′ = (f ′ g − f g ′ )/g 2 .
• Chain rule. u = f (x) and y = g(u) implies
dy dy du
= · .
dx du dx
To visualize the chain rule, suppose
u = f (x) = sin x,
y = g(u) = u2 .
These are two functions f , g in composition, as in Figure 7.2.

f g
x u y

Figure 7.2: Composition of two functions.

328 CHAPTER 7. CALCULUS
√
Suppose x = π/4. Then u = sin(π/4) = 1/ 2, and y = u2 = 1/2. Since
dy 2 du 1
= 2u = √ , = cos x = √ ,
du 2 dx 2
by the chain rule,
dy dy du 2 1
= · = √ · √ = 1.
dx du dx 2 2
Since the chain rule is important for machine learning, it is discussed in detail
in §7.4.
Since a constant function f (x) = c is a line with slope zero, the derivative
of a constant is zero. Since f (x) = x is a line with slope 1, (x)′ = 1.
By the product rule,

(x2 )′ = x′ x + xx′ = 1x + x1 = 2x.

Similarly one obtains the power rule

(xn )′ = nxn−1 . (7.1.3)

Using the chain rule, the power rule can be √derived for any rational number n,
2
positive or negative. For example,
√ since ( x) = x, we can write x = f (g(x))
2
with f (x) = x and g(x) = x. By the chain rule,
√ √
1 = (x)′ = f ′ (g(x))g ′ (x) = 2g(x)g ′ (x) = 2 x( x)′ .
√
Solving for ( x)′ yields
√ 1
( x)′ = √ ,
2 x
which is (7.1.3) with n = 1/2. In this generality, the variable x is restricted
to positive values only.

The second derivative f ′′ (x) of f (x) is the derivative of the derivative,

′
f ′′ (x) = (f ′ (x)) .

For example,
n!
(xn )′′ = (nxn−1 )′ = n(n − 1)xn−2 = xn−2 = P (n, 2)xn−2
(n − 2)!
7.1. CALCULUS 329

(for n! and P (n, k) see §4.1).

More generally, the k-th derivative f (k) (x) is the derivatives taken k times,
so
n!
(xn )(k) = n(n − 1)(n − 2) . . . (n − k + 1)xn−k = xn−k = P (n, k)xn−k .
(n − k)!
When k = 0, f (0) (x) = f (x), and, when k = 1, f (1) (x) = f ′ (x).

We use the above to derive the Taylor series. Suppose f (x) is given by a
finite or infinite sum
f (x) = c0 + c1 x + c2 x2 + c3 x3 + . . . (7.1.4)
Then f (0) = c0 . Taking derivatives, by the sum, product, and power rules,
f ′ (x) = c1 + 2c2 x + 3c3 x2 + 4c4 x3 + . . .
f ′′ (x) = 2c2 + 3 · 2c3 x + 4 · 3c4 x2 + . . .
(7.1.5)
f ′′′ (x) = 3 · 2c3 + 4 · 3 · 2c4 x + . . .
f (4) (x) = 4 · 3 · 2c4 + . . .

Inserting x = 0, we obtain f ′ (0) = c1 , f ′′ (0) = 2c2 , f ′′′ (0) = 3 · 2c3 , f (4) (0) =
4 · 3 · 2c4 . This can be encapsulated by f (n) (0) = n!cn , n = 0, 1, 2, 3, 4, . . . ,
which is best written
f (n) (0)
= cn , n ≥ 0.
n!
Going back to (7.1.4), we derived

Taylor Series About 0

For almost every function f (x),

∞
X f (n) (0) x2 ′′′ x3 (4) x4
f (x) = xn = f (0)+f ′ (0)x+f ′′ (0) +f (0) +f (0) +. . .
n=0
n! 2 6 24
(7.1.6)

More generally, let a be a fixed point. Then any function f (x) can be
expanded in powers (x − a)n , and we have
330 CHAPTER 7. CALCULUS

Taylor Series About a

For almost every function f (x),

∞
X f (n) (a) (x − a)2
f (x) = (x − a)n = f (a) + f ′ (a)(x − a) + f ′′ (a) +...
n=0
n! 2
(7.1.7)

We review the derivative of sine and cosine. Recall the angle θ in radians
is the length of the subtended arc (in red) in Figure 7.3. Following the figure,
with P = (x, y), we have x = cos θ, y = sin θ. By the figure, the arclength θ
is greater than the diagonal, which in turn is greater than y. Moreover θ is
less than 1 − x + y, so
y < θ < 1 − x + y.

P 1−x

θ
0 x 1

Figure 7.3: Angle θ in the plane, P = (x, y).

From this we have

sin θ < θ < 1 − cos θ + sin θ,

which implies
1 − cos θ sin θ
1− < < 1. (7.1.8)
θ θ
7.1. CALCULUS 331

From the Figure, 0 < sin θ < θ, and sin2 θ + cos2 θ = 1, so

1 − cos θ 1 − cos2 θ sin θ sin θ
0≤ = = · ≤ sin θ ≤ θ.
θ θ(1 + cos θ) θ 1 + cos θ
This implies
1 − cos θ
lim = 0.
θ→0 θ
Taking the limit θ → 0 in (7.1.8), this implies
sin θ
lim = 1.
θ→0 θ

From (1.5.5),
sin(θ + ϕ) = sin θ cos ϕ + cos θ sin ϕ,
so
sin(θ + ϕ) − sin θ cos ϕ − 1 sin ϕ
lim = lim sin θ · + cos θ · = cos θ.
ϕ→0 ϕ ϕ→0 ϕ ϕ
Thus the derivative of sine is cosine,
(sin θ)′ = cos θ.
Similarly,
(cos θ)′ = − sin θ.
Using the chain rule, we compute the derivative of the inverse arcsin x of
sin θ. Since
θ = arcsin x ⇐⇒ x = sin θ,
we have √
1 = x′ = (sin θ)′ = θ′ · cos θ = θ′ · 1 − x2 ,
or
1
(arcsin x)′ = θ′ = √ .
1 − x2
We
√ use this to compute the derivative of the arcsine law (3.2.13). With
x = λ/2, by the chain rule,
′
1√

2 2 1
arcsin λ = √ · x′
π 2 π 1 − x2
(7.1.9)
2 1 1 1
= p · √ = p .
π 1 − λ/4 4 λ π λ(4 − λ)
332 CHAPTER 7. CALCULUS

This shows the derivative of the arcsine law is the density in Figure 3.11.

For the parabola in Figure 7.4, y = x2 so, by the power rule, y ′ = 2x.
Since y ′ > 0 when x > 0 and y ′ < 0 when x < 0, this agrees with the
increase/decrease of the graph. In particular, the minimum of the parabola
occurs when y ′ = 0.

0
x

Figure 7.4: Increasing or decreasing?

For the curve y = x4 − 2x2 in Figure 7.5,

y ′ = 4x3 − 4x = 4x(x2 − 1) = 4x(x − 1)(x + 1),

so y ′ is a product of the three factors 4x, x − 1, x + 1. Since the zeros of these

factors are 0, 1, and −1, and y ′ > 0 when all factors are positive, or two of
them are negative, this agrees with the increase/decrease in the figure.
Here y ′ = 0 occurs at the two minima x = ±1 and at the local maximum
0. Notice 0 is not a global maximum as there is no highest value for y.
7.1. CALCULUS 333

√
(c = 1/ 3)

−1 −c c 1
x
0

Figure 7.5: Increasing or decreasing?

Let y = f (x) be a function. A critical point is a point x∗ where the

derivative equals zero, f ′ (x∗ ) = 0. Above we saw local or global maximizers
or minimizers are critical points. In general, however, this need not be so. A
critical point may be neither. For example, for f (x) = x3 , x∗ = 0 is a critical
point, but is neither a maximizer nor a minimizer. Here, for y = x3 , x∗ = 0
is a saddle point.

Now we look at the increase/decrease in y ′ , rather than in y. Applying

the above logic to y ′ instead to y, we see y ′ is increasing when y ′′ ≥ 0, and
y ′ is decreasing when y ′′ ≤ 0. In the first case, we say f (x) is convex, while
in the second case, we say f (x) is concave.
If we look at Figure 7.4, the slope at x equals y ′ = 2x. Thus as x increases,
′
y increases. Even though the parabola height y decreases when x < 0 and
increases when x > 0, its slope y ′ is always increasing: When x < 0, as x
increases, y ′ = 2x is less and less negative, while, when x > 0, as x increases,
y ′ is more and more positive.
334 CHAPTER 7. CALCULUS

Since y ′ increases when its derivative is positive, the parabola’s behavior

is encapsulated in
y ′′ = (y ′ )′ = (2x)′ = 2 > 0.
In general,

Second Derivative Test for Convexity

y = f (x) is convex iff y ′′ ≥ 0, and concave if y ′′ ≤ 0.

A point where y ′′ = 0 is an inflection point. For example, the parabola in

Figure 7.4 is convex everywhere. Analytically, for the parabola, y ′′ = 2 > 0
everywhere,
For the graph in Figure 7.5 it is clear the graph is convex away from 0,
and concave near 0. Analytically,

y ′′ = (x4 − 2x2 )′′ = (4x3 − 4x)′ = 12x2 − 4 = 4(3x2 − 1),

√
so the inflection
√ points are x = ±1/ 3. Hence the graph √ is convex
√ when
|x| > 1/ 3, and the graph is concave when |x| < 1/ 3. Since 1/ 3 < 1,
the graph is convex near x = ±1.

A function f (x) is strictly convex if y ′′ > 0. Geometrically, f (x) is strictly

convex if each chord joining any two points on the graph lies strictly above
the graph. Similarly, one defines strictly concave to mean y ′′ < 0.

Second Derivative Test for Strict Convexity

Suppose y = f (x) has a second derivative y ′′ . Then y is strictly convex

if y ′′ > 0, and strictly concave if y ′′ < 0.

For example, x2 and ex√are strictly convex everywhere, and x4 − 2x2 is

strictly convex for |x| > 1/ 3.
This was also derived in (4.4.12). Since

(ex )(n) = ex , n ≥ 0,

writing the Taylor series centered at zero for the exponential function yields
the exponential series (4.4.10).
7.1. CALCULUS 335

Suppose y = f (x) is convex, so y ′ is increasing. Then a ≤ t ≤ x ≤ b

implies f ′ (a) ≤ f ′ (t) ≤ f ′ (x) ≤ f ′ (b). Taking m = f ′ (a) and L = f ′ (x) in
(7.1.1),
f (x) − f (a)
f ′ (a) ≤ ≤ f ′ (x), a ≤ x ≤ b.
x−a
Since the tangent line at a is y = f ′ (a)(x − a) + f (a), rearranging this last
inequality, we obtain

Convex Function Graph Lies Above the Tangent Line

If f (x) is convex on [a.b], then

f (x) ≥ f (a) + f ′ (a)(x − a), a ≤ x ≤ b.

For example, the function in Figure 7.6 is convex near x = a, and the
graph lies above its tangent line at a.

Let pm (x) be the parabola

m
pm (x) = f (a) + f ′ (a)(x − a) + (x − a)2 .
2
Then p′′m (x) = m. Moreover the graph of pm (x) is tangent to the graph of
f (x) at x = a, in the sense f (a) = pm (a) and f ′ (a) = p′m (a). Because of this,
we call pm (x) a tangent parabola.
When y is convex, we saw above the graph of y lies above its tangent
line. When m ≤ y ′′ ≤ L, we can specify the size of the difference between
the graph and the tangent line. In fact, the graph is constrained to lie above
or below the lower or upper tangent parabolas.

Second Derivative Bounds

If m ≤ f ′′ (x) ≤ L on [a, b], the graph lies between pm (x) and pL (x),
m L
(x − a)2 ≤ f (x) − f (a) − f ′ (a)(x − a) ≤ (x − a)2 . (7.1.10)
2 2
a ≤ x ≤ b.
336 CHAPTER 7. CALCULUS

To see this, suppose f ′′ (x) ≥ m. then g(x) = f (x) − pm (x) satisfies

g ′′ (x) = f ′′ (x) − p′′m (x) = f ′′ (x) − m ≥ 0,

so g(x) is convex, so g(x) lies above its tangent line at x = a. Since g(a) = 0
and g ′ (a) = 0, the tangent line is 0, and we conclude g(x) ≥ 0, which is the
left half of (7.1.10). Similarly, if f ′′ (x) ≤ L, then pL (x) − f (x) is convex,
leading to the right half of (7.1.10).

x
a

Figure 7.6: Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0.

Now suppose f (x) is strongly convex in the sense L ≥ f ′′ (x) ≥ m on an

interval [a, b], for some positive constants m and L. By (7.1.1),

f ′ (b) − f ′ (a)
t= =⇒ L ≥ t ≥ m,
b−a
which implies

t2 − (m + L)t + mL = (t − m)(t − L) ≤ 0, a ≤ t ≤ b.

This yields
7.1. CALCULUS 337

Coercivity for Strongly Convex Functions

If m ≤ f ′′ (x) ≤ L for a ≤ x ≤ b, then

2
f ′ (b) − f ′ (a) mL 1 f ′ (b) − f ′ (a)
≥ + . (7.1.11)
b−a m+L m+L b−a

We now compute the derivatives of the exponential function (§4.4). By

(4.4.10),
ex − 1 x x2
=1+ + + ...,
x 2 6
so
ex − 1
(ex )′ |x=0 = lim = 1.
x→0 x
By the law of exponents and t = x − a,
ex − ea a ex−a − 1 a et − 1
lim = e · lim = e · lim = ea · 1 = ea .
x→a x − a x→a x − a t→0 t
This derives

Derivative of the Exponential Function

The exponential function satisfies

(ex )′ = ex , (ex )′′ = ex .

In particular, since ex > 0, the exponential function is convex.

The logarithm function is the inverse of the exponential function,

y = log x ⇐⇒ x = ey .

This is the same as saying

log(ey ) = y, elog x = x.
338 CHAPTER 7. CALCULUS

Figure 7.7: The logarithm function log x.

From here, we see the logarithm is defined only for x > 0 and is strictly
increasing (Figure 7.7).
Since e0 = 1,
log 1 = 0.
Since e∞ = ∞ (Figure 4.10),

log ∞ = ∞.

Since e−∞ = 1/e∞ = 1/∞ = 0,

log 0 = −∞.

We also see log x is negative when 0 < x < 1, and positive when x > 1.
Moreover, by the law of exponents,

log(ab) = log a + log b.

For a > 0 and b real, define

ab = eb log a .
7.1. CALCULUS 339

Then, by definition,
log(ab ) = b log a,
and c c
ab = eb log a = ebc log a = abc .

By definition of the logarithm, y = log x is shorthand for x = ey . Use the

chain rule to find y ′ :

x = ey =⇒ 1 = x′ = (ey )′ = ey y ′ = xy ′ ,

so
1
y = log x =⇒ y′ = .
x
Derivative of the Logarithm
1
y = log x =⇒ y′ = . (7.1.12)
x

For gradient descent, we need the relation between a convex function and
its dual. If f (x) is convex, its convex dual is

g(p) = max(px − f (x)). (7.1.13)

Below we see g(p) is also convex. This may not always exist, but we will
work with cases where no problems arise.
Let q > 0. The simplest example is
q 1 2
f (x) = x2 =⇒ g(p) = p.
2 2q
For each p, the point x where px − f (x) equals the maximum g(p) — the
maximizer — depends on p. If we denote the maximizer by x = x(p), then

g(p) = px(p) − f (x(p)).

340 CHAPTER 7. CALCULUS

Since the maximum occurs when the derivative is zero, we have

0 = (px − f (x))′ = p − f ′ (x) ⇐⇒ x = x(p).

Hence
g(p) = px − f (x) ⇐⇒ p = f ′ (x).
Also, by the chain rule, differentiating with respect to p,

g ′ (p) = (px − f (x))′ = x + px′ − f ′ (x)x′ = x.

From this, we conclude

p = f ′ (x) ⇐⇒ x = g ′ (p). (7.1.14)

Thus f ′ (x) is the inverse function of g ′ (p). Since g(p) = px − f (x) is the
same as f (x) = px − g(p), we have

Dual of the Dual

If g(p) is the convex dual of a convex f (x), then f (x) is the convex
dual of g(p).

Since f ′ (x) is the inverse function of g ′ (p), we have

f ′ (g ′ (p)) = p.

Differentiating with respect to p again yields

f ′′ (g ′ (p))g ′′ (p) = 1.

We derived

Second Derivatives of Dual Functions

Let f (x) be a strictly convex function, and let g(p) be the convex dual
of f (x). Then g(p) is strictly convex and
1
g ′′ (p) = , (7.1.15)
f ′′ (x)

where x = g ′ (p), p = f ′ (x).

7.1. CALCULUS 341

Since f ′′ (x) > 0, also g ′′ (p) > 0, so g(p) is strictly convex.

The logistic function

ez 1
q = σ(z) = = , −∞ < z < ∞, (7.1.16)
1+e z 1 + e−z
was defined in §5.1. By the quotient and chain rules, its derivative is
−e−z
q′ = − = σ(z)(1 − σ(z)) = q(1 − q). (7.1.17)
(1 + e−z )2
The logistic function is also called the expit function and the sigmoid function.
The inverse of the logistic function is the logit function. The logit function
is found by solving q = σ(z) for z, obtaining

−1 q
z = σ (q) = log . (7.1.18)
1−q
The logit function is also called the log-odds function. Its derivative is
′
′ 1−q q 1−q 1 1
z = · = · 2
= .
q 1−q q (1 − q) q(1 − q)

Notice the derivatives of σ and its inverse σ −1 are reciprocals. This result
holds in general, and is called the inverse function theorem.
The partition function is

Z(z) = log (1 + ez ) . (7.1.19)

Then Z ′ (z) = σ(z) and Z ′′ (z) = σ ′ (z) = σ(1 − σ) > 0. This shows Z(z) is
strictly convex.
The maximum
max(pz − Z(z))
z

is attained when (pz − Z(z)) = 0, which happens when p = Z ′ (z) = σ(z).

′

Inserting the log-odds function z = σ −1 (p), we obtain

p p
max(pz − Z(z)) = p log − Z log , (7.1.20)
z 1−p 1−p
342 CHAPTER 7. CALCULUS

which simplifies to I(p). Thus the convex dual of the partition function is the
information. The information is studied further in §7.2, and the multinomial
extension is in §7.6.

For the chi-squared distribution (§5.5), we will need Newton’s generaliza-

tion of the binomial theorem to general exponents.

Newton’s Binomial Theorem

Let n be any real number. For a > 0 and −a < x < a,

n n n−1 n n−2 2 n n−3 3
(a + x) = a + na x + a x + a x + ....
2 3

n

This makes sense because the binomial coefficient k
is defined for any
real number n (4.3.12), (4.3.13).
In summation notation,
∞
n
X n n−k k
(a + x) = a x . (7.1.21)
k=0
k

The only difference between (4.3.7) and (7.1.21) is the upper limit of the
summation, which is set to infinity. When n is a whole number, by (4.3.10),
we have
n
= 0, for k > n,
k
so (7.1.21) is a sum of n + 1 terms, and equals (4.3.7) exactly. When n is not
a whole number, the sum (7.1.21) is an infinite sum.
Actually, in §5.5, we will need the special case a = 1, which we write in
slightly different notation,
∞
p
X p n
(1 + x) = x . (7.1.22)
n=0
n

Newton’s binomial theorem (7.1.21) is a special case of the Taylor series

centered at zero (7.1.6). To see this, set

f (x) = (a + x)n .
7.2. ENTROPY AND INFORMATION 343

Then, by the power rule,

f (k) (x) = n(n − 1)(n − 2) . . . (n − k + 1)(a + x)n−k ,

so
f (k) (0)

n(n − 1)(n − 2) . . . (n − k + 1) n−k n n−k
= a = a .
k! k! k
Writing out the Taylor series,
∞ ∞
n
X f (k) (0) X n
(a + x) = = an−k xk ,
k=0
k! k=0
k

which is Newton’s binomial theorem.

7.2 Entropy and Information

A function f (x) is concave if −f (x) is convex. We use convexity of ex to
show concavity of log x.
Let a = ex and b = ey . Then x = log a and y = log b. Taking the log of
both sides of (4.4.12), since log x is increasing, we have

(1 − t)x + ty ≤ log ((1 − t)a + tb) ,

or
(1 − t) log a + t log b ≤ log ((1 − t)a + tb) .
Since the inequality sign is reversed, this shows

Concavity of the Logarithm Function

The logarithm function is strictly concave,

log((1 − t)a + tb) ≥ (1 − t) log a + t log b, a > 0, b > 0, (7.2.1)

for 0 ≤ t ≤ 1.

We use calculus to derive the strict concavity. By the power rule,

′
′′ ′′ 1 ′
y = (log x) = = x−1 = −1x−2 .
x
344 CHAPTER 7. CALCULUS

Since x > 0, y ′′ < 0, which shows log x is in fact strictly concave everywhere
it is defined.
Since log x is strictly concave,

1
log = − log x
x

is strictly convex.

Let p be a probability, i.e. a number between 0 and 1. The entropy of p

is
H(p) = −p log p − (1 − p) log(1 − p), 0 < p < 1. (7.2.2)

This is sometimes called absolute entropy to contrast with relative entropy

which we see below.
To graph H(p), we compute its first and second derivatives. Here the
independent variable is p. By the product rule,

′ ′ 1−p
H (p) = (−p log p − (1 − p) log(1 − p)) = − log p + log(1 − p) = log .
p

Thus H ′ (p) = 0 when p = 1/2, H ′ (p) > 0 on p < 1/2, and H ′ (p) < 0 on
p > 1/2. Since this implies H(p) is increasing on p < 1/2, and decreasing on
p > 1/2, p = 1/2 is a global maximum of the graph.
Notice as p increases, 1 − p decreases, so (1 − p)/p decreases. Since log is
increasing, as p increases, H ′ (p) decreases. Thus H(p) is concave.
Taking the second derivative, by the chain rule and the quotient rule,
′
′′ 1−p 1
H (p) = log =− ,
p p(1 − p)

which is negative, leading to the strict concavity of H(p).

7.2. ENTROPY AND INFORMATION 345

Figure 7.8: The absolute entropy function H(p).

A crucial aspect of Figure 7.8 is its limiting values at the edges p = 0 and
p = 1,
H(0) = lim H(p) and H(1) = lim H(p).
p→0 p→1

Figure 7.8 suggests H(0) = 0 and H(1) = 0.

To check this, for the first limit, since H(p) is increasing near p = 0, it
is clear there is a definite value H(0). The entropy is the sum of two terms,
−p log p, and −(1 − p) log(1 − p). When p → 0, the second term approaches
− log 1 = 0, so H(0) is the limit of the first term,
H(0) = − lim p log p.
p→0

When p → 0, also 2p → 0. Replacing p by 2p,

H(0) = − lim p log p
p→0

= − lim 2p log(2p)
p→0

= lim −2p log 2 + 2H(0) = 2H(0).

p→0

Thus H(0) = 0. Since H(p) is symmetric, H(1 − p) = H(p), we also have

H(1) = 0. This completes the discussion of Figure 7.8.
346 CHAPTER 7. CALCULUS

We can now explain the meaning of the entropy function. Suppose an

event has probability p. If p is near 1, then we have confidence that the
event is likely, and, if p is near 0, we have confidence the event is unlikely.
If p = 1/2, then we have no information either way. Thus we can view the
entropy as the negative of our information about the event: High entropy
equals low information, or

Entropy and Information

Entropy equals negative information.

Because of this, we call

I(p) = −H(p) = p log p + (1 − p) log(1 − p) (7.2.3)

the information in p. In (7.1.20), we saw the information is the convex dual

of the partition function. The derivative of I(p) is

′ p
I (p) = log . (7.2.4)
1−p

Then I ′ (p) is the inverse of the derivative σ(x) (7.1.16) of the dual Z(x)
(7.1.19) of I(p), as it should be (7.1.14).

The clearest explanation ofH(p) is in terms of counting coin tosses. Re-

call the binomial coefficient nk is the number of ways of selecting k objects
from n objects (4.3.10).
Toss a coin n times, and let #n = #n (p) be the number of outcomes
where the proportion k/n of heads is p. Then the number of heads is k = np,
so,

n
#n (p) = .
np
When p is an irrational, np is replaced by the floor ⌊np⌋, but we ignore this
point. Using (4.1.1), a straightforward calculation results in
7.2. ENTROPY AND INFORMATION 347

Entropy and Coin-Tossing Counting

Toss a coin n times, and let #n (p) be the number of outcomes where
the proportion of heads is p. Then we have the approximation

#n (p) ∼ enH(p) , for n large.

Figure 7.9: Asymptotics of binomial coefficients.

In more detail, using (4.1.6), one can derive the asymptotic equality
1 1
#n (p) ≈ √ ·p · enH(p) , for n large. (7.2.5)
2πn p(1 − p)
Figure 7.9 is returned by the code below, which compares both sides of
the asymptotic equality (7.2.5) for n = 10.

from numpy import *

from scipy.special import comb
from matplotlib.pyplot import *
348 CHAPTER 7. CALCULUS

n = 10
def H(p): return - p*log(p) - (1-p)*log(1-p)
p = arange(.01,.99,.01)

grid()
plot(p, comb(n, n*p), label="binomial coefficient")
plot(p, exp(n*H(p))/sqrt(2*n*pi*p*(1-p)), label="entropy
,→ approximation")
title("number of tosses " + "$n=" + str(n) +"$", usetex=True)
legend()
show()

The usetex=True option assumes TEX is installed on your system.

Let p and q be two probabilities,

0 < p < 1, and 0 < q < 1.

When do we consider p and q close to each other? If p and q were just

numbers, p and q are considered close if the distance |p − q| is small or the
distance squared |p − q|2 is small. But here p and q are probabilities, so it
makes sense to consider them close if their information content is close.
To this end, we define the relative information I(p, q) by

p 1−p
I(p, q) = p log + (1 − p) log .
q 1−q

Then
I(q, q) = 0,
which agrees with our understanding that I(p, q) measures the difference in
information between p and q. Because I(p, q) is not symmetric in p, q, we
think of q as a base or reference probability, against which we compare p.
Equivalently, instead of measuring relative information, we can measure
the relative entropy,
H(p, q) = −I(p, q).
7.2. ENTROPY AND INFORMATION 349

Since − log(x) is strictly convex,

q 1−q
I(p, q) = −p log − (1 − p) log
p 1−p

q 1−q
> − log p · + (1 − p) ·
p 1−p
= − log 1 = 0.

This shows I(p, q) is positive and H(p, q) is negative, when p ̸= q.

Figure 7.10: The relative information I(p, q) with q = .7.

Since
I(p, q) = −H(p) − p log(q) − (1 − p) log(1 − q)
and H(0) = 0 = H(1), I(p, q) is well-defined for p = 0, and p = 1,

I(1, q) = − log q, I(0, q) = − log(1 − q).

350 CHAPTER 7. CALCULUS

Taking derivatives (with independent variable p),

d2 1
2
I(p, q) = −H ′′ (p) = ,
dp p(1 − p)

hence I is strictly convex in p. Thus q is a global minimum of the graph of

I(p, q) (Figure 7.10). Also

d2 p 1−p
2
I(p, q) = 2 + ,
dq q (1 − q)2

so I(p, q) is strictly convex in q as well.

Assume the probability of heads in a single toss of a coin is q. Then we

expect the long-term proportion of heads in n tosses to equal roughly q. Now
let p be another probability, 0 ≤ p ≤ 1.
Toss a coin n times, and let Pn (p, q) be the probability of obtaining out-
comes where the proportion of heads equals p, given that the base heads-
probability is q.
If p is not equal to q, we expect this outcome to be unlikely. In other
words, we expect Pn (p, q) to be small for large n. In fact, as n → ∞, we
expect Pn (p, q) → 0.
We derive a formula for the speed of this decay. With k = np in the
binomial distribution (5.1.5),

n np
Pn (p, q) = q (1 − q)n−np .
np

Using (4.1.1), a straightforward calculation results in

Relative Entropy and Coin-Tossing Probabilities

Assume the heads-probability of a coin is q. Toss the coin n times,
and let Pn (p, q) be the probability of obtaining outcomes where the
proportion of heads is p. Then we have the approximation

Pn (p, q) ∼ enH(p,q) , for n large. (7.2.6)

7.3. MULTI-VARIABLE CALCULUS 351

In more detail, using (4.1.6), one can derive the asymptotic equality
1 1
Pn (p, q) ≈ √ ·p · enH(p,q) , for n large. (7.2.7)
2πn p(1 − p)

The law of large numbers (§6.1) states that the proportion of heads equals
approximately q for large n. Therefore, when p ̸= q, we expect the probability
that the proportion of heads equal p should become successively smaller as
n get larger, and in fact vanish when n = ∞. Since H(p, q) < 0 when p ̸= q,
(7.2.7) implies this is so. Thus (7.2.7) may be viewed as a quantitative
strengthening of the law of large numbers, in the setting of coin-tossing.

7.3 Multi-variable Calculus

Let
f (x) = f (x1 , x2 , . . . , xd )
be a scalar function of a point x = (x1 , x2 , . . . , xd ) in Rd , and suppose v is
a unit vector in Rd . Then, along the line x(t) = x + tv, g(t) = f (x + tv)
is a function of the single variable t. Hence its derivative g ′ (0) at t = 0 is
well-defined. Since g ′ (0) depends on the point x and on the direction v, this
rate of change is the directional derivative of f (x) at x in the direction v.
More explicitly, the directional derivative of f (x) at x in the direction v
is
d
Dv f (x) = f (x + tv). (7.3.1)
dt t=0
In multiple dimensions, there are many directions v emanating from a
point x, we may ask: How does the direction v affect the rate of change of
temperature f ? More specifically, in which direction v does the temperature
f increase? In which direction v does the temperature decrease? In which
direction does the temperature have the greatest increase? In which direction
does the temperature have the greatest decrease? In one dimension, there
are only two directions, so the directional derivative is either f ′ (x) or −f ′ (x).

When we select specific directions, the directional derivatives have specific

names. Let e1 , e2 , . . . , ed be the standard basis in Rd . The partial derivative
352 CHAPTER 7. CALCULUS

in the k-th direction, k = 1, . . . , d, is

∂f d
(x) = f (x + tek ).
∂xk ds t=0

The partial derivative in the k-th direction is just the one-dimensional deriva-
tive considering xk as the independent variable, with all other xj ’s constants.

Below we exhibit the multi-variable chain rule in two ways. The first in-
terpretation is geometric, and involves motion in time and directional deriva-
tives. This interpretation is relevant to gradient descent, §8.3.
The second interpretation is combinatorial, and involves repeated com-
positions of functions. This interpretation is relevant to computing gradients
in networks, specifically backpropagation §7.4, §8.2.
These two interpretations work together when training neural networks,
§8.4.

For the first interpretation of the chain rule, suppose the components x1 ,
x2 , . . . , xd are functions of a single variable t (usually time), so we have
x1 = x1 (t), x2 = x2 (t), ..., xd = xd (t).
Inserting these into f (x1 , x2 , . . . , xd ), we obtain a function
f (t) = f (x1 (t), x2 (t), . . . , xd (t))
of a single variable t. Then we have

Multi-Variable Chain Rule

With f (t) = f (x1 (t), x2 (t), . . . , xd (t)),

df ∂f dx1 ∂f dx2 ∂f dxd

= · + · + ··· + · .
dt ∂x1 dt ∂x2 dt ∂xd dt

The gradient of f (x) is the vector

∂f ∂f ∂f
∇f = , ,..., . (7.3.2)
∂x1 ∂x2 ∂xd
7.3. MULTI-VARIABLE CALCULUS 353

The Rd -valued function x(t) = (x1 (t), x2 (t), . . . , xd (t)) represents a curve
or path in Rd , and the vector

x′ (t) = (x′1 (t), x′2 (t), . . . , x′d (t))

represents its velocity at time t.

With this notation, the chain rule may be written
df
= ∇f (x(t)) · x′ (t).
dt
Let v = (v1 , v2 , . . . , vd ). The simplest application of the multi-variable
chain rule is to select x(t) = x + tv. Then the chain rule becomes

Directional Derivative Formula

The directional derivative of f (x) in the direction v is the dot product
of the gradient ∇f (x) and v,

d
f (x + tv) = ∇f (x) · v. (7.3.3)
dt t=0

In §8.6, we will need to compute the gradient of a function f (W ) of

matrices W . Towards this, recall the collection of matrices with a fixed
shape may be added and scaled. It follows if W and V are matrices with the
same shape, then W + sV also has the same shape, for any scalar s.
If G and V are two matrices with the same shape, we think of trace(V t G)
as a dot product between G and V . This is consistent with the definition of
norm squared (2.2.12). By analogy with (7.3.3), we say

Directional Derivative Matrix Formula

A matrix G is the gradient of f (W ) at W if

d
f (W + sV ) = trace(V t G). for all V. (7.3.4)
ds s=0

Then the gradient G has the same shape as W .

354 CHAPTER 7. CALCULUS

Here is an example of the second interpretation of the chain rule. Suppose

r = f (x) = sin x,
1
s = g(x) = ,
1 + e−x
t = h(x) = x2 ,
u = r + s + t,
y = k(u) = cos u.
These are multiple functions in composition, as in Figure 7.11.

g s u y
x + k

Figure 7.11: Composition of multiple functions.

The input variable is x and the output variable is y. The intermediate

variables are r, s, t, u. Suppose x = π/4. Then
x, r, s, t, u, y = 0.79, 0.71, 0.69, 0.62, 2.01, −0.43.
To compute derivatives, start with
dy
= k ′ (u) = − sin u = −0.90.
du
Next, to compute dy/dr, the chain rule says
dy dy
= dudr = −0.90 ∗ 1 = −0.90,
dr du
7.3. MULTI-VARIABLE CALCULUS 355

and similarly,
dy dy
= = −0.90.
ds dt
By the chain rule,
dy dy dr dy ds dy dt
= · + · + · .
dx dr dx ds dx dt dx
By (7.1.17), s′ = s(1 − s) = 0.22, so

dr ds dt
= cos x = 0.71, = s(1 − s) = 0.22, = 2x = 1.57.
dx dx dx
We obtain
dy
= −0.90 ∗ 0.71 − 0.90 ∗ 0.22 − 0.90 ∗ 1.57 = −2.25.
dx
The chain rule is discussed in further detail in §7.4.

Let y = f (x) be a function. A critical point is a point x∗ satisfying

∇f (x∗ ) = 0.

Let x∗ be a local or global minimizer of y = f (x). Then for any vector v

and scalar t near zero, f (x∗ ) ≤ f (x∗ + tv). Hence

d f (x∗ + tv) − f (x∗ )

∇f (x) · v = f (x∗ + tv) = lim ≥ 0.
dt t=0
t→0 t

This is so for any direction v. Replacing v by −v, we conclude ∇f (x∗ )·v = 0.

Since v is any direction, ∇f (x∗ ) = 0, x∗ is a critical point. Thus a minimizer
is a critical point. Similarly, a maximizer is a critical point.
As in the single-variable case, a critical point may be neither a minimizer
nor a maximizer, for example x∗ = (0, 0) and y = x21 − x22 . Such a point is a
saddle point.
If x∗ is a critical point and D2 f (x∗ ) > 0, then x∗ is a local or global
minimizer. This is the same as saying all eigenvalues of the symmetric matrix
D2 f (x∗ ) are positive. When D2 f (x∗ ) < 0, x∗ is a local or global maximum.
356 CHAPTER 7. CALCULUS

If D2 f (x∗ ) has both positive and negative eigenvalues, x∗ is a saddle

point.

Let Q be a d × d symmetric matrix, let b be a vector, and let

d d
1 1X X
f (x) = x · Qx − b · x = qij xi xj − bj x j . (7.3.5)
2 2 i,j=1 j=1

When Q is a covariance matrix and b = 0, f (x) is the variance corresponding

to covariance matrix Q.
In this case,
d d
∂f 1X 1X
= qij xj + qji xj − bi = (Qx − b)i .
∂xi 2 j=1 2 j=1

Here we used Q = Qt . Thus ∇f (x) = Qx − b, and

Dv f (x) = v · (Qx − b).

A multi-variable function f (x) is convex if its restriction to any line is

convex. Explicitly, f (x) is convex if the single-variable function g(t) = f (x0 +
tv) is convex for for every point x0 and every direction v.
For example, when f (x) is given by (7.3.5),

g(t) = f (x0 + tv)

1
= (x0 + tv) · Q(x0 + tv) − b · (x0 + tv)
2
1 1 (7.3.6)
= x0 · Qx0 − b · x0 + tv · (Qx0 − b) + t2 v · Qv
2 2
1 2
= f (x0 ) + tv · (Qx0 − b) + t v · Qv.
2
From this follows
1
g ′ (t) = v · (Qx0 − b) + tv · Qv, g ′′ (t) = f (v) = v · Qv.
2
This shows
7.4. BACK PROPAGATION 357

Quadratic Convexity
Let Q be a symmetric matrix and b a vector. The quadratic function
1
f (x) = x · Qx − b · x
2
has gradient
∇f (x) = Qx − b. (7.3.7)
Moreover f (x) is convex everywhere exactly when Q is a covariance
matrix, Q ≥ 0.

By (2.2.2),
Dv f (x) = ∇f (x) · v = |∇f (x)| |v| cos θ,
where θ is the angle between the vector v and the gradient vector ∇f (x).
Since −1 ≤ cos θ ≤ 1, we conclude

Gradient is Direction of Greatest Increase

Let v be a unit vector and let x0 be a point in Rd . As the direction v
varies, the directional derivative varies between the two extremes

−|∇f (x0 )| ≤ Dv f (x0 ) ≤ |∇f (x0 )|.

The directional derivative achieves its greatest value when v points in

the direction of ∇f (x0 ), and achieves its least value in the opposite
direction, when v points in the direction of −∇f (x0 ).

7.4 Back Propagation

In this section, we compute outputs and derivatives on a graph. We consider
two cases, when the graph is a chain, or the graph is a network of neurons.
358 CHAPTER 7. CALCULUS

The derivatives are taken with respect to the outputs at each node of the
graph. In §8.2, we consider a third case, and compute outputs and derivatives
on a neural network.
To compute node outputs, we do forward propagation. To compute
derivatives, we do back propagation. Corresponding to the three cases, we
will code three versions of forward and back propagation. In all cases, back
propagation depends on the chain rule.
The chain rule (§7.1) states
dy dy dr
r = f (x), y = g(r) =⇒ = · .
dx dr dx
In this section, we work out the implications of the chain rule on repeated
compositions of functions.
Suppose

r = f (x) = sin x,
1
s = g(r) = ,
1 + e−r
y = h(s) = s2 .

These are three functions f , g, h composed in a chain (Figure 7.12).

r s y
x f g h

Figure 7.12: Composition of three functions in a chain.

The chain in Figure 7.12 has four nodes and four edges. The outputs at
the nodes are x, r, s, y. Start with output x = π/4. Evaluating the functions
in order,

x = 0.785, r = 0.707, s = 0.670, y = 0.448.

Notice these values are evaluated in the forward direction: x then r then s
then y. This is forward propagation.
Now we evaluate the derivatives of the output y with respect to x, r, s,
dy dy dy
, , .
dx dr ds
7.4. BACK PROPAGATION 359

With the above values for x, r, s, we have

dy
= 2s = 2 ∗ 0.670 = 1.340.
ds
Since g is the logistic function, by (7.1.17),
g ′ (r) = g(r)(1 − g(r)) = s(1 − s) = 0.670 ∗ (1 − 0.670) = 0.221.
From this,
dy dy ds
= · = 1.340 ∗ g ′ (r) = 1.340 ∗ 0.221 = 0.296.
dr ds dr
Repeating one more time,
dy dy dr
= · = 0.296 ∗ cos x = 0.296 ∗ 0.707 = 0.209.
dx dr dx
Thus the derivatives are
dy dy dy
= 0.209, = 0.296, = 1.340.
dx dr ds
Notice the derivatives are evaluated in the backward direction: First dy/dy =
1, then dy/ds, then dy/dr, then dy/dx. This is back propagation.

Here is another example. Let

r = x2 ,
s = r 2 = x4 ,
y = s2 = x 8 .
This is the same function h(x) = x2 composed with itself three times. With
x = 5, we have
x = 5, r = 25, s = 625, y = 390625.
Applying the chain rule as above, check that
dy dy dy
= 625000, = 62500, = 1250.
dx dr ds

To evaluate x, r, s, y in Figure 7.12, first we built the list of functions

and the list of derivatives
360 CHAPTER 7. CALCULUS

from numpy import *

def f(x): return sin(x)

def g(r): return 1/(1+ exp(-r))
def h(s): return s**2
# this for next example
def k(t): return cos(t)

func_chain = [f,g,h]

def df(x): return cos(x)

def dg(r): return g(r)*(1-g(r))
def dh(s): return 2*s
# this for next example
def dk(t): return -sin(t)

der_chain = [df,dg,dh]

Then we evaluate the output vector x = (x, r, s, y), leading to the first
version of forward propagation,

# first version: chains

def forward_prop(x_in,func_chain):
x = [x_in]
while func_chain:
f = func_chain.pop(0) # first func
x_out = f(x_in)
x.append(x_out) # insert at end
x_in = x_out
return x

from numpy import *

x_in = pi/4
x = forward_prop(x_in,func_chain)

Now we evaluate the gradient vector δ = (dy/dx, dy/dr, dy/ds, dy/dy).

Since dy/dy = 1, we set
7.4. BACK PROPAGATION 361

# dy/dy = 1
delta_out = 1

The code for the first version of back propagation is

# first version: chains

def backward_prop(delta_out,x,der_chain):
delta = [delta_out]
while der_chain:
# discard last output
x.pop(-1)
df = der_chain.pop(-1) # last der
der = df(x[-1])
# chain rule -- multiply by previous der
der = der * delta[0]
delta.insert(0,der) # insert at start
return delta

delta = backward_prop(delta_out,x,der_chain)

Note forward propagation must be run prior to back propagation.

To apply this code to the second example, use

d = 3
func_chain, der_chain = [h]*d, [dh]*d
x_in, delta_out = 5, 1

x = forward_prop(x_in,func_chain)
delta = backward_prop(delta_out,x,der_chain)
362 CHAPTER 7. CALCULUS

Now we work with the network in Figure 7.13, using the multi-variable
chain rule (§7.3). The functions are
a = f (x, y) = x + y,
b = g(y, z) = max(y, z),
J = h(a, b) = ab.
The composite function is
J = (x + y) max(y, z),

x +
a

y J
∗

b
z max

Figure 7.13: A network composition [27].

Here there are three input nodes x, y, z, and three hidden nodes +, max,
∗. Starting with inputs (x, y, z) = (1, 2, 0), and plugging in, we obtain node
outputs
(x, y, z, a, b, J) = (1, 2, 0, 3, 2, 6)
(Figure 7.15). This is forward propagation.

Now we compute the derivatives

∂J ∂J ∂J ∂J ∂J
, , , , .
∂x ∂y ∂z ∂a ∂b
This we do in reverse order. First we compute
∂J ∂J
= b = 2, = a = 3.
∂a ∂b
7.4. BACK PROPAGATION 363

Then
∂a ∂a
= 1, = 1.
∂x ∂y

z
y<z
max(y, z) = z
∂g/∂y = 0, ∂g/∂z = 1
y=z

y>z
max(y, z) = y
∂g/∂y = 1, ∂g/∂z = 0

Figure 7.14: The function g = max(y, z).

Let (
1, y > z,
1(y > z) =
0, y < z.
By Figure 7.14, since y = 2 and z = 0,
∂b ∂b
= 1(y > z) = 1, = 1(z > y) = 0.
∂y ∂z
By the chain rule,
∂J ∂J ∂a
= = 2 ∗ 1 = 2,
∂x ∂a ∂x
∂J ∂J ∂a ∂J ∂b
= + = 2 ∗ 1 + 3 ∗ 1 = 5,
∂y ∂a ∂y ∂b ∂y
∂J ∂J ∂b
= = 3 ∗ 0 = 0.
∂z ∂b ∂z
Hence we have

∂J ∂J ∂J ∂J ∂J ∂J
, , , , , = (2, 5, 0, 2, 3, 1).
∂x ∂y ∂z ∂a ∂b ∂J
364 CHAPTER 7. CALCULUS

The outputs (blue) and the derivatives (red) are displayed in Figure 7.15.

1
x +
2∗1=2
2 3
2 2
y 6
∗
1
2 2
3 3
0
z max
0

Figure 7.15: Forward and backward propagation [27].

Summarizing, by the chain rule,

• derivatives are computed backward,
• derivatives along successive edges are multiplied,
• derivatives along several outgoing edges are added.

To do this in general, recall a directed graph (§4.2) as in Figure 7.13 has

an adjacency matrix W = (wij ) with wij equal to one or zero depending on
whether (i, j) is an edge or not.
Suppose a directed graph has d nodes, and, for each node i, let xi be the
outgoing signal. Then x = (x1 , x2 , . . . , xd ) is the outgoing vector. In the case
of Figure 7.13, d = 6 and
x = (x1 , x2 , x3 , x4 , x5 , x6 ) = (x, y, z, a, b, J).
With this order, the adjacency matrix is
 
0 0 0 1 0 0
0 0 0 1 1 0
 
0 0 0 0 1 0
W = 0 0 0
.
 0 0 1

0 0 0 0 0 1
0 0 0 0 0 0
7.4. BACK PROPAGATION 365

This we code as a list of lists,

d = 6
w = [ [None]*d for _ in range(d) ]
w[0][3] = w[1][3] = w[1][4] = w[2][4] = w[3][5] = w[4][5] = 1

More generally, in a weighed directed graph, the weights wij are numeric
scalars. In this case, for each node j, let
x−
j = (w1j x1 , w2j x2 , . . . , wdj xd ). (7.4.1)
Then x−j is the list of node signals, each weighed accordingly. If (i, j) is
not an edge, then wij = 0, so xi does not appear in x− j : In other words, xj
−

is the weighed list of incoming signals at node j.

An activation function at node j is a function fj of the incoming signals
x−j . Then the outgoing signal at node j is
xj = fj (x−
j ) = fj (w1j x1 , w2j x2 , . . . , wdj xd ). (7.4.2)
By the chain rule,

∂fj
∂xj  · wij , if (i, j) is an edge,
= ∂xi (7.4.3)
∂xi 0, if (i, j) is not an edge.
For example, if (1, 5), (7, 5), (2, 5) are the edges pointing to node 5 and
we ignore zeros in (7.4.1), then x−5 = (w15 x1 , w75 x7 , w25 x2 ), so

x5 = f5 (x−
5 ) = f5 (w15 x1 , w75 x7 , w25 x2 ).

The incoming vector is

−
x− = (x− −
1 , x2 , . . . , xd ).

Then x− is a list of lists. In the case of Figure 7.13, if we ignore zeros,

x− = (x− − − − − −
1 , x2 , x3 , x4 , x5 , x6 ) = ((), (), (), (x, y), (y, z), (a, b)),

and
f4 (x, y) = x + y, f5 (y, z) = max(y, z), J(a, b) = ab.
Note there is nothing incoming at the input nodes, so there is no point
defining f1 , f2 , f3 .
366 CHAPTER 7. CALCULUS

activate = [None]*d

activate[3] = lambda x,y: x+y

activate[4] = lambda y,z: max(y,z)
activate[5] = lambda a,b: a*b

Assume activate[j] is the function at node j. To compute the outgoing

signal xj at node j, we collect the incoming signals x−
j following (7.4.1)

def incoming(x,w,j):
return [ outgoing(x,w,i) * w[i][j] if w[i][j] else 0 for i
,→ in range(d) ]

then plug them into the activation function,

def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](*incoming(x,w,j))

Here * is the unpacking operator.

Summarizing, at each node j, we have the outgoing signal xj , and a list
−
xj of incoming signals.

A node with an attached activation function is a neuron. A network is

a directed weighed graph where the nodes are neurons. The code in this
section works for any network without cycles. In §8.2, we specialize to neural
networks. Neural networks are networks with a restricted class of activation
functions.

Let xin be the outgoing vector over the input nodes. If there are m input
nodes, and d nodes in total, then the length of xin is m, and the length of x
is d. In the example above, xin = (x, y, z).
We assume the nodes are ordered so that the initial portion of x equals
xin ,
7.4. BACK PROPAGATION 367

m = len(x_in)
x[:m] = x_in

Here is the second version of forward propagation.

# second version: networks

def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x

For this code to work, we assume there are no cycles in the graph: All
backward paths end at inputs.

Let xout be the output nodes. For Figure 7.13, this means xout = (J).
Then by forward propagation, J is also a function of all node outputs. For
Figure 7.13, this means J is a function of x, y, z, a, b.
Therefore, at each node i, we have the derivatives
∂J
δi = (xi ), i = 1, 2, . . . , d.
∂xi
Then δ = (δ1 , δ2 , . . . , δd ) is the gradient vector. We first compute the deriva-
tives of J with respect to the output nodes xout , and we assume these deriva-
tives are assembled into a vector δout .
In Figure 7.13, there is one output node J, and
∂J
δJ = = 1.
∂J
Hence δout = (1).
We assume the nodes are ordered so that the terminal portion of x equals
xout and the terminal portion of δ equals δout ,
368 CHAPTER 7. CALCULUS

d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out

For each i, j, let

∂fj
gij = .
∂xi
Then we have a d × d gradient matrix g = (gij ). When (i, j) is not an edge,
gij = 0.
These are the local derivatives, not the derivatives obtained by the chain
rule. For example, even though we saw above ∂J/∂y = 1, here the local
derivative is zero, since J does not depend directly on y.
For the example above, with (x1 , x2 , x3 , x4 , x5 , x6 ) = (x, y, z, a, b, J),

g = [ [None]*d for _ in range(d) ]

# note g[i][i] remains undefined

g[0][3] = lambda x,y: 1

g[1][3] = lambda x,y: 1
g[1][4] = lambda y,z: 1 if y>z else 0
g[2][4] = lambda y,z: 1 if z>y else 0
g[3][5] = lambda a,b: b
g[4][5] = lambda a,b: a

By the chain rule and (7.4.3),

∂J X ∂J ∂xj X ∂J ∂fj
= · = · · wij ,
∂xi i→j
∂xj ∂xi i→j
∂xj ∂xi

so X
δi = δj · gij · wij .
i→j

The code is
7.5. CONVEX FUNCTIONS 369

def derivative(x,delta,g,i):
if delta[i] != None: return delta[i]
else:
return sum([ derivative(x,delta,g,j) *
,→ g[i][j](*incoming(x,g,j)) * w[i][j] if g[i][j] != None
,→ else 0 for j in range(d) ])

This leads to our second version of back propagation,

# second version: networks

def backward_prop(x,delta_out,g):
d = len(g)
delta = [None]*d
m = len(delta_out)
delta[d-m:] = delta_out
for i in range(d-m): delta[i] = derivative(x,delta,g,i)
return delta

7.5 Convex Functions

Let f (x) be a scalar function of points x = (x1 , . . . , xd ) in Rd . For example,
in two dimensions,
x21
f (x) = f (x1 , x2 ) = max(|x1 |, |x2 |), + x22
f (x) = f (x1 , x2 ) =
4
are scalar functions of points in R2 . More generally, if Q is a d × d matrix,
f (x) = x · Qx is such a function. Here, to obtain x · Qx, we think of the point
x as a vector, then use row-times-column multiplication to obtain Qx, then
take the dot product with x. We begin with functions in general.
A level set of f (x) is the set
E: f (x) = 1.
Here we write the level set of level 1. One can have level sets corresponding
to any level ℓ, f (x) = ℓ. In two dimensions, level sets are also called contour
lines.
370 CHAPTER 7. CALCULUS

x0 x0

Figure 7.16: Level sets and sublevel sets in two dimensions.

For example, the covariance ellipsoids x · Qx = 1 are level sets. In two

dimensions, the square and ellipse in Figure 7.16 are level sets
x21
max(|x1 |, |x2 |) = 1, + x22 = 1.
4
The contour lines of
x21 x22
f (x) = f (x1 , x2 ) = +
16 4
are in Figure 7.17.

Figure 7.17: Contour lines in two dimensions.

A sublevel set of f (x) is the set

E: f (x) ≤ 1.
7.5. CONVEX FUNCTIONS 371

(1 − t)x0 + tx1
x0

Figure 7.18: Line segment [x0 , x1 ].

Here we write the sublevel set of level 1. One can have sublevel sets corre-
sponding to any level c, f (x) ≤ c. For example, in Figure 7.16, the (blue)
interior of the square, together with the square itself, is a sublevel set. Sim-
ilarly, the interior of the ellipse, together with the ellipse itself, is a sublevel
set. The interiors of the ellipsoids, together with the ellipsoids themselves,
in Figure 7.22 are sublevel sets. Note we always consider the level set to be
part of the sublevel set.
The level set f (x) = 1 is the boundary of the sublevel set f (x) ≤ 1. Thus
the square and the ellipse in Figure 7.16 are boundaries of their respective
sublevel sets, and the covariance ellipsoid x · Qx = 1 is the boundary of the
sublevel set x · Qx ≤ 1.

Given points x0 and x1 in Rd , the line segment joining them is

[x0 , x1 ] = {(1 − t)x0 + tx1 : 0 ≤ t ≤ 1}.

A scalar function f (x) is convex if1 for any two points x0 and x1 in Rd ,

f ((1 − t)x0 + tx1 ) ≤ (1 − t)f (x0 ) + tf (x1 ), for 0 ≤ t ≤ 1. (7.5.1)

This says the line segment joining any two points (x0 , f (x0 )) and (x1 , f (x1 ))
on the graph of f (x) lies above the graph of f (x). For example, in two
dimensions, the function f (x) = f (x1 , x2 ) = x21 + x22 /4 is convex because its
graph is the paraboloid in Figure 7.19.
More generally, given points x1 , x2 , . . . , xN , a linear combination

t1 x1 + t2 x2 + · · · + tN xN
1
We only consider convex functions that are continuous.
372 CHAPTER 7. CALCULUS

Figure 7.19: Convex: The line segment lies above the graph.

is a convex combination if t1 , t2 , . . . , tN are nonnegative, and

t1 + t2 + · · · + tN = 1.
For example, if 0 ≤ t ≤ 1, (1 − t)x0 + tx1 is a convex combination of x0 and
x1 . Then a convex function also satisfies
f (t1 x1 + · · · + tN xN ) ≤ t1 f (x1 ) + · · · + tN f (xN ), (7.5.2)
for any convex combination.

Recall (§2.2) a nonnegative matrix is a symmetric matrix Q satisfying

x · Qx ≥ 0 for all x, and every such matrix is the covariance matrix of some
dataset. This is equivalent to the nonnegativity of the eigenvalues of Q.
When the eigenvalues of Q are positive, Q is invertible.

Quadratic is Convex
If Q is a nonnegative matrix and b is a vector, then
1
f (x) = x · Qx − b · x
2
7.5. CONVEX FUNCTIONS 373

is a convex function. When Q is invertible, f (x) is strictly convex.

This was derived in the previous section, but here we present a more
geometric proof.
To derive this result, let x0 and x1 be any points, and let v = x1 − x0 .
Then x0 + tv = (1 − t)x0 + tx1 and x1 = x0 + v. Let g0 = Qx0 − b. By (7.3.6),
1 1
f (x0 + tv) = f (x0 ) + tv · (Qx0 − b) + t2 v · Qv = f (x0 ) + tv · g0 + + t2 v · Qv.
2 2
(7.5.3)
Inserting t = 1 in (7.5.3), we have f (x1 ) = f (x0 ) + v · g0 + v · Qv/2. Since
t2 ≤ t for 0 ≤ t ≤ 1 and v · Qv ≥ 0, by (7.5.3),
f ((1 − t)x0 + tx1 ) = f (x0 + tv)
1
≤ f (x0 ) + tv · g0 + tv · Qv
2
1
= (1 − t)f (x0 ) + tf (x0 ) + tv · g0 + tv · Qv
2
= (1 − t)f (x0 ) + tf (x1 ).
When Q is is invertible, then v · Qv > 0, and we have strict convexity.

A convex set is a subset E in Rd that contains the line segment joining

any two points in it: If x0 and x1 are in E, then the line segment [x0 , x1 ] is
in E. To be consistent with sublevel sets, we only consider convex sets that
contain their boundaries.
More generally, given points x1 , x2 , . . . , xN in E, the convex combination
x = t1 x1 + t2 x2 + · · · + tN xN
is also in E. The set of all convex combinations of x1 , x2 , . . . , xN is the
convex hull of x1 , x2 , . . . , xN (Figure 7.20). If E is convex and contains x1 ,
x2 , . . . , xN , then E contains their convex hull.
The interiors of the square and the ellipse in Figure 7.16, together with
their boundaries, are convex sets. The interior of the ellipsoid in Figure 7.22,
together with the ellipsoid, is a convex set.
The following code generates convex hulls,
374 CHAPTER 7. CALCULUS

x2
x6 x7

Figure 7.20: Convex hull of x1 , x2 , x3 , x4 , x5 , x6 , x7 .

from scipy.spatial import ConvexHull

from numpy import *
from numpy.random import *

rng = default_rng()

# 30 random points in 2-D

points = rng.random((30, 2))

hull = ConvexHull(points)

and this plots the facets of the convex hull

from matplotlib.pyplot import *

plot(points[:,0], points[:,1], 'o')

for facet in hull.simplices:
plot(points[facet,0], points[facet,1], 'k-')

facet = hull.simplices[0]
plot(points[facet, 0], points[facet, 1], 'r--')
grid()
show()

resulting in Figure 7.21.

7.5. CONVEX FUNCTIONS 375

Figure 7.21: A convex hull with one facet highlighted.

If f (x) is a function, its graph is the set of points (x, y) in Rd+1 satisfying
y = f (x), and its epigraph is the set of points (x, y) satisfying y ≥ f (x).
If f (x) is defined on Rd , its sublevel sets are in Rd , and its epigraph is in
Rd+1 . Then f (x) is a convex function exactly when its epigraph is a convex
set (Figure 7.19). From convex functions, there are other ways to get convex
sets:

Sublevel of Convex is Convex

If f (x) is a convex function, then the sublevel set

E: f (x) ≤ 1

is a convex set.

This is an immediate consequence of the definition: f (x0 ) ≤ 1 and

f (x1 ) ≤ 1 implies

f ((1 − t)x0 + tx1 ) ≤ (1 − t)f (x0 ) + tf (x1 ) ≤ (1 − t) + t = 1.

From these results, we have

376 CHAPTER 7. CALCULUS

Figure 7.22: Convex set in three dimensions with supporting hyperplane.

Ellipsoids are Boundaries of Convex Sets

If Q is a covariance matrix, then x · Qx ≤ 1 is a convex set.

Let n be a nonzero vector in Rd . In two dimensions, the vectors or-

thogonal to n form a line (Figure 7.23). In three dimensions, the vectors
orthogonal to n form a plane (Figure 7.23). In d dimensions, these vectors
form the orthogonal complement n⊥ (2.7.5), which is a (d − 1)-dimensional
subspace. This subspace is a hyperplane passing through the origin.
In general, given a point x0 and a nonzero vector n, the hyperplane through
x0 with normal n consists of all solutions x of

H: n · (x − x0 ) = 0. (7.5.4)

A hyperplane separates the whole space Rd into two half-spaces,

n · (x − x0 ) < 0 n · (x − x0 ) = 0 n · (x − x0 ) > 0.
7.5. CONVEX FUNCTIONS 377

The vector n is the normal vector to the hyperplane. Note replacing n by

any nonzero multiple of n leaves the hyperplane unchanged.

n n

x0 x0

Figure 7.23: Hyperplanes in two and three dimensions.

Separating Hyperplane
Let E be a convex set and let x∗ be a point not in E. Then there is a
hyperplane separating x∗ and E: For some x0 in E and nonzero n,

n · (x − x0 ) ≤ 0 and n · (x∗ − x0 ) > 0. (7.5.5)

x∗
n
x′
x0 x0
x x0 + tv

Figure 7.24: Separating hyperplane theorem.

A diagram of the proof is Figure 7.24. Let x0 be the point in E closest

to x∗ . This means x0 minimizes |x − x∗ |2 over x in E. If x is in E, then by
convexity, the line segment [x0 , x] is in E, hence x0 + tv, v = x − x0 , is in E
for 0 ≤ t ≤ 1. Since x0 is the point of E closest to x∗ ,
|x0 − x∗ |2 ≤ |x0 + tv − x∗ |2 for 0 ≤ t ≤ 1.
378 CHAPTER 7. CALCULUS

Expanding, we have

|x0 − x∗ |2 ≤ |x0 − x∗ |2 + 2t(x0 − x∗ ) · v + t2 |v|2 , 0 ≤ t ≤ 1.

Canceling |x0 − x∗ |2 then, for t > 0, canceling t, we obtain

0 ≤ 2(x0 − x∗ ) · v + t|v|2 , 0 ≤ t ≤ 1.

Since this is true for small positive t, sending t → 0, results in v·(x0 −x∗ ) ≥ 0.
Setting n = x∗ − x0 , we obtain

n · (x − x0 ) ≤ 0 and n · (x∗ − x0 ) > 0.

Now suppose x0 is a point in the boundary of a convex set E. Since x0 is

in E, we cannot find a separating hyperplane for x∗ = x0 . In this case, the
best we can hope for is a hyperplane passing through x0 , with E to one side
of the hyperplane:

x in E =⇒ (x − x0 ) · n ≤ 0. (7.5.6)

Such a hyperplane is a supporting hyperplane for E at x0 . Figures 7.16 and

7.22 display examples of supporting hyperplanes. Here is the basic result
relating convex sets and supporting hyperplanes.

Supporting Hyperplane for Convex Set

Let E be a convex set and let x0 be a point on the boundary of E.
Then there is a supporting hyperplane at x0 .

If x0 is in the boundary of E, there are points x′ not in E approximating

x0 (Figure 7.24). Applying the separating hyperplane theorem to x′ , and
taking the limit x′ → x0 , results in a supporting hyperplane at x0 . We skip
the details here.
Supporting hyperplanes characterize convex sets in the following sense:
If through every point x0 in the boundary of E, there is a supporting hyper-
plane, then E is convex.
7.5. CONVEX FUNCTIONS 379

In Figure 7.16, there are multiple supporting hyperplanes. However, at

every other point a on the boundary of the square, there is a unique (up to
scalar multiple) supporting hyperplane. For the ellipse or ellipsoid, at every
point of the boundary, there is a unique supporting hyperplane.
Now we derive the analogous concepts for convex functions.
Let f (x) be a function and let a be a point at which there is a gradient
∇f (a). The tangent hyperplane for f (x) at a is

y = f (a) + ∇f (a) · (x − a). (7.5.7)

Convex Function Graph Lies Above the Tangent Hyperplane

If f (x) is convex and has a gradient ∇f (a), then

f (x) ≥ f (a) + ∇f (a) · (x − a). (7.5.8)

This vector result is obtained by applying the corresponding scalar result

in §7.1 to the function f (a + tv), where v = x − a. As in the scalar case,
there is a similar result (7.5.16) with tangent paraboloids.

We now address the existence of a global minimizer of a convex function.

A (global) minimizer for f (x) is a vector x∗ satisfying

f (x∗ ) = min f (x),

where the minimum is taken over all vectors x. A minimizer is the location of
the bottom of the graph of the function. For example, the parabola (Figure
7.4) and the relative information (Figure 7.10) both have global minimizers.
We say a function f (x) is strictly convex if g(t) = f (a + tv) is strictly
convex for every point a and direction v. This is the same as saying the
inequality (7.5.1) is strict for 0 < t < 1.
We say a function f (x) is proper if the sublevel set f (x) ≤ c is bounded
for every level c. Before we state this precisely, we contrast a level versus a
bound.
Let f (x) be a function. A level is a scalar c determining a sublevel set
f (x) ≤ c. A bound is a scalar C determining a bounded set |x|2 ≤ C.
380 CHAPTER 7. CALCULUS

We say f (x) is proper if for every level c, there is a bound C so that

f (x) ≤ c =⇒ |x|2 ≤ C. (7.5.9)
This is same as saying f (x) rises to +∞ as |x| → ∞. The exact formula
for the bound C, which depends on the level c and the function f (x), is not
important for our purposes. What matters is the existence of some bound C
for each level c.
More vividly, suppose x is scalar, and think of the graph of y = f (x)
as the cross-section of a river. Then f (x) is proper if the river never floods
its banks, no matter how much it rains. So y = sin x is not proper, but
y = x2 + sin x is proper.
What does it mean for f (x) to not be proper? Unpacking the definition,
f (x) is not proper if there is some level c with no corresponding bound C.
This means there is some level c and a sequence x1 , x2 , . . . with f (xn ) ≤ c
and |xn | → ∞.
For example, the functions in Figure 7.4 are proper and strictly convex,
while the function in Figure 7.5 is proper but neither convex nor strictly
convex.
Intuitively, if f (x) goes up to +∞ when x is far away, then its graph must
have a minimizer at some point x∗ . The precise statement is below.

The following result describes when the residual (2.6.1) is proper.

Properness of Residual on Row Space

Let A be a matrix, and b a vector with dimensions so that the residual

f (x) = |Ax − b|2

is defined. Then f (x) is proper on the row space of A.

To see this, suppose f (x) is not proper. In this case, by (7.5.9), there
would be a level c and a sequence x1 , x2 , . . . in the row space of A satisfying
|xn | → ∞ and f (xn ) ≤ c for n ≥ 1.
Let x′n = xn /|xn |. Then x′n are unit vectors in the row space of A, hence
xn is a bounded sequence. From §A.2, this implies x′n subconverges to some
′

x∗ , necessarily a unit vector in the row space of A.

7.5. CONVEX FUNCTIONS 381

By the triangle inequality (2.2.4),

1 1 1 √
|Ax′n | = |Axn | ≤ (|Axn − b| + |b|) ≤ ( c + |b|).
|xn | |xn | |xn |

Moreover Ax′n subconverges to Ax∗ . Since |xn | → ∞, taking the limit n →

∞,
1 √
|Ax∗ | = lim |Ax′n | ≤ ( c + |b|) = 0.
n→∞ ∞
∗
Thus x is both in the row space of A and in the null space of A. Since the
row space and the null space are orthogonal, this implies x∗ = 0. But we
can’t have 1 = |x∗ | = |0| = 0. This contradiction shows there is no such
sequence xn , and we conclude f (x) is proper.
When the row space is the source space,

Properness of Residual
When the N × d matrix A has rank d,

f (x) = |Ax − b|2 (7.5.10)

is proper on Rd .

Now we turn to minimizers.

Existence of Global Minimizer

Suppose f (x) is a continuous proper function defined on all of Rd .
Then f (x) has a global minimizer x∗ ,

f (x∗ ) ≤ f (x). (7.5.11)

To see this, pick any point a. Then, by properness, the sublevel set S
given by f (x) ≤ f (a) is bounded. By continuity of f (x), there is a minimizer
x∗ (see §A.2). Since for all x outside the sublevel set, we have f (x) > f (a),
x∗ is a global minimizer.
382 CHAPTER 7. CALCULUS

Existence and Uniqueness of Global Minimizer

Suppose f (x) is a continuous strictly convex proper function defined

on all of Rd . Then f (x) has a unique global minimizer x∗ .

Let x1 be another global minimizer. Then f (x1 ) = f (x∗ ). Let x2 =

(x∗ + x1 )/2 be their midpoint. By strict convexity,

1
f (x2 ) < (f (x∗ ) + f (x1 )) = f (x∗ ),
2
contradicting the fact that x∗ is a global minimizer. Thus there cannot be
another global minimizer.

As a consequence,

Existence of Residual Minimizer

Let A be a matrix and b a vector so that the residual

f (x) = |Ax − b|2 (7.5.12)

is well-defined. Then there is a residual minimizer x∗ in the row space

of A,
f (x∗ ) ≤ f (x). (7.5.13)

The global minimizer x∗ is located by the first derivative test.

First Derivative Test for Global Minimizer

Let f (x) be a strictly convex proper function having a gradient ∇f (x)
at every point. Then the global minimizer x∗ is the unique point
satisfying
∇f (x∗ ) = 0. (7.5.14)
7.5. CONVEX FUNCTIONS 383

Let a be any point, and v any direction, and let g(t) = f (a + tv). Then

g ′ (0) = ∇f (a) · v.

If a is a minimizer, then t = 0 is a minimum of g(t), so g ′ (0) = 0. Since v is

any direction, this shows ∇f (a) = 0.
If there were another point b satisfying ∇f (b) = 0, let v = b − a. Then
b = a + v and g(t) is strictly convex in t, and also g ′ (1) = ∇f (b) · v = 0. By
convexity, g ′ (t) is increasing in t. If g ′ (0) = 0 and g ′ (1) = 0, then g ′ (t) = 0
for 0 < t < 1. This implies g(t) is a linear on 0 < t < 1, contradicting strict
convexity. This establishes the first derivative test.

Suppose the second partials

∂ 2f
, 1 ≤ i, j ≤ d,
∂xi ∂xj

exist. Then the second derivative of f (x) is the symmetric matrix

 
2 2
∂ f ∂ f
 . . .
 ∂x1 ∂x1 ∂x1 ∂x2 
 
 ∂ 2f ∂ 2f 

 ∂x ∂x . . .
2
D f (x) =  2 1 ∂x2 ∂x2 

 
 
 ... ... . . .
 
 2 2

 ∂ f ∂ f 
...
∂xd ∂x1 ∂xd ∂x2
Replacing x by x + tv in (7.3.3), we have

d
f (x + tv) = ∇f (x + tv) · v.
dt

Differentiating and using the chain rule again,

384 CHAPTER 7. CALCULUS

Second Directional Derivative

The second derivative Q = D2 f (x) satisfies

d2
f (x + tv) = v · Qv. (7.5.15)
dt2 t=0

In particular, f (x) is convex if the second directional derivative is non-

negative for all x and v.

This implies

Second Derivative Test for Multi-variable Strict Convexity

Suppose f (x) has a second derivative D2 f (x), and assume D2 f (x) is

positive definite, at every x. Then f (x) is strictly convex.

An important example of a strictly convex proper function is f (x) =

x · Qx/2 − b · x when Q > 0 (§7.3). Also (§8.5) loss functions in linear
regression and logistic regression are strictly convex and proper under the
right assumptions.

Recall m ≤ Q ≤ L means the eigenvalues of the symmetric matrix Q are

between L and m. The following is the multi-variable version of (7.1.10).
The proof is the same as in the scalar case.

Second Derivative Bounds

If m ≤ D2 f (x) ≤ L, then

m L
|x − a|2 ≤ f (x) − f (a) − ∇f (a) · (x − a) ≤ |x − a|2 . (7.5.16)
2 2

If we choose a = x∗ , where x∗ is the global minimizer, then by (7.5.14),

we see the graph of f (x) lies between two quadratics globally.
7.5. CONVEX FUNCTIONS 385

Upper and Lower Paraboloids

If m ≤ D2 f (x) ≤ L and x∗ is the global minimum, then

m L
|x − x∗ |2 ≤ f (x) − f (x∗ ) ≤ |x − x∗ |2 . (7.5.17)
2 2

We describe the convex dual in the multi-variable setting (the single-

variable case was done in (7.1.13)). If f (x) is a scalar convex function of x,
and x = (x1 , x2 , . . . , xd ) has d features, the convex dual is

g(p) = max (p · x − f (x)) . (7.5.18)

Here the maximum is over all vectors x, and p = (p1 , p2 , . . . , pd ), the dual
variable, also has d features. We will work in situations where a maximizer
exists in (7.5.18).
Let Q > 0 be a positive matrix. The simplest example is
1 1
f (x) = x · Qx =⇒ g(p) = p · Q−1 p.
2 2
This is established by the identity
1 1 1
(p − Qx) · Q−1 (p − Qx) = p · Q−1 p − p · x + x · Qx. (7.5.19)
2 2 2
To see this, since the left side of (7.5.19) is greater or equal to zero, we have
1 1
p · Q−1 p − p · x + x · Qx ≥ 0.
2 2
Since (7.5.19) equals zero iff p = Qx, we are led to (7.5.18).
Moreover, switching p · Q−1 p with x · Qx, we also have

f (x) = max (p · x − g(p)) . (7.5.20)

Thus the convex dual of the convex dual of f (x) is f (x). In §7.6, we compute
the convex dual of the partition function.
386 CHAPTER 7. CALCULUS

If x is a maximizer in (7.5.18), then the derivative is zero,

0 = ∇x (p · x − f (x)) =⇒ p = ∇x f (x).

Here ∇ is with respect to x. The maximizer x = x(p) depends on p, so by

the chain rule

∇p g(p) = ∇(p · x(p) − f (x(p)))

= x + p∇x(p) − ∇f (x)∇x(p) = x + (p − ∇f (x))∇x(p) = x.

Here ∇ is with respect to p and, since x = x(p) is vector-valued, ∇x(p) is a

d × d matrix. We conclude

p = ∇x f (x) ⇐⇒ x = ∇p g(p).

Thus the vector-valued function ∇f (x) is the inverse of the vector-valued

function ∇g(p), or
∇g(∇f (x)) = x.
Differentiating, we obtain

D2 g(∇f (x))D2 f (x) = I.

This yields

Second Derivatives of Dual Functions

Let f (x) be a strictly convex function with second derivative D2 f (x),
and let g(p) be its convex dual. Then

D2 g(p) = (D2 f (x))−1 , p = ∇f (x).

Moreover, if m ≤ D2 f (x) ≤ L, then

1 1
≤ D2 g(p) ≤ .
L m

Using this, and writing out (7.5.16) for g(p) instead of f (x) (we skip the
details) yields
7.6. MULTINOMIAL PROBABILITY 387

Dual Second Derivative Bounds

Let p = ∇f (x) and q = ∇f (a). If f (x) is convex and m ≤ D2 f (x) ≤ L,
then
1 1
|p − q|2 ≥ f (x) − f (a) − q · (x − a) ≥ |p − q|2 . (7.5.21)
2m 2L

This is used in gradient descent.

Now let f (x) be strongly convex in the sense m ≤ D2 f (x) ≤ L. Then we

have the vector version of (7.1.11).

Coercivity of the Gradient

Let p = ∇f (x) and q = ∇f (a). If m ≤ D2 f (x) ≤ L, then

mL 1
(p − q) · (x − a) ≥ |x − a|2 + |p − q|2 . (7.5.22)
m+L m+L

This is derived by using (7.5.21), the details are in [3]. This result is used
in gradient descent.

7.6 Multinomial Probability

In multinomial probability, there are d classes or categories, and p is a vector
of probabilities, p = (p1 , p2 , . . . , pd ). Here pk is the probability we are in class
k. Then each pk is nonnegative, pk ≥ 0, and the sum is one,
p1 + p2 + · · · + pd = 1.
The partition function is
Z(z) = log (ez1 + ez2 + · · · + ezd ) . (7.6.1)
Then Z is a function of d scalar variables z = (z1 , z2 , . . . , zd ). If we insert
z = 0, we obtain Z(0) = log d.
Let
1 = (1, 1, . . . , 1).
388 CHAPTER 7. CALCULUS

Then
d
X
p·1= pk = 1.
k=1

Because

Z(z + a1) = Z(z1 + a, z2 + a, . . . , zd + a) = Z(z) + a,

Z is not bounded below and does not have a minimum.

The softmax function is the vector-valued function q = σ(z) with compo-

nents
e zk ezk
qk = σk (z) = = , k = 1, 2, . . . , d.
ez1 + ez2 + · · · + ezd eZ
By the chain rule, the gradient of the partition function is the softmax
function,
∇Z(z) = σ(z). (7.6.2)
When d = 2, the vector softmax function reduces to the scalar logistic
function (5.1.12), since

ez1 1
q1 = = = σ(z1 − z2 ),
z
e 1 +e 2z 1 + e 1 −z2 )
−(z

ez2 1
q 2 = z1 = = σ(z2 − z1 ).
e + ez2 1 + e−(z2 −z1 )
Because of this, the softmax function is the multinomial analog of the logistic
function, and we use the same symbol σ to denote both functions.

from scipy.special import softmax

z = array([z1,z2,z3])
q = softmax(z)
7.6. MULTINOMIAL PROBABILITY 389

In the previous section, we studied convex functions and the existence

and uniqueness of the global minimum. As we saw above, Z does not have
a global minimum over unrestricted z.
Since σ(z) = ∇Z(z), a critical point z ∗ of Z must satisfy σ(z ∗ ) = 0. For
Z, a critical point cannot be unique, because

σ(z1 , z2 , . . . , zd ) = σ(z1 + a, z2 + a, . . . , zd + a),

or
σ(z) = σ(z + a1).
To guarantee uniqueness of a global minimum of Z, we have to restrict
attention to the subspace of vectors z = (z1 , z2 , . . . , zd ) orthogonal to 1, the
vectors satisfying
z1 + z2 + · · · + zd = 0.
Now suppose z is orthogonal to 1. Since the exponential function is
convex, !
d d
eZ 1 X zk 1X
= e ≥ exp zk = e0 = 1.
d d k=1 d k=1
This establishes

Restricted Global Minimum of the Partition Function

If z satisfies z1 + z2 + · · · + zd = 0, then Z(z) ≥ Z(0) = log d.

The inverse of the softmax function is obtained by solving p = σ(z) for

z, obtaining
zk = Z + log pk , k = 1, 2, . . . , d. (7.6.3)
Define
log p = (log p1 , log p2 , . . . , log pd ).
Then the inverse of p = σ(z) is

z = Z1 + log p. (7.6.4)
390 CHAPTER 7. CALCULUS

The function
d
X
I(p) = p · log p = pk log pk (7.6.5)
k=1

is the (absolute) information. Since 0 ≤ p ≤ 1, log p ≤ 0, hence I(p) ≤ 0.

Since log is concave,
d d
!
X X
pk log(ezk ) ≤ log pk ezk .
k=1 k=1

This implies
d
X d
X
p·z = pk zk = pk log(ezk )
k=1 k=1
d
! d
!
X X
≤ log pk ezk = log ezk +log pk = Z(z + log p).
k=1 k=1

Replacing z by z − log p, this establishes

I(p) ≥ p · z − Z(z). (7.6.6)

By (7.6.4), (7.6.6) is an equality when p = σ(z). We conclude

Information and Partition are Convex Duals

For all p,
I(p) = max (p · z − Z(z)) .
z

For all z,
Z(z) = max (p · z − I(p)) .
p

The second equality follows by switching Z and I in (7.6.6), and repeating

the same logic used to derive the first equality.

Inserting z = 0 in (7.6.6), we have

7.6. MULTINOMIAL PROBABILITY 391

Absolute Information is Bounded

For all p = (p1 , p2 , . . . , pd ),

0 ≥ I(p) ≥ − log(d). (7.6.7)

The (absolute) entropy, the analog of (7.2.2), is then

d
X
H(p) = −I(p) = − pk log(pk ). (7.6.8)
k=1

Since
2 1 1 1
D I(p) = diag , ,..., ,
p1 p2 pd
we see I(p) is strictly convex, and H(p) is strictly concave.
In Python, the entropy is

from scipy.stats import entropy

p = array([p_1,p_2,p_3])
entropy(p)

Now (
∂ 2Z ∂σj σj − σj σk , if j = k,
= =
∂zj ∂zk ∂zk −σj σk , if j ̸= k.
Hence we have

D2 Z(z) = ∇σ(z) = diag(q) − q ⊗ q, q = σ(z). (7.6.9)

Let v̄ = v · q = qk vk . Since Q = D2 Z(z) satisfies

d
X d
X
v · Qv = qk vk2 2
− (v · q) = qk (vk − v̄)2 ,
k=1 k=1
392 CHAPTER 7. CALCULUS

which is nonnegative, Q is a covariance matrix, and Z is convex.

In fact Z is strictly convex in directions v orthogonal to 1 = (1, 1, . . . , 1).
If v · Qv = 0, then, since qk > 0 for all k, v = v̄1. If v is orthogonal to 1, this
forces v̄ = 0, which, by using v · Qv = 0 again, implies v = 0. This shows Z
is strictly convex in directions v orthogonal to 1.
Moreover, Z is proper (7.5.9) in directions orthogonal to 1. To see this,
suppose z · 1 = 0 and Z(z) ≤ c. Since zj ≤ Z(z), this implies

zj ≤ c, j = 1, 2, . . . , d.

Given 1 ≤ j ≤ d,Padd the inequalities zk ≤ c over all indices k ̸= j. Since

z · 1 = 0, −zj = k̸=j zk . Hence

−zj ≤ (d − 1)c, j = 1, 2, . . . , d.

Combining the last two inequalities,

|zj | = max(zj , −zj ) ≤ (d − 1)c, j = 1, 2, . . . , d,

which implies
d
X
2
|z| = zk2 ≤ d(d − 1)2 c2 .
k=1
√
Setting C = d(d − 1)c, we conclude

Z(z) ≤ c and z·1=0 =⇒ |z| ≤ C. (7.6.10)

By (7.5.9), we have shown

The Partition Function is Proper and Strictly Convex

On the subspace z · 1 = 0, Z(z) is proper and strictly convex.

The relative information is

d
X
I(p, q) = pk log(pk /qk ). (7.6.11)
k=1
7.6. MULTINOMIAL PROBABILITY 393

Here p = (p1 , p2 , . . . , pd ) and q = (q1 , q2 , . . . , qd ) are probability distributions.

Let
log q = (log q1 , log q2 , . . . , log qd ).

Then
d
X
p · log q = pk log qk ,
k=1

and
I(p, q) = I(p) − p · log q. (7.6.12)

Similarly, the relative entropy is

H(p, q) = −I(p, q). (7.6.13)

In Python, the code

from scipy.stats import entropy

p = array([p1,p2,p3])
q = array([q1,q2,q3])
entropy(p,q)

returns the relative information, not the relative entropy. See below for more
on this terminology confusion.

The relative partition function is

d
!
X
Z(z, q) = log ezk qk ,
k=1

If we insert qk = exp(log(qk )) in the definition of Z(z, q), one obtains

Z(z, q) = Z(z + log q).

394 CHAPTER 7. CALCULUS

From this, using the change of variable z ′ = z + log q,

max (p · z − Z(z, q)) = max (p · z − Z(z + log q))
z z
= max
′
(p · (z ′ − log q) − Z(z ′ ))
z
= max (p · z − Z(z)) − p · log q
z
= I(p) − p · log q
= I(p, q).
As before, this shows

Relative Information and Relative Partition are Convex Duals

For all p and q,

I(p, q) = max (p · z − Z(z, q)) .

For all z and q,

Z(z, q) = max (p · z − I(p, q)) .

In logistic regression (§8.5), the output is z, the computed target is q =

σ(z), the desired target is p, and the information error function is I(p, q). To
compute the information error, by (7.6.4),

q = σ(z) =⇒ log q = z − Z(z)1.

By (7.6.12), this yields

Fundamental Information Error Identity

For all p and all z,

I(p, σ(z)) = I(p) − p · z + Z(z). (7.6.14)

This identity is the direct analog of (7.5.19). The identity (7.5.19) is

relevant to linear regression. Similarly, (7.6.14) relevant to logistic regression.
7.6. MULTINOMIAL PROBABILITY 395

The cross-information is
d
X
Icross (p, q) = − pk log qk ,
k=1

and the cross-entropy is

d
X
Hcross (p, q) = −Icross (p, q) = pk log qk .
k=1

The cross-information is usually erroneously called cross-entropy, see the dis-

cussion at the end of the section.
Cross-information and relative information are related by

I(p, q) = I(p) + Icross (p, q).

A probability vector p = (p1 , p2 , . . . , pd ) is one-hot encoded at slot j if

pj = 1. When p is one-hot encoded at slot j, then pk = 0 for k ̸= j.
When p is one-hot encoded, then I(p) = 0, so

I(p, q) = Icross (p, q). (7.6.15)

In general, from (7.6.14),

Icross (p, σ(z)) = −p · z + Z(z).

From (7.6.2) and (7.6.14),

∇z I(p, σ(z)) = q − p, q = σ(z). (7.6.16)

Since I(p, σ(z)) and Icross (p, σ(z)) differ by the constant I(p), we also have

∇z Icross (p, σ(z)) = q − p, q = σ(z),

so it doesn’t matter whether I or Icross is used in gradient descent (§8.3).

Nevertheless, we stick with I, because the calculations are cleaner with I
(§8.5).
396 CHAPTER 7. CALCULUS

The relative softmax function is

ezk qk
σk (z, q) = , k = 1, 2. . . . , d.
ez1 q1 + ez2 q2 + · · · + ezd qd

Then the relative version of (7.6.14) is

I(p, σ(z, q)) = I(p, q) − p · z + Z(z, q).

Here is the multinomial analog of (7.2.6). Suppose a dice has d faces, and
suppose the probability of rolling the k-th face in a single roll is qk . Then
q = (q1 , q2 , . . . , qd ) is a probability vector. Let p = (p1 , p2 , . . . , pd ) be another
probability vector. Roll the dice n times, and let Pn (p, q) be the probability
that the proportion of times the k-th face is rolled equals pk , k = 1, 2, . . . , d.
Then we have the approximation

Pn (p, q) ∼ enH(p,q) , for n large.

In the literature, in the industry, in Wikipedia, and in Python, the termi-

nology2 is confused: The relative information I(p, q) is almost always called
relative entropy.
Since the entropy H is the negative of the information I, this is looking at
things upside-down. In other settings, I(p, q) is called the Kullback–Leibler
divergence, which is not exactly intuitive terminology.
Also, in machine learning, Icross (p, q) is called the cross-entropy, not cross-
information, continuing the confusion.
Rubbing salt into the wound, in Python, entropy(p) is H(p), which is
correct, but entropy(p,q) is I(p, q), which is incorrect, or at the very least,
inconsistent, even within Python.
2
Of course the correct quantities are used, it’s the naming that is incorrect.
7.6. MULTINOMIAL PROBABILITY 397

How does one keep things straight? By remembering that it’s convex
functions that we like to minimize, not concave functions. In machine learn-
ing, loss functions are built to be minimized, and information, in any form,
is convex, while entropy, in any form, is concave. Table 7.25 summarizes the
situation.

H = −I Information Entropy
Absolute I(p) H(p)
Cross Icross (p, q) Hcross (p, q)
Relative I(p, q) H(p, q)
Curvature Convex Concave
Error I(p, q) with q = σ(z)

Table 7.25: The third row is the sum of the first and second rows, and the
H column is the negative of the I column.
398 CHAPTER 7. CALCULUS
Chapter 8

Machine Learning

8.1 Overview
This first section is an overview of the chapter. Here is a summary of the
structure of neural networks.

• A graph consists of nodes and edges (§4.2).

• If each edge has a direction, the graph is directed.

• If each edge has a weight, the graph is weighed.

• In a directed graph, there are input nodes, output nodes, and hidden
nodes.

• A node with an activation function is a neuron (§7.4).

• Each neuron has incoming signals and an outgoing signal.

• The outgoing signal is the activation function applied to the incoming

signals.

• A network is a weighed directed graph (§4.2) where the nodes are neu-
rons (§7.4).

• A neural network is a network where each activation function is a func-

tion of the sum of the incoming signals (§8.2).

399
400 CHAPTER 8. MACHINE LEARNING

The goal is to train a neural network. To train a neural network means to

find weights W so that the input-output behavior of the network is as close
as possible to a given dataset of sample pairs (xk , yk ), k = 1, 2. . . . , N . Here
is a summary of how neural networks are trained (§8.4).

1. Start with a sample pair (xk , yk ) and a weight matrix W .

2. Using xk as incoming signals at the input nodes, compute the network’s

outgoing signals at all nodes (forward propagation).

3. Compute the error J = J(xk , yk , W ) between the outgoing signals at

the output nodes and yk .

4. Compute the derivatives δout of J at the output nodes.

5. Compute the derivatives δ of J at all nodes (back propagation).

6. Then the weight gradient is given by ∇W J = x ⊗ δ.

7. Update W using gradient descent (§8.3), W + = W − t∇W J (§8.4).

8. Repeat steps 1-7 over all sample pairs (xk , yk ), k = 1, 2, . . . , N (§8.9).

9. Repeat step 8 until convergence.

Steps 1-7 is an iteration, and step 8 is an epoch. An iteration uses a

single sample (more generally a batch of samples), and an epoch uses the
entire dataset. The mean error function over the dataset is
N
X
J(W ) = J(xk , yk , W ).
k=1

Sometimes J(W ) is normalized by dividing by N , but this does not change

the results. With the dataset given, the mean error is a function of the
weights.
A weight matrix W ∗ is optimal if it is a minimizer of the mean error,

J(W ∗ ) ≤ J(W ), for all W.

Convergence means W is close to W ∗ . Now we turn to the details.

8.2. NEURAL NETWORKS 401

8.2 Neural Networks

In §7.4, we saw two versions of forward and back propagation. In this section
we see a third version. We begin by reviewing the definition of graph and
network as given in §4.2 and §7.4.
A graph consists of nodes and edges. Nodes are also called vertices, and
an edge is an ordered pair (i, j) of nodes. Because the ordered pair (i, j) is
not the same as the ordered pair (j, i), our graphs are directed.
The edge (i, j) is incoming at node j and outgoing at node i. If a node j
has no outgoing edges, then j is an output node. If a node i has no incoming
edges, then i is an input node. If a node is neither an input nor an output,
it is a hidden node.
We assume our graphs have no cycles: every path terminates at an output
node in a finite number of steps.
A graph is weighed if a scalar weight wij is attached to each edge (i, j).
If (i, j) is not an edge, we set wij = 0. If a network has d nodes, the edges
are completely specified by the d × d weight matrix W = (wij ).
A node with an attached activation function (7.4.2) is a neuron. A net-
work is a directed weighed graph where the nodes are neurons. In the next
paragraph, we define a special kind of network, a neural network.

In a network, in §7.4, the activation function fj at node j was allowed to

be any function of the incoming list (7.4.1) at node j

(w1j x1 , w2j x2 , . . . , wdj xd ).

Because wij = 0 if (i, j) is not an edge, the nonzero entries in the incoming
list at node j correspond to the edges incoming at node j.
A neural network is a network where every activation function is restricted
to be a function of the sum of the entries of the incoming list.
For example, all the networks in this section are neural networks, but the
network in Figure 7.13 is not a neural network.
Let X
x−j = wij xi (8.2.1)
i→j
402 CHAPTER 8. MACHINE LEARNING

be the sum of the incoming list at node j. Then, in a neural network, the
outgoing signal at node j is
!
X
xj = fj (x−
j ) = fj wij xi . (8.2.2)
i→j

If the network has d nodes, the outgoing vector is

x = (x1 , x2 , . . . , xd ),

and the incoming vector is

−
x− = (x− −
1 , x2 , . . . , xd ).

In a network, in §7.4, x− −
j was a list or vector; in a neural network, xj is a
scalar.
Let W be the weight matrix. If the network has d nodes, the activation
vector is
f = (f1 , f2 , . . . , fd ).
Then a neural network may be written in vector-matrix form

x = f (W t x).

However, this representation is more useful when the network has structure,
for example in a dense shallow layer (8.2.12).

A perceptron is a network of the form

y = f (w1 x1 + w2 x2 + · · · + wd xd ) = f (w · x)

(Figure 8.1). This is the simplest neural network.

Thus a perceptron is a linear function followed by an activation function.
By our definition of neural network,

Neural Network
Every neural network is a combination of perceptrons.
8.2. NEURAL NETWORKS 403

x1
w1

w2
x2 f y

w3
x3

Figure 8.1: A perceptron with activation function f .

When an input x0 is fixed to equal 1, x0 = 1, the corresponding weight

w0 is called a bias,

y = f (w1 x1 + w2 x2 + · · · + wd xd + w0 ) = f (w · x + w0 ).

There is no computational advantage in separating out the bias.

Nevertheless, the bias is important, as it shifts the threshhold in the
activation function.
In §5.1, Bayes theorem is used to express a conditional probability in
terms of a perceptron,

P rob(H | x) = σ(w · x + w0 ).

This is a basic example of how a perceptron computes probabilities.

Perceptrons gained wide exposure after Minsky and Papert’s famous 1969
book [18], from which Figure 8.2 is taken.
404 CHAPTER 8. MACHINE LEARNING

Figure 8.2: Perceptrons in parallel [18].

Here is a listing of common activation functions.

• The identity function,

id(z) = z
and its derivative id′ = 1.

• The binary output,

(
1 if z > 0,
bin(z) =
0 if z < 0,

and its derivative bin′ = 0, z ̸= 0, and bin′ (0) undefined.

• The logistic or sigmoid function (Figure 5.2)

1
σ(z) =
1 + e−z
and its derivative σ ′ = σ(1 − σ).
8.2. NEURAL NETWORKS 405

• The hyperbolic tangent function

ez − e−z
tanh(z) =
ez + e−z
and its derivative tanh′ = 1 − tanh2 .
• The rectified linear unit relu,
(
z if z > 0,
relu(z) =
0 if z < 0,

and its derivative relu′ = bin.

Here is the code

# activation functions

def relu(z): return 0 if z < 0 else z

def bin(z): return 0 if z < 0 else 1
def sigmoid(z): return 1/(1+exp(-z))
def id(z): return z
# tanh already part of numpy
def one(z): return 1
def zero(z): return 0

# derivative of relu is bin

# derivative of bin is zero
# derivative of s=sigmoid is s*(1-s)
# derivative of id is one
# derivative of tanh is 1-tanh**2

def D_relu(z): return bin(z)

def D_bin(z): return 0
def D_sigmoid(z): return sigmoid(z)*(1-sigmoid(z))
def D_id(z): return 1
def D_relu(z): return bin(z)
def D_tanh(z): return 1 - tanh(z)**2

der_dict = { relu:D_relu, id:D_id, bin:D_bin,

406 CHAPTER 8. MACHINE LEARNING

,→ sigmoid:D_sigmoid, tanh: D_tanh}

and activation functions f3 , f4 , f5 , f6 . Here 1 and 2 are plain nodes, and 3,

4, 5, 6 are neurons.

w13 w35
x1 f3 f5 x5

w14 w36

w23 w45

w24 w46
x2 f4 f6 x6

Figure 8.3: Network of neurons.

Let xin and xout be the outgoing vectors corresponding to the input and
output nodes. Then the network in Figure 8.3 has outgoing vectors

x = (x1 , x2 , x3 , x4 , x5 , x6 ), xin = (x1 , x2 ), xout = (x5 , x6 ).

Here are the incoming and outgoing signals at each of the four neurons f3 ,
f4 , f5 , f6 .
8.2. NEURAL NETWORKS 407

Neuron Incoming Outgoing

f3 x−
3 = w13 x1 + w23 x2 x3 = f3 (w13 x1 + w23 x2 )
f4 x−
4 = w14 x1 + w24 x2 x4 = f4 (w14 x1 + w24 x2 )
f5 x−
5 = w35 x3 + w45 x4 x5 = f5 (w35 x3 + w45 x4 )
f6 x−
6 = w36 x3 + w46 x4 x6 = f6 (w36 x3 + w46 x4 )

Table 8.4: Incoming and Outgoing signals.

Now we specialize the forward propagation code in §7.4 to neural net-

works. The key diagram is Figure 8.5.

xi xj
fi fj
wij

Figure 8.5: Forward and back propagation between two neurons.

Assume the activation function at node j is activate[j]. By (8.2.1) and

(8.2.2), the code is

def incoming(x,w,j):
return sum([ outgoing(x,w,i)*w[i][j] if w[i][j] != None
,→ else 0 for i in range(d) ])

def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](incoming(x,w,j))

We assume the nodes are ordered so that the initial portion of x equals
xin ,

m = len(x_in)
x[:m] = x_in
408 CHAPTER 8. MACHINE LEARNING

Here is the third version of forward propagation.

# third version: neural networks

def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x

For Figure 8.3, we define a weight matrix as follows,

 
0 0 0.1 −2.0 0 0
0
 0 0.1 −2.0 0 0 

0 0 0 0 −0.3 −0.3
W = 0
 (8.2.4)
 0 0 0 0.22 0.22  
0 0 0 0 0 0 
0 0 0 0 0 0
and activation functions

activate = [None]*d

activate[2] = relu
activate[3] = id
activate[4] = sigmoid
activate[5] = tanh

The code for W is

w = [ [None]*d for _ in range(d) ]

# remember in Python, index starts from 0

8.2. NEURAL NETWORKS 409

w[0][2] = w[1][2] = 0.1

w[0][3] = w[1][3] = -2.0
w[2][4] = w[2][5] = -0.3
w[3][4] = w[3][5] = 0.22

Then the code

x_in = [1.5,2.5]
x = forward_prop(x_in,w)

returns the outgoing vector

x = (1.5, 2.5, 0.4, −8.0, 0.132, −0.954). (8.2.5)
From this, the incoming vector is
x− = (0, 0, 0.4, −8.0, −1.88, −1.88).
and the outputs are
xout = (0.132, −0.954).

Let
y1 = 0.427, y2 = −0.288, y = (y1 , y2 )
be targets, and let J(xout , y) be a function of the outputs xout of the output
nodes, and the targets y. For example, for Figure 8.3, xout = (x5 , x6 ) and we
may take J to be the mean square error function or mean square loss
1 1
J(xout , y) = (x5 − y1 )2 + (x6 − y2 )2 , (8.2.6)
2 2
The code for this J is

def J(x_out,y):
m = len(y)
return sum([ (x_out[i] - y[i])**2/2 for i in range(m) ])

and the code

410 CHAPTER 8. MACHINE LEARNING

y0 = [0.132,-0.954]
y = [0.427, -0.288]

J(x_out,y0), J(x_out,y)

returns 0 and 0.266.

By forward propagation, J is also a function of all nodes. Then, at each

node j, we have the derivatives

∂J ∂J
, fj′ (x−
j ), . (8.2.7)
∂x−
j ∂xj

These are the downstream derivative, local derivative, and upstream derivative
at node j. (The terminology reflects the fact that derivatives are computed
backward.)

∂J fi′ ∂J
∂x−
i ∂xi
fi

Figure 8.6: Downstream, local, and upstream derivatives at node i.

From (8.2.2),
∂xj
= fj′ (x−
j ). (8.2.8)
∂x−
j

By the chain rule and (8.2.8), the key relation between these derivatives
is
∂J ∂J
− = · f ′ (x− ), (8.2.9)
∂xi ∂xi i i
or
downstream = upstream × local.
8.2. NEURAL NETWORKS 411

def local(x,w,i):
return der_dict[activate[i]](incoming(x,w,i))

Let
∂J
δi = , i = 1, 2, . . . , d.
∂x−
i

Then we have the outgoing vector x = (x1 , x2 , . . . , xd ) and the downstream

gradient vector δ = (δ1 , δ2 , . . . , δd ). Strictly speaking, we should write δi−
for the downstream derivatives. However, in §8.4, we don’t need upstream
derivatives. Because of this, we will write δi .
Let xout be the output nodes, and let δout be the downstream derivatives
of J corresponding to xout . Then δout is a function of xout , y, w. We assume
the nodes are ordered so that the terminal portions of x and δ equal xout and
δout respectively,

d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out

Once we have the incoming vector x− and outgoing vector x, we can

differentiate J and compute the downstream derivatives δout with respect to
each node in xout . For example, in Figure 8.3, there are two output nodes
x5 , x6 , and we compute

∂J ∂J
δ5 = , δ6 = , δout = (δ5 , δ6 )
∂x−
5 ∂x−
6

as follows. Using (8.2.5) and (8.2.6), the upstream derivative is

∂J
= (x5 − y1 ) = −0.294.
∂x5
412 CHAPTER 8. MACHINE LEARNING

At node 5, the activation function is f5 = σ. Since σ ′ = σ(1 − σ), the local

derivative at node 5 is

σ ′ (x− − −
5 ) = σ(x5 )(1 − σ(x5 )) = x5 (1 − x5 ) = 0.114.

Hence the downstream derivative at node 5 is

δ5 = upstream × local = −0.294 ∗ 0.114 = −0.0337.

Similarly,
δ6 = −0.059.
We conclude
δout = (−0.0337, −0.059).
The code for this is

# delta_out for mean square error

def delta_out(x_out,y,w):
d =len(w)
m = len(y)
return [ (x_out[i] - y[i]) * local(x,w,d-m+i) for i in
,→ range(m) ]

delta_out(x_out,y_star,w)

We compute δ recursively via back propagation as in §7.4. From Figure

8.5 and (8.2.1) and (8.2.8),

∂J X ∂J ∂x− j ∂xi
− = − · · −
∂xi i→j
∂xj ∂xi ∂xi
!
X ∂J
= − · wij · fi′ (x−
i ).
i→j
∂x j
8.2. NEURAL NETWORKS 413

This yields the downstream derivative at node i,

!
X
δi = δj · wij · fi′ (x−
i ). (8.2.10)
i→j

The code is

def downstream(x,delta,w,i):
if delta[i] != None: return delta[i]
else:
upstream = sum([ downstream(x,delta,w,j) * w[i][j] if
,→ w[i][j] != None else 0 for j in range(d) ])
return upstream * local(x,w,i)

Using this, we have the third version of back propagation,

# third version: neural networks

def backward_prop(x,y,w):
d = len(w)
delta = [None]*d
m = len(y)
x_out = x[d-m:]
delta[d-m:] = delta_out(x_out,y_star,w)
for i in range(d-m): delta[i] = downstream(x,delta,w,i)
return delta

With W , x, and targets y as above, the code

delta = backward_prop(x,y,w)

returns

δ = (0.0437, 0.0437, 0.0279, −0.0204, −0.0337, −0.059).

414 CHAPTER 8. MACHINE LEARNING

Above we computed the upstream, downstream, and local derivatives of

J at a given node (8.2.7). Since the incoming signals x−
j depend also on the
weights wij , J also depends on wij . By (8.2.1),

∂x−
j
= xi ,
∂wij
see also Table 8.4. From this,
−
∂J ∂J ∂xj
= · = δj · x i .
∂wij ∂x−
j ∂wij

We have shown

Weight Gradient of Output

If (i, j) is an edge, then

∂J
= xi · δj . (8.2.11)
∂wij

This result is key for neural network training (§8.4).

Perceptrons can be assembled in parallel (Figure 8.2). If a network has

no hidden nodes, the network is shallow. In a shallow network, all nodes are
either input nodes or output nodes (Figure 8.7).
A shallow network is dense if all input nodes point to all output nodes:
wij is defined for all i, j. A shallow network can always be assumed dense
by inserting zero weights at missing edges.
Neural networks can also be assembled in series, with each component
a layer (Figure 8.8). Usually each layer is a dense shallow network. For
example, Figure 8.3 consists of two dense shallow networks in layers. We say
a network is deep if there are multiple layers.
The weight matrix W (8.2.3) is 6 × 6, while the weight matrices W1 , W2
of each of the two dense shallow network layers in Figure 8.3 are 2 × 2.
In a single shallow layer with n input nodes and m output nodes (Figure
8.7), let x and z be the layer’s input node vector and output node vector.
Then x and z are n and m dimensional respectively, and W is m × n.
8.2. NEURAL NETWORKS 415

x1
z1

x2
z2

x3
z3

Figure 8.7: A shallow dense layer.

Figure 8.8: Layered neural network [9].

If we have the same activation function f at every output node, then we

416 CHAPTER 8. MACHINE LEARNING

may apply it componentwise,

−
f (z − ) = f (z1− , z2− , . . . , zm −
) = (f (z1− ), f (z2− ), . . . , f (zm )).

Our convention is to let wij denote the weight on the edge (i, j). With this
convention, the formulas (8.2.1), (8.2.2) reduce to the matrix multiplication
formulas
z − = W t x, z = f (W t x). (8.2.12)
Thus a dense shallow network can be thought of as a vector-valued percep-
tron. This allows for vectorized forward and back propagation.

8.3 Gradient Descent

Let f (w) be a scalar function of a vector w = (w1 , w2 , . . . , wd ) in Rd . A basic
problem is to minimize f (w), that is, to find or compute a minimizer w∗ ,

f (w) ≥ f (w∗ ), for every w.

This goal is so general, that anything concrete one insight one provides to-
wards this goal is widely useful in many settings. The setting we have in
mind is f = J, where J is the mean error from §8.1.
Usually f (w) is a measure of cost or lack of compatibility. Because of
this, f (w) is called the loss function or cost function.
A neural network is a black box with inputs x and outputs y, depending on
unknown weights w. To train the network is to select weights w in response
to training data (x, y). The optimal weights w∗ are selected as minimizers
of a loss function f (w) measuring the error between predicted outputs and
actual outputs, corresponding to given training inputs.
From §7.5, if the loss function f (w) is continuous and proper, there is
a global minimizer w∗ . If f (w) is in addition strictly convex, w∗ is unique.
Moreover, if the gradient of the loss function is g = ∇f (w), then w∗ is a
critical point, g ∗ = ∇f (w∗ ) = 0.

Let g(w) be any function of a scalar variable w. From the definition of

derivative (7.1.2), if b is close to a, we have the approximation

g(b) − g(a) ≈ g ′ (a)(b − a).

8.3. GRADIENT DESCENT 417

Inserting a = w and b = w+ ,

g(w+ ) ≈ g(w) + g ′ (w)(w+ − w).

Assume w∗ is a root of g(w) = 0, so g(w∗ ) = 0. If w+ is close to w∗ , then

g(w+ ) is close to zero, so

0 ≈ g(w) + g ′ (w)(w+ − w).

Solving for w+ ,
g(w)
w+ ≈ w − .
g ′ (w)
Since the global minimizer w∗ satisfies f ′ (w∗ ) = 0, we insert g(w) = f ′ (w)
in the above approximation,

f ′ (w)
w+ ≈ w − .
f ′′ (w)

This leads to Newton’s method of computing approximations w0 , w1 , w2 , . . .

of w∗ using the recursion

f ′ (wn )
wn+1 = wn − , n = 1, 2, . . .
f ′′ (wn )

Because calculating f ′′ (w) is computationally expensive, first-order de-

scent methods replace the second derivative terms f ′′ (wn ) by constants, known
as learning rates.
In the multi-variable case, Newton’s method becomes

wn+1 = wn − D2 f (wn )−1 ∇f (wn ), n = 1, 2, . . . ,

and the second-derivative term is even more expensive to compute.

These first-order methods, collectively known as gradient descent, are the
subject of this chapter. In presenting §8.3 and §8.8, we follow [3], [19], [28],
[30].

Here is code for Newton’s method.

418 CHAPTER 8. MACHINE LEARNING

from numpy import *

def newton(loss,grad,curv,w,num_iter):
g = grad(w)
c = curv(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= g/c
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
c = curv(w)
if allclose(g,0): break
return trajectory

When applied to the function

f (w) = w4 − 6w2 + 2w,
the code returns trajectory

def loss(w): return w**4 - 6*w**2 + 2*w # f(w)

def grad(w): return 4*w**3 - 12*w + 2 # f'(w)
def curv(w): return 12*w**2 - 12 # f''(w)

u0 = -2.72204813
w0 = 2.45269774
num_iter = 20
trajectory = newton(loss,grad,curv,w0,num_iter)

which can be plotted

from matplotlib.pyplot import *

def plot_descent(a,b,loss,curv,delta,trajectory):
w = arange(a,b,delta)
plot(w,loss(w),color='red',linewidth=1)
plot(w,curv(w),"--",color='blue',linewidth=1)
plot(*trajectory,color='green',linewidth=1)
scatter(*trajectory,s=10)
8.3. GRADIENT DESCENT 419

title("num_iter= " + str(len(trajectory.T)))

grid()
show()

with the code

ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)

returning Figure 8.9.

Figure 8.9: Double well newton descent.

A descent sequence is a sequence w0 , w1 , w2 , . . . where the loss function

decreases
f (w0 ) ≥ f (w1 ) ≥ f (w2 ) ≥ . . . .
In a descent sequence, the point after the current point w = wn is the succes-
sive point w+ = wn+1 , and the point before the current point is the previous
point w− = wn−1 . Then (w− )+ = w = (w+ )− .
420 CHAPTER 8. MACHINE LEARNING

Recall (§7.3) the gradient ∇f (w) at a given point w is the direction of

greatest increase of the function, starting from w. Because of this, it is
natural to construct a descent sequence by moving, at any given w, in the
direction −∇f (w) opposite to the gradient.
A gradient descent is a descent sequence w0 , w1 , w2 , . . . where each
successive point w+ is obtained from the previous point w by moving in the
direction opposite to the gradient g = ∇f (w) at w,

Basic Gradient Descent Step

w+ = w − t∇f (w). (8.3.1)

The step-size t, which determines how far to go in the direction opposite

to g, is the learning rate.

Let us unpack (8.3.1), so we understand how it applies to weights in

networks (§7.4). In a neural network, weights w1 , w2 , . . . are attached to
edges, and the final outputs are combined into a loss function. As a result,
the loss function is a function of the weights,

f (w) = f (w1 , w2 , . . . ).

In (8.3.1), w = (w1 , w2 , . . . ) is the weight vector, consisting all of weights

combined into a single vector. By the gradient formula (7.3.2), (8.3.1) is
equivalent to

∂f
w1+ = w1 − t ,
∂w1
∂f
w2+ = w2 − t ,
∂w2
... = ....

In other words,
8.3. GRADIENT DESCENT 421

Each Weight is Computed Separately

To update a weight in a specific edge using gradient descent, one needs
only the derivative of the loss function relative to that specific weight.

Of course, the derivative relative to a specific weight may depend on other

derivatives and other weights, when one applies backpropagation (§7.4). This
principle also holds for modified gradient descent (§8.8).

In practice, the learning rate is selected by trial and error. Which learning
rate does the theory recommend?
Given an initial point w0 , the sublevel set at w0 (see §7.5) consists of all
points w where f (w) ≤ f (w0 ). Only the part of the sublevel set that is
connected to w0 counts.

u0 a b c w1 w
0

Figure 8.10: Double well cost function and sublevel sets at w0 and at w1 .

In Figure 8.10, the sublevel set at w0 is the interval [u0 , w0 ]. The sublevel
set at w1 is the interval [b, w1 ]. Notice we do not include any points to the
left of b in the sublevel set at w1 , because points to the left of b are separated
from w1 by the gap at the point b.
Suppose the second derivative D2 f (w) is never greater than a constant L
on the sublevel set. This means

D2 f (w) ≤ L, on f (w) ≤ f (w0 ), (8.3.2)

in the sense the eigenvalues of D2 f (w) are never greater than L.

422 CHAPTER 8. MACHINE LEARNING

Because the second derivative is the derivative of the first derivative,

2
D f (w) measures how fast the gradient ∇f (w) changes from point to point.
From this point of view, D2 f (w) is a measure of the curvature of the function
f (w), and (8.3.2) says the rate of change of the gradient is never greater than
L.
Given such a bound L on the curvature, If the learning rate t is no larger
than 1/L, we say we are doing short step gradient descent. Then we have

Short Step Gradient Descent

Let L be as above and w+ as in (8.3.1). If t ≤ 1/L, then

t
f (w+ ) ≤ f (w) − |∇f (w)|2 . (8.3.3)
2

To see this, fix w and let S be the sublevel set {w′ : f (w′ ) ≤ f (w)}. Since
the gradient pushes f down, for t > 0 small, w+ stays in S. Insert x = w+
and a = w into the right half of (7.5.16) and simplify. This leads to
t2 L
f (w+ ) ≤ f (w) − t|∇f (w)|2 + |∇f (w)|2 .
2
Since tL ≤ 1 when 0 ≤ t ≤ 1/L,we have t2 L ≤ t. This derives (8.3.3).
The curvature of the loss function and the learning rate are inversely
proportional. Where the curvature of the graph of f (w) is large, the learning
rate 1/L is small, and gradient descent proceeds in small time steps.

When the sublevel set is bounded, there is a bound L satisfying (8.3.2).

From §7.5, the sublevel set is bounded when f (w) is proper: Large |w| implies
high cost f (w). The graphs in Figures 7.4, 7.5, 8.10, are proper.
In practice, when the loss function is not proper, it is modified by an
extra term that forces properness. This is called regularization. If the extra
term is proportional to |w|2 , it is ridge regularization, and if the extra term
is proportional to |w|, it is LASSO regularization.

Now let w0 w1 , w2 . . . be a short-step gradient descent sequence, t ≤ 1/L.

By (8.3.3), wn remains in the sublevel set f (w) ≤ f (w0 ). If this sublevel set is
8.3. GRADIENT DESCENT 423

bounded, wn subconverges to a limit w∗ (Appendix A.2). Inserting w = wn ,

w+ = wn+1 in (8.3.3),

1
f (wn+1 ) ≤ f (wn ) − |∇f (wn )|2 .
2L

Since f (wn ) and f (wn+1 ) both converge to f (w∗ ), and ∇f (wn ) converges to
∇f (w∗ ), we conclude

1
f (w∗ ) ≤ f (w∗ ) − |∇f (w∗ )|2 .
2L

Since this implies ∇f (w∗ ) = 0, we have derived the following.

Gradient Descent Converges to a Critical Point

Fix an initial weight w0 and let L be as above. If the short-step gradient
descent sequence starting from w0 converges to some point w∗ , then
w∗ is a critical point.

For example, let f (w) = w4 − 6w2 + 2w (Figures 8.9, 8.10, 8.11). Then

f ′ (w) = 4w3 − 12w + 2, f ′′ (w) = 12w2 − 12.

Thus the inflection points (where f ′′ (w) = 0) are ±1 and, in Figure 8.10, the
critical points are a, b, c.
Let u0 and w0 be the points satisfying f (w) = 5 as in Figure 8.11.
Then u0 = −2.72204813 and w0 = 2.45269774, so f ′′ (u0 ) = 76.914552 and
f ′′ (w0 ) = 60.188. Thus we may choose L = 76.914552. With this L, the
short-step gradient descent starting at w0 is guaranteed to converge to one
of the three critical points. In fact, the sequence converges to the right-most
critical point c (Figure 8.10).
This exposes a flaw in basic gradient descent. Gradient descent may con-
verge to a local minimizer, and miss the global minimizer. In §8.8, modified
gradient descent will address some of these shortcomings.
424 CHAPTER 8. MACHINE LEARNING

Figure 8.11: Double well gradient descent.

The code for gradient descent is

from numpy import *

from matplotlib.pyplot import *

def gd(loss,grad,w,learning_rate,num_iter):
g = grad(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= learning_rate * g
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
if allclose(g,0): break
return trajectory

When applied to the double well function f (w),

u0 = -2.72204813
w0 = 2.45269774
L = 76.914552
learning_rate = 1/L
8.4. NETWORK TRAINING 425

num_iter = 100
trajectory = gd(loss,grad,w0,learning_rate,num_iter)

ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)

the code returns Figure 8.11.

8.4 Network Training

A neural network with weight matrix W defines an input-output map

xin → xout .

Given inputs xin and target outputs y, we seek to modify the weight matrix
W so that the input-output map is

xin → y.

This is training.
Let (§8.2)
−
x− = (x− −
1 , x2 , . . . , xd ), x = (x1 , x2 , . . . , xd )

be the network’s incoming vector and outgoing vector, and let

δ = (δ1 , δ2 , . . . , δd )

be the downstream gradient vector, relative to some mean error function J.

From (8.2.1),
−
∂J ∂J ∂xj ∂J
= − · = · x i = x i δj . (8.4.1)
∂wij ∂xj ∂wij ∂x−
j

This we derived as (8.2.11), but here it is again:

426 CHAPTER 8. MACHINE LEARNING

The Weight Gradient of J is a Tensor Product

Let wij be the weight along an edge (i, j), let xi be the outgoing signal
from the i-th node, and let δj be the downstream derivative of the
output J with respect to the j-th node. Then the derivative ∂J/∂wij
equals xi δj . In this partial sense,

∇W J = x ⊗ δ. (8.4.2)

When W is the weight matrix between successive layers in a layered neural

network (Figure 8.8), (8.4.2) is not partial, it is exactly correct.
Using (8.4.1), we update the weight wij using gradient descent

def update_weights(x,delta,w,learning_rate):
d = len(w)
for i in range(d):
for j in range(d):
if w[i][j]:
w[i][j] = w[i][j] - learning_rate*x[i]*delta[j]

The learning rate is discussed in §8.3. The triple

forward propagation → backward propagation → update weights
is an iteration. Starting with a given W0 , we repeat this iteration until we
obtain the target outputs y. Here is the code.

def train_nn(x_in,y,w0,learning_rate,n_iter):
trajectory = []
cost = 1
# build a local copy
w = [ row[:] for row in w0 ]
d = len(w0)
for _ in range(n_iter):
x = forward_prop(x_in,w)
delta = backward_prop(x,y,w)
update_weights(x,delta,w,learning_rate)
m = len(y)
x_out = x[d-m:]
8.4. NETWORK TRAINING 427

cost = J(x_out,y)
trajectory.append(cost)
if allclose(0,cost): break
return w, trajectory

Here n_iter is the maximum number of iterations allowed, and the iterations
stop if the cost J is close to zero.
The cost or error function J enters the code only through the function
delta_out, which is part of the function backward_prop.
Let W0 be the weight matrix (8.2.4). Then

x_in = [1.5,2.5]
learning_rate = .01
y0 = 0.4265356063
y1 = -0.2876478137
y = [y0,y1]
n_iter = 10000

w, trajectory = train_nn(x_in,y,w0,learning_rate,n_iter)

returns the cost trajectory, which can be plotted using the code

from matplotlib.pyplot import *

for lr in [.01,.02,.03,.035]:
w, trajectory = train_nn(x_in,y,w0,lr,n_iter)
n = len(trajectory)
label = str(n) + ", " + str(lr)
plot(range(n),trajectory,label=label)

grid()
legend()
show()

resulting in Figure 8.12.

428 CHAPTER 8. MACHINE LEARNING

Figure 8.12: Cost trajectory and number of iterations as learning rate varies.

The convergence here is surprisingly easy to attain. However, the conver-

gence here is a mirage. It is a reflection of overfitting, in the sense that we
trained the weights to obtain the input-output map corresponding to a sin-
gle sample: There is no reason the trained weights produce the input-output
map for other samples.
Only after we train the weights repeatedly against all samples in a training
dataset, can we hope to achieve training with some predictive power. This
is the subject of §8.9.

8.5 Shallow Learning

Let x1 , x2 , . . . , xN be a dataset, with corresponding labels or targets y1 , y2 ,
. . . , yN . As in §8.1, the loss function is

N
X
J(W ) = J(xk , yk , W ). (8.5.1)
k=1
8.5. SHALLOW LEARNING 429

In this section, we focus on a single-layer perceptron (Figure 8.7),

J(x, y, W ) = J(W t x, y) = J(z, y).

Here x is the input, W is the weight matrix, z = W t x is the network com-

puted output, and y is the desired output or target.
A basic attribute of a neural network is its trainability. Can a given
network be trained to achieve desired input-output behavior? As stated, this
question is imprecise and not clearly defined. In fact, for deep networks, it
is not at all clear how to turn this vague idea into an actionable definition.
In the case of a single-layer perceptron, the situation is straightforward
enough to be able to both make the question precise, and to provide action-
able criteria that guarantee trainability. This we do in the two cases

• linear regression, and

• logistic regression.

These cases correspond to the loss functions

• mean square error J(z, y) = |z − y|2 /2, and

• information error J(z, p) = I(p, σ(z)).

With any loss function J, the goal is to minimize J. With this in mind,
from §7.5, we recall

Ideal Loss Function

If a loss function J(W ) is strictly convex and proper, then J has a
unique global minimizer,

J(W ∗ ) ≤ J(W ),

characterized as the unique critical point W ∗ .

Often, in machine learning, J is neither convex nor proper. Nevertheless,

this result is an important benchmark to start with. Lack of properness is
often addressed by regularization, which is the modification of J by a proper
forcing term. Lack of convexity is addressed by using some type of accelerated
gradient descent.
430 CHAPTER 8. MACHINE LEARNING

We say the loss function (8.5.1) of a single-layer perceptron is trainable if it

is strictly convex and proper (§7.5). In this section, we determine conditions
on the dataset that guarantee trainability in the above two cases.

The first loss function is (8.5.1) with

1
J(x, y, W ) = |W t x − y|2 . (8.5.2)
2

Then (8.5.1) is the mean square error or mean square loss, and the problem
of minimizing (8.5.1) is linear regression (Figure 8.13).
We use (7.3.4) to compute the gradient of J(x, y, W ). Let V be a weight
matrix, and let v = V t x, z = W t x. Then (W + sV )t x = z + sv, and the
directional derivative is

d d 1
J(x, y, W + sV ) = |z + sv − y|2
ds s=0 ds s=0 2
= v · (z − y) = (V t x) · (z − y)
= trace (V t x) ⊗ (z − y) = trace V t (x ⊗ (z − y)) .

By (7.3.4), this implies the weight gradient for mean square loss is

G = ∇W J(x, y, W ) = x ⊗ (z − y), z = W t x. (8.5.3)

Note this result is a special case of (8.4.2).

Now we use (7.5.15) to check convexity of J(x, y, W ). With V and v as
above,
d2
J(x, y, W + sV ) = |v|2 = |V t x|2 . (8.5.4)
ds2 s=0

Since this is nonnegative, J(x, y, W ) is a convex function of W .

8.5. SHALLOW LEARNING 431

+ y
z1
x2
z2 J
+ (−)2
z3
x3
z = W tx
+
J = |z − y|2 /2
x4

Figure 8.13: Linear regression neural network.

Since J(W ) is the sum of J(x, y, W ) over all samples, J(W ) is convex.
To check strict convexity of J(W ), suppose

d2
J(W + sV ) = 0.
ds2 s=0

Then (8.5.4) vanishes for all samples x = xk , y = yk , which implies

V t xk = 0, k = 1, 2, . . . , N. (8.5.5)

Recall the feature space is the vector space of all inputs x, and (§2.9)
a dataset is full-rank if the span of the dataset is the entire feature space.
When this happens, (8.5.5) implies V = 0, hence J(W ) is strictly convex.
To check properness of J(W ), by definition (7.5.9), we have to show

J(W ) ≤ c =⇒ ∥W ∥2 ≤ C. (8.5.6)

Here ∥W ∥ is the norm of the matrix W (2.2.11). The exact formula for the
bound C, which is not important for our purposes, depends on the level c
and the dataset.
If J(W ) ≤ c, by (8.5.1), (8.5.2), and the triangle inequality,
√
|W t xk | ≤ 2c + |yk |, k = 1, 2, . . . , N.
432 CHAPTER 8. MACHINE LEARNING

If x is in the span of the dataset, then x is a linear combination of samples

xk . Hence there is a bound C1 , depending on x but not on W , such that

|W t x| ≤ C1 . (8.5.7)

Let e1 , e2 , . . . be the standard basis in feature space, and assume the

dataset is full-rank. Then e1 , e2 , . . . are in the span of the dataset. By
(2.2.11) and (8.5.7), there is a bound C, not depending on W , with
X
∥W ∥2 = |W t ej |2 ≤ C.
j

Since this establishes (8.5.6), we have shown

Trainability: Linear Regression

Suppose the dataset x1 , x2 , . . . , xN is full-rank. Then the mean square
loss J(W ) is trainable.

This is a simple, clear geometric criterion for convergence of gradient

descent to the global minimum of J, valid for linear regression.

Let x1 , x2 , . . . , xN be a dataset, with corresponding labels or targets p1 ,

p2 , . . . , pN . In logistic regression, we assume the targets p reflect finitely
many (say d) classes or categories. Hence each target p is a probability
vector p = (p1 , p2 , . . . , pd ). Because of this, we use p instead of y to denote
the targets.
In logistic regression, there are two main sub-cases where things work out
best: Strict probabilities and one-hot encoded probabilities.
A probability p = (p1 , p2 , . . . , pd ) is strict if p1 , p2 , . . . , pd are all positive
(none are zero).
A target p = (p1 , p2 , . . . , pd ) is one-hot encoded at slot j if pj = 1. When
p is one-hot encoded at slot j, then pk = 0 for k ̸= j.
For example, in classification problems, each sample x lies in one of d
classes, and the target p is one-hot encoded at the slot corresponding to the
class: If there are three classes 0, 1, 2, then the one-hot encoded target p is

(1, 0, 0), (0, 1, 0), (0, 0, 1),

8.5. SHALLOW LEARNING 433

depending on which class x lies in.

The second loss function is (8.5.1), with

J(x, p, W ) = I(p, q), q = σ(z), z = W t x.

Here I(p, q) is the relative information, measuring the error between the
desired target p and the computed target q, and q = σ(z) is the softmax
function, squashing the network’s output z = W t x into the probability q.
When p is one-hot encoded, by (7.6.15),

J(x, p, W ) = Icross (p, σ(W t x)).

Because of this, in the literature, in the one-hot encoded case, (8.5.1) is called
the cross-entropy loss.
In either case, strict or one-hot encoded, J(W ) is logistic loss or logistic
error, and the problem of minimizing (8.5.1) is logistic regression (Figure
8.14).
Since we will be considering both strict and one-hot encoded probabilities,
we work with I(p, q) rather than Icross (p, q). Table 7.25 is a useful summary
of the various information and entropy concepts.

x1
z1
+ p
q1
x2
z2 q2 J
+ σ I

x3 q3
z3 z = W tx
+ q = σ(z)
J = I(p, q)
x4

Figure 8.14: Logistic regression neural network.

434 CHAPTER 8. MACHINE LEARNING

We compute the gradient ∇W J(x, p, W ). By (7.6.2) and (7.6.14),

∇z I(p, σ(z)) = ∇z Z(z) − p = q − p, q = σ(z), (8.5.8)
and, by (7.6.9),
Dz2 I(p, σ(z)) = D2 Z(z) = diag(q) − q ⊗ q, q = σ(z). (8.5.9)
Let V be a weight matrix, and let v = V t x, z = W t x. Then (W +sV )t x =
z + sv, and, by (8.5.8), the directional derivative is
d d
J(x, y, W + sV ) = I(p, σ(z + sv))
ds s=0 ds s=0
= v · (q − p) = (V t x) · (q − p)
= trace (V t x) ⊗ (q − p)

= trace V t (x ⊗ (q − p)) .

By (7.3.4), this shows the gradient for log loss is

G = ∇W J(x, p, W ) = x ⊗ (q − p), q = σ(W t x). (8.5.10)
As before, this result is a special case of (8.4.2). Since q and p are probabil-
ities, p · 1 = q · 1, hence the gradient satisfies G1 = 0.
Recall (§7.6) we have strict convexity of Z(z) in directions orthogonal to
1, when z · 1 = 0. Since z = W t x, z · 1 = x · W 1. Hence, to force z · 1 = 0,
it is natural to impose the constraint
W1 = 0 (8.5.11)
on the weight matrix, or
d
X
wij = 0, i = 1, 2, . . . , d.
j=1

If we initiate gradient descent with a weight matrix W satisfying W 1 = 0,

since the gradient G satisfies G1 = 0, all successive W ’s will also satisfy
W 1 = 0.

Turning to convexity, we will establish

8.5. SHALLOW LEARNING 435

Strict Convexity: Logistic Regression

Suppose the dataset x1 , x2 , . . . , xN is full-rank. Then the logistic loss
J(W ), restricted to the subspace W 1 = 0, is strictly convex.
Pd
To see this, given a vector v and probability q, set v̄ = j=1 vj qj . Then

d d
!2 d
X X X
vj2 qj − vj qj = (vj − v̄)2 qj .
j=1 j=1 j=1

If either side is zero, and q is strict, then v = v̄1, so v is a multiple of 1.

From this identity, and by (7.5.15) and (8.5.9), the second derivative of
I(p, σ(z)) in the direction of a vector v is
d
d2 X
I(p, σ(z + sv)) = (vj − v̄)2 qj , q = σ(z).
ds2 s=0 j=1

Let V be a weight matrix satisfying V 1 = 0 and let v = V t x. Then

v · 1 = x · V 1 = 0, so v is orthogonal to 1, and

(W + sV )t x = z + sv.

If z = W t x, it follows the second derivative of J(x, p, W ) in the direction of

V is
d
d2 X
J(x, p, W + sV ) = (vj − v̄)2 qj , v = V t x. (8.5.12)
ds2 t=0 j=1

This shows the second derivative of J(x, p, W ) is nonnegative, establishing

the convexity of J(x, p, W ). Since J(W ) is the sum of J(x, p, W ) over all
samples, we conclude J(W ) is convex.
Moreover, if (8.5.12) vanishes, then, by the previous paragraph, since
q = σ(z) is strict, v is a multiple of 1. Since v is orthogonal to 1, v = 0.
Since v = V t x, the vanishing of (8.5.12) implies V t x = 0.
If
N
d2 X d2
J(W + sV ) = J(xk , pk , W + sV )
ds2 s=0 k=1
ds2 s=0
436 CHAPTER 8. MACHINE LEARNING

vanishes, then, since the summands are nonnegative, (8.5.12) vanishes, for
every sample x = xk , p = pk , hence

V t xk = 0, k = 1, 2, . . . , N.

When the dataset is full-rank, this implies V = 0. This establishes strict

convexity of J(W ) in the subspace W 1 = 0.

Now we turn to properness of J(W ).

Properness: Logistic Regression

Let x1 , x2 , . . . , xN be a dataset with corresponding targets p1 , p2 , . . . ,
pN . For each class i, let Ki be the convex hull of the samples x whose
corresponding targets p = (p1 , p2 , . . . , pd ) satisfy pi > 0. If the span of
the intersection Ki ∩ Kj is full-rank for every class i and class j, then
the logistic loss J(W ) is proper on the subspace W 1 = 0.

The convex hull is discussed in §7.5, see Figures 7.20 and 7.21. If Ki were
just the samples x whose corresponding targets p satisfy pi > 0 (with no
convex hull), then the intersection Ki ∩ Kj may be empty.
For example, if p were one-hot encoded, then x belongs to at most one
Ki . Thus taking the convex hull in the definition of Ki is crucial. This is
clearly seen in Figure 8.26: The samples never intersect, but the convex hulls
may do so.
To establish properness of J(W ), by definition (7.5.9), we have to show

W 1 = 0 and J(W ) ≤ c =⇒ ∥W ∥2 ≤ C. (8.5.13)

The exact formula for the bound C, which is not important for our purposes,
depends on the level c and the dataset.
Suppose J(W ) ≤ c, with W 1 = 0. Then I(p, σ(W t x)) = J(x, p, W ) ≤ c
for every sample x and corresponding target p.
Let x be a sample, let z = W t x, and suppose the corresponding target p
satisfies pi ≥ ϵ, for some class i, and some ϵ > 0. If j ̸= i, then
d
X
ϵ(zj − zi ) ≤ pi (Z(z) − zi ) ≤ pk (Z(z) − zk ) = Z(z) − p · z.
k=1
8.5. SHALLOW LEARNING 437

Here we used zj < Z(z), and Z(z) − zk > 0 for all k. By (7.6.14),

Z(z) − p · z = I(p, σ(z)) − I(p) ≤ c + log d.

Combining the last two inequalities,

ϵ(zj − zi ) ≤ c + log d.

By definition of Ki , pi > 0 for all targets p corresponding to samples x

in Ki . Therefore there is a positive ϵi such that pi ≥ ϵi for all targets p
corresponding to samples x in Ki . Let ϵ be the least of ϵ1 , ϵ2 , . . . , ϵd . Then

ϵ(zj − zi ) ≤ c + log d, j ̸= i, for samples x in Ki .

By taking convex combinations of samples x in Ki , the last inequality

remains valid for all x in Ki , so

ϵ(zj − zi ) ≤ c + log d, j ̸= i, for all x in Ki .

Repeating the same argument for x in Kj ,

ϵ(zi − zj ) ≤ c + log d, j ̸= i, for all x in Kj .

Let C1 = (c + log d)/ϵ. Combining the last two inequalities,

|zi − zj | ≤ C1 , j ̸= i, for all x in Ki ∩ Kj . (8.5.14)

Let x be any vector in feature space, and let z = W t x. Since span(Ki ∩Kj )
is full-rank for every i and j, x is a linear combination of vectors in Ki ∩ Kj .
This implies, by (8.5.14), there is a bound C2 , depending on x but not on
W , such that
|zi − zj | ≤ C2 , for every i and j, (8.5.15)
P
Since z · 1 = 0, zi = − j̸=i zj . Summing (8.5.15) over j ̸= i,

X
d|zi | = |(d − 1)zi + zi | = (zi − zj ) ≤ (d − 1)C2 .
j̸=i

This implies there is a bound C3 , depending on x but not on W , such that

d
X
t 2 2
|W x| = |z| = zi2 ≤ C3 .
i=1
438 CHAPTER 8. MACHINE LEARNING

Let e1 , e2 , . . . be the standard basis in feature space. By (2.2.11),

X
∥W ∥2 = |W t ek |2 .
k

Inserting x = e1 , x = e2 , . . . into the last inequality, we conclude there is a

bound C, depending only on level c, d, and the dataset, satisfying (8.5.13).

If the span of Ki ∩ Kj is full-rank, then the span of the dataset itself is

full-rank. Putting the last two results together, we conclude

Trainability: Logistic Regression

By the definition of Ki here, the union of Ki over classes i = 1, 2, . . . , d

contains the whole dataset. This is not necessarily the case in the results
below.
As a special case, let K be the samples whose corresponding targets are
strict. Then K ⊂ Ki for all classes i. If the span of K is full-rank, then
span(Ki ∩ Kj ) is full-rank. This derives the first consequence,

Trainability: Strict Logistic Regression

Let x1 , x2 , . . . , xN be a dataset, with corresponding targets p1 , p2 , . . . ,
pN . Let K be the samples whose corresponding targets are strict. If
the span of K is full-rank, then the logistic loss J(W ) is trainable on
the subspace W 1 = 0.

If a target p is one-hot encoded at slot i, then pi = 1 > 0. This derives

the second consequence,
8.5. SHALLOW LEARNING 439

Trainability: One-hot Encoded Logistic Regression

Let x1 , x2 , . . . , xN be a dataset with corresponding targets p1 , p2 , . . . ,
pN . For each class i, let Ki be the convex hull of the samples whose
corresponding targets are one-hot encoded at slot i. If the span of the
intersection Ki ∩ Kj is full-rank for every i and j, then the logistic loss
J(W ) is trainable on the subspace W 1 = 0.

In this case, each sample x belongs in at most one Ki , so taking convex

hulls is crucial, see the examples in the next section. Here not all samples
need be one-hot encoded: The requirement is that there is sufficient overlap
between the targets that are one-hot encoded.

We end the section by comparing the three regressions: linear, strict

logistic, and one-hot encoded logistic.
In classification problems, it is one-hot encoded logistic regression that is
relevant. Because of this, in the literature, logistic regression often defaults
to the one-hot encoded case.
In linear regression, not only does J(W ) have a minimum, but so does
J(z, y). Properness ultimately depends on properness of a quadratic |z|2 .
In strict logistic regression, by (8.5.8), the critical point equation

∇z J(z, p) = 0

can always be solved, so there is at least one minimum for each J(z, p). Here
properness ultimately depends on properness of the partition function Z(z).
In one-hot encoded regression, J(z, p) = I(p, σ(z)) and ∇z J(z, p) = 0
can never be solved, because q = σ(z) is always strict and p is one-hot
encoded, see (8.5.10). Nevertheless, trainability of J(W ) is achievable if
there is sufficient overlap between the sample categories.
In linear regression, the minimizer is expressible in terms of the regression
equation, and thus can be solved in principle using the pseudo-inverse. In
practice, when the dimensions are high, gradient descent may be the only
option for linear regression.
In logistic regression, the minimizer cannot be found in closed form, so
we have no choice but to apply gradient descent, even for low dimensions.
440 CHAPTER 8. MACHINE LEARNING

8.6 Regression Examples

Let (xk , yk ), k = 1, 2, . . . , N , be a dataset in the plane. The simplest regres-
sion problem is to determine the line y = mx + b minimizing the residual
N
X
J(m, b) = (yk − mxk − b)2 . (8.6.1)
k=1

Then the line is the regression line.

More generally, given a dataset x1 , x2 , . . . , xN in Rd , and scalar targets
y1 , y2 , . . . , yN , we want to minimize
N
X
J(w) = (yk − w · xk )2
k=1

over all weight vectors w in Rd . Here we are fitting a regression hyperplane

y = w · x = w 1 x1 + w 2 x2 + · · · + w d xd .

This corresponds to (8.5.2), where W is the d × 1 matrix W = w.

For example, Figure 8.16 is a dataset and Figure 8.15 is a plot of popu-
lation versus employed, with the mean and the regression line shown.

Figure 8.15: Population versus employed: Linear Regression.

8.6. REGRESSION EXAMPLES 441

GNP.deflator GNP Unemployed Armed Forces Population Year Employed

83 234.289 235.6 159 107.608 1947 60.323
88.5 259.426 232.5 145.6 108.632 1948 61.122
88.2 258.054 368.2 161.6 109.773 1949 60.171
89.5 284.599 335.1 165 110.929 1950 61.187
96.2 328.975 209.9 309.9 112.075 1951 63.221
98.1 346.999 193.2 359.4 113.27 1952 63.639
99 365.385 187 354.7 115.094 1953 64.989
100 363.112 357.8 335 116.219 1954 63.761
101.2 397.469 290.4 304.8 117.388 1955 66.019
104.6 419.18 282.2 285.7 118.734 1956 67.857
108.4 442.769 293.6 279.8 120.445 1957 68.169
110.8 444.546 468.1 263.7 121.95 1958 66.513
112.6 482.704 381.3 255.2 123.366 1959 68.655
114.2 502.601 393.1 251.4 125.368 1960 69.564
115.7 518.173 480.6 257.2 127.852 1961 69.331
116.9 554.894 400.7 282.7 130.081 1962 70.551

Table 8.16: Longley Economic Data [15].

Let X be the N × d matrix with rows x1 , x2 , . . . , xN , and let Y be the

vector (y1 , y2 , . . . , yN ). Then we can rewrite the residual as

J(w) = |Xw − Y |2 . (8.6.2)

From §2.3, any weight w∗ minimizing (8.6.2) is a solution the regression

equation
X t Xw∗ = X t Y. (8.6.3)

Since the pseudo-inverse provides a solution of the regression equation, we

have

Linear Regression

The weight w∗ = X + Y minimizes the residual (8.6.2) and solves the

regression equation (8.6.3).
442 CHAPTER 8. MACHINE LEARNING

We work out the regression equation in the plane, when both features x
and y are scalar. In this case, w = (m, b) and
   
x1 1 y1
 x2 1   y2 
X=  . . . . . . ,
 Y =
. . . .


xN 1 yN

In the scalar case, the regression equation (8.6.3) is 2 × 2. To simplify

the computation of X t X, let
N N
1 X 1 X
x̄ = xk , ȳ = yk .
N k=1 N k=1

Then (x̄, ȳ) is the mean of the dataset. Also, let x and y denote the vectors
(x1 , x2 , . . . , xN ) and (y1 , y1 , . . . , yN ), and let, as in §1.6,
N
1 X 1
cov(x, y) = (xk − x̄)(yk − ȳ) = x · y − x̄ȳ.
N k=1 N

Then cov(x, y) is the covariance between x and y,

t x · x x̄ t x·y
X X=N , X Y =N .
x̄ 1 ȳ

With w = (m, b), the regression equation reduces to

(x · x)m + x̄b = x · y,
mx̄ + b = ȳ.

The second equation says the regression line passes through the mean (x̄, ȳ).
Multiplying the second equation by x̄ and subtracting the result from the
first equation cancels the b and leads to

cov(x, x)m = (x · x − x̄2 )m = (x · y − x̄ȳ) = cov(x, y).

This derives
8.6. REGRESSION EXAMPLES 443

Linear Regression in the Plane

The regression line in two dimensions passes through the mean (x̄, ȳ)
and has slope
cov(x, y)
m= .
cov(x, x)

Now we use linear regression to do polynomial regression. Return to the

dataset (xk , yk ) in R2 (Figure 8.15). We can expand or “lift” the dataset
from R2 to R6 by working with the vectors (1, xk , x2k , x3k , x4k , yk ) instead of
(xk , yk ).
Assuming the data is given by Figure 8.16, we build the code for Figures
8.15 and 8.17. We begin by assuming the data is given as arrays,

from numpy import *

from pandas import read_csv

df - read_csv("longley.csv")

X = df["Population"].to_numpy()
Y = df["Employed"].to_numpy()

Then we standardize the data

X = X - mean(X)
Y = Y - mean(Y)

varx = sum(X**2)/len(X)
vary = sum(Y**2)/len(Y)

X = X/sqrt(varx)
Y = Y/sqrt(vary)

After this, we compute the optimal weight w∗ and construct the polyno-
mial. The regression equation is solved using the pseudo-inverse (§2.3).
444 CHAPTER 8. MACHINE LEARNING

Figure 8.17: Polynomial regression: Degrees 2, 4, 6, 8, 10, 12.

from numpy.linalg import pinv

# polynomial function - degree d-1

def poly(x,d):
A = column_stack([ X**i for i in range(d) ]) # Nxd
Aplus = pinv(A)
b = Y # Nx1
8.6. REGRESSION EXAMPLES 445

wstar = dot(Aplus,b)
return sum([ x**i*wstar[i] for i in range(d) ],axis=0)

Then we plot the data and the polynomial in six subplots.

from matplotlib.pyplot import *

xmin,ymin = amin(X), amin(Y)

xmax, ymax = amax(X), amax(Y)

figure(figsize=(12,12))
# six subplots
rows, cols = 3,2

# x interval
x = arange(xmin,xmax,.01)

for i in range(6):
d = 3 + 2*i # degree = d-1
subplot(rows, cols,i+1)
plot(X,Y,"o",markersize=2)
plot([0],[0],marker="o",color="red",markersize=4)
plot(x,poly(x,d),color="blue",linewidth=.5)
xlabel("degree = %s" % str(d-1))
grid()

show()

Running this code with degree 1 returns Figure 8.15. Taking too high a
power can lead to overfitting, for example for degree 12.

Here is an example of a simple logistic regression problem. A group of

students takes an exam. For each student, we know the amount of time x
they studied, and the outcome p, whether or not they passed the exam.
446 CHAPTER 8. MACHINE LEARNING

x p x p x p x p x p
0.5 0 .75 0 1.0 0 1.25 0 1.5 0
1.75 0 1.75 1 2.0 0 2.25 1 2.5 0
2.75 1 3.0 0 3.25 1 3.5 0 4.0 1
4.25 1 4.5 1 4.75 1 5.0 1 5.5 1

Table 8.18: Hours studied and outcomes.

More generally, we may only know the amount of study time x, and the
probability p that the student passed, where now 0 ≤ p ≤ 1.
For example, the data may be as in Figure 8.18, where pk equals 1 or 0
according to whether they passed or not.
As stated, the samples of this dataset are scalars, and the dataset is
one-dimensional (Figure 8.19).

Figure 8.19: Exam dataset: x.

Plotting the dataset on the (x, p) plane, the goal is to fit a curve as in
Figure 8.20.

(0, 1)

x
(0, 0)

Figure 8.20: Exam dataset: (x, p) [29].

To apply the results from the previous section, we incorporate the bias
8.6. REGRESSION EXAMPLES 447

and rewrite the dataset as

(x1 , 1), (x2 , 1), . . . , (xN , 1), N = 20,
resulting in Figure 8.21. Since these vectors are not parallel, the dataset is
full-rank in R2 , hence J(m, b) is strictly convex. In Figure 8.21, the shaded
area is bounded by the vectors corresponding to the overlap between passing
and failing students’ hours.
x0

(0, 1)

x
(0, 0)

Figure 8.21: Exam dataset: (x, x0 ).

Let σ(z) be the sigmoid function (5.1.12). Then, as in the previous sec-
tion, the goal is to minimize the loss function
N
X
J(m, b) = I(pk , qk ), qk = σ(mxk + b), (8.6.4)
k=1

Once we have the minimizer (m∗ , b∗ ), we have the best-fit curve

q = σ(m∗ x + b∗ )
(Figure 8.20).
If the targets p are one-hot encoded, the dataset is as follows.

x p x p x p x p x p
0.5 (1,0) .75 (1,0) 1.0 (1,0) 1.25 (1,0) 1.5 (1,0)
1.75 (1,0) 1.75 (0,1) 2.0 (1,0) 2.25 (0,1) 2.5 (1,0)
2.75 (0,1) 3.0 (1,0) 3.25 (0,1) 3.5 (1,0) 4.0 (0,1)
4.25 (0,1) 4.5 (0,1) 4.75 (0,1) 5.0 (0,1) 5.5 (0,1)

Table 8.22: Hours studied and one-hot encoded outcomes.

448 CHAPTER 8. MACHINE LEARNING

Each sample (x, 1) in the dataset is in R2 , and each target is one-hot

encoded as (p, 1 − p). Since the weight matrix must satisfy (8.5.11) W 1 = 0.
we have
b −b
W = .
m −m
Since z = W t x, the outputs must satisfy z1 = z and z2 = −z. This leads to
a neural network with two inputs and two outputs (Figure 8.23).

p
b z
1 + q
−b J
σ I
m
−z
x + 1−q
−m

Figure 8.23: Neural network for student exam outcomes.

Since here d = 2, the networks in Figures 8.23 and 8.24 are equivalent.
In Figure 8.23, σ is the softmax function, I is given by (7.6.5), and p, q are
probability vectors. In Figure 8.24, σ is the sigmoid function, I is given by
(7.2.3), and p, q are probability scalars.

1 b
z q J
+ σ I
m
x

Figure 8.24: Equivalent neural network for student exam outcomes.

Figure 8.20 is a plot of x against p. However, the dataset, with the bias
input included, has two inputs x, 1 and one output p, and should be plotted
in three dimensions (x, 1, p). Then (Figure 8.25) samples lie on the line (x, 1)
in the horizontal plane, and p is on the vertical axis.
8.6. REGRESSION EXAMPLES 449

0.5

0
0

2
1
4
0.5
0

Figure 8.25: Exam dataset: (x, x0 , p).

The horizontal plane in Figure 8.25, which is the plane in Figure 8.21, is
feature space. The convex hulls K0 and K1 are in feature space, so the convex
hull K0 of the samples corresponding to p = 0 is the line segment joining
(.5, 1, 0) and (3.5, 1, 0), and the convex hull K1 of the samples corresponding
to p = 1 is the line segment joining (1.75, 1, 0) and (5.5, 1, 0). In Figure 8.25,
K0 is the line segment joining the blue points, and K1 is the projection onto
feature space of the line segment joining the red points. Since K0 ∩ K1 is the
line segment joining (1.75, 1, 0) and (3.5, 1, 0), the span of K0 ∩ K1 is all of
feature space. By the results of the previous section, J(w) is proper.
Here is the descent code.

from numpy import *

from scipy.special import expit

X = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5,
,→ 2.75, 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
P = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

def gradient(m,b):
return sum([ (expit(m*x+b) - p) * array([x,1]) for x,p in
,→ zip(X,P) ],axis=0)

# gradient descent
w = array([0,0]) # starting m,b
450 CHAPTER 8. MACHINE LEARNING

g = gradient(*w)
t = .01 # learning rate

while not allclose(g,0):

wplus = w - t * g
if allclose(w,wplus): break
else: w = wplus
g = gradient(*w)

print("descent result: ",w)

print("gradient: ",gradient(*w))

This code returns

m∗ = 1.49991537, b∗ = −4.06373862.

These values are used to graph the sigmoid in Figure 8.20.

Figure 8.26: Convex hulls of Iris classes in R2 .

8.6. REGRESSION EXAMPLES 451

The Iris dataset consists of 150 samples divided into three groups. leading
to three convex hulls K0 , K1 , K2 in R4 . If the dataset is projected onto the
top two principal components, then the projections of these three hulls do
not pair-intersect (Figure 8.26). It follows we have no guarantee the logistic
loss is proper.
On the other hand, the MNIST dataset consists of 60,000 samples divided
into ten groups. If the MNIST dataset is projected onto the top two principal
components, the projections of the ten convex hulls K0 , K1 , . . . , K9 onto R2 ,
do intersect (Figure 8.27).
This does not guarantee that the ten convex hulls K0 , K1 , . . . , K9 in R784
intersect, but at least this is so for the 2d projection of the MNIST dataset.
Therefore the logistic loss of the 2d projection of the MNIST dataset is
proper.

Figure 8.27: Convex hulls of MNIST classes in R2 .

We say two sets A, B in Rd are linearly separable if there is a hyperplane

z = w · x + w0
452 CHAPTER 8. MACHINE LEARNING

with
z ≤ 0, for x in A,
z ≥ 0, for x in B.
In the case of two classes, the results in §7.5 and §8.5 lead to the following
result [12].

Dataset Binary Classifier

Suppose a dataset x1 , x2 , . . . , xN is divided into two classes, and
suppose neither class lies in a hyperplane. There are two possibilities.

• The two classes are linearly separable.

• The two classes are not linearly separable: When the means of
the classes are distinct, the log loss J(w, w0 ) is trainable (strictly
convex and proper), and there is a unique minimizer (w∗ , w0∗ )
with w∗ ̸= 0, hence an optimal single-layer perceptron

q = σ(w∗ · x + w0∗ ).

8.7 Strict Convexity

In this section, we work with loss functions that are smooth and strictly
convex. While this is not always the case, this assumption is a base case
against which we can test different optimization or training models.
By smooth and strictly convex, we mean there are positive constants m
and L satisfying
m ≤ D2 f (w) ≤ L, for every w. (8.7.1)
Recall this means the eigenvalues of the symmetric matrix D2 f (w) are be-
tween L and m. In this situation, the condition number1 r = m/L is between
zero and one: 0 < r ≤ 1.
In the previous section, we saw that basic gradient descent converged to
a critical point. If f (x) is strictly convex, there is exactly one critical point,
the global minimum. From this we have

1
In the literature, the condition number is often defined as L/m.
8.7. STRICT CONVEXITY 453

Gradient Descent on a Strictly Convex Function

If the short-step gradient descent sequence starting from w0 converges
to w∗ , then w∗ is the global minimum.

The simplest example of a convex loss function is the quadratic case

1
f (w) = w · Qw − b · w, (8.7.2)
2
where Q is a covariance matrix. Then D2 f (w) = Q. If the eigenvalues of Q
are between positive constants m and L, then f (w) is smooth and strictly
convex.
By (7.3.7), the gradient for this example is g = Qw − b. Hence the
minimizer is the unique solution w∗ = Q−1 b of the linear system Qw =
b. Thus gradient descent is a natural tool for solving linear systems and
computing inverses, at least for covariance matrices Q.
By (7.5.17), f (w) lies between two quadratics,
m L
|w − w∗ |2 ≤ f (w) − f (w∗ ) ≤ |w − w∗ |2 . (8.7.3)
2 2
How far we are from our goal w∗ can be measured by the error E(w) =
|w − w∗ |2 . Another measure of error is E(w) = f (w) − f (w∗ ). The goal is to
drive the error between w and w∗ to zero.
When f (w) is smooth and strictly convex in the sense of (8.7.1), the
estimate (8.7.3) shows these two error measures are equivalent. We use both
measures below.

Let t = 1/L. Inserting x = w and a = w∗ in the left half of (7.5.21) and

using ∇f (w∗ ) = 0 implies
1
f (w) ≤ f (w∗ ) + |∇f (w)|2 .
2m
Let E(w) = f (w) − f (w∗ ). Combining this inequality with (8.3.3), and
recalling r = m/L = mt, we arrive at

E(w+ ) ≤ (1 − r)E(w). (8.7.4)

454 CHAPTER 8. MACHINE LEARNING

Iterating this implies

E(w2 ) ≤ (1 − r)E(w1 ) ≤ (1 − r)(1 − r)E(w0 ) = (1 − r)2 E(w0 ).

In general, this leads to

Gradient Descent I
Let r = m/L and set E(w) = f (w)−f (w∗ ). Then the descent sequence
w0 , w1 , w2 , . . . given by (8.3.1) with learning rate
1
t=
L
converges to w∗ at the rate

E(wn ) ≤ (1 − r)n E(w0 ), n = 1, 2, . . . . (8.7.5)

This is the basic gradient descent result GD-I.

Using coercivity of the gradient (7.5.22), we can obtain an improved result

GD-II.
Let E(w) = |w−w∗ |2 and set the learning rate at t = 2/(m+L). Inserting
x = w and a = w∗ in (7.5.22) and using ∇f (w∗ ) = 0 implies

mL 1
g · (w − w∗ ) ≥ |w − w∗ |2 + |g|2 .
m+L m+L
Using this and (8.3.1) and t = 2/(m + L),

E(w+ ) = E(w) − 2tg · (w − w∗ ) + t2 |g|2

mL 2 2t
≤ 1 − 2t E(w) + t − |g|2
m+L m+L
2
L−m
= E(w).
L+m

This implies
8.7. STRICT CONVEXITY 455

Gradient Descent II
Let r = m/L and set E(w) = |w − w∗ |2 . Then the descent sequence
w0 , w1 , w2 , . . . given by (8.3.1) with learning rate
2
t=
m+L
converges to w∗ at the rate
2n
1−r
E(wn ) ≤ E(w0 ), n = 1, 2, . . . . (8.7.6)
1+r

GD-II improves GD-I in two ways: The learning rate is larger,

2 1
> ,
m+L L
and the convergence rate is smaller,
2
1−r
< (1 − r),
1+r
implying faster convergence.
For example, if L = 6 and m = 2, then r = 1/3, the learning rates are 1/6
versus 1/4, and the convergence rates are 2/3 versus 1/4. Even though GD-II
improves GD-I, the improvement is not substantial. In the next section, we
use momentum to derive better convergence rates.

Let g be the gradient of the loss function at a point w. Then the line
passing through w in the direction of g is w − tg. When the loss function is
quadratic (7.3.7), f (w − tg) is a quadratic function of the scalar variable t.
In this case, the minimizer t along the line w − tg is explicitly computable as
g·g
t= .
g · Qg
This leads to gradient descent with varying time steps t0 , t1 , t2 , . . . . As a
consequence, one can show the error is lowered as follows,

+ 1 g
E(w ) = 1 − −1
E(w), u= .
(u · Qu)(u · Q u) |g|
456 CHAPTER 8. MACHINE LEARNING

Using a well-known inequality, Kantorovich’s inequality, one can show

that here the convergence rate is also (8.7.6). Thus, after all this work, there
is no advantage here, it simpler to stick with GD-II!
Nevertheless, the idea here, the line-search for a minimizer, is a sound
one, and is useful in some situations.

8.8 Accelerated Gradient Descent

In this section, we modify the gradient descent method by adding a term in-
corporating previous gradients, leading to gradient descent with momentum.
After this, we consider other variations, leading to the most frequently used
descent methods.
Recall in a descent sequence, the current point is w, the next point is w+ ,
and the previous point is w− .
In gradient descent with momentum, we add a momentum term to the
current point w, obtaining the lookahead point

w◦ = w + s(w − w− ). (8.8.1)

Here s is the decay rate. The momentum term reflects the direction induced
by the previous step. Because this mimics the behavior of a ball rolling
downhill, gradient descent with momentum is also called heavy ball descent.
Then the descent sequence w0 , w1 , w2 , . . . is generated by

Momentum Gradient Descent Step

w+ = w − t∇f (w) + s(w − w− ). (8.8.2)

Here we have two hyperparameters, the learning rate and the decay rate.

We study convergence for the simplest case of a quadratic (8.7.2). In this

case, ∇f (w) = Qw − b, and the sequence satisfies the recursion

wn+1 = wn − t(Qwn − b) + s(wn − wn−1 ), n = 0, 1, 2, . . . . (8.8.3)

To initialize the recursion, we set w−1 = w0− = w0 . This implies w1 =

w0 − t(Qw0 − b).
8.8. ACCELERATED GRADIENT DESCENT 457

We measure the convergence using the error E(w) = |w − w∗ |2 , and we

assume m < Q < L strictly, in the sense every eigenvalue λ satisfies

m < λ < L. (8.8.4)

As before, we set r = m/L.

Let v be an eigenvector of Q with eigenvalue λ. To solve (8.8.3), we
assume a solution of the form

wn = w∗ + ρn v, Qv = λv. (8.8.5)

Inserting this into (8.8.3) and using Qw∗ = b leads to the quadratic equation

ρ2 = (1 − tλ + s)ρ − s.

By the quadratic formula,

p
(1 − λt + s) ± (1 − λt + s)2 − 4s
ρ = ρ± = .
2
Assume the discriminant (1 − λt + s)2 − 4s is negative. This happens exactly
when √ √
(1 − s)2 (1 + s)2
<t< . (8.8.6)
λ λ
If we assume √ √
(1 − s)2 (1 + s)2
≤t≤ , (8.8.7)
m L
then (8.8.6) holds for every eigenvalue λ of Q.
Multiplying (8.8.7) by λ and factoring the discriminant as a difference of
two squares leads to
(L − λ)(λ − m)
4s − (1 − λt + s)2 ≥ (1 − s)2 . (8.8.8)
mL
When (8.8.6) holds, the roots are conjugate complex numbers ρ, ρ̄, where
p
(1 − λt + s) + i −(1 − λt + s)2 + 4s
ρ = x + iy = . (8.8.9)
2
It follows the absolute value of ρ equals
p √
|ρ| = x2 + y 2 = s.
458 CHAPTER 8. MACHINE LEARNING
√
To obtain the fastest convergence, we choose s and t to minimize |ρ| = s,
while still satisfying (8.8.7). This forces (8.8.7) to be an equality,
√ √
(1 − s)2 (1 + s)2
=t= .
m L
These are two equations in two unknowns s, t. Solving, we obtain
√
√ 1− r 1 4
s= √ , t= · √ .
1+ r L (1 + r)2

Let w̃n = wn − w∗ . Since Qwn − b = Qw̃n , (8.8.3) is a 2-step linear

recursion in the variables w̃n . Therefore the general solution depends on two
constants A, B.
Let λ1 , λ2 , . . . , λd be the eigenvalues of Q and let v1 , v2 , . . . , vd be the
corresponding orthonormal basis of eigenvectors.
Since (8.8.3) is a 2-step vector linear recursion, A and B are vectors, and
the general solution depends on 2d constants Ak , Bk , k = 1, 2, . . . , d.
If ρk , k = 1, 2, . . . , d, are the corresponding roots (8.8.9), then (8.8.5) is
a solution of (8.8.3) for each of 2d roots ρ = ρk , ρ = ρ̄k , k = 1, 2, . . . , d.
Therefore the linear combination
d
X
∗
wn = w + (Ak ρnk + Bk ρ̄nk ) vk , n = 0, 1, 2, . . . (8.8.10)
k=1

is the general solution of (8.8.3). Inserting n = 0 and n = 1 into (8.8.10), then

taking the dot product of the result with vk , we obtain two linear equations
for two unknowns Ak , Bk ,

Ak + Bk = (w0 − w∗ ) · vk ,
Ak ρk + Bk ρ̄k = (w1 − w∗ ) · vk = (1 − tλk )(w0 − w∗ ) · vk ,

for each k = 1, 2, . . . , d. Solving for Ak , Bk yields

1 − tλk − ρ̄k
Ak = (w0 − w∗ ) · vk , Bk = Āk .
ρk − ρ̄k

Let
(L − m)(L − m)
C = max . (8.8.11)
λ (L − λ)(λ − m)
8.8. ACCELERATED GRADIENT DESCENT 459

Using (8.8.8), one verifies the estimate

|Ak |2 = |Bk |2 ≤ C |(w0 − w∗ ) · vk |2 .

Now use (2.9.4) twice, first with v = wn − w∗ , then with v = w0 − w∗ . By

(8.8.10) and the triangle inequality,

d
X
∗ 2
|wn − w | = |(wn − w∗ ) · vk |2
k=1
d
X
= |Ak ρnk + Bk ρ̄nk |2
k=1
Xd
≤ (|Ak | + |Bk |)2 |ρk |2n
k=1
d
X
≤ 4Cs n
|(w0 − w∗ ) · vk |2
k=1
= 4Cs |w0 − w∗ |2 .
n

This derives the following result.

Momentum Gradient Descent - Heavy Ball

Suppose the loss function f (w) is quadratic (8.7.2), let r = m/L, and
set E(w) = |w − w∗ |2 . Let C be given by (8.8.11). Then the descent
sequence w0 , w1 , w2 , . . . given by (8.8.2) with learning rate and decay
rate √ 2
1 4 1− r
t= · √ , s= √ ,
L (1 + r)2 1+ r
converges to w∗ at the rate
√ 2n
1− r
E(wn ) ≤ 4C √ E(w0 ), n = 1, 2, . . . (8.8.12)
1+ r

This heavy ball √

descent, due to Polyak [21], is an improvement over GD-
II (8.7.6), because r is substantially larger than r when r is small. The
downside of this momentum method is that the convergence (8.8.12) is only
460 CHAPTER 8. MACHINE LEARNING

guaranteed for f (w) quadratic (8.7.2). In fact, there are examples of non-
quadratic f (w) where heavy ball descent does not converge to w∗ . Neverthe-
less, this method is widely used.

The momentum method can be modified by evaluating the gradient at

the lookahead point w◦ (8.8.1),

Momentum Descent Step With Lookahead Gradient

w◦ = w + s(w − w− ),
(8.8.13)
w+ = w◦ − t∇f (w◦ ).

This leads to accelerated gradient descent, or momentum descent with

lookahead gradient. This result, due to Nesterov [19], is valid for any convex
function satisfying (8.7.1), not just quadratics.
The iteration (8.8.13) is in two steps, a momentum step followed by a
basic gradient descent step. The momentum step takes us from the current
point w to the lookahead point w◦ , and the gradient descent step takes us
from w◦ to the successive point w+ .
Starting from w0 , and setting w−1 = w0 , here it turns out the loss se-
quence f (w0 ), f (w1 ), f (w2 ), . . . is not always decreasing. Because of this,
we seek another function V (w) where the corresponding sequence V (w0 ),
V (w1 ), V (w2 ), . . . is decreasing.
To explain this, it’s best to assume w∗ = 0 and f (w∗ ) = 0. This can
always be arranged by translating the coordinate system. Then it turns out

L
V (w) = f (w) + |w − ρw− |2 , (8.8.14)
2
with a suitable choice of ρ, does the job. With the choices
√
1 1− r √
t= , s= √ , ρ = 1 − r,
L 1+ r

we will show
V (w+ ) ≤ ρV (w). (8.8.15)
8.8. ACCELERATED GRADIENT DESCENT 461

In fact, we see below (8.8.22), (8.8.23) that V is reduced by an additional

quantity proportional to the momentum term.
The choice t = 1/L is a natural choice from basic gradient descent (8.3.3).
The derivation of (8.8.15) below forces the choices for s and ρ.
Given a point w, while w+ is well-defined by (8.8.13), it is not clear what
−
w means. There are two ways to insert meaning here. Either evaluate V (w)
along a sequence w0 , w1 , w2 , . . . and set, as before, wn− = wn−1 , or work with
the function W (w) = V (w+ ) instead of V (w). If we assume (w+ )− = w,
then W (w) is well-defined. With this understood, we nevertheless stick with
V (w) as in (8.8.14) to simplify the calculations.
We first show how (8.8.15) implies the result. Using (w0 )− = w0 and
(8.7.3),

L m
V (w0 ) = f (w0 ) + |w0 − ρw0 |2 = f (w0 ) + |w0 |2 ≤ 2f (w0 ).
2 2
Moreover f (w) ≤ V (w). Iterating (8.8.15), we obtain

f (wn ) ≤ V (wn ) ≤ ρn V (w0 ) ≤ 2ρn f (w0 ).

This derives

Momentum Descent - Lookahead Gradient

Let r = m/L and set E(w) = f (w) − f (w∗ ). Then the sequence w0 ,
w1 , w2 , . . . given by (8.8.13) with learning rate and decay rate
√
1 1− r
t= , s= √
L 1+ r

converges to w∗ at the rate

√ n
E(wn ) ≤ 2 1 − r E(w0 ), n = 1, 2, . . . . (8.8.16)

While the convergence rate for accelerated descent is slightly worse than
heavy ball descent, the value of accelerated descent is its validity for all
convex functions satisfying (8.7.1), and the fact, also due to Nesterov [19],
that this convergence rate is best-possible among all such functions.
Now we derive (8.8.15). Assume (w+ )− = w and w∗ = 0, f (w∗ ) = 0. We
know w◦ = (1 + s)w − sw− and w+ = w◦ − tg ◦ , where g ◦ = ∇f (w◦ ).
462 CHAPTER 8. MACHINE LEARNING

By the basic descent step (8.3.1) with w◦ replacing w, (8.3.3) implies

t
f (w+ ) ≤ f (w◦ ) − |g ◦ |2 . (8.8.17)
2
Here we used t = 1/L.
By (7.5.16) with x = w and a = w◦ ,
m
f (w◦ ) ≤ f (w) − g ◦ · (w − w◦ ) − |w − w◦ |2 . (8.8.18)
2
By (7.5.16) with x = w∗ = 0 and a = w◦ ,
m ◦2
f (w◦ ) ≤ g ◦ · w◦ − |w | . (8.8.19)
2
Multiply (8.8.18) by ρ and (8.8.19) by 1 − ρ and add, then insert the sum
into (8.8.17). After some simplification, this yields
r t
f (w+ ) ≤ ρf (w) + g ◦ · (w◦ − ρw) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 − |g ◦ |2 .
2t 2
(8.8.20)
Since
(w◦ − ρw) − tg ◦ = w+ − ρw,
we have
1 + 1 t
|w − ρw|2 = |w◦ − ρw|2 − g ◦ · (w◦ − ρw) + |g ◦ |2 .
2t 2t 2
Adding this to (8.8.20) leads to
r 1
V (w+ ) ≤ ρf (w) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 + |w◦ − ρw|2 . (8.8.21)
2t 2t
Let

R(a, b) = r ρs2 |b|2 + (1 − ρ)|a + sb|2 − |(1 − ρ)a + sb|2 + ρ|(1 − ρ)a + ρb|2 .

Solving for f (w) in (8.8.14) and inserting into (8.8.21) leads to

1
V (w+ ) ≤ ρV (w) − R(w, w − w− ). (8.8.22)
2t
If we can choose s and ρ so that R(a, b) is a positive scalar multiple of
2
|b| , then, by (8.8.22), (8.8.15) follows, completing the proof.
8.9. STOCHASTIC GRADIENT DESCENT 463

Based on this, we choose s, ρ to make R(a, b) independent of a, which is

equivalent to ∇a R = 0. But

2 2

∇a R = 2(1 − ρ) r − (1 − ρ) a + ρ − s(1 − r) b ,

so ∇a R = 0 is two equations in two unknowns s, ρ. This leads to the choices

for s and ρ made above. Once these choices are made, s(1 − r) = ρ2 and
ρ > s. From this,

R(a, b) = R(0, b) = (rs2 − s2 + ρ3 )|b|2 = ρ2 (ρ − s)|b|2 , (8.8.23)

which is positive.

8.9 Stochastic Gradient Descent

⋆ under construction ⋆
464 CHAPTER 8. MACHINE LEARNING
Chapter A

Appendices

A.1 SQL
Recall matrices (§2.1), datasets, CSV files, spreadsheets, arrays, dataframes
are basically the same objects.
Databases are collections of tables, where a table is another object similar
to the above. Hence

matrix = dataset = CSV f ile = spreadsheet = table = array = dataf rame

(A.1.1)
One difference is that each entry in a table may be a string, or code, or an
image, not just a number. Nevertheless, every table has rows and columns;
rows are usually called records, and columns are columns.
A database is a collection of several tables that may or may not be linked
by columns with common data. Software that serves databases is a database
server. Often the computer running this software is also called a database
server, or a server for short. Databases created by a database server (soft-
ware) are stored as files on the database server.
There are many varieties of database server software. Here we use Mari-
aDB, a widely-used open-source database server. By using open-source soft-
ware, one is assured to be using the “purest” form of the software, in the
sense that proprietary extensions are avoided, and the software is compatible
with the widest range of commercial variations.
Because database tables can contain millions of records, it is best to ac-
cess a database server programmatically, using an application programming
interface, rather than a graphical user interface. The basic API for inter-

465
466 CHAPTER A. APPENDICES

acting with database servers is SQL (structured query language). SQL is a

programming language for creating and modifying databases.
Any application on your laptop that is used to access a database is called
an SQL client. The database server being accessed may be local, running
on the same computer you are logged into, or remote, running on another
computer on the internet. In our examples, the code assumes a local or
remote database server is being accessed.
Because SQL commands are case-insensitive, by default we write them
in lowercase. Depending on the SQL client, commands may terminate with
semicolons or not. As mentioned above, data may be numbers or strings.
The basic SQL commands are

select from
limit
select distinct
where/where not <column>
where <column> = <data> and/or <column> = <data>
order by <column1>,<column2>
insert into table (<column1>,<column2>,...) \
values (<data1>, <data2>, ...)
is null
update <table> set <column> = <data> where ...
like <regex> (%, _, [abc], [a-f], [!abc])
delete from <table> where ...
select min(<column>) from <table> (also max, count, avg)
where <column> in/not in (<data array>)
between/not between <data1> and <data2>
as
join (left, right, inner, full)
create database <database>
drop database <database>
create table <table>
truncate <table>
alter table <table> add <column> <datatype>
alter table <table> drop column <column>
insert into <table> select
A.1. SQL 467

All the objects in (A.1.1) are also equivalent to a Python list-of-dicts. In

this section we explain how to convert between the objects
list-of-dicts ⇐⇒ JSON string ⇐⇒ dataframe ⇐⇒ CSV file ⇐⇒ SQL table
(A.1.2)
For all conversions, we use pandas. We begin describing a Python list-of-
dicts, because this does not require any additional Python modules.
A Python dictionary or dict is a Python object of the form (prices are in
cents)

item1 = {"dish": "Hummus", "price": 800, "quantity": 5}

This is an unordered listing of key-value pairs. Here the keys are the strings
dish, price, and quantity. Keys need not be strings; they may be integers
or any unmutable Python objects. Since a Python list is mutable, a key
cannot be a list. Values may be any Python objects, so a value may be a
list. In a dict, values are accessed through their keys. For example, item1[
,→ "dish"] returns 'Hummus'.
A list-of-dicts is simply a Python list whose elements are Python dicts,
for example,

item2 = {"dish": "Avocado", "price": 900, "quantity": 2}

L = [item1,item2]

Here L is a list and

len(L), L[0]["dish"]

returns

(2,'Hummus')

In other words, L is a list-of-dicts,

L == [{"dish": "Hummus", "price": 800, "quantity": 5},

,→ {"dish": ... }]

returns True.
468 CHAPTER A. APPENDICES

A list-of-dicts L can be converted into a string using the json module, as

follows:

frpm json import *

s = dumps(L)

Now print L and print s. Even though L and s “look” the same, L is a
list, and s is a string. To emphasize this point, note

• len(L) == 2 and len(s) == 99,

• L[0:2] == L and s[0:2] == '[{'

• L[8] returns an error and s[8] == ':'

To convert back the other way, use

from json import *

L1 = loads(s)

Then L == L1 returns True. Strings having this form are called JSON
strings, and are easy to store in a database as VARCHARs (see Figure A.4).
The basic object in the Python module pandas is the dataframe (Figures
A.1, A.2, A.4, A.5). The pandas module can convert a dataframe df to
many, many other formats

df.to_dict(), df.to_csv(), df.to_excel(), df.to_sql(),

,→ df.to_json(), ...

To convert a list-of-dicts to a dataframe is easy. The code

from pandas import *

df = DataFrame(L)
df

returns the dataframe in Figure A.1 (prices are in cents).

A.1. SQL 469

Figure A.1: Dataframe from list-of-dicts.

Figure A.2: Menu dataframe and SQL table.

To go the other way is equally easy. The code

L1 = df.to_dict('records')
L == L1

returns True. Here the option 'records' returns a list-of-dicts; other options
returns a dict-of-dicts or other combinations.
To convert a CSV file into a dataframe, use the code

menu_df = read_csv("menu.csv")
menu_df

This returns Figure A.2 (prices are in cents).

470 CHAPTER A. APPENDICES

To go the other way, to convert the dataframe df to the CSV file menu1
,→ .csv, use the code

df.to_csv("menu1.csv")
df.to_csv("menu2.csv",index=False)

The option index=False suppresses the index column, so menu2.csv has

two columns, while menu1.csv has three columns. Also useful is the method
.to_excel, which returns an excel file.
Now we explain how to convert between a dataframe and an SQL table.
What we have seen so far uses only the module pandas. To convert to SQL,
we need two more modules, sqlalchemy and pymysql.
The module sqlalchemy allows us to connect to a database server from
within Python, and the module pymysql is the code necessary to complete
the connection to our version of database server. For example, if we are
connecting to an Oracle database server, we would use the module cx-Oracle
instead of pymysql.
In Python, the standard module installation method is to use pip. To
install sqlalchemy and pymysql, type within jupyter:

pip install sqlalchemy

pip install pymysql

To connect using sqlalchemy, we first collect the connection data into

one URI string,

protocol = "mysql+pymysql://"
credentials = "username:password"
server = "@servername"
port = ":3306"
uri = protocol + credentials + server + port

This string contains your database username, your database password, the
database server name, the server port, and the protocol. If the database is
”\rawa”, the URI is
A.1. SQL 471

database = "/rawa"
uri = protocol + credentials + server + port + database

Using this uri, the connection is made as follows

from sqlalchemy import create_engine

engine = sqlalchemy.create_engine(uri)

(In sqlalchemy, a connection is called an “engine”.) After this, to store the

dataframe df into a table Menu, use the code

df.to_sql('Menu',engine,if_exists='replace')

The if_exists = 'replace' option replaces the table Menu if it existed

prior to this command. Other options are if_exists='fail' and if_exists
,→ ='append'. The default is if_exists='fail', so

df.to_sql('Menu',engine)

returns an error if Menu exists.

To read a table into a dataframe, use for example the code

from sqlalchemy import text

query1 = text("select * from rawa.OrdersIn")

query2 = text("select * from rawa.OrdersIn where items
,→ like '%Hummus%';")
connection = engine.connect()
df1 = read_sql(query1, connection)
df2 = read_sql(query2, connection)

Better Python coding technique is to place read_sql and to_sql in a

with block, as follows
472 CHAPTER A. APPENDICES

with engine.connect() as connection:

df = pd.read_sql(query, connection)
df.to_sql('Menu',engine)

One benefit of this syntax is the automatic closure of the connection upon
completion. This completes the discussion of how to convert between dataframes
and SQL tables, and completes the discussion of conversions between any of
the objects in (A.1.2).

Figure A.3: Rawa restaurant.

As an example how all this goes together, here is a task:

Given two CSV files menu.csv and orders.csv downloaded from
a restaurant website (Figure A.3), create three SQL tables Menu,
OrdersIn, OrdersOut.
The two CSV files are (click)
orders.csv and menu.csv.
The three SQL table columns are as follows (price, tip, tax, subtotal,
total are in cents)
A.1. SQL 473

Figure A.4: OrdersIn dataframe and SQL table.

/* Menu */
dish varchar
price integer

/* ordersin */
orderid integer
created datetime
customerid integer
items json

/* ordersout */
orderid integer
subtotal integer
tip integer
tax integer
total integer

To achieve this task, we download the CSV files menu.csv and orders
,→ .csv, then we carry out these steps. (price and tip in menu.csv and
474 CHAPTER A. APPENDICES

orders.csv are in cents so they are INTs.)

1. Read the CSV files into dataframes menu_df and orders_df.

2. Convert the dataframes into list-of-dicts menu and orders.

3. Create a list-of-dicts OrdersIn with keys orderId, created, customerId

,→ whose values are obtained from list-of-dicts orders.

4. Create a list-of-dicts OrdersOut with keys orderId, tip whose values

are obtained from list-of-dicts orders (tips are in cents so they are
INTs).

5. Add a key items to OrdersIn whose values are JSON strings specifying
the items ordered in orders, using the prices in menu (these are in cents
so they are INTs). The JSON string is of a list-of-dicts in the form
discussed above L = [item1, item2] (see row 0 in Figure A.4).
Do this by looping over each order in the list-of-dicts orders, then
looping over each item in the list-of-dicts menu, and extracting the
quantity ordered of the item item in the order order.

6. Add a key subtotal to OrdersOut whose values (in cents) are com-
puted from the above values.
Add a key tax to OrdersOut whose values (in cents) are computed
using the Connecticut tax rate 7.35%. Tax is applied to the sum of
subtotal and tip.
Add a key total to OrdersOut whose values (in cents) are computed
from the above values (subtotal, tax, tip).

7. Convert the list-of-dicts OrdersIn, OrdersOut to dataframes OrdersIn_df

,→ , OrdersOut_df.

8. Upload menu_df, OrdersIn_df, OrdersOut_df to tables Menu, OrdersIn

,→ , OrdersOut.

The resulting dataframes ordersin_df and ordersout_df, and SQL ta-

bles OrdersIn and OrdersOut, are in Figures A.4 and A.5.
A.1. SQL 475

Figure A.5: OrdersOut dataframe and SQL table.

Complete Code for the Task

# step 1
from pandas import *

protocol = "https://"
server = "math.temple.edu"
path = "/~hijab/teaching/csv_files/restaurant/"
url = protocol + server + path

menu_df = read_csv(url + "menu.csv")

orders_df = read_csv(url + "orders.csv")

# step 2
menu = menu_df.to_dict('records')
orders = orders_df.to_dict('records')

# step 3
OrdersIn = h
476 CHAPTER A. APPENDICES

for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["created"] = r["created"]
d["customerId"] = r["customerId"]
OrdersIn.append(d)

# step 4
OrdersOut = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["tip"] = r["tip"]
OrdersOut.append(d)

# step 5
from json import *

for i,r in enumerate(OrdersIn):

itemsOrdered = h
for item in menu:
dish = item["dish"]
price = item["price"]
if dish in orders[i]:
quantity = orders[i][dish
if quantity > 0:
d = {"dish": dish, "price": price,
,→ "quantity": quantity}
itemsOrdered.append(d)
r["items"] = dumps(itemsOrdered)

# steps 6
for i,r in enumerate(OrdersOut):
items = loads(OrdersIn[i]["items"])
subtotal = sum([ item["price"]*item["quantity"] for item
,→ in items ])
r["subtotal"] = subtotal
tip = OrdersOut[i]["tip"]
tax = int(.0735*(tip + subtotal))
A.1. SQL 477

total = subtotal + tip + tax

r["tax"] = tax
r["total"] = total

# step 7
ordersin_df = DataFrame(OrdersIn)
ordersout_df = DataFrame(OrdersOut)

# step 8
from sqlalchemy import create_engine, text

# connect to the database

protocol = "mysql+pymysql://"
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
uri = protocol + credentials + server + port + database

engine = create_engine(uri)

dtype1 = { "dish":sqlalchemy.String(60),
,→ "price":sqlalchemy.Integer }

dtype2 = {
"orderId":sqlalchemy.Integer,
"created":sqlalchemy.String(30),
"customerId":sqlalchemy.Integer,
"items":sqlalchemy.String(1000)
}

dtype3 = {
"orderId":sqlalchemy.Integer,
"tip":sqlalchemy.Integer,
"subtotal":sqlalchemy.Integer,
"tax":sqlalchemy.Integer,
"total":sqlalchemy.Integer
}
478 CHAPTER A. APPENDICES

with engine.connect() as connection:

menu_df.to_sql('Menu', engine,
if_exists = 'replace', index = False, dtype = dtype1)
ordersin_df.to_sql("OrdersIn", engine,
index = False, if_exists = 'replace', dtype = dtype2)
ordersout_df.to_sql("OrdersOut", engine,
index = False, if_exists = 'replace', dtype = dtype3)

Moral of this section

In this section, all work was done in Python on a laptop, no SQL was used on
the database, other than creating a table or downloading a table. Generally,
this is an effective workflow:
• Use SQL to do big manipulations on the database (joining and filter-
ing).

• Use Python to do detailed computations on your laptop (analysis).

Now we consider the following simple problem. The total number of
orders in 3970. What is the total number of plates? To answer this, we loop
through all the orders, summing the number of plates in each order. The
answer is 14,949 plates.

from json import *

from pandas import *
from sqlalchemy import create_engine, text

protocol = "mysql+pymysql://"
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
uri = protocol + credentials + server + port + database

engine = sqlalchemy.create_engine(uri)

connection = engine.connect()
A.2. MINIMIZING SEQUENCES 479

query = text("select * from OrdersIn")

df = read_sql(query, connection)

num = 0

for item in df["items"]:

plates = loads(item)
num += sum( [ plate["quantity"] for plate in plates ])

print(num)

A more streamlined approach is to use map. First we define a function

whose input is a JSON string in the format of df["items"], and whose
output is the number of plates.

from json import *

def num_plates(item):
dishes = loads(item)
return sum( [ dish["quantity"] for dish in dishes ])

Then we use map to apply to this function to every element in the series
df["items"], resulting in another series. Then we sum the resulting series.

num = df["items"].map(num_plates).sum()
print(num)

Since the total number of plates is 14,949, and the total number of orders
is 4970, the average number of plates per order is 3.76.

A.2 Minimizing Sequences

Several times in the text, we dealt with minimizing functions, most notably
for the pseudo-inverse of a matrix (§2.3), for proper continuous functions
(§7.5), and for gradient descent (§8.3).
Throughout, the technical foundations underlying the existence of mini-
480 CHAPTER A. APPENDICES

mizers were ignored. In this section, which may safely be skipped, we review
the foundational material supporting the existence of minimizers.
The first issue that must be clarified is the difference between the min-
imum and the infimum. In a given situation, it is possible that there is
no minimum. By contrast, in any reasonable situation, there is always an
infimum.
For example, since y = ex is an increasing function, the minimum

min ex = min{ex | 0 ≤ x ≤ 1}
0≤x≤1

is y ∗ = e0 = 1, and the minimizer, the location at which the minimum occurs,

is x∗ = 0. Here we have one minimizer.
For the function y = x4 −2x2 in Figure 7.5, the minimum over −2 ≤ x ≤ 2
is y ∗ = −1, which occurs at the minimizers x∗ = ±1. Here we have two
minimizers.
On the other hand, if we attempt to minimize the function y = 1/x over
the open interval 1 < x < ∞, we have no minimizer, since 1/x approaches 0
as x approaches ∞. Here we say there is an infimum, and we have

inf 1/x = inf{1/x | 1 < x < ∞} = 0.

1<x<∞

In this situation, the minimizer does not exist, but, since the values of 1/x are
arbitrarily close to 0, we say the infimum is 0. Since there is no minimizer,
there is no minimum value. Also, even though 0 is the infimum, we do not
say ∞ is the “infimizer”, since ∞ is not an actual number.

Let S be a collection of real numbers. A lower bound for S is a number

b satisfying b ≤ x for every x in S. For example, 0 is a lower bound for the
closed interval 0 ≤ x ≤ 1, and also 0 is a lower bound for the open interval
0 < x < 1. Any number less than 0 is also a lower bound, for example, −1
is a lower bound, in either case.
Not every collection S of numbers has a lower bound, for example the
entire real line has no lower bound, since −∞ is not a number. If S does
have a lower bound, we say S is bounded below.
If S has a lower bound m that is in S, then we say m is the minimum of
S. If S is a finite set, then S has a minimum. However, as we saw above, if
A.2. MINIMIZING SEQUENCES 481

S is infinite, a minimum need not exist. When the minimum exists, we write
m = min S.
If S is bounded below, then S has many lower bounds. The greatest
among these lower bounds is the infimum of S. A foundational axiom for
real numbers is that the infimum always exists. When m is the infimum of
S, we write m = inf S.

Existence of Infima
Any collection S of real numbers that is bounded below has an infimum:
There is a lower bound m for S that is greater than any other lower
bound b for S.

For example, for S = [0, 1], inf S = 0 and min S = 0, and, for S = (0, 1),
inf S = 0, but min S does not exist. For both these sets S, it is clear that 0
is the infimum. The power of the axiom comes from its validity for any set
S of scalars that is bounded below, no matter how complicated.
By definition, the infimum of S is the lower bound for S that is greater
than any other lower bound for S. From this, if min S exists, then inf S =
min S.

A sequence is an infinite ordered listing x1 , x2 , . . . of vectors. An error

sequence is a sequence of nonnegative scalars e1 , e2 , . . . that is decreasing

e1 ≥ e2 ≥ · · · ≥ 0.

We say an error sequence converges to zero if

inf en = 0.
n≥1

In this case, we write en → 0 as n → ∞. Let’s unpack this.

Suppose e1 ≥ e2 ≥ . . . is an error sequence converging to zero. Since 0
is the greatest lower bound of the set S = {e1 , e2 , . . . }, given any positive
ϵ > 0, there is a term eN satisfying eN < ϵ. Since the sequence is decreasing,
we conclude 0 ≤ en < ϵ for n ≥ N .
482 CHAPTER A. APPENDICES

Error Sequence
An error sequence e1 ≥ e2 ≥ · · · ≥ 0 converges to zero iff for any ϵ > 0,
there is an N > 0 with

0 ≤ en < ϵ, n ≥ N.

Now let x1 , x2 , . . . be a sequence of vectors. We say the sequence con-

verges to x∗ or approaches x∗ if there is an error sequence e1 , e2 , . . . converg-
ing to zero with
|xn − x∗ | ≤ en n ≥ 1.
In this case, we write
lim xn = x∗ ,
n→∞

or we write xn → x∗ .
Note this definition of convergence is consistent with the previous defini-
tion, since an error sequence e1 , e2 , . . . converges to zero (in the first sense)
iff
lim en = 0
n→∞

(in the second sense).

Let x1 , x2 , . . . be a sequence. A subsequence is a selection of terms

xn1 , xn2 , xn3 , . . . , n1 < n2 < n3 < . . . .

Here it is important that the indices n1 < n2 < n3 < . . . be strictly increas-
ing.
If a sequence x1 , x2 , . . . has a subsequence x′1 , x′2 , . . . converging to x∗ ,
then we say the sequence x1 , x2 , . . . subconverges to x∗ . For example, the
sequence 1, −1, 1, −1, 1, −1, . . . subconverges to 1 and also subconverges
to −1, as can be seen by considering the odd-indexed terms and the even-
indexed terms separately.
Note a subsequence of an error sequence converging to zero is also an
error sequence converging to zero. As a consequence, if a sequence converges
to x∗ , then every subsequence of the sequence converges to x∗ . From this
A.2. MINIMIZING SEQUENCES 483

it follows that the sequence 1, −1, 1, −1, 1, −1, . . . does not converge to
anything: it bounces back and forth between ±1.

We say a set S is bounded if |x|2 ≤ b for all x in S, for some constant b.

The scalar b is then a bound for S.

Bounded Sequences Must Subconverge

Let x1 , x2 , . . . be a bounded sequence of vectors. Then there is a
subsequence x′1 , x′2 , . . . converging to some x∗ .

To see this, assume first x1 , x2 , . . . are scalars, and let x1 , x2 , . . . be a

bounded sequence of numbers, say a ≤ xn ≤ b for n ≥ 1. Bisect the interval
I0 = [a, b] into two equal subintervals. Then at least one of the subintervals,
call it I1 , has infinitely many terms of the sequence. Select x′1 in I1 and let
x∗1 be the right endpoint of I1 .
Now bisect I1 into two equal subintervals. Then at least one of the subin-
tervals, call it I2 , has infinitely many terms of the sequence. Select x′2 in I2
and let x∗2 be the right endpoint of I2 . Continuing in this manner, we obtain
a subsubsequence x′1 , x′2 , . . . with x′n in In , and a sequence x∗1 , x∗2 , . . . .
Since the intervals are nested

I0 ⊃ I! ⊃ I2 ⊃ . . . ,

the sequence x∗1 , x∗2 , . . . is decreasing and

x∗ = inf x∗n
n≥1

exists. Thus en = x∗n − x∗ is an error sequence converging to zero.

Since the length of In equals (b − a)/2n ,

0 ≤ x∗n − x′n ≤ (b − a)2−n ,

hence by the triangle inequality (2.2.4)

|x′n − x∗ | ≤ |x′n − x∗n | + |x∗n − x∗ | ≤ (b − a)2−n + en .

Since (b − a)2−n + en is an error sequence converging to zero, this establishes

x′n → x∗ .
484 CHAPTER A. APPENDICES

Now let x1 , x2 , . . . be a sequence of vectors in Rd , and let v be a vector;

then x1 · v, x2 · v, . . . are scalars, so, from the previous paragraph, there is a
subsequence x′n · v (depending on v) converging to some x∗v .
Let e1 , e2 , . . . , ed be the standard basis in Rd . By choosing v = e1 ,
there is a subsequence x′1 , x′2 , . . . such that the first features of x′n converge.
By choosing v = e2 , and focusing on the subsequence x′1 , x′2 , . . . , there is
a sub-subsequence x′′1 , x′′2 , . . . such that the first and second features of x′′n
converge. Continuing in this manner, we obtain a subsequence x∗1 , x∗2 , . . .
such that the k-th feature of the subsequence converges to the k-th feature
of a single x∗ , for every 1 ≤ k ≤ d. From this, it follows that x∗n converges
to x∗ .

Let S be a set of vectors and let y = f (x) be a scalar-valued function

bounded below on S, f (x) ≥ b for some number b, for all x in S. Then b is
a lower bound for f (x) over S. By the above axiom, the infimum

m = inf f (x) = inf{f (x) | x in S} (A.2.1)

must exist.
A minimizer is a vector x∗ satisfying f (x∗ ) = m. As we saw above, a
minimizer may or may not exist, and, when the minimizer does exist, there
may be several minimizers.
A minimizing sequence for f (x) over S is a sequence x1 , x2 , . . . of vectors
in S such that the corresponding values f (x1 ), f (x2 ), . . . are decreasing and
converge to m = inf S f (x) as n → ∞. In other words, x1 , x2 , . . . is a
minimizing sequence for f (x) over S if

f (x1 ) ≥ f (x2 ) ≥ f (x3 ) ≥ . . .

and
inf f (x) = inf f (xn ).
S n≥1

If there is a minimizer x∗ in S, then inf S f (x) = minS f (x) = f (x∗ ), and

the sequence x∗ , x∗ , . . . is a minimizing sequence in S. However, in general,
there may be no such minimizer.
A.2. MINIMIZING SEQUENCES 485

Existence of Minimizing Sequences

If S is a collection of vectors, and y = f (x) is bounded below on S,

then there is a minimizing sequence x1 , x2 , . . . in S.

If there is a minimizer x∗ in S, the sequence x∗ , x∗ , . . . is a minimizing

sequence in S. Otherwise, if there is no minimizer in S, pick any x0 in S.
Since m is the greatest lower bound for f (x), f (x0 ) is not a lower bound, so
there is x1 in S with

m < f (x1 ) < (f (x0 ) + m)/2,

or
0 < f (x1 ) − m < (f (x0 ) − m)/2.

Similarly, there is x2 with

0 < f (x2 ) − m < (f (x1 ) − m)/2.

Continuing in this manner, we have xn with

0 < f (xn ) − m < (f (xn−1 ) − m)/2.

Since this implies

0 < f (xn ) − m < 2−n (f (x0 ) − m),

this yields a minimizing sequence in S.

We note a subsequence of a minimizing sequence is also a minimizing
sequence.

A function y = f (x) is continuous if f (xn ) approaches f (x∗ ) whenever xn

approaches x∗ .
Now we can establish
486 CHAPTER A. APPENDICES

Existence of Minimizers
If f (x) is continuous on Rd and S is a bounded set in Rd , then there
is a minimizer x∗ ,
f (x∗ ) = inf f (x). (A.2.2)
x in S

In general, the minimizer x∗ may lie outside the set S. To guarantee x∗

belongs to S, typically one assumes an additional requirement, the closedness
of S. In our applications of this result, this point is of no concern.
To establish the result, let m be as in (A.2.1), and let x1 , x2 , . . . be a
minimizing sequence for f (x) in S. Then x1 , x2 , . . . is bounded, so by the
previous result, there is a subsequence x′1 , x′2 , . . . converging to some x∗ .
Since x′1 , x′2 , . . . is also a minimizing sequence, and f (x) is continuous,

f (x∗ ) = lim f (x′n ) = lim f (xn ) = m.

n→∞ n→∞

This shows x∗ is a minimizer for f (x).

A.3 Keras Training

⋆ under construction ⋆
This section trains classifier networks for the Iris and MNIST datasets
using keras.
Bibliography

[1] Joshua Akey, Genome 560: Introduction to Statistical Genomics, 2008. https://ww
w.gs.washington.edu/academics/courses/akey/56008/lecture/lecture1.pdf.
[2] Christopher M. Bishop, Pattern Recognition and Machine Learning, Information Sci-
ence and Statistics, Springer, 2006.
[3] Sébastien Bubeck, Convex Optimization: Algorithms and Complexity, Foundations
and Trends in Machine Learning, vol. 8, Now Publishers, 2015.
[4] Harald Cramér, Mathematical Methods of Statistics, Princeton University Press, 1946.
[5] A. Aldo Faisal Marc Peter Deisenroth and Cheng Soon Ong, Mathematics for Machine
Learning, Cambridge University Press, 2020.
[6] J. L. Doob, Probability and Statistics, Transactions of the American Mathematical
Society 36 (1934), 759-775.
[7] R. A. Fisher, The conditions under which χ2 measures the discrepancy between ob-
servation and hypothesis, Journal of the Royal Statistical Society 87 (1924), 442-450.
[8] Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press,
2016.
[9] Google, Machine Learning. https://developers.google.com/machine-learning.
[10] Robert M. Gray, Toeplitz and Circulant Matrices: A Review, Foundations and Trends
in Communications and Information Theory 2 (2006), no. 3, 155-239.
[11] T. L. Heath, The Works of Archimedes, Cambridge University, 1897.
[12] Omar Hijab, A Note on Binary Classifiers, Preprint (2024).
[13] Nikolai Janakiev, Classifying the Iris Data Set with Keras, 2018. https://janakiev
.com/blog/keras-iris.
[14] Lily Jiang, A Visual Explanation of Gradient Descent Methods, 2020. https://towa
rdsdatascience.com/a-visual-explanation-of-gradient-descent-methods-m
omentum-adagrad-rmsprop-adam-f898b102325c.
[15] J. W. Longley, An Appraisal of Least Squares Programs for the Electronic Computer
from the Point of View of the User, Journal of the American Statistical Association
62.319 (1967), 819-841.

487
488 BIBLIOGRAPHY

[16] David G. Luenberger and Yinyu Ye, Linear and Nonlinear Programming, Springer,
2008.
[17] Ioannis Mitliagkas, Theoretical principles for deep learning, lecture notes, 2019. http
s://mitliagkas.github.io/ift6085-dl-theory-class-2019/.
[18] Marvin Minsky and Seymour Papert, Perceptrons, An Introduction to Computational
Geometry, MIT Press, 1988.
[19] Yurii Nesterov, Lectures on Convex Optimization, Springer, 2018.
[20] Roger Penrose, A generalized inverse for matrices, Proceedings of the Cambridge
Philosophical Society 51 (1955), 406-413.
[21] Boris Teodorovich Polyak, Some methods of speeding up the convergence of iteration
methods, USSR Computational Mathematics and Mathematical Physics 4(5) (1964),
1-17.
[22] Karl Pearson, On the criterion that a given system of deviations from the probable in
the case of a correlated system of variables is such that it can be reasonably supposed
to have arisen from random sampling, Philosophical Magazine Series 5 50:302 (1900),
157-175.
[23] Sebastian Raschka, PCA in three simple steps, 2015. https://sebastianraschka.c
om/Articles/2015_pca_in_3_steps.html.
[24] Herbert Robbins and Sutton Monro, A Stochastic Approximation Method, The Annals
of Mathematical Statistics 22 (1951), no. 3, 400 – 407.
[25] Sheldon M. Ross, Probability and Statistics for Engineers and Scientists, Sixth Edi-
tion, Academic Press, 2021.
[26] Mark J. Schervish, Theory of Statistics, Springer, 1995.
[27] Stanford University, CS224N: Natural Language Processing with Deep Learning. ht
tps://web.stanford.edu/class/cs224n.
[28] Irène Waldspurger, Gradient Descent With Momentum, 2022. https://www.cerema
de.dauphine.fr/~waldspurger/tds/22_23_s1/advanced_gradient_descent.pd
f.
[29] Wikipedia, Logistic Regression, 2015. https://en.wikipedia.org/wiki/Logistic
_regression.
[30] Stephen J. Wright and Benjamin Recht, Optimization for Data Analysis, Cambridge
University, 2022.
Python

*, 9, 16 def.newton, 417
def.num_plates, 479
all, 189 def.outgoing, 366, 407
append, 189 def.pca, 182
def.pca_with_svd, 182
def.angle, 24, 75 def.plot_cluster, 190
def.assign_clusters, 189 def.plot_descent, 418
def.backward_prop, 361, 369, def.poly, 444
413 def.project, 121
def.ball, 62 def.project_to_ortho, 122
def.chi2_independence, 323 def.pvalue, 272
def.confidence_interval, 295, def.random_batch_mean, 249
307, 312, 315 def.random_vector, 189
def.delta_out, 412 def.tensor, 32
def.derivative, 368 def.train_nn, 426
def.dimension_staircase, 130 def.ttest, 308
def.downstream, 413 def.type2_error, 301, 309
def.ellipse, 51, 58 def.update_means, 189
def.find_first_defect, 128 def.update_weights, 426
def.forward_prop, 360, 367, 408 def.zero_variance, 108
def.gd, 424 def.ztest, 299
def.goodness_of_fit, 320 diag, 175
def.H, 347 dict, 467
def.hexcolor, 10 display, 150
def.incoming, 366, 407
def.J, 409 enumerate, 183
def.local, 410
def.nearest_index, 189 floor, 167

489
490 PYTHON

import, 8 numpy.array, 8, 65
itertools.product, 62 numpy.column_stack, 86, 100
numpy.corrcoef, 54
join, 10 numpy.cov, 47
json.dumps, 468 numpy.cumsum, 180
json.loads, 468 numpy.degrees, 24
numpy.dot, 73
keras
numpy.exp, 347
datasets
numpy.isclose, 158
mnist.load_data, 4
numpy.linalg.eig, 144
lambda, 365 numpy.linalg.eigh, 144, 180
list, 7 numpy.linalg.inv, 84
numpy.linalg.matrix_rank, 128
matplotlib.pyplot.axes, 187 numpy.linalg.norm, 21, 189
matplotlib.pyplot.contour, 58 numpy.linalg.pinv, 121
matplotlib.pyplot.figure, 183 numpy.linalg.svd, 175
matplotlib.pyplot.grid, 7 numpy.linspace, 62
matplotlib.pyplot.hist, 243 numpy.log, 347
matplotlib.pyplot.imshow, 8, 9 numpy.mean, 14, 15
matplotlib.pyplot.legend, 167 numpy.meshgrid, 62
matplotlib.pyplot.meshgrid, numpy.outer, 323
58 numpy.pi, 347
matplotlib.pyplot.plot, 45 numpy.random.binomial, 242
matplotlib.pyplot.scatter, 7 numpy.random.default_rng, 249
matplotlib.pyplot.show, 7 numpy.random.default_rng.
matplotlib.pyplot.stairs, 130 ,→ shuffle, 249
matplotlib.pyplot.subplot, numpy.random.normal, 271
183 numpy.random.randn, 286
matplotlib.pyplot.text, 168 numpy.random.random, 45
matplotlib.pyplot.title, 347 numpy.reshape, 179
matplotlib.pyplot.xlabel, 445 numpy.roots, 43
numpy.row_stack, 69
numpy.allclose, 83 numpy.shape, 65
numpy.amax, 445 numpy.sqrt, 24
numpy.amin, 445
numpy.arange, 58, 167 pandas.DataFrame, 468
numpy.arccos, 24, 75 pandas.DataFrame.drop, 72
numpy.argsort, 182 pandas.DataFrame.to_csv, 470
PYTHON 491

pandas.DataFrame.to_dict, 469 sqlalchemy.text, 471

pandas.DataFrame.to_numpy, 72 sympy.*, 73
pandas.DataFrame.to_sql, 471 sympy.diag, 72
pandas.read_csv, 443, 469 sympy.diagonalize, 149
pandas.read_sql, 471 sympy.eigenvects, 149
sympy.init_printing, 149
random.choice, 10 sympy.Matrix, 65
random.random, 15 sympy.Matrix.col, 70
scipy.linalg.null_space, 99, sympy.Matrix.cols, 70
100 sympy.Matrix.columnspace, 92
scipy.linalg.orth, 93 sympy.Matrix.eye, 71
scipy.linalg.pinv, 87 sympy.Matrix.hstack, 68, 87
scipy.spatial.ConvexHull, 373 sympy.Matrix.inv, 84
simplices, 374 sympy.Matrix.nullspace, 98
scipy.special.comb, 218 sympy.Matrix.ones, 71
scipy.special.expit, 238 sympy.Matrix.rank, 134
scipy.special.softmax, 388 sympy.Matrix.row, 70
scipy.stats.chi2, 276 sympy.Matrix.rows, 70
scipy.stats.norm, 263 sympy.Matrix.rowspace, 96
scipy.stats.t, 304, 307 sympy.Matrix.zeros, 71
sklearn.datasets.load_iris, 2 sympy.RootOf, 43
sklearn.decomposition sympy.shape, 65
.PCA, 184 sympy.solve, 257
sklearn.preprocessing sympy.symbols, 43
.StandardScaler, 83 tuple, 18
sort, 180
sqlalchemy.create_engine, 471 zip, 186
492 PYTHON
Index

1, 203, 387 entropy, 347

relative, 350
angle, 75, 136
column space, 92
approaches, 482
columns, 68
approximate equality, 167
orthonormal, 79
Archimedes, 39
combination, 194
arcsine law, 331
complex
asymptotic equality, 198
average, 11 conjugate, 36
division, 35, 37
basis, 125 hermitian product, 37
of eigenvectors, 148 multiplication, 35, 36
of singular vectors, 172 numbers, 35
orthonormal, 125, 137, 148 plane, 35
standard, 66 polar representation, 39
Bayes theorem, 235 roots of unity, 40
perceptron, 240 concave, 333, 343
binomial, 212 condition number, 452
coefficient, 196, 213, 214 confidence, 268
theorem, 212, 214 interval, 294
Newton’s, 342 level, 292
cartesian plane, 17 contingency table, 322
Cauchy-Schwarz inequality, 24, 75 convex, 333
central limit theorem, 244, 263, combination, 372
265 dual, 339, 385, 390, 394
circle, 22 function, 371
unit, 21 hull, 373, 436, 438, 439
coin-tossing, 229 set, 373

493
494 INDEX

strictly, 334 convexity, 334

correlation descent
coefficient, 53 gradient, 417, 454, 455
matrix, 53, 83 heavy ball, 459
covariance, 46, 81 sequence, 419
and correlation, 83 with lookahead gradient, 461
and variance, 50 with momentum, 459
biased, 47 diagonalizable, 147
ellipse, 50 diagonalization
inverse ellipse, 50 eigen, 148
inverse ellipsoid, 158 singular, 175
matrix, 46 dimension, 125
unbiased, 47 staircase, 130
CSV file, 70 direct sum, 123
distance formula, 20
dataset, 1 distribution
attributes, 1 bernoulli, 234
binary classification, 452 binomial, 234
centered, 12 chi-squared, 274, 276
covariance, 46 cumulative, 252
dimension, 137 F -, 317
example, 1 normal, 263, 264
features, 1 T -, 304
full-rank, 137 Z-, 263, 264
Iris, 1 dot product, 23, 73
mean, 44
projected, 49, 105, 123, 184 eigenspace, 158
reduced, 49, 105, 123, 184 eigenvalue, 143
sample, 1 bottom, 156
standardized, 53, 83 clustering, 167
vectors or points, 13 decomposition, 148
degree, 212 minimum variance, 156
derivative projected variance, 154
definition, 327 top, 154
directional, 351 transpose, 145
logarithm, 339 eigenvector, 142
partial, 351 eigenvectors
second, 328 best-aligned vector, 154
INDEX 495

is right singular vector, 176 fundamental theorem of algebra,

orthogonal, 146 43
entropy, 344, 391
gradient, 352
cross-, 395
weight, 414, 426
relative, 348, 393
graph, 198
epigraph, 375
bipartite, 209
epoch, 400
complement, 205
error
complete, 200
logistic, 433
connected, 202
mean square, 430
cycle, 200, 202
standard, 261
directed, 198
Euler’s constant, 221
edge, 198, 401
experiment, 240
incoming, 401
exponential
outgoing, 401
function, 223
isomorphism, 208
series, 225 laplacian, 211
factorial, 194 nodes, 198, 401
full-rank adjacent, 198
dataset, 137 connected, 202
matrix, 134 degree, 201
function dominating, 201
bound, 379, 483 hidden, 401
input, 401
cumulative distribution, 252
isolated, 201
error
output, 401
information, 394, 429
order, 199
mean, 400
path, 202
mean square, 409, 429
regular, 202
level, 379
simple, 199
logistic, 238, 341
size, 199
loss, 416, 429
sub-, 200
moment generating, 256
undirected, 198
chi-squared, 276
walk, 202
independence, 258
weighed, 198
partition, 341, 387
weight
probability density, 263
matrix, 401
sigmoid, 238, 341
softmax, 388 hyperplane, 105, 376
496 INDEX

separating, 377 mean, 400

suporting, 378 mean square, 409, 429, 430
tangent, 379 lower bound, 480
hypothesis
alternate, 297 machine learning, 399
null, 297 margin of error, 292
testing, 297 mass-spring system, 161
matrix, 66
iff, 79, 145 2 × 2, 28
incoming edge, 401 addition, 71
independence, 245 adjacency, 199, 203
infimum, 480, 481 augmented, 94
information, 346 circulant, 167, 205
cross-, 395 eigenvalues, 167
relative, 348, 392 columns, 29
inverse, 84 covariance, 46
pseudo-, 114 dataset, 70
Iris dataset, 1 diagonal, 70
iteration, 400 identity, 84
incidence, 211
Jupyter, 4
inverse, 31, 84
law of large numbers, 244, 262, nonnegative, 80
289, 351 orthogonal, 136
line-search, 456 positive, 80
linear projection, 118
combination, 90 rows, 28
dependence, 97 scaling, 71
independence, 97 square, 70
system, 84, 151 symmetric, 31, 80
homogeneous, 27, 98 trace, 33
inhomogeneous, 28 transpose, 29, 67
transformation, 133, 139 weight, 199, 401
log-odds, 341 maximizer, 339
logit, 341 mean, 11, 44, 253, 278
loss, 416, 429 population, 253
cross-entropy, 433 sample, 261
information, 429 minimizer, 480, 484
logistic, 433 existence, 381
INDEX 497

global, 379 population, 9

residual, 382 power of a test, 302
uniqueness, 382 principal axes, 57
minimizers principal components, 148, 178
existence, 486 probability, 240
minimum, 480 binomial, 229
coin-tossing, 230
network, 366, 401 conditional, 230, 245
deep, 414 multinomial, 387
iteration, 426
one-hot encoded, 432
neural, 401
strict, 432
layered, 414
product
training, 425
dot, 23, 73, 136
neuron, 366, 401
matrix-matrix, 77
perceptron, 402
matrix-vector, 76
shallow, 414
tensor, 32, 79
dense, 414
projection, 118
trainability, 430
onto null space, 124
Newton’s method, 417
onto row space, 122
norm, 21, 81
propagation
null space, 98
back, 359, 361
1, 203, 387 chain, 361
one-hot encoded, 395 network, 369
orthogonal, 75 neural network, 413
complement, 101, 123 forward, 358, 360
orthonormal, 76 chain, 360
outgoing edge, 401 network, 367
neural network, 408
Pascal’s triangle, 215 proper, 380
perceptron, 240, 402 pseudo-inverse, 114
Bayes theorem, 403 Pythagoras theorem, 24
parallel, 414 Python, 4
permutation, 194 installation, 4
point, 65
critical, 333, 355, 423 quadratic form, 33
inflection, 334, 423
saddle, 333, 355 random variables, 246, 247
point of best-fit, 44 bernoulli, 234, 246
498 INDEX

correlation, 256 principle, 62

identically distributed, 260 sequence, 481
independence, 258 convergent, 482
moments, 256 error, 481
standard, 255 minimizing, 484
rank, 134 sub-, 482
and eigenvalues, 150 subconvergent, 482
and singular values, 172 series
column, 93 alternating, 217
full-, 134 exponential, 225
nonzero eigenvalues, 150 Taylor, 329, 330
row, 96 singular
regression value, 169
linear, 430, 440, 441 decomposition, 172
convexity, 430 of pseudo-inverse, 177
neural network, 430 vector, 169
properness, 431 vectors
trainability, 432 left, 169
logistic, 433 right, 169
convexity, 435 slope, 325
neural network, 433 space
one-hot encoded, 439 column, 92
properness, 436 eigen-, 158
strict, 438 feature, 1, 65, 96
trainability, 438, 439 null, 98
regularization, 422 row, 96
residual, 110 sample, 9
minimizer, 111 source, 133
minimum norm, 113 sub-, 103
pseudo-inverse, 114 target, 133
regression equation, 112 vector, 11
vanishing, 111 span, 91
row space, 96 spherical coordinates, 62
rows, 68 statistic, 13
orthonormal, 79 Stirling’s approximation, 198
sublevel set, 370
scaling sum
factor, 140 direct, 123
INDEX 499

of spans, 123 addition, 18, 66

of vectors, 66 best aligned, 54
cartesian, 18
t-distribution, 304 dimension, 65
tangent dot product, 23
line, 325 gradient, 352
parabola, 335 downstream, 411
test incoming, 402
goodness of fit, 319 length, 21, 74
independence, 323 magnitude, 74
T , 308 norm, 21, 74
Z, 299 orthogonal, 75
trainability, 430 orthonormal, 76, 100
linear regression, 432 outgoing, 364, 402
logistic regression, 438 polar, 21
one-hot encoded, 439 projected, 119–122
strict, 438 random, 278, 286
transpose, 67 reduced, 119–122
triangle inequality, 76 scaling, 19
shadow, 17
unit circle, 21 span, 91
standardized, 82
variance, 46, 105, 254 subtraction, 20
population, 254 unit, 21, 74
projected, 50, 105, 141, 146 zero, 18, 66
sample, 282
total, 48 weight, 401
zero, 105 gradient, 414, 426
vectors, 12, 17, 65 matrix, 199
500 INDEX
Dr. Omar Hijab obtained his doctorate in mathematics from the University
of California at Berkeley. Currently he is Professor Emeritus at Temple
University, Philadelphia, Pennsylvania, USA.

Mathematics of Machine Learning
No ratings yet
Mathematics of Machine Learning
577 pages
Mathematics of Machine Learning
No ratings yet
Mathematics of Machine Learning
497 pages
Math For Data Science
100% (1)
Math For Data Science
554 pages
Quantecon Python
100% (4)
Quantecon Python
1,413 pages
(VMLS) Julia Language Companion PDF
100% (2)
(VMLS) Julia Language Companion PDF
178 pages
Linear Algebra LectureNote
No ratings yet
Linear Algebra LectureNote
288 pages
Math For Data Science
100% (1)
Math For Data Science
507 pages
Caterpillar ISO Symbols
100% (2)
Caterpillar ISO Symbols
55 pages
Linear Algebra With Its Applications
No ratings yet
Linear Algebra With Its Applications
336 pages
Machine Learning
No ratings yet
Machine Learning
674 pages
Linear Algebra For Data Science 9811276226 9789811276224 - Compress
100% (3)
Linear Algebra For Data Science 9811276226 9789811276224 - Compress
257 pages
Quantitative Economics With Python
No ratings yet
Quantitative Economics With Python
300 pages
Essay Topics Grade 11
100% (2)
Essay Topics Grade 11
5 pages
English Manual v3 001
No ratings yet
English Manual v3 001
63 pages
Probability and Stats For Data Science PDF
100% (1)
Probability and Stats For Data Science PDF
237 pages
Mathematics For Machine Learning
No ratings yet
Mathematics For Machine Learning
134 pages
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
No ratings yet
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
644 pages
Machine Learning
No ratings yet
Machine Learning
216 pages
Kursus ICT Refresh Course Programme (ICTRCP) Tahun 2024 (Sesi 6)
No ratings yet
Kursus ICT Refresh Course Programme (ICTRCP) Tahun 2024 (Sesi 6)
32 pages
Advanced Statistical Computing PDF
No ratings yet
Advanced Statistical Computing PDF
329 pages
Mathematics For Machine Learning: A Comprehensive Guide To Building Mathematical Foundations For AI and Data Science
No ratings yet
Mathematics For Machine Learning: A Comprehensive Guide To Building Mathematical Foundations For AI and Data Science
266 pages
Math For Data Science
No ratings yet
Math For Data Science
538 pages
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
No ratings yet
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
6 pages
Lesson Plan - Metal Work
50% (2)
Lesson Plan - Metal Work
6 pages
Mathbook-Econ Prep
100% (1)
Mathbook-Econ Prep
278 pages
Quantecon Python Econometria
No ratings yet
Quantecon Python Econometria
1,399 pages
Vmls Python Companion
No ratings yet
Vmls Python Companion
192 pages
ECON2125/8013 Maths Notes: John Stachurski March 4, 2015
100% (1)
ECON2125/8013 Maths Notes: John Stachurski March 4, 2015
162 pages
3.1 Tuple Relational Calculus
No ratings yet
3.1 Tuple Relational Calculus
11 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
135 pages
Machine Learning
No ratings yet
Machine Learning
662 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
972 pages
Comp Data Science
No ratings yet
Comp Data Science
821 pages
Maths For Machine Learning
No ratings yet
Maths For Machine Learning
47 pages
P Refresher
No ratings yet
P Refresher
264 pages
Programming Basics and AI Lecture
No ratings yet
Programming Basics and AI Lecture
270 pages
MachineLearningPatternRecognition 18 Finalversion
No ratings yet
MachineLearningPatternRecognition 18 Finalversion
265 pages
Shamail - e - Kubra - Volume 3 & 4 - by Shaykh Mufti Muhammad Irshaad Qasmi
No ratings yet
Shamail - e - Kubra - Volume 3 & 4 - by Shaykh Mufti Muhammad Irshaad Qasmi
585 pages
John B. Goodenough
No ratings yet
John B. Goodenough
11 pages
Math 4 ML
100% (1)
Math 4 ML
47 pages
178 HW 9
No ratings yet
178 HW 9
153 pages
Machine Learning and Data Mining Notes 1647447657
No ratings yet
Machine Learning and Data Mining Notes 1647447657
134 pages
ML Lecture Notes 2022 v0.0
No ratings yet
ML Lecture Notes 2022 v0.0
176 pages
Mma PDF
No ratings yet
Mma PDF
377 pages
Notes Unit 1+2+3+4
No ratings yet
Notes Unit 1+2+3+4
110 pages
178 HW 6
No ratings yet
178 HW 6
125 pages
Stat Computing
No ratings yet
Stat Computing
329 pages
ROB501 Textbook2022 03 21
No ratings yet
ROB501 Textbook2022 03 21
142 pages
133A Textbook
No ratings yet
133A Textbook
348 pages
ABC Telecom
No ratings yet
ABC Telecom
8 pages
Extra Lecturenotes Cs725
No ratings yet
Extra Lecturenotes Cs725
119 pages
Lecturenote - COL341 - 2010
No ratings yet
Lecturenote - COL341 - 2010
116 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
134 pages
Mathematics
No ratings yet
Mathematics
20 pages
1 Introduction To Vectors: Factorization: A LU - . - . - . - . - . - . - . - . .
No ratings yet
1 Introduction To Vectors: Factorization: A LU - . - . - . - . - . - . - . - . .
2 pages
Experimental Investigation of Circular Concrete Filled Steel Tube Geometry On Seismic Performance
No ratings yet
Experimental Investigation of Circular Concrete Filled Steel Tube Geometry On Seismic Performance
54 pages
Recovery Is Everywhere Handout
No ratings yet
Recovery Is Everywhere Handout
3 pages
FINAL MANUSCRIPTTTTTTTTTTtttttttttttttttttttttttttttttttttttttTTTTTTTTTTT
No ratings yet
FINAL MANUSCRIPTTTTTTTTTTtttttttttttttttttttttttttttttttttttttTTTTTTTTTTT
24 pages
AC6-How To Setup Client+AP Mode
No ratings yet
AC6-How To Setup Client+AP Mode
10 pages
Carolina Reaper
No ratings yet
Carolina Reaper
19 pages
Detailed Contents
No ratings yet
Detailed Contents
8 pages
Level 7 Diploma in Data Science (Fast Track) - Delivered Online by LSBR, UK
No ratings yet
Level 7 Diploma in Data Science (Fast Track) - Delivered Online by LSBR, UK
19 pages
BBMF2083 - Chap 2 - 6.22
No ratings yet
BBMF2083 - Chap 2 - 6.22
40 pages
#01 G.R. No. 100113
No ratings yet
#01 G.R. No. 100113
19 pages
CS Executive Sbec MCQ Questions With Answers
No ratings yet
CS Executive Sbec MCQ Questions With Answers
20 pages
Sabse Bada Kalakar
No ratings yet
Sabse Bada Kalakar
4 pages
FICM Unit 3
No ratings yet
FICM Unit 3
6 pages
MS-Syllabus TCChem
No ratings yet
MS-Syllabus TCChem
17 pages
MLbook Extract
No ratings yet
MLbook Extract
14 pages
Day 4 Plastic Pollution Ielts Nguyenhuyen
No ratings yet
Day 4 Plastic Pollution Ielts Nguyenhuyen
1 page
Q1-DLL-WK-7 - October 9-13-2023-2024
No ratings yet
Q1-DLL-WK-7 - October 9-13-2023-2024
5 pages
Bodyweight Hoplite - Build A Lean and Mean Physique With Only Your Own Body PDF
No ratings yet
Bodyweight Hoplite - Build A Lean and Mean Physique With Only Your Own Body PDF
9 pages
Pickle Brand Auditing and Strengthening
No ratings yet
Pickle Brand Auditing and Strengthening
34 pages
November 09
No ratings yet
November 09
2 pages
SM Ch1
No ratings yet
SM Ch1
30 pages
Landing Page Inspiration 3
No ratings yet
Landing Page Inspiration 3
1 page
Tests For Two Correlations
No ratings yet
Tests For Two Correlations
10 pages
ARINC Meteorological Data Collection and Reporting System (MDCRS)
No ratings yet
ARINC Meteorological Data Collection and Reporting System (MDCRS)
16 pages
What Is Athletic Sports and Management?
No ratings yet
What Is Athletic Sports and Management?
3 pages
Post WW Ii Latin American Boom: 21 Century Literature From The Philippines and The World Week 4 Topic
No ratings yet
Post WW Ii Latin American Boom: 21 Century Literature From The Philippines and The World Week 4 Topic
2 pages
Modbus TCP Client RTU Slave MN67010 ENG
No ratings yet
Modbus TCP Client RTU Slave MN67010 ENG
9 pages
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
The Sandy Steele Mystery MEGAPACK®: 6 Young Adult Novels (Complete Series)
From Everand
The Sandy Steele Mystery MEGAPACK®: 6 Young Adult Novels (Complete Series)
Roger Barlow
No ratings yet
Mathematics N4: FET College Nated, #6
From Everand
Mathematics N4: FET College Nated, #6
Efetobo Emede
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Osama the Gun
From Everand
Osama the Gun
Norman Spinrad
5/5 (1)
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet