0% found this document useful (0 votes)

4K views497 pages

Mathematics of Machine Learning

Uploaded by

ahokano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4K views497 pages

Mathematics of Machine Learning

Uploaded by

ahokano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 497

Mathematics of Machine Learning

Tivadar Danka

Mar 16, 2023

CONTENTS

I Introduction 3
1 Introduction 7
1.1 What is this book about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 How to read this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

II Linear algebra 13
2 Vectors in theory 15
2.1 Representing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 What is a vector space? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Examples of vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Linear basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Vectors in practice 31
3.1 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 NumPy arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Is NumPy really faster than Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 The geometric structure of vector spaces: measuring distances 47

4.1 Norms and distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Deﬁning distances from norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5 Inner products, angles, and lots of reasons to care about them 57

5.1 The generated norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 The geometric interpretation of inner products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Orthogonal and orthonormal bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 The orthogonal complement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 The ﬁrst steps in computational linear algebra 71

6.1 Basic NumPy functions and array methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Norms and distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 The dot product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4 The Gram-Schmidt orthogonalization process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

i
6.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7 Matrices, the workhorses of machine learning 85

7.1 Unraveling matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2 Manipulating matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3 Matrices as arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Matrices in NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.5 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.6 Linear transformations and data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

8 Linear transformations 101

8.1 What is a linear transformation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2 Linear transformations and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.3 Matrix operations revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.4 Inverting linear transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.5 Change of basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.6 Linear transformations in the Euclidean plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

9 Determinants, or how linear transformations aﬀect volume 123

9.1 Area in the Euclidean plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.2 Determinants, orientation, and volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.3 The recursive deﬁnition of determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.4 Fundamental properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.5 Determinants and invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

10 Linear equations 135

10.1 Gaussian elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.2 When can a system of linear equations be solved? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.3 Inverting matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

11 The LU decomposition 143

11.1 Implementing the LU decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.2 Inverting a matrix, for real . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

12 Determinants in practice 151

12.1 The lesser of two evils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
12.2 The proper method for computing determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
12.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
12.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

13 Eigenvalues and eigenvectors 157

13.1 Eigenvalues of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
13.2 Finding eigenvalue-eigenvector pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
13.3 Eigenvectors, eigenspaces, and their bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
13.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

14 Special transformations and matrix decompositions 165

14.1 The adjoint transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
14.2 Orthogonal transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
14.3 Self-adjoint transformations and the spectral decomposition theorem . . . . . . . . . . . . . . . . . . 168
14.4 The Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

ii
14.5 Orthogonal projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
14.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

15 Computing eigenvalues 179

15.1 Power iteration for calculating the eigenvectors of real symmetric matrices . . . . . . . . . . . . . . . 179
15.2 The QR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

III Functions 191

16 Functions in theory 193
16.1 The mathematical deﬁnition of a function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
16.2 Domain and image. Injective, surjective, bijective functions. . . . . . . . . . . . . . . . . . . . . . . . 197
16.3 Operations with functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
16.4 Mental models of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
16.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

17 Functions in practice 205

17.1 Operations on functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
17.2 Functions as callable objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
17.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
17.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

18 Numbers 211
18.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

19 Sequences 217
19.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
19.2 Famous convergent sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
19.3 The big and small O notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
19.4 Real numbers are sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

20 The topology of real numbers 225

20.1 Open and closed sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
20.2 Bounded sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
20.3 Compact sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
20.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

21 Limits and continuity 233

21.1 Limits of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
21.2 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
21.3 Properties of continuous functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
21.4 Continuity on compact sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
21.5 What’s next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

22 Diﬀerentiation in theory 243

22.1 The definition of the derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
22.2 Equivalent forms of differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
22.3 Differentiation and continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
22.4 Higher derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

23 Diﬀerentiation in practice 253

23.1 Rules of diﬀerentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
23.2 Derivatives of elementary functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
23.3 Extending the Function base class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

iii
23.4 Numerical diﬀerentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
23.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

24 Minima and maxima 263

24.1 Derivatives and local behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
24.2 Local minima and maxima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
24.3 Characterization of optima with higher order derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 272
24.4 Mean value theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
24.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

25 The basics of gradient descent 277

25.1 Derivatives revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
25.2 The gradient descent algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
25.3 Drawbacks and caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

26 Integration in theory 287

26.1 Calculating the area under a function’s graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
26.2 The Riemann integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
26.3 Beyond upper and lower sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
26.4 Integration as the inverse of diﬀerentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
26.5 Integrals in machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

27 Integration in practice 299

27.1 Integrals and operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
27.2 Partial integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
27.3 Integration by substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
27.4 Numerical integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
27.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

28 Why does gradient descent work? 307

28.1 Diﬀerential equations 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
28.2 A continuous version of gradient ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
28.3 What’s next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

IV Multivariable functions 319

29 Multivariable functions 321
29.1 What is a multivariable function? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
29.2 Linear functions in multiple variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
29.3 The curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

30 Derivatives and gradients 329

30.1 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
30.2 The total derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
30.3 Directional derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
30.4 Properties of the gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
30.5 Multivariable functions in code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

31 Derivatives of vector-valued functions 341

31.1 The derivatives of curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
31.2 The Jacobian and Hessian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
31.3 The total derivative for vector-vector functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
31.4 Derivatives and function operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

iv
32 Minima and maxima in multiple dimensions 349
32.1 Local extrema in multiple dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350

33 Gradient descent in its full form 355

33.1 Gradient ascent and descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
33.2 Implementation of gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

V Probability theory 357

34 What is probability? 359
34.1 Thinking in absolutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
34.2 Thinking in probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362

35 The axioms of probability 365

35.1 The mathematical formulation of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
35.2 Probability measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
35.3 Probability spaces on ℝ𝑛 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
35.4 How to interpret probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
35.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

36 Conditional probability 379

36.1 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
36.2 Properties of the conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
36.3 The Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
36.4 The Bayesian interpretation of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
36.5 The probabilistic inference process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
36.6 The Monty Hall paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

37 Random variables 391

37.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
37.2 Real-valued random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
37.3 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
37.4 Independence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
37.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397

38 Distributions 399
38.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
38.2 Law of total probability, revisited once more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
38.3 Real-valued distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
38.4 Notable real-valued distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
38.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

39 Densities 417
39.1 Density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
39.2 Classiﬁcation of real-valued random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421

40 The expected value 425

40.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
40.2 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
40.3 Properties of the expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
40.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
40.5 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
40.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436

v
41 The Law of Large Numbers 437
41.1 Tossing coins… . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
41.2 …rolling dices… . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
41.3 …and all the rest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
41.4 The strong law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444

VI Statistics 447

VII Classical machine learning 449

VIII Neural networks 451

IX Advanced optimization 453

X Convolutional networks 455

XI Appendix 457
42 It’s just logic 459
42.1 Mathematical logic 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
42.2 Logical connectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
42.3 The propositional calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
42.4 Variables and predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
42.5 Existential and universal quantiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
42.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466

43 The structure of mathematics 467

43.1 What is a deﬁnition? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
43.2 What is a theorem? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
43.3 What is a proof? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
43.4 Equivalences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
43.5 Proof techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472

44 Basics of set theory 477

44.1 What is a set? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
44.2 Operations on sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
44.3 The Cartesian product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
44.4 The Russell paradox (optional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483

45 A crash course in Python 485

45.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
45.2 Control ﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
45.3 Fundamental data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
45.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
45.5 Decorators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
45.6 Object-oriented programming in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486

46 Bibliography 487

Bibliography 489

vi
Mathematics of Machine Learning

Hey there!
First of all, thank you! Your support is what makes this book possible! You are an absolute legend.
In the early access program, I’ll release the sections of this book as I write them. During our time together, my goal is
to guide you through the inner workings of machine learning, from high school mathematics to backpropagation. Each
week, a new chapter will be published, and I’ll be there for you to discuss your thoughts about the book.
For this purpose, I have created a Discord server, where I’ll be available for you at all times. You can join here: https:
//discord.gg/JC2RFpzun6
To give you a heads up, I am focusing on the content first, appearances second. So, some figures might be clumsy, and
the editing might not be perfect. Don’t worry, though. These will be fixed before the full release.

Early access FAQ

When will the book be ﬁnished?

I am planning to ﬁnish writing the content until the end of 2023. After the content is ﬁnalized, I’ll focus on editing and
formatting. This is especially important regarding the pdf version of the book. To avoid spending an excessive
amount of time with LaTeX customization, prettifying the pdf is going to take place at a later time.

What will you get from the early access program?

• The latest version of the book each week, in an interactive Jupyter Book format + pdf.
• Exclusive access to a new chapter each week as I ﬁnish them.
• A personal hotline to me where you can share your feedback with me to build the best learning resource for you.

What will I get from the early access program?

Writing a book is a long and challenging project. I want to do this the right way, so I decided to dedicate 100% of my
time and energy. However, I can’t do this without your support. I created the Early Access Program for those wishing to
join me in this journey. With you signing up for the Early Access Program, I’ll get
• your ﬁnancial support so I can work on this project full time,
• and your continual feedback, which is essential for me to write the best book on the subject for you.

CONTENTS 1
Mathematics of Machine Learning

Where can I contact you?

There are two main ways. You can contact me at the Discord server of the early access (join here: https://discord.gg/
JC2RFpzun6), or you can shoot a message on Twitter, you can ﬁnd me here: https://twitter.com/TivadarDanka.

Acknowledgements

This book is dedicated to my mother, who I lost while making this book. Thanks, Mom! You are inside every line I write.

2 CONTENTS
Part I

Introduction

3
Mathematics of Machine Learning

5
Mathematics of Machine Learning

6
CHAPTER

ONE

INTRODUCTION

Why do we have to learn mathematics? - This is a question I am asked and think about almost every day.
On the surface, advanced mathematics doesn’t impact software engineering and machine learning in a production setting.
You don’t have to calculate gradients, solve linear equations, or find eigenvalues by hand. Basic and advanced algorithms
are abstracted away into libraries and APIs, performing all the hard work for you.
Nowadays, implementing a state-of-the-art deep neural network is almost equivalent to instantiating an object in Tensor-
Flow, loading the pre-trained weights, and letting the data blaze through the model. Just like all technological advances,
this is a double-edged sword. On the one hand, frameworks that accelerate prototyping and development enable machine
learning in practice. Without them, we wouldn’t have seen the explosion in deep learning that we witnessed in the last
decade.
On the other hand, high-level abstractions are barriers between us and the underlying technology. User-level knowledge
is only sufficient when one is treading on familiar paths. (Or until something breaks.)
If you are not convinced, let’s do a thought experiment! Imagine moving to a new country without speaking the language
and knowing the way of life. However, you have a smartphone and a reliable internet connection.
How do you start exploring?
With Google Maps and a credit card, you can do many awesome things there: explore the city, eat in excellent restaurants,
have a good time. You can do the groceries every day without speaking a word: just put the stuff in your basket and swipe
your card at the cashier.
After a few months, you’ll start to pick up some language as well—simple things like saying greetings or introducing
yourself. You are off to a good start!
There are built-in solutions for everyday tasks that just work—food ordering services, public transportation, etc. However,
at some point, they will break down. For instance, you need to call the delivery person who dropped off your package at
the wrong door. This requires communication.
You may also want to do more. Get a job, or perhaps even start your own business. For that, you need to communicate
with others effectively.
Learning the language when you plan to live somewhere for a few months is unnecessary. However, if you want to stay
there for the rest of your life, it is one of the best investments you can make.
Now replace the country with machine learning and the language with mathematics.
Fact is, algorithms are written in the language of mathematics. To work with algorithms on a professional level, you have
to speak it.

7
Mathematics of Machine Learning

1.1 What is this book about?

There is a similarity between knowing one’s way about a town and mastering a field of knowledge; from any
given point one should be able to reach any other point. One is even better informed if one can immediately
take the most convenient and quickest path from the one point to the other. — George Pólya and Gábor Szegő,
in the introduction of the legendary book Problems and Theorems in Analysis
The above quote is one of my all-time favorites. For me, it implies that knowledge rests on many pillars. Like a chair
has four legs, a well-rounded machine learning engineer also has several skills that enable them to be effective in their
job. Each of us focuses on a balanced constellation of skills, and for many, mathematics is a great addition. You can start
machine learning without advanced mathematics, but at some point in your career, getting familiar with the mathematical
background of machine learning can help you bring your skills to the next level.
In my opinion, there are two paths to mastery in deep learning. One starts from the practical parts, the other starts from
theory. Both are perfectly viable, and eventually, they intertwine. This book is for those who started on the practical,
application-oriented path, like data scientists, machine learning engineers, or even software developers interested in the
topic.
This book is not a 100% pure mathematical treatise. At points, I will make some shortcuts to balance between clarity
and mathematical correctness. My goal is to give you the “Eureka!” moments and help you understand the bigger picture
instead of getting you ready for a PhD in mathematics.
Most machine learning books I have read fall into one of the two categories.
1. Focus on practical applications, unclear or imprecise with mathematical concepts.
2. Focus on theory, involving heavy mathematics with almost no real applications.
I want this book to offer the best of both: a sound introduction of basic and advanced mathematical concepts, keeping
machine learning in sight at all times. My goal is not only to cover the bare fundamentals but to give a breadth of knowl-
edge. In my experience, to master a subject, one needs to go both deep and wide. Covering only the very essentials of
mathematics would be like a tightrope walk. Instead of performing a balancing act every time you encounter a mathe-
matical subject in the future, I want you to gain a stable footing. Such confidence can bring you very far and set you apart
from others.
During our journey, we are going to follow this roadmap. (You might need to zoom in, as the figure is relatively large.)
Before we start, let’s take a brief look into each part.

1.1.1 Linear algebra

We are going to begin our journey with linear algebra. In machine learning, data is represented by vectors. Essentially,
training a learning algorithm is ﬁnding more descriptive representations of data through a series of transformations.
Linear algebra is the study of vector spaces and their transformations.
Simply speaking, a neural network is just a function mapping the data to a high-level representation. Linear transforma-
tions are the fundamental building blocks of these. Developing a good understanding of them will go a long way, as they
are everywhere in machine learning.

8 Chapter 1. Introduction
Mathematics of Machine Learning

Fig. 1.1: The complete roadmap of mathematics for machine learning.

1.1. What is this book about? 9
Mathematics of Machine Learning

1.1.2 Calculus

While linear algebra shows how to describe predictive models, calculus has the tools to fit them to the data. When you
train a neural network, you are almost certainly using gradient descent, which is rooted in calculus and the study of
differentiation.
Besides differentiation, its “inverse” is also a central part of calculus: integration.
Integrals are used to express essential quantities such as expected value, entropy, mean squared error, and many more.
They provide the foundations for probability and statistics.

1.1.3 Multivariate calculus

When doing machine learning, we deal with functions with millions of variables.

10 Chapter 1. Introduction
Mathematics of Machine Learning

In higher dimensions, things work diﬀerently. This is where multivariable calculus comes in, where diﬀerentiation and
integration are adapted to these spaces.

1.1.4 Probability theory

How to draw conclusions from experiments and observations? How to describe and discover patterns in them?
These are answered by probability theory and statistics, the logic of scientiﬁc thinking.

1.1.5 Beyond the fundamentals

Linear algebra, calculus, and probability theory form the foundations of mathematics in machine learning. They are just
the starting points. The most exciting stuﬀ comes after we are familiar with them! Advanced statistics, optimization
techniques, backpropagation, the internals of neural networks. In the second part of the book, we will take a detailed look
at all of those.

1.1. What is this book about? 11

Mathematics of Machine Learning

1.2 How to read this book

Mathematics follows a definition-theorem-proof structure that might be difficult to follow at first. If you are unfamiliar
with such a flow, don’t worry. I’ll give a gentle introduction right now.
In essence, mathematics is the study of abstract objects (such as functions) through their fundamental properties. Instead
of empirical observations, mathematics is based on logic, making it universal. A correct mathematical result is set in
stone, remaining valid forever. (Or, until the axioms of logic change.) If we want to use the powerful tool of logic, the
mathematical objects need to be precisely defined. Definitions are presented in boxes like this below.

Deﬁnition 1 (An example deﬁnition.)

This is how deﬁnitions are presented in this book.

Given a definition, results are formulated as if A, then B statements, where A is the premise, and B is the conclusion. Such
results are called theorems. For instance, if a function is differentiable, then it is also continuous. If a function is convex,
then it has global minima. If we have a function, then we can approximate it with arbitrary precision using a single-layer
neural network. You get the pattern. Theorems are the core of mathematics.
We must provide a sound logical argument to accept the validity of a proposition, one that deduces the conclusion from
the premise. This is called a proof, responsible for the steep learning curve of mathematics. Contrary to other scientific
disciplines, proofs in mathematics are indisputable statements, set in stone forever. On a practical note, look out for these
boxes.

Theorem 1 (An example theorem.)

Let 𝑥 be a fancy mathematical object. The following two statements hold.
(a) If 𝐴, then 𝐵.
(b) If 𝐶 and 𝐷, then 𝐸.

Proof. This is where proofs go. □

To enhance the learning experience, I’ll often make good-to-know but not absolutely essential information into remarks.

Remark 1 (An exciting remark.)

Mathematics is awesome. You’ll be a better engineer because of it.

The most eﬀective way of learning is building things and putting theory into practice. In mathematics, this is the only way
to learn. What this means to you is need to read through the text carefully. Don’t take anything for granted just because
it is written down. Think through every sentence, take apart every argument and calculation. Try to prove theorems by
yourself before reading the proofs.

12 Chapter 1. Introduction
Part II

Linear algebra

13
CHAPTER

TWO

VECTORS IN THEORY

I want to point out that the class of abstract linear spaces is no larger than the class of spaces whose elements
are arrays. So what is gained by abstraction? First of all, the freedom to use a single symbol for an array;
this way we can think of vectors as basic building blocks, unencumbered by components. The abstract view
leads to simple, transparent proofs of results. — Peter D. Lax, in Chapter 1 of his book Linear Algebra and
its Applications
The mathematics of machine learning rest upon three pillars: linear algebra, mathematical analysis, and probability theory.
Linear algebra describes how to represent and manipulate data; mathematical analysis helps us deﬁne and ﬁt models; while
probability theory helps interpret them.
These are building on top of each other, and we will start at the beginning: representing and manipulating data.

2.1 Representing data

To guide us throughout this section, we will look at the famous Iris dataset. This contains the measurements from three
species of Iris: the lengths and widths of sepals and petals. Each data point includes these four measurements, and we
also know the corresponding species (Iris setosa, Iris virginica, Iris versicolor).
The dataset can be loaded right away from scikit-learn, so let’s take a look!

from sklearn.datasets import load_iris

data = load_iris()

X, y = data["data"], data["target"]

X[:10]

array([[5.1, 3.5, 1.4, 0.2],

[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])

15
Mathematics of Machine Learning

Before going into the mathematical deﬁnitions, let’s establish a common vocabulary ﬁrst. The measurements themselves
are stored in a tabular format. Rows represent samples, and columns represent measurements. A particular measurement
type is often called feature. As X.shape tells us, the Iris dataset has 150 data points and four features.

X.shape

(150, 4)

For a given sample, the corresponding species is called the label. In our case, this is either Iris setosa, Iris virginica, or
Iris versicolor. Here, the labels are encoded with the numbers 0, 1, and 2.

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In its entirety, the Iris dataset forms a matrix, and the data points form vectors. Simply speaking, matrices are tables,
while vectors are tuples of numbers. (Tuples are are just finite sequence of numbers, like (1.297, −2.35, 32.3, 29.874).)
However, this view doesn’t give us the big picture. Moreover, since we have more than three features, we cannot visualize
the dataset easily. As humans cannot see in more than three dimensions, visualization breaks down.
Besides representing the data points in a compact form, we want to perform operations on them, like addition and scalar
multiplication. Why do we need to add data points together? To give you a simple example, it is often beneficial if the
features are on the same scale. If a given feature is distributed on a smaller scale than the others, it will have less influence
on the predictions.
Think about this: if somebody is whispering to you something from the next room while speakers blast loud music right
next to your ear, you won’t hear anything from what the person is saying to you. Large-scale features are the blasting
music, while the smaller ones are the whisper. You may obtain much more information from the whisper, but you need
to quiet down the music first.
To see this phenomenon in action, let’s take a look at the distribution of the features of our dataset!
You can see in Fig. 2.1 that some are more stretched-out (like the sepal length), while others are narrower (like sepal
width). In practical scenarios, this can hurt the predictive performance of our algorithms.
To solve it, we can remove the mean and the standard deviation of a dataset. If the dataset consists of the vectors
𝑥1 , 𝑥2 , … , 𝑥150 , we can calculate their mean by

1 150
𝜇= ∑ 𝑥 ∈ ℝ4
150 𝑖=1 𝑖

and their standard deviation by

√
√ 1 150
𝜎=√ ∑(𝑥 − 𝜇)2 ∈ ℝ4 ,
150 𝑖=1 𝑖
⎷
where the square operation in (𝑥𝑖 − 𝜇)2 is taken elementwise. The components of 𝜇 = (𝜇1 , 𝜇2 , 𝜇3 , 𝜇4 ) and 𝜎 =
(𝜎1 , 𝜎2 , 𝜎3 , 𝜎4 ) are the means and variances of the individual features. (Recall that the Iris dataset contains 150 samples
and 4 features per sample.)

16 Chapter 2. Vectors in theory

Mathematics of Machine Learning

Fig. 2.1: Distribution of the individual features of the Iris dataset.

In other words, the mean describes the average of samples, while the standard deviation represents the average distance
from the mean. The larger the standard deviation is, the more spread out the samples are.
With these quantities, the scaled dataset can be described as
𝑥1 − 𝜇 𝑥2 − 𝜇 𝑥 −𝜇
, , … , 150 ,
𝜎 𝜎 𝜎
where both the substraction and the division are taken elementwise.
If you are familiar with Python and NumPy, this is how it is done. (Don’t worry if you are not, everything you need to
know about them will be explained in the next chapter, with example code.)

X_scaled = (X - X.mean(axis=0))/X.std(axis=0)

X_scaled[:10]

array([[-0.90068117, 1.01900435, -1.34022653, -1.3154443 ],

[-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
[-1.38535265, 0.32841405, -1.39706395, -1.3154443 ],
[-1.50652052, 0.09821729, -1.2833891 , -1.3154443 ],
[-1.02184904, 1.24920112, -1.34022653, -1.3154443 ],
[-0.53717756, 1.93979142, -1.16971425, -1.05217993],
[-1.50652052, 0.78880759, -1.34022653, -1.18381211],
[-1.02184904, 0.78880759, -1.2833891 , -1.3154443 ],
[-1.74885626, -0.36217625, -1.34022653, -1.3154443 ],
[-1.14301691, 0.09821729, -1.2833891 , -1.44707648]])

If you compare this modified version with the original, you can see that its features are on the same scale. From a (very)
abstract point of view, machine learning is nothing else but a series of learned data transformations, turning raw data into
a form where prediction is simple.
In a mathematical setting, manipulating data and modeling its relations to the labels arise from the concept of vector spaces
and transformations between them. Let’s take the first steps by making the definition of vector spaces precise!

2.1. Representing data 17

Mathematics of Machine Learning

Fig. 2.2: Features of the Iris dataset after scaling.

2.2 What is a vector space?

Representing multiple measurements as a tuple (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) is a natural idea that has a ton of merits. The tuple form
suggests that the components belong together, giving a clear and concise way to store information.
However, this comes at a cost: now we have to deal with more complex objects. Despite having to deal with objects like
(𝑥1 , … , 𝑥𝑛 ) instead of numbers, there are similarities. For instance, if 𝑥 = (𝑥1 , … , 𝑥𝑛 ) and 𝑦 = (𝑦1 , … , 𝑦𝑛 ) are two
arbitrary tuples,
• they can be added together by 𝑥 + 𝑦 = (𝑥1 + 𝑦1 , … , 𝑥𝑛 + 𝑦𝑛 ),
• and they can be multiplied with scalars: if 𝑐 ∈ ℝ, then 𝑐𝑥 = (𝑐𝑥1 , … , 𝑐𝑥𝑛 ).
It’s almost like using a number.
These operations have clear geometric interpretations as well. Adding two vectors together is the same as translation,
while multiplying with a scalar is a simple stretching. (Or squeezing, if |𝑐| < 1.)

Fig. 2.3: Geometric interpretation of addition and scalar multiplication.

18 Chapter 2. Vectors in theory

Mathematics of Machine Learning

On the other hand, if we want to follow our geometric intuition (which we definitely do), it is unclear how to define vector
multiplication. The definition

𝑥𝑦 = (𝑥1 𝑦1 , … , 𝑥𝑛 𝑦𝑛 )

might make sense algebraically, but it is unclear what it means in a geometric sense.
When we think about vectors and vector spaces, we are thinking about a mathematical structure that ﬁts our intuitive
views and expectations. So, let’s turn these into the deﬁnition!

Deﬁnition 1.2.1 (Vector spaces.)

A vector space is a mathematical structure (𝑉 , 𝐹 , +, ⋅), where
(a) 𝑉 is the set of vectors,
(b) + ∶ 𝑉 × 𝑉 → 𝑉 is the addition operation, satisfying
• 𝑥 + 𝑦 = 𝑦 + 𝑥 (commutativity),
• 𝑥 + (𝑦 + 𝑧) = (𝑥 + 𝑦) + 𝑧 (associativity),
• there is an element 0 ∈ 𝑉 such that 𝑥 + 0 = 𝑥 (existence of the null vector),
• there is an inverse −𝑥 ∈ 𝑉 for each 𝑥 ∈ 𝑉 such that 𝑥 + (−𝑥) = 0 (existence of additive inverses)
for all vectors 𝑥, 𝑦, 𝑧 ∈ 𝑉 ,
(c) 𝐹 is a ﬁeld of scalars (most commonly the real numbers ℝ or the complex numbers ℂ),
(d) and ⋅ ∶ 𝐹 × 𝑉 → 𝑉 is the scalar multiplication operation, satisfying
• 𝑎(𝑏𝑥) = (𝑎𝑏)𝑥 (associativity),
• 𝑎(𝑥 + 𝑦) = 𝑎𝑥 + 𝑎𝑦 (distributivity),
• 1𝑥 = 𝑥
for all scalars 𝑎, 𝑏 ∈ 𝐹 and vectors 𝑥, 𝑦 ∈ 𝑉 .

This deﬁnition is overloaded with new concepts, so we need to unpack it.

First, looking at operations like addition and scalar multiplication as functions might be unusual for you, but this is a
perfectly natural representation. In writing, we use the notation 𝑥 + 𝑦, but when thinking about + as a function of two
variables, we might as well write +(𝑥, 𝑦). The form 𝑥 + 𝑦 is called infix, while +(𝑥, 𝑦) is called prefix notations.
In vector spaces, the input of addition are two vectors and the result is a single vector, thus + is a function that maps the
Cartesian product 𝑉 × 𝑉 to 𝑉 . Similarly, scalar multiplication takes a scalar and a vector, resulting in a vector; meaning
a function that maps 𝐹 × 𝑉 to 𝑉 .
This is also good place to note that mathematical definitions are always formalized in hindsight after the objects themselves
are somewhat crystallized and familiar to the users. Mathematics is often presented as definitions first, then theorems
second, but this is not how it is done in practice. Examples motivate definitions, not the other way around.
In general, the field of scalars can be something other than real or complex numbers. The term field refers to a well-
defined mathematical structure, which turns a natural notion mathematically precise. Without going into the technical
details, we will think about fields as “a set of numbers where addition and multiplication works just as for real numbers”.
Since we are not concerned with the most general case, we will use ℝ or ℂ to avoid unnecessary difficulty. If you are not
familiar with the exact mathematical definition of a field, don’t worry, just think of ℝ each time you read the word “field”.
When everything is clear from the context, (𝑉 , ℝ, +, ⋅) will be often referred to as 𝑉 for notational simplicity. So, if the
field 𝐹 is not specified, it is implicitly assumed to be ℝ. When we want to emphasize this, we’ll call these real vector
spaces.

2.2. What is a vector space? 19

Mathematics of Machine Learning

At ﬁrst sight, this deﬁnition is certainly too complex to comprehend. It seems like just a bunch of sets, operations, and
properties thrown together. However, to help us build a mental model, we can imagine a vector as an arrow, starting from
the null vector. (Recall that the null vector 0 is that special one for which 𝑥 + 0 = 𝑥 holds for all 𝑥. Thus, it can be
considered as an arrow with zero length; the origin.)
To further familiarize ourselves with the concept, let’s see some examples of vector spaces!

2.3 Examples of vector spaces

Examples are one of the best ways of building insight and understanding of seemingly diﬃcult concepts like vector
spaces. We, humans, usually think in terms of models instead of abstractions. (Yes, this includes pure mathematicians.
Even though they might deny it.)
Example 1. The most ubiquitous instance of the vector space is (ℝ𝑛 , ℝ, +, ⋅), the same one we used to motivate the
deﬁnition itself. (ℝ𝑛 refers to the n-fold Cartesian product of the set of real numbers. If you are unfamiliar with this
notion, check the set theory tutorial in the Appendix.)
(ℝ𝑛 , ℝ, +, ⋅) is the canonical model, the one we use to guide us throughout our studies. If 𝑛 = 2, we are simply talking
about the familiar Euclidean plane.

Fig. 2.4: The Euclidean plane as a vector space.

Using ℝ2 or ℝ3 for visualization can help a lot. What works here will usually work in the general case, although sometimes
this can be dangerous. Math relies on both intuition and logic. We develop ideas using our intuition, but we confirm them
with our logic.
Example 2. Not all vector spaces take the form of a collection of finite tuples. An example of this is the space of
polynomials with real coefficients, defined by
𝑛
ℝ[𝑥] = { ∑ 𝑝𝑖 𝑥𝑖 ∶ 𝑝𝑖 ∈ ℝ, 𝑛 = 0, 1, … }.
𝑖=0

Two polynomials 𝑝(𝑥) and 𝑞(𝑥) can be added together by

(𝑝 + 𝑞)(𝑥) = 𝑝(𝑥) + 𝑞(𝑥),

and can be multiplied with a real scalar by

(𝑐𝑝)(𝑥) = 𝑐𝑝(𝑥).

20 Chapter 2. Vectors in theory

Mathematics of Machine Learning

With these operations, (ℝ[𝑥], ℝ, +, ⋅) is a vector space. Although most of the time we percieve polynomials as functions,
they can be represented as tuples of coeﬃcients as well:
𝑛
∑ 𝑝𝑖 𝑥𝑖 ⟷ (𝑝0 , … , 𝑝𝑛 ).
𝑖=0

Note that 𝑛, the degree of the polynomial, is unbounded. As a consequence, this vector space has a signiﬁcantly richer
structure than ℝ𝑛 .
Example 3. The previous example can be further generalized. Let 𝐶([0, 1]) denote the set of all continuous real functions
𝑓 ∶ [0, 1] → ℝ. Then (𝐶(ℝ), ℝ, +, ⋅) is a vector space, where the addition and scalar multiplication are deﬁned just as in
the previous example:

(𝑓 + 𝑔)(𝑥) ∶= 𝑓(𝑥) + 𝑔(𝑥), (𝑐𝑓)(𝑥) = 𝑐𝑓(𝑥)

for all 𝑓, 𝑔 ∈ 𝐶(ℝ) and 𝑐 ∈ ℝ.

Yes, that is right: functions can be thought of as vectors as well. Spaces of functions play a significant role in mathematics,
and they come in several different forms. We often restrict the space to continuous functions, differentiable functions, or
basically any subset that is closed under the given operations.
(In fact, ℝ𝑛 can be also thought of as a function space. From an abstract viewpoint, each vector 𝑥 = (𝑥1 , … , 𝑥𝑛 ) is a
mapping from {1, 2, … , 𝑛} to ℝ.)
Function spaces are encountered in more advanced topics, such as inverting ResNet architectures, which we won’t deal
with in this book. However, it is worth seeing examples that are different (and not as straightforward) as ℝ𝑛 .

2.4 Linear basis

Although our vector spaces contain inﬁnitely many vectors, we can reduce the complexity by ﬁnding special subsets that
can express any other vector.
To make this idea precise, let’s consider our recurring example ℝ𝑛 . There, we have a special vector set
𝑒1 = (1, 0, … , 0)
𝑒2 = (0, 1, … , 0)
⋮
𝑒𝑛 = (0, 0, … , 1)

which can be used to express each element 𝑥 = (𝑥1 , … , 𝑥𝑛 ) as

𝑛
𝑥 = ∑ 𝑥𝑖 𝑒𝑖 , 𝑥𝑖 ∈ ℝ, 𝑒𝑖 ∈ ℝ 𝑛
𝑖=1

For instance, 𝑒1 = (1, 0) and 𝑒2 = (0, 1) in ℝ2 .

What we have just seen feels extremely trivial and it seems to only complicate things. Why would we need to write vectors
𝑛
in the form of 𝑥 = ∑𝑖=1 𝑥𝑖 𝑒𝑖 , instead of simply using the coordinates (𝑥1 , … , 𝑥𝑛 )? Because, in fact, the coordinate
notation depends on the underlying vector set ({𝑒1 , … , 𝑒𝑛 } in our case) used to express other vectors.
A vector is not the same as its coordinates! A single vector can have multiple different coordinates in different systems,
and switching between these is a useful tool.
Thus, the set 𝐸 = {𝑒1 , … , 𝑒𝑛 } ⊆ ℝ𝑛 is rather special, as it significantly reduces the complexity of representing vectors.
With the vector addition and scalar multiplication operations, it spans our vector space entirely. 𝐸 is an instance of a
vector space basis, a set that serves as a skeleton of ℝ𝑛 .
In this section, we are going to introduce and study the concept of vector space basis in detail.

2.4. Linear basis 21

Mathematics of Machine Learning

2.4.1 Linear combinations and independence

Let’s zoom out from the special case ℝ𝑛 and start talking about general vector spaces. From our motivating example
regarding bases, we have seen that sums of the form
𝑛
∑ 𝑥𝑖 𝑣𝑖 ,
𝑖=1

where the 𝑣𝑖 -s are vectors and the 𝑥𝑖 coeﬃcients are scalars, play a crucial role. These are called linear combinations. A
linear combination is called trivial if all of the coeﬃcients are zero.
Given a set of vectors, the same vector can potentially be expressed as a linear combination in multiple ways. For example,
if 𝑣1 = (1, 0), 𝑣2 = (0, 1), and 𝑣3 = (1, 1), then

(2, 1) = 2𝑣1 + 𝑣2 = 𝑣1 + 𝑣3 .

This suggests that the set 𝑆 = {𝑣1 , 𝑣2 , 𝑣3 } is redundant, as it contains duplicate information. The concept of linear
dependence and independence makes this precise.

Deﬁnition 1.4.1 (Linear dependence and independence.)

Let 𝑉 be a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } be a subset of its vectors. 𝑆 is said to be linearly dependent if it only
contains the zero vector, or there is a nonzero 𝑣𝑘 that can be expressed as a linear combination of the other vectors
𝑣1 , … , 𝑣𝑘−1 , 𝑣𝑘+1 , … , 𝑣𝑛 .
𝑆 is said to be linearly independent if it is not linearly dependent.

Linear dependence and independence can be looked at from a diﬀerent angle. If

𝑘−1 𝑛
𝑣𝑘 = ∑ 𝑥𝑖 𝑣𝑖 + ∑ 𝑥𝑖 𝑣𝑖 ,
𝑖=1 𝑖=𝑘+1

for some nonzero 𝑣𝑘 , then by subtracting 𝑣𝑘 , we obtain that the null vector can be obtained as a nontrivial linear combi-
nation
𝑛
0 = ∑ 𝑥 𝑖 𝑣𝑖
𝑖=1

for some scalars 𝑥𝑖 , where 𝑥𝑘 = −1. This is an equivalent deﬁnition of linear dependence. With this, we have proved
the following theorem.

Theorem 1.4.1
Let 𝑉 be a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } be a subset of its vectors.
(a) 𝑆 is linearly dependent if and only if the null vector 0 can be obtained as a nontrivial linear combination.
𝑛
(b) 𝑆 is linearly independent if and only if whenever 0 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 , all coeﬃcients 𝑥𝑖 are zero.

22 Chapter 2. Vectors in theory

Mathematics of Machine Learning

2.4.2 Spans of vector sets

Linear combinations provide a way to take a small set of vectors and generate a whole lot of others from them. For a
set of vectors 𝑆, taking all of its possible linear combinations is called spanning, and the generated set is called the span.
Formally, it is deﬁned by
𝑛
span(𝑆) = { ∑ 𝑥𝑖 𝑣𝑖 ∶ 𝑛 ∈ ℕ, 𝑣𝑖 ∈ 𝑆, 𝑥𝑖 is a scalar}.
𝑖=1

Note that the vector set 𝑆 is not necessarily ﬁnite. To help illustrate the concept of span, we can visualize the process in
three dimensions. The span of two linearly independent vectors is a plane.

Fig. 2.5: The span of two linearly independent vectors 𝑢 and 𝑣 in ℝ3 .

When we are talking about spans of a ﬁnite vector set {𝑣1 , … , 𝑣𝑛 }, we denote the span as

span(𝑣1 , … , 𝑣𝑛 ).

This can help us to avoid overcomplicating notations by naming every set.

Proposition 1.4.1
Let 𝑉 be a vector space and 𝑆, 𝑆1 , 𝑆2 ⊆ 𝑉 be subsets of its vectors.
(a) If 𝑆1 ⊆ 𝑆2 , then span(𝑆1 ) ⊆ span(𝑆2 ).
(b) span(span(𝑆)) = span(𝑆).

Proof. The property (a) follows directly from the deﬁnition. To prove (b), we have to show that span(𝑆) ⊆ span(span(𝑆))
and span(span(𝑆)) ⊆ span(𝑆). The former follows from the deﬁnition. For the latter, let 𝑥 ∈ span(span(𝑆)). Then
𝑛
𝑥 = ∑ 𝛼𝑖 𝑥𝑖
𝑖=1

2.4. Linear basis 23

Mathematics of Machine Learning

𝑚
for some 𝑥𝑖 ∈ span(𝑆). Because of 𝑥𝑖 being in the span of 𝑆, we have 𝑥𝑖 = ∑𝑗=1 𝛽𝑖,𝑗 𝑠𝑗 for some 𝑠𝑗 ∈ 𝑆. Thus,

𝑛
𝑥 = ∑ 𝛼 𝑖 𝑥𝑖
𝑖=1
𝑛 𝑚
= ∑ 𝛼𝑖 ∑ 𝛽𝑖,𝑗 𝑠𝑗
𝑖=1 𝑗=1
𝑚 𝑛
= ∑ ( ∑ 𝛼𝑖 𝛽𝑖,𝑗 )𝑠𝑗 ,
𝑗=1 𝑖=1

implying that 𝑥 ∈ span(𝑆) as well. □

Because of span(span(𝑆)) = span(𝑆), if 𝑆 is linearly dependent, we can remove the redundant vectors and still keep the
span the same.
𝑛−1
Think about it: if 𝑆 = {𝑣1 , … , 𝑣𝑛 } and, say, 𝑣𝑛 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 , then 𝑣𝑛 ∈ span(𝑆\{𝑣𝑛 }). So,

span(𝑆\{𝑣𝑛 }) = span(span(𝑆\{𝑣𝑛 })) = span(𝑆).

Among sets of vectors, those that generate the entire vector space are special. Remember that we started the discussion
about linear combinations to ﬁnd subsets that can be used to express any vector. After all this setup, we are ready to make
a formal introduction. Any set of vectors 𝑆 that have the property span(𝑆) = 𝑉 is called a generating set for 𝑉 .
𝑆 can be thought of as a “lossless compression” of 𝑉 , as it contains all the information needed to reconstruct any element
in 𝑉 , yet it is smaller than the entire space. Thus, it is natural to aim to reduce the size of the generating set as much as
possible. This leads us to one of the most important concepts in linear algebra: minimal generating sets, or bases, as we
prefer to call them.

2.4.3 Bases, the minimal generating sets

With all the intuition we have built so far, let’s jump into the deﬁnition right away!

Deﬁnition 1.4.2 (Basis.)

Let 𝑉 be a vector space and 𝑆 be a subset of its vectors. 𝑆 is a basis of 𝑉 if
(a) 𝑆 is linearly independent,
(b) and span(𝑆) = 𝑉 .
The elements of a basis set are called basis vectors.

It can be shown that these deﬁning properties mean that every vector 𝑥 can be uniquely written as a linear combination of
𝑆. (This is left as an exercise for the reader)
Let’s see some examples! In ℝ3 , the set {(1, 0, 0), (0, 1, 0), (0, 0, 1)} is a basis, but so is {(1, 1, 1), (1, 1, 0), (0, 1, 1)}.
So, there can be more than one basis for the same vector space.
For ℝ𝑛 , the most commonly used basis is {𝑒1 , … , 𝑒𝑛 }, where 𝑒𝑖 is a vector whose all coordinates are 0, except the 𝑖-th
one, which is 1. This is called the standard basis.
In terms of the “information” contained in a set of vectors, bases hit the sweet spot. Adding any new vector to a basis
set would introduce redundancy; removing any of its elements would cause the set to be incomplete. These notions are
formalized in the two theorems below.

24 Chapter 2. Vectors in theory

Mathematics of Machine Learning

Theorem 1.4.2
Let 𝑉 be a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } a subset of vectors. The following are equivalent.
(a) 𝑆 is a basis.
(b) 𝑆 is linearly independent and for any 𝑥 ∈ 𝑉 \𝑆, the vector set 𝑆 ∪ {𝑥} is linearly dependent. In other words, 𝑆 is a
maximal linearly independent set.

Proof. To show the equivalence of two propositions, we have to prove two things: that (a) implies (b); and that (b) implies
(a). Let’s start with the ﬁrst one!
𝑛
(a) ⟹ (b) If 𝑆 is a basis, then any 𝑥 ∈ 𝑉 can be written as 𝑥 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 for some 𝑥𝑖 ∈ ℝ. Thus, by deﬁnition,
𝐸 ∪ {𝑥} is linearly dependent.
(b) ⟹ (a). Our goal is to show that any 𝑥 can be written as a linear combination of the vectors in 𝑆. By our assumption,
𝑆 ∪ {𝑥} is linearly dependent, so 0 can be written as a nontrivial linear combination:
𝑛
0 = 𝛼𝑥 + ∑ 𝛼𝑖 𝑣𝑖 ,
𝑖=1

where not all coeﬃcients are zero. Because 𝑆 is linearly independent, 𝛼 cannot be zero. (As it would imply the linear
dependence of 𝑆, which would go against our assumptions.) Thus,
𝑛
𝛼𝑖
𝑥 = ∑− 𝑣,
𝑖=1
𝛼 𝑖

showing that 𝑆 is a basis. □

Next, we are going to show that every vector of a basis is essential.

Theorem 1.4.3
Let 𝑉 be a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } a basis. Then for any 𝑣 ∈ 𝑆, the span of 𝑆\{𝑣} is a proper subset of 𝑉 .

Proof. We are going to prove by contradiction. Without the loss of generality, we can assume that 𝑣 = 𝑣1 . If
span(𝑆\{𝑣1 }) = 𝑉 , then
𝑛
𝑣1 = ∑ 𝑥𝑖 𝑣𝑖 .
𝑖=2

This means that 𝑆 = {𝑣1 , … , 𝑣𝑛 } is not linearly independent, contradicting our assumptions. □

In other words, the above results mean that a basis is a maximal linearly independent and a minimal generating set at the
same time.
𝑛
Given a basis 𝑆 = {𝑣1 , … , 𝑣𝑛 }, we implictly write the vector 𝑥 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 as 𝑥 = (𝑥1 , … , 𝑥𝑛 ). Since this decompo-
sition is unique, we can do this without issues. The coefficients 𝑥𝑖 are also called coordinates. (Note that the coordinates
strongly depend on the basis. Given two different bases, the coordinates of the same vector can be different.)

2.4. Linear basis 25

Mathematics of Machine Learning

2.4.4 Finite dimensional vector spaces

As we have seen previously, bases are not unique, as a single vector space can have many diﬀerent bases. A very natural
question that arises in this context is the following. If 𝑆1 and 𝑆2 are two bases for 𝑉 , then does |𝑆1 | = |𝑆2 | hold? (Where
|𝑆| denotes the cardinality of the set 𝑆, that is, its “size”.)
In other words, can we do better if we select our basis more cleverly? It turns out that we cannot, and the sizes of any two
basis sets is equal. We are not going to prove this, but here is the theorem in its entirety.

Theorem 1.4.4
Let 𝑉 be a vector space and 𝑆1 , 𝑆2 be two of its bases. Then |𝑆1 | = |𝑆2 |.

This gives us a way to define the dimension of a vector space, which is simply the cardinality of its basis. We’ll
denote the dimension of 𝑉 as dim(𝑉 ). For example, ℝ𝑛 is 𝑛-dimensional, as shown by the standard basis
{(1, 0, … , 0), … , (0, 0, … , 1)}.
If you recall the previous theorems, we almost always assumed that a basis is finite. You might ask the question: is this
always true? The answer is no. Examples 2 and 3 show that this is not the case. For instance, the countably infinite set
{1, 𝑥, 𝑥2 , 𝑥3 , … } is a basis for ℝ[𝑥]. So, according to the theorem above, no finite basis can exist there.
This marks an important distinction between vector spaces: those with finite bases are called finite-dimensional. I have
some good news: all finite-dimensional real vector spaces are essentially ℝ𝑛 . (Recall that we call a vector space real if its
scalars are the real numbers.)
To see why, suppose that 𝑉 is an 𝑛-dimensional real vector space with basis {𝑣1 , … , 𝑣𝑛 }, and define the mapping 𝜑 ∶
𝑉 → ℝ𝑛 by
𝑛
𝜑 ∶ ∑ 𝑥𝑖 𝑣𝑖 → (𝑥1 , … , 𝑥𝑛 ).
𝑖=1

𝜑 is invertible and preserves the structure of 𝑉 , that is, the addition and scalar multiplication operations. Indeed, if
𝑢, 𝑣 ∈ 𝑉 and 𝛼, 𝛽 ∈ ℝ, then 𝜑(𝛼𝑢 + 𝛽𝑣) = 𝛼𝜑(𝑥) + 𝛽𝜑(𝑦). Such mappings are called isomorphisms. The word itself
is derived from ancient Greek, with isos meaning same and morphe meaning shape. Even though this sounds abstract, the
existence of an isomorphism between two vector spaces mean that they have the same structure. So, ℝ𝑛 is not just an
example of ﬁnite dimensional real vector spaces, it is a universal model of them. (Note that if the scalars are not the real
numbers, the isomorphism to ℝ𝑛 is not true.)
Considering that we’ll almost exclusively deal with ﬁnite dimensional real vector spaces, this is good news. Using ℝ𝑛 is
not just a heuristic, it is a good mental model.

2.4.5 Why are bases so important?

If every finite-dimensional real vector space is essentially the same as ℝ𝑛 , what do we gain from abstraction? Sure, we can
just work with ℝ𝑛 without talking about bases, but to develop a deep understanding of the core mathematical concepts in
machine learning, we need the abstraction.
Let’s look ahead briefly and see an example. If you have some experience with neural networks, you know that matrices
play an essential role in defining its layers. Without any context, matrices are just a table of numbers with seemingly
arbitrary rules of computation. Have you ever wondered why matrix multiplication is defined the way it is?
Although we haven’t precisely defined matrices yet, you have probably encountered them previously. For two matrices

𝑎1,1 𝑎1,2 … 𝑎1,𝑛 𝑏1,1 𝑏1,2 … 𝑏1,𝑛

⎡𝑎 𝑎2,2 … 𝑎2,𝑛 ⎤ ⎡𝑏 𝑏2,2 … 𝑏2,𝑛 ⎤
𝐴 = ⎢ 2,1 ⎥, 𝐵 = ⎢ 2,1 ⎥,
⎢ ⋮ ⋮ ⋱ ⋮ ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑛 ⎦ ⎣𝑏𝑛,1 𝑏𝑛,2 … 𝑏𝑛,𝑛 ⎦

26 Chapter 2. Vectors in theory

Mathematics of Machine Learning

their product is deﬁned by

𝑛 𝑛 𝑛
∑ 𝑎1,𝑘 𝑏𝑘,1 ∑𝑘=1 𝑎1,𝑘 𝑏𝑘,2 … ∑𝑘=1 𝑎1,𝑘 𝑏𝑘,𝑛
⎡ ∑𝑛𝑘=1 𝑎 𝑏 𝑛
∑𝑘=1 𝑎2,𝑘 𝑏𝑘,2 … ∑𝑘=1 𝑎2,𝑘 𝑏𝑘,𝑛 ⎤
𝑛
𝐴𝐵 = ⎢ 𝑘=1 2,𝑘 𝑘,1 ⎥,
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
𝑛 𝑛 𝑛
⎣∑𝑘=1 𝑎𝑛,𝑘 𝑏𝑘,1 ∑𝑘=1 𝑎𝑛,𝑘 𝑏𝑘,2 … ∑𝑘=1 𝑎𝑛,𝑘 𝑏𝑘,𝑛 ⎦
that is, the 𝑖, 𝑗-th element of 𝐴𝐵 is deﬁned by
𝑛
∑ 𝑎𝑖,𝑘 𝑏𝑘,𝑗 .
𝑘=1

Even though we can visualize this to make it easier to understand, the deﬁnition seems random.

Fig. 2.6: Matrix multiplication visualized.

Why not just take the componentwise product (𝑎𝑖,𝑗 𝑏𝑖,𝑗 )𝑛𝑖,𝑗=1 ? The definition becomes crystal clear once we look at a
matrix as a tool to describe linear transformations between vector spaces, as the elements of the matrix describe the
images of basis vectors. In this context, multiplication of matrices is just the composition of linear transformations.
Instead of just putting out the definition and telling you how to use it, I want you to understand why it is defined that way.
In the next chapters, we are going to learn every nook and cranny of matrix multiplication.

2.4.6 The existence of bases (optional)

At this point, you might ask the question: for a given vector space, are we guaranteed to ﬁnd a basis? Without such a
guarantee, the previous setup might be wasted. (As there might not be a basis to work with.)
Fortunately, this is not the case. As the proof is extremely diﬃcult, we will not show this, but this is so important that we
should at least state the theorem. If you are interested in how this can be done, I included a proof sketch. Feel free to
skip this, as it is not going to be essential for our purposes.

Theorem 1.4.5
Every vector space has a basis.

Proof. (Sketch.) The proof of this uses an advanced technique called transﬁnite induction, which is way beyond our
scope. Instead of being precise, let’s just focus on building intuition about how to construct a basis for any vector space.

2.4. Linear basis 27

Mathematics of Machine Learning

For our vector space 𝑉 , we will build a basis one by one. Given any non-null vector 𝑣1 , if span(𝑆1 ) ≠ 𝑉 , the set
𝑆1 = {𝑣1 } is not yet a basis. Thus, we can find a vector 𝑣2 ∈ 𝑉 \span(𝑆1 ) so that 𝑆2 ∶= 𝑆1 ∪ {𝑣2 } is still linearly
independent.
Is 𝑆2 a basis? If not, we can continue the process. In case the process stops in finitely many steps, we are done. However,
this is not guaranteed. Think about ℝ[𝑥], the vector space of polynomials, which is not finite-dimensional, as we have
seen before. This is where we need to employ some set-theoretical heavy machinery. (Which we don’t have.)
If the process doesn’t stop, we need to find a set 𝑆ℵ0 that contains all 𝑆𝑖 as a subset. (Finding this 𝑆ℵ0 set is the tricky
part.) Is 𝑆ℵ0 a basis? If not, we continue the process.
This is difficult to show, but the process eventually stops, and we can’t add any more vectors to our linearly independent
vector set that won’t destroy the independence property. When this happens, we have found a maximal linearly independent
set, that is, a basis. ≈ □

For ﬁnite dimensional vector spaces, the above process is easy to describe. In fact, one of the pillars of linear algebra is
the so-called Gram-Schmidt process, used to explicitly construct special bases for vector spaces. As several quintessential
results rely on this, we are going to study it in detail during the next chapters.

2.5 Subspaces

Before we move on, there is one more thing we need to talk about, one that will come in handy when talking about linear
transformations. (But again, linear transformations are at the heart of machine learning. Everything we learn is to get to
know them better.) For a given vector space 𝑉 , we are often interested in one of its subsets that is a vector space in its
entirety. This is described by the concept of subspaces.

Deﬁnition 1.5.1 (Subspaces.)

Let 𝑉 be a vector space. The set 𝑈 ⊆ 𝑉 is a subspace of 𝑉 if it is closed under addition and scalar multiplication.
𝑈 is a proper subspace if it is a subspace and 𝑈 ⊂ 𝑉 .

By deﬁnition, subspaces are vector spaces themselves, so we can deﬁne their dimension as well. There are at least two
subspaces of each vector space: itself and {0}. These are called trivial subspaces. Besides those, the span of a set of
vectors is always a subspace. One such example is illustrated in Fig. 2.5.
One of the most important aspects of subspaces is that we can use them to create more subspaces. This notion is made
precise below.

Deﬁnition 1.5.2 (Direct sum of subspaces.)

Let 𝑉 be a vector space and 𝑈1 , 𝑈2 be two of its subspaces. The direct sum of 𝑈1 and 𝑈2 is deﬁned by

𝑈1 + 𝑈2 = {𝑢1 + 𝑢2 ∶ 𝑢1 ∈ 𝑈1 , 𝑢2 ∈ 𝑈2 }.

You can easily verify that 𝑈1 + 𝑈2 is a subspace indeed, moreover 𝑈1 + 𝑈2 = span(𝑈1 ∪ 𝑈2 ). (See one of the
exercises at the end of the chapter.) Subspaces and their direct sum play an essential role in several topics, such as matrix
decompositions. For example, we’ll see later that many of them are equivalent to decomposing a linear space into a sum
of vector spaces.
The ability to select a basis whose subsets span certain given subspaces often comes in handy. This is formalized by the
next result.

28 Chapter 2. Vectors in theory

Mathematics of Machine Learning

Theorem 1.5.1
Let 𝑉 be a vector space and 𝑈1 , 𝑈2 ⊆ 𝑉 its subspaces such that 𝑈1 + 𝑈2 = 𝑉 . Then for any bases {𝑝1 , … , 𝑝𝑘 } ⊆ 𝑈1
and {𝑞1 , … , 𝑞𝑙 } ⊆ 𝑈2 , their union is a basis in 𝑉 .

Proof. This follows directly from the direct sum’s deﬁnition. If 𝑉 = 𝑈1 + 𝑈2 , then any 𝑥 ∈ 𝑉 can be written in the
form 𝑥 = 𝑎 + 𝑏, where 𝑎 ∈ 𝑈1 and 𝑏 ∈ 𝑈2 .
In turn, since 𝑝1 , … , 𝑝𝑘 forms a basis in 𝑈1 and 𝑞1 , … , 𝑞𝑙 forms a basis in 𝑈2 , the vectors 𝑎 and 𝑏 can be written as
𝑘 𝑙
𝑎 = ∑ 𝛼𝑖 𝑝𝑖 , 𝑏 = ∑ 𝛽𝑖 𝑞𝑖 .
𝑖=1 𝑖=1

Thus, any 𝑥 is
𝑘 𝑙
𝑥 = ∑ 𝛼𝑖 𝑝𝑖 + ∑ 𝛽𝑖 𝑞𝑖 ,
𝑖=1 𝑖=1

which is the deﬁnition of the basis. □

With vector spaces, we are barely scratching the surface. Bases are essential, but they only provide the skeleton for the
vector spaces encountered in practice. To properly represent and manipulate data, we need to build a geometric structure
around this skeleton. How to measure the “distance” between two measurements? What about their similarity?
Besides all that, there is an even more crucial question: how on earth will we represent vectors inside a computer? In the
next chapter, we will take a look at the data structures of Python, laying the foundation for the data manipulations and
transformations we’ll do later.

2.6 Problems

Problem 1. Not all vector spaces are infinite. There are some that only contain a finite number of vectors, as we shall see
next in this problem. Define the set

ℤ2 ∶= {0, 1},

where the operations +, ⋅ are deﬁned by the rules

0+0=0
0+1=1
1+0=1
1+1=0
and
0⋅0=0
0⋅1=0
1⋅0=0
1 ⋅ 1 = 1.
This is called binary (or modulo-2) arithmetic.
(a) Show that (ℤ2 , ℤ2 , +, ⋅) is a vector space.

2.6. Problems 29
Mathematics of Machine Learning

(b) Show that (ℤ𝑛2 , ℤ2 , +, ⋅) is also a vector space, where ℤ𝑛2 is the 𝑛-fold Cartesian product

ℤ𝑛2 = ℤ 2 × ⋯ × ℤ2 ,
⏟⏟⏟⏟⏟
𝑛 times

and the addition and scalar multiplication are deﬁned elementwise:

𝑥 + 𝑦 = (𝑥1 + 𝑦1 , … , 𝑥𝑛 + 𝑦𝑛 ), 𝑥, 𝑦 ∈ ℤ𝑛2 ,
𝑐𝑥 = (𝑐𝑥1 , … , 𝑐𝑥𝑛 ), 𝑐 ∈ ℤ2 .

Problem 2. Are the following vector sets linearly independent?

(a) 𝑆1 = {(1, 0, 0), (1, 1, 0), (1, 1, 1)} ⊆ ℝ3
(b) 𝑆2 = {(1, 1, 1), (1, 2, 4), (1, 3, 9)} ⊆ ℝ3
(c) 𝑆3 = {(1, 1, 1), (1, 1, −1), (1, −1, −1)} ⊆ ℝ3
(d) 𝑆4 = {(𝜋, 𝑒), (−42, 13/6), (𝜋3 , −2)} ⊆ ℝ2
Problem 3. Let 𝑉 be a finite 𝑛-dimensional vector space and let 𝑆 = {𝑣1 , … , 𝑣𝑚 } be a linearly independent set of
vectors, 𝑚 < 𝑛. Show that there is a basis set 𝐵 such that 𝑆 ⊂ 𝐵.
Problem 4. Let 𝑉 be a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } be its basis. Show that every vector 𝑥 ∈ 𝑉 can be uniquely
𝑛 𝑛
written as a linear combination of vectors in 𝑆. (That is, if 𝑥 = ∑𝑖=1 𝛼𝑖 𝑣𝑖 = ∑𝑖=1 𝛽𝑖 𝑣𝑖 , then 𝛼𝑖 = 𝛽𝑖 for all
𝑖 = 1, … , 𝑛.)
Problem 5. Let 𝑉 be an arbitrary vector space and 𝑈1 , 𝑈2 ⊆ 𝑉 be two of its subspaces. Show that 𝑈1 + 𝑈2 =
span(𝑈1 ∪ 𝑈2 ).
Hint: to prove the equality of these two sets, you need to show two things:
1. if 𝑥 ∈ 𝑈1 + 𝑈2 , then 𝑥 ∈ span(𝑈1 ∪ 𝑈2 ) as well,
2. if 𝑥 ∈ span(𝑈1 ∪ 𝑈2 ), then 𝑥 ∈ 𝑈1 + 𝑈2 as well.
Problem 6. Consider the vector space of polynomials with real coefficients, defined by
𝑛
ℝ[𝑥] = {𝑝(𝑥) = ∑ 𝑝𝑖 𝑥𝑖 ∶ 𝑝𝑖 ∈ ℝ, 𝑛 = 0, 1, … }.
𝑖=0

(a) Show that

𝑛
𝑥ℝ[𝑥] ∶= {𝑝(𝑥) = ∑ 𝑝𝑖 𝑥𝑖 ∶ 𝑝𝑖 ∈ ℝ, 𝑛 = 1, 2, … }
𝑖=1

is a proper subspace of ℝ[𝑥].

(b) Show that

𝑓 ∶ ℝ[𝑥] → 𝑥ℝ[𝑥], 𝑝(𝑥) ↦ 𝑥𝑝(𝑥)

is a bijective and linear. (A function 𝑓 ∶ 𝑋 → 𝑌 is bijective if every 𝑦 ∈ 𝑌 has exactly one 𝑥 ∈ 𝑋 for which 𝑓(𝑥) = 𝑦.)
In general, a linear and bijective function 𝑓 ∶ 𝑈 → 𝑉 between vector spaces is called an isomorphism. Given the existence
of such function, we call the vector spaces 𝑈 and 𝑉 isomorphic, meaning that they have an identical algebraic structure.
Combining (a) and (b), we obtain that ℝ[𝑋] is isomorphic with its proper subspace 𝑥ℝ[𝑋]. This is quite an interesting
phenomenon: a vector space that is algebraically identical to its proper subspace.

30 Chapter 2. Vectors in theory

CHAPTER

THREE

VECTORS IN PRACTICE

So far, we have mostly talked about the theory of vectors and vector spaces. However, our ultimate goal is to build
computational models for discovering and analyzing patterns in data. To put theory into practice, we will take a look at
how vectors are represented in computations.
In computer science, there is a stark contrast between how we think about mathematical structures and how we represent
them inside a computer. Until this point, our goal was to develop a mathematical framework that enables us to reason
about the structure of data and its transformations effectively. We want a language that is
• expressive,
• easy to speak,
• and as compact as possible.
However, our goals change when we aim to do computations instead of pure logical reasoning. We want implementations
that are
• easy to work with,
• memory-efficient,
• and fast to access, manipulate and transform.
These are often contradicting requirements, and particular situations might prefer one over the other. For instance, if
we have plenty of memory but want to perform lots of computations, we can sacrifice size for speed. Because of all
the potential use-cases, there are multiple formats to represent the same mathematical concepts. These are called data
structures.
Different programming languages implement vectors differently. Because Python is ubiquitous in data science and ma-
chine learning, it’ll be our language of choice. In this chapter, we are going to study all the possible data structures in
Python to see which one is suitable to represent vectors for high performance computations.

3.1 Tuples

In standard Python, two built-in data structures can be used to represent vectors: tuples and lists. Let’s start with tuples!
They can be simply deﬁned by enumerating their elements between two parentheses, separating them with commas.

v_tuple = (1, 3.5, -2.71, "a string", 42)

print(v_tuple)

(1, 3.5, -2.71, 'a string', 42)

31
Mathematics of Machine Learning

type(v_tuple)

tuple

A single tuple can hold elements of various types. Even though we’ll exclusively deal with ﬂoats in computational linear
algebra, this property is extremely useful for general purpose programming.
We can access the elements of a tuple by indexing. Just like in (almost) all other programming languages, indexing starts
from zero. This is in contrast with mathematics, where we often start indexing from one. (Don’t tell this to anybody else,
but it used to drive me crazy. I am a mathematician ﬁrst.)

v_tuple[0]

The number of elements in a tuple can be accessed by calling the built-in len function.

len(v_tuple)

Besides indexing, we can also access multiple elements by slicing.

v_tuple[1:4]

(3.5, -2.71, 'a string')

Slicing works by specifying the ﬁrst and last elements with an optional step size, using the syntax ob-
ject[first:last:step].
Tuples are rather inﬂexible, as you cannot change their components. Attempting to do so results in a TypeError,
Python’s standard way of telling you that the object does not support the method you are trying to call (item assignment).

v_tuple[0] = 2

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-5d8791fb815d> in <module>
----> 1 v_tuple[0] = 2

TypeError: 'tuple' object does not support item assignment

Besides that, extending the tuple with additional elements is also not supported. As we cannot change the state of a tuple
object in any way after it has been instantiated, they are immutable. Depending on the use-case, immutability can be an
advantage and a disadvantage as well. Immutable objects eliminate accidental changes, but each operation requires the
creation of a new object, resulting in a computational overhead. Thus, tuples are not going to be optimal to represent large
amounts of data in complex computations.
This issue is solved by lists. Let’s take a look at them, and the new problems they introduce!

32 Chapter 3. Vectors in practice

Mathematics of Machine Learning

3.2 Lists

Lists are the workhorses of Python. In contrast with tuples, lists are extremely ﬂexible and easy to use, albeit this comes at
the cost of runtime. Similarly to tuples, a list object can be created by enumerating its objects between square brackets,
separated by commas.

v_list = [1, 3.5, -2.71, "qwerty"]

type(v_list)

list

Just like tuples, accessing the elements of a list is done by indexing or slicing.
We can do all kinds of operations on a list: overwrite its elements, append items, or even remove others.

v_list[0] = "this is a string"

v_list

['this is a string', 3.5, -2.71, 'qwerty']

This example illustrates that lists can hold elements of various types as well. Adding and removing elements can be done
with methods like append, push, pop, and remove.
Before trying that, let’s quickly take note of the memory address of our example list, which can be accessed by calling
the id function.

v_list_addr = id(v_list)

v_list_addr

140531937533320

This number simply refers to an address in my computer’s memory, where the v_list object is located. Quite literally,
as this book is compiled on my personal computer.
Now, we are going to perform a few simple operations on our list and show that the memory address doesn’t change.
Thus, no new object is created.

v_list.append([42]) # adding the list [42] to the end of our list

print(v_list)

['this is a string', 3.5, -2.71, 'qwerty', [42]]

id(v_list) == v_list_addr # adding elements doesn't create any new objects

True

3.2. Lists 33
Mathematics of Machine Learning

v_list.pop(1) # removing the element at the index "1"

print(v_list)

['this is a string', -2.71, 'qwerty', [42]]

id(v_list) == v_list_addr # removing elements still doesn't create any new objects

True

Unfortunately, adding lists together achieves a result that is completely diﬀerent from our expectations.

[1, 2, 3] + [4, 5, 6]

[1, 2, 3, 4, 5, 6]

Instead of adding the corresponding elements together, like we want vectors to behave, the lists are concatenated. This
feature is handy when writing general-purpose applications. However, this is not well-suited for scientiﬁc computations.
“Scalar multiplication” also has strange results.

3*[1, 2, 3]

[1, 2, 3, 1, 2, 3, 1, 2, 3]

Multiplying a list with an integer repeats the list by the speciﬁed number of times. Given the behavior of the + operator
on lists, this seems logical as multiplication with an integer is repeated addition:

𝑎 ⋅ 𝑏 = 𝑏⏟⏟
+⏟⋯⏟
+⏟𝑏 .
𝑎 times

Overall, lists can do much more than we need to represent vectors. Although we potentially want to change elements
of our vectors, we don’t need to add or remove elements from them, and we also don’t need to store objects other than
ﬂoats. Can we sacriﬁce these extra features and obtain an implementation suitable for our purposes yet has a lightning-fast
computational performance? Yes. Enter NumPy arrays.

3.3 NumPy arrays

Even though Python’s built-in data structures are amazing, they are optimized for ease of use, not for scientiﬁc computa-
tion. This problem was realized early on the language’s development and was addressed by the NumPy library.
One of the main selling points of Python is how fast and straightforward it is to write code, even for complex tasks. This
comes at the price of speed. However, in machine learning, speed is crucial for us. When training a neural network, a
small set of operations are repeated millions of times. Even a small percentage of improvement in performance can save
hours, days, or even weeks in case of extremely large models.
Regarding speed, the C language is at the other end of the spectrum. C code is hard to write but executes blazing fast
when done correctly. As Python is written in C, a tried and true method for achieving fast performance is to call functions
written in C from Python. In a nutshell, this is what NumPy provides: C arrays and operations, all in Python.
To get a glimpse into the deep underlying issues with Python’s built-in data structures, we should put numbers and arrays
under our magnifying glass. Inside a computer’s memory, objects are represented as ﬁxed-length 0-1 sequences. Each
component is called a bit. Bits are usually grouped into 8, 16, 32, 64, or even 128 sized chunks. Depending on what we

34 Chapter 3. Vectors in practice

Mathematics of Machine Learning

want to represent, identical sequences can mean diﬀerent things. For instance, the 8-bit sequence 00100110 can represent
the integer 38 or the ASCII character “&”.

Fig. 3.1: An 8-bit object in memory.

By specifying the data type, we can decode binary objects. 32-bit integers are called int32 types, 64-bit ﬂoats are
float64, and so on.
Since a single bit contains very little information, memory is addressed by dividing it into 32 or 64 bit sized chunks and
numbering them consecutively. This address is a hexadecimal number, starting from 0. (For simplicity, let’s assume that
the memory is addressed by 64 bits. This is customary in modern computers.)
A natural way to store a sequence of related objects (with matching data type) is to place them next to each other in the
memory. This data structure is called an array.

Fig. 3.2: An array of int64 objects.

By storing the memory address of the ﬁrst object, say 0x23A0, we can instantly retrieve the k-th element by accessing
the memory at 0x23A0 + k.

3.3. NumPy arrays 35

Mathematics of Machine Learning

We call this the static array or often the C array because this is how it is done in the magnificent C language. Although
this implementation of arrays is lightning fast, it is relatively inflexible. First, you can only store objects of a single type.
Second, you have to know the size of your array in advance, as you cannot use memory addresses that overextend the
pre-allocated part. Thus, before you start working with your array, you have to allocate memory for it. (That is, reserve
space so that other programs won’t overwrite it.)
However, in Python, you can store arbitrarily large and different objects in the same list, with the option of removing and
adding elements to it.

l = [2**142 + 1, "a string"]

l.append(lambda x: x)

[5575186299632655785383929568162090376495105,
'a string',
<function __main__.<lambda>(x)>]

In the example above, l[0] is an integer so large that it doesn’t ﬁt into 128 bits. Also, there are all kinds of objects in
our list, including a function. How is this possible?
Python’s list provides a ﬂexible data structure by
1. overallocating the memory and,
2. keeping memory addresses to the objects’ in the list instead of the objects themselves.
(At least in the most widespread CPython implementation.)

Fig. 3.3: CPython implementation of lists.

By checking the memory addresses of each object in our list l, we can see that they are all over the memory.

[id(x) for x in l]

[140408800412080, 140408800519536, 140408800458816]

Due to the overallocation, deletion or insertion can always be done simply by shifting the remaining elements. Since the
list stores the memory address of its elements, all types of objects can be stored within a single structure.

36 Chapter 3. Vectors in practice

Mathematics of Machine Learning

However, this comes at a cost. Because the objects are not contiguous in memory, we lose locality of reference, meaning
that since we frequently access distant locations of the memory, our reads are much slower. Thus, looping over a Python
list is not eﬃcient, making it unsuitable for scientiﬁc computation.
So, NumPy arrays are essentially the good old C arrays in Python, with the user-friendly interface of Python lists. (If you
have ever worked with C, you know how big of a blessing this is.) Let’s see how to work with them!
First, we import the numpy library. (To save on the characters, it is customary to import it as np.)

import numpy as np

The main data structure is the np.ndarray, short for n-dimensional array. We can use the np.array function to
create NumPy arrays from standard Python containers or initialized from scratch. (Yes, I know. This is confusing, but
you’ll get used to it. Just take a mental note that np.ndarray is the class, and np.array is the function you use to
create NumPy arrays from Python objects.)

X = np.array([87.7, 4.5, -4.1, 42.1414, -3.14, 2.001]) # creating a NumPy array␣

↪from a Python list

array([87.7 , 4.5 , -4.1 , 42.1414, -3.14 , 2.001 ])

np.ones(shape=7) # initializing a NumPy array from scratch using ones

array([1., 1., 1., 1., 1., 1., 1.])

np.zeros(shape=5) # initializing a NumPy array from scratch using zeros

array([0., 0., 0., 0., 0.])

We can even initialize NumPy arrays using random numbers. Later, when talking about probability theory, we’ll discuss
this functionality in detail, as the library covers a wide range of probability distributions.

np.random.rand(10)

array([0.66339028, 0.19520763, 0.60535065, 0.52306207, 0.24778104,

0.99992155, 0.76469949, 0.61753093, 0.46085152, 0.22215355])

Most importantly, when we have a given array, we can initialize another one with the same dimensions using the np.
zeros_like, np.ones_like, and np.empty_like functions.

np.zeros_like(X)

array([0., 0., 0., 0., 0., 0.])

Just like Python lists, NumPy arrays support item assignments and slicing.

X[0] = 1545.215
X

3.3. NumPy arrays 37

Mathematics of Machine Learning

array([1545.215 , 4.5 , -4.1 , 42.1414, -3.14 , 2.001 ])

X[1:4]

array([ 4.5 , -4.1 , 42.1414])

However, as expected, you can only store a single data type within each ndarray. When trying to assign a string as the
ﬁrst element, we get an error message.

X[0] = "str"

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-d996c0d86300> in <module>
----> 1 X[0] = "str"

ValueError: could not convert string to float: 'str'

As you might have guessed, every ndarray has a data type attribute that can be accessed at ndarray.dtype. If a
conversion can be made between the value to be assigned and the data type, it is automatically performed, making the
item assignment successful.

X.dtype

dtype('float64')

val = 23
type(val)

int

X[0] = val

array([23. , 4.5 , -4.1 , 42.1414, -3.14 , 2.001 ])

NumPy arrays are iterable, just like other container types in Python.

for x in X:
print(x)

23.0
4.5
-4.1
42.1414
-3.14
2.001

38 Chapter 3. Vectors in practice

Mathematics of Machine Learning

Are these suitable to represent vectors? Yes. Let’s see why!

3.3.1 NumPy arrays as vectors

Let’s talk about vectors once more. From now on, we are going to use NumPy ndarray-s to model vectors.

v_1 = np.array([-4.0, 1.0, 2.3])

v_2 = np.array([-8.3, -9.6, -7.7])

The addition and scalar multiplication operations are supported by default and perform as expected.

v_1 + v_2 # adding v_1 and v_2 together as vectors

array([-12.3, -8.6, -5.4])

10.0*v_1 # multiplying v_1 with a scalar

array([-40., 10., 23.])

v_1 * v_2 # the elementwise product of v_1 and v_2

array([ 33.2 , -9.6 , -17.71])

np.zeros(shape=3) + 1

array([1., 1., 1.])

Because of the dynamic typing of Python, we can (often) plug in NumPy arrays into functions intended for scalars.

def f(x):
return 3*x**2 - x**4

f(v_1)

array([-208. , 2. , -12.1141])

So far, NumPy arrays satisfy almost everything we require to represent vectors. There is only one box to be ticked:
performance. To investigate this, we measure the execution time with Python’s built-in timeit tool.
In its ﬁrst argument, timeit takes a function to be executed and timed. Instead of passing a function object, it also accepts
executable statements as a string. Since function calls have a signiﬁcant computational overhead in Python, we are passing
code rather than a function object in order to be precise with the time measurements.
Below, we compare adding together two NumPy arrays vs. Python lists containing a thousand zeros.

from timeit import timeit

(continues on next page)

3.3. NumPy arrays 39

Mathematics of Machine Learning

(continued from previous page)

n_runs = 100000
size = 1000

t_add_builtin = timeit(
"[x + y for x, y in zip(v_1, v_2)]",
setup=f"size={size}; v_1 = [0 for _ in range(size)]; v_2 = [0 for _ in␣
↪range(size)]",

number=n_runs
)

t_add_numpy = timeit(
"v_1 + v_2",
setup=f"import numpy as np; size={size}; v_1 = np.zeros(shape=size); v_2 = np.
↪zeros(shape=size)",

number=n_runs
)

print(f"Built-in addition: \t{t_add_builtin} s")

print(f"NumPy addition: \t{t_add_numpy} s")
print(f"Performance improvement: \t{100*t_add_builtin/t_add_numpy}% speedup")

Built-in addition: 3.3502430509997794 s

NumPy addition: 0.06488390699996671 s
Performance improvement: 5163.4422245896785% speedup

NumPy arrays are much-much faster. This is because they are

• contiguous in memory,
• homogeneous in type,
• with operations implemented in C.
This is just the tip of the iceberg. We have only seen a small part of it, but NumPy provides much more than a fast
data structure. As we progress in the book, we’ll slowly dig deeper and deeper, eventually discovering the vast array of
functionalities it provides.

3.3.2 Why NumPy instead of TensorFlow or PyTorch?

If you are already familiar with some deep learning frameworks, you might ask: why are we studying NumPy instead
of them? The answer is simple: because all state-of-the-art libraries are built on its legacy. Modern tensor libraries are
essentially clones of NumPy, with GPU support. Thus, most NumPy knowledge translates directly to TensorFlow and
PyTorch. If you understand how it works on a fundamental level, you’ll have a headstart in more advanced frameworks.
Moreover, our goal is to implement our neural network from scratch by the end of the book. To understand every nook
and cranny, we don’t want to use built-in algorithms like backpropagation. We’ll create our own!

40 Chapter 3. Vectors in practice

Mathematics of Machine Learning

3.4 Is NumPy really faster than Python?

NumPy is designed to be faster than vanilla Python. Is this really the case? Not all the time. If you use it wrong, it might
even hurt performance! To know when it is beneﬁcial to use NumPy, we will look at why exactly it is faster in practice.
To simplify the investigation, our toy problem will be random number generation. Suppose that we need just a single
random number. Should we use NumPy? Let’s test it! We are going to compare it with the built-in random number
generator by running both ten million times, measuring the execution time.

from numpy.random import random as random_np

from random import random as random_py

n_runs = 10000000
t_builtin = timeit(random_py, number=n_runs)
t_numpy = timeit(random_np, number=n_runs)

print(f"Built-in random:\t{t_builtin} s")

print(f"NumPy random: \t{t_numpy} s")

Built-in random: 0.2327952079995157 s

NumPy random: 1.9404740369991487 s

For generating a single random number, NumPy is signiﬁcantly slower. Why is this the case? What if we need an array
instead of a single number? Will this also be slower?
This time, let’s generate a list/array of a thousand elements.

size = 1000
n_runs = 10000

t_builtin_list = timeit(
"[random_py() for _ in range(size)]",
setup=f"from random import random as random_py; size={size}",
number=n_runs
)

t_numpy_array = timeit(
"random_np(size)",
setup=f"from numpy.random import random as random_np; size={size}",
number=n_runs
)

print(f"Built-in random with lists:\t{t_builtin_list}s")

print(f"NumPy random with arrays: \t{t_numpy_array}s")

Built-in random with lists: 0.4184184540008573s

NumPy random with arrays: 0.062079527000605594s

(Again, I don’t want to wrap the timed expressions in lambdas since function calls have an overhead in Python. I want to
be as precise as possible, so I pass them as strings to the timeit function.)
Things are looking much diﬀerent now. When generating an array of random numbers, NumPy wins hands down.
There are some curious things about this result as well. First, we generated a single random number 10 000 000 times.
Second, we generated an array of 1000 random numbers 10 000 times. In both cases, we have 10 000 000 random

3.4. Is NumPy really faster than Python? 41

Mathematics of Machine Learning

numbers in the end. Using the built-in method, it took ~2x time when we put them in a list. However, with NumPy, we
see a ~30x speedup compared to itself when working with arrays!

3.4.1 Dissecting the code: proﬁling with cProﬁler

To see what happens behind the scenes, we are going to profile the code using cProfiler. With this, we’ll see exactly how
many times a given function was called and how much time we spent inside them.
(To make profiling work from Jupyter Notebooks, we need to do some Python magic first. Feel free to disregard the
contents of the next cell; this is just to make sure that the output of the profiling is printed inside the notebook.)

from IPython.core import page

page.page = print

Let’s take a look at the built-in function ﬁrst. In the following function, we create 10 000 000 random numbers, just as
before.

def builtin_random_single(n_runs):
for _ in range(n_runs):
random_py()

From Jupyter Notebooks, where this book is written, cProﬁler can be called with the magic command %prun.

n_runs = 10000000

%prun builtin_random_single(n_runs)

10000004 function calls in 1.037 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

1 0.630 0.630 1.037 1.037 <ipython-input-45-73f06baff02e>
↪:1(builtin_random_single)

10000000 0.407 0.000 0.407 0.000 {method 'random' of '_random.Random'␣

↪objects}

1 0.000 0.000 1.037 1.037 {built-in method builtins.exec}

1 0.000 0.000 1.037 1.037 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.
↪Profiler' objects}

There are two important columns here for our purposes. ncalls shows how many times a function was called, while
tottime is the total time spent in a function, excluding time spent in subfunctions.
The built-in function random.random() was called 10 000 000 times as expected, and the total time spent in that
function was 0.407 seconds. (If you are running this notebook locally, this number is going to be diﬀerent.)
What about the NumPy version? The results are surprising.

def numpy_random_single(n_runs):
for _ in range(n_runs):
random_np()

42 Chapter 3. Vectors in practice

Mathematics of Machine Learning

%prun numpy_random_single(n_runs)

10000004 function calls in 2.820 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

10000000 2.174 0.000 2.174 0.000 {method 'random' of 'numpy.random.
↪mtrand.RandomState' objects}

1 0.646 0.646 2.820 2.820 <ipython-input-47-db45b38db7dd>

↪:1(numpy_random_single)

1 0.000 0.000 2.820 2.820 {built-in method builtins.exec}

1 0.000 0.000 2.820 2.820 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.
↪Profiler' objects}

Similarly as before, the numpy.random.random() function was indeed called 10 000 000 times as expected. Yet,
the script spent signiﬁcantly more time in this function than in the Python built-in random before. Thus, it is more costly
per call.
When we start working with large arrays and lists, things change dramatically. Next, we generate a list/array of 1000
random numbers, while measuring the execution time.

def builtin_random_list(size, n_runs):

for _ in range(n_runs):
[random_py() for _ in range(size)]

size = 1000
n_runs = 10000

%prun builtin_random_list(size, n_runs)

10010004 function calls in 1.138 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

10000 0.641 0.000 1.089 0.000 <ipython-input-49-d0bf27f76fac>:3(
↪<listcomp>)

10000000 0.448 0.000 0.448 0.000 {method 'random' of '_random.Random'␣

↪objects}

1 0.049 0.049 1.138 1.138 <ipython-input-49-d0bf27f76fac>

↪:1(builtin_random_list)

1 0.000 0.000 1.138 1.138 {built-in method builtins.exec}

1 0.000 0.000 1.138 1.138 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.
↪Profiler' objects}

As we see, about 60% of the time was spent on the list comprehensions: 10 000 calls, 0.641s total. (Note that tottime
doesn’t count subfunction calls like calls to random.random() here.)
Now we are ready to see why NumPy is faster when used right.

3.4. Is NumPy really faster than Python? 43

Mathematics of Machine Learning

def numpy_random_array(size, n_runs):

for _ in range(n_runs):
random_np(size)

%prun numpy_random_array(size, n_runs)

10004 function calls in 0.066 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

10000 0.064 0.000 0.064 0.000 {method 'random' of 'numpy.random.
↪mtrand.RandomState' objects}

1 0.001 0.001 0.066 0.066 <ipython-input-51-219ecab2e001>

↪:1(numpy_random_array)

1 0.000 0.000 0.066 0.066 {built-in method builtins.exec}

1 0.000 0.000 0.066 0.066 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.
↪Profiler' objects}

With each of the 10 000 function calls, we get a numpy.ndarray of 1000 random numbers. The reason why NumPy
is fast when used right is that its arrays are extremely efficient to work with. They are like C arrays instead of Python lists.
As we have seen, there are two significant differences between them.
• Python lists are dynamic, so for instance, you can append and remove elements. NumPy arrays have fixed lengths,
so you cannot add or delete without creating a new one.
• Python lists can hold several data types simultaneously, while a NumPy array can only contain one.
So, NumPy arrays are less flexible but significantly more performant. When this additional flexibility is not needed,
NumPy outperforms Python.

3.4.2 Where is the break-even point?

To see precisely at which size does NumPy overtakes Python in random number generation, we can compare the two by
measuring the execution times for several sizes.

sizes = list(range(1, 100))

runtime_builtin = [
timeit(
"[random_py() for _ in range(size)]",
setup=f"from random import random as random_py; size={size}",
number=100000
)
for size in sizes
]

runtime_numpy = [
timeit(
"random_np(size)",
setup=f"from numpy.random import random as random_np; size={size}",
(continues on next page)

44 Chapter 3. Vectors in practice

Mathematics of Machine Learning

(continued from previous page)

number=100000
)
for size in sizes
]

import matplotlib.pyplot as plt

with plt.style.context("seaborn-white"):
plt.figure(figsize=(10, 5))
plt.plot(sizes, runtime_builtin, label="built-in")
plt.plot(sizes, runtime_numpy, label="NumPy")
plt.xlabel("array size")
plt.ylabel("time (seconds)")
plt.title("Runtime of random array generation")
plt.legend()

Around 20, NumPy starts to beat Python in performance. Of course, this number might be diﬀerent for other operations
like calculating the sine or adding numbers together, but the tendency will be the same. Python will slightly outperform
NumPy for small input sizes, but NumPy wins by a large margin as the size grows.

3.4. Is NumPy really faster than Python? 45

Mathematics of Machine Learning

46 Chapter 3. Vectors in practice

CHAPTER

FOUR

THE GEOMETRIC STRUCTURE OF VECTOR SPACES: MEASURING

DISTANCES

Let’s revisit the Iris dataset introduced in the previous chapter! I want to test your intuition. I plotted the petal widths
against the petal lengths while hiding the class labels in the following ﬁgure.

Fig. 4.1: Petal width plotted against petal length in the Iris dataset.

Even without knowing any labels, we can intuitively point out that there are probably at least two classes. Can you
summarize your reasoning in a single sentence?
There are many valid arguments, but the most prevalent one is that the two clusters are far away from each other. As this
example illustrates, the concept of distance plays an essential role in machine learning. In this chapter, we will translate
the notion of distance to the language of mathematics and put it into the context of vector spaces.

47
Mathematics of Machine Learning

4.1 Norms and distances

Previously, we have seen that vectors are essentially arrows, starting from the null vector. Besides their direction, vectors
also have magnitude. For example, as we have learned in high school mathematics, the magnitude in the Euclidean plane
is deﬁned by

‖𝑥‖ = √𝑥21 + 𝑥22 , 𝑥 = (𝑥1 , 𝑥2 ),

while we can calculate the distance between 𝑥 and 𝑦 as

𝑑(𝑥, 𝑦) = √(𝑥1 − 𝑦1 )2 + (𝑥2 − 𝑦2 )2 ,

𝑥 = (𝑥1 , 𝑥2 ) ∈ ℝ2 ,
𝑦 = (𝑦1 , 𝑦2 ) ∈ ℝ2 .

(The function ‖ ⋅ ‖ simply denotes the magnitude of a vector.)

Fig. 4.2: Magnitude in the Euclidean plane.

The magnitude formula √𝑥21 + 𝑥22 can be generalized to higher dimensions simply by

‖𝑥‖ = √𝑥21 + ⋯ + 𝑥2𝑛 , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛 .

However, just from this formula, it is not clear why it is deﬁned this way. What does the square root of a sum of squares
have to do with distance and magnitude? Behind the scenes, it is just the Pythagorean theorem.
Recall that the Pythagorean theorem states that in right triangles, the squared length of the hypotenuse equals the sum of
squared lengths of other sides, as illustrated by Fig. 4.3.
To put this into an algebraic form, it states that 𝑎2 + 𝑏2 = 𝑐2 , when 𝑐 is the hypotenuse of the right triangle, and 𝑎 and
𝑏 are its two other sides. If we apply this to a two-dimensional vector 𝑥 = (𝑥1 , 𝑥2 ), we can see that the Pythagorean
theorem gives its magnitude ‖𝑥‖2 = √𝑥21 + 𝑥22 .
This can be generalized to higher dimensions. To see what is happening, we are going to check the three-dimensional
case, as illustrated by Fig. 4.4. Here, we can apply the Pythagorean theorem twice to obtain the magnitude!
For each vector 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 ), we can take a look at the triangle determined by (0, 0, 0), (𝑥1 , 0, 0), and (𝑥1 , 𝑥2 , 0) ﬁrst.
The length of the hypotenuse can be calculated by √𝑥21 + 𝑥22 . However, the points (0, 0, 0), (𝑥1 , 𝑥2 , 0), and (𝑥1 , 𝑥2 , 𝑥3 )

48 Chapter 4. The geometric structure of vector spaces: measuring distances

Mathematics of Machine Learning

Fig. 4.3: The Pythagorean theorem.

Fig. 4.4: The Pythagorean theorem in three dimensions.

4.1. Norms and distances 49

Mathematics of Machine Learning

form a right triangle. Applying the Pythagorean theorem once again we obtain

‖𝑥‖2 = √𝑥21 + 𝑥22 + 𝑥23 ,

which is the Euclidean norm. This is exactly what is going on in the general 𝑛-dimensional case.
The notions of magnitude and distance are critical in machine learning, as we can use them to determine the similarity
between data points, measure and control the complexity of neural networks, and many more.
Is the above method the only viable way to measure magnitude and distance? Certainly not. Because Manhattan’s street
layout is essentially a rectangular grid, its residents are famed for measuring distances in blocks. If something is two
blocks to the north and three blocks east, it means that you have to travel two intersections to the north and three to
the east to ﬁnd it. This gives rise to a mathematically perfectly valid notion of measurement called Manhattan distance,
deﬁned by

𝑑(𝑥, 𝑦) = |𝑥1 − 𝑦1 | + |𝑥2 − 𝑦2 |.

When using the Manhattan distance, the shortest path between two points is not unique.

Fig. 4.5: For the Manhattan distance, the shortest path between two points is not unique.

Besides the Euclidean and Manhattan distances, there are several other metrics. Once again, we are going to step away
from the concrete examples to take an abstract viewpoint.
If we talk about measurements and metrics in general, what are the properties that we expect from all of them? What
makes a measurement distance? Essentially, there are three such traits: the distance should
• be nonnegative,
• preserve scaling,
• and the distance straight from point 𝐴 to 𝐵 is always equal or smaller than touching any other point 𝐶.
These are formalized by the notion of norms.

Deﬁnition 3.1.1 (Norms.)

Let 𝑉 be a vector space. A function ‖ ⋅ ‖ ∶ 𝑉 → [0, ∞) is said to be a norm if for all 𝑥, 𝑦 ∈ 𝑉 ,
(a) ‖𝑥‖ ≥ 0 and ‖𝑥‖ = 0 if and only if 𝑥 = 0 (positive deﬁniteness),
(b) ‖𝑐𝑥‖ = |𝑐|‖𝑥‖ for all 𝑐 ∈ ℝ (positive homogeneity),

50 Chapter 4. The geometric structure of vector spaces: measuring distances

Mathematics of Machine Learning

(c) ‖𝑥 + 𝑦‖ ≤ ‖𝑥‖ + ‖𝑦‖ holds for all 𝑥, 𝑦 ∈ 𝑉 (triangle inequality).

Vector spaces with a norm are called normed spaces.

Let’s see some examples!

Example 1. Let 𝑝 ∈ [1, ∞) and deﬁne
𝑛 1/𝑝
‖𝑥‖𝑝 = ( ∑ |𝑥𝑖 |𝑝 ) , 𝑥 = (𝑥1 , … , 𝑥𝑛 )
𝑖=1

on ℝ𝑛 . The function ‖ ⋅ ‖𝑝 is called the 𝑝-norm. Showing that these are indeed norms is a bit technical. Thus, we won’t
go into the details. (The triangle inequality requires some work, but the other three properties are easy to see.)
We have already seen two special cases: the Euclidean norm (𝑝 = 2), and the Manhattan norm (𝑝 = 1). Both of them
frequently appear in machine learning. For instance, the familiar mean squared error is just the scaled Euclidean distance
between prediction and ground truth:

1 1 𝑛
MSE(𝑦, 𝑦)̂ = ‖𝑦 − 𝑦‖̂ 22 = ∑(𝑦𝑖 − 𝑦𝑖̂ )2
𝑛 𝑛 𝑖=1

As mentioned before, the 2-norm, along with the 1-norm, is commonly used to control the complexity of models dur-
𝑚
ing training. To give a concrete example, suppose that we are ﬁtting a polynomial 𝑓(𝑥) = ∑𝑖=0 𝑞𝑖 𝑥𝑖 to the data
{(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 )}. To obtain a model that generalizes well to new data, we prefer our models to be as simple
as possible. Thus, instead of using the plain mean squared error, we might consider minimizing the loss

Loss(𝑦, 𝑦,̂ 𝑞) = MSE(𝑦, 𝑦)̂ + 𝜆‖𝑞‖𝑝 , 𝑞 = (𝑞0 , 𝑞1 , … , 𝑞𝑚 ), 𝜆 ∈ [0, ∞)

where the term ‖𝑞‖𝑝 is responsible for keeping the coeﬃents of the polynomial 𝑓(𝑥) small, and 𝜆 controlls the strength
of regularization. Usually, 𝑝 is either 1 or 2, but other values from [1, ∞) are also valid.
Example 2. Let’s stay in ℝ𝑛 for a bit more! The so-called ∞-norm is deﬁned by

‖𝑥‖∞ = max{|𝑥1 |, … , |𝑥𝑛 |}.

Showing that ‖ ⋅ ‖∞ is indeed a norm is a simple task and left to the reader for practice. (This is perhaps one of the most
notorious sentences written in mathematical textbooks, but trust me, this is truly easy. Give it a shot! If you don’t see it,
try the special case ℝ2 .)
This is called the ∞-norm, and is strongly related to the 𝑝-norm that we have just seen. In fact, if we let the value 𝑝 grow
inﬁnitely, ‖𝑥‖𝑝 will be very close to ‖𝑥‖∞ , ultimately reaching it in the limit.

Remark 3.1.1 (The ∞-norm as the limit of 𝑝-norm.)

If you are already familiar with convergent sequences and limits, you can see that this is called the ∞-norm because
lim𝑝→∞ ‖𝑥‖𝑝 = ‖𝑥‖∞ . To show this, consider that
𝑛 1/𝑝
lim ‖𝑥‖𝑝 = lim ( ∑ |𝑥𝑖 |𝑝 )
𝑝→∞ 𝑝→∞
𝑖=1
𝑛 𝑝 1/𝑝
|𝑥𝑖 |
= lim ‖𝑥‖∞ ( ∑ ( ) ) .
𝑝→∞
𝑖=1
‖𝑥‖∞

|𝑥𝑖 |
Since ‖𝑥‖∞ ≤ 1 by deﬁnition,

𝑛 𝑝 1/𝑝
|𝑥𝑖 |
1 ≤ (∑( ) ) ≤ 𝑛1/𝑝 ,
𝑖=1
‖𝑥‖∞

4.1. Norms and distances 51

Mathematics of Machine Learning

holds. Because lim𝑝→∞ 𝑛1/𝑝 = 1, we can conclude that lim𝑝→∞ ‖𝑥‖𝑝 = ‖𝑥‖∞ . This is the reason why the ∞-norm is
considered a 𝑝-norm with 𝑝 = ∞.
If you are not familiar with taking limits of sequences, don’t worry. We’ll cover everything in detail when studying
single-variable calculus.

Example 3. ∞-norms can be generalized for function spaces. Remember 𝐶([0, 1]), the vector space of functions con-
tinuous on [0, 1]? We introduced this when talking about examples of vector spaces. There, ‖ ⋅ ‖∞ can be deﬁned as

‖𝑓‖∞ = sup |𝑓(𝑥)|.

𝑥∈[0,1]

This norm can be deﬁned on other function spaces, like 𝐶(ℝ), the space of continuous real functions. Since the maximum
is not guaranteed to exist (like for the sigmoid function in 𝐶(ℝ)), the maximum is replaced with supremum. Hence, the
∞-norm is often called the supremum norm.
If you imagine the function as a landscape, the supremum norm is the height of the highest peak or the depth of the
deepest trench. (Whichever is larger in absolute value.)

Fig. 4.6: The supremum norm

When encountering this norm for the ﬁrst time, it might seem challenging to understand what this has to do with any
notion of magnitude. However, ‖𝑓 − 𝑔‖∞ is a natural way to measure the distance between two functions 𝑓 and 𝑔, and in
general, magnitude is just the distance from 0.
(subsection:linear-algebra/normed-spaces/unit-spheres)

52 Chapter 4. The geometric structure of vector spaces: measuring distances

Mathematics of Machine Learning

Fig. 4.7: The distance between two functions, given by the supremum norm.

4.1.1 The unit sphere of norms

For each norm, the unit sphere 𝑆 = {𝑥 ∶ ‖𝑥‖ = 1} plays a special role. Not only does every norm uniquely determine its
unit sphere, but the other way around as well: given a sphere, a corresponding norm can be constructed.
To be more precise, if you give me a 𝑆 ⊆ 𝑉 that contains 0, is bounded, strictly convex, and symmetric set, I can
construct you the norm for which this is the unit sphere. We are not going to prove this, but it helps to illustrate that
norms essentially deﬁne a “geometry” on vector spaces.
We can visualize this in ℝ2 . (In two dimensions, spheres are called circles, so we’ll refer to them as such.)

Fig. 4.8: Unit circles of the 𝑝 = 1, 2, ∞ norms in ℝ2 .

4.1. Norms and distances 53

Mathematics of Machine Learning

4.2 Deﬁning distances from norms

Besides measuring the magnitude of vectors, we are also interested in measuring the distance between them. If you are
at the location 𝑥 in some normed space, how far is 𝑦? In normed vector spaces, we can deﬁne the distance between any
𝑥 and 𝑦 by

𝑑(𝑥, 𝑦) = ‖𝑥 − 𝑦‖.

This is called the norm-induced metric. Thus, norms measure the distance from the zero vector, and the metric 𝑑 measures
the norm of the diﬀerence.
In general, we say that a function 𝑑 ∶ 𝑉 × 𝑉 → [0, ∞) is a metric if the following hold.

Deﬁnition 3.2.1 (Metrics.)

Let 𝑉 be a vector space and 𝑑 ∶ 𝑉 × 𝑉 → [0, ∞) be a function. 𝑑 is a metric if the following conditions hold for all
𝑥, 𝑦, 𝑧 ∈ 𝑉 :
(a) whenever 𝑑(𝑥, 𝑦) = 0, we have 𝑥 = 𝑦 (positive deﬁniteness),
(b) 𝑑(𝑥, 𝑦) = 𝑑(𝑦, 𝑥) (symmetry),
(c) 𝑑(𝑥, 𝑧) ≤ 𝑑(𝑥, 𝑦) + 𝑑(𝑦, 𝑧) (triangle inequality).

Given the properties of norms, we can quickly check that 𝑑(𝑥, 𝑦) = ‖𝑥 − 𝑦‖ is indeed a metric. Due to the linear structure
of vector spaces, the norm-generated metric is invariant to translation. That is, for any 𝑥, 𝑦, 𝑧 ∈ 𝑉 , we have

𝑑(𝑥, 𝑦) = 𝑑(𝑥 + 𝑧, 𝑦 + 𝑧).

In other words, it doesn’t matter where you start: the distance only depends on your displacement. This is not true for any
metric. Thus, norm-induced metrics are special. In our studies, we only deal with these special cases. Because of this,
we won’t even talk about metrics, just norms.

4.3 Conclusion

In itself, a vector space is just a skeleton that provides a way to represent data. On top of this, norms deﬁne a geometric
structure that reveals properties such as magnitude and distance. Both of these are essential in machine learning. For
instance, some unsupervised learning algorithms separate data points into clusters based on their mutual distances from
each other.
There is yet another way to enhance the geometric structure of vector spaces: inner products, also called dot products.
We are going to put this concept under our magnifying glass in the next section.

4.4 Problems

Problem 1. Let 𝑉 be a vector space and deﬁne the function 𝑑 ∶ 𝑉 × 𝑉 → [0, ∞) by

0 if 𝑥 = 𝑦,
𝑑(𝑥, 𝑦) = {
1 otherwise.

(a) Show that 𝑑 is a metric.

(b) Show that 𝑑 cannot come from a norm.

54 Chapter 4. The geometric structure of vector spaces: measuring distances

Mathematics of Machine Learning

Problem 2. Let 𝑆𝑛 be the set of all ASCII strings of 𝑛 character length and define the Hamming distance ℎ(𝑥, 𝑦) for any
two 𝑥, 𝑦 ∈ 𝑆𝑛 by the number of corresponding positions where 𝑥 and 𝑦 are different.
For instance,
ℎ("001101", "101110") = 2,
ℎ("metal", "petal") = 1.
Show that ℎ satisfies the three defining properties of a metric. (Note that 𝑆𝑛 is not a vector space, so technically, the
Hamming distance is not a metric.)
Problem 3. Let ‖ ⋅ ‖ be a norm on the vector space ℝ𝑛 , and define the mapping 𝑓 ∶ ℝ𝑛 → ℝ𝑛 ,

𝑓 ∶ (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) ↦ (𝑥1 , 2𝑥2 , … , 𝑛𝑥𝑛 ).

Show that

‖𝑥‖∗ ∶= ‖𝑓(𝑥)‖

is a norm on ℝ𝑛 .

4.4. Problems 55
Mathematics of Machine Learning

56 Chapter 4. The geometric structure of vector spaces: measuring distances

CHAPTER

FIVE

INNER PRODUCTS, ANGLES, AND LOTS OF REASONS TO CARE

ABOUT THEM

In the previous chapter, we imbued our vector spaces with norms, measuring the magnitude of vectors and the distance
between points. In machine learning, these concepts can be used, for instance, to identify clusters in unlabeled datasets.
However, without context, distance is often not enough. Following our geometric intuition, we can aspire to measure the
similarity of data points. This is done by the inner product. (Also known as the dot product.)
You can recall the inner product as a quantity that we used to measure the angle between two vectors in high school
geometry classes. Given two vectors 𝑥 = (𝑥1 , 𝑥2 ), 𝑦 = (𝑦1 , 𝑦2 ) from the plane, we deﬁned their inner product by

⟨𝑥, 𝑦⟩ = 𝑥1 𝑦1 + 𝑥2 𝑦2 ,

for which it can be shown that

⟨𝑥, 𝑦⟩ = ‖𝑥‖‖𝑦‖ cos 𝛼 (5.1)

holds, where 𝛼 is the angle between 𝑥 and 𝑦. (In fact, there are two such angles, but their cosine is equal.) Thus, the angle
itself can be extracted by

⟨𝑥, 𝑦⟩
𝛼 = arccos .
‖𝑥‖‖𝑦‖

We can use the inner products to determine if two vectors are orthogonal, as this happens if and only if ⟨𝑥, 𝑦⟩ = 0 holds.
During our earlier encounters with mathematics, geometric intuition (such as orthogonality) came ﬁrst, on which we built
tools such as the inner product. However, if we zoom out and take an abstract viewpoint, things are exactly the opposite.
As we’ll see soon, inner products emerge quite naturally, giving rise to the general concept of orthogonality.
In general, this is the formal deﬁnition of an inner product.

Deﬁnition 4.1 (Inner products and inner product spaces.)

Let 𝑉 be a real vector space. The function ⟨⋅, ⋅⟩ ∶ 𝑉 × 𝑉 → ℝ is called an inner product if the following holds for all
𝑥, 𝑦, 𝑧 ∈ 𝑉 and 𝑎 ∈ ℝ:
(a) ⟨𝑎𝑥 + 𝑦, 𝑧⟩ = 𝑎⟨𝑥, 𝑧⟩ + ⟨𝑦, 𝑧⟩ (linearity of the ﬁrst variable),
(b) ⟨𝑥, 𝑦⟩ = ⟨𝑦, 𝑥⟩ (symmetry),
(c) ⟨𝑥, 𝑥⟩ > 0 for all 𝑥 ≠ 0 (positive deﬁniteness).
Vector spaces with an inner product are called inner product spaces.

Right oﬀ the bat, we can immediately deduce two properties. First,

⟨0, 𝑥⟩ = ⟨0𝑥, 𝑥⟩ = 0⟨𝑥, 𝑥⟩ = 0. (5.2)

57
Mathematics of Machine Learning

As a special case, ⟨0, 0⟩ = 0. Just like we have seen for norms, a bit more is true: if ⟨𝑥, 𝑥⟩ = 0, then 𝑥 = 0. This follows
from positive definiteness and (5.2).
In addition, due to symmetry and the linearity of the first variable, inner products are also linear in the second variable.
Because of this, they are called bilinear.
To familiarize ourselves with the concept, let’s see some examples!
Example 1. As usual, the canonical and most prevalent example of inner product spaces is ℝ𝑛 , where the inner product
⟨⋅, ⋅⟩ is defined by
𝑛
⟨𝑥, 𝑦⟩ = ∑ 𝑥𝑖 𝑦𝑖 , 𝑥 = (𝑥1 , … , 𝑥𝑛 ), 𝑦 = (𝑦1 , … , 𝑦𝑛 ).
𝑖=1

This bilinear function is often called the dot product. Equipped with this, ℝ𝑛 is called the n-dimensional Euclidean space.
This is a central concept in machine learning, as data is most frequently represented in Euclidean spaces. Thus, we are
going to explore the structure of this space in great detail throughout this book.
Example 2. Besides Euclidean spaces, there are other inner product spaces that play a signiﬁcant role in mathematics and
machine learning. If you are familiar with integration, in certain function spaces the bilinear function
∞
⟨𝑓, 𝑔⟩ = ∫ 𝑓(𝑥)𝑔(𝑥)𝑑𝑥
−∞

defines an inner product space with a very rich and beautiful structure.
The symmetry and linearity of ⟨𝑓, 𝑔⟩ is clear. Only the positive definiteness seems to be an issue. For instance, if 𝑓 is
defined by

1 if 𝑥 = 0,
𝑓(𝑥) = {
0 otherwise,
then 𝑓 ≠ 0, but ⟨𝑓, 𝑓⟩ = 0. This problem can be circumvented by “overloading” the equality operator and letting 𝑓 = 𝑔
∞
if and only if ∫−∞ |𝑓(𝑥) − 𝑔(𝑥)|2 𝑑𝑥 = 0. Even though function spaces such as this play an important role in mathematics
and machine learning, their study falls outside of our scope.

5.1 The generated norm

1/2
𝑛
Recall that the 2-norm in ℝ𝑛 was defined by ‖𝑥‖2 = ( ∑𝑖=1 𝑥2𝑖 ) , which, according to our definition of the inner
product there, equals to √⟨𝑥, 𝑥⟩. This is not a coincidence. Inner products can be used to define norms on vector spaces.
To show exactly how, we need a simple tool: the Cauchy-Schwarz inequality.

Theorem 4.1.1 (Cauchy-Schwarz inequality)

Let 𝑉 be an inner product space. Then for any 𝑥, 𝑦 ∈ 𝑉 , the inequality*
|⟨𝑥, 𝑦⟩|2 ≤ ⟨𝑥, 𝑥⟩⟨𝑦, 𝑦⟩

holds.

Proof. At this point, we don’t know much about the inner product except its core deﬁning properties. So, we are going to
use a little trick. For any 𝜆 ∈ ℝ, the positive deﬁniteness implies that ⟨𝑥 + 𝜆𝑦, 𝑥 + 𝜆𝑦⟩ ≥ 0. On the other hand, because
of the bilinearity (that is, linearity in both variables) and symmetry, we have
⟨𝑥 + 𝜆𝑦, 𝑥 + 𝜆𝑦⟩ = ⟨𝑥, 𝑥⟩ + 2𝜆⟨𝑥, 𝑦⟩ + 𝜆2 ⟨𝑦, 𝑦⟩, (5.3)

58 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning

which is a quadratic polynomial in 𝜆. In general, we know that for any quadratic polynomial of the form 𝑎𝑥2 + 𝑏𝑥 + 𝑐,
the roots are given by the formula
√
−𝑏 ± 𝑏2 − 4𝑎𝑐
𝑥1,2 = .
2𝑎
Since ⟨𝑥 + 𝜆𝑦, 𝑥 + 𝜆𝑦⟩ ≥ 0, the polynomial deﬁned by (5.3) must have at most one real root. Thus, the discriminant
𝑏2 − 4𝑎𝑐 is negative or zero. Plugging in the coeﬃcients of (5.3) into the discriminant formula implies
|⟨𝑥, 𝑦⟩|2 − ⟨𝑥, 𝑥⟩⟨𝑦, 𝑦⟩ ≤ 0,
which is what we had to show. □

The Cauchy-Schwarz inequality is probably one of the most useful tools in studying inner product spaces. One application
we are going to see next is to show how inner products deﬁne norms.

Theorem 4.1.2 (The norm generated by the inner product.)

Let 𝑉 be an inner product space. Then the function ‖ ⋅ ‖ ∶ 𝑉 → [0, ∞) deﬁned by

‖𝑥‖ = √⟨𝑥, 𝑥⟩
is a norm on 𝑉 .

Proof. According to the definition of norms, we have to show that three properties hold: positive definiteness, homo-
geneity, and the triangle inequality. The first two follow easily from the same properties of inner products. The triangle
inequality follows from the Cauchy-Schwarz inequality:
‖𝑥 + 𝑦‖2 = ⟨𝑥 + 𝑦, 𝑥 + 𝑦⟩
= ‖𝑥‖2 + ‖𝑦‖2 + 2⟨𝑥, 𝑦⟩
≤ ‖𝑥‖2 + ‖𝑦‖2 + 2‖𝑥‖‖𝑦‖
2
= (‖𝑥‖ + ‖𝑦‖) ,
from which the triangle inequality follows. □

Thus, inner product spaces are normed spaces as well. They have the algebraic and geometric structure we need to
represent, manipulate, and transform data.
Most importantly, Theorem 4.1.2 can be reversed! That is, given a norm ‖ ⋅ ‖, we can deﬁne a matching inner product.

Theorem 4.1.3 (The polarization identity.)

Let 𝑉 be an inner product space, and ‖ ⋅ ‖ be the generated norm. Then
1
⟨𝑥, 𝑦⟩ = (‖𝑥 + 𝑦‖2 − ‖𝑥‖2 − ‖𝑦‖2 ). (5.4)
2

In other words, one can generate an inner product from a norm, not just the other way around.

Proof. As the inner product is bilinear, we have

⟨𝑥 + 𝑦, 𝑥 + 𝑦⟩ = ⟨𝑥, 𝑥⟩ + 2⟨𝑥, 𝑦⟩ + ⟨𝑦, 𝑦⟩,
from which (5.4) follows.

5.1. The generated norm 59

Mathematics of Machine Learning

5.2 Orthogonality

In vector spaces other than ℝ2 , the concept of orthogonality and angles are not clear at all. For instance, in spaces where
vectors are functions, there is no intuitive way to deﬁne the angles between two function. However, as the formula (5.1)
suggests in the special case ℝ2 , these can be generalized.

Deﬁnition 4.2.1 (Orthogonality of vectors.)

Let 𝑉 be an inner product space and let 𝑥, 𝑦 ∈ 𝑉 . We say that 𝑥 and 𝑦 are orthogonal if ⟨𝑥, 𝑦⟩ = 0. Orthogonality is
denoted with 𝑥 ⟂ 𝑦.

To illustrate how inner products and orthogonality deﬁne geometry on vector spaces, let’s see how the classic Pythagorean
theorem looks in this new form. Recall that the “original” version states that in right triangles, 𝑎2 + 𝑏2 = 𝑐2 , where 𝑐 is
the length of the hypotenuse, while 𝑎 and 𝑏 are the lengths of the other two sides.
In inner product spaces, this generalizes to the following way.

Theorem 4.2.1 (The Pythagorean theorem.)

Let 𝑉 be an inner product space and let 𝑥, 𝑦 ∈ 𝑉 . Then, 𝑥 and 𝑦 are orthogonal if and only if

⟨𝑥 + 𝑦, 𝑥 + 𝑦⟩ = ⟨𝑥, 𝑥⟩ + ⟨𝑦, 𝑦⟩. (5.5)

Proof. Given the deﬁnition of inner products and orthogonality, the proof is trivial. Due to the bilinearity, we have

⟨𝑥 + 𝑦, 𝑥 + 𝑦⟩ = ⟨𝑥, 𝑥 + 𝑦⟩ + ⟨𝑦, 𝑥 + 𝑦⟩
= ⟨𝑥, 𝑥⟩ + 2⟨𝑥, 𝑦⟩ + ⟨𝑦, 𝑦⟩,

from which result immediately follows. □

Why is this the Pythagorean theorem in another form? Because the norm and the inner product is related by ⟨𝑥, 𝑥⟩ = ‖𝑥‖2 ,
the equation (5.5) is equivalent to

‖𝑥 + 𝑦‖2 = ‖𝑥‖2 + ‖𝑦‖2 ,

which is exactly the famous “𝑎2 + 𝑏2 = 𝑐2 ”.

5.3 The geometric interpretation of inner products

By looking at the general deﬁnition, it is hard to get an insight into what an inner product does. However, by using the
concept of orthogonality, we can visualize what does ⟨𝑥, 𝑦⟩ represent for any 𝑥 and 𝑦.
Intuitively, any 𝑥 can be decomposed into the sum of two vectors 𝑥𝑜 + 𝑥𝑝 , where 𝑥𝑜 is orthogonal to 𝑦 and 𝑥𝑝 is parallel
to it.
How can we ﬁnd 𝑥𝑝 and 𝑥𝑜 ? Since 𝑥𝑝 has the same direction as 𝑦, it can be written in the form 𝑥𝑝 = 𝑐𝑦 for some 𝑐.
Because 𝑥𝑝 and 𝑥𝑜 sum up to 𝑥, we also have 𝑥𝑜 = 𝑥 − 𝑥𝑝 = 𝑥 − 𝑐𝑦.
Since 𝑥𝑜 is orthogonal to 𝑦, the constant 𝑐 can be determined by solving the equation

⟨𝑥 − 𝑐𝑦, 𝑦⟩ = 0.

60 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning

Fig. 5.1: Decomposition of 𝑥 into components parallel and orthogonal to 𝑦.

By using the bilinearity of the inner product, we can express 𝑐 from this equation. Thus, we have
⟨𝑥, 𝑦⟩
𝑐= .
⟨𝑦, 𝑦⟩
So,
⟨𝑥, 𝑦⟩
𝑥𝑝 = 𝑦,
⟨𝑦, 𝑦⟩
(5.6)
⟨𝑥, 𝑦⟩
𝑥𝑜 = 𝑥 − 𝑦.
⟨𝑦, 𝑦⟩
We call 𝑥𝑝 the orthogonal projection of 𝑥 onto 𝑦. This is a common transformation, so we are going to introduce the
notation
⟨𝑥, 𝑦⟩
proj𝑦 (𝑥) = 𝑦. (5.7)
⟨𝑦, 𝑦⟩
From this, we can see that the scaling ratio between 𝑦 and proj𝑦 (𝑥) can be described by inner products.

5.3.1 The angle enclosed by two vectors

So far, we have seen that we can use inner products to deﬁne the orthogonality relation between two vectors. Can we use
it to measure (and in some cases, even deﬁne) the angle? The answer is yes! In the following, we are going to see how,
arriving at the formula (5.1) already familiar from basic geometry.
To build our intuition, let’s select two arbitrary 𝑛-dimensional vectors 𝑥, 𝑦 ∈ ℝ𝑛 . The inner product of the sum 𝑥 + 𝑦 can
be calculated using the bilinearity property.
With this, we obtain that
⟨𝑥 + 𝑦, 𝑥 + 𝑦⟩ = ‖𝑥 + 𝑦‖2
(5.8)
= ‖𝑥‖2 + ‖𝑦‖2 + 2⟨𝑥, 𝑦⟩.

5.3. The geometric interpretation of inner products 61

Mathematics of Machine Learning

Fig. 5.2: The sum of 𝑥 and 𝑦.

On the other hand, considering that 𝑥, 𝑦, and 𝑥+𝑦 form a triangle, we can use the law of cosines to express ⟨𝑥+𝑦, 𝑥+𝑦⟩ =
‖𝑥 + 𝑦‖2 in a diﬀerent form.
Here, the law of cosines imply

‖𝑥 + 𝑦‖2 = ‖𝑥‖2 + ‖𝑦‖2 − 2‖𝑥‖‖𝑦‖ ⏟

cos(𝜋
⏟⏟ −⏟𝛼)
⏟. (5.9)
=− cos 𝛼

By combining (5.8) and (5.9), we get

⟨𝑥, 𝑦⟩ = ‖𝑥‖‖𝑦‖ cos 𝛼.

That is, in ℝ𝑛 , the angle enclosed by 𝑥 and 𝑦 can be extracted by

⟨𝑥, 𝑦⟩
𝛼 = arccos . (5.10)
‖𝑥‖‖𝑦‖
What about vector spaces where the angle between vectors is not deﬁned? We have seen instances of vector spaces where
the elements are polynomials, functions, and other mathematical objects. There, (5.10) can be used to deﬁne the angle!
In the following, we will explore this idea a bit further and see how to use inner products to measure similarity.

5.3.2 Measuring similarity with inner products

Given our geometric interpretation of inner products as orthogonal projections, let’s focus on the case when both 𝑥 and 𝑦
have unit norms. In this special case, the orthogonal projection equals to

proj𝑦 (𝑥) = ⟨𝑥, 𝑦⟩𝑦 (‖𝑥‖ = ‖𝑦‖ = 1).

Thus, ⟨𝑥, 𝑦⟩ precisely describes the signed magnitude of the orthogonal projection. (It can be negative when proj𝑦 (𝑥)
and 𝑦 have an opposite direction.)

62 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning

Fig. 5.3: The triangle formed by 𝑥, 𝑦, and 𝑥 + 𝑦.

With this in mind, we can see that the inner product equals to the cosine of the angle enclosed by the two vectors. Let’s
draw a picture to illustrate! (Recall that in right triangles, the cosine is the ratio of the length of the adjacent side and the
hypotenuse. In this case, the adjacent side has a length of ⟨𝑥, 𝑦⟩, while the hypotenuse is of unit length.)
In machine learning, this quantity is frequently used to measure the similarity of two vectors.
Because any vector 𝑥 can be scaled to unit norm with the transformation 𝑥 ↦ 𝑥/‖𝑥‖, we deﬁne the cosine similarity by

𝑥 𝑦
cos(𝑥, 𝑦) = ⟨ , ⟩. (5.11)
‖𝑥‖ ‖𝑦‖

If 𝑥 and 𝑦 represent the feature vectors of two data samples, cos(𝑥, 𝑦) tells us how much the features move together. Note
that because of the scaling, two samples with a high cosine similarity can be far from each other. So, this reveals nothing
about their relative positions in the feature space.

5.4 Orthogonal and orthonormal bases

Through the lenses of similarity, orthogonality means that one vector does not contain “information” about the other. We
will make this notion more precise when learning about correlation, but there are clear implications regarding the structure
of inner product spaces. Recall that with our introduction of basis vectors, our motivation was to ﬁnd a minimal set of
vectors that can be used to express any other vector. With the introduction of orthogonality, we can go a step further.

Deﬁnition 4.4.1 (Orthogonal and orthonormal bases.)

Let 𝑉 a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } its basis. We say that 𝑆 is an orthogonal basis if ⟨𝑣𝑖 , 𝑣𝑗 ⟩ = 0 whenever 𝑖 ≠ 𝑗.

5.4. Orthogonal and orthonormal bases 63

Mathematics of Machine Learning

Fig. 5.4: The inner product of two unit vectors equals the cosine of their angle.

Moreover, 𝑆 is called orthonormal if

1 if 𝑖 = 𝑗,
⟨𝑣𝑖 , 𝑣𝑗 ⟩ = {
0 if 𝑖 ≠ 𝑗.
In other words, 𝑆 is orthonormal if in addition to being orthogonal, each vector has unit norm.

Orthogonal and orthonormal bases are extremely convenient to use. If a basis is orthogonal, we can easily obtain an
orthonormal basis by simply scaling its vectors to unit norm. Thus, we’ll use orthonormal basis vectors most of the time.
Why do we love orthonormal bases so much? To see this, let {𝑣1 , … , 𝑣𝑛 } be an arbitrary basis and 𝑥 be an arbitrary
𝑛
vector. We know that 𝑥 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 , but how do we ﬁnd the coeﬃcients 𝑥𝑖 ? There is a general method involving linear
equations that we’ll see later, but if {𝑣𝑖 }𝑛𝑖=1 is orthonormal, the situation is much simpler.
This is made more precise in the following theorem.

Theorem 4.4.1
Let 𝑉 be a vector space and 𝑆 = {𝑣1 , … , 𝑣𝑛 } be an orthonormal basis of 𝑉 . Then, for any 𝑥 ∈ 𝑉 ,
𝑛
𝑥 = ∑⟨𝑥, 𝑣𝑖 ⟩𝑣𝑖 (5.12)
𝑖=1

holds.

𝑛
Proof. Because 𝑣1 , … , 𝑣𝑛 is a basis, 𝑥 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 for some scalars 𝑥𝑖 . However, due to the linearity of the inner
product,
𝑛 𝑛
⟨𝑥, 𝑣𝑗 ⟩ = ⟨ ∑ 𝑥𝑖 𝑣𝑖 , 𝑣𝑗 ⟩ = ∑ 𝑥𝑖 ⟨𝑣𝑖 , 𝑣𝑗 ⟩ = 𝑥𝑗
𝑖=1 𝑖=1

64 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning

holds for all 𝑗-s. □

Thus, the coeﬃcients can be calculated by taking the inner product. In other words, for orthonormal bases, 𝑥𝑗 depends
only on the 𝑗-th basis vector.
As another consequence of the orthonormality, calculating the norm is also easier, as we can always express it in terms of
the coeﬃcients. To be more precise, we have
𝑛 𝑛
‖𝑥‖2 = ⟨𝑥, 𝑥⟩ = ⟨ ∑ 𝑥𝑖 𝑣𝑖 , ∑ 𝑥𝑗 𝑣𝑗 ⟩
𝑖=1 𝑗=1
𝑛 𝑛
= ∑ ∑ 𝑥𝑖 𝑥𝑗 ⟨𝑣𝑖 , 𝑣𝑗 ⟩ (5.13)
𝑖=1 𝑗=1
𝑛
= ∑ 𝑥2𝑖 .
𝑖=1

This is called Parsival’s identity. So, given 𝑥 in terms of an orthonormal basis, its norm is easy to ﬁnd. It is not a coincident
that this formula resembles the Euclidean norm so much! (Note that here, ‖ ⋅ ‖ is a general norm.) In fact, the squared
Euclidean norm
𝑛
‖𝑥‖22 = ∑ 𝑥2𝑖 , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛
𝑖=1

is just (5.13) using the standard basis.

5.4.1 The Gram-Schmidt orthogonalization process

Orthogonal bases are awesome and all, but how do we ﬁnd them?
There is a general method called the Gram-Schmidt orthogonalization process that solves this problem. The algorithm
takes any set of basis vectors {𝑣1 , … , 𝑣𝑛 } and outputs an orthonormal basis {𝑒1 , … , 𝑒𝑛 } such that

span(𝑣1 , … , 𝑣𝑘 ) = span(𝑒1 , … , 𝑒𝑘 ), 𝑘 = 1, … , 𝑛,

that is, the subspaces generated by the first 𝑘 vectors of both sets match.
How to do that? The process is straightforward. Let’s focus on finding an orthogonal system first, which we can normalize
later to achieve orthonormality. We are going to build our set {𝑒1 , … , 𝑒𝑛 } iteratively. It is clear that

𝑒1 ∶= 𝑣1

is a good choice. Now, our goal is to ﬁnd 𝑒2 such that 𝑒2 ⟂ 𝑒1 and together, they span the same subspace as {𝑣1 , 𝑣2 }.
Remember when we talked about the geometric interpretation of orthogonality? The orthogonal component of 𝑣2 with
respect to 𝑒1 will be a good choice for 𝑒2 . Thus, let
𝑒2 ∶ = 𝑣2 − proj𝑒 (𝑣2 )
1

⟨𝑣 , 𝑒 ⟩
= 𝑣2 − 2 1 𝑒1 .
⟨𝑒1 , 𝑒1 ⟩
From the definition, it is clear that 𝑒2 ⟂ 𝑒1 , and it is also clear that {𝑒1 , 𝑒2 } spans the same subspace as {𝑣1 , 𝑣2 }.
In the next step, we perform the same process. We project 𝑣3 onto the subspace generated by 𝑒1 and 𝑒2 , then define 𝑒3
as the difference of 𝑣3 and the projection . That is,
⟨𝑣3 , 𝑒1 ⟩ ⟨𝑣 , 𝑒 ⟩
𝑒3 ∶ = 𝑣3 − 𝑒1 − 3 2 𝑒2
⟨𝑒1 , 𝑒1 ⟩ ⟨𝑒2 , 𝑒2 ⟩
= 𝑣3 − proj𝑒 ,𝑒 (𝑣3 ).
1 2

5.4. Orthogonal and orthonormal bases 65

Mathematics of Machine Learning

With this, we essentially remove the “contributions” of 𝑒1 and 𝑒2 towards 𝑣3 , thus obtaining an 𝑒3 that is orthogonal to
the previous ones.
In general, if we have 𝑒1 , … , 𝑒𝑘 , the vector 𝑒𝑘+1 can be found by

𝑒𝑘+1 ∶= 𝑣𝑘+1 − proj𝑒 (𝑣𝑘+1 ),

1 ,…,𝑒𝑘

where
𝑘
⟨𝑥, 𝑒𝑖 ⟩
proj𝑒 (𝑥) = ∑ 𝑒 (5.14)
1 ,…,𝑒𝑘
𝑖=1
⟨𝑒𝑖 , 𝑒𝑖 ⟩ 𝑖

is the generalized orthogonal projection operator, projecting a vector to the subspace generated by {𝑒1 , … , 𝑒𝑘 }. To check
that 𝑒𝑘+1 ⟂ 𝑒1 , … , 𝑒𝑘 , we have

𝑘
⟨𝑣𝑘+1 , 𝑒𝑖 ⟩
⟨𝑒𝑘+1 , 𝑒𝑗 ⟩ = ⟨𝑣𝑘+1 − ∑ 𝑒 ,𝑒 ⟩
𝑖=1
⟨𝑒𝑖 , 𝑒𝑖 ⟩ 𝑖 𝑗
𝑘
⟨𝑣𝑘+1 , 𝑒𝑖 ⟩
= ⟨𝑣𝑘+1 , 𝑒𝑗 ⟩ − ∑ ⟨𝑒𝑖 , 𝑒𝑗 ⟩
𝑖=1
⟨𝑒𝑖 , 𝑒𝑖 ⟩
= ⟨𝑣𝑘+1 , 𝑒𝑗 ⟩ − ⟨𝑣𝑘+1 , 𝑒𝑗 ⟩
= 0,

due to the orthogonality of the 𝑒𝑖 -s and the linearity of the inner product. Since {𝑒1 , … , 𝑒𝑘 } spans the same subspace as
{𝑣1 , … , 𝑣𝑘 } and 𝑒𝑘+1 is a linear combination of 𝑣𝑘+1 and 𝑒1 , … , 𝑒𝑘 (where the coeﬃcient of 𝑣𝑘+1 is nonzero),

span(𝑣1 , … , 𝑣𝑘+1 ) = span(𝑒1 , … , 𝑒𝑘+1 )

also follows.
This can be repeated until we run out of vectors and ﬁnd {𝑒1 , … , 𝑒𝑛 }.
For the sake of further reference, mathematical correctness, and a tiny bit of OCD, let’s summarize the all the above in a
single theorem.

Theorem 4.4.2 (Gram-Schmidt orthogonalization process.)

Let 𝑉 be an inner product vector space and {𝑣1 , … , 𝑣𝑛 } ⊆ 𝑉 be a set of linearly independent vectors. Then there exists
an orthonormal set {𝑒1 , … , 𝑒𝑛 } ⊆ 𝑉 such that

span(𝑒1 , … , 𝑒𝑘 ) = span(𝑣1 , … , 𝑣𝑘 )

holds for any 𝑘 = 1, … , 𝑛.

As a consequence, we can state that each ﬁnite inner product space has an orthonormal basis. We can even construct it
explicitly via the Gram-Schmidt process.

Corollary 4.4.1 (Existence of orthonormal bases.)

Let 𝑉 be a ﬁnite-dimensional inner product space. Then there exists an orthonormal basis in 𝑉 .

Going one step further, we can view Theorem 4.4.2 and its proof as an algorithm.

Algorithm 4.4.1 (Gram-Schmidt orthogonalization process)

66 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning

Inputs: a set of linearly independent vectors {𝑣1 , … , 𝑣𝑛 } ⊆ 𝑉 .

Output: a set of orthonormal vectors {𝑒1 , … , 𝑒𝑛 }, such that

span(𝑒1 , … , 𝑒𝑘 ) = span(𝑣1 , … , 𝑣𝑘 )

holds for any 𝑘 = 1, … , 𝑛.

Remark 4.4.1 (Linearly dependent inputs of the Gram-Schmidt process)

What happens if we apply the Gram-Schmidt process to a set of linearly dependent vectors?
To get a grip on the problem, let’s consider a simple case of two vectors: {𝑣1 , 𝑣2 = 𝑐𝑣1 }, where 𝑐 is an arbitrary scalar.
𝑒1 is chosen to be 𝑣1 , and 𝑒2 is deﬁned by
𝑒2 = 𝑣2 − proj𝑒 (𝑣2 )
1

⟨𝑣 , 𝑒 ⟩
= 𝑣 2 − 2 1 𝑒1
⟨𝑒1 , 𝑒1 ⟩
𝑐⟨𝑣1 , 𝑣1 ⟩
= 𝑣2 − 𝑣
⟨𝑣1 , 𝑣1 ⟩ 1
= 𝑣2 − 𝑐𝑣1
= 0.
This is true in the general case as well: when the process encounters an input vector that is linearly dependent from the
previous ones, a zero vector is added to the output.

5.5 The orthogonal complement

Earlier, we have seen that given a ﬁxed vector 𝑦 ∈ 𝑉 , we can decompose any 𝑥 ∈ 𝑉 as 𝑥 = 𝑥𝑜 + 𝑥𝑝 , where 𝑥𝑜 is
orthogonal to 𝑦, while 𝑥𝑝 is parallel to it. (We used this to provide a geometric motivation for inner products.)
This is an essential tool, and in this section, we will see that an analogue of this decomposition still holds true when 𝑦 is
replaced with an arbitrary subspace 𝑆 ⊂ 𝑉 . To see this, let’s talk about the orthogonality of subspaces.

Deﬁnition 4.5.1 (Orthogonal subspaces.)

Let 𝑉 be an arbitrary inner product space. We say that the subspaces 𝑆1 , 𝑆2 ⊆ 𝑉 are orthogonal if for every pair of
𝑥 ∈ 𝑆1 and 𝑦 ∈ 𝑆2 , we have ⟨𝑥, 𝑦⟩ = 0. This is denoted by 𝑆1 ⟂ 𝑆2 .

For example, the 𝑥-axis and the 𝑦-axis are orthogonal subspaces in ℝ2 . (Just as the 𝑥-𝑦 plane and the 𝑧-axis in ℝ3 .)
Similarly, we can talk about the orthogonality of a vector and a subspace: 𝑥 is orthogonal to the subspace 𝑆, 𝑥 ⟂ 𝑆 in
symbols, if 𝑥 is orthogonal to all vectors of 𝑆.
One of the most straightforward and essential ways to construct orthogonal subspaces is to take the orthogonal complement.

Deﬁnition 4.5.2 (Orthogonal complement.)

Let 𝑉 be an arbitrary inner product space and 𝑆 ⊆ 𝑉 its subspace. The set deﬁned by

𝑆 ⟂ ∶= {𝑥 ∈ 𝑉 ∶ 𝑥 ⟂ 𝑆} (5.15)

5.5. The orthogonal complement 67

Mathematics of Machine Learning

is called the orthogonal complement of 𝑆.

𝑆 ⟂ is not just any set, it is a subspace, as we are about to see.

Theorem 4.5.1
Let 𝑉 be an arbitrary inner product space and 𝑆 ⊆ 𝑉 one of its subspaces. 𝑆 ⟂ is orthogonal to 𝑆, and a subspace.
Moreover, 𝑆 ∩ 𝑆 ⟂ = {0}.

Proof. According to the deﬁnition of subspaces, we only have to show that 𝑆 ⟂ is closed with respect to addition and
scalar multiplication. As the inner product is bilinear, this is straightforward:

⟨𝑎𝑥 + 𝑏𝑦, 𝑠⟩ = 𝑎⟨𝑥, 𝑠⟩ + 𝑏⟨𝑦, 𝑠⟩ = 0

holds for any vectors 𝑥, 𝑦 ∈ 𝑆 ⟂ , 𝑠 ∈ 𝑆, and scalars 𝑎, 𝑏.

To see that 𝑆 ∩ 𝑆 ⟂ = {0}, let’s take an arbitrary 𝑥 ∈ 𝑆 ∩ 𝑆 ⟂ By the definition of 𝑆 ⟂ , we have ⟨𝑥, 𝑥⟩ = 0. As the inner
product is positive definite per definition, 𝑥 must be the zero vector 0. □

Recall the decomposition of any 𝑥 ∈ 𝑉 into a parallel and an orthogonal component with respect to a ﬁxed vector 𝑦? In
terms of subspaces, we can restate this as

𝑉 = span(𝑦) + span(𝑦)⟂ ,

that is, 𝑉 can be written as the direct sum of the vector space spanned by 𝑦, and its orthogonal complement. This is an
extremely powerful result, as this allows us to decouple 𝑥 from 𝑦. For instance, if we think about vectors as a collection
of features (just like the sepal and petal width and length measurements in our favourite Iris dataset), 𝑦 can represent a
certain trait that we want to exclude from our analysis.
With the notion of orthogonal complements, we can make this mathematically precise. We can also be more general. In
fact, the decomposition

𝑉 = 𝑆 + 𝑆⟂

holds for any subspace 𝑆! We are going to see at least two proofs for this. One right now, another a bit later when talking
about orthogonal projections.

Theorem 4.5.2
Let 𝑉 be an arbitrary ﬁnite dimensional inner product space and 𝑆 ⊂ 𝑉 its subspace. Then

𝑉 = 𝑆 + 𝑆⟂

holds.

Proof. Let 𝑒1 , … , 𝑒𝑘 ∈ 𝑆 be an orthonormal basis of 𝑆. This is guaranteed to exist, and we can even construct it from
an arbitrary basis using the Gram-Schmidt process.
Like during its proof, we can deﬁne the generalized orthogonal projection (5.14), given by
𝑘
proj𝑒 (𝑥) = ∑⟨𝑥, 𝑒𝑖 ⟩𝑒𝑖 .
1 ,…,𝑒𝑘
𝑖=1

68 Chapter 5. Inner products, angles, and lots of reasons to care about them
Mathematics of Machine Learning

Using this, we can decompose any 𝑥 ∈ 𝑉 as

𝑥 = (𝑥 − proj𝑒 (𝑥)) + proj𝑒 (𝑥) (5.16)

1 ,…,𝑒𝑘 1 ,…,𝑒𝑘

Since proj𝑒 ,…,𝑒 (𝑥) is the linear combination of 𝑒𝑖 ∈ 𝑆-s, it belongs to 𝑆. On the other hand, the bilinearity of the inner
1 𝑘
product gives that 𝑥 − proj𝑒 ,…,𝑒 (𝑥) ∈ 𝑆 ⟂ . Indeed, as we have
1 𝑘

𝑘
⟨𝑥 − proj𝑒 (𝑥), 𝑒𝑗 ⟩ = ⟨𝑥 − ∑⟨𝑥, 𝑒𝑖 ⟩𝑒𝑖 , 𝑒𝑗 ⟩
1 ,…,𝑒𝑘
𝑖=1
𝑘
= ⟨𝑥, 𝑒𝑗 ⟩ − ∑⟨𝑥, 𝑒𝑖 ⟩⟨𝑒𝑖 , 𝑒𝑗 ⟩
𝑖=1
= ⟨𝑥, 𝑒𝑗 ⟩ − ⟨𝑥, 𝑒𝑗 ⟩
= 0,

the vector 𝑥 − proj𝑒 ,…,𝑒 (𝑥) is orthogonal to each 𝑒𝑗 . Thus, since 𝑒1 , … , 𝑒𝑘 is an orthonormal basis of 𝑆, it is also
1 𝑘
orthogonal to 𝑆, hence 𝑥 − proj𝑒 ,…,𝑒 (𝑥) ∈ 𝑆 ⟂ .
1 𝑘

The fact that every 𝑥 ∈ 𝑉 can be decomposed as the sum of a vector from 𝑆 and a vector from 𝑆 ⟂ , as given by (5.16),
means that 𝑉 = 𝑆 + 𝑆 ⟂ , which is what we had to prove. □

5.6 Problems

Problem 1. Let 𝑎1 , … , 𝑎𝑛 > 0 be arbitrary positive numbers. Show that

𝑛
⟨𝑥, 𝑦⟩ ∶= ∑ 𝑎𝑖 𝑥𝑖 𝑦𝑖 , 𝑥, 𝑦 ∈ ℝ𝑛 .
𝑖=1

is an inner product, where 𝑥 = (𝑥1 , … , 𝑥𝑛 ) and 𝑦 = (𝑦1 , … , 𝑦𝑛 ).

Problem 2. Let 𝑉 be a ﬁnite-dimensional inner product space, let 𝑣1 , … , 𝑣𝑛 ∈ 𝑉 be a basis in 𝑉 , and deﬁne

𝑎𝑖,𝑗 ∶= ⟨𝑣𝑖 , 𝑣𝑗 ⟩.

Show that for any 𝑥, 𝑦 ∈ 𝑉 ,

𝑛
⟨𝑥, 𝑦⟩ = ∑ 𝑎𝑖,𝑗 𝑥𝑖 𝑥𝑗 ,
𝑖,𝑗=1

𝑛 𝑛
where 𝑥 = ∑𝑖=1 𝑥𝑖 𝑣𝑖 and 𝑦 = ∑𝑖=1 𝑦𝑖 𝑣𝑖 .
Problem 3. Let 𝑉 be a ﬁnite-dimensional real inner product space.
(a) Let 𝑦 ∈ 𝑉 be an arbitrary vector. Show that

𝑓 ∶ 𝑉 → ℝ, 𝑥 ↦ ⟨𝑥, 𝑦⟩

is a linear function. (That is, 𝑓(𝛼𝑢 + 𝛽𝑣) = 𝛼𝑓(𝑥) + 𝛽𝑓(𝑦) holds for all 𝑢, 𝑣 ∈ 𝑉 and 𝛼, 𝛽 ∈ ℝ).
(b) Let 𝑓 ∶ 𝑉 → ℝ an arbitrary linear function. Show that there exists an 𝑦 ∈ 𝑉 such that

𝑓(𝑥) = ⟨𝑥, 𝑦⟩.

5.6. Problems 69
Mathematics of Machine Learning

(Note that (b) is the reverse of (a), and a much more interesting result.)
Problem 4. Let 𝑉 be a real inner product space and let ‖𝑥‖ = √⟨𝑥, 𝑥⟩ be the generated norm. Show that

2‖𝑥‖2 + 2‖𝑦‖2 = ‖𝑥 + 𝑦‖2 + ‖𝑥 − 𝑦‖2 . (5.17)

This is called the parallelogram law, because if we think of 𝑥 and 𝑦 as two the sides determining a parallelogram, (5.17)
relates the length of its sides to the length of its diagonals.
Problem 5. Let 𝑉 be a real inner product space and let 𝑥1 , 𝑥2 ∈ 𝑉 . Show that if

⟨𝑥1 , 𝑦⟩ = ⟨𝑥2 , 𝑦⟩

holds for all 𝑦 ∈ 𝑉 , then 𝑥1 = 𝑥2 .

70 Chapter 5. Inner products, angles, and lots of reasons to care about them
CHAPTER

SIX

THE FIRST STEPS IN COMPUTATIONAL LINEAR ALGEBRA

Now that we started to understand the geometric structure of vector spaces, it’s time to put the theory into practice once
again. In this chapter, we’ll take a hands-down look at norms, inner products, and NumPy array operations in general.
The last time we translated theory to code, we left oﬀ at ﬁnding an ideal representation for vectors: NumPy arrays. Let’s
initialize two instances to play around with.

import numpy as np

x = np.array([1.8, -4.5, 9.2, 7.3])

y = np.array([-5.2, -1.1, 0.7, 5.1])

In linear algebra, and in most of machine learning, almost all operations involve looping through the vector components
one by one. For instance, adding together two vectors can be implemented like this.

def add(x: np.ndarray, y: np.ndarray):

x_plus_y = np.zeros(shape=len(x))

for i in range(len(x_plus_y)):
x_plus_y[i] = x[i] + y[i]

return x_plus_y

add(x, y)

array([-3.4, -5.6, 9.9, 12.4])

Of course, this is far from optimal. (It may not even work if the vectors have diﬀerent dimensions.)
For example, addition is massively parallelizable, and our implementation does not take advantage of that. With two
threads, we can do two additions simultaneously. So, adding together two-dimensional vectors would require just one
step, as one would compute x[0] + y[0], while the other x[1] + y[1]. Raw Python does not have access to such
high-performance computing tools, but NumPy does, through its functions that are implemented in C. In turn, C uses the
LAPACK (Linear Algebra PACKage) library, which makes calls to BLAS (Basic Linear Algebra Subprograms). BLAS
is optimized at the assembly level.
So, whenever it is possible, we should strive to work with vectors in a NumPythonic way. (Yes, I just made up that term.)
For vector addition, this is simply the + operator, as we have seen earlier.

np.equal(x + y, add(x, y))

array([ True, True, True, True])

71
Mathematics of Machine Learning

By the way, you shouldn’t ever compare ﬂoats with the == operator, as internal rounding errors can occur due to the ﬂoat
representation. The example below illustrates this.

1.0 == 0.3*3 + 0.1

False

To compare arrays, NumPy provides the functions np.allclose and np.equal. These compare arrays elementwise,
returning a boolean array. From this, the built-in all function can be used to see if all the elements match.

all(np.equal(x + y, add(x, y)))

True

In the following, we’ll brieﬂy review how to work with NumPy arrays in practice.

6.1 Basic NumPy functions and array methods

At this point, there are two operations that we want to do with our vectors: apply a function elementwise, or take the
sum/product of the elements. Since the +, *, and ** operators are implemented for our arrays, certain functions carry
over from scalars, as the example shows below.

def just_a_quadratic_polynomial(x):
return 3*x**2 + 1

x = np.array([1.8, -4.5, 9.2, 7.3])

just_a_quadratic_polynomial(x)

array([ 10.72, 61.75, 254.92, 160.87])

However, we can’t just plug in ndarray-s to every function. For instance, let’s take a look at Python’s built-in exp from
its math module.

from math import exp

exp(x)

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-a34f98e51bbc> in <module>
1 from math import exp
2
----> 3 exp(x)

TypeError: only size-1 arrays can be converted to Python scalars

To overcome this problem, we could manually apply the function elementwise.

72 Chapter 6. The ﬁrst steps in computational linear algebra

Mathematics of Machine Learning

def naive_exp(x: np.ndarray):

x_exp = np.empty_like(x)

for i in range(len(x)):
x_exp[i] = exp(x[i])

return x_exp

(Recall that np.empty_like(x) creates an uninitialized array that matches the dimensions of x.)

naive_exp(x)

array([6.04964746e+00, 1.11089965e-02, 9.89712906e+03, 1.48029993e+03])

A bit less naive implementation would use comprehensions to achieve the same eﬀect.

def bit_less_naive_exp(x: np.ndarray):

return np.array([exp(x_i) for x_i in x])

bit_less_naive_exp(x)

array([6.04964746e+00, 1.11089965e-02, 9.89712906e+03, 1.48029993e+03])

Even though comprehensions are more concise and readable, they still don’t avoid the core issue: for loops in Python.
This problem is solved by NumPy’s famous ufuncs, that is, “functions that operate element by element on whole arrays”.
Since they are implemented in C, they are blazing fast. For instance, the exponential function 𝑓(𝑥) = 𝑒𝑥 is given by
np.exp.

np.exp(x)

array([6.04964746e+00, 1.11089965e-02, 9.89712906e+03, 1.48029993e+03])

Not surprisingly, the results of our implementations match.

all(np.equal(naive_exp(x), np.exp(x)))

True

all(np.equal(bit_less_naive_exp(x), np.exp(x)))

True

Again, there are more advantages of using NumPy functions and operations than simplicity. In machine learning, we care
a lot about speed, and as we are about to see, NumPy delivers once more.

from timeit import timeit

n_runs = 10000
(continues on next page)

6.1. Basic NumPy functions and array methods 73

Mathematics of Machine Learning

(continued from previous page)

size = 1000

t_naive_exp = timeit(
"np.array([exp(x_i) for x_i in x])",
setup=f"import numpy as np; from math import exp; x = np.ones({size})",
number=n_runs
)

t_numpy_exp = timeit(
"np.exp(x)",
setup=f"import numpy as np; from math import exp; x = np.ones({size})",
number=n_runs
)

print(f"Built-in exponential: \t{t_naive_exp} s")

print(f"NumPy exponential: \t{t_numpy_exp} s")
print(f"Performance improvement: \t{100*t_naive_exp/t_numpy_exp}% speedup")

Built-in exponential: 1.1058482119988184 s

NumPy exponential: 0.05404456199903507 s
Performance improvement: 2046.1785073187616% speedup

For further reference, you can ﬁnd the list of available ufuncs here.

6.1.1 Sum and product, minimum and maximum

What about operations that aggregate the elements and return a single value? Not surprisingly, these can be found within
NumPy as well. For instance, let’s take a look at the sum. In terms of mathematical formulas, we are looking to implement
the function
𝑛
sum(𝑥) = ∑ 𝑥𝑖 , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛 .
𝑖=1

A basic approach would be something like this.

def naive_sum(x: np.ndarray):

val = 0

for x_i in x:
val += x_i

return val

naive_sum(x)

13.799999999999999

Alternatively, we can use Python’s built-in summing function.

74 Chapter 6. The ﬁrst steps in computational linear algebra

Mathematics of Machine Learning

sum(x)

13.799999999999999

The story is the same: NumPy can do this better using its own data structures. We can either call the function np.sum,
or use the array method np.ndarray.sum.

np.sum(x)

13.799999999999999

x.sum()

13.799999999999999

Y’know by now that I love timing functions, so let’s compare the performances once more.

t_naive_sum = timeit(
"sum(x)",
setup=f"import numpy as np; x = np.ones({size})",
number=n_runs
)

t_numpy_sum = timeit(
"np.sum(x)",
setup=f"import numpy as np; x = np.ones({size})",
number=n_runs
)

print(f"Built-in sum: \t{t_naive_sum} s")

print(f"NumPy sum: \t{t_numpy_sum} s")
print(f"Performance improvement: \t{100*t_naive_sum/t_numpy_sum}% speedup")

Built-in sum: 0.5536425890004466 s

NumPy sum: 0.029986262999955215 s
Performance improvement: 1846.3207269317736% speedup

Similarly, the product

𝑛
prod(𝑥) = ∏ 𝑥𝑖 , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛
𝑖=1

is implemented by the np.prod function and np.ndarray.prod method.

np.prod(x)

-543.996

On quite a few occasions, we need to ﬁnd the maximum or minimum of an array. We can do this using the np.max and
np.min functions. (Similarly to the others, these are also available as array methods.) The rule of thumb is if you want
to perform any array operation, use NumPy functions.

6.1. Basic NumPy functions and array methods 75

Mathematics of Machine Learning

6.2 Norms and distances

Now that we have reviewed how to perform operations on our vectors eﬃciently, it’s time to dive deep into the really
interesting part: norms and distances.
Let’s start with the most important one: the Euclidean norm, also known as the 2-norm, deﬁned by
𝑛 1/2
‖𝑥‖2 = ( ∑ 𝑥2𝑖 ) , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛 .
𝑖=1

A straightforward implementation would be the following.

def euclidean_norm(x: np.ndarray):

return np.sqrt(np.sum(x**2))

Note that our euclidean_norm function is dimension-agnostic, that is, it works for arrays of every dimension.

x = np.array([-3.0, 1.2, 1.2, 2.1]) # a 4-dimensional vector

y = np.array([8.1, 6.3]) # a 2-dimensional vector

euclidean_norm(x)

4.036087214122113

euclidean_norm(y)

10.261578825892242

But wait, didn’t I just mention that we should use NumPy functions whenever possible? Norms are important enough to
have their own functions: np.linalg.norm.

np.linalg.norm(x)

4.036087214122113

With a quick inspection, we can check that these match for our vector x.

np.equal(euclidean_norm(x), np.linalg.norm(x))

True

However, the Euclidean norm is just a special case of 𝑝-norms. Recall that for any 𝑝 ∈ [0, ∞), we deﬁned the 𝑝-norm
by the formula
𝑛 1/𝑝
𝑝
‖𝑥‖𝑝 = ( ∑ |𝑥𝑖 | ) , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛 ,
𝑖=1

and

‖𝑥‖∞ = max{|𝑥1 |, … , |𝑥𝑛 |}, 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛

76 Chapter 6. The ﬁrst steps in computational linear algebra

Mathematics of Machine Learning

for 𝑝 = ∞. It is a good practice to keep the number of functions in a codebase minimal to reduce maintenance costs.
Can we compact all 𝑝-norms into a single Python function that takes the value of 𝑝 as an argument? Sure. We only have
a small issue: representing ∞. Python and NumPy both provide their own representations, but we will go with NumPy’s
np.inf. Surprisingly, this is a float type.

type(np.inf)

float

With this, we can do something like the following.

def p_norm(x: np.ndarray, p: float):

if np.isinf(p):
return np.max(np.abs(x))
elif p >= 1:
return (np.sum(np.abs(x)**p))**(1/p)
else:
return ValueError("p must be a float larger or equal than 1.0 or inf.")

Since ∞ can have multiple other representations, such as Python’s built-in math.inf, we can make our function more
robust by using the np.isinf function to check if an object represents ∞ or not.
A quick check shows that p_norm works as intended.

x = np.array([-3.0, 1.2, 1.2, 2.1])

for p in [1, 2, 42, np.inf]:

print(f"p-norm for p = {p}: \t {p_norm(x, p=p)}")

p-norm for p = 1: 7.5

p-norm for p = 2: 4.036087214122113
p-norm for p = 42: 3.0000000222838166
p-norm for p = inf: 3.0

However, once again, NumPy is one step ahead of us. In fact, the familiar np.linalg.norm already does this out of
the box. We can achieve the same with less code by passing the value of 𝑝 as the argument ord, short for order. For
ord = 2, we obtain the good old 2-norm.

for p in [1, 2, 42, np.inf]:

print(f"p-norm for p = {p}: \t {np.linalg.norm(x, ord=p)}")

p-norm for p = 1: 7.5

p-norm for p = 2: 4.036087214122113
p-norm for p = 42: 3.0000000222838166
p-norm for p = inf: 3.0

Somewhat surprisingly, distances don’t have their own NumPy functions. However, as the most common distance metrics
are generated from norms, we can often write our own. For instance, here is the Euclidean distance.

def euclidean_distance(x: np.ndarray, y: np.ndarray):

return np.linalg.norm(x - y, ord=2)

6.2. Norms and distances 77

Mathematics of Machine Learning

6.3 The dot product

Besides norms and distances, the third component that deﬁnes the geometry of our vector spaces is the inner product.
During our journey, we’ll almost exclusively use the dot product, deﬁned in the vector space ℝ𝑛 by
𝑛
⟨𝑥, 𝑦⟩ = ∑ 𝑥𝑖 𝑦𝑖 , 𝑥, 𝑦 ∈ ℝ𝑛 .
𝑖=1

By now, you can easily smash out a Python function that calculates this. In principle, the one-liner below should work.

def dot_product(x: np.ndarray, y: np.ndarray):

return np.sum(x*y)

Let’s test this out!

x = np.array([-3.0, 1.2, 1.2, 2.1])

y = np.array([1.9, 2.5, 3.9, 1.2])

dot_product(x, y)

4.5

When the dimension of the vectors doesn’t match, the function throws an exception as we expect.

x = np.array([-3.0, 1.2, 1.2, 2.1])

y = np.array([1.9, 2.5])

dot_product(x, y)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-086e67f8bc3e> in <module>
2 y = np.array([1.9, 2.5])
3
----> 4 dot_product(x, y)

<ipython-input-39-bb4e36bf420e> in dot_product(x, y)
1 def dot_product(x: np.ndarray, y: np.ndarray):
----> 2 return np.sum(x*y)

ValueError: operands could not be broadcast together with shapes (4,) (2,)

However, upon further attempts to break the code, a strange thing occurs. Our function dot_product should fail when
called with an 𝑛-dimensional and a one-dimensional vector, and this is not what happens.

x = np.array([-3.0, 1.2, 1.2, 2.1])

y = np.array([2.0])

dot_product(x, y)

3.0

I always advocate breaking solutions in advance to avoid later surprises, and the above example excellently illustrates the
usefulness of this principle.

78 Chapter 6. The ﬁrst steps in computational linear algebra

Mathematics of Machine Learning

Beyond the scenes, NumPy is doing something called broadcasting. When performing an operation on two arrays with
mismatching shapes, it tries to guess the correct sizes and reshape them so the operation can go through. Check out what
takes place when calculating x*y.

x*y

array([-6. , 2.4, 2.4, 4.2])

NumPy guessed that we want to multiply all elements of x by the scalar y[0], so it transforms y = np.array([2.
0]) into the four-dimensional vector np.array([2.0, 2.0, 2.0, 2.0]), then calculates the elementwise
product.
Broadcasting is extremely useful because it allows us to write much simpler code by automagically performing transfor-
mations. Still, if you are unaware of how and when broadcasting is done, it can seriously bite you in the back. Just like
in our case, as the inner product of a four-dimensional and a one-dimensional vector is not deﬁned.
To avoid writing excessive checks for edge cases (or missing them altogether), we calculate the inner product in practice
using the np.dot function.

x = np.array([-3.0, 1.2, 1.2, 2.1])

y = np.array([1.9, 2.5, 3.9, 1.2])

np.dot(x, y)

4.5

When attempting to call np.dot with misaligned arrays, it fails as supposed to. Even in cases when broadcasting bails
out our custom implementation.

x = np.array([-3.0, 1.2, 1.2, 2.1])

y = np.array([2.0])

np.dot(x, y)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-45-513b3ef24556> in <module>
2 y = np.array([2.0])
3
----> 4 np.dot(x, y)

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: shapes (4,) and (1,) not aligned: 4 (dim 0) != 1 (dim 0)

Now that we have a basic arsenal of array operations and functions, it is time to do something with them!

6.3. The dot product 79

Mathematics of Machine Learning

6.4 The Gram-Schmidt orthogonalization process

One of the most fundamental algorithms in linear algebra is the Gram-Schmidt orthogonalization process, used to turn a
set of linearly independent vectors into an orthonormal set.
To be more precise, for our input of a set of linearly independent vectors 𝑣1 , … , 𝑣𝑛 ∈ ℝ𝑛 , the Gram-Schmidt process
ﬁnds the output set of vectors 𝑒1 , … , 𝑒𝑛 ∈ ℝ𝑛 such that
• ‖𝑒𝑖 ‖ = 1 and ⟨𝑒𝑖 , 𝑒𝑗 ⟩ = 0 for all 𝑖 ≠ 𝑗 (that is, the vectors are orthonormal).
• and span(𝑒1 , … , 𝑒𝑘 ) = span(𝑣1 , … , 𝑣𝑘 ) for all 𝑘 = 1, … , 𝑛.
If you are having trouble recalling how this is done, feel free to revisit the section where we ﬁrst described the algorithm.
The learning process is a spiral, where we keep revisiting old concepts from new perspectives. For the Gram-Schmidt
process, this is our second iteration, where we put the mathematical formulation into code.
Since we are talking about a sequence of vectors, we need a suitable data structure for this purpose vectors. There are
several possibilities for this in Python. For now, we are going with the conceptually simplest, albeit computationally rather
suboptimal one: lists. (Later, we’ll revisit this algorithm using multidimensional arrays, enabling us to write super concise
code, but let’s not get ahead of ourselves.)

vectors = [np.random.rand(5) for _ in range(5)] # randomly generated vectors in a␣

↪list

vectors

[array([0.65023241, 0.67087522, 0.68986822, 0.03021897, 0.72688872]),

array([0.85941475, 0.38198361, 0.89036711, 0.15662603, 0.60220998]),
array([0.8648462 , 0.36266194, 0.31473038, 0.58518851, 0.71176751]),
array([0.98413649, 0.87825261, 0.78210159, 0.22655408, 0.66533214]),
array([0.87312727, 0.38620065, 0.50844992, 0.53770434, 0.60529503])]

The ﬁrst component of the algorithm is the orthogonal projection operator, deﬁned by
𝑘
⟨𝑥, 𝑒𝑖 ⟩
proj𝑒 (𝑥) = ∑ 𝑒.
1 ,…,𝑒𝑘
𝑖=1
⟨𝑒𝑖 , 𝑒𝑖 ⟩ 𝑖

With our NumPy tools, the implementation is straightforward by now.

from typing import List

def projection(x: np.ndarray, to: List[np.ndarray]):

"""
Computes the orthogonal projection of the vector `x`
onto the subspace spanned by the set of vectors `to`.
"""
p_x = np.zeros_like(x)

for e in to:
e_norm_square = np.dot(e, e)
p_x += np.dot(x, e)*e/e_norm_square

return p_x

80 Chapter 6. The ﬁrst steps in computational linear algebra

Mathematics of Machine Learning

To check if it works, let’s look at a simple example and visualize the results. If you are reading the Jupyter Notebook
version of the book, feel free to change the inputs and experiment with this. (Don’t worry if you don’t understand the
visualization code, it is not essential for now.)

x = np.array([1.0, 2.0])
e = np.array([2.0, 1.0])

x_to_e = projection(x, to=[e])

import matplotlib.pyplot as plt

with plt.style.context("seaborn-white"):
plt.figure(figsize=(7, 7))
plt.xlim([-0, 3])
plt.ylim([-0, 3])
plt.arrow(0, 0, x[0], x[1], head_width=0.1, color="r", label="x")
plt.arrow(0, 0, e[0], e[1], head_width=0.1, color="g", label="e")
plt.arrow(x_to_e[0], x_to_e[1], x[0] - x_to_e[0], x[1] - x_to_e[1], linestyle="--
↪")

plt.arrow(0, 0, x_to_e[0], x_to_e[1], head_width=0.1, color="b", label=

↪"projection of x onto e")

plt.legend()

6.4. The Gram-Schmidt orthogonalization process 81

Mathematics of Machine Learning

Checking the orthogonality of e and x - x_to_e provides another means of veriﬁcation.

np.allclose(np.dot(e, x - x_to_e), 0.0)

True

When writing code for production, a couple of visualizations and ad-hoc checks are not enough. An extensive set of unit
tests is customarily written to ensure that a function works as intended. We are skipping this to keep our discussion on
track, but feel free to add some of your tests. After all, mathematics and programming are not a spectator’s sport.
With the projection(x: np.ndarray, to: List[np.ndarray]) function available to us, we are ready
to knock the Gram-Schmidt algorithm out of the park.

def gram_schmidt(vectors: List[np.ndarray]):

"""
Creates an orthonormal set of vectors from the input
that spans the same subspaces.
"""
(continues on next page)

82 Chapter 6. The ﬁrst steps in computational linear algebra

Mathematics of Machine Learning

(continued from previous page)

output = []

# 1st step: finding an orthogonal set of vectors

output.append(vectors[0])
for v in vectors[1:]:
v_proj = projection(v, to=output)
output.append(v - v_proj)

# 2nd step: normalizing the result

output = [v/np.linalg.norm(v, ord=2) for v in output]

return output

Let’s quickly test out this implementation with a simple example.

test_vectors = [np.array([1.0, 0.0, 0.0]),

np.array([1.0, 1.0, 0.0]),
np.array([1.0, 1.0, 1.0])]

gram_schmidt(test_vectors)

[array([1., 0., 0.]), array([0., 1., 0.]), array([0., 0., 1.])]

So, we have just created our ﬁrst algorithm from scratch. This is like the base camp for Mount Everest. We have gone a
long way, but there is much more to go until we create a neural network from scratch. Until then, the journey is packed
with beautiful sections, and this is one of them. Take a while to appreciate this, then move on when you are ready.

Remark 5.4.1 (Linearly dependent inputs of the Gram-Schmidt process)

Recall that if the input vectors of the Gram-Schmidt are linearly dependent, some vectors of the output are zero. In practice,
this causes a lot of problems.
For instance, we normalize the vectors in the end, using list comprehension

output = [v/np.linalg.norm(v, ord=2) for v in output]

which causes numerical issues. If any v is approximately zero, its norm np.linalg.norm(v, ord=2) is going to
be really small, and division with such small numbers can lead to issues.
This issue also aﬀects the projection function. Take a look at the deﬁnition below:

def projection(x: np.ndarray, to: List[np.ndarray]):

"""
Computes the orthogonal projection of the vector `x`
onto the subspace spanned by the set of vectors `to`.
"""
p_x = np.zeros_like(x)

for e in to:
e_norm_square = np.dot(e, e)
p_x += np.dot(x, e)*e/e_norm_square

return p_x

6.4. The Gram-Schmidt orthogonalization process 83

Mathematics of Machine Learning

If e is (close to) zero, which can happen if the input vectors are linearly dependent, then e_norm_square is small.

In the following chapter, we will be meeting with the single most important objects in machine learning: matrices.

6.5 Problems

Problem 1. Implement the mean squared error

1 𝑛
MSE(𝑥, 𝑦) = ∑(𝑥 − 𝑦𝑖 )2 , 𝑥, 𝑦 ∈ ℝ𝑛
𝑛 𝑖=1 𝑖

both with and without using NumPy functions and methods. (The vectors 𝑥 and 𝑦 should be represented by NumPy arrays
in both cases.)
Problem 2. Compare the performances of the built-in maximum function max and NumPy’s np.max using timeit.
timeit, like we did above. Try running a diﬀerent number of experiments and changing the array sizes to ﬁgure out
the breakeven point between the two performances.
Problem 3. Instead of implementing the general 𝑝-norm as we earlier in this chapter, we can change things around to
obtain the version below.

def p_norm(x: np.ndarray, p: float):

if p >= 1:
return (np.sum(np.abs(x)**p))**(1/p)
elif np.isinf(p):
return np.max(np.abs(x))
else:
return ValueError("p must be a float larger or equal than 1.0 or inf.")

However, this doesn’t work for 𝑝 = ∞. What is the problem with it?
Problem 4. Let 𝑤 ∈ ℝ𝑛 be a vector with nonnegative elements. Use NumPy to implement the weighted 𝑝-norm by
𝑛 1/𝑝
‖𝑥‖𝑤 𝑝
𝑝 = ( ∑ 𝑤𝑖 |𝑥𝑖 | ) , 𝑥 = (𝑥1 , … , 𝑥𝑛 ) ∈ ℝ𝑛 .
𝑖=1

Can you come up with a scenario where this can be useful in machine learning?
Problem 5. Implement the cosine similarity function, deﬁned by the formula

𝑥 𝑦
cos(𝑥, 𝑦) = ⟨ , ⟩, 𝑥, 𝑦 ∈ ℝ𝑛 .
‖𝑥‖ ‖𝑦‖

(Whenever possible, use built-in NumPy functions.)

84 Chapter 6. The ﬁrst steps in computational linear algebra

CHAPTER

SEVEN

MATRICES, THE WORKHORSES OF MACHINE LEARNING

I am quite sure that you were already familiar with the notion of matrices before reading this book. Matrices are one
of the most important data structures that are able to represent systems of equations, graphs, mappings between vector
spaces, and many more. Matrices are the fundamental building blocks of neural networks.
At ﬁrst look, we deﬁne a matrix as a table of numbers. If the matrix 𝐴 has, for instance, 𝑛 rows and 𝑚 columns of real
numbers, we write

𝑎1,1 𝑎1,2 … 𝑎1,𝑚

⎡𝑎 𝑎2,2 … 𝑎2,𝑚 ⎤
𝐴 = ⎢ 2,1 ⎥. (7.1)
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑚 ⎦

When we don’t want to write out the entire matrix as (7.1), we use the abbreviation 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑚
𝑖,𝑗=1 .

The set of all 𝑛 × 𝑚 real matrices is denoted by ℝ𝑛×𝑚 . We will exclusively talk about real matrices, but when it is not
the case, this notation is modiﬁed accordingly. For instance, ℤ𝑛×𝑚 denotes the set of integer matrices.
Matrices can be added and multiplied together, or multiplied by a scalar.

Deﬁnition 6.1 (Matrix operations.)

(a) Let 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑚
𝑖,𝑗=1 ∈ ℝ
𝑛×𝑚
be a matrix and 𝑐 ∈ ℝ a real number. The multiple of 𝐴 by the scalar 𝑐 is deﬁned by

𝑐𝐴 ∶= (𝑐𝑎𝑖,𝑗 )𝑛,𝑚
𝑖,𝑗=1 ∈ ℝ
𝑛×𝑚
.

(b) Let 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑚 𝑛,𝑚

𝑖,𝑗=1 , 𝐵 = (𝑏𝑖,𝑗 )𝑖,𝑗=1 ∈ ℝ
𝑛×𝑚
be two matrices. Their sum 𝐴 + 𝐵 is deﬁned by

𝐴 + 𝐵 ∶= (𝑎𝑖,𝑗 + 𝑏𝑖,𝑗 )𝑛,𝑚

𝑖,𝑗=1 ∈ ℝ
𝑛×𝑚
.

(c) Let 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑙

𝑖,𝑗=1 ∈ ℝ
𝑛×𝑙
and 𝐵 = (𝑏𝑖,𝑗 )𝑙,𝑚
𝑖,𝑗=1 ∈ ℝ
𝑙×𝑚
be two matrices. Their product 𝐴𝐵 is deﬁned by

𝑙
𝐴𝐵 ∶= (∑ 𝑎𝑖,𝑘 𝑏𝑘,𝑗 )𝑛,𝑚
𝑖,𝑗=1 ∈ ℝ
𝑛×𝑚
.
𝑘=1

Scalar multiplication and addition is clear, but matrix multiplication is not the simplest-to-understand operation ever.
Fortunately, visualization can help. In essence, the (𝑖, 𝑗)-th element is the dot product of the 𝑖-th row of 𝐴 and the 𝑗-th
column of 𝐵.
Besides addition and multiplication, there is another operation that is worth mentioning: transposition.

Deﬁnition 6.2 (Matrix transposition.)

85
Mathematics of Machine Learning

Fig. 7.1: Visualizing matrix multiplication.

Let 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑚

𝑖,𝑗=1 ∈ ℝ
𝑛×𝑚
be a matrix. The matrix 𝐴𝑇 deﬁned by

𝐴𝑇 = (𝑎𝑗,𝑖 )𝑛,𝑚
𝑖,𝑗=1 ∈ ℝ
𝑚×𝑛

is called the transpose of 𝐴. The operation 𝐴 ↦ 𝐴𝑇 is called transposition.

Transposition simply means “ﬂipping” the matrix, replacing rows with columns. For example,

𝑎 𝑏 𝑎 𝑐
𝐴=[ ], 𝐴𝑇 = [ ].
𝑐 𝑑 𝑏 𝑑

As opposed to addition and multiplication, transposition is a unary operation. (Unary means that it takes one argument.
Binary operations take two arguments, and so on.) Although transposition is easy to understand, there is much more
behind the surface. Later, we will give a geometric interpretation involving inner products, but for now, let’s move on to
study the invertibility of linear transformations.

7.1 Unraveling matrix multiplication

Matrix multiplication is one of the most frequently used operations in computing. As it can be performed extremely fast,
it is common to even vectorize certain algorithms just to express it in terms of matrix multiplications.
Thus, the more we know about it the better. To get a grip on the operation itself, we can take a look at it from a few
diﬀerent angles. Let’s start with a special case!
In machine learning, taking the product of a matrix and a column vector is a fundamental building block of certain models.
For instance, this is linear regression in itself, or the famous fully connected layer in neural networks.
To see what happens in this case, let 𝐴 ∈ ℝ𝑛×𝑚 be a matrix. If we treat 𝑥 ∈ ℝ𝑚 as a column vector 𝑥 ∈ ℝ𝑚×1 , then 𝐴𝑥
can be written as
𝑚
𝑎1,1 𝑎1,2 … 𝑎1,𝑚 𝑥1 ∑ 𝑎1,𝑗 𝑥𝑗
⎡𝑎 ⎤ ⎡ ⎤ ⎡ 𝑗=1
𝑚 ⎤
𝑎2,2 … 𝑎2,𝑚 𝑥 ∑𝑗=1 𝑎2,𝑗 𝑥𝑗 ⎥
𝐴𝑥 = ⎢ 2,1 ⎥⎢ 2⎥ = ⎢⎢ ⎥.
⎢ ⋮ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝑚
⎣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑚 .⎦ ⎣𝑥𝑚 ⎦ ⎣∑ 𝑎𝑛,𝑗 𝑥𝑗 ⎦
𝑗=1

86 Chapter 7. Matrices, the workhorses of machine learning

Mathematics of Machine Learning

Based on this, the matrix 𝐴 describes a function that takes a piece of data 𝑥, then transforms it into the form 𝐴𝑥.
This is the same as taking the linear combination of 𝐴‘s columns, that is,

𝑎1,1 𝑎1,2 … 𝑎1,𝑚 𝑥1 𝑎1,1 𝑎1,𝑚

⎡𝑎 𝑎2,2 … 𝑎2,𝑚 ⎤ ⎡ 𝑥2 ⎤ ⎡𝑎 ⎤ ⎡𝑎 ⎤
⎢ 2,1 ⎥ ⎢ ⎥ = 𝑥1 ⎢ 2,1 ⎥ + ⋯ + 𝑥𝑚 ⎢ 2,𝑚 ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑚 .⎦ ⎣𝑥𝑚 ⎦ ⎣𝑎𝑛,1 ⎦ ⎣𝑎𝑛,𝑚 ⎦

With a bit more suggestive notation, by denoting the 𝑖-th column as 𝑎𝑖 , we can write

𝑥1 𝑛 𝑎1,𝑖
⎡𝑎
⎢ 1 𝑎2 … 𝑎𝑛 ⎤ ⎡ ⎤
⎥ ⎢ ⋮ ⎥ = ∑ 𝑥𝑖 𝑎𝑖 , 𝑎𝑖 = ⎡
⎢ ⋮ ⎥.
⎤ (7.2)
⎣ ⎦ ⎣𝑥𝑛 ⎦ 𝑖=1 ⎣𝑎𝑛,𝑖 ⎦
If we replace the vector 𝑥 with a matrix 𝐵, the columns in the product matrix 𝐴𝐵 are linear combinations of 𝐴‘s columns,
where the coeﬃcients are determined by 𝐵.
You should really appreciate that certain operations on the data can be written in the form 𝐴𝑥. Elevating this simple
property to a higher level of abstraction, we can say that the data has the same representation as the function. If you are
familiar with programming languages like Lisp, you know how beautiful this is.
There is one more way to think about the matrix product: taking the columnwise inner products. If 𝑎𝑖 = (𝑎𝑖,1 , … , 𝑎𝑖,𝑛 )
denotes the 𝑖-th column of 𝐴, then 𝐴𝑥 can be written as

𝑎1,1 𝑎1,2 … 𝑎1,𝑚 𝑥1 ⟨𝑎1 , 𝑥⟩

⎡𝑎 𝑎2,2 … 𝑎2,𝑚 ⎤ ⎡ 𝑥2 ⎤ ⎡ ⟨𝑎2 , 𝑥⟩ ⎤
𝐴𝑥 = ⎢ 2,1 ⎥⎢ ⎥ = ⎢ ⎥, (7.3)
⎢ ⋮ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑚 .⎦ ⎣𝑥𝑚 ⎦ ⎣⟨𝑎𝑛 , 𝑥⟩⎦

that is, the transformation 𝑥 ↦ 𝐴𝑥 projects the input 𝑥 to the row vectors of 𝐴, then compacts the results in a vector.

7.2 Manipulating matrices

Because of the well-deﬁned matrix operations, we can do algebra on matrices just as with numbers. However, there are
some diﬀerences. As manipulating matrix expressions is an essential skill, let’s take a look at its fundamental rules!

Theorem 6.2.1 (Properties of matrix addition and multiplication.)

(a) Let 𝐴, 𝐵, 𝐶 ∈ ℝ𝑛×𝑙 be arbitrary matrices. Then

𝐴 + (𝐵 + 𝐶) = (𝐴 + 𝐵) + 𝐶

holds. That is, matrix addition is associative.

(b) Let 𝐴 ∈ ℝ𝑛×𝑙 , 𝐵 ∈ ℝ𝑙×𝑘 , 𝐶 ∈ ℝ𝑘×𝑚 be arbitrary matrices. Then

𝐴(𝐵𝐶) = (𝐴𝐵)𝐶

holds. That is, matrix multiplication is associative.

𝐴(𝐵 + 𝐶) = 𝐴𝐵 + 𝐴𝐶

holds. That is, matrix multiplication is left-distributive with respect to addition.

7.2. Manipulating matrices 87

Mathematics of Machine Learning

(d) Let 𝐴, 𝐵 ∈ ℝ𝑛×𝑙 and 𝐶 ∈ ℝ𝑙×𝑚 be arbitrary matrices. Then

(𝐴 + 𝐵)𝐶 = 𝐴𝐶 + 𝐵𝐶

holds. That is, matrix multiplication is right-distributive with respect to addition.

As the proof is extremely technical and boring, we are going to skip it. However, there are a few things to note. Most
importantly, matrix multiplication is not commutative; that is, 𝐴𝐵 does not always equal to 𝐵𝐴. (It might not even be
deﬁned.) For instance, consider

1 1 1 0
𝐴=[ ], 𝐵=[ ].
1 1 0 2

You can verify by hand that

1 2 1 1
𝐴𝐵 = [ ], 𝐵𝐴 = [ ],
1 2 2 2

which are not equal.

In line with this, the algebraic identities that we use for scalars are quite diﬀerent. For instance, if 𝐴 and 𝐵 are matrices,
then
(𝐴 + 𝐵)(𝐴 + 𝐵) = 𝐴(𝐴 + 𝐵) + 𝐵(𝐴 + 𝐵)
= 𝐴2 + 𝐴𝐵 + 𝐵𝐴 + 𝐵2 ,
or
(𝐴 + 𝐵)(𝐴 − 𝐵) = 𝐴(𝐴 − 𝐵) + 𝐵(𝐴 − 𝐵)
= 𝐴2 − 𝐴𝐵 + 𝐵𝐴 − 𝐵2 .
Transposition also behaves nicely with respect to addition and multiplication.

Theorem 6.2.2 (Properties of transposition.)

(a) Let 𝐴, 𝐵 ∈ ℝ𝑛×𝑚 be arbitrary matrices. Then

(𝐴 + 𝐵)𝑇 = 𝐴𝑇 + 𝐵𝑇

holds.
(b) Let 𝐴 ∈ ℝ𝑛×𝑙 , 𝐵 ∈ ℝ𝑙×𝑚 be arbitrary matrices. Then

(𝐴𝐵)𝑇 = 𝐵𝑇 𝐴𝑇

holds.

We are not going to prove it either, but feel free to do it as an exercise.

7.3 Matrices as arrays

To do computational work with matrices inside a computer, we are looking for a data structure that represents a matrix A
and supports
• accessing elements by A[i, j],
• assigning elements by A[i, j] = value,
• addition and multiplication with the + and * operators,

88 Chapter 7. Matrices, the workhorses of machine learning

Mathematics of Machine Learning

and works lightning fast. These requirements only specify the interface of our matrix data structure, not the concrete
implementation. An obvious choice would be a list of lists, but as discussed in our section about representing vectors in
computations, this is highly suboptimal. Can we leverage the C array structure to store a matrix?
Yes, and this is precisely what NumPy does, providing a fast and convenient representation for matrices in the form of
multidimensional arrays. Before learning to use NumPy’s machinery for our purposes, let’s look a bit deeper into the
heart of the issue.
At ﬁrst glance, there seems to be a problem: a computer’s memory is one-dimensional, thus addressed (indexed) by a
single key, not two as we want. Thus, we can’t just shove a matrix into the memory. The solution is to ﬂatten the matrix
and place each consecutive row next to each other, like Fig. 7.2 illustrates in the 3 × 3 case.

Fig. 7.2: Flattening a matrix.

By storing the rows of any 𝑛 × 𝑚 matrix in a contiguous array, we get all the beneﬁts of the array data structure at the
low cost of a simple index transformation deﬁned by

(𝑖, 𝑗) ↦ 𝑖𝑚 + 𝑗.

To demonstrate what’s happening, let’s conjure up a prototypical Matrix class in Python that uses a single list to store
all the values, yet supports accessing elements by row and column indices. For the sake of illustration, let’s imagine that
a Python list is actually a static array. (At least until this presentation is over.) This is for educational purposes only, as at
the moment, we only care about understanding the ﬂattening process, not performance.
Take a moment to review the code below. I’ll explain everything line by line. (If you are not familiar with classes in
Python, I encourage you to also check the introductory OOP section in the appendix.)

from typing import Tuple

class Matrix:
def __init__(self, shape: Tuple[int, int]):
if len(shape) != 2:
raise ValueError("The shape of a Matrix object must be a two-dimensional␣
↪tuple.")

(continues on next page)

7.3. Matrices as arrays 89

Mathematics of Machine Learning

(continued from previous page)

self.shape = shape
self.data = [0.0 for _ in range(shape[0]*shape[1])]

def _linear_idx(self, i: int, j: int):

return i*self.shape[1] + j

def getitem(self, key: Tuple[int, int]):

linear_idx = self._linear_idx(*key)
return self.data[linear_idx]

def setitem(self, key: Tuple[int, int], value):

linear_idx = self._linear_idx(*key)
self.data[linear_idx] = value

def __repr__(self):
array_form = [[self[i, j] for j in range(self.shape[1])] for i in range(self.
↪shape[0])]

return "\n".join(["\t".join([f"{x}" for x in row]) for row in array_form])

The Matrix object is initialized with the __init__ method. This is called when an object is created, like we are about
to do now.

M = Matrix(shape=(3, 4))

Upon initialization, we supply the dimensions of the matrix in the form of a two-dimensional tuple, passed for the
shape argument. In our concrete example, M is a 3 × 4 matrix, represented by an array of length 12. For simplicity, our
simple Matrix is ﬁlled up with zeros by default.
Overall, the __init__ method does three things:
• checking the validity of shape,
• storing shape in an attribute,
• and initializing a list of size shape[0]*shape[1], serving as our data storage.
The second method, suggestively named _linear_idx, is responsible for translating between the row-column indices
of the matrix and the linear index for our internal one-dimensional representation. (In Python, it is customary to preﬁx
methods with an underscore if they are not intended to called externally. Many other languages, such as Java, support
hidden methods. Python is not one of them, so we have to make do with such polite suggestions instead of strictly enforced
rules.)
We can implement item retrieval via indexing by providing the __getitem__ method, expecting a two-dimensional
integer tuple as the key. For any key = (i, j), the method
• calculates the linear index using our _linear_idx method,
• then retrieves the element located at the given linear index from the list.
Item assignment happens similarly, as given by the __setitem__ magic method. Let’s try these out to see if they work.

M[1, 2] = 3.14

90 Chapter 7. Matrices, the workhorses of machine learning

Mathematics of Machine Learning

M[1, 2]

3.14

By providing a __repr__ method, we specify how a Matrix object is represented as a string. So, we can print it out
to the standard output in a pretty form.

0.0 0.0 0.0 0.0

0.0 0.0 3.14 0.0
0.0 0.0 0.0 0.0

Pretty awesome. Now that we understand some of the internals, it is time to see how we can achieve much more with
NumPy.

7.4 Matrices in NumPy

As foreshadowed earlier, NumPy provides an excellent out-of-the-box representation for matrices in the form of multi-
dimensional arrays. (These are often called tensors, but I’ll just stick to the naming array.)
I have some fantastic news: these are the same np.ndarray objects we have been using! We can create one by simply
providing a list of lists during initialization.

import numpy as np

A = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]])

B = np.array([[5, 5, 5, 5], [5, 5, 5, 5], [5, 5, 5, 5]])

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

Everything works the same as we have seen so far. Operations are performed elementwise, and you can plug them into
functions like np.exp.

A + B

array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])

A*B

array([[ 0, 5, 10, 15],

[20, 25, 30, 35],
[40, 45, 50, 55]])

7.4. Matrices in NumPy 91

Mathematics of Machine Learning

np.exp(A)

array([[1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01],

[5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03],
[2.98095799e+03, 8.10308393e+03, 2.20264658e+04, 5.98741417e+04]])

Since we are working with multidimensional arrays, the transposition operator can be deﬁned. Here, this is conveniently
implemented as the np.transpose function, but can also be accessed at the np.ndarray.T attribute.

np.transpose(A)

array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])

A.T

array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])

As expected, we can get and set elements with the indexing operator []. The indexing starts from zero. (Don’t even get
me started.)

A[1, 2] # 1st row, 2nd column (if we index rows and columns from zero)

Entire rows and columns can be accessed using slicing. Instead of putting out the exact deﬁnitions, I’ll just leave a few
examples and let you ﬁgure it out with your internal pattern matching engine. (That is, your intelligence.)

A[:, 2] # 2nd column

array([ 2, 6, 10])

A[1, :] # 1st row

array([4, 5, 6, 7])

A[2, 1:4] # 2nd row, 1st-4th elements

array([ 9, 10, 11])

A[1] # 1st row

92 Chapter 7. Matrices, the workhorses of machine learning

Mathematics of Machine Learning

array([4, 5, 6, 7])

When used as an iterable, a two-dimensional array yields its rows at every step.

for row in A:
print(row)

[0 1 2 3]
[4 5 6 7]
[ 8 9 10 11]

Initializing arrays can be done with the familiar np.zeros, np.ones, and other functions.

np.zeros(shape=(4, 5))

array([[0., 0., 0., 0., 0.],

[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])

As you have guessed, that shape argument speciﬁes the dimensions of the array. We are going to explore this property
next.

7.4.1 The shape of multidimensional arrays

Let’s initialize an example multidimensional array with three rows and four columns.

A = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]])

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

The shape of an array, stored inside the attribute np.ndarray.shape, is a tuple object describing its dimensions.
In our example, since we have a 3 × 4 matrix, the shape equals (3, 4).

A.shape

(3, 4)

This innocently looking attribute determines what kind of operations you can perform with your arrays. Let me tell you,
as a machine learning engineer, shape mismatches will be the bane of your existence. You want to calculate the product
of two matrices A and B? The second dimension of A must match the ﬁrst dimension of B. Pointwise products? Matching
or broadcastable shapes are required. Understanding shapes are vital.
However, we have just learned that multidimensional arrays are linear arrays in disguise. Because of this, we can reshape
an array by slicing the linear view diﬀerently. For example, A can be reshaped into arrays with shapes (12, 1), (6,
2), (4, 3), (3, 4), (2, 6), and (1, 12).

7.4. Matrices in NumPy 93

Mathematics of Machine Learning

A.reshape(6, 2) # reshapes A into a 6 x 2 matrix

array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])

The np.ndarray.reshape method returns a newly constructed array object but doesn’t change A. In other words,
reshaping is not destructive in NumPy.

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

Reshaping is hard to wrap your head around for the ﬁrst time. To help you visualize the process, Fig. 7.3 shows precisely
what happens in our case.

Fig. 7.3: Reshaping a one-dimensional array into multiple possible shapes.

If you are unaware of the exact dimension along a speciﬁc axis, you can get away by inputting -1 there during the
reshaping. Since the product of dimensions is constant, NumPy is smart enough to ﬁgure out the missing one for you.
This trick will get you out of trouble all the time, so it is worth taking note.

A.reshape(-1, 2)

94 Chapter 7. Matrices, the workhorses of machine learning

Mathematics of Machine Learning

array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])

We won’t go into the details now, but as you probably guessed, multidimensional arrays can have more than two di-
mensions. The range of permitted shapes for the operations will be even more complicated then. So, building a solid
understanding now will provide a massive headstart in the future.

7.5 Matrix multiplication

Without a doubt, one of the most important operations regarding matrices is multiplication. Computing determinants and
eigenvalues? Matrix multiplication. Passing data through a fully connected layer? Matrix multiplication. Convolution?
Matrix multiplication. We will see how these seemingly different things can be traced back to matrix multiplication, but
first, let’s discuss the operation itself from a computational perspective.
First, recap the mathematical definition. If 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑚
𝑖,𝑗=1 ∈ ℝ
𝑛×𝑚
and 𝐵 = (𝑏𝑖,𝑗 )𝑚,𝑙
𝑖,𝑗=1 ∈ ℝ
𝑚×𝑙
are two arbitrary
matrices, then their product is defined by the formula,
𝑚 𝑛,𝑙
𝐴𝐵 = ( ∑ 𝑎𝑖,𝑘 𝑏𝑘,𝑗 ) ∈ ℝ𝑛×𝑙 ,
𝑘=1 𝑖,𝑗=1

which comes from the composition of the linear transformations determined by 𝐴 and 𝐵. Notice that the element in the
𝑖-th row and 𝑗-th column of 𝐴𝐵 is the dot product of 𝐴‘s 𝑖-th row and 𝐵‘s 𝑗-th column.
We can put this into code using the tools we have learned so far.

from itertools import product

def matrix_multiplication(A: np.ndarray, B: np.ndarray):

# checking if multiplication is possible
if A.shape[1] != B.shape[0]:
raise ValueError("The number of columns in A must match the number of rows in␣
↪B.")

# initializing an array for the product

AB = np.zeros(shape=(A.shape[0], B.shape[1]))

# calculating the elements of AB

for i, j in product(range(A.shape[0]), range(B.shape[1])):
AB[i, j] = np.sum(A[i, :]*B[:, j])

return AB

Let’s test our function with an example that is easy to verify by hand.

A = np.ones(shape=(4, 6))
B = np.ones(shape=(6, 3))

matrix_multiplication(A, B)

7.5. Matrix multiplication 95

Mathematics of Machine Learning

array([[6., 6., 6.],

[6., 6., 6.],
[6., 6., 6.],
[6., 6., 6.]])

The result is correct, as we expected.

Of course, matrix multiplication has its own NumPy function in the form of numpy.matmul.

np.matmul(A, B)

array([[6., 6., 6.],

[6., 6., 6.],
[6., 6., 6.],
[6., 6., 6.]])

This yields the same result as our custom function. We can test it out by generating a bunch of random matrices and
checking if the results match.

for _ in range(100):
n, m, l = np.random.randint(1, 100), np.random.randint(1, 100), np.random.
↪randint(1, 100)

A = np.random.rand(n, m)
B = np.random.rand(m, l)

if not np.allclose(np.matmul(A, B), matrix_multiplication(A, B)):

print(f"Result mismatch for\n{A}\n and\n{B}")
break

According to this small test, our matrix_multiplication function yields the same result as NumPy’s built-in one.
We are happy, but don’t forget: always use your chosen framework’s implementations in practice, may it be NumPy,
TensorFlow, or PyTorch.
Since writing numpy.matmul is cumbersome when lots of multiplications are present, NumPy oﬀers a way to abbreviate
using the @ operator.

A = np.ones(shape=(4, 6))
B = np.ones(shape=(6, 3))

np.allclose(A @ B, np.matmul(A, B))

True

7.6 Linear transformations and data

Besides composing linear transformations, matrix multiplication also describes the image of vectors under them. Recall
that if a transformation is given by the matrix 𝐴 ∈ ℝ𝑛×𝑚 and the input is given by 𝑥 ∈ ℝ𝑚 , then by treating 𝑥 as a
column vector 𝑥 ∈ ℝ𝑚×1 , the image of 𝑥 under 𝐴 can be calculated by
𝑚
𝑎1,1 𝑎1,2 … 𝑎1,𝑚 𝑥1 ∑ 𝑎1,𝑗 𝑥𝑗
⎡𝑎 ⎤ ⎡ ⎤ ⎡ 𝑗=1
𝑚 ⎤
𝑎2,2 … 𝑎2,𝑚 𝑥 ∑𝑗=1 𝑎2,𝑗 𝑥𝑗 ⎥
𝐴𝑥 = ⎢ 2,1 ⎥⎢ 2⎥ = ⎢⎢ ⎥.
⎢ ⋮ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝑚
⎣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑚 .⎦ ⎣𝑥𝑚 ⎦ ⎣∑ 𝑎𝑛,𝑗 𝑥𝑗 ⎦
𝑗=1

96 Chapter 7. Matrices, the workhorses of machine learning

Mathematics of Machine Learning

Mathematically speaking, looking at 𝑥 as a column vector is perfectly natural. Think of it as extending ℝ𝑚 with a dummy
dimension, thus obtaining ℝ𝑚×1 . This form also comes naturally by considering that the columns of a matrix are images
of the basis vectors by their very deﬁnition.
In practice, things are not as simple as they look. Implicitly, we have made a choice here: to represent datasets as a
horizontal stack of column vectors. To elaborate further, let’s consider two data points with four features and a matrix
that maps these into a three-dimensional feature space. That is, let 𝑥1 , 𝑥2 ∈ ℝ4 and let 𝐴 ∈ ℝ3×4 .

x1 = np.array([2, 0, 0, 0]) # first data point

x2 = np.array([-1, 1, 0, 0]) # second data point

A = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]) # a feature␣

↪transformation

(I speciﬁcally selected the numbers so that calculations are easily veriﬁable by hand.) To be sure, we double-check the
shapes.

A.shape

(3, 4)

x1.shape

(4,)

What happens when we call the np.matmul function?

np.matmul(A, x1)

array([ 0, 8, 16])

The result is correct. However, when we have a bunch of input data points, we prefer to calculate the images using a single
operation. This way, we can take advantage of vectorized code, locality of reference, and all the juicy computational magic
we have seen so far.
We can achieve this by horizontally stacking the column vectors, each one representing a data point. Mathematically
speaking, we want to perform the calculation
2 −1
0 1 2 3 ⎡ 0 1
⎡4 0 1⎤ ⎡
⎢ 5 6 7⎤ ⎢ ⎥
⎥ ⎢0 0 ⎥ ⎢ 8 1⎥
= ⎤
⎣8 9 10 11⎦ ⎣16 1⎦
⎣0 0 ⎦
in code. Upon looking up the NumPy documentation, we quickly ﬁnd that the np.hstack function might be the tool
for the job, at least according to its oﬃcial documentation. Yay!

np.hstack([x1, x2]) # np.hstack takes a list of np.ndarrays as its argument

array([ 2, 0, 0, 0, -1, 1, 0, 0])

Not yay. What happened? np.hstack treats one-dimensional arrays diﬀerently, and even though the math works out
perfectly by creatively abusing the notation, we don’t get away that easily in the trenches of real-life computations. Thus,
we have to reshape our inputs manually. Meet the true skill gap between junior and senior machine learning engineers:
correctly shaping multidimensional arrays.

7.6. Linear transformations and data 97

Mathematics of Machine Learning

data = np.hstack([x1.reshape(-1, 1), x2.reshape(-1, 1)])

data

array([[ 2, -1],
[ 0, 1],
[ 0, 0],
[ 0, 0]])

Let’s try this one more time.

np.matmul(A, data)

array([[ 0, 1],
[ 8, 1],
[16, 1]])

Yay! (For real this time.)

7.6.1 The shape of data

We made an extremely impactful choice in the previous section: representing individual data points as column vectors.
I am writing this with bold letters to emphasize its importance.
Why? Because we could have gone the other way and treated samples as row vectors. With our current choice, we ended
up with a multidimensional array of shape

number of dimensions × number of samples,

as opposed to

number of samples × number of dimensions.

The former is called batch-last, while the latter is called batch-first format. Popular frameworks like TensorFlow and
PyTorch use batch-first, but we are going with batch-last. The reasons go back to the very definition of matrices, where
columns are the images of basis vectors under the given linear transformation. This way, we can write multiplication from
left to right, like 𝐴𝑥 and 𝐴𝐵.
Should we define matrices as rows of basis vector images, everything turns upside down. This way, if 𝑓 and 𝑔 are linear
transformations with “matrices” 𝐴 and 𝐵, the “matrix” of the composed transformation 𝑓 ∘ 𝑔 would be 𝐵𝐴. This makes
the math complicated and ugly.
On the other hand, batch-first makes the data easier to store and read. Think about a situation when you have thousands
of data points in a single CSV file. Due to how input-output is implemented, files are read line-by-line, so it is natural
and convenient to have a single line correspond to a single sample.
No good choices here; there are sacrifices either way. Since the math works out much easier for batch-last, we will use
that format. However, in practice, you’ll find that batch-first is more common. With this textbook, I don’t intend to give
you just a manual. My goal is to help you understand the internals of machine learning. If I succeed, you’ll be able to
apply your knowledge to translate between batch-first and batch-last seamlessly.

98 Chapter 7. Matrices, the workhorses of machine learning

Mathematics of Machine Learning

7.7 Problems

Problem 1. Calculate the product of the following matrices.

(a)

6 −2
−1 2
𝐴=[ ], 𝐵=⎡ ⎤
⎢ 2 −6⎥ .
1 5
⎣−3 2 ⎦
(b)

1 2 3 7 8
𝐴=[ ], 𝐵=[ ].
4 5 6 9 10

Problem 2. The famous Fibonacci numbers are deﬁned by the recursive sequence

𝐹0 = 0,
𝐹1 = 1,
𝐹𝑛 = 𝐹𝑛−1 + 𝐹𝑛−2 .

(a) Write a recursive function that computes the 𝑛-th Fibonacci number. (Expect it to be really slow.)
(b) Show that
𝑛
1 1 𝐹 𝐹𝑛
[ ] = [ 𝑛+1 ],
1 0 𝐹𝑛 𝐹𝑛−1

and use this identity to write a non-recursive function that computes the 𝑛-th Fibonacci number.
Use Python’s builtin timeit function to measure the execution of both functions. Which one is faster?
Problem 3. Let 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑚 𝑛,𝑚
𝑖,𝑗=1 , 𝐵 = (𝑏𝑖,𝑗 )𝑖,𝑗=1 . be two 𝑛 × 𝑚 matrices. Their Hadamard product is deﬁned by

𝑎1,1 𝑏1,1 𝑎1,2 𝑏1,2 … 𝑎1,𝑛 𝑏1,𝑛

⎡𝑎 𝑏 𝑎2,2 𝑏2,2 … 𝑎2,𝑛 𝑏2,𝑛 ⎤
𝐴 ⊙ 𝐵 = ⎢ 2,1 2,1 ⎥.
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 𝑛,1 𝑏𝑛,1
𝑎 𝑎𝑛,2 𝑏𝑛,2 … 𝑎𝑛,𝑛 𝑏𝑛,𝑛 ⎦

Implement a function that takes two identically shaped NumPy arrays, then performs the Hadamard product on them.
(There are two ways to do this: with for loops and with NumPy operations. It is instructive to implement both.)
Problem 4. Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix. Functions of the form

𝐵(𝑥, 𝑦) = 𝑥𝑇 𝐴𝑦, 𝑥, 𝑦 ∈ ℝ𝑛

are called bilinear forms. Implement a function that takes two vectors and a matrix (all represented by NumPy arrays),
then calculates the corresponding bilinear form.

7.7. Problems 99
Mathematics of Machine Learning

100 Chapter 7. Matrices, the workhorses of machine learning

CHAPTER

EIGHT

LINEAR TRANSFORMATIONS

“Why do my eyes hurt?”“You’ve never used them before.” - Morpheus to Neo, when waking up from the Matrix
for the ﬁrst time
In most linear algebra courses, the curriculum is all about matrices. In machine learning, we work with them all the time.
Here is the thing: matrices don’t tell the whole story. It is hard to understand the patterns by looking only at matrices.
For instance, why is matrix multiplication deﬁned in such a complex way as it is? Why are relations like 𝐵 = 𝑇 −1 𝐴𝑇
important? Why are some matrices invertible and some are not?
To really understand what is going on, we have to look at what gives rise to matrices: linear transformations. Like for
Neo, this might hurt a bit, but it will greatly reward us later down the line. Let’s get to it!
With the introduction of inner products, orthogonality, and orthogonal/orthonormal bases, we know everything about
the structure of our feature spaces. However, in machine learning, our interest mainly lies in transforming the data. To
illustrate this, we should take another look at the Iris dataset. Each sample is represented by a four-dimensional vector
in its raw form, belonging to one of the three classes. Since human perception is limited to three dimensions, we can’t
visualize these directly. However, we can map each feature against every other.
To simplify the data and gain predictive insight, we can train a simple neural network and check out how it maps the
dataset into a new feature space. Since the Iris set contains three class labels, the result is three-dimensional. Similarly to
the raw data, we are going to visualize this by plotting the features pairwise.
Even though we only passed the data through a function 𝑓 ∶ ℝ4 → ℝ3 without adding new information, the transformed
dataset looks much more descriptive than the original one.
From this viewpoint, a neural network is just a function composed of smaller parts (known as layers), transforming
the data to a new feature space in every step. One of the key components of models in machine learning are linear
transformations. You probably encountered them as functions of the form 𝑓(𝑥) = 𝐴𝑥, but this is only one way to look
at them. This section will start from a geometric viewpoint, then move towards the algebraic representation that you are
probably already familiar with. To understand how neural networks can learn powerful high-level representations of the
data, looking at the geometry of transforms is essential.

8.1 What is a linear transformation?

Let’s not hesitate a moment further, and jump into the deﬁnition right away!

Deﬁnition 7.1.1 (Linear transformations.)

Let 𝑈 and 𝑉 be two vector spaces (over the same scalar ﬁeld), and let 𝑓 ∶ 𝑈 → 𝑉 be a function between them. We say
that 𝑓 is linear if

𝑓(𝑎𝑥 + 𝑏𝑦) = 𝑎𝑓(𝑥) + 𝑏𝑓(𝑦) (8.1)

101
Mathematics of Machine Learning

Fig. 8.1: The Iris dataset, visualized by plotting every feature against every other feature. Colors are according to class
labels, while the diagonals represent the density estimation of each feature.

102 Chapter 8. Linear transformations

Mathematics of Machine Learning

Fig. 8.2: The Iris dataset, transformed by a neural network.

8.1. What is a linear transformation? 103

Mathematics of Machine Learning

holds for all vectors 𝑥, 𝑦 ∈ 𝑈 and all scalars 𝑎, 𝑏.

This is why linear algebra is called linear algebra. In essence, a linear transformation is a mapping between two vector
spaces that preserves the algebraic structure: addition and scalar multiplication. (Functions between vector spaces are
often called transformations, so we will use this terminology.)

Remark 7.1.1
Linearity is essentially comprising two properties in one: 𝑓(𝑥 + 𝑦) = 𝑓(𝑥) + 𝑓(𝑦) and 𝑓(𝑎𝑥) = 𝑎𝑓(𝑥) for all vectors
𝑥, 𝑦 and all scalars 𝑎. From these two, (8.1) follows by

𝑓(𝑎𝑥 + 𝑏𝑦) = 𝑓(𝑎𝑥) + 𝑓(𝑏𝑦)

= 𝑎𝑓(𝑥) + 𝑏𝑓(𝑦).

Two properties immediately jump out from the deﬁnition. First, since

𝑓(𝑥) = 𝑓(𝑥 + 0) = 𝑓(𝑥) + 𝑓(0),

𝑓(0) = 0 holds for every linear transformation. In addition, the composition of linear transformations is still linear, as

𝑓(𝑔(𝑎𝑥 + 𝑏𝑦)) = 𝑓(𝑎𝑔(𝑥) + 𝑏𝑔(𝑦)) = 𝑎𝑓(𝑔(𝑥)) + 𝑏𝑓(𝑔(𝑦))

shows for any linear 𝑓 and 𝑔.

As usual, let’s see some examples to build intuition.
Example 1. For any scalar 𝑐, the scaling transformation 𝑓(𝑥) = 𝑐𝑥 is linear. This is probably the simplest example out
there, and it can be deﬁned in all vector spaces.

Fig. 8.3: Scaling as a linear transformation.

104 Chapter 8. Linear transformations

Mathematics of Machine Learning

It’s easy to see that scaling is linear:

𝑐(𝑎𝑥 + 𝑏𝑦) = 𝑐(𝑎𝑥) + 𝑐(𝑏𝑦)
= 𝑎(𝑐𝑥) + 𝑏(𝑐𝑦).
2
Example 2. In ℝ , rotations around the origin by an angle 𝛼 are also linear.

Fig. 8.4: Rotation in the Euclidean plane as a linear transformation.

To show that rotations are indeed linear, I am pull the deﬁnition out from the hat: the rotation of a planar vector 𝑥 =
(𝑥1 , 𝑥2 ) with the angle 𝛼 is described by

𝑓(𝑥) = (𝑥1 cos 𝛼 − 𝑥2 sin 𝛼, 𝑥1 sin 𝛼 + 𝑥2 cos 𝛼),

from which (8.1) is easily conﬁrmed. I know that this looks like magic, but trust me, the rotation formula will be explained
in detail. You can sweat it out with some basic trigonometry, or wait until we do this later with matrices.
In general, linear transformations have a strong connection with the geometry of the space. Later we are going to study the
linear transformations of ℝ2 in detail, with an emphasis on geometric ones such as this. (Note that rotations are slightly
more complicated in higher dimensions, as they will require an axis to rotate around.)
Example 3. In any vector space 𝑉 and a nonzero vector 𝑣 ∈ 𝑉 , the translation deﬁned by 𝑓(𝑥) = 𝑥 + 𝑣 is not linear, as
𝑓(0) = 𝑣 ≠ 0.
We’ll see more examples later in the section. For now, let’s move to some general properties of linear transformations.

8.1. What is a linear transformation? 105

Mathematics of Machine Learning

8.1.1 Key properties

For any linear transformation 𝑓 ∶ 𝑈 → 𝑉 , the image

im(𝑓) = {𝑣 ∈ 𝑉 ∶ 𝑣 = 𝑓(𝑢) for some 𝑢 ∈ 𝑈 }

is always a subspace of 𝑉 . This is easy to check: if 𝑣1 , 𝑣2 ∈ im𝑓, then there are 𝑢1 , 𝑢2 ∈ 𝑈 such that 𝑓(𝑢1 ) = 𝑣1 and
𝑓(𝑢2 ) = 𝑣2 . Thus,

𝑎𝑣1 + 𝑏𝑣2 = 𝑎𝑓(𝑢1 ) + 𝑏𝑓(𝑢2 ) = 𝑓(𝑎𝑢1 + 𝑏𝑢2 ) ∈ im𝑓.

To add one more level of abstraction, we will see that the set of all linear transformations is a vector space.

Theorem 7.1.1
Let 𝑈 and 𝑉 be two vector spaces over the same ﬁeld 𝐹 . Then the set of all linear transformations

𝐿(𝑈 , 𝑉 ) = {𝑓 ∶ 𝑈 → 𝑉 |𝑓 is linear} (8.2)

is also a vector space over 𝐹 , with the usual deﬁnitions for function addition and scalar multiplication.

The proof of this is just a boring checklist, going through the items of the deﬁnition of vector spaces. (I recommend you
to walk through it at least once to solidify your understanding of vector spaces, but there is really nothing special there.)

8.2 Linear transformations and matrices

The deﬁnition of linear transformations, as we saw it, might seem a bit abstract. However, there is a simple and expressive
way to characterize them.
To see this, let 𝑓 ∶ 𝑈 → 𝑉 be a linear transformation between two vector spaces 𝑈 and 𝑉 . Suppose that {𝑢1 , … , 𝑢𝑚 }
𝑚
is a basis in 𝑈 , while {𝑣1 , … , 𝑣𝑛 } is a basis in 𝑉 . Since every 𝑥 ∈ 𝑈 can be written in the form 𝑥 = ∑𝑖=1 𝑥𝑖 𝑢𝑖 , the
linearity of 𝑓 implies
𝑚 𝑚
𝑓( ∑ 𝑥𝑗 𝑢𝑗 ) = ∑ 𝑥𝑗 𝑓(𝑢𝑗 ), (8.3)
𝑗=1 𝑗=1

meaning that 𝑓(𝑥) is a linear combination of 𝑓(𝑢1 ), … , 𝑓(𝑢𝑚 ). In other words, every linear transformation is completely
determined by the images of basis vectors. To expand this idea, suppose that for every 𝑢𝑗 , we have
𝑛
𝑓(𝑢𝑗 ) = ∑ 𝑎𝑖,𝑗 𝑣𝑖
𝑖=1

for some scalars 𝑎𝑖,𝑗 .

These 𝑛 × 𝑚 numbers completely describe 𝑓. For notational simplicity, we store these in a 𝑛 × 𝑚-sized table called a
matrix, which we’ll denote by 𝐴𝑓 :

𝑎1,1 𝑎1,2 … 𝑎1,𝑚

⎡𝑎 𝑎2,2 … 𝑎2,𝑚 ⎤
𝑓 ↔ 𝐴𝑓 = ⎢ 2,1 ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑚 ,⎦

meaning that linear transformations are represented by matrices. This connection is heavily utilized throughout machine
learning.

106 Chapter 8. Linear transformations

Mathematics of Machine Learning

𝑚
Expanding (8.3) further, for every 𝑥 = ∑𝑗=1 𝑥𝑗 𝑢𝑗 we have

𝑚
𝑓(𝑥) = ∑ 𝑥𝑗 𝑓(𝑢𝑗 )
𝑗=1
𝑚 𝑛
= ∑ 𝑥𝑗 ∑ 𝑎𝑖,𝑗 𝑣𝑖
𝑗=1 𝑖=1
𝑛 𝑚
= ∑ ( ∑ 𝑎𝑖,𝑗 𝑥𝑗 )𝑣𝑖 .
𝑖=1 𝑗=1

Thus, the image of 𝑥 can be expressed as

Note that the matrix representation depends on the choice of the basis! If, say, 𝑃 = {𝑝1 , … , 𝑝𝑛 } ⊂ 𝑈 is the basis of our
matrix, we denote this dependence in the subscript, writing 𝐴𝑓,𝑃 .
To avoid confusion, we’ll almost exclusively define linear transformations by giving their matrices in the standard or-
thonormal basis. In practical scenarios, this makes it much easier to understand what is going on. So, whenever I write
something like “let 𝐴 be the matrix of a linear transformation 𝑓“, it is implictly assumed that 𝐴 is written in the basis
𝑒1 = (1, 0, … , 0), 𝑒2 = (0, 1, … , 0), … , 𝑒𝑛 = (0, 0, … , 1).
On a philosophical note, have you heard about Plato’s allegory of the cave? In this thought experiment, people are assumed
to be living in a cave constantly facing a single wall, only observing their shadows generated by a fire behind them. What
they observe and use to build an internal representation of the world is very different from reality. Applying this analogy
to linear algebra, matrices are the shadows that we observe and use in practical scenarios. In many introductory courses,
linear transformations are hidden, and only matrix calculus is taught. My first exposition into the subject was similar: the
first linear algebra course I took talked exclusively about matrices. It was as complicated and confusing as a math course
can be. (Which, I can assure you, can be very complicated and confusing.) Later in my studies, everything clicked when I
discovered that you could look at matrices from the perspective of linear transformations. Without seeing what is behind
matrices, it impossible to master linear algebra. If my approach feels too abstract for you, keep this in mind: years later,
when you are a practicing data scientist/machine learning engineer/researcher or whatever, going below the surface will
pay huge dividends.
Let’s get back on track and continue our discussion about linear transformations. The most commonly used matrix is the
matrix of the identity transformation id ∶ 𝑥 ↦ 𝑥. We’ll denote this by 𝐼. It is easy to see that

1 0 … 0
⎡0 1 … 0⎤
𝐼=⎢ ⎥. (8.4)
⎢⋮ ⋮ ⋱ ⋮⎥
⎣0 0 … 1⎦

To summarize, for a matrix 𝐴, a linear transformation can be given by 𝑥 ↦ 𝐴𝑥. In fact, the mapping

𝑓 ↦ 𝐴𝑓,𝑃

deﬁnes a one-to-one correspondence between the space of linear transformations 𝐿(𝑈 , 𝑉 ) deﬁned by (8.2) and the set of
𝑛 × 𝑚 matrices, where 𝑛 and 𝑚 are the corresponding dimensions.

8.3 Matrix operations revisited

Functions can be added and composed. Because of the connection between linear transformations and matrices, matrix
operations are inherited from the corresponding function operations.
With this principle in mind, we deﬁned matrix addition so that the the matrix of the sum of two linear transformations is
the sum of the corresponding matrices. Mathematically speaking, if 𝑓, 𝑔 ∶ 𝑈 → 𝑉 are two linear transformations with

8.3. Matrix operations revisited 107

Mathematics of Machine Learning

matrices, 𝑓 ↔ 𝐴 = (𝑎𝑖,𝑗 )𝑛,𝑚 𝑛,𝑚

𝑖,𝑗=1 and 𝑔 ↔ 𝐵 = (𝑏𝑖,𝑗 )𝑖,𝑗=1 , then

𝑛
(𝑓 + 𝑔)(𝑢𝑗 ) = 𝑓(𝑢𝑗 ) + 𝑔(𝑢𝑗 ) = ∑(𝑎𝑖,𝑗 + 𝑏𝑖,𝑗 )𝑣𝑖 .
𝑖=1

Thus, the corresponding matrices can be added together elementwise: 𝐴 + 𝐵 = (𝑎𝑖𝑗 + 𝑏𝑖,𝑗 )𝑛,𝑚
𝑖,𝑗=1 .

Multiplication between matrices is defined by the composition of the corresponding transformations. To see how, we
study a special case first. (In general, it is a good idea to look at special cases first, as they often reduce the complexity and
allows you to see patterns without information overload.) So, let 𝑓, 𝑔 ∶ 𝑈 → 𝑈 be two linear transformations, mapping
𝑈 onto itself. To determine the elements of the matrix corresponding to 𝑓 ∘ 𝑔, we have to express 𝑓(𝑔(𝑢𝑗 )) in terms of
all the basis vectors 𝑢1 , … , 𝑢𝑛 . For this, we have
𝑛
(𝑓𝑔)(𝑢𝑗 ) = 𝑓(𝑔(𝑢𝑗 )) = 𝑓( ∑ 𝑏𝑘,𝑗 𝑢𝑘 )
𝑘=1
𝑛
= ∑ 𝑏𝑘,𝑗 𝑓(𝑢𝑘 )
𝑘=1
𝑛 𝑛
= ∑ 𝑏𝑘,𝑗 ∑ 𝑎𝑖,𝑘 𝑢𝑖
𝑘=1 𝑖=1
𝑛 𝑛
= ∑ ( ∑ 𝑎𝑖,𝑘 𝑏𝑘,𝑗 )𝑢𝑖 .
𝑖=1 𝑘=1

𝑛
By considering how we deﬁned a transformation’s matrix, the scalar ∑𝑘=1 𝑎𝑖,𝑘 𝑏𝑘,𝑗 is the element in the 𝑖-th row and 𝑗-th
𝑛 𝑛
column of 𝑓 ∘ 𝑔‘s matrix. Thus, matrix multiplication can be deﬁned by 𝐴𝐵 = ( ∑𝑘=1 𝑎𝑖,𝑘 𝑏𝑘,𝑗 )𝑖,𝑗=1 .

In the general case, we can only deﬁne the product of matrices if the corresponding linear transformations can be com-
posed. That is, if 𝑓 ∶ 𝑈 → 𝑉 , then 𝑔 must start from 𝑉 . Translating this into the language of the matrices, the number
of columns in 𝐴 must match the number of rows in 𝐵. So, for any 𝐴 ∈ ℝ𝑛×𝑚 and 𝐵 ∈ ℝ𝑚×𝑙 , their product is deﬁned by
𝑚 𝑛,𝑙
𝐴𝐵 = ( ∑ 𝑎𝑖,𝑘 𝑏𝑘,𝑗 ) ∈ ℝ𝑛×𝑙 .
𝑘=1 𝑖,𝑗=1

8.4 Inverting linear transformations

Regarding linear transformations, the question of invertibility is extremely important. For example, have you encountered
a system of equations like this?
2𝑥1 + 𝑥2 = 5
𝑥1 − 3𝑥2 = −8
If we deﬁne
2 1 5 𝑥1
𝐴=[ ], 𝑏=[ ], 𝑥=[ ],
1 −3 −8 𝑥2

the above system can be written in the form 𝐴𝑥 = 𝑏. These are called linear equations.
How would you write the solution of such an equation? If there were be a matrix 𝐴−1 such that 𝐴−1 𝐴 is the identity
matrix 𝐼 (defined by (8.4)), then multiplying the equation 𝐴𝑥 = 𝑏 from the left by 𝐴−1 would yield the solution in the
form 𝑥 = 𝐴−1 𝑏.
The matrix 𝐴−1 is called the inverse matrix. It might not always exist, but when it does, it is extremely important for
several reasons. We’ll talk about linear equations later, but first, let’s study the fundamentals of invertibility! Here is the
general definition.

108 Chapter 8. Linear transformations

Mathematics of Machine Learning

Deﬁnition 7.4.1 (Inverse of a linear transformation.)

Let 𝑓 ∶ 𝑈 → 𝑉 be a linear transformation between the vector spaces 𝑈 and 𝑉 . We say that 𝑓 is invertible if there is a
linear transformation 𝑓 −1 such that 𝑓 −1 ∘ 𝑓 and 𝑓 ∘ 𝑓 −1 are the identity functions; that is,

𝑓 −1 (𝑓(𝑢)) = 𝑢,
𝑓(𝑓 −1 (𝑣)) = 𝑣

holds for all 𝑢 ∈ 𝑈 , 𝑣 ∈ 𝑉 . 𝑓 −1 is called the inverse of 𝑓.

Not all linear transformations are invertible. For instance, if 𝑓 maps all vectors to the zero vector, you cannot deﬁne an
inverse.
There are certain conditions that guarantee the existence of the inverse. One of the most important ones connect the
concept of basis with invertibility.

Theorem 7.4.1 (Invertibility of linear transformations.)

Let 𝑓 ∶ 𝑈 → 𝑉 be a linear transformation and let 𝑢1 , … , 𝑢𝑛 be a basis in 𝑈 . Then 𝐴 is invertible if and only if
𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 ) is a basis in 𝑉 .

The following proof is straightforward, but can be a bit overwhelming. Feel free to skip this at the ﬁrst reading, you can
always revisit it later.

Proof. As usual, the proof of the if and only if type theorems consist of two parts, as these statements involve two
implications.
(a) First, we prove that 𝑓 is invertible, then 𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 ) is a basis. That is, we need to show that 𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 )
is linearly indepenendent and every 𝑣 ∈ 𝑉 can be written as their linear combination.
Since 𝑓 is invertible, 𝑓(0) = 0, moreover there are no nonzero vectors 𝑢 ∈ 𝑈 such that 𝑓(𝑢) = 0. In other words, 0
cannot be written as the nontrivial linear combination of 𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 ), from which Theorem 1.4.1 implies the linear
independence.
On the other hand, since 𝑓 is surjective, every 𝑣 ∈ 𝑉 can be obtained as 𝑣 = 𝑓(𝑢) for some 𝑢 ∈ 𝑈 . As 𝑢1 , … , 𝑢𝑛 is a
𝑛
basis, 𝑢 = ∑𝑖=1 𝛼𝑖 𝑢𝑖 . Thus,
𝑣 = 𝑓(𝑢)
𝑛
= 𝑓( ∑ 𝛼𝑖 𝑢𝑖 )
𝑖=1
𝑛
= ∑ 𝛼𝑖 𝑓(𝑢𝑖 ),
𝑖=1

showing that span(𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 )) = 𝑉 .

The linear independence 𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 ) and the fact that it spans 𝑉 gives that it is indeed a basis.
(b) Now we prove the other implication: if 𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 ) is a basis, then 𝑓 is invertible.
If 𝑓(𝑢1 ), … , 𝑓(𝑢𝑛 ) is indeed a basis, then every 𝑣 ∈ 𝑉 can be written as
𝑛 𝑛
𝑣 = ∑ 𝛼𝑖 𝑓(𝑢𝑖 ) = 𝑓( ∑ 𝛼𝑖 𝑢𝑖 ),
𝑖=1 𝑖=1

8.4. Inverting linear transformations 109

Mathematics of Machine Learning

which shows the surjectivity. Regarding the injectivity, if 𝑣 = 𝑓(𝑥) = 𝑓(𝑦) for some 𝑥, 𝑦 ∈ 𝑈 , then, since both 𝑥 and 𝑦
can be written as a linear combination of the 𝑢𝑖 basis vectors, we would have
𝑛 𝑛
𝑣 = 𝑓(𝑥) = 𝑓( ∑ 𝑥𝑖 𝑢𝑖 ) = ∑ 𝑥𝑖 𝑓(𝑢𝑖 )
𝑖=1 𝑖=1

and
𝑛 𝑛
𝑣 = 𝑓(𝑦) = 𝑓( ∑ 𝑦𝑖 𝑢𝑖 ) = ∑ 𝑦𝑖 𝑓(𝑢𝑖 ).
𝑖=1 𝑖=1

𝑛
Thus, 0 = ∑𝑖=1 (𝑥𝑖 − 𝑦𝑖 )𝑢𝑖 , and since 𝑢1 , … , 𝑢𝑛 is a basis in U, 𝑥𝑖 = 𝑦𝑖 must hold. Hence 𝑓 is injective. □

A consequence of this theorem is that a linear transformation 𝑓 ∶ 𝑈 → 𝑉 is not invertible if the dimensions of 𝑈 and 𝑉
are diﬀerent. We can look at invertibility from the aspect of matrices as well. For any 𝐴 ∈ 𝐹 𝑛×𝑛 , if the corresponding
linear transformation is invertible, there exists a matrix 𝐴−1 ∈ 𝔽𝑛×𝑛 such that 𝐴−1 𝐴 = 𝐴𝐴−1 = 𝐼. Not surprisingly,
we call 𝐴−1 the inverse of 𝐴. If a matrix is not square, it is not invertible in the classical sense.

8.4.1 The kernel and the image

Regarding the invertibility of a linear transformation, two special sets play an essential role: the kernel and the image.
Let’s see them!

Deﬁnition 7.4.2 (Kernel and image of linear transformations.)

Let 𝑓 ∶ 𝑈 → 𝑉 be a linear transformation. Its image and kernel is deﬁned by

im𝑓 ∶= {𝑓(𝑢) ∶ 𝑢 ∈ 𝑈 }

and

ker 𝑓 ∶= {𝑢 ∈ 𝑈 ∶ 𝑓(𝑢) = 0}.

Often, we write im𝐴 and ker 𝐴 for some matrix 𝐴, referring to the linear transformation deﬁned by 𝑥 ↦ 𝐴𝑥. Due to the
linearity of 𝑓, it is easy to see that im𝑓 is a subspace of 𝑉 and ker 𝑓 is a subspace of 𝑈 . As mentioned, they are closely
connected with invertibility, as we shall see next.

Theorem 7.4.2 (Invertibility in terms of linear transformations.)

Let 𝑓 ∶ 𝑈 → 𝑉 be a linear transformation.
(a) 𝐴 is injective if and only if ker 𝑓 = {0}.
(b) 𝐴 is surjective if and only if im𝑓 = 𝑉 .
(c) 𝐴 is bijective (that is, invertible) if and only if ker 𝑓 = {0} and im𝑓 = 𝑉 .

Proof. (a) If 𝑓 is injective, there can only be one vector in 𝑈 that is mapped to 0. Since 𝑓(0) = 0 for any linear
transformation, ker 𝑓 = {0}.
On the other hand, if there are two diﬀerent vectors 𝑥, 𝑦 ∈ 𝑈 such that 𝑓(𝑥) = 𝑓(𝑦), then 𝑓(𝑥 − 𝑦) = 𝑓(𝑥) − 𝑓(𝑦) = 0,
so 𝑥 − 𝑦 ∈ ker 𝑓. Thus, ker 𝑓 = {0} implies 𝑥 = 𝑦, which gives the injectivity.

110 Chapter 8. Linear transformations

Mathematics of Machine Learning

(b) This is just the deﬁnition of surjectivity.

8.4.2 The inverse matrix

Because matrices deﬁne linear transformations, it makes sense to talk about the inverse of a matrix.
Algebraically speaking, the inverse of an 𝐴 ∈ ℝ𝑛×𝑛 is the matrix 𝐴−1 ∈ ℝ𝑛×𝑛 such that 𝐴−1 𝐴 = 𝐴𝐴−1 = 𝐼 holds.
The connection between linear transforms and matrices imply that 𝐴−1 is the matrix of 𝑓 −1 , so no surprise here.
Don’t worry if this section about invertibility feels a bit too much of algebra. Later, when talking about the determinant
of a transformation, we are going to study invertibility from a geometric perspective. In terms of matrices, later we are
going to see a general method to calculate the inverse matrix.

8.5 Change of basis

Previously in this section, we have seen that any linear transformation can be described with the images of the basis vectors.
This gave us the matrix representation that we use all the time. However, this very much depends on the choice of basis.
Diﬀerent bases yield diﬀerent matrices for the same transformation.
For instance, let’s take a look at 𝑓 ∶ ℝ2 → ℝ2 that maps 𝑒1 = (1, 0) to the vector (2, 1) and 𝑒2 = (0, 1) to (1, 2). Its
matrix in the standard orthonormal basis 𝐸 = {𝑒1 , 𝑒2 } is given by

2 1
𝐴𝑓,𝐸 = [ ]. (8.5)
1 2

The eﬀect of 𝑓 is visualized in Fig. 8.5.

What if we select a diﬀerent basis, say 𝑃 = {𝑝1 = (1, 1), 𝑝2 = (−1, 1)}? With a quick calculation, we can check that

2 1 1 3 2 1 −1 −1
[ ][ ] = [ ], [ ][ ] = [ ].
1 2 1 3 1 2 1 1

In other words, 𝑓(𝑝1 ) = 3𝑝1 + 0𝑝2 and 𝑓(𝑝2 ) = 0𝑝1 + 𝑝2 . This is visualized by Fig. 8.6.
This means that if 𝑃 = {𝑝1 , 𝑝2 } is our basis (thus, if writing (𝑎, 𝑏) means 𝑎𝑝1 + 𝑏𝑝2 ), the matrix of 𝑓 becomes

3 0
𝐴𝑓,𝑃 = [ ].
0 1

In this form, 𝐴𝑓,𝑃 is a diagonal matrix. (That is, its elements below and above the diagonal are zero.) As you can
see, having the right basis can signiﬁcantly simplify the linear transformation. For instance, in 𝑛 dimensions, applying a
transformation in diagonal form requires only 𝑛 operations, as

𝑑1 0 … 0 𝑥1 𝑑1 𝑥1
⎡0 𝑑2 … 0 ⎤ ⎡ 𝑥2 ⎤ ⎡ 𝑑2 𝑥2 ⎤
⎢ ⎥⎢ ⎥ = ⎢ ⎥
⎢⋮ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣0 0 … 𝑑𝑛 ⎦ ⎣𝑥𝑛 ⎦ ⎣𝑑𝑛 𝑥𝑛 ⎦

Otherwise, 𝑛2 operations are needed. So, we can save a lot there.

8.5. Change of basis 111

Mathematics of Machine Learning

Fig. 8.5: The linear transformation 𝑓, deﬁned by (8.5).

Fig. 8.6: The eﬀect of 𝑓 on 𝑝1 = (1, 1) and 𝑝2 = (−1, 1).

112 Chapter 8. Linear transformations

Mathematics of Machine Learning

8.5.1 The transformation matrix

We have just seen that the matrix of a linear transformation depends on our choice of basis. However, there is a special
relation between matrices of the same transformation. We’ll explore this next. Let 𝑓 ∶ 𝑈 → 𝑈 be a linear transformation,
and let 𝑃 = {𝑝1 , … , 𝑝𝑛 }, 𝑄 = {𝑞1 , … , 𝑞𝑛 } be two bases. As before, 𝐴𝑓,𝑆 denotes the matrix of 𝑓 in some basis 𝑆.
Suppose that we know 𝐴𝑓,𝑃 , but we have our vectors represented in terms of the other basis 𝑄. How do we calculate the
images our vectors under the linear transformation? A natural idea is to ﬁrst transform our vector representations from 𝑄
to 𝑃 , apply 𝐴𝑓,𝑃 , then transform the representations back. In the following, we are going to make this precise.
Let 𝑡 ∶ 𝑈 → 𝑈 be a transformation deﬁned by 𝑝𝑖 ↦ 𝑞𝑖 for all 𝑖 ∈ {1, … , 𝑛}. Since 𝑃 and 𝑄 are bases (so the sets are
linearly independent), 𝑡 is invertible. Suppose that the matrix 𝐴𝑓,𝑄 = (𝑎𝑄 𝑛
𝑖,𝑗 )𝑖,𝑗=1 is known to us, that is,

𝑛
𝑓(𝑞𝑗 ) = 𝐴𝑓,𝑄 𝑞𝑗 = ∑ 𝑎𝑄
𝑖,𝑗 𝑞𝑖
𝑖=1

holds for all 𝑗. So, we have

(𝑡−1 𝑓𝑡)(𝑝𝑗 ) = 𝑡−1 𝑓(𝑞𝑗 )
𝑛
= 𝑡−1 ( ∑ 𝑎𝑄
𝑖,𝑗 𝑞𝑖 )
𝑖=1
𝑛
= ∑ 𝑎𝑄 −1
𝑖,𝑗 𝑡 (𝑞𝑖 )
𝑖=1
𝑛
= ∑ 𝑎𝑄
𝑖,𝑗 𝑝𝑖 .
𝑖=1

In other words, the matrix of the composed transformation 𝑡−1 𝑓𝑡 in the basis 𝑃 is the same as the matrix of 𝑓 in 𝑄. In
terms of formulas,

𝑇 −1 𝐴𝑓,𝑃 𝑇 = 𝐴𝑓,𝑄 , (8.6)

where 𝑇 denotes the matrix of 𝑡 in 𝑃 . (For notational simplicity, we omit the subscript. Most often, we don’t care what
base it is in.)
We’ll call 𝑇 the change of basis matrix. These types of relations are prevalent in linear algebra, so we’ll take the time to
introduce a deﬁnition formally.

Deﬁnition 7.5.1 (Similar matrices.)

Let 𝐴, 𝐵 ∈ ℝ𝑛×𝑛 be two arbitrary matrices. 𝐴 and 𝐵 are called similar if there exists a matrix 𝑇 ∈ ℝ𝑛×𝑛 such that

𝐵 = 𝑇 −1 𝐴𝑇

holds. We call functions of the form 𝐴 ↦ 𝑇 −1 𝐴𝑇 similarity transformations.

In these terms, (8.6) says that the matrices of a given linear transformation are all similar to each other. This holds true
the other way around: if matrices are similar to each other, then they are coming from the same linear transformation.
With this under our belt, we can ﬁnish up with the example (8.5). In this case, 𝑇 and 𝑇 −1 can be written as

1 −1 1/2 1/2
𝑇 =[ ], 𝑇 −1 = [ ].
1 1 −1/2 1/2

(Later, we’ll see a general method to compute the inverse of any matrix, but for now, you can verify this by hand.) Thus,

1/2 1/2 2 1 1 −1 3 0
[ ][ ][ ]=[ ]. (8.7)
−1/2 1/2 1 2 1 1 0 1

8.5. Change of basis 113

Mathematics of Machine Learning

Fig. 8.7 shows how (8.7) looks like in geometric terms. From this example, we can see that a properly selected similarity
transformation can diagonalize certain matrices. Is this a coincidence? Spoiler alert: no. In a later chapter, we will see
exactly when and how this can be done.

Fig. 8.7: Change of basis, illustrated.

8.6 Linear transformations in the Euclidean plane

We have just seen that a linear transformation can be described by the image of a basis set. From a geometric viewpoint,
they are functions mapping parallelepipeds to parallelepipeds.
Because of the linearity, you can imagine this as distorting the grid determined by the bases.
In two dimensions, we have seen a few examples of geometric maps such as scaling and rotation as linear transformations.
Now we can put them into matrix form. There are ﬁve of them in particular that we will study: stretching, shearing,
rotation, reﬂection, and projection.
These simple transformations are not only essential to build intuition, but they are also frequently applied in computer
vision. Flipping, rotating, and stretching are essential parts of image augmentation pipelines, greatly enhancing the per-
formance of models.

114 Chapter 8. Linear transformations

Mathematics of Machine Learning

Fig. 8.8: How linear transforms distort the grid determined by the basis vectors.

8.6.1 Stretching

The simplest one is a generalization of scaling. We have seen a variant of this in Example 1 above. In matrix form, this
is given by

𝑐1 0
𝐴=[ ], 𝑐1 , 𝑐2 ∈ ℝ.
0 𝑐2

Linear transformations such as this can be visualized by plotting the image of the unit square determined by the standard
basis 𝑒1 = (1, 0), 𝑒2 = (0, 1).

8.6.2 Rotation

Rotations are given by the matrix

cos 𝛼 − sin 𝛼
𝑅𝛼 = [ ].
sin 𝛼 cos 𝛼

To see why, recall that each column of the transformation’s matrix describes the image of the basis vectors. The rotation
of (1, 0) is given by (cos 𝛼, sin 𝛼), while the rotation of (0, 1) is (cos(𝛼 + 𝜋/2), sin(𝛼 + 𝜋/2)). This is illustrated by
Fig. 8.10.
Like above, we can visualize the image of the unit square to gain a geometric insight into what is happening.

8.6. Linear transformations in the Euclidean plane 115

Mathematics of Machine Learning

Fig. 8.9: Stretching.

Fig. 8.10: The rotation matrix explained.

116 Chapter 8. Linear transformations

Mathematics of Machine Learning

Fig. 8.11: Rotation.

8.6.3 Shearing

Another essential geometric transform is shearing, which is frequently applied in physics. A shearing force is a pair of
forces with opposite directions, acting on the same body.
Its matrix is given by

1 𝑎
𝑆=[ ].
0 1

8.6.4 Reﬂection

Until this point, all the transformations we have seen in the Euclidean plane had preserved the “orientation” of the space.
However, this is not always the case. The transformation given by the matrices

−1 0 1 0
𝑅1 = [ ], 𝑅2 = [ ]
0 1 0 −1

act as reﬂections with respect to the 𝑦 and the 𝑥 axes..

When combined with a rotation, we can use reﬂections to ﬂip bases. For instance, the transformation

0 −1 1 0 0 1
𝑅= [ ] [ ]=[ ]
1
⏟ ⏟ 0 0 −1 1 0
rotation with 𝜋/2 =𝑅2

maps 𝑒1 to 𝑒2 and 𝑒2 to 𝑒1 .
These types of transformations play an essential role in understanding determinants, as we will soon see in the next chapter.

8.6. Linear transformations in the Euclidean plane 117

Mathematics of Machine Learning

Fig. 8.12: Shearing.

Fig. 8.13: Reﬂection.

118 Chapter 8. Linear transformations

Mathematics of Machine Learning

Fig. 8.14: Swapping 𝑒1 and 𝑒2 is a reﬂection and rotation.

In general, reﬂections can be easily deﬁned in higher dimensional spaces. For instance,

1 0 0
𝑅=⎡
⎢0 1 0⎤⎥
⎣0 0 −1⎦

is a reflection in ℝ3 that flips 𝑒3 to the opposite direction. It is just like looking in the mirror: it turns left to right and
right to left.
Reflections can flip orientations multiple times. The transformation given by

1 0 0
𝑅=⎡
⎢0 −1 0⎤⎥
⎣0 0 −1⎦

ﬂips 𝑒2 and 𝑒3 , changing the orientation twice. Later, we’ll see that the “number of changes in orientation” of a given
transformation is one of its essential descriptors.

8.6.5 Orthogonal projection

One of the most important transformations (not only in two dimensions) is the orthogonal projection. We have seen this
already when talking about inner products and their geometric representation. By taking a closer look, it turns out that they
are linear transformations.
Recall from (5.6) that the orthogonal projection of 𝑥 to some 𝑦 can be written as

⟨𝑥, 𝑦⟩
proj𝑦 (𝑥) = 𝑦. (8.8)
⟨𝑦, 𝑦⟩

8.6. Linear transformations in the Euclidean plane 119

Mathematics of Machine Learning

Fig. 8.15: Orthogonal projection.

The bilinearity of ⟨⋅, ⋅⟩ immediately implies that proj𝑦 (𝑥) is also linear. With a bit of algebra, we can rewrite this in terms
of matrices. We have
⟨𝑥, 𝑦⟩
proj𝑦 (𝑥) = 𝑦 (8.9)
⟨𝑦, 𝑦⟩
𝑥 𝑦 + 𝑥 𝑦 𝑦1
= 1 1 2 2 2 (8.10)
[ ]
‖𝑦‖ 𝑦2
1 2
𝑦 𝑦1 𝑦2 𝑥1
= [ 1 ] [(8.11)
],
‖𝑦‖ 2 𝑦 𝑦
1 2 𝑦22 𝑥2

thus,

1 𝑦2 𝑦1 𝑦2
proj𝑦 = [ 1 ].
‖𝑦‖ 𝑦1 𝑦2
2 𝑦22

Notice that proj𝑦 (𝑒2 ) = 𝑦𝑦2 proj𝑦 (𝑒1 ), so the images of the standard basis vectors are not linearly independent. As a
1
consequence, the image of the plane under proj𝑦 is span(𝑦), which is a one-dimensional subspace. From this example,
we can see that the image of a vector space under a linear transformation is not necessarily of the same dimension as the
starting space.
With these examples and knowledge under our belt, we have a basic understanding of linear transformations, the most
basic building blocks of neural networks. In the next chapter, we will study how linear transformations aﬀect the geometric
structure of the vector space.

120 Chapter 8. Linear transformations

Mathematics of Machine Learning

8.7 Problems

Problem 1. Show that if 𝐴 ∈ ℝ𝑛×𝑛 is an invertible matrix, then (𝐴−1 )𝑇 = (𝐴𝑇 )−1 .
Problem 2. Let 𝑅𝛼 be the two-dimensional rotation matrix deﬁned by

cos 𝛼 − sin 𝛼
𝑅𝛼 = [ ].
sin 𝛼 cos 𝛼

Show that 𝑅𝛼 𝑅𝛽 = 𝑅𝛼+𝛽 .

Problem 3. Let 𝐴 = (𝑎𝑖,𝑗 )𝑛𝑖,𝑗=1 ∈ ℝ𝑛×𝑛 be a matrix and let 𝐷 ∈ ℝ𝑛×𝑛 be a diagonal matrix deﬁned by

𝑑1 0 … 0
𝐷=⎡
⎢0 𝑑2 … 0⎤ ⎥,
⎣0 0 … 𝑑𝑛 ⎦

where all of its elements are zero outside the diagonal. Show that

𝑑1 𝑎1,1 𝑑2 𝑎1,2 … 𝑑𝑛 𝑎1,𝑛

⎡𝑑 𝑎 𝑑2 𝑎2,2 … 𝑑𝑛 𝑎2,𝑛 ⎤
𝐷𝐴 = ⎢ 1 2,1 ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣𝑑1 𝑎𝑛,1 𝑑2 𝑎𝑛,2 … 𝑑𝑛 𝑎𝑛,𝑛 ⎦

and
𝑑1 𝑎1,1 𝑑1 𝑎1,2 … 𝑑1 𝑎1,𝑛
⎡𝑑 𝑎 𝑑2 𝑎2,2 … 𝑑2 𝑎2,𝑛 ⎤
𝐴𝐷 = ⎢ 2 2,1 ⎥.
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣𝑑𝑛 𝑎𝑛,1 𝑑𝑛 𝑎𝑛,2 … 𝑑𝑛 𝑎𝑛,𝑛 ⎦

Problem 4. Let ‖ ⋅ ‖ be a norm on ℝ𝑛 , and let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary matrix.

Show that 𝐴 is invertible if and only if the function

‖𝑥‖∗ ∶= ‖𝐴𝑥‖

is a norm on ℝ𝑛 .
Problem 5. Let 𝑈 be a normed space and 𝑓 ∶ 𝑈 → 𝑈 be a linear transformation.
If

‖𝑥‖∗ ∶= ‖𝑓(𝑥)‖

is a norm, is 𝑓 necessarily invertible?

Hint: no. Consider the vector space ℝ[𝑥] with the norm
𝑛 1/2 𝑛
‖𝑝‖ = ( ∑ 𝑝𝑖2 ) , 𝑝(𝑥) = ∑ 𝑝𝑖 𝑥𝑖
𝑖=0 𝑖=0

and the linear transformation 𝑓 ∶ 𝑝(𝑥) ↦ 𝑥𝑝(𝑥).

Problem 6. Let ⟨⋅, ⋅, ⟩ be an inner product on ℝ𝑛 . Show that there is a matrix 𝐴 ∈ ℝ𝑛×𝑛 such that

⟨𝑥, 𝑦⟩ = 𝑥𝑇 𝐴𝑦, 𝑥, 𝑦 ∈ ℝ𝑛 .

(Recall that we treat vectors 𝑥, 𝑦 ∈ ℝ𝑛 as column vectors.)

8.7. Problems 121

Mathematics of Machine Learning

Problem 7. Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix. 𝐴 is called positive deﬁnite if 𝑥𝑇 𝐴𝑥 > 0 for every nonzero 𝑥 ∈ ℝ𝑛 .
Show that 𝐴 is positive deﬁnite if and only if

⟨𝑥, 𝑦⟩ ∶= 𝑥𝑇 𝐴𝑦

is an inner product.
Problem 8. Let 𝐴 ∈ ℝ𝑛×𝑚 be a matrix, and denote its columns by 𝑎1 , … , 𝑎𝑛 ∈ ℝ𝑛 .
(a) Show that for all 𝑥 ∈ ℝ𝑚 , we have 𝐴𝑥 ∈ span(𝑎1 , … , 𝑎𝑛 ).
(b) Let 𝐵 ∈ ℝ𝑚×𝑘 , and denote the columns of 𝐴𝐵 by 𝑣1 , … , 𝑣𝑘 ∈ ℝ𝑛 . Show that

𝑣1 , … , 𝑣𝑘 ∈ span(𝑎1 , … , 𝑎𝑛 ).

Problem 9. Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix. Show that

⟨𝐴𝑥, 𝑦⟩ = ⟨𝑥, 𝐴𝑇 𝑦⟩

holds for all 𝑥, 𝑦 ∈ ℝ𝑛 , where ⟨⋅, ⋅⟩ is the Euclidean inner product.

122 Chapter 8. Linear transformations

CHAPTER

NINE

DETERMINANTS, OR HOW LINEAR TRANSFORMATIONS AFFECT

VOLUME

In the previous sections, we have seen that linear transformations can be thought as distorting the grid determined by the
basis vectors.
Following our geometric intuition, we suspect that measuring how much a transformation distorts volume and distance
can provide some valuable insight. As we will see in this chapter, this is exactly the case. Transformations that preserve
distance or norm are special, giving rise to methods such as Principal Component Analysis.

9.1 Area in the Euclidean plane

9.1.1 How linear transformations scale the area

Let’s go back to the Euclidean plane one more time. Consider any linear transformation 𝐴, mapping the unit square to a
parallelogram.

Fig. 9.1: Image of the unit square under a linear transformation.

The area of this parallelogram describes how 𝐴 scales the unit square. Let’s call it 𝜆 for now; that is,

area(𝐴(𝐶)) = 𝜆 ⋅ area(𝐶),

123
Mathematics of Machine Learning

where 𝐶 = [0, 1] × [0, 1] is the unit square, and 𝐴(𝐶) is its image

𝐴(𝐶) ∶= {𝐴𝑥 ∶ 𝑥 ∈ 𝐶}.

Due to linearity, 𝜆 also matches the scaling ratio between the area of any rectangle (with parallel sides to the coordinate
axes) and its image under 𝐴. As Fig. 9.2 shows, we can approximate any planar object as the union of rectangles.
If all rectangles are scaled by 𝜆, then unions of rectangles also scale by that factor. Thus, it follows that 𝜆 is also the
scaling ratio between any planar object 𝐸 and its image 𝐴(𝐸) = {𝐴𝑥 ∶ 𝑥 ∈ 𝐸}.

Fig. 9.2: Approximating planar objects with a union of rectangles.

This quantity 𝜆 reveals a lot about the transformation itself, but there is a question remaining: how can we calculate it?

9.1.2 Calculating the scaling factor of the area

Suppose that our linear transformation is given by

𝑥1 𝑦1
𝐴=[ ],
𝑥2 𝑦2

thus its columns 𝑥 = (𝑥1 , 𝑥2 ) and 𝑦 = (𝑦1 , 𝑦2 ) describe the two sides of the parallelogram. This is the image of the unit
square.
Our area scaling factor 𝜆 equals the area of this parallelogram, so our goal is to calculate this.
The area of any parallelogram can be calculated by multiplying the length of the base (‖𝑥‖ in this case) with the height
ℎ. (You can easily see this by cutting oﬀ a triangle at the right side of the parallelogram and putting it to the left side,
rearranging it as a rectangle.) ℎ is unknown, but with basic trigonometry, we can see that ℎ = sin 𝛼‖𝑦‖, where 𝛼 is the
angle between 𝑥 and 𝑦.
Thus,

area = sin 𝛼‖𝑦‖‖𝑥‖.

This is almost the dot product of 𝑥 and 𝑦. (Recall that the dot product can be written as ⟨𝑥, 𝑦⟩ = ‖𝑥‖‖𝑦‖ cos 𝛼.) However,
the sin 𝛼 part is not a match.

Fortunately, there is a clever trick we can use to turn this into a dot product! Since sin 𝛼 = cos (𝛼 − 𝜋2 ), we have

𝜋
area = cos (𝛼 − )‖𝑥‖‖𝑦‖.
2

124 Chapter 9. Determinants, or how linear transformations aﬀect volume

Mathematics of Machine Learning

Fig. 9.3: Image of the unit square under a linear transformation.

The issue is, the angle between 𝑥 and 𝑦 is not 𝛼 − 𝜋2 . However, we can solve this easily by applying a rotation. Applying
the transformation
0 1
𝑅−𝜋/2 = [ ],
−1 0

we obtain 𝑦rot = 𝑅−𝜋/2 𝑦 = (𝑦2 , −𝑦1 ). Since ‖𝑦rot ‖ = ‖𝑦‖, we have

area = sin 𝛼‖𝑦‖‖𝑥‖

𝜋
= cos (𝛼 − )‖𝑥‖‖𝑦‖
2
𝜋
= cos (𝛼 − )‖𝑥‖‖𝑦rot ‖
2
= ⟨𝑥, 𝑦rot ⟩.

The quantity ⟨𝑥, 𝑦rot ⟩ can be calculated using only the elements of the matrix 𝐴:

⟨𝑥, 𝑦rot ⟩ = 𝑥1 𝑦2 − 𝑥2 𝑦1 .

Notice that ⟨𝑥, 𝑦rot ⟩ can be negative! This happens when the angle between 𝑦 = 𝐴𝑒2 and 𝑥 = 𝐴𝑒1 , measured from a
counter-clockwise direction, is larger than 𝜋, as this implies cos (𝛼 − 𝜋2 ) < 0.

Hence, the quantity ⟨𝑥, 𝑦rot ⟩ is called the signed area of the parallelogram.
In two dimensions, we call this the determinant of the linear transformation. That is, for any given linear transforma-
tion/matrix 𝐴 ∈ ℝ2×2 , its determinant is deﬁned by

𝑎 𝑏
det 𝐴 = 𝑎𝑑 − 𝑐𝑏, 𝐴=[ ]. (9.1)
𝑐 𝑑

The determinant is often written as |𝐴|, but we’ll avoid this notation. We’ll deal with determinants for any matrix 𝐴 ∈
ℝ𝑛×𝑛 , but let’s stay with the 2 × 2 case just a bit to build intuition.
The determinant also reveals the orientation of the vectors: positive determinant means positive orientation, negative de-
terminant means negative orientation. (As mentioned earlier, positive orientation means that the angle measured between
𝑥 and 𝑦 in a counter-clockwise direction is between 0 and 𝜋.) This is demonstrated in Fig. 9.4 below.

9.1. Area in the Euclidean plane 125

Mathematics of Machine Learning

Fig. 9.4: Orientation of two vectors in the plane.

Overall,

area(𝐴(𝐸)) = | det 𝐴|area(𝐸) (9.2)

holds. Even though we have only shown this in two dimensions, this holds in general. (Although we don’t know how to
deﬁne the determinant there yet.)
So, if 𝑒1 and 𝑒2 is a basis on the plane, equations (9.1) and (9.2) tell us that the determinant in two dimensions equals to

det 𝐴 = orientation(𝐴𝑒1 , 𝐴𝑒2 ) × area(𝐴𝑒1 , 𝐴𝑒2 ) (9.3)

Based on the example of the Euclidean plane, we have built enough geometric intuition on understanding how linear trans-
formations distort volume and change the orientation of the space. These are described by the concept of determinants,
which we have deﬁned in the special case (9.1). We are going to move on to study the concept in its full generality.

9.2 Determinants, orientation, and volume

To introduce the formal definition of the determinant, we will take a route that is different from the usual. Most commonly,
the determinant of a linear transformation 𝐴 is defined straight away with a complicated formula, then all of its geometric
properties are shown.
Instead of this, we will deduce the determinant formula by generalizing the geometric notion we have learned in the
previous section. Here, we are roughly going to follow the outline of [[Lax07]].
We set the foundations by introducing some key notations. Let 𝐴 = (𝑎𝑖,𝑗 )𝑛𝑖,𝑗=1 ∈ ℝ𝑛×𝑛 be a matrix with columns
𝑎1 , … , 𝑎𝑛 . When we introduced the notion of matrices as linear transformations, we have seen that the 𝑖-th column is the
image of the 𝑖-th basis vector. For simplicity, let’s assume that 𝑒1 , 𝑒2 , … , 𝑒𝑛 is the standard orthonormal basis, that is, 𝑒𝑖
is the vector whose 𝑖-th coordinate is 1 and the rest is 0. Thus, 𝐴𝑒𝑖 = 𝑎𝑖 .
During our explorations in the Euclidean plane, we have seen that the determinant is the orientation of the images of basis
vectors, times the area of the parallelogram defined by them. Following this logic, we could define the determinant for
𝑛 × 𝑛 matrices by

det 𝐴 = orientation(𝐴𝑒1 , … , 𝐴𝑒𝑛 ) × volume(𝐴𝑒1 , … , 𝐴𝑒𝑛 ) (9.4)

126 Chapter 9. Determinants, or how linear transformations aﬀect volume

Mathematics of Machine Learning

Two questions surface immediately. First, how do we define the orientation of multiple vectors in the 𝑛-dimensional
space? Second, how can we even calculate the area?
Instead of finding the answers for these questions, we are going to put a twist into the story: first, we’ll find a convenient
formula for determinants, then use it to define orientation.

9.2.1 The multi-linearity of determinants

To make the relation between the determinant and the columns of the matrix 𝑎𝑖 = 𝐴𝑒𝑖 more explicit, we’ll write

det 𝐴 = det(𝑎1 , … , 𝑎𝑛 ).

Thinking about determinants this way, det is just a function of multiple variables:

det ∶ ℝ 𝑛 × ⋯ × ℝ𝑛 → ℝ.
⏟⏟⏟⏟⏟
𝑛 times

Good news: det(𝑎1 , … , 𝑎𝑛 ) is linear in each variable. That is,

det(𝑎1 , … , 𝛼𝑎𝑖 + 𝛽𝑏𝑖 , … 𝑎𝑛 ) = 𝛼 det(𝑎1 , … , 𝑎𝑖 , … 𝑎𝑛 ) + 𝛽 det(𝑎1 , … , 𝑏𝑖 , … 𝑎𝑛 )

holds. We are not going to prove this, but as the determinant represents the signed volume, you can convince yourself by
checking out Fig. 9.5.

Fig. 9.5: The multilinearity of det(𝑎1 , 𝑎2 ).

A consequence of linearity is that we can express the determinant as a linear combination of determinants for the standard
𝑛
basis vectors 𝑒1 , … , 𝑒𝑛 . For instance, consider the following. Since 𝐴𝑒1 = 𝑎1 = ∑𝑖=1 𝑎𝑖,1 𝑒𝑖 , we have
𝑛
det(𝑎1 , 𝑎2 , … , 𝑎𝑛 ) = ∑ 𝑎𝑖,1 det(𝑒𝑖 , 𝑎2 , … , 𝑎𝑛 ).
𝑖=1

𝑛
Going one step further and using that 𝑎2 = ∑𝑗=1 𝑎𝑗,2 𝑒𝑗 , we start noticing a pattern. With the linearity, we have
𝑛 𝑛
det(𝑎1 , 𝑎2 , … , 𝑎𝑛 ) = ∑ ∑ 𝑎𝑖,1 𝑎𝑗,2 det(𝑒𝑖 , 𝑒𝑗 , 𝑎3 , … , 𝑎𝑛 ).
𝑖=1 𝑗=1

9.2. Determinants, orientation, and volume 127

Mathematics of Machine Learning

We can see that the row indices in the coeﬃcients 𝑎𝑖1 𝑎𝑗2 match the indices of 𝑒𝑘 -s in det(𝑒𝑖 , 𝑒𝑗 , 𝑎3 , … , 𝑎𝑛 ). In the general
case, this pattern can be formalized in terms of permutations. Expanding the determinant of 𝐴, we have
𝑛
det(𝑎1 , … , 𝑎𝑛 ) = ∑ [ ∏ 𝑎𝜎(𝑖),𝑖 ] det(𝑒𝜎(1) , … , 𝑒𝜎(𝑛) ).
𝜎∈𝑆𝑛 𝑖=1

𝑛
This formula is not the easiest one to understand. You can think about each term ∏𝑖=1 𝑎𝜎(𝑖),𝑖 as placing 𝑛 chess rooks
on a 𝑛 × 𝑛 board such that none of them can capture each other.

Fig. 9.6: The anatomy of the term 𝑎𝜎(1)1 … 𝑎𝜎(𝑛)𝑛 .

𝑛
The formula ∑𝜎∈𝑆 [ ∏𝑖=1 𝑎𝜎(𝑖),𝑖 ] det(𝑒𝜎(1) , … , 𝑒𝜎(𝑛) ) combines all the possible ways we can do this.
𝑛

128 Chapter 9. Determinants, or how linear transformations aﬀect volume

Mathematics of Machine Learning

9.2.2 Putting it all together

There is only one thing left: calculating det(𝑒𝜎(1) , … , 𝑒𝜎(𝑛) ).

Remember when we discussed the combination of reflections and rotations in the Euclidean plane? The transformation
determined by 𝑒𝑖 ↦ 𝑒𝜎(𝑖) is similar to that. When talking about permutations, we have mentioned that each one can
be obtained by switching two elements at a time. The number of transpositions - that is, permutations affecting two
elements - in a permutation is called the sign of 𝜎. In the context of our linear transformation 𝑒𝑖 ↦ 𝑒𝜎(𝑖) , the number of
transpositions in 𝜎 is the number of reflections required and sign(𝜎) is the orientation of (𝑒𝜎(1) , … , 𝑒𝜎(𝑛) ).
Thus, with these, we can finally give a formal definition for determinants and the orientation.

Deﬁnition 8.2.1 (Determinants and orientation.)

Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary matrix and let 𝑎𝑖 ∈ ℝ𝑛 be its 𝑖-th column. The determinant of 𝐴 is deﬁned by
𝑛
det 𝐴 = det(𝑎1 , … , 𝑎𝑛 ) = ∑ sign(𝜎)[ ∏ 𝑎𝜎(𝑖),𝑖 ], (9.5)
𝜎∈𝑆𝑛 𝑖=1

and the orientation of the vectors 𝑎1 , … , 𝑎𝑛 is

orientation(𝑎1 , … , 𝑎𝑛 ) ∶= sign( det 𝐴),

When the det notation is not convenient, we denote determinants by putting the elements of the matrix inside a big absolute
value sign:

𝑎 𝑎1,2 … 𝑎1,𝑛
∣ 1,1 ∣
𝑎 𝑎2,2 … 𝑎2,𝑛 ∣
det 𝐴 = ∣ 2,1 .
∣ ⋮ ⋮ ⋱ ⋮ ∣
∣𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑛 ∣

When I was a young math student, the determinant formula (9.5) was presented as-is in my ﬁrst linear algebra class.
Without explaining the connection to volume and orientation, it took me years to properly understand it. I still think that
the determinant is one of the most complex concepts in linear algebra, especially when presented without a geometric
motivation to the deﬁnition.

9.3 The recursive deﬁnition of determinants

Now that you have a basic understanding of the determinant, you might ask: how can we calculate it in practice? Summing
over the set of all permutations and calculating their sign is not an easy operation from a computational perspective.
Good news: there is a recursive formula for the determinant. Bad news: for an 𝑛 × 𝑛 matrix, it involves 𝑛 pieces of
(𝑛 − 1) × (𝑛 − 1) matrices. Still, it is a big step from the permutation formula. Let’s see it!

Theorem 8.3.1 (Recursive formula for determinants.)

Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary square matrix. Then
𝑛
det 𝐴 = ∑(−1)𝑗+1 𝑎1,𝑗 det 𝐴1,𝑗 , (9.6)
𝑗=1

where 𝐴𝑖,𝑗 is the (𝑛 − 1) × (𝑛 − 1) matrix obtained from 𝐴 by removing its 𝑖-th row and 𝑗-th column.

9.3. The recursive deﬁnition of determinants 129

Mathematics of Machine Learning

Instead of a proof, we are going to provide an example to demonstrate the formula. For 3 × 3 matrices, this is how it
looks:
𝑎 𝑏 𝑐
𝑒 𝑓 𝑑 𝑓 𝑑 𝑒
∣𝑑 𝑒 𝑓∣ = 𝑎 ∣ ∣−𝑏∣ ∣+𝑐∣ ∣.
ℎ 𝑖 𝑔 𝑖 𝑔 ℎ
𝑔 ℎ 𝑖

9.4 Fundamental properties

When working with determinants, we prefer to create basic building blocks and rules for combining them. (Like we have
seen this pattern so many times, even when deducing the formula (9.5).) These rules are manifested by the fundamental
properties of determinants, which we will discuss now. Usually, the proofs are some heavy computations based on the
formulas (9.5) and (9.6), so I am going to be a bit unorthodox here. Instead of providing fully ﬂeshed-out proofs, I’ll give
intuitive explanations. After all, we want to build algorithms using mathematics, not building mathematics.
The ﬁrst property is concerned with the relation of composition and the determinant.

Theorem 8.4.1
Let 𝐴, 𝐵 ∈ ℝ𝑛×𝑛 be two matrices. Then

det(𝐴𝐵) = det(𝐴) det(𝐵). (9.7)

The explanation for this is quite simple. If we think about the matrices 𝐴, 𝐵 ∈ ℝ𝑛×𝑛 as linear transformations, we have
just seen that det(𝐴) and det(𝐵) determine how they scale the unit cube.
Since the composition of these linear transformations is the matrix product 𝐴𝐵, the linear transformation 𝐴𝐵 scales the
unit cube to a parallelepiped with signed volume det(𝐴) det(𝐵). (Because applying 𝐴𝐵 is the same as applying 𝐵 ﬁrst,
then applying 𝐴 on the result.)
Thus, by our understanding of the determinant, as the scaling factor of 𝐴𝐵 is also det(𝐴𝐵), (9.7) holds.
We can do the actual proof of this, for example, by induction based on the recursive formula (9.6), leading to a long and
involved calculation.
An immediate corollary of the product rule is a special relation between the determinants of a matrix and its inverse.

Theorem 8.4.2
Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary invertible matrix. Then det 𝐴−1 = (det 𝐴)−1 .

Proof. Using the product rule, we have

1 = det 𝐼 = det(𝐴𝐴−1 ) = (det 𝐴)(det 𝐴−1 ),

from which the theorem follows. □

Because of this, we can also conclude that the determinant is preserved by the similarity relation.

Theorem 8.4.3
Let 𝐴, 𝐵 ∈ ℝ𝑛×𝑛 be two similar matrices with 𝐵 = 𝑇 −1 𝐴𝑇 for some 𝑇 ∈ ℝ𝑛×𝑛 . Then det 𝐴 = det 𝐵.

130 Chapter 9. Determinants, or how linear transformations aﬀect volume

Mathematics of Machine Learning

Proof. This simply follows from

det 𝐵 = det 𝑇 −1 𝐴𝑇
= det 𝑇 −1 det 𝐴 det 𝑇
= det 𝐴,
which is what we had to show. □

Another important consequence is that the determinant is independent of the basis the matrix is in. If 𝐴 ∶ 𝑈 → 𝑈 is a
linear transformation and 𝑃 = {𝑝1 , … , 𝑝𝑛 }, 𝑅 = {𝑟1 , … , 𝑟𝑛 } are two bases of 𝑈 , then we know that the matrices of the
transformation are related by

𝐴𝑃 = 𝑇 −1 𝐴𝑅 𝑇 ,

where 𝐴𝑆 is the matrix of the transformation 𝐴 in a basis 𝑆 and 𝑇 ∈ ℝ𝑛×𝑛 is the change of basis matrix. Using the
previous theorem, this implies that det 𝐴𝑃 = det 𝐴𝑅 . Thus, the determinant is properly deﬁned for linear transformations,
not just matrices!

9.4.1 Swapping rows with columns

There is an essential duality relation regarding determinants: you can swap the rows and columns of a matrix, keeping all
determinant-related identities true.

Theorem 8.4.4
Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary matrix. Then det 𝐴 = det 𝐴𝑇 .

Proof. Suppose that 𝐴 = (𝑎𝑖,𝑗 )𝑛𝑖,𝑗=1 . Let’s denote the elements of its transpose by 𝑎𝑡𝑖,𝑗 = 𝑎𝑗,𝑖 . According to (9.5), we
have
𝑛
det 𝐴𝑇 = ∑ sign(𝜎) ∏ 𝑎𝑡𝜎(𝑖),𝑖
𝜎∈𝑆𝑛 𝑖=1
𝑛
= ∑ sign(𝜎) ∏ 𝑎𝑖,𝜎(𝑖) .
𝜎∈𝑆𝑛 𝑖=1
𝑛
Now comes the trick. Since the product ∏𝑖=1 𝑎𝑖,𝜎(𝑖) iterates through all 𝑖-s, and the order of the terms doesn’t matter, we
might as well order the terms as 𝑖 = 𝜎−1 (1), … , 𝜎−1 (𝑛). Since sign(𝜎−1 ) = sign(𝜎), by continuing the above calculation,
we have
𝑛 𝑛
∑ sign(𝜎) ∏ 𝑎𝑖,𝜎(𝑖) = ∑ sign(𝜎−1 ) ∏ 𝑎𝜎−1 (𝑗),𝑗 .
𝜎∈𝑆𝑛 𝑖=1 𝜎∈𝑆𝑛 𝑗=1

Because every permutation is invertible and 𝜎 ↦ 𝜎−1 is a bijection, summing over 𝜎 ∈ 𝑆𝑛 is the same as summing over
𝜎−1 ∈ 𝑆𝑛 . Combining all of the above, we obtain that
𝑛
det 𝐴 = ∑ sign(𝜎−1 ) ∏ 𝑎𝜎−1 (𝑗),𝑗
𝜎∈𝑆𝑛 𝑗=1
𝑛
= ∑ sign(𝜎) ∏ 𝑎𝜎(𝑗),𝑗
𝜎∈𝑆𝑛 𝑗=1
𝑇
= det 𝐴 ,

9.4. Fundamental properties 131

Mathematics of Machine Learning

which is what we had to show. □

Theorem 8.4.5
Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary matrix and let 𝐴𝑖,𝑗 denote the matrix which can be obtained by swapping the 𝑖-th and 𝑗-th
column of 𝐴. Then

det 𝐴𝑖,𝑗 = − det 𝐴,

or in other words, swapping any two columns of 𝐴 will change the sign of the determinant. Similarly, swapping two rows
also changes the sign of the determinant.

Proof. This follows from a clever application of (9.7), noticing that 𝐴𝑖,𝑗 = 𝐴𝐼 𝑖,𝑗 , where 𝐼 𝑖,𝑗 is obtained from the
identity matrix by swapping its 𝑖-th and 𝑗-th column. det 𝐼 𝑖,𝑗 is a determinant of the form det(𝑒𝜎(1) , … , 𝑒𝜎(𝑛) ), where 𝜎
is a permutation simply swapping 𝑖 and 𝑗. (That is, 𝜎 is a transposition.) Thus,

det 𝐴𝑖,𝑗 = det 𝐴 det 𝐼 𝑖,𝑗 = − det 𝐴,

which is what we had to show.

Regarding swapping rows, we can apply the previous result because transposing a matrix preserves the determinant. □

As a consequence, matrices with two matching rows have a zero determinant.

Theorem 8.4.6
Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix that has two identical rows or columns. Then det 𝐴 = 0.

Proof. Suppose that the 𝑖-th and the 𝑗-th columns are matching. Since the two columns equal, det 𝐴𝑖,𝑗 = det 𝐴. However,
applying the previous theorem, we obtain det 𝐴𝑖,𝑗 = − det 𝐴 This can only be true if det 𝐴 = 0. Again, transposing the
matrix gives the statement for rows. □

9.4.2 Linear dependence and the determinant

As yet another consequence, we obtain an essential connection between linearly dependent vector systems and determi-
nants.

Theorem 8.4.7
Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix. Then its columns are linearly dependent if and only if det 𝐴 = 0. Similarly, the rows of 𝐴
are linearly dependent if and only if det 𝐴 = 0.

Proof. (i) First, we are going to show that linearly dependent columns (or rows) imply det 𝐴 = 0. As usual, let’s denote
𝑛
the columns of 𝐴 as 𝑎1 , … , 𝑎𝑛 and for the sake of simplicity, assume that 𝑎1 = ∑𝑖=2 𝛼𝑖 𝑎𝑖 . Since the determinant is a
linear function of the columns, we have
𝑛
det(𝑎1 , 𝑎2 , … , 𝑎𝑛 ) = ∑ 𝛼𝑖 det(𝑎𝑖 , 𝑎2 , … , 𝑎𝑛 ).
𝑖=2

132 Chapter 9. Determinants, or how linear transformations aﬀect volume

Mathematics of Machine Learning

Because of the previous theorem, all terms det(𝑎𝑖 , 𝑎2 , … , 𝑎𝑛 ) are zero, implying det 𝐴 = 0, which is what we had to
show. If the rows are linearly dependent, we apply the above to obtain that det 𝐴 = det 𝐴𝑇 = 0.
(ii) Now, let’s show that det 𝐴 = 0 means linearly dependent columns. Instead of the exact proof, which is rather involved,
we should have an intuitive explanation instead.
Recall that the determinant is orientation times volume of the parallelepiped given by the columns. Since the orientation
is ±1, det 𝐴 implies that the volume of the parallelepiped is 0. This can only happen if the 𝑛 columns lie in an 𝑛 − 1-
dimensional subspace, meaning that they are linearly dependent. □

We can immediately apply this to get the following result.

Corollary 8.4.1
Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix with a constant zero column (or row). Then det 𝐴 = 0.

9.5 Determinants and invertibility

As the determinant is the signed volume of the basis vectors’ image, it can be zero in certain cases. These transformations
are rather special. When can it happen? Let’s go back to the Euclidean plane to build some intuition.
There, we have

𝑥1 𝑦1
∣ ∣ = 𝑥1 𝑦2 − 𝑥2 𝑦1 = 0,
𝑥2 𝑦2

or in other words, 𝑥𝑦1 = 𝑥𝑦2 . There is one more interpretation of this: the vector (𝑦1 , 𝑦2 ) is a scalar multiple of (𝑥1 , 𝑥2 ).
1 2
Thinking in terms of linear transformations, this means that the images of 𝑒1 and 𝑒2 lie on a subspace of ℝ2 . As we shall
see next, this is closely connected with the invertibility of the transformation.

Theorem 8.5.1 (Invertibility and the determinants.)

The linear transformation 𝐴 ∈ ℝ𝑛×𝑛 is invertible if and only if det 𝐴 ≠ 0.

Proof. When we introduced the concept of invertibility, we have seen that 𝐴 is invertible if and only if its columns
𝑎1 , … , 𝑎𝑛 form a basis. Thus, they are linearly independent. Since linear independence of columns is equivalent to a
nonzero determinant, the result follows. □

9.6 Problems

Problem 1. Calculate the determinant of

1 2 3
𝐴=⎡
⎢4 5 6⎤⎥.
⎣7 8 9⎦

Problem 2. Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix and let 𝑐 ∈ ℝ be a constant.

9.5. Determinants and invertibility 133

Mathematics of Machine Learning

(a) Show that

𝑎 … 𝑐𝑎1,𝑖 … 𝑎1,𝑛
∣ 1,1 ∣
∣ 𝑎2,1 … 𝑐𝑎2,𝑖 … 𝑎2,𝑛 ∣
= 𝑐 det 𝐴
∣ ⋮ ⋱ ⋮ ⋱ ⋮ ∣
∣𝑎𝑛,1 … 𝑐𝑎𝑛,𝑖 … 𝑎𝑛,𝑛 ∣

holds for all 𝑖 = 1, … , 𝑛.

(b) Show that

𝑎 𝑎1,2 … 𝑎1,𝑛
∣ 1,1 ∣
⋮ ⋮ ⋱ ⋮
∣ ∣
∣𝑐𝑎𝑖,1 𝑐𝑎𝑖,2 … 𝑐𝑎𝑖,𝑛 ∣ = 𝑐 det 𝐴
∣ ⋮ ⋮ ⋱ ⋮ ∣
∣ 𝑎𝑛,1 𝑎𝑛,2 … 𝑎𝑛,𝑛 ∣

holds for all 𝑖 = 1, … , 𝑛 and $.

det(𝑐𝐴) = 𝑐𝑛 det 𝐴.

Problem 3. Let 𝐴 ∈ ℝ𝑛×𝑛 be an upper triangular matrix. (That is, all elements below the diagonal are zero.) Show that
𝑛
det 𝐴 = ∏ 𝑎𝑖,𝑖 .
𝑖=1

Show that the same holds for lower triangular matrices. (That is, matrices where elements above the diagonal are zero.)
Problem 4. Let 𝑀 ∈ ℝ𝑛×𝑚 be a matrix with the block structure
𝐴 𝐵
𝑀 =[ ],
0 𝐶

where 𝐴 ∈ ℝ𝑘×𝑘 , 𝐵 ∈ ℝ𝑘×𝑙 , and 𝐶 ∈ ℝ𝑙×𝑙 .

Show that

det 𝑀 = det 𝐴 det 𝐶.

134 Chapter 9. Determinants, or how linear transformations aﬀect volume

CHAPTER

TEN

LINEAR EQUATIONS

In practice, we can translate several problems to linear equations. For example, a cash dispenser has $900 in $20 and
$50 bills. We know that there are twice as many $20 bills than $50. The question is, how many pieces of each bill the
machine has?
If we denote the number of $20 bills by 𝑥1 and the number of $50 bills by 𝑥2 , we obtain the equations

𝑥1 − 2𝑥2 = 0
20𝑥1 + 50𝑥2 = 900.

For two variables like we have now, these are easily solvable by expressing one in terms of other. Here, the ﬁrst equation
would imply 𝑥1 = 2𝑥2 . Plugging it back to the second equation, we obtain 90𝑥2 = 900, which gives 𝑥2 = 10. Arriving
full circle, we can substitute this into 𝑥1 = 2𝑥2 , yielding the solutions

𝑥1 = 20
𝑥2 = 10.

However, for thousands of variables like in real applications, we need a bit more craft. This is where linear algebra comes
in. By introducing the matrix and vectors

1 −2 𝑥1 0
𝐴=[ ], 𝑥=[ ], 𝑏=[ ],
20 50 𝑥2 900

the equation can be written in the form 𝐴𝑥 = 𝑏. That is, in terms of linear transformations, we can reformulate the
question: which vector 𝑥 is mapped to 𝑏 by the transformation 𝐴? This question is central in linear algebra. We are going
to dedicate this section to solving these.

10.1 Gaussian elimination

Let’s revisit our earlier example:

𝑥1 − 2𝑥2 = 0
20𝑥1 + 50𝑥2 = 900.
Besides the previously outlined solution, there is another approach. We can use the ﬁrst equation 𝑥1 − 2𝑥2 = 0 to get rid
of the term 𝑥1 in the second equation 20𝑥1 + 50𝑥2 = 900. We can do this by multiplying it with 20 and subtracting it
from the second row, obtaining 90𝑥2 = 900, from which 𝑥2 = 10 is obtained. This can be substituted back into the ﬁrst
row, yielding 𝑥1 = 20.
What about the general case? That is, if 𝐴 ∈ ℝ𝑛×𝑛 and 𝑥, 𝑏 ∈ ℝ𝑛 , would this work? Absolutely. So far, we have used
two rules of manipulating the equations in a linear system:
1. multiplying an equation with a nonzero scalar won’t change the solutions,
2. and adding a scalar multiple of one row to another won’t either.

135
Mathematics of Machine Learning

Earlier, we have applied these repeatedly to eliminate variables progressively in our simple example. We can easily do
the same for 𝑛 variables! First, let’s see what we are talking about!

Deﬁnition 9.1.1 (System of linear equations.)

Let 𝐴 ∈ ℝ𝑛×𝑛 be a matrix and 𝑏 ∈ ℝ𝑛 be a vector. The collection of equations

𝑎11 𝑥1 + 𝑎12 𝑥2 + ⋯ + 𝑎1𝑛 𝑥𝑛 = 𝑏1

𝑎21 𝑥1 + 𝑎22 𝑥2 + ⋯ + 𝑎2𝑛 𝑥𝑛 = 𝑏2
(10.1)
⋮
𝑎𝑛1 𝑥1 + 𝑎𝑛2 𝑥2 + ⋯ + 𝑎𝑛𝑛 𝑥𝑛 = 𝑏𝑛 .

are called the system of linear equations determined by 𝐴.

A system of linear equations is often written in the short form 𝐴𝑥 = 𝑏, where 𝐴 is called its coefficient matrix. If the
vector 𝑥 satisfies 𝐴𝑥 = 𝑏, it is called a *solution.
Speaking of the solutions, are there even any, and if so, how can we find them?
𝑎𝑘1
If 𝑎11 is nonzero, we can multiply the first equation of (10.1) with 𝑎11 and subtract it from the 𝑘-th equation.
This way, 𝑥1 will be eliminated from all but the first row, obtaining

𝑎11 𝑥1 + 𝑎12 𝑥2 + ⋯ + 𝑎1𝑛 𝑥𝑛 = 𝑏1

𝑎21 𝑎 𝑎
0𝑥1 + (𝑎22 − 𝑎12 )𝑥2 + ⋯ + (𝑎2𝑛 − 𝑎1𝑛 21 )𝑥𝑛 = 𝑏2 − 𝑏1 21
𝑎11 𝑎11 𝑎11
(10.2)
⋮
𝑎 𝑎 𝑎
0𝑥1 + (𝑎𝑛2 − 𝑎12 𝑛1 )𝑥2 + ⋯ + (𝑎𝑛𝑛 − 𝑎1𝑛 𝑛1 )𝑥𝑛 = 𝑏𝑛 − 𝑏1 𝑛1 .
𝑎11 𝑎11 𝑎11
(1) (1)
To clear up this notation a bit, let’s denote the new coeﬃcients with 𝑎𝑖𝑗 and 𝑏𝑖 . So, we have

𝑎11 𝑥1 + 𝑎12 𝑥2 + ⋯ + 𝑎1𝑛 𝑥𝑛 = 𝑏1

(1) (1) (1)
0𝑥1 + 𝑎22 𝑥2 + ⋯ + 𝑎2𝑛 𝑥𝑛 = 𝑏2
(10.3)
⋮
(1) (1) (1)
0𝑥1 + 𝑎𝑛2 𝑥2 + ⋯ + 𝑎𝑛𝑛 𝑥𝑛 = 𝑏𝑛 .

We can repeat the above process and use the second equation to get rid of the 𝑥2 variable in the third equation, and
so forth. This can be done 𝑛 − 1 times in total, ultimately leading to an equation system 𝐴(𝑛−1) 𝑥 = 𝑏(𝑛−1) where all
coeﬃcients below the diagonal of 𝐴(𝑛−1) is zero:

𝑎11 𝑎12 𝑎13 … 𝑎1𝑛

⎡ 0 (1)
𝑎22
(1)
𝑎23 … 𝑎2𝑛 ⎤
(1)
⎢ (2) (2)
⎥
𝐴(𝑛−1) =⎢ 0 0 𝑎33 … 𝑎3𝑛 ⎥ . (10.4)
⎢ ⎥
⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⎥
(𝑛−1)
⎣ 0 0 0 … 𝑎𝑛𝑛 ⎦

Notice that the 𝑘-th elimination step only affects the coefficients from the 𝑘 + 1-th row. Now we can work backwards:
(𝑛−1) (𝑛−1)
the last equation 𝑎𝑛𝑛 𝑥𝑛 = 𝑏𝑛 can be used to find 𝑥𝑛 . This can be substituted to the 𝑛 − 1-th equation, yielding
𝑥𝑛−1 . Continuing like this, we can eventually find all 𝑥1 , … , 𝑥𝑛 , obtaining a solution for our linear system.
This process is called Gaussian elimination, and it’s kind of a big deal. It is not only useful for solving linear equations, it
can be used to calculate determinants, factor matrices into the product of simpler ones, and many more. We’ll talk about
all of these in detail, but let’s focus on equations for a little more.

136 Chapter 10. Linear equations

Mathematics of Machine Learning

Unfortunately, not all linear equations can be solved. For instance, consider the system

𝑥1 + 𝑥 2 = 1
2𝑥1 + 2𝑥2 = −1.

Gaussian elimination would yield

𝑥1 + 𝑥 2 = 1
0𝑥1 + 0𝑥2 = −3
in the very ﬁrst step, making it apparent that the equation has no solutions.
Before we turn to the technical details, let’s see a simple example of how Gaussian elimination is done in practice!

10.1.1 Gaussian elimination by hand

To build a deeper understanding of Gaussian elimination, let’s consider the simple equation system

𝑥1 + 0𝑥2 − 3𝑥3 = 6
2𝑥1 + 1𝑥2 + 5𝑥3 = 2
−2𝑥1 − 3𝑥2 + 8𝑥3 = 2.

To keep track of our progress (and, since we are lazy, to avoid writing too much), we record the intermediate results as

1 0 −3 6
2 1 5 2
−2 −3 8 2

with the coeﬃcient matrix 𝐴 on the left side and 𝑏 on the other. To get a good grip on the method, I encourage you to
follow along and do the calculations for yourself by hand.
After eliminating the ﬁrst variable from the second and third equations, we have

1 0 −3 6
0 1 11 −10 ,
0 −3 2 14

while the ﬁnal step yields

1 0 −3 6
0 1 11 −10 .
0 0 35 −16

From this form, we can unravel the solutions one by one.

In the 21st century, your chances of having to solve a linear equation by hand are close to zero. (If you are reading this
book during the 22nd century or later, I am incredibly honored and surprised at the same time. Or at least, I would be if
I was still alive.) Still, understanding the general principles behind solving linear equations can take you very far.

10.1.2 When can we perform Gaussian elimination?

If you followed the description of Gaussian elimination carefully, you might have noticed that the process can break down.
We might accidentally divide with zero during any elimination step!
For instance, after the ﬁrst step given by equation (10.2), the new coeﬃcients are of the form
𝑎𝑖1
(𝑎𝑖𝑗 − 𝑎1𝑗 ),
𝑎11

10.1. Gaussian elimination 137

Mathematics of Machine Learning

(𝑘−1) (𝑘−1)
which is invalid if 𝑎11 = 0. In general, the 𝑘-th step involves division with 𝑎𝑘𝑘 . Since 𝑎𝑘𝑘 is deﬁned recursively,
describing it in terms of 𝐴 is not straightforward. For this, we introduce the concept of principal minors, the upper left
subdeterminants of a matrix.

Deﬁnition 9.1.2 (Principal minors.)

Let 𝐴 = (𝑎𝑖𝑗 )𝑛𝑖,𝑗=1 ∈ ℝ𝑛×𝑛 be an arbitrary square matrix. Deﬁne the submatrix 𝐴𝑘 ∈ ℝ𝑘×𝑘 by omitting all rows and
columns of 𝐴 with indices larger than 𝑘. For instance,

𝑎11 𝑎12
𝐴1 = [𝑎11 ] , 𝐴2 = [ ],
𝑎21 𝑎22

and so on. The k-th principal minor of 𝐴, denoted by 𝑀𝑘 is deﬁned by

𝑀𝑘 ∶= det 𝐴𝑘 .

The ﬁrst and last principal minors are special, as 𝑀1 = 𝑎11 and 𝑀𝑛 = det 𝐴. With principal minors, we can describe
when Gaussian elimination is possible. In fact, it turns out that

(1) 𝑀2 (𝑛−1) 𝑀𝑛
𝑎11 = 𝑀1 , 𝑎22 = , … , 𝑎𝑛𝑛 =
𝑀1 𝑀𝑛−1
(𝑘−1) 𝑀𝑘
and in general, 𝑎𝑘𝑘 = 𝑀𝑘−1 .

To summarize, we can state the following.

Theorem 9.1.1
Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary square matrix, and let 𝑀𝑘 be its 𝑘-th principal minor. If 𝑀𝑘 ≠ 0 for all 𝑘 = 1, 2, … , 𝑛−1,
then Gaussian elimination can be successfully performed.

(𝑘−1)
As the proof is a bit involved, we are not going to do it here. (The diﬃcult step is to show 𝑎𝑘𝑘 = 𝑀𝑘 /𝑀𝑘−1 , the rest
follows immediately.) Point is, if none of the principal minors are zero, the algorithm ﬁnishes.
We can simplify this requirement a bit, and describe the Gaussian elimination in terms of the determinant, not the principal
minors.

Theorem 9.1.2
Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary square matrix. If det 𝐴 ≠ 0 holds, then all principal minors are nonzero as well.

As a consequence, if the determinant is nonzero, the Gaussian elimination is successful. A simple and nice requirement.

10.1.3 The time complexity of Gaussian elimination

To get a grip on how fast the Gaussian elimination algorithm executes, let’s do a little complexity analysis. As described
by (10.2), the first elimination step involves an addition and a multiplication for each element, except for those in the first
row. That is 2𝑛(𝑛 − 1) operations in total.
The next step is essentially the first step, done on the (𝑛 − 1) × (𝑛 − 1) matrix obtained from 𝐴(1) by removing its first
row and column. This time, we have 2(𝑛 − 1)(𝑛 − 2) operations.

138 Chapter 10. Linear equations

Mathematics of Machine Learning

Following this train of thought, we quickly get that the total number of operations are
𝑛
∑ 2(𝑛 − 𝑖 + 1)(𝑛 − 𝑖),
𝑖=1

which doesn’t look that friendly. Since we are looking for the order of complexity instead of an exact number, we can be
generous and suppose that at each elimination step, we are performing 𝑂(𝑛2 ) operations. So, we have a time complexity
of
𝑛
∑ 𝑂(𝑛2 ) = 𝑂(𝑛3 ),
𝑖=1

meaning that we need around 𝑐𝑛3 operations for Gaussian elimination, where 𝑐 is an arbitrary positive constant. This
might seem a lot, but in the beautiful domain of algorithms, this is good. 𝑂(𝑛3 ) is polynomial time, and we can be
much-much worse.

10.2 When can a system of linear equations be solved?

So, we have just seen that for any linear equation

𝐴𝑥 = 𝑏, 𝐴 ∈ ℝ𝑛×𝑛 , 𝑥, 𝑏 ∈ ℝ𝑛 ,

Gaussian elimination can be successfully performed if the principal minors 𝑀1 , … , 𝑀𝑛−1 are nonzero. Notice one caveat
about the result: 𝑀𝑛 = det 𝐴 can be zero as well. Turns out, this is quite an important detail.
If you have closely followed the discussion leading up to this point, you can see that we missed a crucial point: are there
any solutions at all for a given linear equation?
1. there are no solutions,
2. there is exactly one solution,
3. and there are multiple solutions.
All of these are relevant to us from a certain perspective, but let’s start with the most straightforward one: when do we
have exactly one solution? The answer is simple: when 𝐴 is invertible, the solution can be explicitly written as 𝑥 = 𝐴−1 𝑏.
Speaking in terms of linear transformations, we can ﬁnd a unique vector 𝑥 that is mapped to 𝑏. We summarize this idea
in the following theorem.

Theorem 9.2.1
Let 𝐴 ∈ ℝ𝑛×𝑛 be an invertible matrix. Then for any 𝑏 ∈ ℝ𝑛 , the equation 𝐴𝑥 = 𝑏 has a unique solution that can be
written as 𝑥 = 𝐴−1 𝑏.

If 𝐴 is invertible, then det 𝐴 is nonzero. Thus, using what we have learned previously, Gaussian elimination can be
performed, yielding the unique solution. Nice and simple.
If 𝐴 is not invertible, the two remaining possibilities are in play: no vector is mapped to 𝑏, which means there are no
solutions, or multiple vectors mapped to 𝑏, giving numerous solutions.
Do you remember how we used the kernel of a linear transformation to describe its invertibility? It turns out that ker 𝐴
can also be used to ﬁnd all solutions for a linear system.

Theorem 9.2.2

10.2. When can a system of linear equations be solved? 139

Mathematics of Machine Learning

Let 𝐴 ∈ ℝ𝑛×𝑛 an arbitrary matrix and let 𝑥0 ∈ ℝ𝑛 be a solution of the linear equation 𝐴𝑥 = 𝑏, where 𝑏 ∈ ℝ𝑛 . Then the
set of all solutions can be written as

𝑥0 + ker 𝐴 ∶= {𝑥0 + 𝑦 ∶ 𝑦 ∈ ker 𝐴}.

Proof. We have to show two things: (a) if 𝑥 ∈ 𝑥0 + ker 𝐴, then 𝑥 is a solution; and (b) if 𝑥 is a solution, then
𝑥 ∈ 𝑥0 + ker 𝐴.
(a) Suppose that 𝑥 ∈ 𝑥0 + ker 𝐴, that is, 𝑥 = 𝑥0 + 𝑦 for some 𝑦 ∈ ker 𝐴. Then

𝐴𝑥 = 𝐴(𝑥0 + 𝑦) = 𝐴𝑥
⏟0 + 𝐴𝑦
⏟ = 𝑏,
=𝑏 =0

which shows that 𝑥 is indeed a solution.

(b) Now let 𝑥 be an arbitrary solution. We have to show that 𝑥 − 𝑥0 ∈ ker 𝐴. This is easy, since

𝐴(𝑥 − 𝑥0 ) = 𝐴𝑥 − 𝐴𝑥0 = 𝑏 − 𝑏 = 0.

Thus, (a) and (b) imply that 𝑥0 + ker 𝐴 is the set of all solutions. □

In theory, this theorem provides an excellent way to find all solutions for linear equations, generalizing far beyond finite-
dimensional vector spaces. (Note that the proof goes through verbatim for all vector spaces and linear transformations.)
For instance, this exact result is used to describe all solutions of an inhomogeneous linear differential equation.

10.3 Inverting matrices

So far, we have seen that the invertibility of a matrix 𝐴 ∈ ℝ𝑛×𝑛 is key to solving linear equations. However, we haven’t
found a way to compute the inverse of a matrix yet.
Let’s recap what the inverse is in terms of linear transformations. If the columns of 𝐴 are denoted by the vectors
𝑎1 , … , 𝑎𝑛 ∈ ℝ𝑛 , then 𝐴 is the linear transformation that maps the standard basis vectors to the these vectors:

𝐴 ∶ 𝑒𝑖 ↦ 𝑎 𝑖 , 𝑖 = 1, … , 𝑛.

If the direction of the arrows can be reversed, that is,

𝐴−1 ∶ 𝑎𝑖 ↦ 𝑒𝑖 , 𝑖 = 1, … , 𝑛

is a well-deﬁned linear equation, then 𝐴−1 is called the inverse of 𝐴.

In light of all that we have seen in this chapter, the method for ﬁnding the inverse is simple: solve 𝐴𝑥 = 𝑒𝑖 for each 𝑖
where 𝑒1 , … , 𝑒𝑛 ∈ ℝ𝑛 is the standard basis.
𝑛 𝑛
Suppose that 𝐴𝑥𝑖 = 𝑒𝑖 , 𝑥𝑖 ∈ ℝ𝑛 . Then, if 𝑏 can be written as 𝑏 = ∑𝑖=1 𝛽𝑖 𝑒𝑖 , the vector 𝑥 = ∑𝑖=1 𝛽𝑖 𝑥𝑖 is the solution
of 𝐴𝑥 = 𝑏:
𝑛 𝑛
𝐴( ∑ 𝛽𝑖 𝑥𝑖 ) = ∑ 𝛽𝑖 𝐴𝑥𝑖
𝑖=1 𝑖=1
𝑛
= ∑ 𝛽𝑖 𝑒𝑖
𝑖=1
= 𝑏.
Thus, the inverse is the matrix whose 𝑖-th column is 𝑥𝑖 .

140 Chapter 10. Linear equations

Mathematics of Machine Learning

I know, this seems paradoxical: to ﬁnd the solution of 𝐴𝑥 = 𝑏, we need the inverse 𝐴−1 . To ﬁnd the inverse, we need
to solve 𝑛 equations. The answer is Gaussian elimination, which gives us an exact computational method to obtain 𝐴−1 .
In the next chapter, we are going to put this into practice and write our matrix-inverting algorithm from scratch. Pretty
awesome.

10.3. Inverting matrices 141

Mathematics of Machine Learning

142 Chapter 10. Linear equations

CHAPTER

ELEVEN

THE LU DECOMPOSITION

In the previous chapter, I promised that you’d never have to solve a linear equation by hand. As it turns out, this task is
perfectly suitable for computers. In this chapter, we will dive deep into the art of solving linear equations, developing the
tools from scratch.
We start by describing the process of Gaussian elimination in terms of matrices. Why would we even do that? Because
matrix multiplication can be performed extremely fast in modern computers. Expressing any algorithm in terms of
matrices is a sure way
At the start, our linear equation 𝐴𝑥 = 𝑏 is given by the coeﬃcient matrix

𝑎11 𝑎12 … 𝑎1𝑛

⎡𝑎 𝑎22 … 𝑎2𝑛 ⎤
𝐴 = ⎢ 21 ⎥ ∈ ℝ𝑛×𝑛 ,
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣𝑎𝑛1 𝑎𝑛2 … 𝑎𝑛𝑛 ⎦

and at the end of the elimination process, 𝐴 transformed to the form

𝑎11 𝑎12 𝑎13 … 𝑎1𝑛

⎡ 0
(1)
𝑎22
(1)
𝑎23 … 𝑎2𝑛 ⎤
(1)
⎢ (2) (2)
⎥
𝐴(𝑛−1) =⎢ 0 0 𝑎33 … 𝑎3𝑛 ⎥ .
⎢ ⎥
⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⎥
(𝑛−1)
⎣ 0 0 0 … 𝑎𝑛𝑛 ⎦

𝐴(𝑛−1) is upper diagonal, that is, all elements below its diagonal are zero.
Gaussian elimination performs this task one step at a time, focusing on consecutive columns. After the ﬁrst elimination
step, this is turned into the equation (10.3), described by the coeﬃcient matrix

𝑎11 𝑎12 … 𝑎1𝑛

⎡ 0 (1)
𝑎22 … 𝑎2𝑛 ⎤
(1)
𝐴(1) =⎢ ⎥ ∈ ℝ𝑛×𝑛 .
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
(1) (1)
⎣ 0 𝑎𝑛2 … 𝑎𝑛𝑛 ⎦

Can we obtain 𝐴(1) from 𝐴 via multiplication with some matrix, that is, can we ﬁnd a 𝐺1 ∈ ℝ𝑛×𝑛 such that 𝐴(1) = 𝐺1 𝐴
holds?
Yes. By deﬁning 𝐺1 as

1 0 0 … 0
⎡ − 𝑎21 1 0 … 0⎤
⎢ 𝑎𝑎11 ⎥
𝐺1 = ⎢ − 𝑎31 0 1 … 0⎥ , (11.1)
⎢ 11 ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
𝑎𝑛1
⎣− 𝑎11 0 0 … 1⎦

143
Mathematics of Machine Learning

we can see that 𝐴(1) = 𝐺1 𝐴 is the same as performing the first step of Gaussian elimination. 𝐺1 is lower diagonal; that
is, all elements above its diagonal are zero. In fact, except for the first column, all elements below the diagonal are zero
as well. (Note that 𝐺1 depends on 𝐴.)
By analogously defining

1 0 0 … 0
⎡0 1 0 … 0⎤
⎢ 𝑎32
(1) ⎥
⎢ 0⎥ ,
𝐺2 = ⎢0 − (1)
𝑎22
1 …
⎥ (11.2)
⎢⋮ ⋮ ⋱ ⋮ ⎥
⎢ 𝑎𝑛2
(1) ⎥
⎣0 − (1)
𝑎22
0 … 1⎦

we obtain 𝐴(2) = 𝐺2 𝐴(1) = 𝐺2 𝐺1 𝐴, a matrix that is upper diagonal in the ﬁrst two column. (That is, all elements are
zero below the diagonal, but only in the ﬁrst two columns.)
We can continue this process until obtaining the upper triangular matrix

𝐴(𝑛−1) = 𝐺𝑛−1 … 𝐺1 𝐴. (11.3)

The algorithm is starting to shape up nicely. The 𝐺𝑖 matrices are invertible, with inverses

1 0 0 … 0
1 0 0 … 0 ⎡0
⎡ 𝑎21 1 0 … 0⎤
1 0 … 0⎤ ⎢ ⎥
⎢ 𝑎𝑎11 ⎥ ⎢
(1)
𝑎32
0⎥ , … ,
𝐺−1
1 = ⎢ 𝑎31 0 1 … 0⎥ , 𝐺−1
2 = ⎢0 (1)
𝑎22
1 …
⎥
⎢ 11 ⎥ ⎢⋮ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥ ⋮ ⋱ ⋮
𝑎𝑛1 ⎢ (1) ⎥
⎣ 𝑎11 0 0 … 1⎦ 𝑎𝑛2
⎣0 (1)
𝑎22
0 … 1⎦

and so on. Thus, by multiplying with their inverses one by one, we can express 𝐴 as

𝐴 = 𝐺−1 −1
1 … 𝐺𝑛−1 𝐴
(𝑛−1)
.

Fortunately, we can calculate 𝐿 ∶= 𝐺−1 −1

1 … 𝐺𝑛−1 by hand. After a quick computation, we obtain that

1 0 0 … 0
⎡ 𝑎21 1 0 … 0⎤
⎢ 𝑎11 (1)
⎥
⎢ 𝑎31 𝑎32
1 … 0⎥ ,
𝐿 = ⎢ 𝑎11 (1)
𝑎22 ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢𝑎 (1)
𝑎𝑛2
(2)
𝑎𝑛3
⎥
𝑛1
… 1
⎣ 𝑎11 (1)
𝑎22
(2)
𝑎33 ⎦

which is lower diagonal. By deﬁning the upper diagonal matrix 𝑈 ∶= 𝐴(𝑛−1) , we obtain the famous LU decomposition,
factoring 𝐴 into a lower and an upper diagonal matrix:

𝐴 = 𝐿𝑈 .

Notice that with this algorithm, we perform two tasks for the price of one:
• factorizing 𝐴 into the product of an upper diagonal and a lower diagonal matrix,
• and performing Gaussian elimination.
From a computational standpoint, the LU decomposition is an extremely important tool. Good news: it is relatively easy
to and fast compute. Since it is just a refashioned Gaussian elimination, its complexity is 𝑂(𝑛3 ), just as we saw this
earlier. Bad news: it is not always available. Since it is tied to Gaussian elimination, we can characterize its existence
in similar terms. Recall that for the Gaussian elimination to successfully ﬁnish, the principal minors are required to be
nonzero. This is directly transferred to the LU decomposition.

144 Chapter 11. The LU decomposition

Mathematics of Machine Learning

Theorem 10.1 (Existence of the LU decomposition.)

Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary square matrix, and let 𝑀𝑘 be its 𝑘-th principal minor. If 𝑀𝑘 ≠ 0 for all 𝑘 = 1, 2, … , 𝑛−1,
then 𝐴 can be written as

𝐴 = 𝐿𝑈 , 𝐿, 𝑈 ∈ ℝ𝑛×𝑛 ,

where 𝐿 is a lower diagonal, and 𝑈 is an upper diagonal matrix. Moreover, the elements along the diagonal of 𝐿 are equal
to 1.

The gist is the same: everything is ﬁne if we avoid division with zero during the algorithm.
After all the preparations, we are ready to put things into practice!

11.1 Implementing the LU decomposition

To summarize the LU decomposition algorithm as described above, we essentially repeat two steps:
1. calculate the elimination matrices of the input,
2. and multiply the input with the elimination matrices, feeding the output back into the ﬁrst step.
The plan is clear: ﬁrst, we write a function that computes the elimination matrices and their inverses; then, we iteratively
perform the elimination steps using matrix multiplication.

import numpy as np

def compute_elimination_matrices(
A: np.ndarray,
step: int,
):
"""
Computes the step-th elimination matrix and its inverse.

A: np.ndarray of shape (n, n)

step: integer between 1 and n-1
"""
n = A.shape[0]
elim_mtx = np.eye(n)
elim_mtx_inv = np.eye(n)

if 0 < step < n:

a = A[:, step-1]/A[step-1, step-1]
elim_mtx[step:, step-1] = -a[step:]
elim_mtx_inv[step:, step-1] = a[step:]

return elim_mtx, elim_mtx_inv

Now we are ready to perform the elimination steps.

def LU(A: np.ndarray):

"""
Computes the LU factorization of a square matrix A.

(continues on next page)

11.1. Implementing the LU decomposition 145

Mathematics of Machine Learning

(continued from previous page)

A: np.ndarray of shape (n, n)
"""
n = A.shape[0]
L = np.eye(n)
U = np.copy(A)

for step in range(1, n + 1):

elim_mtx, elim_mtx_inv = compute_elimination_matrices(U, step=step)
U = np.matmul(elim_mtx, U)
L = np.matmul(L, elim_mtx_inv)

return L, U

Let’s test our function on a small matrix.

A = 10*np.random.rand(4, 4) - 5

array([[ 1.7020027 , 4.93736796, 1.92012691, 4.20560581],

[ 2.83052706, 0.11641681, -4.52557461, 0.07825449],
[ 3.81985872, 3.55058343, 1.31915557, 3.59351558],
[-4.88685672, 2.72079189, 3.61677349, 2.53183118]])

L, U = LU(A)

print(f"Lower:\n{L}\n\nUpper:\n{U}")

Lower:
[[ 1. 0. 0. 0. ]
[ 1.66305674 1. 0. 0. ]
[ 2.24433177 0.93030038 1. 0. ]
[-2.87123912 -2.08743287 -1.66626783 1. ]]

Upper:
[[ 1.7020027 4.93736796 1.92012691 4.20560581]
[ 0. -8.09470627 -7.71885461 -6.91590661]
[ 0. 0. 4.19060713 0.58861142]
[ 0. 0. 0. 1.15142459]]

np.allclose(np.matmul(L, U), A)

True

Overall, the LU decomposition is a highly versatile tool, used as a stepping stone in the implementations of essential
algorithms. One of them is computing the inverse matrix, as we shall see next.

146 Chapter 11. The LU decomposition

Mathematics of Machine Learning

11.2 Inverting a matrix, for real

So far, we have talked a lot about the inverse matrix. We explored the question of invertibility from several angles, in
terms of
• the kernel and the image,
• the determinant,
• and the solvability of linear equations.
However, we haven’t yet talked about actually computing the inverse. With the LU decomposition, we obtain a tool that
can be used for this purpose. How? By plugging in a lower triangular matrix into the Gaussian elimination process, we
get its inverse as a side eﬀect. So, we
1. calculate the LU decomposition 𝐴 = 𝐿𝑈 ,
2. invert the lower triangular matrices 𝐿 and 𝑈 𝑇 ,
3. use the identity (𝑈 −1 )𝑇 = (𝑈 𝑇 )−1 to get 𝑈 −1 ,
4. multiply 𝐿−1 and 𝑈 −1 to ﬁnally obtain 𝐴−1 = 𝑈 −1 𝐿−1 .
That’s a plan! Let’s start with inverting lower triangular matrices.

11.2.1 Inverting lower triangular matrices

Let 𝐿 ∈ ℝ𝑛×𝑛 be an arbitrary lower triangular matrix. Following the same process that led to (11.3), we obtain

𝐷 = 𝐺𝑛−1 … 𝐺1 𝐿,

where 𝐷, the ﬁnal result of Gaussian elimination, is a diagonal matrix

𝑑1 0 … 0
⎡0 𝑑2 … 0⎤
𝐷 = diag(𝑑1 , … , 𝑑𝑛 ) = ⎢ ⎥,
⎢⋮ ⋮ ⋱ ⋮ ⎥
⎣0 0 … 𝑑𝑛 ⎦
and the 𝐺𝑖 -s are the elimination matrices deﬁned by (11.1), (11.2), and so on.
Since the inverse of 𝐷 is simply 𝐷−1 = diag(𝑑1−1 , … , 𝑑𝑛−1 ), we can express 𝐿−1 as

𝐿−1 = 𝐷−1 𝐺𝑛−1 … 𝐺1 .

We can implement this very similarly to the LU decomposition; we can even reuse our com-
pute_elimination_matrices function.

def invert_lower_triangular_matrix(L: np.ndarray):

n = L.shape[0]
G = np.eye(n)
D = np.copy(L)

for step in range(1, n + 1):

elim_mtx, _ = compute_elimination_matrices(D, step=step)
G = np.matmul(elim_mtx, G)
D = np.matmul(elim_mtx, D)

D_inv = np.eye(n)/np.diagonal(D) # NumPy performs this operation elementwise

return np.matmul(D_inv, G)

11.2. Inverting a matrix, for real 147

Mathematics of Machine Learning

With this done, we are ready to invert any matrix. (That is actually invertible.)

11.2.2 Inverting any (invertible) matrix

We are almost at the ﬁnish line. Every component is ready, the only thing left to do is to put them together. We can do
this within a few lines of code.

def invert(A: np.ndarray):

L, U = LU(A)
L_inv = invert_lower_triangular_matrix(L)
U_inv = invert_lower_triangular_matrix(U.T).T
return np.matmul(U_inv, L_inv)

Voilà! Witness the result with your own eyes.

A = np.random.rand(3, 3)
A_inv = invert(A)

print(f"A:\n{A}\n\nA⁻¹:\n{A_inv}\n\nAA⁻¹:\n{np.matmul(A, A_inv)}")

A:
[[0.50099494 0.38379137 0.90459645]
[0.90899798 0.68654876 0.17644865]
[0.02572362 0.47884098 0.89581146]]

A⁻¹:
[[ 1.59422999 0.26850537 -1.66275189]
[-2.43329707 1.27870628 2.20529206]
[ 1.2548991 -0.69122123 -0.01474888]]

AA⁻¹:
[[ 1.00000000e+00 4.71048498e-15 8.53893767e-18]
[-6.79235268e-15 1.00000000e+00 1.92026231e-16]
[ 3.12838119e-14 -1.43648206e-14 1.00000000e+00]]

To test the correctness of our invert function, we quickly check the results on a few randomly generated matrices.

for _ in range(1000):
n = np.random.randint(1, 10)
A = np.random.rand(n, n)
A_inv = invert(A)
if not np.allclose(np.matmul(A, A_inv), np.eye(n), atol=1e-5):
print("Test failed.")

Since there is no error message above, the function is (probably) correct.

What seemed complex and abstract a few chapters ago is now in our hands. We can invert any matrix, not with built-in
functions, but with one that we wrote from scratch. I love these moments when the pieces are ﬁnally put together, and
everything clicks. Sit back, relax, and appreciate the journey that got us here!

148 Chapter 11. The LU decomposition

Mathematics of Machine Learning

11.2.3 How to actually invert matrices

Of course, our implementation far from optimal. When working with NumPy arrays, we can turn to the built-in functions.
In NumPy, this is np.linalg.inv.

A = np.random.rand(3, 3)
A_inv = np.linalg.inv(A)

print(f"A:\n{A}\n\nNumPy's A⁻¹:\n{A_inv}\n\nAA⁻¹:\n{np.matmul(A, A_inv)}")

A:
[[0.89317567 0.90111789 0.82417549]
[0.04126326 0.12167803 0.65452453]
[0.52059519 0.49859182 0.34184381]]

NumPy's A⁻¹:
[[ -59.83217022 21.6188274 102.86029659]
[ 68.63452126 -25.99985543 -115.69420373]
[ -8.98735362 4.99835821 15.02326035]]

AA⁻¹:
[[ 1.00000000e+00 -4.52177990e-15 -1.03125295e-14]
[ 1.46455735e-16 1.00000000e+00 -5.46033371e-16]
[-2.30941798e-15 2.59457335e-16 1.00000000e+00]]

Let’s compare the runtime of our implementation and NumPy’s.

from timeit import timeit

n_runs = 100
size = 100
A = np.random.rand(size, size)

t_inv = timeit(lambda: invert(A), number=n_runs)

t_np_inv = timeit(lambda: np.linalg.inv(A), number=n_runs)

print(f"Our invert: \t{t_inv} s")

print(f"NumPy's invert: \t{t_np_inv} s")
print(f"Performance improvement: \t{100*t_inv/t_np_inv}% speedup")

Our invert: 4.81281138399936 s

NumPy's invert: 0.02269207200151868 s
Performance improvement: 21209.219606201055% speedup

A ~200x improvement. Nice! Why is NumPy that much faster? There are two main reasons. First, it directly calls the
SGETRI function from LAPACK, which is extremely fast. Second, according to its documentation, SGETRI uses a
faster algorithm:

SGETRI computes the inverse of a matrix using the LU factorization

computed by SGETRF.

This method inverts U and then computes inv(A) by solving the system
inv(A)*L = inv(U) for inv(A).

11.2. Inverting a matrix, for real 149

Mathematics of Machine Learning

So, NumPy calls the LAPACK function, which uses LU factorization in turn. (I am not particularly adept in digging
through Fortran code that is older than I am, so let me know if I am wrong here. Nevertheless, the fact that state-of-the-
art frameworks still make calls to this ancient library is a testament to its power. Never underestimate old technologies
like LAPACK and Fortran.)

11.3 Problems

Problem 1. Show that the product of upper triangular matrices is upper triangular. Similarly, the product of lower
triangular matrices is lower triangular. (We have used these facts extensively in this section but didn’t give a proof. So,
this is an excellent time to convince yourself about this if you haven’t already.)
Problem 2. Write a function that, given an invertible square matrix 𝐴 ∈ ℝ𝑛×𝑛 and a vector 𝑏 ∈ ℝ𝑛 , ﬁnds the solution of
the linear equation 𝐴𝑥 = 𝑏. (This can be done with a one-liner if you use one of the tools we have built here.)

150 Chapter 11. The LU decomposition

CHAPTER

TWELVE

DETERMINANTS IN PRACTICE

In the theory and practice of mathematics, the development of concepts usually has a simple flow. Definitions first arise
from vague geometric or algebraic intuitions, eventually crystallized in mathematical formalism.
However, mathematical definitions often disregard practicalities. Often for a very good reason, mind you! Keeping
practical considerations out of sight gives us the power to reason about structure effectively. This is the strength of
abstraction. Eventually, if meaningful applications are found, the development flows toward computational questions,
putting speed and efficiency onto the horizon.
An epitome of this is neural networks themselves. From theoretical constructs to state-of-the-art algorithms that run on
your smartphone, machine learning research followed this same arc.
This is also what we experience in this book on a microscopic level. Among many other examples, think about deter-
minants. We introduced the determinant as the orientation of column vectors and the parallelepiped volume defined by
them. Still, we haven’t really worked on computing them in practice. Sure, we gave a formula or two, but it is hard to
decide which one is the most convoluted. All of them are.
On the other hand, the mathematical study of determinants yielded a ton of useful results: invertibility of linear transfor-
mations, characterization of Gaussian elimination, and many more. (And even more to come.)
In this chapter, we are ready to pay off our debts and develop tools to actually compute determinants. As before, we will
take a straightforward approach and use one of the previously derived determinant formulas. Spoiler alert: this is far from
optimal, so we’ll find a way to compute the determinant with high speed.

12.1 The lesser of two evils

Let’s recall what we know about determinants so far. Given a matrix 𝐴 ∈ ℝ𝑛×𝑛 , its determinant det 𝐴 quantiﬁes the
volume distortion of the linear transformation 𝑥 → 𝐴𝑥. That is, if 𝑒1 , … , 𝑒𝑛 is the standard orthonormal basis, then
informally speaking,

det 𝐴 = (orientation of 𝐴𝑒1 , 𝐴𝑒2 , … , 𝐴𝑒𝑛 )

× (area of the parallelepiped determined by 𝐴𝑒1 , 𝐴𝑒2 , … , 𝐴𝑒𝑛 ).

We have derived two formulas to compute this quantity. Initially, we described the determinant in terms of summing
over all permutations:

det 𝐴 = ∑ sign(𝜎)𝑎𝜎(1)1 … 𝑎𝜎(𝑛)𝑛 .

𝜎∈𝑆𝑛

This is diﬃcult even to understand, let alone programmatically compute. So, a recursive formula is derived, which we can
also use. It states that
𝑛
det 𝐴 = ∑(−1)𝑗+1 𝑎1,𝑗 det 𝐴1,𝑗 ,
𝑗=1

151
Mathematics of Machine Learning

where 𝐴𝑖,𝑗 is the matrix obtained by deleting the 𝑖-th row and 𝑗-th column of 𝐴. Which one would you rather use? Take
a few minutes to figure out your reasoning.
Unfortunately, there are no right choices here. With the permutation formula, one has to find a way to generate all
permutations first, then calculate their signs. Moreover, there are 𝑛! unique permutations in 𝑆𝑛 , so this sum has a lot of
terms. Using this formula seems extremely difficult, so we are going with the recursive version. Recursion has its issues
(as we are about to see very soon), but it is easy to handle from a coding standpoint. Let’s get to work!

12.1.1 The recursive way

Let’s put the formula

𝑛
det 𝐴 = ∑(−1)𝑗+1 𝑎1,𝑗 det 𝐴1,𝑗
𝑗=1

under our magnifying glass. If 𝐴 is an 𝑛 × 𝑛 matrix, then 𝐴1,𝑗 (obtained from 𝐴 by deleting its ﬁrst row and 𝑗-th column)
is of size (𝑛 − 1) × (𝑛 − 1). This is a recursive step. For each 𝑛 × 𝑛 determinant, we have to calculate 𝑛 pieces of
(𝑛 − 1) × (𝑛 − 1) determinants, and so on.
By the end, we have a lot of 1 × 1 determinants, which are trivial to calculate. So, we have a boundary condition, and
with that, we are ready to put these together inside a function.

import numpy as np

def det(A: np.ndarray):

n, m = A.shape

# making sure that A is a square matrix

if n != m:
raise ValueError("A must be a square matrix.")

if n == 1:
return A[0, 0]

else:
return sum([(-1)**i*A[0, i]*det(np.delete(A[1:], i, axis=1)) for i in␣
↪range(n)])

Let’s test the det function out on a small example. For 2 × 2 matrices, we can easily calculate the determinants using
the rule
𝑎 𝑏
det [ ] = 𝑎𝑑 − 𝑏𝑐.
𝑐 𝑑

A = np.array([[1, 2], [3, 4]])

det(A) # should be -2

-2

It seems to work. So far, so good. What is the issue? Recursion. Let’s calculate the determinant of a small 10 × 10
matrix, measuring the time it takes.

152 Chapter 12. Determinants in practice

Mathematics of Machine Learning

from timeit import timeit

A = np.random.rand(10, 10)
t_det = timeit(lambda: det(A), number=1)

print(f"The time it takes to compute the determinant of a 10 x 10 matrix: {t_det}␣

↪seconds")

The time it takes to compute the determinant of a 10 x 10 matrix: 31.

↪73533689299802 seconds

Thirty-one long and unbearable seconds. For such a simple task, this feels like an eternity.
For 𝑛 × 𝑛 inputs, we call the det function recursively 𝑛 times, on (𝑛 − 1) × (𝑛 − 1) inputs. That is, if 𝑎𝑛 denotes the
time complexity of our algorithm for an 𝑛 × 𝑛 matrix, then, due to the recursive step, we have

𝑎𝑛 = 𝑛𝑎𝑛−1 ,

which explodes really fast. In fact, 𝑎𝑛 = 𝑂(𝑛!), which is the dreaded factorial complexity. Unlike some other recursive
algorithms, caching doesn’t help either. There are two reasons for this: sub-matrices rarely match, and numpy.ndarray
objects are mutable thus not hashable.
In practice, 𝑛 can be in the millions, so this formula is utterly useless. What can we do? Simple: LU decomposition.

12.2 The proper method for computing determinants

Besides the two formulas, we saw lots of useful properties of matrices and determinants. Can we apply what we have
learned so far to simplify the problem?
Let’s consider the LU decomposition. According to this, if det 𝐴 ≠ 0, then 𝐴 = 𝐿𝑈 , where 𝐿 is lower triangular and 𝑈
is upper triangular. Since the determinant behaves nicely with respect to matrix multiplication, see (9.7), we have

det 𝐴 = det 𝐿 det 𝑈 .

Seemingly, we made our situation worse: instead of one determinant, we have to deal with two. However, 𝐿 and 𝑈 are
rather special, as they are triangular. It turns out that computing a triangular matrix’s determinant is extremely easy. We
just have to multiply the elements in the diagonal together!

Theorem 11.2.1 (Determinant of a triangular matrix.)

Let 𝐴 = (𝑎𝑖𝑗 )𝑛𝑖,𝑗=1 ∈ ℝ𝑛×𝑛 be a triangular matrix. (That is, it is either lower or upper triangular.) Then
𝑛
det 𝐴 = ∏ 𝑎𝑖𝑖 .
𝑖=1

Proof. Suppose that 𝐴 is lower triangular. (That is, all elements above its diagonal are zero.) According to the recursive
formula for det 𝐴, we have
𝑛
det 𝐴 = ∑(−1)𝑗+1 𝑎1,𝑗 det 𝐴1,𝑗 .
𝑗=1

12.2. The proper method for computing determinants 153

Mathematics of Machine Learning

Because 𝐴 is upper triangular, 𝑎1,𝑗 = 0 if 𝑗 > 1. Thus,

det 𝐴 = 𝑎11 det 𝐴1,1 .

𝐴1,1 = (𝑎𝑖𝑗 )𝑛𝑖,𝑗=2 is also lower triangular. By iterating the previous step, we obtain

det 𝐴 = 𝑎11 𝑎22 … 𝑎𝑛𝑛 ,

which is what we had to show.

If 𝐴 is upper triangular, its transpose 𝐴𝑇 is lower triangular. Thus, we can apply the previous result, so

det 𝐴 = det 𝐴𝑇 = 𝑎11 𝑎22 … 𝑎𝑛𝑛

holds as well. □

Back to our original problem. Since the diagonal 𝐿 is constant 1, as guaranteed by the LU decomposition, we have
𝑛
det 𝐴 = det 𝑈 = ∏ 𝑢𝑖𝑖 .
𝑖=1

So, the algorithm to compute the determinant is quite simple: get the LU decomposition, then calculate the product of
𝑈 ‘s diagonal. Let’s put this into practice!

import nbimporter
from scripts.LU import LU

def fast_det(A: np.ndarray):

L, U = LU(A)
return np.prod(np.diag(U))

Yes, that simple. Let’s see how it performs!

A = np.random.rand(1000, 1000)

t_fast_det = timeit(lambda : fast_det(A), number=1)

print(f"The time it takes to calculate the determinant of a 1000 x 1000 matrix: {t_
↪fast_det}")

The time it takes to calculate the determinant of a 1000 x 1000 matrix: 41.
↪27983342299922

Forty-one seconds, but for a 1000 × 1000 matrix this time. This can be even faster if we use a better implementation of
the LU decomposition algorithm. (For instance, scipy.linalg.lu, which relies on our old friend LAPACK.)
I get emotional just by looking at this result. See how far we can go with a bit of linear algebra? This is why understanding
the fundamentals such as Gaussian elimination is essential. Machine learning and deep learning are still very new ﬁelds,
and even though an insane amount of research power is being put into it, moments like these happen all the time. Simple
ideas often give birth to new paradigms.

154 Chapter 12. Determinants in practice

Mathematics of Machine Learning

12.3 Problems

Before we wrap this chapter up, let’s go back to the very beginning. Even though we have lots of reasons against using
the determinant formula, we have one for it: it is a good exercise, and implementing it will deepen your understanding.
So, in this section, you are going to build
det 𝐴 = ∑ sign(𝜎)𝑎𝜎(1)1 … 𝑎𝜎(𝑛)𝑛 ,
𝜎∈𝑆𝑛

one step at a time.

Problem 1. Implement a function that, given an integer 𝑛, returns all permutations of the set {0, 1, … , 𝑛 − 1}. Represent
each permutation 𝜎 as a list. For example,

[2, 0, 1]

would represent the permutation 𝜎, where 𝜎(0) = 2, 𝜎(1) = 0, and 𝜎(2) = 1.

Problem 2. Let 𝜎 ∈ 𝑆𝑛 be a permutation of the set 0, 1, … , 𝑛 − 1. Its inversion number is deﬁned by

inversion(𝜎) = ∣{(𝑖, 𝑗) ∶ 𝑖 < 𝑗 and 𝜎(𝑖) > 𝜎(𝑗)}∣,

where | ⋅ | denotes the number of elements in the set. Essentially, inversion describes the number of times a permutation
reverses the order of a pair of numbers.
Turns out, the sign of 𝜎 can be written as

sign(𝜎) = (−1)inversion(𝜎) .

Implement a function that ﬁrst calculates the inversion number, then the sign of an arbitrary permutation. (Permutations
are represented like in the previous problem.)
Problem 3. Put the solutions for Problem 1. and Problem 2. together and calculate the determinant of a matrix using the
permutation formula. What do you think the time complexity of this algorithm is?

12.4 Solutions

Problem 1.

from copy import deepcopy

def permutations(n: int):

if n == 0:
return [[0]]
else:
prev_permutations = permutations(n - 1)

new_permutations = []

for p in prev_permutations:
for i in range(len(p)+1):
p_new = deepcopy(p)
p_new.insert(i, n)
new_permutations.append(p_new)

return new_permutations

12.3. Problems 155

Mathematics of Machine Learning

Problem 2.

from itertools import product

def inversion(permutation: list):

n = len(permutation)
inversions = sum([1 for i, j in product(range(n), range(n)) if i < j and␣
↪permutation[i] > permutation[j]])

return inversions

def sign(permutation: list):

i = inversion(permutation)
return (-1)**i

Problem 3.

def permutation_formula(A: np.ndarray):

n, _ = A.shape
S_n = permutations(n-1)
determinant = sum([sign(p)*np.prod([A[p[i], i] for i in range(n)]) for p in S_n])
return determinant

156 Chapter 12. Determinants in practice

CHAPTER

THIRTEEN

EIGENVALUES AND EIGENVECTORS

So far, we have seen three sides of linear transformations: functions, matrices, and transforms that distort the grid of the
underlying vector space. In the Euclidean plane, we saw some examples that shed some light into the geometric nature of
them.
Following this line of thought, let’s consider the linear transformation given by the matrix

2 1
𝐴=[ ]. (13.1)
1 2

Since the columns of 𝐴 are the images of the standard basis vectors 𝑒1 = (1, 0) and 𝑒2 = (0, 1), we can visualize the
eﬀect of 𝐴. (Check here if you don’t recall this fact.)

Fig. 13.1: Images of the standard basis vectors under the linear transformation given by 𝐴.

This seems to shear, stretch, and rotate the entire grid. However, there are special directions along which 𝐴 is simply
a stretching. For instance, consider the vector 𝑢1 = (1, 1). By a simple calculation, you can verify that 𝐴𝑢1 = 3𝑢1 .
Because of the linearity, this means that if a vector 𝑥 is in span(𝑢1 ), its image under 𝐴 is 3𝑥.
Another one is 𝑢2 = (−1, 1), where we have 𝐴𝑢2 = 𝑢2 . Thus, any 𝑥 ∈ span(𝑢2 ) is left in place.

157
Mathematics of Machine Learning

Fig. 13.2: Images of 𝑢1 = (1, 1) and 𝑢2 = (−1, 1) under the linear transformation given by 𝐴.

If we select 𝑢1 , 𝑢2 as our base, the matrix of this transformation is

3 0
𝐴𝑢1 ,𝑢2 = [ ],
0 1

that is, 𝐴𝑢1 ,𝑢2 is diagonal. We love diagonal matrices in practice because multiplication with a diagonal matrix is much
faster, as it requires 𝑂(𝑛) operations, opposed to 𝑂(𝑛2 ).
Is this a general phenomena? Are these even useful? The answer is yes to both questions. What we have just seen is
formalized by the concept of eigenvalues and eigenvectors. The terminology originates from the german word “eigen”
meaning “own”, resulting in one of the ugliest naming conventions in mathematics.

Deﬁnition 12.1 (Eigenvalues and eigenvectors.)

Let 𝑓 ∶ 𝑉 → 𝑉 be an arbitrary linear transformation. We say that the 𝜆 scalar and the 𝑥 ∈ 𝑉 \{0} nonzero vector is an
eigenvalue-eigenvector pair of 𝑓 if 𝑓(𝑥) = 𝜆𝑥 holds.

13.1 Eigenvalues of matrices

Although we have formally defined eigenvalues and eigenvectors for linear transformations, we often talk about them
in context of matrices. (Because, as we have seen, matrices and linear transformations are the same.) Let’s start by
translating the definition into the language of matrices.
If 𝐴 ∈ ℝ𝑛×𝑛 is a matrix, Definition 12.1 translates to the following: the scalar 𝜆 and the vector 𝑥 ∈ ℝ\{0} is an
eigenvalue-eigenvector pair of the matrix if

𝐴𝑥 = 𝜆𝑥 (13.2)

158 Chapter 13. Eigenvalues and eigenvectors

Mathematics of Machine Learning

holds. This can be simpliﬁed: as the linear transformation 𝑥 ↦ 𝜆𝑥 corresponds to the matrix 𝜆𝐼, (13.2) is equivalent to

(𝐴 − 𝜆𝐼)𝑥 = 0 (13.3)

If you recall how matrices arise from linear transformations, you might ask the question: won’t the eigenvalues depend on
the choice of the matrix?
The following theorem states that this is not the case: the eigenvalues for a linear transformation and its matrices are the
same.

Theorem 12.1.1 (Eigenvalues of similar matrices.)

Let 𝐴, 𝐵 ∈ ℝ𝑛×𝑛 be two similar matrices, that is, suppose that there exists an invertible 𝑇 ∈ ℝ𝑛×𝑛 such that 𝐵 =
𝑇 −1 𝐴𝑇 . Then, if 𝐴𝑥 = 𝜆𝑥 holds for some scalar 𝜆 and vector 𝑥 ∈ ℝ𝑛 , then

𝐵𝑥′ = 𝜆𝑥′

holds for some 𝑥′ ∈ ℝ as well.

Proof. Let’s massage the eigenvalue equation (13.3) a bit! We have

(𝐴 − 𝜆𝐼)𝑥 = (𝐴 − 𝜆𝑇 𝑇 −1 )𝑥
= 𝑇 (𝑇 −1 𝐴𝑇 − 𝜆𝐼)𝑇 −1 𝑥
= 0.

Since 𝑇 is invertible, 𝑇 [(𝑇 −1 𝐴𝑇 − 𝜆𝐼)𝑇 −1 𝑥] = 0 can only happen if (𝑇 −1 𝐴𝑇 − 𝜆𝐼)𝑇 −1 𝑥 = 0. (Recall the relation
of the kernel and invertibility.) This looks almost like (13.3), just a bit more complicated. Let me use some suggestive
parentheses to highlight the similarities:

[𝑇 −1 𝐴𝑇 − 𝜆𝐼][𝑇 −1 𝑥] = 0.

Note that
So, with the selection 𝑥′ = 𝑇 −1 𝑥, we have

𝑇 −1 𝐴𝑇 𝑥′ = 𝜆𝑥′ ,

which is what we had to show. □

In other words, the eigenvalues of similar matrices are the same. Consequently, we can talk about the eigenvalues of
matrices, not just linear transformations. The above theorem implies that the eigenvalues of a transformation and its
corresponding matrix are the same. Moreover, the eigenvalues of the matrix don’t depend on the choice of basis.
To be more precise, suppose that 𝐴 ∶ 𝑈 → 𝑈 is a linear transformation and 𝑃 , 𝑄 are bases of 𝑈 . The matrix of 𝐴 in
some basis 𝑆 is denoted by 𝐴𝑄 . We know that there is a transformation matrix 𝑇 ∈ ℝ𝑛×𝑛 such that

𝐴𝑄 = 𝑇 −1 𝐴𝑃 𝑇 .

So, the eigenvalues are the same.

13.1. Eigenvalues of matrices 159

Mathematics of Machine Learning

13.2 Finding eigenvalue-eigenvector pairs

Even though the definition of eigenvalues-eigenvectors is easy to understand given the geometric interpretation we just
saw, it does not give us any tools to find them in practice. Using them to get simpler representations of matrices is one
thing, but we are stuck at square one without a method to find them.
First, let’s focus on the eigenvalues. Suppose that for some 𝜆, there is a nonzero vector 𝑥 such that 𝐴𝑥 = 𝜆𝑥. The
transformation defined by 𝑥 → 𝜆𝑥 is a linear one, and its matrix is diagonal:

𝜆 0 … 0 𝑥1
⎡0 𝜆 … 0⎤ ⎡𝑥 ⎤
𝜆𝑥 = ⎢ ⎥ ⎢ 2⎥ .
⎢⋮ ⋮ ⋱ ⋮⎥⎢ ⋮ ⎥
⎣ 0 0 … 𝜆⎦ ⎣𝑥𝑛 ⎦

Because linear transformations can be added and subtracted, the deﬁning equation 𝐴𝑥 = 𝜆𝑥 is equivalent to

(𝐴 − 𝜆𝐼)𝑥 = 0,

where 𝐼 denotes the identity transformation, as deﬁned by (8.4). In another words, the transformation 𝐴 − 𝜆𝐼 maps a
nonzero vector to 0, meaning that it is not invertible, as Theorem 7.4.2 implies. We can characterize this with determinants:
we need to ﬁnd all 𝜆-s such that

det(𝐴 − 𝜆𝐼) = 0.

We can summarize the above ﬁndings in the following theorem.

Theorem 12.2.1
Let 𝐴 ∶ 𝑈 → 𝑈 be an arbitrary linear transformation. Then 𝜆 is its eigenvalue if and only if det(𝐴 − 𝜆𝐼) = 0.

Although we are one step closer, ﬁnding eigenvalues based on this still seems complicated. In the following, we are going
to see what det(𝐴 − 𝜆𝐼) really is and how we can ﬁnd the solutions of det(𝐴 − 𝜆𝐼) = 0 in practice.
Before going into the generalities, let’s revisit the example (13.1). There, we have

2−𝜆 1
det(𝐴 − 𝜆𝐼) = ∣ ∣
1 2−𝜆
= (2 − 𝜆)2 − 1
= 𝜆2 − 4𝜆 + 3.

To ﬁnd the eigenvalues, we have to solve the quadratic equation

𝜆2 − 4𝜆 + 3 = 0,

which we can do easily. Recall that the solutions of any quadratic equation 𝑎𝑥2 + 𝑏𝑥 + 𝑐 = 0 are
√
−𝑏 ± 𝑏2 − 4𝑎𝑐
𝑥1,2 = .
2𝑎
Applying this, we have 𝜆1 = 3 and 𝜆2 = 1 as solutions. There are no other ones, so 1 and 3 are the only two eigenvalues
for 𝐴.
Let’s see what happens in the general case!

160 Chapter 13. Eigenvalues and eigenvectors

Mathematics of Machine Learning

13.2.1 The characteristic polynomial

As the example above suggests, if the underlying vector space 𝑈 is 𝑛-dimensional, that is, 𝐴 is an 𝑛×𝑛 matrix, det(𝐴−𝜆𝐼)
is an 𝑛-th degree polynomial in 𝜆.
To see this, let’s write det(𝐴 − 𝜆𝐼) explicitly in terms of matrices. With this in mind, we have

𝑎 −𝜆 𝑎12 … 𝑎1𝑛
∣ 11 ∣
𝑎 𝑎22 − 𝜆 … 𝑎2𝑛 ∣
det(𝐴 − 𝜆𝐼) = ∣ 21 .
∣ ⋮ ⋮ ⋱ ⋮ ∣
∣ 𝑎𝑛1 𝑎𝑛2 … 𝑎𝑛𝑛 − 𝜆∣

If you consider the formula to calculate the determinant given by (9.5), you can see that every term is a polynomial.
Depending on how many ﬁxed points 𝜎 has (that is, points where 𝜎(𝑖) = 𝑖), the degree of this polynomial varies between
0 and 𝑛.
(Alternatively, you can see that det(𝐴 − 𝜆𝐼) is a polynomial of degree 𝑛 by using the recursive formula (9.6) and applying
induction.)

Deﬁnition 12.2.1 (Characteristic polynomial of matrices.)

Let 𝐴 ∈ ℝ𝑛×𝑛 be an arbitrary matrix. The polynomial

𝑝(𝜆) = det(𝐴 − 𝜆𝐼)

is called the characteristic polynomial of 𝐴.

The roots of the characteristic polynomial are the eigenvalues. If 𝑈 is an 𝑛-dimensional complex vector space (that is,
the set of scalars is ℂ), the fundamental theorem of algebra guarantees that det(𝐴 − 𝜆𝐼) = 0 has exactly
𝑛 roots.
As a consequence, every matrix 𝐴 ∈ ℂ𝑛×𝑛 has at least one eigenvalue. Note that roots can have higher algebraic
multiplicity. For instance, the characteristic polynomial for the matrix

1 0 0
𝐵=⎡
⎢0 1 0⎤⎥
⎣0 0 2⎦

is (1 − 𝜆)2 (2 − 𝜆). So, its roots are 1 (with algebraic multiplicity 2) and 2.
If we restrict ourselves to real matrices and real vector spaces, the existence of eigenvalues and eigenvectors are not
guaranteed. For instance, consider

1 2
𝐶=[ ].
1 1

Its characteristic polynomial is −𝜆2 − 1, which doesn’t have any real roots, only complex ones: 𝜆1 = 𝑖 and 𝜆2 = −𝑖.
Mathematically speaking, if we want to stay within the conﬁnes of real vector spaces, 𝐶 has no eigenvalues. However, we
are here to do machine learning, not algebra. Thus, we are going to be a bit imprecise and treat real matrices as complex
ones. We don’t often need complex numbers to describe mathematical models of a dataset, but they frequently appear
during the analysis of matrices.

13.2. Finding eigenvalue-eigenvector pairs 161

Mathematics of Machine Learning

13.2.2 Finding eigenvectors

When an eigenvalue 𝜆 is identiﬁed, we can set out to ﬁnd the corresponding eigenvectors; that is, vectors 𝑥 where (𝐴 −
𝜆𝐼)𝑥 = 0. In more precise terms, we are looking for \ker (A - \lambda I) $.
As we have mentioned before, the kernel of any linear transformation is a subspace. As it might be more than one
dimensional, identifying it often involves an implicit description like 𝑥1 + 𝑥2 = 0.
Let’s check what happens with our recurring example

2 1
𝐴=[ ].
1 2

Previously, we have seen that 𝜆1 = 3 and 𝜆2 = 1 are the eigenvalues. To identify the corresponding eigenvectors for,
say, 𝜆1 , we have to ﬁnd all solutions for the linear equation (𝐴 − 𝜆1 𝐼)𝑥 = 0. Expanding this, we have

−𝑥1 + 𝑥2 = 0
𝑥1 − 𝑥2 = 0.

Both equations implying that all 𝑥 = (𝑥1 , 𝑥2 ) are solutions where 𝑥1 = 𝑥2 .

13.3 Eigenvectors, eigenspaces, and their bases

Deﬁnition 12.3.1
Let 𝑓 ∶ 𝑉 → 𝑉 be an arbitrary linear transformation, and 𝜆 its eigenvalue. The subspace of eigenvectors deﬁned by

𝑈𝜆 = {𝑥 ∶ 𝐴𝑥 = 𝜆𝑥}

is called the eigenspace of 𝜆.

Eigenspaces play an important role in understanding the structure of linear transformations. First, we can note that a
linear transformation keeps its eigenspaces invariant. (That is, if 𝑥 is in the 𝑈𝜆 eigenspace, then 𝑓(𝑥) ∈ 𝑈𝜆 as well.) This
property makes it possible for us to restrict linear transformations to their eigenspaces.
To illustrate the concept of eigenspaces, let’s revisit the already familiar matrix

2 1
𝐴=[ ]
1 2

one more time. Its eigenvalues are 𝜆1 = 3 and 𝜆2 = 1, and by solving the equation (𝐴 − 𝜆1 𝐼)𝑥 = 0, the eigenspace of
𝜆1 is

𝑈𝜆1 = {𝑥 ∈ ℝ2 ∶ 𝑥1 = 𝑥2 }.

Similarly, you can check that 𝑈𝜆2 = {𝑥 ∈ ℝ2 ∶ 𝑥1 = −𝑥2 }. (If you go back to Fig. 13.2, you can visualize 𝑈𝜆1 and
𝑈𝜆2 .)
Eigenspaces are not necessarily one dimensional. For instance, consider one of the the previous examples

1 0 0
𝐵=⎡
⎢0 1 0⎤⎥,
⎣0 0 2⎦

162 Chapter 13. Eigenvalues and eigenvectors

Mathematics of Machine Learning

with two eigenvalues 𝜆1 = 1 and 𝜆2 = 2. Substituting 𝜆1 back into the equation and solving for (𝐵 − 𝐼)𝑥 = 0, we
obtain that

𝑈𝜆1 = {𝑥 ∈ ℝ3 ∶ 𝑥3 = 0},

which is simply the plane determined by the ﬁrst two axes.

The structure of eigenspaces determines whether or not we can diagonalize the matrix 𝐴 with a change of basis transfor-
mation Λ = 𝑇 −1 𝐴𝑈 . The following general theorem establishes this connection.

Theorem 12.3.1
Let 𝑓 ∶ 𝑉 → 𝑉 be a linear transformation, let 𝐴 ∈ ℝ𝑛×𝑛 be its matrix in some basis, and let 𝑈𝜆1 , … , 𝑈𝜆𝑘 be the
eigenspaces of 𝑓. The following are equivalent.
(a) There is a matrix 𝑇 ∈ ℝ𝑛×𝑛 such that

Λ = 𝑇 −1 𝐴𝑇 ,

where Λ is a diagonal matrix.

(b) There is a basis 𝑢1 , … , 𝑢𝑛 for 𝑉 that can be selected from the eigenvectors of 𝑓.
(c) 𝑉 can be written as the direct sum of the eigenspaces, that is,

𝑉 = 𝑈𝜆1 + ⋯ + 𝑈𝜆𝑘 .

Proof. (a) ⟹ (b). If 𝐴 is the matrix of 𝑓 in some basis, then a a similarity transformation is equivalent to a change in
basis.
That is, the new matrix Λ = 𝑇 −1 𝐴𝑇 is the matrix of 𝑓 in a diﬀerent basis, say 𝑢1 , … , 𝑢𝑛 .
If Λ is diagonal, it can be written in the form

𝜆1 0 … 0
⎡0 𝜆2 … 0⎤
Λ=⎢ ⎥.
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣0 0 … 𝜆𝑛 ⎦

(Note that the 𝜆𝑖 -s are not mutually diﬀerent.) Thus, Λ𝑢𝑖 = 𝜆𝑖 𝑢𝑖 , meaning that 𝑢1 , … , 𝑢𝑛 is a basis from the eigenvectors
of 𝑓.
(b) ⟹ (a). If 𝑢1 , … , 𝑢𝑛 is a basis from the eigenvectors of 𝑓, then its matrix Λ in that basis is diagonal. Thus, 𝐴 is
similar to Λ, which is what we had to show.
(b) ⟹ (c). By deﬁnition, the direct sum of the eigenspaces contains all linear combinations of the form
𝑛
𝑥 = ∑ 𝑥𝑖 𝑢𝑖 .
𝑖=1

Since 𝑢1 , … , 𝑢𝑛 is a basis, 𝑉 = 𝑈𝜆1 + ⋯ + 𝑈𝜆𝑘 holds.

(c) ⟹ (b). From each eigenspace 𝑈𝜆𝑖 , we can select a basis. Due to the construction of 𝑈𝜆𝑖 , its basis will consist of
eigenvectors.
Since 𝑉 = 𝑈𝜆1 + ⋯ + 𝑈𝜆𝑘 , the union of of such bases 𝑢1 , … , 𝑢𝑛 will be a basis for 𝑉 . □

13.3. Eigenvectors, eigenspaces, and their bases 163

Mathematics of Machine Learning

Even though this theorem does not give us any useful recipes on how to diagonalize a matrix, it provides us with an
extremely valuable insight: diagonalization is equivalent to ﬁnding an eigenvector basis. This is not always possible, but
when it is, we are cooking with gas.
In the next chapter, we will take a deep dive into this topic, providing multiple ways to simplify matrices. If our journey
in linear algebra is akin to a mountain climb, we will reach the peak soon.

13.4 Problems

Problem 1. Let 𝐴 ∈ ℝ𝑛×𝑛 be an upper or lower trangular matrix. Show that the eigenvalues of 𝐴 are its diagonal
elements.

164 Chapter 13. Eigenvalues and eigenvectors

CHAPTER

FOURTEEN

SPECIAL TRANSFORMATIONS AND MATRIX DECOMPOSITIONS

So far, we have aspired to develop a geometric view of linear algebra. Vectors are mathematical objects deﬁned by their
direction and magnitude. In the spaces of vectors, the concept of distance and orthogonality gives rise to a geometric
structure.
Linear transformations, the building blocks of machine learning, are just mappings that distort this structure: rotating,
stretching, skewing the geometry. However, there are types of transformations that preserve some of the structure. In
practice, these provide valuable insight, and additionally, they are much easier to work with. In this section, we will take
a look at the most important ones, those that we’ll encounter in machine learning.

14.1 The adjoint transformation

In machine learning, the most important stage is the Euclidean space ℝ𝑛 . This is where data is represented and manipu-
lated. There, the entire geometric structure is deﬁned by the inner product
𝑛
⟨𝑥, 𝑦⟩ = ∑ 𝑥𝑖 𝑦𝑖 ,
𝑖=1

giving rise to the notion of magnitude, direction (in the form of angles), and orthogonality. Because of this, transforma-
tions that can be related to the inner product are special. For instance, if ⟨𝑓(𝑥), 𝑓(𝑥)⟩ = ⟨𝑥, 𝑥⟩ holds for all 𝑥 ∈ ℝ𝑛 and
the linear transformation 𝑓 ∶ ℝ𝑛 → ℝ𝑛 , we know that 𝑓 leaves the norm invariant. That is, distance in the original and
the transformed feature space have the same meaning.
First, we will establish a general relation between images of vectors under a transform and their inner product. This is
going to be the foundation for our discussions in this chapter.

Theorem 13.1.1 (The adjoint transformation.)

Let 𝑓 ∶ ℝ𝑛 → ℝ𝑛 be a linear transformation. Then, there exists a linear transformation 𝑓 ∗ ∶ ℝ𝑛 → ℝ𝑛 for which

⟨𝑓(𝑥), 𝑦⟩ = ⟨𝑥, 𝑓 ∗ (𝑦)⟩ (14.1)

holds for all 𝑥, 𝑦 ∈ ℝ𝑛 . 𝑓 ∗ is called the adjoint transformation of 𝑓.

Moreover, if 𝐴 ∈ ℝ𝑛×𝑛 is the matrix of 𝑓 in the standard orthonormal basis, then the matrix of 𝑓 ∗ is 𝐴𝑇 . That is,

⟨𝐴𝑥, 𝑦⟩ = ⟨𝑥, 𝐴𝑇 𝑦⟩. (14.2)

165
Mathematics of Machine Learning

Proof. Suppose that 𝐴 ∈ ℝ𝑛×𝑛 is the matrix of 𝑓 in the standard orthonormal basis. For any 𝑥 = (𝑥1 , … , 𝑥𝑛 ) and
𝑦 = (𝑦1 , … , 𝑦𝑛 ), the inner product is deﬁned by
𝑛
⟨𝑥, 𝑦⟩ = ∑ 𝑥𝑖 𝑦𝑖 ,
𝑖=1

and 𝐴𝑥 can be written as

𝑛
𝑎11 𝑎12 … 𝑎1𝑛 𝑥1 ∑ 𝑎1𝑗 𝑥𝑗
⎡𝑎 ⎡ 𝑗=1
𝑛 ⎤
𝑎22 … 𝑎2𝑛 ⎤ ⎡ 𝑥2 ⎤ ⎢ ∑𝑗=1 𝑎2𝑗 𝑥𝑗 ⎥
⎢ 21 ⎥⎢ ⎥ = ⎢ ⎥.
⎢ ⋮ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝑛
⎣𝑎𝑛1 𝑎𝑛2 … 𝑎𝑛𝑛 ⎦ ⎣𝑥𝑛 ⎦ ⎣∑ 𝑎𝑛𝑗 𝑥𝑗 ⎦
𝑗=1

Using this form, we can express ⟨𝐴𝑥, 𝑦⟩ in terms of 𝑎𝑖𝑗 -s, 𝑥𝑖 -s, and 𝑦𝑖 -s. For this, we have
𝑛 𝑛
⟨𝐴𝑥, 𝑦⟩ = ∑ ( ∑ 𝑎𝑖𝑗 𝑥𝑗 )𝑦𝑖
𝑖=1 𝑗=1
𝑛 𝑛
= ∑ ( ∑ 𝑎𝑖𝑗 𝑦𝑖 ) 𝑥𝑗
𝑗=1 ⏟⏟⏟⏟⏟
𝑖=1
𝑗-th component of 𝐴𝑇 𝑦

= ⟨𝑥, 𝐴𝑇 𝑦⟩.

This shows that the transformation given by 𝑓 ∗ ∶ 𝑥 ↦ 𝐴𝑇 𝑥 satisﬁes (14.1) and (14.2), which is what we had to show. □

Why is the quantity ⟨𝐴𝑥, 𝑦⟩ that important for us? Because inner products deﬁne the geometric structure of a vector
space. Recall the equation (5.12), allowing us to fully describe any vector using only the inner products with respect to
an orthonormal basis. In addition, ⟨𝑥, 𝑥⟩ = ‖𝑥‖2 deﬁnes the notion of distance and magnitude. Because of this, (14.1)
and (14.2) will be quite useful for us.
As we are about to see, transformations that preserve the inner product are rather special, and these relations provide us
a way to characterize them both algebraically and geometrically.

14.2 Orthogonal transformations

Let’s jump straight into the deﬁnition.

Deﬁnition 13.2.1 (Orthogonal transformations.)

Let 𝑓 ∶ ℝ𝑛 → ℝ𝑛 be an arbitrary linear transformation. 𝑓 is called orthogonal if

⟨𝑓(𝑥), 𝑓(𝑦)⟩ = ⟨𝑥, 𝑦⟩

holds for all 𝑥, 𝑦 ∈ ℝ𝑛 .

As a consequence, an orthogonal 𝑓 preserves the norm: ‖𝑓(𝑥)‖2 = ⟨𝑓(𝑥), 𝑓(𝑥)⟩ = ⟨𝑥, 𝑥⟩ = ‖𝑥‖2 . Because the angle
enclosed by two vectors is deﬁned by their inner product, see (5.11), the property ⟨𝑓(𝑥), 𝑓(𝑦)⟩ = ⟨𝑥, 𝑦⟩ means that an
orthogonal transform also preserves angles.
We can translate the deﬁnition to the language of matrices as well. In practice, we are always going to work with matrices,
so this characterization is essential.

166 Chapter 14. Special transformations and matrix decompositions

Mathematics of Machine Learning

Theorem 13.2.1 (Matrices of orthogonal transformations.)

Let 𝑓 ∶ ℝ𝑛 → ℝ𝑛 be a linear transformation and 𝐴 ∈ ℝ𝑛×𝑛 be its matrix in the standard orthonormal basis. Then 𝑓 is
orthogonal if and only if 𝐴𝑇 = 𝐴−1 .

Proof. As usual, we have to show the implication in both ways.

(a) Suppose that 𝑓 is orthogonal. Then, (14.2) gives

⟨𝑥, 𝑦⟩ = ⟨𝐴𝑥, 𝐴𝑦⟩ = ⟨𝑥, 𝐴𝑇 𝐴𝑦⟩.

Thus, for any given 𝑦,

⟨𝑥, (𝐴𝑇 𝐴 − 𝐼)𝑦⟩ = 0

holds for all 𝑥. By letting 𝑥 = (𝐴𝑇 𝐴 − 𝐼)𝑦, the positive deﬁniteness of the inner product implies that (𝐴𝑇 𝐴 − 𝐼)𝑦 = 0
for all 𝑦. Thus, 𝐴𝑇 𝐴 = 𝐼, which means that 𝐴𝑇 is the inverse of 𝐴.
(b) If 𝐴𝑇 = 𝐴−1 , we have
⟨𝐴𝑥, 𝐴𝑦⟩ = ⟨𝑥, 𝐴𝑇 𝐴𝑦⟩
= ⟨𝑥, 𝐴−1 𝐴𝑦⟩
= ⟨𝑥, 𝑦⟩,
showing that 𝑓 is orthogonal. □

The fact that 𝐴𝑇 = 𝐴−1 has a profound implication regarding the columns of 𝐴. If you think back to the deﬁnition of
matrix multiplication, the element in the 𝑖-th row and 𝑗-th column of 𝐴𝐵 is the inner product of 𝐴‘s 𝑖-th row and 𝐵‘s 𝑗-th
column.
To be more precise, if the 𝑖-th column is denoted by 𝑎𝑖 = (𝑎1,𝑖 , 𝑎2,𝑖 , … , 𝑎𝑛,𝑖 ), then we have
𝑛
𝐴𝑇 𝐴 = (⟨𝑎𝑖 , 𝑎𝑗 ⟩) = 𝐼,
𝑖,𝑗=1

that is,

1 if 𝑖 = 𝑗,
⟨𝑎𝑖 , 𝑎𝑗 ⟩ = {
0 otherwise.

In other words, the columns of 𝐴 form an orthonormal system. This fact should not come as a surprise since orthogonal
transformations preserve magnitude and orthogonality, and the columns of 𝐴 are the images of the standard orthonormal
basis 𝑒1 , … , 𝑒𝑛 .
In machine learning, performing an orthogonal transformation on our features is equivalent to looking at them from
another perspective, without distortion. You might know it already, but this is what Principal Component Analysis is
doing.

14.2. Orthogonal transformations 167

Mathematics of Machine Learning

14.3 Self-adjoint transformations and the spectral decomposition

theorem

Besides orthogonal transformations, there is another important class: transformations whose adjoints are themselves. Bear
with me a bit, and we’ll see an example soon.

Deﬁnition 13.3.1 (Self-adjoint transformations.)

Let 𝑓 ∶ ℝ𝑛 → ℝ𝑛 be a linear transformation. 𝑓 is self-adjoint if 𝑓 ∗ = 𝑓, that is,

⟨𝑓(𝑥), 𝑦⟩ = ⟨𝑥, 𝑓(𝑦)⟩ (14.3)

holds for all 𝑥, 𝑦 ∈ ℝ𝑛 .

As always, we are going to translate this to the language of the matrices. If 𝐴 is the matrix of 𝑓 in the standard orthonormal
basis, we know that 𝐴𝑇 is the matrix of the adjoint. For self-adjoint transformations, it implies that 𝐴𝑇 = 𝐴. Matrices
such as these are called symmetric, and they have a lot of pleasant properties.
For us, the most important one is that symmetric matrices can be diagonalized! (That is, transformed to a diagonal matrix
with a similarity transform.) The following theorem makes this precise.

Theorem 13.3.1 (Spectral decomposition of real symmetric matrices.)

Let 𝐴 ∈ ℝ𝑛×𝑛 be a real symmetric matrix. Then 𝐴 has exactly 𝑛 real eigenvalues 𝜆1 ≥ ⋯ ≥ 𝜆𝑛 , and the corresponding
eigenvectors 𝑢1 , … , 𝑢𝑛 can be selected such that they form an orthonormal basis.
Moreover, if we let Λ = diag(𝜆1 , … , 𝜆𝑛 ) and 𝑈 be the orthogonal matrix whose columns are 𝑢1 , … , 𝑢𝑛 , then

𝐴 = 𝑈 Λ𝑈 𝑇 (14.4)

holds.

Note that the eigenvalues 𝜆1 ≥ ⋯ ≥ 𝜆𝑛 are not necessarily distinct from each other.

Proof. (Sketch) Since the proof is pretty involved, we are better oﬀ getting to know the main ideas behind it, without all
the mathematical details.
The main steps are the following.
1. If the matrix 𝐴 is symmetric, all of its eigenvalues are real.
2. Using this, it can be shown that an orthonormal basis can be formed from the eigenvectors of 𝐴.
3. Writing the matrix of the transformation 𝑥 → 𝐴𝑥 in this orthonormal basis yields a diagonal matrix. Hence, a
change of basis yields (14.4).
Showing that the eigenvalues are real requires some complex numbers magic (which is beyond our scope). The tough part
is the second step. Once that has been done, moving to the third one is straightforward, as we have seen when talking
about eigenspaces and their bases. □

We still don’t have a hands-on way to diagonalize matrices, but this theorem gets us one step closer: at least we know it
is possible for symmetric matrices. This is an important stepping stone, as we’ll be able to reduce the general case to the
symmetric one.

168 Chapter 14. Special transformations and matrix decompositions

Mathematics of Machine Learning

The requirement for a matrix to be symmetric seems like a very special one. However, in practice, we can symmetrize
matrices in several diﬀerent ways. For any matrix 𝐴 ∈ ℝ𝑛×𝑚 , the products 𝐴𝐴𝑇 and 𝐴𝑇 𝐴 will be symmetric. For
𝑇
square matrices, the average 𝐴+𝐴
2 also works. So, symmetric matrices are more common than you think.
The orthogonal matrix 𝑈 and the corresponding orthonormal basis {𝑢1 , … , 𝑢𝑛 } that diagonalizes a symmetric matrix 𝐴
has a special property that is going to be very important for us later, when we discuss the Principal Component Analysis
of data samples.

Theorem 13.3.2
Let 𝐴 ∈ ℝ𝑛×𝑛 be a real symmetric matrix and let 𝜆1 ≥ ⋯ ≥ 𝜆𝑛 be its real eigenvalues in decreasing order. Moreover,
let 𝑈 ∈ ℝ𝑛×𝑛 be the ortogonal matrix that diagonalizes 𝐴, with the corresponding orthonormal basis {𝑢1 , … , 𝑢𝑛 }.
Then
arg max 𝑥𝑇 𝐴𝑥 = 𝑢1 ,
‖𝑥‖=1

and
max 𝑥𝑇 𝐴𝑥 = 𝜆1 .
‖𝑥‖=1

Proof. Since {𝑢1 , … , 𝑢𝑛 } is an orthonormal basis, any 𝑥 can be expressed as a linear combination of them:
𝑛
𝑥 = ∑ 𝑥𝑖 𝑢𝑖 , 𝑥𝑖 ∈ ℝ.
𝑖=1

Thus, since the 𝑢𝑖 are eigenvectors of 𝐴,

𝑛 𝑛 𝑛
𝐴𝑥 = 𝐴( ∑ 𝑥𝑖 𝑢𝑖 ) = ∑ 𝑥𝑖 𝐴𝑢𝑖 = ∑ 𝑥𝑖 𝜆𝑖 𝑢𝑖 .
𝑖=1 𝑖=1 𝑖=1

Plugging it back into 𝑥𝑇 𝐴𝑥, we have

𝑛 𝑇 𝑛
𝑥𝑇 𝐴𝑥 = ( ∑ 𝑥𝑗 𝑢𝑗 ) 𝐴( ∑ 𝑥𝑖 𝑢𝑖 )
𝑗=1 𝑖=1
𝑛 𝑛
= ( ∑ 𝑥𝑗 𝑢𝑇𝑗 )𝐴( ∑ 𝑥𝑖 𝑢𝑖 )
𝑗=1 𝑖=1
𝑛 𝑛
= ( ∑ 𝑥𝑗 𝑢𝑇𝑗 )( ∑ 𝑥𝑖 𝜆𝑖 𝑢𝑖 )
𝑗=1 𝑖=1
𝑛
= ∑ 𝑥𝑖 𝑥𝑗 𝜆𝑖 𝑢𝑇𝑗 𝑢𝑖 .
𝑖,𝑗=1

Since the 𝑢𝑖 -s form an orthonormal basis,

1 if 𝑖 = 𝑗,
𝑢𝑇𝑗 𝑢𝑖 = {
0 otherwise.

In other words, 𝑢𝑇𝑗 𝑢𝑖 vanishes when 𝑖 ≠ 𝑗. Continuing the above calculation with this observation,
𝑛 𝑛
𝑥𝑇 𝐴𝑥 = ∑ 𝑥𝑖 𝑥𝑗 𝜆𝑖 𝑢𝑇𝑗 𝑢𝑖 = ∑ 𝑥2𝑖 𝜆𝑖 .
𝑖,𝑗=1 𝑖=1

14.3. Self-adjoint transformations and the spectral decomposition theorem 169

Mathematics of Machine Learning

𝑛 𝑛
When ‖𝑥‖2 = ∑𝑖=1 𝑥2𝑖 = 1, the sum ∑𝑖=1 𝑥2𝑖 𝜆𝑖 is a weighted average of the eigenvalues 𝜆𝑖 . So,
𝑛 𝑛
∑ 𝑥2𝑖 𝜆𝑖 ≤ ∑ 𝑥2𝑖 max 𝜆𝑘 = max 𝜆𝑘 = 𝜆1 ,
𝑘=1,…,𝑛 𝑘=1,…,𝑛
𝑖=1 𝑖=1

from which 𝑥𝑇 𝐴𝑥 ≤ 𝜆1 follows. (Recall that we can assume without loss in generality that the eigenvalues are decreasing.)
On the other hand, by plugging in 𝑥 = 𝑢1 , we can see that 𝑢𝑇1 𝐴𝑢1 = 𝜆1 , so the maximum is indeed attained. From these
two, the theorem follows. □

Remark 13.3.1
In other words, Theorem 13.3.2 gives that the function 𝑥 ↦ 𝑥𝑇 𝐴𝑥 assumes its maximum value at 𝑢1 , and that maximum
value is 𝑢𝑇1 𝐴𝑢1 = 𝜆1 . The quantity 𝑥𝑇 𝐴𝑥 seems quite mysterious as well, so let’s clarify this a bit. If we think in terms
of features, the vectors 𝑢1 , … , 𝑢𝑛 can be thought of as mixtures of the “old” features 𝑒1 , … , 𝑒𝑛 . When we have actual
observations (that is, data), we can use the above process to diagonalize the covariance matrix. So, if 𝐴 denotes this
covariance matrix, 𝑢𝑇1 𝐴𝑢1 is the variance of the new feature 𝑢1 .
Thus, this theorem says that 𝑢1 is the unique feature that maximizes the variance. So, among all the possible choices for
new features, 𝑢1 conveys the most information about the data.
At this point, we don’t have all the tools to see, but in connection to the Principal Component Analysis, this says that the
ﬁrst principal vector is the one that maximizes variance.

Theorem 13.3.2 is just a special case of a general theorem.

Theorem 13.3.3
Let 𝐴 ∈ ℝ𝑛×𝑛 be a real symmetric matrix, let 𝜆1 ≥ ⋯ ≥ 𝜆𝑛 be its real eigenvalues in decreasing order. Moreover, let
𝑈 ∈ ℝ𝑛×𝑛 be the ortogonal matrix that diagonalizes 𝐴, with the corresponding orthonormal basis {𝑢1 , … , 𝑢𝑛 }.
Then for all 𝑘 = 1, … , 𝑛, we have
arg max 𝑥𝑇 𝐴𝑥 = 𝑢𝑘 ,
‖𝑥‖ = 1
𝑥 ⟂ {𝑢1 , … , 𝑢𝑘−1 }
and
max 𝑥𝑇 𝐴𝑥 = 𝜆𝑘 .
‖𝑥‖ = 1
𝑥 ⟂ {𝑢1 , … , 𝑢𝑘−1 }

Proof. The proof is almost identical to the previous one. Since 𝑥 is required to be orthogonal to 𝑢1 , … , 𝑢𝑘−1 , it can be
expressed as
𝑛
𝑥 = ∑ 𝑥𝑖 𝑢𝑖 .
𝑖=𝑘

Following the calculations in the proof of the previous theorem, we have

𝑛
𝑥𝑇 𝐴𝑥 = ∑ 𝑥2𝑖 𝜆𝑖 ≤ 𝜆𝑘 .
𝑖=𝑘

On the other hand, similarly as before, 𝑢𝑇𝑘 𝐴𝑢𝑘 = 𝜆𝑘 , so the theorem follows. □

170 Chapter 14. Special transformations and matrix decompositions

Mathematics of Machine Learning

14.4 The Singular Value Decomposition

So, we can diagonalize any real symmetric matrix with an orthogonal transformation. That’s great, but what if our matrix
is not symmetric? After all, this is a rather special case.
How can we do the same for a general matrix? We’ll use a very strong tool, straight from the mathematician’s toolkit:
wishful thinking. We pretend to have the solution, then reverse engineer it. To be speciﬁc, let 𝐴 ∈ ℝ𝑛×𝑚 be any real
matrix. (It might not be square.) Since 𝐴 is not symmetric, we have to relax our wishes for factoring it into the form
𝑈 Λ𝑈 𝑇 . The most straightforward way is to assume that the orthogonal matrices to the left and to the right of Λ are not
each other’s transposes.
Thus, we are looking for orthogonal matrices 𝑈 ∈ ℝ𝑛×𝑛 and 𝑉 ∈ ℝ𝑚×𝑚 such that
𝐴 = 𝑈 Λ𝑉 𝑇

holds, where Λ ∈ ℝ𝑛×𝑚 is diagonal. (A non-square matrix Λ = (𝜆𝑖,𝑗 )𝑛,𝑚

𝑖,𝑗=1 is diagonal if 𝜆𝑖,𝑗 is zero when 𝑖 ≠ 𝑗.)

Here comes the reverse-engineering part. First, as we have discussed earlier, 𝐴𝐴𝑇 and 𝐴𝑇 𝐴 are symmetric matrices.
Second, we can simplify them by using the orthogonality of 𝑈 and 𝑉 , obtaining
𝐴𝐴𝑇 = (𝑈 Λ𝑉 𝑇 )(𝑉 Λ𝑈 𝑇 )
= 𝑈 Λ2 𝑈 𝑇 .
Similarly, we have 𝐴𝑇 𝐴 = 𝑉 Λ2 𝑉 𝑇 . Good news: we can actually ﬁnd 𝑈 and 𝑉 by applying the spectral decomposition
theorem to 𝐴𝐴𝑇 and 𝐴𝑇 𝐴 respectively. Thus, the factorization 𝐴 = 𝑈 Λ𝑉 𝑇 is valid! This form is called the singular
value decomposition (SVD), one of the pinnacle achievements of linear algebra.
Of course, we are not done yet, we only know where to look. Let’s make this mathematically precise!

Theorem 13.4.1 (Singular value decomposition. )

Let 𝐴 ∈ ℝ𝑛×𝑚 be an arbitrary matrix. Then there exists a diagonal matrix Λ ∈ ℝ𝑛×𝑚 and orthogonal matrices 𝑈 ∈ ℝ𝑛×𝑛
and 𝑉 ∈ ℝ𝑚×𝑚 such that
𝐴 = 𝑈 Λ𝑉 𝑇 .

Proof. Since 𝐴𝑇 𝐴 ∈ ℝ𝑚×𝑚 is a real symmetric matrix, we can apply the spectral decomposition theorem to obtain a
diagonal Λ ∈ ℝ𝑚×𝑚 and orthogonal 𝑉 ∈ ℝ𝑚×𝑚 such that
𝐴𝑇 𝐴 = 𝑉 Λ𝑉 𝑇

holds. Now, we only have to show that 𝐴𝑉 can be written as

𝐴𝑉 = 𝑈 Λ, 𝑈 ∈ ℝ𝑛×𝑛

for some orthogonal matrix 𝑈 . If this indeed holds, we can select 𝑈 ∶= 𝐴𝑉 Λ−1 , obtaining the singular value decompo-
sition.
Stating that 𝐴𝑉 = 𝑈 Λ for an orthogonal 𝑈 and diagonal Λ is equivalent to saying that the columns of 𝐴𝑉 are orthogonal.
(Since multiplying 𝑈 with a diagonal matrix from the right is the same as scaling the columns of 𝑈 .) In turn, this is
equivalent to showing that (𝐴𝑉 )𝑇 (𝐴𝑉 ) is diagonal. Thus, we have
(𝐴𝑉 )𝑇 (𝐴𝑉 ) = 𝑉 𝑇 𝐴
⏟ 𝑇𝐴 𝑉
=𝑉 Λ𝑉 𝑇
𝑇 𝑇
= 𝑉 (𝑉 Λ𝑉 )𝑉
= (𝑉 𝑇 𝑉 )Λ(𝑉 𝑇 𝑉 )
= Λ,

14.4. The Singular Value Decomposition 171

Mathematics of Machine Learning

which is diagonal. Now, everything is ready to reap the rewards of our work. By selecting 𝑈 ∶= 𝐴𝑉 Λ−1 , we have
𝑈 𝑇 𝐴𝑉 = Λ, which is diagonal. Thus, since 𝑈 and 𝑉 are orthogonal, we ﬁnally have

𝐴 = 𝑈 Λ𝑉 𝑇 ,

the desired singular value decomposition. □

Let’s take a moment to appreciate the power of the singular value decomposition. The columns of 𝑈 and 𝑉 are orthogonal
matrices, which are rather special transformations. As they leave the inner products and the norm invariant, the structure
of the underlying vector spaces are preserved. The diagonal Λ is also special, as it is just a stretching in the direction of
the bases. It is very surprising that any linear transformation is the composition of these three special ones.
Besides mapping out the fine structure of linear transformations, SVD offers a lot more. For instance, it generalizes the
notion of eigenvectors, a concept that was defined only for square matrices. With this, we have

𝐴𝑉 = 𝑈 Λ,

which we can take a look column-wise. Here, Λ is diagonal, but its number of elements depend on the smaller one of 𝑛
or 𝑚.
So, if 𝑢𝑖 is the 𝑖-th column of 𝑈 , 𝑣𝑖 is the 𝑖-th column of 𝑉 , the identity 𝐴𝑉 = 𝑈 Λ is translated to

𝐴𝑣𝑖 = 𝜆𝑖 𝑢𝑖 , 0 ≤ 𝑖 ≤ min(𝑛, 𝑚).

This resembles closely to the deﬁnition of eigenvalue-eigenvector pairs, except that instead of one vector, we have two.
The 𝑢𝑖 and 𝑣𝑖 are the so-called left and right singular vectors, while the scalars 𝜆𝑖 are called singular values.

14.5 Orthogonal projections

Linear transformations are essentially manipulations of data, revealing other (hopefully more useful) representations.
Intuitively, we think about them as one-to-one mappings, faithfully preserving all the “information” from the input.
This is often not the case, to such an extent that sometimes a lossy compression of the data is highly beneficial. To give
you a concrete example, consider a dataset with a million features, out of which only a couple hundred are useful. What
we can do is to identify the important features and throw away the rest, obtaining a representation that is more compact,
thus easier to work with.
This notion is formalized by the concept of orthogonal projections. We have already met them upon our first encounter
with the inner products (see (5.7)). Projections also played a fundamental role in the Gram-Schmidt process, used to or-
thogonalize an arbitrary basis. Because we are already somewhat familiar with orthogonal projections, a formal definition
is in due.

Deﬁnition 13.5.1 (Projections and orthogonal projections.)

Let 𝑉 be an arbitrary inner product space and 𝑃 ∶ 𝑉 → 𝑉 be a linear transformation. 𝑃 is a projection if 𝑃 2 = 𝑃 .
A projection 𝑃 is orthogonal if the subspaces ker 𝑃 and im𝑃 are orthogonal to each other. (That is, for every pair of
𝑥 ∈ ker 𝑃 and 𝑦 ∈ im𝑃 , we have ⟨𝑥, 𝑦⟩ = 0.)

Let’s revisit the examples we have seen so far to get a grip on the deﬁnition!
Example 1. The simplest one is the orthogonal projection to a single vector. That is, if 𝑢 ∈ ℝ𝑛 is an arbitrary vector, the
transformation
⟨𝑥, 𝑢⟩
proj𝑢 (𝑥) = 𝑢
⟨𝑢, 𝑢⟩

172 Chapter 14. Special transformations and matrix decompositions

Mathematics of Machine Learning

is the orthogonal projection to (the subspace spanned by) 𝑢. (We have talked about this when discussing the geometric in-
terpretation of inner products, where this deﬁnition was deduced from a geometric intuition.) Applying this transformation
repeatedly, we get
⟨𝑥,𝑢⟩
⟨ ⟨𝑢,𝑢⟩ 𝑢, 𝑢⟩
proj𝑢 (proj𝑢 (𝑥)) = 𝑢
⟨𝑢, 𝑢⟩
⟨𝑥,𝑢⟩
⟨𝑢,𝑢⟩ ⟨𝑢, 𝑢⟩
= 𝑢
⟨𝑢, 𝑢⟩
⟨𝑥, 𝑢⟩
= 𝑢
⟨𝑢, 𝑢⟩
= proj𝑢 (𝑥).

Thus, faithfully to its name, proj𝑢 is indeed a projection. To see that it is indeed orthogonal, let’s examine its kernel and
image! Since the value of proj𝑢 (𝑥) is a scalar multiple of 𝑢, its image is

im(proj𝑢 ) = span(𝑢).
⟨𝑥,𝑢⟩
Its kernel, the set of vectors mapped to zero by proj𝑢 , is also easy to ﬁnd, as ⟨𝑢,𝑢⟩ 𝑢 = 0 can only happen if ⟨𝑥, 𝑢⟩ = 0,
that is, if 𝑥 ⟂ 𝑢. In other words,

ker(proj𝑢 ) = span(𝑢)⟂ ,

where span(𝑢)⟂ denotes the orthogonal complement of span(𝑢). This means that proj𝑢 is indeed an orthogonal projection.
We can also describe proj𝑢 (𝑥) in terms of matrices. By writing out proj𝑢 (𝑥) component-wise, we have

⟨𝑥, 𝑢⟩𝑢1
⟨𝑥, 𝑢⟩ 1 ⎡ ⟨𝑥, 𝑢⟩𝑢2 ⎤
proj𝑢 (𝑥) = 𝑢= ⎢ ⎥,
⟨𝑢, 𝑢⟩ ‖𝑢‖2 ⎢ ⋮ ⎥
⎣⟨𝑥, 𝑢⟩𝑢𝑛 ⎦
where 𝑢 = (𝑢1 , … , 𝑢𝑛 ). This looks like some kind of matrix multiplication! As we have seen earlier, multiplying a
matrix and a vector can be described in terms of rowwise dot products. (See (7.3).)
So, according to this interpretation of matrix multiplication, we have
⟨𝑥, 𝑢⟩𝑢1
1 ⎡ ⟨𝑥, 𝑢⟩𝑢2 ⎤ 𝑢𝑢𝑇
proj𝑢 (𝑥) = ⎢ ⎥= 𝑥. (14.5)
‖𝑢‖2 ⎢ ⋮ ⎥ ‖𝑢‖2
⎣⟨𝑥, 𝑢⟩𝑢𝑛 ⎦

Note that the scaling with ‖𝑢‖2 can be incorporated into the “matrix” product by writing

𝑢𝑢𝑇 𝑢 𝑢𝑇
= ⋅ ,
‖𝑢‖2 ‖𝑢‖ ‖𝑢‖

The matrix 𝑢 ∈ ℝ𝑛×𝑛 , obtained from the product of the vector 𝑢 ∈ ℝ𝑛(×1) and its transpose 𝑢𝑇 ∈ ℝ1×𝑛 , is a rather
special one. They are called rank-1 projection matrices, and they frequently appear in mathematics.
Example 2. As we have seen when introducing the Gram-Schmidt orthogonalization process, the previous example can
be generalized by projecting to multiple vectors.
If 𝑢1 , … , 𝑢𝑘 ∈ ℝ𝑛 is a set of linearly independent and pairwise orthogonal vectors, then the linear transformation
𝑘
⟨𝑥, 𝑢𝑖 ⟩
proj𝑢 (𝑥) = ∑ 𝑢
1 ,…,𝑢𝑘
𝑖=1
⟨𝑢𝑖 , 𝑢𝑖 ⟩ 𝑖

14.5. Orthogonal projections 173

Mathematics of Machine Learning

is an orthogonal projection onto the subspace span(𝑢1 , … , 𝑢𝑘 ). This is easy to see, and I recommend the reader to do this
as an exercise. (This can be found in the problems section as well.)
From (14.5), we can determine the matrix form of proj𝑢 as well:
1 ,…,𝑢𝑘

𝑘
𝑢𝑖 𝑢𝑇𝑖
proj𝑢 (𝑥) = ( ∑ ) 𝑥.
1 ,…,𝑢𝑘 ‖𝑢𝑖 ‖2
⏟⏟⏟⏟⏟
𝑖=1
∈ℝ𝑛×𝑛

This is good to know, as projection matrices are often needed in the implementation of certain algorithms.

14.5.1 Properties of orthogonal projections

Now that we have seen a few examples, it is time to discuss orthogonal projections in more general terms. There are lots
of reasons why these special transformations are useful, and we’ll explore them in this section. First, let’s start with the
most important one: an orthogonal projection also enables to decompose vectors in terms of a given subspace plus an
orthogonal vector.

Theorem 13.5.1
Let 𝑉 be an inner product space and 𝑃 ∶ 𝑉 → 𝑉 be a projection. Then 𝑉 = ker 𝑃 + im𝑃 , that is, every vector 𝑥 ∈ 𝑉
can be written as

𝑥 = 𝑥ker + 𝑥im , 𝑥ker ∈ ker 𝑃 , 𝑥im ∈ im𝑃 .

If 𝑃 is an orthogonal projection, then 𝑥im ⟂ 𝑥ker .

Proof. Every 𝑥 can be written as

𝑥 = (𝑥 − 𝑃 𝑥) + 𝑃 𝑥.

Since 𝑃 is idempotent, that is, 𝑃 2 = 𝑃 , we have

𝑃 (𝑥 − 𝑃 𝑥) = 𝑃 𝑥 − 𝑃 (𝑃 𝑥)
= 𝑃𝑥 − 𝑃𝑥
= 0,

that is, 𝑥 − 𝑃 𝑥 ∈ ker 𝑃 . By deﬁnition, 𝑃 𝑥 ∈ im𝑃 , so 𝑉 = ker 𝑉 + im𝑉 , which proves our main proposition.
If 𝑃 is an orthogonal projection, then again by deﬁnition, 𝑥im ⟂ 𝑥ker , which is what we had to show. □

In addition, orthogonal projections are self-adjoint. This might not sound like a big deal, but self-adjointness leads to
several very pleasant properties.

Theorem 13.5.2
Let 𝑉 be an inner product space and 𝑃 ∶ 𝑉 → 𝑉 be an orthogonal projection. Then 𝑃 is self-adjoint.

Proof. According to the deﬁnition (14.3), all we need to show is that

⟨𝑃 𝑥, 𝑦⟩ = ⟨𝑥, 𝑃 𝑦⟩

174 Chapter 14. Special transformations and matrix decompositions

Mathematics of Machine Learning

holds for any 𝑥, 𝑦 ∈ 𝑉 . In the previous result, we have seen that 𝑥 and 𝑦 can be written as

𝑥 = 𝑥ker 𝑃 + 𝑥im𝑃 , 𝑥ker 𝑃 ∈ ker 𝑃 , 𝑥im𝑃 ∈ im𝑃

and

𝑦 = 𝑦ker 𝑃 + 𝑦im𝑃 , 𝑦ker 𝑃 ∈ ker 𝑃 , 𝑦im𝑃 ∈ im𝑃 .

Since 𝑃 2 = 𝑃 , we have
⟨𝑃 𝑥, 𝑦⟩ = ⟨𝑃 𝑥ker 𝑃 + 𝑃 𝑥im𝑃 , 𝑦ker + 𝑦im𝑃 ⟩
= ⟨𝑥im𝑃 , 𝑦ker 𝑃 + 𝑦im𝑃 ⟩
= ⟨𝑥 im𝑃 , 𝑦ker 𝑃 ⟩ +⟨𝑥im𝑃 , 𝑦im𝑃 ⟩
⏟⏟⏟⏟⏟
=0
= ⟨𝑥im𝑃 , 𝑦im𝑃 ⟩.
Similarly, it can be shown that ⟨𝑥, 𝑃 𝑦⟩ = ⟨𝑥im𝑃 , 𝑦im𝑃 ⟩. These two identities imply ⟨𝑃 𝑥, 𝑦⟩ = ⟨𝑥, 𝑃 𝑦⟩, which is what
we had to show. □

One straightforward consequence of self-adjointness is that the kernel of orthogonal projections is the orthogonal com-
plement of its image.

Theorem 13.5.3
Let 𝑉 be an inner product space and 𝑃 ∶ 𝑉 → 𝑉 be an orthogonal projection. Then

ker 𝑃 = (im𝑃 )⟂ .

Proof. To prove the equality of these two sets, we need to show that (a) ker 𝑃 ⊆ (im𝑃 )⟂ , and (b) (im𝑃 )⟂ ⊆ ker 𝑃 .
(a) Let 𝑥 ∈ ker 𝑃 , that is, suppose that 𝑃 𝑥 = 0. We need to show that for any 𝑦 ∈ im𝑃 , we have ⟨𝑥, 𝑦⟩ = 0. For this,
let 𝑦0 ∈ 𝑉 such that 𝑃 𝑦0 = 𝑦. (This is guaranteed to exist, since we took 𝑦 from the image of 𝑃 .) Then

⟨𝑥, 𝑦⟩ = ⟨𝑥, 𝑃 𝑦0 ⟩
= ⟨𝑃 𝑥, 𝑦0 ⟩
= ⟨0, 𝑦0 ⟩
= 0,

where we used that 𝑃 is self-adjoint. Thus, 𝑥 ∈ (im𝑃 )⟂ also holds, implying ker 𝑃 ⊆ (im𝑃 )⟂ .
(b) Now let 𝑥 ∈ (im𝑃 )⟂ . Then for any 𝑦 ∈ 𝑉 , we have ⟨𝑥, 𝑃 𝑦⟩ = 0. However,

⟨𝑃 𝑥, 𝑦⟩ = ⟨𝑥, 𝑃 𝑦⟩ = 0.

Specially, with the choice 𝑦 = 𝑃 𝑦, we have ⟨𝑃 𝑥, 𝑃 𝑥⟩ = 0. Due to the positive deﬁniteness of the inner product, this
implies that 𝑃 𝑥 = 0, that is, 𝑥 ∈ ker 𝑃 . □

Summing up all the above, if 𝑃 is an orthogonal projection of the inner product space 𝑉 , then

𝑉 = im𝑃 + (im𝑃 )⟂ .

Do you recall that when we ﬁrst encountered the concept of orthogonal complements, how we proved that 𝑉 = 𝑆 + 𝑆 ⟂
for any ﬁnite dimensional inner product space 𝑉 and its subspace 𝑆? With the use of a special orthogonal projection. We
are getting close to see the general pattern here.

14.5. Orthogonal projections 175

Mathematics of Machine Learning

Because of the kernel of an orthogonal projection 𝑃 is an orthogonal complement of the image, the transformation 𝐼 − 𝑃
is an orthogonal projection as well, with the roles of image and kernel reversed.

Theorem 13.5.4
Let 𝑉 be an inner product space and 𝑃 ∶ 𝑉 → 𝑉 be an orthogonal projection. Then 𝐼 − 𝑃 is an orthogonal projection
as well, and

ker(𝐼 − 𝑃 ) = im𝑃 , im(𝐼 − 𝑃 ) = ker 𝑃 .

The proof is so simple that this is left as an exercise for the reader.
One more thing to mention. If the image spaces of two orthogonal projections match, then the projections themselves
are equal. This is a very strong uniqueness property, as if you think about it, this is not true for other classes of linear
transformations.

Theorem 13.5.5 (Uniqueness of orthogonal projections.)

Let 𝑉 be an inner product space and 𝑃 , 𝑄 ∶ 𝑉 → 𝑉 be two orthogonal projections. If im𝑃 = im𝑄, then 𝑃 = 𝑄.

Proof. Because of ker 𝑃 = (im𝑃 )⟂ , the equality of the image spaces also imply that ker 𝑃 = ker 𝑄.
Since 𝑉 = ker 𝑃 + im𝑃 , every 𝑥 ∈ 𝑉 can be decomposed as

𝑥 = 𝑥ker 𝑃 + 𝑥im𝑃 , 𝑥ker 𝑃 ∈ ker 𝑃 , 𝑥im𝑃 ∈ im𝑃 .

This decomposition and the equality of kernel and image spaces give that

𝑃 𝑥 = 𝑃 𝑥ker 𝑃 + 𝑃 𝑥im𝑃 = 𝑥im𝑃 .

With an identical argument, we have 𝑄𝑥 = 𝑥im𝑃 , thus 𝑃 𝑥 = 𝑄𝑥 on all vectors 𝑥 ∈ 𝑉 . This proves 𝑃 = 𝑄. □

In other words, given a subspace, there can be only one orthogonal projection to it. But is there any at all? Yes, and in the
next section, we will see that it can be desrcribed in geometric terms.

14.5.2 Orthogonal projections are the optimal projections

Orthogonal projections have an extremely pleasant and mathematically useful property. In some sense, if 𝑃 ∶ 𝑉 → 𝑉 is
an orthogonal projection, 𝑃 𝑥 provides the optimal approximation of 𝑥 among all vectors in im𝑃 . To make this precise,
we can state the following.

Theorem 13.5.6 (Construction of orthogonal transformations.)

Let 𝑉 be a ﬁnite-dimensional inner product space and 𝑆 ⊆ 𝑉 its subspace. Then the transformation 𝑃 ∶ 𝑉 → 𝑉 deﬁned
by

𝑃 ∶ 𝑥 → arg min ‖𝑥 − 𝑦‖
𝑦∈𝑆

is an orthogonal projection to 𝑆.

176 Chapter 14. Special transformations and matrix decompositions

Mathematics of Machine Learning

In other words, since orthogonal projections to a given subspace are unique (as implied by Theorem 13.5.5), 𝑃 𝑥 is the
closest vector to 𝑥 in the subspace 𝑆. Thus, we can denote this as 𝑃𝑆 , emphasizing the uniqueness.
Besides having an explicit way to describe orthogonal projections, there is one extra beneﬁt. Recall that previously, we
have shown that

𝑉 = im𝑃 + (im𝑃 )⟂

holds. Since for any subspace 𝑆 an orthogonal projection 𝑃𝑆 exists whose image set is 𝑆, it also follows that 𝑉 = 𝑆 + 𝑆 ⟂ .
Although we have seen this earlier when talking about orthogonal complements, it is interesting to see a proof that doesn’t
require the construction of an orthonormal basis in 𝑆.
Interestingly, this is a point where mathematical analysis and linear algebra intersects. We don’t have the tools for it
yet, but using the concept of convergence, the above theorems can be generalized to infinite dimensional spaces. Infinite
dimensional spaces are not particularly relevant for machine learning in practice, yet they provide a beautiful mathematical
framework for the study of functions. Who knows, one day these advanced tools will provide a significant breakthrough
in machine learning.

14.6 Problems

Problem 1. Let 𝑢1 , … , 𝑢𝑘 ∈ ℝ𝑛 be a set of linearly independent and pairwise orthogonal vectors. Show that the linear
transformation
𝑘
⟨𝑥, 𝑢𝑖 ⟩
proj𝑢 (𝑥) = ∑ 𝑢
1 ,…,𝑢𝑘
𝑖=1
⟨𝑢𝑖 , 𝑢𝑖 ⟩ 𝑖

is an orthogonal projection.
Problem 2. Let 𝑢1 , … , 𝑢𝑘 ∈ ℝ𝑛 be a set of linearly independent vectors, and deﬁne the linear transformation
𝑘
⟨𝑥, 𝑢𝑖 ⟩
fakeproj𝑢 (𝑥) = ∑ 𝑢.
1 ,…,𝑢𝑘
𝑖=1
⟨𝑢𝑖 , 𝑢𝑖 ⟩ 𝑖

Is this a projection? (Hint: study the special case 𝑘 = 2 and ℝ3 . You can visualize this if needed.)
Problem 3. Let 𝑉 be an inner product space and 𝑃 ∶ 𝑉 → 𝑉 be an orthogonal projection. Show that 𝐼 − 𝑃 is an
orthogonal projection as well, and

ker(𝐼 − 𝑃 ) = im𝑃 , im(𝐼 − 𝑃 ) = ker 𝑃

holds.

14.6. Problems 177

Mathematics of Machine Learning

178 Chapter 14. Special transformations and matrix decompositions

CHAPTER

FIFTEEN

COMPUTING EIGENVALUES

In the last chapter, we have reached the singular value decomposition, one of the pinnacle results of linear algebra. We
laid all of our theoretical groundwork to get us to this point.
However, one thing is missing: computing the SVD in practice. Without this, we can’t reap all the rewards this powerful
tool offers. In this chapter, we’ll develop two methods for this purpose. One offers a deep insight into the behavior of
eigenvectors, but it doesn’t work in practice. The other offers excellent performance, but it is hard to understand what is
happening behind the formulas. Let’s start with the first one, illuminating how the eigenvectors determine the effects of
a linear transformation!

15.1 Power iteration for calculating the eigenvectors of real symmet-

ric matrices

If you recall, we discovered the SVD by tracing the problem back to the spectral decomposition of symmetric matrices. In
turn, we can obtain the spectral decomposition by finding an orthonormal basis from the eigenvectors of our matrix. The
plan is the following: first, we define a procedure that finds an orthonormal set of eigenvectors for symmetric matrices.
Then, use this to compute the SVD for arbitrary matrices.
A naive way would be find the eigenvalues by solving the polynomial equation det(𝐴 − 𝜆𝐼) = 0 for 𝜆, then compute the
corresponding eigenvectors by solving the linear equations (𝐴 − 𝜆𝐼)𝑥 = 0.
However, there are problems with this approach. For an 𝑛 × 𝑛 matrix, the characteristic polynomial 𝑝(𝜆) = det(𝐴 − 𝜆𝐼)
is a polynomial of degree 𝑛. Even if we could effectively evaluate det(𝐴 − 𝜆𝐼) for any lambda, there are serious issues.
Unfortunately, unlike for the quadratic equation 𝑎𝑥2 + 𝑏𝑥 + 𝑐 = 0, there are no formulas for finding the solutions when
𝑛 > 4. (It is not that mathematicians were just not clever enough to find them. No such formula exists.)
How can we find an alternative approach? Once again, we use the wishful thinking approach that worked so well before.
Let’s pretend that we know the eigenvalues, play around with them, and see if this gives us some useful insight.
For the sake of simplicity, assume that 𝐴 is a small symmetric 2×2 matrix, say with eigenvalues 𝜆1 = 4 and 𝜆2 = 2. Since
𝐴 is symmetric, we can even find a set of corresponding eigenvectors 𝑢1 , 𝑢2 such that 𝑢1 and 𝑢2 form an orthonormal basis.
(That is, both have an unit norm and they are orthogonal to each other.) This is guaranteed by the spectral decomposition
theorem.
Thus, any 𝑥 ∈ ℝ2 can be written as 𝑥 = 𝑥1 𝑢1 + 𝑥2 𝑢2 for some nonzero scalars 𝑥1 , 𝑥2 . What happens if we apply the
transformation 𝐴 to our vector 𝑥? Because 𝑢𝑖 are eigenvectors, we have

𝐴𝑥 = 𝐴(𝑥1 𝑢1 + 𝑥2 𝑢2 )
= 𝑥1 𝐴𝑢1 + 𝑥2 𝐴𝑢2
= 𝑥1 𝜆1 𝑢1 + 𝑥2 𝜆2 𝑢2
= 4𝑥1 𝑢1 + 2𝑥2 𝑢2 .

179
Mathematics of Machine Learning

By applying 𝐴 one more time, we obtain

𝐴2 𝑥 = 𝑥1 𝜆21 𝑢1 + 𝑥2 𝜆22 𝑢2
= 42 𝑥1 𝑢1 + 22 𝑥2 𝑢2 .
A pattern starts to emerge. In general, the 𝑘-th iteration of 𝐴 yields
𝐴𝑘 𝑥 = 𝑥1 𝜆𝑘1 𝑢1 + 𝑥2 𝜆𝑘2 𝑢2
= 4𝑘 𝑥1 𝑢1 + 2𝑘 𝑥2 𝑢2 .
By taking an inquisitive look at 𝐴𝑘 𝑥, we can note that the contribution of 𝑢1 is much more significant than 𝑢2 . Why?
Because the coefficient 𝑥1 𝜆𝑘1 = 4𝑘 𝑥1 grows faster than 𝑥2 𝜆𝑘2 = 2𝑘 𝑥2 , regardless of the value of 𝑥1 and 𝑥2 . In technical
terms, we say that 𝜆1 dominates 𝜆2 .
Now, by scaling down things with 𝜆𝑘1 , we can extract the eigenvector 𝑢1 ! That is,
𝑘
𝐴𝑘 𝑥 𝜆
= 𝑥1 𝑢1 + 𝑥2 ( 2 ) 𝑢2
𝜆𝑘1 𝜆1
= 𝑥1 𝑢1 + (something very small)𝑘 .
𝐴𝑘 𝑥
If we let 𝑘 grow infinitely, the contribution of 𝑢2 to 𝜆𝑘
vanishes. If you are familiar with the concept of limits, you could
1
write
𝐴𝑘 𝑥
lim = 𝑥1 𝑢1 . (15.1)
𝑘→∞ 𝜆𝑘
1

Remark 14.1.1 (A primer on limits.)

If you are not familiar with limits, here is a quick explanation. The identity
𝐴𝑘 𝑥
lim = 𝑥1 𝑢1
𝑘→∞ 𝜆𝑘
1
𝐴𝑘 𝑥
means that as 𝑘 grows, the quantity 𝜆𝑘
gets closer and closer to 𝑥1 𝑢1 , until the diﬀerence between them is inﬁnitesimal.
1
𝐴𝑘 𝑥
In practice, this means that we can approximate 𝑥1 𝑢1 by 𝜆𝑘
by selecting a very large 𝑘.
1

Equation (15.1) is great news for us! All we have to do is repeatedly apply the transformation 𝐴 to identify the eigenvector
for the dominant eigenvalue 𝜆1 . There is one small caveat though: we have to know the value of 𝜆1 . We’ll deal with this
later, but ﬁrst, let’s record this milestone in the form of a theorem.

Theorem 14.1.1 (Finding the eigenvector for the dominant eigenvalue with power iteration. )
Let 𝐴 ∈ ℝ𝑛×𝑛 be a real symmetric matrix. Suppose that
(a) the eigenvalues of 𝐴 are 𝜆1 > ⋯ > 𝜆𝑛 (that is, 𝜆1 is the dominant eigenvalue),
(b) and the corresponding eigenvectors 𝑢1 , … , 𝑢𝑛 form an orthonormal basis.
𝑛
Let 𝑥 ∈ ℝ𝑛 be a vector such that when written as the linear combination 𝑥 = ∑𝑖=1 𝑥𝑖 𝑢𝑖 , the coeﬃcient 𝑥1 ∈ ℝ is
nonzero. Then
𝐴𝑘 𝑥
lim = 𝑥1 𝑢1 . (15.2)
𝑘→∞ 𝜆𝑘
1

Before we jump into the proof, some explanations are in order. Recall that if 𝐴 is symmetric, the spectral decomposition
theorem guarantees that it can be diagonalized with a similarity transformation. In its proof (sketch), we mentioned that
a symmetric matrix has

180 Chapter 15. Computing eigenvalues

Mathematics of Machine Learning

• real eigenvalues,
• and an orthonormal basis from its eigenvectors.
Thus, the assumptions (a) and (b) are guaranteed, except for one caveat: the eigenvalues are not necessarily distinct.
However, this rarely causes problems in practice. There are multiple reasons for this, but most importantly, matrices with
repeated eigenvalues are so rare that they form a zero-probability set. Thus, stumbling upon one is highly unlikely.

Proof. Because 𝑢𝑘 is the eigenvector for the eigenvalue 𝜆𝑘 , we have

𝑛
𝐴𝑘 𝑥 = ∑ 𝑥𝑖 𝜆𝑘𝑖 𝑢𝑖 . (15.3)
𝑖=1

Thus,
𝑛 𝑘
𝐴𝑘 𝑥 𝜆
𝑘
= 𝑥 𝑢
1 1 + ∑ 𝑥𝑖 ( 𝑖 ) 𝑢𝑖 .
𝜆1 𝑖=2
𝜆1
Since 𝜆1 is the dominant eigenvalue, |𝜆𝑖 /𝜆1 | < 1 for 𝑖 = 2, … , 𝑛, so (𝜆𝑖 /𝜆1 )𝑘 → 0 as 𝑘 → ∞. Hence,
𝐴𝑘 𝑥
lim = 𝑥1 𝑢1 .
𝑘→∞ 𝜆𝑘
1

This is what we had to show. □

Now, let’s fix the small issue that requires us to know 𝜆1 . Since 𝜆1 is the largest eigenvalue, the previous theorem shows
that 𝐴𝑘 𝑥 equals to 𝑥1 𝜆𝑘1 𝑢1 plus some term that is much smaller, at least compared to this dominant term. We can extract
this quantity by taking the supremum norm ‖𝐴𝑘 𝑥‖∞ . (Recall that for any 𝑦 = (𝑦1 , … , 𝑦𝑘 ), the supremum norm is defined
by ‖𝑦‖∞ = max{|𝑦1 |, … , |𝑦𝑛 |}. Keep in mind that the 𝑦𝑖 -s are the coefficients of 𝑦 in the original basis of our vector
space, which is not necessarily our eigenvector basis 𝑢1 , … , 𝑢𝑛 .)
By factoring out |𝜆1 |𝑘 from 𝐴𝑘 𝑥, we have
𝑛 𝑘
𝜆𝑖
‖𝐴𝑘 𝑥‖∞ = |𝜆1 |𝑘 ∥𝑥1 𝑢1 + ∑ 𝑥𝑖 ( ) 𝑢𝑖 ∥ .
𝑖=2
𝜆1 ∞

𝑘
𝑛
Intuitively speaking, the remainder term ∑𝑖=2 𝑥𝑖 ( 𝜆𝜆𝑖 ) 𝑢𝑖 is small, thus we can approximate the norm as
1

‖𝐴𝑘 𝑥‖∞ ≈ |𝜆1 |𝑘 ‖𝑥1 𝑢1 ‖∞ .

In other words, instead of scaling with 𝜆𝑘1 , we can scale with ‖𝐴𝑘 𝑥‖∞ .
So, we are ready to describe our general eigenvector-finding procedure fully. First, we initialize a vector 𝑥0 randomly,
then we define the recursive sequence
𝐴𝑥𝑘−1
𝑥𝑘 = , 𝑘 = 1, 2, …
‖𝐴𝑥𝑘−1 ‖∞
Using the linearity of 𝐴, we can see that, in fact,
𝐴𝑘 𝑥 0
𝑥𝑘 = ,
‖𝐴𝑘 𝑥0 ‖∞
but scaling has an additional side benefit, as we don’t have to use large numbers at any computational step. With this,
(15.2) implies that
𝐴𝑘 𝑥 0
lim 𝑥𝑘 = = 𝑢1 .
𝑘→∞ ‖𝐴𝑘 𝑥0 ‖∞
That is, we can extract the eigenvector for the dominant eigenvalue without actually knowing the eigenvalue itself.

15.1. Power iteration for calculating the eigenvectors of real symmetric matrices 181
Mathematics of Machine Learning

15.1.1 Power iteration in practice

Let’s put the power iteration method into practice! The input of our power_iteration function is a square matrix
A, and we expect the output to be an eigenvector corresponding to the dominant eigenvalue.
Since this is an iterative process, we should define a condition that defines when the process should terminate. If the
consecutive members of the sequence {𝑥𝑘 }∞ 𝑘=1 are sufficiently close together, we arrived at a solution. That is, if, say

‖𝑥𝑘+1 − 𝑥𝑘 ‖2 < 1 × 10−10 ,

we can stop and return the current value. However, this might never happen. For those cases, we define a cutoff point,
say 𝑘 = 100000, when we terminate the computation even if there is no convergence.
To give us a bit more control, we can also manually define the initialization vector x_init.

import numpy as np

def power_iteration(
A: np.ndarray,
n_max_steps: int = 100000,
convergence_threshold: float = 1e-10,
x_init: np.ndarray = None,
normalize: bool = False
):
n, m = A.shape

# checking the validity of the input

if n != m:
raise ValueError("the matrix A must be square")

# reshaping or defining the initial vector

if x_init is not None:
x = x_init.reshape(-1, 1)
else:
x = np.random.normal(size=(n, 1))

# performing the iteration

for step in range(n_max_steps):
x_transformed = A @ x # applying the transform
x_new = x_transformed / np.linalg.norm(x_transformed, ord=np.inf) #␣
↪scaling the result

# quantifying the difference between the new and old vector

improvement = np.linalg.norm(x - x_new)
x = x - x_new

# stopping the iteration in case of convergence

if improvement < convergence_threshold:
break

# normalizing the result if required

if normalize:
return x / np.linalg.norm(x)

return x

To test the method, we should use an input for which the correct output is easy to calculate by hand. Our usual recurring

182 Chapter 15. Computing eigenvalues

Mathematics of Machine Learning

example

2 1
𝐴=[ ].
1 2

should be perfect, as we already know a lot about it. Previously, we have seen that its eigenvalues are 𝜆1 = 3 and 𝜆2 = 1,
with corresponding eigenvectors 𝑢1 = (1, 1) and 𝑢2 = (−1, 1).
Let’s see if our function correctly recovers (a scalar multiple of) 𝑢1 = (1, 1)!

A = np.array([[2, 1], [1, 2]])

u_1 = power_iteration(A, normalize=True)

u_1

array([[-0.70710678],
[-0.70710678]])

Success! To recover the eigenvalue, we can simply apply the linear transformation and compute the proportions.

A @ u_1 / u_1

array([[3.],
[3.]])

The result is 3, as expected.

15.1.2 Power iteration for the rest of the eigenvectors

Can we modify the power iteration algorithm to recover the other eigenvalues as well? In theory, yes. In practice, no. Let
me elaborate!
To get a grip on how to generalize the idea, let’s take another look at the equation (15.3), saying that
𝑛
𝐴𝑘 𝑥 = ∑ 𝑥𝑖 𝜆𝑘𝑖 𝑢𝑖 .
𝑖=1

𝐴𝑘 𝑥
One of the conditions for 𝜆𝑘
to converge was that 𝑥 should have a nonzero component of the eigenvector 𝑢1 , that is,
1
𝑥1 ≠ 0.
What if 𝑥1 = 0? In that case, we have

𝐴𝑘 𝑥 = 𝑥2 𝜆𝑘2 𝑢2 + ⋯ + 𝑥𝑛 𝜆𝑘2 𝑢𝑛 ,

with 𝑥2 𝜆𝑘2 𝑢2 becoming the dominant term. Thus, we have

𝑛 𝑘
𝐴𝑘 𝑥 𝜆
𝑘
= 𝑥 𝑢
2 2 + ∑ 𝑥𝑖 ( 𝑖 ) 𝑢𝑘
𝜆2 𝑖=3
𝜆2
= 𝑥2 𝑢2 + (something very small)𝑘 ,

implying that

𝐴𝑘 𝑥
lim = 𝑥2 𝑢2 .
𝑘→∞ 𝜆𝑘
2

15.1. Power iteration for calculating the eigenvectors of real symmetric matrices 183
Mathematics of Machine Learning

Let’s make this mathematically precise in the following theorem.

Theorem 14.1.2 (Generalized power iteration. )

Let 𝐴 ∈ ℝ𝑛×𝑛 be a real symmetric matrix. Suppose that
(a) the eigenvalues of 𝐴 are 𝜆1 > ⋯ > 𝜆𝑛 (that is, 𝜆1 is the dominant eigenvalue),
(b) and the corresponding eigenvectors 𝑢1 , … , 𝑢𝑛 form an orthonormal basis.
Let 𝑥 ∈ ℝ𝑛 be a vector such that when written as the linear combination of the basis 𝑢1 , … , 𝑢𝑛 , its ﬁrst nonzero component
𝑛
is 𝑢𝑙 for some 𝑙 = 1, 2, … , 𝑛. (That is, 𝑥 = ∑𝑖=𝑙 𝑥𝑖 𝑢𝑖 .)
Then
𝐴𝑘 𝑥
lim = 𝑥𝑙 𝑢𝑙
𝑘→∞ 𝜆𝑘
𝑙

holds.

The proof goes just like what we have seen a few times already. The question is, how can we eliminate the 𝑢1 , … , 𝑢𝑙−1
components from any vector? The answer is simple: orthogonal projections.
For the sake of simplicity, let’s take a look at extracting the second dominant eigenvector with power iteration. Recall
that the transformation

proj𝑢 (𝑥) = ⟨𝑢1 , 𝑥⟩𝑢1

𝑛
describes the orthogonal projection of any 𝑥 to 𝑢1 . In concrete terms, if 𝑥 = ∑𝑖=1 𝑥𝑖 𝑢𝑖 , then
𝑛
proj𝑢 (𝑥) = proj𝑢 ( ∑ 𝑥𝑖 𝑢𝑖 )
1 1
𝑖=1
𝑛
= ∑ 𝑥𝑖 proj𝑢 (𝑢𝑖 )
1
𝑖=1
= 𝑥1 𝑢1 .

This is the exact opposite of what we are looking for! However, at this point, we can see that 𝐼 − proj𝑢 is going to be
1
suitable for our purposes. This is still an orthogonal projection. Moreover, we have
𝑛 𝑛
(𝐼 − proj𝑢 )( ∑ 𝑥𝑖 𝑢𝑖 ) = ∑ 𝑥𝑖 𝑢𝑖 ,
1
𝑖=1 𝑖=2

that is, 𝐼 −proj𝑢 eliminates the 𝑢1 component of 𝑥. Thus, if we initialize the power iteration with 𝑥∗ = (𝐼 −proj𝑢 )(𝑥),
1 1
𝐴𝑘 𝑥∗
the sequence ‖𝐴𝑘 𝑥∗ ‖∞ will converge to 𝑢2 , the second dominant eigenvector.
How to compute (𝐼 − proj𝑢 )(𝑥) in practice? Recall that in the standard orthonormal basis, the matrix of proj𝑢 can
1 1
be written as 𝑢1 𝑢𝑇1 . (Keep in mind that the 𝑢𝑖 vectors form an orthonormal basis, so ‖𝑢1 ‖ = 1.) Thus, the matrix of
𝐼 − proj𝑢 is 𝐼 − 𝑢1 𝑢𝑇1 , which we can easily compute.
1

For a general vector 𝑢, this is how we can do this in NumPy.

def get_orthogonal_complement_projection(u: np.ndarray):

u = u.reshape(-1, 1)
n, _ = u.shape
return np.eye(n) - u @ u.T / np.linalg.norm(u, ord=2)**2

184 Chapter 15. Computing eigenvalues

Mathematics of Machine Learning

So, the procedure to ﬁnd all the eigenvectors is the following.

1. Initialize a random 𝑥(1) and use the power iteration to ﬁnd 𝑢1 .
2. Project 𝑥(1) to the orthogonal complement of the subspace spanned by 𝑢1 , thus obtaining
𝑥(2) ∶= (𝐼 − proj𝑢 )(𝑥(1) ),
1

which we use as the initial vector of the second round of power iteration, yielding the second dominant eigenvector 𝑢2 .
3. Project 𝑥(2) to the orthogonal complement of the subspace spanned by 𝑢1 and 𝑢2 , thus obtaining
𝑥(3) = (𝐼 − proj𝑢 )(𝑥(2) )
2

= (𝐼 − proj𝑢 )(𝑥(1) ),
1 ,𝑢2

which we use as the initial vector of the third round of power iteration, yielding the third dominant eigenvector 𝑢3 .
4. Project 𝑥(3) to…
You get the pattern. To implement this in practice, we add the find_eigenvectors function.

def find_eigenvectors(A: np.ndarray, x_init: np.ndarray):

n, _ = A.shape
eigenvectors = []

for _ in range(n):
ev = power_iteration(A, x_init=x_init)
proj = get_orthogonal_complement_projection(ev)
x_init = proj @ x_init
eigenvectors.append(ev)

return eigenvectors

Let’s test find_eigenvectors out on our old friend

2 1
𝐴=[ ]!
1 2

A = np.array([[2.0, 1.0], [1.0, 2.0]])

x_init = np.random.rand(2, 1)

find_eigenvectors(A, x_init)

[array([[0.65505892],
[0.65505892]]),
array([[ 0.12565508],
[-0.12565508]])]

The result is as we expected. (Don’t be surprised that the eigenvectors are not normalized, as we haven’t explicitly done
so in the find_eigenvectors function.)
We are ready to actually diagonalize symmetric matrices. Recall that the diagonalizing orthogonal matrix 𝑈 can be
obtained by vertically stacking the eigenvectors one by one.

def diagonalize_symmetric_matrix(A: np.ndarray, x_init: np.ndarray):

eigenvectors = find_eigenvectors(A, x_init)
U = np.hstack(eigenvectors) / np.linalg.norm(np.hstack(eigenvectors), axis=0,␣
↪ord=2)

return U, U @ A @ U.T

15.1. Power iteration for calculating the eigenvectors of real symmetric matrices 185
Mathematics of Machine Learning

diagonalize_symmetric_matrix(A, x_init)

(array([[ 0.70710678, 0.70710678],

[ 0.70710678, -0.70710678]]),
array([[ 3.00000000e+00, -5.05623488e-16],
[-3.13396163e-16, 1.00000000e+00]]))

Awesome!
What is the problem then? What you haven’t seen is that I had to run find_eigenvectors 16 times to finally find
an initial vector that yields the expected result. This is because power iteration is numerically extremely unstable. Notice
that we get a completely different result by perturbing our initial vector by 0.0001.

find_eigenvectors(A, x_init + 1e-4)

[array([[0.65511639],
[0.65511639]]),
array([[0.28417351],
[0.28417351]])]

For 𝑛 × 𝑛 matrices where 𝑛 can be in the millions, this is a serious issue.

Why did we talk so much about power iteration then? Besides being the simplest, it oﬀers a deep insight into linear
transformations work. The identity
𝑛
𝐴𝑥 = ∑ 𝑥𝑖 𝜆𝑖 𝑢𝑖 ,
𝑖=1

where 𝜆𝑖 and 𝑢𝑖 are eigenvalue-eigenvector pairs of the symmetric matrix 𝐴, reﬂecting how eigenvectors and eigenvalues
determine the behavior of the transformation.
If the power iteration is not usable in practice, how can we compute the eigenvalues? We will see this in the next section.

15.2 The QR algorithm

The algorithm used in practice to compute the eigenvalues is the so-called QR algorithm, proposed independently by
John G. R. Francis and the soviet mathematician Vera Kublanovskaya. This is where all of the lessons we have learned in
linear algebra converge. Describing the QR algorithm is very simple, as it is the iteration of a matrix decomposition and
a multiplication step.
However, understanding why it works is a diﬀerent question. Behind the scenes, the QR algorithm combines many tools
we have learned earlier. To start, let’s revisit the good old Gram-Schmidt orthogonalization process.

15.2.1 The QR decomposition

If you recall, we have encountered the Gram-Schmidt orthogonalization process when introducing the concept of orthog-
onal bases.
In essence, this algorithm takes an arbitrary basis 𝑣1 , … , 𝑣𝑛 and turns it into an orthonormal one 𝑒1 , … , 𝑒𝑛 such that
𝑒1 , … , 𝑒𝑘 spans the same subspace as 𝑣1 , … , 𝑣𝑘 . Since we last met this, we have gained a lot of perspective about linear
algebra, so we are ready to see the bigger picture.

186 Chapter 15. Computing eigenvalues

Mathematics of Machine Learning

With the orthogonal projections deﬁned by

𝑘
⟨𝑥, 𝑒𝑖 ⟩
proj𝑒 (𝑥) = ∑ 𝑒,
1 ,…,𝑒𝑘
𝑖=1
⟨𝑒𝑖 , 𝑒𝑖 ⟩ 𝑖

we can describe the Gram-Schmidt process recursively as

𝑒1 = 𝑣 1 ,
𝑒𝑘 = 𝑣𝑘 − proj𝑒 (𝑣𝑘 ),
1 ,…,𝑒𝑘−1

where the 𝑒𝑘 vectors are normalized after.

By expanding this and writing out 𝑒𝑘 explicitly, we have

𝑒1 = 𝑣 1
⟨𝑒1 , 𝑣2 ⟩
𝑒2 = 𝑣 2 − 𝑒
⟨𝑒1 , 𝑒1 ⟩ 1
⋮
⟨𝑒1 , 𝑣𝑛 ⟩ ⟨𝑒𝑛−1 , 𝑣𝑛 ⟩
𝑒𝑛 = 𝑣 𝑛 − 𝑒 −⋯− 𝑒 .
⟨𝑒1 , 𝑒1 ⟩ 1 ⟨𝑒𝑛−1 , 𝑒𝑛−1 ⟩ 𝑛−1

A pattern is starting to emerge. By arranging the 𝑒1 , … , 𝑒𝑛 terms into one side, we obtain

𝑣1 = 𝑒 1
⟨𝑒1 , 𝑣2 ⟩
𝑣2 = 𝑒 + 𝑒2
⟨𝑒1 , 𝑒1 ⟩ 1
⋮
⟨𝑒1 , 𝑣𝑛 ⟩ ⟨𝑒𝑛−1 , 𝑣𝑛 ⟩
𝑣𝑛 = 𝑒1 + ⋯ + 𝑒 + 𝑒𝑛 .
⟨𝑒1 , 𝑒1 ⟩ ⟨𝑒𝑛−1 , 𝑒𝑛−1 ⟩ 𝑛−1

This is starting to resemble some kind of matrix multiplication! Recall that matrix multiplication can be viewed as taking
the linear combination of columns. (Check (7.2) if you are uncertain about this.)
By horizontally concatenating the column vectors 𝑣𝑘 to form the matrix 𝐴 and similarly deﬁning the vector 𝑄 from the
𝑒𝑘 -s, we obtain that

𝐴 = 𝑄∗ 𝑅 ∗

for some upper triangular 𝑅, defined by the coefficients of 𝑒𝑘 in 𝑣𝑘 according to the Gram-Schmidt orthogonalization.
To be more precise, define

𝐴=⎡
⎢𝑣1 … 𝑣𝑛 ⎤
⎥, 𝑄∗ = ⎡
⎢𝑒1 … 𝑒𝑛 ⎤
⎥,
⎣ ⎦ ⎣ ⎦
and
⟨𝑒1 ,𝑣2 ⟩ ⟨𝑒1 ,𝑣𝑛 ⟩
1 ⟨𝑒1 ,𝑒1 ⟩ … ⟨𝑒1 ,𝑒1 ⟩
⎡ ⟨𝑒2 ,𝑣𝑛 ⟩ ⎤
𝑅∗ = ⎢0 1 … ⟨𝑒2 ,𝑒2 ⟩ ⎥ .
⎢ ⎥
⎢⋮ ⋮ ⋱ ⋮ ⎥
⎣0 0 ⋮ 1 ⎦

The result 𝐴 = 𝑄∗ 𝑅∗ is almost what we call the QR factorization. The columns of 𝑄∗ are orthogonal (but not or-
thonormal), while 𝑅∗ is upper triangular. We can easily orthonormalize 𝑄∗ by factoring out the norms columnwise, thus

15.2. The QR algorithm 187

Mathematics of Machine Learning

obtaining
⟨𝑒1 ,𝑣2 ⟩ ⟨𝑒1 ,𝑣𝑛 ⟩
‖𝑒1 ‖ √⟨𝑒1 ,𝑒1 ⟩
… √⟨𝑒1 ,𝑒1 ⟩ ⎤
⎡ ⟨𝑒2 ,𝑣𝑛 ⟩
𝑄=⎡ 𝑅=⎢ 0 ‖𝑒2 ‖ … ⎥
𝑒1 𝑒𝑛 ⎤
⎢ ‖𝑒1 ‖ … ‖𝑒𝑛 ‖ ⎥ , ⎢ √⟨𝑒2 ,𝑒2 ⟩ ⎥ .

⎣ ⎦ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 0 0 ⋮ ‖𝑒𝑛 ‖ ⎦

It is easy to see that 𝐴 = 𝑄𝑅 still holds. This result is called the QR decomposition, and we have just proved the following
theorem.

Theorem 14.2.1 (QR decomposition.)

Let 𝐴 ∈ ℝ𝑛×𝑛 be an invertible matrix. Then there exists an orthogonal matrix 𝑄 ∈ ℝ𝑛×𝑛 and an upper triangular matrix
𝑅 ∈ ℝ𝑛×𝑛 such that

𝐴 = 𝑄𝑅

holds.

As we are about to see, the QR decomposition is an extremely useful and versatile tool. (Like all other matrix decompo-
sitions are.) Before we move forward to discuss how it can be used to compute the eigenvalues in practice, let’s put what
we have seen so far into code!
The QR decomposition algorithm is essentially Gram-Schmidt orthogonalization, where we explicitly memorize some
coeﬃcients and form a matrix from them. (Recall our earlier implementation if you feel overwhelmed.)

def projection_coeff(x: np.ndarray, to: np.ndarray):

return np.dot(x, to)/np.dot(to, to)

from typing import List

def projection(x: np.ndarray, to: List[np.ndarray], return_coeffs: bool = True):

"""
Computes the orthogonal projection of the vector `x`
onto the subspace spanned by the set of vectors `to`.
"""
p_x = np.zeros_like(x)
coeffs = []

for e in to:
coeff = projection_coeff(x, e)
coeffs.append(coeff)
p_x += coeff*e

if return_coeffs:
return p_x, coeffs
else:
return p_x

Now we can put these together to obtain the QR factorization of an arbitrary square matrix. (Surprisingly, this works for
non-square matrices as well, but we won’t be concerned with this.)

def QR(A: np.ndarray):

"""
(continues on next page)

188 Chapter 15. Computing eigenvalues

Mathematics of Machine Learning

(continued from previous page)

Gives the QR decomposition of A, using the
Gram-Schmidt orthogonalization process.
"""
n, m = A.shape

A_columns = [A[:, i] for i in range(A.shape[1])]

Q_columns, R_columns = [], []

Q_columns.append(A_columns[0])
R_columns.append([1] + (m-1)*[0])

for i, a in enumerate(A_columns[1:]):
p, coeffs = projection(a, Q_columns, return_coeffs=True)
next_q = a - p
next_r = coeffs + [1] + max(0, m - i - 2)*[0]

Q_columns.append(next_q)
R_columns.append(next_r)

# assembling Q and R from its columns

Q, R = np.array(Q_columns).T, np.array(R_columns).T

# normalizing Q's columns

Q_norms = np.linalg.norm(Q, axis=0)
Q = Q/Q_norms
R = np.diag(Q_norms) @ R
return Q, R

Let’s try it out on a random 3 × 3 matrix.

A = np.random.rand(3, 3)
Q, R = QR(A)

There are three things to check: (a) that 𝐴 = 𝑄𝑅, (b) 𝑄 is an orthogonal matrix, (c) and that 𝑅 is upper triangular.

np.allclose(A, Q @ R)

True

np.allclose(Q.T @ Q, np.eye(3))

True

np.allclose(R, np.triu(R))

True

Success! There is only one more question left. How does this help us in calculating the eigenvalues? Let’s see that now.

15.2. The QR algorithm 189

Mathematics of Machine Learning

15.2.2 Iterating the QR decomposition

Surprisingly, we can discover the eigenvalues of a matrix 𝐴 by a simple iterative process. First, we ﬁnd the QR decom-
position

𝐴 = 𝑄 1 𝑅1 ,

and deﬁne the matrix 𝐴1 by

𝐴1 = 𝑅1 𝑄1 ,

that is, we simply reverse the order of 𝑄 and 𝑅. Then, we start it all over and find the QR decomposition of 𝐴1 , and so
on, defining the sequence
𝐴𝑘−1 = 𝑄𝑘 𝑅𝑘 (QR decomposition)
(15.4)
𝐴𝑘 = 𝑅𝑘 𝑄𝑘 (definition).
In the long run, the diagonal elements of 𝐴𝑘 will get closer and closer to the eigenvalues of 𝐴. This is called the QR
algorithm, which is so simple that I didn’t believe it when I first saw it.
With all of our tools, we can implement the QR algorithm in a few lines.

def QR_algorithm(A: np.ndarray, n_iter: int = 1000):

for _ in range(n_iter):
Q, R = QR(A)
A = R @ Q
return A

Let’s test it right away.

A = np.array([[2.0, 1.0], [1.0, 2.0]])

QR_algorithm(A)

array([[3.00000000e+00, 2.39107046e-16],
[0.00000000e+00, 1.00000000e+00]])

We are almost at the state-of-the-art. Unfortunately, the vanilla QR algorithm has some issues, as it can fail to converge.
A simple example is given by the matrix

0 1
𝐴=[ ].
1 0

A = np.array([[0.0, 1.0], [1.0, 0.0]])

QR_algorithm(A)

array([[0., 1.],
[1., 0.]])

In practice, we can solve this with the introduction of shifts:

𝐴𝑘−1 − 𝛼𝑘 𝐼 = 𝑄𝑘 𝑅𝑘 (QR decomposition)

(15.5)
𝐴𝑘 = 𝑅 𝑘 𝑄 𝑘 + 𝛼 𝑘 𝐼 (deﬁnition),

where 𝛼𝑘 is some scalar. There are multiple approaches to deﬁning the shifts themselves (Rayleigh quotient shift, Wilkin-
son shift, etc.), but the details lie much deeper than our study.

190 Chapter 15. Computing eigenvalues

Part III

Functions

191
CHAPTER

SIXTEEN

FUNCTIONS IN THEORY

Mathematicians are like Frenchmen: whatever you say to them they translate into their own language and
forthwith it is something entirely diﬀerent. — Johann Wolfgang von Goethe
Everyone has an intuitive understanding of what functions are. At one point or another, all of us have encountered
this concept. For most of us, a function is a curve drawn with a continuous line onto a representation of the Cartesian
coordinate system.

Fig. 16.1: Deﬁnitely looks like a function.

However, in mathematics, intuitions can often lead us to false conclusions. Often, there is a diﬀerence between what
something is and how do you think about it, what your mental models are. To give an example from a real-life scenario in
machine learning, consider the following piece of code.

import numpy as np

(continues on next page)

193
Mathematics of Machine Learning

(continued from previous page)

def cross_entropy_loss(X, y):

"""
Args:
X: numpy.ndarray of shape (n_batch, n_dim).
Contains the predictions in form of a probability distribution.
y: numpy.ndarray of shape (n_batch, 1).
Contains class labels for each data point in X.

Returns:
loss: numpy.float.
Cross entropy loss of the predictions.
"""

exp_x = np.exp(X)
probs = exp_x / np.sum(exp_x, axis=1, keepdims=True)
log_probs = - np.log([probs[i, y[i]] for i in range(len(probs))])
loss = np.mean(log_probs)

return loss

Suppose that you wrote this function, and it is in your codebase somewhere. Depending on our needs, we might think of
it as cross-entropy loss, but in reality, this is a 579 character long string in the Python language, eventually processed by
an interpreter. However, when working with it, we often use a mental model that compacts this information into easily
usable chunks. Like the three words cross-entropy loss. When we reason about high-level processes like training a neural
network, abstractions such as this allow us to move further and step bigger.
But sometimes, things don’t go our way. When this function throws an error and crashes the computations, cross-entropy
loss will not cut it. Then, it is time to unravel the deﬁnition and put everything under a magnifying glass. What could have
hindered your thinking before is now essential.
These principles are also true for theory, not just for practice. Mathematics is a balancing act between logical precision
and a clear understanding, two often contradicting objectives.
Let’s go back to our starting point: functions in a mathematical sense. One possible mental model, as mentioned, is a
curve drawn with a continuous line. It allows us to reason about functions visually and intuitively answer some questions.
However, this particular mental model can go very wrong. To give an example, is this a function below?
Even though this curve is drawn with a continuous line, this is not a function. To avoid confusion later, we have to build
the foundations of our discussion if we were to talk about mathematical objects. In this chapter, our goal is to establish a
basic dictionary to properly understand the objects we are working with in machine learning.

16.1 The mathematical deﬁnition of a function

Let’s dive straight into the deep water and see the exact mathematical definition of functions! (Don’t worry if you don’t
understand it for the first read. I’ll explain everything in detail. This is the usual experience when encountering a definition
for the first time.)

Deﬁnition 15.1.1 (Functions.)

Let 𝑋 and 𝑌 be two sets. The subset 𝑓 ⊆ 𝑋 × 𝑌 is a function if for every 𝑥 ∈ 𝑋, there is at most one 𝑦 ∈ 𝑌 such that
(𝑥, 𝑦) ∈ 𝑓.

194 Chapter 16. Functions in theory

Mathematics of Machine Learning

Fig. 16.2: Is this a function?

16.1. The mathematical deﬁnition of a function 195

Mathematics of Machine Learning

For simplicity, we introduce the notation

𝑓 ∶ 𝑋 → 𝑌,

which is short for 𝑓 is a function from 𝑋 to 𝑌 . Note that 𝑋 and 𝑌 can be any set. In the examples we encounter, these
are usually the set of real numbers or vectors, but there is no such restriction.
To visualize the deﬁnition, we can draw two sets and arrows pointing from elements of 𝑋 to elements of 𝑌 . Each element
(𝑥, 𝑦) ∈ 𝑓 represents an arrow, pointing from 𝑥 to 𝑦.

Fig. 16.3: A function, as arrows between two sets

The only criteria is that there can be at most one arrow starting from any 𝑥 ∈ 𝑋. This is why Fig. 16.2 is not a function.
Deﬁning a function as a subset is mathematically precise but very low level. To be more useful, we can introduce an
abstraction by deﬁning functions with formulas, such as

𝑓 ∶ ℝ → ℝ, 𝑥 ↦ 𝑥2 ,

or simply 𝑓(𝑥) = 𝑥2 in short. This is how most of us think about functions when working with them.
Now that we are familiar with the deﬁnition, we should get to know some of the most basic structural properties of
functions.

196 Chapter 16. Functions in theory

Mathematics of Machine Learning

16.2 Domain and image. Injective, surjective, bijective functions.

We saw that, in essence, functions are arrows between sets. At this point, we don’t know anything useful about them.
When is a function invertible? How to ﬁnd their minima and maxima? Why should we even care? Probably you have a
bunch of questions here. Slowly but surely, we will cover all of these.
The ﬁrst steps in our journey are concerned with the sets from which arrows start and point. There are two important sets
in a function’s life: its domain and image.

Deﬁnition 15.2.1 (Domain and image of functions.)

Let 𝑓 ∶ 𝑋 → 𝑌 be a function. The sets

dom𝑓 ∶= {𝑥 ∈ 𝑋 ∶ there is an 𝑦 ∈ 𝑌 such that 𝑓(𝑥) = 𝑦} ⊆ 𝑋

and

im𝑓 ∶= {𝑦 ∈ 𝑌 ∶ there is an 𝑥 ∈ 𝑋 such that 𝑓(𝑥) = 𝑦} ⊆ 𝑌

are respectively called the domain and image of 𝑓.

In other words, the domain is the subset of 𝑋 where arrows start; the image is the subset of 𝑌 where arrows point.

Fig. 16.4: The domain and image of a function.

Why is this important? For one, these are directly related to the invertibility of a function. If you consider the “points and
arrows” mental representation, inverting a function is as simple as ﬂipping the direction of the arrows. When can we do
it? In some cases, doing this might not even result in a function, as in the ﬁgure below.
To put the study of functions on top of solid theoretical foundations, we introduce the concept of injective, surjective and
bijective functions.

Deﬁnition 15.2.2 (Surjective, injective, and bijective functions.)

16.2. Domain and image. Injective, surjective, bijective functions. 197

Mathematics of Machine Learning

Fig. 16.5: This function is not invertible. Reversing the arrows doesn’t give a well-deﬁned function.

Let 𝑓 ∶ 𝑋 → 𝑌 be an arbitrary function.

(a) 𝑓 is injective if for every 𝑦 ∈ 𝑌 there is at most one 𝑥 ∈ 𝑋 such that 𝑓(𝑥) = 𝑦.
(b) 𝑓 is surjective if for every 𝑦 ∈ 𝑌 there is an 𝑥 ∈ 𝑋 such that 𝑓(𝑥) = 𝑦.*
(c) 𝑓 is bijective if it is injective and surjective.

In terms of arrows, injectivity means that every element of the image has at most one arrow pointing to it, while surjectivity
is that every element indeed has at least one arrow. When both are satisﬁed, we have a bijective function, one that can be
inverted properly. When the inverse 𝑓 −1 exists, it is unique. Both 𝑓 −1 ∘ 𝑓 and 𝑓 ∘ 𝑓 −1 equal to the identity function in
their respective domains.

Fig. 16.6: Injective, surjective, and bijective functions.

Let’s see some concrete examples! For instance,

𝑓 ∶ ℝ → ℝ, 𝑓(𝑥) = 𝑥2

198 Chapter 16. Functions in theory

Mathematics of Machine Learning

is not injective nor surjective. (Ponder on this a bit if you don’t understand it right away. It helps if you draw a ﬁgure.)
On the contrary,

𝑔 ∶ ℝ → ℝ, 𝑔(𝑥) = 𝑥3

is both, so it is bijective and invertible with inverse 𝑔−1 (𝑥) = 𝑥1/3 .

Invertible functions behave nicely, and from a certain perspective, they are much better to work with.

16.3 Operations with functions

Functions, just like numbers, have operations defined on them. Two numbers can be multiplied and added together, but
can you the same with functions? Without any difficulty, they can be added together and multiplied with a scalar as
(𝑓 + 𝑔)(𝑥) ∶= 𝑓(𝑥) + 𝑔(𝑥),
(𝑐𝑓)(𝑥) ∶= 𝑐𝑓(𝑥),
where 𝑐 is some scalar.
Another essential operation is composition. Let’s consider the famous logistic regression for a minute! The estimator itself
is defined by

𝑓(𝑥) = 𝜎(𝑎𝑥 + 𝑏),

where
1
𝜎(𝑥) =
1 + 𝑒−𝑥
is the sigmoid function. The estimator 𝑓(𝑥) is the composition of two functions: 𝑙(𝑥) = 𝑎𝑥 + 𝑏 and the sigmoid function,
so

𝑓(𝑥) = 𝜎(𝑙(𝑥)).

This is how we can illustrate function composition with points and arrows.

Fig. 16.7: Composing functions.

To give one more example, a neural network with several hidden layers is just the composition of a bunch of functions.
The output of each layer is fed into the next one, which is exactly how composition is deﬁned.
In general, if 𝑓 ∶ 𝐵 → 𝐶 and 𝑔 ∶ 𝐴 → 𝐵 are two functions, then their composition is formally deﬁned by

𝑓 ∘ 𝑔 ∶ 𝐴 → 𝐶, 𝑥 ↦ 𝑓(𝑔(𝑥)).

16.3. Operations with functions 199

Mathematics of Machine Learning

16.3.1 Function addition is composition

Given the functions

𝑓 ∶ 𝑋 → ℝ, 𝑔 ∶ 𝑋 → ℝ,

we can deﬁne their sum by

(𝑓 + 𝑔)(𝑥) ∶= 𝑓(𝑥) + 𝑔(𝑥).

Believe it or not, this is yet another form of function composition. Why? Deﬁne the function add by

add ∶ 𝑋 × 𝑋 → ℝ, add(𝑥1 , 𝑥2 ) = 𝑥1 + 𝑥2 .

Now we can write addition as

(𝑓 + 𝑔)(𝑥) = add(𝑓(𝑥), 𝑔(𝑥)).

Composition is an extremely powerful tool. In fact, so powerful that given a small set of cleverly deﬁned building blocks,
“almost every function” can be obtained as the composition of these blocks. (I put “almost every function” in quotes
because if we want to say mathematically precise here, long detours are needed. To keep ourselves focused, let’s allow
ourselves to be a little hand-wavy here.)

16.4 Mental models of functions

So far, we have seen that functions are defined as arrows drawn between elements of two sets. This, although being
mathematically rigorous, does not give us useful mental models to reason about them. As you’ll surely see by the end of
our journey, in mathematics, the key is often to find the right way to look at things. Regarding functions, one of the most
common and useful mental models is their graph.
If 𝑓 ∶ ℝ → ℝ is a function mapping a real number to a real number, we can visualize it using its graph, defined by

graph(𝑓) = {(𝑥, 𝑓(𝑥)) ∶ 𝑥 ∈ ℝ}.

This set of points can be drawn in the two-dimensional plane. For instance, in the case of the famous rectiﬁed linear unit
(ReLU)

0 if 𝑥 < 0
ReLU(𝑥) = {
𝑥 if 0 ≤ 𝑥,

the graph looks like this.

Although identifying functions with their graphs can be useful, it is not generalizable for more complex mappings. Visu-
alizing it is challenging if the function’s domain and image are not the set of real numbers.
When dealing with neural networks, probably the best way to think about functions (that is, layers in this context) as
transformations of the underlying feature space. A simple example is a rotation in the two-dimensional Euclidean plane.
Image transformations provide a set of more complex examples. You rarely think about image blur as a transformation
between spaces, but this is the case. After all, an image is just a huge vector in some vector space.
In essence, a neural network is simply a stack of transformations, each taking its input from the output of the previous
one. As you’ll see, what makes them special is that the transformations are not hand-engineered but learned from the
data.

200 Chapter 16. Functions in theory

Mathematics of Machine Learning

Fig. 16.8: Graph of the ReLU activation function.

Fig. 16.9: Functions as a transformation of the space. Here, the vectors of the space are rotated around the origin.

16.4. Mental models of functions 201

Mathematics of Machine Learning

Fig. 16.10: Image operations as transformations, as done by the Albumentations library. Source of the image: Albu-
mentations: Fast and Flexible Image Augmentations by Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khvedchenya,
Alex Parinov, Mikhail Druzhinin and Alexandr A. Kalinin.

202 Chapter 16. Functions in theory

Mathematics of Machine Learning

16.5 Problems

16.5.1 Challenging ones

Problem 1. Find a function 𝑓 ∶ ℝ → ℝ such that (𝑓 ∘ 𝑓)(𝑥) = −𝑥.

Problem 2. Can any real function 𝑔 ∶ ℝ → ℝ be obtained as 𝑔 = 𝑓 ∘ 𝑓 for some 𝑓 ∶ ℝ → ℝ?

16.5. Problems 203

Mathematics of Machine Learning

204 Chapter 16. Functions in theory

CHAPTER

SEVENTEEN

FUNCTIONS IN PRACTICE

In our study of functions, we started from arrows between sets and ended up with mental models such as formulas and
graphs. For pure mathematical purposes, these models are perfectly enough to conduct thorough investigations. However,
once we leave the realm of theory and start putting things into practice, we must think about how functions are represented
in programming languages.
In Python, functions are deﬁned using a straightforward syntax. For instance, this is how the square(𝑥) = 𝑥2 function
can be implemented.

def square(x):
return x**2

The result is an object of type function. (In Python, everything is an object.)

type(square)

function

Functions are called using the () operator.

square(12)

144

Python is well-known for its simplicity, and functions are no exception. However, this doesn’t mean that they are limited
in features, quite the contrary: you can achieve a lot with the clever use of functions.

17.1 Operations on functions

There are three operations that we want to do on functions: composition, addition, multiplication. The easiest way is
to call the functions themselves and fall back to the operations deﬁned for the number types. To see an example, let’s
implement the cube(𝑥) = 𝑥3 function and add/multiply/compose it with square.

def cube(x):
return x**3

x = 2

square(x) + cube(x) # addition

205
Mathematics of Machine Learning

square(x)*cube(x) # multiplication

square(cube(x)) # composition

However, there is a major problem. If you take another look at the function operations, you can notice that they take
functions and return functions. For instance, the composition is deﬁned by

compose ∶ 𝑓, 𝑔 ↦ (𝑥
⏟⏟↦ 𝑓(𝑔(𝑥)))
⏟⏟ ⏟⏟⏟ ,
the composed function

with a function as a result. We did no such thing by simply passing the return value to the outer function. There is no
function object to represent the composition.
In Python, functions are first-class objects, meaning that we can pass them to other functions and return them from
functions. (This is an absolutely fantastic feature, but if this is the first time you encounter this, it might take some time
to get used to.) Thus, we can implement the compose function above by using the first-class function feature.

def compose(f, g):

def composition(*args, **kwargs):

return f(g(*args, **kwargs))

return composition

square_cube_composition = compose(square, cube)

square_cube_composition(2)

Addition and multiplication can be done just like this. (They are even assigned as an exercise problem.)

17.2 Functions as callable objects

The standard way of function definitions is not a good fit for an application that is essential for us: parametrized functions.
Think about the case of linear functions of the form 𝑎𝑥 + 𝑏, where 𝑎 and 𝑏 are parameters. On the first try, we can do
something like this.

def linear(x, a, b):

return a*x + b

Passing the parameters as arguments seems to work, but there are serious underlying issues. For instance, functions can
have a lot of parameters. Even if we compact parameters into multidimensional arrays, we might need to deal with dozens
of such arrays. Passing them around manually is error-prone, and we usually have to work with multiple functions. For

206 Chapter 17. Functions in practice

Mathematics of Machine Learning

example, neural networks are composed of several layers. Each layer is a parameterized function, and their composition
yields a predictive model.
We can solve this issue by applying the classical object-oriented principle of encapsulation, implementing functions as
callable objects. In Python, we can do this by implementing the magic __call__ method for the class.

class Linear:
def __init__(self, a, b):
self.a = a
self.b = b

def call(self, x):

return self.a*x + self.b

f = Linear(2, -1) # this represents the function f(x) = 2*x - 1

f(2.1)

3.2

This way, we can store, access, and modify the parameters using attributes.

f.a, f.b

(2, -1)

Since there can be a lot of parameters, we should implement a method that collects them together in a dictionary.

class Linear:
def __init__(self, a, b):
self.a = a
self.b = b

def call(self, x):

return self.a*x + self.b

def parameters(self):
return {"a": self.a, "b": self.b}

f = Linear(2, -1)
f.parameters()

{'a': 2, 'b': -1}

Interactivity is one of the most useful features of Python. In practice, we frequently ﬁnd ourselves working in the REPL,
inspecting objects and calling functions by hand. We often add a concise string representation for our classes for these
situations.
By default, printing a Linear instance results in a cryptic message.

<__main__.Linear at 0x7fae3c29b0f0>

17.2. Functions as callable objects 207

Mathematics of Machine Learning

This is not very useful. Besides the class name and its location in the memory, we haven’t received any information.
We can change this by implementing the __repr__ method responsible for returning the string representation for our
object.

class Linear:
def __init__(self, a, b):
self.a = a
self.b = b

def call(self, x):

return self.a*x + self.b

def __repr__(self):
return f"Linear(a={self.a}, b={self.b})"

def parameters(self):
return {"a": self.a, "b": self.b}

f = Linear(2, -1)

Linear(a=2, b=-1)

This looks much better! Adding a pretty string representation seems like a small thing, but this can go a long way when
doing machine learning engineering in the trenches.

17.2.1 The Function base class

The Linear class that we have just seen is only the tip of the iceberg. There are hundreds of function families that are
used in machine learning. We’ll implement many of them eventually, and to keep the interfaces consistent, we are going
to add a base class from which all others will be inherited.

class Function:
def __init__(self):
pass

def call(self, *args, **kwargs):

pass

def parameters(self):
return dict()

With this, we can implement functions and function families in the following way.

import numpy as np

class Sigmoid(Function): # the parent class is explicitly declared

def __call__(self, x):
return 1/(1 + np.exp(-x))

208 Chapter 17. Functions in practice

Mathematics of Machine Learning

sigmoid = Sigmoid()

sigmoid(2)

0.8807970779778823

Even though we haven’t implemented the parameters method for the Sigmoid class, it is inherited from the base
class.

sigmoid.parameters()

{}

For now, let’s keep the base class as simple as possible. During the course of this book, we’ll progressively enhance the
Function base class to cover all the methods a neural network and its layers need. (For instance, gradients.)
(subsection:functions/functions-in-practice/composition)

17.2.2 Composition in the object-oriented way

Recall how we did function composition when working with plain Python functions? Syntactically, that can work with our
Function class as well, although there is a huge issue: the return value is not a Function type.

composed = compose(Linear(2, -1), Sigmoid())

composed(2)

0.7615941559557646

isinstance(composed, Function)

False

This kind of composition doesn’t inherit the interface we need.

composed.parameters()

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-28-02b9088725fc> in <module>
----> 1 composed.parameters()

AttributeError: 'function' object has no attribute 'parameters'

To ﬁx the issue, we implement function composition as a child of the Function base class. Recall that composition is
a function, taking two functions as input and returning one:

compose ∶ 𝑓, 𝑔 ↦ (𝑥
⏟⏟↦ 𝑓(𝑔(𝑥)))
⏟⏟ ⏟⏟⏟ .
the composed function

17.2. Functions as callable objects 209

Mathematics of Machine Learning

Keeping this in mind, this is how we can do composition.

class Composition(Function):
def __init__(self, *functions):
self.functions = functions

def call(self, x):

for f in self.functions:
x = f(x)

return x

composed = Composition(Linear(2, -1), Sigmoid())

composed(2)

0.9525741268224334

This way, we get to keep the Function interface.

composed.parameters()

{}

isinstance(composed, Function)

True

17.3 Problems

Problem 1. Implement the add function that takes the functions 𝑓 and 𝑔, and returns their sum 𝑓 + 𝑔. (You can do this
following the example of composition.)

17.4 Solutions

Problem 1.

def add(f, g):

def sum(*args, **kwargs):

return f(*args, **kwargs) + g(*args, **kwargs)

return sum

210 Chapter 17. Functions in practice

CHAPTER

EIGHTEEN

NUMBERS

It’s like asking why is Ludwig van Beethoven’s Ninth Symphony beautiful. If you don’t see why, someone can’t
tell you. I know numbers are beautiful. If they aren’t beautiful, nothing is. — Paul Erdős
When I was about to take my first mathematical analysis course at the university, coming straight from high school, I
wondered why we would spend several lectures on real numbers. At the time, I was confident in my knowledge and
thought that I knew what numbers were. This was my first painful encounter with the Dunning–Kruger effect. Suffice to
say, after a few classes, I was left confused about numbers, taking a while to understand them finally.
If you look at numbers with a magnifying glass, they become extremely complex. In this chapter, we are going to see why
and how to make sense of them. To look ahead and keep machine learning in our sight, consider that gradient descent
(you know, the optimization algorithm that is used everywhere) is not be possible for functions that are not differentiable.
In turn, a function 𝑓 is differentiable at 𝑥 if the limit

𝑓(𝑥) − 𝑓(𝑦)
lim
𝑦→𝑥 𝑥−𝑦
exists. To understand limits, we must understand real numbers ﬁrst.
Another good reason to dig deep into the patterns and structures of numbers: they are beautiful. (Like said above by Paul
Erdős, one of the greatest mathematicians ever.) There is a particular joy of understanding seemingly familiar things on
a deep level. Even though you might not use this knowledge every day, it teaches you perspective about the objects you
encounter during your work.
So, let’s get started!

18.1 Numbers

There are ﬁve famous classes of numbers that one has to know in order to become adept in mathematics:
• natural numbers, denoted by ℕ,
• integers, denoted by ℤ,
• rational numbers, denoted by ℚ,
• real numbers, denoted by ℝ,
• and ﬁnally complex numbers, denoted by ℂ.
These classes are increasing in order, that is,

ℕ ⊆ ℤ ⊆ ℚ ⊆ ℝ ⊆ ℂ.

In this section, we are going to be concerned with the ﬁrst four. (Complex numbers will get own chapter.)

211
Mathematics of Machine Learning

18.1.1 Natural numbers and integers

Natural numbers are simply deﬁned as

ℕ ∶= {1, 2, 3, … }.

Sometimes zero is included; sometimes it is not. Believe it or not, after a few thousand years, mathematicians still cannot
decide whether or not 0 is a natural number. This problem might sound comical, but trust me, I have seen senior professors
almost go into a fistfight upon debating this issue. For some people, this is almost a religious question.
I don’t particularly care, and neither should you. I propose to use the more common and practical definition, which is the
one without zero. When we really need to talk about the natural numbers AND zero, I will use the notation

ℕ0 = {0, 1, 2, … }.

The cardinality of the set of natural numbers is countably infinite. In fact, countability is defined as |ℕ|.
To be able to express negative and zero quantities, we extend natural numbers to obtain the set of integers, defined by

ℤ = {… , −2, −1, 0, 1, 2, … }.

Relatively straightforward so far. Integers are also countable: one can enumerate all of its elements by

0, 1, −1, 2, −2, 3, −3, … .

One significant advantage of integers over natural numbers is that they contain the additive inverse for each element. That
is, in plain English, if 𝑛 ∈ ℤ, then so does −𝑛 ∈ ℤ. This makes it possible to define all kinds of algebraic structures over
the integers, giving us mathematical tools to reason about phenomena modeled by them.
Note that if 𝑛, 𝑚 ∈ ℤ, then 𝑛 + 𝑚 ∈ ℤ. In mathematical terminology, we say that ℤ is closed to addition.
To summarize, ℤ is
• closed to addition,
• and every element has an additive inverse.
These two properties will guide us on how to go from natural numbers to real numbers: each extension is constructed so
that these two properties hold, but for different and different operations.

18.1.2 Rational numbers

So, we obtained ℤ from ℕ by extending it with zero and the additive inverses for each element. What about the multi-
plicative inverses? This idea leads us to the concept of rational numbers, numbers that can be written as a ratio of two
integers. It is deﬁned by

𝑝
ℚ={ ∶ 𝑝, 𝑞 ∈ ℤ, 𝑞 ≠ 0},
𝑞

is both closed to multiplication and every element (except zero) has a multiplicative inverse. This is not just a “l’art pour
l’art” mathematical construction. Rational numbers model quantities that we encounter in real life. 0.798 kilometers, 3.4
kilograms of grain, etc.
It might be surprising, but ℚ is also countable. One easy way to prove is to notice that it can be obtained as the countable
union of countable sets:
𝑝
ℚ = ∪𝑝∈ℤ { ∶ 𝑞 ∈ ℤ\{0}}.
𝑞

212 Chapter 18. Numbers

Mathematics of Machine Learning

Fig. 18.1: Enumeration of rational numbers. This shows that ℚ is indeed countable.

Since the union of countable sets is countable, ℚ is countable as well. Another (and perhaps more visual) way to see this
is to simply enumerate them in a sequence. Something like this:
1
Rational numbers can be written in decimal form, like 2 = 0.5 for example. In general, the following is true.

Theorem
Any rational number 𝑥 can be represented as a:
(a) ﬁnite decimal

𝑥 = 𝑥0 .𝑥1 … 𝑥𝑛 , 𝑥𝑖 ∈ {0, 1, 2, … , 9}

(b) or a repeating decimal

𝑥 = 𝑥0 .𝑥1 𝑥2 … 𝑥𝑛̇ … 𝑥𝑚̇ , 𝑥𝑖 ∈ {0, 1, 2, … , 9},

where the decimals between the two dots repeat inﬁnitely. (This can be just a single digit as well.)

Note that the decimal representation is not unique: for example, 1.0 and 0.9̇ is equal.
The above theorem fully characterizes rational numbers. But what about the numbers with an inﬁnite decimal form that
is not repeating? Like the famous mathematical constant 𝜋 describing the half circumference of the unit circle, that is

𝜋 = 3.14159265358979323846264338327950288419716939937510...,

with no repeating patterns. These are called irrational numbers, and together with rationals, they make up the real numbers.

18.1. Numbers 213

Mathematics of Machine Learning

18.1.3 Real numbers

The simplest way to imagine real numbers is a line, where each point represents a number.

Fig. 18.2: The real number line. Source: Wikipedia.

If we temporarily let a little bit of mathematical correctness slide, we can say that

ℝ = finite decimals ∪ infinite repeating decimals ∪ infinite nonrepeating decimals.

Real numbers are also the ﬁrst in our journey that is not countable, and we will prove this! Its proof is so beautiful that it
belongs to The Book, a collection of the most elegant and beautiful mathematical proofs.

Theorem
ℝ is not countable.

Proof. To show that ℝ is not countable, we take an indirect approach: we suppose that it is countable and demonstrate
that this leads to a contradiction. This method is called an indirect proof, a top-tier tool in a mathematician’s toolkit.
Since [0, 1) ⊆ ℝ, it is enough to show that [0, 1) is not countable. If it is countable, we can enumerate it:

[0, 1) = {𝑎1 , 𝑎2 , … }.

We can write out the decimal forms of these:

𝑎1 = 0.𝑎11 𝑎12 𝑎13 … ,
𝑎2 = 0.𝑎21 𝑎22 𝑎23 … ,
𝑎3 = 0.𝑎31 𝑎32 𝑎33 … .
⋮

Let’s focus on the diagonal! By changing the digits there, we deﬁne

5, if 𝑎𝑛𝑛 ≠ 5,
𝑎𝑛𝑛
̂ ∶= {
1, if 𝑎𝑛𝑛 = 5.

Can the number

𝑎̂ ∶= 0.𝑎11
̂ 𝑎22
̂ 𝑎33
̂ …

be found in the sequence {𝑎1 , 𝑎2 , … }? No, because the i-th decimal of 𝑎𝑖 and 𝑎̂ must be diﬀerent for all 𝑖 ∈ ℕ! We
have constructed 𝑎̂ by changing the i-th decimal of 𝑎𝑖 .

214 Chapter 18. Numbers

Mathematics of Machine Learning

To summarize, our assumption that [0, 1) can be enumerated leads us to a contradiction because we have found an element
that cannot possibly be in our enumeration. So, [0, 1) is not countable, hence ℝ is not countable as well. This is what we
needed to show! □

The method of proof that you have seen above is called Cantor’s diagonal argument. This is a beautiful and powerful
idea, and although we won’t encounter it anymore, it is the key to proving several diﬃcult theorems. (Like Gödel’s famous
incompleteness theorems that threw a huge monkey wrench into the machinery of mathematics at the beginning of the
20th century.)
Notice that the way we introduced real numbers broke the pattern we have observed before. Integers were constructed
by extending the natural numbers with additive inverses and closing them to addition. Rationals were obtained the same
way, except doing it for multiplication. As we shall see later, real numbers follow a similar process: we obtain it from
rationals by closing them to limits.

18.1. Numbers 215

Mathematics of Machine Learning

216 Chapter 18. Numbers

CHAPTER

NINETEEN

SEQUENCES

Sequences lie at the very heart of mathematics. Sequences and their limits describe long-term behavior, like the (occa-
sional) convergence of gradient descent to a local optimum. By deﬁnition, a sequence is an enumeration of mathematical
objects.
The elements of a sequence can be any mathematical object, like sets, functions, or Hilbert spaces. (Whatever those might
be.) For us, sequences are composed of numbers. We formally denote them as

{𝑎𝑛 }∞
𝑛=1 , 𝑎𝑛 ∈ ℝ.

For simplicity, the subscripts and the superscripts are often omitted, so don’t panic if you see {𝑎𝑛 }, as it is just an
abbreviation. (Or 𝑎𝑛 . Mathematicians love abbreviations.) If all elements of the sequence belongs to a set 𝐴, we often
write {𝑎𝑛 } ⊆ 𝐴.
Sequences can be bidirectional as well, those are denoted as {𝑎𝑛 }∞ 𝑛=−∞ . We don’t need them for now, but they will
frequently appear when talking about probability distributions later.

19.1 Convergence

One of the most important aspects of sequences is their asymptotic behavior, or in other words, what they do in the long
term. A particular property we often look for is convergence. In plain English, the sequence {𝑎𝑛 } converges to 𝑎 if no
matter how small of an interval (𝑎 − 𝜀, 𝑎 + 𝜀) we deﬁne (where 𝜀 can be really small), eventually all of the elements of
{𝑎𝑛 } fall into it.
The following is the mathematically precise deﬁnition of convergence.

Deﬁnition 18.1.1 (Convergence of sequences.)

The sequence {𝑎𝑛 } ⊆ ℝ is said to converge to some 𝑎 ∈ ℝ if for every 𝜀 > 0, there is a cutoﬀ index 𝑛0 ∈ ℕ such that

|𝑎𝑛 − 𝑎| < 𝜀

holds for all indices 𝑛 > 𝑛0 . 𝑎 is said to be the limit of {𝑎𝑛 } and we write

lim 𝑎𝑛 = 𝑎
𝑛→∞

𝑎𝑛 → 𝑎 (𝑛 → ∞).

217
Mathematics of Machine Learning

Note that the cutoff index 𝑛0 depends on 𝜀. We could write 𝑛0 (𝜀) to emphasize this dependency, but we will rarely do
so. To avoid referencing and naming the cutoff index 𝑛0 all the time, we often simply say that a given property “holds for
all 𝑛 large enough”. (Did I mention that mathematicians love abbreviations?)
In plain English, the definition means that no matter how small of an interval you enclose 𝑎 in, all members of the sequence
will eventually fall into it.
Although mathematically extremely precise and correct, this definition doesn’t give us a lot of tools to show if a sequence
is convergent or not. First, we have to conjure up the limit 𝑎 and then construct the cutoff indexes. For example, consider
1
𝑎𝑛 ∶= .
𝑛
To make our job easier, we can plot this to visualize the situation.

Fig. 19.1: The {1/𝑛} sequence, 𝑛 = 1, 2, … , 20.

Here, we can explicitly construct the cutoﬀ index 𝑛0 for every 𝜀. Since we want to have
1
< 𝜀,
𝑛
we can reorganize the inequality to obtain
1
< 𝑛.
𝜀
So,
1
𝑛0 ∶= ⌊ ⌋ + 1
𝜀
will do the job.
We had it easy in this example, but this is pretty much as far as we can go with the deﬁnition. For example, how do you
show the convergence of
−1
1 1 1
𝑎𝑛 ∶= ( + +⋯+ )
𝑛 𝑛+1 2𝑛

218 Chapter 19. Sequences

Mathematics of Machine Learning

with the deﬁnition only? You don’t. There are more advanced tools for this, as we shall see. (By the way, lim𝑛→∞ 𝑎𝑛 =
1
ln 2 . We will show this later when talking about integrals.) For sequences that are deﬁned recursively and there is no
analytic formula available, like

⃗ ∞
{𝐿(𝑤⃗ 𝑛 , 𝑥,⃗ 𝑦)} 𝑛=1 ,

where 𝐿 is the loss function for a neural network with weights 𝑤⃗ 𝑛 and training data (𝑥,⃗ 𝑦),
⃗ we have even more complica-
tions. There is no need to worry about them yet, so let’s go one step at a time.

19.1.1 Properties of convergence

In essence, the study of convergence for a particular sequence comes down to breaking it into simpler and simpler parts
until the limit is known.
1. Is this a “famous” sequence where the limit is known? If yes, we are done. If not, go to the next step.
2. Can you decompose it into simpler parts? If yes, is the convergence known for them? If the convergence is
unknown, can you simplify it further?
We can do this because convergence has some particularly nice properties, as summarized in the theorem below.

Theorem 18.1.1 (Properties of convergence.)

Let {𝑎𝑛 } and {𝑏𝑛 } be two convergent sequences with lim𝑛→∞ 𝑎𝑛 = 𝑎 and lim𝑛→∞ 𝑏𝑛 = 𝑏. Then
(a)

lim (𝑎𝑛 + 𝑏𝑛 ) = 𝑎 + 𝑏,
𝑛→∞

(b)

lim 𝑐𝑎𝑛 = 𝑐𝑎 for all 𝑐 ∈ ℝ,

𝑛→∞

(c)

lim 𝑎𝑛 𝑏𝑛 = 𝑎𝑏,
𝑛→∞

(d) if 𝑎𝑛 ≠ 0 and 𝑎 ≠ 0, then

1 1
lim = .
𝑛→∞ 𝑎𝑛 𝑎

The properties (a) and (b) together are called linearity of convergence. If you recall the deﬁnition of linear transformations
As we shall see later, continuity of functions also provides a great tool to study convergence properties of a sequence. In
fact, continuity is nothing else than the interchangeability of limits and functions:

lim 𝑓(𝑥𝑛 ) = 𝑓( lim 𝑥𝑛 ).

𝑛→∞ 𝑛→∞

One essential property of convergent sequences is that under certain circumstances, they preserve inequalities. This will
be true to function limits as well, so it will be important for us later as we’ll see.

Theorem 18.1.2 (Transfer principle.)

Let {𝑎𝑛 }∞
𝑛=1 be a convergent sequence. If 𝑎𝑛 ≥ 𝛼 holds for all 𝑛 ∈ ℕ, where 𝛼 ∈ ℝ is some lower bound, then
lim𝑛→∞ 𝑎𝑛 ≥ 𝛼.

19.1. Convergence 219

Mathematics of Machine Learning

|𝑎−𝛼|
Proof. We are going to do this indirectly. If lim𝑛→∞ 𝑎𝑛 < 𝛼, then by the deﬁnition of convergence, |𝑎𝑛 − 𝑎| < 2
for all large 𝑛. This means that those 𝑎𝑛 -s are actually below 𝛼, contradicting our assumptions. □

This proof is straightforward to understand if you draw a ﬁgure and visualize what happens, so I encourage you to do it.
The identical result is true if we replace ≥ with ≤ in the above, and the proof goes through word by word.
Note that if 𝑎𝑛 > 𝛼 for all 𝑛, lim𝑛→∞ 𝑎𝑛 > 𝑎 is not guaranteed! The best example to show this is 𝑎𝑛 ∶= 1/𝑛, which
converges to 0, although all of its terms are positive.
As a corollary, we obtain a tool that will be very useful for showing the convergence of particular sequences.

Corollary 18.1.1 (The squeeze principle.)

Let {𝑎𝑛 }∞ ∞ ∞
𝑛=1 , {𝑏𝑛 }𝑛=1 , and {𝑐𝑛 }𝑛=1 be three sequences such that 𝑎𝑛 ≤ 𝑏𝑛 ≤ 𝑐𝑛 for all large enough 𝑛-s. If
lim𝑛→∞ 𝑎𝑛 = lim𝑛→∞ 𝑐𝑛 = 𝛼, then lim𝑛→∞ 𝑏𝑛 = 𝛼.

In other words, squeezing {𝑏𝑛 } between two convergent sequences that have the same limit implies convergence of 𝑏𝑛 to
the joint limit.

19.2 Famous convergent sequences

Because convergence behaves nicely with respect to certain operations, we study sequences by decomposing them into
building blocks. Let’s see the most important ones that will be useful for us later!
Example 1. For any 𝑥 ≥ 0,

⎧0 if 0 ≤ 𝑥 < 1,
{
𝑛
lim 𝑥 = 1 if 𝑥 = 1, (19.1)
𝑛→∞ ⎨
{∞ if 𝑥 > 1.
⎩
If you think about it for a minute, this is easy to see. The 𝑥 = 0 and 𝑥 = 1 cases are trivial. Regarding the others,
because taking the logarithm turns exponentiation into multiplication, we have log 𝑥𝑛 = 𝑛 log 𝑥. So,

−∞ if 0 < 𝑥 < 1,
lim 𝑛 log 𝑥 = {
𝑛→∞ ∞ if 𝑥 > 1.

Since the logarithm is increasing and invertible, (19.1) follows.

Example 2. For any 𝑥 ≥ 0,

1 if 𝑥 > 0,
lim 𝑥1/𝑛 = { (19.2)
𝑛→∞ 0 if 𝑥 = 0.

Similarly to the previous example, this can be shown with the use of logarithms.

220 Chapter 19. Sequences

Mathematics of Machine Learning

19.2.1 The role of convergence in machine learning

Convergence is everywhere. We just met this concept for the first time, so we don’t see its importance just yet. However,
it is central to mathematics and machine learning.
Just to look ahead and give a few examples, differentiation is defined by a limit:

𝑓(𝑥) − 𝑓(𝑦)
𝑓 ′ (𝑥) ∶= lim .
𝑦→𝑥 𝑥−𝑦
Regarding derivatives, integrals (the “inverse” of diﬀerentiation) are limits of convergent sequences. For instance,
1 𝑛
𝑘2
∫ 𝑥2 𝑑𝑥 = lim ∑ .
0
𝑛→∞
𝑘=1
𝑛3

Because integrals are limits, so does every quantity calculated with integration, such as expected values, like for the
standard normal distribution,
∞
1 2
𝔼[𝒩(0, 1)] = ∫ 𝑥 √ 𝑒−𝑥 /2 𝑑𝑥.
−∞ 2𝜋
Convergence is also central to probability and statistics. There are two famous theorems: the Law of Large Numbers,
stating that

1 𝑛
∑ 𝑋𝑘 = 𝜇
lim
𝑛→∞ 𝑛
𝑘=1

holds; and the Central Limit Theorem that says

√ 𝑋1 + ⋯ + 𝑋𝑛
𝑛( − 𝜇) → 𝒩(𝜇, 𝜎2 ) (𝑛 → ∞)
𝑛

in distribution, for independent and identically distributed random variables 𝑋1 , 𝑋2 , … with finite expected value 𝔼[𝑋𝑖 ] =
𝜇 and variance var(𝑋𝑖 ) = 𝜎2 . They are both very important in machine learning and neural networks, for instance, the
Law of Large Numbers is one of the fundamental ideas behind stochastic gradient descent.
Even the gradient descent optimization process is a recursively defined sequence of model weights, converging towards
an optimum where the model best fits the data.
We will talk about all of these in detail. So even if you don’t understand these right now, don’t worry. It’ll be clear soon.
Before finishing up with sequences, we shall discuss what happens when a sequence is not convergent.

19.2.2 Divergent sequences

We have talked about how convergent sequences are everywhere, and they are at the core of mathematics and machine
learning. However, not all sequences are convergent.
Think about the following example:

𝑎𝑛 ∶= sin(𝑛).

When plotted, this is how it looks.

Although it is hard to prove, this sequence does not converge. Its value is constantly oscillating in [−1, 1]. We call
non-convergent sequences divergent. Among these, there is a special kind of divergence: approaching inﬁnity.

Deﬁnition 18.2.1 (∞-divergence.)

19.2. Famous convergent sequences 221

Mathematics of Machine Learning

Fig. 19.2: The {𝑠𝑖𝑛(𝑛)} sequence, 𝑛 = 1, 2, … , 20.

The sequence {𝑎𝑛 } is said to be ∞-divergent if for every arbitrarily large number 𝑥, there is a cutoﬀ index 𝑛0 such that

𝑎𝑛 > 𝑥

holds for all indices 𝑛 > 𝑛0 .

We denote ∞-divergence by writing 𝑥𝑛 → ∞. Analogously, −∞ -divergent sequences can be deﬁned.

An obvious example is {𝑛} or {𝑛 log 𝑛}. These are all across computer science as well: the runtime of algorithms given
the number of steps or the size of the input is ∞-divergent.
When you see something like 𝑎𝑛 = 𝑂(𝑛), it means that there is a constant 𝑐 such that

0 ≤ 𝑎𝑛 ≤ 𝑐𝑛

holds for all large enough 𝑛.

19.2.3 Subsequences

Sometimes, when working with sequences, we don’t need the entire thing, just its subsequence. We will not do anything
special with them just yet, but here is the formal deﬁnition.

Deﬁnition 18.2.2 (Subsequences.)

Let {𝑎𝑛 }∞ ∞ ∞
𝑛=1 , and let {𝑛𝑘 }𝑘=1 ⊆ ℕ be a strictly increasing sequence of natural numbers. Then the sequence {𝑎𝑛𝑘 }𝑘=1
is a subsequence of {𝑎𝑛 }.

Think of it as throwing elements away from a sequence.

222 Chapter 19. Sequences

Mathematics of Machine Learning

19.2.4 Series

There are a special class of sequences we need to mention: series. That is, sequences of the form
∞
𝑆𝑛 = ∑ 𝑎𝑘 , 𝑎𝑘 ∈ ℝ.
𝑘=1

This is what we mean when we write inﬁnite sums, as they are deﬁned by
∞ 𝑛
∑ 𝑎𝑘 ∶= lim ∑ 𝑎𝑘 .
𝑛→∞
𝑘=1 𝑘=1

Although we won’t go into details, the literature on series is huge. It is not an overstatement to say that almost the entire
development of mathematical analysis in the 19th and 20th century was motivated by expressing functions in series form.

19.3 The big and small O notation

If you have some experience with computer science, you are probably familiar with the big O/small O notation. There, it
is used to express the runtime of algorithms, but it is not limited to that. In general, it is used to compare the long-term
behavior of sequences. Let’s start with the deﬁnitions ﬁrst, and then I’ll explain the intuition and some use-cases.

Deﬁnition 18.3.1 (Big and small O notation.)

Let {𝑎𝑛 }∞ ∞
𝑛=1 and {𝑏𝑛 }𝑛=1 be two arbitrary sequences. We say that

(a) 𝑏𝑛 = 𝑂(𝑎𝑛 ), if there is a constant 𝐶 such that |𝑏𝑛 | ≤ 𝐶𝑎𝑛 ,

(b) and 𝑏𝑛 = 𝑜(𝑎𝑛 ) if for every 𝜀 > 0, there is a cutoﬀ index 𝑁 ∈ ℕ such that |𝑏𝑛 | ≤ 𝜀𝑎𝑛 for every 𝑛 > 𝑁 .

In plain English, “𝑏𝑛 is big O 𝑎𝑛 ” means that 𝑏𝑛 grows roughly at the same rate as 𝑎𝑛 , while “𝑏𝑛 is small o 𝑎𝑛 ” says that
𝑏𝑛 is an order of magnitude smaller than 𝑎𝑛 .
So, when we say that the runtime of an algorithm is 𝑂(𝑛) steps where 𝑛 is the input size, we mean that the algorithm
will ﬁnish in 𝐶𝑛 steps. Often, we don’t care about the constant multiplier since it doesn’t mean an order of magnitude
diﬀerence in the long run.

19.4 Real numbers are sequences

Now that we have familiarized ourselves with the concept of convergent sequences, we shall take another look at rational
and real numbers. When extending the classes of numbers going from ℕ to ℝ, we pick an operation, close the set with
respect to it, and add inverse elements to that operation.
Extending ℕ with additive inverses −𝑛 for all 𝑛 ∈ ℕ yields ℤ. Extending ℤ with multiplicative inverses 1/𝑛 for all 𝑛 and
closing it for multiplication yields ℚ. The pattern is seemingly diﬀerent in the case of ℝ, but this is not the case. After
understanding what convergence is, we have the tools to see why.
Consider the following sequence:
𝑛
1
𝑎𝑛 ∶= (1 + ) , 𝑛 = 1, 2, … .
𝑛
Since rational numbers are closed to addition and multiplication, we see that 𝑎𝑛 is rational. However,
𝑛
1
lim (1 + ) = 𝑒,
𝑛→∞ 𝑛

19.3. The big and small O notation 223

Mathematics of Machine Learning

which is the famous Euler constant, is not rational.

Thus, we have found what is missing: ℚ is not closed to taking limits. So, we can obtain the set of real numbers by closing
ℚ to taking limits.
The fact that every irrational number can be approximated with rational numbers as close as possible is often under-
appreciated. Think about this: can you represent all real numbers with a computer? Nope. This follows from a simple
cardinality argument: the number of possible ﬂoats is ﬁnite, but there are uncountably many real numbers. However,
certain numbers (like 𝜋 or 𝑒) are essential in engineering calculations and simulations. Without approximations, working
with irrational numbers would be unfeasible.

224 Chapter 19. Sequences

CHAPTER

TWENTY

THE TOPOLOGY OF REAL NUMBERS

(Almost) everything in machine learning is described with real numbers. Features, losses, parameters, probabilities.
Every model is a mapping between ℝ𝑛 and ℝ𝑚 . Because our tooling is built on top of this, it is essential to understand
how real numbers are structured. In mathematical terms, this is called topology.
According to the Cambridge English Dictionary, the word “topology” means
the way the parts of something are organized or connected.
From a mathematical perspective, topology studies the local properties of structures and spaces. In machine learning,
we are often interested in global properties like minima and maxima but only have local tools to search for them. One
example is the derivative of functions. Derivatives describe the slope of the tangent plane, and as Fig. 20.1 illustrates, this
doesn’t change if the function is modiﬁed away from the point where the derivative is taken.
In mathematics, local properties are handled in terms of sequences and neighborhoods. We have learned about sequences
in the last chapter, and now we tackle the subject of neighborhoods.
We are going to focus on three fundamental aspects:
• open and closed sets,
• behavior of sequences within sets,
• and their smallest and largest elements, upper and lower bounds.
Our main goal with mathematical analysis is to understand gradient descent, a fundamental tool for training models. To
do that, we need to understand limits. For that, sequences and real numbers, leading deep into the rabbit hole where we
are now.
Think of it as learning the Python language versus learning TensorFlow or PyTorch. Since we want to do machine learning,
we ultimately want to learn a high-level framework. However, if we lack the understanding of the basic keywords in Python
like import or def, we are not ready to learn and productively use advanced tools. Sequences, open and closed sets,
limits, and others are the fundamental building blocks of mathematical analysis, the language of optimization.

20.1 Open and closed sets

Let’s start our discussion with open and closed sets! (In this chapter, when we refer to something as a subset or set, it is
implicitly assumed to be within ℝ.)

Deﬁnition 19.1.1 (Open and closed sets.)

Let 𝐴 ⊆ ℝ be a subset of the real numbers.
(a) 𝐴 is open if for every 𝑥 ∈ 𝐴, there is a 𝜀 > 0 such that (𝑥 − 𝜀, 𝑥 + 𝜀) ⊆ 𝐴.

225
Mathematics of Machine Learning

Fig. 20.1: Gradient as a local property.

226 Chapter 20. The topology of real numbers

Mathematics of Machine Learning

(b) 𝐴 is closed if its complement ℝ\𝐴 is open.

Before we start analyzing the properties of open and closed sets, here are some key examples for building up useful mental
models.
Example 1. Intervals of the form (𝑎, 𝑏) = {𝑥 ∈ ℝ ∶ 𝑎 < 𝑥 < 𝑏} are open. This can be easily seen by picking any
𝑥 ∈ (𝑎, 𝑏) and letting 𝜀 = min{|𝑥 − 𝑎|/2, |𝑥 − 𝑏|/2}. Essentially, we take the distance from the closest endpoint and cut
that in half. Any point that is closer to 𝑥 than half-distance of the closest endpoint will be also in (𝑎, 𝑏).
Example 2. Intervals of the form [𝑎, 𝑏] = {𝑥 ∈ ℝ ∶ 𝑎 ≤ 𝑥 ≤ 𝑏} are closed. Indeed, its complement is ℝ\[𝑎, 𝑏] =
(−∞, 𝑎) ∪ (𝑏, ∞). Using the reasoning above, it is easy to see that (−∞, 𝑎) ∪ (𝑏, ∞) is open.
Example 3. Intervals of the form (𝑎, 𝑏] = {𝑥 ∈ ℝ ∶ 𝑎 < 𝑥 ≤ 𝑏} are neither open, nor closed. To see that it is not open,
observe that no interval containing 𝑏 is fully within (𝑎, 𝑏], since 𝑏 is an endpoint. For similar reasons, its complement
ℝ\(𝑎, 𝑏] = (−∞, 𝑎] ∪ (𝑏, ∞) is not open.
An important takeaway from the last example is that if a set is not closed, it doesn’t mean that it is open and vice versa.
We can rephrase the deﬁnition of openness by introducing the concept of neighborhoods. The neighborhoods of a given
point 𝑥 are the open intervals (𝑎, 𝑏) containing 𝑥. With this terminology, any set 𝐴 is open if, for any 𝑥 ∈ 𝐴, there exists
a neighborhood of 𝑥 that is fully contained within 𝐴. From this aspect, openness means that there is still “room to move”
from any point.
The most fundamental property of open and closed sets is their behavior under union and intersection.

Theorem 19.1.1
Let {𝐴𝛾 }𝛾∈Γ be an arbitrary collection of sets.
(a) If each 𝐴𝛾 is open, then ∪𝛾∈Γ 𝐴𝛾 is also open.
(b) If each 𝐴𝛾 is closed, then ∩𝛾∈Γ 𝐴𝛾 is also closed.

Proof. (a) Suppose that 𝐴𝛾 , 𝛾 ∈ Γ are open sets and let 𝑥 ∈ ∪𝛾∈Γ 𝐴𝛾 . Because 𝑥 is in the union, there is some 𝛾0 ∈ Γ
such that 𝑥 ∈ 𝐴𝛾0 . Because 𝐴𝛾0 is open, there is a small neighborhood (𝑎, 𝑏) of 𝑥 such that (𝑎, 𝑏) ⊆ 𝐴𝛾0 . Because of
this, (𝑎, 𝑏) ⊆ ∪𝛾∈Γ 𝐴𝛾 , which is what we had to show.
(b) Now let 𝐴𝛾 , 𝛾 ∈ Γ be closed sets. In this case, De Morgan’s laws imply that ℝ\(∩𝛾∈Γ 𝐴𝛾 ) = ∪𝛾∈Γ (ℝ\𝐴𝛾 ). Since
each 𝐴𝛾 is closed, ℝ\𝐴𝛾 is open. As we have previously seen, the union of open sets is open. □

Closedness and openness of a set inﬂuence its behavior regarding set sequences. The ﬁrst fundamental result regarding
this is Cantor’s axiom.

Theorem 19.1.2 (Cantor’s axiom.)

Let 𝐼𝑛 = [𝑎𝑛 , 𝑏𝑛 ] ⊆ ℝ be a sequence of intervals such that 𝐼𝑛+1 ⊆ 𝐼𝑛 for every 𝑛 ∈ ℕ. Then the intersection ∩∞
𝑛=1 𝐼𝑛
is nonempty.

This seemingly simple proposition is a profound property of real numbers, one that ultimately follows from their mathe-
matical construction. Cantor’s axiom is not true, for instance, if we talk about subsets of ℚ instead of ℝ. Think about a
sequence of rational numbers 𝑎𝑛 → 𝜋 that approximates 𝜋 from below, and another sequence 𝑏𝑛 → 𝜋 that approximates
pi from above, that is,

𝑎𝑛 < 𝜋 < 𝑏𝑛 , 𝑎𝑛 , 𝑏𝑛 ∈ ℚ.

20.1. Open and closed sets 227

Mathematics of Machine Learning

The intersection of the intervals [𝑎𝑛 , 𝑏𝑛 ] only contains 𝜋, which is not rational. Thus, in the space of rational numbers,
∩∞
𝑛=1 [𝑎𝑛 , 𝑏𝑛 ] = ∅, therefore Cantor’s axiom doesn’t hold there.

There is an old proverb about losing the war because of a nail in a horseshoe. It goes something like this:
For want of a nail the shoe was lost.For want of a shoe the horse was lost.For want of a horse the rider was
lost.For want of a rider the message was lost.For want of a message the battle was lost.For want of a battle
the kingdom was lost.And all for the want of a horseshoe nail.
Think about Cantor’s axiom as the nail in the horseshoe. Without it, we can’t talk about taking limits of sequences.
Without limits, there are no gradients. Without gradients, there is no gradient descent, and consequently, we can’t ﬁt
machine learning models.

20.1.1 Distance and topology

Originally, we have defined open sets in terms of small open intervals like (𝑥 − 𝜀, 𝑥 + 𝜀). We called a set open if you
could squeeze in such a small interval for each of its points. By taking a step of abstraction, we can rephrase the definition
in terms of norms.
From this viewpoint, an interval (𝑥 − 𝜀, 𝑥 + 𝜀) is the same as a one-dimensional open ball. Given a normed space 𝑉 with
the norm ‖ ⋅ ‖, the ball of radius 𝑟 > 0 centered at 𝑥 is defined by

𝐵(𝑥, 𝑟) = {𝑦 ∈ 𝑉 ∶ ‖𝑥 − 𝑦‖ < 𝑟}.

Equivalently, a ball of radius 𝑟 is the set of points with distance less than 𝑟 from the center point. In the Euclidean spaces
ℝ𝑛 , with the norm ‖𝑥‖ = √𝑥21 + ⋯ + 𝑥2𝑛 , this matches our intuitive understanding. This is illustrated in Fig. 20.2.

Fig. 20.2: Balls, from one to three dimensions.

However, in one dimension, the Euclidean norm simpliﬁes to ‖𝑥‖ = |𝑥|. Thus, we have

𝐵(𝑥, 𝑟) = {𝑦 ∈ ℝ ∶ |𝑥 − 𝑦| < 𝑟}
= (𝑥 − 𝑟, 𝑥 + 𝑟).

We don’t often think about the interval (𝑥 − 𝜀, 𝑥 + 𝜀) as the one-dimensional ball 𝐵(𝑥, 𝜀). However, making this
connection will make it easy to later extend the topology of ℝ to ℝ𝑛 , which is where we want to work eventually.
With norms and balls, we can rephrase the deﬁnition of open sets in the following way.

228 Chapter 20. The topology of real numbers

Mathematics of Machine Learning

Deﬁnition 19.1.2 (Open sets, second take.)

Let 𝐴 ⊆ ℝ be a subset of the real numbers.
𝐴 is open if for every 𝑥 ∈ 𝐴, there is a one-dimensional ball 𝐵(𝑥, 𝜀) of radius 𝜀 centered around 𝑥 such that 𝐵(𝑥, 𝜀) ⊆ 𝐴.

Thus, in a sense, open sets are determined by open balls.

20.1.2 Sets and sequences

Closed sets can be characterized in terms of their sequences. The following theorem shows an equivalent deﬁnition of
closed sets, giving us a helpful way of thinking about them.

Theorem 19.1.3 (Characterization of closed sets with sequences.)

Let 𝐴 ⊆ ℝ be a set. The following is equivalent.
(a) 𝐴 is closed.
(b) If {𝑎𝑛 }∞
𝑛=1 ⊆ 𝐴 is a convergent sequence, then lim𝑛→∞ 𝑎𝑛 ∈ 𝐴.

Proof. To prove that the two statements are equivalent, we have to show two things: that (a) implies (b) and that (b)
implies (a). Don’t worry if this proof seems too complicated when you read it the first time. If you don’t understand it
right away, I suggest thinking about 𝐴 as a closed interval and drawing a figure. You can also skip it since I will refer back
to this fact every time we need it later.
First, let’s see that (a) implies (b). Thus, suppose that 𝐴 is closed and {𝑎𝑛 }∞
𝑛=1 ⊆ 𝐴 is a convergent sequence, 𝑎 ∶=
lim𝑛→∞ 𝑎𝑛 . We have to show that 𝑎 ∈ 𝐴, and we are going to do this by contradiction. The plan is the following: assume
that 𝑎 ∉ 𝐴 and deduce that {𝑎𝑛 } must eventually leave 𝐴.
Indeed, suppose that 𝑎 ∈ ℝ\𝐴. Because of 𝐴 is closed, ℝ\𝐴 is open, so there is a small neighborhood (𝑎 − 𝜀, 𝑎 + 𝜀) ⊆
ℝ\𝐴. In plain English, this means that we can separate 𝑎 from 𝐴. This contradicts the fact that {𝑎𝑛 } ⊆ 𝐴 and 𝑎𝑛 → 𝑎,
because according to the definition of convergence, eventually all members of the sequence have to fall into (𝑎 − 𝜀, 𝑎 + 𝜀).
This is a contradiction, so 𝑎 ∈ 𝐴.
Second, we will show that (b) implies (a), that is, if the limit of every convergent sequence of 𝐴 is also in 𝐴, then the
set is closed. Our goal is to show that ℝ\𝐴 is open. More precisely, if 𝑥 ∈ ℝ\𝐴, we want to find a small neighborhood
(𝑥 − 𝜀, 𝑥 + 𝜀) that is disjoint from 𝐴. Again, we can show this via contradiction.
Suppose that no matter how small 𝜀 > 0 is, we can find an 𝑎 ∈ 𝐴 ∩ (𝑥 − 𝜀, 𝑥 + 𝜀). Thus, we can define a sequence
{𝑎𝑛 }∞𝑛=1 such that 𝑎𝑛 ∈ 𝐴 ∩ (𝑥 − 1/𝑛, 𝑥 + 1/𝑛). Due to the construction, lim𝑛→∞ 𝑎𝑛 = 𝑎, and as 𝐴 is closed to taking
limits according to the premise (b), this would imply that 𝑎 ∈ 𝐴, which is a contradiction. This is what we had to show.
□

This result also explains the origins of the terminology closed. A closed set is such because it is closed to limits.

20.1. Open and closed sets 229

Mathematics of Machine Learning

20.2 Bounded sets

From a (very) high-level view, machine learning can be described as an optimization problem. For inputs 𝑥 and predictions
𝑦, we are looking at a parametrized family of functions 𝑓(𝑥, 𝑤), where our parameters are condensed in the variable 𝑤.
Given a set of samples and observations, our goal is to ﬁnd the minimum of the set

{Loss(𝑓(𝑥, 𝑤), 𝑦) ∶ 𝑤 ∈ ℝa very large number } ⊆ ℝ, (20.1)

and the parameter conﬁguration 𝑤 where the optimum is attained. To make sure that our foundations are not missing this
building block, we are going to take some time to study this.

Deﬁnition 19.2.1 (Bounded sets.)

Let 𝐴 ⊆ ℝ.
(a) 𝐴 is bounded from below if there is an 𝑚 ∈ ℝ such that 𝑚 < 𝑥 for all 𝑥 ∈ 𝐴. The number 𝑚 is called a lower bound.
(b) 𝐴 is bounded from above if there is an 𝑀 ∈ ℝ such that 𝑥 < 𝑀 for all 𝑥 ∈ 𝐴. (Similarly as before, 𝑀 is called an
upper bound.)
(c) 𝐴 is bounded if it is bounded from above and below.

Being bounded means that we can include the set in a large interval [𝑚, 𝑀 ]. For optimization, there are a few essential
quantities that are related to bounds: minimal and maximal elements, smallest upper bounds, and largest lower bounds.
Let’s start with formalizing the concept of the smallest and largest element within a set.

Deﬁnition 19.2.2 (Minimal and maximal elements.)

Let 𝐴 ⊆ ℝ.
(a) 𝑚 ∈ 𝐴 is called the minimal element of 𝐴 if for every 𝑎 ∈ 𝐴, 𝑚 ≤ 𝑎 holds. The minimal element is denoted by
𝑚 = min 𝐴.
(b) 𝑀 ∈ 𝐴 is called the maximal element of 𝐴 if for every 𝑎 ∈ 𝐴, 𝑎 ≤ 𝑀 holds. The maximal element is denoted by
𝑀 = max 𝐴.

As usual, let’s see some examples.

Example 1. The set 𝐴 = [0, 1] ∪ [2, 3] has both minimal and maximal elements. Its minimal element is 0, its maximal
element is 3.
Example 2. The set 𝐴 = { 𝑛1 ∶ 𝑛 ∈ ℕ} has no minimal element. 0 is the largest lower bound, but since it is not in the
set, it is not a minimal element.
The key takeaway is that minimal and maximal elements do not necessarily exist. To avoid the inconvenience, we need
related quantities that always exist. These will be the inﬁmum and supremum.

Deﬁnition 19.2.3 (Inﬁmum and supremum.)

Let 𝐴 ⊆ ℝ.
(a) inf 𝐴 ∈ ℝ is the largest lower bound of 𝐴 if 𝑚 ≤ inf 𝐴 holds for every lower bound 𝑚. This is called the inﬁmum.
(b) sup 𝐴 ∈ ℝ is the smallest upper bound of 𝐴 if 𝑀 ≥ sup 𝐴 holds for every upper bound 𝑀 . This is called the
supremum.

230 Chapter 20. The topology of real numbers

Mathematics of Machine Learning

We won’t go into great details here, but the inﬁmum and supremum always exist. However, it is essential to note that there
is a sequence {𝑎𝑛 }∞
𝑛=1 ⊆ 𝐴 such that 𝑎𝑛 → inf 𝐴. (This is true for the supremum as well.)

20.3 Compact sets

With the concept of inﬁmum and supremum, we can formalize the optimization problem for machine learning described
by (20.1) as

inf {Loss(𝑓(𝑥, 𝑤), 𝑦) ∶ 𝑤 ∈ ℝ𝑛 },

where this number represents the smallest possible value of the loss function and 𝑤 is the parameter of our model.
However, there is a significant issue with the 𝑤 ∈ ℝ𝑛 part. First, our parameter space is high dimensional. In practice, 𝑛
can be in the millions. Besides that, we are looking at an unbounded parameter space, where such an optimum might not
even exist. Finally, is there even a parameter 𝑤 where the infimum is attained? After all, this is what we are primarily
interested in.
We can restrict the parameter space to a closed and bounded set to fix these issues. These sets are so prevalent that they
have their own name: compact sets.

Deﬁnition 19.3.1 (Compact sets.)

The set 𝐴 ⊆ ℝ is compact if it is bounded and closed.

We love compact sets. Even though their deﬁnition seems straightforward, these two properties have profound conse-
quences regarding optimization. At this point, we are not ready to talk about this in detail, but we can ﬁnd minima or
maxima in practice because continuous functions behave nicely on compact sets.
There is a key result about compact sets that will constantly resurface during our studies of functions: the Bolzano-
Weierstrass theorem.

Theorem 19.3.1 (Bolzano-Weierstrass.)

In a compact set, every sequence has a convergent subsequence.

Proof. Let 𝐴 ⊆ ℝ be a compact set, and {𝑎𝑛 }∞

𝑛=1 be an arbitrary sequence in 𝐴.

Because 𝐴 is compact, there exists an interval 𝐼1 ∶= [𝑚, 𝑀 ] that contains 𝐴 in its entirety. By cutting this interval in
half, we obtain [𝑚, (𝑚 + 𝑀 )/2] and [(𝑚 + 𝑀 )/2, 𝑀 ]. At least one of these will contain inﬁnitely many points from
{𝑎𝑛 }, let that be 𝐼2 . Repeating this process will yield a sequence of closed intervals 𝐼1 ⊇ 𝐼2 ⊇ 𝐼3 …. The length of 𝐼𝑘
is (𝑀 − 𝑚)/2𝑘 , so eventually these will get really small.
Due to the construction of these intervals, we can also deﬁne a subsequence {𝑎𝑛𝑘 }∞
𝑘=1 by selecting 𝑎𝑛𝑘 such that 𝑎𝑛𝑘 ∈ 𝐼𝑘 .

According to Cantor's axiom, ∩∞ ∞

𝑘=1 𝐼𝑘 is nonempty, so let 𝑎 ∈ ∩𝑘=1 𝐼𝑘 . Because both 𝑎𝑛𝑘 and 𝑎 are elements of 𝐼𝑘 , we
have
𝑀 −𝑚
|𝑎 − 𝑎𝑛𝑘 | ≤ , 𝑘 ∈ ℕ.
2𝑘
This shows that lim𝑘→∞ 𝑎𝑛𝑘 = 𝑎, which is what we had to show. □

The technique we used here is called lion catching. How does a mathematician catch a lion in the desert? By cutting it in
half. The lion will be located at one half or the other. This section can be cut in half repeatedly until the area becomes
small. Thus, the lion will be trapped there eventually.

20.3. Compact sets 231

Mathematics of Machine Learning

20.4 Problems

Problem 1. Let 𝐴 ⊆ ℝ be an arbitrary set. Show that there exists a sequence {𝑎𝑛 }∞ 𝑛=1 ⊆ 𝐴 such that lim𝑛→∞ 𝑎𝑛 →
sup 𝐴. (An identical statement is true for inf 𝐴 as well that can be shown in the same way.)

232 Chapter 20. The topology of real numbers

CHAPTER

TWENTYONE

LIMITS AND CONTINUITY

If I ask you to conjure up a random function from your mind, I am almost sure that you will show one that is both
continuous and differentiable. (Unless you have a weird taste as most mathematicians do.)
However, the vast majority of functions are neither. In terms of cardinality, if you count all real functions 𝑓 ∶ ℝ ↦ ℝ,
it turns out that there are 2𝑐 of them in total, but the subset of continuous ones have cardinality 𝑐. It is hard to imagine
such quantities: 𝑐 and 2𝑐 are both infinite, but, well, 2𝑐 is more infinite. Yeah, I know. Set theory is weird.
Overall, as we shall see, continuity and differentiability allow us to do meaningful work with functions. For instance,
the usual gradient descent-based optimization for neural networks doesn’t work if the loss function and the layers are not
differentiable. That alone would throw a huge monkey wrench into the cogs of machine learning since this is used all the
time in the deep learning part of the field.
This chapter explores how these concepts work together and ultimately enable us to train neural networks.

21.1 Limits of functions

Recall that in the section about sequences, we deﬁned limits of convergent sequences. Intuitively, limits capture the notion
that eventually, all elements get as close to the limit as we wish. This concept can be extended to functions as well.

Deﬁnition 20.1.1 (Limits of functions.)

Let 𝑓 ∶ ℝ → ℝ be an arbitrary function. We say that

lim 𝑓(𝑥) = 𝑎
𝑥→𝑥0

if for every sequence 𝑥𝑛 → 𝑥0 , where 𝑥𝑛 does not equal to 𝑥0 for all 𝑛,

lim 𝑓(𝑥𝑛 ) = 𝑎
𝑛→∞

holds.

Right oﬀ the bat, there are two essential things to note.

1. The limit of a function is defined in terms of limits of sequences. If all possible sequences of the form {𝑓(𝑥𝑛 )}
with 𝑥𝑛 → 𝑥0 have the same limit, then lim𝑥→𝑥0 𝑓(𝑥) is defined as the common limit.
2. With ∞-divergent sequences, limits at ±∞ are defined.
Here is a figure that helps visualize the process.
To further illustrate the concept of limits for functions, let’s see some examples.

233
Mathematics of Machine Learning

Fig. 21.1: Illustration of limits.

Example 1. Deﬁne the following function.

1 if 𝑥 = 0,
𝑓(𝑥) = {
0 otherwise.

In other words, 𝑓(𝑥) is zero everywhere except at 0, where it is 1.

Does lim𝑥→0 𝑓(𝑥) exist? Yes. Because for any sequence 𝑥𝑛 → 0 that is not 0, the limit lim𝑛→∞ 𝑓(𝑥𝑛 ) = 0. On the
other hand, note that lim𝑥→0 𝑓(𝑥) ≠ 𝑓(0).
Example 2. For our second example, deﬁne

1 if 𝑥 ∈ ℚ,
𝐷(𝑥) = { (21.1)
0 otherwise.

This is the (in)famous Dirichlet function, which is hard to imagine and impossible to plot: its value is 1 at rationals and
0 at irrationals. Not surprisingly, lim𝑥→𝑥0 𝐷(𝑥) does not exist for all 𝑥0 , because rational and irrational numbers are
“dense”: every number 𝑥0 can be obtained as a limit of rationals and as a limit of irrationals.

21.1.1 Properties of limits

Since limits of functions are deﬁned as the common limit of sequences, many of its properties are inherited from se-
quences. How sequences behave under operations determines how function limits behave.

Theorem 20.1.1 (Operations and limits.)

Let 𝑓 and 𝑔 be two functions.
(a)

lim 𝑓(𝑥) + 𝑔(𝑥) = lim 𝑓(𝑥) + lim 𝑔(𝑥),

𝑥→𝑥0 𝑥→𝑥0 𝑥→𝑥0

234 Chapter 21. Limits and continuity

Mathematics of Machine Learning

Fig. 21.2: Plot of 𝑓(𝑥).

(b)
lim 𝑐𝑓(𝑥) = 𝑐 lim 𝑓(𝑥) for all 𝑐 ∈ ℝ,
𝑥→𝑥0 𝑥→𝑥0

(d) if 𝑓(𝑥) ≠ 0 in some small interval (𝑥0 − 𝜀, 𝑥0 + 𝜀) around 𝑥0 and lim𝑥→𝑥0 𝑓(𝑥) ≠ 0, then
1 1
lim = .
𝑥→𝑥0 𝑓(𝑥) lim𝑥→𝑥0 𝑓(𝑥)

Proof. This follows directly from the properties of sequence limits. □

Similarly as we have seen for convergent sequences, (a) and (b) above is referred to as the linearity of limits.

21.1.2 Left and right limits

For technical reasons, we often want to be a bit more strict regarding the sequences 𝑥𝑛 when deﬁning limits. There are
two particular cases: when we restrict the sequences to be strictly smaller or larger than the target. This is formalized by
the deﬁnition of left and right limits.

Deﬁnition 20.1.2 (Left and right limits.)

Let 𝑓 ∶ ℝ → ℝ be an arbitrary function. We say that
lim 𝑓(𝑥) = 𝑎
𝑥→𝑥0 +

21.1. Limits of functions 235

Mathematics of Machine Learning

if for every sequence 𝑥𝑛 → 𝑥0 , where 𝑥𝑛 > 𝑥0 for all 𝑛,

lim 𝑓(𝑥𝑛 ) = 𝑎
𝑛→∞

holds. This is called the right limit of 𝑓 at 𝑥0 . Similarly, the left limit lim𝑥→𝑥0 − 𝑓(𝑥) can be deﬁned by restricting the
sequences {𝑥𝑛 } to be 𝑥𝑛 < 𝑥0 for all 𝑛.

This concept will be essential for us in probability theory.

21.1.3 The big and small O notation revisited

Remember how the big and small O notation expressed asymptotic properties of sequences? We’ll have a similar tool for
functions as well.

Deﬁnition 20.1.3 (Big and small O notation for functions.)

Let 𝑓 ∶ ℝ → ℝ and 𝑔 ∶ ℝ → ℝ be two arbitrary functions. We say that
(a) 𝑔(𝑥) = 𝑂(𝑓(𝑥)) as 𝑥 → 𝑎 if there is a constant 𝐶 such that for some 𝛿 > 0, we have |𝑔(𝑥)| ≤ 𝐶𝑓(𝑥) for all
𝑥 ∈ (𝑎 − 𝛿, 𝑎) ∪ (𝑎, 𝑎 + 𝛿). Similarly, 𝑔(𝑥) = 𝑂(𝑓(𝑥)) as 𝑥 → ∞ if there is a constant 𝐶 and a cutoﬀ number 𝑁 such
that |𝑔(𝑥)| ≤ 𝐶𝑓(𝑥) holds for all 𝑥 > 𝑁 .
(b) 𝑔(𝑥) = 𝑜(𝑓(𝑥)) as 𝑥 → 𝑎 if for any 𝜀 > 0, there is a 𝛿 > 0 such that we have |𝑔(𝑥)| ≤ 𝜀𝑓(𝑥) for all 𝑥 ∈
(𝑎 − 𝛿, 𝑎) ∪ (𝑎, 𝑎 + 𝛿). Similarly, 𝑔(𝑥) = 𝑜(𝑓(𝑥)) as 𝑥 → ∞ if for any 𝜀 > 0, there is a cutoﬀ number 𝑁 such that
|𝑔(𝑥)| ≤ 𝜀𝑓(𝑥) for all 𝑥 > 𝑁 .

21.1.4 Equivalent deﬁnitions of limits

If you have a sharp eye (and some experience in mathematics), you might have already posed the question: won’t showing
convergence of {𝑓(𝑥𝑛 )} for all sequences 𝑥𝑛 → 𝑥0 be diﬃcult?
Indeed, it is often not the most convenient way to think about function limits. Another equivalent deﬁnition expresses
limits in terms of smaller and smaller neighborhoods around the point in question.

Theorem 20.1.2 (Limits as error terms.)

Let 𝑓 ∶ ℝ → ℝ be an arbitrary function and 𝑥0 ∈ ℝ. Then the following are equivalent.
(a) lim𝑥→𝑥0 𝑓(𝑥) = 𝑎.
(b) For every 𝜀 > 0, there exists a small neighborhood (𝑥0 − 𝛿, 𝑥0 + 𝛿) around 𝑥0 such that for every 𝑥 ∈ (𝑥0 − 𝛿, 𝑥0 ) ∪
(𝑥0 , 𝑥0 + 𝛿),

|𝑓(𝑥) − 𝑎| < 𝜀

holds.

Proof. (a) ⟹ (b). We are going to do this indirectly, so we assume that (a) holds and (b) is not true. The negation
of (b) states that there is a 𝜀 > 0 such that for every 𝛿 > 0, there is an 𝑥 ∈ (𝑥0 − 𝛿, 𝑥0 ) ∪ (𝑥0 , 𝑥0 + 𝛿) such that
|𝑓(𝑥) − 𝑎| > 𝜀. (If you don’t see why this is the negation, check out the introductory section about logic.)

236 Chapter 21. Limits and continuity

Mathematics of Machine Learning

Now we define a sequence that will contradict (a). If we select 𝛿 = 1/𝑛, we can let 𝑥𝑛 be the one in (𝑥0 − 𝛿, 𝑥0 ) ∪
(𝑥0 , 𝑥0 +𝛿) such that |𝑓(𝑥𝑛 )−𝑎| > 𝜀, as guaranteed by our assumption that (b) is false. Due to its construction, {𝑥𝑛 }∞
𝑛=1
does not converge to 𝑎. This contradicts (a), which completes our indirect proof.
(b) ⟹ (a). Let {𝑥𝑛 }∞ 𝑛=1 be an arbitrary sequence that converges to 𝑥0 . If 𝑛 is large enough (that is, larger than some
cutoff index 𝑁 ), 𝑥𝑛 will fall into (𝑥0 − 𝛿, 𝑥0 + 𝛿). Since (b) says that |𝑓(𝑥𝑛 ) − 𝑎| < 𝜀 here for all such 𝑛, we have
lim𝑛→∞ 𝑓(𝑥𝑛 ) = 𝑎 by the definition of convergence. □

In plain English, this theorem says that 𝑓(𝑥) gets arbitrarily close to lim𝑥→𝑥0 𝑓(𝑥) if 𝑥 is close enough to 𝑥0 . Definitions
similar to (b) are called epsilon-delta definitions.
There is yet another equivalent definition which, although might seem trivial, is a useful mental model when thinking
about limits.

Theorem 20.1.3
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function and 𝑥0 ∈ ℝ. Then the following are equivalent.
(a) lim𝑥→𝑥0 𝑓(𝑥) = 𝑎.
(b) There exists a function error(𝑥) such that lim𝑥→𝑥0 error(𝑥) = 0 and

𝑓(𝑥) = 𝑎 + error(𝑥)

holds.

Proof. (a) ⟹ (b). Due to how limits behave with respect to operations, it is easy to see that

error(𝑥) ∶= 𝑓(𝑥) − 𝑎

satisﬁes the requirements.

(b) ⟹ (a). Again, this is trivial because of the linearity of the limit operation:

lim 𝑓(𝑥) = lim (𝑎 + error(𝑥)) = 𝑎.

𝑥→𝑥0 𝑥→𝑥0

This is what we had to show. □

21.1.5 The transfer principle

Often, we don’t need to know the exact limits of a function; it is enough to know that the limit is above or below a specific
bound.
To give a specific example, we will look slightly ahead and talk about differentiation. I’ll explain everything in the next
section in detail, but the derivative of a function 𝑓 at the point 𝑥0 is defined as the limit

𝑓(𝑥) − 𝑓(𝑥0 )
lim .
𝑥→𝑥0 𝑥 − 𝑥0
If the function is increasing, we have
𝑓(𝑥) − 𝑓(𝑥0 )
≥ 0,
𝑥 − 𝑥0
which, given the things we are about to see, implies that the derivative is positive.

21.1. Limits of functions 237

Mathematics of Machine Learning

Without any further ado, let’s see the result!

Theorem 20.1.4
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function. If 𝑓(𝑥) ≥ 𝛼 for all 𝑥 ∈ (𝑎 − 𝛿, 𝑎) ∪ (𝑎, 𝑎 + 𝛿) and some 𝛼 ∈ ℝ lower bound,
then lim𝑥→𝑎 𝑓(𝑥) ≥ 𝛼 if the limit exists.

Proof. Due to the deﬁnition of function limits, this is the immediate consequence of the transfer principle for convergent
sequences. □

21.1.6 Famous limits

There are a few limit relations that come up all the time in calculations. These are the building blocks for calculating more
complicated limits, as they can often be reduced to a form for which the limit is known.
We won’t include the proofs here, as they are not that useful for our purposes. (Which is understanding how machine
learning algorithms work.)

Theorem 20.1.5
(a)
sin 𝑥
lim = 1. (21.2)
𝑥→0 𝑥
(b)

lim 𝑥 log 𝑥 = 0. (21.3)

𝑥→0

(c)

lim 𝑥𝑘 𝑒−𝑎𝑥 = 0, 𝑎 ∈ (0, ∞), 𝑘 = 0, 1, 2, … (21.4)

𝑥→∞

21.2 Continuity

With the extension of limits from sequences to functions, we saw that if the limit exists, it is not necessarily equal to the
function’s value at the given point. However, when it does, the function is much easier to handle. This is called continuity.

Deﬁnition 20.2.1 (Continuity.)

Let 𝑓 ∶ ℝ → ℝ be an arbitrary function. We say that 𝑓 is continuous at 𝑎 if

lim 𝑓(𝑥) = 𝑓(𝑎)

𝑥→𝑎

holds.

In other words, continuity means that if 𝑥 is close to 𝑦, then 𝑓(𝑥) will also be close to 𝑓(𝑦). This is how most of our
mental models work. This is also what we want from many machine learning models. For example, if 𝑓 is a model that

238 Chapter 21. Limits and continuity

Mathematics of Machine Learning

takes images and decides if they feature a cat or not, we would expect that after changing a few pixels on 𝑥, the prediction
would stay the same. (However, this is deﬁnitely not the case in general, which is exploited by certain adversarial attacks.)
We can rephrase the above deﬁnition a bit by unpacking the limits. If you think it through a bit, it is easy to see that
continuity of 𝑓 at 𝑎 is equivalent to

lim 𝑓(𝑎𝑛 ) = 𝑓( lim 𝑎𝑛 )

𝑛→∞ 𝑛→∞

for all convergent sequences 𝑎𝑛 → 𝑎. We are going to use this very frequently.
As usual, we’ll see some examples ﬁrst. We’ll revisit the ones we saw when discussing limits.
Example 1. Let’s revisit

1 if 𝑥 = 0,
𝑓(𝑥) = {
0 otherwise.

While 𝑓(𝑥) is not continuous at 0 since lim𝑥→0 𝑓(𝑥) = 0 ≠ 𝑓(0) as we have seen before, 𝑓(𝑥) is continuous everywhere
else. (Since it is constant 0.)
Note that even though the function is not continuous at 0, the limit does exist!
Example 2. What about the Dirichlet function 𝐷(𝑥)? (See (21.1) for the deﬁnition.) Since the limits doesn’t even exist,
this is a nowhere continuous function.
Example 3. Deﬁne

𝑥 if 𝑥 ∈ ℚ,
𝑓(𝑥) = {
−𝑥 otherwise.

Surprisingly, 𝑓(𝑥) is continuous at 0, but nowhere else. As you can see, (almost) nothing is off the table with continuity.
Functions, in general, can be wild objects, and without certain regularity conditions, optimizing them is extremely hard.
In essence, this chapter aims to understand when and how we can optimize functions that we used to do when training a
machine learning model.
One final example!
Example 4. We call a function an elementary function, if it can be obtained by taking a finite sum, product, and combi-
nation of
• constant functions,
• power functions 𝑥, 𝑥2 , 𝑥3 , …,
• n-th root functions 𝑥1/2 , 𝑥1/3 , 𝑥1/4 , …,
• exponential functions 𝑎𝑥 ,
• logarithms log𝑎 𝑥,
• trigonometric and inverse trigonometric functions sin 𝑥, cos 𝑥, arcsin 𝑥, arccos 𝑥,
−1 −1
• hyperbolic and inverse hyperbolic functions sinh 𝑥, cosh 𝑥, sinh 𝑥, cosh 𝑥.
For instance,

1 − 3𝑥 + 5𝑥4
𝑓(𝑥) = sin(𝑥2 + 𝑒𝑥 ) − √
2𝑥 − 𝑥

is an elementary function. Elementary functions are continuous wherever they are deﬁned. This is going to be extremely
useful for us since showing the continuity of a complicated function like 𝑓(𝑥) is hard with the deﬁnition alone. This way,
if it is elementary, we know it is continuous. This will also be true for multivariate functions. (Like a neural network.)

21.2. Continuity 239

Mathematics of Machine Learning

21.3 Properties of continuous functions

A typical pattern in mathematics, as you have seen when discussing the properties of convergent sequences, is to study
certain properties on basic building blocks ﬁrst, then show how it behaves with respect to operations.
As the previous example regarding the continuity of elementary functions illustrates, we are going to follow a similar
pattern here.

Theorem 20.3.1
Let 𝑓 and 𝑔 be two functions.
(a) If 𝑓 and 𝑔 are continuous at 𝑎, then 𝑓 + 𝑔 and 𝑓𝑔 are also continuous at 𝑎.
(b) If 𝑓 and 𝑔 are continuous at 𝑎 and 𝑔(𝑎) ≠ 𝑎, then 𝑓/𝑔 is also continuous at 𝑎.
(c) If 𝑔 is continuous at 𝑎 and 𝑓 is continuous at 𝑔(𝑎), then 𝑓 ∘ 𝑔 is also continuous at 𝑎.

Proof. (a) and (b) follows directly from the properties of limits.
To see (c), we simply let {𝑎𝑛 }∞
𝑛=1 be a sequence that converges to 𝑎. Then, using that 𝑓 is continuous at 𝑔(𝑎) and 𝑔 is
continuous at 𝑎, we have

lim 𝑓(𝑔(𝑎𝑛 )) = 𝑓( lim 𝑔(𝑎𝑛 )) = 𝑓(𝑔( lim 𝑎𝑛 )) = 𝑓(𝑔(𝑎)),

𝑛→∞ 𝑛→∞ 𝑛→∞

which shows the continuity of 𝑓 ∘ 𝑔 at 𝑎. □

21.4 Continuity on compact sets

So far, we have only deﬁned continuity at a single point. In general, a function 𝑓 ∶ ℝ → ℝ is continuous on the set 𝐴
simply if it is continuous at its every point.
We have arrived at the point that partly explains why we love continuous functions and compact sets. The reason is simple:
functions that are continuous on compact sets are bounded there and attain their optima.

Theorem 20.4.1
Let 𝑓 be continuous on a compact set 𝐾. There exists 𝛼, 𝛽 ∈ 𝐾 such that 𝑓(𝛼) ≤ 𝑓(𝑥) ≤ 𝑓(𝛽) holds for all 𝑥 ∈ 𝐾.

Proof. Let {𝛼𝑛 }∞

𝑛=1 ⊆ 𝐾 be a sequence such that 𝑓(𝛼𝑛 ) → inf{𝑓(𝑥) ∶ 𝑥 ∈ 𝐾}. (This is guaranteed to exist, as follows
from the properties of inﬁmum and supremum.)
Now, according to the Bolzano-Weierstrass theorem, {𝛼𝑛 }∞ ∞
𝑛=1 has a convergent subsequence {𝛼𝑛𝑘 }𝑘=1 with
lim𝑘→∞ 𝛼𝑛𝑘 = 𝛼. Since 𝐾 is compact, 𝛼 ∈ 𝐾. (Recall that a compact set is closed, and closed sets contain the
limits of their convergent sequences.)
Because 𝑓 is continuous, we have

𝑓(𝛼) = 𝑓( lim 𝛼𝑛𝑘 ) = lim 𝑓(𝛼𝑛𝑘 ) = inf{𝑓(𝑥) ∶ 𝑥 ∈ 𝐾},

𝑘→∞ 𝑘→∞

240 Chapter 21. Limits and continuity

Mathematics of Machine Learning

which is what we had to show. An identical argument shows the existence of a 𝛽 ∈ 𝐾 such that 𝑓(𝑥) ≤ 𝑓(𝛽) for all
𝑥 ∈ 𝐾. □

1
This statement is not true for sets that are not closed and bounded. For example, 𝑓(𝑥) = 𝑥 is continuous on (0, 1], but
has no upper bound.
We conclude our study of continuity with this theorem.

21.5 What’s next?

Now that we are familiar with function limits and continuous functions, we are ready to tackle the first directly relevant
subject for machine learning: differentiation. We should take a look at how to analyze functions and what makes a function
“behave nicely”.
If you think through what machine learning is really about, you’ll find that it is quite straightforward from a bird’s view.
In essence, all we want to do is
1. design parametrized functions to explain the relations between data and observations,
2. find the parameters that give the best fit to our data.
To find models that are expressive enough yet it is easy to work with them, we need to restrict ourselves to functions that
satisfy certain properties. The two most important of those are continuity and differentiability. Now that we have seen
what continuity is, we can move on to study differentiable functions.
In the following few chapters, we will exclusively deal with univariate real functions. This is just to introduce concepts
without adding many layers of complexity at once. In later chapters, we are going to turn towards multivariate functions
slowly, and by the time we get to machine learning, we will master their use.

21.5. What’s next? 241

Mathematics of Machine Learning

242 Chapter 21. Limits and continuity

CHAPTER

TWENTYTWO

DIFFERENTIATION IN THEORY

I turn with terror and horror from this lamentable scourge of continuous functions with no derivatives. —
Charles Hermite
In the history of science, a few milestones are as significant as inventing the wheel. Even among these, differentiation is
a highlight. With its invention, Newton essentially created mechanics as we know it. Differentiation enables space travel,
function optimization, or even epidemiological models.

22.1 The deﬁnition of the derivative

Instead of jumping straight into the mathematical deﬁnition, let’s start our discussion with a straightforward example: a
point-like object moving along a straight line. Its movement is fully described by the time-distance plot, which shows its
distance from the starting point at a given time.
Our goal is to calculate the object’s speed at a given time. In high school, we learned that

distance
average speed = .
time
To put this into a quantitative form, if 𝑓(𝑥) denotes the time-distance function, and 𝑡1 < 𝑡2 are two arbitrary points in
time then,
𝑓(𝑡2 ) − 𝑓(𝑡1 )
average speed between 𝑡1 and 𝑡2 = .
𝑡2 − 𝑡 1
𝑓(𝑡2 )−𝑓(𝑡1 )
Expressions like 𝑡2 −𝑡1 are called diﬀerential quotients. Note that if the object moves backwards, the average speed
is negative.
The average speed has a simple geometric interpretation. If you replace the object’s motion with a constant velocity
motion moving at the average speed, you’ll end up at exactly the same place. In graphical terms, this is equivalent of
connecting (𝑡1 , 𝑓(𝑡1 )) and (𝑡2 , 𝑓(𝑡2 )) with a single line. The average speed is just the slope of this line.
Given this, we can calculate the exact speed at a single time point 𝑡, which we’ll denote with 𝑣(𝑡). The idea is simple: the
average speed in the small time-interval between 𝑡 and 𝑡 + Δ𝑡 should get closer and closer to 𝑣(𝑡) if Δ𝑡 is small enough.
(Δ𝑡 can be negative as well.)
So,
𝑓(𝑡 + Δ𝑡) − 𝑓(𝑡)
𝑣(𝑡) = lim ,
∆𝑡→0 Δ𝑡
if the above limit exists.
Following our geometric intuition, 𝑣(𝑡) is simply the slope of the tangent line of 𝑓 at 𝑡. Keeping this in mind, we are
ready to introduce the formal deﬁnition.

243
Mathematics of Machine Learning

Fig. 22.1: Time-distance plot of our moving object.

244 Chapter 22. Diﬀerentiation in theory

Mathematics of Machine Learning

Fig. 22.2: Average speed between 𝑡1 and 𝑡2 .

Deﬁnition 21.1.1 (Diﬀerentiability.)

Let 𝑓 ∶ ℝ → ℝ be an arbitrary function. 𝑓 is diﬀerentiable at 𝑥0 ∈ ℝ if the limit

𝑑𝑓 𝑓(𝑥0 ) − 𝑓(𝑥)
(𝑥 ) = lim
𝑑𝑥 0 𝑥→𝑥0 𝑥0 − 𝑥

exists. If so, this is called the derivative of 𝑓 at 𝑥0 .

In other words, if 𝑓 describes a time-distance function of a moving object, then the derivative is simply its speed.
Don’t let the change in notation from 𝑡 and 𝑡 + Δ𝑡 to 𝑥0 and 𝑥 confuse you, this means exactly the same as before. Similar
to continuity, differentiability is a local property. However, we’ll be more interested in functions that are differentiable
(almost) everywhere. In those cases, the derivative is a function, often denoted with 𝑓 ′ (𝑥).
Sometimes it is confusing that 𝑥 can denote the variable of 𝑓 and the exact point where the derivative is taken. Here is a
quick glossary of terms to clarify the difference between derivative and derivative function.
𝑑𝑓
• 𝑑𝑥 (𝑥0 ): derivative of 𝑓 with respect to the variable 𝑥 at the point 𝑥0 . This is a scalar, also denoted with 𝑓 ′ (𝑥0 ).
𝑑𝑓
• 𝑑𝑥 : derivative function of 𝑓 with respect to the variable 𝑥. This is a function, also denoted with 𝑓 ′ .
Let’s see some examples!
Example 1. 𝑓(𝑥) = 𝑥. For any 𝑥, we have

𝑓(𝑥) − 𝑓(𝑦) 𝑥−𝑦

lim = lim = 1.
𝑦→𝑥 𝑥−𝑦 𝑦→𝑥 𝑥−𝑦

22.1. The deﬁnition of the derivative 245

Mathematics of Machine Learning

Fig. 22.3: Approximating the velocity at 𝑡.

246 Chapter 22. Diﬀerentiation in theory

Mathematics of Machine Learning

Thus, 𝑓(𝑥) = 𝑥 is diﬀerentiable everywhere and its derivative is the constant function 𝑓 ′ (𝑥) = 1.
Example 2. 𝑓(𝑥) = 𝑥2 . Here, we have

𝑓(𝑥) − 𝑓(𝑦) 𝑥2 − 𝑦2
lim = lim
𝑦→𝑥 𝑥−𝑦 𝑦→𝑥 𝑥 − 𝑦

(𝑥 − 𝑦)(𝑥 + 𝑦)
= lim
𝑦→𝑥 𝑥−𝑦
= lim 𝑥 + 𝑦
𝑦→𝑥

= 2𝑥.

So, 𝑓(𝑥) = 𝑥2 is diﬀerentiable everywhere and 𝑓 ′ (𝑥) = 2𝑥. Later, when talking about elementary functions, we’ll see
the general case 𝑓(𝑥) = 𝑥𝑘 .
Example 3. 𝑓(𝑥) = |𝑥| at 𝑥 = 0. For this, we have

𝑓(0) − 𝑓(𝑦) |𝑦|

lim = lim .
𝑦→0 0−𝑦 𝑦→0 𝑦
Since

|𝑦| 1 if 𝑦 > 0,
={
𝑦 −1 if 𝑦 < 0,

this limit does not exist. This is our first example of a non-differentiable function. However, |𝑥| is differentiable everywhere
else.
It is worth drawing a picture here to enhance our understanding of differentiability. Recall that the value of the derivative
at a given point equals the slope of the tangent line to the function’s graph. Since |𝑥| has a sharp corner at 0, the tangent
line is not well-defined.
Differentiability means no sharp corners in the graph, so differentiable functions are often called smooth. This is one
reason we prefer differentiable functions: the rate of change is tractable.
Next, we’ll see an equivalent definition of differentiability, involving local approximation with a linear function. From this
perspective, differentiability means manageable behavior: no wrinkles, corners, sharp changes in value.

22.2 Equivalent forms of diﬀerentiation

To really understand derivatives and differentiation, we are going to take a look at it from another point of view: local
linear approximations.
Approximation is a very natural idea in mathematics. Have you ever thought about what happens when you punch
sin(2.18) into a calculator? We cannot express the function sin with finitely many additions and multiplications, so
we have to approximate it. In practice, functions of the form 𝑝(𝑥) = 𝑝0 + 𝑝1 𝑥 + … 𝑝𝑛 𝑥𝑛 , called polynomials, can be
evaluated easily. They are just a finite combination of additions and multiplications.
Can we just replace functions with polynomials to make computations easier? (Even at the cost of perfect precision.)
It turns out that we can. We will not go into details here, but every continuous function can be approximated by a
polynomial with arbitrary precision on a compact set. Actually,
𝑁
𝑥2𝑛+1
sin 𝑥 ≈ ∑(−1)𝑛 ,
𝑛=0
(2𝑛 + 1)!

so in practice, this polynomial is evaluated.

22.2. Equivalent forms of diﬀerentiation 247

Mathematics of Machine Learning

Fig. 22.4: Tangent planes of 𝑓(𝑥) = |𝑥| at 0.

248 Chapter 22. Diﬀerentiation in theory

Mathematics of Machine Learning

In essence, diﬀerentiation is just a local approximation with a linear function. The following theorem makes this clear.

Theorem 21.2.1 (Diﬀerentiation as a local linear approximation.)

Let 𝑓 ∶ ℝ → ℝ be an arbitrary function. The following are equivalent.
(a) 𝑓 is diﬀerentiable at 𝑥0 .
(b) there is an 𝛼 such that

𝑓(𝑥) = 𝑓(𝑥0 ) + 𝛼(𝑥 − 𝑥0 ) + 𝑜(|𝑥 − 𝑥0 |) as 𝑥 → 𝑥0 . (22.1)

Recall that the small O notation means that the function is an order of magnitude smaller around 𝑥0 than the function
|𝑥 − 𝑥0 |.
If exists, the 𝛼 in the above theorem is the derivative 𝑓 ′ (𝑥0 ). In other words, 𝑓(𝑥) is can be locally written as

𝑓(𝑥0 ) + 𝑓 ′ (𝑥0 )(𝑥 − 𝑥0 ) + 𝑜(|𝑥 − 𝑥0 |). (22.2)

Proof. To show the equivalence of two statements, we have to prove that diﬀerentiation implies the desired property and
vice versa. Although this might seem complicated, it is straightforward and entirely depends on how functions can be
written as their limit plus an error term.
(a) ⟹ (b). The existence of the limit
𝑓(𝑥) − 𝑓(𝑥0 )
lim = 𝑓 ′ (𝑥0 )
𝑥→𝑥0 𝑥 − 𝑥0

implies that we can write the slope of the approximating tangent in the form

𝑓(𝑥) − 𝑓(𝑥0 )
= 𝑓 ′ (𝑥0 ) + error(𝑥),
𝑥 − 𝑥0

where lim𝑥→𝑥0 error(𝑥) = 0. (Recall that one equivalent form of limits states exactly this.)
With some simple algebra, we obtain

𝑓(𝑥) = 𝑓(𝑥0 ) + 𝑓 ′ (𝑥0 )(𝑥 − 𝑥0 ) + error(𝑥)(𝑥 − 𝑥0 ).

Since the error term tends to zero as 𝑥 goes to 𝑥0 , error(𝑥)(𝑥 − 𝑥0 ) = 𝑜(|𝑥 − 𝑥0 |), which is what we wanted to show.
(b) ⟹ (a). Now, repeat what we did in the previous part, just in reverse order. We can rewrite

𝑓(𝑥) = 𝑓(𝑥0 ) + 𝛼(𝑥 − 𝑥0 ) + 𝑜(|𝑥 − 𝑥0 |)

in the form
𝑓(𝑥) − 𝑓(𝑥0 )
= 𝛼 + 𝑜(1),
𝑥 − 𝑥0

which, according to what we have used before, implies that

𝑓(𝑥) − 𝑓(𝑥0 )
lim = 𝛼.
𝑥→𝑥0 𝑥 − 𝑥0

So, 𝑓 is diﬀerentiable at 𝑥0 and its derivative is 𝑓 ′ (𝑥0 ) = 𝛼. □

22.2. Equivalent forms of diﬀerentiation 249

Mathematics of Machine Learning

One huge advantage of this form is that it will be easily generalized to multivariate functions. Even though we are far
from it, we can get a glimpse. Multivariate functions map vectors to scalars, so the ratio
𝑓(𝑥) − 𝑓(𝑥0 )
, 𝑥, 𝑥0 ∈ ℝ𝑛
𝑥 − 𝑥0
is not even deﬁned. (Since we can’t divide with a vector.) However, the expression
𝑓(𝑥) = 𝑓(𝑥0 ) + ∇𝑓(𝑥0 )𝑇 (𝑥 − 𝑥0 ) + 𝑜(|𝑥 − 𝑥0 |)
makes perfect sense, since ∇𝑓(𝑥0 )𝑇 (𝑥−𝑥0 ) is a scalar. Here, ∇𝑓(𝑥0 ) denotes the gradient of 𝑓, that is, the multivariable
version of derivatives. ∇𝑓(𝑥0 ) is an n-dimensional vector. Don’t worry if you are not familiar with this notation, we’ll
cover everything in due time. The take-home message is that this alternative deﬁnition will be more convenient for us in
the future.

22.3 Diﬀerentiation and continuity

As the following theorem states, diﬀerentiation is a more strict condition than continuity.

Theorem 21.3.1 (Diﬀerentiable functions are continuous.)

If 𝑓 ∶ ℝ → ℝ is diﬀerentiable at 𝑎, it is also continuous there.

Proof. With the general form (21.1), we have

lim 𝑓(𝑥) = lim (𝑓(𝑥0 ) + 𝑓 ′ (𝑥0 )(𝑥 − 𝑥0 ) + 𝑜(|𝑥 − 𝑥0 |)) = 𝑓(𝑥0 ),

𝑥→𝑥0 𝑥→𝑥0

which shows the continuity of 𝑓 at 𝑥0 . □

Note that the previous theorem is not true the other way around: a function can be continuous, but not differentiable. (As
the example 𝑓(𝑥) = |𝑥| at 𝑥 = 0 shows.)
This can be taken to the extremes: there are functions that are everywhere continuous but nowhere differentiable. One of
the first examples was provided by Weierstrass (from the Bolzano-Weierstrass theorem). The function itself is defined by
the infinite sum
∞
𝑊 (𝑥) = ∑ 𝑎𝑛 cos(𝑏𝑛 𝜋𝑥),
𝑛=0

where 𝑎 ∈ (0, 1), 𝑏 is a positive odd integer, and 𝑎𝑏 > 1 + 3𝜋/2.

I agree, this definition feels totally random, and you are probably wondering: how did the author come up with it? To get
a grip on this function, imagine this as the superposition of cosine waves with smaller and smaller amplitude but higher
and higher frequency. Remember that differentiation implies “no sharp corners”? This definition puts a sharp corner at
every point on the real line.
Its graph is a fractal curve with self-similarity, as illustrated below.
Examples such as this inspired the opening quote of the section:
I turn with terror and horror from this lamentable scourge of continuous functions with no derivatives. —
Charles Hermite
19th-century mathematicians certainly did not think much about nondifferentiable functions. However, there are much
more of them than differentiable ones. We won’t go into the details, but amongst all continuous functions, the set of ones
that are differentiable at least one point is meagre. Meagre is a proper technical term for sets, and although we don’t need
to know what it means exactly, its name implies that it is extremely small.

250 Chapter 22. Diﬀerentiation in theory

Mathematics of Machine Learning

Fig. 22.5: Graph of the Weierstrass function. Source: Wikipedia

22.4 Higher derivatives

One last thing to do before we move on is to talk about higher-order derivatives. Because derivatives are functions, it is a
completely natural idea to calculate the derivative of derivatives. As we will see when studying the basics of optimization
in the next chapter, the second derivatives contain quite a lot of essential information regarding minima and maxima.
The 𝑛-th derivative of 𝑓 is denoted with 𝑓 (𝑛) , where 𝑓 (0) = 𝑓. There are a few rules regarding them that are worth
keeping in mind. Although, we have to note that a derivative function is not always diﬀerentiable, as the example

0 if 𝑥 < 0,
𝑓(𝑥) = {
𝑥2 otherwise.

Now, about those rules regarding higher-order derivatives.

Theorem 21.4.1
Let 𝑓 ∶ ℝ → ℝ and 𝑔 ∶ ℝ → ℝ be two arbitrary functions.
(a) (𝑓 + 𝑔)(𝑛) = 𝑓 (𝑛) + 𝑔(𝑛)
𝑛
(b) (𝑓𝑔)(𝑛) = ∑𝑘=0 (𝑛𝑘)𝑓 (𝑛−𝑘) 𝑔(𝑘)

Proof. (a) trivially follows from the linearity of diﬀerentiation.

Regarding (b), we are going to use proof by induction. For 𝑛 = 1, the statement simply says that (𝑓𝑔)′ = 𝑓 ′ 𝑔 + 𝑓𝑔′ , as
we have seen before.

22.4. Higher derivatives 251

Mathematics of Machine Learning

Now, we assume that it is true for 𝑛 and deduce the 𝑛 + 1 case. For this, we have
𝑛
′ 𝑛 ′
(𝑓𝑔)(𝑛+1) = ((𝑓𝑔)(𝑛) ) = ∑ ( )(𝑓 (𝑛−𝑘) 𝑔(𝑘) )
𝑘=0
𝑘
𝑛
𝑛
= ∑ ( )[𝑓 (𝑛−𝑘+1) 𝑔(𝑘) + 𝑓 (𝑛−𝑘) 𝑔(𝑘+1) ]
𝑘=0
𝑘
𝑛 𝑛
𝑛 𝑛
= ∑ ( )𝑓 (𝑛−𝑘+1) 𝑔(𝑘) + ∑ ( )𝑓 (𝑛−𝑘) 𝑔(𝑘+1)
𝑘=0
𝑘 𝑘=0
𝑘
𝑛 𝑛−1
𝑛 𝑛 𝑛 𝑛
= ( )𝑓 (𝑛+1) 𝑔 + [ ∑ ( )𝑓 (𝑛+1−𝑘) 𝑔(𝑘) ] + [ ∑ ( )𝑓 (𝑛−𝑘) 𝑔(𝑘+1) ] + ( )𝑓𝑔(𝑛+1) .
0 𝑘=1
𝑘 𝑘=0
𝑘 𝑛

First, we note that (𝑛0) = (𝑛+1 𝑛 𝑛+1

0 ) = 1 and (𝑛) = (𝑛+1) = 1. Second, the recursive relation for binomial coeﬃcients says
that
𝑛+1 𝑛 𝑛
( )=( )+( ).
𝑘 𝑘 𝑘−1

With a simple reindexing, we have

𝑛−1 𝑛
𝑛 𝑛
∑ ( )𝑓 (𝑛−𝑘) 𝑔(𝑘+1) = ∑ ( )𝑓 (𝑛+1−𝑘) 𝑔(𝑘) ,
𝑘=0
𝑘 𝑘=1
𝑘 − 1

so we can join the two sums together and obtain

𝑛
𝑛 + 1 (𝑛+1) 𝑛 𝑛 𝑛+1
(𝑓𝑔)(𝑛+1) = ( )𝑓 𝑔 + ∑ [( ) + ( )𝑓 (𝑛+1−𝑘) 𝑔(𝑘) ] + ( )𝑓𝑔(𝑛+1)
0 𝑘=1
𝑘 𝑘 − 1 𝑛 + 1
𝑛+1
𝑛 + 1 (𝑛+1−𝑘) (𝑘)
= ∑( )𝑓 𝑔 ,
𝑘=0
𝑘

which is what we had to show. □

252 Chapter 22. Diﬀerentiation in theory

CHAPTER

TWENTYTHREE

DIFFERENTIATION IN PRACTICE

During our first encounter with differentiation, we seen that computing derivatives by the definition

𝑓(𝑥0 ) − 𝑓(𝑥)
𝑓 ′ (𝑥0 ) = lim
𝑥→𝑥0 𝑥0 − 𝑥

can be really hard in practice if we encounter convoluted functions such as 𝑓(𝑥) = cos(𝑥) sin(𝑒𝑥 ). Similar to convergent
sequences and limits, using the definition of differentiation won’t get us far—the complexity piles on fast. So we have to
find ways to decompose the complexity into its fundamental building blocks.

23.1 Rules of diﬀerentiation

First, we’ll look at the simplest of operations: scalar multiplication, addition, multiplication, and division.

Theorem 22.1.1 (Rules of diﬀerentiation.)

Let 𝑓 ∶ ℝ ↦ ℝ and 𝑔 ∶ ℝ ↦ ℝ be two arbitrary functions and let 𝑥0 ∈ ℝ. Suppose that both 𝑓 and 𝑔 is diﬀerentiable at
𝑥0 . Then
(a) (𝑐𝑓)′ (𝑥0 ) = 𝑐𝑓 ′ (𝑥0 ) for all 𝑐 ∈ ℝ,
(b) (𝑓 + 𝑔)′ (𝑥0 ) = 𝑓 ′ (𝑥0 ) + 𝑔′ (𝑥0 ),
(c) (𝑓𝑔)′ (𝑥0 ) = 𝑓 ′ (𝑥0 )𝑔(𝑥0 ) + 𝑓(𝑥0 )𝑔′ (𝑥0 ),
′ 𝑓 ′ (𝑥0 )𝑔(𝑥0 )−𝑓(𝑥0 )𝑔′ (𝑥0 )
(d) ( 𝑓𝑔 ) (𝑥0 ) = 𝑔(𝑥0 )2 if 𝑔(𝑥0 ) ≠ 0.

Proof. (a) and (b) is a direct consequence of the linearity of limits.

To show (c), we have to do a bit of algebra:

𝑓(𝑥)𝑔(𝑥) − 𝑓(𝑥0 )𝑔(𝑥0 ) 𝑓(𝑥)𝑔(𝑥) − 𝑓(𝑥0 )𝑔(𝑥) + 𝑓(𝑥0 )𝑔(𝑥) − 𝑓(𝑥0 )𝑔(𝑥0 )
lim = lim
𝑥→𝑥0 𝑥 − 𝑥0 𝑥→𝑥 0 𝑥 − 𝑥0
𝑓(𝑥)𝑔(𝑥) − 𝑓(𝑥0 )𝑔(𝑥) 𝑓(𝑥0 )𝑔(𝑥) − 𝑓(𝑥0 )𝑔(𝑥0 )
= lim + lim
𝑥→𝑥0 𝑥 − 𝑥0 𝑥→𝑥0 𝑥 − 𝑥0
𝑓(𝑥) − 𝑓(𝑥0 ) 𝑔(𝑥) − 𝑔(𝑥0 )
= lim [ 𝑔(𝑥)] + 𝑓(𝑥0 ) lim
𝑥→𝑥0 𝑥 − 𝑥0 𝑥→𝑥0 𝑥 − 𝑥0
= 𝑓 ′ (𝑥0 )𝑔′ (𝑥0 ) + 𝑓(𝑥0 )𝑔′ (𝑥0 ),

from which (c) follows.

253
Mathematics of Machine Learning

For (d), we are going to start with the special case of (1/𝑔)′ . We have
1 1
𝑔(𝑥) − 𝑔(𝑥0 ) 1 𝑔(𝑥0 ) − 𝑔(𝑥)
lim = lim
𝑥→𝑥0 𝑥 − 𝑥0 𝑥→𝑥0𝑔(𝑥)𝑔(𝑥0 ) 𝑥 − 𝑥0
′
𝑔 (𝑥0 )
=− ,
𝑔(𝑥0 )2

from which the general case follows by applying (c) to 𝑓 and 1/𝑔. □

There is one operation which we haven’t covered in the previous theorem: function composition. In the study of neural
networks, composition plays an essential role. Each layer can be thought of as a function, which are composed together
to form the entire network.

Theorem 22.1.2 (Chain rule/Leibniz rule.)

Let 𝑓 ∶ ℝ ↦ ℝ and 𝑔 ∶ ℝ ↦ ℝ be two arbitrary functions and let 𝑥0 ∈ ℝ. Suppose that 𝑔 is diﬀerentiable at 𝑥0 and 𝑓 is
diﬀerentiable at 𝑔(𝑥0 ). Then

(𝑓 ∘ 𝑔)′ (𝑥0 ) = 𝑓 ′ (𝑔(𝑥0 ))𝑔′ (𝑥0 )

holds.

Proof. First, we rewrite the diﬀerential quotient into the following form:

𝑓(𝑔(𝑥)) − 𝑓(𝑔(𝑥0 )) 𝑓(𝑔(𝑥)) − 𝑓(𝑔(𝑥0 )) 𝑔(𝑥) − 𝑔(𝑥0 )

lim = lim
𝑥→𝑥0 𝑥 − 𝑥0 𝑥→𝑥0 𝑔(𝑥) − 𝑔(𝑥0 ) 𝑥 − 𝑥0
𝑓(𝑔(𝑥)) − 𝑓(𝑔(𝑥0 )) 𝑔(𝑥) − 𝑔(𝑥0 )
= lim lim .
𝑥→𝑥0 𝑔(𝑥) − 𝑔(𝑥0 ) 𝑥→𝑥 0 𝑥 − 𝑥0

Because 𝑔 is diﬀerentiable at 𝑥0 , it is also continuous there, so lim𝑥→𝑥0 𝑔(𝑥) = 𝑔(𝑥0 ). So, the ﬁrst term can be rewritten
as
𝑓(𝑔(𝑥)) − 𝑓(𝑔(𝑥0 )) 𝑓(𝑦) − 𝑓(𝑔(𝑥0 ))
lim = lim = 𝑓 ′ (𝑔(𝑥0 )).
𝑥→𝑥0 𝑔(𝑥) − 𝑔(𝑥0 ) 𝑦→𝑔(𝑥0 ) 𝑦 − 𝑔(𝑥0 )

Since 𝑔 is diﬀerentiable at 𝑥0 , the second term is 𝑔′ (𝑥0 ). Thus, we have

𝑓(𝑔(𝑥)) − 𝑓(𝑔(𝑥0 ))
lim = 𝑓 ′ (𝑔(𝑥0 ))𝑔′ (𝑥0 ),
𝑥→𝑥0 𝑥 − 𝑥0

which is what we had to show. □

As neural networks are just a huge composed functions, their derivative is calculated with the repeated application of the
chain rule. (Although the derivatives of its layers are vectors and matrices since they are multivariable functions.)

23.2 Derivatives of elementary functions

Following the already familiar pattern, now we calculate the derivatives for the most important class: the elementary
functions. There are a few ones that we will encounter all the time, like in the mean squared error, cross-entropy, Kullback-
Leibler divergence, etc.

254 Chapter 23. Diﬀerentiation in practice

Mathematics of Machine Learning

Theorem 22.2.1 (Derivatives of elementary functions.)

(a) (𝑥0 )′ = 0 and (𝑥𝑛 )′ = 𝑛𝑥𝑛−1 , where 𝑛 ∈ ℤ\{0},
(b) (sin 𝑥)′ = cos 𝑥 and (cos 𝑥)′ = − sin 𝑥,
(c) (𝑒𝑥 )′ = 𝑒𝑥 ,
(d) (log 𝑥)′ = 𝑥1 .

You don’t necessarily have to know how to prove these. I’ll include the proof of (a), but feel free to skip it, especially if
this is your ﬁrst encounter with calculus. What you have to remember, though, are the derivatives themselves. (However,
I’ll refer back to this part when necessary.)

Proof. (a) It is easy to see that for 𝑛 = 0, the derivative (𝑥0 )′ = 0. The case 𝑛 = 1 is also simple: calculating the
diﬀerential quotient shows that (𝑥)′ = 1. For the case 𝑛 ≥ 2, we are going to employ a small trick. Writing out the
diﬀerential quotient for 𝑓(𝑥) = 𝑥𝑛 , we obtain
𝑥𝑛 − 𝑥𝑛0
,
𝑥 − 𝑥0
which we want to simplify. If you don’t have a lot of experience in math, it might seem like magic, but 𝑥𝑛 − 𝑦𝑛 can be
written as
𝑥𝑛 − 𝑦𝑛 = (𝑥 − 𝑦)(𝑥𝑛−1 + 𝑥𝑛−2 𝑦 + ⋯ + 𝑥𝑦𝑛−2 + 𝑦𝑛−1 )
𝑛−1
= (𝑥 − 𝑦) ∑ 𝑥𝑛−1−𝑘 𝑦𝑘 .
𝑘=0

This can be seen easily by calculating the product:

𝑛
(𝑥 − 𝑦) ∑ 𝑥𝑛−1−𝑘 𝑦𝑘 . = 𝑥𝑛 + [𝑥𝑛−1 𝑦 + ⋯ + 𝑥𝑦𝑛−1 ]
𝑘=0
− [𝑥𝑛−1 𝑦 + ⋯ + 𝑥𝑦𝑛−1 ] − 𝑦𝑛
= 𝑥𝑛 − 𝑦 𝑛 .
Thus, we have
𝑛−1
𝑥𝑛 − 𝑥𝑛0 (𝑥 − 𝑥0 ) ∑𝑘=0 𝑥𝑛−1−𝑘 𝑥𝑘0
lim = lim
𝑥→𝑥0 𝑥 − 𝑥0 𝑥→𝑥0 𝑥−𝑦
𝑛−1
= lim ∑ 𝑥𝑛−1−𝑘 𝑥𝑘0
𝑥→𝑥0
𝑘=0
𝑛−1
= 𝑛𝑥0 .
𝑛 ′ 𝑛−1
So, (𝑥 ) = 𝑛𝑥 . With this and the rules of diﬀerentiation, we can calculate the derivative of any polynomial 𝑝(𝑥) =
𝑛
∑𝑘=0 𝑝𝑘 𝑥𝑘 as
𝑛
𝑝′ (𝑥) = ∑ 𝑘𝑝𝑘 𝑥𝑘−1 .
𝑘=1

The case 𝑛 < 0 follows from 𝑥−𝑛 = 1/𝑥𝑛 using the rules of diﬀerentiation. □

With these rules under our belt, we can calculate the derivatives for some of the most famous activation functions.
The most classical one, the sigmoid function is deﬁned by
1
𝜎(𝑥) = .
1 + 𝑒−𝑥

23.2. Derivatives of elementary functions 255

Mathematics of Machine Learning

Since it is an elementary function, it is diﬀerentiable everywhere. To calculate its derivative, we can use the quotient rule:
′
′ 1 𝑒−𝑥
𝜎 (𝑥) = ( ) =
1 + 𝑒−𝑥 (1 + 𝑒−𝑥 )2
−𝑥 (23.1)
1 𝑒
=
1 + 𝑒−𝑥 1 + 𝑒−𝑥
= 𝜎(𝑥)(1 − 𝜎(𝑥)).
Another popular activation function is the ReLU, deﬁned by

𝑥 if 𝑥 > 0,
ReLU(𝑥) = {
0 otherwise.
Let’s plot its graph ﬁrst!

def relu(x):
if x > 0:
return x
else:
return 0

import numpy as np
import matplotlib.pyplot as plt

xs = np.linspace(-5, 5, 1000)

with plt.style.context("seaborn-white"):
plt.figure(figsize=(15, 5))
plt.title("Graph of the ReLU function")
plt.plot(xs, [relu(x) for x in xs], label="ReLU")
plt.legend()

By looking at it, we can suspect that it is not diﬀerentiable at 0. Indeed, since

ReLU(𝑥) − ReLU(0) 1 if 𝑥 > 0,

={
𝑥 0 if 𝑥 < 0,
the limit of the diﬀerential quotient doesn’t exist. However, besides 0, it is diﬀerentiable and

1 if 𝑥 > 0,
ReLU′ (𝑥) = {
0 if 𝑥 < 0.

256 Chapter 23. Diﬀerentiation in practice

Mathematics of Machine Learning

Even though ReLU is not differentiable at 0, this is not a problem in practice. When performing backpropagation, it is
extremely unlikely that ReLU′ (𝑥) will receive 0 as its input. Even if this is the case, the derivative can be artificially
extended to zero by defining it as 0.

23.3 Extending the Function base class

Now that we have several tools under our belt to calculate derivatives, it’s time to think about implementations. Since we
have our own Function base class, a natural idea is to implement the derivative as a method. This is a simple solution
that is in line with object-oriented principles as well, so we should go for it!

class Function:
def __init__(self):
pass

def call(self, *args, **kwargs):

pass

# new interface element for

# computing the derivative
def prime(self):
pass

def parameters(self):
return dict()

To see a concrete example, let’s revisit the Sigmoid function, whose derivative is given by (23.1):

𝜎′ (𝑥) = 𝜎(𝑥)(1 − 𝜎(𝑥)).

class Sigmoid(Function):
def __call__(self, x):
return 1/(1 + np.exp(-x))

def prime(self, x):

return self(x) - self(x)**2

A simple implementation, yet colorful functionality. Now that we have the Sigmoid and its derivative in place, let’s plot
them together!

sigmoid = Sigmoid()

xs = np.linspace(-10, 10, 1000)

with plt.style.context("seaborn-white"):
plt.figure(figsize=(15, 5))
plt.title("Sigmoid and its derivative")
plt.plot(xs, [sigmoid(x) for x in xs], label="Sigmoid")
plt.plot(xs, [sigmoid.prime(x) for x in xs], label="Sigmoid prime")
plt.legend()

23.3. Extending the Function base class 257

Mathematics of Machine Learning

23.3.1 The derivative of compositions

At this point, I probably emphasized the importance of function compositions and the chain rule dozens of times. We
finally reached a point when we are ready to implement a simple neural network and compute its derivative! (Of course,
our methods will be far more refined in the end, but still, this is a milestone.)
How to calculate the derivative for a composition of 𝑛 functions? To see the pattern, let’s start map out the first few cases.
For 𝑛 = 2, we have the good old chain rule
′
(𝑓2 (𝑓1 (𝑥))) = 𝑓2′ (𝑓1 (𝑥)) ⋅ 𝑓1′ (𝑥).
For 𝑛 = 3, we have
′
(𝑓3 (𝑓2 (𝑓1 (𝑥)))) = 𝑓3′ (𝑓2 (𝑓1 (𝑥))) ⋅ 𝑓2′ (𝑓1 (𝑥)) ⋅ 𝑓1′ (𝑥).
Among the multitude of parentheses, we can notice a pattern. First, we should calculate the value of the composed
function 𝑓3 ∘ 𝑓2 ∘ 𝑓1 at 𝑥 while storing the intermediate results, then pass these to the appropriate derivatives and take the
product of the result.

class Composition(Function):
def __init__(self, *functions):
self.functions = functions

def call(self, x):

for f in self.functions:
x = f(x)

return x

def prime(self, x):

forward_pass = [x]

for f in self.functions:
x = f(x)
forward_pass.append(x)

forward_pass.pop() # removing the last element, as we won't need it

derivative = np.product([f.prime(x) for f, x in zip(self.functions, forward_

↪ pass)])
(continues on next page)

258 Chapter 23. Diﬀerentiation in practice

Mathematics of Machine Learning

(continued from previous page)

return derivative

To see if our implementation works, we should test it on a simple test case, say for

𝑓1 (𝑥) = 2𝑥,
𝑓2 (𝑥) = 3𝑥,
𝑓3 (𝑥) = 4𝑥.

The derivative of the composition (𝑓3 ∘ 𝑓2 ∘ 𝑓1 )(𝑥) = 24𝑥 should be constant 24.

class Linear(Function):
def __init__(self, a, b):
self.a = a
self.b = b

def call(self, x):

return self.a*x + self.b

def prime(self, x):

return self.a

def parameters(self):
return {"a": self.a, "b": self.b}

f = Composition(Linear(2, 0), Linear(3, 0), Linear(4, 0))

xs = np.linspace(-10, 10, 1000)

ys = [f.prime(x) for x in xs]

with plt.style.context("seaborn-white"):
plt.figure(figsize=(15, 5))
plt.title("The derivative of f(x) = 24x")
plt.plot(xs, ys, label="f prime")
plt.legend()

Success! Even though we only deal with single-variable functions for now, our Composition is going to be the skeleton
for neural networks.

23.3. Extending the Function base class 259

Mathematics of Machine Learning

23.4 Numerical diﬀerentiation

So far, we have seen that in the cases when at least some formula is available for the function in question, we can apply
the rules of differentiation to obtain the derivative.
However, in practice, this is often not the case. For instance, think about the case when the function represents a recorded
audio signal.
If we can’t compute the derivative exactly, a natural idea is to approximate it, that is, provide an estimate that is sufficiently
close to the real value.
For the sake of example, suppose that we don’t know the exact formula of our function to be differentiated, which is
secretly the good old sine function.

import numpy as np

def f(x):
return np.sin(x)

Recall that by deﬁnition, the derivative is deﬁned by

𝑓(𝑥0 ) − 𝑓(𝑥)
𝑓 ′ (𝑥0 ) = lim .
𝑥→𝑥0 𝑥0 − 𝑥

Since we can’t take limits inside a computer (as computers can’t deal with inﬁnity), the second best thing to do is to
approximate this by

𝑓(𝑥 + ℎ) − 𝑓(𝑥)
Δℎ 𝑓(𝑥) = , ℎ ∈ (0, ∞),
ℎ
where ℎ is an arbitrarily small but fixed quantity. Δℎ 𝑓(𝑥) is called the forward difference quotient. In theory, Δℎ 𝑓(𝑥) ≈
𝑓 ′ (𝑥) holds when ℎ is sufficiently small. Let’s see how they perform!

def delta(f, h, x):

return (f(x + h) - f(x))/h

import matplotlib.pyplot as plt

def f_prime(x):
return np.cos(x)

hs = [3.0, 1.0, 0.1]

xs = np.linspace(-5, 5, 100)
f_prime_ys = [f_prime(x) for x in xs]

with plt.style.context("seaborn-white"):
plt.figure(figsize=(15, 5))
plt.title("Approximating the derivative with finite differences")
for h in hs:
ys = [delta(f, h, x) for x in xs]
plt.plot(xs, ys, label=f"h = {h}")
plt.plot(xs, f_prime_ys, label="the true derivative")
plt.legend()

260 Chapter 23. Diﬀerentiation in practice

Mathematics of Machine Learning

Although the Δℎ 𝑓(𝑥) functions seem to be close 𝑓 ′ (𝑥) well when ℎ is small, there are a plethora of potential issues.
For one, Δℎ 𝑓(𝑥) = 𝑓(𝑥+ℎ)−𝑓(𝑥)
ℎ only approximates the derivative from the right of 𝑥, as ℎ > 0. To solve this, one might
use the backward diﬀerence quotient

𝑓(𝑥) − 𝑓(𝑥 − ℎ)
∇ℎ 𝑓(𝑥) = ,
ℎ
but that seem to have the same problems. The crux of the issue is that if 𝑓 is diﬀerentiable at some 𝑥0 , then

𝑓(𝑥 + ℎ) − 𝑓(𝑥) 𝑓(𝑥) − 𝑓(𝑥 − ℎ)

≈ ,
ℎ ℎ
but only if ℎ is very small, and the “good enough” choice for ℎ can vary from point to point.
A middle ground is provided by the so-called symmetric diﬀerence quotients, deﬁned by

𝑓(𝑥 + ℎ) − 𝑓(𝑥 − ℎ)
𝛿ℎ 𝑓(𝑥) = , ℎ ∈ (0, ∞),
2ℎ
∆ℎ 𝑓(𝑥)+∇ℎ 𝑓(𝑥)
which is the average of forward and backward differences: 𝛿ℎ 𝑓(𝑥) = 2 . These three approximators are
called finite differences.
Even though symmetric differences are provably better, the approximation errors can be significantly amplified on the
long run.
All things considered, we are not going to use finite differences for machine learning in practice. However, as we’ll see,
the gradient descent method is simply a forward difference approximation of a special differential equation.

23.5 Problems

Problem 1. Calculate the derivative of the tanh(𝑥) function deﬁned by

𝑒𝑥 − 𝑒−𝑥
tanh(𝑥) = .
𝑒𝑥 + 𝑒−𝑥

23.5. Problems 261

Mathematics of Machine Learning

262 Chapter 23. Diﬀerentiation in practice

CHAPTER

TWENTYFOUR

MINIMA AND MAXIMA

If someone gave you a function 𝑓 ∶ ℝ → ℝ defined by some tractable formula, how would you find its minima and
maxima? Take a moment and conjure up some ideas before moving on.
The first idea that comes to mind for most people is to evaluate the function for all possible values and simply find the
optimum. This method immediately breaks down due to multiple reasons. We can only perform finite evaluations, so this
would be impossible. Even if we try to define a discrete search grid cleverly and evaluate only there, this method takes an
unreasonable amount of time.
Another idea is to use some kind of inequality to provide an ad hoc upper or lower bound, then see if this bound can be
attained. However, this is nearly impossible for complex functions, like losses for neural networks.
However, derivatives provide an extremely useful way to optimize functions. Throughout the following sections, we will
study the relationship between derivatives and optimal points, and algorithms on how to find them.
Intuitively, the notion of minima and maxima is simple. Take a look at the example below.
Peaks of hills are the maxima, and bottoms of valleys are the minima. Minima and maxima are collectively called extremal
or optimal points. As our example demonstrates, we have to distinguish between local and global optima. The graph has
two valleys, and although both have a bottom, one of them is lower than the other.
We can graphically mark the local and global optima in our example based on this.
The really interesting part is finding these, as we’ll see next.

24.1 Derivatives and local behavior

Let’s consider again our simple example above to demonstrate how derivatives are connected to local minima and maxima.
If we use a bit of geometric intuition, we can observe that the tangents are horizontal at the peaks of hills and the bottoms
of valleys.
In terms of derivatives, since they describe the slope of the tangent, it means that the derivative should be 0 there.
If we think about the function as the description of a motion along the real line, derivatives say that the motion stops there
and changes direction. It slows down ﬁrst, stops, then immediately starts in the opposite direction. For instance, in the
local maxima case, the function increases up until that point, where it starts decreasing.
Again, we can describe this monotonicity behavior in terms of derivatives. Notice that when the function increases, the
derivative is positive (the object in motion has a positive speed). On the other hand, decreasing parts have a negative
derivative.
We can go ahead and put these intuitions into a mathematical form. First, we’ll start with the deﬁnitions of monotonicity
and their relation to the derivative. Then, we’ll connect all the dots and see how this comes together for characterizing the
optima.

263
Mathematics of Machine Learning

Fig. 24.1: A simple example.

264 Chapter 24. Minima and maxima

Mathematics of Machine Learning

Fig. 24.2: Local and global optima.

24.1. Derivatives and local behavior 265

Mathematics of Machine Learning

Fig. 24.3: Tangents at local and global optima.

266 Chapter 24. Minima and maxima

Mathematics of Machine Learning

Fig. 24.4: The ﬂow of the function.

24.1. Derivatives and local behavior 267

Mathematics of Machine Learning

Fig. 24.5: The sign of the derivatives.

268 Chapter 24. Minima and maxima

Mathematics of Machine Learning

Deﬁnition 23.1.1 (Locally increasing and decreasing functions.)

Let 𝑓 ∶ ℝ → ℝ be an arbitrary function and let 𝑎 ∈ ℝ. We say that
(a) 𝑓 is locally increasing at 𝑎, if there is a neighborhood (𝑎 − 𝛿, 𝑎 + 𝛿) such that

≥ 𝑓(𝑥), if 𝑥 ∈ (𝑎 − 𝛿, 𝑎),
𝑓(𝑎) {
≤ 𝑓(𝑥), if 𝑥 ∈ (𝑎, 𝑎 + 𝛿),

(b) and 𝑓 is strictly locally increasing at 𝑎, if there is a neighborhood (𝑎 − 𝜀, 𝑎 + 𝜀) such that

> 𝑓(𝑥), if 𝑥 ∈ (𝑎 − 𝛿, 𝑎),

𝑓(𝑎) {
< 𝑓(𝑥), if 𝑥 ∈ (𝑎, 𝑎 + 𝛿).

The locally decreasing and strictly locally decreasing properties are deﬁned similarly, with the inequalities reversed.

For diﬀerentiable functions, the behavior of the derivative describes their local behavior in terms of monotonicity.

Theorem 23.1.1
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function that is diﬀerentiable at some 𝑎 ∈ ℝ.
(a) If 𝑓 ′ (𝑎) ≥ 0, then 𝑓 is locally increasing at 𝑎.
(b) If 𝑓 ′ (𝑎) > 0, then 𝑓 is strictly locally increasing at 𝑎.
(c) If 𝑓 ′ (𝑎) ≤ 0, then 𝑓 is locally decreasing at 𝑎.
(d) If 𝑓 ′ (𝑎) < 0, then 𝑓 is strictly locally decreasing at 𝑎.

Proof. We will only show (a), since the rest of the proofs go the same way. Due to how limits are deﬁned,

𝑓(𝑥) − 𝑓(𝑎)
lim = 𝑓 ′ (𝑎) ≥ 0
𝑥→𝑎 𝑥−𝑎
means that once 𝑥 gets close enough to 𝑎, that is, 𝑥 is from a small neighborhood (𝑎 − 𝛿, 𝑎 + 𝛿),

𝑓(𝑥) − 𝑓(𝑎)
≥ 0, 𝑥 ∈ (𝑎 − 𝛿, 𝑎 + 𝛿)
𝑥−𝑎
holds. If 𝑥 > 𝑎, then because of the diﬀerential quotient is nonnegative, 𝑓(𝑥) ≥ 𝑓(𝑎) must hold. Similarly, for 𝑥 < 𝑎,
the nonnegativity of the diﬀerential quotient implies that 𝑓(𝑥) ≤ 𝑓(𝑎).
The proof for (b), (c), and (d) is almost identical, with the obvious changes in the inequalities. □

The propositions related to not strict monotonicity are true the other way around as well

Theorem 23.1.2
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function that is diﬀerentiable at some 𝑎 ∈ ℝ.
(a) If 𝑓 is locally increasing at 𝑎, then 𝑓 ′ (𝑎) ≥ 0.
(b) If 𝑓 is locally decreasing at 𝑎, then 𝑓 ′ (𝑎) ≤ 0.

24.1. Derivatives and local behavior 269

Mathematics of Machine Learning

Proof. Similarly as before, we will only show the proof of (a), since (b) can be done in the same way. If 𝑓 is locally
increasing at 𝑎, then the diﬀerential quotient is positive:

𝑓(𝑥) − 𝑓(𝑎)
≥ 0.
𝑥−𝑎
Using the transfer principle of limits, we obtain

𝑓(𝑥) − 𝑓(𝑎)
𝑓 ′ (𝑎) = lim ≥ 0,
𝑥→𝑎 𝑥−𝑎
which is what we had to prove. □

After all this setup, we are ready to study local optima.

24.2 Local minima and maxima

As we have seen in the introduction, the tangent at the extremal points is horizontal. Now it is time to put this introduction
into a mathematically correct form.

Deﬁnition 23.2.1 (Local minima and maxima.)

Let 𝑓 ∶ ℝ → ℝ be an arbitrary function and let 𝑎 ∈ ℝ.
(a) 𝑎 is a local minimum, if there is a neighborhood (𝑎 − 𝛿, 𝑎 + 𝛿) such that for every 𝑥 ∈ (𝑎 − 𝛿, 𝑎 + 𝛿), 𝑓(𝑎) ≤ 𝑓(𝑥)
holds.
(b) 𝑎 is a strict local minimum, if there is a neighborhood (𝑎−𝛿, 𝑎+𝛿) such that for every 𝑥 ∈ (𝑎−𝛿, 𝑎+𝛿), 𝑓(𝑎) < 𝑓(𝑥)
holds.
(c) 𝑎 is a local maximum, if there is a neighborhood (𝑎 − 𝛿, 𝑎 + 𝛿) such that for every 𝑥 ∈ (𝑎 − 𝛿, 𝑎 + 𝛿), 𝑓(𝑥) ≤ 𝑓(𝑎)
holds.
(d) 𝑎 is a strict local maximum, if there is a neighborhood (𝑎−𝛿, 𝑎+𝛿) such that for every 𝑥 ∈ (𝑎−𝛿, 𝑎+𝛿), 𝑓(𝑥) < 𝑓(𝑎)
holds.

Extremal points have their global versions as well. The sad truth is, even though we always want global optimums, we
only have the tools to ﬁnd local ones.

Deﬁnition 23.2.2 (Global minima and maxima.)

Let 𝑓 ∶ ℝ → ℝ be an arbitrary function and let 𝑎 ∈ ℝ.
(a) 𝑎 is a global minimum, if 𝑓(𝑎) ≤ 𝑓(𝑥) holds for every 𝑥 ∈ ℝ.
(b) 𝑎 is a global maximum, if 𝑓(𝑥) ≤ 𝑓(𝑎) holds for every 𝑥 ∈ ℝ.

Note that a global optimum is also a local optimum, but not the other way around.

Theorem 23.2.1
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function that is diﬀerentiable at some 𝑎 ∈ ℝ. If 𝑓 has a local minima or maxima at 𝑎, then
𝑓 ′ (𝑎) = 0.

270 Chapter 24. Minima and maxima

Mathematics of Machine Learning

Proof. According to our previous theorem, if 𝑓 ′ (𝑎) ≠ 0, then it is either strictly increasing or decreasing locally. Since
this contradicts our assumption that 𝑎 is a local optimum, the theorem is proven. □

(In case you are interested, this was the principle of contraposition in action. From the negation of the conclusion, we
have shown the negation of the premises.)
It is very important to emphasize that the theorem is not true the other way around. For instance, the function 𝑓(𝑥) = 𝑥3
is strictly increasing everywhere, yet 𝑓 ′ (0) = 0.

Fig. 24.6: Graph of 𝑓(𝑥) = 𝑥3 as a counterexample to show that 𝑓 ′ (0) = 0 doesn’t imply local optimum.

In general, we call this behavior inflection. So, 𝑓(𝑥) = 𝑥3 is said to have an inflection point at 0. Inflection means a change
in behavior, which reflects the switch in its derivative from decreasing to increasing in this case. (The multidimensional
analogue of inflection is called a “saddle”, as we shall see later.)
So, we are not at our end goal yet, as the other half of the promised characterization is missing. The derivative is zero at

24.2. Local minima and maxima 271

Mathematics of Machine Learning

the local extremal points, but can we come up with a criterion that implies the existence of minima or maxima?
With the utilization of second derivatives, this is possible.

24.3 Characterization of optima with higher order derivatives

Let’s take a second look at our example, considering the local behavior of 𝑓 ′ this time, not just its sign. In the ﬁgure
below, the derivative is plotted along with our function.

Fig. 24.7: The function and its derivative.

The pattern seems simple: an increasing derivative implies a local minimum, a decreasing one means a local maximum.
This aligns with our intuition about derivative as speed: local maximum means that the object is going in a positive
direction, then stops and starts reversing.

272 Chapter 24. Minima and maxima

Mathematics of Machine Learning

We can make this mathematically precise with the following theorem.

Theorem 23.3.1 (The second derivative test.)

Let 𝑓 ∶ ℝ → ℝ be an arbitrary function that is twice diﬀerentiable at some 𝑎 ∈ ℝ.
(a) If 𝑓 ′ (𝑎) = 0 and 𝑓 ′′ (𝑎) > 0, then 𝑎 is a local minimum.
(b) If 𝑓 ′ (𝑎) = 0 and 𝑓 ′′ (𝑎) < 0, then 𝑎 is a local maximum.

Proof. Once again, we will only prove (a), since the proof of (b) is almost identical.
First, as we have seen when discussing the relation between derivatives and monotonicity, 𝑓 ′′ (𝑎) > 0 implies that 𝑓 ′ is
strictly locally increasing at 𝑎. Since 𝑓 ′ (𝑎) = 0, this means that

≤0 if 𝑥 ∈ (𝑎 − 𝛿, 𝑎]
𝑓 ′ (𝑥) {
≥0 if 𝑥 ∈ [𝑎, 𝑎 + 𝛿)

for some 𝛿 > 0. Because of the previously referenced theorem, 𝑓 is locally decreasing in (𝑎 − 𝛿, 𝑎] and locally increasing
in [𝑎, 𝑎 + 𝛿). This can only happen if 𝑎 is a local minimum. □

In summary, the method of ﬁnding the extrema of a function 𝑓 is the following.

1. Solve 𝑓 ′ (𝑥) = 0. Its solutions {𝑥1 , … , 𝑥𝑛 } are the candidates that can be extremal points. (But not necessarily all
of them are.)
2. Check the sign of 𝑓 ′′ (𝑥𝑖 ) for all solutions 𝑥𝑖 . If 𝑓 ′′ (𝑥𝑖 ) > 0, it is a local minimum. If 𝑓 ′′ (𝑥𝑖 ) < 0, it is a local
maximum.
If 𝑓 ′′ (𝑥𝑖 ) = 0, we still can’t draw any conclusions. The functions 𝑥4 , −𝑥2 and 𝑥4 show that it can be local minimum,
maximum, or neither.
Even though we have a “recipe”, this is still far from enough for practical purposes. Not counting that our functions of
interest are multivariable, calculating the derivative and solving 𝑓 ′ (𝑥) = 0 is not tractable. For loss functions of neural
networks, we don’t even bother writing out a formula because, for a composition of hundreds of functions, it can be
unreasonably complex.

24.4 Mean value theorems

In some cases, we can extract a lot of information about the derivatives without explicitly calculating them. These results
are extremely useful in cases where we don’t have an explicit formula for the function or the formula might be too huge.
(Like in the case of neural networks.) In the following, we’ll get to know the Lagrange’s mean value theorems, connecting
the function’s behavior at the endpoints and inside an interval.
First, we start with a special case that states that the function attains the same value at the end of some interval [𝑎, 𝑏], then
its derivative is zero somewhere inside the interval.

Theorem 23.4.1 (Rolle’s mean value theorem.)

Let 𝑓 ∶ ℝ → ℝ be a diﬀerentiable function and suppose that 𝑓(𝑎) = 𝑓(𝑏) for some 𝑎 ≠ 𝑏. Then there exists a 𝜉 ∈ (𝑎, 𝑏)
such that 𝑓 ′ (𝜉) = 0.

Proof. If you are a visual person, take a look at the ﬁgure below. This is what we need to show.

24.4. Mean value theorems 273

Mathematics of Machine Learning

Fig. 24.8: Rolle’s theorem.

To be mathematically precise, there are two cases. First, if 𝑓 is constant on [𝑎, 𝑏], then its derivative is zero on the entire
interval.
If 𝑓 is not constant, then it attains some value 𝑐 inside (𝑎, 𝑏) that is not equal to 𝑓(𝑎) = 𝑓(𝑏). For simplicity, suppose
that 𝑐 > 𝑓(𝑎). (The argument that follows goes through in the 𝑐 < 𝑓(𝑎) case with some obvious changes.) Since 𝑓 is
continuous, it attains its maximum there at a point 𝜉 ∈ [𝑎, 𝑏]. According to what we have just seen regarding the relation
of local maxima and the derivative, 𝑓 ′ (𝜉) = 0, which is what we had to show. □

Rolle’s theorem is an important stepping stone towards Lagrange’s mean value theorem, which we will show in the fol-
lowing.

Theorem 23.4.2 (Lagrange’s mean value theorem.)

Let 𝑓 ∶ ℝ → ℝ be a diﬀerentiable function and [𝑎, 𝑏] an interval for some 𝑎 ≠ 𝑏. Then there exists a 𝜉 ∈ (𝑎, 𝑏) such that

𝑓(𝑏) − 𝑓(𝑎)
𝑓 ′ (𝜉) =
𝑏−𝑎
holds.

274 Chapter 24. Minima and maxima

Mathematics of Machine Learning

Proof. Again, let’s start with a visualization to get a grip on the theorem.

Fig. 24.9: Lagrange’s mean value theorem.

𝑓(𝑏)−𝑓(𝑎)
Recall that 𝑏−𝑎 is the slope of the line going through (𝑎, 𝑓(𝑎)) and (𝑏, 𝑓(𝑏)). This line is described by the function

𝑓(𝑏) − 𝑓(𝑎)
(𝑥 − 𝑎) + 𝑓(𝑎).
𝑏−𝑎
Using this, we introduce the function

𝑓(𝑏) − 𝑓(𝑎)
𝑔(𝑥) ∶= 𝑓(𝑥) − ( (𝑥 − 𝑎) + 𝑓(𝑎)).
𝑏−𝑎

We can apply Rolle’s theorem to 𝑔(𝑥), since 𝑔(𝑎) = 𝑔(𝑏) = 0. Thus, for some 𝜉 ∈ (𝑎, 𝑏), we have

𝑓(𝑏) − 𝑓(𝑎)
𝑔( 𝜉) = 0 = 𝑓 ′ (𝜉) − ,
𝑏−𝑎
𝑓(𝑏)−𝑓(𝑎)
implying 𝑓 ′ (𝜉) = 𝑏−𝑎 , which is what we had to show. □

24.4. Mean value theorems 275

Mathematics of Machine Learning

Why are mean value theorems so important? In mathematics, they serve as a cornerstone in several results. To give you
one example, think about integration. (Perhaps you are familiar with this concept already. Don’t worry if not, we are
going to study it in detail later.) Integration is essentially the inverse of diﬀerentiation: if 𝐹 ′ (𝑥) = 𝑓(𝑥), then
𝑏
∫ 𝑓(𝑥)𝑑𝑥 = 𝐹 (𝑏) − 𝐹 (𝑎),
𝑎

which will be a simple consequence of Lagrange’s mean value theorem.

24.5 Problems

Problem 1. Find the local minimas and maximas of 𝑓(𝑥) = sin 𝑥.

276 Chapter 24. Minima and maxima

CHAPTER

TWENTYFIVE

THE BASICS OF GRADIENT DESCENT

We need to solve two computational problems to train neural networks:

1. computing the derivative of the loss 𝐿(𝑤),
2. and finding its minima using the derivative.
𝑑
Finding the minima by solving 𝑑𝑤 𝐿(𝑤) = 0 is going to work in practice. There are several problems. First, as we have
seen, not all solutions are minimal points: there are maximal and saddle points as well. (In multiple dimensions, saddle
points are the analogue of inflection points. More on this later.) Second, solving this equation is not feasible except for
the simplest cases. (Like linear regression with the mean squared error.)
Gradient descent provides a way to tackle the complexity of finding the exact solution, enabling us to do machine learning
on a large scale.

25.1 Derivatives revisited

When we encountered the concept of the derivative for the first time, we saw several of its faces. The derivative can be
thought of as
1. speed (if the function describes a time-distance graph of a moving object),
2. the slope of the tangent line of a function,
3. and the best linear approximator at a given point.
To understand how gradient descent works, we’ll see yet another interpretation: derivatives as vectors. For any differen-
tiable function 𝑓(𝑥), the derivative 𝑓 ′ (𝑥) can be thought of as a one-dimensional vector. If 𝑓 ′ (𝑥) is positive, it points
to the right. If it is negative, it points to the left. We can visualize this by drawing a horizontal vector to every point of
𝑓(𝑥)-s graph, where the length represents |𝑓 ′ (𝑥)| and the direction represents the sign.
Do you recall how monotonicity is characterized by the sign of the derivative? Negative derivative means decreasing,
positive means increasing. In other words, this implies that the derivative, as a vector, points towards the direction of the
increase.
Imagine yourself as a hiker on the x-y plane, where y signifies the height. How would you climb a mountain ahead of
you? By taking a step towards the direction of increase, that is, following the derivative. If you are not there yet, you can
still take another (perhaps smaller) step in the right direction, over and over again until you arrive. If you are right at the
top, the derivative is zero, so you won’t move anywhere.
This process is illustrated by Fig. 25.2.
What you have seen here is the gradient ascent in action. Now that we understand the main idea, we are ready to tackle
the mathematical details.

277
Mathematics of Machine Learning

Fig. 25.1: The derivative as a vector.

278 Chapter 25. The basics of gradient descent

Mathematics of Machine Learning

Fig. 25.2: Climbing a mountain, one step at a time.

25.1. Derivatives revisited 279

Mathematics of Machine Learning

25.2 The gradient descent algorithm

Let 𝑓 ∶ ℝ → ℝ be a diﬀerentiable function which we want to maximize, that is, ﬁnd

𝑥max = argmax𝑥∈ℝ 𝑓(𝑥).

Based on our intuition, the process is quite simple. First, we conjure up an arbitrary starting point 𝑥0 , then deﬁne the
sequence

𝑥𝑛+1 ∶= 𝑥𝑛 + ℎ𝑓 ′ (𝑥𝑛 ), (25.1)

where ℎ ∈ (0, ∞) is a parameter of our gradient descent algorithm, called the learning rate. In English, the formula
𝑥𝑛 + ℎ𝑓 ′ (𝑥𝑛 ) describes taking a small step from 𝑥𝑛 towards the direction of the increase, with step size ℎ𝑓 ′ (𝑥𝑛 ).
If things go our way, the sequence 𝑥𝑛 converges to a local maximum of 𝑓. However, things do not always go our way.
We’ll discuss these when talking about the issues of gradient descent.
But what about ﬁnding minima? In machine learning, we are trying to minimize loss functions. There is a simple trick:
′
the minima of 𝑓(𝑥) is the maxima of −𝑓(𝑥). So, since ( − 𝑓) = −𝑓 ′ , the deﬁnition of the approximating sequence 𝑥𝑛
changes to

𝑥𝑛+1 ∶= 𝑥𝑛 − ℎ𝑓 ′ (𝑥𝑛 ).

This is gradient descent in a nutshell.

25.2.1 Implementing gradient descent

At this point, we have all the knowledge to implement the gradient descent algorithm. As usual, I encourage you to try
implementing your version before looking at mine. Coding is one of the most eﬀective ways to learn.

import numpy as np
import nbimporter
from tools.function import Function

def gradient_descent(
f: Function,
x_init: float, # the initial guess
learning_rate: float = 0.1, # the learning rate
n_iter: int = 1000, # number of steps
return_all: bool = False # if true, returns all intermediate values
):
xs = [x_init] # we store the intermediate results for visualization

for n in range(n_iter):
x = xs[-1]
grad = f.prime(x)
x_next = x - learning_rate*grad
xs.append(x_next)

if return_all:
return xs
else:
return x

Let’s test the gradient descent out on a simple example, say 𝑓(𝑥) = 𝑥2 ! If all goes according to plan, the algorithm should
ﬁnd the minimum 𝑥 = 0 in no time.

280 Chapter 25. The basics of gradient descent

Mathematics of Machine Learning

class Square(Function):
def __call__(self, x):
return x**2

def prime(self, x):

return 2*x

f = Square()

gradient_descent(f, x_init=5.0)

7.688949513507002e-97

The result is as expected: our gradient_descent function successfully ﬁnds the minimum.
To visualize what happens, we can plot the process in its entirety.

from tools.plotting import plot_gradient_descent

xs = gradient_descent(f, x_init=5.0, n_iter=25, learning_rate=0.2, return_all=True)

plot_gradient_descent(f, xs, x_min=-5, x_max=5, label="x²")

25.2. The gradient descent algorithm 281

Mathematics of Machine Learning

(The plot_gradient_descent is a helper function to reduce boilerplate code in the book. Don’t worry about the
details. It just plots the result of the gradient descent.)

25.3 Drawbacks and caveats

Even though the idea behind the gradient descent is sound, there are several issues. During our journey in machine
learning, we’ll see most of them ﬁxed by variants of the algorithm, but it is worth looking at the potential problems of the
base version at this point.
First, the base gradient descent can get inﬁnitely stuck at a local minima. To illustrate this, let’s take a look at the
𝑓(𝑥) = cos(𝑥) + 𝑥2 function, whose global minima is at 𝑥 = 0.

class CosPlusSquare(Function):
def __call__(self, x):
return np.sin(x) + 0.5*x
(continues on next page)

282 Chapter 25. The basics of gradient descent

Mathematics of Machine Learning

(continued from previous page)

def prime(self, x):

return np.cos(x) + 0.5

f = CosPlusSquare()

xs = gradient_descent(f, x_init=7.5, n_iter=20, learning_rate=0.2, return_all=True)

plot_gradient_descent(f, xs, -10, 10, label="sin(x) + 0.5x")

Note that if the initial point 𝑥0 is selected poorly, the algorithm is much less eﬀective. This sensitivity to the initial
conditions is another weakness. It might not seem that much of an issue in a simple one-variable case that we have just
seen. However, this is a signiﬁcant headache in the million-dimensional parameter spaces that we encounter when training
neural networks. Several methods can help to alleviate the issue, and we are going to see them when talking about weight
initialization for neural networks.
The starting point is not the only parameter of the algorithm; it depends on the learning rate ℎ as well. There are several

25.3. Drawbacks and caveats 283

Mathematics of Machine Learning

potential mistakes here: a too large learning rate results in the algorithm bouncing all around the space, never ﬁnding an
optimum. On the other hand, a too small one results in an extremely slow convergence.
In the case of 𝑓(𝑥) = 𝑥2 , starting the gradient descent from 𝑥0 = 1.0 with a learning rate of ℎ = 1.05, the algorithm
diverges, with 𝑥𝑛 oscillating at a larger and larger amplitude.

f = Square()

xs = gradient_descent(f, x_init=1.0, n_iter=20, learning_rate=1.05, return_all=True)

plot_gradient_descent(f, xs, -8, 8, label="x²")

Can you come up with some solution ideas to these problems? No need to work anything out, just take a few minutes to
brainstorm and make a mental note about what comes to mind. In the later chapters, we’ll see several proposed solutions
for all of these problems, but putting some time into this is a very useful exercise.
We are almost at the end of our journey of introductory calculus. During the lectures so far, we mostly spent our time
with getting to know diﬀerentiation and its importance in optimization. However, there is a counterpart of diﬀerentiation
that will be essential for understanding probability and statistics: integration. In a sense, integration is the inverse of

284 Chapter 25. The basics of gradient descent

Mathematics of Machine Learning

diﬀerentiation, and later will be used to express quantities like expected values and loss functions. Let’s take a look at
them!

25.3. Drawbacks and caveats 285

Mathematics of Machine Learning

286 Chapter 25. The basics of gradient descent

CHAPTER

TWENTYSIX

INTEGRATION IN THEORY

When we first encountered the concept of the derivative, we introduced it through an example from physics. As Newton
created it, the derivative describes the speed of a moving object as calculated from its time-distance graph. In other words,
the speed can be derived from the time-distance information.
Can the distance be reconstructed given the speed? In a sense, this is the inverse of differentiation.
Questions such as these are hard to answer if we only look at the most general case, so let’s consider a special one. Suppose
that our object is moving with a constant speed 𝑣(𝑡) = 𝑣0 𝑚 𝑠 , for a duration of 𝑇 seconds. With some elementary logic,
we can conclude that the total distance traveled is 𝑣0 𝑇 meters.
When taking a look at the time-speed plot, we can immediately see that the distance is the area under the time-speed
function graph 𝑣(𝑡) = 𝑣0 . The graph of 𝑣(𝑡) describes a rectangle with width 𝑣0 and length 𝑇 , hence its area is indeed
𝑣0 𝑇 .
Does the area under 𝑣(𝑡) equal the distance traveled in the general case? For instance, what happens when the time-speed
plot looks something like this?
The speed is not constant here. In this case, we can do a simple trick: partition the time interval [0, 𝑇 ] into smaller ones
and approximate the object’s motion as a constant-speed motion on each of these intervals.
If the time intervals [𝑡𝑖 , 𝑡𝑖+1 ] are sufficiently granular, the distance travelled will roughly match a constant velocity motion
with the average speed at [𝑡𝑖 , 𝑡𝑖+1 ]. That is, if we introduce the notation

𝑣𝑖 ∶= average speed during the time interval [𝑡𝑖−1 , 𝑡𝑖 ], 𝑖 = 1, 2, … , 𝑛,

we should have
𝑛
∑ 𝑣𝑖 (𝑡𝑖 − 𝑡𝑖−1 ) ≈ total distance traveled during [0, 𝑇 ].
𝑖=1

Let’s think about this whole process as approximating the function 𝑣(𝑡) with a stepwise constant function 𝑣approx (𝑡). From
this angle, we have
𝑛
∑ 𝑣𝑖 (𝑡𝑖 − 𝑡𝑖−1 ) = area under 𝑣approx (𝑡),
𝑖=1

but we also

area under 𝑣approx (𝑡) ≈ area under 𝑣(𝑡).

(Very) loosely speaking, if the granularity of the time intervals [𝑡𝑖 , 𝑡𝑖+1 ] gets inﬁnitesimally small, the approximations
turn into equality. Thus,

total distance traveled during [0, 𝑇 ] = area under 𝑣approx (𝑡) in [0, 𝑇 ].

There are two key points that we need to remember: if 𝑠(𝑡) is the distance traveled and 𝑣(𝑡) is the speed, then

287
Mathematics of Machine Learning

Fig. 26.1: Time-distance plot of an object, moving with constant speed.

288 Chapter 26. Integration in theory

Mathematics of Machine Learning

Fig. 26.2: Time-distance plot of an object, moving with a changing speed.

289
Mathematics of Machine Learning

Fig. 26.3: Approximating with a constant velocity motion.

290 Chapter 26. Integration in theory

Mathematics of Machine Learning

• 𝑣(𝑡) is the derivative 𝑠′ (𝑡),

• 𝑠(𝑇 ) is the area under the graph 𝑣(𝑡) between 0 and 𝑇 .
In other words, calculating the area under the curve is the same as inverting diﬀerentiation. This process is called inte-
gration.
Unfortunately, things are not as simple as they seem. We missed a lot of mathematical detail in the above discussion. For
one, does the sum
𝑛
∑ 𝑣𝑖 (𝑡𝑖 − 𝑡𝑖−1 )
𝑖=1

converge if the partition of [0, 𝑇 ] gets more granular? Does the limit depend on the partitions? Can we even deﬁne the
area under the “graph” for all functions? Like the Dirichlet function, deﬁned by

1 if 𝑥 is rational
𝐷(𝑥) = { (26.1)
0 otherwise,
𝑛
How do we calculate limits of ∑𝑖=1 𝑣𝑖 (𝑡𝑖 − 𝑡𝑖−1 ) in practice? In addition, what does all of this have to do with machine
learning?
Fasten your seatbelts! Here comes the rigorous study of integration, clearing up all of these questions above.

26.1 Calculating the area under a function’s graph

Let’s build a solid theoretical foundation for the above intuitive explanation! Let 𝑓 ∶ [𝑎, 𝑏] → ℝ be an arbitrary bounded
function, and our goal is to calculate the signed area under the graph. (Note that the signed area is negative if the graph
goes below the 𝑥 axis. In the time-speed graph example above, this is equivalent to moving backward, thus decreasing
the distance traveled from the starting point.)
Let 𝑎 = 𝑥0 < 𝑥1 < ⋯ < 𝑥𝑛 = 𝑏 an arbitrary partition of the interval [𝑎, 𝑏]. For notational convenience, we’ll denote
this partition as 𝑋 = {𝑥0 , … , 𝑥𝑛 } as well. The granularity (or mesh) of 𝑋 is deﬁned by

|𝑋| ∶= max |𝑥𝑖 − 𝑥𝑖−1 |,

𝑖=1,…,𝑛

which is the length of the biggest gap in 𝑋. Note that the partition is not necessarily uniform, so |𝑥𝑖 − 𝑥𝑖−1 | is not
constant.
We are going to use an argument similar to the squeeze principle to make the approximation idea rigorous. (You know,
the one where we replaced the speed of a moving object with a piecewise constant one.) Instead of using the averages of
𝑓(𝑥) on each interval [𝑥𝑖−1 , 𝑥𝑖 ], we are going to provide an upper and lower estimation by using

𝑚𝑖 ∶= inf 𝑓(𝑥)
𝑥∈[𝑥𝑖−1 ,𝑥𝑖 ]

and
𝑀𝑖 ∶= sup 𝑓(𝑥).
𝑥∈[𝑥𝑖−1 ,𝑥𝑖 ]

Mathematically speaking, the inﬁmum and the supremum are much easier to work with than the average. Now we can
approximate 𝑓(𝑥) with a piecewise constant function from both above and below. This is visualized by Fig. 26.4.
Our plan is to squeeze the area between the lower and upper sums
𝑛
𝐿[𝑓, 𝑋] ∶= ∑ 𝑚𝑖 (𝑥𝑖 − 𝑥𝑖−1 ) (26.2)
𝑖=1

26.1. Calculating the area under a function’s graph 291

Mathematics of Machine Learning

Fig. 26.4: Estimating the area under the curve of 𝑓 using the partition 𝑋

and
𝑛
𝑈 [𝑓, 𝑋] ∶= ∑ 𝑀𝑖 (𝑥𝑖 − 𝑥𝑖−1 ), (26.3)
𝑖=1

then study if these two match. (As usual, the dependence on 𝑓 and 𝑋 will be omitted if it is clear from the context.)
It is clear from the construction that

𝐿[𝑓, 𝑋] ≤ area under the graph ≤ 𝑈 [𝑓, 𝑋]

holds for any partition 𝑋.

As the granularity of the partition 𝑋 goes to zero, hopefully, both 𝐿[𝑓, 𝑋] and 𝑈 [𝑓, 𝑋] will converge to the same number.
This common limit intuitively should be the “area under the function graph”, but currently, our notion of the area is not
general enough to make such bold statements. For instance, how would you deﬁne the “area” under the Dirichlet function,
deﬁned by (26.1)? As we shall see soon, integration will generalize our heuristic notion of area. To get to that point, we
have a lot to do. First, we’ll take a closer look at the partitions.

26.1.1 Partitions and their reﬁnements

We need to introduce some basic facts about reﬁning partitions to construct mathematically correct arguments regarding
the convergence of the approximating sums 𝐿[𝑓, 𝑋] and 𝑈 [𝑓, 𝑋].

Deﬁnition 25.1.1 (Reﬁnement of partitions.)

Let 𝑋 = {𝑥0 , … , 𝑥𝑛 } and 𝑌 = {𝑦0 , … , 𝑦𝑚 } be two partitions of [𝑎, 𝑏]. We say that 𝑌 is a reﬁnement of 𝑋 if 𝑋 ⊆ 𝑌 .

We can visualize this easily.

Reﬁnements are vital for understanding why integration works. One of the core reasons is the following theorem.

292 Chapter 26. Integration in theory

Mathematics of Machine Learning

Fig. 26.5: The partition 𝑌 , as a reﬁnement of 𝑋

Proposition 25.1.1
Let 𝑓 ∶ [𝑎, 𝑏] → ℝ be a bounded function and 𝑋 and 𝑌 be two partitions of [𝑎, 𝑏]. Suppose that 𝑌 is a reﬁnement of 𝑋.
Then

𝐿[𝑓, 𝑋] ≤ 𝐿[𝑓, 𝑌 ] (26.4)

and

𝑈 [𝑓, 𝑌 ] ≤ 𝑈 [𝑓, 𝑋]. (26.5)

Proof. We are going to show 𝐿[𝑓, 𝑋] ≤ 𝐿[𝑓, 𝑌 ], as (26.5) follows from a similar argument. Suppose that 𝑥𝑖−1 ≤ 𝑦𝑗 ≤
⋯ ≤ 𝑦𝑙 ≤ 𝑥𝑖 . Mathematically speaking, we have

inf 𝑓(𝑥) ≤ inf 𝑓(𝑥), 𝑘 = 𝑗 + 1, … , 𝑙.

𝑥∈[𝑥𝑖−1 ,𝑥𝑖 ] 𝑥∈[𝑦𝑘−1 ,𝑦𝑘 ]

𝑙
Since 𝑥𝑖 − 𝑥𝑖−1 = ∑𝑘=𝑗+1 𝑦𝑘 − 𝑦𝑘−1 , the above implies that

𝑙
inf 𝑓(𝑥)(𝑥𝑖 − 𝑥𝑖−1 ) = ∑ inf 𝑓(𝑥)(𝑦𝑘 − 𝑦𝑘−1 )
𝑥∈[𝑥𝑖−1 ,𝑥𝑖 ] 𝑥∈[𝑥𝑖−1 ,𝑥𝑖 ]
𝑘=𝑗+1
(26.6)
𝑙
≤ ∑ inf 𝑓(𝑥)(𝑦𝑘 − 𝑦𝑘−1 ).
𝑥∈[𝑦𝑘−1 ,𝑦𝑘 ]
𝑘=𝑗+1

Don’t worry if these mathematical formalisms make this hard to follow. Just take a look at Fig. 26.6 below, which
summarizes all that we have done so far.

26.1. Calculating the area under a function’s graph 293

Mathematics of Machine Learning

Fig. 26.6: Reﬁnement of lower sums.

Since 𝐿[𝑓, 𝑋] and 𝐿[𝑥, 𝑌 ] are composed from parts like in (26.6), summing over 𝑖 in the above immediately yields
𝐿[𝑓, 𝑋] ≤ 𝐿[𝑥, 𝑌 ]. □

We are almost there. There is one thing left for us to show: that for any two partitions, the lower sum is always smaller
than the upper sum. Hence, the squeeze principle could be applied to show that the lower and upper sums converge to the
same limit in some instances.
For, we need a simple but fundamental fact about partitions.

Proposition 25.1.2
Let 𝑋 and 𝑌 be two partitions of [𝑎, 𝑏]. Then there is a partition 𝑍 that is a reﬁnement for both 𝑋 and 𝑌 .

Proof. It is easy to see that 𝑍 = 𝑋 ∪ 𝑌 satisﬁes our requirements. □

The above 𝑍 is called a mutual reﬁnement of 𝑋 and 𝑌 . We can show a fundamental relation between the upper and lower
sums with this idea.

Proposition 25.1.3
Let 𝑓 ∶ [𝑎, 𝑏] → ℝ be a bounded real function and let 𝑋 and 𝑌 be two partitions of the interval [𝑎, 𝑏]. Then

𝐿[𝑓, 𝑋] ≤ 𝑈 [𝑓, 𝑌 ]

holds.

Proof. Let 𝑍 be a mutual reﬁnement of 𝑋 and 𝑌 , as guaranteed by the previous result. Then, (26.4) and (26.5) implies

294 Chapter 26. Integration in theory

Mathematics of Machine Learning

that
𝐿[𝑓, 𝑋] ≤ 𝐿[𝑓, 𝑍] ≤ 𝑈 [𝑓, 𝑍] ≤ 𝑈 [𝑓, 𝑌 ],
which is what we wanted to show. □

26.2 The Riemann integral

Let’s denote the set of all partitions on [𝑎, 𝑏] by ℱ[𝑎, 𝑏]:

ℱ[𝑎, 𝑏] = {𝑋 ∶ 𝑋 is a partition of [𝑎, 𝑏]}.
Now we are ready to deﬁne the integral of the function as the single value that separates upper and lower sums.

Deﬁnition 25.2.1 (Riemann-integrability.)

Let 𝑓 ∶ [𝑎, 𝑏] → ℝ be a bounded function. We say that 𝑓 is Riemann-integrable (or just integrable) on [𝑎, 𝑏] if
sup 𝐿[𝑓, 𝑋] = inf 𝑈 [𝑓, 𝑋].
𝑋∈ℱ[𝑎,𝑏] 𝑋∈ℱ[𝑎,𝑏]

This value is called the Riemann integral (or just the integral) of 𝑓 over [𝑎, 𝑏], denoted by
𝑏
∫ 𝑓(𝑥)𝑑𝑥.
𝑎

𝑏
The function 𝑓 in ∫𝑎 𝑓(𝑥)𝑑𝑥 is called the integrand. How do we calculate the integral itself? The hard way is to define a
sequence of partitions 𝑋𝑛 and show that
lim 𝐿[𝑓, 𝑋𝑛 ] = lim 𝑈 [𝑓, 𝑋𝑛 ],
𝑛→∞ 𝑛→∞
𝑏
so this number is necessarily ∫𝑎 𝑓(𝑥)𝑑𝑥. We’ll see the easy way soon, but let’s see an example demonstrating this process.
1
Let’s calculate ∫0 𝑥2 𝑑𝑥! The simplest is to use the uniform partition 𝑋𝑛 = {𝑖/𝑛}𝑛𝑖=0 , obtaining
𝑛 2
𝑖−1 1
𝐿[𝑥2 , 𝑋𝑛 ] = ∑ ( )
𝑖=1
𝑛 𝑛
1 𝑛
= ∑(𝑖 − 1)2 .
𝑛3 𝑖=1
𝑛 𝑛(𝑛+1)(2𝑛+1)
Since ∑𝑘=1 𝑘2 = 6 (as it can be shown by induction), it is easy to see that
1
lim 𝐿[𝑥2 , 𝑋𝑛 ] = .
𝑛→∞ 3
1 1
With a similar argument, you can check that lim𝑛→∞ 𝑈 [𝑥2 , 𝑋𝑛 ] = 3 as well, thus, ∫0 𝑥2 𝑑𝑥 exists and
1
1
∫ 𝑥2 𝑑𝑥 = .
0 3
Although this method works for simple cases such as 𝑓(𝑥) = 𝑥2 , it breaks down for more complex functions, as calculating
limits of upper and lower sums can be difficult. In addition, selecting the right partition is also a challenge. For instance,
𝜋
can you calculate ∫0 sin(𝑥)𝑑𝑥 by the definition?
Because we are lazy (just like any good mathematician), we want to find a general method to calculate integrals. We’ll
see this in the next section.

26.2. The Riemann integral 295

Mathematics of Machine Learning

26.3 Beyond upper and lower sums

Lower and upper sums are needed to make the notion of an integral mathematically precise. Combined with the squeeze
principle, they are used to provide a deﬁnition.
However, other tools become available once we know that a function is integrable. Such as the general approximating
sum, as we are about to see next.

Theorem 25.3.1
Let 𝑓 ∶ ℝ → ℝ be an arbitrary function, and let 𝑋𝑛 = {𝑥0,𝑛 , … , 𝑥𝑛,𝑛 } be a sequence of partitions on [𝑎, 𝑏] such that
|𝑋𝑛 | → 0. Then 𝑓 is integrable if and only if the limit
𝑛
lim ∑ 𝑓(𝜉𝑖 )(𝑥𝑖,𝑛 − 𝑥𝑖−1,𝑛 )
𝑛→∞
𝑖=1

exists and in this case,

𝑛 𝑏
lim ∑ 𝑓(𝜉𝑖 )(𝑥𝑖,𝑛 − 𝑥𝑖−1,𝑛 ) = ∫ 𝑓(𝑥)𝑑𝑥,
𝑛→∞
𝑖=1 𝑎

holds, where 𝜉𝑖 ∈ [𝑥𝑖−1,𝑛 , 𝑥𝑖,𝑛 ].

We will not prove the above theorem, as the proof is technical and doesn’t provide any valuable insight. However, the
point is clear: local inﬁma and suprema in lower and upper sums can be replaced with any local value.
For simplicity, we’ll denote this sum by
𝑛
𝑆[𝑓, 𝑋, 𝜉𝑋 ] = ∑ 𝑓(𝜉𝑖 )(𝑥𝑖 − 𝑥𝑖−1 ) (26.7)
𝑖=1

for any 𝑋 = {𝑥0 , … , 𝑥𝑛 } and 𝜉𝑋 = {𝜉1 , … , 𝜉𝑛 } with 𝜉𝑖 ∈ [𝑥𝑖−1 , 𝑥𝑖 ].

26.4 Integration as the inverse of diﬀerentiation

Now that we understand the mathematical deﬁnition of the integral, it is time to ﬁnd some tools that enable its use in
practice. The most important result is the Newton-Leibniz formula, named after Isaac Newton and Gottfried Wilhelm
Leibniz, the inventors of calculus. (Fun fact: these men discovered calculus independently and were mortal enemies
throughout their lives.)

Theorem 25.4.1 (The fundamental theorem of calculus, a.k.a. the Newton-Leibniz formula.)
Let 𝑓 ∶ ℝ → ℝ a function that is integrable on some [𝑎, 𝑏] and suppose that there is an 𝐹 ∶ ℝ → ℝ such that 𝐹 ′ (𝑥) = 𝑓(𝑥).
Then
𝑏
∫ 𝑓(𝑥)𝑑𝑥 = 𝐹 (𝑏) − 𝐹 (𝑎) (26.8)
𝑎

holds.

𝑥
In other words, by deﬁning 𝑥 ↦ 𝐹 (𝑎) + ∫𝑎 𝑓(𝑥)𝑑𝑥, we can eﬀectively reconstruct a function from its derivative.

296 Chapter 26. Integration in theory

Mathematics of Machine Learning

Proof. Let 𝑎 = 𝑥0 < 𝑥1 < ⋯ < 𝑥𝑛 = 𝑏 be an arbitrary partition of [𝑎, 𝑏] According to Lagrange's mean value theorem,
there is a 𝜉𝑖 ∈ (𝑥𝑖−1 , 𝑥𝑖 ) for all 𝑖 = 1, … , 𝑛 such that

𝐹 (𝑥𝑖 ) − 𝐹 (𝑥𝑖−1 ) = 𝐹 ′ (𝜉𝑖 )(𝑥𝑖 − 𝑥𝑖−1 )

= 𝑓(𝜉𝑖 )(𝑥𝑖 − 𝑥𝑖−1 ).

Thus, we can sum these numbers up, eliminating all but the ﬁrst and last elements:
𝑛 𝑛
∑ 𝑓(𝜉𝑖 )(𝑥𝑖 − 𝑥𝑖−1 ) = ∑ 𝐹 (𝑥𝑖 ) − 𝐹 (𝑥𝑖−1 )
𝑖=1 𝑖=1
= 𝐹 (𝑏) − 𝐹 (𝑎).

On the other hand, due to the properties of lower and upper sums, we have
𝑛
𝐿[𝑓, 𝑋] ≤ ∑ 𝑓(𝜉𝑖 )(𝑥𝑖 − 𝑥𝑖−1 ) ≤ 𝑈 [𝑓, 𝑋]
𝑖=1

Since 𝑓 is integrable, the squeeze principle implies that

𝑏 𝑛
∫ 𝑓(𝑥)𝑑𝑥 = ∑ 𝑓(𝜉𝑖 )(𝑥𝑖 − 𝑥𝑖−1 ) = 𝐹 (𝑏) − 𝐹 (𝑎)
𝑎 𝑖=1

must hold. This is what we had to show. □

Remark 25.4.1 (Increments of functions.)

For simplicity, the increments of a function 𝐹 on the interval [𝑎, 𝑏] is also denoted by
𝑥=𝑏
[𝐹 (𝑥)]𝑥=𝑎 ∶= 𝐹 (𝑏) − 𝐹 (𝑎).

Thus, according to the fundamental theorem of calculus,

𝑏
𝑥=𝑏
∫ 𝑓(𝑥)𝑑𝑥 = [𝐹 (𝑥)]𝑥=𝑎
𝑎
′
holds if 𝐹 (𝑥) = 𝑓(𝑥).

Note that integration is insensitive towards changing the values of 𝑓(𝑥) at countably many points. To be more precise,
suppose that 𝑓 ∶ ℝ → ℝ is a function that is integrable on [−1, 1]. Let’s change its value at a single point and deﬁne

𝑓(0) + 1 if 𝑥 = 0
𝑓 ∗ (𝑥) = {
𝑓(𝑥) otherwise.

If a partition is given by −1 = 𝑥0 < … 𝑥𝑘−1 ≤ 0 < 𝑥𝑘 < ⋯ < 𝑥𝑛 = 1, then

∣𝐿[𝑓, 𝑋] − 𝐿[𝑓 ∗ , 𝑋]∣ = ∣ inf 𝑓(𝑥) − inf 𝑓 ∗ (𝑥)∣(𝑥𝑘 − 𝑥𝑘−1 )

𝑥∈[𝑥𝑘−1 ,𝑥𝑘 ]
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ 𝑥∈[𝑥𝑘−1 ,𝑥𝑘 ]
∶=𝑚

and

∣𝑈 [𝑓, 𝑋] − 𝑈 [𝑓 ∗ , 𝑋]∣ = ∣ sup 𝑓(𝑥) − sup 𝑓 ∗ (𝑥)∣(𝑥𝑘 − 𝑥𝑘−1 )

𝑥∈[𝑥 ,𝑥 ]
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
𝑘−1 𝑘 𝑥∈[𝑥𝑘−1 ,𝑥𝑘 ]
∶=𝑀

26.4. Integration as the inverse of diﬀerentiation 297

Mathematics of Machine Learning

holds. We can select the partition such that 𝑥𝑘 − 𝑥𝑘−1 < 𝜀 = min{𝑚, 𝑀 } for some arbitrary 𝜀 > 0, thus, ∣𝐿[𝑓, 𝑋] −
𝐿[𝑓 ∗ , 𝑋]∣ and ∣𝑈 [𝑓, 𝑋] − 𝑈 [𝑓 ∗ , 𝑋]∣ can be made as small as needed. This implies that
𝑏 𝑏
∫ 𝑓(𝑥)𝑑𝑥 = ∫ 𝑓 ∗ (𝑥)𝑑𝑥.
𝑎 𝑎

Hence, saying that integration is the inverse of differentiation is mathematically a bit imprecise. Given a differentiable
𝑥
function 𝐹 (𝑥), its derivative is unique, but there are infinitely many functions whose integral 𝐹 (𝑎) + ∫𝑎 𝑔(𝑦)𝑑𝑦 recon-
structs 𝐹 .

26.5 Integrals in machine learning

After all this theory, you might ask: what does integration have to do with machine learning? Without being mathemati-
cally rigorous, here is a (very) brief overview of what’s to come.
First, you can think about integration as a continuous generalization of the arithmetic mean. As you can see, for equidistant
partitions, an approximating sum

1 𝑛
𝑆[𝑓, 𝑋, 𝜉] = ∑ 𝑓(𝜉𝑖 )
𝑛 𝑖=1

is exactly the average of 𝑓(𝜉1 ), … , 𝑓(𝜉𝑛 ). In machine learning, averages are frequently used to express quantities. Think
about it: overall loss functions are often averages of certain individual losses. On a ﬁne enough scale, averages become
integrals.
Along with linear algebra and calculus, the central pillar of machine learning is probability theory and statistics, which gives
us a way to model the world based on our observations. Probability and statistics are the logic of science and decision-
making. There, integration is used to express probabilities, expected value, information, and many more. Without a
rigorous theory of integration, we cannot build probabilistic models beyond a certain point.

298 Chapter 26. Integration in theory

CHAPTER

TWENTYSEVEN

INTEGRATION IN PRACTICE

Even though we understand what an integral is, we are far from computing them in practice. As opposed to differentiation,
analytically evaluating an integral can be really difficult and often downright impossible. The formula (26.8) suggests that
the key is to find the function whose derivative is the integrand, called the antiderivative or primitive function. This is
harder than you think. Nevertheless, there are several tools for this, and we are going to devote this section to study the
most important ones.
The key is often finding the antiderivative, so we introduce the notation

𝐹 = ∫ 𝑓𝑑𝑥,

for the functions where 𝐹 ′ = 𝑓. Note that since (𝐹 + some constant)′ = 𝐹 ′ , the antiderivative ∫ 𝑓𝑑𝑥 is not uniquely
determined. However, this is not an issue for us, as the Newton-Leibniz formula states that
𝑏
∫ 𝑓(𝑥)𝑑𝑥 = 𝐹 (𝑏) − 𝐹 (𝑎).
𝑎

Thus, any additional constants would be eliminated.

With this under our belt, we are ready to dig deep into evaluating integrals in practice.

27.1 Integrals and operations

As we have seen this several times (for instance when discussing the rules of diﬀerentiation), the relations of an operation
with addition, multiplication, and possibly others are extremely useful for gaining insight and develop practical tools.
This is the same for integration as well. Similarly as before, the linearity of the integral is our main tool to evaluate them.

Theorem 26.1.1 (Linearity of the Riemann integral.)

Let 𝑓, 𝑔 ∶ ℝ → ℝ be two functions that are integrable on [𝑎, 𝑏]. Then
𝑏 𝑏 𝑏
(a) ∫𝑎 𝑓(𝑥) + 𝑔(𝑥)𝑑𝑥 = ∫𝑎 𝑓(𝑥)𝑑𝑥 + ∫𝑎 𝑔(𝑥)𝑑𝑥,
𝑏 𝑏
(b) ∫𝑎 𝑐𝑓(𝑥)𝑑𝑥 = 𝑐 ∫𝑎 𝑓(𝑥)𝑑𝑥.

Proof. (a) If 𝑓 and 𝑔 is integrable, then for any 𝜀 > 0, there are partitions 𝑋𝑓 , 𝑋𝑔 such that
𝑏 𝑏
∫ 𝑓(𝑥)𝑑𝑥 − 𝜀 ≤ 𝐿[𝑓, 𝑋𝑓 ] ≤ 𝑈 [𝑓, 𝑋𝑓 ] ≤ ∫ 𝑓(𝑥)𝑑𝑥 + 𝜀
𝑎 𝑎

299
Mathematics of Machine Learning

and
𝑏 𝑏
∫ 𝑔(𝑥)𝑑𝑥 − 𝜀 ≤ 𝐿[𝑔, 𝑋𝑔 ] ≤ 𝑈 [𝑔, 𝑋𝑔 ] ≤ ∫ 𝑔(𝑥)𝑑𝑥 + 𝜀,
𝑎 𝑎

where the lower and upper sums are deﬁned by (26.2) and (26.3). So, for the mutual reﬁnement 𝑋 = 𝑋𝑓 ∪ 𝑋𝑔 , we have
𝑏 𝑏
∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑔(𝑥)𝑑𝑥 − 2𝜀 ≤ 𝐿[𝑓, 𝑋] + 𝐿[𝑔, 𝑋]
𝑎 𝑎
≤ 𝑆[𝑓, 𝑋, 𝜉𝑋 ] + 𝑈 [𝑔, 𝑋, 𝜉𝑋 ]
≤ 𝑈 [𝑓, 𝑋𝑓 ] + 𝑈 [𝑔, 𝑋𝑔 ]
𝑏 𝑏
≤ ∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑔(𝑥)𝑑𝑥 + 2𝜀,
𝑎 𝑎

where 𝑆 is deﬁned by (26.7). From this deﬁnition, it can also be seen that

𝑆[𝑓 + 𝑔, 𝑋, 𝜉𝑋 ] = 𝑆[𝑓, 𝑋, 𝜉𝑋 ] + 𝑆[𝑔, 𝑋, 𝜉𝑋 ].

Thus,
𝑏 𝑏
∣ ∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑔(𝑥)𝑑𝑥 − 𝑆[𝑓 + 𝑔, 𝑋, 𝜉𝑋 ]∣ ≤ 𝜀,
𝑎 𝑎

implying that
𝑏 𝑏
lim 𝑆[𝑓 + 𝑔, 𝑋, 𝜉𝑋 ] = ∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑔(𝑥)𝑑𝑥
|𝑋|→0 𝑎 𝑎
𝑏
= ∫ 𝑓(𝑥) + 𝑔(𝑥)𝑑𝑥.
𝑎

Our Theorem regarding the approximating sum 𝑆 implies that 𝑓 + 𝑔 is integrable on [𝑎, 𝑏] and
𝑏 𝑏 𝑏
∫ 𝑓(𝑥) + 𝑔(𝑥) = ∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑔(𝑥)𝑑𝑥.
𝑎 𝑎 𝑎

(b) This follows from the fact that 𝑆[𝑐𝑓, 𝑋, 𝜉𝑋 ] = 𝑐𝑆[𝑓, 𝑋, 𝜉𝑋 ]. □

27.2 Partial integration

As we have learned when studying the rules of diﬀerentiation, for an arbitrary 𝑓 and 𝑔, we have

(𝑓𝑔)′ = 𝑓 ′ 𝑔 + 𝑓𝑔′ .

Applying this logic to the antiderivatives,

𝑓𝑔 = ∫ 𝑓 ′ 𝑔 + 𝑓𝑔′ 𝑑𝑥

holds. Rearranging the equation a bit, we obtain the formula of partial integration:

∫ 𝑓 ′ 𝑔 = 𝑓𝑔 − ∫ 𝑓𝑔′ . (27.1)

This is summed up in the following theorem.

300 Chapter 27. Integration in practice

Mathematics of Machine Learning

Theorem 26.2.1 (Partial integration.)

Let 𝑓, 𝑔 ∶ ℝ → ℝ be two functions. If both are diﬀerentiable on the interval [𝑎, 𝑏], then
𝑏 𝑏
𝑥=𝑏
∫ 𝑓 ′ (𝑥)𝑔(𝑥)𝑑𝑥 = [𝑓(𝑥)𝑔(𝑥)]𝑥=𝑎 − ∫ 𝑓(𝑥)𝑔′ (𝑥)𝑑𝑥
𝑎 𝑎

holds.

How is this useful for us? Consider a situation where ﬁnding the antiderivative of 𝑓 and the derivative of 𝑔 is easy, but
the antiderivative of the product 𝑓𝑔 is hard. For example, can you quickly calculate

∫ 𝑥 log 𝑥𝑑𝑥?

Applying (27.1) with the roles 𝑓 ′ (𝑥) = 𝑥 and 𝑔(𝑥) = log 𝑥 immediately yields

1 2 1
∫ 𝑥 log 𝑥 = 𝑥 log 𝑥 − ∫ 𝑥𝑑𝑥
2 2
1 2 1 2
= 𝑥 log 𝑥 − 𝑥 .
2 4

27.3 Integration by substitution

As the partial integration formula is the “opposite” of the diﬀerentiation rule for products, there is an analogue for the
chain formula as well. Recall that for two diﬀerentiable functions, we had

(𝑓 ∘ 𝑔)′ (𝑥) = 𝑓 ′ (𝑔(𝑥))𝑔′ (𝑥).

Translating this to the language of integrals, we obtain the following result.

Theorem 26.3.1 (Integration by substitution.)

Let 𝑓, 𝑔 ∶ ℝ → ℝ be integrable functions on [𝑎, 𝑏]. Suppose that 𝑓 is continuous and 𝑔 is diﬀerentiable. Then
𝑏 𝑔(𝑏)
∫ 𝑓(𝑔(𝑦))𝑔′ (𝑦)𝑑𝑦 = ∫ 𝑓(𝑥)𝑑𝑥
𝑎 𝑔(𝑎)

holds.

This is called integration by substitution. To give you an example of its use, consider

∫ 𝑥 sin 𝑥2 𝑑𝑥.

With the roles 𝑦(𝑥) = 𝑥2 , we have

1
∫ 𝑥 sin 𝑥2 𝑑𝑥 = ∫ sin 𝑦𝑑𝑦
2
1
= − cos 𝑦
2
1
= − cos 𝑥2 .
2

27.3. Integration by substitution 301

Mathematics of Machine Learning

Partial integration and substitution are our main weapons when calculating integrals on paper. Most of the integrals one
might encounter can be solved with the creative (and possible iterated) application of these two rules. The recipe is simple:
find the antiderivative, then use the Newton-Leibniz formula to compute the value of the integral.
However, there is a serious issue: antiderivatives can be extremely hard to find, maybe even impossible. This makes
integrals difficult to compute symbolically. For instance, consider

2
∫ 𝑒−𝑥 𝑑𝑥,

2 2
where the function 𝑒−𝑥 describes the well-known Gaussian bell curve. As surprising as it is, ∫ 𝑒−𝑥 𝑑𝑥 cannot be de-
scribed with a closed formula! (That is, one that uses a ﬁnite number of operations and only elementary functions.) It’s
not that mathematicians were not clever enough to discover that, this is proven to be impossible.
Thus, computing integrals is much simpler to do numerically. This is in stark contrast with diﬀerentiation, which is easy
to do symbolically, but hard numerically.

27.4 Numerical integration

Instead of using symbolic computation to get the exact value of an integral, we will resolve to approximation once again.
Previously, Theorem 25.3.1 showed us that an integral is the limit of the Riemann-sums:
𝑏 𝑛
∫ 𝑓(𝑥)𝑑𝑥 = lim ∑ 𝑓(𝜉𝑖 )(𝑥𝑖,𝑛 − 𝑥𝑖−1,𝑛 ), (27.2)
𝑛→∞
𝑎 𝑖=1

where 𝑋𝑛 = {𝑥0,𝑛 , … , 𝑥𝑛,𝑛 } is a partition of [𝑎, 𝑏] and 𝜉𝑖 ∈ [𝑥𝑖−1,𝑛 , 𝑥𝑖,𝑛 ] are arbitrary intermediate values.
𝑛 𝑏
In other words, if 𝑛 is large enough, the sum ∑𝑖=1 𝑓(𝜉𝑖 )(𝑥𝑖,𝑛 − 𝑥𝑖−1,𝑛 ) is close to ∫𝑎 𝑓(𝑥)𝑑𝑥. There are two crucial
issues: ﬁrst, how to select the partition and the intermediate values; second, how fast is the convergence?
If we want to make (27.2) useful, we have to devise a concrete method that prescribes the 𝑥𝑖 -s, 𝜉𝑖 -s, and tells us how
large of an 𝑛 should we select. This is an extremely rich subject that has been the focus of studies ever since the since the
introduction of integration. So, there is a lot to talk about here. To keep things simple, let’s just focus on the essentials.
The most straightforward method is to select an uniform partition, then approximate the area under the function curve
with a sequence of trapezoids. That is, let 𝑋 = {𝑎, 𝑎 + 𝑏−𝑎 𝑏−𝑎
𝑛 , 𝑎 + 2 𝑛 , … , 𝑏}, and

As the trapezoid’s area is given by ℎ 𝑎+𝑏

2 , the area under the curve in [𝑥𝑖−1 , 𝑥𝑖 ] is approximated by

𝑓(𝑥𝑖 ) + 𝑓(𝑥𝑖−1 ) 𝑓(𝑥𝑖 ) + 𝑓(𝑥𝑖−1 )

(𝑥𝑖 − 𝑥𝑖−1 ) = ,
2 2𝑛
we have the approximation
𝑏
1 𝑛 𝑓(𝑥𝑖−1 ) + 𝑓(𝑥𝑖 )
∫ 𝑓(𝑥)𝑑𝑥 ≈ ∑
𝑎 𝑛 𝑖=1 2
(27.3)
1 1 𝑛−1 𝑏−𝑎
= (𝑓(𝑥0 ) + 𝑓(𝑥𝑛 )) + ∑ 𝑓(𝑥𝑖 ), 𝑥𝑖 = 𝑎 + 𝑖 .
2𝑛 𝑛 𝑖=1 𝑛

This is called the trapezoidal rule. It might seem complicated, but (27.3) is just a weighted sum of the 𝑓(𝑥𝑖 ) values.
Its rate of convergence is quadratic, as stated by the following theorem.

Theorem 26.4.1 (Trapezoidal rule.)

302 Chapter 27. Integration in practice

Mathematics of Machine Learning

Fig. 27.1: Approximating the area under a function with successive trapezoids.

27.4. Numerical integration 303

Mathematics of Machine Learning

Let 𝑓 ∶ [𝑎, 𝑏] → ℝ be a twicely diﬀerentiable function, and let

1 𝑛 𝑓(𝑥𝑖−1 ) + 𝑓(𝑥𝑖 )
𝐼𝑛 ∶= ∑
𝑛 𝑖=1 2

be the approximation given by the trapezoidal rule. Then

𝑏
1
∣ ∫ 𝑓(𝑥)𝑑𝑥 − 𝐼𝑛 ∣ = 𝑂( ).
𝑎 𝑛2

There are other methods, for instance, Simpson’s rule approximates the function with a piecewise quadratic one. (Instead
of a piecewise linear one, like we did for the trapezoidal rule.) Since the approximation is more accurate, the convergence
is also faster: Simpson’s rule converges at a 𝑂(𝑛−4 ) rate. Without going into details, it is given by
⌊𝑛/2⌋
𝑏−𝑎
𝑆𝑛 = ∑ 𝑓(𝑥2𝑖−2, ) + 4𝑓(𝑥2𝑖−𝑖 ) + 𝑓(𝑥2𝑖 ), (27.4)
3𝑛 𝑖=1

𝑏
with ∣ ∫𝑎 𝑓(𝑥)𝑑𝑥 − 𝑆𝑛 ∣ = 𝑂(𝑛−4 ), where 𝑥𝑖 is again the equidistant partition 𝑥𝑖 = 𝑎 + 𝑖 𝑏−𝑎
𝑛 .

The formula (27.4) can be diﬃcult to unpack, but the essence remains the same: we compute the function’s values at
given points, then take their weighted sum.

27.4.1 Implementing the trapezoidal rule

To show you how straightforward the trapezoidal rule is, let’s implement in practice! To keep it simple, we are imple-
menting this as a function that takes another function as its input.

def trapezoidal_rule(f, a, b, n):

partition = [a + i*(b - a)/n for i in range(n+1)]
vals = [f(x) for x in partition]
I_n = (vals[0] + vals[-1])/(2*n) + sum(vals[1:-1])/n

return I_n

This can be made even simpler with NumPy, but I’ll leave this to you as an exercise. Let’s test it on an example instead!
With the use of the Newton-Leibniz formula, you can verify that
1
1
∫ 𝑥2 𝑑𝑥 = .
0 3

(We even computed this with our bare hands, using lower and upper sums.) After plugging in the function lambda x:
x**2 into trapezoidal_rule, we can see that this method is indeed correct.

import matplotlib.pyplot as plt

with plt.style.context("seaborn-white"):
plt.figure(figsize=(7, 7))
ns = range(1, 25, 1)
Is = [trapezoidal_rule(lambda x: x**2, 0, 1, n) for n in ns]
plt.scatter(ns, Is)

304 Chapter 27. Integration in practice

Mathematics of Machine Learning

27.5 Conclusion

Congratulations! With all these knowledge about integration under your belt, you ﬁnished with the technically most
challenging subject so far.
However, we are just getting started. In machine learning, things are happening in spaces with millions of dimensions.
So, we need to generalize all the tools we have developed so far. Fortunately, a solid knowledge of single-variable calculus
is an excellent guide for multivariable functions as well. Some concepts work similarly, some have to be re-thought.
Have some rest, maybe brieﬂy review what we have done so far, then go dive deep into the next chapter: multivariable
calculus.

27.5. Conclusion 305

Mathematics of Machine Learning

306 Chapter 27. Integration in practice

CHAPTER

TWENTYEIGHT

WHY DOES GRADIENT DESCENT WORK?

Young man, in mathematics you don’t understand things. You just get used to them. — John von Neumann
In the practice of machine learning, we use gradient descent so much that we get used to it. We hardly ever question why
it works.
What’s usually told is the mountain-climbing analogue: to find the peak (or the bottom) of a bumpy terrain, one has to
look at the direction of the steepest ascent (or descent), and take a step in that direction. This direction is desribed by
the gradient, and the iterative process of finding local extrema by following the gradient is called gradient ascent/descent.
(Ascent for finding peaks, descent for finding valleys.)
However, this is not a mathematically precise explanation. There are several questions left unanswered, and based on our
mountain-climbing intuition, it’s not even clear if the algorithm works.
Without a precise understanding of gradient descent, we are practically flying blind. In this chapter, our goal is to look
behind gradient descent and reveal the magic behind it.
Understanding the “whys” of the gradient descent starts with one of the most beautiful areas of mathematics: differential
equations.

28.1 Diﬀerential equations 101

28.1.1 What is a diﬀerential equation?

Equations play an essential role in mathematics. This is common wisdom, but there is a deep truth behind it. Quite often,
equations arise from modeling systems such as interactions in a biochemical network, economic processes, and thousands
more. For instance, modelling the metabolic processes in organisms yields linear equations of the form

𝐴𝑥 = 𝑏, 𝐴 ∈ ℝ𝑛×𝑛 , 𝑥, 𝑏 ∈ ℝ𝑛

where the vectors 𝑥 and 𝑏 represent the concentration of molecules (where 𝑥 is the unknown), and the matrix 𝐴 represents
the interactions between them. Linear equations are easy to solve, and we understand quite a lot about them.
However, the equations we have seen so far are unﬁt to model dynamical systems, as they lack a time component. To
describe, for example, the trajectory of a space station orbiting around Earth, we have to describe our models in terms of
functions and their derivatives.
For instance, the trajectory of a swinging pendulum can be described by the equation
𝑔
𝑥′′ (𝑡) + sin 𝑥(𝑡) = 0, (28.1)
𝐿
where
• 𝑥(𝑡) describes the angle of the pendulum from the vertical,

307
Mathematics of Machine Learning

• 𝐿 is the length of the (massless) rod that our object of mass 𝑚 hangs on,
• and 𝑔 is the gravitational acceleration constant ≈ 9.7𝑚/𝑠2 .
According to the original interpretation of differentiation, if 𝑥(𝑡) describes the movement of the pendulum at time 𝑡, then
𝑥′ (𝑡) and 𝑥′′ (𝑡) describe the velocity and the acceleration of it, where the differentiation is taken with respect to the time
𝑡.
(In fact, the differential equation (28.1) is a direct consequence of Newton’s second law of motion.)

Fig. 28.1: A swinging pendulum.

Equations involving functions and their derivatives, such as (28.1), are called ordinary differential equations, or ODEs
in short. Without any overexaggeration, their study has been the main motivating force of mathematics since the 17th
century. Trust me when I say this, differential equations are one of the most beautiful objects in mathematics. As we are
about to see, the gradient descent algorithm is, in fact, an approximate solution of differential equations.
The first part of this chapter will serve as a quickstart to differential equations. I am mostly going to follow the fantastic
Nonlinear Dynamics and Chaos book by Steven Strogatz [[Str00]]. If you ever feel the desire to dig deep into dynamical
systems, I wholeheartedly recommend this book to you. (This is one of my favorite math books ever, it reads like a novel.
The quality and clarity of its exposition serves as a continuous inspiration for my writing.)

28.1.2 The (slightly more) general form of ODEs

Let’s dive straight into the deep waters and start with an example to get a grip on diﬀerential equations. Quite possibly,
the simplest example is the equation

𝑥′ (𝑡) = 𝑥(𝑡),

where the diﬀerentiation is taken with respect to the time variable 𝑡. If, for example, 𝑥(𝑡) is the size of a bacterial colony,
the equation 𝑥′ (𝑡) = 𝑥(𝑡) describes its population dynamics if the growth is unlimited. Think about 𝑥′ (𝑡) as the rate
at which the population grows: if there are no limitations in space and nutrients, every bacterial cell can freely replicate
whenever possible. Thus, since every cell can freely divide, the speed of growth matches the colony’s size.

308 Chapter 28. Why does gradient descent work?

Mathematics of Machine Learning

In plain English, the solutions of the equation 𝑥′ (𝑡) = 𝑥(𝑡) are functions whose derivatives are themselves. After a bit of
thinking, we can come up with a family of solutions: 𝑥(𝑡) = 𝑐𝑒𝑡 , where 𝑐 ∈ ℝ is an arbitrary constant. (Recall that 𝑒𝑡 is
an elementary function, and we have seen that its derivative is itself.)
If you are a visual person, some of the solutions are plotted on Fig. 28.2.

Fig. 28.2: Some solutions of the exponential growth equation.

There are two key takeaways here: diﬀerential equations describe dynamical processes that change in time, and they can
have multiple solutions. Each solution is determined by two factors: the equation itself 𝑥′ (𝑡) = 𝑥(𝑡), and an initial
condition 𝑥(0) = 𝑥∗ . If we specify 𝑥(0) = 𝑥∗ , then the value of 𝑐 is given by

𝑥(0) = 𝑐𝑒0 = 𝑐 = 𝑥∗ .

Thus, ODEs have a bundle of solutions, each one determined by the initial condition. So, it’s time to discuss diﬀerential
equations in more general terms!

Deﬁnition 27.1.1 (Ordinary diﬀerential equations in one dimension.)

Let 𝑓 ∶ ℝ → ℝ be a diﬀerentiable function. The equation

𝑥′ (𝑡) = 𝑓(𝑥(𝑡)) (28.2)

is called a ﬁrst-order homogeneous ordinary diﬀerential equation.

When it is clear, the dependence on 𝑡 is often omitted, so we only write 𝑥′ = 𝑓(𝑥). (Some resources denote the time
derivative by 𝑥,̇ a notation that can be originated from Newton. We will not use this, though it is good to know.)
The term “first-order homogeneous ordinary differential equation” doesn’t exactly roll off the tongue, and it is overloaded
with heavy terminology. So, let’s unpack what is going on here.
The differential equation part is clear: it is a functional equation that involves derivatives. Since the time 𝑡 is the only
variable, the differential equation is ordinary. (As opposed to differential equations involving multivariable functions

28.1. Diﬀerential equations 101 309

Mathematics of Machine Learning

and partial derivatives, but more on those later.) As only the first derivative is present, the equation becomes first-order.
Second-order would involve second derivatives, and so on. Finally, since the right-hand side 𝑓(𝑥) doesn’t explicitly depend
on the time variable 𝑡, the equation is homogeneous in time. Homogeneity means that the rules governing our dynamical
system don’t change over time.
Don’t let the 𝑓(𝑥(𝑡)) part scare you! For instance, in our example 𝑥′ (𝑡) = 𝑥(𝑡), the role of 𝑓 is cast to the identity
function 𝑓(𝑥) = 𝑥. In general, 𝑓(𝑥) establishes a relation between the quantity 𝑥(𝑡) (which can be position, density, etc)
and its derivative, that is, its rate of change.
As we have seen, we think in terms of differential equations and initial conditions that pinpoint solutions among a bundle
of functions. Let’s put this into a proper mathematical definition!

Deﬁnition 27.1.2 (Initial value problems.)

Let 𝑥′ = 𝑓(𝑥) be a ﬁrst order homogeneous ordinary diﬀerential equation and let 𝑥0 ∈ ℝ be an arbitrary value. The
system

𝑥′ = 𝑓(𝑥)
{
𝑥(𝑡0 ) = 𝑥0

is called an initial value problem. If a function 𝑥(𝑡) satisﬁes both conditions, it is said to be a solution to the initial value
problem.

Most often, we select 𝑡0 to be 0. After all, we have the freedom to select the origin of the time as we want.
Unfortunately, things are not as simple as they seem. In general, differential equations and initial value problems are tough
to solve. Except for a few simple ones, we cannot find exact solutions. (And when I say we, I include every person on
the planet.) In these cases, there are two things that we can do: either we construct approximate solutions via numeric
methods or turn to qualitative methods that study the behavior of the solutions without actually finding them.
We’ll talk about both, but let’s turn to the qualitative methods first. As we’ll see, looking from a geometric perspective
gives us a deep insight into how differential equations work.

28.1.3 A geometric interpretation of diﬀerential equations

When ﬁnding analytic solutions is not feasible, we look for a qualitative understanding of the solutions, focusing on the
local and long-term behavior instead of formulas.
Imagine that given a diﬀerential equation

𝑥′ (𝑡) = 𝑓(𝑥(𝑡)),

you are interested in a particular solution that assumes the value 𝑥∗ at time 𝑡0 . For instance, you could be studying the
dynamics of a bacterial colony and want to provide a predictive model to ﬁt your latest measurement 𝑥(𝑡0 ) = 𝑥∗ . In the
short term, where will your solutions go?
We can immediately notice that if 𝑥(𝑡0 ) = 𝑥∗ and 𝑓(𝑥∗ ) = 0, then the constant function 𝑥(𝑡) = 𝑥∗ is a solution! These
are called equilibrium solutions, and they are extremely important. So, let’s make a formal deﬁnition!

Deﬁnition 27.1.3 (Equilibrium solutions.)

Let

𝑥′ (𝑡) = 𝑓(𝑥(𝑡)) (28.3)

be a ﬁrst order homogeneous ODE, and let 𝑥∗ ∈ ℝ be an arbitrary point. If 𝑓(𝑥∗ ) = 0, then 𝑥∗ is called an equilibrium
point of the equation 𝑥′ = 𝑓(𝑥).

310 Chapter 28. Why does gradient descent work?

Mathematics of Machine Learning

For equilibrium points, the constant function 𝑥(𝑡) = 𝑥∗ is a solution of (28.3). This is called an equilibrium solution.

Think about our recurring example, the simplest ODE 𝑥′ (𝑡) = 𝑥(𝑡). As mentioned, we can interpret this equation as a
model of unrestricted population growth under ideal conditions. In that case, 𝑓(𝑥) = 𝑥, and this is zero only for 𝑥 = 0.
Therefore, the constant 𝑥(𝑡) = 0 function is a solution. This makes perfect sense: if a population has zero individuals, no
change is going to happen in its size. In other words, the system is in equilibrium.
Like a pendulum that stopped moving and reached its resting point at the bottom. However, pendulums have two equilibria:
one at the top and one at the bottom. (Let’s suppose that the mass is held by a massless rod. Otherwise, it would collapse)
At the bottom, you can push the hanging mass all you want, it’ll return to rest. However, at the top, any small push would
disrupt the equilibrium state, to which it would never return.
To shed light on this phenomenon, let’s look at another example: the famous logistic equation

𝑥′ (𝑡) = 𝑥(𝑡)(1 − 𝑥(𝑡)). (28.4)

From a population dynamics perspective, if our favorite equation 𝑥′ (𝑡) = 𝑥(𝑡) describes the unrestricted growth of a
bacterial colony, the logistic equation models the population growth under a resource constraint. If we assume that 1
is the total capacity of our population, the growth becomes more difficult as the size approaches this limit. Thus, the
population’s rate of change 𝑥′ (𝑡) can be modelled as 𝑥(𝑡)(1 − 𝑥(𝑡)), where the term 1 − 𝑥(𝑡) slows down the process as
the colony nears the sustain capacity.
We can write the logistic equation in the general form (28.2) by casting the role 𝑓(𝑥) = 𝑥(1 − 𝑥). Do you recall our
theorem about the relation of derivatives and monotonicity? Translated to the differential equation 𝑥′ = 𝑓(𝑥), this reveals
the flow of our solutions! To be specific,

⎧increasing if 𝑓(𝑥) > 0,

{
𝑥(𝑡) is ⎨decreasing if 𝑓(𝑥) < 0,
{constant if 𝑓(𝑥) = 0.
⎩
We can visualize this in the so-called phase portrait.
Thus, the monotonicity describes long-term behavior:

⎧1 if 𝑥′ (0) > 0,
{
lim 𝑥(𝑡) = 0
⎨ if 𝑥′ (0) = 0, (28.5)
𝑡→∞
{−∞ if 𝑥′ (0) < 0.
⎩
With a little bit of calculation (whose details are not essential for us), we can obtain that we can write the solutions as
1
𝑥(𝑡) = ,
1 + 𝑐𝑒−𝑡
where 𝑐 ∈ ℝ is an arbitrary constant. For 𝑐 = 1, this is the famous Sigmoid function. You can check by hand that these
are indeed solutions. We can even plot them, as shown in Fig. 28.4 below.
As we can see in Fig. 28.4, the monotonicity of the solutions are as we predicted in (28.5).
We can characterize the equilibria based on the long-term behavior of nearby solutions. (In the case of our logistic
equation, the equilibria are 0 and 1.) This can be connected to the local behavior of 𝑓: if it decreases around the
equilibrium 𝑥∗ , it attracts the nearby solutions. On the other hand, if 𝑓 increases around 𝑥∗ , then the nearby solutions are
repelled.
This gives rise to the concept of stable and unstable equilibria.

Deﬁnition 27.1.4 (Stable and unstable equilibria.)

Let 𝑥′ = 𝑓(𝑥) be a first-order homogeneous ordinary differential equation, and suppose that 𝑓 is differentiable. Moreover,
let 𝑥∗ be an equilibrium of the equation.

28.1. Diﬀerential equations 101 311

Mathematics of Machine Learning

Fig. 28.3: The ﬂow of solutions for 𝑥′ = 𝑥(1 − 𝑥), visualized on the phase portrait. (The arrows represent the direction
where the solutions for given initial values are headed.)

Fig. 28.4: Solutions of the logistic diﬀerential equation 𝑥′ = 𝑥(1 − 𝑥).

312 Chapter 28. Why does gradient descent work?

Mathematics of Machine Learning

𝑥∗ is called a stable equilibrium if there is a neighborhood (𝑥∗ −𝜀, 𝑥∗ +𝜀) around 𝑥∗ such that for all 𝑥0 ∈ (𝑥∗ −𝜀, 𝑥∗ +𝜀),
the solution of the initial value problem

𝑥′ = 𝑓(𝑥)
{
𝑥(0) = 𝑥0

converges towards 𝑥∗ . (That is, lim𝑡→∞ 𝑥(𝑡) = 𝑥∗ holds.)

If 𝑥∗ is not stable, it is called unstable.

In the case of the logistic ODE 𝑥′ = 𝑥(1 − 𝑥), 𝑥∗ = 1 is a stable and 𝑥∗ = 0 is an unstable equilibrium. This makes
sense given its population dynamics interpretation: the equilibrium 𝑥∗ = 1 means that the population is at maximum
capacity. If the size is slightly above or below the capacity 1, some specimens die due to starvation, or the colony reaches
its constraints. On the other hand, no matter how small the population is, it won’t ever go extinct in this ideal model.
Recall how the derivatives characterize the monotonicity of diﬀerentiable functions? With this, we have a simple tool that
can help us decide whether a given equilibrium is stable or not.

Theorem 27.1.1
Let 𝑥′ = 𝑓(𝑥) be a first-order homogeneous ordinary differential equation, and suppose that 𝑓 is differentiable. Moreover,
let 𝑥∗ be a equilibrium point of the equation.
If 𝑓 ′ (𝑥∗ ) < 0, then 𝑥∗ is a stable equilibrium.

The concept of stable equilibrium is fundamental, even in the most general cases. At this point, it’s time to take a few
steps backward and remind ourselves why we are here: to understand gradient descent. If stable equilibria remind you of
a local minimum which a gradient descent process converges towards, it is not an accident. We are ready to see what’s
behind the scenes.

28.2 A continuous version of gradient ascent

Now, let’s talk about maximizing a function 𝐹 ∶ ℝ → ℝ. Suppose that 𝐹 is twice diﬀerentiable, and we denote its
derivative by 𝐹 ′ = 𝑓. Luckily, the local maxima of 𝐹 can be found with the help of its second derivative by looking for
𝑥∗ where 𝑓(𝑥∗ ) = 0 and 𝑓 ′ (𝑥∗ ) < 0.
Does this look familiar? If 𝑓(𝑥∗ ) = 0 indeed holds, then 𝑥(𝑡) = 𝑥∗ is an equilibrium solution; and since 𝑓 ′ (𝑥∗ ) < 0, it
attracts the nearby solutions as well. This means that if 𝑥0 is drawn from the basin of attraction and 𝑥(𝑡) is the solution
of the initial value problem

𝑥′ = 𝑓(𝑥)
{ (28.6)
𝑥(0) = 𝑥0 ,

then lim𝑡→∞ 𝑥(𝑡) = 𝑥∗ . In other words, the solution converges towards 𝑥∗ , a local maxima of 𝐹 ! This is gradient ascent
in a continuous version.
We are happy, but there is an issue. We’ve talked about how hard solving diﬀerential equations are. For a general 𝐹 , we
have no prospects to actually ﬁnd the solutions. Fortunately, we can approximate them.

28.2. A continuous version of gradient ascent 313

Mathematics of Machine Learning

28.2.1 Gradient ascent as a discretized diﬀerential equation

When studying differentiation in practice, we have seen that derivatives can be approximated numerically by the forward
difference
𝑥(𝑡 + ℎ) − 𝑥(𝑡)
𝑥′ (𝑡) ≈ .
ℎ
If 𝑥(𝑡) is indeed the solution for the initial value problem (28.6), we are in luck! Using forward differences, we can take
a small step from 0 and approximate 𝑥(ℎ) by substituting the forward difference into the differential equation. To be
precise, we have

𝑥(ℎ) − 𝑥(0)
≈ 𝑓(𝑥(0)),
ℎ
from which

𝑥(ℎ) ≈ 𝑥(0) + ℎ𝑓(𝑥(0))

follows. By deﬁning 𝑥0 and 𝑥1 by

𝑥0 ∶= 𝑥(0),
𝑥1 ∶= 𝑥0 + ℎ𝑓(𝑥0 ),
we have 𝑥1 ≈ 𝑥(ℎ). If this looks like the ﬁrst step of the gradient ascent (25.1) to you, you are on the right track. Using
the forward diﬀerence once again, this time from the point 𝑥(ℎ), we obtain

𝑥(2ℎ) ≈ 𝑥(ℎ) + ℎ𝑓(𝑥(ℎ))

≈ 𝑥1 + ℎ𝑓(𝑥1 ),

thus by defining 𝑥2 ∶= 𝑥1 + ℎ𝑓(𝑥1 ), we have 𝑥2 ≈ 𝑥(2ℎ). Notice that in 𝑥2 , two kinds of approximation errors are
accumulated: first the forward difference, then the approximation error of the previous step.
This motivates us to define the recursive sequence

𝑥0 ∶= 𝑥(0),
(28.7)
𝑥𝑛+1 ∶= 𝑥𝑛 + ℎ𝑓(𝑥𝑛 ),

which approximates 𝑥(𝑛ℎ) with 𝑥𝑛 , as this is implied by the very deﬁnition. This recursive sequence is the gradient
ascent itself, and the small step ℎ is the learning rate! Check (25.1) if you don’t believe me. (28.7) is called the Euler
method.
Without going into the details, if ℎ is small enough and 𝑓 “behaves properly”, the Euler method will converge to the
equilibrium solution 𝑥∗ . (Whatever proper behavior might mean.)
We only have one more step: to turn everything into gradient descent instead of ascent. This is extremely simple, as
gradient descent is just applying gradient ascent to −𝑓. Think about it: minimizing a function 𝑓 is the same as maximizing
its negative −𝑓. And with that, we are done! The famous gradient descent is a consequence of dynamical systems
converging towards their stable equilibria, and this is beautiful.

28.2.2 The gradient ascent in action

To see the gradient ascent (that is, the Euler method) in action, we should go back to our good old example: the logistic
equation (28.4). So, suppose that we want to ﬁnd the local maxima of the function
1 2 1 3
𝐹 (𝑥) = 𝑥 − 𝑥 ,
2 3
plotted in Fig. 28.5.

314 Chapter 28. Why does gradient descent work?

Mathematics of Machine Learning

Fig. 28.5: The graph of 𝐹 (𝑥) = 12 𝑥2 − 31 𝑥3 .

First, we can use what we learned and ﬁnd the maxima using the derivative 𝑓(𝑥) = 𝐹 ′ (𝑥) = 𝑥(1 − 𝑥), concluding that
there is a local maximum at 𝑥∗ = 1. (Don’t just take my word, check out Theorem 23.3.1 and work it out!)
Since 𝑓(𝑥∗ ) = 𝐹 ′ (𝑥∗ ) = 0 and 𝑓 ′ (𝑥∗ ) < 0, the point 𝑥∗ is a stable equilibrium of the logistic equation

𝑥′ = 𝑥(1 − 𝑥).

Thus, if the initial value 𝑥(0) = 𝑥0 is suﬃciently close to 𝑥∗ = 1, the solution 𝑥(𝑡) of the initial value problem

𝑥′ = 𝑥(1 − 𝑥),
{
𝑥(0) = 𝑥0 ,

then lim𝑡→∞ 𝑥(𝑡) = 𝑥∗ . (In fact, we can select any initial value 𝑥0 from the inﬁnite interval (0, ∞), and the convergence
will hold.) Upon discretization via the Euler method, we obtain the recursive sequence

𝑥0 = 𝑥(0),
𝑥𝑛+1 = 𝑥𝑛 + ℎ𝑥𝑛 (1 − 𝑥𝑛 ).

This process is visualized by Fig. 28.6.

We can even take the discrete solution provided by the Euler method, and plot it on the 𝑥-𝐹 (𝑥) plane.

28.2. A continuous version of gradient ascent 315

Mathematics of Machine Learning

Fig. 28.6: Solving 𝑥′ = 𝑥(1 − 𝑥) via the Euler-method. (For visualization purposes, the initial value was set at 𝑡0 = −5.)

Fig. 28.7: Mapping the Euler method on the 𝑥, 𝑥′ plane.

316 Chapter 28. Why does gradient descent work?

Mathematics of Machine Learning

28.3 What’s next?

To sum up what we’ve seen so far, our entire goal was to understand the very principles of gradient descent, the most
important optimization algorithm in machine learning. Its main principle is straightforward: to find a local minimum of
a function, first find the direction of decrease, then take a small step towards there. This seemingly naive algorithm has a
foundation that lies deep within differential equations. Turns out that if we look at our functions as rules determining a
dynamical system, local extrema correspond to equilibrium states. These dynamical systems are described by differential
equations, and the local maxima are equilibrium states that solutions towards them. From this viewpoint, the gradient
descent algorithm is nothing else than a numerical solution to this equation.
What we’ve seen so far only covers the single-variable case, and as I have probably told this many times, machine learning
is done in millions of dimensions. Still, the intuition we built up will be our guide in the study of multivariable functions
and high-dimensional spaces. There, the principles are the same, but the objects of study are much more complex. The
main challenge in multivariable calculus is to manage the complexity, and this is where our good friends, vectors and
matrices will do much of the heavy lifting.
Multivariable calculus is where linear algebra and the study of functions come together, providing the skeleton for building
and training neural networks. Let’s jump into it!

28.3. What’s next? 317

Mathematics of Machine Learning

318 Chapter 28. Why does gradient descent work?

Part IV

Multivariable functions

319
CHAPTER

TWENTYNINE

MULTIVARIABLE FUNCTIONS

How different is multivariable calculus from its single-variable counterpart? When I was a student, I had a professor who
used to say something like, “multivariable and single-variable functions behave the same, you just have to write more”.
Well, this couldn’t be further from the truth. Just think about what we are doing in machine learning: training models
with gradient descent; that is, finding a configuration of parameters that minimize a parametric function. In one variable
(which is not a realistic assumption), we can do this with the derivative. How can we extend the derivative to multiple
dimensions?
The inputs of multivariable functions are vectors. Thus, given a function 𝑓 ∶ ℝ𝑛 → ℝ, we can’t just define
𝑑𝑓 𝑓(x0 ) − 𝑓(x)
(x ) = lim , x0 , x ∈ ℝ𝑛
𝑑x 0 x→x0 x0 − x
to the analogue of Definition 21.1.1. Why? Because the division with the vector x0 − x is not defined.
As we’ll see, differentiation in multiple dimensions is much more complicated. Think about it: in one dimension, there
are only two directions, left and right. This is not true even for two dimensions, with an infinite number of directions at
each point.
So, what are multivariable functions anyway?

29.1 What is a multivariable function?

We introduced functions as general mappings between two sets. However, we’ve only discussed functions that map real
numbers to real numbers. Simple scalar-scalar functions are great for conveying ideas, but the world around us is much
more complex than what we could describe with them. At the other end of the spectrum, set-set functions are way too
general to be useful.
In practice, three categories are special enough to be analyzed mathematically but general enough to describe the patterns
in science and engineering: those that
1. map scalars to vectors, that is, 𝑓 ∶ ℝ → ℝ𝑛 (curves),
2. map vectors to scalars, that is, 𝑓 ∶ ℝ𝑛 → ℝ (scalar fields),
3. and those that map vectors to vectors, that is, 𝑓 ∶ ℝ𝑛 → ℝ𝑚 (vector fields).
The scalar-vector variants are called curves, the vector-scalar ones are surfaces, and the vector-vector functions are what
we call vector fields. This nomenclature looks a bit abstract, so let’s see some examples.
Scalar-vector functions, or curves in their more user-friendly name, are the mathematical representations of movement.
A space station orbiting around Earth describes a curve. So does the trajectory of a stock in the market.
To give you a concrete example, the scalar-vector function
cos(𝑡)
𝑓(𝑡) = [ ]
sin(𝑡)

321
Mathematics of Machine Learning

describes the unit circle. This is illustrated by Fig. 29.1.

Fig. 29.1: A scalar-vector function, that is, a curve.

Not all curves are closed. For example, the curve

cos(𝑡)
𝑔(𝑡) = ⎡ ⎤
⎢ sin(𝑡) ⎥
⎣ 𝑡 ⎦
represents a motion that spirals upward, as illustrated by Fig. 29.2. These curves are called open.
Because of their inherent ability to describe trajectories, scalar-vector functions are essential in mathematics and science.
Are you familiar with Newton’s second law of motion, stating that force equals mass times acceleration? This is described
by the equation, 𝐹 = 𝑚𝑎, which is an instance of an ordinary differential equation. All of its solutions are curves.
On the surface, scalar-vector functions have little to do with machine learning, but that’s not the case. Even though we
won’t deal with them extensively, they have a serious presence behind the scenes. For instance, gradient descent is a
discretized curve.
Vector-scalar functions will be our focus for the next few chapters. When I write “multivariable function”, I’ll most often
refer to a vector-scalar function.
Think about a map of a mountain landscape. This maps the height - a scalar - to each coordinate, thereby defining the
surface. This is just a function 𝑓 ∶ ℝ2 → ℝ in mathematical terms. Thinking about scalar fields as surfaces is useful for
building geometric intuition, giving us a way to visualize them.
Let’s clear up the notation first. If 𝑓 ∶ ℝ𝑛 → ℝ is a function of 𝑛 variables, we might write 𝑓(x) for an x ∈ ℝ𝑛 or
𝑓(𝑥1 , … , 𝑥𝑛 ) for 𝑥𝑖 ∈ ℝ if we want to emphasize the dependence on its variables. A function of 𝑛 variables is the same
as a function of a single vector variable. I know this seems confusing, but trust me, you’ll get used to it in no time.
To give a concrete example for a vector-scalar function, let’s consider pressure. Pressure is the ratio of the magnitude of
the force and the area of the surface of contact:
𝐹
𝑝= .
𝐴

322 Chapter 29. Multivariable functions

Mathematics of Machine Learning

Fig. 29.2: An open curve.

Fig. 29.3: A surface given by a vector-scalar function.

29.1. What is a multivariable function? 323

Mathematics of Machine Learning

This can be thought of as a function of two variables: 𝑝(𝑥, 𝑦) = 𝑥/𝑦.

To illustrate how problematic things can become in multiple dimensions, consider the pressure around (0, 0). Although
we haven’t talked about the limits of multivariable functions yet, what do you think
𝑥
lim
(𝑥,𝑦)→(0,0) 𝑦

should be? Based on how we deﬁned limits for single-variable functions,

𝑥
lim 𝑛
𝑛→∞ 𝑦𝑛

must match for all possible choices for 𝑥𝑛 and 𝑦𝑛 . This is not the case. Consider 𝑥𝑛 = 𝛼2 /𝑛 and 𝑦𝑛 = 𝛼/𝑛 for any 𝛼
real number. With this choice, we have
𝑥𝑛 𝛼2 /𝑛
lim = = 𝛼.
𝑛→∞ 𝑦𝑛 𝛼/𝑛
Thus, the above limit is not defined. All we did here is approach zero along slightly different trajectories, yet the result is
a total mess. In one variable, we have to flex our intellectual muscles to produce such examples; in multiple variables, a
simple 𝑥/𝑦 will do the trick.
Vector-vector functions are called vector fields. For example, consider our solar system, modeled by ℝ3 . Each point
is affected by a gravitational force, which is a vector. Thus, the gravitational pull can be described by a 𝑓 ∶ ℝ3 → ℝ3
function, hence the name vector field.
Although they are often hidden in the background, vector fields play an essential role in machine learning. Remember
when we discussed why does gradient descent work? (At least in one variable.) All the differential equations we have
encountered there are equivalent to vector fields.
Why? Consider the differential equation 𝑥′ = 𝑓(𝑥). If 𝑥(𝑡) describes the trajectory of a moving object, then its derivative
𝑥′ (𝑡) is its speed. Thus, we can interpret the equation 𝑥′ (𝑡) = 𝑓(𝑥(𝑡)) as prescribing the speed of our object at every
position. It’s not that spectacular when our object is moving in one dimension (like we assumed in the previous chapter),
but if the trajectory 𝑥 ∶ ℝ → ℝ2 describes a motion on the plane, the function 𝑓 ∶ ℝ2 → ℝ2 can be visualized neatly.
For example, consider the population dynamics of a simple predator-prey system. Predators feed on the prey, thus, their
numbers can grow in the abundance of food. In turn, over-consumption decreases the prey population, causing a famine
among the predators and decreasing their numbers. This leads to a growth in the prey, and the cycle starts over again.
If 𝑥1 (𝑡) and 𝑥2 (𝑡) are the size of the prey and predator populations, respectively, then their dynamics are described by
the famous Lotka-Volterra equations:
𝑥′1 = 𝑥1 − 𝑥1 𝑥2
𝑥′2 = 𝑥1 𝑥2 − 𝑥2 .
If we represent the trajectory as the scalar-vector function
𝑥1 (𝑡)
x ∶ ℝ → ℝ2 , x(𝑡) = [ ],
𝑥2 (𝑡)
then the derivative
𝑥′1 (𝑡)
x′ (𝑡) = [ ]
𝑥′2 (𝑡)
is given by the vector-vector function
𝑥1 − 𝑥1 𝑥2
𝑓 ∶ ℝ2 → ℝ 2 , 𝑓(𝑥1 , 𝑥2 ) = [ ].
𝑥1 𝑥2 − 𝑥2
𝑓 can be visualized by drawing a vector onto each point of the plane, as illustrated by Fig. 29.4.
Vector fields have serious applications in machine learning. As we shall see soon, the multivariable derivative (called
gradient) defines a vector field. Moreover, as indicated by the single-variable case, the gradient descent algorithm will be
the discretized trajectory determined by the vector field of the gradient.

324 Chapter 29. Multivariable functions

Mathematics of Machine Learning

Fig. 29.4: The vector ﬁeld given by the Lotka-Volterra equations.

29.2 Linear functions in multiple variables

One of the most important functions in mathematics is the linear function. In one variable, it takes the form 𝑙(𝑥) = 𝑎𝑥+𝑏,
where 𝑎 and 𝑏 are arbitrary real numbers.
We’ve seen linear functions several times already. For instance, Theorem 21.2.1 gives that diﬀerentiation is equivalent to
ﬁnding the best linear approximation.
Linear functions, that is, functions of the form
𝑛
𝑓(𝑥1 , … , 𝑥𝑛 ) = 𝑏 + ∑ 𝑎𝑖 𝑥𝑖 , 𝑏, 𝑎𝑖 ∈ ℝ
𝑖=1

are as important in multiple variables as in one.

To build up a deep understanding, we’ll take a look at the simplest case: a line on the two-dimensional plane.
Given its normal vector m = (𝑚1 , 𝑚2 ) and its arbitrary point v0 , the vector x is on the line if and only if m and x − v0
is orthogonal, that is, if

⟨m, x − v0 ⟩ = 0 (29.1)

holds. (29.1) is called the normal vector equation of the line.

By using the bilinearity of the inner product and writing out ⟨m, x⟩ in terms of their coordinates, we can simplify (29.1).
Assuming that 𝑚2 ≠ 0, that is, the line is not parallel to the 𝑥2 axis, a quick calculation yields
𝑚1 1
𝑥2 = − 𝑥 + ⟨m, v0 ⟩.
𝑚2 1 𝑚2

This is a linear function of the single variable 𝑥1 in its full glory. The coeﬃcient − 𝑚 1
𝑚2 describes the slope, while 𝑚2 ⟨m, v0 ⟩
1

describes the shift.

29.2. Linear functions in multiple variables 325

Mathematics of Machine Learning

Fig. 29.5: A line on the plane.

326 Chapter 29. Multivariable functions

Mathematics of Machine Learning

In other words, linear functions are equivalent to vector equations of the form (29.1), at least in one variable.
What happens if we apply the same argument in higher dimensional spaces? In ℝ𝑛+1 , the normal vector equation

⟨m, x − v0 ⟩ = 0, m, x, v0 ∈ ℝ𝑛+1 (29.2)

deﬁnes a hyperplane, that is, an 𝑚-dimensional plane. (One dimension less than the embedding plane, which is ℝ𝑛+1 in
our case.) Unraveling (29.2), we obtain
𝑛
1 𝑚𝑖
𝑥𝑛+1 = ⟨m, v0 ⟩ − ∑ 𝑥.
𝑚𝑛+1 𝑖=1
𝑚𝑛+1 𝑖

Thus, the general form of a linear function in 𝑛 variables

𝑛
𝑓(𝑥1 , … , 𝑥𝑛 ) = 𝑏 + ∑ 𝑎𝑖 𝑥𝑖 , 𝑏, 𝑎𝑖 ∈ ℝ
𝑖=1

originates from the normal vector equation of the 𝑛-dimensional plane, embedded in the 𝑛 + 1-dimensional space. This
can also be written in the vectorized form
𝑓(x) = 𝑏 + ⟨a, x⟩
(29.3)
= a𝑇 x + 𝑏, a, x ∈ ℝ𝑛 , 𝑏 ∈ ℝ,

which is how we’ll mostly use it in the future. (Note that when looking at the matrix representation of a vector u ∈ ℝ𝑛 ,
we always use the column form ℝ𝑛×1 . Moreover, a is not the normal vector of the plane.)

29.3 The curse of dimensionality

Before we move on to study the inner workings of multivariable calculus, I want to emphasize how seriously multiple
dimensions complicate things in machine learning.

29.3.1 Impossibly large grids

First, let’s talk about optimization. If all else fails, optimizing a single-variable function 𝑓 ∶ [𝑎, 𝑏] → ℝ can be as simple
as partitioning [𝑎, 𝑏] into a grid of 𝑛 points, evaluate the function at each point, then ﬁnd the minima/maxima.
We cannot* do this *in higher dimensions. To see why, consider ResNet18, the famous convolutional network archi-
tecture. It has precisely 11689512 parameters. Thus, training is equivalent to optimizing a function of a whopping
11689512-variable function. If we were to construct a grid with just two points along every dimension, we would have
211689512 points to evaluate the function at.
For comparison, the number of atoms in our observable universe is around 1082 . A number that is dwarfed by the size of
our grid. Thus, grid search is currently impossible on such an enormous grid. We are forced to devise clever algorithms
that can tackle the size and complexity of large dimensional spaces.

29.3.2 The strange behavior of distance

In high dimensions, a strange thing starts to happen with balls. Recall that by deﬁnition, the n-dimensional ball of radius
𝑟 around the point x0 ∈ ℝ𝑛 is deﬁned by

𝐵𝑛 (𝑟, x0 ) = {x ∈ ℝ𝑛 ∶ ‖x − x0 ‖ < 𝑟},

and we denote its volume by 𝑉𝑛 (𝑟). (The volume depends only on the radius and the dimension, not the center.)

29.3. The curse of dimensionality 327

Mathematics of Machine Learning

It turns out that

𝑛
𝜋2
𝑉𝑛 (𝑟) = 𝑟𝑛 ,
Γ(1 + 𝑛2 )
where Γ(𝑧) is the famous Gamma function, the generalization of the factorial.
The volume formula might seem complicated because of the Gamma function, the 𝜋, and all the other terms, but let’s
focus on the core of the issue. What happens if we slice oﬀ an 𝜀-wide shell from the unit ball?
It turns out that the volume of the unit ball is concentrated around its outer shell, as shown by the volume formula:
𝑉𝑛 (1 − 𝜀)
lim = lim (1 − 𝜀)𝑛 = 0.
𝑛→∞ 𝑉𝑛 (1) 𝑛→∞

Heuristically, this means that if you randomly select a point from the unit ball, its distance from the center will be close
to 1 in high dimensions.
In other words, distance doesn’t behave as you would intuitively expect. Another way of looking at the issue would be to
study the eﬀects of taking one step in each possible direction, starting from the origin and arriving at the point

1 = (1, 1, … , 1) ∈ ℝ𝑛 ,

something like how Fig. 29.6 illustrates in the three-dimensional case.

Fig. 29.6: Taking a step in each direction in three dimensions.

The Euclidean distance we have traveled is

𝑛
√
‖1‖ = √∑ 1 = 𝑛,
𝑖=1

which goes to infinity as the number of dimensions grows. That is, the diagonal of the unit cube is really big.
These two phenomena can cause significant headaches in practice. More parameters result in more expressive models but
also make training much more difficult. This is called the curse of dimensionality.

328 Chapter 29. Multivariable functions

CHAPTER

THIRTY

DERIVATIVES AND GRADIENTS

Now that we understand why multivariate functions and high-dimensional spaces are more complex than the single-variable
case we studied earlier, it’s time to see how to do things in the general case.
To recap quickly, our goal in machine learning is to optimize functions with millions of variables. For instance, think
about a neural network 𝑁 (x, w), where x ∈ ℝ𝑛 is the input data and the vector w ∈ ℝ𝑚 compresses all of the weight
parameters. In the case of, say, the binary cross-entropy loss, we have the loss function
𝑑
𝐿(w) = − ∑ 𝑁 (x𝑖 , w) log 𝑦𝑖 ,
𝑘=1

where x𝑖 is the 𝑖-th data point with ground truth 𝑦𝑖 . (I told you that we have to write much more in multivariable calculus.)
Training the neural network is the same as finding a global minimum of 𝐿(w), if it exists.
We have already seen how we can do optimization in a single variable:
• figure out the direction of increase by calculating the derivative,
• take a small step,
• then iterate.
For this to work in multiple variables, we need to generalize the concept of the derivative.
We quickly discovered the issue: since division with a vector is not defined, the difference quotient

𝑓(x) − 𝑓(y)
x−y

makes no sense when 𝑓 ∶ ℝ𝑛 → ℝ is a function of 𝑛 variables and x, y ∈ ℝ𝑛 are 𝑛-dimensional vectors.

How to make sense of it, then? This is the topic of the next chapter.# test

30.1 Partial derivatives

Let’s take a look at multivariable functions more closely! For the sake of simplicity, let 𝑓 ∶ ℝ2 → ℝ be our function of
two variables. To emphasize the dependence on the individual variables, we often write

𝑓(𝑥1 , 𝑥2 ), 𝑥1 , 𝑥2 ∈ ℝ.

We can quickly notice that by ﬁxing one of the variables, we obtain the two single-variable functions! That is, if 𝑥1 , 𝑥2 ∈
ℝ2 is ﬁxed, then we have
𝑥 ↦ 𝑓(𝑥, 𝑥2 ),
𝑥 ↦ 𝑓(𝑥1 , 𝑥),

329
Mathematics of Machine Learning

Fig. 30.1: Slicing the surface with the 𝑥 − 𝑧 plane.

where 𝑥 ∈ ℝ is a scalar. Think about this as slicing the function graph with a plane parallel to the 𝑥 − 𝑧 or the 𝑦 − 𝑧, as
illustrated by Fig. 30.1. The part cut out by the plane is a single-variable function.
We can define the derivative of these functions by the limit of difference quotients. These are called the partial derivatives:
𝜕𝑓 𝑓(𝑥, 𝑥2 ) − 𝑓(𝑥1 , 𝑥2 )
(𝑥 , 𝑥 ) = lim ,
𝜕𝑥1 1 2 𝑥→𝑥1 𝑥 − 𝑥1
𝜕𝑓 𝑓(𝑥1 , 𝑥) − 𝑓(𝑥1 , 𝑥2 )
(𝑥 , 𝑥 ) = lim .
𝜕𝑥2 1 2 𝑥→𝑥2 𝑥 − 𝑥2
𝜕𝑓 𝜕𝑓
(Keep in mind that 𝑥1 signifies the variable in 𝜕𝑥 , but an actual scalar value in the argument of 𝜕𝑥1 (𝑥1 , 𝑥2 ). This can
1
be quite confusing, but you’ll soon learn to make sense of it.)
The definition is similar for general multivariable functions; we just have to write much more. There, the partial derivative
of 𝑓 ∶ ℝ𝑛 → ℝ at the point x = (𝑥1 , … , 𝑥𝑛 ) with respect to the 𝑖-th variable is defined by
𝑖-th variable
𝜕𝑓 𝑓(𝑥1 , … , 𝑥
⏞ , … , 𝑥𝑛 ) − 𝑓(𝑥1 , … , 𝑥𝑖 , … , 𝑥𝑛 ) (30.1)
(𝑥 , … , 𝑥𝑛 ) = lim .
𝜕𝑥𝑖 1 𝑥→𝑥𝑖 𝑥 − 𝑥𝑖
One of the biggest challenges in multivariable calculus is to manage the ever-growing notational complexity. Just take a
look at the difference quotient above:
𝑓(𝑥1 , … , 𝑥, … , 𝑥𝑛 ) − 𝑓(𝑥1 , … , 𝑥𝑖 , … , 𝑥𝑛 )
.
𝑥 − 𝑥𝑖
This is not the prettiest to look at, and this kind of notational complexity can pile up fast. Fortunately, we can use linear
algebra to the rescue! Not only can we compact the variables into the vector x = (𝑥1 , … , 𝑥𝑛 ), we can use the standard
basis
e𝑖 = (0, … , 0, 1⏟ , 0, … , 0)
𝑖-th component

330 Chapter 30. Derivatives and gradients

Mathematics of Machine Learning

to write the diﬀerence quotients as

𝑓(x + ℎe𝑖 ) − 𝑓(x)

, ℎ ∈ ℝ.
ℎ
Thus, (30.1) can be compacted. With this newly found form, we are ready to make a concise and formal deﬁnition for
partial derivatives.

Deﬁnition 29.1.1 (Partial derivatives.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be a function of 𝑛 variables. The partial derivative of 𝑓 at the point x = (𝑥1 , … , 𝑥𝑛 ) with respect to the
𝑖-th variable is deﬁned by

𝜕𝑓 𝑓(x + ℎe𝑖 ) − 𝑓(x)

(x) = lim .
𝜕𝑥𝑖 ℎ→0 ℎ

If the above limit exists, we say that 𝑓 is partially diﬀerentiable with respect to the 𝑖-th variable 𝑥𝑖 .

𝜕
The partial derivative is again a vector-scalar function. Because of this, it is often written as 𝜕𝑥𝑖 𝑓, reﬂecting on the fact
𝜕
that the symbol " 𝜕𝑥 "can be thought of as a function that maps functions to functions. I know, this is a bit abstract, but
𝑖
you’ll get used to it quickly.
As usual, there are several alternative notations for the partial derivatives. Among others, the symbols
• 𝑓𝑥𝑖 (x),
• 𝐷𝑖 𝑓(x),
• 𝜕𝑖 𝑓(x)
𝜕𝑓
denote the 𝑖-th partial derivative of 𝑓 at x. For simplicity, we’ll use the old school 𝜕𝑥𝑖 (x).

30.1.1 Examples

It’s best to start with a few examples to illustrate the concept of partial derivatives.
Example 1. Let

𝑓(𝑥1 , 𝑥2 ) = 𝑥21 + 𝑥22 .

To calculate, say, 𝜕𝑓/𝜕𝑥1 , we ﬁx the second variable and treat 𝑥2 as a constant. Formally, we obtain the single-variable
function

𝑓 1 (𝑥) ∶= 𝑥2 + 𝑥22 , 𝑥2 ∈ ℝ,

whose derivative gives the ﬁrst partial derivative:

𝜕𝑓 𝑑𝑓
(𝑥 , 𝑥 ) = (𝑥 ) = 2𝑥1 .
𝜕𝑥1 1 2 𝑑𝑥 1

Similarly, we get that

𝜕𝑓
(𝑥 , 𝑥 ) = 2𝑥2 .
𝜕𝑥2 1 2

Once you are comfortable with the mental gymnastics of ﬁxing variables, you’ll be able to perform partial diﬀerentiation
without writing out all the intermediate steps.

30.1. Partial derivatives 331

Mathematics of Machine Learning

Example 2. Let’s see a more complicated example. Deﬁne

𝑓(𝑥1 , 𝑥2 ) = sin(𝑥21 + 𝑥2 ).

By fixing 𝑥2 , we obtain a composite function. Thus the chain rule is used to calculate the first partial derivative:
𝜕𝑓
(𝑥 , 𝑥 ) = 2𝑥1 cos(𝑥21 + 𝑥2 ).
𝜕𝑥1 1 2
Similarly, we obtain that
𝜕𝑓
(𝑥 , 𝑥 ) = cos(𝑥21 + 𝑥2 ).
𝜕𝑥2 1 2
(I highly advise you to carry out the above calculations step by step as an exercise, even if you understand all the inter-
mediate steps.)
Example 3. Finally, let’s see a function that is partially differentiable in one variable but not in the other. Define the
function

−1 if 𝑥2 < 0,
𝑓(𝑥1 , 𝑥2 ) = {
1 else.

By ﬁxing 𝑥2 , the resulting function is constant, thus

𝜕𝑓
(𝑥 , 𝑥 ) = 0
𝜕𝑥2 1 2
everywhere. However, in 𝑥2 , there is a discontinuity at 0. Thus the second partial derivative does not exist there.

30.1.2 The gradient

If a function is partially diﬀerentiable in every variable, we can compact the derivatives together in a single vector to form
the gradient.

Deﬁnition 29.1.2 (The gradient.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be a function that is partially diﬀerentiable in all of its variables. Then its gradient is deﬁned by the
(column) vector
𝜕
𝜕𝑥 𝑓(x)
⎡ 𝜕1 ⎤
∇𝑓(x) ∶= ⎢ 𝜕𝑥 𝑓(x) ⎥ ∈ ℝ𝑛×1 .
⎢ 2⋮ ⎥
⎢ ⎥
𝜕
⎣ 𝜕𝑥𝑛 𝑓(x)⎦

A few remarks are in order. First, the symbol ∇ is called nabla, a symbol that was conceived to denote gradients.
Second, the gradient can be thought of as a vector-vector function. To see that, consider the already familiar function
𝑓(𝑥1 , 𝑥2 ) = 𝑥21 + 𝑥22 . The gradient of 𝑓 is

2𝑥1
∇𝑓(𝑥1 , 𝑥2 ) = [ ],
2𝑥2
or

∇𝑓(x) = 2x

332 Chapter 30. Derivatives and gradients

Mathematics of Machine Learning

Fig. 30.2: The vector ﬁeld given by the gradient of 𝑥21 + 𝑥22 .

in vectorized form. We can visualize this by drawing the vector ∇𝑓(𝑥1 , 𝑥2 ) at each point (𝑥1 , 𝑥2 ) ∈ ℝ2 .
Thus, you can think about ∇𝑓 as a vector-vector function ∇𝑓 ∶ ℝ𝑛 → ℝ𝑛 . The gradient at a given point x is obtained by
evaluating this function, yielding (∇𝑓)(x).
For clarity, the parentheses are omitted, arriving at the all familiar notation ∇𝑓(x).

30.1.3 Higher order partial derivatives

The partial derivatives of a vector-scalar function 𝑓 ∶ ℝ𝑛 → ℝ are vector-scalar functions themselves. Thus, we can
perform partial diﬀerentiation one more time!
If they exist, the second order partial derivatives are deﬁned by

𝜕2𝑓 𝜕 𝜕𝑓
(a) ∶= ( (a)). (30.2)
𝜕𝑥𝑖 𝜕𝑥𝑗 𝜕𝑥𝑖 𝜕𝑥𝑗

𝜕2 𝑓
(When the second partial differentiation takes place with respect to the same variable, (30.2) is abbreviated by 𝜕𝑥2𝑖
(a).)
The definition begs the question: is the order of differentiation interchangeable? That is, does

𝜕2𝑓 𝜕2𝑓
(a) = (a)
𝜕𝑥𝑖 𝜕𝑥𝑗 𝜕𝑥𝑗 𝜕𝑥𝑖

hold? The answer is quite surprising: the order is interchangeable under some mild assumptions, but not in the general
case. There is a famous theorem about it which we won’t prove, but it’s essential to know.

Theorem 29.1.1

30.1. Partial derivatives 333

Mathematics of Machine Learning

Let 𝑓 ∶ ℝ𝑛 → ℝ be an arbitrary vector-scalar function and let a ∈ ℝ𝑛 . If there is a small ball 𝐵(𝜀, a) ⊆ ℝ𝑛 centered at
a such that 𝑓 has continuous second-order partial derivatives at all points of 𝐵(𝜀, a), then

𝜕2𝑓 𝜕2𝑓
(a) = (a)
𝜕𝑥𝑖 𝜕𝑥𝑗 𝜕𝑥𝑗 𝜕𝑥𝑖

holds for all 𝑖 = 1, … , 𝑛.

30.2 The total derivative

Partial derivatives seem to generalize the notion of diﬀerentiability for multivariable functions. However, something is
missing. Let’s revisit the single-variable case for a moment.
Recall that according to Theorem 21.2.1, the diﬀerentiability of a single-variable function 𝑓 ∶ ℝ → ℝ at a given point 𝑎
is equivalent to a local approximation of 𝑓 by the linear function

𝑙(𝑥) = 𝑓(𝑎) + 𝑓 ′ (𝑎)(𝑥 − 𝑎).

If 𝑥 is close to 𝑎, 𝑙(𝑥) is also close to 𝑓(𝑥). Moreover, this is the best linear approximation we can do around 𝑎. In a
single variable, this is equivalent to differentiation.
𝑓(x)−𝑓(y)
This gives us an idea: even though difference quotients like x−y does not exist in multiple variables, the best local
approximation with a multivariable linear function does!
Thus, the notion of total differentiability is born.

Deﬁnition 29.2.1 (Total diﬀerentiability.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be a function of 𝑛 variables. We say that 𝑓 is totally diﬀerentiable (or sometimes just diﬀerentiable in
short) at a ∈ ℝ𝑛 if there exists a row vector 𝐷𝑓 (a) ∈ ℝ1×𝑛 such that

𝑓(x) = 𝑓(a) + 𝐷𝑓 (a)(x − a) + 𝑜(‖x − a‖) (30.3)

holds for all a ∈ 𝐵(𝜀, a), where 𝜀 > 0 and 𝐵(𝜀, a) is deﬁned by

𝐵(𝜀, a) = {x ∈ ℝ𝑛 ∶ ‖x − a‖} < 𝜀.

(In other words, 𝐵(𝜀, a) is a ball of radius 𝜀 > 0 around a.) When exists, the vector 𝐷𝑓 (a) is called the total derivative
of 𝑓 at a.

Recall that when it is not stated explicitly, we prefer to work with column vectors, because we want to write our linear
transformations in the form 𝐴x, where 𝐴 ∈ ℝ𝑚×𝑛 and x ∈ 𝑛 × 1. Thus, the “dimensionology” of the formula

𝑓(x)
⏟ = 𝑓(a)
⏟ +𝐷⏟𝑓 (a) (x
⏟ − a) +𝑜(‖x − a‖) ∈ ℝ1×1
∈ℝ1×1 ∈ℝ1×1 ∈ℝ1×𝑛 ∈ℝ𝑛×1

works out. (Don’t be fooled, ℝ1×1 is a scalar.)

Let’s unravel the notion of total diﬀerentiability. The form (30.3) implies that a totally diﬀerentiable function 𝑓 equals to
the linear part 𝑓(a) + 𝐷𝑓 (a)(x − a) plus a small error.
The surface given by the linear part is called the tangent plane. We can visualize it for functions of two variables.
Unsurprisingly, the partial and total derivatives share an intimate connection.

334 Chapter 30. Derivatives and gradients

Mathematics of Machine Learning

Fig. 30.3: The tangent plane.

Theorem 29.2.1 (Total derivative and the partial derivatives.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be a function that is totally diﬀerentiable at a. Then all of its partial derivatives exist at a and

𝑓(x) = 𝑓(a) + ∇𝑓(a)𝑇 (x − a) + 𝑜(‖x − a‖) (30.4)

holds for all a in some 𝐵(𝜀, a). (That is, 𝐷𝑓 (a) = ∇𝑓(a)𝑇 .)

In other words, the equation (30.4) gives that the coeﬃcients of the best linear approximation are equal to the partial
derivatives.

Proof. Because 𝑓 is totally diﬀerentiable at a, the deﬁnition gives that 𝑓 can be written in the form

𝑓(x) = 𝑓(a) + 𝐷𝑓 (a)(x − a) + 𝑜(‖x − a‖),

where 𝐷𝑓 (a) = (𝑑1 , … , 𝑑𝑛 ) is the vector that describes the coeﬃcients of the linear part.
Our goal is to show that
𝑓(a + ℎe𝑖 ) − 𝑓(a)
lim = 𝑑𝑖 ,
ℎ→0 ℎ
where e𝑖 is the unit (column) vector whose 𝑖-th component is 1, while the others are 0.
Let’s do a quick calculation! Based on what we know, we have
𝑓(a + ℎe𝑖 ) − 𝑓(a) 𝐷𝑓 (a)ℎe𝑖 + 𝑜(‖ℎe𝑖 ‖)
=
ℎ ℎ
= 𝐷𝑓 (a)e𝑖 + 𝑜(1)
= 𝑑𝑖 + 𝑜(1),

30.2. The total derivative 335

Mathematics of Machine Learning

𝑓(a+ℎe𝑖 )−𝑓(a)
thus conﬁrming that limℎ→0 ℎ = 𝑑𝑖 , which is what we had to show. □

What’s all the hassle with total differentiation, then? Theorem 29.2.1 tells us that total differentiability is a stronger
condition than partial differentiability.
Surprisingly, the other direction is not true: the existence of partial derivatives does not imply total differentiability, as
the example

1 if 𝑥 = 0 or 𝑦 = 0,
𝑓(𝑥, 𝑦) = {
0 otherwise

illustrates. This function has all its partial derivatives at 0, yet the total derivative does not exist. (You can convince
yourself by either drawing a ﬁgure, or noting that the function 1 − d𝑇 x can never be 𝑜(‖x‖), regardless of the choice of
d.)

Remark 29.2.1 (The total derivative as an operator.)

Just like for single-variable functions, the total derivative is a function 𝐷𝑓 ∶ ℝ𝑛 → ℝ𝑛 .
At the highest level of abstraction, we can think about the total derivative as an operator that maps a vector-vector function
to a vector-scalar function: 𝑛
𝐷 ∶ (ℝ𝑛 )ℝ → (ℝ𝑛 )ℝ ,
𝐷 ∶ 𝑓 ↦ 𝐷𝑓 ,
where 𝐴𝐵 denotes the set of all functions mapping 𝐴 to 𝐵.
You are not required to understand this at all, but trust me, the more abstract your thinking is, the more powerful you’ll
be.

30.3 Directional derivatives

So far, we have talked about two kinds of derivatives: partial derivatives that describe the rate of change along a ﬁxed
axis, and total derivatives that give the best linear approximation of the function at a given point.
Partial derivatives are only concerned with a few particular directions. However, this is not the end of the story in multiple
variables. With the standard orthonormal basis vectors e𝑖 , the partial derivatives are deﬁned by

𝜕𝑓 𝑓(a + ℎe𝑖 ) − 𝑓(a)

(a) = lim , 𝑖 = 1, 2. (30.5)
𝜕𝑥𝑖 ℎ→0 ℎ

As we have seen earlier, these describe the rate of change along the dimensions. However, the standard orthonormal
vectors are just a few special directions.
What about an arbitrary direction v? Can we deﬁne the derivative along these? Sure! There is nothing stopping us to
replace e𝑖 with v in (30.5). Thus, the directional derivatives are born.

Deﬁnition 29.3.1 (Directional derivatives.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be a function of 𝑛 variables and let v ∈ ℝ𝑛 be an arbitrary vector. The directional derivative of 𝑓 along
v is deﬁned by the limit

𝜕𝑓 𝑓(a + ℎv) − 𝑓(a)

∶= lim .
𝜕v ℎ→0 ℎ

336 Chapter 30. Derivatives and gradients

Mathematics of Machine Learning

Good news: the directional derivatives can be described in terms of the gradient.

Theorem 29.3.1
Let 𝑓 ∶ ℝ𝑛 → ℝ be a function of 𝑛 variables. If 𝑓 is totally diﬀerentiable at a, then its directional derivatives exist in all
directions, and
𝜕𝑓
(a) = ∇𝑓(a)𝑇 v.
𝜕v

Proof. Because of the total diﬀerentiability, Theorem 29.2.1 gives that

𝑓(x) = 𝑓(a) + ∇𝑓(a)𝑇 (x − a) + 𝑜(‖x − a‖)

around a. Thus,
𝑓(a + ℎv) − 𝑓(a) ℎ∇𝑓(a)𝑇 v + 𝑜(ℎ)
=
ℎ ℎ
= ∇𝑓(a)𝑇 v + 𝑜(1),

giving that
𝜕𝑓 𝑓(a + ℎv) − 𝑓(a)
(a) = lim
𝜕v ℎ→0 ℎ
𝑇
= lim ∇𝑓(a) v + 𝑜(1)
ℎ→0
= ∇𝑓(a)𝑇 v,
as we needed to show. □

In other words, theorem:multivariable-functions/partial-derivatives/directional-derivative gives us that no matter the di-

rection v, the directional derivative can be written in terms of the gradient and v. If you think about this for a minute, this
is quite amazing: the rates of change along two special directions determine the rate of change in any other direction.

30.4 Properties of the gradient

In one variable, we have learned that if the derivative of 𝑓 is positive at some 𝑎, then 𝑓 increases around 𝑎. (If the
derivative is negative, 𝑓 decreases.) If we think about the derivative 𝑓 ′ (𝑎) as a one-dimensional vector, the above result
says that the derivative points towards the direction of increase.
Is this true in higher dimensions? Yes, and this is what makes gradient descent work.

Theorem 29.4.1 (The gradient determines the direction of the increase.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be a function of 𝑛 variables, and suppose that 𝑓 is totally diﬀerentiable at a ∈ ℝ𝑛 .
Then
𝜕𝑓 ∇𝑓(a)
arg max (a) = . (30.6)
𝑛
v∈{x∈ℝ ∶‖x‖=1} 𝜕v ‖∇𝑓(a)‖

30.4. Properties of the gradient 337

Mathematics of Machine Learning

I know, (30.6) is pretty overloaded, so let’s unpack it. First, let’s start with the mysterious arg max. For a given function
𝑓,

arg max 𝑓(𝑥)

𝑥∈𝑆

denotes the values that maximizes 𝑓 on the set 𝑆. As the maximum may not be unique, arg max can yield a set. (The
deﬁnition of arg min is the same, but with minimum instead of maximum.)
Thus, in English, (30.6) states that the unit direction that maximizes the directional derivative at a is the normalized
gradient. Now we are ready to see the proof!

Proof. Do you remember the Cauchy-Schwarz inequality? It was a long time ago, so let’s recall it! In the vector space
ℝ𝑛 , the Cauchy-Schwarz inequality tells us that for any x, y ∈ ℝ𝑛 ,

x𝑇 y ≤ ‖x‖‖y‖.

Now, as Theorem 29.3.1 implies, the directional derivatives can be written as

𝜕𝑓
(a) = ∇𝑓(a)𝑇 v.
𝜕v
Combined with the Cauchy-Schwarz inequality, we get that

𝜕𝑓
(a) = ∇𝑓(a)𝑇 v
𝜕v
≤ ‖∇𝑓(a)‖‖v‖.

By restricting the directions to unit vectors,

𝜕𝑓
(a) ≤ ‖∇𝑓(a)‖ (30.7)
𝜕v
follows. Thus, the directional derivatives must be less or equal than the gradient’s norm. (At least, along a direction vector
with unit length.)
However, by letting v0 = ∇𝑓(a)/‖∇𝑓(a)‖, we obtain that

𝜕𝑓
(a) = ∇𝑓(a)𝑇 v0
𝜕v0
∇𝑓(a)𝑇 ∇𝑓(a)
=
‖∇𝑓(a)‖
‖∇𝑓(a)‖2
=
‖∇𝑓(a)‖
= ‖∇𝑓(a)‖.
∇𝑓(a) ∇𝑓(a)
Thus, with the choice v0 = ‖∇𝑓(a)‖ , equality can be attained in (30.7). This means that ‖∇𝑓(a)‖ maximizes the directional
derivative at a, which is what we had to prove. □

With that, we have the basics of differentiation in multiple variables under our belt. To sum up, we have learned that the
difference quotient definition of the derivative does not generalize directly for multiple variables, but we can fix all but
one variables to make the difference quotient work, thus obtaining partial derivatives.
On the other hand, the linear approximation definition works in multiple dimensions, but instead of

𝑓(𝑎) + 𝑓 ′ (𝑎)(𝑥 − 𝑎), 𝑓 ∶ ℝ → ℝ, 𝑥, 𝑎 ∈ ℝ,

338 Chapter 30. Derivatives and gradients

Mathematics of Machine Learning

like we had in one variable, we obtain

𝑓(a) + ∇𝑓(a)𝑇 (x − a), 𝑓 ∶ ℝ𝑛 → ℝ, x, a ∈ ℝ𝑛 ,

where the analogue of the derivative is the gradient vector ∇𝑓(a) ∈ ℝ𝑛 .

Even when we were studying differentiation in one variable for the first time, I told you that the local linear approximation
definition would be useful someday. That time is now, and we are reaping the benefits. Soon, we’ll see gradient descent
in its full glory.

30.5 Multivariable functions in code

It’s been a long time since we’ve put theory into code. So, let’s take a look at multivariable functions!
Last time, we built a Function base class with two main methods: one for computing the derivative (Function.
prime), one for getting the dictionary of parameters (Function.parameters).
This won’t be much of a surprise: the multivariate function base class is not much diﬀerent. For clarity, we’ll appropriately
rename the prime method to grad.

class MultivariableFunction:
def __init__(self):
pass

def call(self, *args, **kwargs):

pass

def grad(self):
pass

def parameters(self):
return dict()

Let’s see a few examples right away. The simplest one is the squared Euclidean norm 𝑓(x) = ‖x‖2 , a closed relative to
the mean squared error function. Its gradient is given by

∇𝑓(x) = 2x,

thus everything is ready to implement it. As we’ve used NumPy arrays to represent vectors, we’ll use them as the input
as well.

import numpy as np

class SquaredNorm(MultivariableFunction):
def __call__(self, x: np.array):
return np.mean(x**2)

def grad(self, x: np.array):

return 2*x

Note that SquaredNorm is diﬀerent from 𝑓(x) = ‖x‖2 in a mathematical sense, as it accepts any NumPy array, not
just an 𝑛-dimensional vector. This is not a problem now, but will be one later, so keep that in mind.
Another can be given by the parametric linear function

𝑔(𝑥, 𝑦) = 𝑎𝑥 + 𝑏𝑦,

30.5. Multivariable functions in code 339

Mathematics of Machine Learning

where 𝑎, 𝑏 ∈ ℝ are arbitrary parameters. Let’s see how 𝑔(𝑥, 𝑦) is implemented!

class Linear(MultivariableFunction):
def __init__(self, a: float, b: float):
self.a = a
self.b = b

def call(self, x: np.array):

"""
x: np.array of shape (2, 1)
"""
x = x.reshape(2)
return self.a*x[0] + self.b*x[1]

def grad(self, x: np.array):

return np.array([self.a, self.b]).reshape(2, 1)

def parameters():
return {"a": self.a, "b": self.b}

Note that as we are working with column vectors, the input x is an array of shape (2, 1).
To check if our implementation works correctly, we can quickly test it out on a simple example.

g = Linear(a=1, b=-1)

g(np.array([1, 0]).reshape(2, 1))

Perhaps we might have overlooked this question until now, but trust me, specifying the input and output shapes is of
crucial importance. When doing mathematics, we can be ﬂexible in our notation and treat any vector 𝑥 ∈ ℝ𝑛 as a column
or row vector, but this is painfully not the case in practice.
Correctly keeping track of array shapes is of utmost importance, and can save you hundreds of hours. No joke.
For now, that’s basically all to our MultivariableFunction class. Later, when implementing our neural networks
from scratch, we’ll add other methods for utility. However, regarding “mathematical functionality”, we are almost done.

340 Chapter 30. Derivatives and gradients

CHAPTER

THIRTYONE

DERIVATIVES OF VECTOR-VALUED FUNCTIONS

In a single variable, deﬁning higher-order derivatives is simple. We simply have to keep repeating diﬀerentiation:
′
𝑓 ′′ (𝑥) = (𝑓 ′ (𝑥)) ,
′
𝑓 ′′′ (𝑥) = (𝑓 ′′ (𝑥)) ,

and so on. However, this is not that straightforward with multivariable functions. So far, we have only talked about
gradients, the generalization of the derivative for vector-scalar functions.
As ∇𝑓(a) is a column vector, the gradient is a vector-vector function ∇ ∶ ℝ𝑛 → ℝ𝑛 . We only know how to compute the
derivative of vector-scalar functions. It’s time to change that!

31.1 The derivatives of curves

Curves, often describing the solutions of dynamical systems, are one of the most important objects in mathematics. We
don’t use them explicitly in machine learning, but they are underneath algorithms such as gradient descent. (Where we
traverse a discretized curve leading to a local minimum.)
Formally, a curve - that is, a scalar-vector function - is given by a function

𝛾1 (𝑡)
⎡ 𝛾 (𝑡) ⎤
𝛾 ∶ ℝ → ℝ𝑛 , 𝛾(𝑡) = ⎢ 2 ⎥ ∈ ℝ𝑛(×1) ,
⎢ ⋮ ⎥
⎣𝛾𝑛 (𝑡)⎦

where the 𝛾𝑖 ∶ ℝ → ℝ functions are good old single-variable scalar-scalar functions. As the independent variable often
represents time, it is customary to denote it with 𝑡.
We can diﬀerentiate 𝛾 componentwise:

𝛾1′ (𝑡)
⎡ 𝛾 ′ (𝑡) ⎤
𝛾 ′ (𝑡) ∶= ⎢ 2 ⎥ ∈ ℝ𝑛(×1) .
⎢ ⋮ ⎥
⎣𝛾𝑛′ (𝑡)⎦

If we indeed imagine 𝛾(𝑡) as a trajectory in space, 𝛾 ′ (𝑡) is the tangent vector to 𝛾 at 𝑡. Since the diﬀerentiation is
componentwise, Theorem 21.2.1 implies that if 𝛾 is diﬀerentiable at some 𝑎,

𝛾(𝑡) = 𝛾(𝑎) + 𝛾 ′ (𝑡)(𝑡 − 𝑎) + 𝑜(|𝑡 − 𝑎|) (31.1)

there. The equation (31.1) is a true vectorized formula: some components are vectors, and some are scalars. Yet, this is
simple and makes perfect sense to us. Hiding the complexities of vectors and matrices is the true power of linear algebra.

341
Mathematics of Machine Learning

It is easy to see that for any two 𝛾, 𝜂 ∶ ℝ → ℝ𝑛 , diﬀerentiation is additive, as (𝛾 + 𝜂)′ = 𝛾 ′ + 𝜂′ . What happens when
we compose a scalar-vector function with a vector-scalar one?
This situation is commonplace in machine learning. If, say, 𝐿 ∶ ℝ𝑛 → ℝ describes the loss function and 𝛾 ∶ ℝ → ℝ𝑛
is our trajectory in the parameter space ℝ𝑛 , the composite function 𝑓(𝛾(𝑡)) describes the model loss at time 𝑡. Thus, to
compute (𝑓 ∘ 𝛾)′ , we have to generalize the chain rule.

Theorem 30.1.1 (The chain rule for composing scalar-vector and vector-scalar functions.)
Let 𝛾 ∶ ℝ → ℝ𝑛 and 𝑓 ∶ ℝ𝑛 → ℝ be arbitrary functions. If 𝛾 is differentiable at some 𝑎 ∈ ℝ and 𝑓 is differentiable at
𝛾(𝑎), then 𝑓 ∘ 𝛾 ∶ ℝ → ℝ is also differentiable at 𝑎 and

(𝑓 ∘ 𝛾)′ (𝑎) = ∇𝑓(𝛾(𝑎))𝑇 𝛾 ′ (𝑎)

there.

Proof. As 𝑓 is diﬀerentiable at 𝛾(𝑎), Theorem 29.2.1 gives

𝑓(𝛾(𝑡)) = 𝑓(𝛾(𝑎)) + ∇𝑓(𝛾(𝑎))𝑇 (𝛾(𝑡) − 𝛾(𝑎)) + 𝑜(‖𝛾(𝑥) − 𝛾(𝑎)‖).

Thus,
𝑓(𝛾(𝑡)) − 𝑓(𝛾(𝑎))
(𝑓 ∘ 𝛾)′ (𝑎) = lim
𝑡→𝑎 𝑡−𝑎
= ∇𝑓(𝛾(𝑎))𝑇 𝛾 ′ (𝑎),
which is what we had to prove. □

31.2 The Jacobian and Hessian matrices

Now, our task is to extend the derivative for vector-vector functions. Let f ∶ ℝ𝑛 → ℝ𝑚 be an arbitrary vector-vector
function. By writing out the output of f explicitly, we can decompose it into multiple components:
𝑓1 (x)
f(x) = ⎡
⎢ ⋮ ⎥∈ℝ
⎤ 𝑚(×1)

⎣𝑓𝑚 (x)⎦
where 𝑓𝑖 ∶ ℝ𝑛 → ℝ are vector-scalar functions.
The natural idea is to compute the partial derivatives for 𝑓𝑖 , compacting them into a matrix. And so we shall!

Deﬁnition 30.2.1 (The Jacobian matrix.)

Let f ∶ ℝ𝑛 → ℝ𝑚 be an arbitrary vector-vector function, and suppose that

f(x) = (𝑓1 (x), … , 𝑓𝑚 (x)),

where all 𝑓𝑖 ∶ ℝ𝑛 → ℝ are (partially) diﬀerentiable at some a.

The matrix
𝜕𝑓1 𝜕𝑓2 𝜕𝑓𝑚
(a) 𝜕𝑥1 (a) … 𝜕𝑥1 (a)
⎡ 𝜕𝑥
𝜕𝑓1
1
𝜕𝑓2 𝜕𝑓𝑚 ⎤
𝐽f (a) ∶= ⎢ 𝜕𝑥 (a) 𝜕𝑥2 (a) … 𝜕𝑥2 (a)⎥ 𝑚×𝑛
⎢ 2⋮ ⋮ ⋱ ⋮ ⎥∈ℝ
⎢ ⎥
𝜕𝑓1 𝜕𝑓2 𝜕𝑓𝑚
⎣ 𝜕𝑥𝑛 (a) 𝜕𝑥𝑛 (a) … 𝜕𝑥𝑛 (a)⎦

342 Chapter 31. Derivatives of vector-valued functions

Mathematics of Machine Learning

is called the Jacobian of f at a.

In other words, the columns of the Jacobian are the gradients of 𝑓𝑖 :

𝐽f (a) = [∇𝑓1 (a) ∇𝑓2 (a) … ∇𝑓𝑚 (a)] .

I have good news: the best local linear approximation of f around a is given by

f(x) = f(a) + 𝐽f (a)𝑇 (x − a) + 𝑜(‖x − a‖),

if the best local linear approximation exists. Thus, the Jacobian is a proper generalization of the gradient.
We can use the Jacobian to generalize the notion of second-order derivatives for vector-scalar functions: by computing
the Jacobian of the gradient, we obtain a special matrix, the analogue of the second derivative.

Deﬁnition 30.2.2 (The Hessian matrix.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be an arbitrary vector-scalar function, and suppose that all of its second-order partial derivatives exist at
a ∈ ℝ𝑛 .
The matrix
𝜕2 𝑓 𝜕2 𝑓 𝜕2 𝑓
𝜕𝑥21
(a) 𝜕𝑥1 𝜕𝑥2 (a) … 𝜕𝑥1 𝜕𝑥𝑛 (a)
⎡ 𝜕2 𝑓 𝜕2 𝑓 𝜕2 𝑓
⎤
⎢ 𝜕𝑥2 𝜕𝑥1 (a) 𝜕𝑥22
(a) … 𝜕𝑥2 𝜕𝑥𝑛 (a)⎥ 𝑛×𝑛
𝐻𝑓 (a) ∶= ⎢ ⎥∈ℝ
⎢ 2⋮ ⋮ ⋱ ⋮ ⎥
𝜕 𝑓 𝜕2 𝑓 𝜕2 𝑓
⎣ 𝜕𝑥𝑛 𝜕𝑥1 (a) 𝜕𝑥𝑛 𝜕𝑥2 (a) … 𝜕𝑥𝑛2 (a) ⎦

is called the Hessian of 𝑓 at a.

In other words,

𝐻𝑓 (a) = 𝐽∇𝑓 (a)

holds by deﬁnition. Moreover, if 𝑓 behaves nicely (for instance, all second-order partial derivatives exist and are contin-
uous), Theorem 29.1.1 implies that the Hessian is symmetric; that is, 𝐻𝑓 (a) = 𝐻𝑓 (a)𝑇 .

31.3 The total derivative for vector-vector functions

One last generalization, I promise. Recall that the existence of the gradient (that is, partial diﬀerentiability) doesn’t imply
total diﬀerentiability for vector-scalar functions, as the example

1 if 𝑥 = 0 or 𝑦 = 0,
𝑓(𝑥, 𝑦) = {
0 otherwise

shows at zero.
This is true for vector-vector functions as well, as the Jacobian is the generalization of the gradient, not the total derivative.
It is best to rip the band-aid off quickly and define the total derivative for vector-vector functions. The definition will be
a bit abstract, but trust me, the investment will pay off when talking about the chain rule. (Which is the foundation of
backpropagation, the algorithm that makes gradient descent computationally feasible.)

Deﬁnition 30.3.1 (Total diﬀerentiability of vector-vector functions.)

31.3. The total derivative for vector-vector functions 343

Mathematics of Machine Learning

Let f ∶ ℝ𝑛 → ℝ𝑚 be an arbitrary vector-vector function. We say that 𝑓 is totally diﬀerentiable (or sometimes just
diﬀerentiable in short) at a ∈ ℝ𝑛 if there exists a matrix 𝐷f (a) ∈ ℝ𝑚×𝑛 such that

f(x) = f(a) + 𝐷f (a)(x − a) + 𝑜(‖x − a‖) (31.2)

holds for all a ∈ 𝐵(𝜀, a), where 𝜀 > 0 and 𝐵(𝜀, a) is deﬁned by

𝐵(𝜀, a) = {x ∈ ℝ𝑛 ∶ ‖x − a‖ < 𝜀}.

(In other words, 𝐵(𝜀, a) is a ball of radius 𝜀 > 0 around a.) When exists, the matrix 𝐷f (a) is called the total derivative
of 𝑓 at a.

Notice that Deﬁnition 30.3.1 is almost verbatim to Deﬁnition 29.2.1, except that the “derivative” is a matrix this time.
You are probably not surprised to hear that its relation with the Jacobian is the same as the gradient and the total derivative
in the vector-scalar case.

Theorem 30.3.1 (Total derivative and the partial derivatives.)

Let f ∶ ℝ𝑛 → ℝ𝑚 be a function that is totally diﬀerentiable at a. Then all of its partial derivatives exist at a and

𝐷f (a) = 𝐽f (a)𝑇 .

The proof is almost identical to the one of Theorem 29.2.1, with more complex notations. I strongly recommend you to
work it out line by line, as this kind of mental gymnastics helps signiﬁcantly to get used to matrices in practice.
Componentwise, the total derivative can be written as
𝜕𝑓1 𝜕𝑓1 𝜕𝑓1
𝜕𝑥1 (a) 𝜕𝑥2 (a) … 𝜕𝑥𝑛 (a)
⎡ 𝜕𝑓2 𝜕𝑓2 𝜕𝑓2 ⎤
𝐷f (a) = ⎢ 𝜕𝑥1 (a) 𝜕𝑥2 (a) … 𝜕𝑥𝑛 (a) ⎥ 𝑚×𝑛
⎢ ⋮ ⋮ ⋱ ⋮ ⎥∈ℝ .
⎢ ⎥
𝜕𝑓𝑚 𝜕𝑓𝑚 𝜕𝑓𝑚
⎣ 𝜕𝑥1 (a) 𝜕𝑥 0
2
(a) … 𝜕𝑥𝑛 (a)⎦

By introducing the notation

𝜕𝑓1
𝜕𝑥𝑖 (a)
⎡ 𝜕𝑓 ⎤
𝜕f ⎢ 𝜕𝑥
2
(a) ⎥ 𝑚×1
(a) = ⎢ 𝑖 ⎥∈ℝ ,
𝜕𝑥𝑖 ⎢ ⋮ ⎥
𝜕𝑓𝑚
⎣ 𝜕𝑥𝑖 (a)⎦

the total derivative 𝐷f (a) can be written in the block-forms

𝜕f 𝜕f 𝜕f
𝐷f (a) = [ 𝜕𝑥 (a) 𝜕𝑥2 (a) … 𝜕𝑥𝑛 (a)]
1

and
∇𝑓1 (a)𝑇
⎡ ∇𝑓 (a)𝑇 ⎤
𝐷f (a) = ⎢ 2 ⎥.
⎢ ⋮ ⎥
𝑇
∇𝑓
⎣ 𝑚 (a) ⎦

344 Chapter 31. Derivatives of vector-valued functions

Mathematics of Machine Learning

31.4 Derivatives and function operations

We have generalized the notion of derivatives as far as possible for us. Now it’s time to study their relations with the
two essential function operations: addition and composition. (As there is no vector multiplication in higher dimensional
spaces, the product and ratio of f, g ∶ ℝ𝑛 → ℝ𝑚 is undeﬁned.)
Let’s start with the simpler one: addition.

Theorem 30.4.1 (Linearity of the total derivative.)

Let f, g ∶ ℝ𝑛 → ℝ𝑚 be two vector-vector functions that are diﬀerentiable at some a ∈ ℝ𝑛 , and let 𝛼, 𝛽 ∈ ℝ be two
arbitrary scalars.
Then 𝛼f + 𝛽g is also diﬀerentiable at a and

𝐷𝛼f+𝛽g (a) = 𝛼𝐷f (a) + 𝛽𝐷g (a)

there.

Proof. Because of the total diﬀerentiability, (31.2) implies that

𝛼f(x) + 𝛽g(x) = 𝛼f(a) + 𝛽g(a)

+ (𝛼𝐷f (a) + 𝛽𝐷g (a))(x − a)
+ 𝑜(‖x − a‖),

which implies

𝐷𝛼f+𝛽g (a) = 𝛼𝐷f (a) + 𝛽𝐷g (a).

This is what we had to show. □

Linearity is always nice, but what we need is the ultimate generalization of the chain rule. We previously saw the special
case of composing a scalar-vector and a vector-vector function (see Theorem 30.1.1), but we need to go one step further.
The multivariable chain rule is extremely important in machine learning. A neural network is a composite function, each
layer forming a component. During gradient descent, we use the chain rule to calculate its derivative.

Theorem 30.4.2 (Multivariable chain rule.)

Let f ∶ ℝ𝑚 → ℝ𝑙 and g ∶ ℝ𝑛 → ℝ𝑚 be two vector-vector functions. If g is totally differentiable at a ∈ ℝ𝑛 and f is totally
differentiable at g(a), then f ∘ g is differentiable at a and

𝐷f∘g (a) = 𝐷f (g(a))𝐷g (a) (31.3)

holds.

To our advantage, the derivative of a composed function (31.3) is given by the product of two matrices. Since matrix
multiplication can be done lightning fast, this is good news.
We will see two proofs for Theorem 30.4.2. One is done with a faster-than-light engine, while the other shows much more
by reducing the general case to Theorem 30.1.1. Both provide a ton of insight. Let’s start with the heavy machinery.

Proof. (First method.)

31.4. Derivatives and function operations 345

Mathematics of Machine Learning

As f is totally diﬀerentiable at g(a), the equation (31.2) implies

f(g(x)) = f(g(a)) + 𝐷f (g(a))(g(x) − g(a)) + 𝑜(‖g(x) − g(a)‖).

In turn, again because of the total diﬀerentiability of g at a, we have

g(x) − g(a) = 𝐷g (a)(x − a) + 𝑜(‖x − a‖).

Thus, we can continue our calculation by

f(g(x)) = f(g(a)) + 𝐷f (g(a))(g(x) − g(a)) + 𝑜(‖g(x) − g(a)‖)

= f(g(a)) + 𝐷f (g(a))𝐷g (a)(x − a)
+𝐷 f (g(a))𝑜(‖x − a‖) + 𝑜(‖g(x) − g(a)‖),
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
=𝑜(‖x−a‖)

showing that f ∘ g is totally diﬀerentiable at a with total derivative

𝐷f∘g (a) = 𝐷f (g(a))𝐷g (a),

which is what we needed to show. □

Now, about that second proof.

Proof. (Second method.)

Let’s unpack 𝐷f∘g (a) a bit. Writing out the components of f ∘ g, we have

(f ∘ g)1 (x)
(f ∘ g)(x) = ⎡ ⎤ 𝑙
⎢ (f ∘ g)2 (x) ⎥ ∈ ℝ , x ∈ ℝ𝑛 .
⎣⋮ (f ∘ g)𝑙 (x)⎦

By deﬁnition, the 𝑖-th row and 𝑗-th column of 𝐷f∘g (a) is

𝜕(f ∘ g)𝑖
(𝐷f∘g (a)) = (a).
𝑖,𝑗 𝜕𝑥𝑗

If you look at it long enough, you’ll realize that 𝜕(f∘g)

𝜕𝑥𝑗 (a) is the derivative of a single variable function. Indeed, the
𝑖

function to be diﬀerentiated is the composition of the curve

𝛾 ∶ 𝑡 ↦ 𝑔(𝑎1 , … , 𝑎𝑗−1 , 𝑡, 𝑎𝑗+1 , … , 𝑎𝑛 )

and the vector-scalar function 𝑓𝑖 ∶ ℝ𝑚 → ℝ. Thus, the chain rule for the composition of scalar-vector and vector-scalar
functions (given by Theorem 30.1.1) can be applied:

𝜕(f ∘ g)𝑖 𝜕
(a) = ∇𝑓𝑖 (g(a))𝑇 g(a),
𝜕𝑥𝑗 𝜕𝑥𝑗
𝜕
where 𝜕𝑥𝑗 g(a) is the componentwise derivative

𝜕𝑔1 (a)
⎡ 𝜕𝑔𝜕𝑥(a)
𝑗 ⎤

𝜕 ⎢ 𝜕𝑥 2
⎥
g(a) = ⎢ 𝑗
⎥.
𝜕𝑥𝑗 ⎢ ⋮ ⎥
𝜕𝑔𝑚 (a)
⎣ 𝜕𝑥𝑗 ⎦

346 Chapter 31. Derivatives of vector-valued functions

Mathematics of Machine Learning

To sum up, we have

𝜕(f ∘ g)𝑖 𝜕
(a) = ∇𝑓𝑖 (g(a))𝑇 g(a)
𝜕𝑥𝑗 𝜕𝑥𝑗
𝑚
𝜕𝑓𝑖 𝜕𝑔
=∑ (a) 𝑘 (a).
𝑘=1
𝜕𝑥𝑘 𝜕𝑥𝑗
This is the element in the 𝑖-th row and 𝑗-th column of the matrix product 𝐷f (g(a))𝐷g (a), hence

𝐷f∘g (a) = 𝐷f (g(a))𝐷g (a),

which is what we had to show. □

With the concept of total derivatives for vector-vector functions and the general chain rule under our belt, we are ready
to actually do things with multivariable functions. Thus, our next stop lays the foundations of optimization.

31.4. Derivatives and function operations 347

Mathematics of Machine Learning

348 Chapter 31. Derivatives of vector-valued functions

CHAPTER

THIRTYTWO

MINIMA AND MAXIMA IN MULTIPLE DIMENSIONS

In a single variable, we have successfully used the derivatives to find local optima of differentiable functions.
Recall that if 𝑓 ∶ ℝ → ℝ is differentiable everywhere, then Theorem 23.3.1 gives that
(a) 𝑓 ′ (𝑎) = 0 and 𝑓 ′′ (𝑎) > 0 implies a local minimum,(b) and 𝑓 ′ (𝑎) = 0 and 𝑓 ′′ (𝑎) < 0 implies a local maximum.
(A simple 𝑓 ′ (𝑎) = 0 is not enough to conclude the local extremum, as the example 𝑓(𝑥) = 𝑥3 shows at 0).
Can we do something similar in multiple variables? Right from the start, there seem to be an issue: the derivative is not
a scalar (thus, we can’t equate it to zero).
This is easy to solve: the analogue of the condition 𝑓 ′ (𝑎) = 0 is ∇𝑓(a) = (0, 0, … , 0) for multivariate functions. For
simplicity, the zero vector (0, 0, … , 0) will also be denoted by 0. Don’t worry, this won’t be confusing; it’s all clear from
the context. Introducing a new notation for the zero vector would just add more complexity.
We can even visualize this. In a single variable, we have already seen this: as Fig. 32.1 illustrates, 𝑓 ′ (𝑎) = 0 implies that
the tangent line is horizontal.

Fig. 32.1: Local extrema in a single variable.

349
Mathematics of Machine Learning

In multiple variables, the situation is similar: ∇𝑓(a) = 0 implies that the best local linear approximation (30.3) is constant;
that is, the tangent plane is horizontal. (As visualized by Fig. 32.2.)

Fig. 32.2: Local extrema in multiple variables.

32.1 Local extrema in multiple dimensions

So, what does ∇𝑓(a) = 0 imply? Similarly to the single-variable case, we have three options:
1. local minima,
2. local maxima,
3. neither.
The functions
𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 ,
𝑔(𝑥, 𝑦) = −(𝑥2 + 𝑦2 ),
ℎ(𝑥, 𝑦) = 𝑥2 − 𝑦2
at (0, 0) provide an example for all three, as Fig. 32.3, Fig. 32.4, and Fig. 32.5 show. (Keep in mind that a local extremum
might be global.)
To put things into order, let’s start formulating deﬁnitions and theorems.

Deﬁnition 31.1.1 (Critical points.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be an arbitrary vector-scalar function and suppose that 𝑓 is partially diﬀerentiable in all variables at some
a ∈ ℝ.
We say that a is a critical point of 𝑓 if
∇𝑓(a) = 0

350 Chapter 32. Minima and maxima in multiple dimensions

Mathematics of Machine Learning

Fig. 32.3: A (local) maximum.

Fig. 32.4: A (local) minimum.

32.1. Local extrema in multiple dimensions 351

Mathematics of Machine Learning

Fig. 32.5: A saddle point.

holds.

For the sake of precision, let’s deﬁne local extrema in multiple dimensions.

Deﬁnition 31.1.2 (Local minima and maxima.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be an arbitrary vector-scalar function and let a ∈ ℝ𝑛 be an arbitrary point.
(a) a is a local minimum if there exists an 𝜀 > 0 such that

𝑓(a) ≤ 𝑓(x), x ∈ 𝐵(𝜀, a).

(b) a is a strict local minimum if there exists an 𝜀 > 0 such that

𝑓(a) < 𝑓(x), x ∈ 𝐵(𝜀, a).

(c) a is a local maximum if there exists an 𝜀 > 0 such that

𝑓(a) ≥ 𝑓(x), x ∈ 𝐵(𝜀, a).

(d) a is a strict local maximum if there exists an 𝜀 > 0 such that

𝑓(a) > 𝑓(x), x ∈ 𝐵(𝜀, a).

As the example of 𝑥2 − 𝑦2 shows, a critical point is not necessarily a local extremum, but a local extremum is always a
critical point. The next result, which is the analogue of Theorem 23.2.1, makes this mathematically precise.

Theorem 31.1.1

352 Chapter 32. Minima and maxima in multiple dimensions

Mathematics of Machine Learning

Let 𝑓 ∶ ℝ𝑛 → ℝ be an arbitrary vector-scalar function, and suppose that 𝑓 is partially diﬀerentiable with respect to all
variables at some a ∈ ℝ.
If 𝑓 has a local extremum at a, then ∇𝑓(a) = 0.

Proof. This is a direct consequence of Theorem 23.2.1, as if a = (𝑎1 , … , 𝑎𝑛 ) is a local extremum of the vector-scalar
function 𝑓, then it is a local extremum of the single-variable functions 𝑓(a + ℎe𝑖 ), where e𝑖 is the vector whose 𝑖-th
component is 1, while the others are zero.
According to the very deﬁnition of the partial derivative given by Deﬁnition 29.1.1,

𝑑 𝜕𝑓
𝑓(a + ℎe𝑖 ) = (a).
𝑑ℎ 𝜕𝑥𝑖

Thus, Theorem 23.2.1 gives that

𝜕𝑓
(a) = 0
𝜕𝑥𝑖

for all 𝑖 = 1, … , 𝑛, giving that ∇𝑓(a) = 0. □

32.1.1 The two-dimensional case

So, how can we ﬁnd the local extrema with the derivative? As we have already suggested, studying the second derivative
will help us pinpoint the extrema among critical points. Unfortunately, things are much more complicated in 𝑛 variables,
so let’s focus on the two-variable case ﬁrst.

Theorem 31.1.2 (The two-variable second derivative test.)

Let 𝑓 ∶ ℝ2 → ℝ be an arbitrary vector-scalar function, and suppose that 𝑓 is partially diﬀerentiable with respect to both
variables at some a ∈ ℝ. Also suppose that a is a critical point, that is, ∇𝑓(a) = 0.
𝜕2 𝑓
(a) If det 𝐻𝑓 (a) > 0 and 𝜕𝑥22
> 0, then a is a local minimum.
𝜕2 𝑓
(b) If det 𝐻𝑓 (a) > 0 and 𝜕𝑥22
< 0, then a is a local maximum.

(c) If det 𝐻𝑓 (a) < 0, then a is a saddle point.

We will not prove this, but some remarks are in order. First, as the determinant of the Hessian can be zero, Theorem
31.1.3 does not cover all possible cases.
It’s probably best to see a few examples, so let’s revisit the previously seen functions

𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 ,
𝑔(𝑥, 𝑦) = −(𝑥2 + 𝑦2 ),
ℎ(𝑥, 𝑦) = 𝑥2 − 𝑦2 .

All three have a critical point at 0, so the Hessians can provide a clearer picture. The Hessians are given by the matrices

2 0 −2 0 2 0
𝐻𝑓 (𝑥, 𝑦) = [ ], 𝐻𝑔 (𝑥, 𝑦) = [ ], 𝐻ℎ (𝑥, 𝑦) = [ ].
0 2 0 −2 0 −2

𝜕2 𝑓
For functions of two variables, Theorem 31.1.3 says that it is enough to study det 𝐻𝑓 (a) and 𝜕𝑦2 (a).

32.1. Local extrema in multiple dimensions 353

Mathematics of Machine Learning

2
In the case of 𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 , we have 𝐻𝑓 (0, 0) = 4 and 𝜕𝜕𝑦𝑓2 (0, 0) = 2, giving that 0 is a local minimum of
𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 . Similarly, we can conclude that 0 is a local maximum of 𝑔(𝑥, 𝑦) = −(𝑥2 + 𝑦2 ). (Which shouldn’t
surprise you, as 𝑔 = −𝑓.)
Finally, for ℎ(𝑥, 𝑦) = 𝑥2 − 𝑦2 , the second derivative test conﬁrms that 0 is indeed a saddle point.

32.1.2 The general case

So, what’s up with the general case? Unfortunately, just studying the determinant of the Hessian matrix is not enough.
We need to bring in the heavy hitters: eigenvalues. Here is the second derivative test in its full glory.

Theorem 31.1.3 (The multivariable second derivative test.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be an arbitrary vector-scalar function, and suppose that 𝑓 is partially diﬀerentiable with respect to all
variables at some a ∈ ℝ. Also suppose that a is a critical point, that is, ∇𝑓(a) = 0.
(a) If all the eigenvalues of 𝐻𝑓 (a) are positive, then a is a local minimum.
(b) If all the eigenvalues of 𝐻𝑓 (a) are negative, then a is a local maximum.
(c) If all the eigenvalues of 𝐻𝑓 (a) are either positive or negative, then a is a saddle point.

That’s right: if any of the eigenvalues are zero, then the test is inconclusive. You might recall from linear algebra that
in practice, computing the eigenvalues is not as fast as computing the second-order derivatives, but there are plenty of
numerical methods. (Like the QR-algorithm.)
To sum it up, the method of optimizing (diﬀerentiable) multivariable functions is a simple two-step process:
1. ﬁnd the critical points by solving the equation ∇𝑓(x) = 0,
2. then use the second derivative test to determine which critical points are extrema.
Do we use this method in practice to optimize functions? No. Why? Most importantly because computing the eigenvalues
of the Hessian for a vector-scalar function of millions of variables is extremely hard. Why is the second derivative test so
important? Because understanding the behavior of functions around their extremal points is essential to truly understand
gradient descent. Believe it or not, this is the key behind the theoretical guarantees for gradient descent.
Speaking of gradient descent, now is the time to dig deep into the algorithm that powers neural networks.

354 Chapter 32. Minima and maxima in multiple dimensions

CHAPTER

THIRTYTHREE

GRADIENT DESCENT IN ITS FULL FORM

Gradient descent is one of the most important algorithms in machine learning. We have talked about this a lot, although
up until this point, we have only seen it for single-variable functions. (Which is, let’s admit it, not the most practical
use-case.)
However, now we have all the tools we need to talk about gradient descent in its general form. Let’s get to it!

33.1 Gradient ascent and descent

Suppose that we have a diﬀerentiable vector-scalar function 𝑓 ∶ ℝ𝑛 → ℝ that we want to maximize. This can describe the
loss function of a neural network, the return on investment of an investing strategy, or any other quantity.
Calculating the gradient and ﬁnding the critical points is often not an option, as solving the equation ∇𝑓(x) = 0 can be
computationally unfeasible. Thus, we resort to an iterative solution.
The algorithm is the same as for single-variable functions:
1. we start from a random point,
2. calculate the gradient,
3. take a step towards its direction.
This is called gradient ascent. We can formalize it in the following way.

Algorithm 32.1.1 (The gradient ascent algorithm.)

Step 1. Initialize the starting point x0 ∈ ℝ𝑛 and select a learning rate ℎ ∈ (0, ∞).
Step 2. Let

x𝑛+1 ∶= x𝑛 + ℎ∇𝑓(x𝑛 ).

Step 3. Repeat Step 2. until convergence.

If we want to minimize 𝑓, we might as well maximize −𝑓. The only eﬀect of this is a sign change for the gradient. In this
form, the algorithm is called gradient descent, and this is what’s widely used for training neural networks.

Algorithm 32.1.2 (The gradient descent algorithm.)

Step 1. Initialize the starting point x0 ∈ ℝ𝑛 and select a learning rate ℎ ∈ (0, ∞).
Step 2. Let

x𝑛+1 ∶= x𝑛 − ℎ∇𝑓(x𝑛 ).

355
Mathematics of Machine Learning

Step 3. Repeat Step 2. until convergence.

33.2 Implementation of gradient descent

After all of the setup, implementing the gradient descent is straightforward.

import numpy as np
import nbimporter
from tools.function import MultivariableFunction

def gradient_descent(
f: MultivariableFunction,
x_init: np.array, # the initial guess
learning_rate: float = 0.1, # the learning rate
n_iter: int = 1000, # number of steps
):
x = x_init

for n in range(n_iter):
grad = f.grad(x)
x = x - learning_rate*grad

return x

Notice that it is almost identical to the single variable version. To see if it works correctly, let’s test it out on the squared
Euclidean norm function! (The one that we implemented a few chapters earlier.)

from tools.function import SquaredNorm

squared_norm = SquaredNorm()
local_minimum = gradient_descent(
f=squared_norm,
x_init=np.array([10.0, -15.0])
)

local_minimum

array([ 1.23023192e-96, -1.84534788e-96])

There is nothing special to it, really. The issues with multivariable gradient descent are the same as what we discussed at
the single-variable version: it can get stuck in local minima, it is sensitive to our choice of learning rate, and the gradient
can be computationally hard to calculate in high dimensions.
For machine learning, we’ll have to solve all of these issues.

356 Chapter 33. Gradient descent in its full form

Part V

Probability theory

357
CHAPTER

THIRTYFOUR

WHAT IS PROBABILITY?

When going about our lives, we almost always think in binary terms. A statement is either true or false. An outcome has
either occurred or not.
In practice though, we rarely have the comfort of certainty. We have to operate with incomplete information. When a
scientist observes the outcome of an experiment, can she verify her hypothesis with 100% certainty? No. Because she
does not have complete control over all the variables of the experiment (like the weather or the alignment of stars), the
observed eﬀect might be unintentional. Each result will either strengthen or weaken our belief in the hypothesis, but none
will provide ultimate proof.
In machine learning, our job is not to simply provide a prediction about some class label, but to formulate a mathematical
model that summarizes our knowledge about the data in a way that conveys information about the degree of our certainty
in the prediction as well.
So, ﬁtting a parametric function 𝑓 ∶ ℝ𝑛 → ℝ𝑚 to model the relation between the data and the variable to be predicted is
not enough. We will need an entirely new vocabulary to formulate such models. We need to think in terms of probabilities.

34.1 Thinking in absolutes

First, let’s talk about how we think. On the most basic level, our knowledge about the world is stored in propositions. In
a mathematical sense, a proposition is a declaration that is either true or false. (In binary terms, true is denoted by 1 and
false is denoted by 0.)
“The sky is blue.“
“There are inﬁnitely many prime numbers.“
“1 + 1 = 3.“
“I got the ﬂu.“
Propositions are often abbreviated as variables such as 𝐴 = "it's raining outside".
Determining the truth value of a given proposition using evidence and reasoning is called inference. To be able to formulate
valid arguments and understand how inference works, we’ll take a quick visit in the world of mathematical logic.

359
Mathematics of Machine Learning

34.1.1 Mathematical logic 101

So, we have propositions like 𝐴 = "it's raining outside", or 𝐵 = "the sidewalk is wet". We need more expression power:
propositions are building blocks, and we want to combine them, yielding more complex propositions.
We can formulate complex propositions from smaller building blocks with logical connectives. Consider the proposition
“if it is raining outside, then the sidewalk is wet”. This is the combination of 𝐴 and 𝐵, strung together by the implication
connective.
There are four essential connectives:• NOT (¬), also known as negation,• AND (∧), also known as conjunction,• OR (∨),
also known as disjunction,• THEN (→), also known as implication.
Connectives are deﬁned by the truth values of the resulting propositions. For instance, if 𝐴 is true, then ¬𝐴 is false; if 𝐴
is false, then ¬𝐴 is true. Denoting true by 1 and false by 0, we can describe connectives with truth tables. Here is the one
for negation.

𝐴 ¬𝐴
0 1
1 0

AND (∧) and OR (∨) connect two propositions. 𝐴 ∧ 𝐵 is true if both 𝐴 and 𝐵 are true, while 𝐴 ∨ 𝐵 is true if either one
is.
𝐴 𝐵 𝐴∧𝐵 𝐴∨𝐵
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 1

The implication connective THEN (→) formalizes the deduction of a conclusion 𝐵 from a premise 𝐴. By deﬁnition,
𝐴 → 𝐵 is true if 𝐵 is true or both 𝐴 and 𝐵 are false. An example: if “it’s raining outside”, THEN “the sidewalk is wet”.

𝐴 𝐵 𝐴→𝐵
0 0 1
0 1 1
1 0 0
1 1 1

Note that 𝐴 → 𝐵 does not imply 𝐵 → 𝐴. This common logical fallacy is called aﬃrming the consequent, and we’ve all
fell victim to it at some point in our lives. To see a concrete example: if “it’s raining outside”, then “the sidewalk is wet”,
but not the other way around. The sidewalk can be wet for other reasons, like someone spilling a barrel of water.
Connectives correspond to set operations. Why? Let’s take a look at the formal deﬁnition of set operations.

Definition 33.1.1 (The (reasonably) formal definition of set operations and relations.)
Let 𝐴 and 𝐵 be two sets.
(a) The union of 𝐴 and 𝐵 is defined by

𝐴 ∪ 𝐵 ∶= {𝑥 ∶ (𝑥 ∈ 𝐴) ∨ (𝑥 ∈ 𝐵)},

that is, 𝐴 ∪ 𝐵 contains all elements that are in 𝐴 or 𝐵.

(b) The intersection of 𝐴 and 𝐵 is deﬁned by

𝐴 ∩ 𝐵 ∶= {𝑥 ∶ (𝑥 ∈ 𝐴) ∧ (𝑥 ∈ 𝐵)},

that is, 𝐴 ∪ 𝐵 contains all elements that are in 𝐴 and 𝐵.

360 Chapter 34. What is probability?

Mathematics of Machine Learning

(c) We say that 𝐴 is a subset of 𝐵, that is, 𝐴 ⊆ 𝐵 if

(𝑥 ∈ 𝐴) → (𝑥 ∈ 𝐵)

is true for all 𝑥 ∈ 𝐴.

(d) The complement of 𝐴 with respect to an Ω ⊃ 𝐴 is deﬁned by

Ω\𝐴 ∶= {𝑥 ∈ Ω ∶ ¬(𝑥 ∈ 𝐴)},

that is, Ω\𝐴 contains all elements that are in Ω, but not in 𝐴.

If you carefully read through the deﬁnitions, you can see how connectives and set operations relate. ∧ is intersection, ∨ is
union, ¬ is the complement, and → is the subset relation. This is illustrated by Fig. 35.1. (I’ve slightly abused the notation
here, as statements like 𝐴 ∧ 𝐵 ⟺ 𝐴 ∩ 𝐵 is mathematically incorrect. 𝐴 and 𝐵 cannot be a proposition and a set at
the same time, and thus equivalence is not precise. )

Fig. 34.1: Connectives and set operations

Why is this important? Because probability operates on sets, and sets play the role of propositions. We’ll see this later,
but ﬁrst, let’s dive deep into how mathematical logic formalizes scientiﬁc thinking.

34.1.2 The inference process

Let’s refine the inference process of mathematical logic. A proposition is either true or false, fair and square. How can
we determine that in practice? Say, how do we find the truth value of the proposition “there are infinitely many prime
numbers“?
Using evidence and deduction. Like Sherlock Holmes solving a crime by connecting facts, we rely on knowledge of the
form “if 𝐴, then 𝐵“. Our knowledge about the world is stored in true implications. For example,
• “If it is raining, then the sidewalk is wet.“

34.1. Thinking in absolutes 361

Mathematics of Machine Learning

• “If 𝐴𝐵𝐶 is a right triangle, then 𝐴2 + 𝐵2 = 𝐶 2 .“

• “If a system is closed, then its entropy cannot decrease.“
Notice that 𝐴 → 𝐵 doesn’t imply 𝐵 → 𝐴. If it rains, the sidewalk is wet; but if the sidewalk is wet, it’s not necessarily
because of rain. Maybe someone spilled a barrel of water.
As we have seen, the implication can be translated into the language of set theory. (As all the other connectives.) While
∧ corresponds to intersection and OR to union, the implication is the subset relation. Keep this in mind, as it’s going to
be important.
During inference, we use implications in the following way.
1. If 𝐴, then 𝐵.
2. 𝐴.
3. Therefore 𝐵.
This is called the modus ponens. If it sounds abstract, here is a concrete example.
1. If it is raining, the sidewalk is wet.
2. It is raining.
3. Therefore, the sidewalk is wet.
Thus, we can infer the state of the sidewalk without looking at it. This is bigger than it sounds: modus ponens is a
cornerstone of scientiﬁc thinking. We would still be living in caves without it. Modus ponens enables us to build robust
skyscrapers of knowledge.
However, it’s not all perfect. Classical deductive logic might help to prove the inﬁnity of prime numbers, but it fails
spectacularly when confronted with inference problems outside the realms of mathematics and philosophy.

34.2 Thinking in probabilities

Classical logic has a fatal ﬂaw: it is unable to deal with uncertainty. Think about the simple proposition “it is raining
outside”. If we are unable to actually observe the weather but have some indirect evidence (like the fact that the sidewalk
is wet, or the sky is cloudy, or it’s autumn out there), “it is raining outside” is probable, but not certain.
We need to tool to measure the truth value on a 0 − 1 scale. This is where probabilities come in.
In a mathematical sense, probability is a function that assigns a numerical value between zero and one to various sets that
represent events. (You can think about events as propositions.) Events are subsets of the event space, often denoted with
the capital Greek letter omega (Ω). This is illustrated in Fig. 34.2.
This sounds quite abstract, so let’s see a simple example: rolling a fair six-sided dice. We can encode all possi-
ble outcomes with the event space Ω = {1, 2, 3, 4, 5, 6}. Events such as 𝐴 = "the outcome is even" or 𝐵 =
"the outcome is larger than 3" are represented by the sets

𝐴 = {2, 4, 6},
𝐵 = {4, 5, 6}.

As the dice is fair, the probability of each outcome is the same:

𝑃 ({1}) = ⋯ = 𝑃 ({6}) = 6.

There are two properties that make such a function 𝑃 a proper measure of probability:
1. the probability of the event space is one,
2. and the probability of the union of disjoint events is the sum of probabilities.

362 Chapter 34. What is probability?

Mathematics of Machine Learning

Fig. 34.2: Events and the event space

In our dice-rolling example, this is translated to, for instance,

𝑃 (the outcome is even) = 𝑃 ({2, 4, 6})

= 𝑃 ({2}) + 𝑃 ({4}) + 𝑃 ({6})
1 1 1
= + +
6 6 6
1
= .
2
We’ll talk about these properties extensively in the next chapter. As logical connectives can be represented in the language
of set theory, set operations translate the semantics of logics into probabilities. Intersection is joint occurrence of events.
Union is the occurrence of either one.
This way, we are able to build models involving uncertainty and develop a calculus to work with said models. In the Tower
of Babel that is mathematics, statistics deals with the modeling part, and probability theory deals with the calculus part.
Even technically well-trained engineers conflate modeling and working with models. For instance, when we talk about
flipping fair coins, the probability of heads and tails are both 1/2. Even when we are absolutely sure about the model but
have ten heads in a row, most would immediately jump to the conclusion that our coin is biased.
To make sure we are not making this mistake, first we are going to learn what is a probabilistic model and how to work
with it.

34.2. Thinking in probabilities 363

Mathematics of Machine Learning

Fig. 34.3: The probabilities of intersection and union

364 Chapter 34. What is probability?

CHAPTER

THIRTYFIVE

THE AXIOMS OF PROBABILITY

In the previous chapter, we have talked about probability as the extension of mathematical logic. Just like formal logic,
probability has its axioms, which we need to understand to work with probability models. In this chapter, we are going
to seek the answer to a fundamental question: what is the mathematical model of probability and how to work with it?

35.1 The mathematical formulation of probability

Probabilities are deﬁned in context of experiments and outcomes. To talk about probabilities, we need to deﬁne what do
we assign probabilities to. Formally speaking, we denote the probability of the event 𝐴 by 𝑃 (𝐴). First, we’ll talk about
what events are.

35.1.1 Event spaces and event algebras

Let’s revisit the six-sided example from the previous chapter. There are six diﬀerent mutually exclusive outcomes, and
together they form the event space, denoted by Ω:

Ω ∶= {1, 2, 3, 4, 5, 6}.

In general, the event space is the collection of all mutually exclusive outcomes. It can be any set.
Returning to our dice-rolling example, what kind of events can we assign probabilities to? Obviously, the individual
outcomes come to mind. However, we can think of events like “the result is an odd number”, “the result is 2 or 6”, or
“the result is not 1”. Following this logic, our expectations are that for any two events 𝐴 and 𝐵,
• 𝐴 or 𝐵,
• 𝐴 and 𝐵,
• and not 𝐴
are events as well. These can be translated to the language of set theory, and are formalized by the notion of event algebras.

Deﬁnition 34.1.1 (Event algebras.)

Let Ω be an event space. A collection of its subsets Σ ⊆ 2Ω is called an event algebra over Ω if the following properties
hold.
(a) Ω ∈ Σ. (That is, the set of all outcomes is an event.)
(b) For all 𝐴 ∈ Σ, the set Ω\𝐴 is also an element of Σ. (That is, event algebras are closed to complements.)
(c) For all 𝐴1 , 𝐴2 , ⋯ ∈ Σ, the set ∪∞
𝑛=1 𝐴𝑛 is also an element of Σ. (That is, event algebras are closed to unions.)

365
Mathematics of Machine Learning

Since events are modelled by sets, logical concepts like and, or, and not can be translated to set operations. That is,
• the joint occurrence of events 𝐴 and 𝐵 is equivalent to 𝐴 ∩ 𝐵,
• 𝐴 or 𝐵 is equivalent to 𝐴 ∪ 𝐵,
• and not 𝐴 is equivalent to Ω\𝐴.
In the literature, event algebras are frequently referred to as 𝜎-algebras. We’ll use the former terminology, but keep this
in mind for your later studies.
An immediate consequence of the deﬁnition is that for any events 𝐴1 , 𝐴2 , ⋯ ∈ Σ, their intersection ∩∞
𝑛=1 𝐴𝑛 is also a
member of Σ. Indeed, as De Morgan’s laws suggest,

Ω\( ∩∞ ∞
𝑛=1 𝐴𝑛 ) = ∪𝑛=1 (Ω\𝐴𝑛 ).

Since Ω\(Ω\𝐴) = 𝐴, we have

∩∞ ∞
𝑛=1 𝐴𝑛 = Ω\(Ω\ ∩𝑛=1 𝐴𝑛 )

= Ω\ ∪∞
𝑛=1 ( Ω\𝐴
⏟ 𝑛 ).
∈Σ

Thus, the deﬁning properties of event algebras guarantee that ∩∞

𝑛=1 𝐴𝑛 is indeed an element of Σ.
At first glance, event algebras seem a bit abstract. As usual, a bit of abstraction now will pay us huge dividends later in
our studies. To bring this concept closer, here is a summary of event algebras in English.
• The set of all possible outcomes is an event.
• For any event, it not occurring is an event as well.
• For any events, their joint occurrence is an event as well.
• For any events, at least one of them occurring is an event as well.
Now that we have the formal definition under our belt, let’s see the first example.
Example 1. Rolling a six-sided dice. There, the event algebra is simply the power set of the event space:

Ω = {1, 2, 3, 4, 5, 6}, Σ = 2Ω .

Even though this is one of the simplest examples, it will serve as a prototype and a building block for constructing more
complicated event spaces.
Example 2. Tossing a coin 𝑛 times. A single toss has two possible outcomes: heads or tails. For simplicity, we are going
to encode heads with 0 and tails with 1. Since we are tossing the coin 𝑛 times, the result of an experiment will be an
𝑛-long sequence of ones and zeros. Like this: (0, 1, 1, 1, … , 0, 1). Thus, the complete event space is Ω = {0, 1}𝑛 .
(We are not talking about probabilities just yet, but feel free to spend some time ﬁguring out how to assign them to these
events. Don’t worry if this is not clear; we will go through it in detail.)
Just like in the previous example, the event algebra 2Ω is a good choice. This covers all events that we need, for instance,
“the number of tails is 𝑘“.
In practice, event algebras are rarely given explicitly. Sure, for simple cases such as the above, it is possible.
What about cases where the event spaces are not countable? For instance, suppose that we are picking a random number
between 0 and 1. Then, Ω = [0, 1], but selecting Σ = 2[0,1] is extremely problematic. Recall that we want to assign a
probability to every event in Σ. The power set 2[0,1] is so large that very strange things can occur. In certain scenarios,
we can cut up sets into a ﬁnite number of pieces and reassemble two identical copies of the set from its pieces. (If you are
interested in more, check out the Banach-Tarski paradox.)
To avoid weird things like the above mentioned, we need another way to describe event algebras.

366 Chapter 35. The axioms of probability

Mathematics of Machine Learning

35.1.2 Describing event algebras

Let’s start with a simple yet fundamental property of event algebras that we’ll soon use to give a friendly description of
event algebras.

Theorem 34.1.1 (Intersection of event algebras.)

Let Ω be an event space and Σ1 , Σ2 be two event algebras over it. Then Σ1 ∩ Σ2 is an event algebra as well.

Proof. As we saw in the definition of event algebras, there are three properties we need to verify to show that Σ1 ∩ Σ2 is
an event algebra. This is very simple to check, so I suggest taking a shot by yourself first before reading my explanation.
(a) As both Σ1 and Σ2 are event algebras, Ω ∈ Σ1 and Ω ∈ Σ2 both hold. Thus, by definition of the intersection,
Ω ∈ Σ 1 ∩ Σ2 .
(b) Let 𝐴 ∈ Σ1 ∩ Σ2 . As both of them are event algebras, Ω\𝐴 ∈ Σ1 and Ω\𝐴 ∈ Σ2 . Thus, Ω\𝐴 is an element of the
intersection as well.
(c) Let 𝐴1 , 𝐴2 , ⋯ ∈ Σ1 ∩ Σ2 be arbitrary events. We can use the exact same argument as before: as both Σ1 and Σ2 are
event algebras, ∪∞ 𝑛=1 𝐴𝑛 ∈ Σ1 and ∪𝑛=1 𝐴𝑛 ∈ Σ2 . So, the union is also a member of the intersection. □
∞

With all that, we are ready to describe event algebras with a generating set.

Theorem 34.1.2 (Generated event algebras.)

Let Ω be an event space and 𝑆 ⊆ 2Ω be an arbitrary collection of its sets. Then there is an unique smallest event algebra
𝜎(𝑆) that contains 𝑆.

(By smallest, we mean that if Σ is an event algebra containing 𝑆, then 𝜎(𝑆) ⊆ Σ.)

Proof. Our previous result shows that the intersection of event algebras is also an event algebra. So, let’s take all event
algebras that contain 𝑆 and take their intersection. Formally, we deﬁne

𝜎(𝑆) = ∩{Σ ∶ Σ is an event algebra and 𝑆 ⊆ Σ}.

By deﬁnition, 𝜎(𝑆) is clearly the smallest, and it also contains 𝑆. □

Right away, we can use this to precisely construct the event algebra for an extremely common task: picking a number
between 0 and 1.
Example 3. Selecting a random number between 0 and 1. It is clear that the event space is Ω = [0, 1]. What about
the events? In this situation, we want to ask questions like the probability of a random number 𝑋 falling between some
𝑎, 𝑏 ∈ [0, 1]. That is, events like (𝑎, 𝑏), (𝑎, 𝑏], [𝑎, 𝑏), [𝑎, 𝑏]. (Whether or not we want strict inequality regarding 𝑎 and 𝑏.)
So, a proper event algebra can be given by the algebra generated by events of the form (𝑎, 𝑏]. That is,

Σ = 𝜎({(𝑎, 𝑏] ∶ 0 ≤ 𝑎 < 𝑏 ≤ 1}).

This Σ has a rich structure. For instance, it contains simple events like {𝑥}, where 𝑥 ∈ [0, 1], but more complex ones
like “𝑋 is a rational number” or “𝑋 is an irrational number”. Give yourself a few minutes to see why this is true. Don’t
worry if you don’t see the solution, we’ll work this out in the problems section. (If you think this through, you’ll also see
why we chose intervals of the form (𝑎, 𝑏] instead of others like (𝑎, 𝑏) or [𝑎, 𝑏].)
Now that we understand what events and event algebras are, we can take our ﬁrst detailed look at probability. In the next
section, we will introduce its precise mathematical deﬁnition.

35.1. The mathematical formulation of probability 367

Mathematics of Machine Learning

35.1.3 Event algebras over real numbers

From all the examples we have seen so far, it is clear that most commonly, we deﬁne probability spaces on ℕ or on ℝ.
When Ω ⊆ ℕ, the choice of event algebra is clear, as Σ = 2Ω will always work.
However, as suggested in Example 3 above, selecting Σ = 2Ω when Ω ⊆ ℝ can lead to some weird stuﬀ. Because we are
interested in the probability of events like [𝑎, 𝑏], our standard choice is going to be the generated event algebra

ℬ(ℝ) = 𝜎({(𝑎, 𝑏) ∶ 𝑎, 𝑏 ∈ ℝ}), (35.1)

called the Borel-algebra, named after the famous French mathematician Émile Borel. Due to its construction, ℬ contains
all events that are important to us, such as intervals and unions of intervals. Elements of ℬ are called Borel-sets.
Because event algebras are closed to unions, you can see that all types of intervals can be found in ℬ(ℝ). This is summa-
rized by the following theorem.

Theorem 34.1.3
For all 𝑎, 𝑏 ∈ ℝ, the sets [𝑎, 𝑏], (𝑎, 𝑏], [𝑎, 𝑏), (−∞, 𝑎], (−∞, 𝑎), (𝑎, ∞), [𝑎, ∞) are elements of ℬ(ℝ).

As an exercise, try to come up with the proof by yourself. One trick to get the ideas ﬂowing is to start drawing some
ﬁgures. If you can visualize what happens, you’ll discover a proof quickly.

Proof. In general, for a given set 𝑆, we can show that it belongs to ℬ(ℝ) by writing it as the union/intersection/diﬀerence
of known Borel sets. First, we have

(𝑎, ∞) = ∪∞
𝑛=1 (𝑎, 𝑛),

so (𝑎, ∞) ∈ ℬ(ℝ). With a similar argument, we see that (−∞, 𝑎) ∈ ℬ(ℝ). Next,

(−∞, 𝑎] = ℝ\(𝑎, ∞),

[𝑎, ∞) = ℝ\(−∞, 𝑎),

so (−∞, 𝑎], [𝑎, ∞) ∈ ℬ(ℝ) for all 𝑎. From these, the sets [𝑎, 𝑏], (𝑎, 𝑏], [𝑎, 𝑏) can be produced by intersections. □

35.2 Probability measures

Let’s recap what we have learned so far! In the language of mathematics, experiments with intrinsic uncertainty are
described with outcomes, event spaces, and events. The collection of all possible mutually exclusive outcomes of an
experiment is the event space Ω. Its certain subsets are the events, to which we want to assign probabilities. The events
form the so-called event algebra Σ. We denote the probability of an event 𝐴 with 𝑃 (𝐴)
Intuitively speaking, we have three reasonable expectations about probability.
1. 𝑃 (Ω) = 1, that is, the probability that at least one outcome occurs is 1. In other words, our event space is a
complete description of the experiment.
2. 𝑃 (∅) = 0, that is, the probability that none of the outcomes occur is 0. Again, this means that our event space is
complete.
3. The probability that either of two events occurs for two mutually exclusive events is the sum of the individual
probabilities.

368 Chapter 35. The axioms of probability

Mathematics of Machine Learning

These are formalized by the following deﬁnition.

Deﬁnition 34.2.1
Let Ω be an event space and Σ be an event algebra over Ω. We say that the function 𝑃 ∶ Σ → [0, 1] is a probability
measure on Σ if the following properties hold.
(a) 𝑃 (Ω) = 1.
∞
(b) If 𝐴1 , 𝐴2 , … are mutually disjoint events (that is, 𝐴𝑖 ∩ 𝐴𝑗 = ∅ for all 𝑖 ≠ 𝑗), then 𝑃 (∪∞
𝑛=1 𝐴𝑛 ) = ∑𝑛=1 𝑃 (𝐴𝑛 ).
This property is called the 𝜎-additivity of probability measures.
Along with the probability measure 𝑃 , the structure (Ω, Σ, 𝑃 ) is said to form a probability space.

As usual, let’s see some concrete examples ﬁrst! We are going to continue with the ones we worked out when discussing
event algebras.
Example 1, continued. Rolling a six-sided dice. Recall that the event space and algebra were deﬁned by

Ω = {1, 2, 3, 4, 5, 6}, Σ = 2Ω .

If we don’t have any extra knowledge about our dice, it is reasonable to assume that each outcome is equally probable.
That is, since there are six possible outcomes, we have
1
𝑃 ({1}) = ⋯ = 𝑃 ({6}) = .
6
Notice that in this case, knowing the probabilities for the individual outcomes is enough to determine the probability of
any event. This is due to the (𝜎-)additivity of the probability. For instance, the event “the outcome of the dice roll is an
odd number” is described by
3
𝑃 ({1, 3, 5}) = 𝑃 ({1}) + 𝑃 ({3}) + 𝑃 ({5}) = .
6
In English, the probability of any event can be written down with the following formula:

favorable outcomes
𝑃 (event) = .
all possible outcomes
You might remember this from your elementary and high school studies (depending on the curriculum in your country).
This is a useful formula, but there is a caveat: it only works if we assume that each outcome has equal probability.
In the case when our dice is not uniformly weighted, the occurrences of individual outcomes are not equal. (Just think of
a lead dice, where one side is signiﬁcantly heavier than the others.) For now, we are not going to be concerned with this
case. Later, this generalization will be discussed in detail.
Example 2, continued. Tossing a coin 𝑛 times. Here, our event space and algebra was Ω = {0, 1}𝑛 and Σ = 2Ω . For
simplicity, let’s assume that 𝑛 = 5.
What is the probability of a particular result, say HHTTT? Going step by step, the probability that the ﬁrst toss will be
heads is 1/2. That is,

1
𝑃 (ﬁrst toss is heads) =
2
Since the ﬁrst toss is independent of the second,
1
𝑃 (second toss is heads) =
2

35.2. Probability measures 369

Mathematics of Machine Learning

as well. To combine this and calculate the probability that the first two tosses are both heads, we can think like the
following. Among the outcomes where the first toss is heads, exactly half of them will have the second toss heads as well.
So, we are looking for the half of the half. That is,
𝑃 (first two tosses are heads) = 𝑃 (first toss is heads)𝑃 (second toss is heads)
1
= .
4
Going further with the same logic, we obtain that
1 1
𝑃 (HHTTT) = = .
25 32
If we look a bit deeper, we can notice that this follows the previously seen “favorable/all” formula. Indeed, as we can see
with a bit of combinatorics, there are 25 total possibilities, all of them having equal probability.
Considering this, what is the probability that out of our five tosses, exactly two of them are heads? In the language of
sets, we can encode each five-toss experiment as a subset of {1, 2, 3, 4, 5}, the elements signifying the toss that resulted
in heads. (So, for example, {1, 4, 5} would encode the outcome HTTHH.) With this, the experiments when there are
two heads are exactly the two-element subsets of {1, 2, 3, 4, 5}.
From our combinatorics studies, we know that the number of subsets with given elements is
𝑛
( ),
𝑘
where 𝑛 is the size of our set and 𝑘 is the desired size of the subsets. So, in total, there are (52) number of occurrences
with exactly two heads. Thus, following the “favorable/all” formula, we have
5 1 10
𝑃 (two heads out of five tosses) = ( ) = .
2 32 32
One more example, and we are ready to move forward.
Example 3. Selecting a random number between 0 and 1. Here, our event space was Ω = [0, 1], and our event algebra
was the generated algebra
Σ = 𝜎({(𝑎, 𝑏] ∶ 0 ≤ 𝑎 < 𝑏 ≤ 1}).
Without any further information, it is reasonable to assume that every number can be selected with an equal probability.
What does this even mean for an infinite event space like Ω = [0, 1]? We can’t divide 1 into infinitely many equal parts.
So, instead of thinking about individual outcomes, we should start thinking about events. Let’s denote our randomly
selected number with 𝑋. If all number is “equally likely”, what is 𝑃 (𝑋 ∈ (0, 1/2])? Intuitively, given our equally likely
hypothesis, this probability should be proportional to the size of [0, 1/2]. Thus,
𝑃 (𝑋 ∈ 𝐼) = |𝐼|,
where 𝐼 is some interval and |𝐼| is its length. For instance,
𝑃 (𝑎 < 𝑋 < 𝑏) = 𝑃 (𝑎 ≤ 𝑋 < 𝑏)
= 𝑃 (𝑎 < 𝑋 ≤ 𝑏)
= 𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏)
= 𝑏 − 𝑎.
By giving the probabilities on the generating set of the event algebra, the probabilities for all other events can be deduced.
For instance,
𝑃 (𝑋 = 𝑥) = 𝑃 (0 ≤ 𝑋 ≤ 𝑥) − 𝑃 (0 ≤ 𝑋 < 𝑥)
=𝑥−𝑥
= 0.
Thus, the probability of picking a given number is surprisingly zero. There is an important lesson here: events with zero
probability can happen. This sounds counterintuitive at first, but based on the above example, you can see that it is true.

370 Chapter 35. The axioms of probability

Mathematics of Machine Learning

35.2.1 Fundamental properties of probability

Now that we are familiar with the mathematical model of probability, we can start working with them. Manipulating
expressions of probabilities gives us the ability to deal with more and more complex scenarios.
If you recall, probability measures had three simple deﬁning properties:
(a) 𝑃 (Ω) = 1,
(b) 𝑃 (∅) = 0, and
∞
(c) 𝑃 (∪∞
𝑛=1 𝐴𝑘 ) = ∑𝑛=1 𝑃 (𝐴𝑘 ), if the events 𝐴𝑘 are mutually disjoint.

From these properties, many others can be deduced. For simplicity, here is a theorem summarizing the most important
ones.

Theorem 34.2.1
Let (Ω, Σ, 𝑃 ) be a probability space and let 𝐴, 𝐵 ∈ Σ be two arbitrary events.
(a) 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∪ 𝐵).
(b) 𝑃 (𝐴) = 𝑃 (𝐴 ∩ 𝐵) + 𝑃 (𝐴\𝐵). Speciﬁcally, 𝑃 (Ω\𝐴) + 𝑃 (𝐴) = 1.
(c) if 𝐴 ⊆ 𝐵, then 𝑃 (𝐴) ≤ 𝑃 (𝐵).

The proof of this is so simple that it is left to the reader as a simple exercise. All of these follow from the additivity of
probability measures with respect to disjoint events. (If you don’t see the solution, try drawing Venn diagrams.)
Another fundamental tool is the law of total probability, which is used all the time when dealing with more complex
events.

Theorem 34.2.2 (Law of total probability.)

Let (Ω, Σ, 𝑃 ) be a probability space and let 𝐴 ∈ Σ be an arbitrary event. If 𝐴1 , 𝐴2 , ⋯ ∈ Σ are mutually disjoint events
(that is, 𝐴𝑖 ∩ 𝐴𝑗 = ∅ if 𝑖 ≠ 𝑗) for which ∪∞𝑛=1 𝐴𝑛 = Ω, then

∞
𝑃 (𝐴) = ∑ 𝑃 (𝐴 ∩ 𝐴𝑛 ) (35.2)
𝑛=1

holds. We call mutually disjoint events whose union is the entire event space partitions.

Proof. This simply follows from the 𝜎-additivity of probability measures. Feel free to give the proof a shot by yourself
to test your understanding.
If you can’t see this, no worries. Here is a brief explanation. Since 𝐴1 , 𝐴2 , … are mutually disjoint, 𝐴 ∩ 𝐴1 , 𝐴 ∩ 𝐴2 , …
are mutually disjoint as well. Moreover, since ∪∞ 𝑛=1 𝐴𝑛 = Ω, we also have

∞
∪𝑛=1 𝐴𝑛 ∩ 𝐴 = ( ∪ ∞
𝑛=1 𝐴𝑛 ) ∩ 𝐴

=Ω∩𝐴
= 𝐴.

Thus, the 𝜎-additivity of probability measures imply that

∞
𝑃 (𝐴) = ∑ 𝑃 (𝐴 ∩ 𝐴𝑛 ),
𝑛=1

35.2. Probability measures 371

Mathematics of Machine Learning

which is what we had to show. □

Let’s see an example right away! Suppose that we toss two dice. What is the probability that the sum of the results is 7?
First, we should properly describe the probability space. For notational simplicity, let’s denote the result of the throws
with 𝑋 and 𝑌 . What we are looking for is 𝑃 (𝑋 + 𝑌 = 7). Modeling the toss with two dice is the simplest if we impose
order among the dice: we designate the ﬁrst and the second dice. With this in mind, the event space Ω is described by
the Cartesian product
Ω = {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6}
= {(𝑖, 𝑗) ∶ 𝑖, 𝑗 ∈ {1, 2, 3, 4, 5, 6}},
and the outcomes are tuples of the form (𝑖, 𝑗). (That is, the tuple (𝑖, 𝑗) encodes the elementary event {𝑋 = 𝑖, 𝑌 = 𝑗}.)
Since the tosses are independent of each other,
11 1
𝑃 (𝑋 = 𝑖, 𝑌 = 𝑗) = = .
66 36
(When it is clear, we omit the brackets of the event {𝑋 = 𝑖, 𝑌 = 𝑗}.)
Since the ﬁrst throw falls between 1 and 6, we can partition the event space by forming

𝐴𝑛 ∶= {𝑋 = 𝑛}, 𝑛 = 1, … , 6.

Thus, the law of total probability gives

6
𝑃 (𝑋 + 𝑌 = 7) = ∑ 𝑃 ({𝑋 + 𝑌 = 7} and {𝑋 = 𝑛}).
𝑛=1

However, if we know that 𝑋 + 𝑌 = 7 and 𝑋 = 𝑛, we know that 𝑌 = 7 − 𝑛 must hold as well. So, continuing the
calculation above,
6
𝑃 (𝑋 + 𝑌 = 7) = ∑ 𝑃 ({𝑋 + 𝑌 = 7} and {𝑋 = 𝑛})
𝑛=1
6
= ∑ 𝑃 (𝑋 = 𝑛, 𝑌 = 7 − 𝑛)
𝑛=1
6
1
=∑
𝑛=1
36
1
= .
6
So, the law of total probability helps us deal with complex events by decomposing them into simpler ones. We have seen
this pattern dozens of times now, and once again, it proves to be essential.
As yet another consequence of 𝜎-additivity, we can calculate the probability of an increasing sequence of events by taking
the limit.

Theorem 34.2.3 (Lower continuity of probability measures.)

Let (Ω, Σ, 𝑃 ) be a probability space and let 𝐴1 ⊆ 𝐴2 ⊆ ⋯ ∈ Σ be an increasing sequence of events. Then

𝑃 (∪∞
𝑛=1 𝐴𝑛 ) = lim 𝑃 (𝐴𝑛 ) (35.3)
𝑛→∞

holds. This property is called the lower continuity of probability measures.

372 Chapter 35. The axioms of probability

Mathematics of Machine Learning

Proof. Since the events are increasing, that is, 𝐴𝑛−1 ⊆ 𝐴𝑛 , we can write 𝐴𝑛 as

𝐴𝑛 = 𝐴𝑛−1 ∪ (𝐴𝑛 \𝐴𝑛−1 ),

where 𝐴𝑛−1 and 𝐴𝑛 \𝐴𝑛−1 are disjoint. Thus,

∪∞ ∞
𝑛=1 𝐴𝑛 = ∪𝑛=1 𝐴𝑛 \𝐴𝑛−1 , 𝐴0 ∶= ∅,

which gives
∞
𝑃 (∪∞
𝑛=1 𝐴𝑛 ) = ∑ 𝑃 (𝐴𝑛 \𝐴𝑛−1 )
𝑛=1
𝑛
= lim ∑ 𝑃 (𝐴𝑘 \𝐴𝑘−1 )
𝑛→∞
𝑘=1
𝑛
= lim ∑ 𝑃 (𝐴𝑘 ) − 𝑃 (𝐴𝑘−1 )
𝑛→∞
𝑘=1
= lim 𝑃 (𝐴𝑛 ),
𝑛→∞

where we used that 𝑃 (∅) = 0. □

We can state an analogue of the above theorem for a decreasing sequence of events.

Theorem 34.2.4 (Upper continuity of probability measures.)

Let (Ω, Σ, 𝑃 ) be a probability space and let 𝐴1 ⊇ 𝐴2 ⊇ ⋯ ∈ Σ be a decreasing sequence of events. Then

𝑃 (∩∞
𝑛=1 𝐴𝑛 ) = lim 𝑃 (𝐴𝑛 ) (35.4)
𝑛→∞

holds. This property is called the upper continuity of probability measures.

Proof. For simplicity, let’s denote the inﬁnite intersection by 𝐴 ∶= ∩∞

𝑛=1 𝐴𝑛 .

By deﬁning 𝐵𝑛 ∶= 𝐴1 \𝐴𝑛 , we have ∪∞

𝑛=1 𝐵𝑛 = 𝐴1 \𝐴. Since 𝐴𝑛 is decreasing, 𝐵𝑛 is increasing, so we can apply the
previous theorem to obtain
𝑃 (𝐴1 \𝐴) = lim 𝑃 (𝐴1 \𝐴𝑛 )
𝑛→∞
= 𝑃 (𝐴1 ) − lim 𝑃 (𝐴𝑛 ).
𝑛→∞

Since 𝑃 (𝐴1 \𝐴) = 𝑃 (𝐴1 ) − 𝑃 (𝐴), we obtain 𝑃 (𝐴) = lim𝑛→∞ 𝑃 (𝐴𝑛 ). □

35.3 Probability spaces on ℝ𝑛

Now that we have a mathematical deﬁnition of a probabilistic model, it is time to take a step toward space where machine
learning is done: ℝ𝑛 .
In machine learning, every data point is an elementary outcome, located somewhere in the Euclidean space ℝ𝑛 . Because
of this, we are interested in modeling experiments there.

35.3. Probability spaces on ℝ𝑛 373

Mathematics of Machine Learning

How can we deﬁne a probability space there? Similarly as we did with on real line, we describe a convenient event algebra
by generating. There, we can use the higher dimensional counterpart of the intervals (𝑎, 𝑏): 𝑛-dimensional spheres. For
this, we deﬁne the set

𝐵(𝑥, 𝑟) = {𝑦 ∈ ℝ𝑛 ∶ ‖𝑥 − 𝑦‖ < 𝑟},

where the norm ‖ ⋅ ‖ denotes the usual Euclidean norm. (The 𝐵 denotes the word “ball”. In mathematics, 𝑛-dimensional
spheres are often called balls.) Similarly to the real line, the Borel-algebra is deﬁned by

ℬ(ℝ𝑛 ) = 𝜎({𝐵(𝑥, 𝑟) ∶ 𝑥 ∈ ℝ𝑛 , 𝑟 > 0}). (35.5)

As we saw on real line, the structure of ℬ(ℝ𝑛 ) is richer than what the definition suggests at first glance. Here, the analogue
of interval is a rectangle, defined by

(𝑎, 𝑏) = (𝑎1 , 𝑏1 ) × ⋯ × (𝑎𝑛 , 𝑏𝑛 )

= {𝑥 ∈ ℝ𝑛 ∶ 𝑎𝑖 < 𝑥𝑖 < 𝑏𝑖 , 𝑖 = 1, … , 𝑛}.

Similarly, we can deﬁne [𝑎, 𝑏], (𝑎, 𝑏], [𝑎, 𝑏), and others.

Theorem 34.3.1
For any 𝑎, 𝑏 ∈ ℝ𝑛 , the sets [𝑎, 𝑏], [𝑎, 𝑏), (𝑎, 𝑏], (𝑎, ∞), [𝑎, ∞), (−∞, 𝑎), (−∞, 𝑏] are elements of ℬ(ℝ𝑛 ).

Proof. The proof goes along the same line as the counterpart for ℬ(ℝ). As such, it is left as an exercise to the reader.
As a hint, ﬁrst we can show that (𝑎, 𝑏) can be written as a countable union of balls. We can also show that this holds true
for sets like
ℝ × ⋯ × (−∞,
⏟ 𝑎𝑖 ) × ⋯ × ℝ
𝑖-th component

as well. From these two, we can write the others as unions/intersections/diﬀerences. □

As an example, let’s throw a few darts at a rectangular wall. Suppose that we are terrible darts players and hitting any
point on the wall is equally likely.
We can model this event space with Ω = [0, 1] × [0, 1] ⊆ ℝ2 , representing our wall. What are the possible events?
For instance, there is a circular darts board hanging on the wall, and we want to ﬁnd the probability of hitting it. In this
scenario, we can restrict the Borel sets deﬁned by (35.5) to

ℬ(Ω) ∶= {𝐴 ∩ Ω ∶ 𝐴 ∈ ℬ(ℝ𝑛 )}.

Now that the event space and algebra is clear, we need to think about assigning probabilities. Our assumption is that
hitting any point is equally likely. So, by generalizing the allfavorable outcomes
possible outcomes formula we have seen in the discrete case, we
deﬁne the probability measure by

volume(𝐴)
𝑃 (𝐴) = .
volume(Ω)

We can draw the following picture.

As we shall see later, this is a special case of uniform distributions, one of the most prevalent distributions in probability
theory. However, there is a lot to talk about until then. Before we conclude this chapter, let’s discuss how can we interpret
probabilities.

374 Chapter 35. The axioms of probability

Mathematics of Machine Learning

Fig. 35.1: Probability space for throwing darts at a wall.

35.4 How to interpret probability?

Now that we know how to work with probabilities, it is time to study how can we assign probabilities to real-life events.
First, we are going to take a look at the frequentist interpretation, explaining probabilities with relative frequencies. (If
you are one of those people who are religious about this question, calm down. We’ll discuss the Bayesian interpretation
in detail, but it is not time yet.)
Let’s go back to the beginning and consider the coin-tossing experiment, the most basic example possible. If I toss a fair
coin 1000 times, how many of them will be heads? Most people immediately answer 500, but this is not correct. There is
no right answer, any number of heads between 0 and 1000 can happen. Of course, most probably it will be around 500,
but with a very small probability, there can be zero heads as well.
In general, the probability of an event describes its relative frequence among inﬁnitely many attempts. That is,
number of occurences
𝑃 (event) ≈ .
number of attempts
When the number of attempts goes towards inﬁnity, the relative frequency of occurrences converges to the true underlying
probability. In other words, if 𝑋𝑖 quantitatively describes our 𝑖-th attempt by

1 if the event occurs

𝑋𝑖 = {
0 otherwise,

then
𝑋1 + ⋯ + 𝑋𝑛
𝑃 (event) = lim .
𝑛→∞ 𝑛
We can illustrate this by doing a quick simulation using the coin-tossing example. Don’t worry if you don’t understand
the code, we’ll talk about it in detail in the next chapters.

import numpy as np
from scipy.stats import randint
(continues on next page)

35.4. How to interpret probability? 375

Mathematics of Machine Learning

(continued from previous page)

n_tosses = 1000
# coin tosses: 0 for tails and 1 for heads
coin_tosses = [randint.rvs(low=0, high=2) for _ in range(n_tosses)]
averages = [np.mean(coin_tosses[:k+1]) for k in range(n_tosses)]

Let’s plot the results for some insight.

import matplotlib.pyplot as plt

with plt.style.context("seaborn-white"):
plt.figure(figsize=(16, 8))
plt.title("Relative frequency of the coin tosses")
plt.xlabel("Relative frequency")
plt.ylabel("Number of tosses")

# plotting the averages

plt.plot(range(n_tosses), averages, linewidth=3) # the averages

# plotting the true expected value

plt.plot([-100, n_tosses+100], [0.5, 0.5], c="k")
plt.xlim(-10, n_tosses+10)
plt.ylim(0, 1)

The relative frequency quite nicely stabilizes around 1/2, which is the true probability of our fair coin landing on its heads.
Is this an accident? No.
We will make all of this mathematically precise when talking about the law of large numbers. For now, you can think
about estimating probabilities this way. In the next chapter, we will introduce the Bayesian viewpoint, a probabilistic
framework for updating our models given new observations.

376 Chapter 35. The axioms of probability

Mathematics of Machine Learning

35.5 Problems

Problem 1. Let’s roll two six-sided dice! Describe the event space, event algebra, and the corresponding probabilities for
this experiment.
Problem 2. Let Ω = [0, 1], and the corresponding event algebra be the generated algebra

Σ = 𝜎({(𝑎, 𝑏] ∶ 0 ≤ 𝑎 < 𝑏 ≤ 1}).

Show that the following events are members of Σ.

(a) 𝑆1 = {𝑥} for all 𝑥 ∈ [0, 1].
(b) 𝑆2 = ∪𝑛𝑖=1 (𝑎𝑖 , 𝑏𝑖 ). (Show that this is also true when the intervals [… ] are replaced with open and half-open versions
(… ), (… ], [… ).)
(c) 𝑆3 = [0, 1] ∩ ℚ. (That is, the set of rational numbers in [0, 1].)
(d) 𝑆4 = [0, 1] ∩ (ℝ\ℚ). (That is, the set of irrational numbers in [0, 1].)
Problem 3. Let’s roll two six-sided dice. What is the probability that
(a) Both rolls are odd numbers?
(b) At least one of them is an odd number?
(c) None of them are odd numbers?
Problem 4. Let Ω = ℝ2 be the event space, where we deﬁne the open disks

𝐷(𝑥, 𝑟) ∶= {𝑧 ∈ ℝ2 ∶ ‖𝑧 − 𝑥‖ < 𝑟}, 𝑥 = (𝑥1 , 𝑥2 ) ∈ ℝ2 , 𝑟 > 0,

and the open rectangles by

𝑅(𝑥, 𝑦) = (𝑥1 , 𝑦1 ) × (𝑥2 , 𝑦2 )

= {𝑧 = (𝑥1 , 𝑥2 ) ∈ ℝ2 ∶ 𝑥1 < 𝑧1 < 𝑦1 , 𝑥2 < 𝑧2 < 𝑦2 }.

Show that the event algebras generated by these sets are the same, that is,

𝜎({𝐷(𝑥, 𝑟) ∶ 𝑥 ∈ ℝ, 𝑟 > 0}) = 𝜎({𝑅(𝑥, 𝑦) ∶ 𝑥, 𝑦 ∈ ℝ}).

35.5. Problems 377

Mathematics of Machine Learning

378 Chapter 35. The axioms of probability

CHAPTER

THIRTYSIX

CONDITIONAL PROBABILITY

In the previous chapter, we learned the foundations of probability. Now we can speak in terms of outcomes, events, and
chances. However, in real-life applications, these basic tools are not enough to build useful predictive models.
To illustrate this, let’s build a probabilistic spam ﬁlter! For every email we receive, we want to estimate the probability
𝑃 (email is spam). The closer this is to 1, the more likely that we are looking at a spam email.
Based on our inbox, we might calculate the relative frequency of spam emails and obtain that

number of our spam emails

𝑃 (email is spam) ≈ .
number of emails in our inbox
However, this doesn’t help us at all. Based on this, we can randomly discard every email with this probability, but that
would be a horrible spam ﬁlter.
To improve, we need to dig a bit deeper. When analyzing spam emails, we start to notice patterns. For instance, the
phrase “act now” can be found almost exclusively in spam. After a quick count, we get that

𝑃 (email is a spam, given that it contains the phrase "act now")

number of spam emails containing the phrase "act now"
=
number of emails containing the phrase "act now"
≈ 0.95.

This looks much more useful for our spam filtering efforts. By checking for the presence of the phrase “act now”, we can
confidently classify an email as spam.
Of course, there is much more to spam filtering, but this simple example demonstrates the importance of probabilities
conditional on other events. To put this into mathematical form, we introduce the following definition.

Deﬁnition 35.1 (Conditional probability.)

Let (Ω, Σ, 𝑃 ) be a probability space and let 𝐴, 𝐵 ∈ Σ be two events. The conditional probability of 𝐴 given 𝐵 is deﬁned
by

𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐵 ∣ 𝐴) ∶= .
𝑃 (𝐴)

You can think about 𝑃 (𝐵|𝐴) as restricting the event space to 𝐴, as illustrated by Fig. 36.1.
When there are more conditions, say 𝐴1 and 𝐴2 , the deﬁnition takes the form

𝑃 (𝐵 ∩ 𝐴1 ∩ 𝐴2 )
𝑃 (𝐵 ∣ 𝐴1 , 𝐴2 ) = ,
𝑃 (𝐴1 ∩ 𝐴2 )

and so on.

379
Mathematics of Machine Learning

Fig. 36.1: A visual reprensentation of conditional probability

To bring this concept closer, let’s revisit the simple dice-rolling experiment. Suppose that your friend rolls a six-sided
dice and tells you that the outcome is an odd number. Given this information, what is the probability that the result is 3?
For simplicity, let’s denote the outcome of the roll with 𝑋. Mathematically speaking, this can be calculated by

𝑃 (𝑋 = 3 and 𝑋 ∈ {1, 3, 5})

𝑃 (𝑋 = 3|𝑋 ∈ {1, 3, 5}) =
𝑃 (𝑋 ∈ {1, 3, 5})
𝑃 (𝑋 = 3)
=
𝑃 (𝑋 ∈ {1, 3, 5})
1/6
=
1/2
1
= .
3
This is the number that we expected.
Although this simple example doesn’t demonstrate the usefulness of conditional probability, this is a cornerstone in ma-
chine learning. In essence, learning from data can be formulated as estimating 𝑃 (label|data). We are going to expand on
this idea later in this chapter.

380 Chapter 36. Conditional probability

Mathematics of Machine Learning

36.1 Independence

The idea behind conditional probability is that observing certain events changes the probability of others. Is this always
the case though?
In probabilistic modeling, recognizing when observing an event doesn’t inﬂuence another is equally important. This
motivates the concept of independence.

Deﬁnition 35.1.1 (Independence of events.)

Let (Ω, Σ, 𝑃 ) be a probability space and let 𝐴, 𝐵 ∈ Σ be two events. We say that 𝐴 and 𝐵 are independent if

𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐴)𝑃 (𝐵)

holds.

Equivalently, this can be formulated in terms of conditional probabilities. By the deﬁnition, if 𝐴 and 𝐵 are independent,
we have
𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐵 ∣ 𝐴) =
𝑃 (𝐴)
𝑃 (𝐴)𝑃 (𝐵)
=
𝑃 (𝐴)
= 𝑃 (𝐵).
To see an example, let’s go back to coin tossing, and suppose that we toss a coin two times. Let the result of the ﬁrst and
second toss be denoted by 𝑋1 and 𝑋2 respectively. What is the probability that both of these tosses are heads? As we
saw this when discussing the probability space given by this experiment, we can see that
1
𝑃 (𝑋1 = heads and 𝑋2 = heads) = 𝑃 (𝑋1 = heads)𝑃 (𝑋2 = heads) = .
4
That is, the two events are independent of each other.
Regarding probability, there are many common misconceptions. One is about the interpretation of independence. Sup-
pose that I toss a fair coin ten times, all of them resulting in heads. What is the probability that my next toss will be
heads?
Most would immediately conclude that this must be very small since having eleven heads in a row is highly unlikely.
However, once we have the ten results available, we no longer talk about the probability of eleven coin tosses, just the last
one! Since the coin tosses are independent of each other, the chance of heads for the eleventh toss (given the results of
the previous ten) is still 50%. This phenomenon is called the gambler’s fallacy, and I am pretty sure that at some point in
your life, you fell victim to it. (I sure did.)

36.2 Properties of the conditional probability

In practical scenarios, working with conditional probabilities might be easier. (For instance, sometimes we can estimate
them directly, while the standard probabilities are diﬃcult to gauge.)
Because of this, we need tools to work with them.

36.1. Independence 381

Mathematics of Machine Learning

36.2.1 The Law of total probability revisited

Remember the law of total probability? We can use conditional probabilities to put it into a slightly diﬀerent form.

Theorem 35.2.1 (Law of total probability, conditional version.)

∞
𝑃 (𝐴) = ∑ 𝑃 (𝐴|𝐴𝑘 )𝑃 (𝐴𝑘 ). (36.1)
𝑘=1

Proof. The proof is the trivial application of the law of total probability and the deﬁnition of conditional probabilities: as
𝑃 (𝐴 ∩ 𝐴𝑘 ) = 𝑃 (𝐴|𝐴𝑘 )𝑃 (𝐴𝑘 ),
∞
𝑃 (𝐴) = ∑ 𝑃 (𝐴 ∩ 𝐴𝑘 )
𝑘=1
∞
= ∑ 𝑃 (𝐴|𝐴𝑘 )𝑃 (𝐴𝑘 )
𝑘=1

holds, which is what we had to show. □

Why is this useful for us? Let’s demonstrate this with an example. Suppose that we have three urns containing red and
blue colored balls. The ﬁrst one contains 4 blue, the second one contains 2 red and 2 blue, while the last one contains 1
red and 3 blue balls.

Fig. 36.2: Urns with colored balls.

We randomly pick one. However, picking the first one is twice as likely as picking the other two. (That is, we pick the first
urn 50% of the time, while the second and the third 25%-25% of the time.) From that urn, we also randomly pick a ball.
What is the probability that we select a red ball? Without using the law of total probability, this is difficult to compute.
Let’s denote the color of the selected ball by 𝑋 and suppose that the event 𝐴𝑛 describes picking the 𝑛-th urn. Then, we
have
3 3
𝑃 (𝑋 = red) = ∑ 𝑃 ({𝑋 = red} ∩ 𝐴𝑘 ) = ∑ 𝑃 (𝑋 = red ∣ 𝐴𝑘 )𝑃 (𝐴𝑘 ).
𝑘=1 𝑘=1

382 Chapter 36. Conditional probability

Mathematics of Machine Learning

Without using conditional probabilities, calculating 𝑃 ({𝑋 = red} ∩ 𝐴𝑘 ) is diﬃcult. (Since we are not picking each urn
with equal probability.) However, we can simply calculate the conditionals by counting the number of red balls in each
urn. That is, we have
𝑃 (𝑋 = red ∣ 𝐴1 ) = 0
2
𝑃 (𝑋 = red ∣ 𝐴2 ) =
4
1
𝑃 (𝑋 = red ∣ 𝐴3 ) = .
4
Since 𝑃 (𝐴1 ) = 1/2, 𝑃 (𝐴2 ) = 1/4, and 𝑃 (𝐴3 ) = 1/4, the probability we are looking for is
3
𝑃 (𝑋 = red) = ∑ 𝑃 (𝑋 = red ∣ 𝐴𝑘 )𝑃 (𝐴𝑘 )
𝑘=1
1 21 11
=0 + +
2 44 44
3
= .
16
Note that because the urns are not selected with equal probability,
number of red balls
𝑃 (𝑋 = red) ≠ ,
number of balls
as one would naively guess. This is because the urns are not selected with uniform probability, as frequently happens in
statistics.

36.2.2 Chain rule

Another useful property of the conditional probability that, due to their deﬁnition, we can use them to express the joint
probability of events:
𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐵 ∣ 𝐴)𝑃 (𝐴).

Even though this sounds trivial, there are cases when we can estimate/compute the conditional probability, but not the
joint probability. In fact, this simple identity can be generalized for an arbitrary number of conditions. This is called the
chain rule. (Despite its name, it has nothing to do with the chain rule for diﬀerentiation.)

Theorem 35.2.2 (The chain rule.)

Let (Ω, Σ, 𝑃 ) be a probability space and 𝐴1 , 𝐴2 , ⋯ ∈ Σ be arbitrary events. Then
𝑛
𝑃 (𝐴1 ∩ ⋯ ∩ 𝐴𝑛 ) = ∏ 𝑃 (𝐴𝑘 ∣ 𝐴1 , … , 𝐴𝑘−1 ) (36.2)
𝑘=1

holds.

Proof. First, we notice that 𝑃 (𝐴1 ∩ ⋯ ∩ 𝐴𝑛 ) can be written as

𝑃 (𝐴1 ∩ 𝐴𝑛 ) 𝑃 (𝐴1 ∩ 𝐴𝑛−1 )
𝑃 (𝐴1 ∩ ⋯ ∩ 𝐴𝑛 ) = … 𝑃 (𝐴1 ),
𝑃 (𝐴1 ∩ 𝐴𝑛−1 ) 𝑃 (𝐴1 ∩ 𝐴𝑛−2 )
because the terms cancel out each other. Since
𝑃 (𝐴1 ∩ 𝐴𝑘 )
= 𝑃 (𝐴𝑘 ∣ 𝐴1 , … , 𝐴𝑘−1 ),
𝑃 (𝐴1 ∩ 𝐴𝑘−1 )
the chain rule (36.2) follows. □

36.2. Properties of the conditional probability 383

Mathematics of Machine Learning

36.3 The Bayes Theorem

In essence, machine learning is about turning observations into predictive models. Probability theory gives us a language
to express our models. For instance, going back to our spam filter example from the beginning of the chapter, we can
notice that 5% of our emails are spam. However, this is not enough information to filter out spam emails. Upon inspection,
we have observed that 95% of mails that contain the phrase “act now” are spam. (But only 1% of the all mails contain
“act now”.)
In the language of conditional probabilities, we have concluded that
𝑃 (spam ∣ contains "act now") = 0.95.
With this, we can start looking for emails containing the phrase “act now” and discard them with 95% confidence. Is this
spam filter effective? Not really, since there can be other frequent keywords in spam mails that we don’t check. How can
we check this?
For one, we can take a look at the conditional probability 𝑃 (contains "act now" ∣ spam), describing the frequency of the
“act now” keyword among all the spam emails. A low frequency means that we are missing out on other keywords that
we can use for filtering.
Generally speaking, we often want to compute/estimate the quantity 𝑃 (𝐴|𝐵), but our observations only allow us to infer
𝑃 (𝐵|𝐴). So, we need a way to reverse the condition and the event. With a bit of algebra, we can do this easily.

Theorem 35.3.1 (The Bayes formula.)

Let (Ω, Σ, 𝑃 ) be a probability space and 𝐴, 𝐵 be two arbitrary events. Then
𝑃 (𝐴 ∣ 𝐵)𝑃 (𝐵)
𝑃 (𝐵 ∣ 𝐴) = (36.3)
𝑃 (𝐴)
holds.

Proof. By the deﬁnition of conditional probability, we have

𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐵 ∣ 𝐴) =
𝑃 (𝐴)
𝑃 (𝐴 ∩ 𝐵)𝑃 (𝐵)
=
𝑃 (𝐵)𝑃 (𝐴)
𝑃 (𝐵)
= 𝑃 (𝐴 ∣ 𝐵) ,
𝑃 (𝐴)
which is what we had to show. □

To see how it works in action, let’s put it to the test in our spam ﬁltering example. Given the information we know, we
have
𝑃 (spam ∣ contains "act now") = 0.95,
𝑃 (contains "act now") = 0.01,
𝑃 (spam) = 0.05.
So, according to the Bayes formula,
𝑃 (spam ∣ contains "act now")𝑃 (contains "act now")
𝑃 (contains "act now" ∣ spam) =
𝑃 (spam)
0.95 × 0.01
=
0.05
19
= .
100

384 Chapter 36. Conditional probability

Mathematics of Machine Learning

Thus, by ﬁltering only for the phrase “act now”, we are missing a lot of spam.
We can take the Bayes formula one step further by combining it with the law of total probability. (See the equation (36.1).)

Theorem 35.3.2 (The Bayes theorem.)

Let (Ω, Σ, 𝑃 ) be a probability space and let 𝐴, 𝐵 ∈ Σ be arbitrary events. Moreover, let 𝐴1 , 𝐴2 , ∈Σ ̇ be a partition of
the event space Ω. (That is, the 𝐴𝑘 -s are pairwise disjoint and their union is the entire event space.) Then

𝑃 (𝐴 ∣ 𝐵)𝑃 (𝐵)
𝑃 (𝐵 ∣ 𝐴) = ∞
∑𝑘=1 𝑃 (𝐴 ∣ 𝐴𝑘 )𝑃 (𝐴𝑘 )

holds.

Proof. The proof immediately follows from the Bayes formula (36.3) and the law of total probability (36.1). □

36.4 The Bayesian interpretation of probability

Historically, probability was introduced as the relative frequency of observed events. However, the invention of conditional
probabilities and the Bayes formula enabled another interpretation that slowly became prevalent in statistics and machine
learning.
In pure English, the Bayes formula can be thought of as updating our probabilistic models using new observations. Suppose
that we are interested in the event 𝐵. Without observing anything, we can formulate a probabilistic model by assigning
a probability to 𝐵, that is, estimating 𝑃 (𝐵). This is what we call the prior. However, observing another event 𝐴 might
change our probabilistic model.
Thus, we would like to estimate the posterior probability 𝑃 (𝐵 ∣ 𝐴). We can’t do this directly, but thanks to our prior
model, we can tell 𝑃 (𝐴 ∣ 𝐵). The quantity 𝑃 (𝐴 ∣ 𝐵) is called the likelihood. Combining these with the Bayes formula,
we can see that the posterior is proportional to the likelihood and the prior.

Fig. 36.3: The Bayes formula, as the product of the likelihood and prior.

Let’s see a concrete example that will make the idea clear. Suppose that we are creating a diagnostic test for an exotic
disease. How likely is the disease present in a random person?

36.4. The Bayesian interpretation of probability 385

Mathematics of Machine Learning

Without knowing any speciﬁcs about the situation, we can only use statistics to formulate the probability model. Let’s say
that only 2% of the population is aﬀected. So, our probabilistic model is

𝑃 (infected) = 0.02, 𝑃 (healthy) = 0.98.

However, once someone produces a positive test, things change. The goal is to estimate the posterior probability
𝑃 (infected|positive test), a more accurate model.
Since no medical test is perfect, false positives and false negatives can happen. From the manufacturer, we know that it
gives true positives 99% of the time, but the chance for a false positive is 5%. In probabilistic terms, we have

𝑃 (positive test ∣ infected) = 0.99,

𝑃 (positive test ∣ healthy) = 0.05.

With these, the Bayes theorem gives us

𝑃 (positive test ∣ infected)𝑃 (infected)

𝑃 (infected ∣ positive test) =
𝑃 (positive test ∣ infected)𝑃 (infected) + 𝑃 (positive test ∣ healthy)𝑃 (healthy)
0.99 × 0.02
=
0.99 × 0.02 + 0.05 × 0.98
≈ 0.29.

So, the chance of being infected upon producing a positive test is surprisingly 29%. (Given these speciﬁc true and false
positive rates.)
These probabilistic thinking principles are also valid for machine learning. If we abstract away the process of learning
from data, we are essentially 1) making observations, 2) updating our models given the new observations, and 3) and start
the process over again. The Bayes theorem gives a concrete tool for the job.

36.5 The probabilistic inference process

As we have seen before, probability theory is the extension of mathematical logic. So far, we have discussed how logical
connectives correspond to set operations, and how probability generalizes the truth value by adding the component of
uncertainty. What about the probabilistic inference process? Can we generalize classical inference and use probabilistic
reasoning to construct arguments? Yes.
To illustrate, let’s start with a story. It’s 6:00 AM. The alarm clock is blasting, but you are having a hard time getting out
of bed. You don’t feel well. Your muscles are weak, and your head is exploding. After a brief struggle, you manage to
call a doctor and list all the symptoms. Your sore throat makes speaking painful.
“It’s probably just the flu“, she says.
Interactions like this are everyday occurrences. Yet, we hardly think about the reasoning process behind them. After all,
you could have been hungover. Similarly, if the police find a murder weapon at your house, they’ll suspect that you are
the killer. The two are related, but not the same. For instance, the murder weapon could have been planted.
The bulk of humanity’s knowledge is obtained in this manner: we collect evidence, then explain it with various hypotheses.
How do we infer the underlying cause from observing the effect? Most importantly, how can we avoid fooling ourselves
into false conclusions?
Let’s focus on “muscle fatigue, headache, sore throat → flu“. This is certainly not true in an absolute sense, as these
symptoms resemble how you would feel after shouting and drinking excessively during a metal concert. Which is far from
the flu. Yet, a positive diagnosis of flu is plausible. Given the evidence at hand, our belief is increased in the hypothesis.
Unfortunately, classical logic cannot deal with plausible. Only with absolute. Probability theory solves this problem by
measuring plausibility on a 0 − 1 scale, instead of being stuck at the extremes. Zero is impossible. One is certain. All the
values in between represent degrees of uncertainty.

386 Chapter 36. Conditional probability

Mathematics of Machine Learning

Let’s put this into mathematical terms!

How can we establish a probabilistic link between cause and eﬀect? In classical logic, events are interesting in the context
of other events. Before, implication and modus ponens provided the context. Translating to the language of probability,
the question is the following. What is the probability of 𝐵, given that 𝐴 is observed? The answer: conditional probabilities.
Why does conditional probability generalize the concept of implication? It’s easier to draw a picture, so consider the two
extreme cases in Fig. 36.4. (Recall that implication corresponds to the subset relation.)

Fig. 36.4: Conditional probability as logical implication

Essentially, 𝑃 (𝐵 ∣ 𝐴) = 1 means that 𝐴 → 𝐵 is true, while 𝑃 (𝐵 ∣ 𝐴) = 0 means that it is not. We can take this analogy
further: a small 𝑃 (𝐵 ∣ 𝐴) means that 𝐴 → 𝐵 is likely false, and a large 𝑃 (𝐵 ∣ 𝐴) means that it is likely to be true. This
is illustrated by Fig. 36.5.
Thus, the “probabilistic modus ponens” goes like this:
1. 𝑃 (𝐵 ∣ 𝐴) ≈ 1.
2. 𝐴.
3. Therefore, 𝐵 is probable.
This is quite a relief, as now we have a solid theoretical justification for most of our decisions. Thus, the diagnostic process
that kicked up our investigation makes a lot more sense now:
1. 𝑃 (flu ∣ headache, muscle fatigue, sore throat) ≈ 1.
2. “Headache and muscle fatigue”.
3. Therefore, “flu” is probable.
However, one burning question remains. How do we know that 𝑃 (flu ∣ headache, muscle fatigue, sore throat) ≈ 1 holds?

36.5. The probabilistic inference process 387

Mathematics of Machine Learning

Fig. 36.5: Conditional probability as the extension of logical implication

36.5.1 Aﬃrming the consequent

Let’s focus on the probabilistic version of “headache, sore throat, muscle fatigue → flu“. We know that this is not certain,
only plausible. Yet, the reverse implication “flu \to headache, sore throat, muscle fatigue“ is almost certain.
When naively arguing that the evidence implies the hypothesis, we have the opposite in mind. Instead of applying the
modus ponens, we use the faulty argument
1. 𝐴 → 𝐵.
2. 𝐵.
3. Therefore 𝐴.
We have talked about this before. This logical fallacy is called affirming the consequent, and it’s completely wrong from
a purely logical standpoint. However, the Bayes theorem provides its probabilistic version. The proposition 𝐴 → 𝐵
translates to 𝑃 (𝐵 ∣ 𝐴) = 1, which implies that when 𝐵 is observed, 𝐴 is more likely. Why? Because then, we have

𝑃 (𝐵 ∣ 𝐴)𝑃 (𝐴)
𝑃 (𝐴 ∣ 𝐵) =
𝑃 (𝐵)
𝑃 (𝐴)
=
𝑃 (𝐵)
> 𝑃 (𝐵).

This is good news, as reversing the implication is not totally wrong. Instead, we have the “probabilistic aﬃrming the
consequent”:
1. 𝐴 → 𝐵.
2. 𝐵.
3. Therefore, 𝐴 is more probable.

388 Chapter 36. Conditional probability

Mathematics of Machine Learning

With this, the probabilistic reasoning process makes perfect sense. To recall, the issue with arguments like “if you have
muscle fatigue, sore throat, and a headache, then you have the flu“ is that the symptoms can be caused by other conditions;
and in rare cases, the flu does not carry all of these these symptoms.
Yet, this kind of thinking can surprisingly effective in real-life decision-making. Probability and conditional probability
extends our reasoning toolkit with inductive methods in three steps:
1. generalized the binary 0 − 1 truth values to allow the representation of uncertainty,
2. defined the analogue of “if 𝐴, then 𝐵“ -type implications using conditional probability,
3. and came up with a method to infer the cause from observing the effect.
These three ideas are seriously powerful, and their inception has enabled science to perform unbelievable feats.
(If you are interested in learning more about the relation of probability theory and logic, I recommend you the great book
[[JBP03]].)

36.6 The Monty Hall paradox

Before we ﬁnish with conditional probability, we’ll touch on an important problem. Regarding probability, we often have
seemingly contradictory phenomena, going against our intuitive expectations. These are called paradoxes. To master
probabilistic thinking, we need to resolve them and eliminate common fallacies from our thinking processes. So far, we
have already seen the gambler’s fallacy when talking about the concept of independence. Now, we’ll discuss the famous
Monty Hall paradox.
In the ’60s, there was a TV show in the United States, called Let’s Make a Deal. As a contestant, you faced three closed
doors, one having a car behind it (that you could take home), while the rest were empty. You had the opportunity to open
one.

Fig. 36.6: Three closed doors, one of which contains a reward behind it.

Suppose that after selecting door No. 1, Monty Hall, the show host, opens the third door, showing that it was not the
winning one. Now, you have the opportunity to change your mind and open door No. 2 instead of the first one. Do you
take it?
At first glance, your chances are 50%-50%, so you might not be better off with switching. However, this is not true!
To set things straight, let’s do a careful probabilistic analysis! Let 𝐴𝑖 denote the event that the price is behind the 𝑖-th
door, while 𝐵𝑖 is the event of Monty opening the 𝑖-th door. Before Monty opens the third one, our model is
1
𝑃 (𝐴1 ) = 𝑃 (𝐴2 ) = 𝑃 (𝐴3 ) = ,
3

36.6. The Monty Hall paradox 389

Mathematics of Machine Learning

Fig. 36.7: Monty opened the third door for you. Do you switch?

and we want to calculate 𝑃 (𝐴1 ∣ 𝐵3 ) and 𝑃 (𝐴2 ∣ 𝐵3 ). By thinking from the perspective of the show host, which door
would you open? If you know that the price is behind the 1st door, you open the 2nd and 3rd one with equal probability.
However, if the price is actually behind the 2nd door (and the contestant selected the 1st one), you always open the 3rd
one. That is,
1
𝑃 (𝐵3 ∣ 𝐴1 ) = 𝑃 (𝐵2 ∣ 𝐴1 ) = ,
2
𝑃 (𝐵3 ∣ 𝐴2 ) = 1.
Thus, by applying the Bayes formula, we have

𝑃 (𝐵3 ∣ 𝐴1 )𝑃 (𝐴1 )
𝑃 (𝐴1 ∣ 𝐵3 ) =
𝑃 (𝐵3 )
1/6
= ,
𝑃 (𝐵3 )

and
𝑃 (𝐵3 ∣ 𝐴2 )𝑃 (𝐴2 )
𝑃 (𝐴2 ∣ 𝐵3 ) =
𝑃 (𝐵3 )
1/3
= .
𝑃 (𝐵3 )
In conclusion, 𝑃 (𝐴2 ∣ 𝐵3 ) is twice as large as 𝑃 (𝐴1 ∣ 𝐵3 ), from which we deduce

1 2
𝑃 (𝐴1 ∣ 𝐵3 ) = , 𝑃 (𝐴2 ∣ 𝐵3 ) = .
3 3
So, you should always switch doors. Surprising, isn’t it? Here, the paradox is that contrary to what we might expect,
changing our minds is the better option. With clear probabilistic thinking, we can easily resolve this.

390 Chapter 36. Conditional probability

CHAPTER

THIRTYSEVEN

RANDOM VARIABLES

Having a probability space to model our experiments and observations is ﬁne and all, but in almost all of the cases, we are
interested in a quantitative measure of the outcome. To give you an example, let’s consider an already familiar situation:
tossing coins. Suppose that we are tossing a fair coin 𝑛 times, but we are only interested in the number of heads. How do
we model the probability space this time?
By taking things one step at a time, ﬁrst we construct an event space by enumerating all possible outcomes in a single set,
just like we already did:

Ω = {0, 1}𝑛 , Σ = 2Ω .

Since the coin is fair, each outcome 𝜔 has the probability 𝑃 (𝜔) = 21𝑛 . This probability space (Ω, Σ, 𝑃 ) is nice and simple
so far. Using the additivity of probability measures, we can calculate the probability of eny event. That is, for any 𝐴 ∈ Σ,
we have
|𝐴|
𝑃 (𝐴) = ,
|Ω|
where | ⋅ | denotes the number of elements in a given set.
However, as mentioned, we are only interested in the number of heads. Should we just incorporate this information
somewhere in the probability space? Sure, we could do that, but that would couple the elementary outcomes (that is, a
series of heads or tails) with the measurements. This can significantly complicate our model.
Instead of overloading this probability space to directly deal with the desired measurements, we can do something much
simpler: introduce a function 𝑋 ∶ Ω → ℕ, mapping outcomes to measurements.
These functions are called random variables, and they are at the very center of probability theory and statistics. By
collecting data, we are observing random variables, and by fitting predictive models, we approximate them using the
observations. Now that we understand why we need them, we are going to make this notion mathematically precise.
In their ultimate form, random variables are special mappings between probability spaces and event spaces. By taking one
step at a time, we’ll deal with so-called discrete random variables (such as the above example) first, real random variables
second, and the general case last.

37.1 Discrete random variables

Following our motivating example describing the number of heads in 𝑛 coin tosses, we can create a formal deﬁnition.

Deﬁnition (Discrete random variables.)

Let (Ω, Σ, 𝑃 ) be a probability space, and {𝑥𝑘 }∞ 𝑘=1 be an arbitrary sequence of real numbers. The function 𝑋 ∶ Ω →
{𝑥1 , 𝑥2 , … } is called a discrete random variable if the sets

𝑆𝑘 = {𝜔 ∈ Ω ∶ 𝑋(𝜔) = 𝑥𝑘 }

391
Mathematics of Machine Learning

are events for any integer 𝑘 ∈ ℤ. (That is, 𝑆𝑘 ∈ Σ.)

You might ask, why are we requiring the sets {𝜔 ∈ Ω ∶ 𝑋(𝜔) = 𝑥𝑘 } to be events? It seems like just another technical
condition, but this plays an essential role. Ultimately, we are deﬁning random variables because we are want to measure
the probabities of our observations. This condition assures that we can do this.
To simplify our notations, we write

𝑃 (𝑋 = 𝑥𝑘 ) ∶= 𝑃 ({𝜔 ∈ Ω ∶ 𝑋(𝜔) = 𝑥𝑘 })

when we talk about these probabilities. Let’s see a concrete example!

In the case of the coin tossing above, our random variable is deﬁned by

𝑋 = number of heads.

Even though we can deﬁne 𝑋 in terms of formulas by

𝑛
𝑋(𝜔) = ∑ 𝜔𝑘 , 𝜔 = (𝜔1 , … , 𝜔𝑛 ) ∈ Ω,
𝑘=1

this is not needed. Often such a thing is not even possible. Regarding our random variables, we are not interested in
knowing the entire mapping, but rather questions such as the probability of 𝑘 heads among 𝑛 tosses.
If we record the “timestamps” where the outcome is heads, we can encode each 𝜔 as a subset of {1, 2, … , 𝑛}. For instance,
if the 1st, 3rd, and 37th tosses are heads and the rest are tails, this is {1, 3, 37}. To calculate the probability of 𝑘 heads,
we need to count the number of 𝑘-sized subsets for a set of 𝑛 elements. This is given by the binomial coeﬃcient (𝑛𝑘). So,

𝑛 1
𝑃 (𝑋 = 𝑘) = ( ) 𝑛 .
𝑘 2

37.2 Real-valued random variables

What if our measurements are not discrete? For instance, suppose that we have a class of students in front of us. We are
interested in the distribution of their body height. So, we pick one student by random and measure their height with our
shiny new tool, capable to measure the height with perfect precision.
In this case, discrete random variables are not enough, but we can deﬁne something similar.

Deﬁnition (Real valued random variables.)

Let (Ω, Σ, 𝑃 ) be a probability space. The function 𝑋 ∶ Ω → ℝ is called a random variable if

𝑋 −1 ((𝑎, 𝑏)) ∶= {𝜔 ∈ Ω ∶ 𝑎 < 𝑋(𝜔) < 𝑏}

are events for all 𝑎, 𝑏 ∈ ℝ. (That is, 𝑋 −1 ((𝑎, 𝑏)) ∈ Σ for all 𝑎, 𝑏 ∈ ℝ.)

Let’s unwrap this deﬁnition. First of all, 𝑋 is a mapping from the event space Ω to the set of real numbers ℝ.
Similarly to the discrete case, we are interested in the probabilities of events like 𝑋 −1 ((𝑎, 𝑏)). Again, for simplicity, we
write

𝑃 (𝑎 < 𝑋 < 𝑏) = 𝑃 (𝑋 −1 ((𝑎, 𝑏))).

You can imagine 𝑋 −1 ((𝑎, 𝑏) as the subset of Ω that are mapped to (𝑎, 𝑏). (In general, sets of the form 𝑋 −1 (𝐴) are called
inverse images.)

392 Chapter 37. Random variables

Mathematics of Machine Learning

Fig. 37.1: A real-valued random variable is a mapping from the event space to the set of real numbers.

Fig. 37.2: Inverse image of an interval.

37.2. Real-valued random variables 393

Mathematics of Machine Learning

Let’s see an example right away. Suppose that we are throwing darts at a circular board on the wall. (For simplicity,
assume that we are so good that we always hit the board.) As we have seen when discussing event algebras in higher
dimensions, we can model this by selecting

Ω = 𝐵(0, 1)
= {𝑥 ∈ ℝ2 ∶ ‖𝑥‖ < 1}

and
Σ = ℬ(𝐵(0, 1))
= 𝜎({𝐴 ∩ 𝐵(0, 1) ∶ 𝐴 ∈ ℬ(ℝ𝑛 )),
while
area(𝐴) area(𝐴)
𝑃 (𝐴) = = .
area(Ω) 𝜋
Since dart boards are subdivided by concentric circles, scoring is determined by the distance from the center. So, we
might as well deﬁne our random variable by

𝑋 ∶ distance of the impact point from the center.

𝑋 encodes all that we are interested in in terms of scoring. In general, we have

⎧0 if 𝑟 ≤ 0,
{
𝑃 (𝑋 < 𝑟) = ⎨𝑟2 if 0 < 𝑟 < 1,
{1 otherwise.
⎩
What if we have more than one measurement? For instance, in the case of the famous Iris dataset (one that we have seen
a few times so far), we have four measurements. Sure, we can just deﬁne four random variables, but then we cannot take
advantage of all the heavy machinery we built so far: linear algebra and multivariate calculus.
For this, we will take a look at random variables in the general case.

37.3 Random variables

Let’s cut to the chase.

Deﬁnition (Random variables.)

Let (Ω1 , Σ1 , 𝑃1 ) be a probability space and let (Ω2 , Σ2 ) be another event space Ω2 with event algebra Σ2 . The function
𝑋 ∶ Ω1 → Ω2 is a random variable if for every 𝐸 ∈ Σ2 , the set

𝑋 −1 (𝐸) ∶= {𝜔 ∈ Ω1 ∶ 𝑋(𝜔) ∈ 𝐸}

is a member of Σ1 . (That is, 𝑋 −1 (𝐸) ∈ Σ1 .)

In mathematical literature, random variables are usually denoted with either capital latin letters such as 𝑋, 𝑌 , or Greek
letters. (Mostly starting from 𝜉.)
Random variables essentially push probability measures forward from abstract probability spaces to more tractable ones.
On the event space (Ω2 , Σ2 ), we can deﬁne a probability measure 𝑃2 by

𝑃2 (𝐸) ∶= 𝑃1 (𝑋 −1 (𝐸)), 𝐸 ∈ Σ2 ,

making it possible to transform one probability space to another, while keeping the underlying probabilistic model intact.

394 Chapter 37. Random variables

Mathematics of Machine Learning

This general case covers all the mathematical objects we are interested in for machine learning. Staying with the Iris
dataset, the random variable

𝑋 ∶ set of iris ﬂowers → ℝ4 ,

iris ﬂower ↦ (petal width, petal length, sepal width, sepal length)

describes the generating distribution for the dataset, while for classification tasks, we are interested in approximating the
random variable
𝑌 ∶ set of iris flowers → {setosa, versicolor, virginica},
iris flower ↦ class label.
Now we will take a deeper look into why random variables are defined this way. This will be a bit technical, so feel free
to skip it. It won’t adversely affect your ability to work with random variables.

37.3.1 Behind the deﬁnition of random variables

So, random variables are functions, mapping the probability space onto a measurement space. The only question is, why
are the sets 𝑋 −1 (𝐸) so special? Let’s revisit one of our motivating examples: picking a random student and measuring
their height. We are interested in questions such as the probability of a student having a body height between 155 cm and
185 cm. (If you prefer using the imperial metric system, then 155 cm is roughly 5.09 feet and 185 cm is around 6.07
feet.) Translating this to formulas, we are interested in

𝑃 (155 ≤ 𝑋 ≤ 185) = 𝑃 (𝑋 −1 ([155, 185])).

(In the above formula, I wrote the same thing using two diﬀerent notations.)
So, how is 𝑋 −1 ([155, 185]) an event? To ﬁnd this out, let’s look at inverse images in general.

Deﬁnition (Inverse image of sets with respect to functions.)

Let 𝑓 ∶ 𝐸 → 𝐻 be a function between the two sets 𝐸 and 𝐻, and let 𝐴 ⊆ 𝐻 be an arbitrary set. The inverse image of 𝐴
with respect to the function 𝑓 is deﬁned by

𝑓 −1 (𝐴) = {𝑥 ∈ 𝐸 ∶ 𝑓(𝑥) ∈ 𝐴}.

We like inverse images of sets because they behave nicely under set operations. This is formalized by the following
theorem.

Theorem
Let 𝑓 ∶ 𝐸 → 𝐻 be a function between the two sets 𝐸 and 𝐻. For any 𝐴1 , 𝐴2 , ⋯ ⊆ 𝐻, the following hold.
(a) 𝑓 −1 (∪∞ ∞
𝑛=1 𝐴𝑛 ) = ∪𝑛=1 𝑓
−1
(𝐴𝑛 ),
(b) 𝑓 −1 (𝐴1 \𝐴2 ) = 𝑓 −1 (𝐴1 )\𝑓 −1 (𝐴2 ),
(c) 𝑓 −1 (∩∞ ∞
𝑛=1 𝐴𝑛 ) = ∩𝑛=1 𝑓
−1
(𝐴𝑛 ).

Proof. (a) We can easily see this by simply writing out the deﬁnitions. That is, we have

𝑓 −1 (∪∞ ∞
𝑛=1 𝐴𝑛 ) = {𝑥 ∈ 𝐸 ∶ 𝑓(𝑥) ∈ ∪𝑛=1 𝐴𝑛 }
= ∪∞
𝑛=1 {𝑥 ∈ 𝐸 ∶ 𝑓(𝑥) ∈ 𝐴𝑛 }
= ∪∞
𝑛=1 𝑓
−1
(𝐴𝑛 ),

37.3. Random variables 395

Mathematics of Machine Learning

which is what we had to show. (If you are not comfortable with working with sets, feel free to review the chapter on
introductory set theory.)
(b) This can be done in the same manner as (a).
(c) The De Morgan laws imply that

𝐻\( ∪∞ ∞
𝑛=1 𝐴𝑛 ) = ∩𝑛=1 (𝐻\𝐴𝑛 )

holds. Combining this with (a) and (b), (c) follows. □

Why is this important? Recall that the Borel sets, our standard event algebra on real numbers, is deﬁned by

ℬ = 𝜎({(−∞, 𝑥] ∶ 𝑥 ∈ ℝ}). (37.1)

These contain all events that we are interested in regarding the measurements. Combined with our previous result, we
can reveal what is not in plain sight about random variables.

Theorem
Let (Ω, Σ, 𝑃 ) be a probability space and 𝑋 ∶ Ω → ℝ be a random variable, and let 𝐴 ∈ ℬ, where ℬ is deﬁned by (37.1).
Then 𝑋 −1 (𝐴) ∈ Σ.

That is, we can measure the probability of 𝑋 −1 (𝐴) for any Borel set 𝐴. Without this, our random variables would not be
that useful. To make our notations more intuitive, we write

𝑃 (𝑋 ∈ 𝐴) ∶= 𝑃 (𝑋 −1 (𝐴)).

In plain English, 𝑃 (𝑋 ∈ 𝐴) is the probability of our measurement 𝑋 falling into the set 𝐴.
Now that we understand what all of this means, let’s see the simple proof!

Proof. This is a simple consequence of the fact that ℬ is the event algebra generated by sets of the form (−∞, 𝑥], and
the inverse images behave nicely under set operations (as the previous result suggests). □

37.4 Independence of random variables

When building probabilistic models of the external world, the assumption of independence signiﬁcantly simpliﬁes the
subsequent mathematical analysis. Recall that on a probability space (Ω, Σ, 𝑃 ) the events 𝐴, 𝐵 ∈ Σ are independent if

𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐴)𝑃 (𝐵),

or equivalently,

𝑃 (𝐴|𝐵) = 𝑃 (𝐴).

In plain English, observing one event doesn’t change our probabilistic belief about the other. Since a random variable 𝑋
is described by events of the form 𝑋 −1 (𝐸), we can generalize the notion of independence to random variables.

Deﬁnition (Independence of random variables.)

396 Chapter 37. Random variables

Mathematics of Machine Learning

Let 𝑋, 𝑌 ∶ Ω1 → Ω2 be two random variables between the probability space (Ω1 , Σ1 , 𝑃 ) and event algebra (Ω2 , Σ2 ).
We say that 𝑋 and 𝑌 are independent if for every 𝐴, 𝐵 ∈ Σ2 ,

𝑃 (𝑋 ∈ 𝐴, 𝑌 ∈ 𝐵) = 𝑃 (𝑋 ∈ 𝐴)𝑃 (𝑌 ∈ 𝐵)

holds.

Again, think about two coin tosses. 𝑋1 describes the first coin toss, 𝑋2 describes the other. Since the tosses are inde-
pendent, no observation of the first one reveals any extra information about the second one. This is formalized by the
definition above.
On the other hand, to show two dependent random variables, consider the following. We’ll roll a six-sided dice, and denote
the result by 𝑋. After that, we roll with 𝑋 pieces of six-sided dices, and denote the sum total of their values by 𝑌 .
𝑋 and 𝑌 are dependent on each other. For instance, consider that 𝑃 (𝑋 = 1, 𝑌 ≥ 7) = 0, but neither 𝑃 (𝑋 = 1) and
𝑃 (𝑌 ≥ 7) are zero.

37.5 Problems

Problem 1. Let 𝑋 and 𝑌 be two independent random variables, and let 𝑎, 𝑏 ∈ ℝ be two arbitrary constants. Show that
𝑋 − 𝑎 and 𝑌 − 𝑏 are also independent from each other.

37.5. Problems 397

Mathematics of Machine Learning

398 Chapter 37. Random variables

CHAPTER

THIRTYEIGHT

DISTRIBUTIONS

Let’s recap what we have learned so far. In probability theory, our goal is to
1. model real-life scenarios affected by uncertainty,
2. and to analyze them using mathematical tools such as calculus.
For the latter purpose, probability spaces are not easy to work with. A probability measure is a function defined on an
event algebra, so we can’t really use calculus there.
Random variables bring us one step closer to the solution, but they can be also difficult to work with. Even though a real
random variable 𝑋 ∶ Ω → ℝ maps an abstract probablity space to the set of real numbers, there are some complications.
Ω can be anything, and if you recall, we might not even have a tractable formula for 𝑋.
For example, if 𝑋 denotes the lifetime of a lightbulb, we don’t have a formula. So again, we can’t use calculus. However,
there is a way to represent the information contained by a random variable in a sequence, a vector-scalar function, or a
scalar-scalar function.
Enter probability distributions and density functions.

38.1 Discrete distributions

Consider a simple experiment, like tossing a fair coin 𝑛 times and counting the number of heads, denoting it with 𝑋. As
we have seen before, 𝑋 is a discrete random variable with

(𝑛𝑘) 21𝑛 if 𝑘 = 0, 1, … , 𝑛,
𝑃 (𝑋 = 𝑘) = {
0 otherwise.

However, the sequence {𝑃 (𝑋 = 𝑘)}𝑛𝑘=0 fully describes the random variable 𝑋!

Think about it. As our event space is Ω = {0, 1, … , 𝑛}, any event is of the form 𝐴 = {𝑎1 , 𝑎2 , … , 𝑎𝑙 } ⊂ Ω. Thus,
𝑙
𝑃 (𝑋 ∈ 𝐴) = ∑ 𝑃 (𝑋 = 𝑎𝑖 ),
𝑖=1

where we used the (𝜎-)additivity of probability. The sequence {𝑃 (𝑋 = 𝑘)}𝑛𝑘=0 is all the information we need.
As a consequence, instead of working with 𝑋 ∶ Ω → ℕ, we can forget about it and use only {𝑃 (𝑋 = 𝑘)}𝑛𝑘=0 . Why is
this good for us?
Because sequences are awesome. As opposed to the mysterious random variables, we have a lot of tools to work with
them. Most importantly, we can represent them in a programming language as an array of numbers. We can’t do such a
thing with pure random variables.

Deﬁnition 37.1.1 (Probability mass function.)

399
Mathematics of Machine Learning

Let 𝑋 ∶ Ω → {𝑥1 , 𝑥2 , … } be a discrete random variable. The function 𝑝𝑋 ∶ ℝ → [0, 1] deﬁned by

𝑝𝑋 (𝑥𝑘 ) = 𝑃 (𝑋 = 𝑘𝑘 )

is called the probability mass function (or PMF in short) of the discrete random variable 𝑋.

In general, a sequence of real numbers deﬁnes a discrete distribution if its elements are nonnegative and it sums up to one.

Deﬁnition 37.1.2 (Discrete probability distribution.)

Let {𝑝𝑘 }∞
𝑘=1 be a sequence of real numbers. We say that {𝑝𝑘 } is a discrete probability distribution if

(a) 𝑝𝑘 ≥ 0 for all 𝑘,

∞
(b) and ∑𝑘=1 𝑝𝑘 = 1.

Remark 37.1.1
Note that if the random variable assumes only ﬁnitely many variables (such as in our coin tossing example before), only
ﬁnitely many values are nonzero in the distribution.

As recently hinted, every discrete random variable 𝑋 deﬁnes the distribution {𝑃 (𝑋 = 𝑥𝑘 )}∞𝑘=1 , where {𝑥1 , 𝑥2 , … } are
the possible values 𝑋 can take. This is true the other way around: given a discrete distribution p = {𝑝𝑘 }∞𝑘=1 , we can
construct a random variable 𝑋 whose PMF is p.
Thus, the probability mass function of 𝑋 is also referred to as its distribution. I know, it is a bit confusing, as the word
“distribution” is quite overloaded in math. You’ll get used to it.
These discrete probability distributions are well-suited for performing quantitative analysis, as opposed to the base form
of random variables. As an additional beneﬁt, think about how distributions generalize random variables. No matter if
we talk about coin tosses or medical tests, the rate of success is given by the above discrete probability distribution.
Before moving on to discussing the basic properties of discrete distributions, let’s see some examples!

38.1.1 The Bernoulli distribution

Let’s start the long line of examples with the most basic probability distribution possible: the Bernoulli distribution,
describing a simple coin-tossing experiment. We are tossing a coin having probability 𝑝 of coming up heads and probability
1 − 𝑝 of coming up tails. The experiment is encoded in the random variable 𝑋 that takes the value 1 if the toss results in
heads, 0 otherwise.

1 if the toss results in heads,

𝑋={
0 otherwise.

Thus,

⎧1 − 𝑝 if 𝑘 = 0,
{
𝑃 (𝑋 = 𝑘) = ⎨𝑝 if 𝑘 = 1,
{0 otherwise.
⎩
When a random variable 𝑋 is distributed according to this, we write

𝑋 ∼ Bernoulli(𝑝),

400 Chapter 38. Distributions

Mathematics of Machine Learning

where 𝑝 ∈ [0, 1] is a parameter of the distribution.

It’s time to talk about distributions in practice. There are several stats packages for Python, but we’ll use the almighty
scipy. (Which is not exactly a stats package, but it has an excellent statistical module.)

from scipy.stats import bernoulli

We can generate random values using the rvs method of the bernoulli object. (Just like for any other distribution
from scipy.)

[bernoulli.rvs(p=0.5) for _ in range(10)] # ten Bernoulli(0.5)-distributed random␣

↪numbers

[0, 1, 1, 0, 0, 0, 0, 0, 1, 0]

In scipy, the probability mass function is implemented in the pmf method. We can even visualize the distribution using
Matplotlib. (Don’t worry if you don’t understand the code below. As we will routinely do things like this, I’ll introduce
you to the necessary libraries when the time comes.)

import matplotlib.pyplot as plt

params = [0.25, 0.5, 0.75]

with plt.style.context("seaborn-white"):
fig, axs = plt.subplots(1, len(params), figsize=(4*len(params), 4), sharey=True)
fig.suptitle("The Bernoulli distribution")
for ax, p in zip(axs, params):
x = range(2)
y = [bernoulli.pmf(k=k, p=p) for k in x]
ax.bar(x, y)
ax.set_title(f"p = {p}")
ax.set_ylabel("P(X = k)")
ax.set_xlabel("k")

If you are interested in the details, feel free to check out the SciPy documentation for further methods!

38.1. Discrete distributions 401

Mathematics of Machine Learning

38.1.2 The binomial distribution

Let’s take our previous coin-tossing example one step further. Suppose that we toss the same coin 𝑛 times, and 𝑋 denotes
the number of heads out of 𝑛 tosses. What is the probability of getting exactly 𝑘 heads?
Say, 𝑛 = 5 and 𝑘 = 3. For example, the configuration 11010 (where 0 denotes tails and 1 denotes heads) has the
probability 𝑝3 (1 − 𝑝)2 , as there are three heads and two tails from five independent tosses.
How many such configurations are available? Selecting the position of the three heads is the same as selecting a three-
element subset out of a set of five elements. Thus, there are (𝑛𝑘) possibilities.
Combining this, we have

(𝑛𝑘)𝑝𝑘 (1 − 𝑝)𝑛−𝑘 if 𝑘 = 0, 1, … , 𝑛,
𝑃 (𝑋 = 𝑘) = {
0 otherwise.

This is called the binomial distribution, one of the most frequently encountered ones in probability and statistics. In
notation, we write

𝑋 ∼ Binomial(𝑛, 𝑝),

where the 𝑛 ∈ ℕ and 𝑝 ∈ [0, 1] are its two parameters. Let’s visualize the distribution!

from scipy.stats.distributions import binom

params = [(20, 0.25), (20, 0.5), (20, 0.75)]

with plt.style.context("seaborn-white"):
fig, axs = plt.subplots(1, len(params), figsize=(4*len(params), 4), sharey=True)
fig.suptitle("The binomial distribution")
for ax, (n, p) in zip(axs, params):
x = range(n+1)
y = [binom.pmf(n=n, p=p, k=k) for k in x]
ax.bar(x, y)
ax.set_title(f"n = {n}, p = {p}")
ax.set_ylabel("P(X = k)")
ax.set_xlabel("k")

402 Chapter 38. Distributions

Mathematics of Machine Learning

38.1.3 The geometric distribution

A bit more coin tossing. We toss the same coin until a heads turn up. Let 𝑋 denote the number of tosses needed. With
some elementary probabilistic thinking, we can deduce that
(1 − 𝑝)𝑘−1 𝑝 if 𝑘 = 1, 2, … , 𝑛,
𝑃 (𝑋 = 𝑘) = {
0 otherwise.
(Since if heads turn up ﬁrst for the 𝑘-th toss, we tossed 𝑘 − 1 tails previously.) This is called the geometric distribution
and is denoted as
𝑋 ∼ Geo(𝑝),
with 𝑝 ∈ [0, 1] being the only parameter. Similarly, we can plot the histograms to visualize the distribution family.

from scipy.stats import geom

params = [0.2, 0.5, 0.8]

with plt.style.context("seaborn-white"):
fig, axs = plt.subplots(1, len(params), figsize=(5*len(params), 5), sharey=True)
fig.suptitle("The geometric distribution")
for ax, p in zip(axs, params):
x = range(1, 20)
y = [geom.pmf(p=p, k=k) for k in x]
ax.bar(x, y)
ax.set_title(f"p = {p}")
ax.set_ylabel("P(X = k)")
ax.set_xlabel("k")

Note that none of the probabilities 𝑃 (𝑋 = 𝑘) are zero, but as 𝑘 grows, they become extremely small. (The closer 𝑝 is to
1, the faster the decay.)
∞
It might not be immediately obvious that ∑𝑘=1 (1 − 𝑝)𝑘−1 𝑝 = 1. To do that, we’ll apply a magic trick. (You know. As
the famous Arthur C. Clarke quote goes, “Any suﬃciently advanced mathematics is indistinguishable from magic.” Or
technology. It’s the same.)
In fact, for an arbitrary 𝑥 ∈ (−1, 1), the astounding identity
∞
1
∑ 𝑥𝑘 = (38.1)
𝑘=0
1−𝑥

38.1. Discrete distributions 403

Mathematics of Machine Learning

holds. This is the famous geometric series. Using (38.1), we have

∞ ∞
∑ 𝑃 (𝑋 = 𝑘) = ∑(1 − 𝑝)𝑘−1 𝑝
𝑘=1 𝑘=1
∞
= 𝑝 ∑(1 − 𝑝)𝑘
𝑘=0
1
=𝑝
1 − (1 − 𝑝)
= 1.

Using the geometric series is one of the most common tricks up a mathematician’s sleeve. We’ll use this, for instance,
when talking about expected values for certain distributions.

38.1.4 The uniform distribution

Let’s discard the coin and roll a six-sided dice. We’ve seen this before: the probability of each outcome is the same, that
is,
1
𝑃 (𝑋 = 1) = 𝑃 (𝑋 = 2) = ⋯ = 𝑃 (𝑋 = 6) = ,
6
where 𝑋 denotes the outcome of the roll. This is a special instance of the uniform distribution.
In general, let 𝐴 = {𝑎1 , 𝑎2 , … , 𝑎𝑛 } be a ﬁnite set. The discrete random variable 𝑋 ∶ Ω → 𝐴 is uniformly distributed on
𝐴, that is,

𝑋 ∼ Uniform(𝐴),

if
1
𝑃 (𝑋 = 𝑎1 ) = 𝑃 (𝑋 = 𝑎2 ) = ⋯ = 𝑃 (𝑋 = 𝑎𝑛 ) = .
𝑛
Note that 𝐴 must be a ﬁnite set: no discrete uniform distribution exists on inﬁnite sets.
Here is the probability mass function for rolling a six-sided dice. Not the most exciting one, I know.

from scipy.stats import randint

with plt.style.context("seaborn-white"):
fig = plt.figure(figsize=(16, 8))
plt.title("The uniform distribution")

x = range(-1, 9)
y = [randint.pmf(k=k, low=1, high=7) for k in x]
plt.bar(x, y)
plt.ylim(0, 1)
plt.ylabel("P(X = k)")
plt.xlabel("k")

404 Chapter 38. Distributions

Mathematics of Machine Learning

38.1.5 The single-point distribution

We’ve left the simplest one for the last: the single-point distribution. Let 𝑎 ∈ ℝ be an arbitrary real number. We say that
the random variable 𝑋 is distributed according to 𝛿(𝑎) if

1 if 𝑥 = 𝑎,
𝑃 (𝑋 = 𝑥) = {
0 otherwise.

That is, 𝑋 assumes 𝑎 with probability 1. Their corresponding cumulative distribution function is

1 if 𝑥 ≥ 𝑎,
𝐹𝑋 (𝑥) = {
0 otherwise,

which is a simple step function with a single jump.

Trust me, explicitly naming such a simple distribution is immensely useful. There are two main reasons that come to
mind. First, the single-point distribution often arises as the limit distribution of sequences of random variables. (We’ll
make this precise when talking about the Law of Large Numbers.)
Second, every discrete distribution can be written in terms of single-point distributions. This is not absolutely necessary
for you to understand right now, but it’ll be essential on a more advanced level.

Remark 37.1.2 (Discrete distributions as the linear combination of single-point distributions.)

Let (Ω, Σ, 𝑃 ) be a probability space and let 𝑋 ∶ Ω → {𝑥1 , 𝑥2 , … } be a discrete random variable with probability mass
function 𝑝𝑖 = 𝑃 (𝑋 = 𝑥𝑖 ).
By introducing the single-point distributions 𝑋𝑖 ∼ 𝛿(𝑥𝑖 ), we have
∞
𝐹𝑋 (𝑥) = ∑ 𝑝𝑖 𝐹𝑋𝑖 (𝑥).
𝑖=1

This decomposition can be extremely useful.

38.1. Discrete distributions 405

Mathematics of Machine Learning

38.2 Law of total probability, revisited once more

With the help of discrete random variables, we can dress up the law of total probability in new clothes.

Theorem 37.2.1 (Law of total probability, discrete random variable version.)

Let (Ω, Σ, 𝑃 ) be a probability space and let 𝐴 ∈ Σ be an arbitrary event. If 𝑋 ∶ Ω → {𝑥1 , 𝑥2 , … } is a discrete random
variable, then
∞
𝑃 (𝐴) = ∑ 𝑃 (𝐴|𝑋 = 𝑥𝑘 )𝑃 (𝑋 = 𝑥𝑘 ). (38.2)
𝑘=−∞

Proof. For any discrete random variable 𝑋 ∶ Ω → {𝑥1 , 𝑥2 , … }, the events {𝑋 = 𝑥𝑘 } partition the event space: they
are mutually disjoint, and their union gives Ω. Thus, the law of total probability can be applied, obtaining
∞
𝑃 (𝐴) = ∑ 𝑃 (𝐴, 𝑋 = 𝑥𝑘 )
𝑘=−∞
∞
= ∑ 𝑃 (𝐴|𝑋 = 𝑥𝑘 )𝑃 (𝑋 = 𝑥𝑘 ),
𝑘=−∞

which is what we had to prove. □

In other words, we can study events in the context of discrete random variables. This is extremely useful in practice.
(Soon, we’ll see that it’s not only for the discrete case.)
Let’s put (38.2) to work right away.

38.2.1 Sums of discrete random variables

Since discrete probability distributions are represented by sequences, we can use a wide array of tools from mathematical
analysis to work with them. (This was the whole reason behind switching random variables to distributions.) As a
consequence, we can easily describe more complex random variables by constructing them from simpler ones.
For instance, consider rolling two dice, where we are interested in the distribution of the sum. So, we can write this as
the sum of random variables 𝑋1 and 𝑋2 , denoting the outcome of the ﬁrst and second toss respectively. We know that

1
if 𝑘 = 1, 2, … , 6,
𝑃 (𝑋𝑖 = 𝑘) = { 6
0 otherwise

for 𝑖 = 1, 2. Using (38.2) and the fact that the two outcomes are independent, we have
6
𝑃 (𝑋1 + 𝑋2 = 𝑘) = ∑ 𝑃 (𝑋1 + 𝑋𝑘 = 𝑘|𝑋2 = 𝑙)𝑃 (𝑋2 = 𝑙)
𝑙=1
6
= ∑ 𝑃 (𝑋1 = 𝑘 − 𝑙)𝑃 (𝑋2 = 𝑙)
𝑙=1

If this looks familiar, it is not an accident. What you see here is the famous convolution operation in action.

Deﬁnition 37.2.1 (Discrete convolution.)

406 Chapter 38. Distributions

Mathematics of Machine Learning

Let 𝑎 = {𝑎𝑘 }∞ ∞
𝑘=−∞ and 𝑏 = {𝑏𝑘 }𝑘=−∞ be two arbitrary sequences. Their convolution is deﬁned by

∞ ∞
𝑎 ∗ 𝑏 ∶= { ∑ 𝑎𝑘−𝑙 𝑏𝑙 } .
𝑙=−∞ 𝑘=−∞

∞
That is, the 𝑘-th element of the sequence 𝑎 ∗ 𝑏 is deﬁned by the sum ∑𝑙=−∞ 𝑎𝑘−𝑙 𝑏𝑙 . This might be hard to imagine, but
thinking about the probabilistic interpretation makes the deﬁnition clear. The random variable 𝑋1 + 𝑋2 can assume the
value 𝑘 if 𝑋1 = 𝑘 − 𝑙 and 𝑋2 = 𝑙, for all possible 𝑙 ∈ ℤ.

Remark 37.2.1 (Switching up the indices.)

Due to symmetry,
∞ ∞
∑ 𝑎𝑘−𝑙 𝑏𝑙 = ∑ 𝑎𝑙 𝑏𝑘−𝑙 .
𝑙=−∞ 𝑙=−∞

Thus, an alternative deﬁnition of 𝑎 ∗ 𝑏

∞ ∞
𝑎 ∗ 𝑏 = { ∑ 𝑎𝑙 𝑏𝑘−𝑙 } .
𝑙=−∞ 𝑘=−∞

∞
This trick is often extremely useful, as when 𝑎𝑘 and 𝑏𝑘 is explicitly given, sometimes ∑𝑙=−∞ 𝑎𝑙 𝑏𝑘−𝑙 is simpler to calculate
∞
than ∑𝑙=−∞ 𝑎𝑘−𝑙 𝑏𝑙 , and vice versa.

Convolution is supported by NumPy, so with its help, we can visualize the distribution of our 𝑋1 + 𝑋2 .

import numpy as np

dist_1 = [0, 1/6, 1/6, 1/6, 1/6, 1/6, 1/6]

dist_2 = [0, 1/6, 1/6, 1/6, 1/6, 1/6, 1/6]
sum_dist = np.convolve(dist_1, dist_1)

with plt.style.context("seaborn-white"):
plt.figure(figsize=(16, 8))
plt.bar(range(0, len(sum_dist)), sum_dist)
plt.title("Distribution of X₁ + X₂")
plt.ylabel("P(X + Y = k)")
plt.xlabel("k")

38.2. Law of total probability, revisited once more 407

Mathematics of Machine Learning

Let’s talk about the general case. The pattern is clear, so we can formulate a theorem.

Theorem 37.2.2 (Sums of discrete random variables.)

If 𝑋, 𝑌 ∶ Ω → ℤ are both integer-valued random variables, then the distribution of 𝑋 + 𝑌 is given by the convolution
of the respective distributions:
∞
𝑃 (𝑋 + 𝑌 = 𝑘) = ∑ 𝑃 (𝑋 = 𝑘 − 𝑙)𝑃 (𝑌 = 𝑙),
𝑙=−∞

that is,

𝑝𝑋+𝑌 = 𝑝𝑋 ∗ 𝑝𝑌 .

Proof. The proof is a straightforward application of the law of total probability (38.2):
∞
𝑃 (𝑋 + 𝑌 = 𝑘) = ∑ 𝑃 (𝑋 + 𝑌 = 𝑘|𝑌 = 𝑙)𝑃 (𝑌 = 𝑙)
𝑙=−∞
∞
= ∑ 𝑃 (𝑋 = 𝑘 − 𝑙)𝑃 (𝑌 = 𝑙)
𝑙=−∞
= (𝑝𝑋 ∗ 𝑝𝑌 )(𝑘),

which is what we had to prove. □

Another example of random variable sums is the binomial distribution itself. Instead of thinking about the number of
successes of an experiment out of 𝑛 independent tries, we can model the core experiment as a Bernoulli distribution. That

408 Chapter 38. Distributions

Mathematics of Machine Learning

is, if 𝑋𝑖 is a Bernoulli(𝑝) distributed random variable describing the success of the 𝑖-th attempt, we have

𝑃 (𝑋1 + ⋯ + 𝑋𝑛 = 𝑘) = ∑ 𝑃 (𝑋1 = 𝑖1 , … , 𝑋𝑛 = 𝑖𝑛 )
𝑖1 +…𝑖𝑛 =𝑘

= ∑ 𝑃 (𝑋1 = 𝑖1 ) … 𝑃 (𝑋𝑛 = 𝑖𝑛 )
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
𝑖1 +…𝑖𝑛 =𝑘 𝑋1 ,…,𝑋𝑛 are independent

= ∑ 𝑝 (1 − 𝑝)𝑛−𝑘
𝑘

𝑖1 +…𝑖𝑛 =𝑘

𝑛
= ( )𝑝𝑘 (1 − 𝑝)𝑛−𝑘 ,
𝑘

where the sum ∑𝑖 traverses all tuples (𝑖1 , … , 𝑖𝑛 ) ∈ {0, 1}𝑛 for which 𝑖1 + ⋯ + 𝑖𝑛 = 𝑘. (As there are (𝑛𝑘) of
1 +…𝑖𝑛 =𝑘
such tuples, we have ∑𝑖 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 = (𝑛𝑘)𝑝𝑘 (1 − 𝑝)𝑛−𝑘 in the last step.)
1 +…𝑖𝑛 =𝑘

38.3 Real-valued distributions

So far, we have talked about discrete random variables; that is, random variables with countably many values. However,
not all experiments/observations/measurements are like this. For instance, the height of a person is a random variable
that can assume a continuum of values.
To give a tractable example, let’s pick a number 𝑋 from [0, 1], with each one having an “equal chance”. In this context,
equal chance means that

𝑃 (𝑎 < 𝑋 ≤ 𝑏) = |𝑏 − 𝑎|.

Can we describe 𝑋 with a single real function? Like in the discrete case, we can try

𝐹 (𝑥) = 𝑃 (𝑋 = 𝑥),

but this wouldn’t work. Why? Because for each 𝑥 ∈ 𝑋, we have 𝑃 (𝑋 = 𝑥) = 0. That is, picking a particular number 𝑥
has zero probability. Instead, we can try 𝐹𝑋 (𝑥) = 𝑃 (𝑋 ≤ 𝑥), which is

⎧0 if 𝑥 ≤ 0,
{
𝐹𝑋 (𝑥) = ⎨𝑥 if 0 < 𝑥 ≤ 1,
{1 otherwise.
⎩
We can plot this for visualization.

from scipy.stats import uniform

X = np.linspace(-0.5, 1.5, 100)
y = uniform.cdf(X)

with plt.style.context('seaborn-white'):
plt.figure(figsize=(16, 8))
plt.plot(X, y)

38.3. Real-valued distributions 409

Mathematics of Machine Learning

In the following section, we will properly deﬁne and study this object in detail for all real-valued random variables.

38.3.1 The cumulative distribution function

What we have seen in our motivating example is an instance of a cumulative distribution functionor CDF in short. Let’s
jump into the formal deﬁnition right away.

Deﬁnition 37.3.1 (Cumulative distribution function.)

Let 𝑋 be a real-valued random variable. The function deﬁned by

𝐹𝑋 (𝑥) ∶= 𝑃 (𝑋 ≤ 𝑥) (38.3)

is called the cumulative distribution function (or CDF in short) of 𝑋.

Again, let’s unpack this. Recall that in the deﬁnition of real-valued random variables, we have used the inverse images
𝑋 −1 ((𝑎, 𝑏))?
Something similar is going on here. 𝑃 (𝑋 ≤ 𝑥) is the abbreviation for 𝑃 (𝑋 −1 ((−∞, 𝑥])), which we are too lazy to
write. Similarly to 𝑋 −1 ((𝑎, 𝑏))), you can visualize 𝑋 −1 ((−∞, 𝑥])) by pulling the interval (−∞, 𝑥] back to Ω using the
mapping 𝑋.
Sets of the form 𝑋 −1 ((−∞, 𝑥]) are called the level sets of 𝑋.
According to the Oxford English Dictionary, the word cumulative means “increasing or increased in quantity, degree,
or force by successive additions”. For discrete random variables, using 𝑃 (𝑋 = 𝑘) was enough, but since real random
variables are more nuanced, we have to use the cumulative probabilities 𝑃 (𝑋 ≤ 𝑥) to meaningfully describe them.
Why do we like to work with distribution functions? Because they condense all the relevant information about a random
variable in a real function. For instance, we can express probabilities like

𝑃 (𝑎 < 𝑋 ≤ 𝑏) = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎).

To give an example, let’s revisit the introduction, where we were selecting a random number between zero and one. There,

410 Chapter 38. Distributions

Mathematics of Machine Learning

Fig. 38.1: The level set of a random variable.

the random variable 𝑋 with CDF

⎧0 if 𝑥 ≤ 0,
{
𝐹𝑋 (𝑥) = 𝑥 if 0 < 𝑥 ≤ 1, (38.4)
⎨
{1 otherwise.
⎩
is said to be uniformly distributed over [0, 1], or 𝑋 ∼ Uniform(0, 1) in short. We’ll see a ton of examples later, but keep
note of this, as the uniform distribution will be our textbook example throughout this section.

38.3.2 Properties of the distribution function

Cumulative distribution functions have three properties that characterize them: they are always nondecreasing, right-
continuous (whatever that might be), and their limits are 0 and 1 towards −∞ and ∞ respectively. You might have
guessed some of this from the deﬁnition, but here is the formal theorem that summarizes this.

Theorem 37.3.1 (Properties of CDFs.)

Let 𝑋 be a real-valued random variable with CDF 𝐹𝑋 . Then 𝐹𝑋 is
(a) non-decreasing (that is, 𝑥 ≤ 𝑦 implies 𝐹𝑋 (𝑥) ≤ 𝐹𝑋 (𝑦)),
(b) right-continuous (that is, lim𝑥→𝑥0 + 𝐹𝑋 (𝑥) = 𝐹𝑋 (𝑥0 ), or in other words, taking the right limit is interchangeable with
𝐹𝑋 ),
(c)
lim 𝐹𝑋 (𝑥) = 0, lim 𝐹𝑋 (𝑥) = 1
𝑥→−∞ 𝑥→∞

holds.

38.3. Real-valued distributions 411

Mathematics of Machine Learning

Proof. The proofs are relatively straightforward. (a) follows from the fact that if 𝑥 < 𝑦, then we have

𝑋 −1 ((−∞, 𝑥]) ⊆ 𝑋 −1 ((−∞, 𝑦]).

In other words, the event 𝑋 ≤ 𝑥 is a subset of 𝑋 ≤ 𝑦. Thus, due to the monotonicity of probability measures, we have
𝑃 (𝑋 ≤ 𝑥) ≤ 𝑃 (𝑋 ≤ 𝑦).
(b) Here, we need to show that lim𝑥→𝑥0 + 𝑃 (𝑋 ≤ 𝑥) = 𝑃 (𝑋 ≤ 𝑥0 ). For this, note that for any 𝑥𝑛 → 𝑥0 with 𝑥𝑛 > 𝑥0 ,
the event sequence {𝜔 ∈ Ω ∶ 𝑋(𝜔) ≤ 𝑥𝑛 } is decreasing, and

∩∞ −1 −1
𝑛=1 𝑋 ((−∞, 𝑥𝑛 ]) = 𝑋 ((−∞, 𝑥]).

Because of the upper continuity of probability measures (see (35.4)), the right continuity of 𝐹𝑋 follows.
(c) Again, this follows from the fact that

∩∞ −1
𝑛=1 𝑋 ((−∞, 𝑛]) = ∅

and

∪∞ −1
𝑛=1 𝑋 ((−∞, 𝑛]) = Ω.

Since 𝑃 (∅) = 0 and 𝑃 (Ω), the statement follows from the upper and lower continuity of probability measures. (See
(35.3) and (35.4).) □

Remark 37.3.1 (Alternative deﬁnition of CDF-s)

In the literature, you can sometimes ﬁnd that instead of (38.3), the CDF of 𝑋 is deﬁned by
∗
𝐹𝑋 (𝑥) ∶= 𝑃 (𝑋 < 𝑥),

that is, 𝑋 < 𝑥 instead of 𝑋 ≤ 𝑥. This doesn’t change the big picture, but some details are slightly diﬀerent. For instance,
this change makes 𝐹𝑋 left-continuous instead of right-continuous. These minute details matter if you dig really deep, but
in machine learning, we’ll be ﬁne without thinking too much about them.

Theorem 37.3.1 is true the other way around: if you give me a nondecreasing right-continuous function 𝐹 (𝑥) with
lim𝑥→−∞ 𝐹 (𝑥) = 0 and lim𝑥→∞ 𝐹 (𝑥) = 1, I can construct a random variable such that its distribution function matches
𝐹 (𝑥).

38.3.3 Cumulative distribution functions for discrete random variables

The discrete and real-valued case is not entirely disjoint: in fact, discrete random variables have cumulative distribution
functions as well. (But not the other way around, as real-valued random variables cannot be described with sequences.)
Say, if 𝑋 is a discrete random variable taking the values 𝑥1 , 𝑥2 , …, then its CDF is

𝐹𝑋 (𝑥) = ∑ 𝑃 (𝑋 = 𝑥𝑖 ),
𝑥𝑖 ≤𝑥

which is a piecewise continuous function. In the case of binomial distributions, here is what it looks like.

412 Chapter 38. Distributions

Mathematics of Machine Learning

Fig. 38.2: The CDF of Binomial(𝑛, 𝑝)

38.4 Notable real-valued distributions

The strength or probability lies in its ability to translate real-world phenomena into coin tosses, dice rolls, dart throws,
lightbulb lifespans, and many more. This is possible because of distributions. Distributions are the ribbons stringing
together a vast bundle of random variables.
Let’s meet some of the most important ones!

38.4.1 The uniform distribution

We have already seen a special case of the uniform distribution: selecting a random number from the interval [0, 1], such
that all outcomes are “equally likely”. The general uniform distribution captures the same concept, except on an arbitrary
interval [𝑎, 𝑏]. That is, the random variable 𝑋 is uniformly distributed on the interval [𝑎, 𝑏], or 𝑋 ∼ Uniform(𝑎, 𝑏) in
symbols, if
1
𝑃 (𝛼 < 𝑋 ≤ 𝛽) = ∣[𝑎, 𝑏] ∩ (𝛼, 𝛽]∣
𝑏−𝑎
for all 𝛼 < 𝛽, where ∣[𝑐, 𝑑]∣ denotes the length of the interval [𝑐, 𝑑],
In other words, the probability of our random number falling into a given interval is proportional to the interval’s length.
This is how the condition “equally likely” makes sense: as there are uncountably many possible outcomes, the probability
of each individual outcome is zero, but equally long intervals have an equal chance.
In line with the deﬁnition, the distribution function of 𝑋 is

⎧0 if 𝑥 ≤ 𝑎,
{
𝐹𝑋 (𝑥) = ⎨ 𝑥−𝑎
𝑏−𝑎 if 𝑎 < 𝑥 ≤ 𝑏,
{1 otherwise.
⎩

38.4. Notable real-valued distributions 413

Mathematics of Machine Learning

38.4.2 The exponential distribution

Let’s turn our attention toward a diﬀerent problem: lightbulbs. According to some mysterious (and probably totally
inaccurate) lore, lightbulbs possess the so-called memoryless property. That is, its expected lifespan is the same at any
point in its life.
To put this into a mathematical form, let 𝑋 be a random variable denoting the lifespan of a given lightbulb. The mem-
oryless property states that if the lightbulb has already lasted 𝑠 seconds, then the probability of lasting another 𝑡 is the
same as in the very ﬁrst moment of its life. That is,

𝑃 (𝑋 > 𝑡 + 𝑠|𝑋 > 𝑠) = 𝑃 (𝑋 > 𝑡).

Expanding the left side, we have

𝑃 (𝑋 > 𝑡 + 𝑠, 𝑋 > 𝑠)
𝑃 (𝑋 > 𝑡 + 𝑠|𝑋 > 𝑠) =
𝑃 (𝑋 > 𝑠)
𝑃 (𝑋 > 𝑡 + 𝑠)
= ,
𝑃 (𝑋 > 𝑠)
as {𝑋 > 𝑡 + 𝑠} ∩ {𝑋 > 𝑠} = {𝑋 > 𝑡 + 𝑠}. Thus, the memoryless property implies that

𝑃 (𝑋 > 𝑡 + 𝑠) = 𝑃 (𝑋 > 𝑡)𝑃 (𝑋 > 𝑠). (38.5)

If we think about the probabilities as a function 𝑓(𝑡) = 𝑃 (𝑋 > 𝑡), (38.5) can be viewed as a functional equation. And
a famous one for that. Without going into the painful details the only continuous solution is the exponential function
𝑓(𝑡) = 𝑒𝑎𝑡 , where 𝑎 ∈ ℝ is a parameter.
As we are talking about the lifespan of a lightbulb here, the probability of it lasting forever is zero. That is,
lim 𝑃 (𝑋 > 𝑡) = 0
𝑡→∞

holds. Thus, as

⎧0 if 𝑎 < 0,
{
lim 𝑒𝑎𝑡 =
𝑡→∞ ⎨1 if 𝑎 = 0,
{∞ if 𝑎 > 0,
⎩
only the negative parameters are valid in our case. This characterizes the exponential distribution. In general, 𝑋 ∼ exp(𝜆)
for a 𝜆 > 0 if

0 if 𝑥 < 0,
𝐹𝑋 (𝑥) = {
1 − 𝑒−𝜆𝑥 if 0 ≤ 𝑥.

Let’s plot this for visualization!

from scipy.stats import expon

X = np.linspace(-0.5, 10, 100)
params = [0.1, 1, 10]
ys = [expon.cdf(X, scale=1/l) for l in params]

with plt.style.context('seaborn-white'):
plt.figure(figsize=(16, 8))

for l, y in zip(params, ys):

plt.plot(X, y, label=f"λ = {l}")

plt.legend()

414 Chapter 38. Distributions

Mathematics of Machine Learning

The exponential distribution is extremely useful and frequently encountered in real-life applications. For instance, it
models the requests incoming to a server, customers standing in a queue, buses arriving at a bus stop, and many more.
We’ll talk more about special distributions in later chapters, and we’ll add quite a few others as well.

38.5 Conclusion

Distributions are the lifeblood of probability theory, and distributions can be represented with cumulative distribution
functions.
However, CDFs have a signiﬁcant drawback: it’s hard to express the probability of more complex events with them.
Later, we’ll see several concrete examples of where CDFs fail. Without going into details, one example points towards
multidimensional distributions. (I hope that their existence and importance are not surprising you.) There, the distribution
functions can be used to express the probability of rectangle-shaped events, but not, say, spheres.
To be a tiny bit more precise, if 𝑋, 𝑌 ∼ Uniform(0, 1), then the probability

𝑃 (𝑋 2 + 𝑌 2 < 1)

cannot be directly expressed in terms of the two-dimensional CDF 𝐹𝑋,𝑌 (𝑥, 𝑦). (Whatever that may be.) Fortunately,
this is not our only tool.
Enter probability density functions.

38.5. Conclusion 415

Mathematics of Machine Learning

416 Chapter 38. Distributions

CHAPTER

THIRTYNINE

DENSITIES

39.1 Density functions

Distribution functions are not our only tool to describe real-valued random variables. If you have studied probability
theory from a book/lecture/course written by a non-mathematician, you have probably seen a function like
1 𝑥2
𝑝(𝑥) = √ 𝑒− 2
2𝜋
referred to as “probability” at some point. Let me tell you, this is deﬁnitely not a probability. I have seen this mistake so
much that I decided to write short Twitter threads properly explaining probabilistic concepts, from which this book was
grown out of. So, I take this issue to heart.
Here is the problem with cumulative distribution functions: they represent global information about local objects. Let’s
unpack this idea. If 𝑋 is a real-valued random variable, the CDF

𝐹𝑋 (𝑥) = 𝑃 (𝑋 ≤ 𝑥)

describes the probability of 𝑋 being smaller than a given 𝑥. But what if we are interested in what happens around 𝑥?
Say, in the case of the uniform distribution (38.4), we have

𝑃 (𝑋 = 𝑥) = lim 𝑃 (𝑥 − 𝜀 < 𝑋 ≤ 𝑥)
𝜀→0
= lim (𝐹𝑋 (𝑥) − 𝐹𝑋 (𝑥 − 𝜀))
𝜀→0
= lim 𝜀
𝜀→0
= 0.

(We used Theorem 34.2.4 when taking the limit.)

Thus, as we have already seen, the probability of picking a particular point is zero. Contrary to the discrete case, 𝑃 (𝑋 =
𝑥) tells us nothing about how the distribution of 𝑋 behaves around 𝑥.
And the worst thing is, this is the same for a wide array of distributions. For instance, you can check it manually for the
exponential distribution.
Isn’t this strange? The probability of individual outcomes is zero for both the uniform and exponential distribution, yet
the distributions themselves couldn’t be any diﬀerent.
Let’s examine the problem from another perspective. By deﬁnition,

𝑃 (𝑎 < 𝑋 ≤ 𝑏) = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎)

holds. Does this look familiar to you? Increments of 𝐹𝑋 on the right, probabilities on the left. Where have we seen
increments before?

417
Mathematics of Machine Learning

′
In the fundamental theorem of calculus, that’s where. That is, if 𝐹𝑋 is diﬀerentiable and its derivative is 𝐹𝑋 (𝑥) = 𝑓𝑋 (𝑥),
then
𝑏
∫ 𝑓𝑋 (𝑥)𝑑𝑥 = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎). (39.1)
𝑎

The function 𝑓𝑋 (𝑥) seems to be what we are looking for: it represents the local behavior of 𝑋 around 𝑥. But instead of
describing the probability, it describes its rate of change. This is called a probability density function.
By turning this argument around, we can deﬁne density functions using (39.1). Here is the mathematically precise version.

Deﬁnition 38.1.1 (Density functions.)

Let (Ω, Σ, 𝑃 ) be a probability space, and 𝑋 ∶ Ω → ℝ be a real-valued random variable. The function 𝑓𝑋 ∶ ℝ → ℝ is
called the probability density function of 𝑋, if
𝑏
∫ 𝑓𝑋 (𝑥)𝑑𝑥 = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎), (39.2)
𝑎

holds for all 𝑎, 𝑏 ∈ ℝ.

Again, (39.2) is the Newton-Leibniz formula (26.8) in disguise. The following theorem makes this connection precise.

Theorem 38.1.1 (The density function as derivative.)

Let 𝑋 be a real-valued random variable. If the cumulative distribution function 𝐹𝑋 (𝑥) is everywhere diﬀerentiable, then
𝑑
𝑓𝑋 (𝑥) = 𝐹 (𝑥)
𝑑𝑥 𝑋
is a density function for 𝑋.

Proof. This is just a simple application of the fundamental theorem of calculus. If the derivative indeed exists, then
𝑏
𝑑
∫ 𝐹 (𝑥)𝑑𝑥 = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎),
𝑎 𝑑𝑥 𝑋
𝑑
which means that 𝑓𝑋 (𝑥) = 𝑑𝑥 𝐹𝑋 (𝑥) is indeed a density function. □

Remark 38.1.1 (Density functions are not unique.)

Note that density functions are not unique. If 𝑋 is a random variable with density 𝑓𝑋 , then, say, modifying 𝑓𝑋 at a single
point still functions as a density function for 𝑋.
To be more precise, deﬁne

∗ 𝑓𝑋 (𝑥) if 𝑥 ≠ 0,
𝑓𝑋 (𝑥) = {
𝑓𝑋 (0) + 1 if 𝑥 = 0.
∗ ∗
You can check it by hand that 𝑓𝑋 is still a density for 𝑋, yet 𝑓𝑋 ≠ 𝑓𝑋 .

One more thing before we move on. Recall that discrete random variables are characterized by probability mass functions:
these two objects are two sides of the same coin.

418 Chapter 39. Densities

Mathematics of Machine Learning

The probability mass function is analogous to the density function, yet we don’t have terminology for random variables
with the latter. We’ll ﬁx this now.

Deﬁnition 38.1.2 (Continuous random variables.)

Let (Ω, Σ, 𝑃 ) be a probability space, and 𝑋 ∶ Ω → ℝ be a real-valued random variable. We say that 𝑋 is continuous if
it has a probability density function; that is, there exists a function 𝑓𝑋 ∶ ℝ → ℝ such that
𝑏
∫ 𝑓𝑋 (𝑥)𝑑𝑥 = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎),
𝑎

where 𝐹𝑋 is the cumulative distribution function of 𝑋.

Discrete and continuous random variables are the backbones of probability theory: the most interesting random variables
are falling into these classes. (Later in the chapter, we’ll see that there are more types, but these two are the most
important.)
Now we are ready to get our hands dirty and see some density functions in practice.

39.1.1 Density functions in practice

After all this introduction, let’s see a few concrete examples. So far, we have seen two real-valued non-discrete distribu-
tions: the uniform and the exponential.
Example 1. Let’s start with 𝑋 ∼ Uniform(0, 1). Can we apply Theorem 38.1.1 directly? Not without a little snag. Or
two, to be more precise.
Why? Because the distribution function

⎧0 if 𝑥 ≤ 0,
{
𝐹𝑋 (𝑥) =
⎨𝑥 if 0 < 𝑥 ≤ 1,
{1 if 1 < 𝑥
⎩
is not diﬀerentiable at 𝑥 = 0 and 𝑥 = 1. However, it is diﬀerentiable everywhere else, and its derivative

⎧0 if 𝑥 < 0,
′
{
𝐹𝑋 (𝑥) = ⎨1 if 0 < 𝑥 < 1,
{0 if 1 < 𝑥
⎩

is indeed a density function. (You can check this by hand.) This density is patched together from the derivative of 𝐹𝑋 (𝑥)
on the intervals (−∞, 0), (0, 1), and (1, ∞).
Example 2. In the case of the exponentially distributed random variable 𝑌 ∼ exp(𝜆), the function

0 if 𝑥 < 0,
𝑓𝑌 (𝑥) = {
𝜆𝑒−𝜆𝑥 if 0 ≤ 𝑥

is a proper density function, which we obtained by diﬀerentiating 𝐹𝑌 (𝑥) whenever possible. Again, the density 𝑓𝑋 (𝑥) is
patched together from the derivatives on the intervals (−∞, 0) and (0, ∞).
Example 3. Now, I am going to turn everything upside down. Let 𝑍 ∼ Bernoulli(1/2), which is a discrete random
variable with probability mass function
1
𝑝𝑍 (0) = 𝑝𝑍 (1) = ,
2

39.1. Density functions 419

Mathematics of Machine Learning

Fig. 39.1: Density function of the uniform distribution on [0, 1].

Fig. 39.2: Density function of the exp(1) distribution.

420 Chapter 39. Densities

Mathematics of Machine Learning

and cumulative distribution function

⎧0 if 𝑥 < 0,
{
𝐹𝑍 (𝑥) = 12 if 0 ≤ 𝑥 < 1,
⎨
{1 if 1 ≤ 𝑥.
⎩
Like the uniform and exponential distributions, this CDF is also differentiable except for a few points. (Which are 0 and
1.)
Thus, it is natural to guess that like before, we can patch its derivatives together to obtain a density function. However,
there is a bigger snag: the derivative of 𝐹𝑍 is zero, at least wherever it exists. It turns out that 𝑍 does not have a density
function at all!
What’s the issue? I’ll tell you: the jump discontinuities of 𝐹𝑍 (𝑥) at 𝑥 = 0 and 𝑥 = 1. Although the CDFs of the uniform
and exponential distributions were not differentiable at finitely many points, they did not include any jump discontinuities.
We are not going to dive deep into the details, but the gist is: if there is a jump discontinuity in the CDF, the density
function does not exist.

Remark 38.1.2 (The non-existence of density despite the lack of jump discontinuities.)
Unfortunately, the reverse direction of “jump discontinuity in the CDF ⟹ no PDF exists” is not true, I repeat, not true.
We can ﬁnd random variables whose cumulative distribution functions are continuous, but their density does not exist.
One famous example is the Cantor function, also known as the Devil’s staircase. (Only follow this link if you are brave
enough, or well-trained in real analysis. Which is the same.)

39.2 Classiﬁcation of real-valued random variables

So far, we have been focusing on two special kinds of real-valued random variables: discrete random variables and
continuous ones.
We’ve seen all kinds of objects describing them. Every real-valued random variable has a cumulative distribution function,
but while discrete ones are characterized by probability mass functions, the continuous ones are by density functions.
Are these two all that’s out there?
No. There are mixed cases. For instance, consider the following example. We are selecting a random number from [0, 1],
but we add a little twist to the picking process. First, we toss a fair coin, and if it comes up heads, we pick 0. Otherwise,
we pick uniformly between zero and one.
To describe this weird process, let’s introduce two random variables: let 𝑋 be the ﬁnal outcome, and 𝑌 be the outcome
of the coin toss. Then, using the conditional version of the law of total probability (see Theorem 35.2.1), we have
𝑃 (𝑋 ≤ 𝑥) = 𝑃 (𝑋 ≤ 𝑥|𝑌 = heads)𝑃 (𝑌 = heads)
+ 𝑃 (𝑋 ≤ 𝑥|𝑌 = tails)𝑃 (𝑌 = tails).
As
0 if 𝑥 < 0,
𝑃 (𝑋 ≤ 𝑥|𝑌 = heads) = {
1 if 0 ≤ 𝑥,

and 𝑃 (𝑋 ≤ 𝑥|𝑌 = tails) = 𝐹Uniform(0,1) (𝑥), we ultimately have

⎧0 if 𝑥 < 0,
{
𝐹𝑋 (𝑥) = ⎨ 𝑥+1
2 if 0 ≤ 𝑥 < 1,
{1 if 1 ≤ 𝑥.
⎩

39.2. Classiﬁcation of real-valued random variables 421

Mathematics of Machine Learning

Ultimately, 𝐹𝑋 is the convex combination of two cumulative distribution functions. (A convex combination is a linear
combination where the coeﬃcients are positive and their sum is 1.)

Fig. 39.3: CDF of the mixed distribution 𝑋.

Thus, the random variable 𝑋 is not discrete, nor continuous. So, what is it?
It’s time to put order to chaos! In this section, we are going to provide a complete classiﬁcation for our real-valued random
variables. This is a beautiful, albeit advanced topic, so feel free to skip it on a ﬁrst read.

39.2.1 Sets of zero measure

Let’s start at a seemingly distant topic: subsets of ℝ that are so small that they practically vanish.
Since ℝ is a one-dimensional object, we are usually talking about length here, but let’s forget that terminology and talk
about measure instead. We’ll denote the measure of a set 𝐴 ⊆ ℝ by 𝜆(𝐴), whatever that might be.
We are not going too deep into the details and will keep on using the notion of measure intuitively. For instance, the
measure of an interval [𝑎, 𝑏] is 𝜆([𝑎, 𝑏]) = 𝑏 − 𝑎.
Our measure 𝜆 has some fundamental properties, for instance,
∞
(a) 𝜆(∅) = 0,(b) 𝜆(𝐴) ≤ 𝜆(𝐵) if 𝐴 ⊆ 𝐵,(c) and 𝜆(∪𝑘=1 𝐴𝑘 ) = ∑𝑘=1 𝜆(𝐴𝑘 ) if 𝐴𝑖 ∩ 𝐴𝑗 = ∅.
This almost behaves like a probability measure, with one glaring exception: 𝜆(ℝ) = ∞. This is not an accident.
What is the measure of a ﬁnite set {𝑎1 , … , 𝑎𝑛 }? Intuitively, it is zero, and from this example, we’ll conjure up the concept
of sets of zero measure.

Theorem 38.2.1 (Sets of zero measure.)

Let 𝐴 ⊆ ℝ be an arbitrary set. Suppose that for any arbitrarily small 𝜀 > 0, there exists an union of intervals 𝐸 =
∪∞
𝑘=1 (𝑎𝑖 , 𝑏𝑖 ) such that

422 Chapter 39. Densities

Mathematics of Machine Learning

(a) 𝜆(𝐸) < 𝜀,(b) and 𝐴 ⊆ 𝐸,

then 𝜆(𝐴) = 0.

Proof. As 𝐴 ⊆ 𝐸, 𝜆(𝐴) ≤ 𝜆(𝐸) < 𝜀. This means that 𝜆(𝐴) is smaller than any positive real number, thus it must be
zero. □

Let’s see some examples.

Example 1. A set of a single element has zero measure. As any {𝑎} can be covered by the interval (𝑎 − 𝜀, 𝑎 + 𝜀). As
𝜆((𝑎 − 𝜀/2, 𝑎 + 𝜀/2)) = 𝜀, the conditions of Theorem 38.2.1 apply, thus 𝜆({𝑎}) = 0.
Example 2. A ﬁnite set has zero measure. To see this, let 𝐴 = {𝑎1 , … , 𝑎𝑛 } be our ﬁnite set. The system of intervals

𝜀 𝜀
𝐸 = ∪𝑛𝑘=1 (𝑎𝑘 − ,𝑎 + )
2𝑛 𝑘 2𝑛

will do the job, as

𝑛
𝜀 𝜀
𝜆(𝐸) = ∑ 𝜆((𝑎𝑘 − , 𝑎𝑘 + ))
𝑘=1
2𝑛 2𝑛
𝑛
𝜀
=∑
𝑘=1
𝑛
= 𝜀.
Example 3. A countable set has zero measure. For any 𝐴 = {𝑎1 , 𝑎2 , … }, the system of intervals

𝜀 𝜀
𝐸 = ∪∞
𝑘=1 (𝑎𝑘 − ,𝑎 + )
2𝑘+1 𝑘 2𝑘+1

work perfectly, as
∞
𝜀
𝜆(𝐸) ≤ ∑ = 𝜀.
𝑘=1
2𝑘+1

For instance, as the set of integers and rational numbers are both countable, 𝜆(ℤ) = 𝜆(ℚ) = 0.
Overall, sets of zero measure are true to their name: they are small. (They are not necessarily countable though.) Why
are these important? We’ll see this in the next section.

Remark 38.2.1 (Density functions are not unique, take two.)

Do you recall Remark 38.1.1, where we seen that changing the density function of 𝑋 at a single point is also a density for
𝑋?
Turns out that you can actually modify 𝑓𝑋 at an entire set of measure zero. Say,

∗ 𝑓𝑋 (𝑥) if 𝑥 ∉ ℚ,
𝑓𝑋 (𝑥) = {
0 if 𝑥 ∈ ℚ

is still a density function for 𝑋. Unfortunately, we don’t have the tools to show this, as it would require moving beyond
the good old Riemann integral, which is way beyond our scope.

39.2. Classiﬁcation of real-valued random variables 423

Mathematics of Machine Learning

39.2.2 The decomposition of real-valued random variables

The main diﬀerence between a discrete and continuous random variable is the set where they live. Fundamentally, they
are both real-valued random variables, but the range of a discrete variable is a set of measure zero.
Let’s introduce the concept of singular random variables to make this notion precise.

Deﬁnition 38.2.1 (Singular random variables.)

Let (Ω, Σ, 𝑃 ) be a probability space, and 𝑋 ∶ Ω → ℝ be a real-valued random variable. We say that 𝑋 is singular if its
range 𝑋(Ω) = {𝑋(𝜔) ∶ 𝜔 ∈ Ω} ⊆ ℝ is a set of zero measure, that is,

𝜆(𝑋(Ω)) = 0

holds.

All discrete random variables are singular, but not the other way around. For instance, Cantor function is a good example.
Why are singular random variables so special? Because every distribution can be written as the sum of a singular and a
continuous one! Here is the famous Lebesgue decomposition theorem.

Theorem 38.2.2 (Lebesgue’s decomposition theorem)

Let (Ω, Σ, 𝑃 ) be a probability space, and 𝑋 ∶ Ω → ℝ be a real-valued random variable. Then there exists a singular
random variable 𝑋𝑠 and a continuous random variable 𝑋𝑐 such that

𝐹𝑋 = 𝛼𝐹𝑋𝑠 + 𝛽𝐹𝑋𝑐 ,

where 𝛼 + 𝛽 = 1, and 𝐹𝑋 , 𝐹𝑋𝑠 , 𝐹𝑋𝑐 are the corresponding cumulative distribution functions.

We are not going to prove this here, but the gist is this: there are singular random variables, continuous ones, and their
sum.

424 Chapter 39. Densities

CHAPTER

FORTY

THE EXPECTED VALUE

40.1 Discrete random variables

Let’s play a simple game. I toss a coin, and if it comes up heads, you win $1. If it is tails, you lose $2.
Up until now, we were dealing with questions like the probability of winning. Say, if 𝑋 describes your winnings per
round, we have
1
𝑃 (heads) = 𝑃 (tails) = .
2
Despite the equal chances, should you play this game? Let’s ﬁnd out.
After 𝑛 rounds, your earnings can be calculated by the number of heads times $1 minus the number of tails times $2. If
we divide total earnings by 𝑛, we obtain your average winnings per round. That is,
total winnings
your average winnings =
𝑛
1 ⋅ #heads − 2 ⋅ #tails
=
𝑛
#heads #tails
=1⋅ −2⋅ ,
𝑛 𝑛
where #heads and #tails denote the number of heads and tails respectively.
Recall the frequentist interpretation of probability? According to our intuition, we should have
#heads 1
lim = 𝑃 (heads) = ,
𝑛→∞ 𝑛 2
#tails 1
lim = 𝑃 (tails) = .
𝑛→∞ 𝑛 2
This means that if you play long enough, your average winnings per round is
your average winnings = 1 ⋅ 𝑃 (heads) − 2 ⋅ 𝑃 (tails)
1
=− .
2
So, as you are losing half a dollar per round on average, you deﬁnitely shouldn’t play this game.
Let’s formalize this argument with a random variable. Say, if 𝑋 describes your winnings per round, we have
1
𝑃 (𝑋 = 1) = 𝑃 (𝑋 = −2) = ,
2
so the average winnings can be written as

average value of 𝑋 = 1 ⋅ 𝑃 (𝑋 = 1) − 2 ⋅ 𝑃 (𝑋 = −2).

425
Mathematics of Machine Learning

With a bit of a pattern matching, we ﬁnd that for a general discrete random variable 𝑋, the formula looks like

average value of 𝑋 = ∑(value) ⋅ 𝑃 (𝑋 = value).

value

And from this, the deﬁnition of expected value is born.

Deﬁnition 39.1.1 (The expected value of discrete random variables.)

Let (Ω, Σ, 𝑃 ) be a probability space, and 𝑋 ∶ Ω → {𝑥1 , 𝑥2 , … } be a discrete random variable. The expected value of
𝑋 is deﬁned by
∞
𝔼[𝑋] = ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 ).
𝑘=1

In English, the expected value describes the average value of a random variable in the long run. The expected value is
also called the mean and is often denoted by 𝜇.
It’s time for examples.
Example 1. Expected value of the Bernoulli distribution. Let 𝑋 ∼ Bernoulli(𝑝). Its expected value is quite simple to
compute, as

𝔼[𝑋] = 0 ⋅ 𝑃 (𝑋 = 0) + 1 ⋅ 𝑃 (𝑋 = 1) = 𝑝.

We’ve seen this before: the introductory example with the simple game is the transformed Bernoulli distribution 3 ⋅
Bernoulli(1/2) − 2.
Example 2. Expected value of the binomial distribution. Let 𝑋 ∼ Binomial(𝑛, 𝑝). Then
𝑛
𝔼[𝑋] = ∑ 𝑘𝑃 (𝑋 = 𝑘)
𝑘=0
𝑛
𝑛
= ∑ 𝑘( )𝑝𝑘 (1 − 𝑝)𝑛−𝑘
𝑘=0
𝑘
𝑛
𝑛!
= ∑𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 .
𝑘=0
𝑘!(𝑛 − 𝑘)!

𝑛!
The plan is the following: absorb that 𝑘 with the fraction 𝑘!(𝑛−𝑘)! , and adjust the sum such that its terms form the
probability mass function for Binomial(𝑛 − 1, 𝑝). As 𝑛 − 𝑘 = (𝑛 − 1) − (𝑘 − 1), we have
𝑛
𝑛!
𝔼[𝑋] = ∑ 𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘
𝑘=0
𝑘!(𝑛 − 𝑘)!
𝑛
(𝑛 − 1)!
= 𝑛𝑝 ∑ 𝑝𝑘−1 (1 − 𝑝)(𝑛−1)−(𝑘−1)
𝑘=1
(𝑘 − 1)!((𝑛 − 1) − (𝑘 − 1))!
𝑛−1
(𝑛 − 1)!
= 𝑛𝑝 ∑ 𝑝𝑘 (1 − 𝑝)(𝑛−1−𝑘)
𝑘=0
𝑘!(𝑛 − 1 − 𝑘)!
𝑛−1
= 𝑛𝑝 ∑ 𝑃 (Binomial(𝑛 − 1, 𝑝) = 𝑘)
𝑘=0
= 𝑛𝑝.

This computation might not look like the simplest, but once you get familiar with the trick, that’ll be like second nature
for you.

426 Chapter 40. The expected value

Mathematics of Machine Learning

Example 3. Expected value of the geometric distribution. Let 𝑋 ∼ Geo(𝑝). We need to calculate
∞
𝔼[𝑋] = ∑ 𝑘(1 − 𝑝)𝑘−1 𝑝.
𝑘=1

Do you remember the geometric series (38.1)? This is almost it, except for the 𝑘 term, which throws a monkey wrench
into our gears. To ﬁx that, we’ll use another magic trick. Recall that
∞
1
= ∑ 𝑥𝑘 .
1 − 𝑥 𝑘=0

Now, we are going to diﬀerentiate the geometric series, thus obtaining

𝑑 1 𝑑 ∞ 𝑘
= ∑𝑥
𝑑𝑥 1 − 𝑥 𝑑𝑥 𝑘=0
∞
𝑑 𝑘
=∑ 𝑥
𝑘=0
𝑑𝑥
∞
= ∑ 𝑘𝑥𝑘−1 ,
𝑘=1

where we used the linearity of the derivative and the pleasant analytic properties of the geometric series. Mathematicians
would scream upon the sight of switching the derivative and the sum, but don’t worry, everything here is correct as is.
(It’s just that mathematicians are really afraid of interchanging limits. For a good reason, if I may say so.)
On the other hand,
𝑑 1 1
= ,
𝑑𝑥 1 − 𝑥 (1 − 𝑥)2

thus
∞
1
∑ 𝑘𝑥𝑘−1 = .
𝑘=1
(1 − 𝑥)2

Combining all of these, we ﬁnally have

∞
𝔼[𝑋] = ∑ 𝑘(1 − 𝑝)𝑘−1 𝑝
𝑘=1
∞
= 𝑝 ∑ 𝑘(1 − 𝑝)𝑘−1
𝑘=1
1 1
=𝑝 = .
𝑝2 𝑝
Example 4. Expected value of the constant random variable. Let 𝑐 ∈ ℝ be an arbitrary constant. We can think about
𝑐 as the random variable 𝑐 ∶ Ω → ℝ that assumes the value 𝑐(𝜔) = 𝑐 everywhere. (The abuse comes from denoting a
constant and a function with the same symbol.)
As 𝑐 is a discrete random variable, its expected value is simply

𝔼[𝑐] = 𝑐𝑃 (𝑐 = 𝑐) = 𝑐.

I know, this example looks silly, but it is quite useful, as we shall see this soon. (And many times later.)

40.1. Discrete random variables 427

Mathematics of Machine Learning

40.1.1 The expected value in poker

One more example, before we move on. I was a mediocre no-limit Texas hold’em player a while ago, and the ﬁrst time I
heard about the expected value was years before I studied probability theory.
According to the rules of Texas hold-em, each player holds two cards on their own, while ﬁve more shared cards are dealt.
The shared cards are available for everyone, and the player with the strongest hand wins.
Fig. 40.1 shows how the table looks before the last card (the river) is revealed.

Fig. 40.1: The poker table before the river card

There is money in the pot to be won, but to see the river, you have to call the opponent’s bet.
The question is, should you? Expected value to the rescue.
Let’s build a probabilistic model. We would win the pot with certain river cards but lose with all the others. If 𝑋 represents
our winnings, then
#winning cards
𝑃 (𝑋 = pot) = ,
#remaining cards
#losing cards
𝑃 (𝑋 = −bet) = .
#remaining cards
Thus, the expected value is

𝔼[𝑋] = pot ⋅ 𝑃 (𝑋 = pot) − bet𝑃 (𝑋 = −bet)

#winning cards #losing cards
= pot ⋅ − bet .
#remaining cards #remaining cards

When is the expected value positive? With some algebra, we obtain that 𝔼[𝑋] > 0 if and only if

#winning cards bet

> ,
#losing cards pot

428 Chapter 40. The expected value

Mathematics of Machine Learning

which is called positive pot odds. If this is satisfied, making the bet is the right call. You might lose a hand with positive
pot odds, but in the long term, your winnings will be positive.
Of course, pot odds are extremely hard to determine in practice. For instance, you don’t know what others hold, and
counting the cards that would win the pot for you is not possible unless you have a good read on the opponents. Poker is
much more than just math. Good players choose their bet specifically to throw off their opponents’ pot odds.

40.2 Continuous random variables

So far, we have only defined the expected value for discrete random variables. As 𝔼[𝑋] describes the average value of 𝑋
in the long run, it should exist for continuous random variables as well.
The interpretation of the expected value was simple: outcome times probability, summed over all potential values.
However, there is a snag with continuous random variables: we don’t have such a mass distribution, as the probabilities
of individual outcomes are zero: 𝑃 (𝑋 = 𝑥) = 0. Moreover, we can’t sum uncountably many values.
What can we do?
Wishful thinking. This is one of the most powerful techniques in mathematics, and I am not joking.
Here’s the plan. We’ll pretend that the expected value of a continuous random variable is well-defined, and let our
imagination run free. Say goodbye to mathematical precision, and allow our intuition to unfold.
Instead of the probability of a given outcome, we can talk about 𝑋 landing in a small interval. First, we divide up the set
of real numbers into really small parts. To be more precise, let 𝑥0 < 𝑥1 < ⋯ < 𝑥𝑛 be a granular partition of the real
line. If the partition is refined enough, we should have
𝑛
𝔼[𝑋] ≈ ∑ 𝑥𝑘 𝑃 (𝑥𝑘−1 < 𝑋 ≤ 𝑋𝑘 ). (40.1)
𝑘=1

The probabilities in (40.1) can be expressed in terms of the CDF:

𝑛 𝑛
∑ 𝑥𝑘 𝑃 (𝑥𝑘−1 < 𝑋 ≤ 𝑋𝑘 ) = ∑ 𝑥𝑘 (𝐹𝑋 (𝑥𝑘 ) − 𝐹𝑋 (𝑥𝑘−1 )).
𝑘=1 𝑘=1

These increments remind us of the diﬀerence quotients. We don’t quite have these inside the sum, but by a “fancy
multiplication with one”, we can achieve this:
𝑛 𝑛
𝐹𝑋 (𝑥𝑘 ) − 𝐹𝑋 (𝑥𝑘−1 )
∑ 𝑥𝑘 (𝐹𝑋 (𝑥𝑘 ) − 𝐹𝑋 (𝑥𝑘−1 )) = ∑ 𝑥𝑘 (𝑥𝑘 − 𝑥𝑘−1 ) .
𝑘=1 𝑘=1
𝑥𝑘 − 𝑥𝑘−1

If the 𝑥𝑖 -s are close to each other (and we can select them to be arbitrarily close), the diﬀerence quotients are close to the
derivative of 𝐹𝑋 , which is the density function 𝑓𝑋 . Thus,
𝑛 𝑛
𝐹𝑋 (𝑥𝑘 ) − 𝐹𝑋 (𝑥𝑘−1 ) ′
∑ 𝑥𝑘 (𝑥𝑘 − 𝑥𝑘−1 ) ≈ ∑ 𝑥𝑘 (𝑥𝑘 − 𝑥𝑘−1 )𝑓𝑋 (𝑥𝑘 ).
𝑘=1
𝑥𝑘 − 𝑥𝑘−1 𝑘=1

This is a (26.7)! Hence, the last sum is close to a Riemann-integral:

𝑛 ∞
′
∑ 𝑥𝑘 (𝑥𝑘 − 𝑥𝑘−1 )𝑓𝑋 (𝑥𝑘 ) ≈ ∫ 𝑥𝑓𝑋 (𝑥)𝑑𝑥.
𝑘=1 −∞

Although we were not exactly precise in our argument, all of the above can be made mathematically correct. (But we are
not going to do it here, as it is not relevant to us.) Thus, we ﬁnally obtain the formula of the expected value for continuous
random variables.

40.2. Continuous random variables 429

Mathematics of Machine Learning

Deﬁnition 39.2.1 (The expected value of continuous random variables.)

Let (Ω, Σ, 𝑃 ) be a probability space, and 𝑋 ∶ Ω → ℝ be a continuous random variable. The expected value of 𝑋 is
deﬁned by
∞
𝔼[𝑋] = ∫ 𝑥𝑓𝑋 (𝑥)𝑑𝑥.
−∞

As usual, let’s see some examples ﬁrst.

Example 1. Expected value of the uniform distribution. Let 𝑋 ∼ Uniform(𝑎, 𝑏). Then
∞
1
𝔼[𝑋] = ∫ 𝑥 𝑓 (𝑥)𝑑𝑥
−∞ 𝑏−𝑎 𝑋
𝑏
= ∫ 𝑥𝑑𝑥
𝑎
𝑥=𝑏
1
=[ 𝑥2 ]
2(𝑏 − 𝑎) 𝑥=𝑎
𝑎+𝑏
= ,
2
which is the midpoint of the interval [𝑎, 𝑏], where the Uniform(𝑎, 𝑏) lives.
Example 2. Expected value of the exponential distribution. Let 𝑋 ∼ exp(𝜆). Then, we need to calculate the integral
∞
𝔼[𝑋] = ∫ 𝑥𝜆𝑒−𝜆𝑥 𝑑𝑥.
0

We can do this via partial integration: by letting 𝑓(𝑥) = 𝑥 and 𝑔′ (𝑥) = 𝑒−𝜆𝑥 , we have
∞
𝔼[𝑋] = ∫ 𝑥𝜆𝑒−𝜆𝑥 𝑑𝑥
0
∞
𝑥=∞
= [ − 𝑥𝑒−𝜆𝑥 ]𝑥=0 + ∫ 𝑒−𝜆𝑥 𝑑𝑥
⏟⏟⏟⏟⏟⏟⏟ 0
=0
𝑥=∞
1 −𝜆𝑥
=[− 𝑒 ]
𝜆 𝑥=0
1
= .
𝜆

40.3 Properties of the expected value

As usual, the expected value has several useful properties. Most importantly, the expected value is linear with respect to
the random variable.

Theorem 39.3.1 (Linearity of the expected value.)

Let (Ω, Σ, 𝑃 ) be a probability space, and let 𝑋, 𝑌 ∶ Ω → ℝ be two random variables. Moreover, let 𝑎, 𝑏 ∈ ℝ be two
scalars. Then

𝔼[𝑎𝑋 + 𝑏𝑌 ] = 𝑎𝔼[𝑋] + 𝑏𝔼[𝑌 ]

430 Chapter 40. The expected value

Mathematics of Machine Learning

holds.

We are not going to prove this theorem here, but know that linearity is an essential tool. Do you recall the game that we
used to introduce the expected value for discrete random variables? I toss a coin, and if it comes up heads, you win $1.
Tails, you lose $2. If you think about it for a minute, this is the

𝑋 = 3 ⋅ Bernoulli(1/2) − 2

distribution, and as such,

𝔼[𝑋] = 𝔼[3 ⋅ Bernoulli(1/2) − 2]
= 3𝔼[Bernoulli(1/2)] − 2
1
=3⋅ −2
2
1
=− .
2
Of course, linearity goes way beyond this simple example. As you’ve gotten used to this already, linearity is a crucial
property in mathematics. We love linearity.

Remark 39.3.1
Notice that Theorem 39.3.1 did not say that 𝑋 and 𝑌 have to be both discrete or both continuous. Even though we have
only defined the expected value in such cases, there is a general definition that works for all random variables.
The snag is, it requires a familiarity with measure theory, falling way outside of our scope. Suffice to say, the theorem
works as is.

If the expected value of a sum is the sum of the expected values, does the same apply to the product? Not in general, but
fortunately, this works for independent random variables.

Theorem 39.3.2 (Expected value of the product of independent random variables.)

Let (Ω, Σ, 𝑃 ) be a probability space, and let 𝑋, 𝑌 ∶ Ω → ℝ be two independent random variables.
Then

𝔼[𝑋𝑌 ] = 𝔼[𝑋]𝔼[𝑌 ]

holds.

This property is extremely useful, as we’ll see in the next section, where we’ll talk about the variance and the covariance.
One more property that’ll help us to calculate the expected value of functions of the random variable, such as 𝑋 2 or sin 𝑋.

Theorem 39.3.3 (Law of the unconscious statistician.)

Let (Ω, Σ, 𝑃 ) be a probability space, let 𝑋Ω → ℝ be a random variable, and let 𝑔 ∶ ℝ → ℝ be an arbitrary function.
(a) If 𝑋 is discrete with possible values 𝑥1 , 𝑥2 , …, then
∞
𝔼[𝑔(𝑋)] = ∑ 𝑔(𝑥𝑛 )𝑃 (𝑋 = 𝑥𝑛 ).
𝑛=1

40.3. Properties of the expected value 431

Mathematics of Machine Learning

(b) If 𝑋 is continuous with the probability density function 𝑓𝑋 (𝑥), then

∞
𝔼[𝑔(𝑋)] = ∫ 𝑔(𝑥)𝑓𝑋 (𝑥)𝑑𝑥.
−∞

Thus, calculating 𝔼[𝑋 2 ] for a continuous random variable can be done by simply taking
∞
𝔼[𝑋 2 ] = ∫ 𝑥2 𝑓𝑋 (𝑥)𝑑𝑥,
−∞

which will be used all the time.

40.4 Variance

Plainly speaking, the expected value measures the average value of the random variable. However, even though both
Uniform(−1, 1) and Uniform(−100, 100) have zero expected value, the latter is much more spread out than the former.
Thus, 𝔼[𝑋] is not a good descriptor of the random variable 𝑋.
To add one more layer, we measure the average deviation from the expected value. These are done via the variance and
the standard deviation.

Deﬁnition 39.4.1 (Variance and standard deviation.)

Let (Ω, Σ, 𝑃 ) be a probability space, let 𝑋 ∶ Ω → ℝ be a random variable, and let 𝜇 = 𝔼[𝑋] be is its expected value.
The variance of 𝑋 is deﬁned by

Var[𝑋] = 𝔼[(𝑋 − 𝜇)2 ],

while its standard deviation is deﬁned by

Std[𝑋] = √Var[𝑋].

Take note that in the literature, the expected value is often denoted by 𝜇, while the standard deviation is by 𝜎. Together,
they form two of the most important descriptors of a random variable.
Fig. 40.2 shows a visual interpretation of the mean and standard deviation in the case of a normal distribution. The mean
shows the average value, while the standard deviation can be interpreted as the average deviation from the mean. (We’ll
talk about the normal distribution in great detail later, so don’t worry if it is not yet familiar to you.)
The usual method of calculating variance is not taking the expected value of (𝑋 − 𝜇)2 , but taking the expected value of
𝑋 2 and subtracting 𝜇2 from it. This is shown by the following proposition.

Proposition 39.4.1
Let (Ω, Σ, 𝑃 ) be a probability space, and let 𝑋 ∶ Ω → ℝ be a random variable.
Then

Var[𝑋] = 𝔼[𝑋 2 ] − 𝔼[𝑋]2 .

432 Chapter 40. The expected value

Mathematics of Machine Learning

Fig. 40.2: Mean (𝜇) and standard deviation (𝜎) of the standard normal distribution.

Proof. Let 𝜇 = 𝔼[𝑋]. Because of the linearity of the expected value, we have

Var[𝑋] = 𝔼[(𝑋 − 𝜇)2 ]

= 𝔼[𝑋 2 − 2𝜇𝑋 + 𝜇2 ]
= 𝔼[𝑋]2 − 2𝜇𝔼[𝑋] + 𝜇2
= 𝔼[𝑋]2 − 2𝜇+ 𝜇2
= 𝔼[𝑋 2 ] − 𝜇2 ,

which is what we had to show. □.

Is the variance linear as well? No, but there are some important identities regarding scalar multiplication and addition.

Theorem 39.4.1 (Variance and the linear operations.)

Let (Ω, Σ, 𝑃 ) be a probability space, and let 𝑋 ∶ Ω → ℝ be a random variable.
(a) Let 𝑎 ∈ ℝ be an arbitrary constant. Then

Var[𝑎𝑋] = 𝑎2 Var[𝑋].

(b) Let 𝑌 ∶ Ω → ℝ be a random variable that is independent from 𝑋. Then

Var[𝑋 + 𝑌 ] = Var[𝑋] + Var[𝑌 ].

40.4. Variance 433

Mathematics of Machine Learning

Proof. (a) Let 𝜇𝑋 = 𝔼[𝑋]. Then we have

Var[𝑎𝑋] = 𝔼[(𝑎𝑋 − 𝑎𝜇𝑋 )2 ]

= 𝔼[𝑎2 (𝑋 − 𝜇𝑋 )2 ]
= 𝑎2 𝔼[(𝑋 − 𝜇𝑋 )2 ]
= 𝑎2 Var[𝑋],

which is what we had to show.

(b) Let 𝜇𝑌 = 𝔼[𝑌 ]. Then, due to the linearity of the expected value, we have

Var[𝑋 + 𝑌 ] = 𝔼[(𝑋 + 𝑌 − (𝜇𝑋 + 𝜇𝑌 ))2 ]

= 𝔼[((𝑋 − 𝜇𝑋 ) + (𝑌 − 𝜇𝑌 ))2 ]
= 𝔼[(𝑋 − 𝜇𝑋 )2 ] + 2𝔼[(𝑋 − 𝜇𝑋 )(𝑌 − 𝜇𝑌 )] + 𝔼[(𝑌 − 𝜇𝑌 )2 ].

Now, as 𝑋 and 𝑌 are independent, 𝔼[𝑋𝑌 ] = 𝔼[𝑋]𝔼[𝑌 ]. Thus, due to the linearity of the expected value,

𝔼[(𝑋 − 𝜇𝑋 )(𝑌 − 𝜇𝑌 )] = 𝔼[𝑋𝑌 − 𝑋𝜇𝑌 − 𝜇𝑋 𝑌 + 𝜇𝑋 𝜇𝑌 ]

= 𝔼[𝑋𝑌 ] − 𝔼[𝑋𝜇𝑌 ] − 𝔼[𝜇𝑋 𝑌 ] + 𝜇𝑋 𝜇𝑌
= 𝔼[𝑋]𝔼[𝑌 ] − 𝔼[𝑋]𝜇𝑌 − 𝜇𝑋 𝔼[𝑌 ] + 𝜇𝑋 𝜇𝑌
= 𝜇𝑋 𝜇𝑌 − 𝜇𝑋 𝜇𝑌 − 𝜇𝑋 𝜇𝑌 + 𝜇𝑋 𝜇𝑌
= 0.

Thus, continuing the ﬁrst calculation,

Var[𝑋 + 𝑌 ] = 𝔼[(𝑋 − 𝜇𝑋 )2 ] + 2𝔼[(𝑋 − 𝜇𝑋 )(𝑌 − 𝜇𝑌 )] + 𝔼[(𝑌 − 𝜇𝑌 )2 ]

= 𝔼[(𝑋 − 𝜇𝑋 )2 ] + 𝔼[(𝑌 − 𝜇𝑌 )2 ],

which is what we had to show. □

40.5 Covariance and correlation

Expected value and variance measure a random variable in isolation. However, in real problems, we need to discover
relations between separate measurements. Say, 𝑋 describes the price of a given real estate, while 𝑌 measures its size.
These are certainly related, but one does not determine the other. For instance, the location might be a diﬀerentiator
between the prices.
The simplest statistical way of measuring similarity is the covariance and correlation.

Deﬁnition 39.5.1 (Covariance and correlation.)

Let (Ω, Σ, 𝑃 ) be a probability space, let 𝑋, 𝑌 ∶ Ω → ℝ be two random variables, and let 𝜇𝑋 = 𝔼[𝑋], 𝜇𝑌 = 𝔼[𝑌 ] be
their expected values and 𝜎𝑋 = Std[𝑋], 𝜎𝑌 = Std[𝑌 ] their standard deviations.
(a) The covariance of 𝑋 and 𝑌 is deﬁned by

Cov[𝑋, 𝑌 ] = 𝔼[(𝑋 − 𝜇𝑋 )(𝑌 − 𝜇𝑌 )].

(b) The correlation of 𝑋 and 𝑌 is deﬁned by

Cov[𝑋, 𝑌 ]
Corr[𝑋, 𝑌 ] = .
𝜎𝑋 𝜎𝑌

434 Chapter 40. The expected value

Mathematics of Machine Learning

Similarly to variance, the deﬁnition of covariance can be simpliﬁed to provide an easier way of calculating its exact value.

Proposition 39.5.1
Let (Ω, Σ, 𝑃 ) be a probability space, let 𝑋, 𝑌 ∶ Ω → ℝ be two random variables, , and let 𝜇𝑋 = 𝔼[𝑋], 𝜇𝑌 = 𝔼[𝑌 ] be
their expected values.
Then

Cov[𝑋, 𝑌 ] = 𝔼[𝑋𝑌 ] − 𝜇𝑋 𝜇𝑌 .

Proof. This is just a simple calculation. According to the deﬁnition, we have

Cov[𝑋, 𝑌 ] = 𝔼[(𝑋 − 𝜇𝑋 )(𝑌 − 𝜇𝑌 )]

= 𝔼[𝑋𝑌 − 𝑋𝜇𝑌 − 𝜇𝑋 𝑌 + 𝜇𝑋 𝜇𝑌 ]
= 𝔼[𝑋𝑌 ] − 𝔼[𝑋𝜇𝑌 ] − 𝔼[𝜇𝑋 𝑌 ] + 𝜇𝑋 𝜇𝑌 ]
= 𝔼[𝑋𝑌 ] − 𝔼[𝑋]𝜇𝑌 − 𝜇𝑋 𝔼[𝑌 ] + 𝜇𝑋 𝜇𝑌
= 𝔼[𝑋𝑌 ] − 𝜇𝑋 𝜇𝑌 − 𝜇𝑋 𝜇𝑌 + 𝜇𝑋 𝜇𝑌
= 𝔼[𝑋𝑌 ] − 𝜇𝑋 𝜇𝑌 ,

which is what we had to show. □

One of the most important properties of covariance and correlation is that they are zero for independent random variables.

Theorem 39.5.1
Let (Ω, Σ, 𝑃 ) be a probability space, and let 𝑋, 𝑌 ∶ Ω → ℝ be two independent random variables.
Then Cov[𝑋, 𝑌 ] = 0. (And consequently, Corr[𝑋, 𝑌 ] = 0 as well.)

The proof follows straight from the deﬁnition and Theorem 39.3.2, so this is left as an exercise for you.
Take note, as this is extra important: independence implies zero covariance, but zero covariance does not imply indepen-
dence. Here is an example.
Let 𝑋 be a discrete random variable with the probability mass function
1
𝑃 (𝑋 = −1) = 𝑃 (𝑋 = 0) = 𝑃 (𝑋 = 1) = ,
3
and let 𝑌 = 𝑋 2 . The expected value of 𝑋 is

𝔼[𝑋] = (−1) ⋅ 𝑃 (𝑋 = −1) + 0 ⋅ 𝑃 (𝑋 = 0) + 1 ⋅ 𝑃 (𝑋 = 1)

1 1
=− +0+
3 3
= 0,

40.5. Covariance and correlation 435

Mathematics of Machine Learning

while the law of the unconscious statistician gives that

𝔼[𝑌 ] = 𝔼[𝑋 2 ] = 1 ⋅ 𝑃 (𝑋 = −1) + 0 ⋅ 𝑃 (𝑋 = 0) + 1 ⋅ 𝑃 (𝑋 = 1)

1 1
= +0+
3 3
2
= ,
3
and

𝔼[𝑋𝑌 ] = 𝔼[𝑋 3 ] = 0.

Thus,
Cov[𝑋, 𝑌 ] = 𝔼[𝑋𝑌 ] − 𝔼[𝑋]𝔼[𝑌 ]
= 𝔼[𝑋 3 ] − 𝔼[𝑋]𝔼[𝑋 2 ]
2
=0−0⋅
3
= 0.
However, 𝑋 and 𝑌 are not independent, as 𝑌 = 𝑋 2 is a function of 𝑋.
(I shamelessly stole this example from a brilliant Stack Overﬂow thread, which you should read for more on this question.)

40.6 Problems

Problem 1. Let 𝑋, 𝑌 ∶ Ω → ℝ be two random variables.

(a) Show that if 𝑋 ≥ 0, then 𝔼[𝑋] ≥ 0.
(b) Show that if 𝑋 ≥ 𝑌 , then 𝔼[𝑋] ≥ 𝔼[𝑌 ].
Problem 2. Let 𝑋 ∶ Ω → ℝ be a random variable. Show that if Var[𝑋] = 0, then 𝑋 assumes only a single value. (That
is, the set 𝑋(Ω) = {𝑋(𝜔) ∶ 𝜔 ∈ Ω} has only a single element.)

436 Chapter 40. The expected value

CHAPTER

FORTYONE

THE LAW OF LARGE NUMBERS

We’ll continue our journey with a quite remarkable and famous result: the law of large numbers. You have probably
already heard several faulty arguments invoking the law of large numbers. For instance, gamblers are often convinced
that their bad luck will end soon because of the said law. This is one of the most frequently misused mathematical terms,
and we are here to clean that up.
We’ll do this in two passes. First, we are going to see an intuitive interpretation, then add the technical but important
mathematical details. I’ll try to be gentle.

41.1 Tossing coins…

First, let’s toss some coins again. If we toss coins repeatedly, what is the relative frequency of heads on the long run?
We should have a pretty good guess already: the average number of heads should converge to 𝑃 (heads) = 𝑝 as well.
Why? Because we have seen this when studying the frequentist interpretation of probability.
Our simulation showed that the relative frequency of heads does indeed converge to the true probability. This time, we’ll
carry the simulation a bit further.
First, to formulate the problem, let’s introduce the independent random variables 𝑋1 , 𝑋2 , … that are distributed along
Bernoulli(𝑝), where 𝑋𝑖 = 0 if the toss results in tails, while 𝑋𝑖 = 1 if it is heads. We are interested in the long-term
behavior of
𝑋 + ⋯ + 𝑋𝑛
𝑋̄ 𝑛 = 1 .
𝑛
𝑋̄ 𝑛 is called the sample average. We have already seen that the sample average gets closer and closer to 𝑝 as 𝑛 grows.
Let’s see the simulation one more time, before we go any further. (The parameter 𝑝 is selected to be 1/2 for the sake of
the example.)

import numpy as np
from scipy.stats import bernoulli

n_tosses = 1000
idx = range(n_tosses)

coin_tosses = [bernoulli.rvs(p=0.5) for _ in idx]

coin_toss_averages = [np.mean(coin_tosses[:k+1]) for k in idx]

And here is the plot.

437
Mathematics of Machine Learning

import matplotlib.pyplot as plt

with plt.style.context("seaborn-white"):
plt.figure(figsize=(16, 8))
plt.title("Relative frequency of the coin tosses")
plt.xlabel("Relative frequency")
plt.ylabel("Number of tosses")

# plotting the averages

plt.plot(range(n_tosses), coin_toss_averages, linewidth=3) # the averages

# plotting the true expected value

plt.plot([-100, n_tosses+100], [0.5, 0.5], c="k")
plt.xlim(-10, n_tosses+10)
plt.ylim(0, 1)

Nothing new so far. However, if you have a sharp eye, you might ask the question: is this just an accident? After all, we
are studying the average
𝑋 + ⋯ + 𝑋𝑛
𝑋̄ 𝑛 = 1 ,
𝑛
which is (almost) a binomially distributed random variable! To be more precise, if 𝑋𝑖 ∼ Bernoulli(𝑝), then
1
𝑋̄ 𝑛 ∼ Binomial(𝑛, 𝑝).
𝑛
(We have seen this earlier when discussing the sums of discrete random variables.)
At this point, it is far from guaranteed that this distribution will be concentrated around a single value. So, let’s do some
more simulations. This time, we’ll toss a coin a thousand times thousand times to see the distribution of the averages.
Quite meta, I know.

more_coin_tosses = bernoulli.rvs(p=0.5, size=(n_tosses, n_tosses))

more_coin_toss_averages = np.array([[np.mean(more_coin_tosses[i][:j+1]) for j in idx]␣
↪for i in idx])

438 Chapter 41. The Law of Large Numbers

Mathematics of Machine Learning

We can visualize the distributions on histograms.

with plt.style.context("seaborn-white"):
fig, axs = plt.subplots(1, 3, figsize=(12, 4), sharey=True)
fig.suptitle("The distribution of sample averages")
for ax, i in zip(axs, [5, 100, 999]):
x = [k/i for k in range(i+1)]
y = more_coin_toss_averages[:, i]
ax.hist(y, bins=x)
ax.set_title(f"n = {i}")

In other words, the probability of 𝑋𝑛 falling far from 𝑝 becomes smaller and smaller. For any small 𝜀, we can formulate
the probability of “𝑋̄ 𝑛 falls farther from 𝑝 than 𝜀” as 𝑃 (|𝑋̄ 𝑛 − 𝑝| > 𝜀).
Thus, mathematically speaking, our guess is that the limit

lim 𝑃 (|𝑋̄ 𝑛 − 𝑝| > 𝜀) = 0

𝑛→∞

holds.
Again, is this just an accident, and were we just lucky to study an experiment where this is true? Would the same work
for random variables other than Bernoulli ones? What will the sample averages converge to? (If they converge at all.)
We’ll ﬁnd out.

41.2 …rolling dices…

Let’s play dice. To keep things simple, we are interested in the average value of a roll on the long run. To build a proper
probabilistic model, let’s introduce random variables!
A single roll is uniformly distributed on {1, 2, … , 6}, and each roll is independent from the others. So, let 𝑋1 , 𝑋2 , … be
independent random variables, each distributed according to Uniform({1, 2, … , 6}).
How does the sample average 𝑋̄ 𝑛 behave? Simulation time. We’ll randomly generate 1000 rolls, then explore how 𝑋̄ 𝑛
behaves.

from scipy.stats import randint

n_rolls = 1000
(continues on next page)

41.2. …rolling dices… 439

Mathematics of Machine Learning

(continued from previous page)

idx = range(n_rolls)

dice_rolls = [randint.rvs(low=1, high=7) for _ in idx]

dice_roll_averages = [np.mean(dice_rolls[:k+1]) for k in idx]

Again, to obtain a bit of an insight, we’ll visualize the averages on a plot.

with plt.style.context("seaborn-white"):
plt.figure(figsize=(16, 8))
plt.title("Sample averages of rolling a six-sided dice")

# plotting the averages

plt.plot(idx, dice_roll_averages, linewidth=3) # the averages

# plotting the true expected value

plt.plot([-100, n_rolls+100], [3.5, 3.5], c="k")

plt.xlim(-10, n_rolls+10)
plt.ylim(0, 6)

The ﬁrst thing to note is that these are suspiciously close to 3.5. This is not a probability, but the expected value:

𝔼[𝑋1 ] = 𝔼[𝑋2 ] = ⋯ = 3.5.

For Bernoulli(𝑝) distributed random variables, the expected value coincides with the probability 𝑝. However, this time,
𝑋̄ 𝑛 does not have a nice and explicit distribution like in the case of coin tosses, where the sample averages were binomially
distributed. So, let’s roll some more dices to estimate how 𝑋̄ 𝑛 is distributed.

more_dice_rolls = randint.rvs(low=1, high=7, size=(n_rolls, n_rolls))

more_dice_roll_averages = np.array([[np.mean(more_dice_rolls[i][:j+1]) for j in idx]␣
↪for i in idx])

440 Chapter 41. The Law of Large Numbers

Mathematics of Machine Learning

with plt.style.context("seaborn-white"):
fig, axs = plt.subplots(1, 3, figsize=(12, 4), sharey=True)
fig.suptitle("The distribution of sample averages")
for ax, i in zip(axs, [5, 100, 999]):
x = [6*k/i for k in range(i+1)]
y = more_dice_roll_averages[:, i]
ax.hist(y, bins=x)
ax.set_title(f"n = {i}")

It seems like that once more, the distribution of 𝑋̄ 𝑛 is concentrated around 𝔼[𝑋1 ]. Our intuition tells us that this is not
an accident; that this phenomenon is true for a wide range of random variables.
Let me spoil the surprise: this is indeed the case, and we’ll see this now.

41.3 …and all the rest

This time, let 𝑋1 , 𝑋2 , … be a sequence of independent and identically distributed random variables. Not coin tosses, not
dice rolls, but any distribution. We saw that the sample average 𝑋̄ 𝑛 seem to converge to the joint expected value of the
𝑋𝑖 -s:

"𝑋̄ 𝑛 → 𝔼[𝑋1 ]".

Note the quotation marks: 𝑋̄ 𝑛 is not a number, but a random variable. Thus, we can’t (yet) speak about their convergence.
In mathematically precise terms, what we saw previously is that for large enough 𝑛-s, the sample average 𝑋̄ 𝑛 is highly
unlikely to fall far from the joint expected value 𝜇 = 𝔼[𝑋1 ]; that is,
lim 𝑃 (|𝑋̄ 𝑛 − 𝜇| > 𝜀) = 0 (41.1)
𝑛→∞

holds for all 𝜀 > 0.

The limit (41.1) seems hard to prove right now even in the simple case of coin tossing. There, 𝑋̄ 𝑛 ∼ 1
𝑛 Binomial(𝑛, 𝑝),
thus
⌊𝑛(𝑝+𝜀)⌋
𝑛
𝑃 (|𝑋̄ 𝑛 − 𝜇| > 𝜀) = 1 − ∑ ( )𝑝𝑘 (1 − 𝑝)𝑛−𝑘 ,
𝑘=⌊𝑛(𝑝−𝜀)⌋
𝑘

which does not look friendly at all. (The symbol ⌊𝑥⌋ denotes the largest integer that is smaller than 𝑥.)
Thus, our plan is the following.

41.3. …and all the rest 441

Mathematics of Machine Learning

1. Find a way to estimate 𝑃 (|𝑋̄ 𝑛 − 𝜇| > 𝜀) in a way that is independent from the distribution of the 𝑋𝑖 -s.
2. Use the upper estimate to show lim𝑛→∞ 𝑃 (|𝑋̄ 𝑛 − 𝜇| > 𝜀) = 0.
Let’s go.

41.3.1 The Markov and Chebyshev inequalities

First, the upper estimates. There are two general inequalities that’ll help us to deal with 𝑃 (|𝑋̄ 𝑛 − 𝜇| ≥ 𝜀)

Theorem 40.3.1 (Markov’s inequality.)

Let (Ω, Σ, 𝑃 ) be a probability space and let 𝑋 → [0, ∞) be a nonnegative random variable. Then

𝔼[𝑋]
𝑃 (𝑋 ≥ 𝑡) ≤
𝑡
holds for any 𝑡 ∈ (0, ∞).

Proof. We have to separate the discrete and the continuous cases. The proofs are almost identical, so I’ll only do the
discrete case here, while the continuous is left for you as an exercise to test your understanding.
So, let 𝑋 ∶ Ω → {𝑥1 , 𝑥2 , … } be a discrete random variable (where 𝑥𝑘 ≥ 0 for all 𝑘), and 𝑡 ∈ (0, ∞) be an arbitrary
positive real number. Then
∞
𝔼[𝑋] = ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 )
𝑘=1

= ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 ) + ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 ),
𝑘∶𝑥𝑘 <𝑡 𝑘∶𝑥𝑘 ≥𝑡

where the sum ∑𝑘∶𝑥 only accounts for 𝑘-s with 𝑥𝑘 < 𝑡, and similarly, ∑𝑘∶𝑥 only accounts for 𝑘-s with 𝑥𝑘 ≥ 𝑡.
𝑘 <𝑡 𝑘 ≥𝑡

As the 𝑥𝑘 -s are nonnegative by assumption, we can estimate 𝔼[𝑋] from below by omitting one of them. Thus,

𝔼[𝑋] = ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 ) + ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 )
𝑘∶𝑥𝑘 <𝑡 𝑘∶𝑥𝑘 ≥𝑡

≥ ∑ 𝑥𝑘 𝑃 (𝑋 = 𝑥𝑘 )
𝑘∶𝑥𝑘 ≥𝑡

≥ 𝑡 ∑ 𝑃 (𝑋 = 𝑥𝑘 )
𝑘∶𝑥𝑘 ≥𝑡

= 𝑡𝑃 (𝑋 ≥ 𝑡),

from which Markov’s inequality

𝔼[𝑋]
𝑃 (𝑋 ≥ 𝑡) ≤
𝑡
follows. □

The law of large numbers is only one step away from Markov’s inequality. This last step is so useful that it deserves to be
its own theorem. Meet the famous inequality of Chebyshev.

Theorem 40.3.2 (Chebyshev’s inequality.)

442 Chapter 41. The Law of Large Numbers

Mathematics of Machine Learning

Let (Ω, Σ, 𝑃 ) be a probability space and let 𝑋 → ℝ be a nonnegative random variable with ﬁnite variance 𝜎2 = Var[𝑋] <
∞ and expected value 𝔼[𝑋] = 𝜇.
Then
𝜎2
𝑃 (|𝑋 − 𝜇| ≥ 𝑡) ≤
𝑡2
holds for all 𝑡 ∈ (0, ∞).

Proof. As |𝑋 − 𝜇| is a nonnegative random variable, we can apply Markov's inequality to obtain

𝑃 (|𝑋 − 𝜇| ≥ 𝑡) = 𝑃 (|𝑋 − 𝜇|2 ≥ 𝑡2 )

𝔼[|𝑋 − 𝜇|2 ]
≤ .
𝑡2
However, as 𝔼[|𝑋 − 𝜇|2 ] = Var[𝑋] = 𝜎2 , we have

𝔼[|𝑋 − 𝜇|2 ]
𝑃 (|𝑋 − 𝜇| ≥ 𝑡) ≤
𝑡2
𝜎2
= 2
𝑡
which is what we had to show. □

And with that, we are ready to precisely formulate and prove the law of large numbers.

41.3.2 The weak law of large numbers

After all this setup, the (weak) law of large numbers is just a small step away. Here it is in its full glory.

Theorem 40.3.3 (The weak law of large numbers)

Let 𝑋1 , 𝑋2 , … be a sequence of independent and identically distributed random variables with ﬁnite expected value
𝜇 = 𝔼[𝑋1 ] and variance 𝜎2 = Var[𝑋1 ], and let

𝑋 + ⋯ + 𝑋𝑛
𝑋̄ 𝑛 = 1
𝑛
be their sample average. Then

lim 𝑃 (|𝑋̄ 𝑛 − 𝜇| ≥ 𝜀) = 0
𝑛→∞

holds for any 𝜀 > 0.

41.3. …and all the rest 443

Mathematics of Machine Learning

Proof. As the 𝑋𝑖 -s are independent, the variance of the sample average is

𝑋 + ⋯ + 𝑋𝑛
Var[𝑋̄ 𝑛 ] = Var[ 1 ]
𝑛
1
= 2 Var[𝑋1 + ⋯ + 𝑋𝑛 ]
𝑛
1
= 2 (Var[𝑋1 ] + ⋯ + Var[𝑋𝑛 ])
𝑛
𝑛𝜎2
= 2
𝑛
𝜎2
= .
𝑛
Now, by using Chebyshev's inequality, we obtain

Var[𝑋̄ 𝑛 ]
𝑃 (|𝑋𝑛 − 𝜇| ≥ 𝜀) ≤
𝜀2
2
𝜎
= 2.
𝑛𝜀
Thus,
0 ≤ lim 𝑃 (|𝑋̄ 𝑛 − 𝜇| ≥ 𝜀)
𝑛→∞
𝜎2
≤ lim = 0,
𝑛→∞ 𝑛𝜀2

hence

lim 𝑃 (|𝑋̄ 𝑛 − 𝜇| ≥ 𝜀) = 0,
𝑛→∞

which is what we needed to show. □

Theorem 40.3.3 is not all that can be said about the sample averages. There is a stronger result, showing that the sample
averages are in fact converge to the mean with probability 1.

41.4 The strong law of large numbers

Why is Theorem 40.3.3 called the “weak” law? Think about the statement

lim 𝑃 (|𝑋̄ 𝑛 − 𝜇| ≥ 𝜀) = 0 (41.2)

𝑛→∞

for a moment. For a given 𝜔 ∈ Ω, this doesn’t tell us anything about the convergence of a concrete sample average
𝑋 (𝜔) + ⋯ + 𝑋𝑛 (𝜔)
𝑋̄ 𝑛 (𝜔) = 1 ,
𝑛
it just tells us that in a probabilistic sense, 𝑋̄ 𝑛 is concentrated around the joint expected value 𝜇. In a sense, (41.2) is a
weaker version of

𝑃 ( lim 𝑋̄ 𝑛 = 𝜇) = 1,
𝑛→∞

hence the terminology weak law of large numbers. Do we have a stronger result than Theorem 40.3.3? Yes, we do.

Theorem 40.4.1 (The strong law of large numbers.)

444 Chapter 41. The Law of Large Numbers

Mathematics of Machine Learning

Let 𝑋1 , 𝑋2 , … be a sequence of independent and identically distributed random variables with ﬁnite expected value
𝜇 = 𝔼[𝑋1 ] and variance 𝜎2 = Var[𝑋1 ], and let

𝑋 + ⋯ + 𝑋𝑛
𝑋̄ 𝑛 = 1
𝑛
be their sample average. Then

𝑃 ( lim 𝑋̄ 𝑛 = 𝜇) = 1.
𝑛→∞

We are not going to prove this, just know that the sample average will converge to the mean with probability one.

Remark 40.4.1 (Convergence of random variables.)

What we have seen in the weak and strong laws of large numbers are not unique to sample averages. Similar phenomena
can be observed in other cases, thus, these types of convergences have their own exact deﬁnitions.
If 𝑋1 , 𝑋2 , … is a sequence of random variables, we say that
(a) 𝑋𝑛 converges in probability towards 𝑋 if

lim 𝑃 (|𝑋𝑛 − 𝑋| ≥ 𝜀) = 0
𝑛→∞

𝑃
for all 𝜀 > 0. Convergence in probability is denoted by 𝑋𝑛 −
→ 𝑋.
(b) 𝑋𝑛 converges almost surely towards 𝑋 if

𝑃 ( lim 𝑋𝑛 = 𝑋) = 1
𝑛→∞

a. s.
holds. Almost sure convergence is denoted by 𝑋𝑛 −−→ 𝑋.
Thus, the weak and strong laws of large numbers state that in certain cases, the sample averages converge to the expected
value both in probability and almost surely.

41.4. The strong law of large numbers 445

Mathematics of Machine Learning

446 Chapter 41. The Law of Large Numbers

Part VI

Statistics

447
Part VII

Classical machine learning

449
Part VIII

Neural networks

451
Part IX

Advanced optimization

453
Part X

Convolutional networks

455
Part XI

Appendix

457
CHAPTER

FORTYTWO

IT’S JUST LOGIC

The rules of logic are to mathematics what those of structure are to architecture. — Bertrand Russell
“Mathematics is a language”, one of my professors used to say all the time. “Learning mathematics starts with building up
a basic vocabulary.”
What he forgot to add is that mathematics is the language of thinking. I often get the question: do you need to know
mathematics to be a software engineer/data scientist/random technical professional? My answer is simple. If you regularly
have to solve problems in your profession, then mathematics is extremely beneficial for you. You don’t have to think
effectively, but you are better off.
The learning curve of mathematics is steep. You have experienced it yourself, and the difficulty may have deterred you
from reaching a familiarity with its fundamentals. I have good news for you: if we treat learning mathematics as learning
a foreign language, we can start by building up a basic vocabulary instead of diving into poems and novels. Like as my
professor suggested.

42.1 Mathematical logic 101

Logic and clear thinking lie at the very foundations of mathematics. But what are those? How would you explain what
“logic” is?
Our thinking processes are formalized by the ﬁeld of mathematical logic. In logic, we work with propositions; that is,
statements that are either true or false. “It is raining outside.” “The sidewalk is wet.” These are both valid propositions.
To be able to reason about propositions eﬀectively, we often denote them with roman capital letters, such as

𝐴 = it is raining outside,
𝐵 = the sidewalk is wet.

Each proposition has a corresponding truth value, which is either true or false. These are often abbreviated as 1 and 0.
Although this seems like no big deal, ﬁnding the truth value can be extremely hard. Think about the proposition

𝐴 = if the solutions of an algorithm can be

veriﬁed in polynomial time, it can also be
solved in polynomial time.

This is the famous P = NP conjecture, one of the longest-standing unsolved problems in mathematics. The statement is
easy to understand, but solving the problem (that is, finding the truth value of the corresponding proposition) has eluded
even the smartest minds.
In essence, the entire body of our scientific knowledge lies in propositions whose truth values we have identified. So, how
do we do that in practice?

459
Mathematics of Machine Learning

42.2 Logical connectives

In themselves, propositions are not enough to provide an effective framework for reasoning. Mathematics (and the entire
science) is the collection of complex propositions formulated from smaller building blocks with logical connectives. Each
connective takes one or more propositions and transforms their truth value.
“If it is raining outside, then the sidewalk is wet.” This is the combination of two propositions, strung together by the
implication connective. There are four essential connectives: negation, disjunction, conjunction, and implication. We will
take a close look at each one.
Negation flips the truth value of a proposition to its opposite. It is denoted by the mathematical symbol ¬: if 𝐴 is a
proposition, then ¬𝐴 is its negation. In general, connectives are defined by truth tables that enumerate all possible truth
values of the resulting expression, given its inputs. In writing, this looks complicated, so here is the truth table of ¬ to
illustrate the concept.

𝐴 ¬𝐴
0 1
1 0

When expressing propositions in a natural language, negation translates to the word “not”. For instance, the negation of
the proposition “the screen is black” is “the screen is not black”. (Not “the screen is white”.)
Logical conjunction is equivalent to grammatical conjunction “and”, denoted by the symbol ∧. The proposition 𝐴 ∧ 𝐵 is
true if and only if both 𝐴 and 𝐵 are true. For example, when we say that “the table is set and the food is ready”, we mean
to convey that both conjuncts are true. Here is the truth table:

𝐴 𝐵 𝐴∧𝐵
0 0 0
0 1 0
1 0 0
1 1 1

Disjunction is known as “or” in the English language and is denoted by the symbol ∨. The proposition 𝐴 ∨ 𝐵 is true
whenever either one is:
𝐴 𝐵 𝐴∨𝐵
0 0 0
0 1 1
1 0 1
1 1 1

Disjunction is inclusive, unlike the exclusive or we frequently use in our natural language. When you say “I am traveling
by train or car”, both cannot be true. The disjunction connective is not exclusive.
Finally, the implication connective (→) formalizes the deduction of a conclusion 𝐵 from a premise 𝐴: “if 𝐴, then 𝐵.”
The implication is true when the conclusion is true, or both the premise and the conclusion are false.

𝐴 𝐵 𝐴→𝐵
0 0 1
0 1 1
1 0 0
1 1 1

One example would be the famous quote from Descartes: “I think, therefore I am.” Translated to the language of formal
logic, this is simply

"I think" → "I exist".

460 Chapter 42. It’s just logic

Mathematics of Machine Learning

Sentences of the form “if 𝐴, then 𝐵“ are called conditionals. It’s not all just philosophy. Science is the collection of
propositions like “if 𝑋 is a closed system, then the entropy of 𝑋 cannot decrease”. (As the 2nd law of thermodynamics
states.)
The entire body of scientiﬁc knowledge is made of 𝐴 → 𝐵 propositions, and scientiﬁc research is equivalent to pursuing
the truth value of implications. When solving problems in practice, we rely on theorems (that is, implications) that turn
our premises into conclusions.

42.3 The propositional calculus

If you got the feeling that the connectives are akin to arithmetic operations, you are correct. Connectives yield propositions.
Thus connectives can again be applied, resulting in complex expressions like ¬(𝐴∨𝐵)∧𝐶. Constructing such expressions
and deductive arguments is called the propositional calculus.
Just like arithmetic operations, expressions made up of propositions and connectives also have identities. Think about the
famous algebraic identity

(𝑎 + 𝑏)(𝑎 − 𝑏) = 𝑎2 − 𝑏2 ,

that is one of the most frequently used symbolic expressions. Such an identity means we can write one thing in another
form. In mathematical logic, we call these logical equivalences.

Deﬁnition (Logical equivalences.)

The propositions 𝑃 and 𝑄 are logically equivalent if they always have the same truth value.
If 𝑃 and 𝑄 are logically equivalent, we write

𝑃 ≡ 𝑄.

To show you an example, let’s look at our ﬁrst theorem, one that establishes logical equivalences for the conjunction
connective.

Theorem (Properties of conjunction.)

Let 𝐴, 𝐵, and 𝐶 be arbitrary propositions. Then
(a) (𝐴 ∧ 𝐵) ∧ 𝐶 ≡ 𝐴 ∧ (𝐵 ∧ 𝐶) (associativity)
(b) 𝐴 ∧ 𝐵 ≡ 𝐵 ∧ 𝐴 (commutativity)
(c) 𝐴 ∨ (𝐵 ∧ 𝐶) ≡ (𝐴 ∨ 𝐵) ∧ (𝐴 ∨ 𝐶) (distributivity)
(d) 𝐴 ∧ 𝐴 ≡ 𝐴 (idempotence)

Proof. Showing these properties is done by drawing up their truth tables. We will do this for (a), while the rest is left for
you as an exercise. (I highly suggest you to do this, as performing a task by yourself is an excellent learning opportunity.)

42.3. The propositional calculus 461

Mathematics of Machine Learning

For the associativity property, the sizeable truth table

𝐴 𝐵 𝐶 𝐴∧𝐵 𝐵∧𝐶 (𝐴 ∧ 𝐵) ∧ 𝐶 𝐴 ∧ (𝐵 ∧ 𝐶)
0 0 0 0 0 0 0
0 0 1 0 0 0 0
0 1 0 0 0 0 0
0 1 1 0 1 0 0
1 0 0 0 0 0 0
1 0 1 0 0 0 0
1 1 0 1 0 0 0
1 1 1 1 1 1 1

provides a proof. □

A few remarks are in order. First, we should read the truth table from left to right columns. Strictly speaking, we can
omit the columns for 𝐴 ∧ 𝐵 and 𝐵 ∧ 𝐶. However, including them saves the mental gymnastics.
Second, because of the associativity, we can freely write 𝐴 ∧ 𝐵 ∧ 𝐶, as the order of operations is irrelevant.
Finally, note that our ﬁrst theorem is a premise and a conclusion, connected by the implication connective. If we denote
them by
𝑃 = 𝐴, 𝐵, 𝐶 are propositions,
𝑄 = (𝐴 ∧ 𝐵) ∧ 𝐶 ≡ 𝐴 ∧ (𝐵 ∧ 𝐶),
then the ﬁrst part of our theorem is just the proposition 𝑃 → 𝑄, one that we have proven to be true via laying out the
truth table. This shows the immense power of the propositional calculus we are building here.
Theorem 41.3.1 has an analogue for disjunction. This is stated below for the sake of completeness, but the proof is left to
you as an exercise.

Theorem (Properties of disjunction.)

Let 𝐴, 𝐵, and 𝐶 be arbitrary propositions. Then
(a) (𝐴 ∨ 𝐵) ∨ 𝐶 ≡ 𝐴 ∨ (𝐵 ∨ 𝐶) (associativity)
(b) 𝐴 ∨ 𝐵 ≡ 𝐵 ∨ 𝐴 (commutativity)
(c) 𝐴 ∧ (𝐵 ∨ 𝐶) ≡ (𝐴 ∧ 𝐵) ∨ (𝐴 ∧ 𝐶) (distributivity)
(d) 𝐴 ∨ 𝐴 ≡ 𝐴 (idempotence)

Just as arithmetic operations, connectives have order of precedence as well: ¬, ∧, ∨, →. This means that, for instance,
((¬𝐴) ∧ 𝐵) ∨ 𝐶 can be written as ¬𝐴 ∧ (𝐵 ∨ 𝐶).
In our calculus of propositions, one of the most important rules is De Morgan’s laws, describing how conjunction and
disjunction behave with respect to negation.

Theorem (De Morgan’s laws.)

Let 𝐴 and 𝐵 two arbitrary propositions. Then
(a) ¬(𝐴 ∧ 𝐵) ≡ ¬𝐴 ∨ ¬𝐵
(b) ¬(𝐴 ∨ 𝐵) ≡ ¬𝐴 ∧ ¬𝐵
holds.

462 Chapter 42. It’s just logic

Mathematics of Machine Learning

Proof. As usual, we can prove De Morgan’s laws by laying out the two truth tables

𝐴 𝐵 ¬𝐴
¬𝐵 𝐴∧𝐵 ¬(𝐴 ∧ 𝐵)
¬𝐴 ∨ ¬𝐵
0 0 1
1 0 1
1
0 1 1
0 0 1
1
1 0 0
1 0 1
1
1 1 0
0 1 0
0

and
𝐴 𝐵 ¬𝐴
¬𝐵 𝐴∨𝐵 ¬(𝐴 ∨ 𝐵)
¬𝐴 ∧ ¬𝐵
0 0 1
1 0 1
1
0 1 1
0 1 0
0
1 0 0
1 1 0
0
1 1 0
0 1 0
0

that are verifying our claim. □

The propositional calculus we have established so far is the mathematical formalization of thinking. One thing is missing,
though: deduction, or as Wikipedia puts it, “the mental process of drawing inferences in which the truth of their premises
ensures the truth of their conclusion”. This is given via the famous rule of modus ponens.

Theorem (Modus ponens.)

Let 𝐴 and 𝐵 be two propositions. If 𝐴 and 𝐴 → 𝐵 are true, then 𝐵 is be true as well.

Proof. Let’s take a look at the truth table of → once again:

𝐴 𝐵 𝐴→𝐵
0 0 1
0 1 1
1 0 0
1 1 1

42.3. The propositional calculus 463

Mathematics of Machine Learning

By looking at its rows, we can see that when 𝐴 is true and the implication 𝐴 → 𝐵 is true, 𝐵 is true as well, as the principle
of modus ponens indicates. □

As modus ponens sounds extremely abstract, here is a concrete example. From common sense, we know that the impli-
cation “if it’s raining, then the sidewalk is wet” is true. If we observe from a roof window that it’s indeed raining, we can
conﬁdently conclude that the sidewalk is wet, even without looking at it.
In symbolic notation, we can write

𝐴 → 𝐵, 𝐴 ⊢ 𝐵,

where the turnstile symbol ⊢ essentially reads as “proves”. Thus, the modus ponens says that 𝐴 → 𝐵 and 𝐴 prove 𝐵.
Modus ponens is how we use our theorems. It is always in the background.

Remark
This is a great opportunity to point out one of the most frequent logical fallacies: reversing the implication. When debating
about a given topic, participants often resort to the faulty argument

𝐴 → 𝐵, 𝐵 ⊢ 𝐴.

Of course, this is not true. For instance, consider our favorite example:

𝐴 = it's raining outside,

𝐵 = the sidewalk is wet.

Clearly, 𝐴 → 𝐵 holds, but 𝐵 → 𝐴 does not. There are other reasons for a wet sidewalk. For instance, someone
accidentally spilled a barrel of water on it.

42.4 Variables and predicates

So, mathematics is about propositions, implications, and their truth values. We have seen that we can formulate proposi-
tions and reason about pretty complicated expressions using our propositional calculus. However, the language we have
built up so far is not suitable for propositions with variables.
For instance, think about the sentence

𝑥 is a nonnegative real number.

Because the truth value depends on 𝑥, this is not a well-formed proposition. Sentences with variables are called predicates,
and we denote them by emphasizing the dependence on their variables; for instance

𝑃 (𝑥) ∶ 𝑥 ≥ 0,

𝑄(𝑥, 𝑦) ∶ 𝑥 + 𝑦 is an even number.

Each predicate has a domain from which its variables can be taken. You can think about a predicate 𝑃 (𝑥) as a function
that maps its domain to the set {0, 1}, representing its truth value. (Although, strictly speaking, we don’t have functions
available as tools when deﬁning the very foundation of our formal language. However, we are not philosophers or set
theorists, so we don’t have to be concerned about such details.)

464 Chapter 42. It’s just logic

Mathematics of Machine Learning

Predicates deﬁne truth sets, that is, subsets of the domain where the predicate is true. Formally, they are denoted by

{𝑥 ∈ 𝐷 ∶ 𝑃 (𝑥)}, (42.1)

where 𝑃 (𝑥) is a predicate with domain 𝐷.

Translated to English, (42.1) reads as “all elements 𝑥 of 𝐷 for which 𝑃 (𝑥) is true”.
Although we haven’t talked about sets before, truth sets probably seem familiar if you have a computer science background.
For instance, if you have ever used the Python programming language, you have probably seen expressions like

s = {x for x in range(1, 100) if x % 5 == 0}

all the time. These are called comprehensions, and they are inspired by the so-called set-builder notation given by (42.1).

42.5 Existential and universal quantiﬁcation

Predicates are a big step towards properly formalizing mathematical thinking, but we are not quite there yet. To give you
an example from machine learning, let’s talk about ﬁnding the minima of loss functions. (That is, training a model.)
A point 𝑥 is said to be the global minimum of a function 𝑓(𝑥) if for all other 𝑦 in its domain 𝐷, 𝑓(𝑥) ≤ 𝑓(𝑦) holds. For
instance, the point 𝑥 = 0 is a minima of the function 𝑓(𝑥) = 𝑥2 .
How would you express this in our formal language? For one, we could say that

for all 𝑦 ∈ 𝐷, 𝑓(𝑥) ≤ 𝑓(𝑦) is true,

where we ﬁx 𝑓(𝑥) = 𝑥2 and 𝑥 = 0. There are two parts of this sentence: for all 𝑦 ∈ 𝐷, and 𝑓(𝑥) ≤ 𝑓(𝑦) is true. The
second one is a predicate:

𝑃 (𝑦) ∶ 𝑓(𝑥) ≤ 𝑓(𝑦),

where 𝑦 ∈ ℝ.
The second part seems new, as we have never seen the words "for all" in our formal language before. They express a kind
of quantification about when the predicate 𝑃 (𝑦) is true.
In mathematical logic, there are two quantifiers we need to be happy: the universal quantifier "for all" denoted by the
symbol ∀, and the existential quantifier "there exists" denoted by ∃.
For example, consider the sentence “all of my friends are mathematicians”. By defining the set 𝐹 to be set of my friends
and the predicate on this domain as

𝑀 (𝑥) ∶ 𝑥 is a mathematician,

we can formalize our sentence as

∀𝑥 ∈ 𝐹 , 𝑀 (𝑥).

Remember that the domain of the predicate 𝑀 (𝑥) is 𝐹 . We could omit that, but it’s much more user-friendly this way.
Similarly, “I have at least one friend who is a mathematician” translates to

∃𝑥 ∈ 𝐹 , 𝑀 (𝑥).

When there is a more complex proposition behind the quantiﬁer, we mark its scope with parentheses:

∀𝑥 ∈ 𝐹 , (𝐴(𝑥) → (𝐵(𝑥) ∧ 𝐶(𝑥))).

42.5. Existential and universal quantiﬁcation 465

Mathematics of Machine Learning

Note that as (∀𝑥 ∈ 𝐹 , 𝑀 (𝑥)) and (∃𝑥 ∈ 𝐹 , 𝑀 (𝑥)) have a single truth value, they propositions, not predicates! Thus,
quantiﬁers turn predicates into propositions. Just like any other propositions, logical connectives can be applied to them.
Among all the operations, negation is the most interesting here. To see why, let’s consider the previous example: “all of
my friends are mathematicians”. At ﬁrst, you might say that its negation is “none of my friends are mathematicians”, but
that is not correct. Think about it: I can have mathematician friends, as long as not all of them are mathematicians. Thus,

¬("all of my friends are mathematicians"*) ≡ "I have at least a non-mathematician friend".

In other words (or should I say symbols), we have

¬(∀𝑥 ∈ 𝐹 , 𝑀 (𝑥)) ≡ ∃𝑥 ∈ 𝐹 , ¬𝑀 (𝑥).

That is, roughly speaking, the negation of ∀ is ∃ and the negation of ∃ is ∀.

42.6 Problems

Problem 1. Using truth tables, show that.

(a) 𝐴 ∨ ¬𝐴 is true,
(b) and 𝐴 ∧ ¬𝐴 is false.
In other words, 𝐴∨¬𝐴 is a tautology, while 𝐴∧¬𝐴 is a contradiction. (We call expressions that are always true tautologies,
while expressions that are always false contradictions.)
Problem 2. Deﬁne the exclusive or operation ⊕ by the truth table

𝐴 𝐵 𝐴⊕𝐵
0 0 0
0 1 1
1 0 1
1 1 0

Show that
(a) 𝐴 ⊕ 𝐵 ≡ (¬𝐴 ∧ 𝐵) ∨ (𝐴 ∧ ¬𝐵),
(b) and 𝐴 ⊕ 𝐵 ≡ (¬𝐴 ∨ ¬𝐵) ∧ (𝐴 ∨ 𝐵)
holds.

466 Chapter 42. It’s just logic

CHAPTER

FORTYTHREE

THE STRUCTURE OF MATHEMATICS

We’ve come a long way from the start: we studied propositions, logical connectives, predicates, quantifiers, and all the
formal logic. This was to be able to talk about mathematics. However, ultimately, we want to do mathematics.
As the only exact science, mathematics is built on top of definitions, theorems, and proofs. We precisely define objects,
formulate conjectures about them, then prove those with logically correct arguments. You can think of mathematics as a
colossal building made of propositions, implications, and modus ponens. If one theorem fails, all others that build upon
it fail too.
In other fields of science, the modus operandi is to hypothesize, experiment, and validate. However, experiments are not
𝑛
enough in mathematics. For instance, think about the famous Fermat numbers, that is, numbers of the form 𝐹𝑛 ∶= 22 +1.
Fermat conjectured them to be all prime numbers, as 𝐹0 , 𝐹1 , 𝐹2 , 𝐹3 , and 𝐹4 are all primes.
Five affirmative “experiments” might have been enough to accept the hypothesis as true in certain fields of science. Not in
mathematics. Around the 1700s, Euler showed that 𝐹5 = 4294967297 is not a prime, as 4294967297 = 641×6700417.
(Imagine calculating that in the 18th century, long before the age of computing.)
So far, we’ve seen some definitions, theorems, and even proofs when talking about mathematical logic. It’s time to put
them under the magnification glass and see what they are!

43.1 What is a deﬁnition?

Ambiguity is the drawback of natural languages. How would you define, say, the concept of “hot”? Upon several attempts,
you would soon discover that no two people have the same definition.
In mathematics, there is no room for ambiguity. Every object and every property must be precisely defined. It’s best to
look at a good example instead of philosophizing about it.

Deﬁnition 42.1.1 (Divisors.)

Let 𝑏 ∈ ℤ be an integer. We say that 𝑎 ∈ ℤ is a divisor of 𝑏 if there exists a integer 𝑘 ∈ ℤ such that 𝑏 = 𝑘𝑎.
The property “𝑎 is a divisor of 𝑏” is denoted by 𝑎 ∣ 𝑏.

For example, 2 ∣ 10 and 5 ∣ 10, but 7 ∤ 10. (Crossed symbols mean the negation of the said property.)
In terms of our formal language, the deﬁnition of “𝑎 is a divisor of 𝑏” can be written as

𝑎 ∣ 𝑏 ∶ ∃𝑘 ∈ ℤ, 𝑏 = 𝑘𝑎. (43.1)

Don’t let the 𝑎 ∣ 𝑏 notation deceive you; this is a predicate in disguise. We could have denoted 𝑎 ∣ 𝑏 by

divisor(𝑎, 𝑏) ∶ ∃𝑘 ∈ ℤ, 𝑏 = 𝑘𝑎.

467
Mathematics of Machine Learning

Although every mathematical definition can be formalized, we’ll prefer our natural language because it is much easier to
understand. (At least for humans. Not so much for computers.)
Like building blocks, definitions build on top of each other.
(If you have a sharp eye for details, you noticed that even Definition 42.1.1 is built upon other concepts such as numbers,
multiplication, and equality. We haven’t defined them precisely, just assumed they are there. Since our goal is not to
re-build mathematics from scratch, we’ll let this one slide.)
Again, it’s best to see an example here. Let’s see what even and odd numbers are!

Deﬁnition 42.1.2 (Even and odd numbers.)

Let 𝑛 ∈ ℤ be an integer. We say that 𝑛 is even if 2 ∣ 𝑛.
In turn, we say that 𝑛 is odd if 2 ∤ 𝑛. (The notation 𝑎 ∤ 𝑏 is the negation of the “𝑎 is a divisor of 𝑏” predicate.)

One more time, with our formal language. For an integer 𝑛 ∈ ℤ, the predicates

even(𝑛) ∶ 2 ∣ 𝑛

and

odd(𝑛) ∶ 2 ∤ 𝑛

express the same as Deﬁnition 42.1.2.

These examples are not that exciting, so let’s see something more interesting!

Deﬁnition 42.1.3 (Prime numbers.)

Let 𝑝 ∈ ℕ be a positive integer. We say that 𝑝 is a prime number if
(a) 𝑝 > 1,
(b) and if 𝑎 ∣ 𝑝, then 𝑎 = 1 or 𝑎 = 𝑝.

In other words, primes have no integer divisors other than themselves. The ﬁrst few primes are 2, 3, 5, 7, 11, 13, 17, and
many more. Non-prime integers are called composite numbers.
The deﬁnition of primality can be written as

𝑃 (𝑝) ∶ (𝑝 > 1) ∧ (∀𝑎 ∈ ℤ, (𝑎 ∣ 𝑝 → ((𝑎 = 1) ∨ (𝑎 = 𝑝)))).

This might look complicated, but we can decompose it into parts, as shown by Fig. 43.1.
Primes play an essential role in our everyday lives! For instance, many mainstream cryptographic methods use large
primes to cipher and decipher messages. Without them, you wouldn’t be able to initiate ﬁnancial transactions securely.
Their usefulness is guaranteed by their various properties, established in the form of theorems. We’ll see a few of them
soon enough, but ﬁrst, let’s talk about what theorems really are.

468 Chapter 43. The structure of mathematics

Mathematics of Machine Learning

Fig. 43.1: Deﬁnition of primality in ﬁrst-order language, decomposed into its parts.

43.2 What is a theorem?

So, a deﬁnition is essentially a predicate whose truth set consists of our objects of interest. The whole point of mathematics
is to ﬁnd true propositions involving those objects, most often in the form 𝐴 → 𝐵. Consider the following theorem that
is a cornerstone of optimization for machine learning.

Theorem 42.2.1 (Existence of global minima for convex functions.)

Let 𝑓 ∶ ℝ𝑛 → ℝ be a function. If 𝑓 is convex, then there exists an 𝑥∗ such that 𝑓 assumes its global minimum at 𝑥∗ .
(That is, for all 𝑥 ∈ ℝ𝑛 , we have 𝑓(𝑥∗ ) ≤ 𝑓(𝑥).)

Don’t worry if you are unfamiliar with the concepts of convexity and local minimum; it’s beside the point. The gist is that
Theorem 42.2.1 can be written as

∀𝑓 ∈ 𝐹 , (𝐶(𝑓) → 𝑀 (𝑓)),

where 𝐹 denotes the set of all functions ℝ𝑛 → ℝ, and the predicates 𝐶(𝑓) and 𝑀 (𝑓) are deﬁned by

𝐶(𝑓) ∶ 𝑓 is convex,
𝑀 (𝑓) ∶ ∃𝑥∗ , 𝑥∗ is a global minimum of 𝑓.

Notice the structure of the theorem: “Let 𝑥 ∈ 𝐴. If 𝐵(𝑥), then 𝐶(𝑥).” With the ﬁrst sentence, we are setting the domains
of the predicates 𝐴(𝑥) and 𝐵(𝑥), and putting an universal quantiﬁer in front of the conditional “if 𝐵(𝑥), then 𝐶(𝑥).”

43.2. What is a theorem? 469

Mathematics of Machine Learning

43.3 What is a proof?

Now that we understand what theorems are, it’s time to look at proofs. We have just seen that theorems are true propo-
sitions. Proofs are deductions that establish the truth of a proposition. Let’s see an example instead of talking like a
philosopher!
The proof of Theorem 42.2.1 is not within our reach yet, so let’s look at something much simpler: the sum of even
numbers.

Theorem 42.3.1 (The sum of even numbers.)

Let 𝑛, 𝑚 ∈ ℤ be two integers. If 𝑛 and 𝑚 are even, then 𝑛 + 𝑚 is even.

Proof. Since 𝑛 is even, 2 ∣ 𝑛. According to Deﬁnition 42.1.1, this means that there exists an integer 𝑘 ∈ ℤ such that
𝑛 = 2𝑘.
Similarly, as 𝑚 is also even, there exists an integer 𝑙 ∈ ℤ such that 𝑚 = 2𝑙.
Summing up the two, we obtain that
𝑛 + 𝑚 = 2𝑘 + 2𝑙
= 2(𝑘 + 𝑙),
giving that 𝑛 + 𝑚 is indeed even. □

(The square symbol □ is just there to mark the end of the proof. It reads as “quod erat demonstrandum”, meaning “what
was to be shown”.)
If you read the above proof carefully, you might notice that it is a chain of implications and modus ponens. These two
form the backbone of our deductive skills. What is proven is set in stone.
Understanding what proofs are is one of the biggest skill gaps in mathematics. Don’t worry if you don’t get it immediately;
this is a deep concept. You’ll get used to proofs eventually.

43.4 Equivalences

The building blocks of mathematics are propositions of the form 𝐴 → 𝐵; at least, this is what I emphasized throughout
this chapter.
I was not precise. The proposition 𝐴 → 𝐵 translates to “if 𝐴, then 𝐵“, but sometimes, we know much more. Quite
frequently, 𝐴 and 𝐵 have the same truth values. In natural language, we express this by saying “𝐴 if and only if 𝐵“.
(Although this is much rarer than the simple conditional.)
In logic, we express this relation with the biconditional connective ↔, deﬁned by

𝐴 ↔ 𝐵 ≡ (𝐴 → 𝐵) ∧ (𝐵 → 𝐴).

Theorems of the “if and only if” type are called equivalences, and they play an essential role in mathematics. When
proving an equivalence, we must show both 𝐴 → 𝐵 and 𝐵 → 𝐴.
To see an example, let’s go back to elementary geometry. As you have probably learned in high school, we can describe
geometric objects on the plane with vectors that are represented by a tuple of two real numbers.
This way, geometric properties can be translated into analytic ones, and we can often prove hard theorems by a simple
calculation.

470 Chapter 43. The structure of mathematics

Mathematics of Machine Learning

For instance, let’s talk about orthogonality, one of the most important concepts in mathematics. This is how orthogonality
is deﬁned for two planar vectors.

Deﬁnition 42.4.1 (Orthogonality.)

Let 𝑎 and 𝑏 be two nonzero vectors on the plane. We say that 𝑎 and 𝑏 are orthogonal if their enclosed angle is 𝜋/2.
Orthogonality is denoted by the ⟂ symbol; that is, 𝑎 ⟂ 𝑏 means that 𝑎 and 𝑏 are orthogonal.

For the sake of simplicity, we always assume that the enclosed angle is between 0 and 𝜋. (An angle of 𝜋 radians is 180
degrees, but we’ll always use radians.)
However, measuring the angle enclosed by two arbitrary vectors is not as easy as it sounds. We need a tractable formula,
and this is where the dot product comes in.

Deﬁnition 42.4.2 (Dot product of planar vectors.)

Let 𝑎 = (𝑎1 , 𝑎2 ) and 𝑏 = (𝑏1 , 𝑏2 ) be two vectors on the plane. Their dot product 𝑎 ⋅ 𝑏 is deﬁned by

𝑎 ⋅ 𝑏 ∶= |𝑎||𝑏| cos 𝛼,

where 𝛼 is the angle enclosed by the two vectors, and | ⋅ | denotes the magnitude of a vector.

Dot products give an equivalent deﬁnition of orthogonality in the form of an “if and only if” theorem.

Theorem 42.4.1
Let 𝑎 = (𝑎1 , 𝑎2 ) and 𝑏 = (𝑏1 , 𝑏2 ) be two nonzero vectors on the plane. Then 𝑎 and 𝑏 orthogonal if and only if 𝑎 ⋅ 𝑏 = 0.

Let’s see the proof of this equivalence!

Proof. We have to prove two implications:

(i) 𝑎 ⟂ 𝑏 ⟹ 𝑎 ⋅ 𝑏 = 0,(ii) 𝑎 ⋅ 𝑏 = 0 ⟹ 𝑎 ⟂ 𝑏.
Let’s start with (i). If 𝑎 ⟂ 𝑏, then their enclosed angle 𝛼 equals to 𝜋/2. Thus,
𝜋
𝑎 ⋅ 𝑏 = |𝑎||𝑏| cos
2
= |𝑎||𝑏|0
= 0,

which is what we needed to show.

To prove (ii), we have to notice that since 𝑎 and 𝑏 are nonzero, their magnitudes |𝑎|, |𝑏| are also nonzero. Thus,

𝑎 ⋅ 𝑏 = |𝑎||𝑏| cos 𝛼 = 0

can only hold if cos 𝛼 = 0. In turn, this means that 𝛼 = 𝜋/2; that is, 𝑎 ⟂ 𝑏. (Recall that we assumed the enclosed angle
𝛼 to be between 0 and 𝜋.) □

43.4. Equivalences 471

Mathematics of Machine Learning

43.5 Proof techniques

There is no way around it: proving theorems is hard. Some took the smartest of minds decades, and some conjectures
remain unresolved after a century. (That is, they are not proven nor disproven.)
A few basic yet powerful tools can get one push through the diﬃculties. In the following, we’ll look at the three most
important ones: proof by induction, proof by contradiction, and the principle of contraposition.

43.5.1 Proof by induction

How do you climb a set of stairs? Simple. You climb the ﬁrst step, then climb the next one and so on.
You might be surprised, but this is something we frequently use in mathematics all the time. Let’s illuminate this by an
example.

Theorem 42.5.1 (Sum of square numbers.)

Let 𝑛 ∈ ℕ be an arbitrary integer. Then

𝑛(𝑛 + 1)
1+2+⋯+𝑛= (43.2)
2
holds.

Proof. For 𝑛 = 1, the case is clear: the left-hand side of (43.2) evaluates to 1, while the right-hand side is

1(1 + 1)
= 1.
2
Thus, our proposition holds for 𝑛 = 1.
Here comes the magic, that is, the induction step. Let’s assume that (43.2) holds for a given 𝑛; that is, we have

𝑛(𝑛 + 1)
1+2+⋯+𝑛= .
2
This is what’s called the induction hypothesis. Using this assumption, we are going to prove that (43.2) holds for 𝑛 + 1 as
well. In other words, our goal is to show that

(𝑛 + 1)(𝑛 + 2)
1 + 2 + ⋯ + 𝑛 + (𝑛 + 1) = .
2
Due to our induction hypothesis, we have

1 + 2 + ⋯ + 𝑛 + (𝑛 + 1) = [1 + 2 + ⋯ + 𝑛] + (𝑛 + 1)
𝑛(𝑛 + 1)
= + (𝑛 + 1).
2
Continuing the calculation, we obtain

𝑛(𝑛 + 1) 𝑛(𝑛 + 1) 2(𝑛 + 1)

+ (𝑛 + 1) = +
2 2 2
𝑛(𝑛 + 1) + 2(𝑛 + 1)
=
2
(𝑛 + 1)(𝑛 + 2)
= ,
2

472 Chapter 43. The structure of mathematics

Mathematics of Machine Learning

which is what we had to show. □

To sum up what happened, let’s denote the equation (43.2) by the predicate

𝑛(𝑛 + 1)
𝑆(𝑛) ∶ 1 + 2 + ⋯ + 𝑛 = .
2
Proof by induction consists of two main steps. First, we establish that 𝑆(1) is true. Then, we show that for arbitrary 𝑛,
the implication 𝑆(𝑛) → 𝑆(𝑛 + 1) holds. Starting from the induction step, this implies that 𝑆(𝑛) is indeed true for all 𝑛:
the chain of implications
𝑆(1) → 𝑆(2),
𝑆(2) → 𝑆(3),
𝑆(3) → 𝑆(4),
⋮
combined with 𝑆(1) and the almighty modus ponens yields the truth of 𝑆(𝑛).
We took the ﬁrst step 𝑆(1), then proved that we can take the next step from anywhere.
Induction is not simple to grasp, so here is another example. (It is slightly more complex than the previous one.) Follow
through with the proof and see if you can identify the marks of induction.

Theorem 42.5.2 (The fundamental theorem of number theory.)

Let 𝑛 ∈ ℤ be an integer and suppose that 𝑛 > 1. Then 𝑛 can be uniquely represented as the product of prime numbers;
that is, there exists prime numbers 𝑝1 , 𝑝2 , … , 𝑝𝑙 and exponents 𝑘1 , 𝑘2 , … , 𝑘𝑙 > 1 such that
𝑘 𝑘 𝑘
𝑛 = 𝑝1 1 𝑝2 2 … 𝑝𝑙 𝑙 . (43.3)

Furthermore, this representation is unique.

For example, 24 = 23 3, and 24 cannot be written as a diﬀerent product of primes.

The presence of natural language masks it, but in essence, Theorem 42.5.2 can be translated to the sentence
𝑘 𝑘 𝑘
∀𝑛 ∈ ℤ[(𝑛 > 1) → (∃𝑝1 , … , 𝑝𝑙 , 𝑘1 , … , 𝑘𝑙 ∈ ℤ, 𝑛 = 𝑝1 1 𝑝2 2 … 𝑝𝑙 𝑙 )].

For simplicity, we’ll only prove the existence of the prime factorization, not the unicity.

Proof. (Existence.) For 𝑛 = 2, the theorem is trivially true, as 2 is a prime itself.

Now let 𝑛 > 2 and suppose that (43.3) is true for all integers 𝑚 that are smaller or equal to 𝑛. (This is our induction
hypothesis.)
Our goal is to show that (43.3) also holds for 𝑛 + 1.
There are two possibilities: either 𝑛 + 1 is a prime or a composite number. If it is a prime, we are done, as 𝑛 + 1 is by
itself in the form (43.3). Otherwise, if 𝑛 + 1 is a composite number, we can ﬁnd a divisor that is not 1 or 𝑛 + 1:

𝑛 + 1 = 𝑎𝑏

for some 𝑎, 𝑏 ∈ ℤ. Since 𝑎, 𝑏 ≤ 𝑛, we can apply the induction hypothesis! Spelling it out, it means that we can write
them as 𝛼 𝛼
𝑎 = 𝑝1 1 … 𝑝𝑙 𝑙 ,
𝛽 𝛽
𝑏 = 𝑞1 1 … 𝑞𝑚𝑚 ,

43.5. Proof techniques 473

Mathematics of Machine Learning

where the 𝑝𝑖 , 𝑞𝑖 are the primes and the 𝛼𝑖 , 𝛽𝑖 are the exponents. Thus,

𝑛 + 1 = 𝑎𝑏
𝛼 𝛼 𝛽 𝛽
= 𝑝1 1 … 𝑝𝑙 𝑙 𝑞1 1 … 𝑞𝑚𝑚 ,

which is just (43.3), with a bit more symbols. □

Induction is like a power tool in mathematics. It is extremely powerful, and when it is applicable, it’ll almost always do
the job.

43.5.2 Proof by contradiction

Sometimes, it is easier to prove theorems by assuming that their conclusion is false, then deduce a contradiction.
Again, it’s best to see a quick example. Let’s revisit our good old friends, the prime numbers.

Theorem 42.5.3
There are inﬁnitely many prime numbers.

Proof. Assume that there are ﬁnitely many prime numbers: 𝑝1 , 𝑝2 , … , 𝑝𝑛 .

Is the integer 𝑝1 𝑝2 … 𝑝𝑛 + 1 a prime? If 𝑝1 , 𝑝2 , … , 𝑝𝑛 are all of the prime numbers, it is enough to check if

𝑝𝑖 ∤ 𝑝1 𝑝2 … 𝑝𝑛 + 1.

This holds indeed, as by definition, 𝑝1 𝑝2 … 𝑝𝑛 + 1 = 𝑝𝑖 𝑘 + 1, where 𝑘 is simply the product of the prime numbers other
than 𝑝𝑖 .
Since no 𝑝𝑖 is a divisor of 𝑝1 𝑝2 … 𝑝𝑛 + 1, it must be a prime. We have found a new prime that is not on our list! This
means that our assumption (that there are finitely many prime numbers) has led to a contradiction.
Thus, there must be infinitely many prime numbers. □

If you have a sharp eye, you probably noticed that the above example is not of the form 𝐴 → 𝐵; it’s just a simple
proposition:

𝐴 = there are inﬁnitely many prime numbers.

In these cases, showing that ¬𝐴 is false yields the desired conclusion. However, this technique works for 𝐴 → 𝐵 -style
propositions as well.

43.5.3 Contraposition

The ﬁnal technique we will study is contraposition, a clever method that puts a twist into the classic 𝐴 → 𝐵-style thinking.
We should get to know the implication connective a bit better to see what it is. As it turns out, 𝐴 → 𝐵 can be written in
terms of negation and disjunction.

Theorem 42.5.4

474 Chapter 43. The structure of mathematics

Mathematics of Machine Learning

Let 𝐴 and 𝐵 two propositions. Then

𝐴 → 𝐵 ≡ ¬𝐴 ∨ 𝐵.

Proof. The truth table

𝐴 𝐵 ¬𝐴 ¬𝐴 ∨ 𝐵
0 0 1 1
0 1 1 1
1 0 0 0
1 1 0 1
provides a proof.

Why is this relevant? Simple. Take a look at the following corollary.

Corollary 42.5.1 (Principle of contraposition.)

Let 𝐴 and 𝐵 two propositions. Then
𝐴 → 𝐵 ≡ ¬𝐵 → ¬𝐴.

Proof. Theorem 42.5.4 implies that

𝐴 → 𝐵 ≡ ¬𝐴 ∨ 𝐵
≡ 𝐵 ∨ ¬𝐴
≡ ¬𝐵 → ¬𝐴,
which is what we had to prove. □

Here is a simple proposition about integers to give you a mathematical example.

Theorem 42.5.5
Let 𝑛 ∈ ℤ be an integer. If 2 ∤ 𝑛, then 4 ∤ 𝑛.

Proof. We should prove this via contraposition. Thus, assume that 4 ∣ 𝑛. This means that
𝑛 = 4𝑘
for some integer 𝑘 ∈ ℤ. However, this implies that
𝑛 = 2(2𝑘),
which shows that 2 ∣ 𝑛. Due to the principle of contraposition, (4 ∣ 𝑛) → (2 ∣ 𝑛) is logically equivalent to (2 ∤ 𝑛) →
(4 ∤ 𝑛), which is what we had to prove. □

Contraposition is not only useful in mathematics, it is a valuable thinking tool in general. Let’s consider our recurring
proposition: “if it is raining outside, then the sidewalk is wet”. We know this to be true, but this also means that “if the
sidewalk is not wet, then it is not raining”. (Because otherwise, the sidewalk would be wet.)
You perform these types of arguments every day without even noticing it. Now you have a name for them and can start
to apply this pattern consciously.

43.5. Proof techniques 475

Mathematics of Machine Learning

476 Chapter 43. The structure of mathematics

CHAPTER

FORTYFOUR

BASICS OF SET THEORY

In other words, general set theory is pretty trivial stuﬀ really, but, if you want to be a mathematician, you need
some and here it is; read it, absorb it, and forget it. — Paul R. Halmos
Although Paul Halmos said the above a long time ago, it has remained quite accurate. Except for one part: set theory is
not only necessary for mathematicians, but for computer scientists, data scientists, and software engineers as well.
You might have heard about or studied set theory before. It is hard to see why it is so essential for machine learning, but
trust me, set theory is the very foundation of mathematics. Deep down, everything is a set or a function between sets. (As
we will see later, even functions are deﬁned as sets.)
Think about the relation of set theory and machine learning like grammar and poetry. To write beautiful poetry, one
needs to be familiar with the rules of the language. For example, data points are represented as vectors in vector spaces,
often constructed as the Cartesian product of sets. (Don’t worry if you are not familiar with Cartesian products, we’ll get
there soon.) Or, to really understand probability theory, you need to be familiar with event spaces, which are systems of
sets closed under certain operations.
So, what are sets anyway?

44.1 What is a set?

On the surface level, a set is just a collection of things. We deﬁne sets by enumerating their elements like

𝑆 = {red, green, ♡}.

Two sets are equal if they have the same elements. Given any element, we can always tell if it is a member of a given set
or not. When every element of 𝐴 is also an element of 𝐵, we say that 𝐴 is a subset of 𝐵, or in notation,

𝐴 ⊆ 𝐵.

If we have a set, we can deﬁne subsets by specifying a property that all of its elements satisfy, for example

even numbers = {𝑛 ∈ ℕ ∶ 𝑛%2 = 0}.

(The % denotes the modulo operator.) This latter method is called the set-builder notation, and if you are familiar with
the Python programming language, you can see this inspired list comprehensions. There, one would write something like
this.

even_numbers = {n for n in range(100) if n%2 == 0}

print(even_numbers)

477
Mathematics of Machine Learning

{0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,
↪ 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82,␣

↪84, 86, 88, 90, 92, 94, 96, 98}

For a given set 𝐴, the collection of its subsets are denoted by 2𝐴 :

2𝐴 ∶= {𝐵 ∶ 𝐵 ⊆ 𝐴}. (44.1)

2𝐴 is called the power set of 𝐴.

However, deﬁning sets as the collection of elements does not work. Without further conditions, it can lead to paradoxes,
as the famous Russell paradox shows. (We’ll talk about this later in this chapter.) To avoid going down the rabbit hole
of set theory, we accept that sets have some proper deﬁnition buried within a thousand-page sized tome of mathematics.
Instead of worrying about this, we focus on what we can do with sets.

44.2 Operations on sets

Describing more complex sets with only these two methods (listing its members or using the set-builder notation) will be
extremely diﬃcult. To make the job easier, we deﬁne operations on sets.

44.2.1 Union, intersection, diﬀerence

The most basic operations are the union, intersection, and diﬀerence. You are probably familiar with these, as they are
encountered frequently as early as high school. Even you feel familiar with them, check out the formal deﬁnition next.

Deﬁnition 43.2.1 (Set operations.)

Let 𝐴 and 𝐵 two sets. We deﬁne
(a) their union by 𝐴 ∪ 𝐵 ∶= {𝑥 ∶ 𝑥 ∈ 𝐴 or 𝑥 ∈ 𝐵},
(b) their intersection by 𝐴 ∩ 𝐵 ∶= {𝑥 ∶ 𝑥 ∈ 𝐴 and 𝑥 ∈ 𝐵},
(c) and their diﬀerence by 𝐴\𝐵 = {𝑥 ∶ 𝑥 ∈ 𝐴 and 𝑥 ∉ 𝐵}.

We can easily visualize these with Venn diagrams, as you can see below.

Fig. 44.1: Set operations visualized in Venn diagrams.

478 Chapter 44. Basics of set theory

Mathematics of Machine Learning

We can express set operations in plain English as well. For example, 𝐴 ∪ 𝐵 means that “𝐴 or 𝐵“. Similarly, 𝐴 ∩ 𝐵 means
“𝐴 and 𝐵“, while 𝐴\𝐵 is “𝐴 but not 𝐵“. When talking about probabilities, these will be useful for translating events to
the language of set theory.
These set operations also have a lot of pleasant properties. For example, they behave nicely with respect to parentheses.

Theorem 43.2.1
Let 𝐴, 𝐵, and 𝐶 be three sets. The union operation is
(a) associative, that is, 𝐴 ∪ (𝐵 ∪ 𝐶) = (𝐴 ∪ 𝐵) ∪ 𝐶,
(b) commutative, that is, 𝐴 ∪ 𝐵 = 𝐵 ∪ 𝐴.
Moreover, the intersection operation is also associative and commutative.
Finally,
(c) the union is distributive with respect to the intersection, that is, 𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶),
(d) and the intersection is distributive with respect to the union, that is, 𝐴 ∩ (𝐵 ∪ 𝐶) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶).

Union and intersection can be deﬁned for an arbitrary number of operands. That is, if 𝐴1 , 𝐴2 , … , 𝐴𝑛 are sets,

𝐴1 ∪ ⋯ ∪ 𝐴𝑛 ∶= (𝐴1 ∪ ⋯ ∪ 𝐴𝑛−1 ) ∪ 𝐴𝑛 ,

and similar for the intersection. Note that this is a recursive deﬁnition! Because of associativity, the order of parentheses
doesn’t matter.
The associativity and commutativity might seem too abstract and trivial at the same time. However, this is not the case
for all operations, so it is worth emphasizing to get used to the concepts. If you are curious, noncommutative operations
are right under our noses. A simple example is string concatenation.

a = "string"
b = "concatenation"
a + b == b + a

False

44.2.2 De Morgan’s laws

One of the fundamental rules describes how set diﬀerence, union, and intersection behave together regarding set opera-
tions. These are called De Morgan’s laws.

Theorem 43.2.2 (De Morgan’s laws.)

Let 𝐴, 𝐵, and 𝐶 be three sets. Then
(a) 𝐴\(𝐵 ∪ 𝐶) = (𝐴\𝐵) ∩ (𝐴\𝐶),
(b) 𝐴\(𝐵 ∩ 𝐶) = (𝐴\𝐵) ∪ (𝐴\𝐶).

Proof. For simplicity, we are going to prove this using Venn diagrams. Although drawing a picture is not a “proper”
mathematical proof, this is not a problem. We are here to understand things, not to get hung up on philosophy.
Here is the illustration.

44.2. Operations on sets 479

Mathematics of Machine Learning

Fig. 44.2: De Morgan’s laws, illustrated on Venn diagrams.

Based on this, you can easily see both (a) and (b). □
Note that De Morgan’s laws can be generalized to cover any number of sets. So, for any Γ index set,
𝐴\(∩𝛾∈Γ 𝐵𝛾 ) = ∪𝛾∈Γ (𝐴\𝐵𝛾 ),
𝐴\(∪𝛾∈Γ 𝐵𝛾 ) = ∩𝛾∈Γ (𝐴\𝐵𝛾 )

44.3 The Cartesian product

One of the most fundamental ways to construct new sets is the Cartesian product.

Deﬁnition 43.3.1 (The Cartesian product.)

Let 𝐴 and 𝐵 be two sets. Their Cartesian product 𝐴 × 𝐵 is deﬁned by
𝐴 × 𝐵 ∶= {(𝑎, 𝑏) ∶ 𝑎 ∈ 𝐴 and 𝑏 ∈ 𝐵}.

The elements of the product are called tuples. Note that this operation is not associative nor commutative! To see this,
consider that for example,
{1} × {2} ≠ {2} × {1}

480 Chapter 44. Basics of set theory

Mathematics of Machine Learning

and
({1} × {2}) × {3} ≠ {1} × ({2} × {3}).

The Cartesian product for an arbitrary number of sets is deﬁned with a recursive deﬁnition, just like we did with the union
and intersection. So, if 𝐴1 , 𝐴2 , … , 𝐴𝑛 are sets, then
𝐴1 × ⋯ × 𝐴𝑛 ∶= (𝐴1 × ⋯ × 𝐴𝑛−1 ) × 𝐴𝑛 .

Here, the elements are tuples of tuples of tuples of…, but to avoid writing an excessive number of parentheses, we can
abbreviate it as (𝑎1 , … , 𝑎𝑛 ). When the operands are the same, we usually write 𝐴𝑛 instead of 𝐴 × ⋯ × 𝐴.
One of the most common examples is the Cartesian plane, which you probably have seen before.

Fig. 44.3: The Cartesian plane. Source: Wikipedia

To give you a machine learning-related example, let’s take a look at how we are usually given the data! Let’s focus on the
famous Iris dataset, a subset of ℝ4 . Here, the axes represent sepal length, sepal width, petal length, and petal width.
As the example demonstrates, Cartesian products are useful because they combine related information into a single math-
ematical structure. This is a recurring pattern in mathematics: building complex things from simpler building blocks and

44.3. The Cartesian product 481

Mathematics of Machine Learning

Fig. 44.4: The sepal width, plotted against the sepal length in the Iris dataset. Source: scikit-learn documentation.

482 Chapter 44. Basics of set theory

Mathematics of Machine Learning

abstracting away the details by turning the result into yet another building block. (As one would do for creating complex
software as well.)

44.4 The Russell paradox (optional)

Let’s return to a remark I made earlier: naively deﬁning sets as collections of things is not going to cut it. In the following,
we are going to see why. Prepare for some mind-twisting mathematics.
As we have seen, sets can be made of sets. For instance, {ℕ, ℤ, ℝ} is a collection of the most commonly used number
sets. We might as well deﬁne the set of all sets, which we’ll denote with Ω.
With that, we can use the set-builder notation to describe the following collection of sets:

𝑆 ∶= {𝐴 ∈ Ω ∶ 𝐴 ∉ 𝐴}.

In plain English, 𝑆 is a collection of sets that are not elements of themselves. Although this is weird, it looks valid. We
used the property “𝐴 ∉ 𝐴” to filter the set of all sets. What is the problem?
For one, we can’t decide if 𝑆 is an element of 𝑆 or not. If 𝑆 ∈ 𝑆, then by the defining property, 𝑆 ∉ 𝑆. On the other
hand, if 𝑆 ∉ 𝑆, then by the definition, 𝑆 ∈ 𝑆. This is definitely very weird.
We can diagnose the issue by decomposing the set-builder notation. In general terms, it can be written as

𝑥 ∈ 𝐴 ∶ 𝑇 (𝑥),

where 𝐴 is some set and 𝑇 (𝑥) is a property, that is, a true or false statement about 𝑥. In the deﬁnition {𝐴 ∈ Ω ∶ 𝐴 ∉ 𝐴},
our abstract property is deﬁned by

true if 𝐴 ∉ 𝐴,
𝑇 (𝐴) = {
false otherwise.

This is perfectly valid, so the problem must be in the other part: the set Ω. It turns out, the set of all sets is not a set. So,
deﬁning sets as a collection of things is not enough. Since sets are at the very foundations of mathematics, this discovery
threw a giant monkey wrench into the machine around the late 19th - early 20th century, and it took lots of years and
brilliant minds to ﬁx it.
Fortunately, as machine learning practitioners, we don’t have to care about such low-level details as the axioms of set
theory. For us, it is enough to know that a solid foundation exists somewhere. (Hopefully.)

44.4. The Russell paradox (optional) 483

Mathematics of Machine Learning

484 Chapter 44. Basics of set theory

CHAPTER

FORTYFIVE

A CRASH COURSE IN PYTHON

This section is just a draft of a future Python quickstart for those who are new to the language. For now, I’ll just link a
few tutorials that are relevant for understanding the course material.

45.1 Variables

45.2 Control ﬂow

45.3 Fundamental data structures

45.3.1 Tuples

45.3.2 Lists

45.3.3 Dictionaries

• keys must be hashable!

45.3.4 Comprehensions

45.4 Functions

45.5 Decorators

45.5.1 The least recently used cache

Computations in practice can be expensive.

from functools import lru_cache

lru_cache uses a dictionary, so only works for hashable inputs.

485
Mathematics of Machine Learning

45.6 Object-oriented programming in Python

This is what I’ll talk about here later:

• classes
• initialization
• attributes
• inheritance
• special methods (__call__, __getitem__, etc)
• “private” methods: _method
OOP tutorials for you meanwhile:
• freecodecamp’s Python OOP course

486 Chapter 45. A crash course in Python

CHAPTER

FORTYSIX

BIBLIOGRAPHY

487
Mathematics of Machine Learning

488 Chapter 46. Bibliography

BIBLIOGRAPHY

[JBP03] E.T. Jaynes, G.L. Bretthorst, and Cambridge University Press. Probability Theory: The Logic of Science. Cam-
bridge University Press, 2003.
[Lax07] Peter D. Lax. Linear Algebra and Its Applications. Wiley-Interscience, second edition, 2007.
[Str00] Steven H. Strogatz. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry and Engi-
neering. Westview Press, 2000.

489

Mathematics of Machine Learning
No ratings yet
Mathematics of Machine Learning
577 pages
Mathematical Techniques An Introduction For The Engineering, Physical
No ratings yet
Mathematical Techniques An Introduction For The Engineering, Physical
997 pages
Mathematics For Machine Learning
No ratings yet
Mathematics For Machine Learning
249 pages
Van Der Post H. Data Science With Rust. From Fundamentals To Insights 2024
No ratings yet
Van Der Post H. Data Science With Rust. From Fundamentals To Insights 2024
672 pages
Sebastiao Salgado - From My Land To The Planet-Contrasto
100% (1)
Sebastiao Salgado - From My Land To The Planet-Contrasto
184 pages
Statistics For Absolute Beginners (Second Edition) (Oliver Theobald
0% (1)
Statistics For Absolute Beginners (Second Edition) (Oliver Theobald
144 pages
Exploring University Mathematics With Python
100% (2)
Exploring University Mathematics With Python
100 pages
Supercharged Python - Take Your Code To The Next Level
75% (4)
Supercharged Python - Take Your Code To The Next Level
667 pages
Virtual Reality in Behavioral Neuroscience: New Insights and Methods
100% (1)
Virtual Reality in Behavioral Neuroscience: New Insights and Methods
386 pages
David Forsyth - Applied Machine Learning (2019)
No ratings yet
David Forsyth - Applied Machine Learning (2019)
496 pages
Daily Dose of Data Science - Archive
No ratings yet
Daily Dose of Data Science - Archive
580 pages
Machine Learning, Animated (Liu, Mark) (Z-Library)
No ratings yet
Machine Learning, Animated (Liu, Mark) (Z-Library)
582 pages
PD 2 Preview
0% (3)
PD 2 Preview
32 pages
DevOps For Data Science (Alex K Gold) (Z-Library)
No ratings yet
DevOps For Data Science (Alex K Gold) (Z-Library)
274 pages
Statistics and Data Visualisation With Python 2023
100% (1)
Statistics and Data Visualisation With Python 2023
554 pages
Dokumen - Pub - Big Data Analytics A Guide To Data Science Practitioners Making The Transition To Big Data 1032457554 9781032457550
No ratings yet
Dokumen - Pub - Big Data Analytics A Guide To Data Science Practitioners Making The Transition To Big Data 1032457554 9781032457550
328 pages
Linear Algebra in Python
No ratings yet
Linear Algebra in Python
42 pages
Mastering Python For Data Science With Numpy & Pandas
100% (2)
Mastering Python For Data Science With Numpy & Pandas
136 pages
Understanding and Using Linear Programming: Printed Book
No ratings yet
Understanding and Using Linear Programming: Printed Book
1 page
Robert J. Marks II - The Joy of Fourier
No ratings yet
Robert J. Marks II - The Joy of Fourier
811 pages
Uncertainty Quantifi Cation and Predictive Computational Science
No ratings yet
Uncertainty Quantifi Cation and Predictive Computational Science
349 pages
Sarkar, DR Tirthajyoti - Roychowdhury, Shubhadeep - Data Wrangling With Python - Creating Actionable Data From Raw Sources-Packt Publishing (2019)
No ratings yet
Sarkar, DR Tirthajyoti - Roychowdhury, Shubhadeep - Data Wrangling With Python - Creating Actionable Data From Raw Sources-Packt Publishing (2019)
538 pages
Machine Learning 1
100% (1)
Machine Learning 1
245 pages
TH9 Ports and Harbor
No ratings yet
TH9 Ports and Harbor
11 pages
Augmenting Cognition by Idan Segev Henry Markram
100% (1)
Augmenting Cognition by Idan Segev Henry Markram
216 pages
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
100% (1)
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
235 pages
Quantum Information Theory Concepts and Methods - Joseph M. Renes (2022)
No ratings yet
Quantum Information Theory Concepts and Methods - Joseph M. Renes (2022)
336 pages
Carol Ash - The Probability Tutoring Book - An Intuitive Course For Engineers and Scientists (And Everyone Else!) - Wiley-IEEE Press (1996)
100% (2)
Carol Ash - The Probability Tutoring Book - An Intuitive Course For Engineers and Scientists (And Everyone Else!) - Wiley-IEEE Press (1996)
481 pages
How Many Subjects Statistical Power Analysis in Research
100% (1)
How Many Subjects Statistical Power Analysis in Research
107 pages
Python
100% (1)
Python
635 pages
Business Analytics Data Science For Business Problems (Walter R. Paczkowski)
No ratings yet
Business Analytics Data Science For Business Problems (Walter R. Paczkowski)
416 pages
Notes Unit 1+2+3+4
No ratings yet
Notes Unit 1+2+3+4
110 pages
Optoisolation and Optical Sensor Products: Selection Catalog
100% (1)
Optoisolation and Optical Sensor Products: Selection Catalog
60 pages
Efficient Academic Writer - 11 March Updated
No ratings yet
Efficient Academic Writer - 11 March Updated
152 pages
Wassce Waec 2023 Physics Paper 2 Past Questions and Answers PDF
No ratings yet
Wassce Waec 2023 Physics Paper 2 Past Questions and Answers PDF
25 pages
Probability and Statistics For Computer Science
No ratings yet
Probability and Statistics For Computer Science
374 pages
LA Nutshell
No ratings yet
LA Nutshell
239 pages
(Cambridge Texts in Applied Mathematics) Philippe G. Ciarlet-Introduction To Numerical Linear Algebra and Optimisation-Cambridge University Press (1989)
No ratings yet
(Cambridge Texts in Applied Mathematics) Philippe G. Ciarlet-Introduction To Numerical Linear Algebra and Optimisation-Cambridge University Press (1989)
447 pages
Data Science Full Stack Roadmap
No ratings yet
Data Science Full Stack Roadmap
25 pages
Linear Algebra
No ratings yet
Linear Algebra
91 pages
David Forsyth - Probability and Statistics For Computer Science (2018, Springer)
No ratings yet
David Forsyth - Probability and Statistics For Computer Science (2018, Springer)
368 pages
IIR Filter
100% (1)
IIR Filter
50 pages
Matrix Algebra For Engineers
No ratings yet
Matrix Algebra For Engineers
190 pages
Mathematics For ML PDF
No ratings yet
Mathematics For ML PDF
10 pages
Linear Guest PDF
No ratings yet
Linear Guest PDF
436 pages
Cerebral Benefits Induced by Electrical Muscle Stimulation - Evidence From A Human and Rat Study
No ratings yet
Cerebral Benefits Induced by Electrical Muscle Stimulation - Evidence From A Human and Rat Study
23 pages
Thinking in Pandas - How To Use The Python Data Analysis Library The Right Way (2020)
100% (2)
Thinking in Pandas - How To Use The Python Data Analysis Library The Right Way (2020)
190 pages
The Parallel Plate Capacitor - Construction, Principle, Videos and Examples
No ratings yet
The Parallel Plate Capacitor - Construction, Principle, Videos and Examples
5 pages
Maps and Globes Worksheets
No ratings yet
Maps and Globes Worksheets
7 pages
Analysis and Linear Algebra For Finance Part I
No ratings yet
Analysis and Linear Algebra For Finance Part I
127 pages
Solutions To Chapter 13
100% (1)
Solutions To Chapter 13
45 pages
Mindfulness and Mind Wandering - The Protective Effects of Brief Meditation in Anxious Individuals
No ratings yet
Mindfulness and Mind Wandering - The Protective Effects of Brief Meditation in Anxious Individuals
9 pages
Electrical Stimulation
No ratings yet
Electrical Stimulation
15 pages
Numpy Ref
No ratings yet
Numpy Ref
1,128 pages
Artful Architecture
No ratings yet
Artful Architecture
24 pages
R Numerical Analysis
100% (3)
R Numerical Analysis
279 pages
Modulation of Torque Evoked by Wide Pulse, High Frequency Neuromuscular Electrical Stimulation
No ratings yet
Modulation of Torque Evoked by Wide Pulse, High Frequency Neuromuscular Electrical Stimulation
13 pages
Mit Data Science Program
100% (1)
Mit Data Science Program
15 pages
Physical Sciences p1 Grade 11 Exemplar 2013 Eng 071645
No ratings yet
Physical Sciences p1 Grade 11 Exemplar 2013 Eng 071645
19 pages
Deep Learning Fundamentals Materials
100% (1)
Deep Learning Fundamentals Materials
216 pages
MIMS Ep2006 256
No ratings yet
MIMS Ep2006 256
53 pages
Motoneuron Persistent Inward Current Contribution To Increased Torque Responses To Wide-Pulse High-Frequency Neuromuscular Electrical Stimulation
No ratings yet
Motoneuron Persistent Inward Current Contribution To Increased Torque Responses To Wide-Pulse High-Frequency Neuromuscular Electrical Stimulation
10 pages
Enhancing Adaptations To Neuromuscular Electrical
No ratings yet
Enhancing Adaptations To Neuromuscular Electrical
9 pages
Mathematical Tools For Data Science
No ratings yet
Mathematical Tools For Data Science
9 pages
Chapter-6 - Analysis of Plane Stress
No ratings yet
Chapter-6 - Analysis of Plane Stress
9 pages
PIIS0960982223001744
No ratings yet
PIIS0960982223001744
12 pages
La Ges CS JPG
No ratings yet
La Ges CS JPG
10 pages
What Makes A Good Photo
No ratings yet
What Makes A Good Photo
26 pages
Slope Stabilization Methods
No ratings yet
Slope Stabilization Methods
56 pages
Project
No ratings yet
Project
10 pages
Reaction Mechanism
100% (1)
Reaction Mechanism
19 pages
Visco Meter
No ratings yet
Visco Meter
6 pages
IJERT Data Analysis Using Python
No ratings yet
IJERT Data Analysis Using Python
6 pages
Julia Programming
No ratings yet
Julia Programming
115 pages
Numpy Handwritten Notes
No ratings yet
Numpy Handwritten Notes
8 pages
Alok Mitra Notes
No ratings yet
Alok Mitra Notes
5 pages
Computational Tools and Software MATLAB Python
No ratings yet
Computational Tools and Software MATLAB Python
5 pages
Karakteristik Tanah
No ratings yet
Karakteristik Tanah
22 pages
Pre Cal 4. Hyperbolas
No ratings yet
Pre Cal 4. Hyperbolas
7 pages
Advance Python Question Paper 2023
No ratings yet
Advance Python Question Paper 2023
2 pages
McCool-DEVELOPMENT OF A PLUG ASSISTED THERMOFROMING SIMULATION-Paper
No ratings yet
McCool-DEVELOPMENT OF A PLUG ASSISTED THERMOFROMING SIMULATION-Paper
4 pages
Appc 3.14a Ca2
No ratings yet
Appc 3.14a Ca2
2 pages
Facing Facts: Neuronal Mechanisms of Face Perception: Review Acta Neurobiol Exp 2008, 68: 229-252
No ratings yet
Facing Facts: Neuronal Mechanisms of Face Perception: Review Acta Neurobiol Exp 2008, 68: 229-252
24 pages
The Circadian System and The Balance of The Autonomic Nervous System
No ratings yet
The Circadian System and The Balance of The Autonomic Nervous System
20 pages
Python Interview Questions
No ratings yet
Python Interview Questions
8 pages
Re Rein Detailing
No ratings yet
Re Rein Detailing
47 pages
2023 Nov Algebra 1
No ratings yet
2023 Nov Algebra 1
2 pages
ANS Chap PDF
No ratings yet
ANS Chap PDF
19 pages
Plane-Stress Fracture Toughness Testing Using A Crack-Line-Loaded Speclment
No ratings yet
Plane-Stress Fracture Toughness Testing Using A Crack-Line-Loaded Speclment
22 pages
40034
No ratings yet
40034
6 pages
10 Maths Entrance Examination
No ratings yet
10 Maths Entrance Examination
14 pages
Valvula de Control Sanitaria para Vapor Limpio y Puro Dn15 100 Pv928
No ratings yet
Valvula de Control Sanitaria para Vapor Limpio y Puro Dn15 100 Pv928
5 pages
A Study On Friction and Wear Behaviour of Carburized
No ratings yet
A Study On Friction and Wear Behaviour of Carburized
8 pages
Activity 13 DNAReplicationPupil Case Study
No ratings yet
Activity 13 DNAReplicationPupil Case Study
9 pages
Python Recursion
No ratings yet
Python Recursion
11 pages
Exercise and The Brain - Something To Chew On PDF
No ratings yet
Exercise and The Brain - Something To Chew On PDF
8 pages
Properties of Trigonometric Functions
No ratings yet
Properties of Trigonometric Functions
9 pages
BDNF 4
No ratings yet
BDNF 4
8 pages
Rods and Beams
No ratings yet
Rods and Beams
12 pages
Module-2 Fundamentals of Surveying
100% (1)
Module-2 Fundamentals of Surveying
2 pages
MI199 Training Flow Measurement
No ratings yet
MI199 Training Flow Measurement
9 pages
Tommy 21b Notes
No ratings yet
Tommy 21b Notes
19 pages
Learning Probabilistic Graphical Models in R - Sample Chapter
No ratings yet
Learning Probabilistic Graphical Models in R - Sample Chapter
37 pages
Data Structures Using C and C++ - Y. Langsam, M. Augenstein and A. M. Tenenbaum
No ratings yet
Data Structures Using C and C++ - Y. Langsam, M. Augenstein and A. M. Tenenbaum
99 pages
Angular 2 Essentials - Sample Chapter
0% (1)
Angular 2 Essentials - Sample Chapter
39 pages
1 Introduction To Vectors: Factorization: A LU - . - . - . - . - . - . - . - . .
No ratings yet
1 Introduction To Vectors: Factorization: A LU - . - . - . - . - . - . - . - . .
2 pages
Practical Linear Algebra
100% (1)
Practical Linear Algebra
253 pages
Pattern Recognition and Machine Learning Errata and Additional Comments
0% (1)
Pattern Recognition and Machine Learning Errata and Additional Comments
7 pages
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet