0% found this document useful (0 votes)
45 views14 pages

MLbook Extract

Uploaded by

vtkhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views14 pages

MLbook Extract

Uploaded by

vtkhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Intuitive Machine Learning

Vincent Granville, Ph.D. | www.MLTechniques.com | Version 1.0, September 2022


Preface

This book covers the foundations of machine learning, with modern approaches to solving complex problems.
Emphasis is on scalability, automation, testing, optimizing, and interpretability (explainable AI). For instance,
regression techniques – including logistic and Lasso – are presented as a single method, without using advanced
linear algebra. There is no need to learn 50 versions when one does it all and more. Confidence regions and
prediction intervals are built using parametric bootstrap, without statistical models or probability distributions.
Models (including generative models and mixtures) are mostly used to create rich synthetic data to test and
benchmark various methods.
Topics covered include clustering and classification, GPU machine learning, ensemble methods including an
original boosting technique, elements of graph modeling, deep neural networks, auto-regressive and non-periodic
time series, Brownian motions and related processes, simulations, interpolation, random numbers, natural
language processing (smart crawling, taxonomy creation and structuring unstructured data), computer vision
(shapes generation and recognition), curve fitting, cross-validation, goodness-of-fit metrics, feature selection,
curve fitting, gradient methods, optimization techniques and numerical stability.
Methods are accompanied by enterprise-grade Python code, replicable datasets and visualizations, including
data animations (gifs, videos, even sound done in Python). The code uses various data structures and library
functions sometimes with advanced options. It constitutes a Python tutorial in itself, and an introduction
to scientific computing. Some data animations and chart enhancements are done in R. The code, datasets,
spreadsheets and data visualizations are also on GitHub, here.
Chapters are mostly independent from each other, allowing you to read in random order. A glossary, index
and numerous cross-references make the navigation easy and unify all the chapters. The style is very compact,
getting down to the point quickly, and suitable to business professionals eager to learn a lot of useful material
in a limited amount of time. Jargon and arcane theories are absent, replaced by simple English to facilitate the
reading by non-experts, and to help you discover topics usually made inaccessible to beginners.
While state-of-the-art research is presented in all chapters, the prerequisites to read this book are minimal:
an analytic professional background, or a first course in calculus and linear algebra. The original presentation
avoids all unnecessary math and statistics, yet without eliminating advanced topics.
Finally, this book is the main reference for my upcoming course on intuitive machine learning. For details
about the classes, see here.
About the Author
Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central
(acquired by a publicly traded company), former VC-funded executive, author and patent owner. Vincent’s
past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also
a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and
IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books,
available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical
systems, experimental math and probabilistic number theory.

2
Contents

1 Machine Learning Cloud Regression and Optimization 7


1.1 Introduction: circle fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Previous versions of my method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Methodology, implementation details and caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Solution, R-squared and backward compatibility . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Upgrades to the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Logistic regression, two ways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Ellipsoid and hyperplane fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2.1 Curve fitting: 250 examples in one video . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2.2 Confidence region for the fitted ellipse . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2.3 Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Non-periodic sum of periodic time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.3.1 Numerical instability and how to fix it . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.3.2 Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.4 Fitting a line in 3D, unsupervised clustering, and other generalizations . . . . . . . . . . . 23
1.3.4.1 Example: confidence region for the cluster centers . . . . . . . . . . . . . . . . . 24
1.3.4.2 Exact solution and caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.4.3 Comparison with K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.4.4 Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2 A Simple, Robust and Efficient Ensemble Method 31


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 How hidden decision trees (HDT) work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.2 NLP Case study: summary and findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.4 Improving the methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Correcting for bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1.1 Time-adjusted scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.2 Excel spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 Python code and dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Model-free confidence intervals and perfect nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.1 Interesting asymptotic properties of confidence intervals . . . . . . . . . . . . . . . . . . . 39

3 Gentle Introduction to Linear Algebra 41


3.1 Power of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Examples, Generalization, and Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Example with a non-invertible matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Fast computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Square root of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Application to Machine Learning problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Time series: auto-regressive processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Mathematics of auto-regressive time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Simulations: curious fractal time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1.1 White noise: Fréchet, Weibull and exponential cases . . . . . . . . . . . . . . . . 46

3
3.4.1.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Solving Vandermonde systems: a numerically stable method . . . . . . . . . . . . . . . . . 47
3.5 Math for Machine Learning: Must-Read Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 The Art of Visualizing High Dimensional Data 49


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Spatial time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Prediction intervals in any dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 Supervised classification of an infinite dataset . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.3.1 Machine learning perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.3.2 Six challenging problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.3.3 Mathematical background: the Riemann Hypothesis . . . . . . . . . . . . . . . . 52
4.2.3.4 Partial solutions to the six challenging problems . . . . . . . . . . . . . . . . . . 53
4.2.4 Algorithms with chaotic convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Paths simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Visual convergence analysis in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.3 Supervised classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Fast Classification and Clustering via Image Convolution Filters 64


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Generating the synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.1 Simulations with logistic distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.2 Mapping the raw observations onto an image bitmap . . . . . . . . . . . . . . . . . . . . . 66
5.3 Classification and unsupervised clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Supervised classification based on convolution filters . . . . . . . . . . . . . . . . . . . . . 67
5.3.2 Clustering based on histogram equalization . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.3 Fractal classification: deep neural network analogy . . . . . . . . . . . . . . . . . . . . . . 68
5.3.4 Generalization to higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.5 Towards a very fast implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.1 Fractal classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.2 GPU classification and clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.3 Home-made graphic library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6 Shape Classification via Explainable AI 78


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 Mathematical foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Shape signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.1 Weighted centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.2 Computing the signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4 Shape Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4.1 Shape classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7 Synthetic Data, Interpretable Regression, and Submodels 84


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Synthetic data sets and the spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2.1 Correlation structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2.2 Standardized regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2.3 Initial conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2.4 Simulations and Excel spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Damping schedule and convergence acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.1 Spreadsheet implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.2 Interpretable regression with no overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3.3 Adaptive damping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Performance assessment on synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4
7.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.4.2 Distribution-free confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4.2.1 Parametric bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.5 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.5.1 Combinatorial approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.5.2 Stepwise approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

8 From Interpolation to Fuzzy Regression 96


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.2 Original version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3 Full, non-linear model in higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3.1 Geometric proximity, weights, and numerical stability . . . . . . . . . . . . . . . . . . . . 98
8.3.2 Predicted values and prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.3.3 Illustration, with spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3.3.1 Output fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.4.1 Performance assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.4.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.4.3 Amplitude restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.6 Python source code and datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

9 Detecting Subtle Departures from Randomness 107


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2 Pseudo-random numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.2.1 Strong pseudo-random numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.2.1.1 New test of randomness for PRNGs . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.2.1.2 Theoretical background: the law of the iterated logarithm . . . . . . . . . . . . . 109
9.2.1.3 Connection to the Generalized Riemann Hypothesis . . . . . . . . . . . . . . . . 109
9.2.2 Testing well-known sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.2.2.1 Reverse-engineering a pseudo-random sequence . . . . . . . . . . . . . . . . . . . 111
9.2.2.2 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.3 Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.3.1 Fixes to the faulty random function in Python . . . . . . . . . . . . . . . . . . . . . . . . 114
9.3.2 Prime test implementation to detect subtle flaws√in PRNG’s . . . . . . . . . . . . . . . . 114
9.3.3 Special formula to compute 10 million digits of 2 . . . . . . . . . . . . . . . . . . . . . . 117

10 Some Unusual Random Walks 121


10.1 Symmetric unbiased constrained random walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
10.1.1 Three fundamental properties of pure random walks . . . . . . . . . . . . . . . . . . . . . 121
10.1.2 Random walks with more entropy than pure random signal . . . . . . . . . . . . . . . . . 122
10.1.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
10.1.2.2 Algorithm to generate quasi-random sequences . . . . . . . . . . . . . . . . . . . 123
10.1.2.3 Variance of the modified random walk . . . . . . . . . . . . . . . . . . . . . . . . 123
10.1.3 Random walks with less entropy than pure random signal . . . . . . . . . . . . . . . . . . 124
10.2 Related stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.2.1 From Brownian motions to clustered Lévy flights . . . . . . . . . . . . . . . . . . . . . . . 125
10.2.2 Integrated Brownian motions and special auto-regressive processes . . . . . . . . . . . . . 126
10.3 Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
10.3.1 Computing probabilities and variances attached to Sn . . . . . . . . . . . . . . . . . . . . 127
10.3.2 Path simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

11 Miscellaneous Topics 130


11.1 The sound that data makes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11.1.1 From data visualizations to videos to data music . . . . . . . . . . . . . . . . . . . . . . . 130
11.1.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
11.1.3 Python code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
11.2 Data videos and enhanced visualizations in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.2.1 Cairo library to produce better charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.2.2 AV library to produce videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5
11.3 Dual confidence regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.3.1 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.3.2 Standard confidence region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.3.3 Dual confidence region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
11.3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
11.3.5 Original problem with minimum contrast estimators . . . . . . . . . . . . . . . . . . . . . 136
11.3.6 General shape of confidence regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
11.4 Fast feature selection based on predictive power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
11.4.1 How cross-validation works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.4.2 Measuring the predictive power of a feature . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.4.3 Efficient implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
11.5 Natural language processing: taxonomy creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.5.1 Designing a keyword taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.5.2 Fast clustering algorithm for keyword data . . . . . . . . . . . . . . . . . . . . . . . . . . 142
11.5.2.1 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
11.5.2.2 Smart crawling of the whole Internet and a bit of graph theory . . . . . . . . . . 143
11.6 Automated detection of outliers and number of clusters . . . . . . . . . . . . . . . . . . . . . . . 144
11.6.1 Black-box elbow rule to detect outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.7 Advice to beginners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.7.1 Getting started and learning how to learn . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.7.1.1 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.7.1.2 Beyond Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.7.2 Automated data cleaning and exploratory analysis . . . . . . . . . . . . . . . . . . . . . . 146
11.7.3 Example of simple analysis: marketing attribution . . . . . . . . . . . . . . . . . . . . . . 147
11.7.4 Upcoming books and courses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Glossary 149

Bibliography 152

Index 154

6
Glossary

Autoregressive process Auto-correlated time series, as described in section 3.4. Time-continuous versions
include Gaussian processes and Brownian motions, while random walks are a discrete
example; two-dimensional versions exist. These processes are essentially integrated
white noise. See pages 44, 92, 126
Binning Feature binning consists of aggregating the values of a feature into a small number
of bins, to avoid overfitting and reduce the number of nodes in methods such as
naive Bayes, neural networks, or decision trees. Binning can be applied to two or
more features simultaneously. I discuss optimum binning in this book. See pages
32, 68, 139
Boosted model Blending of several models to get the best of each one, also referred to as ensemble
methods. The concept is illustrated with hidden decision trees in this book. Other
popular examples are gradient boosting and AdaBoost. See pages 31, 149
Bootstrapping A data-driven, model-free technique to estimate parameter values, to optimize
goodness-of-fit metrics. Related to resampling in the context of cross-validation.
In this book, I discuss parametric bootstrap on synthetic data that mimics the
actual observations. See pages 10, 91
Confidence Region A confidence region of level γ is a 2D set of minimum area covering a proportion
γ of the mass of a bivariate probability distribution. It is a 2D generalization
of confidence intervals. In this book, I also discuss dual confidence regions – the
analogous of credible regions in Bayesian inference. See pages 7, 10, 13, 15, 24, 134,
137
Cross-validation Standard procedure used in bootstrapping, and to test and validate a model, by
splitting your data into training and validation sets. Parameters are estimated
based on training set data. An alternative to cross-validation is testing your model
on synthetic data with known response. See pages 10, 32, 88, 94, 139, 149
Decision trees A simple, intuitive non-linear modeling techniques used in classification problems.
It can handle missing and categorical data, as well as a large number of features,
but requires appropriate feature binning. Typically one blends multiple binary trees
each with a few nodes, to boost performance. See pages 31, 32, 34, 36, 149, 150
Dimension reduction A technique to reduce the number of features in your dataset while minimizing the
loss in predictive power. The most well known are principal component analysis
and feature selection to maximize goodness-of-fit metrics. See pages 7, 11, 150, 151
Empirical distribution Cumulative frequency histogram attached to a statistic (for instance, nearest neigh-
bor distances), and based on observations. When the number of observations tends
to infinity and the bin sizes tend to zero, this step function tends to the theoretical
cumulative distribution function of the statistic in question. See pages 11, 91, 108
Ensemble methods A technique consisting of blending multiple models together, such as many decision
trees with logistic regression, to get the best of each method and outperform each
method taken separately. Examples include boosting, bagging, and AdaBoost. In
this book, I discuss hidden decision trees. See pages 31, 78, 149
Explainable AI Automated machine learning techniques that are easy to interpret are referred to
as interpretable machine learning or explainable artificial intelligence. As much as
possible, the methods discussed in this book belong to that category. The goal is to
design black-box systems less likely to generate unexpected results with unintended
consequences. See pages 8, 64, 69, 78, 85

149
Feature selection Features – as opposed to the model response – are also called independent vari-
ables or predictors. Feature selection, akin to dimensionality reduction, aims at
finding the minimum subset of variables with enough predictive power. It is also
used to eliminate redundant features and find causality (typically using hierarchical
Bayesian models), as opposed to mere correlations. Sometimes, two features have
poor predictive power when taken separately, but provide improved predictions when
combined together. See pages 7, 10, 32, 89, 92, 130, 138, 149, 151
Goodness-of-fit A model fitting criterion or metric to assess how a model or sub-model fits to a
dataset, or to measure its predictive power on a validation set. Examples include
R-squared, Chi-squared, Kolmogorov-Smirnov, error rate such as false positives and
other metrics discussed in this book. See pages 10, 51, 88, 89, 139, 149, 151
Gradient methods Iterative optimization techniques to find the minimum of maximum of a function,
such as the maximum likelihood. When there are numerous local minima or max-
ima, use swarm optimization. Gradient methods (for instance, stochastic gradient
descent or Newton’s method) assume that the function is differentiable. If not,
other techniques such as Monte Carlo simulations or the fixed-point algorithm can
be used. Constrained optimization involves using Lagrange multipliers. See pages
10, 26, 50, 84
Graph structures Graphs are found in decision trees, in neural networks (connections between neu-
rons), in nearest neighbors methods (NN graphs), in hierarchical Bayesian models,
and more. See pages 65, 69, 142, 143
Hyperparameter An hyperparameter is used to control the learning process: for instance, the di-
mension, the number of features, parameters, layers (neural networks) or clusters
(clustering problem), or the width of a filtering window in image processing. By
contrast, the values of other parameters (typically node weights in neural networks
or regression coefficients) are derived via training. See pages 24, 51, 65, 70, 96, 150
Link function A link function maps a nonlinear relationship to a linear one so that a linear model
can be fit, and then mapped back to the original form using the inverse function.
For instance, the logit link function is used in logistic regression. Generalizations
include quantile functions and inverse sigmoids in neural network to work with
additive (linear) parameters. See pages 8, 11, 150
Logistic regression A generalized linear regression method where the binary response (fraud/non-fraud
or cancer/non-cancer) is modeled as a probability via the logistic link function.
Alternatives to the iterative maximum likelihood solution are discussed in this book.
See pages 11, 28, 31, 35, 149, 150
Neural network A blackbox system used for predictions, optimization, or pattern recognition espe-
cially in computer vision. It consists of layers, neurons in each layer, link functions
to model non-linear interactions, parameters (weights associated to the connections
between neurons) and hyperparameters. Networks with several layers are called
deep neural networks. Also, neurons are sometimes called nodes. See pages 64, 68,
70, 78, 96, 149, 150
NLP Natural language processing is a set of techniques to deal with unstructured text
data, such as emails, automated customer support, or webpages downloaded with
a crawler. The example discussed in section 11.5 deals with creating a keyword
taxonomy based on parsing Google search results pages. See pages 31, 141
Numerical stability This issue occurring in unstable optimization problems typically with multiple min-
ima or maxima, is frequently overlooked and leads to poor predictions or high volatil-
ity. It is sometimes referred to as ill-conditioned problems. I explain how to fix it
in several examples in this book, for instance in section 3.4.2. Not to be confused
with numerical precision. See pages 7, 9, 54
Overfitting Using too many unstable parameters resulting in excellent performance on the train-
ing set, but poor performance on future data or on the validation set. It typically
occurs with numerically unstable procedures such as regression (especially polyno-
mial regression) when the training set is not large enough, or in the presence of
wide data (more features than observations) when using a method not suited to this
situation. The opposite is underfitting. See pages 10, 87, 96, 149, 151

150
Predictive power A metric to assess the goodness-of-fit or performance of a model or subset of features,
for instance in the context of dimensionality reduction or feature selection. Typical
metrics include R-squared, or confusion matrices in classification. See pages 33, 35,
39, 138, 140, 150
R-squared A goodness-of-fit metric to assess the predictive power of a model, measured on a
validation set. Alternatives include adjusted R-squared, mean absolute error and
other metrics discussed in this book. See pages 7, 10, 51, 85, 88, 90, 92, 99, 150, 151
Random number Pseudo-random numbers are sequences of binary digits, usually grouped into blocks,
satisfying properties of independent Bernoulli trials. In this book, the concept
is formally defined, and strong pseudo-number generators are built and used in
computer-intensive simulations. See pages 24, 107, 114, 144
Regression methods I discuss a unified approach to all regression problems in chapter 1. Traditional tech-
niques include linear, logistic, Bayesian, polynomial and Lasso regression (to deal
with numerical instability and overfitting), solved using optimization techniques,
maximum likelihood methods, linear algebra (eigenvalues and singular value de-
composition) or stepwise procedures. See pages 7, 8, 10, 11, 14, 22, 31, 35, 41, 44,
47, 51, 84, 90, 96, 103, 150, 151
Supervised learning Techniques dealing with labeled data (classification) or when the response is known
(regression). The opposite is unsupervised learning, for instance clustering prob-
lems. In-between, you have semi-supervised learning and reinforcement learning
(favoring good decisions). The technique described in chapter 1 fits into unsuper-
vised regression. Adversarial learning is testing your model against extreme cases
intended to make it fail, to build better models. See pages 151
Synthetic data Artificial data simulated using a generative model, typically a mixture model, to
enrich existed datasets and improve the quality of training sets. Called augmented
data when blended with real data. See pages 7, 8, 10, 12, 22, 24, 28, 43, 50, 64, 65,
70, 83, 89, 100, 122, 135, 144, 149
Tensor Matrix generalization with three of more dimensions. A matrix is a two-dimensional
tensor. A triple summation with three indices is represented by a three-dimensional
tensor, while a double summation involves a standard matrix. See pages 64, 69
Training set Dataset used to train your model in supervised learning. Typically, a portion of the
training set is used to train the model, the other part is used as validation set. See
pages 8, 10, 12, 15, 24, 31, 35, 51, 67, 83, 90, 96, 100, 139, 149, 150, 151
Validation set A portion of your training set, typically 20%, used to measure the actual perfor-
mance of your predictive algorithm outside the training set. In cross-validation and
bootstrapping, the training and validation sets are split into multiple subsets to get
a better sense of variations in the predictions. See pages 10, 22, 36, 51, 88, 96, 139,
149, 150, 151

151
Bibliography

[1] Weighted percentiles using numpy. Forum discussion, 2020. StackOverflow [Link]. 96
[2] Jan Ackmann et al. Machine-learned preconditioners for linear solvers in geophysical fluid flows. Preprint,
pages 1–19, 2020. arXiv:2010.02866 [Link]. 88
[3] Rabi Bhattacharya and Edward Waymire. Random Walk, Brownian Motion, and Martingales. Springer,
2021. 121
[4] Barbara Bogacka. Lecture Notes on Time Series. 2008. Queen Mary University of London [Link]. 44
[5] Oliver Bröker and Marcus J. Groteb. Sparse approximate inverse smoothers for geometric and algebraic
multigrid. Applied Numerical Mathematics, 41(1):61–80, 2002. 85
[6] Oliver Chikumbo and Vincent Granville. Optimal clustering and cluster identity in understanding high-
dimensional data spaces with tightly distributed points. Machine Learning and Knowledge Extraction,
1(2):715–744, 2019. 145
[7] Keith Conrad. L-functions and the Riemann Hypothesis. 2018. 2018 CTNT Summer School [Link]. 110
[8] D.J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes. Springer, second edition,
2002. Volume 1 – Elementary Theory and Methods. 125
[9] D.J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes. Springer, second edition,
2014. Volume 2 – General Theory and Structure. 125
[10] Tilman M. Davies and Martin L. Hazelton. Assessing minimum contrast parameter estimation for spatial
and spatiotemporal log-Gaussian Cox processes. Statistica Neerlandica, 67(4):355–389, 2013. 136
[11] Marc Deisenroth, A. Faisal, and Cheng Soon Ong. Mathematics for Machine Learning. Cambridge Uni-
versity Press, 2020. [Link]. 48
[12] Harold G. Diamond and Wen-Bin Zhang. Beurling Generalized Numbers. American Mathematical Society,
2016. Mathematical Surveys and Monographs, Volume 213 [Link]. 111
[13] Arash Farahmand. Math 55 Lecture Notes. 2021. University of Berkeley [Link]. 42, 48
[14] P. A. Van Der Geest. The binomial distribution with dependent Bernoulli trials. Journal of Statistical
Computation and Simulation, pages 141–154, 2004. [Link]. 122
[15] Stamatia Giannarou and Tania Stathaki. Shape signature matching for object identification invariant to
image transformations and occlusion. 2007. ResearchGate [Link]. 79
[16] B.V. Gnedenko and A. N. Kolmogorov. Limit Distributions for Sums of Independent Random Variables.
Addison-Wesley, 1954. 126
[17] Manuel González-Navarrete and Rodrigo Lambert. Non-markovian random walks with memory lapses.
Preprint, pages 1–14, 2018. arXiv [Link]. 121
[18] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. [Link]. 48
[19] Vincent Granville. Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numera-
tion Systems. MLTechniques.com, 2018. [Link]. 111
[20] Vincent Granville. New perspective on the Riemann Hypothesis. Preprint, pages 1–23, 2022. MLTech-
niques.com [Link]. 20, 21
[21] Vincent Granville. Stochastic Processes and Simulations: A Machine Learning Perspective. MLTech-
niques.com, 2022. [Link]. 28, 46, 54, 65, 83, 85, 126, 134, 138, 147
[22] Vincent Granville, Mirko Krivanek, and Jean-Paul Rasson. Simulated annealing: A proof of convergence.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 16:652–656, 1996. 67
[23] Vincent Granville and Richard L Smith. Disaggregation of rainfall time series via Gibbs sampling. NISS
Technical Report, pages 1–21, 1996. [Link]. 102
[24] Kristen Grauman. Shape matching. 2008. University of Texas, Austin [Link]. 82

152
[25] Radim Halir and Jan Flusser. Numerically stable direct least squares fitting of ellipses. Preprint, pages
1–8, 1998. [Link]. 12, 14
[26] Adam J. Harper. Moments of random multiplicative functions, II: High moments. Algebra and Number
Theory, 13(10):2277–2321, 2019. [Link]. 107
[27] Adam J. Harper. Moments of random multiplicative functions, I: Low moments, better than squareroot
cancellation, and critical multiplicative chaos. Forum of Mathematics, Pi, 8:1–95, 2020. [Link]. 107, 109
[28] Adam J. Harper. Almost sure large fluctuations of random multiplicative functions. Preprint, pages 1–38,
2021. arXiv [Link]. 109
[29] T. W. Hilberdink and M. L. Lapidus. Beurling Zeta functions, generalised primes, and fractal membranes.
Preprint, pages 1–31, 2004. arXiv [Link]. 110, 111
[30] Christian Hill. Learning Scientific Programming with Python. Cambridge University Press, 2016. [Link].
14
[31] Robert V. Hogg, Joseph W. McKean, and Allen T. Craig. Introduction to Mathematical Statistics. Pearson,
eighth edition, 2016. [Link]. 48
[32] Zhiqiu Hu and Rong-Cai Yang. A new distribution-free approach to constructing the confidence region for
multiple parameters. PLOS One, pages 1–13, 2013. [Link]. 135
[33] Chigozie Kelechi. Towards efficiency in the residual and parametric bootstrap techniques. American Journal
of Theoretical and Applied Statistics, 5(5), 2016. [Link]. 92
[34] Yuk-Kam Lau, Gerald Tenenbaum, and Jie Wu. On mean values of random multiplicative functions.
Proceedings of the American Mathematical Society, 142(2):409–420, 2013. [Link]. 107, 109
[35] Jing Lei et al. Distribution-free predictive inference for regression. Journal of the American Statistical
Association, 113:1094–1111, 2018. [Link]. 92
[36] Christoph Molnar. Interpretable Machine Learning. ChristophMolnar.com, 2022. [Link]. 92
[37] Marc-Andreas Muendler. Linear difference equations and autoregressive processes. 2000. University of
Berkeley [Link]. 44
[38] Peter Mörters and Yuval Peres. Brownian Motion. Cambridge University Press, 2010. Cambridge Series
in Statistical and Probabilistic Mathematics, Volume 30 [Link]. 121, 125
[39] Jesper Møller. Introduction to spatial point processes and simulation-based inference. In International
Center for Pure and Applied Mathematics (Lecture Notes), Lomé, Togo, 2018. 136
[40] Guillermo Navas-Palencia. Optimal binning: mathematical programming formulation. Preprint, pages
1–21, 2020. arXiv:2001.08025 [Link]. 32
[41] Fred Park. Shape descriptor / feature extraction techniques. 2011. UCI iCAMP 2011 [Link]. 79
[42] Carl Rasmussen and Christopher Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
[Link]. 47
[43] Kamron Saniee. A simple expression for multivariate Lagrange interpolation. SIAM Undergraduate Re-
search Online, 2007. SIURO [Link]. 98
[44] Luuk Spreeuwers. Image Filtering with Neural Networks: Applications and Performance Evaluation. PhD
thesis, University of Twente, 1992. 68
[45] E.C. Titchmarsh and D.R. Heath-Brown. The Theory of the Riemann Zeta-Function. Oxford Science
Publications, second edition, 1987. 53, 110
[46] Chris Tofallis. Fitting equations to data with the perfect correlation relationship. Preprint, pages 1–11,
2015. Hertfordshire Business School Working Paper[Link]. 8
[47] D. Umbach and K.N. Jones. A few methods for fitting circles to data. IEEE Transactions on Instrumen-
tation and Measurement, 52(6):1881–1885, 2003. [Link]. 9, 12
[48] D. A. Vaccari and H. K. Wang. Multivariate polynomial regression for identification of chaotic time series.
Mathematical and Computer Modelling of Dynamical Systems, 13(4):1–19, 2007. [Link]. 12
[49] Yu Vizilter and Sergey Zheltov. Geometrical correlation and matching of 2D image shapes. 2012. Re-
searchGate [Link]. 81
[50] Lan Wua, Yongcheng Qi, and Jingping Yang. Asymptotics for dependent Bernoulli random variables.
Statistics and Probability Letters, pages 455–463, 2012. [Link]. 121
[51] Shaohong Yan, Aimin Yang, et al. Explicit algorithm to the inverse of Vandermonde matrix. In 2009
International Conference on Test and Measurement, 2009. IEEE [Link]. 42

153
Index

α-compositing, 52 confidence interval, 39, 149


m-interlacing, 66 confidence level, 134, 137
confidence region, 10, 24, 134
A/B testing, 147 dual region, 39, 135, 149
AdaBoost, 31, 149 conformal map, 9
adversarial learning, 151 confusion matrix, 139, 151
algebraic number, 112 connected components, 142
analytic function, 110 contour level, 137
anti-aliasing, 50, 54, 132 contour plot, 137
association rule, 140 convergence
attraction basin, 53 conditional, 109
attractor distribution, 126 convergence acceleration, 54
augmented data, 83, 151 convex linear combination, 103
auto-correlation, 44, 111 covariance matrix, 85, 134
auto-regressive process, 44, 126, 149 credible interval, 39
credible region (Bayesian), 135, 149
Bailey–Borwein–Plouffe formulas, 112 cross-validation, 10, 139
Bayesian classification, 69 curve fitting, 21
Bayesian inference, 39
hierarchical models, 150 decision tree, 31
naive Bayes, 140 Dedekind zeta function, 110
Bernoulli trials, 134 deep neural network, 68, 150
Berry-Esseen inequality, 108 Diehard tests of randomness, 109
Beurling primes, 111 dimensionality reduction, 11
binning, 139 Dirichlet character, 110, 111
optimum binning, 32, 149 Dirichlet functional equation, 110
bootstrapping, 10, 91 Dirichlet series, 107
percentile method, 96 Dirichlet theorem, 110
Brownian motion, 46, 121, 125, 144, 149 Dirichlet-L function, 110
Lévy flight, 126 dissimilarity metric, 141
distributed architecture, 138
Cauchy distribution, 126 dot product, 9
Cauchy-Riemann equations, 110 dummy variable, 31
causality, 150 dyadic map, 111
Cayley-Hamilton theorem, 42 dynamical systems, 111
CDF regression, 12 dyadic map, 111
central limit theorem, 126, 134 ergodicity, 111
characteristic polynomial, 42, 44, 45, 127 logistic map, 111
Chebyshev’s bias (prime numbers), 110 shift map, 111
checksum, 146
Chi-squared, 150 eigenvalue, 8, 47, 85, 151
classification, 151 power iteration, 87
clustering, 151 elbow rule, 144
Collatz conjecture, 118 empirical distribution, 11, 91
color model multivariate, 108
RGB, 50, 133 empirical quantiles, 96
RGBA, 50, 51, 71, 133 ensemble methods, 31, 78
color transparency, 15, 133 entropy, 140
complex random variable, 107 equidistribution modulo 1, 114
computational complexity, 141 ergodicity (dynamical systems), 111
computer vision, 7, 78 Euler product, 107

154
experimental design, 147 Lasso regression, 10, 151
experimental math, 51 law of the iterated logarithm, 108, 109, 121
explainable AI, 8, 69, 78, 85 least absolute residuals, 96
exploratory analysis, 146 link function, 8, 11
exponential decay, 35 log-polar map, 9
extrapolation, 103 logistic distribution, 11
extreme value theory, 126 logistic map, 111
logistic regression, 11
feature selection, 10, 92, 138 unsupervised, 28
fixed-point algorithm, 54, 84, 150 logit function, 150
flag vector, 140, 147 Lévy distribution, 126
fractal dimension, 46 Lévy flight, 126
fractional part function, 113
Frobenius norm, 85 Map-reduce, 138
Fruchterman and Rheingold algorithm, 142 marketing attribution, 147
Fréchet distribution, 46, 126 Markov chain, 44
fuzzy classification, 51 MCMC, 107
Mathematica, 137
Gamma function, 46, 126 maximum likelihood estimation, 136, 150, 151
Gaussian distribution, 134 mean squared error, 10, 25
Gaussian mixture, 65 medoid, 26
Gaussian primes, 110 Mersenne twister, 24, 111, 114, 123
Gaussian process, 44, 149 minimum contrast estimation, 136
general linear model, 8 mixture model, 24, 40, 137, 151
generalized linear model, 8, 43 model fitting, 51, 150
generalized logistic distribution, 85 model identifiability, 10
generative model, 151 modulus (complex number), 127
geostatistics, 97 Monte Carlo simulations, 107, 150
goodness-of-fit, 51, 139 multiplicative function
GPU-based clustering, 66 completely multiplicative, 107, 109
gradient boosting, 149 Rademacher, 107
gradient operator, 10
graph database, 142 n-gram (NLP), 141
naive Bayes, 140, 149
half-tone (music), 130 natural language processing, 31, 141
Hartman–Wintner theorem, 121 nearest neighbor interpolation, 96, 99
hash table, 140, 141 nearest neighbors method, 150
sparse, 141 neural network, 68
Hausdorff distance, 82 hidden layer, 68
hidden decision trees, 31, 32, 149 hyperparameter, 70
hidden layer, 68 neuron, 68, 150
hierarchical clustering, 68, 141 sparse, 64
histogram equalization, 66, 68 very deep, 68
Hoeffding inequality, 124 node (decision tree), 32, 149
Hotelling distribution, 135 perfect node, 39
Hurst exponent, 46 usable node, 33
hyperparameter, 24, 51, 98 normal number, 108
strongly normal, 109
ill-conditioned problem, 21, 47, 87, 150 numerical stability, 42
image segmentation, 68
interarrival times, 125 ordinary least squares, 44, 96
inverse distance weighting, 99 outliers, 144
iterated logarithm, 108, 109, 121 overfitting, 10, 149
Itô integral, 47
palette, 133
K-means clustering, 26, 27 parametric bootstrap, 15, 24, 92, 149
key-value pair, 32, 140 partial least squares, 8
Kolmogorov-Smirnov, 150 percentile bootstrap, 96
Kolmogorov-Smirnov test, 108 Poisson point process, 125
positive semidefinite (matrix), 43, 86
Lagrange interpolation, 47 power iteration, 87
Lagrange multiplier, 10, 150 preconditioning, 87

155
prediction interval, 10, 91, 96 square-free integer, 108
predictive power, 32, 39, 139, 140 stable distribution, 126
prime test (of randomness), 109, 122 stationary distribution, 47
principal component analysis, 42, 149 stationary process, 44, 126
probability distribution stepwise regression, 93
Cauchy, 126 stochastic function, 46
Fréchet, 46, 126 stop word (NLP), 141
Gaussian, 134 supervised classification, 66
generalized logistic, 85 swarm optimization, 22, 150
Hotelling, 135 synthetic data, 8, 22, 24, 83, 85, 122, 135
logistic, 11 synthetic metric, 140
Lévy, 126
Rademacher, 107, 108 Tarjan’s algorithm, 142
Weibull, 46, 126 tensor, 69
probability generating function, 122 text normalization, 141
proxy space, 137 Theil-Sen estimator, 96
pseudo-inverse matrix, 43 time series, 45
pseudo-random numbers, 123, 144 auto-regressive, 45, 126
congruential generator, 114 disaggregation, 102
Diehard tests, 109 Hurst exponent, 46
Mersenne twister, 114, 123 non-periodic, 20
prime test, 109, 122 total least squares, 8
strongly random, 109, 112 training set, 96, 139
TestU01, 109 transcendental number, 112

quadratic irrational, 111, 114 unsupervised clustering, 66


quantile, 135, 150 unsupervised learning, 28, 151
quantile regression, 10
quantiles validation set, 10, 51, 96, 139
empirical, 96 Vandermonde matrix, 42, 47
weighted, 96 video compression
FFmpeg, 50, 54
R-squared, 10
Rademacher distribution, 108 Watts and Strogatz model, 143
Rademacher function, 107 Weibull distribution, 46, 126
random multiplicative function, 107 weighted least squares, 8
Rademacher, 109 weighted quantiles, 96
random variable weighted regression, 11
complex, 107 white noise, 22, 44, 126, 149
random walk, 121, 149 wide data, 150
first hitting time, 122, 125
XOR operator, 114
zero crossing, 121
regression splines, 8
regular expression, 141, 146
reinforcement learning, 151
resampling, 91
Riemann Hypothesis, 102
Generalized, 109
Riemann zeta function, 107, 110
root mean squared error, 51

semi-supervised learning, 151


shape signature, 79
Shepard’s method, 99
shift map, 111
sigmoid function, 150
singular value decomposition, 8, 151
six degrees of separation, 143
smoothing parameter, 98
spatial statistics, 97
square root (matrix), 43, 86

156

You might also like