Comp Data Science
Comp Data Science
Introduction to
Using ScalaTion
...
John A. Miller
Department of Computer Science
University of Georgia
...
1
2
Brief Table of Contents
I Foundations 47
2 Linear Algebra 49
2.1 Linear System of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4 Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.5 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.7 Internal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8 Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3 Probability 69
3.1 Probability Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.6 Algebra of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.7 Median, Mode and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.8 Joint, Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.9 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.10 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.11 Estimating Parameters from Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.12 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3
3.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.14 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.15 Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.16 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
II Modeling 149
6 Prediction 151
6.1 Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Quality of Fit for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.4 Simpler Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.5 Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.6 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.7 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.8 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.9 Quadratic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.10 Cubic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.11 Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.12 Transformed Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.13 Regression with Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.14 Weighted Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.15 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.16 Trigonometric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
7 Classification 257
7.1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.2 Quality of Fit for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
4
7.4 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.5 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.6 Tree Augmented Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.7 Bayesian Network Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.8 Markov Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.9 Decision Tree ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.10 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
5
10.112D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
10.12Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.13Extreme Learning Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
14 Clustering 545
14.1 KNN Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
14.2 Clusterer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
14.3 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
14.4 K-Means Clustering - Hartigan-Wong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
14.5 K-Means++ Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
14.6 Clustering Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.7 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
14.8 Markov Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
6
III Simulation 567
Appendices 721
7
A Optimization in Data Science 723
A.1 Partial Derivatives and Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
A.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
A.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
A.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
A.5 Stochastic Gradient Descent with Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
A.6 SGD with ADAptive Moment Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
A.7 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
A.8 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
A.9 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.10 Method of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
A.11 Karush-Kuhn-Tucker Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
A.12 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
A.13 Augmented Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
A.14 Alternating Direction Method of Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
A.15 Nelder-Mead Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
8
Contents
I Foundations 47
2 Linear Algebra 49
2.1 Linear System of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.1 Vector Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.2 Element-wise Multiplication and Division . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.3 Vector Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.4 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.5 Vector Operations in ScalaTion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4 Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.1 Gradient Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.2 Jacobian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.3 Hessian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.5 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5.1 Matrix Operation in ScalaTion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.6.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.7 Internal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9
2.8 Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.8.1 Three Dimensional Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.8.2 Four Dimensional Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3 Probability 69
3.1 Probability Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.1 Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.1 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.2 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.1 Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 Probability Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.1 Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.2 Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.6 Algebra of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.1 Expectation is a Linear Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.2 Variance is not a Linear Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.3 Convolution of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7 Median, Mode and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7.1 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7.2 Quantile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7.3 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.8 Joint, Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8.1 Discrete Case: Joint and Marginal Mass . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8.2 Continuous Case: Joint and Marginal Density . . . . . . . . . . . . . . . . . . . . . . . 86
3.8.3 Discrete Case: Conditional Mass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8.4 Continuous Case: Conditional Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.8.6 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.8.7 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.9 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.10 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.11 Estimating Parameters from Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.11.1 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10
3.11.2 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.11.3 Estimation for Discrete Outcomes/Responses . . . . . . . . . . . . . . . . . . . . . . . 97
3.12 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.12.1 Positive Log Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.12.2 Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.12.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.12.4 Relative Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.12.5 Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.12.6 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.12.7 Probability Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.14 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.15 Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.16 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11
5 Data Preprocessing 141
5.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.1 Remove Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.2 Convert String Columns to Numeric Columns . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.3 Identify Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.4 Preliminary Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2 Methods for Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2.1 Based on Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2.2 Based on InterQuartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.2.3 Based on Quantiles/Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3 Imputation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.3.1 Imputation Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.4 Align Multiple Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5 Creating Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
II Modeling 149
6 Prediction 151
6.1 Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.1.1 Predictor Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Quality of Fit for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2.1 Fit Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.3 Optimization - Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.4 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.3.5 NullModel Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.4 Simpler Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.3 Optimization - Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.4 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4.5 SimplerRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.5 Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.5.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.5.3 Optimization - Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.5.4 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.5.5 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
12
6.5.6 SimpleRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.6 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6.3 Optimization - Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.6.4 Matrix Inversion Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.6.5 LU Factorization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.6.6 Cholesky Factorization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.6.7 QR Factorization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.6.8 Use of Factorization in Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.6.9 Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.6.10 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.6.11 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.6.12 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.6.13 Regression Problem: Texas Temperatures . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.6.14 Regression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.6.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.6.16 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.7 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.7.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.7.4 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.7.5 The λ Hyper-parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
6.7.6 Comparing RidgeRegression with Regression . . . . . . . . . . . . . . . . . . . . . . 204
6.7.7 RidgeRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.8 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.8.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.8.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.8.3 Optimization Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.8.4 The λ Hyper-parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.8.5 Regularized and Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.8.6 LassoRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.8.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.9 Quadratic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.9.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.9.2 Comparison of quadratic and Regression . . . . . . . . . . . . . . . . . . . . . . . . 214
6.9.3 SymbolicRegression.quadratic Method . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.9.4 Quadratic Regression with Cross Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.9.5 Response Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
13
6.9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.10 Cubic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.10.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.10.2 Comparison of cubic, quadratic and Regression . . . . . . . . . . . . . . . . . . . . 222
6.10.3 SymbolicRegression.cubic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.10.4 Cubic Regression with Cross Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.11 Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.11.1 Sample Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.11.2 As a Data Science Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.11.3 SymbolicRegression Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.11.4 Implementation of the apply Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
6.11.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.12 Transformed Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.12.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.12.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.12.3 Square Root Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.12.4 Log Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.12.5 Reciprocal Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
6.12.6 Box-Cox Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.12.7 Quality of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.12.8 TranRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.12.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.13 Regression with Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.13.1 Handling Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.13.2 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
6.13.3 RegressionCat Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
6.13.4 RegressionCat Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.14 Weighted Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.14.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.14.2 Root Absolute Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.14.3 RegressionWLS Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.15 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.15.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.15.2 PolyRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.15.3 PolyORegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
6.16 Trigonometric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.16.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.16.2 TrigRegression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
14
6.16.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
7 Classification 257
7.1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.1.1 Classifier Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.2 Quality of Fit for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.2.1 FitC Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.3 Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
7.3.1 NullModel Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
7.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
7.4 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.4.1 Factoring the Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.4.2 Estimating Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.4.3 Laplace Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
7.4.4 Table Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7.4.5 The train Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7.4.6 The test Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.4.7 The predictI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.4.8 The lpredictI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.4.9 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
7.4.10 NaiveBayes Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
7.4.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.5 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.5.1 BayesClassifier Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.6 Tree Augmented Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.6.1 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.6.2 Conditional Probability Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.6.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
7.6.4 The train Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.6.5 The predictI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.6.6 TANBayes Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.7 Bayesian Network Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.7.1 Network Augmented Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
7.8 Markov Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.8.1 Markov Blanket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.8.2 Factoring the Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.9 Decision Tree ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.9.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.9.2 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.9.3 Early Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7.9.4 DecisionTree Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7.9.5 DecisionTree ID3 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
15
7.9.6 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
7.9.7 DecisionTree ID3wp Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.10 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.10.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.10.2 Forward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
7.10.3 Backward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
7.10.4 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
7.10.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
7.10.6 Reestimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
7.10.7 HiddenMarkov Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
16
8.7.2 DecisionTree C45 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
8.7.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.7.4 DecisionTree C45wp Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.8 Bagging Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.8.1 Creating Subsample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.8.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.8.3 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
8.8.4 BaggingTrees Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
8.9 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.9.1 Extracting Sub-features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.9.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.9.3 RandomForest Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.10 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.10.1 Separating Hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.10.2 Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
8.10.3 Running the Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
8.10.4 SupportVectorMachine Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
8.10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.11 Neural Network Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.11.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.11.2 Training Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.11.3 Prediction Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
8.11.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
8.11.5 NeuralNet Class 3L Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
17
9.5.4 RegressionTreeMT class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
9.6 Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.6.1 RegressionTreeRF Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.7 Gradient Boosting Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.7.1 RegressionTreeGB Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
9.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
18
10.6.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
10.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
10.6.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
10.6.4 Matrix Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
10.6.5 NeuralNet 2L Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
10.6.6 NeuralNet 2L Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
10.6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
10.7 Three-Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
10.7.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
10.7.2 Ridge Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
10.7.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
10.7.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
10.7.5 Matrix Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
10.7.6 train Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
10.7.7 Stochastic Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 405
10.7.8 Example Error Calculation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
10.7.9 Response Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
10.7.10 NeuralNet 3L Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
10.7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
10.8 Multi-Hidden Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
10.8.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
10.8.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
10.8.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
10.8.4 Number of Nodes in Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
10.8.5 Avoidance of Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
10.8.6 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
10.8.7 NeuralNet XL Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
10.8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
10.9 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
10.101D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
10.10.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
10.10.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
10.10.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
10.10.4 Matrix Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
10.10.5 Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
10.10.6 Example Error Calculation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
10.10.7 Two Convolutional Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
10.10.8 CNN 1D Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
10.10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
10.112D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
10.11.1 Filtering Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
10.11.2 Pooling Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
10.11.3 Flattening Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
19
10.11.4 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.11.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.11.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.12Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.12.1 Definition of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.12.2 Type of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
10.12.3 NeuralNet XLT Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
10.12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
10.13Extreme Learning Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.1 Model Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
10.13.4 ELM 3L1 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
10.13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
20
11.4.5 AR Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
11.4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
11.5 Moving-Average (MA) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
11.5.1 MA(q) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
11.5.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
11.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
11.6 Auto-Regressive, Moving Average (ARMA) Models . . . . . . . . . . . . . . . . . . . . . . . . 473
11.6.1 Selection Based on ACF and PACF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
11.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
11.6.3 ARMA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
11.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
11.7 Rolling-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
11.7.1 1-Fold Rolling-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
11.7.2 Rolling Validation and the Forecasting Matrix . . . . . . . . . . . . . . . . . . . . . . 478
11.7.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
11.8 ARIMA (Integrated) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
11.8.1 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
11.8.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
11.8.3 Backshift Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
11.8.4 Stationarity Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
11.8.5 ARIMA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
11.8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
11.9 SARIMA (Seasonal) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.1 Determination of the Seasonal Period . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.2 Seasonal Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.3 Seasonal AR and MA Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
11.9.4 Case Study: COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
11.9.5 SARIMA Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
11.9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
11.10Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
21
12.2.1 Model Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
12.2.2 SARIMAX Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
12.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
12.3 Vector Auto-Regressive (VAR) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
12.3.1 VAR(p, 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
12.3.2 VAR(p, n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
12.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
12.3.4 VAR Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
12.3.5 AR∗ (p, n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
12.3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
12.4 Nonlinear Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
12.4.1 Nonlinear Autoregressive (NAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
12.4.2 Autoregressive Neural Network (ARNN) . . . . . . . . . . . . . . . . . . . . . . . . . . 509
12.4.3 Nonlinear Autoregressive, Moving-Average (NARMA) . . . . . . . . . . . . . . . . . . 509
12.5 Recurrent Neural Networks (RNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
12.5.1 RNN(1, 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
12.5.2 RNN(p, nh ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
12.5.3 RNN(p, nh , nv ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
12.5.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
12.5.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
12.5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
12.6 Gated Recurrent Unit (GRU) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
12.6.1 A GRU Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
12.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
12.6.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
12.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
12.7 Minimal Gated Unit (MGU) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
12.8 Long Short Term Memory (LSTM) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
12.8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
12.9 Encoder-Decoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
12.9.1 Simple Encoder-Decoder Consisting of Two GRU Cells . . . . . . . . . . . . . . . . . . 530
12.9.2 Teacher Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
12.9.3 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
12.9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
12.10Transformer Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
12.10.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
12.10.2 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
12.10.3 Encoder-Decoder Architecture for Transformers . . . . . . . . . . . . . . . . . . . . . . 536
12.10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
12.10.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
22
13 Dimensionality Reduction 539
13.1 Reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
13.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
13.2.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
13.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
13.3 Autoencoder (AE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
13.3.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
13.3.2 Denoising Autoencoder (DEA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
14 Clustering 545
14.1 KNN Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
14.1.1 KNN Regression Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
14.1.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
14.2 Clusterer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
14.3 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
14.3.1 Initial Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
14.3.2 Reassignment of Points to Closest Clusters . . . . . . . . . . . . . . . . . . . . . . . . 552
14.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
14.3.4 KMeansClusterer Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
14.3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
14.4 K-Means Clustering - Hartigan-Wong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
14.4.1 Adjusted Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
14.4.2 KMeansClusteringHW Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
14.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
14.5 K-Means++ Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
14.5.1 Picking Initial Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
14.5.2 KMeansClustererPP Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
14.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
14.6 Clustering Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.6.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.6.2 ClusteringPredictor Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
14.6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
14.7 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
14.7.1 HierClusterer Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
14.7.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
14.8 Markov Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
14.8.1 MarkovClusterer Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
14.8.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
23
15.2 Types of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
15.2.1 Example: Modeling an M/M/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
15.3 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
15.3.1 Example RNG: Random0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
15.3.2 Testing Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
15.3.3 Example RNG: Random3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
15.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
15.4 Random Variate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
15.4.1 Inverse Transform Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
15.4.2 Convolution Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
15.4.3 Acceptance-Rejection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
15.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
15.5 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
15.5.1 Generating a Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
15.5.2 Generating a Non-Homogeneous Poisson Process . . . . . . . . . . . . . . . . . . . . . 585
15.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
15.6 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
15.6.1 Simulation of Card Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
15.6.2 Integral of a Complex Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
15.6.3 Grain Dropping Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
15.6.4 Simulation of the Monty Hall Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
15.6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
15.7 Hand Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
15.7.1 Little’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
15.7.2 Event Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
15.7.3 Spreadsheet Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
15.7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
15.8 Tableau-Oriented Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
15.8.1 Iterating through Tableau Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
15.8.2 Reproducing the Hand Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
15.8.3 Customized Logic/Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
15.8.4 Tableau.scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
15.8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
24
16.2.4 MarkovChain Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
16.2.5 Continuous-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
16.2.6 Limiting/Steady-State Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
16.2.7 MarkovChainCT Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
16.2.8 Queueing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
16.2.9 MMc Queue Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
16.2.10 MMcK Queue Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
16.2.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
16.3 Dynamic Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
16.3.1 Example: Traffic Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
16.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
16.4 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
16.4.1 Example: Golf Ball Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
16.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
16.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
16.5 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
16.5.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
16.5.2 Example: SEIHRD Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
16.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
16.6 ODE Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
25
18 Process-Oriented Models 667
18.1 Base Traits and Classes for Process-Oriented Models . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.1 Identifiable Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.2 Locatable Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.3 Modelable Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.1.4 Temporal Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.2 Concurrent Processing of Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
18.2.1 Java’s Thread Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
18.2.2 ScalaTion’s Coroutine Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
18.3 Process Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
18.3.1 Model Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
18.3.2 Component Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
18.3.3 Example: BankModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
18.3.4 Executing the Bank Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
18.3.5 Network Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
18.3.6 Comparison to Event Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
18.3.7 SimActor Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
18.3.8 Source Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
18.3.9 Sink Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
18.3.10 Transport Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
18.3.11 Resource Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
18.3.12 WaitQueue Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
18.3.13 WaitQueue LCFS Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
18.3.14 Junction Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
18.3.15 Gate Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684
18.3.16 Route Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
18.3.17 Model Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
18.3.18 Vehicle Traffic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
18.3.19 Model MBM Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
18.3.20 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
18.4 Agent-Based Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
18.4.1 SimAgent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
18.4.2 Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695
18.4.3 Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695
18.4.4 Bank Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
18.4.5 Vehicle Traffic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697
18.4.6 Hybrid Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
18.4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
18.5 Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
18.5.1 2D Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
18.5.2 3D Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
18.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
26
19 Simulation Output Analysis 705
19.1 Point and Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705
19.2 One-Shot Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
19.3 Simulation Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
19.3.1 Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
19.3.2 Model Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
19.4 Method of Independent Replications (MIR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
19.4.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
19.4.2 Example: MIR Version of BankModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
19.5 Method of Batch Means (MBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715
19.5.1 Effect of Increasing the Number of Batches . . . . . . . . . . . . . . . . . . . . . . . . 715
19.5.2 Effect on Batch Correlation of Increasing the Batch Size . . . . . . . . . . . . . . . . . 716
19.5.3 MBM versus MIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716
19.5.4 Relative Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.5.5 Example: MBM Version of BankModel . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
Appendices 721
27
A.7 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
A.8 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
A.8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749
A.9 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.9.1 Newton-Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.9.2 Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
A.9.3 BFGS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
A.9.4 Limited Memory-BFGS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
A.9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
A.9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
A.10 Method of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
A.10.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
A.11 Karush-Kuhn-Tucker Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
A.11.1 Active and Inactive Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
A.12 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
A.13 Augmented Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
A.13.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
A.13.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
A.14 Alternating Direction Method of Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
A.14.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
A.14.2 LassoAddm Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
A.14.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
A.15 Nelder-Mead Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
A.15.1 NelderMeadSimplex Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
A.15.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
28
B.4.1 Embedding Relationships in Vertex-Types . . . . . . . . . . . . . . . . . . . . . . . . . 792
B.4.2 Resource Description Framework (RDF) Graphs . . . . . . . . . . . . . . . . . . . . . 793
B.4.3 From Relational to Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794
B.5 Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
B.5.1 Type Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
B.5.2 Constraints and Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796
B.5.3 KGTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797
B.6 Exercises - Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799
B.7 Graph Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.7.1 Path Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.7.2 Centrality and Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.7.3 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
B.8 Graph Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
B.8.1 Graph Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
B.8.2 Subgraph Isomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
B.8.3 Graph Homomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
B.8.4 Application to Query Processing in Graph Databases . . . . . . . . . . . . . . . . . . 803
B.9 Graph Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
B.9.1 Matrix Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
B.9.2 Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805
B.10 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
B.10.1 AGGREGATE and COMBINE Operations . . . . . . . . . . . . . . . . . . . . . . . . 807
B.11 Exercises - Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
29
30
Preface
Applied Mathematics accelerated starting with the differential equations of Euler’s analytical mechanics
published in early 1700s [45, 117]. Over time increasingly accurate mathematical models of natural phenomena
were developed. The models are scrutinized by how well they match empirical data and related models.
Theories were developed that featured a collection of consistent, related models. In his theory of Universal
Gravity [132], Newton argues the sufficiency of this approach, while others seek to understand the underlying
substructures and causal mechanisms [117].
Data Science can trace its linage back to Applied Mathematics. One way to represent a mathematical
model is as a function f : Rn → R.
y = f (x, b) +
This illustrates that a response variable y is functionally related to other predictive variables x (vector in
bold font). Uncertainty in the relationship is modeled as a random variable (blue font) that follows some
probability distribution.
Making useful predictions or even inferences that one product lasts longer than another product are
clouded by this uncertainty. DeMoivre developed a limiting distribution for the Binomial Distribution.
Laplace derived a central limit theorem that showed that the sample means from several distributions follow
this same distribution. Gauss [180] studied this uncertainty and deduced a distribution for measurement
errors from basic principles. This distribution is now known as the Gaussian or Normal distribution. Infer-
ences such as which of two products has the longer expected lifetimes can now be made to a certain level of
confidence. Gauss also developed the method of least squares estimation.
Momentum in using probability distributions to analyze data, fit parameters and make inferences under
uncertainty lead to mathematical statistics emerging from applied mathematics in the late 1800s. In par-
ticular, Galton and Pearson collected and transformed statistical techniques into a mathematical discipline
(e.g., Pearson correlation coefficient, method of moments estimation, p-value, Chi-square test, statistical
hypothesis testing, principal component analysis). In the early 1900s, Gosset and Fisher expanded mathe-
matical statistics (e.g., analysis of variance, design of experiments, maximum likelihood estimation, Student’s
t-distribution, F-distribution).
With the increasing capabilities of computers, the amount of data available for training models grew
rapidly. This lead Computer Scientists into the fray with machine learning coined in 1959 and data mining
beginning in the late 1980s. Machine Learning developed slowly over the decades until the invention of the
back-propagation algorithm for neural networks in the mid 1980s lead to important advances. Data Mining
billed itself as finding patterns in data. Databases are often utilized and data preprocessing is featured in
the sense that mining through large amounts of data should be done with care.
31
With greater computing capabilities and larger amounts of data, statistics and machine learning are
leaning toward each other: The emphasis is to develop of accurate, interpretable and explainable models
for prediction, classification and forecasting. Data may also be clustered and simulation models that mimic
phenomena or systems may be created. Training a model is typically done using an optimization algorithm
(e.g., gradient descent) to minimize the errors in the model’s predictions. These constitute the elements of
data science.
This book is an introduction to data science that includes mathematical and computational foundations.
It is divided into three parts: (I) Foundations, (II) Modeling, and (III) Simulation. A review of Optimization
from the point of view of data science is included in the Appendix. The level of the book is College Junior
through beginning Graduate Student. The ideal mathematical background includes Differential, Integral and
Vector Calculus, Applied Linear Algebra, Probability and Mathematical Statistics. The following advanced
topics may be found useful for Data Science: Differential Equations, Nonlinear Optimization, Measure
Theory, Functional Analysis and Differential Geometry. Data Science also involves Computer Programming,
Database Management, Data Structures and Algorithms. Advanced topics include Parallel Processing,
Distributed Systems and Big Data frameworks (e.g., Hadoop and Spark). This book has been used in the
Data Science I and Data Science II courses at the University of Georgia.
32
Chapter 1
33
rithms for used for learning. Mathematical derivations are provided for the loss functions that are used
to train the models. Short Scala code snippets are provided to illustrate how the algorithms work. The
Scala object-oriented, functional language allows the creation of coincide code that looks very much like the
mathematical expressions. Modeling based on ordinary differential equations and simulation models are also
provided.
The prerequisite material for data science includes Vector Calculus, Applied Linear Algebra and Calculus-
based Probability and Statistics. Datasets can be stored as vectors and matrices, learning/parameter esti-
mation often involves taking gradients, and probability and statistics are needed to handle uncertainty.
34
1.2 ScalaTion
ScalaTion supports multi-paradigm modeling that can be used for simulation, optimization and analytics.
In ScalaTion, the modeling package provides tools for performing data analytics. Datasets are becom-
ing so large that statistical analysis or machine learning software should utilize parallel and/or distributed
processing. Databases are also scaling up to handle greater amounts of data, while at the same time increas-
ing their analytics capabilities beyond the traditional On-Line Analytic Processing (OLAP). ScalaTion
provides many analytics techniques found in tools like MATLAB, R and Weka. The analytics component
contains six types of tools: predictors, classifiers, forecasters, clusterers, recommenders and reduc-
ers. A trait is defined for each type.
To use ScalaTion, go to the Website http://www.cs.uga.edu/~jam/scalation.html and click on the
most recent version of ScalaTion and follow the first three steps: download, unzip, build.
Current projects are targeting Big Data Analytics in four ways: (i) use of sparse matrices, (ii) parallel
implementations using Scala’s support for parallelism (e.g., .par methods, parallel collections and actors),
(iii) distributed implementations using Akka, and (iv) high performance data stores including columnar
databases (e.g., Vertica), document databases (e.g., MongoDB), graph databases (e.g., Neo4j) and distributed
file systems (e.g., HDFS).
2. modeling - regression models with sub-packages for classification, clustering, neural networks, and time
series
35
2. simulation - multiple simulation engines
• if
1 if x < y then if x < y :
2 x += 1 x += 1
3 else if x > y then elsif x > y :
4 y += 1 y += 1
5 else else :
6 x += 1 x += 1
7 y += 1 y += 1
8 end if
The else and end are optional, as are the line breaks. Note, the x += 1 shortcut simply means x =
x + 1 for both languages.
• match
1 z = c match match c :
2 case ’+ ’ = > x + y case ’+ ’:
3 z = x + y
4 case ’ - ’ = > x - y case ’ - ’:
5 z = x - y
6 case ’* ’ = > x * y case ’* ’:
7 z = x * y
8 case ’/ ’ = > x / y case ’/ ’:
9 z = x / y
10 case _ = > println ( " not supported " ) case -:
11 print ( " not supported " )
In Scala 3, the case may be indented like Python. Also an end may be added.
• while
1 while x <= y do while x <= y :
2 x + = 0.5 x + = 0.6
3 end while
• for
1 for i <- 0 until 10 do for i in range (0 , 10) :
2 a ( i ) = 0.5 * i ~ ˆ 2 a [ i ] = 0.5 * i **2
3 end for
The end is optional, as are the line breaks. Note: for i <- 0 to 10 do will include 10, while until
will stop at 9. Both Scala and Python support other variaties of for. The for-yield collects all the
computed values into a.
36
1 val a = for i <- 0 until 10 yield 0.5 * i ~ ˆ 2
• cfor
1 var i = 0
2 cfor ( i < 10 , i + = 1) {
3 a ( i ) = 0.5 * i ~ ˆ 2
4 } // cfor
This for follows more of a C-style, provides improved efficiency and allows returns inside the loop. It
is defined as follows:
1 inline def cfor ( pred : = > Boolean , step : = > Unit ) ( body : = > Unit ) : Unit =
2 while pred do { body ; step }
3 end cfor
• try
1 try try :
2 file = new File ( " myfile . csv " ) x = 1 / 0
3 catch except Z e r o D i v i s i o n E r r o r :
4 case ex : FileNotFound = > println ( " not found " ) print ( " division by zero " )
5 end try
The end is optional and a finally clause is available. Both support a finally clause and Python
provides a shortcut with statement that comes in handy for opening files and automatically closing
them at the statement’s end of scope.
• assign with if
1 val y = if x < 1 then sqrt ( x ) else x ~ ˆ 2 y = sqrt ( x ) if x < 1 else x **2
All Scala control structures return values and so can be used in assignment statements. Note, prefix
sqrt with math for Python.
Note, the end tags are optional since Scala 3 uses significant indentation like Python.
Optionally, an end hypotenuse may be added and is often useful for functions which include several lines
of code. The Python code below is very similar, with the exception of the exponentiation operator ~^ for
ScalaTion and ** in Python. Outside of Scalation import scalation.~^. Both Double in Scala and
float in Python indicate 64-bit floating point numbers.
37
1 import math
2
The dot product operator on vectors is used extensively in data science. It multiplies all the elements
in the two vectors and then sums the products. An implementation in ScalaTion is given followed by a
similar implementation in Python that includes type annotations for improved readability and type checking.
1 import scalation . mathstat . VectorD
2
1 import numpy as np
2
Note, see the Chapter on Linear Algebra for more efficient implementations of dot product. Also, both
numpy.ndarray and VectorD directly provide dot product.
1 val z = x dot y
1 z = x . dot ( y )
In cases where the arguments are 2D arrays, np.dot is the same as matrix multiplication (x @ y) and for
scalars it is simple multiplication (x * y). ScalaTion supports several forms of multiplication for both
vectors and matrices (see the Linear Algebra Chapter).
Executable top-level functions can also be defined in similar ways in both Scala 3 and Python.
1 @ main def hello () : Unit =
2 val message = " Hello Data Science "
3 println ( s " message = $message " )
1.2.4 Classes
Defining a class is a good way to combine a data structure with its natural operations. The class will
consists of fields/attributes for maintaining data and methods for retrieving and updating the data.
An example of a class in Scala 3 is the Complex class that supports complex number (e.g., 2.1 + 3.2i)
and operations on complex numbers such as the + method. Of course, the actual implementation provides
many methods (see scalation.mathstat.Complex).
1 @ param re the real part ( e . g . , 2.1)
2 @ param im the imaginary part ( e . g . , 3.2)
3
38
5 extends Fractional [ Complex ] with Ordered [ Complex ]:
6
Notice that second argument im provides a default value of 0.0, so the class can be instantiated using either
one or two arguments/parameters.
Also observe that first variable cannot be reassigned as it is declared val, while the second variable c2 can be
as it is declared var. Finally, notice that the Complex class extends both Fractional and Ordered. These
are traits that the class Complex inherits. Some of the functionality (e.g., method implementations) can be
provided by the trait itself. The class must implement those that are not implemented or override the ones
with implementations to customize their behavior, if need be. Classes can extend several traits (multiple
inheritance), but may only extend one class (single inheritance).
Although Python already has a class called Complex, one could image coding one as follows:
1 class Complex :
2 def __init__ ( self , re : float , im : float = 0.0) :
3 self . re = re
4 self . im = im
5
Notice there are few differences: The constructor for Scala is any code at the top level of the class and
arguments to the constructor are given in the class definition, while Python has an explicit constructor
called init . Scala has an implicit reference to the instance object called this, while Python has an
explicit reference to the instance object called self. Furthermore, naming the method add makes it so
+ can be used to add two complex numbers. Another difference (not shown here) is that fields/attributes
as well as methods in Scala can be made internal using the private access modifier. In Python, a code
convention of having the first character of an identifier be underscore ( ) indicates that it should not be used
externally.
The basic data-types in Scala are integer types: Byte (8 bits), Short (16), Int (32) and Long (64), floating
point types: Float (32) and Double (64), character types: Char (single quotes) and String (double quotes),
and Boolean.
Corresponding Python data types are integer types: int (unlimited), floating point types: float32 (32)
and float (64), complex (128), character types: str (single or double quotes), and bool.
There are many operators that can be applied to these data-types, see https://docs.scala-lang.
org/tour/operators.html for the precedence of the operators. ScalaTion adds a few itself such as ˜^
for exponentiation. Also, ScalaTion provides complex numbers via the Complex class in the mathstat
package.
39
1.2.6 Collection Types
The most commonly used collection types in Scala are Array, ArrayBuffer, Range, List, Map, Set, and
Tuple. The Python rough equivalents (in lower case) are on the right (Map becomes dict).
1 val a = Array . ofDim [ Double ] (10) a = np . zeros (10)
2 val b = ArrayBuffer (2 , 3 , 3)
3 val r = 0 until 10 r = range (10)
4 val l = List (2 , 3 , 3) l = [2 , 3 , 3]
5 val m = Map ( " no " -> 0 , " yes " -> 1) m = { " no " : 0 , " yes " : 1}}
6 val s = Set (1 , 2 , 3 , 5 , 7) s = {1 , 2 , 3 , 5 , 7}
7 val t = ( firstName , lastName ) t = ( firstName , lastName )
For more collection types consult their documentation: https://scala-lang.org/api/3.x/ for Scala and
https://docs.python.org/3/library/collections.html for Python. Scala typically has mutable and
immutable versions of most collection types.
A matrix is a 2D array, that in this case is a 9-by-2 matrix holding two variables/features x0 and x1 in
columns of the matrix.
1 // col0 col1
2 val x = MatrixD ((9 , 2) , 1 , 8 , // row 0
3 2, 7, // row 1
4 3, 6, // row 2
5 4, 5, // row 3
6 5, 5, // row 4
7 6, 4 // row 5
8 7, 4, // row 6
9 8, 3, // row 7
10 9 , 2) // row 8
As practice, try to find a vector b of length/dimension 2, so that x * b is close to y. The * operator does
matrix-vector multiplication. It takes the dot product of the ith row of matrix x and vector b to obtain the
ith element in the resulting vector.
In Python, numpy arrays can be used to do the same thing. The following 1D array can represent a
vector. Note the use of period “1.” to make the elements be floating point numbers. The “D” indicates such
for ScalaTion.
1 y = np . array ([1. , 2. , 4. , 7. , 9. , 8. , 6. , 5. , 3.])
Using double square brackets “[[”, numpy can be used to represent matrices. Each “[ ... ]” corresponds to a
row in the matrix.
1 # col0 col1
2 x = np . array ([[1. , 8.] , # row 0
3 [2. , 7.] , # row 1
4 [3. , 6.] , # row 2
40
5 [4. , 5.] , # row 3
6 [5. , 5.] , # row 4
7 [6. , 4.] , # row 5
8 [7. , 4.] , # row 6
9 [8. , 3.] , # row 7
10 [9. , 2.]]) # row 8
8 13 , 14 , 15 , // 0 0 ,1 ,2 1
9 16 , 17 , 18 , // 1 0 ,1 ,2 1
10 19 , 20 , 21 , // 2 0 ,1 ,2 1
11 22 , 23 , 24) // 3 0 ,1 ,2 1
In Python, the above tensor can be defined as a 3D numpy array. Each row and column position has two
sheet values, e.g., ”[1., 13.]”.
1 # column 0 column 1 column 2
2 z = np . array ([[[1. , 13.] , [2. , 14.] , [3. , 15.]] , # row 0
3 [[4. , 16.] , [5. , 17.] , [6. , 18.]] , # row 1
4 [[7. , 19.] , [8. , 20.] , [9. , 21.]] , # row 2
5 [[10. , 22.] , [11. , 23.] , [12. , 24.]]]) # row 3
Vectors, matrices and tensors will discussed in greater detail in the Linear Algebra Chapter.
41
1.3 A Data Science Project
The orientation of this textbook is that of developing modeling techniques and the understanding of how
to apply them. A secondary goal is to explain the mathematics behind the models in sufficient detail to
understand the algorithms implementing the modeling techniques. Concise code based on the mathematics
is included and explained in the textbook. Readers may drill down to see the actual ScalaTion code.
The textbook is intended to facilitate trying out the modeling techniques as they are learned and to
support a group-based term project that includes the following ten elements. The term project is to culminate
in a presentation that explains what was done concerning these ten elements.
1. Problem Statement. Imagine that your group is hired as consultants to solve some problem for a
company or government agency. The answers and recommendations that your group produces should
not depend solely on prior knowledge, but rather on sophisticated analytics performed on multiple
large-scale datasets. In particular, the study should be focused and the purpose of the study should
clearly stated. What not to do: The following datasets are relevant to the company, so we ran them
through an analytics package (e.g., R) and obtained the following results.
2. Collection and Description of Datasets. To reduce the chances of results being relevant only to a
single dataset, multiple datasets should be used for the study (at least two). Explanation must be given
to how each dataset relates to the other datasets as well as to the problem statement. When a dataset
in the form of a matrix, metadata should be collected for each column/variable. In some cases the
response column(s)/variable(s) will be obvious, in others it will depend on the purpose of the study.
Initially, the result of columns/variables may be considered as features that may be useful in predicting
responses. Ideally, the datasets should loaded into a well-designed database. ScalaTion provides two
high-performance database systems: a relational database system and a graph database system
in scalation.database.table and scalation.database.graph, respectively.
3. Data Preprocessing Techniques Applied. During the preprocessing phase (before the modeling
techniques are applied), the data should be cleaned up. This includes elimination of features with zero
variance or too many missing values, as well as the elimination of key columns (e.g., on the training
data, the employee-id could perfectly predict the salary of an employee, but is unlikely to be of any
value in making predictions on the test data). For the remaining columns, strings should be converted
to integers and imputation techniques should be used to replace missing values.
4. Visual Examination. At this point, Exploratory Data Analysis (EDA) should be applied. Com-
monly, one column of a dataset in the combined data matrix will be chosen as the response column,
call it the response vector y, and the rest of the columns that remain after preprocessing form m-by-n
data matrix X. In general models are of the form
y = f (x) + (1.1)
where f is function mapping feature vector x into a predicted value for response y. The last term may be
viewed as random error . In an ideal model, the last term will be error (e.g., white noise). Since most
models are approximations, technically the last term should be referred to as a residual (that which
is not explained by the model). During exploratory data analysis, the value of y, should be plotted
against each feature/column x:j of data matrix X. The relationships between the columns should
42
be examined by computing a correlation matrix. Two columns that are very highly correlated are
supplying redundant information, and typically, one should be removed. For a regression type problem,
where y is treated as continuous random variable, a simple linear regression model should be created
for each feature xj ,
y = b0 + b1 xj + (1.2)
where the parameters b = [b0 , b1 ] are to be estimated. The line generated by the model should be
plotted along with the {(xij , yi )} data points. Visually, look for patterns such white noise, linear
relationship, quadratic relationship, etc. Plotting the residuals {(xij , i )} will also be useful. One
should also create Histograms and Box-Plots for each variable as well as consider removing outliers.
5. Modeling Techniques Chosen. For every type of modeling problem, there is the notions of a
NullModel: For prediction it is guess the mean, i.e., given a feature vector z, predict the value E [y],
regardless of the value of z. The coefficient of determination R2 for such models will be zero. If a
more sophisticated model cannot beat the NullModel, it is not helpful in predicting or explaining the
phenomena. Projects should include four classes of models: (i) NullModel, (ii) simple, easy to explain
models (e.g., Multiple Linear Regression), (iii) complex, performant models (e.g., Quadratic Regression,
Extreme Learning Machines) (iv) complex, time-consuming models (e.g., Neural Networks). If classes
(ii-iv) do not improve upon class (i) models, new datasets should be collected. If this does not help, a
new problem should be sought. On the flip side, if class (ii) models are nearly perfect (R2 close to 1),
the problem being addressed may be too simple for a term project. At least one modeling technique
should be chosen from each class.
7. Feature Selection. Although feature selection can occur during multiple phases in a modeling study,
an overview should be given at this point in the presentation. Explain which features were eliminated
and why they were eliminated prior to building the models. During model building, what features
were eliminated, e.g., using forward selection, backward elimination, Lasso Regression, dimensionality
reduction, etc. Also address and quantify the relative importance of the remaining features. Explain
how features that categorical are handled.
8. Reporting of Results. First the experimental setup should be described in sufficient detail to
facilitate reproducibility of your results. One way to show overall results is to plot predicted responses
ŷ and actual responses y versus the instance index i = 0 until m. Reports are to include the Quality
of Fit (QoF) for the various models and datasets in the form of tables, figures and explanation of the
results. Besides the overall model, for many modeling techniques the importance/significance of model
parameters/variables may be assessed as well. Tables and figures must include descriptive captions
and color/shape schemes should be consistent across figures.
43
9. Interpretation of Results. With the results clearly presented, they need to be given insightful
interpretations. What are the ramifications of the results? Are the modeling techniques useful in
making predictions, classifications or forecasts?
10. Recommendations of Study. The organization that hired your group would like some take home
messages that may result in improvements to the organization (e.g., what to produce, what processes
to adapt, how to market, etc.). A brief discussion of how the study could be improved (possibly leading
to further consulting work) should be given.
44
1.4 Additional Textbooks
More detailed development of this material can be found in textbooks on statistical learning, such as
See Table 1.1 for a mapping between the chapters in the four textbooks.
The next two chapters serve as quick reviews of the two principal mathematical foundations for data
science: linear algebra and probability.
45
46
Part I
Foundations
47
Chapter 2
Linear Algebra
Data science and analytics make extensive use of linear algebra. For example, let yi be the income of the ith
individual and xij be the value of the j th predictor/feature (age, education, etc.) for the ith individual. The
responses (outcomes of interest) are collected into a vector y, the values for predictors/features are collected
in a matrix X and the parameters/coefficients b are fit to the data.
y0 = x00 b0 + x01 b1
y1 = x10 b0 + x11 b1
This linear system has two equations with two variables having unknown values, b0 and b1 . Such linear
systems can be used to solve problems like the following: Suppose a movie theatre charges 10 dollars per
child and 20 dollars per adult. The evening attendance is 100, while the revenue is 1600 dollars. How many
children (b0 ) and adults (b1 ) were in attendance?
y = Xb (2.1)
49
2.2 Matrix Inversion
If the matrix is of full rank with m = n, then the unknown vector b may be uniquely determined by
multiplying both sides of the equation by the inverse of X, X −1
b = X −1 y (2.2)
Multiplying matrix X and its inverse X −1 , X −1 X results in an n-by-n identity matrix In = [1i=j ], where
the indicator function 1i=j equals 1 when i = j and 0 otherwise.
A faster and more numerically stable way to solve for b is to perform Lower-Upper (LU) Factorization.
This is done by factoring matrix X into lower L and upper U triangular matrices.
X = LU (2.3)
Then LU b = y, so multiplying both sides by L−1 gives U b = L−1 y. Taking an augmented matrix
" #
1 3 1
2 1 7
and performing row operations to make it upper right triangular has the effect of multiplying by L−1 . In
this case, the first row multiplied by -2 is added to second row to give.
" #
1 3 1
0 −5 5
From this, backward substitution can be used to determine b1 = −1 and then that b0 = 4, i.e.,
4
b =
−1
In cases where m > n, the system may be overdetermined, and no solution will exist. Values for b are
then often determined to make y and Xb agree as closely as possible, e.g., minimize absolute or squared
differences.
Vector notation is used in this book, with vectors shown in boldface and matrices in uppercase. Note,
matrices in ScalaTion are in lowercase, since by convention, uppercase indicates a type, not a variable.
ScalaTion supports vectors and matrices in its mathstat package. A commonly used operation is the dot
(inner) product, x · y, or in ScalaTion, x dot y.
50
2.3 Vector
A vector may be viewed a point in multi-dimensional space, e.g., in three space, we may have
where x is a point on the unit sphere and y is a point in the plane determined by the first two coordinates.
n−1
X
x·y = xi yi = 1.1547 (2.4)
i=0
Note, the inner product applies more generally, e.g., hx, yi may be applied when x and y are infinite sequences
or functions. See the exercises for the definition of an inner product space.
2.3.4 Norm
The norm of a vector is its length. Assuming Euclidean distance, the norm is
v
un−1
uX
kxk = t x2i = 1 (2.5)
i=0
√
The norm of y is 2. If θ is the angle between the x and y vectors, then the dot product is the product of
their norms and the cosine of the angle.
x·y 1.1547
cos(θ) = = √ = 0.8165
kxkkyk 1· 2
so the angle θ = .616 radians. Vectors x and y are orthogonal if the angle θ = π/2 radians (90 degrees).
In general there are `p norms. The two that are used here are the `2 norm kxk = kxk2 (Euclidean
distance) and the `1 norm kxk1 (Manhattan distance).
51
n−1
X
kxk1 = |xi | (2.7)
i=0
Vector notation facilitates concise mathematical expressions. Many common statistical measures for
populations or samples can be given in vector notation. For an m dimensional vector (m-vector) the following
may be defined.
1·x
µ(x) = µx =
m
(x − µx ) · (x − µx )
σ 2 (x) = σx2 =
m
x·x
= − µ2x
m
(x − µx ) · (y − µy )
σ(x, y) = σx,y =
m
x·y
= − µx µy
m
σx,y
ρ(x, y) = ρx,y =
σx σy
which are the population mean, variance, covariance and correlation, respectively.
The size of the population is m, which corresponds to the number of elements in the vector. A vector of
all ones is denoted by 1. For an m-vector k1k2 = 1 · 1 = m. Note, the sample mean uses the same formula,
while the sample variance and covariance divide by m − 1, rather than m (sample indicates that only some
fraction of population is used in the calculation).
Vectors may be used for describing the motion of an object through space over time. Let u(t) be the
location of an object (e.g., golf ball) in three dimensional space R3 at time t,
To describe the motion, let v(t) be the velocity at time t, and a be the constant acceleration, then according
to Newton’s Second Law of Motion,
1
u(t) = u(0) + v(0) t + a t2
2
The time varying function u(t) over time will show the trajectory of the golf ball.
52
8 with Cloneable [ VectorD ]
9 with D e f a u l t S e r i a l i z a b l e :
VectorD includes methods for size, indices, set, copy, filter, select, concatenate, vector arithmetic, power,
square, reciprocal, abs, sum, mean variance, rank, cumulate, normalize, dot, norm, max, min, mag, argmax,
argmin, indexOf, indexWhere, count, contains, sort and swap.
53
2.4 Vector Calculus
Data science uses optimization to fit parameters in models, where for example a quality of fit measure (e.g.,
sum of squared errors) is minimized. Typically, gradients are involved. In some cases, the gradient of the
measure can be set to zero allowing the optimal parameters to be determined by matrix factorization. For
complex models, this may not work, so an optimization algorithm that moves in the direction opposite to
the gradient can be applied.
For example, the functional value at the point [3, 2], f ([3, 2]) = 1 + 1 = 2 and at the point [1, 1], f ([1, 1]) =
1 + 4 = 5. The following contour curves illustrate the how the elevation of the function increases with
distance from the point [2, 3].
y
5
4
3
2
1
−3 −2 −1 1 2 3 4 5 x
−1
−2
−3
The gradient of function f consists of a vector formed from the two partial derivatives.
∂f ∂f
grad f = ∇f = ,
∂x ∂y
The gradient evaluated at point/vector u ∈ R2 is
∂f ∂f
∇f (u) = (u), (u)
∂x ∂y
The gradient indicates the direction of steepest increase/ascent. For example, the gradient at the point [3, 2],
∇f ([3, 2]) = [2, −2] (in blue), while at [1, 1], ∇f ([1, 1]) = [−2, −4] (in purple).
54
A gradient’s norm indicates the magnitude of the rate of change (or steepness). When the elevation
changes are fixed (here they differ by one), the closeness of the contours curves also indicates steepness.
Notice that the gradient vector at point [x, y] is orthogonal to the contour curve intersecting that point.
By setting the gradient equal to zero, in this case
∂f
= 2(x − 2)
∂x
∂f
= 2(y − 3)
∂y
one may find the vector that minimizes function f , namely u = [2, 3] where f = 0. For more complex
functions, repeatedly moving in the opposite direction to the gradient, may lead to finding a minimal value.
In general, the gradient (or gradient vector) of function f : Rn → R is
∂f ∂f ∂f
∇f = = ,..., (2.8)
∂x ∂x0 ∂xn−1
or evaluated at point/vector x ∈ Rn is
∂f ∂f ∂f
∇f (x) = (x) = (x), . . . , (x) (2.9)
∂x ∂x0 ∂xn−1
In data science, it is often convenient to take the gradient of a dot product of two functions of x, in which
case the following product rule can be applied.
∇f0 (x)
∇f (x)
1
...
∇fm−1 (x)
This follows the numerator layout where the functions correspond to rows (the opposite is called the denom-
inator layout which is the transpose of the numerator layout).
Consider the following function f : R2 → R2 that maps vectors in R2 into other vectors in R2 .
55
∂f ∂f0
0
,
∂x0 ∂x1
∂f ∂f
1 1
,
∂x0 ∂x1
Taking the partial derivatives gives the following Jacobian matrix.
" #
2x0 − 4, 2x1 − 6)
4x0 − 12, 6x1 − 12
∂2f
Hf (x) = (2.12)
∂xi ∂xj 0≤i<n,0≤j<n
∂2f ∂2f
,
∂x20 ∂x0 ∂x1
2 2
∂ f ∂ f
,
∂x1 ∂x0 ∂x21
Taking the second partial derivatives gives the following Hessian matrix.
" #
4, 0
0, 6
Consider a differentiable function of n variables, f : Rn → R. The points at which its gradient vector ∇f
is zero are referred to as critical points. In particular, they may be local minima, local maxima or saddle
points/inconclusive, depending on whether the Hessian matrix H is positive definite, negative definite, or
|
otherwise. A symmetric matrix A is positive definite if x Ax > 0 for all x 6= 0 (alternatively, all of A’s
eigenvalues are positive). Note: a positive/negative semi-definite Hessian matrix may or may not indicate
an optimal (minimal/maximal) point.
56
2.5 Matrix
A matrix may be viewed as a collection of vectors, one for each row in the matrix. Matrices may be used to
represent linear transformations
f : Rn → Rm (2.13)
n m
that map vectors in R to vectors in R . For example, in ScalaTion an m-by-n matrix A with m = 3
rows and n = 2 columns may be created as follows:
1 val a = MatrixD ((3 , 2) , 1 , 2 ,
2 3, 4,
3 5 , 6)
to produce matrix A.
1 2
3 4
5 6
Matrix A will transform u vectors in R2 into v vectors in R3 .
Au = v (2.14)
For example,
" # 5
1
A = 11
2
17
ScalaTion supports retrieval of row vectors, column vectors and matrix elements. In particular, the
following access operations are supported.
A = a = matrix
ai = a(i) = row vector i
a:j = a(?, j) = column vector j
aij = a(i, j) = the element at row i and column j
Ai:k,j:l = a(i to k, j to l) = row and column matrix slice
Note, i to k does not include k. Common operations on matrices are supported as well.
57
Matrix Addition and Subtraction
Matrix Multiplication
n−1
X
cij = aik bkj = ai · b:j (2.16)
k=0
Mathematically, this is written as C = AB. The ij element in matrix C is the vector dot product of the ith
row of A with the j th column of B.
Matrix Transpose
|
The transpose of matrix A, written A (val t = a.transpose or val t = a.T ), simply exchanges the roles
of rows and columns.
1 def transpose : MatrixD =
2 val a = Array . ofDim [ Double ] ( dim2 , dim )
3 for j <- indices do
4 val v_j = v ( j )
5 var i = 0
6 cfor ( i < dim2 , i + = 1) { a ( i ) ( j ) = v_j ( i ) }
7 end for
8 new MatrixD ( dim2 , dim , a )
9 end transpose
Matrix Determinant
The determinant of square (m = n) matrix A, written |A| (val d = a.det), indicates whether a matrix is
singular or not (and hence invertible), based on whether the determinant is zero or not.
Trace of a Matrix
The trace of matrix A ∈ Rn×n is simply the sum of its diagonal elements.
n−1
X
tr(A) = aii (2.17)
i=0
In ScalaTion, the trace is computed using the trace method (e.g., a.trace).
58
Matrix Dot Product
ScalaTion provides several types of dot products on both vectors and matrices, two of which are shown
below. The first method computes the usual dot product between two vectors. Note, the parameter y is
generalized to take any vector-like data type.
1 def dot ( y : IndexedSeq [ Double ]) : Double =
2 var sum = 0.0
3 for i <- v . indices do sum + = v ( i ) * y ( i )
4 sum
5 end dot
When relevant a n-vector (e.g., x ∈ Rn ) may be viewed as an n-by-1 matrix (column vector), in which case
|
x would be viewed as an 1-by-n matrix (row vector). Consequently, dot product (and outer product) can
be defined in terms of matrix multiplication and transpose operations.
|
x·y =x y dot (inner) product (2.18)
|
x ⊗ y = xy outer product (2.19)
The second method takes the dot product two matrices. The second method extends the notion of
|
matrices and is an efficient way to compute A B = A · B = a.transpose * b = a dot b.
1 def dot ( y : MatrixD ) : MatrixD =
2 if dim2 ! = y . dim then
3 flaw ( " dot " , s " matrix dot matrix - incompatible cross dimensions :
4 dim2 = $dim2 , y . dim = ${ y . dim } " )
5
23 end for
24 end for
25 end for
26 new MatrixD ( dim , y . dim , a )
27 end dot
59
2.6 Matrix Factorization
Many problems in data science involve matrix factorization to for example solve linear systems of equations
or perform Ordinary Least Squares (OLS) estimation of parameters. ScalaTion supports several factorization
techniques, including the techniques shown in Table 2.2
These algorithms are faster or more numerically stable than algorithms for matrix inversion. See the Pre-
diction chapter to see how matrix factorization is used in Ordinary Least Squares estimation.
Multiplying A and x yields 20 , while multiplying A and z yields 03 . Thus, letting λ0 = 2 and λ1 = 3, we
see that Ax = λ0 x and Az = λ1 z. In general, a matrix An×n of rank r will have r non-zero eigenvalues λi
with corresponding eigenvector x(i) such that
In other words, there will be r unit eigenvectors, for which multiplying by the matrix simply rescales the
eigenvector x(i) by its eigenvalue λi . The same will happen for any non-zero vector in alignment with one of
the r unit eigenvectors.
Given an eigenvalue λi , an eigenvector may be found by noticing that
Any vector in the nullspace of the matrix A − λi I is an eigenvector corresponding to λi . Note, if the above
equation is transposed, it is called a left eigenvalue problem (see the section on Markov Chains).
In low dimensions, the eigenvalues may be found as roots of the characteristic polynomial/equation
derived from taking the determinant of A − λi I. Software like ScalaTion, however, use iterative algorithms
that convert a matrix into Hessenburg and tridiagonal forms.
60
2.7 Internal Representation
The current internal representation used for storing the elements in a dense matrix is Array [Array [Double]]
in row major order (row-by-row). Depending on usage, operations may be more efficient using column ma-
jor order (column-by-column). Also, using a one dimensional array Array [Double] mapping (i, j) to the
k th location may be more efficient. Furthermore, having operations access through sub-matrices (blocks)
may improve performance because of caching efficiency or improved performance for parallel and distributed
versions.
The mathstat package provides several classes implementing multiple types of vectors and matrices as
shown in Table 2.3 including VectorD and MatrixD.
The suffix ‘D’ indicates the base element type is Double. There are also implementations for Complex ‘C’,
Int ‘I’, Long ‘L’, Rational ‘Q’, Real ‘R’, String ‘S’, and TimeNum ‘T’.
Note, ScalaTion 2.0 currently only supports dense vectors and matrices. See older versions for the
other types of vectors and matrices.
ScalaTion supports many operations involving matrices and vectors, including the following show in
Table 2.5.
61
2.8 Tensor
Loosely speaking, a tensor is a generalization of scalar, vector and matrix. The order of the tensor indicates
the number dimensions. In this text, tensors are treated as hyper-matrices and issues such as basis inde-
pendence, contravariant and covariant vectors/tensors, and the rules for index notation involving super and
subscripts are ignored [111]. To examine the relationship between order 2 tensors and matrices more deeply,
see the last exercise.
For data science, input into a model may be a vector (e.g., simple regression, univariate time series), a
matrix (e.g., multiple linear regression, neural networks), a tensor with three dimensions (e.g., monochro-
matic/greyscale images), and a tensor with four dimensions (e.g., color images).
5 class TensorD ( val dim : Int , val dim2 : Int , val dim3 : Int ,
6 private [ mathstat ] var v : Array [ Array [ Array [ Double ]]] = null )
7 extends Error with Serializable
A tensor T is stored in a triple array [tijk ]. Below is an example of a 2-by-2-by-2 tensor, T = [T::0 |T::1 ]
" #
t000 t010 | t001 t011
t100 t110 | t101 t111
62
2.8.2 Four Dimensional Tensors
In ScalaTion, tensors with four dimensions are supported by the Tensor4D class. The default names for
the dimensions [111] were chosen to follow a common convention (row, column, sheet, channel).
1 @ param dim size of the 1 st level / dimension ( row ) of the tensor ( height )
2 @ param dim2 size of the 2 nd level / dimension ( column ) of the tensor ( width )
3 @ param dim3 size of the 3 rd level / dimension ( sheet ) of the tensor ( depth )
4 @ param dim3 size of the 4 rd level / dimension ( channel ) of the tensor ( spectra )
5
6 class Tensor4D ( val dim : Int , val dim2 : Int , val dim3 : Int , , dim4 : Int ,
7 private [ mathstat ] var v : Array [ Array [ Array [ Array [ Double ]]]] = null )
8 extends Error with Serializable
63
2.9 Exercises
1. Draw two 2-dimensional non-zero vectors, x and y, whose dot product x · y is zero.
x
2. A vector can be transformed into a unit vector in the same direction by dividing by its norm, .
kxk
Let, y = 2x and show that the dot of the corresponding unit vectors equals one. This means that their
Cosine Similarity equals one.
x·y
cosxy = cos(θ) = where θ is the angle between the vectors
kxkkyk
3. Correlation ρxy vs. Cosine Similarity cosxy . What does it mean when the correlation (cosine similarity)
is 1, 0, -1, respectively. In general, does ρxy = cosxy ? What about in special cases?
4. Given the matrix X and the vector y, solve for the vector b in the equation y = Xb using matrix
inversion and LU factorization.
1 import scalation . mathstat .{ MatrixD , VectorD , Fac_LU }
2 val x = MatrixD ((2 , 2) , 1 , 3 ,
3 2 , 1)
4 val y = VectorD (1 , 7)
5 println ( " using inverse : b = X ˆ -1 y = " + x . inverse * y )
6 println ( " using LU fact : Lb = Uy = " + { val lu = new Fac_LU ( x ) ; lu . factor () . solve
(y) } )
Modify the code to show the inverse matrix X −1 and the factorization into the L and U matrices.
| |
5. If Q is an orthogonal matrix, then Q Q becomes what type of matrix? What about QQ ? Illustrate
with an example 3-by-3 matrix. What is the inverse of Q?
6. Show that the Hessian matrix of a scalar-valued function f : Rn → R is the transpose of the Jacobian
of the gradient, i.e.,
|
Hf (x) = [J ∇f (x)]
7. Critical points for a function f : Rn → R occur when ∇f (x) = 0. How can the Hessian Matrix can be
used to decide whether a particular critical point is a local minimum or maximum?
8. Define three functions, f1 (x, y), f2 (x, y) and f3 (x, y), that have critical points (zero gradient) at the
point [2, 3] such that this point is (a) a minimal point, (b) a maximal point, (c) a saddle point,
respectively. Compute the Hessian matrix at this point for each function and use it to explain the type
of critical point. Plot the three surfaces in 3D.
Hint: see https://www.math.usm.edu/lambers/mat280/spr10/lecture8.pdf
9. Determine the eigenvalues for the matrix A given in the section on eigenvalues and eigenvectors, by
setting the determinant of A − λI equal to zero.
64
" #
2−λ 0
0 3−λ
(2 − λ)(3 − λ) − 0 = 0
10. A vector space V over field K (e.g., R or C) is a set of objects, e.g., vectors x, y, and z, and two
operations, addition and scalar multiplication,
x, y ∈ V =⇒ x + y ∈ V (2.22)
x ∈ V and a ∈ K =⇒ ax ∈ V (2.23)
(x + y) + z = x + (y + z)
x+y = y+x
∃ 0 ∈ V s.t. x + 0 = x
∃ − x ∈ V s.t. x + (−x) = 0
(ab)x = a(bx)
a(x + y) = ax + ay
(a + b)x = ax + bx
∃1 ∈ K s.t. 1x = x
11. A normed vector space V over field K is a vector space with a function defined that gives the length
(norm) of a vector,
x ∈ V =⇒ kxk ∈ R+
65
d(x, y) = kx − yk
n−1
! p1
X
kxkp = |xi |p
i=0
Norms and distances are very useful in data science, for example, loss functions used to judge/optimize
models are often defined in terms of norms or distances.
Show that the last axiom called the triangle inequality hold for `2 -norms.
Hint: kxk22 is the sum of the elements in x squared.
12. An inner product space H over field K is a vector space with one more operation, inner product,
x, y ∈ H =⇒ hx, yi ∈ K
∗
hx, yi = hy, xi
hax + by, zi = a hx, zi + b hy, zi
hx, xi > 0 unless x = 0
Note, the complex conjugate negates the imaginary part of a complex number, e.g., (c + di)∗ = c − di
Show that an n-dimensional Euclidean vector space using the definition of dot product given in this
chapter is an inner product space over R.
13. Explain the meaning of the following statement, “a tensor of order 2 for a given coordinate system can
be represented by a matrix.”
Hint: see “Tensors: A Brief Introduction” [32]
66
2.10 Further Reading
1. Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares [21]
67
68
Chapter 3
Probability
Probability is used to measure the likelihood of certain events occurring, such as flipping a coin and getting a
head, rolling a pair of dice and getting a sum of 7, or getting a full house in five card draw. Given a random
experiment, the sample space Ω is the set of all possible outcomes.
Technically speaking, an event is a measurable subset of Ω (see [41] for a measure-theoretic definition).
Letting F be the set of all possible events, one may define a probability space as follows:
Definition: A probability space is defined as a triple (Ω, F, P ).
Given an event A ∈ F, the probability of its occurrence is restricted to the unit interval, P (A) ∈ [0, 1].
Thus, P may be viewed as a function that maps events to the unit interval.
P : F → [0, 1] (3.2)
If events A and B are independent, simply take the product of the individual probabilities,
69
3.1.2 Conditional Probability
The conditional probability of the occurrence of event A, given it is known that event B has occurred/will
occur is
P (AB)
P (A|B) = (3.5)
P (B)
If events A and B are independent, the conditional probability reduces to
Bayes Theorem
P (B|A)P (A)
P (A|B) = (3.7)
P (B)
When determining conditional probability A|B is difficult, one may try going the other direction and first
determine B|A.
Example
The size of the outcome space is 4 and since the event space F contains all subsets of Ω, its size is 24 = 16.
Define the following two events:
What is the probability that event A occurred, given that you know that event B occurred? If fair coins are
used, the probability of a head (or tail) is 1/2 and the probabilities reduce to the ratios of set sizes.
70
3.2 Random Variable
Rather than just looking at individual events, e.g., E1 or E2 , one is often more interested in the probability
that random variables take on certain values.
Definition: A random variable y is a function that maps outcomes in the sample space Ω into a set/domain
of numeric values Dy .
y : Ω → Dy (3.8)
Some commonly used domains are real numbers R, integers Z, natural numbers N, or subsets thereof. An
example of a mapping from outcomes to numeric values is tail → 0, head → 1. In other cases such as the
roll of one dice, the map is the identity function.
One may think of a random variable y (blue font) as taking on values from a given domain Dy . With a
random variable, its value is uncertain, i.e., its value is only known probabilistically.
For A ⊆ Dy one can measure the probability of the random variable y taking on a value from the set
A. This is denoted by
P (y ∈ A) (3.9)
E = y −1 (A) (3.10)
P (E) (3.11)
71
3.3 Probability Distribution
A random variable y is characterized by how its probability is distributed over its domain Dy . This can be
captured by functions that map Dy to R+ .
Fy : Dy → [0, 1] (3.12)
It measures the amount probability or mass accumulated over the domain up to and including the point y.
The color highlighted symbol y is the random variable, while y simply represents a value.
Fy (y) = P (y ≤ y) (3.13)
To illustrate the concept, let x1 and x2 be the number on dice 1 and dice 2, respectively. Let y = x1 + x2 ,
then Fy (6) = P (y ≤ 6) = 5/12. The entire CDF for the discrete random variable y (roll of two dice), Fy (y)
is
{(2, 1/36), (3, 3/36), (4, 6/36), (5, 10/36), (6, 15/36), (7, 21/36), (8, 26/36), (9, 30/36), (10, 33/36), (11, 35/36), (12, 36/36)}
As another example, the CDF for a continuous random variable y that is defined to be uniformly distributed
on the interval [0, 2] is
y
on [0, 2]
Fy (y) =
2
When random variable y follows this CDF, we may say that y is distributed as Uniform (0, 2), symbolically,
y ∼ Uniform (0, 2).
py : Dy → [0, 1] (3.14)
It can be calculated as the first difference of the CDF, i.e., the amount of accumulated mass at point yi
minus the amount of accumulated mass at the previous point yi−1 .
For one dice x1 , the pmf is
{(1, 1/6), (2, 1/6), (3, 1/6), (4, 1/6), (5, 1/6), (6, 1/6)}
72
A second dice x2 will have the same pmf. Both random variables follow the Discrete Uniform Distribution,
Randi (1, 6).
1
px (x) = 1{1≤x≤6} (3.16)
6
{(2, 1/36), (3, 2/36), (4, 3/36), (5, 4/36), (6, 5/36), (7, 6/36), (8, 5/36), (9, 4/36), (10, 3/36), (11, 2/36), (12, 1/36)}
The random variable y follows the Discrete Triangular Distribution (that peaks in the middle) and not the
flat Discrete Uniform Distribution.
min(y − 1, 13 − y)
py (y) = 1{2≤y≤12} (3.17)
36
6 − |7 − y|
py (y) = for y ∈ {2, . . . , 12} (3.18)
36
Suppose y is defined on the continuous domain, e.g., Dy = [0, 2], and that mass/probability is uniformly
spread amongst all the points in the domain. In such situations, it is not productive to consider the mass at
one particular point. Rather one would like to consider the mass in a small interval and scale it by dividing
by the length of the interval. In the limit this is the derivative which gives the density. For a continuous
random variable, if the function Fy is differentiable, a probability density function (pdf) may be defined.
f y : D y → R+ (3.19)
dFy (y)
fy (y) = (3.20)
dy
For example, the pdf for a uniformly distributed random variable y on [0, 2] is
d y 1
fy (y) = = on [0, 2]
dy 2 2
The pdf for the Uniform Distribution is shown in the figure below.
73
pdf for Uniform Distribution
0.6
0.55
fy (y)
0.5
0.45
0 0.5 1 1.5 2
y
Random variates of this type may be generated using ScalaTion’s Uniform (0, 2) class within the
scalation.random package.
1 val rvg = Uniform (0 , 2)
2 val yi = rvg . gen
For another example, the pdf for an exponentially distributed random variable y on [0, ∞) with rate
parameter λ is
The pdf for the Exponential (λ = 1) Distribution is shown in the figure below.
0.8
0.6
fy (y)
0.4
0.2
0
0 1 2 3 4
y
Going the other direction, the CDF Fy (y) can be computed by summing the pmf py (y)
74
X
Fy (y) = py (xi ) (3.21)
xi ≤y
75
3.4 Empirical Distribution
An empirical distribution may be used to describe a dataset probabilistically. Consider a dataset (X, y)
where X ∈ Rm×n is the data matrix collected about the predictor variables and y ∈ Rm is the data vector
collected about the response variable. In other words, the dataset consists of m instances of an n-dimensional
predictor vector xi and a response value yi .
The joint empirical probability mass function (epmf) may be defined on the basis of a given dataset
(X, y).
m−1
ν(x, y) 1 X
pdata (x, y) = = 1{xi =x,yi =y} (3.23)
m m i=0
where ν(x, y) is the frequency count and 1{c} is the indicator function (if c then 1 else 0).
The corresponding Empirical Cumulative Distribution Function (ECDF) may be defined as follows:
m−1
1 X
Fdata (x, y) = 1{xi ≤x,yi ≤y} (3.24)
m i=0
76
3.5 Expectation
Using the definition of a CDF, one can determine the expected value (or mean) for random variable y using
a Riemann-Stieltjes integral.
Z
E [y] = y dFy (y) (3.25)
Dy
The mean specifies the center of mass, e.g., a two-meters rod with the mass evenly distributed throughout,
would have a center of mass at 1 meter. Although it will not affect the center of mass calculation, since the
total probability is 1, unit mass is assumed (one kilogram). The center of mass is the balance point in the
middle of the bar.
X
E [y] = y py (y) (3.27)
y∈Dy
The mean for rolling two dice is E [y] = 7. One way to interpret this is to imagine winning y dollars by
playing a game, e.g., two dollars for rolling a 2 and twelve dollars for rolling a 12, etc. The expected earnings
when playing the game once is seven dollars. Also, by the law of large numbers, the average earnings for
playing the game n times will converge to seven dollars as n gets large.
3.5.3 Variance
The variance of random variable y is given by
V [y] = E (y − E [y])2
(3.28)
The variance specifies how the mass spreads out from the center of mass. For example, the variance of y ∼
Uniform (0, 2) is
Z 2
1 1
V [y] = E (y − 1)2 = (y − 1)2
dy =
0 2 3
77
That is, the variance of the one kilogram, two-meter rod is 13 kilogram meter2 . Again, for probability to
be viewed as mass, unit mass (one kilogram) must be used, so the answer may also be given as 13 meter2
Similarly to interpreting the mean as the center of mass, the variance corresponds to the moment of inertia.
The standard deviation is simply the square root of variance.
p
SD [y] = V [y] (3.29)
For the two-meter rod, the standard deviation is √13 = 0.57735. The percentage of mass within one
standard deviation unit of the center of mass is then 58%. Many distributions, such as the Normal (Gaussian)
distribution concentrate mass closer to the center. For example, the Standard Normal Distribution has the
following pdf.
1 2
fy (y) = √ e−y /2 (3.30)
2π
The mean for this distribution is 0, while the variance is 1. The percentage of mass within one standard
deviation unit of the center of mass is 68%. The pdf for the Normal (µ = 0, σ 2 = 1) Distribution is shown
in the figure below.
0.4
0.3
fy (y)
0.2
0.1
0
−3 −2 −1 0 1 2 3
y
Note, the uncentered variance (or mean square) of the random variable y is simply E y 2 .
3.5.4 Covariance
The covariance of two random variable x and y is given by
78
C [z] = C [zi , zj ] 0≤i,j<k (3.32)
79
3.6 Algebra of Random Variables
When random variables x1 and x2 are added to create a new random variable y,
y = x1 + x2
how is y described in terms of mean, variance and probability distribution? Also, what happens when a
random variable is multiplied a constant?
y = ax
The expectation of a random variable multiplied by a constant, is the constant multiplied by the random
variable’s expectation.
When the random variable are independent, the covariance is zero, so the variance of sum is just the sum of
variances.
The variance of a random variable multiplied by a constant, is the constant squared multiplied by the random
variable’s variance.
80
Convolution: Discrete Case
Assuming the random variables are independent and discrete, the pmf of the sum py is the convolution of
two pmfs px1 and px2 .
X
py (y) = px1 (x) px2 (y − x) (3.39)
x∈Dx
For example, letting x1 , x2 ∼ Bernoulli(p), i.e., px1 (x) = px (1 − p)1−x on Dx = {0, 1}, gives
X
py (0) = px1 (x) px2 (0 − x) = p2
x∈Dx
X
py (1) = px1 (x) px2 (1 − x) = 2p(1 − p)
x∈Dx
X
py (2) = px1 (x) px2 (2 − x) = (1 − p)2
x∈Dx
which indicates that y ∼ Binomial(p, 2). The pmf for the Binomial(p, n) distribution is
n y
py (y) = p (1 − p)n−y (3.40)
y
4
z
1 2 3 4 5 6
x
As the joint pmf pxz (xi , zj ) = px (xi )pz (zj ) = 1/36 is constant over all points, the convolution sum for a
particular value of y corresponds to the downward diagonal sum where the dice sum to that value, e.g.,
py (3) = 2/36, py (7) = 6/36.
81
Convolution: Continuous Case
Now, assuming the random variables are independent and continuous, the pdf of the sum fy is the convolution
of two pdfs fx1 and fx2 .
Z
fy (y) = fx1 (x) fx2 (y − x) dx (3.42)
Dx
For example, letting x1 , x2 ∼ Uniform(0, 1), i.e., fx1 (x) = 1 on Dx = [0, 1], gives
Z
for y ∈ [0, 1] fy (y) = fx1 (x) fx2 (y − x)dx = y
[0,y]
Z
for y ∈ [1, 2] fy (y) = fx1 (x) fx2 (y − x)dx = 2 − y
[0,2−y]
m−1
X
y = xi
i=0
1 1
When xi ∼ Uniform(0, 1) with mean 2 and variance 12 , then for m large enough y will follow a Normal
distribution
y ∼ Normal(µ, σ 2 )
m m
where µ = 2 and σ 2 = 12 . The pdf for the Normal Distribution is
1 1 y−µ 2
fy (y) = √ e− 2 ( σ ) (3.43)
2πσ
For most distributions, summed random variables will be approximately distributed as Normal, as in-
dicated by the Central Limit Theorem (CLT); for proofs see [47, 11]. Suppose xi ∼ F with mean µx and
variance σx2 < ∞, then the sum of m independent and identically distributed (iid) random variables is
distributed as follows:
m−1
X
y = xi ∼ N (mµx , mσx2 ) as m → ∞ (3.44)
i=0
This is one simple form of the CLT. See the exercises of a visual illustration of the CLT.
Similarly, the sum of m independent and identically distributed random variables (with mean µx and
variance σx 2 ) divided by m will also be Normally distributed for sufficiently large m.
82
m−1
1 X
y = xi
m i=0
1
The expectation of y = m mµx = µx , while variance is σx2 /m, so
As, E [y] = µx , y can serve as an unbiased estimator of µx . This can be transformed to the Standard Normal
Distribution with the following transformation.
y − µx
z = √ ∼ Normal(0, 1)
σx / m
The Normal distribution is also referred to as the Gaussian distribution. See the exercises for related
distributions: Chi-square, Student’s t and F .
83
3.7 Median, Mode and Quantiles
As stated, the mean is the expected value, a probability weighted sum/integral of the values in the domain of
the random variable. Other ways of characterizing a distribution are based more directly on the probability.
3.7.1 Median
Moving along the distribution, the place at which half of the mass is below you and half is above you is the
median.
1 1
P (y ≤ median) ≥ and P (y ≥ median) ≥ (3.45)
2 2
Given equally likely values (1, 2, 3), the median is 2. Given equally likely values (1, 2, 3, 4), there are two
common interpretations for the median: The smallest value satisfying the above equation (i.e., 2) or the
average of the values satisfying the equation (i.e., 2.5) The median for two dice (with the numbers summed)
which follow the Triangular distribution is 7.
3.7.2 Quantile
The median is also referred to as the half quantile.
1
Q [y] = Fy−1 ( ) (3.46)
2
More generally, the p ∈ [0, 1] quantile is given by
y
p = Fy (y) = on [0, 2]
2
Taking the inverse yields the iCDF.
3.7.3 Mode
Similarly, one may be interested in the mode, which is the average of the points of maximal probability mass.
The mode for rolling two dice is y = 7. For continuous random variables, it is the average of points of
maximal probability density.
For the two-meter rod, the mean, median and mode are all equal to 1.
84
3.8 Joint, Marginal and Conditional Distributions
Knowledge of one random variable may be useful in narrowing down the possibilities for another random
variable. Therefore, it is important to understand how probability is distributed in multiple dimensions.
There are three main concepts: joint, marginal and conditional.
In general, the joint CDF for two random variables x and y is
pxy (xi , yj ) = Fxy (xi , yj ) − [Fxy (xi−1 , yj ) + Fxy (xi , yj−1 ) − Fxy (xi−1 , yj−1 )] (3.51)
See the exercises to check this formula for the matrix shown below.
Imagine nine weights placed in a 3-by-3 grid with the number indicating the relative mass.
X
px (xi ) = pxy (xi , yj ) sum out y (3.52)
yj ∈Dy
X
py (yj ) = pxy (xi , yj ) sum out x (3.53)
xi ∈Dx
Carrying out the summations or calling margProbX (pxy) for px (xi ) and margProbY (pxy) for py (yj ) gives,
It is now easy to see that px is based on row sums, while py is based on column sums.
85
3.8.2 Continuous Case: Joint and Marginal Density
In the continuous case, the joint pdf for two random variables x and y is
∂ 2 Fxy
fxy (x, y) = (x, y) (3.54)
∂x∂y
Consider the following joint pdf that specifies the distribution of one kilogram of mass (or probability)
uniformly over a 2-by-3 meter plate.
1
fxy (x, y) = on [0, 2] × [0, 3]
6
2
y
0 0.5 1 1.5 2
x
Z x Z y
1 xy
Fxy (x, y) = dydx =
0 0 6 6
There are two marginal pdfs that are single integrals: Think of the mass of the vertical red line being
collected into the thick red bar at the bottom. Collecting all such lines creates the red bar at the bottom
and its mass is distributed as follows:
Z 3
1 3 1
fx (x) = dy = = on [0, 2] integrate out y
0 6 6 2
Now think of the mass of the horizontal green line being collected into the thick green bar on the left.
Collecting all such lines creates the green bar on the left and its mass is distributed as follows:
Z 2
1 2 1
fy (y) = dx = = on [0, 3] integrate out x
0 6 6 3
86
3.8.3 Discrete Case: Conditional Mass
Conditional probability can be examined locally. Given two discrete random variables x and y, the conditional
mass function of x given y is defined as follows:
pxy (xi , yj )
px|y (xi , yj ) = P (x = xi |y = yj ) = (3.55)
py (yj )
where pxy (xi , yj ) is the joint mass function. Again, the marginal mass functions are
X
px (xi ) = pxy (xi , yj )
yj ∈Dy
X
py (yj ) = pxy (xi , yj )
xi ∈Dx
Consider the following example: Roll two dice. Let x be the value on the first dice and y be the sum of
the two dice. Compute the conditional pmf for x given that it is known that y = 2.
pxy (xi , 2)
px|y (xi , 2) = P (x = xi |y = 2) = (3.56)
py (2)
Try this problem for each possible value for y.
fxy (x, y)
fx|y (x, y) = (3.57)
fy (y)
where fxy (x, y) is the joint density function. The marginal density functions are
Z
fx (x) = fxy (x, y)dy (3.58)
y∈Dy
Z
fy (y) = fxy (x, y)dx (3.59)
x∈Dx
The marginal density function in the x-dimension is the probability mass projected onto the x-axis from all
other dimensions, e.g., for a bivariate distribution with mass distributed in the first xy quadrant, all the
mass will fall onto the x-axis.
Consider the example below where the random variable x indicates how far down the center-line of a
straight golf hole the golf ball was driven in units of 100 yards. The random variable y indicates how far left
(positive) or right (negative) the golf ball ends up from the center of the fairway. Let us call these random
variable distance and dispersion. The golfer teed the ball up at location [0, 0]. For simplicity, assume the
probability is uniformly distributed within the triangle.
87
1
0.5
y
−0.5
−1
1
fxy (x, y) = on x ∈ [0, 3], y ∈ [−x/3, x/3]
3
The distribution (density) of the driving distance down the center-line is given the marginal density for the
random variable x
Z x/3 x/3
1 y 2x
fx (x) = dy = =
−x/3 3 3 −x/3 9
Therefore, the conditional density of dispersion y given distance x is given by
3.8.5 Independence
The two random variables x and y are said to independent denoted x ⊥ y when the joint CDF (equivalently
pmf/pdf) can be factored into the product of its marginal CDFs (equivalently pmf/pdf).
For example, determine which of the following two joint density functions defined on [0, 1]2 signify indepen-
dence.
For the first joint density, the two marginal densities are the following:
1 1
4xy 2
Z
fx (x) = 4xy dy = = 2x
0 2 0
88
1 1
4x2 y
Z
fy (y) = 4xy dx = = 2y
0 2 0
The product of the marginal densities fx (x) fy (y) = 4xy is the joint density.
Compute the conditional density under the assumption that the random variables, x and y, are indepen-
dent.
fxy (x, y)
fx|y (x, y) = (3.62)
fy (y)
As the joint density can be factored, fxy (x, y) = fx (x) fy (y), we obtain,
fx (x) fy (y)
fx|y (x, y) = = fx (x) (3.63)
fy (y)
showing that the value of random variable y has no effect on x. See the exercises for a proof that independence
implies zero covariance (and therefore zero correlation).
X
E [x|y = y] = x px|y (x, y) (3.65)
x∈Dx
Consider the previous example on the dispersion y of a golf ball conditioned on the driving distance y.
Compute the conditional mean and the conditional variance for y given x.
Z x/3
µy|x = E [y|x = x] = y fy|x (x, y)dy
−x/3
Z x/3
2
= E (y − µy|x )2 |x = x = (y − µy|x )2 fy|x (x, y)dy
σy|x
−x/3
89
3.8.7 Conditional Independence
A wide class of modeling techniques are under the umbrella of probabilistic graphical models (e.g., Bayesian
Networks and Markov Networks). They work by factoring a joint probability based on conditional indepen-
dencies. Random variables x and y are conditionally independent given z, denoted
x ⊥ y|z
means that
90
3.9 Odds
Another way of looking a probability is odds. This is the ratio of probabilities of an event A occurring over
the event not occurring S − A.
P (y ∈ A) P (y ∈ A)
odds(y ∈ A) = = (3.67)
P (y ∈ S − A) 1 − P (y ∈ A)
For example, the odds of rolling a pair dice and getting natural is 8 to 28.
8 2
odds(y ∈ {7, 11}) = = = .2857
28 7
Of the 36 individual outcomes, eight will be a natural and 28 will not. Odds can be easily calculated from
probability.
91
3.10 Example Problems
Understanding of some of techniques to be discussed requires some background in conditional probability.
1. Consider the probability of rolling a natural (i.e., 7 or 11) with two dice where the random variable y
is the sum of the dice.
If you knew you rolled a natural, what is the conditional probability that you rolled a 5 or 7?
P (y ∈ A, x ∈ B)
P (y ∈ A | x ∈ B) =
P (x ∈ B)
where
P (y ∈ A, x ∈ B) = P (x ∈ B | y ∈ A) P (y ∈ A)
P (x ∈ B | y ∈ A) P (y ∈ A)
P (y ∈ A | x ∈ B) =
P (x ∈ B)
This is Bayes Theorem written using random variables, which provides an alternative way to compute
conditional probabilities, i.e., P (y ∈ {5, 7} | y ∈ {7, 11}) is
P (x = 1) = 1/3
Obviously, the probability is 1/3, since the probability of picking any of the three coins is the same.
This is the prior probability.
Not satisfied with this level of uncertainty, you conduct experiments. In particular, you flip the selected
coin three times and get all heads. Let y indicate the number of heads rolled. Using Bayes Theorem,
we have,
92
P (y = 3 | x = 1) P (x = 1) 1 · (1/3)
P (x = 1 | y = 3) = = = 4/5
P (y = 3) 5/12
93
3.11 Estimating Parameters from Samples
Given a model for predicting a response value for y from a feature/predictor vector x,
y = f (x; b) +
one needs to pick a functional form for f and collect a sample of data to estimate the parameters b. The
sample will consist of m instances (yi , xi ) that form the response/output vector y and the data/input matrix
X.
y = f (X; b) +
There are multiple types of estimation procedures. The central ideas are to minimize error or maximize
the likelihood that the model would generate data like the sample. A common way to minimize error is to
minimize the Mean Squared Error (MSE). The error vector is the difference between the actual response
vector y and the predicted response vector ŷ.
= y − ŷ = y − f (x; b)
The mean squared error on the length (Euclidean norm) of the error vector kk is given by
2
E kk2 = V [kk] + E [kk]
(3.68)
where V [kk] is error variance and E [kk] is the error mean. If the model is unbiased the error mean will
be zero, in which case the goal is to minimize the error variance.
y = µ+
where ∼ N (0, σ 2 ). Create a sample of size m = 100 data points, using a Normal random variate generator.
The population values for the mean µ and standard deviation σ are typically unknown and need to be
estimated from the sample, hence the names sample mean µ̂ and sample standard deviation σ̂. Show the
generated sample, by plotting the data points and displaying a histogram.
1 @ main def sampleStats () : Unit =
2
12 end sampleStats
94
Imports: scalation.mathstat. , scalation.random. .
m−1
1·y 1 X
µ̂ = = yi
m m i=0
To create a confidence interval, we need to determine the variability or variance in the estimate µ̂.
"
m−1
# m−1
1 X 1 X σ2
V [µ̂] = V yi = 2
V [yi ] =
m i=0 m i=0 m
The difference between the estimate from the sample and the population mean is Normally distributed and
centered at zero (show that µ̂ is an unbiased estimator for µ, i.e., E [µ̂] = µ).
σ2
µ̂ − µ ∼ N (0,)
m
We would like to transform the difference so that the resulting expression follows a Standard Normal
distribution. This can be done by dividing by √σm .
µ̂ − µ
√ ∼ N (0, 1)
σ/ m
Consequently, the probability that the expression is greater than z is given by the CDF of the Standard
Normal distribution, FN (z).
µ̂ − µ
P √ >z = 1 − FN (z)
σ/ m
One might consider that if z = 2, two standard deviation units, then the estimate is not close enough. The
same problem can exist on the negative side, so we should require
µ̂ − µ
√ ≤2
σ/ m
In other words,
2σ
|µ̂ − µ| ≤ √
m
This condition implies that µ would likely be inside the following confidence interval.
2σ 2σ
µ̂ − √ , µ̂ + √
m m
In this case it is easy to compute values for the lower and upper bounds of the confidence interval. The
interval half width is simply 2·8
10 = 1.6, which is to be subtracted and added to the sample mean.
Use ScalaTion to determine the probability that µ is within such confidence intervals?
1 println ( s " 1 - F (2) = ${1 - normalCDF (2) } " )
95
The probability is one minus twice this value. If 1.96 is used instead of 2, what is the probability, expressed
as a percentage.
Typically, the population standard deviation is unlikely to be known. It would need to estimated by
using the sample standard deviation, where the sample variance is
m−1
1 X
σ̂ 2 = (yi − µ̂)2 (3.69)
m − 1 i=0
Note, this textbook uses θ̂ to indicate an estimator for parameter θ, regardless of whether it is a Maxi-
mum Likelihood (MLE) estimator. This substitution introduces more variability into the estimation of the
confidence interval and results in the Standard Normal distribution (z-distribution)
z∗σ z∗σ
µ̂ − √ , µ̂ + √ (3.70)
m m
being replace by the Student’s t distribution
t∗ σ̂ t∗ σ̂
µ̂ − √ , µ̂ + √ (3.71)
m m
where z ∗ and t∗ represent distances from zero, e,g., 1.96 or 2.09, that are large enough so that the analyst
is comfortable with the probability that they may be wrong.
The numerators for the interval half widths (ihw) are calculated by the following top-level functions in
Statistics.scala. The z sigma function is used for the z-distribution.
1 def z_sigma ( sig : Double , p : Double = .95) : Double =
2 val pp = 1.0 - (1.0 - p ) / 2.0 // e . g . , .95 --> .975 ( two tails )
3 val z = random . Quantile . normalInv ( pp )
4 z * sig
5 end z_sigma
Does the probability you determined in the last example problem make any sense. Seemingly, if you took
several samples, only a certain percentage of them would have the population mean within their confidence
interval.
1 @ main def c o n f i d e n c e I n t e r v a l T e s t () : Unit =
2
96
11 val ( mu_ , sig_ ) = ( y . mean , y . stdev ) // sample mean and std dev
12
26 end c o n f i d e n c e I n t e r v a l T e s t
97
3.12 Entropy
The entropy of a discrete random variable y with probability mass function (pmf) py (y) is the negative of
the expected value of the log of the probability.
X
H(y) = H(py ) = − E [log2 py ] = − py (y) log2 py (y) (3.72)
y∈Dy
For finite domains of size k = |Dy |, entropy H(y) ranges from 0 to log2 (k). Low entropy (close to 0) means
that there is low uncertainty/risk in predicting an outcome of an experiment involving the random variable
y, while high entropy (close to log2 k) means that there is high uncertainty/risk in predicting an outcome of
such an experiment. For binary classification (k = 2), the upper bound on entropy is 1.
The entropy may be normalized by setting the base of the logarithm to the size of the domain k, in which
case, the entropy will be in the interval [0, 1].
X
Hk (y) = Hk (py ) = − E [logk py ] = − py (y) logk py (y)
y∈Dy
A random variable y ∼ Bernoulli(p) may be used to model the flip of a single coin that has a probability of
success/head (1) of p. Its pmf is given by the following formula.
p(y) = py (1 − p)1−y
The figure below plots the entropy H([p, 1 − p]) as probability of a head p ranges from 0 to 1.
98
Entropy for Bernoulli pmf
0.8
0.6
fy (y)
0.4
0.2
A random variable y = z1 + z2 where z1 , z2 are distributed as Bernoulli(p) may be used to model the
sum of flipping two coins.
See the exercises for how to extend entropy to continuous random variables.
The concept of plog can also be used in place of probability and offers several advantages: (1) multiplying
many small probabilities may lead to round off error or underflow; (2) independence leads to addition of
plog values rather than multiplication of probabilities; and (3) its relationship to log-likelihood in Maximum
Likelihood Estimation.
X
plog(x) = plog(xj ) for independent random variables (3.75)
j
The greater the plog the less likely the occurrence, e.g., the plog of rolling snake eyes (1, 1) with two dice is
about 5.17, while probability of rolling 7 is 2.58. Note, probability 1 and .5 give plog of 0 and 1, respectively.
99
3.12.2 Joint Entropy
Entropy may be defined for multiple random variables as well. Given two discrete random variables, x, y,
with a joint pmf px,y (x, y) the joint entropy is defined as follows:
X X
H(x, y) = H(px,y ) = − E [log2 px,y ] = − px,y (x, y) log2 px,y (x, y) (3.76)
x∈Dx y∈Dy
X X
H(x|y) = H(px|y ) = − E log2 px|y = − px,y (x, y) log2 px|y (x, y) (3.77)
x∈Dx y∈Dy
Suppose an experiment involves two random variables x and y. Initially, the overall entropy is given by the
joint entropy H(x, y). Now, partial evidence allows the value of y to be determined, so the overall entropy
should decrease by y’s entropy.
When there is no dependency between x and y (i.e., they are independent), H(x, y) = H(x) + H(y)), so
At the other extreme, when there is full dependency (i.e., they value of x can be determined from the value
of y).
H(x|y) = 0 (3.80)
Given a discrete random variables, y, with two candidate probability mass functions (pmf)s py (y) and qy (y)
the relative entropy is defined as follows:
py X py (y)
H(py ||qy ) = E log2 = py (y) log2 (3.81)
qy qy (y)
y∈Dy
One way to look at relative entropy is that it measures the uncertainty that is introduced by replacing
the true/empirical distribution py with an approximate/model distribution qy . If the distributions are
identical, then the relative entropy is 0, i.e., H(py ||py ) = 0. The larger the value of H(py ||qy ) the greater
the dissimilarity between the distributions py and qy .
As an example, assume the true distribution for a coin is [.6, .4], but it is thought that the coin is fair
[.5, .5]. The relative entropy is computed as follows:
100
H(py ||qy ) = .6 log2 .6/.5 + .4 log2 .4/.5 = 0.029
Given a continuous random variables, y, with two candidate probability density functions (pdf)s fy (y) and
gy (y) the relative entropy is defined as follows:
Z
fy fy (y)
H(fy ||gy ) = E log2 = fy (y) log2 dy (3.82)
gy Dy gy (y)
In this subsection, we examine the relationship between KL Divergence and Maximum Likelihood. Consider
the dissimlarity of an empirical distribution pdata (x, y) and a model generated distribution pmod (x, y, b).
pdata (x, y)
H(pdata (x, y)||pmod (x, y, b)) = E log (3.83)
pmod (x, y, b)
m−1
X pdata (xi , yi )
= pdata (xi , yi ) log (3.84)
i=0
pmod (xi , yi , b)
Note, that pdata (xi , yi ) is unaffected by the choice of parameters b, so it represents a constant C.
m−1
X
H(pdata (x, y)||pmod (x, y, b)) = C − pdata (xi , yi ) log pmod (xi , yi , b) (3.85)
i=0
1
The probability for the ith data instance is m, thus
m−1
1 X
H(pdata (x, y)||pmod (x, y, b)) = C − log pmod (xi , yi , b) (3.86)
m i=0
The second term is the negative log-likelihood (the Chapter on Generalized Linear Models for details).
It is the sum of the entropy of the empirical distribution and the model distribution’s relative entropy to the
empirical distribution. It can be calculated using the following formula (see exercises for details):
X
H(py × qy ) = − py (y) log2 qy (y) (3.88)
y∈Dy
Since cross entropy is more efficient to calculate than relative entropy, it is a good candidate as a loss function
for machine learning algorithms. The smaller the cross entropy, the more the model (e.g., Neural Network)
agrees with the empirical distribution (dataset). The formula looks like the one for ordinary entropy with
qy substituted in as the argument for the log function. Hence the name cross entropy.
101
3.12.6 Mutual Information
Recall that if x and y are independent, then for all x ∈ Dx and y ∈ Dy ,
The relative entropy (KL divergence) of the joint distribution to the product of the marginal distributions
is referred to as mutual information.
As with covariance (or correlation) mutual information will be zero when x and y are independent. While
independence implies zero covariance, independence is equivalent to zero mutual information. Mutual infor-
mation is symmetric and non-negative. See the exercises for additional comparisons between covariance/-
correlation and mutual information.
While mutual information measures the dependence between two random variables, relative entropy (KL
divergence) measures the dissimilarity of two distribution.
Mutual Information corresponds to Information Gain, i.e., the drop in entropy of one random variable
due to knowledge of the value of the other random variable.
102
3.12.7 Probability Object
Class Methods:
1 object Probability :
2
3 def isProbability ( px : VectorD ) : Boolean = px . min >= 0.0 && abs ( px . sum - 1.0) < EPSILON
4 def isProbability ( pxy : MatrixD ) : Boolean = pxy . mmin >= 0.0 && abs ( pxy . sum - 1.0) <
EPSILON
5 def freq ( x : VectorI , vc : Int , y : VectorI , k : Int ) : MatrixD =
6 def freq ( x : VectorI , y : VectorI , k : Int , vl : Int ) : ( Double , VectorI ) =
7 def freq ( x : VectorD , y : VectorI , k : Int , vl : Int , cont : Boolean ,
8 def count ( x : VectorD , vl : Int , cont : Boolean , thres : Double ) : Int =
9 def toProbability ( nu : VectorI ) : VectorD = nu . toDouble / nu . sum . toDouble
10 def toProbability ( nu : VectorI , n : Int ) : VectorD = nu . toDouble / n . toDouble
11 def toProbability ( nu : MatrixD ) : MatrixD = nu / nu . sum
12 def toProbability ( nu : MatrixD , n : Int ) : MatrixD = nu / n . toDouble
13 def probY ( y : VectorI , k : Int ) : VectorD = y . freq ( k ) . _2
14 def jointProbXY ( px : VectorD , py : VectorD ) : MatrixD = outer ( px , py )
15 def margProbX ( pxy : MatrixD ) : VectorD =
16 def margProbY ( pxy : MatrixD ) : VectorD =
17 def condProbY_X ( pxy : MatrixD , px_ : VectorD = null ) : MatrixD =
18 def condProbX_Y ( pxy : MatrixD , py_ : VectorD = null ) : MatrixD =
19 inline def plog ( p : Double ) : Double = - log2 ( p )
20 def plog ( px : VectorD ) : VectorD = px . map ( plog ( _ ) )
21 def entropy ( px : VectorD ) : Double =
22 def entropy ( nu : VectorI ) : Double =
23 def entropy ( px : VectorD , b : Int ) : Double =
24 def nentropy ( px : VectorD ) : Double =
25 def rentropy ( px : VectorD , qx : VectorD ) : Double =
26 def centropy ( px : VectorD , qx : VectorD ) : Double =
27 def entropy ( pxy : MatrixD ) : Double =
28 def entropy ( pxy : MatrixD , px_y : MatrixD ) : Double =
29 def muInfo ( pxy : MatrixD , px : VectorD , py : VectorD ) : Double =
30 def muInfo ( pxy : MatrixD ) : Double = muInfo ( pxy , margProbX ( pxy ) , margProbY ( pxy ) )
For example, the following freq method is used by Naı̈ve Bayes Classifiers. It computes the Joint Frequency
Table (JFT) for all value combinations of vectors x and y by counting the number of cases where xi = v
and yi = c.
1 @ param x the variable / feature vector
2 @ param vc the number of distinct values in vector x ( value count )
3 @ param y the response / classif ication vector
4 @ param k the maximum value of y + 1 ( number of classes )
5
103
3.13 Exercises
Several random number and random variate generators can be found in ScalaTion’s random package. Some
of the following exercises will utilize these generators.
1. Let the random variable h be the number heads when two coins are flipped. Determine the following
conditional probability: P (h = 2|h ≥ 1).
P (B|A)P (A)
P (A|B) =
P (B)
3. Compute the mean and variance for the Bernoulli Distribution with success probability p.
4. Use the Randi random variate generator to run experiments to check the pmf and CDF for rolling two
dice.
1 import scalation . mathstat . _
2 import scalation . random . Randi
3
2
V [y] = E (y − E [y])2 = E y 2 − E [y]
7. Show that the covariance of two independent, continuous random variables, x and y, is zero.
Z Z
C [x, y] = E [(x − µx )(y − µy )] = (x − µx )(y − µy )fxy (x, y)dxdy
Dy Dx
104
8. Derive the formula for the expectation of the sum of random variables.
9. Derive the formula for the variance of the sum of random variables.
10. Use the Uniform random variate generator and the Histogram class to run experiments illustrating
the Central Limit Theorem (CLT).
1 import scalation . mathstat . _
2 import scalation . random . Uniform
3
6 val rg = Uniform ()
7 val x = VectorD ( for i <- 0 until 100000 yield rg . gen + rg . gen + rg . gen + rg . gen )
8 new Histogram ( x )
9
10 end cLTTest
z 2 ∼ χ21
z
p ∼ tk
v/k
u/k1
∼ Fk1 ,k2
v/k2
14. Run the confidenceIntervalTest main function (see the Confidence Interval section) for values of
m = 20 to 40, 60, 80 and 100. Report the confidence interval and the number cases when the true
values was inside the confidence interval for (a) the z-distribution and (b) the t-distribution. Explain.
105
16. Show that formula for computing the joint probability mass function (pmf) for the 3-by-3 grid of
weights is correct. Hint: Add/subtract rectangular regions of the grid and make sure nothing is double
counted.
17. Show for k = 2 where pp = [p, 1 − p], that H(pp) = p log2 (p) + (1 − p) log2 (1 − p). Plot the entropy
H(pp) versus p.
1 val p = VectorD . range (1 , 100) / 100.0
2 val h = p . map ( p = > -p * log2 ( p ) - (1 - p ) * log2 (1 - p )
3 new Plot (p , h )
18. Plot the entropy H and normalized entropy Hk for the first 16 Binomial(p, n) distributions, i.e., for
the number of coins n = 1, . . . , 16. Try with p = .6 and p = .5.
19. Entropy can be defined for continuous random variables. Take the definition for discrete random
variables and replace the sum with an integral and the pmf with a pdf. Compute the entropy for
y ∼ Uniform(0, 1).
20. Using the summation formulas for entropy, relative entropy and cross entropy, show that cross entropy
is the sum of entropy and relative entropy.
21. Show that mutual information equals the sum of marginal entropies minus the joint entropy, i.e.,
22. Compare correlation and mutual information in terms of how well they measure dependence between
random variables x and y. Try various functional relationships: negative exponential, reciprocal,
constant, logarithmic, square root, linear, right-arm quadratic, symmetric quadratic, cubic, exponential
and trigonometric.
y = f (x) +
Other types of relationships are also possible. Try various constrained mathematical relations: circle,
ellipse and diamond.
f (x, y) + = c
23. Consider an experiment involving the roll of two dice. Let x indicate the value of dice 1 and x2 indicate
of the value of dice 2. In order to examine dependency between random variables, define y = x + x2 .
The joint pmf px,y can be recorded in a 6-by-11 matrix that can be computed from the following
feasible occurrence matrix (0 → cannot occur, 1 → can occur), since all the non-zero probabilities are
the same (equal likelihood).
1 // X - dice 1: 1 , 2, 3, 4, 5, 6
2 // X2 - dice 2: 1 , 2, 3, 4, 5, 6
3 // Y = X + X2 : 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12
4 val nuxy = MatrixD ((6 , 11) , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ,
106
5 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
6 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,
7 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
8 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
9 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)
Use methods in the Probability object to compute the joint, marginal and conditional probability
distributions, as well as the joint, marginal, conditional and relative entropy, and mutual information.
Explore the independence between random variables x and y.
24. Convolution. The convolution operator may be applied to vectors as well as functions (including
mass and density functions). Consider two vectors c ∈ Rm and x ∈ Rn . Without loss of generality let
m ≤ n, then their convolution is defined as follows:
m−1
X
y = c ? x = yk = cj xk−j (3.92)
j=0
k=0,m+n−2
Note, there are also ’same’ and ’valid’ versions of convolution operators.
25. Consider a distribution with density on the interval [0, 2]. Let the probability density function (pdf)
for this distribution be the following:
y
fy (y) = on [0, 2]
2
(i) Draw/plot the pdf fy (y) vs. y for the interval [0, 2].
(ii) Determine the Cumulative Distribution Function (CDF), Fy (y).
(iii) Draw/plot the CDF Fy (y) vs y for the interval [0, 2].
(iv) Determine the expected value of the Random Variable (RV) y, i.e., E [y].
26. Take the limit of the difference quotient of monomial xn to show that
d n
x = nxn−1
dx
Recall the definition of derivative as the limit of the difference quotient.
d f (x + h) − f (x)
f (x) = lim
dx h→0 h
Recall the notations due to Leibniz, Lagrange, and Euler.
d
f (x) = f 0 (x) = Dx f (x)
dx
107
27. Take the integral and then the derivative of the monomial xn to show that
Z
d
xn dx = xn
dx
108
3.14 Further Reading
1. Probability and Mathematical Statistics [163].
109
3.15 Notational Conventions
With respect to random variables, vectors and matrices, the following notational conventions shown in Table
3.1 will be used in this book.
Built on the Functional Programming features in Scala, ScalaTion support several function types:
1 type FunctionS2S = Double = > Double // function of a scalar
2 type FunctionS2V = Double = > VectorD // vector - valued function of a scalar
3
These function types are defined in the scalation and scalation.mathstat packages. A scalar-valued
function type ends in ’S’, a vector-valued function type ends in ’V’, and a matrix-valued function type ends
in ’M’.
Mathematically, the scalar-valued functions are denoted by a symbol, e.g., f .
S2S function f : R → R
V2S function f : Rn → R
S2V function f : R → Rn
V2V function f : Rm → Rn
M2V function f : Rm×p → Rn
110
V2M function f : Rp → Rm×n
M2M function f : Rp×q → Rm×n
111
3.16 Model
Models are about making predictions such as given certain properties of a car, predict the car’s mileage, given
recent performance of a stock index fund, forecast its future value, or given a person’s credit report, classify
them as either likely to repay or not likely to repay a loan. The thing that is being predicted, forecasted
or classified is referred to the response/output variable, call it y. In many cases, the “given something” is
either captured by other input/feature variables collected into a vector, call it x,
y = f (x; b) + (3.94)
or by previous values of y. Some functional form f is chosen to map input vector x into a predicted value
for response y. The last term indicates the difference between actual and predicted values, i.e., the residuals
. The function f is parameterized and often these parameters can be collected into a matrix b.
If values for the parameter vector b are set randomly, the model is unlikely to produce accurate pre-
dictions. The model needs to be trained by collecting a dataset, i.e., several (m) instances of (xi , yi ), and
optimizing the parameter vector b to minimize some loss function, such as mean squared error (mse),
1
mse = ky − ŷk2 (3.95)
m
where y is the vector from all the response instances and ŷ = f (X; b) is the vector of predicted response
values and X is the matrix formed from all the input/feature vector instances.
Estimation Procedures
Although there are many types of parameter estimation procedures, this text only utilizes the three most
commonly used procedures [14].
The method of moments develops equations the relate the moments of a distribution to the parameters of the
model, in order to create estimates for the parameters. Least Squares Estimation takes the sum of squared
errors and sets the parameter values to minimize this sum. It has three main varieties: Ordinary Least
Squares (OLS), Weighted Least Squares (WLS), and Generalized Least Squares (GLS). Finally, Maximum
Likelihood Estimation sets the parameter values so that the observed data is likely to occur. The easiest
way to think about this is to imagine that one wants to create a generative model (a model that generates
data). One would want to set the parameters of the model so it generates data that looks like the given
dataset.
Setting of parameters is done by solving a system of equations for the simpler models, or by using an
optimization algorithm for more complex models.
112
Quality of Fit (QoF)
After a model is trained, its Quality of Fit (QoF) should be evaluated. One way to perform the evaluation
is to train the model on the full dataset and test as well on the full dataset. For complex models with many
parameters, over-fitting will likely occur. Then its excellent evaluation is unlikely to be reproduced when the
model is applied in the real-world. To avoid overly optimistic evaluations due to over-fitting, it is common
to divide a dataset (X, y) into a training dataset and testing dataset where training is conducted on the
training dataset (Xr , yr ) and evaluation is done on the test dataset (Xe , ye ). The conventions used in this
book for the full, training and test datasets are shown in Table 3.3
Note, when training and testing on the full dataset, the training and test dataset are actually the same, i.e.,
they are the full dataset. If a model has many parameters, the Quality of Fit (QoF) found from training
and testing on the full dataset should be suspect. See the section on cross-validation for more details.
In ScalaTion, the Model trait severs as base trait for all the modeling techniques in the modeling
package and its sub-packages classifying, clustering, fda, forecasting, and recommeneder.
Model Trait
Trait Methods:
1 trait Model :
2
The getFname method returns the predictor variable/feature names in the model. The train method
will use a training or full dataset to train the model, i.e., optimize its parameter vector b to minimize a
given loss function. After training, the quality of the model may be assessed using the test method. The
evaluation may be performed on a test or full dataset. Finally, information about the model may be extracted
113
by the following three methods: (1) hparameter showing the hyper-parameters, (2) parameter showing the
parameters, and (3) report showing the hyper-parameters, the parameter, and the Quality of Fit (QoF) of
the model. Note, hyper-parameters are used by some modeling techniques to influence either the result or
how the result is obtained.
Classes that implement (directly or indirectly) the Model trait should default x and x e to the full
data/input matrix x, and y and y e to the full response/output vector y that are passed into the class
constructor.
Implementations of the train method take a training data/input matrix x and a training respon-
se/output vector y and optimize the parameter vector b to, for example, minimize error or maximize
likelihood. Implementations of the test method take a test data/input matrix x e and the corresponding
test response/output vector y e to compute errors and evaluate the Quality of Fit (QoF). Note that with
cross-validation (to be explained later), there will be multiple training and test datasets created from one
full dataset. Implementations of the hparameter method simply return the hyper-parameter vector hparam,
while implementations of the parameter method simply return the optimized parameter vector b. (The
fname and technique parameters for Regression are the feature names and the solution/optimization
technique used to estimate the parameter vector, respectively.)
Associated with the Model trait is the FitM trait that provides QoF measures common to all types of
models. For prediction, Fit extends FitM with several additional QoF measures and they are explained on
the Prediction Chapter. Similarity, FitC extends FitM for classification models.
FitM Trait
Trait Methods:
1 trait FitM :
2
The diagnose method takes the actual response/output vector y and the predictions from the model yp
and calculates the basic QoF measures.
1 @ param y the actual response / output vector to use ( test / full )
2 @ param yp the predicted response / output vector ( test / full )
3 @ param w the weights on the instances ( defaults to null )
4
114
10 val mu = y . mean // mean of y ( may be zero )
11 val e = y - yp // residual / error vector
12 sse = e . normSq // sum of squares for error
13 if w = = null then
14 sst = y . cnormSq // sum of squares total
15 ssr = sst - sse // sum of squares model
16 else
17 ssr = ( w * ( yp - ( w * yp / w . sum ) . sum ) ~ ˆ 2) . sum // regression sum of squares
18 sst = ssr + sse
19 end if
20
Note, ˜^ is the exponentiation operator provided in ScalaTion, where the first character is ˜ to give the
operator higher precedence than multiplication (*).
One of the measures is based on absolute errors, Mean Absolute Error (MAE), and is computed as the
`1 norm of the error vector divided by the number of elements in the response vector (m). The rest are
based on squared values. Various squared `2 norms may be taken to compute these quantities, i.e., sst =
y.cnormSq is the centered norm squared of y, while sse = e.normSq is the norm squared of e. Then ssr,
the sum of squares model/regression, is the difference. The idea being that one started with the variation in
the response, some of which can accounted for by the model, with the remaining part considered errors. As
models are less than perfect, what remains are better referred to as residuals, part of which a better model
could account for. The fraction of the variation accounted for by the model to the total variation is called
the coefficient of determination R2 = ssr/sst ≤ 1. A measure that parallel MAE is the Root Mean Squared
Error (RMSE). It is typically higher as a large squared term has more of an effect. Both are interpretable
as they are in the units of the response variable, e.g., imagine one hits a golf ball at 150 mph with an MAE
of 7 mph and an RMSE of 10 mph. Further explanations are given in the Prediction Chapter.
115
116
Chapter 4
Data Management
4.1 Introduction
Data Science relies on having large amounts of quality data. Collecting data and handling data quality issues
are of utmost importance. Without support from a system or framework, this can be very time-consuming
and error-prone. This chapter provides a quick overview of the support provided by ScalaTion for data
management.
In the era of big data, a variety of database management technologies have been proposed, including
those under the umbrella of Not-only-SQL (NoSQL). These technologies include the following:
• Key-value stores (e.g., Memcached). When the purpose the data store is very rapid lookup and not
advanced query capabilities, a key-value store may be ideal. They are often implemented as distributed
hash tables.
• Document-oriented databases (e.g., MongoDB). These databases are intended for storage and retrieval
of unstructured (e.g., text) and semi-structured (e.g., XML or JSON) data.
• Columnar databases (e.g., Vertica). Such databases are intended for structured data like traditional
relational databases, but to better facilitate data compression and analytic operations. Data is stored
in columns rather rows as in traditional relational databases.
• Graph databases (e.g., Neo4j). These make the implicit relationships (via foreign-key, primary-key
pairs) in relational databases explicit. A tuple in a relational database is mapped to a node in a graph
database, while an implicit relationship is mapped to edge in a graph database. The database then
consists of a collection directed graphs, each consisting of nodes and edges connecting the nodes. These
database are particularly suited to social networks.
The purpose of these database technologies is to provide enhanced performance over traditional, row-oriented
relational database and each of the above are best suited to particular types of data.
Data management capabilities provided by ScalaTion include Relational Databases, Columnar Databases
and Graph Databases. All include extensions making them suitable as a Time Series DataBase (TSDB).
Graph databases are discussed in the Appendix.
Preprocessing of data should be done before applying analytics techniques to ensure they are working on
quality data. ScalaTion provides a variety of preprocessing techniques, as discussed in the next chapter.
117
4.1.1 Analytics Databases
In data science, it is convenient to collect data from multiple sources and store the data in a database.
Analytics databases are organized to support efficient data analytics.
A database supporting data science should make it easy and efficient to view and select data to be feed
into models. The structures supported by the database should make it easy to extract data to create vectors,
matrices and tensors that are used by data science tools and packages.
Multiple systems, including ScalaTion’s TSDB, are built on top of columnar, main memory databases
in order to provide high performance. ScalaTion’s TSDB is a Time Series DataBase that has built-in
capabilities for handling time series data. It is able to store non-time series data as well. It provides multiple
Application Programming Interfaces (APIs) for convenient access to the data [?].
6 trait Tabular [ T <: Tabular [ T ]] ( val name : String , val schema : Schema ,
7 val domain : Domain , val key : Schema )
8 extends Serializable :
For convenience, the following two Scala type definitions are utilized.
1 type Schema = Array [ String ]
2 type Domain = Array [ Char ]
Tabular structures are logically linked together via foreign keys. A foreign key is an attribute that
references a primary key in some table (typically another table). In ScalaTion, the foreign key specification
is added via the following method call after the Tabular structure is created.
1 def addForeignKey ( fkey : String , refTab : T ) : Unit
ScalaTion supports the following domains/data-types: ’D’ouble, ’I’nt, ’L’ong, ’S’tring, and ’T’imeNum.
1 ’D ’ - ‘ Double ‘ - ‘ VectorD ‘ - 64 bit double precision floating point number
2 ’I ’ - ‘Int ‘ - ‘ VectorI ‘ - 32 bit integer
3 ’L ’ - ‘ Long ‘ - ‘ VectorL ‘ - 64 bit long integer
4 ’S ’ - ‘ String ‘ - ‘ VectorS ‘ - variable length numeric string
5 ’T ’ - ‘ TimeNum ‘ - ‘ VectorT ‘ - time numbers for date - time
These data types are generalized into a ValueType as a Scala union type.
1 type ValueType = ( Double | Int | Long | String | TimeNum )
118
4.2 Relational Data Model
A relational database table may be built up as follows: A cell in the table holds an atomic value of type
ValueType. A tuple (or row) in the table is simply an array of ValueType. A relational Table consists of a
bag (or multi-set) of tuples. Each column in the Table is restricted to a particular domain. Note, uniqueness
of primary keys is enforced by creating a primary index.
1 type Tuple = Array [ ValueType ]
Using the operator += as an alias for the add method the following code may be used to populate the Bank
database.
1 customer + = ( " Peter " , " Oak St " , " Bogart " )
2 + = ( " Paul " , " Elm St " , " Watkinsville " )
3 + = ( " Mary " , " Maple St " , " Athens " )
4 customer . show ()
5
119
17 deposit . show ()
18
120
Fundamental Relational Algebra Operators
The following six relational algebra operators form the fundamental operators for ScalaTion’s table pack-
age and are shown in Table 4.1. They are fundamental in sense that rest of operators, although convenient,
do not increase the power of the query language.
customer ρ (“client”)
2. Project Operator. The project operator will return the specified columns in table customer.
3. Select Operator. The select operator will return the rows that match the predicate in table customer.
4. Union Operator. The union operator will return the union of rows from deposit and loan. Duplicate
tuples may be eliminated by creating an index. For this operator the textbook syntax and ScalaTion
syntax are identical.
deposit ∪ loan
1 deposit ∪ loan
5. Minus Operator. The minus operator will return the rows from account (result of the union) that
are not in loan. For this operator the textbook syntax and ScalaTion syntax are identical.
account − loan
1 account - loan
6. Cartesian Product Operator. The product operator will return all combinations of rows in customer
with rows in deposit. For this operator the textbook syntax and ScalaTion syntax are identical.
customer × deposit
1 customer × deposit
121
Additional Relational Algebra Operators
The next eight operators, although not fundamental, are important operators in SacalaTion’s table
package and are shown in Table 4.1.
1. Join Operator. In order to combine information from two tables, join operators are preferred over
products, as they are much more efficient and only combine related rows. ScalaTion’s table package
supports natural-join, equi-join, theta-join, left outer join, and right outer join, as shown below. For
each tuple in the left table, the equi-join pairs it with all tuples in the right table that match it on
the given attributes (in this case customer.bname = deposit.bname). The natural-join is an equi-
join on the common attributes in the two tables, followed by projecting away any duplicate columns.
The theta-join generalizes an equi-join by allowing any comparison operator to be used (in this case
deposit1 .balance < deposit2 .balance). The symbol for semi-join is adopted for outer joins as it is a
Unicode symbol. The left join keeps all tuples from the left (null padding if need be), while the right
join keeps all tuples from the right table.
1 customer ./ deposit
2 customer ./ ( " cname = = cname " , deposit )
3 deposit ./ ( " balance < balance " , deposit )
4 customer n deposit
5 customer o deposit
Additional forms of joins are also available in the Table class. Join is not fundamental as its result
can be made by combining product and select.
2. Divide Operator. For the query below, the divide operator will return the cnames where the cus-
tomers has a deposit account at all branches (of course it would make sense to first select on the
branches).
The divide operator requires the other attributes (in this case cname) in the left table to be paired up
with all the attribute values (in this case bname) in the right table.
3. Intersect Operator. The intersect operator will return the rows in account that are also in loan.
For this operator the textbook syntax and ScalaTion syntax are identical.
account ∩ loan
122
1 account ∩ loan
4. GroupBy Operator. The groupBy operator forms groups among the relation based on the equality
of attributes. The following example groups the tuples in the deposit table based on the value of the
bname attribute.
γbname (deposit)
5. Aggregate Operator. The aggregate operator returns values for the grouped-by attribute (e.g.,
bname) and applies aggregate operators on the specified columns (e.g., avg (balance)). Typically it is
called after the groupBy operator.
1 deposit F ( " bname " , ( count , " accno " ) , ( avg , " balance " ) )
6. OrderBy Operator. The orderBy operator effectively puts the rows into ascending order based on
the given attributes.
↑bname (deposit)
7. OrderByDesc Operator. The orderByDesc operator effectively puts the rows into descending order
based on the given attributes.
↓bname (deposit)
8. Select-Project Operator. The selproject is a combination operator added for convenience and
efficiency, especially for columnar relation databases (see the next section). As whole columns are
stored together, this operator only requires one column to be accessed.
1 customer σπ ( " ccity " , _ = = ’ Athens ’)
123
4.2.4 Example Queries
1. List the names of customers who live in the city of Athens.
1 val liveAthens = customer .σ ( " ccity = = ’ Athens ’" ) .π ( " cname " )
2 liveAthens . show ()
2. List the names of customers who live in Athens or bank (have deposits in branches located) in Athens.
1 val bankAthens = ( deposit ./ branch ) .σ ( " bcity = = ’ Athens ’" ) .π ( " cname " )
2 bankAthens . show ()
3. List the names of customers who live and bank in the same city.
1 val sameCity = ( customer ./ deposit ./ branch ) .σ ( " ccity = = bcity " ) .π ( " cname " )
2 sameCity . create_index ()
3 sameCity . show ()
4. List the names and account numbers of customers with the largest balance.
1 val largest = deposit .π ( " cname , accno " ) - ( deposit ./ ( " balance < balance " ,
deposit ) ) .π ( " cname , accno " )
2 largest . show ()
5. List the names of customers who are silver club members (have loans where they have deposits).
1 val silver = ( loan .π ( " cname , bname " ) ∩ deposit .π ( " cname , bname " ) ) .π ( " cname " )
2 silver . create_index ()
3 silver . show ()
6. List the names of customers who are gold club members (have loans only where they have deposits).
1 val gold = loan .π ( " cname " ) - ( loan .π ( " cname , bname " ) - deposit .π ( " cname , bname " )
) .π ( " cname " )
2 gold . create_index ()
3 gold . show ()
8. List the names of customers who have deposits at all branches located in Athens.
1 val allAthens = deposit .π ( " cname , bname " ) / inAthens
2 allAthens . create_index ()
3 allAthens . show ()
124
4.2.5 Persistence
Modern databases do much of the processing in main-memory due to its large size and high speed. Although
using MRAM, main-memories may be persistent, typically they are volatile, meaning if the power is lost,
so is the data. It is therefore essential to provide efficient mechanisms for making and maintaining the
persistence of data.
Traditional database management systems achieve this by having a persistence data store in non-volatile
storage (e.g., Hard-Disk Drives (HDD) or Solid-State Devices (SSD)) and a large database cache in main-
memory. Complex page management algorithms are used to ensure persistence and transactional correctness
(see the next subsection).
A simple way to provide persistence is to design the database management system to operate in main-
memory and then provide load and save methods that utilize built-in serialization to save to or load from
persistent storage. This is what ScalaTion does.
The load method will read a table with a given name into main-memory using serialization.
1 @ param name the name of the table to load
2
The save method will write the entire contents of this table into a file using serialization.
1 def save () : Unit =
2 val oos = new O b j e c t O u t p u t S t r e a m ( new F i l e O u t p u t S t r e a m ( STORE_DIR + name + SER ) )
3 oos . writeObject ( this )
4 oos . close ()
5 end save
For small databases, this approach is fine, but as database become large, greater efficiency must be
sought. One cannot save a whole table ever time there is a change. See the exercises for alternatives.
4.2.6 Transactions
The idea of a transaction is to bundle a sequence of operations into a meaningful action that one wants to
succeed, such as transferring money from one bank account to another.
Making the action a transaction has the main benefit of making it atomic, the action either completes
successfully (called a commit) or is completely undone having no effect on the database state (called a
rollback). The third option, a partially completed action in this case would lead to a bank customer losing
their money.
Making a transaction atomic can be achieved by maintaining a log. Operations can be written to the log
and then only saved once the transaction commits. If a transaction cannot commit, it must be rolled back.
There must also be a recover procedure to handles the situation when volatile storage is lost. For this to
function, committed log records must be flushed to persistent storage.
A second important advantage of making an action a transaction is to protect it from other transactions,
so it can think of itself as if it is running in isolation. Rather than worrying about how other transactions
125
may corrupt the action, this worry is turned over to database management system to handle it. One
form of potential interference involves two transactions running concurrently and accessing the same back
accounts. It one transaction accesses all the accounts first, there will be no corruption. Such an execution
of two transactions is called a serial execution (one transaction executes at a time). Unfortunately, modern
high-performance database management systems could not operate at the slow speed this would dictate.
Transaction must be run concurrently, not serially. The correction condition caller serializability allows
transaction to run with their concurrency controlled by a protocol that ensures their effects on the database
are equivalent to one of their slow-running, serially-executing cousin schedules. In other words, the fast
running serializable schedule for a set of transactions must be equivalent to some serial execution of the
same set of transactions. See the exercise for more details on equivalence (e.g., conflict and view equivalence)
and various concurrency control protocols that can be used to ensure correctness with minimal impact on
performance.
6 class Table ( override val name : String , override val schema : Schema ,
7 override val domain : Domain , override val key : Schema )
8 extends Tabular [ Table ] ( name , schema , domain , key )
9 with Serializable :
Internally, the Table class maintains a collection of tuples. Using a Bag allows for duplicates, if wanted.
Creating an index on the primary will efficiently eliminate any duplicates. Foreign key relationships are
specified in linkTypes. It also provides a groupMap used by the groupBy operator.
The Table class supports three types of indices:
1. Primary Index. A uniques index on the primary key (may be composite).
1 private [ table ] val index = IndexMap [ KeyType , Tuple ] ()
2. Secondary Unique Indices. A unique index on a single attribute (other than the primary key). For
example, a student id may be used as a primary for a Student table, while email may also be required
to be unique. Since there can be multiple such indices a Map is used to name each index.
1 private [ table ] val sindex = Map [ String , IndexMap [ ValueType , Tuple ]] ()
3. Non-Unique Indices. When fast-lookup is required based on an attribute/column that is not required
to be unique (e.g., name) such an index may be used. Again, since there can be multiple such indices
a Map is used to name each index.
1 private [ table ] val mindex = Map [ String , MIndexMap [ ValueType , Tuple ]] ()
The following methods may be used to create the various types of indices: primary unique index, secondary
unique index, or non-unique index, respectively.
126
1 def create_index ( rebuild : Boolean = false ) : Unit =
2 def create_sindex ( atr : String ) : Unit =
3 def create_mindex ( atr : String ) : Unit =
The following factory method in the companion object provides a more convenient way to create a table.
The strim method splits a string into an array of strings based on a separation character and then trims
away any white-space.
1 def apply ( name : String , schema : String , domain_ : String , key : String ) : Table =
2 new Table ( name , strim ( schema ) , strim ( domain_ ) . map ( _ . head ) , strim ( key ) )
3 end apply
The following two classes extend the Table class in the direction of the Graph Data Model, see Appendix
C.
6 case class LTable ( name_ : String , schema_ : Schema , domain_ : Domain , key_ : Schema )
7 extends Table ( name_ , schema_ , domain_ , key_ )
8 with Serializable :
The LTable class (for Linked-Table) simply adds an explicit link from the foreign key to the primary key
that it references. For each tuple in a linked-table, add a link to the referenced table, so that the foreign key
is linked to the primary key. Caveat: LTable does not handle composite foreign keys. Although in general
primary keys may be composite, a foreign key is conceptualized as a column value and its associated link.
1 @ param fkey the foreign key column
2 @ param refTab the referenced table being linked to
3
The LTable class makes many-to-one relationships/associations explicit and improves the efficiency of
the most common form of join operation which is based on equating a foreign key (fkey) to a primary key
(pkey). Without an index, these are performed using a Nest-Loop Join algorithm. The existence of an index
on the primary key allows a much more efficient Indexed Join algorithm to be utilized. The direct linkage
provides for additional speed up of such join operations (see the exercises for a comparison). Note that the
linkage is only in one direction, so joining from the primary key table to the foreign key table would require
a non-unique index on the foreign key column, or resorting to a slow nested loop join.
Note, the link and foreign key value are in some sense redundant. Removing the foreign key column is
possible, but may force the need for an additional join for some queries, so the database designer may wish
to keep the foreign key column. ScalaTion leaves this issue up to the database designer.
The next class moves further in the direction of the Graph Data Model.
4.2.9 VTable Class
127
3 @ param domain_ the domains / data - types for attributes ( ’D ’ , ’I ’ , ’L ’ , ’S ’ , ’X ’ , ’T ’)
4 @ param key_ the attributes forming the primary key
5
6 case class VTable ( name_ : String , schema_ : Schema , domain_ : Domain , key_ : Schema )
7 extends Table ( name_ , schema_ , domain_ , key_ )
8 with Serializable :
The VTable class (for Vertex-Table) supports many-to-many relationships with efficient navigation in
both directions. Supporting this is much more completed than what is needed for LTable, but provides for
index-free adjacency, similar to what is provided by Graph Database systems.
The VTable model is graph-like in that it elevates tuples into vertices as first-class citizens of the data
model. However, edges are embedded inside of vertices and are there to establish adjacency. Edges do not
have labels, attributes or properties. Although this simplifies the data model and makes it more relation-like,
it is not set up to naturally support finding for example shortest paths.
The Vertex class extends the notion of Tuple into values stored in the tuple part, along with foreign
keys links captured as outgoing edges.
1 @ param tuple the tuple part of a vertex
2
7 end Vertex
For data models where edges become first-class citizens, see the Appendix on Graph Data Models.
128
4.3 Columnar Relational Data Model
Of the NoSQL database management systems, columnar databases are closest to traditional relational
databases. Rather than tuples/rows taking center stage, columns/vectors take center stage.
A columnar database is made up of the following components:
• Element - a value from a given Domain or Datatype (e.g., Int, Long, Double, Rational, Real, Complex,
String, TimeNum)
• Column/Vector - a collection of values from the same Datatype (e.g., forming VectorI, VectorL,
VectorD, VectorQ, VectorR, VectorC, VectorS, VectorT)
Table 4.2 shows the first 10 rows (out of 392) for the well-known Auto MPG dataset (see https://
archive.ics.uci.edu/ml/datasets/Auto+MPG).
Table 4.2: Example Columnar Relation: First 10 Rows of Auto MPG Dataset
mpg cylinders displacement horsepower weight acceleration model year origin car name
Double Int Double Double Double Double Int Int String
18.0 8 307.0 130.0 3504.0 12.0 70 1 ”chevrolet chevelle”
15.0 8 350.0 165.0 3693.0 11.5 70 1 ”buick skylark 320”
18.0 8 318.0 150.0 3436.0 11.0 70 1 ”plymouth satellite”
16.0 8 304.0 150.0 3433.0 12.0 70 1 ”amc rebel sst”
17.0 8 302.0 140.0 3449.0 10.5 70 1 ”ford torino”
15.0 8 429.0 198.0 4341.0 10.0 70 1 ”ford galaxie 500”
14.0 8 454.0 220.0 4354.0 9.0 70 1 ”chevrolet impala”
14.0 8 440.0 215.0 4312.0 8.5 70 1 ”plymouth fury iii”
14.0 8 455.0 225.0 4425.0 10.0 70 1 ”pontiac catalina”
15.0 8 390.0 190.0 3850.0 8.5 70 1 ”amc ambassador dpl”
Since each column is stored as a vector, they can be readily compressed. Due to the high repetition in the
cylinders column it can be effectively compressed using Run Length Encoding (RLE) compression. In
addition, a column can be efficiently extracted since it already stored as a vector in the database. These
vectors can be used in aggregate operators or passed into analytic models.
Data files in various formats (e.g., comma separated values (csv)) can be loaded into the database.
1 val auto_mpg = Relation ( " auto_mpg " , " auto_mpg . csv " )
It is easy to create a Multiple Linear Regression model for this dataset. Simply pick the response column,
in this case mpg and the predictor columns, in this case all other columns besides car name. The connection
between car name and mpg is coincidental. The response column/variable goes into a vector.
1 val y = auto_mpg . toVectorD (0)
129
1 val x = auto_mpg . toMatrixD (1 to 7) )
Then the matrix x and vector y can be passed into a Regression model constructor.
1 val rg = new Regression (x , y )
See the next chapter for how to train a model, evaluate the quality of fit and make predictions.
The first API is a Columnar Relational Algebra that includes the standard operators of relational algebra
plus those common to column-oriented databases. It consists of the Table trait and two implementing classes:
Relation and MM Relation. Persistence for Relation is provided by the save method, while MM Relation
utilizes memory-mapped files.
The name of the first relation is “sensor” and it stores information about traffic sensors.
• The fourth argument is the column number for the primary key (key),
• The fifth argument, “ISDDI”, indicates the domains (domain) for the attributes (Integer, String,
Double, Double, Integer).
• The sixth and optional argument can be used to define foreign keys (fKeys).
• The seventh and optional argument indicates whether to enter that relation is the system Catalog.
130
The second relation road stores the Id, name, beginning and ending latitude-longitude coordinates.
The third relation mroad is for multi-lane roads.
The fourth relation traffic stores the data collected from traffic sensors. The primary key in this case
is composite, Seq (0, 1), as both the time and the sensorId are required for unique identification.
The fifth relation wsensor stores information about weather sensors.
Finally, the sixth relation weather stores data collected from the weather sensors.
Select Operator
The select operator will return the rows that match the predicate, in this case rdN ame == “I285”.
Project Operator
The project operator will return the specified columns, in this case rdN ame, lat1, long1.
131
Union Operator
The union operator will return the rows from r and s with no duplicates. For this operator the textbook
syntax and column db syntax are identical.
r∪s
Minus Operator
The minus operator will return the rows from r that are not in s. For this operator the textbook syntax and
column db syntax are identical.
r−s
The product operator will return all combinations of rows in r with rows in s. For this operator the textbook
syntax and column db syntax are identical.
r×s
Rename Operator
r.ρ(“r2”)
The above six operators form the fundamental operators for SacalaTion’s column db package and are
shown as the first group in Table 4.3.
Table 4.3: Columnar Relational Algebra (r = road, s = sensor, t = traffic, q = mroad, w = weather)
132
The next seven operators, although not fundamental, are important operators in SacalaTion’s column db
package and are shown as the second group in Table 4.3.
Join Operators
In order to combine information from two relations, join operators are preferred over products, as they are
much more efficiently and only combine related rows. ScalaTion’s column db package supports natural-
join, equi-join, general theta join, left outer join, and right outer join, as shown below.
r ./ s natural − join
r ./ (“roadId”, “roadId”, s) equi − join
r ./ [Int](s, (“roadId”, “roadId”, == )) theta join
t n (“time”, “time”, w) left outer join
t o (“time”, “time”, w) right outer join
Intersect Operator
The intersect operator will return the rows in r that are also in s. For this operator the textbook syntax
and column db syntax are identical.
r∩s
GroupBy Operator
The groupBy operator forms groups among the relation based on the equality of attributes. The following
example groups traffic data based in the value of the “sensorId” attribute.
t.γ(“sensorId”)
The extended projection operator eproject applies aggregate operators on aggregation columns (first argu-
ments) and regular project on the other columns (second arguments). Typically it is called after the groupBy
operator.
OrderBy Operator
The orderBy operator effectively puts the rows into ascending (descending) order based on the given at-
tributes.
t.ω(“sensorId”)
133
Compress Operator
The compress operator will compress the given columns of the relation.
t.ζ(“count”)
Uncompress Operator
The uncompress operator will uncompress the given columns of the relation.
t.Z(“count”)
2. Retrieve the automobile mileage data for cars with 8 cylinders, returning the car name and mpg.
3. Retrieve traffic data within a 100 kilometer-grid from the center of Austin, Texas. The latitude-
longitude coordinates for Austin, Texas are (30.266667, -97.733333).
Class Methods:
1 @ param name the name of the relation
2 @ param colName the names of columns
3 @ param col the Scala Vector of columns making up the columnar relation
4 @ param key the column number for the primary key ( < 0 = > no primary key )
5 @ param domain an optional string indicating domains for columns ( e . g . , ’ SD ’ = ’ String ’
, ’ Double ’)
6 @ param fKeys an optional sequence of foreign keys
7 - Seq ( column name , ref table name , ref column position )
8 @ param enter whether to enter the newly created relation into the ‘ Catalog ‘
9
134
10 class Relation ( val name : String , val colName : Seq [ String ] , var col : Vector [ Vec ] =
null ,
11 val key : Int = 0 , val domain : String = null ,
12 var fKeys : Seq [( String , String , Int ) ] = null , enter : Boolean = true )
13 extends Table with Error with Serializable
135
4.4 SQL-Like Language
The SQL-Like API in ScalaTion provides many of the language constructs of SQL in a functional style.
1. Retrieve the vehicle traffic counts over time from all sensors on the road with Id = 101.
1 ( traffic join sensor ) . where [ Int ] ( " roadId " , _ = = 101)
2 . select ( " sensorId " , " time " , " count " )
2. Retrieve the vehicle traffic counts averaged over time from all sensors on the road with Id = 101.
1 ( traffic join sensor ) . where [ Int ] ( " roadId " , _ = = 101)
2 . groupBy ( " sensorId " )
3 . eselect (( avg , " acount " , " count " ) ) ( " sensorId " )
136
4.4.3 RelationSQL Class
Class Methods:
1 @ param name the name of the relation
2 @ param colName the names of columns
3 @ param col the Scala Vector of columns making up the columnar relation
4 @ param key the column number for the primary key ( < 0 = > no primary key )
5 @ param domain an optional string indicating domains for columns ( e . g . , ’ SD ’ = ’ String ’
, ’ Double ’)
6 @ param fKeys an optional sequence of foreign keys - Seq ( column name , ref table name ,
ref column position )
7
8 class RelationSQL ( name : String , colName : Seq [ String ] , col : Vector [ Vec ] ,
9 key : Int = 0 , domain : String = null , fKeys : Seq [( String , String , Int
) ] = null )
10 extends Tabular with Serializable
11
137
12 def toVectorL ( colName : String ) : VectorL = r . toVectorL ( colName )
13 def toVectorS ( colPos : Int = 0) : VectorS = r . toVectorS ( colPos )
14 def toVectorS ( colName : String ) : VectorS = r . toVectorS ( colName )
15 def toVectorT ( colPos : Int = 0) : VectorT = r . toVectorT ( colPos )
16 def toVectorT ( colName : String ) : VectorT = r . toVectorT ( colName )
17 def show ( limit : Int = Int . MaxValue ) r . show ( limit )
18 def save () r . save ()
19 def generateIndex ( reset : Boolean = false ) r . generateIndex ( reset )
138
4.5 Exercises
1. Use Scala 3 to complete the implementation of the following ScalaTion data models: Table, LTable,
and VTable in the scalation.table package. A group will work on one the data models. See Appendix
C for two more data models: GTable and PGraph.
• Test all types of non-unique indices (MIndexMap). Use the import scheme shown in the beginning
of Table.scala.
• Add use of indexing to speed up as many operations as possible.
• Speed up joins by using Unique Indices and Non-Unique Indices.
• Use index-free adjacency when possible for further speed-up.
• Make the save operation efficient, by only serializing tuples/vertices that have changed since the
last load. One way to approach this would be to maintain a map in persistent storage,
1 Map [ KeyType , [ TimeNum , Tuple ]]
where the key for a tuple/vertex may be used to check the timestamp of a tuple/vertex. Unless
the timestamp of the volatile tuple/vertex is larger, there is no need to save it. Further speed
improvement may be obtained by switching from Java’s text-based serialization to Kryo’s binary
serialization.
4. Create the sensor schema using the RelationSQL class in the columnar db package.
139
7. Retrieve traffic data within a 100 kilometer-grid from the center of Austin, Texas. The latitude-
longitude coordinates for Austin, Texas are (30.266667, -97.733333).
Formulate a relation algebra expression to list the names of the professors of courses taken by Peter.
140
Chapter 5
Data Preprocessing
141
5.2 Methods for Outlier Detection
Data points that are considered outliers may happen because of errors or highly unusual occurrences. For
example, suppose a dataset records the times for members of a football team to run a 100-yard dash and
one of the recorded values is 3.2 seconds. This is an outlier. Some analytics techniques are less sensitive to
outliers, e.g., `1 Regression, while others, e.g., `2 Regression, are more sensitive. Detection of outliers suffers
from the obvious problems of being too strict (in which case good data may be thrown away) or too lenient
(in which case outliers are passed to an analytics technique). One may choose to handle outliers separately,
or turn them into missing values, so that both outliers and missing values may be handled together.
If measured values for a random variable xj are approximately Normally distributed and are several standard
deviation units away form the center (µxj ), they are rare events. Depending on the situation, this may be
important information to examine, but may often indicate incorrect measurement. Table 5.1 shows how
unlikely it is to obtain data points in distant tails of a Normal distribution. The standard way to detect
outliers using the standard deviation method is to examine points beyond three standard deviation (σxj )
units for being outliers. This is also called the z-score method as xj needs to be transformed to zj that
follows the Standard Normal distribution.
xj − µxj
zj = (5.1)
σxj
142
pdf for Standard Normal Distribution
0.4
0.3
fzj (z)
0.2
0.1
−4 −2 0 2 4
z
xj ∈
/ [ .25 Q [xj ] − δ · IQR, .75 Q [xj ] + δ · IQR ] (5.2)
For the Normal distribution case, when the scale factor δ = 1.5, it corresponds to 2.69792 standard deviation
units and at 2.0 it corresponds to 3.3724 standard deviation units (see the exercises). The advantage of
this method over the previous one, is that it can work when the data points are not approximately Normal.
This includes the cases where the distribution is not symmetric (a problematic situation for the previous
method). A weakness of the IQR method occurs when data are concentrated near the median, resulting in
an IQR that is in some sense too small to be useful.
Use of Box-Plots provides visual support for looking for outliers. The IQR is shown as a box with whiskers
extending in both directions, extending δ ·IQR units beyond the box, with indications of locations of extreme
data points beyond the whiskers.
• Standard Deviation Method: data points too many standard deviation units (typically 2.5 to 3.5,
defaults to 2.7) away from the mean, DistanceOutlier;
143
• InterQuartile Range Method: data points a scale factor/expansion multiplier (typically 1.5 to 2.0,
defaults to 1.5) times the IQR beyond the middle two quartiles, QuartileXOutlier; and
• Quantiles/Percentile Method: data points in the extreme percentages (typically 0.7 to 10 percent,
defaults to 0.7), i.e., having the smallest or largest values, QuantileOutlier.
Note: These defaults put these three outlier detection methods in alignment when data points are approx-
imately Normally distributed.
The following function will turn outliers in missing values, by reassigning the outliers to noDouble,
ScalaTion’s indicator of a missing value of type Double.
An alternative to eliminating outliers during data preprocessing, is to eliminate them during modeling
by looking for extreme residuals. In addition to looking at the magnitude of a residual i , some argue only
to remove data points that also have high influence on the model’s parameters/coefficients, using techniques
such as DFFITS, Cook’s Distance, or DFBETAS [34].
144
5.3 Imputation Techniques
The two main ways to handle missing values are (1) throw them away, or (2) use imputation to replace them
with reasonable guesses. When there is a gap in time series data, imputation may be used for short gaps,
but is unlikely to be useful for long gaps. This is especially true when imputation techniques are simple. The
alternative could be to use an advanced modeling technique like SARIMA for imputation, but then results
of a modeling study using SARIMA are likely to be biased. Imputation implementations are based on the
Imputation trait in the scalation.modeling package.
Trait Methods:
1 trait Imputation
2
2. object ImputeForward extends Imputation: Use the previous value and slope to estimate the next
missing value.
3. object ImputeBackward extends Imputation: Use the subsequent value and slope to estimate the
previous missing value.
4. object ImputeMean extends Imputation: Use the filtered mean to estimate the next missing value.
5. object ImputeMovingAvg extends Imputation: Use the moving-average of the last ’dist’ values to
estimate the next missing value.
6. object ImputeNormal extends Imputation: Use the median of three Normally distributed, based
on filtered mean and variance, random values to estimate the next missing value.
7. object ImputeNormalWin extends Imputation: Same as ImputeNormal except mean and variance
are recomputed over a sliding window.
145
5.4 Align Multiple Time Series
When the data include multiple time series, there are likely to be time alignment problems. The frequency
and/or phase may not be in agreement. For example, traffic count data may be recorded every 15 minutes
and phased on the hour, while weather precipitation data may be collected every 30 minutes and phased to
10 minutes past the hour.
ScalaTion supports the following alignments techniques: (1) approximate left outer join and (2) dy-
namic time warping. The first operator will perform a left outer join between two relations based on their
time (TimeNum) columns. Rather than the usual matching based on equality, approximately equal times are
considered sufficient for alignment. For example, to align traffic data with the weather data, the following
approximate left outer join may be used.
146
5.5 Creating Vectors and Matrices
Once the data have been preprocessed, columns may be projected out to create a matrix that may be passed
to analytics/modeling techniques.
This matrix may then be passed into multiple modeling techniques: (1) a Multiple Linear Regression, (2) a
Auto-Regressive, Integrated, Moving-Average (ARIMA) model.
By default in ScalaTion the rightmost columns are the response/output variables. As many of the
modeling techniques have a single response variable, it will be assumed to in the last column. There are also
constructors and factory apply functions that take explicit vector and matrix parameters, e.g., a matrix of
predictor variables and a response vector.
147
5.6 Exercises
1. Assume random variable xj is distributed N (µ, σ).
(a) Show that when the scale factor δ = 1.5, the InterQuartile Range method corresponds to the
Standard Deviation method at 2.69792 standard deviation units.
(b) Show that when the scale factor δ = 2.0, the InterQuartile Range method corresponds to the
Standard Deviation method at 3.3724 standard deviation units.
(c) What should the scale factor δ need to be to correspond to 3 standard deviation units?
2. Randomly generate 10,000 data points from the Standard Normal distribution. Count how many of
these data points are considered as outliers for
(a) the Standard Deviation method set at 3.3724 standard deviation units, and
(b) the InterQuartile Range method with δ = 2.0.
(c) the Quantile/Percentile method set at what? percent.
3. Load the auto mpg.csv dataset into an auto mpg relation. Perform the preprocessing steps above to
create a cleaned-up relation auto mpg2 and produce a data matrix called auto mat from this relation.
Print out the correlation matrix for auto mat. Which columns have the highest correlation? To predict
the miles per gallon mpg which columns are likely to be the best predictors.
4. Find a dataset at the UCI Machine Learning Repository and carry out the same steps
https://archive.ics.uci.edu/ml/index.php.
148
Part II
Modeling
149
Chapter 6
Prediction
As the name predictive analytics indicates, the purpose of techniques that fall in this category is to develop
models to predict outcomes. For example, the distance a golf ball travels y when hit by a driver depends
on several factors or inputs x such as club head speed, barometric pressure, and smash factor (how square
the impact is). The models can be developed using a combination of data (e.g., from experiments) and
knowledge (e.g., Newton’s Second Law). The modeling techniques discussed in this technical report tend
to emphasize the use of data more than knowledge, while those in the simulation modeling technical report
emphasize knowledge.
Abstractly, a predictive model can generally be formulated using a prediction function f as follows:
y = f (x, t; b) + (6.1)
where
• y is an response/output scalar,
• x is an predictor/input vector,
Both the response y and residuals/errors are treated as random variables, while the predictor/feature
variables x may be treated as either random or deterministic depending on context. Depending on the goals
of the study as well as whether the data are the product of controlled/designed experiments, the random or
deterministic view may be more suitable.
The parameters b can be adjusted so that the predictive model matches the available data. Note,
in the definition of a function, the arguments appear before the “;”, while the parameters appear after.
The residuals/errors are typically additive as shown above, but may also be multiplicative. Of course, the
formulation could be generalized by turning the output/response into a vector y and the parameters into a
matrix B.
When a model is time-independent or time can be treated as just another dimension within the x vectors,
prediction functions can be represented as follows:
151
y = f (x; b) + (6.2)
Another way to look at such models, is that we are trying to estimate the conditional expectation of y given
x.
y = E [y|x] +
= y − f (x; b)
Given a dataset (m instances of data), each instance contributes to an overall residual/error vector .
One of the simpler ways to estimate the parameters b is to minimize the size of the residual/error vector,
e.g., its Euclidean norm. The square of this norm is the sum of squared errors (sse)
This corresponds to minimizing the raw mean square error (mse = sse/m). See the section on Generalized
Linear Models for further development along these lines.
In ScalaTion, data are passed to the train function to train the model/fit the parameters b. In the
case of prediction, the predict function is used to predict values for the scalar response y.
A key question to address is the possible functional forms that f may take, such as the importance of
time, the linearity of the function, the domains for y and x, etc. We consider several cases in the subsections
below.
152
6.1 Predictor
In ScalaTion, the Predictor trait provides a common framework for several predictor classes such as
SimpleRegression or Regression. All of the modeling techniques discussed in this chapter extend the
Predictor trait. They also extend the Fit trait to enable Quality of Fit (QoF) evaluation. (Unlike classes,
traits support multiple inheritance).
Many modeling techniques utilize several predictor/input variables to predict a value for a response/out-
put variable, e.g., given values for [x0 , x1 , x2 ] predict a value for y. The datasets fed into such modeling
techniques will collect multiple instances of the predictor variables into a matrix x and multiple instances of
the response variable into a vector y. The Predictor trait takes datasets of this form.
Trait Methods:
1 @ param x the input / data m - by - n matrix
2 ( augment with a first column of ones to include intercept in model )
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( if null , use x_j )
5 @ param hparam the hyper - parameters for the model
6
153
37 def s t e p R e g r e s s i o n A l l ( idx_q : Int = QoF . rSqBar . ordinal , cross : Boolean = true ) :
38 ( LinkedHashSet [ Int ] , MatrixD ) =
39
The Predictor trait extends the Model trait (see the end of the Probability chapter) and has the following
methods:
1. The getX method returns the actual data/input matrix used by the model. Some complex models
expand the columns in an initial data matrix to add for example quadratic or cross terms.
2. The getY method returns the actual response/output vector used by the model. Some complex models
transform the initial response vector.
3. The getFname method returns the names of predictor variable/features, both given and extended.
5. The train method takes the dataset passed into the model (either the full dataset or a training-data)
and optimizes the model parameters b.
6. The train2 method takes the dataset passed into the model (either the full dataset or a training
dataset) and optimizes the model parameters b. It also optimizes the hyper-parameters.
7. The test method evaluates the Quality of Fit (QoF) either on the full dataset or a designated test-data
using the diagnose method.
8. The trainNtest method trains on the training-set and evaluates on the test-set.
9. The predict method take a data vector (e.g., a new data instance) and predicts its response. Another
predict method takes a matrix as input (with each row being an instance) and makes predictions for
each row.
10. The hparameter method returns the hyper-parameters for the model. Many simple models have none,
but more sophisticated modeling techniques such as RidgeRegression and LassoRegression have
them (e.g., a shrinkage hyper-parameter).
11. The parameter method returns the estimated parameters for the model.
12. The residual method returns the difference between the actual and predicted response vectors. The
residual indicates what the model has left to explain/account for (e.g., an ideal model will only leave
the noise in the data unaccounted for).
13. The buildModel method build a sub-model that is restricted to given columns of the data matrix.
This method of called by the following feature selection methods.
154
14. The selectFeatures methods makes it easy to switch between forward, backward and stepwise feature
selection.
15. The forwardSel method is used for forward selection of variables/features for inclusion into the model.
At each step the variable that increases the predictive power of the model the most is selected. This
method is called repeatedly in forwardSelAll to find “best” combination of features. Not guaranteed
to find the optimal combination.
16. The importance method is used to indicate the relative importance of the features/variables.
17. The bakwardElim method is used for backward elimination of variables/features from the model. At
each step the variable that contributes the least to the predictor power of the model is eliminated. This
method is called repeatedly in bakwardElimAll to find “best” combination of features. Not guaranteed
to find the optimal combination.
18. The stepRegressionAll method decides to add or remove a variable/feature based on whichever leads
to the greater improvement. It continues until there is no further improvement. A swap operation may
yield a better combination of features.
19. The vif method returns the Variance Inflation Factors (VIFs) for each of the columns in the data/input
matrix. High VIF scores may indicate multi-collinearity.
21. The validate method divides a dataset into a training-set and a test-set, trains on one and tests on
the other to determine out-of-sample Quality of Fit (QoF).
22. The crossValidate method implements k-fold cross-validation, where a dataset is divided into a
training-set and a test-set. The training-set is used by the train method, while the test-set is used by
the test method. The crossValidate method is similar to validate, but more extensive in that it
repeats this process k times and makes sure all the data ends up in one of the k test-sets.
155
6.2 Quality of Fit for Prediction
The related Fit trait provides a common framework for computing Quality of Fit (QoF) measures. The
dataset for many models comes in the form of an m-by-n data matrix X and an m response vector y. After
the parameters b (an n vector) have been fit/estimated, the error vector may be calculated. The basic
QoF measures involve taking either `1 (Manhattan) or `2 (Euclidean) norms of the error vector as indicated
in Table 6.1.
Typically, if a model has m instances/rows in the dataset and n parameters to fit, the error vector will live
in an m − n dimensional space (ignoring issues related to the rank the data matrix). Note, if n = m, there
may be a unique solution for the parameter vector b, in which case = 0, i.e., the error vector lives in a
0-dimensional space. The Degrees of Freedom (for error) is the dimensionality of the space that the error
vector lives in, namely, df = m − n.
Trait Methods:
1 @ param dfm the degrees of freedom for model / regression
2 @ param df the degrees of freedom for error
3
For modeling, a user chooses one the of classes (directly or indirectly) extending the trait Predictor
(e.g., Regression) to instantiate an object. Next the train method would be typically called, followed
by the test method, which computes the residual/error vector and calls the diagnose method. Then the
156
fitMap method would be called to return quality of fit statistics computed by the diagnose method. The
quality of fit measures computed by the diagnose method in the Fit class are shown below.
1 @ param y the actual response / output vector to use ( test / full )
2 @ param yp the predicted response / output vector ( test / full )
3 @ param w the weights on the instances ( defaults to null )
4
One may look at the sum of squared errors (sse) as an indicator of model quality.
sse = · (6.4)
In particular, sse can be compared to the sum of squares total (sst), which measures the total variability of
the response y,
1 X 2
sst = ky − µy k2 = y · y − m µ2y = y · y − yi (6.5)
m
while the sum of squares regression (ssr = sst − sse) measures the variability captured by the model, so the
coefficient of determination measures the fraction of the variability captured by the model.
ssr sse
R2 = = 1− ≤ 1 (6.6)
sst sst
Values for R2 would be non-negative, unless the proposed model is so bad (worse than the Null Model that
simply predicts the mean) that the proposed model actually adds variability.
157
6.3 Null Model
The NullModel class implements the simplest type of predictive modeling technique. If all else fails it may
be reasonable to simply guess that y will take on its expected value or mean.
y = E [y] + (6.7)
This could happen if the predictors x are not relevant, not collected in a useful range or the relationship is
too complex for the modeling techniques you have applied.
y = b0 + (6.8)
This intercept-only model is just a constant term plus the error/residual term.
6.3.2 Training
The training dataset in this case only consists of a response vector y. The error vector in this case is
= y − ŷ = y − b0 1 (6.9)
For Least Squares Estimation (LSE), the loss function L(b) can be set to half the sum of squared errors.
1 1 1
L(b) = sse = kk2 = · (6.10)
2 2 2
Substituting for gives
1
L(b) = y − b0 1 · y − b0 1 (6.11)
2
(f · g)0 = f 0 · g + f · g0
(f · f )0 = 2 f 0 · f
1
Dividing by 2 gives,
1
(f · f )0 = f 0 · f (6.12)
2
dL
Taking the derivative w.r.t. b0 , , using the derivative product rule and setting it equal to zero yields the
db0
following equation.
158
dL
= −1 · (y − b0 1) = 0
db0
Therefore, the optimal value for the parameter b0 is
1·y 1·y
b0 = = = µy (6.13)
1·1 m
This shows that the optimal value for the parameter is the mean of the response vector.
In ScalaTion this requires just one line of code inside the train method.
1 def train ( x_null : MatrixD = null , y_ : VectorD = y ) : Unit =
2 b = VectorD ( y_ . mean ) // parameter vector [ b0 ]
3 end train
After values for the model parameters are determined, it it important to assess the Quality of Fit (QoF).
The test method will compute the residual/error vector and then call the diagnose method.
1 def test ( x_null : MatrixD = null , y_ : VectorD = y ) : ( VectorD , VectorD ) =
2 val yp = VectorD . fill ( y_ . dim ) ( b (0) ) // y predicted for ( test / full )
3 ( yp , diagnose ( y_ , yp ) ) // return predictions and QoF
4 end test
The coefficient of determination R2 for the null regression model is always 0, i.e., none of variance in the
random variable y is explained by the model. A more sophisticated model should only be used if it is better
than the null model, that is when its R2 is strictly greater than zero. Also, a model can have a negative R2
if its predictions are worse than guessing the mean.
Finally, the predict method is simply.
1 def predict ( z : VectorD ) : Double = b (0)
y = 2.75 + (6.14)
x y ŷ 2
11
1 1 4 − 74 49
16
11 1 1
2 3 4 4 16
11 1 1
3 3 4 4 16
11 5 25
4 4 4 4 16
19
10 11 11 0 4 = 4.75
The sum of squared errors (sse) is given in the lower, right corner of the table. The sum of squares total for
this dataset is 4.75, so
159
sse 4.75
R2 = 1 − = 1− = 0
sst 4.75
The plot below illustrates how the Null Model attempts to fit the four given data points.
y 3
0 1 2 3 4
x
Class Methods:
1 @ param y the response / output vector
2
6.3.6 Exercises
1. Determine the value of the second derivative of the loss function
d2 L
= ?
db0 2
at the critical point b0 = µy . What kind of critical point is this?
160
2. Let the response vector y be
1 val y = VectorD (1 , 3 , 3 , 4)
Draw an xy plot of the data points. Give the value for the parameter vector b. Show the error distance
for each point in the plot. Compare the sum of squared errors sse with the sum of squares total sst.
What is the value for the coefficient of determination R2 ?
3. Using ScalaTion, analyze the NullModel for the following response vector y.
1 val y = VectorD (2.0 , 3.0 , 5.0 , 4.0 , 6.0) // response vector y
2 println ( s " y = $y " )
3
4. Execute the NullModel on the Auto MPG dataset. See scalation.modeling.Example AutoMPG. What
is the quality of the fit (e.g., R2 or rSq)? Is this value expected? Is is possible for a model to perform
worse than this?
161
6.4 Simpler Regression
The SimplerRegression class supports simpler linear regression. In this case, the predictor vector x consists
of a single variable x0 , i.e., x = [x0 ] and there is only a single parameter that is the coefficient for x0 in the
model.
y = b · x + = b0 x 0 + (6.15)
where represents the residuals/errors (the part not explained by the model).
6.4.2 Training
A dataset may be collected for providing an estimate for parameter b0 . Given m data points, stored in an
m-dimensional vector x0 and m response values, stored in an m-dimensional vector y, we may obtain the
following vector equation.
y = b0 x 0 + (6.16)
One way to find a value for parameter b0 is to minimize the norm of residual/error vector .
minb0 ky − b0 x0 k (6.18)
This is equivalent to minimizing half the dot product ( 21 kk2 = 1
2 ·= 1
2 sse). Thus the loss function is
1
L(b) = y − b0 x0 · y − b0 x0 (6.19)
2
dL
= −x0 · (y − b0 x0 ) = 0 (6.20)
db0
Therefore, the optimal value for the parameter b0 is
x0 · y
b0 = (6.21)
x0 · x0
162
6.4.4 Example Calculation
Consider the following data points {(1, 1), (2, 3), (3, 3), (3, 4)} and solve for the parameter (slope) b0 .
[1, 2, 3, 4] · [1, 3, 3, 4] 32 16
b0 = = =
[1, 2, 3, 4] · [1, 2, 3, 4] 30 15
16
Using this optimal value for the parameter b0 = , we may obtain predicted values for each of the x-values.
15
The table below shows the values of x, y, ŷ, , and 2 . for the Simpler Regression Model,
16 16
y = · [x] + = x+
15 15
x y ŷ 2
16 1 1
1 1 15 − 15 225
32 13 169
2 3 15 15 225
48 3 9
3 3 15 − 15 225
64 4 16
4 4 15 − 15 225
160 5 13
10 11 15 15 15 = 0.867
The sum of squared errors (sse) is given in the lower, right corner of the table. The sum of squares total for
this dataset is 4.75, so
sse 0.867
R2 = 1 − = 1− = 0.813
sst 4.75
The plot below illustrates how the Simpler Regression Model attempts to fit the four given data points.
163
Simpler Regression Model Line vs. Data Points
y
2
0 1 2 3 4
x
Note, that this model has no intercept. This makes the solution for the parameter very easy, but may
make the model less accurate. This is remedied in the next section. Since no intercept really means the
intercept is zero, the regression line will go through the origin. This is referred to as Regression Through
the Origin (RTO) and should only be applied when the data scientist has reason to believe it makes sense.
Class Methods:
1 @ param x the data / input matrix ( only use the first column )
2 @ param y the response / output vector
3 @ param fname_ the feature / variable names ( only use the first name )
4
6.4.6 Exercises
1. For x0 = [1, 2, 3, 4] and y = [1, 3, 3, 4], try various values for the parameter b0 . Plot the sum of squared
errors (sse) vs. b0 . Note, the code must be completed before it is complied and run.
164
1 import scalation . mathstat . _
2
5 val x0 = VectorD (1 , 2 , 3 , 4)
6 val y = VectorD (1 , 3 , 3 , 4)
7 val b0 = VectorD . range (0 , 50) / 25.0
8 val sse = new VectorD ( b0 . dim )
9 for i <- b0 . indices do
10 val e = ?
11 sse ( i ) = e dot e
12 end for
13 new Plot ( b0 , sse , lines = true )
14
15 end s i m p l e r R e g r e s s i o n _ e x e r _ 1
2. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points. What is the slope of this line. Pass the X matrix and y vector as
arguments to the SimplerRegression class to obtain the b = [b0 ] vector.
1 // 4 data points : x0
2 val x = MatrixD ((4 , 1) , 1 , // x 4 - by -1 matrix
3 2,
4 3,
5 4)
6 val y = VectorD (1 , 3 , 3 , 4) // y vector
7
An alternative to using the above constructor new SimplerRegression is to use a factory method
SimplerRegression. Substitute in the following lines of code to do this.
1 val x = VectorD (1 , 2 , 3 , 4)
2 val rg = S i m p l e r R e g r e s s i o n (x , y , null )
3 new Plot (x , y , yp , lines = true )
3. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points and intersects the origin [0, 0]. What is the slope of this line? Pass the
X matrix and y vector as arguments to the SimplerRegression class to obtain the b = [b0 ] vector.
1 // 5 data points : x0
2 val x = MatrixD ((5 , 1) , 0 , // x 5 - by -1 matrix
3 1,
165
4 2,
5 3,
6 4)
7 val y = VectorD (2 , 3 , 5 , 4 , 6) // y vector
8
4. Execute the SimplerRegression on the Auto MPG dataset. See scalation.modeling.Example AutoMPG.
What is the quality of the fit (e.g., R2 or rSq)? Is this value expected? What does it say about this
model? Try using different columns for the predictor variable.
d2 L
5. Compute the second derivative of the loss function w.r.t. b0 , . Under what conditions will it be
db0 2
positive?
166
6.5 Simple Regression
The SimpleRegression class supports simple linear regression. It combines the benefits of the last two mod-
eling techniques: the intercept model NullModel and the slope model SimplerRegression. It is guaranteed
to be at least as good as the better of these two modeling techniques. In this case, the predictor vector
x ∈ R2 consists of the constant one and a single variable x1 , i.e., [1, x1 ], so there are now two parameters
b = [b0 , b1 ] ∈ R2 in the model.
where represents the residuals (the part not explained by the model).
6.5.2 Training
The model is trained on a dataset consisting of m data points/vectors, stored row-wise in an m-by-2 matrix
X ∈ Rm×2 and m response values, stored in an m dimensional vector y ∈ Rm .
y = Xb + (6.23)
The parameter vector b may be determined by solving the following optimization problem:
Substituting = y − ŷ = y − Xb yields
minb ky − Xbk
Using the fact that the matrix X consists of two column vectors 1 and x1 , it can be rewritten,
b0
min[b0 ,b1 ] ky − [1 x1 ] k
b1
Since x0 is just 1, for simplicity we drop the subscript on x1 . Thus the loss function 21 sse is
1
L(b) = y − (b0 1 + b1 x) · y − (b0 1 + b1 x) (6.27)
2
167
6.5.3 Optimization - Gradient
A function of several variables can be optimized using Vector Calculus by setting its gradient (see the Linear
Algebra Chapter) equal to zero and solving the resulting system of equations. When the system of equations
are linear, matrix factorization may be used, otherwise techniques from Nonlinear Optimization may be
needed.
Taking the gradient of the loss function L gives
∂L ∂L
∇L = , (6.28)
∂b0 ∂b1
The goal is to find the value of the parameter vector b that yields a zero gradient (flat response surface).
Setting the gradient equal to zero (0 = [0, 0]) yields two equations.
∂L ∂L
∇L(b) = (b), (b) = 0 (6.29)
∂b0 ∂b1
The gradient (the two partial derivatives) may be determined using the derivative product rule for dot
products.
1
(f · f )0 = f 0 · f
2
−1 · (y − (b0 1 + b1 x)) = 0
1 · y − 1 · (b0 1 + b1 x) = 0
b0 1 · 1 = 1 · y − b1 1 · x
1 · y − b1 1 · x
b0 = (6.30)
m
−x · (y − (b0 1 + b1 x)) = 0
x · y − x · (b0 1 + b1 x) = 0
b0 1 · x + b1 x · x = x · y
m b0 1 · x + m b1 x · x = m x · y (6.31)
Substituting for m b0 = 1 · y − b1 1 · x yields
[1 · y − b1 1 · x]1 · x + m b1 x · x = m x · y
b1 [m x · x − (1 · x)2 ] = m x · y − (1 · x)(1 · y)
168
Solving for b1 gives
m x · y − (1 · x)(1 · y)
b1 = (6.32)
m x · x − (1 · x)2
The b0 parameter gives the intercept, while the b1 parameter gives the slope of the line that best fits the data
points.
Consider again the problem from the last section where the data points are {(1, 1), (2, 3), (3, 3), (3, 4)} and
solve for the two parameters, (intercept) b0 and (slope) b1 .
Table ?? below shows the values of x, y, ŷ, , and 2 for the Simple Regression Model,
x y ŷ 2
1 1 1.4 -0.4 0.16
2 3 2.3 0.7 0.49
3 3 3.2 -0.2 0.04
4 4 4.1 -0.1 0.01
10 11 11 0 0.7
For which models (NullModel, SimplerRegression and SimpleRegression), did the redidual/error vector
sum to zero?
The sum of squared errors (sse) is given in the lower, right corner of the table. The sum of squares total
for this dataset is 4.75, so the Coefficient of Determination,
sse 0.7
R2 = 1 − = 1− = 0.853
sst 4.75
The plot below illustrates how the Simple Regression Model (SimpleRegression) attempts to fit the
four given data points.
169
Simple Regression Model Line vs. Data Points
y
2
0 1 2 3 4
x
More concise and intuitive formulas for the parameters b0 and b1 may be derived.
• Using the definition for mean from Chapter 3 for µx and µy , it can be shown that the expression for
b0 shortens to
b0 = µy − b1 µx (6.33)
Draw a line through the following two points [0, b0 ] (the intercept) and [µx , µy ] (the center of mass).
How does this line compare to the regression line.
• Now, using the definitions for covariance σx,y and variance σx2 from Chapter 3, it can be shown that
the expression for b1 shortens to
σx,y
b1 = (6.34)
σx2
If the slope of the regression line is simply the ratio of the covariance to the variance, what would the
slope be if y = x. It may also be written as follows:
Sxy
b1 = (6.35)
Sxx
− µx )2 .
P P
where Sxy = i (xi − µx )(yi − µy ) and Sxx = i (xi
Table 6.5 extends the previous table to facilitate computing the parameters vector b using the concise
formulas.
170
Table 6.5: Simple Regression Model: Expanded Table with Centering µx = 2.5, µy = 2.75
x x − µx y y − µx ŷ 2
1 -1.5 1 -1.75 1.4 -0.4 0.16
2 -0.5 3 0.25 2.3 0.7 0.49
3 0.5 3 0.25 3.2 -0.2 0.04
4 1.5 4 1.25 4.1 -0.1 0.01
10 0 11 0 11 0 0.7
X
Sxx = (xi − µx )2 = 1.52 + 0.52 + 0.52 + 1.52 = 5
i
X
Syy = (yi − µy )2 = 1.752 + 0.252 + 0.252 + 1.252 = 4.75
i
X
Sxy = (xi − µx )(yi − µy ) = (−1.5 · −1.75) + (−0.5 · 0.25) + (0.5 · 0.25) + (1.5 · 1.25) = 4.5
i
Therefore,
Sxy 4.5
b1 = = = 0.9
Sxx 5
Next the relationships between the predictor variable xj (the columns in input/data matrix X) should
be compared. If two of the predictor variables are highly correlated, their individual effects on the response
variable y may be indistinguishable. The correlations between the predictor variable, may be seen by
examining the correlation matrix. Including the response variable in a combined data matrix xy allows one
to see how each predictor variable is correlated with the response.
1 banner ( " Correlation Matrix for Columns of xy " )
2 println ( s " x_fname = ${ stringOf ( x_fname ) } " )
171
3 println ( s " y_name = MPG " )
4 println ( s " xy . corr = ${ xy . corr } " )
Although Simple Regression may be too simple for many problems/datasets, it should be used in Ex-
ploratory Data Analysis (EDA). A simple regression model should be created for each predictor variable xj .
The data points and the best fitting line should be plotted with y on the vertical axis and xj on the hori-
zontal axis. The data scientist should look for patterns/tendencies of y versus xj , such as linear, quadratic,
logarithmic, or exponential patterns. When there is no relationship, the points will appear to be randomly
and uniformly positioned in the plane.
1 for j <- x . indices2 do
2 banner ( s " Plot response y vs . predictor variable ${ x_fname ( j ) } " )
3 val xj = x (? , j )
4 val mod = S i m p l e R e g r e s s i o n ( xj , y , Array ( " one " , x_fname ( j ) ) )
5 mod . trainNtest () () // train and test model
6 val yp = mod . predict ( mod . getX )
7 new Plot ( xj , y , yp , s " EDA : y and yp ( red ) vs . ${ x_fname ( j ) } " , lines = true )
8 end for
The Figure below shows four possible patterns: Linear (blue), Quadratic (purple), Inverse (green), Inverse-
Square (black). Each curve depicts a function 1 + xp , for p = −2, −1, 1, 2.
Finding a Pattern: Linear (blue), Quadratic (purple), Inverse (green), Inverse-Square (black)
10
y
0
0.5 1 1.5 2 2.5 3 3.5
x
To look for quadratic patterns, the following code regresses on the square of each predictor variable (i.e.,
x2j ).
1 for j <- x . indices2 do
2 banner ( s " Plot response y vs . predictor variable ${ x_fname ( j ) } " )
3 val xj = x (? , j )
4 val mod = S i m p l e R e g r e s s i o n . quadratic ( xj , y , Array ( " one " , x_fname ( j ) + " ˆ 2 " ) )
5 mod . trainNtest () () // train and test model
6 val yp = mod . predict ( mod . getX )
7 new Plot ( xj , y , yp , s " EDA : y and yp ( red ) vs . ${ x_fname ( j ) } " , lines = false )
8 end for
172
To determine the effect of having linear and quadratic terms (both xj and x2j ) the Regression class that
supports Multiple Linear Regression or the SymbolicRegression object may be used. Generally, one could
include both terms if there is sufficient improvement over just using one term. If one term is chosen, use
the linear term unless the quadratic term is sufficiently better (see the section on Symbolic Regression for a
more detailed discussion).
Plotting
The Plot and PlotM classes in the mathstat package can be used for plotting data and results. Both use
ZoomablePanel in the scala2d package to support zooming and dragging. The mouse wheel controls the
amount of zooming (scroll value where up is negative and down is positive), while mouse dragging repositions
the objects in the panel (drawing canvas).
1 @ param x the x vector of data values ( horizontal ) , use null to use y s index
2 @ param y the y vector of data values ( primary vertical , black )
3 @ param z the z vector of data values ( secondary vertical , red ) to compare with y
4 @ param _title the title of the plot
5 @ param lines flag for generating a line plot
6
7 class Plot ( x : VectorD , y : VectorD , z : VectorD = null , _title : String = " Plot y vs . x " ,
8 lines : Boolean = false )
9 extends VizFrame ( _title , null ) :
Class Methods:
1 @ param x the data / input matrix augmented with a first column of ones
2 ( only use the first two columns [1 , x1 ])
3 @ param y the response / output vector
4 @ param fname_ the feature / variable names ( only use the first two names )
5
173
15 b_ : VectorD = b , vifs : VectorD = vif () ) : String =
16 def confInterval ( x_ : MatrixD = getX ) : VectorD =
6.5.7 Exercises
1. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points (i.e., that minimize kk). Using the formulas developed in this section,
what are the intercept and slope [b0 , b1 ] of this line.
Also, pass the X matrix and y vector as arguments to the SimpleRegression class to obtain the b
vector.
1 // 4 data points : one x1
2 val x = MatrixD ((4 , 2) , 1 , 1 , // x 4 - by -2 matrix
3 1, 2,
4 1, 3,
5 1 , 4)
6 val y = VectorD (1 , 3 , 3 , 4) // y vector
7
2. For more complex models, setting the gradient to zero and solving a system of simultaneous equation
may not work, in which case more general optimization techniques may be applied. Two simple
optimization techniques are grid search and gradient descent.
For grid search, in a spreadsheet set up a 5-by-5 grid around the optimal point for b, found in the
previous problem. Compute values for the loss function L = 12 sse for each point in the grid. Plot h
versus b0 across the optimal point. Do the same for b1 . Make a 3D plot of the surface h as a function
b0 and b1 .
For gradient descent, pick a starting point b0 , compute the gradient ∇L and move −η∇L from b0
where η is the learning rate (e.g., 0.1). Repeat for a few iterations. What is happening to the value of
the loss function L = 21 see.
[−1 · , −x · ]
3. From the X matrix and y vector, plot the set of data points {(xi1 , yi ) | 0 ≤ i < m} and draw the line
that is nearest to these points. What are the intercept and slope of this line. Pass the X matrix and
y vector as arguments to the SimpleRegression class to obtain the b vector.
174
1 // 5 data points : one x1
2 val x = MatrixD ((5 , 2) , 1 , 0 , // x 5 - by -2 matrix
3 1, 1,
4 1, 2,
5 1, 3,
6 1 , 4)
7 val y = VectorD (2 , 3 , 5 , 4 , 6) // y vector
8
5. Let errors i have E [i ] = 0 and V [i ] = σ 2 , and be independent of each other. Show that the variances
for the parameters b0 and b1 are as follows:
σ2
V [b1 ] =
Sxx
Sxy −2
Hint: V [b1 ] = V = Sxx V [Sxy ].
Sxx
µ2
1
V [b0 ] = + x σ2
m Sxx
6. Further assume that i ∼ N(0, σ 2 ). Show that the confidence intervals for the parameters b0 and b1
are as follows:
s
b1 ± t∗ √
Sxx
sse
Hint: Let the error variance estimator be s2 = = mse.
m−2
s
2
b0 ± t∗ s 1 µ
+ x
m Sxx
175
sse
estimate the error variance s2 = = mse. Take its square root to obtain the residual standard
m−2
error s. Use these to compute 95% confidence intervals for the parameters: b0 and b1 .
8. Consider the above simple dataset, but where the y values are reversed so the slope is negative and
the fit line is away from the origin,
1 val x = VectorD (1 , 2 , 3 , 4 , 5)
2 val y = VectorD (4 , 5 , 3 , 3 , 1)
3 val ox = MatrixD . one ( x . dim ) : ˆ + x
Compare the SimplerRegression model with the SimpleRegression model. Examine the QoF mea-
sures for each model and make an argument for which model to pick. Also compute R02 (R2 relative
to 0)
ky − ŷk2
R02 = 1 − (6.37)
kyk2
ky − ŷk2
R2 = 1 − (6.38)
ky − µy k2
For Regression Through the Origin (RTO) some software packages use R02 in place of R2 . See
[43] for a deeper discussion of the issues involved, including when it is appropriate to not include an
intercept b0 in the model. ScalaTion provides functions for both in the FitM trait: def rSq (the
default) and def rSq0 .
176
6.6 Regression
The Regression class supports multiple linear regression where multiple input/predictor variables are used
to predict a value for the response/output variable. When the response variable has non-zero correlations
with multiple predictor variables, this technique tends to be effective, efficient and leads to explainable
models. It should be applied typically in combination with more complex modeling techniques. In this case,
the predictor vector x is multi-dimensional [1, x1 , ...xk ] ∈ Rn , so the parameter vector b = [b0 , b1 , . . . , bk ] has
the same dimension as x, while response y is a scalar.
x0
b0
x1 y
b1
β
b2
x2
The intercept can be provided by fixing x0 to one, making b0 the intercept. Alternatively, x0 can be used
as a regular input variable by introducing another parameter β for the intercept. In Neural Networks, β is
referred to as bias and bj is referred to as the edge weight connecting input vertex/node j to the output
node as shown in Figure 6.1. Note, if a activation function fa is added to the model, the Multiple Linear
Regression model becomes a Perceptron model.
y = b · x + = b0 + b1 x1 + ... + bk xk + (6.39)
where represents the residuals (the part not explained by the model).
6.6.2 Training
Using several data samples as a training set (X, y), the Regression class in ScalaTion can be used to
estimate the parameter vector b. Each sample pairs an x input vector with a y response value. The x vectors
are placed into a data/input matrix X ∈ Rm×n row-by-row with a column of ones as the first column in X.
The individual response values taken together form the response vector y ∈ Rm .
The training diagram shown in Figure 6.2 illustrates how the ith instance/row flows through the diagram
computing the predicted response ŷ = b · x and the error = y − ŷ.
177
x0
b0
b·x
x1 ŷ
b1
(X, y) xi0
b2
= y − ŷ
xi1 x2
xi2
y
yi
The matrix-vector product Xb provides an estimate for the response vector ŷ.
y = Xb + (6.40)
The goal is to minimize the distance between y and its estimate ŷ. i.e., minimize the norm of residual/error
vector.
Substituting = y − ŷ = y − Xb yields
This is equivalent to minimizing half the dot product of the error vector with itself ( 12 kk2 = 12 · = 12 sse)
Thus, the loss function is
1
L(b) = y − Xb · y − Xb (6.43)
2
1
(f · f )0 = f 0 · f (6.45)
2
yields the j th partial derivative.
178
∂L |
= −x:j · (y − Xb) = − x:j (y − Xb) (6.46)
∂bj
Notice that the parameter bj is only multiplied by column x:j in the matrix-vector product Xb. The dot
product is equivalent a transpose operation followed by matrix multiplication. The gradient is formed by
collecting all these partial derivatives together.
|
∇L = − X (y − Xb) (6.47)
Now, setting the gradient equal to the zero vector 0 ∈ Rn yields
|
−X (y − Xb) = 0
| |
−X y + (X X)b = 0
A more detailed derivation of this equation is given in section 3.4 of “Matrix Calculus: Derivation and Simple
Application” [82]. Moving the term involving b to the left side, results in the Normal Equations.
| |
(X X)b = X y (6.48)
Note: equivalent to minimizing the distance between y and Xb is minimizing the sum of the squared
residuals/errors (Least Squares method).
ScalaTion provides five techniques for solving for the parameter vector b based on the Normal Equa-
tions: Matrix Inversion, LU Factorization, Cholesky Factorization, QR Factorization and SVD Factorization.
| |
(X X)b = X y
|
a simple technique is Matrix Inversion, which involves computing the inverse of X X and using it to multiply
both sides of the Normal Equations.
| |
b = (X X)−1 X y (6.49)
| |
where (X X)−1 is an n-by-n matrix, X is an n-by-m matrix and y is an m-vector. When X is full rank,
the expression above involving the X matrix may be referred to as the pseudo-inverse X + .
| |
X + = (X X)−1 X
When X is not full rank, Singular Value Decomposition may be applied to compute X + . Using the pseudo-
inverse, the parameter vector b may be solved for as follows:
b = X +y (6.50)
The pseudo-inverse can be computed by first multiplying X by its transpose. Gaussian Elimination can be
used to compute the inverse of this, which can be then multiplied by the transpose of X. In ScalaTion,
the computation for the pseudo-inverse (x pinv) looks similar to the math.
179
1 val x_pinv = ( x .T * x ) . inverse * x .T
Most of the factorization classes/objects implement matrix inversion, including Fac Inv, Fac LU, Fac Cholesky,
and Fac QR. The default Fac LU combines reasonable speed and robustness.
1 def inverse : MatrixD = Fac_LU . inverse ( this ) ()
For efficiency, the code in Regression does not calculate x pinv, rather is directly solves for the parameters
b.
1 val b = fac . solve ( x .T * y )
Starting the solution to the Normal Equations that takes the inverse for determining the optimal parameter
vector b,
| |
b = (X X)−1 X y (6.51)
One can substitute the rhs into the prediction equation for ŷ = Xb
| |
ŷ = X(X X)−1 X y = Hy (6.52)
| |
where H = X(X X)−1 X is the hat matrix (puts a hat on y). The hat matrix may be viewed as a projection
matrix.
|
X X = LU
where L is a lower left triangular n-by-n matrix and U is an upper right triangular n-by-n matrix. Then the
normal equations may be rewritten
|
LU b = X y
Letting w = U b allows the problem to solved in two steps. The first is solved by forward substitution to
determine the vector w.
|
Lw = X y
Ub = w
180
Example Calculation
Consider the example where the input/data matrix X and output/response vector y are as follows:
1 1 1
1 2 3
X = , y =
1 3 3
1 4 4
| |
Putting these values into the Normal Equations (X X)b = X y yields
" #
4 10 11
10 30 32
Multiply the first row by -2.5 and add it to the second row,
" #
4 10 11
0 5 4.5
|
This results in the following optimal parameter vector b = [.5, .9]. Note, the product of L and U gives X X.
| |
X X = LL
where L is a lower triangular n-by-n matrix. Then the normal equations may be rewritten
| |
LL b = X y
|
Letting w = L b, we may solve for w using forward substitution
|
Lw = X y
and then solve for b using backward substitution.
|
L b = w
| |
As an example, the product of L and its transpose L gives X X.
181
Therefore, w can be determined by forward substitution and b by backward substitution.
X = QR
where Q is an orthogonal m-by-n matrix and R matrix is a right upper triangular n-by-n matrix. Starting
again with the Normal Equations,
| |
(X X)b = X y
| |
(QR) QRb = (QR) y
| | | |
R Q QRb = R Q y
|
and using the fact that Q Q = I, we obtain the following:
| | |
R Rb = R Q y
|
Multiply both sides by (R )−1 yields
|
Rb = Q y
Since R is an upper triangular matrix, the parameter vector b can be determined by backward substitution.
Alternatively, the pseudo-inverse may be computed as follows:
|
X + = R−1 Q
182
1 private def solver ( x_ : MatrixD ) : Factorization =
2 algorithm match
3 case " Fac_Cholesky " = > new Fac_Cholesky ( x_ .T * x_ ) // Cholesky Factorization
4 case " Fac_LU " = > new Fac_LU ( x_ .T * x_ ) // LU Factorization
5 case " Fac_Inverse " = > new Fac_Inverse ( x_ .T * x_ ) // Inverse Factorization
6 case " Fac_SVD " = > new Fac_SVD ( x_ ) // Singular Value Decomp .
7 case _ = > new Fac_QR ( x_ ) // QR Factorization
8 end match
9 end solver
The train method below computes parameter/coefficient vector b by calling the solve method provided
by the factorization classes.
1 def train ( x_ : MatrixD = x , y_ : VectorD = y ) : Unit =
2 val fac = solver ( x_ )
3 fac . factor () // factor the matrix
4
10 if b (0) . isNaN then flaw ( " train " , s " parameter b = $b " )
11 debug ( " train " , s "$fac estimates parameter b = $b " )
12 end train
After training, the test method does two things: First, the residual/error vector is computed. Second,
several quality of fit measures are computed by calling the diagnose method.
1 def test ( x_ : MatrixD = x , y_ : VectorD = y ) : ( VectorD , VectorD ) =
2 val yp = predict ( x_ ) // make predictions
3 e = y_ - yp // RECORD the residuals / errors
4 ( yp , diagnose ( y_ , yp ) ) // return predictions and QoF
5 end test
To see how the train and test methods work in a Regression model see the Collinearity Test and Texas
Temperatures examples in subsequent subsections.
183
Degrees of Freedom
the prediction vector ŷ is a projection of the response vector y ∈ Rm onto Rk , the space (hyperplane)
spanned by the vectors x1 , . . . xk . Since = y − ŷ, one might think that the residual/error ∈ Rm−k . As
P
i i = 0 when an intercept parameter b0 is included in the model (n = k + 1), this constraint reduces the
dimensionality of the space by one, so ∈ Rm−n .
Therefore, the Degrees of Freedom (DoF) captured by the regression model is dfr and left for error is df
are indicated in the table below.
As an example, the equation ŷ = 2x1 + x2 + .5 defines a dfr = 2 dimensional hyperplane (or ordinary plane)
as shown in Figure 6.3.
10
0
y
−10 5
−4 0
−2 0 2 4 x2
x1 −5
184
Adjusted Coefficient of Determination R̄2
dfr + df
rdf =
df
SimplerRegression is at one extreme of model complexity, where df = m−1 and dfr = 1, so rdf = m/(m−1)
is close to one. For a more complicated model, say with n = m/2, rdf will be close to 2. This ratio can be
used to adjust the Coefficient of Determination R2 to reduce it with increasing number of parameters. This
is called the Adjusted Coefficient of Determination R̄2
R̄2 = 1 − rdf (1 − R2 )
Suppose m = 121, n = 21 and R2 = 0.9, as an exercise, show that rdf = 1.2 and R̄2 = 0.88.
Dividing sse and ssr by their respective Degrees of Freedom gives the mean square error and regression,
respectively
mse = sse / df
msr = ssr / dfr
The mean square error mse follows a Chi-square distribution with df Degrees of Freedom, while the mean
square regression msr follows a Chi-square distribution with dfr Degrees of Freedom. Consequently, the
ratio
msr
∼ Fdfr ,df (6.53)
mse
that is, it follows an F -distribution with (dfr , df ) Degrees of Freedom. If this number exceeds the critical
value, one can claim that the parameter vector b is not zero, implying the model is useful. More general
quality of fit measures useful for comparing models are the Akaike Information Criterion (AIC) and Bayesian
Information Criterion (BIC).
In ScalaTion the several Quality of Fit (QoF) measures are computed by the diagnose method in the
Fit class, as described in section 1 of this chapter.
1 def diagnose ( y : VectorD , yp : VectorD , w : VectorD = null )
It looks at different ways to measure the difference between the actual y and predicted yp values for the
response. The differences are optionally weighted by the vector w. Weighting is not applied when w is null.
185
Now the difficult issue is how to guard against over-fitting. With enough flexibility and parameters to
fit, modeling techniques can push quality measures like R2 to perfection (R2 = 1) by fitting the signal and
the noise in the data. Doing so tends to make a model worse in practice than a simple model that just
captures the signal. That is where quality measures like R̄2 (or AIC) come into play, but computations of
R̄2 require determination of Degrees of Freedom (df ), which may be difficult for some modeling techniques.
Furthermore, the amount of penalty introduced by such quality measures is somewhat arbitrary.
Would not it be better to measure quality in way in which models fitting noise are downgraded because
they perform more poorly on data they have not seen? Is it really a test, if the model has already seen
the data? The answers to these questions are obvious, but the solution of the underlying problem is a bit
tricky. The first thought would be to divide a dataset in half, but then only half of the data are available
for training. Also, picking a different half may result in substantially different quality measures.
This leads to two guiding principles: First, the majority of the data should be used for training. Second,
multiple testing should be done. In general, conducting real-world tests of a model can be difficult. There
are, however, strategies that attempt to approximate such testing. Two simple and commonly used strategies
are the following: Leave-One-Out and Cross-Validation. In both cases, a dataset is divided into a training
set and a test set.
Leave-One-Out
When fitting the parameters b the more data available in the training set, in all likelihood, the better the
fit. The Leave-One-Out strategy takes this to the extreme, by splitting the dataset into a training set of
size m − 1 and test set of size 1 (e.g., row t in data matrix X). From this, a test error can be computed
yt − b · xt . This can be repeated by iteratively letting t range from the first to the last row of data matrix
X. For certain predictive analytics techniques such as Multiple Linear Regression, there are efficient ways
to compute the test sse based on the leverage each point in the training set has [85].
k-Fold Cross-Validation
A more generally applicable strategy is called cross-validation, where a dataset is divided into k test sets.
For each test set, the corresponding training set is all the instances not chosen for that test set. A simple
way to do this is to let the first test dataset be first m/k rows of matrix X, the second be the second m/k
rows, etc.
1 val tsize = m / k // test set size
2 for l <- 0 until k do
3 x_e = x ( l * tsize until (( l +1) * tsize ) // l - th test set
4 x_ = x . not ( l * tsize until (( l +1) * tsize ) ) // l - th training set
5 end for
The model is trained k times using each of the training sets. The corresponding test set is then used to
estimate the test sse (or other quality measure such as mse). These are more meaningful out-of-sample
results. From each of these samples, a mean, standard deviation and confidence interval may be computed
for the test sse.
Due to patterns that may exist in the dataset, it is more robust to randomly select each of the test sets.
The row indices may be permuted for random selection that ensures that all data instances show up exactly
in one test set.
186
Typically, training QoF (in-sample) measures such as R2 will be better than testing QoF (out-of-sample)
2
measures such as Rcv . Adjusted measures such as R¯2 are intending to more closely follow Rcv
2
than R2 .
ScalaTion support cross-validation via is crossValidate method.
1 @ param k the number of cross - validation iterations / folds ( defaults to 5 x ) .
2 @ param rando flag indicating whether to use randomized or simple cross - validation
3
It also supports a simpler strategy that only tests once, via its validate method defined in the Predictor
trait. It utilizes the Test-n-Train Split TnT Split from the mathstat package.
1 @ param rando flag indicating whether to use randomized or simple validation
2 @ param ratio the ratio of the TESTING set to the full dataset ( e . g . , 70 -30 , 80 -20)
3 @ param idx the prescribed TESTING set indices
4
10 train ( x_ , y_ )
11 val qof = test ( x_e , y_e ) . _2
12 if qof ( QoF . sst . ordinal ) <= 0.0 then
13 flaw ( " validate " , " chosen testing set has no variability " )
14 end if
15 println ( FitM . fitMap ( qof , QoF . values . map ( _ . toString ) ) )
16 qof
17 end validate
6.6.11 Collinearity
Consider the matrix-vector equation used for estimating the parameters b via the minimization of kk.
y = Xb +
The parameter/coefficient vector b = [b0 , b1 , . . . , bk ] may be viewed as weights on the column vectors in the
data/predictor matrix X.
y = b0 1 + b1 x:1 + . . . + bk x:k +
A question arises when two of these column vectors are nearly the same (or more generally nearly parallel
or anti-parallel). They will affect and may obfuscate each others’ parameter values.
First, we will examine ways of detecting such problems and then give some remedies. A simple check is to
compute the correlation matrix for the column vectors in matrix X. High (positive or negative) correlation
indicates collinearity.
Example Problem
Consider the following data/input matrix X and response vector y. This is the same example used for
SimpleRegression with new variable x2 added (i.e., y = b0 + b1 x1 + b2 x2 + ). The collinearityTest
main function allows one to see the effects of increasing the collinearity of features/variables x1 and x2 .
187
1 package < your - package >
2
8 // one x1 x2
9 val x = MatrixD ((4 , 3) , 1 , 1 , 1 , // input / data matrix
10 1, 2, 2,
11 1, 3, 3,
12 1 , 4 , 0) // change 0 by adding .5 until it ’s 4
13
16 val v = x (? , 0 until 2)
17 banner ( s " Test without column x2 " )
18 println ( s " v = $v " )
19 var mod = new Regression (v , y )
20 mod . trainNtest () ()
21 println ( mod . summary () )
22
23 for i <- 0 to 8 do
24 banner ( s " Test Increasing Collinearity : x_32 = ${ x (3 , 2) } " )
25 println ( s " x = $x " )
26 println ( s " x . corr = ${ x . corr } " )
27 mod = new Regression (x , y )
28 mod . trainNtest () ()
29 println ( mod . summary () )
30 x (3 , 2) + = 0.5
31 end for
32
33 end co ll i n e a r i t y T e s t
Try changing the value of element x32 from 0 to 4 by .5 and observe what happens to the correlation
matrix. What effect do these changes have on the parameter vector b = [b0 , b1 , b2 ] and how do the first two
parameters compare to the regression where the last column of X is removed giving the parameter vector
b = [b0 , b1 ].
The corr method is provided by the scalation.mathstat.MatrixD class. For this method, if either
column vector has zero variance, when the column vectors are the same, it returns 1.0, otherwise -0.0
(indicating undefined).
Note, perfect collinearity produces a singular matrix, in which case many factorization algorithms will
give NaN (Not-a-Number) for much of their output. In this case, Fac SVD (Singular Value Decomposition)
should be used. This can be done by changing the following hyper-parameter provided by the Regression
object, before instantiating the Regression class.
188
Multi-Collinearity
Even if no particular entry in the correlation matrix is high, a column in the matrix may still be nearly
a linear combination of other columns. This is the problem of multi-collinearity. This can be checked by
computing the Variance Inflation Factor (VIF) function (or vif in ScalaTion). For a particular parameter
bj for the variable/predictor xj , the function is evaluated as follows:
1
vif(bj ) = (6.54)
1 − R2 (xj )
where R2 (xj ) is R2 for the regression of variable xj onto the rest of the predictors. It measures how well the
variable xj (or its column vector x:j ) can be predicted by all xl for l 6= j. Values above 20 (R2 (xj ) = 0.95)
are considered by some to be problematic. In particular, the value for parameter bj may be suspect, since
its variance is inflated by vif(bj ).
mse
σ̂ 2 (bj ) = · vif(bj ) (6.55)
k σ̂ 2 (xj )
See the exercises for details. Both corr and vif may be tested in ScalaTion using RegressionTest4.
One remedy to reduce collinearity/multi-collinearity is to eliminate the variable with the highest corr/vif
value. Another is to use regularized regression such as RidgeRegression or LassoRegression.
Forward Selection
The forewordSel method, coded in the Predictor trait, performs forward selection by adding the most
predictive variable to the existing model, returning the variable to be added and a reference to the new
model with the added variable/feature.
1 @ param cols the columns of matrix x currently included in the existing model
2 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
3
4 def forwardSel ( cols : LinkedHashSet [ Int ] , idx_q : Int = QoF . rSqBar . ordinal ) : BestStep =
The BestStep is used to record the best improvement step found so far.
1 @ param col the column / variable to ADD / REMOVE for this step
2 @ param qof the Quality of Fit ( QoF ) for this step
3 @ param mod the model including selected features / variables for this step
189
4
5 case class BestStep ( col : Int = -1 , qof : VectorD = null , mod : Predictor = null )
Selecting the most predictive variable to add boils down to comparing on the basis of a Quality of Fit
(QoF) measure. The default is the Adjusted Coefficient of Determination R̄2 . The optional argument idx q
indicates which QoF measure to use (defaults to QoF.rSqBar.ordinal). To start with a minimal model, set
cols = Set (0) for an intercept-only model. The method will consider every variable/column x.indices2
not already in cols and pick the best one for inclusion.
1 for j <- x . indices2 if ! ( cols contains j ) do
To find the best model, the forwardSel method should be called repeatedly while the quality of fit measure
is sufficiently improving. This process is automated in the forwardSelAll method.
1 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
2 @ param cross whether to include the cross - validation QoF measure
3
4 def forwardSelAll ( idx_q : Int = QoF . rSqBar . ordinal , cross : Boolean = true ) :
5 ( LinkedHashSet [ Int ] , MatrixD ) =
The forwardSelAll method takes the QoF measure to use as the selection criterion and whether to apply
cross-validation as inputs and returns the best collection of features/columns to include in the model as well
as the QoF measures for all steps.
To see how R2 , R̄2 , sMAPE, and Rcv
2
change with the number of features/parameters added to the model
by forwardSelAll method, run the following test code from the scalation modeling module.
sMAPE, symmetric Mean Absolute Percentage Error, is explained in detail in the Time Series/Temporal
Models Chapter.
Backward Elimination
The backwardElim method, coded in the Predictor trait, performs backward elimination by removing the
least predictive variable from the existing model, returning the variable to eliminate, the new parameter
vector and a reference to the new model with the removed variable/feature.
1 @ param cols the columns of matrix x currently included in the existing model
2 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
3 @ param first first variable to consider for elimination
4 ( default (1) assume intercept x_0 will be in any model )
5
6 def backwardElim ( cols : LinkedHashSet [ Int ] , idx_q : Int = QoF . rSqBar . ordinal ,
7 first : Int = 1) : BestStep =
To start with a maximal model, set cols = Set (0, 1, ..., k) for a full model. As with forwardSel,
the idx q optional argument allows one to choose from among the QoF measures. The last parameter first
provides immunity from elimination for any variable/parameter that is less than first (e.g., to ensure that
models include an intercept b0 , set first to one). The method will consider every variable/column from
first until x.dim2 in cols and pick the worst one for elimination.
1 for j <- first until x . dim2 if cols contains j do
190
To find the best model, the backwardElim method should be called repeatedly until the quality of fit measure
sufficiently decreases. This process is automated in the backwardElimAll method.
1 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
2 @ param first first variable to consider for elimination
3 @ param cross whether to include the cross - validation QoF measure
4
The backwardElimAll method takes the QoF measure to use as the selection criterion, the index of the first
variable to consider for elimination, and whether to apply cross-validation as inputs and returns the best
collection of features/columns to include in the model as well as the QoF measures for all steps.
Some studies have indicated that backward elimination can outperform forward selection, but it is difficult
to say in general.
More advanced feature selection techniques include using genetic algorithms to find near optimal subsets
of variables as well as techniques that select variables as part of the parameter estimation process, e.g.,
LassoRegression.
Stepwise Regression
An improvement over Forward Selection and Backward Elimination is possible with Stepwise Regression.
It starts with either no variables or the intercept in the model and adds one variable that improves the
selection criterion the most. It then adds the second best variable for step two. After the second step, it
determines whether it is better to add or remove a variable. It continues in this fashion until no improvement
in the selection criterion is found at which point it terminates. Note, for Forward Selection and Backward
Elimination it may instructive to continue all the way to the end (all variables for forward/no variables for
backward).
Stepwise regression may lead to coincidental relationships being included in the model, particularly if a t-
test is the basis of inclusion or a penalty-free QoF measure such as R2 is used. Typically, this approach is used
when there a penalty for having extra variables/parameters, e.g., R2 adjusted R̄2 , R2 cross-validation Rcv 2
or
Akaike Information Criterion (AIC). See the section on Maximum Likelihood Estimation for a definition of
AIC. Alternatives to Stepwise Regression include Lasso Regression (`1 regularization) and to a lesser extent
Ridge Regression (`2 regularization).
ScalaTion provides the stepRegressionAll method for Stepwise Regression. At each step it calls
forwardSel and backwardElim and chooses the one yielding better improvement.
1 @ param idx_q index of Quality of Fit ( QoF ) to use for comparing quality
2 @ param cross whether to include the cross - validation QoF measure
3
An option for further improvement is to add a swapping operation, which finds the best variable to remove
and replace with a variable not in the model. Unfortunately, this may lead to a quadratic number of steps in
the worst-case (as opposed to linear for forward, backward and stepwise without swapping). See the exercises
for more details.
191
Categorical Variables/Features
For Regression, the variables/features have so far been treated as continuous or ordinal. However, some
variables may be categorical in nature, where there is no ordering of the values for a categorical variable.
Although one can encode “English”, “French”, “Spanish” as 0, 1, and 2, it may lead to problems such
as concluding the average of “English” and “Spanish” is ‘French”.
In such cases, it may be useful to replace a categorical variable with multiple dummy variables. Typically,
a categorical variable (column in the data matrix) taking on k distinct values is replaced with with k − 1
dummy variables (columns in the data matrix). For details on how to do this effectively, see the section on
RegressionCat.
The trainNtest method defined in the Predictor trait does several things: trains the model on x and y ,
tests the model on xx and yy, produces a report about training and testing, and optionally plots y-actual
and y-predicted.
192
1 @ param x_ the training / full data / input matrix ( defaults to full x )
2 @ param y_ the training / full response / output vector ( defaults to full y )
3 @ param xx the testing / full data / input matrix ( defaults to full x )
4 @ param yy the testing / full response / output vector ( defaults to full y )
5
The report method returns the following basic information: (1) the name of the modeling technique nm, (2)
the values of the hyper-parameters hp (used for controlling the model/optimizer), (3) the feature/predictor
variable names fn, (4) the values of the parameters b, and (5) several Quality of Fit measures qof.
REPORT
----------------------------------------------------------------------------
modelName mn = Regression
----------------------------------------------------------------------------
hparameter hp = HyperParameter(factorization -> (Fac_QR,Fac_QR))
----------------------------------------------------------------------------
features fn = Array(x0, x1, x2, x3)
----------------------------------------------------------------------------
parameter b = VectorD(151.298,-1.99323,-0.000955478,-0.384710)
----------------------------------------------------------------------------
fitMap qof = LinkedHashMap(
rSq -> 0.991921, rSqBar -> 0.989902, sst -> 941.937500, sse -> 7.609494,
mse0 -> 0.475593, rmse -> 0.689633, mae -> 0.531353, dfm -> 3.000000,
df -> 12.000000, fStat -> 491.138015, aic -> -8.757481, bic -> -5.667126,
mape -> 1.095990, smape -> 1.094779, mase -> 0.066419)
The plot below shows the results from running the ScalaTion Regression Model in terms of actual (y) vs.
predicted (yp) response vectors.
193
Regression Model: y(*) vs yp(+)
60
y, yp
50
40
0 5 10 15
index
More details about the parameters/coefficients including standard errors, t-values, p-values, and Variance
Inflation Factors (VIFs) are shown by the summary method.
1 println ( mod . summary () )
For the Texas Temperatures dataset it provides the following information: The Estimate is the value assigned
to the parameter for the given Var. The Std. Error, t-value, p-value and VIF are also given.
Given the following assumptions: (1) ∼ D(0, σI) for some distribution D and (2) for each column j, and
xj are independent, the covariance matrix of the parameter vector b is
|
C [b] = σ 2 (X X)−1 (6.56)
194
·
σ̂ 2 = = mse (6.57)
df
the standard deviation (or standard error) of the j th parameter/coefficient may be given as the square root
of the j th diagonal element of the covariance matrix.
q
σ̂bj = σ̂ (X | X)−1 jj (6.58)
The corresponding t-value is simply the parameter value divided by its standard error, which indicates how
many standard deviation units it is away from zero. The farther way from zero the more significant (or more
important to the model) the parameter is.
bj
t(bj ) = (6.59)
σ̂bj
When the error distribution is Normal, then t(bj ) follows the Student’s t Distribution. For example, the pdf
for the Student’s t Distribution with df = ν = 2 Degrees of Freedom is shown in the figure below (the t
Distribution approaches the Normal Distribution as ν increases).
0.4
0.3
fy (y)
0.2
0.1
0
−3 −2 −1 0 1 2 3
y
The corresponding p-value P (|y| > t) measures how significant the t-value is, e.g.,
Fy (−1.683018) = 0.0590926
P (|y| > −1.683018) = 2Fy (−1.683018)) = 0.118185 for ν = df = 12
Typically, the t-value is only considered significant if is in the tails of the Student’s t distribution. The
farther out in the tails, the less likely for the parameter to be non-zero (and hence be part of the model)
simply by chance. The p-value measures the risk (chance of being wrong) in including parameter bj and
therefore variable xj in the model.
195
The predict Method
Finally, a given new data vector z, the predict method may be used to predict its response value.
1 val z = VectorD (1.0 , 30.0 , 1000.0 , 100.0)
2 println ( s " predict (z) ={ mod . predict ( z ) } " )
Feature Selection
Feature selection (or Variable Selection) may be carried out by using either forwrardSel or backwardElim.
These methods add or remove one variable at a time. To iteratively add or remove, the following methods
may be called.
1 mod . forwardSelAll ( cross = false )
2 mod . ba ckw ar dE l im Al l ( cross = false )
3 mod . s t e p R e g r e s s i o n A l l ( cross = false )
The default criterion for choosing which variable to add/remove is Adjusted R2 . It may be changed via the
idx q parameter to the methods (see the Fit trait for the possible values for this parameter). Note: The
cross-validation is turned off (cross = false) due to the small size of the dataset.
The source code for the Texas Temperatures example is a test case in Regression.scala.
Class Methods:
1 @ param x the data / input m - by - n matrix
2 ( augment with a first column of ones to include intercept in model )
3 @ param y the response / output m - vector
4 @ param fname_ the feature / variable names ( defaults to null )
5 @ param hparam the hyper - parameters ( defaults to Regression . hp )
6
6.6.15 Exercises
| |
1. For Exercise 1 from the last section, compute A = X X and z = X y. Now solve the following linear
systems of equations for b.
196
Ab = z
2. Gradient descent can be used for Multiple Linear Regression as well. For gradient descent, pick a
starting point b0 , compute the gradient of the loss function ∇L and move −η∇L from b0 where η is
the learning rate. Write a Scala program that repeats this for several iterations for the above data.
What is happening to the value of the loss function L.
|
∇L = − X (y − Xb)
|
−X
Starting with data matrix x, response vector y and parameter vector b, in ScalaTion, the calculations
become
1 val yp = x * b // y predicted
2 val e = y - yp // error
3 val g = x .T * e // - gradient
4 b += g * eta // update parameter b
5 val h = 0.5 * ( e dot e ) // half the sum of squared errors
Unless the dataset is normalized, finding an appropriate learning rate eta may be difficult. See the
MatrixTransform object for details. Do this for the Blood Pressure Example BPressure dataset. Try
using another dataset.
3. Consider the relationships between the predictor variables and the response variable in the AutoMPG
dataset. This is a well know dataset that is available at multiple websites including the UCI Machine
Learning Repository http://archive.ics.uci.edu/ml/datasets/Auto+MPG. The response variable
is the miles per gallon (mpg: continuous) while the predictor variables are cylinders: multi-valued
discrete, displacement: continuous, horsepower: continuous, weight: continuous, acceleration:
continuous, model year: multi-valued discrete, origin: multi-valued discrete, and car name: string
(unique for each instance). Since the car name is unique and obviously not causal, this variable is
eliminated, leaving seven predictor variables. First compute the correlations between mpg (vector y)
and the seven predictor variables (each column vector x:j in matrix X).
1 val correlation = y corr x_j
and then plot mpg versus each of the predictor variables. The source code for this example is at
http://www.cs.uga.edu/~jam/scalation_2.0/src/main/scala/scalation/modeling/Example_AutoMPG.
scala .
Alternatively, a .csv file containing the AutoMPG dataset may be read into a relation called auto tab
from which data matrix x and response vector y may be produced. If the dataset has missing values,
they may be replaced using a spreadsheet or using the techniques discusses in the Data Preprocessing
Chapter.
197
1 val auto_tab = Relation ( BASE_DIR + " auto - mpg . csv " , " auto_mpg " , null , -1)
2 val (x , y ) = auto_tab . toMatrixDD (1 to 6 , 0)
3 println ( s " x = x”)println(s”y =y "
4. Apply Regression analysis on the AutoMPG dataset. Compare with results of applying the NullModel,
SimplerRegression and SimpleRegression. Try using SimplerRegression and SimpleRegression
with different predictor variables for these models. How does their R2 values compare to the correlation
analysis done in the previous exercise?
5. Examine the collinearity and multi-collinearity of the column vectors in the AutoMPG dataset.
6. For the AutoMPG dataset, repeatedly call the backwardElim method to remove the predictor variable
that contributes the least to the model. Show how the various quality of fit (QoF) measures change as
variables are eliminated. Do the same for the forwardSel method. Using R̄2 , select the best models
from the forward and backward approaches. Are they the same?
7. Compare model assessment and model validation. Compute sse, mse and R2 for the full and best
AutoMPG models trained on the entire data set. Compare this with the results of Leave-One-Out,
5-fold Cross-Validation and 10-fold Cross-Validation.
mse
σ̂ 2 (bj ) = · vif(bj )
k σ̂ 2 (xj )
Derive this formula. The standard error is the square root of this value. Use the estimate for bj and
its standard error to compute a t-value and p-value for the estimate. Run the AutoMPG model and
explain these values produced by the summary method.
9. Singular Value Decomposition Technique. In cases where the rank of the data/input matrix X
is not full or its multi-collinearity is high, a useful technique to solve for the parameters of the model is
Singular Value Decomposition (SVD). Based on the derivation given in http://www.ime.unicamp.br/
~marianar/MI602/material%20extra/svd-regression-analysis.pdf, we start with the equation
estimating y as the product of the data matrix X and the parameter vector b.
y = Xb
|
X = U ΣV
where in the full-rank case, U is an m-by-n orthogonal matrix, Σ is an n-by-n diagonal matrix of singular
|
values, and V is an n-by-n orthogonal matrix The r = rank(A) equals the number of nonzero singular
|
values in Σ, so in general, U is m-by-r, Σ is r-by-r, and V is r-by-n. The singular values are the
|
square roots of the nonzero eigenvalues of X X. Substituting for X yields
|
y = U ΣV b
198
|
Defining d = ΣV b, we may write
y = Ud
This can be viewed as a estimating equation where X is replaced with U and b is replaced with d.
Consequently, a least squares solution for the alternate parameter vector d is given by
| |
d = (U U )−1 U y
|
Since U U = I, this reduces to
|
d = U y
b = V Σ−1 d
where Σ−1 is a diagonal matrix where elements on the main diagonal are the reciprocals of the singular
values.
10. Improve Stepwise Regression. Write ScalaTion code to improve the stepRegressionAll method
by implementing the swapping operation. Then redo exercise 6 using all three: Forward Selection,
Backward Elimination, and Stepwise Regression with all four criteria: R2 , R̄2 , Rcv
2
, and AIC. Plot the
curve for each criterion, determine the best number of variables and what these variables are. Compare
the four criteria.
As part of a larger project compare this form of feature selection with that provided by Ridge Regression
and Lasso Regression. See the next two sections.
Now add features including quadratic terms, cubic terms, and dummy variables to the model using
SymbolicRegression.quadratic, SymbolicRegression.cubic, and RegressionCat. See the subse-
quent sections.
In addition to the AutoMPG dataset, use the Concrete dataset and three more datasets from UCI
Machine Learning Repository. The UCI datasets should have more instances (m) and variables (n)
than the first two datasets. The testing should also be done in R or Python.
11. Regression as Projection. Consider the following six vectors/points in 3D space where the response
variable y is modeled as a linear function of predictor variables x1 and x2 .
1 // x1 x2 y
2 val xy = MatrixD ((6 , 3) , 1 , 1, 2.8 ,
3 1, 2, 4.2 ,
4 1, 3, 4.8 ,
5 2, 1, 5.3 ,
6 2, 2, 5.5
7 2, 3, 6.5)
199
5
y
0 2
0
0.5 1 x2
1.5 2 0
x1
y = b0 x1 + b1 x2 +
Determine the plane (response surface) that these six points are projected onto.
ŷ = b0 x1 + b1 x2
For this problem, the number of instances m = 6 and the number of parameters/predictor variables
n = 2. Determine the number of Degrees of Freedom for the model dfm and the number of Degrees of
Freedom for the residuals/errors df .
|
12. Given a data matrix X ∈ Rm×2 and response vector y ∈ Rm where X = [1, x], compute X X and
|
X y. Use these to set up an augmented matrix and then apply LU Factorization to make it upper
triangular. Solve for the parameters b0 and b1 symbolically. Simply to reproduce formulas for b0 and
b1 for Simple Regression.
| |
13. Recall that ŷ = Hy where the hat matrix is X(X X)−1 X . The leverage of point i is defined to be
hii .
| |
hii = xi (X X)−1 xi (6.61)
The main diagonal of the hat matrix gives the leverage for each of the points. Points with high leverage
are those above a threshold such as
2 tr(H)
hii ≥ (6.62)
m
Note, that the trace tr(H) = rank(H) = rank(X) will equal n when X has full rank. List the high
leverage points for the Example AutoMPG dataset.
200
14. Points that are influential in determining values for model coefficients/parameters combine high lever-
age with large residuals. Measures of influence include Cook’s Distance, DFFITS, and DFBETAS
[34] and see http://home.iitk.ac.in/~shalab/regression/Chapter6-Regression-Diagnostic%
20for%20Leverage%20and%20Influence.pdf. These measures can also be useful in detecting po-
tential outliers. Compute these measures for the Example AutoMPG dataset.
15. The best two predictor variables for AutoMPG are weight and modelyear and with the weight given
in units of 1000 pounds, the prediction equation for the Regression model (with intercept) is
40
20
y
0 80
0 75
2
4 x2
x1 6 70
Make a plot of the hyperplane for the second best combination of features. Compare the QoF of these
two models and explain how the feature combinations affect the response variable (mpg).
16. State and explain the conditions required for the Ordinary Least Squares (OLS) estimate of parameter
vector b for multiple linear regression to be B.L.U.E. See the Gauss-Markov Theorem. B.L.U.E. stands
for Best Linear Unbiased Estimator.
201
6.7 Ridge Regression
The RidgeRegression class supports multiple linear ridge regression. As with Regression, the predictor
variables x are multi-dimensional [x1 , . . . , xk ], as are the parameters b = [b1 , . . . , bk ]. Ridge regression adds
a penalty based on the `2 norm of the parameters b to reduce the chance of them taking on large values
that may lead to less robust models.
The penalty holds down the values of the parameters and this may result in several advantages: (1)
better out-of-sample (e.g., cross-validation) quality of fit, (2) reduced impact of multi-collinearity, (3) turn
singular matrices, non-singular, and to a limited extent (4) eliminate features/predictor variables from the
model.
The penalty is not to be included on the intercept parameter b0 , as this would shift predictions in a way
that would adversely affect the quality of the model. See the exercise on scale invariance.
y = b · x + = b1 x1 + · · · + bk xk + (6.63)
where represents the residuals (the part not explained by the model).
6.7.2 Training
Centering the dataset (X, y) has the following effects: First, when the X matrix is centered, the intercept
b0 = µy . Second, when y is centered, µy becomes zero, implying b0 = 0. To rescale back to the original
response values, µy can be added back during prediction. Therefore, both the data/input matrix X and the
response/output vector y should be centered (zero mean).
The regularization of the model adds an `2 -penalty on the parameters b. The objective function to
minimize is now the loss function L(b) = 12 sse plus the `2 -penalty.
1 1 1
fobj = L(b) + λ kbk2 = · + λ b · b (6.66)
2 2 2
where λ is the shrinkage parameter. A large value for λ will drive the parameters b toward zero, while a
small value can help stabilize the model (e.g., for nearly singular matrices or high multi-collinearity).
1 1
fobj = (y − Xb) · (y − Xb) + λ b · b (6.67)
2 2
202
6.7.3 Optimization
Fortunately, the quadratic nature of the penalty function allows it to be combined easily with the quadratic
error terms, so that matrix factorization can still be used for finding optimal values for parameters.
Taking the gradient of the objective function fobj with respect to b and then setting it equal to zero
yields
|
− X (y − Xb) + λb = 0 (6.68)
Recall the first term of the gradient was derived in the Regression section. See the exercises below for
deriving the last term of the gradient. Multiplying out gives,
| |
−X y + (X X)b + λb = 0
| |
(X X)b + λb = X y
| |
(X X + λI)b = X y (6.69)
Matrix factorization may now be used to solve for the parameters b in the modified Normal Equations. For
example, use of matrix inversion yields,
| |
b = (X X + λI)−1 X y (6.70)
|
For Cholesky factorization, one may compute X X and simply add λ to each of the diagonal elements (i.e,
along the ridge). QR and SVD factorizations require similar, but slightly more complicated, modifications.
Note, use of SVD can improve the efficiency of searching for an optimal value for λ [71, 196].
6.7.4 Centering
Before creating a RidgeRegression model, the X data matrix and the y response vector should be centered.
This is accomplished by subtracting the means (vector of column means for X and a mean value for y).
1 val mu_x = x . mean // column - wise mean of x
2 val mu_y = y . mean // mean of y
3 val x_c = x - mu_x // centered x ( column - wise )
4 val y_c = y - mu_y // centered y
The centered matrix x c and center vector y c are then passed into the RidgeRegression constructor.
1 val mod = new R id ge R eg re ss i on ( x_c , y_c )
2 mod . trainNtest ()
Now, when making predictions, the new data vector z needs to be centered by subtracting mu x. Then the
predict method is called, after which the mean of y is added.
1 val z_c = z - mu_x // center z first
2 yp = mod . predict ( z_c ) + mu_y // predict z_c and add y ’s mean
3 println ( s " predict (z) =yp " )
203
6.7.5 The λ Hyper-parameter
The value for λ can be user specified (typically a small value) or chosen by a method like findLambda. It
finds a roughly optimal value for the shrinkage parameter λ based on the cross-validated sum of squared
errors sse cv. The search starts with the low default value for λ and then doubles it with each iteration,
returning the minimizing λ and its corresponding cross-validated sse. A more precise search could be used
to provide a better value for λ.
1 def findLambda : ( Double , Double ) =
2 var l = lambda // start with a small default value
3 var l_best = l
4 var sse = Double . MaxValue
5 for i <- 0 to 20 do
6 R id ge Re g re ss io n . hp ( " lambda " ) = l
7 val mod = new R id ge R eg re ss i on (x , y )
8 val stats = mod . crossValidate ()
9 val sse2 = stats ( QoF . sse . ordinal ) . mean
10 banner ( s " R idgeReg ession with lambda = $mod . lambda_ } has sse = $sse2 " )
11 if sse2 < sse then { sse = sse2 ; l_best = l }
12 l *= 2
13 end for
14 ( l_best , sse ) // best lambda and its sse_cv
15 end findLambda
Third, predict a value for new input vector z using each model.
204
1 banner ( " Make Predictions " )
2 val z = VectorD (20.0 , 80.0) // new instance to predict
3 val _1z = VectorD .++ (1.0 , z ) // prepend 1 to z
4 val z_c = z - mu_x // center z
5 println ( s " rg . predict (z) ={ rg . predict ( _1z ) } " ) // predict using _1z
6 println ( s " mod . predict (z) ={ mod . predict ( z_c ) + mu_y } " ) // predict using z_c and add
y ’s mean
SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 -281.426985 835.349154 -0.336897 0.768262 NA
x1 -7.611030 8.722908 -0.872534 0.474922 3.653976
x2 19.010291 8.423716 2.256758 0.152633 3.653976
SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 -7.611271 8.722908 -0.872561 0.474910 NA
x1 19.009947 8.423716 2.256717 0.152638 3.653976
Notice there is very little difference between the two models. Try increasing the value of the shrinkage
hyper-parameter λ beyond its default value of 0.01. This example can be run as follows:
$ sbt
sbt> runMain scalation.modeling.ridgeRegressionTest
Automatic Centering
ScalaTion provides factory methods, apply and center, in the RigdgeRgression companion object that
center the data for the user.
205
1 // val mod = R id g eR eg re s si on ( xy , fname ) // apply takes a combined matrix xy
2 val mod = R id ge Re g re ss io n . center (x , y , fname ) // center takes a matrix x and
vector y
3 mod . trainNTest () ()
4 val yp = mod . predict ( z - x . mean ) + y . mean
The user must still center any vectors passed into the predict method and add back the response mean at
the end, e.g., pass z - x.mean and add back y.mean.
Note, care should be taken regarding x.mean and y.mean when preforming validation or crossValidation.
The means for the full, training and testing sets may differ.
Class Methods:
1 @ param x the centered data / input m - by - n matrix NOT augmented with a column of 1 s
2 @ param y the centered response / output m - vector
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param hparam the shrinkage hyper - parameter , lambda (0 = > OLS ) in the penalty term
5 ’ lambda * b dot b ’
6
6.7.8 Exercises
1. Based on the example given in this section, try increasing the value of the hyper-parameter λ and
examine its effect on the parameter vector b, the quality of fit and predictions made.
1 import R id g eR eg re s si on . hp
2
Alternatively,
1 hp ( " lambda " ) = 1.0
206
See the HyperParameter class in the scalation.modeling package for details.
2. For the AutoMPG dataset, use the findLambda method find a value for λ that roughly minimizes
out-of-sample sse cv based on using the crossValidate method. Plot sse cv vs. λ.
3. Why is it important to center (zero mean) both the data matrix X and the response vector y? What
is scale invariance and how does it relate to centering the data?
4. The Degrees of Freedom (DoF) used in ScalaTion’s RidgeRegression class is approximate. As the
shrinkage parameter λ increases the effective DoF (eDoF) should be used instead. A general definition
| |
of effective DoF is the trace tr of the hat matrix H = X(X X + λI)−1 X
eff
dfm = tr(H)
Read [86] and explain the difference between DoF and effective DoF (eDoF) for Ridge Regression.
5. A matrix that is close to singularity is said to be ill-conditioned. The condition number κ of a matrix
|
A (e.g., A = X X) is defined as follows:
κ = kAk kA−1 k ≥ 1
When κ becomes large the matrix is considered to be ill-conditioned. In such cases, it is recommended
to use QR or SVD Factorization for Least-Squares Regression [39]. Compute the condition number for
|
of X X for various datasets.
6. For the last term of the gradient of the objective function, show that
∂ λ λX 2
b·b = b = λbj
∂bj 2 2 i i
λ
Put these together to show that ∇ b · b = λb
2
7. For over-parameterized (or under-determined) regression where the n > m, (number of parameters >
number of instances) it is common to seek a min-norm solution.
| |
b = X (XX )−1 y
8. Compare different algorithms for finding a suitable value for the shrinkage parameter λ.
Hint: see Lecture Notes on Ridge Regression - https://arxiv.org/pdf/1509.09169.pdf - [196].
207
6.8 Lasso Regression
The LassoRegression class supports multiple linear regression using the Least absolute shrinkage and
selection operator (Lasso) that constrains the values of the b parameters and effectively sets those with low
impact to zero (thereby deselecting such variables/features). Rather than using an `2 -penalty (Euclidean
norm) like RidgeRegression, it uses and an `1 -penalty (Manhattan norm). In RidgeRegression when bj
approaches zero, b2j becomes very small and has little effect on the penalty. For LassoRegression, the
effect based on |bj | will be larger, so it is more likely to set parameters to zero. See section 6.2.2 in [85]
for a more detailed explanation on how LassoRegression can eliminate a variable/feature by setting its
parameter/coefficient to zero.
y = b · x + = b0 + b1 x1 + ... + bk xk + (6.71)
where represents the residuals (the part not explained by the model). See the exercise that considers
whether to include the intercept b0 in the shrinkage.
6.8.2 Training
The regularization of the model adds an `1 -penalty on the parameters b. The objective function to minimize
is now the loss function L(b) = 21 sse plus the penalty.
1 1
fobj = sse + λ kbk1 = kk22 + λ kbk1 (6.72)
2 2
where λ is the shrinkage parameter. Substituting = y − Xb yields,
1
fobj = ky − Xbk22 + λ kbk1 (6.73)
2
Replacing the norms with dot products gives,
1
fobj = (y − Xb) · (y − Xb) + λ 1 · |b| (6.74)
2
Although similar to the `2 -penalty used in Ridge Regression, it may often be more effective. Still, the
` -penalty for Lasso has a disadvantage that the absolute values in the `1 norm make the objective function
1
non-differentiable.
k
X
λ 1 · |b| = λ |bj | (6.75)
j=0
Therefore, the straightforward strategy of setting the gradient equal to zero to develop appropriate modified
Normal Equations that allow the parameters to be determined by matrix factorization will no longer works.
Instead, the objective function needs to be minimized using a search based optimization algorithm.
208
6.8.3 Optimization Strategies
There are multiple optimization algorithms that can be applied for parameter estimation in Lasso Regression.
Coordinate Descent
Coordinate Descent attempts to optimize one variable/feature at a time (repeated one dimensional optimiza-
tion). For normalized data the following algorithm has been shown to work: https://xavierbourretsicotte.
github.io/lasso_implementation.html.
ScalaTion uses the Alternating Direction Method of Multipliers (ADMM) [22] algorithm to optimize the b
parameter vector. The algorithm for using ADMM for Lasso Regression is outlined in section 6.4 of Boyd [22]
(https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf). Optimization problems in ADMM
form separate the objective function into two parts f and g.
For Lasso Regression, the f function will capture the loss function ( 12 sse), while the g function will capture
the `1 regularization, i.e.,
1
f (b) = ky − Xbk22 , g(z) = λ kzk1 (6.77)
2
Introducing z allows the functions to be separated, while the constraint keeps z and b close. Therefore, the
iterative step in the ADMM optimization algorithm becomes
| |
b = (X X + ρI)−1 (X y + ρ(z − u))
z = Sλ/ρ (b + u)
u = u+b−z
where u is the vector of Lagrange multipliers and Sλ is the soft thresholding function.
209
6 L as so Re g re ss io n . hp ( " lambda " ) = l
7 val mod = new L as so R eg re ss i on (x , y )
8 val stats = mod . crossValidate ()
9 val sse2 = stats ( QoF . sse . ordinal ) . mean
10 banner ( s " L assoReg ession with lambda = $mod . lambda_ } has sse = $sse2 " )
11 if sse2 < sse then
12 sse = sse2 ; l_best = l
13 end if
14 Fit . s h o w Q o f S t a t T a b l e ( stats )
15 l *= 2
16 end for
17 ( l_best , sse )
18 end findLambda
As the default value for the shrinkage/penalty parameter λ is very small, the optimal solution will be
close to the Ordinary Least Squares (OLS) solution shown in green at b = [b1 , b2 ] = [3, 1] in Figure 6.5.
Increasing the penalty parameter will pull the optimal b towards the origin. At any given point in the plane,
the objective function is the sum of the loss function L(b) and the penalty function p(b). The contours in
blue show points of equal height for the penalty function, while those in black show the same for the loss
function. Suppose for some λ the point [2, 0] is this penalized optimum. This would mean that moving
toward the origin would be non-productive, as the increase in the loss would exceed the drop in the penalty.
On the other hand, moving toward [3, 1] would be non-productive as the increase in the penalty would exceed
the drop in the loss. Notice in this case, that the penalty has pulled the b1 parameter to zero (an example of
feature selection). Ridge regression will be less likely to pull a parameter to zero, as its contours are circles
rather than diamonds. Lasso regression’s contours have sharp points on the axis which thereby increase the
chance of intersecting a loss contour on an axis.
b2
−2 −1 1 2 3 4 b1
−1
−2
210
6.8.5 Regularized and Robust Regression
Regularized and Robust Regression are useful in many cases including high-dimensional data, correlated
data, non-normal data and data with outliers [70]. These techniques work by adding a `1 and/or `2 -penalty
terms to shrink the parameters and/or changing from an `2 to `1 loss function. Modeling techniques include
Ridge, Lasso, Elastic Nets, Least Absolute Deviation (LAD) and Adaptive LAD [70].
Class Methods:
1 @ param x the data / input m - by - n matrix
2 @ param y the response / output m - vector
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param hparam the shrinkage hyper - parameter , lambda (0 = > OLS ) in the penalty term
5 ’ lambda * b dot b ’
6
6.8.7 Exercises
1. Compare the results of LassoRegression with those of Regression and RidgeRegression. Examine
the parameter vectors, quality of fit and predictions made.
1 // 5 data points : one x_0 x_1
2 val x = MatrixD ((5 , 3) , 1.0 , 36.0 , 66.0 , // 5 - by -3 matrix
3 1.0 , 37.0 , 68.0 ,
4 1.0 , 47.0 , 64.0 ,
5 1.0 , 32.0 , 53.0 ,
6 1.0 , 1.0 , 101.0)
7 val y = VectorD (745.0 , 895.0 , 442.0 , 440.0 , 1598.0)
8 val z = VectorD (1.0 , 20.0 , 80.0)
9
10 // Create a L as s oR eg re s si on model
11
211
2. Based on the last exercise, try increasing the value of the hyper-parameter λ and examine its effect on
the parameter vector b, the quality of fit and predictions made.
1 import L as s oR eg re s si on . hp
2
3. Using the above dataset and the AutoMPG dataset, determine the effects of (a) centering the data
(µ = 0), (b) standardizing the data (µ = 0, σ = 1).
1 import M a t r i xT r a n s f o r m s . _
2
4. Explain how the Coordinate Descent Optimization Algorithm works for Lasso Regression. See
https://xavierbourretsicotte.github.io/lasso_implementation.html.
5. Explain how the ADMM Optimization Algorithm works for Lasso Regression. See
https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf.
6. Compare LassoRegression the with Regression that uses forward selection or backward elimination
for feature selection. What are the advantages and disadvantages of each for feature selection.
7. Compare LassoRegression the with Regression on the AutoMPG dataset. Specifically, compare the
quality of fit measures as well as how well feature selection works.
8. Show that the contour curves for the Simple Regression loss function L(b0 , b1 ) are ellipses. The general
equation of an ellipse centered at (h, k) is
9. Elastic Nets combine both `2 and `1 penalties to try to combine the best features of both RidgeRegression
and LassoRegression. Elastic Nets naturally includes two shrinkage parameters, λ1 and λ2 . Is the
additional complexity worth the benefits?
10. Regularization using Lasso has the nice property of being able to force parameters/coefficients to zero,
but this may require a large shrinkage hyper-parameter λ that shrinks non-zero coefficients more than
desired. Newer regularization techniques reduce the shrinkage effect compared to Lasso, by having a
penalty profile that matches Lasso for small coefficients, but is below Lasso for large coefficient values.
Make of plot of the penalty profiles for Lasso, Smoothly Clipped Absolute Deviations (SCAD) and
Mimimax Concave Penalty (MCP).
212
6.8.8 Further Reading
1. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
[22]
213
6.9 Quadratic Regression
The quadratic method in the SymbolicRegression object adds quadratic terms into the model. It can
often be the case that the response variable y will have a nonlinear relationship with one more of the predictor
variable xj . The simplest such nonlinear relationship is a quadratic relationship. Looking at a plot of y vs.
xj , it may be evident that a bending curve will fit the data much better than a straight line. For example,
a particle under constant acceleration will have a position that changes quadratically with time.
When there is only one predictor variable x, the response y is modeled as a quadratic function of x
(forming a parabola).
y = b0 + b1 x + b2 x2 + (6.79)
The quadratic method achieves this simply by expanding the data matrix. From the dataset (initial
data matrix), all columns will have another column added that contains the values of the original column
squared. It is important that the initial data matrix has no intercept. The expansion will optionally
add an intercept column (column of all ones). Since 12 = 1, the ones columns and its square will be perfectly
collinear and make the matrix singular, if the user includes a ones column.
where x0 = [1, x1 , x2 , x21 , x22 ], b = [b0 , b1 , b2 , b3 , b4 ], and represents the residuals (the part not explained by
the model).
The number of terms (nt) in the model increases linearly with the dimensionality of the space (n)
according to the following formula:
Each column in the initial data matrix is expanded into two in the expanded data matrix and an intercept
column is optionally added.
214
10 val (x , y ) = ( xy . not (? , 1) , xy (? , 1) ) // x is first column , y is last column
11 val ox = VectorD . one ( xy . dim ) + ˆ : x // prepend a column of all ones
12
Now compare their summary results. The summary results for the Regression model are shown below:
SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 -13.285714 5.154583 -2.577457 0.041913 NA
x1 8.285714 1.020760 8.117205 0.000188 1.000000
The summary results for the SymbolicRegression.quadratic model are given here:
SUMMARY
Parameters/Coefficients:
Var Estimate Std. Error t value Pr(>|t|) VIF
----------------------------------------------------------------------------------
x0 4.035714 3.873763 1.041807 0.345231 NA
x1 -2.107143 1.975007 -1.066904 0.334798 21.250000
x2 1.154762 0.214220 5.390553 0.002965 21.250000
The summary results for the SymbolicRegression.quadratic model highlight a couple of important
issues:
Try eliminating x1 to see if these two improve without much of a drop in Adjusted R-squared R̄2 . Note,
eliminating x1 makes the model non-hierarchical (see the exercises). Figure 6.6 shows the predictions (yp)
of the Regression and quadratic models.
215
y
60
50
40
30
20
10
1 2 3 4 5 6 7 8 x
−10
Figure 6.6: Actual y (red) vs. Regression yp (green) vs. quadratic yp (blue)
The quadratic method in the SymbolicRegression object creates a Regression object that uses mul-
tiple regression to fit a quadratic surface to the data.
Method:
1 @ param x the initial data / input m - by - n matrix ( before quadratic term expansion )
2 must not include an intercept column of all ones
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( defaults to null )
5 @ param intercept whether to include the intercept term ( column of ones ) _1
6 ( defaults to true )
7 @ param cross whether to include 2 - way cross / interaction terms x_i x_j
8 ( defaults to false )
9 @ param hparam the hyper - parameters ( defaults to Regression . hp )
10
The apply method is defined in the SymbolicRegression object. The Set (1, 2) specifies that first
(Linear) and second (Quadratic) order terms will be included in the model. The intercept flag indicates
whether a column of ones will be added to the input/data matrix.
216
The next few modeling techniques described in subsequent sections support the development of low-order
multi-dimensional polynomial regression models. Higher order polynomial regression models are typically
restricted to one-dimensional problems (see the PolyRegression class).
Model Equation
In two dimensions (2D) where x = [x1 , x2 ], the quadratic cross model/regression equation is the following:
The number of terms (nt) in the model increases quadratically with the dimensionality of the space (n)
according to the formula for triangular numbers shifted by (n → n + 1).
n+2 (n + 2)(n + 1)
nt = = e.g., nt = 6 for n = 2 (6.83)
2 2
This result may derived by summing the number of constant terms (1), linear terms (n), quadratic terms
(n), and cross terms n2 .
Such models generalize quadratic by introducing cross terms, e.g., x1 x2 . Adding cross terms makes the
number of terms increase quadratically rather than linearly with the dimensionality. Consequently, multi-
collinearity problems (check VIF scores) may be intensified and the need for feature selection, therefore,
increases.
y = f (x1 , x2 ) + (6.84)
For example, a model with two predictor variables and one response variable may be displayed in three
dimensions. Such a response surface can also be shown in two dimensions using contour plots where a
contour/curve shows points of equal height. Figure 6.7 shows three types of contours that represent the
types of terms in quadratic regression (1) linear terms, (2) quadratic terms, and (3) cross terms. In the
217
figure, the first green line is for x1 + x2 = 4, the first blue curve is for x21 + x22 = 16, and the first red curve
is for x1 x2 = 4.
x2
1 2 3 4 5 6 x1
A constant term simply moves the whole response surface up or down. The coefficients for each of terms can
rotate and stretch these curves.
The response surface for Quadratic Regression on AutoMPG based on the best combination of features,
weight and modelyear, is shown in 6.8.
50
y
80
0 75
2
4 x2
x1 6 70
218
6.9.6 Exercises
1. Enter the x, y dataset from the example given in this section and use it to create a quadratic model.
Show the expanded input/data matrix and the response vector using the following two print statements.
1 val qrg = new S y m b o l i c R e g r e s s i o n . quadratic (x , y )
2 println ( s " expanded x = ${ qrg . getX } " )
3 println ( s " y = ${ qrg . getY } " )
2. Perform Quadratic Regression on the Example BPressure dataset using the first two columns of its
data matrix x.
1 import E x a m p l e _ B P r e s s u r e .{ x01 = > x , y }
3. Perform both forward selection and backward elimination to find out which of the terms have the most
impact on predicting the response. Which feature selection approach (forward selection or backward
elimination) finds a model with the highest R¯2 ?
4. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * m * m).
1 for i <- x . indices do
2 x (i , 0) = i
3 y ( i ) = i * i + i + noise . gen
4 end for
Compare the results of Regression vs. quadratic. Compare the Quality of Fit and the parameter
values. What correspondence do the parameters have with the coefficients used to generate the data?
Plot y vs. x, yp and y vs. t for both Regression and quadratic. Also plot the residuals e vs. x for
both. Note, t is the index vector VectorD.range (0, m).
5. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * s * s) and grid = 1 to s.
1 var k = 0
2 for i <- grid ; j <- grid do
3 x ( k ) = VectorD (i , j )
4 y ( k ) = x (k , 0) ~ ˆ 2 + 2 * x (k , 1) + noise . gen
5 k += 1
6 end for
Compare the results of Regression vs. quadratic. Try modifying the equation for the response and
see how the Quality of Fit changes.
6. The quadratic model as well as its more complex cousin cubic may have issues with having high
multi-collinearity or high VIF values. Although high VIF values may not be a problem for predic-
tion accuracy, they can make interpretation and inferencing difficult. For the problem given in this
section, rather than adding x2 to the existing Regression model, find a second order polynomial that
could be added without causing high VIF values. VIF values are the lowest when column vectors are
orthogonal. See the section on Polynomial Regression for more details.
7. Extrapolation far from the training data can be risky for many types of models. Show how having
higher order polynomial terms in the model can increase this risk.
219
8. A polynomial regression model is said to be hierarchical [143, 167, 127] if it contains all terms up to
xk , e.g., a model with x, x2 , x3 is hierarchical, while a model with x, x3 is not. Show that hierarchical
models are invariant under linear transformations.
Hint: Consider the following two models where x is the distance on I-70 West in miles from the center
of Denver (junction with I-25) and y is the elevation in miles above sea level.
ŷ = b0 + b1 x + b2 x2
ŷ = b0 + b2 x2
The first model is hierarchical, while the second is not. A second study is conducted, but now the
distance z is from the junction of I-70 and I-76. A linear transformation can be used to resolve the
problem.
x = z+7
Putting z into the second model (assuming the first study indicated a linear term is not needed) gives,
9. Perform quadratic and quadratic (with cross terms) regression on the Example BPressure dataset
using the first two columns of its data matrix x.
1 import E x a m p l e _ B P r e s s u r e .{ x01 = > x , y }
10. Perform both forward selection and backward elimination to find out which of the terms have the most
impact on predicting the response. Which feature selection approach (forward selection or backward
elimination) finds a model with the highest R¯2 ?
11. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * s * s) and grid = 1 to s.
1 var k = 0
2 for i <- grid ; j <- grid do
3 x ( k ) = VectorD (i , j )
4 y ( k ) = x (k , 0) ~ ˆ 2 + 2 * x (k , 1) + x (k , 0) * x (k , 1) + noise . gen
5 k += 1
6 end for
Compare the results of Regression, quadratic with cross = false, and quadratic with cross =
true.
12. Prove that the number of terms for a quadratic function f (x) in n dimensions is n+2
2 , by decomposing
the function into its quadratic (both squared and cross), linear and constant terms,
| |
f (x) = x Ax + b x + c
220
where A in an n-by-n matrix, b is an n-dimensional column vector and c is a scalar. Hint: A is
symmetric, but the main diagonal is not repeated, and we are looking for unique terms (e.g., x1 x2 and
x2 x1 are treated as the same). Note, when n = 1, A and b become scalars, yielding the usual quadratic
function ax2 + bx + c.
221
6.10 Cubic Regression
The cubic method in the SymbolicRegression object adds cubic terms in addition to the quadratic terms
added by the quadratic method. Linear terms in a model allow for slopes and quadratic terms allow for
curvature. If the curvature changes substantially or there is an inflection point (curvature changes sign), then
cubic terms may be useful. For example, before the inflection point the curve/surface may be concave upward,
while after the point it may be concave downward, e.g., a car stops accelerating and starts decelerating.
When there is only one predictor variable x, the response y is modeled as a cubic function of x.
y = b0 + b1 x + b2 x 2 + b3 x 3 + (6.85)
The number of terms (nt) in the model still increases quadratically with the dimensionality of the space
(n) according to the formula for triangular numbers shifted by (n → n + 1) plus n for the cubic terms.
n+2 (n + 2)(n + 1)
nt = +n = +n e.g., nt = 8 for n = 2 (6.87)
2 2
When n = 10, the number of terms and corresponding parameters nt = 76, whereas for Regression,
quadratic and quadratic with cross terms and order 2, it would 11, 21 and 66, respectively. Issues related
to negative Degrees of Freedom, over-fitting and multi-collinearity will need careful attention.
222
13 val rg = new Regression ( ox , y ) // create a regression model
14 rg . trainNtest () () // train and test the model
15 println ( rg . summary () ) // show summary
16
Figure 6.9 shows the predictions (yp) of the Regression, quadratic and cubic models.
y
60
50
40
30
20
10
1 2 3 4 5 6 7 8 x
−10
Figure 6.9: Actual y (red) vs. Regression (green) vs. quadratic (blue) vs. cubic (black)
Notice the quadratic curve follows the linear curve (line), while the cubic curve more closely follows the data.
Class Methods:
1 @ param x the initial data / input m - by - n matrix ( before quadratic term expansion )
2 must not include an intercept column of all ones
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( defaults to null )
5 @ param intercept whether to include the intercept term ( column of ones ) _1
6 ( defaults to true )
7 @ param cross whether to include 2 - way cross / interaction terms x_i x_j
8 ( defaults to false )
9 @ param cross3 whether to include 3 - way cross / interaction terms x_i x_j x_k
10 ( defaults to false )
11 @ param hparam the hyper - parameters ( defaults to Regression . hp )
223
12
The Set (1, 2, 3) specifies that first (Linear), second (Quadratic), and third (Cubic) order terms will
be included in the model. The intercept flag indicates whether a column of ones will be added to the
input/data matrix.
Model Equation
In two dimensions (2D) where x = [x1 , x2 ], the cubic model/regression equation with cross terms is the
following:
x0 = [1, x1 , x2 , x21 , x22 , x31 , x32 , x1 x2 , x21 x2 , x1 x22 ] expanded input vector
b = [b0 , b1 , b2 , b3 , b4 , b5 , b6 , b7 , b8 , b9 ] parameter/coefficient vector
= y − b · x0 error/residual
Naturally, the number of terms in the model increases cubically with the dimensionality of the space (n)
according to the formula for tetrahedral numbers shifted by (n → n + 1).
n+3 (n + 3)(n + 2)(n + 1)
nt = = e.g., nt = 10 for n = 2 (6.90)
3 6
When n = 10, the number of terms and corresponding parameters nt = 286, whereas for Regression,
quadratic, quadratic with cross and cubic with both crosses and order 2, it would 11, 21, 66 and 76,
respectively. Issues related to negative Degrees of Freedom, over-fitting and multi-collinearity will need even
more careful attention.
224
If polynomials of higher degree are needed, ScalaTion provides a couple of means to deal with it. First,
when the data matrix consists of single column and x is one dimensional, the PolyRegression class may
be used. If one or two variables need higher degree terms, the caller may add these columns themselves as
additional columns in the data matrix input into the Regression class. The SymbolicRegression object
described in the next section allows the user to try many function forms.
Quadratic and Cubic Regression may fail producing Not-a-Number (NaN) results when a dataset contains
one or more categorical variables. For example, a variable like citizen “no”, “yes” is likely to be encoded 0,
1. If such a column is squared or cubed, the new column will be identical to the original column, so that
they will be perfectly collinear. One solution is not to expand such columns. If one must, then a different
encoding may be used, e.g., 1, 2. See the section on RegressionCat for more details.
6.10.5 Exercises
1. Generate and compare the model summaries produced by the three models (Regression, quadratic
and cubic) applied to the dataset given in this section.
2. An inflection point occurs when the second derivative changes sign. Find the inflection point in the
following cubic equation:
Plot the cubic function to illustrate. Explain why there are no inflection points for quadratic models.
3. Many laws in science involve quadratic and cubic terms as well as the inverses of these terms (e.g.,
inverse square laws). Find such a law and an open dataset to test the law.
4. Perform Cubic and Cubic with cross terms Regression on the Example BPressure dataset using the
first two columns of its data matrix x.
1 import E x a m p l e _ B P r e s s u r e .{ x01 = > x , y }
5. Perform both forward selection and backward elimination to find out which of the terms have the most
impact on predicting the response. Which feature selection approach (forward selection or backward
elimination) finds a model with the highest R¯2 ?
6. Generate a dataset with data matrix x and response vector y using the following loop where noise =
new Normal (0, 10 * s * s) and grid = 1 to s.
1 var k = 0
2 for i <- grid ; j <- grid do
3 x ( k ) = VectorD (i , j )
4 y ( k ) = x (k , 0) ~ ˆ 2 + 2 * x (k , 1) + x (k , 0) * x (k , 1) + noise . gen
5 k += 1
6 end for
225
Compare the results of Regression, quadratic with cross = false, quadratic with cross = true,
cubic with cross = false, cubic with cross = true, cubic with cross = true, cross3 = true,
Try modifying the equation for the response and see how the Quality of Fit changes.
226
6.11 Symbolic Regression
The last two sections covered Quadratic and Cubic Regression, but there are many possible functional forms.
For example, in physics force often decreases with distance following a inverse square law. The Newton’s Law
of Universal Gravitation states that masses m1 and m2 with center of mass positions at p1 and p2 (with
distance r = kp2 − p1 k will attract each other with force f ,
m1 m2
f = G (6.91)
r2
where the gravitational constant G = 6.67408 · 10−11 m3 kg−1 s−2 .
y = b0 x0 x1 x−2
2 + (6.92)
Given a four column dataset [x0 , x1 , x2 , y] a Symbolic Regression could be run to estimate a more general
model that includes all possible terms with powers x−2 −1 1 2
j , xj , xj , xj . It could also include cross (two-way
interaction) terms between all these terms. In this case, it is necessary to add cross3 (three-way interaction)
terms. An intercept would imply force with no masses involved, so it should be left out of the model.
It is easier to collect data where the Earth is used for mass 1 and mass 2 is for people at various distances
from the center of the Earth (m1 → x0 , r → x1 , f → y).
y = b0 x0 x−2
1 + (6.93)
In this case the parameter b0 will correspond to GM , where G is the Gravitational Constant and M is the
Mass of the Earth. The following code provides simulated data and uses symbolic regression to determine
the Gravitational Constant.
1 val noise = Normal (0 , 10) // random noise
2 val rad = Uniform (6370 , 7000) // distance from center of Earth in km
3 val mas = Uniform (50 , 150) // mass of person
4
227
7
The statement val mod = SymbolicRegression (...) invokes the factory method called apply in the
SymbolicRegression object. The SymbolicRegression object provides methods for quadratic, cubic, and
more general symbolic regression.
Object Methods:
1 object S y m b o l i c R e g r e s s i o n :
2
11 end S y m b o l i c R e g r e s s i o n
The apply method is flexible enough to include many functional forms as terms in a model. Feature
selection can be used to eliminate many of the terms to produce a meaningful and interpretable model.
Note, unless measurements are precise and experiments are controlled, other terms besides the one given by
Newton’s of Universal Gravitation are likely to be selected.
1 @ param x the initial data / input m - by - n matrix ( before expansion )
2 must not include an intercept column of all ones
3 @ param y the response / output m - vector
4 @ param fname the feature / variable names ( defaults to null )
228
5 @ param powers the set of powers to raise matrix x to ( defaults to null )
6 @ param intercept whether to include the intercept term ( column of ones ) _1
7 ( defaults to true )
8 @ param cross whether to include 2 - way cross / interaction terms x_i x_j
9 ( defaults to true )
10 @ param cross3 whether to include 3 - way cross / interaction terms x_i x_j x_k
11 ( defaults to false )
12 @ param hparam the hyper - parameters ( defaults to Regression . hp )
13 @ param terms custom terms to add into the model , e . g . ,
14 Array ((0 , 1.0) , (1 , -2.0) ) adds x0 x1 ˆ ( -2)
15
where type Xj2p = (Int, Double) indicates raising column Xj to the p-th power.
1. The powers set takes each column in matrix X and raises it to the pth power for every p ∈ powers.
The expression X p produces a matrix with all columns raised to the pth power. For example, Set (1,
2, 0.5) will add the original columns, quadratic columns, and square root columns.
2. The intercept flag indicates whether an intercept (column of ones) is to be added to the model.
Again, such a column must not be included in the original matrix.
3. The cross flag indicates whether two-way cross/interaction terms of the form xi xj (for i 6= j) are to
be added to the model.
4. The cross3 flag indicates whether three-way cross/interaction terms of the form xi xj xk (for i, j, k not
all the same) are to be added to the model.
5. The terms (repeated) array allows custom terms to add into the model. For example,
1 Array ((0 , 1.0) , (1 , -2) )
229
Much of functionality to do this is supplied by the MatrixD class in the mathstat package. The operator
++^ concatenates two matrices column-wise, while operator x~^p returns a new matrix where each of the
columns in the original matrix is raised to the pth power. The crossAll method returns a new matrix
consisting of columns that multiply each column by every other column. The crossAll3 method returns a
new matrix consisting of columns that multiply each column by all combinations of two other columns.
buildMatrix Method
The bulk of the work is done by the buildMatrix method that creates the input data matrix, column by
column.
1 def buildMatrix ( x : MatrixD , fname : Array [ String ] ,
2 powers : Set [ Double ] , intercept : Boolean ,
3 cross : Boolean , cross3 : Boolean ,
4 terms : Array [ Xj2p ]*) : ( MatrixD , Array [ String ]) =
5 val _1 = VectorD . one ( x . dim ) // one vector
6 var xx = new MatrixD ( x . dim , 0) // start empty
7 var fname_ = Array [ String ] ()
8
34 if cross then
35 xx = xx ++ ˆ x . crossAll // add 2 - way cross x_i x_j
36 fname_ ++ = crossNames ( fname )
37 end if
38
39 if cross3 then
40 xx = xx ++ ˆ x . crossAll3 // add 3 - way cross x_i x_j x_k
41 fname_ ++ = crossNames3 ( fname )
42 end if
43
230
44 if intercept then
45 xx = _1 + ˆ : xx // add intercept term ( _1 )
46 fname_ = Array ( " one " ) ++ fname_
47 end if
48
6.11.5 Regularization
Due to fact that symbolic regression may introduce many terms into the model and have high multi-
collinearity, regularization becomes even more important.
Symbolic Ridge Regression can be beneficial in dealing with multi-collinearity. The SymRidgeRegression
object supports the same methods that SymbolicRegression does, except buildMatrix that it reuses.
1 object S y m R i d g e R e g r e s s i o n :
2
231
Symbolic Lasso Regression
Other forms of regularization can be useful as well. Symbolic Lasso Regression can be beneficial in dealing
with multi-collinearity and more importantly by setting some parameters/coefficients bj to zero, thereby
eliminating the j th term. This is particularly important for symbolic regression as the number of possible
terms can become very large.
1 object S y m L a s s o R e g r e s s i o n :
2
6.11.6 Exercises
1. Exploratory Data Analysis Revisited. For each predictor variable xj in the Example AutoMPG
dataset, determine the best power to raise that column to. Plot y and yp versus xj for SimpleRegression.
Compare this to the plot of y and yp versus xj for SymbolicRegression using the best power.
2. Combine all the best powers together to form a model matrix with the same number of columns as
the original AutoMPG matrix and compare SymbolicRegression with Regression on the original
matrix.
3. Use forward, backward and stepwise regression to look for a better (than the last exercise) combination
of features for the AutoMPG dataset.
232
4. Redo the last exercise using SymRidgeRegression. Note any differences.
6. When there are for example quadratic terms added to the expanded matrix, explain why it will not
work to simply center (by subtracting the column means) the original data matrix X.
7. Compare the effectiveness of the following two search strategies that are used in Symbolic Regression:
(a) Genetic Algorithms and (b) FFX Algorithm.
8. Present a review of a paper that discusses how Symbolic Regression has been used to reproduce a
theory in a scientific discipline.
233
6.12 Transformed Regression
The TranRegression class supports transformed multiple linear regression and hence, the predictor vector
x is multi-dimensional [1, x1 , ...xk ]. In certain cases, the relationship between the response scalar y and the
predictor vector x is not linear. There are many possible functional relationships that could apply [144], but
five obvious choices are the following:
1. The response grows exponentially versus a linear combination of the predictor variable.
2. The response grows quadratically versus a linear combination of the predictor variable.
3. The response grows as the square root of a linear combination of the predictor variable.
4. The response grows logarithmically versus a linear combination of the predictor variable.
5. The response grows inversely (as the reciprocal) versus a linear combination of the predictor variable.
The capability can be easily implemented by introducing a transform (transformation function) into Regression.
The transformation function and its inverse are passed into the TranRegression class which extends the
Regression class.
The transform ft (tran) and its inverse transform fa = ft−1 (itran) for the five cases are as follows:
(log, exp), (log1p, expm1), (sqrt, sq), (sq, sqrt), (exp, log), (recip, recip)
The second pair is the first pair shifted by one (log1p(x) = log(1 + x) and expm1(x) = exp(x) − 1) to better
handle cases where x is very small. These and other common functions can be found in the scala.math
package or the scalation package defined in CommonFunctions.scala.
Transformed Regression models extend the reach of linear models while maintaining their simplicity and
highly efficient parameter estimation techniques. Beyond these models lay Generalized Linear Models and
Nonlinear Models.
234
1 Regression (x , y . map ( tran ) , fname_ , hparam )
The inverse transform fa (itran) is then applied in the overridden predict method.
1 override def predict ( z : VectorD ) : Double = itran ( b dot z )
6.12.2 Training
Using several data samples as a training set (X, y), a loss function L(b) can be minimized to find an optimal
solution for the parameter vector b.
The training diagram shown in Figure 6.10 illustrates how the ith instance/row flows through the diagram
computing the transformed response z = ft (y), the predicted transformed response ẑ = b · x and the
transformed error e = z − ẑ.
x0
b0
b·x
x1 ẑ
b1
(X, y) xi0
b2
e = z − ẑ
xi1 x2
xi2
z
ft (yi )
1 1 1
L(b) = kek2 = kz − ẑk2 = kft (y) − Xbk2 (6.98)
2 2 2
where the transformed error vector e = z − ẑ, z = ft (y), and ẑ = Xb. Note, ft : Rm → Rm is the vectorized
version of ft . See the exercises for a loss function based the actual (untransformed) errors.
Taking the gradient of the loss function and setting it to zero,
|
∇L(b) = X [ft (y) − Xb] = 0 (6.99)
| |
(X X)b = X ft (y) (6.100)
235
6.12.3 Square Root Transformation
The square root transformation takes the square root of the response variable and regresses it onto a linear
model.
√
ft (y) = y = b·x+ (6.101)
As an example in 1D, the TranRegression class is compared to the quadratic method and the Regression
class.
y = b0 + b1 x + regression
2
y = b0 + b1 x + b2 x + quadratic regression
√
y = b0 + b1 x + sqrt transformed regression
The ScalaTion code shown below compares these three models. To run this code, be sure to add the
appropriate package statement, depending on where this code is placed.
1 import scala . math . sqrt
2 import scalation . _
3 import scalation . mathstat . _
4 import scalation . modeling . _
5
8 // 8 data points : x y
9 val xy = MatrixD ((8 , 2) , 1 , 2 , // 8 - by -2 combined matrix
10 2, 5,
11 3 , 10 ,
12 4 , 15 ,
13 5 , 20 ,
14 6 , 30 ,
15 7 , 50 ,
16 8 , 60)
17 val x_fname = Array ( " x " ) // names of features for x
18 val ox_fname = Array ( " _1 " , " x " ) // names of features for ox
19
236
36 println ( qrg . summary () ) // parameter / coefficient stats
37 val yp2 = qrg . predict ( qrg . getX ) // y predicted for Quadratic
38
50 end t r a n R e g r e s s i o n T e s t
Figure 6.11 shows the predictions (yp) of the Regression, quadratic and TranRegression models.
y
60
50
40
30
20
10
1 2 3 4 5 6 7 8 x
−10
Figure 6.11: Actual y (red) vs. Regression (green) vs. quadratic (blue) vs. TranRegression (black)
Notice that the sqrt transformed regression model closely follows the quadratic regression model, yet has one
fewer parameter. The square root transformation can model quadratic effects and stabilize error variance,
but it makes interpretation of coefficients less direct. See the exercises for a comparison.
Imagine a system where the rate of change of the response variable y with the predictor variable x (e.g.,
time) is proportional to its current value y and is y0 when x = 0.
237
dy
= gy
dx
This differential equation can be solved by direct integration to obtain
Z Z
dy
= g dx
y
1
As the integral of is ln(y), integrating both sides gives
y
ln(y) = gx + C
Solving for the constant gives C = ln(y0 ), and then taking the exp function of both sides produces (ignoring
noise/error)
ln(y) = gx + ln(y0 )
y = y0 egx
When the growth factor g is positive, the system exhibits exponential growth, while when it is negative, it
exhibits exponential decay. So far we have ignored noise. For previous modeling techniques, we have assumed
that noise is additive and typically normally distributed. For phenomena exhibiting exponential growth or
decay, this may not be the case. When the error is multiplicative, we may collect it into the exponent.
y = y0 egx+
Now applying a log transformation, will yield
log(y) = log(y0 ) + gx + = b0 + b1 x +
An alternative to using TranRegression is to use Exponential Regression ExpRegression, a form of
Nonlinear Regression Model (see the exercises for a comparison).
ft (y) = y −1 = b · x + (6.103)
It may be the case that rates, such as Miles-Per-Gallon (MPG), are inversely related to other quantities.
This can be tested using the AutoMPG dataset as follows:
1 import scala . math . _
2 import scalation . mathstat . _
3 import scalation . modeling . _
4 import scalation . modeling . E x am pl e_ A ut oM PG . _
5 import scalation . modeling . TranR egressi on . _
6
238
9 // val f = ( sqrt , sq , " sqrt ")
10 // val f = ( sq , sqrt , " sq ")
11 // val f = ( exp , log , " exp ")
12 // TranRe gression . setLambda (0.2) ; val f = ( box_cox , cox_box , " box_cox ")
13
Be sure to try all the commented out transformations as well as various values for λ for the Box-Cox
transformation (see the next subsection).
yλ − 1
ft (y) = (6.104)
λ
where λ 6= 0 determines the power function on y, e.g., 0.5 for sqrt and 2.0 for sq. The following simplified
version of the form given by Box and Cox may be used as well.
ft (y) = y λ (6.105)
The inverse transform is fa (y) = y 1/λ . When λ = 0, the log transformation is used. See the exercises for a
more detailed treatment.
One way to find suitable values for λ (lambda) is to perform grid search over the interval [−3, 3], possibly
as large as [−5, 5].
1 TranR egressi on (x , y , lambda )
239
6.12.8 TranRegression Class
Class Methods:
1 @ param x the data / input matrix
2 @ param y the response / output vector
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param hparam the hyper - parameters ( defaults to Regression . hp )
5 @ param tran the trans formatio n function ( defaults to log )
6 @ param itran the inverse transfo rmation function to rescale predictions
7 to original y scale ( defaults to exp )
8
9 class Tra nRegres sion ( x : MatrixD , y : VectorD , fname_ : Array [ String ] = null ,
10 hparam : Hyp erParame ter = Regression . hp ,
11 tran : FunctionS2S = log , itran : FunctionS2S = exp )
12 extends Regression (x , y . map ( tran ) , fname_ , hparam ) :
13
6.12.9 Exercises
1. Use the following code to generate a dataset. You will need to import from scalation.math.sq and
scalation.random.
1 val cap = 30
2 val rng = 0 until cap
3 val (m , n ) = ( cap * cap , 3)
4 val err = Normal (0 , cap )
5 val x = new MatrixD (m , n )
6 val y = new VectorD ( m )
7 for i <- rng ; j <- rng do x ( cap * i + j ) = VectorD (1 , i , j )
8 for k <- y . indices do y ( k ) = sq (10 + 2 * x (k , 1) + err . gen )
As an alternative, try
1 for k <- y . indices do y ( k ) = sq (10 + 2 * x (k , 1) + 0.3 * x (k , 2) + err . gen )
Notice that it uses a linear model inside and takes the square for the response variable y. Use
Regression to create a predictive model. Ideally, the model should approximately recapture the
equations used to generate the data. What correspondence do the parameters b have to these equa-
tions? Next, examine the relationship between the response y and predicted response yp, as well as
the residuals (or remaining error) from the model.
240
1 val reg = new Regression (x , y )
2 reg . trainNtest () ()
3 println ( reg . summary () )
4
2. Transform the response y to a transformed response y2 that is the square root of the former.
1 val y2 = y . map ( sqrt )
Redo the regression as before, but now using the transformed response y2, i.e., new Regression (x,
y2). Compute and plot the corresponding y2 versus yp2 and then e2 vectors. What do the residuals
look like now? How can predictions be made on the original scale?
3. Now transform yp2 to yp3 in order the match the actual response y, by using the inverse transformation
function sq. Now, compute and plot the corresponding y versus yp3 and then e3 vectors. How well
does yp3 predict the original response y? Compute the Coefficient of Determination R2 . What is the
difference between the residuals e2 and e3? Finally, use PlotM to compare Regression vs. Transformed
Regression.
1 val ys2 = MatrixD ( y2 , yp2 )
2 val ys3 = MatrixD (y , yp3 , yp )
3 new PlotM ( null , ys2 .T , null , " Transformed " )
4 new PlotM ( null , ys3 .T , null , " Tran - back " )
4. The TranRegression class provides direct support for making transformations. Compare the quality
of fit resulting from Regression versus TranRegression.
1 banner ( " Regression " )
2 val rg = new Regression (x , y )
3 rg . trainNTest () ()
4 println ( rg . summary () )
5
6. Compare SimpleRegression, TranRegression and ExpRegression on the beer foam dataset www.tf.
uni-kiel.de/matwis/amat/iss/kap_2/articles/beer_article.pdf. The last two are similar, but
TranRegression assumes multiplicative noise, while ExpRegression assumes additive noise, so they
produce different predictions. Plot and compare the three predictions.
241
1 val x1 = VectorD (0 , 15 , 30 , 45 , 60 , 75 , 90 , 105 , 120 , 150 , 180 ,
2 210 , 240 , 300 , 360)
3 val y = VectorD (14.0 , 12.1 , 10.9 , 10.0 , 9.3 , 8.6 , 8.0 , 7.5 ,
4 7.0 , 6.2 , 5.5 , 4.5 , 3.5 , 2.0 , 0.9)
5 val _1 = VectorD . one ( x1 . dim )
6 val x = MatrixD ( _1 , x1 )
7. Compare the following loss function based on actual (untransformed) errors with the one based on
transformed errors e.
1 1 1
L(b) = kk2 = ky − ŷk2 = ky − fa (Xb)k2
2 2 2
8. A general form of transformation functions [164] was given by Tukey in 1957.
ft (y) = y λ λ 6= 0 (6.106)
ft (y) = log y λ=0 (6.107)
yλ − 1
ft (y) = λ 6= 0 (6.108)
λ
ft (y) = log y λ=0 (6.109)
Show the former equations have a discontinuity at λ = 0, while the latter do not. Hint: Take the limit
as λ → 0 and use L’Hospital’s Rule.
9. Explain the advantage of the following transformation for λ 6= 0 proposed by Bickel and Doksum in
1981.
|y|λ sign(y) − 1
ft (y) = (6.110)
λ
242
6.13 Regression with Categorical Variables
An ANalysis of COVAriance (ANCOVA) model may be developed using the RegressionCat class. This
type of model comes into play when input variables are mixed, i.e., some are (i) continuous/ordinal, while
others are (ii) categorical/binary. The main difference between the two types of variables is type (i) variables
define the notion of less than (<), while variables of type (ii) do not. Also, the expected value means much
less for type (ii) variables, e.g., what is the expected value of English, French and Spanish? If we encode a
language variable xj as 0, 1 or 2 for English, French and Spanish, respectively, and half of a group speaks
English with the rest speaking Spanish, then the expected value would be French. Worse, if the encoding
changes, so does the expected value.
In the binary case, when a variable xj may take on only two distinct values, e.g., Red or Black, then it may
simply be encoded as 0 for Red and 1 for Black. Therefore, a single zero-one, encoded/dummy variable xj ,
can be used to distinguish the two cases. For example, when xj ∈ {Red, Black}, it would be replaced by
one encoded/dummy variable, xj0 as shown in Table 6.11.
Categorical Variables
For the more general categorical case, when the number distinct values for a variable xj is greater than two,
simply encoding the j th column may not be ideal. Instead multiple dummy variables should be used. The
number of dummy variables required is one less than the number of distinct values ndv . In one hot encoding,
the number of dummy variables may be equal to the ndv , however, this will produce a singular expanded
data matrix X, i.e., perfect multi-collinearity (see the exercises).
First, the categorical variable xj may be encoded using integer values as follows:
encoded xj = 0, 1, . . . , ndv − 1
Next, for categorical variable xj , create ndv − 1 dummy variables {xjk |k = 0, . . . , ndv − 2} and use the
following loop to set the value for each dummy variable.
1 for k <- 0 until n_dv - 1 do x_jk = is ( x_j = = k +1)
where 1{c} is the indicator function that returns 1 when the condition c evaluates to true and 0 otherwise
(the is function in ScalaTion). In this way, xj ∈ {English, F rench, German, Spanish} would be replaced
by three dummy variables, xj0 , xj1 and xj2 , as shown in Table 6.11.
243
Table 6.9: Conventional Dummy Encoding of a Categorical Variable
Unfortunately, for the conventional encoding of a categorical variable, a dummy variable column will
be identical to its square, which will result in singular matrix for quadratic regression. One solution is
to exclude dummy variables in the column expansion done by quadratic. Alternatively, a more robust
encoding such as the one given in Table 6.10 may be used.
map2Int Method
Conversion from strings to an integer encoding can be accomplished using the map2Int method in the
VectorS class within the scalation.mathstat package. It converts a VectorS into a VectorI by mapping
each distinct value in VectorS into a distinct numeric integer value, returning the new vector and the
bidirectional mapping, e.g., VectorS ("A", "B", "C", "A", "D") will be mapped to VectorI (0, 1, 2,
0, 3). Use the from method in BiMap to recover the original string.
1 def map2Int : ( VectorI , BiMap [ String , Int ]) =
2 val map = new BiMap [ String , Int ] ()
3 var count = 0
4 for i <- indices if ! ( map contains ( v ( i ) ) ) do
5 map + = v ( i ) -> count
6 count + = 1
7 end for
8 val vec = VectorI ( for i <- indices yield map ( v ( i ) ) )
9 ( vec , map )
10 end map2Int
The vector of encoded integers vec can be made into a matrix using MatrixI (vec). To produce the
dummy variable columns the dummyVars function within the RegressionCat companion object may be
called. See the first exercise for an example.
Multi-column expansion may done by the caller in cases where there are few categorical variables, by
expanding the input data matrix before passing it to the Regression class. The expansion occurs automati-
244
cally when the RegressionCat class is called. This class performs the expansion and then delegates to the
work to the Regression class.
Before continuing the discussion of the RegressionCat class, a restricted form is briefly discussed.
6.13.2 ANOVA
An ANalysis Of VAriance (ANOVA) model may be developed using the ANOVA1 class. This type of model
comes into play when all input/predictor variables are categorical/binary. One-way Analysis of Variance
allows only one binary/categorical treatment variable and is framed in ScalaTion using General Linear
Model notation and supports the use of one binary/categorical treatment variable t. For example, the
treatment variable t could indicate the type of fertilizer applied to a field.
The ANOVA1 class in ScalaTion only supports one categorical variable, so in general, x consists of ndv −1
dummy variables dk for k ∈ {1, ndv − 1}
y = b · x + = b0 + b1 d1 + . . . + bl dl + (6.111)
where l = ndv −1 and represents the residuals (the part not explained by the model). The dummy variables
are binary and are used to determine the level/type of a categorical variable. See http://psych.colorado.
edu/~carey/Courses/PSYC5741/handouts/GLM%20Theory.pdf.
In ScalaTion, the ANOVA1 class is implemented using regular multiple linear regression. A data/input
matrix X is built from columns corresponding to levels/types for the treatment vector t. As with multiple
linear regression, the y vector holds the response values. Multi-way Analysis of Variance may be performed
using the more general RegressionCat class.
The dummy variables are binary (or shifted binary) and are used to determine the level of a categorical
variable. See http://www.ams.sunysb.edu/~zhu/ams57213/Team3.pptx.
In general, there may be multiple categorical variables and an expansion will be done for each such
variable. Then the data for continuous variable are collected into matrix X and the values for the categorical
variables are collected into matrix T .
In ScalaTion, RegressionCat is implemented using regular multiple linear regression. An augmented
data/input matrix X is build from X corresponding to the continuous variables with additional columns
corresponding to the multiple levels for columns in the treatment matrix T . As with multiple linear regression,
the y vector holds the response values.
245
6.13.4 RegressionCat Class
Class Methods:
1 @ param x_ the data / input matrix of continuous variables
2 @ param t the treatment / categorical variable matrix
3 @ param y the response / output vector
4 @ param fname_ the feature / variable names ( defaults to null )
5 @ param hparam the hyper - parameters ( defaults to Regression . hp )
6
6.13.5 Exercises
1. Mapping Strings to Integers. Use the map2Int method in the VectorS class within ScalaTion’s
scalation.mathstat package to convert the given strings into encoded integers. Turn this vector into
a matrix and pass it into the dummyVars function to produce the dummy variable columns. Print out
the values xe, xm and xd.
1 val x1 = VectorS ( " English " , " French " , " German " , " Spanish " )
2 val ( xe , map ) = x1 . map2Int // map strings to integers
3 val xm = MatrixI ( xe ) // form a matrix from vector
4 val xd = RegressionCat . dummyVars ( xm ) // make dummy variable columns
Add code to recover the string values from the encoded integers using the returned map.
2. Compare the results of using the RegressionCat class versus the Regression class for the following
dataset.
1 // 6 data points : one x_1 x_2
2 val x = MatrixD ((6 , 3) , 1.0 , 36.0 , 66.0 , // 6 - by -3 matrix
3 1.0 , 37.0 , 68.0 ,
4 1.0 , 47.0 , 64.0 ,
5 1.0 , 32.0 , 53.0 ,
6 1.0 , 42.0 , 83.0 ,
7 1.0 , 1.0 , 101.0)
8 val t = MatrixI ((6 , 1) , 1 , 1 , 2 , 2 , 3 , 3) // treatments levels
9 val y = VectorD (745.0 , 895.0 , 442.0 , 440.0 , 643.0 , 1598.0) // response vector
10 val z = VectorD (1.0 , 20.0 , 80.0 , 2) // new instance
11
246
15
3. Test the RegressionCat class on the AutoMPG dataset using the RegressionCat.apply function.
1 val mod = RegressionCat ( oxr , y , 6 , oxr_fname )
The apply method creates a RegressionCat object from a single data matrix, splitting it into regular
and categorical matrices based on the value of nCat.
1 @ param x the data / input matrix of continuous variables
2 @ param y the response / output vector
3 @ param nCat the index at which the categorical variables start
4 @ param fname the feature / variable names
5 @ param hparam the hyper - parameters
6
7 def apply ( xt : MatrixD , y : VectorD , nCat : Int , fname : Array [ String ] = null ,
8 hparam : Hyp erParame ter = Regression . hp ) : RegressionCat =
9 val x = xt (? , 0 until nCat )
10 val t : MatrixI = xt (? , nCat until xt . dim2 )
11 new RegressionCat (x , t , y , fname , hparam )
12 end apply
4. There is problem called, “too many dummy variables”. What is this problem and when does it become
significant?
5. To reduce the number of dummy variables and thereby the number of columns in a data matrix, de-
pending on the application, it may make sense to combine or fuse similar levels. Suppose a dataset has
state as categorical variable with 50 possible values. Rather than converting this into 49 new dummy
variable columns, consider a way of grouping based on (1) proximity/geography or (2) similarity/com-
mon characteristics.
6. For each string (or word) One-Hot Encoding will introduce a column for each distinct string. The
column that is hot (1) directly indicates what the string is, (e.g., [0, 1, 0, 0] represents French).
247
Table 6.11: One-Hot Encoding of a Categorical Variable
248
6.14 Weighted Least Squares Regression
The RegressionWLS class supports weighted multiple linear regression. In this case, the predictor vector x
is multi-dimensional [1, x1 , ...xk ].
y = b · x + = b0 + b1 x1 + . . . + bk xk + (6.113)
where represents the residuals (the part not explained by the model). Under multiple linear regression, the
parameter vector b is estimated using matrix factorization with the Normal Equations.
| |
(X X)b = X y
Let us look at the error vector = y − Xb in more detail. A basic assumption is that ∼ N (0, σ 2 I).
i.e., each error is Normally distributed with mean 0 and variance σ 2 . If this is violated substantially, the
estimate for the parameters b may be less accurate than desired. One way this can happen is that the
variance changes i ∼ N (0, σi2 ). This is called heteroscedasticity (or heteroskedasticity) and it would imply
that certain instances (data points) would have greater influence b than they should. The problem can be
corrected by weighting each instance by the inverse of its residual/error variance.
1
wi = (6.114)
σi2
This begs the question on how to estimate the residual/error variance σi2 .
1
wi =
ri
249
More commonly, a second unweighted regression is performed, regressing r onto X to obtain the predictions
r̂. The predicted RAD’s may be more smooth and less likely to be zero. See Exercise 1 for a comparison or
the two methods setWeights0 (uses ri ) and setWeights (uses r̂i ).
1
wi = (6.116)
r̂i
See [127] for additional discussion concerning how to set weights. These weights can be used to build a
diagonal weight matrix W that factors into the Normal Equations
| |
X W Xb = X W y (6.117)
√
In ScalaTion, this is accomplished by computing a weight vector w and taking its square root ω = w.
The data matrix X is then re-weighted by pre-multiplying it by ω (rtW in the code), as if it is a diagonal
matrix rtW *∼: x. The response vector y is re-weighted using vector multiplication rtW * y. The re-
weighted matrix and vector are passed into the Regression class, which solves for the parameter vector
b.
In summary, Weighted Least-Squares (WLS) is accomplished by re-weighting and then using Ordinary
Least Squares (OLS). See http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares.
Class Methods:
250
6.14.4 Exercises
1. The setWeights0 method used actual RAD’s rather than predicted RAD’s used by the setWeights
method. Compare the two methods of setting the weights on the following dataset.
1 // 5 data points : one x_1 x_2
2 val x = MatrixD ((5 , 3) , 1.0 , 36.0 , 66.0 , // 5 - by -3 matrix
3 1.0 , 37.0 , 68.0 ,
4 1.0 , 47.0 , 64.0 ,
5 1.0 , 32.0 , 53.0 ,
6 1.0 , 1.0 , 101.0)
7 val y = VectorD (745.0 , 895.0 , 442.0 , 440.0 , 1598.0)
8 val z = VectorD (1.0 , 20.0 , 80.0)
Try the two methods on other datasets and discuss the advantages and disadvantages.
2. Explain why ri or r̂i can serve as replacements for the residual/error variance σi2 .
3. As Weighted Least Squares (WLS) reduces the contribution of instances with large (predicted) resid-
uals, one might expect M SE, RM SE, and R2 to be worse, but M AE to be better than for Ordinary
Least Squares (OLS). Check this on multiple datasets.
4. Show that re-weighting the data matrix X and the response vector y and solving for the parameter
| |
vector b in the standard Normal Equations (X X)b = X y gives the same result as not re-weighting
| |
and solving for the parameter vector b in the Weighted Normal Equations X W Xb = X W y.
5. Given an error vector , what does its covariance matrix C [] represent? How can it be estimated?
What are its diagonal elements?
6. When the non-diagonal elements are non-zero, it may be useful to consider using Generalized Least
Squares (GLS). What are the trade-offs of using this more complex technique?
251
6.15 Polynomial Regression
The PolyRegression class supports polynomial regression. In this case, x is formed from powers of a single
parameter t, [1, t, t2 , . . . , tk ].
y = b · x + = b0 + b1 t + b2 t2 + . . . + bk tk + (6.118)
where represents the residuals (the part not explained by the model). Such models are useful when there
is a nonlinear relationship between a response and a predictor variable, e.g., y may vary quadratically with
t.
A training set now consists of two vectors, one for the m-vector t and one for the m-vector y. An easy
way to implement polynomial regression is to expand each t value into an x vector to form a data/input
matrix X and pass it to the Regression class (multiple linear regression). The columns of data matrix X
represent powers of the vector t.
1, t, t2 , . . . , tk
X = (6.119)
In ScalaTion the vector t is expanded into a matrix X before calling Regression. The number of
columns in matrix X is the order k plus 1 for the intercept.
1 val x = new MatrixD ( t . dim , 1 + k )
2 for i <- t . indices do x ( i ) = expand ( t ( i ) )
3 val mod = new Regression (x , y )
The expand method in the PolyRegression class calls the forms function in the PolyRegression object
that takes a 1-vector and computes the values for all of its polynomial forms/terms, returning them as a
vector.
1 @ param v the 1 - vector ( e . g . , i - th row of t ) for creating forms / terms
2 @ param k number of features / predictor variables ( not counting intercept ) = 1
3 @ param nt the number of terms
4
252
y
50
40
30
20
10
1 2 3 4 5 6 7 8 x
Class Methods:
1 @ param t the initial data / input m - by -1 matrix : t_i expands to
2 x_i = [1 , t_i , t_i ˆ 2 , ... t_i ˆ k ]
3 @ param y the response / ouput vector
4 @ param ord the order ( k ) of the polynomial ( max degree )
5 @ param fname_ the feature / variable names ( defaults to null )
6 @ param hparam the hyper - parameters ( defaults to PolyReg ression . hp )
7
8 class Pol yRegres sion ( t : MatrixD , y : VectorD , ord : Int , fname_ : Array [ String ] = null ,
9 hparam : Hyp erParame ter = P olyRegre ssion . hp )
10 extends Regression ( Poly Regressi on . allForms (t , ord ) , y , fname_ , hparam ) :
11
Unfortunately, when the order of the polynomial k get moderately large, the multi-collinearity problem
can become severe. In such cases it is better to use orthogonal polynomials rather than regular polynomials
[169]. This is done in ScalaTion by using the PolyORegression class.
Class Methods:
1 @ param t the initial data / input m - by -1 matrix : t_i expands to
2 x_i = [1 , t_i , t_i ˆ 2 , ... t_i ˆ k ]
3 @ param y the response / ouput vector
4 @ param ord the order ( k ) of the polynomial ( max degree )
253
5 @ param fname_ the feature / variable names ( defaults to null )
6 @ param hparam the hyper - parameter ( defaults to P olyRegre ssion . hp
7
6.15.4 Exercises
1. Generate two vectors t and y as follows.
1 val noise = Normal (0.0 , 100.0)
2 val t = VectorD . range (0 , 100)
3 val y = new VectorD ( t . dim )
4 for i <- 0 until 100 do y ( i ) = 10.0 - 10.0 * i + i ~ ˆ 2 + i * noise . gen
Test new PolyRegression (t, y, order) for various orders and factorization techniques, e.g., reset
hyper-parameter hp.
1 val hp = new H yperPar ameter ; hp + = ( " factorization " , " Fac_Cholesky " , " Fac_Cholesky "
)
2 hp ( " factorization " ) = " Fac_QR "
2. Test new PolyORegression (t, y, order) for various orders and factorization techniques. It will
use orthogonal polynomials to be used instead of simple polynomials. Again, test for multi-collinearity
using the correlation matrix and vif.
3. Work day traffic data has two peaks, one for the morning rush and one for the late afternoon rush.
What polynomial function of time could match these characteristics. Collect traffic data and use
PolyRegression to model the data. Use the lowest order polynomial that provides a reasonable fit.
254
6.16 Trigonometric Regression
The TrigRegression class supports trigonometric regression. In this case, x is formed from trigonometric
functions of a single parameter t, [1, sin(ωt), cos(ωt), . . . , sin(kωt), cos(kωt)].
A periodic function can be expressed as linear combination of trigonometric functions (sine and cosine
functions) of increasing frequencies. Consequently, if the data points have a periodic nature, a trigonometric
regression model may be superior to alternatives.
Class Methods:
1 @ param t the initial data / input m - by -1 matrix : t_i expands to x_i
2 @ param y the response / ouput vector
3 @ param ord the order ( k ) , maximum multiplier in the trig function ( kwt )
4 @ param fname_ the feature / variable names ( defaults to null )
5 @ param hparam the hyper - parameters ( defaults to Regression . hp )
6
7 class Tri gRegres sion ( t : MatrixD , y : VectorD , ord : Int , fname_ : Array [ String ] = null ,
8 hparam : Hyp erParame ter = Regression . hp )
9 extends Regression ( Trig Regressi on . allForms (t , ord ) , y , fname_ , hparam ) :
10
255
6.16.3 Exercises
1. Create a noisy cubic function and test how well TrigRegression can fit the data for various values of
k (harmonics) generated from this function.
1 val noise = Normal (0.0 , 10000.0)
2 val t = VectorD . range (0 , 100)
3 val y = new VectorD ( t . dim )
4 for i <- 0 until 100 do
5 val x = ( i - 40) /2.0
6 y ( i ) = 1000.0 + x + x * x + x * x * x + noise . gen
7 end for
2. Make the noisy cubic function periodic and test how well TrigRegression can fit the data for various
values of k (harmonics) generated from this function.
1 val noise = Normal (0.0 , 10.0)
2 val t = VectorD . range (0 , 200)
3 val y = new VectorD ( t . dim )
4 for i <- 0 until 5 do
5 for j <- 0 until 20 do
6 val x = j - 4
7 y (40* i + j ) = 100.0 + x + x * x + x * x * x + noise . gen
8 end for
9 for j <- 0 until 20 do
10 val x = 16 - j
11 y (40* i +20+ j ) = 100.0 + x + x * x + x * x * x + noise . gen
12 end for
13 end for
256
Chapter 7
Classification
y ∈ Zk = {0, 1, . . . , k − 1} (7.1)
the problem shifts from prediction to classification. This facilitates giving the response meaningful class
names, e.g., low-risk, medium-risk and high-risk. However, when the response is discrete, but unbounded
(e.g, Poisson Regression), or ordinal (e.g., the number of states voting for a particular political party) the
problem can be considered to be a prediction problem.
y = f (x; b) + (7.2)
As with Regression in continuous domains, some of the modeling techniques in this chapter will focus on
estimating the conditional expectation of y given x.
y = E [y|x] + (7.3)
ŷ = E [y|x] (7.4)
Others will focus on maximizing the conditional probability of y given x, i.e., finding the conditional mode.
257
7.1 Classifier
The Classifier trait provides a common framework for several classifiers such as NaiveBayes.
Trait Methods:
1 @ param x the input / data m - by - n matrix
2 @ param y the response / output m - vector ( class values )
3 @ param fname the feature / variable names ( if null , use x_j s )
4 @ param k the number of classes ( categorical response values )
5 @ param cname the names / labels for each class
6 @ param hparam the hyper - parameters for the model
7
258
43 def b ackw a rd El im A ll ( idx_q : Int = QoF . rSqBar . ordinal , first : Int = 1 , cross : Boolean =
true ) :
44 def s t e p R e g r e s s i o n A l l ( idx_q : Int = QoF . rSqBar . ordinal , cross : Boolean = true ) :
45 def vif ( skip : Int = 1) : VectorD =
46 inline def testIndices ( n_test : Int , rando : Boolean ) : IndexedSeq [ Int ] =
47 def validate ( rando : Boolean = true , ratio : Double = 0.2)
48 ( idx : IndexedSeq [ Int ] =
49 testIndices (( ratio * y . dim ) . toInt , rando ) ) : VectorD =
50 def crossValidate ( k : Int = 5 , rando : Boolean = true ) : Array [ Statistic ] =
For modeling, a user chooses one the of classes extending the trait Classifier (e.g., DecisionTree ID3)
to instantiate an object. Next the train method would be typically called.
1 @ param x_ the training / full data / input matrix ( defaults to full x )
2 @ param y_ the training / full response / output vector ( defaults to full y )
3
This implementation simply computes the class/prior frequencies (nu y) and probabilities (p y). This works
for a simple model such as NullModel, but needs to be be overridden for most models. The test method is
abstract and thus must be defined in all implementing classes.
1 @ param x_ the testing / full data / input matrix ( defaults to full x )
2 @ param y_ the testing / full response / output vector ( defaults to full y )
3
While the modeling techniques in the last chapter focused on minimizing errors, the focus in this chapter
will be on minimizing incorrect classifications. Generally, this is done by dividing a dataset up into a training
dataset and test dataset. A technique for utilizing one dataset to produce single training and test datasets
is called validation.
1 def validate ( rando : Boolean = true , ratio : Double = 0.2)
2 ( idx : IndexedSeq [ Int ] =
3 testIndices ( rando , ( ratio * y . dim ) . toInt ) ) : VectorD =
4 val ( x_e , x_ , y_e , y_ ) = TnT_Split (x , y , idx ) // Test -n - Train Split
5
Another technique for utilizing one dataset to produce multiple training and test datasets is called cross-
validation. As discussed in the Model Validation section in the Prediction chapter, k-fold cross-validation is
a useful general purpose strategy for examining the quality of a model. It performs k iterations of training
(train method) and testing (test method).
1 def crossValidate ( k : Int = 5 , rando : Boolean = true ) : Array [ Statistic ] =
2 if k < MIN_FOLDS then flaw ( " crossValidate " , s " k = $k must be at least $MIN_FOLDS " )
3 val stats = FitC . qofStatTable // create table - QoF measures
259
4 val fullIdx = if rando then permGen . igen // permuted indices
5 else VectorI . range (0 , y . dim ) // ordered indices
6 val sz = y . dim / k // size of each fold
7 val ratio = 1.0 / k // fraction used for testing
8
Setting rando to true is usually preferred, as it randomizes the instances selected for the test dataset,
so that patterns coincidental to the index are broken up.
Once a model/classifier has been sufficiently trained and tested, it is ready to be put into practice on
new data via the classify method.
1 @ param z the data vector to classify
2
It calls the model specific predictI method and returns its value as well the corresponding class label
and its relative probability.
The Classifier trait also provides methods to determine the value count (vc) for the features/variables.
A method to shift values in a vector toward zero by subtracting the minimum value. It has base implementa-
tions for test methods. Finally, several methods for features selection are provided. ScalaTion currently
uses forward selection, backward elimination, and stepwise refinement algorithms for feature selection.
260
7.2 Quality of Fit for Classification
The FitC trait provides methods for computing Quality of Fit (QoF) measures for classifiers. Many are
derived from the so-called Confusion Matrix that keeps track of correct and incorrect classifications. In
ScalaTion when k = 2, the confusion matrix C is configured as follows:
" #
c00 = tn c01 = f p
c10 = f n c11 = tp
where tn and tp are true negatives and positives, respectively, and f n and f p are false negatives and positives,
respectively. The selected row is determined by the actual value y , while the selected column is determined
by the predicted value yp. The confusion matrix is computed using the confusion method.
1 @ param y_ the actual class values / labels for full ( y ) or test ( y_e ) dataset
2 @ param yp the precicted class values / labels
3
The first column indicates the prediction/classification is negative (no or 0), while the second column indicates
it is positive (yes or 1). The first letter (’f’ or ’t’) indicates whether the classification is correct (true) or not
(false). After calling the confusion method, the summary should be called. To see values for the basic QoF
measured from FitM the diagnose method may be called instead of tt confusion. The FitC trait includes
several methods for directly computing QoF measures as well.
Trait Methods:
1 @ param y the vector of actual class values / labels
2 @ param k the number distinct class values / labels
3
261
17 def f1_measure ( p : Double , r : Double ) : Double = 2.0 * p * r / ( p + r )
18 def f1v : VectorD = ( pv * rv * 2.0) / ( pv + rv )
19 def kappa : Double =
20 def fit : VectorD =
21 def fitMicroMap : Map [ String , VectorD ] =
22 def help : String = FitC . help
23 def fitLabel_v : Seq [ String ] = FitC . fitLabel_v
24 def summary ( x_ : MatrixD , fname : Array [ String ] , b : VectorD , vifs : VectorD = null ) :
String =
262
7.3 Null Model
The NullModel class implements a simple Classifier suitable for discrete input data. Corresponding to the
Null Model in the Prediction chapter, one could imagine estimating probabilities for outcomes of a random
variable y. Given an instance, this random variable indicates the classification or decision to be made. For
example, it may be used for a decision on whether or not to grant a loan request. The model may be trained
by collecting a training dataset. Probabilities may be estimated from data stored in an m-dimensional
response/classification vector y within the training dataset. These probabilities are estimated based on the
frequency ν (nu in the code) with which each class value occurs.
ν(y = c) mc
P (y = c) = = (7.7)
m m
Exercise 1 below is the well-known toy classification problem on whether to play tennis (y = 1) or not (y = 0)
based on weather conditions. Of the 14 days (m = 14), tennis was not played on 5 days and was played on
9 days, i.e.,
5 9
P (y = 0) = and P (y = 1) =
14 14
This information, class frequencies and class probabilities, can be placed into a Class Frequency Vector
(CFV) as shown in Table 7.1 and
y 0 1
5 9
y 0 1
5/14 9/14
Picking the maximum probability case, one should always predict that tennis will be played, i.e., ŷ = 1.
This modeling technique should outperform purely random guessing, since it factors in the relative
frequency with which tennis is played. As with the NullModel for prediction, more sophisticated modeling
techniques should perform better than this NullModel for classification. If they are unable to provide higher
accuracy, they are of questionable value.
263
7.3.1 NullModel Class
Class Methods:
1 @ param y the response / output m - vector ( class values )
2 @ param k the number of distinct values / classes
3 @ param cname_ the names for all classes
4
5 class NullModel ( y : VectorI , k : Int = 2 , cname_ : Array [ String ] = Array ( " No " , " Yes " ) )
6 extends Classifier ( null , y , null , k , cname_ , null ) // no x matrix , no hparam
7 with FitC (y , k ) :
8
The NullModel is so simple that is just uses the train method from Classifier.
Typically, one dataset is divided into a training dataset and testing dataset. For example, 80% may
be used for training (estimating probabilities) with the remaining 20% used for testing the accuracy of the
model. Furthermore, this is often done repeatedly as part of a cross-validation procedure.
7.3.2 Exercises
1. The NullModel classifier can be used to solve problems such as the one below. Given the Out-
look, Temperature, Humidity, and Wind determine whether it is more likely that someone will (1)
or will not (0) play tennis. The data set is widely available on the Web. If is also available in
scalation.modeling.classifying.Example PlayTennis. Use the NullModel for classification and
evaluate its effectiveness using cross-validation.
The Example PlayTennis object is used to test all integer based classifiers.
1 // The E x a m p l e _ P l a y T e n n i s object is the well - known c lassific ation problem on whether to
play tennis
2 // based on given weather conditions . Applications may need to slice xy .
3 // val x = xy (? , 0 until 4) // columns 0 , 1 , 2 , 3
4 // val y = xy (? , 4) // column 4
5 // @ see euclid . nmu . edu /~ mkowalcz / cs495f09 / slides / lesson004 . pdf
6
7 object E x a m p l e _ P l a y T e n n i s :
8
9 // dataset - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
10 // x0 : Outlook : Rain (0) , Overcast (1) , Sunny (2)
11 // x1 : Temperature : Cold (0) , Mild (1) , Hot (2)
12 // x2 : Humidity : Normal (0) , High (1)
13 // x3 : Wind : Weak (0) , Strong (1)
14 // y : the response / class ificatio n decision
15 // variables / features : x0 x1 x2 x3 y // combined matrix
16 val xy = MatrixI ((14 , 5) , 2 , 2, 1, 0, 0, // day 1
17 2, 2, 1, 1, 0, // day 2
18 1, 2, 1, 0, 1, // day 3
264
19 0, 1, 1, 0, 1, // day 4
20 0, 0, 0, 0, 1, // day 5
21 0, 0, 0, 1, 0, // day 6
22 1, 0, 0, 1, 1, // day 7
23 2, 1, 1, 0, 0, // day 8
24 2, 0, 0, 0, 1, // day 9
25 0, 1, 0, 0, 1, // day 10
26 2, 1, 0, 1, 1, // day 11
27 1, 1, 1, 1, 1, // day 12
28 1, 2, 0, 0, 1, // day 13
29 0, 1, 1, 1, 0) // day 14
30
31 val fn = Array ( " Outlook " , " Temp " , " Humidity " , " Wind " ) // feature names
32 val cn = Array ( " No " , " Yes " ) // class names for y
33 val k = cn . size // number of classes
34
38 end E x a m p l e _ P l a y T e n n i s
2. Build a NullModel classifier for the Breast Cancer problem (data in breast-cancer.arff file).
265
7.4 Naı̈ve Bayes
The NaiveBayes class implements a Naı̈ve Bayes (NB) Classifier suitable for discrete input data. A Bayesian
Classifier is a special case of a Bayesian Network where one of the random variables is distinguished as the
basis for making decisions, call it random variable y, the class variable. The NullModel ignores weather con-
ditions which are the whole point of the Example PlayTennis exercise. For Naı̈ve Bayes, weather conditions
(or other data relevant to decision making) are captured in an n-dimensional vector of random variables.
P (x|y) P (y)
P (y|x) = (7.9)
P (x)
Since the denominator is the same for all y, it is sufficient to maximize the right hand side of the following
proportionality statement.
Notice that the right hand side is the joint probability of all the random variables.
One could in principle represent the joint probability P (x, y) or the conditional probability P (x|y) in
a matrix. Unfortunately, with 30 binary random variables, the matrix would have over one billion rows
and exhibit issues with sparsity. Bayesian classifiers will factor the probability and use multiple matrices to
represent the probabilities.
n−1
Y
P (x| y) = P (xj |y) (7.12)
j=0
Research has shown that even though the assumption that given response/class variable y, the x-variables
are independent is often violated by a dataset, Naı̈ve Bayes still tends to perform well [212]. Substituting
this factorization into the joint probability formula yields
266
n−1
Y
P (x, y) = P (y) P (xj |y) (7.13)
j=0
The classification problem then is to find the class value for y that maximizes this probability, i.e., let ŷ be
the argmax of the product of the class probability P (y) and all the conditional probabilities P (xj |y). The
argmax is the value in the domain Dy = {0, . . . , k − 1} that maximizes the probability.
n−1
Y
ŷ = argmax P (y) P (xj |y) (7.14)
y∈{0,...,k−1} j=0
The conditional probability for random variable xj given random variable y can be estimated as the ratio of
two frequencies.
ν(x:j = h, y = c)
P (xj = h | y = c) = (7.16)
ν(y = c)
In other words, the conditional probability is the ratio of the joint frequency count for a given h and c divided
by the class frequency count for a given c. These frequency counts can be collected into
For the Example PlayTennis problem, Figure 7.1 shows the Tree for a Naı̈ve Bayes Classifier. The edges
from classification variable y to the feature variables xj are shown in black.
267
Play
Figure 7.1: Naı̈ve Bayes Classifier: y = Play, x0 = Outlook, x1 = Temp, x2 = Humidity, x3 = Wind
x0 \y 0 1
0 2 3
1 0 4
2 3 2
For the this problem, the Joint Frequency Matrix/Table (JFT) for Outlook random variable x0 is shown in
Table 7.3.
x0 \y 0 1
0 2/5 3/9
1 0 4/9
2 3/5 2/9
Continuing with the Example PlayTennis problem, the Joint Frequency Matrix/Table for Wind random
variable x3 is shown in Table 7.5.
268
Table 7.5: JFT for x 3
x3 \y 0 1
0 2 6
1 3 3
x3 \y 0 1
0 2/5 6/9
1 3/5 3/9
Similar matrices/tables can be created for the other random variables: Temperature x1 and Humidity x2 .
ν(x:j = h, y = c) + me /vcj
P (xj = h | y = c) = (7.17)
ν(y = c) + me
where me is the parameter used for the m-estimate. The term added to the numerator, takes the one (or
me ) instance(s) and adds uniform probability for each possible values for xj of which there are vcj of them.
Table 7.7 shows the result of adding 1/3 in the numerator and 1 in the denominator, (e.g., for h = 0 and c
= 0, (2 + 1/3)/(5 + 1) = 7/18).
x0 \y 0 1
0 7/18 10/30
1 1/18 13/30
2 10/18 7/30
Or in decimal,
Another problem is when a conditional probability in a CPT is zero. If any CPT has a zero element, the
corresponding product for the column (where the CPV and CPTs are multiplied) will be zero no matter how
high the other probabilities may be. This happens when the frequency count is zero in the corresponding
JFT (see element (1, 0) in Table 7.3). The question now is whether this is due to the combination of x0 = 1
269
Table 7.8: CPT for x 0 with me = 1 in decimal
x0 \y 0 1
0 0.3889 0.3333
1 0.0556 0.4333
2 0.5556 0.2333
and y = 0 being highly unlikely, or that the dataset is not large enough to exhibit this combination. Laplace
smoothing guards against this problem as well.
Other values (including fractional values) may be used for me as well. ScalaTion uses a small value for
the default me to reduce the distortion of the CPTs.
The dimensions are as follows: for each matrix, the number of rows is vcj , the value count for feature xj ,
and the number of columns is k, the number of class values; while the number of matrices is n, the number
of x-random variables (features).
For the Example PlayTennis problem, vc = [3, 3, 2, 2], k = 2, and n = 4.
Note that the alternative of storing the tables in a rectangular hyper-matrix or tensor would result in a
dimensionality of 4-by-3-by-2, but this would in general be wasteful of space. Each variable only needs space
for the values it allows, as indicated by the value counts vc = [3, 3, 2, 2]. The user may specify the optional vc
parameter in the constructor call. If the vc parameter is unspecified, then ScalaTion uses the vc fromData
method to determine the value counts from the training data. In some cases, the test data may include a
value unseen in the training data. Currently, ScalaTion requires the user to pass vc into the constructor
in such cases.
270
The freq method for computing Joint Frequency Tables (JFTs) is defined in the RTensorD object. The
cprob Xy method defined in the NaiveBayes class divides each JFT by class frequencies nu y to obtain the
corresponding CPT. Laplace smoothing adds me v(j) to the numerator and me to denominator.
1 @ param x_ the integer - valued data vectors stored as rows of a matrix
2 @ param y_ the class vector , where y ( i ) = class for row i of the matrix x , x ( i )
3 @ param nu_Xy the joint frequency of X and y for each feature xj and class value
4
Note that the classify method defined in the Classifier trait calls predictI returning its value as well
as its class label and relative probability.
271
n−1
X
log P (z, y) = log P (y) + log P (zj |y) (7.18)
j=0
P 0 1
y 5/14 9/14
z0 3/5 2/9
z3 3/5 3/9
z, y 9/70 1/21
The two probabilities are approximately 0.129 for c = 0 (Do not Play) and 0.0476 for c = 1 (Play). The
higher probability is for c = 0.
To perform feature selection in a systematic way ScalaTion provides several methods for feature se-
lection including forward selection, backward elimination and stepwise refinement. (see the Classifier
trait).
Class Methods:
1 @ param x the input / data m - by - n matrix
2 @ param y the class vector , where y ( i ) = class for row i of matrix x
3 @ param fname_ the names of the features / variables ( defaults to null )
4 @ param k the number of classes ( defaults to 2)
5 @ param cname_ the names of the classes
6 @ param vc the value count ( number of distinct values ) for each feature
272
7 @ param hparam the hyper - parameters
8
7.4.11 Exercises
1. With Laplace smoothing set at 1 fake instance, complete the Example PlayTennis problem given in this
section by creating CPTs for random variables x1 and x2 and then computing the relative probabilities
for z = [2, 2, 1, 1]. Hint: The relative probabilities are [0.03376, 0.004288]. What decision would the
classifier make?
2. Use ScalaTion’s integer-based NaiveBayes class to build a classifier for the Example PlayTennis
problem.
1 import scalation . modeling . classifying . E x a m p l e _ P l a y T e n n i s . _
2 banner ( " Play Tennis Example " )
3 println ( s " xy = $xy " ) // combined data matrix [ x | y ]
4
3. Compare the confusion matrix, accuracy, precision and recall of NaiveBayes on the full dataset to that
of NullModel.
4. For the Example PlayTennis problem, compare the accuracy of NaiveBayes to that of NullModel
using
(a) 80-20% train-test split validation
(b) 5-fold cross-validation (cv).
273
1 println ( " mod test accu = " + mod . validate () () ) // out - of - sample validation
2 FitM . s h o w Q o f S t a t T a b l e ( mod . crossValidate () ) // 5 - fold cross - validattion
Note: validate and crossValidate are better suited to larger datasets (the Example PlayTennis toy
dataset only has 14 instances).
5. Compare the confusion matrix, accuracy, precision and recall of RoundRegression on the full dataset
to that of NullModel.
6. Perform feature selection on the Example PlayTennis problem. Which feature/variable is removed
from the model, first, second and third. Explain the basis for the featureSelection method’s decision
to remove a feature.
7. Use the integer-based NaiveBayes class to build a classifier for the Breast Cancer problem (data in
breast-cancer.arff file). Compare its accuracy to that of NullModel.
274
7.5 Bayes Classifier
The BayesClassifier trait provides methods for more advanced Bayesian Classifiers, including calculations
of joint probabilities and Conditional Mutual Information (CMI). For data with small value counts, CMI
tends to be more useful than correlation for examining dependencies between random variables. More
information will be provided in the next section on TAN Bayes.
Class Methods:
1 @ param k the number of classes
2
3 trait Ba ye s Cl as si f ie r ( k : Int = 2) :
4
275
7.6 Tree Augmented Naı̈ve Bayes
The TANBayes class implements a Tree Augmented Naı̈ve (TAN) Bayes Classifier suitable for discrete input
data. Unlike Naı̈ve Bayes, a TAN Bayes model can capture more, yet limited dependencies between vari-
ables/features. In general, xj can be dependent on the class y as well as one other variable xpj . Representing
the dependency pattern graphically, y becomes a root node of a Directed Acyclic Graph (DAG), where each
node/variable has at most two parents.
Starting with the joint probability defined in the section on Naı̈ve Bayes,
we can obtain a better factored approximation (better than Naı̈ve Bayes) by keeping the most important
dependencies amongst the random variables. Each feature xj , except a selected x-root, xr , will have one
x-parent xpj in addition to its y-parent, i.e.,
The dependency pattern among the x random variables forms a tree and this tree augments the Naı̈ve Bayes
structure where each x random variable has y as its only parent.
n−1
Y
P (x, y) = P (y) P (xj |xpj , y) (7.20)
j=0
Now, each feature xj is conditioned on its x-parent xpj and the class variable y. More precisely, since the
root xr , has no x-parent, it can be factored out as special case.
Y
P (x, y) = P (y)P (xr |y) P (xj |xpj , y) (7.21)
j6=r
Figure 7.2 shows the DAG for a TAN Bayes Classifier. The edges from classification variable y to the feature
variables xj are shown in black, while edges between the feature variables forming the tree are shown in blue.
Play
Temp Humidity
Outlook Wind
Figure 7.2: TAN Bayes Classifier: y = Play, x0 = Outlook, x1 = Temp, x2 = Humidity, x3 = Wind
276
As with Naı̈ve Bayes, the goal is to find an optimal value for the random variable y that maximizes the
probability.
n−1
Y
ŷ = argmax P (y)P (xr |y) P (xj |xpj , y) (7.22)
y∈Dy j=0
XX p(x, z)
I(x; z) = p(x, z) log (7.23)
x z
p(x)p(z)
The Conditional Mutual Information (CMI) between two random variables x (e.g., xj and z (e.g., xl ) given
a third random variable y is
X XX p(x, z|y)
I(x; z|y) = p(y) p(x, z|y) log (7.24)
y x z
p(x|y)p(z|y)
XXX p(y)p(x, z, y)
I(x; z|y) = p(x, z, y) log (7.25)
y x z
p(x, y)p(z, y)
The steps involved in the structure learning algorithm for TAN Bayes are the following:
1. Compute the CMI I(xj ; xl |y) for all combinations of random variables, j 6= l.
2. Build a complete undirected graph with a node for each xj random variable. The weight on undirected
edge {xj , xl } is its CMI value.
3. Apply a Maximum Spanning Tree algorithm (e.g., Prim or Kruskal) to the undirected graphs to cre-
ate a maximum spanning tree (those n − 1 edges that (a) connect all the nodes, (b) form a tree,
and (c) have maximum cumulative edge weights). Note, ScalaTion’s MinSpanningTree in the
scalation.graph db package can be used with parameter min = false.
5. To build the directed tree, start with root node xr and traverse from there giving each edge direction-
ality as you go outward from the root.
277
Table 7.10: Parent Table
xj x pj
x0 x3
x3 null
x0 \x3 , y 0, 0 0, 1 1, 0 1, 1
0 0 3 2 0
1 0 2 0 2
2 2 1 1 1
In this case, the only modification to the CPV and CPTs from the Naı̈ve Bayes solution, is that the JFT
and CPT for x0 are extended. The extended Joint Frequency Table (JFT) for x0 is shown in Table 7.11.
The column sums are 2, 6, 3, 3, respectively. Again they must add up to same total of 14. Dividing each
element in the JFT by its column sum yields the extended Conditional Probability Table (CPT) shown in
Table 7.12
x0 \x3 , y 0, 0 0, 1 1, 0 1, 1
0 0 1/2 2/3 0
1 0 1/3 0 2/3
2 1 1/6 1/3 1/3
In general for TANBayes, the x-root will have a regular CPT, while all other x-variables will have an
extended CPT, i.e., the extended CPT for xj is calculated as follows:
ν(x:j = h, x p = l, y = c)
P (xj = h | xp = l, y = c) = (7.26)
ν(x p = l, y = c)
7.6.3 Smoothing
The analog of Laplace smoothing used in Naı̈ve Bayes is the following.
ν(x:j = h, x p = l, y = c) + me /vcj
P (xj = h | xp = l, y = c) = (7.27)
ν(x p = l, y = c) + me
In Friedman’s paper [50], he suggests using the marginal distribution rather than uniform (as shown above),
which results in the following formula.
278
ν(x:j = h, x p = l, y = c) + me ∗ mpj
P (xj = h | xp = l, y = c) = (7.28)
ν(x p = l, y = c) + me
where
ν(x:j )
mpj = (7.29)
m
Class Methods:
1 @ param x the input / data m - by - n matrix
2 @ param y the class vector , where y ( i ) = class for row i of matrix x
3 @ param fname_ the names of the features / variables ( defaults to null )
279
4 @ param k the number of classes ( defaults to 2)
5 @ param cname_ the names of the classes
6 @ param vc the value count ( number of distinct values ) for each feature
7 @ param hparam the hyper - parameters
8
7.6.7 Exercises
1. Use the Integer-based TANBayes to build classifiers for (a) the Example PlayTennis problem and (b)
the Breast Cancer problem (data in breast-cancer.arff file). Compare its accuracy to that of
NullModel and NaiveBayes.
2. Compare the correlation matrix on X with the corresponding Conditional Mutual Information (CMI)
matrix. How well do they capture dependencies. Use multiple datasets.
3. Show that the two formulas (the one using conditional probability and the other using joint probability)
for Conditional Mutual Information (CMI) give the samem results.
4. Re-engineer TANBayes to use correlation instead of Conditional Mutual Information (CMI). Compare
the results with the current TANBayes implementation.
5. The FANBayes class implements a Forest Augmented Naı̈ve (FAN) Bayes Classifier suitable for discrete
input data. It competes with TANBayes by allowing multiple trees and is thus more flexible. Compare
FANBayes and TANBayes on multiple datasets.
280
7.7 Bayesian Network Classifier
A Bayesian Network Classifier [17] is used to classify a discrete input data vector x by determining which
of k classes has the highest Joint Probability of x and the response/outcome y (i.e., one of the k classes) of
occurring.
Using the Chain Rule of Probability, the Joint Probability calculation can factored into multiple calcu-
lations of conditional probabilities as well as the class probability of the response. For example, given three
variables, the joint probability may be factored as follows:
Conditional dependencies are specified using a Directed Acyclic Graph (DAG). A feature/variable rep-
resented by a node in the network is conditionally dependent on its parents only,
n−1
Y
ŷ = argmax P (y) P (xj |xp(j) , y) (7.32)
y∈Dy j=0
where xp(j) is the vector of features/variables that xj is dependent on, i.e., its parents. In our model, each
variable has dependency with the response variable y (a defacto parent). Note, some more general BN
formulations do not distinguish one of the variables to be the response y as we do.
Conditional probabilities are recorded in tables referred to as Conditional Probability Tables (CPTs).
Each variable will have a CPT and the number of columns in the table is governed by the number of other
variables it is dependent upon. If this number is large, the CPT may become prohibitively large.
281
7.8 Markov Network
A Markov Network is a probabilistic graphical model where directionality/causality between random variables
is not considered, only their bidirectional relationships. In general, let x be an n-dimensional vector of random
variables.
x = [x0 , . . . xn−1 ]
Given a data instance x, its likelihood of occurrence is given by the joint probability.
P (x = x)
In order to compute the joint probability, it needs to be factored based on conditional independencies. These
conditional independencies may be illustrated graphically, by creating a vertex for each random variable xi
and letting the structure of the graph reflect the conditional independencies,
xi ⊥ xk | {xj }
such that removal of the vertices in the set {xj } will disconnect xi and xk in the graph. These conditional
independencies may be exploited to factor the joint probability, e.g.,
When two random variables are directly connected by an undirected edge (denoted xi − xj ) they cannot
to separated by removal of other vertices. Together they form an Undirected Graph G(x, E) where the
vertex-set is the set of random variables x and the edge-set is defined as follows:
When the random variables are distributed in space, the Markov Network may from a grid, in which case
the network is often referred to as a Markov Random Field (MRF).
The edges E are selected so that random variable xi will be conditionally independent of any other (k 6= i)
random variable xk that is not in its Markov blanket.
xi ⊥ xk | B(xi ) (7.34)
282
7.8.2 Factoring the Joint Probability
Factorization of the joint probability is based on the graphical structure of G that reflects the conditional
independencies. It has been shown (see the Hammersley-Clifford Theorem) that P (x) may be factored
according the set of maximal cliques1 Cl in graph G.
1 Y
P (x) = φc (xc ) (7.35)
Z
c∈Cl
For each clique c in the set Cl, a potential function φc (xc ) is defined. (Potential functions are non-negative
functions that are used in place of marginal/conditional probabilities and need not sum to one; hence the
normalizing constant Z).
Suppose a graph G([x0 , x1 , x2 , x3 , x4 ], E) has two maximal cliques, Cl = {[x0 , x1 , x2 ], [x2 , x3 , x4 ]} then
1
P (x) = φ0 (x0 , x1 , x2 ) φ1 (x2 , x3 , x4 )
Z
7.8.3 Exercises
1. Consider the random vector x = [x0 , x1 , x2 ] with conditional independency
x0 ⊥ x1 | x2
show that
283
7.9 Decision Tree ID3
A Decision Tree (or Classification Tree) classifier [162, 158] will take an input vector x and classify it, i.e.,
give one of k class values to y by applying a set of decision rules configured into a tree. Abstractly, the
decision rules may be viewed as a function f .
7.9.1 Entropy
In decision trees, the goal is to reduce the disorder in decision making. Assume the decision is of the
yes(1)/no(0) variety and consider the following decision/classification vectors: y = (1, 1, . . . , 1, 1) or y0 =
(1, 0, . . . , 1, 0). In the first case all the decisions are yes, while in the second, three are an equal number of
yes and no decisions. One way to measure the level of disorder is Shannon entropy. To compute the entropy,
first convert the m-dimensional decision/classification vector y into a k-dimensional probability vector p.
The frequency and toProbability functions in the Probability object may be used for this task (see
NullModel from the last chapter).
For the two cases, p = (1, 0) and p0 = (.5, .5), so computing the Shannon entropy H(p) (see the
Probability Chapter), we obtain H(p) = 0 and H(p0 ) = 1. These indicate that there is no disorder in
the first case and maximum disorder in the second case.
Entropy is used as measure of the impurity of a node (e.g., to what degree is it a mixture of ‘-’ and ‘+’). For
a discussion of additional measures see [158].
5 9 5 5 9 9
H(p) = H( , ) = − log2 ( ) − log2 ( ) = 0.9403
14 14 14 14 14 14
Recall that the features are Outlook x0 , Temp x1 , Humidity x2 , and Wind x3 . To reduce entropy, find
the feature/variable that has the greatest impact on reducing disorder. If feature/variable j is factored into
the decision making, entropy is now calculated as follows:
vcj −1
X ν(x:j = v}
H(px:j =v ) (7.37)
v=0
m
where ν(x:j = v} is the frequency count of value v for column vector x:j in matrix X. The sum is the
weighted average of the entropy over all possible vcj values for variable j.
284
Table 7.13: Tennis Example
To see how this works, let us compute new entropy values assuming each feature/variable is used, in turn,
as the principal feature for decision making. Starting with feature j = 0 (Outlook) with values of Rain (0),
Overcast (1) and Sunny (2), compute the probability vector and entropy for each value and weight them by
how often that value occurs.
2
X ν(x:0 = v)
H(px:0 =v ) (7.38)
v=0
m
For v = 0, we have 2 no (0) cases and 3 yes (1) cases (2−, 3+), for v = 1, we have (0−, 4+) and for v = 2,
we have (3−, 2+).
5 4 5
H(px:0 =0 ) + H(px:0 =1 ) + H(px:0 =2 )
14 14 14
We are left with computing three entropy values:
2 3 2 2 3 3
H(px:0 =0 ) = H( , ) = − log2 ( ) − log2 ( ) = 0.9710
5 5 5 5 5 5
0 4 0 0 4 4
H(px:0 =1 ) = H( , ) = − log2 ( ) − log2 ( ) = 0.0000
4 4 4 4 4 4
3 2 3 3 2 2
H(px:0 =2 ) = H( , ) = − log2 ( ) − log2 ( ) = 0.9710
5 5 5 5 5 5
285
The weighted average is then 0.6936, so that the drop in entropy (also called information gain) is 0.9403 -
0.6936 = 0.2467. As shown in Table 7.14, the other entropy drops are 0.0292 for Temperature (1), 0.1518
for Humidity (2) and 0.0481 for Wind (3).
Hence, Outlook (j = 0) should be chosen as the principal feature for decision making. As the entropy is
too high, make a tree with Outlook (0) as the root and make a branch for each value of Outlook: Rain (0),
Overcast (1), Sunny (2). Each branch defines a sub-problem.
The resulting tree is shown in Figure 7.3 where the node is designated by the variable (in this case x0 ).
The edges indicate the values that this variable can take on, while the two numbers n− p+ indicate the
number of negative and positive cases.
5− 9+
x0
=0 =1 =2
2− 3+ 0− 4+ 3− 2+
. . .
Sub-problem x0 = 0
The sub-problem for Outlook: Rain (0) see Table 7.15 is defined as follows: Take all five cases/rows in the
data matrix X for which x:0 = 0.
If we select Wind (j = 3) as the next variable, we obtain the following cases: For v = 0, we have (0−, 3+),
so the probability vector and entropy are
0 3
px:3 =0 = ( , ) H(px:3 =0 ) = 0
5 5
286
Table 7.15: Sub-problem for node x0 and branch 0
For v = 1, we have (2−, 0+), so the probability vector and entropy are
2 0
px:3 =1 = ( , ) H(px:3 =1 ) = 0
5 5
If we stop expanding the tree at this point, we have the following rules.
1 if x0 = = 0 then
2 if x3 = = 0 then yes
3 if x3 = = 1 then no
4 if x0 = = 1 then yes
5 if x0 = = 2 then no
The overall entropy can be calculated as the weighted average of all the leaf nodes.
3 2 4 5
·0+ ·0+ ·0+ · .9710 = .3468
14 14 14 14
Sub-problem x0 = 2
Note that if x0 = 1, the entropy for this case is already zero, so this node need not be split and remains as
a leaf node. There is still some uncertainty left when x0 = 2, so this node may be split. The sub-problem
for Outlook: Rain (2) see Table 8.1 is defined as follows: Take all five cases/rows in the data matrix X for
which x−0 = 2.
It should be obvious that y = 1 − x:2 . For v = 0, we have (0−, 2+), so the probability vector and entropy
are
287
0 2
px:2 =0 = ( , ) H(px:3 =0 ) = 0
5 5
For v = 1, we have (3−, 0+), so the probability vector and entropy are
3 0
px:2 =1 = ( , ) H(px:3 =0 ) = 0
5 5
At this point, the overall entropy is zero and the decision tree is the following (shown as a pre-order
traversal from ScalaTion).
Decision Tree:
[ -1 -> Node (j = 0, nu = VectorI(5, 9), y = 1, leaf = false)
[ 0 -> Node (j = 3, nu = VectorI(2, 3), y = 1, leaf = false)
[ 0 -> Node (j = 1, nu = VectorI(0, 3), y = 1, leaf = true) ]
[ 1 -> Node (j = 1, nu = VectorI(2, 0), y = 0, leaf = true) ]
]
[ 1 -> Node (j = 1, nu = VectorI(0, 4), y = 1, leaf = true) ]
[ 2 -> Node (j = 2, nu = VectorI(3, 2), y = 0, leaf = false)
[ 0 -> Node (j = 1, nu = VectorI(0, 2), y = 1, leaf = true) ]
[ 1 -> Node (j = 1, nu = VectorI(3, 0), y = 0, leaf = true) ]
]
]
The above process of creating the decision tree is done by a recursive, greedy algorithm. As with many
greedy algorithms, it does not guarantee an optimal solution.
288
11 def predictIrec ( z : VectorI , n : Node = root ) : Int =
12 def predictIrecD ( z : VectorD , n : Node = root ) : Int =
13 def printTree () : Unit =
The DecisionTree ID3 class extends this trait implementing the ID3 algorithm with methods for train-
ing, testing, making predictions and produing summary statistics. The train method calls the private
buildTree method that recursively builds the decision tree by expanding nodes until entropy drops to the
cutoff threshold or the tree depth is at the specified tree height.
7.9.5 DecisionTree ID3 Class
Class Methods:
1 @ param x the input / data m - by - n matrix with instances stored in rows
2 @ param y the response / classif ication m - vector , where y_i = class for row i
3 @ param fname_ the name for each feature / variable xj ( defaults to null )
4 @ param k the number of classes ( defaults to 2)
5 @ param cname_ the name for each class
6 @ param hparam the hyper - parameters for the Decision Tree classifier
7
7.9.6 Pruning
An alternative to early termination is to build a complex tree and then prune the tree. Pruning involves
selecting a node whose children are all leaves and undoing the split that created the children. Compared
to early termination, pruning will take more time to come up with the solution. For the tennis example,
pruning could be used to turn node 5 into a leaf node (pruning away two nodes) where the decision would
be the majority decision y = 1. The entropy for this has already been calculated to be .3468. Instead node 1
could be turned into a leaf (pruning away two nodes). This case is symmetric to the other one, so the entropy
would be .3468, but the decision would be y = 0. The original ID3 algorithm did not use pruning, but its
follow on algorithm C4.5 does (see the next Chapter). The ScalaTion implementation of ID3 does support
pruning. The DecisionTree ID3wp class extends DecisionTree ID3 with methods for finding candidates
for pruning and doing the actual pruning.
289
7.9.7 DecisionTree ID3wp Class
Class Methods:
1 @ param x the input / data m - by - n matrix with instances stored in rows
2 @ param y the response / classif ication m - vector , where y_i = class for row i of
matrix x
3 @ param fname_ the name for each feature / variable xj ( defaults to null )
4 @ param k the number of classes ( defaults to 2)
5 @ param cname_ the name for each class
6 @ param hparam the hyper - parameters for the Decision Tree classifier
7
7.9.8 Exercises
1. The Play Tennis example (see NaiveBayes) can also be analyzed using decisions trees.
1 import E x a m p l e _ P l a y T e n n i s . _
2
Use DecisionTree ID3 to build classifiers for the Example PlayTennis problem. Compare its accuracy
to that of NullModel, NaiveBayes and TANBayes.
2. Do the same for the Breast Cancer problem (data in breast-cancer.arff file).
3. For the Breast Cancer problem, evaluate the effectiveness of the prune method.
4. Again for the Breast Cancer problem, explore the results for various limitations to the maximum
height/depth of tree via the height hyper-parameter.
290
7.10 Hidden Markov Model
A Hidden Markov Model (HMM) provides a natural way to study a system with an internal state and
external observations. One could image looking at a flame and judging the temperature (internal state) of
the flame by its color (external observation). When this is treated as a discrete problem, an HMM may
be used; whereas, as a continuous problem, a Kalman Filter may be used (see the chapter on State Space
Models). For HMMs, we assume that the internal state is unknown (hidden), but may be predicted by from
the observations.
Consider two discrete-valued, discrete-time stochastic processes. The first process represents the internal
state of a system
The internal state influences the observations. In a deterministic setting, one might imagine
yt = f (xt )
Unfortunately, since both xt and yt are both stochastic processes, their trajectories need to be described
probabilistically. For tractability and because it often suffices, the assumption is made that the state xt is
only significantly influenced by its previous state xt−1 .
In other words, the transitions from state to state are governed by a discrete-time Markov chain and char-
acterized by state-transition probability matrix A = [aij ], where
The influence of the state upon the observation is also characterized by emission probability matrix B = [bkj ],
where
is the conditional probability of the observation being k when the state is j. This represents a second
simplifying assumption that the observation is effectively independent of prior states or observations. To
predict the evolution of the system, it is necessary to characterize the initial state of the system x0 .
πj = P (xt = j)
The dynamics of an HMM model is thus represented by two matrices A, B and an intial state probability
vector π.
291
7.10.1 Example Problem
Let the system under study be a lane of road with a sensor to count traffic flow (number of vehicles passing
the sensor in a five minute period). As a simple example, let the state of the road be whether or not there
is an accident ahead. In other words, the state of road is either 0 (No-accident) or 1 (Accident). The only
information available is the traffic counts and of course historical information for training an HMM model.
Suppose the chance of an accident ahead is 10%.
π = [0.9, 0.1]
From historical information, two transition probabilities are estimated: the first is for the transition from
no accident to accident which is 20%; the second from accident to no-accident state (i.e., the accident has
been cleared) which is 50% (i.e., probability of one half that the accident will be cleared by the next time
increment). The number of states n = 2. Therefore, the state-transition probability matrix A is
" #
0.8 0.2
0.5 0.5
As A maps states to state, A is an n-by-n matrix.
Clearly, the state will influence the traffic flow (tens of cars per 5 minutes) with possible values of 0, 1,
2, 3. The number of observed values m = 4. Again from historical data the emission probability matrix B
is estimated to be
" #
0.1 0.2 0.3 0.4
0.5 0.2 0.2 0.1
As B maps states to observed values, B is an n-by-m matrix.
One question to address is, given a time series (observations sequence), what corresponding sequence of
states gives the highest probability of occurrence to the observed sequences.
y = [3, 3, 0]
P (N N N, y) = π0 · b03 · a00 · b03 · a00 · b00 = 0.9 · 0.4 · 0.8 · 0.4 · 0.8 · 0.1 = 0.009216
P (N N A, y) = π0 · b03 · a00 · b03 · a01 · b10 = 0.9 · 0.4 · 0.8 · 0.4 · 0.2 · 0.5 = 0.011520
P (N AN, y) = π0 · b03 · a01 · b13 · a10 · b00 = 0.9 · 0.4 · 0.2 · 0.1 · 0.5 · 0.1 = 0.000360
P (N AA, y) = π0 · b03 · a01 · b13 · a11 · b10 = 0.9 · 0.4 · 0.2 · 0.1 · 0.5 · 0.5 = 0.001800
P (AN N, y) = π1 · b03 · a10 · b03 · a00 · b00 = 0.1 · 0.1 · 0.5 · 0.4 · 0.8 · 0.1 = 0.000160
P (AN A, y) = π1 · b03 · a10 · b03 · a01 · b10 = 0.1 · 0.1 · 0.5 · 0.4 · 0.2 · 0.5 = 0.000200
P (AAN, y) = π1 · b03 · a11 · b13 · a10 · b00 = 0.1 · 0.1 · 0.5 · 0.1 · 0.5 · 0.1 = 0.000025
P (AAA, y) = π1 · b03 · a11 · b13 · a11 · b10 = 0.1 · 0.1 · 0.5 · 0.1 · 0.5 · 0.5 = 0.000125
The state giving the highest probability is x = N N A. The marginal probability of the observed sequence
P (y) can be computed by summing over all eight states.
292
X
P (y) = P (x, y) = 0.023406
x
The algorithms given in the subsections below are adapted from [181]. For these algorithms, we divide
the time series/sequence of observations into two parts (past and future).
yt− = [y0 , y1 , yt ]
yt+ = [yt+1 , yt+2 , yT −1 ]
They allow one to calculate (1) the probability of arriving in a state at time t with observations yt− , (2) the
conditional probability of seeing future observations yt+ from a given state at time t, and (3) the conditional
probability of being in a state at time t given all the observations y.
n−1
X
αtj = bj,yt αt−1,i aij = bj,yt [αt−1 · a:j ]
i=0
To get to state j at time t, the system must transition from some state i at time t − 1 and at time t emit
the value yt . These values may be saved in a T -by-n matrix A = [αtj ] and efficiently computed by moving
forward in time.
1 def forwardEval0 () : MatrixD =
2 for j <- rstate do alp (0 , j ) = pi ( j ) * b (j , y (0) ) // compute alpha_0 ( at t = 0)
3 for t <- 1 until tt ; j <- rstate do // iterate over time and states
4 alp (t , j ) = b (j , y ( t ) ) * ( alp (t -1) dot a (? , j ) )
5 end for
6 alp
7 end forwardEval0
The marginal probability is now simply the sum of the elements in the last row of the α matrix
n
X
P (y) = αT −1,j
j=0
ScalaTion also provides a forwardEval method that uses scaling to avoid underflow.
293
7.10.3 Backward Algorithm
The backward algorithm (β-pass) computes the B matrix. The conditional probability of having future
observations after time t (y = yt+ ) given the current state xt = i is
n−1
X
βti = aij bj,yt+1 βt+1,j
j=0
From state i at time t, the system must transition to some state j at time t + 1 and at time t + 1 emit the
value yt+1 . These values may be saved in a T -by-n matrix B = [βti ] and efficiently computed by moving
backward in time.
1 def backwardEval0 () : MatrixD =
2 for i <- rstate do bet ( tt -1 , i ) = 1.0 // initialize beta_ { tt -1} to 1
3 for t <- tt -2 to 0 by -1; i <- rstate do // iterate backover time , over states
4 bet (t , i ) = 0.0
5 for j <- rstate do bet (t , i ) + = a (i , j ) * b (j , y ( t +1) ) * bet ( t +1 , j )
6 end for
7 bet
8 end backwardEval0
ScalaTion also provides a backwardEval method that uses scaling to avoid underflow.
αti βti
γti =
P (y)
In ScalaTion, the Γ = [γti ] matrix is calculated using the Hadamard product.
1 def gamma ( alp : MatrixD , bet : MatrixD ) : MatrixD = ( alp *~ bet ) / probY ( alp )
The conditional probability of being in state i at time t and transitioning to state j at time t + 1 given
all observations (y = y) is
294
αti aij bj,yt+1 βt+1,j
γtij =
P (y)
The Viterbi Algorithm viterbiDecode computes the Γ matrix (gam in code) from scaled versions of alp and
bet. It also computes the Γ = [γtij ] tensor (gat in code).
7.10.5 Training
The train method will call forwardEval, backwardEval and viterbiDecode to calculate updated values
for the A, B and Γ matrices as well as for the Γ tensor. These values are used to re-estimate the π, A and
B parameters.
1 @ param x_ the training / full data / input matrix ( ignored )
2 @ param y_ the training / full response / output vector ( defaults to full y )
3
The training loop will terminate early when there is no improvement to P (y). To avoid underflow −log(P (y)
is used.
α0i
πi = = γ0i
bi,y0
The A matrix can re-estimated as follows:
PT −2
γtij
aij = Pt=0
T −2
t=0 γti
295
PT −1
t=0 Iyt =k γti
bik = PT −1
t=0 γti
The detailed derivations are left to the exercises.
Class Methods:
1 @ param y the observation vector / observed discrete - valued time series
2 @ param m the number of observation symbols / values {0 , 1 , ... m -1}
3 @ param n the number of ( hidden ) states in the model
4 @ param cname_ the class names for the states , e . g . , ( " Hot " , " Cold " )
5 @ param pi the probabilty vector for the initial state
6 @ param a the state transition probability matrix (n - by - n )
7 @ param b the observation probability matrix (n - by - m )
8 @ param hparam the hyper - parameters
9
7.10.8 Exercises
1. Show that for t ∈ {0, . . . T − 2},
n−1
X
γti = γtij
j=0
296
2. Show that
α0i
πi = = γ0i
bi,y0
3. Show that
PT −2
γtij
aij = Pt=0
T −2
t=0 γti
4. Show that
PT −1
t=0 Iyt =k γti
bik = PT −1
t=0 γti
2. Data Mining Practical Machine Learning Tools and Techniques, Fourth Edition [205]
297
298
Chapter 8
For the problems in this chapter, the response/classification variable is still discrete, but some/all of the
feature variables are now continuous. Technically, classification problems fit in this category, if it is infeasible
or nonproductive to compute frequency counts for all values of a variable (e.g., for xj , the value count
vcj = ∞). If a classification problem almost fits in the previous chapter, one may consider the use of
binning to convert numerical variables into categorical variables (e.g, convert weight into weight classes).
Care should be taken since binning represents hidden parameters in the model and arbitrary choices may
influence results.
299
8.1 Gaussian Naı̈ve Bayes
The Naı̈veBayesR class implements a Gaussian Naı̈ve Bayes Classifier, which is the most commonly used
such classifier for continuous input data. The classifier is trained using a data matrix X and a classification
vector y. Each data vector in the matrix is classified into one of k classes numbered 0, 1, . . . , k − 1.
Class probabilities are calculated based on the population of each class in the training-set. Relative
probabilities are computed by multiplying these by values computed using conditional density functions
based on the Normal (Gaussian) distribution. The classifier is naı̈ve, because it assumes feature independence
and therefore simply multiplies the conditional densities.
Starting with main results from the section on Naı̈ve Bayes (equation 4.5),
n−1
Y
ŷ = argmax P (y) P (xj |y) (8.1)
y∈{0,...,k−1} j=0
if all the variables xj are continuous, we may switch from conditional probabilities P (xj |y) to conditional
densities f (xj |y). The best prediction for class y is the value ŷ that maximizes the product of the conditional
densities multiplied by the class probability.
n−1
Y
ŷ = argmax P (y) f (xj |y) (8.2)
y∈{0,...,k−1} j=0
Although the formula assumes the conditional independence of xj s, the technique can be applied as long as
correlations are not too high.
Using the Gaussian assumption, the conditional density of xj given y, is approximated by estimating the
two parameters of the Normal distribution,
where class c ∈ {0, 1, . . . , k − 1}, µc = E [x|y = c] and σc2 = V [x|y = c]). Thus, the conditional density
function is
2
1 (x−µ )
− 2σ2c
f (xj |y = c) = √ e c (8.4)
2πσc
Class probabilities P (y = c) may be estimated as m m , where mc is the frequency count of the number of
c
occurrences of c in the class vector y. Conditional densities are needed for each of the k class values, for each
of the n variables (each xj ) (i.e., kn are needed). Corresponding means and variances may be estimated as
follows:
m−1
1 X
µ̂cj = (xij |yi = c) (8.5)
mc i=0
m−1
2 1 X
σ̂cj = ((xij − µ̂cj )2 |yi = c) (8.6)
mc − 1 i=0
Using conditional density (cd) functions estimated in the train method (see code for details), an input
vector z can be classified using the predictI or classify method.
300
1 override def predictI ( z : VectorD ) : Int =
2 for c <- 0 until k ; j <- x . indices2 do p_yz ( c ) * = cd ( c ) ( j ) ( z ( j ) )
3 p_yz . argmax () // return class with highest probability
4 end predictI
Class Methods:
1 @ param x the real - valued data vectors stored as rows of a matrix
2 @ param y the class vector , where y_i = class for row i of the matrix x , x ( i )
3 @ param fname_ the names for all features / variables ( defaults to null )
4 @ param k the number of classes ( defaults to 2)
5 @ param cname_ the names for all classes
6 @ param hparam the hyper - parameters
7
8.1.2 Exercises
1. Use NaiveBayesR to classify manufactured parts according whether they should pass quality con-
trol based on curvature and diameter tolerances. See people.revoledu.com/kardi/tutorial/LDA/
Numerical%20Example.html for details.
1 // features / variable :
2 // x1 : curvature
3 // x2 : diameter
4 // y : classi fication : pass (0) , fail (1)
5 // x1 x2 y
6 val xy = MatrixD ((7 , 3) , 2.95 , 6.63 , 0, // joint data matrix
7 2.53 , 7.79 , 0,
8 3.57 , 5.65 , 0,
9 3.16 , 5.47 , 0,
10 2.58 , 4.46 , 1,
11 2.16 , 6.22 , 1,
12 3.27 , 3.52 , 1)
13
301
14 val fname = Array ( " curvature " , " diameter " ) // feature names
15 val cname = Array ( " pass " , " fail " ) // class names
16 val nbr = NaiveBayesR ( xy , fname , 2 , cname ) () // create NaiveBayesR nbr
302
8.2 Simple Logistic Regression
The SimpleLogisticRegression class supports simple logistic regression. In this case, the predictor vector
x is two-dimensional [1, x1 ]. Again, the goal is to fit the parameter vector b in the regression equation
y = b · x + = b0 + b1 x1 + (8.7)
where represents the residuals (the part not explained by the model). This looks like simple linear regression,
with the difference being that the response variable y is binary (y ∈ {0, 1}). Since y is binary, minimizing
the distance, as was done before, may not work well. First, instead of focusing on y ∈ {0, 1}, we focus on
the conditional probability of success py (x) ∈ [0, 1], i.e.,
For example, the random variable y could be used to indicate whether a customer will pay back a loan (1)
or not (0). The predictor variable x1 could be the customer’s FICA score.
py (x) = b0 + b1 x1
1 ez
logistic(z) = −z
= (8.9)
1+e 1 + ez
Letting z = b0 + b1 x1 , we obtain
eb0 +b1 x1
py (x) = logistic(b0 + b1 x1 ) = (8.10)
1 + eb0 +b1 x1
eb·x
py (x) = (8.11)
1 + eb·x
303
Multiplying through by 1 + eb·x gives
py (x)
eb·x = (8.13)
1 − py (x)
Taking the natural logarithm of both sides gives
py (x)
ln = b · x = b0 + b1 x1 (8.14)
1 − py (x)
where the function on the left hand side is called the logit function.
Putting the model in this form shows it is a special case of a Generalized Linear Model (see the Chapter on
Generalized Linear Models) and will be useful in the estimation procedure.
L(b|x, y) (8.16)
In this case, y ∈ {0, 1}, so if we estimate the likelihood for a single data instance (or row), we have
If y = 1, then L = py (x) and otherwise L = 1 − py (x). These are the probabilities for the two outcomes for
a Bernoulli random variable (and equation 6.5 concisely captures both).
For each instance i ∈ {0, . . . , m − 1}, a similar factor is created. These are multiplied together for all the
instances (in the dataset, or training or testing). The likelihood of b given the predictor matrix X and the
response vector y is then
m−1
Y
L(b|x, y) = py (xi )yi (1 − py (xi ))1−yi (8.18)
i=0
304
8.2.6 Log-likelihood Function
To reduce round-off errors, a log (e.g., natural log, ln) is taken
m−1
X
l(b|x, y) = yi ln(py (xi )) + (1 − yi )ln(1 − py (xi )) (8.19)
i=0
m−1
X py (xi )
l(b|x, y) = yi ln + ln(1 − py (xi )) (8.20)
i=0
1 − py (xi )
m−1
X
l(b|x, y) = yi b · xi + ln(1 − py (xi )) (8.21)
i=0
eb·xi
Now substituting for py (xi ) gives
1 + eb·xi
m−1
X
l(b|x, y) = yi b · xi − ln(1 + eb·xi ) (8.22)
i=0
Multiplying the log-likelihood by -2 makes the distribution approximately Chi-square (see Wilks Theorem[203]).
m−1
X
− 2l = − 2 yi b · xi − ln(1 + eb·xi ) (8.23)
i=0
Or since b = [b0 , b1 ],
m−1
X
− 2l = − 2 yi (b0 + b1 xi1 ) − ln(1 + eb0 +xi1 ) (8.24)
i=0
m−1
X
− 2l = − 2 yi βi − ln(1 + eβi ) (8.25)
i=0
m−1
X
− 2l = − 2 yi βi − βi − ln(e−βi + 1) (8.26)
i=0
305
5 bx = b (0) + b (1) * x (i , 1)
6 sum + = y ( i ) * bx - bx - log ( exp ( - bx ) + 1.0)
7 end for
8 -2.0 * sum
9 end ll
In some cases, this may results in imbalance between false positives and false negatives. Quality of
Fit (QoF) measures may improve by tuning the classification/decision threshold cThresh. Decreasing the
threshold pushes false negatives to false positives. Increasing the threshold does the opposite. Ideally, the
tuning of the threshold will also push more cases into the diagonal of the confusion matrix and minimize
errors. Finally, in some cases it may be more important to reduce one more than the other, false negatives
vs. false positives (see the exercises).
Class Methods:
1 @ param x the input / design matrix augmented with a first column of ones
2 @ param y the binary response vector , y_i in {0 , 1}
3 @ param fname_ the names for all features / variables
4 @ param cname_ the names for both classes
5 @ param hparam the hyper - parameters
6
306
21 Seq ( " n_dev " , " r_dev " , " aic " )
22 override def predictI ( z : VectorD ) : Int =
23 override def predictI ( z : VectorI ) : Int = predictI ( z . toDouble )
24 override def summary ( x_ : MatrixD = null , fname_ : Array [ String ] = null ,
25 b_ : VectorD = b , vifs : VectorD = null ) : String =
8.2.10 Exercises
1. Plot the standard logistic function (sigmoid for scalars, sigmoid for vectors).
1 import scalation . modeling . ActivationFun . sigmoid_
2 val z = VectorD . range (0 , 160) / 10.0 - 8.0
3 val fz = sigmoid_ ( z )
4 new Plot (z , fz )
2. For the mtcars dataset, determine the model parameters b0 and b1 directly (i.e., do not call train).
Rather perform a grid search for a minimal value of the ll function. Use the x matrix (one, mpg) and
y vector (V/S) from SimpleLogisticRegressionTest.
1 // 32 data points : One Mpg
2 val x = MatrixD ((32 , 2) , 1.0 , 21.0 , // 1 - Mazda RX4
3 1.0 , 21.0 , // 2 - Mazda RX4 Wa
4 1.0 , 22.8 , // 3 - Datsun 710
5 1.0 , 21.4 , // 4 - Hornet 4 Drive
6 1.0 , 18.7 , // 5 - Hornet Sportabout
7 1.0 , 18.1 , // 6 - Valiant
8 1.0 , 14.3 , // 7 - Duster 360
9 1.0 , 24.4 , // 8 - Merc 240 D
10 1.0 , 22.8 , // 9 - Merc 230
11 1.0 , 19.2 , // 10 - Merc 280
12 1.0 , 17.8 , // 11 - Merc 280 C
13 1.0 , 16.4 , // 12 - Merc 450 S
14 1.0 , 17.3 , // 13 - Merc 450 SL
15 1.0 , 15.2 , // 14 - Merc 450 SLC
16 1.0 , 10.4 , // 15 - Cadillac Fleetwood
17 1.0 , 10.4 , // 16 - Lincoln Continental
18 1.0 , 14.7 , // 17 - Chrysler Imperial
19 1.0 , 32.4 , // 18 - Fiat 128
20 1.0 , 30.4 , // 19 - Honda Civic
21 1.0 , 33.9 , // 20 - Toyota Corolla
22 1.0 , 21.5 , // 21 - Toyota Corona
23 1.0 , 15.5 , // 22 - Dodge Challenger
24 1.0 , 15.2 , // 23 - AMC Javelin
25 1.0 , 13.3 , // 24 - Camaro Z28
26 1.0 , 19.2 , // 25 - Pontiac Firebird
27 1.0 , 27.3 , // 26 - Fiat X1 -9
28 1.0 , 26.0 , // 27 - Porsche 914 -2
29 1.0 , 30.4 , // 28 - Lotus Europa
30 1.0 , 15.8 , // 29 - Ford Pantera L
31 1.0 , 19.7 , // 30 - Ferrari Dino
32 1.0 , 15.0 , // 31 - Maserati Bora
307
33 1.0 , 21.4) // 32 - Volvo 142 E
34
35 // V / S ( e . g . , V -6 vs . I -4)
36 val y = VectorI (0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ,
37 0 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , 1 , 0 , 0 , 0 , 1)
4. If the treatment for a disease is risky and consequences of having the disease are minimal, would you
prefer to focus on reducing false positives or false negatives?
5. If the treatment for a disease is safe and consequences of having the disease may be severe, would you
prefer to focus on reducing false positives or false negatives?
308
8.3 Logistic Regression
The LogisticRegression class supports logistic regression. In this case, x may be multi-dimensional
[1, x1 , . . . , xk ]. Again, the goal is to fit the parameter vector b in the regression equation
y = b · x + = b0 + b1 x1 + . . . + bk xk + (8.27)
where represents the residuals (the part not explained by the model). This looks like multiple linear
regression. The difference being that the response variable y is binary (y ∈ {0, 1}). Since y is binary,
minimizing the distance, as was done before may not work well. First, instead of focusing on y ∈ {0, 1}, we
focus on the conditional probability of success py (x) ∈ [0, 1], i.e.,
Still, py (x) is bounded, while b · x is not. We therefore, need a transformation, such as the logit transfor-
mation, and fit b · x to this function. Treating this as a Generalized Linear Model problem,
y = µ(x) + (8.29)
g(µ(x)) = b · x (8.30)
py (x)
logit(µ(x)) = ln = b·x (8.31)
1 − py (x)
This is the logit regression equation. Second, instead of minimizing the sum of squared errors, we wish to
maximize the likelihood of predicting correct outcomes. For the ith training case xi with outcome yi , the
likelihood function is based on the Bernoulli distribution.
The overall likelihood function is the product over all m cases. The equation is the same as 6.6 from the
last section.
m−1
Y
L(b|x, y) = py (xi )yi (1 − py (xi ))1−yi (8.33)
i=0
Following the same derivation steps, will give the same log-likelihood that is in equation 6.7.
m−1
X
l(b|x, y) = yi b · xi − ln(1 + eb·xi ) (8.34)
i=0
Again, multiplying the log-likelihood function by -2 makes the distribution approximately Chi-square.
m−1
X
− 2l = − 2 yi b · xi − ln(1 + eb·xi ) (8.35)
i=0
The likelihood can be maximized by minimizing −2l, which is a nonlinear function of the parameter vector
b. Various optimization techniques may be used to search for optimal values for b. Currently, ScalaTion
309
uses BFGS, a popular general-purpose QuasiNewton NLP solver. Other possible optimizers include LBFGS
and IRWLS. For a more detailed derivation, see http://www.stat.cmu.edu/~cshalizi/350/lectures/26/
lecture-26.pdf.
Class Methods:
1 @ param x the input / design matrix augmented with a first column of ones
2 @ param y the binary response vector , y_i in {0 , 1}
3 @ param fname_ the names for all features / variables ( defaults to null )
4 @ param cname_ the names for both classes
5 @ param hparam the hyper - parameters
6
8.3.2 Exercises
1. Use Logistic Regression to classify whether stock market will be increasing or not. The Smarket
dataset is in the ISLR library, see [85] section 4.6.2.
2. Use Logistic Regression to classify whether a customer will purchase caraavan insurance. The
Caravan dataset is in the ISLR library, see [85] section 4.6.6.
310
8.4 Simple Linear Discriminant Analysis
The SimpleLDA class support Linear Discriminant Analysis which is useful for multiway classification of
continuously valued data. The response/classification variable can take on k possible values, y ∈ {0, 1, . . . , k−
1}. The feature variable x is one dimensional for SimpleLDA, but can be multi-dimensional for LDA discussed
in the next section. Given the data about an instance stored in variable x, pick the best (most probable)
classification y = c.
As was done for Naı̈ve Bayes classifiers, we are interested in the probability of y given x.
P (x|y) P (y)
P (y|x) = (8.36)
P (x)
Since x is now continuous, we need to work with conditional densities as is done Gaussian Naı̈ve Bayes
classifiers,
f (x|y) P (y)
P (y|x) = (8.37)
f (x)
where
k−1
X
f (x) = f (x|y = c)P (y = c) (8.38)
c=0
Now let us assume the conditional probabilities are normally distributed with a common variance.
where class c ∈ {0, 1, . . . , k − 1}, µc = E [x|y = c] and σ 2 is the pooled variance (weighted average of
V [x|y = c]). Thus, the conditional density function is
1 (x−µc )2
f (x|y = c) = √ e− 2σ2 (8.40)
2πσ
Substituting into eqaution 6.10 gives
1 (x−µc )2
√ e− 2σ2 P (y)
2πσ
P (y|x) = (8.41)
f (x)
where
k−1
X 1 (x−µc )2
f (x) = √ e− 2σ2 P (y = c) (8.42)
c=0
2πσ
Because of differing means, each conditional density will be shifted resulting in a mountain range appear-
ance when plotted together. Given a data point x, the question becomes, which mountain is it closest to in
the sense of maximizing the conditional probability expressed in equation 6.11.
1 (x−µc )2
P (y|x) ∝ √ e− 2σ2 P (y) (8.43)
2πσ
1
Since the term √ is same for all values of y, it may be ignored. Taking the natural logarithm yields
2πσ
311
−(x − µc )2
ln(P (y|x)) ∝ + ln(P (y)) (8.44)
2σ 2
Expanding −(x − µc )2 gives −x2 + 2xµc − µ2c and the first term may be ignored (same for all y).
xµc µ2c
ln(P (y|x)) ∝ − + ln(P (y)) (8.45)
σ2 2σ 2
The right hand side functions in 4.12 are linear in x and are called discriminant functions δc (x).
Given training data vectors x and y, define xc (or xc in the code) to be the vector of all xi values where
yi = c and let its length be denoted by mc . Now the k means may be estimated as follows:
1 · xc
µ̂c = (8.46)
mc
The common variance my be estimated using a pooled variance estimator.
k−1
1 X
σ̂ 2 = kxc − µc k2 (8.47)
m − k c=0
mc
Finally, m can be used to estimate P (y).
These can easily be translated into ScalaTion code. Most of the calculations are done in the train
method. It estimates the class probability vector p y, the group means vector mu and the pooled variance.
The vectors term1 and term2 capture the x-term (µc /σ 2 ) and the constant term (µ2c /2σ 2 − ln(P (y))) in
equation 6.12.
1 override def train ( x_ : MatrixD = x , y_ : VectorI = y ) : Unit =
2 val xc = for c <- 0 until k yield // groups for x
3 VectorD ( for i <- y_ . indices if y ( i ) = = c yield x_ (0 , i ) ) // group c
4 p_y = VectorD ( xc . map ( _ . dim / y . dim . toDouble ) ) // probability y = c
5 mu = VectorD ( xc . map ( _ . mean ) ) // group means
6 var sum = 0.0
7 for c <- 0 until k do sum + = ( xc ( c ) - mu ( c ) ) . normSq
8 sig2 = sum / ( m - k ) . toDouble // pooled variance
9 term1 = mu / sig2
10 term2 = mu ~ ˆ 2 / (2.0 * sig2 ) - p_y . map ( log ( _ ) )
11 end train
Given the two precomputed terms, the classify method simply multiplies the first by z(0) and subtracts
the second. Then it finds the argmax of the delta vector to return the class with the maximum delta,
which corresponds the most probable classification.
zµc µ2c
ŷ = argmaxc − + ln(P (y)) (8.48)
σ2 2σ 2
312
8.4.1 SimpleLDA Class
Class Methods:
1 * @ param x the input / design matrix with only one column
2 * @ param y the response / class ificatio n vector , y_i in {0 , 1}
3 * @ param fname_ the name for the feature / variable
4 * @ param k the number of possible values for y (0 , 1 , ... k -1)
5 * @ param cname_ the names for all classes
6 * @ param hparam the hyper - parameters
7 */
8 class SimpleLDA ( x : MatrixD , y : VectorI , fname_ : Array [ String ] = Array ( " x1 " ) ,
9 k : Int = 2 , cname_ : Array [ String ] = Array ( " No " , " Yes " ) ,
10 hparam : Hype rParame ter = Classifier . hp )
11 extends Classifier (x , y , fname_ , k , cname_ , hparam )
12 with FitC (y , 2) :
13
8.4.2 Exercises
1. Generate two samples using Normal (98.6, 1.0) and Normal (101.0, 1.0) with 100 in each sample.
Put the data instances into a single x vector. Let the y vector be 0 for the first sample and 1 for the
second. Use SimpleLDA to classify all 200 data points and determine the values for tp, tn, fn and
fp. See scalation.modeling.classifying.simpleLDATest2.
313
8.5 Linear Discriminant Analysis
Like SimpleLDA, the LDA class support Linear Discriminant Analysis that is used for multiway classification
of continuously valued data. Similarly, the response/classification variable can take on k possible values,
y ∈ {0, 1, . . . , k − 1}. Unlike SimpleLDA, this class is intended for cases where the feature vector x is multi-
dimensional. The classification y = c is chosen to maximize the conditional probability of class y given the
n-dimensional data/feature vector x.
f (x|y) P (y)
P (y|x) = (8.49)
f (x)
where
k−1
X
f (x) = f (x|y = c)P (y = c)
c=0
In the multi-dimensional case, x|y has a multivariate Gaussian distribution, N ormal(µc , Σ), where µc are
the mean vectors E [x|y = c] and Σ is the common covariance matrix (weighted average of C [x|y = c]. The
conditional density function is given by
1 1 |
Σ−1 (x−µc )
f (x|y = c) = n 1 e− 2 (x−µc )
(2π) |Σ|
2 2
The discriminant functions are obtained by multiplying out and again dropping terms independing of c.
|
| µc Σ−1 µc
δc (x) = x Σ−1 µc − + ln(P (y = c)) (8.50)
2
As in the last section, the means for each class c (µc ), the common covariance matrix (Σ), and the class
probabilities (P (y)) must be estimated.
Class Methods:
1 @ param x the real - valued training / test data vectors stored as rows of a matrix
2 @ param y the training / test clas sificat ion vector , where y_i = class for row i
3 @ param fname_ the names for all features / variables ( defaults to null )
4 @ param k the number of classes ( k in {0 , 1 , ... k -1}
5 @ param cname_ the names for all classes
6 @ param hparam the hyper - parameters
7
314
10 hparam : Hype rParame ter = Classifier . hp )
11 extends Classifier (x , y , fname_ , k , cname_ , hparam )
12 with FitC (y , k ) :
13
8.5.2 Exercises
1. Use LDA to classify manufactured parts according whether they should pass quality control based on
curvature and diameter tolerances. See people.revoledu.com/kardi/tutorial/LDA/Numerical%
20Example.html for details.
315
8.6 K-Nearest Neighbors Classifier
The KNN Classifier class is used to classify a new vector z into one of k classes y ∈ {0, 1, . . . , k − 1}. It
works by finding its κ-nearest neighbors to the point z. These neighbors essentially vote according to their
classification. The class with the most votes is selected as the classification of vector z. Using a distance
metric, the κ vectors nearest to z are found in the training data, which are stored row-wise in data matrix
X. The corresponding classifications are given in vector y, such that the classification for vector xi is given
by yi .
In ScalaTion to avoid the overhead of calling sqrt, the square of the Euclidean distance is used
(although other metrics can easily be swapped in). The squared distance from vector x to vector z is then
The distance metric is used to collect the κ nearest vectors into set topκ (z), such that there does not exists
any vector xj ∈
/ topκ (z) that is closer to z.
In case of ties for the most distant point to include in topκ (z) one could pick the first point encountered or
the last point. A less biased approach would be to randomly break the tie.
Now y(topκ (z)) can be defined to be the vector of votes from the members of the set, e.g., y(top3 (z)) =
[1, 0, 1]. The ultimate classification is then simply the mode (most frequent value) of this vector (e.g., 1 in
this case).
The kNearest method finds the κ x vectors closest to the given vector z. This method updates topK by
replacing the most distant x vector in topK with a new one if it is closer. Each element in the topK array is
a tuple (j, d(j)) indicating which vector and its distance from z. Each of these selected vectors will have
their vote taken, voting for the class for which it is labeled. These votes are tallied in the count vector. The
class with the highest count will be selected as the best class.
Class Methods:
316
1 @ param x the input / data matrix
2 @ param y the classi fication of each vector in x
3 @ param fname_ the names for all features / variables ( defaults to null )
4 @ param k the number of classes ( defaults to 2)
5 @ param cname_ the names for all classes
6 @ param kappa the number of nearest neighbors to consider
7 @ param hparam the hyper - parameters
8
9 class KNN _Classi fier ( x : MatrixD , y : VectorI , fname_ : Array [ String ] = null ,
10 k : Int = 2 , cname_ : Array [ String ] = Array ( " No " , " Yes " ) ,
11 kappa : Int = 3 , hparam : Hy perParam eter = null )
12 extends Classifier (x , y , fname_ , k , cname_ , hparam )
13 with FitC (y , k ) :
14
8.6.3 Exercises
1. Create a KNN Classifier for the joint data matrix given below and determine its tp, tn, f n, f p values
upon re-classification of the data matrix. Let k = 3. Use Leave-One-Out validation for computing
tp, tn, f n, f p.
1 // x1 x2 y
2 val xy = MatrixD ((10 , 3) , 1 , 5 , 1, // joint data matrix
3 2, 4, 1,
4 3, 4, 1,
5 4, 4, 1,
6 5, 3, 0,
7 6, 3, 1,
8 7, 2, 0,
9 8, 2, 0,
10 9, 1, 0,
11 10 , 1 , 0)
2. Under what circumstances would one expect a KNN Classifier to perform better than
LogisticRegression?
317
8.7 Decision Tree C45
The DecisionTree C45 class implements a Decision Tree classifier that uses the C4.5 algorithm. The classifier
is trained using an m-by-n data matrix X and an n-dimensional classification vector y. Each data vector in
the matrix is classified into one of k classes numbered 0, . . . , k − 1. Each column in the matrix represents a
feature (e.g., Humidity). The value count vc vector gives the number of distinct values per feature (e.g., 2
for Humidity).
Depending on the data type of a column, ScalaTion’s implementation of C4.5 works like ID3 unless
the column is continuous. A column is flagged isCont if it is continuous or relatively large ordinal. For a
column that isCont, values for the feature are split into a left group and a right group based upon whether
they are ≤ or > an optimal threshold, respectively.
Candidate thresholds/split points are all the mid points between all column values that have been sorted.
The threshold giving the maximum entropy drop (or gain) is the one that is chosen.
6 object E x a m p l e _ P l a y T e n n i s _ C o n t :
7
31 val fname = Array ( " Outlook " , " Temp " , " Humidity " , " Wind " ) // feature / variable names
318
32 val conts = Set (1 , 2) // set of continuous
features
33 val cname = Array ( " No " , " Yes " ) // class names for y
34 val k = cname . size // number of classes
35
39 end E x a m p l e _ P l a y T e n n i s _ C o n t
As with the ID3 algorithm, the C4.5 algorithm picks x0 as the root node. This feature is not continuous
and has three branches. Branch b0 will lead to a node where as before x3 is chosen. Branch b1 will lead to
a leaf node. Finally, branch b2 will lead to a node where continuous feature x2 is chosen.
Sub-problem x0 = 2
Note that if x0 = 0 or 1, the algorithm works like ID3. However, there is still some uncertainty left when
x0 = 2, so this node may be split and it turn out the split will involve continuous feature x2 . The sub-problem
for Outlook: Rain (2) see Table 8.1 is defined as follows: Take all five cases/rows in the data matrix X for
which x−0 = 2.
The distinct values for feature x2 in sorted order are the following: [70.0 ,85.0, 90.0, 95.0]. Therefore, the
candidate threshold/split points for continuous feature x2 are their midpoints: [77.5, 87.5, 92.5]. Threshold
77.5 yields (0-, 2+) on the left and (3-, 0+) on the right, 87.5 yields (1-, 2+) on the left and (2-, 0+) on the
right, and 92.5 yields (2-, 2+) on the left and (1-, 0+) on the right. Clearly, the best threshold value is 77.5.
Since a continuous feature splits elements into low (left) and high (right) groups, rather than branching on
all possible values, the same continuous feature may be chosen again by a descendant node.
Class Methods:
1 @ param x the input / data matrix with instances stored in rows
2 @ param y the response / classif ication vector , where y_i = class for row i
3 @ param fname_ the names for all features / variables ( defaults to null )
4 @ param k the number of classes ( defaults to 2)
5 @ param cname_ the names for all classes
319
6 @ param conts the set of feature indices for variables that are treated as continuous
7 @ param hparam the hyper - parameters for the Decision Tree classifier
8
8.7.3 Pruning
8.7.4 DecisionTree C45wp Class
Class Methods:
1 @ param x the input / data matrix with instances stored in rows
2 @ param y the response / classif ication vector , where y_i = class for row i
3 @ param fname_ the names for all features / variables ( defaults to null )
4 @ param k the number of classes ( defaults to 2)
5 @ param cname_ the names for all classes
6 @ param conts the set of feature indices for variables that are treated as continuous
7 @ param hparam the hyper - parameters for the decision tree
8
8.7.5 Exercises
1. Run DecisionTree C45 on the Example PlayTennis dataset and verify that it produces the same
answer as DecisionTree ID3.
2. Complete the C45 Decision Tree for the Example PlayTennis Cont problem.
320
3. Run DecisionTree C45 on the winequality-white dataset. Plot the accuracy and F1 measure versus
the maximum tree height/depth (height).
321
8.8 Bagging Trees
Bootstrap does sampling with replacement, thus allowing many large sub-samples of a dataset or training
set to be created. Bootstrap Aggregation (Bagging) allows a modeling technique to be applied on several
sub-samples. As a decision tree is at risk of overfitting, it is a good candidate for bagging. A decision tree is
created for each sub-sample. Given a new data point is to be classified, each tree makes its prediction and
the majority vote is taken as the overall classification. When k is larger than 2, the plurality will be taken.
Bagging works on the notion of ”wisdom of the crowd”: The consensus of several trees is more likely to be
correct than the prediction of a single tree. Experimentation has borne this out.
The creation of a sub-sample is done by the subSample method in the modeling.Sampling class. It creates
a random sub-sample of rows from data matrix x and elements from classification vector y, returning the
sub-sample matrix and vector, as well as the indices selected irows.
8.8.2 Training
Training involves creating a DecisionTree C45 classifier for each sub-sample and calling train for each tree.
322
8.8.3 Hyper-parameters
The following hyper-parameters can be adjusted to improve the model: The number of trees (nTrees) to
create has a major effect. The bagging ratio (bRatio) is the ratio of the size of the sub-sample to the size of
dataset (or training set). Finally, the height/depth limit (height) puts a limit of the height of each decision
tree.
1 protected val nTrees = hparam ( " nTrees " ) . toInt
2 private val bRatio = hparam ( " bRatio " ) . toDouble
3 private val height = hparam ( " height " ) . toInt
Note, many (more than 50) trees may be needed to get good results for BaggingTrees and RandomForest.
Class Methods:
1 @ param x the data matrix ( instances by features )
2 @ param y the response / class labels of the instances
3 @ param fname_ the names of the variables / features ( defaults to null )
4 @ param k the number of classes ( defaults to 2)
5 @ param cname_ the names of the classes
6 @ param conts the set of feature indices for variables that are treated as continuous
7 @ param hparam the hyper - parameters for the bagging trees
8
323
8.9 Random Forest
The BaggingTrees class provides diversity of opinion via bootstrap sampling and Random Forests increase
the diversity by having each tree work potentially on different, yet closely related problems. Each tree selects
a subset of the columns in the original data matrix.
The RandomForest class builds multiple decision trees for a given problem. Each decision tree is built
using a sub-sample (rows) of the data matrix x as well as a subset of the features/variables (columns) of x.
As with BaggingTrees, the fraction of rows used is given by the bagging ratio bRatio, while the new
RandomForest hyper-parameter fbRatio specifies the ratio between the number of columns selected and
total number of columns fbRatio Reasonable values for these ratios are around 0.7 (70%).
Given a new instance vector z, each of the trees will classify it and the class with the most number of
votes (one from each tree), will be the overall response of the random forest.
8.9.2 Training
Training involves creating a DecisionTree C45 classifier for each sub-sample on selected features and calling
train for each tree. The current selected features held in the columns variable and saved for the lth tree
in jcols. In addition, it is needed to extract to relevant feature names (fname2) and continuous column
indicators (conts2).
1 @ param x_ the training / full data / input matrix ( defaults to full x )
2 @ param y_ the training / full response / output vector ( defaults to full y )
3
324
8.9.3 RandomForest Class
Class Methods:
1 @ param x the data matrix ( instances by features )
2 @ param y the response class labels of the instances
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param k the number of classes ( defaults to 2)
5 @ param cname_ the names of the classes
6 @ param conts the set of feature indices for variables that are treated as continuous
7 @ param hparam the hyper - parameters
8
8.9.4 Exercises
1. Compare DecisionTree C45, BaggingTrees and Random Forest classifiers on the White-Wine, Breast
Cancer, and Diabetes datasets. Use the default setting for the hyper-parameters.
2. Compare BaggingTrees and Random Forest classifiers on the White-Wine, Breast Cancer, and Di-
abetes datasets. Increase the number of trees from low default setting (Supreme Court 9) to 51,
incrementing by 2. Plot the accuracy and F1 measure versus the number of trees.
3. Compare BaggingTrees and Random Forest classifiers on the White-Wine, Breast Cancer, and Dia-
betes datasets. Increase the maximum height/depth of the trees from 3 to 20. Plot the accuracy and
F1 measure versus the maximum tree height.
325
8.10 Support Vector Machine
The idea behind support vector machines is actually quite simple and can be easy to follow when x ∈ R2 and
y ∈ {−1, +1}. The negative responses can be indicated in a plot by ”-” (or ”o”), while positive responses
can be indicted by ”+” (or ”*”). Consider the following Eight Point Example depicted in Figure 8.1 where
the goal is to separate the two sets of points.
4
y
0 2 4 6
x
Suppose the points represent tank regimes for the blue (o) and green (*) armies. Your mission as a peace
keeping force is split the two armies. Naturally, one should try to maximize the distance to any tank regime.
At this point, the peace keeping force is to be placed in a line, although this can relaxed later on.
The first question is determine the equation of line. This can be done by lining a ruler up on the front of
one army and moving it to toward the other army and stopping in the middle. Clearly that line is x + y = 7.
This can verified to computing the distance to each of points and observing the minimum distance orthogonal
to the line for each army is √12 = 0.7071. This distance is referred to as the half margin and the goal is to
maximize it. From a data science perspective it makes sense to maximum the margin as any point below
the split line will be classified as −1, while any point above the split line will be classified as +1.
Notice that only five of eight points are relevant in determining the position/equation of the middle
separting line (see the points on the red lines in Figure 8.2). These may be viewed as front-line points or
more generally as support vectors.
326
Maximizing the Margin
y 2
0 2 4 6
x
w·x−b = 0 (8.51)
where w is a vector normal to the hyperplane and b is an offset. For the example, w = [1, 1] and b = 7.
Note that w is the gradient of f (x) = w · x − b w.r.t. x.
The minimum directional distance from the hyperplane to the origin 0 is given by
b
(8.52)
kwk
and equals √72 = 4.9497 for the example problem. Subtracting this value would move the hyperplane to the
origin.
The equations for the two red lines/hyperplanes are given by
w · x − b = -1 (8.53)
w·x−b = 1 (8.54)
Note, making these two equations valid in general may require rescaling of w. In this case, the equations
can be verified as follows:
The full margin is the distance between these two red lines/hyperplanes and is given by
2
(8.57)
kwk
√
which equals √22 = 2 for the example problem. Therefore, maximizing the margin is equivalent to mini-
mizing the norm of the w, kwk.
327
8.10.2 Optimization Problem
Given a dataset (X, y) with xi being the ith row of matrix X ∈ Rm×n and yi being the ith element of vector
y ∈ Rm , the goal of minimizing kwk can be cast as a constrained optimization problem:
n−1
1 1X 2
min kwk2 = min w (8.58)
2 2 j=0 i
subject to
w · xi − b ≤ -1 if yi = -1 (8.59)
w · xi − b ≥ 1 if yi = 1 (8.60)
yi (w · xi − b) ≥ 1 for i = 0, . . . , m − 1 (8.61)
or in the form of standard inequality constraints,
yi (w · xi − b) − 1 ≥ 0 for i = 0, . . . , m − 1 (8.62)
This is a quadratic optimization problem with linear inequality constraints involving n + 1 unknowns (w and
b) and m constraints. Such optimization problems can be solved by introducing a Lagrange multiplier αi ≥ 0
for each inequality constraint https://aa.ssdi.di.fct.unl.pt/files/AA-09_notes.pdf. Therefore, the
Lagrangian (see the Appendix) is given by
m−1
1 X
L(w, b, α) = kwk2 − αi [yi (w · xi + b) − 1] (8.63)
2 i=0
The problem is now to minimize the Lagrangian by finding optimal values for the parameters w and b.
Taking the gradient of L w.r.t. w and b, and setting it equal to zero yields,
m−1
∂L X
= w− αi yi xi = 0 (8.64)
∂w i=0
m−1
∂L X
= − αi yi = 0 (8.65)
∂b i=0
Dual Formulation
The dual form will reformulate the problem in terms of the Lagrange multipliers. As an exercise, show that
XX XX
kwk2 = w · w = αi yi xi · αj yj xj = αi αj yi yj xi · xj (8.66)
i j i j
m−1
X X X X
αi [yi (w · xi + b) − 1] = αi yi w · xi + αi yi b + αi
i=0 i i i
328
The first term becomes
X X
αi yi α j yj x j · x i
i j
X
The second term is zero due to the constraint αi yi = 0. Combining yields
i
1 XX X
LD (α) = − αi αj yi yj xi · xj + αi (8.67)
2 i j i
1 XX X
max LD (α) = − αi αj yi yj xi · xj + αi (8.68)
2 i j i
subject to
X
αi yi = 0 and αi ≥ 0 (8.69)
i
Class Methods:
1 @ param x the input / data matrix with points stored as rows
2 @ param y the classi fication of the data points stored in a vector
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param cname_ the names of the classes
5 @ param hparam the hyper - parameters
6
329
7 class S u p p o r t V e c t o r M a c h i n e ( x : MatrixD , y : VectorI , fname_ : Array [ String ] = null ,
8 cname_ : Array [ String ] = Array ( " -" , " + " ) ,
9 hparam : Hype rParame ter = null )
10 extends Classifier (x , y , fname_ , 2 , cname_ , hparam )
11 with FitC (y , k = 2) :
12
13 override def
parameter : VectorD = w :+ b
14 override def
train ( x_ : MatrixD = x , y_ : VectorI = y ) : Unit =
15 def test ( x_
: MatrixD = x , y_ : VectorI = y ) : ( VectorI , VectorD ) =
16 override def
predictI ( z : VectorD ) : Int = if ( w dot z ) >= b then 1 else -1
17 override def
summary ( x_ : MatrixD = null , fname_ : Array [ String ] = null ,
18 b_ : VectorD = p_y , vifs : VectorD = null ) : String =
19 override def toString : String = " (w , b ) = " + (w , b )
8.10.5 Exercises
1. Develop a SupportVectorMachine model for the Eight Point Example given in this section. What are
w and b. What is the size of the margin?
2. Compare DecisionTree C45, BaggingTrees, Random Forest, and Support Vector Machine classifiers on
the White-Wine, Breast Cancer, and Diabetes datasets.
3. Give pseudocode for Platt’s Sequential Minimal Optimization (SMO) algorithm for SVM [146].
4. Explain the improvements to Platt’s SMO algorithm suggested by Keerthi et al. [93].
330
8.11 Neural Network Classifiers
It is easy to use Neural Networks as classifiers. The last layer can be set up to provide a probability
of outcome. For example, for two-way classification (k = 2) a single output node’s value can indicate the
probability of a positive outcome (y = 1). One minus this value is used as the indicator for the corresponding
negative outcome (y = 0). Whichever is higher is then the predicted class. Of course for better balance of
false positives and false negatives a threshold may be used.
1
S(u) = (8.70)
1 + e−u
The NeuralNet Class 3L class supports single-output, 3-layer (input, hidden and output) Neural-Network
classifiers. The model equation is a specialization of the one given for NeuralNet 3L (see the Chapter on
Nonlinear Models and Neural Networks). Given an input vector x ∈ Rn the model predicts a value which is
considered off by .
| |
y = S(B · f (A · x)) + = S(B f (A x)) + (8.71)
where
• ∈ R is the residual/error
As discussed in the Chapter on Nonlinear Models and Neural Networks the parameter weight matrix and
bias vector are bundled into a NetParam object.
The corresponding network diagram is shown in Figure 8.3. Its input layer has n = 2 nodes, hidden layer
has nz = 3 nodes, output layer 1 node.
331
f
z0
a00
α0 b00
x0
a01 S
a10
z1 y0
a02 b10
a11 α1 β0
x1
b20
a12
z2
α2
Figure 8.3: A Simple Three-Layer (input, hidden, output) Neural Network Classifier
| |
ŷ = S(B · f (A · z)) = S(B f (A z)) (8.73)
If ŷ > 0.5 then return 1 (positive classification); otherwise return 0 (negative classification). More generally,
the test can involve a threshold τ , i.e., ŷ > τ .
8.11.4 Optimization
One may utilize the same loss that used for regression type problems (sse or mse) in which case the mse loss
function is
1
L(A, B) = ky − S(f (XA)B)k2 (8.74)
2m
Or using summation notation,
m−1
1 X | |
L(A, B) = yi − S(B f (A xi )) (8.75)
2m i=0
However, a typically better alternative is to use cross-entropy for the loss function [135] (section 3.1).
m−1
1 X
L(A, B) = − yi ln ŷi + (1 − yi )ln(1 − ŷi ) (8.76)
m i=0
m−1
1 X | | | |
L(A, B) = − yi ln S(B f (A xi )) + (1 − yi ) ln (1 − S(B f (A xi ))) (8.77)
m i=0
332
Partial Derivatives
The partial derivative w.r.t, weight bh connecting hidden node h with the output node (there is only one) is
∂L 1
= z:h · (ŷ − y) (8.78)
∂bh m
The partial derivative w.r.t, weight ajh connecting input node j with hidden node h is
∂L 1
= ... (8.79)
∂ajh m
See the Probability Chapter for more information about cross-entropy and the Nonlinear Models and Neural
Network Chapter for computing gradients (and their partial derivatives) of loss functions. See the exercises
for partial derivatives for the biases α and β.
Class Methods:
1 @ param x the m - by - n input matrix ( training data consisting of m input vectors )
2 @ param y the m output vector ( training data consisting of m output integer values )
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param cname_ the names for all classes
5 @ param nz the number of nodes in hidden layer ( -1 = > use default formula )
6 @ param hparam the hyper - parameters for the model / network
7 @ param f the activation function family for layers 1 - >2 ( input to hidden )
8 the activation function family for layers 2 - >3 ( hidden to output ) is
sigmoid
9
8.11.6 Exercises
1. Compare the QoF measures for Neural Class 3L using sse vs. cross-entropy for the loss function. Do
this for the Breast Cancer and Diabetes datasets.
333
2. Suppose model 1 predicts 0.1 when the actual value is 1 and 0.9 when its 0. Consider the following
two vectors: y = [1, 0] and ŷ = [0.1, 0.9]. Compute the cross-entropy loss function
1
L = − (1 ln 0.1 + 0 ln 0.9)
2
Now suppose for model 2 the predictions are in closer agreement with the actual values ŷ = [0.1, 0.9].
Recompute the cross-entropy loss function
1
L = − (1 ln 0.9 + 0 ln 0.1)
2
Which model is better? How do these two examples help explain why cross-entropy works as a loss
function.
∂L
3. Derive the results for the partial derivative w.r.t. weight bh , .
∂bh
∂L
4. Derive the results for the partial derivative w.r.t. weight ajh , .
∂ajh
∂L
5. Derive the results for the partial derivative w.r.t. bias αh , .
∂αh
∂L
6. Derive the results for the partial derivative w.r.t. bias β, .
∂β
334
Chapter 9
General Linear Models presented in Prediction Chapter cover a wide range of simple models that are easy
to use and explain and can be rapidly trained typically using Ordinary Least Squares (OLS) that boils down
to matrix factorization.
Complex nonlinear models can provide improved Quality of Fit and handle cases where General Linear
Models are not sufficiently predictive. However, they often result in the following disadvantage: harder to use
and explain, substantially longer training times, more likely to overfit and require tuning of hyper-parameters.
This chapter examines two categories of modeling techniques that fall between the two extreme categories
of models just discussed. The two categories of models of intermediate complexity are Generalized Linear
Models and Regression Trees.
y ∼ Normal(b · x, σ 2 ) (9.1)
y = b·x + (9.2)
where ∼ Normal(0, σ 2 ).
Now generalize the first term b · x into an optionally transformed linear function of x called the mean
function.
y = µy (x) + (9.3)
335
where g is an invertible function. One may think of g is as a link function, or its inverse g −1 as an activation
function. The idea of a link function is that it uncovers an underlying linear model b · x. In other words, g
is a function that links y’s mean to a linear combination of the predictor variables. The idea of an activation
function is that is allows a transformation function to be applied to the linear combination b · x, as is done
in Perceptrons.
µy (x) = g −1 (b · x) (9.5)
For example, a commonly used activation function in Perceptrons is the sigmoid function.
1
sigmoid(t) = (9.7)
1 + e−t
The inverse of the sigmoid function is the logit function.
u
logit(u) = ln (9.8)
1−u
Applying the logit function to both sides of equation yields
y = µy (x) +
logit(µy (x)) = b · x
These are the model equations of Logistic Regression. When the link function is the identity function, the
response random variable is continuous and the errors distributed as Normal(0, σ 2 ), we are back to a Linear
Model, written as a Generalized Linear Model.
y = µy (x) +
µy (x) = b · x
Several additional combinations of link functions and residual distributions are commonly used as shown
in the table below.
336
Model Type Response Type (y) Link Function Residual Distribution
Logistic Regression binary {0, 1} logit Bernoulli Distribution
Poisson Regression integer {0, . . . , ∞} ln Poisson Distribution
Exponential Regression continuous [0, ∞) ln or reciprocal Exponential Distribution
General Linear Model (GLM) continuous (−∞, ∞) identity Normal Distribution
337
9.2 Maximum Likelihood Estimation
In this section, rather than estimating parameters using Least Squares Estimation (LSE), Maximum Like-
lihood Estimation (MLE) will be used. For a deeper view of estimation concepts see [15]. Given a dataset
with m instances, the model will produce an error for each instance. When the error is large, the model is
in disagreement with the data. When errors are normally distributed, the probability density will be low for
a large error, meaning this is an unlikely case. If this is true for many of the instances, the problem is not
the data, it is the values given for the parameters. The parameter vector b should be set to maximize the
likelihood of seeing instances in the dataset. This notion is captured in the likelihood function L(b). Note,
for the Simpler Regression model there is only a single parameter, the slope b.
Given an m instance dataset (x, y) where both are m-dimensional vectors and a Simpler Regression
model
y = bx +
where ∼ N (0, σ 2 ), let us consider how to estimate the parameter b. While LSE is based upon the distance
of errors from zero, MLE transforms this distance based upon for example the pdf of the error distribution.
Notice that for small errors, the Normal distribution is more tolerant than the Exponential distribution,
while for larger errors it is less tolerant. See the exercises for how to visualize this.
For this model and dataset, the likelihood function L(b) is the product of m Normal density functions
(making the assumption that the instances are independent).
m−1
Y 1 2 2
L(b) = √ e−i /2σ
i=0
2πσ
Since i = yi − bxi , we may rewrite L(b) as
m−1
Y 1 2 2
L(b) = √ e−(yi −bxi ) /2σ
i=0
2πσ
Taking the natural logarithm gives the log-likelihood function l(b)
m−1
X √
l(b) = −ln( 2πσ) − (yi − bxi )2 /2σ 2
i=0
m−1
dl X
= −2xi (yi − bxi )/2σ 2
db i=0
m−1
X
xi (yi − bxi ) = 0
i=0
338
9.2.1 Akaike Information Criterion
The Akaike Information Criterion (AIC) serves as an alternative to measures like R2 -Adjusted, R̄2 .
It is twice the number of parameters n + 1 to be estimated minus twice the optimal log-likelihood. Note, the
n comes from estimating the coefficients dim(b) and 1 comes from estimating the error/redidual variance
σ 2 . The smaller the AIC value, the better the model.
For linear regression, the above formula is equivalent to
sse
AIC = 2(n + 1) − m ln + constant
m
and the contant may be ignored as all models for the given dataset will have the same constant.
y = µy (x) ~
g(µy (x)) = b · x
The ~ is used to indicate either + (for additive errors) or × (for multiplicative errors).
Now given an m-instance dataset (X, y), we need to determine the link function g, the residual/error
distribution ∼ Dist(b), and finally the parameters b using MLE.
When y is continuous, the pdf conditioned on the parameters may be used to capture the error distribu-
tion.
f (|b)
The error for each instance may be calculated from the dataset.
When errors are highly correlated, Generalized Linear Models are not ideal. Thus, we assume the errors are
independent and identically distributed (iid).
Because of the independence assumption, the joint error density is the product of the density for each
instance i , Therefore, the likelihood function w.r.t. the parameter vector b is
m−1
Y
L(b) = f (i |b)
i=0
339
Taking the natural logarithm yields
m−1
X
l(b) = ln f (i |b) (9.10)
i=0
Analagously, for discrete random variables, the pdf is replaced with pmf
m−1
X
l(b) = ln p (i |b) (9.11)
i=0
For Maximum Likelihood Estimation, the above two equations may serve as a starting point.
340
9.3 Poisson Regression
The PoissonRegression class can be used for developing Poisson Regression models. In this case, a response
y may be thought of as a count that may take on a non-negative integer value. The probability density
function (pdf) for the Poisson distribution with mean λ may be defined as follows:
λy −λ
f (y; λ) = e
y!
y = µ(x) +
g(µ(x)) = b · x
The link function g for Poisson Regression is the ln (natural logarithm) function.
ln(µ(x)) = b · x
m−1
Y µ(xi )yi −µ(xi )
L = e
i=0
yi !
m−1
X
l = yi ln(µ(xi ) − µ(xi ) − ln(yi !)
i=0
m−1
X
l = yi b · xi − eb·xi − ln(yi !)
i=0
Since the last term is independent of the parameters, removing it will not affect the optimization.
m−1
X
l = yi b · xi − eb·xi
i=0
Example Problem:
341
9.3.1 PoissonRegression Class
Class Methods:
1 @ param x the data / input matrix augmented with a first column of ones
2 @ param y the integer response / output vector , y_i in {0 , 1 , ... }
3 @ param fname_ the names of the features / variables ( defaults to null )
4 @ param hparam the hyper - parameters ( currently has none )
5
342
9.4 Regression Trees
As with Decision (or Classification) Trees, Regression Trees make predictions based upon what range each
variable/feature is in. If the tree is binary, there are two ranges for each feature split: low (below a threshold)
and high (above a threshold). Building a Regression Tree essentially then requires finding thresholds for
splitting variables/features. A threshold will split a dataset into two groups. Letting θk be a threshold for
splitting variable xj , we may split the rows in the X matrix into left and right groups.
For splitting variable xj , the threshold θk should be chosen to minimize the weighted sum of the Mean
Squared Error (MSE) of the left and right sides. Alternatively, one can minimize the Sum of Squared Errors
(SSE). This variable becomes the root node of the regression tree. The dataset for the root node’s left
branch consists of leftk (X), while the right branch consists of rightk (X). If the maximum tree depth is
limited to one, the root’s left child and right child will be leaf nodes. For a leaf node, the prediction value
that minimizes MSE is the mean µ(y), see exercises.
Given a limit on the depth of the tree, nodes are split recursively, starting with the root. The process
terminates when the limit is reached or improvement is deemed inadequate. Each split adds a constraint on
a variable.
Depth = 1
343
(1.930 @10)
x0
Depth = 2
Further splitting may occur on x0 (or xj for multidimensional examples). If we let the maximum tree depth
be two, we obtain the following four regions, corresponding to the four leaf nodes,
with means µ0 (y) = 5.61, µ1 (y) = 6.75, µ2 (y) = 8.80 and µ3 (y) = 9.03. Each internal (non-leaf) node will
have a threshold. They are θ0 = 6.5, θ1 = 3.5 and θ2 = 8.5.
9.4.2 Regions
The number of regions (or leaf nodes) is always one greater than the number of thresholds. The region for leaf
node l, Rl = (xj , (al , bl ]), defines the feature/variable being split and the interval of inclusion. Corresponding
to each region Rl is an indicator function,
which simply indicates {false, true} or {0, 1} whether variable xj is in the interval (al , bl ]. Now define 1∗l (x)
as the product of the indicator functions from leaf l until (not including) the root of the tree,
where anc(l) is the set of ancestors of leaf node l (inclusive of l, exclusive of root). Since only one of these 1∗
indicator functions can be true for any given x vector, we may concisely express the regression tree model
as follows:
344
1∗l (x) µl (y) +
X
y = (9.16)
l ∈ leaves
Thus, given a predictor vector x, predicting a value for the response variable y corresponds to taking the
mean y-value of the vectors in x’s composite region (the intersection of regions from the leaf until the root).
The prediction function to compute ŷ is thus piecewise constant.
As locality determines the prediction for Regression Trees, they are similar to KNN Regression.
where wl and wr are the weights for the left and right sides, respectively. The overall score for a tree is then
just the weighted sum of the MSEs for all leaf nodes, where the weight is the fraction of the instances that
qualify for that leaf node.
Possible values for θk are the values between any two consecutive values in vector x:j sorted. This
will allow any possible split of x:j to be considered. For example, {1, 10, 11, 12} should not be split in the
middle, e.g., into {1, 10} and {11, 12}, but rather into {1} and {10, 11, 12}. Possible thresholds (split points)
are the averages of any two consecutive values, i.e., 5.5, 10.5 and 11.5. A straitforward way to implement
determining the next variable xj and its threshold θk would be to iterate over all features/variables and
split points. Calculating the weighted sum of left and right mse from scratch for each candidate split point
is inefficient. These values may be computed incrementally using the fast thresholding algorithm [40]. See
[195] for derivations of efficient algorithms.
Class Methods:
1 @ param x the m - by - n input / data matrix
2 @ param y the response m - vector
3 @ param fname_ the names of the features / variables ( defaults to null )
4 @ param hparam the hyper - parameters for the model
5 @ param curDepth current depth
6 @ param branchValue the branch value for the tree node
7 @ param feature the feature for the tree ’s parent node
8 @ param leaves the leaf counter
9
10 class Reg ression Tree ( x : MatrixD , y : VectorD , fname_ : Array [ String ] = null ,
11 hparam : Hyp erParame ter = R egressio nTree . hp , curDepth : Int = 0 ,
12 branchValue : Int = -1 , feature : Int = -1 , leaves : Counter = Counter ()
)
345
13 extends Predictor (x , y , fname_ , hparam )
14 with Fit ( dfm = x . dim2 - 1 , df = x . dim - x . dim2 ) :
15
346
9.5 Linear Model Trees
The regression trees in the last section have prediction functions that are piecewise constant, while those
in this section are piecewise linear. Each region in the previous section is covered by a flat surface, while
the regions in this section are covered by hyperplanes. These types of regression trees are also called Model
Trees (e.g., M5 [148]).
At each leaf node, rather than taking the average of all the points (like using a Null Model), a multiple
linear regression model is used. As such more data points are needed in each leaf, implying the need to have
a sufficiently large dataset and typically have less splitting (smaller tree depth).
As the degrees of freedom in leaves may become small, it is useful to use Stepwise Refinement to reduce
the number of parameters.
9.5.1 Splitting
A node should only be split if its multiple regression model is significantly worse than the combination of
the two regression models of its children.
9.5.2 Pruning
9.5.3 Smoothing
The response surface may be made more smooth by taking a weighted average of the predictions of all models
from the root to the leaf.
347
9.6 Random Forest Regression
Random Forest Regression takes multiple regreession trees and averages their predictions. The trees are
used in parallel.
348
9.7 Gradient Boosting Regression
Gradient Boosting Regression tries to correct the predictions fm of a regression tree by using a subsequent
tree (with prediction function fm+1 ) to predict the stage m residuals/errors, = y − fm (X). The trees are
used sequencially in M stages and the corrections are moderated by a learning rate.
349
9.8 Exercises
1. Explain the difference between AIC and (Bayesian Information Criterion) BIC.
fx (x) = e−x
1 x2
fy (x) = √ e− 2
2π
Plot and compare the pdfs fx and fy vs. x over the interval [0, 4].
3. For Regression Trees, show that for each leaf node that the optimal value for the constant is the mean.
5. Create Regression Tree models for the Example AutoMPG dataset. Compare the results for multiple
depths.
6. How can bagging and boosting be used to improve upon Regression Trees.
7. Contrast KNN Regression with Regression Trees in terms of the shape of and how regions are formed.
350
9.9 Further Reading
1. Inductive Learning of Tree-based Regression Models [195] https://www.dcc.fc.up.pt/~ltorgo/PhD/
4. Optimal Classification and Regression Trees with Hyperplanes Are as Powerful as Classification and
Regression Neural Networks,
https://dbertsim.mit.edu/pdfs/papers/2018-sobiesk-optimal-classification-and-regression-trees.
pdf
351
352
Chapter 10
353
10.1 Nonlinear Regression
The NonlinearRegression class supports Nonlinear Regression (NLR). In this case, the vector of input/pre-
dictor variable x ∈ Rn can be multi-dimensional [1, x1 , ...xk ] and the function f is nonlinear in the parameters
b ∈ Rp .
y = f (x; b) + (10.1)
where represents the residuals (the part not explained by the model). The function f : Rn × Rp → R is
called the mean response function.
Note that y = b0 + b1 x1 + b2 x21 + is still linear in the parameters. The example below is not, as there
is no transformation that will make the formula linear in the parameters.
y = (b0 + b1 x1 )/(b2 + x1 ) +
Nonlinear Regression can more precisely model phenomena compared to using the linear approximations
of multiple linear regression. Although quadratic and cubic regressions allow the lines to be bent to better fit
the data, they do not provide the flexibility of nonlinear regression, see Chapter 13 of [102]. The functional
forms in nonlinear regressions can be selected by the physics driving the phenomena. At the other extreme,
simple nonlinear functions of linear combinations of several input variables, allow many diverse phenomena
to be modeled, for example, using Neural Networks. Neural Networks trade interpretability for a universal
modeling framework.
10.1.2 Training
A training dataset consisting of m input-output pairs is used to minimize the error in the prediction by
adjusting the parameter vector b. Given an input matrix X consisting of m input vectors and an output
vector y consisting of m output values, minimize the distance between the target output vector y and the
predicted output vector f (X; b).
The model training process depicted in Figure 10.1 involves
• a Model with mean response function f or vectorized mean response function f : Rm×n × Rp → Rm ,
and
In Figure 10.1, the ith row/instance of the Dataset is taken as the input vector x ∈ Rn to the model to
produce a predicted response,
354
where f (f ) is the (vectorized) mean response function and the parameter vector b ∈ Rp .
Model f
Dataset (X, y) b ŷ
Trainer L
Note that for linear models typically p = n, but for nonlinear models, it is not uncommon for p > n, i.e.,
the number of parameters to be greater than the number of input/predictor variables.
Loss Function
Optimization techniques can be used to determine a “good” value for the parameter vector b, by minimizing
a loss function.
L(b; y, ŷ)
Common forms of loss functions are based on minimizing an `1 or `2 norm of the error vector = y − ŷ =
y − f (X; b),
Again for an `2 norm, it is convenient to use its square and minimize the dot product of the error with itself
(or rather half of that 21 · ).
1
L(b) = (y − f (X; b)) · (y − f (X; b)) (10.3)
2
An alternative equation for the loss function uses summation notation to add up the square of each error
i = yi − f (xi ; b) over all m instances.
m−1
1 X
L(b) = [yi − f (xi ; b)]2 (10.4)
2 i=0
355
Special Case of a Linear Model
For a linear model, f is simply a linear combination of the input variables, i.e., f (X; b) = Xb, so
1
L(b) = (y − Xb) · (y − Xb)
2
Recall that taking the gradient with respect to b and setting it equal to 0, yields the Normal Equations,
| |
(X X)b = X y
Then matrix factorization can be used to relatively quickly find a globally optimal value for the parameter
vector b.
10.1.3 Optimization
For nonlinear regression, a Least-Squares (minimizing the errors/residuals) method can be used to fit the
parameter vector b. Finding optimal values for the parameters b may be found by either solving the Normal
Equations or using an iterative optimization algorithm. In either case, gradients of the loss function are used
to find these values.
m−1
1 X
L(b) = [yi − f (xi ; b)]2
2 i=0
The j th partial derivative of this loss function L(b) may be determined using the chain rule.
m−1
∂L(b) X ∂f (xi ; b)
= − [yi − f (xi ; b)] (10.5)
∂bj i=0
∂bj
m−1 m−1
X ∂f (xi ; b) X ∂f (xi ; b)
f (xi ; b) = yi
i=0
∂bj i=0
∂bj
Collecting these equations for all values of j leads to the Normal Equations for nonlinear models. Unfortu-
nately, they form a system of nonlinear equations requiring numerical solution. Instead of solving the Normal
Equations for nonlinear models, one may use an optimization algorithm to minimize the loss function L(b).
First-order optimization algorithms utilize first derivatives, typically moving in the opposite direction to the
gradient.
∂L(b) ∂L(b)
∇L(b) = ,..., (10.6)
∂b0 ∂bp−1
Noting that i = yi − f (xi ; b), the j th partial derivative can be written more concisely.
m−1
∂L(b) X ∂f (xi ; b)
= − [i ] (10.7)
∂bj i=0
∂bj
356
Therefore, an optimization algorithm essentially operates off of the gradient of the mean response function
f . For greater accuracy and efficiency for a given function f , utilization of formulas for its partial derivatives
is preferred over their numerical calculation.
Second-order optimization algorithms utilize both first derivatives (gradient) and second derivatives (Hes-
sian). A user defined mean response function f that takes a vector of inputs x and a vector of parameters
b,
is passed as a parameter to the NonlinearRegression class. This function is used to create a predicted
output value ypi for each input vector xi . The sseF method (as a loss function) applies function f to all
m input vectors to compute predicted output values. These are then subtracted from the actual output
to create an error vector, whose squared Euclidean norm is returned. The sseF is embedded in the train
method.
6 // : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
7 /* * Function to compute sse for the given values for the parameter vector b .
8 * @ param b the parameter vector
9 */
10 def sseF ( b : VectorD ) : Double =
11 val yp = x_ . map ( f (_ , b ) ) // predicted response vector
12 ( y_ - yp ) . normSq // sum of squared errors
13 end sseF
14
15 val bfgs = new BFGS ( sseF ) // minimize sseF using nonlinear optimizer
16 val result = bfgs . solve ( b_init ) // result from optimizer
17 val sse = result . _1 // optimal function value
18 b = result . _2 // optimal parameter vector
19 end train
ScalaTion’s optimization package provide several solvers for linear, quadratic, integer and nonlinear pro-
gramming/optimization. Currently, the BFGS class (a Quasi-Newton optimizer) is used for finding an optimal
b by minimizing sseF. It uses the second-order Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm that
can improve convergence over first-order algorithms such as gradient/steepest descent. The BFGS algorithm
determines a search direction by deflecting the gradient/steepest descent direction vector (opposite the gra-
dient) by multiplying it by a matrix that approximates the inverse Hessian. This optimizer requires an initial
guess b init for the parameter vector b.
For more information see http://www.bjsos.umd.edu/socy/alan/stats/socy602_handouts/kut86916_
ch13.pdf.
357
10.1.4 Use of the Chain Rule
The Chain Rule for Differential Calculus can be used to obtain derivatives, partial derivatives and gradients
needed in Nonlinear Regression and Neural Networks. Consider the following function of the variable x; the
rule works as follows:
f (x) = (7 − 3x2 )2
df
To obtain the derivative of the function f w.r.t. x, , f can be expressed as the composition of two
dx
functions f = f ◦ u,
df df du
= (10.8)
dx du dx
where
df
= 2u
du
du
= − 6x
dx
Multiplying and substituting for u yields
df
= [2(7 − 3x2 )][−6x] = − 12x(7 − 3x2 )
dx
This basic rule carries over to partial derivatives.
∂L dL ∂u
= (10.9)
∂bj du ∂bj
For example, consider the function f (x, y) = (7 − 3xy)2 . Let u(x, y) = 7 − 3xy and f(u) = u2 ; applying the
chain rule gives
∂f df ∂u
= (10.10)
∂x du ∂x
∂f
= [2u][−3y] = − 6y(7 − 3xy)
∂x
Of course, there are additional chain rules in multivariate calculus, see https://www.whitman.edu/mathematics/
multivariable/multivariable.pdf.
358
10.1.5 NonlinearRegression Class
Class Methods:
1 @ param x the data / input matrix optionally augmented with a first column of ones
2 @ param y the response / output vector
3 @ param f the nonlinear function f (x , b ) to fit
4 @ param b_init the initial guess for the parameter vector b
5 @ param fname_ the feature / variable names ( defaults to null )
6 @ param hparam the hyper - parameters ( currently has none )
7
10.1.6 Exercises
1. Given the Normal Equations for Nonlinear Regression models,
m−1 m−1
X ∂f (xi ; b) X ∂f (xi ; b)
f (xi ; b) = yi
i=0
∂bj i=0
∂bj
f (xi ; b) = xi · b
Hint:
∂f (xi ; b)
= xij
∂bj
2. List and compare several nonlinear optimization algorithms from the following classes:
(a) derivative-free optimization algorithms,
(b) first-order optimization algorithms, and
(c) second-order optimization algorithms.
359
b0 x
y = +
x + b1
where parameter b0 = Vmax is the maximum reaction velocity (occurs at high substrate concentration)
and parameter b1 = Km is the Michaelis constant that measures the strength of the enzyme-substrate
interaction. Use the dataset given in Table 2 of [119] to train the model.
m−1
∂L(b) X ∂f (xi ; b)
= − [yi − f (xi ; b)]
∂bj i=0
∂bj
m−1
∂L(b) X ∂f (xi ; b)
= − [i ]
∂bj i=0
∂bj
The first-order optimization algorithms use gradients/partial derivatives to iteratively improve the
parameters. Gradient Descent (GD) uses the whole training set, while Stochastic Gradient Descent
(SGD) uses a (typically random) subset of the training set (called a mini-batch). For pure Stochastic
Gradient Descent (SGD), the parameters are updated for every data instance i, so the above equation
becomes the following:
∂L(b) ∂f (xi ; b)
= − [i ]
∂bj ∂bj
Now, suppose the error i > 0 (positive). This means that the predicted value ŷi = f (xi ; b) is too low.
This begs the question of how to change the parameter bj to reduce the loss function, make the error
smaller or equivalently increase the mean response function f .
Clearly, the answer depends on the slope of the mean response function.
∂f (xi ; b)
∂bj
For the case where the slope of the mean response function is positive, determine whether to increase
or decrease the parameter bj . Also, explain what this does to the loss function.
(a) bj = 1
(b) bj = 4
360
How to Update Parameter bj , y(black), ŷ(blue)
90
85
y
80
75
Assuming the slope of the mean response function f is positive at both values of bj , how should the
j th parameter bj be changed (increased or decreased) for (a) and (b)?
361
10.2 Simple Exponential Regression
The SimpleExpRegression class can be used for developing Simple Exponential Regression models. These
are useful when data exhibit exponential growth or decay.
y = b0 e b 1 x + (10.11)
The mean response function has two parameters, a multiplier b0 and a growth rate b1 .
Note, for SimpleExpRegression the number of parameters p = 2, while the number of predictor variables
n = 1.
10.2.2 Training
Given a dataset (x, y) where x ∈ Rm and y ∈ Rm , the parameters b = [b0 , b1 ] may be determined using
Least Squares Estimation (LSE). The loss function may be deduced from the general NLR loss function in
summation form.
m−1
1 X
L(b) = [yi − f (xi ; b)]2
2 i=0
Replacing the general mean response function f (xi ; b) with b0 eb1 xi yields,
m−1
1 X
L(b) = [yi − b0 eb1 xi ]2 (10.13)
2 i=0
When ∼ N (0, σ 2 I), i.e., each error has mean 0 and variance σ 2 , the same parameter estimates can be
obtained using the Maximum Likelihood Estimation (MLE), see the Chapter on Generalized Linear Models.
See the Exercises, to show the equivalence of the two estimation techniques.
10.2.3 Optimization
Optimization for Simple Exponential Regression involves determining the gradient of the loss function with
respect to the parameters b. Starting with the general formula for the j th partial derivative of the loss
function L(b)
m−1
∂L(b) X ∂f (xi ; b)
= − [i ]
∂bj i=0
∂bj
the partial derivatives of the mean response function are needed. In this case, there are two parameters
b = [b0 , b1 ] and two formulas:
362
Partial Derivatives of the Mean Response Function
∂f (xi ; b)
= eb1 xi (10.14)
∂b0
∂f (xi ; b)
= xi b0 eb1 xi (10.15)
∂b1
m−1
∂L(b) X
= − [i ] xi b0 eb1 xi (10.17)
∂b1 i=0
Since the ith predicted response yˆi = b0 eb1 xi , these two equations may be simplied as follows:
m−1
∂L(b) X
= − [i ] yˆi /b0 = − · (ŷ/b0 ) (10.18)
∂b0 i=0
m−1
∂L(b) X
= − [i ] xi yˆi = − · (x ∗ ŷ) (10.19)
∂b1 i=0
where ∗ is the element-wise vector product (e.g., [2, 4, 6] ∗ [5, 3, 1] = [10, 12, 6]). The gradient of the loss
function is the vector formed from the two partial derivatives given above.
∂L(b) ∂L(b)
∇L(b) = , (10.20)
∂b0 ∂b1
A Gradient Descent Optimizer is a simple type of first-order optimization algorithm. It starts at a random
(or guessed) point in parameter space and iteratively moves in the direction opposite to the gradient. To
control how rapidly the algorithm moves in that direction, a learning rate η is introduced as a hyper-parameter
to tune. With each iteration the parameter b is updated as follows:
b = b − η ∇L(b)
The pseudocode for the Gradient Descent Algorithm for Simple Exponential Regression is as follows:
1 b = b_init // initial random / guessed parameter vector
2 while lo ss _ de cr ea s in g do // stopping rule
3 val yp = f (x , b ) // y - predicted vector
4 val = y - yp // error vector
5 val δ0 = dot yp / b0 // delta 0 - partial of loss w . r . t . b0
6 val δ1 = dot yp * x // delta 1 - partial of loss w . r . t . b1
7 b0 + = η * δ0 // update to first parameter
8 b1 + = η * δ1 // update to second parameter
9 end while
363
10.2.4 Linearization
In this subsection, we examine two versions of simple exponential regression to see if the model can be
linearized, by transforming the model so that it becomes linear in the parameters. The mean response
function for the simple exponential regression model is
f (x, b) = b0 eb1 x
Multiplicative Errors
y = b0 eb1 x
This linearized form can be used to fit the transformed response as was done with TranRegression.
ln(y) = β0 + b1 x + e
Additive Errors
y = b0 e b 1 x +
Notice that the error model is different (multiplicative vs. additive) so the transformed linear regression model
is different from the original nonlinear exponential regression model. They may be viewed as competitors,
see the exercises.
Some nonlinear models are intrinsically linear in that the data can be transformed into a linear form,
although the transformation may transform the errors inappropriately. The following nonlinear model,
however, cannot be transformed to a linear form, regardless of the error model. It adds an additional
parameter to the simple exponential model.
y = b0 + b1 e b 2 x + (10.22)
10.2.5 Exercises
1. Consider the following hospitalization dataset from Chapter 13 of [102].
364
1 val xy = MatrixD ((15 , 2) , 2 , 54 ,
2 5, 50 ,
3 7, 45 ,
4 10 , 37 ,
5 14 , 35 ,
6 19 , 25 ,
7 26 , 20 ,
8 31 , 16 ,
9 34 , 18 ,
10 38 , 13 ,
11 45 , 8,
12 52 , 11 ,
13 53 , 8,
14 60 , 4,
15 65 , 6)
2. Model the hospitalization dataset using Simple Linear Regression, plot the regression line and show
the Quality of Fit (QoF) measures.
3. Model the hospitalization dataset using Simple Exponential Regression, plot the regression curve and
show the Quality of Fit (QoF) measures. How do the QoF measures compare with those produced by
Simple Linear Regression?
4. Model the hospitalization dataset using Log Transformed Linear Regression, plot the regression curve
and show the Quality of Fit (QoF) measures. How do the QoF measures compare with those produced
by Simple Linear Regression and Simple Exponential Regression?
y = b0 eb1 x +
show that when error vector ∼ N (0, σ 2 I), the LSE and MLE estimation techniques will produce the
same parameter estimates.
6. Use the pseudo-code given in this section to write a Gradient Descent Optimizer for Simple Exponential
Regression using ScalaTion.
7. Pass the loss function into a second-order Quasi-Newton Optimizer and compare to the first-order
algorithm in term of the parameter solution, the value of the loss function, and the number of steps
needed for convergence.
365
10.3 Exponential Regression
The simple exponential regression model can be extended to have multiple predictor variables, e.g., x =
[x1 , x2 ] and b = [b0 , b1 , b2 ].
y = b0 eb1−2 ·x + (10.23)
f (x, b) = b0 eb1−2 ·x
m−1
1 X
L(b) = [yi − eb·x ]2
2 i=0
Again, starting with the general formula for the j th partial derivative of the loss function L(b)
m−1
∂L(b) X ∂f (xi ; b)
= − [i ]
∂bj i=0
∂bj
the problem reduces to finding the j th partial derivative of the mean response function.
∂f (xi , b)
= xij eb·xi
∂bj
Substituting this result into the general formula gives
m−1
∂L(b) X
= [i ]xij eb·xi (10.25)
∂bj i=0
This mean response function can be linearized using a log transformation. Alternatively, one can also
look into using a Generalized Linear Model (GLM) for such datasets to provide more options for deal-
ing with the error distribution. See http://www.stat.uni-muenchen.de/~leiten/Lehre/Material/GLM_
0708/chapterGLM.pdf for more details. Also, see the exercises.
Class Methods:
1 @ param x the data / input matrix
2 @ param y the response / output vector
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param hparam the hyper - parameters ( currently none )
5 @ param nonneg whether to check that responses are nonnegative
6
366
8 hparam : Hyp erParame ter = null , nonneg : Boolean = true )
9 extends Predictor (x , y , fname_ , hparam )
10 with Fit ( dfm = x . dim2 - 1 , df = x . dim - x . dim2 ) :
11
10.3.2 Exercises
1. A Generalized Linear Model (GLM) that utilized Maximum Likelihood Estimation (MLE) may be
used for such data. The response variable y is modeled as the product of a mean response function
f (x; b) = µy (x) and exponentially distributed residuals/errors .
y = µy (x) ×
ln(µy (x)) = b · x
Explain why the errors/residuals i = yi /µy (xi ) follow and Exponential distribution.
1
f (yi /µy (xi )) = e−yi /µy (xi )
µy (xi )
2. Show that the likelihood function for Exponential Regression is the following:
m−1
Y 1
L = e−yi /µy (xi )
i=0
µy (xi )
m−1
Y b·xi
L = e−b·xi e−yi /e
i=0
367
m−1
X yi
l = −b · xi − b·x
i=0
e i
3. Write a function to compute the negative log-likelihood (−l) and pass it into a nonlinear optimization
algorithm for minimization.
368
10.4 Perceptron
The Perceptron class supports single-valued 2-layer (input and output) Neural Networks. The inputs into a
Neural Net are given by the input vector x, while the outputs are given by the output value y. As depicted
in Figure 10.2, each component of the input xj is associated with an input node in the network, while the
output y is associated with the single output node. The input layer consists of n input nodes, while the
output layer consists of 1 output node.
x0
b0
f
x1 y
b1
β
b2
x2
An edge connects each input node with the output node, i.e., there are n edges in the network. To
include an intercept in the model (also referred to as bias) one of the inputs (say x0 ) must always be set to
1. Alternatively, a bias offset β can be associated with the output node and added to the weighted sum (see
below).
y = b · x + = b0 + b1 x1 + ...bn−1 xn−1 +
We now take the linear combination of the inputs, b · x, and apply an activation function fa (or simply f ).
n−1
X
y = f (b · x) + = f ( bj x j ) + (10.26)
j=0
The general mean response function fr (x; b) of nonlinear regression, which takes two vectors as input, is
now replaced by first taking a linear combination of x and then applying a simpler activation function that
takes a single scalar as input, i.e., f (b · x).
369
10.4.2 Ridge Functions
Restricting the multi-dimensional function fr (x; b) to f (b · x) means that the response surface is simply a
ridge function. As an example, let b = [.5, 2, 1] and x = [1, x1 , x2 ], then
0.5
y
5
0
−4 0
−2 0 2 4 x2
x1 −5
10.4.3 Training
Given several input vectors and output values (e.g., in a training set), optimize/fit the weights b connecting
the layers. After training, given an input vector x, the net can be used to predict the corresponding output
value ŷ.
A training dataset consisting of m input-output pairs is used to minimize the error in the prediction by
adjusting the parameter/weight vector b ∈ Rn . Given an input/data matrix X ∈ Rm×n consisting of m
input vectors and an output vector y ∈ Rm consisting of m output values, minimize the distance between
the actual/target output vector y and the predicted output vector ŷ,
ŷ = f (Xb) (10.27)
where f : Rm → Rm is the vectorized version of the activation function f . The vectorization may occur over
the entire training set or more likely, an iterative algorithm may work with a group/batch of instances at a
time. In other words, the goal is to minimize some norm of the error vector.
370
x0
b0
fa (b · x)
x1 ŷ
b1
(X, y) xi0
b2
= y − ŷ
xi1 x2
xi2
y
yi
= y − ŷ = y − f (Xb) (10.28)
minb ky − f (Xb)k
As was the case with regression, it is convenient to minimize the dot product of the error with itself (kk2 =
· ). In particular, we aim to minimize half of this value, half sse (loss function L).
1
L(b) = [y − f (Xb)] · [y − f (Xb)] (10.29)
2
Using summation notation gives
m−1
1 X
L(b) = [yi − f (xi · b)]2 (10.30)
2 i=0
10.4.4 Optimization
Optimization for Perceptrons and Neural Networks is typically done using an iterative optimization algorithm
that utilizes gradients. Popular optimizers include Gradient Descent (GD), Stochastic Gradient Descent
(SGD), Stochastic Gradient Descent with Momentum (SGDM), Root Mean Square Propagation (RMSProp)
and Adaptive Moment Estimation (Adam) (see Appendix on Optimization for details).
The gradient of the loss function L is calculated by computing all of the partial derivatives with respect
to the parameters/weights.
∂L ∂L
∇L(b) = ,...,
∂b0 ∂bn−1
371
Partial Derivative for bj
Starting with the nonlinear regression general formula for the j th partial derivative of the loss function L(b)
m−1
∂L(b) X ∂fr (xi ; b)
= − [i ]
∂bj i=0
∂bj
and specializing the mean response function fr (xi ; b) to f (xi · b), i.e., the function of two vectors is replaced
with an activation function applied to the dot product of the two vectors, yields,
m−1
∂L(b) X ∂f (xi · b)
= − [i ]
∂bj i=0
∂bj
Now, the partial derivative of f (xi · b) with respect to bj can be determined via the chain rule using the
scalar pre-activation response ui = xi · b.
Since ∂f∂u
(ui )
i
is a regular derivative as it is one dimensional, it can be replaced with df (ui )
dui which may be
0
denoted as f (ui ).
∂f (ui ) df (ui )
= = f 0 (ui ) (10.32)
∂ui dui
Combining Results
∂f (xi · b)
= f 0 (ui ) xij
∂bj
Therefore, the j th partial derivative of the loss function L becomes
m−1
∂L(b) X
= − [i ] xij f 0 (ui ) (10.34)
∂bj i=0
∂L(b)
= − x:j · ( ∗ f 0 (u)) (10.35)
∂bj
where x:j is the j th column of data matrix X, ∗ is the element-wise vector product, u = Xb, and f 0 (u) is
the vector extension of the derivative of the activation function dfdu
(ui )
i
= f 0 (ui ) over all m instances.
See the derivation below that expresses the loss function as the dot product of the error vectors. It
provides another way to obtain the same results.
372
Alternative Derivation at the Vector Level
The same basic derivation can be carried out at the vector level as well. Taking the partial derivative with
respect to the j th parameter/weight, bj , is a bit complicated since we need to use the chain rule and the
product rule. First, letting u = Xb (the pre-activation response) allows the loss function to be simplified to
1
L(b) = [y − f (u)] · [y − f (u)] (10.36)
2
The chain rule from vector calculus to be applied is
∂L(b) ∂L(b) ∂u
= · (10.37)
∂bj ∂u ∂bj
The first partial derivative is
∂L(b)
= − [y − f (u)] ∗ f 0 (u) (10.38)
∂u
where the first part of the r.h.s. is f 0 (u) which is the derivative of f with respect to vector u and the
second part is the difference between the actual and predicted output/response vectors. The two vectors are
multiplied together, element-wise.
The second partial derivative is
∂u
= x:j (10.39)
∂bj
where x:j is the j th column of matrix X (see Exercises 4, 5 and 6 for details).
The dot product of the two partial derivatives gives
∂L(b)
= − x:j · [y − f (u)] f 0 (u)
∂bj
Since the error vector = y − f (u), we may simplify the expression.
∂L(b)
= − x:j · ( ∗ f 0 (u)) (10.40)
∂bj
The j th partial derivative (or j th element of the gradient) indicates the relative amount to move (change bj )
in the j th dimension to reduce L. Notice that equations 10.33 and 10.38 are the same.
The δ Vector
The goal of training is to minimize the errors. The errors are calculated using forward propagation through
the network. The parameters (weights and bias) are updated by back-propagation of the errors, or more
precisely, the slope-adjusted errors as illustrated in Figure 10.5.
373
x0
b0
→
x1 y
b1
←δ
b2
x2
Therefore, it is helpful especially for multi-layer neural networks to define the delta vector δ as follows:
∂L(b)
δ = = − ∗ f 0 (u) (10.41)
∂u
It multiplies the derivative of the vectorized activation function f by the negative error vector, element-wise.
If the error is small or the derivative is small, the adjustment to the parameter should be small. The partial
derivative of L with respect to bj now simplifies to
∂L(b)
= x:j · δ (10.42)
∂bj
compute the m-dimensional vectors, and δ, for Exercise 7. With these parameters, the predicted out-
put/response vector ŷ may be computed in two steps: The first step computes the response, pre-activation
u = Xb. The second step takes this vector and applies the activation function to each of its elements.
This requires looking ahead to the subsection on activation functions. The sigmoid function (abbreviated
sigmoid) is defined as follows:
Figure 10.6 show the response surface for the example problem where the response y is color-coded.
Pre-activation Vector u
From Figure 10.6 create the input/data matrix X ∈ R9×3 and multiply it by the current parameter vector
b ∈ R3 . The pre-activation vector is the aggregated signal before activation.
374
1
0.8
0.6
x2
0.4
0.2
Figure 10.6: Example Perceptron Problem (Exercise 7): response values, black (.2), blue (.3), green (.5),
purple/crimson (.8), red (1), form a diagonal terrace pattern
The predicted response vector is determined by applying the activation function to each element of the
pre-activation vector.
ŷ = sigmoid(u) = sigmoid([.1, .15, .2, .2, .25, .3, .3, .35, .4])
= [.5249, .5374, .5498, .5498, .5621, .5744, .5744, .5866, .5986]
Error Vector
The error vector is simply the difference between the actual and predicted output/response vectors.
= y − ŷ
[.5000, .3000, .2000, .8000, .5000, .3000, 1.0000, .8000, .5000] −
[.5249, .5374, .5498, .5498, .5621, .5744, .5744, .5866, .5986] =
[−.0249, −.2374, −.3498, .2501, −.0621, −.2744, .4255, .2133, −.0986]
Delta Vector δ
To compute the delta vector δ, we must look ahead to get the derivative of the activation function (see
exercise 8).
375
Therefore, since sigmoid(u) = ŷ
δ = − ∗ [ŷ ∗ (1 − ŷ)]
[.0062, .0590, .0865, −.0619, .0153, .0670, −.1040, −.0517, .0237]
∂L(b) | |
∇L(b) = = − X [ ∗ f 0 (Xb)] = X δ (10.44)
∂b
Parameter Update
Since many optimizers such as gradient-descent, move in the direction opposite to the gradient by a distance
governed by the learning rate η (alternatively step size), the following term should be subtracted from the
weight/parameter vector b.
|
∇L(b) η = X δ η (10.45)
The right hand side is an n-by-m matrix, m vector product yielding an n vector result. Since gradient-
based optimizers move in the negative gradient direction by an amount determined by the magnitude of the
gradient times a learning rate η, the parameter/weight vector b is updated as follows:
|
b = b − X δη (10.46)
5 def weightVec ( rows : Int , stream : Int = 0 , limit : Double = -1.0) : VectorD =
6 val lim = if limit <= 0.0 then limitF ( rows ) else limit
7 val rvg = new RandomVecD ( rows , lim , 0.0 , stream = stream )
8 rvg . gen
9 end weightVec
10
11 private inline def limitF ( rows : Int ) : Double = 1.0 / sqrt ( rows )
For testing or learning purposes, the weights may also be set manually.
1 @ param w0 the initial weights for parameter b
2
376
10.4.7 Activation Functions
An activation function fa (or simple f ) takes an aggregated signal and transforms it. These activation
functions typically introduce smooth non-linearities either between lowers and upper bound (e.g., [0, 1] or
[−1, 1]) or follow a rectified linear unit (zero for negative signals and linear for positive signals).
The simplest activation function is the id or identity function where the aggregated signal is passed
through unmodified. In this case, Perceptron is in alignment with Regression (see Exercise 9). This
activation function is usually not intended for neural nets with more layers, since theoretically they can be
reduced to a two-layer network (although it may be applied in the last layer).
More generally useful activation functions include reLU, lreLU, eLU, seLU, geLU, sigmoid, tanh and
gaussian. Several activation functions are compared in [118, 137]. For these activation functions the
outputs in the y vector need to be transformed into the range specified for the activation function, see Table
10.1. It may be also useful to transform/standardize the inputs to hit the sweet spot for the particular
activation function being used.
The curves of three of the activation functions are shown in Figure 10.7
Table 10.1: Activation Functions: Identity id, Rectified Linear Unit reLU, Leaky Rectified Linear Unit
lreLU, Exponential Linear Unit eLU, Scaled Exponential Unit seLU, Gaussian Error Linear Unit geLU,
Sigmoid sigmoid, Hyperbolic Tangent tanh, and Gaussian gaussian.
The Gaussian Error Linear Unit (geLU) activation function is uΦ(u) where Φ(u) is the CDF for the
standard Normal (or Gaussian) distribution (see the Probability Chapter). As this function is computed
numerically, the following approximation may be used instead.
h hp ii
uΦ(u) ≈ f (u) = .5u 1 + tanh 2/π(u + .044715u3 ) (10.47)
The derivative of the geLU activation function f 0 (u) may be computed using product and chain rules [133].
377
3
f (u)
1
−1
−3 −2 −1 0 1 2 3
u
Figure 10.7: Activation Functions: sigmoid (blue), tanh (black), reLU (green)
The sigmoid function has an ‘S’ shape, which facilitates its use as a smooth and differentiable version
of a step function, with larger negative values tending to zero and larger positive values tending to one. In
the case of using sigmoid for the activation function, f 0 (u) = f (u)[1 − f (u)], so
Gradient-descent algorithms iteratively move in the negative gradient direction by an amount determined
by the magnitude of the gradient times a learning rate η, so the parameter/weight vector b is adjusted as
follows:
|
b = b − X δη
Assuming the learning rate η = 1 and taking the δ vector from the example, the update to parameter/weight
vector b is
|
X δ η = [.0402, −.1218, .1886]
Check to see if the new values for b have improved/decreased the loss function L.
378
careful adjustments to obtain nearly (locally) optimal values for L. Gradient-descent works by iteratively
moving in the opposite direction as the gradient until a stopping rule evaluates to true (e.g., stop after q th
increase in L and return best solution so far). The rate of convergence can be adjusted using the learning
rate η which multiplies the gradient. Setting it too low slows convergence, while setting it too high can cause
oscillation or divergence. In ScalaTion, the learning rate η (eta in the code) is a hyper-parameter that
defaults to 0.1, but is easily adjusted, e.g.,
1 Optimizer . hp ( " eta " ) = 0.05
train Method
The train method contains the main training loop that is shown below. Inside the loop, new values yp are
predicted, from which an error vector e is determined. This is used to calculate the delta vector d, which
along x.T and eta are used to update the parameter/weight vector b. Note, x , y which default to x, y
constitute the training set.
1 @ param x_ the training / full data / input matrix
2 @ param y_ the training / full response / output vector
3
The vector function f.f is the vectorization of the activation function f.f, and is created in ScalaTion
using the vectorize high-order function defined in the mathstat package, e.g., given a scalar function f , it
can produce the corresponding vector function f .
1 def vectorize ( f : FunctionS2S ) : FunctionV2V = ( x : VectorD ) = > x . map ( f ( _ ) )
2 val f_ = vectorize ( f )
The function f.d is the derivative of the vector activation function. The collectLoss method is defined in
the MonitorLoss trait.
The core of the algorithm is the first four lines in the loop. Table 10.2 show the correspondence between
these lines of code and the main/boxed equations derived in this section. Note, all the equations in the table
are vector assignments.
The third line of code appears to be different from the mathematical equation, in terms of passing the pre-
activation versus the post-activation response. It turns out that all derivatives for the activation functions
379
Table 10.2: Correspondence between Code and Boxed Equations
(except Gaussian) are either formulas involving constants or simple functions of the activation function itself,
so for efficiency, the yp vector is passed in.
Warning: This implementation is minimal to illustrate the basic required mechanisms, and as such is
likely to be less accurate. For example, the code does not create mini-batches, has a stopping rule that is too
simple, uses a fixed learning rate, does not utilize a modern optimization algorithm. For a useful perceptron,
NeuarlNet 2L in the neuralnet package may be used with one node in the output layer.
A perceptron can be considered to be a special type of nonlinear or transformed regression, see Exercise
10.
The Perceptron class defaults to the f sigmoid Activation Function Family (AFF), which is defined with
the ActivationFun object.
1 @ param name the name of the activation function
2 @ param f the activation function itself ( scalar version )
3 @ param f_ the vector version of the activation function
4 @ param d the vector version of the activation function derivative
5 @ param bounds the ( lower , upper ) bounds on the range of the activation function
6 e . g . , (0 , 1) for sigmoid , defaults to null = > no limit
7 @ param arange the ( lower , upper ) bounds on the input ( active ) range of the function
8 e . g . , ( -2 , 2) for sigmoid , defaults to null = > no limit
9
16 end AFF
380
Rescaling
Depending on the activation function, rescaling of outputs and/or inputs may be necessary:
1. Output: If the actual response vector y is outside the bounds/range of the activation function, it will
be impossible for the predicted response vector ŷ to approximate it, so rescaling will be necessary. The
bounds in AFF is used for rescaling vector y.
2. Input: If the linear combination of inputs takes the pre-activation value beyond the effective domain
(active range) of the activation function, training will become slow (or even ineffective), e.g., when u
is large, f (u) will be nearly 1 for sigmoid, even when there are substantial changes to u induced by
updates to the parameters. The arange in AFF is used for rescaling the columns of matrix X.
As a convenience, all the modeling techniques in ScalaTion have a factory method for rescaling the inputs
and outputs.
1 @ param x the data / input matrix
2 @ param y the response / output vector
3 @ param fname the feature / variable names ( defaults to null )
4 @ param hparam the hyper - parameters ( defaults to hp )
5 @ param f the activation function family for layers 1 - >2 ( input to output )
6
Other activation functions should be experimented with, as one may produce better results. All the
activation functions shown in Table 10.1 are available in the ActivationFun object.
Essentially, parameter optimization in perceptrons involves using/calculating several vectors as summa-
rized in Table 10.3 where n is the number of parameters and m is the number of instances used at a particular
point in the iterative optimization algorithm, for example, corresponding to the total number of instances
in a training set for Gradient Descent (GD).
Stochastic Gradient Descent (SGD) utilizes a fraction of the m instances to make updates to the parameters,
corresponding to just 1 for pure SGD or to the size of a mini-batch (e.g., 30) for the common form of SGD.
Instead of single loop of the Perceptron, there is a nested loop, the outer loop is over training epochs, while
the inner loop is over mini-batches.
1 cfor ( go && epoch <= maxEpochs , epoch + = 1) { // iterate over each epoch
2 val batches = permGen . igen . chop ( nB ) // permute indices & chop
3 for ib <- batches do b -= updateWeight ( x ( ib ) , y ( ib ) ) // iterate : update param b
381
This has the obvious benefit of reducing the amount of computation required to make parameter updates.
Somewhat surprisingly, it also makes it easier for the algorithm to escape local minima and to generally be
a more robust algorithm. The mini-batch size is another hyper-parameter that can be tuned.
Class Methods:
1 @ param x the data / input m - by - n matrix ( data consisting of m input vectors )
2 @ param y the response / output m - vector ( data consisting of m output values )
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param hparam the hyper - parameters for the model / network ( defaults to Perceptron . hp )
5 @ param f the activation function family for layers 1 - >2 ( input to output )
6 @ param itran the inverse transfo rmation function returns responses to original scale
7
The train method uses Gradient Descent with a simple stopping rule and a non-adaptive learning rate.
382
A better optimizer is used by the train method in the NeuralNet 2L class that uses Stochastic Gradient
Descent, a better stopping rule (see StoppingRule class), and an adaptive learning rate. The work is
delegated to the Optimizer SGD object and can easily be changes to use Stochastic Gradient Descent with
Momentum using the Optimizer SGDM object. A NeuralNet 2L may be thought of as multiple Perceptrons.
10.4.10 Exercises
1. Plot the rest of activation functions in Table 10.1 in the same plot and compare them.
2. The Texas Temperature regression problem can also be analyzed using a perceptron.
1 val fname = Array ( " one " , " Lat " , " Elev " , " Long " )
2
22 val y = VectorD (56.0 , 48.0 , 60.0 , 46.0 , 38.0 , 46.0 , 53.0 , 46.0 ,
23 44.0 , 41.0 , 47.0 , 36.0 , 52.0 , 60.0 , 56.0 , 62.0)
24
383
3. Analyze the Example Concrete dataset in the neuralnet package, which has three output variables
y0 , y1 and y2 . Create a perceptron for each output variable.
X
u = Xb = bj x:j
j
∂u
= x:j
∂bj
5. Given the formula for the loss function L : Rm → R expressed in terms of the pre-activation vector
u = Xb and the vectorized activation function f : Rm → Rm ,
1
L(u) = (y − f (u)) · (y − f (u))
2
derive the formula for the gradient of L with respect to u.
∂L
∇L = = − f 0 (u)(y − f (u))
∂u
∂L
Hint: Take the gradient, , using the product rule (d1 · f2 + f1 · d2 ).
∂u
∂L ∂f (u)
= − · (y − f (u))
∂u ∂u
∂f (u) ∂f (u)
where f1 = f2 = y − f (u) and d1 = d2 = − . Next, assuming is a diagonal matrix, show
∂u ∂u
that the above equation can be rewritten as
∂L
= − f 0 (u) (y − f (u))
∂u
where f 0 (u) = [f 0 (u0 ), . . . , f 0 (um−1 )] and the two vectors, f 0 (u) and y − f (u), are multiplied, element-
wise.
∂f (u)
6. Show that the m-by-m Jacobian matrix, Jf (u) = , is a diagonal matrix, i.e.,
∂u
∂fi (u)
Jf (u) = = 0 if i 6= j
∂uj
where fi = f the scalar activation function. Each diagonal element is the derivative of the activation
function applied to the ith input, f 0 (ui ). See the section on Vector Calculus in Chapter 2 that discusses
Gradient Vectors, Jacobian Matrices and Hessian Matrices.
7. Show the first 10 iterations that update the parameter/weight matrix b that is initialized to [.1, .2, .1].
Use the following combined input-output matrix. Let the perceptron use the default sigmoid function.
384
1 // 9 data points : one x1 x2 y
2 val xy = MatrixD ((9 , 4) , 1.0 , 0.0 , 0.0 , 0.5 ,
3 1.0 , 0.0 , 0.5 , 0.3 ,
4 1.0 , 0.0 , 1.0 , 0.2 ,
5 1.0 , 0.5 , 0.0 , 0.8 ,
6 1.0 , 0.5 , 0.5 , 0.5 ,
7 1.0 , 0.5 , 1.0 , 0.3 ,
8 1.0 , 1.0 , 0.0 , 1.0 ,
9 1.0 , 1.0 , 0.5 , 0.8 ,
10 1.0 , 1.0 , 1.0 , 0.5)
11
12 Preceptron . hp ( " eta " ) = 1.0 // try several values for eta
13 val nn = new Perceptron (x , y , null , hp ) // create a perceptron , user control
14 // val nn = Perceptron ( xy , null , hp ) // create a perceptron , automatic
scaling
|
For each iteration, do the following: Print the weight/parameter update vector X δη and the new
value for weight/parameter vector b, Make a table with m rows showing values for
1
sigmoid(u) = [1 + e−u ]−1 =
1 + e−u
its derivative is
d sigmoid(u)
sigmoid0 (u) = = sigmoid(u)[1 − sigmoid(u)]
du
9. Show that when the activation function f is the id function, that f 0 (u) is the one vector, 1. Plug this
into the equation for the gradient of the loss function to obtain the following result.
∂L | |
= − X [1 ] = − X (y − Xb)
∂b
| |
Setting the gradient equal to zero, now yields (X X)b = X y, the Normal Equations.
10. Show that a Perceptron with an invertible activation function f is similar to TranRegression with
tranform f −1 . Explain any differences in the parameter/weight vector b and the sum of squared errors
sse. Use the sigmoid activation function and the AutoMPG dataset and make the following two plots
(using PlotM): y, ypr, ypt vs. t and y, ypr, ypp vs. t, where y is the actual response/output, ypr is
the prediction from Regression, ypt is the prediction from TranRegression and ypp is the prediction
from Perceptron.
385
10.5 Multi-Output Prediction
The PredictorMV trait (Predictor Multi-Variate) provides the basic structure and API for a variety of
modeling techniques that produce multiple responses/outputs, e.g., Neural Networks and Multi-Variate
Regression. It serves the same role that Predictor does for the regression modeling techniques that have a
single response/output variable.
|
y = f (B · x) + = f (B x) + (10.50)
where
10.5.2 Training
The training equation takes the model equation and several instances in a dataset to provide estimates for
the values in parameter matrix B. Compared to the single response/output variable case, the main difference
is that the response/output vector, the parameter vector, and the error vector now all become matrices.
Y = f (XB) + E (10.51)
where X is an m-by-n data/input matrix, Y is an m-by-ny response/output matrix, B is an n-by-ny param-
eter matrix, f is a function mapping one m-by-ny matrix to another, and E is an m-by-ny residual/error
matrix. Note, a bold function symbol f is used is used to denote a function mapping either vectors to vectors
(as was the case in the Model Equation subsection) or matrices to matrices (as is the case here).
f : Rm×ny → Rm×ny
If one is interested in referring to the k th component (or column) of the output, the model equation
decomposes into
386
Analogous to other predictive modeling techniques, PredictorMV takes four arguments: the data/input
matrix x, the response/output matrix y, the feature/variable names fname, and the hyper-parameters for
the model/network hparam.
Class Methods:
1 @ param x the input / data m - by - n matrix
2 ( augment with a first column of ones to include intercept in model )
3 @ param y the response / output m - by - ny matrix
4 @ param fname the feature / variable names ( if null , use x_j s )
5 @ param hparam the hyper - parameters for the model / network
6
387
46
The methods provided by PredictorMV are similar to those in Predictor, but extensions to handle response
matrices are added.
The RegressionMV class extends PredictorMV and shares the factorization but individually solves for
each output/response variable and is hence faster than performing Regression individually.
Class Methods:
1 @ param x the data / input m - by - n matrix
2 ( augment with a first column of ones to include intercept in model )
3 @ param y the response / output m - by - ny matrix
4 @ param fname_ the feature / variable names ( defaults to null )
5 @ param hparam the hyper - parameters ( defaults to Regression . hp )
6
388
7 hp += ( " lambda " , 0.01 , 0.01) // reg ularizat ion / shrinkage hyper - parameter
8 hp += ( " upLimit " , 4 , 4) // up - limit hyper - parameter for stopping rule
9 hp += ( " beta " , 0.9 , 0.9) // momentum decay hyper - parameter
10 hp += ( " nu " , 0.9 , 0.9) // importance of momentum in parameter est .
11
17 end Optimizer
The Optimizer trait provides an abstract method called optimize that must be implemented in extending
optimization classes such Optimizer SGD. It also provides a method for automatic optimization that includes
built-in grid search. In addition, it provides a helper method (permGenerator) that facilitates the creation
of random mini-batches.
1 trait Optimizer extends MonitorLoss with StoppingRule :
2
Inside, the x is multiplied by the weight matrix w and the bias vector b is added. Note, the *: is right
associative since the NetParam object is on right (see NeuralNet 2L for an example of its usage).
Class Methods:
1 @ param w the weight matrix
2 @ param b the optional bias / intercept vector ( null = > not used )
3
6 def copy : NetParam = NetParam ( w . copy , if b ! = null then b . copy else null )
7 def trim ( dim : Int , dim2 : Int ) : NetParam =
8 def update ( c : NetParam ) : Unit = { w = c . w ; b = c . b }
9 def set ( c : NetParam ) : Unit = { w = c . w ; b = c . b }
10 def update ( cw : MatrixD , cb : VectorD = null ) : Unit = { w = cw ; b = cb }
11 def set ( cw : MatrixD , cb : VectorD = null ) : Unit = { w = cw ; b = cb }
389
12 def + = ( c : NetParam ) : Unit =
13 def + = ( cw : MatrixD , cb : VectorD ) : Unit =
14 def -= ( c : NetParam ) : Unit =
15 def -= ( cw : MatrixD , cb : VectorD = null ) : Unit =
16 def * ( x : MatrixD ) : MatrixD =
17 def *: ( x : MatrixD ) : MatrixD = x * w + b
18 def dot ( x : VectorD ) : VectorD = ( w dot x ) + b
19 def toMatrixD : MatrixD = if b = = null then w else b +: w
20 override def toString : String = s " b . w = $w \\ n b . b = $b "
390
10.6 Two-Layer Neural Networks
The NeuralNet 2L class supports multi-valued 2-layer (input and output) Neural Networks. The inputs into
a Neural Network are given by the input vector x, while the outputs are given by the output vector y. Each
input xj is associated with an input node in the network, while each output yk is associated with an output
node in the network, as shown in Figure 10.8. The input layer consists of n input nodes, while the output
layer consists of ny output nodes.
x0
b00 f
y0
b01
b10 β0
x1
b11
b20 y1
b21 β1
x2
An edge connects each input node with each output node, i.e., there are nny edges in the network. The
edge connecting input node j to output node k has weight bjk . In addition, below each output node is a
bias βk . An alternative to having explicit bias offsets, is to include an intercept in the model (as an implicit
bias) by adding a special input node (say x0 ) having its input always set to 1.
n−1
X
yk = f (b:k · x) + k = f bjk xj + k
j=0
The model equation for NeuralNet 2L can written in vector form as follows:
|
y = f (B · x) + = f (B x) + (10.53)
391
10.6.2 Training
Given several input vectors and output vectors in a training dataset (i = 0, . . . , m − 1), the goal is to
optimize/fit the parameters/weights B. The training dataset consisting of m input-output pairs is used to
minimize the error in the prediction by adjusting the parameter/weight matrix B. Given an input matrix
X ∈ Rm×n consisting of m input vectors and an output matrix Y ∈ Rm×ny consisting of m output vectors,
minimize the distance between the actual/target output matrix Y and the predicted output matrix Ŷ ,
Ŷ = f (XB) (10.54)
where k · kF is the Frobenius norm, X is a m-by-n matrix, Y is a m-by-ny matrix, and B is a n-by-ny matrix.
Other norms may be used as well, but the square of the Frobenius norm will give the overall sum of squared
errors sse.
y −1
X nX
m−1
kEk2F = 2ij (10.56)
i=0 j=0
10.6.3 Optimization
As was the case with regression, it is convenient to minimize the dot product of the error with itself. We do
this for each of the columns of the Y matrix to get the sse for each yk and sum them up. The goal then is
to simply minimize the loss function L(B) = 21 sse(B). As in the Perceptron section, we work with half of
the sum of squared errors sse. Summing the error over each column vector y:k in matrix Y gives
ny −1
1 X
L(B) = (y:k − f (Xb:k )) · (y:k − f (Xb:k )) (10.57)
2
k=0
This nonlinear optimization problem may be solved by a variety of optimization techniques, including
Gradient-Descent, Stochastic Gradient Descent or Stochastic Gradient Descent with Momentum.
Most optimizers require a derivative and ideally these should be provided in functional form (otherwise
the optimizer will need to numerically approximate them). Again, for the sigmoid activation function,
1
sigmoid(u) =
1 + e−u
the derivative is
sigmoid(u)[1 − sigmoid(u)]
1
L(b:k ) = 2 (y:k − f (Xb:k )) · (y:k − f (Xb:k ))
392
Notice that this is the same as the Perceptron loss function, just with subscripts on y and b.
In Regression, we took the gradient and set it equal to zero. Here, gradients will need to be computed by
the optimizer. The equations will be the same as given in the Perceptron section, again just with subscripts
added. The boxed equations from the Perceptron section become the following: The prediction vector for
the k th response/output is
∂L
δ :k = = − :k ∗ f 0 (Xb:k ) (10.60)
∂uk
∂L | |
= − X [:k ∗ f 0 (Xb:k )] = X δ :k
∂b:k
Finally, the update for the k th parameter vector is
|
b:k = b:k − X δ :k η (10.61)
Sigmoid Case
Ŷ = f (XB) (10.62)
The m-by-ny negative of the error matrix E is the difference between the predicted and actual/target
output/response.
E = Ŷ − Y (10.63)
The m-by-ny delta matrix ∆ adjusts the error according to the slopes within f 0 (XB) and is the element-wise
matrix (Hadamard) product of f 0 (XB) and E.
∆ = f 0 (XB) E (10.64)
393
In math, the Hadamard product may be denoted by the symbol, while in ScalaTion it is denoted by
|
or ∗ ∼. Finally, the n-by-ny parameter matrix B is updated by −X ∆η.
|
B = B − X ∆η (10.65)
The train method calls the appropriate optimizer and records statistics about the number of epochs. The
corresponding code for the train method is shown below:
1 @ param x_ the training / full data / input matrix
2 @ param y_ the training / full response / output matrix
3
For NeuralNet 2L, the bulk of the work is done by the optimize2 method, in for example, the Optimizer SGD
class. The initializattion part of this method is defined as follows:
1 @ param x the m - by - nx input matrix ( training data consisting of m input vectors )
2 @ param y the m - by - ny output matrix ( training data consisting of m output vectors )
3 @ param bb the array of parameters ( weights & biases ) between every two adjacent layers
4 @ param eta the initial learning / convergence rate
5 @ param ff the array of activation function family for every two adjacent layers
6
The main loop iterates up to maxEpochs, but may exist early depending on what stopWhen returns. The
learning rate η is periodically adjusted.
1 var sse_best_ = -0.0
2 var ( go , epoch ) = ( true , 1)
3 cfor ( go && epoch <= maxEpochs , epoch + = 1) { // iterate over each epoch
4 val batches = permGen . igen . chop ( nB ) // permute indices & chop
5
394
13 sse_best_ = sse_best // save best in sse_best_
14 go = false
15 else
16 if epoch % ADJUST_PERIOD = = 0 then η * = ADJUST_FACTOR
17 end if
18 } // cfor
The end of the optimize2 method returns the best value for the loss function and the number of epochs.
1 if go then (( y - f . fM ( b * x ) ) . normFSq , maxEpochs ) // return sse and # epochs
2 else ( sse_best_ , epoch - upLimit )
3 end optimize2
Note: f.fM is the matrix version of the activation function and it is created using the matrixize high-order
function that takes a vector function as input.
1 def matrixize ( f : FunctionV2V ) : FunctionM2M = ( x : MatrixD ) = > x . map ( f ( _ ) )
2 val fM = matrixize ( f_ )
Similary, f.dM is the matrix version of the derivative of the activation function. The NeuralNet 2L class
also provides train2 (with built in η search) methods.
Also note: ScalaTion provides the following alternatives for (a) Hadamard product: or *∼, and (b)
Transpose: T-like Unicode symbol or transpose.
Class Methods:
1 @ param x the m - by - n input / data matrix ( full / training data having m input vectors )
2 @ param y the m - by - ny output / response matrix ( full / training data having m vectors )
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param hparam the hyper - parameters for the model / network ( defaults to Optimizer . hp )
5 @ param f the activation function family for layers 1 - >2 ( input to output )
6 @ param itran the inverse transfo rmation function returns response matrix to original
7 scale
8
395
17 def test ( x_ : MatrixD = x , y_ : MatrixD = y ) : ( MatrixD , MatrixD ) =
18 override def makePlots ( yy_ : MatrixD , yp : MatrixD ) : Unit =
19 def predict ( v : VectorD ) : VectorD = f . f_ ( bb (0) dot v )
20 override def predict ( v : MatrixD = x ) : MatrixD = f . fM ( bb (0) * v )
21 def buildModel ( x_cols : MatrixD ) : NeuralNet_2L =
22 def summary2 ( x_ : MatrixD = getX , fname_ : Array [ String ] = fname ,
23 b_ : MatrixD = parameter ) : String =
Object Methods:
1 object NeuralNet_2L extends Scaling :
2
396
29 hparam : Hype rParame ter = Optimizer . hp , f : AFF = f_sigmoid ) :
30 NeuralNet_2L =
10.6.7 Exercises
1. The dataset in Example Concrete consists of 7 input variables and 3 output variables.
1 // Input Variables (7) ( component kg in one M ˆ 3 concrete ) :
2 // 1. Cement
3 // 2. Blast Furnace Slag
4 // 3. Fly Ash
5 // 4. Water
6 // 5. Super Plasticizer ( SP )
7 // 6. Coarse Aggregate
8 // 7. Fine Aggregate
9 // Output Variables (3) :
10 // 1. SLUMP ( cm )
11 // 2. FLOW ( cm )
12 // 3. 28 - day Compressive STRENGTH ( Mpa )
Create a NeuralNet 2L model to predict values for the three outputs y0 , y1 and y2 . Compare with the
results of using three Perceptrons.
2. Create a NeuralNet 2L model to predict values for the one output for the AutoMPG dataset. Compare
with the results of using the following models: (a) Regression, (b) Perceptron.
3. Were the results in for the AutoMPG dataset the same for Perceptron and NeuralNet 2L? Please
explain. In general, is a NeuralNet 2L equivalent to ny Perceptrons?
4. Compare the convergence for the AutoMPG dataset of the following three optimization algorithm by
plotting the drop in the loss function versus the number of epochs.
(c) Stochastic Gradient Descent with Momentum (SGDM) from the train method in NeuralNet 2L.
using Optimizer SGDM.
5. Explain how the ADAM optimizer works and redo the above exercise using Keras comparing GD, SGD,
SGDM and Adam.
6. Draw a NeuralNet 2L with n = 4 input nodes and ny = 2 output nodes. Label the eight edges with
weights from the 4-by-2 weight matrix B = [bjk ]. Write the two model equations, one for y0 and one
for y1 . Combine these two equations into one vector equation for y = [y0 , y1 ]. Given column vector
|
x = [1, x1 , x2 , x3 ], express ŷ = f (B x) at the scalar level.
397
10.7 Three-Layer Neural Networks
The NeuralNet 3L class supports 3-layer (input, hidden and output) Neural Networks. The inputs into a
Neural Net are given by the input vector x, while the outputs are given by the output vector y. Between
these two layers is a single hidden layer, whose intermediate values will be denoted by the vector z. Each
input xj is associated with an input node in the network, while each output yk is associated with an output
node in the network, as shown in Figure 10.9. The input layer consists of n input nodes, the hidden layer
consists of nz hidden nodes, and the output layer consists of ny output nodes.
f0
z0
a00 b00 f1
α0
x0 y0
a01 b01
a10 b10 β0
z1
a02 b11
a11 α1 b20
x1 y1
a12 b21 β1
z2
α2
There are two sets of edges. Edges in the first set connect each input node with each hidden node, i.e.,
there are nnz such edges in the network. The parameters (or edge weights) for the first set of edges are
maintained in matrix A = [ajh ]n×nz . Edges in the second set connect each hidden node with each output
node, i.e., there are nz ny such edges in the network. The parameters (or edge weights) for the second set of
edges are maintained in matrix B = [bhk ]nz ×ny .
There are now two activation functions, f0 and f1 . f0 is applied at each node in the hidden layer, while f1
plays this role for the output layer. Having two activation functions allows greater capability to approximate
a variety of functional forms (for more information see [36, 81] on Universal Approximation Theorems).
| |
y = f1 (B · f0 (A · x)) + = f1 (B f0 (A x)) + (10.66)
The innermost matrix-vector product multiplies the transpose of the n-by-nz matrix A by the n-by-1 vector
x, producing an nz -by-1 vector, which is passed into the f0 vectorized activation function. The outermost
matrix-vector product multiplies the transpose of the nz -by-ny matrix B by the nz -by-1 vector results,
producing an ny -by-1 vector, which is passed into the f1 vectorized activation function.
398
Intercept/Bias
As before, one may include an intercept in the model (also referred to as bias) by having a special input
node (say x0 ) that always provides the value 1. A column of all one in an input matrix (see below) can
achieve this. This approach could be carried forward to the hidden layer by including a special node (say z0 )
that always produces the value 1 (referred to as the bias trick). In such case, the computation performed at
node z0 would be thrown away and replaced with 1, although a clever implementation could avoid the extra
calculation. The alternative is to replace the uniform notion of parameters with two types of parameters,
weights and biases. ScalaTion supports this with the NetParam case class.
1 @ param w the weight matrix
2 @ param b the bias / intercept vector
3
6 ...
7 def dot ( x : VectorD ) : VectorD = ( w dot x ) + b
Following this approach, there is no need for the special nodes and the dot product is re-defined to add the
bias b to the regular matrix-vector dot product (w dot x). Note, the NetParam class defines several other
methods as well. The vector version of the predict method in NeuralNet 3L uses this dot product to make
predictions.
1 @ param v the new input vector
2
Note: f0 corresponds to f.f in the code, while f1 corresponds to f1.f in the code.
Including the bias vectors α and β and splitting the computation of the predicted value ŷ into two steps
yields,
|
z = f0 (A x + α) (10.67)
|
ŷ = f1 (B z + β) (10.68)
399
ŷ = B · f0 (A · x) (10.69)
where matrix A ∈ R2×2 and matrix B ∈ R2×1 . Expanding the outer dot product results in the following:
f0
a00
x0 z0
b00
a01 α0
y0
a10
b10 β0
x1 z1
a11
α1
Each hidden node brings in a ridge function and since the second activation is the identity function, the
output ŷ is a linear combination of the two (e.g., sigmoid) ridge functions.
1
ŷ
5
0
−4 0
−2 0 2 4 x1
x0 −5
400
The equation (not optimized) used in Figure 10.11 is
1 1
ŷ = +
1 + e−(2x0 +x1 +.5) 1 + e−(x0 −x1 +.5)
See the subsection on Response Surface and the exercises for an optimized equation.
Note, the use of identity id activation function for the last layer means that the final layer preforms a
linear transformation, so that rescaling of the outputs y may be avoided.
10.7.3 Training
Given a training dataset made up of an m-by-n input matrix X and an m-by-ny output matrix Y , training
consists of making a prediction Ŷ ,
and determining the error in prediction E = Y − Ŷ with the goal of minimizing the error.
Training involves an iterative procedure (e.g., stochastic gradient descent) that adjusts parameter values
(for weights and biases) to minimize a loss function such as sse or rather half sse (or L). Before the main
loop, random parameter values (for weights and biases) need to be assigned to NetParam A and NetParam
B. Roughly as outlined in section 3 of [157], the training can be broken into four steps:
1. Compute predicted values for output ŷ and compare with actual values y to determine the error y − ŷ.
2. Back propagate the adjusted error to determine the amount of correction needed at the output layer.
Record this as vector δ 1 .
3. Back propagate the correction to the hidden layer and determine the amount of correction needed at
the hidden layer. Record this as vector δ 0 .
4. Use the delta vectors, δ 1 and δ 0 , to makes updates to NetParam A and NetParam B, i.e., the weights
and biases.
10.7.4 Optimization
In this subsection, the basic elements of the back-propagation algorithm are presented. In particular, we
now go over the four steps outlined above in more detail. Biases are ignored for simplicity, so the A and B
NetParams are treated as weight matrices. In the code, the same logic includes the biases (so nothing is lost,
see exercises). Note that L denotes a loss function, while h is an index into the hidden layer.
1. Compute predicted values: Based on the randomly assigned weights to the A and B matrices, predicted
outputs ŷ are calculated. First values for the hidden layer z are calculated, where the value for hidden
node h, zh , is given by
zh = f0 (a:h · x) for h = 0, . . . , nz − 1
401
where f0 is the first activation function (e.g., sigmoid), a:h is column-h of the A weight matrix, and x
is an input vector for a training sample/instance (row in the data matrix). Typically, several samples
(referred to as a mini-batch) are used in each step. Next, the values computed at the hidden layer are
used to produce predicted outputs ŷ, where the value for output node k, ŷk , is given by
where the second activation function f1 may be the same as (or different from) the one used in the
hidden layer and b:k is column-k of the B weight matrix. Now the difference between the actual and
predicted output can be calculated by simply subtracting the two vectors, or element-wise, the error
for the k th output, k , is given by
k = yk − ŷk for k = 0, . . . , ny − 1
Obviously, for subsequent iterations, the updated/corrected weights rather than the initial random
weights are used.
2. Back propagate from output layer: Given the computed error vector , the delta/correction vector δ 1
for the output layer may be calculated, where for output node k, δk1 is given by
where f10 is the derivative of the activation function (e.g., for sigmoid, f 0 (u) = f (u)[1 − f (u)]). The
partial derivative of the loss function L with respect to the weight connecting hidden node h with
output node k, bhk , is given by
∂L
= zh δk1 (10.75)
∂bhk
3. Back propagate from hidden layer: Given the delta/correction vector δ 1 from the output layer, the
delta vector for the hidden layer δ 0 may be calculated, where for hidden node h, δh0 is given by
This equation is parallel to the one given for δk1 in that an error-like factor multiplies the derivative
of the activation function. In this case, the error-like factor is the weighted combination of the δk1 for
output nodes connected to hidden node h times row-h of weight matrix B. The weighted combination
is computed using the dot product.
ny −1
X
bh · δ 1 = bhk δk1
k=0
The partial derivative of L with respect to the weight connecting input node j with hidden node h,
ajh , is given by
∂L
= xj δh0 (10.77)
∂ajh
402
4. Update weights: The weight matrices A and B, connecting input to hidden and hidden to output layers,
respectively, may now be updated based on the partial derivatives. For gradient descent, movement
is in the opposite direction, so the sign flips from positive to negative. These partial derivatives are
multiplied by the learning rate η which moderates the adjustments to the weights.
Figure 10.12 shows the forward (→) and backward (←) propagation through the Neural Network and depicts
the role of δ in updating the parameters from the perspective of hidden node zh .
x→ f1 (b:0 · z) →
x0 y0 0 →
a0h bh0
f0 (a:h · x) → ← δ01
zh
← δh0 f1 (b:1 · z) →
a1h bh1
x1 y1 1 →
← δ11
To improve the stability of the algorithm, weights are adjusted based on accumulated corrections over a
mini-batch of instances, where a mini-batch is a sub-sample of the training dataset and may be up to the
size the of the entire training dataset (for i = 0, . . . , m − 1). Once training has occurred over the current
mini-batch including, at the end, updates to the A and B estimates, the current epoch is said to be complete.
Correspondingly, the above equations may be vectorized/matrixized so that calculations are performed over
many instances in a mini-batch using matrix operations. Each outer iteration (epoch) typically should
improve the A and B estimates. Simple stopping rules include specifying a fixed number of iterations or
breaking out of the outer loop when the loss function L fails to decrease for q iterations.
403
The four boxed equations from the previous section become seven due to the extra layer. The optimizers
compute predicted outputs and take differences between the actual/target values and these predicted values
to compute an error matrix. These results are then used to compute delta matrices that form the basis for
updating the parameter/weight matrices A ∈ Rn×nz and B ∈ Rnz ×ny .
1. The hidden (latent) values for all m instances and all nz hidden nodes are computed by applying
the first matrixized activation function f0 to the matrix product XA to produce the latent feature
matrix Z ∈ Rm×nz .
Z = f0 (XA) (10.80)
The predicted output matrix Ŷ ∈ Rm×ny is similarly computed by applying the second matrixized
activation function f1 to the matrix product ZB.
Ŷ = f1 (ZB) (10.81)
2. The negative of the error matrix E ∈ Rm×ny is just the difference between the predicted and
actual/target values.
E = Ŷ − Y (10.82)
3. This information is sufficient to calculate delta matrices: ∆1 for adjusting B and ∆0 for adjusting A.
The output-layer delta matrix ∆1 ∈ Rm×ny is the element-wise matrix (Hadamard) product of E
and f10 (ZB).
The hidden-layer delta matrix ∆0 ∈ Rm×nz is the element-wise matrix (Hadamard) product of
|
∆1 B and f00 (XA).
|
∆0 = [∆1 B ] f00 (XA) (10.84)
4. As mentioned, the delta matrices form the basis (a matrix transpose × delta × the learning rate η)
for updating the parameter/weight matrices, A and B.
|
B −= Z ∆1 η (10.85)
|
A −= X ∆0 η (10.86)
404
10.7.6 train Method
The corresponding ScalaTion code for the train method is show below.
1 @ param x_ the training / full data / input matrix
2 @ param y_ the training / full response / output matrix
3
The main part of the optimize3 method is a nested loop: The outer loop iterates up to maxEpochs, but
may exist early depending on what stopWhen returns. The inner loop iterates through the mini-batches (ib
is the ith mini-batch) The learning rate η is periodically adjusted.
1 var sse_best_ = -0.0
2 var ( go , epoch ) = ( true , 1)
3 cfor ( go && epoch <= maxEpochs , epoch + = 1) { // iterate over each epoch
4 val batches = permGen . igen . chop ( nB ) // permute indices & chop
5
405
10
The updateWeight method is used to update the parameters (weights and biases). It performs both
forward and back propagation on the mini-batch passed in.
1 inline def updateWeight ( x : MatrixD , y : MatrixD ) : ( NetParam , NetParam ) =
2 val α = η / x . dim // eta over batch size
3 var z = f . fM ( a * x ) // Z = f ( XA )
4 var yp = f1 . fM ( b * z ) // Yp = f ( ZB )
5 var = yp - y // negative of error matrix
6 val δ 1 = f1 . dM ( yp ) // delta matrix for y
7 val δ 0 = f . dM ( z ) (δ1 * b . w .T ) // delta matrix for z
8
The end of the optimize3 method returns the best value for the loss function and the number of epochs.
1 if go then (( y - b * f1 . fM ( f . fM ( a * x ) ) ) . normFSq , maxEpochs ) // ret . sse , epochs
2 else ( sse_best_ , epoch - upLimit )
3 end optimize3
For stochastic gradient descent in the Optimizer SGD class, the inner loop divides the training dataset
into nB mini-batches. A batch is a randomly selected group/batch of rows. Each batch (ib) is passed to the
updateWeight (x(ib), y(ib)) method that updates the A and B parameter/weight matrices.
Neural networks may be used for prediction/regression as well as classification problems. For predic-
tion/regression, the number of output nodes would corresponding to the number of responses. For example,
in the ExampleConrete example there are three response columns, requiring three instances of Regression
or one instance of NeuralNet 3L. Three separate NeuralNet 3L instances each with one output node could
be used as well. Since some activation functions have limited ranges, it is common practice for these types of
problems to let the activation function in the last layer be identity id. If this is not done, response columns
need to be re-scaled based on the training dataset. Since the testing dataset may have values outside this
range, this approach may not be ideal.
For classification problems, it is common to have an output node for each response value for the categorical
variable, e.g., “no”, “yes” would have y0 and y1 , while “red”, “green”, “blue” would have y0 , y1 and y2 . The
softmax activation function is a common choice for the last layer for classification problems.
406
eui
fi (u) = for i = 0, . . . , n − 1
1 · eu
The values produced by the softmax activation function can be thought of as giving a probability score to
the particular class label, e.g., the image shows a “cat” vs. “dog”.
An alternative for binary classification (k = 2) is to have one output node and use sigmoid activation for
the last layer.
Let x = [x0 , x1 ] = [2, 1] and y0 = .8, and then compute the error 0 = y0 − yˆ0 , by feeding the values from
vector x forward. First compute values at the hidden layer for z.
zh = f0 (a:h · x + αh )
z0 = f0 (a:0 · x + α0 )
z0 = f0 ([0.1, 0.3] · [2.0, 1.0] + 0.1)
z0 = f0 (0.6) = 0.645656
z1 = f0 (a:1 · x + α1 )
z1 = f0 ([0.2, 0.4] · [2.0, 1.0] + 0.1)
z1 = f0 (0.9) = 0.710950
One may compute the values for sigmoid activation function as follows:
1 println ( ActivationFun . sigmoid_ ( VectorD (0.6 , 0.9) ) )
407
yˆk = f1 (b:k · z + βk )
yˆ0 = f1 (b:0 · z + β0 )
yˆ0 = f1 ([0.5, 0.6] · [0.645656, 0.71095] + 0.1)
yˆ0 = f1 (0.849398) = 0.7004408
0 = y0 − yˆ0
0 = 0.8 − 0.7004408 = 0.0995592
From these a prediction formula may be created that closely approximates the predict method.
Class Methods:
1 @ param x the m - by - n input / data matrix ( full / training data having m vectors )
2 @ param y the m - by - ny output / response matrix ( full / training data having m vectors )
3 @ param fname_ the feature / variable names ( defaults to null )
408
4 @ param nz the number of nodes in hidden layer ( -1 = > use default formula )
5 @ param hparam the hyper - parameters for the model / network ( defaults to Optimizer . hp )
6 @ param f the activation function family for layers 1 - >2 ( input to output )
7 @ param f1 the activation function family for layers 2 - >3 ( hidden to output )
8 @ param itran the inverse transfo rmation returns response matrix to original scale
9
10.7.11 Exercises
1. Delta Vectors: For the example error calculation problem given in this section, calculate the δ 1 = [δ01 ]
vector using the following formula.
Rework the problem using separate weights (b:k ) and biases (βk ) (not lumped together as parameters).
to compute δk1 .
Calculate the δ 0 = [δ00 , δ10 ] vector using the analogous reworked formula having separate weights (a:h )
and biases (αh ) to compute δh0 .
2. Parameter Update Equations: Use the δ 1 vector to update weight matrix B, i.e., for each row h,
bh −= zh δ 1 η
409
β −= δ 1 η
Use the δ 0 vector to update weight matrix A, i.e., for each row j,
aj −= xj δ 0 η
α −= δ 0 η
3. Derive the equation for the partial derivative of the loss function L w.r.t. bhk ,
∂L
= zh δk1
∂bhk
by defining pre-activation value vk = b:k · z and applying the following chain rule:
∂L ∂L ∂vk
=
∂bhk ∂vk ∂bhk
4. Explain the formulations for the two delta matrices.
∆1 = E f10 (ZB)
|
∆0 = [∆1 B ] f00 (XA)
5. The dataset in Example Concrete consists of 7 input variables and 3 output variables. See the
NeuralNet 2L section for details. Create a NeuralNet 3L model to predict values for the three outputs
y0 , y1 and y2 . Compare with the results of using a NeuralNet 2L model.
6. Create a NeuralNet 3L model to predict values for the one output for the AutoMPG dataset. Compare
with the results of using the following models: (a) Regression, (b) Perceptron, (c) NeuralNet 2L.
7. For the AutoMPG dataset let the activation function for the last layer be id, so that rescaling of the
output/response y is not needed. Then try all of ScalaTion’s activation functions and compare the
QoF for (a) in-sample testing, (b) validation, i.e., train-n-test split (TnT), and (c) cross-validation.
8. Use the formula for ŷ given in the Response Surface subsection to plot ŷ vs. x1 and x2 .
9. Conduct a literature study on the effectiveness and efficiency of various training algorithms for neural
networks, for example see [31].
410
10.8 Multi-Hidden Layer Neural Networks
The NeuralNet XL class supports basic x-layer (input, {hidden} and output) Neural Networks. For example
a four layer neural network (see Figure 10.13) with have four layers of nodes with (one input layer numbered
0, two hidden layers numbered 1 and 2, and one output layer numbered 3). Note, since the input layer’s
purpose is just to funnel the input into the model, it is also common to refer to such a neural network as
a three layer network. This has the advantage that the number of layers now corresponds to the number
parameter/weight matrices.
f0
z0
b000 b100 f1 f2
β00 b200
x0 ζ0 y0
b001 b101
b010 b110 β01 b201 β02
z1
b002 b111 b210
b011 β10 b120 b211
x1 ζ1 y1
β20
Figure 10.13: Four-Layer (input, hidden (l = 0), hidden (l = 1), output (l = 2)) Neural Network
In ScalaTion, the number of active layers is denoted by nl (which in this case equals 3 with l = 0, 1, 2).
Since arrays of matrices are used in the ScalaTion code, multiple layers of hidden nodes are supported.
The matrix Bl = [bljk ] maintains the weights for layer l, while β l = [βkl ] maintains the biases for layer l.
In particular, parameter b which holds the weights and biases for all layers is of type NetParams where
1 type NetParams = Array [ NetParam ]
For simplicity, in the model equation below rolls the biases into the weights using NetParam.
where Bl is the NetParam (weight matrix and bias vector) connecting layer l to layer l + 1 and fl is the
vectorized activation function at layer l + 1.
411
10.8.2 Training
As before, the training dataset consists of an m-by-n input matrix X and an m-by-ny output matrix Y .
During training, the predicted values Ŷ are compared to actual/target values Y ,
Corrections based on these errors are propagated backward through the network to improve the parameter
estimates (weights and biases) layer by layer.
10.8.3 Optimization
The seven boxed equations from the previous section become six due to unification of the last two. As before,
the optimizers compute a predicted output matrix and then take differences between the actual/target values
and these predicted values to compute an error matrix. These computed matrices are then used to compute
delta matrices that form the basis for updating the weight matrices. Again for simplicity, biases are ignore in
the equations below, but are taken care of in the code through the NetParam abstraction. See the exercises
for details.
1. The values are feed forward through the network, layer by layer. For layer l, these values are stored
in matrix Zl . The first layer is the input, so Z0 = X. For the rest of the layers, Zl+1 equals the
result of activation function fl being applied to the product of the previous layer’s Zl matrix times its
parameter matrix Bl .
Z0 = X (10.90)
2. The negative of the error matrix E is just the difference between the predicted and actual/target
values, where Ŷ = Znl .
E = Ŷ − Y (10.92)
3. This information is sufficient to calculate delta matrices ∆l . For the last layer:
0
∆nl−1 = fnl−1 (Znl−1 Bnl−1 ) E (10.93)
For the rest of layers in the backward direction with l being decremented:
|
∆l = fl0 (Zl Bl ) (∆l+1 Bl+1 ) (10.94)
412
4. As mentioned, the delta matrices form the basis (a matrix transpose × delta × the learning rate η)
for updating the parameter/weight matrices, Bl for each layer l.
|
Bl −= Zl ∆l η (10.95)
The implementation of the train encodes these equation and uses gradient descent to improve the
parameters Bl over several epochs, terminating early when the objective/cost function fails to improve.
1 @ param x_ the training / full data / input matrix
2 @ param y_ the training / full response / output matrix
3
The code for the optimize method may be found in the Optiimzer SGD and Optimizer SGDM classes. Again,
the train2 method includes limited auto-tuning of hyper-parameters.
If the array is null, then default numbers for the hidden layers are utilized. The default rule sets the first
hidden layer to 2 * n + 1 and divides this by the layer number for subsequent layers.
The number of nodes in each layer is currently used as a very rough estimate of the Degrees of Freedom
for the neural network. Also, the Degrees of Freedom is only considered for the first output variable (y0 ).
There is ongoing research to characterize Generalized Degrees of Freedom for neural networks [61]. As this
work matures, plans are for ScalaTion to include them.
413
10.8.6 Deep Learning
When the number of hidden layers are increased beyond the base levels of one or two, the learning may be
described as deep. In addition, special types of layers or units may be included such as convolutional or
recurrent.
There are Universal Approximation Theorems that indicate that Neural Networks with one hidden layer
(with arbitrary width) can approximate a board class of functions mapping inputs to outputs. However,
other theorems show the using more hidden layers allows the width to be reduced. This is the realm of deep
learning that has showed success in many applications areas, including automatic speech recognition, image
classification, computer vision and natural language processing.
Class Methods:
1 @ param x the m - by - n input / data matrix ( full / training data having m input vectors )
2 @ param y the m - by - ny output / response matrix ( full / training data having m vectors )
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param nz the number of nodes in each hidden layer , e . g . ,
5 Array (9 , 8) = > 2 hidden of sizes 9 and 8 ( null = > use default formula )
6 @ param hparam the hyper - parameters for the model / network ( defaults to Optimizer . hp )
7 @ param f the array of activation function families between every pair of layers
8 @ param itran the inverse transfo rmation function returns response to original scale
9
414
10.8.8 Exercises
1. Examine the implementation of the train method and the NetParam case class, where the net param-
eter b has two parts: the weight matrix b.w and the bias vector b.b. Show how the biases affect the
calculation of prediction matrix Ŷ = Znl in the feed forward process.
2. Examine the implementation of the train method and the NetParam case class and show how the
biases affect the update of the weights b.w in the back-propagation process.
3. Examine the implementation of the train method and the NetParam case class and show how the
biases b.b are updated in the back-propagation process.
4. The dataset in Example Concrete consists of 7 input variables and 3 output variables. See the
NeuralNet 2L section for details. Create a NeuralNet XL model with four layers to predict values
for the three outputs y0 , y1 and y2 . Compare with the results of using a NeuralNet 3L model.
5. Create a NeuralNet XL model with four layers to predict values for the one output for the AutoMPG
dataset. Compare with the results of using the following models: (a) Regression, (b) Perceptron,
(c) NeuralNet 2L, (d) NeuralNet 3L.
6. Tuning the Hyper-Parameters: The learning rate η (eta in the code) needs frequent tuning.
ScalaTion as with most packages has limited auto-tuning of the learning rate. Tune the other hyper-
parameters for the AutoMPG dataset.
1 object Optimizer :
2
Note, in other packages patience is number of upward steps, not the number of subsequent upward
steps. Below are additional neural network hyper-parameters for a future release of ScalaTion.
1 hp + = ( " dropout " , 0.05 , 0.05) // probability of neuron dropping out
2 hp + = ( " valSplit " , 0.1 , 0.1) // training - validation set split fraction
415
7. Tuning the Network Architecture: The architecture of the neural network can be tuned by
Tune the architecture for the AutoMPG dataset. The number of layers and number of nodes in each
layer should only be increased when there is non-trivial improvement.
8. Feature Selection for Neural Networks. Although Neural Networks may be used without Feature
Selection, it can still be useful to consider. An improvement over Forward Selection and Backward
Elimination is Stepwise Regression. Start with no variables in the model and add one variable that
improves the selection criterion the most. Add the second best variable for step two. After the second
step determine whether it is better to add or remove a variable. Continue in this fashion until no
improvement in the selection criterion is found. For Forward Selection and Backward Elimination it
may instructive to continue all the way to the end (all variables for forward/no variables for backward).
Stepwise regression may lead to coincidental relationships being included in the model, particularly
if a penalty-free QoF measure such as R2 is used. Typically, this approach is used when there a
penalty for having extra variables/parameters, e.g., R2 adjusted R̄2 , R2 cross-validation Rcv
2
or Akaike
Information Criterion (AIC). See the section on Maximum Likelihood Estimation for a definition of
AIC. Alternatives to Stepwise Regression include Lasso Regression (`1 regularization) and to a lesser
extent Ridge Regression (`2 regularization).
Perform Forward Selection, Backward Elimination, and Stepwise Regression with all four criteria: R2 ,
R̄2 , Rcv
2
, and AIC. Plot the curve for each criterion, determine the best number of variables and what
these variables are. Compare the four criteria.
As part of a larger project compare this form of feature selection with that provided by Ridge Regression
and Lasso Regression.
Use the following types of models: TranRegression, Perceptron, NeuralNet 3L, and NeuralNet XL
(4 layers).
In addition to the AutoMPG dataset, use the Concrete dataset and three more datasets from UCI
Machine Learning Repository. The UCI datasets should have more instances (m) and variables (n)
than the first two datasets. The testing is to be done in ScalaTion and Keras.
9. Double Descent in Deep Learning. Conventional wisdom indicates that as more features, nodes,
or parameters are added the network/model, the training loss (e.g., based on mse or sse) will continue
to decrease, but the testing loss (out-of-sample) will start to increase. Recent research suggests that
as even more parameters are added to the network/model, the testing loss may begin to decrease for
a second time. Try this with some of the datasets from the last question. Note, that a decrease in the
loss function corresponds to an increase in R2 for training or Rcv2
for testing.
416
10.9 Convolutional Neural Networks
One way to think about a Convolution Neural Network (CNN) is to take an ordinary Neural Network such
as a 2-layer Neural Network and add a special non-neural layer in front of it, as depicted in Figure 10.14.
This special layer is called the convolutional layer (there may be more than one). Its purpose is to reduce
the volume of input in such a way that the salient features are brought forward. When the input is huge,
such as image data, it is important to reduce the amount of data sent to the last layers, now referred to as
the Fully-Connected (FC) layers.
z0
b00 f
y0
b01
Conv b10 β0
z1
Layers b11
b20 y1
b21 β1
z2
The convolution layers will take the input and apply convolutional filtering along with pooling operations.
Inputs to the convolutional layers may be 1D (vector), 2D (matrix) or 3D or higher (tensor). After the convo-
lutional layers, a flattening layer is used to turn matrices/tensors into vectors. Whatever the dimensionality
of the input, the output of the flattening layer will be one dimensional. The flattening layer will fully connect
to the first FC layer using a parameter matrix for edge weights. In general, the box labeled Conv Layers
may consist of multiple convolutional layers with multiple pooling layer interspersed, followed by a flattening
layer. Some of the advantages of convolutional networks are listed below:
For additional background, there are multiple basic introductions to convolutional networks including
[97, 108]. In the next two sections, 1D and 2D convolutional networks will be presented.
417
10.10 1D CNN
A 1D CNN takes the simplest form of input, that of a one-dimensional vector. Such input is found in
sequence data such a time series, sound/audio and text. More well known are 2D CNNs where the input is
two-dimensional as in grayscale images. Color images would be 3D as there are typically three color channels.
Convolutional Neural Networks can be deep and complex, but in their simplest form they may be viewed
as a modified Fully-Connected Neural Network. The example shown in Figure 10.15 of a one-dimensional
convolutional network replaces the first dense layer with a convolutional layer. There are two differences:
First the connections are sparse (at least not full) and the weights/parameters are shared by all second layer
nodes. Notice that each node zh is only connected to three input nodes forming a window of width nc = 3
starting at index h,
and that the parameters are always c = [c0 , c1 , c2 ]. The second set of edges fully connect the z nodes with
the y nodes.
x0
c0
f0
c1
x1 z0
b00 f1
c2 α
c0 b01 y0
c1
x2 z1 b10 β0
b11
c2 α b20
c0 y1
x3 z2 b21 β1
c1
α
c2
x4
Convolutional Filter
As the edge weights are shared, rather than duplicating them for multiple edges, we may think of the vector
c = [c0 , c1 , c2 ] as a convolutional filter (or cofilter) that is used to calculate second layer node values,
418
for each index value h by shifting the window/slice xh:h+2 and taking the dot product with the cofilter c.
For example, if the input x = [1, 2, 3, 4, 5] and the cofilter c = [0.5, 1.0, 0.5], then
z = [4, 6, 8]
To keep things simple, we assume the bias α is zero and that the convolutional layer’s activation function
f0 is the identity function. The z vector is then propagated through the fully connected layer to obtain
predicted output ŷ.
ŷk = f1 (b:k · z)
z = x ∗c c (10.98)
In cases where activation function f0 is not the identity function, the equation becomes
z = f0 (x ∗c c) (10.99)
ScalaTion’s conv function in object CoFilter 1D implements this convolution operator ∗c . Note, the
subscript c is used to avoid confusion with element-wise vector multiplication.
1 @ param c the cofilter vector of coefficient
2 @ param x the input / data vector
3
The second equation takes the z vector and propagates it through the Fully-Connected (FC) component
of the network
y = f1 (B · z) + (10.100)
where the B matrix is a 3-by-2 matrix in the example.
Recall the model equation for NeuralNet 3L written in vector form.
y = f1 (B · f0 (A · x)) +
For 1D CNNs, the model equation is very similar.
y = f1 (B · f0 (x ∗c c)) + (10.101)
Rather than taking the dot product with a matrix of unshared parameters/weights A, the input vector x
undergoes convolution with cofilter vector c. Adding in the shared scalar bias α for the convolutional layer
and the unshared bias vector β for the fully connected layer gives
419
y = f1 (B · f0 (x ∗c c + α) + β) + (10.102)
When input vector x is large it may be useful to further reduce the vector sizes by pooling, e.g., max-
pooling with stride s = 2, would take the max of two adjacent elements and then move on to the next two.
ScalaTion’s pool function in object CoFilter 1D implements the pooling operator.
1 @ param x the input / data vector
2 @ param s the size of the pooling window
3
10.10.2 Training
Training 1D Convolutional Neural Networks is more complicated, although more efficient, than training
Fully-Connected Neural Networks [97]. Training involves finding values for the parameters B ∈ Rnz ×ny ,
β ∈ Rny , c ∈ Rnc and α ∈ R that minimize a given loss function L. To simplify the development, the biases
will be ignored.
For a single input vector x yielding
ŷ = f1 (B · f0 (x ∗c c)) (10.103)
1
L(B, c) = [(y − ŷ) · (y − ŷ)] (10.104)
2
10.10.3 Optimization
As with NeuralNet 3L, there are basically four steps in optimization.
1. Compute predicted values: Based on the randomly assigned weights to vector c and matrix B,
predicted outputs ŷ are calculated. First values for the hidden layer z are calculated, where the value
for hidden node h, zh , is given by
zh = f0 (x[h,nc ] · c) for h = 0, . . . , nz − 1
where f0 is the first activation function (e.g., reLU), c is the convolutional filter which contains the
shared weights/parameters, and x is an input vector for a training sample/instance (row in the data
matrix). Next, the values computed at the hidden layer are used to produce predicted outputs ŷ,
where the value for output node k, ŷk , is given by
420
ŷk = f1 (b:k · z) for k = 0, . . . , ny − 1
where the second activation function f1 may be the same as (or different from) the one used in the
hidden layer and b:k is column-k of the B weight matrix. Now the difference between the actual and
predicted output, the error, for the k th output, k , is given by
k = yk − ŷk for k = 0, . . . , ny − 1
2. Back propagate from output layer: Given the computed error vector , the delta/correction vector
δ 1 for the output layer may be calculated, where for output node k, δk1 is given by
where f10 is the derivative of the activation function. The partial derivative of the loss function L with
respect to the weight connecting hidden node h with output node k, bhk , is given by
∂L
= zh δk1 (10.106)
∂bhk
3. Back propagate from hidden layer: Given the delta/correction vector δ 1 from the output layer,
the delta vector for the hidden layer δ 0 may be calculated, where for hidden node h, δh0 is given by
The partial derivative of L with respect to the j th weight in the convolutional filter, cj , is given by
z −1
nX
∂L
= xj+h δh0 = x[j,nz ] · δ 0 (10.108)
∂cj
h=0
4. Update weights: The weights c and B, in the convolutional and full-connected layers, respectively,
may now be updated based on the partial derivatives. For gradient descent, movement is in the opposite
direction, so the sign flips from positive to negative. These partial derivatives are multiplied by the
learning rate η which moderates the adjustments to the weights.
421
Z = f0 (X ∗c c) (10.111)
The predicted output matrix Ŷ ∈ Rm×ny is similarly computed by applying the second matrixized
activation function f1 to the matrix product ZB.
Ŷ = f1 (ZB) (10.112)
2. The negative of the error matrix E ∈ Rm×ny is just the difference between the predicted and
actual/target values.
E = Ŷ − Y (10.113)
3. This information is sufficient to calculate delta matrices: ∆1 for adjusting B and ∆0 for adjusting A.
The output-layer delta matrix ∆1 ∈ Rm×ny is the element-wise matrix (Hadamard) product of E
and f10 (ZB).
The hidden-layer delta matrix ∆0 ∈ Rm×nz is the element-wise matrix (Hadamard) product of
|
∆1 B and f00 (XA).
|
∆0 = [∆1 B ] f00 (X ∗c c) (10.115)
4. As mentioned, the delta matrices form the basis (a matrix transpose × delta × the learning rate η)
for updating the parameter/weight matrix B and cofilter vector c.
|
B −= Z ∆1 η (10.116)
|
cj −= [X:,[j,nz ] ∆0 ].mean η (10.117)
422
9 val δ1 = f1 . dM ( yp ) // delta matrix for y
10 val δ0 = f . dM ( z ) (δ1 , b . w .T ) // delta matrix for z
11 CNN_1D . updateParam ( x_ , z , δ0 , δ1 , eta , c , b )
12
Again note: ScalaTion provides the following alternatives for (a) Hadamard product: or *∼, and (b)
Transpose: T-like Unicode symbol or transpose.
The updateParam method produces new values for c and b.
1 def updateParam ( x_ : MatrixD , z : MatrixD , δ 0: MatrixD , δ 1: MatrixD , eta : Double ,
2 c : VectorD , b : NetParam ) =
3 for j <- c . indices do
4 var sum = 0.0
5 for i <- x_ . indices ; h <- z . indices2 do sum + = x_ (i , h + j ) * δ 0( i , h )
6 c ( j ) -= ( sum / x_ . dim ) * eta // update c in conv filter
7 end for
8 b -= ( z .T * δ1 * eta , δ 1. mean * eta ) // update b weights / biases
9 end updateParam
423
f0
z0
c0
α b00
x0
c1 f1
z1 y0
c2 b10
c0 α β0
x1
c1c0 b20
b01
b30
z2
c2c1
c0 α b11
x2
c2 b40
c1c0
z3
c1 b21
b50
c2 α b31
x3
c2
c0 b41
z4 y1
c1 α β1
x4
b51
c2
z5
The cofilter c0 will move over the input nodes producing feature map φ0 (the yellow nodes in the middle),
while cofilter c1 will also move over the input nodes producing feature map φ1 (the orange nodes in the
middle). As there are no other layers before the fully-connected layer, these two feature maps are combined
(or flattened) and activated to form a hidden (z = [z0 , z1 , z2 , z3 , z4 , z5 ]) layer. The stride indicates how far
the cofilter moves (e.g., down one node, down two nodes) on each step. Here the stride is equal to one.
Figure 10.17 shows the same Convolutional Neural Network from the point of view of tensors flowing
through the network (in this case the tensors are just vectors). Cofilter c0 moves through the input vector x
taking the dot product of itself with the blue, purple/crimson and black windows/slices, respectively. These
three dot products produce the values in feature map φ0 . (Note that blue line from cofilter c0 to feature
map φ0 has utilized all the values in the cofilter and all values in the blue window and is shown as a single
line to reduce clutter.) The same logic is used for the second cofilter c1 to create feature map φ1 . The
difference lies only in the values in each of these two convolution vectors.
Now the two feature maps φ0 and φ1 need to be combined. In general this is done by a flattening
operation. For the 1D case, the feature maps are already vectors so flattening reduces to vector concatenation.
This flattened vector is activated using f0 to produce z that serves as an entry point into the fully-connected
part of the network. Further computations produce the output vector ŷ based on vector z and parameter
matrix B, with their product activated using f1 .
424
c0 φ0
x0:2 · c0 z = f0 (.)
c00 φ00
x z0
x1:3 · c0 b00 ŷ = f1 (.)
x0 c01 φ01
z1 y0
x2:4 · c0 b10
x1 c02 φ02
b30
20
β0
01
z2
b11
x2
φ 1 b40
c1 z3
1
x0:2 · c b21
50
31
x3 c10 φ10
b41
z4 y1
1
x1:3 · c
x4 c11 φ11
b51 β1
z5
x2:4 · c1
c12 φ12 α
Figure 10.17: Tensor Flow Diagram Showing Two Convolution Filters c0 , c0 and Two Feature Maps φ0 , φ0
followed by a Fully-Connected Layer
A Convolutional Neural Network may have multiple convolutional layers as well as pooling layers (dis-
cussed in more detail in the next section). Activation (e.g., using reLU) occurs at the end of paired
convolutional-pooling layers. In addition, the fully-connected part may consist of multiple layers.
Class Methods:
1 @ param x the input / data matrix with instances stored in rows
2 @ param y the output / response matrix , where y_i = response for row i of matrix x
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param nf the number of filters for this convolutional layer
5 @ param nc the width of the filters ( size of cofilters )
6 @ param hparam the hyper - parameters for the model / network
7 @ param f the activation function family for layers 1 - >2 ( input to hidden )
8 @ param f1 the activation function family for layers 2 - >3 ( hidden to output )
9 @ param itran the inverse transfor mation function returns responses to original scale
10
425
17 with Fit ( dfm = x . dim2 - 1 , df = x . dim - x . dim2 ) :
18
10.10.9 Exercises
1. Delta Vectors: For the example error calculation problem given in this section, calculate the δ 1 and
δ 0 vectors.
2. Parameter Update Equations: For the same example, use the δ 1 vector to update weight matrix B
for the fully-connected layer. Use the δ 0 vector to update weight vector c for the convolutional layer.
3. For the same example, make four training steps and see what what happens to sse.
5. Create and demo a 1D CNN for the MIT-BIH Arrhythmia Database using ScalaTion’s CNN 1D class
and Keras. See https://physionet.org/content/mitdb/1.0.0.
426
10.11 2D CNN
Two-dimensional Convolution Networks (2D CNN2) are needed when the data instances are two-dimensional
and based on a regular grid pattern where the values in the grid can be mapped to a matrix. For 2D CNNs,
flattening operators are added to the convolutional and pooling operators. The following references focus on
2D Convolutional Networks [88, 214]. To clarify what can happen in a 2D CNN convolutional layer, consider
the well-known MNIST problem to classify hand written digits 0 to 9. Given 10,000 grayscale images, each
having 28-by-28 pixels, the goal is to train a convolutional network to classify an image as one of 10 digits.
The following simple network architecture may be used as a starting point for the MNIST problem:
• Convolutional Layer. A convolutional layer consists of one or more convolutional filters, for which a
convolutional operation is applied to the input for each filter, conceptually in parallel. The result of
the convolutions may then be passed into an activation function. For this problem, assume there is
just one 5-by-5 convolutional filter.
• Pooling Layer. A pooling layer takes the output of a convolutional layer and reduces/aggregates it.
A common pooling operation is generally applied to the results of the convolution. For this problem,
assume there is a 2-by-2 max-pooling operation applied to the one filter.
• Flattening Layer. Between the convolutional layers and fully-connected layers a flattening layer turns
the matrices (or tensors for 3D CNN) flowing through the convolutional layers and pooling layers into
vectors as required by the fully-connected layers.
• Fully Connected Output Layer. A 10 node layer where the ith node represents the ith digit. Whichever
output node gets the highest score is the classification result. In general, there may be multiple fully-
connected layers.
Advanced network architectures may have many convolutional layers and pooling layers. Since one layer
comes after the other, the signals are processed in series (not parallel). For this problem, assume there is
just one convolutional layer.
Instead of looking at a large input matrix X, consider one of the matrices given on the following Webpage
https://cs231n.github.io/convolutional-networks. It is a 5-by-5 matrix, extracted below.
427
x00
x01
x10 x02 c00
x11 c01 φ00
x20 x12 c10 φ10
x21 c11 φ10
x22 φ11
0 0 2 1 0
0 0 0 1 2
X =
1 2 2 0 2 (10.118)
2 0 0 0 1
2 2 2 0 1
" #
1 1
C = (10.119)
0 1
is produced by using the convolution operator (∗c ), which will give the following feature map matrix Φ,
Φ = X ∗c C (10.120)
where each element in the resulting 4-by-4 feature map matrix Φ is the dot product of a subimage and
the convolutional filter. The subimage is a shifting 2-by-2 window/slice of the original image with top-left
position (j, k).
0 2 4 3
2 2 1 5
Φ = (10.122)
3 4 2 3
4 2 0 2
Notice for this example, if a reLU activation function is applied, the final feature map is unaltered.
ScalaTion’s conv function in the CoFilter 2D object implements the convolution operator.
428
for j <- phi.indices; k <- phi.indices2 do
phi(j, k) = (x(j until j+m, k until k+n) *~ c).sum
end for
phi
end conv
" #
2 5
P = (10.124)
4 3
ScalaTion’s pool function in the CoFilter 2D object implements the pooling operator.
429
P.flatten = [ 2, 5, 4, 3 ]
The second equation takes the z vector and propagates it through the Fully-Connected (FC) layer of the
network. The model equation includes these two steps with an error term added.
y = f1 (B · z) + (10.126)
where the B matrix connects each of the 144 nodes in the flattened layer to the 10 output nodes of the FC
layer. Combining these two equations yields
More generally, multiple convolutional filters (along with their pooling) can be used simultaneously.
See the exercises for an example with multiple convolutional filters. In addition, multiple combination
convolution-pooling layers are often used, as in Deep Learning. The fully-connected part of the network may
also consist of multiple layers as well.
10.11.5 Training
Training involves finding values for the parameters B ∈ Rnz ×ny , β ∈ Rny , C ∈ Rnc ×nc and α ∈ R that
minimize a given loss function L. To simplify the development, the biases will be ignored.
For a single input matrix X yielding
1
L(B, C) = [(y − ŷ) · (y − ŷ)] (10.129)
2
10.11.6 Optimization
10.11.7 Exercises
1. Develop the modeling equations for a 2D CNN that consists of the following layers:
(a) Convolutional Layer with two feature maps. 5 × 5 convolutional filters, no padding, stride 1 and
reLU activation.
(b) Pooling Layer using max-pooling. 2 × 2 pooling with a stride of 2.
(c) Flattening Layer to convert matrices into a vector.
430
(d) Fully-Connected Layer with 10 output nodes.
2. If the above network is used for a 28×28 MNIST image/matrix, give the dimensions for all components
in the network. How many parameters will need to be optimized?
3. Create and demo a 2D CNN for the MNIST dataset using Keras. See http://yann.lecun.com/exdb/
mnist.
4. It is common in data science to normalize input data. This is especially true when there are nonlinear
transformations. For CNNs and deep learning, normalizing the initial input may be insufficient, so other
types of normalization have been introduced, such as batch normalization, and layer normalization.
Discuss the advantages and disadvantages of the various normalization techniques.
5. The convolution operator slides a window the size of the convolution filter over the image. For a dilated
CNN, when the dilation rate is 2, the window size is enlarged in each dimension since every other pixel
is skipped. Given a 3-by-3 convolution filter, the window size (size of the receptive field in the image)
is 3-by-3. In this case the dilation rate is 1. In general, a dilation rate of n, will mean n − 1 pixels will
skipped for every pixel used. What would the window size if the dilation rate is 2?
6. Create a 3D CNN that processes color images using Keras. It will take a four-dimensional input
tensor X = [xijkl ] where i indicates which image, j indicates which (horizontal) row, k indicates
which (vertical) column, and l indicates which color channel. Use the CIFAR-10 dataset https:
//www.cs.toronto.edu/~kriz/cifar.html.
431
10.12 Transfer Learning
Transfer learning is applicable to many types of machine learning problems [140]. In this section, we focus
on transfer learning for neural networks [190].
The first sets of transformations (from input to the first hidden layer and between the first and second
hidden layers) in a Neural Network allow nonlinear effects to be created to better capture characteristics of
the system under study. These are taken in combination in later latter layers of the Neural Network. The
thought is that related problems share similar nonlinear effects. In other words, two datasets on related
problems used to train two Neural Networks are likely to develop similar nonlinear effects at certain layers.
If this is the case, the training of the first Neural Network could expedite the training of the second Neural
Network. Some research has shown the Quality of Fit (QoF) may also be enhanced as well.
The easiest way to imagine this is to have two four-layer Neural Networks, say with 30 inputs for the
first and 25 for the second. Let the first hidden layer have 50 nodes and second have 20 nodes, with an
output layer for the single output/response value. The only difference in node count is in the input layer.
For the first Neural Network, the sizes of the parameter/weight matrices are 30-by-50, 50-by-20 and 20-by-1.
The only difference in the second Neural Network is that the first matrix is 25-by-50. After the first Neural
Network is trained, its second matrix (50-by-20) along with its associated bias vector could be transferred
to the second Neural Network. When training starts for the second Neural Network, random initialization
is skipped for this matrix (and its associated bias vector). A training choice is whether to freeze this layer
or allow its values to be adjusted during back-propagation.
• Dataset. A dataset D may be defined as an ordered pair of tensors D = (X, Y) where X is the input
tensor and Y is the output tensor. It is assumed that there is an unknown function f and a noise
generator that characterizes the relationship between input tensor X and output tensor Y.
Y = f (X) + (10.130)
• Task. In a limited sense, a task τ may be viewed a procedure to produce a model function fm that
approximates f . Secondary goals include interpretability and generalizability. The closeness of the
approximation can be measured by some norm of the differences between the functions or more generally
by a task inspired loss function (e.g., MSE for regression taks or cross entropy for classifications tasks),
where F is a function space. For example, if the input tensor is two-dimensional (a matrix) and the
output tensor is one-dimensional (a vector), F could be the set of all linear transformations of the
column space of the input matrix to R. The goal of multiple linear regression is to find a point in the
column space of matrix X closest to the vector y.
• Transfer Learning. Now given two datasets D1 and D2 , and two tasks τ1 and τ2 , the question is
whether the efforts in task τ1 can reduce the effort and/or improve the quality of outcome of task τ2 .
432
Transfer learning is ideally applied when both the datasets and the tasks are similar (or related in some
sense) and when the first dataset is large enough to support accurate training. Applying transfer learning
to the second task may be motivated by limited time/computational resources or lack of data. The lack of
data often happens for classification tasks with missing labels (y values).
An area in which transfer learning has demonstrated considerable success is in Convolutional Networks.
Therefore, consider a dataset where the input is a three-dimensional tensor X ∈ Rm×w×h , where m is the
number of grayscale images, w is their pixel width and h is their pixel height. For image classification, the
output is a one-dimensional vector y ∈ {0, . . . K − 1}m , where K is the number of classes in classification C.
For simplicity, assume that m2 < m1 , w2 = w1 , h2 = h1 , and K2 ≤ K1 . We are left with two issues: how
similar/related are the two sets of images and what is the logical/semantic connection between classification
C2 and classification C1 . The similarity of classification schemes could be based on the fraction of classes in
C2 that are also in C1 .
|C2 ∩ C1 |
r = (10.132)
|C2 |
This can be refined when there is an underlying ontology to provide meaning to the classifications and their
class labels, as well as, a metric for semantic distance. Note, some pre-trained models may have been trained
to discriminate between hundreds or thousands of image types, e.g., ImageNet [98].
Selection of a source dataset may now be decomposed into the following steps:
1. Initial Screening. Initial screening of potential candidate source datasets can be done by comparing
their similarity to that of the target dataset D2 . Information divergence (e.g., KL divergence) or
distances based on optimal transport may be used to explore the similarity of datasets (or their under-
lying probability measures). Source datasets sufficiently similar and with pre-trained models should
be tested in step 2.
2. Choosing Candidates. Datasets identified from step 1 may be selected as candidates by applying
them to the target problem in the following way. Replace the last fully-connected layer of the source
convolutional network with a new one, where the output layer is reduced from K1 to K2 nodes. Call this
the adapted model. Freeze all the previous layers and train the last layer from scratch. A layer being
frozen means that its parameters (weights and biases) will not be changed during back-propagation.
Choose the models with better Quality of Fit (QoF) measures or lower loss functions.
3. Fine-Tuning. Let D1 be one of the candidate datasets. One option would be to train the adapted
model/network from scratch. Unfortunately, this would be time consuming at best and likely to fail
at worst due to lack of class labels. Retraining involves a choice for each layer, whether to freeze,
fine-time, or train from scratch. Making the right choice for each layer can lead to a more accurate
model. Unfortunately, a convolution network with 16 trainable layers, such as VGG-16 [189], would
433
require 316 = 43, 047, 721 models to train. Therefore, establishing criteria for deciding how to handle
each layer is important. Although it tends to be generally the case, that the later layers are specific
to the given task and that earlier layers are more generic and hence transferable as is, recent research
has shown benefits in fine-tuning some early layers [66].
Domain Adaptation
To summarize, given two datasets D1 = (X1 , Y1 ) and D2 = (X1 , Y2 ), and two tasks τ1 and τ2 , utilze trained
model/network fm1 to produce a new model fm2 by reusing much of fm1 to make training have similar or
better outcomes in less time.
Typically in domain adaptation, we may assume that τ1 = τ2 and the task is classification (thus, the
outputs are vectors). It may be further assumed that a label mapping function can be defined so that class
labels in classification 1 can be mapped without loss of information to labels in classification 2. Thus, we
may stipulate that the output vectors are points in the following label spaces.
The focus in domain adaptation is on applying transformations on the input datasets, X1 and X2 . Let
the form of inputs now be reduced to matrices X1 ∈ Rm1 ×n1 and X2 ∈ Rm2 ×n2 . To facilitate visualization,
let the number of columns in both matrices be 2 (n1 = n2 = 2). Now, two-dimensional histograms may be
produced for each input data matrix.
Instance Selection
Class Methods:
1 @ param x the m - by - n input matrix ( training data consisting of m input vectors )
2 @ param y the m - by - ny output matrix ( training data consisting of m output vectors
)
3 @ param fname_ the feature / variable names ( defaults to null )
4 @ param nz the number of nodes in each hidden layer , e . g . , Array (9 , 8) = > 2
hidden of sizes 9 and 8
5 ( null = > use default formula )
6 @ param hparam the hyper - parameters for the model / network
7 @ param f the array of activation function families between every pair of layers
8 @ param l_tran the layer to be transferred in ( defaults to first hidden layer )
9 @ param transfer the saved network parameters from a layer of a related neural network
10 trim before passing in if the size does not match
11 @ param itran the inverse transfo rmation function returns responses to original scale
12
434
18 extends NeuralNet_XL (x , y , fname_ , nz , hparam , f , itran ) :
19
20 def trim ( tl : NetParam , lt : Int = l_tran ) : NetParam = tl . trim ( sizes ( lt ) , sizes ( lt +1) )
10.12.4 Exercises
1. Find two related datasets at https://huggingface.co/datasets/inria-soda/tabular-benchmark
and use the larger dataset to transfer in a layer and see if this improves the QoF. Compare the following
three scenarios:
(a) QoF of small dataset trained with its own data
(b) QoF of small dataset after a layer is transferred in and not frozen
(c) QoF of small dataset after a layer is transferred in and frozen (see freeze method in NeuralNet XL)
435
10.13 Extreme Learning Machines
An Extreme Learning Machine (ELM) may be viewed as three-layer Neural Network with the first set of
fixed-parameters (weights and biases) frozen. The values for these parameters may be randomly generated
or transferred in from a Neural Network. With the first set of fixed-parameters frozen, only the second set
of parameters needs to be optimized. When the second activation function is the identity function (id), the
optimization problem is the same as for the Regression problem, so that matrix factorization may be used
to train the model. This greatly reduces the training time over Neural Networks. Although the first set of
fixed-parameters is not optimized, nonlinear effects are still created at the hidden layer and there may be
enough flexibility left in the second set of parameters to retain some of the advantages of Neural Networks.
|
y = b · f0 (A x)) + (10.135)
where A is the first layer NetParam consisting of an n-by-nz weight matrix and an nz bias vector. In
ScalaTion, these parameters are initialized as follows (see exercises for details):
1 private var a = new NetParam ( weightMat3 (n , nz , s ) ,
2 weightVec3 ( nz , s ) )
The second layer is simply an nz weight vector b. The first layer’s weights and biases A are frozen, while
the second layer parameters b are optimized.
10.13.2 Training
Given a dataset (X, y), training will be used to adjust values of the parameter vector b. The objective is to
minimize the distance between the actual and predicted response vectors.
10.13.3 Optimization
An optimal value for parameter vector b may be found using Regression.
1 @ param x_ the training / full data / input matrix
2 @ param y_ the training / full response / output vector
3
436
9 end train
Extreme Learning Machines typically are competitive with lower order polynomial regression and may
have fewer parameters than CubicXRegression.
Class Methods:
1 @ param x the m - by - n input matrix ( training data consisting of m input vectors )
2 @ param y the m output vector ( training data consisting of m output scalars )
3 @ param fname_ the feature / variable names ( if null , use x_j s )
4 @ param nz the number of nodes in hidden layer ( -1 = > use default formula )
5 @ param hparam the hyper - parameters for the model / network
6 @ param f the activation function family for layers 1 - >2 ( input to hidden )
7 @ param itran the inverse transfo rmation function returns responses to original scale
8
When the second activation function is not id, then the optimization of the second set of parameters
works like it does for Transformed Regression. Also, when multiple outputs are needed, the ELM 3L may be
used.
10.13.5 Exercises
1. Create an ELM 3L1 model to predict values for the AutoMPG dataset. Compare with the results of
using the following models: (a) Regression, (b) Perceptron, (c) NeuralNet 2L, (d) NeuralNet 3L,
(e) NeuralNet XL
2. Time each of the six model given above using ScalaTion’s time method (in the scalation package).
1 def time [ R ] ( block : = > R ) : R =
This method can time the execution of any block of code (time { block }).
3. Compare the following strategies for initializing NetParam A (weights and biases for the first layer).
(a) randomly generated weights
437
(b) binary sparse weights
(c) ternary sparse weights
For an explanation of these three weight initialization schemes, see https://core.ac.uk/download/
pdf/301130498.pdf.
438
Chapter 11
For time series forecasting, time series/temporal models are used in order to make forecasts into the future.
Applications are numerous and include weather, sales projections, energy consumption, economic indicators,
financial instruments and traffic conditions. For time series classification, related models are applied to
sequential data that includes Speech Recognition and Natural Language Processing (NLP).
Until this chapter, predictive models have been of the form
y = f (x; b) +
where x is the vector of predictive variables/features, y is the response variable and is the residual/error.
In order to fit the parameters b, m samples are collected into a data/input matrix X and a response/out-
put vector y. The samples are treated as though they are independent. In many case, such as data collected
over time, they are often not independent. For example, the current Gross Domestic Product (GDP) will
likely show high dependency upon the previous quarter’s GDP. If the model is to forecast the next quarter’s
GDP, surely it should take into account the current (or recent) GDP values.
One may think about forecasting as follows: Given that you know today’s high temperature yt , predict (or
forecast) tomorrow’s high temperature yt+1 . Of course, using historical data (past temperatures) would be
likely to improve the quality of the forecasts.
To begin with, one could focus on forecasting the value of response y at time t + 1 as a function of current
and past (or lagged) values of y, e.g.,
439
[yt−2p+1 , . . . , yt−3 , yt−1 ]
In addition to past values of the response vector, due to the time-oriented nature of the data, one can
literally use past errors to improve forecasts. This is easier to rationalize if they are termed shocks or
innovations, e.g., sometime unexpected happened that caused a forecast to be off. Obviously, other variables
besides the response variable itself may be used. In some models, these are referred to as exogenous variables.
Some forecasting techniques such as Time Series Regression with Lagged Variables can select the features
using for example forward selection, backward elimination or stepwise refinement.
Initially in this chapter, time will be treated as discrete time.
440
11.1 Forecaster
The Forecaster trait within the scalation.modeling.forecasting package provides a common framework
for several forecasters. The key methods are the following:
• The train method must be called first as it matches the parameter values to patterns in the data.
Training can occur on a training set or the full dataset/time series.
• The test method is used to assess the Quality of Fit of the trained model in making predictions (one
time-unit ahead forecasts). It can be applied to the full dataset for in-sample testing, or to the testing
set for out-of-sample testing.
• The testF method takes testing to the next level. Once predictions are judged to be satisfactory, the
model can be assessed in terms of its ability to make longer term forecasts. Obviously, the farther into
the future the forecasts are made, the greater the challenge. The quality of the forecasts are likely
degrade as the forecasting horizon h increases.
• The predict method gives a one-step ahead forecast. Given the current time t (e.g., think of it as
today) based on for example the current value yt , past values, etc. predict the next (e.g., tomorrow’s)
value yt+1 .
• The predictAll method extends this over the times-series (or a sub-range of it)
• The forecast method is the analog of predict, but for multi-horizon forecasting, returning a vector
of forecasts for 1 to h time units (e.g., days) ahead.
• The forecastAt method will provide all the forecasts for a particular forecasting horizon h.
• The forecastAll method extends this over all forecasting horizon from 1 up to and including h. The
commonly used recursive method for forecasting into the future, requires forecasts to be generated and
stored in a matrix (called yf) one horizon at a time. The first forecast is made entirely from actual
data, while the next one dependent on the prior forecast. Depending on how many past values are
used in the model, eventually the recursive methods will produce forecasts solely based on forecasted
values. The alternative, direct method, will be discussed later.
Forecaster Trait
Trait Methods:
1 @ param y the response vector ( time series data )
2 @ param tt the time vector , if relevant ( index as time may suffice )
3 @ param hparam the hyper - parameters for models extending this trait
4
441
10 def getFname : Array [ String ] = Array ( " no - x features " )
11 def train ( x_null : MatrixD , y_ : VectorD ) : Unit
12 def test ( x_null : MatrixD , y_ : VectorD ) : ( VectorD , VectorD )
13 protected def testSetup ( y_ : VectorD , doPlot : Boolean = true ) : ( VectorD , VectorD ) =
14 def trainNtest ( y_ : VectorD = y ) ( yy : VectorD = y ) : ( VectorD , VectorD ) =
15 def testF ( h : Int , y_ : VectorD ) : ( VectorD , VectorD )
16 protected def testSetupF ( y_ : VectorD , h : Int , doPlot : Boolean = true ) :
17 ( VectorD , VectorD ) =
18 def hparameter : HyperPar ameter = hparam
19 def parameter : VectorD = new VectorD (0) // vector with no elements
20 def nparams : Int = parameter . dim // number of parameters
21 def residual : VectorD = { if e = = null then flaw ( " residual " ,
22 " must call test method first " ) ; e }
23 def predict ( z : VectorD ) : Double =
24 def predict ( t : Int , y_ : VectorD ) : Double
25 def predictAll ( y_ : VectorD ) : VectorD =
26 def forecast ( t : Int , yf : MatrixD , y_ : VectorD , h : Int ) : VectorD
27 def forecastAt ( yf : MatrixD , y_ : VectorD , h : Int ) : VectorD
28 def forecastAll ( y_ : VectorD , h : Int ) : MatrixD =
29 def forwardSel ( cols : Set [ Int ] , idx_q : Int = QoF . rSqBar . ordinal ) :
30 ( Int , Forecaster ) = ???
31 def forwardSelAll ( idx_q : Int = QoF . rSq . ordinal , cross : Boolean = false ) :
32 ( Set [ Int ] , MatrixD ) =
Class Methods:
1 @ param y the response vector ( time series data ) for the training / full dataset
2 @ param lags_ the maximum number of lags
3
6 val lags = min ( y . dim -1 , lags_ ) // lags can ’t exceed dataset size
7 val mu = y . mean // sample mean
8 val sig2 = y . variance // sample variance
9 val acv = new VectorD ( lags + 1) // auto - covariance vector
10 for k <- acv . indices do acv ( k ) = y acov k // k - th lag auto - covariance
11 val acr = acv / acv (0) // auto - correlation function
12 override def toString : String =
13
14 end Stats4TS
442
11.1.2 Auto-Correlation Function
To better understand the dependencies in the data, it is useful to look at the auto-correlation. Consider the
following time series data used in forecasting lake levels recorded in the Lake Level Times-series Dataset.
(see cran.r-project.org/web/packages/fpp/fpp.pdf):
val m = 98
val t = VectorD.range (0, m)
val y = VectorD (580.38, 581.86, 580.97, 580.80, 579.79, 580.39, 580.42, 580.82, 581.40, 581.32,
581.44, 581.68, 581.17, 580.53, 580.01, 579.91, 579.14, 579.16, 579.55, 579.67,
578.44, 578.24, 579.10, 579.09, 579.35, 578.82, 579.32, 579.01, 579.00, 579.80,
579.83, 579.72, 579.89, 580.01, 579.37, 578.69, 578.19, 578.67, 579.55, 578.92,
578.09, 579.37, 580.13, 580.14, 579.51, 579.24, 578.66, 578.86, 578.05, 577.79,
576.75, 576.75, 577.82, 578.64, 580.58, 579.48, 577.38, 576.90, 576.94, 576.24,
576.84, 576.85, 576.90, 577.79, 578.18, 577.51, 577.23, 578.42, 579.61, 579.05,
579.26, 579.22, 579.38, 579.10, 577.95, 578.12, 579.75, 580.85, 580.41, 579.96,
579.61, 578.76, 578.18, 577.21, 577.13, 579.10, 578.25, 577.91, 576.89, 575.96,
576.80, 577.68, 578.38, 578.52, 579.74, 579.31, 579.89, 579.96)
First plot this dataset and then look at its Auto-Correlation Function (ACF).
1 new Plot (t , y , null , " Plot of y vs . t " , true )
The Auto-Correlation Function (ACF) measures how much the past can influence a forecast. If the
forecast is for time t + 1, then the current/past time points are t, t − 1, . . . , t − p0 . The k-lag auto-covariance
(auto-correlation), γk (ρk ) is the covariance (correlation) of yt and yt−k .
Note that γ0 = V [yt ] and ρk = γk /γ0 . These equations assume the stochastic process {yt |t ∈ [0, m − 1]} is
covariance stationary (see exercises).
Although vectors need not be created, to compute corr(yt , yt−2 ) one could imagine computing the cor-
relation between y(2 until m) and y(0 until m-2). In ScalaTion, the ACF is provided by the acF
function from the Correlogram trait in the scalation.mathstat package.
1 @ main def cor r el og ra m Te st () : Unit =
2
3 val y = VectorD (1 , 2 , 5 , 8 , 3 , 6 , 9 , 4 , 5 , 11 ,
4 12 , 16 , 7 , 6 , 13 , 15 , 10 , 8 , 14 , 17)
5
443
8 banner ( " Plot Data " )
9 new Plot ( null , y , null , " y vs . t " , lines = true )
10
20 end c or re l og ra mT e st
The first point in plot is the auto-correlation of yt with itself, while the rest of the points are ACF(k), the
k-lag auto-correlation.
11.1.3 Correlogram
A correlogram shows how a time series correlates with lagged versions of itself and allows one to visualize
the important dependencies in the time series.
Correlogram Trait
Trait Methods:
1 @ param y the time series data ( response vector )
2
444
1. Mean Absolute Error (MAE)
m−1
1 X
MAE = |yt − ŷt | (11.5)
m t=0
" m−1
#
√ 1 X
RMSE = (yt − ŷt )2 (11.6)
m t=0
Problem: Same as for MAE plus can be overly influenced by outliers (or near outliers).
m−1
100 X |yt − ŷt |
MAPE = (11.7)
m t=0 |yt |
Problem: If some values for yt are 0, it will return infinity. Also, an upward shift of the values will
reduce/improve the score (e.g, changing from Celsius to Kelvins)
m−1
200 X |yt − ŷt |
sMAPE = (11.8)
m t=0 |yt | + |yˆt |
Problem: Same problem, but to a lesser degree, as both yt and ŷt need to be zero. Also, an upward
shift of the values will reduce the score.
M AE
MASE = (11.9)
M AE n
where the denominator is the MAE for the Naı̈ve Model (Simple Random Walk) that is used as the
standard baseline model to compare other models with.
m−1
1 X
MAE n = |yt − yt−1 | (11.10)
m − 1 t=1
While the standard baseline model for prediction is the Null Model (guess the mean), it tends not to
perform well for time series forecasting, so the Naı̈ve Model (guess the previous value) is used in its
place. Therefore, MASE measures how well the forecasting model works relative to the Naı̈ve Model.
When MASE is 1, the model performance is on par with the Naı̈ve Model, values below 1 are better
and values above 1 are worse. Note, for long horizon forecasts MASE values above 1 are likely to
happen.
445
11.2 Baseline Models: Random Walk, Null and Trend Models
There are three simple baseline models for time series data: the Random Walk, the Null Model and the
Trend Model. These models are very simple and serve as baselines for other forecasting models to compete
against.
Notice that this model has no parameters to estimate and as such is considered to be a baseline model for
more sophisticated models to improve upon. If they cannot, their value is questionable.
Several distributions may be used to generate white noise with the most common being the Gaussian (Normal)
distribution, N(0, σ 2 I).
h
X ρ2k
q = m(m + 2) ∼ χ2h (11.16)
m−k
k=1
where ρk is the k-lag autocorrelation and m is the length of the time series. Large values for q indicate
significant autocorrelation. Hyndman [84] recommends that the number of lags h be 10 for non-seasonal
processes. See the section on SARIMA for a discussion of seasonal models.
446
11.2.4 RandomWalk Class
Class Methods:
1 @ param y the response vector ( time series data )
2 @ param tt the time vector , if relevant ( time index may suffice )
3 @ param hparam the hyper - parameters ( none = > use null )
4
where µy is the mean of the response variable estimated from the training set. For in-sample assessment it is
computed over the full dataset. For out-of-sample assessment using rolling validation, each retraining may
produce a slightly different value for the mean. See the section on Rolling Validation.
Class Methods:
1 @ param y the response vector ( time series data )
2 @ param tt the time vector , if relevant ( time index may suffice )
3 @ param hparam the hyper - parameters ( none = > use null )
4
447
13 override def parameter : VectorD = VectorD ( mu )
14 def predict ( t : Int , y_ : VectorD ) : Double = mu
15 def forecast ( t : Int , yf : MatrixD , y_ : VectorD , h : Int ) : VectorD =
16 def forecastAt ( yf : MatrixD , y_ : VectorD , h : Int ) : VectorD =
Class Methods:
1 @ param y the response vector ( time series data )
2 @ param tt the time vector ( required for the trend model )
3 @ param hparam the hyper - parameters ( none = > use null )
4
448
in-sample Quality of Fit (Qof) measures. See the exercises for out-of-sample quality assessment.
1 @ main def ran d om Wa lk T es t2 () : Unit =
2
3 import E x a m p l e _ L a k e L e v e l s .{ t , y }
4
21 end r an do m Wa lk Te s t2
Note, computation of mase requires the first value of y (y(0)) which is eliminated by the test method as it
aligns the time series y and yp. Hence, mase is computed separately from the other QoF measures.
In the plot, the forecasts (in red) simply follow the actual values (in black) with a one-step delay. The report
for RandomWalk shows the quality of fit.
REPORT
----------------------------------------------------------------------------
modelName mn = RandomWalk
----------------------------------------------------------------------------
hparameter hp = null
----------------------------------------------------------------------------
features fn = Array(no-x features)
----------------------------------------------------------------------------
parameter b = VectorD()
----------------------------------------------------------------------------
fitMap qof = LinkedHashMap(rSq -> 0.676806, rSqBar -> 0.673440, sst -> 166.664699,
sse -> 53.865000, mse0 -> 0.555309, rmse -> 0.745191, mae -> 0.585567, dfm -> 1.000000,
df -> 96.000000, fStat -> 201.035387, aic -> -105.107880, bic -> -99.958458,
mape -> 0.101146, smape -> 0.101154)
----------------------------------------------------------------------------
mase = 1.0
In particular, the R2 is 0.676806, the Mean Absolute Error (MAE) is 0.585567, the Mean Absolute Percentage
Error (MAPE) is 0.101146, the symmetric MAPE is 0.101154, and the Mean Absolute Scaled Error (MASE)
449
is 1.0 (as expected).
In the plot, the forecasts (in red) are a flat line that cuts the actual values (in black) in the middle (mean).
The report for NullModel shows a reduced quality of fit.
REPORT
----------------------------------------------------------------------------
modelName mn = NullModel
----------------------------------------------------------------------------
hparameter hp = null
----------------------------------------------------------------------------
features fn = Array(no-x features)
----------------------------------------------------------------------------
parameter b = VectorD(578.990)
----------------------------------------------------------------------------
fitMap qof = LinkedHashMap(rSq -> 0.000000, rSqBar -> 0.000000, sst -> 166.664699,
sse -> 166.664699, mse0 -> 1.718193, rmse -> 1.310799, mae -> 1.057230, dfm -> 1.000000,
df -> 96.000000, fStat -> 0.000000, aic -> -159.888779, bic -> -154.739357,
mape -> 0.182630, smape -> 0.182614)
----------------------------------------------------------------------------
mase = 1.805481
In this case, the R2 is 0.0 (as expected), the Mean Absolute Error (MAE) is 1.057230, the Mean Absolute
Percentage Error (MAPE) is 0.182630, the symmetric MAPE is 0.182614, and the Mean Absolute Scaled
Error (MASE) is 1.805481.
In the plot, the forecasts (in red) are a flat line that cuts the actual values (in black) in the middle (mean).
The report for TtendModel shows a reduced quality of fit.
REPORT
----------------------------------------------------------------------------
modelName mn = TrendModel
----------------------------------------------------------------------------
hparameter hp = null
----------------------------------------------------------------------------
features fn = Array(no-x features)
----------------------------------------------------------------------------
parameter b = VectorD(580.169,-0.0240708)
----------------------------------------------------------------------------
fitMap qof = LinkedHashMap(rSq -> 0.264379, rSqBar -> 0.264379, sst -> 166.664699,
sse -> 122.602045, mse0 -> 1.263939, rmse -> 1.124250, mae -> 0.920808, dfm -> 1.000000,
df -> 96.000000, fStat -> 34.501992, aic -> -144.997325, bic -> -139.847903,
mape -> 0.159076, smape -> 0.159073)
450
----------------------------------------------------------------------------
mase = 1.572507
For the final baseline model, the R2 is 0.264379, the Mean Absolute Error (MAE) is 0.920808, the Mean
Absolute Percentage Error (MAPE) is 0.159076, the symmetric MAPE is 0.159073, and the Mean Absolute
Scaled Error (MASE) is 1.572507
Figure 11.1 shows in-sample, one-step ahead forecasting plots, where the actual values y are shown in
red, Random Walk forecasts y-RW are shown in green, Null Model forecasts y-NM are shown in blue, and
Trend Model forecasts y-TM are shown in black.
Figure 11.1: Comparison of Baseline Time Series Models: Lake Level vs. Year
Notice that y-RW follows the actual time series value with a lag of one time unit, while y-NM and y-TM are
straight lines, with the former being a horizontal line at the mean value for the time series and the latter
showing negative slope indicating a downward trend.
451
A useful time series model should have a MASE smaller than minimum of the values from the three
baselines (RandomWalk, NullModel and TrendModel).
11.2.10 Exercises
1. Generate Gaussian white noise with a variance of one and plot it.
1 val noise = Normal (0 , 1)
2 val y = VectorD ( for i <- 0 until 100 yield noise . gen )
3 new Plot ( null , y , null , " white noise " )
2. Create a Random Walk model for the above white noise process, plot the process value vs. the
forecasted value and assess the Quality of Fit (QoF). Also, examine the Auto-Correlation Function
(ACF) plot. What would it mean if a peak in the ACF plot was outside the error bands. Ignore
the first value corresponding to the zeroth lag where the correlation ρ0 should one and the partial
correlation ψ00 should be zero.
1 val mod = new RandomWalk ( y ) // time series model
2 mod . trainNtest () () // train - test model on full dataset
3 mod . plotFunc ( mod . acF , " ACF " )
3. For the Lake-Level Dataset, apply the Ljung-Box Test to see if yt or t are white noise. Determine the
values for q for each case and the 95% critical value of χ210 .
1 import scalation . variate . Quantile . chiSquareInv
2 println ( chiSquareInv (0.95 , 10) )
4. Give the model equations for a Random Walk with Drift. How is the drift parameter estimated? What
are its advantages?
5. Explain the results from the three baseline models, Random Walk, Null and Trend Models, for out-of
sample horizon h = 1 assessment show below. See the section on Rolling Validation for how out-of-
sample QoF measures are produced.
Random Walk: rollValidate: for horizon h = 1:
LinkedHashMap(rSq -> 0.511033, rSqBar -> 0.511033, sst -> 74.797020, sse -> 36.573300,
mse0 -> 0.746394, rmse -> 0.863941, mae -> 0.690000, dfm -> 1.000000, df -> 49.000000,
fStat -> 51.211192, aic -> -59.547204, bic -> -55.763564, mape -> 0.119283, smape -> 0.119301)
LinkedHashMap(rSq -> -0.668405, rSqBar -> -0.668405, sst -> 74.797020, sse -> 124.791686,
mse0 -> 2.546769, rmse -> 1.595860, mae -> 1.313594, dfm -> 1.000000, df -> 49.000000,
fStat -> -19.630623, aic -> -90.603986, bic -> -86.820346, mape -> 0.227405, smape -> 0.227085)
452
LinkedHashMap(rSq -> -0.433676, rSqBar -> -0.463545, sst -> 74.797020, sse -> 107.234717,
mse0 -> 2.188464, rmse -> 1.479346, mae -> 1.255724, dfm -> 1.000000, df -> 48.000000,
fStat -> -14.519640, aic -> -89.187547, bic -> -85.403906, mape -> 0.216952, smape -> 0.217159)
6. Compare the three baseline models, Random Walk, Null and Trend Models, using out-of sample
assessment for h > 1 via the RollValidate object. See the section on Rolling Validation. What can
be said about the three baseline model as a group? What about the relative strengths of each baseline
modeling technique as the forecasting horizon increases?
7. Two of the three baseline models can be unified using the Simple Moving Average Model. While
RandomWalk takes the last value as the forecast, SimpleMovingAverage takes the average the last
q-values as the forecast.
1 @ param y the response vector ( time series data )
2 @ param tt the time points , if needed
3 @ param hparam the hyper - parameters
4
Show that for q = 1, SimpleMovingAverage gives the same results as RandomValk, while as q becomes
large, it approximates the results of the Mean Model, i.e., NullModel.
8. Consider a stochastic process generated from a Random Walk with y0 = 0 and t+1 ∼ N (0, σ 2 )
Show that E [yt ] = 0 and V [yt ] = t σ 2 . The fact that the variance grows linearly in time shows that an
Random Walk is not a covariance stationary process (see the section on ARIMA).
453
11.3 Simple Exponential Smoothing
The basic idea of Simple Exponential (SES) is to start with Random Walk,
ŷt+1 = yt (11.20)
but stabilize/smooth out the forecast by maintaining a state variable the summarizes the past.
The state variable st+1 is a weighted combination of the most recent value yt and the most recent value
of the state variable st . The forecasted value ŷt+1 is taken as the value of the state variable st+1 .
The state variable may be initialized as the initial value of the time series (other options include taking an
average or letting it be a second parameter passed to an optimizer).
s0 = y0 (11.23)
The parameter α ∈ [0, 1] is referred to as the smoothness parameter. When α = 1, SES becomes Random
Walk as older values in the time series are ignored. When α = 0, SES the state remains locked at its initial
value. Thus, α ∈ (0, 1] is a more practical range.
The reason it is called exponential smoothing is that the impact of older values in the time series decreases
exponentially.
11.3.2 Training
For SES, one may minimize the sum of squared errors (sse). The error at time t is given by
t = yt − ŷt (11.25)
= y − ŷ (11.26)
A loss function L(α) = sse = · may be minimized to determine the parameter α (alternatively, both s0
and α).
In ScalaTion, the train method is used to find an optimal value for α (a).
454
1 override def train ( x_null : MatrixD , y_ : VectorD ) : Unit =
2
5 if opt then
6 val optimizer = new L_BFGS_B ( f_obj , l = lo , u = up ) // Quasi - Newton optimizer
7 val opt = optimizer . solve ( VectorD ( a ) , toler = TOL ) // optimize value for a
8 a = ( opt . _2 ) (0) // pull from result
9 end if
10 s = smooth ( a ) // predicted values
11 end train
3 import E x a m p l e _ L a k e L e v e l s .{ t , y }
4
8 for i <- 0 to 5 do
9 val a = i . toDouble / 5.0
10 banner ( s " Build S i m p l e E x p S m o o t h i n g model with a = $a " )
11 mod . reset ( a )
12 mod . train ( null , y ) // train the model on full dataset
13 val ( yp , qof ) = mod . test ( null , y ) // test the model on full dataset
14 println ( mod . report ( qof ) ) // report on Quality of Fit ( QoF )
15 end for
16
17 end s i m p l e E x p S m o o t h i n g T e s t 5
Class Methods:
1 @ param y the response vector ( original time series data )
2 @ param tt the time vector , if relevant ( time index may suffice )
3 @ param hparam the hyper - parameters
4
455
13 def smooth ( a : Double = α, y_ : VectorD = y ) : VectorD =
14 def train ( x_null : MatrixD , y_ : VectorD ) : Unit =
15 def test ( x_null : MatrixD , y_ : VectorD ) : ( VectorD , VectorD ) =
16 def testF ( h : Int , y_ : VectorD ) : ( VectorD , VectorD ) =
17 override def parameter : VectorD = VectorD (α)
18 def predict ( t : Int , y_ : VectorD ) : Double = s ( t +1)
19 override def predictAll ( y_ : VectorD ) : VectorD = s
20 def forecast ( t : Int , yf : MatrixD , y_ : VectorD , h : Int ) : VectorD =
21 def forecastAt ( yf : MatrixD , y_ : VectorD , h : Int ) : VectorD =
22 def forecastAll ( h : Int , y_ : VectorD ) : MatrixD =
11.3.5 Exercises
1. Test customized (α = 0.5) vs. optimized (α optimizer determined) smoothing on the following synthetic
data.
1 @ main def s i m p l e E x p S m o o t h i n g T e s t 3 () : Unit =
2
3 val m = 50
4 val r = Random ()
5 val y = VectorD ( for i <- 0 until m yield i + 10.0 * r . gen )
6
21 end s i m p l e E x p S m o o t h i n g T e s t 3
2. For some optimization software it is necessary to pass in an objective function and derivative func-
tion(s). For this type of optimization, ŷ needs to be replaced.
t−1
X
ŷt = α (1 − α)k yt−1−k (11.29)
k=0
456
Therefore, the loss function is expressed as a double sum.
m−1
" t−1
#2
X X
k
L(α) = yt − α (1 − α) yt−1−k (11.30)
t=0 k=0
Create a formula for the derivative of L(α) and pass the function and its derivative into the NewtonRaphson
class found in the scalation.optimization package to find an optimal value for α.
457
11.4 Auto-Regressive (AR) Models
In order to predict future values for a response variable y, the obvious thing to do is to find variables that
may influence the value of y. The most obvious is prior (or lagged) values of itself. Often, other variables
may be helpful as well, some of which may be time series themselves. Given the variety of information
potentially available, it should not be surprising that there are numerous modeling techniques used for time
series forecasting [20, 84].
One of the simplest types of forecasting models of the form given in the first section of this chapter is
to make the future value yt+1 be linearly dependent on the last p values of y. In particular, a pth -order
Auto-Regressive AR(p) model predicts the next value yt+1 from the sum of the last p values each weighted
by its own coefficient/parameter φj ,
• p0 = p − 1
Zero-Centered
To better capture the dependency, the data can be zero-centered, which can be accomplished by subtracting
the mean µ, zt = yt − µ.
E [zt ] = 0 (11.34)
V [zt ] = E zt 2 = γ0
(11.35)
C [zt , zt−k ] = E [zt zt−k ] = γk (11.36)
458
11.4.1 AR(1) Model
When the future is mainly dependent only on the most recent value, e.g., ρ1 is high and rest ρ2 , ρ3 , etc. are
decaying, then an AR(1) model may be sufficient.
An estimate for the parameter φ0 may be determined from the Auto-Correlation Function (ACF).
Using the definition for γk given above for k ≥ 1, this can be rewritten as
γk = φ0 γk−1 + 0
The zero is due to the fact that E [t zt−k ] = C [t , zt−k ] and past value zt−k is independent of future noise
shock t . Now dividing by the variance γ0 yields
ρk = φ0 ρk−1 (11.39)
An estimate for parameter φ0 may be easily determined by simply setting k = 1 in the above equation.
φ0 = ρ1 (11.40)
Furthermore, ρ0 is 1 and
γ0 = φ0 γ1 + E [t (zˆt + t )]
Since zˆt is independent of t , the last term becomes E [t t ] and as γ1 = γ0 ρ1 , the variance of the noise may
be written as follows:
459
11.4.2 AR(p) Model
The zero-centered model equation for an AR(p) model includes the past p values.
φ0 φ1 φ2
z0 z1 z2 z3 z4 zˆ5
As with convolutional filters, the parameter vector φ, can slide along the zero-centered time series zt to
compute forecasted values ẑt . To obtain forecasted values for the original time series, simply add back the
mean,
Working with the zero-centered equations, one can derive a system of linear equations to solve for the
parameters. First, multiplying zt by zt−k gives
460
Dividing by γ0 yields equations relating the parameters and correlations,
These equations contains p unknowns and by letting k = 1, 2, . . . , p, it can be used to generate p equations,
or one matrix equation.
Yule-Walker Equations
The equations below are known as the Yule-Walker equations. Note that ρ−j = ρj , ρ0 equals 1, and k
advances row by row.
ρ1 = φ0 ρ0 + φ1 ρ1 + . . . + φp−1 ρp−1
ρ2 = φ0 ρ1 + φ1 ρ0 + . . . + φp−1 ρp−2
...
ρp = φ0 ρp−1 + φ1 ρp−2 + . . . + φp−1 ρ0
Letting ρ be the p-dimensional vector of lag auto-correlations and φ be the p-dimensional vector of param-
eters/coefficients, we may concisely write
ρ = Rφ (11.47)
where R is a p-by-p symmetric Toeplitz (one value per diagonal) matrix of correlations with ones on the
main diagonal.
1 ρ1 ... ρp−2 ρp−1
ρ1 1 ... ρp−3 ρp−2
R =
... ... ... ... ...
(11.48)
ρp−2 ρp−3 ... 1 ρ1
ρp−1 ρp−2 ... ρ1 1
One way to solve for the parameter vector φ is to take the inverse (or use related matrix factorization
techniques).
φ = R−1 ρ (11.49)
|
σ2 = γ0 (1 − ρ · φ) = γ0 (1 − ρ R−1 ρ) (11.50)
Due to the special structure of the R matrix, more efficient techniques may be used, see the next subsection.
461
Equations for AR(2)
An easy way to solve for φ is to apply LU Factorization using the augmented matrix below.
" #
1 ρ1 ρ1
ρ1 1 ρ2
11.4.3 Training
The steps in training include computing the Auto-Correlation Function, executing the Durbin-Levinson
algorithm, and zero-centering the response. The train method of the AR class is implemented as follows:
1 def train ( x_null : MatrixD , y_ : VectorD ) : Unit =
2 m = y_ . dim // length of relevant time series
3 resetDF ( pnq , m - pnq ) // reset the degrees of freedom
4 m akeC or r el og r am ( y_ ) // correlogram computes psi matrix
5 φ = psiM ( p ) (1 until p +1) // coefficients = p - th row , columns 1.. p
6 δ = statsF . mu * (1 - φ. sum ) // compute drift / intercept
7 end train
The drift δ is computed from the mean and the values of the φ. The parameter/coefficient vector φ can be
estimated using multiple training algorithms/estimation procedures [194]:
1. Method of Moments (MoM). The first two moments, mean and covariance, are used in the Yule-
Walker Method. In particular the auto-correlations in ACF are used.
2. Maximum Likelihood Estimation (MLE). Minimize the negative log-likelihood. See the ARMA
section.
3. Least Squares Estimation (LSE). Minimize the conditional sum of squared errors. See the ARMA
section.
In ScalaTion, the coefficients φ are estimated using the Durbin-Levinson algorithm and extracted from
the pth row of the ψ (psi) matrix. Define ψkj to be φj for an AR(k) model. Letting k range up to p allows
the φj parameters to be calculated. Letting k range up to the maximum number of lags (ml) allows the
Partial Auto-Correlation Function (PACF) to be computed.
Invoke the durbinLevinson method [151] passing in the auto-covariance vector γ (g) and the maximum
number of lags (ml). From 1 up to the maximum number of lags, iteratively compute the following:
γk − Σk−1
j=1 ψk−1,j γk−j
ψkk =
rk−1
ψkj = ψk−1,j − ψkk ψk−1,k−j
2
rk = rk−1 (1 − ψkk )
462
1 def durbinL evinson ( g : VectorD , ml : Int ) : MatrixD =
2 val ψ = new MatrixD ( ml +1 , ml +1) // psi matrix ( ml = max lags )
3 val r = new VectorD ( ml +1) ; r (0) = g (0)
4
In particular, ψkk is the k-lag partial auto-correlation and the parameter vector is
The Partial Auto-Correlation Function (PACF) that is extracted from the main diagonal of the Ψ matrix and
can be used along with ACF to select the appropriate type of model. As k increases, the k-lag auto-correlation
ρk will decrease and eventually adding more parameters/coefficients will become of little help.
Deciding where to cut off the model based on ρk from the ACF is somewhat arbitrary, but the k-lag
partial auto-correlation ψkk drops toward zero more abruptly giving a stronger signal as to what model
to select. The partial auto-correlation differs from auto-correlation in that it removes indirect correlation.
For example, if ρ1 = .7, one would expect ρ2 to equal ρ21 = .49 as yt−2 is correlated with yt−1 and yt−1 is
correlated with yt . The 2-lag partial auto-correlation ψ22 measures whether there is any direct correlation
between yt−2 and yt . For this example, when p2 = .49, ψ22 = 0 indicating no direct correlation and implying
that the φ1 zt−1 term need not be included in the model.
See the exercises and the ARMA section for more details.
11.4.4 Forecasting
After the parameters/coefficients have been estimated as part of the train method, the AR(p) model can
be used for forecasting.
1 @ main def aRTest4 () : Unit =
2
463
10 val yf = mod . forecastAll (y , hh ) // forecast h - steps ahead (1.. hh )
11 println ( s " y . dim = ${ y . dim } , yp . dim = ${ yp . dim } , yf . dims = ${ yf . dims } " )
12 assert ( yf (? , 0) (0 until m ) = = y ) // col 0 must agree with actual values
13 differ ( yf (? , 1) (1 until m ) , yp )
14 assert ( yf (? , 1) (1 until m ) = = yp ) // col 1 must agree with 1 step pred .
15
16 for h <- 1 to hh do
17 val ( yfh , qof ) = mod . testF (h , y )
18 val yy = y ( h until m )
19 println ( s " Evaluate QoF for horizon $h : " )
20 println ( FitM . fitMap ( qof , QoF . values . map ( _ . toString ) ) )
21 println ( s " Fit . mae (y , yfh , h ) = ${ Fit . mae (y , yfh , h ) } " )
22 println ( s " Fit . mae_n (y , 1) = ${ Fit . mae_n (y , 1) } " )
23 println ( s " Fit . mase (y , yfh , h ) = ${ Fit . mase (y , yfh , h ) } " )
24 end for
25
26 end aRTest4
ScalaTion will not predict a value corresponding to y0 (although some packages offer the option of back-
casting). When p is greater than 1, it repeats the y0 value into negative times, hence the y (max (0, t-j))
factor.
1 @ param t the time point from which to make prediction
2 @ param y_ the actual values to use in making predictions
3
Multi-horizon forecasting is more challenging than one-step ahead predictions. When the forecasting
horizon h > 1,
As h increases, fewer actual values are used to make the forecast. The forecastAt method makes forecasts
for horizon h. It uses the last p actual values to make the first forecast, and then uses the last p − 1 actual
values and the first forecast to make the next forecast. As expected, the quality of the forecast will degrade
as h gets larger.
1 @ param yf the forecasting matrix ( time x horizons )
2 @ param y_ the actual values to use in making forecasts
3 @ param h the forecasting horizon , number of steps ahead to produce forecasts
4
464
10 for j <- 0 until p do sum + = φ( j ) * yf ( max (0 , t1 - j ) , max (0 , h -1 - j ) )
11 yf ( t +h , h ) = sum // forecast down the diagonal
12 end for
13 yf (? , h ) // return h - step ahead forecast vector
14 end forecastAt
The forecasting matrix yf is required since the next horizon’s forecasts depends on those from the previous
horizon (the forecastAll method in the Forecaster traits sorts all this out).
11.4.5 AR Class
Class Methods:
1 @ param y the response vector ( time series data )
2 @ param tt the time vector , if relevant ( time index may suffice )
3 @ param hparam the hyper - parameters
4
11.4.6 Exercises
1. Compute ACF for the Lake Level time series dataset. For AR(1) set parameter φ0 to ρ1 , the first lag
auto-correlation. Compute yˆt by letting ŷ0 = y 0 and for k <- 1 until y.dim
zt = yt − µ
yˆt = ρ1 zt−1 + µ
zt = φ0 zt−1 + φ1 zt−2 + t
465
ρk = φ0 ρk−1 + φ1 ρk−2
Setting k = 1 and then k = 2 produces two equations which have two unknowns φ0 and φ1 . Solve
for φ0 and φ1 in terms of the first and second lag auto-correlations, ρ1 and ρ2 , by completing the LU
Factorization of the Augmented Yule-Walker Matrix. Given these two formulas, plug in values for (a)
ρ1 and ρ2 for the Lake Level Dataset, see Table 11.1, (b) ρ1 = .7 and ρ2 = .6.
For both (a) and (b), compute yˆt by letting ŷ0 = y0 , ŷ1 = y1 and for k <- 2 until y.dim
zt = yt − µ
yˆt = φ0 zt−1 + φ1 zt−2 + µ
3. Use the ScalaTion class AR to develop Auto-Regressive Models for p = 1, 2, 3, for the Lake Level time
series dataset. Plot ŷt and yt versus t for each model. Also, compare the first two models with those
developed in the previous exercises.
Also, plot ŷt and yt versus t. Look at the ACF and PACF to see if some other AR(p) might be better.
1 mod . plotFunc ( ar . acF , " ACF " )
2 mod . plotFunc ( ar . pacF , " PACF " )
Choose a value for p and create an AR(p) model. Explain your choice of p. Also, plot ŷt and yt versus
t.
5. The k-lag auto-covariance is useful when the stochastic process {yt | t ∈ {0, m − 1}} is covariance
stationary.
γk = C [yt , yt−k ]
When yt is covariance stationary, the covariance is only determined by the lag between the variables and
not where they occur in the time series, e.g., C [y4 , y3 ] = C [y2 , y1 ] = γ1 . Thus, the covariance matrix
Γ is Toeplitz. Covariance stationarity also requires that E [yt ] = µ, i.e., the mean is time invariant.
Repeat the previous exercise, but generate a time series from a process that is not covariance stationary.
What can be done to transform such a process into a covariance stationary process? Hint: read ahead
to the section on ARIMA models or additional reading such as [168], Chapter 23.
466
6. For a covariance stationary process yt , derive the relationship between the drift parameter δ and the
mean µ. Further, given an AR(p) model,
Show that δ drops out of the equation when zt + µ is substituted for yt in the above equation.
7. Assume the time indexed, m dimensional vector y is a zero-centered, covariance stationary Gaussian
process N(0, Γ) having the following probability density function (pdf),
− 12 1 t
Γ−1 y
f (y; φ, σ 2 ) = [(2π)m detΓ] e− 2 y
Recall, that the covariance matrix for a covariance stationary process is Toeplitz, meaning
Γ = [γi−j ]i,j=0,m−1
Write the likelihood and log-likelihood functions. Discuss how to create an algorithm for Maximum
Likelihood Estimation. Be sure to exploit the special structure of the covariance matrix.
8. Derive the formula for the noise variance σ2 = V [zt ] = E [zt zt ],
σ2 = γ0 (1 − ρ · φ)
γ0 = φ0 γ1 + φ1 γ2 + . . . + φp−1 γp + E [t t ]
9. Create a process with φ0 and φ1 nonzero, such that ρ1 is approximately zero, while ρ2 is large. Compare
the Quality of Fit (QoF) of an AR(2) model versus an AR(1) model for this process. Must complete
the code before running.
1 val m = 100
2 val y = new VectorD ( m )
3 val noise = Normal (0 , 1)
4 val φ = VectorD (? , ?)
5 y (0) = noise . gen ; y (1) = noise . gen
6 for t <- 2 until m do y ( t ) = φ(0) * y (t -1) + φ(1) * y (t -2) + noise . gen
yt+1 = δ + φ0 yt + t+1
is stationary, unit-root, or explosive, if |φ0 | is less than, equal to, or greater than 1, respectively. Letting
y0 = 10, δ = 0, and noise = Normal (0, 1), examine the time evolution of the process for φ0 = .7,
1, and 2. Explain what is happening. Which ones are covariance stationary?
467
11. For the Lake Level Dataset, the first ten values for the ACF ρk and PACF ψkk are given in Table 11.1.
Table 11.1: Correlogram (ACF ρk , PACF ψkk ) for the Lake Level Dataset
k ρk ψkk
0 1.000000 0.000000
1 0.840488 0.840488
2 0.622644 -0.285357
3 0.472722 0.147962
4 0.386269 0.032072
5 0.343057 0.069543
6 0.303435 -0.026385
7 0.285146 0.108148
8 0.287510 0.047882
9 0.283758 0.003677
10 0.203506 -0.237926
As AR extends Correlogram the ACF and PACF functions may be plotted. Any points inside the two
lines (upper and lower bounds) are of lesser significance.
1 @main def aRTest3 () : Unit =
2
11 end aRTest3
Based on the values in Table 11.1 and the ACF and PACF plots, explain your choices of p for candidate
AR(p) models. Make a loop for p <- 1 to 10 do and check the QoF (R2 , MAE, and sMAPE) for
your choices versus the other models. Justify your choices.
468
11.5 Moving-Average (MA) Models
The Auto-Regressive (AR) models predict future values based on past values. Let us suppose the daily values
for a stock are 100, 110, 120, 110, 90. To forecast the next value, one could look at these values or focus on
each days errors or shocks. Whether previous values of the time series (AR) or unexpected changes (MA)
(e.g., while predictions indicated that the Dow-Jones Industrial Average would continue its slow rise, news
of the its large drop has investor nervous) are more important depends on the situation. In the simplest case
an MA model is of the following form,
yt+1 = µ + θ0 t + t+1
ŷt+1 = µ + θ0 t (11.55)
ẑt+1 = θ0 t (11.56)
that is, the mean of the process plus some fraction of yesterday’s shock. Let us assume that µ = 100 and
the parameter/coefficient θ0 had been estimated to be 0.8. Now, if the mean is 100 dollars and yesterday’s
shock was that stock went up an extra 10 dollars, then the new forecast would be 100 + 8 = 108 (day 2).
This can be seen more clearly in the Table 11.2.
The question of AR(1) versus MA(1) may be viewed as which column zt−1 or t−1 leads to better forecasts.
It depends on the data, but considering the GDP example again, zt−1 indicates how far above or below the
mean the current GDP is. One could imagine shocks t−1 to GDP, such as new tariffs, tax cuts or bad
weather being influencial shocks to the economy. In such cases, an MA model may provide better forecasts
than an AR model.
469
11.5.1 MA(q) Model
The model equation for an MA(q) model includes the past q values of t .
Zero-Centered
γk = E [(θ0 t−1 + . . . + θq−1 t−q + t )(θ0 t−k−1 + . . . + θq−1 t−k−q + t−k )] (11.59)
When k = 0
When k ≥ 1
For k ∈ [1, . . . , q], similar equations can be created, only with parameter index shifted so the noise shocks
match up.
γ1 = E [(θ0 t−1 + θ1 t−2 + . . . + θq−1 t−q + t )(θ0 t−1−1 + θ1 t−1−2 + . . . + θq−1 t−1−q + t−1 )]
470
γk = E [(θ0 t−1 + θ1 t−2 + . . . + θq−1 t−q + t )(θ0 t−k−1 + θ1 t−k−2 + . . . + θq−1 t−k−q + t−k )]
σ2
ρk = (θk−1 + θk θ0 + θk+1 θ1 + . . . + θq−1 θq−k−1 ) (11.62)
γ0
Notice that when k > q, the ACF will be zero. This is because only the last q noise shocks are included in
the model, so any earlier noise shocks before that are forgotten. MA processes will tend to exhibit a more
rapid drop off of the ACF compared to the slow decay for AR processes.
Unfortunately, the system of equations that can be generated from these equations are nonlinear. Con-
sequently, training is more difficult and less efficient.
11.5.2 Training
Training for MA(1)
Training for Moving-Average models is easiest to understand for the case of a single parameter θ0 . In general,
training is about errors and so rearranging the MA equation gives the following:
et = zt − θ0 et−1 (11.63)
Given a value for parameter θ0 , this is a recursive equation that can be used to compute subsequent errors
from previous ones. Unfortunately, the equation cannot be used to compute the first error e0 . One approach
to deal with indeterminacy of e0 is to assume (or condition on) it being 0.
e0 = 0 (11.64)
Note, this affects more than the first error, since the first affects the second and so on. Next, we may compute
a sum of squared errors, in this case called Conditional Sum of Squared Errors (denoted in ScalaTion as
csse.
m−1
X m−1
X
csse = e2t = (zt − θ0 et−1 )2 (11.65)
t=0 t=1
One way to find a near optimal value for θ0 is to minimize the Conditional Sum of Squared Error.
m−1
X
argminθ0 (zt − θ0 et−1 )2 (11.66)
t=1
As the parameter θ0 ∈ (−1, 1), an optimal value minimizing csse may be found using Grid Search. A more
efficient approach is to use the Newton method for optimization.
d csse d2 csse
θ0i = θ0i−1 − / (11.67)
dθ0 dθ02
471
where
m−1
d csse X
= −2 et−1 (zt − θ0 et−1 )
dθ0 t=1
m−1
d2 csse X
2 = 2 e2t−1
dθ0 t=1
Substituting gives
m−1
X m−1
X
θ0i = θ0i−1 + et−1 (zt − θ0 et−1 )/ e2t−1 (11.68)
t=1 t=1
Maximum Likelihood Estimation (MLE) is also used for MA(q) parameter estimation.
11.5.3 Exercises
1. For an MA(1) model, solve for θ0 using the equation for the noise variance and the equation for the
correlation.
γ0 = (1 + θ02 ) σ2
σ2 θ0
ρ1 = (θ0 ) =
γ0 1 + θ02
2. Develop an MA(1) model for the Lake Level time series dataset using the solution you derived in the
previous question. Plot yt and yˆt vs. t. Estimate the mean Lake level and then plot zt and t .
3. Use the MA(1) solution you derived above for θ0 and the fact that φ0 = ρ1 for AR(1) to produce two
Forecast Charts for the first 10 values in the Lake-Level Dataset, one for MA(1) and one for AR(1).
Compute R2 and sMAPE from information in the charts. Which model has a better Quality of Fit
(QoF)?
4. Use ScalaTion to assess the quality of an MA(1) model versus an MA(2) model for the Lake Level
time series dataset.
5. The position in the ACF plot where ρk drops off can be used for q, the order of the model. What does
this plot suggest for the Lake Level time series dataset?
472
11.6 Auto-Regressive, Moving-Average (ARMA) Models
The ARMA class provides basic time series analysis capabilities for Auto-Regressive (AR) and Moving-Average
(MA) models. In an ARMA(p, q) model, p and q refer to the order of the Auto-Regressive and Moving-
Average components of the model. ARMA models are often used for forecasting.
Recall that a pth -order Auto-Regressive AR(p) model predicts the next value yt+1 from a weighted
combination of prior values and that a q th -order Moving-Average MA(q) model predicts the next value yt+1
from the combined effects of prior noise/disturbances.
A combined pth -order Auto-Regressive, q th -order Moving-Average ARMA(p, q) model predicts the next
value yt+1 from both a weighted combination of prior values and the combined effects of prior noise/distur-
bances.
−1
ψkk = R ρ1:k k (11.72)
Figure 11.3 shows the ACF ρk and PACF ψkk values versus the number of lags k. The PACF indicates
the correlation between zt and zt−k beyond the chaining effect. For example, zt and zt−2 would naturally
be correlated at the level ρ21 , since the correlation between zt and zt−1 is ρ1 , and zt−1 and zt−2 is ρ1 . Thus,
partial correlation is a better indication of whether a term φk zt−k contributes to the model.
From Figure 11.3 three partial correlations are worthy of consideration: ρ1 = 0.831911, ρ2 = −0.285357,
and maybe ρ10 = −0.237926. From this perspective, candidate models are AR(1), AR(2) and AR(10). Of
course, AR(10) will include several terms that contribute very little, so if a software package supports feature
selection, they can be removed so that only three terms will be included in the model. Other options include
maximizing R̄2 or minimizing AIC. See the exercise.
473
ACF(-) and PACF(+) vs lags k
0.5
corr
0
0 5 10 15 20 25 30
k
Figure 11.3: Auto-Correlation Function (ACF) and Partial Auto-Correlation Function (PACF)
11.6.2 Training
Training for ARMA models involves simultaneously finding p values for the AR part and q values for the MA
part. We have seen that are multiple optimization algorithms for finding the φ parameters for AR models.
Technique Description
MoM Durbin-Levinson Solution to Yule-Walker Equations
OLS Ordinary Least Squares
WLS Weighted Least Squares
GLS Generalized Least Squares
MLE Maximum Likelihood Estimation
Similarly, there are also multiple, more complex, optimization algorithms for finding the θ parameters for
MA models.
The train method in the ARMA class shown below uses conditional sum of squared errors (csse) as the
loss function. The parameter vector b collects all the parameters, φ, θ, and δ and passes then into a BFGS
optimizer. The optimal values are determined and unpacked back into φ, θ, and δ.
474
Table 11.4: MA Optimization Algorithms
Technique Description
MoI Method of Innovations
cSSE Conditional Sum of Squares
MLE Maximum Likelihood Estimation
18 val optimizer = new BFGS ( csse ) // apply Quasi - Newton BFGS optimizer
19 val ( fb , bb ) = optimizer . solve ( b ) // optimal solution and parameters
20
The predict method gives a one-step ahead forecast ŷt+1 using the past p time series values and the
most recent q errors/shocks. Note that errors before the start of the time series do not exist and cannot be
predicted, hence the if in the second loop. Also, errors must be computed sequentially as predictions are
made.
1 override def predict ( t : Int , y_ : VectorD ) : Double =
2 var sum = δ
3 for j <- 0 until p do sum + = φ( j ) * y_ ( max (0 , t - j ) )
4 for j <- 0 until q if t - j >= 0 do sum + = θ( j ) * e (t - j )
5 if t < y_ . dim -1 then e ( t +1) = y_ ( t +1) - sum
6 sum
7 end predict
Class Methods:
1 @ param y the response vector ( time series data )
2 @ param tt the time vector , if relevant ( time index may suffice )
3 @ param hparam the hyper - parameters
475
4
11.6.4 Exercises
1. Plot the PACF for the Lake Level time series dataset and use it to pick a value for p based on rule 1:
“stop at the drop” and rule 2: “pull all peaks”. Run AR for p ranging from 1 to 10. Assess the Quality
of Fit (QoF) for each AR(p) model over this range. What would R¯2 , AIC, and MAPE/sMAPE pick,
respectively?
2. Rule 1 and especially rule 2 should be done within the context of significance, based on the following
critical values,
∗
zα/2
ρ∗k = √ (11.73)
m−k
where m is the length of the time series, k is the lag, and α is the significance level (e.g., for α = .95,
∗
zα/2 = 1.96). Determine all peaks (both positive and negative) that are outside the interval [−ρ∗k , ρ∗k ].
3. Plot the ACF for the Lake Level time series dataset and use it to pick a value for q. Assess the quality
of an MA(q) with this value for q. Try it for q being one lower and one higher.
4. Using the selected values for p and q from the two previous exercises, assess the quality of an ARMA(p,
q). Try the four possibilities around the point (p, q).
5. For an ARMA (p, q) model, explain why the Partial Auto-correlation Function (PACF) is useful in
choosing a value for the p AR hyper-parameter.
6. For an ARMA (p, q) model, explain why the Auto-correlation Function (ACF) is useful in in choosing
a value for the q MA hyper-parameter.
476
11.7 Rolling-Validation
Although cross-validation is very useful for prediction and classification problems, there is no direct analog
for time series data. Instances cannot be randomly selected for training and testing, because the instances
are now time ordered and dependency is high between instances that are close in time (e.g., as measured by
auto-correlation).
In time series, a rolling-validation can be used as a replacement for cross-validation. Rolling-validation
works by defining two adjacent, non-overlapping time windows: the first containing training data and the
second containing testing data. These two windows are rolled forward in time together.
Based on these time windows, the response vector y is chopped into two sub-vectors:
An algorithm for rolling validation may start by training the model on y(r) . Then iterate through y(e)
making a forecast for each element in this vector, producing a forecast vector ŷ. The forecast error is then
= y(e) − ŷ (11.78)
This basic algorithm will only work for very short testing windows and thus the quality assessment may be
unreliable.
Retraining
If the testing window is long, training may not be up to date, so one would expect the forecasts to degrade as
the algorithm iterates through the testing data. This would argue for periodic retraining within the testing
loop. Depending on the runtime efficiency of the training algorithm, the retaining may be more or less
frequent. ScalaTion defines a retraining cycle rc that causes retraining to be performed after every rc
forecasts. See the RollingValidation object in the scalation.modeling.forecasting package for details.
Multi-Horizon Forecasting
Although one step ahead forecasting is the most accurate, today’s forecasting models are capable of producing
useful forecasts for several steps ahead. As an example, Numerical Weather Prediction (NWP) produces
14-day forecasts. With a time-step on 6 hours, this would imply 56 step ahead forecasts. In this case, the
477
forecasting horizon h = 56. Note, among other techniques, NWP uses Ensemble Kalman Filters, see the
chapter on State-Space Models.
There are three main techniques available for multi-horizon forecasting [187]:
1. The Direct Method applies a trained forecasting function f to recent data to obtain an h-step ahead
forecast.
Clearly, the training must be based upon minimizing the h-step ahead forecast error (or maximizing the
likelihood). For a single output model like ARMA, a separate model could be built for each forecasting
horizon 1 through h. For a multiple output like a Neural Network, each output node would correspond
to a particular horizon 1 through h.
2. The Recursive Method builds up forecasts one step at a time. For efficiency it is implemented in an
iterative rather than recursive manner. For example, for h = 3 and p ≥ h, forecasts would be produced
as follows:
Notice that the farther into the future the forecast is made, the more it depends on prior forecasts (not
actual data). As errors can compound, longer term forecast may degrade.
Predicting future shocks/errors is precarious, so order q needs to be at least the maximum forecasting
horizon h for MA(q) models. When h > q, all the errors will need to be forecasted and the best
estimate for future noise is zero, so MA(q) forecasts will quickly degenerate to the mean of the time
series (like the NullModel).
3. Hybrid Methods try to combine the best the direct and recursive methods. One way to do this is to
have a model for each forecasting horizon as with the direct method, but utilize forecasts from previous
models (1 through h − 1) in the model for horizon h. See [187, 186] for more details.
478
for training in order to predict the first value in the testing set. For the next value in testing, the actual first
value in the testing set can be used for its prediction. This process continues until the end of the testing
set. To keep the parameters from becoming stale, retraining occurs for every rc predictions (rc = 4 in the
example below). All of the previous values are used for retraining. In this way both (training and testing)
roll forward.
1 val y = VectorD . range (1 , 25)
2 val ( rc , h ) = (4 , 2) // horizon , retraining cycle
3 val mod = new RandomWalk ( y ) // create an RW model
4 mod . train ( null , y ) // train the model on full dataset
5 val ( yp , qof ) = mod . test ( null , y ) // test the model on full dataset
6 println ( mod . report ( qof ) ) // report on Quality of Fit ( QoF )
7 val yf = mod . forecastAll (y , h ) // produce all forecasts up horizon h
8 println ( s " yf = $yf " ) // print forecast matrix
9 FitM . s ho w Q o f S t a t T a b l e ( R o l l i n g V a l i d a t i o n . rollValidate ( mod , rc , h ) )
yt yˆt @h = 1 yˆt @h = 2 t
1.00000 0.00000 0.00000 0.00000
2.00000 1.00000 0.00000 1.00000
3.00000 2.00000 1.00000 2.00000
4.00000 3.00000 2.00000 3.00000
5.00000 4.00000 3.00000 4.00000
6.00000 5.00000 4.00000 5.00000
7.00000 6.00000 5.00000 6.00000
8.00000 7.00000 6.00000 7.00000
9.00000 8.00000 7.00000 8.00000
10.0000 9.00000 8.00000 9.00000
11.0000 10.0000 9.00000 10.00000
12.0000 11.0000 10.0000 11.00000
13.0000 12.0000 11.0000 12.00000
14.0000 13.0000 12.0000 13.00000
15.0000 14.0000 13.0000 14.00000
16.0000 15.0000 14.0000 15.00000
17.0000 16.0000 15.0000 16.00000
18.0000 17.0000 16.0000 17.00000
19.0000 18.0000 17.0000 18.00000
20.0000 19.0000 18.0000 19.00000
21.0000 20.0000 19.0000 20.00000
22.0000 21.0000 20.0000 21.00000
23.0000 22.0000 21.0000 22.00000
24.0000 23.0000 22.0000 23.00000
0.00000 24.0000 23.0000 24.00000
0.00000 0.00000 24.0000 25.00000
479
Lines in the Table 11.5 indicate the points at which training (or retraining) occurs:
• Training at time point times 12 (using 0-11) is used for forecasting values for 12 to 15.
• Training at time point times 16 (using 0-15) is used for forecasting values for 16 to 19.
• Training at time point times 20 (using 0-19) is used for forecasting values for 20 to 23.
Forecasts are produced for times 24 and 25, but as no actual values are available at these times, they cannot
be used for quality assessment. In general, the last h rows of the yf matrix must be excluded from quality
assessment. Do to the regular nature of the values in Table 11.5, calculation of QoF measures are easily
carried out, as shown below.
23
X
sst = (t + 1 − 18.5)2 = 83
t=12
23
X
sse = (t + 1 − t)2 = 12 R2 = 1 − 12/83 = 0.855
t=12
23
X
sse = (t + 1 − t − 1)2 = 48 R2 = 1 − 48/83 = 0.422
t=12
Notice: (1) Random Walk has no training involved and was picked to make the patterns obvious. (2)
The dataset is too small and regular to be anything but a toy example. (3) QoF measures get worse as
the forecasting horizon becomes larger and for baseline models this may happen rapidly. (4) If the testing
set is small, sst will be substantially reduced, which will make R2 look very poor (this is one reason other
QoF measures are preferred over R2 for time series). For example, for this problem MAE = h (i.e., 1 or
2), independently of the size of testing set. See the exercises for application of other forecasting models to
longer datasets.
480
8 debug ( " rollValidate " , s " train : tr_size = $tr_size ; test : te_size = $te_size , rc = $
rc " )
9
27 for k <- 1 to h do
28 val yfh = yf ( tr_size until y . dim , k )
29 new Plot (t , yy , yfh , s " Plot yy , yfh vs . t ( h = $k ) " , lines = true )
30 banner ( s " rollValidate : for horizon h = $k : " )
31 println ( FitM . fitMap ( mod . diagnose ( yy , yfh ) , QoF . values . map ( _ . toString ) ) )
32 end for
33 end rollValidate
The first algorithm is subsumed by the second. The main differences occur in the for loop that iterates
through the testing set.
481
1 for i <- 0 until te_size do // iterate over testing set
2 val t = tr_size + i // next time to forecast
3 if i % rc = = 0 then mod . train ( null , y (0 until t ) ) // retrain on sliding set
4 yp ( i ) = mod . predict (t -1 , y ) // predict the next value
5 val yd = mod . forecast (t -1 , yf , y , h ) // forecast next h - values
6 // yf updated on diagonals
7 assert ( yp ( i ) = ~ yd (0) ) // forecasts = ? predictions
8 end for // yf updated on diagonals
The time t is the training set size tr size plus the index i into the testing set. If the index i modulus rc
is zero retraining occurs. The predict method returns a one-step ahead prediction for time t using past
known values. For an AR(p) model, here’s the predict method.
The forecast method returns a vector containing forecasts for horizons 1 to h. It uses the recursive method
and requires the forecasting matrix yf that it updates as it goes making modifications down diagonals. The
updated values are captured in vector called yd and returned. For an AR(p) model, here’s the forecast
method.
Note the switch, predict uses y , while forecast uses yf. The first column of the yf matrix contains actual
values, while the rest contain forecasts. Obviously, actual values are preferred over forecasted values, but for
multi-horizon forecasting, prior forecasts are used when actual values are not available (or would constitute
cheating). Values are extracted from the yf matrix by tracking back up the diagonal until the first column
is reached and then moving up that column. Until the first column is reached, only forecasted are available.
Furthermore, for AR(p) if h > p, some forecasts will be based solely upon prior forecasts and not directly
on any actual values. The assert statement makes sure the prediction agrees with the one-step forecast.
482
Example: Rolling Validation for an AR(3) Model
Consider how rolling validation would work for an AR(3) model for the Lake Levels dataset that consists of
98 years worth of time series data. Using the default rule that initially the first half (49) is used for training
and second half (49) is used for testing, Table 11.6 shows the forecasting matrix around this split (made the
same size as the previous table for easy of understanding).
yt yˆt @h = 1 yˆt @h = 2 t
578.670 578.284 578.786 37.0000
579.550 578.945 578.512 38.0000
578.920 579.645 578.966 39.0000
578.090 578.617 579.431 40.0000
579.370 578.096 578.688 41.0000
580.130 579.809 578.379 42.0000
580.140 579.970 579.610 43.0000
579.510 579.832 579.641 44.0000
579.240 579.233 579.594 45.0000
578.660 579.212 579.204 46.0000
578.860 578.588 579.207 47.0000
578.050 579.030 578.725 48.0000
577.790 577.946 579.047 49.0000
576.750 578.045 578.220 50.0000
576.750 576.873 578.326 51.0000
577.820 577.298 577.436 52.0000
578.640 578.345 577.759 53.0000
580.580 578.789 578.458 54.0000
579.480 580.760 578.750 55.0000
577.380 578.783 580.220 56.0000
576.900 577.202 578.777 57.0000
576.940 577.436 577.775 58.0000
576.240 577.383 577.940 59.0000
576.840 576.509 577.793 60.0000
576.850 577.500 577.128 61.0000
576.900 577.140 577.870 62.0000
Again, rolling validation is used to roll forward through the testing set, with fresh data and periodic
retraining, making out-of-sample multi-horizon forecasts. The first 2-steps ahead, out-of-sample forecast is
for time-unit 49. It uses three previous values to make a 2-steps ahead forecast for 49 (shown in bold). The
most reliable values are actual values (shown in the first column). Actual values are available for times 47
and 46. Using the actual value for 48 is not appropriate since this is tantamount to soothsaying (knowing the
future). Although the data is yearly, suppose it is daily and today is Wednesday and you want for forecast
two days ahead (Friday). It is fine to use actual values from Tuesday and Wednesday, but not Thursday
483
(since it is tomorrow). When actual values are not appropriate, use forecasts from the least horizon that
is appropriate (in this case the one-day ahead forecast for Thursday). Therefore, multi-horizon forecasting
involves moving up the diagonal until column 0 is reached and then moving up column 0 (as indicated by
the pattern of values in bold).
11.7.3 Exercises
1. Compute the following additional QoF measure for rolling-validation on the dataset given in Table
11.5: RMSE, sMAPE and MASE.
2. Rolling-Validation results were given for the Lake Levels Dataset for the three baseline models: Random
Walk, Null and Trend models. See the Baseline Models section. Compare these to the modeling
techniques of intermediate complexity covered so far in this text: Simple Exponential Smoothing
(SES), Auto-Regressive (AR) and Auto-Regressive, Moving-Average (ARMA).
3. Find the optimal values for p and q in ARMA(p, q) models for the Lake Levels Dataset for each
forecasting horizon h = 1, 2, 3. Compare in-sample and out-of-sample QoF measures. How do these
results relate to suggestions implied by the Correlogram?
4. An alternative to the rolling validation algorithm would be to keep the size of training set the same
size throughout, by discarding values at the beginning of the time series. Although in general this
means less data for training, it could reduce staleness (where values from the distant pass have undue
influence on parameters estimation). Try making this change to the code and assess it impact.
5. To further reduce staleness one could use a restricted training that is a user specified distance back
from the testing window. Try making this change to the code and assess it impact.
6. Compare recursive and direct multi-horizon forecasting under rolling-validation for Auto-Regressive,
Moving-Average (ARMA) models.
7. Compare two recursive-direct hybrid multi-horizon forecasting methods found in the literature. Find
papers that compare the relative quality of recursive, direct and hybrid methods.
8. The complete forecasting matrix yf created, for example, by calling forecastAll is not needed for
rolling-validation. Only cap values from training set combined with the testing set are needed. Make
this change to the forecast method to improve its efficiency.
9. Design a k-fold rolling-validation algorithm by dividing the testing set into multiple folds and by using
j ∈ {0, . . . , k − 1} to keep track of the current fold. Now each fold has two adjacent, non-overlapping
time windows. The first time window is for training and has length l, while the second time window is
for testing and has length λ = b m−l k c.
(r)
Wj = {tλj , . . . , tl+λj−1 } training (11.84)
(e)
Wj = {tl+λj , . . . , tl+λ(j+1)−1 } testing (11.85)
For example, let m = 200, k = 10, l = 100, then λ = 10 and the ten pairs of time windows are shown
below.
484
W:(r) = {t0 , . . . , t99 }, {t10 , . . . , t109 }, . . . , {t90 , . . . , t189 } (11.86)
W:(e) = {t100 , . . . , t109 }, {t110 , . . . , t119 }, . . . , {t190 , . . . , t199 } (11.87)
(11.88)
485
11.8 ARIMA (Integrated) Models
In cases where a time series is not stationary, an ARIMA (p, d, q) may be used. The new hyper-parameter
d is the number of times the time series is differenced. When there is a significant trend in the data (e.g.,
the values or levels are increasing/decreasing over time), an ARIMA (p, 1, q) may be effective. This model
can take a first difference of the values in the time series, i.e.,
This new ‘differenced’ time series can then be put into an ARMA(p, q) model to predict the next difference
0
yt+1 from both a weighted combination of prior values and the combined effects of prior noise/disturbances.
0
yt+1 = δ + φ ·← yt−p
0
0 :t + θ ·
←
t−q0 :t + t+1 (11.90)
0
yt+1 = δ + φ0 yt0 + . . . + φp0 yt−p
0
0 = θ0 t + . . . + θq 0 t−q 0 + t+1 (11.91)
Higher differencing is possible, but is not commonly used. Think of the original/level time series as values,
the first difference as rates of change, and the second difference as changes in rates (e.g., position, velocity,
and acceleration).
Training for an ARIMA model can apply one of the optimization algorithms used for ARMA models.
Note, the “I” in ARIMA could interpreted to mean, once you have taken a difference (like a derivative), it
should be “Integrated” back to get the final forecast (see the subsection below).
11.8.1 Differencing
The del method (also ∆ in the code) in AR1MA.scala in the forecasting package takes a time series y
as a vector and returns the first difference. Note that the ‘differenced’ series has one less element than the
original series.
1 def del ( y : VectorD ) : VectorD = VectorD ( for t <- 0 until y . dim - 1 yield y ( t +1) - y ( t ) )
The first element in the original time series y0 must be maintained to enable the original time series to be
exactly restored from the ‘differenced’ series. Restoration of the original times-series is achieved using the
undel (inverse difference) method.
1 def undel ( v : VectorD , y0 : Double ) : VectorD =
2 val y = new VectorD ( v . dim + 1)
3 y (0) = y0
4 for t <- 1 until y . dim do y ( t ) = v (t -1) + y (t -1)
5 y
6 end undel
486
11.8.2 Forecasting
Forecasts for the differenced time series may be produced using an ARMA model.
0
ŷt+1 = δ + φ ·← yt−p
0
0 :t + θ ·
←
t−q0 :t (11.93)
0
ŷt+1 = ŷt+1 + yt (11.94)
B k yt = yt−k (11.95)
that moves back k positions in the time series. Consequently, an AR(p) model may be written as
Note the change from forecasting yt+1 to yt as it is more convenient with the backshift operator. In vector
notation φ = [φ0 , . . . φp−1 ] and using a dot product, this becomes
yt = δ + φ · [B 1 , . . . , B p ]yt + t (11.97)
yt = δ + θ · [B 1 , . . . , B q ]t + t (11.98)
Notice that (1 − B 1 )yt = yt − yt−1 , i.e., the first difference and in general, (1 − B 1 )d is the dth difference.
The backshift operator can be used to reformulate an ARIMA(p, 0, q) model as follows:
This is just the ARMA model with the AR part collected on the right and the MA collected on the left. In
particular, yt is not repeated. So why the more obscure formulation. Well, it makes it easy to apply the
differencing in an ARIMA(p, d, q) model.
487
11.8.4 Stationarity Process
A general assumption for ARMA models is that time series yt is covariance (or weakly) stationary, meaning
that the first two moments are time invariant, i.e,
If the time series violates these conditions strongly enough, several approaches may be tried including
transformations, detrending and differencing.
Unit-Root Process
It is not uncommon to have a stochastic process where the variance grows over time, The Random Walk
process is an example. Interesting, for an AR(1) model, a slight change to the parameter/coefficient will
change a process from a stationary process, to a unit-root process, and then to an explosive process (see the
exercises). Although for an AR(1) process, this demarcation is obvious, a process is stationary, unit-root or
explosive, if |φ0 | is less than, equal to, or greater than 1, respectively. As an analogy, think of exponential
decay vs. growth with e.9 , e1 , and e1.1 . For AR(p) processes, the case is not so obvious, however, by finding
the roots of the characteristic equation for a given model equation, the answer can be obtained.
Characteristic Equation
Starting with the boxed equation for ARIMA model equations with δ = 0 formulated using the backshift
operator,
(1 − φ0 B 1 )yt = t (11.107)
Similarly, for an AR(2) model,
(1 − φ0 B 1 − φ1 B 2 )yt = t (11.108)
In general, these can be rewritten in terms of characteristic polynomials,
φ(B)yt = t (11.109)
where for an AR(p) model the characteristic polynomial is
488
Table 11.7: Characteristic Polynomials for ARIMA (p, d, q) Models
An AR(p) process is covariance stationary, if all the root of the following characteristic equation,
φ(B) = 0 (11.111)
φ(ζ) = 0 (11.112)
are outside the unit circle. As roots of a polynomial function may be complex numbers, we replace B with
a complex variable ζ ∈ C.
For example, for an AR(1) process, the characteristic equation is
1 − φ0 ζ = 0 (11.113)
The root of this equation is clearly
1
ζ = (11.114)
φ0
Based upon the magnitude of the complex number |ζ|, three cases exist.
• When |ζ| > 1, the coefficient φ0 < 1 and the process will be covariance stationary,
• When |ζ| = 1, the coefficient φ0 = 1 and it will be a unit root process, and
• When |ζ| < 1, the coefficient φ0 > 1 and the process will be explosive.
Note, unit circle denotes a circle with radius one in the complex plane.
For an AR(2) process, the characteristic equation is
1 − φ0 ζ − φ1 ζ 2 = 0 (11.115)
The roots of this equation is given by the quadratic equation,
p
φ0 ± φ20 + 4φ1
ζ = (11.116)
−2φ1
A (linear) trend stationary process can be constructed by combining a deterministic linear trend µ(t) with
a covariance stationary process zt .
yt = µ(t) + zt (11.117)
where the mean varies with time t, i.e., µ(t) = b0 + b1 t. Simple Regression can be used to determine the
coefficients b0 and b1 .
489
Trend stationary processes have an advantage over unit-root processes in that a shock’s influence will
diminish over time and the process will revert to the mean [217]. Detrending can be accomplished by
subtracting the deterministic trend µ(t) or by differencing. Note, for nonlinear trends (not considered here),
differencing may not remove the trend.
In order to classify a process as covariance stationary, trend stationary or non-stationary, the following tests
are available in ScalaTion.
The Augmented Dickey-Fuller (ADF) Test can be used to test whether a times-series has a unit root
(alternatively, is covariance stationary)
The Kwiatkowski–Phillips–Schmidt–Shin (KPSS) Test can be used to test whether a time series is trend
stationary (alternatively, has a unit root).
H0 : y t is trend stationary
H1 : y t has a unit root
These two popular tests can be applied in combination to classify a process as one of the following:
2. Trend Stationary: ADF gives H0 and KPSS gives H0 (may warrant further investigation).
Unless the process is classified as Covariance Stationary, one may difference the time series and apply the
tests again. For more information on on testing processes for stationarity, see [217].
As many real-world time series are not initially covariance stationary and differencing can lose some
pattern information, there is ongoing research on analyzing non-stationary or locally stationary processes
directly [38].
Class Methods:
1 @ param y the original input vector ( time series data )
2 @ param tt the time vector , if relevant ( time index may suffice )
3 @ param hparam the hyper - parameters
4
490
8 protected def init ( v : VectorD ) : Unit =
9 def setPQ ( pq : VectorI ) : Unit =
10 def s h o w P a r a m e t e r E s t i m a t e s () : Unit =
11 override def train ( x_null : MatrixD , y_ : VectorD ) : Unit =
12 protected def nll ( b : VectorD ) : Double =
13 protected def u p d a t e F i t t e d V a l u e s () : Double =
14 override def test ( x_null : MatrixD , y_ : VectorD ) : ( VectorD , VectorD ) =
15 override def predictAll ( y_ : VectorD ) : VectorD =
16 override def forecast ( t : Int = y . dim , h : Int = 1) : VectorD =
17 override def forecastAll ( y_ : VectorD , h : Int ) : MatrixD =
18 def residuals : VectorD = if differenced then y - predictAll ( y ) else e
11.8.6 Exercises
1. Use the del method to create a version that takes the number of differences d as a second parameter.
Do this (a) iteratively, (b) recursively.
2. Take a first difference of the Lake Level time series dataset and plot the values versus time.
3. Look at the ACF and PACF plots for the Lake Level ‘differenced series’ to determine values for p and
q. Create an ARIMA (p, 1, q) model. Plot its in-sample forecast ŷ versus the actual value y and
determine its in-sample Quality of Fit (QoF) measures.
4. Use Grid Search on 0 to 19 for p and q to find the hyper-parameter values that (a) minimize AIC, (b)
minimize sMAPE, (c) minimize MAE, (d) minimize RMSE, (e) maximize R2 , and (f) maximize R̄2 .
How do these compare?
5. Take the best model according to the consensus in the last question and determine its out-of-sample
Quality of Fit (QoF) measures for forecasting horizon h = 1. Plot its out-of-sample forecast ŷ versus
the actual value y. Try this for a training to testing ratio of (a) 60-40 and (b) 70-30.
6. Determine the out-of-sample Quality of Fit (QoF) measures for forecasting horizons h = 2 to 14 (two
weeks ahead). How rapidly do the QoF measures decline.
7. Repeat the questions on the Lake-Level dataset for the COVID-19 dataset.
8. Compare an ARIMA (0, 1, 0) model and a Random Walk with Drift model. Remove the constant/drift
from this ARIMA model and determine what other model type it matches.
9. Compare an ARIMA (0, 1, 1) model and Simple Exponential Smoothing with Growth model. Remove
the constant/drift from this ARIMA model and determine what other model type it matches.
10. Generate and plot the following four types of stochastic processes: covariance stationary process, unit
root process, explosive process, and (linear) trend stationary process.
y t+1 = 0.99 yt + t
y t+1 = 1.00 yt + t
y t+1 = 1.01 yt + t
y t+1 = 1 + 0.1(t + 1) + t
491
11.9 SARIMA (Seasonal) Models
In this context, the word “seasonal” is used to mean any time period longer than one. For example, vehicle
traffic may exhibit periodic behavior every 24 hours. For Caltrans PeMS data, the time resolution is 5
minutes (smoothed to 15), so the period or seasonal length s = 288 time units (96 for the smoothed data).
In COVID-19 pandemic forecasting there is a strong weekly period in daily reported deaths, so the seasonal
period s = 7.
The seasonal differencing (1−B s )D happens first, followed by regular differencing (1−B)d and then essentially
an ARMA model.
Considering the COVID-19 daily data, when s = 7, (seven days), D = 1 (one seasonal difference), and
d = 0 (no regular difference), the differenced series would be
zt = (1 − B s )D yt = yt − yt−7 (11.119)
This differenced time series would indicate how (yt changes each day from what it was last week on the same
date (e.g., Monday to Monday). Although one might think it would be more useful to know the change from
the day before, the strong seasonal pattern may override the basic principle that the most recent data is the
most useful. A case study of COVID-19 is given later in the section to illustrate this.
492
zt = (1 − B)d (1 − B s )D yt (11.122)
Modeling yt
Modeling (1 − B)yt
Modeling (1 − B 2 )yt
Class Methods:
1 @ param y the original input vector ( time series data )
2 @ param dd the order of seasonal differencing
3 @ param period the seasonal period ( at least 2)
4 @ param tt the time vector , if relevant ( time index may suffice )
5 @ param hparam the hyper - parameters
6
493
18 override def predictAll ( y_ : VectorD ) : VectorD =
19 override def forecast ( t : Int = y . dim , h : Int = 1) : VectorD =
11.9.6 Exercises
1. Write out a SARIMA(5, 1, 3)× (4, 1, 2)7 model, with and without using the backshift operator.
2. In predicting a value for yt+1 how far back in time would the above SARIMA model need to go?
494
11.10 Further Reading
1. Forecasting: Principles and Practice [84]
https://otexts.com/fpp3
495
496
Chapter 12
Univariate Time Series (UTS) models only consider how a time series is influenced by past values of the time
series itself. Some such models also allow for a trend as a simple function (e.g., linear, quadratic) of time t
to be included. Still, UTS modeling and forecasting might be considered myopic, ignoring other potentially
important variables or factors.
Multivariate Time Series (MTS) models consider the interaction of multiple related time series. MTS
models consider endogenous and exogenous variables. A variable is considered to be exogenous when it is not
correlated with the error term. This ideal implies that the values of predictor exogenous variables will not
influence the errors. See the exercises in the Auto-Regressive with eXogenous Variables (ARX) section for
further discussion of this issue and the notion weak exogeneity. Values from exogenous variables are recorded
in the datasets and may be used by models that utilize exogenous variables, such as ARX, SARIMAX, VARX
and VARMAX models. However, generally exogenous values are not predicted nor forecasted by the models.
Consider the following simple model equation,
yt = δ + φ0 yt−1 + β0 xt + t (12.1)
While xt being exogenous is uncorrelated with t , yt is influenced by the value of the error term and is
hence, endogenous. In time series, it is the endogenous variables that are predicted or forecasted. The
discussion in this chapter will start with models that allow only one endogenous variable, the ARX and
SARIMAX models. Next up are Vector Auto-Regressive (VAR) models that only have endogenous variables
and face the challenge in multi-horizon forecasting of compounding errors across time and series, e.g., errors
in forecasts of hospitalizations feed into future forecasts of new deaths and their errors. A recent survey
of Multivariate Time Series [122] discusses the evolution of such modeling techniques.
Nonlinear Time Series models allow additional functional forms to be fit and therefore, forecast the time
series. Linear models are defined, as they were for regression, to be linear in the parameters. Hence the
inclusion of quadratic terms does not make the model nonlinear. Note that while linear regression allows
for efficient optimization using matrix factorization, having an MA components in the time series model
(e.g., θ0 t−1 ) will necessitate the use of nonlinear optimization. The discussion will start with Nonlinear
Auto-Regressive (NAR) models and extend into various types of Neural Network architectures.
497
12.1 Auto-Regressive with eXogenous variables (ARX) Models
Suppose there is another time series xt that may be a leading indicator for yt . An example in pandemic
modeling would when xt is hospitalizations and yt is deaths. Shifting the xt curve to the right, will tend
to make it match up (somewhat) with the yt . This time shift indicates how much the hospitalizations lead
deaths. Note, this tendency may be diffuse, e.g., 8 to 10 days. In this case three lags from xt could help in
predicting yt .
An Auto-Regressive with eXogenous variables (ARX) model allows for this type of connection, where
actual values from xt can be used to help make yt forecasts. The X indicates that one or more eXogenous
variables are included in the model. Other variables may include the number of patients in ICU, number of
positive cases, etc. Related modeling techniques include SARIMAX, VAR, VARX, VARMA and VARMAX
models; these are discussed in the next two sections.
This is the same as the AR(p) model equation from the last chapter with the β0 xt−1 term added in. Recall
that p0 = p − 1.
As xt may be a leading indicator, starting at xt−a may be more effective than starting at xt=1 . Note, if
a = 0, xt−a = xt is called a contemporaneous variable, e.g., for two daily time series, one may be made public
before the other and therefore, be put to use in forecasting.
where parameter/coefficient β j ∈ Rn .
498
12.1.4 Determining the Exogenous Lag Interval [a, b]
While the Partial AutoCorrelation Function (PACF) can be used to establish a value for the number of
endogenous lags p (although hyper-parameter search may be more effective), cross-correlation (or measures
derived derived from it) may be used to select the interval [a, b].
yt = β0 + β1 xt + t (12.5)
Knowing previous values of xt often helps with the predictions, so it is common to introduce a lagged
predictor variable xt−1
yt = β0 + β1 xt + β2 xt−1 + t (12.6)
One may also introduce a lagged response variable yt−1 .
k−1
1X
x̄t = xt−i (12.8)
k i=0
Such forecasts tend to be conservative and do not require any parameters to be estimated for the exogenous
variables. Then the modeling equation becomes,
499
When k = 1, the averaging will correspond to using a Random Walk Model for each exogenous variable,
while when k is large, it will approximate the Null (or Mean) Model.
Note, when k = 1, the ARXA MV Model reduces to a ARX MV Model where each x̄t is replaced with xt .
Class Methods:
1 @ param x the input / predictor matrix built out of lags of y
2 ( and optionally from exogenous variables ex )
3 @ param yy the output / response vector trimmed to match x . dim
4 @ param lags the maximum lag included ( inclusive )
5 @ param fname the feature / variable names
6 @ param hparam the hyper - parameters ( use Regression . hp for default )
7
8 class ARX ( x : MatrixD , yy : VectorD , lags : Int , fname : Array [ String ] = null ,
9 hparam : Hype rParame ter = Regression . hp )
10 extends Regression (x , yy , fname , hparam )
11 with ForecasterX ( lags ) :
12
As stated, yy is a trimmed version of the given endogenous time series y. This is because the first response
cannot be forecasted due to missing past values, as there is no value at time t = −1 or earlier. Therefore,
the size of the yy vector is reduced to y.dim - 1. Note, some forecasting packages perform back-casting
to make a forecast ŷ0 , while others will require all p past values to be available and therefore do not start
making forecasts until ŷp . Thus, care needs to be taken when comparing results from different forecasting
packages.
Rather than passing in a user supplied predictor matrix built out of lags of the endogenous variable yt ,
the apply method in the companion object may be called.
1 @ param y the original un - expanded output / response vector
2 @ param lags the maximum lag included ( inclusive )
3 @ param hparam the hyper - parameters ( use Regression . hp for default )
4
5 def apply ( y : VectorD , lags : Int , hparam : HyperPa rameter = Regression . hp ) : ARX =
500
When there are exogenous variables, xt = [xt0 , xt1 , . . . xt,n−1 ], the following method should be called.
1 @ param y the original un - expanded output / response vector
2 @ param lags the maximum lag included ( inclusive )
3 @ parax ex the input matrix for exogenous variables ( one per column )
4 @ param hparam the hyper - parameters ( use Regression . hp for default )
5 @ param elag1 the minimum exo lag included ( inclusive )
6 @ param elag2 the maximum exo lag included ( inclusive )
7
8 def exo ( y : VectorD , lags : Int , ex : MatrixD , hparam : HyperPa rameter = Regression . hp )
9 ( elag1 : Int = max (1 , lags / 5) , elag2 : Int = max (1 , lags ) ) : ARX =
Object Methods:
1 object ARX_MV :
2
When averaging is to used, the corresponding class and object are ARXA and ARXA MV.
12.1.10 Exercises
1. Compare ARX(p, 0, []) to AR(p) models for the Lake Level dataset. This is the case when there
are no exogenous variables. The two models will likely differ slightly, since AR uses the Method of
Moments for estimation, while ARX uses either OLS via the Regression class or regularized OLS via
the RidgeRegression class.
3. When errors are correlated, e.g., E [t−1 t ] 6= 0, the significance of relationships may be over-estimated.
This is referred to spurious regression [60]. Discuss the ramifications of this problem.
4. Weak Exogeneity. OLS results rely on the error at time t being uncorrelated with the current value
of the exogenous variable. See https://www.reed.edu/economics/parker/s12/312/notes/Notes9.
pdf, https://www.reed.edu/economics/parker/312/tschapters/S13_Ch_2.pdf.
501
E [t |xt ] = 0 (12.13)
Estimate the conditional expectation of errors given hospitalizations for COVID-19 dataset. Recall
that new deaths is the endogenous variables.
5. For the previous problem, use GLS for parameter estimation, standard errors and forecasting. Compare
with the results given by OLS.
502
12.2 SARIMAX Models
The natural scale up from an ARX model is a Seasonal Auto-Regressive, Integrated, Moving-Average, with
eXogenous variables (SARIMAX) model. It adds the ability to consider the effects of shocks and some
long-term periodic effects. In addition, it allows differencing of the time series.
The specification of a SARIMAX model subsumes the ARX specification,
where
• p is the number of Auto-Regressive (AR) terms/lagged endogenous values, e.g., temperature for five
(p = 5) previous days used to estimate tomorrow’s temperature;
• d is the number of stride-1 Differences (Integrations (I)) to take, e.g., focus on the daily change in
temperature rather than the temperature itself;
• q is the number of Moving-Average (MA) terms/lagged shocks, e.g., a shock (unexpected change that
induces forecast errors) due to the emergence of a new more virulent strain;
• P is the number of Seasonal (stride-s) Auto-Regressive (AR) terms/lagged endogenous values, e.g., for
traffic forecasts, values from the previous Mondays will likely work better that values from the previous
days;
• D is the number of Seasonal (stride-s) Differences to take; e.g., for daily time series data, taking a
difference based on one week may work better that one day;
• Q is the number of Seasonal (stride-s) Moving-Average (MA) terms/lagged shocks; e.g., the start of
college football season is a dramatic shock for Saturday traffic forecasts;
• s is the Seasonal period (e.g., week, month, or whatever time period best captures the pattern), e.g.,
COVID-19 daily data exhibit a strong weekly pattern, so setting s = 7 tends to improve the accuracy
of the model; and
• [a, b] is the range of eXogenous (X) lags to include (where as a shorthand [b] = [1, b]), e.g., as hospi-
talization data tends to lead new deaths by several days, finding appropriate values for a and b can
improve model accuracy.
zt = (1 − B)d (1 − B s )D yt (12.15)
Again, there may be multiple exogenous variables, i.e., replace xt with a vector xt . For COVID-19 forecasting
of new deaths as the endogenous variable, potentially useful exogenous variables include icu patients,
hosp patients, new tests, people vaccinated, see the exercises.
503
12.2.2 SARIMAX Object
Class Methods:
1 @ param y the original endogenous input vector ( time series data )
2 @ param x the exogenous time series data as an input matrix
3 @ param dd the order of seasonal differencing
4 @ param period the seasonal period ( at least 2)
5 @ param tt the time vector , if relevant ( time index may suffice )
6 @ param hparam the hyper - parameters
7
12.2.3 Exercises
1. Start with a SARIMA model for forecasting weekly new death and find the best exogenous vari-
able to add to improve the accuracy of the forecasts. Pick the best variable from icu patients,
hosp patients, new tests, people vaccinated,
504
12.3 Vector Auto-Regressive (VAR) Models
Multivariate Time Series Analysis extends univariate time series by analyzing multiple variables. It works
on multiple interrelated time series,
with one time series for each of the n variables [201]. By convention in this text, the time series are oriented
in a matrix so that the j th variable is held in the j th column and the tth time point is held in the tth row.
This way the matrix holding the data will correspond to how data is typically organized in a csv data file.
The forecasted value for the j th variable at time t, ytj , can depend on the previous (or lagged) values
of all the variables. This notion is captured in Vector Auto-Regressive Models [176, 218]. A Vector Auto-
Regressive Model of order p with n variables, VAR(p, n), will utilize the most recent p values for each variable
to produce a forecast.
12.3.1 VAR(p, 2)
When n = 1, VAR(p, n) becomes AR(p), so the simplest actual vector case is VAR(p, 2) also known as
a bivariate VAR(p) model. Such models can be useful in traffic forecasting as flow and speed are related
variables for which time series data is maintained for each, i.e., yt0 is the traffic flow at a particular sensor
at time t and yt1 is the traffic speed at that sensor at time t.
For each lag, each of the lag variables is used to predict each response variable, so the complete set of
parameters forms a tensor consisting of p matrices.
where δ ∈ Rn is a constant, Φ(0) ∈ Rn×n is the parameter matrix (first row of tensor) for first lags,
Φ(1) ∈ Rn×n is the parameter matrix (second row of tensor) for second lags, Φ(2) ∈ Rn×n is the param-
eter matrix (third row of tensor) for third lags, and t ∈ Rn is the residual/error/shock vector. Notice
that Φ(0) yt−3 is a matrix-vector product yielding an n-dimensional vector. Also see the note about index
(superscript/subscript) ordering in the section on AR models.
The equations for each component of the vector yt = [yt0 , yt1 ] are
505
" # " # " # " # " #
(0) (0) (1) (1) (2) (2)
yt0 δ0 φ00 φ01 φ00 φ01 φ00 φ01
yt = = + (0) (0) y t−3 + (1) (1) y t−2 + (2) (2) yt−1 + t
yt1 δ1 φ10 φ11 φ10 φ11 φ10 φ11
Convention: known values are shown in black and unknown values are shown in blue. For example,
given values from the past (e.g., yt−1 , yt−2 ), before the beginning of day t, make a forecast for the value of
yt . Note, the forecast is known ŷt , but the actual value yt is not yet known.
12.3.2 VAR(p, n)
A general Vector Auto-Regressive VAR(p, n) model with n variables each with p lags will have p parameter
matrices. The equation for the response vector yt ∈ Rn may be written in matrix-vector form as follows:
where constant vector δ ∈ Rn , parameter matrices Φ(l) ∈ Rn×n (l = 0, . . . p − 1), and error vector t ∈ Rn .
12.3.3 Training
One way to train a VAR model is to treat each variable as a separate regression problem and use least squares
to estimate the parameters, e.g., the parameters for ytj are matrix Φ:j . A more efficient approach taken by
ScalaTion is to use the RegressionMV class that fits multiple responses to multiple predictor variables.
12 var x = ARX . makeExoCols ( lags , y , 1 , lags +1) // add cols for each lagged vars
13 val yy = y (1 until y . dim ) // trim y to match x
14 if intercept then x = VectorD . one ( yy . dim ) + ˆ : x // add first column of all ones
15
506
For example, when run on the GasFurnace dataset that consists of 296 rows (time steps) and 2 columns
(gas flow rate and Carbon Dioxide concentration variables), the VAR (3, 2) produces the following results
for In-Sample testing.
" # " #
(0) (0)
(0) φ00 φ01 −0.170147 0.243170
Φ = (0) (0) =
φ10 φ11 0.0152980 0.416977
" # " #
(1) (1)
(1) φ00 φ01 −0.0755910 −0.454592
Φ = (1) (1) =
φ10 φ11 −0.0660577 −1.60076
" # " #
(2) (2)
(2) φ00 φ01 1.14269 0.0743863
Φ = (2) (2) =
φ10 φ11 0.0589660 2.17496
12.3.6 Exercises
1. Plot the flow yt0 and speed yt1 given in the Traffic Sensor Dataset (traffic.csv). Estimate values for
the four parameters contained in the one parameter matrix Φ(0) of a VAR (1, 2) model (p = 1, n = 2).
Compute the Quality of Fit (QoF).
2. Use the Traffic Sensor Dataset (traffic.csv) to estimate values for the three parameter matrices Φ(0) ,
Φ(1) and Φ(2) of a VAR (3, 2) model (p = 3, n = 2). Compute the Quality of Fit (QoF) and compare
with the VAR(1, 2) model.
3. Consider what happens if two times series {yt0 } and {yt1 } are following similar trends. What difficulties
could this cause in a VAR model.
507
4. How could a Vector Error Correction Model (VECM) handle the above problem?
5. Compare an ARX model with a VAR model for the COVID-19 weekly dataset.
508
12.4 Nonlinear Time Series Models
12.4.1 Nonlinear Auto-Regressive (NAR)
As was the case for Nonlinear Regression, Nonlinear Time Series models have the potential for better fitting
models.
A pth order Nonlinear Auto-Regressive NAR(p) model is a generalization of an AR(p) model. The
forecasted value at time t, y, is a function of previous values of y, e.g.,
where φ is a vector of p parameters and p is also taken the number of lagged values to use. In general, for
nonlinear models the number of lags may or may not equal the number of parameters p.
The p × p matrix Φ holds the parameters/weights connecting the input and hidden layers, while the p
dimensional vector w holds the parameters/weights connecting the hidden and output layers. There is only
an activation function (vectorized f ) for the hidden layer. Note that Φ xt is matrix-vector multiplication. Of
course, the basic ARNN model can be embellished.
509
12.5 Recurrent Neural Networks (RNN)
Regular neural networks are often referred to as feed-forward neural networks, as there is no feedback in the
calculations. Recurrent Neural Networks (RNN) provide a straightforward mechanism that uses feedback to
allow the past to be captured (metaphorically remembered).
As exponential smoothing maintains a state variable that summarizes the past, with the importance of
past values decaying with age, recurrent neural networks maintain a hidden state vector ht . The new state
vector is computed as a combination of the prior state vector and the current input vector. The forecasted
value (output) is determined from the state vector. For simplicity, here it is assumed the dimensionality of
the output vector (k = 1).
A single hidden layer RNN (p, nh ) has one input layer, one hidden layer, and one output layer, much like
a 3-layer Fully-Connected Neural Network. The difference is that the hidden state vector is fed back into
the calculation for the next time step. The hyper-parameter p can denote the dimensionality of the input
vector, while nh can denote the number of hidden units (or the dimensionality of the hidden state vector).
The size (and shape) of the input depends on the modeling approach. The size p (as in AR(p) models)
could be used to indicate the number of lags. One may say that the past is captured by the hidden state, but
it may be helpful to explicitly include lags. Now in multi-variate time-series, there will be multiple variables,
so p could denote the number of variables (e.g., new deaths, hosp patients for Covid-19 forecasting) Often,
it will be useful to include both, so here we use p for lags and nv for variables, and then the input at time t
becomes a matrix Xt Thus, the input over time becomes a 3D tensor X (time × lags × variables).
12.5.1 RNN(1, 1)
The simplest Recurrent Neural Network has input dimension one and hidden dimension one, i.e., it has one
hidden unit/node and makes forecasts based on the most recent information. For such models, the hidden
state vector is one dimensional (i.e., scalar). In order to forecast a value for yt , denoted ŷt , a weighted
combination of the previous value xt = yt−1 and the previous state ht−1 are passed to an activation function
f.
where φ and w are scalar parameters/weights and β (h) and β (y) are the scalar biases. The above RNN model
is a NARMA(1, 1) model [33].
The hidden state variable ht is defined recursively and as such provides memory to the model, since its
value results from all previous values [124].
Notice that if the activation function f is the identity function, v = 1, and β (y) = 0, then the middle
equation becomes,
510
12.5.2 RNN(p, nh )
A single hidden layer Recurrent Neural Network of order (p, nh ) makes forecasts based on information
going back p lags, with a nh -dimensional hidden state vector. The scalar parameters φ and w now become
parameter matrices Φ (renamed U ) and W . They are given in pre-transposed form to facilitate the direct
application of matrix multiplication with the need to take a transpose.
• xt ∈ Rp holds the collected inputs (time series values from the recent past)
• f : Rnh → Rnh hidden layer activation function defaults to the tanh function (vectorized)
• g : Rnh → Rk output layer activation function defaults to the identity function (scalar when k = 1)
Note, the hidden layer is recurrent, while the output layer may be dense. This RNN model is a NARMA
model [33].
The computation of the forecasted value ŷt is depicted in Figure 12.1 where the recurrent unit is within
the purple box.
The hidden layer is typically implemented by looping through all of the time steps, taking the next input
xt and the prior hidden state vector ht−1 into the unit for calculation.
Figure 12.2 show a small, yet complete Recurrent Neural Network (RNN) with p = 2 (two lags), nh = 2
(two hidden nodes/units), and k = 1 (size of output). As the RNN iterates over time t the calculations are
repeated with new inputs xt and the hidden state ht−1 saved from the last iteration. This saved vector is
shown in the orange box.
511
ŷt
ht−1 f ht
xt
(h)
ht0 = f (u0 · xt + w0 · ht−1 + β0 ) (12.30)
(h)
ht1 = f (u1 · xt + w1 · ht−1 + β1 ) (12.31)
ht−1
w0
f
u00 w1
xt0 ht0
v00 g
u10
yt0
u01
v01
xt1 ht1
u11
Figure 12.2: Three-Layer (input, hidden, output) Recursive Neural Network (RNN) with Biases Removed
Unfortunately, such neural networks may have stability problems, so work moved unto units with gates
that add stability (e.g., to avoid vanishing or exploding gradients). Gates are used to control how much of
the previous state is preserved as signals propagate through the units. This allows gated units to have longer
memories that simple RNNs.
512
12.5.3 RNN(p, nh , nv )
When there are multiple variables (e.g., nv = 2 with one for COVID new hospitalizations and one for
new deaths) the situation becomes more complex. In this case, the input input is a matrix Xt and the
output/response is a vector, yt = [yt0 , yt1 , . . . , yt,nv −1 ].
• Xt ∈ Rnv ×p holds the collected inputs (time series values from the recent past)
• f : Rnh → Rnh hidden layer activation function defaults to the tanh function (vectorized)
• g : Rnh → Rk output layer activation function defaults to the identity function (vectorized)
12.5.4 Training
Below is a simple gradient descent implementation for train the RNN on a dataset (full or training).
1 def train () : Unit =
2 for it <- 1 to max_epochs do
3 forward () // forward propagate : get intermediate and output results
4
513
12.5.5 Optimization
Forward Pass
The forward method performs forward propagatiion to calculate yp, loss and intermediate variables for each
step.
1 def forward () : Unit =
2 for t <- 0 until n_seq do
3 val h_pre = if t = = 0 then h_m1 else h (t -1) // get previous hidden state
4 h ( t ) = tanh_ ( U * x ( t ) + W * h_pre + b_h ) // compute new hidden state
5 if CLASSIF then
6 yp ( t ) = softmax_ ( V * h ( t ) + b_y ) // act : softmax for classif
7 L ( t ) = ( - y ( t ) * log_ ( yp ( t ) ) ) . sum // cross - entropy loss funct
8 else
9 yp ( t ) = V * h ( t ) + b_y // act : id for forecasting
10 L ( t ) = ( y ( t ) - yp ( t ) ) . normSq // sse loss function
11 end if
12 end for
13 end forward
Backward Pass
The backward method performs back-propagation to calculate gradients using chain rules.
1 def backward () : Unit =
2
5 // start back - propagation with the final / feed - forward ( ff ) layer ( uses id )
6
514
Parameter Update
Based on the calculated partial derivatives, update the parameters (weights and biases).
1 def update_params () : Unit =
2 // hidden state ( h )
3 U -= hg . dU * eta
4 W -= hg . dW * eta
5 b_h -= hg . db * eta
6
7 // output layer
8 V -= dV * eta
9 b_y -= db_y * eta
10 end update_params
12.5.6 Exercises
1. Use ScalaTion’s RNN class to make forecasts based on the Lake Level dataset given in Example LakeLevels.
515
12.6 Gated Recurrent Unit (GRU) Networks
Gated units have alleviated some of the problems with traditional RNNs. As the simplest gated unit, a
Minimal Gated Unit (MGU) has a single gate and a minimal number of parameters compared to other gated
units.
A Gated Recurrent Unit (GRU) [30, 29] is slightly more complex, but is more commonly use than an
MGU. It adds a second gate, a third pair of parameter matrices and a third bias vector. The two gates in
a GRU are the reset gate and the update gate. These are used to control the degree to which the previous
state ht−1 factors, along with the current input xt , into the calculation of the new state ht , which will
be a mixture of the previous state ht−1 and new candidate state h̃t (or c(t) in the code) coming out of
the activation function f . Figure 12.3 shows how signals propagate through one unit. This unit would be
connected to one on the left taking input xt−1 and one on the right taking input xt+1 . The state provides
memory for the next unit.
ŷt
update
zt
ht−1
ht
mix
reset h̃t
rt
activate
xt
The elements in the control vectors, reset rt and update zt , come through sigmoid activation so they are
always between 0 (open circuit) and 1 (closed circuit) and are shown in purple.
The following two equations indicate one way that a GRU could be set up to handle univariate time
series data, e.g., a NAR(p) model.
516
Other forms of feature selection or engineering could be done as well; one could include time t, shocks, etc.
or extend to multivariate time series analysis.
GRU Equations
The equations below show how information flows through a gated recurrent unit [30, 215, 211].
Reset Gate. The degree to which the previous state ht−1 influences the new candidate state h̃t is
controlled by the reset gate. When the reset gate is open (rt is close to zero) the previous state is all but
ignored, whereas, when the gate is closed, its influence will be strong.
Update Gate. The update gate controls the relative mixture of the previous state ht−1 and the new
candidate state h̃t used to form the new state ht . When the update gate is open (zt is close to zero) the
previous state is passed through almost intact, whereas, when the gate is closed, the new state is essentially
the candidate state.
Activate. A candidate state h̃t can be created by applying the activation function f (e.g., tanh or reLU)
to the weighted combination of the previous state ht−1 and the current input xt . However, the reset control
rt can be used to cut off the influence of the previous state.
Mix. The new state ht is created as a mixture of the previous state ht−1 and the newly created candidate
state h̃t . The update control zt determines how much of each are put into the new state ht .
When the update gate is nearly open (zt near 0), the new state preserves much of previous state, while when
it is nearly closed (zt near 1), the new state is mainly given by the candidate state. Notice that when both
rt and zt = 1, the GRU equations reduce to those of the simple RNN given in the last section (see exercises).
Recall that ∗ (or ) is the element-wise vector product.
A Gated Recurrent Unit introduces a greater number of variables and parameters than a dense Feed Forward
Neural Network. The variables for a GRU unit consist of one variable that serves as input xt , one that
represents state and is modified by the unit having before ht−1 and after ht values, a candidate state h̃t ,
and two control variables, rt and zt . These five vector-valued variables are listed in Table 12.1. All but
one the variables have dimension nh (or n mem in the code), the dimensionality of state variables (memory
size). That one, xt has dimension p corresponding to the number of lags in univariate time series (or course
feature engineering can be used to add more). For multi-variate time series it corresponds to the number of
variables nv (n var in the code). Note, using both multiple lags and multi-variate time series will require xt
to be a 2-dimensional matrix.
517
Xt ∈ Rnv ×p input matrix at time t (12.37)
The parameters for a GRU unit consist of six weight/parameter matrices (in three pairs) and three bias
vectors as listed in Table 12.2.
The matrices may be thought of a pre-transposed to facilitate application of matrix multiplication without
the need to transpose.
Now, swap a GRU layer in for the dense hidden layer. Let the number of lags p = 2. Then at time t, the
network maps input xt = [yt−2 , yt−1 ] to output yt .
The U weights shown in Figure 12.5 are meant to represent the weights in the three weight matrices, Ur ,
Uz and Uc , while the W weights are meant to represent the weights in the three weight matrices, Wr , Wz
and Wc . Each of the two gates has its own weight matrices and the candidate state has it own as well.
518
f0
a00
x0 z0
b00 f1
a01
y0
a10
b10
x1 z1
a11
Figure 12.4: Three-Layer (input, hidden, output) Neural Network with Biases Removed
Each of the two units takes the input vector xt and the previous state ht−1 and computes the next state
ht . Computations, are performed over all nh units in the form of matrix-vector multiplications, e.g., Wr ht−1 ,
as depicted in Figure 12.3. The states coming out of the GRU layer are feed into a final dense layer to make
one step-head forecasts ŷt . (This treatment can be extended for multi-horizon forecasts.)
ht−1
w0
xt
f
u00 w1
yt−2 h0
v00 g
u10
yt
u01
v01
yt−1 h1
u11
Figure 12.5: Three-Layer (input, hidden, output) Neural Network with a GRU Layer
The GRU iterates through time (using a loop in implementation). A useful way to visualize the execution is
to duplicate the unit for each of the m timestamps in the time series. Figure 12.6 illustrates this, imagining
h0 and h1 executing at time 0, t − 1, t, t + 1, and m − 1. Note, unless imputation or back-casting is used,
there is no input for time 0. Consequently, for this example, one would set x0 = 0 and h0 = 0.
12.6.2 Training
Below is a simple gradient descent implementation for train the GRU on a dataset (full or training).
1 def train () : Unit =
2 for it <- 1 to max_epochs do
3 forward () // forward propagate : get intermediate and output results
4
519
... h0,t−1 h0,t ...
h0 @t = 0 h0 @t − 1 h0 @t h0 @t + 1 h0 @m − 1
h1 @t = 0 ... h1 @t − 1 h1 @t h1 @t + 1 ... h1 @m − 1
h1,t−1 h1,t
See the exercises for how to replace gradient descent with stochastic gradient descent using mini-batches.
12.6.3 Optimization
As shown in the code above, there are three parts: a forward pass, a backward pass and a parameter update.
Forward Pass
Using the identity activation function for the hidden to output layer yields the following forward propagation
equations.
This case is for univariate times series, i.e., yt = [y0 , y1 , . . . ym−1 ], in which case V ∈ R1×nh .
1 def forward () : Unit =
2 for t <- 0 until n_seq do
3 val h_pre = if t = = 0 then h_m1 else h (t -1) // get previous hidden state
4 r ( t ) = sigmoid_ ( Ur * x ( t ) + Wr * h_pre + b_r ) // reset gate
5 z ( t ) = sigmoid_ ( Uz * x ( t ) + Wz * h_pre + b_z ) // update gate
6 c ( t ) = tanh_ ( Uc * x ( t ) + Wc * ( r ( t ) * h_pre ) + b_c ) // candidate state
7 h ( t ) = ( _1 - z ( t ) ) * h_pre + z ( t ) * c ( t ) // hidden state
520
8 if CLASSIF then
9 yp ( t ) = softmax_ ( V * h ( t ) + b_y ) // act : softmax for classif
10 L ( t ) = ( - y ( t ) * log_ ( yp ( t ) ) ) . sum // cross - entropy loss function
11 else
12 yp ( t ) = V * h ( t ) + b_y // act : id for forecasting
13 L ( t ) = ( y ( t ) - yp ( t ) ) . normSq // sse loss function
14 end if
15 end for
16 end forward
Consequently, the sse (or divide by m − 1 for mse) loss function on the training (or full) dataset of size m is
m−1
X
L = (yt − ŷt )2 (12.38)
t=1
As there is no data for predicting ŷ0 , it is not considered. The following are the trainable parameters: weight
matrices: Ur , Wr , Uz , Wz , Uc , Wc , V , and bias vectors: β (r) , β (z) , β (c) , β (y) .
Backward Pass
The Gate case class holds information on the gate’s value and its partial derivatives.
18 end Gate
∂U L += din ⊗ xt
∂W L += din ⊗ ht−1
∂b L += din
521
Candidate c Mixin
dhbk = ∂ht L
din = ∂ht L ∗ (1 − zt ) ∗ tanh0 (h̃t )
h̃t += (din , xt , ht−1 ∗ rt )
|
dhr = Wc ∗ din
∂ht L = dhr ∗ rt
Reset r Gate
Update z Gate
The backward method has three parts: (1) start back-propagation with the final/feed-forward (ff) layer (uses
id for activation); (2) loop back in time adding to the partials for U , W and b, as well as for the state at
time t, ht ; (3) handle the end case for t = 0 where h−1 becomes h m1.
1 def backward () : Unit =
2 val e = yp - y // negative error matrix
3 db_y = e . sumVr // vector of row sums
4 for t <- 0 until n_seq do dV + = outer ( e ( t ) , h ( t ) ) // outer vector product
5 val dh_ff = e * V // partial w . r . t . h : n_seq by n_mem
6 var dh = new VectorD ( dh_ff . dim2 ) // hold partial for hidden state
7 var dIn , dhr : VectorD = null
8
522
24 dh + = Wz .T * dIn
25 end for
26 ...
27 end backward
For completeness the code (corresponding to the ... above) for the end case (t = 0) is show below.
1 dh + = dh_ff (0) // update partial : h hidden @ t = 0
2
Parameter Update
After computing values for the variables in the forward pass and computing partial derivatives in the back-
ward pass, the partials moderated by the learning rate η (or eta) can be subtracted from the current values
for the parameters (weight matrices and bias vectors).
1 def update_params () : Unit =
2 // update gate ( z )
3 Uz -= z . dU * eta
4 Wz -= z . dW * eta
5 b_z -= z . db * eta
6
12 // candidate state ( c )
13 Uc -= c . dU * eta
14 Wc -= c . dW * eta
15 b_c -= c . db * eta
16
17 // output layer
18 V -= dV * eta
19 b_y -= db_y * eta
20 end update_params
12.6.4 Exercises
1. Use ScalaTion’s GRU class to make forecasts based on the Lake Level dataset given in Example LakeLevels.
523
2. Create a GRU model using Keras, Guide: https://keras.io/guides/working_with_rnns/, https://
faroit.com/keras-docs/2.0.5/layers/recurrent, API: https://keras.io/api/layers/recurrent_
layers/gru/ for Lake Level dataset. Compare the results to those obtained with ScalaTion.
4. When both the reset gate rt and the update gate zt are fixed at 1, show that the GRU equations
reduce to those of the simple RNN given in the last section.
5. Show that the weight matrices Ur and Wr can be combined into one to make a slightly more concise
formula for the reset gate rt (same for the update gate).
Given Ur ∈ Rnh ×p , Wr ∈ Rnh ×nh , ht−1 ∈ Rnh , and xt ∈ Rp , show that if W = [Ur , Wr ] ∈ Rnh ×(p+nh ) ,
then
xt
W = Ur xt + Wr ht−1 (12.39)
ht−1
6. Use the above identity to rewrite the GRU equations using three parameter matrices and three bias
vectors.
7. Write a formula for the total number of trainable parameters in a GRU where the state vector is
nh -dimensional and the input vector is p-dimensional.
9. Explain the difference between stateless and stateful training of a GRU (or MGU, LSTM) [44].
524
12.7 Minimal Gated Unit (MGU) Networks
A Minimal Gated Unit (MGU) [215] has a single gate and a minimal number of parameters compared to
other gated units. Note, the GRU reset and update gates as well as their corresponding vectors are unified
into the forget gate in an MGU. An MGU has a forget gate and its purpose is to weigh accumulated past
information versus new information. The greater the forgetting the greater the reliance on recent data.
ŷt
ht−1 mix ht
h̃t
fot
forget activate
xt
xt = [yt−1 , . . . , yt−p ]
MGU Equations
The equations below show how information flows through a minimal gated unit [215, 211].
Notice that when fot is essentially 1, the equations reduce to those of a simple RNN.
The variables for an MGU unit consist of one variable that serves as input xt , one that represents state and
is modified by the unit having before ht−1 and after ht values, a candidate state h̃t , and one control variable,
fot . These four vector-valued variables are listed in Table 12.3.
525
Table 12.3: MGU Variables
The parameters consist of two pairs of weight/parameter matrices along with the biases, see Table 12.4.
The first activation function serves as a switch (typically sigmoid) and the second activation function
defaults to tanh, but other activation functions (e.g., reLU) may be used.
In the above formulation, the dimensions of the vectors are follows: xt ∈ Rp , fot , h̃t and ht ∈ Rnh , where
p is the number features engineered into the input and nh is the number of units.
Forget Gate. The purpose of the forget gate is to determine how much of the previous state to forget.
When the forget gate is open (fot is close to zero) the previous state ht−1 is passed through almost intact,
whereas, when the gate is closed, the new state is essentially the candidate state. The value of fot also
determines the influence of the previous state ht−1 has over the new candidate state h̃t .
526
12.8 Long Short Term Memory (LSTM) Networks
Long Short Term Memory (LSTM) [75] Networks provide increased memory by introducing another state
variable called the cell state ct that works in parallel with the hidden state ht . For additional control of how
past information is propagated, three gates are used: a forget gate, an input gate, and an output gate. The
corresponding vectors, fot , int , and out , hold values in (0, 1) and thus act as switches to control the flow.
An LSTM has advantages over a GRU when its longer memory is beneficial and the time series is long
enough for effective training. An LSTM has more parameters to train than a GRU, so it takes longer to
train.
LSTM Equations
The equations below show how information flows through a Long Short-Term Memory network [75, 211].
Forget Gate. The new cell state takes a fraction of the previous cell state and a fraction of the new
candidate cell state (see below). The forget gate determines how much of the previous cell state is kept;
when it is open (near zero) this information is forgotten, while when it is closed (near one) it is strongly
remembered. (It may be more intuitive to think of this as a remember gate.)
Input Gate. The new candidate cell state (see below) as a weighted combination of the input xt and
the previous hidden state ht−1 is a key calculation in an LSTM. The input gate controls how much of the
new candidate cell state goes into the actual new cell state. When the input gate is open (near zero) the
new candidate cell state has little influence, while when it is closed, the new candidate cell state enters the
calculation at full strength.
Output Gate. The output gate comes into play at the end of the unit calculations and is used to
moderate the activated cell state before it assigned to the hidden state. Open dampens it, while closed does
not. This provides additional stability in the hidden state ht , which is important as it widely fed back into
most of the LSTM equations.
Candidate Cell State. As with a GRU, an LSTM first calculates a candidate state, a candidate for
the cell state (the computed values in an LSTM are on the cell state, and at the end a final calculation
is performed to determine the hidden state). The candidate cell state cct is computed by applying the
activation function f (e.g., tanh or reLU) to the weighted combination of the current input xt and previous
state ht−1 .
Cell State. The actual cell state ct is a combination of the previous cell state ct−1 and the candidate
cell state cct , where the fraction of ct−1 included is determined by the forget gate, while the fraction of cct
included is determined by the input gate.
527
ct = fot ∗ ct−1 + int ∗ cct (12.44)
Hidden State. As the principal output of LSTM units, the hidden state ht may be thought of the cell
state ct with an activation function applied to it. For added stability (remember an RNN has problems with
calculations blowing up), this value is moderated since the value of the output gate is in (0, 1).
The variable for a LSTM unit consist of three control variables, two state variables and one internal candiate
state variable described in Table 12.5.
The parameters for a LSTM unit consist of six weight/parameter matrices (in three pairs) and three bias
vectors as listed in Table 12.6.
528
12.8.1 Exercises
1. Use ScalaTion’s LSTM class to make forecasts based on the Lake Level dataset given in Example LakeLevels.
529
12.9 Encoder-Decoder Architectures
One may consider ways to improve LSTM (or GRU) networks and incremental improvement may be obtained
by adding LSTM and/or Feed-Forward layers. It may be beneficial to replace or augment the Feed-Forward
layer with a decoder, for example one containing LSTM units.
The purpose of the encoder it to create a context vector that captures the input in a summarized/encoded
form. This may simply be the hidden state at time t or ht produced by LSTM units. It may be more complex
and include multiple hidden state vectors.
The context vector serves as the initial state vector for the decoder/second portion of the network archi-
tecture. In order to adjust the dimensionality of the data (inputs vs. outputs), a single layer Feed-Forward
layer may be used (in line with the V matrix discussed in prior sections).
While the encoder can be thought of as encoding the past, the decoder uses this encoding to forecast the
future. The hope is that the two parts can specialize their skill, with the encoder striving to better capture
the patterns in data and the decoder at using patterns to make accurate forecasts.
ŷt+1 ŷt+2
The intuition is that training the parameters in the first GRU (the encoder) will be optimized to capture
patterns in the data, while parameters in the second GRU will optimized on making accurate forecasts given
the encoding saved in the context vector. Depending on the number of lags making up the input xt , forecasts
at later horizons will need prior forecasts to be feed into them as shown by the red line in the figure.
530
12.9.2 Teacher Forcing
The figure shows what happens during inferencing (actual forecasting). Training is a bit different. First
there is Back-Propagation Through Time (BPTT), ... Second, actual values, such as yt+1 may be used
rather than forecasted values ŷt+1 . This is not possible (or allowed) during actual forecasting, as it would
constitute knowing the future. However, some degree of teacher forcing, using actual for forecasted values,
has been shown potential [204].
Alignment Scores
An alignment score between encoder state hτ and decoder state st (renamed to distinguish it from an encoder
states) is given by aτ t
Attention Weights
The attention weights are found by normalizing the alignments scores to the range (0, 1) using the softmax
function.
eaτ t
ατ t = P aτ t (12.47)
τ e
Context Vector
i.e., it uses state vectors from the encoder hτ weighted by their importance in calculating st as specified by
their attention weights. Define the function context to consist of the above three equations,
ct = context(st−1 , H) (12.49)
where the matrix H maintains the hidden state vectors from the encoder GRU, such that the τ th row of H
stores hτ .
531
Modifications to GRU Equations
A slight modification to the GRU equations for the decoder is needed [213]. There is useful state information
from the previous time as well as the from context vector. These need to be combined in some fashion, for
example, by vector concatenation.
The simple change is just to replace the state vector from the GRU section h with the concatenated state
vector cs in the GRU equations.
12.9.4 Exercises
1. Notice in the equations above, the context vector and state vector are treated the same way through-
out, e.g., both the context vector and state vector are regulated by the reset control vector rt when
computing h̃t . It may be beneficial to only do this to the state vector and not the control vector.
Rewrite the above equations to achieve this.
Hint: keep the context and state vectors separate and do not use cst ; also introduce new parameter
matrices to include ct into the equations.
2. Consider other ways of combining the context and state vectors, besides concatenation.
532
12.10 Transformer Models
Transformers [197] are well-suited to finding patterns in sequential data, including natural language and
time series. While Recurrent Neural Networks utilize a hidden state vector that summarizes what happened
in the past, a transformer can utilize temporal relationships/dependencies between any two elements in the
time series. This enriched view of dependencies is referred to as the self-attention mechanism.
12.10.1 Self-Attention
Given a multi-variate time series Y consisting of m time steps and nv variables,
Y = [ytj ] (12.51)
nv
the inputs xt ∈ R into the transformer can be defined as follows:
vt = Wv xt (12.53)
where Wv ∈ Rnu ×nv is the value matrix (its dimensions are number of units × size of input vector). Note
that the transformation can have two effects: the new values are likely to be in an higher dimensional space
(controlled by the user) and the values are rescaled (more suitable for neural computation).
Now attention weights [αtτ ] are applied to produce a context vector at time t, ct ,
m−1
X
ct = αtτ vτ = αt V (12.54)
τ =0
|
where V = [v0 , v1 , . . . vm−1 ] is the value vectors captured in a matrix. The question remains of how
determine the attention scores/weights. We start by defining the following two learned views of the input
sequence,
qt = Wq xt
kt = Wk xt
where Wq and Wk ∈ Rnk ×nv are learned weight matrices. As with nu , nk is chosen by the user. (Note, in
Natural Language Processing (NLP) the dimension of word vector xt is typically very high, so the matrices
are referred to projection matrices (project to a lower dimensional space), while for time series they would
typically do the opposite.)
Let us consider how input xt is related to the other inputs xτ . In the transformed space qt is its
representative in making such an inquiry (query). One simple way to measure relatedness of two vectors is
to take their dot product (proportional to the cosine of the angle between them).
533
ωtτ = qt · kτ (12.55)
These are referred to as the unnormalized attention scores. Notice that the other vector is represented by
transformed vector kt . Using separate vectors qt and kt makes it possible for the attention scores to be
asymmetric, meaning the influence (and therefore need for attention) between input xt and xτ need not be
the same in both direction. (Recall as measures of dependence correlation is symmetric, while conditional
entropy is not).
The scores may be efficiently computed for all τ using matrix-vector multiplication,
ω t = Kqt (12.56)
|
where K = [k0 , k1 , . . . km−1 ] is the key vectors captured in a matrix.
The following equation is used for computing (normalized) attention scores.
ωt
αt = softmax √ (12.57)
nk
√
The unnormalized attention scores are first divided by nk and then passed through the softmax function to
√ √
put the elements in the interval (0, 1). Dividing by nk (or dk in other papers) may improve the stability
of calculations. According to [197], “We suspect that for large values of dk , the dot products grow large in
magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract
√
this effect, we scale the dot products by dk .”
Single-Head Self-Attention
Putting the formulas together yields a means for computing a context vector ct .
| Kqt
ct = V softmax √ (12.58)
nk
This is implemented in ScalaTion as follows:
1 @ param q_t the query vector at time t ( based on input vector x_t )
2 @ param k the key matrix K
3 @ param v the value matrix V
4
Actually, all the context vectors (also referred to as an attention matrix) can be computed together using
matrix multiplication.
|
QK
C = attention(Q, K, V ) = softmax √ V (12.59)
nk
The attention method computes attention weights.
1 @ param q the query matrix Q ( q_t over all time )
2 @ param k the key matrix K
3 @ param v the value matrix V
534
4
The context and attention methods are provided by the Attention trait.
1 @ param n_var the size of the input vector x_t ( number of variables )
2 @ param n_mod the size of the output ( dimensio nality of the model , d_model )
3 @ param heads the number of attention heads
4 @ param n_v the size of the value vectors
5
6 trait Attention ( n_var : Int , n_mod : Int = 512 , heads : Int = 8 , n_v : Int = -1) :
7
Multi-Head Self-Attention
This self-attention mechanism is said to be bundled into an attention head. The transformer architecture
allows for the use of multiple attention heads. For example, [197] suggests having 8 heads.
heads−1
attentionMH(Q, K, V ; Wq , Wk , Wv , W o ) = concati=0 attention(QWiq , KWik , V Wiv ) W o
(12.60)
The attentionMH method in the Attention trait computes multi-head attention weights.
1 @ param q the query matrix Q ( q_t over all time )
2 @ param k the key matrix K
3 @ param v the value matrix V
4 @ param w_q the weight tensor for query Q ( w_q ( i ) matrix for i - th head )
5 @ param w_v the weight tensor for key K ( w_k ( i ) matrix for i - th head )
6 @ param w_v the weight tensor for value V ( w_v ( i ) matrix for i - th head )
7 @ param w_o the overall weight matrix to be applied to concatenated attention
8
535
Sinusoidal positional encoding is often used ... Given a time value t, a positional vector of length d is
created by alternately calling sin and cos functions,
pt = [sin(ω1 t), cos(ω1 t), sin(ω2 t), cos(ω2 t), . . . , sin(ωd/2 t), cos(ωd/2 t)] (12.61)
where the wavelengths 2πωk decrease exponentially.
Although, the commonly used sinusoidal positional encoding could be used, some studies have shown that
other approaches may be better for time series forecasting [125].
Encoder
An encoder layer consists of (1) a multi-head self-attention module (shown in orange), followed by (2) a
feed-forward neural network that leads this layer’s output (shown in lime). This output is feed into the next
encoder layer as input. The input to the first encoder layer is [xt ] (typically adjusted based upon positional
encoding). In addition, after completion of each module, upstream information is added in (skip connection)
followed by layer normalization (see exercises). Addition and normalization are shown in yellow as depicted
in Figure 12.9.
1. Attention: For a single head, attention(Q, K, V ) computes ... For multi-head attention, ...
3. Layer Normalization: Suppose a layer in a network outputs a vector z whose size is the number of units
in the layer. This vector is then normalized by subtracting the mean and dividing by the standard
deviation. In Scalation, this is provided by the standardize method in VectorD or the more robust
standardize2 method (takes a Normal random variable to a Standard Normal random variable). See
exercise about pre-layer and post-layer normalization.
4. Feed-Forward Neural Network: A basic configuration is to have a Linear Layer (no activation function),
followed by a reLU/geLU layer (see the Percepton section), followed by another linear layer, and finally
a dropout layer.
6. Layer Normalization: The addition in the last step may throw off normalization, so it needs to be done
again.
536
normalize again
→ add in [zt ]
[zt ]
normalize
attention (Q, K, V )
Q K V
Decoder
A decoder layer includes the modules from a encoder with the following changes and additions. The self-
attention is modified by using using masking to prevent the decoder (i.e., the forecaster) from seeing the
future. In addition, the decoder takes the output from its corresponding encoder, and processes it with
another multi-head self-attention module. In other words, it contains three modules: (1) multi-head self-
attention module applied to input, (2) multi-head self-attention module applied to encoder output, and (3)
a feed-forward neural network to produce its output. The final decoder layer produces the forecasts.
NEED DETAILED FIGURE (using tikz)
12.10.4 Exercises
1. The Scaled Dot Product of vectors x and y is defined as
x·y
x sdot y = √ (12.62)
n
where n is the dimensionality (number of elements) of each vector. Explain why dividing by the square
root of dimensionality improves the stability of gradient calculations.
537
12.10.5 Further Reading
• “Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch,”
Sebastian Raschka, https://sebastianraschka.com/blog/2023/self-attention-from-scratch.
html
538
Chapter 13
Dimensionality Reduction
When data matrices are very large with high dimensionality, analytics becomes difficult. In addition, there is
likely to be co-linearity between vectors, making the computation of inverses or pseudo-inverses problematic.
In such cases, it is useful to reduce the dimensionality of the data.
539
13.1 Reducer
The Reducer trait provides a common framework for several data reduction algorithms.
Trait Methods:
1 trait Reducer
2
540
13.2 Principal Component Analysis (PCA)
The PrincipalComponents class computes the Principal Components (PCs) for data matrix X with the
following dimensions
X ∈ Rm×n (13.1)
where the number of rows m is the number of instances/samples and the number of columns n is the number
of predictor variables.
Principal Component Analysis (PCA) can be used to reduce the dimensionality of the data matrix. Using
the PrincipalComponents class, first find the PCs by calling ‘findPCs’ and then call ‘reduce’ to reduce the
data (i.e., reduce matrix X to a lower dimensionality matrix).
13.2.1 Representation
PCA will replace the data matrix X ∈ Rm×n with a lower-dimensional reduced matrix Z ∈ Rm×k for k ≤ n.
Z = XEk (13.2)
z = Ek x (13.5)
Note, that each element of vector z is a linear combination of elements in the original vector x.
Example Problem:
Class Methods:
541
def findPCs (k: Int): MatrixD =
def reduceData (): MatrixD =
def recover (): MatrixD = reducedMat * featureMat.t + mu
def solve (i: Int): (VectorD, VectorD) =
13.2.2 Exercises
1.
542
13.3 Autoencoder (AE)
An Autoencoder (AE) is a fully-connected neural network that contains a middle layer of lower dimensionality
than the input layer. The output layer has the same dimensionality as the input layer, as shown in figure
13.1. The loss function then simply measures the difference between the input vectors and output vectors.
If the loss is small, then the middle layer may be used as a lower-dimensional representation of the input.
f1
x0 a00 f0 b00 y0
a01 z0 b01 β0
x1 a10 α0 b10 y1
b11
01
a20
11 z1 b12 β1
x2 a21 α1 y2
β1
A common choice for the loss function is the sum of squared reconstruction errors,
sse = kX − Ŷ kF (13.6)
13.3.1 Representation
For a three layer autoencoder, the lower dimensional representation vector z corresponds to the values at
the middle hidden layer and is given by the following equation,
|
z = f0 (A x + β) (13.7)
where x is the input vector, A is the parameter/weight matrix, β is the bias vector and f0 is the vectorized
activation function.
Notice that if f0 is the identity function, then z becomes a linear transformation of x, as is the case for
PCA.
543
544
Chapter 14
Clustering
Clustering is related to classification, except that specific classes are not prescribed. Instead data points
(vectors) are placed into clusters based on some similarity or distance metric (e.g., Euclidean or Manhattan
distance). It is also related to prediction in the sense that a predictive model may be associated with each
cluster. Points in a cluster, are according to some metric, closer to each other than to points not in their
cluster. Closeness or similarity may be defined in terms of `p distance kx − zkp , correlation ρ(x, z), or cosine
cos(x, z). Abstractly, we may represents any of these by distance d(x, z). In ScalaTion, the function dist
in the clustering package computes the square of Euclidean distance between two vectors, but may easily
be changed (e.g., (x - z).norm1 for Manhattan distance).
Consider a general modeling equation, where the parameters b are estimated based on a dataset (X, y).
y = f (x; b) +
Rather than trying to approximate the function f over the whole data domain, one might think that given
point z, that points similar to (or close to) z, might be more useful in making a prediction f (z).
A simple way to do this would be to find the κ-nearest neighbors to point z,
545
each with centroid ξ c = µ(Xc ). Typically, point xi is in cluster c because it is closer to its centroid than any
other centroid, i.e.,
xi ∈ Xc =⇒ d(xi , ξ c ) ≤ d(xi , ξ h )
Define the cluster assignment function ξ to take a point xi and assign it to the cluster with the closest
centroid ξ c , i.e.,
ξ(xi ) = c
The goal becomes to find an optimal cluster assignment function by minimizing the following objective/cost
function:
m−1
X
minξ d(xi , ξ ξ(xi ) )
i=0
If the distance d is kxi − ξ ξ(xi ) k22 (the default in ScalaTion), then above sum may be viewed as a form of
sum of squared errors (sse).
If one knew the optimal centroids ahead of time, finding an optimal cluster assignment function ξ would
be trivial and would take O(kmn) time. Unfortunately, k centroids must be initially chosen, but then as
assignments are made, the centroids will move, causing assignments to need re-evaluation. The details vary
by clustering algorithm, but it is useful to know that finding an optimal cluster assignment function is
N P-hard [7].
Other factors that can be considering in forming clusters include, balancing the size of clusters and
maximizing the distance between clusters.
546
14.1 KNN Regression
Similar to the KNN Classifier class, the KNN Regression class makes predictions based on individual pre-
dictions of its κ-nearest neighbors. For prediction, its function is analogous to using clustering for prediction
and will be compared in the exercises in later sections of this chapter.
Training in KNN Regression is lazy and is done in the predict method, based on the following equation:
1
ŷ = 1 · y(topκ (z)) (14.1)
κ
Given point z, find κ points that are the closest, sum there response values y, and return the average.
Class Methods:
Note that the train method has nothing to do, so it need not be called.
547
14.1.2 Exercises
1. Apply KNN Regression to the following combined data matrix.
// x1 x2 y
val xy = MatrixD ((10, 3), 1, 5, 1, // joint data matrix
2, 4, 1,
3, 4, 1,
4, 4, 1,
5, 3, 0,
6, 3, 1,
7, 2, 0,
8, 2, 0,
9, 1, 0,
10, 1, 0)
548
14.2 Clusterer
The Clusterer trait provides a common framework for several clustering algorithms.
Clusterer Trait
Trait Methods:
trait Clusterer:
For readability, names may be given to clusters (see name and name ). To obtain a new (and likely different)
cluster assignment, setStream method may be called to change the random number stream. The train
methods in the implementing classes will take a set of points (vectors) and apply iterative algorithms to find
a “good” cluster assignment function. The cluster method may be called after train to see the cluster
assignments. The centroids are returned as rows in a matrix by calling centroids, whose cluster sizes are
given by csize. The initCentroid method initializes the centroids, while the calcCentroids calculates
the centroids based in the points contained in the each cluster.
def calcCentroids (x: MatrixD, to_c: Array [Int], sz: VectorI, cent: MatrixD): Unit =
cent.setAll (0.0) // set cent matrix to all zeros
for i <- x.indices do
val c = to_c(i) // x_i currently assigned to cluster c
cent(c) = cent(c) + x(i) // add the next vector in cluster
end for
for c <- cent.indices do cent(c) = cent(c) / sz(c) // divide to get averages/means
end calcCentroids
Given a new point/vector z, the classify method will indicate which cluster it belongs to (in the range
0 to k-1). The distances between a point and the centroids is computed by the distance method. The
549
objective/cost function is defined to be the sum of squared errors (sse). If the cost of an optimal solution
is known, checkOpt will return true if the cluster assignment is optimal.
550
14.3 K-Means Clustering
The KMeansClustering class clusters several vectors/points using k-means clustering. The user selects the
number of clusters desired (k). The algorithm will partition the points in X into k clusters. Each cluster
has a centroid (mean) and each data point xi ∈ X is placed in the cluster whose centroid it is nearest to.
See the exercises for more details on the second technique for initializing clusters/centroids.
If any cluster turns out to be empty, move a point from another cluster. In ScalaTion this is done by
removing a point from the largest cluster and adding it to the empty cluster. This is performed by the
fixEmptyClusters method.
After the assign and fixEmptyClusters methods have been called, the data matrix X will be logically
partitioned into k non-empty sub-matrices Xc with cluster c having nc (sz(c)) points/rows.
Calculating Centroids
The next step is to calculate the centroids using the calcCentroids method. For cluster c, the centroid is
the vector mean of the rows in submatrix Xc .
1 X
ξc = xi
nc
xi ∈Xc
ScalaTion iterates over all points and based on their cluster assignment adds them to one of the k centroids
(stored in the cent matrix). After the loop, these sums are divided by the cluster sizes sz to get means.
The calcCentroids method is defined in the base trait Clusterer.
551
14.3.2 Reassignment of Points to Closest Clusters
After initialization, the algorithm iteratively reassigns each point to the cluster containing the closest cen-
troid. The algorithm stops when there are no changes to the cluster assignments. For each iteration, each
point xi needs to be re-evaluated and moved (if need be) to the cluster with the closest centroid. Reas-
signment is based on taking the argmin of all the distances to the centroids with ties going to the current
cluster.
In ScalaTion, this is done by the reassign method which iterates over each xi ∈ X computing the distance
to each of k centroids. The cluster (c2) with the closest centroid is found using the argmin method. The
distance to c2’s centroid is then compared to the distance to its current cluster c1’s centroid, and if the
distance to c2’s centroid is less, xi will be moved and a done flag will be set to false, indicating that during
this reassignment phase at least one change was made.
The exercises explore a change to this algorithm by having it return after the first change.
14.3.3 Training
The train method simply uses these methods until the reassign method returns true (internally the done
flag is true). The method is set up to work for this and derived classes. It assigns points to clusters and
then either initializes/picks centroids or calculates centroids from the first cluster assignment. Inside the
loop, reassign and calcCentroid are called until there is no change to the cluster assignment. After the
loop, an exception is thrown if there are any empty clusters (a useful safe-guard since this method is used
by derived classes). Finally, if post-processing is to be performed (post = true), then the swap method is
called. This method will swap two points in different clusters, if the swap results in a lower sum of squared
error (sse).
552
def train (): Unit =
sz.set (0) // cluster sizes initialized to zero
raniv = PermutedVecI (VectorI.range (0, x.dim), stream) // for randomizing index order
assign () // randomly assign points to clusters
fixEmptyClusters () // move points into empty clusters
if ! initCentroids () then calcCentroids (x, to_c, sz, cent) // pick points for initial centroids
breakable {
for l <- 1 to MAX_IT do
if reassign () then break () // reassign points (no change => break)
calcCentroids (x, to_c, sz, cent) // re-calculate the centroids
end for
} // breakable
val ce = sz.indexOf (0) // check for empty clusters
if ce != -1 then throw new Exception (s"Empty cluster c = $ce")
if post then swap () // swap points to improve sse
end train
Class Methods:
class KMeansClusterer (x: MatrixD, k: Int, val flags: Array [Boolean] = Array (false, false))
extends Clusterer:
553
14.3.5 Exercises
1. Plot the following points.
// x0 x1
val x = MatrixD ((6, 2), 1.0, 2.0,
2.0, 1.0,
4.0, 5.0,
5.0, 4.0,
8.0, 9.0,
9.0, 8.0)
new Plot (x(?, 0), x(?, 1), null, "x0 vs. x1")
For k = 3, determine the optimal cluster assignment ξ. What is the sum of squared errors sse for this
assignment?
2. Using the data from the previous exercise, apply the K-Means Clustering Algorithm by hand to com-
plete the following cluster assignment function table. Let the number of clusters k be 3 (clusters 0, 1
and 2). The ξ 0 column is the initial random cluster assignment, while the next two columns represent
the cluster assignments for the next two iterations.
point (x0 , x1 ) ξ0 ξ1 ξ2
0 (1, 2) 0 ? ?
1 (2, 1) 2 ? ?
2 (4, 5) 0 ? ?
3 (5, 4) 1 ? ?
4 (8, 9) 1 ? ?
5 (9, 8) 2 ? ?
3. The test function in the Clusterer object is used test various configurations of classes extending
Clusterer, such as the KMeansClusterer class.
def test (x: MatrixD, fls: Array [Boolean], alg: Clusterer, opt: Double = -1.0): Unit =
Explain the meaning of each of the flags: post and immediate. Call the test function, passing in x
and k from the last exercise. Also, let the value opt be the value determined in the last exercise. The
test method will give the number of test cases out of NTESTS that are correct in terms of achieving
the minimum sse.
554
4. The primary versus secondary techniques for initializing the clusters/centroids are provided by the
KMeansClusterer class and the KMeansClusterer2 class, respectively. Test the quality of these two
techniques.
5. Show that the time complexity of the reassign method is O(kmn). The time complexity of K-Means
Clustering using Lloyd’s Algorithm [113] is simply the complexity of the reassign method times the
number of iterations. In practice, the number of iterations tends to be small, but in the worst case
only upper and lower bounds are known, see [9] for details.
6. Consider the objective/cost function given in ISL equation 10.11 in [85]. What does it measure and
how does it compare to sse used in this book?
555
14.4 K-Means Clustering - Hartigan-Wong
An alternative to the Lloyd algorithm that often produces more tightly packed clusters is the Hartigan-Wong
algorithm [69]. Improvement is seen in the fraction of times that optimal clusters are formed as well as the
reduction in sum of squared errors (sse). The change to the code is minimal in that only the reassign
method needs to be overridden.
The basic difference is that rather than simply reassigning each point to the cluster with the closest
centroid (the Lloyd algorithm), the Hartigan-Wong algorithm weights the distance by the relative changes
in the number of points in a cluster. For example, if a point is to be moved into a cluster with 10 points
currently, the weight would be 10/11. If the point is to stay in its present cluster with 10 points currently,
the loss in removing it would be weighted as 10/9. The weighting scheme has two effects: First it makes it
more likely to move a point out of its current cluster. Second it makes it more likely to join a small cluster.
Mathematically, the weighted distance d0 to cluster c when the point xi ∈
/ Xc is given by
nc
d0 (xi , ξ c ) = d(xi , ξ c ) (14.2)
nc + 1
nc
d0 (xi , ξ c ) = d(xi , ξ c ) (14.3)
nc − 1
The code for the reassign method is similar to the one in KMeansClusterer, except that the private
method closestByR2 calculates weighted distances to return the closest centroid.
Besides switching from distance d to weighted distance d0 , the code also randomizes the index order and has
the option of returning immediately after a change is made.
556
14.4.1 Adjusted Distance
The distance2 method computes the adjusted distance of point u to all of the centroids cent, where cc is
the current centroid that u is assigned to. Notice the inflation of distance when c == cc, and its deflation,
otherwise.
Class Methods:
class KMeansClustererHW (x: MatrixD, k: Int, flags: Array [Boolean] = Array (false, false))
extends KMeansClusterer (x, k, flags)
14.4.3 Exercises
1. Compare KMeansClustererHW with KMeansClusterer for a variety of datasets, starting with the six
points given in the last section (Exercise 1). Compare the quality of the solution in terms the fraction
of optimal clusterings and the mean of the sse over the NTESTS test cases.
557
14.5 K-Means++ Clustering
The KMeansClustererPP class clusters several vectors/points using a k-means++ clustering algorithm [10].
The class may be derived from a K-Means clustering algorithm and in ScalaTion it is derived from the
Hartigan-Wong algorithm (KMeansClustererHW). The innovation for KMeansClustererPP is to pick the
initial centroids wisely, yet randomly. The wise part is to make sure points are well separated. The random
part involves making a probability mass function (pmf) where points farther away from the current centroids
are more likely to be selected as the next centroid. Picking the initial centroids entirely randomly leads
to KMeansClusterer2 which typically does not perform as well KMeansClusterer. However, maintaining
randomness while giving preference to more distant points becoming the next centroid has been shown to
work well.
val ranI = new Randi (0, x.dim-1, stream) // uniform random integer generator
cent(0) = x(ranI.igen) // pick first centroid uniformly at random
The rest of the centroids are chosen following a distance-derived discrete distribution, using the ranD random
variate generator object. The probability mass function (pmf) for this discrete distribution is produced so
that the probability of a point being selected as the next centroid is proportional to its distance to the closest
existing centroid.
Each time a new centroid is chosen, the pmf must be updated as it is likely to be the closest centroid for
some of the remaining as yet unchosen points. Given that the next centroid to selected is the cth centroid,
the update pmf method will update the pmf and return a new distance-derived discrete distribution.
The pmf vector initially records the shortest distance from each point xi to any of the existing already
selected centroids {0, . . . , c − 1}. These distances are turned into probabilities by dividing by their sum. The
pmf vector then defines a new distance-derived random generator that is returned.
558
14.5.2 KMeansClustererPP Class
Class Methods:
class KMeansClustererPP (x: MatrixD, k: Int, flags: Array [Boolean] = Array (false, false))
extends KMeansClustererHW (x, k, flags)
14.5.3 Exercises
1. Compare KMeansClustererPP with KMeansClustererHW and KMeansClusterer for a variety of datasets,
starting with the six points given in the KMeansClusterer section (Exercise 1). Compare the quality
of the solution in terms the fraction of optimal clusterings and the mean of the sse over the NTESTS
test cases.
559
14.6 Clustering Predictor
The ClusteringPredictor class is used to predict a response value for new vector z. It works by finding the
cluster that the point z would belong to. The recorded response value for y is then given as the predicted
response. The per cluster recorded response value is the consensus (e.g., average) of the response values yi
for each member of the cluster. Training involves clustering the points in data matrix X and then computing
each cluster’s response. Assuming the closest centroid to z is ξ c , the predicted value ŷ is
1 X
ŷ = yi (14.4)
nc
ξ(xi )=c
where nc is the number points in cluster c and ξ(xi ) = c means that the ith point is assigned to cluster c.
14.6.1 Training
The train method first clusters the points/rows in data matrix X by calling the train method of a clustering
algorithms (e.g., clust = KMeansClusterer (...)). It then calls the assignResponse method to assign a
consensus (average) response value for each cluster.
The computed consensus values are stored in yclus, so that the predict method may simply use the
underlying clustering algorithm to classify a point z to indicate which cluster it belongs to. This is then
used to index into the yclus vector.
Class Methods:
560
override def test (xx: MatrixD = x, yy: VectorD = y): (VectorD, VectorD) =
def classify (z: VectorD): Int = clust.classify (z)
override def predict (z: VectorD): Double = yclus (clust.classify (z))
def reset (): Unit =
override def buildModel (x_cols: MatrixD): Predictor =
14.6.3 Exercises
1. Apply ClusteringPredictor to the following combined data matrix.
// x0 x1 y
val xy = MatrixD ((10, 3), 1, 5, 1, // joint data matrix
2, 4, 1,
3, 4, 1,
4, 4, 1,
5, 3, 0,
6, 3, 1,
7, 2, 0,
8, 2, 0,
9, 1, 0,
10, 1, 0)
561
14.7 Hierarchical Clustering
One critique of K-Means Clustering is that the user chooses the desired number of clusters (k) beforehand.
With modern computing power, several values for k may be tried, so this is less of an issue now. There is,
however, a clustering technique called Hierarchical Clustering [87] where this a non-issue.
In ScalaTion the HierClusterer class starts with each point in the data matrix X forming its own
cluster (m clusters). For each iteration, the algorithm will merge two clusters into a one larger cluster,
thereby reducing the number of clusters by one. The two clusters that are closest to each other are chosen
as the clusters to merge. The train method is shown below.
After reducing the number of clusters to the desired number k (which defaults to 2), final cluster assignments
are made and centroids are calculated. Intermediate clustering results are available making it easier for the
user to pick the desired number of clusters after the fact. The algorithm can be rerun with this value for k.
Class Methods:
562
14.7.2 Exercises
1. Compare HierClusterer with KMeansClustererHW and KMeansClusterer for a variety of datasets,
starting with the six points given in the KMeansClusterer section (Exercise 1). Compare the quality
of the solution in terms the fraction of optimal clusterings and the mean of the sse over the NTESTS
test cases.
2. K-Means Clustering techniques often tend to produce better clusters (e.g., lower sse) than Hierarchical
Clustering techniques. For what types of datasets might Hierarchical Clustering be preferred?
563
14.8 Markov Clustering
The MarkovClusterer class implements a Markov Clustering Algorithm (MCL) and is used to cluster nodes
in a graph. The graph is represented as an edge-weighted adjacency matrix (a non-zero cell indicates nodes
i and j are connected).
The primary constructor takes either a graph (adjacency matrix) or a Markov transition matrix as input.
If a graph is passed in, the normalize method must be called to convert it into a Markov transition matrix.
Before normalizing, it may be helpful to add self loops to the graph. The matrix (graph or transition) may
be either dense or sparse. See the MarkovClusteringTest object at the bottom of the file for examples.
Class Methods:
14.8.2 Exercises
1. Draw the directed graph obtained from the following adjacency matrix, where g(i, j) == 1.0 means
that a directed edge exists from node i to node j.
564
1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0,
1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0,
0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0,
1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0)
Apply the MCL Algorithm to this graph and explain the significance of the resulting clusters.
565
566
Part III
Simulation
567
Chapter 15
Simulation Foundations
ScalaTion supports multi-paradigm modeling that can be used for simulation, optimization and analytics.
The focus of this chapter is simulation modeling. Viewed as a black-box, a simple model maps an input
vector x and a scalar time t to an output/response vector y,
y = f (x, t) + (15.1)
where is the error/residual vector.
A simulation model typically adds to these the notion of state, represented by a vector-valued function
of time x(t). External input (e.g., an external driving force) is now renamed to u(t) and the error vector
is replaced with a noise process w(t) (e.g., a Gaussian while noise process). Commonly, simulations model
systems that evolve over time.
input output
u(t) state: x(t) y(t)
Knowledge about a system or process is used to define state as well as how state can change over time.
Theoretically, this should make such models more accurate, more robust, and have more explanatory power.
Ultimately, we may still be interested in how inputs affect outputs, but to increase the realism of the model
with the hope of improving its accuracy, much attention must be directed in the modeling effort to state
and state transitions. This is true to a degree with most simulation modeling paradigms or world views.
Once a simulation model has been validated, one of its strengths is that it can be used to address what-if
questions. What if we add another lane to an interstate highway. Will this lead to reduced traffic congestion
569
and reduced travel times? Such capabilities allow simulation models to play a larger role in perspective
analytics. This can be taken farther with simulation optimization, which can seek improvements to systems.
This chapter focuses on foundations necessary for creating simulation models. as well as some simulation
modeling techniques that can be performed without substantial software.
The following textbooks on Discrete-Event Simulation are recommended:
1. Discrete-Event System Simulation, 5th Edition, J. Banks, J. Carson, B. Nelson and D. Nicol, 2010 [12].
570
15.1 Basic Concepts
The following basic concepts are common to many types of simulation models.
• Simulation Model: A simulation model consists of collection of entities that interact with each other.
The model may be view as simplified version of a system (existing or imagined). The model should
be useful for description, prediction, and/or prescription. For improved explainability, the it is often
desirable that the model mimics the behavior the real system.
• Entity: An entity is an identifiable object in a simulation model, e.g., a customer entering a bank, or
a vehicle traveling on a road. One may think of an entity having a trajectory in time and space.
• Attribute: An attribute is a property of an entity that is relevant for the model, e.g., the speed and
weight of the vehicle.
• State: The current values for all the variable in the model (or attribute values for all entities). These
may be collected into a state vector x(t) that evolves over time. The Restorable State of the model may
be thought of a sufficient recording of the execution of the model so far. This would allow a snapshot
to be saved and later restored for continued execution. For some models, recent history needs to be
maintained along with the current state. For a Markov Models, the state only depends on current
values, i.e., the future conditioned on the present is independent of the past. For such models, the two
notions of state are identical.
• Event: An event is an instantaneous occurrence that has the potential to change the state of the
system being modeled, e.g., the arrival of customer at a bank. It may also trigger other events to occur
in the future.
• Simulation Clock: To keep track of the advancement of time, a simulation clock is maintained. For
continuous-time simulation, time smoothly advances in small increments. For discrete-time simulation,
time advances by one time-unit (e.g., set to 1) for each tick of the clock. For discrete-event simulation,
time jumps from the “event time of the current event” to the “event time of the next event” (any
intermediate time is skipped).
• Activity: Entities in a model undergo activities that start on one event, have a duration described by
a random variable, and end on another event, e.g., a customer being served by a bank teller.
• Indefinite Delay: An entity must wait for a server or other resource to become available. As this
delay depends on other entities, the delay in not definite as it is for activities, e.g., waiting time in a
queue.
571
15.2 Types of Models
While many modeling techniques, such as Regression, focus on predicting expected values,
ŷ = E [y|x] (15.3)
simulation models generate data. Then techniques discussed in this text, can be used to analyze the data
instances produced as output of the simulation model.
where λ is the arrival rate and µ is the service rate (not to be confused the mean).
The queue can be modeled as a Continuous-Time Markov Chain and the expected waiting time in the
queue can be determined. Unfortunately, as queuing systems become more complex, analytic solution may
not be available.
A more general, although less efficient approach is to simulate customers making their way through the
queue. They arrive a certain time ta , begin service at time ts and depart at time td . Averages for m
customers may then be used to estimate expected waiting times Tq and service times Ts .
The construction of such simulation models is straightforward. One approach is based on the observation
that the number of customers in a queueing system remained constant, except at special points in time,
where an event occurs. Changes of the state of systems (e.g., number of customers) can only occur at events.
The simulation therefore consists of programming logic that indicates what happens when an event occurs.
This is the first major paradigm for simulation modeling and is referred to as event-scheduling.
For complex simulations, the logic may become fragmented, so an alternative is to track active entities
in the systems and give the logic that they follow. For example, one may think of a customer entering a
bank as an actor following a script. The programming logic for the simulation model then becomes writing
scripts of each type of actor. This is the second major paradigm for simulation modeling and is referred to
as process-interaction.
There are additional paradigms for simulation modeling that will be discussed later.
572
15.3 Random Number Generation
Let us imagine you are one of the actors following the script that you were given and you notice that all
actors are doing exactly the same thing, following the same steps, take the same routes through the system
and experiencing the same service times. Clearly, such simulations would not mimic reality. There should
be uncertainty or variability in behavior. This is accomplished by introducing randomness.
The question becomes how to introduce randomness in a deterministic digital computer. In particular,
one wants to generate a sequence of random numbers of the following form.
ri ∼ U ID(0, 1) (15.6)
The random numbers should be Uniformly and Independently Distributed. A Random Number Generator
(RNG) may be used to produce such numbers.
//:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
/** Return the modulus used by this random number generator.
*/
def getM: Double = M.toDouble
//:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
/** Return the next random number as a ‘Double‘ in the interval (0, 1).
* Compute x_i = (x_i-1 + 1) % m using x = (x + 1) % m
*/
inline def gen: Double = { x = (x + 1) % M; x * NORM }
//:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
/** Return the next stream value as a ‘Int‘ in the set {1, 2, ... , m-1}.
* Compute x_i = (x_i-1 + 1) % m using x = (x + 1) % m
*/
inline def igen: Int = { x = (x + 1) % M; x }
end Random0
573
The Random0 class will produce numbers in the interval [0, 1) when the gen method is called. The period of
the generator (before it repeats itself) is equal to M. So far, so good. Before using this generator, a battery
of tests should be applied to check its suitability.
The Means Test simply computes several means by averaging sub-sequences of the random number stream.
Suppose the test has a sample of n means from sub-sequences of length m.
m−1
1 X
µi = rim+j for i = 0, . . . n − 1 (15.7)
m j=0
Due to the Central Limit Theory, the means should be Normally distributed and the sample should have
the following expected value and variance.
1
E [µi ] = mE [ri ] = 0.5 (15.8)
m
1 1
V [µi ] = 2
mV [ri ] = (15.9)
m 12m
Distribution Test
The Distribution Test determines how well sub-streams of the generated random numbers are distributed over
the unit interval. Are they uniformly spread out over the interval or concentrated in certain regions. This can
be assessed by a Goodness-of-Fit Test, e.g., the Chi-square Goodness-of-Fit Test or the Kolmogorov-Smirnov
Goodness-of-Fit Test (see the exercises).
The Chi-square Goodness-of-Fit Test checks how well a histogram from a subsequence of length m
matches the density function of the Uniform(0, 1) distribution. Each interval, say Ij , in the histogram will
have an observed oj and expected ej number of generated random numbers within interval Ij .
f is the probability density function (pdf) and in this case it is the pdf for the Uniform (0, 1) distribution.
Suppose there are n intervals, then ej = m/n, while oj is determined by a counter that is incremented
whenever a random number ri is generated that is within interval Ij . The Chi-square test statistic is then
n−1
X (oj − ej )2
χ2 = (15.10)
j=0
ej
When χ2 > χ2α,n−1 , the sample suggests the distribution is not Uniform, where α is the significance level
(e.g., .95).
574
Auto-Correlation Test
The are many tests related to checking correlation. As with time series, the auto-correlation may be exam-
ined by looking at a Correlogram that indicates the Auto-Correlation Function (ACF) and Partial Auto-
Correlation Function (PACF) for increasing lags. The k-lag auto-correlation ρk is given by
C [rj , rj+k ]
ρk = for any j (15.11)
V [rj ]
Ideally, ρ0 = 1 and the rest are close to zero, indicating lack of correlation. Note, the above fomula for ρk
assumes stationarity (see the chapter on time series).
The Random0 class fails all three tests, see the exercises.
Extensive testing has been used to find good values for the constant A. One example is A = 16807. Random
number generators of this form are know as Multiplicative Linear Congruential Generators (MLCG). This
value for A is an example of primitive-element modulo M, allowing the generator to exhibit a full period,
period = M-1. In other words, the generator will produce all stream values from 1 to M-1, inclusive, before
repeating itself. In ScalaTion, this generator is available in the Random3 class.
The next step up (longer period and better properties) is a Multiple Recursive Generator (MRG). These can
be combined for further improvement into a Combined Multiple Recursive Generator (CMRG). ScalaTion’s
Random class is an example of an CMRG developed by L’Ecuyer and Touzin [107].
15.3.4 Exercises
1. Apply the above three tests to the Random0, Random and Random3 random number generators. See the
RNGTester object in the scalation.random package for the test methods: meansTest, distributionTest,
and correlationTest.
2. Discuss additional tests that are applied to assess the quality of a random number generator.
575
3. Explain how Multiple Recursive Generators (MRG) work and what advantages they may have over
Linear Congruential Generators (LCG).
576
15.4 Random Variate Generation
With a good quality random number generator as a foundation, Random Variate Generators (RVG) for a
variety of probability distributions can be created.
Suppose y ∼ F and r ∼ Uniform(0, 1). Now let F be a general uniform distribution, Uniform(a, b).
0.8
0.6
F (y)
0.4
0.2
2 2.5 3 3.5 4
y
Conceptually, the Inverse Transform Method generates a random number r and use it select the height of
a horizontal line in the diagram. Drop a vertical line from where it intersects the Cumulative Distribution
Function (CDF) F . The position in the horizontal axis is the value for the random variate y. For example,
if r = 0.5, then y = 3.0.
Mathematically, the relationship is
F (y) = r (15.16)
−1
y = F (r) (15.17)
y−a
The CDF for the general uniform distribution is F (y) = , so
b−a
y = a + (b − a)r (15.18)
577
Exponential Distribution
The Inverse Transform Method also works well for the Exponential(λ) distribution where the CDF F (y) =
1 − e−λy . Setting F (y) = r and solving for the inverse yields,
F (y) = 1 − e−λy = r
e−λy = 1 − r
e−λy = r
−λy = ln(r)
ln(r)
y = −
λ
The third step relies on the fact that 1 − r and r have the same distribution.
To illustare its use, for example with λ = 1, if the generated random number r = 0.5, then generated
random variate y = 0.693.
0.8
0.6
F (y)
0.4
0.2
0 1 2 3 4 5
y
There are numerous Random Variate Generators in ScalaTion that extend the Variate abstract class,
including the Exponential class.
val mean = mu
578
def pf (z: Double): Double = if z >= 0 then l * exp (-l*z) else 0.0
end Exponential
1
Note, since the mean is the reciprocal of the rate, µ = , the simple gen method -mu * log (r.gen)
λ
produces exponentially distributed random variates.
case class Binomial (p: Double = .5, n: Int = 10, stream: Int = 0)
extends Variate (stream):
if p < 0.0 || p > 1.0 then flaw ("constructor", "parameter p must be in [0, 1]")
if n <= 0 then flaw ("constructor", "parameter n must be positive")
_discrete = true
private val q = 1.0 - p // probability of failure
private val p_q = p / q // the ratio p divided by q
private val coin = Bernoulli (p, stream) // coin with prob of success of p
val mean = p * n
def pf (z: Double): Double = { val k = z.toInt; if z == k then pf (k) else 0.0 }
def pf (k: Int): Double = if k in (0, n) then choose (n, k) * p~^k * q~^(n-k) else 0.0
579
override def pmf (k: Int): Array [Double] =
val d = Array.ofDim [Double] (n+1) // array to hold pmf distribution
d(0) = q~^n
for k <- 1 to n do d(k) = d(k-1) * p_q * (n-k+1) / k.toDouble
d
end pmf
end Binomial
It evaluates the given formula a total of n times and returns the sum.
where the reciprocal of the constant c ≥ 1 indicates the probability of acceptance. The procedure is to
generate a random value y from the simple distribution and depending on the following ratio
f (y)
(15.20)
cg(y)
randomly keep it the closer the ratio is to 1, i.e.,
f (y)
if ≥ r then accept else reject (15.21)
cg(y)
where r is a random number. Rejection means to keep trying.
15.4.4 Exercises
1. Consider a distribution where the density linearly increases on the interval [0, b]. The pdf for this
distribution is the following:
580
2y
f (y) = on [0, b]
b2
Use the Inverse Transform Method (ITM) to generate random variates following this distribution.
(a) Determine the Cumulative Distribution Function (CDF) F (y).
(b) Determine the inverse Cumulative Distribution Function (iCDF) F −1 (r).
(c) Write code for the gen method.
(d) Create a case class to contain the gen method and produce a Histogram that shows how the
generated random variates are distributed.
2. Use the convolution method to generate random variates following the Erlang(λ, k) distribution, where
λ is the rate parameter and k is the number of events. The random variable can be used to measure the
time for k events to occur. When k = 1, it reduces to the Exponential distribution. Since an Erlang
random variable is the sum of k independent exponential random variables, the convolution method
may be applied. The pdf for the Erlang distribution is shown below.
λk y k−1 e−λy
f (y) = on [0, ∞)
(k − 1)!
3. Test the Convolution Method for generating Binomial random variates for p = .5 and n = 4. Generate
10,000 random variates and show the histogram.
4. Consider the Standard Normal distribution for positive values of y. Its density function can be bounded
using c times an Exponental density function. Flipping a coin allows the generation of negative values.
Apply the Acceptance-Rejection method to generate Standard Normal random variates.
Hint: see http://www.columbia.edu/~ks20/4703-Sigman/4703-07-Notes-ARM.pdf [171].
5. Consider a distribution with density on the interval [0, 2]. Let the probability density function (pdf)
for this distribution be the following:
y
fy (y) = on [0, 2]
2
Use the Inverse Transform Method (ITM) to generate random variates following this distribution.
(i) Determine the inverse Cumulative Distribution Function (iCDF) Fy−1 (r). Recall r denotes a random
nymber.
(ii) Write code for the gen method for its Random Variate Generator (RVG).
(iii) Draw the CDF Fy (y) vs. y and illustrate how the ITM works in this case.
581
15.5 Poisson Process
In this section, the relationships between three random variables are examined. Consider a system in which
events (e.g., arrivals) occur randomly, but at a constant rate λ. It is further assumed that the time to the
next arrival is independent of the previous arrival.
• Inter-arrival Time. T = the time interval between subsequent arrivals/events. As the time duration
∆t becomes arbitrarily small, the probability of an arrival equals the arrival rate multiplied by the
time span.
n
X
Sn = Ti (15.23)
i=1
• Counting Process. N (t) = the counter of the number events (e.g., arrivals) by time t.
Due to the independence assumption, the probability of no arrivals by time t + ∆t, is the product of the
following two probabilities.
d
F̄T (t) = − λF̄T (t) (15.31)
dt
582
Both sides may now be integrated.
Z Z
dF̄T (t)
= − λdt (15.32)
F̄T (t)
The integral of a reciprocal introduces a natural logarithm.
See thr exercises for more details. Switching back to the regular CDF, shows that inter-arrival time follows
the Exponential distribution.
Consequently, Sn follows the Erlang distribution (see the exercises from the last section).
The counting process, N (t), is a Poisson Process with the following pmf:
(λt)n −λt
pN (t) (n) = e (15.36)
n!
The probability of no arrivals by time t, which is
583
First is the event/arrival times, e.g., the times that vehicles pass a road sensor. The gen method will
produce the arrival times from time zero up to the terminal/end time of the simulation t. These arrival
times are returned in a vector. As the inter-arrival time t ia distribution is Exponential (mu, stream)
where mu = 1.0 / lambda, it is used to generate each time increment.
584
15.5.2 Generating a Non-Homogeneous Poisson Process
A simulation using a constant arrival rate to model traffic flow will be of low fidelity. Vehicle arrival rates
vary dramatically over a day, with low vehicle counts at night and high spikes during morning and late
afternoon rush hours.
To create more realistic, or higher fidelity, simulation models, a Non-Homogeneous Poisson Process
(NHPP) may be used. The extension can be accomplished by converting the constant λ to a function of
time λ(t). In ScalaTion, the NH PoissonProcess can be used to generate arrivals where the arrival rate
is given by the lambdaf function.
The gen must be overridden to adjust the time jump based on the current arrival rate. Fortunately, this
can be done dividing an Exponential (1) random variate by the current arrival rate.
15.5.3 Exercises
1. Plot the pdf of the Erlang(1, k) distribution, for k = 2 and k = 3.
2. Compare the above pdf with a histogram of the time for the second arrival (k = 2) and the third arrival
(k = 3). See the Histogram class in the mathstat package.
3. Call the num method for several time points and plot it (N (t)) versus time t.
4. For the vehicle arrival process simulation, collect data on the number of arrivals every 5 minutes.
Create a histogram to show the distribution.
585
5. One way to create a time-dependent lambda function lambdaf is to take a data file, say consisting
of traffic counts, e.g., travelTime.csv in the data directory. Then create a Polynomial Regression
model using PolyRegression from the modeling package. Finally, define a lambdaf that calls the
Polynomial Regression model predict method.
d
F̄T (t) = − λF̄T (t)
dt
the rate of decrease in the cCDF is proportional to its value. This is analogous to the phenomena of
radioactive decay, where the rate of decay is proportional to the amount of radioactive material. Both
are described to the above Ordinary Differential Equation (ODE). The following steps may be followed
to solve the ODE.
Step 1: Separation of Variables
dF̄T
= − λdt
F̄T
Step 2: Integrate Both Sides
Z Z
dF̄T
= −λdt
F̄T
1
Step 3: Use the Fact that the Derivative of ln x =
x
ln F̄T + C = − λt
Step 4: Determine Constant of Integration C using the Initial Condition (IC): F̄T (0) = 1
ln 1 + C = 0
Since C = 0, we have,
ln F̄T = − λt
F̄T = e−λt
Verify that the solution is correct by showing that it satisfies both the ODE and the IC.
586
15.6 Monte Carlo Simulation
Many problems can be addressed by drawing a sample and determining whether it satisfies some criterion.
For example, draw five cards and determine whether the hand is a full house (three-of-a-kind and pair/two-
of-a-kind). If this process is repeated enough times, an estimate for the probability of a full house may be
obtained.
• The face-value is 1 (Ace), 2, 3, 4, 5, 6, 7, 8, 9, Jack, Queen, or King. The rank (low to high) usually
moves Ace to high end of the list.
• The suit is ordered (low to high) in some games like bridge: Clubs (♣), Diamonds (♦), Hearts (♥),
and Spades (♠).
The deck will be shuffled using the shuffle method to randomize the positions of cards in the deck. The
draw method pulls the next card from the top of the deck. Calling it five times yields a poker hand. A
counter can be incremented in case the hand is a full house. The ratio of this counter to the number of
hands dealt becomes and estimate for the probability of a full-house.
class Cards:
//::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
/** Draw the top card from the deck and return it. Return -1 if no cards are
* left in the deck.
*/
def draw (): Int = if top == NUM_CARDS then -1
else { val c = card(top); top += 1; c }
//::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
/** Shuffle the deck of cards.
*/
def shuffle (): Unit =
for i <- card.indices do swap (card, i, rn.igen)
top = 0
end shuffle
587
//::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
/** Convert the card number c (0 to 51) to the face value (1 to 13) and
* suit (0(C), 1(D), 2(H), 3(S)).
* @param c the card ordinal number
*/
def value (c : Int): (Int, Char) = (c % 13 + 1, suit (c / 13))
//::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
/** Convert the card deck to a string.
*/
override def toString: String = "Cards ( " + stringOf (for c <- card yield value (c)) + " )"
end Cards
The estimate of the probability of a full house can be checked against probability theory. The total number
of hands consisting of five cards is
52 52 · 51 · 50 · 49 · 48
= = 2, 598, 960
5 1·2·3·4·5
The number of ways nways for get a full house may be determined as follows: (a) choose 1 of 13 face values
for the three-of-a-kind, (b) choose 1 of 12 remaining face values for the pair, (c) choose 3 of 4 suits for the
three-of-a-kind, and (d) choose 2 of 4 suits for the pair, i.e.,
13 12 4 4 4·3·2 4·3
= [13] [12] = 78 · 4 · 6 = 3, 744
1 1 3 2 1·2·3 1·2
Letting y be the random draw of five cards, the probability of a full house is simply the ratio of the two
numbers.
3, 744
P (y = full house) = = 0.00144
2, 598, 960
As discussed earlier, there is no closed-form formula for the Cumulative Distribution Function (CDF) for the
Normal distribution. As the integral of the probability density function (pdf), the Numerical Integration can
be used to compute the CDF. Monte Carlo integration offers one approach for doing this, which is practically
effective for multiple integrals in higher dimensions.
To illustrate, consider the following one dimensional function defined on the domain [0, 1].
p
y = f (x) = (1 − x2 )
588
Integration: Area Under the Curve
0.8
y = f (x)
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
x
The integral is the area under the curve. This is the same as the mean height of the curve times the
length/size of the domain (1 in this case). The mean height of a function may be estimated by computing
the height at many randomly selected points and taking the average.
m−1 m−1 q
1 X 1 X
ȳ = f (xi ) = (1 − x2i )
m i=0 m i=0
In ScalaTion, this capability is provided by the MonteCarloIntegration.integrate method.
def integrate (f: FunctionS2S, a: Double, b: Double, m: Int, s: Int = 0): Double =
val length = b - a
val x = Uniform (a, b, s)
var sum = 0.0
for it <- 0 until m do sum += f(x.gen)
sum * length / m
end integrate
The particular function f is passed into integrate in the code below. As the function traces the unit circle
in the first quadrant, multiplying by 4 allows π to be approximated.
import MonteCarloIntegration.integrate
end monteCarloIntegrationTest
589
The case class Uniform is a random variate generator from the random package. Uniform (a, b, s) gen-
erates (via the gen method) uniformly distributed random numbers in the interval [a, b] using random
number stream s.
private val grain = RandomVecD (2, max = 1, min = -1, stream = stream)
//::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
/** Return the fraction of grains found inside the unit circle.
* @param n the number of grains to generate
*/
def fraction (n: Int): Double =
var count = 0
for i <- 0 until n do if grain.gen.normSq <= 1.0 then count += 1
count / n.toDouble
end fraction
end GrainDrop
The fraction method counts the number of generated grains that that are inside the unit circle and
divides that count by the total number of grains generated. For a good random number generator, this
fraction should correspond to the ratio of the areas for the unit circle versus the bounding square {(x, y) :
x ∈ [−1, 1], y ∈ [−1, 1]}. The area of the square is 4 so the fraction times 4 should provide an estimate for π.
end grainDropTest
590
15.6.4 Simulation of the Monty Hall Problem
Imagine you are a contestant on the Let’s Make a Deal game show and host, Monty Hall, asks you to select
door number 0, 1 or 2, behind which are two worthless prizes and one luxury car. Whatever door you pick,
he randomly opens one of the other non-car doors and asked if you want to stay with you initial choice or
switch to the remaining door. What are the probabilities of winning if you (a) stay with your initial choice,
or (b) switch to the other door? Finish the code below to validate your results.
end MontyHall
Note, since the opened door never has the car behind it, the car must be behind either the originally picked
door (stay) or the remaining door (switch). Hence, the form of the above if then else statement.
15.6.5 Exercises
1. The hands in Five-Card Draw Poker are the following: (1) high card, (2) pair, (3) two pair, (4) three-
of-a-kind, (5) straight, (6) flush, (7) full house, (8) four-of-a-kind, (9) straight flush, and (10) royal
flush.
Use probability theory to determine the probabilities of each Poker hand. Do the same thing using
Monte Carlo simulation and compare the results. Let the number of repetitions (hands drawn) increase
until the estimates stabilize. Hint: use a large number of samples.
2. Use Monte Carlo simulation to integrate the CDF for the Standard Normal Distribution at 1, Fy (1).
Note, the distribution is symmetric around zero, so the following integral may be computed.
Z 1
1 2
area = √ e−y /2
0 2π
591
Integration: Area Under the Curve
0.4
0.35
y = f (x)
0.3
0.25
1
The solution for Fy (1) will be 2 + area. Check your answers by calling CDF.normalCDF (1) in the
random package.
3. Finish the coding of the Simulation of the Monte Hall Problem. Determine the winning percentages
for the Stay and Switch strategies as the number of game simulations increases. Do they converge?
Explain why one strategy is better than the other.
4. Estimate the probability distribution for rolling three 6-sided dice and taking their sum
y = x1 + x2 + x3
where xi ∼ Randi(1, 6), i.e., pxi (k) = 1/6 for k = 1, 2, 3, 4, 5, 6. The probability mass function (pmf)
for y has a range from 3 to 18. Use Monte Carlo Simulation to estimate py (k) =? for k = 3, 4, . . . , 17, 18
Draw/plot the pmf py (k) vs. k.
5. When rolling nd six-sided dice, how many ways can four be rolled, for nd = 1, 2, 3, 4 dice.
Table 15.1: Counting the Number of Ways to Roll a Four (sum of nd dice)
Use Monte Carlo simulation to estimate the number of nways and the probability mass function py (k)
for nd = 1, 2, 3, 4 dice.
6. The number of ways nways can also be solved using the following recursive function for the sum s from
nd to 6nd .
592
6
X
nways (nd , s) = nways (nd − 1, s − k)
k=1
The base case for the recursion is when nd = 1, in which case nways (1, s) = 1 for s = 1, . . . , 6.
Write a program to calculate nways for nd = 1, 2, 3, 4 dice. Note, the recursive function may be
computed more efficiently using dynamic programming.
7. Question 1: Develop a Monte Carlo simulation to estimate the volume of a unit sphere (radius eqaul
to one). What is your estimate? Also, provide your code. Hint: a point is inside the sphere when
x2 + y 2 + z 2 ≤ 1.
593
15.7 Hand Simulation
Before using or developing software to perform simulation, it is instructive to carry out a simple simulation
by hand.
This can be done as follows: Generate random variates for inter-arrival and service times for m = 10
customers. Fill in these two columns in Table 15.2. Use Exponential(1/λ) for inter-arrival times and
Exponential(1/µ) for service times. Let λ = 10 and µ = 12 per hour, giving means of 6 and 5 minutes,
respectively. Also, fill in the zeroth row, for a non-existent customer, with all zeros.
Notice that the sample mean inter-arrival time and sample mean service time shown in the last row should
correspond the theoretical means of 6 and 5 (keeping the mind the inaccuracy of small samples)
The hand simulation part now begins. Filling in the table row-by-row requires the event logic to be
followed. A customer cannot begin service until the previous customer has departed. They must wait in the
queue until the server is available. The time between arrival and beginning of service is the wait time. The
inter-arrival time indicates the time gap for the next arriving customer.
Given the inter-arrival (iarrival) times (ιi ) and service times (si ), the equations below may be applied to
provide values for the empty columns.
Note, the columns in bold correspond to events. Before looking at the completed table, try to fill in the
previous one.
594
Table 15.3: Completed Hand Simulation of M/M/1 Queue
Define Σq , Σs and Σy to be the sums of the waiting, service and system times, repectively. As shown in the
table, the average times in the Queue, Service and sYstem are given by the formulas below.
m 10
λe = = = 0.1563 per minute = 9.375 per hour (15.46)
τ 64.0
Little’s Law relates time averages to occupancy averages. As waiting queues get longer, one would expect
expect the waiting time to increase. Using Little’s Law (see the section on Markov Chains for more details),
the occupancy (number in the system) Ly is proportional to the time in the system Ty (and the same is true
for sub-components).
Ly = λe Ty (15.47)
The proportionality constant is λe . Little’s Law allows the following summary results for our M/M/1 Queue
simulation to be collected into Table 15.4.
595
Table 15.4: Formulas for M/M/1 Queueing Models
1 τ
Z
Ly = Ly (t)dt (15.48)
τ 0
Notice that the function Ly (t) can only change at event times (the state of the system can only change at
these times). The event times are the following:
VectorD(0.0, 6.0, 9.0, 11.0, 14.0, 17.0, 18.0, 21.0, 24.0, 29.0,
34.0, 36.0, 38.0, 41.0, 45.0, 50.0, 55.0, 56.0, 64.0)
Event time 0.0 has the start simulation event, times 6.0 and 9.0 have arrival events, time 11.0 has a departure
event, times 21.0 and 29.0 have both arrival and departure events, time 64.0 has the last departure event.
As there is 1 start simulation event, 8 arrival events, 8 departure events and 2 dual events, there should be
a total of 19 event times.
For the above simulation, Ly (t) takes on the values 0, 1 or 2. When Ly (t) = 0 the server is idle, Ly (t) = 1
there is one customer in service and none waiting, and Ly (t) = 2 there is one customer in service and one
waiting.
Start with Ly (0) = 0, then ”:” means no change, + means add 1, - means subtract 1. In this way the
value of Ly (t) can be determined for all event times.
0:, 6+, 9+, 11-, 14+, 17-, 18+, 21:, 24-, 29:,
34+, 36-, 38-, 41+, 45-, 50+, 55-, 56+, 64-)
Check that the number of arrivals (+) equals the number of departures (-). Tracing through the list, one
can deduce the occupancy for the time intervals between the events.
• Ly (t) = 1: [6, 9], [11, 14], [17, 18], [24, 34], [36, 38], [41, 45], [50, 55], [56, 64]
Plot Ly (t) vs. t from 0.0 to 64.0 and use it to determine the area under the curve. The subtotals for each
are 15, 36, 13 the integral sums to 0 · 15 + 36 · 1 + 13 · 2 = 62. Similar calculations yield results for Lq and Ls .
Lq = Σq /τ = 13/64 = 0.203
Ls = Σs /τ = 49/64 = 0.766
Ly = Σy /τ = 62/64 = 0.969
596
Note, for 15 minutes there were no customers in the system, so the server idle time is 15 minutes, while the
server busy time is 64 - 15 = 49 minutes.
15.7.4 Exercises
1. Use a spreadsheet to plot Lq (t), Ls (t) and Ly (t) versus time from 0.0 to τ for the M/M/1 Queue.
Directly compute the area under the curve and then the average height.
2. Do the same plot for the M/M/1 Queue using PlotM from ScalaTion’s mathstat package.
3. Suppose there are now two severs and the arrival rate λ = 20 per hour. Redo all the calculations for
this M/M/2 Queue.
5. Write code that takes the arrival and departure columns and produces all the event times in time order.
Hint: set up an i-cursors for the arrival list and j-cursor for the departure list, if the values at i and j
are the same, copy that value into the event list and advance both cursors, otherwise copy the smaller
value and advance its cursor. Do this in a while loop.
6. Recall that λe = m/τ . For the definitions given in this section for Lq , Ls , Ly , and Tq , Ts , Ty , show the
following forms of Little’s Law hold.
Lq = λe Tq (15.49)
Ls = λe Ts (15.50)
Ly = λe Ty (15.51)
597
7. Spreadsheet Simulation: A Small Fast Food Restaurant has two severs and enough space for three
customers to wait (at most five customers total at any given time). For the case of a single queue,
perform a spreadsheet simulation for m = 20 customer arrivals. Assume each server can process µ = 30
customers per hour and that the customer arrival rate λ = 75 customers per hour (assume Exponential
distributions). Each completed order gives a net profit (before paying the servers) of 2.00 dollars. Each
server makes 11.00 dollars per hour. Should the restaurant hire a third server? Explain in terms of
profit (after paying the servers) per hour. Give the simulation table and summary results produced by
the spreadsheet.
598
15.8 Tableau-Oriented Simulation
In tableau-oriented simulation models, each simulation entity’s event times are recorded in a row of a
matrix/tableau. For example in a Bank simulation, each row would store information about a particular
customer, e.g., when they arrived, how long they waited, their service time duration, etc. If 20 customers
are simulated, the matrix will have 20 rows (actually 24 since there are always 4 special rows). Average
waiting and service times can be easily calculated by summing columns and dividing by the number of
customers. This approach is similar to, but not as flexible as Spreadsheet simulation. The complete code
for this example may be found in Ex Bank.scala in the scalation.simulation.tableau package.
end runEx_Bank
Note that it is important that the various random variate generators use “different random number streams”
to keep them independent. The stream number (0 to 999) specifies the combinations of seeds to use for the
random number generator. In this model, iArrivalRV uses stream, while serviceRV uses stream + 1.
• The zeroth row is a placeholder for the previous non-existent entity (the values are all zero).
• Rows 1 to m are for the entity timings, i.e., row i records times for the ith entity.
• The last three rows hold the column sums, sample averages and time averages, respectively.
The simulate method is used to evaluate the equations, row-by-row. A basic set of equations is provided in
the Model class that works for a collection of simple, related models. Other models will require the simulate
method to be overridden.
599
for i <- 1 to m do tab(i, 0) = i // ID-0
The columns are the same as those given in the Hand Simulation section.
end runQueue_MM1
The Known Random Variate Generator (RVG) simply repeats the given sequence of numbers.
600
override def simulate (startTime: Double): Unit =
var l = 0 // last established call
for i <- 1 to m do
tab(i, 1) = rv(0).gen // IArrival-1
tab(i, 2) = tab(i-1, 2) + tab(i, 1) // Arrival-2
if tab(l, 6) <= tab(i, 2) then // call established
tab(i, 3) = tab(i, 2); l = i // Begin-3
tab(i, 4) = tab(i, 3) - tab(i, 2) // Wait-4
tab(i, 5) = rv(1).gen // Service-5
tab(i, 6) = tab(i, 3) + tab(i, 5) // Departure-6
tab(i, 7) = tab(i, 6) - tab(i, 2) // Total-7
end if
end for
end simulate
Model developers may wish to copy the base simulate method code from Model and make minimal modifi-
cations to the equations. The only changes above are the introduce of the variable l, the addition of the if
statement, and modification to the Begin-3 equation.
15.8.4 Tableau.scala
The Model class support tableau-oriented simulation models in which each simulation entity’s events are
recorded in tabular form (in a matrix). This is analogous to Spreadsheet Simulation (http://www.informs-
sim.org/wsc06papers/002.pdf).
Class Methods:
class Model (name: String, m: Int, rv: Array [Variate], label_ : Array [String])
extends Modelable:
The report method displays the tab matrix and should be called after simulate. The last two rows
display useful averages such as average inter-arrival, waiting, service and system times. The summary method
displays averages for lengths of queues and waiting times, etc. It summarizes the number and time in
the queue (q), service (s) and system (y), i.e., L q, L s, L y, T q, T s, T y. The Model.occupancy
601
(mm1.timeLine ()) line gives the event times and corresponding values for Ly (t). The save method saves
the matrix in a .csv file that may be loaded into a spreadsheet for further processing.
The report method for the hand simulation problem outputs the following table where the last three
rows show the sums, sample averages and time averages.
-------------------------------------------------------------------------------------------------
ID-0 IArrival-1 Arrival-2 Begin-3 Wait-4 Service-5 Departure-6 Total-7
-------------------------------------------------------------------------------------------------
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
-------------------------------------------------------------------------------------------------
1.000 6.000 6.000 6.000 0.000 5.000 11.000 5.000
2.000 3.000 9.000 11.000 2.000 6.000 17.000 8.000
3.000 5.000 14.000 17.000 3.000 4.000 21.000 7.000
4.000 4.000 18.000 21.000 3.000 3.000 24.000 6.000
5.000 3.000 21.000 24.000 3.000 5.000 29.000 8.000
6.000 8.000 29.000 29.000 0.000 7.000 36.000 7.000
7.000 5.000 34.000 36.000 2.000 2.000 38.000 4.000
8.000 7.000 41.000 41.000 0.000 4.000 45.000 4.000
9.000 9.000 50.000 50.000 0.000 5.000 55.000 5.000
10.000 6.000 56.000 56.000 0.000 8.000 64.000 8.000
-------------------------------------------------------------------------------------------------
55.000 56.000 278.000 291.000 13.000 49.000 340.000 62.000
5.500 5.600 27.800 29.100 1.300 4.900 34.000 6.200
0.859 0.875 4.344 4.547 0.203 0.766 5.313 0.969
-------------------------------------------------------------------------------------------------
ID-0 IArrival-1 Arrival-2 Begin-3 Wait-4 Service-5 Departure-6 Total-7
-------------------------------------------------------------------------------------------------
15.8.5 Exercises
1. Run the bank simulation for many more customers and see if the averages begin to stablize.
2. Suppose there are now two severs and the arrival rate λ = 20 per hour. Override the simulate method
for this M/M/2 Queue. The logic will need to handle the fact that there are now two statistically
identical servers. What do the new results indicate?
3. Now suppose the second server works 20% faster than the first server. What do the new results
indicate?
5. An M/M/1/1 Queue has one server and a system capacity of one (no space for waiting). Develop and
run a Tableau simulation for λ = 10 per hour and µ = 12 per hour. Redo for an M/M/2/2 Queue and
λ = 20 per hour.
602
Chapter 16
In dynamic models, the state of a system may be described by a state vector x(t). For example, a particle
may be tracked over time. In two-dimensional space, one might be interested in the height and down range
distance of the particle over time,
where x0 (t) is the height of the particle and x1 (t) is the horizontal distance travelled at time t.
The dynamics of the particle may be described by a n-dimensional vector-valued function of time.
f : R+ → Rn (16.2)
Such a function may be developed using physical laws that are expressed as a system of Ordinary Differential
Equations (ODEs) consisting of first-order time derivatives. When the system is linear, the time derivative
ẋ(t) equals an affine transformation of the current state x(t), i.e., the sum of a linear transformation of the
current state and a constant vector.
603
16.1 Example: Trajectory of a Ball in One-Dimensional Space
Consider the application of Newton’s Laws of Motions to determine the trajectory of a ball in one dimension.
Suppose someone hits a golf ball with a driver straight up in the air and wishes to know how high it will go
and how long it will be in the air. Let x(t) = [y(t), v(t)] be the height of the ball y(t) and its velocity v(t)
at time t. Also, let the initial conditions be [0 m, 60 m/s], corresponding to almost hitting the ball from the
ground with initial upward velocity of 60 mps (approximately 134.2 mph).
where the gravity of Earth g = 9.807 m/s2 and v̇(t) = a (constant acceleration).
The system of differential equations may be written using vector and matrix notation, with all the columns
being treated as column vectors.
This system of differential equations may be solved by integration. Integrating both sides of the second
equation/row produces,
Z t
v(t) − v(0) = −g dτ = − gt
0
1
y(t) = y(0) + v(0)t − gt2
2
v(t) = v(0) − gt
1
y(t) = − gt2 + 60t
2
v(t) = − gt + 60
604
16.1.2 Discretization
Typically, it is not so easy to solve a system of ODEs, so iterative algorithms (called integrators, e.g., Euler,
Verlet, Runge-Kutta, Dormand-Prince) are used to provide approximate solutions. Using the Verlet Method
[198, 67, 165] with a time gap of ∆t, the ODEs can discretized to the following form,
1
y(t) = y(t − ∆t) + v(t − ∆t)∆t − g(∆t)2
2
v(t) = v(t − ∆t) − g∆t
where the current value is computed from the previous value. This also allows a continuous-time system to
treated as a discrete-time system. Here to maintain notational compatibility with the notation previously
used for time series analysis, t in xt has two interpretation: (1) time index and (2) the actual discrete
time/timestamp. Therefore, the above equation can written as follows.
1
yt = yt−1 + vt−1 ∆t − g(∆t)2
2
vt = vt−1 − g∆t
val g = 9.807
val mps = 60.0
def f (t: Double): VectorD = VectorD (-0.5 * g * t~^2 + mps * t, -g * t + mps)
To determine the maximum height, simply set the velocity to zero (x1 (t) = 0), solve for the time τ when
velocity becomes zero, and calculate the height x0 (t) at time τ .
τ = 60/g = 6.118 s
2
x0 (t) = −0.5 ∗ g ∗ 6.118 + 60 ∗ 6.118 = 183.5 m
Can a golf ball hit with the swing speed of an average golfer really go so high? To seek the answer a
theorist and an experimentalist may be consulted. The theorist explains that the model ignores other
forces (e.g., drag due to air resistance) and therefore, the state equation needs to have an error term. The
experimentalist explains that measurements (e.g., using RADAR/LIDAR) need to be taken and of course
there will be measurement errors (another error term). These errors can be modeled as noise and, as such,
605
turn the deterministic state vector into a random one x(t). Furthermore, since measurements are not made
continuously, discrete time is introduced. The measurements may be recorded as time series data yt .
Now there are two stochastic processes: {x(t)| t ∈ [0, te ]}, the actual process, and {yt | t ∈ {0, te }} the
observed process. Simplicity argues for merging the two stochastic processes. Unfortunately, it is commonly
the case that the actual process is only partially observable (some of the state variables are not directly
measurable). Therefore, merging the two may result in less accurate and/or less explainable models.
Dynamic models consisting of state equations and observation/measurement equations come in several
varieties:
Note: Kalman-Bucy Filter has continuous-time for the state and discrete time for the observations. Similarly,
CT Hidden Markov Model has continuous-time for the state and discrete time for the observations. The
next section discusses Dynamic Linear Models as Kalman Filters where the forcing/control vector is missing.
Kalman filters are models that are particularly useful for dealing with/filtering out noise.
16.1.4 Exercises
1. The diameter and mass of modern golf balls are approximately 4.268 cm and 45.93 grams, respectively.
Combine Newton’s Second Law of Motion (F = ma) and the Law of Gravity (FG = −mg) to deduce
the following equation:
v̇(t) = − g
2. Consider the effect of wind/air resistance (drag) as another force, drag force FD (in addition to gravity
FG )
ρ CD A
FD = v(t)2
2
where ρ = density of air (1.225 kg/m3 ), CD = drag coefficient (0.4), and A = cross sectional area of
the golf ball (14.3 cm2 ).
Recompute the time in the air and maximum height considering the effects of both FG and FD .
606
16.2 Markov Chains
A Markov Chain [173] is a simple type of Markov Model where one is interested in tracking the state of a
system over time. Consider the following Markov Chain with n = 6 states, corresponding to the number of
dollars one currently has. At each discrete time point, flip a coin, heads (with probability p) gives a dollar,
while tails (with probability q = 1 − p) takes a dollar. There are two terminal states, 0 (lose) and 5 (win).
One pays x0 ∈ {1, 2, 3, 4} dollars to start the game. The Markov Chain that models this game is shown in
Figure 16.1.
1 1
p p p p
0 1 2 3 4 5
q q q q
Consider a discrete-valued, discrete-time stochastic processes that represents the state of a system over
time t.
aij
i j
The Markov Property can be generally stated as the future given the present is conditionally independent
of the past.
607
16.2.1 Probability Mass Function
The probability mass function (pmf) for the state at time t as a vector may be given as follows:
Due to the Markov Property, the next state vector (as a row vector) may be computed by vector-matrix
multiplication.
π t = π t−1 A (16.8)
Suppose p = .6 and the initial state is 3 (3 dollars to enter the game). Probabilistically the initial state is
given by π 0 = [0, 0, 0, 1, 0, 0], so the RHS of the above equation is
1 0 0 0 0 0
.4 0 .6 0 0 0
h i
0 .4 0 .6 0 0
0 0 0 1 0 0
0
0 .4 0 .6 0
0 0 0 .4 0 .6
0 0 0 0 0 1
Then probabilistically the next state is π 1 = π 0 A = [0, 0, .4, 0, .6, 0]. The dot product of the row vector π 0
with each column of A is used to compute π 1 .
In ScalaTion, advancing to the next state is carried out by the next method.
where the right-associative *: operator is for vector-matrix multiplication (and complements the left asso-
ciate * operator for matrix-vector multiplication).
This recurrence can be unfolded to yield the following equation,
π t = π t−1 A
= π t−2 A2
= π 0 At
608
16.2.2 Reducible Markov Chains
State j is reachable from state i if
(t)
aij > 0 (16.9)
(t)
for some discrete time t, where aij is the i, j element in At .
Based on reachability a Markov Chain may decomposed in into multiple subschains, where all the states
in each subchain are reachable from each other. If states i and j are mutually reachable, they are said to
communicate. Since communication forms an equivalence class (reflexive, symmetric and transitive), each
subchain is a equivalence class. The six-state Markov Chain shown in the figure has three equivalence classes.
1. S1 = {0} Lost
3. S1 = {5} Won
States 0 and 5 are absorbing states, since once in such a state, it will never be left. A Markov Chain that
has just one communication/equivalence class is called an Irreducible Markov Chain.
A state i if said the recurrent if the probability of returning some time in the future is one, otherwise
it said to be transient (may never return). A recurrent state is either positive (finite return time) or null
recurrent (infinite return time).
π = lim π t (16.10)
t→∞
After convergence, π may be substituted for both π t and π t−1 , so the previous boxed equations becomes,
π = πA (16.11)
where π is the probability vector (non-negative and sums to 1) and A is the transition probability matrix.
When this equation has a solution, π is the limiting/steady-state probability vector.
Interpretation
For example, a Markov Chain that oscillates between two states, the long-term probabilities will depend on
the initial conditions.
609
Solving for the Limiting Probabilities
πA − π = 0 (16.12)
with 0 as a row vector. Using an identity matrix I, this may be rewritten,
π(A − I) = 0 (16.13)
Taking the transpose produces
| | |
(A − I) π = 0 (16.14)
The vector π is the eigenvector solution to the left eigenvalue problem for eigenvalue λ = 1 (see the
|
chapter on Linear Algebra). This can be solved by computing the nullspace of (A − I) . In ScalaTion,
the limiting distribution can be found using QR Factorization (see the exercises).
Consider the example problem from Introduction to Probability Models, 3rd Ed., Ross, p. 146 [160, 161]
having the following transition probability matrix A. Solve for the stationary (steady-state) distribution
three ways:
1. Start with state probability vector π = [π0 , π1 , π2 ] = [.5, .5, 0] and repeatedly compute πA.
.5 .4 .1
[π0 , π1 π2 ] = [π0 , π1 π2 ] .3 .4 .3
.2 .3 .5
Since the A matrix is stochastic, one of the equations is redundant and may be replaced with the
normalization equation kπk1 = 1.
610
3. This may be rewritten as an augmented matrix and solved using LU Factorization.
−.5 .3 .2 0
.4 −.6 .3 0
1 1 1 1
|
Note, the matrix above is simply (A − I) with the last row replaced with 1s from the normalization
equation.
Software Solution
end markovChainTest4
The MarkovChain class in the scalation.simulation.state package provides both transient (via the
next method) and steady-state (via the limit method) solutions.
611
16.2.4 MarkovChain Class
Class Methods:
λ λ λ λ
0 1 2 3 4
µ µ µ µ
Figure 16.3: State Transition Diagram for Five-State Continuous-Time Markov Chain
Self-loops are not included, since the state is unchanged until there is an out transition. One may view this
as an event that changes the state (here there are arrival and service completion events).
The transient solution for Continuous-Time Markov Chains can be found by solving Kolmogorov differ-
ential equations, see Ross, Chapter 6 [160].
612
16.2.6 Limiting/Steady-State Distribution
The limiting/steady-state solution can be given by,
πQ = 0 (16.15)
−λ λ 0 0 0
i µ −(λ + µ) λ 0 0 h
h i
π0 π1 π2 π3 0
π4 µ −(λ + µ) λ = 0
0 0 0 0 0
0 0 µ −(λ + µ) λ
0 0 0 µ −µ
The diagonal elements are set so that the rows of matrix Q sum to 0 (inflow = output).
Multiplying the vector π by the matrix Q produces the following five equations.
−λπ0 + µπ1 = 0
λπj−1 − (λ + µ)πj + µπj+1 = 0 for j = 1, 2, 3
λπ3 − µπ4 = 0
In this case a simpler approach is possible, as the solution can be developed using partial balance equations
that equates up-flow with down-flow, so
Therefore, π1 = µλ π0 , π2 = µλ π1 , π3 = µλ π2 , and π4 = µλ π3 .
Traffic Intensity: The traffic intensity is the ratio of the arrival rate to the service rate. The higher the
traffic intensity, the more the chain is pushed to the right (toward more congestion).
λ
ρ= (16.17)
µ
The partial balance equations can be expressed with ρ replacing λ and µ.
613
This recursive equation can be unfolded to give,
πj = ρj π0 (16.19)
[1 + ρ + ρ2 + ρ3 + ρ4 ]π0 = 1 (16.20)
1−ρ
π0 = (16.21)
1 − ρn
Finally, the state-probabilities may be determined.
1−ρ j
πj = ρ (16.22)
1 − ρn
This is the solution for an M/M/1/K Queue (with K = 4 and n = K + 1 = 5). The notation M/M/1/K
means the arrival process is Markovian (Poisson or Exponential inter-arrival times), the service distribution
is Exponential, the number of servers is 1 and customer capacity is K (one in service and the rest waiting).
The MarkovChainCT class in the scalation.simulation.state package provides both transient (via the
next method) and steady-state (via the limit method) solutions. Currently, the transient solution has not
been implemented (hence = ???).
Class Methods:
1−ρ
πj = ρj (16.23)
1 − ρK+1
This solution works when ρ = 0 (π0 = 1), ρ ∈ (0, 1), ρ = 1 (via L’Hospital’s Rule) and when ρ > 1.
614
As the waiting capacity K (and therefore the number of states n) goes to infinity, stability requires traffic
intensity ρ < 1. In which case ρK+1 goes to zero. Therefore, the solution for an M/M/1 Queue can be
obtained.
πj = (1 − ρ)ρj (16.24)
Given that πj = P (x = j) indicates the probability there are j customers in the system, the expected number
in the system is given as follows.
∞
X ∞
X
E [x] = j πj = (1 − ρ) jρj (16.25)
j=0 j=0
ρ
L = E [x] = (16.26)
1−ρ
The number in service Ls corresponds to probability the server is busy = 1 − π0 = ρ. Therefore, the expected
length of the queue is
ρ ρ2
Lq = L − Ls = −ρ = (16.27)
1−ρ 1−ρ
λ
Ls = λTs = = ρ (16.28)
µ
This relationship between length (number in) and time carries over to the queue
Lq = λTq (16.29)
L = λT (16.30)
Summary of Results
In summary, the formulas for an M/M/1 Queue are collected into Table 16.2.
The MMc Queue class in the scalation.simulation.queueingnet package produces steady-state solutions
for M/M/1 and M/M/c queues, where c is the number of servers.
615
Table 16.2: Formulas for M/M/1 Queueing Models
Class Methods:
The MMcK Queue class produces steady-state solutions for M/M/1/K and M/M/c/K queues, where c is the
number of servers and K is the system capacity.
Class Methods:
616
16.2.11 Exercises
1. For the given six-state discrete-time Markov Chain, advance the state probability π t over the next 20
time points.
2. For the six-state Markov Chain, what happens to the probabilities of states 1 to 4 as time increases.
3. Consider the middle subchain, states 1 to 4. Do they form a irreducible Markov Chain and admit a
stationary/steady state solution? Compute the value for this solution π.
4. A square matrix is stochastic if all its elements are non-negative and all the columns sums equal 1.
Show that a stochastic matrix has an eigenvalue equal to 1. Hint: see https://textbooks.math.
gatech.edu/ila/stochastic-matrices.html
5. For discrete-time Markov chains, explain how the limit method works for computing π.
∞
X ρn
sn = ρj =
j=n
1−ρ
1−ρ
π0 =
1 − ρ5
7. Develop the steady-state solution πj for an M/M/c/K Queue where c is the number of servers and K
is the system capacity.
d j
8. Derive the formula for the expected number in the system L for an M/M/1 Queue. Hint: dρ ρ = jρj−1 .
9. The relationship L = λT is called Little’s Law. Sketch Stidham’s proof of the law. See [174]
10. Use the MMc Queue class to address the one line versus two line question. Let λ = 20 and µ = 12 per
hour. What is the mean time in the queue Tq for an M/M/2 queue? Compare this with Tq for two
M/M/1 queues, where customers are randomly split between the two lines/servers, i.e., the arrival rate
to each is λ/2. Note, if customers join the shorter line, the analysis of this problem becomes difficult,
but simulation is still straightforward.
11. In the above problem, let λ take on all integer values from 4 to 23 and plot Tq (the mean waiting time)
over these values for both the one line and two line solutions. Note, for λ = 24 or higher the queues
will be unstable.
617
12. Explain what the Kolmogorov backward equations are and how they can be used to solve for transient
solutions to Continuous-Time Markov Chains [172].
13. One simple way to model a epidemic such as the COVID-19 Pandemic is to use a Discrete-Time Markov
Chain (DTMC). One could start with an SEIR compartmental model and relate subpopulations of
individuals to probabilities of being in a given state. Consider the discrete-time Markov Chain model
shown in Figure 16.4.
S E I R
Assume the population of the state Georgia that is susceptible to COVID-19 is N = 10, 000, 000. The
basic SEIR model assumes there are four subpopulations of individuals.
S = N π0 Susceptible
E = N π1 Exposed
I = N π2 Infected
R = N π3 Recovered
Further assume that on average it takes 20 days to transition from state S to state E, 8 days from E
to I, and 10 days from I to R. Let the transition probabilities correspond to the reciprocals of the
days. Remember the probabilities in each row of the transition probability matrix must add to 1. The
discrete time unit is one day. Each day, an individual may transition to the next state (e.g., S to E)
or remain in the same state (e.g., S to S). Again the probabilities must add to one.
(a) Construct the transition probability matrix A.
(b) Let the initial probability vector π 0 = [0.99, 0.0, 0.01, 0.0], i.e., 99% in state S and 1% in state I.
Compute π t for the next two weeks (14 days). Show π t for each of these days.
14. The above DTMC model is too simple to exhibit high accuracy in forecasting COVID-19. Discuss a
more accurate simulation/modeling technique for COVID-19.
618
15. Question 2: Consider a CTMC for the M/M/2 queue (i.e., Exponential inter-arrival times with rate
λ, Exponential service times with rate µ, and two service units). The rate µ is for each server. The
λ
traffic intensity ρ = 2µ .
(a) Solve for the steady-state probabilities πj , using the partial balance equations.
λ πj−1 = 2µ πj for j ≥ 2
λ π0 = µ π1
Hint:
1−ρ
π0 = , πj = ? for j ≥ 1
1+ρ
(b) Use this result to solve for the expected number in the system.
∞
X
L = E [x] = j πj
j=0
(c) Using the formula for L, logic and Little’s Law, create a formula summary table for the M/M/2
queue having six formulas (Lq , Ls , L, Tq , Ts , T ). The summary will have the form of Table 15.2.
(d) Suppose λ = 12 per hour (overall arrival rate) and µ = 7.5 per hour (per server service rate);
compute values for π0 and the six formulas (for times in minutes).
619
16.3 Dynamic Linear Models
As with a Hidden Markov Model (HMM), a Dynamic Linear Model (DLM) may be used to represent a
system in terms of two stochastic processes, the state of the system at time t, xt and the observed values
from measurements of the system at time t, yt . The main difference from an HMM is that the state and
its observation are treated as continuous quantities. For time series analysis, it is natural to treat time as
discrete values.
As background, consider the following system of homogeneous ODEs that only includes a linear trans-
formation (no constant term). The transition matrix F (c) has been renamed to emphasize that it is for the
continuous time problem.
(c)
xt = e∆t F xt−∆t (16.32)
where ∆t is time gap between consecutive time points. It is assumed here that the time gaps are uniform.
P 1 k
The matrix exponential (eX = k! X ) can be calculated using the Al-Mohy & Higham algorithm [4]. We
define matrix F as follows:
(c)
F = e∆t F (16.33)
Substituting in the matrix F gives,
xt = F xt−∆t (16.34)
Again to maintain notational compatibility with the notation previously used for time series analysis, t in
xt has two interpretation: (1) time index and (2) the actual discrete time/timestamp. Therefore, the above
equation can written as follows.
xt = F xt−1 (16.35)
To deal with uncertainty in the system a noise term may be added. In addition, the state and observation
equations distinguished.
For a basic DLM, the dynamics of the system are described by two equations: The State Equation
indicates how the next state vector xt is dependent on the previous state vector xt−1 and a process noise
vector wt ∼ Normal(0, Q)
xt = F xt−1 + wt (16.36)
where Q is the covariance matrix for the process noise. If the dynamics are deterministic, then the covariance
matrix is zero, otherwise it can capture uncertainty in the relationships between the state variables (e.g.,
simple models of the flight of a golf ball often ignore the effects due to the spin on the golf ball).
The Observation/Measurement Equation indicates how at time t, the observation vector yt is dependent
on the current state xt and a measurement noise vector vt ∼ Normal(0, R)
yt = Hxt + vt (16.37)
620
where R is the covariance matrix for the measurement noise/error. The process noise and measurement noise
are assumed to be independent of each other. The state transition matrix F indicates the linear relationships
between the state variables, while the H matrix establishes linear relationships between the state of system
and its observations/measurements.
The sensor tries to capture the dynamics of the system, but depending on the quality of the sensor there
will be measurement errors. The observation/measurement variables yt = [yt0 , yt1 ] may correspond to the
state variables in a one-to-one correspondence or by some linear relationship. The observation of the system
then may be described by the following observation equations:
Further assume that estimates for the F and H parameters of the model have been found (see the subsection
on Training).
State Equations
" # " #
xt0 0.9 0.2
xt = = xt−1 + wt
xt1 −0.4 0.8
These state equations suggest that the flow will be a high percentage of the previous flow, but that higher
speed suggests increasing flow. In addition, the speed is based on the previous speed, by higher flow suggests
that speeds may be decreasing (e.g., due to congestion).
Observation/Measurement Equations
621
" # " #
yt0 1.0 −0.1
yt = = xt + vt
yt1 −0.1 1.0
These observation/measurement equations suggest that higher speed makes it more likely for a vehicle to
pass the sensor without being counted and higher flow makes under-estimation of speed to be greater.
16.3.2 Exercises
1. For a DLM, consider the case where m = n = 1. The state equations and measurement equations
become
xt = axt−1 + wt
yt = cxt + vt
where wt ∼ Normal(0, σq2 ) and vt ∼ Normal(0, σr2 ). Compare this model with an AR(1) model.
2. For the Traffic Sensor Example, let Q = σq2 I and R = σr2 I. Develop a DLM model using ScalaTion
and try low, medium and high values for the variances σq2 and σr2 (9 combinations). Let the initial
state of the system be x00 = 100.0 vehicles per 15 minutes and x01 = 100.0 km per hour. How does
the relative amount of process and measurement error affect the dynamics/observation of the system?
3. Consider the state and observation equations given in the Traffic Sensor Example and assume that the
state equations are deterministic (no uncertainty in system, only in its observation). Reduce the DLM
to a simpler type of time series model. Explain.
4. Use the Traffic Sensor Dataset (traffic.csv) to estimate values for the 2-by-2 covariance matrices Q
and R.
5. Use the Traffic Sensor Dataset (traffic.csv) to estimate values for the parameters of a DLM model,
i.e., for the F and H 2-by-2 matrices.
622
16.4 Kalman Filter
A Kalman Filter (KF) is a Dynamic Linear Model that incorporates an outside influence on the system. If
a driving force or control is applied to the system, an additional term Gut is added to the state equation
[202, 154],
ut = [−g]
The observation/measurement equation remains the same.
yt = Hxt + vt
The process noise wt and the measurement noise vt also remain the same. The Kalman Filter model,
therefore includes five matrices.
623
1
yt = yt−1 + vt−1 ∆t − g(∆t)2
2
vt = vt−1 − g∆t
When noise is added, the variables will become random variables (blue font). The discrete-time system then
serves as the basis to formulate the Kalman filter state equations [154].
In this case, the force/control is the one-dimensional vector ut = [−g]. Again the initial conditions are
x0 = [0 m, 60 m/s]. For simplicity, the process noise wt is assumed to be on scale with the measurement
noise (see below).
" #
0.05 0.0
Q = C [wt ] = 0.05 I =
0.0 0.05
The measurement equation indicate what variables are measured and how they relate to the state vari-
ables. When are the state variables are directly measurable and one if interested all the state variables, the
observation matrix will be the 2-dimensional identity matrix H = I.
yt = Ixt + vt
vt ∼ Normal(0, R)
Modern golf ball tracking devices have standard deviations as low as σ0 = 0.25 meters for position in a
particular dimension and σ1 = 0.2 meters per second for velocity. These are rough estimates based on
sources such as [106]. These may be used to for the variances (diagonal elements) in the covariance matrix.
Remaining is to determine the correlation ρ01 between vt0 and vt1 to get the covariance σ01 = ρ01 σ0 σ1 . Such
information is hard to come by, but it makes sense that they would be positively correlated, so let ρ01 = 0.5.
" # " #
σ02 σ01 0.0625 0.00125
R = C [vt ] = =
σ01 σ12 0.00125 0.04
See the next section for alternative approaches for estimating Q and R.
Note that if only the height is of interest, then observation matrix H = [1, 0].
624
16.4.2 Training
The main goal of training is to minimize the error in estimating the state. At time t, a new measurement
yt becomes available. The errors before and after this event are the differences between the actual state xt
and the estimated state before x̂−
t (predicted) and after x̂t (corrected) [202].
e− −
t = xt − x̂t
et = xt − x̂t
Since wt has a zero mean, the covariance matrices (Pt− and Pt ) for the before and after state errors may be
computed as expectations of their outer products.
Pt− = C e− = E e− −
t t ⊗ et before errors (16.39)
Pt = C [et ] = E [et ⊗ et ] after errors (16.40)
The essential insight by Kalman was that the after estimate should be the before estimate adjusted by a
weighted difference between the actual measured value yt and its before estimate Hx̂−t .
x̂−
t = F x̂t−1 + Gut predicted (16.41)
x̂t = x̂−
t + Kt [yt − Hx̂−
t ] corrected (16.42)
The n-by-m Kt matrix is called the Kalman Gain and the above equations may be referred to as the Kalman
state update equations. If the actual measurement is very close to its predicted value, little adjustment to
the predicted state value is needed. On the other hand, when there is a disagreement, the adjustment based
upon the measurement should be tempered based upon the reliability of the measurement. A small gain will
dampen the adjustment, while a high gain may result in large adjustments. The trick is to find the optimal
gain Kt .
Using a Minimum Variance Unbiased Estimator (MVUE) for parameter estimation for a Kalman Filter
means that the trace of the error covariance matrix should be minimized (see exercises for details).
Plugging the Kalman Gain equation into the above equation gives the following optimization problem:
This optimization will produce (see exercises) the following equation that can be used to update the Kalman
Gain.
|
Pt− = F Pt−1 F + Q (16.45)
| |
Kt = Pt− H [HPt− H −1
+ R] (16.46)
Pt = [I − Kt H]Pt− (16.47)
625
16.4.3 Exercises
1. Suppose that fog negatively affects traffic and speed. Use the Traffic Sensor with Fog Dataset (traffic fog.csv)
to estimate values for the 2-by-2 covariance matrices Q and R.
2. Use the Traffic Sensor with Fog Dataset (traffic fog.csv) to estimate values for the parameters of
a Kalman Filter model, i.e., for the F , G and H 2-by-2 matrices.
3. Show that if ŷ is an unbiased estimator for y (i.e., E [ŷ] = E [y]) then the minimum error variance
V [ky − ŷk] is
4. Explain why minimizing the trace of the covariance C [et ] leads to optimal Kalman Gain K.
626
16.5 Extended Kalman Filter
When some of the relationships between state variables are nonlinear, the simplest option is to use an
Extended Kalman Filter (EKF). The linear combinations in the equations for Kalman Filters are now
replaced with differentiable (nonlinear) vector functions f and h.
The dynamics of the state are governed by a (nonlinear) state transition function f : Rn → Rn and specified
in the state equation,
xt = f (xt−1 , ut ) + wt (16.48)
Observation/Measurement Function
yt = h(xt ) + vt (16.49)
The process noise wt and the measurement noise vt should be close to Gaussian (Normally distributed).
They are also assumed to be additive, otherwise they need to be incorporated into the f and h functions.
function description
n n
f :R →R state transition function
h : R n → Rm observation function
For the example in the next subsection of a SEIHRD epidemic/pandemic model, the state vector is 6-
dimensional (n = 6), while the measurement/observation vector is 4-dimensional (m = 4).
16.5.1 Training
Extended Kalman Filters operate much like Kalman filters [154]. The only change to the Kalman update
equations is to use the f and h functions in place of the multiplications involving the F , G and H matrices
(for simplicity in previous section, these were taken to be constant matrices, but in a more general treatment
they would vary with time Ft , Gt and Ht ).
The Kalman state update equations for EKF are as follows:
x̂−
t = f (x̂t−1 , ut ) predicted (16.50)
x̂t = x̂−
t + Kt [yt − h(x̂−
t )] corrected (16.51)
627
|
Pt− = Ft−1 Pt−1 Ft−1 + Q (16.52)
| |
Kt = Pt− Ht [Ht Pt− Ht −1
+ R] (16.53)
Pt = [I − Kt Ht ]Pt− (16.54)
where now the matrices Ft and Ht are slopes of the f and h functions, respectively.
∂f
Ft = (16.55)
∂xt
∂h
Ht = (16.56)
∂xt
In other words, at each discrete time step t, the nonlinear vector-valued functions f and h are approximated
| |
by their local slopes the at predicted state x̂−
t , computed as Jacobian matrices. Note, Ft and Ht are the
transposes of Ft and Ht , respectively.
Recall that the Jacobian of a vector function f : Rn → Rl , is an l-by-n matrix.
∂f0 ∂f0 ∂f0
∂x0 ...
∂x1 ∂xn−1
∂f1 ∂f1 ∂f1
∂fi ...
Jf (x) =
∂xj ∂x0
= ∂x1 ∂xn−1
0≤i<l,0≤j<n ... ... ... ...
∂fl−1 ∂fl−1 ∂fl−1
...
∂x0 ∂x1 ∂xn−1
3. The number of Infected and not Hospitalized individuals at time t is denoted by I(t).
To make this more clear, suppose the study is the spread of COVID-19 during the year 2020 in the United
States. The population of individuals is given by N = 330 million.
628
State Transition Rates
The dynamics of the system are governed by the state transition rates. These will become parameters in the
differential equations to be estimated from the data. Assuming the birth rate matches the COVID-19 death
rate, the population not change over time (a short time span approximation).
The state transition diagram is shown in Figure 16.5. If there is a significant reinfection rate, an edge can
be added from state R to state S.
q0 q1 q2
S E I H
. q3 q4 q5
R D
The six time variables are interrelated through ordinary differential equations, where a dot above the variable
denotes a time derivative,
629
Ṡ(t) = Ḋ(t) − q0 S(t) Susceptible
Ė(t) = q0 S(t) − q1 E(t) Exposed
˙
I(t) = q1 E(t) − (q2 + q3 )I(t) only Infected
Ḣ(t) = q2 I(t) − (q4 + q5 )H(t) Infected and Hospitalized
Ṙ(t) = q3 I(t) + q4 H(t) Recovered
Ḋ(t) = q5 H(t) Died
The LHS is the rate of change of the variable, while the RHS is the sum of incoming edges minus the sum
of the outgoing edges. From epidemiology, the first transition rate q0 can be further dissected. Exposure
depends on members in S(t) coming in contact with an infected individual in I(t) or H(t) in terms of their
fraction of the population N and is proportional to the new parameter α.
I(t) + H(t)
q0 = α (16.58)
N
I(t) + H(t)
Ṡ(t) = q5 H(t) − α S(t) Susceptible
N
I(t) + H(t)
Ė(t) = α S(t) − q1 E(t) Exposed
N
˙
I(t) = q1 E(t) − (q2 + q3 )I(t) only Infected
Ḣ(t) = q2 I(t) − (q4 + q5 )H(t) Infected and Hospitalized
Ṙ(t) = q3 I(t) + q4 H(t) Recovered
Ḋ(t) = q5 H(t) Died
In order for the six equations to be useful for forecasting, the six parameters, α, q1 , q2 , q3 , q4 , q5 , q6 , must be
estimated from data.
Notice that a further simplification is possible: One of the variables may be removed since all the variables
sum up to N .
Discretization
As was done with the golf ball example, the ODEs may be discretized. In this case, since all the differential
equations are first-order, the performance of the Euler Method should be acceptable, although more advanced
numerical integration methods could be applied. (Note that Newton’s Second Law is a second-order ODE
(converted into two coupled first order equations) so the Verlet Method was superior to the Euler Method
for the golf ball trajectory problem.) Recall that ∆t is the time gap in the discrete-time system and time
can now be used as subscript since it is discrete.
630
It−1 + Ht−1
St = St−1 + ∆t q5 Ht−1 − α St−1 Susceptible
N
It−1 + Ht−1
Et = Et−1 + ∆t α St−1 − q1 Et−1 Exposed
N
It = It−1 + ∆t [q1 Et−1 − (q2 + q3 )It−1 ] only Infected
Ht = Ht−1 + ∆t [q2 It−1 − (q4 + q5 )Ht−1 ] Infected and Hospitalized
Rt = Rt−1 + ∆t [q3 It−1 + q4 Ht−1 ] Recovered
Dt = Dt−1 + ∆t [q5 Ht−1 ] Died
Similar discretized system of ODEs are given in the literature for SEIR models and their extensions [27].
The discretized system of ODEs may be now formulated as an Extended Kalman Filter model. For this
model, there is no forcing/control vector, i.e., ut is removed. Also, the observation function h is assumed to
be linear. Therefore, the state and observation/measurement vector equations take the following form:
xt = f (xt−1 ) + wt
yt = Hxt + vt
State Equations
The state vector includes random variables for each of the variables in the discrete-time system of equations.
xt = [St , Et , It , Ht , Rt , Dt ]
For this model, the nonlinear function f is quadratic so it can be written in matrix form.
α α
St 0 0 −N ∆t −N ∆t 0 0 1 0 0 q5 ∆t 0 0
α α
E
t
0
0 N
∆t N
∆t 0 0
0
1 − q1 ∆t 0 0 0 0
= x t − 1 0
It | 0 0 0 0 0 xt−1 + 0
q1 ∆t 1 − q23 ∆t 0 0 0 xt−1 + wt
H
t
0
0 0 0 0 0 0
0 q2 ∆t 1 − q45 ∆t 0 0
Rt 0 0 0 0 0 0 0 0 q3 ∆t q4 ∆t 1 0
Dt 0 0 0 0 0 0 0 0 0 q5 ∆t 0 1
|
where x t − 1 is the transpose of xt−1 (making it a row vector), q23 = q2 + q3 and q45 = q4 + q5 .
Observation/Measurement Equations
The observation/measurement vector includes random variables for the observable state variable, i.e., those
variables for which time series data exists.
631
For this model, there is a direct relationship between the state and measurement variables, so the observa-
tion/measurement vector is as follows:
Ito
0 0 1 0 0 0
H o 0 0 0 1 0 0
t
o = xt + vt
Rt 0 0 0 0 1 0
Dto 0 0 0 0 0 1
Jacobian Matrices
Ft ∈ R6×6 is given by the Jacobian matrix for the state transition function f .
∂f
Ft =
∂xt
Similarly, Ht ∈ R4×6 is given by the Jacobian matrix for the observation function h.
∂h
Ht =
∂xt
0 0 1 0 0 0
0 0 0 1 0 0
Ht =
0 0 0 0 1 0
0 0 0 0 0 1
The covariance of the process noise Q ∈ Rn×n and the covariance of the measurement noise R ∈ Rm×m can
be challenging to determine. These covariance matrices feed into the calculation of the Kalman Gain K, so
that reliance on process predictions versus measured quantities is partially based on how low their respective
covariances are.
There are three basic approaches for assigning values to Q and R: (1) Use knowledge of the process for
Q and of the measurement devices for R. The covariance of the process noise Q may be roughly determined
based on inherent uncertainty in the model (unpredictably changing wind conditions) or missing model
elements (e.g., the force of drag). The covariance of the measurement noise R may be roughly determined
based on characteristics of the measurement device(s). (2) Use hyper-parameters and Hyper-Parameter
Optimization (HPO) as discussed below. (3) Use Adaptive (Extended) Kalman Filters that adjust Qt and
Rt at each step based on calculated errors/residuals/innovations [3].
632
As approach (1) is very problem specific and approach (3) is complicated, approach (2) is a good alter-
native for a quick start. This alternative uses tunable hyper-parameters λQ and λR as multipliers of identity
matrices.
Q = λQ I
R = λR I
where IQ an an n-by-n identity matrix and IR an an m-by-m identity matrix. The hyper-parameters are
problem specific, but λ = 0.1 or 1.0 are reasonable ballparks [179] for starting grid searches or more efficient
HPO techniques.
Note that approach (1) is used in the golf ball trajectory problem given in the last section and approach
(3) is explored in the exercises.
Initialization
Some initialization is needed before starting an Extended Kalman Filter. Due to lack of sensitivity, the
initial covariance of the process error can be set to an n-by-n identity matrix, i.e., P0 = I. The initial state
xo = [329999999, 0, 1, 0, 0, 0]. The time gap/increment ∆t = 1 day.
16.5.3 Exercises
1. Use the ScalaTion COVID-19 datasets found at https://github.com/scalation/data to train
an Extended Kalman Filter (EKF) and make four-week ahead forecasts. It is composed of data
collected from multiple datasets stored at https://github.com/CSSEGISandData/COVID-19/tree/
master/csse_covid_19_data. In particular, use the dataset that contains data about COVID-19
cases, hospitalizations, recoveries, and deaths in the United States for one full year since the first
confirmed case on January 22, 2020. The dataset is in a 367 row by 6 column CSV file containing the
data indicated in the table below. There are 367 rows since the first row contains column headers and
the year 2020 was a leap year.
Column Description
d date: January 22, 2020 to January 21, 2021
t time: 0 to 365
Ct total (cumulative) number of confirmed cases by day t
Ht total (cumulative) number of hospitalized individuals by day t
Rt total (cumulative) number of recovered individuals by day t
Dt total (cumulative) number of deaths by day t
It deduced current number of infected individuals on day t (not counting hospitalizations)
ItH deduced current number of hospitalized individuals on day t
The data cover the last four states of the State Transition Diagram. The data also include the total
number of confirmed cases Ct and the total number of hospitalizations Ht that are not directly part
of the SEIHRD model, but are used to compute It and Ht .
633
It = Ct − Ht
Ht = Ht − Rt − Dt
2. Early Stage Approximation. In the early stage of an epidemic/pandemic, the number of Susceptible
individuals changes slowly, so that St−1
N is nearly a constant that can be rolled into α. This allows the
first two difference equations to be rewritten.
The state equations are now linear (and the observation/measurement equations have been linear), so
an ordinary Kalman Filter may be used. Use the ScalaTion COVID-19 dataset for the year 2020 to
train a Kalman Filter (KF) and make four-week ahead forecasts. Compare with the results of using
the EKF.
3. St and Et are hard to measure. Use the following identity to eliminate St from the model.
St = N − Et − It − Ht − Rt
Use the ScalaTion COVID-19 dataset for the year 2020 to train a Kalman Filter (KF) and make
four-week ahead forecasts. Compare with previous results.
4. Based on the last exercise, there is only one unobserved variables Et . Use the early stage approximation
to eliminate Et . Now all the state variables are observable (n = m = 4). Create a new Kalman Filter
where both F and H are 4-by-4 matrices. Use the ScalaTion COVID-19 dataset for the year 2020
to train this new Kalman Filter (KF) and make four-week ahead forecasts. Compare with previous
results.
5. When all the state variables are observable and the state transition and observation functions are
linear, a Vector Auto-Regressive VAR(p, n) model may be applied. Use the ScalaTion COVID-19
dataset for the year 2020 to train a VAR model and make four-week ahead forecasts. In particular,
create a VAR(1, 4) model and a VAR(2, 4) model as shown below,
where yt = [It , Ht , Rt , Dt ], p = 2 and n = 4, with two parameter matrices: Φ0 ∈ R4×4 and Φ1 ∈ R4×4
Compare with previous results.
6. Create an AR∗ (1, 4) model and a AR∗ (2, 4) model. Recall that an AR∗ (p, n) model is a VAR(p, n)
model where all the parameter matrices are diagnonal. Compare with previous results.
634
7. As the state transition function f or the observation function h are no longer well locally approxi-
mately by linear functions or the noise is no longer well approximated by Gaussian distributions, it is
recommended to use an Unscented Kalman Filter [199]. Use the ScalaTion COVID-19 dataset
for the year 2020 to train an Unscented Kalman Filter (UKF) and make four-week ahead forecasts.
Compare with previous results.
8. As the dimensionality of the problem becomes very large, such as in Numerical Weather Prediction
(NWP), it is recommended to use an Ensemble Kalman Filter [46, 91]. Use the ScalaTion
COVID-19 dataset for the year 2020 to train an Ensemble Kalman Filter (EnKF) and make four-week
ahead forecasts. Compare with previous results.
9. There are two techniques for estimating noise covariance matrices Qt and Rt for Adaptive (Extended)
Kalman Filters, one based innovations (errors before correction) the other based on residuals (errors
after correction). Read the following two papers [24, 3] and write a short essay on how the two
techniques work.
10. The discretization of the system of Ordinary Differential Equations (ODEs) for the SEIHRD model
used the Euler Method. For a system of ODEs, the vector equation is of the form:
635
(f) The number of Deaths in region k at time t is denoted by Dkt .
Since infections in one region can cause infections in other regions, use of mobility data can improve
forecasts. A simple approach is to add a connectivity/mobility matrix that indicates the amount of
travel between regions.
Cjk = number of individuals traveling from region j to region k in one time unit (16.59)
The diagonal in the matrix is minus the number of individuals leaving the region. The issue is to
determine, for each state the number of individuals entering region k in one time unit. For example,
for susceptibility the travel adjusted number is
l−1
X Sjt
τ (Skt ) = Skt + Cjk (16.60)
j=0
nj
Therefore, the six space-time variables are now interrelated through a new set of ordinary differential
equations,
τ (Ikt ) + τ (Hkt )
Ṡkt = q5 τ (Hkt ) − α τ (Skt ) Susceptible
nk
τ (Ikt ) + τ (Hkt )
Ėkt = α τ (Skt ) − q1 τ (Ekt ) Exposed
nk
I˙kt = q1 τ (Ekt ) − (q2 + q3 )τ (Ikt ) only Infected
I˙kt
H
= q2 τ (Ikt ) − (q4 + q5 )τ (Hkt ) Infected and Hospitalized
Ṙkt = q3 τ (Ikt ) + q4 τ (Hkt ) Recovered
Ḋkt = q5 τ (Hkt ) Died
Parameter Estimation
A common way to estimate the parameters is to use Maximum Likelihood Estimation [109]. The
likelihood function is ...
The basic six differential equation can be approximated using six difference equations, where for ex-
ample ∆Skt = Skt − Sk,t−1 . One time unit will correspond to one day.
636
Ik,t−1 + Hk,t−1
∆Skt = q5 Hk,t−1 − α Sk,t−1 Susceptible
nk
Ik,t−1 + Hk,t−1
∆Ekt = α Sk,t−1 − q1 Ek,t−1 Exposed
nk
∆Ikt = q1 Ek,t−1 − (q2 + q3 )Ik,t−1 only Infected
∆Hkt = q2 Ik,t−1 − (q4 + q5 )Hk,t−1 Infected and Hospitalized
∆Rkt = q3 Ik,t−1 + q4 Hkt Recovered
∆Dkt = q5 Hk,t−1 Died
Again, the six parameters to estimate are α, q1 , q2 , q3 , q4 , q5 , and q6 . Each of the fifty states in the
United States may be solve separately or pooled together.
Note that the approximation above only uses first lags, utilizing more lags may lead to better results.
The dataset will consist of three time series, Ikt infected cases, Hkt hospitalizations, and Dkt deaths.
TBD.
637
16.6 ODE Parameter Estimation
y = x(t) +
dx(t)
= f (x(t); b)
dt
638
Chapter 17
Event-Oriented Models
The simulation modeling techniques discusses so far center around specifying a set of equations. The structure
or operation rules of actual systems may require logic that is hard to express in this fashion. A very flexible
way of expressing such logic is to focus what happens that may affect the state of system. This is the
approach that is followed by Event-Oriented Models. These are also called Event Scheduling Models that
suggests how the simulation engine would need to work. The engine needs to provide an means for creating,
scheduling and processing events over time.
Before discussing event oriented models in more detail, several simulation modeling paradigms will be
highlighted.
• State-Oriented Models. State-oriented models focus states and state transitions. State-oriented
models include Markov Chains and their generalizations such as Generalized Semi-Markov Processes
(GSMPs) A GSMP can be defined using three functions,
In simulation, advancing to the current state x(t) causes a set of events {e} to be activated according
to the activation function a. Events occur instantaneously and may affect both the clock and transition
functions. The clock function c determines how time advances from t to t0 and the state-transition
function determines the next state x(t0 ). One can also tie in the input and output vectors. The input
vector u is used to initialize a state at some start time t0 and the response vector y can be a function
of the state sampled at multiple times during the execution of the simulation model.
639
• Event-Oriented Models. State-oriented models may become unwieldy when the state-space becomes
very large. One option is to focus on state changes that occur by processing events in time order. An
event may indicate what other events it causes as well as how it may change the state. Essentially, the
activation and state transition functions are divided into several simpler functions, one for each event
e:
– {e} = ae (x(t)),
– x(t0 ) = de (x(t)).
Logic for each event type implements the ae , what other event to trigger and de how to change (or
transition) the state. Time advance is simplified to just setting the time t0 to the time of the most
imminent event on a Future Event List (activated events are placed on this list in time order).
• Process-Oriented Models. One of the motivations for process-oriented models is that event-oriented
models provide a fragmented view of the system or phenomena. As combinations of low-level events
determine behavior, it may be difficult to see the big picture or have an intuitive feel for the behavior.
Process-oriented or process-interaction models aggregate events by putting them together to form a
process. An example of a process is a customer in a store. As the simulated customer (as an active
entity) carries out behavior it will conditionally execute multiple events over time. A simulation
then consists of many simultaneously active entities and may be implemented using coroutines (or
threads/actors as a more heavyweight alternative). Typically, there is one coroutine for each active
entity. The overall state of a simulation then includes a combination of the states of each active entity
(each coroutine/thread has its own stack). The global shared state also includes a variety of resources
types.
• Activity-Oriented Models. There are many types of activity-oriented models including Petri-Nets
and Activity-Cycle Diagrams. The main characteristics of such models is a focus on the notion of
activity. An activity (e.g, customer checkout) corresponds to a distinct action that occurs over time
and includes a start event and an end event. Activities may be started because time advances to its
start time or a triggering condition becomes true. Activities typically involve one or more entities.
State information is stored in activities, entities and the global shared state.
• System Dynamics Models. System dynamics models have been added to DeMO, since hybrid
models that combine continuous and discrete aspects are becoming more popular. One may consider
modeling the flight of a golf ball once struck by a golf club. Let the response vector y = [y0 y1 ] where
y0 indicates the horizontal distance traveled, while y1 indicates the vertical height of the ball. Future
positions of y depend on the current position and time t. Using Newton’s Second Law of Motion, y
can be estimated by solving a system of Ordinary Differential Equations (ODEs) such as
The object uses the Dormand-Prince ODE solver to solve this problem. More accurate models for
estimating how far a golf ball will carry when struck by a driver can be developed based on inputs/fac-
tors such as club head speed, spin rate, smash factor, launch angle, dimple patterns, ball compression
characteristics, etc. There have been numerous studies of this problem, including [25].
640
In addition to these main modeling paradigms, ScalaTion support a simpler approach called Tableau-
Oriented Models that can be thought of as an automated analog to Spreadsheet Simulation.
One may also classify models as supporting Object-Oriented Simulation where the advantages of
object-oriented programming are utilized. For example, some forms of process-oriented simulation (e.g.,
GPSS) require each process to strictly follow a sequence of blocks. This makes it easier to create models
(and they can be fully specified in a GUI), but reduces the flexibility with which simulation can be built
(see [155] for details). In ScalaTion, all of the simulation modeling techniques take advantage of object-
orientation, especially the process-interaction models. The importance of each actor running concurrently is
the reason it falls in the process-oriented category. In the sense that it has active entities that are objects
and concurrent, puts it in the lineage of Simula-67.
641
17.2 List Processing
Event-Oriented Models will require events and entities to be maintained in various types of lists or queues.
prepend addOne
front ArrayDeque back
removeHead removeLast
The following methods are provided by ArrayDeque for adding and removing items from the ends of a
queue.
The Queue class implements its enqueue (with alias +=) and its dequeue methods as follows:
def enqueue (elem: A): this.type = this += elem // join the back of the line (addOne)
Note, specifying this.type means q.enqueue (elem) returns the type of q which could be Queue or a
subclass of Queue, e.g., MyQueue extends Queue.
See http://scalada.blogspot.com/2008/02/thistype-for-chaining-method-calls.html for an exam-
ple.
642
17.2.2 LCFS Queue
Another type of queue is a Last-Come, First-Serve (LCFS) Queue also known as a Last-In, First-Out (LIFO)
Queue. As a data structure, they may also be implemented as an array or linked list. Efficient access to the
front (top) of the queue is needed, as well as efficient methods for adding a item to the front of the queue
(push) and removing an item from the front of the queue (pop).
The Stack class in the scala.collection.mutable package provides these capabilities. It slso extends
the ArrayDeque class from the same package.
The Stack class implements its push and its pop methods as follows:
def push (elem: A): this.type = prepend (elem) // join at the front
eventList += anEvent
Note, += is alias for the addOne method. The most imminent event is removed from the eventList as
follows:
nextEvent = eventList.dequeue ()
643
anEvent
+=
Future
Event List
dequeue
nextEvent
(2) Event Graph models capture the event logic related to triggering other events in causal links. In
this way, Event Graph models are more declarative (less procedural) than Event Scheduling models. They
also facilitate a graphical representation and animation. They can also serve as a design diagram for event
scheduling as they depict the relationships between events.
644
17.3 Event Scheduling
A simple, yet practical way to develop a simulation engine to support discrete-event simulation is to imple-
ment event-scheduling. This involves creating the following three classes: Event, Entity and Model. An
Event is defined as an instantaneous occurrence that can trigger other events and/or change the state of
the simulation. An Entity, such as a customer in a bank, flows through the simulation. The Model serves
as a container/controller for the whole simulation and carries out scheduling of event in time order. The
centerpiece class for event scheduling and the one model developers will be most concerned with is the Event
class.
Class Methods:
abstract class Event (val entity: Entity, director: Model, delay: Double = 0.0,
stat: Statistic = null, val proto: EventNode = null)
extends Identifiable with Ordered [Event]:
An Event must be defined inside a Model referenced by director and have an Entity that is involved
in this event.
An important field in the Event class is actTime, which indicates the activation/occurrence time for the
event.
The methods in this class perform the following functions:
• The compare method compares the actTime of two events, thus allowing events to be placed in time
order in the F.E.L. (eventList). Since Scala’s PrioityQueue class is organized as Highest Priority
First (HPF), the logic of the above compare method is reversed.
645
• The cancel method allows scheduled events to be cancelled by marking them as not live.
• The occur method must be implemented in each subclass and it captures the event logic for a particular
type of event (e.g., Arrival). The method may (1) schedule other events and (2) specify state changes.
• The toString method converts internal information about an event into a string.
A frequently used method from the Model class is schedule. It will place the event on the eventList
in time order. The eventList is managed by the Model.
Much of what is needed to develop simply event scheduling models has now been covered.
646
17.3.2 Example: Bank Model
To create a simple bank simulation model, one could use the classes defined in the event-scheduling engine
to create
The complete code for this Bank simulation example may be found in the scalation.simulation.event
package in Bank.scala.
The event logic is coded in the occur method which in general triggers future events and updates the
current state. It indicates what happens when the event occurs.
Arrival Class
The Arrival case class extends the abstract Event class. The class constructor takes two parameters:
The entity involved in the event and the time delay (how far in the future this event is to occur). These
parameters are passed into the base class (Event) along with the Model director (this) and a Statistic
object for recording statistics on inter-arrival times t ia stat.
@param customer the entity that arrives, in this case a bank customer
@param delay the time delay for this event’s occurrence
end Arrival
Before implementing the logic of the occur methods, it is useful to create an Event Graph design diagram
(event if Event Graphs are not used for the implementation). Each type of event is depicted as a node and
the directed edges indicate event causality. For the Bank Model, two event types will suffice (Arrival and
Departure). There are three causal links (directed edges): an arrival event triggers the next arrival and
it may trigger a service completion/departure event, and a departure event may trigger the next service
completion event. Typically, the directed edges have conditions on them (i.e., the event is only triggered
when the condition is true). The events are also triggered to occur in the future based on the transition/delay
times associated with edges. Figure 17.3 depicts the event graph to the Bank Model (the delays time are
not placed in the graph, but in the caption to avoid clutter).
The first condition nArr < nStop − 1 will be true when the simulation stopping rule becomes true, the
second condition nIn = 0 being true allows the arriving customer to go directly into service as the server is
not busy, and the third condition allows a customer in the queue to begin service, by scheduling the end the
service activity.
647
nArr < nStop − 1 nIn > 1
nIn = 0
Arrival Departure
Figure 17.3: Bank Event Graph: with edge delay times of tia , ts , and ts , going left to right
The occur method will schedule the next arrival event (up to the limit), check to see if the teller is busy.
If so, it will place itself in the Wait Queue (W.Q.), otherwise it schedules its own departure to correspond to
its service completion time. Finally, it adjusts the state by incrementing both the number of arrivals (nArr)
and the number in the system (nIn).
Suppose nStop = 3 and that three arrival events, A0 , A1 , A2 , with corresponding customers C0 , C1 , C2
occur before any departure events D0 , D1 , D2 . The table below shows the situation at the end of each occur
method.
648
Departure Class
For the Departure class, the occur method will check to see if there is another customer waiting in the
queue and if so, schedule that customer’s departure. It will then signal its own departure by updating the
state; in this case decrementing nIn and incrementing nOut.
@param customer the entity that departs, in this case a bank customer
@param delay the time delay for this event’s occurrence
end Departure
BankModel Class
The BankModel class defines a simple Event-Scheduling model of a Bank where service is provided by one
teller and models an M/M/1 queue. The parameters to the constructor are the name of the simulation model,
the number of independent replications to run (see the Simulation Output Analysis Chapter for details),
the number of entities to create before stopping the simulation, and the base random number stream to
use. Each random variate should use a difference stream for independence. The models need to initialize
the model constants, create the random variates and state variables, specify the event logic in subclasses for
Event, in this case, the Arrival and Departure events defined in the previous subsections, and finally, start
the simulation after scheduling the first priming event. Once the simulation stops, reports and summarizes
may be output.
class BankModel (name: String = "Bank", reps: Int = 1, nStop: Int = 100, stream: Int = 0)
extends Model (name, reps):
649
// Create Random Variables (RVs)
end BankModel
Note, to aid with debugging, it may be useful to add the following state variable.
var nOut = 0.0 // number of customers that departed
The three code segments may now be merged together and compiled. The following imports are required:
scalation.mathstat.Statistic, scalation.random.Exponential, and
scalation.random.RandomSeeds.N STREAMS. Outside of Scalation for example in my scalation, the fol-
lowing import is also required: scalation.simulation.event.
Model Execution
650
Defining Simulation Scenarios
Some of the model preamble (before the Event subclasses) may be moved out of the BankModel to allow
trying multiple combinations of model parameters/constants.
The scenario specification can also be extended to include the random variates as well, allowing different
arrival and service distributions to be readily tested.
Statistic Class
In order to collect statistical information, the constructor of the Event class calls the tally method from
the Statistic class in the scalation.mathstat package to obtain statistics on
Class Methods:
def set (n_ : Int, sum_ : Double, sumAb_ : Double, sumSq_ : Double,
minX_ : Double, maxX_ : Double): Unit =
def reset (): Unit =
def tally (x: Double): Unit =
inline def num: Int = n
inline def nd: Double = n.toDouble
inline def min: Double = if n == 0 then 0.0 else minX
inline def max: Double = maxX
def mean: Double = if n == 0 then 0.0 else sum / nd
def variance: Double =
def stdev: Double = sqrt (variance)
def ms: Double = sumSq / nd
651
def ma: Double = sumAb / nd
def rms: Double = sqrt (ms)
def interval (p: Double = .95): Double =
def interval_z (p: Double = .95): Double =
def show: String = s"Statistic: $n, $sum, $sumAb, $sumSq, $minX, $maxX"
def statRow: Array [Any] = Array (name, num, min, max, mean, stdev, interval ())
override def toString: String =
Monitor Class
The Monitor class is used to trace the key actions in the execution of a model. This class works for both
event oriented and process oriented model. Information collected by calling the trace method is saved in
log file.
Class Methods:
Model developers may call trace anywhere in their code. The Model class calls trace mutilple times,
The traces are are written into a file located in for example log/simulation directory. If this directory
does not exist, it will need to be made for the model to run. Both scalation and my scalation have this
directory.
652
17.3.3 Example: Call Center Model
A simpler model is one for which there is no waiting queue. If a server is free/idle, service may begin,
otherwise, it can be attempted later. A simple call center can be modeled in this way. The CallCenter.scala
file contains event logic for the case of a call center with a staff of one. The Event Graph design diagram
is similar the one for the Bank Model. The major difference is that upon departure, as there is no queue,
the ending call cannot trigger the beginning of the next completed call, i.e., the Departure event does not
trigger any other events, it just updates the state.
nIn = 0
Arrival Departure
Figure 17.4: Call Center Event Graph: with edge delay times of tia and ts , going left to right
Arrival Class
The Arrival class replaces waiting in a queue with incrementing a counter of the number of lost calls
(nLost).
end Arrival
653
Departure Class
The logic for the Departure class simply records information and updates counters.
end Departure
The remaining four of five classes used for creating simulation models following the Event Scheduling
paradigm are discussed in the next four subsections.
Class Methods:
case class Entity (val iArrivalT: Double, var serviceT: Double, director: Model)
extends Identifiable:
Important fields in the Entity class are the entity id eid, the time at which the entity arrived arrivalT =
director.clock + iArrivalT, and the time it begins waiting startWait, if applicable.
In the BankModel, a new entity (called toArrive) for a future arrival event is created before being passed
to the event.
654
17.3.5 WaitQueue Class
When entities are unable to begin service immediately, they are often placed in a wait queue. The WaitQueue
class provides a First-Come, First-Served (FCFS) queue and is implemented by extending Scala’s Queue class.
An entity is added to the back of the queue using the enqueue method and is removed from the front of the
queue using the dequeue method. By default it has infinite capacity, but may be restricted by passing a
value in the cap parameter. In this case, entities are barred from entering the queue when the queue is full.
The number of times it is called is returned by the barred method.
The dequeue method collects both sample statistics (for Tq ) and time-persistent statistics (for Lq ) on queue
occupancy, see the exercises.
The summary method returns statistics about waiting times in the queue.
Class Methods:
case class WaitQueue (director: Model, ext: String = "", cap: Int = Int.MaxValue)
extends Queue [Entity]:
655
17.3.6 WaitQueue LCFS Class
Again, entities in the model that are unable to begin service immediately are often placed in a wait queue. The
WaitQueue LCFS class provides a Last-Come, First-Served (LCFS) queue and is implemented by extending
Scala’s Stack class. An entity is added to the front (top) of the queue using the enqueue method and is
removed from the front (top) of the queue using the dequeue method. By default it has infinite capacity,
but may be restricted by passing a value in the cap parameter. In this case, entities are barred from entering
the queue, and need to stay where they are, go somewhere else or be lost to the simulation. The number of
times entities are barred is returned by the barred method. The summary method returns statistics about
waiting times in the LCFS queue.
Class Methods:
case class WaitQueue_LCFS (director: Model, ext: String = "", cap: Int = Int.MaxValue)
extends Queue [Stack]:
See the exercises for how to add additional types of wait queues.
• The schedule method is used to add events to the eventList so they may be processed in time order.
To preempt a scheduled event, the cancel method may be called.
• The simulate method repeatedly removes and processes events in the eventList. The simulate
method will cause the main simulation loop to execute, which will remove the most imminent event
from the eventList and invoke its occur method.
The main loop in the Model class is the following while with the line controlling animation removed.
656
while simulating && ! eventList.isEmpty do
nextEvent = eventList.dequeue ()
if nextEvent.live then
_clock = nextEvent.actTime
nextEnt = nextEvent.entity
log.trace (this, s"executes ${nextEvent.me} on ${nextEnt.eid}", nextEnt, _clock)
debug ("simulate", s"$nextEvent \t" + "%g".format (_clock))
nextEvent.occur ()
end if
end while
The simulation will continue until a stopping rule evaluates to true (either the simulating flag becomes
false or the eventList becomes empty). The nextEvent is removed from the front of the priority queue and
checked to make sure it is still live (not cancelled). If it is live, the director’s simulation clock is set to
the this event’s activation time actTime. For tracing purposes, this event’s associated entity is referenced.
Then this event’s occur method is called that carries out the event logic based on the type of event it is.
Methods to getStatistics and report statistical results are also provided.
Class Methods:
@param name the name of the model
@param animation whether to animate the model (only for Event Graphs)
The animate methods are used with Event Graphs (see the next section).
As a starting point for developing simulation models following the event scheduling paradigm, a model
developer may copy the Ex Template.scala file in the scalation.simualation.event package.
657
polishes the part. The service rate for the first machine is µ1 = 12 hr−1 and is µ2 = 15 hr−1 for the second
machine. Both machines have space to store three parts waiting to be machined. When there is a backup
of parts, flow from other departments must continue, so these parts are removed to be sold for scrap. The
company wishes to conduct a simulation study to see which policy is better.
• Policy Two: Do not block machine 1 when machine 2’s queue is full, rather send the partially finished
part to scrap.
The cost of a raw part is 100 dollars, the value of partially finished part is 50 dollars and value of finished
part is 200 dollars. The operational cost of machine is 60 dollars an hour and is 30 dollars for machine 2.
The management has argued that machine 1 should be blocked since 50 dollars are lost whenever a
partially finished part is sold for scrap, while if it a raw part it can be resold for 100 dollars. Others say
stopping/forced idling of machine 1 is costly.
Figure 18.1 shows the flow of parts through the machine shop. Use simulation to estimate performance
characteristics and determine the better policy. How sensitive is the decision to the relative service rates of
the two machines.
Figure 17.5: Machine Shop Schematic Diagram: Each Queue Has Capacity Three
The Machine.scala file in the event.example 1 package contains a partial implementation of the ma-
chine shop model. The Event Graph design diagram now requires three event types: Arrival, FinishMachine1,
and FinishMachine2. The Event Graph is depicted in Figure 17.6.
nIn1 = 0 nIn2 = 0
658
The state variable are the following:
Blocking. Note that one policy may require machine station 1 to be blocked due to no room in machine
station 2. The blocking occurs when machine 1 finishes a part and it is ready to move onto machine station
2, but cannot. This part will be stuck at machine 1 and machine 1’s operator will be idle, until machine 2
completes the part it is working on. An additional edge should be added to the Event Graph to indicate
this causal connection. As the blocked part is not in a queue and not in the F.E.L., it needs to be held
somewhere, e.g., in model variable called heldAtMachine1.
659
17.4 Event Graphs
Event Graphs operate in a fashion similar to Event Scheduling. Originally proposed as a graphical conceptual
modeling technique (Schruben, 1983) for designing event oriented simulation models, modern programming
languages now permit more direct support for this style of simulation modeling.
In ScalaTion, the simulation engine for Event Graphs consists of the following seven classes:
The first five classes are shared with Event Scheduling.
1. An EventNode (subclass of Event), defined as an instantaneous occurrence that can trigger other events
and/or change the state of the simulation, is represented as a node in the event graph.
2. A CausalLink emanating from an event/node is represented as an outgoing directed edge in the event
graph. It represents causality between events. One event can conditionally trigger another event to
occur some time in the future.
class BankModel2 (name: String, nStop: Int, iarrivalRV: Variate, serviceRV: Variate)
extends Model (name, true) // true => animation on
The Scala code below was made more declarative than typical code for event-scheduling to better mirror
event graph specifications, where the causal links specify the conditions and time delays. For instance,
is a anonymous function/closure returning Boolean that will be executed when arrival events are handled.
In this case, it represents a stopping rule; when the number of arrivals exceeds a threshold, the arrival event
will no longer schedule the next arrival. The serviceRV is a random variate to be used for computing service
times.
In the BankModel class, the logic in the Event classes is simplified somewhat due to the specification of
CauasalLinks. Before the Event classes are specified, the following four types for definitions are required:
statistical accumulators, event nodes, causal links/edges, and the state variables.
First the statistical accumulators are defined, one for inter-arrivals and one for service. A wait queue is
given and it maintains its own statistics.
660
val t_ia_stat = new Statistic ("t_ia") // time between Arrivals statistics
val t_s_stat = new Statistic ("t_s") // time in Service statistics
addStats (t_ia_stat, t_s_stat)
val waitQueue = WaitQueue (this) // waiting queue that collects stats
For animation of the event graph, a prototype for each type of event is created and displayed as a node.
Locations are required for each EventNode.
val aLoc = Array (150.0, 200.0, 50.0, 50.0) // Arrival event node location
val dLoc = Array (450.0, 200.0, 50.0, 50.0) // Departure event node location
val aProto = new EventNode (this, aLoc) // prototype for all Arrival events
val dProto = new EventNode (this, dLoc) // prototype for all Departure events
The edges connecting these prototypes represent the casual links. The aLink array holds two causal links
emanating from Arrival, the first a self link representing triggered arrivals and the second representing an
arrival finding an idle server, so it can schedule its own departure. The dLink array holds one causal link
emanating from Departure, a self link representing the departing customer causing the next customer in the
waiting queue to enter service (i.e., have its departure scheduled).
val aLink = Array (CausalLink ("l_A2A", this, () => nArr < nStop-1, aProto),
CausalLink ("l_A2D", this, () => nIn == 0, dProto))
val dLink = Array (CausalLink ("l_D2D", this, () => nIn > 1, dProto))
aProto.displayLinks (aLink)
dProto.displayLinks (dLink)
The state variables, nArr, nIn and nOut, are defined as vars since they will change during the simulation.
An animation of the Event Graph consisting of two EventNodes Arrival and Departure and three
CausalLinks is depicted in Figure 17.7.
661
The main thing to write within each subclass of Event is the occur method. To handle arrival events,
the occur method of the Arrival class first checks the aLinks to see if it needs to trigger additional events.
It then updates the current state by incrementing both the number of arrivals (nArr) and the number in the
system (nIn).
end Arrival
To handle departure events, the occur method of the Departure class first checks the dLink to see if
it needs to trigger additional events. It then updates the state by decrementing the number in the system
(nIn) and incrementing the number of departures (nOut).
end Departure
Four of the classes used for creating simulation models following the Event Scheduling paradigm can
be used for Event Graphs, namely Entity, Event, Model, and WaitQueue. In addition, EventNode is also
required as they form the nodes in the Event Graphs. An edge in the Event Graph is an instance of the
CausalLink class. These two new classes (EventNode and CausalLink) are described in the subsections
below.
662
17.4.2 EventNode Class
The EventNode class provides facilities for defining simulation events. Subclasses of Event provide event-
logic in their implementation of the occur method. The main purpose of EventNode is to associate a type of
event with a node in the event graph.
Class Methods:
def occur (): Unit = throw new NoSuchMethodException ("this occur should not be called")
def displayLinks (links: Array [CausalLink]): Unit =
Class Methods:
case class CausalLink (label: String, director: Model, val condition: () => Boolean,
causedEvent: Event)
extends Identifiable:
Events graphs support a more declarative means for specifying a simulation, allow the relationships
between events to be seen visually and provide basic animation of simulation model execution.
663
17.5 Exercises
1. It is common practice to implement an eventList as a Heap-Based Priority Queue. The most imminent
event is at the top of the heap data structure in a priority queue. Processing this event involves removing
and returning this event, swapping in the event in the last position in the heap, and reestablishing the
heap order. Draw pictures to illustrate what happens to the heap.
2. As mentioned, the dequeue in the WaitQueue class collects both sample and time-persistent statistics.
The Statistic class in the mathstat packge is used to collect via tally sample statistics, while the
TimeStatistic class is used to collect accum time-persistent statistics.
@param name the name for this statistic (e.g., ’numberInQueue’ or ’tellerQ’)
@param _lastTime the time of last observation
@param _startTime the time observation began
How do the tally and accum methods differ? Why is it necessary to have two such methods?
3. Explain what each line of the main while loop does in the simulate method of the Model class.
4. Draw an event graph for the simulation of an M/M/c/K Queue. Let the arrival rate λ = 7 per hour and
the service rate µ = 8 per hour. Write and execute an event scheduling model for this Queue System.
Compare the results with the theoretical results from Queueing Theory. Consider the following cases:
c = 1, K = 1; c = 1, K = ∞; c = 2, K = 10; c = 2, K = ∞.
5. Wendy’s vs. McDonald’s Lines. Given two servers, is it better to have a line/queue for each server
or one common line for both servers. Analyze the waiting times and standard deviation of the waiting
times. Let λ = 20 hr−1 and µ = 12 hr−1 .
Hint: To compare the two waiting times, compute the adjusted waiting time. For example, with a
single queue, suppose Wq is the average time in queue for the nq customers that had to wait. The
overall adjusted waiting time is then
nq Wq + (m − nq )0
Tq =
m
When there are two queues with waiting times Wq1 and Wq2 the formula becomes
7. Machine Shop. Complete the implementation of the Machine Shop simulation and determine which
policy is better.
664
8. Two-Stage Queueing System Simulation. Consider modeling a system with two stages of service:
In stage one a patient registers, and in stage two the patient receives treatment. The line for registration
is unbounded, while patients waiting for treatment must be within a room with a total capacity of
K. The first stage has one server, while the second stage has two. Based on the cases in the previous
exercise, simulate a two-stage service system where the first stage has a queue with c = 1, K = ∞ and
the second stage has a queue with c = 2, K = 10. When the second queue is full, the server in the first
stage will be blocked. (i.e, must be idle until space is available).
9. Emergency Department Simulation. Create and execute an event scheduling model for an emer-
gency department/room based on the specifications given in the following paper, “Modeling and Im-
proving Emergency Department Systems using Discrete Event Simulation,” by Christine Duguay and
Fatah Chetouane, https://journals.sagepub.com/doi/10.1177/0037549707083111.
10. ScalaTion uses Scala’s Priority Queue class for its time ordered F.E.L. (eventList), but that class
could also be used for priority based waiting queues. Add a new class to the event package called
WaitQueue PQ to provide this capability.
11. Use the new WaitQueue PQ class along with WaitQueue and WaitQueue LCFS to test various job schedul-
ing algorithms: FCFS, LCFS, Shortest Job First (SJF) and Highest Priority First (HPF).
12. Event Scheduling (ES) Simulation: A Small Fast Food Restaurant has two severs and enough
space for three customers to wait (at most five customers total at any given time). For the case of a
single queue, perform an ES simulation for m = 20 customer arrivals. Assume each server can process
µ = 30 customers per hour and that the customer arrival rate λ = 75 customers per hour (assume
Exponential distributions). Each completed order gives a net profit (before paying the servers) of 2.00
dollars. Each server makes 11.00 dollars per hour. Should the restaurant hire a third server? Explain
in terms of profit (after paying the servers) per hour. Note: this simulation problem is posed in the
Hand/Spreadsheet Simulation section of the Simulation Foundations Chapter.
(i) Use ScalaTion’s Known Random Variate Generator (RVG) to make ES reproduce the results of the
Spreadsheet Simulation. Give the Event Graph, code for the Event subclasses including their occur
methods, and the summary results.
(ii) Replace the Known RVG with ScalaTion’s Exponential RVG in order to run the simulation
longer to obtain better results. Also, run multiple replications to produce multiple estimates for the
final profit for having two servers vs. three servers. To make the replications independent, make sure
each replication uses a different base random number stream.
(iii) Explain how Confidence Intervals can be used to make more informed decisions.
(iv) [Bonus] Create 95% Confidence Intervals for the two and three server simulations. Let the number
of replications be 10 and use the Student’s t Distribution.
13. Question 3: Simulation model design: Draw an Event Graph for a simulation model used to study a
Bank with two tellers with Exponential inter-arrival and service time distributions with rates λ and
µ, respectively. Having one line and two servers along with Exponential distributions makes this an
M/M/2 queue. Explain the nodes and edges in your event graph.
665
666
Chapter 18
Process-Oriented Models
667
18.1 Base Traits and Classes for Process-Oriented Models
The simulation package contains several base traits and classes that can be used by several types of
simulation models, and are especially useful for process-oriented simulation models.
trait Identifiable:
trait Locatable:
trait Modelable:
trait Temporal
extends Identifiable:
668
18.2 Concurrent Processing of Actors
A simulation with multiple actors as active entities whose behaviors overlap in time are most naturally
implemented using concurrent programming.
Traditionally, programming language support for concurrent programming has been limited and this
makes providing support of process-oriented models more difficult. Typically, support for coroutines is
sufficient for developing simulation engines of this type. Languages supporting coroutines include: Simula,
Smalltalk, Modula-2, Ruby, Julia, Go, and Kotlin.
On the other hand, many languages support threads, notably Java and therefore all Java based languages
including Scala. Although threads are capable of getting the job done, they introduce two problems: Unusual
transfer of control between threads can lead to challenging bugs to eliminate. There is also more overhead
(time and space) required to use threads rather than coroutines. To deal with the first problem, ScalaTion
implements a Coroutine class using Java’s Runnable interface and Thread class. Then users of ScalaTion
can avoid the complexity and bugs associated with threads. Improvement on the problem of overhead will
be provided by Java 18 in the form of VirtualTheads. These run in user space with reduced overheads and
can allow may more actors to run concurrently.
/**************************************************************
* Construct a ping or pong object.
*/
public PingPong (String whatToSay, int delayTime)
{
word = whatToSay;
delay = delayTime;
} // PingPong
669
/**************************************************************
* Run method for the ping/pong object.
*/
public void run ()
{
for ( ; ; ) {
out.println (word);
try {
Thread.sleep (delay);
} catch (InterruptedException ex) {
return;
} // try
} // for
} // run
/**************************************************************
* Main method for invoking the application.
* @param args Command-line arguments
*/
public static void main (String [] args)
{
new Thread (new PingPong ("ping", 333)).start ();
new Thread (new PingPong ("PONG", 1000)).start ();
} // main
} // PingPong
$ javac PingPong.java
$ java PingPong
Notice the interleaved execution. Had the run been called directly, rather than indirectly via calling the
start method, no interleaving would be seen, since this would require the first thread to finish before the
second one can begin. Try change ”start” to ”run” in the code above. Note, to terminate the program, type
”Ctrl C”.
670
abstract class Coroutine (label: String = "cor")
extends Runnable:
The Coroutine also uses the following from the java.util.concurrent package: Executors, ExecutorSer-
vice, Future, Semaphore, and ThreadPoolExecutor. See Coroutine.scala in the scalation.simulation
package for details.
671
18.3 Process Interaction
Many discrete-event simulation models are written using the process-interaction world view, because the
code tends to be concise and intuitively easy to understand. Take for example the process-interaction model
of a bank (BankModel a subclass of Model) shown later. Following this world view, one simply constructs
the simulation components and then provides a script for entities (cimActors) to follow while in the system.
In this case, the act method for the customer class provides the script (what entities should do), i.e., enter
the bank, if the tellers are busy wait in the queue, then receive service and finally leave the bank.
The development of a simulation engine for process-interaction models is complicated by the fact that con-
current (or at least quasi-concurrent) programming is required. Various language features/capabilities from
lightweight to middleweight include continuations, coroutines, fibers, actors, virtual-threads and threads.
Heavyweight concurrency via OS processes is infeasible, since simulations may require a very large number
of concurrent entities. The main requirement is for a concurrent entity to be able to suspend its execution
and be resumed where it left off (its state being maintained on a stack). Since preemption is not necessary,
lightweight concurrency constructs are ideal. Presently, ScalaTion uses the Coroutine class.
ScalaTion includes several types of model components: Gate, Junction, Resource, Route, Sink,
Source, Transport, WaitQueue and WaitQueue LCFS. A model may be viewed as a directed graph with
several types of nodes:
• Gate: a gate is used to control the flow of entities, they cannot pass when it is shut.
• WaitQueue: a FCFS wait-queue provides a place for entities to wait, e.g., waiting for a resource to
become available or a gate to open.
• WaitQueue LCFS: an LCFS wait-queue provides a place for entities to wait, e.g., waiting for a resource
to become available or a gate to open.
These nodes are linked together with directed edges (from, to) that model the flow entities from node to
node. A Source node must have no incoming edges, while a Sink node must have no outgoing edges.
• Route: a route bundles multiple transports together (e.g., a two-lane, one-way street).
• Transport: a transport is used to move entities from one component node to the next.
The model graph includes coordinates for the component nodes to facilitate animation of the model.
Coordinates for the component edges are calculated based on the coordinates of its from and to nodes.
Small colored tokens move along edges and jump through nodes as the entities they represent flow through
the system.
Formally, it is not required to be a graph since entities can move from node to node without going along
an edge (e.g., WaitQueue to Resource). These graph-like structures are referred to as Network Diagram.
Scalation’s Agent-Based Simulation (see the next section), however, does require the model to be based
on an underlying Property Graph.
672
18.3.1 Model Template
All process-interaction simulation models in ScalaTion are of the following basic form. The file called
Ex Template.scala in the scalation.simulation.process package may be used as a starting template
for the development of specific process-interaction simulation models.
class SOMEModel (name: String = "SOME", reps: Int = 1, animating: Boolean = true,
aniRatio: Double = 8.0, nStop: Int = 100, stream: Int = 0)
extends Model (name, reps, animating, aniRatio):
val entry = Source ("entry", this, () => SOMEActor (), 0, nStop, iArrivalRV, (100, 290))
val exit = Sink ("exit", (600, 290))
end SOMEActor
simulate ()
waitFinished ()
Model.shutdown ()
end SOMEModel
673
Specification of a process-interaction involves the four steps: (1) Initialize Model Constants, (2) Create
Random Variables (RVs), (3) Create Model Components, and (4) Specify Scripts for each Type of Simulation
Actor. The SOMEModel class can be invoked as follows:
Class Methods:
trait Component
extends Identifiable with Locatable:
674
toTellerQ toDoor
entry tellerQ teller door
• Initialize Model Constants: In this case customer arrival (lambda) and service (mu) rates need to be
specified. In addition, the number of service units (nTellers) needs to be specified.
• Create Random Variables (RVs): This model will have three random variates: one for the inter-
arrival times (iArrivalRV), one for service times (serviceRV), and one for movement (moveRV) along
Transports.
• Create Model Components: A key step is to define the component nodes entry, tellerQ, teller,
and door. Then two edge components, toTellerQ and toDoor, are defined. These six components are
added to the BankModel using the addComponent method. Note, the endpoint nodes for an edge must
be added before the edge itself.
class BankModel (name: String = "Bank", reps: Int = 1, animating: Boolean = true,
aniRatio: Double = 8.0, nStop: Int = 100, stream: Int = 0)
extends Model (name, reps, animating, aniRatio):
val entry = Source ("entry", this, () => Customer (), 0, nStop, iArrivalRV, (100, 290))
val tellerQ = WaitQueue ("tellerQ", (330, 290))
val teller = Resource ("teller", tellerQ, nTellers, serviceRV, (350, 285))
val door = Sink ("door", (600, 290))
val toTellerQ = Transport ("toTellerQ", entry, tellerQ, moveRV)
val toDoor = Transport ("toDoor", teller, door, moveRV)
675
• Specify Scripts for each Type of Simulation Actor: Finally, an inner case class called Customer is defined
where the act method specifies the script for bank customers to follow. The act method specifies the
behavior of concurrent entities (SimActor) and is analogous to the run method for Java/Scala Threads.
end Customer
simulate ()
waitFinished ()
Model.shutdown ()
end BankModel
1. Upon creation by the Source at entry, the actor will move to the teller queue.
2. The actor will check whether all the tellers are busy and if all are busy, will wait in the queue tellerQ
which is a WaitQueue. Note, the call to noWait is just for statistics collection.
3. The actor will utilize one of the tellers in the teller Resource, for a period of time corresponding to
its service time.
4. After service is finished, the actor will then release the teller. This allows a waiting actor to begin
service and triggers the collection of statistics.
6. Upon arrival at the door, a Sink, the actor will leave the bank/simulation and overall statistics will
be collected.
The last three method calls will run the simulation using multiple threads/coroutines (simulate), wait
until all the threads/coroutines are finished (waitFinished), and then safely shut down concurrent execution
(Model.shutdown).
676
18.3.4 Executing the Bank Model
The BankModel class can be invoked as follows:
To make the animation easier to follow, try changing the aniRatio to 50.0.
An active entity (SimActor) representing a Customer is shown as token (small circle) frozen in its motion
along the first Transport. Like an Event Graph, a Network Diagram can be used for both simulation model
design and animation.
677
18.3.7 SimActor Class
The SimActor abstract class represents entities that are active in the model. The act abstract method,
which specifies entity behavior, must be defined for each subclass. Each SimActor extends the Coroutine
class and may be roughly thought of as running in its own thread. The script for entities/sim-actors to follow
is specified in the act method of the subclass as was done for the Customer case class in the BankModel.
For example, a customer in the BankModel will enter the bank, move to a teller, if there is a line/queue
will wait in the queue, be served by the teller and finally leave the bank. The director will transfer control
to the actor (bank customer) which will execute code to get to the next step and transfer control back to
the director. In this way the entity progresses through time processing multiple multiple events over its
lifetime. A SimActor is created by a Source and terminated by a Sink. The act method encodes the logic
of the actor’s script.
Class Methods:
Two of the key methods involved in transferring control between actors (via the director) are schedule
and yieldToDirector.
The schedule method places this actor in agenda (like a future event list) effectively specifying when the
actor (a coroutine) will be reactivated. The delay parameter indicates how are into the future this will be.
678
When the actor is has completed a step (conceptually like an embedded event) and either placed itself in
a queue or the agenda, it is ready to let another actor to execute. It does this by yielding control to the
director so the director can take the next action via the yieldToDirector method. The quit parameter is
a flag indicating whether this actor has completed its last step.
breakable {
for i <- 1 to units do // minor loop - make actors
debug ("act", s"make $i SimActor")
if director.stopped then break () // terminate source, simulation ended
val actor = makeEntity () // make new actor
actor.mySource = this // actor’s source
actor.subtype = esubtype // set the entity subtype
director.numActors += 1 // number of actors created by all sources
director.log.trace (this, "generates", actor, director.clock)
director.animate (actor, CreateToken, randomColor (actor.id), Ellipse (),
Array (at(0) + at(2) + RAD / 2.0, at(1) + at(3) / 2.0 - RAD))
actor.schedule (0.0)
For conciseness the Source.group method may be used to create multiple sources for a model.
def group (director: Model, makeEntity: () => SimActor, units: Int, xy: (Int, Int),
src: (String, Int, Variate, (Int, Int))*): List [Source] =
val sourceGroup = new ListBuffer [Source] ()
for s <- src do sourceGroup += Source (s._1, director, makeEntity, s._2, units, s._3,
(xy._1 + s._4._1, xy._2 + s._4._2))
sourceGroup.toList
end group
679
For animation, the location of a Source node is specified by loc.
Class Methods:
def this (name: String, director: Model, makeEntity: () => SimActor, esubtype: Int,
units: Int, iArrivalTime: Variate, xy: (Double, Double)) =
def display (): Unit =
def act (): Unit =
Note, when a SimActor is no longer referenced (e.g., in director’s agenda or in a wait queue) it becomes
available for garbage collection (memory reclamation).
Class Methods:
680
18.3.10 Transport Class
The Transport class provides a pathway between two other component nodes. The Components in a Model
conceptually form a graph in which the edges are Transport objects and the nodes are other Component
objects. An edge may be either a Transport or Route. A Transport is directional connecting a from
component to a to component. When flow is required in both directions, two transports are required.
A SimActor may utilize a Transport by calling either the move or jump methods. The move method is
intended for smooth animation, while the jump transports the entity quickly to the next component.
Class Methods:
class Transport (name: String, val from: Component, val to: Component,
motion: Variate, isSpeed: Boolean = false, bend: Double = 0.0,
shift1: R2 = new R2 (0.0, 0.0), shift2: R2 = new R2 (0.0, 0.0))
extends Component:
Class Methods:
681
@param name the name of the resource
@param line the line/queue where entities wait
@param units the number of service units (e.g., bank tellers)
@param serviceTime the service time distribution
@param at the location of the resource (x, y, w, h)
class Resource (name: String, line: WaitQueue, private var units: Int, serviceTime: Variate,
at: Array [Double])
extends Component:
def this (name: String, line: WaitQueue, units: Int, serviceTime: Variate,
xy: (Double, Double))
def changeUnits (dUnits: Int): Unit =
def display (): Unit =
def busy: Boolean = inUse == units
def utilize (): Unit =
def utilize (duration: Double): Unit =
def release (): Unit =
Class Methods:
class WaitQueue (name: String, at: Array [Double], cap: Int = Int.MaxValue)
extends Queue [SimActor] with Component:
682
def this (name: String, xy: (Double, Double), cap: Int) =
def isFull: Boolean = length >= cap
def barred: Int = _barred
def display (): Unit =
def waitIn (): Boolean =
def noWait (): Unit = tally (0.0)
Class Methods:
class WaitQueue_LCFS (name: String, at: Array [Double], cap: Int = Int.MaxValue)
extends Stack [SimActor] with Component:
Class Methods:
683
@param name the name of the junction
@param director the director controlling the model
@param jTime the jump-time through the junction
@param at the location of the junction (x, y, w, h)
class Junction (name: String, director: Model, jTime: Variate, at: Array [Double])
extends Component:
def this (name: String, director: Model, jTime: Variate, xy: (Double, Double)) =
def display (): Unit =
def jump (): Unit =
Class Methods:
class Gate (name: String, director: Model, line: WaitQueue, units: Int,
onTime: Variate, offTime: Variate,
loc: Array [Double], shut0: Boolean = false, cap: Int = 10)
extends SimActor (name, director) with Component:
def this (name: String, director: Model, line: WaitQueue, units: Int,
onTime: Variate, offTime: Variate,
xy: (Double, Double), shut0: Boolean, cap: Int) =
def shut: Boolean = _shut
def display (): Unit =
def release (): Unit =
def act (): Unit =
def gateColor: Color = if _shut then red else green
684
def flip (): Unit = _shut = ! _shut
def duration: Double = if _shut then offTime.gen else onTime.gen
See the RoadModel which uses two Route objects, one for West-bound traffic and the other for East-bound
traffic. Each route has two lanes, making the road a four-lane road overall.
Class Methods:
685
the director directs the actors in the play (i.e., simulation model). Each entity (SimActor) is implemented
as a Coroutine and may be roughly thought of as running in its own thread. Control is transferred back
and forth between the director and the actors in the play.
Class Methods:
@param name the name of the simulation model
@param reps the number of independent replications to run
@param animating whether to animate the model
@param aniRatio the ratio of simulation speed vs. animation speed
@param full generate a full report with both sample and time-persistent statistics
class Model (name: String, val reps: Int = 1, animating: Boolean = true, aniRatio: Double = 1.0,
val full: Boolean = true)
extends Coroutine (name) with Completion with Modelable with Component:
The operation of the process-interaction simulation engine can be understood by looking at the inner schedul-
ing loop within the act method of the Model class. The director takes the first actor in the agenda and
marks it as theActor. Then it advances its clock to the activation time of theActor. The director then
transfer control by yielding to the theActor. After executing the next step in their logic, theActor will
transfer control back to the director. This will continue until the agenda becomes empty or the director
is instructed to stop simulating.
686
while simulating && ! agenda.isEmpty do // INNER SCHEDULING LOOP
_theActor = agenda.dequeue () // next from priority queue
_clock = _theActor.actTime // advance the time
log.trace (this, "resumes", _theActor, _clock)
yyield (_theActor) // director yields to actor
end while
For One-Shot Simulation (OSS), reps should be one and for the Method of Independent of Replications
(MIR) it should be say 10 or greater. See the Simulation Output Analysis Chapter for details. The simulate
method is called by the class extending Model to initialize the component parts of the model. Its call to
start will make a new thread that begins executing the director’s act method.
687
method is used to create all four Sources. The coordinates (800, 250) form a reference point; the rest are
relative to the reference point.
val source = Source.group (this, () => Car (), nStop, (800, 250),
("s1N", 0, iArrivalRV, (0, 0)), // from North
("s1E", 1, iArrivalRV, (230, 200)),
("s1S", 2, iArrivalRV, (30, 400)),
("s1W", 3, iArrivalRV, (-200, 230)))
A place is needed for cars waiting for a stop light to change from red to green. Thus four WaitQueues
are needed.
val queue = WaitQueue.group ((800, 430), ("q1N", (0, 0)), // before North light
("q1E", (50, 20)),
("q1S", (30, 70)),
("q1W", (-20, 50)))
The 4-way intersection requires four traffic lights to the control flow of cars. At any particular time, two
of the lights should be red (closed Gate) and two should be green (open Gate). The onTimeRV gives the
duration for the green light, while offTimeRV gives the duration of the red light. Both are Sharp distributions
that give a constant value. The group method swaps the on and off times, based upon whether its number
in the group is even or odd. Lights are positioned at the back of the intersection, e.g., the light for traffic
coming for the North source "s1N" is the bottom left light in Figure 18.3.
After making it through the intersection, traffic continues to its designated Sink. For example, cars
created by source s1N will be terminated by sink k1S.
For each Source, two Routes are created: one from the Source to the WaitQueue and the other from the
Gate to the Sink.
688
In total, there are 16 nodes: four Sources, four WaitQueues, four Gates and four Sinks. In addition,
there are 8 edges: all Routes with two lanes each (so underlying there are 16 Transports).
For this simulation, Cars are the actors moving along the roads (may be thought of as autonomous
vehicles or car-driver combinations). The behavior of a car depends on the direction it is traveling and is
specified by its subtype.
end Car
689
The TrafficModel may be runs as follows:
Class Methods:
Notice that the Model MBM class extends the Model class reusing all of methods, except its act method.
18.3.20 Exercises
1. Explain why the act method cannot be just a regular method/function call.
2. Explain what happens in the inner scheduling loop of the Model class. For the BankModel, suppose
there are three coroutines/threads, the director, customer1, and customer2. Using three vertical
time lines, one for each coroutine, with the director in the middle, show the control transfers between
them. Assume customer2 arrives before customer1 finishes its service.
3. Wendy’s vs. McDonald’s Lines. Given two servers, is it better to have a line/queue for each server
or one common line for both servers. Create a Network Diagram for Wendy’s and another one for
MacDonald’s.
4. Implement a Process-Interaction Simulation and analyze the waiting times and standard deviation of
the waiting times for Wendy’s and MacDonald’s. Let λ = 20 hr−1 and µ = 12 hr−1 .
690
5. Machine Shop. Create a Network Diagram for the Machine Shop described in the Event-Oriented
Simulation chapter.
6. Implement a Machine Shop simulation using Process-Interaction and determine which policy is better.
7. Vehicle Traffic Simulation. Create a Network Diagram for the stretch of US 101 at the Stanford
exits from Willow Road to Oregon Expressway. See the Caltrans PeMS map.
8. Create a Process-Interaction Simulation of US 101 (Bayshore Freeway) at the Stanford exits. Data
is recorded every five minutes at each of the sensors from Willow Road to Oregon Expressway giving
288 data points per day per sensor. Collect data for a portion the year 2021 for these sensors. Use
it to calibrate and validate your models. Place Sources at the beginning of all road segments and
on-ramps. Model the traffic inflow to the model using a Non-Homogeneous Poisson Process (NHPP)
for each source that fits the data at that location. Place Sinks at the end of all road segments
and off-ramps. Place Junctions at all road sensors. Use Routes for the road segments between
sensors. There are typically four lanes Northbound and four lanes Southbound. Finally, use the data
provided by Caltrans PeMS (traffic flow and speed) to to measure the accuracy (sMAPE) of your
simulation model. See https://getd.libs.uga.edu/pdfs/peng_hao_201908_phd.pdf and https:
//dot.ca.gov/programs/traffic-operations/mpr/pems-source.
10. Compare the results of BankModel for the event package and process package. What happens as
the number entities (customers) increases? What happens when the move method is replaced with the
jump method. How do these results compare to those from Queueing Theory?
11. Rewrite the Car class for the VehicleModel from section 17.3.18 and put it in your answer sheet. Put
this inside a new simulation model called TrafficModelTurn that has cars go straight with probability
0.75 and turn right with probability 0.25. Run the before TrafficModel and after TrafficModelTurn
models and indicate the changes in travel time (cars going from Source to Sink).
runMain scalation.simulation.process.example_1.runTraffic
runMain scalation.simulation.process.example_1.runTrafficTurn
Give the mean travel times reported by each Sink for both models.
12. Question 4: Develop a process interaction simulation model for a Bank with two tellers (nTellers
= 2). Let the inter-arrival time distribution be Exponential with rate λ = 12 per hour and the
service time distribution be Exponential with rate µ = 7.5 per hour. Simulate for 100 customers and
1 replication. Report the mean waiting time Tq in minutes. You may modify the BankModel class
(Bank.scala) in the
scalation.simulation.process.example 1
691
package, if you like. Show all modifications you made to the code. Note, it is important to develop a
correct model as this question is linked to the next question.
692
18.4 Agent-Based Simulation
Agent-Based Simulation (ABS) may be viewed as a cousin of Process-Interaction Simulation. An important
enhancement provided by ABS is to provide a richer structure for actors to interact. For example, in a traffic
simulation, it may be useful for a Car to know about the cars ahead and behind, as they may influence what
the car does. Although this can done with process-interaction, it is up to model developer to create all
the code to handle this. An ABS system should provide a framework that facilitates enriched interactions,
reducing the burden of the simulation model developer. To reflect the enhanced capabilities of actors,
including greater knowledge of its environment and other actors, they are typically named agents.
For simplicity, we focus on ABS as a form of time-based simulation (discrete-time or discrete-event) with
event causality and a time-advance mechanism, but do not consider the more general Agent-Based Modeling
(ABM) that may run multiple autonomous agents without controlled causal ordering of events, see [208] for
a discussion.
In this context, the increased flexibility provided by Agent-Based Simulation partially derives from the
capabilities/properties of agents. Desirable characteristics for agents include the following [115, 116, 114]:
1. An agent needs to be identifiable, self-contained, and active. Although actors in the process inter-
action paradigm share this, the event-scheduling paradigm does not as entities are passive and their
logic/behavior specification is scattered among multiple event routines.
2. Agents have some level of autonomy. An agent should have the ability to sense its environment, make
decisions and act accordingly. An example where this is not the case would be a SimActor whose
script includes no parameterization or decision making (e.g., if statements). An agent should be able
to Observe-Decide-Act [59].
3. Agents have the ability to interact with other agents. An example, in a Vehicle Traffic simulation,
would be a car using a car-following rule/model which influences the driver’s speed and gap to the car
in front. The agent must be aware of its neighborhood to put a car-following rule/model into play.
Some models may require move detailed interaction between agents, e.g., may require a communication
protocol.
4. An agent is situated in an environment and interacts with its local environment. As agents can move
around in their environment typically they will be given coordinates. Although coordinates were
given for process-interaction, they were added for animation and providing a real interpretation of the
coordinates is up to the model developer. An ABS system should directly support this capability.
5. It is also useful to provide support for agents to learn, for example, the model developer, could provide
a set of possible rules for a car to change lanes. Support for for learning would mean that cars can
collect information and analyze it to improve there decision making. This allows the agents to adapt
as the simulation continues. Improvement implies goals, for example, the car would prefer less (not
more) travel time.
6. Resources may exist in the environments or within agents. For example, a server is often thought of
as part of the environment in process-interaction simulations, but a more realistic or detailed model
could represent each teller as an agent that does other thinks besides just serving bank customers, e.g.,
they work a shift, have lunch and take breaks.
693
Further Reading
• “Introductory Tutorial: Agent-Based Modeling and Simulation,” by Charles Macal, Michael North,
Proceedings of the 2014 Winter Simulation Conference, https://informs-sim.org/wsc14papers/
includes/files/004.pdf.
In ScalaTion agents may access information about the simulated world via a Knowledge Graph. In
particular, a spatial Property Graph (PGraph) is used to set up the components in the simulated world. See
the Property Graph section in the Data Management Chapter.
The vertices in the PGraph represent resources that agents can work with. An agent has a position in the
simulated world with 2D or 3D coordinates. These coordinates may be transformed to screen coordinates
for purposes of animation (see the next section). The edges in the PGraph represent one-way connections
between vertices.
18.4.1 SimAgent
A SimAgent is a dynamic entity that moves to vertices and along edges as the simulation progresses. Its
location is recorded in terms of its topological coordinates. An agent is thought to be at a vertex or on a
edge. In theory, vertices are points, but ScalaTion allows them to take up space and measures distance as the
distance from the center of the vertex. Distance along an edge is given by the distance from the beginning of
the edge where it connects to the from vertex to the agent. In animation, an agent is represented as a token
that moves along the graph. Due to the small size of vertices, tokens within may be represented collectively.
Tokens on edges are represented as circles moving along edges. Topological coordinates are specified using
the Topological trait.
While moving through the graph, an agent may interact with vertices as well as other agents. Agents
moving along an edge may speed up, slow down or jump to a parallel edge (e.g., lane change in a vehicle
traffic simulation).
At a vertex, agents may passively wait (e.g., in wait queue), work with a server for a period of time,
update their properties (e.g., the value of a part changes at each machine stage), wait for a traffic light to
change color, choose which edge to follow upon leaving the vertex (e.g., the road to turn onto).
694
As expected, agents play a more central role in Agent-Based Simulation compared to Process-Interaction
Simulation where much of the specification of behavior/decision making is often delegated to the compo-
nents/blocks in the simulation.
A Property Graph (PGraph) consists of multiple vertex-types (VertexType) and multiple edge-types
(EdgeTypes) as defined in the scalation.database.graph package.
18.4.2 Vertices
The vertices in a property graph are grouped into one or more vertex-types. A vertex is Identifiable and
situated is space using Spatial coordinates.
class Vertex (_name: String, val prop: Property, _pos: VectorD = null)
extends Identifiable (_name)
with Spatial (_pos)
with PartiallyOrdered [Vertex]
with Serializable:
ScalaTion’s ABS system defined in scalation.simulation.agent based currently supports the fol-
lowing types of vertices:
• Gate: a gate is used to control the flow of agents, they cannot pass when it is shut.
• WaitQueue: a FCFS wait-queue provides a place for agents to wait, e.g., waiting for a resource to
become available or a gate to open.
• WaitQueue LCFS: an LCFS wait-queue provides a place for agents to wait, e.g., waiting for a resource
to become available or a gate to open.
These vertex-types correspond to the component nodes in process-interaction. Their constructors and
methods are more oriented towards allowing more flexibility and specificity from the agents.
18.4.3 Edges
The edges in a property graph are grouped into one or more edge-types. An edge is Identifiable and
situated is space using Spatial coordinates based one of the vertices it connects to.
695
@param _name the name of this edge (’name’ from ‘Identifiable‘)
@param from the source/from vertex of this edge
@param prop maps edge property names into property values
@param to the target/to vertex of this edge
class Edge (_name: String, val from: Vertex, val prop: Property, val to: Vertex)
extends Identifiable (_name)
with Spatial (if from == null then to.pos else from.pos)
with Serializable:
• Link: a link supports simple/quick movement of agents between closely connected vertices (e.g., a
queue before a resource).
• Route: a route bundles multiple transports together (e.g., a two-lane, one-way street).
• Transport: a transport is used to move agents from one vertex to the next.
Note, closely connected nodes in the process-interaction were not required to have an edge between them,
however, as ScalaTion’s ABS system in built using PGraph, edges are required between vertices, hence the
need for the Link class.
Create Random Variates (RVs). A fourth random variate is added for jumping through the link
between the wait queue and the resource containing the servers/tellers.
Create the Graph Model. Specifying a graph model replaces the specification of the model components
under process interaction. Notice the addition of the Link edge.
696
val entry = Source ("entry", this, 0.0, iArrivalRV, () => Customer (), nStop, pos = entry_pos)
val tellerQ = WaitQueue ("tellerQ", this, pos = WaitQueue.at (330, 290))
val teller = Resource ("teller", this, serviceRV, nTellers, pos = Resource.at (380, 285))
val door = Sink ("door", this, pos = Sink.at (600, 290))
val toTellerQ = Transport ("toTellerQ", this, entry.vert, tellerQ, moveRV)
val toTeller = Link ("to", this, tellerQ, teller, jumpRV)
val toDoor = Transport ("toDoor", this, teller, door, moveRV)
Specify Scripts for each Type of Simulation Agent. The script is similar to the specification of
actor scripts for process interaction. The service, move and jump times may also pass directly into the
work, move and jump methods. Notice the increased specification available to SimAgents. For more complex
models, this allows greater flexibility. In addition, an explicit ping method is required, as it is now not
implicitly done by the resource, i.e., actions are now under control of the agent.
case class Customer () extends SimAgent ("c", director.clock, this, cust_pos.copy):
end Customer
simulate ()
waitFinished ()
Model.shutdown ()
• The car ahead exits: the exiting car may then link the car behind (you) to the car ahead of itself, as
in deletion from a doubly linked list.
• Your car wants to change lanes: where is the new car to follow, it needs to be efficiently found, by
searching the local road segment.
• Your car wants to turn onto another road: the car needs to know where it is in the graph. The
intersection is a vertex, the chosen road is an edge and the car starts on a particular road segment.
697
The following Car class includes an ability to change lanes. This requires a more flexible move method,
where the second parameter indicates the fraction of the length of a Route to move along. Midway, a new
lane l2 is determined and passed to the changeLane method. The car continues along lane l2 to the traffic
light. If the light is green, then it continues through, otherwise it waits in the queue. After getting through
the light it continues to its Sink.
end Car
18.4.7 Exercises
1. Compare the software available for Agent-Based Simulation.
2. Discuss the simulation world-views/paradigms used for modeling port traffic in the following disserta-
tion, “Application of Mixed Simulation Method to Modelling Port Traffic.”
3. What data structures and algorithms can used in the changeLane method to efficiently find the car
ahead?
5. Develop an Agent-Based Simulation for Vehicle Traffic Forecasting. Add the capabilities discussed
in this section to provide a more realistic simulation compared to Process-Interaction. See
7. Develop the Emergency Department Simulation as an Agent-Based Simulation and model doctors
and nurses as agents rather than resources as was indicated for the process-interaction approach.
698
Create and execute an model for an emergency department/room based on the specifications given in
the following paper, “Modeling and Improving Emergency Department Systems using Discrete Event
Simulation,” by Christine Duguay and Fatah Chetouane, https://journals.sagepub.com/doi/10.
1177/0037549707083111.
9. Develop an Agent-Based Simulation for analyzing the COVID-19 Pandemic. See “A realistic agent-
based simulation model for COVID-19 based on a traffic simulation and mobile phone data,” Sebastian
A. Muller et al., https://arxiv.org/pdf/2011.11453.pdf as a starting point.
10. Develop an Agent-Based Simulation for Military Applications. See “Simulating Small Unit Military
Operations with Agent-Based Models of Complex Adaptive Systems,” by Victor Middleton, Proceed-
ings of the 2010 Winter Simulation Conference, https://www.informs-sim.org/wsc10papers/013.
pdf as a starting point.
11. Explain how Agent-Based Modeling and Simulation (ABMS) is used to study emergent phenomena.
Give an example.
699
18.5 Animation
Process-Interaction and Agent-Based Simulation are ideally suited for animation. The environment can be
given largely by displaying nodes and edges of a graph. The dynamics can be displayed by creating, moving
and destroying tokens. The locations and actions of an actor or agent over its lifetime can be depicted in
the animation.
18.5.1 2D Animation
In the JVM world (includes Java, Scala and several other languages), simple 2D animation can be accom-
plished using awt and swing. A richer graphics library is provided by JavaFx. Motion involves both space
and time, so Java’s Thread class and Runnable interface are also used for animation.
Basics of 2D Animation
Java’s abstract window toolkit (awt) and swing packages support simple 2D animations. The example below
illustrates this by drawing a large blue circle and having a small red ball continually trace the circle. The
drawing Canvas is a class that extends JPanel that is an inner class within a JFrame. The run method (1)
updates the ball coordinates, (2) sleeps for tau millisecond, e.g., 20 milliseconds corresponds to 50 frames
per second (generally fast enough for humans to see motion as smooth), and (3) repaints the canvas by
calling repaint.
private val dim = new Dimension (600, 500) // the size of the canvas
private val tau = 20 // operate at 50 Hz
private val circle = new Ellipse2D.Double (200, 200, 200, 200) // the circle to traverse
private val ballPos = new Point2D.Double (0, 300) // ball position
private val ball = new Ellipse2D.Double () // the moving ball
700
ballPos.y = 300 + 100 * sin (theta)
println (s"ballPos = $ballPos")
end SimpleAnimator
The call to repaint will cause execution of the paintComponent method. Based on the new ball coordi-
nates determined by the angle theta, the ball will move a few pixels each time the canvas is repainted. The
setFrame (x, y, w, h) method is used to reset the ball coordinates: x-coordinate, y-coordinate, width,
and height of the ball. Notice that draw draws the shape’s boundary, while fill fills the shape in with color.
The x and y coordinates specify the top-left position for the bounding box of the ball. As the ballPos is
intended to be the center of the ball, half the width/height (10) must be subtracted to center the ball on
the large red circle.
This example animation may be run as follows:
ScalaTion avoids the direct use of any graphics framework to facilitate changing frameworks as the
technology evolves. Consequently, the scala2d package is used to insulate code from the particulars of any
graphics framework. The SimpleAnimation2 class shows the minimal changes required to switch from direct
use of Java graphics libraries to the use of the scala2d package.
2D Animation in ScalaTion
The Model class in the event, process, and agent based packages import the following from the scalation.animation
package: AnimateCommand, CommandType, and DgAnimator.
The AnimateCommand class provides a data structure for holding animation command specifications.
701
@param action the animation action to perform
@param eid the external id for the component acted upon
@param shape the shape of graph component (node, edge or token)
@param label the display label for the component
@param primary whether the component is primary (true) or secondary (false)
@param color the color of the component
@param pts the set points/dimensions giving the shapes location and size
@param time simulation time when the command is to be performed
@param from_eid the ’eid’ of the origination node (only for edges)
@param to_eid the ’eid’ of the destination node (only for edges)
case class AnimateCommand (action: CommandType, eid: Int, shape: Shape, label: String,
primary: Boolean, color: Color, pts: Array [Double], time: Double,
from_eid: Int = -1, to_eid: Int = -1):
The CommandType enumeration specifies the types of commands passed from a simulation engine to the
animation engine.
end CommandType
The DgAnimator class is an animation engine for animating graphs. It contains a Panel that provides a
paintComponent method for displaying the nodes, edges and tokens of the graph.
702
@param fgColor the foreground color
@param bgColor the background color
@param aniRatio the ratio of simulation speed vs. animation speed
class DgAnimator (_title: String, fgColor: Color = black, bgColor: Color = white,
aniRatio: Double = 1.0)
extends VizFrame (_title, null, 1200, 800) with Runnable:
The run method repeatedly executes animation commands, sleeps and repaints the drawing canvas/panel.
The length of the sleep is determined by the time gap between commands based on their values from the
simulation clock as well as the aniRatio.
18.5.2 3D Animation
C++ and C# are commonly languages used for 3D animations. There are several libraries available in the
JVM world as well. JavaFx has limited support for 3D. Currently, LWJGL and libGDX are widely used by
the JVM community.
Further Reading
18.5.3 Exercises
1. Create an animation of the trajectory of a golf ball using the equations given in the State Space Models
chapter. Add a tracer to see the path taken by the golf ball.
2. Redesign the scala2d package and call it scala2df to use JavaFx rather than awt and swing.
3. Design a scala3d package that uses JavaFx with its limited 3D capabilities.
7. List two ways animation is useful in simulation. Make a convincing argument for each.
703
704
Chapter 19
Unlike some modeling techniques, a simulation run will produce one result, the next run another results.
Individually these results may be misleading as results depend on the combination of random variates
generated. Simulation true power comes from generating several results and then performing statistical
analysis on these results. In other words, the simulation outputs need to be analyzed.
Many types of output may analyzed. For example, many simulation studies are interested in reducing
waiting times. For such studies, one may define,
m−1
1·y 1 X
µ̄ = = yi (19.2)
m m i=0
The sample variance is computed as follows:
m−1
ky − µ̄k2 1 X
σ̂ 2 = = (yi − µ̄)2 (19.3)
m−1 m − 1 i=0
Typically, simulations are used to study average behavior, so the focus in on the sample mean. To collect
statistics on average behavior, several means need to collected and it is important that they be independent
(or at least not highly correlated). There are two common methods used to achieve this, the Method of
Independent Replications and the Method of Batch Means.
Both methods will produce n means: µ̄0 , µ̄1 , . . . µ̄n−1 . From these a grand mean will be calculated.
n−1
1X
¯ =
µ̄ µ̄i (19.4)
n i=0
705
The grand mean can be used to estimate average behavior or performance characteristics, such as average
waiting times. In addition to using the grand mean as a point estimate, it is common practice to obtain an
¯ − ihw, µ̄
interval estimate [µ̄ ¯ + ihw], where ihw is the interval half width.
To create an interval estimate in the form of a confidence interval, we need to determine the variability
¯.
or variance in the point estimate µ̄
"
n−1
# n−1
1X 1 X σµ̄2
¯] = V
V [µ̄ µ̄i = V [µ̄i ] = (19.5)
n i=0 n2 i=0 n
¯.
Consequently, the following is an estimate for the variance of µ̄
σ̂µ̄2
σ̂µ̄2¯ = (19.6)
n
Therefore the interval half width is
σ̂µ̄
ihw = t∗ √ (19.7)
n
where t∗ is a critical value from the Student’s t Distribution. The interval estimate is then
t∗ σµ̄ t∗ σµ̄
¯ ¯
µ̄ − √ , µ̄ + √ (19.8)
n n
Suppose the true mean waiting time µ can be computed (e.g., from Queueing Theory), then the probability
the confidence interval will contain it corresponds to the confidence level chosen (and t∗ is determined by
this and the Degrees of Freedom).
706
19.2 One-Shot Simulation
One-shot simulation happens when there a single simulation run and that run is treated as a whole (i.e., it
is not broken into multiple batches). This is the simplest way to run a simulation and it is good idea to
convince yourself that the model is running correctly in this mode before moving on a method more useful
for simulation output analysis.
The process package contains several example models that are placed in sub-directories according how
they are set up for output analysis and validation.
• example 1:
One-Shot Simulation (OSS)
During model development, OSS may be used since model execution takes less time and animation is
on by default, thus giving the model developer quick feedback on the correctness of the model. The
default settings are 1 replication, animation on, uses move method, extends Model (no batching). The
move method gives smooth motion for better animation.
• example MIR:
Method of Independent Replications (MIR)
OSS will report a mean, but not a standard deviation, since that requires at least 2 simulation runs (or
replications). The default settings are 10 replications, animation off, uses jump method, and extends
Model (no batching). The jump method gives less overhead in Transports which for many models may
provide more accurate statistics that correspond better those in the event package. The developer
needs to decide depending on what they are modeling. For example, one should use move and not
jump for traffic simulations, but use jump over move for tandem queues where the time to move from
one queue to next is negligible. Note, using move with OSS would likely be preferred as it will make it
easier to track the entities while viewing the animation.
• example MBM:
Method of Batch Means (MBM)
Although MIR can provide more useful statistics, it tends to be less effective for studying long-term or
steady-state behavior. MBM has one long simulation run (replication), but divides it up into several
batches. Each batch produces a batch mean and since there are several of them, standard deviations
can be computed. The default settings are 1 replication, animation off, uses jump method, extends
Model MBM (batching). The Model MBM class extends the Model class with batching capabilities (see the
section on MBM for details).
707
19.3 Simulation Model Validation
In data science in general, model validation is both important and challenging. As simulation models tend
to have many human designed components that must work together, so the chance of making errors is high.
Also, like all computer programs bugs need to be removed. With sorting algorithms, it is obvious when the
output is incorrect. With simulation, it is often difficult to know whether the output is correct or not.
One option would be to collect data and see if the model output agrees with data. This begs to question
of whether a program implementing a model and having bugs could agree with the data. As this is real
possibility, one should be skeptical of a new model and first look to ways to falsify the model. Of course,
there is the additional complexity of whether the problem is with program implementing model or the model,
or likely both. To have a clean separation between the implementing program and the model, would require
a formal (or executable) specification of the model. The process of showing that a program implementation
agrees with the model is sometimes called verification (to be discussed later).
Focusing first on validation a model based on a program that implements it, there are many ways to
check the simulation model’s behavior or output [104, 166].
1. Make sure the model works for Simple Scenarios. For example, consider the Machine Shop with limited
storage capacity for parts being processed by machines in sequence. This model can to difficult to build
correctly and hard to know if its output is correct. Due to the limited queue sizes, parts may be sold
for scrap or machines may be blocked from operating. However, when the part arrival rate is small, no
parts will be sent to scrap and no machines will be blocked. A simpler model can be used (or built)
that uses infinite capacity queues. The new machine shop model should agree with this simpler model.
2. Slowly Increase the Complexity of scenarios. For a machine shop with two machines in sequence, the
first queue becoming full will introduce the first increment of complexity. What is expected to happen
when the arrival rate increases enough to cause the first queue to become full? In case the second
queue becomes full first, its capacity can be temporarily increased to focus on one problem at a time.
After any bugs triggered by the first queue becoming have been removed, the arrival may be further
increased to make the second queue to become full. Finally, after bugs associated with this have been
removed, add the option of blocking the first machine when the second queue becomes full. Removing
bugs with all three issues simultaneously may be very difficult. And remember, the Machine Shop
model is a relatively simple model.
3. Use Analytic Models as beacons in the sea of complexity. They may be too simple for the problem
being addressed, but may be close enough to be indicators of whether you are on the right track. For
example, the infinite queue approximation for machines in sequence can be solved analytically using a
Queueing Network model. Although the restrictions of unchanging parameters, to certain distributions
and steady-state conditions may have taken Queueing Networks out as viable models for the original
problem, they still can be very helpful in model validation. Time series Forecasting models such as
SARIMAX and LSTM many be very helpful in validating Vehicle Traffic simulation models.
4. A collection of related, validated models in a specific domain along with explanations of when they are
applicable and why they are at least an approximation to the phenomena/system under study could
be termed a Theory. The theory would typically include constraints that could be used to assess the
realism of a newly proposed (simulation) model. Models that are accurate may still be in discord with
theory, suggesting either the model is magical or the theory needs revision. There are new initiatives
708
in data science and machine learning that go by various names such as theory-guided data science
[90, 123] or physics-informed machine learning [89].
5. The above item focused on model output, but model inputs are important as well. Input Analysis can
be used to choose between distributions, e.g., are service times Exponential, Erlang, Normal, Weibull,
etc. If arrivals follow an Non-Homogeneous Process Process (NHPP), how is the function λ(t) estimated
correctly? Are probabilities for cars go straight, turning right, turning left estimated correctly. Plots
and Goodness-of-Fit test can be useful for this.
6. Before it was argued that simpler models are helpful for validation. Ideally, the new model could be
positioned between a simpler model and a more Complex Model that has been previously validated.
Such a model may not exist given the need for your study, but if one does, it could be that your models
is faster or more generalizable than the complex model. Alternatively, it may be more amenable to
optimization or interpretation. In any case, the model complex model can be very helpful for validation.
7. Besides looking at the results/outputs of a simulation model, one should Examine the Behavior of the
model. One way to do this is to examine a trace of the model execution (e.g., what happens with each
event). This can be tedious and even mind-numbing. An alternative is to examine an animation of
simulation. The animation should be watched to slow motion to see what happens to the entities (or
actors or agents) in detail. For both tracing and animation, the number of entities should be reduced.
The real simulation may requires thousand of entities, examining all of them will likely be fruitless.
8. Although humans can digest short traces, they may miss the part of the simulation exhibiting the
incorrect behavior (e.g., machine 1 gets blocked, but a newly arriving entity begins service in the
blocked queue), Anomaly Detection in Simulation Traces can be used. For example, do entities enter
the sink out of order, or does an entity’s wait time vary greatly form the entities going through the
system at roughly the same time and taking the same pathway. Both machine learning and rule-based
anomaly detection algorithms can be useful.
9. As a later step in the validation process, the output of the model (whether predictions, forecasts or
classifications) need to be compared with data and Quality of Fit (QoF) measures should be examined.
For example, R2 , MAE and sMAPE may be used to assess the quality of forecasts. The QoF measures
should be compared to other models addressing the problem under study.
10. The model developers should also follow standard Software Engineering Practices, a topic too large to
summarize here.
709
single run is not reliable, i.e., multiple runs need to be made for each combination of parameters. See the
exercises for better alternatives to Grid Search.
710
19.4 Method of Independent Replications (MIR)
Since simulation models produce stochastic outputs, One-Shot Simulation may not be reliable. Suppose one
is interested in customer waiting time Tq (renamed w). Each customer will have their own waiting time
wj . One simulation run typically involves hundreds or thousands of entities/customers. One might think
that taking an average or mean would provide a useful estimate. Unfortunately, the wj ’s may be highly
correlated, which tends to inflate variability. This can be evidenced in the following table where an M/M/1
Queue is simulated for nr = 10 replications (where the only thing changing per run is the base random
number stream).
For MIR, let wij be the waiting time for the j th entity in the ith run. The ith run mean is given as
follows:
ns −1
1 X
w̄i = wij = mean waiting time for ith replication (19.9)
ns j=0
The number of customers per run nStop = ns is set to 100 (and then 1000) in Table 19.1. In order to make
the replications independent, it is essential to change the stream each time.
Notice the high variability in w̄i particularly for nStop = 100, less so with nStop = 1000. These data are
created by running the MIR version of BankModel.
runMain scalation.simulation.process.example_MIR.runBank
The changes to the code from the example 1 version are the following: animation is turned off, reps = 10,
jump replaces move and the time on the transports is greatly reduced.
The mean and standard deviation for each column can be used to compute confidence intervals.
711
19.4.1 Confidence Intervals
Case nr = 10, ns = 100
For the case where reps = 10, nStop = 100, the grand mean is simply the mean of nr = 10 run means.
nr −1
1 X
¯ =
w̄ w̄i = 30.011 (19.10)
nr i=0
nr −1
1 X
σ̂w̄ = ¯ )2 = 24.760
(w̄i − w̄ (19.11)
nr − 1 i=0
σw̄ 24.760
t∗ √ = 2.262 √ = 17.712 (19.12)
nr 10
where t∗ is the value for the Student’s t distribution with nr − 1 Degrees of Freedom, where the area/proba-
bility of being in either tail is 0.05 (95% confidence interval). Therefore, the interval estimate is the following:
t∗ σ̂w̄ t∗ σ̂w̄
¯ ¯
w̄ − √ , w̄ + √ (19.13)
nr nr
Finally, the 95% confidence interval is
For the case where reps = 10, nStop = 1000, the interval half width (ihw) is
σw̄ 6.478
t∗ √ = 2.262 √ = 4.634 (19.15)
nr 10
and the 95% confidence interval is much tighter.
For the case where reps = 40, nStop = 1000, the interval half width (ihw) is
σw̄ 7.758
t∗ √ = 2.023 √ = 2.481 (19.17)
nr 40
and the 95% confidence interval is even tighter.
712
Case nr = 40, ns = 10000
For the case where reps = 40, nStop = 10000, the interval half width (ihw) is
σw̄ 2.598
t∗ √ = 2.023 √ = 0.831 (19.19)
nr 40
and the 95% confidence interval is reasonably tight.
With 10,000 entities per run, it is reasonable to compare this results with the result from queueing theory.
ρ/µ
Tq = (19.21)
1−ρ
λ 6
Since the traffic intensity ρ = = = 0.8,
µ 7.5
0.8/7.5
Tq = = 0.533 hours = 32 minutes (19.22)
1 − 0.8
Note, the theoretical value of 32 minutes is inside all the confidence intervals.
class BankModel (name: String = "Bank", reps: Int = 100, animating: Boolean = false,
aniRatio: Double = 8.0, nStop: Int = 1000, stream: Int = 0)
extends Model (name, reps, animating, aniRatio):
//--------------------------------------------------
// Initialize Model Constants
//--------------------------------------------------
// Create Random Variables (RVs)
//--------------------------------------------------
// Create Model Components
val entry = Source ("entry", this, () => Customer (), 0, nStop, iArrivalRV, (100, 290))
val tellerQ = WaitQueue ("tellerQ", (330, 290))
713
val teller = Resource ("teller", tellerQ, nTellers, serviceRV, (350, 285))
val door = Sink ("door", (600, 290))
val toTellerQ = Transport ("toTellerQ", entry, tellerQ, moveRV)
val toDoor = Transport ("toDoor", teller, door, moveRV)
//--------------------------------------------------
// Specify Scripts for each Type of Simulation Actor
end Customer
simulate ()
waitFinished ()
Model.shutdown ()
end BankModel
714
19.5 Method of Batch Means (MBM)
The Method of Batch Means (MBM) is intended to provide a more efficient and reliable way to analyze the
steady-state, as opposed to what was done in the last section of making nStop large enough for the simulation
to exhibit steady-state behavior. Each of the forty runs/replications had to go through a warm-up period
or transient phase.
With MBM, the simulation only goes through one transient phase at the beginning. Rather than having a
mean for each run, a mean is created for each batch. One long run is divided into multiple batches. The trick
is to make the batches uncorrelated enough, so that the advantage of independent replications is not lost. If
the batch means are highly correlated, the confidence intervals will not be reliable. Fortunately, the longer
the batch sizeB (sb ), the smaller the correlation between the batch means. The other hyper-parameter is
nBatch (nb ). These are analogs of ns and nr .
For MBM, let wij be the waiting time for the j th entity in the ith batch. The ith batch mean is given as
follows:
sb −1
1 X
w̄i = wij = mean waiting time for ith batch (19.23)
sb j=0
nb −1
1 X
¯ =
w̄ w̄i (19.24)
nb i=0
nb −1
1 X
σ̂w̄ = ¯ )2
(w̄i − w̄ (19.25)
nb − 1 i=0
σw̄
ihw = t∗ √ (19.26)
nb
where t∗ is the value for the Student’s t distribution with nr − 1 Degrees of Freedom, where the area/prob-
ability of being in either tail is 0.05 (95% cofidence interval).
715
Table 19.2: M/M/1 Queue: MBM vs. MIR
nb or nr ¯
MBM w̄ MBM ihw ¯
MIR w̄ MIR ihw
10 31.875 5.074 30.774 4.634
20 30.146 2.821 28.880 2.815
30 30.030 2.980 29.924 3.074
40 31.703 3.121 30.489 2.481
50 32.141 2.845 30.777 2.398
60 32.382 2.574 30.935 2.154
70 32.503 2.313 31.120 1.914
80 32.220 2.089 30.495 1.759
90 31.812 1.947 30.584 1.683
100 31.764 1.841 30.426 1.583
The acorr method in VectorD computes the lag-1 autocorrelation ρ1 for the vector shown above to be -0.2767.
Having larger batch sizes sb is likely to reduce to the magnitude (absolute value) of the correlation. Assuming
covariance stationarity (see the section on the Auto-Correlation Function in the time series chapter), the
lag-1 autocorrelation may be computed as follows:
C [w̄i , w̄i−1 ]
ρ1 = (19.27)
V [w̄i ]
See the exercises to see how autocorrelation changes with increasing batch sizes.
716
The Method of Batch Means and the Method of Independent Replications may be depicted as shown
below. The MIR simulation consists of nr = 10 runs/replications, while the MBM simulation consists of one
run, that is divided in nb = 10 batches. Let ’-’ represent 10 entities/customers, and thus the length of a
replication ns = 100 entities, and similarly the size of each batch sb = 100 entities.
MIR:
---------- w_0
---------- w_1
---------- w_2
---------- w_3
---------- w_4
---------- w_5
---------- w_6
---------- w_7
---------- w_8
---------- w_9
MBM:
|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|
w_0 w_1 w_2 w_3 w_4 w_5 w_6 w_7 w_8 w_9
The w̄i (w i) are run means for MIR and batch means for MBM.
ihw
γ = (19.28)
¯
w̄
Suppose the goal is to achieve 90% relative precision (or γ ≤ .1). For nb = 10, γ = 5.074/31.875 = .159,
indicating more batches are needed. For nb = 20, γ = 2.821/30.146 = .094, indicating acceptable relative
precision. Note, some simulation studies may prefer to with work absolute precision instead.
A basic procedure for MBM would be to increase the batch size until the correlation between batch means
drops to a threshold (e.g., ρ1 ≤ .2 or .3) and 1 - relative precision drops to another threshold (e.g., γ ≤ .05
or .1). There are several advanced procedures that are more efficient/effective (although more complex) for
MBM, see [105, 12, 182, 5].
class BankModel (name: String = "Bank", nBatch: Int = 100, sizeB: Int = 1000,
animating: Boolean = false, aniRatio: Double = 8.0, stream: Int = 0)
extends Model_MBM (name, nBatch, sizeB, animating, aniRatio):
717
val nStop = nBatch * sizeB // number arrivals before stopping the Source
//--------------------------------------------------
// Initialize Model Constants
//--------------------------------------------------
// Create Random Variables (RVs)
//--------------------------------------------------
// Create Model Components
val entry = Source ("entry", this, () => Customer (), 0, nStop, iArrivalRV, (100, 290))
val tellerQ = WaitQueue ("tellerQ", (330, 290))
val teller = Resource ("teller", tellerQ, nTellers, serviceRV, (350, 285))
val door = Sink ("door", (600, 290))
val toTellerQ = Transport ("toTellerQ", entry, tellerQ, moveRV)
val toDoor = Transport ("toDoor", teller, door, moveRV)
//--------------------------------------------------
// Specify Scripts for each Type of Simulation Actor
end Customer
simulate ()
waitFinished ()
Model.shutdown ()
718
end BankModel
719
19.6 Exercises
1. Consider how increasing the number of batches by 10 for both MBM and MIR up 100 batches/repli-
cations. effects the accuracy of the simulation. Using the data from Table 19.2, plot the grand means
versus n (the number of batches/replications). Also plot the theory line (32) and discuss the converge.
2. Convergence for MIR is dependent upon the length of the runs/replications ns . For nr = 40, plot
the MIR grand means as the run length ns increase from 10 to 100,000 on a log scale (10, 100, 1000,
10,000, 100,000). Again plot the theory line (32) and discuss the converge.
3. Increasing the MBM batch size sb can also be advantageous, in that the correlation between batches
decreases. For nb = 40, plot the MBM grand means as the run length sb increase from 10 to 100,000
on a log scale (10, 100, 1000, 10,000, 100,000). On another plot, indicate the lag-1 autocorrelation
between the batch means w̄i . What is the takeaway message?
5. Under what circumstances would MBM be an inappropriate method to use in a simulation study.
6. For a large number of calibration parameters or parameters having many levels, Grid Search becomes
infeasible. Consult the literature for better alternatives.
7. Question 5: For the process-interaction simulation model of a Bank with two tellers (see Section
17.3.20: Exercise 12), determine the mean waiting time Tq three ways. For the simulations use ns =
sb = 1000.
(a) Analytic Model based on Queueing Theory. Give the formula and compute the value for Tq . Be
sure to indicate the time units.
(b) Simulation Model using the Method of Independent Replications (MIR). Give the Grand Mean
and its Confidence Interval. Make sure “1 - relative precision” γ ≤ .1 (may require more runs). Is the
value for Tq from Queueing Theory inside this confidence interval?
(c) Simulation Model using the Method of Batch Means (MBM). Assume the lag-1 autocorrelation ρ1
is small enough. Give the Grand Mean and its Confidence Interval. Make sure “1 - relative precision”
γ ≤ .1 (may require more batches). Is the value for Tq from Queueing Theory inside this confidence
interval?
720
Appendices
721
Appendix A
As discussed in earlier chapters, when matrix factorization cannot be applied for determining optimal values
for parameters, an optimization algorithm will often need to be applied. This chapter provides a quick
overview of optimization algorithms that are useful for data science. Note that the notation in the opti-
mization field differs in that we now focus on optimizing the vector x rather than the parameter vector
b.
Many optimization problems may be formulated as restricted forms of the following,
minimize f (x)
subject to g(x) ≤ 0
h(x) = 0
where f (x) is the objective function, g(x) ≤ 0 are the inequality constraints, and h(x) = 0 are the equality
constraints. Consider the example below.
If we ignore all the constraints, the optimal solution is x = [4, 2] where f (x) = 0, while enforcing the
inequality constraints makes this solution infeasible. The new optimal solution is x = [3, 1] where f (x) = 2.
Finally, the optimal solution when all constraints are enforced is x = [1, 1] where f (x) = 10. Note, for this
example there is just one equality constraint that forces x1 = x2 .
723
A.1 Partial Derivatives and Gradients
These topics were introduced in the chapter on Linear Algebra, but are probed in more depth here.
Definition: The partial derivative w.r.t. xj of a multivariate function f : Rn → R (y = f (x)) is defined as
follows.
∂f f (x1 , . . . , xj + h, . . . , xn ) − f (x)
= lim (A.1)
∂xj h→0 h
It indicates the rate of change in function f with small changes to the xj coordinate in Rn space. All the
other coordinates are held fixed.
∂ ∂f ∂g
(f + g) = + (A.2)
∂xj ∂xj ∂xj
Proposition: Subtraction Rule:
∂ ∂f ∂g
(f − g) = − (A.3)
∂xj ∂xj ∂xj
Proposition: Product Rule:
∂ ∂f ∂g
(f g) = g+f (A.4)
∂xj ∂xj ∂xj
Proposition: Quotient Rule:
∂f ∂g
g−f
∂ ∂xj ∂xj
(f /g) = (A.5)
∂xj g2
∂h df ∂u
= (A.6)
∂xj du ∂xj
Proposition: Given f (x(t), y(t)) where x and y may be thought of functions of time t, the derivative is
df ∂f dx ∂f dy
= + (A.7)
dt ∂x dt ∂y dt
These rules can be generalized to higher dimensions. Naturally, there are additional chain rules for more
complex functional compositions.
724
A.1.3 Gradient
Definition: The gradient of multivariate function f : Rn → R (y = f (x)) is defined as follows.
∂f ∂f ∂f
∇f = , ,..., (A.8)
∂x1 ∂x2 ∂xn
It is the n-dimensional vector of partial derivatives. At each point in Rn , it is orthogonal to the contour
curves of the response variable y and points in the direction of steepest increase.
The gradient may be written more concisely as follows
∂x f = ∇f (A.10)
The two Jacobian matrices are multiplied together. Consider the case where n = 4, p = 3, q = 2 and
h = [h0 , h1 ] = f ◦ g
" # " # ∂x g0 ∂x1 g0 ∂x2 g0 ∂x3 g0
∂x0 h0 ∂x1 h0 ∂x2 h0 ∂x3 h0 ∂u0 f0 ∂u1 f0 ∂u2 f0 0
= ∂x0 g1 ∂x1 g1 ∂x2 g1 ∂x3 g1
∂x0 h1 ∂x1 h1 ∂x2 h1 ∂x3 h1 ∂u0 f1 ∂u1 f1 ∂u2 f1
∂x0 g2 ∂x1 g2 ∂x2 g2 ∂x3 g2
where u = [u0 , u1 , u2 ] = g(x). Now consider the case when q = 1, i.e., function f is scalar-valued, then
In other words, the gradient of h w.r.t. x as a 1-by-n matrix (effectively a vector) is the product of the
gradient of f as a 1-by-p matrix and the Jacobian of g as a p-by-n matrix. Further consider the case when
p = 1 making g a scalar valued function. The gradient of h w.r.t. x as a 1-by-n matrix (vector) becomes the
product of the gradient of f as a 1-by-1 matrix (scalar) and the gradient of g as a 1-by-n matrix (vector).
Note that the scalar ∂u f (g(x)) is an ordinary derivative. Focusing on one dimension in the vector x, xj , the
j th partial derivative is then
When the arguments to the functions are understood (and hence dropped), this can be shortened to
725
∂xj h = ∂u f ∂xj g (A.15)
When the function is univariate and scalar, the ∂ symbol is understood by context to be an ordinary
derivative, rather than a partial derivative.
The first two rules can be re-written as shown below by extending the concise notation to include Jaco-
bians, i.e., ∂x g = Jg where both x and g are vector-valued.
Differential Object
1 object Differential :
2
726
18 def laplacian ( f : FunctionV2S , x : VectorD ) : Double =
19
20 end Differential
Most of these methods also have math-like Unicode equivalents (see code for details). The most common
first order methods are shown below.
1 @ param f the function whose derivative is sought
2 @ param x the point ( scalar ) at which to estimate the derivative
3
727
A.2 Automatic Differentiation
As we learned Neural Networks work because of back-propagation, but this requires manual development of
partial derivatives. Notice that for a Gated Recurrent Unit (GRU), getting the partial derivatives correct is
not so easy, and with newer architectures, it is even harder.
Automatic Differentiation [63, 142, 13] allows one to specify the equations for forward propagation and
have the system automatically handle the backward propagation (and even generalize it).
This has opened up a new research area are called differential programming [200] where model developers
can specify parameterized equations and the system can automatically fit the parameters (based on data)
using first or second order optimizers.
ŷ = f (x · b) (A.16)
Now the weight vector (not matrix) is b ∈ Rn . In addition, the forward pass calculates the loss function.
1 1
L = (y − ŷ)2 = (y − f (x · b))2 (A.17)
2 2
∂L ∂L ∂u
= (A.18)
∂b ∂u ∂b
The chain rule applied can be applied again with v = x · b to obtain.
∂L ∂L ∂u ∂v
= (A.19)
∂b ∂u ∂v ∂b
∂b L = ∂u L ∂v f (x · b) ∂b (x · b) (A.20)
The formulas for the forward and backward passes are shown on the left and right, respectively.
1
L(u) = (y − u)2 → ∂u L = − (y − u) (A.21)
2
u = f (x · b) → ∂v f (x · b) = f 0 (v) (A.22)
v = x · b → ∂b (x · b) = x (A.23)
The calculations may be depicted in a computation graph as shown in Figure A.1. Forward calculations
(left-to-right) are shown above the nodes, while backward calculations (right-to-left) are shown below the
nodes.
728
1
input x·b f (v) 2 (y − u)2
x v u L
x f 0 (v) u−y
x v u L
Label each node vi , and then the partial derivative calculations simply accumulate the multiplications left-
to-right according to following formula
∂vi+1
v̄i = v̄i+1 (A.24)
∂vi
∂L
where v̄i = and v̄4 = 1. This value v̄i is called the adjoint, a term used in adjoint methods that provide
∂vi
more efficient ways of calculating derivatives [54].
The automatic differentiation calculations are summarized in Tables A.3 and A.4 symbolically and nu-
merically.
Table A.3: Forward and Backward Symbolic Calculations for Perceptron Example
729
Table A.4: Forward and Backward Calculations for Perceptron Example
b = b − ∂b L η (A.25)
With the learning rate η = 1, the new parameter values will be b = [−.1666, .0778, −.1666]. Notice that
this update differs from the one given in the Perceptron section as that one is based on all the rows in data
matrix X, whereas this one only uses row six.
The previous example used a single row (the sixth) x ∈ X to perform the computation, as would be done
with pure (single instance) Stochastic Gradient Descent. On the other hand, Gradient Descent would use
the full input data/training matrix X. The updated two step chain rule becomes the following:
∂L ∂L ∂u ∂v
= (A.26)
∂b ∂u ∂v ∂b
This states that the gradient of the loss function L w.r.t. the parameters b is the product of the gradient
of the loss function L w.r.t. u times the Jacobian of u w.r.t. v times the Jacobian of v w.r.t. b. The two
Jacobian matrices are diagonal because the activation function maps individual elements, i.e.,
Thus, the matrix multipication becomes the element-wise vector product of three vectors.
The full input calculations are depicted in the computation graph shown in Figure A.3.
1
input Xb f (v) 2 ky − uk2
X v u L
X
|
f 0 (v) u−y
730
Calculations for the full input case are summarized in Table A.5. Where there is a vector of length nine,
the previous table’s numbers correspond to the sixth element. The overall loss and gradient should not agree
as in the first table they are based in one row, while in this table they are based on all nine rows.
Table A.5: Forward and Backward Calculations for Perceptron Input X Example
Using one half sse as the loss function, it may be expressed as one half the Frobenius norm squared.
1
L =kY − f1 (f0 (XA + α)B + β)k2F (A.30)
2
Partial derivatives are now needed for all weight matrices and bias vectors:
Or in aggregated form.
∂A L, ∂α L, ∂B L, ∂β L (A.32)
Ignoring biases (or using the Bias Trick to incorporate them into weight matrices), the prediction equation
becomes the following:
731
is an nz -by-ny matrix and can be decomposed using the following matrix calculus chain rule.
∂L ∂L ∂ Ŷ ∂U
= (A.35)
∂B ∂ Ŷ ∂U ∂B
∂L
Figure A.4 represents a computation graph for three layer neural network relevant to .
∂B
1
input XA f0 (V ) ZB f1 (U ) 2 kY − Ŷ k2F
X V Z U Ŷ L
f10 (U )
|
Z Ŷ − Y
To better illustrate how the partial derivative products accumulate right-to-left, Table A.6 shows the forward
and backward calculations relevant to weight matrix B for three layer neural networks symbolically.
Table A.6: Forward and Backward Matrix Calculations Relevant to B for Neural Network Example
|
B = B − Z ∆1 η (A.36)
|
i.e., move in the direction opposite the gradient (Z ∆1 ) moderated by the learning rate η.
∂L ∂L ∂ Ŷ ∂U ∂Z ∂V
= (A.38)
∂A ∂ Ŷ ∂U ∂Z ∂V ∂A
∂L
Figure A.5 represents a computation graph for three layer neural network relevant to .
∂A
732
1
input XA f0 (V ) ZB f1 (U ) 2 kY − Ŷ k2F
X V Z U Ŷ L
f00 (V ) f10 (U )
| |
X B Ŷ − Y
Notice that the forward pass for the two computation graphs are identical, while the backward pass is
the same until computing the partial derivative for node U . Table A.7 shows the forward and backward
calculations relevant to weight matrix A for three layer neural networks symbolically.
Table A.7: Forward and Backward Matrix Calculations Relevant to A for Neural Network Example
|
A = A − X ∆0 η (A.39)
|
i.e., move in the direction opposite the gradient (X ∆0 ) moderated by the learning rate η.
733
A.3 Gradient Descent
One the simplest algorithms for unconstrained optimization is Gradient Descent (GD). Imagine you are in
a mountain range at some point x with elevation f (x). Your goal is the find the valley (or ideally the
lowest valley). Look around (assume you cannot see very far) and determine the direction and magnitude
of steepest ascent. This is the gradient.
Using the objective/cost function from the beginning of the chapter,
h i
∂f ∂f
the gradient of the objective function ∇f (x) is the vector formed by the partial derivatives ∂x1 , ∂x2
In its most elemental form the algorithm simply moves in the direction that is opposite to the gradient
−∇f (x) and a distance determined by the magnitude of the gradient. Unfortunately, at some points in
the search space the magnitude of the gradient may be very large and moving that distance may result
in divergence (you keep getting farther away from the valley). One solution is to temper the gradient by
multiplying it by a learning rate η (tunable hyper-parameter typically smaller than one). Using a tuned
learning rate, update your current location x as follows:
Repeat this process until a stopping rule signals sufficient convergence. Examples of stopping rules include
stop when the change to x or f (x) becomes small or after the objective function has increased for too many
consecutive iterations/steps.
734
Wolfe Line Search
The parent optimization algorithm will choose a search direction p that is moving downward, while the
gradient at the current location x will be in the direction of steepest increase. The basic idea is to in-
crease/advance the displacement in the search direction α, so long as it is productive to do so. In general a
line search algorithm may work in two phases: a bracketing phases and a selection phase. The bracketing
phase finds an interval of acceptable step lengths that statisfy certain conditions, e.g., Wolfe condition 1 for
an upper bound and Wolfe condition 2 for a lower bound. The selection phase looks for a reasonably good
solution within the interval. As line search is called frequently, one typically wants to use minimal effort in
phase 2. Also, the next iteration of the parent algorithm may in a way undo some of the work done in the
last line search. One simple selection algorithm is bisection search, although interpolative search may be
used [136].
The Wolfe condition 1 (or Armijo condition) will be satisfied when the function at new point f (x + αp)
is sufficiently less than its starting value (α = 0).
To see the Sufficient Decrease Condition (SDC) clearly, one may consider gradient descent, which will set
the search direction p = −∇f (x), so that the above equation becomes the following:
Although various optimization algorithms deflect away from the direction of steepest descent, the dot
product ∇f (x) · p will be negative. For small values of c1 (e.g., .0001), this condition will allow the new
point (due to the line search) to be on a line with small negative slope emanating from (α = 0, f (x)).
The Wolfe condition 2 has Weak and Strong versions. The goal of the Weak version is to have the dot
product of the gradient and search direction become less negative.
The idea of this Curvature Condition (CC) is that as one approaches a minimal point, this dot product will
approach zero from negative values. One use of the CC condition is to make sure to advance/increase α
while the decent is still very steep, i.e., some fraction (e.g., c2 = .9) of the original rate of descent.
The following method returns whether Wolfe condition 1, the Sufficient Decrease Condition (SDC) is
satisfied.
1 @ param fx the functional value of the original point
2 @ param fy the functional value of the new point y = x + p * a
3 @ param a the displacement in the search direction
4 @ param gxp the dot product of the gradient vector g ( x ) and the search vector p
5
The next method returns whether Wolfe condition 2, the Curvature Condition (CC) is satisfied.
1 @ param p the search direction vector
2 @ param gy the gradient at new point y = x + p * a
3 @ param gxp the dot product of the gradient vector g ( x ) and the search vector p
4
735
5 inline def wolfe2 ( p : VectorD , gy : VectorD , gxp : Double ) : Boolean =
6 ( gy dot p ) >= c2 * gxp
1
L(b) = (y − f (Xb) · (y − f (Xb)
2
in which case the gradient is
|
∇L(b) = − X [f 0 (Xb) ] (A.44)
where = y − f (Xb).
For each epoch, the parameter vector b is updated according to the GD Update Equation.
b = b − η ∇L(b) (A.45)
A.3.3 Exercises
1. Write a ScalaTion program to solve the example problem given above.
1 // function to optimize
2 def f ( x : VectorD ) : Double = ( x (0) - 4) ~ ˆ 2 + ( x (1) - 2) ~ ˆ 2
3
2. Add code to collect the trajectory of vector x in a matrix z and plot the two columns in the z matrix.
1 val z = new MatrixD ( MAX_IT , 2) // store x ’s trajectory
2 z (k -1) = x . copy
3 new Plot ( z (? , 0) , z (? , 1)
How does it differ from the weak condition and when is it useful?
736
A.4 Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a foundational algorithm for Data Science. In fits in the class of
stochastic optimization algorithms since the objective/loss function is noisy, as it is based on functions of
randomly selected mini-batches of data instances. Such algorithms need to be robust and need not be the
best optimization algorithm in a deterministic setting.
Several improved variants Stochastic Gradient Descent are incorporated into popular neural network
software packages.
The GradientDescent NoLS class provides methods to optimize a loss/objective function f that may be
stochastic. It tries to find a value for vector x that minimizes the loss function. In the context of Neural
Networks, x may be thought of the parameter vector (weights and biases). This optimizer implements a
Gradient Descent with no contained Line-Search optimizer.
This algorithm is rather simple: (1) Compute the gradient vector g using the grad function. (2) Multiply
it by the step-size/learning rate α and subtract this from the previous value of the parameter vector x.
Note, the exact relationship between the learning rate α and the hyper-parameter eta (η) is code dependent.
Of course the core logic shown above must be embedded in an iterative optimization algorithm (see the
solve method in the GradientDescent NoLS class.
1 @ param f the vector - to - scalar ( V2S ) objective / loss function
2 @ param grad the vector - to - vector ( V2V ) gradient function , grad f
3 @ param hparam the hyper - parameters
4
The solve method iterates over several time-steps or learning epochs. It computes the gradient, multiplies
it by learning rate α and subtracts this product from the vector x. A stopping rule is checked and if progress
is stalled, the algorithm terminates early. This method returns the best solution found for the loss function
f and the vector x.
1 @ param x0 he starting point
2 @ param α the step - size / learning rate
3
9 var ( go , it ) = ( true , 1)
10 cfor ( go && it <= MAX_IT , it + = 1) { // iterate over each epoch / timestep
11 val g = grad ( x ) // get gradient loss function
12 x -= g * α // update parameters x
13 f_x = f ( x ) // compute new loss function value
14
737
15 best = stopWhen ( f_x , x )
16 if best . _2 ! = null then go = false // early termination , return best
17 } // cfor
18 if go then getBest // best solution found
19 else best
20 end solve
1
L(b) = (y − f (Xb)) · (y − f (Xb))
2
an estimate of the gradient is computed for a limited number of instances (a mini-batch). Several non-
overlapping mini-batches are created simultaneous by taking a random permutation of the row indices of
data/input matrix X. The permutation is split into nB mini-batches. Letting iB be the indices for the ith
mini-batch and X[iB ] be the projection of matrix X onto the rows in iB , the estimate for the gradient is
simply
|
∇L(b) = − X[iB ] [f 0 (X[iB ]b) ] (A.49)
where = y[iB ] − f (X[iB ]b). Using the definition of the delta vector
δ = − f 0 (X[iB ]b)
|
∇L(b) = X[iB ] δ
For each epoch, nB mini-batches are created. For each mini-batch, the parameter vector b is updated
according to this equation, using that mini-batch’s estimate for the gradient.
|
b = b − η∇L(b) = b − X[iB ] δη (A.50)
At a high level, the optimize2 method for the Optimize SGD class shown in the NeuralNet 2L section
works as follows:
1. The outermost loop makes many complete passes through the training set portion of the dataset. Each
pass may be thought of a learning epoch.
2. The inner loop will iterate through all the mini-batches, updating the parameters (weights and biases)
based on delta corrections (combination of errors and slopes).
3. The updateWeight method simply encodes the boxed equations from the Perceptron section: com-
puting predicted output, the negative of the error vector, the delta vector, and a mini-batch size
normalized learning rate, and finally, returning the parameter vector b update.
738
4. The last part of the outer loop computes the new loss function and applies a stopping rule for early
termination. The parameter settings with the lowest loss function are recorded and returned along
with number of epochs.
5. The final line of optimize simply returns the value of the loss function and number epochs used by
the algorithm, when there is no early termination.
The optimize3 and optimize methods in the Optimize SGD class are for NeuralNet 3L and NeuralNet XL,
respectively.
739
A.5 Stochastic Gradient Descent with Momentum
To better handle situations where the gradient becomes small or erratic, previous values of the gradient can
be weighed in with the current gradient. Their contributions can be exponentially decayed, so that recent
gradients have greater influence. The decay rate β ∈ [0, 1] indicates how fast prior gradients are discounted.
If β = 0, momentum is not used. In ScalaTion, the hyper-parameter beta (β) is set to 0.9, but can easily
be changed.
These contributions may be collected via the gradient-based parameter updates to the parameter vector x
[55]. First the gradient vector g is computed using the grad function. The gradient-based momentum vector
p is calculated as the weighted average of the current gradient and the previous momentum. Finally, the
parameter vector x is updated as a weighted average of the current gradient and the momentum, moderated
by the learning rate α > 0 (derived from the hyper-parameter eta (η) in some code).
If β is zero, that algorithm behaves the same as Stochastic Gradient Descent. At the other extreme, if β is
1, there is no decay and all previous gradients will weigh in, so eventually the new gradient value will have
little impact and the algorithm will become oblivious to its local environment.
The third hyper-parameter ν determines how much weight to place on the current gradient versus the
momentum when updating the parameter vector x. There are two special cases to consider.
See [55] for a more nuanced discussion on the variety of stochastic gradient descent algorithms that use
momentum.
1 @ param f the vector - to - scalar ( V2S ) objective / loss function
2 @ param grad the vector - to - vector ( V2V ) gradient function , grad f
3 @ param hparam the hyper - parameters
4
740
The solve method iterates over several time-steps or learning epochs. It computes the gradient and
uses it to recompute the momentum as the weighted average of this gradient and the previous momentum.
The update to the vector x is another weighted average of this gradient and the updated momentum. This
update is then multiplied by learning rate α and subtracted from the vector x. A stopping rule is checked
and if progress is stalled, the algorithm terminates early. This method returns the best solution found for
the loss function f and the vector x.
1 @ param x0 the starting point
2 @ param α the step - size / learning rate
3
10 var ( go , it ) = ( true , 1)
11 cfor ( go && t <= MAX_IT , it + = 1) { // iterate over each epoch / timestep
12 val g = grad ( x ) // get gradient of loss function
13 p = g * (1 - β ) + p * β // update momentum - based agg . gradient
14 x -= ( g * (1 - ν ) + p * ν ) * α // update parameters
15 f_x = f ( x ) // compute new loss function value
16
where δ = f 0 (XB) .
The Optimizer SGDM class has three optimization methods supporting Stochastic Gradient Descent with
Momentum. The optimize2 method is for NeuralNet 2L. It takes the input x and output y matrices, the
initial guess for parameters bb, the initial learning rate eta, and the activation function ff.
1 @ param x the m - by - n input matrix ( training data consisting of m input vectors )
2 @ param y the m - by - ny output matrix ( training data consisting of m output vectors )
3 @ param bb the array of parameters ( weights & biases ) between every two adjacent layers
4 @ param eta the initial learning / convergence rate
5 @ param ff the array of activation function family for every two adjacent layers
6
741
The method initializes several constants and variables, mainly related to the hyper-parameters. The
permutation generator is use to create random mini-batches.
1 val permGen = permGenerator ( x . dim ) // permutation vector generator
2 val b = bb (0) // net - params : weights and biases
3 val f = ff (0) // activation function
4 val bSize = min ( hp ( " bSize " ) . toInt , x . dim ) // batch size
5 val maxEpochs = hp ( " maxEpochs " ) . toInt // maximum number of epochs
6 val upLimit = hp ( " upLimit " ) . toInt // limit on increasing lose
7 val β = hp ( " beta " ) . toDouble // momentum hyper - parameter
8 val ν = hp ( " nu " ) . toDouble // 0 = > SGD , 1 = > ( normalized ) SHB
9 var η = eta // set initial learning rate
10 val nB = x . dim / bSize // the number of batches
11 var p = new MatrixD ( b . w . dim , b . w . dim2 ) // momentum matrix
The code below shows the double loop (over epoch and ib). The parameter vector b is updated for each
batch by calling updateWeight. The rest of the outer loop simply looks for early termination based on a
stopping rule and records the best solution for b found so far. The square of the Frobenius norm is used to
compute the sse over all outputs. The final part of the outer loop, increases the learning rate η at the end of
each adjustment period (as the algorithm get closer to an optimal solution, gradients shrink and may slow
down the algorithm).
1 var sse_best_ = -0.0
2 var ( go , epoch ) = ( true , 1)
3 cfor ( go && epoch <= maxEpochs , epoch + = 1) { // iterate over each epoch
4 val batches = permGen . igen . chop ( nB ) // permute & split into batches
5
The core of algorithm is the updateWeight method that (1) computes the predicted outputs, (2) computes
the error matrix (actually minus the error matrix for convenience), (3) computes the delta correction matrix
as the Hadamard product of the slope matrix and the error matrix, (4) adjusts the learning rate by dividing
by the batch size, (5) computes the change to parameters ignoring momentum from previous calls, (6) adds
some of the momentum to this parameter change, and (7) returns the update matrix.
1 @ param x the input matrix for the current batch
2 @ param y the output matrix for the current batch
3
742
8 val δ = f . dM ( yp ) // delta matrix for y
9 val g = x .T * δ // gradient matrix
10
The final part of the optimize2 method returns the value of the loss function and the number of iterations
when there is no early termination.
1 if go then (( y - f . fM ( b * x ) ) . normFSq , maxEpochs ) // return sse and # epochs
2 else ( sse_best_ , epoch - upLimit )
3 end optimize2
Note: the ScalaTion code also uses the operator for Hadamard product (the alternative is *∼) and a
T-like Unicode symbol that looks similar to T for transpose.
The optimize3 and optimize methods in the Optimize SGDM class are for NeuralNet 3L and NeuralNet XL,
respectively.
A.5.2 Exercises
1. Plot the loss function versus the time-step/epoch for the AutoMPG for the NeuralNet 2L, NeuralNet 3L
and NeuralNet XL, comparing the SGD and SGDM.
Hint: see the MonitorLoss trait in the modeling package.
743
A.6 SGD with ADAptive Moment Estimation
The ADAptive Moment estimation (Adam) Optimizer extends the optimizers that use first moments of
momentum (means) of the gradients by including second moments (uncentered variances) [96]. The algorithm
computes the gradient g and the momentum p just like in the previous section.
The new element is to include the uncentered variance (second raw moment) v into the calculation. It is
√
included to normalize the gradient, dividing it by v̂ (analog of dividing a random variable by its standard
deviation). A very small value eps is added to it to avoid division by zero.
The calculated momentum mean p and momentum uncentered variance v are not used directly. This is
because these two vectors are initialized to zero, so they will be under-estimates. These values are inflated
by dividing them (1 − β1t ) and (1 − β2t ), respectively. For example, if β1 = 0.9 and t = 1, p will be inflated
by a factor of 10. Raising β1 (same for β2 ) to the tth power reduces this effect as the time-steps increase.
The GradientDescent Adam class provides an implementation of this algorithm.
1 @ param f the vector - to - scalar ( V2S ) objective / loss function
2 @ param grad the vector - to - vector ( V2V ) gradient function , grad f
3 @ param hparam the hyper - parameters
4
The solve method iteratively applies the above logic looking for a minimal solution. Again the code will
terminate early due to lack of progress.
1 @ param x0 the starting point
2 @ param α the step - size / learning rate
3
13 var ( go , it ) = ( true , 1)
14 cfor ( go && it <= MAX_IT , it + = 1) { // iterate over epochs / timesteps
744
15 val g = grad ( x ) // get gradient of loss function
16 p = p * β 1 + g * (1 - β 1) // update biased 1 st moment
17 v = v * β 2 + g ~ ˆ 2 * (1 - β 2) // update biased 2 nd raw moment
18 ph = p / (1 - β 1~ ˆ it ) // compute bias - corrected 1 st moment
19 vh = v / (1 - β 2~ ˆ it ) // compute bias - corrected 2 nd raw mo .
20 // x -= ph * α // update parameters (1 st moment )
21 x -= ( ph / ( vh ~ ˆ 0.5 + EPS ) ) * α // update parameters ( both moments )
22 f_x = f ( x ) // compute new loss function value
23
A.6.1 Exercises
1. Apply the Adam Optimizer to NeuralNet 2L, NeuralNet 3L, and NeuralNet XL, i.e., finish the coding
of Optimizer Adam.
2. Test reducing the inflation of p and v by starting the time t at a larger value than 1 and discuss the
effects, if any.
3. Report on the tuning of the hyper-parameters eta (η), beta (β1 ) and beta2 (β2 ).
1 object Minimize
2
15 end Minimize
4. Compare the loss curves, Quality of Fit (QoF), and run-times of Optimizer SGD, Optimizer SGDM, and
Optimizer Adam in ScalaTion, analogously in Keras and PyTorch.
745
A.7 Coordinate Descent
Rather than moving in the opposite direction of the gradient, the coordinate descent algorithm picks a
coordinate direction and tries moving forward (1) or backward (-1) parallel to the coordinate axis. The next
coordinate to try may be picked by a selection rule or cyclically as done by the code below.
Upon selecting a direction, a Line Search algorithm is applied to move down in that direction. The
code can perform an exact (e.g., GoldenSectionLS) or inexact (e.g., WolfeLS line search. The Line Search
algorithm looks in direction dir and returns the distance to move in that direction.
1 @ param x the current point
2 @ param dir the direction to move in
3 @ param step the initial step size
4
The solve method cyclically picks a coordinate axis and tries moving in both the forward and backward
directions. The algorithm terminates when the distance moved on the last step drops below a tolerance or
when a maximum MAX IT number of iterations is exceeded.
1 @ param x0 the starting point
2 @ param step the initial step size
3 @ param toler the tolerance
4
5 def solve ( x0 : VectorD , step : Double = STEP , toler : Double = EPSILON ) : FuncVec =
6 val n = x0 . dim
7 var x = x0 // current point
8 var fx = f(x) // obj . function at current point
9 var y = VectorD . nullv // next point
10 var fy = 0.0 // obj . function at next point
11 val dir = new VectorD ( n ) // set dir . by cycling coordinates
12 var dist = 1.0 // distance current to next point
13 var down = true // moving down flag
14
15 var it = 1
16 cfor ( it <= MAX_IT && down && dist > toler , it + = 1) {
17
746
31 } // cfor
32 ( fx , x ) // return functional value and point
33 end solve
747
A.8 Conjugate Gradient
The Conjugate Gradient (or Conjugarte Gradient Descent) algorithm like SGDM combines prior gradient
(or search direction) information in with the current gradient (or search direction).
The new search direction dir(t) is opposite the current gradient gr(t) plus a correction term β dir(t−1)
proportional to the previous direction. This proportion β for the Fletcher-Reeves (FR) formula is the ratio
of dot products of the search directions.
dir(t) · dir(t)
β=− (A.66)
dir(t−1) · dir(t−1)
For the Polak-Ribiere (PR) formula β is the ratio of dot products where the numerator takes the difference
between subsequent search directions.
h i
dir(t) · dir(t) − dir(t−1)
β=− (A.67)
dir(t−1) · dir(t−1)
ScalaTion uses the PR formula (adding EPSILON and taking a max).
1 @ param sd1 the search direction at the previous point
2 @ param sd2 the search direction at the current point
3
The ConjugateGradient class supports finding minimal values for an objective/loss function f . It also
supports having constraints and utilizes line-search.
1 @ param f the objective function / loss to be minimized
2 @ param g the constraint function to be satisfied , if any
3 @ param ineq whether the constraint function must satisfy inequality or equality
4 @ param exactLS whether to use exact ( e . g . , ‘ GoldenLS ‘)
5 or inexact ( e . g . , ‘ WolfeLS ‘) Line Search
6
The solve method provides the iterative search engine and is set up for deterministic functions, but
could be adapted to a stochastic setting.
1 @ param x0 the starting point
2 @ param step the initial step - size
3 @ param toler the tolerance
4
5 def solve ( x0 : VectorD , step : Double = STEP , toler : Double = EPSILON ) : FuncVec =
6 var x = x0 // current point
7 var f_x = fg ( x ) // objective function at current point
8 var y = VectorD . nullv // next point
748
9 var f_y = 0.0 // objective function at next point
10 var dir = - grad ( fg , x ) // initial direction is - gradient
11 var dir0 = VectorD . nullv // keep the previous direction
12 var dist = 1.0 // distance current to next point
13 var down = true // moving down flag
14
15 for t <- 1 to MAX_IT if down && dist > toler && dir . normSq > toler do
16 y = x + dir * lineSearch (x , dir , step ) // determine the next point
17 f_y = fg ( y ) // obj . function value for next point
18 dir0 = dir // save the current direction
19 dir = - grad ( fg , y ) // next search dir . via Gradient Desc .
20 if t > 1 then dir + = dir0 * beta ( dir0 , dir ) // modify search direction using PR - CG
21
When formulas are available for the partial derivatives making up the gradient, they should be used since
they are faster and more accurate than numerical methods.
1 @ param partials the vector of partial derivative functions
2
ScalaTion provides Gradient Descent and Conjugate Gradient Descent algorithms in two versions, one
with Line Search (LS) and one without (NoLS) as shown in Table A.8.
A.8.1 Exercises
1. Adapt the above code for a stochastic setting.
2. Report on the literature concerning the use of or influence of the Conjugate Gradient algorithm in
Machine Learning.
3. Consult the literature on the effectiveness of the Conjugate Gradient algorithm in the presence of
constraints.
749
A.9 Quasi-Newton Methods
A.9.1 Newton-Raphson Method
In one dimension, the Newton (or Newton-Raphson) Method simply optimizes (minimizes) a function f by
moving in the direction opposite the gradient (first derivative) divided by the Hessian (second derivative).
The step size is moderated by the learning rate η (eta).
∂x f
xi+1 = xi − η (A.68)
∂x2 f
The NewtonRaphson class provides solve methods for finding roots (places where the function evaluates
to zero) and a method for finding optimal values. The optimize method finds a local optima close to the
starting point/guess x0. It applies the logic of the above equation and numerically approximates the first
and second derivatives.
1 @ param x0 the starting point / guess
2
8 var it = 1
9 cfor ( it < MAX_IT && abs ( df_x ) > EPS , it + = 1) {
10 df_x = maxmag ( D ( f ) ( x ) , EPS ) // make sure 1 st der . isn ’t too small
11 d2f_x = maxmag ( DD ( f ) ( x ) , EPS ) // make sure 2 nd der . isn ’t too small
12 x -= df_x / d2f_x * eta // subtract the ratio
13 f_x = f ( x )
14 } // cfor
15
Now ∂x f is a gradient vector and ∂x2 f is a Hessian matrix. As division is not supported, matrix inversion is
used. Using alternate notion, it can be written as follows:
This represents a vector-matrix multiplication, although it can be easily switched to a matrix-vector multi-
plication since the Hessian matrix is symmetric (and switching from row to column vectors for the gradient).
750
One may view the multiplication by the Hessian as modifying (or deflecting) the gradient due to the function’s
curvature. Define this to be the direction vector d.
Rather than taking the inverse, it will be faster and more numerically stable to used matrix factorization
(as was done for regression).
[Hx f ] d = ∇x f (A.73)
The solve method in the Newton class finds local optima close to the starting point/guess x0. This version
numerically approximates the first and second derivatives. It uses factorization (Fac LU) to solve for the
direction vector d.
1 @ param x0 the starting point / guess
2 @ param α the current learning rate
3
18 x -= d * α // subtract direction * α
19 f_x = f ( x ) // functional value
20 } // cfor
21
751
Sherman–Morrison formula (see https://mdav.ece.gatech.edu/ece-6270-spring2021/notes/09-bfgs.
pdf).
(s ⊗ s) (sy + y · ay ) ay ⊗ s + s ⊗ ay
− (A.74)
s2y sy
where scalar sy = s · y and vector ay = H −1 y. Recall that ⊗ is the symbol for outer product (takes two
vectors and produces a rank 2 matrix). The corresponding ScalaTion code is shown below.
1 @ param aHi the current value of the approximate Hessian inverse ( aHi )
2 @ param s the step vector ( next point - current point )
3 @ param y the difference in the gradients ( next - current )
4
Using this update method, the inverse Hessian is never computed, only efficiently approximated. This
particular approximation yield the Broyden [23], Fletcher [48], Goldfarb [56], Shanno [170] (BFGS) algorithm.
See [141] for a historical review of second-order optimization algorithms.
1 @ param x0 the starting point / guess
2 @ param α the current learning rate
3
752
steps or changes in x-position vectors s and changes in gradient vectors y. There are stored using the Ring
class that efficiently maintains the last cap = m additions.
1 @ param g the current gradient
2 @ param k the k - th iteration
3
The solve method for L-BFGS only needs slight modification from the BFGS algorithm.
1 @ param x0 the starting point / guess
2 @ param α the current learning rate
3
A.9.5 Summary
The three algorithms trade off efficiency of the approximate Hessian inverse update for more iterations/ability
to find optima. The following study [178] shows these trade off for several example problems. The time
complexity of update is Newton O(n3 ), BFGS O(n2 ), and L-BFGS O(mn).
753
ScalaTion provides each of the three algorithms in two versions, one with Line Search (LS) and one
without (NoLS). The classes implementing these algorithms are shown in Table A.9.
The first numerically computes the gradient, while the second one requires the user/application to pass in
the gradient, typically as a function mapping vectors to vectors (FunctionV2V). As the Newton Method
requires computing the Hessian as well it needs a function for each partial derivative so it it can perform an
more efficient Jacobian calculation for the Hessian. In this case the type needs to be Array [FunctionV2S].
A.9.6 Exercises
1. Compare the success rate, number of iterations and execution times of each of the three algorithms
for their Line Search (LS) versions. Use the some of 30 benchmark problems given in Appendix A of
[103].
https://arxiv.org/pdf/2204.05297.pdf
2. Compare the success rate, number of iterations and execution times of each of the three algorithms for
their No Line Search (NoLS) versions. Use the same benchmark.
3. Compare the best of the Newton/Quasi-Newton methods with Gradient Descent and Conjugate Gra-
dient methods.
754
A.10 Method of Lagrange Multipliers
The Method of Lagrange Multipliers (or Lagrangian Method) provides a means for solving constrained
optimizations problems. For optimization problems involving only one equality constraint, one may introduce
a Lagrange multiplier λ. At optimality, the gradient of f should be orthogonal to the surface defined by the
constraint L(x) = 0, otherwise, moving along the surface in the opposite direction to the gradient (−∇f (x)
for minimization) would improve the solution. Since the gradient of h, ∇h(x), is orthogonal to the surface
as well, this implies that the two gradients should only differ by a constant multiplier λ.
−2(x1 − 4) = λ
−2(x2 − 2) = −λ
x1 − x2 = 0
The first two equations are from the gradient w.r.t. x, while the third equation is simply the constraint itself
h(x) = 0. The equations may be rewritten in the following form.
2x1 + λ = 8
2x2 − λ = 4
x1 − x2 = 0
This is a linear system of equations with 3 variables [x1 , x2 , λ] and 3 equations that may be solved, for
example, by LU Factorization. In this case, the last equation gives x1 = x2 , so adding equations 1 and 2
yields 4x1 = 12. Therefore, the optimal value is x = [3, 3] with λ = 2 where f (x) = 2.
755
Adding an equality constraint is addressed by adding another Lagrange multiplier, e.g., 4 variables
[x1 , x2 , λ1 , λ2 ] and 4 equations, two from the gradient w.r.t. x and one for each of the two constraints.
Linear systems of equations are generated when the objective function is at most quadratic and the
constraints are linear. If this is not the case, a nonlinear system of equations may be generated.
756
A.11 Karush-Kuhn-Tucker Conditions
Introducing inequality constraints makes the situation is a little more complicated. A generalization of
the Method of Lagrange Multipliers based on the Karush-Kuhn-Tucker (KKT) conditions is needed. For
minimization, the KKT conditions are as follows:
Furthermore, the Lagrange multipliers for the inequality constraints α are themselves constrained to be
non-negative.
α≥0
When the objective function is at most quadratic and the constraints are linear, the problem of finding
an optimal value for x is referred to a Quadratic Programming. Many estimation/learning problems in
data science are of this form. Beyond Quadratic Programming lies problems in Nonlinear Programming.
Linear Programming (linear objective function and linear constraints) typically finds less use (e.g., Quantile
Regression) in estimation/learning, so it will not be covered in this Chapter, although it is provided by
ScalaTion.
757
A.12 Quadratic Programming
The QuadraticSimplex class solves Quadratic Programming (QP) problems using the Quadratic Simplex
Algorithm. Given a constraint matrix A, constant vector b, cost matrix Q and cost vector c, find values
for the solution/decision vector x that minimize the objective function f (x), while satisfying all of the
constraints, i.e.,
1
minimize f (x) = x · Qx + c · x
2
subject to g(x) = Ax − b ≤ 0
Before considering the type of optimization algorithm to use, we may simplify the problem by applying
the KKT conditions.
Qx + c = α · A
Ax − b ≤ 0
α ≥ 0
Class Methods:
1 @ param a the M - by - N constraint matrix
2 @ param b the M - length constant / limit vector
3 @ param q the N - by - N cost / revenue matrix ( second order component )
4 @ param c the N - length cost / revenue vector ( first order component )
5 @ param x_B the initial basis ( set of indices where x_i is in the basis )
6
758
19 def dual : VectorD = null
20 def objValue ( x : VectorD ) : Double = ( x dot ( q * x ) ) * .5 + ( c dot x )
21 def showTableau () : Unit =
759
A.13 Augmented Lagrangian Method
The Augmented Lagrangian Method (also known as the Method of Multipliers) takes a constrained opti-
mization problem with equality constraints and solves it as a series of unconstrained optimization problems.
minimize f (x)
subject to h(x) = 0
where f (x) is the objective function and h(x) = 0 are the equality constraints.
In penalty form, the constrained optimization problem becomes.
ρk
minimize f (x) + kh(x)k22
2
where k is the iteration counter. The square of the Euclidean norm indicates to what degree the equality
constraints are violated. Replacing the square of the Euclidean norm with the dot product gives.
ρk
minimize f (x) + h(x) · h(x)
2
The value of the penalty parameter ρk increases (e.g., linearly) with k and thereby enforces the equality
constraints more strongly with each iteration.
An alternative to minimizing f (x) with a quadratic penalty is to minimize using the Augmented La-
grangian Lρk (x, λ).
ρk
Lρk (x, λ) = f (x) + h(x) · h(x) − λ · h(x) (A.78)
2
where λ is the vector of Lagrange multipliers. After each iteration, the Lagrange multipliers are updated.
λ = λ − ρk h(x)
This method allows for quicker convergence without the need for the penalty ρk to become as large (see the
exercises for a comparison of the Augmented Lagrangian Method with the Penalty Method). This method
may be combined with an algorithm for solving unconstrained optimization problems (see the exercises for
how it can be combined with the Gradient Descent algorithm). The method also can be extended to work
inequality constraints.
where x ∈ R2 , f is the objective function and h is the single equality constraint. The Augmented Lagrangian
for this problem is
760
ρk
Lρk (x, λ) = (x1 − 4)2 + (x2 − 2)2 +
(x1 − x2 )2 − λ(x1 − x2 ) (A.79)
2
The gradient of the Augmented Lagrangian ∇Lρk (x, λ) is made up of the following two partial derivatives.
ρk
∂/∂x1 = 2(x1 − 4) + 2(x1 − x2 ) − λ
2
ρk
∂/∂x2 = 2(x2 − 2) − 2(x1 − x2 ) + λ
2
The Lagrange multiplier updates becomes
λ = λ − ρk (x1 − x2 )
The code in the exercises tightly integrates the Gradient Descent algorithm with the Augmented Lagrangian
method by updating the penalty and Lagrange multiplier during each iteration.
A.13.2 Exercises
1. Write a ScalaTion program to solve the example problem given above.
1 // function to optimize
2 def f ( x : VectorD ) : Double = ( x (0) - 4) ~ ˆ 2 + ( x (1) - 2) ~ ˆ 2
3
7 // augmented Lagrangian
8 def lg ( x : VectorD ) : Double = f ( x ) + ( p /2) * h ( x ) ~ ˆ 2 - l * h ( x )
9
2. Add code to collect the trajectory of vector x in a matrix z and plot the two columns in the z matrix.
1 val z = new MatrixD ( MAX_IT , 2) // store x ’s trajectory
2 z (k -1) = x . copy
3 new Plot ( z (? , 0) , z (? , 1)
3. Compare the Augmented Lagrangian Method with the Penalty Method by simply removing the La-
grange multiplier from the code.
761
A.14 Alternating Direction Method of Multipliers
For problems that have non-differentiable points, it may work better to approach such points from two
directions. This happens with `1 regularization, e.g., in Lasso regression. Given a differentiable function
f (x) along with a regularization term αkxk1 , it is non-differentiable in each dimension as xj approaches
zero, with the slopes for the second term jumping from negative α to positive α.
where x ∈ Rn and f : Rn → R.
The Alternating Direction Method of Multipliers (ADMM) approach [22] is to separate the objective
function into two parts
subject to x − z = 0 (A.82)
and letting
ρk
Lρk (x, z, λ) = f (x) + g(z) + kx − zk22 − λ · (x − z) (A.84)
2
where λ is a vector of Lagrange multipliers and ρk is the penalty parameter for the k th iteration (see the
last section).
The problem can be solved by iterating over k and solving three sub-problems for each iteration (the
prime indicates the new value for iteration k). The x and z variable vectors are updated in an alternating
fashion.
1. Minimize the augmented Lagrangian w.r.t. the x vector with z and λ fixed with initial values at the
start of iteration k.
2. Minimize the augmented Lagrangian w.r.t. the z vector with x fixed at its new value and λ fixed at
its start value.
3. Update the Lagrange multipliers based on the new values for x and z.
λ0 = λ − ρk (x0 − z0 ) (A.87)
762
A.14.1 Example Problem
As a reformulation of the half sse loss function for regression, f (x) may be written,
1
f (x) =ky − Dxk22 (A.88)
2
where D ∈ Rm×n is the given data matrix, x ∈ Rn is the parameter vector to be fit, and y ∈ Rm is the given
response vector. Letting the shrinkage parameter α be constant, the regularization term becomes,
1 ρk
Lρk (x, z, λ) = ky − Dx)k22 + α kzk1 + kx − zk22 − λ · (x − z) (A.90)
2 2
1. The first sub-problem can be solved by setting the following gradient to zero.
∂L |
= − D (y − Dx) + ρk (x − z) − λ = 0 (A.91)
∂x
Collecting terms gives an equation that can be solved by matrix factorization (like it was done for ridge
regression) to determine x0 .
| |
(D D + ρk I)x = D y − ρk z + λ (A.92)
2. The first term in the second sub-problem is constant w.r.t. z and may be ignored. Using a soft-
thresholding function for the gradient of α kzk1 (see the first exercise) and taking the overall gradient
w.r.t. z and dividing by ρk gives the following:
∂L λ
= Sα/ρk (z) − (x0 − z) + = 0 (A.93)
∂z ρk
3. The update to the Lagrange multipliers is the same as the general equation.
λ0 = λ − ρk (x0 − z0 ) (A.94)
A.14.3 Exercises
1. The sub-deferential of the scalar absolute value function f ,
f (x) = α|x|
763
is α sign(x) where
1 x>0
sign(x) = [−1, 1] x = 0
−1 x<0
Unfortunately, sign(x) is a relation, but not a function. To avoid this problem, one can approximate
the derivative of α|x| with the following soft-thresholding function.
x−θ
x>θ
Sθ (x) = 0 |x| ≤ θ
x+θ x < −θ
2. Rework the above example problem using the scaled version of ADMM [22].
ρk
x0 = argminx (f (x) + kx − z + uk22 ) (A.95)
2
ρk 0
z0 = argminz (g(z) + kx − z + uk22 ) (A.96)
2
u0 = u + (x0 − z0 ) (A.97)
λ
where u = − .
ρk
3. Show how the equations from the above problem relate to those given in the Lasso section of the
Prediction chapter.
where x ∈ Rn and z ∈ Rm are variable vectors, and A ∈ Rp×n and B ∈ Rp×m are coefficient matrices.
Rewrite the augmented Lagrangian for this more general case.
764
A.15 Nelder-Mead Simplex
Efficient nonlinear optimization without the use of gradients is challenging. Of the derivative-free methods,
the Nelder-Mead Simplex [131, 177] method is surprisingly robust and effective. A simplex is a triangular
shape defined by n + 1 vertices in n-dimensional space, e.g., a triangle in a two-dimensional plane. The
algorithm begins with an initial simplex that is systematically moved downhill by methods that expand,
reflect, contract-out, contract-in and shrink the simplex.
More specifically, the algorithm improves the simplex by replacing the worst vertex (xh ) with a better
one found on the line containing xh and the centroid (xc ). It tries reflection, expansion, outer contraction
and inner contraction points, in that order. If none succeeds, it shrinks the simplex. The algorithm iterate
until the distance between the best and worst points in the simplex drops below a given tolerance.
Consider the following objective function.
The optimal (minimal) solution is the green point at (2, 3) with a functional value of 1 with contours drawn
around it The initial simplex is shown as the blue triangle in Figure A.6: the worst point is in red with a
functional value of 6, the second worst is in purple with a functional value of 5, and the best point is in blue
with a functional value of 3. In this case, the reflect method will succeed with new better point (2, 2) with
a functional value of 2 and shown in cyan. The new cyan point will replace the red point, forming a new
simplex closer to the optimal solution.
x1
−1 1 2 3 4 x0
Figure A.6: Contour Curves for Nelder-Mead Simplex (red = 6, purple = 5, blue = 3)
Point/vertex replacement is based on finding points along a line that includes the worst point xh and the
centroid xc formed from the other points (excluding the worst). In this example, the centroid xc = (1.5, 1.5).
• Reflection. The reflection point xr is the centroid plus α times the distance from the worst point to
the centroid and is given as the cyan point in Figure A.6. In 2D and α = 1, it flips the triangle to the
other side.
765
xr = xc + (xc − xh ) ∗ α (A.101)
• Expansion. The expansion point xe push further away from the centroid than the reflection point.
xe = xc + (xr − xc ) ∗ γ (A.102)
• Outer Contraction. The outer contraction point xo is between the centroid and the reflection point.
xo = xc + (xr − xc ) ∗ β (A.103)
• Inner Contraction. The inner contraction point xi is between the worst point and the centroid.
xi = xc + (xh − xc ) ∗ β (A.104)
For the above example with α = 1, γ = 2, and β = 0.5, these points are xr = (2, 2), xe = (2.5, 2.5), xo =
(1.75, 1.75), and xi = (1.25, 1.25).
The transformation method applies some simple rules for selecting one of these points to replace the
worst point (see the code for details). If none of these points are selected to replace the worst point xh , then
all the points are pulled toward the best point xl by a factor of δ. by the shrink method.
A.15.2 Exercises
1. Make a table showing the new candidate points/vertices for each of the first five iterations of the
Nelder-Mead algorithm. Include the functional values of these points.
2. Consider ways to normalize the search space and how this can improve the performance of the Nelder-
Mead algorithm.
3. The performance of the Nelder-Mead algorithm can be affected by the choice of the initial simplex.
Discuss issues related to (a) initial general location, (b) initial size, and (c) initial shape.
Hint: see “Practical Initialization of the Nelder–Mead Method for Computationally Expensive Opti-
mization Problems,” [188].
766
Appendix B
Graph databases enhance traditional, row-oriented relational databases by making implicit relationships
where a foreign key references a primary key, explicit. For example, in the sensor relation rather having
a foreign key roadId that references the primary key roadId in the road relation, the relationship is made
explicit via an edge-type.
Table B.1 shows the correspondence between concepts in the relational database model and the graph
database model.
Table B.1: Mapping from Relational Database Concept to Graph Database Concept
The notion that all the vertices in a vertex-type have the same type means that their properties have the
same names and domains. This also implies that they will have the same arity. Furthermore, the notion
that all edges in an edge-type have the type means that their properties have the same names and domains
as well as they connect vertices from the same vertex-types. For example, edges in the Takes edge-type
connect vertices from the Student vertex-type to the Course vertex-type. Note, property graphs may have
fixed schema, be schema-less or have flexible schema.
The subsections below define several types of graphs that are useful for graph analytics and graph
databases. Such graphs need information content that is provided by labels or properties. In some cases,
the labels or properties may only be associated with vertices/nodes, but more generally, they are provided
for both vertices and edges. Labels associate a single value, while properties associate multiple values. For
767
graph analytics and some types of graph databases, labels may suffice, but properties are usually required
for robust database applications.
768
B.1 Directed Graphs
The starting point for graph databases is the basic definition of a directed graph from graph theory. A
directed graph consists of vertices (nodes) connected via one-way (directed) edges. More formally, it is a
two-tuple G(V, E) where
• V = set of vertices/nodes
The edge from vertex/node u to vertex/node v is an ordered pair (u, v), also denoted as u → v, or more
concisely uv. The from vertex u is referred to as the source and the to vertex v is referred to as the target.
Consider the directed graph shown in Figure B.1
0 1
2 3
V = {0, 1, 2, 3}
E = {(0, 1), (0, 2), (1, 2), (2, 3), (3, 0), (3, 1)}
a(0) = {1, 2}
a(1) = {2}
a(2) = {3}
a(3) = {0, 1}
As an adjacency matrix.
0 1 1 0
0 0 1 0
A =
0 0 0 1
1 1 0 0
769
As a transition matrix giving the probability of moving from vertex u to an adjacent vertex v is d−1
u = one
over the out-degree of vertex u.
0 .5 .5 0
0 0 1 0
P =
0 0 0 1
.5 .5 0 0
This assumes a uniform probability of selecting the next vertex to move to. In general, the transition matrix
may be defined using Markov Chain transition probabilities (also see the Chapter on State Space Models).
Movement in a directed graph must be along the edges and is called a walk meaning a sequence of edges
(e0 , e1 , ...), where the source vertex of ei must be the target vertex of ei−1 . A trail is a walk where all the
edges are distinct. Finaly, a path is a trail where all the vertices are distinct.
A directed graph may be disconnected, weakly connected, or strongly connected. It is strongly connected
if a path exists between any two vertices u, v ∈ V . It is weakly connected if is underlying undirected graph
is connected, and is disconnected otherwise.
• V = set of vertices/nodes
• E ⊆V ×V { source-vertex, target-vertex }
• L = set of labels
Notice that if a graph has nv vertices/nodes, the number of edges may range from 0 to n2v . There are
two general ways to represent the connectivity structure of a graph, an adjacency matrix and an adjacency
list. When the graph is sparse, i.e., ne n2v , then an adjacency list will be more efficient.
Under this approach, a vertex/node needs to keep track of its outgoing edges (also known as children).
The Graph0 class below has a name and a flag inverse indicating whether to also store the incoming
edges (parents). The rest of the arguments to the constructor are Arrays, where each element of the array
corresponds to a vertex in the graph. These arrays record the children ch and the vertex label. In addition,
vertex ids id are maintained internally. The children of a node are a set of integers. For example, vertex 1
may have children { 2, 4, 5 }.
Consider the graph shown in Figure B.2. If Bob and Sue know each other, then the graph consists of 2
vertices and 2 edges.
770
Bob Sue
Graph0 Class
Class Methods:
1 @ param ch the array of child ( adjacency ) vertex sets ( outgoing edges )
2 @ param label the array of vertex labels : v -> vertex label
3 @ param inverse whether to store inverse adjacency sets ( parents )
4 @ param name the name of the digraph
5
knows
Bob Sue
employs
• V = set of vertices/nodes
• E ⊆V ×V { source-vertex, target-vertex }
771
• L = set of labels
Of course, it may also be useful to split the set of labels L into Lv = set of vertex/node labels and Le = set
of edge labels.
The Graph class in scalation.database.graph pm shown below provides more functionality than the
Graph0 class, mainly adding edge labels elabel. The edge labels are stored in a Map for looking up the label
based on the source and target vertices. An optional schema specification is also provided to introduce a
type structure. (This will not be discussed further here.)
Graph Class
Class Methods:
1 val links = new Graph ( Array ( SET (1) , SET (0) , SET (0 , 1) ) , // children
2 Array ( " Bob " , " Sue " , " Joe " ) , // vertex labels
3 Map ((0 , 1) -> " knows " , // edge labels
4 (1 , 0) -> " employs " ,
5 (2 , 0) -> " knows " ,
6 (2 , 1) -> " knows " ) ,
7 false , " links " )
772
Notice that since there is a single label per edge and only one edge between any two vertices, the basic
structure borrowed from graph theory has not changed. It has only been embellished.
B.1.3 Directed Multi-Graphs
In order to better model the real-world, if one thinks of vertices representing entities and edges representing
relationships, then it makes sense for two entities to be in different relationships, e.g., Sue knows Bob and
Sue employs Bob. Allowing multiple edges (and edge labels) between two vertices, (u, v), introduces a
fundamental change to the graph.
No longer is an edge uniquely identified by its source and target vertices, u → v. So long as the edge
labels are different, multiple edges may connect the same source and target vertices. For example, if Bob
and Sue know each other and Sue employs Bob, then the graph consists of 2 vertices and 3 edges as shown
in Figure B.4.
knows
employs
Bob Sue
knows
Allowing multiple edges between vertices along with both edge and vertex labels allows rich information
content to be stored. The easiest way to achieve this to simply make the edge labels multi-valued.
Following this approach, a fully-labeled directed multi-graph may be represented as a five-tuple G(V, E, L, lv , le )
where
• V = set of vertices/nodes
• E ⊆V ×V { source-vertex, target-vertex }
• L = set of labels
Note, 2L is the power-set of the labels, so le may map to any element of the power-set (i.e., any subset).
Such graphs may implemented as follows (see the scalation.database.mugraph pm package):
MuGraph Class
Class Methods:
1 @ param ch the array of child ( adjacency ) vertex sets ( outgoing edges )
2 @ param label the array of vertex labels : v -> vertex label
3 @ param elabel the map of edge labels : (u , v ) -> set of edge label
4 @ param inverse whether to store inverse adjacency sets ( parents )
773
5 @ param name the name of the multi - digraph
6 @ param schema optional schema : map from label to label type
7
One may say the easiest approach allows an edge to have multiple labels, but does not really support
having multiple edges between a pair of vertices. However, a straightforward algorithm can convert it to
a representation that does. A natural storage structure that makes the multiple edges explicit is one that
treats edges as triples.
Using triples, a fully-labeled directed multi-graph may be represented as a four-tuple G(V, E, L, l) where
• V = set of vertices/nodes
• L = set of labels
Again, it may also be useful to split the set of labels L into Lv = set of vertex/node labels and Le = set of
edge labels.
As mentioned, an edge is no longer an ordered pair, as it becomes a triple where the 3 parts are known
by various names as shown in Table B.2
This concept may implemented as follows (see the scalation.database.triplegraph package): The Triple
class holds information about a triple (3 part edge).
1 @ param h the head vertex
2 @ param r the relation / edge - label
3 @ param t the tail vertex
4
774
TripleGraph Class
Class Methods:
1 @ param label the array of vertex labels
2 @ param triples the bag of triples in the triple - graph
3 @ param name the name of the triple - graph
4 @ param schema optional schema : map from label to label type
5
Such a structure provides the basis for Resource Description Framework (RDF) graphs, to be discussed later
in this chapter.
B.1.4 Exercises
1. Given the above adjacency matrix A, compute the k-hop adjacency Ak for A2 = AA.
2. For the graph shown in Figure B.1 given an example of a path, a trail that is not a path, and a walk
that is not a trail.
4. Survey the literature to compare different ways of creating storage structures for fully-labeled directed
multi-graphs.
775
B.2 A Graph Database with Relational Roots
Before moving onto Property Graphs (the predominate form of today’s Graph Databases), a further extension
of the Table class is discussed. The main difference is that it uses a tuple construct to store vertex and
edge attributes/properties, following more closely to the Relational Model and requiring the specification of
a schema. Property Graphs use maps for storing properties of vertices and edge and support having schema
and being schema-less.
6 case class GTable ( name_ : String , schema_ : Schema , domain_ : Domain , key_ : Schema )
7 extends Table ( name_ , schema_ , domain_ , key_ )
8 with Serializable :
The GTable class (for Graph-Table) supports many-to-many relationships with efficient navigation in
both directions. Supporting this is much more complicated than what is needed for LTable, but provides
for index-free adjacency, similar to what is provided by Graph Database systems.
The GTable model is graph-like in that it (as did VTable) elevates tuples into vertices as first-class citizens
of the data model. Also, a directed edge has attributes and serves to link a source (from) vertex to a target
(to) vertex. Now, if distance is included as one of the edge attributes, shortest path algorithms may be
applied.
The Edge class includes three parts: The edge attributes in the form of a tuple of values, the source
(from) vertex and target (to) vertex.
1 @ param tuple the tuple part of the edge
2 @ param from the source vertex
3 @ param to the target vertex
4
10 end Edge
The Vertex class extends the notion of Tuple into values stored in the tuple part, along with foreign keys
links captured as outgoing edges. The edge Map has a key that is the edge label (e.g., employs) and a value
that is a set/bag of outgoing edges (e.g., all of the outgoing employs edges). Each edge in turn references
the target vertex (e.g., the person employed).
1 @ param tuple the tuple part of the vertex
2
5 val edge = Map [ String , Bag [ Edge ]] () // map edge - label -> { edges }
6
776
7 def neighbors : Bag [ Vertex ] =
8 def neighbors ( elab : String ) : Bag [ Vertex ] =
9 def neighbors ( ref : ( String , GTable ) ) : Bag [ Vertex ] =
10 override def toString : String = s " vertex : ${ stringOf ( tuple ) } "
11
12 end Vertex
8 student . addEdgeType ( " cid " , course , false ) // student has M courses
9 course . addEdgeType ( " sid " , student , false ) // course has M students
10 course . addEdgeType ( " pid " , professor ) // course has 1 professor
Select applies a given predicate to all vertices in this GTable, keeping those where the predicate evaluates to
true.
1 override def select ( predicate : Predicate ) : GTable =
2 val s = new GTable ( s "${ name } _s_${ cntr . inc () } " , schema , domain , key )
3 s . vertices ++ = ( for v <- vertices if predicate ( v . tuple ) yield v )
4 s
5 end select
Union takes the union of all vertices in the two tables. An index may be created to eliminate duplicates.
1 override def union ( r2 : Table ) : GTable =
2 if incompatible ( r2 ) then return this
3 val s = new GTable ( s "${ name } _u_${ cntr . inc () } " , schema , domain , key )
4 s . vertices ++ = (
5 if r2 . isInstanceOf [ GTable ] then vertices ++ r2 . asInstanceOf [ GTable ]. vertices
6 else vertices ++ r2 . tuples . map ( Vertex ( _ ) ) )
7 s
8 end union
Minus will keep each vertex in this table only if it is not the second table.
777
1 override def minus ( r2 : Table ) : GTable =
2 if incompatible ( r2 ) then return this
3 val s = new GTable ( s "${ name } _m_${ cntr . inc () } " , schema , domain , key )
4 for v <- vertices do
5 if ! ( r2 contains v . tuple ) then s . vertices + = v
6 end for
7 s
8 end minus
Having a graph database where the vertices are explicitly linked via edges, reduces the frequency with
which join operations are needed. Starting with a subset of vertices in a first GTable, relevant attribute
values can be efficiently extracted from this and a second table without performing a join.
For example, if one wishes to run a query to retrieve the courses each student is taking, one may write
the following:
1 student extract ( " sname , cname " , ( " cid " , course ) )
This query will extract sname values from the Student table and pair them with cname values from the
Course table. Use of cid indicates the edge-type to follow. (In general, there may be multiple types of
outgoing edges.)
B.2.4 Exercises
1. Use Scala 3 to complete the implementation of GTable in the scalation.database.table package.
2. Using your implementation for GTable, create a schema where vertices represent cities and edges
represent roads connecting them. City attributes include id, name, state, lat, long, and population.
Road attributes include distance.
3. Populate the database with sample data for Northeast Georgia, with major roads (US and State Roads)
connecting the following cities: Athens, Jefferson, Watkinsville, Monroe, Bethlehem, Winder, Bogart,
Statham, Bishop, Good Hope, Hull, Colbert, Crawford, Nicholson, Danielsville.
4. Using GTable’s graph algebra, list all major roads leaving Athens, GA.
778
B.3 Property Graphs
Property graphs [16] enhance plain labeled directed multi-graphs by replacing labels with a list/map of
properties. (Note, some database models and systems such as Neo4j keep labels and assign them a special
purpose.) As such, information may be flexibly added to a property graph. The structure and type of
a property graph is defined by specifying its structural organization (analogously to specifying relational
schema). Some Graph Databases/Property Graphs are schema-less to add greater flexibility and agility.
Property Graphs may be divided into two groups: Labeled Property Graph and Typed Property Graphs.
A typed property graph requires all vertices to have a unique Vertex-Type that defines the vertex-schema
(set of properties) for vertices of a given type. Similarly, it requires all edges to have a unique Edge-Type. In
addition to defining the edge-schema, it specifies the source and target Vertex-Types. Some graph databases
engines, however, chose to provide users with greater flexibiilty. For instance, Neo4j does not give a vertex
a type, but instead may give one or more labels to a vertex (node). For example, the label Person may
be use to indicate the kind of node (this is done without creating a Vertex-Type). Some vertices may have
two (or more) labels, e.g., Person and Golfer. As vertices are often accessed based on their labels, indices
are automatically created to find the vertices with given labels. Different graph database engines make
choices regarding whether to use labels or types for vertices, as well as such choices for edges. Furthermore,
labels may be optional vs. required, as well as unique vs. multivalued. See the following document for the
design choices made by some of the popular graph database engines: https://medium.com/geekculture/
labeled-vs-typed-property-graphs-all-graph-databases-are-not-the-same-efdbc782f099.
ScalaTion’s graph database supports property graphs and is organized as follows: The primary concepts
are vertex and edge. Similar vertices and edges are collected into vertex-types and edge-types. A prop-
erty graph consists of multiple vertex-types and edge-types. Element and element type are introduced as
generalizations that are useful in defining graph algebra operators.
• The ValueType type is a Scala 3 union type for atomic database values.
1 type ValueType = ( Double | Int | Long | String | TimeNum )
• The Property type corresponds to the notion of an attribute in a relational database. It maps property
names to property values (e.g., Map ("name" -> "Bob", "salary" -> 85000.0).
1 type Property = Map [ String , ValueType ]
779
Vertex Class
The Vertex class specifies the form of vertices/nodes in the property graph. A vertex maintains properties
for a vertex, e.g., a person. It is analogous to a tuple in a relational database.
1 @ param _name the name of this vertex ( ’ name ’ from ‘ Identifiable ‘) , vertex label
2 @ param prop maps vertex property names into property values
3 @ param _pos the position ( Euclidean coordinates ) of this vertex ( ’ pos ’ from ‘ Spatial ‘)
4
5 class Vertex ( _name : String , val prop : Property , _pos : VectorD = null )
6 extends Identifiable ( _name )
7 with Spatial ( _pos )
8 with P a r t i a l l y O r d e r e d [ Vertex ]
9 with Serializable :
17 // use the companion object ’s apply method to use system generated vertex label
18
27 end vertexTest
VertexType Class
The VertexType class corresponds to the notion of an entity type in an Entity-Relationship Model. A
vertex-type collects vertices of the same type, e.g., a person vertex-type. Its schema specification determines
the types of properties that vertices in this collection are allowed to have. The color and shape are for
display purposes. It is analogous to a relation with no foreign keys in a relational database.
1 @ param _name the name of this vertex type ( ’ name ’ form ‘ Identifiable ‘)
2 @ param schema the property names for this vertex type
3 @ param verts the set of vertices having this vertex type ( extension )
780
4 @ param color the display color for vertices of this type
5 @ param shape the display shape template for vertices of this type
6
Edge Class
The Edge class allows explicit relationships to be formed between vertices. An edge connects a source vertex
to a target vertex and may have its own properties. It may be thought of a triple (from, prop, to),
but also allows for an edge label and shift used for diplay purposes. It is roughly analogous to an implicit
relationship manifest via foreign key-primary key pairs in a relational database.
1 @ param _name the name of this edge ( ’ name ’ from ‘ Identifiable ‘) , edge label
2 @ param from the source / from vertex of this edge
3 @ param prop maps edge property names into property values
4 @ param to the target / to vertex of this edge
5 @ param shift number of units to shift to accomodate a bundle of egdes in a composite
edge
6
7 class Edge ( _name : String , val from : Vertex , val prop : Property , val to : Vertex , val
shift : Int = 0)
8 extends Identifiable ( _name )
9 with Spatial ( if from = = null then to . pos else from . pos )
10 with Serializable :
781
7
11 // use the companion object ’s apply method to use system generated edge label
12
EdgeType Class
The EdgeType class corresponds to the notion of an relationship type in an Entity-Relationship Model. An
edge-type collects edges of the same type and has a source VertexType and a target VertexType. Its schema
specification determines the type of properties that edges in this collection are allowed to have. The color
and shape are for display purposes. An edge-type is analogous to a relation with foreign keys in a relational
database.
1 @ param _name the name of this edge - type ( ’ name ’ from ‘ Identifiable ‘)
2 @ param from the source vertex
3 @ param schema the property names for this edge - type
4 @ param to the target vertex
5 @ param edges the set of edges having this edge - type ( extension )
6 @ param color the display color for edges of this type
7 @ param shape the display shape template for edges of this type
8
782
PGraph Class
The PGraph class is used to store property graphs. A property graph has a name, zero or more VertexTypes
and zero or more EdgeTypes. Each of the EdgeTypes can only reference VertexTypes specified in this PGraph.
The animating and aniRatio are used for diaplay purposes.
1 @ param name the name of the property graph
2 @ param vt the set of vertex types
3 @ param et the set of edges types
4 @ param animating whether to animate the model ( defaults to false )
5 @ param aniRatio the ratio of simulation speed vs . animation speed
6
A SocialNetwork property graph can be created as shown below. See the example below to see the full
construction of the property graph.
1 val g = PGraph ( " SocialNetwork " , VEC ( vt0 ) , VEC ( et0 , et1 ) )
A PGraph with no EdgeTypes essentially allows a similar organization to a relational database. Simply add
a property that acts as a foreign key and use a join operation.
1 object SocialNetwork :
2
5 val employs : Property = Map ( " type " -> " employs " )
6
13 val v = VEC (
14 new Vertex ( " Bob " , Map ( " name " -> " Bob " , " state " -> " GA " , " salary " -> 85000.0) , x0 ) ,
15 new Vertex ( " Sue " , Map ( " name " -> " Sue " , " state " -> " FL " , " salary " -> 95000.0) , x1 ) ,
16 new Vertex ( " Joe " , Map ( " name " -> " Joe " , " state " -> " GA " , " salary " -> 99000.0) , x2 ) )
17 val vt0 = VertexType ( " person " , " name , state , salary " , v )
18
19 println ( s " check schema for vertex - type vt0 = ${ vt0 . check } " )
20 vt0 . buildIndex ( " name " )
21
24 val et0 = EdgeType ( " knows " , vt0 , " type " , vt0 , VEC (
25 new Edge ( " knows " , v (0) , Map ( " type " -> " knows " , " since " -> 5) , v (1) ) ,
26 new Edge ( " knows " , v (1) , Map ( " type " -> " knows " , " since " -> 2) , v (0) , -1) ,
783
27 new Edge ( " knows " , v (2) , Map ( " type " -> " knows " , " since " -> 4) , v (0) ) ) )
28 val et1 = EdgeType ( " employs " , vt0 , " type " , vt0 , VEC (
29 new Edge ( " employs " , v (1) , employs , v (0) , 1) ,
30 new Edge ( " employs " , v (2) , employs , v (1) ) ) )
31
32 println ( s " check schema for edge - type et0 = ${ et0 . check } " )
33 println ( s " check schema for edge - type et1 = ${ et1 . check } " )
34
35 end SocialNetwork
Index-Free Adjacency
Notice the basic structure for property graphs, makes it easy to find vertices connected to an edge. Some
types of navigation will require finding edges that a vertex is connected to. Suppose there are vertex-
types Student and Course connected with edge-type Takes. The meta-graph for this is shown in the next
subsection. Suppose a student named "Bob" wishes to know the courses he is taking. One would like to
traverse from the "Bob" vertex to his course vertices. In general this will not work, since some queries may
depend upon an edge property (e.g., the grade). In order to handle such cases, the "Bob" vertex should
reference its outgoing edges in the Takes edge-type. Unfortunately, "Bob" may have taken many courses
(i.e., the Takes edge/relationship type is many-to-many). A storage solution is to have the vertex reference
the first edge in the "Bob" group of courses and have this edge reference the next edge in the group, etc.
The chain of edges may be organized as either a singly or doubly linked list, however, to make edge deletions
efficient the list should be doubly linked. This is the approach utilized by Neo4j’s native graph storage engine,
as described in Chapter 6 of [156]. Note, ScalaTion’s GTable class provides an alternative of collecting all
a vertex’s outgoing edges into an ArrayBuffer.
To illustrate details about the storage, example vertex-types and edge-types will be shown for a simple
course-enrollment database.
Property Graphs are rich enough to depict their own conceptual model. A meta-graph showns the meta-
data for a graph database. The meta-graph in Figure B.5 shows three vertex-types, Student, Course and
Professor. These are connected via the Takes and TaughtBy edge-types. The Takes edge/relationship type
is many-to-many, while the TaughtBy edge/relationship type is many-to-one.
The arrow on the TaughtBy edge-type indicates a many-to-one relationship from Course to Professor. That
is, a course is taught by one Professor, while a professor may teach many Courses.
784
Takes
Student Course
TaughtBy
Professor
The Course and Professor vertex-types as well as the TaughtBy edge-type are shown in Figure B.5 in
tabular format. Using the index-free adjacency approach, each Course vertex maintains a reference to its
first outgoing TaughtBy edge. Similarly, each Professor vertex maintains a reference to its first incoming
TaughtBy edge.
To handle the case when there are multiple edges (e.g., a Professor may teach multiple Courses), an
edge chain is maintained in the TaughtBy Edge-Type. Following the edge chain for “Dr. Bill”, gives the first
edge as e0 , with the next edge e1 and then e2 ; a null terminates the edge chain.
The edge chains are shown here as singly-linked lists for simplicity, while in practice doubly-linked lists allow
for more efficient maintenance of the edge chains.
785
Table B.5: Course Enrollment Database
• Neo4j’s Cypher Query Language. The MATCH statement is structured according the TaughtBy edge/re-
lationship, with the WHERE constraining the professors and the RETURN indicating the returned results.
1 MATCH ( c : Course ) - [: TaughtBy ] -> ( p : Professor )
2 WHERE p . pname = ’ Dr . John ’
3 RETURN c . cname
• Apache TinkerPop’s Gremlin Query Language. Using syntax from the Groovy programming language
(a JVM cousin of Scala), the same query can be written in a functional programming style.
1 g . V () . has ( ’ Professor ’ , ’ pname ’ , ’ Dr . John ’)
2 . out ( ’ TaughtBy ’)
3 . values ( ’ cname ’)
• Graph Query Language (GQL) is an emerging standard for graph database. It utilizes concepts from
multiple existing languages, see https://www.gqlstandards.org/existing-languages
786
The Cypher Query Language
In order to execute queries using the Cypher query language, a graph database needs to be populated. This
is done by first creating several nodes. These nodes may be referenced to make several edges.
1 CREATE
2 ( s1 : Student { sid : 101 , sname : ’ Peter ’ , city : ’ Athens ’ }) ,
3 ( s2 : Student { sid : 102 , sname : ’ Paul ’ , city : ’ Bogart ’ }) ,
4 ( s3 : Student { sid : 103 , sname : ’ Mary ’ , city : ’ Athens ’ }) ,
5 ( c1 : Course { cid : 4370 , cname : ’ DB ’ , hours : 4 }) ,
6 ( c2 : Course { cid : 4550 , cname : ’ AI ’ , hours : 3 }) ,
7 ( p1 : Professor { pid : 201 , pname : ’ Dr . Bill ’ , rank : ’ AssocProf ’ }) ,
8 ( p2 : Professor { pid : 202 , pname : ’ Dr . John ’ , rank : ’ Professor ’ }) ,
9 ( s1 ) -[: Takes ] - >( c1 ) ,
10 ( s2 ) -[: Takes ] - >( c1 ) ,
11 ( s3 ) -[: Takes ] - >( c1 ) ,
12 ( s2 ) -[: Takes ] - >( c2 ) ,
13 ( s3 ) -[: Takes ] - >( c2 ) ,
14 ( p1 ) < -[: TaughtBy ] -( c2 ) ,
15 ( p2 ) < -[: TaughtBy ] -( c1 )
Conventions:
A WHERE clause can be added to, for example, restrict the returned nodes to students living in ’Athens’.
1 MATCH ( s : Student )
2 WHERE s . city = ’ Athens ’
3 RETURN s
This filter can be specified inside the node specification itself, as shown below.
1 MATCH ( s : Student { city : ’ Athens ’} )
2 RETURN s
Many meaningful queries correspond to following paths in the graph database. Cypher provides convenient
syntax for specifying path patterns. Those paths matching the pattern serve as the basis for what is returned.
In the query below, the sid and sname of students taking the ’Database’ course will be returned.
1 MATCH ( s : Student ) -[: Takes ] - >( c : Course { cname : ’ Database ’ } )
2 RETURN s . sid , s . sname
787
This can be compared to equivalent SQL queries.
1 SELECT s . sid , s . sname
2 FROM Student s , Takes t , Course c
3 WHERE c . cname = ’ Database ’ and s . sid = t . sid and t . cid = c . cid
It is easy to make connections that are multiple hops away in the graph. To find the professors teaching
’Peter’, the following 2-hop path pattern may be used.
1 MATCH ( s : Student { sname : ’ Peter ’ }) -[: Takes ] - >( c : Course ) -[: TaughtBy ] - >( p : Professor )
2 RETURN p
Like SQL, Cypher allows results to be combined with UNION (AS is needed for renaming, so the columns
agree).
1 MATCH ( s : Student )
2 RETURN s . sname AS name
3 UNION
4 MATCH ( p : Professor )
5 RETURN p . sname AS name
Note: Like SQL, UNION removes duplicates, while UNION ALL does not.
Other commonly used clauses include ORDER BY to sort the answer to a query and LIMIT to restrict
the size of the answer.
For more information on the syntax of Cypher and more example queries, see the Neo4j Cypher Manual,
https://neo4j.com/docs/cypher-manual/current/introduction/.
Several operators apply only to a single vertex-type. For the discussion below, the person vertex-type will
be used.
1 val person = g . vmap ( " person " )
• The project operator for vertex-types will project onto the properties given in the specified subschema
x.
1 def project ( x : Schema ) : VertexType =
2 if ! subset (x , schema ) then flaw ( " project " , " subschema x does not follow schema " )
3 new VertexType ( name + " _p " , x ,
4 for v <- verts yield Vertex ( v . prop . filter ( x contains _ . _1 ) ) )
5 end project
788
For example, to return the vertices with the properties trimmed down to just names the following two
queries may be used. The corresponding Cypher query is also shown.
1 val q0 = person . project ( Array ( " name " ) )
2 val q2 = person . project ( " name " )
3 MATCH ( p : Person ) RETURN p . name
• The select operator for vertex-types will return the subset of vertices that satisfy the predicate pred.
1 def select ( pred : Property = > Boolean ) : VertexType =
2 new VertexType ( name + " _s " , schema ,
3 for v <- verts if pred ( v . prop ) yield v )
4 end select
For example, to return the vertices where the name property is “Sue”, the following two queries may
be used (the second one is an abbreviated form of the first one). The corresponding Cypher query is
also shown.
1 val q2 = person . select (( p : Property ) = > p ( " name " ) = = " Sue " )
2 val q3 = person . select ( _ ( " name " ) = = " Sue " )
3 MATCH ( p : Person ) WHERE p . name = ’ Sue ’ RETURN p
• The unionAll operator for vertex-types will return all the vertices that from the two vertex-types.
1 def unionAll ( vt2 : VertexType ) : VertexType =
2 new VertexType ( name + " _ua_ " + vt2 . name , schema , verts ++ vt2 . verts )
3 end unionAll
For example, the union of person and android will return person and android vertices. The corre-
sponding Cypher query is also shown.
1 val q4 = person unionAll android
2 MATCH ( p : Person ) RETURN p UNION ALL MATCH ( a : Andoid ) RETURN a
• The union operator for vertex-types will return the vertices that are in either of two vertex-types,
without duplication.
1 def union ( vt2 : VertexType ) : VertexType =
2 new VertexType ( name + " _u_ " + vt2 . name , schema , ( verts ++ vt2 . verts ) . distinct )
3 end union
For example, the union of person and android will return person and android vertices. If any two
vertices have exactly the same properties, the duplicate will be removed. The corresponding Cypher
query is also shown.
1 val q4 = person union android
2 MATCH ( p : Person ) RETURN p UNION MATCH ( a : Andoid ) RETURN a
• The intersect operator for vertex-types will return vertices both to both vertex-types (this and vt2).
1 def intersect ( vt2 : VertexType ) : VertexType =
2 new VertexType ( name + " _i_ " + vt2 . name , schema , ( verts intersect vt2 . verts ) )
3 end intersect
789
• The minus operator for vertex-types will return the vertices that are in either of two vertex-types.
The Cypher query language does not currently provide a MINUS operation (although a work-around
can be used).
1 def minus ( vt2 : VertexType ) : VertexType =
2 new VertexType ( name + " _m_ " + vt2 . name , schema , verts diff vt2 . verts )
3 end minus
For example, starting with the vertices in q4 and subtracting the vertices in android.
1 val q5 = q4 minus android
• The groupBy operator groups the vertices based on sharing a property value pname.
1 def groupBy ( pname : String , agg_name : String , agg_fn : Double = > Double ) : VertexType =
2 debug ( " groupBy " , s " group $schema by $pname " )
3 if ! ( schema contains pname ) then flaw ( " groupBy " , s " property $pname missing from
schema " )
4 if checkMissing ( pname ) then flaw ( " groupBy " , s " property $pname missing from a
vertex " )
5
• The orderBy orders the vertices within this vertex-type by the values of the given property name
pname.
1 def orderBy ( pname : String ) : VertexType =
2 new VertexType ( name + " _o " , schema , verts . sortWith ( _ . prop ( pname ) < _ . prop ( pname ) )
)
3 end orderBy
Other operators apply to edge-types. The expand operators expand into vertices that are targets (to) or
sources (from) of the edges in the edge-type. Several other operators are analogs of those provided by
vertex-types.
• The expandTo operator expand this edge-type with its ’to’ vertex-type, appending its properties
1 def expandTo : EdgeType =
2 val edgez = for e <- edges yield Edge ( e . from , e . prop +++ e . to . prop , null )
3 new EdgeType ( name + " _et " , from , schema ++ to . schema , null , edgez )
4 end expandTo
790
1 def project ( x : Schema ) : EdgeType =
2 def select ( pred : Property = > Boolean ) : EdgeType =
3 def unionAll ( et2 : EdgeType ) : EdgeType =
4 def union ( et2 : EdgeType ) : EdgeType =
5 def intersect ( et2 : EdgeType ) : EdgeType =
6 def minus ( et2 : EdgeType ) : EdgeType =
7 def orderBy ( pname : String ) : EdgeType =
The following four operators work on both vertex-types and edge-types to produce new property graphs.
1 def expandOut ( from : VertexType , ets : VEC [ EdgeType ] , tos : VEC [ VertexType ] ,
2 newName : String ) : PGraph =
3 def expandIn ( froms : VEC [ VertexType ] , ets : VEC [ EdgeType ] , to : VertexType ,
4 newName : String ) : PGraph =
5 def expandBoth ( froms : VEC [ VertexType ] , ets : VEC [ EdgeType ] , tos : VEC [ VertexType ] ,
6 newName : String ) : PGraph =
7 def join ( g2 : PGraph , vt1 : VertexType , vt2 : VertexType ,
8 newName : String ) : PGraph =
As Cypher and Gremlin are the two major query languages for graph databases, graph algebras have
been developed for each: Cypher [79, 120, 185], Gremlin [193, 192]. Additional information on graph algebra
may be found in the text on Graph Data Warehousing [53].
791
B.4 Special Types of Graph Databases
Graph databases attempt to strike the right balance between (i) rich and flexible database models, (ii)
efficient query processing, and (iii) reduction in complexity. For example, compared to relational databases,
graph databases trade off (iii) for gains in (i) and (ii). Different organizational structures for graph databases
will be positioned differently among these three competing goals.
The VertexType class has a name, a schema for its properties, and edge schema for its edge references, as
well the vertices contained in it.
1 @ param name the name of this vertex - type
2 @ param schema the property names for this vertex - type
3 @ param eschema the edge names for this vertex - type
4 @ param verts the set of vertices having this vertex - type ( extension )
5
6 case class VertexType ( name : String , schema : VEC [ String ] , eschema : VEC [ String ] ,
7 verts : VEC [ Vertex ])
8 extends Flaw ( " VertexType " ) with Serializable :
A Course vertex-type may have a reference TaughtBy to a Professor vertex-type. At the instance level this
would indicates that for example, the Database course is taught by professor 2.
1 val Professor = VertexType ( " professor " ,
2 VEC ( " pid " , " name " , " phone " ) ,
3 VEC () ,
792
4 VEC ( Vertex ( Map ( " pid " -> 1 , " name " -> " Bob " , " phone " -> 1234567) ) ,
5 Vertex ( Map ( " pid " -> 2 , " name " -> " Sue " , " phone " -> 2345678) ) ,
6 Vertex ( Map ( " pid " -> 3 , " name " -> " Joe " , " phone " -> 3456789) ) ) )
7
The database.graph relation package is similar to a traditional relational model, except that foreign
key values are no longer stored in a column of a table whose correspondence with the primary key value
facilitates a join. Rather a direct reference to the vertex containing that primary key value is used. The
trade-off is a slight increase in the complexity of the database model, for the potential of faster joins. See
the exercises for a comparison of property graphs, graph relations and relations.
On the flip side, various extensions can be added to the above organizational structure. For example, type
hierarchies can be imposed on vertex-types as well as edge-types.
The IRIs are identifiers for (Web) resources and the Resource Description Framework Schema (RDFS)
can by used to assign types to resources, dividing them into groups called classes. The rdf:type predicate
may be used to assign a type to a resource, e.g.,.
1 resourse1 rdf : type Class1
793
states that resource1 is an instance of Class1 an rdfs:Class. Subclasses may be defined as well (see
https://www.w3.org/TR/rdf-schema/).
1 Class2 rdfs : subClassOf Class1
Similarly, properties can be defined using rdf:Property and subtypes defined using rdfs:subPropertyOf
A convenient way to define an RDF/RDFS dataset is via the Turtle language (see https://www.w3.org/
TR/2014/REC-turtle-20140225/). In addition, JSON-LD provides an alternative convenient syntax (see
https://www.w3.org/TR/json-ld/).
To handle certain use cases in RDF and RDFS, more complicated graph forms are needed, including
bipartite graphs, hypergraphs and Labeled Directed Multigraph with Triple Nodes (LDM-3N) [134]. There
are also efforts underway to provide interoperability between RDF and Property Graphs [8].
The Shapes Constraint Language (SHACL) provides constraints for RDF Graph.
The SPARQL query language makes it easy to express queries in that the WHERE clause consists of con-
straints in the form of triple patterns that are similar to triple statements (see https://www.w3.org/TR/
sparql11-query/).
1 PREFIX university : < http : // .../ univerity >
2 SELECT ? p
3 WHERE { ? c rdf : type university : Course .
4 ? c university : taughtBy ? p . }
The first triple pattern constrains the variable ?c to be a university course, while the second one constrains
the ?p to be someone who teaches a course at the university. The SELECT clause indicates what to to return
as the answer to query, while the PREFIX indicates the data source and its namespace.
Table B.6: Types of Data Models (Relational (R), Graph-Relational (GR), Graph (G)
794
B.5 Knowledge Graphs
Some view knowledge graphs as RDF/RDFS graphs and their extensions such as using the Web Ontology
Language (OWL) to specify type hierarchies and constraints. Another viewpoint is that they are graph
databases, primarily property graphs, used to store knowledge in addition to data.
A more common view is that knowledge graphs go beyond the structural issues related to basing them
on RDF graphs versus property graphs, by including the following deeper issues [77, 78]:
For example, various sub-languages for OWL support particular forms of description logic. Description
logics are especially helpful in deducing if one class/concept subsumes another class/concept, purely from
the logical specification (e.g., Animal subsumes Mammal).
Also see https://web.stanford.edu/ vinayc/kg/notes/KG Notes v1.pdf
One could with a graph database based on Property Graph and add some following features/capabilities
discussed in the subsections below.
795
22
RDFS also allows type hierarchies to be created among relationships (edge-types). For example, a
relationship (edge-type) mother-of could could be defined to be a subtype of parent-of. Then any query
looking for parents of a Person would follow the parent-of relationship and any subtype relationships.
It consists of five concepts/classes (Person, Student, Professor, Course, >), where > is the top concept
(anything) and nine roles/binary relationships (hasId, hasName, hasStreet, hasCity, hasDept, hasLevel, has-
Cid, takes, teaches). The u symbol indicates concept intersection/conjunction, thus specifying the Person
concept has all four roles/properties listed in its definition. The ∃ symbol is an existential restriction.
To permit inferencing in polynomial time, OWL EL is limited to (1) negation (¬C), (2) conjunction
(C u D), (3) disjunction (C t D), (4) existential restriction (∃R.C), (5) concept inclusion (C v D), (6) role
inclusion (R v S), and (7) role chain (R1 ◦ R2 v R).
Concept inclusion C v D is the flip side of concept subsumption D w C. When C v D, if c ∈ C I , then
c ∈ DI . Note, C is a concept, while C I is the application of the concept to a knowledge base (or in general
any interpretation) and indicates the set of individuals classified under C, asserted or inferred.
Instances/individuals may be associated with concepts and roles as follows: (1) concept assertion C(a),
e.g., Student(peter), Course(database) and (2) role assertion R(a, b), e.g., takes(peter, database).
In addition to the top concept (>) containing all individuals, there is also a bottom concept (⊥) that
can never have individuals.
796
See [99, 126] for the complete definitions for OWL 2 EL.
A reasoner can be applied to make inferences, such as concept inclusion (v), concept C v D, meaning
that concept D is more general than concept C, e.g., Student v P erson.
Related to constraints are rules. While inferencing with constraints is used to determine whether some-
thing (e.g., P erson subsumes Student) is true, inferencing with rules produces new facts. Consider the
following rule written in Datalog [62]
It produces new information that is not stored in the database or knowledge base. Although this could be
easily accomplished using relational algebra on a table parent(x, y),
the following rules (with the second one being recursive) are beyond the capabilities of standard relational
algebra.
Other implementations of rules used for knowledge graphs or more knowledge bases include the Semantic
Web Rule Language (SWRL) and the Rule Interchange Format (RIF) [152].
For a discussion of future direction of research on knowledge graphs, see [76].
B.5.3 KGTable
The KGTable class in ScalaTion allows a knowledge-graph-table (or simply kg-table) to specify a parent
or supertype.
1 @ param name_ the name of the graph - table
2 @ param schema_ the attributes for the graph - table
3 @ param domain_ the domains / data - types for attributes ( ’D ’ , ’I ’ , ’L ’ , ’S ’ , ’X ’ , ’T ’)
4 @ param key_ the attributes forming the primary key
5 @ param parent the parent ( super - type ) table
6
7 class KGTable ( name_ : String , schema_ : Schema , domain_ : Domain , key_ : Schema ,
8 val parent : KGTable = null )
9 extends GTable ( name_ ,
10 if parent = = null then schema_ else parent . schema ++ schema_ ,
11 if parent = = null then domain_ else parent . domain ++ domain_ ,
12 if parent = = null then key_ else parent . key )
13 with Serializable :
The schema defined above for PostgreSQL may be specified in ScalaTion as follows:
1 val person = KGTable ( " person " , " id , name , street , city " ,
2 "I , S , S , S " , " id " )
3 val student = KGTable ( " student " , " dept , level " ,
797
4 "S , I " , null , person )
5 val professor = KGTable ( " professor " , " dept " ,
6 " S " , null , person )
7 val course = KGTable ( " course " , " cid , cname , hours , dept " ,
8 "I , X , I , S " , " cid " )
9
10 student . addEdgeType ( " cid " , course , false ) // student has M courses
11 course . addEdgeType ( " id " , student , false ) // course has M students
12 course . addEdgeType ( " pid " , professor ) // course has 1 professor
In a manner similar to the GTable graph database, a KGTable knowledge graph may be populated as
follows:
1 val v_Joe = person . addV (91 , " Joe " , " Birch St " , " Athens " )
2 val v_Sue = person . addV (92 , " Sue " , " Ceder St " , " Athens " )
3
4 val v_Peter = student . addV (101 , " Peter " , " Oak St " , " Bogart " , " CS " , 3)
5 val v_Paul = student . addV (102 , " Paul " , " Elm St " , " Watkinsville " , " CE " , 4)
6 val v_Mary = student . addV (103 , " Mary " , " Maple St " , " Athens " , " CS " , 4)
7
8 val v_DrBill = professor . addV (104 , " DrBill " , " Plum St " , " Athens " , " CS " )
9 val v_DrJohn = professor . addV (105 , " DrJohn " , " Pine St " , " Watkinsville " , " CE " )
10
11 val v_Database = course . addV (4370 , " Database Management " , 4 , " CS " )
12 val v_Archi tecture = course . addV (4720 , " Comp . Architecture " , 4 , " CE " )
13 val v_Networks = course . addV (4760 , " Computer Networks " , 4 , " CS " )
14
15 student . add2E ( " cid " , Edge ( v_Peter , v_Database ) , " id " , course )
16 . add2E ( " cid " , Edge ( v_Peter , v_Archi tecture ) , " id " , course )
17 . add2E ( " cid " , Edge ( v_Paul , v_Database ) , " id " , course )
18 . add2E ( " cid " , Edge ( v_Paul , v_Networks ) , " id " , course )
19 . add2E ( " cid " , Edge ( v_Mary , v_Networks ) , " id " , course )
20
The all operator will retrieve vertices from the complete hierrachy that starts with the given class. The
first show will display Joe and Sue, while the second will display all people.
1 person . show ()
2 person . all () . show ()
798
B.6 Exercises - Part I
1. Present a Group Lecture Series on High Level Query Languages for Graph Databases.
6. Use Scala 3 to complete the implementation of EdgeType and PGraph in the scalation.database.graph
package. Again, it should provide operators like those the Neo4j graph algebra. The exiting operators
in these classes may need to renamed and/or modified.
7. Using your implementation for PGraph translate the road sensor schema.
10. Using Neo4j’s Cypher Query language, retrieve the sensors that are on I35.
11. Retrieve traffic data within a 100 kilometer-grid from the center of Austin, Texas. The latitude-
longitude coordinates for Austin, Texas are (30.266667, -97.733333).
• ScalaTion data models: Table, LTable, VTable, GTable, and PGraph. Each group picks one to
work on.
• Relations - using PostgreSQL/MySQL
• Properties Graphs - using Neo4j
Plot the performance of various types of queries executed with each of the four database systems.
For each plot, the x-axis will be the number of tuples/vertices, while the y-axis will be the time in
milliseconds (using nanoTime) to execute the query. Scalation allows its conveinent use as follows:
1 def time [ R ] ( block : = > R ) : R =
2 val t0 = nanoTime ()
3 val result = block // call - by - name
4 val t1 = nanoTime ()
5 println ( " Elapsed time : " + ( t1 - t0 ) * NS_PER_MS + " ms " )
6 result
7 end time
799
Be sure to skip the timing results for the first iteration and average the next five to get reliable timing
results. Note, the JIT compiler is used during the first iteration, so it tends to be slow. The performance
evaluation will require a large number of tuples/vertices, so a tuple/vertex generator will be needed.
Make sure to test path queries common in social networking applications.
800
B.7 Graph Data Science
Graph Analytics or Graph Data Science is emerging as an important area. It has two sides: first, how can
data science/machine learning be used to build graph databases or knowledge graphs, second how can these
be used to enhance data science/machine learning. Enhancement includes making existing predictive models
better as well as supporting new types of analysis.
The first place to starts is to see how graph algorithms are used in graph data science. Graph database
providers such as Neo4j and TigerGraph supply libraries to support graph data science [130, 94, 153, 149].
801
B.8 Graph Pattern Matching
Graph Pattern Matching [19] captures the idea that one can specify a query as a (typically small) graph
or more generally a graph pattern and then find all subgraphs in the graph database that match it. If one
is interested in subgraphs that match in terms of labels/properties and topology, then the graph matching
problem is called subgraph isomorphism. Various forms of graph simulation may also be used, including
simple, dual, strong, strict and tight simulation. The graph simulation algorithms return all the subgraph
isomorphism matches plus some that while they fail the subgraph isomorphism test, they may be of interest
to the user; in addition, these algorithms are much faster.
Another direction that also relaxes subgraph isomorphism is graph homomorphism [95, 184].
Pruning is used to remove cases where vertex v ∈ φ(u) does not have children with vertex labels matching
those of vertex u.
1 def prune0 (φ: Array [ SET [ Int ]]) : Array [ SET [ Int ]] =
2 var ( rem , alter ) = ( SET [ Int ] () , true ) // vertices to be removed
3 breakable {
4 while alter do // check for matching children
5 alter = false // no vertices removed yet
6
802
The prune method also checks whether the edge labels match. (see sclation.database.graph pm.GraphSim.
Dual Simulation
Strict Simulation
Tight Simulation
803
B.9 Graph Representation Learning
Large graphs are difficult to analyze as a whole. One approach is to convert a graph into a matrix and then
use spectral theory (eigenvalues/singular values) to make a lower dimension approximation of the matrix
(e.g., for an Adjacency Matrix, Laplacian Matrix, or Normalized Laplacian Matrix). Graph Representation
Learning [207] ...
Given a Directed Multi-Graph G(V, E, Le , l, Lv ), suppose an edge-label (or more generally an edge property)
can be interpreted as a weight (e.g., inverse of distance). The weight can be interpreted as vertex proximity
or similarity. Between any two vertices (u, v) there can multiple edges and edge labels.
{(u, le , v) ∈ E} (B.3)
These can be aggregated, (e.g., using count, min, max, mean, median) to create a single weight. This weight
can be use to create a weighted adjacency non-negative matrix, A = [aij ].
Graph Laplacian
Although the Graph Laplacian is usually applied in the context of undirected graph, it can be defined for
directed graphs and directed multi-graphs.
L = D−A (B.6)
where A is the weighted adjacency matrix and D is the out-degree (alternatively the in-degree) diagonal
matrix. For directed graphs, L may not be a symmetric matrix. Consequently, a spectral decomposition
giving the eigenvalues of the matrix may include complex numbers. In such cases, one could work with
singular values [26]. A simpler alternative would be to work with the undirected graph underlying the
directed graph, where L is a symmetric matrix and the eigenvalues are real numbers. Having the spectrum
of a matrix allows overall characteristics of graphs to be study and graphs compared. For example, if the
spectra of two graphs are different, then they cannot be isomorphic [26]. In addition, small eigenvalues may
be ignored to obtain a lower dimensional reprentation of the matrix (and consequently the graph).
1 1
L = D− 2 (D − A)D− 2 (B.7)
804
B.9.2 Graph Embeddings
As many data science/machine learning modeling techniques take vectors and matrices as input, they need
to be created from the information content of knowledge graphs. For example, a state in the United States
may have demographic and policy information, that can be useful information to be fed into a model. In
addition, similiar information about neighboring states or for example mobility data between the states
can useful as well. Capturing all this information may lead to a high dimensionality vector that is not
ideal for modeling. The goal of this area of research is to embed the information in a lower dimensionality
vector, while not losing important structural information [210]. Also see https://neo4j.com/developer/
graph-data-science/graph-embeddings/.
A vertex embedding function φ : V → Rd maps each vertex to d-dimensional vector. This can be done in
numerous ways, but the idea is make it so the dot product of two vectors approximates a similarity function
sim applied to the vertices, i.e.,
FastRP
DeepWalk
A random walk in a directed graph will involve moving from vertex to vertex along edges. The selection of the
next vertex is governed by the transition matrix P . This may also be viewed as moving from state-to-state
in a Discrete-Time Markov Chain {xt |t = 0, 1, ...}. The walk will start at a particular vertex (or state) and
randomly progress from there.
Given a weighted adjacency matrix A, the transition probability for a standard random walk is given by
the following conditional probabilty [145, 83].
auv
p(xt+1 = u|xt = v) = (B.9)
du
The transition matrix is then P = D−1 A. As indicated in the section on Markov Chains, the long-term,
steady-state solution may be found by solving to π in
π = πP (B.10)
Node2Vec
GraphSAGE
805
B.10 Graph Neural Networks
The convolutional filters used in CNNs allow one to extract hidden features by looking at a collection of
nearby points. The notion of nearby may be thought of in terms of Euclidean distance.
Graphs can be used to provide a more versatile notion of what points are nearby. This may be done
by counting hops, the number of edges traversed to get form one vertex to another. If distance is an edge
property (given or derived) then nearby vertices may be defined as those within a certain distance, as shown
in Figure B.6.
v0
v3 z3
v2
v1
v4
There are many types of Graph Neural Networks (GNNs) [216, 206] ... A GNN may be defined for several
types of graphs.
Consider the following slightly modified definition of a vertex-labeled directed graph as a four-tuple
G(V, E, d, x) where
• V = set of vertices
• x : V → Rd
Hence, xv denotes a property/attribute vector associated with vertex v. In general, a GNN could also use
values on edges. The following definitions are needed for understanding how GNNs work [209].
Neighborhood
The neighborhood of vertex v4 is {v1 , v3 }, where v1 is an upstream vertex and v3 is a downstream vertex.
806
B.10.1 AGGREGATE and COMBINE Operations
As with CNNs, the initial processing of data occurs with non-neural layers. For each layer l, data from a
vertex is mixed with its neighbors’ data using the AGGREGATE and COMBINE operations [68, 129]
AGGREGATE
The values for the neighborhood of vertex v are aggregated to form a vector of values representing the
neighborhood.
COMBINE
(l) (l−1)
A new value for vertex v, xv is calculated by combining the previous vector value for vertex v, xv , and
the vector value produced from v’s neighborhood.
x(l)
v = COMBINE(x(l−1)
v , xN (v) ) (B.13)
The COMBINE operation may be a simple as element-wise mean or vector addition (+).
As an example, consider applying a Graph Neural Network to vehicle traffic flow on a highway system. Let
the vertices in the graph represent sensors. At a particular time t, the sensor records the flow (vehicles in last
5 minutes) and speed (in mph). The neighborhood may capture the notion of upstream flow past one sensor
and downstream flow past another sensor. Suppose the initial property/attribute vectors are the following:
v flow speed
v4 99 66
v1 90 68
v3 80 72
x(1)
v = COMBINE(x(0)
v , xN (v) ) = [93, 67]
Typically, the COMBINE operation includes a learnable weight/parameter matrix B and an activation
function f (e.g., reLU).
x(l)
v = f (B[x(0)
v , xN (v) ]) (B.14)
807
f is the vectorization of activation function f . The weights/parameters in the GNN indicate the importance
of data from upstream and downstream sensors in, for example, classifying, predicting or forecasting results
at a particular sensor. Pooling can also be used to reduce the size of the vectors.
ScalaTion WILL supports GraphNeuralNets of the spatial variety, so those of the spectral variety are
not considered (see [206] for a comparison).
808
Bibliography
[1] Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley interdisciplinary reviews:
computational statistics, 2(4):433–459, 2010.
[2] Charu C Aggarwa et al. Data Classification: Algorithms and Applications. Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series, 2015.
[3] Shahrokh Akhlaghi, Ning Zhou, and Zhenyu Huang. Adaptive adjustment of noise covariance in kalman
filter for dynamic state estimation. In 2017 IEEE power & energy society general meeting, pages 1–5.
IEEE, 2017.
[4] Awad H Al-Mohy and Nicholas J Higham. A new scaling and squaring algorithm for the matrix
exponential. SIAM Journal on Matrix Analysis and Applications, 31(3):970–989, 2010.
[5] Christos Alexopoulos, Andrew F Seila, and J Banks. Output data analysis. In Handbook of Simulation,
number 7. John Wiley & Sons, 1998.
[6] Waqas Ali, Muhammad Saleem, Bin Yao, Aidan Hogan, and Axel-Cyrille Ngonga Ngomo. A survey
of rdf stores & sparql engines for querying knowledge graphs. The VLDB Journal, pages 1–26, 2021.
[7] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-of-
squares clustering. Machine learning, 75:245–248, 2009.
[8] Renzo Angles, Harsh Thakkar, and Dominik Tomaszuk. Rdf and property graphs interoperability:
Status and issues. In AMW, 2019.
[9] David Arthur and Sergei Vassilvitskii. How slow is the k-means method? In Symposium on Compu-
tational Geometry, volume 6, pages 1–10, 2006.
[10] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Technical
report, Stanford, 2006.
[11] Jeremy Avigad, Johannes Hölzl, and Luke Serafin. A formally verified proof of the central limit
theorem. Journal of Automated Reasoning, 59(4):389–423, 2017.
[12] Jerry Banks, John Carson, Barry Nelson, and David Nicol. Discrete event system simulation, 5th
Edition. Pearson, 2010.
[13] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind.
Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research,
18:1–43, 2018.
809
[14] Anil K Bera and Yannis Bilias. The MM, ME, ML, EL, EF and GMM approaches to estimation: a
synthesis. Journal of Econometrics, 107(1-2):51–86, 2002.
[15] Joseph Berkson. Minimum chi-square, not maximum likelihood! The Annals of Statistics, 8(3):457–
487, 1980.
[16] Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michal Podstawski, Claude
Barthels, Gustavo Alonso, and Torsten Hoefler. Demystifying graph databases: Analysis and tax-
onomy of data organization, system designs, and graph queries. arXiv preprint arXiv:1910.09017,
2019.
[17] Concha Bielza and Pedro Larrañaga. Discrete Bayesian Network Classifiers: A Survey. ACM Com-
puting Surveys (CSUR), 47(1):5, 2014.
[18] Nicholas H Bingham and John M Fry. Regression: Linear models in Statistics. Springer Science &
Business Media, 2010.
[19] Sarra Bouhenni, Said Yahiaoui, Nadia Nouali-Taboudjemat, and Hamamache Kheddouci. A survey on
distributed graph pattern matching in massive graphs. ACM Computing Surveys (CSUR), 54(2):1–35,
2021.
[20] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis:
forecasting and control. John Wiley & Sons, 2015.
[21] Stephan Boyd and Lieven Vandenberghe. Introduction to Applied Linear Algebra: Vectors, Matrices,
and Least Squares. Cambridge University Press, 2018.
[22] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed Optimiza-
tion and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and
Trends® in Machine Learning, 3(1):1–122, 2011.
[23] Charles G Broyden. The convergence of a class of double-rank minimization algorithms: 2. the new
algorithm. IMA journal of applied mathematics, 6(3):222–231, 1970.
[24] Yalcin Bulut, D Vines-Cavanaugh, and Dionisio Bernal. Process and measurement noise estimation
for kalman filtering. In Structural Dynamics, Volume 3, pages 375–386. Springer, 2011.
[25] Brett Burglund and Ryan Street. Golf ball flight dynamics. Flathead Valley, 2011.
[26] Steven Kay Butler. Eigenvalues and structures of graphs. University of California, San Diego, 2008.
[27] José M Carcione, Juan E Santos, Claudio Bagaini, and Jing Ba. A simulation of a covid-19 epidemic
based on a deterministic seir model. Frontiers in public health, 8:230, 2020.
[28] David Maxwell Chickering. Learning bayesian networks is np-complete. In Learning from data, pages
121–130. Springer, 1996.
[29] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties
of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
810
[30] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
ger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for
statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[31] Zafer Cömert and Adnan Fatih Kocamaz. A study of artificial neural network training algorithms for
classification of cardiotocography signals. J Sci Technol, 7(2):93–103, 2017.
[32] Pierre Comon. Tensors: A brief introduction. IEEE Signal Processing Magazine, 31(3):44–53, 2014.
[33] Jerome Connor, Les E Atlas, and Douglas R Martin. Recurrent networks and narma modeling. In
Advances in neural information processing systems, pages 301–308, 1992.
[34] Denis Cousineau and Sylvain Chartier. Outliers detection and treatment: A review. International
Journal of Psychological Research, 3(1):58–67, 2010.
[35] Thomas M. Cover and Joy A. Thomas. Entropy, Relative Entropy and Mutual Information. Technical
report, Columbia University, 1991.
[37] Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. Mathematics for machine learning.
Cambridge University Press, 2020.
[38] Holger Dette and Weichi Wu. Prediction in locally stationary time series. Journal of Business &
Economic Statistics, 40(1):370–381, 2022.
[39] LEE DO Q. Numerically efficient methods for solving least squares problems, 2012.
[40] Justin Domke. Statistical Machine Learning Notes: Trees. Technical report, Uinversity of Mas-
sachusetts, 2018.
[41] Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.
[42] A Ehiagwina. Application of Mixed Simulation Method to Modelling Port Traffic. PhD thesis, Liverpool
John Moores University, 2021.
[43] Joseph G Eisenhauer. Regression through the origin. Teaching statistics, 25(3):76–80, 2003.
[44] Steven Elsworth and Stefan Güttel. Time series forecasting using LSTM networks: A symbolic ap-
proach. arXiv preprint arXiv:2003.05672, 2020.
[45] Leonhard Euler. Mechanica Sive Motus Scientia Analytice Exposita: Instar Supplementi Ad Commen-
tar. Acad. Scient. Imper, volume 2. Ex typographia academiae scientiarum, 1736.
[46] Geir Evensen. The ensemble kalman filter: Theoretical formulation and practical implementation.
Ocean dynamics, 53(4):343–367, 2003.
[47] Yuval Filmus. Two proofs of the central limit theorem. Recuperado de http://www. cs. toronto.
edu/yuvalf/CLT. pdf, 2010.
811
[48] Roger Fletcher. A new approach to variable metric algorithms. The computer journal, 13(3):317–322,
1970.
[49] Valeria Fonti and Eduard Belitser. Feature Selection using LASSO. Technical report, VU Amsterdam,
2017.
[50] Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classifiers. Machine learning,
29(2-3):131–163, 1997.
[51] Fabien Gandon, Reto Krummenacher, Sung-Kook Han, and Ioan Toma. The resource description
framework and its schema, 2011.
[52] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with
LSTM. In Ninth International Conference on Artificial Neural Networks. IET, 1999.
[54] Michael B Giles, Mihai C Duta, Jens-Dominik Muller, and Niles A Pierce. Algorithm developments
for discrete adjoint methods. AIAA journal, 41(2):198–205, 2003.
[55] Igor Gitman, Hunter Lang, Pengchuan Zhang, and Lin Xiao. Understanding the role of momentum in
stochastic gradient methods. Advances in Neural Information Processing Systems, 32, 2019.
[56] D Goldfarb. A family of variable metric updates derived by variational means. Mathematics of
Computing, 24:317–322, 1970.
[57] John Goldsmith. Probability for linguists. Mathématiques et sciences humaines. Mathematics and
social sciences, (180):73–98, 2007.
[58] Gene H. Golub and Charles F. Van Loan. Matrix Computations, 4th Edition, volume 3. JHU Press,
2013.
[59] Maximilian Götzinger, Dávid Juhász, Nima Taherinejad, Edwin Willegger, Benedikt Tutzer, Pasi
Liljeberg, Axel Jantsch, and Amir M Rahmani. Rosa: A framework for modeling self-awareness in
cyber-physical systems. IEEE Access, 8:141373–141394, 2020.
[60] Clive WJ Granger and Paul Newbold. Spurious regressions in econometrics. Journal of econometrics,
2(2):111–120, 1974.
[61] Erin Grant and Yan Wu. Predicting generalization with degrees of freedom in neural networks. In
ICML 2022 2nd AI for Science Workshop, 2022.
[62] Todd J Green, Shan Shan Huang, Boon Thau Loo, Wenchao Zhou, et al. Datalog and recursive query
processing. Foundations and Trends® in Databases, 5(2):105–195, 2013.
[63] Andreas Griewank et al. On automatic differentiation. Mathematical Programming: recent develop-
ments and applications, 6(6):83–107, 1989.
[64] Andrey Gubichev. Query processing and optimization in graph databases. PhD thesis, Technische
Universität München, 2015.
812
[65] Ricardo Guimarães and Ana Ozaki. Reasoning in knowledge graphs. In International Research School
in Artificial Intelligence in Bergen (AIB 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022.
[66] Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grauman, Tajana Rosing, and Rogerio Feris.
Spottune: transfer learning through adaptive fine-tuning. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 4805–4814, 2019.
[67] Ernst Hairer, Christian Lubich, Gerhard Wanner, et al. Geometric numerical integration illustrated
by the stormer-verlet method. Acta numerica, 12(12):399–450, 2003.
[68] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and
applications. arXiv preprint arXiv:1709.05584, 2017.
[69] John A Hartigan, Manchek A Wong, et al. A k-means clustering algorithm. Applied statistics,
28(1):100–108, 1979.
[70] Hussein Abdulahman Hashem. Regularized and Robust Regression Methods for High Dimensional Data.
PhD thesis, Brunel University, 2014.
[71] Trevor Hastie and Robert Tibshirani. Efficient quadratic regularization for expression arrays. Bio-
statistics, 5(3):329–340, 2004.
[72] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, 2nd Edition. Springer-Verlag New York, 2009.
[73] Yanchen He. The application of alternating direction method of multipliers on 1l-norms problems. In
Journal of Physics: Conference Series, volume 1187, page 042070. IOP Publishing, 2019.
[75] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–
1780, 1997.
[76] Aidan Hogan. Knowledge graphs: Research directions. Reasoning Web International Summer School,
pages 223–253, 2020.
[77] Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez,
José Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier, Axel Polleres, et al. Knowledge graphs.
arXiv preprint arXiv:2003.02320, 2020.
[78] Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard De Melo, Claudio Gutierrez,
Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al. Knowledge
graphs. ACM Computing Surveys (CSUR), 54(4):1–37, 2021.
[79] Jürgen Hölsch and Michael Grossniklaus. An algebra and equivalences to transform graph patterns
in neo4j. In EDBT/ICDT 2016 Workshops: EDBT Workshop on Querying Graph Structured Data
(GraphQ), 2016.
[80] Douglas Holtz-Eakin, Whitney Newey, and Harvey S Rosen. Estimating vector autoregressions with
panel data. Econometrica: Journal of the econometric society, pages 1371–1395, 1988.
813
[81] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal
approximators. Neural networks, 2(5):359–366, 1989.
[82] Pili Hu. Matrix Calculus: Derivation and Simple Application. Technical report, City University of
Hong Kong, 2012.
[83] Zexi Huang, Arlei Silva, and Ambuj Singh. A broader picture of random-walk based graph embedding.
In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages
685–695, 2021.
[84] Rob J Hyndman and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2018.
[85] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical
Learning: With Applications in R. Springer-Verlag New York, 2013.
[86] Lucas Janson, William Fithian, and Trevor Hastie. Effective degrees of freedom: a flawed metaphor.
Biometrika, 99(1):1–8, 2012.
[88] Jefkine Kafunah. Backpropagation in convolutional neural networks. DeepGrid—Organic Deep Learn-
ing, Nov, 29:1–10, 2016.
[89] George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang.
Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.
[90] Anuj Karpatne, Gowtham Atluri, James H Faghmous, Michael Steinbach, Arindam Banerjee, Auroop
Ganguly, Shashi Shekhar, Nagiza Samatova, and Vipin Kumar. Theory-guided data science: A new
paradigm for scientific discovery from data. IEEE Transactions on knowledge and data engineering,
29(10):2318–2331, 2017.
[91] Matthias Katzfuss, Jonathan R Stroud, and Christopher K Wikle. Understanding the ensemble kalman
filter. The American Statistician, 70(4):350–357, 2016.
[92] Luke Keele and Nathan J Kelly. Dynamic models for dynamic theories: The ins and outs of lagged
dependent variables. Political analysis, pages 186–205, 2006.
[93] S. Sathiya Keerthi, Shirish Krishnaj Shevade, Chiranjib Bhattacharyya, and Karuturi Radha Kr-
ishna Murthy. Improvements to platt’s smo algorithm for svm classifier design. Neural computation,
13(3):637–649, 2001.
[94] Samir Khuller and Balaji Raghavachari. Basic graph algorithms. In Algorithms and theory of compu-
tation handbook: general concepts and techniques, pages 7–7. 2010.
[95] Jinha Kim, Hyungyu Shin, Wook-Shin Han, Sungpack Hong, and Hassan Chafi. Taming subgraph
isomorphism for rdf query processing. arXiv preprint arXiv:1506.01973, 2015.
[96] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
814
[97] Serkan Kiranyaz, Onur Avci, Osama Abdeljaber, Turker Ince, Moncef Gabbouj, and Daniel J Inman.
1d convolutional neural networks and applications: A survey. arXiv preprint arXiv:1905.03554, 2019.
[98] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolu-
tional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
[99] Markus Krötzsch. Efficient inferencing for the description logic underlying owl el. Institut AIFB, KIT,
Karlsruhe, 2010.
[100] Markus Krötzsch, Frantisek Simancik, and Ian Horrocks. A description logic primer. arXiv preprint
arXiv:1201.4089, 2012.
[101] Viktor Kunčak and Jad Hamza. Stainless verification system tutorial. In # PLACE-
HOLDER PARENT METADATA VALUE#, pages 2–7, 2021.
[102] M. Kutner, Nachtsheim, and Neter. Introduction to nonlinear regression and neural networks. Technical
report, University of Minnesota, 2016.
[103] Giovanni Lavezzi, Kidus Guye, and Marco Ciarcià. Nonlinear programming solvers for unconstrained
and constrained optimization problems: a benchmark analysis. arXiv preprint arXiv:2204.05297, 2022.
[104] Averill M Law. How to build valid and credible simulation models. In 2008 Winter Simulation
Conference, pages 39–47. IEEE, 2008.
[105] Averill M Law. Simulation Modeling and Analysis. Fifth. McGraw-Hill Education, 2015.
[106] Robert J Leach, Stephanie E Forrester, AC Mears, and Jonathan R Roberts. How valid and accurate are
measurements of golf impact parameters obtained using commercially available radar and stereoscopic
optical launch monitors? Measurement, 112:125–136, 2017.
[107] Pierre L’ecuyer. Good parameters and implementations for combined multiple recursive random num-
ber generators. Operations Research, 47(1):159–164, 1999.
[108] Hagyeong Lee and Jongwoo Song. Introduction to convolutional neural network using keras; an under-
standing from a statistician. Communications for Statistical Applications and Methods, 26(6):591–610,
2019.
[109] Jun Li. A robust stochastic method of estimating the transmission potential of 2019-ncov. arXiv
preprint arXiv:2002.03828, 2020.
[110] Michael Y Li and James S Muldowney. Global stability for the seir model in epidemiology. Mathematical
biosciences, 125(2):155–164, 1995.
[111] Lek-Heng Lim. Tensors and hypermatrices. Handbook of Linear Algebra, pages 231–260, 2013.
[112] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization.
Mathematical programming, 45(1-3):503–528, 1989.
[113] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–
137, 1982.
815
[114] Charles Macal and Michael North. Introductory tutorial: Agent-based modeling and simulation. In
Proceedings of the Winter Simulation Conference 2014, pages 6–20. IEEE, 2014.
[115] Charles M Macal and Michael J North. Agent-based modeling and simulation. In Proceedings of the
Winter Simulation Conference, 2005., pages 86–98. IEEE, 2009.
[116] Charles M Macal and Michael J North. Tutorial on agent-based modeling and simulation. In Journal
of Simulation, pages 151–162. IEEE, 2010.
[117] Penelope Maddy. How applied mathematics became pure. The Review of Symbolic Logic, 1(1):16–41,
2008.
[118] Franco Manessi and Alessandro Rozza. Learning combinations of activation functions. arXiv preprint
arXiv:1801.09403, 2018.
[119] Maja Marasović, Tea Marasović, and Mladen Miloš. Robust nonlinear regression in enzyme kinetic
parameters estimation. Journal of Chemistry, 2017, 2017.
[120] József Marton, Gábor Szárnyas, and Dániel Varró. Formalising opencypher graph queries in relational
algebra. In European Conference on Advances in Databases and Information Systems, pages 182–196.
Springer, 2017.
[121] P McCullagh and JA Nelder. Generalized linear models., 2nd edn.(chapman and hall: London).
Standard book on generalized linear models, 1989.
[122] John A Miller, Mohammed Aldosari, Farah Saeed, Nasid Habib Barna, Subas Rana, I Budak Arpinar,
and Ninghao Liu. A survey of deep learning and foundation models for time series forecasting. arXiv
preprint arXiv:2401.13912, 2024.
[123] John A Miller, Hao Peng, and Michael E Cotterell. Adding support for theory in open science big
data. In Big Data (BigData Congress), 2017 IEEE International Congress on, pages 251–255. IEEE,
2017.
[124] Piotr Mirowski. Time series modeling with hidden variables and gradient-based algorithms. Depart-
ment of Computer Science, Courant Institute of Mathematical Sciences, New York University, Ph. D.
Dissertation, 2011.
[125] R Mohammadi Farsani and Ehsan Pazouki. A transformer self-attention model for time series fore-
casting. Journal of Electrical and Computer Engineering Innovations (JECEI), 9(1):1–10, 2020.
[126] Sutapa Mondal, Vijaya Raghava Mutharaju, and Sumit Bhatia. Embeddings for the EL++ description
logic. PhD thesis, IIIT-Delhi, 2020.
[127] Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining. Introduction to Linear Regression
Analysis, volume 821. John Wiley & Sons, 2012.
[128] Steve Ataky Tsham Mpinda, Lucas Cesar Ferreira, Marcela Xavier Ribeiro, and Marilde Terez-
inha Prado Santos. Evaluation of graph databases performance through indexing techniques. In-
ternational Journal of Artificial Intelligence & Applications (IJAIA), 6(5):87–98, 2015.
816
[129] Nihal V Nayak. Graph neural networks - notes. Technical report, Brown Univerity, 2020.
[130] Mark Needham and Amy E Hodler. Graph algorithms: practical examples in Apache Spark and Neo4j.
O’Reilly Media, 2019.
[131] John A Nelder and Roger Mead. A simplex method for function minimization. The computer journal,
7(4):308–313, 1965.
[132] Isaac Newton. Philosophiae naturalis principia mathematica, volume 2. typis A. et JM Duncan, 1833.
[133] Anh Nguyen, Khoa Pham, Dat Ngo, Thanh Ngo, and Lam Pham. An analysis of state-of-the-art
activation functions for supervised deep neural network. In 2021 International Conference on System
Science and Engineering (ICSSE), pages 215–220. IEEE, 2021.
[134] Vinh Nguyen, Jyoti Leeka, Olivier Bodenreider, and Amit Sheth. A formal graph model for rdf and
its implementation. arXiv preprint arXiv:1606.00480, 2016.
[135] Michael A Nielsen. Neural networks and deep learning, volume 25. Determination press San Francisco,
CA, USA, 2015.
[136] Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999.
[137] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. Activation functions:
Comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378, 2018.
[138] Jeremy Orloff and Jonathan Bloom. 18.05 introduction to probability and statistics. Massachusetts
Institute of Technology: MIT OpenCourseWare, 2014.
[139] Jeremy Orloff and Jonathan Bloom. Maximum Likelihood Estimates. Technical report, MIT, 2014.
[140] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge
and data engineering, 22(10):1345–1359, 2009.
[141] Joanna Maria Papakonstantinou. Historical development of the BFGS secant method and its charac-
terization properties. Rice University, 2009.
[142] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming
Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[143] Julio L Peixoto. A property of well-formulated polynomial regression models. The American Statisti-
cian, 44(1):26–30, 1990.
[144] Jolynn Pek, Octavia Wong, and AC Wong. Data transformations for inference with linear regression:
Clarifications and recommendations. Practical Assessment, Research, and Evaluation, 22(1):9, 2017.
[145] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations.
In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 701–710, 2014.
[146] John Platt. Sequential minimal optimization: A fast algorithm for training support vector machines.
1998.
817
[147] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
[148] John R Quinlan et al. Learning with continuous classes. In 5th Australian joint conference on artificial
intelligence, volume 92, pages 343–348. World Scientific, 1992.
[149] Md Saidur Rahman et al. Basic graph theory, volume 9. Springer, 2017.
[150] Supriya Ramireddy. Query processing in graph databases. PhD thesis, University of Georgia, 2017.
[151] Suhasini Rao. A course in Time Series Analysis. Technical report, Texas A&M University, 2018.
[152] Thanyalak Rattanasawad, Kanda Runapongsa Saikaew, Marut Buranarach, and Thepchai Supnithi.
A review and comparison of rule languages and rule-based inference engines for the semantic web. In
2013 International Computer Science and Engineering Conference (ICSEC), pages 1–6. IEEE, 2013.
[153] Santanu Saha Ray. Graph theory with algorithms and its applications: in applied science and technology.
Springer, 2013.
[154] Matthew B Rhudy, Roger A Salguero, and Keaton Holappa. A kalman filtering tutorial for undergrad-
uate students. International Journal of Computer Science & Engineering Survey, 8(1):1–9, 2017.
[155] Stephen D Roberts and Dennis Pegden. The history of simulation modeling. In 2017 Winter Simulation
Conference (WSC), pages 308–323. IEEE, 2017.
[156] Ian Robinson, Jim Webber, and Emil Eifrem. Graph databases. ”O’Reilly Media, Inc.”, 2013.
[157] Raul Rojas. The backpropagation algorithm. In Neural networks, pages 149–182. Springer, 1996.
[158] Lior Rokack. Decision Trees. Technical report, Tel-Aviv University, 2015.
[159] Michael J Rosenfeld. Ols in matrix form. NYU Lecture Notes, 2013.
[160] Sheldon M Ross. Introduction to probability models, 3rd Edition. Academic Press, San Diego, 1985.
[161] Sheldon M Ross. Introduction to probability models, 11th Edition. Academic press, 2014.
[162] Dan Roth. Decision Trees. Technical report, University of Illinois, 2016.
[163] Prasanna Sahoo. Probability and Mathematical Statistics. University of Louisville, Louisville, KY
40292, USA, 2013.
[164] Remi M Sakia. The box-cox transformation technique: A review. Journal of the Royal Statistical
Society: Series D (The Statistician), 41(2):169–178, 1992.
[165] Anders W Sandvik. Numerical solutions of classical equations of motion. PY Comput. Phys, 502, 2015.
[166] Robert G Sargent. An introductory tutorial on verification and validation of simulation models. In
2015 winter simulation conference (WSC), pages 1729–1740. IEEE, 2015.
[167] T Shalab. Regression analysis. chapter 12: Polynomial regression models. University lectures, Indian
Institute of Technology Kanpur, 2010.
[168] Cosma Shalizi. Advanced data analysis from an elementary point of view, 2013.
818
[169] Cosma Shalizi. Modern Regression. Technical report, Carneige-Mellon University, 2015.
[170] David F Shanno. Conditioning of quasi-newton methods for function minimization. Mathematics of
computation, 24(111):647–656, 1970.
[171] Karl Sigman. Acceptance-rejection method. Technical report, Columbia University, 2007.
[172] Karl Sigman. Ieor 6711: Continuous-time markov chains. Technical report, Columbia University, 2009.
[173] Karl Sigman. Limiting distribution for a Markov chain recurrence and transience. Technical report,
Columbia University, 2009.
[174] Karl Sigman. Notes on little’s law. Technical report, Columbia University, 2009.
[175] Gregory A Silver, John A Miller, Maria Hybinette, Gregory Baramidze, and William S York. An
ontology for discrete-event modeling and simulation. Simulation, 87(9):747–773, 2011.
[176] Christopher A Sims. Macroeconomics and reality. Econometrica: journal of the Econometric Society,
pages 1–48, 1980.
[177] Saša Singer and Sanja Singer. Efficient implementation of the nelder–mead search algorithm. Applied
Numerical Analysis & Computational Mathematics, 1(2):524–534, 2004.
[178] Anders Skajaa. Limited memory bfgs for nonsmooth optimization. Master’s thesis, Courant Institute
of Mathematical Science, New York University, 2010.
[179] A Solonen, J Hakkarainen, A Ilin, M Abbas, and A Bibov. Estimating model error covariance matrix
parameters in extended kalman filtering. Nonlinear Processes in Geophysics, 21(5):919–927, 2014.
[180] Saul Stahl. The evolution of the normal distribution. Mathematics magazine, 79(2):96–113, 2006.
[181] Mark Stamp. A Revealing Introduction to Hidden Markov Models. Technical report, San Jose State
University, 2018.
[182] Natalie M Steiger, Emily K Lada, James R Wilson, Jeffrey A Joines, Christos Alexopoulos, and David
Goldsman. Asap3: A batch means procedure for steady-state simulation analysis. ACM Transactions
on Modeling and Computer Simulation (TOMACS), 15(1):39–73, 2005.
[183] Suhartono Suhartono. Time series forecasting by using seasonal autoregressive integrated moving
average: Subset, multiplicative or additive model. J. Math. Stat, 7:20–27, 2011.
[184] Shixuan Sun, Xibo Sun, Yulin Che, Qiong Luo, and Bingsheng He. Rapidmatch: a holistic approach
to subgraph query processing. Proceedings of the VLDB Endowment, 14(2):176–188, 2020.
[185] Gábor Szárnyas, József Marton, János Maginecz, and Dániel Varró. Reducing property graph queries
to relational algebra for incremental view maintenance. arXiv preprint arXiv:1806.07344, 2018.
[186] Souhaib Ben Taieb. Machine learning strategies for multi-step-ahead time series forecasting. Universit
Libre de Bruxelles, Belgium, pages 75–86, 2014.
[187] Souhaib Ben Taieb, Rob J Hyndman, et al. Recursive and direct multi-step forecasting: the best of
both worlds, volume 19. Citeseer, 2012.
819
[188] Shintaro Takenaga, Yoshihiko Ozaki, and Masaki Onishi. Practical initialization of the nelder–mead
method for computationally expensive optimization problems. Optimization Letters, 17(2):283–297,
2023.
[189] Srikanth Tammina. Transfer learning using VGG-16 with deep convolutional neural network for clas-
sifying images. International Journal of Scientific and Research Publications, 9(10):143–150, 2019.
[190] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey
on deep transfer learning. In International conference on artificial neural networks, pages 270–279.
Springer, 2018.
[191] Thaddeus Tarpey. Generalized Linear Models (GLM). Technical report, Wright State University, 2012.
[192] Harsh Thakkar, Sören Auer, and Maria-Esther Vidal. Formalizing gremlin pattern matching traversals
in an integrated graph algebra. In BlockSW/CKG@ ISWC, 2019.
[193] Harsh Thakkar, Dharmen Punjani, Soeren Auer, and Maria-Esther Vidal. Towards an integrated graph
algebra for graph pattern matching with Gremlin (extended version). arXiv preprint arXiv:1908.06265,
2019.
[194] You Tingyan. Multistep Yule-Walker estimation of autoregressive models. Technical report, National
University of singapore, 2010.
[195] Luı́s Fernando Raı́nho Alves Torgo. Inductive learning of tree-based regression models. 1999.
[196] Wessel N van Wieringen. Lecture notes on ridge regression. arXiv preprint arXiv:1509.09169, 2015.
[197] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing
systems, 30, 2017.
[198] Loup Verlet. Computer” experiments” on classical fluids. i. thermodynamical properties of lennard-
jones molecules. Physical review, 159(1):98, 1967.
[199] Eric A Wan, Rudolph Van Der Merwe, and Simon Haykin. The unscented kalman filter. Kalman
filtering and neural networks, 5(2007):221–280, 2001.
[200] Fei Wang, James Decker, Xilun Wu, Gregory Essertel, and Tiark Rompf. Backpropagation with
callbacks: Foundations for efficient and expressive differentiable programming. Advances in Neural
Information Processing Systems, 31, 2018.
[201] William WS Wei. Multivariate Time Series Analysis and Applications. John Wiley & Sons, 2018.
[202] Greg Welch and Gary Bishop. An introduction to the kalman filter: Siggraph 2001 course 8. In
Computer Graphics, Annual Conference on Computer Graphics & Interactive Techniques, pages 12–
17, 2001.
[203] Samuel S Wilks. The large-sample distribution of the likelihood ratio for testing composite hypotheses.
The annals of mathematical statistics, 9(1):60–62, 1938.
820
[204] Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent
neural networks. Neural computation, 1(2):270–280, 1989.
[205] Ian Witten, Eibe Eibe Frank, Mark Hall, and Christopher Pal. Data Mining Practical Machine Learning
Tools and Techniques, Fourth Edition. Elsevier Inc., 2017.
[206] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A com-
prehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning
Systems, 2020.
[207] Feng Xia, Ke Sun, Shuo Yu, Abdul Aziz, Liangtian Wan, Shirui Pan, and Huan Liu. Graph learning:
A survey. IEEE Transactions on Artificial Intelligence, 2(2):109–127, 2021.
[208] Shufang Xie, Tao Zhang, and Oliver Rose. Agent-based simulation with process-interaction worldview.
Simul. Notes Eur., 29(4):169–177, 2019.
[209] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?
arXiv preprint arXiv:1810.00826, 2018.
[210] Mengjia Xu. Understanding graph embedding methods and their applications. SIAM Review,
63(4):825–853, 2021.
[211] Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. A review of recurrent neural networks:
LSTM cells and network architectures. Neural computation, 31(7):1235–1270, 2019.
[212] Harry Zhang. The optimality of naive bayes. AA, 1(2):3, 2004.
[213] Jianshu Zhang, Jun Du, and Lirong Dai. A GRU-based encoder-decoder approach with attention for
online handwritten mathematical expression recognition. In 2017 14th IAPR international conference
on document analysis and recognition (ICDAR), volume 1, pages 902–907. IEEE, 2017.
[214] Zhifei Zhang. Derivation of backpropagation in convolutional neural network (cnn). University of
Tennessee, Knoxville, TN, 2016.
[215] Guo-Bing Zhou, Jianxin Wu, Chen-Lin Zhang, and Zhi-Hua Zhou. Minimal gated unit for recurrent
neural networks. International Journal of Automation and Computing, 13(3):226–234, 2016.
[216] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li,
and Maosong Sun. Graph neural networks: A review of methods and applications. arXiv preprint
arXiv:1812.08434, 2018.
[217] Eric Zivot and Jiahui Wang. Unit root tests. Modeling Financial Time Series with S-Plus, pages
111–139, 2006.
[218] Eric Zivot and Jiahui Wang. Vector autoregressive models for multivariate time series. Modeling
financial time series with S-PLUS®, pages 385–429, 2006.
821