0% found this document useful (0 votes)
27 views41 pages

Day 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views41 pages

Day 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Linear Regression, from ordinary least

squares to Ridge and Lasso regression

Morten Hjorth-Jensen1,2
1
Department of Physics and Center for Computing in Science Education, University of Oslo, Norway
2
Department of Physics and Astronomy and Facility for Rare Isotope Beams, Michigan State University, USA

October 2, 2023

Plans for week 40, October 2-6


The main topics are:

1. Brief repetition from last week


2. Derivation of the equations for ordinary least squares
3. Discussion on how to prepare data and examples of applications of linear
regression
4. Mathematical interpretations of linear regression
5. Ridge and Lasso regression and Singular Value Decomposition
6. "Video of lecture TBA":"Video of Lecture at https://youtu.be/RlCLw-y9qwM"
7. Whiteboard notes

Reading recommendations:
1. These notes
2. Goodfellow, Bengio and Courville, Deep Learning, chapter 2 on linear
algebra and sections 3.1-3.10 on elements of statistics (background)
3. Hastie, Tibshirani and Friedman, The elements of statistical learning,
sections 3.1-3.4 (on relevance for the discussion of linear regression).
4. Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong, Mathematics for
Machine Learning, see chapter 6 in particular for exercises on derivatives,
see https://mml-book.github.io/book/mml-book.pdf

© 1999-2023, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0


license
Why Linear Regression (aka Ordinary Least Squares and
family), repetition from last week
We need first a reminder from last week about linear regression.
Fitting a continuous function with linear parameterization in terms of the
parameters β.
• Method of choice for fitting a continuous function!

• Gives an excellent introduction to central Machine Learning features with


understandable pedagogical links to other methods like Neural Net-
works, Support Vector Machines etc
• Analytical expression for the fitting parameters β
• Analytical expressions for statistical propertiers like mean values, variances,
confidence intervals and more
• Analytical relation with probabilistic interpretations
• Easy to introduce basic concepts like bias-variance tradeoff, cross-validation,
resampling and regularization techniques and many other ML topics

• Easy to code! And links well with classification problems and logistic
regression and neural networks
• Allows for easy hands-on understanding of gradient descent methods
• and many more features

For more discussions of Ridge and Lasso regression, Wessel van Wieringen’s article
is highly recommended. Similarly, Mehta et al’s article is also recommended.

The equations for ordinary least squares


Our data which we want to apply a machine learning method on, consist of
a set of inputs xT = [x0 , x1 , x2 , . . . , xn−1 ] and the outputs we want to model
y T = [y0 , y1 , y2 , . . . , yn−1 ]. We assume that the output data can be represented
(for a regression case) by a continuous function f through

yi = f (xi ) + ϵi ,

or in general
y = f (x) + ϵ,
where ϵ represents some noise which is normally assumed to be distributed
via a normal probability distribution with zero mean value and a variance σ 2 .
In linear regression we approximate the unknown function with another
continuous function ỹ(x) which depends linearly on some unknown parameters
β T = [β0 , β1 , β2 , . . . , βp−1 ].

2
Last week we introduced the so-called design matrix in order to define the
approximation ỹ via the unknown quantity β as

ỹ = Xβ,
and in order to find the optimal parameters βi we defined a function which
gives a measure of the spread between the values yi (which represent the output
values we want to reproduce) and the parametrized values ỹi , namely the so-called
cost/loss function.

The cost/loss function


We used the mean squared error to define the way we measure the quality of our
model
n−1
1X 2 1n T
o
C(β) = (yi − ỹi ) = (y − ỹ) (y − ỹ) ,
n i=0 n
or using the matrix X and in a more compact matrix-vector notation as
1n T
o
C(β) = (y − Xβ) (y − Xβ) .
n
This function represents one of many possible ways to define the so-called cost
function.
It is also common to define the function C as
n−1
1 X 2
C(β) = (yi − ỹi ) ,
2n i=0
since when taking the first derivative with respect to the unknown parameters
β, the factor of 2 cancels out.

Interpretations and optimizing our parameters


The function
1n T
o
C(β) = (y − Xβ) (y − Xβ) ,
n
can be linked to the variance of the quantity yi if we interpret the latter as the
mean value. When linking (see the discussions next week) with the maximum
likelihood approach below, we will indeed interpret yi as a mean value

yi = ⟨yi ⟩ = β0 xi,0 + β1 xi,1 + β2 xi,2 + · · · + βn−1 xi,n−1 + ϵi ,

where ⟨yi ⟩ is the mean value. Keep in mind also that till now we have treated
yi as the exact value. Normally, the response (dependent or outcome) variable
yi is the outcome of a numerical experiment or another type of experiment and
could thus be treated itself as an approximation to the true value. It is then
always accompanied by an error estimate, often limited to a statistical error

3
estimate given by the standard deviation discussed earlier. In the discussion
here we will treat yi as our exact value for the response variable.
In order to find the parameters βi we will then minimize the spread of C(β),
that is we are going to solve the problem
1n T
o
minp (y − Xβ) (y − Xβ) .
β∈R n

In practical terms it means we will require


" n−1 #
∂C(β) ∂ 1X 2
= (yi − β0 xi,0 − β1 xi,1 − β2 xi,2 − · · · − βn−1 xi,n−1 ) = 0,
∂βj ∂βj n i=0

which results in
"n−1 #
∂C(β) 2 X
=− xij (yi − β0 xi,0 − β1 xi,1 − β2 xi,2 − · · · − βn−1 xi,n−1 ) = 0,
∂βj n i=0

or in a matrix-vector form as (multiplying away the factor −2/n, see derivation


below)
∂C(β)
= 0 = X T (y − Xβ) .
∂β T

Interpretations and optimizing our parameters


We can rewrite, see the derivations below,
∂C(β)
= 0 = X T (y − Xβ) ,
∂β T
as
X T y = X T Xβ,
and if the matrix X T X is invertible we have the solution
−1 T
β = XT X X y.
We note also that since our design matrix is defined as X ∈ Rn×p , the
product X T X ∈ Rp×p . In most cases we have that p ≪ n. In our example
case below we have p = 5 meaning. We end up with inverting a small 5 × 5
matrix. This is a rather common situation, in many cases we end up with
low-dimensional matrices to invert. The methods discussed here and for many
other supervised learning algorithms like classification with logistic regression or
support vector machines, exhibit dimensionalities which allow for the usage of
direct linear algebra methods such as LU decomposition or Singular Value
Decomposition (SVD) for finding the inverse of the matrix X T X.

Small question: Do you think the example we have at hand here (the nuclear
binding energies) can lead to problems in inverting the matrix X T X? What
kind of problems can we expect?

4
Some useful matrix and vector expressions
The following matrix and vector relation will be useful here and for the rest
of the course. Vectors are always written as boldfaced lower case letters and
matrices as upper case boldfaced letters. In the following we will discuss how to
calculate derivatives of various matrices relevant for machine learning. We will
often represent our data in terms of matrices and vectors.
Let us introduce first some conventions. We assume that y is a vector of length
m, that is it has m elements y0 , y1 , . . . , ym−1 . By convention we start labeling
vectors with the zeroth element, as are arrays in Python and C++/C, for example.
Similarly, we have a vector x of length n, that is xT = [x0 , x1 , . . . , xn−1 ].
We assume also that y is a function of x through some given function f

y = f (x).

The Jacobian
We define the partial derivatives of the various components of y as functions of
xi in terms of the so-called Jacobian matrix
∂y0 ∂y0 ∂y0 ∂y0
... ...
 
∂x0 ∂x1 ∂x2 ∂xn−1
∂y1 ∂y1 ∂y1 ∂y1

 ∂x0 ∂x1 ∂x2 ... ... ∂xn−1


∂y ∂y2 ∂y2 ∂y2 ∂y2
... ...
 
J= = ∂x0 ∂x1 ∂x2 ∂xn−1,
 
∂x  . . . ... ... ... ... ... 
 
 ... ... ... ... ... ... 
∂ym−1 ∂ym−1 ∂ym−1 ∂ym−1
∂x0 ∂x1 ∂x2 ... ... ∂xn−1
which is an m × n matrix. If x is a scalar, then the Jacobian is only a
single-column vector, or an m × 1 matrix. If on the other hand y is a scalar, the
Jacobian becomes a 1 × n matrix.
When this matrix is a square matrix m = n, its determinant is often referred
to as the Jacobian determinant. Both the matrix and (if m = n) the determinant
are often referred to simply as the Jacobian. The Jacobian matrix represents
the differential of y at every point where the vector is differentiable.

Derivatives, example 1
Let now y = Ax, where A is an m × n matrix and the matrix does not depend
on x. If we write out the vector y compoment by component we have
n−1
X
yi = aij xj ,
j=0
with ∀i = 0, 1, 2, . . . , m − 1. The individual matrix elements of A are given by
the symbol aij . It follows that the partial derivatives of yi with respect to xk
∂yi
= aik ∀i = 0, 1, 2, . . . , m − 1.
∂xk

5
From this we have, using the definition of the Jacobian
∂y
= A.
∂x

Example 2
We define a scalar (our cost/loss functions are in general also scalars, just think
of the mean squared error) as the result of some matrix vector multiplications

α = y T Ax,
with y a vector of length m, A an m × n matrix and x a vector of length n. We
assume also that A does not depend on any of the two vectors. In order to find
the derivative of α with respect to the two vectors, we define an intermediate
vector z. We define first z T = y T A, a vector of length n. We have then, using
the definition of the Jacobian,
α = z T x,
which means that (using our previous example) we have

∂α
= z = AT y.
∂x
Note that the resulting vector elements are the same for z T and z, the only
difference is that one if just the transpose of the other.
Since α is a scalar we have α = αT = xT AT y. Defining now z = xT AT we
find that
∂α
= z T = xT AT .
∂y

Example 3
We start with a new scalar but where now the vector y is replaced by a vector
x and the matrix A is a square matrix with dimension n × n.

α = xT Ax,
with x a vector of length n.
We write out the specific sums involved in the calculation of α
n−1
X n−1
X
α= xi aij xj ,
i=0 j=0

taking the derivative of α with respect to a given component xk we get the two
sums
n−1 n−1
∂α X X
= aik xi + akj xj ,
∂xk i=0 j=0

6
for ∀k = 0, 1, 2, . . . , n − 1. We identify these sums as
∂α
= xT AT + A .

∂x
If the matrix A is symmetric, that is A = AT , we have
∂α
= 2xT A.
∂x

Example 4
We let the scalar α be defined by

α = y T x,

where both y and x have the same length n, or if we wish to think of them
as column vectors, they have dimensions n × 1. We assume that both y and x
depend on a vector z of the same length. To calculate the derivative of α with
respect to a given component zk we need first to write out the inner product
that defines α as
n−1
X
α= yi xi ,
i=0

and the partial derivative


n−1
X  ∂yi 
∂α ∂xi
= xi + yi ,
∂zk i=0
∂zk ∂zk
for ∀k = 0, 1, 2, . . . , n − 1. We can rewrite the partial derivative in a more
compact form as
∂α ∂y ∂x
= xT + yT ,
∂z ∂z ∂z
and if y = x we have
∂α ∂x
= 2xT .
∂z ∂z

The mean squared error and its derivative


We defined earlier a possible cost function using the mean squared error
n−1
1X 2 1n T
o
C(β) = (yi − ỹi ) = (y − ỹ) (y − ỹ) ,
n i=0 n

or using the design/feature matrix X we have the more compact matrix-vector


1n T
o
C(β) = (y − Xβ) (y − Xβ) .
n

7
We note that the design matrix X does not depend on the unknown param-
eters defined by the vector β. We are now interested in minimizing the cost
function with respect to the unknown parameters β.
The mean squared error is a scalar and if we use the results from example
three above, we can define a new vector

w = y − Xβ,

which depends on β. We rewrite the cost function as


1 T
C(β) = w w,
n
with partial derivative
∂C(β) 2 ∂w
= wT ,
∂β n ∂β
and using that
∂w
= −X,
∂β
where we used the result from example two above. Inserting the last expression
we obtain
∂C(β) 2 T
= − (y − Xβ) X,
∂β n
or as
∂C(β) 2
T
= − X T (y − Xβ) .
∂β n

Other useful relations


We list here some other useful relations we may encounter (recall that vectors
are defined by boldfaced low-key letters)

∂tr(BA)
= BT ,
∂A
∂ log |A|
= (A−1 )T .
∂A

Meet the Hessian Matrix


A very important matrix we will meet again and again in machine learning is the
Hessian. It is given by the second derivative of the cost function with respect
to the parameters β. Using the above expression for derivatives of vectors and
matrices, we find that the second derivative of the mean squared error as cost
function is,
 
∂ ∂C(β) ∂ 2 T 2
= − X (y − Xβ) = X T X.
∂β ∂β T ∂β n n

8
The Hessian matrix plays an important role and is defined here as

H = X T X.
For ordinary least squares, it is inversely proportional (derivation next week)
with the variance of the optimal parameters β̂. Furthermore, we will see later
this week that it is (aside the factor 1/n) equal to the covariance matrix. It plays
also a very important role in optmization algorithms and Principal Component
Analysis as a way to reduce the dimensionality of a machine learning/data
analysis problem.
Linear algebra question: Can we use the Hessian matrix to say something
about properties of the cost function (our optmization problem)? (hint: think
about convex or concave problems and how to relate these to a matrix!).

Interpretations and optimizing our parameters


The residuals ϵ are in turn given by

ϵ = y − ỹ = y − Xβ,

and with
X T (y − Xβ) = 0,
we have
X T ϵ = X T (y − Xβ) = 0,
meaning that the solution for β is the one which minimizes the residuals.

Example relevant for the exercises


In order to understand the relation among the predictors p, the set of data n
and the target (outcome, output etc) y, we condiser a simple polynomial fit.
We assume our data can represented by a fourth-order polynomial. For the ith
component we have

ỹi = β0 + β1 xi + β2 x2i + β3 x3i + β4 x4i .

we have five predictors/features. The first is the intercept β0 . The other terms
are βi with i = 1, 2, 3, 4. Furthermore we have n entries for each predictor. It
means that our design matrix is an n × p matrix X.

Own code for Ordinary Least Squares


It is rather straightforward to implement the matrix inversion and obtain the
parameters β. After having defined the matrix X and the outputs y we have
# matrix inversion to find beta
# First we set up the data
import numpy as np
x = np.random.rand(100)

9
y = 2.0+5*x*x+0.1*np.random.randn(100)
# and then the design matrix X including the intercept
# The design matrix now as function of a fourth-order polynomial
X = np.zeros((len(x),5))
X[:,0] = 1.0
X[:,1] = x
X[:,2] = x**2
X[:,3] = x**3
X[:,4] = x**4
beta = (np.linalg.inv(X.T @ X) @ X.T ) @ y
# and then make the prediction
ytilde = X @ beta

Alternatively, you can use the least squares functionality in Numpy as


fit = np.linalg.lstsq(X, y, rcond =None)[0]
ytildenp = np.dot(fit,X.T)

Adding error analysis and training set up


We can easily test our fit by computing the R2 score that we discussed in
connection with the functionality of Scikit-Learn in the introductory slides.
Since we are not using Scikit-Learn here we can define our own R2 function as
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)

and we would be using it as


print(R2(y,ytilde))

We can easily add our MSE score as


def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n

print(MSE(y,ytilde))

and finally the relative error as


def RelativeError(y_data,y_model):
return abs((y_data-y_model)/y_data)
print(RelativeError(y, ytilde))

10
Splitting our Data in Training and Test data
It is normal in essentially all Machine Learning studies to split the data
in a training set and a test set (sometimes also an additional validation set).
Scikit-Learn has an own function for this. There is no explicit recipe for how
much data should be included as training data and say test data. An accepted
rule of thumb is to use approximately 2/3 to 4/5 of the data as training data.
We will postpone a discussion of this splitting to the end of these notes and our
discussion of the so-called bias-variance tradeoff. Here we limit ourselves to
repeat the above equation of state fitting example but now splitting the data
into a training set and a test set.

The complete code with a simple data set


import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

def R2(y_data, y_model):


return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n

x = np.random.rand(100)
y = 2.0+5*x*x+0.1*np.random.randn(100)

# The design matrix now as function of a fourth-order polynomial


X = np.zeros((len(x),5))
X[:,0] = 1.0
X[:,1] = x
X[:,2] = x**2
X[:,3] = x**3
X[:,4] = x**4
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# matrix inversion to find beta
beta = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train
print(beta)
# and then make the prediction
ytilde = X_train @ beta
print("Training R2")
print(R2(y_train,ytilde))
print("Training MSE")
print(MSE(y_train,ytilde))
ypredict = X_test @ beta
print("Test R2")
print(R2(y_test,ypredict))
print("Test MSE")
print(MSE(y_test,ypredict))

11
Making your own test-train splitting
# equivalently in numpy
def train_test_split_numpy(inputs, labels, train_size, test_size):
n_inputs = len(inputs)
inputs_shuffled = inputs.copy()
labels_shuffled = labels.copy()
np.random.shuffle(inputs_shuffled)
np.random.shuffle(labels_shuffled)
train_end = int(n_inputs*train_size)
X_train, X_test = inputs_shuffled[:train_end], inputs_shuffled[train_end:]
Y_train, Y_test = labels_shuffled[:train_end], labels_shuffled[train_end:]

return X_train, X_test, Y_train, Y_test

But since scikit-learn has its own function for doing this and since it
interfaces easily with tensorflow and other libraries, we normally recommend
using the latter functionality.

Reducing the number of degrees of freedom, overarching


view
Many Machine Learning problems involve thousands or even millions of features
for each training instance. Not only does this make training extremely slow, it
can also make it much harder to find a good solution, as we will see. This problem
is often referred to as the curse of dimensionality. Fortunately, in real-world
problems, it is often possible to reduce the number of features considerably,
turning an intractable problem into a tractable one.
Later we will discuss some of the most popular dimensionality reduction
techniques: the principal component analysis (PCA), Kernel PCA, and Locally
Linear Embedding (LLE).
Principal component analysis and its various variants deal with the problem
of fitting a low-dimensional affine subspace to a set of of data points in a high-
dimensional space. With its family of methods it is one of the most used tools
in data modeling, compression and visualization.

Preprocessing our data


Before we proceed however, we will discuss how to preprocess our data. Till
now and in connection with our previous examples we have not met so many
cases where we are too sensitive to the scaling of our data. Normally the data
may need a rescaling and/or may be sensitive to extreme values. Scaling the data
renders our inputs much more suitable for the algorithms we want to employ.
For data sets gathered for real world applications, it is rather normal that
different features have very different units and numerical scales. For example, a
data set detailing health habits may include features such as age in the range
0 − 80, and caloric intake of order 2000. Many machine learning methods

12
sensitive to the scales of the features and may perform poorly if they are very
different scales. Therefore, it is typical to scale the features in a way to avoid
such outlier values.

Functionality in Scikit-Learn
Scikit-Learn has several functions which allow us to rescale the data, normally
resulting in much better results in terms of various accuracy scores. The Stan-
dardScaler function in Scikit-Learn ensures that for each feature/predictor
we study the mean value is zero and the variance is one (every column in the
design/feature matrix). This scaling has the drawback that it does not ensure
that we have a particular maximum or minimum in our data set. Another
function included in Scikit-Learn is the MinMaxScaler which ensures that
all features are exactly between 0 and 1. The

More preprocessing
The Normalizer scales each data point such that the feature vector has a
euclidean length of one. In other words, it projects a data point on the circle
(or sphere in the case of higher dimensions) with a radius of 1. This means
every data point is scaled by a different number (by the inverse of it’s length).
This normalization is often used when only the direction (or angle) of the data
matters, not the length of the feature vector.
The RobustScaler works similarly to the StandardScaler in that it ensures
statistical properties for each feature that guarantee that they are on the same
scale. However, the RobustScaler uses the median and quartiles, instead of mean
and variance. This makes the RobustScaler ignore data points that are very
different from the rest (like measurement errors). These odd data points are also
called outliers, and might often lead to trouble for other scaling techniques.

Frequently used scaling functions


Many features are often scaled using standardization to improve performance.
In Scikit-Learn this is given by the StandardScaler function as discussed
above. It is easy however to write your own. Mathematically, this involves
subtracting the mean and divide by the standard deviation over the data set,
for each feature:
(i)
(i) xj − xj
xj → ,
σ(xj )
where xj and σ(xj ) are the mean and standard deviation, respectively, of the
feature xj . This ensures that each feature has zero mean and unit standard
deviation. For data sets where we do not have the standard deviation or don’t
wish to calculate it, it is then common to simply set it to one.

13
Example of own Standard scaling
Let us consider the following vanilla example where we use both Scikit-Learn
and write our own function as well. We produce a simple test design matrix with
random numbers. Each column could then represent a specific feature whose
mean value is subracted.
import sklearn.linear_model as skl
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer
import numpy as np
import pandas as pd
from IPython.display import display
np.random.seed(100)
# setting up a 10 x 5 matrix
rows = 10
cols = 5
X = np.random.randn(rows,cols)
XPandas = pd.DataFrame(X)
display(XPandas)
print(XPandas.mean())
print(XPandas.std())
XPandas = (XPandas -XPandas.mean())
display(XPandas)
# This option does not include the standard deviation
scaler = StandardScaler(with_std=False)
scaler.fit(X)
Xscaled = scaler.transform(X)
display(XPandas-Xscaled)

Small exercise: perform the standard scaling by including the standard


deviation and compare with what Scikit-Learn gives.

Min-Max Scaling
Another commonly used scaling method is min-max scaling. This is very useful
for when we want the features to lie in a certain interval. To scale the feature
xj to the interval [a, b], we can apply the transformation
(i)
(i) xj − min(xj )
xj → (b − a) −a
max(xj ) − min(xj )
where min(xj ) and max(xj ) return the minimum and maximum value of xj over
the data set, respectively.

Testing the Means Squared Error as function of Complexity


One of the aims is to reproduce Figure 2.11 of Hastie et al.
Our data is defined by x ∈ [−3, 3] with a total of for example 100 data points.
np.random.seed()
n = 100

14
maxdegree = 14
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)

where y is the function we want to fit with a given polynomial.


Write a first code which sets up a design matrix X defined by a fourth-order
polynomial. Scale your data and split it in training and test data.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

np.random.seed(2018)
n = 50
maxdegree = 5
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)
TestError = np.zeros(maxdegree)
TrainError = np.zeros(maxdegree)
polydegree = np.zeros(maxdegree)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

for degree in range(maxdegree):


model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression(fit_intercept=False)
clf = model.fit(x_train_scaled,y_train)
y_fit = clf.predict(x_train_scaled)
y_pred = clf.predict(x_test_scaled)
polydegree[degree] = degree
TestError[degree] = np.mean( np.mean((y_test - y_pred)**2) )
TrainError[degree] = np.mean( np.mean((y_train - y_fit)**2) )

plt.plot(polydegree, TestError, label='Test Error')


plt.plot(polydegree, TrainError, label='Train Error')
plt.legend()
plt.show()

More preprocessing examples, two-dimensional example,


the Franke function
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model as skl
from sklearn.metrics import mean_squared_error

15
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer

# Where to save the figures and data files


PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"

if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)

if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)

if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)

def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)

def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)

def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')

def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4

def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)

N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))

for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)

return X

# Making meshgrid of datapoints and compute Franke's function


n = 5
N = 1000
x = np.sort(np.random.uniform(0, 1, N))
y = np.sort(np.random.uniform(0, 1, N))
z = FrankeFunction(x, y)
X = create_X(x, y, n=n)
# split in training and test data
X_train, X_test, y_train, y_test = train_test_split(X,z,test_size=0.2)

16
clf = skl.LinearRegression().fit(X_train, y_train)

# The mean squared error and R2 score


print("MSE before scaling: {:.2f}".format(mean_squared_error(clf.predict(X_test), y_test)))
print("R2 score before scaling {:.2f}".format(clf.score(X_test,y_test)))

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Feature min values before scaling:\n {}".format(X_train.min(axis=0)))
print("Feature max values before scaling:\n {}".format(X_train.max(axis=0)))

print("Feature min values after scaling:\n {}".format(X_train_scaled.min(axis=0)))


print("Feature max values after scaling:\n {}".format(X_train_scaled.max(axis=0)))

clf = skl.LinearRegression().fit(X_train_scaled, y_train)

print("MSE after scaling: {:.2f}".format(mean_squared_error(clf.predict(X_test_scaled), y_test)))


print("R2 score for scaled data: {:.2f}".format(clf.score(X_test_scaled,y_test)))

To think about, first part


When you are comparing your own code with for example Scikit-Learn’s library,
there are some technicalities to keep in mind. The examples here demonstrate
some of these aspects with potential pitfalls.
The discussion here focuses on the role of the intercept, how we can set
up the design matrix, what scaling we should use and other topics which tend
confuse us.
The intercept can be interpreted as the expected value of our target/output
variables when all other predictors are set to zero. Thus, if we cannot assume
that the expected outputs/targets are zero when all predictors are zero (the
columns in the design matrix), it may be a bad idea to implement a model
which penalizes the intercept. Furthermore, in for example Ridge and Lasso
regression (to be discussed in moe detail next week), the default solutions from
the library Scikit-Learn (when not shrinking β0 ) for the unknown parameters
β, are derived under the assumption that both y and X are zero centered, that
is we subtract the mean values.

More thinking
If our predictors represent different scales, then it is important to standardize the
design matrix X by subtracting the mean of each column from the corresponding
column and dividing the column with its standard deviation. Most machine
learning libraries do this as a default. This means that if you compare your code
with the results from a given library, the results may differ.

17
The Standadscaler function in Scikit-Learn does this for us. For the data
sets we have been studying in our various examples, the data are in many cases
already scaled and there is no need to scale them. You as a user of different
machine learning algorithms, should always perform a survey of your data, with
a critical assessment of them in case you need to scale the data.
If you need to scale the data, not doing so will give an unfair penaliza-
tion of the parameters since their magnitude depends on the scale of their
corresponding predictor.
Suppose as an example that you you have an input variable given by the
heights of different persons. Human height might be measured in inches or
meters or kilometers. If measured in kilometers, a standard linear regression
model with this predictor would probably give a much bigger coefficient term,
than if measured in millimeters. This can clearly lead to problems in evaluating
the cost/loss functions.

Still thinking
Keep in mind that when you transform your data set before training a model,
the same transformation needs to be done on your eventual new data set before
making a prediction. If we translate this into a Python code, it would could
be implemented as follows (note that the lines are commented since the model
function has not been defined)
#Model training, we compute the mean value of y and X
y_train_mean = np.mean(y_train)
X_train_mean = np.mean(X_train,axis=0)
X_train = X_train - X_train_mean
y_train = y_train - y_train_mean
# The we fit our model with the training data
#trained_model = some_model.fit(X_train,y_train)

#Model prediction, we need also to transform our data set used for the prediction.
X_test = X_test - X_train_mean #Use mean from training data
#y_pred = trained_model(X_test)
y_pred = y_pred + y_train_mean

What does centering (subtracting the mean values) mean


mathematically?
Let us try to understand what this may imply mathematically when we subtract
the mean values, also known as zero centering. For simplicity, we will focus on
ordinary regression, as done in the above example.
The cost/loss function for regression is
 2
n p−1
1 X X
C(β0 , β1 , ..., βp−1 ) = y i − β0 − Xij βj  , .
n i=0 j=1

18
Recall also that we use the squared value since this leads to an increase of the
penalty for higher differences between predicted and output/target values.
What we have done is to single out the β0 term in the definition of the mean
squared error (MSE). The design matrix X does in this case not contain any
intercept column. When we take the derivative with respect to β0 , we want the
derivative to obey
∂C
= 0,
∂βj
for all j. For β0 we have
 
n−1 p−1
∂C 2 X X
=− yi − β0 − Xij βj  .
∂β0 n i=0 j=1

Multiplying away the constant 2/n, we obtain


n−1
X n−1
X n−1
XX p−1
β0 = yi − Xij βj .
i=0 i=0 i=0 j=1

Further Manipulations
Let us special first to the case where we have only two parameters β0 and β1 .
Our result for β0 simplifies then to
n−1
X n−1
X
nβ0 = yi − Xi1 β1 .
i=0 i=0

We obtain then
n−1 n−1
1X 1X
β0 = y i − β1 Xi1 .
n i=0 n i=0
If we define
n−1
1X
µ1 = (Xi1 ,
n i=0
and if we define the mean value of the outputs as
n−1
1X
µy = yi ,
n i=0

we have
β0 = µy − β1 µ1 .
In the general case, that is we have more parameters than β0 and β1 , we have
n−1 n−1 p−1
1X 1 XX
β0 = yi − Xij βj .
n i=0 n i=0 j=1

19
Replacing yi with yi − yi − y and centering also our design matrix results in
a cost function (in vector-matrix disguise)

C(β) = (ỹ − X̃β)T (ỹ − X̃β).

Wrapping it up
If we minimize with respect to β we have then

β̂ = (X̃ T X̃)−1 X̃ T ỹ,


Pn−1
where ỹ = y − y and X̃ij = Xij − n1 k=0 Xkj .
For Ridge regression we need to add λβ T β to the cost function and get then

β̂ = (X̃ T X̃ + λI)−1 X̃ T ỹ.

What does this mean? And why do we insist on all this? Let us look at some
examples.

Linear Regression code, Intercept handling first


This code shows a simple first-order fit to a data set using the above transformed
data, where we consider the role of the intercept first, by either excluding it or
including it (code example thanks to Øyvind Sigmundson Schøyen). Here our
scaling of the data is done by subtracting the mean values only. Note also that
we do not split the data into training and test.
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

np.random.seed(2021)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n

def fit_beta(X, y):


return np.linalg.pinv(X.T @ X) @ X.T @ y

true_beta = [2, 0.5, 3.7]

x = np.linspace(0, 1, 11)
y = np.sum(
np.asarray([x ** p * b for p, b in enumerate(true_beta)]), axis=0
) + 0.1 * np.random.normal(size=len(x))
degree = 3
X = np.zeros((len(x), degree))

20
# Include the intercept in the design matrix
for p in range(degree):
X[:, p] = x ** p

beta = fit_beta(X, y)

# Intercept is included in the design matrix


skl = LinearRegression(fit_intercept=False).fit(X, y)

print(f"True beta: {true_beta}")


print(f"Fitted beta: {beta}")
print(f"Sklearn fitted beta: {skl.coef_}")
ypredictOwn = X @ beta
ypredictSKL = skl.predict(X)
print(f"MSE with intercept column")
print(MSE(y,ypredictOwn))
print(f"MSE with intercept column from SKL")
print(MSE(y,ypredictSKL))

plt.figure()
plt.scatter(x, y, label="Data")
plt.plot(x, X @ beta, label="Fit")
plt.plot(x, skl.predict(X), label="Sklearn (fit_intercept=False)")

# Do not include the intercept in the design matrix


X = np.zeros((len(x), degree - 1))
for p in range(degree - 1):
X[:, p] = x ** (p + 1)

# Intercept is not included in the design matrix


skl = LinearRegression(fit_intercept=True).fit(X, y)

# Use centered values for X and y when computing coefficients


y_offset = np.average(y, axis=0)
X_offset = np.average(X, axis=0)

beta = fit_beta(X - X_offset, y - y_offset)


intercept = np.mean(y_offset - X_offset @ beta)
print(f"Manual intercept: {intercept}")
print(f"Fitted beta (wiothout intercept): {beta}")
print(f"Sklearn intercept: {skl.intercept_}")
print(f"Sklearn fitted beta (without intercept): {skl.coef_}")
ypredictOwn = X @ beta
ypredictSKL = skl.predict(X)
print(f"MSE with Manual intercept")
print(MSE(y,ypredictOwn+intercept))
print(f"MSE with Sklearn intercept")
print(MSE(y,ypredictSKL))
plt.plot(x, X @ beta + intercept, "--", label="Fit (manual intercept)")
plt.plot(x, skl.predict(X), "--", label="Sklearn (fit_intercept=True)")
plt.grid()
plt.legend()
plt.show()

21
The intercept is the value of our output/target variable when all our features
are zero and our function crosses the y-axis (for a one-dimensional case).
Printing the MSE, we see first that both methods give the same MSE, as they
should. However, when we move to for example Ridge regression (discussed next
week), the way we treat the intercept may give a larger or smaller MSE, meaning
that the MSE can be penalized by the value of the intercept. Not including the
intercept in the fit, means that the regularization term does not include β0 . For
different values of λ, this may lead to differing MSE values.
To remind the reader, the regularization term, with the intercept in Ridge
regression is given by
p−1
X
λ||β||22 = λ βj2 ,
j=0

but when we take out the intercept, this equation becomes


p−1
X
λ||β||22 = λ βj2 .
j=1

For Lasso regression we have


p−1
X
λ||β||1 = λ |βj |.
j=1

It means that, when scaling the design matrix and the outputs/targets, by
subtracting the mean values, we have an optimization problem which is not
penalized by the intercept. The MSE value can then be smaller since it focuses
only on the remaining quantities. If we however bring back the intercept, we will
get an MSE which then contains the intercept. This becomes more important
when we discuss Ridge and Lasso regression next week.

Mathematical Interpretation of Ordinary Least Squares


What is presented here is a mathematical analysis of various regression algorithms
(ordinary least squares, Ridge and Lasso Regression). The analysis is based on an
important algorithm in linear algebra, the so-called Singular Value Decomposition
(SVD).
We have shown that in ordinary least squares the optimal parameters β are
given by
−1
β̂ = X T X X T y.
The hat over β means we have the optimal parameters after minimization
of the cost function.
This means that our best model is defined as
−1
ỹ = X β̂ = X X T X X T y.

22
We now define a matrix
−1
A = X XT X XT .

We can rewrite
ỹ = X β̂ = Ay.
The matrix A has the important property that A2 = A. This is the definition
of a projection matrix. We can then interpret our optimal model ỹ as being
represented by an orthogonal projection of y onto a space defined by the column
vectors of X. In our case here the matrix A is a square matrix. If it is a general
rectangular matrix we have an oblique projection matrix.

Residual Error
We have defined the residual error as
h −1 T i
ϵ = y − ỹ = I − X X T X X y.

The residual errors are then the projections of y onto the orthogonal compo-
nent of the space defined by the column vectors of X.

Simple case
If the matrix X is an orthogonal (or unitary in case of complex values) matrix,
we have

X T X = XX T = I.
In this case the matrix A becomes
−1
A = X XT X X T ) = I,

and we have the obvious case

ϵ = y − ỹ = 0.

This serves also as a useful test of our codes.

The singular value decomposition


The examples we have looked at so far are cases where we normally can invert
the matrix X T X. Using a polynomial expansion where we fit of various functions
leads to row vectors of the design matrix which are essentially orthogonal due
to the polynomial character of our model. Obtaining the inverse of the design
matrix is then often done via a so-called LU, QR or Cholesky decomposition.
As we will also see in the first project, this may however not the be case
in general and a standard matrix inversion algorithm based on say LU, QR or

23
Cholesky decomposition may lead to singularities. We will see examples of this
below.
There is however a way to circumvent this problem and also gain some
insights about the ordinary least squares approach, and later shrinkage methods
like Ridge and Lasso regressions.
This is given by the Singular Value Decomposition (SVD) algorithm,
perhaps the most powerful linear algebra algorithm. The SVD provides a numer-
ically stable matrix decomposition that is used in a large swath oc applications
and the decomposition is always stable numerically.
In machine learning it plays a central role in dealing with for example design
matrices that may be near singular or singular. Furthermore, as we will see here,
the singular values can be related to the covariance matrix (and thereby the
correlation matrix) and in turn the variance of a given quantity. It plays also an
important role in the principal component analysis where high-dimensional data
can be reduced to the statistically relevant features.

Linear Regression Problems


One of the typical problems we encounter with linear regression, in particular
when the matrix X (our so-called design matrix) is high-dimensional, are prob-
lems with near singular or singular matrices. The column vectors of X may be
linearly dependent, normally referred to as super-collinearity. This means that
the matrix may be rank deficient and it is basically impossible to to model the
data using linear regression. As an example, consider the matrix
 
1 −1 2
 1 0 1 
X= 
 1 2 −1 
1 1 0

The columns of X are linearly dependent. We see this easily since the the
first column is the row-wise sum of the other two columns. The rank (more
correct, the column rank) of a matrix is the dimension of the space spanned by
the column vectors. Hence, the rank of X is equal to the number of linearly
independent columns. In this particular case the matrix has rank 2.
Super-collinearity of an (n × p)-dimensional design matrix X implies that
the inverse of the matrix X T X (the matrix we need to invert to solve the linear
regression equations) is non-invertible. If we have a square matrix that does not
have an inverse, we say this matrix singular. The example here demonstrates
this
 
1 −1
X= .
1 −1

We see easily that det(X) = x11 x22 − x12 x21 = 1 × (−1) − 1 × (−1) = 0. Hence,
X is singular and its inverse is undefined. This is equivalent to saying that the
matrix X has at least an eigenvalue which is zero.

24
Fixing the singularity
If our design matrix X which enters the linear regression problem

β = (X T X)−1 X T y, (1)

has linearly dependent column vectors, we will not be able to compute the inverse
of X T X and we cannot find the parameters (estimators) βi . The estimators
are only well-defined if (X T X)−1 exits. This is more likely to happen when the
matrix X is high-dimensional. In this case it is likely to encounter a situation
where the regression parameters βi cannot be estimated.
A cheap ad hoc approach is simply to add a small diagonal component to
the matrix to invert, that is we change

X T X → X T X + λI,

where I is the identity matrix. When we discuss Ridge regression this is actually
what we end up evaluating. The parameter λ is called a hyperparameter. More
about this later.

Basic math of the SVD


From standard linear algebra we know that a square matrix X can be diagonalized
if and only it is a so-called normal matrix, that is if X ∈ Rn×n we have
XX T = X T X or if X ∈ Cn×n we have XX † = X † X. The matrix has then a
set of eigenpairs

(λ1 , u1 ), . . . , (λn , un ), andtheeigenvaluesaregivenbythediagonalmatrixΣ = Diag(λ1 , . . . , λn ).

The matrix X can be written in terms of an orthogonal/unitary transformation


U
X = U ΣV T ,
with U U T = I or U U † = I.
Not all square matrices are diagonalizable. A matrix like the one discussed
above  
1 −1
X=
1 −1
is not diagonalizable, it is a so-called defective matrix. It is easy to see that the
condition XX T = X T X is not fulfilled.

The SVD, a Fantastic Algorithm


However, and this is the strength of the SVD algorithm, any general matrix X
can be decomposed in terms of a diagonal matrix and two orthogonal/unitary
matrices. The Singular Value Decompostion (SVD) theorem states that a general
m × n matrix X can be written in terms of a diagonal matrix Σ of dimensionality

25
m × n and two orthognal matrices U and V , where the first has dimensionality
m × m and the last dimensionality n × n. We have then

X = U ΣV T
As an example, the above defective matrix can be decomposed as
    
1 1 1 2 0 1 1 −1
X=√ √ = U ΣV T ,
2 1 −1 0 0 2 1 1
with eigenvalues σ1 = 2 and σ2 = 0. The SVD exits always!
The SVD decomposition (singular values) gives eigenvalues σi ≥ σi+1 for all
i and for dimensions larger than i = p, the eigenvalues (singular values) are zero.
In the general case, where our design matrix X has dimension n × p, the
matrix is thus decomposed into an n × n orthogonal matrix U , a p × p orthogonal
matrix V and a diagonal matrix Σ with r = min(n, p) singular values σi ≥ 0 on
the main diagonal and zeros filling the rest of the matrix. There are at most p
singular values assuming that n > p. In our regression examples for the nuclear
masses and the equation of state this is indeed the case, while for the Ising model
we have p > n. These are often cases that lead to near singular or singular
matrices.
The columns of U are called the left singular vectors while the columns of V
are the right singular vectors.

Economy-size SVD
If we assume that n > p, then our matrix U has dimension n × n. The last
n − p columns of U become however irrelevant in our calculations since they are
multiplied with the zeros in Σ.
The economy-size decomposition removes extra rows or columns of zeros
from the diagonal matrix of singular values, Σ, along with the columns in either
U or V that multiply those zeros in the expression. Removing these zeros and
columns can improve execution time and reduce storage requirements without
compromising the accuracy of the decomposition.
If n > p, we keep only the first p columns of U and Σ has dimension p × p.
If p > n, then only the first n columns of V are computed and Σ has dimension
n × n. The n = p case is obvious, we retain the full SVD. In general the
economy-size SVD leads to less FLOPS and still conserving the desired accuracy.

Codes for the SVD


import numpy as np
# SVD inversion
def SVD(A):
''' Takes as input a numpy matrix A and returns inv(A) based on singular value decomposition (
SVD is numerically more stable than the inversion algorithms provided by
numpy and scipy.linalg at the cost of being slower.
'''
U, S, VT = np.linalg.svd(A,full_matrices=True)

26
print('test U')
print( (np.transpose(U) @ U - U @np.transpose(U)))
print('test VT')
print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
print(U)
print(S)
print(VT)

D = np.zeros((len(U),len(VT)))
for i in range(0,len(VT)):
D[i,i]=S[i]
return U @ D @ VT

X = np.array([ [1.0,-1.0], [1.0,-1.0]])


#X = np.array([[1, 2], [3, 4], [5, 6]])

print(X)
C = SVD(X)
# Print the difference between the original matrix and the SVD one
print(C-X)

The matrix X has columns that are linearly dependent. The first column is
the row-wise sum of the other two columns. The rank of a matrix (the column
rank) is the dimension of space spanned by the column vectors. The rank of
the matrix is the number of linearly independent columns, in this case just 2.
We see this from the singular values when running the above code. Running
the standard inversion algorithm for matrix inversion with X T X results in the
program terminating due to a singular matrix.

Note about SVD Calculations


The U , S, and V matrices returned from the svd() function cannot be multiplied
directly.
As you can see from the code, the S vector must be converted into a diagonal
matrix. This may cause a problem as the size of the matrices do not fit the rules
of matrix multiplication, where the number of columns in a matrix must match
the number of rows in the subsequent matrix.
If you wish to include the zero singular values, you will need to resize the
matrices and set up a diagonal matrix as done in the above example

Mathematics of the SVD and implications


Let us take a closer look at the mathematics of the SVD and the various
implications for machine learning studies.

27
Our starting point is our design matrix X of dimension n × p
 
x0,0 x0,1 x0,2 ... . . . x0,p−1
 x1,0 x1,1 x1,2 ... . . . x1,p−1 
 
 x2,0 x2,1 x2,2 ... . . . x2,p−1 
X=  ...
.
 ... ... ...... ... 

xn−2,0 xn−2,1 xn−2,2 ... . . . xn−2,p−1 
xn−1,0 xn−1,1 xn−1,2 ... . . . xn−1,p−1

We can SVD decompose our matrix as

X = U ΣV T ,

where U is an orthogonal matrix of dimension n × n, meaning that U U T =


U T U = In . Here In is the unit matrix of dimension n × n.
Similarly, V is an orthogonal matrix of dimension p × p, meaning that
V V T = V T V = Ip . Here In is the unit matrix of dimension p × p.
Finally Σ contains the singular values σi . This matrix has dimension n × p
and the singular values σi are all positive. The non-zero values are ordered in
descending order, that is

σ0 > σ1 > σ2 > · · · > σp−1 > 0.


All values beyond p − 1 are all zero.

Example Matrix
As an example, consider the following 3 × 2 example for the matrix Σ
 
2 0
Σ = 0 1
0 0
The singular values are σ0 = 2 and σ1 = 1. It is common to rewrite the
matrix Σ as
 
Σ̃
Σ= ,
0
where  
2 0
Σ̃ = ,
0 1
contains only the singular values. Note also (and we will use this below) that
 
T 4 0
Σ Σ= ,
0 1

28
which is a 2 × 2 matrix while
 
4 0 0
ΣΣT = 0 1 0 ,
0 0 0

is a 3 × 3 matrix. The last row and column of this last matrix contain only
zeros. This will have important consequences for our SVD decomposition of the
design matrix.

Setting up the Matrix to be inverted


The matrix that may cause problems for us is X T X. Using the SVD we can
rewrite this matrix as

X T X = V ΣT U T U ΣV T ,
and using the orthogonality of the matrix U we have

X T X = V ΣT ΣV T .
We define ΣT Σ = Σ̃2 which is a diagonal matrix containing only the singular
values squared. It has dimensionality p × p.
We can now insert the result for the matrix X T X into our equation for
ordinary least squares where
−1
ỹOLS = X X T X X T y,
and using our SVD decomposition of X we have
−1
ỹOLS = U ΣV T V Σ̃2 (V T V ΣT U T y,
which gives us, using the orthogonality of the matrix V ,
p−1
X
ỹOLS = U U T y = ui uTi y,
i=0

It means that the ordinary least square model (with the optimal parameters)
ỹ, corresponds to an orthogonal transformation of the output (or target) vector
y by the vectors of the matrix U . Note that the summation ends at p − 1,
̸ y. We can thus not use the orthogonality relation for the matrix U .
that is ỹ =
This can already be when we multiply the matrices ΣT U T .

Further properties (important for our analyses later)


Let us study again X T X in terms of our SVD,

X T X = V ΣT U T U ΣV T = V ΣT ΣV T .

29
If we now multiply from the right with V (using the orthogonality of V ) we
get
X T X V = V ΣT Σ.


This means the vectors vi of the orthogonal matrix V are the eigenvectors of
the matrix X T X with eigenvalues given by the singular values squared, that is

X T X vi = vi σi2 .


Similarly, if we use the SVD decomposition for the matrix XX T , we have

XX T = U ΣV T V ΣT U T = U ΣΣT U T .

If we now multiply from the right with U (using the orthogonality of U ) we


get
XX T U = U ΣΣT .


This means the vectors ui of the orthogonal matrix U are the eigenvectors of
the matrix XX T with eigenvalues given by the singular values squared, that is

XX T ui = ui σi2 .


Important note: we have defined our design matrix X to be an n × p


matrix. In most supervised learning cases we have that n ≥ p, and quite often
we have n >> p. For linear algebra based methods like ordinary least squares or
Ridge regression, this leads to a matrix X T X which is small and thereby easier
to handle from a computational point of view (in terms of number of floating
point operations).
In our lectures, the number of columns will always refer to the number of
features in our data set, while the number of rows represents the number of
data inputs. Note that in other texts you may find the opposite notation. This
has consequences for the definition of for example the covariance matrix and its
relation to the SVD.

Meet the Covariance Matrix


Before we move on to a discussion of Ridge and Lasso regression, we want to
show an important example of the above.
We have already noted that the matrix X T X in ordinary least squares is
proportional to the second derivative of the cost function, that is we have

∂ 2 C(β) 2
T
= X T X.
∂β∂β n
This quantity defines was what is called the Hessian matrix (the second derivative
of a function we want to optimize).
The Hessian matrix plays an important role and is defined in this course as

H = X T X.

30
The Hessian matrix for ordinary least squares is also proportional to the
covariance matrix. This means also that we can use the SVD to find the
eigenvalues of the covariance matrix and the Hessian matrix in terms of the
singular values. Let us develop these arguments, as they will play an important
role in our machine learning studies.

Introducing the Covariance and Correlation functions


Before we discuss the link between for example Ridge regression and the singular
value decomposition, we need to remind ourselves about the definition of the
covariance and the correlation function. These are quantities that play a central
role in machine learning methods.
Suppose we have defined two vectors x̂ and ŷ with n elements each. The
covariance matrix C is defined as
 
cov[x, x] cov[x, y]
C[x, y] = ,
cov[y, x] cov[y, y]

where for example


n−1
1X
cov[x, y] = (xi − x)(yi − y).
n i=0

With this definition and recalling that the variance is defined as


n−1
1X
var[x] = (xi − x)2 ,
n i=0

we can rewrite the covariance matrix as


 
var[x] cov[x, y]
C[x, y] = .
cov[x, y] var[y]

Note: we have used 1/n in the above definitions of the sample variance
and covariance. We assume then that we can calculate the exact mean value.
What you will find in essentially all statistics texts are equations with a factor
1/(n − 1). This is called Bessel’s correction. This method corrects the bias in the
estimation of the population variance and covariance. It also partially corrects
the bias in the estimation of the population standard deviation. If you use a
library like Scikit-Learn or nunmpy’s function to calculate the covariance,
this quantity will be computed with a factor 1/(n − 1).

Covariance and Correlation Matrix


The covariance takes values between zero and infinity and may thus lead to
problems with loss of numerical precision for particularly large values. It is
common to scale the covariance matrix by introducing instead the correlation
matrix defined via the so-called correlation function

31
cov[x, y]
corr[x, y] = p .
var[x]var[y]
The correlation function is then given by values corr[x, y] ∈ [−1, 1]. This
avoids eventual problems with too large values. We can then define the correlation
matrix for the two vectors x and y as
 
1 corr[x, y]
K[x, y] = ,
corr[y, x] 1
In the above example this is the function we constructed using pandas.

Correlation Function and Design/Feature Matrix


In our derivation of the various regression algorithms like Ordinary Least
Squares or Ridge regression we defined the design/feature matrix X as
 
x0,0 x0,1 x0,2 ... . . . x0,p−1
 x1,0 x1,1 x1,2 ... . . . x1,p−1 
 
 x2,0 x2,1 x2,2 ... . . . x2,p−1 
X=  ...
,
 ... ... ...... ... 

xn−2,0 xn−2,1 xn−2,2 ... . . . xn−2,p−1 
xn−1,0 xn−1,1 xn−1,2 ... . . . xn−1,p−1
with X ∈ Rn×p , with the predictors/features p refering to the column numbers
and the entries n being the row elements. We can rewrite the design/feature
matrix in terms of its column vectors as
 
X = x0 x1 x2 . . . . . . xp−1 ,

with a given vector

xTi = x0,i
 
x1,i x2,i ... . . . xn−1,i .

With these definitions, we can now rewrite our 2 × 2 correlation/covariance


matrix in terms of a moe general design/feature matrix X ∈ Rn×p . This leads
to a p × p covariance matrix for the vectors xi with i = 0, 1, . . . , p − 1

 
var[x0 ] cov[x0 , x1 ] cov[x0 , x2 ] ... . . . cov[x0 , xp−1 ]
 cov[x1 , x0 ] var[x1 ] cov[x1 , x2 ] ... . . . cov[x1 , xp−1 ]
 
 cov[x2 , x0 ] cov[x2 , x1 ] var[x2 ] ... . . . cov[x2 , xp−1 ]
C[x] =  ,

 ... ... ... ... ... ... 

 ... ... ... ... ... ... 
cov[xp−1 , x0 ] cov[xp−1 , x1 ] cov[xp−1 , x2 ] . . . ... var[xp−1 ]

32
and the correlation matrix
 
1 corr[x0 , x1 ] corr[x0 , x2 ] ... . . . corr[x0 , xp−1 ]
 corr[x1 , x0 ] 1 corr[x1 , x2 ] ... . . . corr[x1 , xp−1 ]
 
 corr[x2 , x0 ] corr[x 2 , x1 ] 1 ... . . . corr[x2 , xp−1 ]
K[x] =  ,

 . . . . . . ... ... ... ... 

 ... ... ... ... ... ... 
corr[xp−1 , x0 ] corr[xp−1 , x1 ] corr[xp−1 , x2 ] ... ... 1

Covariance Matrix Examples


The Numpy function np.cov calculates the covariance elements using the factor
1/(n − 1) instead of 1/n since it assumes we do not have the exact mean values.
The following simple function uses the np.vstack function which takes each
vector of dimension 1 × n and produces a 2 × n matrix W
Note that this assumes you have the features as the rows, and the inputs as
columns, that is
 
x0 x1 x2 . . . xn−2 xn−1
W = ,
y0 y1 y2 . . . yn−2 yn−1

which in turn is converted into into the 2 × 2 covariance matrix C via the
Numpy function np.cov(). We note that we can also calculate the mean value
of each set of samples x etc using the Numpy function np.mean(x). We can
also extract the eigenvalues of the covariance matrix through the np.linalg.eig()
function.
# Importing various packages
import numpy as np
n = 100
x = np.random.normal(size=n)
print(np.mean(x))
y = 4+3*x+np.random.normal(size=n)
print(np.mean(y))
W = np.vstack((x, y))
C = np.cov(W)
print(C)

Correlation Matrix
The previous example can be converted into the correlation matrix by simply
scaling the matrix elements with the variances. We should also subtract the
mean values for each column. This leads to the following code which sets up the
correlations matrix for the previous example in a more brute force way. Here
we scale the mean values for each column of the design matrix, calculate the
relevant mean values and variances and then finally set up the 2 × 2 correlation
matrix (since we have only two vectors).
import numpy as np
n = 100

33
# define two vectors
x = np.random.random(size=n)
y = 4+3*x+np.random.normal(size=n)
#scaling the x and y vectors
x = x - np.mean(x)
y = y - np.mean(y)
variance_x = np.sum(x@x)/n
variance_y = np.sum(y@y)/n
print(variance_x)
print(variance_y)
cov_xy = np.sum(x@y)/n
cov_xx = np.sum(x@x)/n
cov_yy = np.sum(y@y)/n
C = np.zeros((2,2))
C[0,0]= cov_xx/variance_x
C[1,1]= cov_yy/variance_y
C[0,1]= cov_xy/np.sqrt(variance_y*variance_x)
C[1,0]= C[0,1]
print(C)

We see that the matrix elements along the diagonal are one as they should
be and that the matrix is symmetric. Furthermore, diagonalizing this matrix we
easily see that it is a positive definite matrix.
The above procedure with numpy can be made more compact if we use
pandas.

Correlation Matrix with Pandas


We whow here how we can set up the correlation matrix using pandas, as done
in this simple code
import numpy as np
import pandas as pd
n = 10
x = np.random.normal(size=n)
x = x - np.mean(x)
y = 4+3*x+np.random.normal(size=n)
y = y - np.mean(y)
# Note that we transpose the matrix in order to stay with our ordering n x p
X = (np.vstack((x, y))).T
print(X)
Xpd = pd.DataFrame(X)
print(Xpd)
correlation_matrix = Xpd.corr()
print(correlation_matrix)

We expand this model to the Franke function discussed above.

Correlation Matrix with Pandas and the Franke function


# Common imports
import numpy as np
import pandas as pd

34
def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4

def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)

N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))
for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)

return X

# Making meshgrid of datapoints and compute Franke's function


n = 4
N = 100
x = np.sort(np.random.uniform(0, 1, N))
y = np.sort(np.random.uniform(0, 1, N))
z = FrankeFunction(x, y)
X = create_X(x, y, n=n)
Xpd = pd.DataFrame(X)
# subtract the mean values and set up the covariance matrix
Xpd = Xpd - Xpd.mean()
covariance_matrix = Xpd.cov()
print(covariance_matrix)

We note here that the covariance is zero for the first rows and columns since
all matrix elements in the design matrix were set to one (we are fitting the
function in terms of a polynomial of degree n).
This means that the variance for these elements will be zero and will cause
problems when we set up the correlation matrix. We can simply drop these
elements and construct a correlation matrix without these elements.

Rewriting the Covariance and/or Correlation Matrix


We can rewrite the covariance matrix in a more compact form in terms of the
design/feature matrix X as
1 T
C[x] = X X = E[X T X].
n

35
To see this let us simply look at a design matrix X ∈ R2×2
 
x00 x01  
X= = x0 x1 .
x10 x11

If we then compute the expectation value (note the 1/n factor instead of
1/(n − 1))

x200 + x210
 
T 1 T 1 x00 x01 + x10 x11
E[X X] = X X = ,
n n x01 x00 + x11 x10 x201 + x211

which is just
 
var[x0 ] cov[x0 , x1 ]
C[x0 , x1 ] = C[x] = ,
cov[x1 , x0 ] var[x1 ]

where we wrote
C[x0 , x1 ] = C[x]
to indicate that this is the covariance of the vectors x of the design/feature
matrix X.
It is easy to generalize this to a matrix X ∈ Rn×p .

Linking with the SVD


We saw earlier that

X T X = V ΣT U T U ΣV T = V ΣT ΣV T .

Since the matrices here have dimension p×p, with p corresponding to the singular
values, we defined earlier the matrix
 
T
  Σ̃
Σ Σ = Σ̃ 0 ,
0

where the tilde-matrix Σ̃ is a matrix of dimension p × p containing only the


singular values σi , that is
 
σ0 0 0 ... 0 0
 0 σ1 0 ... 0 0 
 
0
Σ̃ =  0 σ2 ... 0 0 ,

0 0 0 ... σp−2 0 
0 0 0 ... 0 σp−1
meaning we can write
X T X = V Σ̃2 V T .
Multiplying from the right with V (using the orthogonality of V ) we get

X T X V = V Σ̃2 .


36
What does it mean?
This means the vectors vi of the orthogonal matrix V are the eigenvectors of
the matrix X T X with eigenvalues given by the singular values squared, that is

X T X vi = vi σi2 .


In other words, each non-zero singular value of X is a positive square root of


an eigenvalue of X T X. It means also that the columns of V are the eigenvectors
of X T X. Since we have ordered the singular values of X in a descending order,
it means that the column vectors vi are hierarchically ordered by how much
correlation they encode from the columns of X.
Note that these are also the eigenvectors and eigenvalues of the Hessian
matrix. Note also that the Hessian matrix we are discussing here is from a cost
function defined by the mean squared error only.
If we now recall the definition of the covariance matrix (not using Bessel’s
correction) we have
1 T
C[X] = X X,
n
meaning that every squared non-singular value of X divided by n ( the
number of samples) are the eigenvalues of the covariance matrix. Every singular
value of X is thus a positive square root of an eigenvalue of X T X. If the matrix
X is self-adjoint, the singular values of X are equal to the absolute value of the
eigenvalues of X.

And finally XX T
For XX T we found

XX T = U ΣV T V ΣT U T = U ΣT ΣU T .
Since the matrices here have dimension n × n, we have
   
T Σ̃   Σ̃ 0
ΣΣ = Σ̃0 = ,
0 0 0

leading to  
Σ̃ 0
XX T = U UT .
0 0
Multiplying with U from the right gives us the eigenvalue problem
 
T Σ̃ 0
(XX )U = U .
0 0

It means that the eigenvalues of XX T are again given by the non-zero


singular values plus now a series of zeros. The column vectors of U are the

37
eigenvectors of XX T and measure how much correlations are contained in the
rows of X.
Since we will mainly be interested in the correlations among the features of
our data (the columns of X, the quantity of interest for us are the non-zero
singular values and the column vectors of V .

Ridge and LASSO Regression


Let us remind ourselves about the expression for the standard Mean Squared
Error (MSE) which we used to define our cost function and the equations for
the ordinary least squares (OLS) method, that is our optimization problem is
1n T
o
minp (y − Xβ) (y − Xβ) .
β∈R n
or we can state it as
n−1
1X 2 1
minp (yi − ỹi ) = ||y − Xβ||22 ,
β∈R n i=0 n

where we have used the definition of a norm-2 vector, that is


sX
||x||2 = x2i .
i

By minimizing the above equation with respect to the parameters β we


could then obtain an analytical expression for the parameters β. We can add
a regularization parameter λ by defining a new cost function to be optimized,
that is
1
minp ||y − Xβ||22 + λ||β||22
β∈R n

which leads to the Ridge regression minimization problem where we require


that ||β||22 ≤ t, where t is a finite number larger than zero. By defining
1
C(X, β) = ||y − Xβ||22 + λ||β||1 ,
n
we have a new optimization equation
1
min ||y − Xβ||22 + λ||β||1
β∈Rp n

which leads to Lasso regression. Lasso stands for least absolute shrinkage and
selection operator.
Here we have defined the norm-1 as
X
||x||1 = |xi |.
i

38
Deriving the Ridge Regression Equations
Using the matrix-vector expression for Ridge regression and dropping the pa-
rameter 1/n in front of the standard means squared error equation, we have

C(X, β) = (y − Xβ)T (y − Xβ) + λβ T β,




and taking the derivatives with respect to β we obtain then a slightly modified
matrix inversion problem which for finite values of λ does not suffer from
singularity problems. We obtain the optimal parameters
−1 T
β̂Ridge = X T X + λI X y,
with I being a p × p identity matrix with the constraint that
p−1
X
βi2 ≤ t,
i=0
with t a finite positive number.
If we keep the 1/n factor, the equation for the optimal β changes to
−1 T
β̂Ridge = X T X + nλI X y.
In many textbooks the 1/n term is often omitted. Note that a library like
Scikit-Learn does not include the 1/n factor in the setup of the cost function.
When we compare this with the ordinary least squares result we have
−1 T
β̂OLS = X T X X y,
which can lead to singular matrices. However, with the SVD, we can always
compute the inverse of the matrix X T X.
We see that Ridge regression is nothing but the standard OLS with a modified
diagonal term added to X T X. The consequences, in particular for our discussion
of the bias-variance tradeoff are rather interesting. We will see that for specific
values of λ, we may even reduce the variance of the optimal parameters β. These
topics and other related ones, will be discussed after the more linear algebra
oriented analysis here.
Using our insights about the SVD of the design matrix X We have already
analyzed the OLS solutions in terms of the eigenvectors (the columns) of the
right singular value matrix U as
ỹOLS = Xβ = U U T y.
For Ridge regression this becomes

p−1
−1 X σj2
ỹRidge = XβRidge = U ΣV T V Σ2 V T + λI (U ΣV T )T y = uj uTj y,
j=0
σj2 + λ

with the vectors uj being the columns of U from the SVD of the matrix X.

39
Interpreting the Ridge results
Since λ ≥ 0, it means that compared to OLS, we have

σj2
≤ 1.
σj2 + λ
Ridge regression finds the coordinates of y with respect to the orthonormal
σj2
basis U , it then shrinks the coordinates by σj2 +λ
. Recall that the SVD has
eigenvalues ordered in a descending way, that is σi ≥ σi+1 .
For small eigenvalues σi it means that their contributions become less impor-
tant, a fact which can be used to reduce the number of degrees of freedom. More
about this when we have covered the material on a statistical interpretation of
various linear regression methods.

More interpretations
For the sake of simplicity, let us assume that the design matrix is orthonormal,
that is

X T X = (X T X)−1 = I.
In this case the standard OLS results in
n−1
X
β OLS = X T y = ui uTi y,
i=0

and
−1 −1
β Ridge = (I + λI) X T y = (1 + λ) β OLS ,
that is the Ridge estimator scales the OLS estimator by the inverse of a factor
1 + λ, and the Ridge estimator converges to zero when the hyperparameter goes
to infinity.
We will come back to more interpreations after we have gone through some
of the statistical analysis part.
For more discussions of Ridge and Lasso regression, Wessel van Wierin-
gen’s article is highly recommended. Similarly, Mehta et al’s article is also
recommended.

Deriving the Lasso Regression Equations


Using the matrix-vector expression for Lasso regression, we have the following
cost function
1
C(X, β) = (y − Xβ)T (y − Xβ) + λ||β||1 ,
n

40
Taking the derivative with respect to β and recalling that the derivative of
the absolute value is (we drop the boldfaced vector symbol for simplicty)

d|β| 1 β>0
= sgn(β) =
dβ −1 β < 0,

we have that the derivative of the cost function is

∂C(X, β) 2
= − X T (y − Xβ) + λsgn(β) = 0,
∂β n
and reordering we have
n
X T Xβ + λsgn(β) = 2X T y.
2
We can redefine λ to absorb the constant n/2 and we rewrite the last equation
as
X T Xβ + λsgn(β) = 2X T y.
This equation does not lead to a nice analytical equation as in either Ridge
regression or ordinary least squares. This equation can however be solved by
using standard convex optimization algorithms using for example the Python
package CVXOPT. We will discuss this later.

41

You might also like