Day 1
Day 1
Morten Hjorth-Jensen1,2
1
Department of Physics and Center for Computing in Science Education, University of Oslo, Norway
2
Department of Physics and Astronomy and Facility for Rare Isotope Beams, Michigan State University, USA
October 2, 2023
Reading recommendations:
1. These notes
2. Goodfellow, Bengio and Courville, Deep Learning, chapter 2 on linear
algebra and sections 3.1-3.10 on elements of statistics (background)
3. Hastie, Tibshirani and Friedman, The elements of statistical learning,
sections 3.1-3.4 (on relevance for the discussion of linear regression).
4. Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong, Mathematics for
Machine Learning, see chapter 6 in particular for exercises on derivatives,
see https://mml-book.github.io/book/mml-book.pdf
• Easy to code! And links well with classification problems and logistic
regression and neural networks
• Allows for easy hands-on understanding of gradient descent methods
• and many more features
For more discussions of Ridge and Lasso regression, Wessel van Wieringen’s article
is highly recommended. Similarly, Mehta et al’s article is also recommended.
yi = f (xi ) + ϵi ,
or in general
y = f (x) + ϵ,
where ϵ represents some noise which is normally assumed to be distributed
via a normal probability distribution with zero mean value and a variance σ 2 .
In linear regression we approximate the unknown function with another
continuous function ỹ(x) which depends linearly on some unknown parameters
β T = [β0 , β1 , β2 , . . . , βp−1 ].
2
Last week we introduced the so-called design matrix in order to define the
approximation ỹ via the unknown quantity β as
ỹ = Xβ,
and in order to find the optimal parameters βi we defined a function which
gives a measure of the spread between the values yi (which represent the output
values we want to reproduce) and the parametrized values ỹi , namely the so-called
cost/loss function.
where ⟨yi ⟩ is the mean value. Keep in mind also that till now we have treated
yi as the exact value. Normally, the response (dependent or outcome) variable
yi is the outcome of a numerical experiment or another type of experiment and
could thus be treated itself as an approximation to the true value. It is then
always accompanied by an error estimate, often limited to a statistical error
3
estimate given by the standard deviation discussed earlier. In the discussion
here we will treat yi as our exact value for the response variable.
In order to find the parameters βi we will then minimize the spread of C(β),
that is we are going to solve the problem
1n T
o
minp (y − Xβ) (y − Xβ) .
β∈R n
which results in
"n−1 #
∂C(β) 2 X
=− xij (yi − β0 xi,0 − β1 xi,1 − β2 xi,2 − · · · − βn−1 xi,n−1 ) = 0,
∂βj n i=0
Small question: Do you think the example we have at hand here (the nuclear
binding energies) can lead to problems in inverting the matrix X T X? What
kind of problems can we expect?
4
Some useful matrix and vector expressions
The following matrix and vector relation will be useful here and for the rest
of the course. Vectors are always written as boldfaced lower case letters and
matrices as upper case boldfaced letters. In the following we will discuss how to
calculate derivatives of various matrices relevant for machine learning. We will
often represent our data in terms of matrices and vectors.
Let us introduce first some conventions. We assume that y is a vector of length
m, that is it has m elements y0 , y1 , . . . , ym−1 . By convention we start labeling
vectors with the zeroth element, as are arrays in Python and C++/C, for example.
Similarly, we have a vector x of length n, that is xT = [x0 , x1 , . . . , xn−1 ].
We assume also that y is a function of x through some given function f
y = f (x).
The Jacobian
We define the partial derivatives of the various components of y as functions of
xi in terms of the so-called Jacobian matrix
∂y0 ∂y0 ∂y0 ∂y0
... ...
∂x0 ∂x1 ∂x2 ∂xn−1
∂y1 ∂y1 ∂y1 ∂y1
∂x0 ∂x1 ∂x2 ... ... ∂xn−1
∂y ∂y2 ∂y2 ∂y2 ∂y2
... ...
J= = ∂x0 ∂x1 ∂x2 ∂xn−1,
∂x . . . ... ... ... ... ...
... ... ... ... ... ...
∂ym−1 ∂ym−1 ∂ym−1 ∂ym−1
∂x0 ∂x1 ∂x2 ... ... ∂xn−1
which is an m × n matrix. If x is a scalar, then the Jacobian is only a
single-column vector, or an m × 1 matrix. If on the other hand y is a scalar, the
Jacobian becomes a 1 × n matrix.
When this matrix is a square matrix m = n, its determinant is often referred
to as the Jacobian determinant. Both the matrix and (if m = n) the determinant
are often referred to simply as the Jacobian. The Jacobian matrix represents
the differential of y at every point where the vector is differentiable.
Derivatives, example 1
Let now y = Ax, where A is an m × n matrix and the matrix does not depend
on x. If we write out the vector y compoment by component we have
n−1
X
yi = aij xj ,
j=0
with ∀i = 0, 1, 2, . . . , m − 1. The individual matrix elements of A are given by
the symbol aij . It follows that the partial derivatives of yi with respect to xk
∂yi
= aik ∀i = 0, 1, 2, . . . , m − 1.
∂xk
5
From this we have, using the definition of the Jacobian
∂y
= A.
∂x
Example 2
We define a scalar (our cost/loss functions are in general also scalars, just think
of the mean squared error) as the result of some matrix vector multiplications
α = y T Ax,
with y a vector of length m, A an m × n matrix and x a vector of length n. We
assume also that A does not depend on any of the two vectors. In order to find
the derivative of α with respect to the two vectors, we define an intermediate
vector z. We define first z T = y T A, a vector of length n. We have then, using
the definition of the Jacobian,
α = z T x,
which means that (using our previous example) we have
∂α
= z = AT y.
∂x
Note that the resulting vector elements are the same for z T and z, the only
difference is that one if just the transpose of the other.
Since α is a scalar we have α = αT = xT AT y. Defining now z = xT AT we
find that
∂α
= z T = xT AT .
∂y
Example 3
We start with a new scalar but where now the vector y is replaced by a vector
x and the matrix A is a square matrix with dimension n × n.
α = xT Ax,
with x a vector of length n.
We write out the specific sums involved in the calculation of α
n−1
X n−1
X
α= xi aij xj ,
i=0 j=0
taking the derivative of α with respect to a given component xk we get the two
sums
n−1 n−1
∂α X X
= aik xi + akj xj ,
∂xk i=0 j=0
6
for ∀k = 0, 1, 2, . . . , n − 1. We identify these sums as
∂α
= xT AT + A .
∂x
If the matrix A is symmetric, that is A = AT , we have
∂α
= 2xT A.
∂x
Example 4
We let the scalar α be defined by
α = y T x,
where both y and x have the same length n, or if we wish to think of them
as column vectors, they have dimensions n × 1. We assume that both y and x
depend on a vector z of the same length. To calculate the derivative of α with
respect to a given component zk we need first to write out the inner product
that defines α as
n−1
X
α= yi xi ,
i=0
7
We note that the design matrix X does not depend on the unknown param-
eters defined by the vector β. We are now interested in minimizing the cost
function with respect to the unknown parameters β.
The mean squared error is a scalar and if we use the results from example
three above, we can define a new vector
w = y − Xβ,
∂tr(BA)
= BT ,
∂A
∂ log |A|
= (A−1 )T .
∂A
8
The Hessian matrix plays an important role and is defined here as
H = X T X.
For ordinary least squares, it is inversely proportional (derivation next week)
with the variance of the optimal parameters β̂. Furthermore, we will see later
this week that it is (aside the factor 1/n) equal to the covariance matrix. It plays
also a very important role in optmization algorithms and Principal Component
Analysis as a way to reduce the dimensionality of a machine learning/data
analysis problem.
Linear algebra question: Can we use the Hessian matrix to say something
about properties of the cost function (our optmization problem)? (hint: think
about convex or concave problems and how to relate these to a matrix!).
ϵ = y − ỹ = y − Xβ,
and with
X T (y − Xβ) = 0,
we have
X T ϵ = X T (y − Xβ) = 0,
meaning that the solution for β is the one which minimizes the residuals.
we have five predictors/features. The first is the intercept β0 . The other terms
are βi with i = 1, 2, 3, 4. Furthermore we have n entries for each predictor. It
means that our design matrix is an n × p matrix X.
9
y = 2.0+5*x*x+0.1*np.random.randn(100)
# and then the design matrix X including the intercept
# The design matrix now as function of a fourth-order polynomial
X = np.zeros((len(x),5))
X[:,0] = 1.0
X[:,1] = x
X[:,2] = x**2
X[:,3] = x**3
X[:,4] = x**4
beta = (np.linalg.inv(X.T @ X) @ X.T ) @ y
# and then make the prediction
ytilde = X @ beta
print(MSE(y,ytilde))
10
Splitting our Data in Training and Test data
It is normal in essentially all Machine Learning studies to split the data
in a training set and a test set (sometimes also an additional validation set).
Scikit-Learn has an own function for this. There is no explicit recipe for how
much data should be included as training data and say test data. An accepted
rule of thumb is to use approximately 2/3 to 4/5 of the data as training data.
We will postpone a discussion of this splitting to the end of these notes and our
discussion of the so-called bias-variance tradeoff. Here we limit ourselves to
repeat the above equation of state fitting example but now splitting the data
into a training set and a test set.
x = np.random.rand(100)
y = 2.0+5*x*x+0.1*np.random.randn(100)
11
Making your own test-train splitting
# equivalently in numpy
def train_test_split_numpy(inputs, labels, train_size, test_size):
n_inputs = len(inputs)
inputs_shuffled = inputs.copy()
labels_shuffled = labels.copy()
np.random.shuffle(inputs_shuffled)
np.random.shuffle(labels_shuffled)
train_end = int(n_inputs*train_size)
X_train, X_test = inputs_shuffled[:train_end], inputs_shuffled[train_end:]
Y_train, Y_test = labels_shuffled[:train_end], labels_shuffled[train_end:]
But since scikit-learn has its own function for doing this and since it
interfaces easily with tensorflow and other libraries, we normally recommend
using the latter functionality.
12
sensitive to the scales of the features and may perform poorly if they are very
different scales. Therefore, it is typical to scale the features in a way to avoid
such outlier values.
Functionality in Scikit-Learn
Scikit-Learn has several functions which allow us to rescale the data, normally
resulting in much better results in terms of various accuracy scores. The Stan-
dardScaler function in Scikit-Learn ensures that for each feature/predictor
we study the mean value is zero and the variance is one (every column in the
design/feature matrix). This scaling has the drawback that it does not ensure
that we have a particular maximum or minimum in our data set. Another
function included in Scikit-Learn is the MinMaxScaler which ensures that
all features are exactly between 0 and 1. The
More preprocessing
The Normalizer scales each data point such that the feature vector has a
euclidean length of one. In other words, it projects a data point on the circle
(or sphere in the case of higher dimensions) with a radius of 1. This means
every data point is scaled by a different number (by the inverse of it’s length).
This normalization is often used when only the direction (or angle) of the data
matters, not the length of the feature vector.
The RobustScaler works similarly to the StandardScaler in that it ensures
statistical properties for each feature that guarantee that they are on the same
scale. However, the RobustScaler uses the median and quartiles, instead of mean
and variance. This makes the RobustScaler ignore data points that are very
different from the rest (like measurement errors). These odd data points are also
called outliers, and might often lead to trouble for other scaling techniques.
13
Example of own Standard scaling
Let us consider the following vanilla example where we use both Scikit-Learn
and write our own function as well. We produce a simple test design matrix with
random numbers. Each column could then represent a specific feature whose
mean value is subracted.
import sklearn.linear_model as skl
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer
import numpy as np
import pandas as pd
from IPython.display import display
np.random.seed(100)
# setting up a 10 x 5 matrix
rows = 10
cols = 5
X = np.random.randn(rows,cols)
XPandas = pd.DataFrame(X)
display(XPandas)
print(XPandas.mean())
print(XPandas.std())
XPandas = (XPandas -XPandas.mean())
display(XPandas)
# This option does not include the standard deviation
scaler = StandardScaler(with_std=False)
scaler.fit(X)
Xscaled = scaler.transform(X)
display(XPandas-Xscaled)
Min-Max Scaling
Another commonly used scaling method is min-max scaling. This is very useful
for when we want the features to lie in a certain interval. To scale the feature
xj to the interval [a, b], we can apply the transformation
(i)
(i) xj − min(xj )
xj → (b − a) −a
max(xj ) − min(xj )
where min(xj ) and max(xj ) return the minimum and maximum value of xj over
the data set, respectively.
14
maxdegree = 14
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)
np.random.seed(2018)
n = 50
maxdegree = 5
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)
TestError = np.zeros(maxdegree)
TrainError = np.zeros(maxdegree)
polydegree = np.zeros(maxdegree)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)
15
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4
def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)
N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))
for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)
return X
16
clf = skl.LinearRegression().fit(X_train, y_train)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Feature min values before scaling:\n {}".format(X_train.min(axis=0)))
print("Feature max values before scaling:\n {}".format(X_train.max(axis=0)))
More thinking
If our predictors represent different scales, then it is important to standardize the
design matrix X by subtracting the mean of each column from the corresponding
column and dividing the column with its standard deviation. Most machine
learning libraries do this as a default. This means that if you compare your code
with the results from a given library, the results may differ.
17
The Standadscaler function in Scikit-Learn does this for us. For the data
sets we have been studying in our various examples, the data are in many cases
already scaled and there is no need to scale them. You as a user of different
machine learning algorithms, should always perform a survey of your data, with
a critical assessment of them in case you need to scale the data.
If you need to scale the data, not doing so will give an unfair penaliza-
tion of the parameters since their magnitude depends on the scale of their
corresponding predictor.
Suppose as an example that you you have an input variable given by the
heights of different persons. Human height might be measured in inches or
meters or kilometers. If measured in kilometers, a standard linear regression
model with this predictor would probably give a much bigger coefficient term,
than if measured in millimeters. This can clearly lead to problems in evaluating
the cost/loss functions.
Still thinking
Keep in mind that when you transform your data set before training a model,
the same transformation needs to be done on your eventual new data set before
making a prediction. If we translate this into a Python code, it would could
be implemented as follows (note that the lines are commented since the model
function has not been defined)
#Model training, we compute the mean value of y and X
y_train_mean = np.mean(y_train)
X_train_mean = np.mean(X_train,axis=0)
X_train = X_train - X_train_mean
y_train = y_train - y_train_mean
# The we fit our model with the training data
#trained_model = some_model.fit(X_train,y_train)
#Model prediction, we need also to transform our data set used for the prediction.
X_test = X_test - X_train_mean #Use mean from training data
#y_pred = trained_model(X_test)
y_pred = y_pred + y_train_mean
18
Recall also that we use the squared value since this leads to an increase of the
penalty for higher differences between predicted and output/target values.
What we have done is to single out the β0 term in the definition of the mean
squared error (MSE). The design matrix X does in this case not contain any
intercept column. When we take the derivative with respect to β0 , we want the
derivative to obey
∂C
= 0,
∂βj
for all j. For β0 we have
n−1 p−1
∂C 2 X X
=− yi − β0 − Xij βj .
∂β0 n i=0 j=1
Further Manipulations
Let us special first to the case where we have only two parameters β0 and β1 .
Our result for β0 simplifies then to
n−1
X n−1
X
nβ0 = yi − Xi1 β1 .
i=0 i=0
We obtain then
n−1 n−1
1X 1X
β0 = y i − β1 Xi1 .
n i=0 n i=0
If we define
n−1
1X
µ1 = (Xi1 ,
n i=0
and if we define the mean value of the outputs as
n−1
1X
µy = yi ,
n i=0
we have
β0 = µy − β1 µ1 .
In the general case, that is we have more parameters than β0 and β1 , we have
n−1 n−1 p−1
1X 1 XX
β0 = yi − Xij βj .
n i=0 n i=0 j=1
19
Replacing yi with yi − yi − y and centering also our design matrix results in
a cost function (in vector-matrix disguise)
Wrapping it up
If we minimize with respect to β we have then
What does this mean? And why do we insist on all this? Let us look at some
examples.
np.random.seed(2021)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
x = np.linspace(0, 1, 11)
y = np.sum(
np.asarray([x ** p * b for p, b in enumerate(true_beta)]), axis=0
) + 0.1 * np.random.normal(size=len(x))
degree = 3
X = np.zeros((len(x), degree))
20
# Include the intercept in the design matrix
for p in range(degree):
X[:, p] = x ** p
beta = fit_beta(X, y)
plt.figure()
plt.scatter(x, y, label="Data")
plt.plot(x, X @ beta, label="Fit")
plt.plot(x, skl.predict(X), label="Sklearn (fit_intercept=False)")
21
The intercept is the value of our output/target variable when all our features
are zero and our function crosses the y-axis (for a one-dimensional case).
Printing the MSE, we see first that both methods give the same MSE, as they
should. However, when we move to for example Ridge regression (discussed next
week), the way we treat the intercept may give a larger or smaller MSE, meaning
that the MSE can be penalized by the value of the intercept. Not including the
intercept in the fit, means that the regularization term does not include β0 . For
different values of λ, this may lead to differing MSE values.
To remind the reader, the regularization term, with the intercept in Ridge
regression is given by
p−1
X
λ||β||22 = λ βj2 ,
j=0
It means that, when scaling the design matrix and the outputs/targets, by
subtracting the mean values, we have an optimization problem which is not
penalized by the intercept. The MSE value can then be smaller since it focuses
only on the remaining quantities. If we however bring back the intercept, we will
get an MSE which then contains the intercept. This becomes more important
when we discuss Ridge and Lasso regression next week.
22
We now define a matrix
−1
A = X XT X XT .
We can rewrite
ỹ = X β̂ = Ay.
The matrix A has the important property that A2 = A. This is the definition
of a projection matrix. We can then interpret our optimal model ỹ as being
represented by an orthogonal projection of y onto a space defined by the column
vectors of X. In our case here the matrix A is a square matrix. If it is a general
rectangular matrix we have an oblique projection matrix.
Residual Error
We have defined the residual error as
h −1 T i
ϵ = y − ỹ = I − X X T X X y.
The residual errors are then the projections of y onto the orthogonal compo-
nent of the space defined by the column vectors of X.
Simple case
If the matrix X is an orthogonal (or unitary in case of complex values) matrix,
we have
X T X = XX T = I.
In this case the matrix A becomes
−1
A = X XT X X T ) = I,
ϵ = y − ỹ = 0.
23
Cholesky decomposition may lead to singularities. We will see examples of this
below.
There is however a way to circumvent this problem and also gain some
insights about the ordinary least squares approach, and later shrinkage methods
like Ridge and Lasso regressions.
This is given by the Singular Value Decomposition (SVD) algorithm,
perhaps the most powerful linear algebra algorithm. The SVD provides a numer-
ically stable matrix decomposition that is used in a large swath oc applications
and the decomposition is always stable numerically.
In machine learning it plays a central role in dealing with for example design
matrices that may be near singular or singular. Furthermore, as we will see here,
the singular values can be related to the covariance matrix (and thereby the
correlation matrix) and in turn the variance of a given quantity. It plays also an
important role in the principal component analysis where high-dimensional data
can be reduced to the statistically relevant features.
The columns of X are linearly dependent. We see this easily since the the
first column is the row-wise sum of the other two columns. The rank (more
correct, the column rank) of a matrix is the dimension of the space spanned by
the column vectors. Hence, the rank of X is equal to the number of linearly
independent columns. In this particular case the matrix has rank 2.
Super-collinearity of an (n × p)-dimensional design matrix X implies that
the inverse of the matrix X T X (the matrix we need to invert to solve the linear
regression equations) is non-invertible. If we have a square matrix that does not
have an inverse, we say this matrix singular. The example here demonstrates
this
1 −1
X= .
1 −1
We see easily that det(X) = x11 x22 − x12 x21 = 1 × (−1) − 1 × (−1) = 0. Hence,
X is singular and its inverse is undefined. This is equivalent to saying that the
matrix X has at least an eigenvalue which is zero.
24
Fixing the singularity
If our design matrix X which enters the linear regression problem
β = (X T X)−1 X T y, (1)
has linearly dependent column vectors, we will not be able to compute the inverse
of X T X and we cannot find the parameters (estimators) βi . The estimators
are only well-defined if (X T X)−1 exits. This is more likely to happen when the
matrix X is high-dimensional. In this case it is likely to encounter a situation
where the regression parameters βi cannot be estimated.
A cheap ad hoc approach is simply to add a small diagonal component to
the matrix to invert, that is we change
X T X → X T X + λI,
where I is the identity matrix. When we discuss Ridge regression this is actually
what we end up evaluating. The parameter λ is called a hyperparameter. More
about this later.
25
m × n and two orthognal matrices U and V , where the first has dimensionality
m × m and the last dimensionality n × n. We have then
X = U ΣV T
As an example, the above defective matrix can be decomposed as
1 1 1 2 0 1 1 −1
X=√ √ = U ΣV T ,
2 1 −1 0 0 2 1 1
with eigenvalues σ1 = 2 and σ2 = 0. The SVD exits always!
The SVD decomposition (singular values) gives eigenvalues σi ≥ σi+1 for all
i and for dimensions larger than i = p, the eigenvalues (singular values) are zero.
In the general case, where our design matrix X has dimension n × p, the
matrix is thus decomposed into an n × n orthogonal matrix U , a p × p orthogonal
matrix V and a diagonal matrix Σ with r = min(n, p) singular values σi ≥ 0 on
the main diagonal and zeros filling the rest of the matrix. There are at most p
singular values assuming that n > p. In our regression examples for the nuclear
masses and the equation of state this is indeed the case, while for the Ising model
we have p > n. These are often cases that lead to near singular or singular
matrices.
The columns of U are called the left singular vectors while the columns of V
are the right singular vectors.
Economy-size SVD
If we assume that n > p, then our matrix U has dimension n × n. The last
n − p columns of U become however irrelevant in our calculations since they are
multiplied with the zeros in Σ.
The economy-size decomposition removes extra rows or columns of zeros
from the diagonal matrix of singular values, Σ, along with the columns in either
U or V that multiply those zeros in the expression. Removing these zeros and
columns can improve execution time and reduce storage requirements without
compromising the accuracy of the decomposition.
If n > p, we keep only the first p columns of U and Σ has dimension p × p.
If p > n, then only the first n columns of V are computed and Σ has dimension
n × n. The n = p case is obvious, we retain the full SVD. In general the
economy-size SVD leads to less FLOPS and still conserving the desired accuracy.
26
print('test U')
print( (np.transpose(U) @ U - U @np.transpose(U)))
print('test VT')
print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
print(U)
print(S)
print(VT)
D = np.zeros((len(U),len(VT)))
for i in range(0,len(VT)):
D[i,i]=S[i]
return U @ D @ VT
print(X)
C = SVD(X)
# Print the difference between the original matrix and the SVD one
print(C-X)
The matrix X has columns that are linearly dependent. The first column is
the row-wise sum of the other two columns. The rank of a matrix (the column
rank) is the dimension of space spanned by the column vectors. The rank of
the matrix is the number of linearly independent columns, in this case just 2.
We see this from the singular values when running the above code. Running
the standard inversion algorithm for matrix inversion with X T X results in the
program terminating due to a singular matrix.
27
Our starting point is our design matrix X of dimension n × p
x0,0 x0,1 x0,2 ... . . . x0,p−1
x1,0 x1,1 x1,2 ... . . . x1,p−1
x2,0 x2,1 x2,2 ... . . . x2,p−1
X= ...
.
... ... ...... ...
xn−2,0 xn−2,1 xn−2,2 ... . . . xn−2,p−1
xn−1,0 xn−1,1 xn−1,2 ... . . . xn−1,p−1
X = U ΣV T ,
Example Matrix
As an example, consider the following 3 × 2 example for the matrix Σ
2 0
Σ = 0 1
0 0
The singular values are σ0 = 2 and σ1 = 1. It is common to rewrite the
matrix Σ as
Σ̃
Σ= ,
0
where
2 0
Σ̃ = ,
0 1
contains only the singular values. Note also (and we will use this below) that
T 4 0
Σ Σ= ,
0 1
28
which is a 2 × 2 matrix while
4 0 0
ΣΣT = 0 1 0 ,
0 0 0
is a 3 × 3 matrix. The last row and column of this last matrix contain only
zeros. This will have important consequences for our SVD decomposition of the
design matrix.
X T X = V ΣT U T U ΣV T ,
and using the orthogonality of the matrix U we have
X T X = V ΣT ΣV T .
We define ΣT Σ = Σ̃2 which is a diagonal matrix containing only the singular
values squared. It has dimensionality p × p.
We can now insert the result for the matrix X T X into our equation for
ordinary least squares where
−1
ỹOLS = X X T X X T y,
and using our SVD decomposition of X we have
−1
ỹOLS = U ΣV T V Σ̃2 (V T V ΣT U T y,
which gives us, using the orthogonality of the matrix V ,
p−1
X
ỹOLS = U U T y = ui uTi y,
i=0
It means that the ordinary least square model (with the optimal parameters)
ỹ, corresponds to an orthogonal transformation of the output (or target) vector
y by the vectors of the matrix U . Note that the summation ends at p − 1,
̸ y. We can thus not use the orthogonality relation for the matrix U .
that is ỹ =
This can already be when we multiply the matrices ΣT U T .
X T X = V ΣT U T U ΣV T = V ΣT ΣV T .
29
If we now multiply from the right with V (using the orthogonality of V ) we
get
X T X V = V ΣT Σ.
This means the vectors vi of the orthogonal matrix V are the eigenvectors of
the matrix X T X with eigenvalues given by the singular values squared, that is
X T X vi = vi σi2 .
XX T = U ΣV T V ΣT U T = U ΣΣT U T .
This means the vectors ui of the orthogonal matrix U are the eigenvectors of
the matrix XX T with eigenvalues given by the singular values squared, that is
XX T ui = ui σi2 .
∂ 2 C(β) 2
T
= X T X.
∂β∂β n
This quantity defines was what is called the Hessian matrix (the second derivative
of a function we want to optimize).
The Hessian matrix plays an important role and is defined in this course as
H = X T X.
30
The Hessian matrix for ordinary least squares is also proportional to the
covariance matrix. This means also that we can use the SVD to find the
eigenvalues of the covariance matrix and the Hessian matrix in terms of the
singular values. Let us develop these arguments, as they will play an important
role in our machine learning studies.
Note: we have used 1/n in the above definitions of the sample variance
and covariance. We assume then that we can calculate the exact mean value.
What you will find in essentially all statistics texts are equations with a factor
1/(n − 1). This is called Bessel’s correction. This method corrects the bias in the
estimation of the population variance and covariance. It also partially corrects
the bias in the estimation of the population standard deviation. If you use a
library like Scikit-Learn or nunmpy’s function to calculate the covariance,
this quantity will be computed with a factor 1/(n − 1).
31
cov[x, y]
corr[x, y] = p .
var[x]var[y]
The correlation function is then given by values corr[x, y] ∈ [−1, 1]. This
avoids eventual problems with too large values. We can then define the correlation
matrix for the two vectors x and y as
1 corr[x, y]
K[x, y] = ,
corr[y, x] 1
In the above example this is the function we constructed using pandas.
xTi = x0,i
x1,i x2,i ... . . . xn−1,i .
var[x0 ] cov[x0 , x1 ] cov[x0 , x2 ] ... . . . cov[x0 , xp−1 ]
cov[x1 , x0 ] var[x1 ] cov[x1 , x2 ] ... . . . cov[x1 , xp−1 ]
cov[x2 , x0 ] cov[x2 , x1 ] var[x2 ] ... . . . cov[x2 , xp−1 ]
C[x] = ,
... ... ... ... ... ...
... ... ... ... ... ...
cov[xp−1 , x0 ] cov[xp−1 , x1 ] cov[xp−1 , x2 ] . . . ... var[xp−1 ]
32
and the correlation matrix
1 corr[x0 , x1 ] corr[x0 , x2 ] ... . . . corr[x0 , xp−1 ]
corr[x1 , x0 ] 1 corr[x1 , x2 ] ... . . . corr[x1 , xp−1 ]
corr[x2 , x0 ] corr[x 2 , x1 ] 1 ... . . . corr[x2 , xp−1 ]
K[x] = ,
. . . . . . ... ... ... ...
... ... ... ... ... ...
corr[xp−1 , x0 ] corr[xp−1 , x1 ] corr[xp−1 , x2 ] ... ... 1
which in turn is converted into into the 2 × 2 covariance matrix C via the
Numpy function np.cov(). We note that we can also calculate the mean value
of each set of samples x etc using the Numpy function np.mean(x). We can
also extract the eigenvalues of the covariance matrix through the np.linalg.eig()
function.
# Importing various packages
import numpy as np
n = 100
x = np.random.normal(size=n)
print(np.mean(x))
y = 4+3*x+np.random.normal(size=n)
print(np.mean(y))
W = np.vstack((x, y))
C = np.cov(W)
print(C)
Correlation Matrix
The previous example can be converted into the correlation matrix by simply
scaling the matrix elements with the variances. We should also subtract the
mean values for each column. This leads to the following code which sets up the
correlations matrix for the previous example in a more brute force way. Here
we scale the mean values for each column of the design matrix, calculate the
relevant mean values and variances and then finally set up the 2 × 2 correlation
matrix (since we have only two vectors).
import numpy as np
n = 100
33
# define two vectors
x = np.random.random(size=n)
y = 4+3*x+np.random.normal(size=n)
#scaling the x and y vectors
x = x - np.mean(x)
y = y - np.mean(y)
variance_x = np.sum(x@x)/n
variance_y = np.sum(y@y)/n
print(variance_x)
print(variance_y)
cov_xy = np.sum(x@y)/n
cov_xx = np.sum(x@x)/n
cov_yy = np.sum(y@y)/n
C = np.zeros((2,2))
C[0,0]= cov_xx/variance_x
C[1,1]= cov_yy/variance_y
C[0,1]= cov_xy/np.sqrt(variance_y*variance_x)
C[1,0]= C[0,1]
print(C)
We see that the matrix elements along the diagonal are one as they should
be and that the matrix is symmetric. Furthermore, diagonalizing this matrix we
easily see that it is a positive definite matrix.
The above procedure with numpy can be made more compact if we use
pandas.
34
def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4
def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)
N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))
for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)
return X
We note here that the covariance is zero for the first rows and columns since
all matrix elements in the design matrix were set to one (we are fitting the
function in terms of a polynomial of degree n).
This means that the variance for these elements will be zero and will cause
problems when we set up the correlation matrix. We can simply drop these
elements and construct a correlation matrix without these elements.
35
To see this let us simply look at a design matrix X ∈ R2×2
x00 x01
X= = x0 x1 .
x10 x11
If we then compute the expectation value (note the 1/n factor instead of
1/(n − 1))
x200 + x210
T 1 T 1 x00 x01 + x10 x11
E[X X] = X X = ,
n n x01 x00 + x11 x10 x201 + x211
which is just
var[x0 ] cov[x0 , x1 ]
C[x0 , x1 ] = C[x] = ,
cov[x1 , x0 ] var[x1 ]
where we wrote
C[x0 , x1 ] = C[x]
to indicate that this is the covariance of the vectors x of the design/feature
matrix X.
It is easy to generalize this to a matrix X ∈ Rn×p .
X T X = V ΣT U T U ΣV T = V ΣT ΣV T .
Since the matrices here have dimension p×p, with p corresponding to the singular
values, we defined earlier the matrix
T
Σ̃
Σ Σ = Σ̃ 0 ,
0
X T X V = V Σ̃2 .
36
What does it mean?
This means the vectors vi of the orthogonal matrix V are the eigenvectors of
the matrix X T X with eigenvalues given by the singular values squared, that is
X T X vi = vi σi2 .
And finally XX T
For XX T we found
XX T = U ΣV T V ΣT U T = U ΣT ΣU T .
Since the matrices here have dimension n × n, we have
T Σ̃ Σ̃ 0
ΣΣ = Σ̃0 = ,
0 0 0
leading to
Σ̃ 0
XX T = U UT .
0 0
Multiplying with U from the right gives us the eigenvalue problem
T Σ̃ 0
(XX )U = U .
0 0
37
eigenvectors of XX T and measure how much correlations are contained in the
rows of X.
Since we will mainly be interested in the correlations among the features of
our data (the columns of X, the quantity of interest for us are the non-zero
singular values and the column vectors of V .
which leads to Lasso regression. Lasso stands for least absolute shrinkage and
selection operator.
Here we have defined the norm-1 as
X
||x||1 = |xi |.
i
38
Deriving the Ridge Regression Equations
Using the matrix-vector expression for Ridge regression and dropping the pa-
rameter 1/n in front of the standard means squared error equation, we have
and taking the derivatives with respect to β we obtain then a slightly modified
matrix inversion problem which for finite values of λ does not suffer from
singularity problems. We obtain the optimal parameters
−1 T
β̂Ridge = X T X + λI X y,
with I being a p × p identity matrix with the constraint that
p−1
X
βi2 ≤ t,
i=0
with t a finite positive number.
If we keep the 1/n factor, the equation for the optimal β changes to
−1 T
β̂Ridge = X T X + nλI X y.
In many textbooks the 1/n term is often omitted. Note that a library like
Scikit-Learn does not include the 1/n factor in the setup of the cost function.
When we compare this with the ordinary least squares result we have
−1 T
β̂OLS = X T X X y,
which can lead to singular matrices. However, with the SVD, we can always
compute the inverse of the matrix X T X.
We see that Ridge regression is nothing but the standard OLS with a modified
diagonal term added to X T X. The consequences, in particular for our discussion
of the bias-variance tradeoff are rather interesting. We will see that for specific
values of λ, we may even reduce the variance of the optimal parameters β. These
topics and other related ones, will be discussed after the more linear algebra
oriented analysis here.
Using our insights about the SVD of the design matrix X We have already
analyzed the OLS solutions in terms of the eigenvectors (the columns) of the
right singular value matrix U as
ỹOLS = Xβ = U U T y.
For Ridge regression this becomes
p−1
−1 X σj2
ỹRidge = XβRidge = U ΣV T V Σ2 V T + λI (U ΣV T )T y = uj uTj y,
j=0
σj2 + λ
with the vectors uj being the columns of U from the SVD of the matrix X.
39
Interpreting the Ridge results
Since λ ≥ 0, it means that compared to OLS, we have
σj2
≤ 1.
σj2 + λ
Ridge regression finds the coordinates of y with respect to the orthonormal
σj2
basis U , it then shrinks the coordinates by σj2 +λ
. Recall that the SVD has
eigenvalues ordered in a descending way, that is σi ≥ σi+1 .
For small eigenvalues σi it means that their contributions become less impor-
tant, a fact which can be used to reduce the number of degrees of freedom. More
about this when we have covered the material on a statistical interpretation of
various linear regression methods.
More interpretations
For the sake of simplicity, let us assume that the design matrix is orthonormal,
that is
X T X = (X T X)−1 = I.
In this case the standard OLS results in
n−1
X
β OLS = X T y = ui uTi y,
i=0
and
−1 −1
β Ridge = (I + λI) X T y = (1 + λ) β OLS ,
that is the Ridge estimator scales the OLS estimator by the inverse of a factor
1 + λ, and the Ridge estimator converges to zero when the hyperparameter goes
to infinity.
We will come back to more interpreations after we have gone through some
of the statistical analysis part.
For more discussions of Ridge and Lasso regression, Wessel van Wierin-
gen’s article is highly recommended. Similarly, Mehta et al’s article is also
recommended.
40
Taking the derivative with respect to β and recalling that the derivative of
the absolute value is (we drop the boldfaced vector symbol for simplicty)
d|β| 1 β>0
= sgn(β) =
dβ −1 β < 0,
∂C(X, β) 2
= − X T (y − Xβ) + λsgn(β) = 0,
∂β n
and reordering we have
n
X T Xβ + λsgn(β) = 2X T y.
2
We can redefine λ to absorb the constant n/2 and we rewrite the last equation
as
X T Xβ + λsgn(β) = 2X T y.
This equation does not lead to a nice analytical equation as in either Ridge
regression or ordinary least squares. This equation can however be solved by
using standard convex optimization algorithms using for example the Python
package CVXOPT. We will discuss this later.
41