0% found this document useful (0 votes)

27 views41 pages

Day 1

Uploaded by

sharma.pranshu2388

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views41 pages

Day 1

Uploaded by

sharma.pranshu2388

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Linear Regression, from ordinary least

squares to Ridge and Lasso regression

Morten Hjorth-Jensen1,2
1
Department of Physics and Center for Computing in Science Education, University of Oslo, Norway
2
Department of Physics and Astronomy and Facility for Rare Isotope Beams, Michigan State University, USA

October 2, 2023

Plans for week 40, October 2-6

The main topics are:

1. Brief repetition from last week

2. Derivation of the equations for ordinary least squares
3. Discussion on how to prepare data and examples of applications of linear
regression
4. Mathematical interpretations of linear regression
5. Ridge and Lasso regression and Singular Value Decomposition
6. "Video of lecture TBA":"Video of Lecture at https://youtu.be/RlCLw-y9qwM"
7. Whiteboard notes

Reading recommendations:
1. These notes
2. Goodfellow, Bengio and Courville, Deep Learning, chapter 2 on linear
algebra and sections 3.1-3.10 on elements of statistics (background)
3. Hastie, Tibshirani and Friedman, The elements of statistical learning,
sections 3.1-3.4 (on relevance for the discussion of linear regression).
4. Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong, Mathematics for
Machine Learning, see chapter 6 in particular for exercises on derivatives,
see https://mml-book.github.io/book/mml-book.pdf

© 1999-2023, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0

license
Why Linear Regression (aka Ordinary Least Squares and
family), repetition from last week
We need first a reminder from last week about linear regression.
Fitting a continuous function with linear parameterization in terms of the
parameters β.
• Method of choice for fitting a continuous function!

• Gives an excellent introduction to central Machine Learning features with

understandable pedagogical links to other methods like Neural Net-
works, Support Vector Machines etc
• Analytical expression for the fitting parameters β
• Analytical expressions for statistical propertiers like mean values, variances,
confidence intervals and more
• Analytical relation with probabilistic interpretations
• Easy to introduce basic concepts like bias-variance tradeoff, cross-validation,
resampling and regularization techniques and many other ML topics

• Easy to code! And links well with classification problems and logistic
regression and neural networks
• Allows for easy hands-on understanding of gradient descent methods
• and many more features

For more discussions of Ridge and Lasso regression, Wessel van Wieringen’s article
is highly recommended. Similarly, Mehta et al’s article is also recommended.

The equations for ordinary least squares

Our data which we want to apply a machine learning method on, consist of
a set of inputs xT = [x0 , x1 , x2 , . . . , xn−1 ] and the outputs we want to model
y T = [y0 , y1 , y2 , . . . , yn−1 ]. We assume that the output data can be represented
(for a regression case) by a continuous function f through

yi = f (xi ) + ϵi ,

or in general
y = f (x) + ϵ,
where ϵ represents some noise which is normally assumed to be distributed
via a normal probability distribution with zero mean value and a variance σ 2 .
In linear regression we approximate the unknown function with another
continuous function ỹ(x) which depends linearly on some unknown parameters
β T = [β0 , β1 , β2 , . . . , βp−1 ].

2
Last week we introduced the so-called design matrix in order to define the
approximation ỹ via the unknown quantity β as

ỹ = Xβ,
and in order to find the optimal parameters βi we defined a function which
gives a measure of the spread between the values yi (which represent the output
values we want to reproduce) and the parametrized values ỹi , namely the so-called
cost/loss function.

The cost/loss function

We used the mean squared error to define the way we measure the quality of our
model
n−1
1X 2 1n T
o
C(β) = (yi − ỹi ) = (y − ỹ) (y − ỹ) ,
n i=0 n
or using the matrix X and in a more compact matrix-vector notation as
1n T
o
C(β) = (y − Xβ) (y − Xβ) .
n
This function represents one of many possible ways to define the so-called cost
function.
It is also common to define the function C as
n−1
1 X 2
C(β) = (yi − ỹi ) ,
2n i=0
since when taking the first derivative with respect to the unknown parameters
β, the factor of 2 cancels out.

Interpretations and optimizing our parameters

The function
1n T
o
C(β) = (y − Xβ) (y − Xβ) ,
n
can be linked to the variance of the quantity yi if we interpret the latter as the
mean value. When linking (see the discussions next week) with the maximum
likelihood approach below, we will indeed interpret yi as a mean value

yi = ⟨yi ⟩ = β0 xi,0 + β1 xi,1 + β2 xi,2 + · · · + βn−1 xi,n−1 + ϵi ,

where ⟨yi ⟩ is the mean value. Keep in mind also that till now we have treated
yi as the exact value. Normally, the response (dependent or outcome) variable
yi is the outcome of a numerical experiment or another type of experiment and
could thus be treated itself as an approximation to the true value. It is then
always accompanied by an error estimate, often limited to a statistical error

3
estimate given by the standard deviation discussed earlier. In the discussion
here we will treat yi as our exact value for the response variable.
In order to find the parameters βi we will then minimize the spread of C(β),
that is we are going to solve the problem
1n T
o
minp (y − Xβ) (y − Xβ) .
β∈R n

In practical terms it means we will require

" n−1 #
∂C(β) ∂ 1X 2
= (yi − β0 xi,0 − β1 xi,1 − β2 xi,2 − · · · − βn−1 xi,n−1 ) = 0,
∂βj ∂βj n i=0

which results in
"n−1 #
∂C(β) 2 X
=− xij (yi − β0 xi,0 − β1 xi,1 − β2 xi,2 − · · · − βn−1 xi,n−1 ) = 0,
∂βj n i=0

or in a matrix-vector form as (multiplying away the factor −2/n, see derivation

below)
∂C(β)
= 0 = X T (y − Xβ) .
∂β T

Interpretations and optimizing our parameters

We can rewrite, see the derivations below,
∂C(β)
= 0 = X T (y − Xβ) ,
∂β T
as
X T y = X T Xβ,
and if the matrix X T X is invertible we have the solution
−1 T
β = XT X X y.
We note also that since our design matrix is defined as X ∈ Rn×p , the
product X T X ∈ Rp×p . In most cases we have that p ≪ n. In our example
case below we have p = 5 meaning. We end up with inverting a small 5 × 5
matrix. This is a rather common situation, in many cases we end up with
low-dimensional matrices to invert. The methods discussed here and for many
other supervised learning algorithms like classification with logistic regression or
support vector machines, exhibit dimensionalities which allow for the usage of
direct linear algebra methods such as LU decomposition or Singular Value
Decomposition (SVD) for finding the inverse of the matrix X T X.

Small question: Do you think the example we have at hand here (the nuclear
binding energies) can lead to problems in inverting the matrix X T X? What
kind of problems can we expect?

4
Some useful matrix and vector expressions
The following matrix and vector relation will be useful here and for the rest
of the course. Vectors are always written as boldfaced lower case letters and
matrices as upper case boldfaced letters. In the following we will discuss how to
calculate derivatives of various matrices relevant for machine learning. We will
often represent our data in terms of matrices and vectors.
Let us introduce first some conventions. We assume that y is a vector of length
m, that is it has m elements y0 , y1 , . . . , ym−1 . By convention we start labeling
vectors with the zeroth element, as are arrays in Python and C++/C, for example.
Similarly, we have a vector x of length n, that is xT = [x0 , x1 , . . . , xn−1 ].
We assume also that y is a function of x through some given function f

y = f (x).

The Jacobian
We define the partial derivatives of the various components of y as functions of
xi in terms of the so-called Jacobian matrix
∂y0 ∂y0 ∂y0 ∂y0
... ...
 
∂x0 ∂x1 ∂x2 ∂xn−1
∂y1 ∂y1 ∂y1 ∂y1

 ∂x0 ∂x1 ∂x2 ... ... ∂xn−1


∂y ∂y2 ∂y2 ∂y2 ∂y2
... ...
 
J= = ∂x0 ∂x1 ∂x2 ∂xn−1,
 
∂x  . . . ... ... ... ... ... 
 
 ... ... ... ... ... ... 
∂ym−1 ∂ym−1 ∂ym−1 ∂ym−1
∂x0 ∂x1 ∂x2 ... ... ∂xn−1
which is an m × n matrix. If x is a scalar, then the Jacobian is only a
single-column vector, or an m × 1 matrix. If on the other hand y is a scalar, the
Jacobian becomes a 1 × n matrix.
When this matrix is a square matrix m = n, its determinant is often referred
to as the Jacobian determinant. Both the matrix and (if m = n) the determinant
are often referred to simply as the Jacobian. The Jacobian matrix represents
the differential of y at every point where the vector is differentiable.

Derivatives, example 1
Let now y = Ax, where A is an m × n matrix and the matrix does not depend
on x. If we write out the vector y compoment by component we have
n−1
X
yi = aij xj ,
j=0
with ∀i = 0, 1, 2, . . . , m − 1. The individual matrix elements of A are given by
the symbol aij . It follows that the partial derivatives of yi with respect to xk
∂yi
= aik ∀i = 0, 1, 2, . . . , m − 1.
∂xk

5
From this we have, using the definition of the Jacobian
∂y
= A.
∂x

Example 2
We define a scalar (our cost/loss functions are in general also scalars, just think
of the mean squared error) as the result of some matrix vector multiplications

α = y T Ax,
with y a vector of length m, A an m × n matrix and x a vector of length n. We
assume also that A does not depend on any of the two vectors. In order to find
the derivative of α with respect to the two vectors, we define an intermediate
vector z. We define first z T = y T A, a vector of length n. We have then, using
the definition of the Jacobian,
α = z T x,
which means that (using our previous example) we have

∂α
= z = AT y.
∂x
Note that the resulting vector elements are the same for z T and z, the only
difference is that one if just the transpose of the other.
Since α is a scalar we have α = αT = xT AT y. Defining now z = xT AT we
find that
∂α
= z T = xT AT .
∂y

Example 3
We start with a new scalar but where now the vector y is replaced by a vector
x and the matrix A is a square matrix with dimension n × n.

α = xT Ax,
with x a vector of length n.
We write out the specific sums involved in the calculation of α
n−1
X n−1
X
α= xi aij xj ,
i=0 j=0

taking the derivative of α with respect to a given component xk we get the two
sums
n−1 n−1
∂α X X
= aik xi + akj xj ,
∂xk i=0 j=0

6
for ∀k = 0, 1, 2, . . . , n − 1. We identify these sums as
∂α
= xT AT + A .

∂x
If the matrix A is symmetric, that is A = AT , we have
∂α
= 2xT A.
∂x

Example 4
We let the scalar α be defined by

α = y T x,

where both y and x have the same length n, or if we wish to think of them
as column vectors, they have dimensions n × 1. We assume that both y and x
depend on a vector z of the same length. To calculate the derivative of α with
respect to a given component zk we need first to write out the inner product
that defines α as
n−1
X
α= yi xi ,
i=0

and the partial derivative

n−1
X ∂yi
∂α ∂xi
= xi + yi ,
∂zk i=0
∂zk ∂zk
for ∀k = 0, 1, 2, . . . , n − 1. We can rewrite the partial derivative in a more
compact form as
∂α ∂y ∂x
= xT + yT ,
∂z ∂z ∂z
and if y = x we have
∂α ∂x
= 2xT .
∂z ∂z

The mean squared error and its derivative

We defined earlier a possible cost function using the mean squared error
n−1
1X 2 1n T
o
C(β) = (yi − ỹi ) = (y − ỹ) (y − ỹ) ,
n i=0 n

or using the design/feature matrix X we have the more compact matrix-vector

1n T
o
C(β) = (y − Xβ) (y − Xβ) .
n

7
We note that the design matrix X does not depend on the unknown param-
eters defined by the vector β. We are now interested in minimizing the cost
function with respect to the unknown parameters β.
The mean squared error is a scalar and if we use the results from example
three above, we can define a new vector

w = y − Xβ,

which depends on β. We rewrite the cost function as

1 T
C(β) = w w,
n
with partial derivative
∂C(β) 2 ∂w
= wT ,
∂β n ∂β
and using that
∂w
= −X,
∂β
where we used the result from example two above. Inserting the last expression
we obtain
∂C(β) 2 T
= − (y − Xβ) X,
∂β n
or as
∂C(β) 2
T
= − X T (y − Xβ) .
∂β n

Other useful relations

We list here some other useful relations we may encounter (recall that vectors
are defined by boldfaced low-key letters)

∂tr(BA)
= BT ,
∂A
∂ log |A|
= (A−1 )T .
∂A

Meet the Hessian Matrix

A very important matrix we will meet again and again in machine learning is the
Hessian. It is given by the second derivative of the cost function with respect
to the parameters β. Using the above expression for derivatives of vectors and
matrices, we find that the second derivative of the mean squared error as cost
function is,

∂ ∂C(β) ∂ 2 T 2
= − X (y − Xβ) = X T X.
∂β ∂β T ∂β n n

8
The Hessian matrix plays an important role and is defined here as

H = X T X.
For ordinary least squares, it is inversely proportional (derivation next week)
with the variance of the optimal parameters β̂. Furthermore, we will see later
this week that it is (aside the factor 1/n) equal to the covariance matrix. It plays
also a very important role in optmization algorithms and Principal Component
Analysis as a way to reduce the dimensionality of a machine learning/data
analysis problem.
Linear algebra question: Can we use the Hessian matrix to say something
about properties of the cost function (our optmization problem)? (hint: think
about convex or concave problems and how to relate these to a matrix!).

Interpretations and optimizing our parameters

The residuals ϵ are in turn given by

ϵ = y − ỹ = y − Xβ,

and with
X T (y − Xβ) = 0,
we have
X T ϵ = X T (y − Xβ) = 0,
meaning that the solution for β is the one which minimizes the residuals.

Example relevant for the exercises

In order to understand the relation among the predictors p, the set of data n
and the target (outcome, output etc) y, we condiser a simple polynomial fit.
We assume our data can represented by a fourth-order polynomial. For the ith
component we have

ỹi = β0 + β1 xi + β2 x2i + β3 x3i + β4 x4i .

we have five predictors/features. The first is the intercept β0 . The other terms
are βi with i = 1, 2, 3, 4. Furthermore we have n entries for each predictor. It
means that our design matrix is an n × p matrix X.

Own code for Ordinary Least Squares

It is rather straightforward to implement the matrix inversion and obtain the
parameters β. After having defined the matrix X and the outputs y we have
# matrix inversion to find beta
# First we set up the data
import numpy as np
x = np.random.rand(100)

9
y = 2.0+5*x*x+0.1*np.random.randn(100)
# and then the design matrix X including the intercept
# The design matrix now as function of a fourth-order polynomial
X = np.zeros((len(x),5))
X[:,0] = 1.0
X[:,1] = x
X[:,2] = x**2
X[:,3] = x**3
X[:,4] = x**4
beta = (np.linalg.inv(X.T @ X) @ X.T ) @ y
# and then make the prediction
ytilde = X @ beta

Alternatively, you can use the least squares functionality in Numpy as

fit = np.linalg.lstsq(X, y, rcond =None)[0]
ytildenp = np.dot(fit,X.T)

Adding error analysis and training set up

We can easily test our fit by computing the R2 score that we discussed in
connection with the functionality of Scikit-Learn in the introductory slides.
Since we are not using Scikit-Learn here we can define our own R2 function as
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)

and we would be using it as

print(R2(y,ytilde))

We can easily add our MSE score as

def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n

print(MSE(y,ytilde))

and finally the relative error as

def RelativeError(y_data,y_model):
return abs((y_data-y_model)/y_data)
print(RelativeError(y, ytilde))

10
Splitting our Data in Training and Test data
It is normal in essentially all Machine Learning studies to split the data
in a training set and a test set (sometimes also an additional validation set).
Scikit-Learn has an own function for this. There is no explicit recipe for how
much data should be included as training data and say test data. An accepted
rule of thumb is to use approximately 2/3 to 4/5 of the data as training data.
We will postpone a discussion of this splitting to the end of these notes and our
discussion of the so-called bias-variance tradeoff. Here we limit ourselves to
repeat the above equation of state fitting example but now splitting the data
into a training set and a test set.

The complete code with a simple data set

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

def R2(y_data, y_model):

return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n

x = np.random.rand(100)
y = 2.0+5*x*x+0.1*np.random.randn(100)

# The design matrix now as function of a fourth-order polynomial

X = np.zeros((len(x),5))
X[:,0] = 1.0
X[:,1] = x
X[:,2] = x**2
X[:,3] = x**3
X[:,4] = x**4
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# matrix inversion to find beta
beta = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train
print(beta)
# and then make the prediction
ytilde = X_train @ beta
print("Training R2")
print(R2(y_train,ytilde))
print("Training MSE")
print(MSE(y_train,ytilde))
ypredict = X_test @ beta
print("Test R2")
print(R2(y_test,ypredict))
print("Test MSE")
print(MSE(y_test,ypredict))

11
Making your own test-train splitting
# equivalently in numpy
def train_test_split_numpy(inputs, labels, train_size, test_size):
n_inputs = len(inputs)
inputs_shuffled = inputs.copy()
labels_shuffled = labels.copy()
np.random.shuffle(inputs_shuffled)
np.random.shuffle(labels_shuffled)
train_end = int(n_inputs*train_size)
X_train, X_test = inputs_shuffled[:train_end], inputs_shuffled[train_end:]
Y_train, Y_test = labels_shuffled[:train_end], labels_shuffled[train_end:]

return X_train, X_test, Y_train, Y_test

But since scikit-learn has its own function for doing this and since it
interfaces easily with tensorflow and other libraries, we normally recommend
using the latter functionality.

Reducing the number of degrees of freedom, overarching

view
Many Machine Learning problems involve thousands or even millions of features
for each training instance. Not only does this make training extremely slow, it
can also make it much harder to find a good solution, as we will see. This problem
is often referred to as the curse of dimensionality. Fortunately, in real-world
problems, it is often possible to reduce the number of features considerably,
turning an intractable problem into a tractable one.
Later we will discuss some of the most popular dimensionality reduction
techniques: the principal component analysis (PCA), Kernel PCA, and Locally
Linear Embedding (LLE).
Principal component analysis and its various variants deal with the problem
of fitting a low-dimensional affine subspace to a set of of data points in a high-
dimensional space. With its family of methods it is one of the most used tools
in data modeling, compression and visualization.

Preprocessing our data

Before we proceed however, we will discuss how to preprocess our data. Till
now and in connection with our previous examples we have not met so many
cases where we are too sensitive to the scaling of our data. Normally the data
may need a rescaling and/or may be sensitive to extreme values. Scaling the data
renders our inputs much more suitable for the algorithms we want to employ.
For data sets gathered for real world applications, it is rather normal that
different features have very different units and numerical scales. For example, a
data set detailing health habits may include features such as age in the range
0 − 80, and caloric intake of order 2000. Many machine learning methods

12
sensitive to the scales of the features and may perform poorly if they are very
different scales. Therefore, it is typical to scale the features in a way to avoid
such outlier values.

Functionality in Scikit-Learn
Scikit-Learn has several functions which allow us to rescale the data, normally
resulting in much better results in terms of various accuracy scores. The Stan-
dardScaler function in Scikit-Learn ensures that for each feature/predictor
we study the mean value is zero and the variance is one (every column in the
design/feature matrix). This scaling has the drawback that it does not ensure
that we have a particular maximum or minimum in our data set. Another
function included in Scikit-Learn is the MinMaxScaler which ensures that
all features are exactly between 0 and 1. The

More preprocessing
The Normalizer scales each data point such that the feature vector has a
euclidean length of one. In other words, it projects a data point on the circle
(or sphere in the case of higher dimensions) with a radius of 1. This means
every data point is scaled by a different number (by the inverse of it’s length).
This normalization is often used when only the direction (or angle) of the data
matters, not the length of the feature vector.
The RobustScaler works similarly to the StandardScaler in that it ensures
statistical properties for each feature that guarantee that they are on the same
scale. However, the RobustScaler uses the median and quartiles, instead of mean
and variance. This makes the RobustScaler ignore data points that are very
different from the rest (like measurement errors). These odd data points are also
called outliers, and might often lead to trouble for other scaling techniques.

Frequently used scaling functions

Many features are often scaled using standardization to improve performance.
In Scikit-Learn this is given by the StandardScaler function as discussed
above. It is easy however to write your own. Mathematically, this involves
subtracting the mean and divide by the standard deviation over the data set,
for each feature:
(i)
(i) xj − xj
xj → ,
σ(xj )
where xj and σ(xj ) are the mean and standard deviation, respectively, of the
feature xj . This ensures that each feature has zero mean and unit standard
deviation. For data sets where we do not have the standard deviation or don’t
wish to calculate it, it is then common to simply set it to one.

13
Example of own Standard scaling
Let us consider the following vanilla example where we use both Scikit-Learn
and write our own function as well. We produce a simple test design matrix with
random numbers. Each column could then represent a specific feature whose
mean value is subracted.
import sklearn.linear_model as skl
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer
import numpy as np
import pandas as pd
from IPython.display import display
np.random.seed(100)
# setting up a 10 x 5 matrix
rows = 10
cols = 5
X = np.random.randn(rows,cols)
XPandas = pd.DataFrame(X)
display(XPandas)
print(XPandas.mean())
print(XPandas.std())
XPandas = (XPandas -XPandas.mean())
display(XPandas)
# This option does not include the standard deviation
scaler = StandardScaler(with_std=False)
scaler.fit(X)
Xscaled = scaler.transform(X)
display(XPandas-Xscaled)

Small exercise: perform the standard scaling by including the standard

deviation and compare with what Scikit-Learn gives.

Min-Max Scaling
Another commonly used scaling method is min-max scaling. This is very useful
for when we want the features to lie in a certain interval. To scale the feature
xj to the interval [a, b], we can apply the transformation
(i)
(i) xj − min(xj )
xj → (b − a) −a
max(xj ) − min(xj )
where min(xj ) and max(xj ) return the minimum and maximum value of xj over
the data set, respectively.

Testing the Means Squared Error as function of Complexity

One of the aims is to reproduce Figure 2.11 of Hastie et al.
Our data is defined by x ∈ [−3, 3] with a total of for example 100 data points.
np.random.seed()
n = 100

14
maxdegree = 14
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)

where y is the function we want to fit with a given polynomial.

Write a first code which sets up a design matrix X defined by a fourth-order
polynomial. Scale your data and split it in training and test data.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

np.random.seed(2018)
n = 50
maxdegree = 5
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)
TestError = np.zeros(maxdegree)
TrainError = np.zeros(maxdegree)
polydegree = np.zeros(maxdegree)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

for degree in range(maxdegree):

model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression(fit_intercept=False)
clf = model.fit(x_train_scaled,y_train)
y_fit = clf.predict(x_train_scaled)
y_pred = clf.predict(x_test_scaled)
polydegree[degree] = degree
TestError[degree] = np.mean( np.mean((y_test - y_pred)**2) )
TrainError[degree] = np.mean( np.mean((y_train - y_fit)**2) )

plt.plot(polydegree, TestError, label='Test Error')

plt.plot(polydegree, TrainError, label='Train Error')
plt.legend()
plt.show()

More preprocessing examples, two-dimensional example,

the Franke function
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model as skl
from sklearn.metrics import mean_squared_error

15
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer

# Where to save the figures and data files

PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"

if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)

if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)

if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)

def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)

def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)

def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')

def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4

def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)

N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))

for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)

return X

# Making meshgrid of datapoints and compute Franke's function

n = 5
N = 1000
x = np.sort(np.random.uniform(0, 1, N))
y = np.sort(np.random.uniform(0, 1, N))
z = FrankeFunction(x, y)
X = create_X(x, y, n=n)
# split in training and test data
X_train, X_test, y_train, y_test = train_test_split(X,z,test_size=0.2)

16
clf = skl.LinearRegression().fit(X_train, y_train)

# The mean squared error and R2 score

print("MSE before scaling: {:.2f}".format(mean_squared_error(clf.predict(X_test), y_test)))
print("R2 score before scaling {:.2f}".format(clf.score(X_test,y_test)))

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Feature min values before scaling:\n {}".format(X_train.min(axis=0)))
print("Feature max values before scaling:\n {}".format(X_train.max(axis=0)))

print("Feature min values after scaling:\n {}".format(X_train_scaled.min(axis=0)))

print("Feature max values after scaling:\n {}".format(X_train_scaled.max(axis=0)))

clf = skl.LinearRegression().fit(X_train_scaled, y_train)

print("MSE after scaling: {:.2f}".format(mean_squared_error(clf.predict(X_test_scaled), y_test)))

print("R2 score for scaled data: {:.2f}".format(clf.score(X_test_scaled,y_test)))

To think about, first part

When you are comparing your own code with for example Scikit-Learn’s library,
there are some technicalities to keep in mind. The examples here demonstrate
some of these aspects with potential pitfalls.
The discussion here focuses on the role of the intercept, how we can set
up the design matrix, what scaling we should use and other topics which tend
confuse us.
The intercept can be interpreted as the expected value of our target/output
variables when all other predictors are set to zero. Thus, if we cannot assume
that the expected outputs/targets are zero when all predictors are zero (the
columns in the design matrix), it may be a bad idea to implement a model
which penalizes the intercept. Furthermore, in for example Ridge and Lasso
regression (to be discussed in moe detail next week), the default solutions from
the library Scikit-Learn (when not shrinking β0 ) for the unknown parameters
β, are derived under the assumption that both y and X are zero centered, that
is we subtract the mean values.

More thinking
If our predictors represent different scales, then it is important to standardize the
design matrix X by subtracting the mean of each column from the corresponding
column and dividing the column with its standard deviation. Most machine
learning libraries do this as a default. This means that if you compare your code
with the results from a given library, the results may differ.

17
The Standadscaler function in Scikit-Learn does this for us. For the data
sets we have been studying in our various examples, the data are in many cases
already scaled and there is no need to scale them. You as a user of different
machine learning algorithms, should always perform a survey of your data, with
a critical assessment of them in case you need to scale the data.
If you need to scale the data, not doing so will give an unfair penaliza-
tion of the parameters since their magnitude depends on the scale of their
corresponding predictor.
Suppose as an example that you you have an input variable given by the
heights of different persons. Human height might be measured in inches or
meters or kilometers. If measured in kilometers, a standard linear regression
model with this predictor would probably give a much bigger coefficient term,
than if measured in millimeters. This can clearly lead to problems in evaluating
the cost/loss functions.

Still thinking
Keep in mind that when you transform your data set before training a model,
the same transformation needs to be done on your eventual new data set before
making a prediction. If we translate this into a Python code, it would could
be implemented as follows (note that the lines are commented since the model
function has not been defined)
#Model training, we compute the mean value of y and X
y_train_mean = np.mean(y_train)
X_train_mean = np.mean(X_train,axis=0)
X_train = X_train - X_train_mean
y_train = y_train - y_train_mean
# The we fit our model with the training data
#trained_model = some_model.fit(X_train,y_train)

#Model prediction, we need also to transform our data set used for the prediction.
X_test = X_test - X_train_mean #Use mean from training data
#y_pred = trained_model(X_test)
y_pred = y_pred + y_train_mean

What does centering (subtracting the mean values) mean

mathematically?
Let us try to understand what this may imply mathematically when we subtract
the mean values, also known as zero centering. For simplicity, we will focus on
ordinary regression, as done in the above example.
The cost/loss function for regression is
 2
n p−1
1 X X
C(β0 , β1 , ..., βp−1 ) = y i − β0 − Xij βj  , .
n i=0 j=1

18
Recall also that we use the squared value since this leads to an increase of the
penalty for higher differences between predicted and output/target values.
What we have done is to single out the β0 term in the definition of the mean
squared error (MSE). The design matrix X does in this case not contain any
intercept column. When we take the derivative with respect to β0 , we want the
derivative to obey
∂C
= 0,
∂βj
for all j. For β0 we have
 
n−1 p−1
∂C 2 X X
=− yi − β0 − Xij βj  .
∂β0 n i=0 j=1

Multiplying away the constant 2/n, we obtain

n−1
X n−1
X n−1
XX p−1
β0 = yi − Xij βj .
i=0 i=0 i=0 j=1

Further Manipulations
Let us special first to the case where we have only two parameters β0 and β1 .
Our result for β0 simplifies then to
n−1
X n−1
X
nβ0 = yi − Xi1 β1 .
i=0 i=0

We obtain then
n−1 n−1
1X 1X
β0 = y i − β1 Xi1 .
n i=0 n i=0
If we define
n−1
1X
µ1 = (Xi1 ,
n i=0
and if we define the mean value of the outputs as
n−1
1X
µy = yi ,
n i=0

we have
β0 = µy − β1 µ1 .
In the general case, that is we have more parameters than β0 and β1 , we have
n−1 n−1 p−1
1X 1 XX
β0 = yi − Xij βj .
n i=0 n i=0 j=1

19
Replacing yi with yi − yi − y and centering also our design matrix results in
a cost function (in vector-matrix disguise)

C(β) = (ỹ − X̃β)T (ỹ − X̃β).

Wrapping it up
If we minimize with respect to β we have then

β̂ = (X̃ T X̃)−1 X̃ T ỹ,

Pn−1
where ỹ = y − y and X̃ij = Xij − n1 k=0 Xkj .
For Ridge regression we need to add λβ T β to the cost function and get then

β̂ = (X̃ T X̃ + λI)−1 X̃ T ỹ.

What does this mean? And why do we insist on all this? Let us look at some
examples.

Linear Regression code, Intercept handling first

This code shows a simple first-order fit to a data set using the above transformed
data, where we consider the role of the intercept first, by either excluding it or
including it (code example thanks to Øyvind Sigmundson Schøyen). Here our
scaling of the data is done by subtracting the mean values only. Note also that
we do not split the data into training and test.
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

np.random.seed(2021)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n

def fit_beta(X, y):

return np.linalg.pinv(X.T @ X) @ X.T @ y

true_beta = [2, 0.5, 3.7]

x = np.linspace(0, 1, 11)
y = np.sum(
np.asarray([x ** p * b for p, b in enumerate(true_beta)]), axis=0
) + 0.1 * np.random.normal(size=len(x))
degree = 3
X = np.zeros((len(x), degree))

20
# Include the intercept in the design matrix
for p in range(degree):
X[:, p] = x ** p

beta = fit_beta(X, y)

# Intercept is included in the design matrix

skl = LinearRegression(fit_intercept=False).fit(X, y)

print(f"True beta: {true_beta}")

print(f"Fitted beta: {beta}")
print(f"Sklearn fitted beta: {skl.coef_}")
ypredictOwn = X @ beta
ypredictSKL = skl.predict(X)
print(f"MSE with intercept column")
print(MSE(y,ypredictOwn))
print(f"MSE with intercept column from SKL")
print(MSE(y,ypredictSKL))

plt.figure()
plt.scatter(x, y, label="Data")
plt.plot(x, X @ beta, label="Fit")
plt.plot(x, skl.predict(X), label="Sklearn (fit_intercept=False)")

# Do not include the intercept in the design matrix

X = np.zeros((len(x), degree - 1))
for p in range(degree - 1):
X[:, p] = x ** (p + 1)

# Intercept is not included in the design matrix

skl = LinearRegression(fit_intercept=True).fit(X, y)

# Use centered values for X and y when computing coefficients

y_offset = np.average(y, axis=0)
X_offset = np.average(X, axis=0)

beta = fit_beta(X - X_offset, y - y_offset)

intercept = np.mean(y_offset - X_offset @ beta)
print(f"Manual intercept: {intercept}")
print(f"Fitted beta (wiothout intercept): {beta}")
print(f"Sklearn intercept: {skl.intercept_}")
print(f"Sklearn fitted beta (without intercept): {skl.coef_}")
ypredictOwn = X @ beta
ypredictSKL = skl.predict(X)
print(f"MSE with Manual intercept")
print(MSE(y,ypredictOwn+intercept))
print(f"MSE with Sklearn intercept")
print(MSE(y,ypredictSKL))
plt.plot(x, X @ beta + intercept, "--", label="Fit (manual intercept)")
plt.plot(x, skl.predict(X), "--", label="Sklearn (fit_intercept=True)")
plt.grid()
plt.legend()
plt.show()

21
The intercept is the value of our output/target variable when all our features
are zero and our function crosses the y-axis (for a one-dimensional case).
Printing the MSE, we see first that both methods give the same MSE, as they
should. However, when we move to for example Ridge regression (discussed next
week), the way we treat the intercept may give a larger or smaller MSE, meaning
that the MSE can be penalized by the value of the intercept. Not including the
intercept in the fit, means that the regularization term does not include β0 . For
different values of λ, this may lead to differing MSE values.
To remind the reader, the regularization term, with the intercept in Ridge
regression is given by
p−1
X
λ||β||22 = λ βj2 ,
j=0

but when we take out the intercept, this equation becomes

p−1
X
λ||β||22 = λ βj2 .
j=1

For Lasso regression we have

p−1
X
λ||β||1 = λ |βj |.
j=1

It means that, when scaling the design matrix and the outputs/targets, by
subtracting the mean values, we have an optimization problem which is not
penalized by the intercept. The MSE value can then be smaller since it focuses
only on the remaining quantities. If we however bring back the intercept, we will
get an MSE which then contains the intercept. This becomes more important
when we discuss Ridge and Lasso regression next week.

Mathematical Interpretation of Ordinary Least Squares

What is presented here is a mathematical analysis of various regression algorithms
(ordinary least squares, Ridge and Lasso Regression). The analysis is based on an
important algorithm in linear algebra, the so-called Singular Value Decomposition
(SVD).
We have shown that in ordinary least squares the optimal parameters β are
given by
−1
β̂ = X T X X T y.
The hat over β means we have the optimal parameters after minimization
of the cost function.
This means that our best model is defined as
−1
ỹ = X β̂ = X X T X X T y.

22
We now define a matrix
−1
A = X XT X XT .

We can rewrite
ỹ = X β̂ = Ay.
The matrix A has the important property that A2 = A. This is the definition
of a projection matrix. We can then interpret our optimal model ỹ as being
represented by an orthogonal projection of y onto a space defined by the column
vectors of X. In our case here the matrix A is a square matrix. If it is a general
rectangular matrix we have an oblique projection matrix.

Residual Error
We have defined the residual error as
h −1 T i
ϵ = y − ỹ = I − X X T X X y.

The residual errors are then the projections of y onto the orthogonal compo-
nent of the space defined by the column vectors of X.

Simple case
If the matrix X is an orthogonal (or unitary in case of complex values) matrix,
we have

X T X = XX T = I.
In this case the matrix A becomes
−1
A = X XT X X T ) = I,

and we have the obvious case

ϵ = y − ỹ = 0.

This serves also as a useful test of our codes.

The singular value decomposition

The examples we have looked at so far are cases where we normally can invert
the matrix X T X. Using a polynomial expansion where we fit of various functions
leads to row vectors of the design matrix which are essentially orthogonal due
to the polynomial character of our model. Obtaining the inverse of the design
matrix is then often done via a so-called LU, QR or Cholesky decomposition.
As we will also see in the first project, this may however not the be case
in general and a standard matrix inversion algorithm based on say LU, QR or

23
Cholesky decomposition may lead to singularities. We will see examples of this
below.
There is however a way to circumvent this problem and also gain some
insights about the ordinary least squares approach, and later shrinkage methods
like Ridge and Lasso regressions.
This is given by the Singular Value Decomposition (SVD) algorithm,
perhaps the most powerful linear algebra algorithm. The SVD provides a numer-
ically stable matrix decomposition that is used in a large swath oc applications
and the decomposition is always stable numerically.
In machine learning it plays a central role in dealing with for example design
matrices that may be near singular or singular. Furthermore, as we will see here,
the singular values can be related to the covariance matrix (and thereby the
correlation matrix) and in turn the variance of a given quantity. It plays also an
important role in the principal component analysis where high-dimensional data
can be reduced to the statistically relevant features.

Linear Regression Problems

One of the typical problems we encounter with linear regression, in particular
when the matrix X (our so-called design matrix) is high-dimensional, are prob-
lems with near singular or singular matrices. The column vectors of X may be
linearly dependent, normally referred to as super-collinearity. This means that
the matrix may be rank deficient and it is basically impossible to to model the
data using linear regression. As an example, consider the matrix
 
1 −1 2
 1 0 1 
X= 
 1 2 −1 
1 1 0

The columns of X are linearly dependent. We see this easily since the the
first column is the row-wise sum of the other two columns. The rank (more
correct, the column rank) of a matrix is the dimension of the space spanned by
the column vectors. Hence, the rank of X is equal to the number of linearly
independent columns. In this particular case the matrix has rank 2.
Super-collinearity of an (n × p)-dimensional design matrix X implies that
the inverse of the matrix X T X (the matrix we need to invert to solve the linear
regression equations) is non-invertible. If we have a square matrix that does not
have an inverse, we say this matrix singular. The example here demonstrates
this

1 −1
X= .
1 −1

We see easily that det(X) = x11 x22 − x12 x21 = 1 × (−1) − 1 × (−1) = 0. Hence,
X is singular and its inverse is undefined. This is equivalent to saying that the
matrix X has at least an eigenvalue which is zero.

24
Fixing the singularity
If our design matrix X which enters the linear regression problem

β = (X T X)−1 X T y, (1)

has linearly dependent column vectors, we will not be able to compute the inverse
of X T X and we cannot find the parameters (estimators) βi . The estimators
are only well-defined if (X T X)−1 exits. This is more likely to happen when the
matrix X is high-dimensional. In this case it is likely to encounter a situation
where the regression parameters βi cannot be estimated.
A cheap ad hoc approach is simply to add a small diagonal component to
the matrix to invert, that is we change

X T X → X T X + λI,

where I is the identity matrix. When we discuss Ridge regression this is actually
what we end up evaluating. The parameter λ is called a hyperparameter. More
about this later.

Basic math of the SVD

From standard linear algebra we know that a square matrix X can be diagonalized
if and only it is a so-called normal matrix, that is if X ∈ Rn×n we have
XX T = X T X or if X ∈ Cn×n we have XX † = X † X. The matrix has then a
set of eigenpairs

(λ1 , u1 ), . . . , (λn , un ), andtheeigenvaluesaregivenbythediagonalmatrixΣ = Diag(λ1 , . . . , λn ).

The matrix X can be written in terms of an orthogonal/unitary transformation

U
X = U ΣV T ,
with U U T = I or U U † = I.
Not all square matrices are diagonalizable. A matrix like the one discussed
above
1 −1
X=
1 −1
is not diagonalizable, it is a so-called defective matrix. It is easy to see that the
condition XX T = X T X is not fulfilled.

The SVD, a Fantastic Algorithm

However, and this is the strength of the SVD algorithm, any general matrix X
can be decomposed in terms of a diagonal matrix and two orthogonal/unitary
matrices. The Singular Value Decompostion (SVD) theorem states that a general
m × n matrix X can be written in terms of a diagonal matrix Σ of dimensionality

25
m × n and two orthognal matrices U and V , where the first has dimensionality
m × m and the last dimensionality n × n. We have then

X = U ΣV T
As an example, the above defective matrix can be decomposed as

1 1 1 2 0 1 1 −1
X=√ √ = U ΣV T ,
2 1 −1 0 0 2 1 1
with eigenvalues σ1 = 2 and σ2 = 0. The SVD exits always!
The SVD decomposition (singular values) gives eigenvalues σi ≥ σi+1 for all
i and for dimensions larger than i = p, the eigenvalues (singular values) are zero.
In the general case, where our design matrix X has dimension n × p, the
matrix is thus decomposed into an n × n orthogonal matrix U , a p × p orthogonal
matrix V and a diagonal matrix Σ with r = min(n, p) singular values σi ≥ 0 on
the main diagonal and zeros filling the rest of the matrix. There are at most p
singular values assuming that n > p. In our regression examples for the nuclear
masses and the equation of state this is indeed the case, while for the Ising model
we have p > n. These are often cases that lead to near singular or singular
matrices.
The columns of U are called the left singular vectors while the columns of V
are the right singular vectors.

Economy-size SVD
If we assume that n > p, then our matrix U has dimension n × n. The last
n − p columns of U become however irrelevant in our calculations since they are
multiplied with the zeros in Σ.
The economy-size decomposition removes extra rows or columns of zeros
from the diagonal matrix of singular values, Σ, along with the columns in either
U or V that multiply those zeros in the expression. Removing these zeros and
columns can improve execution time and reduce storage requirements without
compromising the accuracy of the decomposition.
If n > p, we keep only the first p columns of U and Σ has dimension p × p.
If p > n, then only the first n columns of V are computed and Σ has dimension
n × n. The n = p case is obvious, we retain the full SVD. In general the
economy-size SVD leads to less FLOPS and still conserving the desired accuracy.

Codes for the SVD

import numpy as np
# SVD inversion
def SVD(A):
''' Takes as input a numpy matrix A and returns inv(A) based on singular value decomposition (
SVD is numerically more stable than the inversion algorithms provided by
numpy and scipy.linalg at the cost of being slower.
'''
U, S, VT = np.linalg.svd(A,full_matrices=True)

26
print('test U')
print( (np.transpose(U) @ U - U @np.transpose(U)))
print('test VT')
print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
print(U)
print(S)
print(VT)

D = np.zeros((len(U),len(VT)))
for i in range(0,len(VT)):
D[i,i]=S[i]
return U @ D @ VT

X = np.array([ [1.0,-1.0], [1.0,-1.0]])

#X = np.array([[1, 2], [3, 4], [5, 6]])

print(X)
C = SVD(X)
# Print the difference between the original matrix and the SVD one
print(C-X)

The matrix X has columns that are linearly dependent. The first column is
the row-wise sum of the other two columns. The rank of a matrix (the column
rank) is the dimension of space spanned by the column vectors. The rank of
the matrix is the number of linearly independent columns, in this case just 2.
We see this from the singular values when running the above code. Running
the standard inversion algorithm for matrix inversion with X T X results in the
program terminating due to a singular matrix.

Note about SVD Calculations

The U , S, and V matrices returned from the svd() function cannot be multiplied
directly.
As you can see from the code, the S vector must be converted into a diagonal
matrix. This may cause a problem as the size of the matrices do not fit the rules
of matrix multiplication, where the number of columns in a matrix must match
the number of rows in the subsequent matrix.
If you wish to include the zero singular values, you will need to resize the
matrices and set up a diagonal matrix as done in the above example

Mathematics of the SVD and implications

Let us take a closer look at the mathematics of the SVD and the various
implications for machine learning studies.

27
Our starting point is our design matrix X of dimension n × p
 
x0,0 x0,1 x0,2 ... . . . x0,p−1
 x1,0 x1,1 x1,2 ... . . . x1,p−1 
 
 x2,0 x2,1 x2,2 ... . . . x2,p−1 
X=  ...
.
 ... ... ...... ... 

xn−2,0 xn−2,1 xn−2,2 ... . . . xn−2,p−1 
xn−1,0 xn−1,1 xn−1,2 ... . . . xn−1,p−1

We can SVD decompose our matrix as

X = U ΣV T ,

where U is an orthogonal matrix of dimension n × n, meaning that U U T =

U T U = In . Here In is the unit matrix of dimension n × n.
Similarly, V is an orthogonal matrix of dimension p × p, meaning that
V V T = V T V = Ip . Here In is the unit matrix of dimension p × p.
Finally Σ contains the singular values σi . This matrix has dimension n × p
and the singular values σi are all positive. The non-zero values are ordered in
descending order, that is

σ0 > σ1 > σ2 > · · · > σp−1 > 0.

All values beyond p − 1 are all zero.

Example Matrix
As an example, consider the following 3 × 2 example for the matrix Σ
 
2 0
Σ = 0 1
0 0
The singular values are σ0 = 2 and σ1 = 1. It is common to rewrite the
matrix Σ as

Σ̃
Σ= ,
0
where
2 0
Σ̃ = ,
0 1
contains only the singular values. Note also (and we will use this below) that

T 4 0
Σ Σ= ,
0 1

28
which is a 2 × 2 matrix while
 
4 0 0
ΣΣT = 0 1 0 ,
0 0 0

is a 3 × 3 matrix. The last row and column of this last matrix contain only
zeros. This will have important consequences for our SVD decomposition of the
design matrix.

Setting up the Matrix to be inverted

The matrix that may cause problems for us is X T X. Using the SVD we can
rewrite this matrix as

X T X = V ΣT U T U ΣV T ,
and using the orthogonality of the matrix U we have

X T X = V ΣT ΣV T .
We define ΣT Σ = Σ̃2 which is a diagonal matrix containing only the singular
values squared. It has dimensionality p × p.
We can now insert the result for the matrix X T X into our equation for
ordinary least squares where
−1
ỹOLS = X X T X X T y,
and using our SVD decomposition of X we have
−1
ỹOLS = U ΣV T V Σ̃2 (V T V ΣT U T y,
which gives us, using the orthogonality of the matrix V ,
p−1
X
ỹOLS = U U T y = ui uTi y,
i=0

It means that the ordinary least square model (with the optimal parameters)
ỹ, corresponds to an orthogonal transformation of the output (or target) vector
y by the vectors of the matrix U . Note that the summation ends at p − 1,
̸ y. We can thus not use the orthogonality relation for the matrix U .
that is ỹ =
This can already be when we multiply the matrices ΣT U T .

Further properties (important for our analyses later)

Let us study again X T X in terms of our SVD,

X T X = V ΣT U T U ΣV T = V ΣT ΣV T .

29
If we now multiply from the right with V (using the orthogonality of V ) we
get
X T X V = V ΣT Σ.

This means the vectors vi of the orthogonal matrix V are the eigenvectors of
the matrix X T X with eigenvalues given by the singular values squared, that is

X T X vi = vi σi2 .

Similarly, if we use the SVD decomposition for the matrix XX T , we have

XX T = U ΣV T V ΣT U T = U ΣΣT U T .

If we now multiply from the right with U (using the orthogonality of U ) we

get
XX T U = U ΣΣT .

This means the vectors ui of the orthogonal matrix U are the eigenvectors of
the matrix XX T with eigenvalues given by the singular values squared, that is

XX T ui = ui σi2 .

Important note: we have defined our design matrix X to be an n × p

matrix. In most supervised learning cases we have that n ≥ p, and quite often
we have n >> p. For linear algebra based methods like ordinary least squares or
Ridge regression, this leads to a matrix X T X which is small and thereby easier
to handle from a computational point of view (in terms of number of floating
point operations).
In our lectures, the number of columns will always refer to the number of
features in our data set, while the number of rows represents the number of
data inputs. Note that in other texts you may find the opposite notation. This
has consequences for the definition of for example the covariance matrix and its
relation to the SVD.

Meet the Covariance Matrix

Before we move on to a discussion of Ridge and Lasso regression, we want to
show an important example of the above.
We have already noted that the matrix X T X in ordinary least squares is
proportional to the second derivative of the cost function, that is we have

∂ 2 C(β) 2
T
= X T X.
∂β∂β n
This quantity defines was what is called the Hessian matrix (the second derivative
of a function we want to optimize).
The Hessian matrix plays an important role and is defined in this course as

H = X T X.

30
The Hessian matrix for ordinary least squares is also proportional to the
covariance matrix. This means also that we can use the SVD to find the
eigenvalues of the covariance matrix and the Hessian matrix in terms of the
singular values. Let us develop these arguments, as they will play an important
role in our machine learning studies.

Introducing the Covariance and Correlation functions

Before we discuss the link between for example Ridge regression and the singular
value decomposition, we need to remind ourselves about the definition of the
covariance and the correlation function. These are quantities that play a central
role in machine learning methods.
Suppose we have defined two vectors x̂ and ŷ with n elements each. The
covariance matrix C is defined as

cov[x, x] cov[x, y]
C[x, y] = ,
cov[y, x] cov[y, y]

where for example

n−1
1X
cov[x, y] = (xi − x)(yi − y).
n i=0

With this definition and recalling that the variance is defined as

n−1
1X
var[x] = (xi − x)2 ,
n i=0

we can rewrite the covariance matrix as

var[x] cov[x, y]
C[x, y] = .
cov[x, y] var[y]

Note: we have used 1/n in the above definitions of the sample variance
and covariance. We assume then that we can calculate the exact mean value.
What you will find in essentially all statistics texts are equations with a factor
1/(n − 1). This is called Bessel’s correction. This method corrects the bias in the
estimation of the population variance and covariance. It also partially corrects
the bias in the estimation of the population standard deviation. If you use a
library like Scikit-Learn or nunmpy’s function to calculate the covariance,
this quantity will be computed with a factor 1/(n − 1).

Covariance and Correlation Matrix

The covariance takes values between zero and infinity and may thus lead to
problems with loss of numerical precision for particularly large values. It is
common to scale the covariance matrix by introducing instead the correlation
matrix defined via the so-called correlation function

31
cov[x, y]
corr[x, y] = p .
var[x]var[y]
The correlation function is then given by values corr[x, y] ∈ [−1, 1]. This
avoids eventual problems with too large values. We can then define the correlation
matrix for the two vectors x and y as

1 corr[x, y]
K[x, y] = ,
corr[y, x] 1
In the above example this is the function we constructed using pandas.

Correlation Function and Design/Feature Matrix

In our derivation of the various regression algorithms like Ordinary Least
Squares or Ridge regression we defined the design/feature matrix X as
 
x0,0 x0,1 x0,2 ... . . . x0,p−1
 x1,0 x1,1 x1,2 ... . . . x1,p−1 
 
 x2,0 x2,1 x2,2 ... . . . x2,p−1 
X=  ...
,
 ... ... ...... ... 

xn−2,0 xn−2,1 xn−2,2 ... . . . xn−2,p−1 
xn−1,0 xn−1,1 xn−1,2 ... . . . xn−1,p−1
with X ∈ Rn×p , with the predictors/features p refering to the column numbers
and the entries n being the row elements. We can rewrite the design/feature
matrix in terms of its column vectors as

X = x0 x1 x2 . . . . . . xp−1 ,

with a given vector

xTi = x0,i

x1,i x2,i ... . . . xn−1,i .

With these definitions, we can now rewrite our 2 × 2 correlation/covariance

matrix in terms of a moe general design/feature matrix X ∈ Rn×p . This leads
to a p × p covariance matrix for the vectors xi with i = 0, 1, . . . , p − 1

 
var[x0 ] cov[x0 , x1 ] cov[x0 , x2 ] ... . . . cov[x0 , xp−1 ]
 cov[x1 , x0 ] var[x1 ] cov[x1 , x2 ] ... . . . cov[x1 , xp−1 ]
 
 cov[x2 , x0 ] cov[x2 , x1 ] var[x2 ] ... . . . cov[x2 , xp−1 ]
C[x] =  ,

 ... ... ... ... ... ... 

 ... ... ... ... ... ... 
cov[xp−1 , x0 ] cov[xp−1 , x1 ] cov[xp−1 , x2 ] . . . ... var[xp−1 ]

32
and the correlation matrix
 
1 corr[x0 , x1 ] corr[x0 , x2 ] ... . . . corr[x0 , xp−1 ]
 corr[x1 , x0 ] 1 corr[x1 , x2 ] ... . . . corr[x1 , xp−1 ]
 
 corr[x2 , x0 ] corr[x 2 , x1 ] 1 ... . . . corr[x2 , xp−1 ]
K[x] =  ,

 . . . . . . ... ... ... ... 

 ... ... ... ... ... ... 
corr[xp−1 , x0 ] corr[xp−1 , x1 ] corr[xp−1 , x2 ] ... ... 1

Covariance Matrix Examples

The Numpy function np.cov calculates the covariance elements using the factor
1/(n − 1) instead of 1/n since it assumes we do not have the exact mean values.
The following simple function uses the np.vstack function which takes each
vector of dimension 1 × n and produces a 2 × n matrix W
Note that this assumes you have the features as the rows, and the inputs as
columns, that is

x0 x1 x2 . . . xn−2 xn−1
W = ,
y0 y1 y2 . . . yn−2 yn−1

which in turn is converted into into the 2 × 2 covariance matrix C via the
Numpy function np.cov(). We note that we can also calculate the mean value
of each set of samples x etc using the Numpy function np.mean(x). We can
also extract the eigenvalues of the covariance matrix through the np.linalg.eig()
function.
# Importing various packages
import numpy as np
n = 100
x = np.random.normal(size=n)
print(np.mean(x))
y = 4+3*x+np.random.normal(size=n)
print(np.mean(y))
W = np.vstack((x, y))
C = np.cov(W)
print(C)

Correlation Matrix
The previous example can be converted into the correlation matrix by simply
scaling the matrix elements with the variances. We should also subtract the
mean values for each column. This leads to the following code which sets up the
correlations matrix for the previous example in a more brute force way. Here
we scale the mean values for each column of the design matrix, calculate the
relevant mean values and variances and then finally set up the 2 × 2 correlation
matrix (since we have only two vectors).
import numpy as np
n = 100

33
# define two vectors
x = np.random.random(size=n)
y = 4+3*x+np.random.normal(size=n)
#scaling the x and y vectors
x = x - np.mean(x)
y = y - np.mean(y)
variance_x = np.sum(x@x)/n
variance_y = np.sum(y@y)/n
print(variance_x)
print(variance_y)
cov_xy = np.sum(x@y)/n
cov_xx = np.sum(x@x)/n
cov_yy = np.sum(y@y)/n
C = np.zeros((2,2))
C[0,0]= cov_xx/variance_x
C[1,1]= cov_yy/variance_y
C[0,1]= cov_xy/np.sqrt(variance_y*variance_x)
C[1,0]= C[0,1]
print(C)

We see that the matrix elements along the diagonal are one as they should
be and that the matrix is symmetric. Furthermore, diagonalizing this matrix we
easily see that it is a positive definite matrix.
The above procedure with numpy can be made more compact if we use
pandas.

Correlation Matrix with Pandas

We whow here how we can set up the correlation matrix using pandas, as done
in this simple code
import numpy as np
import pandas as pd
n = 10
x = np.random.normal(size=n)
x = x - np.mean(x)
y = 4+3*x+np.random.normal(size=n)
y = y - np.mean(y)
# Note that we transpose the matrix in order to stay with our ordering n x p
X = (np.vstack((x, y))).T
print(X)
Xpd = pd.DataFrame(X)
print(Xpd)
correlation_matrix = Xpd.corr()
print(correlation_matrix)

We expand this model to the Franke function discussed above.

Correlation Matrix with Pandas and the Franke function

# Common imports
import numpy as np
import pandas as pd

34
def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4

def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)

N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))
for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)

return X

# Making meshgrid of datapoints and compute Franke's function

n = 4
N = 100
x = np.sort(np.random.uniform(0, 1, N))
y = np.sort(np.random.uniform(0, 1, N))
z = FrankeFunction(x, y)
X = create_X(x, y, n=n)
Xpd = pd.DataFrame(X)
# subtract the mean values and set up the covariance matrix
Xpd = Xpd - Xpd.mean()
covariance_matrix = Xpd.cov()
print(covariance_matrix)

We note here that the covariance is zero for the first rows and columns since
all matrix elements in the design matrix were set to one (we are fitting the
function in terms of a polynomial of degree n).
This means that the variance for these elements will be zero and will cause
problems when we set up the correlation matrix. We can simply drop these
elements and construct a correlation matrix without these elements.

Rewriting the Covariance and/or Correlation Matrix

We can rewrite the covariance matrix in a more compact form in terms of the
design/feature matrix X as
1 T
C[x] = X X = E[X T X].
n

35
To see this let us simply look at a design matrix X ∈ R2×2

x00 x01
X= = x0 x1 .
x10 x11

If we then compute the expectation value (note the 1/n factor instead of
1/(n − 1))

x200 + x210

T 1 T 1 x00 x01 + x10 x11
E[X X] = X X = ,
n n x01 x00 + x11 x10 x201 + x211

which is just

var[x0 ] cov[x0 , x1 ]
C[x0 , x1 ] = C[x] = ,
cov[x1 , x0 ] var[x1 ]

where we wrote
C[x0 , x1 ] = C[x]
to indicate that this is the covariance of the vectors x of the design/feature
matrix X.
It is easy to generalize this to a matrix X ∈ Rn×p .

Linking with the SVD

We saw earlier that

X T X = V ΣT U T U ΣV T = V ΣT ΣV T .

Since the matrices here have dimension p×p, with p corresponding to the singular
values, we defined earlier the matrix

T
Σ̃
Σ Σ = Σ̃ 0 ,
0

where the tilde-matrix Σ̃ is a matrix of dimension p × p containing only the

singular values σi , that is
 
σ0 0 0 ... 0 0
 0 σ1 0 ... 0 0 
 
0
Σ̃ =  0 σ2 ... 0 0 ,

0 0 0 ... σp−2 0 
0 0 0 ... 0 σp−1
meaning we can write
X T X = V Σ̃2 V T .
Multiplying from the right with V (using the orthogonality of V ) we get

X T X V = V Σ̃2 .

36
What does it mean?
This means the vectors vi of the orthogonal matrix V are the eigenvectors of
the matrix X T X with eigenvalues given by the singular values squared, that is

X T X vi = vi σi2 .

In other words, each non-zero singular value of X is a positive square root of

an eigenvalue of X T X. It means also that the columns of V are the eigenvectors
of X T X. Since we have ordered the singular values of X in a descending order,
it means that the column vectors vi are hierarchically ordered by how much
correlation they encode from the columns of X.
Note that these are also the eigenvectors and eigenvalues of the Hessian
matrix. Note also that the Hessian matrix we are discussing here is from a cost
function defined by the mean squared error only.
If we now recall the definition of the covariance matrix (not using Bessel’s
correction) we have
1 T
C[X] = X X,
n
meaning that every squared non-singular value of X divided by n ( the
number of samples) are the eigenvalues of the covariance matrix. Every singular
value of X is thus a positive square root of an eigenvalue of X T X. If the matrix
X is self-adjoint, the singular values of X are equal to the absolute value of the
eigenvalues of X.

And finally XX T
For XX T we found

XX T = U ΣV T V ΣT U T = U ΣT ΣU T .
Since the matrices here have dimension n × n, we have

T Σ̃ Σ̃ 0
ΣΣ = Σ̃0 = ,
0 0 0

leading to
Σ̃ 0
XX T = U UT .
0 0
Multiplying with U from the right gives us the eigenvalue problem

T Σ̃ 0
(XX )U = U .
0 0

It means that the eigenvalues of XX T are again given by the non-zero

singular values plus now a series of zeros. The column vectors of U are the

37
eigenvectors of XX T and measure how much correlations are contained in the
rows of X.
Since we will mainly be interested in the correlations among the features of
our data (the columns of X, the quantity of interest for us are the non-zero
singular values and the column vectors of V .

Ridge and LASSO Regression

Let us remind ourselves about the expression for the standard Mean Squared
Error (MSE) which we used to define our cost function and the equations for
the ordinary least squares (OLS) method, that is our optimization problem is
1n T
o
minp (y − Xβ) (y − Xβ) .
β∈R n
or we can state it as
n−1
1X 2 1
minp (yi − ỹi ) = ||y − Xβ||22 ,
β∈R n i=0 n

where we have used the definition of a norm-2 vector, that is

sX
||x||2 = x2i .
i

By minimizing the above equation with respect to the parameters β we

could then obtain an analytical expression for the parameters β. We can add
a regularization parameter λ by defining a new cost function to be optimized,
that is
1
minp ||y − Xβ||22 + λ||β||22
β∈R n

which leads to the Ridge regression minimization problem where we require

that ||β||22 ≤ t, where t is a finite number larger than zero. By defining
1
C(X, β) = ||y − Xβ||22 + λ||β||1 ,
n
we have a new optimization equation
1
min ||y − Xβ||22 + λ||β||1
β∈Rp n

which leads to Lasso regression. Lasso stands for least absolute shrinkage and
selection operator.
Here we have defined the norm-1 as
X
||x||1 = |xi |.
i

38
Deriving the Ridge Regression Equations
Using the matrix-vector expression for Ridge regression and dropping the pa-
rameter 1/n in front of the standard means squared error equation, we have

C(X, β) = (y − Xβ)T (y − Xβ) + λβ T β,

and taking the derivatives with respect to β we obtain then a slightly modified
matrix inversion problem which for finite values of λ does not suffer from
singularity problems. We obtain the optimal parameters
−1 T
β̂Ridge = X T X + λI X y,
with I being a p × p identity matrix with the constraint that
p−1
X
βi2 ≤ t,
i=0
with t a finite positive number.
If we keep the 1/n factor, the equation for the optimal β changes to
−1 T
β̂Ridge = X T X + nλI X y.
In many textbooks the 1/n term is often omitted. Note that a library like
Scikit-Learn does not include the 1/n factor in the setup of the cost function.
When we compare this with the ordinary least squares result we have
−1 T
β̂OLS = X T X X y,
which can lead to singular matrices. However, with the SVD, we can always
compute the inverse of the matrix X T X.
We see that Ridge regression is nothing but the standard OLS with a modified
diagonal term added to X T X. The consequences, in particular for our discussion
of the bias-variance tradeoff are rather interesting. We will see that for specific
values of λ, we may even reduce the variance of the optimal parameters β. These
topics and other related ones, will be discussed after the more linear algebra
oriented analysis here.
Using our insights about the SVD of the design matrix X We have already
analyzed the OLS solutions in terms of the eigenvectors (the columns) of the
right singular value matrix U as
ỹOLS = Xβ = U U T y.
For Ridge regression this becomes

p−1
−1 X σj2
ỹRidge = XβRidge = U ΣV T V Σ2 V T + λI (U ΣV T )T y = uj uTj y,
j=0
σj2 + λ

with the vectors uj being the columns of U from the SVD of the matrix X.

39
Interpreting the Ridge results
Since λ ≥ 0, it means that compared to OLS, we have

σj2
≤ 1.
σj2 + λ
Ridge regression finds the coordinates of y with respect to the orthonormal
σj2
basis U , it then shrinks the coordinates by σj2 +λ
. Recall that the SVD has
eigenvalues ordered in a descending way, that is σi ≥ σi+1 .
For small eigenvalues σi it means that their contributions become less impor-
tant, a fact which can be used to reduce the number of degrees of freedom. More
about this when we have covered the material on a statistical interpretation of
various linear regression methods.

More interpretations
For the sake of simplicity, let us assume that the design matrix is orthonormal,
that is

X T X = (X T X)−1 = I.
In this case the standard OLS results in
n−1
X
β OLS = X T y = ui uTi y,
i=0

and
−1 −1
β Ridge = (I + λI) X T y = (1 + λ) β OLS ,
that is the Ridge estimator scales the OLS estimator by the inverse of a factor
1 + λ, and the Ridge estimator converges to zero when the hyperparameter goes
to infinity.
We will come back to more interpreations after we have gone through some
of the statistical analysis part.
For more discussions of Ridge and Lasso regression, Wessel van Wierin-
gen’s article is highly recommended. Similarly, Mehta et al’s article is also
recommended.

Deriving the Lasso Regression Equations

Using the matrix-vector expression for Lasso regression, we have the following
cost function
1
C(X, β) = (y − Xβ)T (y − Xβ) + λ||β||1 ,
n

40
Taking the derivative with respect to β and recalling that the derivative of
the absolute value is (we drop the boldfaced vector symbol for simplicty)

d|β| 1 β>0
= sgn(β) =
dβ −1 β < 0,

we have that the derivative of the cost function is

∂C(X, β) 2
= − X T (y − Xβ) + λsgn(β) = 0,
∂β n
and reordering we have
n
X T Xβ + λsgn(β) = 2X T y.
2
We can redefine λ to absorb the constant n/2 and we rewrite the last equation
as
X T Xβ + λsgn(β) = 2X T y.
This equation does not lead to a nice analytical equation as in either Ridge
regression or ordinary least squares. This equation can however be solved by
using standard convex optimization algorithms using for example the Python
package CVXOPT. We will discuss this later.

Jntua B.tech r20 Artificial Intelligence & Machine Learning Syllabus
No ratings yet
Jntua B.tech r20 Artificial Intelligence & Machine Learning Syllabus
227 pages
Chapter-1 Introduction of Microsoft Excel: Structure
No ratings yet
Chapter-1 Introduction of Microsoft Excel: Structure
182 pages
Principles of Robotics
No ratings yet
Principles of Robotics
13 pages
Bmat101l Calculus TH 1.0 65 Bmat101l
No ratings yet
Bmat101l Calculus TH 1.0 65 Bmat101l
3 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Lab 7 - Robotics - Jacobian
No ratings yet
Lab 7 - Robotics - Jacobian
21 pages
Quadratic Tetrahedral Element
No ratings yet
Quadratic Tetrahedral Element
25 pages
Steger Warming Flux Vector Splitting Method
100% (1)
Steger Warming Flux Vector Splitting Method
31 pages
Book English Version
No ratings yet
Book English Version
167 pages
Efficient High-Order Discretizations For Computational Fluid Dynamics
No ratings yet
Efficient High-Order Discretizations For Computational Fluid Dynamics
314 pages
MTH 381 Mathematical Methods III
No ratings yet
MTH 381 Mathematical Methods III
194 pages
Multivariable Calculus: 1. The Derivative
No ratings yet
Multivariable Calculus: 1. The Derivative
17 pages
Berryman
No ratings yet
Berryman
24 pages
A Complete Generalized Solution To The Inverse Kinematics of Robots-UgA
No ratings yet
A Complete Generalized Solution To The Inverse Kinematics of Robots-UgA
7 pages
Matrix Calculus
No ratings yet
Matrix Calculus
9 pages
CPSC 540 Assignment 1 (Due January 19)
100% (1)
CPSC 540 Assignment 1 (Due January 19)
9 pages
R19-B.tech CSE Syllabi
No ratings yet
R19-B.tech CSE Syllabi
401 pages
Solution of A Partial Differential Equations Using The Method of Lines
No ratings yet
Solution of A Partial Differential Equations Using The Method of Lines
16 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
433 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
MLF Combined
No ratings yet
MLF Combined
84 pages
MyCobot en
No ratings yet
MyCobot en
302 pages
Support For Resource Constrained Microcontroller Programming by A Broad Developer Community
No ratings yet
Support For Resource Constrained Microcontroller Programming by A Broad Developer Community
240 pages
Handbook MAM2085S 2016
No ratings yet
Handbook MAM2085S 2016
91 pages
Spark Reference Manual
No ratings yet
Spark Reference Manual
188 pages
Lec3 2019 PDF
No ratings yet
Lec3 2019 PDF
43 pages
OptimumEngineeringDesign Day2b
No ratings yet
OptimumEngineeringDesign Day2b
24 pages
UC Berkeley Electronic Theses and Dissertations
No ratings yet
UC Berkeley Electronic Theses and Dissertations
95 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Block 3
No ratings yet
Block 3
83 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
NB 24 Aug
No ratings yet
NB 24 Aug
82 pages
Lecture Note 4 To 7 OLS
No ratings yet
Lecture Note 4 To 7 OLS
29 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Lecture Notes On High Dimensional Linear Regression
No ratings yet
Lecture Notes On High Dimensional Linear Regression
73 pages
Hong Molei
No ratings yet
Hong Molei
46 pages
4 - Multiple Linear Regressions
No ratings yet
4 - Multiple Linear Regressions
61 pages
SONA First-Year-I-Sem-Syllabus
No ratings yet
SONA First-Year-I-Sem-Syllabus
68 pages
Robotics: Kinematics, Dynamics and Motion Control Module - 4 VIT
No ratings yet
Robotics: Kinematics, Dynamics and Motion Control Module - 4 VIT
63 pages
Data Analysis
No ratings yet
Data Analysis
40 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
No ratings yet
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
25 pages
Regression 1
No ratings yet
Regression 1
63 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
Lec 03
No ratings yet
Lec 03
42 pages
Dynamic Parallel Khalil Ouarda JINT 07 PDF
No ratings yet
Dynamic Parallel Khalil Ouarda JINT 07 PDF
24 pages
CS 532 Lecture Notes
No ratings yet
CS 532 Lecture Notes
25 pages
cs229.... Machine Language. Andrew NG
No ratings yet
cs229.... Machine Language. Andrew NG
17 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Machine Translation MT
No ratings yet
Machine Translation MT
29 pages
ML Lecture 2 2023
No ratings yet
ML Lecture 2 2023
59 pages
Econometrics I 3
No ratings yet
Econometrics I 3
27 pages
Book List
No ratings yet
Book List
17 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
Lec10 LeastSquaresRegression PDF
No ratings yet
Lec10 LeastSquaresRegression PDF
4 pages
Lecture-04 - Least Squares and Geometry
No ratings yet
Lecture-04 - Least Squares and Geometry
35 pages
Matrix Introduction
No ratings yet
Matrix Introduction
30 pages
Levenberg-Marquardt Algorithm Handout
No ratings yet
Levenberg-Marquardt Algorithm Handout
10 pages
Thomas Minka - Note On Matrix Calculus and Algebra
No ratings yet
Thomas Minka - Note On Matrix Calculus and Algebra
19 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Lecture Slides - Linear Regression (2025)
No ratings yet
Lecture Slides - Linear Regression (2025)
45 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Regression
No ratings yet
Regression
39 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Dash Sylvereye: A Webgl-Powered Library For Dashboard-Driven Visualization of Large Street Networks
No ratings yet
Dash Sylvereye: A Webgl-Powered Library For Dashboard-Driven Visualization of Large Street Networks
20 pages
Matrix OLS NYU Notes
No ratings yet
Matrix OLS NYU Notes
14 pages
Applied Econometrics: Department of Economics Stern School of Business
No ratings yet
Applied Econometrics: Department of Economics Stern School of Business
27 pages
EC501 Lecture 01
No ratings yet
EC501 Lecture 01
28 pages
Robotics: Inverse Kinematics: Sami Haddadin
No ratings yet
Robotics: Inverse Kinematics: Sami Haddadin
24 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Integral calculus-II
No ratings yet
Integral calculus-II
15 pages
Maths Case Study
No ratings yet
Maths Case Study
10 pages
Chapter 1 Simple Linear Regression (Part 6: Matrix Version)
No ratings yet
Chapter 1 Simple Linear Regression (Part 6: Matrix Version)
12 pages
Num Deriv Config
No ratings yet
Num Deriv Config
12 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
Lecture 3
No ratings yet
Lecture 3
33 pages
Performance of Differential Evolution Method in Least Squares Fitting of Some Typical Nonlinear Curves
No ratings yet
Performance of Differential Evolution Method in Least Squares Fitting of Some Typical Nonlinear Curves
21 pages
Representer Function
No ratings yet
Representer Function
12 pages
Tangent Planes and Linear Approximations
No ratings yet
Tangent Planes and Linear Approximations
9 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Design of A Planar Parallel Robot For Optimal Workspace and Dexterity
No ratings yet
Design of A Planar Parallel Robot For Optimal Workspace and Dexterity
8 pages
Lecture 0.2 - Linear Methods For Regression, Optimization
No ratings yet
Lecture 0.2 - Linear Methods For Regression, Optimization
53 pages
Alshammari 2024 Ijca 923446
No ratings yet
Alshammari 2024 Ijca 923446
6 pages
An Algorithm For Solving Non-Linear Equations Based On The Secant Method
No ratings yet
An Algorithm For Solving Non-Linear Equations Based On The Secant Method
7 pages
An Integrated Framework For Collaborative Robot-Assisted Additive Manufacturing
No ratings yet
An Integrated Framework For Collaborative Robot-Assisted Additive Manufacturing
8 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
Lec 12
No ratings yet
Lec 12
9 pages
2CS2010303 - Advance Java Programming
No ratings yet
2CS2010303 - Advance Java Programming
3 pages
2020 Acl-Demos 13
No ratings yet
2020 Acl-Demos 13
6 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
7 pages
c2lc40760d 2
No ratings yet
c2lc40760d 2
4 pages
39 Bus Data 1
No ratings yet
39 Bus Data 1
9 pages
Convex Optimization Prerequisite - Topics
No ratings yet
Convex Optimization Prerequisite - Topics
6 pages
Taylor Series Expansion
No ratings yet
Taylor Series Expansion
4 pages
Compute Brochure
No ratings yet
Compute Brochure
4 pages
CS 182 Berkeley 2021 Discussion 2
No ratings yet
CS 182 Berkeley 2021 Discussion 2
9 pages
Resume-Snibe - Pranshu Sharma
No ratings yet
Resume-Snibe - Pranshu Sharma
3 pages
A First Course in Functional Analysis
From Everand
A First Course in Functional Analysis
Martin Davis
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet