0% found this document useful (0 votes)
17 views

Linear Regression

linear regression
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Linear Regression

linear regression
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Linear Regression

Nathanaël Carraz Rakotonirina

Mathématiques Informatique et Statistique Appliquées (MISA)


Université d’Antananarivo
Model
The linear regression model is of the form

f (x; θ) = w1 x1 + ... + wD xD + b = w > x + b

I θ = (w , b) : parameters
I w : weights
I b : bias
b can be absorbed into w by defining w = [b, w1 , ..., wD ] and x = [1, x1 , ..., xD ], so
that
f (x; θ) = w > x
x can be replaced by a non-linear function of the inputs φ(x) called basis expansion
function.
f (x; θ) = w > φ(x)
The general form of the linear regression model with all observations:

ŷ = Xw + b

I N : Number of observations
I D : number of features
I ŷ ∈ RN : predictions
I X ∈ RN×D : inputs (design matrix)
I w ∈ RD : weights
I b ∈ R : bias
When the bias b is absorbed
ŷ = Xw
Loss function - Least squares

Goal:
Find the parameters w that minimize the residual sum of squares (loss)
N N
1X 1X
RSS(w ) = (yi − f (xi ))2 = (yi − w > xi )2
2 2
i=1 i=1

We can minimize it analytically or iteratively using gradient descent.


Probabilistic Interpretation
The targets and inputs are related as follows

y = w >x + 

where  is the residual error between the predictions and the true response (unmodeled
effects/random noise). We assume  has a Gaussian distribution  ∼ N (0, σ 2 ).

p(y |x; θ) = N (y ; w > x, σ 2 )

where θ = (w , σ 2 ).
We estimate the parameters using MaximumQN Likelihood Estimation. We want the
parameters that maximizes the likelihood i=1 p(yi |xi ; θ). It is easier to minimize the
Negative log likelihood
N
X
NLL(θ) = − log p(yi |xi ; θ)
i=1

It can be shown that minimizing the NLL is equivalent to minimizing the RSS.
Ordinary Least Squares
Our loss function is
N
1X 1 1
J(w ) = RSS(w ) = (yi − w > xi )2 = ||Xw − y ||22 = (Xw − y )> (Xw − y )
2 2 2
i=1

The gradient is given by

∇w RSS(w ) = X > Xw − X > y


Setting the gradient to zero
X > Xw = X > y
called the normal equations.
The solution ŵ called the ordinary least squares solution is given by

ŵ = (X > X )−1 X > y


Is it a unique global minimum ?
We check if the Hessian is positive definite. It is given by

H(x) = RSS(w ) = X > X
∂w
If the columns of x are linearly independent, then H is positive definite and ŵ is a
unique global minimum.
Numerical issues

The inverse should not be computed directly. X > X can be singular or ill-conditioned.
There are alternatives:
I SVD
I QR decomposition

Explore further
I Polynomial regression (other basis expansions)
I Weighted linear regression
I Bayesian linear regression

You might also like