Linear Regression
Linear Regression
Abstract
An exploration of linear regression and various methods of derivation. Exploration
of ordinary least squares, method of moments, and the method of maximum likelihood.
These estimation methods represent staples in the tool belt of all quantitative analysts
and warrant deep mathematical understanding.
y = α + βx (1)
The goal of linear regression is to fit a line of form (1) to our data. In the real world this
line will never fit our data perfectly, so we must introduce an error variable, ϵ into equation
(1):
y = α + βx + ϵ (2)
The methods we will discuss focus on minimizing the error term, and yielding the most
accurate linear representation of our data.
To minimize this residual sum of squares, we take the first derivatives with respect to α̂
and β̂ and set them equal to 0 to find maxes and mins.
1
N N
∂ X X
(yi − α̂ − β̂xi )2 = −2 (yi − α̂ − β̂xi ) = 0 (5)
∂ α̂ i=1 i=1
N N
∂ X 2
X
(yi − α̂ − β̂xi ) = −2 (yi − α̂ − β̂xi )(xi ) = 0 (6)
∂ β̂ i=1 i=1
Where the bar notation indicates the mean of all y or x data values. Since equation (8)
shows α̂ = ȳ − β̂ x̄, we substitute this into equation (7):
N
X N
X N
X
−2 (yi − α̂ − β̂xi )(xi ) = (yi − α̂ − β̂xi )(xi ) = (yi − ȳ + β̂ x̄ − β̂xi )(xi )
i=1 i=1 i=1
N N N N N N
(8)
X X X X X X
= xi yi − ȳ xi + β̂ x̄ xi − β̂ x2i = xi yi − N ȳx̄ + β̂N x̄2 − β̂ x2i = 0
i=1 i=1 i=1 i=1 i=1 i=1
And there we have it! equations (8) and (10) give us the values of α̂ and β̂, or the
equation of the line which minimizes error, in terms of the mean values of the x and y data
sets.
3 Method of Moments
This same result can be achieved by using moments with one additional condition. In
statistics a moment is a quantitative measure on our dataset, namely mean, variance, skew-
ness, and kurtosis. In this case, we will be looking at the mean.
Return to the problem set up in section one:
y = α + βx + ϵ (10)
Our assumption for the method of moments will be that the error is normally distributed
with a mean of 0, i.e.:
ϵ ∼ N (0, σ 2 ) (11)
Given this assumption, one can easily see that the expected value of ϵi would be 0. So,
E(ϵi ) = 0 (12)
2
Rearranging equation (11), we see that y − α − βx = ϵ, so by taking the expected value
and substituting equation (13) we see
E(yi − α − βxi ) = E(ϵi ) = 0 (13)
Since ϵ is normally distributed, we also see
E(ϵi xi ) = 0 (14)
Then plugging in equation (14) into (15),
E((yi − α − βxi )xi ) = 0 (15)
Lastly, notice that the variance of our error, ϵ2 should have the expected value of our
standard deviation squared by definition.
E(ϵ2i ) = σ 2 (16)
Now by rewriting the expected values of equations (14), (16), and (17) as the mean
of our data, we get the following. Notice the introduction of hat notation to differentiate
coefficients of the sample means from those of population moments.
N
1 X
(yi − α̂ − β̂xi ) = 0
N i=1
N
1 X
xi (yi − α̂ − β̂xi ) = 0 (17)
N i=1
N
1 X 2
ϵ̂ = σ̂ 2
N i=1 i
Notice the similarity between (18) and equations (6) and (7) derived in the least squares
method. By following the results of the previous section, we see that the special case of
normally distributed error yields the same coefficients as ordinary least squares but is a
biased estimator.
1 1 PN 2
(18)
= e− 2σ2 i=1 ϵi
(2π)N/2 σ N
1 1 PN 2
= e− 2σ2 i=1 (yi −α−βxi )
(2π)N/2 σ N
3
To maximize this likelihood function, we want the exponential term to be minimized. To
do this we take the partial derivatives of the exponent with respect to α and β and set them
equal to 0.
N
∂ X
(yi − α − βxi )2 = 0
∂α i=1
N
(19)
∂ X
(yi − α − βxi )2 = 0
∂β i=1
Notice again the similarity to the least squares equations (7) and (8). By solving (19), we
get our linear regression coefficients.