0% found this document useful (0 votes)
16 views19 pages

Notes Chapter 2

This document discusses simple linear regression. It introduces the concepts of a prediction rule, loss functions like mean absolute error and mean squared error, and finding the optimal prediction rule by minimizing loss. The goal is to find the best-fitting linear regression line to predict a response variable like salary based on an input feature like years of experience.

Uploaded by

sajidha azim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views19 pages

Notes Chapter 2

This document discusses simple linear regression. It introduces the concepts of a prediction rule, loss functions like mean absolute error and mean squared error, and finding the optimal prediction rule by minimizing loss. The goal is to find the best-fitting linear regression line to predict a response variable like salary based on an input feature like years of experience.

Uploaded by

sajidha azim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Chapter 2

Least Squares Regression

We have so far predicted salaries without taking into account any information about the
person whose salary is being predicted. Of course, there are lots of things which can influence
someone’s salary: how much experience they have, their college GPA, what city they live
in, and so forth. We call these things features. In this chapter, we’ll see how to apply the
framework of empirical risk minimization to the problem of predicting based on one or more
features.

2.1 Simple Linear Regression


As mentioned above, there are many features that could be useful in predicting someone’s
salary, but we will begin by considering just one: years of experience. That is, we wish to
make an accurate prediction of a data scientist’s salary by using only their experience as
an input. The feature x that we’ll base our prediction on is called the predictor variable
and the quantity y that we are trying to predict is called the response variable. In our
example, experience is the predictor variable and salary is the response variable.
We believe that experience is related to pay. In particular, we have a feeling that the more
experience someone has, the higher their salary is likely to be. This relationship is not exact,
of course – there is no law that says that someone with x years of experience should be paid
exactly some amount. But for the purposes of prediction, it is believable that there is some
function H(x) which takes in someone’s experience level and outputs a prediction of their
salary which is reasonably close to their actual salary.
We call such a function a prediction rule. You can think of it as a formula for making
predictions. Here is one example of a prediction rule:
H1 (years of experience) = $50,000 + $2, 000 × (years of experience)
This rule says that, to predict your pay, we should start with $50,000 and add $2,000 for
every year of experience you have. In other words, pay increases linearly with experience.
Another prediction rule is:
H2 (years of experience) = $60,000 × 1.05(years of experience)

1
2 CHAPTER 2. LEAST SQUARES REGRESSION

In this rule, pay increases exponentially with experience. Yet another prediction rule is:
H3 (years of experience) = $100,000 − $5,000 × (years of experience)
This one says that your pay decreases with experience. This is (hopefully) not the case, and
it goes against our intuition. This is probably a bad prediction rule in the sense that it does
not make accurate predictions, but it is a prediction rule nonetheless.
As before, we will assess whether a prediction rule is good or bad by seeing if the predictions
it makes match the data. The figure below shows what we might expect to see if we ask
several data scientists how much they make and how much experience they have.

In general, we observe a positive trend: the more experience someone has, the higher their
pay tends to be. We can plot the prediction rules H1 , H2 , and H3 on top of this data to get
a sense for the accuracy of their predictions.

Out of these, H1 appears to “fit” the data the best. We’ll now make this more precise.

2.1.1 Loss Functions


Suppose we have surveyed n data scientists, and for each, recorded their experience, xi , and
their salary, yi . We want to assess the accuracy of a prediction rule, H(x), which takes as
input years of experience and outputs the predicted salary.
2.1. SIMPLE LINEAR REGRESSION 3

Consider an arbitrary person: person i. Their salary is yi , and their experience is xi . Our
prediction for their salary is H(xi ). The absolute loss of our prediction is
|yi − H(xi )|
In the last chapter, we saw that the empirical risk, or average loss incurred when using H
to predict the salary for everyone in the data set, is one way to quantify how good or bad
H is. In this case, the mean absolute error is:
1∑
n
Rabs (H) = |yi − H(xi )|
n i=1

Our goal now is to find a prediction rule H which results in the smallest mean absolute error.
Here we run into our first problem: it turns out that we can find a function which has zero
absolute error, but it isn’t as useful as we’d like! Here it is:

This prediction rule makes exactly the right prediction for every person in the data set. But
in order to go through every data point, this function has made itself quite wiggly. We
don’t believe that pay is truly related to experience in this way. To put it differently, we feel
that the data has some noise, and the above function describes the noise rather than the
underlying pattern that we are interested in. When this is the case, we say that the line has
overfit the data.
One antidote to overfitting is to mandate that the hypothesis not be so wiggly; in other
words, restrict the hypothesis space so that it doesn’t include such complicated functions.
This is an instance of Occam’s Razor: we should choose the simplest possible description of
the data which actually works. In this example, a straight line is a good, simple description
of the data, and so we’ll restrict the prediction rule to be of the form H(x) = w1 x + w0 , i.e.,
straight lines. This setting is called simple linear regression.
Our new goal is to find a linear prediction rule with the smallest mean absolute error. That
is, we want to solve:
1∑
n
H ∗ = arg min |yi − H(xi )|
linear H n
i=1
4 CHAPTER 2. LEAST SQUARES REGRESSION

But there is still a problem with this: it turns out that Rabs is relatively hard to minimize.1
The cause of this difficulty is our use of the absolute loss. Instead, we’ll try another approach
to quantifying the error of our prediction rule, the square loss:

(yi − H(xi ))2 .

The average loss on the data set becomes

1∑
n
Rabs (H) = (yi − H(xi ))2 .
n i=1

This is also known as the mean squared error. Minimizing this is often called least
squares regression.

2.1.2 Minimizing the Mean Squared Error


Let’s now minimize the mean squared error, Rsq (h). Since Rsq is a function of H but H is
determined by our choice of w0 and w1 , we will also write Rsq (h) as Rsq (w0 , w1 ) to indicate
more clearly that the loss is determined by the values of w0 and w1 . When we minimize the
loss function, we are really trying to find the optimal values of w0 , w1 , those that define the
best-fitting line for the data. We have:

1∑
n
Rsq (H) = (yi − H(xi ))2
n i=1
1∑
n
Rsq (w0 , w1 ) = (yi − (w0 + w1 xi ))2
n i=1
1∑
n
Rsq (w0 , w1 ) = (yi − w0 − w1 xi )2
n i=1

This is differentiable. To find the best w0 and w1 , we’ll use calculus. Taking the partial
derivatives with respect to w0 and w1 and setting them to zero gives

∂Rsq (w0 , w1 ) ∑ n
=0= −2(yi − w0 − w1 xi ),
∂w0 i=1

∂Rsq (w0 , w1 ) ∑ n
=0= −2xi (yi − w0 − w1 xi ).
∂w1 i=1

We must now solve this system of equations to find the optimal values of w0 and w1 .
1
Although it can be done with linear programming.
2.1. SIMPLE LINEAR REGRESSION 5

We will start by setting ∂Rsq /∂w0 = 0 and solving for w0 .2

1∑
n
∂Rsq
= 0 =⇒ 2 (w1 xi + w0 − yi ) = 0
∂w0 n i=1

We can take the 2 outside of the sum:

2∑
n
=⇒ (w1 xi + w0 − yi ) = 0
n i=1

If we multiply both sides by n/2 we are left with a constant factor of 1 on the left hand side,
whereas the right hand side remains zero. In effect, we can get rid of the 2/n:

n
=⇒ (w1 xi + w0 − yi ) = 0
i=1

Our goal is to isolate the w0 on one side of the equation, but we are momentarily prevented
from this by the fact that w0 is inside of the sum. We can remove it from the summation,
however; the first step is to break it into three independent sums:

n ∑
n ∑
n
=⇒ w 1 xi + w0 − yi = 0
i=1 i=1 i=1

n ∑
n ∑
n
=⇒ w0 = yi − w 1 xi
i=1 i=1 i=1

n ∑ n
=⇒ nw0 = yi − w1 xi
(
i=1 i=1
)
1 ∑
n ∑
n
=⇒ w0 = yi − w1 xi
n i=1 i=1


n ∑
n
1 1
Define ȳ = n
yi and x̄ = n
xi . If xi and yi are experience and salary, then x̄ is the
i=1 i=1
mean experience and ȳ is the mean salary of people in our data set. Using this notation

=⇒ w0 = ȳ − w1 x̄
And so we have successfully isolated w0 . We now solve ∂Rsq /∂w1 = 0 for w1 :

1∑
n
∂Rsq
= 0 =⇒ 2 ((w1 xi + w0 ) − yi ) xi = 0
∂w1 n i=1
2
Note that we could have also decided to solve this equation for w1 , as long as we then solve the other
equation for w0 .
6 CHAPTER 2. LEAST SQUARES REGRESSION

We’ll get rid of the 2 and the 1/n straight away:


n
=⇒ ((w1 xi + w0 ) − yi ) xi = 0
i=1

We now substitute our solution for w0 :


n
=⇒ ((w1 xi + ȳ − w1 x̄) − yi ) xi = 0
i=1

We’ll group the xi with the x̄ and the yi with the ȳ:


n
=⇒ ((w1 (xi − x̄) − (yi − ȳ)) xi = 0
i=1

Splitting the summand:


n ∑
n
=⇒ w1 (xi − x̄)xi − (yi − ȳ)xi = 0
i=1 i=1

n ∑
n
=⇒ w1 (xi − x̄)xi = (yi − ȳ)xi
i=1 i=1

n
(yi − ȳ)xi
i=1
=⇒ w1 =
∑n
(xi − x̄)xi
i=1

We could stop here; this is a totally valid formula for w1 . But we’ll continue to get a formula
∑n ∑n
with a little more symmetry. The key is that (yi − ȳ) = 0 and (xi − x̄), as can be
i=1 i=1
verified. This enables us to write:


n
(yi − ȳ)(xi − x̄)
i=1
=⇒ w1 =

n
(xi − x̄)2
i=1

We’ll show that the numerators of these last two formulas for w1 are the same (and you can
2.1. SIMPLE LINEAR REGRESSION 7

show that the denominators are, too). We have:


n ∑
n ∑
n
(yi − ȳ)(xi − x̄) = (yi − ȳ)xi − (yi − ȳ)x̄
i=1 i=1 i=1
∑n ∑n
= (yi − ȳ)xi − x̄ (yi − ȳ)
i=1 i=1

n
= (yi − ȳ)xi − 0
i=1
∑n
= (yi − ȳ)xi
i=1

We have arrived at a formula for w1 , the slope of our linear prediction rule, which can be
computed from the data. Notice that the formula for the intercept, w0 , involves w1 . So to
find the linear prediction rule, we should first calculate w1 , then plug it into the formula to
find w0 .
In summary, the formulas we have found are:


n
(xi − x̄)(yi − ȳ)
i=1
w1 =

n
(xi − x̄)2
i=1

w0 = ȳ − w1 x̄

where x̄ and ȳ are the mean of the xi and yi , respectively. These are called the least squares
solutions for the slope and intercept parameters.
It should be emphasized that the fact that we have formulas which allow us to calculate w0
and w1 directly is a consequence of using the mean squared error. If we had used another
loss function, for example, the mean absolute error, then we may not be so lucky to have
formulas. Instead, we would have to minimize the loss using an algorithmic approach in
which we iterate towards the correct answer.3

2.1.3 Fitting non-linear trends


You may be surprised to learn that the formulas we have just derived for fitting a straight
line to the data can sometimes be cleverly used to to fit certain nonlinear curves to data. By
applying a suitable transformation, it can be possible to turn a nonlinear relationship between
3
There are even some losses which are so hard to optimize that we can’t even hope to find an algorithm
which does so exactly; we have to approximate their minimizers.
8 CHAPTER 2. LEAST SQUARES REGRESSION

x and y into a linear relationship between two quantities that can be directly computed from
x and y.

For instance, we can fit a curve of the form

y = w0 + w1 ln x

to data with the tools we already have. While y does not vary linearly with x, it does vary
linearly with ln x. So if we let z = ln x be our feature variable, we can use the formulas we
have developed to find the slope and intercept of the linear relationship.

∑n
1∑
n
(ln xi − ln xi )(yi − ȳ)
i=1
n i=1
w1 =
∑n
1∑
n
(ln xi − ln xi )2
i=1
n i=1

1∑
n
w0 = ȳ − w1 · ln xi
n i=1

We can use w1 and w0 as the parameters that define the best-fitting curve of the form
y = w0 + w1 ln x.

Similarly, we can use the least squares solutions we derived above to fit data with a curve of
the form
y = c0 x c 1 .

Here, it is not so obvious that this relationship can be written as a linear relationship in new
variables. The trick is taking a logarithm of both sides. We can use a logarithm with any
base. We’ll pick base ten. The relationship becomes

log y = log(c0 xc1 )


log y = log c0 + log xc1
log y = log c0 + c1 log x.

If we let z = log x be our predictor variable, and v = log y be our response variable, we see
that there is a linear relationship v = w0 + w1 z, where the intercept is w0 = log c0 and the
slope is w1 = c1 .

Therefore, we can use the following formulas for w0 and w1 , which we can then use to find
c0 and c1 .
2.2. MULTIPLE LINEAR REGRESSION 9


n
1∑
n
1∑
n
(log xi − log xi )(log yi − log yi )
i=1
n i=1 n i=1
w1 =

n
1∑
n
(log xi − log xi )2
i=1
n i=1

1∑ 1∑
n n
w0 = log yi − w1 · log xi
n i=1 n i=1
Notice that w1 = c1 and w0 = log c0 , so we can find c0 by computing c0 = 10w0 . This gives
us a way to find the parameters c0 , c1 for our original prediction rule y = c0 xc1 .
In general, we can use the formulas we’ve derived to fit any prediction rule that can be
written in the form g(y) = w1 · f (x) + w0 , where f (x) is some transformation of x, such as
x2 , ex , log x, etc. and g(y) is some transformation of y. Here is the procedure:
1. Create a new data set (z1 , v1 ), . . . , (zn , vn ), where zi = f (xi ) and vi = g(yi ).
2. Fit v = w1 z + w0 using familiar least squares solutions:

n
(zi − z̄)(vi − v̄)
i=1
w1 = w0 = v̄ − w1 · z̄

n
(zi − z̄)2
i=1

where z̄ is the mean of the zi ’s and v̄ is the mean of the vi ’s.


3. If necessary, use w0 and w1 to find the parameters of the original prediction rule.

2.2 Multiple Linear Regression


Next, we will look at linear regression through the lens of linear algebra, which will eventually
allow us to extend what we know to new settings. In particular, we’ll be able to include
more than one predictor variable in making our predictions.
We have defined the least squares line as the prediction rule H(x) = w0 + w1 x for which the
mean square error is minimized. That is, it is the line that minimizes the risk:

1∑
n
Rsq (H) = (yi − H(xi ))2
n i=1

or, equivalently:

1∑
n
Rsq (w0 , w1 ) = (yi − w0 − w1 xi )2 .
n i=1
10 CHAPTER 2. LEAST SQUARES REGRESSION

Notice that the risk is n1 times a sum of squares, and remember from linear algebra that we
also measure the length of a vector using a sum of squares. For example, the length of the
vector  
2
⃗x = 5
4
is computed as √ √ √ √
22 + 52 + 42 = 45 = 9 · 5 = 3 5.

1
If we use ||⃗v || to denote the length of a vector ⃗v , this means we can write Rsq (H) =||⃗e||2 ,
n
where ⃗e is the vector whose ith component is ei = yi − H(xi ), which is the error in the ith
prediction. We’ll call ⃗e the error vector. Since each component of ⃗e comes from a difference
between yi and H(xi ), we can think of the vector ⃗e as a difference of two vectors ⃗y and ⃗h,
where the ith component of ⃗y is yi and the ith component of ⃗h is H(xi ). We call ⃗y the
observation vector and ⃗h the hypothesis vector or prediction vector. We are writing
the error vector as the difference between the observation vector and the prediction vector.
Since ⃗e = ⃗y − ⃗h, the risk can be written as Rsq (H) = n1 ∥⃗e∥2 = n1 ∥⃗y − ⃗h∥2 .
Now, since our function H(x) is a linear function of the form H(x) = w0 + w1 x, this means
we can write the prediction vector ⃗h as follows:
     
H(x1 ) w 0 + w 1 x1 1 x1
 H(x2 )   w0 + w1 x2  1 x2  [ ]
⃗h =   
 ..  =  ..
   w0
 =  .. ..  .
 .   .   . .  w1
H(xn ) w 0 + w 1 xn 1 xn

Letting  
1 x1
1 x2  [ ]
  w0
X =  .. ..  and w
⃗= ,
. .  w1
1 xn

we have ⃗h = X w.
⃗ The matrix X is called the design matrix, and the vector w
⃗ is called
the parameter vector.
Notice that we are able to write ⃗h as the product of a matrix and a vector because our choice
of hypothesis H(x) was linear in the parameters w0 , w1 . We’ll soon be able to adjust the form
of H(x) to fit any model that is linear in the parameters, including arbitrary polynomials.
For now, we continue with our linear fit H(x) = w0 + w1 x. We have established that

1 1
Rsq (H) = Rsq (w0 , w1 ) = ∥⃗y − ⃗h∥2 = ∥⃗y − X w∥
⃗ 2.
n n
Note that we can think of the risk as a function of the parameter vector w,
⃗ since changing
this parameter vector produces different values for the risk. The values of X and y are
2.2. MULTIPLE LINEAR REGRESSION 11

completely determined by the data set, so they are fixed, and w ⃗ is the only thing that can
change. Our goal is to find the choice of w ⃗ = n1 ||⃗y − X w||
⃗ for which Rsq (w) ⃗ 2 is minimized.
Equivalently, our goal is to find the vector w⃗ for which the vector ⃗y − X w⃗ has the smallest
length, since minimizing a constant times the square of the length is equivalent to minimizing
the length itself. We’ll use calculus to find the value of w
⃗ where the minimum is achieved.

2.2.1 Minimizing the Mean Squared Error


At this point, we have an expression for the mean squared error in matrix notation:

1
Rsq (w)
⃗ = ∥⃗y − X w∥
⃗ 2.
n

Given the data, X and ⃗y are fixed, known quantities. Our goal is to find the “best” w,⃗ in
the sense that w
⃗ results in the smallest mean squared error. We’ll use vector calculus to do
so.

Our main strategy for minimizing risk thus far has been to take the derivative of the risk
function, set it to zero, and solve for the minimizer. In this case, we want to differentiate
⃗ with respect to a vector, w.
Rsq (w) ⃗ The derivative of a scalar-valued function with respect
to a vector input is called the gradient. The gradient of Rsq (w) ⃗ with respect to w, ⃗
dRsq
written ∇w⃗ Rsq or dw⃗ , is defined to be the vector of partial derivatives:

 dRsq 
dw0
 
 dRsq 
 dw 
dRsq  1 
=
 . 

dw
⃗  .. 
 
 
dRsq
dwd
,

where w0 , w1 , . . . , wd are the entries of the vector w.


⃗ In our case, since we are using a linear
prediction rule, w ⃗ has only two components, w0 and w1 , but we define the gradient for the
more general case where w ⃗ has an arbitrary number of components. This definition of the
gradient with respect to a vector says that when we think of Rsq (w) ⃗ as a function of a vector,
we are really thinking of it as a function of multiple variables, which are the components
of the vector. The gradient of Rsq (w) ⃗ with respect to w
⃗ is the same as the gradient of
Rsq (w0 , w1 , . . . , wd ), a multivariable function.

Our goal is to find the gradient of the mean squared error. We start by expanding the mean
squared error in order to get rid of the squared norm, and we write the mean squared error
in terms of dot products instead. We need to recall a few key facts from linear algebra, which
12 CHAPTER 2. LEAST SQUARES REGRESSION

we will make use of shortly. Here, A and B are matrices, and ⃗u, ⃗v , w,
⃗ ⃗z are vectors:

(A + B)T = AT + B T ,
(AB)T = B T AT ,
⃗u · ⃗v = ⃗v · ⃗u = ⃗uT ⃗v = ⃗v T ⃗u,
∥⃗u∥2 = ⃗u · ⃗u,
(⃗u + ⃗v ) · (w
⃗ + ⃗z) = ⃗u · w ⃗ + ⃗u · ⃗z + ⃗v · w
⃗ + ⃗v · ⃗z.

Therefore:
1
Rsq (w)
⃗ = ∥⃗y − X w∥
⃗ 2
n
1
= (⃗y − X w) ⃗ T (⃗y − X w)

n
1( T )
= ⃗y − w⃗ T X T (⃗y − X w)

n
1[ T ]
= ⃗y ⃗y − ⃗y T X w
⃗ −w⃗ T X T ⃗y + w
⃗ T XT Xw
⃗ .
n

Observe that y T X w ⃗ T X T ⃗y are both the dot product of w


⃗ and w ⃗ with X T ⃗y and are therefore
equal. So,

1[ T ]
Rsq (w)
⃗ = ⃗y ⃗y − ⃗2X T ⃗y · w ⃗ T XT Xw
⃗ +w ⃗
n
1[ ]
= ⃗y · ⃗y − ⃗2X T ⃗y · w ⃗ T XT Xw
⃗ +w ⃗
n

Now we take the gradient. We have:


( [ ])
dRsq d 1
= ⃗y · ⃗y − 2X ⃗y · w
⃗ T
⃗ +w T T
⃗ X Xw ⃗
dw
⃗ dw⃗ n
[ ]
1 d d (⃗ T ) d ( T T )
= (⃗y · ⃗y ) − 2X ⃗y · w
⃗ + w
⃗ X Xw
⃗ .
n dw ⃗ dw⃗ dw⃗

For the first term, the gradient of ⃗y · ⃗y with respect to w


⃗ is zero, since ⃗y does not change
as w⃗ changes, so we treat it as a constant. For the second term, we will use a property
shown in the homework, which said that for any constant vector ⃗v , ddw⃗ (⃗v · w)
⃗ = ⃗v . Letting
⃗v = −2X ⃗y and applying this property shows the gradient of the −2X ⃗y · w
T T
⃗ with respect to
⃗ is( −2X T ⃗y . )For the last term, we also rely on a homework problem where you show that
w
d
dw⃗
⃗ T XT Xw
w ⃗ = 2X T X w.⃗ Therefore,

dRsq
= −2X T ⃗y + 2X T X w.

dw

2.2. MULTIPLE LINEAR REGRESSION 13

This is the gradient of the mean squared error. Our next step is to set this equal to zero and
solve for w:

dRsq
= 0 =⇒ −2X T ⃗y + 2X T X w⃗ =0
dw

=⇒ X T X w
⃗ = X T ⃗y .

The last line is an important one. The equation X T X w ⃗ = X T ⃗y defines a system of equations
in matrix form known as the normal equations. This system can be solved using Gaussian
elimination, or, if the matrix X T X is invertible, by multiplying both sides by its inverse to
⃗ = (X T X)−1 X T ⃗y .
obtain w

The solution w⃗ of the normal equations, X T X w ⃗ = X T ⃗y , is the solution to the least squares
problem. Before, we had formulas for the least squares solutions for w0 and w1 ; the normal
equations give equivalent solutions. In fact, you can derive our old formulas from the normal
equations with a little bit of algebra. The advantage of the normal equations will be that they
generalize more easily. We’ll be able to use the normal equations to fit arbitrary polynomials
to data, as well as to make predictions based on multiple features.

Let’s try out an example. We will use the normal equations to find the linear function that
best approximates the data (2, 1), (5, 2), (7, 3), (8, 3), where the first element of each pair is
the predictor variable x, and the second element is the response variable, y.

We have    
1 1 2
2 1 5
⃗y =   
3 and X = 1
.
7
3 1 8

From this, we can calculate


[ ] [ ]
T 4 22 T 9
X X= and X y = ,
22 142 57

so the normal equations are


[ ][ ] [ ]
4 22 w0 9
= .
22 142 w1 57

Now, we just need to solve this linear system. The left-hand side matrix can be inverted and
left multiplication by its inverse gives
[2]
w
⃗= 7 .
5
14

2 5
The regression line is therefore y = 7
+ 14
x, as shown in the picture below.
14 CHAPTER 2. LEAST SQUARES REGRESSION

You can verify that the formulas for the least squares solutions derived above in section 2.1.2
give the same slope and intercept as we found using the normal equations.

2.2.2 Nonlinear Prediction Rules


The really nice thing about this linear algebra approach to least squares is that we can
use the same strategy to fit many different curves to a set of data points. Before, we only
considered the case where the prediction rule H(x) was a linear function of x. We were not
able to, say, fit an arbitary quadratic of the form H(x) = w0 + w1 x + w2 x2 . With the linear
algebra method, we can now do this easily.
For example, consider the set of five points (−2, 0), (−1, 0), (0, 1), (1, 0), (2, 0), where the
first element of each pair is the predictor variable, x, and the second element is the response
variable, y. The system of equations corresponding to a parabola that best fits these points
has design matrix
   
1 x1 x21 1 −2 4
1 x2 x22  1 −1 1
   
X=  1 x 3 x 2
3 = 
 1 0 0 ,

1 x4 x4  1 1 1
2

1 x5 x25 1 2 4
parameter vector
 
w0
⃗ = w1  ,
w
w2
and observation vector
   
y1 0
y2  0
   
⃗y =    
y3  = 1 .
y4  0
y5 0
We can calculate that
   
5 0 10 1
X T X =  0 10 0  and X T y = 0
10 0 34 0
2.2. MULTIPLE LINEAR REGRESSION 15

so that the least squares solution satisfies


   
5 0 10 1
 0 10 0  w⃗ = 0 .

10 0 34 0
 17 
35
⃗ =  0  so that the best-fitting quadratic is y =
Solving this system of equations gives w
−1
7
17
35
− 17 x2 . The plot below shows the data points and the best fitting quadratic.

The least squares method we have developed using linear algebra can be used to find the
best-fitting curve of any function form that is linear in the parameters, such as

H(x) = w0 + w1 ex

or

H(x) = w0 + w1 x + w2 sin x.

This includes all polynomials because any polynomial

H(x) = w0 + w1 x + w2 x2 + · · · + wd xd

is linear when thought of as a function of the coefficients w0 , w1 , w2 , . . . , wd .

2.2.3 Using Multiple Features


The linear algebra approach is also useful for fitting a function of multiple variables to data,
under the same stipulation that the function should be linear in the parameters. This is
called multiple regression because we are using multiple predictor variables to predict the
value of the response variable. This is highly useful because the more predictor variables we
take into consideration when predicting the value of the response variable, the more likely
we are to make an accurate prediction. For example, Galton might have been able to make
more accurate predictions had he used the mother’s height and father’s height as separate
predictor variables, rather than lumping them together into a single midparent height.
16 CHAPTER 2. LEAST SQUARES REGRESSION

For example, suppose we want to predict the price y of a laptop computer based on its weight
x(1) and amount of memory x(2) . Here we are using superscript notation with parentheses
to distinguish between the different predictor variables. In particular, the parentheses are
used to emphasize that these are indices, not exponents. We could just use different letters
for each predictor variable (and it may help to think of the superscripts as such) but we’ll
use this notation because we want to be able to generalize to situations where we have an
arbitrary number of predictor variables, maybe even more than letters of the alphabet.
To make a prediction, we could collect data for n laptops, recording each laptop’s weight
in ounces, amount of random access memory in gigabytes, and price in dollars. We use
subscripts to distinguish between the different laptops in our data set. The ith laptop in our
(1) (2)
data set has weight xi , memory xi , and price yi . Suppose that a plot of the data points
(1) (2)
(xi , xi , yi ) shows that it makes sense to use a prediction rule of the form

H(x(1) , x(2) ) = w0 + w1 x(1) + w2 x(2) ,

which is the equation of a plane.


Since our function H(x) is linear in the parameters, we can write the prediction vector ⃗h as
follows:
 (1) (2)   (1) (2)   (1) (2) 
H(x1 , x1 ) w 0 + w 1 x1 + w 2 x1 1 x1 x1  
H(x(1) , x(2) ) w + w x(1) + w x(2)  1 x(1) x(2)  w0
⃗h = 

2
.
2   0
=
1 2
.
2 2  
 = .
2
.
2 
..  w1  .
 .
.   .
.  . . .
. .  w2
(1) (2) (1) (2) (1) (2)
H(xn , xn ) w0 + w1 xn + w2 xn 1 xn xn
This corresponds to a design matrix of
 (1) (2) 
1 x1 x1
1 x(1) x(2) 
 2 
X =  . 2. .. 
 .. .. . 
(1) (2)
1 xn xn
and a parameter vector of  
w0
⃗ = w1  .
w
w2
Letting our observation vector be  
y1
 y2 
 
⃗y =  ..  ,
.
yn
we have everything we need to solve the normal equations X T X w
⃗ = X T ⃗y and find the
parameters of the prediction rule

H(x(1) , x(2) ) = w0 + w1 x(1) + w2 x(2)


2.2. MULTIPLE LINEAR REGRESSION 17

with minimum mean squared error.

If you collected actual data on laptop computers and used the normal equations to find the
values of the parameters of the prediction rule, what signs would you expect w0 , w1 , and
w2 to have? The more memory the computer has, the more expensive it probably is, and so
w2 would be positive. As for w1 , you could make the case either way: people are willing to
spend extra for ultralight laptops, but at the same time, smaller, cheaper computers tend
to weigh less than expensive and large gaming laptops with bigger screens. So it isn’t clear
whether w1 would be positive or negative. The constant term w0 , known as the bias, is
often hard to interpret because it represents a nonsensical situation, in this case, the price of
a computer that weighs nothing and has no memory. It’s again not clear whether w0 would
be positive or negative.

What do the magnitudes of the parameters w0 , w1 , w2 represent? It is tempting to say that


the parameters indicate the contribution of each feature variable towards the overall price of
a laptop. For example, if w2 > w1 , you might think that memory is therefore more important
than weight in determining a laptop’s price. However, there’s a problem with this line of
reasoning, which is that it doesn’t take into account the scale of the data. For example, if we
had measured a computer’s weight in pounds instead of ounces, we’d expect the associated
parameter w2 to be one sixteenth as large. Even if w2 > w1 , reducing w2 to be a fraction
of its original value could make it smaller than w1 , but it wouldn’t make sense to say that
weight was now less important than memory. In order to directly compare the magnitudes
of the parameters, we need to make sure the data is all measured on the same scale. The
best way to accomplish this is to standardize each feature variable by subtracting the mean
value of that variable and dividing by the standard deviation. This ensures that all feature
variables use the same scale, since they are all measured in standard units. Constructing
the design matrix from the standardized feature variables and solving the resulting normal
equations gives parameters that can be directly compared. These parameters are known as
the standardized regression coefficients.

In general, suppose we want to use d predictor variables x(1) , x(2) , . . . , x(d) in our prediction
of a response variable y, using a rule of the form

H(x(1) , x(2) , . . . , x(d) ) = w0 + w1 x(1) + w2 x(2) + · · · + wd x(d) .

We collect data from n individuals, and for each individual i, create a feature vector, x⃗i ,
which is a vector in Rd containing each of the d features corresponding to this one individual:
 (1)

xi
 (2) 
x 
x⃗i =  i  .
...
(d)
xi

xi ), to be the vector in Rd+1 whose first com-


Define the augmented feature vector Aug(⃗
18 CHAPTER 2. LEAST SQUARES REGRESSION

ponent is 1 and whose other d components are the components of x⃗i :


 
1
x(1) 
 i 
 
Aug(⃗xi ) = x(2) .
 i 
...
(d)
xi

We want to find a prediction rule of the form

H(x(1) , x(2) , . . . , x(d) ) = w0 + w1 x(1) + w2 x(2) + · · · + wd x(d) ,

so we use a design matrix of


 (1) (2) (d)   
1 x1 x1 . . . x 1 Aug(x⃗1 )T
1 x(1) x(2) . . . x(d)   Aug(x⃗2 )T 
 2   
X =  . 2. 2
. . = .. 
.. .
. .
. .  
. . 
(1) (2) (d) T
1 xn xn . . . x n Aug(x⃗n )

and a parameter vector of  


w0
 w1 
 
 
⃗ =  w2  ,
w
 .. 
 . 
wd
so we can see that    
Aug(x⃗1 ) · w
⃗ H(x⃗1 )
 Aug(x⃗2 ) · w
⃗  
   H(x⃗2 ) 
Xw =  ..  =  . .
 .   .. 
Aug(x⃗n ) · w
⃗ H(x⃗n )
We let our observation vector be  
y1
 y2 
 
⃗y =  .. 
.
yn
and as before, our goal is to minimize
1
Rsq (w)
⃗ = ∥⃗y − X w∥
⃗ 2.
n
We already know that the choice of w⃗ that minimizes Rsq (w)
⃗ is the same vector that solves the
T T
normal equations X X w ⃗ = X ⃗y , so we can find the optimal parameters for our prediction
rule by solving the normal equations.
2.2. MULTIPLE LINEAR REGRESSION 19

Now we can make predictions based on lots of predictor variables instead of just one. Of
course, the more predictor variables we have, the better predictions we are able to make,
right? Not necessarily. There is such a thing as having too many variables. Suppose we
are trying to predict a child’s height. To do so, we gather as many predictor variables as
possible. Some predictors, like the height of each parent, are clearly related to the outcome.
But other variables, like the day of the week on which the child was born, are probably
unrelated. However, if we have too many of these unrelated variables, by sheer chance a
pattern can emerge. This pattern isn’t meaningful - it’s just due to noise - but least squares
regression does not know the difference and will happily overfit the noise.
More predictor variables also introduce other challenges. For one, it is very hard to visualize
multidimensional data, especially data in more than three dimensions, since we don’t have
a good way of plotting such data. Then it can be hard to guess the appropriate form of the
prediction rule. Even if we can come up with an appropriate function form, actually solving
the normal equations can be computationally difficult if the design matrix is very large.

You might also like