Notes Chapter 2
Notes Chapter 2
We have so far predicted salaries without taking into account any information about the
person whose salary is being predicted. Of course, there are lots of things which can influence
someone’s salary: how much experience they have, their college GPA, what city they live
in, and so forth. We call these things features. In this chapter, we’ll see how to apply the
framework of empirical risk minimization to the problem of predicting based on one or more
features.
1
2 CHAPTER 2. LEAST SQUARES REGRESSION
In this rule, pay increases exponentially with experience. Yet another prediction rule is:
H3 (years of experience) = $100,000 − $5,000 × (years of experience)
This one says that your pay decreases with experience. This is (hopefully) not the case, and
it goes against our intuition. This is probably a bad prediction rule in the sense that it does
not make accurate predictions, but it is a prediction rule nonetheless.
As before, we will assess whether a prediction rule is good or bad by seeing if the predictions
it makes match the data. The figure below shows what we might expect to see if we ask
several data scientists how much they make and how much experience they have.
In general, we observe a positive trend: the more experience someone has, the higher their
pay tends to be. We can plot the prediction rules H1 , H2 , and H3 on top of this data to get
a sense for the accuracy of their predictions.
Out of these, H1 appears to “fit” the data the best. We’ll now make this more precise.
Consider an arbitrary person: person i. Their salary is yi , and their experience is xi . Our
prediction for their salary is H(xi ). The absolute loss of our prediction is
|yi − H(xi )|
In the last chapter, we saw that the empirical risk, or average loss incurred when using H
to predict the salary for everyone in the data set, is one way to quantify how good or bad
H is. In this case, the mean absolute error is:
1∑
n
Rabs (H) = |yi − H(xi )|
n i=1
Our goal now is to find a prediction rule H which results in the smallest mean absolute error.
Here we run into our first problem: it turns out that we can find a function which has zero
absolute error, but it isn’t as useful as we’d like! Here it is:
This prediction rule makes exactly the right prediction for every person in the data set. But
in order to go through every data point, this function has made itself quite wiggly. We
don’t believe that pay is truly related to experience in this way. To put it differently, we feel
that the data has some noise, and the above function describes the noise rather than the
underlying pattern that we are interested in. When this is the case, we say that the line has
overfit the data.
One antidote to overfitting is to mandate that the hypothesis not be so wiggly; in other
words, restrict the hypothesis space so that it doesn’t include such complicated functions.
This is an instance of Occam’s Razor: we should choose the simplest possible description of
the data which actually works. In this example, a straight line is a good, simple description
of the data, and so we’ll restrict the prediction rule to be of the form H(x) = w1 x + w0 , i.e.,
straight lines. This setting is called simple linear regression.
Our new goal is to find a linear prediction rule with the smallest mean absolute error. That
is, we want to solve:
1∑
n
H ∗ = arg min |yi − H(xi )|
linear H n
i=1
4 CHAPTER 2. LEAST SQUARES REGRESSION
But there is still a problem with this: it turns out that Rabs is relatively hard to minimize.1
The cause of this difficulty is our use of the absolute loss. Instead, we’ll try another approach
to quantifying the error of our prediction rule, the square loss:
1∑
n
Rabs (H) = (yi − H(xi ))2 .
n i=1
This is also known as the mean squared error. Minimizing this is often called least
squares regression.
1∑
n
Rsq (H) = (yi − H(xi ))2
n i=1
1∑
n
Rsq (w0 , w1 ) = (yi − (w0 + w1 xi ))2
n i=1
1∑
n
Rsq (w0 , w1 ) = (yi − w0 − w1 xi )2
n i=1
This is differentiable. To find the best w0 and w1 , we’ll use calculus. Taking the partial
derivatives with respect to w0 and w1 and setting them to zero gives
∂Rsq (w0 , w1 ) ∑ n
=0= −2(yi − w0 − w1 xi ),
∂w0 i=1
∂Rsq (w0 , w1 ) ∑ n
=0= −2xi (yi − w0 − w1 xi ).
∂w1 i=1
We must now solve this system of equations to find the optimal values of w0 and w1 .
1
Although it can be done with linear programming.
2.1. SIMPLE LINEAR REGRESSION 5
1∑
n
∂Rsq
= 0 =⇒ 2 (w1 xi + w0 − yi ) = 0
∂w0 n i=1
2∑
n
=⇒ (w1 xi + w0 − yi ) = 0
n i=1
If we multiply both sides by n/2 we are left with a constant factor of 1 on the left hand side,
whereas the right hand side remains zero. In effect, we can get rid of the 2/n:
∑
n
=⇒ (w1 xi + w0 − yi ) = 0
i=1
Our goal is to isolate the w0 on one side of the equation, but we are momentarily prevented
from this by the fact that w0 is inside of the sum. We can remove it from the summation,
however; the first step is to break it into three independent sums:
∑
n ∑
n ∑
n
=⇒ w 1 xi + w0 − yi = 0
i=1 i=1 i=1
∑
n ∑
n ∑
n
=⇒ w0 = yi − w 1 xi
i=1 i=1 i=1
∑
n ∑ n
=⇒ nw0 = yi − w1 xi
(
i=1 i=1
)
1 ∑
n ∑
n
=⇒ w0 = yi − w1 xi
n i=1 i=1
∑
n ∑
n
1 1
Define ȳ = n
yi and x̄ = n
xi . If xi and yi are experience and salary, then x̄ is the
i=1 i=1
mean experience and ȳ is the mean salary of people in our data set. Using this notation
=⇒ w0 = ȳ − w1 x̄
And so we have successfully isolated w0 . We now solve ∂Rsq /∂w1 = 0 for w1 :
1∑
n
∂Rsq
= 0 =⇒ 2 ((w1 xi + w0 ) − yi ) xi = 0
∂w1 n i=1
2
Note that we could have also decided to solve this equation for w1 , as long as we then solve the other
equation for w0 .
6 CHAPTER 2. LEAST SQUARES REGRESSION
∑
n
=⇒ ((w1 xi + w0 ) − yi ) xi = 0
i=1
∑
n
=⇒ ((w1 xi + ȳ − w1 x̄) − yi ) xi = 0
i=1
We’ll group the xi with the x̄ and the yi with the ȳ:
∑
n
=⇒ ((w1 (xi − x̄) − (yi − ȳ)) xi = 0
i=1
∑
n ∑
n
=⇒ w1 (xi − x̄)xi − (yi − ȳ)xi = 0
i=1 i=1
∑
n ∑
n
=⇒ w1 (xi − x̄)xi = (yi − ȳ)xi
i=1 i=1
∑
n
(yi − ȳ)xi
i=1
=⇒ w1 =
∑n
(xi − x̄)xi
i=1
We could stop here; this is a totally valid formula for w1 . But we’ll continue to get a formula
∑n ∑n
with a little more symmetry. The key is that (yi − ȳ) = 0 and (xi − x̄), as can be
i=1 i=1
verified. This enables us to write:
∑
n
(yi − ȳ)(xi − x̄)
i=1
=⇒ w1 =
∑
n
(xi − x̄)2
i=1
We’ll show that the numerators of these last two formulas for w1 are the same (and you can
2.1. SIMPLE LINEAR REGRESSION 7
∑
n ∑
n ∑
n
(yi − ȳ)(xi − x̄) = (yi − ȳ)xi − (yi − ȳ)x̄
i=1 i=1 i=1
∑n ∑n
= (yi − ȳ)xi − x̄ (yi − ȳ)
i=1 i=1
∑
n
= (yi − ȳ)xi − 0
i=1
∑n
= (yi − ȳ)xi
i=1
We have arrived at a formula for w1 , the slope of our linear prediction rule, which can be
computed from the data. Notice that the formula for the intercept, w0 , involves w1 . So to
find the linear prediction rule, we should first calculate w1 , then plug it into the formula to
find w0 .
In summary, the formulas we have found are:
∑
n
(xi − x̄)(yi − ȳ)
i=1
w1 =
∑
n
(xi − x̄)2
i=1
w0 = ȳ − w1 x̄
where x̄ and ȳ are the mean of the xi and yi , respectively. These are called the least squares
solutions for the slope and intercept parameters.
It should be emphasized that the fact that we have formulas which allow us to calculate w0
and w1 directly is a consequence of using the mean squared error. If we had used another
loss function, for example, the mean absolute error, then we may not be so lucky to have
formulas. Instead, we would have to minimize the loss using an algorithmic approach in
which we iterate towards the correct answer.3
x and y into a linear relationship between two quantities that can be directly computed from
x and y.
y = w0 + w1 ln x
to data with the tools we already have. While y does not vary linearly with x, it does vary
linearly with ln x. So if we let z = ln x be our feature variable, we can use the formulas we
have developed to find the slope and intercept of the linear relationship.
∑n
1∑
n
(ln xi − ln xi )(yi − ȳ)
i=1
n i=1
w1 =
∑n
1∑
n
(ln xi − ln xi )2
i=1
n i=1
1∑
n
w0 = ȳ − w1 · ln xi
n i=1
We can use w1 and w0 as the parameters that define the best-fitting curve of the form
y = w0 + w1 ln x.
Similarly, we can use the least squares solutions we derived above to fit data with a curve of
the form
y = c0 x c 1 .
Here, it is not so obvious that this relationship can be written as a linear relationship in new
variables. The trick is taking a logarithm of both sides. We can use a logarithm with any
base. We’ll pick base ten. The relationship becomes
If we let z = log x be our predictor variable, and v = log y be our response variable, we see
that there is a linear relationship v = w0 + w1 z, where the intercept is w0 = log c0 and the
slope is w1 = c1 .
Therefore, we can use the following formulas for w0 and w1 , which we can then use to find
c0 and c1 .
2.2. MULTIPLE LINEAR REGRESSION 9
∑
n
1∑
n
1∑
n
(log xi − log xi )(log yi − log yi )
i=1
n i=1 n i=1
w1 =
∑
n
1∑
n
(log xi − log xi )2
i=1
n i=1
1∑ 1∑
n n
w0 = log yi − w1 · log xi
n i=1 n i=1
Notice that w1 = c1 and w0 = log c0 , so we can find c0 by computing c0 = 10w0 . This gives
us a way to find the parameters c0 , c1 for our original prediction rule y = c0 xc1 .
In general, we can use the formulas we’ve derived to fit any prediction rule that can be
written in the form g(y) = w1 · f (x) + w0 , where f (x) is some transformation of x, such as
x2 , ex , log x, etc. and g(y) is some transformation of y. Here is the procedure:
1. Create a new data set (z1 , v1 ), . . . , (zn , vn ), where zi = f (xi ) and vi = g(yi ).
2. Fit v = w1 z + w0 using familiar least squares solutions:
∑
n
(zi − z̄)(vi − v̄)
i=1
w1 = w0 = v̄ − w1 · z̄
∑
n
(zi − z̄)2
i=1
1∑
n
Rsq (H) = (yi − H(xi ))2
n i=1
or, equivalently:
1∑
n
Rsq (w0 , w1 ) = (yi − w0 − w1 xi )2 .
n i=1
10 CHAPTER 2. LEAST SQUARES REGRESSION
Notice that the risk is n1 times a sum of squares, and remember from linear algebra that we
also measure the length of a vector using a sum of squares. For example, the length of the
vector
2
⃗x = 5
4
is computed as √ √ √ √
22 + 52 + 42 = 45 = 9 · 5 = 3 5.
1
If we use ||⃗v || to denote the length of a vector ⃗v , this means we can write Rsq (H) =||⃗e||2 ,
n
where ⃗e is the vector whose ith component is ei = yi − H(xi ), which is the error in the ith
prediction. We’ll call ⃗e the error vector. Since each component of ⃗e comes from a difference
between yi and H(xi ), we can think of the vector ⃗e as a difference of two vectors ⃗y and ⃗h,
where the ith component of ⃗y is yi and the ith component of ⃗h is H(xi ). We call ⃗y the
observation vector and ⃗h the hypothesis vector or prediction vector. We are writing
the error vector as the difference between the observation vector and the prediction vector.
Since ⃗e = ⃗y − ⃗h, the risk can be written as Rsq (H) = n1 ∥⃗e∥2 = n1 ∥⃗y − ⃗h∥2 .
Now, since our function H(x) is a linear function of the form H(x) = w0 + w1 x, this means
we can write the prediction vector ⃗h as follows:
H(x1 ) w 0 + w 1 x1 1 x1
H(x2 ) w0 + w1 x2 1 x2 [ ]
⃗h =
.. = ..
w0
= .. .. .
. . . . w1
H(xn ) w 0 + w 1 xn 1 xn
Letting
1 x1
1 x2 [ ]
w0
X = .. .. and w
⃗= ,
. . w1
1 xn
we have ⃗h = X w.
⃗ The matrix X is called the design matrix, and the vector w
⃗ is called
the parameter vector.
Notice that we are able to write ⃗h as the product of a matrix and a vector because our choice
of hypothesis H(x) was linear in the parameters w0 , w1 . We’ll soon be able to adjust the form
of H(x) to fit any model that is linear in the parameters, including arbitrary polynomials.
For now, we continue with our linear fit H(x) = w0 + w1 x. We have established that
1 1
Rsq (H) = Rsq (w0 , w1 ) = ∥⃗y − ⃗h∥2 = ∥⃗y − X w∥
⃗ 2.
n n
Note that we can think of the risk as a function of the parameter vector w,
⃗ since changing
this parameter vector produces different values for the risk. The values of X and y are
2.2. MULTIPLE LINEAR REGRESSION 11
completely determined by the data set, so they are fixed, and w ⃗ is the only thing that can
change. Our goal is to find the choice of w ⃗ = n1 ||⃗y − X w||
⃗ for which Rsq (w) ⃗ 2 is minimized.
Equivalently, our goal is to find the vector w⃗ for which the vector ⃗y − X w⃗ has the smallest
length, since minimizing a constant times the square of the length is equivalent to minimizing
the length itself. We’ll use calculus to find the value of w
⃗ where the minimum is achieved.
1
Rsq (w)
⃗ = ∥⃗y − X w∥
⃗ 2.
n
Given the data, X and ⃗y are fixed, known quantities. Our goal is to find the “best” w,⃗ in
the sense that w
⃗ results in the smallest mean squared error. We’ll use vector calculus to do
so.
Our main strategy for minimizing risk thus far has been to take the derivative of the risk
function, set it to zero, and solve for the minimizer. In this case, we want to differentiate
⃗ with respect to a vector, w.
Rsq (w) ⃗ The derivative of a scalar-valued function with respect
to a vector input is called the gradient. The gradient of Rsq (w) ⃗ with respect to w, ⃗
dRsq
written ∇w⃗ Rsq or dw⃗ , is defined to be the vector of partial derivatives:
dRsq
dw0
dRsq
dw
dRsq 1
=
.
dw
⃗ ..
dRsq
dwd
,
Our goal is to find the gradient of the mean squared error. We start by expanding the mean
squared error in order to get rid of the squared norm, and we write the mean squared error
in terms of dot products instead. We need to recall a few key facts from linear algebra, which
12 CHAPTER 2. LEAST SQUARES REGRESSION
we will make use of shortly. Here, A and B are matrices, and ⃗u, ⃗v , w,
⃗ ⃗z are vectors:
(A + B)T = AT + B T ,
(AB)T = B T AT ,
⃗u · ⃗v = ⃗v · ⃗u = ⃗uT ⃗v = ⃗v T ⃗u,
∥⃗u∥2 = ⃗u · ⃗u,
(⃗u + ⃗v ) · (w
⃗ + ⃗z) = ⃗u · w ⃗ + ⃗u · ⃗z + ⃗v · w
⃗ + ⃗v · ⃗z.
Therefore:
1
Rsq (w)
⃗ = ∥⃗y − X w∥
⃗ 2
n
1
= (⃗y − X w) ⃗ T (⃗y − X w)
⃗
n
1( T )
= ⃗y − w⃗ T X T (⃗y − X w)
⃗
n
1[ T ]
= ⃗y ⃗y − ⃗y T X w
⃗ −w⃗ T X T ⃗y + w
⃗ T XT Xw
⃗ .
n
1[ T ]
Rsq (w)
⃗ = ⃗y ⃗y − ⃗2X T ⃗y · w ⃗ T XT Xw
⃗ +w ⃗
n
1[ ]
= ⃗y · ⃗y − ⃗2X T ⃗y · w ⃗ T XT Xw
⃗ +w ⃗
n
dRsq
= −2X T ⃗y + 2X T X w.
⃗
dw
⃗
2.2. MULTIPLE LINEAR REGRESSION 13
This is the gradient of the mean squared error. Our next step is to set this equal to zero and
solve for w:
⃗
dRsq
= 0 =⇒ −2X T ⃗y + 2X T X w⃗ =0
dw
⃗
=⇒ X T X w
⃗ = X T ⃗y .
The last line is an important one. The equation X T X w ⃗ = X T ⃗y defines a system of equations
in matrix form known as the normal equations. This system can be solved using Gaussian
elimination, or, if the matrix X T X is invertible, by multiplying both sides by its inverse to
⃗ = (X T X)−1 X T ⃗y .
obtain w
The solution w⃗ of the normal equations, X T X w ⃗ = X T ⃗y , is the solution to the least squares
problem. Before, we had formulas for the least squares solutions for w0 and w1 ; the normal
equations give equivalent solutions. In fact, you can derive our old formulas from the normal
equations with a little bit of algebra. The advantage of the normal equations will be that they
generalize more easily. We’ll be able to use the normal equations to fit arbitrary polynomials
to data, as well as to make predictions based on multiple features.
Let’s try out an example. We will use the normal equations to find the linear function that
best approximates the data (2, 1), (5, 2), (7, 3), (8, 3), where the first element of each pair is
the predictor variable x, and the second element is the response variable, y.
We have
1 1 2
2 1 5
⃗y =
3 and X = 1
.
7
3 1 8
Now, we just need to solve this linear system. The left-hand side matrix can be inverted and
left multiplication by its inverse gives
[2]
w
⃗= 7 .
5
14
2 5
The regression line is therefore y = 7
+ 14
x, as shown in the picture below.
14 CHAPTER 2. LEAST SQUARES REGRESSION
You can verify that the formulas for the least squares solutions derived above in section 2.1.2
give the same slope and intercept as we found using the normal equations.
1 x5 x25 1 2 4
parameter vector
w0
⃗ = w1 ,
w
w2
and observation vector
y1 0
y2 0
⃗y =
y3 = 1 .
y4 0
y5 0
We can calculate that
5 0 10 1
X T X = 0 10 0 and X T y = 0
10 0 34 0
2.2. MULTIPLE LINEAR REGRESSION 15
The least squares method we have developed using linear algebra can be used to find the
best-fitting curve of any function form that is linear in the parameters, such as
H(x) = w0 + w1 ex
or
H(x) = w0 + w1 x + w2 sin x.
H(x) = w0 + w1 x + w2 x2 + · · · + wd xd
For example, suppose we want to predict the price y of a laptop computer based on its weight
x(1) and amount of memory x(2) . Here we are using superscript notation with parentheses
to distinguish between the different predictor variables. In particular, the parentheses are
used to emphasize that these are indices, not exponents. We could just use different letters
for each predictor variable (and it may help to think of the superscripts as such) but we’ll
use this notation because we want to be able to generalize to situations where we have an
arbitrary number of predictor variables, maybe even more than letters of the alphabet.
To make a prediction, we could collect data for n laptops, recording each laptop’s weight
in ounces, amount of random access memory in gigabytes, and price in dollars. We use
subscripts to distinguish between the different laptops in our data set. The ith laptop in our
(1) (2)
data set has weight xi , memory xi , and price yi . Suppose that a plot of the data points
(1) (2)
(xi , xi , yi ) shows that it makes sense to use a prediction rule of the form
If you collected actual data on laptop computers and used the normal equations to find the
values of the parameters of the prediction rule, what signs would you expect w0 , w1 , and
w2 to have? The more memory the computer has, the more expensive it probably is, and so
w2 would be positive. As for w1 , you could make the case either way: people are willing to
spend extra for ultralight laptops, but at the same time, smaller, cheaper computers tend
to weigh less than expensive and large gaming laptops with bigger screens. So it isn’t clear
whether w1 would be positive or negative. The constant term w0 , known as the bias, is
often hard to interpret because it represents a nonsensical situation, in this case, the price of
a computer that weighs nothing and has no memory. It’s again not clear whether w0 would
be positive or negative.
In general, suppose we want to use d predictor variables x(1) , x(2) , . . . , x(d) in our prediction
of a response variable y, using a rule of the form
We collect data from n individuals, and for each individual i, create a feature vector, x⃗i ,
which is a vector in Rd containing each of the d features corresponding to this one individual:
(1)
xi
(2)
x
x⃗i = i .
...
(d)
xi
Now we can make predictions based on lots of predictor variables instead of just one. Of
course, the more predictor variables we have, the better predictions we are able to make,
right? Not necessarily. There is such a thing as having too many variables. Suppose we
are trying to predict a child’s height. To do so, we gather as many predictor variables as
possible. Some predictors, like the height of each parent, are clearly related to the outcome.
But other variables, like the day of the week on which the child was born, are probably
unrelated. However, if we have too many of these unrelated variables, by sheer chance a
pattern can emerge. This pattern isn’t meaningful - it’s just due to noise - but least squares
regression does not know the difference and will happily overfit the noise.
More predictor variables also introduce other challenges. For one, it is very hard to visualize
multidimensional data, especially data in more than three dimensions, since we don’t have
a good way of plotting such data. Then it can be hard to guess the appropriate form of the
prediction rule. Even if we can come up with an appropriate function form, actually solving
the normal equations can be computationally difficult if the design matrix is very large.