Lecture 4
Lecture 4
We continue our discussion of regression by talking about residuals and outliers, and then
look at some more advanced approaches for linear regression, including nonlinear models
and sparsity- and robustness-oriented approaches.
Remember that we defined the residual for data point i as hatεi = yi − ŷi : the difference
between the observed value and the predicted value (in contrast, the error εi is the difference
between the observed value and the true prediction from the correct line). We can often
learn a lot about how well our model did by analyzing them. Intuitively, if the model is
good, then a plot of the residuals (yi − ŷi ) against the fitted values (ŷi ) should look like noise
(i.e., there shouldn’t be any visible patterns).
Anscombe’s quartet can once again provide us with some intuition.
14 14 14 14
12 12 12 12
10 10 10 10
8 8 8 8
6 6 6 6
4 4 4 4
2 2 2 2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Below, we plot the residuals yi − ŷi vs. ŷi (note that unlike the previous plot, we’re not plotting against
x!), which shows how much above and below the fitted line the data points are.
1
Statistics for Research Projects Chapter 4
3 3 3 3
2 2 2 2
1 1 1 1
y − ŷ
y − ŷ
y − ŷ
y − ŷ
0 0 0 0
−1 −1 −1 −1
−2 −2 −2 −2
−3 −3 −3 −3
4 6 8 10 12 14 4 6 8 10 12 14 4 6 8 10 12 14 4 6 8 10 12 14
ŷ ŷ ŷ ŷ
We immediately see that in panels 2, 3, and 4, the residuals don’t look anything like random noise as
there are specific patterns to them! This suggests that a linear fit is not appropriate for datasets 2, 3,
and 4.
While the raw residuals should look like noise, in general their distribution isn’t as simple
as the i.i.d. Gaussian distribution that we assume for the error. Recalling the probabilistic
model from the previous chapter, the data are assumed to be generated by yi = Xi β + εi ,
where each εi is a zero-mean Gaussian with variance σ 2 . But ŷi doesn’t come directly from
this process! Instead, it comes from putting y through the linear regression equations we
saw last time. Recall our regression equation solution:
β̂ = (X T X)−1 X T y
Disclaimer: The rest of this section relies heavily on basic linear algebra. If you’re uncom-
fortable or unfamiliar with linear algebra, feel free to skip ahead to the summary at the end
of this section.
So the residuals ε̂ (length n vector) are:
ε̂ = y − ŷ
= y − X β̂
= y − (X(X T X)−1 X T )y (4.1)
| {z }
=H
= (I − H)y
= (I − H)(Xβ + ε)
= (I − H)Xβ + (I − H)ε
= (I − X(X T X)−1 X T )Xβ + (I − H)ε
= (X − X(X T X)−1 (X T X)) β + (I − H)ε
| {z }
=0
= (I − H)ε.
This shows how the actual noise ε (that we don’t get to observe) relates to the residuals
ε̂ = yi − ŷ. In fact, in general the residuals are correlated with each other! Furthermore, the
variance of the i-th residual ε̂i may not be the same as that of εi . We can standardize the
residuals so that each one has the same variance σ 2 .
Using the matrix H = X(X T X)−1 X T
√ , which is called the hat matrix, we can define
the standardized residuals as ε̂i / 1 − Hii , where Hii is the i-th diagonal entry of H.
2
Statistics for Research Projects Chapter 4
Thus, our model tells us that the residuals may not have the same distribution and may be
correlated, but the standardized residuals have the same, albeit unknown, variance σ 2 .
In order to analyze residuals even further, many packages will go one step further and
compute Studentized residuals. These are computed by estimating the variance of the
noise σ 2 and then dividing by it. Why is this process called “Studentizing”? The estimated
2
√ χ distribution with n − m − 1 degrees of freedom, and the
noise variance has a (scaled)
standardized residual ε̂i / 1 − Hii has a normal distribution, so the result has a Student t
distribution.
Most packaged software will standardize residuals for you, but if you’re writing your own
analysis code, don’t forget to standardize before analyzing the residuals!
While the residuals may be correlated with each other, an important theoretical result states
that under the probabilistic model we’ve been using for regression, the residuals ε̂ are un-
correlated with the fitted values ŷ. This means that when we plot the residuals against the
fitted values (as we did in the previous example for Anscombe’s Quartet), the resulting plot
should look like random noise if the fitted linear regression model is any good.
Here are the most important points from our analysis of residuals:
• The residuals themselves, ε̂i , might have different variances, and furthermore, might
be correlated with each other. We can address the first problem by standardizing
them so that they have the same variance as the residuals, or Studentizing them so
that they have variance 1. It’s important to make sure you visualize standardized or
Studentized residuals!
• The residuals ε̂ are uncorrelated with the fitted values ŷ. This means that if the model
we use is in fact a good fit, we shouldn’t see any patterns in the standardized residuals.
4.2 Outliers
Real life datasets often have some points that are far away from the rest of the data. Here
are some informal definitions used to characterize such anomalies:
• An outlier is any point that’s far away from the rest of the data. There are different
definitions and ways of quantifying them depending on how you define “far away”.
• The leverage of a data point is a quantitative description of how far it is from the
rest of the points in the x-direction. We’ll see a more formal description later, but
intuitively, points that are farther away from the average in the x-direction have higher
leverage, and points closer to the average have lower leverage.
• An influential point is an outlier with high leverage that significantly affects the
slope of a regression line. We’ll quantify this a little later.
3
Statistics for Research Projects Chapter 4
10 10 10
B B
8 8 8
6 6 6
A A
4 4 4
2 2 2
0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
(a) Some points and a regres- (b) When viewing y as a func- (c) The difference between fit-
sion line fit to those points. tion of x, points A and B are ting y = ax + b (blue, dashed)
both outliers since they’re far and x = cy + d (green, dotted).
away from the data, but A is While the results would be sim-
also an influential point, since ilar if not for the outliers, A
moving it just a little will have has a big impact on the first
a large effect on the slope of the fit, while B has a big impact
regression line. on the second.
Figure 4.1 illustrates these concepts. Since influential points can really mess up linear re-
gression results, it’s important to deal with them. Sometimes, such points indicate the need
for more data-gathering: if most of our data is concentrated in one area and our sampling
methodology was flawed, we might have missed data points in the region between the main
cluster and an outlier. Alternatively, if such points don’t indicate a problem in data collec-
tion but really just are irrelevant outliers, we can remove such points from the analysis; we’ll
see a few techniques for identifying them later today.
The leverage of any particular point is formally defined as the corresponding diagonal element
of the hat matrix, Hii (as defined in Equation (4.1)). We can see this intuition more clearly
in the one-dimensional case, where
1 (xi − x̄)2
Hii = + Pn
j=1 (xj − x̄)
n 2
Here, we can see that relatively large values of xi ’s have larger leverage as expected. Leverage
only measures how far away from the average x-value a point is: it’s a measure of how
influential a point has the potential to be, and doesn’t depend on the point’s y-value.
How might we quantify the influence of a particular point? One way is to take each point
and fit the model with and without that point. We can then compare the two results: the
more different they are, the more influential that point would be. Cook’s distance is a
measure of how influential each point i is that captures exactly this idea. The definition is
based on fitting the full model (i.e., using all of our data), and then fitting the model again,
4
Statistics for Research Projects Chapter 4
• ŷj(−i) is the predicted value at xj based on the model fit without point i,
Unfortunately, it seems from this formula that we have to recompute β̂ without point i any
time we want to compute Di . But, with a bit of algebra, we can derive the following formula,
which gives us an easy way to compute Cook’s distance in terms of the leverage:
1 Hii
Di = ε̂2i
p · M SE (1 − Hii )2
We could thus use Cook’s distance to compute how influential a point is and screen for
outliers by removing points that are too influential based on a threshold of our choice.
Manually removing outliers can feel rather cumbersome, raising the question of whether
there are regression methods that automatically handle outliers more gracefully. This leads
us to the topic of robust regression.
One issue with standard linear regression is that it can be affected by outliers. As we saw
with influential points, one inconveniently-placed outlier can dramatically alter the outcome
of a regression. In this section, we’ll first look at two complementary views of how we can
achieve more robust outcomes by tweaking the model we’ve learned about, and then see a
completely different way of achieving robustness using random sampling.
5
Statistics for Research Projects Chapter 4
Huber Bisquare
4.5 450
4.0 400
3.5 350
3.0 300
2.5 250
2.0 200
1.5 150
1.0 100
0.5 50
0.0 0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
Figure 4.2: Graphs showing the squared-error, LAD, Huber, and bisquare loss functions. Note the
different y-axis scales!
where ri = (yi − Xi β) is the residual and ρ(r) = r2 is the squared error function. Unfortu-
nately, squared error puts very large penalties on large errors: ρ(2) = 4, but ρ(10) = 100.
So, any good solution to this optimization problem will avoid large errors, even if that means
fitting to outliers. To get around this, we’ll try choosing a different function ρ. The function
ρ(r) we choose here is usually called the loss function, and any solution to a problem of
this form is called an M -estimator. What other loss functions can we try?
• Least absolute deviations, or LAD: ρ(r) = |r|. While this avoids the problem of
excessively penalizing outliers, it leads to a loss function that isn’t differentiable at
r = 0. It also is less stable than least-squares: changing the x-value of a point just a
little can have a large impact on the solution.
(
r2 /2 |r| < k
• Huber: ρ(r) = .
k(|x| − k/2) |r| ≥ k
This is similar to LAD, but replaces the sharp corner around 0 with a smoothed
parabola for differentiability (being differentiable makes it easy to take derivatives
when optimizing for the best β).
• Bisquare: these functions have a behavior similar to squared loss near 0, but level off
after a certain point. Some even assign a cost of 0 to very large deviations, to avoid
being sensitive to outliers at all.
6
Statistics for Research Projects Chapter 4
0.5 0.045
0.040
0.4 0.035
0.030
0.3
0.025
0.020
0.2
0.015
0.1 0.010
0.005
0.0 0.000
−6 −4 −2 0 2 4 3 4 5 6 7
Figure 4.3: Three different distributions and their tails: Gaussian with variance 1 (blue), Laplacian
(green), Cauchy (red), and Student’s t with 5 degrees of freedom (orange).
4.3.3 RANSAC
Another approach to robust regression is RANSAC, or RANdom SAmple Consensus. While
RANSAC doesn’t have the theoretical justifications of the methods in the last two sections,
it nonetheless is widely used and often works well in practice. It’s so popular that there’s
even a song about RANSAC1 !
The basic assumption of RANSAC is just that the data consists primarily of non-outliers,
which we’ll call “inliers”. The algorithm works as follows: we randomly pick a subset of our
points to call inliers and compute a model based on those points. Any other points that
fit this computed model well are added to the inliers. We then compute the error for this
model, and repeat the entire process several times. Whichever model achieves the smallest
error is the one we pick. Here’s a description in pseudocode:
1: function RANSAC
2: for iteration t ∈ 1, 2, . . . , T do
3: Choose a random subset of points to be inliers; call them It
1
See http://www.youtube.com/watch?v=1YNjMxxXO-E
7
Statistics for Research Projects Chapter 4
Notice how imprecise our specification is; the following are all parameters that must be
decided upon to use RANSAC:
• The number of iterations, T : this one can be chosen based on a theoretical result.
Another useful criterion that we often want our model to satisfy is sparsity. This means that
we want many of the coefficients βk to be 0. This constraint is useful in many cases, such
as when there are more features than data points, or when some features are very similar to
others. Adding in a sparsity constraint in these settings often helps prevent overfitting, and
leads to simpler, more interpretable models.
where λ is a nonnegative parameter that trades off between how small we want the coefficients
βk to be and how small we want the error to be: if λ is 0, then the problem is the same
as before, but if λ is very large, the second term counts a lot more than the first, and so
we’ll try to make the coefficients βk as small as possible. λ is often called a regularization
8
Statistics for Research Projects Chapter 4
parameter, and we’ll talk a little more next week about how to choose it. With some linear
algebra, we can get the following solution:
β̂ = (X T X − λI)−1 X T y.
We see that our intuition from before carries over: very large values of λ will give us a
solution of β̂ = 0, and if λ = 0, then our solution is the same as before.
Unfortunately, while ridge regression does give us smaller coefficients (and helps make the
problem solvable when X T X isn’t invertible), it doesn’t provide us with sparsity. Here’s an
example illustrating why: Suppose we have only 3 input components, and they’re all the
same. Then the solutions βa = (1, 1, 1) and βb = (0, 0, 3) will both give us the same y, but
the second one is sparse while the first one isn’t. But, the second one actually produces a
higher ridge penalty than the first! So, rather than produce a sparse solution, ridge regression
will actually favor solutions that are not sparse.
While using three identical inputs is a contrived example, in practice it’s common to have
several input dimensions that could have similar effects on the output, and choosing a small
(sparse) subset is critical to avoid overfitting.
For the purposes of preventing overfitting and where we don’t care about sparsity, ridge
regression is actually widely used in practice. However, if we seek a sparse solution, we must
turn to a different approach.
Up until now, we’ve been treating the parameter β as a fixed but unknown quantity. But
what if we treat it as a random quantity? Suppose before we know y or X, we have a prior
belief about β, which we’ll encode in the form of a prior distribution p(β). Once we observe
X and y, we can compute a more informed distribution, called a posterior distribution,
from that information. We’ll write it as p(β|X, y), where the “|” means “given” or “having
observed”. Using Bayes’ rule, we can compute the posterior distribution as:
p(β|X, y) ∝ p(β)p(X, y|β) (4.2)
Notice that we’ve already specified p(X, y|β): it’s just the model for the errors ε. Once we
choose p(β), we can try to find the value of β that’s most likely according to p(β|X, y). If
we choose p(β) to be a zero-mean Gaussian with variance σ 2 = 2λ1 2 , then we’ll end up with
ridge regression.
9
Statistics for Research Projects Chapter 4
where the penalty imposes a cost for any nonzero value of β. Alas, there is no known efficient
method for computing a solution to this problem. For any moderate to large dataset, the
above optimization problem is intractable to solve in any reasonable amount of time. We
basically can’t do much better than searching over every possible combination of which βk s
are 0 and which aren’t.
Instead, we’ll opt to solve the following problem:
n
X p
X
2
min (yi − Xi β) + λ |βk | .
β
i=1 k=1
This problem, also known as `1 -norm minimization or the Least Absolute Shrinkage and
Selection Operator (LASSO), is an approximation to (4.3) that can be solved efficiently.
Additionally, under certain reasonable conditions, if the true model is in fact sparse, then the
solution to this problem will also be sparse. There are numerous other interesting theoretical
results about LASSO, and it’s an active area of open research! As a result of its popularity,
most software packages have an option for LASSO when performing linear regression.
The Bayesian interpretation corresponds to using a Laplacian prior over β, so that our
prior belief p(β) is e−λ|β| for each coefficient. If we want to perform hypothesis tests or
get confidence intervals for coefficients that come out of this method, we’ll need to use
nonparametric statistics, which we’ll talk about in Chapter 5.
10
Statistics for Research Projects Chapter 4
Figure 4.4: Several coefficient priors for sparse regression: Gaussian (left) as in ridge regression,
Laplacian (center) as in LASSO, and a horseshoe prior (right). Note that the horseshoe prior
approaches ∞ as the coefficient becomes smaller.
So far, we’ve dealt entirely with linear models and linear relationships. Unfortunately, in
the real world, we often come across data with nonlinear properties. While we can often
get by with cleverly defined features, we often have to deal with output data that has a
nonlinear dependence that we can’t capture. For example, if our output data is binary, or
even between 0 and 1, the techniques we’ve talked about so far have no way of constraining
the predictions to that range.
Recall that before, we assumed that y = µy + ε, where µy = Xβ.
Generalized linear models4 are a family of methods that assume the following:
µy = g −1 (Xβ),
where g(·)5 is called the link function and is usually nonlinear. Here, the interaction
between the input X and the parameters β remains linear, but the result of that linear
interaction is passed through the inverse link function to obtain the output y. We often
write η = Xβ, so that µy = g −1 (η). The choice of link function typically depends on the
problem being solved. Below, we provide an example link function commonly used when y
is binary.
4
These are not to be confused with general linear models, which we’ll talk more about next week.
5
Using the inverse of g(·) in the setup is a matter of historical tradition.
11
Statistics for Research Projects Chapter 4
1.0
0.8
0.6
0.4
0.2
0.0
−6 −4 −2 0 2 4 6
Figure 4.5: A sigmoid function maps a real number to a value between 0 and 1.
Generalized linear models also make fewer assumptions about the form of the error: while
before, we assumed that y = µy + ε, generalized linear models give us more freedom in
expressing the relationship between y and µy by fully specifying the distribution for y.
While generalized linear regression problems usually can’t be solved in closed form (in other
words, we can’t get a simple expression that tells us what β is), there are efficient compu-
tational methods to solve such problems, and many software packages have common cases
such as logistic regression implemented.
12