0% found this document useful (0 votes)

17 views12 pages

Lecture 4

Uploaded by

ajaskulkarni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views12 pages

Lecture 4

Uploaded by

ajaskulkarni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Chapter 4

Regression Diagnostics and Advanced

Regression Topics

We continue our discussion of regression by talking about residuals and outliers, and then
look at some more advanced approaches for linear regression, including nonlinear models
and sparsity- and robustness-oriented approaches.

4.1 Residual Analysis

Remember that we defined the residual for data point i as hatεi = yi − ŷi : the difference
between the observed value and the predicted value (in contrast, the error εi is the difference
between the observed value and the true prediction from the correct line). We can often
learn a lot about how well our model did by analyzing them. Intuitively, if the model is
good, then a plot of the residuals (yi − ŷi ) against the fitted values (ŷi ) should look like noise
(i.e., there shouldn’t be any visible patterns).
Anscombe’s quartet can once again provide us with some intuition.

Example: Anscombe’s Quartet Revisited Again

Recall Anscombe’s Quartet: 4 datasets with very similar statistical properties under a simple quanti-
tative analysis, but that look very different when inspected visually. Here they are again with linear
regression lines fitted to each one:

14 14 14 14
12 12 12 12
10 10 10 10
8 8 8 8
6 6 6 6
4 4 4 4
2 2 2 2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20

Below, we plot the residuals yi − ŷi vs. ŷi (note that unlike the previous plot, we’re not plotting against
x!), which shows how much above and below the fitted line the data points are.

1
Statistics for Research Projects Chapter 4

3 3 3 3
2 2 2 2
1 1 1 1
y − ŷ

y − ŷ

y − ŷ
0 0 0 0
−1 −1 −1 −1
−2 −2 −2 −2
−3 −3 −3 −3
4 6 8 10 12 14 4 6 8 10 12 14 4 6 8 10 12 14 4 6 8 10 12 14
ŷ ŷ ŷ ŷ

We immediately see that in panels 2, 3, and 4, the residuals don’t look anything like random noise as
there are specific patterns to them! This suggests that a linear fit is not appropriate for datasets 2, 3,
and 4.

While the raw residuals should look like noise, in general their distribution isn’t as simple
as the i.i.d. Gaussian distribution that we assume for the error. Recalling the probabilistic
model from the previous chapter, the data are assumed to be generated by yi = Xi β + εi ,
where each εi is a zero-mean Gaussian with variance σ 2 . But ŷi doesn’t come directly from
this process! Instead, it comes from putting y through the linear regression equations we
saw last time. Recall our regression equation solution:
β̂ = (X T X)−1 X T y
Disclaimer: The rest of this section relies heavily on basic linear algebra. If you’re uncom-
fortable or unfamiliar with linear algebra, feel free to skip ahead to the summary at the end
of this section.
So the residuals ε̂ (length n vector) are:
ε̂ = y − ŷ
= y − X β̂
= y − (X(X T X)−1 X T )y (4.1)
| {z }
=H
= (I − H)y
= (I − H)(Xβ + ε)
= (I − H)Xβ + (I − H)ε
= (I − X(X T X)−1 X T )Xβ + (I − H)ε
= (X − X(X T X)−1 (X T X)) β + (I − H)ε
| {z }
=0
= (I − H)ε.
This shows how the actual noise ε (that we don’t get to observe) relates to the residuals
ε̂ = yi − ŷ. In fact, in general the residuals are correlated with each other! Furthermore, the
variance of the i-th residual ε̂i may not be the same as that of εi . We can standardize the
residuals so that each one has the same variance σ 2 .
Using the matrix H = X(X T X)−1 X T
√ , which is called the hat matrix, we can define
the standardized residuals as ε̂i / 1 − Hii , where Hii is the i-th diagonal entry of H.

2
Statistics for Research Projects Chapter 4

Thus, our model tells us that the residuals may not have the same distribution and may be
correlated, but the standardized residuals have the same, albeit unknown, variance σ 2 .
In order to analyze residuals even further, many packages will go one step further and
compute Studentized residuals. These are computed by estimating the variance of the
noise σ 2 and then dividing by it. Why is this process called “Studentizing”? The estimated
2
√ χ distribution with n − m − 1 degrees of freedom, and the
noise variance has a (scaled)
standardized residual ε̂i / 1 − Hii has a normal distribution, so the result has a Student t
distribution.
Most packaged software will standardize residuals for you, but if you’re writing your own
analysis code, don’t forget to standardize before analyzing the residuals!
While the residuals may be correlated with each other, an important theoretical result states
that under the probabilistic model we’ve been using for regression, the residuals ε̂ are un-
correlated with the fitted values ŷ. This means that when we plot the residuals against the
fitted values (as we did in the previous example for Anscombe’s Quartet), the resulting plot
should look like random noise if the fitted linear regression model is any good.

Residual analysis summary

Here are the most important points from our analysis of residuals:

• The residuals themselves, ε̂i , might have different variances, and furthermore, might
be correlated with each other. We can address the first problem by standardizing
them so that they have the same variance as the residuals, or Studentizing them so
that they have variance 1. It’s important to make sure you visualize standardized or
Studentized residuals!
• The residuals ε̂ are uncorrelated with the fitted values ŷ. This means that if the model
we use is in fact a good fit, we shouldn’t see any patterns in the standardized residuals.

4.2 Outliers

Real life datasets often have some points that are far away from the rest of the data. Here
are some informal definitions used to characterize such anomalies:

• An outlier is any point that’s far away from the rest of the data. There are different
definitions and ways of quantifying them depending on how you define “far away”.
• The leverage of a data point is a quantitative description of how far it is from the
rest of the points in the x-direction. We’ll see a more formal description later, but
intuitively, points that are farther away from the average in the x-direction have higher
leverage, and points closer to the average have lower leverage.
• An influential point is an outlier with high leverage that significantly affects the
slope of a regression line. We’ll quantify this a little later.

3
Statistics for Research Projects Chapter 4

10 10 10
B B
8 8 8

6 6 6
A A
4 4 4

2 2 2

0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

(a) Some points and a regres- (b) When viewing y as a func- (c) The difference between fit-
sion line fit to those points. tion of x, points A and B are ting y = ax + b (blue, dashed)
both outliers since they’re far and x = cy + d (green, dotted).
away from the data, but A is While the results would be sim-
also an influential point, since ilar if not for the outliers, A
moving it just a little will have has a big impact on the first
a large effect on the slope of the fit, while B has a big impact
regression line. on the second.

Figure 4.1: An illustration of outliers and influential points.

Figure 4.1 illustrates these concepts. Since influential points can really mess up linear re-
gression results, it’s important to deal with them. Sometimes, such points indicate the need
for more data-gathering: if most of our data is concentrated in one area and our sampling
methodology was flawed, we might have missed data points in the region between the main
cluster and an outlier. Alternatively, if such points don’t indicate a problem in data collec-
tion but really just are irrelevant outliers, we can remove such points from the analysis; we’ll
see a few techniques for identifying them later today.
The leverage of any particular point is formally defined as the corresponding diagonal element
of the hat matrix, Hii (as defined in Equation (4.1)). We can see this intuition more clearly
in the one-dimensional case, where

1 (xi − x̄)2
Hii = + Pn
j=1 (xj − x̄)
n 2

Here, we can see that relatively large values of xi ’s have larger leverage as expected. Leverage
only measures how far away from the average x-value a point is: it’s a measure of how
influential a point has the potential to be, and doesn’t depend on the point’s y-value.
How might we quantify the influence of a particular point? One way is to take each point
and fit the model with and without that point. We can then compare the two results: the
more different they are, the more influential that point would be. Cook’s distance is a
measure of how influential each point i is that captures exactly this idea. The definition is
based on fitting the full model (i.e., using all of our data), and then fitting the model again,

4
Statistics for Research Projects Chapter 4

but this time excluding point i:

full prediction prediction w/o i
Pn z}|{ z }| {
j=1 ( ŷj − ŷj(−i) )2
Di = ,
p · M SE
where

• ŷj is the predicted value at xj based on the full model,

• ŷj(−i) is the predicted value at xj based on the model fit without point i,

• p is the number of features (and the number of coefficients),

• and M SE is the mean squared error, n1 nj=1 (ŷj − yj )2 .

Unfortunately, it seems from this formula that we have to recompute β̂ without point i any
time we want to compute Di . But, with a bit of algebra, we can derive the following formula,
which gives us an easy way to compute Cook’s distance in terms of the leverage:
1 Hii
Di = ε̂2i
p · M SE (1 − Hii )2

We could thus use Cook’s distance to compute how influential a point is and screen for
outliers by removing points that are too influential based on a threshold of our choice.
Manually removing outliers can feel rather cumbersome, raising the question of whether
there are regression methods that automatically handle outliers more gracefully. This leads
us to the topic of robust regression.

4.3 Advanced Regression: Robustness

One issue with standard linear regression is that it can be affected by outliers. As we saw
with influential points, one inconveniently-placed outlier can dramatically alter the outcome
of a regression. In this section, we’ll first look at two complementary views of how we can
achieve more robust outcomes by tweaking the model we’ve learned about, and then see a
completely different way of achieving robustness using random sampling.

4.3.1 Optimization View of Robustness

Suppose instead that we tried to adjust the optimization problem we’re solving. Here’s our
optimization problem:
n
X n
X
min (yi − Xi β)2 = ρ(ri ),
β
i=1 i=1

5
Statistics for Research Projects Chapter 4

Squared error LAD

25 5
20 4
15 3
10 2
5 1
0 0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

Huber Bisquare
4.5 450
4.0 400
3.5 350
3.0 300
2.5 250
2.0 200
1.5 150
1.0 100
0.5 50
0.0 0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

Figure 4.2: Graphs showing the squared-error, LAD, Huber, and bisquare loss functions. Note the
different y-axis scales!

where ri = (yi − Xi β) is the residual and ρ(r) = r2 is the squared error function. Unfortu-
nately, squared error puts very large penalties on large errors: ρ(2) = 4, but ρ(10) = 100.
So, any good solution to this optimization problem will avoid large errors, even if that means
fitting to outliers. To get around this, we’ll try choosing a different function ρ. The function
ρ(r) we choose here is usually called the loss function, and any solution to a problem of
this form is called an M -estimator. What other loss functions can we try?

• Least absolute deviations, or LAD: ρ(r) = |r|. While this avoids the problem of
excessively penalizing outliers, it leads to a loss function that isn’t differentiable at
r = 0. It also is less stable than least-squares: changing the x-value of a point just a
little can have a large impact on the solution.
(
r2 /2 |r| < k
• Huber: ρ(r) = .
k(|x| − k/2) |r| ≥ k
This is similar to LAD, but replaces the sharp corner around 0 with a smoothed
parabola for differentiability (being differentiable makes it easy to take derivatives
when optimizing for the best β).

• Bisquare: these functions have a behavior similar to squared loss near 0, but level off
after a certain point. Some even assign a cost of 0 to very large deviations, to avoid
being sensitive to outliers at all.

These loss functions are shown in Figure 4.2.

6
Statistics for Research Projects Chapter 4

0.5 0.045
0.040
0.4 0.035
0.030
0.3
0.025
0.020
0.2
0.015
0.1 0.010
0.005
0.0 0.000
−6 −4 −2 0 2 4 3 4 5 6 7

Figure 4.3: Three different distributions and their tails: Gaussian with variance 1 (blue), Laplacian
(green), Cauchy (red), and Student’s t with 5 degrees of freedom (orange).

4.3.2 Distribution View of Robustness

Remember from Chapter 1 that the normal distribution is very concentrated. So, by assum-
ing Gaussian noise in our model, we’re effectively assuming that large errors are extremely
unlikely. One way to be less sensitive to outliers is to assume distributions with heavier tails:
that is, distributions that don’t assign such low probability to improbable events. For exam-
ple, the Student’s t distribution, which we’ve already seen, the Laplacian distribution, the
Cauchy distribution, and any power-law distribution all have heavier tails than the Gaussian.
These are illustrated in Figure 4.3. Assuming the noise ε is Gaussian leads to squared loss.
Assuming the noise to be Laplacian leads to LAD.

4.3.3 RANSAC
Another approach to robust regression is RANSAC, or RANdom SAmple Consensus. While
RANSAC doesn’t have the theoretical justifications of the methods in the last two sections,
it nonetheless is widely used and often works well in practice. It’s so popular that there’s
even a song about RANSAC1 !
The basic assumption of RANSAC is just that the data consists primarily of non-outliers,
which we’ll call “inliers”. The algorithm works as follows: we randomly pick a subset of our
points to call inliers and compute a model based on those points. Any other points that
fit this computed model well are added to the inliers. We then compute the error for this
model, and repeat the entire process several times. Whichever model achieves the smallest
error is the one we pick. Here’s a description in pseudocode:

Inputs: Points (x1 , y1 ), . . . , (xn , yn )

1: function RANSAC
2: for iteration t ∈ 1, 2, . . . , T do
3: Choose a random subset of points to be inliers; call them It
1
See http://www.youtube.com/watch?v=1YNjMxxXO-E

7
Statistics for Research Projects Chapter 4

4: Fit a model Mt using the selected points

5: Find all points that have error less than α and add them to It
6: (Optional) Re-fit the model Mt with the new It
7: Find error for model Mt on points in It , call it εt
8: Choose model with the smallest error

Notice how imprecise our specification is; the following are all parameters that must be
decided upon to use RANSAC:

• The number of iterations, T : this one can be chosen based on a theoretical result.

• How large of a subset to choose

• How to fit the model and evaluate the error

• The error threshold for adding inliers, α

• Whether or not to do the optional refitting

Many of these depend on the specific problem being solved.

4.4 Advanced Regression: Sparsity

Another useful criterion that we often want our model to satisfy is sparsity. This means that
we want many of the coefficients βk to be 0. This constraint is useful in many cases, such
as when there are more features than data points, or when some features are very similar to
others. Adding in a sparsity constraint in these settings often helps prevent overfitting, and
leads to simpler, more interpretable models.

4.4.1 A first attempt: ridge regression

Since we want the coefficients to be close to zero (i.e., small), let’s try adding a term to
our minimization that “encourages” them to be small. In ridge regression, we solve the
following optimization problem:
n
X p
2
X
min (yi − Xi β) + λ βk2 ,
β
i=1
| {z } | k=1
{z }
“data term” “regularization term”

where λ is a nonnegative parameter that trades off between how small we want the coefficients
βk to be and how small we want the error to be: if λ is 0, then the problem is the same
as before, but if λ is very large, the second term counts a lot more than the first, and so
we’ll try to make the coefficients βk as small as possible. λ is often called a regularization

8
Statistics for Research Projects Chapter 4

parameter, and we’ll talk a little more next week about how to choose it. With some linear
algebra, we can get the following solution:
β̂ = (X T X − λI)−1 X T y.
We see that our intuition from before carries over: very large values of λ will give us a
solution of β̂ = 0, and if λ = 0, then our solution is the same as before.
Unfortunately, while ridge regression does give us smaller coefficients (and helps make the
problem solvable when X T X isn’t invertible), it doesn’t provide us with sparsity. Here’s an
example illustrating why: Suppose we have only 3 input components, and they’re all the
same. Then the solutions βa = (1, 1, 1) and βb = (0, 0, 3) will both give us the same y, but
the second one is sparse while the first one isn’t. But, the second one actually produces a
higher ridge penalty than the first! So, rather than produce a sparse solution, ridge regression
will actually favor solutions that are not sparse.
While using three identical inputs is a contrived example, in practice it’s common to have
several input dimensions that could have similar effects on the output, and choosing a small
(sparse) subset is critical to avoid overfitting.
For the purposes of preventing overfitting and where we don’t care about sparsity, ridge
regression is actually widely used in practice. However, if we seek a sparse solution, we must
turn to a different approach.

Ridge regression: a Bayesian view

Up until now, we’ve been treating the parameter β as a fixed but unknown quantity. But
what if we treat it as a random quantity? Suppose before we know y or X, we have a prior
belief about β, which we’ll encode in the form of a prior distribution p(β). Once we observe
X and y, we can compute a more informed distribution, called a posterior distribution,
from that information. We’ll write it as p(β|X, y), where the “|” means “given” or “having
observed”. Using Bayes’ rule, we can compute the posterior distribution as:
p(β|X, y) ∝ p(β)p(X, y|β) (4.2)
Notice that we’ve already specified p(X, y|β): it’s just the model for the errors ε. Once we
choose p(β), we can try to find the value of β that’s most likely according to p(β|X, y). If
we choose p(β) to be a zero-mean Gaussian with variance σ 2 = 2λ1 2 , then we’ll end up with
ridge regression.

4.4.2 Sparse solutions: LASSO

We’ve seen that introducing a penalty of the form pk=1 βk2 doesn’t give us sparsity. So why
P
not penalize non-sparsity directly? The ideal alternative might look something like:
X n p
X
2 2
min (yi − Xi β) + λ 1(βk 6= 0) , (4.3)
β
i=1
|k=1 {z }
# of nonzeros in β

9
Statistics for Research Projects Chapter 4

where the penalty imposes a cost for any nonzero value of β. Alas, there is no known efficient
method for computing a solution to this problem. For any moderate to large dataset, the
above optimization problem is intractable to solve in any reasonable amount of time. We
basically can’t do much better than searching over every possible combination of which βk s
are 0 and which aren’t.
Instead, we’ll opt to solve the following problem:
n
X p
X
2
min (yi − Xi β) + λ |βk | .
β
i=1 k=1

This problem, also known as `1 -norm minimization or the Least Absolute Shrinkage and
Selection Operator (LASSO), is an approximation to (4.3) that can be solved efficiently.
Additionally, under certain reasonable conditions, if the true model is in fact sparse, then the
solution to this problem will also be sparse. There are numerous other interesting theoretical
results about LASSO, and it’s an active area of open research! As a result of its popularity,
most software packages have an option for LASSO when performing linear regression.
The Bayesian interpretation corresponds to using a Laplacian prior over β, so that our
prior belief p(β) is e−λ|β| for each coefficient. If we want to perform hypothesis tests or
get confidence intervals for coefficients that come out of this method, we’ll need to use
nonparametric statistics, which we’ll talk about in Chapter 5.

4.4.3 Sparsity and shrinkage: a graphical view

Figure 4.4 shows the two Bayesian priors that correspond to ridge regression and LASSO
(along with one other prior). Notice that in ridge regression, a value close to zero is almost
as likely as a value equal to zero. However, LASSO is a bit better: the gap between zero
and near-zero in terms of probability is bigger. We can take this one step further and define
other so-called shrinkage priors, which even more strongly encourage values to be zero. One
such prior, called the horseshoe prior2 is shown in Figure 4.4. Unfortunately, many of these
result in more difficult optimization problems; one reason for LASSO’s popularity is that it
produces a convex optimization problem where we can always find the global optimum (i.e.,
the best solution)3 . Notice that the horseshoe prior also has heavier tails than the other two
distributions: this means that in addition to encouraging values near zero, it allows large
nonzero values to be more likely.
At this point, we’ve talked about two different kinds of distributions: in Section ??, we
looked at different distributions for the error ε, which led to more robust methods that were
less sensitive to outliers. In this section, we looked at different prior distributions for the
coefficients β, which led to sparser solutions.
2
See Carvalho, Polson, Scott. The Horseshoe Estimator for Sparse Signals, Biometrika. Since there isn’t
actually a closed-form expression for the distribution, the version shown in the figure is an approximation.
3
While this is true in most cases, it isn’t strictly true all the time: if the matrix X doesn’t have full rank,
then the LASSO problem doesn’t have a unique solution.

10
Statistics for Research Projects Chapter 4

0.5 0.5 0.5

0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0.0 0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

Figure 4.4: Several coefficient priors for sparse regression: Gaussian (left) as in ridge regression,
Laplacian (center) as in LASSO, and a horseshoe prior (right). Note that the horseshoe prior
approaches ∞ as the coefficient becomes smaller.

4.5 Generalized Linear Models

So far, we’ve dealt entirely with linear models and linear relationships. Unfortunately, in
the real world, we often come across data with nonlinear properties. While we can often
get by with cleverly defined features, we often have to deal with output data that has a
nonlinear dependence that we can’t capture. For example, if our output data is binary, or
even between 0 and 1, the techniques we’ve talked about so far have no way of constraining
the predictions to that range.
Recall that before, we assumed that y = µy + ε, where µy = Xβ.
Generalized linear models4 are a family of methods that assume the following:

µy = g −1 (Xβ),

where g(·)5 is called the link function and is usually nonlinear. Here, the interaction
between the input X and the parameters β remains linear, but the result of that linear
interaction is passed through the inverse link function to obtain the output y. We often
write η = Xβ, so that µy = g −1 (η). The choice of link function typically depends on the
problem being solved. Below, we provide an example link function commonly used when y
is binary.

Example: Logistic Regression

Suppose that we have data between 0 and 1. Then we can use a sigmoidal link function, which has the
form
1
g −1 (η) = .
1 + exp(−η)
A graph of this function is shown in Figure 4.5. This function maps a real number to a number between
0 and 1. Logistic regression is often used in classification problems, where our output is binary.

4
These are not to be confused with general linear models, which we’ll talk more about next week.
5
Using the inverse of g(·) in the setup is a matter of historical tradition.

11
Statistics for Research Projects Chapter 4

1.0

0.8

0.6

0.4

0.2

0.0
−6 −4 −2 0 2 4 6

Figure 4.5: A sigmoid function maps a real number to a value between 0 and 1.

Generalized linear models also make fewer assumptions about the form of the error: while
before, we assumed that y = µy + ε, generalized linear models give us more freedom in
expressing the relationship between y and µy by fully specifying the distribution for y.
While generalized linear regression problems usually can’t be solved in closed form (in other
words, we can’t get a simple expression that tells us what β is), there are efficient compu-
tational methods to solve such problems, and many software packages have common cases
such as logistic regression implemented.

Kinetic Theory of Gases, Pradeep
100% (2)
Kinetic Theory of Gases, Pradeep
44 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Regression & Linear Modeling Best Practices and Modern Methods, 1st Edition Complete DOCX Download
100% (14)
Regression & Linear Modeling Best Practices and Modern Methods, 1st Edition Complete DOCX Download
15 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
Coding Questions
No ratings yet
Coding Questions
124 pages
00000chen - Linear Regression Analysis3
No ratings yet
00000chen - Linear Regression Analysis3
252 pages
ch12 0
No ratings yet
ch12 0
43 pages
Chapter 9 Simple Linear Regression and Correlation
No ratings yet
Chapter 9 Simple Linear Regression and Correlation
56 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Linear Regression
No ratings yet
Linear Regression
59 pages
PE Civil: Transportation Ebook Practice Exam
No ratings yet
PE Civil: Transportation Ebook Practice Exam
41 pages
330 Lect11
No ratings yet
330 Lect11
35 pages
Chapter 3
No ratings yet
Chapter 3
41 pages
SimpleLinearRegression PDF
No ratings yet
SimpleLinearRegression PDF
86 pages
l09 Machine Learning
No ratings yet
l09 Machine Learning
39 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
C6 Regression
No ratings yet
C6 Regression
27 pages
ML EasySol
No ratings yet
ML EasySol
62 pages
330 Lecture11 2014
No ratings yet
330 Lecture11 2014
61 pages
Robust Regression Modeling With STATA Lecture Notes
No ratings yet
Robust Regression Modeling With STATA Lecture Notes
93 pages
What Is The Gibbs Paradox: and Do We Need Quantum Mechanics To Resolve It?
No ratings yet
What Is The Gibbs Paradox: and Do We Need Quantum Mechanics To Resolve It?
19 pages
Stats101A - Chapter 3
No ratings yet
Stats101A - Chapter 3
54 pages
Dataset Amusement Park
No ratings yet
Dataset Amusement Park
21 pages
Statistics Week3
No ratings yet
Statistics Week3
19 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Ch08 - Linear Regression
No ratings yet
Ch08 - Linear Regression
37 pages
MultivariableRegression 6
No ratings yet
MultivariableRegression 6
44 pages
Lecture 23
No ratings yet
Lecture 23
16 pages
Double Bow Residual
No ratings yet
Double Bow Residual
32 pages
7 Generalized Linear Models
No ratings yet
7 Generalized Linear Models
16 pages
Chapter 4 Hand Out
No ratings yet
Chapter 4 Hand Out
15 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
LR Assumptions - 05
No ratings yet
LR Assumptions - 05
12 pages
Linear Regression Assignment - Subjective
No ratings yet
Linear Regression Assignment - Subjective
7 pages
02 Correlation Coefficient and The Residual
No ratings yet
02 Correlation Coefficient and The Residual
10 pages
Final MAEC - Semester - 4 - English - 2024-25
No ratings yet
Final MAEC - Semester - 4 - English - 2024-25
15 pages
API - NY - GDP.PCAP - PP.CD - DS2 - en - Excel - v2 - 2765043 GDP PPP International OECD Data
No ratings yet
API - NY - GDP.PCAP - PP.CD - DS2 - en - Excel - v2 - 2765043 GDP PPP International OECD Data
57 pages
Chapter 6
No ratings yet
Chapter 6
33 pages
Mauren Publish (Terbit)
No ratings yet
Mauren Publish (Terbit)
21 pages
Lec 34
No ratings yet
Lec 34
15 pages
Regn Lect 4
No ratings yet
Regn Lect 4
9 pages
Regression For Everyone Vol. 1
No ratings yet
Regression For Everyone Vol. 1
25 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
Lesson 3 Overview Problems and Outliers
No ratings yet
Lesson 3 Overview Problems and Outliers
31 pages
Basic Regression Analysis 7
No ratings yet
Basic Regression Analysis 7
6 pages
Econometric Si Syl Lab Us
No ratings yet
Econometric Si Syl Lab Us
5 pages
Significance of The Stochastic Disturbance Term
No ratings yet
Significance of The Stochastic Disturbance Term
5 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Regression Analysis Material
No ratings yet
Regression Analysis Material
12 pages
Basic Regression Analysis 3
No ratings yet
Basic Regression Analysis 3
6 pages
Consistency PDF
No ratings yet
Consistency PDF
8 pages
Discussion of What Is A Standard Error'' - James L. Powell
No ratings yet
Discussion of What Is A Standard Error'' - James L. Powell
7 pages
Advanced Mathematics Curve Fitting: Prepared by
No ratings yet
Advanced Mathematics Curve Fitting: Prepared by
17 pages
Residual Analysis For Simple Linear Regression: X B B y N e N e
No ratings yet
Residual Analysis For Simple Linear Regression: X B B y N e N e
15 pages
Outliers Influence
No ratings yet
Outliers Influence
6 pages
Subjective Questions
No ratings yet
Subjective Questions
8 pages
CH 2
No ratings yet
CH 2
31 pages
Oulier in R
No ratings yet
Oulier in R
8 pages
Jurnal Nur Aeni Salsabila 119020059 Word
No ratings yet
Jurnal Nur Aeni Salsabila 119020059 Word
11 pages
Lecture 20: Outliers and Influential Points
No ratings yet
Lecture 20: Outliers and Influential Points
11 pages
Physics 73 Review
No ratings yet
Physics 73 Review
17 pages
Estadistica, Articulo, Analyzing Outliers: Influential or Nuisance?
No ratings yet
Estadistica, Articulo, Analyzing Outliers: Influential or Nuisance?
3 pages
Advanced Regression Assignment
No ratings yet
Advanced Regression Assignment
5 pages
Introduction To The Theory of Ferromagnetism: Exact Solution of The Ising Model in One Dimension. Antiferromagnetism
No ratings yet
Introduction To The Theory of Ferromagnetism: Exact Solution of The Ising Model in One Dimension. Antiferromagnetism
8 pages
Stat 324: Lecture 24 Testing Equality of Regression Parameters in Two Groups
No ratings yet
Stat 324: Lecture 24 Testing Equality of Regression Parameters in Two Groups
2 pages
SRM 2
No ratings yet
SRM 2
2 pages
Chapter2 (Simple Linear Regression)
No ratings yet
Chapter2 (Simple Linear Regression)
11 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages
Estimating The Genetic Diversity of Ethiopian Noug Using SSR Marker
No ratings yet
Estimating The Genetic Diversity of Ethiopian Noug Using SSR Marker
7 pages
Chapter6 Regression Diagnostic For Leverage and Influence
No ratings yet
Chapter6 Regression Diagnostic For Leverage and Influence
10 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Extra Practice MTII Questions
No ratings yet
Extra Practice MTII Questions
6 pages
Regression
No ratings yet
Regression
2 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Simple Linear Regression: Parameters
No ratings yet
Simple Linear Regression: Parameters
34 pages
Real Gases
No ratings yet
Real Gases
23 pages
DB Structure Pivot Etc
No ratings yet
DB Structure Pivot Etc
14 pages
Problem Set 7 Entropy: Ain Shams University Faculty of Engineering
No ratings yet
Problem Set 7 Entropy: Ain Shams University Faculty of Engineering
2 pages
Estimating Population Size in Mobile Animals
No ratings yet
Estimating Population Size in Mobile Animals
14 pages
Chapter 11
No ratings yet
Chapter 11
10 pages
Example: Anscombe's Quartet Revisited: CC-BY-SA-3.0 GFDL
No ratings yet
Example: Anscombe's Quartet Revisited: CC-BY-SA-3.0 GFDL
10 pages
PHYS09007 2016 May
No ratings yet
PHYS09007 2016 May
7 pages
Annotated 4 Ch4 Linear Regression F2014
No ratings yet
Annotated 4 Ch4 Linear Regression F2014
11 pages
Regression Models Notes
No ratings yet
Regression Models Notes
13 pages
Kinetic Theory & Thermodynamics
No ratings yet
Kinetic Theory & Thermodynamics
3 pages

Lecture 4

Uploaded by

Lecture 4

Uploaded by

Chapter 4

Regression Diagnostics and Advanced

4.1 Residual Analysis

Example: Anscombe’s Quartet Revisited Again

Residual analysis summary

Figure 4.1: An illustration of outliers and influential points.

but this time excluding point i:

• ŷj is the predicted value at xj based on the full model,

• p is the number of features (and the number of coefficients),

• and M SE is the mean squared error, n1 nj=1 (ŷj − yj )2 .

4.3 Advanced Regression: Robustness

4.3.1 Optimization View of Robustness

Squared error LAD

These loss functions are shown in Figure 4.2.

4.3.2 Distribution View of Robustness

Inputs: Points (x1 , y1 ), . . . , (xn , yn )

4: Fit a model Mt using the selected points

• How large of a subset to choose

• How to fit the model and evaluate the error

• The error threshold for adding inliers, α

• Whether or not to do the optional refitting

Many of these depend on the specific problem being solved.

4.4 Advanced Regression: Sparsity

4.4.1 A first attempt: ridge regression

Ridge regression: a Bayesian view

4.4.2 Sparse solutions: LASSO

4.4.3 Sparsity and shrinkage: a graphical view

0.5 0.5 0.5

4.5 Generalized Linear Models

Example: Logistic Regression

You might also like