0% found this document useful (0 votes)
13 views

Review Lecture

The document provides an overview of regression analysis, including its goals of estimation and prediction using independent and dependent variables. It discusses simple linear regression models and key aspects like the confidence interval for the population slope and goodness of fit. Examples are given on relationships between variables like cancer mortality and latitude. Key terms are defined, such as the coefficient of determination (R2) and how it indicates the proportion of variation explained by the regression model. Formulas are shown for confidence intervals of mean and predicted responses.

Uploaded by

Yash Sirowa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Review Lecture

The document provides an overview of regression analysis, including its goals of estimation and prediction using independent and dependent variables. It discusses simple linear regression models and key aspects like the confidence interval for the population slope and goodness of fit. Examples are given on relationships between variables like cancer mortality and latitude. Key terms are defined, such as the coefficient of determination (R2) and how it indicates the proportion of variation explained by the regression model. Formulas are shown for confidence intervals of mean and predicted responses.

Uploaded by

Yash Sirowa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Regression Analysis - PSTAT 126

REVIEW LECTURE

Department of Statistics and Applied Probability


University of California, Santa Barbara
Goal of Regression Analysis

The goal of regression analysis to model the relationship between independent variables
(predictor) and a dependent variable (response).

1 Estimation:
models the relationship between a predictor/predictors and a response with an
observed data set.

2 Prediction:
predict new outcomes given a new set of inputs with a built model.

Examples:

(1) drug overdose and life expectancy

(2) high school grade point average (gpa) and college entrance test score

(3) latitude and cancer mortality rate

(4) physical activity and body mass index (BMI)


Deterministic (or functional) relationships, e.g., the relationship between degrees
Fahrenheit and degrees Celsius is known to be:
9
Fahr = Cels + 32
5
120
100
Fahrenheit
80
60
40

0 10 20 30 40 50

Celsius

The relationship is perfect. We are not interested in that.


Statistical relationships:

Skin Cancer Mortality versus Latitude


220
Mortality (Deaths per 10 million)
200
180
160
140
120
100

30 35 40 45

Latitude (at center of state)


The relationship is not perfect. Indeed, the plot exhibits some “trend,” but it also exhibits
some “scatter.”
Simple Linear Regression

A simple linear regression model is defined as

Yi = β0 + β1 xi + εi , i = 1, . . . , n

Four conditions that comprise “simple linear regression model”:


1 The mean of the response, E(Yi ), at each value of the predictor, xi , is a
Linear function of the xi .

2 The errors, εi , are Independent.

3 The errors, εi , at each value of the predictor, xi , are Normally distributed.

4 The errors, εi , at each value of the predictor, xi , have Equal variances


(denoted σ 2 ).
Histogram of Body Height

For both genders, height follows a “normal distribution.”


85
80
75 Height vs Gender
Height

70
65
60

Women Men

Gender

If we connect the mean of the height for both genders, we get a “line.”
Normal Distribution
Inference for β1 (Population Slope)

The formula for the confidence interval for β1 , in words, is:

Sample estimate ± (t-multiplier × standard error)

and, in notation, is:

b1 ± tn−2,1− α2 se(b1 )
n
√ σ̂ . Recall that σ̂ 2 = M SE, M SE = 1
(Yi − Ŷi )2 , and
P
where se(b1 ) = Sxx n−2
i=1
n
(xi − x̄)2 .
P
Sxx =
i=1
Hypothesis Test for β1

1 Hypothesis Test
H0 : β1 = β10 and H1 : β1 6= β10 ,
where β10 is a fixed value. Usually β10 = 0.

2 Test statistic
b1 − β10
t? =
se(b1 )

3 Null distribution: t? ∼ tn−2

4 p-value: p = 2P (tn−2 > |t? |).

5 Critical value: cα = tn−2,1−α/2 .

For a two-sided test, we reject H0 if |t? | > cα , or equivalently, we reject H0 if p < α.


Goodness of Fit

Skin Cancer Mortality versus Latitude


Mortality (Deaths per 10 million)
250

Yi
200

^
Y i
150

Y
100
50

30 35 40 45

Latitude (at center of state)


1 Regression sum of squares (a component that is due to the change in x)
n
X
SSR = (Ŷi − Ȳ )2
i=1

2 Error sum of squares (a component that is just due to random error)


n
X
SSE = (Yi − Ŷi )2
i=1

3 Total sum of squares


n
X
SST O = (Yi − Ȳ )2
i=1

If the regression sum of squares is a “large” component of the total sum of


squares, it suggests that there is a linear association between the predictor x and
the response Y .
1 One measure of goodness-of-fit is called coefficient of determination: R2 .
SSR SSE
R2 = =1−
SST O SST O
It shows the proportion of the variation in the response Y explained by the
linear regression model.

2 R2 is a number between 0 and 1.

3 If R2 = 1, all of the data points fall on the regression line. The predictor x
accounts for all of the variation in Y .

4 If R2 = 0, the estimated regression line is perfectly horizontal. The predictor x


accounts for none of the variation in Y .
We have a random sample of n = 35 students and we are interested in how strong the
linear relationship between the height (x) of a student and his or her GPA (Y ).

R^2 = 0.0028

^
Y
Y
3.5
3.0
GPA
2.5
2.0
1.5

60 65 70

Height

In this example SSR = 0.0276, SSE = 9.7055 and SST O = 9.7331. Note that
R2 = SSR/SST O = 0.0028.
Skin cancer mortality rates (Y ) vs latitude (x):

R^2 = 0.6798
220
Mortality (Deaths per 10 million)

^
Y
Y
200
180
160
140
120
100

30 35 40 45

Latitude (at center of state)

In this example SSR = 36464.2, SSE = 17173.07 and SST O = 53637.27. Note that
R2 = sSSR/SST O = 0.6798.
Interpretation of R2 when 0 < R2 < 1

When R2 is some number between 0 and 1, like 0.6 or 0.3, we say either

1 R2 × 100 percent of the variation in Y is reduced by taking into account


predictor x

or

2 R2 × 100 percent of the variation in Y is explained by the variation in predictor


x
Confidence Interval for the Mean Response E(Yh)

For the mean response E(Yh ) when the predictor value is xh . The general
formula for the confidence interval in words is

Sample estimate ± t-multiplier × standard error

and the formula in notation is



Ŷh ± tn−2,1−α/2 se Ŷh

1 Ŷh is the fitted value or predicted value of the response when the predictor is
xh .
2 tn−2,1−α/2 is the t-multiplier.
q
2
se Ŷh = σ̂ n1 + (xhS−x̄)

3
xx
and it is the standard error of Ŷh .
Prediction Interval for a New Response Yh(new)

The general formula in words is

Sample estimate ± t-multiplier × standard error

and the formula in notation is


s
1 (xh − x̄)2
Ŷh ± tn−2,1−α/2 σ̂ 1+ +
n Sxx
n
where σ̂ 2 = M SE and Sxx = (xi − x̄)2 .
P
i=1
Analysis of Variance
Note that
Yi − Ȳ = (Yi − Ŷi ) + (Ŷi − Ȳ )

Skin Cancer Mortality versus Latitude


Mortality (Deaths per 10 million)
250

Yi
200

^
Y i
150

Y
100
50

30 35 40 45

Latitude (at center of state)


Decomposition of Total Sum of Squares SST O

It can be further shown that


n
X n
X n
X
2 2
(Yi − Ȳ ) = (Ŷi − Ȳ ) + (Yi − Ŷi )2
i=1 i=1 i=1

which is equivalent to
SST O = SSR + SSE

1 degrees of freedom with SST O is n − 1

2 degrees of freedom with SSR is 1

3 degrees of freedom with SSE is n − 2


ANOVA Table

The formula for each entry is summarized for you in the following analysis of
variance table:
Residual Analysis (Diagnostics) To Check “LINE”
Conditions

To conduct linear regression analysis, the four “LINE” conditions must hold. How can we
know about whether or not the four “LINE” conditions hold?
The “tools” are about using residuals (observation errors):

1 To check the Linear condition and the Equal condition:


A “residuals vs fit" plot or equivalently “residuals vs predictor” plot is used to detect
non-linearity, unequal error variances, and outliers.

2 To check the Independent condition:


A “residuals vs order plot” is a way of detecting a particular form of
non-independence of the error terms, namely serial correlation.

3 To check the Normal condition:


A normal Q-Q probability plot of the residuals is used to check the non-linearity
problem. If the resulting plot is approximately linear, we proceed assuming that the
error terms are normally distributed.
Well-Behaved Pattern in a Residual vs Fit Plot

Residual vs Fit
40
20
Residual

0
-20
-40

120 140 160 180 200 220

Fitted Values
Normally Distributed Residuals
The following normal Q-Q plot suggests that the residuals (and hence the error
terms) are normally distributed:

Normal Q-Q Plot


2
1
Sample Quantiles

0
-1
-2

-2 -1 0 1 2

Theoretical Quantiles
Remedies (Transformations) when “LINE” Conditions
are not Met

(1) What transformation do we use when non-linearity is the only problem?


In this case, we transform x values and log-transforming x is one commonly used remedy.

(2) What transformation do we use when non-normality and/or unequal variances are the
problem(s)?
In this case, we transform Y values and log-transforming Y is one commonly used remedy.

(3) What transformation do we use when “everything” seems wrong?


1 First, to correct non-normality and/or unequal variances, we transform the Y values,
which may help the non-linearity, for example, log-transforming Y .
2 To correct non-linearity, we transform the x values, for example, log-transforming x or
adding x2 , x3 ,...in the model, i.e., polynomial regression.

(4) What to do when it is difficult to determine which transformation on Y to use?


1 Box-Cox transformations are a family of power transformations on Y such that
Y 0 = Y λ , where λ is a parameter to be determined using the data.
2 The boxcox() function in MASS package can be used for box-cox transformations.
Multiple Linear Regression

A population model for a multiple linear regression model that relates a response Y to
(p − 1) x variables is written as

Yi = β0 + β1 xi1 + β2 xi2 + ... + βp−1 xi(p−1) + εi

1 We assume that the εi are independent and have a normal distribution with mean 0
and constant variance σ 2 .

2 The subscript i refers to the ith individual or unit in the population, and for the x
variables, the subscript following i simply denotes which x variable it is.

3 The word “linear” in “multiple linear regression” refers to the fact that the model is
linear in the parameters β0 , β1 , ...βp−1 .

4 For example, if a model is Yi = β0 + β1 xi + β2 x2i + εi , it is still a “multiple linear


regression” model, though the highest power for x is two.
Interpretation of the Model Parameters

1 Each βk , 1 ≤ k ≤ p − 1, coefficient represents the change in the mean response,


E(Y ), per unit increase in the associated k th predictor variable when all the other
predictors are held constant.

2 The intercept term, β0 , represents the mean response, E(Y ), when all the predictors
x1 , x2 , . . . , xp−1 , are zero. As in a simple linear regression setting, it may or may not
have any practical meaning.
General F -Test

The “general linear F -test” involves three basic steps


1 Define a larger full model. (Model with more parameters.)

2 Define a smaller reduced model. (Model with fewer parameters.)

3 Use an F -statistic to decide whether or not to reject the smaller reduced


model in favor of the larger full model.
The null hypothesis always pertains to the reduced model, while the alternative
hypothesis always pertains to the full model.
Confidence Interval for E(Yh) and Prediction Interval
for Yh(new)
The formula for the prediction interval for Yh,(new) is:
q
Ŷh ± tα/2,n−p × M SE + [se(Ŷh )]2

The formula for the confidence interval for E(Yh ) is:

Ŷh ± tα/2,n−p × se(Ŷh )

1 Ŷh is the “fitted value” or “predicted value” of the response when the predictor values
are xh .
Ŷh = b0 + b1 xh1 + b2 xh2 + ... + bp−1 xh(p−1)

2 tα/2,n−p is the t-multiplier. Note again that the t-multiplier has n − p degrees of
freedom because the prediction interval uses the mean square error (M SE) whose
denominator is n − p.

3 Observe that the only difference in the formulas is that the standard error of the
prediction for Yh,(new) has an extra M SE term in it that the standard error of the fit for
E(Yh ) does not.
Multiple Linear Regression Model Assumptions

The four conditions (“LINE”) that comprise the multiple linear regression model generalize the
simple linear regression model conditions to take account of the fact that we now have multiple
predictors:
1 The mean of the response, E(Yi ), at each set of values of the predictors, (xi1 , xi2 , . . .), is a
Linear function.

2 The errors, εi , are Independent.

3 The errors, εi , at each set of values of the predictors, (xi1 , xi2 , . . .), are Normally distributed.

4 The errors, εi , at each set of values of the predictors, (xi1 , xi2 , . . .), have Equal variances
(denoted σ 2 ).
Similarly to simple linear regression, an alternative way to describe all four assumptions is that the
errors, εi , are independent normal random variables with mean zero and constant variance, σ 2 .
As in simple linear regression, we can assess whether these conditions seem to hold for a
multiple linear regression model applied to a particular sample dataset by looking at the prediction
errors, i.e., the residuals, ei = Yi − Ŷi .
Residual Analysis (Diagnostics)

As in simple linear regression, we can assess whether these conditions seem to hold for a
multiple linear regression model applied to a particular sample dataset by looking at the prediction
errors, i.e., the residuals, ei = Yi − Ŷi .

1 How do we check the “L” condition?


1 plot the residuals, ei , against the fitted values, Ŷi
2 plot the residuals, ei , against each of the predictors in the model

2 How do we check the “I” condition?


plot the residuals, ei , against the time (or space) sequence if the data observations were
collected over time (or space)

3 How do we check the “N” condition?


create a histogram and/or normal Q-Q probability plot of the residuals, ei

4 How do we check the “E” condition?


1 plot the residuals, ei , against the fitted values, Ŷi
2 plot the residuals, ei , against each of the predictors in the model
Violation of any of these four may necessitate remedial action such as transforming one or more
predictors and/or the response variable.
Parallel Model vs. Non-Parallel Model

1. If there’s no interaction term between a continuous predictor and a categorical


(qualitative) predictor, then the fitted lines for each category are parallel.
Birth Weight vs Length of Gestation

3600 Non-Smoker
Smoker
3400
3200
Birth Weight

3000
2800
2600
2400

34 36 38 40 42

Length of Gestation

The regression model contains additive effects and the response function can be written as a
sum of functions of the predictor variables:

E(Yi ) = (β0 ) + (β1 xi1 ) + (β2 xi2 )

where Yi is birth weight for individual i, xi1 is length of gestation for individual i, and xi2 is a
categorical variable representing smoking status (Yes/No).
2. If there’s an interaction effect between a continuous predictor and a categorical
(qualitative) predictor, then the fitted lines for each category are non-parallel.
Treatment Effectiveness vs Age

70
60
Treatment Effectiveness

50
40

A
30

B
C

20 30 40 50 60

Age

The regression model contains interaction effects and the response function cannot be written
as a sum of functions of the predictor variables.

E(Yi ) = β0 + β1 xi1 + β2 xi2 + β3 xi3 + β12 xi1 xi2 + β13 xi1 xi3

where Yi is treatment effectiveness for individual i, xi1 is age for individual i, xi2 = 1 if individual i
receives treatment A; 0 otherwise, and xi3 = 1 if individual i receives treatment B; 0 otherwise.
1 Recall that if we have two continuous predictors xi1 and xi2 in a multiple linear regression
model, we are fitting a “plane” to an observed data set.

2 What the plane will look like if there is an interaction effect between xi1 and xi2 ? It will be a
“warped plane”.

3 Example: A data set consists of 654 observations on children aged 3 to 19. Forced
Expiratory Volume (FEV), which is a measure of lung capacity, is the response (Y ). Age (x1 )
and height (x2 ) are two continuous predictors.

No Interaction Effect between Height and Age Interaction Effect Between Height and Age

FEV
FEV

he

he
igh

igh
t

age age
Variable Selection and Model Building

Strategy:
1 Know your goal, know your research question. Knowing how you plan to use your
regression model can assist greatly in the model building stage.

2 Identify all of the possible candidate predictors.


1. Don’t worry about interactions or the appropriate functional form – such as x2 and log x
– just yet.
2. Just make sure you identify all the possible important predictors.

3 Use variable selection procedures to find the middle ground between an


underspecified model and a model with extraneous or redundant variables.
Two possible variable selection procedures are stepwise regression and best subsets
regression.

4 Fine-tune the model to get a correctly specified model.


Iterate back and forth between formulating different regression models and checking the
behavior of the residuals until you are satisfied with the model.
Overview of Stepwise Regression

1 First, we start with no predictors in our “stepwise model.”

2 Then, at each step along the way we either enter or remove a predictor based on
some criteria, for example:
1 General (partial) F -tests – that is, the t-tests for the slope parameters – that are
obtained

2 Akaike’s Information Criterion (AIC)

3 Bayesian Information Criterion (BIC)

3 We stop when no more predictors can be justifiably entered or removed from our
stepwise model.
Information Criteria (Other Criteria to Choose
Predictors)

1. Akaike’s Information Criterion (AIC)

AIC = n log(SSE) − n log(n) + 2p

2. Bayesian Information Criterion (BIC)

BIC = n log(SSE) − n log(n) + plog(n)

1 Notice that the only difference between AIC and BIC is the multiplier of p, the number of
parameters.

2 The BIC places a higher penalty ( log(n) > 2 when n > 7) on the number of parameters in
the model so will tend to reward more parsimonious (smaller) models.

3 For regression models, the information criteria combine information about the SSE, number
of parameters p in the model, and the sample size n.

4 A small value, compared to values for other possible models, is good.


Best Subsets Regression

The general idea behind best subsets regression is that we select the subset of
predictors that do the best at meeting some well-defined objective criterion:

1 largest increase in R2 (Note we don’t use the magnitude of R2 to evaluate a model,


because R2 will always increase as we include more predictors in a model.)

2 largest adjusted R2

3 smallest M SE (Adjusted R2 and M SE are equivalent criteria when we choose


predictors.)

4 Mallows’ Cp is close to the number of parameters p. (Note we don’t use Cp value to


evaluate the fullest model, which is always p.)
Outliers, High-Leverage Data Point, Influential Points

1. What is the distinction between outliers and high-leverage data point?


An outlier is a data point whose response Y does not follow the general trend of the
rest of the data.

A data point has high leverage if it has “extreme” predictor x values.

Outliers and high leverage data points have the potential to be influential, but we
generally have to investigate further to determine whether or not they are actually
influential.

2. When will a data point be influential?


A data point is influential if it unduly influences any part of a regression analysis, such
as the predicted responses, the estimated slope coefficients, or the hypothesis test
results.
Example 1

The red point is an outlier but it’s not a high-leverage point.

50
40
30 Scatterplot of Y vs x
y

20
10

Linear Fit with the Red Point


Linear Fit without the Red Point
0

0 2 4 6 8

It’s not an influential point because the predicted responses, estimated slope coefficients, and
hypothesis test results are not affected by the inclusion of the red data point.
Example 2

The red point is a high-leverage point but it’s not an outlier.

70
60
50
40 Scatterplot of Y vs x
y

30
20
10

Linear Fit with the Red Point


Linear Fit without the Red Point
0

0 2 4 6 8 10 12 14

It’s not an influential point because the predicted responses, estimated slope coefficients, and
hypothesis test results are not affected by the inclusion of the red data point.
Example 3

The red point is a high-leverage point and an outlier.

50
40
30 Scatterplot of Y vs x
y

20
10

Linear Fit with the Red Point


Linear Fit without the Red Point
0

0 2 4 6 8 10 12

It is also an influential point. The predicted responses and estimated slope coefficients are clearly
affected by the presence of the red data point.
The Idea of “Leave-One-Out”

The following plot has only n = 4 data points.


15

linear fit with all points


linear fit without the red point
10
y

2 4 6 8 10

Observe that the red data point “pulls” the estimated regression line towards it. When the
red data point is omitted, the estimated regression line “bounces back” away from the
point.
Some Advice with Problematic Data Points

First, check for obvious data errors:

1 If the data point is a procedural error and invalidates the measurement, delete it.

2 If the data point is not representative of the intended study population, delete it.

Consider the possibility that you might have just mis-formulated your regression model:

1 Did you leave out any important predictors?

2 Should you consider adding some interaction terms?

3 Is there any non-linearity that needs to be modeled?

Decide whether or not deleting data points is warranted:

1 Do not delete data points just because they do not fit your pre-conceived regression model.

2 You must have a good, objective reason for deleting data points.

3 If you delete any data after you’ve collected it, justify and describe it in your reports.

4 If you are not sure what to do about a data point, analyze the data twice – once with and
once without the data point – and report the results of both analyses.

You might also like