Notes 1017 Part1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Chapter 11

Linear Regression
and Corrlelation
The modeling of the relationship between a response variable and a set of
explanatory variables is one of the most widely used of all statistical techniques.
We refer to this type of modeling as regression analysis. The basic idea of
regression analysis is to obtain a model for the functional relationship between a
response variable (often referred to as the dependent variable) and one or more
explanatory variables (often referred to as the independent variables).

The research context is that two variables have been observed for each of n
participants. The research team then has a spreadsheet with n pairs of
observations 𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1, … , 𝑛. One of the variables (here y) is the outcome
variable or dependent variable. This is the variable hypothesized to be affected by
the other variable in scientific research. The other variable (here x) is the
independent variable. It may be hypothesized to predict the outcome variable or to
cause a change in the outcome variable.
An example of a research project seeking to document a causal association would
be a clinical trial in which 𝑥𝑖 was the dosage of a medicine randomly assigned to a
participant (say simvastatin) and 𝑦𝑖 was the participant’s response after a specified
period taking the medicine (say cholesterol reduction after 3 months).

An example of a study seeking to document the value of a predictive association


would be an observational study in which 𝑥𝑖 was the score of a statistics student on
the first examination in a course and 𝑦𝑖 was the student’s score on the final
examination in the course.

A recommended first step is to create the scatterplot of observations, with the


vertical axis representing the dependent variable and the horizontal axis
representing the independent variable. The “pencil test” is to hold up a pencil to the
scatterplot and examine whether that describes the data well. If so, then it is
reasonable to assume that a linear model describes the data. The linear model is
reasonable for many data sets in observational studies. A more object procedure is
to use a “nonlinear smoother” such as LOWESS to estimate the association. If the
LOWESS curve is not well approximated by a line, then the assumption of linearity
is not reasonable.
In this chapter, we consider simple linear regression analysis, in which
there is a single given independent variable x and the equation for
predicting a dependent variable y is a linear function of that
independent variable. We write the prediction equation as
𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
where 𝛽መ0 is the intercept and 𝛽መ1 is the slope.

Assuming linearity, we would like to write y as a linear function of


𝑥: 𝑦 = 𝛽0 + 𝛽1 𝑥. However, according to such an equation, y is an exact
linear function of x; no room is left for the inevitable errors (deviation
of actual y-values from their predicted values). Therefore,
corresponding to each y, we introduce a random error term 𝜀𝑖 and
assume the model 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀𝑖
These assumptions are illustrated in Figure 11.2. The actual values of the dependent
variable are distributed normally with mean values falling on the regression line and
the same standard deviation at all values of the independent variable. The only
assumption not shown in the figure is independence from one measurement to another.
These are the formal assumptions, made in order to derive the significance tests and
prediction methods that follow.
Estimating Model Parameters

OLS (ordinary least squares) is the most used method to estimate the
parameters of the linear model. An arbitrary linear model 𝑏0 + 𝑏1 𝑥 is used as
a fit for the dependent variable values. The method uses the residual
𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 . The fitting model is judged by how small the set of residuals
is. OLS uses each residual and focuses on the magnitude of the residuals by
examining the sum of squares function 𝑆𝑆 𝑏0 , 𝑏1 = σ𝑛𝑖=1(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 .
The OLS method is to find the arguments (𝛽መ0 , 𝛽መ1 ) that make 𝑆𝑆 𝑏0 , 𝑏1 as
small as possible. This minimization is a standard calculus problem.
Step 1 is to calculate the partial derivatives of 𝑆𝑆 𝑏0 , 𝑏1 with respect to each
argument.
Step 2 is to find the arguments (𝛽መ0 , 𝛽መ1 ) that make the two partial
derivatives zero. The resulting equations are called the normal equations:

These equations have a very important interpretation. Let 𝑟𝑖 = 𝑦𝑖 − 𝛽መ0 − 𝛽መ1 𝑥𝑖 .


𝑖 = 1, … , 𝑛. The first normal equation is equivalent to σ𝑛𝑖=1 𝑟𝑖 = 0, and the
second is σ𝑛𝑖=1 𝑟𝑖 𝑥𝑖 = 0. That is, there are two constraints on the n residuals. The
OLS residuals must sum to zero, and the OLS residuals are orthogonal to the
independent variable values. The n residuals then have n-2 degrees of freedom.
Step 3 is to solve this two-linear-equation system in two unknowns.
There are a number of modifications of the formula for 𝛽መ1 that are helpful.

1. One shows the relation of 𝛽መ1 and the Pearson product moment
correlation. The Pearson product moment correlation is a dimensionless
measure of association. The formula is
σ𝑛𝑖=1(𝑦𝑖 − 𝑦ത𝑛 )(𝑥𝑖 − 𝑥ҧ𝑛 )
𝑟 𝑥, 𝑦 =
σ𝑛𝑖=1(𝑥𝑖 − 𝑥ҧ𝑛 )2 σ𝑛𝑖=1(𝑦𝑖 − 𝑦ത𝑛 )2

The Cauchy-Schwartz inequality shows that |𝑟 𝑥, 𝑦 | ≤ 1. A correlation of


+1 or -1 shows a perfect linear association. A correlation of 0 means no
linear association. The numerator of 𝛽መ1 and 𝑟 𝑥, 𝑦 are the same.
2. The next variation will be used in calculating the distributional
properties of 𝛽መ1 and uses the identity that
Fisher’s Decomposition of the Total Sum of Squares
---- a fundamental tool for the analysis of the linear model
Conventionally displayed in an Analysis of Variance (ANOVA) Table as below

Analysis of Variance Table


One Predictor Linear Regression

Source DF Sum of Squares Mean Square F

1 𝑟(𝑥, 𝑦)2 𝑇𝑆𝑆 𝑟(𝑥, 𝑦)2 𝑇𝑆𝑆 (𝑛 − 2)𝑟(𝑥, 𝑦)2


Regression
1 − 𝑟 𝑥, 𝑦 2

𝑛−2 (1 − 𝑟 𝑥, 𝑦 2 ) 𝑇𝑆𝑆 (1 − 𝑟 𝑥, 𝑦 2 ) 𝑇𝑆𝑆


Error
(𝑛 − 2)

2
Total 𝑛−1 𝑇𝑆𝑆 = (𝑛 − 1)𝑆𝐷𝑉
The estimate of the regression slope can potentially be greatly affected by
high leverage points. These are points that have very high or very low values of
the independent variable—outliers in the x direction. They carry great weight in
the estimate of the slope. A high leverage point that also happens to correspond to
a y outlier is a high influence point. It will alter the slope and twist the line
badly.

To this point, we have considered only the estimates of intercept and slope. We
also have to estimate the true error variance 𝜎𝜀2 . We can think of this quantity as
“variance around the line’’ or as the mean squared prediction error. The estimate
of 𝜎𝜀2 is based on the residuals 𝑦𝑖 − 𝑦,
ො which are the prediction errors in the
sample. The estimate of 𝜎𝜀2 based on the sample data is the sum of squared
residuals divided by n-2, the degrees of freedom. The estimated variance is often
shown in computer output as MSE(Error) or MS(Residual).

The estimates 𝛽መ0 , 𝛽መ1 , and 𝑆𝜀 are basic in regression analysis. They specify
the regression line and the probable degree of error associated with y-values for a
given value of x. The next step is to use these sample estimates to make
inferences about the true parameters.
Inferences About Regression Parameters

There must be a probabilistic model for the data so that researchers can make
inferences and find confidence intervals. The model for one predictor linear
regression is 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜎𝑌|𝑥 𝑍𝑖 . The outcome or dependent (random)
variables 𝑌𝑖 , 𝑖 = 1, … , 𝑛 are each assumed to be the sum of the linear regression
expected value 𝛽0 + 𝛽1 𝑥𝑖 and a random error term 𝜎𝑌|𝑥 𝑍𝑖 . The random
variables 𝑍𝑖 , 𝑖 = 1, … , 𝑛 are assumed to be independent standard normal
random variables. The parameter 𝛽0 is the intercept parameter and is fixed but
unknown. The parameter 𝛽1 is the slope parameter, also fixed but unknown,
and is the focus of the statistical analysis. The parameter 𝜎𝑌|𝑥 is also fixed but
unknown. Another description of this model is that 𝑌𝑖 , 𝑖 = 1, … , 𝑛 are
independent normally distributed random variables with 𝑌𝑖 having the
2
distribution 𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎𝑌|𝑥 ). That is 𝐸 𝑌𝑖 𝑋 = 𝑥𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 and
2 2
𝑉𝑎𝑟 𝑌𝑖 𝑋 = 𝑥𝑖 = 𝜎𝑌|𝑥 . The assumption that 𝑉𝑎𝑟 𝑌𝑖 𝑋 = 𝑥𝑖 = 𝜎𝑌|𝑥 is called
the homoscedasticity assumption.
There are four assumptions.

• The outcome variables 𝑌𝑖 , 𝑖 = 1, … , 𝑛 are independent


• 𝐸 𝑌𝑖 𝑋 = 𝑥𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 for 𝑖 = 1, … , 𝑛
• Homoscedasticity
• 𝑌𝑖 , 𝑖 = 1, … , 𝑛 are normally distributed variables
Variance Calculations
The most complex variance formula in this course so far is:
𝑣𝑎𝑟 𝑎𝑋 + 𝑏𝑌 = 𝑎2 𝑣𝑎𝑟𝑋 + 𝑏2 𝑣𝑎𝑟𝑌 + 2𝑎𝑏 𝑐𝑜𝑣(𝑋, 𝑌)
More complex calculations are required for the variance-covariance matrix of the
OLS estimates. The easiest way is to use the variance-covariance matrix of a
random vector. Let Y be an 𝑛 × 1 vector of random variables (𝑌1 , 𝑌2 , … , 𝑌𝑛 )𝑇 .
That is, each component of the vector is a random variable. Then the expected
value of vector Y is the 𝑛 × 1 vector whose components are the respective means
of the random variables; that is 𝐸 𝑌 = (𝐸𝑌1 , 𝐸𝑌2 , … , 𝐸𝑌𝑛 )𝑇 , The variance-
covariance matrix of the random vector Y is the 𝑛 × 𝑛 matrix whose diagonal
entries are the respective variances of the random variables and whose off-
diagonal elements are the covariances of the random variables. That is

𝑣𝑎𝑟(𝑌1 ) 𝑐𝑜𝑣(𝑌1 , 𝑌2 ) ⋯ 𝑐𝑜𝑣(𝑌1 , 𝑌𝑛 )


𝑐𝑜𝑣(𝑌2 , 𝑌1 ) 𝑣𝑎𝑟(𝑌2 ) ⋯ 𝑐𝑜𝑣(𝑌2 , 𝑌𝑛 )
𝑣𝑐𝑣 𝑌 = ⋱
⋮ ⋮ ⋮
𝑐𝑜𝑣(𝑌𝑛 , 𝑌1 ) 𝑐𝑜𝑣(𝑌𝑛 , 𝑌2 ) ⋯ 𝑣𝑎𝑟(𝑌𝑛 )

In terms of expectation operator calculations,


𝑣𝑐𝑣 𝑌 = 𝐸 𝑌 − 𝐸𝑌 𝑌 − 𝐸𝑌 𝑇 =Σ
Variance of a Set of Linear Combinations
Examples

The first use of this result is to find the variance of a linear combination of
values from 𝑌, an 𝑛 × 1 vector of random variables. Let 𝑎 be an 𝑛 × 1 vector
of constants, and let 𝑊 = 𝑎𝑇 𝑌. Then 𝑣𝑎𝑟 𝑎𝑇 𝑌 = 𝑎𝑇 × 𝑣𝑐𝑣(𝑌) × 𝑎. This is
the completely general form of
𝑣𝑎𝑟 𝑎𝑋 + 𝑏𝑌 = 𝑎2 𝑣𝑎𝑟𝑋 + 𝑏2 𝑣𝑎𝑟𝑌 + 2𝑎𝑏 𝑐𝑜𝑣(𝑋, 𝑌)

The second example is fundamental to this chapter. The OLS estimates of the
parameters are always the same functions of the observed data:
Testing a Null Hypothesis about 𝜷𝟏
The last detail before deriving tests and confidence intervals for the slope of
the regression function is to find 𝐸(𝛽መ1 )
Confidence Interval for 𝜷𝟏
Distribution for the Estimated Intercept
Predicting New y-value Using Regression

Confidence Interval for 𝐄(𝒀(𝒙))


Prediction Interval for a Future Observation 𝒀𝑭 (𝒙)

Usually, the more relevant forecasting problem is that of predicting an


individualvalue 𝑌(𝑥) rather than E(𝑌(𝑥)). In most computer packages, the
interval for predicting an individual value is called a prediction interval.
Two issues:


1. 𝑌(𝑥) is a random variable; in other words, it will be observed in the future.
The resolution to this is to set the time of the prediction to be after the
collection of the regression data but before the future observation is made.
Then the 100 1 − 𝛼 % prediction interval is

where 𝑦(𝑥)
ො is the fitted value based on the regression data using the
independent variable setting 𝑥.

2
2. 𝜎𝑌|𝑥 is not known. As usual we estimate it using MSE and stretch 𝑧𝛼/2 by the
t-distribution with 𝑛 − 2 degrees of freedom.
Example question:
A research team collected data on 𝑛 = 450 students in a statistics course. The
observed average final examination score was 524, with an observed standard
deviation of 127.6 (the divisor in the estimated variance was 𝑛 − 1). The average
first examination score was 397, with an observed standard deviation of 96.4. The
correlation coefficient between the first examination score and the final
examination score was 0.63.
a. Report the analysis of variance table and result of the test of the null
hypothesis that the slope of the regression line of final exam score on first
exam score is zero against the alternative that it is not. Use the 0.10, 0.05, and
0.01 levels of significance.
Analysis of Variance Table
Source DF Sum of Squares Mean Square F

Regression 1 2901541.5 2901541.5 294.8

Error 448 4408968.7 9841.4

Total 449 7310510.2


b. Determine the least-squares fitted equation and give the 99% confidence
interval for the slope of the regression of final examination score on first
examination score.
c. Use the least-squares prediction equation to estimate the final
examination score of students who scored 550 on the first examination.
Give the 99% confidence interval for the expected final examination
score of these students.
d. Use the least-squares prediction equation to predict the final examination
score of a student who scored 550 on the first examination. Give the 99%
prediction interval for the final examination score of this student.
Fisher’s Transformation of the Correlation Coefficient

The textbook uses Fisher’s transformation of the correlation coefficient to get a


confidence interval for a correlation coefficient. It is more useful in calculating
Type II error rates and sample size calculations.

The transformation is applied to the Pearson product moment correlation


coefficient 𝑅𝑥𝑦 calculated using 𝑛 observations (𝑋𝑖 , 𝑌𝑖 ) from a bivariate normal
random variable with population correlation coefficient 𝜌 = 𝑐𝑜𝑟𝑟(𝑋, 𝑌).
1 1+𝑅𝑥𝑦
Fisher's result is that 𝐹 𝑅𝑥𝑦 = ln( ) is approximately distributed as
2 1−𝑅𝑥𝑦
1 1+𝜌 1
𝑁(2 ln(1−𝜌), 𝑛−3)
Confidence Interval for a Correlation Coefficient
Example question (modification of the previous example):
A research team collected data on 𝑛 = 450 students in a statistics course. The
correlation coefficient between the first examination score and the final
examination score was 0.63. Find the 99% confidence interval for the population
correlation of the first examination score and the final examination score.
Example Sample Size Calculation:
A research team wishes to test the null hypothesis 𝐻0 : 𝜌 = 0 at 𝛼 = 0.005 against
the alternative 𝐻1 : 𝜌 > 0 using the Fisher’s transformation of the Pearson product
moment correlation coefficient 𝑅𝑥𝑦 as the test statistic. They have asked their
consulting statistician for a sample size 𝑛 such that 𝛽 = 0.01 when 𝜌 = 0.316
(that is, 𝜌2 = 0.10).
Supplemental Information on Variance-covariance Calculations

You might also like