Notes 1017 Part1
Notes 1017 Part1
Notes 1017 Part1
Linear Regression
and Corrlelation
The modeling of the relationship between a response variable and a set of
explanatory variables is one of the most widely used of all statistical techniques.
We refer to this type of modeling as regression analysis. The basic idea of
regression analysis is to obtain a model for the functional relationship between a
response variable (often referred to as the dependent variable) and one or more
explanatory variables (often referred to as the independent variables).
The research context is that two variables have been observed for each of n
participants. The research team then has a spreadsheet with n pairs of
observations 𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1, … , 𝑛. One of the variables (here y) is the outcome
variable or dependent variable. This is the variable hypothesized to be affected by
the other variable in scientific research. The other variable (here x) is the
independent variable. It may be hypothesized to predict the outcome variable or to
cause a change in the outcome variable.
An example of a research project seeking to document a causal association would
be a clinical trial in which 𝑥𝑖 was the dosage of a medicine randomly assigned to a
participant (say simvastatin) and 𝑦𝑖 was the participant’s response after a specified
period taking the medicine (say cholesterol reduction after 3 months).
OLS (ordinary least squares) is the most used method to estimate the
parameters of the linear model. An arbitrary linear model 𝑏0 + 𝑏1 𝑥 is used as
a fit for the dependent variable values. The method uses the residual
𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 . The fitting model is judged by how small the set of residuals
is. OLS uses each residual and focuses on the magnitude of the residuals by
examining the sum of squares function 𝑆𝑆 𝑏0 , 𝑏1 = σ𝑛𝑖=1(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 .
The OLS method is to find the arguments (𝛽መ0 , 𝛽መ1 ) that make 𝑆𝑆 𝑏0 , 𝑏1 as
small as possible. This minimization is a standard calculus problem.
Step 1 is to calculate the partial derivatives of 𝑆𝑆 𝑏0 , 𝑏1 with respect to each
argument.
Step 2 is to find the arguments (𝛽መ0 , 𝛽መ1 ) that make the two partial
derivatives zero. The resulting equations are called the normal equations:
1. One shows the relation of 𝛽መ1 and the Pearson product moment
correlation. The Pearson product moment correlation is a dimensionless
measure of association. The formula is
σ𝑛𝑖=1(𝑦𝑖 − 𝑦ത𝑛 )(𝑥𝑖 − 𝑥ҧ𝑛 )
𝑟 𝑥, 𝑦 =
σ𝑛𝑖=1(𝑥𝑖 − 𝑥ҧ𝑛 )2 σ𝑛𝑖=1(𝑦𝑖 − 𝑦ത𝑛 )2
2
Total 𝑛−1 𝑇𝑆𝑆 = (𝑛 − 1)𝑆𝐷𝑉
The estimate of the regression slope can potentially be greatly affected by
high leverage points. These are points that have very high or very low values of
the independent variable—outliers in the x direction. They carry great weight in
the estimate of the slope. A high leverage point that also happens to correspond to
a y outlier is a high influence point. It will alter the slope and twist the line
badly.
To this point, we have considered only the estimates of intercept and slope. We
also have to estimate the true error variance 𝜎𝜀2 . We can think of this quantity as
“variance around the line’’ or as the mean squared prediction error. The estimate
of 𝜎𝜀2 is based on the residuals 𝑦𝑖 − 𝑦,
ො which are the prediction errors in the
sample. The estimate of 𝜎𝜀2 based on the sample data is the sum of squared
residuals divided by n-2, the degrees of freedom. The estimated variance is often
shown in computer output as MSE(Error) or MS(Residual).
The estimates 𝛽መ0 , 𝛽መ1 , and 𝑆𝜀 are basic in regression analysis. They specify
the regression line and the probable degree of error associated with y-values for a
given value of x. The next step is to use these sample estimates to make
inferences about the true parameters.
Inferences About Regression Parameters
There must be a probabilistic model for the data so that researchers can make
inferences and find confidence intervals. The model for one predictor linear
regression is 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜎𝑌|𝑥 𝑍𝑖 . The outcome or dependent (random)
variables 𝑌𝑖 , 𝑖 = 1, … , 𝑛 are each assumed to be the sum of the linear regression
expected value 𝛽0 + 𝛽1 𝑥𝑖 and a random error term 𝜎𝑌|𝑥 𝑍𝑖 . The random
variables 𝑍𝑖 , 𝑖 = 1, … , 𝑛 are assumed to be independent standard normal
random variables. The parameter 𝛽0 is the intercept parameter and is fixed but
unknown. The parameter 𝛽1 is the slope parameter, also fixed but unknown,
and is the focus of the statistical analysis. The parameter 𝜎𝑌|𝑥 is also fixed but
unknown. Another description of this model is that 𝑌𝑖 , 𝑖 = 1, … , 𝑛 are
independent normally distributed random variables with 𝑌𝑖 having the
2
distribution 𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎𝑌|𝑥 ). That is 𝐸 𝑌𝑖 𝑋 = 𝑥𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 and
2 2
𝑉𝑎𝑟 𝑌𝑖 𝑋 = 𝑥𝑖 = 𝜎𝑌|𝑥 . The assumption that 𝑉𝑎𝑟 𝑌𝑖 𝑋 = 𝑥𝑖 = 𝜎𝑌|𝑥 is called
the homoscedasticity assumption.
There are four assumptions.
The first use of this result is to find the variance of a linear combination of
values from 𝑌, an 𝑛 × 1 vector of random variables. Let 𝑎 be an 𝑛 × 1 vector
of constants, and let 𝑊 = 𝑎𝑇 𝑌. Then 𝑣𝑎𝑟 𝑎𝑇 𝑌 = 𝑎𝑇 × 𝑣𝑐𝑣(𝑌) × 𝑎. This is
the completely general form of
𝑣𝑎𝑟 𝑎𝑋 + 𝑏𝑌 = 𝑎2 𝑣𝑎𝑟𝑋 + 𝑏2 𝑣𝑎𝑟𝑌 + 2𝑎𝑏 𝑐𝑜𝑣(𝑋, 𝑌)
The second example is fundamental to this chapter. The OLS estimates of the
parameters are always the same functions of the observed data:
Testing a Null Hypothesis about 𝜷𝟏
The last detail before deriving tests and confidence intervals for the slope of
the regression function is to find 𝐸(𝛽መ1 )
Confidence Interval for 𝜷𝟏
Distribution for the Estimated Intercept
Predicting New y-value Using Regression
1. 𝑌(𝑥) is a random variable; in other words, it will be observed in the future.
The resolution to this is to set the time of the prediction to be after the
collection of the regression data but before the future observation is made.
Then the 100 1 − 𝛼 % prediction interval is
where 𝑦(𝑥)
ො is the fitted value based on the regression data using the
independent variable setting 𝑥.
2
2. 𝜎𝑌|𝑥 is not known. As usual we estimate it using MSE and stretch 𝑧𝛼/2 by the
t-distribution with 𝑛 − 2 degrees of freedom.
Example question:
A research team collected data on 𝑛 = 450 students in a statistics course. The
observed average final examination score was 524, with an observed standard
deviation of 127.6 (the divisor in the estimated variance was 𝑛 − 1). The average
first examination score was 397, with an observed standard deviation of 96.4. The
correlation coefficient between the first examination score and the final
examination score was 0.63.
a. Report the analysis of variance table and result of the test of the null
hypothesis that the slope of the regression line of final exam score on first
exam score is zero against the alternative that it is not. Use the 0.10, 0.05, and
0.01 levels of significance.
Analysis of Variance Table
Source DF Sum of Squares Mean Square F