R18&19
R18&19
REGRESSION ANALYSIS
The intercept term (α) is the line’s intersection with the Y-axis at X = 0
INTERPRETING REGRESSION RESULTS
Example:
DUMMY VARIABLES
Observations for most independent variables (e.g., firm size, level of GDP,
and interest rates) can take on a wide range of values.
However, there are occasions when the independent variable is binary in
nature—it is either on or off.
Independent variables that fall into this category are called dummy variables
and are often used to quantify the impact of qualitative variables.
Dummy variables are assigned a value of 0 or 1.
For example, in a time series regression of monthly stock returns, you could
employ a January dummy variable that would take on the value of 1 if a stock
return occurred in January, and 0 if it occurred in any other month.
The purpose of including the January dummy variable would be to see if stock
returns in January were significantly different than stock returns in all other months
of the year.
Coefficient of Determination of a Regression (𝑅2 )
The 𝑅 2 of a regression model captures the fit of the model; it represents the
proportion of variation in the dependent variable that is explained by the
independent variable(s).
For a regression model with a single independent variable, 𝑅 2 is the square of the
correlation between the independent and dependent variable
ASSUMPTIONS UNDERLYING LINEAR REGRESSION
Because OLS estimators are derived from random samples, these estimators
are also random variables because they vary from one sample to the next
OLS estimators will have their own probability distributions (i.e., sampling
distributions)
These sampling distributions allow us to estimate population parameters, such as
the population mean, the population regression intercept term, and the
population regression slope coefficient
Drawing multiple samples from a population will produce multiple sample
means.
The distribution of these sample means is referred to as the sampling distribution
of the sample mean.
The mean of this sampling distribution is used as an estimator of the population
mean and is said to be an unbiased estimator of the population mean.
An unbiased estimator is one for which the expected value of the estimator is
equal to the parameter you are trying to estimate.
PROPERTIES OF OLS ESTIMATORS
Given the central limit theorem (CLT), for large sample sizes, it is reasonable to
assume that the sampling distribution will approach the normal distribution.
This means that the estimator is also a consistent estimator. A consistent estimator is
one for which the accuracy of the parameter estimate increases as the sample size
increases
Like the sampling distribution of the sample mean, OLS estimators for the
population intercept term and slope coefficient also have sampling
distributions.
The sampling distributions of OLS estimators, α and β, are unbiased and consistent
estimators of respective population parameters.
Being able to assume that α and β are normally distributed is a key property in
allowing us to make statistical inferences about population coefficients
The variance of the slope (β) increases with variance of the error and decreases
with the variance of the explanatory variable
The variance of the slope indicates the reliability of the sample estimate of the
coefficient, and the higher the variance of the error, the lower the reliability of the
coefficient estimate
Higher variance of the explanatory (X) variable(s) indicates that there is sufficient
diversity in observations (i.e., the sample is representative of the population) and,
hence, lower variability (and higher confidence) of the slope estimate
QUIZ
2. What is the most appropriate interpretation of a slope coefficient estimate equal to 10.0?
A. The predicted value of the dependent variable when the independent variable is zero is
10.0.
B. The predicted value of the independent variable when the dependent variable is zero is 0.1.
C. For every one unit change in the independent variable, the model predicts that the
dependent variable will change by 10 units.
D. For every one unit change in the independent variable, the model predicts that the
dependent variable will change by 0.1 units.
3. The reliability of the estimate of the slope coefficient in a regression model is most likely:
A. positively affected by the variance of the residuals and negatively affected by the
variance of the independent variables.
B. negatively affected by the variance of the residuals and negatively affected by the
variance of the independent variables.
C. positively affected by the variance of the residuals and positively affected by the variance
of the independent variables.
D. negatively affected by the variance of the residuals and positively affected by
the variance of the independent variables.
HYPOTHESIS TESTING
The steps in the hypothesis testing procedure for regression coefficients are
as follows:
Specify the hypothesis to be tested.
Calculate the test statistic.
Reject or fail to reject the null hypothesis after comparing the test statistic to its
critical value
The estimated slope coefficient, β, will be normally distributed with a
standard deviation known as the standard error of the regression
coefficient (𝑆𝑏 ) → Hypothesis testing can be conducted with sample value
of the coefficient and its standard error
Suppose we want to test the hypothesis that the value of the slope
coefficient is equal to β0
The p-value is the smallest level of significance for which the null hypothesis
can be rejected.
An alternative method of doing hypothesis testing of regression coefficients
is to compare the p-value to the significance level:
If the p-value is less than the significance level, the null hypothesis can be
rejected.
If the p-value is greater than the significance level, the null hypothesis cannot be
rejected.
In general, regression outputs will provide the p-value for the standard
hypothesis (Ho: β = 0 versus Ha: β ≠ 0).
Consider again the example where β = 0.76, SER = 0.33, and the level of
significance is 5%. The regression output provides a p-value = 0.026.
Because the p-value < level of significance, we reject the null hypothesis
that β = 0, which is the same result as the one we got when performing the
t-test.
QUIZ
Bob Shepperd is trying to forecast 10-year T-bond yield. Shepperd tries a variety of explanatory variables in several
iterations of a single-variable model. Partial results are provided below (note that these represent three separate one-
variable regressions):
First we estimate the residuals in the following model using OLS estimation techniques for
a single regression:
We then do the same, but this time estimate the residuals in the model
The residuals are regressed against the residuals from the first step to estimate the slope
coefficient β1:
INTERPRETING MULTIPLE REGRESSION RESULTS
Standard error of the regression (SER) measures the uncertainty about the
accuracy of the predicted values of the dependent variable.
Graphically, the relationship is stronger when the actual x,y data points lie closer
to the regression line
Since OLS estimation minimizes the sum of the squared differences
between the predicted value and actual value for each observation
Example: Consider the hypothesis that future 10-year real earnings growth
in the S&P 500 (EG10) can be explained by the trailing dividend payout
ratio of the stocks in the index (PR) and the yield curve slope (YCS). Test the
statistical significance of the independent variable PR in the real earnings
growth example at the 10% significance level. Assume that the number of
observations is 46 and the critical t-value for 10% level of significance is 1.68.
The results of the regression are produced in the following table.
F-TEST
For models with multiple variables, the univariate t-test is not applicable
when testing complex hypotheses involving the impact of more than one
variable. Instead, we use the F-test
F-test is useful to evaluate a model against other competing partial models
For example, a model with three independent variables (X1, X2, and X3) can be
compared against a model with only one independent variable (X1). We are
trying to see if the two additional variables (X2 and X3) in the full model
contribute meaningfully to explain the variation in Y:
A more generic F-test is used to test the hypothesis that all variables
included in the model do not contribute meaningfully in explaining the
variation in Y versus at least one of the variables does contribute statistically
significantly