0% found this document useful (0 votes)
22 views32 pages

R18&19

The document provides an overview of linear regression analysis, including the relationship between dependent and independent variables, conditions for using linear regression, and the ordinary least squares (OLS) estimation method. It discusses the interpretation of regression results, the use of dummy variables, and the assumptions underlying linear regression, as well as hypothesis testing and confidence intervals for regression coefficients. Additionally, it touches on multiple regression and the interpretation of slope coefficients in the context of multiple explanatory variables.

Uploaded by

Hong Nhung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views32 pages

R18&19

The document provides an overview of linear regression analysis, including the relationship between dependent and independent variables, conditions for using linear regression, and the ordinary least squares (OLS) estimation method. It discusses the interpretation of regression results, the use of dummy variables, and the assumptions underlying linear regression, as well as hypothesis testing and confidence intervals for regression coefficients. Additionally, it touches on multiple regression and the interpretation of slope coefficients in the context of multiple explanatory variables.

Uploaded by

Hong Nhung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

R18: LINEAR REGRESSION

REGRESSION ANALYSIS

 Regression analysis seeks to measure how changes in one variable, called


a dependent (or explained) variable can be explained by changes in one
or more other variables called the independent (or explanatory) variables.
 This relationship is captured by estimating a linear equation.

 Error may be reduced by using more independent variables or by using different,


more appropriate independent variables
LINEAR REGRESSION CONDITIONS

 To use linear regression, three conditions need to be satisfied:


 The relationship between Y and X should be linear
 The error term must be additive (i.e., the variance of the error term is
independent of the observed data)
 All X variables should be observable (i.e., makes the model inappropriate when
you have missing data)
 Appropriate transformations of the independent variable(s) can make a
nonlinear relationship amenable to be fitted using a linear model
 Independent variable data are transformed first then transformed values are put
into linear equation as X
 Dependent variable is a linear function of the coefficients.
 For example, consider an unknown parameter, p, in a function: Y = α + β𝑋 𝑝 + ε. In
this instance, β𝑋 𝑝 contains two unknown parameters (β and p) and p does not
enter the model multiplicatively and, hence, it would not be appropriate to
apply linear regression in such a case
QUIZ

1. Generally, if the value of the independent variable is zero, then the


expected value of the dependent variable would be equal to the:
A. slope coefficient.
B. intercept coefficient.
C. error term.
D. residual.

2. The error term represents the portion of the:


A. dependent variable that is not explained by the independent variable(s)
but could possibly be explained by adding addition al independent variables.
B. dependent variable that is explained by the independent variable(s).
C. independent variables that are explained by the dependent variable.
D. dependent variable that is explained by the error in the independent
variable(s).
ORDINARY LEAST SQUARES ESTIMATION

 Ordinary least squares (OLS) estimation is a process that estimates the


parameters α and β in an effort to minimize the squared residuals (i.e., error
terms)
 Rewriting our regression equation
→ the OLS sample coefficients are those that minimize
 The estimated slope coefficient (β) for the regression line describes the
change in Y for a one-unit change in X:

 The intercept term (α) is the line’s intersection with the Y-axis at X = 0
INTERPRETING REGRESSION RESULTS

 Example:
DUMMY VARIABLES

 Observations for most independent variables (e.g., firm size, level of GDP,
and interest rates) can take on a wide range of values.
 However, there are occasions when the independent variable is binary in
nature—it is either on or off.
 Independent variables that fall into this category are called dummy variables
and are often used to quantify the impact of qualitative variables.
 Dummy variables are assigned a value of 0 or 1.
 For example, in a time series regression of monthly stock returns, you could
employ a January dummy variable that would take on the value of 1 if a stock
return occurred in January, and 0 if it occurred in any other month.
 The purpose of including the January dummy variable would be to see if stock
returns in January were significantly different than stock returns in all other months
of the year.
Coefficient of Determination of a Regression (𝑅2 )

 The 𝑅 2 of a regression model captures the fit of the model; it represents the
proportion of variation in the dependent variable that is explained by the
independent variable(s).
 For a regression model with a single independent variable, 𝑅 2 is the square of the
correlation between the independent and dependent variable
ASSUMPTIONS UNDERLYING LINEAR REGRESSION

 The expected value of the error term, conditional on the independent


variable, is zero [E(εi|Xi) = 0]
 This means X has no information about the location of ε
 This assumption is not directly testable
 Evaluation of whether this assumption is reasonable requires an examination of
the data generating process. Generally, a violation would be evidenced by the
following:
 Survivorship bias: occurs when the observations are collected after-the-fact (companies
that get dropped from an index are not included in the sample)
 Sample selection bias: occurs when occurrence of an event (i.e., an observation) is
contingent on specific outcomes.
 Simultaneity bias: happens when the values of X and Y are simultaneously determined
 Omitted variables: Important explanatory (i.e., X) variables are excluded from the
model and the errors will capture the influence of the omitted. Omission of important
variables cause the coefficients to be biased and may indicate nonexistent (i.e.,
misleading) relationships
 Attenuation bias: occurs when X variables are measured with error and leads to
underestimation of the regression coefficients
ASSUMPTIONS UNDERLYING LINEAR REGRESSION

 All (X, Y) observations are independent and identically distributed (i.i.d.)


 Variance of X is positive (otherwise estimation of β would not be possible)
 Variance of the errors is constant (i.e., homoskedasticity)
 It is unlikely that large outliers will be observed in the data. OLS estimates
are sensitive to outliers, and large outliers have the potential to create
misleading regression results
→ Collectively, these assumptions ensure that the regression estimators are
unbiased.
Secondly, they ensure that the estimators are normally distributed and, as
a result, allowed for hypothesis testing
PROPERTIES OF OLS ESTIMATORS

 Because OLS estimators are derived from random samples, these estimators
are also random variables because they vary from one sample to the next
 OLS estimators will have their own probability distributions (i.e., sampling
distributions)
 These sampling distributions allow us to estimate population parameters, such as
the population mean, the population regression intercept term, and the
population regression slope coefficient
 Drawing multiple samples from a population will produce multiple sample
means.
 The distribution of these sample means is referred to as the sampling distribution
of the sample mean.
 The mean of this sampling distribution is used as an estimator of the population
mean and is said to be an unbiased estimator of the population mean.
 An unbiased estimator is one for which the expected value of the estimator is
equal to the parameter you are trying to estimate.
PROPERTIES OF OLS ESTIMATORS

 Given the central limit theorem (CLT), for large sample sizes, it is reasonable to
assume that the sampling distribution will approach the normal distribution.
 This means that the estimator is also a consistent estimator. A consistent estimator is
one for which the accuracy of the parameter estimate increases as the sample size
increases
 Like the sampling distribution of the sample mean, OLS estimators for the
population intercept term and slope coefficient also have sampling
distributions.
 The sampling distributions of OLS estimators, α and β, are unbiased and consistent
estimators of respective population parameters.
 Being able to assume that α and β are normally distributed is a key property in
allowing us to make statistical inferences about population coefficients
 The variance of the slope (β) increases with variance of the error and decreases
with the variance of the explanatory variable
 The variance of the slope indicates the reliability of the sample estimate of the
coefficient, and the higher the variance of the error, the lower the reliability of the
coefficient estimate
 Higher variance of the explanatory (X) variable(s) indicates that there is sufficient
diversity in observations (i.e., the sample is representative of the population) and,
hence, lower variability (and higher confidence) of the slope estimate
QUIZ
2. What is the most appropriate interpretation of a slope coefficient estimate equal to 10.0?
A. The predicted value of the dependent variable when the independent variable is zero is
10.0.
B. The predicted value of the independent variable when the dependent variable is zero is 0.1.
C. For every one unit change in the independent variable, the model predicts that the
dependent variable will change by 10 units.
D. For every one unit change in the independent variable, the model predicts that the
dependent variable will change by 0.1 units.

3. The reliability of the estimate of the slope coefficient in a regression model is most likely:
A. positively affected by the variance of the residuals and negatively affected by the
variance of the independent variables.
B. negatively affected by the variance of the residuals and negatively affected by the
variance of the independent variables.
C. positively affected by the variance of the residuals and positively affected by the variance
of the independent variables.
D. negatively affected by the variance of the residuals and positively affected by
the variance of the independent variables.
HYPOTHESIS TESTING

 The steps in the hypothesis testing procedure for regression coefficients are
as follows:
 Specify the hypothesis to be tested.
 Calculate the test statistic.
 Reject or fail to reject the null hypothesis after comparing the test statistic to its
critical value
 The estimated slope coefficient, β, will be normally distributed with a
standard deviation known as the standard error of the regression
coefficient (𝑆𝑏 ) → Hypothesis testing can be conducted with sample value
of the coefficient and its standard error
 Suppose we want to test the hypothesis that the value of the slope
coefficient is equal to β0

 We would use the t-statistic:


→ If absolute value of test statistic > critical t-value → Reject null hypothesis
CONFIDENCE INTERVALS

 The confidence interval of the slope coefficient = β ± (𝑡𝑐 × 𝑆𝑏 )


 𝑡𝑐 is the critical value for a given level of significance and degrees of freedom
(n-2)
 If we correctly rejected the null hypothesis of β = 0 in our hypothesis test,
zero does not fall in the confidence interval
 In other words, if the hypothesized value of the slope coefficient falls outside of
the confidence interval, we can reject the null.
 If it falls inside the confidence interval, we fail to reject the null hypothesis
CONFIDENCE INTERVALS

 Example: A regression model estimated using 46 observations has β = 0.76


and SER(b) = 0.33. Determine if the slope coefficient is statistically different
from zero at 5% level of significance. The critical t-value for a sample size of
46 and 5% level of significance is 2.02.
 Also calculate the confidence interval of slope coefficient
THE P-VALUE

 The p-value is the smallest level of significance for which the null hypothesis
can be rejected.
 An alternative method of doing hypothesis testing of regression coefficients
is to compare the p-value to the significance level:
 If the p-value is less than the significance level, the null hypothesis can be
rejected.
 If the p-value is greater than the significance level, the null hypothesis cannot be
rejected.
 In general, regression outputs will provide the p-value for the standard
hypothesis (Ho: β = 0 versus Ha: β ≠ 0).
 Consider again the example where β = 0.76, SER = 0.33, and the level of
significance is 5%. The regression output provides a p-value = 0.026.
Because the p-value < level of significance, we reject the null hypothesis
that β = 0, which is the same result as the one we got when performing the
t-test.
QUIZ
Bob Shepperd is trying to forecast 10-year T-bond yield. Shepperd tries a variety of explanatory variables in several
iterations of a single-variable model. Partial results are provided below (note that these represent three separate one-
variable regressions):

The critical t-value at 5% level of significance is equal to 2.02.


1. For the regression model involving inflation as the explanatory variable, the confidence interval for the slope
coefficient is closest to:
A. −0.27 to 2.43.
B. 0.26 to 2.43.
C. −2.27 to 2.43.
D. 0.22 to 1.88.
2. For the regression model involving unemployment rate as the explanatory variable, what are the results of a
hypothesis test that the slope coefficient is equal to 0.20 (vs. not equal to 0.20) at 5% level of significance?
A. The coefficient is not significantly different from 0.20 because the p-value is <0.001.
B. The coefficient is significantly different from 0.20 because the t-value is 2.33, which is greater than the critical t-value
of 2.02.
C. The coefficient is significantly different from 0.20 because the t-value is −5.67.
D. The coefficient is not significantly different from 0.20 because the t-value is −2.33.
R19:REGRESSION WITH MULTIPLE
EXPLANATORY VARIABLES
MULTIPLE REGRESSION

 General form of multiple regression model:

 Assumptions of multiple regression:


 The expected value of the error term, conditional on the independent variables,
is zero: [E(εi|Xi’s) = 0].
 All (Xs and Y) observations are i.i.d.
 The variance of X is positive (otherwise estimation of β would not be possible).
 The variance of the errors is constant (i.e., homoskedasticity).
 There are no outliers observed in the data
 X variables are not perfectly correlated (i.e., they are not perfectly linearly
dependent). In other words, each X variable in the model should have some
variation that is not fully explained by the other X variables (unique to multiple
regression)
MULTIPLE REGRESSION

 the interpretation of the slope coefficient is that it captures the change in


the dependent variable for a one-unit change in the independent
variable, holding the other independent variables constant
 As a result, the slope coefficients in a multiple regression are sometimes called
partial slope coefficients
 The ordinary least squares (OLS) estimation process for multiple regression
differs from single regression.
 In a stepwise fashion, first, the individual explanatory variables are regressed
against other explanatory variables and the residuals from these models become
explanatory variables in the regression using the original independent variable
 Consider a simple, two-independent-variable model:

 First we estimate the residuals in the following model using OLS estimation techniques for
a single regression:
 We then do the same, but this time estimate the residuals in the model

 The residuals are regressed against the residuals from the first step to estimate the slope
coefficient β1:
INTERPRETING MULTIPLE REGRESSION RESULTS

 Suppose we run a regression of the dependent variable Y on a single


independent variable X1 and get the following result: Y = 2.0 + 4.5𝑋1
 Interpretation of the estimated slope coefficient: if X1 increases by 1 unit, we
would expect Y to increase by 4.5 units

 Now suppose we add a second independent variable X2 to the regression


and get the following result: Y = 1.0 + 2.5 𝑋1 + 6.0 𝑋2
 estimated slope coefficient for X1 changed from 4.5 to 2.5 when we added X2 to
the regression → expect this to happen most of the time when a second variable
is added to the regression (unless X2 is uncorrelated with X1, because if X1
increases by 1 unit, then we would expect X2 to change as well)
 Interpretation of the estimated slope coefficient for X1 is that if X1 increases by 1
unit, we would expect Y to increase by 2.5 units, holding X2 constant
INTERPRETING MULTIPLE REGRESSION RESULTS
QUIZ
MEASURES OF FIT IN LINEAR REGRESSION

 Standard error of the regression (SER) measures the uncertainty about the
accuracy of the predicted values of the dependent variable.
 Graphically, the relationship is stronger when the actual x,y data points lie closer
to the regression line
 Since OLS estimation minimizes the sum of the squared differences
between the predicted value and actual value for each observation

 The equation can be rewritten as:



 Or:
COEFFICIENT OF DETERMINATION

 Dividing both sides by TSS, we see that 1 = (ESS/TSS) + (RSS/TSS)


 The first term on the right side captures the proportion of variation in Y that is
explained. This proportion is the coefficient of determination (𝑅 2) of a multiple
regression and is a goodness-of-fit measure
→ 𝑅 2 = ESS/TSS = % of variation explained by the regression model
→ For a multiple regression:

 While it is a goodness-of-fit measure, 𝑅 2 by itself may not be a reliable


measure of the explanatory power of the multiple regression model due to:
 𝑅 2 almost always increases as independent variables are added to the model,
even if the marginal contribution of the new variables is not statistically significant
 A relatively high 𝑅 2 may reflect the impact of a large set of independent
variables rather than how well the set explains the dependent variable →
overestimating the regression
 𝑅 2 is not comparable across models with different dependent variables
 There is no clear predefined values of 𝑅 2 that indicate whether the model is
good or not → For some noisy variables, models with low 𝑅 2 may provide
valuable insight
ADJUSTED 𝑅2

 To overcome the problem of overestimating the impact of additional


variables on the explanatory power of a regression model, many
researchers recommend adjusting 𝑅 2 for the number of independent
variables

 Adjusted 𝑅 2 will be less than or equal to 𝑅 2 → Adding new independent variable


to the model will increase 𝑅 2 and may increase or decrease adjusted 𝑅 2
 Example: An analyst runs a regression of monthly value-stock returns on 5
independent variables over 60 months. The total sum of squares for the
regression is 460, and the residual sum of squares is 170. Calculate the 𝑅 2
and adjusted 𝑅 2
 Example: Suppose the analyst now adds four more independent variables
to the previous regression, and the 𝑅2 increases to 65.0%. Identify which
model the analyst would most likely prefer.
JOINT HYPOTHESIS TESTS AND CONFIDENCE INTERVALS

 As with single regression, the magnitude of the coefficients in a multiple


regression tells us nothing about the importance of the independent
variable in explaining the dependent variable.
 Thus, we must conduct hypothesis testing on the estimated slope coefficients to
determine if the independent variables make a significant contribution to
explaining the variation in the dependent variable
 The t-statistic used to test the significance of the individual coefficients in a
multiple regression is calculated using the same formula that is used with single
regression:

→ For multiple regression, t-statistic has (n – k- 1) degrees of freedom


DETERMINING STATISTICAL SIGNIFICANCE

 The most common hypothesis test done on the regression coefficients is to


test statistical significance, which means testing the null hypothesis that the
coefficient is zero versus the alternative that it is not:
 Testing statistical significance ⇒ 𝐻𝑜 : 𝑏𝑗 = 0 versus 𝐻𝑎 : 𝑏𝑗 ≠ 0

 Confidence interval for a regression coefficient in multiple regression:

 Example: Consider the hypothesis that future 10-year real earnings growth
in the S&P 500 (EG10) can be explained by the trailing dividend payout
ratio of the stocks in the index (PR) and the yield curve slope (YCS). Test the
statistical significance of the independent variable PR in the real earnings
growth example at the 10% significance level. Assume that the number of
observations is 46 and the critical t-value for 10% level of significance is 1.68.
The results of the regression are produced in the following table.
F-TEST

 For models with multiple variables, the univariate t-test is not applicable
when testing complex hypotheses involving the impact of more than one
variable. Instead, we use the F-test
 F-test is useful to evaluate a model against other competing partial models
 For example, a model with three independent variables (X1, X2, and X3) can be
compared against a model with only one independent variable (X1). We are
trying to see if the two additional variables (X2 and X3) in the full model
contribute meaningfully to explain the variation in Y:

 F-statistic for multiple regression coefficients, which is always a one-tailed


test:
F-TEST

 Example: A researcher is seeking to explain returns on a stock using the


market returns as an explanatory variable (CAPM formulation). The
researcher wants to determine whether two additional explanatory
variables contribute meaningfully to variation in the stock’s return. Using a
sample consisting of 64 observations, the researcher found that RSS in the
model with three explanatory variables is 6,650 while the RSS in the single-
variable model is 7,140. Evaluate the model with extra variables relative to
the standard CAPM formulation. 5% significance level
F-TEST

 A more generic F-test is used to test the hypothesis that all variables
included in the model do not contribute meaningfully in explaining the
variation in Y versus at least one of the variables does contribute statistically
significantly

 We calculate the F-statistic as follows:

→ Compare against critical F → If F-stat > critical F → Reject null hypothesis

 Example: An analyst runs a regression of monthly value-stock returns on five


independent variables over 46 months. The total sum of squares is 460, and
the residual sum of squares is 170. Test the null hypothesis at the 5%
significance level (95% confidence) that all five of the independent
variables are equal to zero.

You might also like