0% found this document useful (0 votes)
6 views

Lecture 2

Multiple linear regression utilizes two or more independent variables to explain the variation in a dependent variable, potentially improving predictions and understanding of financial factors. Analysts must carefully specify the model, select relevant variables, and ensure underlying assumptions are met to avoid spurious relationships. Tools like AIC and BIC help assess model quality and prevent overfitting, guiding analysts in selecting the most parsimonious model for their data.

Uploaded by

amir rafique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 2

Multiple linear regression utilizes two or more independent variables to explain the variation in a dependent variable, potentially improving predictions and understanding of financial factors. Analysts must carefully specify the model, select relevant variables, and ensure underlying assumptions are met to avoid spurious relationships. Tools like AIC and BIC help assess model quality and prevent overfitting, guiding analysts in selecting the most parsimonious model for their data.

Uploaded by

amir rafique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Basics of Multiple

Regression
• Multiple linear regression uses two or more independent variables to describe
the variation of the dependent variable rather than just one independent
variable, as in simple linear regression.
• It allows the analyst to estimate using more complex models with multiple
explanatory variables and, if used correctly, may lead to better predictions,
better portfolio construction, or better understanding of the drivers of
security returns.
• If used incorrectly, however, multiple linear regression may yield spurious
relationships, lead to poor predictions, and offer a poor understanding of
relationships.
• The analyst must first specify the model and make several decisions in this
process, answering the following, among other questions:
• What is the dependent variable of interest?
• What independent variables are important?
• What form should the model take?
• What is the goal of the model—prediction or understanding of the
relationship?
• The analyst specifies the dependent and independent variables and then
employs software to estimate the model and produce related statistics.
• The good news is that the software, such as shown in Exhibit 1, does the
estimation, and our primary tasks
• are to focus on specifying the model and interpreting the output from this
software, which are the main subjects of this content.
USES OF MULTIPLE LINEAR REGRESSION

• There are many investment problems in which the analyst needs to consider the
impact of multiple factors on the subject of research rather than a single factor
• In the complex world of investments, it is intuitive that explaining or
forecasting a financial variable by a single factor may be insufficient.
• The complexity of financial and economic relations calls for models with
multiple explanatory variables, subject to fundamental justification and various
statistical tests.
Examples of how multiple regression may be used include the following:

• A portfolio manager wants to understand how returns are influenced by a


set of underlying factors; the size effect, the value effect, profitability, and
investment aggressiveness. The goal is to estimate a Fama–French five-factor
model that will provide an understanding of the factors that are important for
driving a particular stock’s excess returns.
• A financial adviser wants to identify whether certain variables, such as
financial leverage, profitability, revenue growth, and changes in market share,
can predict whether a company will face financial distress.
• An analyst wants to examine the effect of different dimensions of country
risk, such as political stability, economic conditions, and environmental, social,
and governance (ESG) considerations, on equity returns in that country.
• Multiple regression can be used to identify relationships between variables, to test
existing theories, or to forecast. There are many decisions that the analyst must make in
this process.
• For example, if the dependent variable is continuous, such as returns, the traditional
regression model is typically the first step. If, however, the dependent variable is
discrete—for example, an indicator variable such as whether a company is a takeover target
or not a takeover target—then, as we shall see, the model may be estimated as a logistic
regression.
• In either case, the process of determining the best model follows a similar path. The model
must first be specified, including independent variables that may be continuous, such as
company financial features, or discrete (i.e., dummy variables), indicating membership in a
class, such as an industry sector.
• Next, the regression model is estimated and analyzed to ensure it satisfies key underlying
assumptions and meets the analyst’s goodness-of-fit criteria.
• Once the model is tested and its out-of-sample performance is deemed acceptable, then it
can be used for further identifying relationships between variables, for testing existing
theories, or for forecasting.
THE BASICS OF MULTIPLE REGRESSION

• The goal of simple regression is to explain the variation of the dependent


variable, Y, using the variation of an independent variable, X.
• The goal of multiple regression is the same, to explain the variation of the
dependent variable, Y, but using the variations in a set of independent
variables, X1, X2, . . ., Xk.
• Recall the variation of Y is

which we also refer to as the sum of squares total.


• When we introduce additional independent variables to help explain the
variation of the dependent variable, we have the multiple regression equation:

In this equation, the terms involving the k independent variables are the
deterministic part of the model, whereas the error term, εi, is the stochastic or
random part of the model. The model is estimated over n observations, where n
must be larger than k.
• It is important to note that a slope coefficient in a multiple regression, known
as a partial regression coefficient or a partial slope coefficient, must be
interpreted with care.
• A partial regression coefficient, bj, describes the impact of that independent
variable on the dependent variable, holding all the other independent variables
constant.
• For example, in the multiple regression equation,

the coefficient b2 measures the change in Y for a one-unit change in X2 assuming


X1 and X 3 are held constant.
• Consider an estimated regression equation in which the monthly excess returns
of a bond index (RET) are regressed against the change in monthly government
bond yields (BY) and the change in the investment grade credit spreads (CS).
The estimated regression, using 60 monthly observations, is

We learn the following from this regression:

1. The bond index RET yields, on average, 0.0023% per month, or


approximately 0.028% per year, if the changes in the government bond yields
and investment-grade credit spreads are zero.

2. The change in the bond index return for a given one-unit change in the
monthly government bond yield, BY, is –5.0585%, holding CS constant.
3. If the investment-grade credit spreads, CS, increase by one unit, the bond
index returns change by –2.1901%, holding BY constant.
4. For a month in which the change in the credit spreads is 0.001 and the change
in the government bond yields is 0.005, the expected excess return on the bond
index is
ASSUMPTIONS UNDERLYING MULTIPLE LINEAR REGRESSION

• Linearity: The relationship between the dependent variable and the


independent variables is linear.
• Homoskedasticity: The variance of the regression residuals is the same for all
observations.
• Independence of errors: The observations are independent of one another.
This implies the regression residuals are uncorrelated across observations.
• Normality: The regression residuals are normally distributed.
• Independence of independent variables:
• a. Independent variables are not random.
• b. There is no exact linear relation between two or more of the independent
variables or combinations of the independent variables.
• The independence assumption is needed to enable the estimation of the
coefficients.

• If there is an exact linear relationship between independent variables, the


model cannot be estimated.

• In the more common case of approximate linear relationships, which may be


indicated by significant pairwise correlations between the independent
variables, the model can be estimated but its interpretation is problematic.
• Regression software produces diagnostic plots, which are a useful tool for
detecting potential violations of the assumptions underlying multiple linear
regression.
• To illustrate the use of such plots, we first estimate a regression to analyze 10
years of monthly total excess returns of ABC stock using the Fama–French
three-factor model. As noted previously, this model uses market excess return
(MKTRF), size (SMB) and value (HML) as explanatory variables.

• We start our analysis by generating a scatterplot matrix using software. This


matrix is also referred to as a pairs plot.
• Looking at the scatterplots between the independent variables, SMB and HML
have little or no correlation, as indicated by the relatively flat line for the
SMB–HML pair. This is a desirable characteristic between explanatory
variables.
Goodness of Fit
• In the simple regression model, the coefficient of determination, also known
as R-squared or R2, is a measure of the goodness of fit of an estimated
regression to the data.
• R2 can also be defined in multiple regression as the ratio of the variation of the
dependent variable explained by the independent variables (sum of squares
regression) to the total variation of the dependent variable (sum of squares
total).
• In multiple linear regression, however, R2 is less appropriate as a measure
of a model’s goodness of fit.
• This is because as independent variables are added to the model, R2
will increase or will stay the same, but it will not decrease. Problems with
using R2 in multiple regression include the following
• The R2 cannot provide information on whether the coefficients are
statistically significant.
• The R2 cannot provide information on whether there are biases in the
estimated coefficients and predictions.
• The R2 cannot tell whether the model fit is good. A good model may have
a low R2, as in many asset-pricing models, and a bad model may have a
high R2 due to overfitting and biases in the model.
• Overfitting of a regression model is a situation in which the model is too complex,
meaning there may be too many independent variables relative to the number of
observations in the sample.
• A result of overfitting is that the coefficients on the independent variables may not
represent true relationships with the dependent variable.
• An alternative measure of goodness of fit is the adjusted R2(R_2), which is
typically part of the multiple regression output produced by most statistical
software packages.
• A benefit of using the adjusted R2 is that it does not automatically increase when
another independent variable is added to a regression. This is because it adjusts for
the degrees of freedom as follows, where k is the number of independent variables:
• The following are two key observations about adj. R2 when adding a new
variable to a regression:
• If the coefficient’s t-statistic > |1.0|, then adj. R2 increases.
• If the coefficient’s t-statistic < |1.0|, then adj R2 decreases.
• Note that a t-statistic with an absolute value of 1.0 does not indicate the
independent variable is different from zero at typical levels of significance, 5%
and 1%.
• So, adjusted R2 does not set a very high bar for the statistic to increase.
• Consider the regression output provided in Exhibit 1, which shows the results
from the regression of portfolio returns on the returns for five hypothetical
fundamental factors, which we shall call Factors 1 through 5.
• The goal of this regression is to identify the factors that best explain the returns
on the portfolio.
AIC and BIC
• As both the R2 and adjusted R2 may increase when we add independent variables, we
risk model overfitting. Fortunately, there are several statistics to help compare model
quality and identify the most parsimonious model, including two statistics more
commonly known by their acronyms, AIC and BIC.
• We can use Akaike’s information criterion (AIC) to evaluate a collection of models
that explain the same dependent variable. It is often provided in the output for
regression software, but AIC can be calculated using information in the regression
output:


AIC is a measure of model parsimony, so a lower AIC indicates a better-fitting
model. The term 2(k + 1) is the penalty assessed for adding independent variables to
the model.
In a similar manner, Schwarz’s Bayesian information criterion (BIC or SBC)
allows comparison of models with the same dependent variable, as follows:

• Compared to AIC, BIC assesses a greater penalty for having more


parameters in a model, so it will tend to prefer small, more parsimonious
models. This is because ln(n) is greater than 2, even for very small sample
sizes.

• Practically speaking, AIC is preferred if the model is used for prediction


purposes, but BIC is preferred when the best goodness of fit is desired.

• Importantly, the value of these measures considered alone is meaningless;


the relative values of AIC or BIC among a set of models is what really
matters.

You might also like