0% found this document useful (0 votes)
58 views22 pages

1.10 Simple Linear Regression - Answers

Uploaded by

The Spectre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views22 pages

1.10 Simple Linear Regression - Answers

Uploaded by

The Spectre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

1. An investor receives a research report that analyzes two securities through a simple linear regression.

The study discloses the coefficient of


determination (R2) and the sum of squares error (SSE). To estimate the accuracy of the model using an absolute measure of the errors, the
investor will most likely need to know the:

 A. sample size.

B. regression coefficients.

C. correlation between the securities.

Explanation

The standard error of the estimate (se) is an absolute (ie, not relative) measure of the accuracy of the regression, as reflected by the average
distance between each predicted (Ŷ) and observed (Y) values of the dependent variable. Since se is the square root of the mean square error
(MSE) , it is presented in the same units as Y. Smaller values of se mean that the regression line is a better fit of the data.

In this scenario, se is the correct statistic to use since the investor wants an absolute measure of the accuracy of the model. Other measures of
goodness of fit, such as the coefficient of determination (R2) and the F-statistic, provide a relative measure and are not appropriate for this case.
Since this scenario does not provide the MSE, it must be calculated as the ratio of:

the sum of squares error (SSE), or residual sum of squares, to


the degrees of freedom, derived from the sample size (n) and the number of variables.

Since there are two variables and SSE is known, the investor will need to know n to calculate se.

(Choices B and C) The regression coefficients (ie, the intercept and the slope, and the correlation between the variables are not necessary to
calculate se.

Things to remember:
The standard error of the estimate (se) is an absolute (ie, not relative) measure of the accuracy of the regression, presented in the same data units
as the dependent variable. se is the square root of the mean square error (MSE), which is the ratio of the residual sum of squares to the degrees
of freedom.

Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error
of Estimate in a simple linear regression
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


2. An economist runs the following regression models between two macroeconomic datasets with variable transformations:

Regression M1 M2

Model lnY = b0 + b1 X + ε lnY = b0 + b1 lnX + ε

Intercept 5.723 5.338

Slope 0.233 1.184

Coefficient of determination (R2) 0.830 0.573

Standard error of the estimate (se) 0.415 0.658

F-statistic 137.001 37.624

Based only on this data, which of the following functional forms is the most appropriate fit for the sample data?

A. Lin-log

 B. Log-lin

C. Log-log

Explanation

Linearity is one of the four key assumptions of a simple linear regression. However, even if data sets do not have a linear relationship, a simple
linear regression can be used by transforming one or both variables by using, for example, the square, the reciprocal, or the log of the variable.
This results in distinct functional forms, allowing the regression to better fit a curvature. These transformed models have linear parameters (ie,
the intercept b0 and the slope b1), although the variables are not linear.

In this scenario:

M1 is a log-lin model, with a logarithmic dependent variable (lnY) and a linear independent variable (X), and
M2 is a log-log model, with both variables in their logarithmic forms.

The study show that M1 has the largest coefficient of determination and F-statistic, and the smallest standard error of the estimate. Therefore, the
log-lin model displays the best fit for the data (Choice C).

It is important to remember that regression statistics can be compared for regressions only when each regression's dependent variable has
the same form. For instance, linear and lin-log models can be compared since the dependent variable for each is linear, but a lin-log model
cannot be directly compared with a log-log model. In this question, M1 and M2 can be compared since both use lnY as the dependent variable.

(Choice A) Neither regression is in the lin-log functional form, which has a linear dependent variable (Y) and a logarithmic independent variable
(lnX).

Things to remember:
Simple linear regressions may have different functional forms, including logarithmic transformations of one or both variables (ie, lin-log, log-lin, log-
log). These transformed models have linear parameters, although the variables are not linear. The statistics of models with different forms of
dependent variable (eg, Y and lnY) are not comparable.

Describe different functional forms of simple linear regressions


LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


3. An economist runs a linear regression and analyzes the regression's residual plot:

The most appropriate conclusion is that the pattern of the residual plot indicates the presence of:

A. outliers.

 B. nonlinearity.

C. heteroskedasticity.

Explanation

A residual is the difference between the observed value of a dependent variable Y and the value predicted by a regression. In a simple linear
regression (SLR), residuals are the vertical distance between the actual observations and the regression line, which is the straight line that
minimizes the sum of the squares of all residuals (ie, the sum of squares error).

The residual plot ideally should be scattered randomly; the presence of a pattern often indicates the violation of one or more assumptions
underlying the SLR model.

In this residual plot (graph on the right), there is a pattern: a concentration of positive residuals in the lower and upper ranges of the independent
variable X and of negative residuals in the middle range. This indicates a violation of linearity, which assumes that X and Y have a linear
relationship. This violation is called nonlinearity and is evident in the scatterplot of Y against X (graph on the left).

Things to remember:
The presence of a pattern in a residual plot often indicates a violation of the assumptions of a linear regression model. If the residual plot shows a
concentration of positive residuals in a range and of negative residuals in other range(s), then the relationship cannot be displayed as a straight
line and the linearity assumption is violated. This is called nonlinearity.

Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these
assumptions may have been violated
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


4. A linear regression analysis generates the following ANOVA (analysis of variance) table:

ANOVA Table

Degrees of Sum of Mean F-


Source
freedom squares squares statistic

Regression 1 9.5231 9.5231 163.4274

Residual 78 4.5452 0.0583

Total 79 14.0683

Based on this data, the standard error of the estimate (se) is closest to:

 A. 0.241

B. 0.323

C. 0.677

Explanation

The standard error of the estimate (se) measures the accuracy of the regression as reflected by the average distance between each predicted
Ŷ and observed Y. It is the square root of the mean square error (MSE) and is presented in the same units as Y. In this scenario:

Smaller se values mean that a regression has smaller errors, so the regression line is a better fit for the data. This important measure of
goodness of fit can be derived from the ANOVA table for a simple linear regression since the table includes the MSE.

(Choice B) 0.323 results from dividing the sum of squares error (SSE) by the sum of squares total (SST) (1 − R2); it is the proportion of the
change in the dependent variable Y not explained by the model.

(Choice C) 0.677 is the coefficient of determination (R2) that measures the proportion of the change in Y explained by the independent variable
X. R2 = sum of squares regression (SSR) / SST.

Things to remember:
The standard error of the estimate (se), derived from the ANOVA table, is a key measure to evaluate how the regression fits the data. It quantifies
the accuracy of the regression (ie, the average error) and is measured in the same units as the dependent variable. It is the square root of the
mean square error (MSE); smaller se values mean that a regression has smaller errors, so the regression line is a better fit for the data.

Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error
of Estimate in a simple linear regression
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


5. An analyst runs a linear regression on crude oil prices (in USD per barrel) and a company's gross margin (in %), with the following results:

The analyst forecasts an average oil price of USD 66 in the following quarter. Considering t-values of ±2.228, the 95% prediction interval for the
company's gross margin is closest to:

 A. 40.6% to 52.1%.

B. 42.3% to 50.5%.

C. 44.2% to 48.6%.

Explanation

Simple linear regressions are commonly used to predict a dependent variable (Ŷf) based on a forecasted value of the independent variable (Xf).
The forecast's prediction interval is the expected range of values around Ŷf given a certain level of significance (α). For example, if α = 5%,
there is a 95% probability that the actual value of Y will be within the prediction interval.

The width of the range is a function of two components: the critical t-value (tc), which is a reliability factor, and the standard error of the forecast,
which is a measure of the regression's accuracy. In this scenario, the prediction interval is calculated as follows:
6. An analyst runs a simple linear regression (SLR) and plots the resulting residuals:

The regression's residuals are presented on the vertical axis of each graph. In the left graph, residuals are plotted against the regression's
independent variable, and in the right graph, they are plotted in chronological order. Based only on the graphs, the model most likely violates
which of the following linear regression assumptions?

A. Linearity

 B. Independence

C. Homoskedasticity

Explanation

A residual is the difference between the observed value of the dependent variable and the value predicted by the regression. In a simple linear
regression (SLR), residuals are the vertical distance between the actual observations and the regression line, which is the straight line that
minimizes the sum of the squares of all residuals (ie, the sum of squares error).

Ideally, the residual plot is randomly scattered. Therefore, the presence of patterns often indicates the violation of one or more assumptions
underlying the SLR model. In this scenario, when the residuals are plotted against the independent variable (X), it does not display any visible
pattern.

When the residuals are plotted against time (ie, chronological order), there is a pattern: Neighboring (ie, succeeding and preceding) residuals
have similar signs and magnitudes. This indicates a violation of the independence assumption, which assumes that observations (and
residuals) are uncorrelated; it is called autocorrelation.

Things to remember:
Independence assumes that observations (and residuals) are uncorrelated. Autocorrelation, the violation of this assumption, is often recognized
by plotting the residuals against time (ie, chronological order) and identifying that neighboring residuals have similar signs and magnitudes.

Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these
assumptions may have been violated LOS
7. The following table summarizes the relationship between two exchange rates, USD/GBP and USD/EUR for the last 12 months:

Statistic Data

Standard deviation of USD/GBP 1.389

Standard deviation of USD/EUR 1.208

Correlation of USD/GBP and USD/EUR 0.781

An analyst runs a linear regression to test if USD/GBP explains changes in USD/EUR. The slope of this regression model is closest to:

A. 0.405

 B. 0.679

C. 0.898

Explanation

This regression model examines how well USD/GBP explains USD/EUR. Therefore, USD/GBP is the explanatory or independent variable (X),
while USD/EUR is the explained or dependent variable (Y). Since there is only one independent variable, this is a simple linear regression, and
the slope coefficient is the ratio of the covariance (Y, X) to the variance of X.

In this scenario, the covariance (Y, X) can be obtained using the correlation of Y and X and the standard deviations of X and Y. After obtaining the
covariance, the slope coefficient is calculated as follows:

(Choice A) 0.405 results from incorrectly using the correlation instead of covariance in the slope formula (ie, 0.781 / (1.389)2).

(Choice C) 0.898 results from incorrectly swapping the variables (ie, using USD/GBP as the dependent variable and USD/EUR as the
independent variable [0.781 × 1.389 / 1.208]).

Things to remember:
In a simple linear regression, the independent variable (X) explains the dependent variable (Y). The slope coefficient is the ratio of the covariance
(Y, X) to the variance of X. It can also be calculated as the product of the correlation (Y, X) and the ratio of the standard deviations of Y and X.

Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of
these coefficients
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


8. An economist builds a linear regression model (Model 1) to investigate if the unemployment rate can be used to anticipate trends in a stock
index. Then, the economist builds a second model (Model 2) to test if the stock index is a good predictor of trends in GDP. The stock index is the
dependent variable:

A. in both models.

 B. only in Model 1.

C. only in Model 2.

Explanation

A linear regression evaluates the relationship between two or more variables. It uses sample data to test and quantify a linear relationship by
plotting a straight line that best fits the data. This relationship can be expressed as a linear equation. The regression model also predicts the
value of a variable based on the assumed value(s) for the other variable(s).

In this scenario:

Model 1 examines if changes in the stock index are explained by the unemployment rate. In this model, the stock index is the dependent
variable (ie, explained variable), which is usually named Y. The goal of the model is to predict trends in the stock index based on the
expected unemployment rate.
Model 2 examines if changes in the stock index explain changes in GDP. In this model, the stock index is the independent variable (ie,
explanatory variable), which is usually named X. The goal of the model is to predict trends in GDP based on the expected stock index
(Choices A and C).

A linear regression model has only one dependent variable, but it may have several independent variables. In this case, the model is known as a
simple linear regression since it has only one independent variable.

Things to remember:
A linear regression tests if independent variables explain changes in a dependent variable, quantifying their relationship in a linear equation.
Linear regressions have only one dependent variable but may have several independent variables. When there is only one independent variable,
the model is known as a simple linear regression.

Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of
these coefficients
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


9. A portfolio manager uses a regression model to evaluate how a company's debt-to-equity ratio influences its beta. The analysis of variance
(ANOVA) for the regression model is:

ANOVA Table

Degrees of Sum of Mean F-


Source
Freedom Squares Squares statistic

Regression 1 0.1426 0.1426 ?

Residual 8 0.2986 0.0373

Total 9 0.4413

The manager formulates the null hypothesis (H0) that the slope coefficient is zero. If the critical F-value at a 5% significance is 5.32, the manager's
most appropriate decision is to:

A. reject H0 since the F-statistic is less than 5.32

B. reject H0 since the F-statistic is greater than 5.32.

 C. fail to reject H0 since the F-statistic is less than 5.32.

Explanation

A hypothesis test and the F-statistic are used to determine the statistical significance of a linear regression by testing whether results are
random. In a simple linear regression (SLR), the goal of the F-test is to assess whether the single independent variable (X) helps to explain
changes in the dependent variable (Y).

The F-statistic is the ratio of two variances, the mean square regression (MSR) and the mean square error (MSE). The greater (smaller) the F-
statistic, the more (less) the variation is explained by the regression. The F-statistic is compared to the critical F-value, based on the desired
level of significance, the number of independent variables, and the degrees of freedom. If the F-statistic is greater (less) than the critical F-value,
then reject (fail to reject) H0.

In a hypothesis test for a SLR, the null hypothesis (H0) states that the slope (b1) equals zero (ie, Xis not useful to predict Y). The step-by-step
process is as follows:

In this scenario, since H0 cannot be rejected, there is not sufficient evidence to indicate that the slope is different from zero. Therefore, the debt-
to-equity ratio is not statistically significant as a predictor of beta (Choices A and B).

Calculate and interpret measures of fit and formulate and evaluate tests of fit and of regression coefficients in a simple linear regression
LOS
10. An analyst runs a regression to determine whether the value of Stock Index A is a good predictor of the value of Stock Index B. Selected data
are presented below:

Coefficient Standard Error

Intercept 31.930 91.950

Index A 3.587 0.198

The analyst expects Index A to reach a value of 509 next month and estimates the standard error of the forecast (sf) to be 42.893. If the critical t-
values at a 10% significance level are 1.701 (two-sided) and 1.313 (one-sided), the prediction interval for Index B is closest to:

 A. 1,785 to 1,931.

B. 1,801 to 1,914.

C. 1,815 to 1,901.

Explanation

Simple linear regressions are commonly used to predict a dependent variable (Ŷf) based on a forecasted value of the independent variable (Xf).
The prediction interval of a forecast is the expected range of values around Ŷf, given a level of significance (α)(D1485). The width of the range
is a function of two components:

The critical t-value (tc), a reliability factor that is negatively related to α and to the number of observations (n).

The standard error of the forecast (sf), which measures the accuracy of the predictions (ie, expected distance between observed values
and Ŷf). The sf formula shows that sf is positively related to the standard error of the estimate (se) and to the difference between Xf and the
mean value of X. In addition, sf is negatively related to the number of observations (n).

In this scenario, Xf = 509 and α = 10%, so there is a 90% probability that Index B's actual value will be within the prediction interval around Ŷf.
Therefore:

Things to remember:
A forecast's prediction interval is the expected range of values around the predicted dependent variable. The smaller (greater) the level of
significance and the number of observations, and the worse (better) the accuracy of the regression, the wider (narrower) the interval.

Calculate and Interpret the predicted value for the dependent variable, and a prediction interval for it, given an estimated linear regression model
and a value for the independent variable
LOS
11. An analyst evaluates the relationship between two securities and runs a linear regression based on 12 observations. Selected data from the
ANOVA (analysis of variance) table is presented below:

Source Sum of squares

Regression 20.601

Residual 8.326

Total 28.927

Based on this information, the F-statistic is closest to:

 A. 24.74

B. 27.22

C. 29.69

Explanation

The F-statistic is used to test the statistical significance of a linear regression by testing whether results are random. In the F-test for a simple
linear regression, if the F-statistic is greater than the critical F-value, there is evidence that the single independent variable (X) helps to explain
changes in the dependent variable (Y) (ie, results are not random).

The F-statistic is the ratio of the mean square regression (MSR) to the mean square error (MSE) and can be interpreted as the ratio of the
explained variance to the unexplained variance. The greater (smaller) the F-statistic, the more (less) of the variation is explained by the
regression. In the current scenario:

Note that the denominators used to calculate MSR and MSE are the degrees of freedom (df) in each case:

df(model), denominator of MSR, is the number of independent variables, which is 1 in a simple linear regression.
df(residual), denominator of MSE, is the difference between the sample size (n) and the number of parameters (2 in a simple linear
regression), so 10 = 12 − 2.
df(total) is the sum of df(model) and df(residual), or 11.

(Choice B) 27.22 is obtained by incorrectly using total df in the denominator of MSE [20.601 / (8.326 / 11)].

(Choice C) 29.69 is obtained by incorrectly using the total observations in the denominator of MSE [20.601 / (8.326 / 12)].

Things to remember:
The F-statistic is used to test the statistical significance of a linear regression (ie, test if results are random). The F-statistic is a ratio of two
variances, the mean square regression (MSR) and the mean square error (MSE). The greater (smaller) the F-statistic, the more (less) of the
variation is explained by the regression.

Calculate and interpret measures of fit and formulate and evaluate tests of fit and of regression coefficients in a simple linear regression
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


12. A fixed income analyst runs a linear regression to evaluate the relationship between the price of two bonds. Selected data is presented below:

ANOVA Table

Degrees of Sum of Mean


Source
freedom squares squares

Regression 1 20.032 20.032

Residual 28 159.371 5.692

Total 29 179.404

At a 5% significance level, the critical F-value is 4.196. Based only on the data, the most appropriate conclusion is that the:

A. coefficient of determination (R2) is 12.6%.

 B. standard error of the estimate (se) is 2.386.

C. model is statistically significant since the F-statistic is 3.52.

Explanation

There are different ways to evaluate how well a regression model explains the relationship between the dependent (Y) and the independent (X)
variables in a simple linear regression:

The coefficient of determination (R2) measures the proportion of the change in Y explained by X.
The standard error of the estimate (se) measures the accuracy of the regression.
The F-statistic is used to test whether the regression results are random or statistically significant.

These statistics are three key measures of goodness of fit; they are usually analyzed together since they have different specific interpretations and
limitations (ie, each statistic explains only a part of the regression). The ANOVA table supplies the required information to calculate all three
measures. The calculations are:

Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error
of Estimate in a simple linear regression LOS
13. An analyst runs a simple linear regression using a sample of 10 observations. To assess the validity of the regression statistics, the analyst
randomly draws 20 observations from the same data. The results of both regression analyses are:

Sum of Sum of
Sample size squares squares
total (SST) errors (SSE)

Regression 1 10 167.00 47.75

Regression 2 20 334.00 95.50

Based on this information, the analyst should most likely conclude that:

A. Regression 1 has a larger F-statistic.

B. both regressions have the same F-statistic.

 C. Regression 2 has a larger F-statistic.

Explanation

Sum of squares total

Sum of squares total (SST) = Sum of squares regression (SSR) + Sum of squares error (SSE)

A simple linear regression (SLR) can be used to assess the degree to which variations in a dependent variable are explained by variations in an
independent variable. Outputs of a regression analysis include the:

sum of squares total (SST), the variation in the observed values of the dependent variable versus its average value,
sum of squares regression (SSR), the proportion of variation of the dependent variable explained by changes in the independent variable,
and
sum of squares error (SSE), the proportion of variation of the dependent variable unexplained by changes in the independent variable.

F-statistics are used to evaluate the goodness of fit (ie, how well the regression model fits the data). F-statistics reflect the explained versus
the unexplained variation in the dependent variable, adjusted for sample size and degrees of freedom (ie, mean square regression [MSR] /
mean square error [MSE]). For a given data set, both the SSR and SSE increase as sample size increases:

In this scenario, both the SST and SSE double when the sample size doubles. This indicates the same relative variance is exhibited by both
samples. Since the divisor in calculating the F-statistic is inversely related to sample size, other things equal, a larger sample size exhibiting the
same relative variation results in a greater value for the F-statistic (Choices A and B).

A larger sample size that exhibits the same relative variation of the dependent variable should give an analyst greater confidence regarding the
explanatory power of the independent variable.

Calculate and interpret measures of fit and formulate and evaluate tests of fit and of regression coefficients in a simple linear regression
LOS
14. A foreign exchange trader runs a linear regression with 24 observations in which BRL/USD is the independent variable and MXN/USD is the
dependent variable and compiles the following results:

The standard error of the estimate is closest to:

A. 0.418

 B. 0.436

C. 0.473

Explanation

The standard error of the estimate (se) measures the accuracy of a regression as the average distance between each predicted and observed
independent variable (Y). It is also known as the standard error of the regression or the root mean square error (MSE). It is an absolute
measure presented in the same units as Y. Smaller values of se mean that the regression has smaller errors; therefore, the regression line is a
better fit for the data.

Since this scenario does not provide the MSE, it must be calculated as the ratio of the sum of squares error (SSE), or residual sum of squares, to
the degrees of freedom (df). Note that the df(residual) is the difference between the sample size (n = 24) and the number of variables (X and Y)
(ie, 24 − 2 = 22).

(Choice A) 0.418 is obtained by incorrectly using the number of observations (ie, sample size) in the denominator instead of the df, n − 2 (ie,
square root of 4.190 / 24).

(Choice C) 0.473 is obtained by incorrectly using the sum of squares total (SST), or total sum of squares, in the numerator instead of SSE (ie,
square root of 4.932 / 22).

Things to remember:
The standard error of the estimate (se), or square root of the mean square error, quantifies the accuracy of the regression (ie, the average error). It
is an absolute measure of how well the regression line fits the data, presented in the same units as the dependent variable. Smaller values of se
mean that the regression has a smaller error and better precision.

Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error
of Estimate in a simple linear regression
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


15. Given that the goal of a linear regression is to explain the variation of the dependent variable Y, the sum of squares regression (SSR) is best
described as the:

A. total variation of Y.

 B. explained variation of Y.

C. unexplained variation of Y.

Explanation

A linear regression evaluates the relationship between two or more variables, plotting a straight line that best fits the data. The more the model
explains the variation of the dependent variable Y, the greater its goodness of fit.

The variation of Y is often called the sum of squares total (SST). SST is the total squared difference between each observed Y and the average
value Ῡ, based on the results from the sample data (ie, independent of the regression model). Then, the regression model can help to explain
SST based on the changes of one or more independent variables. The model breaks the SST into two measures of total squared differences:

Sum of squares regression (SSR) measures differences between Ῡ and Ŷ (the value of Y predicted by the regression model). Since Ῡ can
be interpreted as the expected value of Y without a regression model, SSR quantifies how much of SST is explained by the model.

Sum of squares error (SSE) measures differences between Y and Ŷ. Since this is a measure of the prediction error, SSE quantifies how
much of SST is unexplained by the model (Choice C).

Therefore, SST equals the total variation of Y (ie, the sum of the explained variation and the unexplained variation) (Choice A).

Things to remember:
The variation of Y is known as the sum of squares total (SST), the total squared difference between each observed Y and the average value Ῡ,
based on the results from the sample data. The regression model can help to explain SST as two components: the sum of squares regression
(SSR), the variation explained by the regression model; and the sum of squares error (SSE), the variation unexplained by the model.

Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of
these coefficients
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


16. After running a simple linear regression, a trader analyzes the resulting residual plot:

Based on only the residual plot, this regression model most likely violated which of the following linear regression assumptions?

A. Linearity

B. Independence

 C. Homoskedasticity

Explanation
A residual is the difference between the observed value of the dependent variable and the value predicted by the regression. In a simple linear
regression (SLR), residuals are the vertical distance between the actual observations and the regression line, which is the straight line that
minimizes the sum of the squares of all residuals (ie, the sum of squares error).

Ideally, the residual plot is randomly scattered; the presence of a pattern often indicates the violation of one or more assumptions underlying
the simple linear regression model.

In this scenario, there is a pattern: the variance increases as the value of the independent variable increases. This indicates a violation of
homoskedasticity, which assumes that observations have a similar dispersion and residuals have constant variance. That violation is called
heteroskedasticity.

(Choice A) Linearity assumes that the variables have a linear relationship. Nonlinearity, the violation of this assumption, is indicated by a
concentration of positive residuals in a range (or ranges) of the plot and of negative residuals in other range(s). No concentration exists in this
scenario's residual plot.

(Choice B) Independence assumes that observations are uncorrelated. Autocorrelation, the violation of this assumption, is often indicated by
plotting the residuals against time (ie, observation order), and identifying that neighboring residuals have similar signs and magnitudes. There is
no evidence of autocorrelation in this residual plot.

Things to remember:
The presence of patterns in residual plots often indicates a violation of the assumptions of a linear regression. If the residual plot shows changes
in variance across observations, the homoskedasticity assumption is violated and the linear regression will be less accurate for part of the
population. This is called heteroskedasticity.

Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these
assumptions may have been violated
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


17. An analyst runs a linear regression between two variables but infers that they have a nonlinear relationship. Which of the following
approaches would be most appropriate to address this issue?

 A. Transform one or both variables

B. Use an indicator (dummy) variable

C. Increase the number of observations

Explanation

Linearity is one of the four key assumptions of a simple linear regression. However, even if data sets do not have a linear relationship, it is
possible to transform one or both variables by using the square, the reciprocal, or the log and then employing a linear regression with the
transformed variable(s).

A common solution to nonlinearity is to use a log transformation. This results in distinct functional forms, allowing the regression line to better fit
a curvature. In the graphs, it is evident that the linear functional form does not depict a linear relationship but the log-lin functional form
provides a much better fit, resulting in a stronger regression model. Although the graphs indicate this improvement, the statistics of both
regression models cannot be compared directly since they use different forms of the dependent variable.

(Choice B) An indicator variable (ie, dummy variable) is used in specific situations (eg, when there is seasonal data indicating autocorrelation) but
not as a solution to nonlinearity.

(Choice C) Increasing the sample size reduces a regression's uncertainty but does not resolve the nonlinear relationship between variables.

Things to remember:
The transformation (eg, the square, the reciprocal, or the log) of one or both variables in a simple linear regression is often used when there is a
nonlinear relationship. This results in distinct functional forms, allowing the regression line to better fit a curvature.

Describe different functional forms of simple linear regressions


LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


18. A linear regression will most likely provide valid conclusions even if:

 A. the independent variable is not normally distributed.

B. there is a strong seasonal pattern in a time series data.

C. the dispersion of the dependent variable changes as the independent variable increases.

Explanation

Assumptions of a simple linear regression

Linearity Relationship between X and Y is linear

Residuals for all observations have constant


Homoskedasticity
variance

Observations are independent and residuals are


Independence
uncorrelated

Normality Residuals are normally distributed

X = Independent variable (explanatory)


Y = Dependent variable (explained)
Residuals = Predicted Y − Observed Y

Normality is one of the key assumptions of simple linear regression. It does not refer to a normal distribution of the variables (ie, the independent
variable, X, or the dependent variable, Y); however, the residuals must be normally distributed.

The residuals are the difference between the observed and predicted dependent variable (Y). The nonnormality of residuals compromises some
uses of the regression model (eg, prediction intervals), but it does not affect the estimated coefficients (ie, slope and intercept), and thus the
variables. As the number of observations increases, this assumption becomes less relevant, according to the central limit theorem.

(Choice B) A strong seasonal pattern in time series data is an indicator of autocorrelation (ie, correlation between sequential values of the same
variable), violating the independence assumption. This harms the independence of the residuals and biases the variance of the estimated
coefficients, invalidating any significance test of the coefficients.

(Choice C) If the dispersion of Y changes as X increases, there is heteroskedasticity, violating the homoskedasticity assumption. The variance of
the residuals also changes and the regression model will be a less precise predictor for part of the population (ie, not a good fit).

Things to remember:
Normality is the assumption that the residuals of a linear regression are normally distributed. It does not require a normal distribution of the
variables. A violation of normality compromises some uses of the regression (eg, prediction intervals), but not the estimation of coefficients. This
assumption becomes irrelevant when there are many observations due to the central limit theorem.

Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these
assumptions may have been violated
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


19. The following table summarizes performance data for a mutual fund and its benchmark for the last 50 months:

Fund Benchmark
Return (Y) Return (X)

Mean 1.48 1.20

Variance 5.60 5.07

Covariance (Y,X) 4.78

Based on this data, if an investor expects a 0% return for the benchmark next month, the predicted return of the mutual fund will be closest to:

 A. 0.35%

B. 0.94%

C. 1.48%

Explanation

The coefficients of a linear regression (ie, slope and intercept) can be estimated based on common statistical measures, as given in this
scenario. When the expected benchmark return (the independent variable, X) equals zero, the predicted return of the mutual fund (the predicted
dependent variable, Ŷ) equals the value of the intercept.

Based on the information given, find the intercept using the return means as the independent and dependent variable values and the slope of the
regression line. The slope is the ratio of the covariance of Y and X to the variance of X. This is derived from the least squares criterion that drives
the goal of the linear regression: to minimize the sum of the squared residuals so that the regression line is the best fit for the data.

It is important to note that 0.35% is the predicted mutual fund return, not the observed (ie, actual) return, and that the data set is just a sample of
the population, so the calculated coefficients will always be estimates of the population parameters.

(Choice B) 0.94% is the slope, or the change in the predicted return of the mutual fund for every unit change in the expected return of the
benchmark.

(Choice C) 1.48% is the mean (ie, average) return of the mutual fund. It is the expected return when there is no information about the
benchmark.

Things to remember:
Regression coefficients can be estimated using common statistical measures. This results from the least squares criterion that drives the linear
regression's goal of minimizing the sum of the squared residuals (ie, the difference between observed Y and predicted Y) so that the estimated
regression line is the best fit for the data.

Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of
these coefficients
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


20. An equity analyst evaluates how the percentage change in cotton prices affects the percentage change in the stock price of an apparel
company. The analyst prepares a linear regression, resulting in an intercept of 0.3 and a slope of −0.7. According to this regression, the predicted
change in stock price given a 2% increase in cotton prices is closest to:

A. −1.4%

 B. −1.1%

C. −0.1%

Explanation

This linear equation represents the current scenario. The intercept is 0.3, so if the change in cotton price is zero, the predicted monthly change in
stock price is 0.3%. The slope is −0.7, meaning that every 1% increase in cotton prices would make the stock price decrease by an incremental
0.7%.

Therefore, a 2% increase in cotton prices results in a predicted decrease of 1.1% [0.3 − (0.7 × 2)] in the stock price. It is important to note that
this is a predicted variation, not the observed (ie, actual) stock price change.

A linear regression with one independent variable has two coefficients and is presented as a scatterplot, where the:

Intercept is the value of the dependent variable when the independent variable equals zero (ie, the expected variation in the estimate that
is unrelated to the independent variable).
Slope is the change in the dependent variable for every unit change in the independent variable.

(Choice A) −1.4% results from ignoring the intercept, but the regression model includes the intercept as the variation unrelated to changes in the
independent variable.

(Choice C) −0.1% results from misplacing both coefficients and multiplying the 2% variation by the intercept instead of the slope [(0.3 × 2) − 0.7].

Things to remember:
In a linear regression, the intercept is the value of the dependent variable Y when the independent variable X equals zero (ie, the expected change
in the dependent variable regardless of the change in the independent variable). The slope is the change in the dependent variable for every unit
change in the independent variable. These coefficients and the independent variable explain the variations in the dependent variable.

Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of
these coefficients
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.


21. A trader runs a linear regression model and calculates a prediction interval. The trader reviews the results, then runs a second model based
on the same data set and calculates a significantly narrower prediction interval. Based only on this information, the trader most likely:

A. decreased the sample size.

 B. increased the level of significance.

C. assumed the forecasted independent variable is farther from the sample mean.

Explanation

Simple linear regressions can be used to predict a dependent variable (Ŷf) based on a forecasted value of the independent variable (Xf). The
forecast's prediction interval is the expected range of values around Ŷf for a given level of significance (α).

The range (ie, the width) of the interval is a function of two components: the critical t-value, which is a reliability factor, and the standard error of
the forecast (sf), which measures the accuracy of the forecast (for a given value of Xf). Factors that can affect this range, when comparing the first
model to the second, are:

Decreasing the number of observations (ie, sample size) results in a less reliable model. The prediction interval is wider to reflect the
greater uncertainty (Choice A).

Increasing the significance level decreases the confidence level (ie, the probability that the actual value of Y lies within the prediction
interval) and narrows the interval.

If the forecasted independent variable is farther from the sample mean independent variable, it decreases accuracy and increases
uncertainty, thus widening the prediction interval (Choice C).

Things to remember:
The prediction interval of a forecast is the expected range of values around the predicted dependent variable, resulting from factors such as the
number of observations, the significance level, and the forecasted independent variable. The smaller (greater) the number of observations and the
significance level and the farther (closer) the forecasted independent variable from the sample mean, the wider (narrower) the interval.

Calculate and Interpret the predicted value for the dependent variable, and a prediction interval for it, given an estimated linear regression model
and a value for the independent variable
LOS

Copyright © UWorld. Copyright CFA Institute. All rights reserved.

You might also like