1.10 Simple Linear Regression - Answers
1.10 Simple Linear Regression - Answers
A. sample size.
B. regression coefficients.
Explanation
The standard error of the estimate (se) is an absolute (ie, not relative) measure of the accuracy of the regression, as reflected by the average
distance between each predicted (Ŷ) and observed (Y) values of the dependent variable. Since se is the square root of the mean square error
(MSE) , it is presented in the same units as Y. Smaller values of se mean that the regression line is a better fit of the data.
In this scenario, se is the correct statistic to use since the investor wants an absolute measure of the accuracy of the model. Other measures of
goodness of fit, such as the coefficient of determination (R2) and the F-statistic, provide a relative measure and are not appropriate for this case.
Since this scenario does not provide the MSE, it must be calculated as the ratio of:
Since there are two variables and SSE is known, the investor will need to know n to calculate se.
(Choices B and C) The regression coefficients (ie, the intercept and the slope, and the correlation between the variables are not necessary to
calculate se.
Things to remember:
The standard error of the estimate (se) is an absolute (ie, not relative) measure of the accuracy of the regression, presented in the same data units
as the dependent variable. se is the square root of the mean square error (MSE), which is the ratio of the residual sum of squares to the degrees
of freedom.
Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error
of Estimate in a simple linear regression
LOS
Regression M1 M2
Based only on this data, which of the following functional forms is the most appropriate fit for the sample data?
A. Lin-log
B. Log-lin
C. Log-log
Explanation
Linearity is one of the four key assumptions of a simple linear regression. However, even if data sets do not have a linear relationship, a simple
linear regression can be used by transforming one or both variables by using, for example, the square, the reciprocal, or the log of the variable.
This results in distinct functional forms, allowing the regression to better fit a curvature. These transformed models have linear parameters (ie,
the intercept b0 and the slope b1), although the variables are not linear.
In this scenario:
M1 is a log-lin model, with a logarithmic dependent variable (lnY) and a linear independent variable (X), and
M2 is a log-log model, with both variables in their logarithmic forms.
The study show that M1 has the largest coefficient of determination and F-statistic, and the smallest standard error of the estimate. Therefore, the
log-lin model displays the best fit for the data (Choice C).
It is important to remember that regression statistics can be compared for regressions only when each regression's dependent variable has
the same form. For instance, linear and lin-log models can be compared since the dependent variable for each is linear, but a lin-log model
cannot be directly compared with a log-log model. In this question, M1 and M2 can be compared since both use lnY as the dependent variable.
(Choice A) Neither regression is in the lin-log functional form, which has a linear dependent variable (Y) and a logarithmic independent variable
(lnX).
Things to remember:
Simple linear regressions may have different functional forms, including logarithmic transformations of one or both variables (ie, lin-log, log-lin, log-
log). These transformed models have linear parameters, although the variables are not linear. The statistics of models with different forms of
dependent variable (eg, Y and lnY) are not comparable.
The most appropriate conclusion is that the pattern of the residual plot indicates the presence of:
A. outliers.
B. nonlinearity.
C. heteroskedasticity.
Explanation
A residual is the difference between the observed value of a dependent variable Y and the value predicted by a regression. In a simple linear
regression (SLR), residuals are the vertical distance between the actual observations and the regression line, which is the straight line that
minimizes the sum of the squares of all residuals (ie, the sum of squares error).
The residual plot ideally should be scattered randomly; the presence of a pattern often indicates the violation of one or more assumptions
underlying the SLR model.
In this residual plot (graph on the right), there is a pattern: a concentration of positive residuals in the lower and upper ranges of the independent
variable X and of negative residuals in the middle range. This indicates a violation of linearity, which assumes that X and Y have a linear
relationship. This violation is called nonlinearity and is evident in the scatterplot of Y against X (graph on the left).
Things to remember:
The presence of a pattern in a residual plot often indicates a violation of the assumptions of a linear regression model. If the residual plot shows a
concentration of positive residuals in a range and of negative residuals in other range(s), then the relationship cannot be displayed as a straight
line and the linearity assumption is violated. This is called nonlinearity.
Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these
assumptions may have been violated
LOS
ANOVA Table
Total 79 14.0683
Based on this data, the standard error of the estimate (se) is closest to:
A. 0.241
B. 0.323
C. 0.677
Explanation
The standard error of the estimate (se) measures the accuracy of the regression as reflected by the average distance between each predicted
Ŷ and observed Y. It is the square root of the mean square error (MSE) and is presented in the same units as Y. In this scenario:
Smaller se values mean that a regression has smaller errors, so the regression line is a better fit for the data. This important measure of
goodness of fit can be derived from the ANOVA table for a simple linear regression since the table includes the MSE.
(Choice B) 0.323 results from dividing the sum of squares error (SSE) by the sum of squares total (SST) (1 − R2); it is the proportion of the
change in the dependent variable Y not explained by the model.
(Choice C) 0.677 is the coefficient of determination (R2) that measures the proportion of the change in Y explained by the independent variable
X. R2 = sum of squares regression (SSR) / SST.
Things to remember:
The standard error of the estimate (se), derived from the ANOVA table, is a key measure to evaluate how the regression fits the data. It quantifies
the accuracy of the regression (ie, the average error) and is measured in the same units as the dependent variable. It is the square root of the
mean square error (MSE); smaller se values mean that a regression has smaller errors, so the regression line is a better fit for the data.
Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error
of Estimate in a simple linear regression
LOS
The analyst forecasts an average oil price of USD 66 in the following quarter. Considering t-values of ±2.228, the 95% prediction interval for the
company's gross margin is closest to:
A. 40.6% to 52.1%.
B. 42.3% to 50.5%.
C. 44.2% to 48.6%.
Explanation
Simple linear regressions are commonly used to predict a dependent variable (Ŷf) based on a forecasted value of the independent variable (Xf).
The forecast's prediction interval is the expected range of values around Ŷf given a certain level of significance (α). For example, if α = 5%,
there is a 95% probability that the actual value of Y will be within the prediction interval.
The width of the range is a function of two components: the critical t-value (tc), which is a reliability factor, and the standard error of the forecast,
which is a measure of the regression's accuracy. In this scenario, the prediction interval is calculated as follows:
6. An analyst runs a simple linear regression (SLR) and plots the resulting residuals:
The regression's residuals are presented on the vertical axis of each graph. In the left graph, residuals are plotted against the regression's
independent variable, and in the right graph, they are plotted in chronological order. Based only on the graphs, the model most likely violates
which of the following linear regression assumptions?
A. Linearity
B. Independence
C. Homoskedasticity
Explanation
A residual is the difference between the observed value of the dependent variable and the value predicted by the regression. In a simple linear
regression (SLR), residuals are the vertical distance between the actual observations and the regression line, which is the straight line that
minimizes the sum of the squares of all residuals (ie, the sum of squares error).
Ideally, the residual plot is randomly scattered. Therefore, the presence of patterns often indicates the violation of one or more assumptions
underlying the SLR model. In this scenario, when the residuals are plotted against the independent variable (X), it does not display any visible
pattern.
When the residuals are plotted against time (ie, chronological order), there is a pattern: Neighboring (ie, succeeding and preceding) residuals
have similar signs and magnitudes. This indicates a violation of the independence assumption, which assumes that observations (and
residuals) are uncorrelated; it is called autocorrelation.
Things to remember:
Independence assumes that observations (and residuals) are uncorrelated. Autocorrelation, the violation of this assumption, is often recognized
by plotting the residuals against time (ie, chronological order) and identifying that neighboring residuals have similar signs and magnitudes.
Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these
assumptions may have been violated LOS
7. The following table summarizes the relationship between two exchange rates, USD/GBP and USD/EUR for the last 12 months:
Statistic Data
An analyst runs a linear regression to test if USD/GBP explains changes in USD/EUR. The slope of this regression model is closest to:
A. 0.405
B. 0.679
C. 0.898
Explanation
This regression model examines how well USD/GBP explains USD/EUR. Therefore, USD/GBP is the explanatory or independent variable (X),
while USD/EUR is the explained or dependent variable (Y). Since there is only one independent variable, this is a simple linear regression, and
the slope coefficient is the ratio of the covariance (Y, X) to the variance of X.
In this scenario, the covariance (Y, X) can be obtained using the correlation of Y and X and the standard deviations of X and Y. After obtaining the
covariance, the slope coefficient is calculated as follows:
(Choice A) 0.405 results from incorrectly using the correlation instead of covariance in the slope formula (ie, 0.781 / (1.389)2).
(Choice C) 0.898 results from incorrectly swapping the variables (ie, using USD/GBP as the dependent variable and USD/EUR as the
independent variable [0.781 × 1.389 / 1.208]).
Things to remember:
In a simple linear regression, the independent variable (X) explains the dependent variable (Y). The slope coefficient is the ratio of the covariance
(Y, X) to the variance of X. It can also be calculated as the product of the correlation (Y, X) and the ratio of the standard deviations of Y and X.
Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of
these coefficients
LOS
A. in both models.
B. only in Model 1.
C. only in Model 2.
Explanation
A linear regression evaluates the relationship between two or more variables. It uses sample data to test and quantify a linear relationship by
plotting a straight line that best fits the data. This relationship can be expressed as a linear equation. The regression model also predicts the
value of a variable based on the assumed value(s) for the other variable(s).
In this scenario:
Model 1 examines if changes in the stock index are explained by the unemployment rate. In this model, the stock index is the dependent
variable (ie, explained variable), which is usually named Y. The goal of the model is to predict trends in the stock index based on the
expected unemployment rate.
Model 2 examines if changes in the stock index explain changes in GDP. In this model, the stock index is the independent variable (ie,
explanatory variable), which is usually named X. The goal of the model is to predict trends in GDP based on the expected stock index
(Choices A and C).
A linear regression model has only one dependent variable, but it may have several independent variables. In this case, the model is known as a
simple linear regression since it has only one independent variable.
Things to remember:
A linear regression tests if independent variables explain changes in a dependent variable, quantifying their relationship in a linear equation.
Linear regressions have only one dependent variable but may have several independent variables. When there is only one independent variable,
the model is known as a simple linear regression.
Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of
these coefficients
LOS
ANOVA Table
Total 9 0.4413
The manager formulates the null hypothesis (H0) that the slope coefficient is zero. If the critical F-value at a 5% significance is 5.32, the manager's
most appropriate decision is to:
Explanation
A hypothesis test and the F-statistic are used to determine the statistical significance of a linear regression by testing whether results are
random. In a simple linear regression (SLR), the goal of the F-test is to assess whether the single independent variable (X) helps to explain
changes in the dependent variable (Y).
The F-statistic is the ratio of two variances, the mean square regression (MSR) and the mean square error (MSE). The greater (smaller) the F-
statistic, the more (less) the variation is explained by the regression. The F-statistic is compared to the critical F-value, based on the desired
level of significance, the number of independent variables, and the degrees of freedom. If the F-statistic is greater (less) than the critical F-value,
then reject (fail to reject) H0.
In a hypothesis test for a SLR, the null hypothesis (H0) states that the slope (b1) equals zero (ie, Xis not useful to predict Y). The step-by-step
process is as follows:
In this scenario, since H0 cannot be rejected, there is not sufficient evidence to indicate that the slope is different from zero. Therefore, the debt-
to-equity ratio is not statistically significant as a predictor of beta (Choices A and B).
Calculate and interpret measures of fit and formulate and evaluate tests of fit and of regression coefficients in a simple linear regression
LOS
10. An analyst runs a regression to determine whether the value of Stock Index A is a good predictor of the value of Stock Index B. Selected data
are presented below:
The analyst expects Index A to reach a value of 509 next month and estimates the standard error of the forecast (sf) to be 42.893. If the critical t-
values at a 10% significance level are 1.701 (two-sided) and 1.313 (one-sided), the prediction interval for Index B is closest to:
A. 1,785 to 1,931.
B. 1,801 to 1,914.
C. 1,815 to 1,901.
Explanation
Simple linear regressions are commonly used to predict a dependent variable (Ŷf) based on a forecasted value of the independent variable (Xf).
The prediction interval of a forecast is the expected range of values around Ŷf, given a level of significance (α)(D1485). The width of the range
is a function of two components:
The critical t-value (tc), a reliability factor that is negatively related to α and to the number of observations (n).
The standard error of the forecast (sf), which measures the accuracy of the predictions (ie, expected distance between observed values
and Ŷf). The sf formula shows that sf is positively related to the standard error of the estimate (se) and to the difference between Xf and the
mean value of X. In addition, sf is negatively related to the number of observations (n).
In this scenario, Xf = 509 and α = 10%, so there is a 90% probability that Index B's actual value will be within the prediction interval around Ŷf.
Therefore:
Things to remember:
A forecast's prediction interval is the expected range of values around the predicted dependent variable. The smaller (greater) the level of
significance and the number of observations, and the worse (better) the accuracy of the regression, the wider (narrower) the interval.
Calculate and Interpret the predicted value for the dependent variable, and a prediction interval for it, given an estimated linear regression model
and a value for the independent variable
LOS
11. An analyst evaluates the relationship between two securities and runs a linear regression based on 12 observations. Selected data from the
ANOVA (analysis of variance) table is presented below:
Regression 20.601
Residual 8.326
Total 28.927
A. 24.74
B. 27.22
C. 29.69
Explanation
The F-statistic is used to test the statistical significance of a linear regression by testing whether results are random. In the F-test for a simple
linear regression, if the F-statistic is greater than the critical F-value, there is evidence that the single independent variable (X) helps to explain
changes in the dependent variable (Y) (ie, results are not random).
The F-statistic is the ratio of the mean square regression (MSR) to the mean square error (MSE) and can be interpreted as the ratio of the
explained variance to the unexplained variance. The greater (smaller) the F-statistic, the more (less) of the variation is explained by the
regression. In the current scenario:
Note that the denominators used to calculate MSR and MSE are the degrees of freedom (df) in each case:
df(model), denominator of MSR, is the number of independent variables, which is 1 in a simple linear regression.
df(residual), denominator of MSE, is the difference between the sample size (n) and the number of parameters (2 in a simple linear
regression), so 10 = 12 − 2.
df(total) is the sum of df(model) and df(residual), or 11.
(Choice B) 27.22 is obtained by incorrectly using total df in the denominator of MSE [20.601 / (8.326 / 11)].
(Choice C) 29.69 is obtained by incorrectly using the total observations in the denominator of MSE [20.601 / (8.326 / 12)].
Things to remember:
The F-statistic is used to test the statistical significance of a linear regression (ie, test if results are random). The F-statistic is a ratio of two
variances, the mean square regression (MSR) and the mean square error (MSE). The greater (smaller) the F-statistic, the more (less) of the
variation is explained by the regression.
Calculate and interpret measures of fit and formulate and evaluate tests of fit and of regression coefficients in a simple linear regression
LOS
ANOVA Table
Total 29 179.404
At a 5% significance level, the critical F-value is 4.196. Based only on the data, the most appropriate conclusion is that the:
Explanation
There are different ways to evaluate how well a regression model explains the relationship between the dependent (Y) and the independent (X)
variables in a simple linear regression:
The coefficient of determination (R2) measures the proportion of the change in Y explained by X.
The standard error of the estimate (se) measures the accuracy of the regression.
The F-statistic is used to test whether the regression results are random or statistically significant.
These statistics are three key measures of goodness of fit; they are usually analyzed together since they have different specific interpretations and
limitations (ie, each statistic explains only a part of the regression). The ANOVA table supplies the required information to calculate all three
measures. The calculations are:
Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error
of Estimate in a simple linear regression LOS
13. An analyst runs a simple linear regression using a sample of 10 observations. To assess the validity of the regression statistics, the analyst
randomly draws 20 observations from the same data. The results of both regression analyses are:
Sum of Sum of
Sample size squares squares
total (SST) errors (SSE)
Based on this information, the analyst should most likely conclude that:
Explanation
Sum of squares total (SST) = Sum of squares regression (SSR) + Sum of squares error (SSE)
A simple linear regression (SLR) can be used to assess the degree to which variations in a dependent variable are explained by variations in an
independent variable. Outputs of a regression analysis include the:
sum of squares total (SST), the variation in the observed values of the dependent variable versus its average value,
sum of squares regression (SSR), the proportion of variation of the dependent variable explained by changes in the independent variable,
and
sum of squares error (SSE), the proportion of variation of the dependent variable unexplained by changes in the independent variable.
F-statistics are used to evaluate the goodness of fit (ie, how well the regression model fits the data). F-statistics reflect the explained versus
the unexplained variation in the dependent variable, adjusted for sample size and degrees of freedom (ie, mean square regression [MSR] /
mean square error [MSE]). For a given data set, both the SSR and SSE increase as sample size increases:
In this scenario, both the SST and SSE double when the sample size doubles. This indicates the same relative variance is exhibited by both
samples. Since the divisor in calculating the F-statistic is inversely related to sample size, other things equal, a larger sample size exhibiting the
same relative variation results in a greater value for the F-statistic (Choices A and B).
A larger sample size that exhibits the same relative variation of the dependent variable should give an analyst greater confidence regarding the
explanatory power of the independent variable.
Calculate and interpret measures of fit and formulate and evaluate tests of fit and of regression coefficients in a simple linear regression
LOS
14. A foreign exchange trader runs a linear regression with 24 observations in which BRL/USD is the independent variable and MXN/USD is the
dependent variable and compiles the following results:
A. 0.418
B. 0.436
C. 0.473
Explanation
The standard error of the estimate (se) measures the accuracy of a regression as the average distance between each predicted and observed
independent variable (Y). It is also known as the standard error of the regression or the root mean square error (MSE). It is an absolute
measure presented in the same units as Y. Smaller values of se mean that the regression has smaller errors; therefore, the regression line is a
better fit for the data.
Since this scenario does not provide the MSE, it must be calculated as the ratio of the sum of squares error (SSE), or residual sum of squares, to
the degrees of freedom (df). Note that the df(residual) is the difference between the sample size (n = 24) and the number of variables (X and Y)
(ie, 24 − 2 = 22).
(Choice A) 0.418 is obtained by incorrectly using the number of observations (ie, sample size) in the denominator instead of the df, n − 2 (ie,
square root of 4.190 / 24).
(Choice C) 0.473 is obtained by incorrectly using the sum of squares total (SST), or total sum of squares, in the numerator instead of SSE (ie,
square root of 4.932 / 22).
Things to remember:
The standard error of the estimate (se), or square root of the mean square error, quantifies the accuracy of the regression (ie, the average error). It
is an absolute measure of how well the regression line fits the data, presented in the same units as the dependent variable. Smaller values of se
mean that the regression has a smaller error and better precision.
Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error
of Estimate in a simple linear regression
LOS
A. total variation of Y.
B. explained variation of Y.
C. unexplained variation of Y.
Explanation
A linear regression evaluates the relationship between two or more variables, plotting a straight line that best fits the data. The more the model
explains the variation of the dependent variable Y, the greater its goodness of fit.
The variation of Y is often called the sum of squares total (SST). SST is the total squared difference between each observed Y and the average
value Ῡ, based on the results from the sample data (ie, independent of the regression model). Then, the regression model can help to explain
SST based on the changes of one or more independent variables. The model breaks the SST into two measures of total squared differences:
Sum of squares regression (SSR) measures differences between Ῡ and Ŷ (the value of Y predicted by the regression model). Since Ῡ can
be interpreted as the expected value of Y without a regression model, SSR quantifies how much of SST is explained by the model.
Sum of squares error (SSE) measures differences between Y and Ŷ. Since this is a measure of the prediction error, SSE quantifies how
much of SST is unexplained by the model (Choice C).
Therefore, SST equals the total variation of Y (ie, the sum of the explained variation and the unexplained variation) (Choice A).
Things to remember:
The variation of Y is known as the sum of squares total (SST), the total squared difference between each observed Y and the average value Ῡ,
based on the results from the sample data. The regression model can help to explain SST as two components: the sum of squares regression
(SSR), the variation explained by the regression model; and the sum of squares error (SSE), the variation unexplained by the model.
Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of
these coefficients
LOS
Based on only the residual plot, this regression model most likely violated which of the following linear regression assumptions?
A. Linearity
B. Independence
C. Homoskedasticity
Explanation
A residual is the difference between the observed value of the dependent variable and the value predicted by the regression. In a simple linear
regression (SLR), residuals are the vertical distance between the actual observations and the regression line, which is the straight line that
minimizes the sum of the squares of all residuals (ie, the sum of squares error).
Ideally, the residual plot is randomly scattered; the presence of a pattern often indicates the violation of one or more assumptions underlying
the simple linear regression model.
In this scenario, there is a pattern: the variance increases as the value of the independent variable increases. This indicates a violation of
homoskedasticity, which assumes that observations have a similar dispersion and residuals have constant variance. That violation is called
heteroskedasticity.
(Choice A) Linearity assumes that the variables have a linear relationship. Nonlinearity, the violation of this assumption, is indicated by a
concentration of positive residuals in a range (or ranges) of the plot and of negative residuals in other range(s). No concentration exists in this
scenario's residual plot.
(Choice B) Independence assumes that observations are uncorrelated. Autocorrelation, the violation of this assumption, is often indicated by
plotting the residuals against time (ie, observation order), and identifying that neighboring residuals have similar signs and magnitudes. There is
no evidence of autocorrelation in this residual plot.
Things to remember:
The presence of patterns in residual plots often indicates a violation of the assumptions of a linear regression. If the residual plot shows changes
in variance across observations, the homoskedasticity assumption is violated and the linear regression will be less accurate for part of the
population. This is called heteroskedasticity.
Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these
assumptions may have been violated
LOS
Explanation
Linearity is one of the four key assumptions of a simple linear regression. However, even if data sets do not have a linear relationship, it is
possible to transform one or both variables by using the square, the reciprocal, or the log and then employing a linear regression with the
transformed variable(s).
A common solution to nonlinearity is to use a log transformation. This results in distinct functional forms, allowing the regression line to better fit
a curvature. In the graphs, it is evident that the linear functional form does not depict a linear relationship but the log-lin functional form
provides a much better fit, resulting in a stronger regression model. Although the graphs indicate this improvement, the statistics of both
regression models cannot be compared directly since they use different forms of the dependent variable.
(Choice B) An indicator variable (ie, dummy variable) is used in specific situations (eg, when there is seasonal data indicating autocorrelation) but
not as a solution to nonlinearity.
(Choice C) Increasing the sample size reduces a regression's uncertainty but does not resolve the nonlinear relationship between variables.
Things to remember:
The transformation (eg, the square, the reciprocal, or the log) of one or both variables in a simple linear regression is often used when there is a
nonlinear relationship. This results in distinct functional forms, allowing the regression line to better fit a curvature.
C. the dispersion of the dependent variable changes as the independent variable increases.
Explanation
Normality is one of the key assumptions of simple linear regression. It does not refer to a normal distribution of the variables (ie, the independent
variable, X, or the dependent variable, Y); however, the residuals must be normally distributed.
The residuals are the difference between the observed and predicted dependent variable (Y). The nonnormality of residuals compromises some
uses of the regression model (eg, prediction intervals), but it does not affect the estimated coefficients (ie, slope and intercept), and thus the
variables. As the number of observations increases, this assumption becomes less relevant, according to the central limit theorem.
(Choice B) A strong seasonal pattern in time series data is an indicator of autocorrelation (ie, correlation between sequential values of the same
variable), violating the independence assumption. This harms the independence of the residuals and biases the variance of the estimated
coefficients, invalidating any significance test of the coefficients.
(Choice C) If the dispersion of Y changes as X increases, there is heteroskedasticity, violating the homoskedasticity assumption. The variance of
the residuals also changes and the regression model will be a less precise predictor for part of the population (ie, not a good fit).
Things to remember:
Normality is the assumption that the residuals of a linear regression are normally distributed. It does not require a normal distribution of the
variables. A violation of normality compromises some uses of the regression (eg, prediction intervals), but not the estimation of coefficients. This
assumption becomes irrelevant when there are many observations due to the central limit theorem.
Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these
assumptions may have been violated
LOS
Fund Benchmark
Return (Y) Return (X)
Based on this data, if an investor expects a 0% return for the benchmark next month, the predicted return of the mutual fund will be closest to:
A. 0.35%
B. 0.94%
C. 1.48%
Explanation
The coefficients of a linear regression (ie, slope and intercept) can be estimated based on common statistical measures, as given in this
scenario. When the expected benchmark return (the independent variable, X) equals zero, the predicted return of the mutual fund (the predicted
dependent variable, Ŷ) equals the value of the intercept.
Based on the information given, find the intercept using the return means as the independent and dependent variable values and the slope of the
regression line. The slope is the ratio of the covariance of Y and X to the variance of X. This is derived from the least squares criterion that drives
the goal of the linear regression: to minimize the sum of the squared residuals so that the regression line is the best fit for the data.
It is important to note that 0.35% is the predicted mutual fund return, not the observed (ie, actual) return, and that the data set is just a sample of
the population, so the calculated coefficients will always be estimates of the population parameters.
(Choice B) 0.94% is the slope, or the change in the predicted return of the mutual fund for every unit change in the expected return of the
benchmark.
(Choice C) 1.48% is the mean (ie, average) return of the mutual fund. It is the expected return when there is no information about the
benchmark.
Things to remember:
Regression coefficients can be estimated using common statistical measures. This results from the least squares criterion that drives the linear
regression's goal of minimizing the sum of the squared residuals (ie, the difference between observed Y and predicted Y) so that the estimated
regression line is the best fit for the data.
Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of
these coefficients
LOS
A. −1.4%
B. −1.1%
C. −0.1%
Explanation
This linear equation represents the current scenario. The intercept is 0.3, so if the change in cotton price is zero, the predicted monthly change in
stock price is 0.3%. The slope is −0.7, meaning that every 1% increase in cotton prices would make the stock price decrease by an incremental
0.7%.
Therefore, a 2% increase in cotton prices results in a predicted decrease of 1.1% [0.3 − (0.7 × 2)] in the stock price. It is important to note that
this is a predicted variation, not the observed (ie, actual) stock price change.
A linear regression with one independent variable has two coefficients and is presented as a scatterplot, where the:
Intercept is the value of the dependent variable when the independent variable equals zero (ie, the expected variation in the estimate that
is unrelated to the independent variable).
Slope is the change in the dependent variable for every unit change in the independent variable.
(Choice A) −1.4% results from ignoring the intercept, but the regression model includes the intercept as the variation unrelated to changes in the
independent variable.
(Choice C) −0.1% results from misplacing both coefficients and multiplying the 2% variation by the intercept instead of the slope [(0.3 × 2) − 0.7].
Things to remember:
In a linear regression, the intercept is the value of the dependent variable Y when the independent variable X equals zero (ie, the expected change
in the dependent variable regardless of the change in the independent variable). The slope is the change in the dependent variable for every unit
change in the independent variable. These coefficients and the independent variable explain the variations in the dependent variable.
Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of
these coefficients
LOS
C. assumed the forecasted independent variable is farther from the sample mean.
Explanation
Simple linear regressions can be used to predict a dependent variable (Ŷf) based on a forecasted value of the independent variable (Xf). The
forecast's prediction interval is the expected range of values around Ŷf for a given level of significance (α).
The range (ie, the width) of the interval is a function of two components: the critical t-value, which is a reliability factor, and the standard error of
the forecast (sf), which measures the accuracy of the forecast (for a given value of Xf). Factors that can affect this range, when comparing the first
model to the second, are:
Decreasing the number of observations (ie, sample size) results in a less reliable model. The prediction interval is wider to reflect the
greater uncertainty (Choice A).
Increasing the significance level decreases the confidence level (ie, the probability that the actual value of Y lies within the prediction
interval) and narrows the interval.
If the forecasted independent variable is farther from the sample mean independent variable, it decreases accuracy and increases
uncertainty, thus widening the prediction interval (Choice C).
Things to remember:
The prediction interval of a forecast is the expected range of values around the predicted dependent variable, resulting from factors such as the
number of observations, the significance level, and the forecasted independent variable. The smaller (greater) the number of observations and the
significance level and the farther (closer) the forecasted independent variable from the sample mean, the wider (narrower) the interval.
Calculate and Interpret the predicted value for the dependent variable, and a prediction interval for it, given an estimated linear regression model
and a value for the independent variable
LOS