ISOM2500 Regression Practice Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

ISOM2500 Practice question for final examination: Regression

Last update: 2019/11/26

Lecturer: Prof. Du, Lilun

1. Let (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1,2, … , 𝑛 be a sample of 𝑛 paired data, also let 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 be the simple
regression line of the data. The least squares estimate of 𝛽1 represents

(a) predicted value of 𝑦 when 𝑥 = 0.


(b) the expected change in 𝑦 per unit change in 𝑥.
(c) the predicted value of 𝑦.
(d) variation around the line of regression.

2. A regression model was fitted and the residuals were checked to be approximated normally
distributed. The scatter plot of residuals vs predicted values on the right consists of 104 observations.
The RMSE:

(a) is around 0.
(b) is around 40
(c) is around 25.
(d) cannot be estimated from the plot.

[3-5] Suppose that in the population the annual salary (𝑦𝑖 ) of the CEO measured in million dollars is related
to the annual sales of the company (𝑥𝑖 ) measured in million dollars according to the following regression
model:

𝑦𝑖 = 5 + 0.1𝑥𝑖 + 𝜀𝑖 ,

where 𝜀𝑖 ∼ 𝑁(0, 92 ) and 𝑥𝑖 ∼ 𝑁(50, 102 )

Note that we assume 𝑥𝑖 to be random and cov(𝑥𝑖 , 𝜀𝑗 ) = 0 for all 𝑖, 𝑗 = 1,2, … .

3. What is the standard deviation CEO salaries in million dollars for CEOs of firms with annual
sales of five million dollars?

(a) 9 (b) 10 (c) 19 (d) 13.45

4. What is the expected difference in million dollars between the salary of CEO of a firm with five
million dollars in annual sales and the CEO of a firm with annual sales of eight million dollars?

(a) -0.3 (b) -0.5 (c) 8 (d) 5


5. What is the probability the salary of CEO is greater than 7 million dollars if the sales is 10 million
dollars?

(a) 0.044 (b) 0.11 (c) 0.38 (d) 0.45

6. The residual plot for a linear regression model is shown below.

Which of the following is true?

(a) A linear model is okay because the association between the two variables is fairly strong.
(b) The linear model is no good because the correlation is near 0.
(c) The linear model is no good because some residuals are large.
(d) The linear model is no good because of the curve in the residuals.

[7-8] Let (𝑋, 𝑌) be a random paired random variables. Suppose you run a linear least squares regression of
𝑌 on 𝑋. The estimated regression line is 𝑌̂ = 3 + 2𝑋. The t-statistic for testing the null hypothesis that
𝛽1 = 1 is 3.

7. You get an additional data point with X = 2 and Y = 7 and run the regression again including the
new data point. What happens to the estimated slope coefficient?

(a) It increases.
(b) It decreases.
(c) It remains the same.
(d) Cannot tell based on the information given.

8. What happens to the sample standard error of residuals in the new regression run using the new data
point relative to sample standard error of residuals in the original regression?

(a) It increases.
(b) It decreases.
(c) It remains the same.
(d) Cannot tell based on the information given.
9. The p-value of testing the slope equals 0 in a simple regression is 0.45. Then

(a) H0: β1 = 0 should be retained.


(b) the data suggests that the predictor x is not helpful in predicting the response y.
(c) the slope is less than 1 SE from zero.
(d) all the above are correct.

10. The following results were obtained from a simple regression analysis:

𝑦̂ = 27.2895 − 1.2024𝑥

𝑟 2 = 0.6744, 𝑠𝑒2 = 0.2934

For each unit change in the independent variable x, the estimated change in the mean value of the dependent
variable y is equal to

(a) –1.2024 (b) 0.6774 (c) 37.2895 (d) 0.2934

11. Which assumption of SRM is violated in the residual plot at the right?

(a) The relationship is linear


(b) The random errors are independent
(c) The random errors have equal variance
(d) The random errors are normally distributed.
12. The scatter plot of sales (𝑦) of half-gallon orange juice versus the price (𝑥) is given below. We
apply log transform on both 𝑦 and 𝑥 to fit the nonlinear pattern.

Transformed Fit Log to Log

log(𝑦𝑖 ) = 4.81 − 1.75 × log(𝑥𝑖 )

Summary of Fit
RSquare 0.755

RSquare Adj 0.750


Root Mean Square Error 0.386
Mean of Response 3.136
Observations (or Sum Wgts) 50

Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept 4.8 0.148 32.50 <.0001*

Log(Price) -1.75 0.144 -12.17 <.0001*

Assume the transformed x and y agrees with SRM, which of the following statement is correct?

(a) As the price increase by 1%, the sales decrease by 1.75% on the average
(b) As the price increase by $1, the sales decrease by 1.75 units on the average.
(c) As the price increase by 1%, the sales decrease by 1.75 units
(d) As the price increase by 1$, the sales decrease by 175%.
13. The heights (y) of 50 men and their shoes sizes (x) were obtained. The variable height is measured
in centimetres and the shoe sizes of these 50 men ranged from 8 to 11. From these 50 pairs of
observations, the least squares regression line predicting height from shoe size was computed to
be 𝑦̂ = 130455 + 4.7498𝑥. What height would you predict for a man with a shoe size of 13?

(a) 130.46cm
(b) 192.20cm
(c) 182.70cm
(d) I would not use this regression line to predict the height of a man with a shoe size of 13.

[14-16] A large national bank charges local companies for using their services. A bank officer reported the
results of a regression analysis designed to predict the bank's charges (𝑌), measured in dollars per month,
for services rendered to local companies. One explanatory variable used to predict service charge to a
company is the company's sales revenue (𝑋), measured in millions of dollars. Data for 21 companies who
use the bank's services were used to fit the model. The results of the simple linear regression are provided
below.

(1) 𝑦̂ = − 2,700 + 20𝑥


(2) RMSE of residuals is 65.
(3) p-value for testing 𝛽1 = 0 is 0.034.

14. Interpret the estimate of the standard deviation of the error terms.

(a) About 95% of the observed service charges fall within $65 from 0 of the least squares line.
(b) About 95% of the observed service charges equal their corresponding predicted values.
(c) About 95% of the observed service charges fall within $130 from 0 of the least squares
line.
(d) For every $1 million increase in sales revenue, we expect a service charge to increase $65.

15. Interpret the p-value for testing whether 𝛽1 = 0.

(a) There is sufficient evidence (at the α = 0.05) to conclude that sales revenue (X) is a useful
linear predictor of service charge (Y).
(b) There is insufficient evidence (at the α = 0.05) to conclude that sales revenue (X) is a useful
linear predictor of service charge (Y).
(c) Sales revenue (X) is a poor predictor of service charge (Y).
(d) For every $1 million increase in sales revenue, we expect a service charge to increase
$0.034.
16. A 95% confidence interval for 𝛽1 is [15, 30]. Interpret the interval.

(a) We are 95% confident that the mean service charge will fall between $15 and $30 per month.
(b) We are 95% confident that the sales revenue (X) will increase between $15 and $30 million for
every $1 increase in service charge (Y).
(c) We are 95% confident that average service charge (Y) will increase between $15 and $30 for every
$1 million increase in sales revenue (X).
(d) At the α= 0.05 level, there is no evidence of a linear relationship between service charge (Y) and
sales revenue (X).

[17-18] A medium-sized business has a policy that keeps its weekly advertising budget within the range
from $2000 to $6000. The marketing manager has collected data from a sample of weeks, recording the
amount spent on advertising (ADV) and the revenue (REV) for each week. The amounts spent on
advertising are recorded in thousands of dollars. Revenue amounts are in actual dollars. After examining
the data, the manager decides to use a (natural) log transformation on both variables in order to derive a
̂ = 6.8 + 2.2 log 𝐴𝐷𝑉
regression line. The log-log equation is determined to be: log 𝑅𝐸𝑉

17. In the context of this application, elasticity refers to which of the following

(a) The slope of the line 𝑏1 = 2.2


(b) How the absolute change in REV relates to the absolute change in ADV
(c) The size of the intercept 𝑏0 = 6.8
(d) The fact that the original data collected for revenue and advertising dollars did not meet the
linear condition for regression

18. What percentage increase in revenue would be predicted for a 0.5% increase in dollars spent on
advertising?

(a) 1.1% (b) 1.78% (c) 0.354% 0.72%


19. The elasticity of y with respect to x describes

(a) how small percentage changes in x are associated with small percentage changes in y
(b) how small percentage changes in y are associated with small percentage changes in x.
(c) how changes in y effect changes in x
(d) how changes in x effect changes in y

[20-21] It is believed that, the average numbers of hours spent studying per day (HOURS) during
undergraduate education should have a positive linear relationship with the starting salary (SALARY,
measured in thousands of dollars per month) after graduation. Given below is the output from regressing
starting salary on number of hours spent studying per day for a sample of 51 students.

R Square 0.7845

Standard Error 1.3704

Observations 51

Coefficients Standard Error test Stat p-value

Intercept -1.8940 0.4018 -4.7134 2.051E-05

HOURS 0.9795 0.0733 13.3561 5.944E-18

20. What's the value of the test statistic to test whether average SALARY are linearly correlated on
HOURS?

(a) -4.7134 (b) -1.8940 (c) 0.9795 (d) 13.3561

21. The 90% confidence interval for the average change in SALARY (in thousands of dollars) as a
result of spending an extra hour per day studying is

(a) wider than [-2.70, -1.09].


(b) narrower than [-2.70, -1.09].
(c) wider than [0.83, 1.13].
(d) narrower than [0.83, 1.13].
[22-25] A large company employs several thousand people in the manufacture of keyboards, equipment
cases, and cables for the small-computer industry. The personnel manager of the company would like to
find ways to forecast the absent rate among the company employees. An effective method of forecasting
would greatly strengthen the ability to plan properly. He took a sample of 40 employees and recorded the
number of absent days (Y) during the last fiscal year along with employee age (X). The computer output of
a regression analysis is as follows.

22. An employee, John, is 30 years old. According to the regression equation, what is his expected
number of absent days in the coming fiscal year?

(a) 3.34 (b) 4.56 (c) 5.34 (d) 6.56

23. Test the regression coefficient 𝛽1 of age is larger than 0 using 5% significance level.

(a) p-value is 0.002 and we reject the null 𝐻0 : 𝛽1 ≤ 0


(b) p-value is 0.998 and we cannot reject the null 𝐻0 : 𝛽1 ≤ 0
(c) p-value is 0.000 and we reject the null 𝐻0 : 𝛽1 ≤ 0
(d) p-value is 0.47 and we cannot reject the null 𝐻0 : 𝛽1 ≤ 0

24. Find a 95% confidence interval for the regression coefficient of age.

(a) [0.1056, 0.4286] (b) [0.2006, 0.3104] (c)[0.1961, 0.3114 ] (d) [0.0056, 0.4286]

25. The sample mean and sample standard deviation for age are 37.87 and 10.39, respectively. Find a
95% prediction interval for the absent days of a 30 years old employee.

(a) [2.907, 3.773] (b) [2.907, 4.025] (c) [1.998, 4.025] (d) [1.998, 3.773] (e) [1.054, 5.625]
[26-27] An insurance agent has selected a sample of drivers that she insures whose ages are in the range
from 16 to 42 years. For each driver, she records the age of the driver (𝑥) and the dollar amount of claims
(𝑦) that the driver filled in the previous 12 months. A scatterplot showing the dollar amount of claims as
the response and the age as the predictor shows a linear trend. The least squares regression line is
determined to be: 𝑦̂ = 3715 − 75.4𝑥. A plot of the residuals versus age of the drivers showed no pattern,
and the following were reported: 𝑟 2 = 0.822 and the standard deviation of the residual is 312.1 .

26. Which of the following is correct?

(a) If the age of a driver increases from 20 to 21, the dollar amount of claims is expected to
decrease by $75.4 on average.
(b) If the age of a driver increases by one year, the dollar amount of claims is predicted to
increase by $3715.
(c) One can use the least squares regression line to obtain a reliable prediction of the dollar
amount of claims for a driver whose age is 55 years.
(d) The dollar amount of claims for a driver of 10 years old is expected to be $2961.

27. Which of the following is false?

(a) 82.2% of the variation in the dollar amounts of claims is explained by the age of the driver.
(b) The correlation coefficient, 𝑟, between the response and the predictor is 0.907.
(c) If there are 38 drivers included in this model, then 𝑆𝑆𝐸 = 3506631 (SSE: sum of squared
residuals).
(d) If the unit of 𝑦 changes from dollar to thousand dollars, 𝑟 2 remains unchanged.

28. A regression analysis between sales (in $1,000) and advertising (in $1,000) resulted in the following
least squares line: Sales = 80 + 5Advertising . This implies that

(a) as advertising increases by $1,000, sales increases on average by $5,000.


(b) as advertising increases by $1,000, sales increases by $5,000.
(c) as advertising increases by $1,000, sales decrease on average by $5,000.
(d) as advertising increases by $1,000, sales increases on average by $80,000.
(e) as advertising increases by $5, sales increases on average by $80.

29. If a test of hypothesis has a Type I error probability (α) of 0.01, it means that
(a) if the null hypothesis is true, you don't reject it 1% of the time.
(b) if the null hypothesis is true, you reject it 1% of the time.
(c) if the null hypothesis is false, you don't reject it 1% of the time.
(d) if the null hypothesis is false, you reject it 1% of the time.
30. In a study of the association between the car mileage (miles per gallon, mpg) and the car weight, it
is found that the association is curved. To make the association to be linear, one decides to change
the response to be 100 multiple of the reciprocal of the mileage. The scatterplot of the new response
vs the car weight (in thousands of pounds) is shown below.

A least-squares linear regression is fitted to the transformed variables, and yields the following
equation:
Estimated new response = 0.95 + 1.25 × Weight (000 lbs).

Based on the equation, what's the predicted mileage (measured in mpg) for a car of weight 5,000
pounds?

(a) 6251
(b) 0.016
(c) 7.2
(d) 13.89

31. For a given sample size n, if the level of significance (𝛼) is decreased, the type II error of
the test (𝛽 )

(a) will increase.


(b) will decrease.
(c) will remain the same.
(d) cannot be determined.
32. The normal quantile plot of residuals from a regression equation in the plot below suggests that

(a) The fitted equation is linear.


(b) The R-squared statistic is about 0.9 or more.
(c) The distribution of residuals is close to a normal distribution.
(d) The data in the sample are dependent.

33. If the role of the explanatory variable and the response variable are switched in a regression and
correlation situation, which of the following would stay the same?

(a) The slope of the regression line.


(b) The intercept of the regression line.
(c) The correlation between the two variables.
(d) None of the above would stay the same.

34. A statistics professor used 𝑋 = “number of class days attended” (out of 30) as an explanatory
variable to predict 𝑌 = “score received on final exam” for a class of his students. The resulting
regression equation was 𝑌̂ = 39.4 + 1.4𝑋. Which of the following statements is true?

(a) If attendance increases by 1.4 days, the expected exam score will increase by 1 point
(b) If attendance increases by 1 day, the expected exam score will increase by 39.4 points
(c) If attendance increases by 1 day, the expected exam score will increase by 1.4 points
(d) If the student does not attend at all, the expected exam score is 1.4.

35. Given the regression equation 𝑌̂ = −4.3 + 5.9𝑋, which of the following statements is incorrect?

(a) The r2 value could be less than 0.


(b) The correlation between 𝑥 and 𝑦 is positive.
(c) The slope of the line is 5.9.
(d) Given an 𝑥 value of 2, the predicted value of 𝑦 is 7.5.
(e) The intercept value is less than 0.
36. What does residual represent?

(a) The difference between the actual 𝑌 values and the mean of 𝑌.
(b) The difference between the actual 𝑌 values and the predicted of 𝑌.
(c) The square root of the slope.
(d) The predicted value of 𝑌 for the average 𝑋 value.

37. What must be correct about a simple linear regression model?

(a) The explanatory variable must come from a normal distribution.


(b) The response must come from a normal distribution.
(c) The errors are assumed to have a normal distribution.
(d) The errors are not independent.

38. In simple linear regression, the least squares estimate of the y-intercept (𝑏0) represents the

(a) estimated average 𝑌 when 𝑋 = 0.


(b) change in estimated average 𝑌 per unit change in 𝑋.
(c) predicted value of 𝑌.
(d) variation around the sample regression line.

39. The coefficient of determination (𝑟 2 ) of a fitted simple regression model tells us

(a) that the coefficient of correlation (r) is larger than 1.


(b) whether r has any significance.
(c) that we should not partition the total variation.
(d) the proportion of total variation that is explained by the model.

40. Which of the following assumptions concerning the probability distribution of the random error
term is stated incorrectly?

(a) The distribution is normal.


(b) The mean of the distribution is 0.
(c) The variance of the distribution increases as X increases.
(d) The errors are independent.

41. If the correlation coefficient (𝑟) = 1.00, then


(a) The y-intercept (𝑏0) must equal 0.
(b) The explained variation equals the unexplained variation.
(c) There is no unexplained variation.
(d) There is no explained variation.
42. The strength of the linear relationship between two numerical variables may be measured by the
(a) scatter diagram.
(b) coefficient of correlation.
(c) slope.
(d) Y intercept.

[43-44] Each worker at an assembly plant that produces clock radios is responsible for the entire assembly
of each unit they work on. The plant manager has collected data from a sample of workers: the number of
years (YRS) of experience at the plant, and the number of hours per unit (TIME) required for assembly.
The scatterplot of TIME versus YRS is shown below.

43. Which of the following is an appropriate reason why a regression line should not be used to make
predictions based on the data?

(a) The magnitude of the slope of the line is too large


(b) The intercept of the fitted line has no practical interpretation in this context
(c) The linear condition for simple regression does not appear to be met
(d) The associate between TIME and YRS is negative
44. The manager has decided to transform the response variable from TIME (hours/unit) to 1/TIME
(units/hour). The scatterplot of 1/TIME versus YRS is shown below.

Which of the following is an appropriate interpretation of these results?

(a) The unit on 𝑠𝑒 is hours per unit


(b) More experienced workers are predicted to produce more units per hour on average than
less experienced workers
(c) Because the transformed model has a higher 𝑅2 , it is better.
(d) The slope measures the elasticity between 1/TIME and YRS
45. It is believed that GPA (grade point average, based on a four-point scale) should have a positive
linear relationship with ACT scores. Given below is the Excel output from regressing GPA on ACT
scores using a data set of 8 randomly chosen students from a Big Ten university.
Regression GPA on ACT

Coefficient Standar t Stat p-value Lower 95% Upper 95%


s d Error

Intercep 0.5681 0.9284 0.6119 0.5630 -1.7036 2.8398


t

ACT 0.1021 0.0356 2.8633 0.0286 0.0148 0.1895

What is the predicted average value of GPA when ACT = 20?

(a) 2.61
(b) 2.66
(c) 2.80
(d) 3.12

46. In a simple linear regression problem, 𝑟 and 𝑏1

(a) may have opposite signs.


(b) must have the same sign.
(c) must have opposite signs.
(d) are equal.

47. In a simple linear regression, the least squares regression line is


(a) the line which makes the sample correlation as close to +1 or −1 as possible.
(b) the line which best splits the data in half, with half of the data points lying above the
regression line and half of the data points lying below the regression line.
(c) the line which minimizes the sum of squared residuals.
(d) the line which minimizes the number of points that do not pass through the line.
48. A least squares regression line is determined from a sample of values for variables x and y, where x
is the size of a listed home (in square feet), and 𝑦 is the selling price of the home. Which of the
following statements is true concerning the fitted line yˆ = b0 + b1 x ?

(a) If there is a positive correlation r between x and y, then the slope b1 must also be positive.
(b) The units on the intercept b0 and the slope b1 will be the same as the units on the variable y.
(c) If r 2 = 0.85 , then it is appropriate to conclude that a change in x will cause a change in y.
(d) None of above is true

[49-50] A least-squares linear regression is fitted to a data set, and the residual plot is shown below.

49. Which of the following is correct?

(a) A linear model is okay because the association between the two variables is fairly strong.
(b) The linear model is not good because the correlation between the response and the predictor is
near 0.
(c) The linear model is not good because some residuals are large.
(d) The linear model is not good because of the curve in the residuals.

50. If one uses the least-squares linear regression to make predictions, which of the following statements
is true?

(a) The predictions tend to be too high for large 𝑥's.


(b) The predictions tend to be too high for intermediate 𝑥's.
(c) The predictions tend to be too high for small 𝑥's.
(d) None of the above is correct.

You might also like