QM-II Midterm OCT 2014 Solution

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Quantitative Methods II

Mid-Term Examination (Sections B, C, D, E and F)


Monday, October20, 2014
Time : 180 minutes
Total No. of Pages : 18

Name ________________________

Total No. of Questions: 3

Roll No. ________________________

Total marks:51

Section ________________________

Instructions
1.
2.
3.
4.
5.
6.
7.
8.

This is a closed book exam. You are NOT allowed to use text book and class notes.
Answer all questions only in the space provided following the question.
Show all work and give adequate explanations to get full credit.
You may use the backside of the last page for rough work only if needed. Do NOT attach any rough
work/sheets.
Encircle or underline your final answer for each part.
No clarifications will be made during the exam.
Assume 95% confidence level if necessary ( = 0.05).
Use approximate critical values for Z, t, F, and 2 tests if the exact value is not available in the tables
attached with the question paper.

Question Number
Max Marks
Marks Scored

Q1
16

Q2
18

Q3
17

Total

Question 1 (18 points)


The yearly US Sales of domestically produced cars is collected for the period 1970-1999, along with the
data on the following:
PriceIndex - CPI for Transportation
Income - Total Disposable Income in the US (billions of dollars)
Interest
- Prime Interest Rate (%) Charged by Banks
The Correlation matrix for these variables is as given below:
Year
Year
Sales
PriceIndex
Income
Interest

1
-0.5453

Sales

PriceIndex

1.0000
-0.6089
-0.5033
-0.3842

Income

Interest

1.0000
1.0000
1.0000

SPSS was used to carry out Stepwise Regression in order to predict Sales. The summary of the models
fitted in the first 2 steps, the ANOVA table and Coefficients table obtained are given below.
Model Summary
Model

1
2

.708

R Square

Adjusted R
Square

Std. Error of the


Estimate

.502

.465

760459.004

a. Predictors: (Constant), ___________


b. Predictors: (Constant), ___________, Interest
c. Dependent Variable: Sales

ANOVA

Model
1

Sum of Squares
Regression
Residual
Total

Regression

1.571E13

Residual

1.561E13

Total

3.132E13

df

Mean Square

Sig.

Coefficients
Model

Standardized
Unstandardized Coefficients

(Constant)

Std. Error

9102897.600

433224.149

-17258.553

4248.694

1.023E7

576427.867

-16873.654

3853.737

-124592.212

46820.081

(Constant)

Interest

Coefficients
Beta

Sig.

21.012

.000

-4.062

.000

17.740

.000

-.595

-4.379

.000

-.362

-2.661

.013

-.609

a. Dependent Variable: Sales

Question 1.1
a) What is the predictor variable used in Model 1? Explain clearly.

(1 point)

PriceIndex.
Since it has the largest correlation with sales it will be the first variable to enter the
regression equation and will give the highest value of R2.

b) What proportion of variation in Sales does this predictor variable explain in model 1? Explain
clearly.
(1 point)

R2 for the first variable X = Corr (X, Y)^2 = (-0.6089)^2 = 0.3708

c) What is the Std. Error of the Estimate for Model 1? Explain clearly.

R2 = 1- SSE/SST
Therefore, SSE = SST*(1-R2) = 1.97 1013
Standard Error (se) = sqrt (SSE/ (n-k-1)) = 838930.95 (n=30, k=1)

(2 points)

Question 1.2
a) What is themagnitude of the semipartial (or part) correlation for the variable
2? Explain.

in Model
(1 point)

When the second variable Interest is added :


sr(Interest, Sales) = (Semipartial corr (Interest, Sales))^2 = R2 (New model with 2
variables) R2 (Old model with 1 variable) = 0.502-0.3708 = 0.1312
Therefore, magnitude of sr (Interest, Sales) = Sqrt(0.1312) = 0.3622 (Note: cannot say
whether the semipartial corr will be ve or +ve).

b) Carry out an appropriate test, at 95% confidence level, to determine if Model 2 as a whole is valid
(significant). State the null and alternate hypotheses and show all work.
(2 points)
H0 : 1 = 2 = 0
H1 : At least one

value is not zero.

MSR= SSR/k = 1.571E13/ 2 = 7.855 1012


MSE = SSE/ (n-k-1) = 1.561E13 / 27 = 5.78 1011 (n=30, k=2)
Fval = MSR/MSE = 13.58996
FCrit = F0.05, 2, 27, 0.05= 3.35
Since Fval > FCrit , we reject H0
Hence the model as a whole is valid.

c) Given no change in the other significant explanatory variables, can it be concluded from Model 2
that
has a higher impact on
than the other variable used in the model. Explain
clearly.
(1 point)
No.
Standardized beta value for Interest = -0.362 while Standardized beta value for
PriceIndex = -0.595
Implying that one SD change in Interest will have a lesser impact on Sales than one SD
change in PriceIndex will have.

Question 1.3: Can it be concluded, at 95% confidence level, that an increase in


decreases yearly Sales by at least 250000 units or more? Show all work.

rate by 5%
(3 points)

Do Sales decrease per 1% increase in Interest by at least 250000/5 = 50000?


H0 : 2 -50000
H1 2< -50000
tval = (-124592.212+50000) / 46820.081 = -1.593
tcrit = t0.05,27 = -1.703
As tva> tcrit , we cannot reject H0. Hence, we cannot conclude that an increase in Interest
by 1% will decrease Sales by at least 50000

Question 1.4: What can you say about the relationship between
used in Models 1 and 2? Explain clearly.

and the otherpredictor variable


(2 points)

The estimate for Price Index in model 1 (-17258.553) becomes less negative in the
presence of Interest in model 2 (-16873.654).
Therefore, the omitted variable Price Index has a negative bias.
Also, Interest has a negative relationship with Sales (Model 2).
Since PriceIndex, when proxying for Interest, was picking up some of this negative
relationship (resulting in the lower value of its estimate), PriceIndex must have a
positive relationship with Interest.
Question 1.5: The partial correlation of the excluded variables after Model 2 was fitted are 0.184 and
0.246 and the corresponding part correlations (semi-partial correlations) are 0.077 and 0.098 respectively.
Conduct an appropriate test, at 95% confidence level, to determine if one of these excluded variables
should be added to the regression model. State the null and alternate hypotheses and show all work.
(3 points)
We need to use the Partial F-test to see if the variable should be included.
Fval = (Change in R2/ # of added variables) / (((1-R2(Full Model)) / (n-k-1)))
Where, Change in R2= R2(Full model with 3 variables) - R2(Reduced model with 2
variables) and # of added variables = 1
Let us check the vaiable with the larger partial correlation
= (0.098)2 / (1-((0.502) + (0.098)2))/26
= 0.51127
Fcrit = 3.84 (used by SPSS) or Fcrit = F0.05,1,26 = 4.23
Since Fval < Fcrit, the excluded variable should not be added to the model.
As the variable with largest partial correlation is not included, the other variable must
not be included as well.

Question 2:

Table 1. Regression Coefficients


Model

Standardized
Unstandardized Coefficients
B

(Constant)
Carat

a. Dependent Variable: Price

Std. Error

-12738.581

200.801

18381.261

141.733

Coefficients
Beta

t
-63.439

Sig.
.000

Table 3. Coefficients
Model

Standardized
Unstandardized Coefficients

Std. Error

(Constant)

7.265

.011

Carat

1.375

.008

Coefficients
Beta

.921

a. Dependent Variable: ln(Price)

Sig.

682.302

.000

183.004

.000

Yi t / 2, n 2 Se

1
n

( Xi
(Xi

X )2
X )2

Table 4Coefficients
Model

Standardized
Unstandardized Coefficients
B

Std. Error

(Constant)

7.346

.012

Carat

1.392

.009

Good

-.211

V Good
Carat_I

Coefficients

Collinearity Statistics

Beta

Sig.

Tolerance

607.930

.000

.932

158.992

.000

.705

1.418

.016

-.096

-13.549

.000

.484

2.064

-.134

.013

-.093

-10.143

.000

.290

3.454

-.044

.009

-.046

-4.758

.000

.257

3.892

a. Dependent Variable: ln(Price)


Table 5Excluded Variables
Model

Collinearity Statistics
Partial
Beta In

VIF

Sig.

Correlation

Minimum
Tolerance

VIF

Tolerance

.015

.878

.380

.011

.085

11.697

.085

Carat_G

-.023

-1.263

.207

-.016

.074

13.505

.074

Carat_VG

-.023

-1.263

.207

-.016

.074

13.505

.074

Ideal

Here we are looking for carat. So the second term(marked blue) is fixed. The third term i.e. the
variable good(marked red) has the maximum impact on the log of price.
Also the same can be concluded from standard beta value which is -0.096 and is lower than the
others.

Question 3:

A large grocery store in the US wishes to understand the key drivers that determine the amount
spent per transaction by their customers. Therefore, it obtained a random sample of 4000
transactions with information on the amount spent (Revenue), the product category on which the
transaction was made (Product Family), the annual income of the customer (Annual i ncome), the
number of children in the household the customer belongs to, and finally whether the customer

In order to enable regression analysis, the following indicator (dummy) variables were created:
Own_Hm = 1 (Yes to Homeowner), 0 otherwise,
Ann_Inc2 = 1 (Annual Income in the range $30K - $50K), 0 otherwise
Ann_Inc3 = 1 (Annual Income in the range $50K - $70K), 0 otherwise
Ann_Inc4 = 1 (Annual Income in the range $70K - $90K), 0 otherwise
Ann_Inc5 = 1 (Annual Income in the range $90K and above), 0 otherwise
Prod_Fam2 = 1 (Product Family is Drink), 0 otherwise
Prod_Fam3 = 1 (Product Family is Non-Consumable), 0 otherwise.

The following outputs were generated using this data:


Regression Output 1 (Revenue($) Response Var)
Regression Statistics
Multiple R

0.0340

R Square

0.0012

Adjusted R Square

0.0002

Standard Error

8.1499

Observations

4000.0000

ANOVA
df
Regression

SS

MS

4.0000

306.4494

76.6123

Residual

3995.0000

265348.3672

66.4201

Total

3999.0000

265654.8166

Coefficients

Standard Error

Significance F
1.1535

t Stat

0.3294

P-value

Intercept

12.6841

0.2786

45.5352

0.0000

Ann_Inc2

0.2787

0.3588

0.7766

0.4374

Ann_Inc3

0.7617

0.4106

1.8551

0.0637

Ann_Inc4

0.4524

0.4712

0.9602

0.3370

Ann_Inc5

-0.0130

0.4229

-0.0307

0.9755

Regression Output 2 (Revenue($) Response Var)


Regression Statistics
Multiple R

0.059

R Square

0.004

Adjusted R Square

0.003

Standard Error

8.137

Observations

4000.000

ANOVA
df
Regression

SS

MS

1.000

933.834

933.834

Residual

3998.000

264720.983

66.213

Total

3999.000

265654.817

Coefficients

Standard Error

t Stat

Significance F
14.103

P-value

Intercept

12.136

0.255

47.564

0.000

Children

0.326

0.087

3.755

0.000

0.000

Regression Output 3(Revenue($) Response Var)


Regression Statistics
Multiple R

0.046

R Square

0.002

Adjusted R Square

0.002

Standard Error

8.144

Observations

4000.000

ANOVA
df
Regression

SS

MS

2.000

550.538

275.269

Residual

3997.000

265104.278

66.326

Total

3999.000

265654.817

Coefficients

Standard Error

Significance F
4.150

t Stat

0.016

P-value

Intercept

13.192

0.152

86.958

0.000

Prod_Fam2

-0.975

0.458

-2.131

0.033

Prod_Fam3

-0.743

0.332

-2.240

0.025

Regression Output 4 (Revenue($) Response Var)


Regression Statistics
Multiple R

0.075

R Square

0.006

Adjusted R Square

0.005

Standard Error

8.131

Observations

4000.000

ANOVA
df
Regression

SS

MS

3.000

1496.798

498.933

Residual

3996.000

264158.018

66.106

Total

3999.000

265654.817

Coefficients

Standard Error

t Stat

Significance F
7.548

P-value

Intercept

12.361

0.267

46.345

0.000

Children

0.329

0.087

3.783

0.000

Prod_Fam2

-0.992

0.457

-2.171

0.030

Prod_Fam3

-0.747

0.331

-2.256

0.024

0.000

Regression Output 5 (Revenue($) Response Var)


Child_Fam3 = Children*Prod_Fam3
Regression Statistics
Multiple R

0.080

R Square

0.006

Adjusted R Square

0.006

Standard Error

8.127

Observations

4000.000

ANOVA
df
Regression

3.000

SS

MS

Sig F

8.624

0.000

1708.947

569.649

Residual

3996.000

263945.870

66.053

Total

3999.000

265654.817

Coefficients

Standard Error

t Stat

P-value

Intercept

12.214

0.258

47.399

0.000

Children

0.393

0.090

4.379

0.000

Prod_Fam2

-1.010

0.455

-2.218

0.027

Child_Fam3

-0.322

0.112

-2.882

0.004

Use the information given above to answer the following questions. For each question give adequate
explanation and support your answer with given information precisely.

a) Rank the income groups based on average revenue obtained per transaction in the sample
data from largest to smallest. Provide precise reasons as to how you obtained this
ranking. Is this ranking valid for the population? What is the average revenue per
transaction obtained for the income group ($10K-$30K)? (2 points)
Consider the prediction model obtained from Regression Output 1. It is:

Y 12.6841 0.2787 Ann _ Inc2 0.7617 Ann _ Inc3 0.4524 Ann _ Inc4 0.0130 Ann _ Inc5
It is clear from the problem description that Ann_Inc1 (income group ($10K$30K))is treated as the base. Therefore, the coefficient of each variable Ann_Inci,
where i>1, describes the difference between the average revenue obtained for
customers belonging to income group Ann_Inci and those belonging to Ann_Inc1 in
the sample dataset. Thus, by simply considering the magnitude on the coefficients
alone, we obtain the rank to be:
1) Ann_Inc3 ($50K-$70K), 2) Ann_Inc4 ($70K-$90K), 3) Ann_Inc2 ($30K-$50K), 4)
Ann_Inc1 ($10K-$30K), 5) Ann_Inc5 ($90K and above).

This ranking is however not valid for the entire population because, based on a
significance level of 0.05, the model as a whole obtained from Regression Output 1 is
not significant. The Significance F = 0.3294. That is, the average revenue spent by
various income groups are not significantly different from that spent by the income
group ($10K-$30K). That is, all other income groups have the same ranking as
Ann_Inc1.
Since the base income group is Ann_Inc1, the y intercept b 0 provides the average
revenue obtained for the income group ($10K-$30K), which is $12.6841.
b) The grocery store wishes to estimate the average amount spent per transaction on nonconsumables. Provide the most accurate estimate possible. Provide details on how you
obtained this estimate. (2 points)
The most appropriate prediction equation needed to be used in this question is the
one listed in Regression Output 3. It is,

13.192 0.975 Pr od _ Fam2 0.743 Pr od _ Fam3 .

The model as a whole is significant. Further, variables Prod_Fam2 and Prod_Fam3


each add significant additional information to the model in the presence of the
others.
While the prediction equation obtained from Regression Output 4 and Regression
Output 5 are superior in terms of lower standard error (s) and higher adjusted R2,
they are not appropriate. That is because the other models require that the estimate
be made for a fixed number of children, which is not so for the question asked.
For non-consumables, Prod_Fam3 = 1. Hence, the estimate obtained is, 13.192-0.743
= 12.449.
c) If in regression output 3, if the base chosen in product family is drinks (Prod_Fam2), then
what will be the corresponding prediction equation? (2 point)
The current prediction equation in Regression Output 3 is

Y 13.192 0.975 Pr od _ Fam2 0.743 Pr od _ Fam3 .


By making Prod_Fam2 as the base, the prediction equation becomes
Y

(13.192 0.975) ( 0.743 ( 0.975)) Pr od _ Fam3 ( 0.975) Pr od _ Fam1

12.217 0.232 Pr od _ Fam3 0.975 Pr od _ Fam1 .

d) Is there a significant difference in the average amount spent per transaction between that
on drinksfood and non-consumables? Why or Why not? Provide precise reasons. (2
points)
The difference in the average amount spent per transaction between food and nonconsumables is measured by the variable Prod_Fam3. Therefore, in Regression
Output 3, this question is answered by testing for the hypothesis:
H O:

Prod_Fam3

HA :

Prod_Fam3

=0

This test is conducted using the t test, for which t = -2.240. The associated p value
=0.024 < 0.05. Consequently, we reject the null hypothesis and conclude that the
difference is significant.
e) The grocery store wishes to target those customers, as well as items on which the amount
spent is maximum. Assuming that no customer has more than five children, identify the
appropriate customer segment as well as the appropriate product family. Provide precise
reasons behind your answer. (2 points)
Prediction equations obtained from Regression Output 4 and Regression Output 5
are the most appropriate models to consider as only these two models incorporate
the effect of both product family and number of children. The prediction equation
obtained from Regression Output 5 is slightly superior to that obtained from
Regression Output 4 since it has a slightly lower standard error and slightly higher
adjusted R2. Both models as a whole are significant, as well as each variable in the
presence of others. The appropriate prediction equation from Regression Output 5
is:

12.214 0.393Children 0.1010 Pr od _ Fam2 0.322Child _ Fam3 .

From the above model it is clear that with Children = 0, Product Family food and
non-consumables are superior to drinks, as the coefficient of Prod_Fam2 is -0.1010.
As the number children increases, for food the average revenue increases at a faster
rate than for non-consumables. For food, the rate of increase is 0.393. For nonconsumables it is (0.393-0.322) = 0.071.
Therefore, the grocery store should target those with 5 children purchasing food
items. The same conclusion is also obtained from the prediction equation in
Regression Output 4.
f) What is the chance that a customer with 3 children will spend more than $10.00 on food
items per transaction? Provide details on your calculations. (3 points)

Here, the option is to use either the prediction equation obtained from Regression
Output 4 or Regression Output 5. However, the prediction equation obtained from
Regression Output 5 is marginally better than that from Regresion Output 4, since
it has a lower standard error and a slightly better R 2 value.
Therefore, the appropriate prediction equation to use is:

12.214 0.393Children 0.1010 Pr od _ Fam2 0.322Child _ Fam3

The average revenue obtained from a customer with 3 children spending on food is
12.214 + 3*0.393 = 13.393. The estimated standard deviation around this mean is
8.127. Therefore the z value associated with $10.00 is (10-13.393)/8.127 = -0.4175.
-0.4175) = 0.6618.

g) Do the number of children effect food purchases more than non-consumables? Why or
why not? State your reasons precisely. (2 points)
The prediction equation obtained from Regression Output 5 is

12.214 0.393Children 0.1010 Pr od _ Fam2 0.322Child _ Fam3

For food purchases the prediction equation becomes

Y 12.214 0.393Children .
For non-consumables, the prediction equation becomes

Y 12.214 0.071Children .
Since 0.393>0.071, the number of children effect food purchases more than nonconsumables.
h) If the grocery store has reason to believe that in addition to the independent variables
considered in Regression Output 4, homeowners spend significantly more on non consumables than non-home owners on any product category. If so, how will you modify
the model provided in Regression Output 4? Provide the model in terms. If you are
adding new variables to the model, provide details on what you expect the value to be.
Positive? Negative? (2 points)
We first create an interaction variable OwnH_PFam3 = Own_Hm*Prod_Fam3. Clearly,
OwnH_PFam3 = 1, only if Own_Hm = 1 and Prod_Fam3 = 1. Thus, the modified model
would be:

Y=

1Children

We would expect

4>

2Prod_Fam2

0.

3Child_Fam3

4OwnH_PFam3

+ .

You might also like