MATH 533 Part C - Regression and Correlation Analysis

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

MATH 533: Applied Managerial Statistics

Part C: Regression and Correlation Analysis


Using MINITAB perform the regression and correlation analysis for the data on CREDIT
BALANCE (Y) and SIZE (X) by answering the following.
1.

Generate a scatterplot for CREDIT BALANCE vs. SIZE, including the graph of the "best
fit" line. Interpret.
Scatterplot of Credit Balance($) vs Size
6000

Credit Balance($)

5000

4000

3000

2000
1

4
Size

The scatter plot of Credit balance ($) versus Size show that the slope of the best fit line
is upward (positive); this indicates that Credit balance varies directly with Size. As Size
increases, Credit Balance also increases vice versa. Correct
MINITAB OUTPUT:
Regression Analysis: Credit Balance($) versus Size
The regression equation is
Credit Balance($) = 2591 + 403 Size
Predictor
Constant
Size
S = 620.162

Coef
2591.4
403.22

SE Coef
195.1
50.95

R-Sq = 56.6%

Analysis of Variance

T
13.29
7.91

P
0.000
0.000

R-Sq(adj) = 55.7%

Source
Regression
Residual Error
Total

DF
1
48
49

SS
24092210
18460853
42553062

MS
24092210
384601

F
62.64

P
0.000

Predicted Values for New Observations


New Obs
1

Fit
4607.5

SE Fit
119.0

95% CI
(4368.2, 4846.9)

95% PI
(3337.9, 5877.2)

Values of Predictors for New Observations


New Obs
1

Size
5.00

2. Determine the equation of the "best fit" line, which describes the relationship between
CREDIT BALANCE and SIZE.
The equation of the best fit line help describes the relationship between Credit Balance
and Size is
Credit Balance ($) = 2591 + 403.2 Size Correct

3. Determine the coefficient of correlation. Interpret.


The coefficient of correlation is given as r = 0.752. The correlation coefficients between
the variables show a positive sign or direct relationship. The correlation coefficient is far
from the P-Value of 0.000. In this case, a p-value of 0.000 is extremely low. This means
that there is an extremely low chance that Credit Balance and Size results are due to
chance. Correct

MINITAB OUTPUT:
Pearson correlation of Credit Balance ($) and Size = 0.752
P-Value = 0.000

4. Determine the coefficient of determination. Interpret.


The coefficient of determination, R-Sq = 0.566. The proportion of variability in a dataset
that is accounted for by the regression model is given by the coefficient of determination
2
R , which for this regression model is 56.6%. Correct

MINITAB OUTPUT:
S = 620.162

R-Sq = 56.6%

R-Sq(adj) = 55.7%

5. Test the utility of this regression model (use a two tail test with =.05). Interpret your
results, including the p-value.
The null hypothesis; Ho, states that there is no significant correlation, or the correlation
coefficient

=0.

The Significance Level, = 0.05


Decision Rule: Reject Ho, if p-value < 0.05

From the Analysis of Variance table, I find that the p-value is 0.000, which is much less
than 0.05. Therefore, I reject the null hypothesis because there is no significant
correlation and conclude that, according to the overall test of significance, the regression
model is valid. Correct

MINITAB OUTPUT:
Analysis of Variance
Source
Regression
Residual Error
Total

DF
1
48
49

SS
24092210
18460853
42553062

MS
24092210
384601

F
62.64

P
0.000

6. Based on your findings in 1-5, what is your opinion about using SIZE to predict CREDIT
BALANCE? Explain.
Base on my finding, I see that Size is a good predictor of Credit Balance because Credit
Balance and Size seems to affect each other. As Size increase Credit Balance seems to
increases also; they correlated. As the Size of the household grow so does the Credit
Balance of those household also grew and increase. Correct

7. Compute the 95% confidence interval for . Interpret this interval.


N/A
8. Using an interval, estimate the average credit balance for customers that have household
size of 5. Interpret this interval.

The household size of 5 average credit balances for customers is estimated to lie within
the interval of (4368.2, 4846.9). This is the 95% confidence interval estimate for the
credit balance for customers that have household size of 5. Correct

MINITAB OUTPUT:
Predicted Values for New Observations
New Obs
1

Fit
4607.5

SE Fit
119.0

95% CI
(4368.2, 4846.9)

95% PI
(3337.9, 5877.2)

Values of Predictors for New Observations


New Obs
1

Size
5.00

9. Using an interval, predict the credit balance for a customer that has a household size of 5.
Interpret this interval.
The credit balance for a customer that has household size of 5 is expected to lie within
the interval of (3337.9, 5877.2). This is the 95% prediction interval estimate for the credit
balance for a customer that has household size of 5. Correct

MINITAB OUTPUT:
Predicted Values for New Observations
New Obs
1

Fit
4607.5

SE Fit
119.0

95% CI
(4368.2, 4846.9)

95% PI
(3337.9, 5877.2)

Values of Predictors for New Observations


New Obs
1

Size
5.00

10. What can we say about the credit balance for a customer that has a household size of 10?
Explain your answer.
We cannot say anything about the credit balance for a customer that has a household size
of 10 because since the maximum value of the predictor variable (size) used to formulate
the given regression model is only 7, which is much less than 10; therefore, we cannot
use the given regression model to accurately estimate the credit balance for a customer
that has a household size of 10. Correct

In an attempt to improve the model, we attempt to do a multiple regression model predicting


CREDIT BALANCE based on INCOME, SIZE and YEARS.
11. Using MINITAB run the multiple regression analysis using the variables INCOME, SIZE
and YEARS to predict CREDIT BALANCE. State the equation for this multiple
regression model.
MINITAB OUTPUT:
Regression Analysis: Credit Balance($ versus Income ($1000), Size, Years
The regression equation is
Credit Balance($) = 1276 + 32.3 Income ($1000) + 347 Size + 7.9 Years
Predictor
Constant
Income ($1000)
Size
Years
S = 424.715

Coef
1276.0
32.272
346.85
7.88

SE Coef
273.6
4.348
36.03
12.34

R-Sq = 80.5%

T
4.66
7.42
9.63
0.64

P
0.000
0.000
0.000
0.526

R-Sq(adj) = 79.2%

Analysis of Variance
Source
Regression
Residual Error
Total

DF
3
46
49

SS
34255444
8297619
42553062

Source
Income ($1000)
Size
Years

DF
1
1
1

Seq SS
16703393
17478430
73620

MS
11418481
180383

F
63.30

P
0.000

Unusual Observations
Obs
3
5
11
17

Income
($1000)
32.0
31.0
25.0
55.0

Credit
Balance($)
5100.0
1864.0
4208.0
4412.0

Fit
3830.1
3001.7
3210.1
5250.3

SE Fit
93.7
139.3
103.3
116.3

Residual
1269.9
-1137.7
997.9
-838.3

St Resid
3.07R
-2.84R
2.42R
-2.05R

R denotes an observation with a large standardized residual.


The multiple regression equation is:
Credit Balance($) = 1276 + 32.3 Income ($1000) + 347 Size + 7.9 Years

Correct

12. Perform the Global Test for Utility (F-Test). Explain your conclusion.
The null hypothesis, Ho states that there is no significant correlation, or the correlation
coefficient

=0.

Significance Level, = 0.05


Decision Rule: Reject Ho if p-value < 0.05
From the Analysis of Variance table, we find that the p-value (0.000) is much less than
0.05. Therefore, we reject the null hypothesis that there is no significant correlation and
conclude that, according to the overall test of significance, the multiple regression models
are valid. Correct

MINITAB OUTPUT:
Test for Equal Variances: Credit Balance($) versus Income ($1000)
95% Bonferroni confidence intervals for standard deviations
Income
($1000)
21
22
23
25
26
27
29
30
31
32
33
34
35
37
39
40
41
42
44
46
48
50
51
52
54
55
61

N
2
2
1
1
1
2
1
3
1
1
1
1
1
2
2
1
1
1
1
1
2
2
1
1
3
4
1

Lower
267.855
188.069
*
*
*
101.215
*
123.736
*
*
*
*
*
328.265
276.062
*
*
*
*
*
80.471
259.193
*
*
396.622
290.865
*

StDev
830.85
583.36
*
*
*
313.96
*
309.43
*
*
*
*
*
1018.23
856.31
*
*
*
*
*
249.61
803.98
*
*
991.86
647.76
*

Upper
344720
242037
*
*
*
130260
*
7053
*
*
*
*
*
422465
355281
*
*
*
*
*
103563
333571
*
*
22607
5780
*

62
63
64
65
66
67

2
1
1
1
2
2

221.807
*
*
*
87.765
70.212

688.01
*
*
*
272.24
217.79

285457
*
*
*
112951
90361

Bartlett's Test (Normal Distribution)


Test statistic = 5.59, p-value = 0.935
Levene's Test (Any Continuous Distribution)
Test statistic = 1.01, p-value = 0.479

Test for Equal Variances: Credit Balance($) versus Size


95% Bonferroni confidence intervals for standard deviations
Size
1
2
3
4
5
6
7

N
5
15
8
9
5
5
3

Lower
137.540
459.836
193.542
415.251
340.696
360.277
150.085

StDev
271.807
698.998
336.323
701.689
673.284
711.981
356.267

Upper
1303.27
1337.23
943.85
1796.00
3228.28
3413.83
5956.16

Bartlett's Test (Normal Distribution)


Test statistic = 8.07, p-value = 0.233
Levene's Test (Any Continuous Distribution)
Test statistic = 1.12, p-value = 0.369

Test for Equal Variances: Credit Balance($) versus Years


95% Bonferroni confidence intervals for standard deviations
Years
1
2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

N
2
1
2
2
1
2
1
2
2
4
4
4
5
3
4
2
5
2
2

Lower
541.930
*
452.950
130.788
*
78.920
*
76.013
135.483
204.115
348.641
167.957
584.321
232.333
231.705
111.114
452.721
121.398
540.589

StDev
1714.03
*
1432.60
413.66
*
249.61
*
240.42
428.51
461.26
787.86
379.55
1221.32
590.58
523.61
351.43
946.25
383.96
1709.78

Upper
875261
*
731550
211232
*
127462
*
122768
218815
4413
7538
3631
7236
14935
5010
179457
5607
196067
873094

Bartlett's Test (Normal Distribution)


Test statistic = 13.77, p-value = 0.543
Levene's Test (Any Continuous Distribution)
Test statistic = 2.23, p-value = 0.029

Conclusion is that since all the p-value of the Bartletts Test (Normal Distribution) is
greater than 0.05, I am unable to reject the null hypothesis. Levenes Test does not
assume Normality and also fails to reject the null hypothesis of equal variance.
13. Perform the t-test on each independent variable. Explain your conclusions and clearly
state how you should proceed. In particular, which independent variables should we keep
and which should be discarded.
Test the significance for the individual coefficients of the independent variables.
The null hypothesis, Ho states that there is no significant correlation, or the correlation
coefficient p = 0.
Decision Rule: Reject Ho if p-value <0.05
MINITAB OUT:
Income ($1000)
Analysis of Variance
Source
Regression
Residual Error
Total

DF
1
48
49

SS
16703393
25849669
42553062

MS
16703393
538535

F
31.02

P
0.000

Year
Analysis of Variance
Source
Regression
Residual Error
Total

DF
1
48
49

SS
2878
42550184
42553062

MS
2878
886462

F
0.00

P
0.955

Size

Analysis of Variance
Source
Regression
Residual Error
Total

DF
1
48
49

SS
24092210
18460853
42553062

MS
24092210
384601

F
62.64

P
0.000

The independent variables of Income ($1000) and Size should kept because they have a
significant contribution in the regression model, but variable Years should be discarded
because it does not have a significant contribution in the regression model. Correct

14. Is this multiple regression model better than the linear model that we generated in parts 110? Explain.
The proportion of variability in a dataset that is accounted for is given by the coefficient
of determination r-square. Thus, the higher the value of r-square, the better is the
regression model. The value of r-square is greater for the multiple regression model
(0.805) as compared to that of the linear regression model (0.566) and hence the multiple
regression model is better than the linear regression model. Correct
Project Part C: Grading Rubric

Category

Points Your
Description
Value Points

Questions 1 - 12 and 14 - 5
pts. each. Everyone gets
credit for No. 7

65

65

addressed with appropriate


output, graphs and
interpretations

Question 13

15

15

addressed with appropriate


output, graphs and
interpretations

Summary

20

20

writing, grammar, clarity, logic,


and cohesiveness

Total

100

100

A quality paper will meet or


exceed all of the above
requirements.

You might also like