Practical Session 1 Solved
Practical Session 1 Solved
Multiple regression is used to model desire to have cosmetic surgery (y) as a function of the following explanatory variables
(regressors) :
x 1 : self-esteem,
x 2 : body satisfaction, and
x 3 : impression of reality TV
x 4 : gender
Y = β0 + β 1 · x 1 ,i + β 2 · x 2, i+ β3 · x3 , i+ β 4 · x 4 , i+ where N (0 , σ)
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.7054
R Square 0.4976
Adjusted R Square 0.4854
Standard Error 2.2509
Observations 170.0000
ANOVA
df SS MS F Significance F
Regression 4.0000 827.8333 206.9583 40.8492 0.0000
Residual 165.0000 835.9549 5.0664
Total 169.0000 1663.7882
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 14.0107 0.7753 18.0702 0.0000 12.4798 15.5415 12.4798 15.5415
SELFESTM -0.0479 0.0367 -1.3066 0.1932 -0.1204 0.0245 -0.1204 0.0245
BODYSAT -0.3223 0.1435 -2.2465 0.0260 -0.6056 -0.0390 -0.6056 -0.0390
IMPREAL 0.4931 0.1274 3.8707 0.0002 0.2416 0.7446 0.2416 0.7446
GENDER -2.1865 0.6766 -3.2314 0.0015 -3.5225 -0.8505 -3.5225 -0.8505
The “desire to have cosmetic surgery” is decreased on average by 0.32 units for each increase of a unitary value of
“body-satisfaction”, keeping the rest of the regressors constant.
The “desire to have cosmetic surgery” is increased on average by 0.49 units for each increase of a unitary value of
“impression of reality TV”, keeping the rest of the regressors constant.
L.F. - ST22 - Advanced Statistical Methods
Since “gender” is a qualitative variable, the interpretation of its coefficient is out of the scope of this course.
When all variables are equal to zero, the desire to have costmetic surgery is expected to be equal to 14.01.
3. Global model validity (Fisher test) & model quality (R2 and R2adjusted)
Considering that the p-value for the Fisher F test is closed to zero, we conclude that the model is valid for the overall population,
since there is, at least, one significant regressor for predicting y.
Considering the coefficient of determination (0.4976) and the adjusted coefficient of determination (0.4854), we conclude that
approximately the half of the total variability is explained by the regression model, (which is low).
(other example)
4. Marginal contributions of explanatory variables (Student tests) & Confidence Intervals for ’s
The p-value of the student t tests are provided in the second column of the following table. And the confidence interval are
provided in the last two columns.
The p-values for variables BODYSAT, IMPREAL, GENDER are lower than 0.05. Thus, we can conclude that the aforementioned
variables have a significant marginal contribution with a 95% of confidence. (This can also be observed in the confidence
intervals, since the zero value is not included in the interval)
On the other hand, the p-value for the variable SELFESTM is higher than 0.05, denoting that this variable is not significant with a
95% of confidence. (This can also be observed in the confidence intervals, since the zero value is inside in the interval).
T-test interpretation
The "t-test" in the various regressions indicates the statistical significance of the relationship between the two variables.
For example, the t-test between Salary and Age is 2.128, the absolute value of which is greater than the critical t-value of 1.969.
This means that the correlation between Age and Salary is statistically significant; we accept the hypothesis that the relationship
between Age and Salary that we have observed in the sample data is not simply due to chance. Furthermore, the regression
coefficient of 802.9337 for Age means that as Age increases by one year the average salary is higher by $802.9337.
L.F. - ST22 - Advanced Statistical Methods
Note that in regression (iii), the t-test for Sex is -0.079. In this case, the absolute value of the t-test, 0.079, is less than the critical
value of 1.969. Hence, we can infer that the correlation between Sex and Salary is statistically non-significant; we accept the
hypothesis that the correlation between Sex and Salary is really zero in the population of interest and we have observed something
slightly different than zero in our sample simply due to chance (due to sampling fluctuations). It follows that the regression
coefficient of -192.7 for Sex is not significant and we accept the hypothesis that it is really zero for the population.
Similarly, we can interpret the results of the other regressions.
We substitute the previous values for the regressors in the following equation:
Obtaining:
Concerning the multicollinearity analysis, we compute the correlation coefficients between the variables:
SELFEST BODYSA
DESIRE M T IMPREAL GENDER
DESIRE 1
SELFEST
M -0.48546 1
BODYSA
T -0.64359 0.757216 1
IMPREAL 0.131726 0.16651 0.14344 1
GENDER -0.63678 0.511079 0.828248 0.065294 1
We observe some high correlations between regressors, denoted in bold font type in the previous table.
Concerning the residual analysis, the residuals’ histogram is provided, denoting normality:
The plot of standard residuals is plotted, denoting that they are homoscedastic (constant in variance), and outliers are not detected
with a 99% confidence (since all standard residuals have an absolute value lower than 3).
L.F. - ST22 - Advanced Statistical Methods
Standard Residuals
12
10
0
0 2 4 6 8 10 12
Since we observed that the variable SELFESTM is not significant with a 95% of confidence, we remove it from the model and re-
estimate it again.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.7017
R Square 0.4924
Adjusted R Square 0.4832
Standard Error 2.2557
Observations 170.0000
ANOVA
df SS MS F Significance F
Regression 3.0000 819.1836 273.0612 53.6679 0.0000
Residual 166.0000 844.6047 5.0880
Total 169.0000 1663.7882
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 13.3228 0.5704 23.3549 0.0000 12.1965 14.4491
BODYSAT -0.4508 0.1047 -4.3065 0.0000 -0.6575 -0.2441
IMPREAL 0.4827 0.1274 3.7885 0.0002 0.2311 0.7343
GENDER -1.9113 0.6444 -2.9661 0.0035 -3.1836 -0.6391
Context
How much influence does the media, especially reality television programs, have on one’s decision to undergo cosmetic surgery?
This was the question of interest to psychologists who published an article in Body Image: An International Journal of Research
(March 2010).
In the study, 170 college students answered questions about their impressions of reality TV shows featuring cosmetic surgery.
BODYSAT—scale ranging from 1 to 9, where the higher the value, the greater the satisfaction with one’s own body;
IMPREAL—scale ranging from 1 to 7, where the higher the value, the more one believes reality television shows featuring
cosmetic surgery are realistic.
The data for the study (simulated based on statistics reported in the journal article) are saved in the Excel file BDYIMG (available
on BlackBoard).
Additional Participants:
The restaurant manager would like to construct a model that would explain the spending amount in
terms of the frequency and the age for all clients.
You are asked to proceed with the different tests required to validate the linear regression model for all
clients among which the sample was taken.
Variable #1 (SPENDING)
Mean 4.17188
Corrected Standard Deviation 1.53249
Variable #2 (FREQUENCY)
Mean 10.59375
Corrected Standard Deviation 6.76738
Variable #3 (AGE)
Mean 35.75
Corrected Standard Deviation 11.6453
Count n 32
R-square 0.58999
ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 29.8504
Total 72.8047 ?
L.F. - ST22 - Advanced Statistical Methods
SOLUTIONS:
ANOVA TABLE
SS df MS F p-value
Regression 42,9543 2 21,47715 20,8652933 <0,01
Residual 29,8504 29=32-(2+1) 1,02932414
Total 72,8047 32-1=31
42,9543=72,8047-29,8504
H0 : 1=2=0
H1 : at least one of the j is not 0 j=1,2
Frequency:
[ ]
95 % CI for β 1= ^β1 ±t 29 ,α /2 × s ^β =[−0.17208± 2,045 ×0.02782 ]
1
We will run a Student test for the coefficient 1 for the explanatory variable Frequency as follows:
H0: 1=0 in the presence of AGE
H1: 1≠0 in the presence of AGE
^β1 −0 −0.17208
We calculate the test statistic t= = =−6,18547807
s ^β 0.02782
1
The critical values associated with a Student distribution for df=29 and a type I error risk =0,05 are t29;0.025 =
2,045 (two tailed test).
|t|>>critical value so we can reject H0
Age:
[ ]
95 % CI for β 2= ^β2 ± t 29 ,α /2 × s ^β =[ −0.00401± 2,045 ×0.01617 ]
2
We will run a Student test for the coefficient 2 for the explanatory variable Age as follows:
H0: 2=0 in the presence of FREQUENCY
H1: 2≠0 in the presence of FREQUENCY
^β 2−0 −0.00401
We calculate the test statistic t= = =−0,24799011
s ^β 0.01617
2
The critical values associated with a Student distribution for df=29 and a type I error risk =0,05 are t29;0.025 =
2,045 (two tailed test).
|t|<critical value so we can not reject H0
L.F. - ST22 - Advanced Statistical Methods
Regression of Y w.r.t X1
ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 694 925
Total 22 170 000
Coefficients Standard-
error
Constant 706
Variable X1 326.9 10.08
L.F. - ST22 - Advanced Statistical Methods
Regression of Y w.r.t. X2
ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 22 156 025
Total ? ?
Coefficients Standard-
error
Constant 1 811.2
Variable X2 9.32 63.62
Coefficients Standard-
error
Constant 742
Variable X 1 327.27 10.15
Variable X 2 -8.98 11.35
L.F. - ST22 - Advanced Statistical Methods
SOLUTIONS:
a. r²=SSRegression/SSTotal=(SSTotal-SSResidual)/SSTotal
Regression of Y w.r.t X1
ANALYSIS OF VARIANCE
Sum of Squares
Regression 21475075
Residual 694 925
Total 22 170 000
r²=(22170000-694925)/22170000= 21475075/22170000=0,96865471
Regression of Y w.r.t. X2
ANALYSIS OF VARIANCE
Sum of Squares
Regression 13975
Residual 22 156 025
Total 22 170 000 ?
r²=(22170000-681 981.4)/22170000=21488018,6/22170000=0,96923855
b.
Regression of Y w.r.t X1
Coefficients Standard-
error
Constant 706
Variable X1 326.9 10.08
We will run a Student test for the coefficient 1 for the explanatory variable X1 as follows:
H0: 1=0
H1: 1≠0
L.F. - ST22 - Advanced Statistical Methods
^β −0 326,9
1
We calculate the test statistic t= = =¿32,4305556
s ^β
1
10,08
The critical values associated with a Student distribution for df=n-2=36-2=34 and a type I error risk =0,05
t35;0.025 are not included in the tables. We know t30;0.025 = 2.042 and t40;0.025 = 2.021 (two tailed test).
|t|>>critical value so we can reject H0
Regression of Y w.r.t X2
Coefficients Standard-
error
Constant 1 811.2
Variable X2 9.32 63.62
We will run a Student test for the coefficient 2 for the explanatory variable X2 as follows:
H0: 2=0
H1: 2≠0
^β −0 9.32
2
We calculate the test statistic t= = =¿ 0,14649481
s ^β 63.62
12
The critical values associated with a Student distribution for df=n-2=36-2=34 and a type I error risk =0,05
t35;0.025 are not included in the tables. We know t30;0.025 = 2.042 and t40;0.025 = 2.021 (two tailed test).
|t|<<critical value so we do not reject H0
X1:
We will run a Student test for the coefficient 1 for the explanatory variable X1 as follows:
H0: 1=0 in the presence of X2
H1: 1≠0 in the presence of X2
^β1 −0 327,27
We calculate the test statistic t= = =¿32,2433498
s ^β 10,15
1
The critical values associated with a Student distribution for df=n-(k+1)=n-3=36-3=33 and a type I error risk =0,05
t33;0.025 are not included in the tables. We know t30;0.025 = 2.042 and t40;0.025 = 2.021 (two tailed test).
X2:
We will run a Student test for the coefficient 2 for the explanatory variable X2 as follows:
H0: 2=0 in the presence of X1
H1: 2≠0 in the presence of X1
^β 2−0 −8,98
We calculate the test statistic t= = =−¿ 0,79118943
s ^β 11,35
2
L.F. - ST22 - Advanced Statistical Methods
The critical values associated with a Student distribution for df=n-(k+1)=n-3=36-3=33 and a type I error risk =0,05
t33;0.025 are not included in the tables. We know t30;0.025 = 2.042 and t40;0.025 = 2.021 (two tailed test).
|t|<critical value so we can not reject H0
Coefficients Standard-
error
Constant 706
Variable X1 326.9 10.08
^β =706 ; ^β =326,9
0 1
ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 694 925
Total 22 170 000
3) Pierre Durand, an employee of this group, is 38 years old, with X2=10 years of service and X1=4
years of graduate studies. His monthly salary is Y=2050 Euros and he thinks he is underpaid.
Calculate a 95% confidence interval for the mean salary of an employee with Pierre Durand’s profile.
[ √ ][ √ ]
2
1 ( x p−x )
2
1 ( 4−3,5 )
^y ± t × sε + = 2013,6 ± 2, 042 ×142,964928 +
n−2 ,
α
2
n ∑ ( x−x )2 36 35× 2.402
¿ [ 2013,6 ±2 , 042× 142,964928× 0,17034629 ] =[ 2013,6 ± 49,7299378 ] =[1963,87006 ; 2063,32994 ]
sε =√ S 2=¿ 142,964928
^y =706 +326,9∗4=¿2013,6
Durbin Watson:
Visually: A visual examination of this plot does not show any obvious pattern in the residual.
For α = .05, n = 13 observations, and k = 2 independent variables in the regression model, the Durbin-Watson table shows the
following upper and lower critical values:
Lower critical value: 0.86
Upper critical value: 1.56
Since our test statistic of 1.3475 doe not lie outside of this range, we do not have sufficient evidence to reject the null hypothesis
of the Durbin-Watson test. In other words, there is no correlation among the residuals.
If you reject the null hypothesis and conclude that autocorrelation is present in the residuals, then you have a few different options
to correct this problem if you deem it to be serious enough:
For positive serial correlation, consider adding lags of the dependent and/or independent variable to the model.
For negative serial correlation, check to make sure that none of your variables are overdifferenced.
For seasonal correlation, consider adding seasonal dummy variables to the model.
With Bj being the coefficient of the regression coefficient concerned and sbj its standard error.
Study the impact of “dependent variable” on “independent
variable”
Forecasting
L.F. - ST22 - Advanced Statistical Methods