0% found this document useful (0 votes)
37 views

Practical Session 1 Solved

A multiple linear regression analysis was conducted to model desire for cosmetic surgery based on self-esteem, body satisfaction, impressions of reality TV, and gender. The analysis found that body satisfaction, impressions of reality TV, and gender were significant predictors of desire for cosmetic surgery, but self-esteem was not a significant predictor. Some multicollinearity was detected between the predictor variables. An alternative model was estimated without the non-significant self-esteem variable.

Uploaded by

Théobald Vitoux
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Practical Session 1 Solved

A multiple linear regression analysis was conducted to model desire for cosmetic surgery based on self-esteem, body satisfaction, impressions of reality TV, and gender. The analysis found that body satisfaction, impressions of reality TV, and gender were significant predictors of desire for cosmetic surgery, but self-esteem was not a significant predictor. Some multicollinearity was detected between the predictor variables. An alternative model was estimated without the non-significant self-esteem variable.

Uploaded by

Théobald Vitoux
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

L.F.

- ST22 - Advanced Statistical Methods

Practical Session 1 – Multiple Linear Regression Using Excel


Objectives

Using Excel, you will conduct multiple linear regression analysis.


You must follow the different steps of linear regression analysis, as discussed in class, and interpret all the obtained outputs.
You can rely on the below checklist specifying all the elements that must be included.

1. Choice of variables included in the model

Multiple regression is used to model desire to have cosmetic surgery (y) as a function of the following explanatory variables
(regressors) :
 x 1 : self-esteem,
 x 2 : body satisfaction, and
 x 3 : impression of reality TV
 x 4 : gender

Therefore, the model equation is the following :

Y = β0 + β 1 · x 1 ,i + β 2 · x 2, i+ β3 · x3 , i+ β 4 · x 4 , i+ where N (0 , σ)

2. Equation for model (based on sample  estimates) & interpretation of coefficients

Using Excel, we obtain:


^y i=14.01−0.05· x 1 ,i−0.32· x2 , i+ 0.49· x 3 ,i−2.19 · x 4 ,i

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.7054
R Square 0.4976
Adjusted R Square 0.4854
Standard Error 2.2509
Observations 170.0000
ANOVA
df SS MS F Significance F
Regression 4.0000 827.8333 206.9583 40.8492 0.0000
Residual 165.0000 835.9549 5.0664
Total 169.0000 1663.7882
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 14.0107 0.7753 18.0702 0.0000 12.4798 15.5415 12.4798 15.5415
SELFESTM -0.0479 0.0367 -1.3066 0.1932 -0.1204 0.0245 -0.1204 0.0245
BODYSAT -0.3223 0.1435 -2.2465 0.0260 -0.6056 -0.0390 -0.6056 -0.0390
IMPREAL 0.4931 0.1274 3.8707 0.0002 0.2416 0.7446 0.2416 0.7446
GENDER -2.1865 0.6766 -3.2314 0.0015 -3.5225 -0.8505 -3.5225 -0.8505

Interpretation of the unknown parameters.


 The “desire to have cosmetic surgery” is decrease on average by 0.05 units for each increase of a unitary value of “self-
esteem”, keeping the rest of the regressors constant.

 The “desire to have cosmetic surgery” is decreased on average by 0.32 units for each increase of a unitary value of
“body-satisfaction”, keeping the rest of the regressors constant.

 The “desire to have cosmetic surgery” is increased on average by 0.49 units for each increase of a unitary value of
“impression of reality TV”, keeping the rest of the regressors constant.
L.F. - ST22 - Advanced Statistical Methods

 Since “gender” is a qualitative variable, the interpretation of its coefficient is out of the scope of this course.

Interpretation of the dependent variable

 When all variables are equal to zero, the desire to have costmetic surgery is expected to be equal to 14.01.

3. Global model validity (Fisher test) & model quality (R2 and R2adjusted)

Considering that the p-value for the Fisher F test is closed to zero, we conclude that the model is valid for the overall population,
since there is, at least, one significant regressor for predicting y.

Considering the coefficient of determination (0.4976) and the adjusted coefficient of determination (0.4854), we conclude that
approximately the half of the total variability is explained by the regression model, (which is low).

(other example)

4. Marginal contributions of explanatory variables (Student tests) & Confidence Intervals for ’s

The p-value of the student t tests are provided in the second column of the following table. And the confidence interval are
provided in the last two columns.

P-value Lower 95% Upper 95%


SELFESTM 0.1932 -0.1204 0.0245
BODYSAT 0.0260 -0.6056 -0.0390
IMPREAL 0.0002 0.2416 0.7446
GENDER 0.0015 -3.5225 -0.8505

The p-values for variables BODYSAT, IMPREAL, GENDER are lower than 0.05. Thus, we can conclude that the aforementioned
variables have a significant marginal contribution with a 95% of confidence. (This can also be observed in the confidence
intervals, since the zero value is not included in the interval)

On the other hand, the p-value for the variable SELFESTM is higher than 0.05, denoting that this variable is not significant with a
95% of confidence. (This can also be observed in the confidence intervals, since the zero value is inside in the interval).

Correlation coefficients interpretation


The correlation coefficients in Table 1 provide information about the linear relationship between pairs of variables. A correlation
coefficient of zero indicates that there is no linear relationship between the two variables. The closer the coefficient of correlation
is to -1 or +1 the stronger the relationship is between the two variables; a correlation coefficient of +1 indicates perfect positive
relationship and -1 indicates a perfect negative relationship.
The correlation of 0.686 between Age and Experience, for example, indicates that the higher the age of a participant the greater is
their work experience on average. However, note that this relationship is not perfect, i.e., a higher age does not mean
proportionally greater experience in all cases but just that on average experience is greater for the higher age group.

T-test interpretation
The "t-test" in the various regressions indicates the statistical significance of the relationship between the two variables.
For example, the t-test between Salary and Age is 2.128, the absolute value of which is greater than the critical t-value of 1.969.
This means that the correlation between Age and Salary is statistically significant; we accept the hypothesis that the relationship
between Age and Salary that we have observed in the sample data is not simply due to chance. Furthermore, the regression
coefficient of 802.9337 for Age means that as Age increases by one year the average salary is higher by $802.9337.
L.F. - ST22 - Advanced Statistical Methods

Note that in regression (iii), the t-test for Sex is -0.079. In this case, the absolute value of the t-test, 0.079, is less than the critical
value of 1.969. Hence, we can infer that the correlation between Sex and Salary is statistically non-significant; we accept the
hypothesis that the correlation between Sex and Salary is really zero in the population of interest and we have observed something
slightly different than zero in our sample simply due to chance (due to sampling fluctuations). It follows that the regression
coefficient of -192.7 for Sex is not significant and we accept the hypothesis that it is really zero for the population.
Similarly, we can interpret the results of the other regressions.

5. New case estimation. Forecasted values.

Participant 171 Participant 172


GENDER = female (0 value) GENDER = male (1 value)
SELFESTM = 20 SELFESTM = 38
BODYSAT = 4 BODYSAT = 2
IMPREAL = 3 IMPREAL = 2

We substitute the previous values for the regressors in the following equation:

^y i=14.01−0.05· x 1 ,i−0.32· x2 , i+ 0.49· x 3 ,i−2.19 · x 4 ,i

Obtaining:

betas participant 171 betas participant 172


14.0107 1 14.0107 1
SELFEST
M -0.0479 20 -0.0479 38
BODYSAT -0.3223 4 -0.3223 2
IMPREAL 0.4931 3 0.4931 2
GENDER -2.1865 0 -2.1865 1
39.9470 55.9470

6. A-posteriori diagnostics: Residual analysis (including plots) & Multicollinearity analysis

Concerning the multicollinearity analysis, we compute the correlation coefficients between the variables:

SELFEST BODYSA
DESIRE M T IMPREAL GENDER
DESIRE 1
SELFEST
M -0.48546 1
BODYSA
T -0.64359 0.757216 1
IMPREAL 0.131726 0.16651 0.14344 1
GENDER -0.63678 0.511079 0.828248 0.065294 1

We observe some high correlations between regressors, denoted in bold font type in the previous table.

Concerning the residual analysis, the residuals’ histogram is provided, denoting normality:

The plot of standard residuals is plotted, denoting that they are homoscedastic (constant in variance), and outliers are not detected
with a 99% confidence (since all standard residuals have an absolute value lower than 3).
L.F. - ST22 - Advanced Statistical Methods

Standard Residuals
12

10

0
0 2 4 6 8 10 12

7. Alternative (improved) model

Since we observed that the variable SELFESTM is not significant with a 95% of confidence, we remove it from the model and re-
estimate it again.

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.7017
R Square 0.4924
Adjusted R Square 0.4832
Standard Error 2.2557
Observations 170.0000
ANOVA
df SS MS F Significance F
Regression 3.0000 819.1836 273.0612 53.6679 0.0000
Residual 166.0000 844.6047 5.0880
Total 169.0000 1663.7882
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 13.3228 0.5704 23.3549 0.0000 12.1965 14.4491
BODYSAT -0.4508 0.1047 -4.3065 0.0000 -0.6575 -0.2441
IMPREAL 0.4827 0.1274 3.7885 0.0002 0.2311 0.7343
GENDER -1.9113 0.6444 -2.9661 0.0035 -3.1836 -0.6391

In the alternative model, all the regressors are significant.


(Both the R Square and adjusted R Square have not changed significantly.)

The final equation is the following:

^y i=13.32−0.45· bodysat i +0.48 ·impreal i−1.91 · gender i

Context

Reality TV and Cosmetic Surgery


(Chapter 12: #12.17 p.725; #12.27 p.732; #12.128 p.804; #12.116 p.782)

How much influence does the media, especially reality television programs, have on one’s decision to undergo cosmetic surgery?
This was the question of interest to psychologists who published an article in Body Image: An International Journal of Research
(March 2010).
In the study, 170 college students answered questions about their impressions of reality TV shows featuring cosmetic surgery.

The five variables analyzed in the study were measured as follows:


DESIRE—scale ranging from 5 to 25, where the higher the value, the greater the interest in having cosmetic surgery;
GENDER—1 if male, 0 if female;
SELFESTM—scale ranging from 4 to 40, where the higher the value, the greater the level of self- esteem;
L.F. - ST22 - Advanced Statistical Methods

BODYSAT—scale ranging from 1 to 9, where the higher the value, the greater the satisfaction with one’s own body;
IMPREAL—scale ranging from 1 to 7, where the higher the value, the more one believes reality television shows featuring
cosmetic surgery are realistic.

The data for the study (simulated based on statistics reported in the journal article) are saved in the Excel file BDYIMG (available
on BlackBoard).

Additional Participants:

Participant 171 Participant 172


GENDER = female GENDER = male
SELFESTM = 20 SELFESTM = 38
BODYSAT = 4 BODYSAT = 2
IMPREAL = 3 IMPREAL = 2

Exercise : Testing a model (without all tables)


A staff restaurant conducted a survey collecting data from a random sample of 32 clients. They were asked,
among other things: how many times did they eat the restaurant during the last month (variable:
FREQUENCY); how much did they spend on a meal (variable: SPENDING); how old they were (variable:
AGE).

The restaurant manager would like to construct a model that would explain the spending amount in
terms of the frequency and the age for all clients.

You will find below:


Appendix 1: scatterplot of SPENDING w.r.t. FREQUENCY
Appendix 2: scatterplot of SPENDING w.r.t. AGE
Appendix 3: linear regression results obtained using a statistical package modeling SPENDING in
terms of FREQUENCY and AGE.

You are asked to proceed with the different tests required to validate the linear regression model for all
clients among which the sample was taken.

Appendix 1 (1st scatterplot)


Response variable: SPENDING
Explanatory variable: FREQUENCY
L.F. - ST22 - Advanced Statistical Methods

Appendix 2 (2nd scatterplot)


Response variable: SPENDING
Explanatory variable: AGE

Appendix 3 (multiple linear regression model)


Response variable: SPENDING
Explanatory variables: FREQUENCY and AGE

Variable #1 (SPENDING)
Mean 4.17188
Corrected Standard Deviation 1.53249
Variable #2 (FREQUENCY)
Mean 10.59375
Corrected Standard Deviation 6.76738
Variable #3 (AGE)
Mean 35.75
Corrected Standard Deviation 11.6453
Count n 32
R-square 0.58999

Coefficient Standard Error


Intercept 6.13803
FREQUENCY -0.17208 0.02782
AGE -0.00401 0.01617

ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 29.8504
Total 72.8047 ?
L.F. - ST22 - Advanced Statistical Methods

SOLUTIONS:
ANOVA TABLE
SS df MS F p-value
Regression 42,9543 2 21,47715 20,8652933 <0,01
Residual 29,8504 29=32-(2+1) 1,02932414
Total 72,8047 32-1=31

42,9543=72,8047-29,8504

Overall test (F test)

H0 : 1=2=0
H1 : at least one of the j is not 0 j=1,2

From the F table (df1= 2; df2=29) critical value is 3,33 (5%)


There exists enough statistical evidence in order to reject the null hypothesis (no model).

Coefficient Standard Error t p-value


Intercept 6.13803
FREQUENCY -0.17208 0.02782 -6,18547807 <,05
AGE -0.00401 0.01617 -0,24799011 >,05

Frequency:

[ ]
95 % CI for β 1= ^β1 ±t 29 ,α /2 × s ^β =[−0.17208± 2,045 ×0.02782 ]
1

We will run a Student test for the coefficient 1 for the explanatory variable Frequency as follows:
H0: 1=0 in the presence of AGE
H1: 1≠0 in the presence of AGE
^β1 −0 −0.17208
We calculate the test statistic t= = =−6,18547807
s ^β 0.02782
1

The critical values associated with a Student distribution for df=29 and a type I error risk =0,05 are t29;0.025 =
2,045 (two tailed test).
|t|>>critical value so we can reject H0

Age:

[ ]
95 % CI for β 2= ^β2 ± t 29 ,α /2 × s ^β =[ −0.00401± 2,045 ×0.01617 ]
2

We will run a Student test for the coefficient 2 for the explanatory variable Age as follows:
H0: 2=0 in the presence of FREQUENCY
H1: 2≠0 in the presence of FREQUENCY
^β 2−0 −0.00401
We calculate the test statistic t= = =−0,24799011
s ^β 0.01617
2

The critical values associated with a Student distribution for df=29 and a type I error risk =0,05 are t29;0.025 =
2,045 (two tailed test).
|t|<critical value so we can not reject H0
L.F. - ST22 - Advanced Statistical Methods

Exercise 2: Choosing a model and using it


The HR director of an industrial group would like to construct a model explaining the monthly salary
of all employees.
Using data collected from a random sample of 36 employees, he tests two explanatory variables that
he deems relevant: the number of years of graduate studies (X1) and the number of years of service
(X2).
You can find below results from three regression models he tested using Excel.

1) For each of the three suggested models:


a. Calculate the coefficient of determination r²
b. Conduct the Student tests for the explanatory variables.
2) Can you help the HR director choose the most suitable model?
a. Using the information provided, which model would you suggest to use? Justify your choice.
b. Estimate the parameters of the chosen model.
3) Pierre Durand, an employee of this group, is 38 years old, with 10 years of service and 4 years of
graduate studies. His monthly salary is 2050 Euros and he thinks he is underpaid.
Calculate a 95% confidence interval for the mean salary of an employee with Pierre Durand’s profile.
If you were the HR director of that firm, what would you tell Pierre Durand about his salary?

Variable Mean Sample standard deviation*


Y 1850 795.88
X1 3.5 2.40
X2 4.17 2.15
* Reminder: this is the value of the unbiased point estimate (coefficient 1/ n-1)

Regression of Y w.r.t X1
ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 694 925
Total 22 170 000

Coefficients Standard-
error
Constant 706
Variable X1 326.9 10.08
L.F. - ST22 - Advanced Statistical Methods

Regression of Y w.r.t. X2
ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 22 156 025
Total ? ?

Coefficients Standard-
error
Constant 1 811.2
Variable X2 9.32 63.62

Regression of Y w.r.t. X1, X2


ANALYSE OF VARIANCE
Sum of Squares
Regression ?
Residual 681 981.4
Total ?

Coefficients Standard-
error
Constant 742
Variable X 1 327.27 10.15
Variable X 2 -8.98 11.35
L.F. - ST22 - Advanced Statistical Methods

SOLUTIONS:

1) For each of the three suggested models:


a. Calculate the coefficient of determination r²
b. Conduct the Student tests for the explanatory variables.

a. r²=SSRegression/SSTotal=(SSTotal-SSResidual)/SSTotal

Regression of Y w.r.t X1
ANALYSIS OF VARIANCE
Sum of Squares
Regression 21475075
Residual 694 925
Total 22 170 000

r²=(22170000-694925)/22170000= 21475075/22170000=0,96865471

Regression of Y w.r.t. X2
ANALYSIS OF VARIANCE
Sum of Squares
Regression 13975
Residual 22 156 025
Total 22 170 000 ?

r²=(22170000-22 156 025)/22170000=13975/22170000=0,00063036

Regression of Y w.r.t. X1, X2


ANALYSIS OF VARIANCE
Sum of Squares
Regression 21488018,6
Residual 681 981.4
Total 22 170 000

r²=(22170000-681 981.4)/22170000=21488018,6/22170000=0,96923855

b.

Regression of Y w.r.t X1
Coefficients Standard-
error
Constant 706
Variable X1 326.9 10.08

We will run a Student test for the coefficient 1 for the explanatory variable X1 as follows:
H0: 1=0
H1: 1≠0
L.F. - ST22 - Advanced Statistical Methods

^β −0 326,9
1
We calculate the test statistic t= = =¿32,4305556
s ^β
1
10,08

The critical values associated with a Student distribution for df=n-2=36-2=34 and a type I error risk =0,05
t35;0.025 are not included in the tables. We know t30;0.025 = 2.042 and t40;0.025 = 2.021 (two tailed test).
|t|>>critical value so we can reject H0

Regression of Y w.r.t X2
Coefficients Standard-
error
Constant 1 811.2
Variable X2 9.32 63.62

We will run a Student test for the coefficient 2 for the explanatory variable X2 as follows:
H0: 2=0
H1: 2≠0

^β −0 9.32
2
We calculate the test statistic t= = =¿ 0,14649481
s ^β 63.62
12

The critical values associated with a Student distribution for df=n-2=36-2=34 and a type I error risk =0,05
t35;0.025 are not included in the tables. We know t30;0.025 = 2.042 and t40;0.025 = 2.021 (two tailed test).
|t|<<critical value so we do not reject H0

Regression of Y w.r.t. X1, X2


Coefficients Standard-
error
Constant 742
Variable X 1 327.27 10.15
Variable X 2 -8.98 11.35

X1:
We will run a Student test for the coefficient 1 for the explanatory variable X1 as follows:
H0: 1=0 in the presence of X2
H1: 1≠0 in the presence of X2
^β1 −0 327,27
We calculate the test statistic t= = =¿32,2433498
s ^β 10,15
1

The critical values associated with a Student distribution for df=n-(k+1)=n-3=36-3=33 and a type I error risk =0,05
t33;0.025 are not included in the tables. We know t30;0.025 = 2.042 and t40;0.025 = 2.021 (two tailed test).

|t|>>critical value so we can reject H0

X2:
We will run a Student test for the coefficient 2 for the explanatory variable X2 as follows:
H0: 2=0 in the presence of X1
H1: 2≠0 in the presence of X1
^β 2−0 −8,98
We calculate the test statistic t= = =−¿ 0,79118943
s ^β 11,35
2
L.F. - ST22 - Advanced Statistical Methods

The critical values associated with a Student distribution for df=n-(k+1)=n-3=36-3=33 and a type I error risk =0,05
t33;0.025 are not included in the tables. We know t30;0.025 = 2.042 and t40;0.025 = 2.021 (two tailed test).
|t|<critical value so we can not reject H0

2.a Choose the most suitable model

From 1.b we choose Regression of Y w.r.t X1

2.b parameters (B0, B1, s2)

Coefficients Standard-
error
Constant 706
Variable X1 326.9 10.08

^β =706 ; ^β =326,9
0 1
ANALYSIS OF VARIANCE
Sum of Squares
Regression ?
Residual 694 925
Total 22 170 000

S2=SSResidual/(n-2)= 694 925/34= 20438,9706

3) Pierre Durand, an employee of this group, is 38 years old, with X2=10 years of service and X1=4
years of graduate studies. His monthly salary is Y=2050 Euros and he thinks he is underpaid.
Calculate a 95% confidence interval for the mean salary of an employee with Pierre Durand’s profile.

We use the model chosen in 2.a


95 % CI for E ( y ) when x =4

[ √ ][ √ ]
2
1 ( x p−x )
2
1 ( 4−3,5 )
^y ± t × sε + = 2013,6 ± 2, 042 ×142,964928 +
n−2 ,
α
2
n ∑ ( x−x )2 36 35× 2.402
¿ [ 2013,6 ±2 , 042× 142,964928× 0,17034629 ] =[ 2013,6 ± 49,7299378 ] =[1963,87006 ; 2063,32994 ]
sε =√ S 2=¿ 142,964928
^y =706 +326,9∗4=¿2013,6

T34;0.05 2 . 042 using the Student table with df=30.


∑ ( x− x )2= ( n−1 ) × s2x =35 ×2.402 .
If you were the HR director of that firm, what would you tell Pierre Durand about his salary?

His salary belongs to the CI.


He is not underpaid (in fact, his salary is higher than the prediction!).

Durbin Watson:

Visually: A visual examination of this plot does not show any obvious pattern in the residual.

The test statistic always ranges from 0 to 4 where:


L.F. - ST22 - Advanced Statistical Methods

 d = 2 indicates no autocorrelation


 d < 2 indicates positive serial correlation
 d > 2 indicates negative serial correlation
In general, if d is less than 1.5 or greater than 2.5 then there is potentially a serious autocorrelation problem. Otherwise, if d is
between 1.5 and 2.5 then autocorrelation is likely not a cause for concern.

H0 (null hypothesis): There is no correlation among the residuals.


HA (alternative hypothesis): The residuals are autocorrelated.

For α = .05, n = 13 observations, and k = 2 independent variables in the regression model, the Durbin-Watson table shows the
following upper and lower critical values:
 Lower critical value: 0.86
 Upper critical value: 1.56

Since our test statistic of 1.3475 doe not lie outside of this range, we do not have sufficient evidence to reject the null hypothesis

of the Durbin-Watson test. In other words, there is no correlation among the residuals.

If you reject the null hypothesis and conclude that autocorrelation is present in the residuals, then you have a few different options
to correct this problem if you deem it to be serious enough:
 For positive serial correlation, consider adding lags of the dependent and/or independent variable to the model.
 For negative serial correlation, check to make sure that none of your variables are overdifferenced.
 For seasonal correlation, consider adding seasonal dummy variables to the model.

Confidence Interval for Regression Coefficients

With Bj being the coefficient of the regression coefficient concerned and sbj its standard error.
Study the impact of “dependent variable” on “independent

variable”

Forecasting
L.F. - ST22 - Advanced Statistical Methods

Use standard deviation of Regression or Global Standard errror

You might also like