0% found this document useful (0 votes)
73 views

Practical Examples With STATA

The document provides an overview of estimating and interpreting a multiple linear regression model (MLRM) to analyze the determinants of household saving behavior in Ethiopia. It discusses using the regress command in Stata to estimate a MLRM with monthly saving as the dependent variable and factors like family size, income, sex, and wealth as independent variables. Before interpreting the results, the document stresses the importance of testing whether the MLRM assumptions are satisfied and whether the overall model and individual coefficients are statistically significant. It outlines performing an F-test on the overall model and t-tests on individual coefficients.

Uploaded by

wudnehkassahun97
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Practical Examples With STATA

The document provides an overview of estimating and interpreting a multiple linear regression model (MLRM) to analyze the determinants of household saving behavior in Ethiopia. It discusses using the regress command in Stata to estimate a MLRM with monthly saving as the dependent variable and factors like family size, income, sex, and wealth as independent variables. Before interpreting the results, the document stresses the importance of testing whether the MLRM assumptions are satisfied and whether the overall model and individual coefficients are statistically significant. It outlines performing an F-test on the overall model and t-tests on individual coefficients.

Uploaded by

wudnehkassahun97
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

EconomEtrics for managEmEnt

(MGMT3071)

Practical ExamplEs With stata


(Estimation, intErprEtation &
Evaluation of CLRM)

Teklebirhan Alemnew (Assistant Professor)


[email protected]
AAU, 2023
By: Teklebirhan A. 1
Estimation and Interpretation of MLRMs
 Suppose we want to analyze the determinants of saving behavior
of households in Ethiopia: the case XYZ regional state.

 To this end, suppose we formulate the following MLRM that


shows relationship between Wheat production and its covariates
(labor, fertilizer and Off-farm income) of sample farm
households from the two Districts:
= + + + + +

where : = Household's Monthly Saving (in ETB)


= Family Size
= Household's Monthly Income (in ETB)
= Sex of the Household head
= Wealth of the household By: Teklebirhan A. 2
Cont…
 We use regression to estimate the unknown effect of changing
one or more variable (s) over another.
 If we managed to obtain the required data on all the variables
included in our model above, we can study the r/ship b/n them
by using the regress command or menu bar in Stata.
 Regress command:
regress [dependent variable] [independent variable(s)]
 Menu bar:
 Statistics-> linear models and related-> linear regression->

 Before running a regression it is recommended to have a clear


idea of what you are trying to estimate (i.e. which are your
outcome and predictor variables).
 A regression makes sense only if there is a sound theory behind it.
By: Teklebirhan A. 3
 Estimate the Model Using the command
(use Saving Data)
 Command: reg Saving FS Income Sex Wealth
 (Stata will produce the following result)
. reg Saving FS Income Sex Wealth

Source SS df MS Number of obs = 50


F(4, 45) = 62.78
Model 27854112.9 4 6963528.21 Prob > F = 0.0000
Residual 4991302.27 45 110917.828 R-squared = 0.8480
Adj R-squared = 0.8345
Total 32845415.1 49 670314.594 Root MSE = 333.04

Saving Coef. Std. Err. t P>|t| [95% Conf. Interval]

FS -13.53576 27.05971 -0.50 0.619 -68.03682 40.9653


Income .1272463 .0154236 8.25 0.000 .0961817 .158311
Sex -149.4642 95.92771 -1.56 0.126 -342.6725 43.74418
Wealth .1275215 .0281806 4.53 0.000 .0707627 .1842802
_cons 178.3252 197.502 0.90 0.371 -219.4642 576.1147

By: Teklebirhan A. 4
 Note the following notations from the above table:
SS = Sum of squares
df = Degrees of freedom
MS = Mean squares
Number of obs = No of observations used in the regression
F() = F value from the joint test of significance of the model
Prob > F = p-value of the F test
R-squared = Model’s R-Squared
Adj R-squared = Model’s Adjusted R-squared
Root MSE = Root Mean Squared Error
Coeff= estimated coefficients ( , , , , , respectvely )
t= t-ratios/statistics of corresponding coefficients

By: Teklebirhan A. 5
Report the Regression Result. How?
 There are two ways to report a regression result:
a) By fitting the estimated coefficients in to the model &
b) Table form

a) Fitting the estimated coefficients in to the model


 It is customary to report a regression results as follows:

= . − . + . − . + .
(197.502) (27.059) (0.015) (95.928) (0.028)

= . %

NB: The values in parenthesis are standard errors of the respective parameters.

By: Teklebirhan A. 6
b) Table Form
eststo: reg Saving FS Income Sex Wealth
esttab using save.rtf, se r2 label
Household's Monthly Saving (in ETB)
Family Size -13.54
(27.06)

Household's Monthly Income (in ETB) 0.127***


(0.0154)

Sex of the Household head -149.5


(95.93)

Wealth of the household 0.128***


(0.0282)

Constant 178.3
(197.5)
Observations 50
R2 0.848
Standard errors in parentheses
* p < 0.05, ** p < 0.01, *** p < 0.001

 You can use eststo and esttab command to order Stata to produce you a
regression table that looks like those in Journal articles. By: Teklebirhan A. 7
Can we interpret the above result now?

NO!!!

Why???

By: Teklebirhan A. 8
 BECAUSE this estimated model do not have statistical
backing since the validity of the model was not tested.
 Besides, it is not tested whether the CLRM assumptions
are satisfied or not.

Therefore, before interpretation of the estimated results


of our model, we first have to test for the statistical
significance as well as the assumptions of the estimated
model.

Let’s go for that!

By: Teklebirhan A. 9
1. Statistical Tests of Significance (FOT)
(Testing Validity of the Regression Model)
 Broadly speaking, a test of significance is a procedure
by which sample results are used to verify the truth or
falsity of a null hypothesis.

 The problem of statistical hypothesis testing deals with


testing whether a given finding is “sufficiently” close to
the hypothesized value so that we do not reject the
stated hypothesis.

 In statistics, the stated hypothesis is known as the null


hypothesis, denoted by H0, and it is tested against an
alternative hypothesis, H1.
By: Teklebirhan A. 10
 Statistical test assumes that the variable (or the estimator) under
consideration has some probability distribution and that
hypothesis testing involves making assertions about the value(s)
of the parameter(s) of the assumed distribution.

 Such test requires normality of the error term (w/c we have


assumed) so that the dependent variable is normally distributed.

 Hence, parameters are also normally distributed (follow a t-


distribution) because of the property of normal distribution:
“........ any linear function of a normally distributed variable is
itself normally distributed” .
 The parameters follow a t-normal distribution so that the t-
statistics of the parameters can be computed as:

= , ~ ( , )
( )
By: Teklebirhan A. 11
 NB: To test a hypothesis, we choose level(s) of significance.
 Level of significance is the probability of making ‘wrong’
decision, i.e. the probability of rejecting the hypothesis when it
is actually true.
 It is customary in econometric research to choose the 1% or
the 5% or the 10% level of significance (tolerance level).

 In general, the least square estimates (for instance , ) are


obtained from a sample of observations on Y , X1 , X2 X3 , &
X4 .

 Since sampling errors are inevitable in all estimates, it is


necessary to apply tests of significance in order to measure
the size of the error and determine the degree of confidence
in the validity of the estimates.
By: Teklebirhan A. 12
We have two categories of significance tests:
A. Overall level of significance tests
i. F-statistics
ii. Coefficient of Determination ( )

B. Individual tests of significance


T-test, i.e., t-statistics/Standard error test

By: Teklebirhan A. 13
i. F-statistics: measures the overall significance of the model.
 The F-statistic tests:
 The null hypothesis ( ): = = = i.e. that the
coefficients are equal to zero implying that there is no
relationship between the dependent variable and the
independent variables.
 The alternate hypothesis ( ): ≠ ≠ ≠ , i.e.,
none of the coefficients is equal to zero.
 Decision Rule:
 if the is accepted it implies that there is no relation between
the dependent and independent variables even if the
coefficients are not zero.
 If the is rejected, i.e. the F-statistic is valid, then the overall
model is valid and we can go ahead and check the other tests .
By: Teklebirhan A. 14
How do we do the F-test?
 You can calculate the F-test manually or using computer.
 This training concentrates on computer use.
 You are strongly advised to read any statistical book for
manual f-statistic calculation and how it is applied.

 When you are running a regression model in STATA, by


default it gives you all the validity tests including the F-test.
 What is important is to know how to use and interpret them.
 The F-test is given at the top-right of the Stata Outputs.
 The F-statistic is 62.78 and as implied by the P-Value of the F-
test: the model is significant at 1% (Prob > F = 0.0000)
 Therefore, the model is overall significant.
 We reject the null hypothesis that there is no relationship b/n
the dependent variable ‘Y’ and the explanatory variables.
By: Teklebirhan A. 15
 The bullets below explain how to determine significance.
 We normally check the significance at three levels 1%, 5% and
10%.
 Note that 1% has the highest level of confidence (99%)
followed by 5% (95%) and 10% (90%).
 That is why in some books they refer to 1% as the highest level;
5% as the moderate (medium) level; and 10% the lowest level.
 As a rule of thumb, start with the highest level then if
necessary (i.e. if not significant at a higher level) then go down
to the lower level, and so on, since when something is say,
significant at 1% it is automatically also significant at the other
lower levels and not vice versa.
 When using a computer (could be STATA or any package), for
the F-test to be significant, the p-value for F (Prob > F) should
be less than the level (1% or 0.01, 5% or 0.05 and 10% or 0.1) at
which you are testing the F-value.
By: Teklebirhan A. 16
 Our result above shows that the p-value for F (Prob >
F) is 0.0000 or 0% which is less than 1%.
 Therefore, overall, our model is significant or
valid at 1% level of significance (p<0.01).
 We, therefore, reject the null hypothesis that the
coefficients are equal to zero and conclude that the
model is valid.
– Note: in social science research, if a model is not
significant at 10%, then it is considered not significant
as this is the lowest acceptable level of significance.
 But the F-test is a necessary but not a sufficient test for
checking validity of a model.
 To sufficiently check regression model validity, we
need to check the other two tests.
By: Teklebirhan A. 17
ii. Coefficient of Determination ( ) Test
 measures the proportion of the variation in the dependent
variable (y) that is being explained by the independent
variable(s).
 It shows the explanatory power of the model and is given as:
= =
.
 The for our result above is = = = . ,
.
i.e., 85%.
 This means the predictors (independent) variables included
in the model (i.e., Family Size, Income, Sex and wealth)
explains 85% of the variation in household’s saving.
 The other, 15% is attributed to variables not included in the
model. This is normal as it has already been said that in any
regression model, it is impossible to include all the
independent variables.
By: Teklebirhan A. 18
 Problem of Using
 The major problem is that is sensitive to the number of
independent variables included in a regression model.
 The greater the number of independent variables the
higher the is likely to be increase, i.e. the more the
independent variables we add (even if they are not valid),
the bigger the becomes.
 This problem arises because does not take into account
the number of degrees of freedom.

 To solve this problem, when testing the validity of a


regression model we use the Corrected or Adjusted R2
(denoted as ) which takes degrees of freedom into
account as given in the following formula:
By: Teklebirhan A. 19
• Where R2 = the coefficient of determination; n = sample
size; and k = number of parameters including the
intercept.
 For our result above, the Adj R-squared ( ) is 0.8345 as
shown in the table above.
 NB: But is not interpretable as . It serves only as a
reference. i.e., if it is closer to , then that implies our is
dependable. Then, we can trust our as a measure of model
adequacy.
 Our model has so far satisfied two validity tests (F-statistic &
test), but we still need to do a final t-statistic test, which tests
the significance of the individual independent variables.
By: Teklebirhan A. 20
 Individual test of significance
3. t-statistic Test (t-test)
 t-statistic is a test of significance for individual explanatory
variables and the constant term within a model.
 The t-statistic tests:
 The null hypothesis ( ): = , i.e., that the coefficient of
the parameter is equal to zero implying that there is no
relationship between the dependent variable and the
independent variable implied by the parameter.
 The alternate hypothesis ( ): ≠ , i.e., the coefficient is
statistically different from zero.

 The t-statistic is found by dividing each coefficient by the


corresponding standard error of each parameter.
 The t-statistic is calculated for the constant and all the other
coefficients.
By: Teklebirhan A. 21
o In our case, the t-statistic for the coefficient of family size is 0.50
and is not significant at any acceptable level of significance (i.e.,
1%, 5% or 10% (P>0.1).

o Income has a t-statistic of 8.25 and is significant at 1% (p<0.01).

o Sex has a t-statistics of 1.56 and is not significant at 10% (p>0.1)

o Wealth has a t-statistics of 4.53 and is significant at 1% (p<0.01)

o The constant term has a t-statistics of 0.90 and is not significant


at any acceptable level of significance (p>0.1).

 Thus, we can conclude with certainty that fertilizer and off-farm


income positively and significantly affect wheat production while
labor employment does not.
By: Teklebirhan A. 22
Prediction /forecasting :

Use the regression result to predict what the saving


amount of a female headed household with family size
of 3, income of birr 1000, and a wealth of Birr 800
would be.

display 178.3252 -13.53576*3 +.1272463*1000+ -149.4642*0+


.1275215*800
=366.98142

By: Teklebirhan A. 23
 Classical Linear Regressions: Regression Diagnostics:
1-Normality of Residuals
 If the normality assumption is violated, hypothesis testing based on
the standard statistical techniques of tests would be impossible.
 Therefore, after a model has been estimated we have to test the
normality of the residuals using graphical and non-graphical tests.

A. Graphical method:
 reg Saving FS Income Sex Wealth
 predict r, resid
 kdensity r, normal or histogram r, kdensity normal
B. Non-graphic methods: Doornik-Hansen (mvtest norm r) and
Shapiro-Wilk test (swilk e) for normality.

By: Teklebirhan A. 24
 Both tests the null hypothesis that the distribution of the
residual values is normal.
. mvtest norm r

Test for multivariate normality

Doornik-Hansen chi2(2) = 6.444 Prob>chi2 = 0.0399

. swilk r

Shapiro-Wilk W test for normal data

Variable Obs W V z Prob>z

r 50 0.91936 3.792 2.843 0.00224

 In one cases, the P-value is significant implying non-normality of


residual values.
 If residuals do not follow a ‘normal’ pattern then you should check for
omitted variables, model specification, linearity, functional forms.
 In sum, you may need to reassess your model/theory.
 In practice, normality does not represent much of a problem when
dealing with really big samples.
By: Teklebirhan A. 25
Econometric Criterion (Second order test)
2-Test for Heteroscedasticity
 Two methods to detect Heteroskcedasticity:
a) Breusch-Pagan heteroskcedasticity test
 Command: estat hettest
a) Graphical method /Visual Inspection/-informal method.
Command: rvfplot, yline(0)
a) Graphical Method
 One way to check whether the variance of the error term is
constant is by plotting residuals vs. predicted values (Yhat).

 When plotting residuals vs. predicted values (Yhat) we should not


observe any pattern at all for homoskcedastic variance assumption to
hold.
 In Stata, we do this using rvfplot right after running the
regression, it will automatically draw a scatterplot between
residuals and predicted values.
By: Teklebirhan A. 26
rvfplot, yline(0)

1000
500
Residuals
0
-500

0 1000 2000 3000


Fitted values

 As seen above, residuals seem to somewhat expanded at the


middle levels of Yhat than at lower and higher.
 The data is suspected of heterosckedasticity, right in such a
case a more formal test (i.e., hettest) for heteroskedasticity
should be conducted.
By: Teklebirhan A. 27
b) Non-Graphical Method
Breusch-Pagan test: estat hettest
. hettest

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity


Ho: Constant variance
Variables: fitted values of Saving

chi2(1) = 14.43
Prob > chi2 = 0.0001

 The hettest implies that our model is heteroskedastic as the p-value (is
significant @ 1%) is smaller than the standard 0.01(99% significance) and
thus, we can reject the null hypothesis of constant variance.
 Both of our tests suggest the possible presence of heteroskedasticity in our
model.

 If heteroskedasticity exist in our model, the problem is to have wrong


estimates of the standard errors for the coefficients and therefore their t-
values because variance will be bigger.
By: Teklebirhan A. 28
 To deal with heteroskedasticity problem, the trustworthy
remedy is to use heteroskedasticity-robust standard errors.
 To do this, we use the option robust in the regress command.
Command: reg Saving FS Income Sex Wealth, robust
. reg Saving FS Income Sex Wealth, robust

Linear regression Number of obs = 50


F(4, 45) = 62.47
Prob > F = 0.0000
R-squared = 0.8480
Root MSE = 333.04

Robust
Saving Coef. Std. Err. t P>|t| [95% Conf. Interval]

FS -13.53576 16.17692 -0.84 0.407 -46.11774 19.04622


Income .1272463 .0126314 10.07 0.000 .1018054 .1526873
Sex -149.4642 98.48536 -1.52 0.136 -347.8238 48.89554
Wealth .1275215 .0278302 4.58 0.000 .0714686 .1835743
_cons 178.3252 152.9195 1.17 0.250 -129.6705 486.321

By: Teklebirhan A. 29
 Note that after robust regression, our results are free of
heteroskcedasticity problem.

3-Test for Multicolinearity


 To detect multicollinerity problem , we can use vif (variance
inflation factor) test command
. vif

Variable VIF 1/VIF

Income 2.12 0.471999


FS 1.71 0.586177
Wealth 1.33 0.750755
Sex 1.01 0.989614

Mean VIF 1.54

 As Rule of Thumb , if Mean VIF is less than 10, the level of


multicollinearity is not a problem.
By: Teklebirhan A. 30
4-Test for Autocorrelation (timeseries problem)
 Use the data set: RGDP
 Two commonly used tests detect the presence of autocorrelation
1. Durbin–Watson d Test-the most celebrated
(Rule of Thumb = if d statistics < 2 –there is positive autocorrelation but if d
statistics >2 –there is negative autocorrelation)
Command: estat dwatson
2. Breusch-Godfrey test
Command: estat bgodfrey
. estat dwatson

Durbin-Watson d-statistic( 5, 37) = .8024311

.
. estat bgodfrey

Breusch-Godfrey LM test for autocorrelation

lags(p) chi2 df Prob > chi2

1 12.012 1 0.0005

H0: no serial correlation


By: Teklebirhan A. 31
 Both tests show that there is autocorrelation problem in
the dataset,
 We can remove the problem by introducing the one
period lagged value of the dependent variable as an
explanatory variable into the model.

 Example: gen LN_CPIlagged = L.LN_CPI


 Then:
reg LN_CPI LN_RGDP LN_RMSS2 LN_RTGEP LN_REER
LN_CPIlagged

The data is free of autocorrelation & ready for


interpretation.
By: Teklebirhan A. 32
5-Specification Error Test
 Most limitations in applied econometric research stems from
errors in specifying the model.
 If the functional form of a model is misspecified or some relevant
equations are omitted from the model, then this will affect not
only of the model but also creates a serious econometric
problem known as specification bias.
 We can test for this problem using:
1. Ramsey’s Regression Specification Error Test (RESET test)-
efficient & most commonly used,
Command: ovtest
Example-1: Use Saving data
2. Link Test
3. Durbin–Watson d statistic (for time series only) &
4. The Lagrange multiplier test ( > (table)=there is
specification error) By: Teklebirhan A. 33
. ovtest

Ramsey RESET test using powers of the fitted values of Saving


Ho: model has no omitted variables
F(3, 42) = 0.98
Prob > F = 0.4099

 Ramsey RESET test implies that there no model specification problem


. linktest

Source SS df MS Number of obs = 50


F(2, 47) = 132.18
Model 27887521.8 2 13943760.9 Prob > F = 0.0000
Residual 4957893.29 47 105487.091 R-squared = 0.8491
Adj R-squared = 0.8426
Total 32845415.1 49 670314.594 Root MSE = 324.79

Saving Coef. Std. Err. t P>|t| [95% Conf. Interval]

_hat 1.116278 .2155866 5.18 0.000 .6825737 1.549983


_hatsq -.0000493 .0000877 -0.56 0.576 -.0002257 .000127
_cons -38.85508 102.1201 -0.38 0.705 -244.2942 166.584

 The insignificant hat square shows that the model has no error on its
formula and no omission of significant variable
By: Teklebirhan A. 34
General Guidelines for Building a Regression
Model
 Make sure all relevant predictors are included.
 These are based on your research question, theory,
empirical evidence and knowledge on the topic.
 Get your sampling technique right.
 Obtain the right data and choose appropriate technique.

Strategy to keep or drop variables:


 Predictor not significant and has the expected sign -> Keep it
 Predictor not significant and does not have the expected sign
-> Drop it
 Predictor is significant and has the expected sign -> Keep it
 Predictor is significant but does not have the expected sign ->
Review, you may need more variables, or there may be an
error in the data.
By: Teklebirhan A. 35
Next!

Chapter -4: Discrete Choice Models

By: Teklebirhan A. 36

You might also like