Exercises in Class Normality - SOLUTIONS
Exercises in Class Normality - SOLUTIONS
1.
1. We have a database for a sample of 935 workers, with information on the following variables:
SALARY (monthly salary), EDAT (age), EDUCACIO (years of study), CASAT (dummy variable
that is 1 if the individual is married, 0 otherwise), PERMANENCIA (years spent in the same job),
EXPERIENCIA (years of experience) and IQ (index score of intelligence).
1.a. From this information, we estimated the following model1: (Table 1.1):
Model (1)
Table 1.1
From this estimation (and without considering any other outcome), what can you say about the
individual and global statistical significance of the parameters estimated? And what about the
economic significance? From the value of R2, do you think that it is an appropriate model to explain
the salary? (0.5 points)
Without considering any other result, we can see that the p-value of the parameter associated with
the variable " experiencia " is greater than 0.05. This means that we cannot reject the null
hypothesis that this parameter is zero. This implies that in this model we should eliminate the
variable " experiencia" which is likely an irrelevant variable in the model (maybe also for
1
The data used for this estimate are from Wooldridge, J.M. (2006) "Introduction to Econometrics", 2nd Edition.
Ed: Thomson.
multicollinearity). The p-value associated with the parameter "permanencia " is less than 0.05,
which indicates that this variable is indeed relevant in explaining the salary. With respect to the
global significance (of the model) we see that the p-value associated with the F statistic is less than
0.05. Therefore, we reject the null hypothesis that all the estimated parameters are zero: the model
is globally significant.
Regarding the economic significance, we only have to interpret the coefficient associated with "
permanencia." Here we see that their influence is positive: a one more year spent in the company
implies an increase of salary of 10.82 units. Because the other variable is not significant, it is not
necessary to compare the magnitudes of the parameters.
The R2 obtained is very low: 1.7 %. Therefore, a very low percentage of the variability of the salary
is explained by the years spent in the same job. Therefore, it is likely that we have omitted the
inclusion of one or more relevant variables in the model.
1.b. From the estimation of the previous model, the following results are obtained regarding the
error term:
Table 1.2
1. Which are the characteristics of this distribution of the residuals? i.e comment the skewness
and the kurtosis.
The skewenss is higher than 0, meaning that there is positive asymmetry in the distribution.
Indeed the graph shows that residuals tend to be more concentrated on the left.
The Kurtosis is higher than 3, meaning that the distribution is Leptokurtic (it has a higher
peak with respect to the one of a normal distribution).
2. Run the Jarque-Bera normality test of the residuals, indicating the hypothesis that are
compared, calculate the corresponding statistical test and specifying which is the reached
conclusion and the implications for the estimation of the model. (1 point)
The Jarque-Bera statistic is calculated as follows, based on the values of the asymmetry b1
(1.307232) and kurtosis b2 (6.007389):
For a significance level of 5%, the value of the statistic is greater than the critical value of chi-
square with two degrees of freedom (5,99), so we conclude that we must reject the hypothesis null.
Therefore, the error term does not follow a normal distribution. As we do not know the distribution
of the error term, in general we do not know the distribution of the estimated coefficients and thus
the tests on the parameters (individual, global significance) are not reliable. Also the asymptotic
efficiency is no longer guaranteed since it is based on the ML estimator that do not coincide
anymore with the OLS estimator. Nonetheless in this case the sample size is of more than 900
observations, we can assume applicability of the central limit theorem, which indicates that the t-
statistics converge to a normal distribution, while the F converges to a chi-squared distribution (and
therefore we know the approximate distribution for our tests). Hence we can make inference on the
model even if in small sample it could not be possible.
2. We have a database for a sample of 935 workers, with information on the following variables:
SALARY (monthly salary), EDAT (age), EDUCACIO (years of study), CASAT (dummy variable
that is 1 if the individual is married, 0 otherwise), PERMANENCIA (years spent in the same job),
EXPERIENCIA (years of experience) and IQ (index score of intelligence).
Model (1)
Table 1 shows the results of the Jarque-Bera test applied to the residuals of the OLS estimation of
model (1).
1. In your opinion which is the reason why the mean and the median of the residuals do not
correspond?
Because the distribution is not symmetric. The fact that the median is negative, while the
mean is equal to 0 (as it has to be) means that half of the observations are found before the
zero threshold, so that the distribution would be asymmetric with a positive asymmetry as
also shown by the skeweness index (higher than 0).
2. Indicates the null and alternative hypothesis of the Jarque-Bera test, interpret the result and
explain the implications of the result on the estimates of the model.
2
The data used for this estimate are from Wooldridge, J.M. (2006) "Introduction to Econometrics", 2nd Edition.
Ed: Thomson.
Table 1
Since the value in a table of a Chi-square distribution with two degrees of freedom for a
significance level (α) of 5% is 5,99, we reject the null hypothesis of normality.
Model (3)
Table 1. shows the results of the Jarque-Bera test applied to the residuals of the OLS estimation of
model (3).
1. Which kind of distribution seems the residual to follow? What does it suggests in terms of
misspecification?
It seems the residuals are plot as a bimodal distribution, which could indicate a
misspecification due to not capturing a structural change.
2. Are they satisfying the assumption of normality? Comment also the kurtosis and the
skewness.
The JB test is rejecting the null hypothesis of normality. Indeed the kurtosis points to a
platykurtic distribution (lower peak than for a normal distribution), while the skewness is
slightly positive indicating positive asymmetry (more observations concentrated on the left
part of the distribution).
Table 1.
4. Using data for 706 individuals, we estimate by OLS a regression model that explains the number
of daily hours for sleep (SLEEPD) as a function of the following variable: the number of daily
hours used for working (TOTWORKD), the years of education (EDUC), the age (AGE), a dummy
variable that takes the value of 1 in the case the individual is a man and 0 otherwise (MALE), a
dummy variable that takes on a value of 1 if the individual has kids with less than 3 years old and 0
otherwise (YNGKID) and the interaction between these last two variables (MALE*YNGKDS). The
results of this estimation are shown in table 1, while the graphics 1 and 2 show the histogram, some
descriptive statistics and the result of the JB test for the endogenous variable (graph 1) and the OLS
residuals presented in table1:
1. With this information, do you think there is some problem in this estimation as for what
concern the accomplishment to the basic hypothesis of the disturbance term of the OLS
model? Which will be the properties of the OLS estimator?
The basic hypothesis for the disturbance term of the model is that it follows a normal distribution
with zero expected value and variance-covariance matrix equal to sigma^2*I. This last hypothesis
means that the error term is homoskedastic and has no autocorrelation. By considering that these
data are cross-sectional is very unlikely there could be any problem of autocorrelation and
therefore, with the information we have in the graphs, we need to value whether there could be
problem of heteroskedasticity or absence of normality. Graph 2 shows the histogram of the
residuals for the OLS estimation, as well as information on the asymmetry and curtosis of the
distribution, which allows to compute the JB test of normality. The graph shows a residual
distribution quite symmetric (the coefficient for symmetry is 0.159), but quite peaked (the
coefficient for the kurtosis is 5.23), which suggests that probably that the hypotesis of normality
would not be complied. In fact, the JB test statistics (149.11) is clearly higher than the critical value
for the chi-square statistics with 2 dof. Therefore we reject the null hypothesis and conclude that the
residuals do not follow a normal distribution. In this case, the properties of the estimator would be
the usual ones, while for the test statistics, due to the central limit theorem and the presence of a big
sample size, we can rely to their (proxied) asymptotic distribution (the t converges to the normal
distribution, while the F to the chi-square). If the sample would have had lower observations, it
would have been not possible to perform the test statistics since the distribution of the variables
would have been not known.
2. Which could be the reasons of a not (eventually) compliance to the basic hypothesis?
Also, looking at the result of the regression we can think that the no compliance with the hypothesis
of normality could be related to omission of some relevant variable in the model, since the R2 is
very low showing that the explicative capacity of the model is not good. As for heteroskedasticity,
we do not have enough information to make any inference on that.