Group Work - Econometrics Updated
Group Work - Econometrics Updated
Department of Management:
MBA Program
By:
Name ID. No.
1. Tilahun Yenew Bogale IUPGW/117/2014
2. Smachew Atalay Admas IUPGW/112/2014
3. Tewabe Belay Fenta IUPGW/116/2014
4. Tekelehaimanot Bogale Abebe IUPGW/114/2014
December, 2022
Injibara
Page 1 of 22
This term paper is focuses on deducing the concept of econometrics and a regression
model. It is written through reading and discussing the specific questions forwarded
from the course instructor. Therefore, it is organized following the questions we
received and this assignment has allowed us to practice technics under
econometrics and regression analysis. This page forward is our discussion as the
question is concerned.
Page 2 of 22
At the very beginning econometrics is an amalgam of economic theory, mathematics
and statistics. Yet, the subject deserves to be studied in its own for the following
justifications.
Page 3 of 22
To begin, linear regression is an analysis that assesses whether a
predictor/explanatory variable explains the dependent variable. The regression has
five key assumptions which we are going to discuss this line forward.
(a) Linearity
Linearity means that the predictor variables in the regression have a straight-line
relationship with the outcome variable. Linearity assumption in classical linear
regression indicates that there is a correctly specified model which is linear in the
coefficients. The model contains the parameters of a linear combination of the
variables. In simplest terms, linear regression needs the relationship between the
independent and dependent variables to be linear. It is also important to check for
outliers since linear regression is sensitive to outlier effects.
(b) Homoscedasticity
Page 4 of 22
(c) Autocorrelation
The residuals are simply the error terms, or the differences between the observed
value of the dependent variable and the predicted value. The linear regression
analysis requires all variables to be multivariate normal. The conditional
expectation of the residual is zero. Furthermore, there must not be any relation
between the residual term and the X variable, which is to say that they are
uncorrelated. This means that the variables left unaccounted for in the residual
should have no relationship with the variable X included in the model. This
assumption can best be checked with a histogram.
(e) Multi-collinearity
Multi-collinearity refers to when the predictor variables are highly correlated with
each other. This is an issue, as a regression model will not be able to accurately
associate variance in our outcome variable with the correct predictor variable,
leading to muddled results and incorrect inferences. This assumption is only
relevant for a multiple linear regression, which has multiple predictor variables.
Page 5 of 22
However, linear regression assumes that there is little or no multicollinearity in the
data. There are several sources of multi-collinearity which may include the
following;
o The data collection method employed. For example, sampling over a limited
range of the values taken by the repressors in the population.
o Constraints on the model or in the population being sampled.
o Model specification. For example, adding polynomial terms to a regression
model, especially when the range of the X variable is small.
o An over determined model. This happens when the model has more
explanatory variables than the number of observations.
In general, multi-collinearity occurs when the independent variables are too highly
correlated with each other. If multicollinearity is found in the data, centering the
data (that is deducting the mean of the variable from each score) might help to solve
the problem. However, the simplest way to address the problem is to remove
independent variables with high variance inflation factor values.
In a regression model, the difference between actual values and estimated value of
regress is called as stochastic error term ui. There are various forms of error terms.
A regression model is never accurate therefore stochastic error term play an
important role by estimating the difference.
Page 6 of 22
A stochastic error term is a factor that is introduced into a regression equation to
account for any variations that independent variables cannot explain. Because a
regression model can never be completely accurate, stochastic error factors are
crucial in estimating the difference. An error term is generally unobservable and a
residual is observable and calculable, making it much easier to quantify and
visualize. In effect, while an error term represents the way observed data differs
from the actual population, a residual represents the way observed data differs from
sample population data.
In actual data we never get the mean of the regress and we rather get samples
which may be close or far away from the mean so it may be uncertain as to the
sample value. Regression analysis is also important to find the dependency of a
dependent variable on an independent variable. It is also helpful in making
Page 7 of 22
predictions and forecasting the data. The mean value of regression shows that a
single value of a dependent variable exists for different values of the independent
variable, but it is not the case. Regression analysis does this by estimating the effect
that changing one independent variable has on the dependent variable while
holding all the other independent variables constant. This process allows us to learn
the role of each independent variable without worrying about the other variables in
the model.
a. Does the insurance premium depend on the driving experience or does the
driving experience depend on the insurance premium? Do you expect a positive
or a negative relationship between these two variables?
Page 8 of 22
b. Find the least squares regression line by choosing appropriate dependent and
independent variables based on your answer in part a.
To find regression line first we have to calculate the sum of driving experience (let
we say ‘x’) and insurance premium (let us say ‘y’) and the square of both variables
as follows.
5 64 320 25 4,096
2 87 174 4 7,569
9 71 639 81 5,041
6 56 336 36 3,136
Page 9 of 22
( ( ( ( (
SSxy = = 4739 - = -593.5
( (
SSxx = = 1396 - = 383.5
( (
SSyy = = 29642 - = 1557.5
b= = = -1.5476
The value of a = 76.6605 gives the value of y for x = 0; that is, it gives the monthly
auto insurance premium for a driver with no driving experience. The value of b
gives the change in ŷ due to a change of one unit in x. Thus, b = −1.5476 indicates
that, on average, for every extra year of driving experience, the monthly auto
insurance premium decreases by the amount of birr 1.55. When b is negative, y
decreases as x increases.
Page 10 of 22
d. Plot the scatter diagram and the regression line.
100
90
80
70
60
50
Insurance premium
40
30
20
10
0
0 5 10 15 20 25 30
r= = = - 0.77
√( ( √( (
( (
r2 = = = 0.59
Explanation: The value of r = −0.77 indicates that the driving experience and the
monthly auto insurance premium are negatively related. The relationship is strong
but not very strong. The value of r2 = 0.59 states that 59% of the total variation in
insurance premiums is explained by years of driving experience and the remaining
41% is not. The low value of r2 indicates that there may be many other important
variables that contribute to the determination of auto insurance premiums.
Page 11 of 22
f. Predict the monthly auto insurance premium for a driver with 10 years of
driving experience.
Based on the estimated regression line above the predicted value of y for x = 10 can
be calculated as follows.
y = 76.6605 – 1.5476(x)
= 76.6605 – 1.5476*10
= 61.18
Now we need to use the t distribution table to make the hypothesis test.
Area in the left tail of the t distribution at a 5% significance level can be computed as
n-2 = 8-2 = 6
From the t distribution table, the critical value of t for 0.05 areas in the left tail of
the t distribution and at 6 values is −1.943.
Now we can calculate the value of the test statistic. The value of the test statistic t
for b is calculated as follows:
t= = = -2.937
Page 12 of 22
Decision: based on the t-value we reject the null hypothesis as the value of the test
statistic t is −2.937. Hence, we reject the null hypothesis and conclude that B is
negative. That is, the monthly auto insurance premium decreases with an increase
in years of driving experience.
We supposed that our score on a final exam of econometrics for management course
(score) depends on class attendance (attend) and unobserved other factors that
affect our exam performance;
Score = βo + β1attend + u
Where;
* βo is a constant
* β1 is a unit explained by the attendance of students
* u is other unobserved variables
o The data is cross sectional because we collect the data from students at the
same time
o The dependent variable is exam performance (score)
o The independent variable is the class attendance (attend) and other
unobserved factors (u)
Page 13 of 22
Suppose we got the quarterly consumption pattern data from Ethiopian statistical
agency from the year 2000 to 2022 and study the percentage changes of real
personal consumption expenditure (y) as explained by real personal disposable
income (x1), percentage changes in industrial production (x2), personal saving (x3,
and unemployment rate (x4), then
* β1, β2, β3, and β4 measure the effect of each predictor after taking account
the effect of all other predictors in the model.
The equation means that a one unit increase/decrease in one of the predictors has
an effect on the personal consumption pattern of the Ethiopian population from the
year 2000 to 2022.
Page 14 of 22
Implication of inclusion of irrelevant variable:
Irrelevant variables are variables that have only zero factor loadings on the
dependent variable. Thus the irrelevant variables are not related to the factors and
do not contain information for estimating the unobserved factors. Therefore,
inclusion of irrelevant variables to an econometric model can cause the coefficient
estimates to become less precise, thereby causing the overall model to loose
accuracy.
Having an omitted variable in research can bias the estimated outcome of the study
and lead the researcher to mistaken conclusion. This means that while the
researcher assesses the effects of the independent variable, the bias can produce
other problems in the regression analysis.
In other words the omission from a regression of some variables that affect the
dependent variable may cause an omitted variables bias. This bias depends on the
correlation between the independent variables which are omitted. Hence, this
omission may lead to biased estimates of model parameters and leads to biased and
inconsistent coefficient estimate. Biased and inconsistent estimates are not reliable
in research.
Page 15 of 22
In regression analysis, a dummy variable is a binary variable that takes the values
0 or 1 to indicate the absence or presence of some categorical effect that may be
expected to shift the outcome. Dummy variables, taking values of 1 and 0 are a
means of introducing qualitative repressors in regression models. A dummy variable
is a numerical variable used in regression analysis to represent subgroups of the
sample in a study. In research design, a dummy variable is often used to distinguish
different treatment groups such as gender, educational level, seasons, private and
public, marital status and so on. Dummy variables can be used either as
explanatory variables or as the dependent variable.
On the other hand dummy variables are independent variables which take the
value of either 0 or 1. Just as a dummy is a stand-in for a real person, in
quantitative analysis, a dummy variable is a numeric stand-in for a qualitative fact
or a logical proposition.
Even if there is a way to use OLS technique, but, there are several statistical
problems that we may faces. Problems with OLS when the dependent variable is a
dummy are:
o The error term is not normally distributed. The error term is heteroskedastic.
o R-squared becomes a useless measure.
Page 16 of 22
o The model becomes problematic for forecasting purposes. One would like to
forecast the probability of a certain set of independent variables to create a
certain binominal outcome. OLS could create probabilities of greater than one
or smaller than zero.
Therefore, there is an alternative to OLS estimation that does not face these
problems, logistic regression. Logistic regression is a non-linear estimation
technique, which solves the problem of uncertainty of OLS. Thus, logistic regression
is proposed for this case.
The logit model assumes a logistic distribution of errors, and the probit model
assumes a normal distributed errors. This point forward is the explanation of each
model.
Logit model:
Logistic regression is a method that we can use to fit a regression model when
the response variable is binary. The following are the basic assumptions;
Page 17 of 22
There are no extreme outliers: Logistic regression assumes that there are no
extreme outliers or influential observations in the dataset.
Logistic regression assumes that there is a linear relationship between
explanatory variables and the logit of the response variable.
Logistic regression assumes that the sample size is sufficiently large. The
sample size of the dataset if large enough to draw valid conclusions from the
fitted logistic regression model.
Probit model:
In general the probit model and the Logit model deliver only approximations to
the regression function. It is not obvious how to decide which model to use in
practice. The linear probability model has the clear drawback of not being able to
capture the nonlinear nature of the population regression function and it may
predict probabilities to lie outside the interval. Probit and Logit models are
harder to interpret but capture the nonlinearities better than the linear
approach: both models produce predictions of probabilities that lie inside the
interval [0, 1]. Predictions of all three models are often close to each other. It is
better to use the method that is easiest to use in the statistical software of choice.
Binary, multinomial and ordinal are the three types of logistic regression. Binary
regression deals with two possible values, essentially: yes or no. Multinomial
Page 18 of 22
logistic regression deals with three or more values. And ordinal logistic regression
deals with three or more classes in a predetermined order.
Binary logistic regression involves an either/or solution. There are just two possible
outcome answers. This concept is typically represented as a 0 or a 1 in coding.
Examples include:
Multinomial logistic regression is a model where there are multiple classes that an
item can be classified as. There is a set of three or more predefined classes set up
prior to running the model. Examples include:
Ordinal logistic regression is also a model where there are multiple classes that an
item can be classified as; however, in this case an ordering of classes is required.
Classes do not need to be proportionate; the distance between each class can vary.
Examples include:
Page 19 of 22
Variable name Estimated Standard error Asymptotic t-ratio
coefficient
X1 3.8 1.7 2.2
X2 -1.6 0.54 -3.0
constant -4.2 2.3 -1.8
= 13.46/14.46
ŷ = 0.93
This indicates that with a given value of X1 and X2 there is a 93% chance to predict
the phenomenon under examination.
Here we need to change the value of x2 from 0.5 to 1.5 and the result can be
calculated as follows.
/
ŷ = e-4.2+3.8(X1) + (-1.6X2) 1+e-4.2+3.8X1+ (-1.6X2)
/
= e-4.2+ (3.8*2) + (-1.6*1.5) 1+e-4.2+ (3.8*2) + (-1.6*1.5)
Page 20 of 22
/
= 2.718 1+2.718
ŷ = 0.73
This indicates that a one unit increase in the value of X2 has about a 73% chance to
predict the phenomenon under examination or study.
Parametric test
Non-parametric test
Page 21 of 22
Tests of differences between groups (independent samples) such as GPA of males
vs. females.
o Parametric test:
- T-test can be used when we have two samples that we want to compare
concerning their mean value for some variable of interest.
- ANOVA can be used when we compare three or more groups.
o Non-parametric equivalent
- To compare two groups we use Mann-Whitney U test.
- To compare three or more groups we use the Kruskal-Wallis analysis of
ranks test.
Tests of differences between variables (dependent samples): for instance,
Mathematics vs. English results of males
o Parametric test
- To compare two variables measured in the same sample, we use the t-
test for dependent samples
o Non-parametric test
- We use the Sign test and Wilcoxon's matched pairs test.
- For dichotomous variables, such as pass vs. fail; yes, or No, we use
McNemar’s Chi-square test.
Tests of relationships between variables, like the relationship between income
and work experience
o Parametric test: correlation coefficient is used for parametric test between
associations.
o Non-parametric equivalent: the Spearman rank correlation coefficient. If
the two variables of interest are categorical in nature (e.g. passed vs.
failed by male vs. female) we conduct the Chi-square test.
Page 22 of 22