0% found this document useful (0 votes)
129 views22 pages

Group Work - Econometrics Updated

This document is a group assignment on econometrics for an MBA program. It discusses key concepts in econometrics and linear regression analysis. It defines econometrics, outlines 5 key assumptions of linear regression models, and explains the importance of including a stochastic error term in regression equations.

Uploaded by

Bantihun Getahun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views22 pages

Group Work - Econometrics Updated

This document is a group assignment on econometrics for an MBA program. It discusses key concepts in econometrics and linear regression analysis. It defines econometrics, outlines 5 key assumptions of linear regression models, and explains the importance of including a stochastic error term in regression equations.

Uploaded by

Bantihun Getahun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

College of Business and Economics

Department of Management:
MBA Program

Group Assignment on the course: Econometrics for Management

By:
Name ID. No.
1. Tilahun Yenew Bogale IUPGW/117/2014
2. Smachew Atalay Admas IUPGW/112/2014
3. Tewabe Belay Fenta IUPGW/116/2014
4. Tekelehaimanot Bogale Abebe IUPGW/114/2014

Submitted to: Smachew Zeleke (PhD)

December, 2022

Injibara

Page 1 of 22
This term paper is focuses on deducing the concept of econometrics and a regression
model. It is written through reading and discussing the specific questions forwarded
from the course instructor. Therefore, it is organized following the questions we
received and this assignment has allowed us to practice technics under
econometrics and regression analysis. This page forward is our discussion as the
question is concerned.

Page 2 of 22
At the very beginning econometrics is an amalgam of economic theory, mathematics
and statistics. Yet, the subject deserves to be studied in its own for the following
justifications.

o Economic theory makes statements or hypotheses that are mostly qualitative in


nature. For instance, in the law of demand the law does not provide any
numerical measure of the relationship. Therefore, quantitative measure of
relationship of variables requires econometrics and econometricians.

o Mathematics expresses economic theory in mathematical form without regard


to measurability or empirical verification of the theory, while econometrics is
mainly interested in the empirical verification of economic theory.

o Statistics is mainly concerned with collecting, processing, and presenting


economic data in the form of charts and tables, yet it does not go any further.
Econometrics is then required to fill such a visible gap.

In general, econometrics differs both from mathematical statistics and economic


statistics. In economic statistics, the empirical data is collected recorded, tabulated
and used in describing the pattern in their development over time. The economic
statistics is a descriptive aspect of economics. It does not provide either the
explanations of the development of various variables or measurement of the
parameters of the relationships. It is because of these facts that econometrics
should be treated as a separate discipline.

Page 3 of 22
To begin, linear regression is an analysis that assesses whether a
predictor/explanatory variable explains the dependent variable. The regression has
five key assumptions which we are going to discuss this line forward.

(a) Linearity

Linearity means that the predictor variables in the regression have a straight-line
relationship with the outcome variable. Linearity assumption in classical linear
regression indicates that there is a correctly specified model which is linear in the
coefficients. The model contains the parameters of a linear combination of the
variables. In simplest terms, linear regression needs the relationship between the
independent and dependent variables to be linear. It is also important to check for
outliers since linear regression is sensitive to outlier effects.

Linearity assumption, therefore, indicates that the relation between Y (dependent


variable) and X (independent variable) is linear and the value of Y is determined for
each value of X. This assumption also impose that the model is complete in the
sense that all relevant variables has been included in the model.

(b) Homoscedasticity

Homoscedasticity describes a situation in which the error term or the random


disturbance in the relationship between the independent variables and the
dependent variable is the same across all values of the independent
variables. Homoscedasticity, or homogeneity of variances, is an assumption of equal
or similar variances in different groups being compared. This is an important
assumption of parametric statistical tests because they are sensitive to any
dissimilarity. The variance of the error term is homoscedastic that is the variance is
constant over different observations.

Page 4 of 22
(c) Autocorrelation

Linear regression analysis requires that there is little or no autocorrelation in the


data. Autocorrelation occurs when the residuals are not independent from each
other. For instance, this typically occurs in stock prices, where the price is not
independent from the previous price. It means that x cannot be a constant within a
given sample since we are interested in how variation in x affects variation in y.
Furthermore, it is a mathematical necessity that x takes at least two different
values in the sample. However, we are going to assume that x is fixed from sample
to sample. That means that the expected value of x is the variable itself, and the
variance of x must be zero when working with the regression model. But within a
sample there need to be variation.

(d) Normality of Residuals

The residuals are simply the error terms, or the differences between the observed
value of the dependent variable and the predicted value. The linear regression
analysis requires all variables to be multivariate normal. The conditional
expectation of the residual is zero. Furthermore, there must not be any relation
between the residual term and the X variable, which is to say that they are
uncorrelated. This means that the variables left unaccounted for in the residual
should have no relationship with the variable X included in the model. This
assumption can best be checked with a histogram.

(e) Multi-collinearity

Multi-collinearity refers to when the predictor variables are highly correlated with
each other. This is an issue, as a regression model will not be able to accurately
associate variance in our outcome variable with the correct predictor variable,
leading to muddled results and incorrect inferences. This assumption is only
relevant for a multiple linear regression, which has multiple predictor variables.

Page 5 of 22
However, linear regression assumes that there is little or no multicollinearity in the
data. There are several sources of multi-collinearity which may include the
following;

o The data collection method employed. For example, sampling over a limited
range of the values taken by the repressors in the population.
o Constraints on the model or in the population being sampled.
o Model specification. For example, adding polynomial terms to a regression
model, especially when the range of the X variable is small.
o An over determined model. This happens when the model has more
explanatory variables than the number of observations.

An additional reason for multicollinearity, especially in time series data, may be


that the repressors included in the model share a common trend, that is, they all
increase or decrease over time. Thus, in the regression of consumption expenditure
on income, wealth, and population, the repressors income, wealth, and population
may all be growing over time at more or less the same rate, leading to collinearity
among these variables.

In general, multi-collinearity occurs when the independent variables are too highly
correlated with each other. If multicollinearity is found in the data, centering the
data (that is deducting the mean of the variable from each score) might help to solve
the problem. However, the simplest way to address the problem is to remove
independent variables with high variance inflation factor values.

In a regression model, the difference between actual values and estimated value of
regress is called as stochastic error term ui. There are various forms of error terms.
A regression model is never accurate therefore stochastic error term play an
important role by estimating the difference.

Page 6 of 22
A stochastic error term is a factor that is introduced into a regression equation to
account for any variations that independent variables cannot explain. Because a
regression model can never be completely accurate, stochastic error factors are
crucial in estimating the difference. An error term is generally unobservable and a
residual is observable and calculable, making it much easier to quantify and
visualize. In effect, while an error term represents the way observed data differs
from the actual population, a residual represents the way observed data differs from
sample population data.

Therefore, the reasons a disturbance term ui is necessary to include are;

o There are some unpredictable elements of randomness in human responses,


o An effect of a large number of omitted variables is contained in x,
o There is a measurement error in y, or
o A functional form of f(x) is not known in general.

We need to use regression analysis because it is a reliable method of identifying


which variables have impact on a topic of interest. The process of performing a
regression allows us to confidently determine which factors matter most, which
factors can be ignored, and how these factors influence each other. In addition,
regression analysis is helpful statistical method that can be leveraged across an
organization to determine the degree to which particular independent variables are
influencing dependent variables.

In actual data we never get the mean of the regress and we rather get samples
which may be close or far away from the mean so it may be uncertain as to the
sample value. Regression analysis is also important to find the dependency of a
dependent variable on an independent variable. It is also helpful in making

Page 7 of 22
predictions and forecasting the data. The mean value of regression shows that a
single value of a dependent variable exists for different values of the independent
variable, but it is not the case. Regression analysis does this by estimating the effect
that changing one independent variable has on the dependent variable while
holding all the other independent variables constant. This process allows us to learn
the role of each independent variable without worrying about the other variables in
the model.

a. Does the insurance premium depend on the driving experience or does the
driving experience depend on the insurance premium? Do you expect a positive
or a negative relationship between these two variables?

Based on theory and intuition, we expect the insurance premium to depend on


driving experience. Consequently, the insurance premium is a dependent variable
and driving experience is an independent variable in the regression model. A new
driver is considered a high risk by the insurance companies, and he or she has to
pay a higher premium for auto insurance. On average, the insurance premium is
expected to decrease with an increase in the years of driving experience. Therefore,
we expect a negative relationship between these two variables. In other words, both
the population correlation coefficient ρ and the population regression slope B are
expected to be negative.

Page 8 of 22
b. Find the least squares regression line by choosing appropriate dependent and
independent variables based on your answer in part a.

To find regression line first we have to calculate the sum of driving experience (let
we say ‘x’) and insurance premium (let us say ‘y’) and the square of both variables
as follows.

5 64 320 25 4,096

2 87 174 4 7,569

12 50 600 144 2,500

9 71 639 81 5,041

15 44 660 225 136

6 56 336 36 3,136

25 42 1,050 625 1,764

16 60 960 256 3,600

Now we can calculate the average value of x and y.

X average = = 90/8 = 11.25

Y average = = 474/8 = 59.25

Then we can calculate the sum squares of x and y as follows.

Page 9 of 22
( ( ( ( (
SSxy = = 4739 - = -593.5

( (
SSxx = = 1396 - = 383.5

( (
SSyy = = 29642 - = 1557.5

Finally to find the regression line we calculate a & b as follows.

b= = = -1.5476

a = yav – bxav = 59.25 – (- 1.5476)(11.25) = 76.6605

Thus, the estimated regression line y = a + bx is

c. Interpret the meaning of the values of a & b calculated in part b.

The value of a = 76.6605 gives the value of y for x = 0; that is, it gives the monthly
auto insurance premium for a driver with no driving experience. The value of b
gives the change in ŷ due to a change of one unit in x. Thus, b = −1.5476 indicates
that, on average, for every extra year of driving experience, the monthly auto
insurance premium decreases by the amount of birr 1.55. When b is negative, y
decreases as x increases.

Page 10 of 22
d. Plot the scatter diagram and the regression line.

100
90
80
70
60
50
Insurance premium
40
30
20
10
0
0 5 10 15 20 25 30

* The black line shows a regression line; y = 76.6605 – 1.5476x


* From the diagram the x and y-axis show the driving experience and the
insurance premium respectively.

e. Calculate r and r2 and explain what they mean.

r= = = - 0.77
√( ( √( (

( (
r2 = = = 0.59

Explanation: The value of r = −0.77 indicates that the driving experience and the
monthly auto insurance premium are negatively related. The relationship is strong
but not very strong. The value of r2 = 0.59 states that 59% of the total variation in
insurance premiums is explained by years of driving experience and the remaining
41% is not. The low value of r2 indicates that there may be many other important
variables that contribute to the determination of auto insurance premiums.

Page 11 of 22
f. Predict the monthly auto insurance premium for a driver with 10 years of
driving experience.

Based on the estimated regression line above the predicted value of y for x = 10 can
be calculated as follows.

y = 76.6605 – 1.5476(x)

= 76.6605 – 1.5476*10

= 61.18

Consequently, it can be expected that the monthly auto insurance premium of


a driver with 10 years of driving experience is 61.18.

g. Formulate a hypothesis and test at the 5% significance level.

Hypothesis: the null and alternative hypothesis is formulated as follows.

o The null hypothesis (H0) is B is not negative


o The alternative hypothesis (H1) is B is negative.

Now we need to use the t distribution table to make the hypothesis test.

The given significance level is 0.05.

Area in the left tail of the t distribution at a 5% significance level can be computed as
n-2 = 8-2 = 6

From the t distribution table, the critical value of t for 0.05 areas in the left tail of
the t distribution and at 6 values is −1.943.

Now we can calculate the value of the test statistic. The value of the test statistic t
for b is calculated as follows:

t= = = -2.937

Page 12 of 22
Decision: based on the t-value we reject the null hypothesis as the value of the test
statistic t is −2.937. Hence, we reject the null hypothesis and conclude that B is
negative. That is, the monthly auto insurance premium decreases with an increase
in years of driving experience.

We supposed that our score on a final exam of econometrics for management course
(score) depends on class attendance (attend) and unobserved other factors that
affect our exam performance;

Score = βo + β1attend + u

Where;

* βo is a constant
* β1 is a unit explained by the attendance of students
* u is other unobserved variables
o The data is cross sectional because we collect the data from students at the
same time
o The dependent variable is exam performance (score)
o The independent variable is the class attendance (attend) and other
unobserved factors (u)

The interpretation of the regression equation is that the absence or presence of


students’ attendance affects the expected value of students score by the amount of
β1.

Page 13 of 22
Suppose we got the quarterly consumption pattern data from Ethiopian statistical
agency from the year 2000 to 2022 and study the percentage changes of real
personal consumption expenditure (y) as explained by real personal disposable
income (x1), percentage changes in industrial production (x2), personal saving (x3,
and unemployment rate (x4), then

Consumption (y) = β0 + β1x1 + β2x2 + β3x3 + β4x4 +€

* It is time series because the data show a 22 years quarterly consumption


pattern of Ethiopian population.
* Consumption pattern (y) is the dependent variable explained by the
independent variables x1, x2, x3 and x4

* β1, β2, β3, and β4 measure the effect of each predictor after taking account
the effect of all other predictors in the model.

The equation means that a one unit increase/decrease in one of the predictors has
an effect on the personal consumption pattern of the Ethiopian population from the
year 2000 to 2022.

Page 14 of 22
 Implication of inclusion of irrelevant variable:

Irrelevant variables are variables that have only zero factor loadings on the
dependent variable. Thus the irrelevant variables are not related to the factors and
do not contain information for estimating the unobserved factors. Therefore,
inclusion of irrelevant variables to an econometric model can cause the coefficient
estimates to become less precise, thereby causing the overall model to loose
accuracy.

 Implication of omitting relevant variable:

Having an omitted variable in research can bias the estimated outcome of the study
and lead the researcher to mistaken conclusion. This means that while the
researcher assesses the effects of the independent variable, the bias can produce
other problems in the regression analysis.

In other words the omission from a regression of some variables that affect the
dependent variable may cause an omitted variables bias. This bias depends on the
correlation between the independent variables which are omitted. Hence, this
omission may lead to biased estimates of model parameters and leads to biased and
inconsistent coefficient estimate. Biased and inconsistent estimates are not reliable
in research.

In general, if an omitted variable or an irrelevant variable is correlated with the


treatment variable, then the treatment effect will become biased. Including
irrelevant variables that are correlated with existing predictors will increase the
variance of estimates and make estimates and predictions less precise.

Page 15 of 22
In regression analysis, a dummy variable is a binary variable that takes the values
0 or 1 to indicate the absence or presence of some categorical effect that may be
expected to shift the outcome. Dummy variables, taking values of 1 and 0 are a
means of introducing qualitative repressors in regression models. A dummy variable
is a numerical variable used in regression analysis to represent subgroups of the
sample in a study. In research design, a dummy variable is often used to distinguish
different treatment groups such as gender, educational level, seasons, private and
public, marital status and so on. Dummy variables can be used either as
explanatory variables or as the dependent variable.

In a dummy dependent variable model, the dependent variable is qualitative, not


quantitative. When the qualitative dependent variable has exactly two values (like
married or unmarried), we often speak of binary choice models. In this case, the
dependent variable can be conveniently represented by a dummy variable that
takes on the value 0 or 1.

On the other hand dummy variables are independent variables which take the
value of either 0 or 1. Just as a dummy is a stand-in for a real person, in
quantitative analysis, a dummy variable is a numeric stand-in for a qualitative fact
or a logical proposition.

Even if there is a way to use OLS technique, but, there are several statistical
problems that we may faces. Problems with OLS when the dependent variable is a
dummy are:

o The error term is not normally distributed. The error term is heteroskedastic.
o R-squared becomes a useless measure.

Page 16 of 22
o The model becomes problematic for forecasting purposes. One would like to
forecast the probability of a certain set of independent variables to create a
certain binominal outcome. OLS could create probabilities of greater than one
or smaller than zero.

Therefore, there is an alternative to OLS estimation that does not face these
problems, logistic regression. Logistic regression is a non-linear estimation
technique, which solves the problem of uncertainty of OLS. Thus, logistic regression
is proposed for this case.

The logit model assumes a logistic distribution of errors, and the probit model
assumes a normal distributed errors. This point forward is the explanation of each
model.

Logit model:

Logistic regression is a method that we can use to fit a regression model when
the response variable is binary. The following are the basic assumptions;

 The response variable is binary: Logistic regression assumes that the


response variable only takes on two possible outcomes like yes or no, male or
female, pass or fails and so on.
 The observations are independent: Logistic regression assumes that the
observations in the dataset are independent of each other. That is, the
observations should not come from repeated measurements of the same
individual or be related to each other in any way.
 There is no multicollinearity among explanatory variables: Logistic
regression assumes that there is no severe multicollinearity among
the explanatory variables. The common way to detect multicollinearity is by
using the variance inflation factor (VIF), which measures the correlation and
strength of correlation between the predictor variables in a regression model.

Page 17 of 22
 There are no extreme outliers: Logistic regression assumes that there are no
extreme outliers or influential observations in the dataset.
 Logistic regression assumes that there is a linear relationship between
explanatory variables and the logit of the response variable.
 Logistic regression assumes that the sample size is sufficiently large. The
sample size of the dataset if large enough to draw valid conclusions from the
fitted logistic regression model.

Probit model:

A probit model is used to model dichotomous or binary outcome variables. For


instance, if we are interested in the factors that influence whether a political
candidate wins an election, the outcome variable is binary (0/1); win or lose. The
predictor variables of interest are the amount of money spent on the campaign, the
amount of time spent campaigning negatively and whether the candidate is an
incumbent. Probit model assumes a normal distributed error.

In general the probit model and the Logit model deliver only approximations to
the regression function. It is not obvious how to decide which model to use in
practice. The linear probability model has the clear drawback of not being able to
capture the nonlinear nature of the population regression function and it may
predict probabilities to lie outside the interval. Probit and Logit models are
harder to interpret but capture the nonlinearities better than the linear
approach: both models produce predictions of probabilities that lie inside the
interval [0, 1]. Predictions of all three models are often close to each other. It is
better to use the method that is easiest to use in the statistical software of choice.

Binary, multinomial and ordinal are the three types of logistic regression. Binary
regression deals with two possible values, essentially: yes or no. Multinomial

Page 18 of 22
logistic regression deals with three or more values. And ordinal logistic regression
deals with three or more classes in a predetermined order.

 Binary logistic regression

Binary logistic regression involves an either/or solution. There are just two possible
outcome answers. This concept is typically represented as a 0 or a 1 in coding.
Examples include:

o Whether or not to lend to a bank customer (outcomes are yes or no).


o Assessing cancer risk (outcomes are high or low).
o Will a team win tomorrow’s game (outcomes are yes or no).

 Multinomial logistic regression

Multinomial logistic regression is a model where there are multiple classes that an
item can be classified as. There is a set of three or more predefined classes set up
prior to running the model. Examples include:

o Classifying texts into what language they come from.


o Predicting whether a student will go to college, trade school or into the
workforce.
o Does your cat prefer wet food, dry food or human food?

 Ordinal logistic regression

Ordinal logistic regression is also a model where there are multiple classes that an
item can be classified as; however, in this case an ordering of classes is required.
Classes do not need to be proportionate; the distance between each class can vary.
Examples include:

o Ranking restaurants on a scale of 0 to 5 stars.


o Predicting the podium results of an Olympic event.
o Assessing a choice of candidates, specifically in places that institute ranked-
choice voting.

Page 19 of 22
Variable name Estimated Standard error Asymptotic t-ratio
coefficient
X1 3.8 1.7 2.2
X2 -1.6 0.54 -3.0
constant -4.2 2.3 -1.8

1) What is the predicted probability that y = 1 when X1 = 2 and X2 = 0.5?

From the table we have βo = -4.2; β1 = 3.8; β 2 = -1.6β

The equation can be written as follows.

ŷ = eβo+β1X1 + β2X2 /1+eβo+β1X1+ β2X2

ŷ = e-4.2+3.8(X1) + (-1.6X2)/1+e-4.2+3.8X1+ (-1.6X2)

= e-4.2+ (3.8*2) + (-1.6*0.5)/1+e-4.2+ (3.8*2) + (-1.6*0.5)

= 13.46/14.46

ŷ = 0.93
This indicates that with a given value of X1 and X2 there is a 93% chance to predict
the phenomenon under examination.

2) Compute the change in the predicted probability when X2 increases by one


unit from X2 = 0.5 to X2 = 1.5, holding X1 at X1 = 2.

Here we need to change the value of x2 from 0.5 to 1.5 and the result can be
calculated as follows.

/
ŷ = e-4.2+3.8(X1) + (-1.6X2) 1+e-4.2+3.8X1+ (-1.6X2)

/
= e-4.2+ (3.8*2) + (-1.6*1.5) 1+e-4.2+ (3.8*2) + (-1.6*1.5)

Page 20 of 22
/
= 2.718 1+2.718

ŷ = 0.73

This indicates that a one unit increase in the value of X2 has about a 73% chance to
predict the phenomenon under examination or study.

Parametric test

 Make certain assumptions about population parameters (e.g., a normal


distribution).
 If an assumption is correct, these tests can produce more accurate and
precise estimates.
 Normally involve data expressed in absolute numbers (interval or ratio scale)
rather than ranks and categories (nominal or ordinal).
 Such tests include analysis of variance (ANOVA), t-tests, and so on.

Non-parametric test

 Can be applied to process data of ‘low quality’ (nominal or ordinal), from


small samples on variables about which nothing is known (concerning their
distribution).
 Used in cases when the researcher knows nothing about the parameters of
the variable of interest in the population (hence the name nonparametric).
 Widely used for studying populations that take on a ranked order but no
clear numerical interpretation, such as when assessing preferences; in terms
of levels of measurement, for data on an ordinal scale.

Basically, there is at least one nonparametric equivalent for each parametric


general type of test. In general, these tests fall into the following three categories:

Page 21 of 22
 Tests of differences between groups (independent samples) such as GPA of males
vs. females.
o Parametric test:
- T-test can be used when we have two samples that we want to compare
concerning their mean value for some variable of interest.
- ANOVA can be used when we compare three or more groups.
o Non-parametric equivalent
- To compare two groups we use Mann-Whitney U test.
- To compare three or more groups we use the Kruskal-Wallis analysis of
ranks test.
 Tests of differences between variables (dependent samples): for instance,
Mathematics vs. English results of males
o Parametric test
- To compare two variables measured in the same sample, we use the t-
test for dependent samples
o Non-parametric test
- We use the Sign test and Wilcoxon's matched pairs test.
- For dichotomous variables, such as pass vs. fail; yes, or No, we use
McNemar’s Chi-square test.
 Tests of relationships between variables, like the relationship between income
and work experience
o Parametric test: correlation coefficient is used for parametric test between
associations.
o Non-parametric equivalent: the Spearman rank correlation coefficient. If
the two variables of interest are categorical in nature (e.g. passed vs.
failed by male vs. female) we conduct the Chi-square test.

Page 22 of 22

You might also like