0% found this document useful (0 votes)
16 views99 pages

Econometrics TCHR

ppt of any module

Uploaded by

melkamushapha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views99 pages

Econometrics TCHR

ppt of any module

Uploaded by

melkamushapha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 99

ECONOMETRICS IN AGRICULTURAL ECONOMICS

Compiled by:
Tilahun.A
November 2015
Table of Contents
1.1: Econometrics.......................................................................................................................3
1.1.1 Introduction..................................................................................................................3
1.1.2 Learning Task Objectives.............................................................................................3
1.1.3 Econometrics Learning Task Sections.........................................................................3
1.1.3.1 Fundamental concepts of Econometrics (3 hrs)....................................................3
Definition and scope of econometrics..........................................................................3
Why a Separate Discipline?.........................................................................................4
Economic models vs. Econometric models................................................................5..
Methodology of Econometrics.....................................................................................6
Desirable Properties of an Econometric Model.........................................................10
Goals of Econometrics...............................................................................................10
2. Correlation and 3. Simple Linear Regression (6 hrs).......Error! Bookmark not
defined.
Correlation Analysis.....................................................................................................1
Methods of Measuring Correlation.............................................................................2
The Rank Correlation Coefficient..............................................................................12
Partial Correlation Coefficicients...............................................................................14
The Simple Regression Model...................................................................................18
Ordinary Least Square Method (OLS) and Classical Assumptions...........................18
Hypothesis Testing of OLS Estimates.......................................................................26
Properties of OLS Estimators.....................................................................................39
Extensions of Regression Models..............................................................................39
4. Multiple Linear Regression Models (8 hrs)...................................................................44
Concept and Notations of Multiple Regression Models............................................45
Assumptions of the Multiple Linear Regression........................................................46
Estimation of Partial Regression Coefficients...........................................................47
Variance and Standard errors of OLS Estimators......................................................48
Coefficient of Multiple Determination.......................................................................50
Hypothesis Testing in Multiple Regression...............................................................52
. Dummy Variable Regression Models......................................................................62

i
5. Econometric Problems (6 hrs)......................................................................................66
Assumptions Revisited...............................................................................................67
Violations of Assumptions.........................................................................................68
Heteroscedasticity: The Error Variance is not Constant............................................68
Autocorrelation: Error Terms are correlated..............................................................72
Multicollinearity: Exact linear correlation between Regressors................................75
Unit 6: Non-linear Regression and Time Series Econometrics
Error! Bookmark not defined.
A References.........................................................................Error! Bookmark not defined.

List of Tables

Table 1: Data for computation of correlation coefficient.........................................................10


Table 2: Computations of inputs for correlation coefficients...................................................10
Table 3: Example for rank correlation coefficient....................................................................13
Table 4: Computation for rank correlation coefficient.............................................................13
Table 5: Data on yield of corn, fertilizer and insecticides used..............................................15
Table 6: Computation for partial correlation coefficients.........................................................16
Table 7: Data on supply and price for given commodity..........................................................34
Table 8: Data for computation of different parameters............................................................35
Table 9: Data for multiple regression examples.......................................................................57
Table 10: Computations of the summary statistics for coefficients for data of Table 9...........58

List of Figures

Figure 1: Perfect linear correlations............................................................................................3


Figure 2: Regression Function/curve if the mean of error term is not zero..............................20
Figure 3: The distribution of with and without serial correlation...........................................74
Figure 4: Graphical detection of autocorrelation.....................Error! Bookmark not defined.

ii
iii
1. Learning Task: Econometrics

1.1 Introduction
Econometrics is about how we can use economic or social science theory and data, along with tools from
statistics to answer “how much” type questions. It integrates mathematical knowledge, statistical skills and
economic theories to solve business and economic problems of agribusiness firms. For instance economics
tells us that the demand for a good is a function of the good’s price, and that in most cases, the price
elasticity of demand is negative. But for many practical purposes one may be interested to quantify the
elasticity more accurately. For such kind of questions econometrics can provide the answer.

1.1.2 Learning Task Objectives


This learning task is designed to equip students with the basic concepts and principles of econometrics. It
attempts to foster students’ ability to estimate ordinary list square (OLS) regression; make predications
using regression analysis; integrate mathematics, statistics and economic theories to solve problems. In
addition, it exposes the students to various statistical packages for analyzing data and provides them with
practical activities to utilize the software packages in appropriate contexts or situations. Specific
objectives of the learning task involve:
 Measure and analyze the association between economic variables
 Formulate models to represent and study economic relationships
 Estimate economic parameters important for policy formation and evaluation
 Forecast economic relationships

1.1.3. Econometrics Learning Task Sections

Fundamental concepts of Econometrics (3 hrs)


Pre-test
1. What is econometrics?
2. What is the importance of econometric models?
3. Mention desirable properties of econometric models
4. Differentiate between economic and econometric model
5. What are the goals of econometrics?

Definition and scope of econometrics


What is Econometrics?
Simply stated Econometric means economic measurement. The “metric” part of the word signifies
measurement and econometrics is concerned with the measuring of economic relationships.

4
It is a social science in which the tools of economic theory, mathematics and statistical inference are
applied to the analysis of economic phenomena (Arthur Goldberger).
In the words of Maddala econometrics is “the application of statistical and mathematical methods to the
analysis of economic data, with a purpose of giving empirical content to economic theories and verifying
them or refuting them.”
Econometrics utilizes economic theory, as embodied in an econometric Model; facts, as summarized by
relevant data, and statistical theory, as refined into econometric techniques to measure and to test
empirically certain relationships among economic variables.
 It is a special type of economic analysis and research in which the general economic theory formulated
in mathematical form (i.e. mathematical economics) is combined with empirical measurement (i.e.
statistics) of economic phenomena.

Why a Separate Discipline?


As the definition suggests econometrics is an amalgam of economic theory, mathematical statistics and
economic statistics. But: a distinction has to be made between Econometrics, and economic theory,
statistics and mathematics.
1. Economic theory makes statements or hypotheses that are mostly of qualitative nature.
Example: Other things remaining constant (ceteris paribus) a reduction in the price of a commodity is
expected to increase the quantity demanded. And Economic theory postulates an inverse relationship
between price and quantity demanded of a commodity. But the theory does not provide numerical value as
the measure of the relationship between the two. Here comes the task of the econometrician to provide the
numerical value by which the quantity will go up or down as a result of changes in the price of the
commodity.
2. Economic statistics is concerned with collecting, processing and presenting economic data (descriptive
statistics).
Example: collecting and refining data on national accounts, index numbers, employment, prices, etc.
3. Mathematical statistics and mathematical economics do provide much of the tools used in
Econometrics. But Econometrics needs special methods to deal with economic data which are never
experimental data.
Examples: Errors of measurement, problem of multicollinearity, problem of serial correlation are only
econometric problems and are not concerns of mathematical statistics.
Econometrics utilizes these data to estimate quantitative economic relationships and to test hypothesis
about them. The Econometrician is called upon to develop special methods of analysis and deal with such
kinds of Econometric problems.
5
Economic models vs. Econometric models
A model is any representation of an actual phenomenon such as an actual system or process. The real
world system is represented by the model in order to explain it, to predict it, and to control it. Any model
represents a compromise between reality and manageability. A given representation of real world system
can be a model if it fulfills the following requirements.
(1) It must be a “reasonable” representation of the real world system and in that sense it should be
realistic.
(2) On the other hand it must be “manageable” in that it yields certain insights or conclusions.
A good model is both realistic and manageable. A highly realistic but too complicated model is a “bad”
model in the sense it is not manageable. A model that is highly manageable but so idealized that it is
unrealistic not accounting for important components of the real world system, is a “bad” model too. In
general to find the proper balance between realism and manageability is the essence of good Modeling.
Thus a good model should, on the one hand, specify the interrelationship among the parts of a system in a
way that is sufficiently detailed and explicit and, on the other hand, it should be sufficiently simplified and
manageable to ensure that the model can be readily analyzed and conclusions can be reached concerning
the real world.
Economic models
Any economic theory is an observation from the real world. For one reason, the immense complexity of
the real world economy makes it impossible for us to understand all interrelationships at once. Another
reason is that all the interrelationships are not equally important for the understanding of the economic
phenomenon under study. The sensible procedure is therefore, to pick up the important factors and
relationships relevant to our problem and to focus our attention on these alone. Such a deliberately
simplified analytical framework is called on economic model. It is an organized set of relationships that
describes the functioning of an economic entity under a set of simplifying assumptions. All economic
reasoning is ultimately based on models. Economic models consist of the following three basic structural
elements.
1. A set of variables
2. A list of fundamental relationships and to be goggled
3. A number of strategic coefficients
Econometric models
The most important characteristic of economic relationships is that they contain a random element which
is ignored by mathematical economic models which postulate exact relationships between economic
variables.

6
In econometrics the influence of these ‘other’ factors is taken into account by the introducing random
variable.

Methodology of Econometrics
Econometric research is concerned with the measurement of the parameters of economic relationships and
with the predication of the values of economic variables. The relationships of economic theory which can
be measured with econometric techniques are relationships in which some variables are postulated as
causes of the variation of other variables. Starting with the postulated theoretical relationships among
economic variables, econometric research or inquiry generally proceeds along the following lines/stages.
1. Statement of theory or hypothesis.
2. Specification of the mathematical model
3. Estimation of the econometric model
4. Obtaining the data
5. Estimation of the parameters of the econometric model
6. Hypothesis testing
7. Forecasting or prediction
8. Using the model for control or policy purposes.
To illustrate the preceding steps, let us consider the well-known Keynesian theory of consumption.
1. Statement of Theory or Hypothesis
Keynes postulated that the marginal propensity to consume (MPC), the rate of change of consumption for
a unit (say, a dollar) change in income, is greater than zero but less than 1.

2. Specification of the mathematical model


In this step the econometrician has to express the relationships between economic variables in
mathematical form. The step involves the determination of three important issues:
 Determine dependent and independent (explanatory) variables to be included in the model,
 Determine a priori theoretical expectations about the size and sign of the parameters of the function,
and
 Determine mathematical form of the model (number of equations, specific form of the equations,
etc.
Although Keynes postulated a positive relationship between consumption and income, he did not specify
the precise form of the functional relation-ship between the two. For simplicity, a mathematical economist
might suggest the following form of the Keynesian consumption function:
7
where Y = consumption expenditure and X = income, and where β1 and β2, known as the
parameters of the model, are, respectively, the intercept and slope coefficients. The slope coefficient
β2 measures the MPC.
Specification of the econometric model will be based on economic theory and on any available
information related to the phenomena under investigation. Thus, specification of the econometric model
presupposes knowledge of economic theory and familiarity with the particular phenomenon being studied.
Specification of the model is the most important and the most difficult stage of any econometric research.
It is often the weakest point of most econometric applications. In this stage there exists enormous degree
of likelihood of committing errors or incorrectly specifying the model. The most common errors of
specification are:
a. Omissions of some important variables from the function.
b. The omissions of some equations (for example, in simultaneous equations model).
c. The mistaken mathematical form of the functions.
Such misspecification errors may associate to one or more reasons. Some of the common reasons for
incorrect specification of the econometric models are:
 imperfections, looseness of statements in economic theories
 limited knowledge of the factors which are operative in any particular case
 formidable obstacles presented by data requirements in the estimation of large models
3. Estimation of the econometric model
The purely mathematical model of the consumption function given in is of limited interest to the
econometrician, for it assumes that there is an exact or deterministic relationship between consumption
and income. But relationships between economic variables are generally inexact.
To allow for the inexact relationships between economic variables, the econometrician would modify
the deterministic consumption function as follows:

Where u, known as the disturbance, or error, term, is a random (stochastic) variable that has well-
defined probabilistic properties. The disturbance term u may well represent all those factors that affect
consumption but are not taken into account explicitly.
8
This is purely a technical stage which requires knowledge of the various econometric methods, their
assumptions and the economic implications for the estimates of the parameters. This stage includes the
following activities.
i. Examination of the identification conditions of the function (especially for simultaneous equations
models).
ii. Examination of the aggregations problems involved in the variables of the function.
iii. Examination of the degree of correlation between the explanatory variables (i.e. examination of the
problem of multicollinearity).
iv. Choice of appropriate economic techniques for estimation, i.e. to decide a specific econometric
method to be applied in estimation; such as, OLS, MLM
4. Obtaining the data

To obtain the numerical values of β1 and β2, we need data

5 .Estimation of the Econometric Model


This stage consists of deciding whether the estimates of the parameters are theoretically meaningful and
statistically significant. This stage enables the econometrician to evaluate the results of calculations and
determine the reliability of the results. For this purpose we use various criteria which may be classified
into three groups:
i. Economic a priori criteria: These criteria are determined by economic theory and refer to the size
and sign of the parameters of economic relationships.
ii. Statistical criteria (first-order tests): These are determined by statistical theory and aim at the
evaluation of the statistical reliability of the estimates of the parameters of the model. Correlation
coefficient test, standard error test, t-test, F-test, and R 2-test are some of the most commonly used
statistical tests.
iii. Econometric criteria (second-order tests): These are set by the theory of econometrics and aim at
the investigation of whether the assumptions of the econometric method employed are satisfied or
not in any particular case. The econometric criteria serve as a second order test (as test of the
statistical tests) i.e. they determine the reliability of the statistical criteria; they help us establish
whether the estimates have the desirable properties of unbiasedness, consistency, etc. Econometric
criteria aim at the detection of the violation or validity of the assumptions of the various
econometric techniques.
9
6. Hypothesis Testing
An economic theory on the basis of sample evidence is based on a branch of statistical theory known as
statistical inference (hypothesis testing).
7) Evaluation of the forecasting power of the model
Forecasting is one of the aims of econometric research. However, before using an estimated model for
forecasting by some way or another, the predictive power and other requirements of the model need to be
checked. It is possible that the model may be economically meaningful and statistically and
econometrically correct for the sample period for which the model has been estimated. Yet it may not be
suitable for forecasting due to various factors (reasons). Therefore, this stage involves the investigation of
the stability of the estimates and their sensitivity to changes in the size of the sample. Consequently, we
must establish whether the estimated function performs adequately outside the sample data which require
model performance test under extra sample.

Desirable Properties of an Econometric Model


An econometric model is a model whose parameters have been estimated with some appropriate
econometric technique. The ‘goodness’ of an econometric model is judged customarily based on the
following desirable properties.
1. Theoretical plausibility: The model should be compatible with the postulates of economic theory
and adequately describe the economic phenomena to which it relates.
2. Explanatory ability: The model should be able to explain the observations of the actual world. It
must be consistent with the observed behaviour of the economic variables whose relationship it
determines.
3. Accuracy of the estimates of the parameter: The estimates of the coefficients should be accurate in
the sense that they should approximate as best as possible the true parameters of the structural
model. The estimates should if possible possess the desirable properties of unbiasedness,
consistency and efficiency.
4. Forecasting ability: The model should produce satisfactory predictions of future values of the
dependent (endogenous) variables.
5. Simplicity: The model should represent the economic relationships with maximum simplicity. The
fewer the equations and the simpler their mathematical form, the better the model provided that the
other desirable properties are not affected by the simplifications of the model.
Goals of Econometrics
Basically there are three main goals of Econometrics. They are:
i) Analysis i.e. testing economic theory
10
ii) Policy making i.e. obtaining numerical estimates of the coefficients of economic relationships for
policy simulations.
iii) Forecasting i.e. using the numerical estimates of the coefficients in order to forecast the future
values of economic magnitudes.

Classification of econometrics

Discuss classification of econometrics. Econometrics may be divided into two main


categories:

a) Theoretical Econometrics b) Applied Econometrics


1) Theoretical Econometrics - is concerned with the development of appropriate
methods for measuring economic relationships specified by econometric models.

The methods of estimation can be classified into two groups:

 The single equation techniques, which are applied to one relationship at a time.
 The simultaneous equation techniques, which are applied to all the relationships of
the model simultaneously.
-Choice depends on the nature of relationship and identification considerations, the
properties of the estimates of the coefficient obtained from each technique such as
consistency, sufficiency, time, cost, simplicity, etc.)
-It is also concerned with the spelling out of the assumptions of the models,
their properties and what happens to these properties when one or more of
the assumptions of the model are not satisfied.
2) Applied Econometrics - is concerned with the measurement of the parameters of
economic relationships and with the prediction of the value of economic relationship.

 The tools of theoretical Econometrics are used to study some special fields such as:
 the production function,

 the consumption function,

11
 the investment function,

 the demand and supply function, etc.

-Furthermore, applied econometrics describes the practical value of econometric


research, as well as, it deals with the application of econometric techniques
developed in theoretical econometrics to different fields of economic theory for
its verification and forecasting.

What is the Nature of the Econometric Approach

As discussed earlier there are at least two basic ingredients in any econometric study,
namely, Economic theory and Data (facts). Therefore, the major task of the
econometrician is to combine these two ingredients.

a) The theory should be developed into a useable form.

 The most useable form for the purposes of econometrics is the Econometric
model.

 The model is the most convenient way of summarizing the theory for empirical
measurement and testing.

 The most important aspect of this step is the specification of the appropriate
econometric model that appropriately represents the phenomena to be studied.
b) The other basic ingredients in an econometric study are a set of facts (data).

12
 The data have to be refined (massaged) in a variety of ways to make them
suitable for use in an econometric study.

Types of Data

Three types of data may be available for empirical analysis: time series, cross-section, and pooled (i.e.,
combination of time series and cross-section) data.

Time Series Data

A time series is a set of observations on the values that a variable takes at different times. Such data may
be collected at regular time intervals, such as daily (e.g., stock prices, weather reports), weekly (e.g.,
money supply figures), monthly [e.g., the unemployment rate, the Consumer Price Index (CPI)], quarterly
(e.g., GDP), annually (e.g., government budgets), every 5 years (e.g., the census of manufactures), or
decennially (e.g., the census of population).

Cross-Section Data

Cross-section data are data on one or more variables collected at a time, such as the census of population
conducted by the Census Bureau every 10 years (the latest being in year 2000), the surveys of consumer
expenditures conducted by the University.

Pooled Data In pooled, or combined, data are elements of both time series and cross-section data.

Panel, Longitudinal, or Micropanel Data

This is a special type of pooled data in which the same cross-sectional unit (say, a family or a firm) is
surveyed over time. For example, the U.S. Department of Commerce carries out a census of housing at
periodic intervals. At each periodic survey the same household (or the people living at the same address)
is interviewed to find out if there has been any change in the housing and financial conditions of that
household since the last survey

13
Agricultural Economics

2. Correlation Analysis

Pre-test Questions

1. How do you define correlation?


2. Which correlation measurement methods do you know?
3. What is regression Analysis?
4. What do you know about OLS (ordinary Least Squares) method?
Economic variables have a great tendency of moving together and very often data are given in
pairs of observations in which there is a possibility that the change in one variable is on
average accompanied by the change of the other variable. This situation is known as
correlation.
Correlation may be defined as the degree of relationship existing between two or more
variables. The degree of relationship existing between two variables is called simple
correlation. The degree of relationship connecting three or more variables is called multiple
correlations. In this unit, we shall examine only simple correlation. A correlation is also said
to be partial if it studies the degree of relationship between two variables keeping all other
variables connected with these two are constant.
Correlation may be linear, when all points (X, Y) on scatter diagram seem to cluster near a
straight, or nonlinear, when all points seem to lie near a curve. In other words, correlation is
said to be linear if the change in one variable brings a constant change of the other. It may be
non-linear if the change in one variable brings a different change in the other.
Correlation may also be positive or negative. Correlation is said to positive if an increase or a
decrease in one variable is accompanied by an increase or a decrease by the other in which
both variables are changed with the same direction. For example, the correlation between
price of a commodity and its quantity supplied is positive since as price rises, quantity
supplied will be increased and vice versa. Correlation is said to negative if an increase or a
decrease in one variable is accompanied by a decrease or an increase in the other in which
both are changed with opposite direction. For example, the correlation between price of a

1
WOLAITA SODO UNIVERSITY
Agricultural Economics

commodity and its quantity demanded is negative since as price rises, quantity demanded will
be decreased and vice versa.

Methods of Measuring Correlation


In correlation analysis there are two important things to be addressed. These are the type of
co-variation existed between variables and its strength. And the types of correlation
mentioned before do not show to us the strength of co-variation between variables. There are
three methods of measuring correlation. These are:
1. The Scattered Diagram or Graphic Method
2. The Simple Linear Correlation coefficient
3. The coefficient of Rank Correlation
The Scattered Diagram or Graphic Method
The scatter diagram is a rectangular diagram which can help us in visualizing the relationship
between two phenomena. It puts the data into X-Y plane by moving from the lowest data set
to the highest data set. It is a non-mathematical method of measuring the degree of co-
variation between two variables. Scatter plots usually consist of a large body of data. The
closer the data points come together and make a straight line, the higher the correlation
between the two variables, or the stronger the relationship.
If the data points make a straight line going from the origin out to high x- and y-values, then
the variables are said to have a positive correlation. If the line goes from a high-value on the
y-axis down to a high-value on the x-axis, the variables have a negative correlation.

2
WOLAITA SODO UNIVERSITY
Agricultural Economics

Figure 1: Perfect linear correlations

3
WOLAITA SODO UNIVERSITY
Agricultural Economics

A perfect positive correlation is given the value of 1. A perfect negative correlation is given
the value of -1. If there is absolutely no correlation present the value given is 0. The closer the
number is to 1 or -1, the stronger the correlation, or the stronger the relationship between the
variables. The closer the number is to 0, the weaker the correlation.
Two variables may have a positive correlation, negative correlation, or they may be
uncorrelated. This holds true both for linear and nonlinear correlation. Two variables are said
to be positively correlated if they tend to change together in the same direction, that is, if they
tend to increase or decrease together. Such positive correlation is postulated by economic
theory for the quantity of a commodity supplied and its price. When the price increases the
quantity supplied increases. Conversely, when price falls the quantity supplied decreases.
Negative correlation: Two variables are said to be negatively correlated if they tend to change
in the opposite direction: when X increases Y decreases, and vice versa. For example, saving
and household size are negatively correlated. When price increases, demand for the
commodity decreases and when price falls demand increases.

The Population Correlation Coefficient ‘’ and its Sample Estimate ‘r’

In the light of the above discussions it appears clear that we can determine the kind of
correlation between two variables by direct observation of the scatter diagram. In addition, the
scatter diagram indicates the strength of the relationship between the two variables. This
section is about how to determine the type and degree of correlation using a numerical result.
For a precise quantitative measurement of the degree of correlation between Y and X we use a
parameter which is called the correlation coefficient and is usually designated by the Greek
letter. Having as subscripts the variables whose correlation it measures,  refers to the
correlation of all the values of the population of X and Y. Its estimate from any particular
sample (the sample statistic for correlation) is denoted by r with the relevant subscripts. For
example if we measure the correlation between X and Y the population correlation coefficient
is represented by xy and its sample estimate by r xy. The simple correlation coefficient is used
to measure relationships which are simple and linear only. It cannot help us in measuring non-
linear as well as multiple correlations. Sample correlation coefficient is defined by the
formula

4
WOLAITA SODO UNIVERSTY
Agricultural Economics

2.1

Or 2.2

Where,

5
WOLAITA SODO UNIVERSTY
We will use a simple example from the theory of supply. Economic theory suggests that the
quantity of a commodity supplied in the market depends on its price, ceteris paribus. When
price increases the quantity supplied increases, and vice versa. When the market price falls
producers offer smaller quantities of their commodity for sale. In other words, economic
theory postulates that price (X) and quantity supplied (Y) are positively correlated.
Example 2.1: The following table shows the quantity supplied for a commodity with the
corresponding price values. Determine the type of correlation that exists between these two
variables.
Table 1: Data for computation of correlation coefficient
Time period(in days) Quantity supplied Yi (in tons) Price Xi (in shillings)
1 10 2
2 20 4
3 50 6
4 40 8
5 50 10
6 60 12
7 80 14
8 90 16
9 90 18
10 120 20

To estimate the correlation coefficient we, compute the following results.


Table 2: Computations of inputs for correlation coefficients

Y X x2 y2 xiyi XY X2 Y2

10 2 -9 -51 81 2601 459 20 4 100


20 4 -7 -41 49 1681 287 80 16 400
50 6 -5 -11 25 121 55 300 36 2500
40 8 -3 -21 9 441 63 320 64 1600
50 10 -1 -11 1 121 11 500 100 2500
60 12 1 -1 1 1 -1 720 144 3600

10
WOLAITA SOD UNIVERSITY
80 14 3 19 9 361 57 1120 196 6400
90 16 5 29 25 841 145 1440 256 8100
90 18 7 29 49 841 203 1620 324 8100
120 20 9 59 81 3481 531 2400 400 14400
Sum=610 110 0 0 330 10490 1810 8520 1540 47700
Mean=61 11

Or using the deviation form (Equation 2.2), the correlation coefficient can be computed as:

This result shows that there is a strong positive correlation between the quantity supplied and
the price of the commodity under consideration.
The simple correlation coefficient has the value always ranging between -1 and +1. That
means the value of correlation coefficient cannot be less than -1 and cannot be greater than
+1. Its minimum value is -1 and its maximum value is +1. If r= -1, there is perfect negative
correlation between the variables. If , there is positive
correlation between the two variables and movement from zero to positive one increases the
degree of positive correlation. If r= +1, there is perfect positive correlation between the two
variables. If the correlation coefficient is zero, it indicates that there is no linear relationship
between the two variables. If the two variables are independent, the value of correlation
coefficient is zero
Properties of Simple Correlation Coefficient

11
WOLAITA SOD UNIVERSITY
The simple correlation coefficient has the following important properties:
1. The value of correlation coefficient always ranges between -1 and +1.
2. The correlation coefficient is symmetric. That means

, where,

is the correlation coefficient of X on Y and

is the correlation coefficient of Y on X.


3. The correlation coefficient is independent of change of origin and change of scale. By
change of origin we mean subtracting or adding a constant from or to every values of a
variable and change of scale we mean multiplying or dividing every value of a variable by a
constant.
4. If X and Y variables are independent, the correlation coefficient is zero.
5. The correlation coefficient has the same sign with that of regression coefficients.
6. The correlation coefficient is the geometric mean of two regression coefficients.

Though, correlation coefficient is most popular in applied statistics and econometrics, it has
its own limitations. The major limitations of the method are:
1. The correlation coefficient always assumes linear relationship regardless of the fact
whether the assumption is true or not.
2. Great care must be exercised in interpreting the value of this coefficient as very often the
coefficient is misinterpreted. For example, high correlation between lung cancer and smoking
does not show us smoking causes lung cancer.
3. The value of the coefficient is unduly affected by the extreme values
5. The coefficient requires the quantitative measurement of both variables. If one of the two
variables is not quantitatively measured, the coefficient cannot be computed.

The Rank Correlation Coefficient


The formulae of the linear correlation coefficient developed in the previous section are based
on the assumption that the variables involved are quantitative and that we have accurate data

12
WOLAITA SOD UNIVERSITY
for their measurement. However, in many cases the variables may be qualitative (or binary
variables) and hence cannot be measured numerically. For example, profession, education,
preferences for particular brands, are such categorical variables. Furthermore, in many cases
precise values of the variables may not be available, so that it is impossible to calculate the
value of the correlation coefficient with the formulae developed in the preceding section. For
such cases it is possible to use another statistic, the rank correlation coefficient (or spearman’s
correlation coefficient.). We rank the observations in a specific sequence for example in order
of size, importance, etc., using the numbers 1, 2, 3, …, n. In other words, we assign ranks to
the data and measure relationship between their ranks instead of their actual numerical values.
Hence, the name of the statistic is given as rank correlation coefficient. If two variables X and
Y are ranked in such way that the values are ranked in ascending or descending order, the
rank correlation coefficient may be computed by the formula
2.3
Where,
D = difference between ranks of corresponding pairs of X and Y
n = number of observations.
The values that r may assume range from + 1 to – 1.
Two points are of interest when applying the rank correlation coefficient. Firstly, it does not
matter whether we rank the observations in ascending or descending order. However, we must
use the same rule of ranking for both variables. Second if two (or more) observations have the
same value we assign to them the mean rank. Let’s use example to illustrate the application of
the rank correlation coefficient.
Example 2.2: A market researcher asks experts to express their preference for twelve different
brands of soap. Their replies are shown in the following table.
Table 3: Example for rank correlation coefficient
Brands of soap A B C D E F G H I J K L
Person I 9 10 4 1 8 11 3 2 5 7 12 6
Person II 7 8 3 1 10 12 2 6 5 4 11 9

13
WOLAITA SOD UNIVERSITY
The figures in this table are ranks but not quantities. We have to use the rank correlation
coefficient to determine the type of association between the preferences of the two persons.
This can be done as follows.

Table 4: Computation for rank correlation coefficient


Brands of soap A B C D E F G H I J K L Total
Person I 9 10 4 1 8 11 3 2 5 7 12 6
Person II 7 8 3 1 10 12 2 6 5 4 11 9
Di 2 2 1 0 -2 -1 1 -4 0 3 1 -3
Di2 4 4 1 0 4 1 1 16 0 9 1 9 50

The rank correlation coefficient (using Equation 2.3)

This figure, 0.827, shows a marked similarity of preferences of the two persons for the
various brands of soap.

Partial Correlation Coefficicients


A partial correlation coefficient measures the relationship between any two variables, when
all other variables connected with those two are kept constant. For example, let us assume that
we want to measure the correlation between the number of hot drinks (X 1) consumed in a
summer resort and the number of tourists (X2) coming to that resort. It is obvious that both
these variables are strongly influenced by weather conditions, which we may designate by X 3.
On a priori grounds we expect X 1 and X2 to be positively correlated: when a large number of
tourists arrive in the summer resort, one should expect a high consumption of hot drinks and
vice versa. The computation of the simple correlation coefficient between X 1 and X2 may not
reveal the true relationship connecting these two variables, however, because of the influence
of the third variable, weather conditions (X 3). In other words, the above positive relationship
between number of tourists and number of hot drinks consumed is expected to hold if weather
conditions can be assumed constant. If weather condition changes, the relationship between
X1 and X2 may change to such an extent as to appear even negative. Thus, if the weather is

14
WOLAITA SOD UNIVERSITY
hot, the number of tourists will be large, but because of the heat they will prefer to consume
more cold drinks and ice-cream rather than hot drinks. If we overlook the weather and look
only at X1 and X2 we will observe a negative correlation between these two variables which is
explained by the fact that hot drinks as well as number of visitors are affected by heat. In
order to measure the true correlation between X 1 and X2, we must find some way of
accounting for changes in X3. This is achieved with the partial correlation coefficient between
X1 and X2, when X3 is kept constant. The partial correlation coefficient is determined in terms
of the simple correlation coefficients among the various variables involved in a multiple
relationship. In our example there are three simple correlation coefficients
r12 = correlation coefficient between X1 and X2
r13 = correlation coefficient between X1 and X3
r23 = correlation coefficient between X2 and X3
The partial correlation coefficient between X 1 and X2, keeping the effect of X3 constant is
given by:

2.4

Similarly, the partial correlation between X 1 and X3, keeping the effect of X 2 constant is given
by:

and

Example 2.3: The following table gives data on the yield of corn per acre(Y), the amount of
fertilizer used(X1) and the amount of insecticide used (X 2). Compute the partial correlation
coefficient between the yield of corn and the fertilizer used keeping the effect of insecticide
constant.
Table 5: Data on yield of corn, fertilizer and insecticides used
Year 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
Y 40 44 46 48 52 58 60 68 74 80
X1 6 10 12 14 16 18 22 24 26 32
X2 4 4 5 7 9 12 14 20 21 24

15
WOLAITA SOD UNIVERSITY
The computations are done as follows:

Table 6: Computation for partial correlation coefficients


Year Y X1 X2 Y x1 x2 x1y x2y x1x2 x12 x22 y2
1971 40 6 4 -17 -12 -8 204 136 96 144 64 289
1972 44 10 4 -13 -8 -8 104 104 64 64 64 169
1973 46 12 5 -11 -6 -7 66 77 42 36 49 121
1974 48 14 7 -9 -4 -5 36 45 20 16 25 81
1975 52 16 9 -5 -2 -3 10 15 6 4 9 25
1976 58 18 12 1 0 0 0 0 0 0 0 1
1977 60 22 14 3 4 2 12 6 8 16 4 9
1978 68 24 20 11 6 8 66 88 48 36 64 121
1979 74 26 21 17 8 9 136 153 72 64 81 289
1980 80 32 24 23 14 12 322 276 168 196 144 529
Sum 570 180 120 0 0 0 956 900 524 576 504 1634
Mean 57 18 12

ryx1=0.9854
ryx2=0.9917
rx1x2=0.9725
Then,

Limitations of the Theory of Linear Correlation

16
WOLAITA SOD UNIVERSITY
Correlation analysis has serious limitations as a technique for the study of economic
relationships.
Firstly: The above formulae for r apply only when the relationship between the variables is
linear. However two variables may be strongly connected with a nonlinear relationship.
It should be clear that zero correlation and statistical independence of two variables (X and Y)
are not the same thing. Zero correlation implies zero covariance of X and Y so that r=0.
Statistical independence of x and y implies that the probability of x i and yi occurring
simultaneously is the simple product of the individual probabilities
P (x and y) = p (x) p (y)
Independent variables do have zero covariance and are uncorrelated: the linear correlation
coefficient between two independent variables is equal to zero. However, zero linear
correlation does not necessarily imply independence. In other words uncorrelated variables
may be statistically dependent. For example if X and Y are related so that the observations fall
on a circle or on a symmetrical parabola, the relationship is perfect but not linear. The
variables are statistically dependent.
Secondly, the second limitation of the theory is that although the correlation coefficient is a
measure of the co-variability of variables, it does not necessarily imply any functional
relationship between the variables concerned. Correlation theory does not establish, and/ or
prove any causal relationship between the variables. It seeks to discover a co-variation exists,
but it does not suggest that variations in, say, Y are caused by variations in X, or vice versa.
Knowledge of the value of r, alone, will not enable us to predict the value of Y from X. A
high correlation between variables Y and X may describe any one of the following situations:
(1) variation in X is the cause of variation in Y,
(2) variation in Y is the cause of variation X,
(3) Y and X are jointly dependent, or there is a two- way causation, that is to say Y is the
cause of (is determined by) X, but also X is the cause of (is determined by) Y. For
example in any market: q = f (p), but also p = f(q), therefore there is a two – way
causation between q and p, or in other words p and q are simultaneously determined.
(4) there is another common factor (Z), that affects X and Y in such a way as to show a
close relation between them. This often occurs in time series when two variables have

17
WOLAITA SOD UNIVERSITY
strong time trends (i.e. grow over time). In this case we find a high correlation
between Y and X, even though they happen to be causally independent,
(5) The correlation between X and Y may be due to chance.

3.The Simple Leaner Regression Model


Economic theories are mainly concerned with the relationships among various economic
variables. These relationships, when phrased in mathematical terms, can predict the effect of
one variable on another. The functional relationships of these variables define the
dependence of one variable upon the other variable (s) in the specific form. In this regard
regression model is the most commonly used and appropriate technique of econometric
analysis. Regression analysis refers to estimating functions showing the relationship between
two or more variables and corresponding tests. This section introduces students with the
concept of simple linear regression analysis. It includes estimating a simple linear function
between two variables. We will restrict our discussion in this part only to two variables and
deal with more variables in the next section.

Ordinary Least Square Method (OLS) and Classical Assumptions


There are two major ways of estimating regression functions. These are the (ordinary) least
square (OLS) method and maximum likelihood (MLH) method. Both the methods are
basically similar to their application in estimations that you may be aware of in statistics
courses. The ordinary least square method is the easiest and the most commonly used method
as opposed to the maximum likelihood (MLH) method which is limited by its assumptions.
For instance, the MLH method is valid only for large sample as opposed to the OLS method
which can be applied to smaller samples. Owing to this merit, our discussion mainly focuses
on the ordinary least square (OLS).
The (Ordinary) least square (OLS) method of estimating parameters or regression function is
about finding or estimating values of the parameters ( of the simple linear
regression function given below for which the errors or residuals are minimized. Thus, it is
about minimizing the residuals or the errors.

18
WOLAITA SOD UNIVERSITY
The above identity represents population regression function (to be estimated from total
enumeration of data from the entire population). But, most of the time it is difficult to
generate population data owing to several reasons; and most of the time we use sample data
and we estimate sample regression function. Thus, we use the following sample regression
function for the derivation of the parameters and related analysis.

Before discussing the details of the OLS estimation techniques, let’s see the major conditions
that are necessary for the validity of the analysis, interpretations and conclusions of the
regression function. These conditions are known as classical assumptions. In fact most of
these conditions can be checked and secured very easily.
i) Classical Assumptions
For the validity of a regression function and its attributes the data we use or the terms related
to our regression function should fulfill the following conditions known as classical
assumptions.
1. The error terms ‘Ui’ are randomly distributed or the disturbance terms are not correlated.
This means that there is no systematic variation or relation among the value of the error
terms (Ui and Uj); Where i = 1, 2, 3, …….., j = 1, 2, 3, ……. and . This is
represented by zero covariance among the error terms summarized as follows:

Cov (Ui , Uj) = 0 for . Note that the same argument holds for residual terms when
we use sample data or sample regression function. Thus, Cov (ei, ej) = 0 for

. Otherwise, the error terms do not serve an adjustment purpose rather it causes an
autocorrelation problem.
2. The disturbance terms ‘Ui’ have zero mean. This implies the sum of the individual
disturbance terms is zero. The deviations of the values of some of the disturbance terms
are negative; some are zero and some are positive and the sum or the average is zero.
This is given by the following identities.

19
WOLAITA SOD UNIVERSITY
E(Ui) =

. Multiplying both sides by (sample size ‘n’) we obtain the following.

E(Ui) =

. The
same argument is true for sample regression function and so for residual terms given as
follows

If this condition is not met, then the position of the regression function (or curve) will
not be the same as where it is supposed to be. This results in an upward (if the mean of
the error term or residual term is positive) or down ward (if the mean of the error term or
residual term is negative) shift in the regression function. For instance, suppose we have
the following regression function.

if

Otherwise the estimated models will be biased and cause the regression function to shift.

For instance, if (or positive) it is going to shift the estimation upward from the

20
WOLAITA SOD UNIVERSITY
true representative model. Similar argument is true for residual term of sample regression
function. This is demonstrated by the following figure.

Figure 2: Regression Function/curve if the mean of error term is not zero

3. The disturbance terms have constant variance in each period. This is given as follows:

= . This assumption is known as the assumption of

homoscedasticity. If this condition is not fulfilled or if the variance of the error terms
varies as sample size changes or as the value of explanatory variables changes, then this
leads to Heteroscedasticity problem.
4. Explanatory variables ‘Xi’ and disturbance terms ‘Ui’ are uncorrelated or independent.
All the co-variances of the successive values of the error term are equal to zero. This

condition is given by . It is followed from this that the following

identity holds true; . The value in which the error term assumed in one
period does not depend on the value in which it assumed in any other period. If this
condition is not met by our data or variables, our regression function and conclusions to

21
WOLAITA SOD UNIVERSITY
be drawn from it will be invalid. This assumption is known as the assumption of non-
autocorrelation or non-serial correlation.
5. The explanatory variable Xi is fixed in repeated samples. Each value of X i does not vary
for instance owing to change in sample size. This means the explanatory variables are
non-random and hence distributional free variable.
6. Linearity of the model in parameters. The simple linear regression requires linearity in
parameters; but not necessarily linearity in variables. The same technique can be applied
to estimate regression functions of the following forms: Y = f (X ); Y = f (X 2); Y = f
(X 3); Y = f (X – kX ); and so on. What is important is transforming the data as
required.
7. Normality assumption-The disturbance term Ui is assumed to have a normal distribution
with zero mean and a constant variance. This assumption is given as follows:

˜ . This assumption is a combination of zero

mean of error term assumption and homoscedasticity assumption. This assumption or


combination of assumptions is used in testing hypotheses about significance of
parameters. It is also useful in both estimating parameters and testing their significance
in maximum likelihood method.
8. Explanatory variables should not be perfectly, linearly and/or highly correlated. Using
explanatory variables which are highly or perfectly correlated in a regression function
causes a biased function or model. It also results in multicollinearity problem.
9. The relationship between variables (or the model) is correctly specified. For instance all
the necessary variables are included in the model. The variables are in the form that best
describes the functional relationship. For instance, “Y = f (X 2)” may better reflect the
relationship between Y and X than “Y = f (X )”.
10. The explanatory variables do not have identical value. This assumption is very
important for improving the precision of estimators.
Note that some of these assumptions or conditions, (those which imply to more than one
explanatory variables), are meant for the next chapters (along with all the other assumptions
or conditions). So, we may not restate these conditions in the next chapter even if they are
required there also.

22
WOLAITA SOD UNIVERSITY
ii) OLS method of Estimation
Estimating a linear regression function using the Ordinary Least Square (OLS) method is
simply about calculating the parameters of the regression function for which the sum of
square of the error terms is minimized. The procedure is given as follows. Suppose we want
to estimate the following equation

Since most of the time we use sample (or it is difficult to get population data) the
corresponding sample regression function is given as follows.

From this identity, we solve for the residual term

, square both sides and then take

sum of both sides. These three steps are given (respectively as follows.

2.7

2.8

Where,

RSS=

Residual Sum of Squares.


The method of OLS involves finding the estimates of the intercept and the slope for which the
sum squares given by the Equation is minimized. To minimize the residual sum of squares we
take the first order partial derivatives of Equation 2.8 and equate them to zero.

That is, the partial derivative with respect to :

2.9

23
WOLAITA SOD UNIVERSITY

2.10

 =0 2.11

 2.12

Where n is the sample size.

Partial derivative With respect to

2.13

2.14

 2.15

Note that the equation is a composite function and we should apply a chain rule in
finding the partial derivatives with respect to the parameter estimates.
Equations 2.12 and 2.15 are together called the system of normal equations. Solving the
system of normal equations simultaneously we obtain:

Or

_ _
  XY  n Y X and we have from above
1  _
 X i2  n X 2

Now we have the formula to estimate the simple linear regression function. Let us illustrate
with example.
Example 2.4: Given the following sample data of three pairs of ‘Y’ (dependent variable) and
‘X’ (independent variable), find a simple linear regression function; Y = f(X).

24
WOLAITA SOD UNIVERSITY
Yi Xi
10 30
20 50
30 60

a) find a simple linear regression function; Y = f(X)


b) Interpret your result.
c) Predict the value of Y when X is 45.
Solution
a. To fit the regression equation we do the following computations.

Yi Xi Yi Xi Xi2
10 30 300 900
20 50 1000 2500
30 60 1800 3600
Sum 60 140 3100 7000
Mean
X=

= 20

Thus the fitted regression function is given by:

b) Interpretation, the value of the intercept term,-10, implies that the value of the
dependent variable ‘Y’ is – 10 when the value of the explanatory variable is zero. The
value of the slope coefficient ( ) is a measure of the marginal change in the
dependent variable ‘Y’ when the value of the explanatory variable increases by one.
For instance, in this model, the value of ‘Y’ increases on average by 0.64 units when
‘X’ increases by one.

c) =-10+(0.64)(45)=18.8

25
WOLAITA SOD UNIVERSITY
That means when X assumes a value of 45, the value of Y on average is expected to be
18.8. The regression coefficients can also be obtained by simple formulae by taking the
deviations between the original values and their means. Now, if

, and

Then, the coefficients can be represented by alternative formula given below , and

Example 2.5: Find the regression equation for the data under Example 2.4, using the
shortcut formula. To solve this problem we proceed as follows.

Yi Xi y x xy x2 y2
10 30 -10 -16.67 166.67 277.78 100
20 50 0 3.33 0.00 11.11 0
30 60 10 13.33 133.33 177.78 100
Sum 60 140 0 0 300.00 466.67 200
Mean 20 46.66667

Then

, and =20-(0.64) (46.67) = -10 with results similar to previous case.

Mean and Variance of Parameter Estimates

We have seen that how the numerical values of the parameter estimates can be obtained using
OLS estimating techniques. Now let us see their distributional nature, i.e. the mean and
variance of the parameter estimates. There can be several samples of the same size that can be
drawn from the same population. For each sample, the parameter estimates have their own
specific numerical values. That means the values of the estimates are different when we go
from one sample to another. Therefore, the parameter estimate have different values for
estimating a given true population parameter. That means the parameter estimates are random
in their nature and should have distinct distribution with the corresponding parameters.
Remember we have discussed in the previous sections that both the error term and the
dependent variable are assumed to be normally distributed. Thus, the parameter estimates also
have a normal distribution with their associative mean and variance. Formula for mean and

26
WOLAITA SOD UNIVERSITY
variance of the respective parameter estimates and the error term are given below (procedure
to drive is given in Annex A)

1. The mean of
2. The variance of

3. The mean of

4. The variance of
5. The estimated value of the variance of the error term

Hypothesis Testing of OLS Estimates


After estimation of the parameters there are important issues to be considered by the
researcher. We have to know that to what extent our estimates are reliable enough and
acceptable for further purpose. That means, we have to evaluate the degree of
representativeness of the estimate to the true population parameter. Simply a model must be
tested for its significance before it can be used for any other purpose. In this subsection we
will evaluate the reliability of model estimated using the procedure we explained above.
Research hypotheses attempt to explain, predict and explore the relationship between two or
more variables. To this end, hypotheses can be thought of as the researcher’s educated guess
about how the study will turn out. Hypothesis testing is designed to detect significant
differences: differences that did not occur by random chance. In the “one sample” case: we
compare a random sample (from a large group) to a population. We compare a sample
statistic to a population parameter to see if there is a significant difference.

• There are two important points that should be kept in mind.

1. Hypotheses may be falsifiable. That is, hypotheses must be capable of being refuted based
on the results of the study. If a researcher’s hypothesis cannot be refuted, then the researcher
is not conducting a scientific investigation.

27
WOLAITA SOD UNIVERSITY
2. A hypothesis must be a prediction (usually, about the relationship between two or more
variables).

Types of Hypotheses

There are two kinds of research hypotheses.

1. The null hypothesis

The null hypothesis always predicts that there will be no difference between the groups on the
variable of interest being studied, or the independent variable has no effect on the dependent
variable.

2. The alternate (or experimental hypothesis)

The alternate hypothesis predicts that there will be a difference between the groups, or that the
independent variable determines the dependent variable.

The available test criteria are divided in to three groups: Theoretical a priori criteria, statistical
criteria and econometric criteria. Priori criteria set by economic theories are in line with the
consistency of coefficients of econometric model to the economic theory. Statistical criteria,
also known as first order tests, are set by statistical theory and refer to evaluate the statistical
reliability of the model. Econometric criteria refer to whether the assumptions of an
econometric model employed in estimating the parameters are fulfilled or not.

The square of correlation coefficient ( )

This is used for judging the explanatory power of the linear regression of Y on X or on X’s.
The square of correlation coefficient in simple regression is known as the coefficient of
determination and is given by . The coefficient of determination measures the goodness
of fit of the line of regression on the observed sample values of Y and X.
i) The Coefficient of determination (R2)
The coefficient of determination is the measure of the amount or proportion of the total
variation of the dependent variable that is determined or explained by the model or the
presence of the explanatory variable in the model. The total variation of the dependent

28
WOLAITA SOD UNIVERSITY
variable is split in two additive components; a part explained by the model and a part
represented by the random term. The total variation of the dependent variable is measured
from its arithmetic mean.

The total variation of the dependent variable is given in the following form; TSS=ESS +
RSS, which means total sum of square of the dependent variable is split into explained sum of
square and residual sum of square.

ei  y i  y i

y i  y i  ei
 2 
y i2  y i  ei2  2 y i ei
 2 
 y i2  y i   ei2  2  y i ei

But  y i ei 0
 2
Therefore ,  y i2  y i   ei2

The coefficient of determination is given by the formula

2.16

the coefficient of determination can also be given as

Or

2.17

The higher the coefficient of determination is the better the fit. Conversely, the smaller the
coefficient of determination is the poorer the fit. That is why the coefficient of determination
is used to compare two or more models. One minus the coefficient of determination is called

29
WOLAITA SOD UNIVERSITY
the coefficient of non-determination, and it gives the proportion of the variation in the
dependent variable that remained undetermined or unexplained by the model.
ii) Testing the significance of a given regression coefficient
Since the sample values of the intercept and the coefficient are estimates of the true
population parameters, we have to test them for their statistical reliability.
The significance of a model can be seen in terms of the amount of variation in the dependent
variable that it explains and the significance of the regression coefficients.
There are different tests that are available to test the statistical reliability of the parameter
estimates. The following are the common ones;
A) The standard error test
B) The standard normal test
C) The students t-test
Now, let us discuss them one by one.
A) The Standard Error Test
The standard error test of the parameter estimates applied for judging the statistical reliability
of the estimates. This test measures the degree of confidence that we may attribute to the
estimates.

This test first establishes the two hypotheses that are going to be tested which are commonly
known as the null and alternative hypotheses. The null hypothesis addresses that the sample is
coming from the population whose parameter is not significantly different from zero while the
alternative hypothesis addresses that the sample is coming from the population whose
parameter is significantly different from zero. The two hypotheses are given as follows:
H0: βi=0
H1: βi≠0
The standard error test is outlined as follows:
1. Compute the standard deviations of the parameter estimates using the above formula for
variances of parameter estimates. This is because standard deviation is the positive square
root of the variance.

30
WOLAITA SOD UNIVERSITY
2. Compare the standard errors of the estimates with the numerical values of the estimates and
make decision.
A) If the standard error of the estimate is less than half of the numerical value of the estimate,
we can conclude that the estimate is statistically significant. That is, if , reject the null
hypothesis and we can conclude that the estimate is statistically significant.
B) If the standard error of the estimate is greater than half of the numerical value of the
estimate, the parameter estimate is not statistically reliable. That is, if , conclude to
accept the null hypothesis and conclude that the estimate is not statistically significant.
B) The Standard Normal Test
This test is based on the normal distribution. The test is applicable if:
 The standard deviation of the population is known irrespective of the sample size
 The standard deviation of the population is unknown provided that the sample size is
sufficiently large (n>30).
The standard normal test or Z-test is outline as follows;

1. Test the null hypothesis against the alternative hypothesis

2. Determine the level of significant ( in which the test is carried

out. It is the probability of committing type I error, i.e. the probability of rejecting the
null hypothesis while it is true. It is common in applied econometrics to use 5% level
of significance.
3. Determine the theoretical or tabulated value of Z from the table. That is, find the value

of from the standard normal table. from the table.

4. Make decision. The decision of statistical hypothesis testing consists of two decisions;
either accepting the null hypothesis or rejecting it.

31
WOLAITA SOD UNIVERSITY
If , accept the null hypothesis while if

, reject the null hypothesis. It is true that most of the times the null and alternative
hypotheses are mutually exclusive. Accepting the null hypothesis means that rejecting
the alternative hypothesis and rejecting the null hypothesis means accepting the
alternative hypothesis.
Example: If the regression has a value of =29.48 and the standard

error of is 36. Test the hypothesis that the value of

at 5% level of significance using standard


normal test.
Solution: We have to follow the procedures of the test.

After setting up the hypotheses to be tested, the next step is to determine the level of
significance in which the test is carried out. In the above example the significance level is
given as 5%.
The third step is to find the theoretical value of Z at specified level of significance. From the

standard normal table we can get that .

The fourth step in hypothesis testing is computing the observed or calculated value of the
standard normal distribution using the following formula.

. Since the calculated value of the test

statistic is less than the tabulated value, the decision is to accept the null hypothesis and
conclude that the value of the parameter is 25.
C) The Student t-Test
In conditions where Z-test is not applied (in small samples), t-test can be used to test the
statistical reliability of the parameter estimates. The test depends on the degrees of freedom

32
WOLAITA SOD UNIVERSITY
that the sample has. The test procedures of t-test are similar with that of the z-test. The
procedures are outlined as follows;
1. Set up the hypothesis. The hypotheses for testing a given regression coefficient is given by:

2. Determine the level of significance for carrying out the test. We usually use a 5% level
significance in applied econometric research.
3. Determine the tabulated value of t from the table with n-k degrees of freedom, where k is
the number of parameters estimated.
4. Determine the calculated value of t. The test statistic (using the t- test) is given by:

The test rule or decision is given as follows:

Reject H0 if

iii) Confidence Interval Estimation of the regression Coefficients

We have discussed the important tests that that can be conducted to check model and
parameters validity. But one thing that must be clear is that rejecting the null hypothesis does
not mean that the parameter estimates are correct estimates of the true population parameters.
It means that the estimate comes from the sample drawn from the population whose
population parameter is significantly different from zero. In order to define the range within
which the true parameter lies, we must construct a confidence interval for the parameter. Like
we constructed confidence interval estimates for a given population mean, using the sample
mean (in Introduction to Statistics), we can construct 100(1- ) % confidence intervals for the
sample regression coefficients. To do so we need to have the standard errors of the sample
regression coefficients. The standard error of a given coefficient is the positive square root of
the variance of the coefficient.
Thus, we have discussed that the formulae for finding the variances of the regression
coefficients are given as.

Variance of the intercept is given by 2.18

33
WOLAITA SOD UNIVERSITY
Variance of the slope is given by
2.19
Where, (2.20) is the estimate of the variance of the random term and k is the number
of parameters to be estimated in the model. The standard errors are the positive square root of
the variances and the 100 (1- ) % confidence interval for the slope is given by:

2.21

And for the intercept:

2.22

Example 2.6: The following table gives the quantity supplied (Y in tons) and its price (X
pound per ton) for a commodity over a period of twelve years.
Table 7: Data on supply and price for given commodity
Y 69 76 52 56 57 77 58 55 67 53 72 64
X 9 12 6 10 9 10 7 8 12 6 11 8

34
WOLAITA SOD UNIVERSITY
Table 8: Data for computation of different parameters

Time Y X XY X2 Y2 x Y Xy x2 y2 ei e i2
1 69 9 621 81 4761 0 6 0 0 36 63.00 6.00 36.00

16
2 76 12 912 144 5776 3 13 39 9 9 72.75 3.25 10.56

- - 12
3 52 6 312 36 2704 3 11 33 9 1 53.25 -1.25 1.56

- 105.0
4 56 10 560 100 3136 1 -7 -7 1 49 66.25 10.25 6

5 57 9 513 81 3249 0 -6 0 0 36 63.00 -6.00 36.00

19 115.5
6 77 10 770 100 5929 1 14 14 1 6 66.25 10.75 6

-
7 58 7 406 49 3364 2 -5 10 4 25 56.50 1.50 2.25

-
8 55 8 440 64 3025 1 -8 8 1 64 59.75 -4.75 22.56

9 67 12 804 144 4489 3 4 12 9 16 72.75 -5.75 33.06

- - 10
10 53 6 318 36 2809 3 10 30 9 0 53.25 -0.25 0.06

11 72 11 792 121 5184 2 9 18 4 81 69.50 2.50 6.25

-
12 64 8 512 64 4096 1 1 -1 1 1 59.75 4.25 18.06

75 10 696 102 4852 15 4 89 756.0 387.0


Sum 6 8 0 0 2 0 0 6 8 4 0 0.00 0

Use Tables (Table 7 and Table 8) to answer the following questions


1. Estimate the Coefficient of determination (R2)

35
WOLAITA SOD UNIVERSITY
2. Run significance test of regression coefficients using the following test methods
A) The standard error test
B) The students t-test
3. Fit the linear regression equation and determine the 95% confidence interval for the slope.
Solution
1. Estimate the Coefficient of determination (R2)
Refer to Example 2.6 above to determine how much percent of the variations in the quantity
supplied is explained by the price of the commodity and what percent remained unexplained.
Use data in Table 8 to estimate R2 using the formula given below.

This result shows that 57% of the variation in the quantity supplied of the commodity under
consideration is explained by the variation in the price of the commodity; and the rest 37%
remain unexplained by the price of the commodity. In other word, there may be other
important explanatory variables left out that could contribute to the variation in the quantity
supplied of the commodity, under consideration.
2. Run significance test of regression coefficients using the following test methods
Fitted regression line for the data given is:

, where the numbers in parenthesis are standard errors of the respective coefficients.

A. Standard Error test


In testing the statistical significance of the estimates using standard error test, the following
information needed for decision.
Since there are two parameter estimates in the model, we have to test them separately.

Testing for

We have the following information about i.e =3.25 and

The following are the null and alternative hypotheses to be tested.

36
WOLAITA SOD UNIVERSITY
Since the standard error of is less than half of the value of , we

have to reject the null hypothesis and conclude that the parameter estimate
is statistically significant.

Testing for

Again we have the following information about

The hypotheses to be tested are given as follows;

Since the standard error of is less than half of the numerical value of
, we have to reject the null hypothesis and conclude that
is statistically significant.

B. The students t-test


In the illustrative example, we can apply t-test to see whether price of the commodity is
significant in determining the quantity supplied of the commodity under consideration? Use
=0.05.
The hypothesis to be tested is:

The parameters are known.

Then we can estimate tcal as follows

Further tabulated value for t is 2.228. When we compare these two values, the calculated t is
greater than the tabulated value. Hence, we reject the null hypothesis. Rejecting the null

37
WOLAITA SOD UNIVERSITY
hypothesis means, concluding that the price of the commodity is significant in determining
the quantity supplied for the commodity.
In this part we have seen how to conduct the statistical reliability test using t-statistic. Now let
us see additional information about this test. When the degrees of freedom is large, we can
conduct t-test without consulting the t-table in finding the theoretical value of t. This rule is
known as “2t-rule”. The rule is stated as follows;
The t-table shows that the values of t changes very slowly if the degrees of freedom (n-k) are
greater than 8. For example the value of changes from 2.30 (when n-k=8) to 1.96(when n-
k=∞). The change from 2.30 to 1.96 is obviously very slow. Consequently, we can ignore the
degrees of freedom (when they are greater than 8) and say that the theoretical value of is
2.0. Thus, a two tail test of a null hypothesis at 5% level of significance can be reduced to the
following rules.

1. If is greater than 2 or less than -2, we reject the null hypothesis

2. If is less than 2 or greater than -2, accept the null hypothesis.

3. Fit the linear regression equation and determine the 95% confidence interval for the slope.

Fitted regression model is indicated ,where the numbers in parenthesis are standard errors
of the respective coefficients.
To estimate confidence interval we need standard error which is determined as follows

The standard error of the slope is

The tabulated value of t for degrees of freedom 12-2=10 and /2=0.025 is 2.228.
Hence the 95% confidence interval for the slope is given by:

. The result tells us that at the error probability 0.05, the true value of the slope
coefficient lies between 1.25 and 5.25

38
WOLAITA SOD UNIVERSITY
Properties of OLS Estimators
The ideal or optimum properties that the OLS estimates possess may be summarized by well
known theorem known as the Gauss-Markov Theorem.
Statement of the theorem: “Given the assumptions of the classical linear regression model, the
OLS estimators, in the class of linear and unbiased estimators, have the minimum variance,
i.e. the OLS estimators are BLUE.
According to this theorem, under the basic assumptions of the classical linear regression
model, the least squares estimators are linear, unbiased and have minimum variance (i.e. are
best of all linear unbiased estimators). Sometimes the theorem referred as the BLUE theorem
i.e. Best, Linear, Unbiased Estimator. An estimator is called BLUE if:
a. Linear: a linear function of the random variable, such as, the dependent variable Y.
b. Unbiased: its average or expected value is equal to the true population parameter.
c. Minimum variance: It has a minimum variance in the class of linear and unbiased
estimators. An unbiased estimator with the least variance is known as an efficient
estimator.
According to the Gauss-Markov theorem, the OLS estimators possess all the BLUE
properties. The detailed proof of these properties are presented in Annex B

Extensions of Regression Models


As pointed out earlier non linearity may be expected in many Economic Relationships. In
other words the relationship between Y and X can be non-linear rather than linear. Thus, once
the independent variables have been identified the next step is to choose the functional form
of the relationship between the dependent and the independent variables. Specification of the
functional form is important, because a correct explanatory variable may well appear to be
insignificant or to have an unexpected sign if an inappropriate functional form is used. Thus
the choice of a functional form for an equation is a vital part of the specification of that
equation. The choice of a functional form almost always should be based on an examination
of the underlying economic theory. The logical form of the relationship between the
dependent variable and the independent variable in question should be compared with the
properties of various functional forms, and the one that comes closest to that underlying
theory should be chosen for the equation.

39
WOLAITA SOD UNIVERSITY
Some Commonly Used Functional Forms
a) The Linear Form: It is based on the assumption that the slope of the relationship between
the independent variable and the dependent variable is constant.

i=1,2,...K

In this case elasticity is not constant.

If the hypothesized relationship between Y and X is such that the slope of the relationship can
be expected to be constant and the elasticity can therefore be expected to be variable, then the
linear functional form should be used.
Note: Economic theory frequently predicts only the sign of a relationship and not its
functional form. Under such circumstances, the linear form can be used until strong evidence
that it is inappropriate is found. Thus, unless theory, common sense, or experience justifies
using some other functional form, one should use the linear model.
b) Log-linear, double Log or constant elasticity model
The most common functional form that is non-linear in the variable (but still linear in the
coefficients) is the log-linear form. A log-linear form is often used, because the elasticities
and not the slopes are constant i.e.,  =  Constant.

Thus, given the assumption of a constant elasticity, the proper form is the exponential (log-
linear) form.

Given:

The log-linear functional form for the above equation can be obtained by a logarithmic
transformation of the equation.

The model can be estimated by OLS if the basic assumptions are fulfilled.

40
WOLAITA SOD UNIVERSITY
The model is also called a constant elasticity model because the coefficient of elasticity
between Y and X (1) remains constant.

This functional form is used in the estimation of demand and production functions.
Note: We should make sure that there are no negative or zero observations in the data set
before we decide to use the log-linear model. Thus log-linear models should be run only if all
the variables take on positive values.
c) Semi-log Form
The semi-log functional form is a variant of the log-linear equation in which some but not all
of the variables (dependent and independent) are expressed in terms of their logs. Such
models expressed as:

(i) ( lin-log model ) and ( ii ) ( log-lin model ) are called semi-log models. The
semi-log functional form, in the case of taking the log of one of the independent variables, can
be used to depict a situation in which the impact of X on Y is expected to ‘tail off’ as X gets
bigger as long as 1 is greater than zero.

Example: The Engel’s curve tends to flatten out, because as incomes get higher, a smaller
percentage of income goes to consumption and a greater percentage goes to saving.

41
WOLAITA SOD UNIVERSITY
 Consumption thus increases at a decreasing rate.
 Growth models are examples of semi-log forms
d) Polynomial Form
Polynomial functional forms express Y as a function of independent variables some of which
are raised to powers others than one. For example in a second degree polynomial (quadratic)
equation, at least one independent variable is squared.

Such models produce slopes that change as the independent variables change. Thus the slopes
of Y with respect to the Xs are

, and

In most cost functions, the slope of the cost curve changes as output changes.

Simple transformation of the polynomial could enable us to use the OLS method to estimate
the parameters of the model
Setting

e) Reciprocal Transformation (Inverse Functional Forms)


The inverse functional form expresses Y as a function of the reciprocal (or inverse) of one or
more of the independent variables (in this case X1):

42
WOLAITA SOD UNIVERSITY
Or

The reciprocal form should be used when the impact of a particular independent variable is
expected to approach zero as that independent variable increases and eventually approaches
infinity. Thus as X1 gets larger, its impact on Y decreases.

An asymptote or limit value is set that the dependent variable will take if the value of the X-
variable increases indefinitely i.e. 0 provides the value in the above case. The function

approaches the asymptote from the top or bottom depending on the sign of 1.

Example: Phillips curve, a non-linear relationship between the rate of unemployment and the
percentage wage change.

4. Multiple Linear Regression Models (14 hrs)

Pre-test Questions

1. What are multiple regression models?


2. How do you think multiple linear regressions are different from simple linear regression
model?
3. Why are multiple regression models advantageous over simple linear regression model?
4. Do you think the estimation and inferences in multiple regression similar with those in
simple linear regression?

Concept and Notations of Multiple Regression Models


Simple linear regression model (also called the two-variable model) is extensively discussed
in the previous section. Such models assume that a dependent variable is influenced by only
one explanatory variable. However, many economic variables are influenced by several
factors or variables. Hence, simple regression models are unrealistic. There is no more

43
WOLAITA SOD UNIVERSITY
practicality of such models except simple to understand. Very good examples, for this
argument, are demand and supply in which they have several determinants each.

Adding more variables to the simple linear regression model leads us to the discussion of
multiple regression models i.e. models in which the dependent variable (or regressand)
depends on two or more explanatory variables, or regressors. The multiple linear regression
(population regression function) in which we have one dependent variable Y, and k
explanatory variables, is given by

3.1

Where, the intercept = value of when all X’s are

zero

= are partial slope coefficients

= the random term

In this model, for example, is the amount of change in when

changes by one unit, keeping the effect of other variables constant. Similarly, is the

amount of change in when

changes by one unit, keeping the effect of other variables constant. The other slopes are also
interpreted in the same way.

Although multiple regression equation can be fitted for any number of explanatory variables
(equation 3.1), the simplest possible regression model, three-variable regression will be
presented for the sake of simplicity. It is characterized by one dependent variable (Y) and two
explanatory variables (X1 and X2). The model is given by:

44
WOLAITA SOD UNIVERSITY
3.2

= the intercept

= value of Y when both


are zero

= the change in Y, when changes by one

unit, keeping the effect of constant.


= the change in Y, when changes by one unit, keeping the effect of constant

Assumptions of the Multiple Linear Regression

Each econometric method that would be used for estimation purpose has its own assumptions.
Knowing the assumptions and their consequence if they are not maintained is very important
for the econometrician. In the previous section, there are certain assumptions underlying the
multiple regression model, under the method of ordinary least squares (OLS). Let us see them
one by one.

Assumption 1: Randomness of ui - the variable u is a real random variable.

Assumption 2: Zero mean of - the random variable has a zero

mean for each value of i.e.

45
WOLAITA SOD UNIVERSITY
Assumption 3: Homoscedasticity of the random term - the random term has constant

variance. In other words, the variance of each is the same for all the

values.

Assumption 4: Normality of - the values of each are normally distributed


Assumption 5: No autocorrelation or serial independence of the random terms - the
successive values of the random term are not strongly correlated. The values of

(corresponding to are

independent of the values of any other (corresponding to

).

Assumption 6: Independence of - every disturbance term is independent of the

explanatory variables.
Assumption 7: No errors of measurement in the

- the

explanatory variables are measured without error.


Assumption 8: No perfect multicollinearity among the - the explanatory variables are not
perfectly linearly correlated.
Assumption 9: Correct specification of the model - the model has no specification error in that
all the important explanatory variables appear explicitly in the function and the mathematical
form is correctly defined (linear or non-linear form and the number of equations in the
model).

46
WOLAITA SOD UNIVERSITY
Estimation of Partial Regression Coefficients
The process of estimating the parameters in the multiple regression model is similar with that
of the simple linear regression model. The main task is to derive the normal equations using
the same procedure as the case of simple regression. Like in the simple linear regression
model case, OLS and Maximum Likelihood (ML) methods can be used to estimate partial
regression coefficients of multiple regression models. But, due to their simplicity and
popularity, OLS methods can be used. The OLS procedure consists in so choosing the values
of the unknown parameters that the residual sum of squares is as small as possible.

Under the assumption of zero mean of the random term, the sample regression function will
look like the following.
3.3
We call this equation, the fitted equation. Subtracting (3.3) from (3.2), we obtain:
3.4
The method of ordinary least squares (OLS) or classical least square (CLS) involves obtaining
the values is minimum.
The values of is minimum is obtained by differentiation this sum of squares with respect to
these coefficients and equate them to zero. That is,

Solving equations (3.5), (3.6) and (3.7) simultaneously, we obtain the system of normal
equations given as follows:

Then, letting

The above three equations (3.8), (3.9) and (3.10) can be solved using Matrix operations or
simultaneously to obtain the following estimates:

47
WOLAITA SOD UNIVERSITY
Variance and Standard errors of OLS Estimators
Estimating the numerical values of the parameters is not enough in econometrics if the data
are coming from the samples. The standard errors derived are important for two main
purposes: to establish confidence intervals for the parameters and to test statistical hypotheses.
They are important to look into their precision or statistical reliability. An estimator cannot be
used for any purpose if it is not a good estimator. The precision of an estimator is measured
by observing the standard error of the estimator.

Like in the case of simple linear regression, the standard errors of the coefficients are vital in
statistical inferences about the coefficients. We use standard the error of a coefficient to
construct confidence interval estimate for the population regression coefficient and to test the
significance of the variable to which the coefficient is attached in determining the dependent
variable in the model. In this section, we will see these standard errors. The standard error of a
coefficient is the positive square root of the variance of the coefficient. Thus, we start with
defining the variances of the coefficients.

Variance of the intercept

3.17

Variance of
3.18

Variance of

48
WOLAITA SOD UNIVERSITY
3.19

Where,
3.20
Equation 3.20 here gives the estimate of the variance of the random term. Then, the standard
errors are computed as follows:

Standard error of

3.21
Standard error of

3.22

Standard error of

3.23

Note: The OLS estimators of the multiple regression model have properties which are parallel
to those of the two-variable model.

Coefficient of Multiple Determination


In simple regression model we have discussed about the coefficient of determination and its
interpretation. In this section, we will discuss the coefficient of multiple determination which
has an equivalent role with that of the simple model. As coefficient of determination is the
square of the simple correlation in simple model, coefficient of multiple determination is the
square of multiple correlation coefficient.

The coefficient of multiple determination is the measure of the


proportion of the variation in the dependent variable that is explained jointly by the
independent variables in the model. One minus is called the coefficient of non-
determination. It gives the proportion of the variation in the dependent variable that remains

49
WOLAITA SOD UNIVERSITY
unexplained by the independent variables in the model. As in the case of simple linear

regression, is the ratio of the explained variation to the total variation.

Mathematically:

Or can also be given in terms of the slope coefficients

In simple linear regression, the higher the means the better the model is

determined by the explanatory variable in the model. In multiple linear regression, however,

every time we insert additional explanatory variable in the model, the increases

irrespective of the improvement in the goodness-of- fit of the model. That means high

may not imply that the model is good.

Thus, we adjust the as follows:

Where, k = the number of explanatory variables in the model.

In multiple linear regression, therefore, we better interpret the adjusted than the ordinary or

the unadjusted . We have known that the value of is always between zero and

one. But the adjusted can lie outside this range even to be negative.

In the case of simple linear regression, is the square of linear correlation

coefficient. Again as the correlation coefficient lies between -1 and +1, the coefficient of

50
WOLAITA SOD UNIVERSITY
determination lies between 0 and 1. The of multiple linear

regression also lies between 0 and +1. The adjusted however, can

sometimes be negative when the goodness of fit is poor. When the adjusted

value is negative, we considered it as zero and interpret as no variation of the dependent


variable is explained by regressors.

Confidence Interval Estimation

Confidence interval estimation in multiple linear regression follows the same formulae and
procedures that we followed in simple linear regression. You are, therefore, required to
practice finding the confidence interval estimates of the intercept and the slopes in multiple
regression with two explanatory variables.

Please recall that 100(1-

% confidence interval for


is given as

where k is the number of parameters to be estimated or the

number of variables (both dependent and explanatory)


Interpretation of the confidence interval: Values of the parameter lying in the interval are

plausible with 100(1- % confidence.

51
WOLAITA SOD UNIVERSITY
Hypothesis Testing in Multiple Regression
Hypothesis testing is important to draw inferences about the estimates and to know how
representative the estimates are to the true population parameter. Once we go beyond the
simple world of the two-variable linear regression model, hypothesis testing assumes several
interesting forms such as the following.

a) Testing hypothesis about an individual partial regression coefficient;


b) Testing the overall significance of the estimated multiple regression model (finding
out if all the partial slope coefficients are simultaneously equal to zero);
c) Testing if two or more coefficients are equal to one another;
d) Testing that the partial regression coefficients satisfy certain restrictions
e) Testing the stability of the estimated regression model over time or in different cross-
sectional units
f) Testing the functional form of regression models.

These and other types of hypotheses tests can be referred from different Econometrics books.
For the case in point, we will confine ourselves to the major ones.

Testing individual regression coefficients


The tests concerning the individual coefficients can be done using the standard error test or
the t-test. In all the cases the hypothesis is stated as:

a) b)

In a) we will like to test the hypothesis that X 1 has no linear influence on Y holding other

variables constant. In b) we test the hypothesis that X 2 has no linear relationship with Y

holding other factors constant. The above hypotheses will lead us to a two-tailed test however,
one-tailed test might also be important. There are two methods for testing significance of
individual regression coefficients.

a) Standard Error Test: Using the standard error test we can test the above hypothesis.
Thus the decision rule is based on the relationship between the numerical value of the
parameter and the standard error of the same.

52
WOLAITA SOD UNIVERSITY
(i) If , we accept the null hypothesis, i.e. the estimate of

is not statistically significant.

Conclusion: The coefficient is not statistically significant. In other words, it does


not have a significant influence on the dependent variable.

(ii) If , we fail to accept H0, i.e., we reject the null hypothesis in favour of the

alternative hypothesis meaning the estimate of i has a significant influence on the dependent

variable.
Generalisation: The smaller the standard error, the stronger is the evidence that the estimates
are statistically significant.

(b) t-test
The more appropriate and formal way to test the above hypothesis is to use the t-test. As usual
we compute the t-ratios and compare them with the tabulated t-values and make our decision.
Therefore:
Decision Rule: accept H0 if

Otherwise, reject the null hypothesis. Rejecting means, the coefficient being tested is

significantly different from 0. Not rejecting , on the other hand, means we don’t have

sufficient evidence to conclude that the coefficient is different from 0.

Testing the Overall Significance of Regression Model


Here, we are interested to test the overall significance of the observed or estimated regression
line, that is, whether the dependent variable is linearly related to all of the explanatory
variables. Hypotheses of such type are often called joint hypotheses. Testing the overall
significance of the model means testing the null hypothesis that none of the explanatory

53
WOLAITA SOD UNIVERSITY
variables in the model significantly determine the changes in the dependent variable. Put in
other words, it means testing the null hypothesis that none of the explanatory variables
significantly explain the dependent variable in the model. This can be stated as:

The test statistic for this test is given by:

54
WOLAITA SOD UNIVERSITY
Where, k is the number of explanatory variables in the model.
The results of the overall significance test of a model are summarized in the analysis of
variance (ANOVA) table as follows.
Source of Sum of squares Degrees of Mean sum of
variation freedom squares
Regression

55
WOLAITA SOD UNIVERSITY
Residual

Total

The values in this table are explained as follows:

These three sums of squares are related in such a way that

56
WOLAITA SOD UNIVERSITY
This implies that the total sum of squares is the sum of the explained (regression) sum of
squares and the residual (unexplained) sum of squares. In other words, the total variation in
the dependent variable is the sum of the variation in the dependent variable due to the
variation in the independent variables included in the model and the variation that remained
unexplained by the explanatory variables in the model. Analysis of variance (ANOVA) is the
technique of decomposing the total sum of squares into its components. As we can see here,
the technique decomposes the total variation in the dependent variable into the explained and
the unexplained variations. The degrees of freedom of the total variation are also the sum of
the degrees of freedom of the two components. By dividing the sum of squares by the
corresponding degrees of freedom, we obtain what is called the Mean Sum of Squares
(MSS).

The Mean Sum of Squares due to regression, errors (residual) and Total are calculated as the
Sum of squares and the corresponding degrees of freedom (look at column 3 of the above
ANOVA table.
The final table shows computation of the test statistic which can be computed as follows:

57
WOLAITA SOD UNIVERSITY
[The F statistic follows F distribution]

The test rule:

where

is the value

58
WOLAITA SOD UNIVERSITY
to be read from the F- distribution table at a given

level.

Relationship between F and R2

You may recall that is given by and

We also know that


Hence, which means

The formula for F is:

59
WOLAITA SOD UNIVERSITY
That means the calculated F can also be expressed in terms of the coefficient of
determination.

Testing the Equality of two Regression Coefficients


Given the multiple regression equation:

We would like to test the hypothesis:


or

vs.

is not true
The null hypothesis says that the two slope coefficients are equal.
Example: If Y is quantity demanded of a commodity, X 1 is the price of the commodity and

X2 is income of the consumer. The hypothesis suggests that the price and income elasticity of

demand are the same.


We can test the null hypothesis using the classical assumption that
t distribution with N - K degrees of freedom.
Where K = the total number of parameters estimated.

The is given as

Thus the t-statistic is:

Decision: Reject H0 if tcal. > ttab.

60
WOLAITA SOD UNIVERSITY
Note: Using similar procedures one can also test linear equality restrictions, for example
and other restrictions.

Illustration: The following table shows a particular country’s the value of imports (Y), the
level of Gross National Product(X1) measured in arbitrary units, and the price index of
imported goods (X2), over 12 years period.
Table 9: Data for multiple regression examples
Year 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
Y 57 43 73 37 64 48 56 50 39 43 69 60
X1 220 215 250 241 305 258 354 321 370 375 385 385
X2 125 147 118 160 128 149 145 150 140 115 155 152

a) Estimate the coefficients of the economic relationship and fit the model.

To estimate the coefficients of the economic relationship, we compute the entries given in
Table 10

61
WOLAITA SOD UNIVERSITY
Table 10: Computations of the summary statistics for coefficients for data of Table 9
Year Y X1 X2 x1 x2 Y X12 x22 x1y x2y x1x2 y2
1960 57 220 125 -86.5833 -15.3333 3.75 7496.668 235.1101 -324.687 -57.4999 1327.608 14.0625
1961 43 215 147 -91.5833 6.6667 -10.25 8387.501 44.44489 938.7288 -68.3337 -610.558 105.0625
1962 73 250 118 -56.5833 -22.3333 19.75 3201.67 498.7763 -1117.52 -441.083 1263.692 390.0625
1963 37 241 160 -65.5833 19.6667 -16.25 4301.169 386.7791 1065.729 -319.584 -1289.81 264.0625
1964 64 305 128 -1.5833 -12.3333 10.75 2.506839 152.1103 -17.0205 -132.583 19.52731 115.5625
1965 48 258 149 -48.5833 8.6667 -5.25 2360.337 75.11169 255.0623 -45.5002 -421.057 27.5625
1966 56 354 145 47.4167 4.6667 2.75 2248.343 21.77809 130.3959 12.83343 221.2795 7.5625
1967 50 321 150 14.4167 9.6667 -3.25 207.8412 93.44509 -46.8543 -31.4168 139.3619 10.5625
1968 39 370 140 63.4167 -0.3333 -14.25 4021.678 0.111089 -903.688 4.749525 -21.1368 203.0625
1969 43 375 115 68.4167 -25.3333 -10.25 4680.845 641.7761 -701.271 259.6663 -1733.22 105.0625
1970 69 385 155 78.4167 14.6667 15.75 6149.179 215.1121 1235.063 231.0005 1150.114 248.0625
1971 60 385 152 78.4167 11.6667 6.75 6149.179 136.1119 529.3127 78.75022 914.8641 45.5625
Sum 639 3679 1684 0.0004 0.0004 0 49206.92 2500.667 1043.25 -509 960.6667 1536.25
Mean 53.25 306.5833 140.3333 0 0 0

62
WOLAITA SOD UNIVERSITY
From Table 10, we can take the following summary results.

The summary results in deviation forms are then given by:

The coefficients are then obtained as follows.

The fitted model is then written as: = 75.40512 + 0.025365X1 - 0.21329X2


b) Compute the variance and standard errors of the slopes.
First, you need to compute the estimate of the variance of the random term as follows

Variance of

63
WOLAITA SOD UNIVERSITY
Standard error of

Variance of

Standard error of

Similarly, the standard error of the intercept is found to be 37.98177. The detail is left for you
as an exercise.
c) Calculate and interpret the coefficient of determination.
We can use the following summary results to obtain the R2.

 yˆ 2
135.0262

(The sum of the above two). Then,

or

d) Compute the adjusted R2.

64
WOLAITA SOD UNIVERSITY
e) Construct 95% confidence interval for the true population parameters (partial regression
coefficients).[Exercise: Base your work on Simple Linear Regression]
f) Test the significance of X1 and X2 in determining the changes in Y using t-test.
The hypotheses are summarized in the following table.

Coefficient Hypothesis Estimate Std. error Calculated t Conclusion


1 H0: 1=0 0.025365 0.056462 We do not
H1: 10 reject H0 since
tcal<ttab
2 H0: 2=0 -0.21329 0.25046 We do not
H1: 20 reject H0 since
tcal<ttab

The critical value (t 0.05, 9 ) to be used here is 2.262. Like the standard error test, the t- test
revealed that both X1 and X2 are insignificant to determine the change in Y since the
calculated t values are both less than the critical value.

Exercise: Test the significance of X1 and X2 in determining the changes in Y using the
standard error test.
g) Test the overall significance of the model. (Hint: use  = 0.05)
This involves testing whether at least one of the two variables X 1 and X2 determine the
changes in Y. The hypothesis to be tested is given by:

The ANOVA table for the test is give as follows:


Source of Sum of Squares Degrees of Mean Sum of Squares
variation freedom
Regression

65
WOLAITA SOD UNIVERSITY
=3-1=2
Residual

=12-3=9
Total =12-1=11

The tabulated F value (critical value) is F(2, 11) = 3.98

In this case, the calculated F value (0.4336) is less than the tabulated value (3.98). Hence, we
do not reject the null hypothesis and conclude that there is no significant contribution of the
variables X1 and X2 to the changes in Y.

h) Compute the F value using the R2.

5.Dummy Variable Regression Models

Dummy Dependent Variables

Many economic choices are either or. Farmers either use a fertilizer or they don‟t. People
either contact extension agents or they don‟t. What are the factors that influence such
decisions? What if we want to analyses factors affecting expenditure on fertilizer where the
dependent variable assumes zero values for those who don‟t use fertilizer and one for the
others?

In this class of models, we consider the case where the dependent variable can take the value
of 0 or 1. They are often termed dichotomous variables. The literature on this type of model is
extensive, it can include cases where there are more than two possible outcomes, and however
we are only covering an introductory section of this area or econometrics. It is important to

66
WOLAITA SOD UNIVERSITY
note here that these types of models tend to be associated with the cross - sectional
econometrics rather than time series.

1 Linear Probability Model (LPM) The Linear Probability Model uses OLS to estimate the
model, the coefficients and statistics etc are then interpreted in the usual way. This Features
of the LPM 1. The dependent variable has two values, the value 1 has a probability of p and
the value 0 has a probability of (1-p). 2. This is known as the Bernoulli probability
distribution. In this case the expected value of a random variable following a Bernoulli
distribution is the probability the variable equals 1. 3. Since the probability of p must lie
between 0 and 1, then the expected value of the dependent variable must also lie between 0
and 1.

Problems with LPM

1. The error term is not normally distributed, it also follows the Bernoulli distribution.
2. The variance of the error term is heteroskedastistic. The variance for the Bernoulli
distribution is p(1-p), where p is the probability of a success.
3. The value of the R-squared statistic is limited, given the distribution of the LPMs.
4. Possibly the most problematic aspect of the LPM is the non-fulfilment of the
requirement that the estimated value of the dependent variable y lies between 0 and 1.
5. One way around the problem is to assume that all values below 0 and above 1 are
actually 0 or 1 respectively
6. An alternative and much better remedy to the problem is to use an alternative technique
such as the Logit or Probit models.
The final problem with the LPM is that it is a linear model and assumes that the probability of
the dependent variable equaling 1 is linearly related to the explanatory variable. For example
if we have a model where the dependent variable takes the value of 1 if a farmer has extension
contact and 0 otherwise, regressed on the farmers education level. The probability of
contacting an extension agent will rise as education level rises.

2 The Logit Model

67
WOLAITA SOD UNIVERSITY
The main way around the problems mentioned earlier is to use a different distribution to the
Bernoulli distribution, where the relationship between x and p is non-linear and the p is
always between 0 and 1. This requires the use of a„s‟ shaped curve, which resembles the
cumulative distribution function (CDF) of a random variable. The CDFs used to represent a
discrete variable are the logistic (Logit model) and normal (Probit model).

Logit model features 1. Although L is linear in the parameters, the probabilities are non-
linear. 2. The Logit model can be used in multiple regression tests. 3. If L is positive, as the
value of the explanatory variables increase, the odds that the dependent variable equals 1
increase. 4. The slope coefficient measures the change in the log-odds ratio for a unit change
in the explanatory variable. 5. These models are usually estimated using Maximum
Likelihood techniques. 6. The R-squared statistic is not suitable for measuring the goodness of
fit in discrete dependent variable models, instead we compute the count R - squared statistic.

The coefficient on y suggests that a 1% increase in income (y) produces a 0.32% rise in the
log of the odds of getting a mortgage. This is difficult to interpret, so the coefficient is often
ignored, the z-statistic (same as t-statistic) and sign on the coefficient is however used for the
interpretation of the results. We could include a specific value for the income of a customer
and then find the probability of getting a mortgage.

Logit model result If we have a customer with 0.5 units of income, we can estimate a value
for the Logit of 0.56+0.32*0.5 = 0.72. We can use this estimated Logit value to find the
estimated probability of getting a mortgage. By including it in the formula given earlier for
the Logit Model we get:

3 .Probit Model

An alternative CDF to that used in the Logit Model is the normal CDF, when this is used we
refer to it as the Probit Model. In many respects this is very similar to the Logit model. The
Probit model has also been interpreted as a „latent variable‟ model. This has implications for

68
WOLAITA SOD UNIVERSITY
how we explain the dependent variable. i.e. we tend to interpret it as a desire or ability to
achieve something.

The models compared 1. The coefficient estimates from all three models are related. 2.
According to Amemiya, if you multiply the coefficients from a Logit model by 0.625, they
are approximately the same as the Probit model. 3. If the coefficients from the LPM are
multiplied by 2.5 (also 1.25 needs to be subtracted from the constant term) they are
approximately the same as those produced by a Probit model.

There are four basic types of variables we generally encounter in empirical analysis. These
are: nominal, ordinal, interval and ratio scale variables. In preceding sections, we have
encountered ratio scale variables. However, regression models do not deal only with ratio
scale variables; they can also involve nominal and ordinal scale variables. In regression
analysis, the dependent variable can be influenced by nominal variables such as sex, race,
color, geographical region etc. models where all regressors are nominal (categorical) variables
are called ANOVA (Analysis of Variance) models. If there is mixture of nominal and ratio
scale variables, the models are called ANCOVA (Analysis of Covariance) models. Look at
the following example.

Illustration: The following model represents the relationship between geographical location
and teachers’ average salary in public schools. The data were taken from 50 states for a single
year. The 50 states were classified into three regions: Northeast, South and West. The
regression models looks like the following.

Where Yi = the (average) salary of public school teachers in state i


D1i = 1 if the state is in the Northeast
= 0 otherwise (i.e. in other regions of the country)
D2i = 1 if the state is in the South
= 0 otherwise (i.e. in other regions of the country)
Note that the above regression model is like any multiple regression model considered
previously, except that instead of quantitative regressors, we have only qualitative (dummy)

69
WOLAITA SOD UNIVERSITY
regressors. Dummy regressors take value of 1 if the observation belongs to that particular
category and 0 otherwise.
Note also that there are 3 states (categories) for which we have created only two dummy
variables (D1 and D2). One of the rules in dummy variable regression is that if there are m
categories, we need only m-1 dummy variables. If we are suppressing the intercept, we can
have m dummies but the interpretation will be a bit different.
The intercept value represents the mean value of the dependent variable for the bench mark
category. This is the category for which we do not assign a dummy (in our case, West is a
bench mark category). The coefficients of the dummy variable are called differential
intercept coefficients because they tell us by how much the value of the intercept that receives
the value of 1 differs from the intercept coefficient of the benchmark category.

p  value (0.000) (0.233) (0.0349) R 2 0.0901

From the above fitted model, we can see that mean salary of public school teachers in the
West is about $26,158.62. The mean salary of teachers in the Northeast is lower by $1734.47
than those of the West and those in the South is lower by $3264.42. Doing this, we will find
the average salaries in the latter two regions are about $24,424 and $22,894.

In order to know the statistical significance of the mean salary differences, we can run the
tests we have discussed in previous sections. The other results can also be interpreted the way
we discussed previously.

6. Econometric Problems (6 hrs)

Pre-test Questions
1. What are the major CLRM assumptions?
2. What happens to the properties of the OLS estimators if one or more of these
assumptions are violated, i.e. not fulfilled?
70
WOLAITA SOD UNIVERSITY
3. How we can check if an assumption is violated or not?

Assumptions Revisited
In many practical cases, two major problems arise in applying the classical linear regression
model.
1) those due to assumptions about the specification of the model and about the
disturbances and
2) those due to assumptions about the data

The following assumptions fall in either of the categories.


 The regression model is linear in parameters.
 The values of the explanatory variables are fixed in repeated sampling (non-
stochastic).
 The mean of the disturbance (ui) is zero for any given value of X i.e. E(ui) = 0

 The variance of ui is constant i.e. homoscedastic

 There is no autocorrelation in the disturbance terms


 The explanatory variables are independently distributed with the ui.

 The number of observations must be greater than the number of explanatory


variables.
 There must be sufficient variability in the values taken by the explanatory
variables.
 There is no linear relationship (multicollinearity) among the explanatory variables.
 The stochastic (disturbance) term ui are normally distributed i.e., ui ~ N(0, ²)

 The regression model is correctly specified i.e., no specification error.

With these assumptions we can show that OLS are BLUE, and normally distributed. Hence it
was possible to test Hypothesis about the parameters. However, if any of such assumption is
relaxed, the OLS might not work. We shall not examine in detail the violation of some of the
assumptions.

Violations of Assumptions

The Zero Mean Assumption i.e. E(ui)=0


71
WOLAITA SOD UNIVERSITY
If this assumption is violated, we obtain a biased estimate of the intercept term. But, since the
intercept term is not very important we can leave it. The slope coefficients remain unaffected
even if the assumption is violated. The intercept term does not also have physical
interpretation.

The Normality Assumption


This assumption is not very essential if the objective is estimation only. The OLS estimators
are BLUE regardless of whether the u i are normally distributed or not. In addition, because of

the central limit theorem, we can argue that the test procedures – the t-tests and F-tests - are
still valid asymptotically, i.e. in large sample.

Heteroscedasticity: The Error Variance is not Constant


The error terms in the regression equation have a common variance i.e., are Homoscedastic. If
they do not have common variance we say they are Hetroscedastic. The basic questions to be
addressed are:
 What is the nature of the problem?
 What are the consequences of the problem?
 How do we detect (diagnose) the problem?
 What remedies are available for the problem?

The Nature of the Problem


In the case of homoscedastic disturbance terms the spread around the mean is constant, i.e. =

². But in the case of heteroscedasticity disturbance terms the variance changes with the
explanatory variable. The problem of heteroscedasticity is likely to be more common in cross-
sectional than in time-series data.

Causes of Heteroscedasticity
There are several reasons why the variance of the error term may be variable, some of which
are as follows.
 Following the error-learning models, as people learn, their errors of behaviour
become smaller over time where the standard error of the regression model
decreases.

72
WOLAITA SOD UNIVERSITY
 As income grows people have discretionary income and hence more scope for
choice about the disposition of their income. Hence, the variance (standard error)
of the regression is more likely to increase with income.
 Improvement in data collection techniques will reduce errors (variance).
 Existence of outliers might also cause heteroscedasticity.
 Misspecification of a model can also be a cause for heteroscedasticity.
 Skewness in the distribution of one or more explanatory variables included in the
model is another source of heteroscedasticity.
 Incorrect data transformation and incorrect functional form are also other sources
Note: Heteroscedasticity is likely to be more common in cross-sectional data than in time
series data. In cross-sectional data, individuals usually deal with samples (such as consumers,
producers, etc) taken from a population at a given point in time. Such members might be of
different size. In time series data, however, the variables tend to be of similar orders of
magnitude since data is collected over a period of time.

Consequences of Heteroscedasticity
If the error terms of an equation are heteroscedastic, there are three major consequences.
a) The ordinary least square estimators are still linear since heteroscedasticity does not
cause bias in the coefficient estimates. The least square estimators are still unbiased.
b) Heteroscedasticity increases the variance of the partial regression coefficients but it
does affect the minimum variance property. Thus, the OLS estimators are inefficient.
Thus the test statistics – t-test and F-test – cannot be relied on in the face of
uncorrected heteroscedasticity.

Detection of Heteroscedasticity
There are no hard and fast rules (universally agreed upon methods) for detecting the presence
of heteroscedasticity. But some rules of thumb can be suggested. Most of these methods are
based on the examination of the OLS residuals, e i, since these are the once we observe and

not the disturbance term ui. There are informal and formal methods of detecting

heteroscedasticity.

a) Nature of the problem


73
WOLAITA SOD UNIVERSITY
In cross-sectional studies involving heterogeneous units, heteroscedasticity is the rule rather
than the exception.
Example: In small, medium and large sized agribusiness firms in a study of input expenditure
in relation to sales, the rate of interest, etc. heteroscedasticity is expected.

b) Graphical method
If there is no a priori or empirical information about the nature of heteroscedasticity, one

could do an examination of the estimated residual squared, e i² to see if they exhibit any

systematic pattern. The squared residuals can be plotted either against Y or against one of the
explanatory variables. If there appears any systematic pattern, heteroscedasticity might exist.
These two methods are informal methods.

c) Park Test
Park suggested a statistical test for heteroscedasticity based on the assumption that the

variance of the disturbance term (i²) is some function of the explanatory variable Xi.

Park suggested a functional form as:


which can be
transferred to a linear function using ln transformation. Hence,
where vi is the

stochastic disturbance term.

since ² is not known.

The Park-test is a two-stage procedure: run OLS regression disregarding the


heteroscedasticity question and obtain the ei and then run the above equation. The regression

is run and if  turns out to be statistically significant, then it would suggest that
heteroscedasticity is present in the data.

d) Spearman’s Rank Correlation test


Recall that: d = difference between ranks
A high rank correlation suggests the presence of heteroscedasticity.
74
WOLAITA SOD UNIVERSITY
Goldfeld and Quandt Test

This is the most popular test and usually suitable for large samples. If it is assumed that the

variance (i²) is positively related to one of the explanatory variables in the regression model

and if the number of observations is at least twice as many as the parameters to be estimated,
the test can be used.
Given the model

Suppose i² is positively related to Xi as

Goldfeld and Quandt suggest the following steps:


1. Rank the observation according to the values of Xi in ascending order.

2. Omit the central c observations (usually the middle third of the recorded observations),
or where c is specified a priori, and divide the remaining (n-c) observations into two

groups, each with observations.ss

3. Fit separate regressions for the two sub-samples and obtain the respective residuals
RSS, and RSS2 with df

4. Compute the ratio:

If the two variances tend to be the same, then F approaches unity. If the variances differ we
will have values for F different from one. The higher the F-ratio, the stronger the evidence of
heteroscedasticity.
Note: There are also other methods of testing the existence of heteroscedasticity in your data.
These are Glejser Test, Breusch-Pagan-Godfrey Test, White’s General Test and Koenker-
Bassett Test the details for which you are supposed to refer.

Remedial Measures
75
WOLAITA SOD UNIVERSITY
OLS estimators are still unbiased even in the presence of heteroscedasticity. But they are not
efficient, not even asymptotically. This lack of efficiency makes the usual hypothesis testing
procedure a dubious exercise. Remedial measures are, therefore, necessary. Generally the
solution is based on some form of transformation.

a) The Weighted Least Squares (WLS)


Given a regression equation model of the form

The weighted least square method requires running the OLS regression to a transformed data.
The transformation is based on the assumption of the form of heteroscedasticity.
b) Other Remedies for Heteroscedasticity
Two other approaches could be adopted to remove the effect of heteroscedasticity.
 Include a previously omitted variable(s) if heteroscedasticity is suspected due to
omission of variables.
 Redefine the variables in such a way that avoids heteroscedasticity. For example,
instead of total income, we can use Income per capita.

Autocorrelation: Error Terms are correlated


Another assumption of the regression model was the non-existence of serial correlation
(autocorrelation) between the disturbance terms, Ui.

Serial correlation implies that the error term from one time period depends in some systematic
way on error terms from other time periods. Autocorrelation is more a problem of time series
data than cross-sectional data. If by chance, such a correlation is observed in cross-sectional
units, it is called spatial autocorrelation. So, it is important to understand serial correlation
and its consequences of the OLS estimators.

Nature of Autocorrelation
The classical model assumes that the disturbance term relating to any observation is not
influenced by the disturbance term relating to any other disturbance term.

,ij
76
WOLAITA SOD UNIVERSITY
But if there is any interdependence between the disturbance terms then we have
autocorrelation
,ij

Causes of Autocorrelation
Serial correlation may occur because of a number of reasons.
 Inertia (built in momentum) – a salient feature of most economic variables time series
(such as GDP, GNP, price indices, production, employment etc) is inertia or
sluggishness. Such variables exhibit (business) cycles.
 Specification bias – exclusion of important variables or incorrect functional forms
 Lags – in a time series regression, value of a variable for a certain period depends on
the variable’s previous period value.
 Manipulation of data – if the raw data is manipulated (extrapolated or interpolated),
autocorrelation might result.

Autocorrelation can be negative as well as positive. The most common kind of serial
correlation is the first order serial correlation. This is the case in which this period error
terms are functions of the previous time period error term.

This is also called the first order autoregressive model.


-1 < P < 1
The disturbance term Ut satisfies all the basic assumptions of the classical linear model.

Consequences of serial correlation


When the disturbance term exhibits serial correlation, the values as well as the standard errors
of the parameters are affected.
1) The estimates of the parameters remain unbiased even in the presence of
autocorrelation but the X’s and the u’s must be uncorrelated.

77
WOLAITA SOD UNIVERSITY
2) Serial correlation increases the variance of the OLS estimators. The minimum
variance property of the OLS parameter estimates is violated. That means the OLS are
no longer efficient.

Figure 3: The distribution of with and without serial correlation.

3) Due to serial correlation the variance of the disturbance term, U i may be

underestimated. This problem is particularly pronounced when there is positive


autocorrelation.
4) If the Uis are autocorrelated, then prediction based on the ordinary least squares

estimates will be inefficient. This is because of larger variance of the parameters.


Since the variances of the OLS estimators are not minimal as compared with other
estimators, the standard error of the forecast from the OLS will not have the least
value.
Detecting Autocorrelation
Some rough idea about the existence of autocorrelation may be gained by plotting the
residuals either against their own lagged values or against time.

There are more accurate tests for the incidence of autocorrelation. The most common test of
autocorrelation is the Durbin-Watson Test.

The Durbin-Watson d Test


The test for serial correlation that is most widely used is the Durbin-Watson d test. This test is
appropriate only for the first order autoregressive scheme.

then
The test may be outlined as

This test is, however, applicable where the underlying assumptions are met:
 The regression model includes an intercept term
 The serial correlation is first order in nature

78
WOLAITA SOD UNIVERSITY
 The regression does not include the lagged dependent variable as an explanatory
variable
 There are no missing observations in the data

Remedial Measures for Autocorrelation


Since in the presence of serial correlation the OLS estimators are inefficient, it is essential to
seek remedial measures.
1) The solution depends on the source of the problem.
 If the source is omitted variables, the appropriate solution is to include these
variables in the set of explanatory variables.
 If the source is misspecification of the mathematical form the relevant approach
will be to change the form.
2) If these sources are ruled out then the appropriate procedure will be to transform the
original data so as to produce a model whose random variable satisfies the assumptions of
non-autocorrelation. But the transformation depends on the pattern of autoregressive
structure.
3) Multicollinearity: Exact linear correlation between Regressors
One of the classical assumptions of the regression model is that the explanatory variables are
uncorrelated. If the assumption that no independent variable is a perfect linear function of one
or more other independent variables is violated we have the problem of multicollinearity. If
the explanatory variables are perfectly linearly correlated, the parameters become
indeterminate. It is impossible to find the numerical values for each parameter and the method
of estimation breaks.
If the correlation coefficient is 0, the variables are called orthogonal; there is no problem of
multicollinearity. Neither of the above two extreme cases is often met. But some degree of
inter-correlation is expected among the explanatory variables, due to the interdependence of
economic variables.

Multicollinearity is not a condition that either exists or does not exist in economic functions,
but rather a phenomenon inherent in most relationships due to the nature of economic
magnitude. But there is no conclusive evidence which suggests that a certain degree of
multicollinearity will affect seriously the parameter estimates.
79
WOLAITA SOD UNIVERSITY
Reasons for Existence of Multicollinearity
There is a tendency of economic variables to move together over time. For example, Income,
consumption, savings, investment, prices, employment tend to rise in the period of economic
expansion and decrease in a period of recession. The use of lagged values of some
explanatory variables as separate independent factors in the relationship also causes
multicollinearity problems.
Example: Consumption = f(Yt, Yt-1, ...)

Thus, it can be concluded that multicollinearity is expected in economic variables. Although


multicollinearity is present also in cross-sectional data it is more a problem of time series
data.

Consequences of Multicollinearity
Recall that, if the assumptions of the classical linear regression model are satisfied, the OLS
estimators of the regression estimators are BLUE. As stated above if there is perfect
multicollinearity between the explanatory variables, then it is not possible to determine the
regression coefficients and their standard errors. But if collinearity among the X-variables is
high, but not perfect, then the following might be expected.
Nevertheless, the effect of collinearity is controversial and by no means conclusive.
1) The estimates of the coefficients are statistically unbiased. Even if an equation has
significant multicollinearity, the estimates of the parameters will still be centered
around the true population parameters.
2) When multicollinearity is present in a function, the variances and therefore the
standard errors of the estimates will increase, although some econometricians argue
that this is not always the case.

(3) The computed t-ratios will fall i.e. insignificant t-ratios will be observed in the
presence of multicollinearity. since

increases t-falls. Thus one may

increasingly accept the null hypothesis that the relevant true population’s value is
zero
80
WOLAITA SOD UNIVERSITY
Thus because of the high variances of the estimates the null hypothesis would be
accepted.

(4) A high R² but few significant t-ratios are expected in the presence of
multicollinearity. So one or more of the partial slope coefficients are individually

statistically insignificant on the basis of the t-test. Yet the R² may be so high. Indeed,
this is one of the signals of multicollinearity, insignificant t-values but a high overall

R² and F-values. Thus because multicollinearity has little effect on the overall fit of
the equation, it will also have little effect on the use of that equation for prediction or
forecasting.

Detecting Multicollinearity
Having studied the nature of multicollinearity and the consequences of multicollinearity, the
next question is how to detect multicollinearity. The main purpose in doing so is to decide
how much multicollinearity exists in an equation, not whether any multicollinearity exists. So
the important question is the degree of multicollinearity. But there is no one unique test that is
universally accepted. Instead, we have some rules of thumb for assessing the severity and
importance of multicollinearity in an equation. Some of the most commonly used approaches
are the following:

1) High R² but few significant t-ratios

This is the classical test or symptom of multicollinearity. Often if R ² is high (R² > 0.8) the F-
test in most cases will reject the hypothesis that the partial slope coefficients are
simultaneously equal to zero, but the individual t-tests will show that none or very few partial
slope coefficients are statistically different from zero. In other words, multicollinearity that is

severe enough to substantially lower t-scores does very little to decrease R² or the F-statistic.

So the combination of high R ² with low calculated t-values for the individual regression
coefficients is an indicator of the possible presence of severe multicollinearity.
Drawback: a non-multicollinear explanatory variable may still have a significant coefficient
even if there is multicollinearity between two or more other explanatory variables Thus,

81
WOLAITA SOD UNIVERSITY
equations with high levels of multicollinearity will often have one or two regression

coefficients significantly different from zero, thus making the “high R² low t” rule a poor
indicator in such cases.
1) High pair-wise (simple) correlation coefficients among the regressors (explanatory
variables).
If the R’s are high in absolute value, then it is highly probable that the X’s are highly
correlated and that multicollinearity is a potential problem. The question is how high r should
be to suggest multicollinearity. Some suggest that if r is in excess of 0.80, then
multicollinearity could be suspected.
Another rule of thumb is that multicollinearity is a potential problem when the squared simple

correlation coefficient is greater than the unadjusted R².


Two X’s are severely multicollinear if .
A major problem of this approach is that although high zero-order correlations may suggest
collinearity, it is not necessary that they be high to have collinearity in any specific case.
2) VIF and Tolerance
VIF shows the speed with which the variances and covariances increase. It also shows how
the variance of an estimator is influenced by the presence of multicollinearity. VIF is defined
as followers:

Where is the correlation between two


explanatory variables. As approaches 1, the VIF approaches
infinity. If there no collinearity, VIF will be 1. As a rule of thumb, VIF value of 10 or more
shows multicollinearity is sever problem. Tolerance is defined as the inverse of VIF.
3) Other more formal tests for multicollinearity
The use of formal tests to give any indications of the severity of the multicollinearity in a
particular sample is controversial. Some econometricians reject even the simple indicators
developed above, mainly because of the limitations cited. Some people tend to use a number
of more formal tests. But none of these is accepted as the best.

Remedies for Multicollinearity

82
WOLAITA SOD UNIVERSITY
There is no automatic answer to the question “what can be done to minimize the problem of
multicollinearity.” The possible solution which might be adopted if multicollinearity exists in
a function, vary depending on the severity of multicollinearity, on the availability of other
data sources, on the importance of factors which are multicollinear, on the purpose for which
the function is used. However, some alternative remedies could be suggested for reducing the
effect of multicollinearity.
1) Do Nothing
Some writers have suggested that if multicollinearity does not seriously affect the estimates of
the coefficients one may tolerate its presence in the function. In a sense, multicollinearity is
similar to a non-life threatening human disease that requires an operation only if the disease is
causing a significant problem. A remedy for multicollinearity should only be considered if
and when the consequences cause insignificant t-scores or widely unreliable estimated
coefficients.
2) Dropping one or more of the multicollinear variables
When faced with severe multicollinearity, one of the simplest way to get rid of (drop) one or
more of the collinear variables. Since multicollinearity is caused by correlation between the
explanatory variables, if the multicollinear variables are dropped the correlation no longer
exists.
Some people argue that dropping a variable from the model may introduce specification error
or specification biases. According to them since OLS estimators are still BLUE despite near
collinearity omitting a variable may seriously mislead us as to the true values of the
parameters.
Example: If economic theory says that income and wealth should both be included in the
model explaining the consumption expenditure, dropping the wealth variable would constitute
specification bias.
3) Transformation of the variables
If the variables involved are all extremely important on theoretical grounds, neither doing
nothing nor dropping a variable could be helpful. But it is sometimes possible to transform the
variables in the equation to get rid of at least some of the multicollinearity.
Two common such transformations are:
(i) to form a linear combination of the multicollinear variables
83
WOLAITA SOD UNIVERSITY
(ii) to transform the equation into first differences (or logs)
The technique of forming a linear combination of two or more of the multicollinearity
variables consists of:

 creating a new variable that is a function of the multicollinear variables


 using the new variable to replace the old ones in the regression equation (if X1 and X2

are highly multicollinear, a new variable, X3 = X1 + X2 or K1X1 + K2X2 might be

substituted for both of the multicollinear variables in a re-estimation of the model)


The second kind of transformation to consider as possible remedy for severe multicollinearity
is to change the functional form of the equation.
A first difference is nothing more than the change in a variable from the previous time-period.

If an equation (or some of the variables in an equation) is switched from its normal
specification to a first difference specification, it is quite likely that the degree of
multicollinearity will be significantly reduced for two reasons.
 Since multicollinearity is a sample phenomenon, any change in the definitions of the
variables in that sample will change the degree of multicollinearity.
 Multicollinearity takes place most frequently in time-series data, in which first
differences are far less likely to move steadily upward than are the aggregates from
which they are calculated.

(4) Increase the sample size


Another solution to reduce the degree of multicollinearity is to attempt to increase the size of
the sample. Larger data set (often requiring new data collection) will allow more accurate
estimates than a small one, since the large sample normally will reduce somewhat the
variance of the estimated coefficients reducing the impact of multicollinearity. But, for most
economic and business applications, this solution is not feasible. As a result new data are
generally impossible or quite expensive to find. One way to increase the sample is to pool
cross-sectional and time series data.

4) Other Remedies

84
WOLAITA SOD UNIVERSITY
There are several other methods suggested to reduce the degree of multicollinearity. Often
multivariate statistical techniques such as Factor analysis and Principal component analysis or
other techniques such as ridge regression are often employed to solve the problem of
multicollinearity.

Unit 7: Non-linear Regression and Time Series Econometrics

7.1. Non-linear regression models: Overview

When we started our discussion of linear regression models, we stated that our concern in this
course is with models that are linear in the parameters; they may or may not be linear in the
variables. On the other hand, if a model is nonlinear in the parameters it is a nonlinear (in-the-
parameter) regression model whether the variables of such a model are linear or not.

However, one has to be careful here, for some models may look nonlinear in the parameters
but are inherently or intrinsically linear because with suitable transformation they can be
made linear-in-the-parameter regression models. But if such models cannot be linearized in
the parameters, they are called intrinsically nonlinear regression models. From now on when
we talk about a nonlinear regression model, we mean that it is intrinsically nonlinear. For
brevity, we will call them NLRM.

Consider now the famous Cobb–Douglas (C–D) production function. Letting Y = output, X2
= labor input, and X3 = capital input, we will write this function in three different ways:

Another well-known but intrinsically nonlinear function is the constant elasticity of


substitution (CES) production function of which the Cobb–Douglas production is a special
case.

7.2. Time series Analysis

7.2.1. Linear Time Series

One of the most popular analytical tools in econometrics/statistics is the linear time series
models. We can basically summarize what we will be doing as follows. First, what is time
series? A time series consists of a set of variable y, which takes values on an equally spaced
85
WOLAITA SOD UNIVERSITY
interval of time. Time subscript t is used to denote a variable that is observed in a given
sample frequency. Initially, low frequency time series studied. Yearly data, then quarterly
data in macroeconomics was useful. But as the computer technology and financial transaction
became more and more complex now financial econometricians are even working with time
frequency of millisecond intervals.1. We will be mainly focusing on relatively lower
frequency aspects of the time series.

7.2.2. Why do we study time series in econometrics

Main aim of time series analysis is to describe main the time series characteristics of various
macroeconomic or financial variables. Historically, Yule(1927), Slutsky(1937) and Wold
(1938) were the main pioneers of using econometrics in regression analysis. Econometricians
mainly were analyzing long and short term cyclical decomposition of macro time series. Later
on more is done in multiple regression and simultaneous regression. Research made by
Operation research group was more focused on smoothing and forecasting time series. Box
and Jenkins(1976) was the main methodology adapted by all economists and engineers. Since
then there are still an important developments in the field of time series. In a univariate
context, an economist tries to understand whether there is any seasonal pattern. Or to analyze
whether a macro variable show a trend. More importantly, if there is a predictable pattern that
we observe in a macro variable we try to forecast the future realization of these variables.
Forecasting is a very important outlet for policy makers and the business.

7.2.3. Why do we trust time series?

There are various reasons why we see some predictable patterns in various time series. One
reason why there are patterns in time series reliance on Psychological Reasons. People do not
change their habits immediately. People‟s consumer‟s choice also shows some type of
momentum. For instance, once they start to demand housing they tend follow each other and
house prices may go monotonically for a long period of time. Businessmen and other agents
may also behave in a manner which makes the production follow a certain pattern. So time
series distinguishes and analyzes these trends. There may be

86
WOLAITA SOD UNIVERSITY
7.3. Objectives of time series

Time series analysis has four main objectives.

1 Description

First step involves to plot the data and to obtain descriptive measures of the time series.
Decomposing the cyclical and seasonal components of a given time series is also conducted in
the first round.

2 Explanations

Fitting the most appropriate specification is done at this stage. Choosing the lag order and
linear model is done at this stage.

3 Predictions

Given the observed macro or financial data, one may usually want to predict the future values
of these series. This is the predictive process where the unknown future values.

4 Policy and control

once the mathematical relationship between the input and output is found then to achieve a
targeted output the level of input is set. It is usually useful for engineering but also relevant
for policy analysis in macroeconomics.

References
Aaron, C.J. Marvin, B.J. Rueben, C. B..1996 .Econometrics. Basic and Applied. Mc Milan
publishing Company, NY and Collier, London .
Gujarati D. 2004 Basic Econometrics. 4th ed. Tata mc grow hill.+++++++++++++++++++
Koutsoyianis A., 2001. Theory of Economietrics. 2nd ed. Replicas press. Pvt.ltd. New Delhi.
Maddala introduction to Econometrics. 2nd ed. Macmillan Publishing Company. New York
Wooldridge 2005. Introductory Econometrics. 3rd ed.

87
WOLAITA SOD UNIVERSITY
88
WOLAITA SOD UNIVERSITY

You might also like