0% found this document useful (0 votes)
18 views34 pages

Econometrics - Chapter - Chapter - II

Chapter 2 discusses Simple Linear Regression, focusing on regression analysis as a tool for understanding the relationship between dependent and independent variables. It differentiates between correlation and regression, emphasizing that while correlation measures the strength of relationships, regression is used for prediction and explanation. The chapter also outlines the simple linear regression model, its components, and the importance of distinguishing between population and sample regression functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views34 pages

Econometrics - Chapter - Chapter - II

Chapter 2 discusses Simple Linear Regression, focusing on regression analysis as a tool for understanding the relationship between dependent and independent variables. It differentiates between correlation and regression, emphasizing that while correlation measures the strength of relationships, regression is used for prediction and explanation. The chapter also outlines the simple linear regression model, its components, and the importance of distinguishing between population and sample regression functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter 2: Simple Linear Regression

2.1 The concept of Regression Analysis


Regression analysis is almost certainly the most important tool at the econometrician’s
disposal. But what is regression analysis? Regression analysis is concerned with the study of
the dependence of one variable, the dependent variable, on one or more other variables, the
explanatory variables, with a view to estimating and/or predicting the (population) mean or
average value of the former in terms of the known or fixed (in repeated sampling) values of
the latter. In very general terms, regression is concerned with describing and evaluating the
relationship between a given variable and one or more other variables. More specifically,
regression is an attempt to explain movements in a variable by reference to movements in
one or more other variables. Regression allows you to estimate how a dependent
variable changes as the independent variable(s) change. To make this more concrete, denote
the variable whose movements the regression seeks to explain by y and the variables which
are used to explain those variations by x1, x2, . . . , xk. Hence, in this relatively simple setup, it
would be said that variations in k variables (the xs) cause changes in some other variable, y.
This chapter will be limited to the case where the model seeks to explain changes in only one
variable y (although this restriction will be removed in the chapter). In regression, the
dependent variable (y) and the independent variable(s) (xs) are treated very differently.
There are various completely interchangeable names for y and the xs, as shown in table 2.1.
All of these terms could be used synonymously in regression analysis.
Table2.1 other names for dependent and independent variable
Dependent variable Independent variable
Explained variable Explanatory variable
Predictant predictor
Regressand Regressor
Response stimulus
Endogenous Exogenous
Outcome Covariate
Controlled variable Control variable
Effect variable Causal variable
Although it is a matter of personal taste and tradition, in this course we will use the
dependent variable/explanatory variable or the more neutral, regressand and regressor
terminology.
2.3 Regression and Correlation
Correlation and regression are complex and powerful statistical techniques that have wide
application in data analysis. It’s not uncommon for correlation and regression to be confused
for one another as correlation can often drive into regression. Although correlation analysis is
closely related to regression analysis, it is conceptually very much different from regression.
Correlation is a statistical technique that measures that extent to which two or more
variables fluctuate in relation to each other. Correlation is a term that measures of strength
of a linear relationship between two quantitative variables. In other words, it measures how
two variables move in relation to one another. It is a statistical measure that expresses the
extent to which two variables are linearly related (meaning they change together at a
constant rate). It’s a common tool for describing simple relationships without making a
statement about cause and effect.
Positive correlation. This means the two variables moved either up or down in the same
direction together.
Negative correlation. This means the two variables moved in opposite directions.
Zero or no correlation: A correlation of zero means there is no relationship between the two
variables. In other words, as one variable moves one way, the other moved in another
unrelated direction
Correlation Vs. regression
Basically, you need to know when to use correlation vs regression. Use correlation for a quick
and simple summary of the direction and strength of the relationship between two or more
numeric variables. Use regression when you’re looking to predict, optimize, or explain a
number response between the variables (how x influences y).
Basis of comparison Correlation Regression
Definition A statistical measure that Describes how an independent
defines co-relationship or variable is associated with the
association of two variables. dependent variable.
Objectives To find a value expressing the To estimate values of a random
relationship between variable based on the values of a
variables. fixed variable.
When to use when you want to When you want to predict or explain
summarize direct numerical response
relationship between two
variables
Dependent and Independent
No difference Both variables are different
variables
Able to quantify direction of yes yes
relationship
Able to quantify strength of Yes Yes
relationship
Able to show cause and No Yes
effect
Able to predict and optimize No Yes
X and y are exchangeable Yes No
Use mathematical equation No Yes
Regression and causation
Although regression analysis deals with the dependence of one variable on other variables, it
does not necessarily imply causation. Regression deals with dependence amongst variables
within a model. But it cannot always imply causation. Causation means that a change in one
variable causes a change in another variable.
Causation exists when one event causes another event to occur. So, a causal relationship
exists when one variable in a data set has a direct influence on another variable. Thus,
one event triggers the occurrence of another event. A causal relationship is also referred
to as cause and effect. A causal relation between two events exists if the occurrence of the
first causes the other. The first event is called the cause and the second event is called the
effect. In other words, it is about cause and effect.
Example
Smoking causes lung cancer
Rain clouds cause rain
Exercise causes muscle growth
Overeating causes weight gain

Types of lnear regression


Simple Linear Regression Model
Simple linear regression is a linear regression model with a single explanatory variable. When
there is only one independent variable in the linear regression model, the model is generally
termed as a simple linear regression model. When there are more than one independent
variables in the model, then the linear model is termed as the multiple linear regression
model.
The equation for a regression line of Y on X for the population is given as:
Y = f(x) + U
= 0 + 1X + U ........................................... (2.1

Where y is termed as the dependent or study variable and X is termed as the independent or
explanatory variable. The terms 0 and 1X are the parameters of the model. The parameter
 is termed as an intercept term, and the parameter 1X is termed as the slope parameter.
These parameters are usually called as regression coefficients. The unobservable error
component U accounts for the failure of data to lie on the straight line and represents the
difference between the true and observed realization of y. U surrogates for all those variables
that are omitted from the model but that collectively affect Y. The obvious question is: Why
not introduce these variables into the model explicitly? Stated otherwise, why not develop a
multiple regression model with as many variables as possible? The reasons are many.

A) Vagueness of theory: The theory, if any, determining the behavior of Y may be, and often
is, incomplete. We might know for certain that weekly income X influences weekly
consumption expenditure Y, but we might be ignorant or unsure about the other
variables affecting Y. Therefore, ui may be used as a substitute for all the excluded or
omitted variables from the model.
B) Unavailability of data: Even if we know what some of the excluded variables are and
therefore consider a multiple regression rather than a simple regression, we may not have
quantitative information about these variables. It is a common experience in empirical
analysis that the data we would ideally like to have often are not available. For example,
in principle we could introduce family wealth as an explanatory variable in addition to the
income variable to explain family consumption expenditure. But un fortunately,
information on family wealth generally is not available. Therefore, we may be forced to
omit the wealth variable from our model despite its great theoretical relevance in
explaining consumption expenditure.
C) Core variables versus peripheral variables: Assume in our consumption- income example
that besides income X1, the number of children per family X2, sex X3, religion X4,
education X5, and geographical region X6 also affect consumption expenditure. But it is
quite possible that the joint influence of all or some of these variables may be so small
and at best nonsystematic or random that as a practical matter and for cost
considerations it does not pay to introduce them into the model explicitly. One hopes
that their combined effect can be treated as a random variable ui

D) Intrinsic randomness in human behavior: Even if we succeed in introducing all the relevant
variables into the model, there is bound to be some “intrinsic” randomness in individual
Y’s that cannot be explained no matter how hard we try. Humans are not machines that will
do as instructed. So there is unpredictable element. Example: due to unexplained case, an
increase in income may not influence consumption. Thus the disturbance term captures such
human behavior that is left unexplained by the economic model. The disturbances, the u’s, may
very well reflect this intrinsic randomness
E) Poor proxy variables: Although the classical regression model (to be developed in Chapter
3) assumes that the variables Y and X are measured accurately, in practice the data may
be plagued by errors of measurement. Variable being explained cannot be measures
accurately, either because of data collection difficulties or because it is inherently un measurable
and a proxy variables must be used instead. The disturbance term can in these circumstances be
thought of as representing this measurement error [(of the variable(s)]
Example: measuring taste is not an easy job

F) Principle of parsimony: Following Occam’s razor, we would like to keep our regression
model as simple as possible. If we can explain the behavior of Y “substantially” with
two or three explanatory variables and if our theory is not strong enough to suggest
what other variables might be included, why introduce more variables? Let ui
represent all other variables. Of course, we should not exclude relevant and
important variables just to keep the regression model simple
G) Wrong functional form: Even if we have theoretically correct variables explaining a
phenomenon and even if we can obtain data on these variables, very often we do not
know the form of the functional relation- ship between the regressand and the
regressors. Is consumption expendi- ture a linear (invariable) function of income or a
nonlinear (invariable) function? If it is the former, Yi = β1 + B2 Xi + ui is the proper
functional re- lationship between Y and X, but if it is the latter, Yi = β1 + β2 Xi + β3 X2 +
ui may be the correct functional form. In two-variable models the functional form of the
relationship can often be judged from the scatter gram. But in a multiple regression
model, it is not easy to determine the appropriate functional form, for graphically we
cannot visualize scatter grams in multiple dimensions. For all these reasons, the stochastic
disturbances ui assume an extremely critical role in regression analysis, which we will
see as we progress.
Generally speaking simple linear regression analysis is concerned with the study of the
dependency of one dependent variable on one variable called the explanatory variable(s) or
the independent variable. Moreover, the true relationship that connects the variables
involved is split in to two. They are systematic (or explained variation and random or
(unexplained) variation. Using (2.2) we can disaggregate the two components as follows
Y =  0 +  1X + U
That is,
[variation in Y] = [systematic variation] + [random variation]
In our analysis we will assume that the “independent” variable X is nonrandom. We will also
assume a linear model. Note that this course is concerned with linear model like (2.3). In this
regard it is essential to know what the term linear really means, for it can be interpreted in
two different ways. These are,

a) Linearity in the variables


b) Linearity in parameters
i) Linearity in variables implies that an equation is linear model if it is expressed in a straight
line.
Example. Consider the regression function Y = 0 + 1X. This means the slope (or
derivative) of this equation is independent of X so that there is linearity in variable. But if
Y = 0 + 1X2 then the variable X is raised (power) to second degree, so it is non-linear in
variable. This is because, the slope or derivative is not independent of the value taken by
dy
X. That is, dx = 21X Hence the above function is not linear in X since the variable X

appears with a power of 2

ii) Linearity in the parameter: this implies that the parameters (i.e., ) are raised to their first
degree. In this interpretation Y = 0 + 1X2 is a linear regression model but Y = 0 + 21X is
not. The latter is an example of a nonlinear (in the parameters) regression model of the
two interpretation of linearity, linearity in the parameters is relevant for the development
of the regression theory. Thus the term linear regression means a regression that is linear
in the parameters, the’s; it may or may not be linear in the explanatory variables.
The following discussion stress that regression analysis is largely concerned with estimating
and/or predicting the (population) mean or average value of the dependent variable on the
basis of the known or fixed values of the explanatory variable(s).

2.2 Population Regression function Vs Sample Regression Function

Population regression Function

As noted in Section 1.2, regression analysis is largely concerned with estimating


and/or predicting the (population) mean value of the dependent variable on the basis
of the known or fixed values of the explanatory variable(s).To understand this,
consider the data given in Table 2.1.
Table
income
↓ 80 100 120 140 160 180 200 220 240 26
0
Weekly family 55 65 79 80 102 110 120 135 137 15
consumption 60 70 84 93 107 115 136 137 145 0
15
expenditure Y, $ 65 74 90 95 110 120 140 140 155 2
17
70 80 94 103 116 130 144 152 165 5
17
75 85 98 108 118 135 145 157 175 8
18
– 88 – 113 125 140 – 160 189 0
18
– – – 115 – – – 162 – 5
19
1
Total 325 462 445 707 678 750 685 1043 966 121
1
Conditional 65 77 89 101 113 125 137 149 161 17
means of Y, 3
E (Y | X )

The data in the above table refer to a total population of 60 families in a hypothetical
community and their weekly income (X) and weekly consumption expenditure (Y), in
dollar. The 60 families are divided into 10 income groups (from $80 to $260) and the
weekly expenditures of each family in the various groups are as shown in the table.
Therefore, we have 10 fixed X values and the corresponding Y values against each of
the X values; so to speak, there are 10 Y subpopulations. There is considerable
variation in weekly consumption expenditure in each income group, which can be
seen clearly from Figure 2.1. But the general picture that one gets is that, despite the
variability of weekly consumption expenditure within each income bracket, on the
average, weekly consumption expenditure increases as income increases. To see this
clearly, in Table 2.1 we have given the mean, or average, weekly consumption
expenditure corresponding to each of the 10 levels of income. Thus, corresponding to
the weekly income level of $80, the mean consumption expenditure is $65, while
corresponding to the income level of $200, it is $137. In all we have 10 mean values
for the 10 subpopulations of Y. We call these mean values conditional expected
values, as they depend on the given values of the (conditioning) variable X.
Symbolically, we denote them as E(Y | X), which is read as the expected value of Y
given the value of X (see also Table 2.2).

The dark circled points in Figure 2.1 show the conditional mean values of Y
against the various X values. If we join these conditional mean values, we obtain
what is known as the population regression line (PRL), or more generally, the

population regression curve. More simply, it is the regression of Y on X. The


adjective “population” comes from the fact that we are dealing in this example with
the entire population of 60 families. Of course, in reality a population may have
many families.

200

Weekly consumption expenditure,


E(Y | X)

150

100
$

50
80 100 120 140 160 180 200 220 240 260

Weekly income, $

FIGURE 2.1 Conditional distribution of expenditure for various levels of income (data of Table 2.1).

Geometrically, then, a population regression curve is simply the locus of the


conditional means of the dependent variable for the fixed values of the
explanatory variable(s). More simply, it is the curve connecting the means of the
subpopulations of Y corresponding to the given values of the regressor X. It can
be depicted as in Figure 2.2.

Y
Weekly consumption expenditure,

Conditional
Weekly consumption expenditure,

mean

E(Y | Xi)
$

149

Distribution of

Y given X =
65 $220
$

X
80 140 220
FIGURE 2.2 Population regression line (data of Table 2.1

This figure shows that for each X (i.e., income level) there is a population of Y values
(weekly consumption expenditures) that are spread around the (conditional) mean of
those Y values. For simplicity, we are assuming that these Y values are distributed
symmetrically around their respective (conditional) mean values. And the regression
line (or curve) passes through these (conditional) mean values. With this background,
the reader may find it instructive to reread the definition of regression given in
Section 1.2.
2.2 THE CONCEPT OF POPULATION REGRESSION FUNCTION (PRF)
From the preceding discussion and Figures. 2.1 and 2.2, it is clear that each
conditional mean E(Y | Xi ) is a function of Xi, where Xi is a given value of X.
Symbolically,
E(Y | Xi) = f (Xi )---------------------------------------------------------------------(2.2.1)

where f (Xi ) denotes some function of the explanatory variable X. In our example,
E(Y | Xi ) is a linear function of Xi. Equation (2.2.1) is known as the conditional
expectation function (CEF) or population regression function (PRF) or population
regression (PR) for short. It states merely that the expected value of the distribution
of Y given Xi is functionally related to Xi. In simple terms, it tells how the mean or
average response of Y varies with X. What form does the function f (Xi) assume?
This is an important question because in real situations we do not have the entire
population avail- able for examination. The functional form of the PRF is therefore an
empirical question, although in specific cases theory may have something to say. For
example, an economist might posit that consumption expenditure is linearly related to
income. Therefore, as a first approximation or a working hypothesis, we may assume
that the PRF E(Y | Xi ) is a linear function of Xi, say, of the type
E(Y | Xi ) = β1 + β2 Xi---------------------------------------------------------------- (2.2.2)

where β1 and β2 are unknown but fixed parameters known as the regression
coefficients; β1 and β2 are also known as intercept and slope coefficients,
respectively. Equation (2.2.1) itself is known as the linear population regression
function. Some alternative expressions used in the literature are linear population
regression model or simply linear population regression. In the sequel, the terms
regression, regression equation, and regression model will be used
synonymously. In regression analysis our interest is in estimating the PRFs like
(2.2.2), that is, estimating the values of the unknowns β1 and β2 on the basis of
observations on Y and X. This topic will be studied in detail in Chapter 3.
2.2 THE MEANING OF THE TERM LINEAR
Since this text is concerned primarily with linear models like (2.2.2), it is essential
to know what the term linear really means, for it can be interpreted in two different
ways.
Linearity in the Variables
The first and perhaps more “natural” meaning of linearity is that the conditional

expectation of Y is a linear function of Xi, such as, for example, (2.2.2).


geometrically; the regression curve in this case is a straight line. In thisi

interpretation, a regression function such as E(Y | Xi) = β1 + β2 X2 is not a linear


function because the variable X appears with a power or index of 2.
Linearity in the Parameters

The second interpretation of linearity is that the conditional


2 expectation of Y, E(Y |
2
Xi ), is a linear function of the parameters, the β’s; it may or may not be linear
i
in

the variable X.7 In this interpretation E(Y | Xi ) = β1 + β2 X2 is a linear (in the


parameter) regression model. To see this, let us suppose X takes the value 3.
i
Therefore, E(Y | X = 3) = β1 + 9β2 , which is obviously linear in β1 and β2. All the
models shown in Figure 2.3 are thus linear regression models, that is, models

linear in the parameters. Now consider the model E(Y | Xi ) = β1 + β2 Xi . Now

suppose X = 3; then we obtain E(Y | Xi ) = β1 + 3β2 , which is nonlinear in the


parameter β2. The preceding model is an example of a nonlinear (in the
parameter) regression model. Of the two interpretations of linearity, linearity in
the parameters is relevant for the development of the regression theory to be
presented shortly. Therefore, from now on the term “linear” regression will always
mean a regression that is linear in the parameters; the β’s (that is, the parameters
are raised to the first power only). It may or may not be linear in the explanatory
variables, the X’s. Schematically, we have Table 2.3. Thus, E(Y | Xi ) = β1 + β2 Xi ,
which is linear both in the parameters and variable, is a LRM, and so is E(Y | Xi ) =

β1 + β2 X2, which is linear in the parameters but nonlinear in variable X.


TABLE 2.3 LINEAR REGRESSION MODELS

Model linear in parameters? Model linear in variables?

Yes No

Yes LRM LRM

No NLRM NLRM

Note: LRM = linear regression model


NLRM = nonlinear regression model

2.3 STOCHASTIC SPECIFICATION OF PRF


It is clear from Figure 2.1 that, as family income increases, family consumption
expenditure on the average increases, too. But what about the consumption
expenditure of an individual family in relation to its (fixed) level of income? It is
obvious from Table 2.1 and Figure 2.1 that an individual family’s consumption
expenditure does not necessarily increase as the income level increases. For
example, from Table 2.1 we observe that corresponding to the income level of $100
there is one family whose consumption expenditure of $65 is less than the
consumption expenditures of two families whose weekly income is only $80. But
notice that the average consumption expenditure of families with a weekly income of
$100 is greater than the average consumption expenditure of families with a weekly
income of $80 ($77 versus $65). What, then, can we say about the relationship
between an individual family’s consumption expenditure and a given level of income?
We see from Figure 2.1 that, given the income level of Xi , an individual family’s
consumption expenditure is clustered around the average consumption of all families at
that Xi , that is, around its conditional expectation. Therefore, we can ex- press the
deviation of an individual Yi around its expected value as follows:
ui = Yi − E(Y | Xi ) or

Yi = E(Y | Xi ) + ui-----------------------
(2.4.1)
where the deviation ui is an unobservable random variable taking positive or
negative values. Technically, ui is known as the stochastic disturbance or
stochastic error term. How do we interpret (2.4.1)? We can say that the
expenditure of an individual family, given its income level, can be expressed as
the sum of two components: (1) E(Y | Xi ), which is simply the mean consumption
expenditure of all the families with the same level of income. This component is
known as the systematic, or deterministic, component, and (2) ui , which is the
random, or nonsystematic, component. We shall examine shortly the nature of the
stochastic disturbance term, but for the moment assume that it is a surrogate or
proxy for all the omitted or neglected variables that may affect Y but are not (or
cannot be) included in the regression model. If E(Y | Xi ) is assumed to be linear in
Xi , as in (2.2.2), Eq. (2.4.1) may be written as
Yi = E(Y | Xi ) + ui

= β1 + β2 Xi + ui-----------------------------------------------------------------------------
(2.4.2)

Equation (2.4.2) posits that the consumption expenditure of a family is linearly


related to its income plus the disturbance term. Thus, the individual consumption
expenditures, given X = $80 (see Table 2.1), can be ex- pressed as
Y1 = 55 = β1 + β2(80) + u1

Y2 = 60 = β1 + β2 (80) + u2

Y3 = 65 = β1 + β2 (80) + u3 (2.4.3)

Y4 = 70 = β1 + β2(80) + u4

Y5 = 75 = β1 + β2(80) + u5

Now if we take the expected value of (2.4.1) on both sides, we obtain


E(Yi | Xi ) = E[E(Y | Xi )] + E(ui | Xi )

= E(Y | Xi ) + E(ui | Xi )-------------------------------------------------------- (2.4.4)


where use is made of the fact that the expected value of a constant is that constant
itself.8 Notice carefully that in (2.4.4) we have taken the conditional expectation,
conditional upon the given X’s. Since E (Yi | Xi) is the same thing as E(Y | Xi ), Eq.
(2.4.4) implies that
E (ui | Xi ) = 0---------------------------------------------------------------------- (2.4.5)

Thus, the assumption that the regression line passes through the conditional means
of Y (see Figure 2.2) implies that the conditional mean values of ui (conditional upon
the given X’s) are zero. From the previous discussion, it is clear (2.2.2) and (2.4.2) are
equivalent forms if E(ui | Xi ) = 0.9 But the stochastic specification (2.4.2) has the
advantage that it clearly shows that there are other variables besides income that affect
consumption expenditure and that an individual family’s consumption expenditure
cannot be fully explained only by the variable(s) included in the regression model.

2.2 THE SAMPLE REGRESSION FUNCTION (SRF)


By confining our discussion so far to the population of Y values corresponding to the
fixed X’s, we have deliberately avoided sampling considerations (note that the data
of Table 2.1 represent the population, not a sample). But it is about time to face up
to the sampling problems, for in most practical situations what we have is but a
sample of Y values corresponding to some fixed X’s. Therefore, our task now is to
estimate the PRF on the basis of the sample information. As an illustration, pretend
that the population of Table 2.1 was not known to us and the only information we had
was a randomly selected sample of Y values for the fixed X’s as given in Table 2.4.
Unlike Table 2.1, we now have only one Y value corresponding to the given X’s;
each Y (given Xi) in Table 2.4 is chosen randomly from similar Y’s corresponding to
the same Xi from the population of Table 2.1. The question is: From the sample of
Table 2.4 can we predict the average weekly consumption expenditure Y in the
population as a whole corresponding to the chosen X’s? In other words, can we
estimate the PRF from the sample data? As the reader surely suspects, we may not
be able to estimate the PRF “accurately” because of sampling fluctuations.

To see this, suppose we draw another random sample from the population of Table
2.1, as presented in Table 2.5. Plotting the data of Tables 2.4 and 2.5, we obtain the
scatter gram given in Figure 2.4. In the scatter gram two sample regression lines are
drawn so as to “fit” the scatters reasonably well: SRF1 is based on the first sample,
and SRF2 is based on the second sample. Which of the two regression lines rep-
resents the “true” population regression line? If we avoid the temptation of looking at
Figure 2.1, which purportedly represents the PR, there is no way we can be
absolutely sure that either of the regression lines shown in Figure 2.4 represents the
true population regression line (or curve). The regression lines in Figure 2.4 are known
as the sample regression lines. Supposedly they represent the population regression
line, but because of sampling fluctuations they are at best an approximation of the
true PR. In general, we would get N different SRFs for N different samples, and these
SRFs are not likely to be the same.
TABLE 2.4

A RANDOM SAMPLE FROM THE POPULATION OF TABLE 2.1


TABLE 2.5

ANOTHER RANDOM SAMPLE FROM THE POPULATION OF TABLE 2.1

Y X Y X

70 80 55 80
65 100 88 100
90 120 90 120
95 140 80 140
110 160 118 160
115 180 120 180
120 200 145 200
140 220 135 220
155 240 145 240
150 260 175 260

200
SRF2
 First sample (Table 2.4) Regression based
Second sample (Table on SRF1
2000 2.5) 


Weekly consumption expenditure,

Regression based
 on

150 

 

100
$

50

80 100
120
140
160
180
200
220
240
260

Weekly income, $

FIGURE 2.4 Regression lines based on two different samples.


.
Now, analogously to the PRF that underlies the population regression line, we can develop the
concept of the sample regression function (SRF) to represent the sample regression line.
The sample counterpart of (2.2.2) may be written as
Yˆi = βˆ1 + βˆ2 Xi------------------------------------------------------------------------------ (2.6.1)

Where Yˆ is read as “Y-hat’’ or “Y-cap’’


YˆI = estimator of E(Y | Xi )
βˆ1 = estimator of β1
βˆ2 = estimator of β2

Note that an estimator, also known as a (sample) statistic, is simply a rule or formula or
method that tells how to estimate the population parameter from the information provided by the
sample at hand. A particular numerical value obtained by the estimator in an application is
known as an estimate. Now just as we expressed the PRF in two equivalent forms, (2.2.2) and
(2.4.2), we can express the SRF (2.6.1) in its stochastic form as follows:
Yi = βˆ1 + βˆ2 Xi + uˆ i----------------------------------------------------------------(2.6.2)

where, in addition to the symbols already defined, uˆ i denotes the (sample) residual term.
Conceptually uˆ i is analogous to ui and can be regarded as an estimate of ui . It is introduced
in the SRF for the same reasons as ui was introduced in the PRF. To sum up, then, we find our
primary objective in regression analysis is to estimate the PRF

Yi = β1 + β2 Xi + ui ---------------------------------------------------------------------------------(2.4.2)
on the basis of the SRF
ˆ ˆ
----------------------------------------------------------------------------------(2.6.2)
Yi = β 1 + β xi =
uˆ i
because more often than not our analysis is based upon a single sample from some population.
But because of sampling fluctuations our estimate of the PRF based on the SRF is at best an
approximate one. This approximation is shown diagrammatically in Figure 2.5
Y

SRF: Yi = 1 + 2 Xi

Yi
Y
i

Weekly consumption expenditure,


ui

PRF: E(Y | Xi)= 1 + 2 Xi

ui

Yi

Yi

E(Y | Xi)
$

Weekly consumption
E(Y | Xi)
For X = Xi , we have one (sample) observation Y = Yi . In terms of the
SRF, the observed Yi can be expressed as
Yi = Yˆi + uˆ i-------------------------------------------------------------------------------
(2.6.3)
and in terms of the PRF, it can be expressed as
Yi = E(Y | Xi ) + ui--------------------------------------------------------------------------
(2.6.4)
Now obviously in Figure 2.5 Yˆi overestimates the true E(Y | Xi ) for the
Xi shown therein. By the same token, for any Xi to the left of the point
A, the SRF will underestimate the true PRF. But the reader can readily
see that such over- and underestimation is inevitable because of
sampling fluctuations. The critical question now is: Granted that the
SRF is but an approximation of the PRF, can we devise a rule or a
method that will make this ap proximation as “close” as possible? In
other words, how should the SRF be constructed so that βˆ1 is as
“close” as possible to the true β1 and βˆ2 is as “close” as possible to
the true β2 even though we will never know the true β1 and β2?
.

You might also like