simple-regression
simple-regression
Table of contents
A note on terminology 6
1
After some quick review of statistical concepts, we will begin
our content with the basic—and most important—technique
in Econometrics: linear regression. You may have been intro-
duced to it in your Stats courses, but we need to go deeper
within its concept, in order to adapt it to real-world uses. The
key argument is that the world that surrounds us produces
diffuse and fuzzy data, and we are far from estimating empirical
models free from error. We need to live with this fact, and the
best way to treat this inherent error is by minimizing it.1 The 1 By minimizing something, keep in
most popular method for that is Ordinary Least Squares (OLS), mind that we will use its mathematical
meaning; that is, we try to make some-
and we introduce this technique in the context of simple linear
thing (in this case, our error) not zero,
regression, that is, a model where we try to explain a variable but as small as possible.
of interest’s behavior in terms of only one single explanatory
factor.
2
Deterministic vs. stochastic relationships
3
Figure 1: A simple straight line.
4
Figure 2: A stochastic relationship.
The table below brings a few synonyms for both dependent and
independent variables. These terms can be used interchange-
ably, with no loss of generality or meaning.
Y = β0 + β1 X
6
represents by how much the dependent variable changes, given
a marginal change in the independent variable.
Y = β0 + β1 X + ε
7
part, represented by the residual, “ε.” Thus, the uncertain part
of any statistical/econometric model is represented by an error
term, which captures everything that the explanatory variables are
not capable of doing to explain variations in Y, our dependent
variable.
Since we will always be dealing with random variables in this
course, let us apply the expectations operator (that is, compute
the Expected Value, which you learned in your Stats course) to
both sides of this equation:4 4 Let us demystify what the term Ex-
error term is zero. This is a crucial assumption we are making pected Value from your Stats class,
make sure to give it a quick review,
here, concerning the distribution of our error term. Figure 3
so you fully understand what is going
illustrates this latter point, simply showing that the central on at this point. It is an important step
location of our ε friend is 0. to understand the basic foundations of
our class.
However, make sure you understand this point: this does not
mean that, when we get to apply this concept to real data, our
error term will show a value of zero. This is a statement about its
average, assuming we can run the same regression model with
several different samples of the same size for the same variables
X and Y. This is related to the sampling distribution of the
error term. If you do not recall what sampling distributions
mean, make sure to also give it a quick review from Stats.
8
Figure 3: The residual’s distribution.
E(ε|X) = E(ε)
wage = β0 + β1 educ + ε
9
accounted for in the model, what is contained in there? Think
about what is also relevant to explain wages.
Factors such as years of experience, innate ability, gender, race, and
many other variables are included in the error term, since the
only independent variable is education. To keep things simple,
assume that ε is only ability (abil), and assume it is lying on the
error term due to the impossibility of measuring it, or obtaining
data.
Then, going back to our previous assumption, it requires that
the average level of ability is the same, regardless of one’s years
of education. Illustrating:
E(abil|educ) = E(abil)
10
The population regression function
Y = β0 + β1 X
E(Y|X) = β0 + β1 X
yi = β0 + β1 xi + ui
11
on the research we are conducting: these may be households,
houses, cats, cities, etc. Conceptually, however, all the terms of
the sample regression functions are the same as before.
Additionally, we assume:
E(u) = 0
Cov(xi , ui ) = 0
That is, the expected value of the sample error term is zero,
and the covariance between the independent variable(s) and
the residual is zero. Recall that the covariance is a more general
concept than correlation: while the latter is only concerned with
linear relationships, the former defines relationships of any
form.
12
Figure 4: Residuals vs. regression line.
After this graph makes sense in your mind, you can see that
Ordinary Least Squares (OLS) is the mathematical technique
for obtaining the estimates of yi , β0 and β1 that will make
the residuals, ui , as small as possible. In other words, it
calculates the β coefficients that minimize the sum of the
squared residuals of our model.6 6 Why OLS, though? Put simply, it
Estimator: it is a mathematical
technique that is applied to a sample
to produce numerical estimates of the
“true” population parameters. Do not
confuse estimator with estimate:
estimator is the formula, and an
estimate is the final numerical value
generated by this formula. To be clear,
OLS is an estimator, as well as β0 and
β1 . On the other hand, β̂0 , β̂1 , and û
are estimates.
13
Now, we are all set to start playing with real-world data. OLS
will be our preferred estimator until the end of the semester.
Just be aware that there are many other estimators out there,
depending on the circumstances. However, the basic founda-
tions of Econometrics are built upon OLS, and it is important
to master it before moving on to more complex methods.
Keep these formulas in mind, we will use them!
Ín
i=1 (xi − x̄)(yi − ȳ) Cov(xi , yi )
β̂1 = Ín 2
=
i=1 (xi − x̄) Var(xi )
β̂0 = ȳ − β̂1 x̄
14