Tinywow Groupproject
Tinywow Groupproject
5SSPP213: Econometrics
4 March 2025
Outline
5. Binary outcomes
Gauss-Markov Assumptions
1. Linear in parameters
▶ Not necessarily in variables (this week)
Gauss-Markov Assumptions
1. Linear in parameters
▶ Not necessarily in variables (this week)
2. Random sampling
3. No perfect co-linearity
▶ Zero conditional mean of errors
▶ No Homoskedasticity (next week)
Outline
5. Binary outcomes
yi = β 0 + β 1 xi + β 2 xi2 + ui
▶ y = exp(x ) = e x
▶ exp(0) = 1
▶ exp(1) = 2.7183
▶ The exponential function is the inverse of the log function
▶ Note that ln(y ) = β 0 + β 1 x is equivalent to y = exp( β 0 + β 1 x )
▶ If β 1 > 0, then has an increasing marginal effect on y (i.e. another
unit has a larger effect than the previous unit)
▶ Recall that exp(a + b ) = exp(a) exp(b )
Outline
5. Binary outcomes
▶ What happen to sales when the price increase by 1%? (price elasticity
of demand)
∂y /y ∆y x %∆y
= × =
∂x /x ∆x y %∆x
∂y /y ∆y x %∆y
= × =
∂x /x ∆x y %∆x
▶ Semi-elasticity of y with respect to x:
∂y /y %∆y
=
∂x ∆x
log(wages)i = β 0 + β 1 xi + ui
▶ Interpretation:
%∆wages ≈ (100 × β 1 )∆x
▶ wages ) = 0.584 + 0.083educ, R 2 = 0.186
Example: log(\
▶ Interpretation:
%∆CEO salary ≈ β 1 %∆sales
▶ \
Example: log(CEO salary) = 4.822 + 0.257 log(sales ), R 2 = 0.211
Outline
5. Binary outcomes
yi = β 0 + β 1 xi + β 2 xi2 + ui
∆y
▶ Marginal effect of: ∆x ≈ β 1 + 2β 2 x
▶ Decreasing marginal effects: β 1 > 0 and β 2 < 0
▶ Increasing marginal effects: β 1 < 0 and β 2 > 0
▶ \
Example: wages = 3.73 + 0.298exper − 0.0061exper2 , R 2 = 0.093
▶ Marginal effect: ∆
widehaty ∆x ≈0.298−2(0.0061)exper
First year: 0.298
Second year: 0.286
Eleventh year: 0.176
− β1
▶ Maximum or minimum: x ∗ = 2β 2
−0.298
▶ Example: x ∗ = 2(−0.0061)
≈ 24.4
▶ Check if estimates make sense
yi = β 0 + β 1 xi + β 2 xi2 + β 3 xi3 + ui
▶ How can I know if a non-linear model of degree r is actually linear?
Outline
5. Binary outcomes
▶ Logit and probit functions can be used to restrict the fitted values to
lie in the unit interval.
28 January 2025
Outline
1. Multiple regression
2. Goodness of fit
Multiple regression
Adding controls
We estimated that for country i:
Adding controls
We estimated that for country i:
Control variables
Control variables
Control variables
n
min ∑ ûi2
i =1
▶ By setting the FOC’s (wrt to ´ˆ0 , ´ 1 and ´ 2 ) to 0.
▶ We get:
n
∑ yi − ´ˆ 0 − ´ˆ 1 x1i − ´ˆ 2 x2i = 0
i =1
n
∑ x1i yi − ´ˆ 0 − ´ˆ 1 x1i − ´ˆ 2 x2i = 0
i =1
n
∑ x2i yi − ´ˆ 0 − ´ˆ 1 x1i − ´ˆ 2 x2i = 0
i =1
(DPE, KCL) W3 28 January 2025 6 / 35
Multiple regression
ŷ = ´ˆ 0 + ´ˆ 1 x1 + ´ˆ 2 x2 + ... + ´ˆ k xk
▶ We can write the same equation in terms of ∆ increments:
´ˆ 0 = ȳ − ´ˆ 1 x̄1 − ´ˆ 2 x̄2
V2 Cy 1 − C12 Cy 2
´ˆ 1 = 2
V1 V2 − C12
V1 Cy 2 − C12 Cy 1
´ˆ 2 = 2
V1 V2 − C12
▶ Where:
▶ Vi is the variance of variable k.
▶ Cyi is the covariance between y and variable k.
▶ C12 is the covariance between variables 1 and 2.
Omitted variables
Simpsons’s paradox
Source: https://www.ft.com/content/94e3acec-a767-11e7-ab55-27219df83c97
yi = ´′0 + ´′1 x1 + vi
y = ´ 0 + ´ 1 x1i + ´ 2 x2i + ui
This implies that:
vi = ´ 2 x2i + ui
▶ Recall:
n
∑ ( xi − x̄ )vi
′ i =1
E [ ´ˆ1 ] = ´ 1 + E
n
2
∑ (xi − x̄ )
i =1
▶ When:
1. cov (x1 , x2 ) ̸= 0
2. ´ 2 ̸= 0
▶ Then the numerator is NOT zero, and we have a biased estimator.
▶ (Also, refer to estimators for multiple regression: bias is proportional
to the cov (x1 , x2)
Outline
1. Multiple regression
2. Goodness of fit
Goodness of fit
Goodness of fit
▶ Total sum of squares (n x sample variance)
n
SStot = ∑ (yi − ȳ )2
i −1
Variation of estimated ŷ
R2 =
Total variation of observed y
SSreg
R2 = (1)
SStot
(DPE, KCL) W3 28 January 2025 17 / 35
Goodness of fit
Goodness of fit
SSres
R2 = 1 − (3)
SStot
R-square properties
▶ R 2 ranges from 0 to 1.
▶ If R 2 = 0 =⇒ The OLS regression does not explained any variation in
the values of y
▶ If R 2 = 1 =⇒ The OLS regression explains all the variation of y.
▶ For regressions with just one independent variable (bivariate
regression), the r-square is the square of the correlation coefficient.
Outline
1. Multiple regression
2. Goodness of fit
OLS as an estimator
OLS as an estimator
OLS as an estimator
s
se ( ´ˆ 1 ) = p
∑(xi − x̄ )2
s
se ( ´ˆ 1 ) = p
∑(xi − x̄ )2
´ˆ − ´
t=
se ( ´ˆ )
▶ Where:
▶ SSTj = ∑ni=1 (xi j − x̄j )
▶ Rj2 is the R-square from regressing xj on all other variables:
Xij = ³0 + ³1 x[ i1] + ³2 xi2 ...³k −1 xi,(k −1) + ui
ˆ )= Ã2
se ( ´j (6)
SSTj (1 − Rj2 )
ˆ )= Ã2
se ( ´j (6)
SSTj (1 − Rj2 )
ˆ )= Ã2
se ( ´j (6)
SSTj (1 − Rj2 )
ˆ )= Ã2
se ( ´j (6)
SSTj (1 − Rj2 )
H0 = ´ 1 = 0
HA = ´ 1 ̸ = 0
▶ When H0 is ´ = 0, then:
´ˆ
t=
se ´ˆ
The t statistic
( ´ˆ j − ´ j )
∼ tn − k − 1
se ( ´ˆ j )
▶ The t distribution has has df = n − k − 1, and k indicates the
number of slopes being estimated.
Outline
1. Multiple regression
2. Goodness of fit
Gauss-Markov Assumptions
Gauss-Markov Assumptions
1. Linear in parameters:
▶ y = ´ + ´ x + u , but NOT y = ´ + ´2 x + u
i 0 1 i i i 0 1 i i
Gauss-Markov Assumptions
1. Linear in parameters:
▶ y = ´ + ´ x + u , but NOT y = ´ + ´2 x + u
i 0 1 i i i 0 1 i i
▶ We can have non linear variables, e.g., ´ + ´ x 2 + u
0 1 i i
Gauss-Markov Assumptions
1. Linear in parameters:
▶ y = ´ + ´ x + u , but NOT y = ´ + ´2 x + u
i 0 1 i i i 0 1 i i
▶ We can have non linear variables, e.g., ´ + ´ x 2 + u
0 1 i i
▶ For example, the Mincer equation is typically used to estimate the
effect of education and experience on earnings:
Gauss-Markov Assumptions
1. Linear in parameters:
▶ y = ´ + ´ x + u , but NOT y = ´ + ´2 x + u
i 0 1 i i i 0 1 i i
▶ We can have non linear variables, e.g., ´ + ´ x 2 + u
0 1 i i
▶ For example, the Mincer equation is typically used to estimate the
effect of education and experience on earnings:
Gauss-Markov Assumptions
2 Random sampling:
▶ The sample data is representative (i.e., randomly drawn from the
population).
Gauss-Markov Assumptions
2 Random sampling:
▶ The sample data is representative (i.e., randomly drawn from the
population).
▶ Since we’ll obatain different estimators ´ˆ for each random draw of n,
ˆ are random variables, with expected values and standard errors.
the ´’s
Gauss-Markov Assumptions
2 Random sampling:
▶ The sample data is representative (i.e., randomly drawn from the
population).
▶ Since we’ll obatain different estimators ´ˆ for each random draw of n,
ˆ are random variables, with expected values and standard errors.
the ´’s
▶ This allows for hypothesis testing:
▶ With sample size sufficiently large, the central limit theorem tells us
that ´ˆ will follow a t-distribution.
Gauss-Markov Assumptions
2 Random sampling:
▶ The sample data is representative (i.e., randomly drawn from the
population).
▶ Since we’ll obatain different estimators ´ˆ for each random draw of n,
ˆ are random variables, with expected values and standard errors.
the ´’s
▶ This allows for hypothesis testing:
▶ With sample size sufficiently large, the central limit theorem tells us
that ´ˆ will follow a t-distribution.
▶ How can we test for random sampling?
Gauss-Markov Assumptions
2 Random sampling:
▶ The sample data is representative (i.e., randomly drawn from the
population).
▶ Since we’ll obatain different estimators ´ˆ for each random draw of n,
ˆ are random variables, with expected values and standard errors.
the ´’s
▶ This allows for hypothesis testing:
▶ With sample size sufficiently large, the central limit theorem tells us
that ´ˆ will follow a t-distribution.
▶ How can we test for random sampling?
Gauss-Markov Assumptions
3 No perfect co-linearity:
▶ Cannot perfectly predict any independent variable with a (linear)
combination of others
Gauss-Markov Assumptions
3 No perfect co-linearity:
▶ Cannot perfectly predict any independent variable with a (linear)
combination of others
▶ For example:
Gauss-Markov Assumptions
3 No perfect co-linearity:
▶ Cannot perfectly predict any independent variable with a (linear)
combination of others
▶ For example:
Gauss-Markov Assumptions
3 No perfect co-linearity:
▶ Cannot perfectly predict any independent variable with a (linear)
combination of others
▶ For example:
Gauss-Markov Assumptions
3 No perfect co-linearity:
▶ Cannot perfectly predict any independent variable with a (linear)
combination of others
▶ For example:
Gauss-Markov Assumptions
Gauss-Markov Assumptions
Gauss-Markov Assumptions
Gauss-Markov Assumptions
5 No Homoskedasticity:
▶ The variance in the error term, ui , given the independent variables is
not correlated with any X variable.
Var (u | x1 , x2 , ..., xk ) = Ã2
Gauss-Markov Assumptions
5 No Homoskedasticity:
▶ The variance in the error term, ui , given the independent variables is
not correlated with any X variable.
Var (u | x1 , x2 , ..., xk ) = Ã2
Gauss-Markov Assumptions
5 No Homoskedasticity:
▶ The variance in the error term, ui , given the independent variables is
not correlated with any X variable.
Var (u | x1 , x2 , ..., xk ) = Ã2
OLS is "BLUE"
16 September 2024
▶ The mean:
x̄ = 1
n ∑ni=1 xi
▶ variance:
1
Var (x ) = n ∑(xi − x̄ )2
▶ standard deviation q
1
σ= n ∑(xi − x̄ )2
▶ For a sample:
1
Var (x ) = n −1 ∑(xi − x̄ )2
▶ Sample standard deviation:
q
1
σ= n −1 ∑(xi − x̄ )2
▶ These are the formulas we use to estimate the population parameter in
a finite sample.
▶ Where does x̄ come from? The sample!
cov (x, y )
rxy =
sx sy
cov (xy )
▶ For the population: ρ = σx σy
▶ Units? . . . unit free – always between(-1,1)
(DPE, KCL) W1 16 September 2024 7 / 53
Interpreting the covariance
cov (x, y )
rxy =
s x sy
cov (xy )
▶ For the population: ρ = σx σy
▶ What if we want to know how a change in the poverty rate affects life
expectancy?
▶ Neither the covariance or correlation coffecient can give us that
information.
▶ If we write an equation describing the relationship, it might look
something like:
Lifeexpectancy = β 0 + β 1 Poverty
▶ n = 123
▶ Life expectancy at birth:
▶ x̄ = 72.20; s = 7.34.
▶ Cov: -112.44
▶ ρ = −0.75
▶ The population is very broadly all of the individuals that fit a certain
criteria.
▶ It could be a known quantity, of specific individuals (e.g. the
population of the UK),
▶ BUT it can also be more esoteric: people who haven’t been born yet;
events that haven’t happened.
▶ We can think of a sample as a random draw from the population.
▶ Random variable X:
▶ outcome of a random process.
▶ Can be continuous or discrete
▶ Probability Density Function (pdf) of X, f(x)
▶ When X is discrete: The probability that variable X can take a
particular value, Pr [X = a] when X is continuous: It makes no sense
that the probability of a continuous random variable take on a
particular value, we use pdf to compute events involving a range of
values. The probabilities of a (small) set of values, for example
Pr [a f X f b ].
E [x ] = µ
▶ We can think of the sample mean as being a random variable with the
expected value equal to the population mean:
E [x̄ ] = µ
▶ Similarly: E [s ] = σ, etc...
(DPE, KCL) W1 16 September 2024 17 / 53
Expected value
▶ For a discrete variable, the expected value is the sum of all of the
possible outcomes (xj ) times the probability of those outcomes
occuring f (xj )
k
E ( x ) = x 1 f ( x 1 ) + x 2 f ( x2 ) . . . x k f ( x k ) = ∑ x j f ( xj )
j =1
1
s2 = n −1 ∑(xi − x̄ )2
E [s 2 ] = σ 2
▶ Why n-1? This is the degrees of freedom.
▶ (Over)simplification: because we have x̄, this “counts as an
observation”.
▶ Degrees of freedom: the number of observations (n), minus the
number of random variables in the test.
▶ x̄ is a random variable
fx,y (x, y ) = P (X = x, Y = y ).
▶ Conditional distributions tell us the distribution of variable, Y, given
the value of another variable X.
fY | X ( Y | X ) = P ( X = x | X = x )
P (X = x, Y = y
=
P (X = x )
E [a (X ) + b (X ) | X ] = a (X )E (Y ) + b (X )
3. If X and Y are independent, then E (Y | X ) = E (Y )
4. E [E (Y )] = E (Y ), the law of iterated expectations.
▶ x̄ has a distribution.
▶ The central limit theorem tells us that x̄ is normally distributed.
▶ This is true even when the distribution of x is not normal.
T-distribution
▶ x̄ is approximately normally distributed:
▶ Central tendency: µ
▶ Standard deviation:
σ
√
n
Degrees of freedom
Inference
▶ The central limit theorem makes it possible to make inferences about
the population based on the sample statistics.
1. Confidence intervals: Based on our estimate that of µ and σ, (x̄ and
s, respectively), we can calculate an interval for at that shows how
spread out our estimate is likely to be (this is imprecise wording).
2. Hypothesis testing: we start by saying IF µ is equal to something
(our hypothesis), who likely are we to observe the value of x̄?
▶ Think of this like working backwards: if we know something about the
distribution of x̄, we can calculate the "area under the curve" for a
given µ
▶ This makes it possible to test hypotheses based on sample statistics as
well as construct confidence intervals.
▶ This basically makes statistics useful! Without this, we cannot use
statistics to test theories – just to describe data.
(DPE, KCL) W1 16 September 2024 43 / 53
The t-distribution
The distribution of x̄
▶ For a sample, we calculate the t-stat, which tells us the chances that a
sample mean is above/below a certain level of µ.
x̄ − µ
t= √
s/ n
Hypothesis testing
▶ Hypothesis testing: we first make a guess about µ then use the t-stat
to test the hypothesis → given the sample data, how likely is a given
value of µ?
▶ We then plug that guess into the t-stat formula and compare that to
the critical t-value to evaluate how likely the given outcome is.
▶ We call this guess the Null Hypothesis, H0 : µ = µ0 .
▶ We compare this to the Alternative Hypothesis, Ha .
▶ The two-sided test: Ha : µ ̸= µ0
▶ The one-sided test: Ha : µ > µ0
▶ The one-sided test: Ha : µ < µ0
▶ Careful: the critical t-value for 1 and 2-sided tests is different
▶ H0 : µ = 0
▶ Ha : µ ̸= 8 (N/B: 2-sided test!)
▶ We observe a sample mean, x̄, which is a random variable.
▶ The hypotesis test is asking: How likely are we to observe the value of
x̄ that we do, if µ = 0?
▶ Assuming µ = 0, and estimating σ with s (i.e. the sample s.d.), we
know how likely a particular value of x̄ is.
▶ We pick a confidence level – let’s say 95%. This means we will reject
the null when our value of (±)x̄ would only occur in < 0.05 samples
under H0.
The p-value
11 March 2025
Outline
2. Heteroskedasticity
Gauss-Markov Assumptions
1. Linear in parameters
2. Random sampling
3. No perfect co-linearity
4. Zero conditional mean of errors
5. Homoskedasticity – or constant error term.
Outline
2. Heteroskedasticity
▶ Where:
▶ SSTj = ∑ni=1 (xij − x̄j )
▶ Rj2 is the R-square from regressing xj on all other variables:
Xij = α0 + α1 xi1 + α2 xi2 ...αk −1 xi,(k −1) + ui
What is heteroskedasticity.
y = β 0 + β 1 x1 + β 2 x2 + ... + β k xk + u
Var (u | x1 , x2 , ..., xk ) = σ2
Var (u | x1 , x2 , ..., xk ) ̸= σ2
Heteroskedasticity
▶ Calculate the squared OLS residuals (ûi2 ) from your original equation
and run the following auxiliary regression:
Rû22 /k
F = ∼ F(k,n−k −1)
(1 − Rû22 )/(n − k − 1)
BP test cont.
▶ Well, actually....
▶ Typically a slightly different version is used, the Lagrange-multiplier
statistic:
LM = nRû22 ∼ χ2k
▶ This follows a chi-square distribution.
▶ in Stata: hettest
White test
LM = nRû22 ∼ χ2k
Consequences of Heteroskedasticity
Solutions to Heteroskedasticity
In Stata:
11 February 2025
β̂ 0 = ȳ − β̂ 1 x̄1 − β̂ 2 x̄2
V2 Cy 1 − C12 Cy 2
β̂ 1 = 2
V1 V2 − C12
V1 Cy 2 − C12 Cy 1
β̂ 2 = 2
V1 V2 − C12
▶ Where:
▶ Vi is the variance of variable k.
▶ Cyi is the covariance between y and variable k.
▶ C12 is the covariance between variables 1 and 2.
▶ Recall:
n
∑ ( xi − x̄ )vi
′ i =1
E [ βˆ1 ] = β 1 + E
n
2
∑ (xi − x̄ )
i =1
▶ When:
1. cov (x1 , x2 ) ̸= 0
2. β 2 ̸= 0
▶ Then the numerator is NOT zero, and we have a biased estimator.
▶ (Also, refer to estimators for multiple regression: bias is proportional
to the cov (x1 , x2)
OLS as an estimator
σ2 : u ∼ N (0, σ2 )
s
se ( β̂ 1 ) = p
∑(xi − x̄ )2
▶ We estimate the variance of the error term, ui using the variance of
the residual, ûi (sometimes written as ei ), with and adjustment for the
degrees of freedom.
▶ Where:
▶ SSTj = ∑ni=1 (xij − x̄j ) (The Total Sum of Squares)
▶ Rj2 is the R-square from regressing xj on all other variables:
σ2
se ( β̂j ) = (3)
SSTj (1 − Rj2 )
▶ Note that:
σ2 σ2 1
se ( β̂ j ) = 2
= (4)
SSTj (1 − Rj ) SSTj (1 − Rj ) (1 − Rj2 )
2
1
▶ Where (1−Rj2 )
is the Variance inflation factor. It tells us how
correlation of xj with other control variables in the model increases the
variance (standard error) of β̂ j .
Goodness of fit
Goodness of fit
n
SStot = ∑ (yi − ȳ )2
i −1
SSreg
R2 = (5)
SStot
R-square (cont.)
▶ The unexplained variation in y is the residual sum of squares: (aka
sum of squared errors).
n
SSres = ∑ (ŷi − yi )2
i −1
the values of y
▶ If R 2 = 1 =⇒ The OLS regression explains all the variation of y.
▶ For regressions with just one independent variable (bivariate
regression), the r-square is the square of the correlation coefficient.
(DPE, KCL) W5 11 February 2025 12 / 48
Review (plus a few new things!)
We want to know whether one year of tenure is worth one year of univ
log(wages ) = β 0 + β 1 tenure + β 2 univ + β 3 exper + u
H0 : β 1 = β 2 HA : β 1 ̸ = β 2
β̂ 1 − β̂ 2 β̂ 1 − β̂ 2
t= =q
se ( β̂ 1 − β̂ 2 ) var ( β̂ 1 ) + var ( β̂ 2 ) − 2cov ( β̂ 1 , β̂ 2 )
▶ We want to know whether one year of tenure is worth one year of univ
H0 : β 1 = β 2 HA : β 1 ̸ = β 2
▶ Approach 1: The test-statistic is
β̂ 1 − β̂ 2 β̂ 1 − β̂ 2
t= =q
se ( β̂ 1 − β̂ 2 ) var ( β̂ 1 ) + var ( β̂ 2 ) − 2cov ( β̂ 1 , β̂ 2 )
F-statistic
▶ This is the test that all regressors do not help explain y, i.e. the joint
exclusion of all independent variables.
H0 : β 1 = β 2 = β 3 = ... = β k = 0
2 /q
RUR
Fall ≡ 2 ) / (n − k − 1)
∼ Fq,n−k −1
(1 − RUR
▶ This is because under the H0 , the RR2 = 0
▶ We reject the H0 if Fall > cF .
▶ In other words, if the p-value of F < 0.05( < 0.01, < 0.10, depending
on the confidence level we wish).
▶ If fail to reject, then there is no evidence that any of the independent
variables help explain y.
Reverse causality
Omitted variables
Endogeneity
OVB
ui = δ1 hi + δ2 + wi + vi
▶ When:
▶ δ1 ̸= 0
▶ cov (hi , educationi ) ̸= 0,
▶ What about adding control variables for standardized test scores (as a
proxy for ability) and parents’ income as a proxy for socio-economic
background.
Simultaneity
▶ This means that our estimate of β̂ 1 will measure the effect of both
education itself in addition to the effect of hi and wi , and thus, our
model is not specified correctly.
yi = β 0 + β 1 x i + u i
yi = β 0 + γ1 yi + vi
▶ ui will be correlated with xi
Before/after comparison
Before/after comparison
Before/after comparison
2. Internal validity: The causal claims inferred from your model are
truly valid for the population under study.
Diff-in-diff
Diff-in-diff
Diff-in-diff
Diff-in-diff
Diff-in-diff
Diff-in-diff
Diff-in-diff
4 February 2025
Outline
OLS as an estimator
OLS as an estimator
OLS as an estimator
s
se ( β̂ 1 ) = p
∑(xi − x̄ )2
s
se ( β̂ 1 ) = p
∑(xi − x̄ )2
β̂ − β
t=
se β̂
▶ Where:
▶ SSTj = ∑ni=1 (xi j − x̄j )
▶ Rj2 is the R-square from regressing xj on all other variables:
Xij = α0 + α1 xi1 + α2 xi2 ...αk −1 xi,(k −1) + ui
Understanding the SE
σ2
se ( β̂j ) = (3)
SSTj (1 − Rj2 )
σ2
se ( β̂j ) = (3)
SSTj (1 − Rj2 )
σ2
se ( β̂j ) = (3)
SSTj (1 − Rj2 )
σ2
se ( β̂j ) = (3)
SSTj (1 − Rj2 )
H0 = β 1 = 0
HA = β 1 ̸ = 0
▶ When H0 is β = 0, then:
β̂
t=
se β̂
The t statistic
( β̂ j − β j )
∼ tn − k − 1
se ( β̂ j )
▶ The t distribution has has df = n − k − 1, and k indicates the
number of slopes being estimated.
25 February 2025
Correlation=−0.5725
0
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Ethnic fragmentation
Source: Pollock (2010)
1. If δ̂1 > 0
1.1 In this case the new intercept will be ^˛0 +δ̂1 and the regression line will
increase along the y axis exactly by δ̂1
2. If δ̂1 < 0
2.1 In this case, the new intercept will be ^˛0 −δ̂1 and the regression line will
decrease along the y axis exactly by δ̂1
3. Another way to define δ̂1 is to look at the expected value of the
regression given the different values of the dummy variable:
1. So given our model and our interest in seeing the effect of SMD on
HDI:
▶ HDI example with a new dummy variable, Rural. The variable takes
value 1 if the majority of a population in a country lives in rural areas
and 0, otherwise.
Shifting intercepts
Using data from Pollock (2010) we can estimate the coefficients using the
procedure described in the previous slide. A regression model containing an
interaction of two dummy variables generate four different intercepts:
Rural Urban
Democracy 0.76 0.88
Dictatorship 0.52 0.75
.9
.8
.7
HDI
.6
.5
.4
0 .2 .4 .6 .8 1
Ethnicity
Interaction terms
▶ The model in the previous slide has two intercepts and two slopes:
16 September 2024
1. Review: covariance/correlation
2. OLS!
▶ Simple OlS
▶ OLS assumptions
▶ Hypothesis testing
3. Second part:
▶ ID strategies
4. Third part:
▶ Working with regressions
cov (x, y )
rxy =
sx sy
cov (xy )
▶ For the population: ρ = Ãx Ãy
▶ Unit free – always between(-1,1)
(DPE, KCL) W2 16 September 2024 5 / 46
Ordinary Least Squares: regression analysis
Outline
2. OLS assumptions
OLS
▶ What if we want to know how a change in the poverty rate affects life
expectancy?
OLS
▶ What if we want to know how a change in the poverty rate affects life
expectancy?
▶ Neither the covariance or correlation coffecient can give us that
information.
OLS
▶ What if we want to know how a change in the poverty rate affects life
expectancy?
▶ Neither the covariance or correlation coffecient can give us that
information.
▶ If we write an equation describing the relationship, it might look
something like:
Lifeexpectancy = ´ 0 + ´ 1 Poverty
▶ n = 123
▶ Life expectancy at birth:
▶ x̄ = 72.20; s = 7.34.
▶ Cov: -112.44
▶ ρ = −0.75
▶ ´ 1 is the slope of the line – it tells us how a one unit the poverty rate
would increase life expectancy.
▶ Because the relationship is not perfect, we allow for an error term, ui ;
▶ We also add the subscript "i" to show what the units of observation
are.
Estimating OLS
▶ Note:
▶ Sometimes the notation is ei
▶ What’s the difference between the residual and error term, ui ?
▶ Note:
▶ Sometimes the notation is ei
▶ What’s the difference between the residual and error term, ui ?
The residuals
Least squares:
n
∑ ûi2
i =1
n 2
∑ yi − ( ´ˆ 0 + ´ˆ 1 xi )
i =1
n 2
min ∑ yi − ( ´ˆ 0 + ´ˆ 1 xi ) (1)
i =1
▶ We take the F.O.C.s of (1) wrt ´ˆ 0 and ´ˆ 1 and set them to zero:
▶ We’ll start with ´ 0
" #
n
∂
0=
∂´ 0 ∑ (yi − ´ˆ 0 − ´ˆ 1 xi )2
i =1
!
n
0=2 ∑ (yi − ´ˆ 0 − ´ˆ 1 xi )
i =1
n
0= ∑ (yi − ´ˆ 0 − ´ˆ 1 xi ) (2)
i =1
▶ Now wrt ´ 1 :
" #
n
∂
0=
∂´ 1 ∑ (yi − ´ˆ 0 − ´ˆ 1 xi ) 2
i =1
!
n
0=2 ∑ xi (yi − ´ˆ 0 − ´ˆ 1 xi )xi
i =1
!
n
0= ∑ xi (yi − ´ˆ 0 − ´ˆ 1 xi )
i =1
summation properties
n
0= ∑ (yi − ´ˆ 0 − ´ˆ 1 xi ) (2)
i =1
!
n
0= ∑ xi (yi − ´ˆ 0 − ´ˆ 1 xi ) (3)
i =1
▶ From (2):
n
0= ∑ (yi − ´ˆ 0 − ´ˆ 1 xi )
i =1
▶ From (2):
n
0= ∑ (yi − ´ˆ 0 − ´ˆ 1 xi )
i =1
n n n
0= ∑ yi − ∑ ´ˆ 0 − ∑ ´ˆ 1 xi
i =1 i =1 i =1
▶ From (2):
n
0= ∑ (yi − ´ˆ 0 − ´ˆ 1 xi )
i =1
n n n
0= ∑ yi − ∑ ´ˆ 0 − ∑ ´ˆ 1 xi
i =1 i =1 i =1
▶ From (2):
n
0= ∑ (yi − ´ˆ 0 − ´ˆ 1 xi )
i =1
n n n
0= ∑ yi − ∑ ´ˆ 0 − ∑ ´ˆ 1 xi
i =1 i =1 i =1
´ˆ 0 = ȳ − ´ˆ 1 x̄ (4)
´ˆ 0 = ȳ − ´ˆ 1 x̄ (4)
´ˆ 0 = ȳ − ´ˆ 1 x̄ (4)
´ˆ 0 = ȳ − ´ˆ 1 x̄ (4)
n
∑ xi yi − ȳ xi + xi x̄ ´ˆ1 − ´ˆ 1 xi2
0=
i =1
n
∑ xi yi − ȳ xi + xi x̄ ´ˆ1 − ´ˆ 1 xi2
0=
i =1
n
∑ xi (yi − ȳ ) − ´ˆ1 xi (xi − x̄ )
0=
i =1
n
∑ xi yi − ȳ xi + xi x̄ ´ˆ1 − ´ˆ 1 xi2
0=
i =1
n
∑ xi (yi − ȳ ) − ´ˆ1 xi (xi − x̄ )
0=
i =1
!
n n
0= ∑ xi (yi − ȳ ) − ´ˆ1 ∑ xi (xi − x̄ )
i =1 i =1
n
∑ xi yi − ȳ xi + xi x̄ ´ˆ1 − ´ˆ 1 xi2
0=
i =1
n
∑ xi (yi − ȳ ) − ´ˆ1 xi (xi − x̄ )
0=
i =1
!
n n
0= ∑ xi (yi − ȳ ) − ´ˆ1 ∑ xi (xi − x̄ )
i =1 i =1
n
∑ (xi − x̄ )(yi − ȳ )
i =1
n
∑ (xi − x̄ )2
i =1
n
∑ (xi − x̄ )(yi − ȳ )
i =1
n
∑ (xi − x̄ )2
i =1
Covxy
´ˆ 1 =
Varx
Lifeexpectancyi = ´ 0 + ´ 1 Rateofpovertyi + ui
where:
▶ We estimate the coefficients based on the sample data. The results
are:
▶ ´ˆ1 = −0.23 =⇒ This is the slope of the curve - it means that for
each additional percentage of people living in extreme poverty, life
expectancy in the country goes down by 0.23 years.
▶ ´ˆ0 = 73.98 =⇒ This is the intercept, it tells average life expectancy
when the poverty rate is zero.
given that
Covxy
rxy =
sx sy
yi = ´ 0 + ´ 1 xi + u i
yi = ´ 0 + ´ 1 xi + u i
▶ What is ui ?
yi = ´ 0 + ´ 1 xi + u i
▶ What is ui ? the error term : everyting else not in the model.
yi = ´ 0 + ´ 1 xi + u i
▶ What is ui ? the error term : everyting else not in the model.
▶ If the model is properly speicifed, then:
E ( ui ) = 0
yi = ´ 0 + ´ 1 xi + u i
▶ What is ui ? the error term : everyting else not in the model.
▶ If the model is properly speicifed, then:
E ( ui ) = 0
E (xu ) = E (x )E (u )
yi = ´ 0 + ´ 1 x i + u i
n
∑ (xi ûi )
i =1
´ˆ1 = ´ 1 + n
∑ x̄ 2
i =1
yi = ´ 0 + ´ 1 x i + u i
n
∑ (xi ûi )
i =1
´ˆ1 = ´ 1 + n
∑ x̄ 2
i =1
n
∑
i =1
( xi ûi )
E [ ´ˆ1 ] = ´ 1 + E
n
2
∑ x̄
i =1
Outline
2. OLS assumptions
Gauss-Markov Assumptions
Gauss-Markov Assumptions
1. Linear in parameters:
▶ y = ´ + ´ x + u , but NOT y = ´ + ´2 x + u
i 0 1 i i i 0 1 i i
Gauss-Markov Assumptions
1. Linear in parameters:
▶ y = ´ + ´ x + u , but NOT y = ´ + ´2 x + u
i 0 1 i i i 0 1 i i
▶ We can have non linear variables, e.g., ´ + ´ x 2 + u
0 1 i i
Gauss-Markov Assumptions
1. Linear in parameters:
▶ y = ´ + ´ x + u , but NOT y = ´ + ´2 x + u
i 0 1 i i i 0 1 i i
▶ We can have non linear variables, e.g., ´ + ´ x 2 + u
0 1 i i
▶ For example, the Mincer equation is typically used to estimate the
effect of education and experience on earnings:
Gauss-Markov Assumptions
1. Linear in parameters:
▶ y = ´ + ´ x + u , but NOT y = ´ + ´2 x + u
i 0 1 i i i 0 1 i i
▶ We can have non linear variables, e.g., ´ + ´ x 2 + u
0 1 i i
▶ For example, the Mincer equation is typically used to estimate the
effect of education and experience on earnings:
Gauss-Markov Assumptions
2 Random sampling:
▶ The sample data is representative (i.e., randomly drawn from the
population).
Gauss-Markov Assumptions
2 Random sampling:
▶ The sample data is representative (i.e., randomly drawn from the
population).
▶ Since we’ll obatain different estimators ´ˆ for each random draw of n,
ˆ are random variables, with expected values and standard errors.
the ´’s
Gauss-Markov Assumptions
2 Random sampling:
▶ The sample data is representative (i.e., randomly drawn from the
population).
▶ Since we’ll obatain different estimators ´ˆ for each random draw of n,
ˆ are random variables, with expected values and standard errors.
the ´’s
▶ This allows for hypothesis testing:
▶ With sample size sufficiently large, the central limit theorem tells us
that ´ˆ will follow a t-distribution.
Gauss-Markov Assumptions
2 Random sampling:
▶ The sample data is representative (i.e., randomly drawn from the
population).
▶ Since we’ll obatain different estimators ´ˆ for each random draw of n,
ˆ are random variables, with expected values and standard errors.
the ´’s
▶ This allows for hypothesis testing:
▶ With sample size sufficiently large, the central limit theorem tells us
that ´ˆ will follow a t-distribution.
▶ How can we test for random sampling?
Gauss-Markov Assumptions
2 Random sampling:
▶ The sample data is representative (i.e., randomly drawn from the
population).
▶ Since we’ll obatain different estimators ´ˆ for each random draw of n,
ˆ are random variables, with expected values and standard errors.
the ´’s
▶ This allows for hypothesis testing:
▶ With sample size sufficiently large, the central limit theorem tells us
that ´ˆ will follow a t-distribution.
▶ How can we test for random sampling?
Gauss-Markov Assumptions
3 No perfect co-linearity:
▶ Cannot perfectly predict any independent variable with a (linear)
combination of others
Gauss-Markov Assumptions
3 No perfect co-linearity:
▶ Cannot perfectly predict any independent variable with a (linear)
combination of others
▶ For example:
Gauss-Markov Assumptions
3 No perfect co-linearity:
▶ Cannot perfectly predict any independent variable with a (linear)
combination of others
▶ For example:
Gauss-Markov Assumptions
3 No perfect co-linearity:
▶ Cannot perfectly predict any independent variable with a (linear)
combination of others
▶ For example:
Gauss-Markov Assumptions
3 No perfect co-linearity:
▶ Cannot perfectly predict any independent variable with a (linear)
combination of others
▶ For example:
Gauss-Markov Assumptions
Gauss-Markov Assumptions
Gauss-Markov Assumptions
Gauss-Markov Assumptions
5 No Homoskedasticity:
▶ The variance in the error term, ui , given the independent variables is
not correlated with any X variable.
Var (u | x1 , x2 , ..., xk ) = Ã2
Gauss-Markov Assumptions
5 No Homoskedasticity:
▶ The variance in the error term, ui , given the independent variables is
not correlated with any X variable.
Var (u | x1 , x2 , ..., xk ) = Ã2
Gauss-Markov Assumptions
5 No Homoskedasticity:
▶ The variance in the error term, ui , given the independent variables is
not correlated with any X variable.
Var (u | x1 , x2 , ..., xk ) = Ã2
OLS is "BLUE"
Goodness of fit
Goodness of fit
▶ Total sum of squares (n x sample variance)
n
SStot = ∑ (yi − ȳ )2
i −1
Variation of estimated ŷ
R2 =
Total variation of observed y
SSreg
R2 = (5)
SStot
(DPE, KCL) W2 16 September 2024 44 / 46
OLS assumptions
Goodness of fit
SSres
R2 = 1 − (7)
SStot
R-square properties
▶ R 2 ranges from 0 to 1.
▶ If R 2 = 0 =⇒ The OLS regression does not explained any variation in
the values of y
▶ If R 2 = 1 =⇒ The OLS regression explains all the variation of y.
▶ For regressions with just one independent variable (bivariate
regression), the r-square is the square of the correlation coefficient.
18 March 2025
ui = δ1 hi + δ2 + wi + vi
▶ When:
▶ δ1 ̸= 0
▶ cov (hi , educationi ) ̸= 0,
▶ What about adding control variables for standardized test scores (as a
proxy for ability) and parents’ income as a proxy for socio-economic
background.
▶ This means that our estimate of β̂ 1 will measure the effect of both
education itself in addition to the effect of hi and wi , and thus, our
model is not specified correctly.
yi = β 0 + β 1 xi + β 2 w i + u i
yi = β 0 + β 1 x i + u i
yi = β 0 + γ1 yi + vi
▶ ui will be correlated with xi
y = β 0 + β 1 xi + β 2 w i + u i
xi = γ0 + γ1 wi + γ3 zi ui
▶ Where cor (xi , wi ) ̸= 0 and β 2 ̸= 0, which means we have
simultaneity.
▶ However, z is exogenous to wi and yi , but predicts x: cor (zi , xi ) ̸= 0
y = α 0 + α 1 z i + α 2 wi + u i
Yi = β 0 + β 1 Xi + ui
▶ As it sounds, 2SLS has two stages, that is, you need to estimate two
regressions:
▶ STAGE 1
1. Isolate the part of X that is uncorrelated with u by regressing X on Z
using OLS:
X i = α 0 + α 1 Zi + v i
STAGE 2
▶ We then replace Xi with Xbi in our original model:
Yi = β̂ 0 + β̂ 1 Xbi + ûi
▶ Relevance of Instrument
▶ If instrument is weak, then the 2SLS estimator is biased and t-statistic
has a non-normal distribution.
▶ To check for weak instruments with a single included endogenous
regressor, check the first-stage F-stat.
▶ Exogeneity of instrument
▶ All the instruments are uncorrelated with the error term:
where Inst refers to institutions and QKi refers to the set of control
variables (Colonial past, Geography, Executive constraints...).
▶ They are interested in identify the causal link between Inst and
GDP/cap which is explained by β 1 .