0% found this document useful (0 votes)
7 views27 pages

Lecture 2 2025

The document discusses the challenges of conducting scientific experiments in social sciences, particularly in matching households for studies on health and income. It explains the use of multivariate regression to analyze the effects of various observable and unobservable variables on health, emphasizing the importance of consistent estimation and the impact of error terms. Additionally, it covers the interpretation of regression coefficients and the use of dummy variables to capture qualitative information.

Uploaded by

Flower Fry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views27 pages

Lecture 2 2025

The document discusses the challenges of conducting scientific experiments in social sciences, particularly in matching households for studies on health and income. It explains the use of multivariate regression to analyze the effects of various observable and unobservable variables on health, emphasizing the importance of consistent estimation and the impact of error terms. Additionally, it covers the interpretation of regression coefficients and the use of dummy variables to capture qualitative information.

Uploaded by

Flower Fry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Lab Experiment

• Scientific Experiment: If households were


guinea pigs,
– Take two households exactly the same in everything
– Let us give one unit income higher in one household
(and not to the other)
– Measure Impact
(sounds familiar???)
Social Sciences (usually)
• To match household in zillion dimensions is
cumbersome.

• Our health example will need us to get enough


number of households with the exact same
education, and then within this group find how
when income changes health changes.

• Impossible when there are many other dimensions.


Too many dimensions to match!
Multivariate Regression (Contd)
• The partialling out interpretation of regression
yields results of the effect of a unit change in
one variable, keeping all the other variables
constant (just like a lab experiment).
• Reminder: Keeping constant means to partial
their effect out.
• Multivariate: More than one variable
Multivariate Regression (Contd)
• Think of Health depending on
– Group of observable variables: Income, Age, Health
Care infrastucture, Pollution
– Group of “Unobservables”: Attitude to Health,
Unobservable aspects of personal hygeine.

A regression “models” health as being dependent on


the observables and the unobservables.
Regression Terminology
• Dependent Variable : Health (what you are
trying to explain):
• Independent variables (the observables like
income, education, health care infrastructure,
dust particles in the neighbourhood)
• Error term to capture the unexplained part of
health: captures Unobservables in your data set
that matter for determining health.
Table 3
OLS: Employment growth on roads

(1) (2) (3) (4)


New road before 2005 0.113 0.079 0.058 0.036
(0.019)*** (0.016)*** (0.015)*** (0.017)**
Baseline log employment -0.275 -0.328 -0.477 -0.496
(0.009)*** (0.008)*** (0.013)*** (0.014)***
Population 0.000 0.000 0.000 0.000
(0.000)*** (0.000)** (0.000)** (0.000)***
Share of land irrigated 0.099 0.078
(0.024)*** (0.026)***
Log(land area) 0.141 0.126
(0.008)*** (0.008)***
Distance from town -0.002 -0.002
(0.000)*** (0.000)***
Baseline number of industries 0.024 0.026
(0.002)*** (0.002)***
Constant 1.115 1.734 1.305 1.377
(0.030)*** (0.063)*** (0.078)*** (0.078)***
N 48216 48216 46720 34888
r2 0.13 0.17 0.21 0.22
⇤ p < 0.10,⇤⇤ p < 0.05,⇤⇤⇤ p < 0.01

This table presents OLS estimates of the relationship between log employment growth
(1998-2005) and treatment, as defined as having received a completed PMGSY road by
2005. The sample is all locations that received a PMGSY road before 2012. Column 1
presents the estimate only controlling for 1998 (log) employment and village population.
Column 2 introduces state fixed e↵ects. Column 3 introduces standard village level controls
of share of land irrigated, log land area, distance from nearest town and number of non-
farm industries present in 1998. Column 4 limits to villages in which the largest habitation
had fewer than 1500 people. Standard errors are clustered at the district level.
Interpreting “Coefficients”
Dependent Variable: Health
Regression Coefficients:
Income: 2.5
Education: 1.5
Health Infr.: 4.0
Dust Particles: -0.7
Interpreting “Coefficients”
Dependent Variable: Health
Regression Coefficients:
Income: 2.5
One unit increase of Income, everything
Education: 1.5 else the same, increases health by 2.5
units
Health Infr.: 4.0
Dust Particles: -0.7
Interpreting “Coefficients”
Dependent Variable: Health
Regression Coefficients:
Income: 2.5
Education: 1.5
Health Infr.: 4.0 One additional unit of Dust
Particle, everything else the
Dust Particles: -0.7 same, decreases health by
0.7 units
Error Term
• The error term cannot be partialled out.
• So the interpretation of coefficients is only true
under the following scenario:
“ when you increase one unit of , say Income, the
error term should not change with the change in
income”
“when you increase one unit of , say Income, no
unobservable variable relevant to explaining health
should change with the change in income”
“Consistent Estimation”
• The estimated coefficient will only be correct if
the error term and none of the observable
independent variables co-vary!
• In statistics: correct means… tending towards
true value… called “Consistent estimator”
• If any variable co- varies with the error term,
the variable is called “endogenous” and the
estimation procedure is incorrect.
Inconsistent Estimation
Dependent Variable: Health
Regression Coefficients:
Income: 3.5
Health Infr.: 4.0
Dust Particles: -0.7
Since Education is not included in the observable part (maybe
information was not collected), it is now captured by the error term.
Since Education and Income are correlated, the error term and income
are correlated
Inconsistent Estimation (Contd.)
• If you think even one variable is endogenous,
the whole regression result is WRONG
• Why: Recall when we look at each coefficient,
we interpret it as the effect partialling out all
other variables. So if there is any problem in any
one variable, the effect will spread to other
variables.
Detecting Inconsistent
Estimation
• Try to think of any variable that is
– Relevant to explain the dependent variable
– Not included in the regression but which you expect is
correlated to some independent variable

• Example: Dependent Variable: Wages


– Independent Variable: Education, Age (years of
Experience)
– Unobservable: IQ … smart people earn higher wages,
but IQ scores are also expected to be correlated to
Education level.
R Square
• Variation in Health= Variation Explained by
Observable Variables + Variation due to
Unobservables
• R square= Explained Variation Divided by the
total Variation in Health
– Proportion of Variation in Health that you have been
able to explain
– Between 0 and 1
– More R square is better as your observable variables
can explain more
Positive News
• You can conduct a regression which has very
few variables and does not “explain” that much.
Yet the coefficient of the variable you may be
interested in is correct! (VERY DIFFERENT
FROM PREDICTION)
• In Impact plans, we are often interested in the
coefficient of a variable that refers to the plan
and not other variables.
Representativeness and
Correlations
Representativeness-what does it
mean
Population Sample
Edu Yrs Income Edu Yrs Income
0 1000 0 1000
0 1200 12 12000
0 1000 15 100000
0 1030
12 10000
12 12000
15 100000
Representativeness-what does it
mean
Population Sample
Edu Yrs Income Edu Yrs Income
0 1000 0 1000
0 1200 12 12000
0 1000 15 100000
0 1030
12 10000
12 12000
15 100000

Mean 18032.86 37666.67


Representativeness-what does it
mean
Population Sample
Edu Yrs Income Edu Yrs Income
0 1000 0 1000
0 1200 12 12000
0 1000 15 100000
0 1030
12 10000
12 12000
15 100000

Mean 18032.86 37666.67


Correlation 0.69 0.73
Capturing Qualitative
Information
• For example: Gender of a person
(Male/Female), Ethnic Group Affiliation
(Minority Comm/ Other General Comm),
Employment Status (Employed/Unemployed).

• Define “Dummy Variables”


– DummyMale =1 if Male, =0 if Female
– DummyMin=1 if Minority, =0 otherwise
– DummyUnemp=1 if Unempl, =0 otherwise
Qualititive Information
(Cont)
• Which category you assign 1 does not matter as
long as you remember your choice.
• If there are more than one category: for
example. Occupation status: Self
Employed/Casual Labour/Unemployed.
Define Dummy Variables:
Dummyemp=1 if employed, =0 otherwise
Dummycasual=1 if casual lab, =0 otherwise
Dummy Variables
• Always leave out one category. What you leave
out is your choice!

• So if there are 4 categories for a variable, there


should be 3 dummy variables describing that
variable.
Example
• Dependent Variable : Hours of Schooling

DummyMale : 2.4
DummySC: -1.3
DummyST: -2.4
DummyOBC: -2.0
Constant: 4.5
Example
Dependent Variable : Hours of Schooling
Everything else the same,
the male child goes to
school 2.4 hours more than
DummyMale: 2.4 the female child.
DummySC : -1.3
DummyST: -2.4
DummyOBC: -2.0
Constant: 4.5
Example
• Dependent Variable : Hours of Schooling
Children From SC
Households spend 1.3 Hours
lesser than the (reference)
General Cat. Household
DummyMale : 2.4
DummySC: -1.3
DummyST: -2.4
DummyOBC: -2.0
Constant: 4.5
Example
• Dependent Variable : Hours of Schooling
The Constant captures the
average hours of schooling
of all omitted categories:
In this example: The
DummyMale : 2.4 omitted (reference) cat is
Gen Category Female
DummySC: -1.3 Child: The average hours
of schooling for her is 4.5
DummyST: -2.4 hours

DummyOBC: -2.0
Constant: 4.5

You might also like